netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] aquantia: Reserve space when allocating an SKB
@ 2020-11-18  1:52 Ramsay, Lincoln
  2020-11-18 14:02 ` [EXT] " Igor Russkikh
  0 siblings, 1 reply; 28+ messages in thread
From: Ramsay, Lincoln @ 2020-11-18  1:52 UTC (permalink / raw)
  To: Igor Russkikh, David S. Miller, Jakub Kicinski, netdev

When performing IPv6 forwarding, there is an expectation that SKBs
will have some headroom. When forwarding a packet from the aquantia
driver, this does not always happen, triggering a kernel warning.

It was observed that napi_alloc_skb and other ethernet drivers
reserve (NET_SKB_PAD + NET_IP_ALIGN) bytes in new SKBs. Do this
when calling build_skb as well.

Signed-off-by: Lincoln Ramsay <lincoln.ramsay@opengear.com>
---

We have an Aquantia 10G ethernet interface in one of our devices. While testing a new feature, we discovered a problem with it. The problem only shows up in a very specific situation however.

We are using firewalld as a frontend to nftables.
It sets up port forwarding (eg. incoming port 5022 -> other_machine:22).
We also use masquerading on the outgoing packet, although I'm not sure this is relevant to the issue.
IPv4 works fine, IPv6 is a problem.
The bug is triggered by trying to hit this forwarded port (ssh -p 5022 addr). It is 100% reproducible.

The problem is that we get a kernel warning. It is triggered by this line in neighbour.h:
    if (WARN_ON_ONCE(skb_headroom(skb) < hh_alen)) {

It seems that skb_headroom is only 14, when it is expected to be >= 16.

2020-10-19 21:24:24 DEBUG   [console] ------------[ cut here ]------------
2020-10-19 21:24:24 DEBUG   [console] WARNING: CPU: 3 PID: 0 at include/net/neighbour.h:493 ip6_finish_output2+0x538/0x580
2020-10-19 21:24:24 DEBUG   [console] Modules linked in: xt_addrtype xt_MASQUERADE iptable_filter iptable_nat ip6table_raw ip6_tables xt_CT xt_tcpudp iptable_raw ip_tables nf_nat_tftp nft_nat nft_masq nft_objref nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_chain_nat nf_nat xfrm_user nf_conntrack_tftp nf_tables_set x_tables nft_ct nf_tables nfnetlink amd_spirom_nor(O) spi_nor(O) mtd(O) atlantic nct5104_wdt(O) gpio_amd(O) nct7491(O) sch_fq_codel tun qmi_wwan usbnet mii qcserial usb_wwan qcaux nsh nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 i2c_dev cdc_wdm br_netfilter bridge stp llc [last unloaded: nft_reject]
2020-10-19 21:24:24 DEBUG   [console] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G           O      5.4.65-og #1
2020-10-19 21:24:24 DEBUG   [console] RIP: 0010:ip6_finish_output2+0x538/0x580
2020-10-19 21:24:24 DEBUG   [console] Code: 87 e9 fc ff ff 44 89 fa 48 89 74 24 20 48 29 d7 e8 2d 4f 0c 00 48 8b 74 24 20 e9 cf fc ff ff 41 bf 10 00 00 00 e9 c4 fc ff ff <0f> 0b 4c 89 ef 41 bc 01 00 00 00 e8 d8 89 f0 ff e9 ee fc ff ff e8
2020-10-19 21:24:24 DEBUG   [console] RSP: 0018:ffffac2040114ab0 EFLAGS: 00010212
2020-10-19 21:24:24 DEBUG   [console] RAX: ffff9c041a0bf00e RBX: 000000000000000e RCX: ffff9c041a0bf00e
2020-10-19 21:24:24 DEBUG   [console] RDX: 000000000000000e RSI: ffff9c03ddf606c8 RDI: 0000000000000000
2020-10-19 21:24:24 DEBUG   [console] RBP: ffffac2040114b38 R08: 00000000f2000000 R09: 0000000002ec5955
2020-10-19 21:24:24 DEBUG   [console] R10: ffff9c041e57a440 R11: 000000000000000a R12: ffff9c03ddf60600
2020-10-19 21:24:24 DEBUG   [console] R13: ffff9c03dcf24800 R14: 0000000000000000 R15: 0000000000000010
2020-10-19 21:24:24 DEBUG   [console] FS:  0000000000000000(0000) GS:ffff9c0426b80000(0000) knlGS:0000000000000000
2020-10-19 21:24:24 DEBUG   [console] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2020-10-19 21:24:24 DEBUG   [console] CR2: 0000000000a0b4d8 CR3: 0000000222054000 CR4: 00000000000406e0
2020-10-19 21:24:24 DEBUG   [console] Call Trace:
2020-10-19 21:24:24 DEBUG   [console]  <IRQ>
2020-10-19 21:24:24 DEBUG   [console]  ? ipv6_confirm+0x85/0xf0 [nf_conntrack]
2020-10-19 21:24:24 DEBUG   [console]  ip6_output+0x67/0x130
2020-10-19 21:24:24 DEBUG   [console]  ? __ip6_finish_output+0x110/0x110
2020-10-19 21:24:24 DEBUG   [console]  ip6_forward+0x582/0x920
2020-10-19 21:24:24 DEBUG   [console]  ? ip6_frag_init+0x40/0x40
2020-10-19 21:24:24 DEBUG   [console]  ip6_sublist_rcv_finish+0x33/0x50
2020-10-19 21:24:24 DEBUG   [console]  ip6_sublist_rcv+0x212/0x240
2020-10-19 21:24:24 DEBUG   [console]  ? ip6_rcv_finish_core.isra.0+0xc0/0xc0
2020-10-19 21:24:24 DEBUG   [console]  ipv6_list_rcv+0x116/0x140
2020-10-19 21:24:24 DEBUG   [console]  __netif_receive_skb_list_core+0x1b1/0x260
2020-10-19 21:24:24 DEBUG   [console]  netif_receive_skb_list_internal+0x1ba/0x2d0
2020-10-19 21:24:24 DEBUG   [console]  ? napi_gro_receive+0x50/0x90
2020-10-19 21:24:24 DEBUG   [console]  gro_normal_list.part.0+0x14/0x30
2020-10-19 21:24:24 DEBUG   [console]  napi_complete_done+0x81/0x100
2020-10-19 21:24:24 DEBUG   [console]  aq_vec_poll+0x166/0x190 [atlantic]
2020-10-19 21:24:24 DEBUG   [console]  net_rx_action+0x12b/0x2f0
2020-10-19 21:24:24 DEBUG   [console]  __do_softirq+0xd1/0x213
2020-10-19 21:24:24 DEBUG   [console]  irq_exit+0xc8/0xd0
2020-10-19 21:24:24 DEBUG   [console]  do_IRQ+0x48/0xd0
2020-10-19 21:24:24 DEBUG   [console]  common_interrupt+0xf/0xf
2020-10-19 21:24:24 DEBUG   [console]  </IRQ>
2020-10-19 21:24:24 DEBUG   [console] ---[ end trace c1cba758301d342f ]---

After much hunting and debugging, I think I have figured out the issue here.

aq_ring.c has this code (edited slightly for brevity):

if (buff->is_eop && buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
    skb = build_skb(aq_buf_vaddr(&buff->rxdata), AQ_CFG_RX_FRAME_MAX);
    skb_put(skb, buff->len);
} else {
    skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);

There is a significant difference between the SKB produced by these 2 code paths. When napi_alloc_skb creates an SKB, there is a certain amount of headroom reserved. The same pattern appears to be used in all of the other ethernet drivers I have looked at. However, this is not done in the build_skb codepath.

I believe that this is the ultimate cause of the warning we are seeing.

I have created a patch to create some headroom in the SKB. The logic is inspired by the igb driver. This was originally developed against Linux 5.4, then migrated to Linux 5.8. It has been tested on our product against both versions. The patch below was migrated to Linux master (some context changed, but otherwise it applied cleanly).

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
index 4f913658eea4..57150e3d3257 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
@@ -16,6 +16,8 @@
 #include <linux/netdevice.h>
 #include <linux/etherdevice.h>

+#define AQ_SKB_PAD	(NET_SKB_PAD + NET_IP_ALIGN)
+
 static inline void aq_free_rxpage(struct aq_rxpage *rxpage, struct device *dev)
 {
 	unsigned int len = PAGE_SIZE << rxpage->order;
@@ -47,7 +49,7 @@ static int aq_get_rxpage(struct aq_rxpage *rxpage, unsigned int order,
 	rxpage->page = page;
 	rxpage->daddr = daddr;
 	rxpage->order = order;
-	rxpage->pg_off = 0;
+	rxpage->pg_off = AQ_SKB_PAD;

 	return 0;

@@ -67,8 +69,8 @@ static int aq_get_rxpages(struct aq_ring_s *self, struct aq_ring_buff_s *rxbuf,
 		/* One means ring is the only user and can reuse */
 		if (page_ref_count(rxbuf->rxdata.page) > 1) {
 			/* Try reuse buffer */
-			rxbuf->rxdata.pg_off += AQ_CFG_RX_FRAME_MAX;
-			if (rxbuf->rxdata.pg_off + AQ_CFG_RX_FRAME_MAX <=
+			rxbuf->rxdata.pg_off += AQ_CFG_RX_FRAME_MAX + AQ_SKB_PAD;
+			if (rxbuf->rxdata.pg_off + AQ_CFG_RX_FRAME_MAX + AQ_SKB_PAD <=
 				(PAGE_SIZE << order)) {
 				u64_stats_update_begin(&self->stats.rx.syncp);
 				self->stats.rx.pg_flips++;
@@ -84,7 +86,7 @@ static int aq_get_rxpages(struct aq_ring_s *self, struct aq_ring_buff_s *rxbuf,
 				u64_stats_update_end(&self->stats.rx.syncp);
 			}
 		} else {
-			rxbuf->rxdata.pg_off = 0;
+			rxbuf->rxdata.pg_off = AQ_SKB_PAD;
 			u64_stats_update_begin(&self->stats.rx.syncp);
 			self->stats.rx.pg_reuses++;
 			u64_stats_update_end(&self->stats.rx.syncp);
@@ -416,8 +418,8 @@ int aq_ring_rx_clean(struct aq_ring_s *self,
 		/* for single fragment packets use build_skb() */
 		if (buff->is_eop &&
 		    buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
-			skb = build_skb(aq_buf_vaddr(&buff->rxdata),
-					AQ_CFG_RX_FRAME_MAX);
+			skb = build_skb(aq_buf_vaddr(&buff->rxdata) - AQ_SKB_PAD,
+					AQ_CFG_RX_FRAME_MAX + AQ_SKB_PAD);
 			if (unlikely(!skb)) {
 				u64_stats_update_begin(&self->stats.rx.syncp);
 				self->stats.rx.skb_alloc_fails++;
@@ -425,6 +427,7 @@ int aq_ring_rx_clean(struct aq_ring_s *self,
 				err = -ENOMEM;
 				goto err_exit;
 			}
+			skb_reserve(skb, AQ_SKB_PAD);
 			if (is_ptp_ring)
 				buff->len -=
 					aq_ptp_extract_ts(self->aq_nic, skb,
--
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [EXT] [PATCH] aquantia: Reserve space when allocating an SKB
  2020-11-18  1:52 [PATCH] aquantia: Reserve space when allocating an SKB Ramsay, Lincoln
@ 2020-11-18 14:02 ` Igor Russkikh
  2020-11-19  0:14   ` Ramsay, Lincoln
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Russkikh @ 2020-11-18 14:02 UTC (permalink / raw)
  To: Ramsay, Lincoln, David S. Miller, Jakub Kicinski, netdev,
	Dmitry Bogdanov


Hi Ramsay,


> When performing IPv6 forwarding, there is an expectation that SKBs
> will have some headroom. When forwarding a packet from the aquantia
> driver, this does not always happen, triggering a kernel warning.
> 
> It was observed that napi_alloc_skb and other ethernet drivers
> reserve (NET_SKB_PAD + NET_IP_ALIGN) bytes in new SKBs. Do this
> when calling build_skb as well.

Thanks for the analysis, but I think the solution you propose is invalid.


> After much hunting and debugging, I think I have figured out the issue
> here.
> 
> aq_ring.c has this code (edited slightly for brevity):
> 
> if (buff->is_eop && buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
>     skb = build_skb(aq_buf_vaddr(&buff->rxdata), AQ_CFG_RX_FRAME_MAX);
>     skb_put(skb, buff->len);
> } else {
>     skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
> 
> There is a significant difference between the SKB produced by these 2 code
> paths. When napi_alloc_skb creates an SKB, there is a certain amount of
> headroom reserved. The same pattern appears to be used in all of the other
> ethernet drivers I have looked at. However, this is not done in the
> build_skb codepath.

...

> -	rxpage->pg_off = 0;
> +	rxpage->pg_off = AQ_SKB_PAD;
> 
>  	return 0;
> 
> @@ -67,8 +69,8 @@ static int aq_get_rxpages(struct aq_ring_s *self, struct
> aq_ring_buff_s *rxbuf,
>  		/* One means ring is the only user and can reuse */
>  		if (page_ref_count(rxbuf->rxdata.page) > 1) {
>  			/* Try reuse buffer */
> -			rxbuf->rxdata.pg_off += AQ_CFG_RX_FRAME_MAX;
> -			if (rxbuf->rxdata.pg_off + AQ_CFG_RX_FRAME_MAX <=
> +			rxbuf->rxdata.pg_off += AQ_CFG_RX_FRAME_MAX +
> AQ_SKB_PAD;
> +			if (rxbuf->rxdata.pg_off + AQ_CFG_RX_FRAME_MAX +
> AQ_SKB_PAD <=
>  				(PAGE_SIZE << order)) {


Here I understand your intention. You are trying to "offset" the placement of
the packet data, and the restore it back when construction SKB.

The problem however is that hardware is being programmed with fixed descriptor
size for placement. And its equal to AQ_CFG_RX_FRAME_MAX (2K by default).

This means, HW will do writes of up to 2K packet data into a single
descriptor, and then (if not enough), will go for next descriptor data.

With your solution, packets of size (AQ_CFG_RX_FRAME_MAX - AQ_SKB_PAD) up to
size of AQ_CFG_RX_FRAME_MAX will overwrite the area of page they designated
to. Ultimately, HW will do a memory corruption of next page.

The limitation here is we can't tell HW on granularity less than 1K.

I think the only acceptable solution here would be removing that optimized
path of build_skb, and keep only napi_alloc_skb. Or, we can think of keeping
it under some configuration condition (which is also not good).

So far I can't imagine any other good solution.

HW supports also a header split - this could be used to follow the optimized
path, but thats not an easy thing to implement.

Regards,
  Igor

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [EXT] [PATCH] aquantia: Reserve space when allocating an SKB
  2020-11-18 14:02 ` [EXT] " Igor Russkikh
@ 2020-11-19  0:14   ` Ramsay, Lincoln
  2020-11-19  5:19     ` Ramsay, Lincoln
  0 siblings, 1 reply; 28+ messages in thread
From: Ramsay, Lincoln @ 2020-11-19  0:14 UTC (permalink / raw)
  To: Igor Russkikh, David S. Miller, Jakub Kicinski, netdev, Dmitry Bogdanov

Hi Igor,

> Here I understand your intention. You are trying to "offset" the placement of
> the packet data, and the restore it back when construction SKB.

Originally, I just added the skb_reserve call, but that broke everything. When I looked at what the igb driver was doing, this approach seemed reasonable but I wasn't sure it'd work.

> The problem however is that hardware is being programmed with fixed descriptor
> size for placement. And its equal to AQ_CFG_RX_FRAME_MAX (2K by default).
> 
> This means, HW will do writes of up to 2K packet data into a single
> descriptor, and then (if not enough), will go for next descriptor data.
> 
> With your solution, packets of size (AQ_CFG_RX_FRAME_MAX - AQ_SKB_PAD) up to
> size of AQ_CFG_RX_FRAME_MAX will overwrite the area of page they designated
> to. Ultimately, HW will do a memory corruption of next page.

Yeah... this is the kind of thing I was worried about. It seemed to me that the SKB was being built around a hardware buffer rather than around heap-allocated memory. I just hoped that the rx_off value would somehow make it work.

The code in aq_get_rxpages seems to suggest that multiple frames can fit in a rxpage, so maybe the logic there prevents overwriting? (at the expense of not fitting as many frames into the page before it has to get a new one?)

I didn't notice any issues when I was testing, but apart from port forwarding ssh (which is tiny) and some copying of files on (probably not even close to saturating the link) there's not a huge network load placed on the device. I guess it's entirely possible that an overwrite problem would only show up under heavy load? (ie. more, and larger amounts of data in flight through the kernel at once)

> I think the only acceptable solution here would be removing that optimized
> path of build_skb, and keep only napi_alloc_skb. Or, we can think of keeping
> it under some configuration condition (which is also not good).

I'll attempt to confirm that this works too, at least for our tests :)

Lincoln

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [EXT] [PATCH] aquantia: Reserve space when allocating an SKB
  2020-11-19  0:14   ` Ramsay, Lincoln
@ 2020-11-19  5:19     ` Ramsay, Lincoln
  2020-11-19 22:01       ` [PATCH] aquantia: Remove the build_skb path Ramsay, Lincoln
  0 siblings, 1 reply; 28+ messages in thread
From: Ramsay, Lincoln @ 2020-11-19  5:19 UTC (permalink / raw)
  To: Igor Russkikh, David S. Miller, Jakub Kicinski, netdev, Dmitry Bogdanov

Hi Igor,

> > With your solution, packets of size (AQ_CFG_RX_FRAME_MAX - AQ_SKB_PAD) up to
> > size of AQ_CFG_RX_FRAME_MAX will overwrite the area of page they designated
> > to. Ultimately, HW will do a memory corruption of next page.
> 
> The code in aq_get_rxpages seems to suggest that multiple frames can fit in a rxpage, so
> maybe the logic there prevents overwriting? (at the expense of not fitting as many
> frames into the page before it has to get a new one?)

I am not terribly experienced with such low-level code, but it looks to me looks like a page is allocated (4k) and then DMA mapped to the device. Frames are 2k, so only 2 can fit into a single mapped page. If the mapping was done with an offset of AQ_SKB_PAD, that'd leave space for the SKB headroom but it would mean only a single frame could fit into that mapped page. Since this is the "I only have 1 fragment less than 2k" code path, maybe that's ok? I'm not sure if the hardware side can know that it's only allowed to write 1 frame into the buffer...

I noticed on my device that aq_ring_rx_clean always hits the "fast" codepath. I guess that just means I am not pushing it hard enough?

> > I think the only acceptable solution here would be removing that optimized
> > path of build_skb, and keep only napi_alloc_skb. Or, we can think of keeping
> > it under some configuration condition (which is also not good).
> 
> I'll attempt to confirm that this works too, at least for our tests :)

FWIW: This does work. I notice that this has to copy data into an allocated skb rather than building the skb around the data. I guess avoiding that copy is why the fast path existed in the first place?

Would you like me to post this patch (removing the fast path)?

Lincoln

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] aquantia: Remove the build_skb path
  2020-11-19  5:19     ` Ramsay, Lincoln
@ 2020-11-19 22:01       ` Ramsay, Lincoln
  2020-11-19 22:07         ` [PATCH v2] " Ramsay, Lincoln
  2020-11-20  7:52         ` [EXT] [PATCH] " Igor Russkikh
  0 siblings, 2 replies; 28+ messages in thread
From: Ramsay, Lincoln @ 2020-11-19 22:01 UTC (permalink / raw)
  To: Igor Russkikh, David S. Miller, Jakub Kicinski, netdev, Dmitry Bogdanov

The build_skb path fails to allow for an SKB header, but the hardware
buffer it is built around won't allow for this anyway.

Just always use the slower codepath that copies memory into an
allocated SKB.

Signed-off-by: Lincoln Ramsay <lincoln.ramsay@opengear.com>
---
 .../net/ethernet/aquantia/atlantic/aq_ring.c  | 118 ++++++++----------
 1 file changed, 50 insertions(+), 68 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
index 68fdb3994088..74f6f41b57e9 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
@@ -385,79 +385,61 @@ int aq_ring_rx_clean(struct aq_ring_s *self,
 					      buff->rxdata.pg_off,
 					      buff->len, DMA_FROM_DEVICE);
 
-		/* for single fragment packets use build_skb() */
-		if (buff->is_eop &&
-		    buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
-			skb = build_skb(aq_buf_vaddr(&buff->rxdata),
+		skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
+		if (unlikely(!skb)) {
+			err = -ENOMEM;
+			goto err_exit;
+		}
+		if (is_ptp_ring)
+			buff->len -=
+				aq_ptp_extract_ts(self->aq_nic, skb,
+					aq_buf_vaddr(&buff->rxdata),
+					buff->len);
+
+		hdr_len = buff->len;
+		if (hdr_len > AQ_CFG_RX_HDR_SIZE)
+			hdr_len = eth_get_headlen(skb->dev,
+							aq_buf_vaddr(&buff->rxdata),
+							AQ_CFG_RX_HDR_SIZE);
+
+		memcpy(__skb_put(skb, hdr_len), aq_buf_vaddr(&buff->rxdata),
+			ALIGN(hdr_len, sizeof(long)));
+
+		if (buff->len - hdr_len > 0) {
+			skb_add_rx_frag(skb, 0, buff->rxdata.page,
+					buff->rxdata.pg_off + hdr_len,
+					buff->len - hdr_len,
 					AQ_CFG_RX_FRAME_MAX);
-			if (unlikely(!skb)) {
-				err = -ENOMEM;
-				goto err_exit;
-			}
-			if (is_ptp_ring)
-				buff->len -=
-					aq_ptp_extract_ts(self->aq_nic, skb,
-						aq_buf_vaddr(&buff->rxdata),
-						buff->len);
-			skb_put(skb, buff->len);
 			page_ref_inc(buff->rxdata.page);
-		} else {
-			skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
-			if (unlikely(!skb)) {
-				err = -ENOMEM;
-				goto err_exit;
-			}
-			if (is_ptp_ring)
-				buff->len -=
-					aq_ptp_extract_ts(self->aq_nic, skb,
-						aq_buf_vaddr(&buff->rxdata),
-						buff->len);
-
-			hdr_len = buff->len;
-			if (hdr_len > AQ_CFG_RX_HDR_SIZE)
-				hdr_len = eth_get_headlen(skb->dev,
-							  aq_buf_vaddr(&buff->rxdata),
-							  AQ_CFG_RX_HDR_SIZE);
-
-			memcpy(__skb_put(skb, hdr_len), aq_buf_vaddr(&buff->rxdata),
-			       ALIGN(hdr_len, sizeof(long)));
-
-			if (buff->len - hdr_len > 0) {
-				skb_add_rx_frag(skb, 0, buff->rxdata.page,
-						buff->rxdata.pg_off + hdr_len,
-						buff->len - hdr_len,
-						AQ_CFG_RX_FRAME_MAX);
-				page_ref_inc(buff->rxdata.page);
-			}
+		}
 
-			if (!buff->is_eop) {
-				buff_ = buff;
-				i = 1U;
-				do {
-					next_ = buff_->next,
-					buff_ = &self->buff_ring[next_];
+		if (!buff->is_eop) {
+			buff_ = buff;
+			i = 1U;
+			do {
+				next_ = buff_->next,
+				buff_ = &self->buff_ring[next_];
 
-					dma_sync_single_range_for_cpu(
-							aq_nic_get_dev(self->aq_nic),
-							buff_->rxdata.daddr,
-							buff_->rxdata.pg_off,
-							buff_->len,
-							DMA_FROM_DEVICE);
-					skb_add_rx_frag(skb, i++,
-							buff_->rxdata.page,
-							buff_->rxdata.pg_off,
-							buff_->len,
-							AQ_CFG_RX_FRAME_MAX);
-					page_ref_inc(buff_->rxdata.page);
-					buff_->is_cleaned = 1;
-
-					buff->is_ip_cso &= buff_->is_ip_cso;
-					buff->is_udp_cso &= buff_->is_udp_cso;
-					buff->is_tcp_cso &= buff_->is_tcp_cso;
-					buff->is_cso_err |= buff_->is_cso_err;
+				dma_sync_single_range_for_cpu(
+						aq_nic_get_dev(self->aq_nic),
+						buff_->rxdata.daddr,
+						buff_->rxdata.pg_off,
+						buff_->len,
+						DMA_FROM_DEVICE);
+				skb_add_rx_frag(skb, i++,
+						buff_->rxdata.page,
+						buff_->rxdata.pg_off,
+						buff_->len,
+						AQ_CFG_RX_FRAME_MAX);
+				page_ref_inc(buff_->rxdata.page);
+				buff_->is_cleaned = 1;
 
-				} while (!buff_->is_eop);
-			}
+				buff->is_ip_cso &= buff_->is_ip_cso;
+				buff->is_udp_cso &= buff_->is_udp_cso;
+				buff->is_tcp_cso &= buff_->is_tcp_cso;
+				buff->is_cso_err |= buff_->is_cso_err;
+
+			} while (!buff_->is_eop);
 		}
 
 		if (buff->is_vlan)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2] aquantia: Remove the build_skb path
  2020-11-19 22:01       ` [PATCH] aquantia: Remove the build_skb path Ramsay, Lincoln
@ 2020-11-19 22:07         ` Ramsay, Lincoln
  2020-11-19 22:15           ` Florian Westphal
  2020-11-20  7:52         ` [EXT] [PATCH] " Igor Russkikh
  1 sibling, 1 reply; 28+ messages in thread
From: Ramsay, Lincoln @ 2020-11-19 22:07 UTC (permalink / raw)
  To: Igor Russkikh, David S. Miller, Jakub Kicinski, netdev, Dmitry Bogdanov

The build_skb path fails to allow for an SKB header, but the hardware
buffer it is built around won't allow for this anyway.

Just always use the slower codepath that copies memory into an
allocated SKB.

Signed-off-by: Lincoln Ramsay <lincoln.ramsay@opengear.com>
---

This patch is against the master branch rather than the 5.8 branch.


 .../net/ethernet/aquantia/atlantic/aq_ring.c  | 127 ++++++++----------
 1 file changed, 53 insertions(+), 74 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
index 4f913658eea4..425e8e5afec7 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
@@ -413,85 +413,64 @@ int aq_ring_rx_clean(struct aq_ring_s *self,
 					      buff->rxdata.pg_off,
 					      buff->len, DMA_FROM_DEVICE);
 
-		/* for single fragment packets use build_skb() */
-		if (buff->is_eop &&
-		    buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
-			skb = build_skb(aq_buf_vaddr(&buff->rxdata),
+		skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
+		if (unlikely(!skb)) {
+			u64_stats_update_begin(&self->stats.rx.syncp);
+			self->stats.rx.skb_alloc_fails++;
+			u64_stats_update_end(&self->stats.rx.syncp);
+			err = -ENOMEM;
+			goto err_exit;
+		}
+		if (is_ptp_ring)
+			buff->len -=
+				aq_ptp_extract_ts(self->aq_nic, skb,
+					aq_buf_vaddr(&buff->rxdata),
+					buff->len);
+
+		hdr_len = buff->len;
+		if (hdr_len > AQ_CFG_RX_HDR_SIZE)
+			hdr_len = eth_get_headlen(skb->dev,
+							aq_buf_vaddr(&buff->rxdata),
+							AQ_CFG_RX_HDR_SIZE);
+
+		memcpy(__skb_put(skb, hdr_len), aq_buf_vaddr(&buff->rxdata),
+			ALIGN(hdr_len, sizeof(long)));
+
+		if (buff->len - hdr_len > 0) {
+			skb_add_rx_frag(skb, 0, buff->rxdata.page,
+					buff->rxdata.pg_off + hdr_len,
+					buff->len - hdr_len,
 					AQ_CFG_RX_FRAME_MAX);
-			if (unlikely(!skb)) {
-				u64_stats_update_begin(&self->stats.rx.syncp);
-				self->stats.rx.skb_alloc_fails++;
-				u64_stats_update_end(&self->stats.rx.syncp);
-				err = -ENOMEM;
-				goto err_exit;
-			}
-			if (is_ptp_ring)
-				buff->len -=
-					aq_ptp_extract_ts(self->aq_nic, skb,
-						aq_buf_vaddr(&buff->rxdata),
-						buff->len);
-			skb_put(skb, buff->len);
 			page_ref_inc(buff->rxdata.page);
-		} else {
-			skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
-			if (unlikely(!skb)) {
-				u64_stats_update_begin(&self->stats.rx.syncp);
-				self->stats.rx.skb_alloc_fails++;
-				u64_stats_update_end(&self->stats.rx.syncp);
-				err = -ENOMEM;
-				goto err_exit;
-			}
-			if (is_ptp_ring)
-				buff->len -=
-					aq_ptp_extract_ts(self->aq_nic, skb,
-						aq_buf_vaddr(&buff->rxdata),
-						buff->len);
-
-			hdr_len = buff->len;
-			if (hdr_len > AQ_CFG_RX_HDR_SIZE)
-				hdr_len = eth_get_headlen(skb->dev,
-							  aq_buf_vaddr(&buff->rxdata),
-							  AQ_CFG_RX_HDR_SIZE);
-
-			memcpy(__skb_put(skb, hdr_len), aq_buf_vaddr(&buff->rxdata),
-			       ALIGN(hdr_len, sizeof(long)));
-
-			if (buff->len - hdr_len > 0) {
-				skb_add_rx_frag(skb, 0, buff->rxdata.page,
-						buff->rxdata.pg_off + hdr_len,
-						buff->len - hdr_len,
-						AQ_CFG_RX_FRAME_MAX);
-				page_ref_inc(buff->rxdata.page);
-			}
+		}
 
-			if (!buff->is_eop) {
-				buff_ = buff;
-				i = 1U;
-				do {
-					next_ = buff_->next,
-					buff_ = &self->buff_ring[next_];
+		if (!buff->is_eop) {
+			buff_ = buff;
+			i = 1U;
+			do {
+				next_ = buff_->next,
+				buff_ = &self->buff_ring[next_];
 
-					dma_sync_single_range_for_cpu(
-							aq_nic_get_dev(self->aq_nic),
-							buff_->rxdata.daddr,
-							buff_->rxdata.pg_off,
-							buff_->len,
-							DMA_FROM_DEVICE);
-					skb_add_rx_frag(skb, i++,
-							buff_->rxdata.page,
-							buff_->rxdata.pg_off,
-							buff_->len,
-							AQ_CFG_RX_FRAME_MAX);
-					page_ref_inc(buff_->rxdata.page);
-					buff_->is_cleaned = 1;
-
-					buff->is_ip_cso &= buff_->is_ip_cso;
-					buff->is_udp_cso &= buff_->is_udp_cso;
-					buff->is_tcp_cso &= buff_->is_tcp_cso;
-					buff->is_cso_err |= buff_->is_cso_err;
+				dma_sync_single_range_for_cpu(
+						aq_nic_get_dev(self->aq_nic),
+						buff_->rxdata.daddr,
+						buff_->rxdata.pg_off,
+						buff_->len,
+						DMA_FROM_DEVICE);
+				skb_add_rx_frag(skb, i++,
+						buff_->rxdata.page,
+						buff_->rxdata.pg_off,
+						buff_->len,
+						AQ_CFG_RX_FRAME_MAX);
+				page_ref_inc(buff_->rxdata.page);
+				buff_->is_cleaned = 1;
 
-				} while (!buff_->is_eop);
-			}
+				buff->is_ip_cso &= buff_->is_ip_cso;
+				buff->is_udp_cso &= buff_->is_udp_cso;
+				buff->is_tcp_cso &= buff_->is_tcp_cso;
+				buff->is_cso_err |= buff_->is_cso_err;
+
+			} while (!buff_->is_eop);
 		}
 
 		if (buff->is_vlan)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] aquantia: Remove the build_skb path
  2020-11-19 22:07         ` [PATCH v2] " Ramsay, Lincoln
@ 2020-11-19 22:15           ` Florian Westphal
  2020-11-19 22:24             ` Ramsay, Lincoln
  0 siblings, 1 reply; 28+ messages in thread
From: Florian Westphal @ 2020-11-19 22:15 UTC (permalink / raw)
  To: Ramsay, Lincoln
  Cc: Igor Russkikh, David S. Miller, Jakub Kicinski, netdev, Dmitry Bogdanov

Ramsay, Lincoln <Lincoln.Ramsay@digi.com> wrote:
> The build_skb path fails to allow for an SKB header, but the hardware
> buffer it is built around won't allow for this anyway.

What problem is being resolved here?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2] aquantia: Remove the build_skb path
  2020-11-19 22:15           ` Florian Westphal
@ 2020-11-19 22:24             ` Ramsay, Lincoln
  2020-11-19 22:28               ` Florian Westphal
  0 siblings, 1 reply; 28+ messages in thread
From: Ramsay, Lincoln @ 2020-11-19 22:24 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Igor Russkikh, David S. Miller, Jakub Kicinski, netdev, Dmitry Bogdanov

> Ramsay, Lincoln <Lincoln.Ramsay@digi.com> wrote:
> > The build_skb path fails to allow for an SKB header, but the hardware
> > buffer it is built around won't allow for this anyway.
> 
> What problem is being resolved here?

Sorry... Do I need to re-post the context? (I thought the reply headers would have kept this with the original patch that included the justification, plus the discussion that led to this revised patch).

Lincoln

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] aquantia: Remove the build_skb path
  2020-11-19 22:24             ` Ramsay, Lincoln
@ 2020-11-19 22:28               ` Florian Westphal
  2020-11-19 22:34                 ` [PATCH v3] " Ramsay, Lincoln
  0 siblings, 1 reply; 28+ messages in thread
From: Florian Westphal @ 2020-11-19 22:28 UTC (permalink / raw)
  To: Ramsay, Lincoln
  Cc: Florian Westphal, Igor Russkikh, David S. Miller, Jakub Kicinski,
	netdev, Dmitry Bogdanov

Ramsay, Lincoln <Lincoln.Ramsay@digi.com> wrote:
> > Ramsay, Lincoln <Lincoln.Ramsay@digi.com> wrote:
> > > The build_skb path fails to allow for an SKB header, but the hardware
> > > buffer it is built around won't allow for this anyway.
> > 
> > What problem is being resolved here?
> 
> Sorry... Do I need to re-post the context? (I thought the reply headers would have kept this with the original patch that included the justification, plus the discussion that led to this revised patch).

This is the only text that gets recorded in git, see

https://patchwork.kernel.org/project/netdevbpf/patch/CY4PR1001MB2311F01C543420E5F89C0F4DE8E00@CY4PR1001MB2311.namprd10.prod.outlook.com/

so, yes, please include this information in the patch description and
post a v3.

Thank you.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3] aquantia: Remove the build_skb path
  2020-11-19 22:28               ` Florian Westphal
@ 2020-11-19 22:34                 ` Ramsay, Lincoln
  2020-11-19 22:49                   ` Maciej Fijalkowski
  2020-11-19 22:58                   ` Florian Westphal
  0 siblings, 2 replies; 28+ messages in thread
From: Ramsay, Lincoln @ 2020-11-19 22:34 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Igor Russkikh, David S. Miller, Jakub Kicinski, netdev, Dmitry Bogdanov

When performing IPv6 forwarding, there is an expectation that SKBs
will have some headroom. When forwarding a packet from the aquantia
driver, this does not always happen, triggering a kernel warning.

The build_skb path fails to allow for an SKB header, but the hardware
buffer it is built around won't allow for this anyway. Just always use the
slower codepath that copies memory into an allocated SKB.

Signed-off-by: Lincoln Ramsay <lincoln.ramsay@opengear.com>
---

We have an Aquantia 10G ethernet interface in one of our devices. While testing a new feature, we discovered a problem with it. The problem only shows up in a very specific situation however.

We are using firewalld as a frontend to nftables.
It sets up port forwarding (eg. incoming port 5022 -> other_machine:22).
We also use masquerading on the outgoing packet, although I'm not sure this is relevant to the issue.
IPv4 works fine, IPv6 is a problem.
The bug is triggered by trying to hit this forwarded port (ssh -p 5022 addr). It is 100% reproducible.

The problem is that we get a kernel warning. It is triggered by this line in neighbour.h:
    if (WARN_ON_ONCE(skb_headroom(skb) < hh_alen)) {

It seems that skb_headroom is only 14, when it is expected to be >= 16.

2020-10-19 21:24:24 DEBUG   [console] ------------[ cut here ]------------
2020-10-19 21:24:24 DEBUG   [console] WARNING: CPU: 3 PID: 0 at include/net/neighbour.h:493 ip6_finish_output2+0x538/0x580
2020-10-19 21:24:24 DEBUG   [console] Modules linked in: xt_addrtype xt_MASQUERADE iptable_filter iptable_nat ip6table_raw ip6_tables xt_CT xt_tcpudp iptable_raw ip_tables nf_nat_tftp nft_nat nft_masq nft_objref nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_chain_nat nf_nat xfrm_user nf_conntrack_tftp nf_tables_set x_tables nft_ct nf_tables nfnetlink amd_spirom_nor(O) spi_nor(O) mtd(O) atlantic nct5104_wdt(O) gpio_amd(O) nct7491(O) sch_fq_codel tun qmi_wwan usbnet mii qcserial usb_wwan qcaux nsh nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 i2c_dev cdc_wdm br_netfilter bridge stp llc [last unloaded: nft_reject]
2020-10-19 21:24:24 DEBUG   [console] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G           O      5.4.65-og #1
2020-10-19 21:24:24 DEBUG   [console] RIP: 0010:ip6_finish_output2+0x538/0x580
2020-10-19 21:24:24 DEBUG   [console] Code: 87 e9 fc ff ff 44 89 fa 48 89 74 24 20 48 29 d7 e8 2d 4f 0c 00 48 8b 74 24 20 e9 cf fc ff ff 41 bf 10 00 00 00 e9 c4 fc ff ff <0f> 0b 4c 89 ef 41 bc 01 00 00 00 e8 d8 89 f0 ff e9 ee fc ff ff e8
2020-10-19 21:24:24 DEBUG   [console] RSP: 0018:ffffac2040114ab0 EFLAGS: 00010212
2020-10-19 21:24:24 DEBUG   [console] RAX: ffff9c041a0bf00e RBX: 000000000000000e RCX: ffff9c041a0bf00e
2020-10-19 21:24:24 DEBUG   [console] RDX: 000000000000000e RSI: ffff9c03ddf606c8 RDI: 0000000000000000
2020-10-19 21:24:24 DEBUG   [console] RBP: ffffac2040114b38 R08: 00000000f2000000 R09: 0000000002ec5955
2020-10-19 21:24:24 DEBUG   [console] R10: ffff9c041e57a440 R11: 000000000000000a R12: ffff9c03ddf60600
2020-10-19 21:24:24 DEBUG   [console] R13: ffff9c03dcf24800 R14: 0000000000000000 R15: 0000000000000010
2020-10-19 21:24:24 DEBUG   [console] FS:  0000000000000000(0000) GS:ffff9c0426b80000(0000) knlGS:0000000000000000
2020-10-19 21:24:24 DEBUG   [console] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2020-10-19 21:24:24 DEBUG   [console] CR2: 0000000000a0b4d8 CR3: 0000000222054000 CR4: 00000000000406e0
2020-10-19 21:24:24 DEBUG   [console] Call Trace:
2020-10-19 21:24:24 DEBUG   [console]  <IRQ>
2020-10-19 21:24:24 DEBUG   [console]  ? ipv6_confirm+0x85/0xf0 [nf_conntrack]
2020-10-19 21:24:24 DEBUG   [console]  ip6_output+0x67/0x130
2020-10-19 21:24:24 DEBUG   [console]  ? __ip6_finish_output+0x110/0x110
2020-10-19 21:24:24 DEBUG   [console]  ip6_forward+0x582/0x920
2020-10-19 21:24:24 DEBUG   [console]  ? ip6_frag_init+0x40/0x40
2020-10-19 21:24:24 DEBUG   [console]  ip6_sublist_rcv_finish+0x33/0x50
2020-10-19 21:24:24 DEBUG   [console]  ip6_sublist_rcv+0x212/0x240
2020-10-19 21:24:24 DEBUG   [console]  ? ip6_rcv_finish_core.isra.0+0xc0/0xc0
2020-10-19 21:24:24 DEBUG   [console]  ipv6_list_rcv+0x116/0x140
2020-10-19 21:24:24 DEBUG   [console]  __netif_receive_skb_list_core+0x1b1/0x260
2020-10-19 21:24:24 DEBUG   [console]  netif_receive_skb_list_internal+0x1ba/0x2d0
2020-10-19 21:24:24 DEBUG   [console]  ? napi_gro_receive+0x50/0x90
2020-10-19 21:24:24 DEBUG   [console]  gro_normal_list.part.0+0x14/0x30
2020-10-19 21:24:24 DEBUG   [console]  napi_complete_done+0x81/0x100
2020-10-19 21:24:24 DEBUG   [console]  aq_vec_poll+0x166/0x190 [atlantic]
2020-10-19 21:24:24 DEBUG   [console]  net_rx_action+0x12b/0x2f0
2020-10-19 21:24:24 DEBUG   [console]  __do_softirq+0xd1/0x213
2020-10-19 21:24:24 DEBUG   [console]  irq_exit+0xc8/0xd0
2020-10-19 21:24:24 DEBUG   [console]  do_IRQ+0x48/0xd0
2020-10-19 21:24:24 DEBUG   [console]  common_interrupt+0xf/0xf
2020-10-19 21:24:24 DEBUG   [console]  </IRQ>
2020-10-19 21:24:24 DEBUG   [console] ---[ end trace c1cba758301d342f ]---

After much hunting and debugging, I think I have figured out the issue here.

aq_ring.c has this code (edited slightly for brevity):

if (buff->is_eop && buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
    skb = build_skb(aq_buf_vaddr(&buff->rxdata), AQ_CFG_RX_FRAME_MAX);
    skb_put(skb, buff->len);
} else {
    skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);

There is a significant difference between the SKB produced by these 2 code paths. When napi_alloc_skb creates an SKB, there is a certain amount of headroom reserved. The same pattern appears to be used in all of the other ethernet drivers I have looked at. However, this is not done in the build_skb codepath.

I believe that this is the ultimate cause of the warning we are seeing.

My original proposed patch followed a pattern used in the igb driver. Some extra space in the buffer was used to hold the SKB headroom. However, this is problematic because the hardware doesn't understand this and it may overwrite data. This patch removes the build_skb path entirely, avoiding the issue. It was tested as a patch against Linux 5.8.


 .../net/ethernet/aquantia/atlantic/aq_ring.c  | 127 ++++++++----------
 1 file changed, 53 insertions(+), 74 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
index 4f913658eea4..425e8e5afec7 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
@@ -413,85 +413,64 @@ int aq_ring_rx_clean(struct aq_ring_s *self,
 					      buff->rxdata.pg_off,
 					      buff->len, DMA_FROM_DEVICE);
 
-		/* for single fragment packets use build_skb() */
-		if (buff->is_eop &&
-		    buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
-			skb = build_skb(aq_buf_vaddr(&buff->rxdata),
+		skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
+		if (unlikely(!skb)) {
+			u64_stats_update_begin(&self->stats.rx.syncp);
+			self->stats.rx.skb_alloc_fails++;
+			u64_stats_update_end(&self->stats.rx.syncp);
+			err = -ENOMEM;
+			goto err_exit;
+		}
+		if (is_ptp_ring)
+			buff->len -=
+				aq_ptp_extract_ts(self->aq_nic, skb,
+					aq_buf_vaddr(&buff->rxdata),
+					buff->len);
+
+		hdr_len = buff->len;
+		if (hdr_len > AQ_CFG_RX_HDR_SIZE)
+			hdr_len = eth_get_headlen(skb->dev,
+							aq_buf_vaddr(&buff->rxdata),
+							AQ_CFG_RX_HDR_SIZE);
+
+		memcpy(__skb_put(skb, hdr_len), aq_buf_vaddr(&buff->rxdata),
+			ALIGN(hdr_len, sizeof(long)));
+
+		if (buff->len - hdr_len > 0) {
+			skb_add_rx_frag(skb, 0, buff->rxdata.page,
+					buff->rxdata.pg_off + hdr_len,
+					buff->len - hdr_len,
 					AQ_CFG_RX_FRAME_MAX);
-			if (unlikely(!skb)) {
-				u64_stats_update_begin(&self->stats.rx.syncp);
-				self->stats.rx.skb_alloc_fails++;
-				u64_stats_update_end(&self->stats.rx.syncp);
-				err = -ENOMEM;
-				goto err_exit;
-			}
-			if (is_ptp_ring)
-				buff->len -=
-					aq_ptp_extract_ts(self->aq_nic, skb,
-						aq_buf_vaddr(&buff->rxdata),
-						buff->len);
-			skb_put(skb, buff->len);
 			page_ref_inc(buff->rxdata.page);
-		} else {
-			skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
-			if (unlikely(!skb)) {
-				u64_stats_update_begin(&self->stats.rx.syncp);
-				self->stats.rx.skb_alloc_fails++;
-				u64_stats_update_end(&self->stats.rx.syncp);
-				err = -ENOMEM;
-				goto err_exit;
-			}
-			if (is_ptp_ring)
-				buff->len -=
-					aq_ptp_extract_ts(self->aq_nic, skb,
-						aq_buf_vaddr(&buff->rxdata),
-						buff->len);
-
-			hdr_len = buff->len;
-			if (hdr_len > AQ_CFG_RX_HDR_SIZE)
-				hdr_len = eth_get_headlen(skb->dev,
-							  aq_buf_vaddr(&buff->rxdata),
-							  AQ_CFG_RX_HDR_SIZE);
-
-			memcpy(__skb_put(skb, hdr_len), aq_buf_vaddr(&buff->rxdata),
-			       ALIGN(hdr_len, sizeof(long)));
-
-			if (buff->len - hdr_len > 0) {
-				skb_add_rx_frag(skb, 0, buff->rxdata.page,
-						buff->rxdata.pg_off + hdr_len,
-						buff->len - hdr_len,
-						AQ_CFG_RX_FRAME_MAX);
-				page_ref_inc(buff->rxdata.page);
-			}
+		}
 
-			if (!buff->is_eop) {
-				buff_ = buff;
-				i = 1U;
-				do {
-					next_ = buff_->next,
-					buff_ = &self->buff_ring[next_];
+		if (!buff->is_eop) {
+			buff_ = buff;
+			i = 1U;
+			do {
+				next_ = buff_->next,
+				buff_ = &self->buff_ring[next_];
 
-					dma_sync_single_range_for_cpu(
-							aq_nic_get_dev(self->aq_nic),
-							buff_->rxdata.daddr,
-							buff_->rxdata.pg_off,
-							buff_->len,
-							DMA_FROM_DEVICE);
-					skb_add_rx_frag(skb, i++,
-							buff_->rxdata.page,
-							buff_->rxdata.pg_off,
-							buff_->len,
-							AQ_CFG_RX_FRAME_MAX);
-					page_ref_inc(buff_->rxdata.page);
-					buff_->is_cleaned = 1;
-
-					buff->is_ip_cso &= buff_->is_ip_cso;
-					buff->is_udp_cso &= buff_->is_udp_cso;
-					buff->is_tcp_cso &= buff_->is_tcp_cso;
-					buff->is_cso_err |= buff_->is_cso_err;
+				dma_sync_single_range_for_cpu(
+						aq_nic_get_dev(self->aq_nic),
+						buff_->rxdata.daddr,
+						buff_->rxdata.pg_off,
+						buff_->len,
+						DMA_FROM_DEVICE);
+				skb_add_rx_frag(skb, i++,
+						buff_->rxdata.page,
+						buff_->rxdata.pg_off,
+						buff_->len,
+						AQ_CFG_RX_FRAME_MAX);
+				page_ref_inc(buff_->rxdata.page);
+				buff_->is_cleaned = 1;
 
-				} while (!buff_->is_eop);
-			}
+				buff->is_ip_cso &= buff_->is_ip_cso;
+				buff->is_udp_cso &= buff_->is_udp_cso;
+				buff->is_tcp_cso &= buff_->is_tcp_cso;
+				buff->is_cso_err |= buff_->is_cso_err;
+
+			} while (!buff_->is_eop);
 		}
 
 		if (buff->is_vlan)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] aquantia: Remove the build_skb path
  2020-11-19 22:34                 ` [PATCH v3] " Ramsay, Lincoln
@ 2020-11-19 22:49                   ` Maciej Fijalkowski
  2020-11-20  8:18                     ` [EXT] " Igor Russkikh
  2020-11-19 22:58                   ` Florian Westphal
  1 sibling, 1 reply; 28+ messages in thread
From: Maciej Fijalkowski @ 2020-11-19 22:49 UTC (permalink / raw)
  To: Ramsay, Lincoln
  Cc: Florian Westphal, Igor Russkikh, David S. Miller, Jakub Kicinski,
	netdev, Dmitry Bogdanov

On Thu, Nov 19, 2020 at 10:34:48PM +0000, Ramsay, Lincoln wrote:
> When performing IPv6 forwarding, there is an expectation that SKBs
> will have some headroom. When forwarding a packet from the aquantia
> driver, this does not always happen, triggering a kernel warning.
> 
> The build_skb path fails to allow for an SKB header, but the hardware
> buffer it is built around won't allow for this anyway. Just always use the
> slower codepath that copies memory into an allocated SKB.
> 
> Signed-off-by: Lincoln Ramsay <lincoln.ramsay@opengear.com>
> ---

(Next time please include in the subject the tree that you're targetting
the patch)

I feel like it's only a workaround, not a real solution. On previous
thread Igor says:

"The limitation here is we can't tell HW on granularity less than 1K."

Are you saying that the minimum headroom that we could provide is 1k?
Maybe put more pressure on memory side and pull in order-1 pages, provide
this big headroom and tailroom for skb_shared_info and use build_skb by
default? With standard 1500 byte MTU.

This issue would pop up again if this driver would like to support XDP
where 256 byte headroom will have to be provided.

[...]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] aquantia: Remove the build_skb path
  2020-11-19 22:34                 ` [PATCH v3] " Ramsay, Lincoln
  2020-11-19 22:49                   ` Maciej Fijalkowski
@ 2020-11-19 22:58                   ` Florian Westphal
  2020-11-19 23:52                     ` [PATCH v4] " Ramsay, Lincoln
  1 sibling, 1 reply; 28+ messages in thread
From: Florian Westphal @ 2020-11-19 22:58 UTC (permalink / raw)
  To: Ramsay, Lincoln
  Cc: Florian Westphal, Igor Russkikh, David S. Miller, Jakub Kicinski,
	netdev, Dmitry Bogdanov

Ramsay, Lincoln <Lincoln.Ramsay@digi.com> wrote:
> When performing IPv6 forwarding, there is an expectation that SKBs
> will have some headroom. When forwarding a packet from the aquantia
> driver, this does not always happen, triggering a kernel warning.
> 
> The build_skb path fails to allow for an SKB header, but the hardware

For build_skb path to work the buffer scheme would need to be changed
to reserve headroom, so yes, I think that the proposed patch is the
most convenient solution.

> buffer it is built around won't allow for this anyway. Just always use the
> slower codepath that copies memory into an allocated SKB.

I thought this changes the driver to always copy the entire packet, but
thats not true, see below.

> It seems that skb_headroom is only 14, when it is expected to be >= 16.

Yes, kernel expects to have some headroom in skbs.

> aq_ring.c has this code (edited slightly for brevity):
> 
> if (buff->is_eop && buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
>     skb = build_skb(aq_buf_vaddr(&buff->rxdata), AQ_CFG_RX_FRAME_MAX);
>     skb_put(skb, buff->len);
> } else {
>     skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
> 
> There is a significant difference between the SKB produced by these 2 code paths. When napi_alloc_skb creates an SKB, there is a certain amount of headroom reserved. The same pattern appears to be used in all of the other ethernet drivers I have looked at. However, this is not done in the build_skb codepath.

I think the above should be part of the commit message rather than this
meta-space (which gets removed by git-am).

> +		skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
> +		if (unlikely(!skb)) {

AQ_CFG_RX_HDR_SIZE is 256 byte, so for larger packets ...

> +		memcpy(__skb_put(skb, hdr_len), aq_buf_vaddr(&buff->rxdata),
> +			ALIGN(hdr_len, sizeof(long)));

This only copies the initial part and then...
> +		if (buff->len - hdr_len > 0) {
> +			skb_add_rx_frag(skb, 0, buff->rxdata.page,
> +					buff->rxdata.pg_off + hdr_len,
> +					buff->len - hdr_len,
>  					AQ_CFG_RX_FRAME_MAX);

The rest is added as a frag.

IOW, this patch looks good to me, but could you update the
commit message so it becomes clear that this doesn't result in a full
copy?

Perhaps something like:
'Just always use the napi_alloc_skb() code path that passes the buffer
 as a page fragment', or similar.

Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v4] aquantia: Remove the build_skb path
  2020-11-19 22:58                   ` Florian Westphal
@ 2020-11-19 23:52                     ` Ramsay, Lincoln
  2020-11-20  0:17                       ` Florian Westphal
  2020-11-21 21:22                       ` Jakub Kicinski
  0 siblings, 2 replies; 28+ messages in thread
From: Ramsay, Lincoln @ 2020-11-19 23:52 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Igor Russkikh, David S. Miller, Jakub Kicinski, netdev, Dmitry Bogdanov

When performing IPv6 forwarding, there is an expectation that SKBs
will have some headroom. When forwarding a packet from the aquantia
driver, this does not always happen, triggering a kernel warning.

aq_ring.c has this code (edited slightly for brevity):

if (buff->is_eop && buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
    skb = build_skb(aq_buf_vaddr(&buff->rxdata), AQ_CFG_RX_FRAME_MAX);
} else {
    skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);

There is a significant difference between the SKB produced by these
2 code paths. When napi_alloc_skb creates an SKB, there is a certain
amount of headroom reserved. However, this is not done in the
build_skb codepath.

As the hardware buffer that build_skb is built around does not
handle the presence of the SKB header, this code path is being
removed and the napi_alloc_skb path will always be used. This code
path does have to copy the packet header into the SKB, but it adds
the packet data as a frag.

Signed-off-by: Lincoln Ramsay <lincoln.ramsay@opengear.com>
---

> For build_skb path to work the buffer scheme would need to be changed
> to reserve headroom, so yes, I think that the proposed patch is the
> most convenient solution.

I don't know about benefits/feasibility, but I did wonder if (in the event that the "fast path" is possible), the dma_mapping could use an offset? The page would include the skb header but the dma mapping would not. If that was done though, only 1 RX frame would fit into the page (at least on my system, where the RX frame seems to be 2k and the page is 4k). Also, there's a possibility to set the "order" variable, so that multiple pages are created at once and I'm not sure if this would work in that case.

> This only copies the initial part and then the rest is added as a frag.

Oh yeah. That's not as bad as I had thought then :)

I wonder though... if the "fast path" is possible, could the whole packet (including header) be added as a frag, avoiding the header copy? Or is that not how SKBs work?


 .../net/ethernet/aquantia/atlantic/aq_ring.c  | 127 ++++++++----------
 1 file changed, 53 insertions(+), 74 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
index 4f913658eea4..425e8e5afec7 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
@@ -413,85 +413,64 @@ int aq_ring_rx_clean(struct aq_ring_s *self,
 					      buff->rxdata.pg_off,
 					      buff->len, DMA_FROM_DEVICE);
 
-		/* for single fragment packets use build_skb() */
-		if (buff->is_eop &&
-		    buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
-			skb = build_skb(aq_buf_vaddr(&buff->rxdata),
+		skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
+		if (unlikely(!skb)) {
+			u64_stats_update_begin(&self->stats.rx.syncp);
+			self->stats.rx.skb_alloc_fails++;
+			u64_stats_update_end(&self->stats.rx.syncp);
+			err = -ENOMEM;
+			goto err_exit;
+		}
+		if (is_ptp_ring)
+			buff->len -=
+				aq_ptp_extract_ts(self->aq_nic, skb,
+					aq_buf_vaddr(&buff->rxdata),
+					buff->len);
+
+		hdr_len = buff->len;
+		if (hdr_len > AQ_CFG_RX_HDR_SIZE)
+			hdr_len = eth_get_headlen(skb->dev,
+							aq_buf_vaddr(&buff->rxdata),
+							AQ_CFG_RX_HDR_SIZE);
+
+		memcpy(__skb_put(skb, hdr_len), aq_buf_vaddr(&buff->rxdata),
+			ALIGN(hdr_len, sizeof(long)));
+
+		if (buff->len - hdr_len > 0) {
+			skb_add_rx_frag(skb, 0, buff->rxdata.page,
+					buff->rxdata.pg_off + hdr_len,
+					buff->len - hdr_len,
 					AQ_CFG_RX_FRAME_MAX);
-			if (unlikely(!skb)) {
-				u64_stats_update_begin(&self->stats.rx.syncp);
-				self->stats.rx.skb_alloc_fails++;
-				u64_stats_update_end(&self->stats.rx.syncp);
-				err = -ENOMEM;
-				goto err_exit;
-			}
-			if (is_ptp_ring)
-				buff->len -=
-					aq_ptp_extract_ts(self->aq_nic, skb,
-						aq_buf_vaddr(&buff->rxdata),
-						buff->len);
-			skb_put(skb, buff->len);
 			page_ref_inc(buff->rxdata.page);
-		} else {
-			skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
-			if (unlikely(!skb)) {
-				u64_stats_update_begin(&self->stats.rx.syncp);
-				self->stats.rx.skb_alloc_fails++;
-				u64_stats_update_end(&self->stats.rx.syncp);
-				err = -ENOMEM;
-				goto err_exit;
-			}
-			if (is_ptp_ring)
-				buff->len -=
-					aq_ptp_extract_ts(self->aq_nic, skb,
-						aq_buf_vaddr(&buff->rxdata),
-						buff->len);
-
-			hdr_len = buff->len;
-			if (hdr_len > AQ_CFG_RX_HDR_SIZE)
-				hdr_len = eth_get_headlen(skb->dev,
-							  aq_buf_vaddr(&buff->rxdata),
-							  AQ_CFG_RX_HDR_SIZE);
-
-			memcpy(__skb_put(skb, hdr_len), aq_buf_vaddr(&buff->rxdata),
-			       ALIGN(hdr_len, sizeof(long)));
-
-			if (buff->len - hdr_len > 0) {
-				skb_add_rx_frag(skb, 0, buff->rxdata.page,
-						buff->rxdata.pg_off + hdr_len,
-						buff->len - hdr_len,
-						AQ_CFG_RX_FRAME_MAX);
-				page_ref_inc(buff->rxdata.page);
-			}
+		}
 
-			if (!buff->is_eop) {
-				buff_ = buff;
-				i = 1U;
-				do {
-					next_ = buff_->next,
-					buff_ = &self->buff_ring[next_];
+		if (!buff->is_eop) {
+			buff_ = buff;
+			i = 1U;
+			do {
+				next_ = buff_->next,
+				buff_ = &self->buff_ring[next_];
 
-					dma_sync_single_range_for_cpu(
-							aq_nic_get_dev(self->aq_nic),
-							buff_->rxdata.daddr,
-							buff_->rxdata.pg_off,
-							buff_->len,
-							DMA_FROM_DEVICE);
-					skb_add_rx_frag(skb, i++,
-							buff_->rxdata.page,
-							buff_->rxdata.pg_off,
-							buff_->len,
-							AQ_CFG_RX_FRAME_MAX);
-					page_ref_inc(buff_->rxdata.page);
-					buff_->is_cleaned = 1;
-
-					buff->is_ip_cso &= buff_->is_ip_cso;
-					buff->is_udp_cso &= buff_->is_udp_cso;
-					buff->is_tcp_cso &= buff_->is_tcp_cso;
-					buff->is_cso_err |= buff_->is_cso_err;
+				dma_sync_single_range_for_cpu(
+						aq_nic_get_dev(self->aq_nic),
+						buff_->rxdata.daddr,
+						buff_->rxdata.pg_off,
+						buff_->len,
+						DMA_FROM_DEVICE);
+				skb_add_rx_frag(skb, i++,
+						buff_->rxdata.page,
+						buff_->rxdata.pg_off,
+						buff_->len,
+						AQ_CFG_RX_FRAME_MAX);
+				page_ref_inc(buff_->rxdata.page);
+				buff_->is_cleaned = 1;
 
-				} while (!buff_->is_eop);
-			}
+				buff->is_ip_cso &= buff_->is_ip_cso;
+				buff->is_udp_cso &= buff_->is_udp_cso;
+				buff->is_tcp_cso &= buff_->is_tcp_cso;
+				buff->is_cso_err |= buff_->is_cso_err;
+
+			} while (!buff_->is_eop);
 		}
 
 		if (buff->is_vlan)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v4] aquantia: Remove the build_skb path
  2020-11-19 23:52                     ` [PATCH v4] " Ramsay, Lincoln
@ 2020-11-20  0:17                       ` Florian Westphal
  2020-11-20  0:23                         ` Ramsay, Lincoln
  2020-11-21 21:22                       ` Jakub Kicinski
  1 sibling, 1 reply; 28+ messages in thread
From: Florian Westphal @ 2020-11-20  0:17 UTC (permalink / raw)
  To: Ramsay, Lincoln
  Cc: Florian Westphal, Igor Russkikh, David S. Miller, Jakub Kicinski,
	netdev, Dmitry Bogdanov

Ramsay, Lincoln <Lincoln.Ramsay@digi.com> wrote:

[ patch looks good to me, I have no further comments ]

> > For build_skb path to work the buffer scheme would need to be changed
> > to reserve headroom, so yes, I think that the proposed patch is the
> > most convenient solution.
> 
> I don't know about benefits/feasibility, but I did wonder if (in the event that the "fast path" is possible), the dma_mapping could use an offset? The page would include the skb header but the dma mapping would not. If that was done though, only 1 RX frame would fit into the page (at least on my system, where the RX frame seems to be 2k and the page is 4k). Also, there's a possibility to set the "order" variable, so that multiple pages are created at once and I'm not sure if this would work in that case.

Yes, this is what some drivers do, they allocate a page, pass
pageaddr + headroom_offset everywhere, except build_skb() which gets the
pageaddr followed by skb_reserve(skb, headroom_offset).

> > This only copies the initial part and then the rest is added as a frag.
> 
> Oh yeah. That's not as bad as I had thought then :)
> 
> I wonder though... if the "fast path" is possible, could the whole packet (including header) be added as a frag, avoiding the header copy? Or is that not how SKBs work?

No, you can either have skb->head point to the page (build_skb), or
skb->head needs to be kmalloc'd (napi_alloc_skb for example).

Both can have page frags. In the second case, at least L2 header
needs to be in skb->head (and ip stack would pull in L3 header as well
later on anyway).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4] aquantia: Remove the build_skb path
  2020-11-20  0:17                       ` Florian Westphal
@ 2020-11-20  0:23                         ` Ramsay, Lincoln
  0 siblings, 0 replies; 28+ messages in thread
From: Ramsay, Lincoln @ 2020-11-20  0:23 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Igor Russkikh, David S. Miller, Jakub Kicinski, netdev, Dmitry Bogdanov

> > I don't know about benefits/feasibility, but I did wonder if (in the event that the "fast path" is possible), the dma_mapping could use an offset? The page would include the skb header but the dma mapping would not. If that was done though, only 1 RX frame would fit into the page (at least on my system, where the RX frame seems to be 2k and the page is 4k). Also, there's a possibility to set the "order" variable, so that multiple pages are created at once and I'm not sure if this would work in that case.
> 
> Yes, this is what some drivers do, they allocate a page, pass
> pageaddr + headroom_offset everywhere, except build_skb() which gets the
> pageaddr followed by skb_reserve(skb, headroom_offset).

If everyone's happy with the patch as-is, I might just leave it and let people more knowledgeable decide if it's worth adding this optimization back in with this offset logic ;)

Lincoln

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [EXT] [PATCH] aquantia: Remove the build_skb path
  2020-11-19 22:01       ` [PATCH] aquantia: Remove the build_skb path Ramsay, Lincoln
  2020-11-19 22:07         ` [PATCH v2] " Ramsay, Lincoln
@ 2020-11-20  7:52         ` Igor Russkikh
  2020-11-23  4:20           ` Ramsay, Lincoln
  1 sibling, 1 reply; 28+ messages in thread
From: Igor Russkikh @ 2020-11-20  7:52 UTC (permalink / raw)
  To: Ramsay, Lincoln, David S. Miller, Jakub Kicinski, netdev,
	Dmitry Bogdanov [C]



On 20/11/2020 1:01 am, Ramsay, Lincoln wrote:
> External Email
> 
> ----------------------------------------------------------------------
> The build_skb path fails to allow for an SKB header, but the hardware
> buffer it is built around won't allow for this anyway.
> 
> Just always use the slower codepath that copies memory into an
> allocated SKB.
> 
> Signed-off-by: Lincoln Ramsay <lincoln.ramsay@opengear.com>

Acked-by: Igor Russkikh <irusskikh@marvell.com>

Yep, that could be the only way to fix this for now.

Have you tried to estimate any performance drops from this?

The most harm may be here on smaller packets, for stuff like UDP.

Regards,
  Igor

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [EXT] Re: [PATCH v3] aquantia: Remove the build_skb path
  2020-11-19 22:49                   ` Maciej Fijalkowski
@ 2020-11-20  8:18                     ` Igor Russkikh
  2020-11-23 19:28                       ` Maciej Fijalkowski
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Russkikh @ 2020-11-20  8:18 UTC (permalink / raw)
  To: Maciej Fijalkowski, Ramsay, Lincoln
  Cc: Florian Westphal, David S. Miller, Jakub Kicinski, netdev,
	Dmitry Bogdanov [C]



On 20/11/2020 1:49 am, Maciej Fijalkowski wrote:
> External Email
> 
> ----------------------------------------------------------------------
> On Thu, Nov 19, 2020 at 10:34:48PM +0000, Ramsay, Lincoln wrote:
>> When performing IPv6 forwarding, there is an expectation that SKBs
>> will have some headroom. When forwarding a packet from the aquantia
>> driver, this does not always happen, triggering a kernel warning.
>>
>> The build_skb path fails to allow for an SKB header, but the hardware
>> buffer it is built around won't allow for this anyway. Just always use
> the
>> slower codepath that copies memory into an allocated SKB.
>>
>> Signed-off-by: Lincoln Ramsay <lincoln.ramsay@opengear.com>
>> ---
> 
> (Next time please include in the subject the tree that you're targetting
> the patch)
> 
> I feel like it's only a workaround, not a real solution. On previous
> thread Igor says:
> 
> "The limitation here is we can't tell HW on granularity less than 1K."
> 
> Are you saying that the minimum headroom that we could provide is 1k?

We can tell HW to place packets with 4 bytes granularity addresses, but the
problem is the length granularity of this buffer - 1K.

This means we can do as Ramsay initially suggested - just offset the packet
placement. But then we have to guarantee that 1K after this offset is
available to HW.

Since normal layout is 1400 packets - we do use 2K (half page) for each packet.
This way we reuse each allocated page for at least two packets (and putting
skb_shared into the remaining 512b).

Obviously we may allocate 4K page for a single packet, and tell HW that it can
use 3K for data. This'll give 1K headroom. Quite an overload - assuming IMIX
is of 0.5K - 1.4K..

Of course that depends on a usecase. If you know all your traffic is 16K
jumbos - putting 1K headroom is very small overhead on memory usage.

> Maybe put more pressure on memory side and pull in order-1 pages, provide
> this big headroom and tailroom for skb_shared_info and use build_skb by
> default? With standard 1500 byte MTU.
I know many customers do consider AQC chips in near embedded environments
(routers, etc). They really do care about memories. So that could be risky.

> This issue would pop up again if this driver would like to support XDP
> where 256 byte headroom will have to be provided.

Actually it already popped. Thats one of the reasons I'm delaying with xdp
patch series for this driver.

I think the best tradeoff here would be allocating order 1 or 2 pages (i.e. 8K
or 16K), and reuse the page for multiple placements of 2K XDP packets:

(256+2048)*3 = 6912 (1K overhead for each 3 packets)

(256+2048)*7 = 16128 (200b overhead over 7 packets)

Regards,
  Igor




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4] aquantia: Remove the build_skb path
  2020-11-19 23:52                     ` [PATCH v4] " Ramsay, Lincoln
  2020-11-20  0:17                       ` Florian Westphal
@ 2020-11-21 21:22                       ` Jakub Kicinski
  2020-11-21 21:23                         ` Jakub Kicinski
  2020-11-22 21:55                         ` [PATCH v4] " Ramsay, Lincoln
  1 sibling, 2 replies; 28+ messages in thread
From: Jakub Kicinski @ 2020-11-21 21:22 UTC (permalink / raw)
  To: Ramsay, Lincoln
  Cc: Florian Westphal, Igor Russkikh, David S. Miller, netdev,
	Dmitry Bogdanov

On Thu, 19 Nov 2020 23:52:55 +0000 Ramsay, Lincoln wrote:
> When performing IPv6 forwarding, there is an expectation that SKBs
> will have some headroom. When forwarding a packet from the aquantia
> driver, this does not always happen, triggering a kernel warning.
> 
> aq_ring.c has this code (edited slightly for brevity):
> 
> if (buff->is_eop && buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
>     skb = build_skb(aq_buf_vaddr(&buff->rxdata), AQ_CFG_RX_FRAME_MAX);
> } else {
>     skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
> 
> There is a significant difference between the SKB produced by these
> 2 code paths. When napi_alloc_skb creates an SKB, there is a certain
> amount of headroom reserved. However, this is not done in the
> build_skb codepath.
> 
> As the hardware buffer that build_skb is built around does not
> handle the presence of the SKB header, this code path is being
> removed and the napi_alloc_skb path will always be used. This code
> path does have to copy the packet header into the SKB, but it adds
> the packet data as a frag.
> 
> Signed-off-by: Lincoln Ramsay <lincoln.ramsay@opengear.com>

I was going to apply as a fix to net and stable but too many small nits
here to pass. First of all please add a From: line at the beginning of
the mail which matches the signoff (or use git-send-email, it'll get it
right).

> diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
> index 4f913658eea4..425e8e5afec7 100644
> --- a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
> +++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
> @@ -413,85 +413,64 @@ int aq_ring_rx_clean(struct aq_ring_s *self,
>  					      buff->rxdata.pg_off,
>  					      buff->len, DMA_FROM_DEVICE);
>  
> -		/* for single fragment packets use build_skb() */
> -		if (buff->is_eop &&
> -		    buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
> -			skb = build_skb(aq_buf_vaddr(&buff->rxdata),
> +		skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
> +		if (unlikely(!skb)) {
> +			u64_stats_update_begin(&self->stats.rx.syncp);
> +			self->stats.rx.skb_alloc_fails++;
> +			u64_stats_update_end(&self->stats.rx.syncp);
> +			err = -ENOMEM;
> +			goto err_exit;
> +		}
> +		if (is_ptp_ring)
> +			buff->len -=
> +				aq_ptp_extract_ts(self->aq_nic, skb,
> +					aq_buf_vaddr(&buff->rxdata),
> +					buff->len);

Align continuations of the lines under '(' like: 

  ./scripts/checkpatch.pl --max-line-length=80 --strict 

is asking you to.

> +		hdr_len = buff->len;
> +		if (hdr_len > AQ_CFG_RX_HDR_SIZE)
> +			hdr_len = eth_get_headlen(skb->dev,
> +							aq_buf_vaddr(&buff->rxdata),
> +							AQ_CFG_RX_HDR_SIZE);

ditto

> +		memcpy(__skb_put(skb, hdr_len), aq_buf_vaddr(&buff->rxdata),
> +			ALIGN(hdr_len, sizeof(long)));

And here, etc.

> +		if (buff->len - hdr_len > 0) {
> +			skb_add_rx_frag(skb, 0, buff->rxdata.page,
> +					buff->rxdata.pg_off + hdr_len,
> +					buff->len - hdr_len,

> +		if (!buff->is_eop) {
> +			buff_ = buff;
> +			i = 1U;
> +			do {
> +				next_ = buff_->next,

The end of this line should be a semicolon.

> +				buff_ = &self->buff_ring[next_];

Thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4] aquantia: Remove the build_skb path
  2020-11-21 21:22                       ` Jakub Kicinski
@ 2020-11-21 21:23                         ` Jakub Kicinski
  2020-11-22 22:36                           ` Ramsay, Lincoln
  2020-11-22 21:55                         ` [PATCH v4] " Ramsay, Lincoln
  1 sibling, 1 reply; 28+ messages in thread
From: Jakub Kicinski @ 2020-11-21 21:23 UTC (permalink / raw)
  To: Ramsay, Lincoln
  Cc: Florian Westphal, Igor Russkikh, David S. Miller, netdev,
	Dmitry Bogdanov

On Sat, 21 Nov 2020 13:22:04 -0800 Jakub Kicinski wrote:
> On Thu, 19 Nov 2020 23:52:55 +0000 Ramsay, Lincoln wrote:
> > When performing IPv6 forwarding, there is an expectation that SKBs
> > will have some headroom. When forwarding a packet from the aquantia
> > driver, this does not always happen, triggering a kernel warning.
> > 
> > aq_ring.c has this code (edited slightly for brevity):
> > 
> > if (buff->is_eop && buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
> >     skb = build_skb(aq_buf_vaddr(&buff->rxdata), AQ_CFG_RX_FRAME_MAX);
> > } else {
> >     skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
> > 
> > There is a significant difference between the SKB produced by these
> > 2 code paths. When napi_alloc_skb creates an SKB, there is a certain
> > amount of headroom reserved. However, this is not done in the
> > build_skb codepath.
> > 
> > As the hardware buffer that build_skb is built around does not
> > handle the presence of the SKB header, this code path is being
> > removed and the napi_alloc_skb path will always be used. This code
> > path does have to copy the packet header into the SKB, but it adds
> > the packet data as a frag.
> > 
> > Signed-off-by: Lincoln Ramsay <lincoln.ramsay@opengear.com>  
> 
> I was going to apply as a fix to net and stable but too many small nits
> here to pass. First of all please add a From: line at the beginning of
> the mail which matches the signoff (or use git-send-email, it'll get it
> right).

Ah, one more thing, this is the correct fixes tag, right?

Fixes: 018423e90bee ("net: ethernet: aquantia: Add ring support code")

Please add it right before the signoff line.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4] aquantia: Remove the build_skb path
  2020-11-21 21:22                       ` Jakub Kicinski
  2020-11-21 21:23                         ` Jakub Kicinski
@ 2020-11-22 21:55                         ` Ramsay, Lincoln
  1 sibling, 0 replies; 28+ messages in thread
From: Ramsay, Lincoln @ 2020-11-22 21:55 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Florian Westphal, Igor Russkikh, David S. Miller, netdev,
	Dmitry Bogdanov

> Align continuations of the lines under '(' like:

Oh... I didn't run the patch checker over this revised patch. In this case, I am only changing the leading indent. Am I still expected to satisfy the patch checker?

The current patch is very clear about what is happening if you do a diff -w but if I start changing other things to satisfy the checker, that goes away.

I can do that... just double checking that it's wanted...

Lincoln

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4] aquantia: Remove the build_skb path
  2020-11-21 21:23                         ` Jakub Kicinski
@ 2020-11-22 22:36                           ` Ramsay, Lincoln
  2020-11-23 16:42                             ` Jakub Kicinski
  0 siblings, 1 reply; 28+ messages in thread
From: Ramsay, Lincoln @ 2020-11-22 22:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Florian Westphal, Igor Russkikh, David S. Miller, netdev,
	Dmitry Bogdanov

> (Next time please include in the subject the tree that you're targetting
> the patch)

I guess you mean like [PATCH master v5] ? Should I be targeting something other than the master branch on the main git repo? (https://github.com/torvalds/linux.git)

> please add a From: line at the beginning of the mail which matches
> the signoff (or use git-send-email, it'll get it right).

Sure.

> Ah, one more thing, this is the correct fixes tag, right?
> Fixes: 018423e90bee ("net: ethernet: aquantia: Add ring support code")
> Please add it right before the signoff line.

I didn't quite understand this header... but yeah, I guess that's the commit that adds the fast path I am removing.

> > Align continuations of the lines under '(' like:
> 
> I am only changing the leading indent. Am I still expected to satisfy the patch checker?
> 
> The current patch is very clear about what is happening if you do a diff -w but if I start
> changing other things to satisfy the checker, that goes away.

Some of the patch checker complaints are only leading whitespace (obviously not a problem for diff -w), but 2 of them involve actual changes (changing , to ; and moving the first argument from the line below to the line above).

Lincoln

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [EXT] [PATCH] aquantia: Remove the build_skb path
  2020-11-20  7:52         ` [EXT] [PATCH] " Igor Russkikh
@ 2020-11-23  4:20           ` Ramsay, Lincoln
  2020-11-24 14:29             ` Igor Russkikh
  0 siblings, 1 reply; 28+ messages in thread
From: Ramsay, Lincoln @ 2020-11-23  4:20 UTC (permalink / raw)
  To: Igor Russkikh, David S. Miller, Jakub Kicinski, netdev,
	Dmitry Bogdanov [C]

> Yep, that could be the only way to fix this for now.
> Have you tried to estimate any performance drops from this?

Unfortunately, I am not in a very good position to do this. The 10G interfaces on our device don't actually have enough raw PCI bandwidth available to hit 10G transfer rates.

I did use iperf3 and saw bursts over 2Gbit/sec (with average closer to 1.3Gbit/sec on a good run). There was no significant difference between running with and without the patch. I am told that this is about as good as can be expected.

Make of that what you will :)

Lincoln

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4] aquantia: Remove the build_skb path
  2020-11-22 22:36                           ` Ramsay, Lincoln
@ 2020-11-23 16:42                             ` Jakub Kicinski
  2020-11-23 21:40                               ` [PATCH net v5] " Ramsay, Lincoln
  0 siblings, 1 reply; 28+ messages in thread
From: Jakub Kicinski @ 2020-11-23 16:42 UTC (permalink / raw)
  To: Ramsay, Lincoln
  Cc: Florian Westphal, Igor Russkikh, David S. Miller, netdev,
	Dmitry Bogdanov

On Sun, 22 Nov 2020 22:36:22 +0000 Ramsay, Lincoln wrote:
> > (Next time please include in the subject the tree that you're targetting
> > the patch)  
> 
> I guess you mean like [PATCH master v5] ? Should I be targeting
> something other than the master branch on the main git repo?
> (https://github.com/torvalds/linux.git)

In this case the patch will be merged into the networking tree, and
then travel downstream to Linus. So you want to target this tree:

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/

IOW [PATCH net v5].

> > please add a From: line at the beginning of the mail which matches
> > the signoff (or use git-send-email, it'll get it right).  
> 
> Sure.
> 
> > Ah, one more thing, this is the correct fixes tag, right?
> > Fixes: 018423e90bee ("net: ethernet: aquantia: Add ring support code")
> > Please add it right before the signoff line.  
> 
> I didn't quite understand this header... but yeah, I guess that's the
> commit that adds the fast path I am removing.

Yup, it points to the oldest revision of the code where the bug is
present. In your case oldest revision where:

    When performing IPv6 forwarding, there is an expectation that SKBs
    will have some headroom. When forwarding a packet from the aquantia
    driver, this does not always happen, triggering a kernel warning.

> > > Align continuations of the lines under '(' like:  
> > 
> > I am only changing the leading indent. Am I still expected to satisfy the patch checker?
> > 
> > The current patch is very clear about what is happening if you do a diff -w but if I start
> > changing other things to satisfy the checker, that goes away.  
> 
> Some of the patch checker complaints are only leading whitespace
> (obviously not a problem for diff -w), but 2 of them involve actual
> changes (changing , to ; and moving the first argument from the line
> below to the line above).

I don't think it'll make a huge difference for the review-ability of
this change to heed checkpatch's warnings here.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [EXT] Re: [PATCH v3] aquantia: Remove the build_skb path
  2020-11-20  8:18                     ` [EXT] " Igor Russkikh
@ 2020-11-23 19:28                       ` Maciej Fijalkowski
  2020-11-24 15:26                         ` Igor Russkikh
  0 siblings, 1 reply; 28+ messages in thread
From: Maciej Fijalkowski @ 2020-11-23 19:28 UTC (permalink / raw)
  To: Igor Russkikh
  Cc: Ramsay, Lincoln, Florian Westphal, David S. Miller,
	Jakub Kicinski, netdev, Dmitry Bogdanov [C]

On Fri, Nov 20, 2020 at 11:18:34AM +0300, Igor Russkikh wrote:
> 
> 
> On 20/11/2020 1:49 am, Maciej Fijalkowski wrote:
> > External Email
> > 
> > ----------------------------------------------------------------------
> > On Thu, Nov 19, 2020 at 10:34:48PM +0000, Ramsay, Lincoln wrote:
> >> When performing IPv6 forwarding, there is an expectation that SKBs
> >> will have some headroom. When forwarding a packet from the aquantia
> >> driver, this does not always happen, triggering a kernel warning.
> >>
> >> The build_skb path fails to allow for an SKB header, but the hardware
> >> buffer it is built around won't allow for this anyway. Just always use
> > the
> >> slower codepath that copies memory into an allocated SKB.
> >>
> >> Signed-off-by: Lincoln Ramsay <lincoln.ramsay@opengear.com>
> >> ---
> > 
> > (Next time please include in the subject the tree that you're targetting
> > the patch)
> > 
> > I feel like it's only a workaround, not a real solution. On previous
> > thread Igor says:
> > 
> > "The limitation here is we can't tell HW on granularity less than 1K."
> > 
> > Are you saying that the minimum headroom that we could provide is 1k?
> 
> We can tell HW to place packets with 4 bytes granularity addresses, but the
> problem is the length granularity of this buffer - 1K.
> 
> This means we can do as Ramsay initially suggested - just offset the packet
> placement. But then we have to guarantee that 1K after this offset is
> available to HW.

Ok, I see, thanks for clarifying.

> 
> Since normal layout is 1400 packets - we do use 2K (half page) for each packet.

What is 'normal layout is 1400 packets' ? Didn't you mean the 1500 byte
standard MTU? So this is what you've been trying to tell me - that for
1500 byte mtu and 1k HW granularity you need to provide to HW 2k of
contiguous space, correct?

> This way we reuse each allocated page for at least two packets (and putting
> skb_shared into the remaining 512b).

I don't think I follow that. I thought that 2k needs to be exclusive for
HW and now you're saying that for remaining 512 bytes you can do whatever
you want.

If that's true then I think you can have build_skb support and I don't see
that 1k granularity as a limitation.

> 
> Obviously we may allocate 4K page for a single packet, and tell HW that it can
> use 3K for data. This'll give 1K headroom. Quite an overload - assuming IMIX
> is of 0.5K - 1.4K..
> 
> Of course that depends on a usecase. If you know all your traffic is 16K
> jumbos - putting 1K headroom is very small overhead on memory usage.
> 
> > Maybe put more pressure on memory side and pull in order-1 pages, provide
> > this big headroom and tailroom for skb_shared_info and use build_skb by
> > default? With standard 1500 byte MTU.
> I know many customers do consider AQC chips in near embedded environments
> (routers, etc). They really do care about memories. So that could be risky.

We have a knob that is controlled by ethtool's priv flag so you can change
the memory model and pull the build_skb out of the picture. Just FYI.

> 
> > This issue would pop up again if this driver would like to support XDP
> > where 256 byte headroom will have to be provided.
> 
> Actually it already popped. Thats one of the reasons I'm delaying with xdp
> patch series for this driver.
> 
> I think the best tradeoff here would be allocating order 1 or 2 pages (i.e. 8K
> or 16K), and reuse the page for multiple placements of 2K XDP packets:
> 
> (256+2048)*3 = 6912 (1K overhead for each 3 packets)
> 
> (256+2048)*7 = 16128 (200b overhead over 7 packets)

And for XDP_PASS you would use build_skb? Then tailroom needs to be
provided.

> 
> Regards,
>   Igor
> 
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH net v5] aquantia: Remove the build_skb path
  2020-11-23 16:42                             ` Jakub Kicinski
@ 2020-11-23 21:40                               ` Ramsay, Lincoln
  2020-11-24 19:02                                 ` Jakub Kicinski
  0 siblings, 1 reply; 28+ messages in thread
From: Ramsay, Lincoln @ 2020-11-23 21:40 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Florian Westphal, Igor Russkikh, David S. Miller, netdev,
	Dmitry Bogdanov

From: Lincoln Ramsay <lincoln.ramsay@opengear.com>

When performing IPv6 forwarding, there is an expectation that SKBs
will have some headroom. When forwarding a packet from the aquantia
driver, this does not always happen, triggering a kernel warning.

aq_ring.c has this code (edited slightly for brevity):

if (buff->is_eop && buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
    skb = build_skb(aq_buf_vaddr(&buff->rxdata), AQ_CFG_RX_FRAME_MAX);
} else {
    skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);

There is a significant difference between the SKB produced by these
2 code paths. When napi_alloc_skb creates an SKB, there is a certain
amount of headroom reserved. However, this is not done in the
build_skb codepath.

As the hardware buffer that build_skb is built around does not
handle the presence of the SKB header, this code path is being
removed and the napi_alloc_skb path will always be used. This code
path does have to copy the packet header into the SKB, but it adds
the packet data as a frag.

Fixes: 018423e90bee ("net: ethernet: aquantia: Add ring support code")
Signed-off-by: Lincoln Ramsay <lincoln.ramsay@opengear.com>
---
 .../net/ethernet/aquantia/atlantic/aq_ring.c  | 126 ++++++++----------
 1 file changed, 52 insertions(+), 74 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
index 4f913658eea4..24122ccda614 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
@@ -413,85 +413,63 @@ int aq_ring_rx_clean(struct aq_ring_s *self,
 					      buff->rxdata.pg_off,
 					      buff->len, DMA_FROM_DEVICE);
 
-		/* for single fragment packets use build_skb() */
-		if (buff->is_eop &&
-		    buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
-			skb = build_skb(aq_buf_vaddr(&buff->rxdata),
+		skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
+		if (unlikely(!skb)) {
+			u64_stats_update_begin(&self->stats.rx.syncp);
+			self->stats.rx.skb_alloc_fails++;
+			u64_stats_update_end(&self->stats.rx.syncp);
+			err = -ENOMEM;
+			goto err_exit;
+		}
+		if (is_ptp_ring)
+			buff->len -=
+				aq_ptp_extract_ts(self->aq_nic, skb,
+						  aq_buf_vaddr(&buff->rxdata),
+						  buff->len);
+
+		hdr_len = buff->len;
+		if (hdr_len > AQ_CFG_RX_HDR_SIZE)
+			hdr_len = eth_get_headlen(skb->dev,
+						  aq_buf_vaddr(&buff->rxdata),
+						  AQ_CFG_RX_HDR_SIZE);
+
+		memcpy(__skb_put(skb, hdr_len), aq_buf_vaddr(&buff->rxdata),
+		       ALIGN(hdr_len, sizeof(long)));
+
+		if (buff->len - hdr_len > 0) {
+			skb_add_rx_frag(skb, 0, buff->rxdata.page,
+					buff->rxdata.pg_off + hdr_len,
+					buff->len - hdr_len,
 					AQ_CFG_RX_FRAME_MAX);
-			if (unlikely(!skb)) {
-				u64_stats_update_begin(&self->stats.rx.syncp);
-				self->stats.rx.skb_alloc_fails++;
-				u64_stats_update_end(&self->stats.rx.syncp);
-				err = -ENOMEM;
-				goto err_exit;
-			}
-			if (is_ptp_ring)
-				buff->len -=
-					aq_ptp_extract_ts(self->aq_nic, skb,
-						aq_buf_vaddr(&buff->rxdata),
-						buff->len);
-			skb_put(skb, buff->len);
 			page_ref_inc(buff->rxdata.page);
-		} else {
-			skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
-			if (unlikely(!skb)) {
-				u64_stats_update_begin(&self->stats.rx.syncp);
-				self->stats.rx.skb_alloc_fails++;
-				u64_stats_update_end(&self->stats.rx.syncp);
-				err = -ENOMEM;
-				goto err_exit;
-			}
-			if (is_ptp_ring)
-				buff->len -=
-					aq_ptp_extract_ts(self->aq_nic, skb,
-						aq_buf_vaddr(&buff->rxdata),
-						buff->len);
-
-			hdr_len = buff->len;
-			if (hdr_len > AQ_CFG_RX_HDR_SIZE)
-				hdr_len = eth_get_headlen(skb->dev,
-							  aq_buf_vaddr(&buff->rxdata),
-							  AQ_CFG_RX_HDR_SIZE);
-
-			memcpy(__skb_put(skb, hdr_len), aq_buf_vaddr(&buff->rxdata),
-			       ALIGN(hdr_len, sizeof(long)));
-
-			if (buff->len - hdr_len > 0) {
-				skb_add_rx_frag(skb, 0, buff->rxdata.page,
-						buff->rxdata.pg_off + hdr_len,
-						buff->len - hdr_len,
-						AQ_CFG_RX_FRAME_MAX);
-				page_ref_inc(buff->rxdata.page);
-			}
+		}
 
-			if (!buff->is_eop) {
-				buff_ = buff;
-				i = 1U;
-				do {
-					next_ = buff_->next,
-					buff_ = &self->buff_ring[next_];
+		if (!buff->is_eop) {
+			buff_ = buff;
+			i = 1U;
+			do {
+				next_ = buff_->next;
+				buff_ = &self->buff_ring[next_];
 
-					dma_sync_single_range_for_cpu(
-							aq_nic_get_dev(self->aq_nic),
-							buff_->rxdata.daddr,
-							buff_->rxdata.pg_off,
-							buff_->len,
-							DMA_FROM_DEVICE);
-					skb_add_rx_frag(skb, i++,
-							buff_->rxdata.page,
-							buff_->rxdata.pg_off,
-							buff_->len,
-							AQ_CFG_RX_FRAME_MAX);
-					page_ref_inc(buff_->rxdata.page);
-					buff_->is_cleaned = 1;
-
-					buff->is_ip_cso &= buff_->is_ip_cso;
-					buff->is_udp_cso &= buff_->is_udp_cso;
-					buff->is_tcp_cso &= buff_->is_tcp_cso;
-					buff->is_cso_err |= buff_->is_cso_err;
+				dma_sync_single_range_for_cpu(aq_nic_get_dev(self->aq_nic),
+							      buff_->rxdata.daddr,
+							      buff_->rxdata.pg_off,
+							      buff_->len,
+							      DMA_FROM_DEVICE);
+				skb_add_rx_frag(skb, i++,
+						buff_->rxdata.page,
+						buff_->rxdata.pg_off,
+						buff_->len,
+						AQ_CFG_RX_FRAME_MAX);
+				page_ref_inc(buff_->rxdata.page);
+				buff_->is_cleaned = 1;
 
-				} while (!buff_->is_eop);
-			}
+				buff->is_ip_cso &= buff_->is_ip_cso;
+				buff->is_udp_cso &= buff_->is_udp_cso;
+				buff->is_tcp_cso &= buff_->is_tcp_cso;
+				buff->is_cso_err |= buff_->is_cso_err;
+
+			} while (!buff_->is_eop);
 		}
 
 		if (buff->is_vlan)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [EXT] [PATCH] aquantia: Remove the build_skb path
  2020-11-23  4:20           ` Ramsay, Lincoln
@ 2020-11-24 14:29             ` Igor Russkikh
  0 siblings, 0 replies; 28+ messages in thread
From: Igor Russkikh @ 2020-11-24 14:29 UTC (permalink / raw)
  To: Ramsay, Lincoln, David S. Miller, Jakub Kicinski, netdev,
	Dmitry Bogdanov [C]



On 23/11/2020 7:20 am, Ramsay, Lincoln wrote:
>> Yep, that could be the only way to fix this for now.
>> Have you tried to estimate any performance drops from this?
> 
> Unfortunately, I am not in a very good position to do this. The 10G
> interfaces on our device don't actually have enough raw PCI bandwidth
> available to hit 10G transfer rates.
> 
> I did use iperf3 and saw bursts over 2Gbit/sec (with average closer to
> 1.3Gbit/sec on a good run). There was no significant difference between
> running with and without the patch. I am told that this is about as good
> as can be expected.
> 
> Make of that what you will :)

Thats not very useful, but since we anyway have to fix that - lets do it.

I'll try to estimate potential perf drop on my setup when possible.

Thanks,
  Igor

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [EXT] Re: [PATCH v3] aquantia: Remove the build_skb path
  2020-11-23 19:28                       ` Maciej Fijalkowski
@ 2020-11-24 15:26                         ` Igor Russkikh
  0 siblings, 0 replies; 28+ messages in thread
From: Igor Russkikh @ 2020-11-24 15:26 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Ramsay, Lincoln, Florian Westphal, David S. Miller,
	Jakub Kicinski, netdev, Dmitry Bogdanov [C]


>>
>> Since normal layout is 1400 packets - we do use 2K (half page) for each
> packet.
> 
> What is 'normal layout is 1400 packets' ? Didn't you mean the 1500 byte
> standard MTU? So this is what you've been trying to tell me - that for
> 1500 byte mtu and 1k HW granularity you need to provide to HW 2k of
> contiguous space, correct?

Thats right.
Sorry for confusion, of course I meant 1500 standard MTU.

> 
>> This way we reuse each allocated page for at least two packets (and
> putting
>> skb_shared into the remaining 512b).
> 
> I don't think I follow that. I thought that 2k needs to be exclusive for
> HW and now you're saying that for remaining 512 bytes you can do whatever
> you want.

As soon as we've got packet we know its length. IF its less than 2K minus
skb_shared_info - we put that halfpage directly into skb, and placing the tail
for shared_info. This is what fast path is doing now.

> If that's true then I think you can have build_skb support and I don't see
> that 1k granularity as a limitation.

Thats true, but we can't use build_skb exactly because of the reason Ramsay
discovered. We need extra headspace always.

>> I know many customers do consider AQC chips in near embedded
> environments
>> (routers, etc). They really do care about memories. So that could be
> risky.
> 
> We have a knob that is controlled by ethtool's priv flag so you can change
> the memory model and pull the build_skb out of the picture. Just FYI.

Priv flags are considered harmful today...
But I agree in general we lack support of driver fastpath tuning.
Like changing page order for large jumbos or page reuse logic knobs.
May be devlink params could be considered for this?

>>> This issue would pop up again if this driver would like to support XDP
>>> where 256 byte headroom will have to be provided.
>>
>> Actually it already popped. Thats one of the reasons I'm delaying with
> xdp
>> patch series for this driver.
>>
>> I think the best tradeoff here would be allocating order 1 or 2 pages
> (i.e. 8K
>> or 16K), and reuse the page for multiple placements of 2K XDP packets:
>>
>> (256+2048)*3 = 6912 (1K overhead for each 3 packets)
>>
>> (256+2048)*7 = 16128 (200b overhead over 7 packets)
> 
> And for XDP_PASS you would use build_skb? Then tailroom needs to be
> provided.

For efficient PASS - I think both tail and head room should somehow be
reserved. Then yes, build_skb could be used..

Thanks
  Igor


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net v5] aquantia: Remove the build_skb path
  2020-11-23 21:40                               ` [PATCH net v5] " Ramsay, Lincoln
@ 2020-11-24 19:02                                 ` Jakub Kicinski
  0 siblings, 0 replies; 28+ messages in thread
From: Jakub Kicinski @ 2020-11-24 19:02 UTC (permalink / raw)
  To: Ramsay, Lincoln
  Cc: Florian Westphal, Igor Russkikh, David S. Miller, netdev,
	Dmitry Bogdanov

On Mon, 23 Nov 2020 21:40:43 +0000 Ramsay, Lincoln wrote:
> From: Lincoln Ramsay <lincoln.ramsay@opengear.com>
> 
> When performing IPv6 forwarding, there is an expectation that SKBs
> will have some headroom. When forwarding a packet from the aquantia
> driver, this does not always happen, triggering a kernel warning.
> 
> aq_ring.c has this code (edited slightly for brevity):
> 
> if (buff->is_eop && buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
>     skb = build_skb(aq_buf_vaddr(&buff->rxdata), AQ_CFG_RX_FRAME_MAX);
> } else {
>     skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
> 
> There is a significant difference between the SKB produced by these
> 2 code paths. When napi_alloc_skb creates an SKB, there is a certain
> amount of headroom reserved. However, this is not done in the
> build_skb codepath.
> 
> As the hardware buffer that build_skb is built around does not
> handle the presence of the SKB header, this code path is being
> removed and the napi_alloc_skb path will always be used. This code
> path does have to copy the packet header into the SKB, but it adds
> the packet data as a frag.
> 
> Fixes: 018423e90bee ("net: ethernet: aquantia: Add ring support code")
> Signed-off-by: Lincoln Ramsay <lincoln.ramsay@opengear.com>

Applied, queued of stable.

Thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-11-24 19:02 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-18  1:52 [PATCH] aquantia: Reserve space when allocating an SKB Ramsay, Lincoln
2020-11-18 14:02 ` [EXT] " Igor Russkikh
2020-11-19  0:14   ` Ramsay, Lincoln
2020-11-19  5:19     ` Ramsay, Lincoln
2020-11-19 22:01       ` [PATCH] aquantia: Remove the build_skb path Ramsay, Lincoln
2020-11-19 22:07         ` [PATCH v2] " Ramsay, Lincoln
2020-11-19 22:15           ` Florian Westphal
2020-11-19 22:24             ` Ramsay, Lincoln
2020-11-19 22:28               ` Florian Westphal
2020-11-19 22:34                 ` [PATCH v3] " Ramsay, Lincoln
2020-11-19 22:49                   ` Maciej Fijalkowski
2020-11-20  8:18                     ` [EXT] " Igor Russkikh
2020-11-23 19:28                       ` Maciej Fijalkowski
2020-11-24 15:26                         ` Igor Russkikh
2020-11-19 22:58                   ` Florian Westphal
2020-11-19 23:52                     ` [PATCH v4] " Ramsay, Lincoln
2020-11-20  0:17                       ` Florian Westphal
2020-11-20  0:23                         ` Ramsay, Lincoln
2020-11-21 21:22                       ` Jakub Kicinski
2020-11-21 21:23                         ` Jakub Kicinski
2020-11-22 22:36                           ` Ramsay, Lincoln
2020-11-23 16:42                             ` Jakub Kicinski
2020-11-23 21:40                               ` [PATCH net v5] " Ramsay, Lincoln
2020-11-24 19:02                                 ` Jakub Kicinski
2020-11-22 21:55                         ` [PATCH v4] " Ramsay, Lincoln
2020-11-20  7:52         ` [EXT] [PATCH] " Igor Russkikh
2020-11-23  4:20           ` Ramsay, Lincoln
2020-11-24 14:29             ` Igor Russkikh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).