netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Cadence/macb ethernet driver bug on nonlinear skb buffers
@ 2019-03-08 16:49 Klaus Doth
  2019-03-09  5:50 ` Harini Katakam
  0 siblings, 1 reply; 3+ messages in thread
From: Klaus Doth @ 2019-03-08 16:49 UTC (permalink / raw)
  To: netdev; +Cc: davem, claudiu.beznea, harini.katakam, michal.simek, Nicolas Ferre

Hi,


I think I found a bug in the cadence / macb ethernet driver.

It seems the macb_pad_and_fcs function in macb_main.c does not handle
cases of fragmented/paged sk-buffers correctly, as sometimes a memmove and
afterwards skb_put_u8 is done on fragmented buffers. skb_put_u8 then
fails as it checks if the buffer is nonlinear.


My setup is a Xilinx ZynqMP using two macb ethernet ports, which are
combined in a bridge interface. As long as only those two interfaces are
bridged, everything works fine, but as soon as I add a wireless AP
interface to it, and then connect to the wireless interface using any
WiFi enabled device, the kernel panics with the message appended at the
bottom of this email. I am currently running Kernel 5.0.0-rc8, so this
issue is in the current mainline kernel, and as far as I can see also in
the stable branch.


I did some debugging and traced the issue to the macb_pad_and_fcs
function, and it only occurs for fragmented sk-buffers.

If I understand the code correctly, the buffer should not be moved by using memmove
and afterwards the free tailroom be used for FCS if the buffer is
fragmented. Instead the buffer should be copied, and thus combining the
fragmented buffer into one non-fragmented one, as skb_put_u8 does not
work on fragmented buffers. However as I am not too deep into kernel
network drivers, there may be a better solution, or I could have missed
something important.


Currently my system is running, after I changed the first line of 
static int macb_pad_and_fcs(struct sk_buff **skb, struct net_device *ndev) from

bool cloned = skb_cloned(*skb) || skb_header_cloned(*skb);

to

bool cloned = skb_cloned(*skb) || skb_header_cloned(*skb) ||
skb_is_nonlinear(*skb);


I.e. handle any nonlinear buffer as if it was cloned. Thus force the
function into copying the buffer for increasing its size.


Before the change, the kernel panicked after a few seconds of running
data over the network bridge, which could be reproduced every time this
connection is attempted. After the change it is running for over a day
now continuously without any issues, or any visible data loss.


If I can help in any way, let me know.


Best regards,

Klaus.



[ 1123.082887] ------------[ cut here ]------------
[ 1123.087491] kernel BUG at net/core/skbuff.c:1703!
[ 1123.092178] Internal error: Oops - BUG: 0 [#1] SMP
[ 1123.096951] Modules linked in: iwlmvm iwlwifi
[ 1123.101302] CPU: 3 PID: 3171 Comm: irq/53-iwlwifi Not tainted 5.0.0-rc8 #13
[ 1123.108252] Hardware name: xlnx,zynqmp (DT)
[ 1123.112420] pstate: 40000005 (nZcv daif -PAN -UAO)
[ 1123.117200] pc : skb_put+0x48/0x60
[ 1123.120589] lr : macb_start_xmit+0x160/0xac0
[ 1123.124839] sp : ffffff801568b2d0
[ 1123.128138] x29: ffffff801568b2d0 x28: 00000000fffffedf
[ 1123.133433] x27: 0000000000000000 x26: ffffff801568b464
[ 1123.138727] x25: ffffffc02acfc100 x24: ffffff8010ed6000
[ 1123.144022] x23: ffffffc02dccd540 x22: ffffff8010ee0298
[ 1123.149317] x21: ffffffc02df18000 x20: 00000000ae3ec97f
[ 1123.154612] x19: ffffffc02df18000 x18: 0000000000000000
[ 1123.159907] x17: 0000000000000000 x16: 0000000000000000
[ 1123.165202] x15: 0000000000000400 x14: 0000000000000000
[ 1123.170496] x13: 0000000000000000 x12: 0000000000000000
[ 1123.175791] x11: 0000000000000000 x10: 000000d700000070
[ 1123.181086] x9 : ffffffbf0095f588 x8 : 00000000518e3072
[ 1123.186381] x7 : 0000000000000001 x6 : 00000000000000d7
[ 1123.191676] x5 : ffffffc02d3aa921 x4 : 0000000000000121
[ 1123.196970] x3 : 0000000000000000 x2 : 0000000000000000
[ 1123.202265] x1 : 0000000000000001 x0 : ffffffc02acfc100
[ 1123.207562] Process irq/53-iwlwifi (pid: 3171, stack limit = 0x000000002f10bec7)
[ 1123.214938] Call trace:
[ 1123.217371]  skb_put+0x48/0x60
[ 1123.220409]  macb_start_xmit+0x160/0xac0
[ 1123.224316]  dev_hard_start_xmit+0x94/0x128
[ 1123.228482]  sch_direct_xmit+0x144/0x348
[ 1123.232387]  __qdisc_run+0x118/0x520
[ 1123.235947]  __dev_queue_xmit+0x3ac/0x738
[ 1123.239939]  dev_queue_xmit+0x10/0x18
[ 1123.243586]  br_dev_queue_push_xmit+0xac/0x178
[ 1123.248011]  br_forward_finish+0xb0/0xb8
[ 1123.251917]  __br_forward.isra.0+0x128/0x158
[ 1123.256170]  br_forward+0x9c/0xa0
[ 1123.259469]  br_handle_frame_finish+0x2d8/0x3e8
[ 1123.263982]  br_handle_frame+0x1d8/0x2d8
[ 1123.267889]  __netif_receive_skb_core+0x25c/0x8d8
[ 1123.272576]  __netif_receive_skb_one_core+0x38/0x80
[ 1123.277437]  __netif_receive_skb+0x28/0x70
[ 1123.281517]  netif_receive_skb_internal+0x7c/0x128
[ 1123.286291]  napi_gro_receive+0xa4/0xc8
[ 1123.290112]  ieee80211_deliver_skb+0xc8/0x1f0
[ 1123.294459]  ieee80211_rx_handlers+0x9f4/0x1ff8
[ 1123.298973]  ieee80211_prepare_and_rx_handle+0x370/0x1028
[ 1123.304354]  ieee80211_rx_napi+0x6f0/0x968
[ 1123.308449]  iwl_mvm_rx_rx_mpdu+0x470/0xb18 [iwlmvm]
[ 1123.313408]  iwl_mvm_rx+0x54/0x88 [iwlmvm]
[ 1123.317495]  iwl_pcie_rx_handle+0x4cc/0x858 [iwlwifi]
[ 1123.322535]  iwl_pcie_irq_handler+0x188/0x710 [iwlwifi]
[ 1123.327749]  irq_thread_fn+0x28/0x78
[ 1123.331314]  irq_thread+0x124/0x1e8
[ 1123.334787]  kthread+0x128/0x130
[ 1123.337999]  ret_from_fork+0x10/0x18
[ 1123.341558] Code: 540000a8 aa0503e0 a8c17bfd d65f03c0 (d4210000)
[ 1123.347633] ---[ end trace 893b8184596cd876 ]---
[ 1123.352233] Kernel panic - not syncing: Fatal exception in interrupt
[ 1123.358571] SMP: stopping secondary CPUs
[ 1123.362480] Kernel Offset: disabled
[ 1123.365956] CPU features: 0x002,20002004
[ 1123.369861] Memory Limit: none
[ 1123.372902] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---



^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: Cadence/macb ethernet driver bug on nonlinear skb buffers
  2019-03-08 16:49 Cadence/macb ethernet driver bug on nonlinear skb buffers Klaus Doth
@ 2019-03-09  5:50 ` Harini Katakam
  2019-03-11  8:58   ` Claudiu.Beznea
  0 siblings, 1 reply; 3+ messages in thread
From: Harini Katakam @ 2019-03-09  5:50 UTC (permalink / raw)
  To: Klaus Doth, netdev
  Cc: davem, claudiu.beznea, Michal Simek, Nicolas Ferre,
	'harinikatakamlinux@gmail.com'

Hi Klaus,

> -----Original Message-----
> From: Klaus Doth [mailto:krnl@doth.eu]
> Sent: Friday, March 8, 2019 10:19 PM
> To: netdev@vger.kernel.org
> Cc: davem@davemloft.net; claudiu.beznea@microchip.com; Harini Katakam
> <harinik@xilinx.com>; Michal Simek <michals@xilinx.com>; Nicolas Ferre
> <nicolas.ferre@microchip.com>
> Subject: Cadence/macb ethernet driver bug on nonlinear skb buffers
> 
> Hi,
> 
> 
> I think I found a bug in the cadence / macb ethernet driver.
> 
> It seems the macb_pad_and_fcs function in macb_main.c does not handle cases
> of fragmented/paged sk-buffers correctly, as sometimes a memmove and
> afterwards skb_put_u8 is done on fragmented buffers. skb_put_u8 then fails as
> it checks if the buffer is nonlinear.
> 
> 
> My setup is a Xilinx ZynqMP using two macb ethernet ports, which are combined
> in a bridge interface. As long as only those two interfaces are bridged,
> everything works fine, but as soon as I add a wireless AP interface to it, and then
> connect to the wireless interface using any WiFi enabled device, the kernel
> panics with the message appended at the bottom of this email. I am currently
> running Kernel 5.0.0-rc8, so this issue is in the current mainline kernel, and as far
> as I can see also in the stable branch.
> 
> 
> I did some debugging and traced the issue to the macb_pad_and_fcs function,
> and it only occurs for fragmented sk-buffers.
> 
> If I understand the code correctly, the buffer should not be moved by using
> memmove and afterwards the free tailroom be used for FCS if the buffer is
> fragmented. Instead the buffer should be copied, and thus combining the
> fragmented buffer into one non-fragmented one, as skb_put_u8 does not work
> on fragmented buffers. However as I am not too deep into kernel network
> drivers, there may be a better solution, or I could have missed something
> important.
> 
> 
> Currently my system is running, after I changed the first line of static int
> macb_pad_and_fcs(struct sk_buff **skb, struct net_device *ndev) from
> 
> bool cloned = skb_cloned(*skb) || skb_header_cloned(*skb);
> 
> to
> 
> bool cloned = skb_cloned(*skb) || skb_header_cloned(*skb) ||
> skb_is_nonlinear(*skb);
> 
> 
> I.e. handle any nonlinear buffer as if it was cloned. Thus force the function into
> copying the buffer for increasing its size.
> 
> 
> Before the change, the kernel panicked after a few seconds of running data over
> the network bridge, which could be reproduced every time this connection is
> attempted. After the change it is running for over a day now continuously
> without any issues, or any visible data loss.
> 

Thanks for the debug.
Yes, this is a bug - we recently noticed this on ZynqMP and temporarily disabled
this functionality for fragmented packets before finding a clean solution.
I noticed this error in one of the functional tests using pktgen.

Regards,
Harini

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Cadence/macb ethernet driver bug on nonlinear skb buffers
  2019-03-09  5:50 ` Harini Katakam
@ 2019-03-11  8:58   ` Claudiu.Beznea
  0 siblings, 0 replies; 3+ messages in thread
From: Claudiu.Beznea @ 2019-03-11  8:58 UTC (permalink / raw)
  To: harinik, krnl, netdev; +Cc: davem, michals, Nicolas.Ferre, harinikatakamlinux

Hi Klaus,

Thanks for reporting this.

On 09.03.2019 07:50, Harini Katakam wrote:
> Hi Klaus,
> 
>> -----Original Message-----
>> From: Klaus Doth [mailto:krnl@doth.eu]
>> Sent: Friday, March 8, 2019 10:19 PM
>> To: netdev@vger.kernel.org
>> Cc: davem@davemloft.net; claudiu.beznea@microchip.com; Harini Katakam
>> <harinik@xilinx.com>; Michal Simek <michals@xilinx.com>; Nicolas Ferre
>> <nicolas.ferre@microchip.com>
>> Subject: Cadence/macb ethernet driver bug on nonlinear skb buffers
>>
>> Hi,
>>
>>
>> I think I found a bug in the cadence / macb ethernet driver.
>>
>> It seems the macb_pad_and_fcs function in macb_main.c does not handle cases
>> of fragmented/paged sk-buffers correctly, as sometimes a memmove and
>> afterwards skb_put_u8 is done on fragmented buffers.

I will also take a look at this. I though I have also tested theses scenario with  
my pktgen tests. Sorry for the issue.


>> skb_put_u8 then fails as
>> it checks if the buffer is nonlinear.
>>
>>
>> My setup is a Xilinx ZynqMP using two macb ethernet ports, which are combined
>> in a bridge interface. As long as only those two interfaces are bridged,
>> everything works fine, but as soon as I add a wireless AP interface to it, and then
>> connect to the wireless interface using any WiFi enabled device, the kernel
>> panics with the message appended at the bottom of this email. I am currently
>> running Kernel 5.0.0-rc8, so this issue is in the current mainline kernel, and as far
>> as I can see also in the stable branch.
>>
>>
>> I did some debugging and traced the issue to the macb_pad_and_fcs function,
>> and it only occurs for fragmented sk-buffers.
>>
>> If I understand the code correctly, the buffer should not be moved by using
>> memmove and afterwards the free tailroom be used for FCS if the buffer is
>> fragmented. Instead the buffer should be copied, and thus combining the
>> fragmented buffer into one non-fragmented one, as skb_put_u8 does not work
>> on fragmented buffers. However as I am not too deep into kernel network
>> drivers, there may be a better solution, or I could have missed something
>> important.
>>
>>
>> Currently my system is running, after I changed the first line of static int
>> macb_pad_and_fcs(struct sk_buff **skb, struct net_device *ndev) from
>>
>> bool cloned = skb_cloned(*skb) || skb_header_cloned(*skb);
>>
>> to
>>
>> bool cloned = skb_cloned(*skb) || skb_header_cloned(*skb) ||
>> skb_is_nonlinear(*skb);
>>
>>
>> I.e. handle any nonlinear buffer as if it was cloned. Thus force the function into
>> copying the buffer for increasing its size.
>>
>>
>> Before the change, the kernel panicked after a few seconds of running data over
>> the network bridge, which could be reproduced every time this connection is
>> attempted. After the change it is running for over a day now continuously
>> without any issues, or any visible data loss.
>>
> 
> Thanks for the debug.
> Yes, this is a bug - we recently noticed this on ZynqMP and temporarily disabled
> this functionality for fragmented packets before finding a clean solution.
> I noticed this error in one of the functional tests using pktgen.
> 
> Regards,
> Harini
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-03-11  8:58 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-08 16:49 Cadence/macb ethernet driver bug on nonlinear skb buffers Klaus Doth
2019-03-09  5:50 ` Harini Katakam
2019-03-11  8:58   ` Claudiu.Beznea

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).