From mboxrd@z Thu Jan 1 00:00:00 1970 From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= Subject: Re: Kernel oops with mlx5 and dual XDP redirect programs Date: Tue, 23 Oct 2018 22:29:52 +0200 Message-ID: <87efcgfltr.fsf@toke.dk> References: <877eize5ro.fsf@toke.dk> <4e2cfdc3db244f4b9483a0c3dfc62fae55238bb3.camel@mellanox.com> <87a7nax6pe.fsf@toke.dk> <15797ad1ccee84dfd47c6f45af155806b4ccc228.camel@mellanox.com> <8736sxgei1.fsf@toke.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: Eran Ben Elisha , Tariq Toukan , "brouer\@redhat.com" To: Saeed Mahameed , "netdev\@vger.kernel.org" Return-path: Received: from mail.toke.dk ([52.28.52.200]:52489 "EHLO mail.toke.dk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725948AbeJXEyy (ORCPT ); Wed, 24 Oct 2018 00:54:54 -0400 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: Saeed Mahameed writes: > On Tue, 2018-10-23 at 12:10 +0200, Toke H=C3=B8iland-J=C3=B8rgensen wrote: >> Saeed Mahameed writes: >>=20 >> > On Thu, 2018-10-18 at 23:53 +0200, Toke H=C3=B8iland-J=C3=B8rgensen wr= ote: >> > > Saeed Mahameed writes: >> > >=20 >> > > > I think that the mlx5 driver doesn't know how to tell the other >> > > > device >> > > > to stop transmitting to it while it is resetting.. Maybe tariq >> > > > or >> > > > Jesper know more about this ? >> > > > I will look at this tomorrow after noon and will try to >> > > > repro... >> > >=20 >> > > Hi Saeed >> > >=20 >> > > Did you have a chance to poke at this? :) >> >=20 >> > HI Toke, yes i have been planing to respond but also i wanted to >> > dig >> > more, >> >=20 >> > so the root cause is very clear. >> >=20 >> > 1. core 1 is doing tx_dev->ndo_xdp_xmit() >> > 2. core 2 is doing tx_dev->xdp_set() //remove xdp program. >>=20 >> Right, it was also my guess that it was related to this interaction. >> Thanks for looking into it! >>=20 >> > and the problem is beyond mlx5, since we don't have a way to tell a >> > different core/different netdev to stop xmitting, or at least >> > synchronize with it. >>=20 >> Hmm, ideally there should be some way for the higher level XDP API to >> notice this and abort the call before it even reaches the driver on >> the >> TX side, shouldn't there? At LPC, Jesper and I will be talking about >> a >> proposal for decoupling the ndo_xdp_xmit() resource allocation from >> loading and unloading XDP programs, which I guess could be a way to >> deal >> with this as well. >>=20 >> In the meantime... >>=20 > > Yes totally agree, this is why my fix is temporary.=20 > Good Idea about LPC, let's discuss this there. > >> > I will be waiting for your confirmation that the fix did work. >>=20 >> I tested your patch, and it does indeed fix the crash. However, it >> also >> seems to have the effect that the XDP redirect continues to function >> even after removing the XDP program on the target device. >>=20 >> I.e., after the call to ./xdp_fwd -d $TX_IF, I still see packets >> being >> redirected out $TX_IF. Is this intentional? >>=20 > > Interesting, shouldn't happen, unless there is something weird going on > when running xpd_fwd -d together with xdp_redirect_map, i just checked > the code and if ndo_xdp_set was called with null program we will remove > xdp tx resources, nothing suspicious in the driver. > > I will look at this later this week. Cool. Let me know if you need anything more from me :) -Toke