netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* A problem with "ip=..." ipconfig and Atheros alx driver.
@ 2021-04-30 11:59 Nikolai Zhubr
  2021-04-30 12:08 ` Johannes Berg
  0 siblings, 1 reply; 9+ messages in thread
From: Nikolai Zhubr @ 2021-04-30 11:59 UTC (permalink / raw)
  To: Chris Snook, netdev, Johannes Berg, nic-devel

Hello Chris and others,

I'm observing a problem with Atheros alx ethernet driver and in-kernel 
ip4 configuration (using "ip=192.168....." boot parameter).

The problem first showed itself as a huge unexpected delay in bootup as 
long as "ip=..." was specified (and a real device is present). I've then 
noticed a timeout counter "Waiting up to 110 more seconds for network" 
between the "Atheros(R) AR816x/AR817x" message and "eth0: NIC Up: 1 Gbps 
Full" message. Meanwhile, this ethernet device is fully operational and 
my cable is perfectly reliable.

Now, after debugging it a little bit more, I've apparently found the 
root cause. One can see in net/ipv4/ipconfig.c that ic_open_devs() tries 
to ensure carrier is physically present. But before opening device(s) 
and starting wait for the carrier, it calls rtnl_lock(). Now in 
ethernet/atheros/alx/main.c one can see that at opening, it first calls 
netif_carrier_off() then schedules alx_link_check() to do actual work, 
so carrier detection is supposed to happen a bit later. Now looking at 
this alx_link_check() carefully, first thing is does is rtnl_lock(). 
Bingo! Double-lock. Effectively actual carrier check in alx is therefore 
delayed just until ic_open_devs() gave up waiting for it and called 
rtnl_unlock(). Hence this delay and timeout.

I have checked with clean 4.9.268 and 5.4.115 on real hardware.
Can't check with 5.12 at the moment because my gcc is somewhat old to 
compile it, but browsing the code it looks like nothing has changed 
substantially anyway.

Fixing this myself is a bit beyond my capability I'm afraid, but I'd be 
happy do some testing if someone requests me to.


Thank you,

Reagrds,
Nikolai

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: A problem with "ip=..." ipconfig and Atheros alx driver.
  2021-04-30 11:59 A problem with "ip=..." ipconfig and Atheros alx driver Nikolai Zhubr
@ 2021-04-30 12:08 ` Johannes Berg
  2021-04-30 17:15   ` Nikolai Zhubr
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Berg @ 2021-04-30 12:08 UTC (permalink / raw)
  To: Nikolai Zhubr, Chris Snook, netdev, nic-devel

Hi Nikolai,

Wow - old code. And yet I'm still even using the device. Time to replace
the system that has it ;-)

> Now, after debugging it a little bit more, I've apparently found the 
> root cause. One can see in net/ipv4/ipconfig.c that ic_open_devs() tries 
> to ensure carrier is physically present. But before opening device(s) 
> and starting wait for the carrier, it calls rtnl_lock(). Now in 
> ethernet/atheros/alx/main.c one can see that at opening, it first calls 
> netif_carrier_off() then schedules alx_link_check() to do actual work, 
> so carrier detection is supposed to happen a bit later. Now looking at 
> this alx_link_check() carefully, first thing is does is rtnl_lock(). 
> Bingo! Double-lock. Effectively actual carrier check in alx is therefore 
> delayed just until ic_open_devs() gave up waiting for it and called 
> rtnl_unlock(). Hence this delay and timeout.

Fun. But we can just do it synchronously?

https://p.sipsolutions.net/e4f076ed1b4c8a78.txt

johannes


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: A problem with "ip=..." ipconfig and Atheros alx driver.
  2021-04-30 12:08 ` Johannes Berg
@ 2021-04-30 17:15   ` Nikolai Zhubr
  2021-04-30 19:02     ` Johannes Berg
  0 siblings, 1 reply; 9+ messages in thread
From: Nikolai Zhubr @ 2021-04-30 17:15 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Chris Snook, netdev

Hi Johannes,

30.04.2021 15:08, Johannes Berg:
>> rtnl_unlock(). Hence this delay and timeout.
>
> Fun. But we can just do it synchronously?
>
> https://p.sipsolutions.net/e4f076ed1b4c8a78.txt

Wow, that was quick! Thanks! However, unfortunately this patch still 
does not quite fix the problem. Although alx_check_link() now gets 
called synchronously at open, it somehow does not pass below the "if 
(old_speed == hw->link_speed) return" line here. I'd guess it is because 
the chip is not yet able to report meaningfull values immediately after 
initialization. The result is still the same: 120 seconds timeout 
happens, then the "eth0: NIC Up: 1 Gbps Full" message appears 
immediately and all is fine after that.
I'll do some more testing later today.


Thank you,

Regards,
Nikolai


>
> johannes
>
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: A problem with "ip=..." ipconfig and Atheros alx driver.
  2021-04-30 17:15   ` Nikolai Zhubr
@ 2021-04-30 19:02     ` Johannes Berg
  2021-05-01  0:01       ` Nikolai Zhubr
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Berg @ 2021-04-30 19:02 UTC (permalink / raw)
  To: Nikolai Zhubr; +Cc: Chris Snook, netdev

On Fri, 2021-04-30 at 20:15 +0300, Nikolai Zhubr wrote:
> Hi Johannes,
> 
> 30.04.2021 15:08, Johannes Berg:
> > > rtnl_unlock(). Hence this delay and timeout.
> > 
> > Fun. But we can just do it synchronously?
> > 
> > https://p.sipsolutions.net/e4f076ed1b4c8a78.txt
> 
> Wow, that was quick! Thanks! However, unfortunately this patch still 
> does not quite fix the problem. Although alx_check_link() now gets 
> called synchronously at open, it somehow does not pass below the "if 
> (old_speed == hw->link_speed) return" line here. I'd guess it is because 
> the chip is not yet able to report meaningfull values immediately after 
> initialization. The result is still the same: 120 seconds timeout 
> happens, then the "eth0: NIC Up: 1 Gbps Full" message appears 
> immediately and all is fine after that.

Yeah so ... I guess we could poll it, but that's really quite ugly?

I wonder if we can detect this case somehow?

The more reasonable thing is probably to just rework the locking here,
but that's a touchy subject I guess.

How about this?

https://p.sipsolutions.net/5adbe659fb061f06.txt

johannes


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: A problem with "ip=..." ipconfig and Atheros alx driver.
  2021-04-30 19:02     ` Johannes Berg
@ 2021-05-01  0:01       ` Nikolai Zhubr
  2021-05-01 13:50         ` Nikolai Zhubr
  0 siblings, 1 reply; 9+ messages in thread
From: Nikolai Zhubr @ 2021-05-01  0:01 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Chris Snook, netdev

Hi Johannes,

30.04.2021 22:02, Johannes Berg:
[...]
> The more reasonable thing is probably to just rework the locking here,
> but that's a touchy subject I guess.
>
> How about this?
>
> https://p.sipsolutions.net/5adbe659fb061f06.txt

Is it supposed to work correctly for unmodified 5.12 kernel? Because it 
applies cleanly, but the resulting kernel never finishes booting. No 
panic happens, it just kind of gets stuck forever. Likely somewhere 
within this ip config steps. The carrier timeout countdown does not show 
up, but console scrollback magic still does work, so it is not a 
complete crash.


Thank you,

Regards,
Nikolai

> johannes
>
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: A problem with "ip=..." ipconfig and Atheros alx driver.
  2021-05-01  0:01       ` Nikolai Zhubr
@ 2021-05-01 13:50         ` Nikolai Zhubr
  2021-05-02  9:04           ` Johannes Berg
  0 siblings, 1 reply; 9+ messages in thread
From: Nikolai Zhubr @ 2021-05-01 13:50 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Chris Snook, netdev

Hi Johannes,

01.05.2021 3:01, I wrote:
[...]
>> https://p.sipsolutions.net/5adbe659fb061f06.txt
>
> Is it supposed to work correctly for unmodified 5.12 kernel? Because it

Success! I've found an apparently missing mutex_unlock() in alx_probe() 
just before the "return 0" line. With that inserted, your patch indeed 
fixes the 120 s boot delay here. And the link apparenly also works fine 
after boot, in normal operation.
Can it be somehow reviewed and submitted unstream please?
I'd like it also ported to 4.xx then. (Maybe I can port myself though)
(Very happy!)


Thank you,

Regards,
Nikolai

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: A problem with "ip=..." ipconfig and Atheros alx driver.
  2021-05-01 13:50         ` Nikolai Zhubr
@ 2021-05-02  9:04           ` Johannes Berg
  2021-05-02 10:28             ` Nikolai Zhubr
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Berg @ 2021-05-02  9:04 UTC (permalink / raw)
  To: Nikolai Zhubr; +Cc: Chris Snook, netdev

On Sat, 2021-05-01 at 16:50 +0300, Nikolai Zhubr wrote:
> Hi Johannes,
> 
> 01.05.2021 3:01, I wrote:
> [...]
> > > https://p.sipsolutions.net/5adbe659fb061f06.txt
> > 
> > Is it supposed to work correctly for unmodified 5.12 kernel? Because it
> 
> Success! I've found an apparently missing mutex_unlock() in alx_probe() 
> just before the "return 0" line.
> 

D'oh, right, missed that place.

> With that inserted, your patch indeed 
> fixes the 120 s boot delay here. And the link apparenly also works fine 
> after boot, in normal operation.
> Can it be somehow reviewed and submitted unstream please?
> I'd like it also ported to 4.xx then. (Maybe I can port myself though)

Well I don't think anyone's going to still backport such a patch to a
4.x kernel, but of course you can.

I guess I can see about making it a real patch, should only take a
little while to audit (again) the hardware access paths wrt. locking.

johannes


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: A problem with "ip=..." ipconfig and Atheros alx driver.
  2021-05-02  9:04           ` Johannes Berg
@ 2021-05-02 10:28             ` Nikolai Zhubr
  2021-05-02 11:22               ` Johannes Berg
  0 siblings, 1 reply; 9+ messages in thread
From: Nikolai Zhubr @ 2021-05-02 10:28 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Chris Snook, netdev

Hi Johannes,

02.05.2021 12:04, Johannes Berg:
[...]
> I guess I can see about making it a real patch, should only take a
> little while to audit (again) the hardware access paths wrt. locking.

Greatly appreciated! Hopefully it makes it into mainline soon.


Regards,
Nikolai


> johannes
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: A problem with "ip=..." ipconfig and Atheros alx driver.
  2021-05-02 10:28             ` Nikolai Zhubr
@ 2021-05-02 11:22               ` Johannes Berg
  0 siblings, 0 replies; 9+ messages in thread
From: Johannes Berg @ 2021-05-02 11:22 UTC (permalink / raw)
  To: Nikolai Zhubr; +Cc: Chris Snook, netdev

On Sun, 2021-05-02 at 13:28 +0300, Nikolai Zhubr wrote:
> Hi Johannes,
> 
> 02.05.2021 12:04, Johannes Berg:
> [...]
> > I guess I can see about making it a real patch, should only take a
> > little while to audit (again) the hardware access paths wrt. locking.
> 
> Greatly appreciated! Hopefully it makes it into mainline soon.

Well, we're in the middle of the merge window now, so realistically
it'll be quite a while (until 5.14).

johannes


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-05-02 11:22 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-30 11:59 A problem with "ip=..." ipconfig and Atheros alx driver Nikolai Zhubr
2021-04-30 12:08 ` Johannes Berg
2021-04-30 17:15   ` Nikolai Zhubr
2021-04-30 19:02     ` Johannes Berg
2021-05-01  0:01       ` Nikolai Zhubr
2021-05-01 13:50         ` Nikolai Zhubr
2021-05-02  9:04           ` Johannes Berg
2021-05-02 10:28             ` Nikolai Zhubr
2021-05-02 11:22               ` Johannes Berg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).