* A problem with "ip=..." ipconfig and Atheros alx driver.
@ 2021-04-30 11:59 Nikolai Zhubr
2021-04-30 12:08 ` Johannes Berg
0 siblings, 1 reply; 9+ messages in thread
From: Nikolai Zhubr @ 2021-04-30 11:59 UTC (permalink / raw)
To: Chris Snook, netdev, Johannes Berg, nic-devel
Hello Chris and others,
I'm observing a problem with Atheros alx ethernet driver and in-kernel
ip4 configuration (using "ip=192.168....." boot parameter).
The problem first showed itself as a huge unexpected delay in bootup as
long as "ip=..." was specified (and a real device is present). I've then
noticed a timeout counter "Waiting up to 110 more seconds for network"
between the "Atheros(R) AR816x/AR817x" message and "eth0: NIC Up: 1 Gbps
Full" message. Meanwhile, this ethernet device is fully operational and
my cable is perfectly reliable.
Now, after debugging it a little bit more, I've apparently found the
root cause. One can see in net/ipv4/ipconfig.c that ic_open_devs() tries
to ensure carrier is physically present. But before opening device(s)
and starting wait for the carrier, it calls rtnl_lock(). Now in
ethernet/atheros/alx/main.c one can see that at opening, it first calls
netif_carrier_off() then schedules alx_link_check() to do actual work,
so carrier detection is supposed to happen a bit later. Now looking at
this alx_link_check() carefully, first thing is does is rtnl_lock().
Bingo! Double-lock. Effectively actual carrier check in alx is therefore
delayed just until ic_open_devs() gave up waiting for it and called
rtnl_unlock(). Hence this delay and timeout.
I have checked with clean 4.9.268 and 5.4.115 on real hardware.
Can't check with 5.12 at the moment because my gcc is somewhat old to
compile it, but browsing the code it looks like nothing has changed
substantially anyway.
Fixing this myself is a bit beyond my capability I'm afraid, but I'd be
happy do some testing if someone requests me to.
Thank you,
Reagrds,
Nikolai
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: A problem with "ip=..." ipconfig and Atheros alx driver.
2021-04-30 11:59 A problem with "ip=..." ipconfig and Atheros alx driver Nikolai Zhubr
@ 2021-04-30 12:08 ` Johannes Berg
2021-04-30 17:15 ` Nikolai Zhubr
0 siblings, 1 reply; 9+ messages in thread
From: Johannes Berg @ 2021-04-30 12:08 UTC (permalink / raw)
To: Nikolai Zhubr, Chris Snook, netdev, nic-devel
Hi Nikolai,
Wow - old code. And yet I'm still even using the device. Time to replace
the system that has it ;-)
> Now, after debugging it a little bit more, I've apparently found the
> root cause. One can see in net/ipv4/ipconfig.c that ic_open_devs() tries
> to ensure carrier is physically present. But before opening device(s)
> and starting wait for the carrier, it calls rtnl_lock(). Now in
> ethernet/atheros/alx/main.c one can see that at opening, it first calls
> netif_carrier_off() then schedules alx_link_check() to do actual work,
> so carrier detection is supposed to happen a bit later. Now looking at
> this alx_link_check() carefully, first thing is does is rtnl_lock().
> Bingo! Double-lock. Effectively actual carrier check in alx is therefore
> delayed just until ic_open_devs() gave up waiting for it and called
> rtnl_unlock(). Hence this delay and timeout.
Fun. But we can just do it synchronously?
https://p.sipsolutions.net/e4f076ed1b4c8a78.txt
johannes
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: A problem with "ip=..." ipconfig and Atheros alx driver.
2021-04-30 12:08 ` Johannes Berg
@ 2021-04-30 17:15 ` Nikolai Zhubr
2021-04-30 19:02 ` Johannes Berg
0 siblings, 1 reply; 9+ messages in thread
From: Nikolai Zhubr @ 2021-04-30 17:15 UTC (permalink / raw)
To: Johannes Berg; +Cc: Chris Snook, netdev
Hi Johannes,
30.04.2021 15:08, Johannes Berg:
>> rtnl_unlock(). Hence this delay and timeout.
>
> Fun. But we can just do it synchronously?
>
> https://p.sipsolutions.net/e4f076ed1b4c8a78.txt
Wow, that was quick! Thanks! However, unfortunately this patch still
does not quite fix the problem. Although alx_check_link() now gets
called synchronously at open, it somehow does not pass below the "if
(old_speed == hw->link_speed) return" line here. I'd guess it is because
the chip is not yet able to report meaningfull values immediately after
initialization. The result is still the same: 120 seconds timeout
happens, then the "eth0: NIC Up: 1 Gbps Full" message appears
immediately and all is fine after that.
I'll do some more testing later today.
Thank you,
Regards,
Nikolai
>
> johannes
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: A problem with "ip=..." ipconfig and Atheros alx driver.
2021-04-30 17:15 ` Nikolai Zhubr
@ 2021-04-30 19:02 ` Johannes Berg
2021-05-01 0:01 ` Nikolai Zhubr
0 siblings, 1 reply; 9+ messages in thread
From: Johannes Berg @ 2021-04-30 19:02 UTC (permalink / raw)
To: Nikolai Zhubr; +Cc: Chris Snook, netdev
On Fri, 2021-04-30 at 20:15 +0300, Nikolai Zhubr wrote:
> Hi Johannes,
>
> 30.04.2021 15:08, Johannes Berg:
> > > rtnl_unlock(). Hence this delay and timeout.
> >
> > Fun. But we can just do it synchronously?
> >
> > https://p.sipsolutions.net/e4f076ed1b4c8a78.txt
>
> Wow, that was quick! Thanks! However, unfortunately this patch still
> does not quite fix the problem. Although alx_check_link() now gets
> called synchronously at open, it somehow does not pass below the "if
> (old_speed == hw->link_speed) return" line here. I'd guess it is because
> the chip is not yet able to report meaningfull values immediately after
> initialization. The result is still the same: 120 seconds timeout
> happens, then the "eth0: NIC Up: 1 Gbps Full" message appears
> immediately and all is fine after that.
Yeah so ... I guess we could poll it, but that's really quite ugly?
I wonder if we can detect this case somehow?
The more reasonable thing is probably to just rework the locking here,
but that's a touchy subject I guess.
How about this?
https://p.sipsolutions.net/5adbe659fb061f06.txt
johannes
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: A problem with "ip=..." ipconfig and Atheros alx driver.
2021-04-30 19:02 ` Johannes Berg
@ 2021-05-01 0:01 ` Nikolai Zhubr
2021-05-01 13:50 ` Nikolai Zhubr
0 siblings, 1 reply; 9+ messages in thread
From: Nikolai Zhubr @ 2021-05-01 0:01 UTC (permalink / raw)
To: Johannes Berg; +Cc: Chris Snook, netdev
Hi Johannes,
30.04.2021 22:02, Johannes Berg:
[...]
> The more reasonable thing is probably to just rework the locking here,
> but that's a touchy subject I guess.
>
> How about this?
>
> https://p.sipsolutions.net/5adbe659fb061f06.txt
Is it supposed to work correctly for unmodified 5.12 kernel? Because it
applies cleanly, but the resulting kernel never finishes booting. No
panic happens, it just kind of gets stuck forever. Likely somewhere
within this ip config steps. The carrier timeout countdown does not show
up, but console scrollback magic still does work, so it is not a
complete crash.
Thank you,
Regards,
Nikolai
> johannes
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: A problem with "ip=..." ipconfig and Atheros alx driver.
2021-05-01 0:01 ` Nikolai Zhubr
@ 2021-05-01 13:50 ` Nikolai Zhubr
2021-05-02 9:04 ` Johannes Berg
0 siblings, 1 reply; 9+ messages in thread
From: Nikolai Zhubr @ 2021-05-01 13:50 UTC (permalink / raw)
To: Johannes Berg; +Cc: Chris Snook, netdev
Hi Johannes,
01.05.2021 3:01, I wrote:
[...]
>> https://p.sipsolutions.net/5adbe659fb061f06.txt
>
> Is it supposed to work correctly for unmodified 5.12 kernel? Because it
Success! I've found an apparently missing mutex_unlock() in alx_probe()
just before the "return 0" line. With that inserted, your patch indeed
fixes the 120 s boot delay here. And the link apparenly also works fine
after boot, in normal operation.
Can it be somehow reviewed and submitted unstream please?
I'd like it also ported to 4.xx then. (Maybe I can port myself though)
(Very happy!)
Thank you,
Regards,
Nikolai
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: A problem with "ip=..." ipconfig and Atheros alx driver.
2021-05-01 13:50 ` Nikolai Zhubr
@ 2021-05-02 9:04 ` Johannes Berg
2021-05-02 10:28 ` Nikolai Zhubr
0 siblings, 1 reply; 9+ messages in thread
From: Johannes Berg @ 2021-05-02 9:04 UTC (permalink / raw)
To: Nikolai Zhubr; +Cc: Chris Snook, netdev
On Sat, 2021-05-01 at 16:50 +0300, Nikolai Zhubr wrote:
> Hi Johannes,
>
> 01.05.2021 3:01, I wrote:
> [...]
> > > https://p.sipsolutions.net/5adbe659fb061f06.txt
> >
> > Is it supposed to work correctly for unmodified 5.12 kernel? Because it
>
> Success! I've found an apparently missing mutex_unlock() in alx_probe()
> just before the "return 0" line.
>
D'oh, right, missed that place.
> With that inserted, your patch indeed
> fixes the 120 s boot delay here. And the link apparenly also works fine
> after boot, in normal operation.
> Can it be somehow reviewed and submitted unstream please?
> I'd like it also ported to 4.xx then. (Maybe I can port myself though)
Well I don't think anyone's going to still backport such a patch to a
4.x kernel, but of course you can.
I guess I can see about making it a real patch, should only take a
little while to audit (again) the hardware access paths wrt. locking.
johannes
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: A problem with "ip=..." ipconfig and Atheros alx driver.
2021-05-02 9:04 ` Johannes Berg
@ 2021-05-02 10:28 ` Nikolai Zhubr
2021-05-02 11:22 ` Johannes Berg
0 siblings, 1 reply; 9+ messages in thread
From: Nikolai Zhubr @ 2021-05-02 10:28 UTC (permalink / raw)
To: Johannes Berg; +Cc: Chris Snook, netdev
Hi Johannes,
02.05.2021 12:04, Johannes Berg:
[...]
> I guess I can see about making it a real patch, should only take a
> little while to audit (again) the hardware access paths wrt. locking.
Greatly appreciated! Hopefully it makes it into mainline soon.
Regards,
Nikolai
> johannes
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: A problem with "ip=..." ipconfig and Atheros alx driver.
2021-05-02 10:28 ` Nikolai Zhubr
@ 2021-05-02 11:22 ` Johannes Berg
0 siblings, 0 replies; 9+ messages in thread
From: Johannes Berg @ 2021-05-02 11:22 UTC (permalink / raw)
To: Nikolai Zhubr; +Cc: Chris Snook, netdev
On Sun, 2021-05-02 at 13:28 +0300, Nikolai Zhubr wrote:
> Hi Johannes,
>
> 02.05.2021 12:04, Johannes Berg:
> [...]
> > I guess I can see about making it a real patch, should only take a
> > little while to audit (again) the hardware access paths wrt. locking.
>
> Greatly appreciated! Hopefully it makes it into mainline soon.
Well, we're in the middle of the merge window now, so realistically
it'll be quite a while (until 5.14).
johannes
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2021-05-02 11:22 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-30 11:59 A problem with "ip=..." ipconfig and Atheros alx driver Nikolai Zhubr
2021-04-30 12:08 ` Johannes Berg
2021-04-30 17:15 ` Nikolai Zhubr
2021-04-30 19:02 ` Johannes Berg
2021-05-01 0:01 ` Nikolai Zhubr
2021-05-01 13:50 ` Nikolai Zhubr
2021-05-02 9:04 ` Johannes Berg
2021-05-02 10:28 ` Nikolai Zhubr
2021-05-02 11:22 ` Johannes Berg
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).