Re: [PATCH net-next 0/5] net: phy: improve and simplify phylib state machine

From: Heiner Kallweit <hkallweit1@gmail.com>
To: Andrew Lunn <andrew@lunn.ch>
Cc: Florian Fainelli <f.fainelli@gmail.com>,
	David Miller <davem@davemloft.net>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>
Subject: Re: [PATCH net-next 0/5] net: phy: improve and simplify phylib state machine
Date: Thu, 8 Nov 2018 08:20:01 +0100	[thread overview]
Message-ID: <d765706e-1983-5dfa-698b-c5c51f310d1f@gmail.com> (raw)
In-Reply-To: <67c3bafb-5b4b-c33e-fd84-4cc6919b4bcb@gmail.com>

On 07.11.2018 21:45, Heiner Kallweit wrote:
> On 07.11.2018 21:21, Andrew Lunn wrote:
>> On Wed, Nov 07, 2018 at 09:05:49PM +0100, Heiner Kallweit wrote:
>>> On 07.11.2018 20:48, Andrew Lunn wrote:
>>>> On Wed, Nov 07, 2018 at 08:41:52PM +0100, Heiner Kallweit wrote:
>>>>> This patch series is based on two axioms:
>>>>>
>>>>> - During autoneg a PHY always reports the link being down
>>>>
>>>> Hi Heiner
>>>>
>>>> I think that is a risky assumption to make.
>>>>
>>> I wasn't sure initially too (found no clear rule in 802.3 clause 22)
>>> and therefore asked around. Florian agrees to the assumption,
>>> see here: https://www.spinics.net/lists/netdev/msg519242.html
>>>
>>> If a PHY reports the link as up then every user would assume that
>>> data can be transferred. But that's not the case during aneg.
>>> Therefore reporting the link as up during aneg wouldn't make sense.
>>
>> Hi Heiner
>>
>> If auto-neg has already been completed once before, i can see a lazy
>> hardware designed not reporting link down, or at least, not until
>> auto-neg actually fails.
>>
> "aneg finished" flag means that the aneg parameters in the register set
> are valid. Once the link goes down that's not necessarily the case any
> longer. E.g. some PHYs have an "auto speed down" feature and reduce
> the speed to save power once they detect the link is down.
> Of course I can not rule out that there are broken designs (or as you
> stated more politely: lazy designs) out there. But in this case I assume
> we would see issues already. And we would have to think about whether we
> want to support such broken / lazy designs in phylib.
> 
Had one more look at the changes to check what happens if "link up" and
"aneg done" are not in sync.

When link goes down the changes don't change current behavior. We just
check the "link up" bit.

When link goes up and aneg isn't finished yet, then we would report
"link is up" earlier to userspace than we do now. If userspace starts
some network activity based on the "link up" event then they may fail.
But it really would be a major flaw if a PHY signals "link up" whilst
it's not actually ready yet to transfer data.

In case of such a broken design we would have issues with the current
code already, at least if interrupts are used. The "link up" interrupt
would cause the state machine to go to PHY_CHANGELINK which doesn't
check for aneg status.

>> And what about if link is down for too short a time for us to notice?
>> I've seen some code fail because the kernel went off and did something
>> else for too long, and a state change was missed. 
>>
> This is a case we have already, independent of my change.
> genphy_update_link() reads BMSR twice, thus ignoring potential latched
> info about a temporary link failure. When polling phylib ignores
> everything that happens between two poll intervals.
> 
>>>> What happens if this assumption is incorrect?
>>>>
>>> Then we have to flush this patch series down the drain ;)
>>> At least I would have to check in detail which parts need to be
>>> changed. I clearly mention the assumptions so that every
>>> reviewer can check whether he agrees.
>>
>> Thanks for doing that. I want to be happy this is safe, and not going
>> to introduce regressions.
>>
>>    Andrew
>>
>