linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Problem with 2.4.24 e1000 and keepalived
@ 2004-01-07 19:05 Stephan von Krawczynski
  2004-01-07 21:02 ` Willy Tarreau
  0 siblings, 1 reply; 14+ messages in thread
From: Stephan von Krawczynski @ 2004-01-07 19:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, linux-net

Hello all,

I am looking for confirmation for the following problem.
Setup is a simple pair of routers with 2 nics each, all e1000. If you start a
vrrp setup with keepalived and interface state is down during keepalived
startup, then the failover does not work. If the nics are UP during startup
everything works well. Now the kernel part of the story: the exact same setup
works with tulip cards.
Is there a difference regarding UP/DOWN state handling/events in e1000 and
tulip. e100 and eepro100 show the same problem btw.

Any hints are welcome

Regards,
Stephan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem with 2.4.24 e1000 and keepalived
  2004-01-07 19:05 Problem with 2.4.24 e1000 and keepalived Stephan von Krawczynski
@ 2004-01-07 21:02 ` Willy Tarreau
  2004-01-08  2:45   ` Ben Greear
  0 siblings, 1 reply; 14+ messages in thread
From: Willy Tarreau @ 2004-01-07 21:02 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, netdev, linux-net

Hi Stephan,

On Wed, Jan 07, 2004 at 08:05:56PM +0100, Stephan von Krawczynski wrote:
> Setup is a simple pair of routers with 2 nics each, all e1000. If you start a
> vrrp setup with keepalived and interface state is down during keepalived
> startup, then the failover does not work. If the nics are UP during startup
> everything works well. Now the kernel part of the story: the exact same setup
> works with tulip cards.
> Is there a difference regarding UP/DOWN state handling/events in e1000 and
> tulip. e100 and eepro100 show the same problem btw.

I noticed the exact same problem about 1 year ago with the early 2.4
bonding code and eepro100. At this time, I attributed this to a yet
undiscovered but in the bonding state machine, and could not investigate
much since it was on a remote production machine. Someone went there and
rebooted it and everything went OK. Before the reboot, the switch alredy
detected an UP link, while the bonding code saw it down (using MII at this
time, not ethtool). I recently read one report (here or on keepalived list)
about someone who got the same problem with another eepro100. I wonder
whether there would not be a bug either in the driver or in the chip itself.

What I noticed is that if you load the driver while the cable is unplugged,
and then plug it, the MII status says the link is still down. Unfortunately,
the only e100 I have access to are in prod at a customer's and I really
cannot make tests there.

Cheers,
Willy


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem with 2.4.24 e1000 and keepalived
  2004-01-07 21:02 ` Willy Tarreau
@ 2004-01-08  2:45   ` Ben Greear
  2004-01-08  5:20     ` Willy Tarreau
  2004-01-08  8:14     ` Stephan von Krawczynski
  0 siblings, 2 replies; 14+ messages in thread
From: Ben Greear @ 2004-01-08  2:45 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Stephan von Krawczynski, linux-kernel, netdev, linux-net

Willy Tarreau wrote:
> Hi Stephan,
> 
> On Wed, Jan 07, 2004 at 08:05:56PM +0100, Stephan von Krawczynski wrote:
> 
>>Setup is a simple pair of routers with 2 nics each, all e1000. If you start a
>>vrrp setup with keepalived and interface state is down during keepalived
>>startup, then the failover does not work. If the nics are UP during startup
>>everything works well. Now the kernel part of the story: the exact same setup
>>works with tulip cards.
>>Is there a difference regarding UP/DOWN state handling/events in e1000 and
>>tulip. e100 and eepro100 show the same problem btw.
> 
> 
> I noticed the exact same problem about 1 year ago with the early 2.4
> bonding code and eepro100. At this time, I attributed this to a yet
> undiscovered but in the bonding state machine, and could not investigate
> much since it was on a remote production machine. Someone went there and
> rebooted it and everything went OK. Before the reboot, the switch alredy
> detected an UP link, while the bonding code saw it down (using MII at this
> time, not ethtool). I recently read one report (here or on keepalived list)
> about someone who got the same problem with another eepro100. I wonder
> whether there would not be a bug either in the driver or in the chip itself.
> 
> What I noticed is that if you load the driver while the cable is unplugged,
> and then plug it, the MII status says the link is still down. Unfortunately,
> the only e100 I have access to are in prod at a customer's and I really
> cannot make tests there.

You have to bring the interface 'UP' before it will detect link,
with something like:  ifconfig eth2 up

Could that be the problem?

Ben

> 
> Cheers,
> Willy
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem with 2.4.24 e1000 and keepalived
  2004-01-08  2:45   ` Ben Greear
@ 2004-01-08  5:20     ` Willy Tarreau
  2004-01-08  8:07       ` Ben Greear
  2004-01-08  8:14     ` Stephan von Krawczynski
  1 sibling, 1 reply; 14+ messages in thread
From: Willy Tarreau @ 2004-01-08  5:20 UTC (permalink / raw)
  To: Ben Greear
  Cc: Willy Tarreau, Stephan von Krawczynski, linux-kernel, netdev, linux-net

Hi Ben,

On Wed, Jan 07, 2004 at 06:45:04PM -0800, Ben Greear wrote:
 
> You have to bring the interface 'UP' before it will detect link,
> with something like:  ifconfig eth2 up

Don't you mean "after" instead of "before" here ? Because the case where
it doesn't work is when everything is set up while the cable is unplugged,
but conversely, if the system goes up with the cable plugged, setting the
interface UP detects the link as UP and works. I believe that the problem
is related to setting the interface UP with nothing plugged into it.

Cheers,
Willy


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem with 2.4.24 e1000 and keepalived
  2004-01-08  5:20     ` Willy Tarreau
@ 2004-01-08  8:07       ` Ben Greear
  2004-01-08  8:46         ` Willy Tarreau
  0 siblings, 1 reply; 14+ messages in thread
From: Ben Greear @ 2004-01-08  8:07 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Stephan von Krawczynski, linux-kernel, netdev, linux-net

Willy Tarreau wrote:
> Hi Ben,
> 
> On Wed, Jan 07, 2004 at 06:45:04PM -0800, Ben Greear wrote:
>  
> 
>>You have to bring the interface 'UP' before it will detect link,
>>with something like:  ifconfig eth2 up
> 
> 
> Don't you mean "after" instead of "before" here ? Because the case where
> it doesn't work is when everything is set up while the cable is unplugged,
> but conversely, if the system goes up with the cable plugged, setting the
> interface UP detects the link as UP and works. I believe that the problem
> is related to setting the interface UP with nothing plugged into it.

No, I meant what I said:  You have to tell many drivers to bring the interface
up before they will attempt (or at least report) link negotiation.
You do NOT have to give it an IP address or add any routes to it.

But, I don't know about your particular program, I just suspect it
is related to detecting link state.  I think tg3 detects link when
the interface is not UP, if you have some tg3 nics maybe you could
try with them?

Ben

> 
> Cheers,
> Willy
> 


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem with 2.4.24 e1000 and keepalived
  2004-01-08  2:45   ` Ben Greear
  2004-01-08  5:20     ` Willy Tarreau
@ 2004-01-08  8:14     ` Stephan von Krawczynski
  2004-01-08  8:47       ` Willy Tarreau
  1 sibling, 1 reply; 14+ messages in thread
From: Stephan von Krawczynski @ 2004-01-08  8:14 UTC (permalink / raw)
  To: Ben Greear; +Cc: willy, linux-kernel, netdev, linux-net

On Wed, 07 Jan 2004 18:45:04 -0800
Ben Greear <greearb@candelatech.com> wrote:

> Willy Tarreau wrote:
> > Hi Stephan,
> > [...]
> > What I noticed is that if you load the driver while the cable is unplugged,
> > and then plug it, the MII status says the link is still down.
> > Unfortunately, the only e100 I have access to are in prod at a customer's
> > and I really cannot make tests there.
> 
> You have to bring the interface 'UP' before it will detect link,
> with something like:  ifconfig eth2 up
> 
> Could that be the problem?
> 
> Ben

Hi Ben,

the situation is like this (exactly this works flawlessly with tulip):

- unplug all interfaces from the switches
- reboot box
- plug in _one_ interface 
- log into the box (yes, network works flawlessly)
- start keepalived
- now plug in rest of the interfaces
- watch keepalived do _nothing_ (seems no UP event shows up)

in comparison to:

- let all interfaces plugged in
- reboot box
- log in
- start keepalived
- watch it work as expected

Regards,
Stephan



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem with 2.4.24 e1000 and keepalived
  2004-01-08  8:07       ` Ben Greear
@ 2004-01-08  8:46         ` Willy Tarreau
  0 siblings, 0 replies; 14+ messages in thread
From: Willy Tarreau @ 2004-01-08  8:46 UTC (permalink / raw)
  To: Ben Greear
  Cc: Willy Tarreau, Stephan von Krawczynski, linux-kernel, netdev, linux-net

On Thu, Jan 08, 2004 at 12:07:10AM -0800, Ben Greear wrote:
 
> No, I meant what I said:  You have to tell many drivers to bring the 
> interface
> up before they will attempt (or at least report) link negotiation.
> You do NOT have to give it an IP address or add any routes to it.

ah, OK. No, anyway, it is just a matter of wrongly detecting link state
after the link has been plugged while the interface was already UP, no
matter if an IP was set or not.

> But, I don't know about your particular program, I just suspect it
> is related to detecting link state.  I think tg3 detects link when
> the interface is not UP, if you have some tg3 nics maybe you could
> try with them?

As far as I have tested, tg3 are fine WRT this.

Willy


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem with 2.4.24 e1000 and keepalived
  2004-01-08  8:14     ` Stephan von Krawczynski
@ 2004-01-08  8:47       ` Willy Tarreau
  2004-01-08 17:49         ` Jonathan Lundell
  0 siblings, 1 reply; 14+ messages in thread
From: Willy Tarreau @ 2004-01-08  8:47 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: Ben Greear, linux-kernel, netdev, linux-net

On Thu, Jan 08, 2004 at 09:14:41AM +0100, Stephan von Krawczynski wrote:
> the situation is like this (exactly this works flawlessly with tulip):
> 
> - unplug all interfaces from the switches
> - reboot box
> - plug in _one_ interface 
> - log into the box (yes, network works flawlessly)
> - start keepalived
> - now plug in rest of the interfaces
> - watch keepalived do _nothing_ (seems no UP event shows up)

I agree with this description, and would add :
  - mii-diag ethX or ethtool ethX report link down

Willy


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem with 2.4.24 e1000 and keepalived
  2004-01-08  8:47       ` Willy Tarreau
@ 2004-01-08 17:49         ` Jonathan Lundell
  2004-01-09  0:45           ` Willy Tarreau
  0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Lundell @ 2004-01-08 17:49 UTC (permalink / raw)
  To: linux-kernel, linux-net

At 9:47am +0100 1/8/04, Willy Tarreau wrote:
>On Thu, Jan 08, 2004 at 09:14:41AM +0100, Stephan von Krawczynski wrote:
>>  the situation is like this (exactly this works flawlessly with tulip):
>>
>>  - unplug all interfaces from the switches
>>  - reboot box
>>  - plug in _one_ interface
>>  - log into the box (yes, network works flawlessly)
>>  - start keepalived
>>  - now plug in rest of the interfaces
>>  - watch keepalived do _nothing_ (seems no UP event shows up)
>
>I agree with this description, and would add :
>   - mii-diag ethX or ethtool ethX report link down

Which is, IMO, a bug, albeit a kind of specification bug, given the 
way the drivers tend to be written. An Ethernet link can be up or 
down independent of the logical up/down state of the interface, and 
with most drivers the link state is hidden as long as the interface 
is logically down.

One place where you might want to know: an HA system where a 
redundant interface is available to be configured in place of an 
active interface. We'd like to know the state of the link on the 
backup interface, which is logically down, as an indication that it's 
hooked up and ready to go.

It's unfortunate that the two conditions are conflated by most net drivers.
-- 
/Jonathan Lundell.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem with 2.4.24 e1000 and keepalived
  2004-01-08 17:49         ` Jonathan Lundell
@ 2004-01-09  0:45           ` Willy Tarreau
  2004-01-09  1:00             ` Jonathan Lundell
  0 siblings, 1 reply; 14+ messages in thread
From: Willy Tarreau @ 2004-01-09  0:45 UTC (permalink / raw)
  To: Jonathan Lundell; +Cc: linux-kernel, linux-net

On Thu, Jan 08, 2004 at 09:49:20AM -0800, Jonathan Lundell wrote:
> One place where you might want to know: an HA system where a 
> redundant interface is available to be configured in place of an 
> active interface. We'd like to know the state of the link on the 
> backup interface, which is logically down, as an indication that it's 
> hooked up and ready to go.

It's exactly under these conditions that I discovered the problem. None
of the interface was usable by the bonding driver, although one of them
was properly connected !

> It's unfortunate that the two conditions are conflated by most net drivers.

IMHO, saying "most net drivers" is unfair : tg3, tulip, 3c59x, starfire,
realtek, sis900, dl2k, pcnet32, and IIRC sunhme are OK. eepro100 is nearly
OK but has this annoying bug, and only older 10 Mbps drivers don't report
their status, often because the chip itself doesn't know.

Willy


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem with 2.4.24 e1000 and keepalived
  2004-01-09  0:45           ` Willy Tarreau
@ 2004-01-09  1:00             ` Jonathan Lundell
  2004-01-09 12:18               ` Stephan von Krawczynski
  0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Lundell @ 2004-01-09  1:00 UTC (permalink / raw)
  To: linux-kernel, linux-net

At 1:45am +0100 1/9/04, Willy Tarreau wrote:
>  > It's unfortunate that the two conditions are conflated by most net drivers.
>
>IMHO, saying "most net drivers" is unfair : tg3, tulip, 3c59x, starfire,
>realtek, sis900, dl2k, pcnet32, and IIRC sunhme are OK. eepro100 is nearly
>OK but has this annoying bug, and only older 10 Mbps drivers don't report
>their status, often because the chip itself doesn't know.

I'm sure you're right; I should have said most of the drivers that 
I'm using (including e100 &e1000).

My impression, though, is that there's a trend to use 
netif_carrier_ok() to check the link in newish drivers (of course, 
it's author-choice, not universal), and that the netif_carrier_ok() 
is generally implemented to be dependent on the interface being 
(logically) up.

It'd be nice if we could define link state reporting to be 
independent of logical up/down state, at least for drivers & devices 
capable of making the distinction.
-- 
/Jonathan Lundell.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem with 2.4.24 e1000 and keepalived
  2004-01-09  1:00             ` Jonathan Lundell
@ 2004-01-09 12:18               ` Stephan von Krawczynski
  2004-01-09 18:43                 ` Jonathan Lundell
  0 siblings, 1 reply; 14+ messages in thread
From: Stephan von Krawczynski @ 2004-01-09 12:18 UTC (permalink / raw)
  To: Jonathan Lundell; +Cc: linux-kernel, linux-net

On Thu, 8 Jan 2004 17:00:42 -0800
Jonathan Lundell <jlundell@lundell-bros.com> wrote:

> At 1:45am +0100 1/9/04, Willy Tarreau wrote:
> >  > It's unfortunate that the two conditions are conflated by most net
> >  > drivers.
> >
> >IMHO, saying "most net drivers" is unfair : tg3, tulip, 3c59x, starfire,
> >realtek, sis900, dl2k, pcnet32, and IIRC sunhme are OK. eepro100 is nearly
> >OK but has this annoying bug, and only older 10 Mbps drivers don't report
> >their status, often because the chip itself doesn't know.
> 
> I'm sure you're right; I should have said most of the drivers that 
> I'm using (including e100 &e1000).

Can we find the cause for this obviously buggy behaviour inside the source? 
Where is the handling of physical up/down events different in tulip compared to
e100(0) ?

Regards,
Stephan



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem with 2.4.24 e1000 and keepalived
  2004-01-09 12:18               ` Stephan von Krawczynski
@ 2004-01-09 18:43                 ` Jonathan Lundell
  2004-01-09 23:56                   ` Stephan von Krawczynski
  0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Lundell @ 2004-01-09 18:43 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, linux-net

At 1:18pm +0100 1/9/04, Stephan von Krawczynski wrote:
>On Thu, 8 Jan 2004 17:00:42 -0800
>Jonathan Lundell <jlundell@lundell-bros.com> wrote:
>
>>  At 1:45am +0100 1/9/04, Willy Tarreau wrote:
>>  >  > It's unfortunate that the two conditions are conflated by most net
>>  >  > drivers.
>>  >
>>  >IMHO, saying "most net drivers" is unfair : tg3, tulip, 3c59x, starfire,
>>  >realtek, sis900, dl2k, pcnet32, and IIRC sunhme are OK. eepro100 is nearly
>>  >OK but has this annoying bug, and only older 10 Mbps drivers don't report
>>  >their status, often because the chip itself doesn't know.
>>
>>  I'm sure you're right; I should have said most of the drivers that
>>  I'm using (including e100 &e1000).
>
>Can we find the cause for this obviously buggy behaviour inside the source?
>Where is the handling of physical up/down events different in tulip 
>compared to
>e100(0) ?

In e1000 5.2.20 (as in earlier versions), the link-state reporters 
rely on netif_carrier_ok() for the state, which is in turned 
maintained by the driver's watchdog timer.

e1000_down() both cancels the watchdog timer and calls 
netif_carrier_off(), guaranteeing that if the interface is logically 
down, the link will be reported as down regardless of the actual link 
state.

I think e100 works the same way, though I haven't looked at the New & 
Improved version.
-- 
/Jonathan Lundell.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem with 2.4.24 e1000 and keepalived
  2004-01-09 18:43                 ` Jonathan Lundell
@ 2004-01-09 23:56                   ` Stephan von Krawczynski
  0 siblings, 0 replies; 14+ messages in thread
From: Stephan von Krawczynski @ 2004-01-09 23:56 UTC (permalink / raw)
  To: Jonathan Lundell; +Cc: linux-kernel, linux-net

On Fri, 9 Jan 2004 10:43:13 -0800
Jonathan Lundell <jlundell@lundell-bros.com> wrote:

> At 1:18pm +0100 1/9/04, Stephan von Krawczynski wrote:
> >On Thu, 8 Jan 2004 17:00:42 -0800
> >Jonathan Lundell <jlundell@lundell-bros.com> wrote:
> >
> >>  At 1:45am +0100 1/9/04, Willy Tarreau wrote:
> >>  >  > It's unfortunate that the two conditions are conflated by most net
> >>  >  > drivers.
> >>  >
> >>  >IMHO, saying "most net drivers" is unfair : tg3, tulip, 3c59x, starfire,
> >>  >realtek, sis900, dl2k, pcnet32, and IIRC sunhme are OK. eepro100 is
> >nearly>  >OK but has this annoying bug, and only older 10 Mbps drivers don't
> >report>  >their status, often because the chip itself doesn't know.
> >>
> >>  I'm sure you're right; I should have said most of the drivers that
> >>  I'm using (including e100 &e1000).
> >
> >Can we find the cause for this obviously buggy behaviour inside the source?
> >Where is the handling of physical up/down events different in tulip 
> >compared to
> >e100(0) ?
> 
> In e1000 5.2.20 (as in earlier versions), the link-state reporters 
> rely on netif_carrier_ok() for the state, which is in turned 
> maintained by the driver's watchdog timer.
> 
> e1000_down() both cancels the watchdog timer and calls 
> netif_carrier_off(), guaranteeing that if the interface is logically 
> down, the link will be reported as down regardless of the actual link 
> state.

That cannot be the cause, as the logical interface state is UP in the problem
case.

Regards,
Stephan

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2004-01-09 23:57 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-01-07 19:05 Problem with 2.4.24 e1000 and keepalived Stephan von Krawczynski
2004-01-07 21:02 ` Willy Tarreau
2004-01-08  2:45   ` Ben Greear
2004-01-08  5:20     ` Willy Tarreau
2004-01-08  8:07       ` Ben Greear
2004-01-08  8:46         ` Willy Tarreau
2004-01-08  8:14     ` Stephan von Krawczynski
2004-01-08  8:47       ` Willy Tarreau
2004-01-08 17:49         ` Jonathan Lundell
2004-01-09  0:45           ` Willy Tarreau
2004-01-09  1:00             ` Jonathan Lundell
2004-01-09 12:18               ` Stephan von Krawczynski
2004-01-09 18:43                 ` Jonathan Lundell
2004-01-09 23:56                   ` Stephan von Krawczynski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).