* Problem with 2.4.24 e1000 and keepalived @ 2004-01-07 19:05 Stephan von Krawczynski 2004-01-07 21:02 ` Willy Tarreau 0 siblings, 1 reply; 14+ messages in thread From: Stephan von Krawczynski @ 2004-01-07 19:05 UTC (permalink / raw) To: linux-kernel; +Cc: netdev, linux-net Hello all, I am looking for confirmation for the following problem. Setup is a simple pair of routers with 2 nics each, all e1000. If you start a vrrp setup with keepalived and interface state is down during keepalived startup, then the failover does not work. If the nics are UP during startup everything works well. Now the kernel part of the story: the exact same setup works with tulip cards. Is there a difference regarding UP/DOWN state handling/events in e1000 and tulip. e100 and eepro100 show the same problem btw. Any hints are welcome Regards, Stephan ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with 2.4.24 e1000 and keepalived 2004-01-07 19:05 Problem with 2.4.24 e1000 and keepalived Stephan von Krawczynski @ 2004-01-07 21:02 ` Willy Tarreau 2004-01-08 2:45 ` Ben Greear 0 siblings, 1 reply; 14+ messages in thread From: Willy Tarreau @ 2004-01-07 21:02 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: linux-kernel, netdev, linux-net Hi Stephan, On Wed, Jan 07, 2004 at 08:05:56PM +0100, Stephan von Krawczynski wrote: > Setup is a simple pair of routers with 2 nics each, all e1000. If you start a > vrrp setup with keepalived and interface state is down during keepalived > startup, then the failover does not work. If the nics are UP during startup > everything works well. Now the kernel part of the story: the exact same setup > works with tulip cards. > Is there a difference regarding UP/DOWN state handling/events in e1000 and > tulip. e100 and eepro100 show the same problem btw. I noticed the exact same problem about 1 year ago with the early 2.4 bonding code and eepro100. At this time, I attributed this to a yet undiscovered but in the bonding state machine, and could not investigate much since it was on a remote production machine. Someone went there and rebooted it and everything went OK. Before the reboot, the switch alredy detected an UP link, while the bonding code saw it down (using MII at this time, not ethtool). I recently read one report (here or on keepalived list) about someone who got the same problem with another eepro100. I wonder whether there would not be a bug either in the driver or in the chip itself. What I noticed is that if you load the driver while the cable is unplugged, and then plug it, the MII status says the link is still down. Unfortunately, the only e100 I have access to are in prod at a customer's and I really cannot make tests there. Cheers, Willy ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with 2.4.24 e1000 and keepalived 2004-01-07 21:02 ` Willy Tarreau @ 2004-01-08 2:45 ` Ben Greear 2004-01-08 5:20 ` Willy Tarreau 2004-01-08 8:14 ` Stephan von Krawczynski 0 siblings, 2 replies; 14+ messages in thread From: Ben Greear @ 2004-01-08 2:45 UTC (permalink / raw) To: Willy Tarreau; +Cc: Stephan von Krawczynski, linux-kernel, netdev, linux-net Willy Tarreau wrote: > Hi Stephan, > > On Wed, Jan 07, 2004 at 08:05:56PM +0100, Stephan von Krawczynski wrote: > >>Setup is a simple pair of routers with 2 nics each, all e1000. If you start a >>vrrp setup with keepalived and interface state is down during keepalived >>startup, then the failover does not work. If the nics are UP during startup >>everything works well. Now the kernel part of the story: the exact same setup >>works with tulip cards. >>Is there a difference regarding UP/DOWN state handling/events in e1000 and >>tulip. e100 and eepro100 show the same problem btw. > > > I noticed the exact same problem about 1 year ago with the early 2.4 > bonding code and eepro100. At this time, I attributed this to a yet > undiscovered but in the bonding state machine, and could not investigate > much since it was on a remote production machine. Someone went there and > rebooted it and everything went OK. Before the reboot, the switch alredy > detected an UP link, while the bonding code saw it down (using MII at this > time, not ethtool). I recently read one report (here or on keepalived list) > about someone who got the same problem with another eepro100. I wonder > whether there would not be a bug either in the driver or in the chip itself. > > What I noticed is that if you load the driver while the cable is unplugged, > and then plug it, the MII status says the link is still down. Unfortunately, > the only e100 I have access to are in prod at a customer's and I really > cannot make tests there. You have to bring the interface 'UP' before it will detect link, with something like: ifconfig eth2 up Could that be the problem? Ben > > Cheers, > Willy > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with 2.4.24 e1000 and keepalived 2004-01-08 2:45 ` Ben Greear @ 2004-01-08 5:20 ` Willy Tarreau 2004-01-08 8:07 ` Ben Greear 2004-01-08 8:14 ` Stephan von Krawczynski 1 sibling, 1 reply; 14+ messages in thread From: Willy Tarreau @ 2004-01-08 5:20 UTC (permalink / raw) To: Ben Greear Cc: Willy Tarreau, Stephan von Krawczynski, linux-kernel, netdev, linux-net Hi Ben, On Wed, Jan 07, 2004 at 06:45:04PM -0800, Ben Greear wrote: > You have to bring the interface 'UP' before it will detect link, > with something like: ifconfig eth2 up Don't you mean "after" instead of "before" here ? Because the case where it doesn't work is when everything is set up while the cable is unplugged, but conversely, if the system goes up with the cable plugged, setting the interface UP detects the link as UP and works. I believe that the problem is related to setting the interface UP with nothing plugged into it. Cheers, Willy ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with 2.4.24 e1000 and keepalived 2004-01-08 5:20 ` Willy Tarreau @ 2004-01-08 8:07 ` Ben Greear 2004-01-08 8:46 ` Willy Tarreau 0 siblings, 1 reply; 14+ messages in thread From: Ben Greear @ 2004-01-08 8:07 UTC (permalink / raw) To: Willy Tarreau; +Cc: Stephan von Krawczynski, linux-kernel, netdev, linux-net Willy Tarreau wrote: > Hi Ben, > > On Wed, Jan 07, 2004 at 06:45:04PM -0800, Ben Greear wrote: > > >>You have to bring the interface 'UP' before it will detect link, >>with something like: ifconfig eth2 up > > > Don't you mean "after" instead of "before" here ? Because the case where > it doesn't work is when everything is set up while the cable is unplugged, > but conversely, if the system goes up with the cable plugged, setting the > interface UP detects the link as UP and works. I believe that the problem > is related to setting the interface UP with nothing plugged into it. No, I meant what I said: You have to tell many drivers to bring the interface up before they will attempt (or at least report) link negotiation. You do NOT have to give it an IP address or add any routes to it. But, I don't know about your particular program, I just suspect it is related to detecting link state. I think tg3 detects link when the interface is not UP, if you have some tg3 nics maybe you could try with them? Ben > > Cheers, > Willy > -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with 2.4.24 e1000 and keepalived 2004-01-08 8:07 ` Ben Greear @ 2004-01-08 8:46 ` Willy Tarreau 0 siblings, 0 replies; 14+ messages in thread From: Willy Tarreau @ 2004-01-08 8:46 UTC (permalink / raw) To: Ben Greear Cc: Willy Tarreau, Stephan von Krawczynski, linux-kernel, netdev, linux-net On Thu, Jan 08, 2004 at 12:07:10AM -0800, Ben Greear wrote: > No, I meant what I said: You have to tell many drivers to bring the > interface > up before they will attempt (or at least report) link negotiation. > You do NOT have to give it an IP address or add any routes to it. ah, OK. No, anyway, it is just a matter of wrongly detecting link state after the link has been plugged while the interface was already UP, no matter if an IP was set or not. > But, I don't know about your particular program, I just suspect it > is related to detecting link state. I think tg3 detects link when > the interface is not UP, if you have some tg3 nics maybe you could > try with them? As far as I have tested, tg3 are fine WRT this. Willy ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with 2.4.24 e1000 and keepalived 2004-01-08 2:45 ` Ben Greear 2004-01-08 5:20 ` Willy Tarreau @ 2004-01-08 8:14 ` Stephan von Krawczynski 2004-01-08 8:47 ` Willy Tarreau 1 sibling, 1 reply; 14+ messages in thread From: Stephan von Krawczynski @ 2004-01-08 8:14 UTC (permalink / raw) To: Ben Greear; +Cc: willy, linux-kernel, netdev, linux-net On Wed, 07 Jan 2004 18:45:04 -0800 Ben Greear <greearb@candelatech.com> wrote: > Willy Tarreau wrote: > > Hi Stephan, > > [...] > > What I noticed is that if you load the driver while the cable is unplugged, > > and then plug it, the MII status says the link is still down. > > Unfortunately, the only e100 I have access to are in prod at a customer's > > and I really cannot make tests there. > > You have to bring the interface 'UP' before it will detect link, > with something like: ifconfig eth2 up > > Could that be the problem? > > Ben Hi Ben, the situation is like this (exactly this works flawlessly with tulip): - unplug all interfaces from the switches - reboot box - plug in _one_ interface - log into the box (yes, network works flawlessly) - start keepalived - now plug in rest of the interfaces - watch keepalived do _nothing_ (seems no UP event shows up) in comparison to: - let all interfaces plugged in - reboot box - log in - start keepalived - watch it work as expected Regards, Stephan ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with 2.4.24 e1000 and keepalived 2004-01-08 8:14 ` Stephan von Krawczynski @ 2004-01-08 8:47 ` Willy Tarreau 2004-01-08 17:49 ` Jonathan Lundell 0 siblings, 1 reply; 14+ messages in thread From: Willy Tarreau @ 2004-01-08 8:47 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: Ben Greear, linux-kernel, netdev, linux-net On Thu, Jan 08, 2004 at 09:14:41AM +0100, Stephan von Krawczynski wrote: > the situation is like this (exactly this works flawlessly with tulip): > > - unplug all interfaces from the switches > - reboot box > - plug in _one_ interface > - log into the box (yes, network works flawlessly) > - start keepalived > - now plug in rest of the interfaces > - watch keepalived do _nothing_ (seems no UP event shows up) I agree with this description, and would add : - mii-diag ethX or ethtool ethX report link down Willy ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with 2.4.24 e1000 and keepalived 2004-01-08 8:47 ` Willy Tarreau @ 2004-01-08 17:49 ` Jonathan Lundell 2004-01-09 0:45 ` Willy Tarreau 0 siblings, 1 reply; 14+ messages in thread From: Jonathan Lundell @ 2004-01-08 17:49 UTC (permalink / raw) To: linux-kernel, linux-net At 9:47am +0100 1/8/04, Willy Tarreau wrote: >On Thu, Jan 08, 2004 at 09:14:41AM +0100, Stephan von Krawczynski wrote: >> the situation is like this (exactly this works flawlessly with tulip): >> >> - unplug all interfaces from the switches >> - reboot box >> - plug in _one_ interface >> - log into the box (yes, network works flawlessly) >> - start keepalived >> - now plug in rest of the interfaces >> - watch keepalived do _nothing_ (seems no UP event shows up) > >I agree with this description, and would add : > - mii-diag ethX or ethtool ethX report link down Which is, IMO, a bug, albeit a kind of specification bug, given the way the drivers tend to be written. An Ethernet link can be up or down independent of the logical up/down state of the interface, and with most drivers the link state is hidden as long as the interface is logically down. One place where you might want to know: an HA system where a redundant interface is available to be configured in place of an active interface. We'd like to know the state of the link on the backup interface, which is logically down, as an indication that it's hooked up and ready to go. It's unfortunate that the two conditions are conflated by most net drivers. -- /Jonathan Lundell. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with 2.4.24 e1000 and keepalived 2004-01-08 17:49 ` Jonathan Lundell @ 2004-01-09 0:45 ` Willy Tarreau 2004-01-09 1:00 ` Jonathan Lundell 0 siblings, 1 reply; 14+ messages in thread From: Willy Tarreau @ 2004-01-09 0:45 UTC (permalink / raw) To: Jonathan Lundell; +Cc: linux-kernel, linux-net On Thu, Jan 08, 2004 at 09:49:20AM -0800, Jonathan Lundell wrote: > One place where you might want to know: an HA system where a > redundant interface is available to be configured in place of an > active interface. We'd like to know the state of the link on the > backup interface, which is logically down, as an indication that it's > hooked up and ready to go. It's exactly under these conditions that I discovered the problem. None of the interface was usable by the bonding driver, although one of them was properly connected ! > It's unfortunate that the two conditions are conflated by most net drivers. IMHO, saying "most net drivers" is unfair : tg3, tulip, 3c59x, starfire, realtek, sis900, dl2k, pcnet32, and IIRC sunhme are OK. eepro100 is nearly OK but has this annoying bug, and only older 10 Mbps drivers don't report their status, often because the chip itself doesn't know. Willy ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with 2.4.24 e1000 and keepalived 2004-01-09 0:45 ` Willy Tarreau @ 2004-01-09 1:00 ` Jonathan Lundell 2004-01-09 12:18 ` Stephan von Krawczynski 0 siblings, 1 reply; 14+ messages in thread From: Jonathan Lundell @ 2004-01-09 1:00 UTC (permalink / raw) To: linux-kernel, linux-net At 1:45am +0100 1/9/04, Willy Tarreau wrote: > > It's unfortunate that the two conditions are conflated by most net drivers. > >IMHO, saying "most net drivers" is unfair : tg3, tulip, 3c59x, starfire, >realtek, sis900, dl2k, pcnet32, and IIRC sunhme are OK. eepro100 is nearly >OK but has this annoying bug, and only older 10 Mbps drivers don't report >their status, often because the chip itself doesn't know. I'm sure you're right; I should have said most of the drivers that I'm using (including e100 &e1000). My impression, though, is that there's a trend to use netif_carrier_ok() to check the link in newish drivers (of course, it's author-choice, not universal), and that the netif_carrier_ok() is generally implemented to be dependent on the interface being (logically) up. It'd be nice if we could define link state reporting to be independent of logical up/down state, at least for drivers & devices capable of making the distinction. -- /Jonathan Lundell. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with 2.4.24 e1000 and keepalived 2004-01-09 1:00 ` Jonathan Lundell @ 2004-01-09 12:18 ` Stephan von Krawczynski 2004-01-09 18:43 ` Jonathan Lundell 0 siblings, 1 reply; 14+ messages in thread From: Stephan von Krawczynski @ 2004-01-09 12:18 UTC (permalink / raw) To: Jonathan Lundell; +Cc: linux-kernel, linux-net On Thu, 8 Jan 2004 17:00:42 -0800 Jonathan Lundell <jlundell@lundell-bros.com> wrote: > At 1:45am +0100 1/9/04, Willy Tarreau wrote: > > > It's unfortunate that the two conditions are conflated by most net > > > drivers. > > > >IMHO, saying "most net drivers" is unfair : tg3, tulip, 3c59x, starfire, > >realtek, sis900, dl2k, pcnet32, and IIRC sunhme are OK. eepro100 is nearly > >OK but has this annoying bug, and only older 10 Mbps drivers don't report > >their status, often because the chip itself doesn't know. > > I'm sure you're right; I should have said most of the drivers that > I'm using (including e100 &e1000). Can we find the cause for this obviously buggy behaviour inside the source? Where is the handling of physical up/down events different in tulip compared to e100(0) ? Regards, Stephan ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with 2.4.24 e1000 and keepalived 2004-01-09 12:18 ` Stephan von Krawczynski @ 2004-01-09 18:43 ` Jonathan Lundell 2004-01-09 23:56 ` Stephan von Krawczynski 0 siblings, 1 reply; 14+ messages in thread From: Jonathan Lundell @ 2004-01-09 18:43 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: linux-kernel, linux-net At 1:18pm +0100 1/9/04, Stephan von Krawczynski wrote: >On Thu, 8 Jan 2004 17:00:42 -0800 >Jonathan Lundell <jlundell@lundell-bros.com> wrote: > >> At 1:45am +0100 1/9/04, Willy Tarreau wrote: >> > > It's unfortunate that the two conditions are conflated by most net >> > > drivers. >> > >> >IMHO, saying "most net drivers" is unfair : tg3, tulip, 3c59x, starfire, >> >realtek, sis900, dl2k, pcnet32, and IIRC sunhme are OK. eepro100 is nearly >> >OK but has this annoying bug, and only older 10 Mbps drivers don't report >> >their status, often because the chip itself doesn't know. >> >> I'm sure you're right; I should have said most of the drivers that >> I'm using (including e100 &e1000). > >Can we find the cause for this obviously buggy behaviour inside the source? >Where is the handling of physical up/down events different in tulip >compared to >e100(0) ? In e1000 5.2.20 (as in earlier versions), the link-state reporters rely on netif_carrier_ok() for the state, which is in turned maintained by the driver's watchdog timer. e1000_down() both cancels the watchdog timer and calls netif_carrier_off(), guaranteeing that if the interface is logically down, the link will be reported as down regardless of the actual link state. I think e100 works the same way, though I haven't looked at the New & Improved version. -- /Jonathan Lundell. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem with 2.4.24 e1000 and keepalived 2004-01-09 18:43 ` Jonathan Lundell @ 2004-01-09 23:56 ` Stephan von Krawczynski 0 siblings, 0 replies; 14+ messages in thread From: Stephan von Krawczynski @ 2004-01-09 23:56 UTC (permalink / raw) To: Jonathan Lundell; +Cc: linux-kernel, linux-net On Fri, 9 Jan 2004 10:43:13 -0800 Jonathan Lundell <jlundell@lundell-bros.com> wrote: > At 1:18pm +0100 1/9/04, Stephan von Krawczynski wrote: > >On Thu, 8 Jan 2004 17:00:42 -0800 > >Jonathan Lundell <jlundell@lundell-bros.com> wrote: > > > >> At 1:45am +0100 1/9/04, Willy Tarreau wrote: > >> > > It's unfortunate that the two conditions are conflated by most net > >> > > drivers. > >> > > >> >IMHO, saying "most net drivers" is unfair : tg3, tulip, 3c59x, starfire, > >> >realtek, sis900, dl2k, pcnet32, and IIRC sunhme are OK. eepro100 is > >nearly> >OK but has this annoying bug, and only older 10 Mbps drivers don't > >report> >their status, often because the chip itself doesn't know. > >> > >> I'm sure you're right; I should have said most of the drivers that > >> I'm using (including e100 &e1000). > > > >Can we find the cause for this obviously buggy behaviour inside the source? > >Where is the handling of physical up/down events different in tulip > >compared to > >e100(0) ? > > In e1000 5.2.20 (as in earlier versions), the link-state reporters > rely on netif_carrier_ok() for the state, which is in turned > maintained by the driver's watchdog timer. > > e1000_down() both cancels the watchdog timer and calls > netif_carrier_off(), guaranteeing that if the interface is logically > down, the link will be reported as down regardless of the actual link > state. That cannot be the cause, as the logical interface state is UP in the problem case. Regards, Stephan ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2004-01-09 23:57 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2004-01-07 19:05 Problem with 2.4.24 e1000 and keepalived Stephan von Krawczynski 2004-01-07 21:02 ` Willy Tarreau 2004-01-08 2:45 ` Ben Greear 2004-01-08 5:20 ` Willy Tarreau 2004-01-08 8:07 ` Ben Greear 2004-01-08 8:46 ` Willy Tarreau 2004-01-08 8:14 ` Stephan von Krawczynski 2004-01-08 8:47 ` Willy Tarreau 2004-01-08 17:49 ` Jonathan Lundell 2004-01-09 0:45 ` Willy Tarreau 2004-01-09 1:00 ` Jonathan Lundell 2004-01-09 12:18 ` Stephan von Krawczynski 2004-01-09 18:43 ` Jonathan Lundell 2004-01-09 23:56 ` Stephan von Krawczynski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).