linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Random packets loss under x86_64 - routing?
@ 2005-01-14 15:35 Peter Kruse
  2005-01-14 16:37 ` linux-os
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Kruse @ 2005-01-14 15:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Kruse

kernel: 2.4.28 smp x86_64

Hello,

We experience a problem in our amd64 beowulf clusters and could need
some help.
When ping'ing other machines in a cluster on the same
subnet, it fails for some machines.  But only right after boot
and after a day or so of idle time.  After some time (a few minutes) the
ping packets go through.

Other things we observed:

1. it is not always the same machines that fail
2. if it fails then no packets are sent or received (checked with
    tcpdump on sending and target host) although all hosts are up.
3. There is no difference if using a 64bit or 32bit ping
4. It does not depend on the network adapter or other hardware, we have
    machines with different NICs connected to different switches with the
    same problem.
5. It does however only happen on amd64 (biarch) systems and not on
    pure i386 systems so it looks like related to the kernel.
6. I have to reboot to reproduce the problem, it's not enough to
    unload and load the network module.
7. It only happens with ping, not with ssh.

The ping always succeeds when running with the "-r" switch,
that bypasses "the normal routing tables and send directly to a host
on an attached interface".  This makes us think that it indeed it is
related to routing - but how?

I can provide an strace output if you think that could help.
What else can I do to gather more information?

Please cc to me, as I'm not subscribed, thanks.

	Peter

-- 
Peter Kruse <pk@q-leap.com>, Chief Software Architect
Q-Leap Networks GmbH
phone: +497071-703171, mobile: +49172-6340044



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Random packets loss under x86_64 - routing?
  2005-01-14 15:35 Random packets loss under x86_64 - routing? Peter Kruse
@ 2005-01-14 16:37 ` linux-os
  2005-01-17 13:57   ` Peter Kruse
  2005-01-24 10:24   ` Peter Kruse
  0 siblings, 2 replies; 5+ messages in thread
From: linux-os @ 2005-01-14 16:37 UTC (permalink / raw)
  To: Peter Kruse; +Cc: linux-kernel

On Fri, 14 Jan 2005, Peter Kruse wrote:

> kernel: 2.4.28 smp x86_64
>
> Hello,
>
> We experience a problem in our amd64 beowulf clusters and could need
> some help.
> When ping'ing other machines in a cluster on the same
> subnet, it fails for some machines.  But only right after boot
> and after a day or so of idle time.  After some time (a few minutes) the
> ping packets go through.
>
> Other things we observed:
>
> 1. it is not always the same machines that fail
> 2. if it fails then no packets are sent or received (checked with
>   tcpdump on sending and target host) although all hosts are up.
> 3. There is no difference if using a 64bit or 32bit ping
> 4. It does not depend on the network adapter or other hardware, we have
>   machines with different NICs connected to different switches with the
>   same problem.
> 5. It does however only happen on amd64 (biarch) systems and not on
>   pure i386 systems so it looks like related to the kernel.
> 6. I have to reboot to reproduce the problem, it's not enough to
>   unload and load the network module.
> 7. It only happens with ping, not with ssh.
>
> The ping always succeeds when running with the "-r" switch,
> that bypasses "the normal routing tables and send directly to a host
> on an attached interface".  This makes us think that it indeed it is
> related to routing - but how?
>
> I can provide an strace output if you think that could help.
> What else can I do to gather more information?
>
> Please cc to me, as I'm not subscribed, thanks.
>
> 	Peter
>

When they 'disappear', use `arp -d hostname` to delete the
entry from the ARP tables. Then see if you can ping it.
It is possible that the destination machine got re-routed
and the new router's HW address wasn't updated in the
ARP tables. If this is the case, I don't know hot to 'fix'
it, but it's a new data-point. When you have dynamic routing,
there needs to be some way to update the ARP tables even though
they eventually expire.

The fact that `ping -r` works seems to show that the ARP table
has stale entries in it.


Cheers,
Dick Johnson
Penguin : Linux version 2.6.10 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by Dictator Bush.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Random packets loss under x86_64 - routing?
  2005-01-14 16:37 ` linux-os
@ 2005-01-17 13:57   ` Peter Kruse
  2005-01-17 14:27     ` linux-os
  2005-01-24 10:24   ` Peter Kruse
  1 sibling, 1 reply; 5+ messages in thread
From: Peter Kruse @ 2005-01-17 13:57 UTC (permalink / raw)
  To: linux-os; +Cc: linux-kernel

Hello,

thanks for your reply

linux-os wrote:
  >
> When they 'disappear', use `arp -d hostname` to delete the
> entry from the ARP tables. Then see if you can ping it.
> It is possible that the destination machine got re-routed
> and the new router's HW address wasn't updated in the
> ARP tables. If this is the case, I don't know hot to 'fix'
> it, but it's a new data-point. When you have dynamic routing,
> there needs to be some way to update the ARP tables even though
> they eventually expire.

There is no router between sender and destination host,
they are on the same subnet and connected on the same switch.

> The fact that `ping -r` works seems to show that the ARP table
> has stale entries in it.
> 

Even directly after reboot when the arp table is empty?

	Peter

-- 
Peter Kruse <pk@q-leap.com>, Chief Software Architect
Q-Leap Networks GmbH
phone: +497071-703171, mobile: +49172-6340044

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Random packets loss under x86_64 - routing?
  2005-01-17 13:57   ` Peter Kruse
@ 2005-01-17 14:27     ` linux-os
  0 siblings, 0 replies; 5+ messages in thread
From: linux-os @ 2005-01-17 14:27 UTC (permalink / raw)
  To: Peter Kruse; +Cc: Linux kernel

On Mon, 17 Jan 2005, Peter Kruse wrote:

> Hello,
>
> thanks for your reply
>
> linux-os wrote:
> >
>> When they 'disappear', use `arp -d hostname` to delete the
>> entry from the ARP tables. Then see if you can ping it.
>> It is possible that the destination machine got re-routed
>> and the new router's HW address wasn't updated in the
>> ARP tables. If this is the case, I don't know hot to 'fix'
>> it, but it's a new data-point. When you have dynamic routing,
>> there needs to be some way to update the ARP tables even though
>> they eventually expire.
>
> There is no router between sender and destination host,
> they are on the same subnet and connected on the same switch.
>

I suggest that you may __think__ that there is no router.... But
for instance, I can't talk to my own printer here because
of some configuration changes made by the "Net Naz^M^M^M
Wizards" here. Same network, same wire. It gets "redirected".
Basically, everything on this wire is proxy-arped by the
default-route machine. There are duplicate packets on the
wire and redirections everywhere.

You can look at your ARP table with:

`cat /proc/net/arp`

>> The fact that `ping -r` works seems to show that the ARP table
>> has stale entries in it.
>>

The `ping -r` working shows that there were is either a bad
ARP table entry or too small a netmask so the device isn't
really on your network.

>
> Even directly after reboot when the arp table is empty?
>
> 	Peter
>

Just check it out. You'd be surprised what you may find.
Look at the ARP-table entry. Try to ping something that
doesn't respond. Look at the table again. That will tell
you what's happening. It's likely that there is an ARP
table entry from some 'router' that has been set to
proxy-ARP whatever it sees on the wire.


Cheers,
Dick Johnson
Penguin : Linux version 2.6.10 on an i686 machine (5537.79 BogoMips).
  Notice : All mail here is now cached for review by Dictator Bush.
                  98.36% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Random packets loss under x86_64 - routing?
  2005-01-14 16:37 ` linux-os
  2005-01-17 13:57   ` Peter Kruse
@ 2005-01-24 10:24   ` Peter Kruse
  1 sibling, 0 replies; 5+ messages in thread
From: Peter Kruse @ 2005-01-24 10:24 UTC (permalink / raw)
  To: linux-os; +Cc: linux-kernel

linux-os wrote:
> 
> The fact that `ping -r` works seems to show that the ARP table
> has stale entries in it.
> 
> 

The problem is gone with 2.6.10.  We can live with this solution.

Thanks,

	Peter

-- 
Peter Kruse <pk@q-leap.com>, Chief Software Architect
Q-Leap Networks GmbH
phone: +497071-703171, mobile: +49172-6340044

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2005-01-24 10:24 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-14 15:35 Random packets loss under x86_64 - routing? Peter Kruse
2005-01-14 16:37 ` linux-os
2005-01-17 13:57   ` Peter Kruse
2005-01-17 14:27     ` linux-os
2005-01-24 10:24   ` Peter Kruse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).