linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* e1000, sshd, and the infamous "Corrupted MAC on input"
@ 2005-02-03  3:44 Ethan Weinstein
  2005-02-03  7:04 ` Matt Mackall
  0 siblings, 1 reply; 7+ messages in thread
From: Ethan Weinstein @ 2005-02-03  3:44 UTC (permalink / raw)
  To: linux-kernel

Hey all,

I've been having quite a time with the e1000 driver running at gigabit 
speeds.  Running it at 100Fdx has never been a problem, which I've done 
done for a long time. Last week I picked up a gigabit switch, and that's 
when the trouble began.  I find that transferring large amounts of data 
using scp invariably ends up with sshd spitting out "Disconnecting: 
Corrupted MAC on input."  After deciding I must have purchased a bum 
switch, I grabbed another model.. only to get the same error.
Finally, I used a crossover cable between the two boxes, which resulted 
in the same error from sshd again.

Both systems are 2.6.10, with 4k stacks, and regparm enabled. system 1 
has an onboard Intel 82547EI, system 2 has an onboard Intel 82545EM, 
both have NAPI enabled... Oddly, running the nics at 100Fdx does not 
generate this error no matter how much pressure I put on them. I've 
found a lot of scuttlebutt regarding these problems with sshd on the 
net, but this appears a hardware/driver problem.  There's mention of a 
specific problem with e1000 here: 
http://www.psc.edu/networking/projects/hpn-ssh  but no apparent resolution.

Any suggestions are greatly appreciated.

-E

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: e1000, sshd, and the infamous "Corrupted MAC on input"
  2005-02-03  3:44 e1000, sshd, and the infamous "Corrupted MAC on input" Ethan Weinstein
@ 2005-02-03  7:04 ` Matt Mackall
  2005-02-04  4:16   ` Ethan Weinstein
  0 siblings, 1 reply; 7+ messages in thread
From: Matt Mackall @ 2005-02-03  7:04 UTC (permalink / raw)
  To: Ethan Weinstein; +Cc: linux-kernel

On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote:
> Hey all,
> 
> I've been having quite a time with the e1000 driver running at gigabit 
> speeds.  Running it at 100Fdx has never been a problem, which I've done 
> done for a long time. Last week I picked up a gigabit switch, and that's 
> when the trouble began.  I find that transferring large amounts of data 
> using scp invariably ends up with sshd spitting out "Disconnecting: 
> Corrupted MAC on input."  After deciding I must have purchased a bum 
> switch, I grabbed another model.. only to get the same error.
> Finally, I used a crossover cable between the two boxes, which resulted 
> in the same error from sshd again.

Well ssh isn't an especially good test as it's hard to debug.

Try transferring large compressed files via netcat and comparing the
results. eg:

host1# nc -l -p 2000 > foo.bz2

host2# nc host1 2000 < foo.bz2

If the md5sums differ, follow up with a cmp -bl to see what changed.

Then we can look at the failure patterns and determine if there's some
data or alignment dependence.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: e1000, sshd, and the infamous "Corrupted MAC on input"
  2005-02-03  7:04 ` Matt Mackall
@ 2005-02-04  4:16   ` Ethan Weinstein
  2005-02-04  5:08     ` Matt Mackall
  2005-02-04  6:03     ` Willy Tarreau
  0 siblings, 2 replies; 7+ messages in thread
From: Ethan Weinstein @ 2005-02-04  4:16 UTC (permalink / raw)
  To: Matt Mackall; +Cc: linux-kernel

Matt Mackall wrote:
> On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote:
...
>>Finally, I used a crossover cable between the two boxes, which resulted 
>>in the same error from sshd again.
> 
> 
> Well ssh isn't an especially good test as it's hard to debug.
> 
> Try transferring large compressed files via netcat and comparing the
> results. eg:
> 
> host1# nc -l -p 2000 > foo.bz2
> 
> host2# nc host1 2000 < foo.bz2
> 
> If the md5sums differ, follow up with a cmp -bl to see what changed.
> 
> Then we can look at the failure patterns and determine if there's some
> data or alignment dependence.
> 

Excellent tip, thanks.  I was able to reprodce the problem several times 
using this technique with nc, however the problem was intermittent (as 
nasty problems like this often are).  I used a 1.3G gzipped tarball and 
  experienced several botched transfers along with a few good ones.  To 
be fair, I also switched back to 100Fdx and repeated; I didn't get a 
single failure at this speed over 25 or so runs.

The results of two cmp's are here:

http://www.stinkfoot.org/e1000tests.out

What next?

-Ethan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: e1000, sshd, and the infamous "Corrupted MAC on input"
  2005-02-04  4:16   ` Ethan Weinstein
@ 2005-02-04  5:08     ` Matt Mackall
  2005-02-04  5:54       ` Has anyone dumped udev for devfs? Anthony DiSante
  2005-02-05  4:53       ` e1000, sshd, and the infamous "Corrupted MAC on input" Ethan Weinstein
  2005-02-04  6:03     ` Willy Tarreau
  1 sibling, 2 replies; 7+ messages in thread
From: Matt Mackall @ 2005-02-04  5:08 UTC (permalink / raw)
  To: Ethan Weinstein; +Cc: linux-kernel

On Thu, Feb 03, 2005 at 11:16:37PM -0500, Ethan Weinstein wrote:
> Matt Mackall wrote:
> >On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote:
> ...
> >>Finally, I used a crossover cable between the two boxes, which resulted 
> >>in the same error from sshd again.
> >
> >
> >Well ssh isn't an especially good test as it's hard to debug.
> >
> >Try transferring large compressed files via netcat and comparing the
> >results. eg:
> >
> >host1# nc -l -p 2000 > foo.bz2
> >
> >host2# nc host1 2000 < foo.bz2
> >
> >If the md5sums differ, follow up with a cmp -bl to see what changed.
> >
> >Then we can look at the failure patterns and determine if there's some
> >data or alignment dependence.
> >
> 
> Excellent tip, thanks.  I was able to reprodce the problem several times 
> using this technique with nc, however the problem was intermittent (as 
> nasty problems like this often are).  I used a 1.3G gzipped tarball and 
>  experienced several botched transfers along with a few good ones.  To 
> be fair, I also switched back to 100Fdx and repeated; I didn't get a 
> single failure at this speed over 25 or so runs.
> 
> The results of two cmp's are here:
> 
> http://www.stinkfoot.org/e1000tests.out
> 
> What next?

Ok, reproduceable without ssh makes narrowing this down much easier.
Are you seeing errors on the interface? No would indicate problems
post CRC checking on the receive side. Do errors happen in both
directions? If not, it may be CPU speed-related or specific to a given
NIC - swap them if they're not onboard. 

The next test is to send patterns. Try sending yourself a gigabyte of:

#include <stdio.h>

int main(void)
{
        int i;

        for (i = 0; i < 0x10000000; i++) {
                fwrite(&i, 4, 1, stdout);
        }
}

If there's some sort of partial DMA transfer going on, this should
make it evident.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Has anyone dumped udev for devfs?
  2005-02-04  5:08     ` Matt Mackall
@ 2005-02-04  5:54       ` Anthony DiSante
  2005-02-05  4:53       ` e1000, sshd, and the infamous "Corrupted MAC on input" Ethan Weinstein
  1 sibling, 0 replies; 7+ messages in thread
From: Anthony DiSante @ 2005-02-04  5:54 UTC (permalink / raw)
  To: linux-kernel

Kevin Fries wrote:
 > Any ETA on when udev is going to be ready for prime time?  And, any
 > clue why Fedora insists on relying on a program that does not f*(&%ing
 > work!!!!
 >
 > I am trying to get a Microtek X12 USL scanner attached, and udev fails
 > to mount it, every time.  Has anyone tried uninstalling udev and
 > reinstalling devfs to stop all these damn usb failures?
 >
 > If so, any hints on how not to make your system unstable?
 >
 > TIA
 > Kevin Fries

I haven't gone back to devfs, but I feel your pain.  udev+hal worked fine 
for a couple months, until hald started intermittently locking up.  Now I 
can't go 2 days without a reboot, because hald so often goes into 
"uninterruptible sleep" and is totally unkillable.  I've upgraded udev, hal, 
and my kernel a bunch of times, but nothing has fixed this.  And it's not a 
single piece of hardware; sometimes it's USB, sometimes Firewire, sometimes 
a CDROM, that causes hald to take a nap, permanently.

-Anthony DiSante
http://nodivisions.com/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: e1000, sshd, and the infamous "Corrupted MAC on input"
  2005-02-04  4:16   ` Ethan Weinstein
  2005-02-04  5:08     ` Matt Mackall
@ 2005-02-04  6:03     ` Willy Tarreau
  1 sibling, 0 replies; 7+ messages in thread
From: Willy Tarreau @ 2005-02-04  6:03 UTC (permalink / raw)
  To: Ethan Weinstein; +Cc: Matt Mackall, linux-kernel

Hi,

On Thu, Feb 03, 2005 at 11:16:37PM -0500, Ethan Weinstein wrote:
(...) 
> Excellent tip, thanks.  I was able to reprodce the problem several times 
> using this technique with nc, however the problem was intermittent (as 
> nasty problems like this often are).  I used a 1.3G gzipped tarball and 
>  experienced several botched transfers along with a few good ones.  To 
> be fair, I also switched back to 100Fdx and repeated; I didn't get a 
> single failure at this speed over 25 or so runs.
> 
> The results of two cmp's are here:
> 
> http://www.stinkfoot.org/e1000tests.out
> 
> What next?

I would disable rx/tx checksums on the cards to ensure that's not a bug
in this part. Because one reason to see what you encounter would be that
some frames are corrupted at gigabit speed (possibly on one of the cards
themselves), and they don't correctly compute the checksum on the receive
side, or they ignore when it's bad.

IIRC, you can do this with ethtool :

  # ethtool -K rx off tx off

Willy


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: e1000, sshd, and the infamous "Corrupted MAC on input"
  2005-02-04  5:08     ` Matt Mackall
  2005-02-04  5:54       ` Has anyone dumped udev for devfs? Anthony DiSante
@ 2005-02-05  4:53       ` Ethan Weinstein
  1 sibling, 0 replies; 7+ messages in thread
From: Ethan Weinstein @ 2005-02-05  4:53 UTC (permalink / raw)
  To: Matt Mackall; +Cc: linux-kernel

Matt Mackall wrote:
> 
> Ok, reproduceable without ssh makes narrowing this down much easier.
> Are you seeing errors on the interface? No would indicate problems
> post CRC checking on the receive side. Do errors happen in both
> directions? If not, it may be CPU speed-related or specific to a given
> NIC - swap them if they're not onboard. 
> 
> The next test is to send patterns. Try sending yourself a gigabyte of:
> 
> #include <stdio.h>
> 
> int main(void)
> {
>         int i;
> 
>         for (i = 0; i < 0x10000000; i++) {
>                 fwrite(&i, 4, 1, stdout);
>         }
> }
> 
> If there's some sort of partial DMA transfer going on, this should
> make it evident.
> 

No errors reported on either interface.

Interesting results, in one direction though.  It seems highly likely 
the problem is only with the 82545EM as I couldn't get a botched 
transfer FROM it to the 82547EI after 20 or so attempts, (both of these 
are onboard unfortunately so no swapping).  Several transfers TO it did 
yield bad files, though (using my big 1.6G gzipped tarball).

Now, on to the patterns.  I didn't get a _single_ failure in either 
directions using what that code snippet generated in over 20 attempts. 
Perhaps we're failing on larger amounts of more complex data?

-Ethan

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-02-05  4:52 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-02-03  3:44 e1000, sshd, and the infamous "Corrupted MAC on input" Ethan Weinstein
2005-02-03  7:04 ` Matt Mackall
2005-02-04  4:16   ` Ethan Weinstein
2005-02-04  5:08     ` Matt Mackall
2005-02-04  5:54       ` Has anyone dumped udev for devfs? Anthony DiSante
2005-02-05  4:53       ` e1000, sshd, and the infamous "Corrupted MAC on input" Ethan Weinstein
2005-02-04  6:03     ` Willy Tarreau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).