All of lore.kernel.org
 help / color / mirror / Atom feed
* access beyond end of device again
@ 2002-06-24 14:35 Kevin
  2002-06-24 14:35 ` Oleg Drokin
  2002-06-24 14:37 ` Robert Brockway
  0 siblings, 2 replies; 25+ messages in thread
From: Kevin @ 2002-06-24 14:35 UTC (permalink / raw)
  To: reiserfs-list

I'm getting these errors again:
  attempt to access beyond end of device
  38:01: rw=0, want=2052028788, limit=58633312

anyone know what causes them? and more importantly a way to stop them
from coming back?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-24 14:35 access beyond end of device again Kevin
@ 2002-06-24 14:35 ` Oleg Drokin
  2002-06-24 14:45   ` Dirk Mueller
  2002-06-24 14:37 ` Robert Brockway
  1 sibling, 1 reply; 25+ messages in thread
From: Oleg Drokin @ 2002-06-24 14:35 UTC (permalink / raw)
  To: Kevin; +Cc: reiserfs-list

Hello!

   Do you get these during normal operations? 
   Then it seems some of unformatted pointers were corrupted and you need
   to run reiserfsck to clear these.

Bye,
    Oleg
On Mon, Jun 24, 2002 at 07:35:24AM -0700, Kevin wrote:
> I'm getting these errors again:
>   attempt to access beyond end of device
>   38:01: rw=0, want=2052028788, limit=58633312
> 
> anyone know what causes them? and more importantly a way to stop them
> from coming back?
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-24 14:35 access beyond end of device again Kevin
  2002-06-24 14:35 ` Oleg Drokin
@ 2002-06-24 14:37 ` Robert Brockway
  2002-06-24 14:48   ` Oleg Drokin
  1 sibling, 1 reply; 25+ messages in thread
From: Robert Brockway @ 2002-06-24 14:37 UTC (permalink / raw)
  To: Kevin; +Cc: reiserfs-list

On Mon, 24 Jun 2002, Kevin wrote:

> I'm getting these errors again:
>   attempt to access beyond end of device
>   38:01: rw=0, want=2052028788, limit=58633312

What sort of device is this supposed to be? :)  38:01 is either a "Myricom 
PCI Myrinet board" or something "reserved for Linux/AP+" (and I'm assuming 
we're talking about a block device rather than a character device here :)

I could be way off but could you confirm what you think the filesystem is 
on?

Rob
-- Robert Brockway B.Sc. email: robert@timetraveller.org  ICQ: 104781119
   Linux counter project ID #16440 (http://counter.li.org)
   avon: up 16 days, 23:28,  1 user,  load average: 0.00, 0.02, 0.00
   "The earth is but one country and mankind its citizens" -Baha'u'llah


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-24 14:35 ` Oleg Drokin
@ 2002-06-24 14:45   ` Dirk Mueller
  2002-06-24 14:49     ` Oleg Drokin
  0 siblings, 1 reply; 25+ messages in thread
From: Dirk Mueller @ 2002-06-24 14:45 UTC (permalink / raw)
  To: reiserfs-list

On Mon, 24 Jun 2002, Oleg Drokin wrote:

> 
>    Do you get these during normal operations? 
>    Then it seems some of unformatted pointers were corrupted and you need
>    to run reiserfsck to clear these.

most lilely the partition is indeed to small (played with fsck lately ?)


Dirk

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-24 14:37 ` Robert Brockway
@ 2002-06-24 14:48   ` Oleg Drokin
  2002-06-24 14:49     ` Robert Brockway
  2002-07-10 18:04     ` Tim Small
  0 siblings, 2 replies; 25+ messages in thread
From: Oleg Drokin @ 2002-06-24 14:48 UTC (permalink / raw)
  To: Robert Brockway; +Cc: Kevin, reiserfs-list

Hello!

On Tue, Jun 25, 2002 at 12:37:02AM +1000, Robert Brockway wrote:
> > I'm getting these errors again:
> >   attempt to access beyond end of device
> >   38:01: rw=0, want=2052028788, limit=58633312
> What sort of device is this supposed to be? :)  38:01 is either a "Myricom 
> PCI Myrinet board" or something "reserved for Linux/AP+" (and I'm assuming 
> we're talking about a block device rather than a character device here :)

Numbers printed are in hex, so this is:
    block       Fifth IDE hard disk/CD-ROM interface
                  0 = /dev/hdi          Master: whole disk (or CD-ROM)
                 64 = /dev/hdj          Slave: whole disk (or CD-ROM)

                Partitions are handled the same way as for the first
                interface (see major number 3).

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-24 14:48   ` Oleg Drokin
@ 2002-06-24 14:49     ` Robert Brockway
  2002-06-24 17:30       ` Kevin
  2002-07-10 18:04     ` Tim Small
  1 sibling, 1 reply; 25+ messages in thread
From: Robert Brockway @ 2002-06-24 14:49 UTC (permalink / raw)
  To: reiserfs-list

On Mon, 24 Jun 2002, Oleg Drokin wrote:

> Numbers printed are in hex, so this is:

Damn, they are too (*blushes* :)  Sorry...ahhh..late here...slinks back 
into hole :)

ObContent: I have seen this error before.  A number of different causes 
right up to & including actual disk problems.

Has their been any messing with the partition table of late?

Can we see an fdisk -l of the relevant disk to see how big the filesystem 
*should be* :)

Rob

-- Robert Brockway B.Sc. email: robert@timetraveller.org  ICQ: 104781119
   Linux counter project ID #16440 (http://counter.li.org)
   avon: up 16 days, 23:38,  1 user,  load average: 0.07, 0.03, 0.01
   "The earth is but one country and mankind its citizens" -Baha'u'llah


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-24 14:45   ` Dirk Mueller
@ 2002-06-24 14:49     ` Oleg Drokin
  2002-06-24 16:46       ` Hans Reiser
  2002-06-24 16:59       ` Dirk Mueller
  0 siblings, 2 replies; 25+ messages in thread
From: Oleg Drokin @ 2002-06-24 14:49 UTC (permalink / raw)
  To: Dirk Mueller; +Cc: reiserfs-list

Hello!

On Mon, Jun 24, 2002 at 04:45:01PM +0200, Dirk Mueller wrote:

> >    Do you get these during normal operations? 
> >    Then it seems some of unformatted pointers were corrupted and you need
> >    to run reiserfsck to clear these.
> most lilely the partition is indeed to small (played with fsck lately ?)

reiserfs does not allocates any blocks past the partition size,
so I cannot even imagine what are you speaking about ;)

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-24 14:49     ` Oleg Drokin
@ 2002-06-24 16:46       ` Hans Reiser
  2002-06-25  5:11         ` Oleg Drokin
  2002-06-24 16:59       ` Dirk Mueller
  1 sibling, 1 reply; 25+ messages in thread
From: Hans Reiser @ 2002-06-24 16:46 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Dirk Mueller, reiserfs-list

Oleg Drokin wrote:

>Hello!
>
>On Mon, Jun 24, 2002 at 04:45:01PM +0200, Dirk Mueller wrote:
>
>  
>
>>>   Do you get these during normal operations? 
>>>   Then it seems some of unformatted pointers were corrupted and you need
>>>   to run reiserfsck to clear these.
>>>      
>>>
>>most lilely the partition is indeed to small (played with fsck lately ?)
>>    
>>
>
>reiserfs does not allocates any blocks past the partition size,
>so I cannot even imagine what are you speaking about ;)
>
>Bye,
>    Oleg
>
>
>  
>
This can be caused by fdisk followed by mkreiserfs without a reboot 
between fdisk and mkreiserfs, yes?

-- 
Hans




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-24 14:49     ` Oleg Drokin
  2002-06-24 16:46       ` Hans Reiser
@ 2002-06-24 16:59       ` Dirk Mueller
  1 sibling, 0 replies; 25+ messages in thread
From: Dirk Mueller @ 2002-06-24 16:59 UTC (permalink / raw)
  To: reiserfs-list

On Mon, 24 Jun 2002, Oleg Drokin wrote:

> > most lilely the partition is indeed to small (played with fsck lately ?)

fdisk I meant. sorry. 


Dirk

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-24 14:49     ` Robert Brockway
@ 2002-06-24 17:30       ` Kevin
  2002-06-25  5:54         ` Oleg Drokin
  0 siblings, 1 reply; 25+ messages in thread
From: Kevin @ 2002-06-24 17:30 UTC (permalink / raw)
  To: reiserfs-list


> On Mon, 24 Jun 2002, Oleg Drokin wrote:


> Can we see an fdisk -l of the relevant disk to see how big the filesystem 
> *should be* :)


I reformatted the partition since the last time it rebooted, but I
haven't changed the actual partition.  The only time I see the error
is when trying to read certain files.  Everything else seems to work
fine.  The drive is a 60g maxtor attached to an hpt366.  It is the
master on its channel, and there is no slave.  I'm running 2.4.18 with
reiserfsprogs 3.x.1c-pre4.

fdisk -l /dev/hdi1
Disk /dev/hdi: 16 heads, 63 sectors, 116336 cylinders
Units = cylinders of 1008 * 512 bytes

   Device Boot    Start       End    Blocks   Id  System
/dev/hdi1             1    116336  58633312+  83  Linux


df /dev/hdi1
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/hdi1             58631516  39128508  19503008  67% /opt


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-24 16:46       ` Hans Reiser
@ 2002-06-25  5:11         ` Oleg Drokin
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Drokin @ 2002-06-25  5:11 UTC (permalink / raw)
  To: Hans Reiser; +Cc: Dirk Mueller, reiserfs-list

Hello!

On Mon, Jun 24, 2002 at 08:46:14PM +0400, Hans Reiser wrote:
> >reiserfs does not allocates any blocks past the partition size,
> >so I cannot even imagine what are you speaking about ;)
> This can be caused by fdisk followed by mkreiserfs without a reboot 
> between fdisk and mkreiserfs, yes?

Similar problem, but not this exact one, I'd say.
But sequence of events should be this:
have a hdd with several partitions. Have some of the partitions mounted.
(or at least one).
Destroy one of the partitions and create smaller one instead.
(or just resize partition down).
mkfs the partition without rebooting.
But if resizing have removed more than 132 Mb of space,
then such a partition won't mount on next reboot just because not of all
bitmaps can be readed. (and messages indicated that requested sector is far
away from partition end).

The main thins there is for HDD to be mounted (at least one partition).
If nothing were mounted off the HDD, then kernel is able to correctly
re-read partition table.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-24 17:30       ` Kevin
@ 2002-06-25  5:54         ` Oleg Drokin
  2002-06-25  6:08           ` Kevin
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Drokin @ 2002-06-25  5:54 UTC (permalink / raw)
  To: Kevin; +Cc: reiserfs-list

Hello!

On Mon, Jun 24, 2002 at 10:30:45AM -0700, Kevin wrote:

> > Can we see an fdisk -l of the relevant disk to see how big the filesystem 
> > *should be* :)
> I reformatted the partition since the last time it rebooted, but I

Reformatted as in mkreiserfs?

> haven't changed the actual partition.  The only time I see the error
> is when trying to read certain files.  Everything else seems to work

Yeah, ones that contains invalid blocknumbers in metadata.

> fine.  The drive is a 60g maxtor attached to an hpt366.  It is the
> master on its channel, and there is no slave.  I'm running 2.4.18 with
> reiserfsprogs 3.x.1c-pre4.
> fdisk -l /dev/hdi1
> Disk /dev/hdi: 16 heads, 63 sectors, 116336 cylinders
> Units = cylinders of 1008 * 512 bytes
>    Device Boot    Start       End    Blocks   Id  System
> /dev/hdi1             1    116336  58633312+  83  Linux
> /dev/hdi1             58631516  39128508  19503008  67% /opt

Numbers looks correct.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-25  5:54         ` Oleg Drokin
@ 2002-06-25  6:08           ` Kevin
  2002-06-25  6:15             ` Oleg Drokin
  0 siblings, 1 reply; 25+ messages in thread
From: Kevin @ 2002-06-25  6:08 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: reiserfs-list


> Reformatted as in mkreiserfs?

Correct.  That's the only change I've made to it though.  Would
rebuilding the superblocks help?  Last time it happened, I rebuilt the
tree, and it deleted the files that I had trouble with.  So it's hard
to say if it really fixed it, or if it was just waiting for me to
write to that spot again.  I'm about to run the Maxtor diags on the
drive and do a factory recertification, just to make sure.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-25  6:08           ` Kevin
@ 2002-06-25  6:15             ` Oleg Drokin
  2002-06-25  7:46               ` Kevin
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Drokin @ 2002-06-25  6:15 UTC (permalink / raw)
  To: Kevin; +Cc: reiserfs-list

Hello!

On Wed, Jun 26, 2002 at 11:09:35PM -0700, Kevin wrote:

> > Reformatted as in mkreiserfs?
> Correct.  That's the only change I've made to it though.  Would
> rebuilding the superblocks help?  Last time it happened, I rebuilt the

No. Superblock is fine in your case.

> tree, and it deleted the files that I had trouble with.  So it's hard

--fix-fixable can fix such errors. (it just zeroes offending pointers),
It would be nice if you first run reiserfsck --check and show us the output,
though. Because you may have other kinds corruptions as well.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-25  6:15             ` Oleg Drokin
@ 2002-06-25  7:46               ` Kevin
  2002-06-25  7:55                 ` Oleg Drokin
  0 siblings, 1 reply; 25+ messages in thread
From: Kevin @ 2002-06-25  7:46 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: reiserfs-list

> --fix-fixable can fix such errors. (it just zeroes offending pointers),
> It would be nice if you first run reiserfsck --check and show us the output,
> though. Because you may have other kinds corruptions as well.

> Bye,
>     Oleg

I put the log here as its somewhat long:
http://redefine.org/~coggy/hdi.txt

Maxtor diags found nothing wrong with the driver itself.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-25  7:46               ` Kevin
@ 2002-06-25  7:55                 ` Oleg Drokin
  2002-06-25  8:07                   ` Kevin
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Drokin @ 2002-06-25  7:55 UTC (permalink / raw)
  To: Kevin; +Cc: reiserfs-list

Hello!

On Thu, Jun 27, 2002 at 12:47:46AM -0700, Kevin wrote:
> > --fix-fixable can fix such errors. (it just zeroes offending pointers),
> > It would be nice if you first run reiserfsck --check and show us the output,
> > though. Because you may have other kinds corruptions as well.
> I put the log here as its somewhat long:
> http://redefine.org/~coggy/hdi.txt

Well, you have some corrupted leaves in conjunction to corrupted unformatted
pointers. So reiserfsck --rebuild-tree seems to be needed.
Are you sure there is no data corruption when writing to disk?
There were reports that VIA chipsets have problems with more than 3 IDE
channels being in use simultaneously.
I am sure there is even a test suite that trigger these bugs reliable.
(you have not told us anything about your motherboard/system so may
be you do not use VIA chipset of course, but finding one of such tools
that uses several HDDs simultaneously is advisable).

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-25  7:55                 ` Oleg Drokin
@ 2002-06-25  8:07                   ` Kevin
  2002-06-25  8:13                     ` Oleg Drokin
  0 siblings, 1 reply; 25+ messages in thread
From: Kevin @ 2002-06-25  8:07 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: reiserfs-list


> Well, you have some corrupted leaves in conjunction to corrupted unformatted
> pointers. So reiserfsck --rebuild-tree seems to be needed.
> Are you sure there is no data corruption when writing to disk?
> There were reports that VIA chipsets have problems with more than 3 IDE
> channels being in use simultaneously.
> I am sure there is even a test suite that trigger these bugs reliable.
> (you have not told us anything about your motherboard/system so may
> be you do not use VIA chipset of course, but finding one of such tools
> that uses several HDDs simultaneously is advisable).

> Bye,
>     Oleg

It is a 2x400 celeron on an Abit BP6.  There are 5 hdd's total, spread
across 3 controllers.  All the drives are the master on their channel,
with no slaves.  As far as the testing, do you know of any such tools?
It's worth a try.  However, when the file that triggered the error was
written, the system was not under any stress at all.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-25  8:07                   ` Kevin
@ 2002-06-25  8:13                     ` Oleg Drokin
  2002-06-25  8:46                       ` Hans Reiser
       [not found]                       ` <353485111.20020627013212@redefine.org>
  0 siblings, 2 replies; 25+ messages in thread
From: Oleg Drokin @ 2002-06-25  8:13 UTC (permalink / raw)
  To: Kevin; +Cc: reiserfs-list

[-- Attachment #1: Type: text/plain, Size: 1260 bytes --]

Hello!

On Thu, Jun 27, 2002 at 01:08:47AM -0700, Kevin wrote:

> > Well, you have some corrupted leaves in conjunction to corrupted unformatted
> > pointers. So reiserfsck --rebuild-tree seems to be needed.
> > Are you sure there is no data corruption when writing to disk?
> > There were reports that VIA chipsets have problems with more than 3 IDE
> > channels being in use simultaneously.
> > I am sure there is even a test suite that trigger these bugs reliable.
> > (you have not told us anything about your motherboard/system so may
> > be you do not use VIA chipset of course, but finding one of such tools
> > that uses several HDDs simultaneously is advisable).
> It is a 2x400 celeron on an Abit BP6.  There are 5 hdd's total, spread

Abit BP6 is particularly bad motherboard, you know.
And running celerons in SMP mode is not supported by Intel.
 
> across 3 controllers.  All the drives are the master on their channel,
> with no slaves.  As far as the testing, do you know of any such tools?
> It's worth a try.  However, when the file that triggered the error was
> written, the system was not under any stress at all.

E.g. http://www.bit-net.com/~rmiller/dt.html
Also take a look at the two messages from lkml, I have attached.

Bye,
    Oleg

[-- Attachment #2: m1 --]
[-- Type: text/plain, Size: 6085 bytes --]

From linux-kernel-owner+green=40namesys.com@vger.kernel.org  Wed May  8 05:48:03 2002
Return-Path: <linux-kernel-owner+green=40namesys.com@vger.kernel.org>
Delivered-To: green@localhost.namesys.com
Received: from localhost (localhost [127.0.0.1])
	by angband.namesys.com (Postfix on SuSE Linux 7.3 (i386)) with ESMTP id CEAAC41907
	for <green@localhost>; Wed,  8 May 2002 05:48:03 +0400 (MSD)
Delivered-To: green@namesys.com
Received: from thebsh.namesys.com [212.16.7.65]
	by localhost with POP3 (fetchmail-5.9.0)
	for green@localhost (single-drop); Wed, 08 May 2002 05:48:03 +0400 (MSD)
Received: (qmail 29959 invoked from network); 8 May 2002 01:46:15 -0000
Received: from vger.kernel.org (209.116.70.75)
  by thebsh.namesys.com with SMTP; 8 May 2002 01:46:15 -0000
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id <S315478AbSEHBoK>; Tue, 7 May 2002 21:44:10 -0400
Received: (majordomo@vger.kernel.org) by vger.kernel.org
	id <S315479AbSEHBoJ>; Tue, 7 May 2002 21:44:09 -0400
Received: from pop.gmx.net ([213.165.64.20]:62967 "HELO mail.gmx.net")
	by vger.kernel.org with SMTP id <S315478AbSEHBoI> convert rfc822-to-8bit;
	Tue, 7 May 2002 21:44:08 -0400
Received: (qmail 32229 invoked by uid 0); 8 May 2002 01:44:01 -0000
Received: from adsl-162-85.adsl-pool.axelero.hu (HELO lead) (62.201.85.162)
  by mail.gmx.net (mp001-rz3) with SMTP; 8 May 2002 01:44:01 -0000
Reply-To: <bPObject@axelero.hu>
From: "P. Breuer" <bPObject@gmx.ch>
To: <andre@linux-ide.org>
Cc: <linux-kernel@vger.kernel.org>
Subject: PROBLEM: silent data corruption using HPT370 on an ABIT VP6
Date:	Wed, 8 May 2002 03:43:59 +0200
Message-ID: <EGEOJJNFHLHGOKNADENLOEGCCFAA.bPObject@gmx.ch>
MIME-Version: 1.0
Content-Type:	text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7BIT
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0)
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000
Importance: Normal
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
X-Mailing-List:	linux-kernel@vger.kernel.org
Status: RO
Content-Length: 3968
Lines: 83

1. Silent disk corruption using HPT370 on an ABIT VP6

2. I have tracked down a crooked bug somewhere in the IDE driver
   leading to a slow and silent data corruption, which is a most alarming threat
   for the incautious. The case is simple: "cp file1 file2; diff file1 file2"
   shows differences under certain conditions.

3. Keywords: kernel, driver, ide, data corruption, i386

4. Kernel versions: 2.4.16 or 2.4.18 (error reproducible in both versions)

5. Hardware environment (details see below):
   ABIT VP6 motherboard including: dual Pentium III, VIA APOLLO PRO chipset
   VIA onboard EIDE controller,
   HPT370 "raid" UDMA/100 controller, integrated on board
   Promise TX2 (PDC) UDMA/100 PCI controller card
   Hard disks (all masters):
     2 x 6GB Quantum Fireball EX6.4A on VIA,
     2 x 40GB Quantum FireballP AS40.0 on PDC,
     2 x 40GB Quantum FireballP AS40.0 on HPT

6. Software environment:
   IDE driver (kernel-integrated)
   raidtools-0.90-5 (optional)
   General: four 40GB disks of identical geometry have three partitions each,
     same partitioning, identified by /dev/hd[e,g,i,k][1-3],
     /dev/md[0-2} are three RAID-5 arrays defined on the four disks accordingly
     each out of three raid partitions are formatted ext3 with internal journal

7. ERROR description:
   Let "file1" be a "large" data file, e.g. 1GB, on a RAID array described above.
   Then "cp file1 file2; cmp -l file1 file2" shows (subtle) differences.
   There are random differences on several random spots between the files.
   The "spots" occur usually as blocks of few bytes in succession. The difference
   is up to several dozens of bytes at a 1GB file copy.

8. Tracking down the error:
   I have conducted over 100 test cases: the error is consistent, though random.

   First I excluded an error in the raid software:
     umount /dev/md[0-2]; raidstop /dev/md[0-2].
   I used a script to read all four raw disks concurrent:
   
   for d in e, g, i, k; do \
    (for i in 1 2 3 4 5; do \
      dd if=/dev/hd"$d"1 count=2500000 \
      2> /dev/null | md5sum; done \
    ) >> trc"$d".md5sum done
   
   I found NO differences in trce.md5sum and trcg.md5sum (both disks are on the
   Promise controller), but significant differences in trci.md5sum and trck.md5sum,
   displaying 3 and 5 different read results out of 5 identical reads, resp.
   (both disks are on the HPT370 controller).

   Oops!!!

   I stayed focused on the HPT370 controller, and compiled a small test environment with a
   single processor motherboard and a HPT370A PCI controller card, which, in addition, has
   the same HPT BIOS version (1.0.3b1) as the integrated one. I found no problem using this
   configuration, so the error might well be related only to the SMP architecture.

9. Solution or workaround?
   I browsed through the HighPoint Software web pages and found a remarkable replacement
   for the kernel IDE-driver. This is a SCSI IDE emulation module, called hpt37x2.o, that
   can be built for "any" 2.4.x kernel. And IT WORKS, at least for me, since at least two days ;)
   The only drawback is, that it is not GPL-d and the complete source is not available.
   The existence of a working driver is a profound proof for the kernel driver to be in error!

10. Attachments:
   I have saved several files out of /proc, boot log, etc. from the test period,
   i.e. by using the faulty driver. They are available upon request. Due to the fact, that the
   HPT driver is not a native IDE-driver, but a SCSI-emulation, it is not possible to switch
   between booting the old and new kernels very easily. One example, the raid arrays are not
   recognised from the foreign configuration.

Peter Breuer [P.Breuer@freemail.hu]
   

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[-- Attachment #3: m2 --]
[-- Type: text/plain, Size: 7189 bytes --]

From linux-kernel-owner+green=40namesys.com@vger.kernel.org  Tue May 14 22:04:04 2002
Return-Path: <linux-kernel-owner+green=40namesys.com@vger.kernel.org>
Delivered-To: green@localhost.namesys.com
Received: from localhost (localhost [127.0.0.1])
	by angband.namesys.com (Postfix on SuSE Linux 7.3 (i386)) with ESMTP id 0E193B17A1
	for <green@localhost>; Tue, 14 May 2002 22:04:04 +0400 (MSD)
Delivered-To: green@namesys.com
Received: from thebsh.namesys.com [212.16.7.65]
	by localhost with POP3 (fetchmail-5.9.0)
	for green@localhost (single-drop); Tue, 14 May 2002 22:04:04 +0400 (MSD)
Received: (qmail 11573 invoked from network); 14 May 2002 18:03:31 -0000
Received: from vger.kernel.org (209.116.70.75)
  by thebsh.namesys.com with SMTP; 14 May 2002 18:03:31 -0000
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id <S315935AbSENR4J>; Tue, 14 May 2002 13:56:09 -0400
Received: (majordomo@vger.kernel.org) by vger.kernel.org
	id <S315937AbSENR4I>; Tue, 14 May 2002 13:56:08 -0400
Received: from mail.netbeat.de ([62.208.140.19]:53266 "HELO mail.netbeat.de")
	by vger.kernel.org with SMTP id <S315935AbSENRzk>;
	Tue, 14 May 2002 13:55:40 -0400
Received: (qmail 2315 invoked from network); 14 May 2002 17:57:31 -0000
Received: from pd9542a05.dip.t-dialin.net (HELO qs2) (217.84.42.5)
  by mail.netbeat.de with SMTP; 14 May 2002 17:57:31 -0000
Date:	Tue, 14 May 2002 19:55:33 +0200
From: Henning Schroeder <hgs@anna-strasse.de>
X-Mailer: The Bat! (v1.53d)
Reply-To: Henning Schroeder <hgs@anna-strasse.de>
Organization: =?ISO-8859-1?B?QW5uYXN0cmFzc2UgV/xyemJ1cmc=?=
X-Priority: 3 (Normal)
Message-ID: <379487051.20020514195533@anna-strasse.de>
To: linux-kernel@vger.kernel.org
Subject: IDE *data corruption* VIA VT8367
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
X-Mailing-List:	linux-kernel@vger.kernel.org
Status: RO
Content-Length: 5198
Lines: 128

Hello,

I╢m not quite sure whether this is a kernel issue, but I can╢t think
of another evildoer :-)

ASUS A7V266-E Mainboard (VT8367 [KT266] Chipset, with VIA IDE and
Promise 20265 IDE Controller on board), 4x MAXTOR 6L020J1 (20GB
ATA-100) attached at the four ports (resulting in hda, hdc, hde, hdg).

Robin Miller╢s Data Test Program (dt) from
http://www.bit-net.com/~rmiller/dt.html reports data errors on (and
only on) hdg when tests are run in parallel. This is especially nasty
because i plan to use the drives in a RAID-0 fashion which results in
data errors as well.

These combinations give errors: (hda hdc hde hdg), (hdc hde hdg)

These combinations run flawless: (hda hdc hde), (hde hdg), (hda hdc
hdg). I did not test more combinations because every test takes some
hours.

Attaching hdg as a slave drive to the first promise port (which gives
me hdf instead and the second promise port emtpy) makes the array run
fine, but performance drops to a figure comparable to a single drive.

There are no error logs whatsoever (except for the dt output). Without
RAID-array and without heavy IDE access, the machine runs stable.

Kernels tested: 2.4.18, 2.4.19pre8

Has anybody seen this before? Any info would be appreciated. I would
be happy to provide more information.

Diagnostics attached below.


------- output from dt (this is actually output from testing the raid
array) ----------------

Command Line:

    % dt.d/dt of=/data/test limit=1g min=512 max=32k align=rotate procs=15 log=dtlog runtime=12h 

        --> Date: June 2nd, Version: 14.10, Author: Robin T. Miller <--

[...]

dt (2150): Error number 1 occurred on Wed May  8 20:16:40 2002
dt (2150): Data compare error at byte 5116 in record number 36
dt (2150): Relative block number where the error occcured is 639 (offset 508)
dt (2150): Data expected = 0xde, data found = 0x33, byte count = 18432
dt (2150): The incorrect data starts at address 0x80b1688 (marked by asterisk '*')
dt (2150): Dumping Pattern Buffer (base = 0x80b1688, offset = 0, limit = 4 bytes):

0x80b1688 *de c6 de c6

dt (2150): The incorrect data starts at address 0x80b33ff (marked by asterisk '*')
dt (2150): Dumping Data Buffer (base = 0x80b2003, offset = 5116, limit = 64 bytes):

0x80b33df  de c6 de c6 de c6 de c6 de c6 de c6 de c6 de c6
0x80b33ef  de c6 de c6 de c6 de c6 de c6 de c6 de c6 de c6
0x80b33ff *33 33 33 33 de c6 de c6 de c6 de c6 de c6 de c6
0x80b340f  de c6 de c6 de c6 de c6 de c6 de c6 de c6 de c6


[...]

dt (2148): Error number 1 occurred on Wed May  8 20:16:42 2002
dt (2148): Data compare error at byte 2044 in record number 857
dt (2148): Relative block number where the error occcured is 27343 (offset 508)
dt (2148): Data expected = 0xff, data found = 0x26, byte count = 12800
dt (2148): The incorrect data starts at address 0x80b1688 (marked by asterisk '*')
dt (2148): Dumping Pattern Buffer (base = 0x80b1688, offset = 0, limit = 4 bytes):

0x80b1688 *ff 00 ff 00

dt (2148): The incorrect data starts at address 0x80b27fc (marked by asterisk '*')
dt (2148): Dumping Data Buffer (base = 0x80b2000, offset = 2044, limit = 64 bytes):

0x80b27dc  ff 00 ff 00 ff 00 ff 00 ff 00 ff 00 ff 00 ff 00
0x80b27ec  ff 00 ff 00 ff 00 ff 00 ff 00 ff 00 ff 00 ff 00
0x80b27fc *26 33 67 66 ff 00 ff 00 ff 00 ff 00 ff 00 ff 00
0x80b280c  ff 00 ff 00 ff 00 ff 00 ff 00 ff 00 ff 00 ff 00

[...]

dt (2160): Error number 1 occurred on Wed May  8 20:16:46 2002
dt (2160): Data compare error at byte 24572 in record number 49
dt (2160): Relative block number where the error occcured is 1223 (offset 508)
dt (2160): Data expected = 0x39, data found = 0xff, byte count = 25088
dt (2160): The incorrect data starts at address 0x80b1688 (marked by asterisk '*')
dt (2160): Dumping Pattern Buffer (base = 0x80b1688, offset = 0, limit = 4 bytes):

0x80b1688 *39 9c c3 39

dt (2160): The incorrect data starts at address 0x80b7ffc (marked by asterisk '*')
dt (2160): Dumping Data Buffer (base = 0x80b2000, offset = 24572, limit = 64 bytes):

0x80b7fdc  39 9c c3 39 39 9c c3 39 39 9c c3 39 39 9c c3 39
0x80b7fec  39 9c c3 39 39 9c c3 39 39 9c c3 39 39 9c c3 39
0x80b7ffc *ff 00 ff 00 39 9c c3 39 39 9c c3 39 39 9c c3 39
0x80b800c  39 9c c3 39 39 9c c3 39 39 9c c3 39 39 9c c3 39


[.... ad nauseaum]

-------------- lspci output ------------

00:00.0 Host bridge: VIA Technologies, Inc. VT8367 [KT266]
00:01.0 PCI bridge: VIA Technologies, Inc. VT8367 [KT266 AGP]
00:06.0 Unknown mass storage controller: Promise Technology, Inc. 20265 (rev 02)
00:0c.0 VGA compatible unclassified device: S3 Inc. 86c864 [Vision 864 DRAM] vers 0
00:0e.0 Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev 0c)
00:0f.0 Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev 0c)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8233 PCI to ISA Bridge
00:11.1 IDE interface: VIA Technologies, Inc. Bus Master IDE (rev 06)



-- 
Best regards,
 Henning                          mailto:hgs@anna-strasse.de

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-25  8:13                     ` Oleg Drokin
@ 2002-06-25  8:46                       ` Hans Reiser
  2002-06-25  8:58                         ` Oleg Drokin
       [not found]                       ` <353485111.20020627013212@redefine.org>
  1 sibling, 1 reply; 25+ messages in thread
From: Hans Reiser @ 2002-06-25  8:46 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Kevin, reiserfs-list

Oleg Drokin wrote:

>  
>
>Abit BP6 is particularly bad motherboard, you know.
>And running celerons in SMP mode is not supported by Intel.
>
All of Namesys used to use BP6s running celerons in SMP mode....

It was a poor mob though as I dimly remember.  Not as bad as the current 
Tyans we use though.....

Hans


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-25  8:46                       ` Hans Reiser
@ 2002-06-25  8:58                         ` Oleg Drokin
  2002-06-25  9:06                           ` Hans Reiser
  0 siblings, 1 reply; 25+ messages in thread
From: Oleg Drokin @ 2002-06-25  8:58 UTC (permalink / raw)
  To: Hans Reiser; +Cc: Kevin, reiserfs-list

Hello!

On Tue, Jun 25, 2002 at 12:46:52PM +0400, Hans Reiser wrote:
> >Abit BP6 is particularly bad motherboard, you know.
> >And running celerons in SMP mode is not supported by Intel.
> All of Namesys used to use BP6s running celerons in SMP mode....

Yes, ask Vitaly about his unpleasant experiences, and logs full of errors.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-25  8:58                         ` Oleg Drokin
@ 2002-06-25  9:06                           ` Hans Reiser
  2002-06-25  9:41                             ` Oleg Drokin
  0 siblings, 1 reply; 25+ messages in thread
From: Hans Reiser @ 2002-06-25  9:06 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Kevin, reiserfs-list

Oleg Drokin wrote:

>Hello!
>
>On Tue, Jun 25, 2002 at 12:46:52PM +0400, Hans Reiser wrote:
>  
>
>>>Abit BP6 is particularly bad motherboard, you know.
>>>And running celerons in SMP mode is not supported by Intel.
>>>      
>>>
>>All of Namesys used to use BP6s running celerons in SMP mode....
>>    
>>
>
>Yes, ask Vitaly about his unpleasant experiences, and logs full of errors.
>
>Bye,
>    Oleg
>
>
>  
>
How fortunate that it was the fsck specialist who had the controller go 
bad on him;-)......

-- 
Hans




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-25  9:06                           ` Hans Reiser
@ 2002-06-25  9:41                             ` Oleg Drokin
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Drokin @ 2002-06-25  9:41 UTC (permalink / raw)
  To: Hans Reiser; +Cc: Kevin, reiserfs-list

Hello!

On Tue, Jun 25, 2002 at 01:06:09PM +0400, Hans Reiser wrote:

> >Yes, ask Vitaly about his unpleasant experiences, and logs full of errors.
> How fortunate that it was the fsck specialist who had the controller go 
> bad on him;-)......

Yes, we already figured that out ;)

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
       [not found]                       ` <353485111.20020627013212@redefine.org>
@ 2002-06-25 23:22                         ` Kevin
  2002-06-26  4:53                           ` Oleg Drokin
  0 siblings, 1 reply; 25+ messages in thread
From: Kevin @ 2002-06-25 23:22 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: reiserfs-list

You think moving the drive to a different controller (not hpt) would
help?  I've got an extra channel on a promise ata66 card that I could
use.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-25 23:22                         ` Kevin
@ 2002-06-26  4:53                           ` Oleg Drokin
  0 siblings, 0 replies; 25+ messages in thread
From: Oleg Drokin @ 2002-06-26  4:53 UTC (permalink / raw)
  To: Kevin; +Cc: reiserfs-list

Hello!

On Tue, Jun 25, 2002 at 04:22:42PM -0700, Kevin wrote:
> You think moving the drive to a different controller (not hpt) would
> help?  I've got an extra channel on a promise ata66 card that I could
> use.

I am not sure, but it seems promise controllers have aven more problems
with VIA chipsets than HPT controllers.
But it is impossible to say until you try that combination and see what happens.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: access beyond end of device again
  2002-06-24 14:48   ` Oleg Drokin
  2002-06-24 14:49     ` Robert Brockway
@ 2002-07-10 18:04     ` Tim Small
  1 sibling, 0 replies; 25+ messages in thread
From: Tim Small @ 2002-07-10 18:04 UTC (permalink / raw)
  To: reiserfs-list

I've seen this sort of problem before, with a 160G disk.  I'd installed 
a kernel with IDE patches to enable access to all 160G of the device, 
and created a reiserfs on it.  Then one of my colleagues decided to 
install our 'standard' kernel on it.  Sadly, the entire filesystem was 
turned to cheese, and reiserfsck didn't get much back.

However, this looks to be a LONG way beyond the end of the device, but 
"cat /proc/partitions" might still be worth doing...


Tim.

Oleg Drokin wrote:

>Hello!
>
>On Tue, Jun 25, 2002 at 12:37:02AM +1000, Robert Brockway wrote:
>  
>
>>>I'm getting these errors again:
>>>  attempt to access beyond end of device
>>>  38:01: rw=0, want=2052028788, limit=58633312
>>>      
>>>
>>What sort of device is this supposed to be? :)  38:01 is either a "Myricom 
>>PCI Myrinet board" or something "reserved for Linux/AP+" (and I'm assuming 
>>we're talking about a block device rather than a character device here :)
>>    
>>
>
>Numbers printed are in hex, so this is:
>    block       Fifth IDE hard disk/CD-ROM interface
>                  0 = /dev/hdi          Master: whole disk (or CD-ROM)
>                 64 = /dev/hdj          Slave: whole disk (or CD-ROM)
>
>                Partitions are handled the same way as for the first
>                interface (see major number 3).
>
>Bye,
>    Oleg
>  
>



^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2002-07-10 18:04 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-06-24 14:35 access beyond end of device again Kevin
2002-06-24 14:35 ` Oleg Drokin
2002-06-24 14:45   ` Dirk Mueller
2002-06-24 14:49     ` Oleg Drokin
2002-06-24 16:46       ` Hans Reiser
2002-06-25  5:11         ` Oleg Drokin
2002-06-24 16:59       ` Dirk Mueller
2002-06-24 14:37 ` Robert Brockway
2002-06-24 14:48   ` Oleg Drokin
2002-06-24 14:49     ` Robert Brockway
2002-06-24 17:30       ` Kevin
2002-06-25  5:54         ` Oleg Drokin
2002-06-25  6:08           ` Kevin
2002-06-25  6:15             ` Oleg Drokin
2002-06-25  7:46               ` Kevin
2002-06-25  7:55                 ` Oleg Drokin
2002-06-25  8:07                   ` Kevin
2002-06-25  8:13                     ` Oleg Drokin
2002-06-25  8:46                       ` Hans Reiser
2002-06-25  8:58                         ` Oleg Drokin
2002-06-25  9:06                           ` Hans Reiser
2002-06-25  9:41                             ` Oleg Drokin
     [not found]                       ` <353485111.20020627013212@redefine.org>
2002-06-25 23:22                         ` Kevin
2002-06-26  4:53                           ` Oleg Drokin
2002-07-10 18:04     ` Tim Small

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.