linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.5.75 does not boot - TCQ oops
@ 2003-07-11  2:51 Ivan Gyurdiev
  2003-07-11  8:03 ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Ivan Gyurdiev @ 2003-07-11  2:51 UTC (permalink / raw)
  To: LKML

See, 

http://www.ussg.iu.edu/hypermail/linux/kernel/0307.0/0515.html

where the bug is described for 2.5.74.
I got no replies, and the bug persists in 2.5.75 (+bk patches).

Note:
The machine boots with TASKFILE on, TCQ is causing the problem.




^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: 2.5.75 does not boot - TCQ oops
@ 2003-07-11  8:35 Voluspa
  0 siblings, 0 replies; 12+ messages in thread
From: Voluspa @ 2003-07-11  8:35 UTC (permalink / raw)
  To: linux-kernel


On 2003-07-11 2:51:42 Ivan Gyurdiev wrote:

> http://www.ussg.iu.edu/hypermail/linux/kernel/0307.0/0515.html

Reading the handcrafted log, yes, that's 'exactly' what I get ;-)
If prodded, I can do a transcription as well.

> where the bug is described for 2.5.74.
> I got no replies, and the bug persists in 2.5.75 (+bk patches).

Haven't tried the 2.5.74, but plain 2.5.75 is where I crash.

> Note:
> The machine boots with TASKFILE on, TCQ is causing the problem.

Yes, writing this on a machine with CONFIG_IDE_TASK_IOCTL is not set,
CONFIG_IDE_TASKFILE_IO=y and CONFIG_BLK_DEV_IDE_TCQ is not set.

Speaking of TASKFILE... I had some hope that it would fix at least a bit
of the regression in disk speed since 2.4.19-ac1+preempt (my yardstick,
excellent kernel). Doing a hdparm -tT /dev/hda on that kernel I get ca
123 MB/sec and 27 MB/sec. On this 2.5.75 I see 119 MB/sec and 22 MB/sec.

Here's hoping it can be cranked up before 2.6!

Mvh
Mats Johannesson

^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: 2.5.75 does not boot - TCQ oops
@ 2003-07-12  4:05 Voluspa
  0 siblings, 0 replies; 12+ messages in thread
From: Voluspa @ 2003-07-12  4:05 UTC (permalink / raw)
  To: linux-kernel


On 2003-07-11 20:58:09 Ivan Gyurdiev wrote:

> Patch confirmed to work - the machine boots.
[...]
> Most massive fs corruption I've ever had.
[...]
> I blamed the reiserfs bk work at first (which I applied along with
> [Axboe's] tcq patch), but I noted that the fs only gets corrupted
> with a tcq-enabled kernel

I took home 2.5.75-bk1, applied the tcq patch and then used the computer
for five hours in the TCQ+TASKFILE environment. Filesystem is ext2.

Untarred a kernel. Copied it to a couple of destinations. Compiled.
Listened to music. Watched part of a movie. Did a nfs move of a file
(which by the way was a pure horror... 600k in ca 3 minutes) from a
machine with a 2.2.16 kernel. Then read about your woes.

Checked the md5sum of a large file that I keep for... corruption checks.
Was ok. Did a read massage by "cd /usr ; find . -type f -exec md5sum {}
\;". No hickups. Except...

Found 1 error in /var/log/kernel that I _never_ get with the 2.4.19:
Jul 12 02:03:39 loke kernel: hda: status error: status=0x48 { DriveReady
DataRequest }
Jul 12 02:03:39 loke kernel: 
Jul 12 02:03:39 loke kernel: hda: drive not ready for command

So I shut down X in preparation for a reboot and full fs check, waiting
for the distributed project foldingathome to checkpoint its work, and
there was another never experienced error (time is UTC):

[01:10:00] [SPG] 100.0 % 
[01:10:00] [SPG] Writing current.xyz                                   
[01:10:01] [SPG] Sequence 15 completed:                                
[01:10:01] SNEYSGTFSFKTKQSKDEMLDALQIKNSYISQMRQITPKMAIEYPKGTPT . . .    
[01:10:01] - Error: Checksums don't match (work/wudata_06.arc)
[01:10:01] [SPG] Error: checksum error                                 
[01:10:02] CoreStatus = 0 (0)
[01:10:02] Client-core communications error: ERROR 0x0
[01:10:02] Deleting current work unit & continuing...

The reboot didn't reveal any fs corruption. Still, I've returned to a
safe kernel :-) Disk where TCQ was enabled (using depth 8)
is a IBM-DTLA-307015. Unfortunately, or luckily, my 
IC35L080AVVA07-0 shares its life with a CD, so no TCQ there.

Mvh
Mats Johannesson

^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: 2.5.75 does not boot - TCQ oops
@ 2003-07-12  7:51 Ivan Gyurdiev
  0 siblings, 0 replies; 12+ messages in thread
From: Ivan Gyurdiev @ 2003-07-12  7:51 UTC (permalink / raw)
  To: lista1, axboe; +Cc: LKML

> [ Voluspa wrote: ]
> I took home 2.5.75-bk1, applied the tcq patch and then used the computer
> for five hours in the TCQ+TASKFILE environment. Filesystem is ext2. 
....
>
> No hickups. Except...
>======================================================

What would be the best approach to track down the problem?
I've recovered 99% of my system.
(rpm -V is a wonderful thing together with wget).

I intend to convert my root filesystem from reiserfs to xfs tomorrow
for the purposes of testing (and because I've had too many problems with 
reiser over time). Should I do this, and re-test TCQ to avoid any reiser 
problems, or should I stick with the current setup and do more testing as 
needed. I'll have to figure out some precautions if so :)
How can I help? 

On the good side, fs corruption dug out 10 gigs of free space
I never would have thought were available....cleanup for free.





^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: 2.5.75 does not boot - TCQ oops
@ 2003-07-12 10:46 Ivan Gyurdiev
  2003-07-12 13:07 ` Voluspa
  0 siblings, 1 reply; 12+ messages in thread
From: Ivan Gyurdiev @ 2003-07-12 10:46 UTC (permalink / raw)
  To: lista1, axboe; +Cc: LKML

Okay, I figured out some more things.

I split my root fs in half and made two identical copies  - one on reiser and 
one on xfs. I compiled a bunch of kernels, and tested some more.

A TCQ enabled kernel with AS, queue depth of 32 works fine on both reiser and 
xfs. A TCQ enabled kernel with deadline, queue depth of 32 works fine on both 
reiser and xfs. Note the depth of 32. I had TCQ enabled on previous <74 
kernels and it worked fine, everywhere with depth 32. 

However, the kernel that crashed was using the default
(The default is 8, even though the comment says 32).
I tested that kernel again with reiserfs, TCQ, the two elevators, and queue 
depth 8 - on-boot fsck detects corruption every time, marks the system 
unclean, and requires --rebuild-tree or --fix-fixable on any further mounts. 
I now get "Wrong amount of used blocks." message.

Have not tested depth 8 TCQ kernel with xfs, since that's my surviving root 
fs, and I'd like to avoid corruption there. 

Have not tested queue depths other than 8 and 32. 
I could test some more on reiser now that I have a backup root fs.





^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2003-07-12 12:52 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-07-11  2:51 2.5.75 does not boot - TCQ oops Ivan Gyurdiev
2003-07-11  8:03 ` Jens Axboe
2003-07-11  8:28   ` Jens Axboe
2003-07-11  8:34     ` Jens Axboe
2003-07-11 10:54       ` Bartlomiej Zolnierkiewicz
2003-07-11 10:55         ` Jens Axboe
2003-07-11 20:58       ` Ivan Gyurdiev
2003-07-11  8:35 Voluspa
2003-07-12  4:05 Voluspa
2003-07-12  7:51 Ivan Gyurdiev
2003-07-12 10:46 Ivan Gyurdiev
2003-07-12 13:07 ` Voluspa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).