Failed reads from RAID-0 array (from newbie who has read the FAQ)

All of lore.kernel.org
 help / color / mirror / Atom feed

* Failed reads from RAID-0 array (from newbie who has read the FAQ)
@ 2007-03-17  2:20 Michael Schwarz
  2007-03-17  5:31 ` Neil Brown
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Schwarz @ 2007-03-17  2:20 UTC (permalink / raw)
  To: linux-raid

I'm not a Linux newbie (I've even written a couple of books and done some
very light device driver work), but I'm completely new to the software
raid subsystem.

I'm doing something rather oddball. I'm making an array of USB flash
drives and comparing read and write rates.

Well, I've had great success writing. I've got seven flash drives on a
hub. I've joined them up both linear and raid0 and written large amounts
of data to them. But come time to read from them, linear works, but raid0
hangs after transferring just shy of 2G of data. It doesn't matter if it
reading from one file or from many files whose cumulative size is just shy
of 2G. It doesn't matter if I'm using "dd" or "cp" to read the file or
files.

The process doing the transfer is unkillable. Not with a kill -15 or a
kill -9. It won't die, but it also won't make progress.

"Linear" always works. Raid-0 always hangs.

Here are my mdadm commands to create the array:

mdadm --create /dev/md0 --level=linear --auto=md --chunk=32
--raid-devices=7 /dev/sd?

(The wildcard works because the seven flash drives are the only scsi
devices on the system).

The command for the raid-0 array is the same as above except for the
"--level=0" it takes to make a raid 0 array.

I then use "mkfs" to make the filesystem and mount the resulting array at
"/mnt"

Can anyone give a raid newbiw some tips? Is there something obvious I'm
missing? Would it help to provide strace/ltrace/ptrace of the hanging copy
command?

Any help (including URLs of manuals I should RTFM) would be most welcome.

Thanks!

-- 
Michael Schwarz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed reads from RAID-0 array (from newbie who has read the FAQ)
  2007-03-17  2:20 Failed reads from RAID-0 array (from newbie who has read the FAQ) Michael Schwarz
@ 2007-03-17  5:31 ` Neil Brown
  2007-03-17 18:01   ` Michael Schwarz
       [not found]   ` <45FC33A4.2090408@tmr.com>
  0 siblings, 2 replies; 23+ messages in thread
From: Neil Brown @ 2007-03-17  5:31 UTC (permalink / raw)
  To: mschwarz; +Cc: linux-raid, linux-usb-users

On Friday March 16, mschwarz@multitool.net wrote:
> I'm not a Linux newbie (I've even written a couple of books and done some
> very light device driver work), but I'm completely new to the software
> raid subsystem.
> 
> I'm doing something rather oddball. I'm making an array of USB flash
> drives and comparing read and write rates.
> 
> Well, I've had great success writing. I've got seven flash drives on a
> hub. I've joined them up both linear and raid0 and written large amounts
> of data to them. But come time to read from them, linear works, but raid0
> hangs after transferring just shy of 2G of data. It doesn't matter if it
> reading from one file or from many files whose cumulative size is just shy
> of 2G. It doesn't matter if I'm using "dd" or "cp" to read the file or
> files.
> 
> The process doing the transfer is unkillable. Not with a kill -15 or a
> kill -9. It won't die, but it also won't make progress.
> 
> "Linear" always works. Raid-0 always hangs.

My guess would be a locking bug in the usb storage driver or some
lower level USB driver..
A significant difference between raid0 and linear is that a largish IO
will touch all drives for raid-0, but only one or two for linear.
That gives much more opportunity for locking bugs to hit.

When it is in the hanging state, do
  echo t > /proc/sysrq-trigger

and look in the kernel logs for the stack trace of all processes.
Hopefully the stack trace for the processes in 'D' state will be
informative.

NeilBrown


> 
> Here are my mdadm commands to create the array:
> 
> mdadm --create /dev/md0 --level=linear --auto=md --chunk=32
> --raid-devices=7 /dev/sd?
> 
> (The wildcard works because the seven flash drives are the only scsi
> devices on the system).
> 
> The command for the raid-0 array is the same as above except for the
> "--level=0" it takes to make a raid 0 array.
> 
> I then use "mkfs" to make the filesystem and mount the resulting array at
> "/mnt"
> 
> Can anyone give a raid newbiw some tips? Is there something obvious I'm
> missing? Would it help to provide strace/ltrace/ptrace of the hanging copy
> command?
> 
> Any help (including URLs of manuals I should RTFM) would be most welcome.
> 
> Thanks!
> 
> 
> -- 
> Michael Schwarz
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Linux-usb-users@lists.sourceforge.net
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-users

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed reads from RAID-0 array (from newbie who has read the FAQ)
  2007-03-17  5:31 ` Neil Brown
@ 2007-03-17 18:01   ` Michael Schwarz
  2007-03-17 20:49     ` Alan Stern
       [not found]   ` <45FC33A4.2090408@tmr.com>
  1 sibling, 1 reply; 23+ messages in thread
From: Michael Schwarz @ 2007-03-17 18:01 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, linux-usb-users

Neil:

Relevant stack trace follows. Any suggestions? blk_backing_dev_unplug...
Does that mean the raid subsystem thinks one of the usb drives has been
removed? I assure you that physically this is untrue, but that doesn't
mean that some sort logical disconnect hasn't happened...

Makes me wonder if one of my USB hub connections is intermittent...

I would also welcome any tips on any other developers group to follow up
with. I haven't hacked any kernel code since the 2.2.x kernel and things
have changed a bit! I don't mind digging into this, but I suspect I could
get things cleared up fast if I could find the right subject expert!



 =======================
cp            D E2FBEDB0  1784  4271   4270                     (NOTLB)
       e2fbedb4 00200086 c15dc550 e2fbedb0 00000001 00200082 00001000
00000000
       00000000 c15dc550 0000000a e94182b0 f3161430 26320f40 000001c5
00000000
       e94183bc c1c8c480 00000000 ecd7d300 c04e0bf2 c042e0e4 f7d767f8
003b6622
Call Trace:
 [<c04e0bf2>] blk_backing_dev_unplug+0x73/0x7b
 [<c042e0e4>] getnstimeofday+0x30/0xb6
 [<c061ec7e>] io_schedule+0x3a/0x5c
 [<c045626b>] sync_page+0x0/0x3b
 [<c04562a3>] sync_page+0x38/0x3b
 [<c061ed8a>] __wait_on_bit_lock+0x2a/0x52
 [<c045625d>] __lock_page+0x58/0x5e
 [<c043788e>] wake_bit_function+0x0/0x3c
 [<c04569e3>] do_generic_mapping_read+0x1e0/0x459
 [<c0458b0d>] generic_file_aio_read+0x173/0x1a6
 [<c0456070>] file_read_actor+0x0/0xe0
 [<c047202f>] do_sync_read+0xc7/0x10a
 [<c0437859>] autoremove_wake_function+0x0/0x35
 [<c0471f68>] do_sync_read+0x0/0x10a
 [<c04728b6>] vfs_read+0xa6/0x152
 [<c0472d0f>] sys_read+0x41/0x67
 [<c0403f64>] syscall_call+0x7/0xb
 =======================

-- 
Michael Schwarz

> My guess would be a locking bug in the usb storage driver or some
> lower level USB driver..
> A significant difference between raid0 and linear is that a largish IO
> will touch all drives for raid-0, but only one or two for linear.
> That gives much more opportunity for locking bugs to hit.
>
> When it is in the hanging state, do
>   echo t > /proc/sysrq-trigger
>
> and look in the kernel logs for the stack trace of all processes.
> Hopefully the stack trace for the processes in 'D' state will be
> informative.
>
> NeilBrown
>
>
>>
>> Here are my mdadm commands to create the array:
>>
>> mdadm --create /dev/md0 --level=linear --auto=md --chunk=32
>> --raid-devices=7 /dev/sd?
>>
>> (The wildcard works because the seven flash drives are the only scsi
>> devices on the system).
>>
>> The command for the raid-0 array is the same as above except for the
>> "--level=0" it takes to make a raid 0 array.
>>
>> I then use "mkfs" to make the filesystem and mount the resulting array
>> at
>> "/mnt"
>>
>> Can anyone give a raid newbiw some tips? Is there something obvious I'm
>> missing? Would it help to provide strace/ltrace/ptrace of the hanging
>> copy
>> command?
>>
>> Any help (including URLs of manuals I should RTFM) would be most
>> welcome.
>>
>> Thanks!
>>
>>
>> --
>> Michael Schwarz
>>
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed reads from RAID-0 array; still no joy in Mudville.
       [not found]   ` <45FC33A4.2090408@tmr.com>
@ 2007-03-17 19:13     ` Michael Schwarz
  2007-03-17 19:21       ` Michael Schwarz
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Schwarz @ 2007-03-17 19:13 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Neil Brown, linux-raid, linux-usb-users

I'll try playing around with IO sizes with dd.

What I'm finding so far is ABSOLUTE consistency on where it locks. If it
were a race condition with kernel locks I guess I would expect it to be
more indeterminate (in my limited experience) unless it is due to specific
"deadly embrace" condition between the usb drivers(s) and the raid
subsystem.

I must admit that I'm not familiar enough with either one. I will also
mention that I experienced this lockup phenomenon with both a stock Fedora
Core 6 i686 kernel and with a stock Ubuntu kernel, so the behavior isn't
terribly kernel compile/module mix sensetive.

I've downloaded the kernel-devel package for my Fedora kernel and I'm
going to start working backwards from the stack trace I've captured to see
where I'm hanging and why. strace wasn't particularly helpful since the
write to file was buffered and so I can't be sure I have the call that
failed. (I'll take a look and see if there's an 'unbuffered write' switch
on strace -- there probably is).

Anyways, I'm still hoping someone who knows a lot will see this and say
"oh, yeah! That's because of BLAH." I don't mind becoming more
knowledgeable about the 2.6.x kernel, but this wasn't how I wanted to go
about it! ;-)

Thanks again, all...

What I find odd is that it seems to be a "per-process" problem. I can
still access the md drive from other processes when the copy is hung.  I'm
going to see if it is "positional" by copying the file that is "hung"
alone and see if it hangs in the same place on the same file, or if it
hangs later or what,,, There will be more posts from me. (Fair warning to
all!)

-- 
Michael Schwarz

> Neil Brown wrote:
>> On Friday March 16, mschwarz@multitool.net wrote:
>>
>>> I'm not a Linux newbie (I've even written a couple of books and done
>>> some
>>> very light device driver work), but I'm completely new to the software
>>> raid subsystem.
>>>
>>> I'm doing something rather oddball. I'm making an array of USB flash
>>> drives and comparing read and write rates.
>>>
>>> Well, I've had great success writing. I've got seven flash drives on a
>>> hub. I've joined them up both linear and raid0 and written large
>>> amounts
>>> of data to them. But come time to read from them, linear works, but
>>> raid0
>>> hangs after transferring just shy of 2G of data. It doesn't matter if
>>> it
>>> reading from one file or from many files whose cumulative size is just
>>> shy
>>> of 2G. It doesn't matter if I'm using "dd" or "cp" to read the file or
>>> files.
>>>
>>> The process doing the transfer is unkillable. Not with a kill -15 or a
>>> kill -9. It won't die, but it also won't make progress.
>>>
>>> "Linear" always works. Raid-0 always hangs.
>>>
>>
>> My guess would be a locking bug in the usb storage driver or some
>> lower level USB driver..
>> A significant difference between raid0 and linear is that a largish IO
>> will touch all drives for raid-0, but only one or two for linear.
>> That gives much more opportunity for locking bugs to hit.
>>
>> When it is in the hanging state, do
>>   echo t > /proc/sysrq-trigger
>>
>> and look in the kernel logs for the stack trace of all processes.
>> Hopefully the stack trace for the processes in 'D' state will be
>> informative.
>>
>> NeilBrown
>>
>>
>>
>>> Here are my mdadm commands to create the array:
>>>
>>> mdadm --create /dev/md0 --level=linear --auto=md --chunk=32
>>> --raid-devices=7 /dev/sd?
>>>
>>> (The wildcard works because the seven flash drives are the only scsi
>>> devices on the system).
>>>
>>> The command for the raid-0 array is the same as above except for the
>>> "--level=0" it takes to make a raid 0 array.
>>>
>>> I then use "mkfs" to make the filesystem and mount the resulting array
>>> at
>>> "/mnt"
>>>
>>> Can anyone give a raid newbiw some tips? Is there something obvious I'm
>>> missing? Would it help to provide strace/ltrace/ptrace of the hanging
>>> copy
>>> command?
>>>
>>> Any help (including URLs of manuals I should RTFM) would be most
>>> welcome.
>>>
>>> Thanks!
>>>
>>>
>>> --
>>> Michael Schwarz
>>>
>
> Neil, would retrying this with small i/o show anything, assuming your
> thought is the cause? Also, would it give useful information to usee dd
> with direct i/o on read:
>   dd if=/dev/md0 iflag=direct bs=1024k of=/dev/null
> and see if large buffer with O_DIRECT works?
>
> These are suggestions on getting more info, if the trace doesn't clarify
> the problem.
>
> --
> bill davidsen <davidsen@tmr.com>
>   CTO TMR Associates, Inc
>   Doing interesting things with small computers since 1979
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed reads from RAID-0 array; still no joy in Mudville.
  2007-03-17 19:13     ` Failed reads from RAID-0 array; still no joy in Mudville Michael Schwarz
@ 2007-03-17 19:21       ` Michael Schwarz
  2007-03-18 17:22         ` Bill Davidsen
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Schwarz @ 2007-03-17 19:21 UTC (permalink / raw)
  To: mschwarz; +Cc: linux-raid, linux-usb-users

Update:

(For those who've been waiting breathlessly). It hangs at a particular
point in a particular file. In other words, it doesn't depend on the total
number of bytes transfered. Rather, when it reaches a particular point in
a particular file (12267520 bytes into a file that is 1073709056 bytes
long) it hangs.

I begin to suspect that I have a "dead spot" in my USB hub. But what gets
me if that is true is why does the write work? Do cp and dd not check to
see if writes succeed?

I know it isn't a particular flash drive because I've used two different
sets of 7 USB drives and it seems to fail consistently no matter which.

Nonetheless, I'm beginning to think I'm dealing with a hardware issue, not
a kernel issue, just because it is so consistent.

Thanks again for all the help.


-- 
Michael Schwarz

> I'll try playing around with IO sizes with dd.
>
> What I'm finding so far is ABSOLUTE consistency on where it locks. If it
> were a race condition with kernel locks I guess I would expect it to be
> more indeterminate (in my limited experience) unless it is due to specific
> "deadly embrace" condition between the usb drivers(s) and the raid
> subsystem.
>
> I must admit that I'm not familiar enough with either one. I will also
> mention that I experienced this lockup phenomenon with both a stock Fedora
> Core 6 i686 kernel and with a stock Ubuntu kernel, so the behavior isn't
> terribly kernel compile/module mix sensetive.
>
> I've downloaded the kernel-devel package for my Fedora kernel and I'm
> going to start working backwards from the stack trace I've captured to see
> where I'm hanging and why. strace wasn't particularly helpful since the
> write to file was buffered and so I can't be sure I have the call that
> failed. (I'll take a look and see if there's an 'unbuffered write' switch
> on strace -- there probably is).
>
> Anyways, I'm still hoping someone who knows a lot will see this and say
> "oh, yeah! That's because of BLAH." I don't mind becoming more
> knowledgeable about the 2.6.x kernel, but this wasn't how I wanted to go
> about it! ;-)
>
> Thanks again, all...
>
> What I find odd is that it seems to be a "per-process" problem. I can
> still access the md drive from other processes when the copy is hung.  I'm
> going to see if it is "positional" by copying the file that is "hung"
> alone and see if it hangs in the same place on the same file, or if it
> hangs later or what,,, There will be more posts from me. (Fair warning to
> all!)
>
> --
> Michael Schwarz
>
>> Neil Brown wrote:
>>> On Friday March 16, mschwarz@multitool.net wrote:
>>>
>>>> I'm not a Linux newbie (I've even written a couple of books and done
>>>> some
>>>> very light device driver work), but I'm completely new to the software
>>>> raid subsystem.
>>>>
>>>> I'm doing something rather oddball. I'm making an array of USB flash
>>>> drives and comparing read and write rates.
>>>>
>>>> Well, I've had great success writing. I've got seven flash drives on a
>>>> hub. I've joined them up both linear and raid0 and written large
>>>> amounts
>>>> of data to them. But come time to read from them, linear works, but
>>>> raid0
>>>> hangs after transferring just shy of 2G of data. It doesn't matter if
>>>> it
>>>> reading from one file or from many files whose cumulative size is just
>>>> shy
>>>> of 2G. It doesn't matter if I'm using "dd" or "cp" to read the file or
>>>> files.
>>>>
>>>> The process doing the transfer is unkillable. Not with a kill -15 or a
>>>> kill -9. It won't die, but it also won't make progress.
>>>>
>>>> "Linear" always works. Raid-0 always hangs.
>>>>
>>>
>>> My guess would be a locking bug in the usb storage driver or some
>>> lower level USB driver..
>>> A significant difference between raid0 and linear is that a largish IO
>>> will touch all drives for raid-0, but only one or two for linear.
>>> That gives much more opportunity for locking bugs to hit.
>>>
>>> When it is in the hanging state, do
>>>   echo t > /proc/sysrq-trigger
>>>
>>> and look in the kernel logs for the stack trace of all processes.
>>> Hopefully the stack trace for the processes in 'D' state will be
>>> informative.
>>>
>>> NeilBrown
>>>
>>>
>>>
>>>> Here are my mdadm commands to create the array:
>>>>
>>>> mdadm --create /dev/md0 --level=linear --auto=md --chunk=32
>>>> --raid-devices=7 /dev/sd?
>>>>
>>>> (The wildcard works because the seven flash drives are the only scsi
>>>> devices on the system).
>>>>
>>>> The command for the raid-0 array is the same as above except for the
>>>> "--level=0" it takes to make a raid 0 array.
>>>>
>>>> I then use "mkfs" to make the filesystem and mount the resulting array
>>>> at
>>>> "/mnt"
>>>>
>>>> Can anyone give a raid newbiw some tips? Is there something obvious
>>>> I'm
>>>> missing? Would it help to provide strace/ltrace/ptrace of the hanging
>>>> copy
>>>> command?
>>>>
>>>> Any help (including URLs of manuals I should RTFM) would be most
>>>> welcome.
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> --
>>>> Michael Schwarz
>>>>
>>
>> Neil, would retrying this with small i/o show anything, assuming your
>> thought is the cause? Also, would it give useful information to usee dd
>> with direct i/o on read:
>>   dd if=/dev/md0 iflag=direct bs=1024k of=/dev/null
>> and see if large buffer with O_DIRECT works?
>>
>> These are suggestions on getting more info, if the trace doesn't clarify
>> the problem.
>>
>> --
>> bill davidsen <davidsen@tmr.com>
>>   CTO TMR Associates, Inc
>>   Doing interesting things with small computers since 1979
>>
>>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Linux-usb-users@lists.sourceforge.net
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-users

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed reads from RAID-0 array (from newbie who has read the FAQ)
  2007-03-17 18:01   ` Michael Schwarz
@ 2007-03-17 20:49     ` Alan Stern
  2007-03-17 21:35       ` Michael Schwarz
  0 siblings, 1 reply; 23+ messages in thread
From: Alan Stern @ 2007-03-17 20:49 UTC (permalink / raw)
  To: Michael Schwarz; +Cc: Neil Brown, linux-raid, linux-usb-users

On Sat, 17 Mar 2007, Michael Schwarz wrote:

> Neil:
> 
> Relevant stack trace follows. Any suggestions? blk_backing_dev_unplug...
> Does that mean the raid subsystem thinks one of the usb drives has been
> removed? I assure you that physically this is untrue, but that doesn't
> mean that some sort logical disconnect hasn't happened...
> 
> Makes me wonder if one of my USB hub connections is intermittent...
> 
> I would also welcome any tips on any other developers group to follow up
> with. I haven't hacked any kernel code since the 2.2.x kernel and things
> have changed a bit! I don't mind digging into this, but I suspect I could
> get things cleared up fast if I could find the right subject expert!
> 
> 
> 
>  =======================
> cp            D E2FBEDB0  1784  4271   4270                     (NOTLB)
>        e2fbedb4 00200086 c15dc550 e2fbedb0 00000001 00200082 00001000
> 00000000
>        00000000 c15dc550 0000000a e94182b0 f3161430 26320f40 000001c5
> 00000000
>        e94183bc c1c8c480 00000000 ecd7d300 c04e0bf2 c042e0e4 f7d767f8
> 003b6622
> Call Trace:
>  [<c04e0bf2>] blk_backing_dev_unplug+0x73/0x7b
>  [<c042e0e4>] getnstimeofday+0x30/0xb6
>  [<c061ec7e>] io_schedule+0x3a/0x5c
>  [<c045626b>] sync_page+0x0/0x3b
>  [<c04562a3>] sync_page+0x38/0x3b
>  [<c061ed8a>] __wait_on_bit_lock+0x2a/0x52
>  [<c045625d>] __lock_page+0x58/0x5e
>  [<c043788e>] wake_bit_function+0x0/0x3c
>  [<c04569e3>] do_generic_mapping_read+0x1e0/0x459
>  [<c0458b0d>] generic_file_aio_read+0x173/0x1a6
>  [<c0456070>] file_read_actor+0x0/0xe0
>  [<c047202f>] do_sync_read+0xc7/0x10a
>  [<c0437859>] autoremove_wake_function+0x0/0x35
>  [<c0471f68>] do_sync_read+0x0/0x10a
>  [<c04728b6>] vfs_read+0xa6/0x152
>  [<c0472d0f>] sys_read+0x41/0x67
>  [<c0403f64>] syscall_call+0x7/0xb
>  =======================

This isn't much help.  The important processes here are khubd,
usb-storage, and scsi_eh_*.  Possibly some raid-related processes too, but 
I don't know which they would be.

It also would help a lot to see your dmesg log.  Especially if you would
build your kernel with CONFIG_USB_DEBUG turned on.

> Update:
>
> (For those who've been waiting breathlessly). It hangs at a particular
> point in a particular file. In other words, it doesn't depend on the total
> number of bytes transfered. Rather, when it reaches a particular point in
> a particular file (12267520 bytes into a file that is 1073709056 bytes
> long) it hangs.
>
> I begin to suspect that I have a "dead spot" in my USB hub. But what gets
> me if that is true is why does the write work? Do cp and dd not check to
> see if writes succeed?

Depends what you mean.  They do check the return codes from the underlying 
device drivers, but they don't try to read the data back to make sure it 
really was written.

> I know it isn't a particular flash drive because I've used two different
> sets of 7 USB drives and it seems to fail consistently no matter which.

But you haven't tried using different hubs, different USB cables, or
different computers.

> Nonetheless, I'm beginning to think I'm dealing with a hardware issue, not
> a kernel issue, just because it is so consistent.

People have reported problems in which the hardware fails when it 
encounters a certain pattern of bytes in the data stream.  Maybe you're 
seeing the same sort of thing.

Alan Stern


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Linux-usb-users@lists.sourceforge.net
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-users

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed reads from RAID-0 array (from newbie who has read the FAQ)
  2007-03-17 20:49     ` Alan Stern
@ 2007-03-17 21:35       ` Michael Schwarz
  2007-03-18  2:06         ` [Linux-usb-users] " Alan Stern
  2007-03-18  2:12         ` Alan Stern
  0 siblings, 2 replies; 23+ messages in thread
From: Michael Schwarz @ 2007-03-17 21:35 UTC (permalink / raw)
  To: Alan Stern; +Cc: Neil Brown, linux-raid, linux-usb-users

Comments/questions below...

-- 
Michael Schwarz


>
> This isn't much help.  The important processes here are khubd,
> usb-storage, and scsi_eh_*.  Possibly some raid-related processes too, but
> I don't know which they would be.

I have no copy khubd running. What is the list policy on attachments? I
have the kernel stack traces for the kernel threads you want, but the text
is rather long. Being new to the group I do not know if I should just go
ahead and paste something that long in here, or if it should be a MIME
attachment. Can you (or someone) tell me, and I'll provide what is asked
for. Aw, heck... I'll just put it at the end...

> It also would help a lot to see your dmesg log.  Especially if you would
> build your kernel with CONFIG_USB_DEBUG turned on.

I saw nothing unusual in the dmesg log, and I'm fairly familiar with
reading it. I'm working with a stock FC6 kernel right now, but I will
eventually roll my own with that flag on if we get to where we need
that...

I'm hoping some wise heads here will be able to make something out of some
more logs and stack traces before I go that far...

>
>> I begin to suspect that I have a "dead spot" in my USB hub. But what
>> gets
>> me if that is true is why does the write work? Do cp and dd not check to
>> see if writes succeed?
>
> Depends what you mean.  They do check the return codes from the underlying
> device drivers, but they don't try to read the data back to make sure it
> really was written.

Yeah, I meant return codes from the system calls.

>
>> I know it isn't a particular flash drive because I've used two different
>> sets of 7 USB drives and it seems to fail consistently no matter which.
>
> But you haven't tried using different hubs, different USB cables, or
> different computers.

I just got back from buying a different hub and different cables to do
just that. And I do have a different computer to try, but will try that
last.

>
>> Nonetheless, I'm beginning to think I'm dealing with a hardware issue,
>> not
>> a kernel issue, just because it is so consistent.
>
> People have reported problems in which the hardware fails when it
> encounters a certain pattern of bytes in the data stream.  Maybe you're
> seeing the same sort of thing.

This theory is (somewhat) blown by the fact that I have tried different
data sources (from vobcopied DVD images to dd'ed images of Linux distros).
I suppose I could try just a large file of zeros...

Anyways, while the list readers ponder this message, I'm going to try the
new hub and cables. Thanks Alan and Neil!

Nasty big stack trace set follows:

Mar 17 16:13:08 localhost kernel:  =======================
Mar 17 16:13:08 localhost kernel: scsi_eh_7     S 00000000  3744  4385    
 7          4386  2257 (L-TLB)
Mar 17 16:13:08 localhost kernel:        ebe2af68 00000046 00000000
00000000 c1c8c94c f7da30f0 c0420c7c c1c8c480
Mar 17 16:13:08 localhost kernel:        f7da30f0 ebe2af28 00000008
e7676d30 f7da25f0 a0caa580 00000032 00000000
Mar 17 16:13:08 localhost kernel:        e7676e3c c1c8c480 00000000
f7f53880 f7da30f0 c1c8c94c f7d98f2c f7f53880
Mar 17 16:13:08 localhost kernel: Call Trace:
Mar 17 16:13:08 localhost kernel:  [<c0420c7c>] enqueue_task+0x29/0x39
Mar 17 16:13:08 localhost kernel:  [<c061fc0d>] _spin_unlock_irq+0x5/0x7
Mar 17 16:13:08 localhost kernel:  [<c061e5c1>]
__sched_text_start+0x999/0xa21
Mar 17 16:13:08 localhost kernel:  [<f88e5415>]
scsi_error_handler+0x0/0x9b6 [scsi_mod]
Mar 17 16:13:08 localhost kernel:  [<f88e5474>]
scsi_error_handler+0x5f/0x9b6 [scsi_mod]
Mar 17 16:13:08 localhost kernel:  [<c0420a03>] complete+0x39/0x48
Mar 17 16:13:08 localhost kernel:  [<f88e5415>]
scsi_error_handler+0x0/0x9b6 [scsi_mod]
Mar 17 16:13:08 localhost kernel:  [<c043779f>] kthread+0xb0/0xd9
Mar 17 16:13:08 localhost kernel:  [<c04376ef>] kthread+0x0/0xd9
Mar 17 16:13:08 localhost kernel:  [<c0404b33>] kernel_thread_helper+0x7/0x10
Mar 17 16:13:08 localhost kernel:  =======================
Mar 17 16:13:08 localhost kernel: usb-storage   S F7E08840  3048  4386    
 7          4410  4385 (L-TLB)
Mar 17 16:13:08 localhost kernel:        f3272f5c 00000046 f3272eec
f7e08840 e764f000 f3efa000 f88cbd5f c1c8c4d4
Mar 17 16:13:08 localhost kernel:        f7da0b30 c0420c7c 0000000a
f7da25f0 f7da0b30 6ce38240 0000096b 00000000
Mar 17 16:13:08 localhost kernel:        f7da26fc c1c8c480 00000000
ec92ec40 00000000 0000000f 00000000 00000000
Mar 17 16:13:08 localhost kernel: Call Trace:
Mar 17 16:13:08 localhost kernel:  [<f88cbd5f>]
usb_stor_bulk_transfer_buf+0x61/0x98 [usb_storage]
Mar 17 16:13:08 localhost kernel:  [<c0420c7c>] enqueue_task+0x29/0x39
Mar 17 16:13:08 localhost kernel:  [<c061fa0d>]
__down_interruptible+0xab/0xf0
Mar 17 16:13:08 localhost kernel:  [<c042f220>] del_timer+0x41/0x47
Mar 17 16:13:08 localhost kernel:  [<c04226ab>] default_wake_function+0x0/0xc
Mar 17 16:13:08 localhost kernel:  [<f88cd022>]
usb_stor_control_thread+0x0/0x1a3 [usb_storage]
Mar 17 16:13:08 localhost kernel:  [<c061f8ef>]
__down_failed_interruptible+0x7/0xc
Mar 17 16:13:08 localhost kernel:  [<f88cd067>]
usb_stor_control_thread+0x45/0x1a3 [usb_storage]
Mar 17 16:13:08 localhost kernel:  [<c0420a03>] complete+0x39/0x48
Mar 17 16:13:08 localhost kernel:  [<f88cd022>]
usb_stor_control_thread+0x0/0x1a3 [usb_storage]
Mar 17 16:13:08 localhost kernel:  [<c043779f>] kthread+0xb0/0xd9
Mar 17 16:13:08 localhost kernel:  [<c04376ef>] kthread+0x0/0xd9
Mar 17 16:13:08 localhost kernel:  [<c0404b33>] kernel_thread_helper+0x7/0x10
Mar 17 16:13:08 localhost kernel:  =======================
Mar 17 16:13:08 localhost kernel: scsi_eh_8     S 00000000  3744  4410    
 7          4411  4386 (L-TLB)
Mar 17 16:13:08 localhost kernel:        f1ad2f68 00000046 00000000
00000000 c1c8c4d4 f7da30f0 c0420c7c c1c8c480
Mar 17 16:13:08 localhost kernel:        f7da30f0 f1ad2f28 00000009
f3f150f0 f3f145f0 b0d342c0 00000032 00000000
Mar 17 16:13:08 localhost kernel:        f3f151fc c1c8c480 00000000
f7f53880 f7da30f0 c1c8c4d4 f7d98f2c f7f53880
Mar 17 16:13:08 localhost kernel: Call Trace:
Mar 17 16:13:08 localhost kernel:  [<c0420c7c>] enqueue_task+0x29/0x39
Mar 17 16:13:08 localhost kernel:  [<c061fc0d>] _spin_unlock_irq+0x5/0x7
Mar 17 16:13:09 localhost kernel:  [<c061e5c1>]
__sched_text_start+0x999/0xa21
Mar 17 16:13:09 localhost kernel:  [<f88e5415>]
scsi_error_handler+0x0/0x9b6 [scsi_mod]
Mar 17 16:13:09 localhost kernel:  [<f88e5474>]
scsi_error_handler+0x5f/0x9b6 [scsi_mod]
Mar 17 16:13:09 localhost kernel:  [<c0420a03>] complete+0x39/0x48
Mar 17 16:13:09 localhost kernel:  [<f88e5415>]
scsi_error_handler+0x0/0x9b6 [scsi_mod]
Mar 17 16:13:09 localhost kernel:  [<c043779f>] kthread+0xb0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c04376ef>] kthread+0x0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c0404b33>] kernel_thread_helper+0x7/0x10
Mar 17 16:13:09 localhost kernel:  =======================
Mar 17 16:13:09 localhost kernel: usb-storage   S F3F03B40  3068  4411    
 7          4436  4410 (L-TLB)
Mar 17 16:13:09 localhost kernel:        e3339f5c 00000046 e3339eec
f3f03b40 e9355400 f3efa080 f88cbd5f c1c8c4d4
Mar 17 16:13:09 localhost kernel:        f7da0b30 c0420c7c 0000000a
f3f145f0 f7da0b30 7578c640 0000096b 00000000
Mar 17 16:13:09 localhost kernel:        f3f146fc c1c8c480 00000000
e9ddad40 00000000 0000000f 00000000 00000000
Mar 17 16:13:09 localhost kernel: Call Trace:
Mar 17 16:13:09 localhost kernel:  [<f88cbd5f>]
usb_stor_bulk_transfer_buf+0x61/0x98 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c0420c7c>] enqueue_task+0x29/0x39
Mar 17 16:13:09 localhost kernel:  [<c061fa0d>]
__down_interruptible+0xab/0xf0
Mar 17 16:13:09 localhost kernel:  [<c042f220>] del_timer+0x41/0x47
Mar 17 16:13:09 localhost kernel:  [<c04226ab>] default_wake_function+0x0/0xc
Mar 17 16:13:09 localhost kernel:  [<f88cd022>]
usb_stor_control_thread+0x0/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c061f8ef>]
__down_failed_interruptible+0x7/0xc
Mar 17 16:13:09 localhost kernel:  [<f88cd067>]
usb_stor_control_thread+0x45/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c0420a03>] complete+0x39/0x48
Mar 17 16:13:09 localhost kernel:  [<f88cd022>]
usb_stor_control_thread+0x0/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c043779f>] kthread+0xb0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c04376ef>] kthread+0x0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c0404b33>] kernel_thread_helper+0x7/0x10
Mar 17 16:13:09 localhost kernel:  =======================
Mar 17 16:13:09 localhost kernel: scsi_eh_9     S 00000000  3744  4436    
 7          4437  4411 (L-TLB)
Mar 17 16:13:09 localhost kernel:        e31b4f68 00000046 00000000
00000000 c1c8c4d4 f7da30f0 c0420c7c c1c8c480
Mar 17 16:13:09 localhost kernel:        f7da30f0 e31b4f28 00000009
f3e2c2b0 e2c70db0 c0cc9dc0 00000032 00000000
Mar 17 16:13:09 localhost kernel:        f3e2c3bc c1c8c480 00000000
f3ce14c0 f7da30f0 c1c8c4d4 f7d98f2c f3ce14c0
Mar 17 16:13:09 localhost kernel: Call Trace:
Mar 17 16:13:09 localhost kernel:  [<c0420c7c>] enqueue_task+0x29/0x39
Mar 17 16:13:09 localhost kernel:  [<c061fc0d>] _spin_unlock_irq+0x5/0x7
Mar 17 16:13:09 localhost kernel:  [<c061e5c1>]
__sched_text_start+0x999/0xa21
Mar 17 16:13:09 localhost kernel:  [<f88e5415>]
scsi_error_handler+0x0/0x9b6 [scsi_mod]
Mar 17 16:13:09 localhost kernel:  [<f88e5474>]
scsi_error_handler+0x5f/0x9b6 [scsi_mod]
Mar 17 16:13:09 localhost kernel:  [<c0420a03>] complete+0x39/0x48
Mar 17 16:13:09 localhost kernel:  [<f88e5415>]
scsi_error_handler+0x0/0x9b6 [scsi_mod]
Mar 17 16:13:09 localhost kernel:  [<c043779f>] kthread+0xb0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c04376ef>] kthread+0x0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c0404b33>] kernel_thread_helper+0x7/0x10
Mar 17 16:13:09 localhost kernel:  =======================
Mar 17 16:13:09 localhost kernel: usb-storage   S F3D0B440  3048  4437    
 7          4451  4436 (L-TLB)
Mar 17 16:13:09 localhost kernel:        e2cc8f5c 00000046 e2cc8eec
f3d0b440 f3cc0400 f3efa100 f88cbd5f c1c8c4d4
Mar 17 16:13:09 localhost kernel:        f7da0b30 c0420c7c 0000000a
e2c70db0 f7da0b30 5cf96980 0000096b 00000000
Mar 17 16:13:09 localhost kernel:        e2c70ebc c1c8c480 00000000
e9316180 00000000 0000000f 00000000 00000000
Mar 17 16:13:09 localhost kernel: Call Trace:
Mar 17 16:13:09 localhost kernel:  [<f88cbd5f>]
usb_stor_bulk_transfer_buf+0x61/0x98 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c0420c7c>] enqueue_task+0x29/0x39
Mar 17 16:13:09 localhost kernel:  [<c061fa0d>]
__down_interruptible+0xab/0xf0
Mar 17 16:13:09 localhost kernel:  [<c042f220>] del_timer+0x41/0x47
Mar 17 16:13:09 localhost kernel:  [<c04226ab>] default_wake_function+0x0/0xc
Mar 17 16:13:09 localhost kernel:  [<f88cd022>]
usb_stor_control_thread+0x0/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c061f8ef>]
__down_failed_interruptible+0x7/0xc
Mar 17 16:13:09 localhost kernel:  [<f88cd067>]
usb_stor_control_thread+0x45/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c0420a03>] complete+0x39/0x48
Mar 17 16:13:09 localhost kernel:  [<f88cd022>]
usb_stor_control_thread+0x0/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c043779f>] kthread+0xb0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c04376ef>] kthread+0x0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c0404b33>] kernel_thread_helper+0x7/0x10
Mar 17 16:13:09 localhost kernel:  =======================
Mar 17 16:13:09 localhost kernel: scsi_eh_10    S 00000000  3744  4451    
 7          4452  4437 (L-TLB)
Mar 17 16:13:09 localhost kernel:        e3368f68 00000046 00000000
00000000 c1c8c94c f7da30f0 c0420c7c c1c8c480
Mar 17 16:13:09 localhost kernel:        f7da30f0 e3368f28 00000009
ea739870 ea738270 d0d53b00 00000032 00000000
Mar 17 16:13:09 localhost kernel:        ea73997c c1c8c480 00000000
f7f53880 f7d086f0 c1c8c94c f7d10c80 f7f53880
Mar 17 16:13:09 localhost kernel: Call Trace:
Mar 17 16:13:09 localhost kernel:  [<c0420c7c>] enqueue_task+0x29/0x39
Mar 17 16:13:09 localhost kernel:  [<c061fc0d>] _spin_unlock_irq+0x5/0x7
Mar 17 16:13:09 localhost kernel:  [<c061e5c1>]
__sched_text_start+0x999/0xa21
Mar 17 16:13:09 localhost kernel:  [<f88e5415>]
scsi_error_handler+0x0/0x9b6 [scsi_mod]
Mar 17 16:13:09 localhost kernel:  [<f88e5474>]
scsi_error_handler+0x5f/0x9b6 [scsi_mod]
Mar 17 16:13:09 localhost kernel:  [<c0420a03>] complete+0x39/0x48
Mar 17 16:13:09 localhost kernel:  [<f88e5415>]
scsi_error_handler+0x0/0x9b6 [scsi_mod]
Mar 17 16:13:09 localhost kernel:  [<c043779f>] kthread+0xb0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c04376ef>] kthread+0x0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c0404b33>] kernel_thread_helper+0x7/0x10
Mar 17 16:13:09 localhost kernel:  =======================
Mar 17 16:13:09 localhost kernel: usb-storage   S F3F03CC0  3032  4452    
 7          4476  4451 (L-TLB)
Mar 17 16:13:09 localhost kernel:        e3352f5c 00000046 e3352eec
f3f03cc0 f3cbc000 f3efa180 f88cbd5f c1c8c4d4
Mar 17 16:13:09 localhost kernel:        f7da0b30 c0420c7c 0000000a
ea738270 f7da0b30 aab2df80 0000096b 00000000
Mar 17 16:13:09 localhost kernel:        ea73837c c1c8c480 00000000
f3fb8540 00000000 0000000f 00000000 00000000
Mar 17 16:13:09 localhost kernel: Call Trace:
Mar 17 16:13:09 localhost kernel:  [<f88cbd5f>]
usb_stor_bulk_transfer_buf+0x61/0x98 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c0420c7c>] enqueue_task+0x29/0x39
Mar 17 16:13:09 localhost kernel:  [<c061fa0d>]
__down_interruptible+0xab/0xf0
Mar 17 16:13:09 localhost kernel:  [<c042f220>] del_timer+0x41/0x47
Mar 17 16:13:09 localhost kernel:  [<c04226ab>] default_wake_function+0x0/0xc
Mar 17 16:13:09 localhost kernel:  [<f88cd022>]
usb_stor_control_thread+0x0/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c061f8ef>]
__down_failed_interruptible+0x7/0xc
Mar 17 16:13:09 localhost kernel:  [<f88cd067>]
usb_stor_control_thread+0x45/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c0420a03>] complete+0x39/0x48
Mar 17 16:13:09 localhost kernel:  [<f88cd022>]
usb_stor_control_thread+0x0/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c043779f>] kthread+0xb0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c04376ef>] kthread+0x0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c0404b33>] kernel_thread_helper+0x7/0x10
Mar 17 16:13:09 localhost kernel:  =======================
Mar 17 16:13:09 localhost kernel: scsi_eh_11    D 00000000  3688  4476    
 7          4477  4452 (L-TLB)
Mar 17 16:13:09 localhost kernel:        f32eff30 00000046 00000000
00000000 00000000 00000000 f3f16260 c1c89780
Mar 17 16:13:09 localhost kernel:        00000000 00000000 0000000a
e32a2e30 f3f160b0 31528500 00000254 00000000
Mar 17 16:13:09 localhost kernel:        e32a2f3c c1c8c480 00000000
e9316b80 00000096 f7cb1400 f7e08640 f7e08640
Mar 17 16:13:09 localhost kernel: Call Trace:
Mar 17 16:13:09 localhost kernel:  [<c0586b4f>] unlink1+0x74/0x86
Mar 17 16:13:09 localhost kernel:  [<c061e701>] wait_for_completion+0x73/0x98
Mar 17 16:13:09 localhost kernel:  [<c04226ab>] default_wake_function+0x0/0xc
Mar 17 16:13:09 localhost kernel:  [<f88cb53a>] command_abort+0x64/0x6d
[usb_storage]
Mar 17 16:13:09 localhost kernel:  [<f88e57a3>]
scsi_error_handler+0x38e/0x9b6 [scsi_mod]
Mar 17 16:13:09 localhost kernel:  [<c0420a03>] complete+0x39/0x48
Mar 17 16:13:09 localhost kernel:  [<f88e5415>]
scsi_error_handler+0x0/0x9b6 [scsi_mod]
Mar 17 16:13:09 localhost kernel:  [<c043779f>] kthread+0xb0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c04376ef>] kthread+0x0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c0404b33>] kernel_thread_helper+0x7/0x10
Mar 17 16:13:09 localhost kernel:  =======================
Mar 17 16:13:09 localhost kernel: usb-storage   S 00000010  3048  4477    
 7          4500  4476 (L-TLB)
Mar 17 16:13:09 localhost kernel:        e31dee78 00000046 f88459c0
00000010 f7e0865c f3da8364 c0587c0e 00000010
Mar 17 16:13:09 localhost kernel:        00000000 f7da25f0 0000000a
edb68630 f7da25f0 390b2d00 00000246 00000000
Mar 17 16:13:09 localhost kernel:        edb6873c c1c8c480 00000000
ec92ec40 00219434 c04062cf 00000073 ffffffff
Mar 17 16:13:09 localhost kernel: Call Trace:
Mar 17 16:13:09 localhost kernel:  [<c0587c0e>]
usb_hcd_submit_urb+0x6cd/0x773
Mar 17 16:13:09 localhost kernel:  [<c04062cf>] do_IRQ+0xc6/0xdb
Mar 17 16:13:09 localhost kernel:  [<c061ecc2>] schedule_timeout+0x13/0x8d
Mar 17 16:13:09 localhost kernel:  [<c061e925>]
wait_for_completion_interruptible_timeout+0x99/0xd5
Mar 17 16:13:09 localhost kernel:  [<c04226ab>] default_wake_function+0x0/0xc
Mar 17 16:13:09 localhost kernel:  [<f88cb90c>]
usb_stor_msg_common+0xc9/0xe8 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<f88cbd5f>]
usb_stor_bulk_transfer_buf+0x61/0x98 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<f88cc2a9>]
usb_stor_Bulk_transport+0xcb/0x221 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<f88cd022>]
usb_stor_control_thread+0x0/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<f88cc414>]
usb_stor_invoke_transport+0x15/0x259 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c061fa40>]
__down_interruptible+0xde/0xf0
Mar 17 16:13:09 localhost kernel:  [<c04226ab>] default_wake_function+0x0/0xc
Mar 17 16:13:09 localhost kernel:  [<f88cd022>]
usb_stor_control_thread+0x0/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<f88cd14a>]
usb_stor_control_thread+0x128/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c0420a03>] complete+0x39/0x48
Mar 17 16:13:09 localhost kernel:  [<f88cd022>]
usb_stor_control_thread+0x0/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c043779f>] kthread+0xb0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c04376ef>] kthread+0x0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c0404b33>] kernel_thread_helper+0x7/0x10
Mar 17 16:13:09 localhost kernel:  =======================
Mar 17 16:13:09 localhost kernel: scsi_eh_12    S 00000000  3744  4500    
 7          4501  4477 (L-TLB)
Mar 17 16:13:09 localhost kernel:        e2d43f68 00000046 00000000
00000000 c1c8c4d4 f7da30f0 c0420c7c c1c8c480
Mar 17 16:13:09 localhost kernel:        f7da30f0 e2d43f28 00000009
e2d4d370 e2c71330 f0c7f100 00000032 00000000
Mar 17 16:13:09 localhost kernel:        e2d4d47c c1c8c480 00000000
f7f53880 f7da30f0 c1c8c4d4 f7d98f2c f7f53880
Mar 17 16:13:09 localhost kernel: Call Trace:
Mar 17 16:13:09 localhost kernel:  [<c0420c7c>] enqueue_task+0x29/0x39
Mar 17 16:13:09 localhost kernel:  [<c061fc0d>] _spin_unlock_irq+0x5/0x7
Mar 17 16:13:09 localhost kernel:  [<c061e5c1>]
__sched_text_start+0x999/0xa21
Mar 17 16:13:09 localhost kernel:  [<f88e5415>]
scsi_error_handler+0x0/0x9b6 [scsi_mod]
Mar 17 16:13:09 localhost kernel:  [<f88e5474>]
scsi_error_handler+0x5f/0x9b6 [scsi_mod]
Mar 17 16:13:09 localhost kernel:  [<c0420a03>] complete+0x39/0x48
Mar 17 16:13:09 localhost kernel:  [<f88e5415>]
scsi_error_handler+0x0/0x9b6 [scsi_mod]
Mar 17 16:13:09 localhost kernel:  [<c043779f>] kthread+0xb0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c04376ef>] kthread+0x0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c0404b33>] kernel_thread_helper+0x7/0x10
Mar 17 16:13:09 localhost kernel:  =======================
Mar 17 16:13:09 localhost kernel: usb-storage   S F3D0B740  3048  4501    
 7          4518  4500 (L-TLB)
Mar 17 16:13:09 localhost kernel:        f1315f5c 00000046 f1315eec
f3d0b740 e764f400 f3efa280 f88cbd5f c1c8c4d4
Mar 17 16:13:09 localhost kernel:        f7da0b30 c0420c7c 0000000a
e2c71330 f7da0b30 a4522ec0 0000096b 00000000
Mar 17 16:13:09 localhost kernel:        e2c7143c c1c8c480 00000000
e9ddab40 00000000 0000000f 00000000 00000000
Mar 17 16:13:09 localhost kernel: Call Trace:
Mar 17 16:13:09 localhost kernel:  [<f88cbd5f>]
usb_stor_bulk_transfer_buf+0x61/0x98 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c0420c7c>] enqueue_task+0x29/0x39
Mar 17 16:13:09 localhost kernel:  [<c061fa0d>]
__down_interruptible+0xab/0xf0
Mar 17 16:13:09 localhost kernel:  [<c042f220>] del_timer+0x41/0x47
Mar 17 16:13:09 localhost kernel:  [<c04226ab>] default_wake_function+0x0/0xc
Mar 17 16:13:09 localhost kernel:  [<f88cd022>]
usb_stor_control_thread+0x0/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c061f8ef>]
__down_failed_interruptible+0x7/0xc
Mar 17 16:13:09 localhost kernel:  [<f88cd067>]
usb_stor_control_thread+0x45/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c0420a03>] complete+0x39/0x48
Mar 17 16:13:09 localhost kernel:  [<f88cd022>]
usb_stor_control_thread+0x0/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c043779f>] kthread+0xb0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c04376ef>] kthread+0x0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c0404b33>] kernel_thread_helper+0x7/0x10
Mar 17 16:13:09 localhost kernel:  =======================
Mar 17 16:13:09 localhost kernel: scsi_eh_13    S 00000000  3744  4518    
 7          4519  4501 (L-TLB)
Mar 17 16:13:09 localhost kernel:        e2d2af68 00000046 00000000
00000000 c1c8c94c f7da30f0 c0420c7c c1c8c480
Mar 17 16:13:09 localhost kernel:        f7da30f0 e2d2af28 00000009
e2d4d8f0 e2c718b0 00c14c00 00000033 00000000
Mar 17 16:13:09 localhost kernel:        e2d4d9fc c1c8c480 00000000
f7f53880 f7d086f0 c1c8c94c f7d10c80 f7f53880
Mar 17 16:13:09 localhost kernel: Call Trace:
Mar 17 16:13:09 localhost kernel:  [<c0420c7c>] enqueue_task+0x29/0x39
Mar 17 16:13:09 localhost kernel:  [<c061fc0d>] _spin_unlock_irq+0x5/0x7
Mar 17 16:13:09 localhost kernel:  [<c061e5c1>]
__sched_text_start+0x999/0xa21
Mar 17 16:13:09 localhost kernel:  [<f88e5415>]
scsi_error_handler+0x0/0x9b6 [scsi_mod]
Mar 17 16:13:09 localhost kernel:  [<f88e5474>]
scsi_error_handler+0x5f/0x9b6 [scsi_mod]
Mar 17 16:13:09 localhost kernel:  [<c0420a03>] complete+0x39/0x48
Mar 17 16:13:09 localhost kernel:  [<f88e5415>]
scsi_error_handler+0x0/0x9b6 [scsi_mod]
Mar 17 16:13:09 localhost kernel:  [<c043779f>] kthread+0xb0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c04376ef>] kthread+0x0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c0404b33>] kernel_thread_helper+0x7/0x10
Mar 17 16:13:09 localhost kernel:  =======================
Mar 17 16:13:09 localhost kernel: usb-storage   S F7E088C0  3068  4519    
 7          4976  4518 (L-TLB)
Mar 17 16:13:09 localhost kernel:        e2c3ff5c 00000046 e2c3feec
f7e088c0 f3cc1400 f3efa300 f88cbd5f c1c8c4d4
Mar 17 16:13:09 localhost kernel:        f7da0b30 c0420c7c 0000000a
e2c718b0 f7da0b30 9f5071c0 0000096b 00000000
Mar 17 16:13:09 localhost kernel:        e2c719bc c1c8c480 00000000
e2c445c0 00000000 0000000f 00000000 00000000
Mar 17 16:13:09 localhost kernel: Call Trace:
Mar 17 16:13:09 localhost kernel:  [<f88cbd5f>]
usb_stor_bulk_transfer_buf+0x61/0x98 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c0420c7c>] enqueue_task+0x29/0x39
Mar 17 16:13:09 localhost kernel:  [<c061fa0d>]
__down_interruptible+0xab/0xf0
Mar 17 16:13:09 localhost kernel:  [<c042f220>] del_timer+0x41/0x47
Mar 17 16:13:09 localhost kernel:  [<c04226ab>] default_wake_function+0x0/0xc
Mar 17 16:13:09 localhost kernel:  [<f88cd022>]
usb_stor_control_thread+0x0/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c061f8ef>]
__down_failed_interruptible+0x7/0xc
Mar 17 16:13:09 localhost kernel:  [<f88cd067>]
usb_stor_control_thread+0x45/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c0420a03>] complete+0x39/0x48
Mar 17 16:13:09 localhost kernel:  [<f88cd022>]
usb_stor_control_thread+0x0/0x1a3 [usb_storage]
Mar 17 16:13:09 localhost kernel:  [<c043779f>] kthread+0xb0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c04376ef>] kthread+0x0/0xd9
Mar 17 16:13:09 localhost kernel:  [<c0404b33>] kernel_thread_helper+0x7/0x10


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Linux-usb-users@lists.sourceforge.net
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-users

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Linux-usb-users] Failed reads from RAID-0 array (from newbie who has read the FAQ)
  2007-03-17 21:35       ` Michael Schwarz
@ 2007-03-18  2:06         ` Alan Stern
  2007-03-18  2:12         ` Alan Stern
  1 sibling, 0 replies; 23+ messages in thread
From: Alan Stern @ 2007-03-18  2:06 UTC (permalink / raw)
  To: Michael Schwarz; +Cc: Neil Brown, linux-raid, linux-usb-users

On Sat, 17 Mar 2007, Michael Schwarz wrote:

> Comments/questions below...
> 
> -- 
> Michael Schwarz
>
> > This isn't much help.  The important processes here are khubd,
> > usb-storage, and scsi_eh_*.  Possibly some raid-related processes too, but
> > I don't know which they would be.
> 
> I have no copy khubd running.

That in itself is a very bad sign.  You need to look at the dmesg log.

Alan Stern


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Linux-usb-users] Failed reads from RAID-0 array (from newbie who has read the FAQ)
  2007-03-17 21:35       ` Michael Schwarz
  2007-03-18  2:06         ` [Linux-usb-users] " Alan Stern
@ 2007-03-18  2:12         ` Alan Stern
  2007-03-18  4:42           ` Michael Schwarz
  1 sibling, 1 reply; 23+ messages in thread
From: Alan Stern @ 2007-03-18  2:12 UTC (permalink / raw)
  To: Michael Schwarz; +Cc: Neil Brown, linux-raid, linux-usb-users

On Sat, 17 Mar 2007, Michael Schwarz wrote:

> Nasty big stack trace set follows:

This format is kind of awkward.  For one thing, a lot of lines were 
wrapped by your email program.

For another, you copied the stack trace from the syslog log file.  That is 
not a good way to do it; syslogd is liable to miss bits and pieces of 
the kernel log when a lot of information comes along all at once.  You're 
much better off getting the stack trace data directly from dmesg.  (And 
when you do, you don't end up with 30 columns of wasted data added to the 
beginning of each line.)

Alan Stern

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed reads from RAID-0 array (from newbie who has read the FAQ)
  2007-03-18  2:12         ` Alan Stern
@ 2007-03-18  4:42           ` Michael Schwarz
  2007-03-18 16:56             ` [Linux-usb-users] " Michael Schwarz
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Schwarz @ 2007-03-18  4:42 UTC (permalink / raw)
  To: Alan Stern; +Cc: Neil Brown, linux-raid, linux-usb-users

Yeah, I understand that.

Sorry, I use squirrelmail. Pretty limited...

I'll get you a "raw" dmseg output when I replicate the problem.

Let me clarify on khubd: There is such an entry in my process table, but
there was no kernel thread stack trace for it when I dumped the traces. I
don't know if that is a bad sign...

Right now I thought it would be best to verify my hardware, so I'm working
with the new hubs and cables, writing a large file to each of seven
attached (non-md) flash drives and diff-ing the usb drive contents against
the original file. If I have dead cables, connectors, or flash drives that
would save you all a lot of hassle.

When I'm done with that, I'll again replicate my problem, grab the logs
straight from dmesg, and post another entry here. I'll even fire up kmail
or mutt to avoid bad formatting.

Thanks again.

-- 
Michael Schwarz

> On Sat, 17 Mar 2007, Michael Schwarz wrote:
>
>> Nasty big stack trace set follows:
>
> This format is kind of awkward.  For one thing, a lot of lines were
> wrapped by your email program.
>
> For another, you copied the stack trace from the syslog log file.  That is
> not a good way to do it; syslogd is liable to miss bits and pieces of
> the kernel log when a lot of information comes along all at once.  You're
> much better off getting the stack trace data directly from dmesg.  (And
> when you do, you don't end up with 30 columns of wasted data added to the
> beginning of each line.)
>
> Alan Stern
>
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Linux-usb-users@lists.sourceforge.net
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-users

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Linux-usb-users] Failed reads from RAID-0 array (from newbie   who has read the FAQ)
  2007-03-18  4:42           ` Michael Schwarz
@ 2007-03-18 16:56             ` Michael Schwarz
  2007-03-18 17:44               ` Michael Schwarz
                                 ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Michael Schwarz @ 2007-03-18 16:56 UTC (permalink / raw)
  To: mschwarz; +Cc: Alan Stern, Neil Brown, linux-raid, linux-usb-users

[-- Attachment #1: Type: text/plain, Size: 1591 bytes --]

Okay. I've verified my hardware (by doing large write/reads to non-raid
file systems on each of the seven USB flash drives on the hub).

So this morning I booted cold and began gathering log data. I'm sending it
to you guys (you hsould know this) before looking at it myself.  Here's
the sequence:

Cold boot
Gnome login
echo t > /proc/sysrq-trigger
dmesg > dmesg-0-beforehub.log
Attached hub with 7 drives
dmesg > dmesg-1-afterhub.log
mdadm --create /dev/md0 --auto=md --level=0 --raid-devices=7 /dev/sd?
dmesg > dmesg-2-aftermdcreate.log
mke2fs -b 4096 -R stride=16 /dev/md0
dmesg > dmesg-3-aftermkfs.log
mount /dev/md0 /mnt
cp -rv ~mschwarz/FUTURAMA_S2D2/* /mnt
dmesg > dmesg-4-afterbigwrite.log
cp -rv /mnt/* fs2d2/

At this point, the process hangs. So I ran:

echo t > /proc/sysrq-trigger
dmesg > dmesg-5-hungread.log

...in a different root window. All these operations were performed as root
(in order to be as dangerous as possible -- actually, in order to reduce
possible permissions issues; although I don't think there are any)

All of these dmesg logs are attached in gzip format. I don't know what
majordomo will do with those, but the cc's going directly to Alan and Neil
should come through. I'm going to start combing these files myself, so if
you guys want to save time, you can certainly give me a couple hours to
get started! ;-)

As always, I'm very grateful for your assistance and that of the group!

-- 
Michael Schwarz

> Yeah, I understand that.
>
> Sorry, I use squirrelmail. Pretty limited...
>
> I'll get you a "raw" dmseg output when I replicate the problem.
>

[-- Attachment #2: dmesg-0-beforehub.log.gz --]
[-- Type: application/x-gzip, Size: 8407 bytes --]

[-- Attachment #3: dmesg-1-afterhub.log.gz --]
[-- Type: application/x-gzip, Size: 9085 bytes --]

[-- Attachment #4: dmesg-2-aftermdcreate.log.gz --]
[-- Type: application/x-gzip, Size: 9340 bytes --]

[-- Attachment #5: dmesg-3-aftermkfs.log.gz --]
[-- Type: application/x-gzip, Size: 9336 bytes --]

[-- Attachment #6: dmesg-4-afterbigwrite.log.gz --]
[-- Type: application/x-gzip, Size: 9350 bytes --]

[-- Attachment #7: dmesg-5-hungread.log.gz --]
[-- Type: application/x-gzip, Size: 16862 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed reads from RAID-0 array; still no joy in Mudville.
  2007-03-17 19:21       ` Michael Schwarz
@ 2007-03-18 17:22         ` Bill Davidsen
  2007-03-18 17:39           ` Michael Schwarz
  0 siblings, 1 reply; 23+ messages in thread
From: Bill Davidsen @ 2007-03-18 17:22 UTC (permalink / raw)
  To: mschwarz; +Cc: linux-raid, linux-usb-users

Michael Schwarz wrote:
> Update:
>
> (For those who've been waiting breathlessly). It hangs at a particular
> point in a particular file. In other words, it doesn't depend on the total
> number of bytes transfered. Rather, when it reaches a particular point in
> a particular file (12267520 bytes into a file that is 1073709056 bytes
> long) it hangs.
>   

I have an odd thought, have you tried copying that same file to 
/dev/null or similar? The reason I ask is that if it were by any chance 
a sparse file, while the program is reading all those unwritten bytes 
odd things may happen. Sorry, I haven't seen this is years, but I do 
remember seeing a filesystem on the destination end running out of space 
because all those unwritten pages were now being "really written" as zeros.

Use of cp with the --sparse= flag may change the behavior if this is the 
case.
> I begin to suspect that I have a "dead spot" in my USB hub. But what gets
> me if that is true is why does the write work? Do cp and dd not check to
> see if writes succeed?
>
> I know it isn't a particular flash drive because I've used two different
> sets of 7 USB drives and it seems to fail consistently no matter which.
>
> Nonetheless, I'm beginning to think I'm dealing with a hardware issue, not
> a kernel issue, just because it is so consistent.
>
> Thanks again for all the help.
>
>
>   


-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed reads from RAID-0 array; still no joy in Mudville.
  2007-03-18 17:22         ` Bill Davidsen
@ 2007-03-18 17:39           ` Michael Schwarz
  2007-03-18 18:21             ` Bill Davidsen
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Schwarz @ 2007-03-18 17:39 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-raid, linux-usb-users

I've tried both single and multiple files. The files are not sparse. They
are highly compressed files (mpeg files) that would, to the filesystem, be
nearly random with no repeated patterns or voids.

-- 
Michael Schwarz

> Michael Schwarz wrote:
>> Update:
>>
>> (For those who've been waiting breathlessly). It hangs at a particular
>> point in a particular file. In other words, it doesn't depend on the
>> total
>> number of bytes transfered. Rather, when it reaches a particular point
>> in
>> a particular file (12267520 bytes into a file that is 1073709056 bytes
>> long) it hangs.
>>
>
> I have an odd thought, have you tried copying that same file to
> /dev/null or similar? The reason I ask is that if it were by any chance
> a sparse file, while the program is reading all those unwritten bytes
> odd things may happen. Sorry, I haven't seen this is years, but I do
> remember seeing a filesystem on the destination end running out of space
> because all those unwritten pages were now being "really written" as
> zeros.
>
> Use of cp with the --sparse= flag may change the behavior if this is the
> case.
>> I begin to suspect that I have a "dead spot" in my USB hub. But what
>> gets
>> me if that is true is why does the write work? Do cp and dd not check to
>> see if writes succeed?
>>
>> I know it isn't a particular flash drive because I've used two different
>> sets of 7 USB drives and it seems to fail consistently no matter which.
>>
>> Nonetheless, I'm beginning to think I'm dealing with a hardware issue,
>> not
>> a kernel issue, just because it is so consistent.
>>
>> Thanks again for all the help.
>>
>>
>>
>
>
> --
> bill davidsen <davidsen@tmr.com>
>   CTO TMR Associates, Inc
>   Doing interesting things with small computers since 1979
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Linux-usb-users] Failed reads from RAID-0 array (from newbie   who has read the FAQ)
  2007-03-18 16:56             ` [Linux-usb-users] " Michael Schwarz
@ 2007-03-18 17:44               ` Michael Schwarz
  2007-03-18 21:55               ` Michael Schwarz
  2007-03-18 21:57               ` Neil Brown
  2 siblings, 0 replies; 23+ messages in thread
From: Michael Schwarz @ 2007-03-18 17:44 UTC (permalink / raw)
  To: mschwarz; +Cc: Alan Stern, Neil Brown, linux-raid, linux-usb-users

As I suspected, majordomo doesn't like attachments.

I looked through the logs. The only odd thing I see before the read that
hangs is this message:

smartd[3069]: Device: /dev/hda, 1 Currently unreadable (pending) sectors

Which I only see in /var/log/messages because the stack dump blows
whatever buffer size if reserved for dmesg (the whole stack trace doesn't
make it in).

I'm going to try a different computer running a different OS next.

Alan, Neil, I wasn't able to make anything of those logs. I've also
grabbed /var/log/message to get the gap between dmesg-4-* and dmesg-5-*.
I'll send that to you two in a separate message.

If anyone else would like my logs, let me know.


-- 
Michael Schwarz

> Okay. I've verified my hardware (by doing large write/reads to non-raid
> file systems on each of the seven USB flash drives on the hub).
>
> So this morning I booted cold and began gathering log data. I'm sending it
> to you guys (you hsould know this) before looking at it myself.  Here's
> the sequence:
>
> Cold boot
> Gnome login
> echo t > /proc/sysrq-trigger
> dmesg > dmesg-0-beforehub.log
> Attached hub with 7 drives
> dmesg > dmesg-1-afterhub.log
> mdadm --create /dev/md0 --auto=md --level=0 --raid-devices=7 /dev/sd?
> dmesg > dmesg-2-aftermdcreate.log
> mke2fs -b 4096 -R stride=16 /dev/md0
> dmesg > dmesg-3-aftermkfs.log
> mount /dev/md0 /mnt
> cp -rv ~mschwarz/FUTURAMA_S2D2/* /mnt
> dmesg > dmesg-4-afterbigwrite.log
> cp -rv /mnt/* fs2d2/
>
> At this point, the process hangs. So I ran:
>
> echo t > /proc/sysrq-trigger
> dmesg > dmesg-5-hungread.log
>
> ...in a different root window. All these operations were performed as root
> (in order to be as dangerous as possible -- actually, in order to reduce
> possible permissions issues; although I don't think there are any)
>
> All of these dmesg logs are attached in gzip format. I don't know what
> majordomo will do with those, but the cc's going directly to Alan and Neil
> should come through. I'm going to start combing these files myself, so if
> you guys want to save time, you can certainly give me a couple hours to
> get started! ;-)
>
> As always, I'm very grateful for your assistance and that of the group!
>
> --
> Michael Schwarz
>
>> Yeah, I understand that.
>>
>> Sorry, I use squirrelmail. Pretty limited...
>>
>> I'll get you a "raw" dmseg output when I replicate the problem.
>>
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed reads from RAID-0 array; still no joy in Mudville.
  2007-03-18 17:39           ` Michael Schwarz
@ 2007-03-18 18:21             ` Bill Davidsen
  0 siblings, 0 replies; 23+ messages in thread
From: Bill Davidsen @ 2007-03-18 18:21 UTC (permalink / raw)
  To: mschwarz; +Cc: linux-raid, linux-usb-users

Michael Schwarz wrote:
> I've tried both single and multiple files. The files are not sparse. They
> are highly compressed files (mpeg files) that would, to the filesystem, be
> nearly random with no repeated patterns or voids.
>
>   
Good, one possible cause eliminated.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Linux-usb-users] Failed reads from RAID-0 array (from newbie   who has read the FAQ)
  2007-03-18 16:56             ` [Linux-usb-users] " Michael Schwarz
  2007-03-18 17:44               ` Michael Schwarz
@ 2007-03-18 21:55               ` Michael Schwarz
  2007-03-18 21:57               ` Neil Brown
  2 siblings, 0 replies; 23+ messages in thread
From: Michael Schwarz @ 2007-03-18 21:55 UTC (permalink / raw)
  To: mschwarz; +Cc: Alan Stern, Neil Brown, linux-raid, linux-usb-users

Just tried in on a stock Ubuntu Edgy install. Same thing. Locks on read.

I've got a dmesg (w/stack trace) file from the ubuntu attempt (it was
clean prior to doing the read) which I will send to Alan and Neil (any
anyone else who asks for it). There were no error messages in dmesg prior
to running the stack trace.

-- 
Michael Schwarz

> Okay. I've verified my hardware (by doing large write/reads to non-raid
> file systems on each of the seven USB flash drives on the hub).
>
> So this morning I booted cold and began gathering log data. I'm sending it
> to you guys (you hsould know this) before looking at it myself.  Here's
> the sequence:
>
> Cold boot
> Gnome login
> echo t > /proc/sysrq-trigger
> dmesg > dmesg-0-beforehub.log
> Attached hub with 7 drives
> dmesg > dmesg-1-afterhub.log
> mdadm --create /dev/md0 --auto=md --level=0 --raid-devices=7 /dev/sd?
> dmesg > dmesg-2-aftermdcreate.log
> mke2fs -b 4096 -R stride=16 /dev/md0
> dmesg > dmesg-3-aftermkfs.log
> mount /dev/md0 /mnt
> cp -rv ~mschwarz/FUTURAMA_S2D2/* /mnt
> dmesg > dmesg-4-afterbigwrite.log
> cp -rv /mnt/* fs2d2/
>
> At this point, the process hangs. So I ran:
>
> echo t > /proc/sysrq-trigger
> dmesg > dmesg-5-hungread.log
>
> ...in a different root window. All these operations were performed as root
> (in order to be as dangerous as possible -- actually, in order to reduce
> possible permissions issues; although I don't think there are any)
>
> All of these dmesg logs are attached in gzip format. I don't know what
> majordomo will do with those, but the cc's going directly to Alan and Neil
> should come through. I'm going to start combing these files myself, so if
> you guys want to save time, you can certainly give me a couple hours to
> get started! ;-)
>
> As always, I'm very grateful for your assistance and that of the group!
>
> --
> Michael Schwarz
>
>> Yeah, I understand that.
>>
>> Sorry, I use squirrelmail. Pretty limited...
>>
>> I'll get you a "raw" dmseg output when I replicate the problem.
>>
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Linux-usb-users] Failed reads from RAID-0 array (from newbie   who has read the FAQ)
  2007-03-18 16:56             ` [Linux-usb-users] " Michael Schwarz
  2007-03-18 17:44               ` Michael Schwarz
  2007-03-18 21:55               ` Michael Schwarz
@ 2007-03-18 21:57               ` Neil Brown
  2007-03-19  3:27                 ` Michael Schwarz
  2 siblings, 1 reply; 23+ messages in thread
From: Neil Brown @ 2007-03-18 21:57 UTC (permalink / raw)
  To: mschwarz; +Cc: Alan Stern, linux-raid, linux-usb-users

On Sunday March 18, mschwarz@multitool.net wrote:
> cp -rv /mnt/* fs2d2/
> 
> At this point, the process hangs. So I ran:
> 
> echo t > /proc/sysrq-trigger
> dmesg > dmesg-5-hungread.log

Unfortunate (as you say) the whole trace doesn't fit.
Could you try compiling the kernel with a larger value for
CONFIG_LOG_BUF_SHIFT ??  It looks like you have 17.  21 is the max. 
19 should probably be sufficient.

Two things look a bit odd.
1/ hald-addon-st (process 3974) seems to be hung doing a
  'test_unit_ready' after a media-changed signal.  Any idea why?
  Could you try killing of hald while running the test?

2/ one usb-storage thread (3667) appears to be waiting for
  IO to complete (though that is just a guess really).

Maybe usb-storage is waiting for the hald test-unit-ready?

But I'm a bit out of my depth here, so I'll leave it to the USB
experts.

NeilBrown

 =======================
hald-addon-st D EF9FBD00  2812  3974   2935          3977  3966 (NOTLB)
       ef9fbd14 00000086 00000002 ef9fbd00 ef9fbcfc 00000000 00000000 ed4fcbe4 
       c04dc5cc 00000086 0000000a ed407770 c06fb480 18f88700 00000206 00000000 
       ed40787c c1c8c480 00000000 ebe7adc0 001d605d db30e9c8 00000096 ffffffff 
Call Trace:
 [<c04dc5cc>] elv_next_request+0xfe/0x1ac
 [<c061e701>] wait_for_completion+0x73/0x98
 [<c04226ab>] default_wake_function+0x0/0xc
 [<c04df415>] blk_execute_rq+0xcf/0xe5
 [<c04de74f>] blk_end_sync_rq+0x0/0x23
 [<c04dbdf0>] elv_set_request+0x14/0x22
 [<c04decda>] get_request+0x205/0x2b2
 [<c04df4e7>] get_request_wait+0x26/0x16c
 [<f8de1116>] scsi_execute+0xc6/0xd9 [scsi_mod]
 [<f8de11e0>] scsi_execute_req+0xb7/0xd5 [scsi_mod]
 [<f8de1241>] scsi_test_unit_ready+0x43/0x80 [scsi_mod]
 [<f8d726a5>] sd_media_changed+0x60/0xb5 [sd_mod]
 [<c04e8c82>] kobject_get+0xf/0x13
 [<c0491481>] check_disk_change+0x16/0x5c
 [<c055890a>] class_device_get+0xe/0x14
 [<f8d72b70>] sd_open+0x92/0x120 [sd_mod]
 [<c04e14cc>] exact_match+0x0/0x4
 [<c0491b65>] do_open+0x19f/0x255
 [<c0491d8e>] blkdev_open+0x0/0x4d
 [<c0491db3>] blkdev_open+0x25/0x4d
 [<c0470cac>] __dentry_open+0xc3/0x17a
 [<c0470ddd>] nameidata_to_filp+0x24/0x33
 [<c0470e1e>] do_filp_open+0x32/0x39
 [<c061f0e0>] do_nanosleep+0x42/0x66
 [<c0470bdf>] get_unused_fd+0xb3/0xbd
 [<c0470e67>] do_sys_open+0x42/0xbe
 [<c0470f1c>] sys_open+0x1c/0x1e
 [<c0403f64>] syscall_call+0x7/0xb
 =======================
usb-storage   S 00000010  3048  3667      7          3669  3666 (L-TLB)
       ebcaee78 00000046 f88459c0 00000010 ebc6b7dc f6de08e4 c0587c0e 00000010 
       00000000 c06fb480 0000000a ed5f2bb0 d80fa9b0 e8b0e880 00000205 00000000 
       ed5f2cbc c1c8c480 00000000 ebe7a9c0 001d5d31 00000205 00000000 ffffffff 
Call Trace:
 [<c0587c0e>] usb_hcd_submit_urb+0x6cd/0x773
 [<c061ecc2>] schedule_timeout+0x13/0x8d
 [<c061e925>] wait_for_completion_interruptible_timeout+0x99/0xd5
 [<c04226ab>] default_wake_function+0x0/0xc
 [<f8db090c>] usb_stor_msg_common+0xc9/0xe8 [usb_storage]
 [<f8db0d5f>] usb_stor_bulk_transfer_buf+0x61/0x98 [usb_storage]
 [<f8db12a9>] usb_stor_Bulk_transport+0xcb/0x221 [usb_storage]
 [<f8db2022>] usb_stor_control_thread+0x0/0x1a3 [usb_storage]
 [<f8db1414>] usb_stor_invoke_transport+0x15/0x259 [usb_storage]
 [<c061fa40>] __down_interruptible+0xde/0xf0
 [<c04226ab>] default_wake_function+0x0/0xc
 [<f8db2022>] usb_stor_control_thread+0x0/0x1a3 [usb_storage]
 [<f8db214a>] usb_stor_control_thread+0x128/0x1a3 [usb_storage]
 [<c0420a03>] complete+0x39/0x48
 [<f8db2022>] usb_stor_control_thread+0x0/0x1a3 [usb_storage]
 [<c043779f>] kthread+0xb0/0xd9
 [<c04376ef>] kthread+0x0/0xd9
 [<c0404b33>] kernel_thread_helper+0x7/0x10
 =======================

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed reads from RAID-0 array (from newbie who has read the FAQ)
  2007-03-18 21:57               ` Neil Brown
@ 2007-03-19  3:27                 ` Michael Schwarz
  2007-03-19 14:29                   ` Bill Davidsen
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Schwarz @ 2007-03-19  3:27 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Alan Stern, linux-usb-users

More than ever, I am convinced that it is actually a hardware problem, but
I am curious for the opinions of both of you on whether the "system"
(meaning, I guess, the combination of usb-storage driver and raid) is
really doing the best with what it has.

My last effort was to switch to a different computer. When I did, I got in
the dmesg log (unfortunately, not preserved, although I should be able to
recreate) that one of the flash drives had bad blocks. Some part of the
system eventually decided it was a "dead device" (I believe dmesg indicate
the scsi subsystem said so). The device (it happened to be /dev/sdc) was
peremptorially dropped from the system. This appears to be what hanged the
raid system.

(Why these messages never appeared on the other computer is beyond me;
obviously some difference in how the actual USB controller reports errors,
but, as I said, I've never studied USB drivers or hardware. In fact, once
you get beyond the UARTs you are getting sophisticated to me)

I've built an array of five known-good devices and so far it works
swimmingly (at least on the hardware that was better at error reporting).

So it seems to me that there is probably nothing actually wrong with the
drivers or their interactions at it leaves me only asking if there should
be some sort of improvement in error reporting/recovery up to userland.

If I am right and the scsi system was marking a device as dead, shouldn't
the userland read against the md device get an error instead of an
indefinite hang?

Beyond this question which I leave to you (although I'd love to hear your
answers/thoughts), I think we can safely say that the problem was hardware
(even if hard to find). If either of you would like, I'd be happy to find
time this week to recreate the error on my "better" PC and send that
along.

As for rolling a custom kernel with more message buffer, well, I'm going
to be getting into a new device driver in the coming months, so a custom
debug kernel is definitely in my future, but I'm not sure when.

I must say, the kernel has become a much more complex beastie since 2.2.x!
(Although it also appears to be improved and somewhat more organized --
but definitely MUCH larger!)

Thank you both so much! I wouldn't even have diagnosed my hardware problem
without your prompts. I'm very grateful. Let me know if you'd like those
dmesg logs or if you'd just like to let it go!

-- 
Michael Schwarz

> On Sunday March 18, mschwarz@multitool.net wrote:
>> cp -rv /mnt/* fs2d2/
>>
>> At this point, the process hangs. So I ran:
>>
>> echo t > /proc/sysrq-trigger
>> dmesg > dmesg-5-hungread.log
>
> Unfortunate (as you say) the whole trace doesn't fit.
> Could you try compiling the kernel with a larger value for
> CONFIG_LOG_BUF_SHIFT ??  It looks like you have 17.  21 is the max.
> 19 should probably be sufficient.
>
> Two things look a bit odd.
> 1/ hald-addon-st (process 3974) seems to be hung doing a
>   'test_unit_ready' after a media-changed signal.  Any idea why?
>   Could you try killing of hald while running the test?
>
> 2/ one usb-storage thread (3667) appears to be waiting for
>   IO to complete (though that is just a guess really).
>
> Maybe usb-storage is waiting for the hald test-unit-ready?
>
> But I'm a bit out of my depth here, so I'll leave it to the USB
> experts.
>
> NeilBrown
>
>  =======================
> hald-addon-st D EF9FBD00  2812  3974   2935          3977  3966 (NOTLB)
>        ef9fbd14 00000086 00000002 ef9fbd00 ef9fbcfc 00000000 00000000
> ed4fcbe4
>        c04dc5cc 00000086 0000000a ed407770 c06fb480 18f88700 00000206
> 00000000
>        ed40787c c1c8c480 00000000 ebe7adc0 001d605d db30e9c8 00000096
> ffffffff
> Call Trace:
>  [<c04dc5cc>] elv_next_request+0xfe/0x1ac
>  [<c061e701>] wait_for_completion+0x73/0x98
>  [<c04226ab>] default_wake_function+0x0/0xc
>  [<c04df415>] blk_execute_rq+0xcf/0xe5
>  [<c04de74f>] blk_end_sync_rq+0x0/0x23
>  [<c04dbdf0>] elv_set_request+0x14/0x22
>  [<c04decda>] get_request+0x205/0x2b2
>  [<c04df4e7>] get_request_wait+0x26/0x16c
>  [<f8de1116>] scsi_execute+0xc6/0xd9 [scsi_mod]
>  [<f8de11e0>] scsi_execute_req+0xb7/0xd5 [scsi_mod]
>  [<f8de1241>] scsi_test_unit_ready+0x43/0x80 [scsi_mod]
>  [<f8d726a5>] sd_media_changed+0x60/0xb5 [sd_mod]
>  [<c04e8c82>] kobject_get+0xf/0x13
>  [<c0491481>] check_disk_change+0x16/0x5c
>  [<c055890a>] class_device_get+0xe/0x14
>  [<f8d72b70>] sd_open+0x92/0x120 [sd_mod]
>  [<c04e14cc>] exact_match+0x0/0x4
>  [<c0491b65>] do_open+0x19f/0x255
>  [<c0491d8e>] blkdev_open+0x0/0x4d
>  [<c0491db3>] blkdev_open+0x25/0x4d
>  [<c0470cac>] __dentry_open+0xc3/0x17a
>  [<c0470ddd>] nameidata_to_filp+0x24/0x33
>  [<c0470e1e>] do_filp_open+0x32/0x39
>  [<c061f0e0>] do_nanosleep+0x42/0x66
>  [<c0470bdf>] get_unused_fd+0xb3/0xbd
>  [<c0470e67>] do_sys_open+0x42/0xbe
>  [<c0470f1c>] sys_open+0x1c/0x1e
>  [<c0403f64>] syscall_call+0x7/0xb
>  =======================
> usb-storage   S 00000010  3048  3667      7          3669  3666 (L-TLB)
>        ebcaee78 00000046 f88459c0 00000010 ebc6b7dc f6de08e4 c0587c0e
> 00000010
>        00000000 c06fb480 0000000a ed5f2bb0 d80fa9b0 e8b0e880 00000205
> 00000000
>        ed5f2cbc c1c8c480 00000000 ebe7a9c0 001d5d31 00000205 00000000
> ffffffff
> Call Trace:
>  [<c0587c0e>] usb_hcd_submit_urb+0x6cd/0x773
>  [<c061ecc2>] schedule_timeout+0x13/0x8d
>  [<c061e925>] wait_for_completion_interruptible_timeout+0x99/0xd5
>  [<c04226ab>] default_wake_function+0x0/0xc
>  [<f8db090c>] usb_stor_msg_common+0xc9/0xe8 [usb_storage]
>  [<f8db0d5f>] usb_stor_bulk_transfer_buf+0x61/0x98 [usb_storage]
>  [<f8db12a9>] usb_stor_Bulk_transport+0xcb/0x221 [usb_storage]
>  [<f8db2022>] usb_stor_control_thread+0x0/0x1a3 [usb_storage]
>  [<f8db1414>] usb_stor_invoke_transport+0x15/0x259 [usb_storage]
>  [<c061fa40>] __down_interruptible+0xde/0xf0
>  [<c04226ab>] default_wake_function+0x0/0xc
>  [<f8db2022>] usb_stor_control_thread+0x0/0x1a3 [usb_storage]
>  [<f8db214a>] usb_stor_control_thread+0x128/0x1a3 [usb_storage]
>  [<c0420a03>] complete+0x39/0x48
>  [<f8db2022>] usb_stor_control_thread+0x0/0x1a3 [usb_storage]
>  [<c043779f>] kthread+0xb0/0xd9
>  [<c04376ef>] kthread+0x0/0xd9
>  [<c0404b33>] kernel_thread_helper+0x7/0x10
>  =======================
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Linux-usb-users@lists.sourceforge.net
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-users

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed reads from RAID-0 array (from newbie who has read the FAQ)
  2007-03-19  3:27                 ` Michael Schwarz
@ 2007-03-19 14:29                   ` Bill Davidsen
  2007-03-19 14:54                     ` [Linux-usb-users] " Michael Schwarz
  0 siblings, 1 reply; 23+ messages in thread
From: Bill Davidsen @ 2007-03-19 14:29 UTC (permalink / raw)
  To: mschwarz; +Cc: Neil Brown, linux-raid, Alan Stern, linux-usb-users

Michael Schwarz wrote:
> More than ever, I am convinced that it is actually a hardware problem, but
> I am curious for the opinions of both of you on whether the "system"
> (meaning, I guess, the combination of usb-storage driver and raid) is
> really doing the best with what it has.
>   

See below, but the short answer is there is probably room for improvement.
> My last effort was to switch to a different computer. When I did, I got in
> the dmesg log (unfortunately, not preserved, although I should be able to
> recreate) that one of the flash drives had bad blocks. Some part of the
> system eventually decided it was a "dead device" (I believe dmesg indicate
> the scsi subsystem said so). The device (it happened to be /dev/sdc) was
> peremptorially dropped from the system. This appears to be what hanged the
> raid system.
>
> (Why these messages never appeared on the other computer is beyond me;
> obviously some difference in how the actual USB controller reports errors,
> but, as I said, I've never studied USB drivers or hardware. In fact, once
> you get beyond the UARTs you are getting sophisticated to me)
>
> I've built an array of five known-good devices and so far it works
> swimmingly (at least on the hardware that was better at error reporting).
>
> So it seems to me that there is probably nothing actually wrong with the
> drivers or their interactions at it leaves me only asking if there should
> be some sort of improvement in error reporting/recovery up to userland.
>
> If I am right and the scsi system was marking a device as dead, shouldn't
> the userland read against the md device get an error instead of an
> indefinite hang?
>   

Let me make sure I have this scenario right... one write process (dd or 
cp) hangs, but you can still access data on the array, so the devices 
(all of them?) are working. It would be useful at that point to see if 
/proc/mdstat shows one device as failed.

Given that I have described the behavior, I would think that there is 
still a problem in the driver or md somewhere, hangs should time out, 
errors should be reported up, and if this is caused by a lost write 
completion, I would hope that would be timed out and reported. That's my 
read on it, these "just hangs" cases probably are undetected or 
mishandled errors which should be passed up and reported to the 
application or retried and completed. Or handled in some better way than 
what you describe.

Bad hardware is a fact of life, if you feel like chasing this more, an 
understanding of what the hardware did wrong and what the kernel didn't 
do right would be helpful. Of course the failure mode may be so rare, 
and the fix so time-consuming that it won't get fixed, but it can get 
documented.
> Beyond this question which I leave to you (although I'd love to hear your
> answers/thoughts), I think we can safely say that the problem was hardware
> (even if hard to find). If either of you would like, I'd be happy to find
> time this week to recreate the error on my "better" PC and send that
> along.
>
> As for rolling a custom kernel with more message buffer, well, I'm going
> to be getting into a new device driver in the coming months, so a custom
> debug kernel is definitely in my future, but I'm not sure when.
>
> I must say, the kernel has become a much more complex beastie since 2.2.x!
> (Although it also appears to be improved and somewhat more organized --
> but definitely MUCH larger!)
>
> Thank you both so much! I wouldn't even have diagnosed my hardware problem
> without your prompts. I'm very grateful. Let me know if you'd like those
> dmesg logs or if you'd just like to let it go!
>
>   
-- 

bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Linux-usb-users@lists.sourceforge.net
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-users

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Linux-usb-users] Failed reads from RAID-0 array (from newbie   who has read the FAQ)
  2007-03-19 14:29                   ` Bill Davidsen
@ 2007-03-19 14:54                     ` Michael Schwarz
  2007-03-19 15:31                       ` Alan Stern
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Schwarz @ 2007-03-19 14:54 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Neil Brown, Alan Stern, linux-raid, linux-usb-users

I'm going to hang on to the hardware. This is a pilot/demo that may lead
to development of a new device, and, if so, I'll be getting back into
device driver writing. Working this problem would be great practice for
that. So I will do it. The only problem is I don't know when!

I believe I can replicate the problem, so I'll find time (perhaps next
weekend) to capture the data of interest.

Mr. Stern: Where might I go for low level programming information on USB
devices? I'm interested in registers/DMA/packet formats, etc.

I've found info on the USB protocol itself, but I haven't found info on
devices. Obviously I can dig through kernel source, but documents would be
nice! Again, if this is an unreasonable request for you to "do my
homework," just say so! I won't be offended. I'm sure I can find it myself
given time, but if you happen to have some URLs handy, they'd be
appreciated.

YET AGAIN thank you both! You've been of great help.

-- 
Michael Schwarz

> Michael Schwarz wrote:
>> More than ever, I am convinced that it is actually a hardware problem,
>> but
>> I am curious for the opinions of both of you on whether the "system"
>> (meaning, I guess, the combination of usb-storage driver and raid) is
>> really doing the best with what it has.
>>
>
> See below, but the short answer is there is probably room for improvement.
>> My last effort was to switch to a different computer. When I did, I got
>> in
>> the dmesg log (unfortunately, not preserved, although I should be able
>> to
>> recreate) that one of the flash drives had bad blocks. Some part of the
>> system eventually decided it was a "dead device" (I believe dmesg
>> indicate
>> the scsi subsystem said so). The device (it happened to be /dev/sdc) was
>> peremptorially dropped from the system. This appears to be what hanged
>> the
>> raid system.
>>
>> (Why these messages never appeared on the other computer is beyond me;
>> obviously some difference in how the actual USB controller reports
>> errors,
>> but, as I said, I've never studied USB drivers or hardware. In fact,
>> once
>> you get beyond the UARTs you are getting sophisticated to me)
>>
>> I've built an array of five known-good devices and so far it works
>> swimmingly (at least on the hardware that was better at error
>> reporting).
>>
>> So it seems to me that there is probably nothing actually wrong with the
>> drivers or their interactions at it leaves me only asking if there
>> should
>> be some sort of improvement in error reporting/recovery up to userland.
>>
>> If I am right and the scsi system was marking a device as dead,
>> shouldn't
>> the userland read against the md device get an error instead of an
>> indefinite hang?
>>
>
> Let me make sure I have this scenario right... one write process (dd or
> cp) hangs, but you can still access data on the array, so the devices
> (all of them?) are working. It would be useful at that point to see if
> /proc/mdstat shows one device as failed.
>
> Given that I have described the behavior, I would think that there is
> still a problem in the driver or md somewhere, hangs should time out,
> errors should be reported up, and if this is caused by a lost write
> completion, I would hope that would be timed out and reported. That's my
> read on it, these "just hangs" cases probably are undetected or
> mishandled errors which should be passed up and reported to the
> application or retried and completed. Or handled in some better way than
> what you describe.
>
> Bad hardware is a fact of life, if you feel like chasing this more, an
> understanding of what the hardware did wrong and what the kernel didn't
> do right would be helpful. Of course the failure mode may be so rare,
> and the fix so time-consuming that it won't get fixed, but it can get
> documented.
>> Beyond this question which I leave to you (although I'd love to hear
>> your
>> answers/thoughts), I think we can safely say that the problem was
>> hardware
>> (even if hard to find). If either of you would like, I'd be happy to
>> find
>> time this week to recreate the error on my "better" PC and send that
>> along.
>>
>> As for rolling a custom kernel with more message buffer, well, I'm going
>> to be getting into a new device driver in the coming months, so a custom
>> debug kernel is definitely in my future, but I'm not sure when.
>>
>> I must say, the kernel has become a much more complex beastie since
>> 2.2.x!
>> (Although it also appears to be improved and somewhat more organized --
>> but definitely MUCH larger!)
>>
>> Thank you both so much! I wouldn't even have diagnosed my hardware
>> problem
>> without your prompts. I'm very grateful. Let me know if you'd like those
>> dmesg logs or if you'd just like to let it go!
>>
>>
> --
>
> bill davidsen <davidsen@tmr.com>
>   CTO TMR Associates, Inc
>   Doing interesting things with small computers since 1979
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Linux-usb-users] Failed reads from RAID-0 array (from newbie who has read the FAQ)
  2007-03-19 14:54                     ` [Linux-usb-users] " Michael Schwarz
@ 2007-03-19 15:31                       ` Alan Stern
  2007-03-19 16:58                         ` Michael Schwarz
  0 siblings, 1 reply; 23+ messages in thread
From: Alan Stern @ 2007-03-19 15:31 UTC (permalink / raw)
  To: Michael Schwarz; +Cc: Bill Davidsen, Neil Brown, linux-raid, linux-usb-users

On Mon, 19 Mar 2007, Michael Schwarz wrote:

> I'm going to hang on to the hardware. This is a pilot/demo that may lead
> to development of a new device, and, if so, I'll be getting back into
> device driver writing. Working this problem would be great practice for
> that. So I will do it. The only problem is I don't know when!
> 
> I believe I can replicate the problem, so I'll find time (perhaps next
> weekend) to capture the data of interest.

Michael, you don't seem to appreciate the basic principles for tracking 
down problems.

	First: Simplify.  Get rid of everything that isn't relevant
	to the problem and could serve to distract you.  In particular,
	don't run X.  That will eliminate around half of your running
	processes and shrink the stack dump down so that it might fit
	in the kernel buffer without overflowing.

	Second: Simplify.  Don't run kernels that have been modified by
	Fedora or anybody else.  Use a plain vanilla kernel from
	kernel.org.

	Third: Simplify.  Try not to collect the same data over and over
	again (take a look at the starts of all those dmesg files you
	compressed and emailed).  You can clear the kernel's log buffer
	after dumping it by doing "dmesg -c >/dev/null".

	Fourth: Be prepared to make changes.  This means making changes
	to the kernel configuration or source code, another reason for 
	using a stock kernel.

To get some really useful data, you need to build a kernel with 
CONFIG_USB_DEBUG turned on.  Without that setting there won't be any 
helpful debugging information in the log.

Then you should run a minimal system.  Single-user mode would be best, 
but that can be _too_ bare-bones.  No GUI will suffice.

Then you should clear the kernel log before before starting the big file 
copy.  Basically nothing that happens before then is important, because 
nothing has gone wrong.

Then after the hang occurs, see what shows up in the dmesg log.  And get a 
stack dump.

> Mr. Stern: Where might I go for low level programming information on USB
> devices? I'm interested in registers/DMA/packet formats, etc.

Are you interested in USB devices (i.e., flash drives, webcams, and so on
-- the things you plug in to a USB connection) or USB controllers (the 
hardware in your computer that manages the USB bus)?

> I've found info on the USB protocol itself, but I haven't found info on
> devices. Obviously I can dig through kernel source, but documents would be
> nice! Again, if this is an unreasonable request for you to "do my
> homework," just say so! I won't be offended. I'm sure I can find it myself
> given time, but if you happen to have some URLs handy, they'd be
> appreciated.

There are three types of USB controllers used in personal computers: UHCI, 
OHCI, and EHCI.  Links to their specifications are available here:

	http://www.usb.org/developers/resources/

Specifications for various classes of USB devices are available here:

	http://www.usb.org/developers/devclass_docs

Alan Stern


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Linux-usb-users] Failed reads from RAID-0 array (from newbie   who has read the FAQ)
  2007-03-19 15:31                       ` Alan Stern
@ 2007-03-19 16:58                         ` Michael Schwarz
  2007-03-19 18:17                           ` Alan Stern
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Schwarz @ 2007-03-19 16:58 UTC (permalink / raw)
  To: Alan Stern; +Cc: Bill Davidsen, Neil Brown, linux-raid, linux-usb-users

Comments below.

-- 
Michael Schwarz

> On Mon, 19 Mar 2007, Michael Schwarz wrote:
>
>> I'm going to hang on to the hardware. This is a pilot/demo that may lead
>> to development of a new device, and, if so, I'll be getting back into
>> device driver writing. Working this problem would be great practice for
>> that. So I will do it. The only problem is I don't know when!
>>
>> I believe I can replicate the problem, so I'll find time (perhaps next
>> weekend) to capture the data of interest.
>
> Michael, you don't seem to appreciate the basic principles for tracking
> down problems.

I want to bristle at this. I've been a professional software developer for
nearly 20 years. But I can't because all of your points below are, of
course, dead on for tracking down a device-level problem.

>
> 	First: Simplify.  Get rid of everything that isn't relevant
> 	to the problem and could serve to distract you.  In particular,
> 	don't run X.  That will eliminate around half of your running
> 	processes and shrink the stack dump down so that it might fit
> 	in the kernel buffer without overflowing.

Right on. And I know this; I should have had two boxes where I was
working; one where I could do browsy-emaily things separate from the
problem I was working.

>
> 	Second: Simplify.  Don't run kernels that have been modified by
> 	Fedora or anybody else.  Use a plain vanilla kernel from
> 	kernel.org.

Yeah; But here was where I lacked confidence. I used to know every inch of
my kernel and my hardware, but, as previously stated, that was back in the
2.2.x days. I wasn't confident that I could run my hardware with a
plain-vanilla kernel or that I could successfully roll my own working
2.6.x kernel in a timely manner. But, of course, I understand why this is
a good idea.

>
> 	Third: Simplify.  Try not to collect the same data over and over
> 	again (take a look at the starts of all those dmesg files you
> 	compressed and emailed).  You can clear the kernel's log buffer
> 	after dumping it by doing "dmesg -c >/dev/null".

Thanks, I actually didn't know that flag. Makes me feel pretty stupid...

>
> 	Fourth: Be prepared to make changes.  This means making changes
> 	to the kernel configuration or source code, another reason for
> 	using a stock kernel.

I agree -- I just lacked confidence doing so with newer kernels. I used to
ALWAYS build my own kernel right up through the 2.2.x series, building the
kernel to exactly match my hardware. I just haven't kept up. And if you
compare the 2.2.x kernel's configuration parameter list to the 2.6.x,
well, you can maybe understand why I was reluctant to launch on that when
under time pressure. But you point (I gather) is that if I had, it might
well have taken less time than it did...

>
> To get some really useful data, you need to build a kernel with
> CONFIG_USB_DEBUG turned on.  Without that setting there won't be any
> helpful debugging information in the log.

Before I send any more info on this problem, I will do this and all of the
above.

>
> Then you should run a minimal system.  Single-user mode would be best,
> but that can be _too_ bare-bones.  No GUI will suffice.

Will do.

>
> Then you should clear the kernel log before before starting the big file
> copy.  Basically nothing that happens before then is important, because
> nothing has gone wrong.
>
> Then after the hang occurs, see what shows up in the dmesg log.  And get a
> stack dump.
>
>> Mr. Stern: Where might I go for low level programming information on USB
>> devices? I'm interested in registers/DMA/packet formats, etc.
>
> Are you interested in USB devices (i.e., flash drives, webcams, and so on
> -- the things you plug in to a USB connection) or USB controllers (the
> hardware in your computer that manages the USB bus)?

Firstly the controllers, then specific devices.

>
>> I've found info on the USB protocol itself, but I haven't found info on
>> devices. Obviously I can dig through kernel source, but documents would
>> be
>> nice! Again, if this is an unreasonable request for you to "do my
>> homework," just say so! I won't be offended. I'm sure I can find it
>> myself
>> given time, but if you happen to have some URLs handy, they'd be
>> appreciated.
>
> There are three types of USB controllers used in personal computers: UHCI,
> OHCI, and EHCI.  Links to their specifications are available here:
>
> 	http://www.usb.org/developers/resources/

Thanks. This is just what I wanted.

>
> Specifications for various classes of USB devices are available here:
>
> 	http://www.usb.org/developers/devclass_docs

And this. Thank you much. I won't post on this issue again until I've
"cleared the decks" of the items you mention above. Thanks again.

>
> Alan Stern
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Failed reads from RAID-0 array (from newbie who has read the FAQ)
  2007-03-19 16:58                         ` Michael Schwarz
@ 2007-03-19 18:17                           ` Alan Stern
  0 siblings, 0 replies; 23+ messages in thread
From: Alan Stern @ 2007-03-19 18:17 UTC (permalink / raw)
  To: Michael Schwarz; +Cc: Neil Brown, linux-raid, Bill Davidsen, linux-usb-users

On Mon, 19 Mar 2007, Michael Schwarz wrote:

> Yeah; But here was where I lacked confidence. I used to know every inch of
> my kernel and my hardware, but, as previously stated, that was back in the
> 2.2.x days. I wasn't confident that I could run my hardware with a
> plain-vanilla kernel or that I could successfully roll my own working
> 2.6.x kernel in a timely manner. But, of course, I understand why this is
> a good idea.

It's not so hard to do, if you start from a known-good configuration.  
For instance, you could take the config your current distribution's kernel
is built from and just use it, although it would take a long time to build
because it includes so many drivers.  Whittling it down to just the
drivers you need would be tedious but not very difficult.

Alan Stern


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Linux-usb-users@lists.sourceforge.net
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-users

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2007-03-19 18:17 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-03-17  2:20 Failed reads from RAID-0 array (from newbie who has read the FAQ) Michael Schwarz
2007-03-17  5:31 ` Neil Brown
2007-03-17 18:01   ` Michael Schwarz
2007-03-17 20:49     ` Alan Stern
2007-03-17 21:35       ` Michael Schwarz
2007-03-18  2:06         ` [Linux-usb-users] " Alan Stern
2007-03-18  2:12         ` Alan Stern
2007-03-18  4:42           ` Michael Schwarz
2007-03-18 16:56             ` [Linux-usb-users] " Michael Schwarz
2007-03-18 17:44               ` Michael Schwarz
2007-03-18 21:55               ` Michael Schwarz
2007-03-18 21:57               ` Neil Brown
2007-03-19  3:27                 ` Michael Schwarz
2007-03-19 14:29                   ` Bill Davidsen
2007-03-19 14:54                     ` [Linux-usb-users] " Michael Schwarz
2007-03-19 15:31                       ` Alan Stern
2007-03-19 16:58                         ` Michael Schwarz
2007-03-19 18:17                           ` Alan Stern
     [not found]   ` <45FC33A4.2090408@tmr.com>
2007-03-17 19:13     ` Failed reads from RAID-0 array; still no joy in Mudville Michael Schwarz
2007-03-17 19:21       ` Michael Schwarz
2007-03-18 17:22         ` Bill Davidsen
2007-03-18 17:39           ` Michael Schwarz
2007-03-18 18:21             ` Bill Davidsen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.