linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* kernel BUG using multipath on 2.6.0-test5
@ 2003-09-26  1:57 Steven Dake
  2003-09-26 12:17 ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Steven Dake @ 2003-09-26  1:57 UTC (permalink / raw)
  To: linux-scsi, linux-kernel

Folks,

I've finally gotten around to trying out 2.6.0-test5 with some automatic
multipathing code on which I am working.  The code automatically
determines and configures multiple paths to a device using the MD
driver.

My program works fine on 2.4.x + qlogic FC driver but not on 2.6.0-test5
with qlogic FC driver.  The qlogic FC driver works for the individual
drives, so it is less likely the cuase of the problem.

The mdstat looks like this:
root@192.168.1.95:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [multipath]
md254 : active multipath sdb[1] sdd[0]
      1000000 blocks [2/2] [UU]

md255 : active multipath sda[1] sdc[0]
      1000000 blocks [2/2] [UU]

unused devices: <none>

I attempted a mke2fs /dev/md254 (which is the multipath device) and the
process froze.  Looking at varlogmessages, I see:

--------------------

kernel BUG at drivers/scsi/scsi_lib.c:544!
invalid operand: 0000 [#1]
CPU:    2
EIP:    0060:[<c01f5db3>]    Not tainted
EFLAGS: 00010046
EIP is at scsi_alloc_sgtable+0xed/0xfa
eax: 00000000   ebx: e8a75378   ecx: f7ce2424   edx: f7cfd200
esi: f7cfd200   edi: f7cfd200   ebp: f7ce2400   esp: ea11fc68
ds: 007b   es: 007b   ss: 0068
Process mke2fs (pid: 147, threadinfo=ea11e000 task=f76b9940)
Stack: e8a75378 c01f172c f7d21e60 e8a75378 f7cfd200 f7cfd200 f7ce2400
c01f631d
       f7cfd200 00000020 e8a75378 e8a75378 f7cfd200 e8a75378 c01f6455
f7cfd200
       00000020 c03c4680 00000001 e8a75378 f7ce4c00 c03c4680 00000001
c01be1aa
Call Trace:
 [<c01f172c>] __scsi_get_command+0x2b/0x74
 [<c01f631d>] scsi_init_io+0x7a/0x13d
 [<c01f6455>] scsi_prep_fn+0x75/0x171
 [<c01be1aa>] elv_next_request+0x47/0xf1
 [<c01bf9e9>] generic_unplug_device+0x5a/0x69
 [<c01bfb2a>] blk_run_queues+0x85/0x9d
 [<c014da8c>] __wait_on_buffer+0xd7/0xde
 [<c011b4e1>] autoremove_wake_function+0x0/0x4f
 [<c011b4e1>] autoremove_wake_function+0x0/0x4f
 [<c014f8af>] __block_prepare_write+0x11a/0x432
 [<c0150406>] block_prepare_write+0x34/0x4d
 [<c015390a>] blkdev_get_block+0x0/0x5b
 [<c01331bf>] generic_file_aio_write_nolock+0x3be/0xa9e
 [<c015390a>] blkdev_get_block+0x0/0x5b
 [<c013ec16>] do_anonymous_page+0x11f/0x1e9
 [<c0134fff>] buffered_rmqueue+0xc0/0x13f
 [<c0135111>] __alloc_pages+0x93/0x30f
 [<c013391d>] generic_file_write_nolock+0x7e/0x9c
 [<c013f238>] handle_mm_fault+0xf7/0x162
 [<c0117832>] do_page_fault+0x126/0x43f
 [<c01548cc>] blkdev_file_write+0x37/0x3b
 [<c014c7a6>] vfs_write+0xbc/0x127
 [<c0153ac7>] block_llseek+0x0/0xef
 [<c014c8b6>] sys_write+0x42/0x63
 [<c01090d7>] syscall_call+0x7/0xb

Code: 0f 0b 20 02 1c bb 2c c0 e9 2d ff ff ff 8b 44 24 08 8b 54 24

Should this work or is multipath known broken in 2.6?  Anyone have
started debugging multipath and want to work together on it?

Thanks
-steve


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel BUG using multipath on 2.6.0-test5
  2003-09-26  1:57 kernel BUG using multipath on 2.6.0-test5 Steven Dake
@ 2003-09-26 12:17 ` Matthew Wilcox
  2003-09-26 12:26   ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2003-09-26 12:17 UTC (permalink / raw)
  To: Steven Dake; +Cc: linux-scsi, linux-kernel

On Thu, Sep 25, 2003 at 06:57:15PM -0700, Steven Dake wrote:
> kernel BUG at drivers/scsi/scsi_lib.c:544!

        BUG_ON(!cmd->use_sg);

>  [<c01f631d>] scsi_init_io+0x7a/0x13d

static int scsi_init_io(struct scsi_cmnd *cmd)
        struct request     *req = cmd->request;
        cmd->use_sg = req->nr_phys_segments;
        sgpnt = scsi_alloc_sgtable(cmd, GFP_ATOMIC);

>  [<c01f6455>] scsi_prep_fn+0x75/0x171

static int scsi_prep_fn(struct request_queue *q, struct request *req)
        struct scsi_cmnd *cmd;
        cmd->request = req;
        ret = scsi_init_io(cmd);

... this is getting outside my area of confidence.  Ask axboe why we might
get a zero nr_phys_segments request passed in.

-- 
"It's not Hollywood.  War is real, war is primarily not about defeat or
victory, it is about death.  I've seen thousands and thousands of dead bodies.
Do you think I want to have an academic debate on this subject?" -- Robert Fisk

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel BUG using multipath on 2.6.0-test5
  2003-09-26 12:17 ` Matthew Wilcox
@ 2003-09-26 12:26   ` Jens Axboe
  2003-09-26 20:14     ` Steven Dake
  0 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2003-09-26 12:26 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Steven Dake, linux-scsi, linux-kernel

On Fri, Sep 26 2003, Matthew Wilcox wrote:
> On Thu, Sep 25, 2003 at 06:57:15PM -0700, Steven Dake wrote:
> > kernel BUG at drivers/scsi/scsi_lib.c:544!
> 
>         BUG_ON(!cmd->use_sg);
> 
> >  [<c01f631d>] scsi_init_io+0x7a/0x13d
> 
> static int scsi_init_io(struct scsi_cmnd *cmd)
>         struct request     *req = cmd->request;
>         cmd->use_sg = req->nr_phys_segments;
>         sgpnt = scsi_alloc_sgtable(cmd, GFP_ATOMIC);
> 
> >  [<c01f6455>] scsi_prep_fn+0x75/0x171
> 
> static int scsi_prep_fn(struct request_queue *q, struct request *req)
>         struct scsi_cmnd *cmd;
>         cmd->request = req;
>         ret = scsi_init_io(cmd);
> 
> .. this is getting outside my area of confidence.  Ask axboe why we might
> get a zero nr_phys_segments request passed in.

Looks like an mp bug. I'd suggest adding something ala

	if (!rq->nr_phys_segments || !rq->nr_hw_segments) {
		blk_dump_rq_flags(req, "scsi_init_io");
		return BLKPREP_KILL;
	}

inside the first

	} else if (req->flags & (REQ_CMD | REQ_BLOCK_PC)) {

drivers/scsi/scsi_lib.c:scsi_prep_fn(). That will show the state of such
a buggy request. I'm pretty sure this is an mp bug though.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel BUG using multipath on 2.6.0-test5
  2003-09-26 12:26   ` Jens Axboe
@ 2003-09-26 20:14     ` Steven Dake
  2003-09-27  0:34       ` [PATCH] fixes defect with " Steven Dake
  0 siblings, 1 reply; 6+ messages in thread
From: Steven Dake @ 2003-09-26 20:14 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Matthew Wilcox, linux-scsi, linux-kernel

On Fri, 2003-09-26 at 05:26, Jens Axboe wrote: 
> On Fri, Sep 26 2003, Matthew Wilcox wrote:
> > On Thu, Sep 25, 2003 at 06:57:15PM -0700, Steven Dake wrote:
> > > kernel BUG at drivers/scsi/scsi_lib.c:544!
> > 
> >         BUG_ON(!cmd->use_sg);
> > 
> > >  [<c01f631d>] scsi_init_io+0x7a/0x13d
> > 
> > static int scsi_init_io(struct scsi_cmnd *cmd)
> >         struct request     *req = cmd->request;
> >         cmd->use_sg = req->nr_phys_segments;
> >         sgpnt = scsi_alloc_sgtable(cmd, GFP_ATOMIC);
> > 
> > >  [<c01f6455>] scsi_prep_fn+0x75/0x171
> > 
> > static int scsi_prep_fn(struct request_queue *q, struct request *req)
> >         struct scsi_cmnd *cmd;
> >         cmd->request = req;
> >         ret = scsi_init_io(cmd);
> > 
> > .. this is getting outside my area of confidence.  Ask axboe why we might
> > get a zero nr_phys_segments request passed in.
> 
> Looks like an mp bug. I'd suggest adding something ala
> 
> 	if (!rq->nr_phys_segments || !rq->nr_hw_segments) {
> 		blk_dump_rq_flags(req, "scsi_init_io");
> 		return BLKPREP_KILL;
> 	}
> 
> inside the first
> 
> 	} else if (req->flags & (REQ_CMD | REQ_BLOCK_PC)) {
> 
> drivers/scsi/scsi_lib.c:scsi_prep_fn(). That will show the state of such
> a buggy request. I'm pretty sure this is an mp bug though.

scsi_prep_fn: dev sdd: flags = REQ_CMD REQ_STARTED
sector 0, nr/cnr 8/8
bio c2694708, biotail c2694708, buffer f76dc000, data 00000000, len 0
multipath: IO failure on sdd, disabling IO path.
        Operation continuing on 1 IO paths.
multipath: sdd: rescheduling sector 8
MULTIPATH conf printout:
-- wd:1 rd:2
disk0, o:0, dev:sdd
disk1, o:1, dev:sdb
MULTIPATH conf printout:
-- wd:1 rd:2
disk1, o:1, dev:sdb
multipath: sdd: redirecting sector 0 to another IO path
scsi_prep_fn: dev sdb: flags = REQ_CMD REQ_STARTED
sector 0, nr/cnr 0/8
bio c2694708, biotail c2694708, buffer f76dc000, data 00000000, len 0

I assume a length of zero is wrong...  I'll trace up the stack and see
where the bad data gets into the request.

Thanks for the pointer...
-steve  


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] fixes defect with kernel BUG using multipath on 2.6.0-test5
  2003-09-26 20:14     ` Steven Dake
@ 2003-09-27  0:34       ` Steven Dake
  2003-09-27  8:33         ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Steven Dake @ 2003-09-27  0:34 UTC (permalink / raw)
  To: neilb; +Cc: axboe, Matthew Wilcox, linux-scsi, linux-kernel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3365 bytes --]

Folks,
Thanks Matt and Jens for the debug help on the multipath problem.  I now
have a patch (attached) which solves the problem and makes multipath
work properly.   There are two types of "flags" that are used in a block
io request, bi_flags, and bi_rw.  bi_flags is used for flags to the
block level code, and bi_rw is used for flags to the low level device
drivers.  The code in the multipath driver used the wrong flag in the
wrong field.  In this case, the flag FASTFAIL (value 3) was being set to
the bi_flags field.  FASTFAIL is a hint to the low level driver that it
should try to fail out quickly.  Unfortunately, the value 3 is also
BIO_SEG_VALID, which is a flag to the block subsystem that the segments
shouldn't be recalculated.  The result was that the wrong field was set,
telling the block layer not to recalculate the segments resulting in
phys and hw segments of 0.  Not good.

Neil can you send upstream ?

Thanks
-steve

On Fri, 2003-09-26 at 13:14, Steven Dake wrote:
> On Fri, 2003-09-26 at 05:26, Jens Axboe wrote: 
> > On Fri, Sep 26 2003, Matthew Wilcox wrote:
> > > On Thu, Sep 25, 2003 at 06:57:15PM -0700, Steven Dake wrote:
> > > > kernel BUG at drivers/scsi/scsi_lib.c:544!
> > > 
> > >         BUG_ON(!cmd->use_sg);
> > > 
> > > >  [<c01f631d>] scsi_init_io+0x7a/0x13d
> > > 
> > > static int scsi_init_io(struct scsi_cmnd *cmd)
> > >         struct request     *req = cmd->request;
> > >         cmd->use_sg = req->nr_phys_segments;
> > >         sgpnt = scsi_alloc_sgtable(cmd, GFP_ATOMIC);
> > > 
> > > >  [<c01f6455>] scsi_prep_fn+0x75/0x171
> > > 
> > > static int scsi_prep_fn(struct request_queue *q, struct request *req)
> > >         struct scsi_cmnd *cmd;
> > >         cmd->request = req;
> > >         ret = scsi_init_io(cmd);
> > > 
> > > .. this is getting outside my area of confidence.  Ask axboe why we might
> > > get a zero nr_phys_segments request passed in.
> > 
> > Looks like an mp bug. I'd suggest adding something ala
> > 
> > 	if (!rq->nr_phys_segments || !rq->nr_hw_segments) {
> > 		blk_dump_rq_flags(req, "scsi_init_io");
> > 		return BLKPREP_KILL;
> > 	}
> > 
> > inside the first
> > 
> > 	} else if (req->flags & (REQ_CMD | REQ_BLOCK_PC)) {
> > 
> > drivers/scsi/scsi_lib.c:scsi_prep_fn(). That will show the state of such
> > a buggy request. I'm pretty sure this is an mp bug though.
> 
> scsi_prep_fn: dev sdd: flags = REQ_CMD REQ_STARTED
> sector 0, nr/cnr 8/8
> bio c2694708, biotail c2694708, buffer f76dc000, data 00000000, len 0
> multipath: IO failure on sdd, disabling IO path.
>         Operation continuing on 1 IO paths.
> multipath: sdd: rescheduling sector 8
> MULTIPATH conf printout:
> -- wd:1 rd:2
> disk0, o:0, dev:sdd
> disk1, o:1, dev:sdb
> MULTIPATH conf printout:
> -- wd:1 rd:2
> disk1, o:1, dev:sdb
> multipath: sdd: redirecting sector 0 to another IO path
> scsi_prep_fn: dev sdb: flags = REQ_CMD REQ_STARTED
> sector 0, nr/cnr 0/8
> bio c2694708, biotail c2694708, buffer f76dc000, data 00000000, len 0
> 
> I assume a length of zero is wrong...  I'll trace up the stack and see
> where the bad data gets into the request.
> 
> Thanks for the pointer...
> -steve  
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

[-- Attachment #2: fix_defect_multipath_BUGs.patch --]
[-- Type: text/plain, Size: 459 bytes --]

--- linux-2.6.0-test5/drivers/md/multipath.c	2003-09-08 12:50:03.000000000 -0700
+++ linux-fixmd2/drivers/md/multipath.c	2003-09-26 17:13:09.000000000 -0700
@@ -178,7 +178,7 @@
 
 	mp_bh->bio = *bio;
 	mp_bh->bio.bi_bdev = multipath->rdev->bdev;
-	mp_bh->bio.bi_flags |= (1 << BIO_RW_FAILFAST);
+	mp_bh->bio.bi_rw |= (1 << BIO_RW_FAILFAST);
 	mp_bh->bio.bi_end_io = multipath_end_request;
 	mp_bh->bio.bi_private = mp_bh;
 	generic_make_request(&mp_bh->bio);

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] fixes defect with kernel BUG using multipath on 2.6.0-test5
  2003-09-27  0:34       ` [PATCH] fixes defect with " Steven Dake
@ 2003-09-27  8:33         ` Jens Axboe
  0 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2003-09-27  8:33 UTC (permalink / raw)
  To: Steven Dake; +Cc: neilb, Matthew Wilcox, linux-scsi, linux-kernel, linux-raid

On Fri, Sep 26 2003, Steven Dake wrote:
> Folks,
> Thanks Matt and Jens for the debug help on the multipath problem.  I now
> have a patch (attached) which solves the problem and makes multipath
> work properly.   There are two types of "flags" that are used in a block
> io request, bi_flags, and bi_rw.  bi_flags is used for flags to the
> block level code, and bi_rw is used for flags to the low level device
> drivers.  The code in the multipath driver used the wrong flag in the
> wrong field.  In this case, the flag FASTFAIL (value 3) was being set to
> the bi_flags field.  FASTFAIL is a hint to the low level driver that it
> should try to fail out quickly.  Unfortunately, the value 3 is also
> BIO_SEG_VALID, which is a flag to the block subsystem that the segments
> shouldn't be recalculated.  The result was that the wrong field was set,
> telling the block layer not to recalculate the segments resulting in
> phys and hw segments of 0.  Not good.
> 
> Neil can you send upstream ?

Auch good catch! I'm sorry to say this is actually my fault...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2003-09-27  8:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-09-26  1:57 kernel BUG using multipath on 2.6.0-test5 Steven Dake
2003-09-26 12:17 ` Matthew Wilcox
2003-09-26 12:26   ` Jens Axboe
2003-09-26 20:14     ` Steven Dake
2003-09-27  0:34       ` [PATCH] fixes defect with " Steven Dake
2003-09-27  8:33         ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).