All of lore.kernel.org
 help / color / mirror / Atom feed
* Block regression since 3.1-rc3
@ 2011-09-30 15:51 ` Christophe Saout
  0 siblings, 0 replies; 25+ messages in thread
From: Christophe Saout @ 2011-09-30 15:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jens Axboe, dm-devel

Hello,

Today I was trying to boot a new 3.1 RC kernel on one of our server
machines and ran into a weird block-related crash.

Whenever I was trying to access some filesystem on the external RAID, I
would soon run into the crash seen below, followed by a total lockup a
bit later.

The external RAID is accessed via multipath and simple logical volumes
(both device-mapper) on top. The local hard disks work just fine (LVM on
top of an Areca hardware RAID).

The bug seems to be introduced between rc2 and rc3. Up to rc2 everything
is fine, and rc3 and also the latest kernel shows this bug. From the git
log I can see that the block/ directory has seen some larger update, but
this is just a wild guess.

The crash has been seen with ext3 and GFS2, apparently when processing
write requests (writing files or on umount). All kinds of crashes are
identical up to generic_make_request.

(oh, and this machine runs as Dom0 under XEN, but I can reproduce the 
 issue when booting natively too, so that's unrelated)

I think you guys should know - maybe you have an idea or have something
for me to test.  I would have tried to file a bug somewhere, but half of
the infrastructure is down at the moment.

Thanks a lot,

	Christophe


Sep 30 10:05:26 vserv80 kernel: CPU 2 
Sep 30 10:05:26 vserv80 kernel: Modules linked in: xenfs gfs2 fuse dlm configfs dummy bonding xt_TPROXY nf_tproxy_core xt_socket xt_owner xt_connmark iptable_mangle ipt_MASQUERADE iptable_nat ipt_ULOG ipt_REJECT xt_mark iptable_filter iptable_raw ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state xt_tcpudp xt_physdev xt_hashlimit xt_multiport ip6table_filter ip6_tables ebt_redirect ebt_ip ebt_mark_m ebtable_broute bridge stp llc ebtable_nat ext4 jbd2 dm_round_robin dm_multipath ebtable_filter ebtables x_tables nf_nat_tftp nf_conntrack_tftp nf_nat_sip nf_conntrack_sip nf_nat_irc nf_conntrack_irc nf_nat_h323 nf_conntrack_h323 nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack xen_netback xen_blkback xen_gntdev ipmi_si ipmi_devintf ipmi_msghandler loop e1000e ohci_hcd tpm_tis tpm ehci_hcd qla2xxx pata_atiixp tpm_bios igb i2c_piix4 scsi_transport_fc
Sep 30 10:05:26 vserv80 kernel:
Sep 30 10:05:26 vserv80 kernel: Pid: 27551, comm: umount Not tainted 3.1.0-rc3 #1 Supermicro H8DG6/H8DGi/H8DG6/H8DGi
Sep 30 10:05:26 vserv80 kernel: RIP: e030:[<ffffffff8121a2e4>]  [<ffffffff8121a2e4>] blk_insert_flush+0x134/0x140
Sep 30 10:05:26 vserv80 kernel: RSP: e02b:ffff880098c89988  EFLAGS: 00010046
Sep 30 10:05:26 vserv80 kernel: RAX: 0000000000000000 RBX: ffff880098e4c018 RCX: ffff88009a158698
Sep 30 10:05:26 vserv80 kernel: RDX: 0000000000000005 RSI: 0000000000000001 RDI: ffff880098e4c018
Sep 30 10:05:26 vserv80 kernel: RBP: ffff880098c89988 R08: 0000000000000001 R09: 0000000000000001
Sep 30 10:05:26 vserv80 kernel: R10: ffff8800a0e44080 R11: 0000000000000000 R12: ffff88009a158698
Sep 30 10:05:26 vserv80 kernel: R13: 00000000ffff00fb R14: 0000000000000005 R15: ffff88009ed37c44
Sep 30 10:05:26 vserv80 kernel: FS:  00007faaaf8e6740(0000) GS:ffff8800d75f8000(0000) knlGS:0000000000000000
Sep 30 10:05:26 vserv80 kernel: CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Sep 30 10:05:26 vserv80 kernel: CR2: 00007faaaf01f0b0 CR3: 000000009aa99000 CR4: 0000000000000660
Sep 30 10:05:26 vserv80 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 30 10:05:26 vserv80 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Sep 30 10:05:26 vserv80 kernel: Process umount (pid: 27551, threadinfo ffff880098c88000, task ffff8800924a8000)
Sep 30 10:05:26 vserv80 kernel: ffff880098c899b8 ffffffff812110a2 ffff880098e4c018 ffff88009a158698
Sep 30 10:05:26 vserv80 kernel: 00000000ffff00fb ffff88009ed37c00 ffff880098c899e8 ffffffff81215bd4
Sep 30 10:05:26 vserv80 kernel: ffff880098c899e8 ffff880000000005 ffff880098e4c018 ffff88009a158698
Sep 30 10:05:26 vserv80 kernel: [<ffffffff812110a2>] __elv_add_request+0x172/0x240
Sep 30 10:05:26 vserv80 kernel: [<ffffffff81215bd4>] add_acct_request+0x34/0x40
Sep 30 10:05:26 vserv80 kernel: [<ffffffff81215c4b>] blk_insert_cloned_request+0x6b/0x90
Sep 30 10:05:26 vserv80 kernel: [<ffffffff8138ec39>] dm_dispatch_request+0x39/0x70
Sep 30 10:05:26 vserv80 kernel: [<ffffffff8139011b>] dm_request_fn+0x1ab/0x2b0
Sep 30 10:05:26 vserv80 kernel: [<ffffffff81211a56>] __blk_run_queue+0x16/0x20
Sep 30 10:05:26 vserv80 kernel: [<ffffffff812182ed>] __make_request+0x2ed/0x330
Sep 30 10:05:26 vserv80 kernel: [<ffffffff8138fd3e>] dm_request+0x16e/0x1e0
Sep 30 10:05:26 vserv80 kernel: [<ffffffff81216077>] generic_make_request+0x2d7/0x520
Sep 30 10:05:26 vserv80 kernel: [<ffffffff8104be91>] ? get_parent_ip+0x11/0x50
Sep 30 10:05:26 vserv80 kernel: [<ffffffff81216321>] submit_bio+0x61/0xd0
Sep 30 10:05:26 vserv80 kernel: [<ffffffff81146c86>] submit_bh+0xe6/0x120
Sep 30 10:05:26 vserv80 kernel: [<ffffffffa034c338>] log_write_header+0x208/0x4b0 [gfs2]
Sep 30 10:05:26 vserv80 kernel: [<ffffffff814c85c5>] ? sub_preempt_count+0x95/0xd0
Sep 30 10:05:26 vserv80 kernel: [<ffffffffa034c698>] gfs2_log_shutdown+0xb8/0x280 [gfs2]
Sep 30 10:05:26 vserv80 kernel: [<ffffffffa0362908>] gfs2_make_fs_ro+0x78/0xc0 [gfs2]
Sep 30 10:05:26 vserv80 kernel: [<ffffffffa03628e9>] ? gfs2_make_fs_ro+0x59/0xc0 [gfs2]
Sep 30 10:05:26 vserv80 kernel: [<ffffffffa0362b38>] gfs2_put_super+0x1e8/0x210 [gfs2]
Sep 30 10:05:26 vserv80 kernel: [<ffffffff8111c64d>] generic_shutdown_super+0x5d/0xe0
Sep 30 10:05:26 vserv80 kernel: [<ffffffff8111c6fc>] kill_block_super+0x2c/0x80
Sep 30 10:05:26 vserv80 kernel: [<ffffffffa035465c>] gfs2_kill_sb+0x5c/0x90 [gfs2]
Sep 30 10:05:26 vserv80 kernel: [<ffffffff8111ca08>] deactivate_locked_super+0x38/0x90
Sep 30 10:05:26 vserv80 kernel: [<ffffffff8111d4e5>] deactivate_super+0x45/0x60
Sep 30 10:05:26 vserv80 kernel: [<ffffffff81137a77>] mntput_no_expire+0xe7/0x130
Sep 30 10:05:26 vserv80 kernel: [<ffffffff81138489>] sys_umount+0x79/0x390
Sep 30 10:05:26 vserv80 kernel: [<ffffffff814cbed2>] system_call_fastpath+0x16/0x1b
Sep 30 10:05:26 vserv80 kernel: RSP <ffff880098c89988>
Sep 30 10:05:26 vserv80 kernel: ---[ end trace 906f3285f4cc5ae9 ]---
Sep 30 10:05:26 vserv80 kernel: note: umount[27551] exited with preempt_count 1

Message from syslogd@vserv80 at Sep 30 17:37:08 ...
 kernel:------------[ cut here ]------------

Message from syslogd@vserv80 at Sep 30 17:37:08 ...
 kernel:invalid opcode: 0000 [#1] PREEMPT SMP 

Message from syslogd@vserv80 at Sep 30 17:37:08 ...
 kernel:Stack:

Message from syslogd@vserv80 at Sep 30 17:37:08 ...
 kernel:Call Trace:

Message from syslogd@vserv80 at Sep 30 17:37:08 ...
 kernel:Code: d2 e8 31 c6 ff ff c9 c3 0f 1f 80 00 00 00 00 48 8b 41 08 48 89 79 08 48 89 0f 48 89 47 08 48 89 38 48 89 cf e8 2e d8 ff ff c9 c3 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 89 f2 be 02 00 00 00 48 

(half of the stuff ended up on the console and no in the logfile, dunno why)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Block regression since 3.1-rc3
@ 2011-09-30 15:51 ` Christophe Saout
  0 siblings, 0 replies; 25+ messages in thread
From: Christophe Saout @ 2011-09-30 15:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jens Axboe, dm-devel

Hello,

Today I was trying to boot a new 3.1 RC kernel on one of our server
machines and ran into a weird block-related crash.

Whenever I was trying to access some filesystem on the external RAID, I
would soon run into the crash seen below, followed by a total lockup a
bit later.

The external RAID is accessed via multipath and simple logical volumes
(both device-mapper) on top. The local hard disks work just fine (LVM on
top of an Areca hardware RAID).

The bug seems to be introduced between rc2 and rc3. Up to rc2 everything
is fine, and rc3 and also the latest kernel shows this bug. From the git
log I can see that the block/ directory has seen some larger update, but
this is just a wild guess.

The crash has been seen with ext3 and GFS2, apparently when processing
write requests (writing files or on umount). All kinds of crashes are
identical up to generic_make_request.

(oh, and this machine runs as Dom0 under XEN, but I can reproduce the 
 issue when booting natively too, so that's unrelated)

I think you guys should know - maybe you have an idea or have something
for me to test.  I would have tried to file a bug somewhere, but half of
the infrastructure is down at the moment.

Thanks a lot,

	Christophe


Sep 30 10:05:26 vserv80 kernel: CPU 2 
Sep 30 10:05:26 vserv80 kernel: Modules linked in: xenfs gfs2 fuse dlm configfs dummy bonding xt_TPROXY nf_tproxy_core xt_socket xt_owner xt_connmark iptable_mangle ipt_MASQUERADE iptable_nat ipt_ULOG ipt_REJECT xt_mark iptable_filter iptable_raw ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state xt_tcpudp xt_physdev xt_hashlimit xt_multiport ip6table_filter ip6_tables ebt_redirect ebt_ip ebt_mark_m ebtable_broute bridge stp llc ebtable_nat ext4 jbd2 dm_round_robin dm_multipath ebtable_filter ebtables x_tables nf_nat_tftp nf_conntrack_tftp nf_nat_sip nf_conntrack_sip nf_nat_irc nf_conntrack_irc nf_nat_h323 nf_conntrack_h323 nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack xen_netback xen_blkback xen_gntdev ipmi_si ipmi_devintf ipmi_msghandler lo
 op e1000e ohci_hcd tpm_tis tpm ehci_hcd qla2xxx pata_atiixp tpm_bios igb i2c_piix4 scsi_transport_fc
Sep 30 10:05:26 vserv80 kernel:
Sep 30 10:05:26 vserv80 kernel: Pid: 27551, comm: umount Not tainted 3.1.0-rc3 #1 Supermicro H8DG6/H8DGi/H8DG6/H8DGi
Sep 30 10:05:26 vserv80 kernel: RIP: e030:[<ffffffff8121a2e4>]  [<ffffffff8121a2e4>] blk_insert_flush+0x134/0x140
Sep 30 10:05:26 vserv80 kernel: RSP: e02b:ffff880098c89988  EFLAGS: 00010046
Sep 30 10:05:26 vserv80 kernel: RAX: 0000000000000000 RBX: ffff880098e4c018 RCX: ffff88009a158698
Sep 30 10:05:26 vserv80 kernel: RDX: 0000000000000005 RSI: 0000000000000001 RDI: ffff880098e4c018
Sep 30 10:05:26 vserv80 kernel: RBP: ffff880098c89988 R08: 0000000000000001 R09: 0000000000000001
Sep 30 10:05:26 vserv80 kernel: R10: ffff8800a0e44080 R11: 0000000000000000 R12: ffff88009a158698
Sep 30 10:05:26 vserv80 kernel: R13: 00000000ffff00fb R14: 0000000000000005 R15: ffff88009ed37c44
Sep 30 10:05:26 vserv80 kernel: FS:  00007faaaf8e6740(0000) GS:ffff8800d75f8000(0000) knlGS:0000000000000000
Sep 30 10:05:26 vserv80 kernel: CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Sep 30 10:05:26 vserv80 kernel: CR2: 00007faaaf01f0b0 CR3: 000000009aa99000 CR4: 0000000000000660
Sep 30 10:05:26 vserv80 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 30 10:05:26 vserv80 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Sep 30 10:05:26 vserv80 kernel: Process umount (pid: 27551, threadinfo ffff880098c88000, task ffff8800924a8000)
Sep 30 10:05:26 vserv80 kernel: ffff880098c899b8 ffffffff812110a2 ffff880098e4c018 ffff88009a158698
Sep 30 10:05:26 vserv80 kernel: 00000000ffff00fb ffff88009ed37c00 ffff880098c899e8 ffffffff81215bd4
Sep 30 10:05:26 vserv80 kernel: ffff880098c899e8 ffff880000000005 ffff880098e4c018 ffff88009a158698
Sep 30 10:05:26 vserv80 kernel: [<ffffffff812110a2>] __elv_add_request+0x172/0x240
Sep 30 10:05:26 vserv80 kernel: [<ffffffff81215bd4>] add_acct_request+0x34/0x40
Sep 30 10:05:26 vserv80 kernel: [<ffffffff81215c4b>] blk_insert_cloned_request+0x6b/0x90
Sep 30 10:05:26 vserv80 kernel: [<ffffffff8138ec39>] dm_dispatch_request+0x39/0x70
Sep 30 10:05:26 vserv80 kernel: [<ffffffff8139011b>] dm_request_fn+0x1ab/0x2b0
Sep 30 10:05:26 vserv80 kernel: [<ffffffff81211a56>] __blk_run_queue+0x16/0x20
Sep 30 10:05:26 vserv80 kernel: [<ffffffff812182ed>] __make_request+0x2ed/0x330
Sep 30 10:05:26 vserv80 kernel: [<ffffffff8138fd3e>] dm_request+0x16e/0x1e0
Sep 30 10:05:26 vserv80 kernel: [<ffffffff81216077>] generic_make_request+0x2d7/0x520
Sep 30 10:05:26 vserv80 kernel: [<ffffffff8104be91>] ? get_parent_ip+0x11/0x50
Sep 30 10:05:26 vserv80 kernel: [<ffffffff81216321>] submit_bio+0x61/0xd0
Sep 30 10:05:26 vserv80 kernel: [<ffffffff81146c86>] submit_bh+0xe6/0x120
Sep 30 10:05:26 vserv80 kernel: [<ffffffffa034c338>] log_write_header+0x208/0x4b0 [gfs2]
Sep 30 10:05:26 vserv80 kernel: [<ffffffff814c85c5>] ? sub_preempt_count+0x95/0xd0
Sep 30 10:05:26 vserv80 kernel: [<ffffffffa034c698>] gfs2_log_shutdown+0xb8/0x280 [gfs2]
Sep 30 10:05:26 vserv80 kernel: [<ffffffffa0362908>] gfs2_make_fs_ro+0x78/0xc0 [gfs2]
Sep 30 10:05:26 vserv80 kernel: [<ffffffffa03628e9>] ? gfs2_make_fs_ro+0x59/0xc0 [gfs2]
Sep 30 10:05:26 vserv80 kernel: [<ffffffffa0362b38>] gfs2_put_super+0x1e8/0x210 [gfs2]
Sep 30 10:05:26 vserv80 kernel: [<ffffffff8111c64d>] generic_shutdown_super+0x5d/0xe0
Sep 30 10:05:26 vserv80 kernel: [<ffffffff8111c6fc>] kill_block_super+0x2c/0x80
Sep 30 10:05:26 vserv80 kernel: [<ffffffffa035465c>] gfs2_kill_sb+0x5c/0x90 [gfs2]
Sep 30 10:05:26 vserv80 kernel: [<ffffffff8111ca08>] deactivate_locked_super+0x38/0x90
Sep 30 10:05:26 vserv80 kernel: [<ffffffff8111d4e5>] deactivate_super+0x45/0x60
Sep 30 10:05:26 vserv80 kernel: [<ffffffff81137a77>] mntput_no_expire+0xe7/0x130
Sep 30 10:05:26 vserv80 kernel: [<ffffffff81138489>] sys_umount+0x79/0x390
Sep 30 10:05:26 vserv80 kernel: [<ffffffff814cbed2>] system_call_fastpath+0x16/0x1b
Sep 30 10:05:26 vserv80 kernel: RSP <ffff880098c89988>
Sep 30 10:05:26 vserv80 kernel: ---[ end trace 906f3285f4cc5ae9 ]---
Sep 30 10:05:26 vserv80 kernel: note: umount[27551] exited with preempt_count 1

Message from syslogd@vserv80 at Sep 30 17:37:08 ...
 kernel:------------[ cut here ]------------

Message from syslogd@vserv80 at Sep 30 17:37:08 ...
 kernel:invalid opcode: 0000 [#1] PREEMPT SMP 

Message from syslogd@vserv80 at Sep 30 17:37:08 ...
 kernel:Stack:

Message from syslogd@vserv80 at Sep 30 17:37:08 ...
 kernel:Call Trace:

Message from syslogd@vserv80 at Sep 30 17:37:08 ...
 kernel:Code: d2 e8 31 c6 ff ff c9 c3 0f 1f 80 00 00 00 00 48 8b 41 08 48 89 79 08 48 89 0f 48 89 47 08 48 89 38 48 89 cf e8 2e d8 ff ff c9 c3 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 89 f2 be 02 00 00 00 48 

(half of the stuff ended up on the console and no in the logfile, dunno why)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Block regression since 3.1-rc3
  2011-09-30 15:51 ` Christophe Saout
  (?)
@ 2011-09-30 18:02 ` Christophe Saout
  -1 siblings, 0 replies; 25+ messages in thread
From: Christophe Saout @ 2011-09-30 18:02 UTC (permalink / raw)
  To: device-mapper development; +Cc: linux-kernel, Jens Axboe

Hello again,

> Today I was trying to boot a new 3.1 RC kernel on one of our server
> machines and ran into a weird block-related crash.
> 
> Whenever I was trying to access some filesystem on the external RAID, I
> would soon run into the crash seen below, followed by a total lockup a
> bit later.
> 
> The external RAID is accessed via multipath and simple logical volumes
> on top. [...]
>
> The bug seems to be introduced between rc2 and rc3. Up to rc2 everything
> is fine, and rc3 and also the latest kernel shows this bug. From the git
> log I can see that the block/ directory has seen some larger update, but
> this is just a wild guess.

Actually, I just found out that reverting

commit 4853abaae7e4a2af938115ce9071ef8684fb7af4
Author: Jeff Moyer <jmoyer@redhat.com>
Date:   Mon Aug 15 21:37:25 2011 +0200

    block: fix flush machinery for stacking drivers with differring flush flags

makes the problem disappear.

Let me know if I can help with further information.

Cheers,
	Christophe



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Block regression since 3.1-rc3
  2011-09-30 15:51 ` Christophe Saout
  (?)
  (?)
@ 2011-10-03 14:43 ` Jeff Moyer
  2011-10-04 12:02   ` Christophe Saout
  -1 siblings, 1 reply; 25+ messages in thread
From: Jeff Moyer @ 2011-10-03 14:43 UTC (permalink / raw)
  To: Christophe Saout, device-mapper development; +Cc: linux-kernel, Jens Axboe

Christophe Saout <christophe@saout.de> writes:

> I think you guys should know - maybe you have an idea or have something
> for me to test.  I would have tried to file a bug somewhere, but half of
> the infrastructure is down at the moment.

You didn't include the top few lines which tell us what went wrong.  I
tried to reproduce this myself, but I ran into other issues with GFS2.
Anyway, it would help a great deal if you could retrigger the failure
and provide the full failure output.  You can get that by issuing the
'dmesg' command and redirecting it to a file.  And please CC me on any
future correspondences on this issue.

Thanks!
Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Block regression since 3.1-rc3
  2011-10-03 14:43 ` Jeff Moyer
@ 2011-10-04 12:02   ` Christophe Saout
  2011-10-04 13:32     ` Jeff Moyer
  2011-10-06 19:51     ` Jeff Moyer
  0 siblings, 2 replies; 25+ messages in thread
From: Christophe Saout @ 2011-10-04 12:02 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: device-mapper development, linux-kernel, Jens Axboe

Hi Jeff,

> Anyway, it would help a great deal if you could retrigger the failure
> and provide the full failure output.  You can get that by issuing the
> 'dmesg' command and redirecting it to a file.

Oh, sorry, yes, there's a line missing.

Line 323 is this one: BUG_ON(!rq->bio || rq->bio != rq->biotail);

The bug is not specific to GFS2, I also get it when using an ext3 mount
on that multipath storage pool when writing to a file.

Thanks,
	Christophe

------------[ cut here ]------------
kernel BUG at block/blk-flush.c:323!
invalid opcode: 0000 [#1] PREEMPT SMP 
CPU 18 
Modules linked in: gfs2 fuse dlm configfs dummy bonding xt_TPROXY
nf_tproxy_core xt_socket xt_owner xt_connmark iptable_mangle
ipt_MASQUERADE iptable_nat ipt_ULOG ipt_REJECT xt_mark iptable_filter
iptable_raw ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6
xt_state xt_tcpudp xt_physdev xt_hashlimit xt_multiport ip6table_filter
ip6_tables ebt_redirect ebt_ip ebt_mark_m ebtable_broute bridge stp llc
ebtable_nat ext4 jbd2 dm_round_robin dm_multipath ebtable_filter
ebtables x_tables nf_nat_tftp nf_conntrack_tftp nf_nat_sip
nf_conntrack_sip nf_nat_irc nf_conntrack_irc nf_nat_h323
nf_conntrack_h323 nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
nf_conntrack_ftp nf_conntrack ipmi_si ipmi_devintf ipmi_msghandler loop
e1000e qla2xxx igb pata_atiixp ohci_hcd ehci_hcd tpm_tis tpm
scsi_transport_fc tpm_bios i2c_piix4

Pid: 5282, comm: umount Not tainted 3.1.0-rc3 #1 Supermicro
H8DG6/H8DGi/H8DG6/H8DGi
RIP: 0010:[<ffffffff8121a2e4>]  [<ffffffff8121a2e4>] blk_insert_flush
+0x134/0x140
RSP: 0018:ffff8808155f3988  EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff880c149f0018 RCX: ffff881014170000
RDX: 0000000000000005 RSI: 0000000000000001 RDI: ffff880c149f0018
RBP: ffff8808155f3988 R08: 0000000000000001 R09: 0000000000000001
R10: ffff880415cf0440 R11: 0000000000000000 R12: ffff881014170000
R13: 0000000000000082 R14: 0000000000000005 R15: ffff880c13763044
FS:  00007f2d80611740(0000) GS:ffff880c17c00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f2d7fd4c0b0 CR3: 0000000c0fa3e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process umount (pid: 5282, threadinfo ffff8808155f2000, task
ffff8808155ed320)
Stack:
 ffff8808155f39b8 ffffffff812110a2 ffff880c149f0018 ffff881014170000
 0000000000000082 ffff880c13763000 ffff8808155f39e8 ffffffff81215bd4
 ffff8808155f39e8 ffff880c00000005 ffff880c149f0018 ffff881014170000
Call Trace:
 [<ffffffff812110a2>] __elv_add_request+0x172/0x240
 [<ffffffff81215bd4>] add_acct_request+0x34/0x40
 [<ffffffff81215c4b>] blk_insert_cloned_request+0x6b/0x90
 [<ffffffff8138ec39>] dm_dispatch_request+0x39/0x70
 [<ffffffff8139011b>] dm_request_fn+0x1ab/0x2b0
 [<ffffffff81211a56>] __blk_run_queue+0x16/0x20
 [<ffffffff812182ed>] __make_request+0x2ed/0x330
 [<ffffffff8138fd3e>] dm_request+0x16e/0x1e0
 [<ffffffff81216077>] generic_make_request+0x2d7/0x520
 [<ffffffff8104be91>] ? get_parent_ip+0x11/0x50
 [<ffffffff81216321>] submit_bio+0x61/0xd0
 [<ffffffff81146c86>] submit_bh+0xe6/0x120
 [<ffffffffa034c338>] log_write_header+0x208/0x4b0 [gfs2]
 [<ffffffff814c85c5>] ? sub_preempt_count+0x95/0xd0
 [<ffffffffa034c698>] gfs2_log_shutdown+0xb8/0x280 [gfs2]
 [<ffffffffa0362908>] gfs2_make_fs_ro+0x78/0xc0 [gfs2]
 [<ffffffffa03628e9>] ? gfs2_make_fs_ro+0x59/0xc0 [gfs2]
 [<ffffffffa0362b38>] gfs2_put_super+0x1e8/0x210 [gfs2]
 [<ffffffff8111c64d>] generic_shutdown_super+0x5d/0xe0
 [<ffffffff8111c6fc>] kill_block_super+0x2c/0x80
 [<ffffffffa035465c>] gfs2_kill_sb+0x5c/0x90 [gfs2]
 [<ffffffff8111ca08>] deactivate_locked_super+0x38/0x90
 [<ffffffff8111d4e5>] deactivate_super+0x45/0x60
 [<ffffffff81137a77>] mntput_no_expire+0xe7/0x130
 [<ffffffff81138489>] sys_umount+0x79/0x390
 [<ffffffff814cbed2>] system_call_fastpath+0x16/0x1b
Code: d2 e8 51 c7 ff ff c9 c3 0f 1f 80 00 00 00 00 48 8b 41 08 48 89 79
08 48 89 0f 48 89 47 08 48 89 38 48 89 cf e8 3e d9 ff ff c9 c3 <0f> 0b
eb fe 0f 1f 84 00 00 00 00 00 55 89 f2 be 02 00 00 00 48 
RIP  [<ffffffff8121a2e4>] blk_insert_flush+0x134/0x140
 RSP <ffff8808155f3988>
---[ end trace d37a65fe4ba84f7e ]---
note: umount[5282] exited with preempt_count 1




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Block regression since 3.1-rc3
  2011-10-04 12:02   ` Christophe Saout
@ 2011-10-04 13:32     ` Jeff Moyer
  2011-10-04 15:06       ` Christophe Saout
  2011-10-06 19:51     ` Jeff Moyer
  1 sibling, 1 reply; 25+ messages in thread
From: Jeff Moyer @ 2011-10-04 13:32 UTC (permalink / raw)
  To: Christophe Saout; +Cc: device-mapper development, linux-kernel, Jens Axboe

Christophe Saout <christophe@saout.de> writes:

> Hi Jeff,
>
>> Anyway, it would help a great deal if you could retrigger the failure
>> and provide the full failure output.  You can get that by issuing the
>> 'dmesg' command and redirecting it to a file.
>
> Oh, sorry, yes, there's a line missing.
>
> Line 323 is this one: BUG_ON(!rq->bio || rq->bio != rq->biotail);
>
> The bug is not specific to GFS2, I also get it when using an ext3 mount
> on that multipath storage pool when writing to a file.

Great, thanks!  I tried ext3 and ext4 as well, and was unable to trigger
this problem.  I setup a linear volume on top of a multipath device.  Is
that the configuration you are using?  Anyway, this may be enough to get
me on the right path...

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Block regression since 3.1-rc3
  2011-10-04 13:32     ` Jeff Moyer
@ 2011-10-04 15:06       ` Christophe Saout
  0 siblings, 0 replies; 25+ messages in thread
From: Christophe Saout @ 2011-10-04 15:06 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: device-mapper development, linux-kernel, Jens Axboe

Hi Jeff,

> > Line 323 is this one: BUG_ON(!rq->bio || rq->bio != rq->biotail);
> >
> > The bug is not specific to GFS2, I also get it when using an ext3 mount
> > on that multipath storage pool when writing to a file.
> 
> Great, thanks!  I tried ext3 and ext4 as well, and was unable to trigger
> this problem.  I setup a linear volume on top of a multipath device.  Is
> that the configuration you are using?  Anyway, this may be enough to get
> me on the right path...

Yes - the device-mapper tables looks like:

vserv81:/root # dmsetup table RAID_A-test.root
0 20971520 linear 253:6 384

(this is one the FS which causes the crash when mounting via ext3)

And 253:6 is the multipath device:

vserv81:/root # dmsetup table 3600d023100065bef0000000009f337c0
0 15626018816 multipath 0 0 2 1 round-robin 0 1 1 8:16 1000 round-robin 0 1 1 8:48 1000 

And 8:16, 8:48 are two qla2xxx FC ports.

Cheers,
	Christophe



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Block regression since 3.1-rc3
  2011-10-04 12:02   ` Christophe Saout
  2011-10-04 13:32     ` Jeff Moyer
@ 2011-10-06 19:51     ` Jeff Moyer
  2011-10-06 22:02       ` Christophe Saout
                         ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: Jeff Moyer @ 2011-10-06 19:51 UTC (permalink / raw)
  To: Christophe Saout
  Cc: device-mapper development, linux-kernel, Jens Axboe, Tejun Heo

Christophe Saout <christophe@saout.de> writes:

> Hi Jeff,
>
>> Anyway, it would help a great deal if you could retrigger the failure
>> and provide the full failure output.  You can get that by issuing the
>> 'dmesg' command and redirecting it to a file.
>
> Oh, sorry, yes, there's a line missing.
>
> Line 323 is this one: BUG_ON(!rq->bio || rq->bio != rq->biotail);

OK, it turns out my testing was incomplete.  I only tested targets that
had a write-through cache, so I didn't hit this problem.  It reproduces
pretty easily with just multipath involved (no linear target on top) when
running against the right storage.

So, here's a patch, but I don't have a full explanation for it just yet.
What I observed was that, on fsync, blkdev_issue_flush was called.
Eventually, the flush request gets cloned, and blk_insert_cloned_request
is called.  This cloned request never actually gets issued to the
q->requst_fn (scsi_request_fn in my case).  So, it may be that there is
no plug list for this, so the queue isn't goosed?  I'll try to come up
with a better explanation, or Tejun may just know off the top of his
head what's going on.

So, the patch works for me, but is very much just an RFC.

Cheers,
Jeff

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 491eb30..7aa4736 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -320,7 +320,7 @@ void blk_insert_flush(struct request *rq)
 		return;
 	}
 
-	BUG_ON(!rq->bio || rq->bio != rq->biotail);
+	BUG_ON(rq->bio && rq->bio != rq->biotail);
 
 	/*
 	 * If there's data but flush is not necessary, the request can be
@@ -345,6 +345,12 @@ void blk_insert_flush(struct request *rq)
 	rq->end_io = flush_data_end_io;
 
 	blk_flush_complete_seq(rq, REQ_FSEQ_ACTIONS & ~policy, 0);
+
+	/*
+	 * A cloned empty flush needs a queue kick to make progress.
+	 */
+	if (!rq->bio)
+		blk_run_queue_async(q);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Block regression since 3.1-rc3
  2011-10-06 19:51     ` Jeff Moyer
@ 2011-10-06 22:02       ` Christophe Saout
  2011-10-08 11:02       ` Shaohua Li
  2011-10-09  8:08       ` [dm-devel] " Shaohua Li
  2 siblings, 0 replies; 25+ messages in thread
From: Christophe Saout @ 2011-10-06 22:02 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: device-mapper development, linux-kernel, Jens Axboe, Tejun Heo

Hi Jeff,

> > Line 323 is this one: BUG_ON(!rq->bio || rq->bio != rq->biotail);
> 
> OK, it turns out my testing was incomplete.  I only tested targets that
> had a write-through cache, so I didn't hit this problem.  It reproduces
> pretty easily with just multipath involved (no linear target on top) when
> running against the right storage.
> 
> So, here's a patch, but I don't have a full explanation for it just yet.
> What I observed was that, on fsync, blkdev_issue_flush was called.
> Eventually, the flush request gets cloned, and blk_insert_cloned_request
> is called.  This cloned request never actually gets issued to the
> q->requst_fn (scsi_request_fn in my case).  So, it may be that there is
> no plug list for this, so the queue isn't goosed?  I'll try to come up
> with a better explanation, or Tejun may just know off the top of his
> head what's going on.
> 
> So, the patch works for me, but is very much just an RFC.

It does?  If I apply the patch, the system dies while booting when
scanning the volume groups:

Setting up LVM Volume GroupsBUG: unable to handle kernel NULL pointer
dereference at 0000000000000308
IP: [<ffffffff8130675b>] sd_prep_fn+0x16b/0xa70
PGD 9c96c067 PUD 9f69f067 PMD 0 
Oops: 0000 [#1] PREEMPT SMP 
CPU 14 
Modules linked in: dm_round_robin dm_multipath ebtable_filter ebtables
x_tables nf_nat_tftp nf_conntrack_tftp nf_nat_sip nf_conntrack_sip
nf_nat_irc nf_conntrack_irc nf_nat_h323 nf_conntrack_h323 nf_nat_ftp
nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack
xen_netback xen_blkback xen_gntdev ipmi_si ipmi_devintf ipmi_msghandler
loop tpm_tis tpm igb tpm_bios ohci_hcd e1000e pata_atiixp qla2xxx
scsi_transport_fc i2c_piix4 ehci_hcd

Pid: 2520, comm: kpartx Not tainted 3.1.0-rc8 #1 Supermicro
H8DG6/H8DGi/H8DG6/H8DGi
RIP: e030:[<ffffffff8130675b>]  [<ffffffff8130675b>] sd_prep_fn
+0x16b/0xa70
RSP: e02b:ffff88009c973918  EFLAGS: 00010086
RAX: 0000000000000000 RBX: ffff88009cea0018 RCX: 0000000000001000
RDX: 0000000000000000 RSI: ffff88009e61d440 RDI: 0000000000000000
RBP: ffff88009c973978 R08: ffff8800a0b40700 R09: 0000000000000001
R10: 0000000000000001 R11: ffff88009e61d440 R12: ffff8800a042b480
R13: 0000000000000000 R14: ffff88009ee7c000 R15: 0000000000000000
FS:  00007f7d769617a0(0000) GS:ffff8800d74cf000(0000)
knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000308 CR3: 000000009c979000 CR4: 0000000000000660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kpartx (pid: 2520, threadinfo ffff88009c972000, task
ffff88009c898000)
Stack:
 ffff88009c973928 ffffffff810ceef0 0000000000000000 ffffffff00001000
 0000000000000000 ffff8800a0b72a00 ffff88009c898000 ffff88009cea0018
 ffff8800a042b480 ffff88009cea0018 ffff88009c972000 ffff88009c973fd8
Call Trace:
 [<ffffffff810ceef0>] ? mempool_alloc_slab+0x10/0x20
 [<ffffffff81215e72>] blk_peek_request+0xd2/0x270
 [<ffffffff812f8acb>] scsi_request_fn+0x4b/0x3c0
 [<ffffffff8120fa46>] __blk_run_queue+0x16/0x20
 [<ffffffff8120ef03>] __elv_add_request+0xb3/0x270
 [<ffffffff81213494>] add_acct_request+0x34/0x40
 [<ffffffff812134fd>] blk_insert_cloned_request+0x5d/0x80
 [<ffffffff8138bff9>] dm_dispatch_request+0x39/0x70
 [<ffffffff8138db3c>] dm_request_fn+0x17c/0x280
 [<ffffffff81214f98>] queue_unplugged+0x48/0x100
 [<ffffffff81216335>] blk_flush_plug_list+0x1e5/0x230
 [<ffffffff8100b30f>] ? xen_restore_fl_direct_reloc+0x4/0x4
 [<ffffffff810ccac0>] ? sleep_on_page+0x10/0x10
 [<ffffffff814c0704>] io_schedule+0x44/0x80
 [<ffffffff810ccac9>] sleep_on_page_killable+0x9/0x40
 [<ffffffff814c0d42>] __wait_on_bit_lock+0x52/0xb0
 [<ffffffff8114b7a0>] ? I_BDEV+0x10/0x10
 [<ffffffff810cca32>] __lock_page_killable+0x62/0x70
 [<ffffffff81075330>] ? autoremove_wake_function+0x40/0x40
 [<ffffffff810ce608>] generic_file_aio_read+0x718/0x750
 [<ffffffff810053c9>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff81118142>] do_sync_read+0xd2/0x110
 [<ffffffff8114b9fc>] ? block_ioctl+0x3c/0x40
 [<ffffffff81129bd6>] ? do_vfs_ioctl+0x96/0x4f0
 [<ffffffff811188c3>] vfs_read+0xc3/0x170
 [<ffffffff811189bc>] sys_read+0x4c/0x90
 [<ffffffff814c9612>] system_call_fastpath+0x16/0x1b
Code: 4c 89 f7 48 89 55 b0 89 4d b8 e8 b1 16 ff ff 85 c0 41 89 c5 48 8b
55 b0 8b 4d b8 0f 85 00 ff ff ff 4d 85 f6 4c 8b 83 d8 00 00 00 <4c> 8b
8a 08 03 00 00 0f 84 c0 02 00 00 41 8b 86 10 06 00 00 31 
RIP  [<ffffffff8130675b>] sd_prep_fn+0x16b/0xa70
 RSP <ffff88009c973918>
CR2: 0000000000000308
---[ end trace 7def99230bac37fd ]---


	Christophe



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Block regression since 3.1-rc3
  2011-10-06 19:51     ` Jeff Moyer
  2011-10-06 22:02       ` Christophe Saout
@ 2011-10-08 11:02       ` Shaohua Li
  2011-10-08 12:05           ` Christophe Saout
  2011-10-08 16:14         ` Mike Snitzer
  2011-10-09  8:08       ` [dm-devel] " Shaohua Li
  2 siblings, 2 replies; 25+ messages in thread
From: Shaohua Li @ 2011-10-08 11:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christophe Saout, device-mapper development, linux-kernel,
	Jens Axboe, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 2971 bytes --]

2011/10/7 Jeff Moyer <jmoyer@redhat.com>:
> Christophe Saout <christophe@saout.de> writes:
>
>> Hi Jeff,
>>
>>> Anyway, it would help a great deal if you could retrigger the failure
>>> and provide the full failure output.  You can get that by issuing the
>>> 'dmesg' command and redirecting it to a file.
>>
>> Oh, sorry, yes, there's a line missing.
>>
>> Line 323 is this one: BUG_ON(!rq->bio || rq->bio != rq->biotail);
>
> OK, it turns out my testing was incomplete.  I only tested targets that
> had a write-through cache, so I didn't hit this problem.  It reproduces
> pretty easily with just multipath involved (no linear target on top) when
> running against the right storage.
>
> So, here's a patch, but I don't have a full explanation for it just yet.
> What I observed was that, on fsync, blkdev_issue_flush was called.
> Eventually, the flush request gets cloned, and blk_insert_cloned_request
> is called.  This cloned request never actually gets issued to the
> q->requst_fn (scsi_request_fn in my case).  So, it may be that there is
> no plug list for this, so the queue isn't goosed?  I'll try to come up
> with a better explanation, or Tejun may just know off the top of his
> head what's going on.
>
> So, the patch works for me, but is very much just an RFC.
>
> Cheers,
> Jeff
>
> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
>
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index 491eb30..7aa4736 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -320,7 +320,7 @@ void blk_insert_flush(struct request *rq)
>                return;
>        }
>
> -       BUG_ON(!rq->bio || rq->bio != rq->biotail);
> +       BUG_ON(rq->bio && rq->bio != rq->biotail);
>
>        /*
>         * If there's data but flush is not necessary, the request can be
> @@ -345,6 +345,12 @@ void blk_insert_flush(struct request *rq)
>        rq->end_io = flush_data_end_io;
>
>        blk_flush_complete_seq(rq, REQ_FSEQ_ACTIONS & ~policy, 0);
> +
> +       /*
> +        * A cloned empty flush needs a queue kick to make progress.
> +        */
> +       if (!rq->bio)
> +               blk_run_queue_async(q);
>  }
Looks the dm request based flush logic is broken.

saved_make_request_fn
  __make_request
    blk_insert_flush
but blk_insert_flush doesn't put the original request to list, instead, the
q->flush_rq is in list.
then
dm_request_fn
  blk_peek_request
    dm_prep_fn
      clone_rq
  map_request
    blk_insert_cloned_request
so q->flush_rq is cloned, and get dispatched. but we can't clone q->flush_rq
and use it to do flush. map_request even could assign a different blockdev to
the cloned request.

Clone q->flush_rq is absolutely wrong. we can clear ogirinal request's flush bit
and set cloned request flush bit, so we can skip the blk_insert_flush logic for
original request.

can you please try this patch. Only compilation tested.

Thanks,
Shaohua

[-- Attachment #2: block-cloned-request-flush.patch --]
[-- Type: text/x-patch, Size: 2240 bytes --]

Signed-off-by: Shaohua Li <shaohua.li@intel.com>

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 52b39f3..07a930d 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1424,6 +1424,13 @@ static int dm_make_request(struct request_queue *q, struct bio *bio)
 {
 	struct mapped_device *md = q->queuedata;
 
+	if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
+		if (bio->bi_rw & REQ_FLUSH)
+			bio->bi_rw |= REQ_CLONED_FLUSH;
+		if (bio->bi_rw & REQ_FUA)
+			bio->bi_rw |= REQ_CLONED_FUA;
+		bio->bi_rw &= ~(REQ_FLUSH|REQ_FUA);
+	}
 	return md->saved_make_request_fn(q, bio); /* call __make_request() */
 }
 
@@ -1547,6 +1554,14 @@ static int dm_prep_fn(struct request_queue *q, struct request *rq)
 	if (!clone)
 		return BLKPREP_DEFER;
 
+	if (rq->cmd_flags & REQ_CLONED_FLUSH)
+		clone->cmd_flags |= REQ_FLUSH;
+	if (rq->cmd_flags & REQ_CLONED_FUA)
+		clone->cmd_flags |= REQ_FUA;
+
+	rq->cmd_flags &=  ~(REQ_FLUSH|REQ_FUA);
+	clone->cmd_flags &=  ~(REQ_FLUSH|REQ_FUA);
+
 	rq->special = clone;
 	rq->cmd_flags |= REQ_DONTPREP;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 71fc53b..6ef5dc3 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -153,6 +153,8 @@ enum rq_flag_bits {
 	__REQ_FLUSH_SEQ,	/* request for flush sequence */
 	__REQ_IO_STAT,		/* account I/O stat */
 	__REQ_MIXED_MERGE,	/* merge of different types, fail separately */
+	__REQ_CLONED_FLUSH,
+	__REQ_CLONED_FUA,
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -170,7 +172,7 @@ enum rq_flag_bits {
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_SYNC | REQ_META | REQ_PRIO | \
-	 REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH | REQ_FUA | REQ_SECURE)
+	 REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH | REQ_FUA | REQ_SECURE | REQ_CLONED_FLUSH | REQ_CLONED_FUA)
 #define REQ_CLONE_MASK		REQ_COMMON_MASK
 
 #define REQ_RAHEAD		(1 << __REQ_RAHEAD)
@@ -194,5 +196,7 @@ enum rq_flag_bits {
 #define REQ_IO_STAT		(1 << __REQ_IO_STAT)
 #define REQ_MIXED_MERGE		(1 << __REQ_MIXED_MERGE)
 #define REQ_SECURE		(1 << __REQ_SECURE)
+#define REQ_CLONED_FLUSH	(1 << __REQ_CLONED_FLUSH)
+#define REQ_CLONED_FUA		(1 << __REQ_CLONED_FUA)
 
 #endif /* __LINUX_BLK_TYPES_H */

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Block regression since 3.1-rc3
  2011-10-08 11:02       ` Shaohua Li
@ 2011-10-08 12:05           ` Christophe Saout
  2011-10-08 16:14         ` Mike Snitzer
  1 sibling, 0 replies; 25+ messages in thread
From: Christophe Saout @ 2011-10-08 12:05 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jeff Moyer, device-mapper development, linux-kernel, Jens Axboe,
	Tejun Heo

Hi Shaohua,

> >> Line 323 is this one: BUG_ON(!rq->bio || rq->bio != rq->biotail);
> >
> > OK, it turns out my testing was incomplete.  I only tested targets that
> > had a write-through cache, so I didn't hit this problem.  It reproduces
> > pretty easily with just multipath involved (no linear target on top) when
> > running against the right storage.
> >
> > So, here's a patch, but I don't have a full explanation for it just yet.
> > What I observed was that, on fsync, blkdev_issue_flush was called.
> > Eventually, the flush request gets cloned, and blk_insert_cloned_request
> > is called.  This cloned request never actually gets issued to the
> > q->requst_fn (scsi_request_fn in my case).  So, it may be that there is
> > no plug list for this, so the queue isn't goosed?  I'll try to come up
> > with a better explanation, or Tejun may just know off the top of his
> > head what's going on.
> >
> > So, the patch works for me, but is very much just an RFC.
> [...]
> Looks the dm request based flush logic is broken.
> 
> saved_make_request_fn
>   __make_request
>     blk_insert_flush
> but blk_insert_flush doesn't put the original request to list, instead, the
> q->flush_rq is in list.
> then
> dm_request_fn
>   blk_peek_request
>     dm_prep_fn
>       clone_rq
>   map_request
>     blk_insert_cloned_request
> so q->flush_rq is cloned, and get dispatched. but we can't clone q->flush_rq
> and use it to do flush. map_request even could assign a different blockdev to
> the cloned request.
> 
> Clone q->flush_rq is absolutely wrong. we can clear ogirinal request's flush bit
> and set cloned request flush bit, so we can skip the blk_insert_flush logic for
> original request.
> 
> can you please try this patch. Only compilation tested.

Now I get a different BUG when doing what triggered the original BUG
(e.g. unmounting the filesystem):

------------[ cut here ]------------
kernel BUG at drivers/scsi/scsi_lib.c:1152!
invalid opcode: 0000 [#1] PREEMPT SMP 
CPU 7 
Modules linked in: xenfs gfs2 fuse dlm configfs dummy bonding xt_TPROXY nf_tproxy_core xt_socket xt_owner xt_connmark iptable_mangle ipt_MASQUERADE iptable_nat ipt_ULOG ipt_REJECT xt_mark iptable_filter iptable_raw ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state xt_tcpudp xt_physdev xt_hashlimit xt_multiport ip6table_filter ip6_tables ebt_redirect ebt_ip ebt_mark_m ebtable_broute bridge stp llc ebtable_nat ext4 jbd2 dm_round_robin dm_multipath ebtable_filter ebtables x_tables nf_nat_tftp nf_conntrack_tftp nf_nat_sip nf_conntrack_sip nf_nat_irc nf_conntrack_irc nf_nat_h323 nf_conntrack_h323 nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack xen_netback xen_blkback xen_gntdev ipmi_si ipmi_devintf ipmi_msghandler loop ohci_hcd ehci_hcd e1000e tpm_tis igb qla2xxx tpm tpm_bios scsi_transport_fc pata_atiixp i2c_piix4

Pid: 5056, comm: kworker/7:2 Not tainted 3.1.0-rc8 #1 Supermicro H8DG6/H8DGi/H8DG6/H8DGi
RIP: e030:[<ffffffff812f7eae>]  [<ffffffff812f7eae>] scsi_setup_fs_cmnd+0x9e/0xe0
RSP: e02b:ffff88009b299b70  EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff88009b9cc018 RCX: 0000000000000000
RDX: ffff88009f193c00 RSI: ffff88009b9cc018 RDI: ffff88009b905800
RBP: ffff88009b299b80 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000001 R11: ffff8800a0e40700 R12: ffff88009b905800
R13: 0000000000000001 R14: ffff88009b905800 R15: 0000000000000000
FS:  00007fc76ab77700(0000) GS:ffff8800d767f000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fc76a1d0beb CR3: 0000000001805000 CR4: 0000000000000660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/7:2 (pid: 5056, threadinfo ffff88009b298000, task ffff88009ed68000)
Stack:
 ffff88009b9cc018 ffff88009e5f1a60 ffff88009b299bf0 ffffffff8130675f
 ffff88009b299ba0 ffffffff810ceef0 ffff88009f193c00 ffffffff00000000
 ffffffff8100ab6d ffff88009b299c60 ffff88009ed68000 ffff88009b9cc018
Call Trace:
 [<ffffffff8130675f>] sd_prep_fn+0x14f/0xa70
 [<ffffffff810ceef0>] ? mempool_alloc_slab+0x10/0x20
 [<ffffffff8100ab6d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81215db2>] blk_peek_request+0xd2/0x270
 [<ffffffff812f8aeb>] scsi_request_fn+0x4b/0x3c0
 [<ffffffff8120fa46>] __blk_run_queue+0x16/0x20
 [<ffffffff8120ef03>] __elv_add_request+0xb3/0x270
 [<ffffffff81213494>] add_acct_request+0x34/0x40
 [<ffffffff8121350b>] blk_insert_cloned_request+0x6b/0x90
 [<ffffffff8138c049>] dm_dispatch_request+0x39/0x70
 [<ffffffff8138db9c>] dm_request_fn+0x17c/0x280
 [<ffffffff8100b322>] ? check_events+0x12/0x20
 [<ffffffff81222db0>] ? cfq_init_queue+0x430/0x430
 [<ffffffff8120fa46>] __blk_run_queue+0x16/0x20
 [<ffffffff81222de4>] cfq_kick_queue+0x34/0x50
 [<ffffffff8106e039>] process_one_work+0x129/0x4a0
 [<ffffffff81070401>] worker_thread+0x161/0x340
 [<ffffffff810702a0>] ? manage_workers.clone.22+0x230/0x230
 [<ffffffff81074d36>] kthread+0x96/0xa0
 [<ffffffff814cb7b4>] kernel_thread_helper+0x4/0x10
 [<ffffffff814c9873>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff814c2dc0>] ? retint_restore_args+0x5/0x6
 [<ffffffff814cb7b0>] ? gs_change+0x13/0x13
Code: ff 5b 41 5c c9 c3 0f 1f 80 00 00 00 00 4c 89 e7 be 20 00 00 00 e8 a3 a2 ff ff 48 85 c0 48 89 c7 74 36 48 89 83 e0 00 00 00 eb 9b <0f> 0b 48 8b 00 48 85 c0 0f 84 77 ff ff ff 48 8b 40 50 48 85 c0 
RIP  [<ffffffff812f7eae>] scsi_setup_fs_cmnd+0x9e/0xe0
 RSP <ffff88009b299b70>
---[ end trace 35f29bc1fdb897a6 ]---
note: kworker/7:2[5056] exited with preempt_count 1

(followed by an oops and the machine is dead in an instant)

	Christophe



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Block regression since 3.1-rc3
@ 2011-10-08 12:05           ` Christophe Saout
  0 siblings, 0 replies; 25+ messages in thread
From: Christophe Saout @ 2011-10-08 12:05 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jeff Moyer, device-mapper development, linux-kernel, Jens Axboe,
	Tejun Heo

Hi Shaohua,

> >> Line 323 is this one: BUG_ON(!rq->bio || rq->bio != rq->biotail);
> >
> > OK, it turns out my testing was incomplete.  I only tested targets that
> > had a write-through cache, so I didn't hit this problem.  It reproduces
> > pretty easily with just multipath involved (no linear target on top) when
> > running against the right storage.
> >
> > So, here's a patch, but I don't have a full explanation for it just yet.
> > What I observed was that, on fsync, blkdev_issue_flush was called.
> > Eventually, the flush request gets cloned, and blk_insert_cloned_request
> > is called.  This cloned request never actually gets issued to the
> > q->requst_fn (scsi_request_fn in my case).  So, it may be that there is
> > no plug list for this, so the queue isn't goosed?  I'll try to come up
> > with a better explanation, or Tejun may just know off the top of his
> > head what's going on.
> >
> > So, the patch works for me, but is very much just an RFC.
> [...]
> Looks the dm request based flush logic is broken.
> 
> saved_make_request_fn
>   __make_request
>     blk_insert_flush
> but blk_insert_flush doesn't put the original request to list, instead, the
> q->flush_rq is in list.
> then
> dm_request_fn
>   blk_peek_request
>     dm_prep_fn
>       clone_rq
>   map_request
>     blk_insert_cloned_request
> so q->flush_rq is cloned, and get dispatched. but we can't clone q->flush_rq
> and use it to do flush. map_request even could assign a different blockdev to
> the cloned request.
> 
> Clone q->flush_rq is absolutely wrong. we can clear ogirinal request's flush bit
> and set cloned request flush bit, so we can skip the blk_insert_flush logic for
> original request.
> 
> can you please try this patch. Only compilation tested.

Now I get a different BUG when doing what triggered the original BUG
(e.g. unmounting the filesystem):

------------[ cut here ]------------
kernel BUG at drivers/scsi/scsi_lib.c:1152!
invalid opcode: 0000 [#1] PREEMPT SMP 
CPU 7 
Modules linked in: xenfs gfs2 fuse dlm configfs dummy bonding xt_TPROXY nf_tproxy_core xt_socket xt_owner xt_connmark iptable_mangle ipt_MASQUERADE iptable_nat ipt_ULOG ipt_REJECT xt_mark iptable_filter iptable_raw ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state xt_tcpudp xt_physdev xt_hashlimit xt_multiport ip6table_filter ip6_tables ebt_redirect ebt_ip ebt_mark_m ebtable_broute bridge stp llc ebtable_nat ext4 jbd2 dm_round_robin dm_multipath ebtable_filter ebtables x_tables nf_nat_tftp nf_conntrack_tftp nf_nat_sip nf_conntrack_sip nf_nat_irc nf_conntrack_irc nf_nat_h323 nf_conntrack_h323 nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack xen_netback xen_blkback xen_gntdev ipmi_si ipmi_devintf ipmi_msghandler loop ohci_hcd ehci_hcd e1000e tpm_
 tis igb qla2xxx tpm tpm_bios scsi_transport_fc pata_atiixp i2c_piix4

Pid: 5056, comm: kworker/7:2 Not tainted 3.1.0-rc8 #1 Supermicro H8DG6/H8DGi/H8DG6/H8DGi
RIP: e030:[<ffffffff812f7eae>]  [<ffffffff812f7eae>] scsi_setup_fs_cmnd+0x9e/0xe0
RSP: e02b:ffff88009b299b70  EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff88009b9cc018 RCX: 0000000000000000
RDX: ffff88009f193c00 RSI: ffff88009b9cc018 RDI: ffff88009b905800
RBP: ffff88009b299b80 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000001 R11: ffff8800a0e40700 R12: ffff88009b905800
R13: 0000000000000001 R14: ffff88009b905800 R15: 0000000000000000
FS:  00007fc76ab77700(0000) GS:ffff8800d767f000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fc76a1d0beb CR3: 0000000001805000 CR4: 0000000000000660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/7:2 (pid: 5056, threadinfo ffff88009b298000, task ffff88009ed68000)
Stack:
 ffff88009b9cc018 ffff88009e5f1a60 ffff88009b299bf0 ffffffff8130675f
 ffff88009b299ba0 ffffffff810ceef0 ffff88009f193c00 ffffffff00000000
 ffffffff8100ab6d ffff88009b299c60 ffff88009ed68000 ffff88009b9cc018
Call Trace:
 [<ffffffff8130675f>] sd_prep_fn+0x14f/0xa70
 [<ffffffff810ceef0>] ? mempool_alloc_slab+0x10/0x20
 [<ffffffff8100ab6d>] ? xen_force_evtchn_callback+0xd/0x10
 [<ffffffff81215db2>] blk_peek_request+0xd2/0x270
 [<ffffffff812f8aeb>] scsi_request_fn+0x4b/0x3c0
 [<ffffffff8120fa46>] __blk_run_queue+0x16/0x20
 [<ffffffff8120ef03>] __elv_add_request+0xb3/0x270
 [<ffffffff81213494>] add_acct_request+0x34/0x40
 [<ffffffff8121350b>] blk_insert_cloned_request+0x6b/0x90
 [<ffffffff8138c049>] dm_dispatch_request+0x39/0x70
 [<ffffffff8138db9c>] dm_request_fn+0x17c/0x280
 [<ffffffff8100b322>] ? check_events+0x12/0x20
 [<ffffffff81222db0>] ? cfq_init_queue+0x430/0x430
 [<ffffffff8120fa46>] __blk_run_queue+0x16/0x20
 [<ffffffff81222de4>] cfq_kick_queue+0x34/0x50
 [<ffffffff8106e039>] process_one_work+0x129/0x4a0
 [<ffffffff81070401>] worker_thread+0x161/0x340
 [<ffffffff810702a0>] ? manage_workers.clone.22+0x230/0x230
 [<ffffffff81074d36>] kthread+0x96/0xa0
 [<ffffffff814cb7b4>] kernel_thread_helper+0x4/0x10
 [<ffffffff814c9873>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff814c2dc0>] ? retint_restore_args+0x5/0x6
 [<ffffffff814cb7b0>] ? gs_change+0x13/0x13
Code: ff 5b 41 5c c9 c3 0f 1f 80 00 00 00 00 4c 89 e7 be 20 00 00 00 e8 a3 a2 ff ff 48 85 c0 48 89 c7 74 36 48 89 83 e0 00 00 00 eb 9b <0f> 0b 48 8b 00 48 85 c0 0f 84 77 ff ff ff 48 8b 40 50 48 85 c0 
RIP  [<ffffffff812f7eae>] scsi_setup_fs_cmnd+0x9e/0xe0
 RSP <ffff88009b299b70>
---[ end trace 35f29bc1fdb897a6 ]---
note: kworker/7:2[5056] exited with preempt_count 1

(followed by an oops and the machine is dead in an instant)

	Christophe

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Block regression since 3.1-rc3
  2011-10-08 11:02       ` Shaohua Li
  2011-10-08 12:05           ` Christophe Saout
@ 2011-10-08 16:14         ` Mike Snitzer
  2011-10-09  4:31           ` Shaohua Li
  2011-10-10 21:33           ` Tejun Heo
  1 sibling, 2 replies; 25+ messages in thread
From: Mike Snitzer @ 2011-10-08 16:14 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Jeff Moyer, Jens Axboe, Tejun Heo, device-mapper development,
	Christophe Saout, linux-kernel

On Sat, Oct 08 2011 at  7:02am -0400,
Shaohua Li <shli@kernel.org> wrote:

> Looks the dm request based flush logic is broken.
> 
> saved_make_request_fn
>   __make_request
>     blk_insert_flush
> but blk_insert_flush doesn't put the original request to list, instead, the
> q->flush_rq is in list.
> then
> dm_request_fn
>   blk_peek_request
>     dm_prep_fn
>       clone_rq
>   map_request
>     blk_insert_cloned_request
> so q->flush_rq is cloned, and get dispatched. but we can't clone q->flush_rq
> and use it to do flush. map_request even could assign a different blockdev to
> the cloned request.

You haven't explained why cloning q->flush_rq is broken.  What is the
problem with map_request changing the blockdev?  For the purposes of
request-based DM the flush machinery has already managed the processing
of the flush at the higher level request_queue.

By the time request-based DM is cloning a flush request it really has no
need to reenter the flush machinery (even though Tejun wants it to --
but in practice it doesn't buy us anything because we never stack
request-based DM at the moment.  Instead it showcases how brittle this
path is).

> Clone q->flush_rq is absolutely wrong.

I'm still missing the _why_.

Taking a step back:

Unless others have an immediate ah-ha moment, I'd suggest we revert
commit 4853abaae7e4a2a (block: fix flush machinery for stacking drivers
with differring flush flags).  Whereby avoiding unnecessarily reentering
the flush machinery.

If commit ed8b752bccf256 (dm table: set flush capability based on
underlying devices) is in place the flush gets fed directly to
scsi_request_fn, which is fine because the request-based DM's
request_queue's flush_flags reflect the flush capabilities of the
underlying device(s).

We are then covered relative to the only request-based DM use-case
people care about (e.g. dm-multipath, which doesn't use stacked
request-based DM).

We can revisit upholding the purity of the flush machinery for stacked
devices in >= 3.2.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Block regression since 3.1-rc3
  2011-10-08 12:05           ` Christophe Saout
  (?)
@ 2011-10-09  4:26           ` Shaohua Li
  -1 siblings, 0 replies; 25+ messages in thread
From: Shaohua Li @ 2011-10-09  4:26 UTC (permalink / raw)
  To: Christophe Saout
  Cc: Jeff Moyer, device-mapper development, linux-kernel, Jens Axboe,
	Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 5989 bytes --]

2011/10/8 Christophe Saout <christophe@saout.de>:
> Hi Shaohua,
>
>> >> Line 323 is this one: BUG_ON(!rq->bio || rq->bio != rq->biotail);
>> >
>> > OK, it turns out my testing was incomplete.  I only tested targets that
>> > had a write-through cache, so I didn't hit this problem.  It reproduces
>> > pretty easily with just multipath involved (no linear target on top) when
>> > running against the right storage.
>> >
>> > So, here's a patch, but I don't have a full explanation for it just yet.
>> > What I observed was that, on fsync, blkdev_issue_flush was called.
>> > Eventually, the flush request gets cloned, and blk_insert_cloned_request
>> > is called.  This cloned request never actually gets issued to the
>> > q->requst_fn (scsi_request_fn in my case).  So, it may be that there is
>> > no plug list for this, so the queue isn't goosed?  I'll try to come up
>> > with a better explanation, or Tejun may just know off the top of his
>> > head what's going on.
>> >
>> > So, the patch works for me, but is very much just an RFC.
>> [...]
>> Looks the dm request based flush logic is broken.
>>
>> saved_make_request_fn
>>   __make_request
>>     blk_insert_flush
>> but blk_insert_flush doesn't put the original request to list, instead, the
>> q->flush_rq is in list.
>> then
>> dm_request_fn
>>   blk_peek_request
>>     dm_prep_fn
>>       clone_rq
>>   map_request
>>     blk_insert_cloned_request
>> so q->flush_rq is cloned, and get dispatched. but we can't clone q->flush_rq
>> and use it to do flush. map_request even could assign a different blockdev to
>> the cloned request.
>>
>> Clone q->flush_rq is absolutely wrong. we can clear ogirinal request's flush bit
>> and set cloned request flush bit, so we can skip the blk_insert_flush logic for
>> original request.
>>
>> can you please try this patch. Only compilation tested.
>
> Now I get a different BUG when doing what triggered the original BUG
> (e.g. unmounting the filesystem):
>
> ------------[ cut here ]------------
> kernel BUG at drivers/scsi/scsi_lib.c:1152!
> invalid opcode: 0000 [#1] PREEMPT SMP
> CPU 7
> Modules linked in: xenfs gfs2 fuse dlm configfs dummy bonding xt_TPROXY nf_tproxy_core xt_socket xt_owner xt_connmark iptable_mangle ipt_MASQUERADE iptable_nat ipt_ULOG ipt_REJECT xt_mark iptable_filter iptable_raw ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state xt_tcpudp xt_physdev xt_hashlimit xt_multiport ip6table_filter ip6_tables ebt_redirect ebt_ip ebt_mark_m ebtable_broute bridge stp llc ebtable_nat ext4 jbd2 dm_round_robin dm_multipath ebtable_filter ebtables x_tables nf_nat_tftp nf_conntrack_tftp nf_nat_sip nf_conntrack_sip nf_nat_irc nf_conntrack_irc nf_nat_h323 nf_conntrack_h323 nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack xen_netback xen_blkback xen_gntdev ipmi_si ipmi_devintf ipmi_msghandler loop ohci_hcd ehci_hcd e1000e tpm_tis igb qla2xxx tpm tpm_bios scsi_transport_fc pata_atiixp i2c_piix4
>
> Pid: 5056, comm: kworker/7:2 Not tainted 3.1.0-rc8 #1 Supermicro H8DG6/H8DGi/H8DG6/H8DGi
> RIP: e030:[<ffffffff812f7eae>]  [<ffffffff812f7eae>] scsi_setup_fs_cmnd+0x9e/0xe0
> RSP: e02b:ffff88009b299b70  EFLAGS: 00010046
> RAX: 0000000000000000 RBX: ffff88009b9cc018 RCX: 0000000000000000
> RDX: ffff88009f193c00 RSI: ffff88009b9cc018 RDI: ffff88009b905800
> RBP: ffff88009b299b80 R08: 0000000000000000 R09: 0000000000000001
> R10: 0000000000000001 R11: ffff8800a0e40700 R12: ffff88009b905800
> R13: 0000000000000001 R14: ffff88009b905800 R15: 0000000000000000
> FS:  00007fc76ab77700(0000) GS:ffff8800d767f000(0000) knlGS:0000000000000000
> CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007fc76a1d0beb CR3: 0000000001805000 CR4: 0000000000000660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process kworker/7:2 (pid: 5056, threadinfo ffff88009b298000, task ffff88009ed68000)
> Stack:
>  ffff88009b9cc018 ffff88009e5f1a60 ffff88009b299bf0 ffffffff8130675f
>  ffff88009b299ba0 ffffffff810ceef0 ffff88009f193c00 ffffffff00000000
>  ffffffff8100ab6d ffff88009b299c60 ffff88009ed68000 ffff88009b9cc018
> Call Trace:
>  [<ffffffff8130675f>] sd_prep_fn+0x14f/0xa70
>  [<ffffffff810ceef0>] ? mempool_alloc_slab+0x10/0x20
>  [<ffffffff8100ab6d>] ? xen_force_evtchn_callback+0xd/0x10
>  [<ffffffff81215db2>] blk_peek_request+0xd2/0x270
>  [<ffffffff812f8aeb>] scsi_request_fn+0x4b/0x3c0
>  [<ffffffff8120fa46>] __blk_run_queue+0x16/0x20
>  [<ffffffff8120ef03>] __elv_add_request+0xb3/0x270
>  [<ffffffff81213494>] add_acct_request+0x34/0x40
>  [<ffffffff8121350b>] blk_insert_cloned_request+0x6b/0x90
>  [<ffffffff8138c049>] dm_dispatch_request+0x39/0x70
>  [<ffffffff8138db9c>] dm_request_fn+0x17c/0x280
>  [<ffffffff8100b322>] ? check_events+0x12/0x20
>  [<ffffffff81222db0>] ? cfq_init_queue+0x430/0x430
>  [<ffffffff8120fa46>] __blk_run_queue+0x16/0x20
>  [<ffffffff81222de4>] cfq_kick_queue+0x34/0x50
>  [<ffffffff8106e039>] process_one_work+0x129/0x4a0
>  [<ffffffff81070401>] worker_thread+0x161/0x340
>  [<ffffffff810702a0>] ? manage_workers.clone.22+0x230/0x230
>  [<ffffffff81074d36>] kthread+0x96/0xa0
>  [<ffffffff814cb7b4>] kernel_thread_helper+0x4/0x10
>  [<ffffffff814c9873>] ? int_ret_from_sys_call+0x7/0x1b
>  [<ffffffff814c2dc0>] ? retint_restore_args+0x5/0x6
>  [<ffffffff814cb7b0>] ? gs_change+0x13/0x13
> Code: ff 5b 41 5c c9 c3 0f 1f 80 00 00 00 00 4c 89 e7 be 20 00 00 00 e8 a3 a2 ff ff 48 85 c0 48 89 c7 74 36 48 89 83 e0 00 00 00 eb 9b <0f> 0b 48 8b 00 48 85 c0 0f 84 77 ff ff ff 48 8b 40 50 48 85 c0
> RIP  [<ffffffff812f7eae>] scsi_setup_fs_cmnd+0x9e/0xe0
>  RSP <ffff88009b299b70>
> ---[ end trace 35f29bc1fdb897a6 ]---
> note: kworker/7:2[5056] exited with preempt_count 1
>
> (followed by an oops and the machine is dead in an instant)
>
there is a typo there, please check the new one.

[-- Attachment #2: block-cloned-request-flush.patch --]
[-- Type: text/x-patch, Size: 2268 bytes --]

Signed-off-by: Shaohua Li <shaohua.li@intel.com>

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 52b39f3..07a930d 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1424,6 +1424,13 @@ static int dm_make_request(struct request_queue *q, struct bio *bio)
 {
 	struct mapped_device *md = q->queuedata;
 
+	if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
+		if (bio->bi_rw & REQ_FLUSH)
+			bio->bi_rw |= REQ_CLONED_FLUSH;
+		if (bio->bi_rw & REQ_FUA)
+			bio->bi_rw |= REQ_CLONED_FUA;
+		bio->bi_rw &= ~(REQ_FLUSH|REQ_FUA);
+	}
 	return md->saved_make_request_fn(q, bio); /* call __make_request() */
 }
 
@@ -1547,6 +1554,14 @@ static int dm_prep_fn(struct request_queue *q, struct request *rq)
 	if (!clone)
 		return BLKPREP_DEFER;
 
+	if (rq->cmd_flags & REQ_CLONED_FLUSH)
+		clone->cmd_flags |= REQ_FLUSH;
+	if (rq->cmd_flags & REQ_CLONED_FUA)
+		clone->cmd_flags |= REQ_FUA;
+
+	rq->cmd_flags &=  ~(REQ_CLONED_FLUSH|REQ_CLONED_FUA);
+	clone->cmd_flags &=  ~(REQ_CLONED_FLUSH|REQ_CLONED_FUA);
+
 	rq->special = clone;
 	rq->cmd_flags |= REQ_DONTPREP;
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 71fc53b..6ef5dc3 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -153,6 +153,8 @@ enum rq_flag_bits {
 	__REQ_FLUSH_SEQ,	/* request for flush sequence */
 	__REQ_IO_STAT,		/* account I/O stat */
 	__REQ_MIXED_MERGE,	/* merge of different types, fail separately */
+	__REQ_CLONED_FLUSH,
+	__REQ_CLONED_FUA,
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -170,7 +172,7 @@ enum rq_flag_bits {
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_SYNC | REQ_META | REQ_PRIO | \
-	 REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH | REQ_FUA | REQ_SECURE)
+	 REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH | REQ_FUA | REQ_SECURE | REQ_CLONED_FLUSH | REQ_CLONED_FUA)
 #define REQ_CLONE_MASK		REQ_COMMON_MASK
 
 #define REQ_RAHEAD		(1 << __REQ_RAHEAD)
@@ -194,5 +196,7 @@ enum rq_flag_bits {
 #define REQ_IO_STAT		(1 << __REQ_IO_STAT)
 #define REQ_MIXED_MERGE		(1 << __REQ_MIXED_MERGE)
 #define REQ_SECURE		(1 << __REQ_SECURE)
+#define REQ_CLONED_FLUSH	(1 << __REQ_CLONED_FLUSH)
+#define REQ_CLONED_FUA		(1 << __REQ_CLONED_FUA)
 
 #endif /* __LINUX_BLK_TYPES_H */

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: Block regression since 3.1-rc3
  2011-10-08 16:14         ` Mike Snitzer
@ 2011-10-09  4:31           ` Shaohua Li
  2011-10-09  7:16               ` Shaohua Li
  2011-10-10 21:33           ` Tejun Heo
  1 sibling, 1 reply; 25+ messages in thread
From: Shaohua Li @ 2011-10-09  4:31 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jeff Moyer, Jens Axboe, Tejun Heo, device-mapper development,
	Christophe Saout, linux-kernel

2011/10/9 Mike Snitzer <snitzer@redhat.com>:
> On Sat, Oct 08 2011 at  7:02am -0400,
> Shaohua Li <shli@kernel.org> wrote:
>
>> Looks the dm request based flush logic is broken.
>>
>> saved_make_request_fn
>>   __make_request
>>     blk_insert_flush
>> but blk_insert_flush doesn't put the original request to list, instead, the
>> q->flush_rq is in list.
>> then
>> dm_request_fn
>>   blk_peek_request
>>     dm_prep_fn
>>       clone_rq
>>   map_request
>>     blk_insert_cloned_request
>> so q->flush_rq is cloned, and get dispatched. but we can't clone q->flush_rq
>> and use it to do flush. map_request even could assign a different blockdev to
>> the cloned request.
>
> You haven't explained why cloning q->flush_rq is broken.  What is the
> problem with map_request changing the blockdev?  For the purposes of
> request-based DM the flush machinery has already managed the processing
> of the flush at the higher level request_queue.
hmm, looks I overlook the issue. cloned flush_rq has some problems but can
be fixed.
1. it doesn't set requet->bio, request->bio_tail
2. REQ_CLONE_MASK should set REQ_FLUSH_SEQ

> By the time request-based DM is cloning a flush request it really has no
> need to reenter the flush machinery (even though Tejun wants it to --
> but in practice it doesn't buy us anything because we never stack
> request-based DM at the moment.  Instead it showcases how brittle this
> path is).
if there is no benefit, we'd better not clone a flush request. Clearing flush
bit and set it to cloned request is more clean and avoid unnecessary
overhead/complexity.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Block regression since 3.1-rc3
  2011-10-09  4:31           ` Shaohua Li
@ 2011-10-09  7:16               ` Shaohua Li
  0 siblings, 0 replies; 25+ messages in thread
From: Shaohua Li @ 2011-10-09  7:16 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jeff Moyer, Jens Axboe, Tejun Heo, device-mapper development,
	Christophe Saout, linux-kernel

2011/10/9 Shaohua Li <shli@kernel.org>:
> 2011/10/9 Mike Snitzer <snitzer@redhat.com>:
>> On Sat, Oct 08 2011 at  7:02am -0400,
>> Shaohua Li <shli@kernel.org> wrote:
>>
>>> Looks the dm request based flush logic is broken.
>>>
>>> saved_make_request_fn
>>>   __make_request
>>>     blk_insert_flush
>>> but blk_insert_flush doesn't put the original request to list, instead, the
>>> q->flush_rq is in list.
>>> then
>>> dm_request_fn
>>>   blk_peek_request
>>>     dm_prep_fn
>>>       clone_rq
>>>   map_request
>>>     blk_insert_cloned_request
>>> so q->flush_rq is cloned, and get dispatched. but we can't clone q->flush_rq
>>> and use it to do flush. map_request even could assign a different blockdev to
>>> the cloned request.
>>
>> You haven't explained why cloning q->flush_rq is broken.  What is the
>> problem with map_request changing the blockdev?  For the purposes of
>> request-based DM the flush machinery has already managed the processing
>> of the flush at the higher level request_queue.
> hmm, looks I overlook the issue. cloned flush_rq has some problems but can
> be fixed.
> 1. it doesn't set requet->bio, request->bio_tail
> 2. REQ_CLONE_MASK should set REQ_FLUSH_SEQ
oh, don't need set REQ_FLUSH_SEQ, the low level queue will set it
anyway. sorry for
the noise. Jeff's patch looks ok then.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Block regression since 3.1-rc3
@ 2011-10-09  7:16               ` Shaohua Li
  0 siblings, 0 replies; 25+ messages in thread
From: Shaohua Li @ 2011-10-09  7:16 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, Christophe Saout, device-mapper development,
	linux-kernel, Jeff Moyer, Tejun Heo

2011/10/9 Shaohua Li <shli@kernel.org>:
> 2011/10/9 Mike Snitzer <snitzer@redhat.com>:
>> On Sat, Oct 08 2011 at  7:02am -0400,
>> Shaohua Li <shli@kernel.org> wrote:
>>
>>> Looks the dm request based flush logic is broken.
>>>
>>> saved_make_request_fn
>>>   __make_request
>>>     blk_insert_flush
>>> but blk_insert_flush doesn't put the original request to list, instead, the
>>> q->flush_rq is in list.
>>> then
>>> dm_request_fn
>>>   blk_peek_request
>>>     dm_prep_fn
>>>       clone_rq
>>>   map_request
>>>     blk_insert_cloned_request
>>> so q->flush_rq is cloned, and get dispatched. but we can't clone q->flush_rq
>>> and use it to do flush. map_request even could assign a different blockdev to
>>> the cloned request.
>>
>> You haven't explained why cloning q->flush_rq is broken.  What is the
>> problem with map_request changing the blockdev?  For the purposes of
>> request-based DM the flush machinery has already managed the processing
>> of the flush at the higher level request_queue.
> hmm, looks I overlook the issue. cloned flush_rq has some problems but can
> be fixed.
> 1. it doesn't set requet->bio, request->bio_tail
> 2. REQ_CLONE_MASK should set REQ_FLUSH_SEQ
oh, don't need set REQ_FLUSH_SEQ, the low level queue will set it
anyway. sorry for
the noise. Jeff's patch looks ok then.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Block regression since 3.1-rc3
  2011-10-06 19:51     ` Jeff Moyer
  2011-10-06 22:02       ` Christophe Saout
  2011-10-08 11:02       ` Shaohua Li
@ 2011-10-09  8:08       ` Shaohua Li
  2 siblings, 0 replies; 25+ messages in thread
From: Shaohua Li @ 2011-10-09  8:08 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christophe Saout, device-mapper development, linux-kernel,
	Jens Axboe, Tejun Heo

2011/10/7 Jeff Moyer <jmoyer@redhat.com>:
> Christophe Saout <christophe@saout.de> writes:
>
>> Hi Jeff,
>>
>>> Anyway, it would help a great deal if you could retrigger the failure
>>> and provide the full failure output.  You can get that by issuing the
>>> 'dmesg' command and redirecting it to a file.
>>
>> Oh, sorry, yes, there's a line missing.
>>
>> Line 323 is this one: BUG_ON(!rq->bio || rq->bio != rq->biotail);
>
> OK, it turns out my testing was incomplete.  I only tested targets that
> had a write-through cache, so I didn't hit this problem.  It reproduces
> pretty easily with just multipath involved (no linear target on top) when
> running against the right storage.
>
> So, here's a patch, but I don't have a full explanation for it just yet.
> What I observed was that, on fsync, blkdev_issue_flush was called.
> Eventually, the flush request gets cloned, and blk_insert_cloned_request
> is called.  This cloned request never actually gets issued to the
> q->requst_fn (scsi_request_fn in my case).  So, it may be that there is
> no plug list for this, so the queue isn't goosed?  I'll try to come up
> with a better explanation, or Tejun may just know off the top of his
> head what's going on.
blk_insert_flush() just insert request to list and doesn't actually
dispatch request
to drive, because normally there is other way to run queue later. But
blk_insert_cloned_request() actually means dispatch request to drive. If we
don't flush queue, the queue will stall.

> So, the patch works for me, but is very much just an RFC.
>
> Cheers,
> Jeff
>
> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
>
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index 491eb30..7aa4736 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -320,7 +320,7 @@ void blk_insert_flush(struct request *rq)
>                return;
>        }
>
> -       BUG_ON(!rq->bio || rq->bio != rq->biotail);
> +       BUG_ON(rq->bio && rq->bio != rq->biotail);
>
>        /*
>         * If there's data but flush is not necessary, the request can be
> @@ -345,6 +345,12 @@ void blk_insert_flush(struct request *rq)
>        rq->end_io = flush_data_end_io;
>
>        blk_flush_complete_seq(rq, REQ_FSEQ_ACTIONS & ~policy, 0);
> +
> +       /*
> +        * A cloned empty flush needs a queue kick to make progress.
> +        */
> +       if (!rq->bio)
> +               blk_run_queue_async(q);
>  }
the rq could be a cloned FUA request, so rq->bio could not be NULL.
Better move blk_run_queue_async() to blk_insert_cloned_request().
We need run the queue in flush case anyway.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Block regression since 3.1-rc3
  2011-10-09  7:16               ` Shaohua Li
  (?)
@ 2011-10-09  8:12               ` Christophe Saout
  2011-10-09 13:47                 ` Christophe Saout
  -1 siblings, 1 reply; 25+ messages in thread
From: Christophe Saout @ 2011-10-09  8:12 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Mike Snitzer, Jeff Moyer, Jens Axboe, Tejun Heo,
	device-mapper development, linux-kernel

Hi Shaohua,

> >>> Looks the dm request based flush logic is broken.
> >>>
> >>> saved_make_request_fn
> >>>   __make_request
> >>>     blk_insert_flush
> >>> but blk_insert_flush doesn't put the original request to list, instead, the
> >>> q->flush_rq is in list.
> >>> then
> >>> dm_request_fn
> >>>   blk_peek_request
> >>>     dm_prep_fn
> >>>       clone_rq
> >>>   map_request
> >>>     blk_insert_cloned_request
> >>> so q->flush_rq is cloned, and get dispatched. but we can't clone q->flush_rq
> >>> and use it to do flush. map_request even could assign a different blockdev to
> >>> the cloned request.
> >>
> >> You haven't explained why cloning q->flush_rq is broken.  What is the
> >> problem with map_request changing the blockdev?  For the purposes of
> >> request-based DM the flush machinery has already managed the processing
> >> of the flush at the higher level request_queue.
> > hmm, looks I overlook the issue. cloned flush_rq has some problems but can
> > be fixed.
> > 1. it doesn't set requet->bio, request->bio_tail
> > 2. REQ_CLONE_MASK should set REQ_FLUSH_SEQ
> oh, don't need set REQ_FLUSH_SEQ, the low level queue will set it
> anyway. sorry for
> the noise. Jeff's patch looks ok then.

Just for the record: I tried your latest patch and it indeed makes the
crasher go away. However, things are getting horribly slow (hangs for
like 20 seconds when writing), at least for GFS2. This is probably
because the queue isn't kicked, like you stated in your most recent
comment, I guess.

Also for the record: Jeff's patch causes other crashes on my machine
(already posted), so while it might be fine from a technical point of
view, it still has issues in its current form. Jeff is working on it.

Thanks,
	Christophe



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Block regression since 3.1-rc3
  2011-10-09  8:12               ` Christophe Saout
@ 2011-10-09 13:47                 ` Christophe Saout
  0 siblings, 0 replies; 25+ messages in thread
From: Christophe Saout @ 2011-10-09 13:47 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: device-mapper development, Shaohua Li, Axboe, Mike Snitzer, Jens,
	device-mapper, linux-kernel, Tejun Heo

Hello again,

> Also for the record: Jeff's patch causes other crashes on my machine
> (already posted), so while it might be fine from a technical point of
> view, it still has issues in its current form. Jeff is working on it.

I have to apologize.  Looks like I made a mistake the first time I tried
Jeff's patch.  Shaohua asked me to check again and this time the box
runs nicely.  I probably forgot something stupid like a modules_install.

The patch that Jeff posted is actually working.  Mea culpa.  Sorry for
the fuzz!

Thanks,
	Christophe



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Block regression since 3.1-rc3
  2011-10-08 16:14         ` Mike Snitzer
  2011-10-09  4:31           ` Shaohua Li
@ 2011-10-10 21:33           ` Tejun Heo
  2011-10-11 19:56             ` Mike Snitzer
  1 sibling, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2011-10-10 21:33 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Shaohua Li, Jeff Moyer, Jens Axboe, device-mapper development,
	Christophe Saout, linux-kernel

Hello, Mike.

On Sat, Oct 08, 2011 at 12:14:21PM -0400, Mike Snitzer wrote:
> Unless others have an immediate ah-ha moment, I'd suggest we revert
> commit 4853abaae7e4a2a (block: fix flush machinery for stacking drivers
> with differring flush flags).  Whereby avoiding unnecessarily reentering
> the flush machinery.

I don't object to the immediate fix but think that adding such special
case is gonna make the thing even more brittle and make future changes
even more difficult.  Those one off cases tend to cause pretty severe
headache when someone wants to evolve common code, so let's please
find out what went wrong and fix it properly so that everyone follows
the same set of rules.

> If commit ed8b752bccf256 (dm table: set flush capability based on
> underlying devices) is in place the flush gets fed directly to
> scsi_request_fn, which is fine because the request-based DM's
> request_queue's flush_flags reflect the flush capabilities of the
> underlying device(s).
> 
> We are then covered relative to the only request-based DM use-case
> people care about (e.g. dm-multipath, which doesn't use stacked
> request-based DM).
> 
> We can revisit upholding the purity of the flush machinery for stacked
> devices in >= 3.2.

Hmmm... another rather nasty assumption the current flush code makes
is that every flush request has either zero or single bio attached to
it.  The assumption has always been there for quite some time now.
That somehow seems broken by request based dm (either that or wrong
request is taking INSERT_FLUSH path).  Is it possible that a rq /w
single bio can end up with multiple bios after cloning?

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Block regression since 3.1-rc3
  2011-10-10 21:33           ` Tejun Heo
@ 2011-10-11 19:56             ` Mike Snitzer
  2011-10-11 20:52               ` Tejun Heo
  0 siblings, 1 reply; 25+ messages in thread
From: Mike Snitzer @ 2011-10-11 19:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Shaohua Li, Jeff Moyer, Jens Axboe, device-mapper development,
	Christophe Saout, linux-kernel

On Mon, Oct 10 2011 at  5:33pm -0400,
Tejun Heo <tj@kernel.org> wrote:

> Hello, Mike.
> 
> On Sat, Oct 08, 2011 at 12:14:21PM -0400, Mike Snitzer wrote:
> > Unless others have an immediate ah-ha moment, I'd suggest we revert
> > commit 4853abaae7e4a2a (block: fix flush machinery for stacking drivers
> > with differring flush flags).  Whereby avoiding unnecessarily reentering
> > the flush machinery.
> 
> I don't object to the immediate fix but think that adding such special
> case is gonna make the thing even more brittle and make future changes
> even more difficult.  Those one off cases tend to cause pretty severe
> headache when someone wants to evolve common code, so let's please
> find out what went wrong and fix it properly so that everyone follows
> the same set of rules.

Are you referring to Jeff's fix as "the immediate fix"?  Christophe
seems to have had success with it after all.

As for the special case that you're suggesting makes the code more
brittle, etc.  If you could be more specific that'd be awesome.

Jeff asked a question about the need to kick the queue in this case (as
he didn't feel he had a proper justification for why it was needed).

If we can get a proper patch header together to justify Jeff's patch
that'd be great.  And then revisit any of the special casing you'd like
us to avoid in >= 3.2?

(we're obviously _very_ short on time for a 3.1 fix right now).

> > If commit ed8b752bccf256 (dm table: set flush capability based on
> > underlying devices) is in place the flush gets fed directly to
> > scsi_request_fn, which is fine because the request-based DM's
> > request_queue's flush_flags reflect the flush capabilities of the
> > underlying device(s).
> > 
> > We are then covered relative to the only request-based DM use-case
> > people care about (e.g. dm-multipath, which doesn't use stacked
> > request-based DM).
> > 
> > We can revisit upholding the purity of the flush machinery for stacked
> > devices in >= 3.2.
> 
> Hmmm... another rather nasty assumption the current flush code makes
> is that every flush request has either zero or single bio attached to
> it.  The assumption has always been there for quite some time now.

OK.

> That somehow seems broken by request based dm (either that or wrong
> request is taking INSERT_FLUSH path).

Where was this issue of a flush having multiple bios reported?

> Is it possible that a rq /w single bio can end up with multiple bios
> after cloning?

No, blk_rq_prep_clone() just clones the bio(s) that are attached to the
request being cloned.

Mike

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Block regression since 3.1-rc3
  2011-10-11 19:56             ` Mike Snitzer
@ 2011-10-11 20:52               ` Tejun Heo
  2011-10-11 22:07                 ` Jeff Moyer
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2011-10-11 20:52 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Shaohua Li, Jeff Moyer, Jens Axboe, device-mapper development,
	Christophe Saout, linux-kernel

Hello, Mike.

On Tue, Oct 11, 2011 at 03:56:12PM -0400, Mike Snitzer wrote:
> > I don't object to the immediate fix but think that adding such special
> > case is gonna make the thing even more brittle and make future changes
> > even more difficult.  Those one off cases tend to cause pretty severe
> > headache when someone wants to evolve common code, so let's please
> > find out what went wrong and fix it properly so that everyone follows
> > the same set of rules.
> 
> Are you referring to Jeff's fix as "the immediate fix"?  Christophe
> seems to have had success with it after all.

I meant reverting the previous commit.  Oops... it seems like I
misread Jeff's patch.  Please read on.

> As for the special case that you're suggesting makes the code more
> brittle, etc.  If you could be more specific that'd be awesome.

I was still talking about the previous attempt of making dm treated
special by flush machinery.  (the purity thing someone was talking
about)

> Jeff asked a question about the need to kick the queue in this case (as
> he didn't feel he had a proper justification for why it was needed).
> 
> If we can get a proper patch header together to justify Jeff's patch
> that'd be great.  And then revisit any of the special casing you'd like
> us to avoid in >= 3.2?
> 
> (we're obviously _very_ short on time for a 3.1 fix right now).
...
> > Hmmm... another rather nasty assumption the current flush code makes
> > is that every flush request has either zero or single bio attached to
> > it.  The assumption has always been there for quite some time now.
> 
> OK.
> 
> > That somehow seems broken by request based dm (either that or wrong
> > request is taking INSERT_FLUSH path).
> 
> Where was this issue of a flush having multiple bios reported?

I was misreading Jeff's patch, so the problem is request w/o bio
reaching INSERT_FLUSH, not rq's with multiple bio's.  Sorry about
that.  Having another look...

Ah, okay, so, blk-flush on the lower layer device is seeing
q->flush_rq of the upper layer which doesn't have bio.  Yes, the
BUG_ON() change looks correct to me.  That or we can do

  BUG_ON(rq->bio != rq->bio_tail); /* assumes zero or single bio rq */

As for the blk_run_queue_async(), it's a bit confusing.  Currently,
the block layer isn't clear about who's responsible kicking the queue
after putting a request onto elevator and I suppose Jeff put it there
because blk_insert_cloned_request() doesn't kick the queue.

Hmm... Jeff, you also added blk_run_queue_async() call in
4853abaae7e4a too.  Is there a reason why blk_insert_cloned_request()
isn't calling __blk_run_queue() or async variant of it like
blk_insert_request() does?

At any rate, the queue kicking is a different issue.  Let's not mix
the two here.  The BUG_ON() change looks good to me.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Block regression since 3.1-rc3
  2011-10-11 20:52               ` Tejun Heo
@ 2011-10-11 22:07                 ` Jeff Moyer
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Moyer @ 2011-10-11 22:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, Shaohua Li, Jens Axboe, device-mapper development,
	Christophe Saout, linux-kernel

Tejun Heo <tj@kernel.org> writes:

> Hello, Mike.
>
> On Tue, Oct 11, 2011 at 03:56:12PM -0400, Mike Snitzer wrote:
>> > I don't object to the immediate fix but think that adding such special
>> > case is gonna make the thing even more brittle and make future changes
>> > even more difficult.  Those one off cases tend to cause pretty severe
>> > headache when someone wants to evolve common code, so let's please
>> > find out what went wrong and fix it properly so that everyone follows
>> > the same set of rules.
>> 
>> Are you referring to Jeff's fix as "the immediate fix"?  Christophe
>> seems to have had success with it after all.
>
> I meant reverting the previous commit.  Oops... it seems like I
> misread Jeff's patch.  Please read on.
>
>> As for the special case that you're suggesting makes the code more
>> brittle, etc.  If you could be more specific that'd be awesome.
>
> I was still talking about the previous attempt of making dm treated
> special by flush machinery.  (the purity thing someone was talking
> about)
>
>> Jeff asked a question about the need to kick the queue in this case (as
>> he didn't feel he had a proper justification for why it was needed).
>> 
>> If we can get a proper patch header together to justify Jeff's patch
>> that'd be great.  And then revisit any of the special casing you'd like
>> us to avoid in >= 3.2?
>> 
>> (we're obviously _very_ short on time for a 3.1 fix right now).
> ...
>> > Hmmm... another rather nasty assumption the current flush code makes
>> > is that every flush request has either zero or single bio attached to
>> > it.  The assumption has always been there for quite some time now.
>> 
>> OK.
>> 
>> > That somehow seems broken by request based dm (either that or wrong
>> > request is taking INSERT_FLUSH path).
>> 
>> Where was this issue of a flush having multiple bios reported?
>
> I was misreading Jeff's patch, so the problem is request w/o bio
> reaching INSERT_FLUSH, not rq's with multiple bio's.  Sorry about
> that.  Having another look...
>
> Ah, okay, so, blk-flush on the lower layer device is seeing
> q->flush_rq of the upper layer which doesn't have bio.  Yes, the
> BUG_ON() change looks correct to me.  That or we can do
>
>   BUG_ON(rq->bio != rq->bio_tail); /* assumes zero or single bio rq */
>
> As for the blk_run_queue_async(), it's a bit confusing.  Currently,
> the block layer isn't clear about who's responsible kicking the queue
> after putting a request onto elevator and I suppose Jeff put it there
> because blk_insert_cloned_request() doesn't kick the queue.
>
> Hmm... Jeff, you also added blk_run_queue_async() call in
> 4853abaae7e4a too.  Is there a reason why blk_insert_cloned_request()
> isn't calling __blk_run_queue() or async variant of it like
> blk_insert_request() does?
>
> At any rate, the queue kicking is a different issue.  Let's not mix
> the two here.  The BUG_ON() change looks good to me.

I can submit a two part series, sure.  I'll have to get back to you on
where I think the right place for the queue kick is.  I'll look at it in
detail tomorrow.

-Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [dm-devel] Block regression since 3.1-rc3
  2011-10-08 12:05           ` Christophe Saout
  (?)
  (?)
@ 2011-11-01  4:54           ` illdred
  -1 siblings, 0 replies; 25+ messages in thread
From: illdred @ 2011-11-01  4:54 UTC (permalink / raw)
  To: Christophe Saout, linux-kernel, illdred

On Sat, 08 Oct 2011 14:05:05 +0200
Christophe Saout <christophe@saout.de> wrote:

> Hi Shaohua,
> 
> > >> Line 323 is this one: BUG_ON(!rq->bio || rq->bio != rq->biotail);
> > >
> > > OK, it turns out my testing was incomplete.  I only tested targets that
> > > had a write-through cache, so I didn't hit this problem.  It reproduces
> > > pretty easily with just multipath involved (no linear target on top) when
> > > running against the right storage.
> > >
> > > So, here's a patch, but I don't have a full explanation for it just yet.
> > > What I observed was that, on fsync, blkdev_issue_flush was called.
> > > Eventually, the flush request gets cloned, and blk_insert_cloned_request
> > > is called.  This cloned request never actually gets issued to the
> > > q->requst_fn (scsi_request_fn in my case).  So, it may be that there is
> > > no plug list for this, so the queue isn't goosed?  I'll try to come up
> > > with a better explanation, or Tejun may just know off the top of his
> > > head what's going on.
> > >
> > > So, the patch works for me, but is very much just an RFC.
> > [...]
> > Looks the dm request based flush logic is broken.
> > 
> > saved_make_request_fn
> >   __make_request
> >     blk_insert_flush
> > but blk_insert_flush doesn't put the original request to list, instead, the
> > q->flush_rq is in list.
> > then
> > dm_request_fn
> >   blk_peek_request
> >     dm_prep_fn
> >       clone_rq
> >   map_request
> >     blk_insert_cloned_request
> > so q->flush_rq is cloned, and get dispatched. but we can't clone q->flush_rq
> > and use it to do flush. map_request even could assign a different blockdev to
> > the cloned request.
> > 
> > Clone q->flush_rq is absolutely wrong. we can clear ogirinal request's flush bit
> > and set cloned request flush bit, so we can skip the blk_insert_flush logic for
> > original request.
> > 
> > can you please try this patch. Only compilation tested.
> 
> Now I get a different BUG when doing what triggered the original BUG
> (e.g. unmounting the filesystem):
> 
> ------------[ cut here ]------------
> kernel BUG at drivers/scsi/scsi_lib.c:1152!
> invalid opcode: 0000 [#1] PREEMPT SMP 
> CPU 7 
> Modules linked in: xenfs gfs2 fuse dlm configfs dummy bonding xt_TPROXY nf_tproxy_core xt_socket xt_owner xt_connmark iptable_mangle ipt_MASQUERADE iptable_nat ipt_ULOG ipt_REJECT xt_mark iptable_filter iptable_raw ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state xt_tcpudp xt_physdev xt_hashlimit xt_multiport ip6table_filter ip6_tables ebt_redirect ebt_ip ebt_mark_m ebtable_broute bridge stp llc ebtable_nat ext4 jbd2 dm_round_robin dm_multipath ebtable_filter ebtables x_tables nf_nat_tftp nf_conntrack_tftp nf_nat_sip nf_conntrack_sip nf_nat_irc nf_conntrack_irc nf_nat_h323 nf_conntrack_h323 nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack xen_netback xen_blkback xen_gntdev ipmi_si ipmi_devintf ipmi_msghandler loop ohci_hcd ehci_hcd e1000e tpm_tis igb qla2xxx tpm tpm_bios scsi_transport_fc pata_atiixp i2c_piix4
> 
> Pid: 5056, comm: kworker/7:2 Not tainted 3.1.0-rc8 #1 Supermicro H8DG6/H8DGi/H8DG6/H8DGi
> RIP: e030:[<ffffffff812f7eae>]  [<ffffffff812f7eae>] scsi_setup_fs_cmnd+0x9e/0xe0
> RSP: e02b:ffff88009b299b70  EFLAGS: 00010046
> RAX: 0000000000000000 RBX: ffff88009b9cc018 RCX: 0000000000000000
> RDX: ffff88009f193c00 RSI: ffff88009b9cc018 RDI: ffff88009b905800
> RBP: ffff88009b299b80 R08: 0000000000000000 R09: 0000000000000001
> R10: 0000000000000001 R11: ffff8800a0e40700 R12: ffff88009b905800
> R13: 0000000000000001 R14: ffff88009b905800 R15: 0000000000000000
> FS:  00007fc76ab77700(0000) GS:ffff8800d767f000(0000) knlGS:0000000000000000
> CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007fc76a1d0beb CR3: 0000000001805000 CR4: 0000000000000660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process kworker/7:2 (pid: 5056, threadinfo ffff88009b298000, task ffff88009ed68000)
> Stack:
>  ffff88009b9cc018 ffff88009e5f1a60 ffff88009b299bf0 ffffffff8130675f
>  ffff88009b299ba0 ffffffff810ceef0 ffff88009f193c00 ffffffff00000000
>  ffffffff8100ab6d ffff88009b299c60 ffff88009ed68000 ffff88009b9cc018
> Call Trace:
>  [<ffffffff8130675f>] sd_prep_fn+0x14f/0xa70
>  [<ffffffff810ceef0>] ? mempool_alloc_slab+0x10/0x20
>  [<ffffffff8100ab6d>] ? xen_force_evtchn_callback+0xd/0x10
>  [<ffffffff81215db2>] blk_peek_request+0xd2/0x270
>  [<ffffffff812f8aeb>] scsi_request_fn+0x4b/0x3c0
>  [<ffffffff8120fa46>] __blk_run_queue+0x16/0x20
>  [<ffffffff8120ef03>] __elv_add_request+0xb3/0x270
>  [<ffffffff81213494>] add_acct_request+0x34/0x40
>  [<ffffffff8121350b>] blk_insert_cloned_request+0x6b/0x90
>  [<ffffffff8138c049>] dm_dispatch_request+0x39/0x70
>  [<ffffffff8138db9c>] dm_request_fn+0x17c/0x280
>  [<ffffffff8100b322>] ? check_events+0x12/0x20
>  [<ffffffff81222db0>] ? cfq_init_queue+0x430/0x430
>  [<ffffffff8120fa46>] __blk_run_queue+0x16/0x20
>  [<ffffffff81222de4>] cfq_kick_queue+0x34/0x50
>  [<ffffffff8106e039>] process_one_work+0x129/0x4a0
>  [<ffffffff81070401>] worker_thread+0x161/0x340
>  [<ffffffff810702a0>] ? manage_workers.clone.22+0x230/0x230
>  [<ffffffff81074d36>] kthread+0x96/0xa0
>  [<ffffffff814cb7b4>] kernel_thread_helper+0x4/0x10
>  [<ffffffff814c9873>] ? int_ret_from_sys_call+0x7/0x1b
>  [<ffffffff814c2dc0>] ? retint_restore_args+0x5/0x6
>  [<ffffffff814cb7b0>] ? gs_change+0x13/0x13
> Code: ff 5b 41 5c c9 c3 0f 1f 80 00 00 00 00 4c 89 e7 be 20 00 00 00 e8 a3 a2 ff ff 48 85 c0 48 89 c7 74 36 48 89 83 e0 00 00 00 eb 9b <0f> 0b 48 8b 00 48 85 c0 0f 84 77 ff ff ff 48 8b 40 50 48 85 c0 
> RIP  [<ffffffff812f7eae>] scsi_setup_fs_cmnd+0x9e/0xe0
>  RSP <ffff88009b299b70>
> ---[ end trace 35f29bc1fdb897a6 ]---
> note: kworker/7:2[5056] exited with preempt_count 1
> 
> (followed by an oops and the machine is dead in an instant)
> 
> 	Christophe
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 
Hi!

I have experienced this "kernel BUG at drivers/scsi/scsi_lib.c:1152!" too!

linux-3.1-rc3 fails to boot on my computer. Lots of error messages get printed to the console, one of which is:
note: md0_raid [255] exited with preempt_count 1

and some of the error messages mention something about scsi before scrolling past. then the system stops doing anything.

I've managed to take screen shots of all the details:

http://www.flickr.com/photos/illdred/sets/72157627863015429/

there are 5 images.

the root filesystem is on an mdadm raid partition and normally works.

I have these relevant kernel options in the grub boot configuration:
root=/dev/md0 rootfstype=ext4 md=0,/dev/sda7,/dev/sdc8 raid=noautodetect

downgrading to the linux-3.0.7-1 package lets the system boot again.

I'm running Archlinux, btw.

thanks,

Fraser 'illdred' DeLisle

p.s. please cc illdred@gmail.com as I am not a list subscriber

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2011-11-01  4:54 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-30 15:51 Block regression since 3.1-rc3 Christophe Saout
2011-09-30 15:51 ` Christophe Saout
2011-09-30 18:02 ` [dm-devel] " Christophe Saout
2011-10-03 14:43 ` Jeff Moyer
2011-10-04 12:02   ` Christophe Saout
2011-10-04 13:32     ` Jeff Moyer
2011-10-04 15:06       ` Christophe Saout
2011-10-06 19:51     ` Jeff Moyer
2011-10-06 22:02       ` Christophe Saout
2011-10-08 11:02       ` Shaohua Li
2011-10-08 12:05         ` Christophe Saout
2011-10-08 12:05           ` Christophe Saout
2011-10-09  4:26           ` Shaohua Li
2011-11-01  4:54           ` illdred
2011-10-08 16:14         ` Mike Snitzer
2011-10-09  4:31           ` Shaohua Li
2011-10-09  7:16             ` Shaohua Li
2011-10-09  7:16               ` Shaohua Li
2011-10-09  8:12               ` Christophe Saout
2011-10-09 13:47                 ` Christophe Saout
2011-10-10 21:33           ` Tejun Heo
2011-10-11 19:56             ` Mike Snitzer
2011-10-11 20:52               ` Tejun Heo
2011-10-11 22:07                 ` Jeff Moyer
2011-10-09  8:08       ` [dm-devel] " Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.