* [PATCH] block: bios with an offset are always gappy
@ 2017-04-13 8:06 Johannes Thumshirn
2017-04-13 9:48 ` Christoph Hellwig
2017-04-13 10:02 ` Ming Lei
0 siblings, 2 replies; 13+ messages in thread
From: Johannes Thumshirn @ 2017-04-13 8:06 UTC (permalink / raw)
To: Jens Axboe, Omar Sandoval
Cc: Ming Lei, Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
Linux Block Layer Mailinglist, Linux Kernel Mailinglist,
Johannes Thumshirn
Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic
in nvme_setup_prps() because the dma_len will drop below zero but the
length not.
A git bisect tracked the behaviour down to commit 729204ef49ec ("block:
relax check on sg gap"). Since commit 729204ef49ec a bio's offsets are not
taken into account in the decision if the bio will gap any more. Restore
the old behavior of checking bio offsets as well for the decision if a
bio will gap.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Fixes: 729204ef49ec ("block: relax check on sg gap")
Cc: Ming Lei <ming.lei@redhat.com>
---
include/linux/blkdev.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7548f332121a..a03b7196209e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1677,11 +1677,14 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
{
if (bio_has_data(prev) && queue_virt_boundary(q)) {
struct bio_vec pb, nb;
+ bool offset;
bio_get_last_bvec(prev, &pb);
bio_get_first_bvec(next, &nb);
- if (!bios_segs_mergeable(q, prev, &pb, &nb))
+ offset = pb.bv_offset || nb.bv_offset;
+
+ if (offset || !bios_segs_mergeable(q, prev, &pb, &nb))
return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
}
--
2.12.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy
2017-04-13 8:06 [PATCH] block: bios with an offset are always gappy Johannes Thumshirn
@ 2017-04-13 9:48 ` Christoph Hellwig
2017-04-13 9:56 ` Johannes Thumshirn
2017-04-13 10:01 ` Johannes Thumshirn
2017-04-13 10:02 ` Ming Lei
1 sibling, 2 replies; 13+ messages in thread
From: Christoph Hellwig @ 2017-04-13 9:48 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Jens Axboe, Omar Sandoval, Ming Lei, Bart Van Assche,
Hannes Reinecke, Christoph Hellwig,
Linux Block Layer Mailinglist, Linux Kernel Mailinglist
On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote:
> Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic
> in nvme_setup_prps() because the dma_len will drop below zero but the
> length not.
I think we should also turns this into a WARN_ON_ONCE + error return..
But do you have an exact btrfsprogs version and command line? I do a lot
of testing that involves mkfs.btrfs on nvme and haven't seen it..
> A git bisect tracked the behaviour down to commit 729204ef49ec ("block:
> relax check on sg gap"). Since commit 729204ef49ec a bio's offsets are not
> taken into account in the decision if the bio will gap any more. Restore
> the old behavior of checking bio offsets as well for the decision if a
> bio will gap.
>
> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
> Fixes: 729204ef49ec ("block: relax check on sg gap")
> Cc: Ming Lei <ming.lei@redhat.com>
> ---
> include/linux/blkdev.h | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7548f332121a..a03b7196209e 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1677,11 +1677,14 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
> {
> if (bio_has_data(prev) && queue_virt_boundary(q)) {
> struct bio_vec pb, nb;
> + bool offset;
>
> bio_get_last_bvec(prev, &pb);
> bio_get_first_bvec(next, &nb);
>
> - if (!bios_segs_mergeable(q, prev, &pb, &nb))
> + offset = pb.bv_offset || nb.bv_offset;
> +
> + if (offset || !bios_segs_mergeable(q, prev, &pb, &nb))
> return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
> }
I think the code in NVMe (and potentially the other drivers using
virt_queue_boundary) is bogus. All of them are actually fine with
gaps in the protocol, as long as the gaps are aligned to said boundary.
So I suspect what we really need is to fix up NVMe, and after that
we could even relax the above check, to not check for offset but
offset & queue_virt_boundary(q).
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy
2017-04-13 9:48 ` Christoph Hellwig
@ 2017-04-13 9:56 ` Johannes Thumshirn
2017-04-13 10:01 ` Johannes Thumshirn
1 sibling, 0 replies; 13+ messages in thread
From: Johannes Thumshirn @ 2017-04-13 9:56 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Omar Sandoval, Ming Lei, Bart Van Assche,
Hannes Reinecke, Linux Block Layer Mailinglist,
Linux Kernel Mailinglist
On Thu, Apr 13, 2017 at 11:48:35AM +0200, Christoph Hellwig wrote:
> I think we should also turns this into a WARN_ON_ONCE + error return..
>
> But do you have an exact btrfsprogs version and command line? I do a lot
> of testing that involves mkfs.btrfs on nvme and haven't seen it..
Sure, it's:
mkfs.btrfs, part of btrfs-progs v4.5.3+20160729
Qemu is 2.6.2
[...]
>
> I think the code in NVMe (and potentially the other drivers using
> virt_queue_boundary) is bogus. All of them are actually fine with
> gaps in the protocol, as long as the gaps are aligned to said boundary.
>
> So I suspect what we really need is to fix up NVMe, and after that
> we could even relax the above check, to not check for offset but
> offset & queue_virt_boundary(q).
That's what I tried doing the last two days but as we're rather late in the rc
cycle and it is a regression that came in with -rc1 I'd rather like to have it
fixed or at least have a band aid in place.
Byte,
Johannes
--
Johannes Thumshirn Storage
jthumshirn@suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy
2017-04-13 9:48 ` Christoph Hellwig
2017-04-13 9:56 ` Johannes Thumshirn
@ 2017-04-13 10:01 ` Johannes Thumshirn
1 sibling, 0 replies; 13+ messages in thread
From: Johannes Thumshirn @ 2017-04-13 10:01 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Omar Sandoval, Ming Lei, Bart Van Assche,
Hannes Reinecke, Linux Block Layer Mailinglist,
Linux Kernel Mailinglist
On Thu, Apr 13, 2017 at 11:48:35AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote:
> > Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic
> > in nvme_setup_prps() because the dma_len will drop below zero but the
> > length not.
>
> I think we should also turns this into a WARN_ON_ONCE + error return..
>
> But do you have an exact btrfsprogs version and command line? I do a lot
> of testing that involves mkfs.btrfs on nvme and haven't seen it..
Ah one detail I forgot: mkfs.xfs _does_ work. Haven't checked ext4 though.
--
Johannes Thumshirn Storage
jthumshirn@suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy
2017-04-13 8:06 [PATCH] block: bios with an offset are always gappy Johannes Thumshirn
2017-04-13 9:48 ` Christoph Hellwig
@ 2017-04-13 10:02 ` Ming Lei
2017-04-13 10:10 ` Johannes Thumshirn
2017-04-13 11:53 ` Johannes Thumshirn
1 sibling, 2 replies; 13+ messages in thread
From: Ming Lei @ 2017-04-13 10:02 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Jens Axboe, Omar Sandoval, Bart Van Assche, Hannes Reinecke,
Christoph Hellwig, Linux Block Layer Mailinglist,
Linux Kernel Mailinglist
On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote:
> Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic
> in nvme_setup_prps() because the dma_len will drop below zero but the
> length not.
Looks I can't reproduce the issue in QEMU(32G nvme, either partitioned
or not, just use 'mkfs.btrfs /dev/nvme0n1p1'), could you share the exact
mkfs command line and size of your emulated NVMe?
>
> A git bisect tracked the behaviour down to commit 729204ef49ec ("block:
> relax check on sg gap"). Since commit 729204ef49ec a bio's offsets are not
> taken into account in the decision if the bio will gap any more. Restore
> the old behavior of checking bio offsets as well for the decision if a
> bio will gap.
>
> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
> Fixes: 729204ef49ec ("block: relax check on sg gap")
> Cc: Ming Lei <ming.lei@redhat.com>
> ---
> include/linux/blkdev.h | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7548f332121a..a03b7196209e 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1677,11 +1677,14 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
> {
> if (bio_has_data(prev) && queue_virt_boundary(q)) {
> struct bio_vec pb, nb;
> + bool offset;
>
> bio_get_last_bvec(prev, &pb);
> bio_get_first_bvec(next, &nb);
>
> - if (!bios_segs_mergeable(q, prev, &pb, &nb))
> + offset = pb.bv_offset || nb.bv_offset;
We don't consider pb's offset here, because if pb.bv_offset
isn't zero, pb should be the only bvec in 'prev' and will be
put in a standalone segement, and we still can make 'nb' into
this segment if both are mergeable.
But the following issue might be caused by commit 729204ef49ec
("block: relax check on sg gap"):
- if the 'next' has more than one segment
- the segment merged from 'pb' and the 1st segment of 'next'
may end at un-aligned virt boundary
Could you try the following patch to see if it fixes your issue?
---
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7548f332121a..65d1510681c6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1659,16 +1659,28 @@ static inline bool bvec_gap_to_prev(struct request_queue *q,
* and the 1st bvec in the 2nd bio can be handled in one segment.
*/
static inline bool bios_segs_mergeable(struct request_queue *q,
- struct bio *prev, struct bio_vec *prev_last_bv,
+ struct bio *prev, struct bio *next,
+ struct bio_vec *prev_last_bv,
struct bio_vec *next_first_bv)
{
if (!BIOVEC_PHYS_MERGEABLE(prev_last_bv, next_first_bv))
return false;
if (!BIOVEC_SEG_BOUNDARY(q, prev_last_bv, next_first_bv))
return false;
- if (prev->bi_seg_back_size + next_first_bv->bv_len >
+ if (prev->bi_seg_back_size + next->bi_seg_front_size >
queue_max_segment_size(q))
return false;
+
+ /*
+ * if 'next' has multiple segments, we need to make
+ * sure the merged segment from 'pb' and the 1st segment
+ * of 'next' ends at aligned virt boundary.
+ */
+ if ((next->bi_seg_front_size < next->bi_iter.bi_size) &&
+ ((prev_last_bv->bv_offset + prev_last_bv->bv_len +
+ next->bi_seg_front_size) & queue_virt_boundary(q)))
+ return false;
+
return true;
}
@@ -1681,7 +1693,7 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
bio_get_last_bvec(prev, &pb);
bio_get_first_bvec(next, &nb);
- if (!bios_segs_mergeable(q, prev, &pb, &nb))
+ if (!bios_segs_mergeable(q, prev, next, &pb, &nb))
return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
}
--
Ming
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy
2017-04-13 10:02 ` Ming Lei
@ 2017-04-13 10:10 ` Johannes Thumshirn
2017-04-13 11:53 ` Johannes Thumshirn
1 sibling, 0 replies; 13+ messages in thread
From: Johannes Thumshirn @ 2017-04-13 10:10 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, Omar Sandoval, Bart Van Assche, Hannes Reinecke,
Christoph Hellwig, Linux Block Layer Mailinglist,
Linux Kernel Mailinglist
On Thu, Apr 13, 2017 at 06:02:21PM +0800, Ming Lei wrote:
> Could you try the following patch to see if it fixes your issue?
Sure, jsut have a short lunch break and then I'll report back.
--
Johannes Thumshirn Storage
jthumshirn@suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy
2017-04-13 10:02 ` Ming Lei
2017-04-13 10:10 ` Johannes Thumshirn
@ 2017-04-13 11:53 ` Johannes Thumshirn
2017-04-13 12:11 ` Ming Lei
2017-04-13 14:45 ` Ming Lei
1 sibling, 2 replies; 13+ messages in thread
From: Johannes Thumshirn @ 2017-04-13 11:53 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, Omar Sandoval, Bart Van Assche, Hannes Reinecke,
Christoph Hellwig, Linux Block Layer Mailinglist,
Linux Kernel Mailinglist
On Thu, Apr 13, 2017 at 06:02:21PM +0800, Ming Lei wrote:
> On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote:
> > Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic
> > in nvme_setup_prps() because the dma_len will drop below zero but the
> > length not.
>
> Looks I can't reproduce the issue in QEMU(32G nvme, either partitioned
> or not, just use 'mkfs.btrfs /dev/nvme0n1p1'), could you share the exact
> mkfs command line and size of your emulated NVMe?
the exact cmdline is mkfs.btrfs -f /dev/nvme0n1p1 (-f because there was a
existing btrfs on the image). The image is 17179869184 (a.k.a 16G) bytes.
[...]
> Could you try the following patch to see if it fixes your issue?
It's back to the old, erratic behaviour, see log below.
>
> ---
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7548f332121a..65d1510681c6 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1659,16 +1659,28 @@ static inline bool bvec_gap_to_prev(struct request_queue *q,
> * and the 1st bvec in the 2nd bio can be handled in one segment.
> */
> static inline bool bios_segs_mergeable(struct request_queue *q,
> - struct bio *prev, struct bio_vec *prev_last_bv,
> + struct bio *prev, struct bio *next,
> + struct bio_vec *prev_last_bv,
> struct bio_vec *next_first_bv)
> {
> if (!BIOVEC_PHYS_MERGEABLE(prev_last_bv, next_first_bv))
> return false;
> if (!BIOVEC_SEG_BOUNDARY(q, prev_last_bv, next_first_bv))
> return false;
> - if (prev->bi_seg_back_size + next_first_bv->bv_len >
> + if (prev->bi_seg_back_size + next->bi_seg_front_size >
> queue_max_segment_size(q))
> return false;
> +
> + /*
> + * if 'next' has multiple segments, we need to make
> + * sure the merged segment from 'pb' and the 1st segment
> + * of 'next' ends at aligned virt boundary.
> + */
> + if ((next->bi_seg_front_size < next->bi_iter.bi_size) &&
> + ((prev_last_bv->bv_offset + prev_last_bv->bv_len +
> + next->bi_seg_front_size) & queue_virt_boundary(q)))
> + return false;
> +
> return true;
> }
>
> @@ -1681,7 +1693,7 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
> bio_get_last_bvec(prev, &pb);
> bio_get_first_bvec(next, &nb);
>
> - if (!bios_segs_mergeable(q, prev, &pb, &nb))
> + if (!bios_segs_mergeable(q, prev, next, &pb, &nb))
> return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
> }
dracut:/# [ 1.211567] tsc: Refined TSC clocksource calibration: 2297.338 MHz
[ 1.212601] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x211d6274d86, max_idle_ns: 440795243673 ns
dracut:/# modprobe btrfs
[ 8.139179] raid6_pq: module verification failed: signature and/or required key missing - tainting kernel
[ 8.207509] raid6: sse2x1 gen() 6827 MB/s
[ 8.275512] raid6: sse2x1 xor() 5654 MB/s
[ 8.343507] raid6: sse2x2 gen() 11573 MB/s
[ 8.411503] raid6: sse2x2 xor() 8826 MB/s
[ 8.479504] raid6: sse2x4 gen() 14794 MB/s
[ 8.547504] raid6: sse2x4 xor() 10618 MB/s
[ 8.547830] raid6: using algorithm sse2x4 gen() 14794 MB/s
[ 8.548218] raid6: .... xor() 10618 MB/s, rmw enabled
[ 8.548558] raid6: using intx1 recovery algorithm
[ 8.549341] xor: measuring software checksum speed
[ 8.587533] prefetch64-sse: 15090.000 MB/sec
[ 8.627553] generic_sse: 13530.000 MB/sec
[ 8.627945] xor: using function: prefetch64-sse (15090.000 MB/sec)
[ 8.633795] Btrfs loaded, crc32c=crc32c-generic, assert=on
dracut:/# modprobe nvme
[ 12.348762] nvme nvme0: pci function 0000:00:04.0
[ 12.386300] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11
dracut:/# [ 12.391707] nvme0n1: p1
dracut:/# mkfs.b[ 36.553376] random: fast init done
tr
dracut:/# mkfs.btrfs -f /dev/nvme0n1p1
btrfs-progs v4.5.3+20160729
See http://btrfs.wiki.kernel.org for more information.
Detected a SSD, turning off metadata duplication. Mkfs with -m dup if you want to force metadata duplication.
[ 46.696671] ------------[ cut here ]------------
[ 46.697338] kernel BUG at drivers/nvme/host/pci.c:494!
[ 46.697806] invalid opcode: 0000 [#1] SMP
[ 46.698175] Modules linked in: nvme(E) nvme_core(E) btrfs(E) xor(E) raid6_pq(E)
[ 46.698879] CPU: 1 PID: 18 Comm: kworker/1:0H Tainted: G E 4.11.0-rc6-default+ #43
[ 46.699686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 46.700737] Workqueue: kblockd blk_mq_run_work_fn
[ 46.701169] task: ffff88007bd24540 task.stack: ffffc900003bc000
[ 46.701709] RIP: 0010:nvme_queue_rq+0x85d/0x886 [nvme]
[ 46.702185] RSP: 0018:ffffc900003bfc78 EFLAGS: 00010286
[ 46.702670] RAX: 0000000000000078 RBX: 0000000000001000 RCX: 000000007f625000
[ 46.703318] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246
[ 46.703968] RBP: ffffc900003bfd50 R08: 000000000013ee00 R09: 0000000000001000
[ 46.704624] R10: ffff88007f1ed000 R11: ffff88007f220000 R12: ffff88007f1ed000
[ 46.705276] R13: 00000000fffffe00 R14: 0000000000000010 R15: 000000000012fe00
[ 46.705927] FS: 0000000000000000(0000) GS:ffff88007ea80000(0000) knlGS:0000000000000000
[ 46.706673] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 46.707199] CR2: 00007ffdd1294000 CR3: 000000007b742000 CR4: 00000000000006e0
[ 46.707846] Call Trace:
[ 46.708082] blk_mq_dispatch_rq_list+0x2a0/0x3d0
[ 46.708510] blk_mq_sched_dispatch_requests+0x138/0x160
[ 46.708991] __blk_mq_run_hw_queue+0x8c/0xa0
[ 46.709407] blk_mq_run_work_fn+0x12/0x20
[ 46.709781] process_one_work+0x153/0x400
[ 46.710152] worker_thread+0x12b/0x4b0
[ 46.711698] kthread+0x109/0x140
[ 46.712013] ? rescuer_thread+0x340/0x340
[ 46.712391] ? kthread_park+0x90/0x90
[ 46.712741] ret_from_fork+0x2c/0x40
[ 46.713081] Code: 01 00 48 8b 40 10 48 89 45 a8 49 8b 87 70 01 00 00 48 89 45 b0 0f 84 3e fa ff ff 49 8b 87 88 01 00 00 48 89 45 a0 e9 2e fa ff ff <0f> 0b 4c 8b 0d e2 35 8f e1 eb 80 0f 0b 4c 89 ef c6 07 00 0f 1f
[ 46.714861] RIP: nvme_queue_rq+0x85d/0x886 [nvme] RSP: ffffc900003bfc78
[ 46.715810] ---[ end trace 280a594163a124fb ]---
[ 46.796265] ------------[ cut here ]------------
Thanks,
Johannes
--
Johannes Thumshirn Storage
jthumshirn@suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy
2017-04-13 11:53 ` Johannes Thumshirn
@ 2017-04-13 12:11 ` Ming Lei
[not found] ` <20170413122010.GJ6734@linux-x5ow.site>
2017-04-13 14:45 ` Ming Lei
1 sibling, 1 reply; 13+ messages in thread
From: Ming Lei @ 2017-04-13 12:11 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Jens Axboe, Omar Sandoval, Bart Van Assche, Hannes Reinecke,
Christoph Hellwig, Linux Block Layer Mailinglist,
Linux Kernel Mailinglist
[-- Attachment #1: Type: text/plain, Size: 1032 bytes --]
On Thu, Apr 13, 2017 at 01:53:28PM +0200, Johannes Thumshirn wrote:
> On Thu, Apr 13, 2017 at 06:02:21PM +0800, Ming Lei wrote:
> > On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote:
> > > Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic
> > > in nvme_setup_prps() because the dma_len will drop below zero but the
> > > length not.
> >
> > Looks I can't reproduce the issue in QEMU(32G nvme, either partitioned
> > or not, just use 'mkfs.btrfs /dev/nvme0n1p1'), could you share the exact
> > mkfs command line and size of your emulated NVMe?
>
> the exact cmdline is mkfs.btrfs -f /dev/nvme0n1p1 (-f because there was a
> existing btrfs on the image). The image is 17179869184 (a.k.a 16G) bytes.
>
> [...]
>
> > Could you try the following patch to see if it fixes your issue?
>
> It's back to the old, erratic behaviour, see log below.
Ok, could you apply the attached debug patch and collect the
ftrace log? (ftrace_dump_on_oops need to be passed to kernel cmd line).
Thanks,
Ming
[-- Attachment #2: dump-rq.patch --]
[-- Type: text/plain, Size: 2383 bytes --]
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 26a5fd05fe88..a813a36d48d9 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -491,6 +491,8 @@ static bool nvme_setup_prps(struct nvme_dev *dev, struct request *req)
break;
if (dma_len > 0)
continue;
+ if (dma_len < 0)
+ blk_dump_rq(req, "nvme dma sg gap");
BUG_ON(dma_len < 0);
sg = sg_next(sg);
dma_addr = sg_dma_address(sg);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e521194f6fc..f3b001e401d2 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -811,5 +811,29 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page,
#endif /* CONFIG_BLK_DEV_INTEGRITY */
+static inline void blk_dump_bio(struct bio *bio, const char *msg)
+{
+ struct bvec_iter iter;
+ struct bio_vec bvec;
+ int i = 0;
+ unsigned sectors = 0;
+
+ trace_printk("%s-%p: %hx/%hx %u %llu %u\n",
+ msg, bio,
+ bio->bi_flags, bio->bi_opf,
+ bio->bi_phys_segments,
+ (unsigned long long)bio->bi_iter.bi_sector,
+ bio->bi_iter.bi_size);
+ bio_for_each_segment(bvec, bio, iter) {
+ sectors += bvec.bv_len >> 9;
+ trace_printk("\t %d: %lu %u %u(%u)\n", i++,
+ (unsigned long)page_to_pfn(bvec.bv_page),
+ bvec.bv_offset,
+ bvec.bv_len, bvec.bv_len >> 12);
+ }
+ trace_printk("\t total sectors %u\n", sectors);
+}
+
+
#endif /* CONFIG_BLOCK */
#endif /* __LINUX_BIO_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7548f332121a..b75d6fe5a1b9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1698,6 +1698,22 @@ static inline bool req_gap_front_merge(struct request *req, struct bio *bio)
return bio_will_gap(req->q, bio, req->bio);
}
+static inline void blk_dump_rq(const struct request *req, const char *msg)
+{
+ struct bio *bio;
+ int i = 0;
+
+ trace_printk("%s: dump bvec for %p(f:%x, seg: %d)\n",
+ msg, req, req->cmd_flags,
+ req->nr_phys_segments);
+
+ __rq_for_each_bio(bio, req) {
+ char num[16];
+ snprintf(num, 16, "%d", i++);
+ blk_dump_bio(bio, num);
+ }
+}
+
int kblockd_schedule_work(struct work_struct *work);
int kblockd_schedule_work_on(int cpu, struct work_struct *work);
int kblockd_schedule_delayed_work(struct delayed_work *dwork, unsigned long delay);
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy
[not found] ` <20170413122010.GJ6734@linux-x5ow.site>
@ 2017-04-13 13:44 ` Ming Lei
0 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2017-04-13 13:44 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Jens Axboe, Omar Sandoval, Bart Van Assche, Hannes Reinecke,
Christoph Hellwig, Linux Block Layer Mailinglist,
Linux Kernel Mailinglist
On Thu, Apr 13, 2017 at 02:20:10PM +0200, Johannes Thumshirn wrote:
> On Thu, Apr 13, 2017 at 08:11:40PM +0800, Ming Lei wrote:
> > Ok, could you apply the attached debug patch and collect the
> > ftrace log? (ftrace_dump_on_oops need to be passed to kernel cmd line).
>
Thanks Johannes very much for collecting the log.
The root cause should be the following:
- if 'offset' of one segment isn't zero, we can't make that segment
at full size, otherwise the segment ends in the unaligned
virt boundary.
Give me a while, I will figure out one patch to address the issue.
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy
2017-04-13 11:53 ` Johannes Thumshirn
2017-04-13 12:11 ` Ming Lei
@ 2017-04-13 14:45 ` Ming Lei
2017-04-13 14:50 ` Johannes Thumshirn
2017-04-13 20:35 ` Andreas Mohr
1 sibling, 2 replies; 13+ messages in thread
From: Ming Lei @ 2017-04-13 14:45 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Jens Axboe, Omar Sandoval, Bart Van Assche, Hannes Reinecke,
Christoph Hellwig, Linux Block Layer Mailinglist,
Linux Kernel Mailinglist
On Thu, Apr 13, 2017 at 01:53:28PM +0200, Johannes Thumshirn wrote:
> On Thu, Apr 13, 2017 at 06:02:21PM +0800, Ming Lei wrote:
> > On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote:
> > > Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic
> > > in nvme_setup_prps() because the dma_len will drop below zero but the
> > > length not.
> >
> > Looks I can't reproduce the issue in QEMU(32G nvme, either partitioned
> > or not, just use 'mkfs.btrfs /dev/nvme0n1p1'), could you share the exact
> > mkfs command line and size of your emulated NVMe?
>
> the exact cmdline is mkfs.btrfs -f /dev/nvme0n1p1 (-f because there was a
> existing btrfs on the image). The image is 17179869184 (a.k.a 16G) bytes.
>
> [...]
>
> > Could you try the following patch to see if it fixes your issue?
>
> It's back to the old, erratic behaviour, see log below.
Johannes, could you test the following patch?
Thanks
Ming
---
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b75d6fe5a1b9..cc68dfaef528 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1672,12 +1672,27 @@ static inline bool bios_segs_mergeable(struct request_queue *q,
return true;
}
-static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
- struct bio *next)
+static inline bool bio_will_gap(struct request_queue *q,
+ struct request *prev_rq,
+ struct bio *prev,
+ struct bio *next)
{
if (bio_has_data(prev) && queue_virt_boundary(q)) {
struct bio_vec pb, nb;
+ /*
+ * don't merge if the 1st bio starts with non-zero
+ * offset, otherwise it is quite difficult to respect
+ * sg gap limit. We work hard to merge huge of small
+ * bios in case of mkfs.
+ */
+ if (prev_rq)
+ bio_get_first_bvec(prev_rq->bio, &pb);
+ else
+ bio_get_first_bvec(prev, &pb);
+ if (pb.bv_offset)
+ return true;
+
bio_get_last_bvec(prev, &pb);
bio_get_first_bvec(next, &nb);
@@ -1690,12 +1705,12 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
static inline bool req_gap_back_merge(struct request *req, struct bio *bio)
{
- return bio_will_gap(req->q, req->biotail, bio);
+ return bio_will_gap(req->q, req, req->biotail, bio);
}
static inline bool req_gap_front_merge(struct request *req, struct bio *bio)
{
- return bio_will_gap(req->q, bio, req->bio);
+ return bio_will_gap(req->q, NULL, bio, req->bio);
}
static inline void blk_dump_rq(const struct request *req, const char *msg)
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy
2017-04-13 14:45 ` Ming Lei
@ 2017-04-13 14:50 ` Johannes Thumshirn
2017-04-13 20:35 ` Andreas Mohr
1 sibling, 0 replies; 13+ messages in thread
From: Johannes Thumshirn @ 2017-04-13 14:50 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, Omar Sandoval, Bart Van Assche, Hannes Reinecke,
Christoph Hellwig, Linux Block Layer Mailinglist,
Linux Kernel Mailinglist
On Thu, Apr 13, 2017 at 10:45:10PM +0800, Ming Lei wrote:
> On Thu, Apr 13, 2017 at 01:53:28PM +0200, Johannes Thumshirn wrote:
> > On Thu, Apr 13, 2017 at 06:02:21PM +0800, Ming Lei wrote:
> > > On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote:
> > > > Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic
> > > > in nvme_setup_prps() because the dma_len will drop below zero but the
> > > > length not.
> > >
> > > Looks I can't reproduce the issue in QEMU(32G nvme, either partitioned
> > > or not, just use 'mkfs.btrfs /dev/nvme0n1p1'), could you share the exact
> > > mkfs command line and size of your emulated NVMe?
> >
> > the exact cmdline is mkfs.btrfs -f /dev/nvme0n1p1 (-f because there was a
> > existing btrfs on the image). The image is 17179869184 (a.k.a 16G) bytes.
> >
> > [...]
> >
> > > Could you try the following patch to see if it fixes your issue?
> >
> > It's back to the old, erratic behaviour, see log below.
>
> Johannes, could you test the following patch?
>
> Thanks
> Ming
Works, awesome thanks!
Tested-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
--
Johannes Thumshirn Storage
jthumshirn@suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy
2017-04-13 14:45 ` Ming Lei
2017-04-13 14:50 ` Johannes Thumshirn
@ 2017-04-13 20:35 ` Andreas Mohr
2017-04-14 1:15 ` Ming Lei
1 sibling, 1 reply; 13+ messages in thread
From: Andreas Mohr @ 2017-04-13 20:35 UTC (permalink / raw)
To: Ming Lei
Cc: Johannes Thumshirn, Jens Axboe, Omar Sandoval, Bart Van Assche,
Hannes Reinecke, Christoph Hellwig,
Linux Block Layer Mailinglist, Linux Kernel Mailinglist
On Thu, Apr 13, 2017 at 10:45:10PM +0800, Ming Lei wrote:
> + /*
> + * don't merge if the 1st bio starts with non-zero
> + * offset, otherwise it is quite difficult to respect
> + * sg gap limit. We work hard to merge huge of small
> + * bios in case of mkfs.
"huge of small bios in case" - -ENOPARSE.
"huge numbers of"?
"huge or small"?
HTH,
Andreas Mohr
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy
2017-04-13 20:35 ` Andreas Mohr
@ 2017-04-14 1:15 ` Ming Lei
0 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2017-04-14 1:15 UTC (permalink / raw)
To: Andreas Mohr
Cc: Johannes Thumshirn, Jens Axboe, Omar Sandoval, Bart Van Assche,
Hannes Reinecke, Christoph Hellwig,
Linux Block Layer Mailinglist, Linux Kernel Mailinglist
On Thu, Apr 13, 2017 at 10:35:17PM +0200, Andreas Mohr wrote:
> On Thu, Apr 13, 2017 at 10:45:10PM +0800, Ming Lei wrote:
> > + /*
> > + * don't merge if the 1st bio starts with non-zero
> > + * offset, otherwise it is quite difficult to respect
> > + * sg gap limit. We work hard to merge huge of small
> > + * bios in case of mkfs.
>
> "huge of small bios in case" - -ENOPARSE.
>
> "huge numbers of"?
> "huge or small"?
OK, it should be tons of or sort of description. As you can see
in the trace, there are 2560 bios merged in one request.
Thanks,
Ming
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2017-04-14 1:15 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-13 8:06 [PATCH] block: bios with an offset are always gappy Johannes Thumshirn
2017-04-13 9:48 ` Christoph Hellwig
2017-04-13 9:56 ` Johannes Thumshirn
2017-04-13 10:01 ` Johannes Thumshirn
2017-04-13 10:02 ` Ming Lei
2017-04-13 10:10 ` Johannes Thumshirn
2017-04-13 11:53 ` Johannes Thumshirn
2017-04-13 12:11 ` Ming Lei
[not found] ` <20170413122010.GJ6734@linux-x5ow.site>
2017-04-13 13:44 ` Ming Lei
2017-04-13 14:45 ` Ming Lei
2017-04-13 14:50 ` Johannes Thumshirn
2017-04-13 20:35 ` Andreas Mohr
2017-04-14 1:15 ` Ming Lei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).