* [RFC PATCH] Workaround for discard on non-conformant nvme devices
@ 2019-11-04 21:47 Eduard Hasenleithner
2019-11-06 16:52 ` Sagi Grimberg
0 siblings, 1 reply; 7+ messages in thread
From: Eduard Hasenleithner @ 2019-11-04 21:47 UTC (permalink / raw)
To: linux-nvme
As documented in https://bugzilla.kernel.org/show_bug.cgi?id=202665
there are lots of Linux nvme users which get IO-MMU related errors when
performing discard on nvme. So far analysis suggests that the errors are
caused by non-conformat nvme devices which are reading beyond the end of
the buffer containing the segments to be discarded.
Until now two different variants of this behavior have been observed:
The controller found on an Intel 660p always reads a multiple of 512
bytes. If the last chunk exceeds a page it continues with the subsequent
page. For a Corsair MP510 the situation is even worse: The controller
always reads a full page (4096) bytes. Then when the address is not
aligned to 4096 it will continue reading at the address given in PRP2
(which is most of the time 0).
This patch makes the nvme_setup_discard function always request a
multiple of a page size (4096) from the kernel for storing the segment
array. Since this makes the buffer always page-aligned the device
reading beyond end of a page is avoided.
Patch is based on linux-5.3.7 tarball. Note: patch itself is not tested
yet; for my tests some time ago I just hard-coded 256 segments. For now
this email is meant for informing the nvme kernel developers about the
topic.
Signed-off-by: Eduard Hasenleithner <eduard@hasenleithner.at>
--- drivers/nvme/host/core.c.orig 2019-11-04 21:53:20.758837001 +0100
+++ drivers/nvme/host/core.c 2019-11-04 22:05:54.409415849 +0100
@@ -561,9 +561,9 @@ static blk_status_t nvme_setup_discard(s
unsigned short segments = blk_rq_nr_discard_segments(req), n = 0;
struct nvme_dsm_range *range;
struct bio *bio;
+ size_t aligned_size = round_up(sizeof *range * segments, 4096);
- range = kmalloc_array(segments, sizeof(*range),
- GFP_ATOMIC | __GFP_NOWARN);
+ range = kmalloc(aligned_size, GFP_ATOMIC | __GFP_NOWARN);
if (!range) {
/*
* If we fail allocation our range, fallback to the controller
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] Workaround for discard on non-conformant nvme devices
2019-11-04 21:47 [RFC PATCH] Workaround for discard on non-conformant nvme devices Eduard Hasenleithner
@ 2019-11-06 16:52 ` Sagi Grimberg
2019-11-06 18:23 ` Keith Busch
0 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2019-11-06 16:52 UTC (permalink / raw)
To: Eduard Hasenleithner, linux-nvme
> As documented in https://bugzilla.kernel.org/show_bug.cgi?id=202665
> there are lots of Linux nvme users which get IO-MMU related errors when
> performing discard on nvme. So far analysis suggests that the errors are
> caused by non-conformat nvme devices which are reading beyond the end of
> the buffer containing the segments to be discarded.
>
> Until now two different variants of this behavior have been observed:
> The controller found on an Intel 660p always reads a multiple of 512
> bytes. If the last chunk exceeds a page it continues with the subsequent
> page. For a Corsair MP510 the situation is even worse: The controller
> always reads a full page (4096) bytes. Then when the address is not
> aligned to 4096 it will continue reading at the address given in PRP2
> (which is most of the time 0).
>
> This patch makes the nvme_setup_discard function always request a
> multiple of a page size (4096) from the kernel for storing the segment
> array. Since this makes the buffer always page-aligned the device
> reading beyond end of a page is avoided.
This needs to come with a quirk for it.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] Workaround for discard on non-conformant nvme devices
2019-11-06 16:52 ` Sagi Grimberg
@ 2019-11-06 18:23 ` Keith Busch
2019-11-06 20:22 ` Eduard Hasenleithner
0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2019-11-06 18:23 UTC (permalink / raw)
To: Sagi Grimberg; +Cc: linux-nvme, Eduard Hasenleithner
On Wed, Nov 06, 2019 at 08:52:20AM -0800, Sagi Grimberg wrote:
> > As documented in https://bugzilla.kernel.org/show_bug.cgi?id=202665
> > there are lots of Linux nvme users which get IO-MMU related errors when
> > performing discard on nvme. So far analysis suggests that the errors are
> > caused by non-conformat nvme devices which are reading beyond the end of
> > the buffer containing the segments to be discarded.
> >
> > Until now two different variants of this behavior have been observed:
> > The controller found on an Intel 660p always reads a multiple of 512
> > bytes. If the last chunk exceeds a page it continues with the subsequent
> > page. For a Corsair MP510 the situation is even worse: The controller
> > always reads a full page (4096) bytes. Then when the address is not
> > aligned to 4096 it will continue reading at the address given in PRP2
> > (which is most of the time 0).
> >
> > This patch makes the nvme_setup_discard function always request a
> > multiple of a page size (4096) from the kernel for storing the segment
> > array. Since this makes the buffer always page-aligned the device
> > reading beyond end of a page is avoided.
>
> This needs to come with a quirk for it.
Yeah, and have the broken ones to use the ctrl->discard_page so that you
don't need to worry about the alignment of the address kmalloc returns.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] Workaround for discard on non-conformant nvme devices
2019-11-06 18:23 ` Keith Busch
@ 2019-11-06 20:22 ` Eduard Hasenleithner
2019-11-06 20:43 ` Keith Busch
0 siblings, 1 reply; 7+ messages in thread
From: Eduard Hasenleithner @ 2019-11-06 20:22 UTC (permalink / raw)
To: Keith Busch, Sagi Grimberg; +Cc: linux-nvme
On 06.11.19 19:23, Keith Busch wrote:
>> This needs to come with a quirk for it.
>
> Yeah, and have the broken ones to use the ctrl->discard_page so that you
> don't need to worry about the alignment of the address kmalloc returns.
This raises some questions for me.
Why bother maintaining a quirk list and have additional complexity with
separate code paths?
Couldn't we just simply use the discard_page for all discards?
Are there even nvme devices out there which have conformant behavior in
this respect?
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] Workaround for discard on non-conformant nvme devices
2019-11-06 20:22 ` Eduard Hasenleithner
@ 2019-11-06 20:43 ` Keith Busch
2019-11-06 21:10 ` Eduard Hasenleithner
0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2019-11-06 20:43 UTC (permalink / raw)
To: Eduard Hasenleithner; +Cc: Sagi Grimberg, linux-nvme
On Wed, Nov 06, 2019 at 09:22:00PM +0100, Eduard Hasenleithner wrote:
> On 06.11.19 19:23, Keith Busch wrote:
> > > This needs to come with a quirk for it.
> >
> > Yeah, and have the broken ones to use the ctrl->discard_page so that you
> > don't need to worry about the alignment of the address kmalloc returns.
>
> This raises some questions for me.
> Why bother maintaining a quirk list and have additional complexity with
> separate code paths?
It documents the code on why the driver is doing spec non-compliant
behavior.
> Couldn't we just simply use the discard_page for all discards?
That's normally just a fallback to ensure forward progress under memory
pressure. It's not particularly performant, though.
> Are there even nvme devices out there which have conformant behavior in this
> respect?
I believe most of them do conform, but I don't have a large sample size
to confirm.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] Workaround for discard on non-conformant nvme devices
2019-11-06 20:43 ` Keith Busch
@ 2019-11-06 21:10 ` Eduard Hasenleithner
2019-11-06 21:34 ` Keith Busch
0 siblings, 1 reply; 7+ messages in thread
From: Eduard Hasenleithner @ 2019-11-06 21:10 UTC (permalink / raw)
To: Keith Busch; +Cc: Sagi Grimberg, linux-nvme
On 06.11.19 21:43, Keith Busch wrote:
> On Wed, Nov 06, 2019 at 09:22:00PM +0100, Eduard Hasenleithner wrote:
>> Couldn't we just simply use the discard_page for all discards?
>
> That's normally just a fallback to ensure forward progress under memory
> pressure. It's not particularly performant, though.
So we are accepting a considerable performance impact for the "broken"
devices?
>> Are there even nvme devices out there which have conformant behavior in this
>> respect?
>
> I believe most of them do conform, but I don't have a large sample size
> to confirm.
Do you have an IO-MMU active on your setup? In my setup it is active and
the nvme is directly attached to the CPU. So the CPU exactly knows the
intiator and the IO-MMU page table defines with page granularity which
memory the nvme may read or write. I'm under the impression that
slightly older setups may be less restrictive wrt DMA IO.
Another aspect is that vendors probably only test with windows and I'm
guessing that windows doesn't bother with maintaining anything but page
sized/aligned memory for issuing discard commands.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] Workaround for discard on non-conformant nvme devices
2019-11-06 21:10 ` Eduard Hasenleithner
@ 2019-11-06 21:34 ` Keith Busch
0 siblings, 0 replies; 7+ messages in thread
From: Keith Busch @ 2019-11-06 21:34 UTC (permalink / raw)
To: Eduard Hasenleithner; +Cc: Sagi Grimberg, linux-nvme
On Wed, Nov 06, 2019 at 10:10:58PM +0100, Eduard Hasenleithner wrote:
> On 06.11.19 21:43, Keith Busch wrote:
> > On Wed, Nov 06, 2019 at 09:22:00PM +0100, Eduard Hasenleithner wrote:
> > > Couldn't we just simply use the discard_page for all discards?
> >
> > That's normally just a fallback to ensure forward progress under memory
> > pressure. It's not particularly performant, though.
>
> So we are accepting a considerable performance impact for the "broken"
> devices?
I'd wager the devices that got this wrong are operating on DSMs beyond
QD1 anyway, the impact to them is probably much less than "considerable".
> > > Are there even nvme devices out there which have conformant behavior in this
> > > respect?
> >
> > I believe most of them do conform, but I don't have a large sample size
> > to confirm.
>
> Do you have an IO-MMU active on your setup?
Sometimes, but that's not why I know at least some devices are
compliant. I've wasted too many hours analyzing protocol traces.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2019-11-06 21:34 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-04 21:47 [RFC PATCH] Workaround for discard on non-conformant nvme devices Eduard Hasenleithner
2019-11-06 16:52 ` Sagi Grimberg
2019-11-06 18:23 ` Keith Busch
2019-11-06 20:22 ` Eduard Hasenleithner
2019-11-06 20:43 ` Keith Busch
2019-11-06 21:10 ` Eduard Hasenleithner
2019-11-06 21:34 ` Keith Busch
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).