All of lore.kernel.org
 help / color / mirror / Atom feed
* nvme: controller resets
@ 2015-11-10 14:30 Stephan Günther
  2015-11-10 15:51 ` Keith Busch
  0 siblings, 1 reply; 15+ messages in thread
From: Stephan Günther @ 2015-11-10 14:30 UTC (permalink / raw)


Hello,

recently we submitted a small patch that enabled support for the Apple
NVMe controller. More testing revealed some interesting behavior we
cannot explain:

1) Formatting a partition as vfat or ext2 works fine and so far,
arbitrary loads are handled correctly by the controller.

2) ext3/4 fails, but may be not immediately.

3) mkfs.btrfs fails immediately.

The error is the same every time:
| nvme 0000:03:00.0: Failed status: 3, reset controller
| nvme 0000:03:00.0: Cancelling I/O 38 QID 1
| nvme 0000:03:00.0: Cancelling I/O 39 QID 1
| nvme 0000:03:00.0: Device not ready; aborting reset
| nvme 0000:03:00.0: Device failed to resume
| blk_update_request: I/O error, dev nvme0n1, sector 0
| blk_update_request: I/O error, dev nvme0n1, sector 977104768
| Buffer I/O error on dev nvme0n1p3, logical block 120827120, async page read


While trying to isolate the problem we found that running 'partprobe -d'
also causes the problem.

So we attached strace to determine the failing ioctl/syscall. However,
running 'strace -f partprobe -d' suddenly worked fine. Similar to that
'strace -f mkfs.btrfs' worked. However, mounting the file system caused
the problem again.

Due to the different behavior with and without strace we assume there
could be some kind of race condition.

Any ideas how we can track the problem further?

Best,
Stephan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* nvme: controller resets
  2015-11-10 14:30 nvme: controller resets Stephan Günther
@ 2015-11-10 15:51 ` Keith Busch
  2015-11-10 20:45   ` Stephan Günther
  2015-11-10 22:28   ` Vedant Lath
  0 siblings, 2 replies; 15+ messages in thread
From: Keith Busch @ 2015-11-10 15:51 UTC (permalink / raw)


On Tue, Nov 10, 2015@03:30:43PM +0100, Stephan G?nther wrote:
> Hello,
> 
> recently we submitted a small patch that enabled support for the Apple
> NVMe controller. More testing revealed some interesting behavior we
> cannot explain:
> 
> 1) Formatting a partition as vfat or ext2 works fine and so far,
> arbitrary loads are handled correctly by the controller.
> 
> 2) ext3/4 fails, but may be not immediately.
> 
> 3) mkfs.btrfs fails immediately.
> 
> The error is the same every time:
> | nvme 0000:03:00.0: Failed status: 3, reset controller
> | nvme 0000:03:00.0: Cancelling I/O 38 QID 1
> | nvme 0000:03:00.0: Cancelling I/O 39 QID 1
> | nvme 0000:03:00.0: Device not ready; aborting reset
> | nvme 0000:03:00.0: Device failed to resume
> | blk_update_request: I/O error, dev nvme0n1, sector 0
> | blk_update_request: I/O error, dev nvme0n1, sector 977104768
> | Buffer I/O error on dev nvme0n1p3, logical block 120827120, async page read

It says the controller asserted an internal failure status, then failed
the reset recovery. Sounds like there are other quirks to this device
you may have to reverse engineer.

> While trying to isolate the problem we found that running 'partprobe -d'
> also causes the problem.
> 
> So we attached strace to determine the failing ioctl/syscall. However,
> running 'strace -f partprobe -d' suddenly worked fine. Similar to that
> 'strace -f mkfs.btrfs' worked. However, mounting the file system caused
> the problem again.
> 
> Due to the different behavior with and without strace we assume there
> could be some kind of race condition.
> 
> Any ideas how we can track the problem further?

Not sure really. Normally I file a f/w bug for this kind of thing. :)

But I'll throw out some potential ideas. Try trottling driver capabilities
and see if anything improves: reduce queue count to 1 and depth to 2
(requires code change).

If you're able to recreate with reduced settings, then your controller's
failure can be caused by a single command, and it's hopefully just a
matter of finding that command.

If the problem is not reproducible with reduced settings, then perhaps
it's related to concurrent queue usage or high depth, and you can play
with either to see if you discover anything interesting.

Of course, I could be way off...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* nvme: controller resets
  2015-11-10 15:51 ` Keith Busch
@ 2015-11-10 20:45   ` Stephan Günther
  2015-11-10 21:16     ` Vedant Lath
  2015-11-10 22:28   ` Vedant Lath
  1 sibling, 1 reply; 15+ messages in thread
From: Stephan Günther @ 2015-11-10 20:45 UTC (permalink / raw)


On 2015/November/10 03:51, Keith Busch wrote:
> On Tue, Nov 10, 2015@03:30:43PM +0100, Stephan G?nther wrote:
> > Hello,
> > 
> > recently we submitted a small patch that enabled support for the Apple
> > NVMe controller. More testing revealed some interesting behavior we
> > cannot explain:
> > 
> > 1) Formatting a partition as vfat or ext2 works fine and so far,
> > arbitrary loads are handled correctly by the controller.
> > 
> > 2) ext3/4 fails, but may be not immediately.
> > 
> > 3) mkfs.btrfs fails immediately.
> > 
> > The error is the same every time:
> > | nvme 0000:03:00.0: Failed status: 3, reset controller
> > | nvme 0000:03:00.0: Cancelling I/O 38 QID 1
> > | nvme 0000:03:00.0: Cancelling I/O 39 QID 1
> > | nvme 0000:03:00.0: Device not ready; aborting reset
> > | nvme 0000:03:00.0: Device failed to resume
> > | blk_update_request: I/O error, dev nvme0n1, sector 0
> > | blk_update_request: I/O error, dev nvme0n1, sector 977104768
> > | Buffer I/O error on dev nvme0n1p3, logical block 120827120, async page read
> 
> It says the controller asserted an internal failure status, then failed
> the reset recovery. Sounds like there are other quirks to this device
> you may have to reverse engineer.

We figured that one out: NVME_CSTS_CFS = Controller Fatal State ...

> 
> > While trying to isolate the problem we found that running 'partprobe -d'
> > also causes the problem.
> > 
> > So we attached strace to determine the failing ioctl/syscall. However,
> > running 'strace -f partprobe -d' suddenly worked fine. Similar to that
> > 'strace -f mkfs.btrfs' worked. However, mounting the file system caused
> > the problem again.
> > 
> > Due to the different behavior with and without strace we assume there
> > could be some kind of race condition.
> > 
> > Any ideas how we can track the problem further?
> 
> Not sure really. Normally I file a f/w bug for this kind of thing. :)

I would file one if there were any hope of an answer...

> 
> But I'll throw out some potential ideas. Try trottling driver capabilities

That's the next thing we will try.

> and see if anything improves: reduce queue count to 1 and depth to 2
> (requires code change).

Reducing the queue count rendered the controller unable to resume. Maybe
we missed something. However, since the errors always hint at QID 1, I
don't think that too many queues are the problem.

Reducing the queue depth to 32/16 resulted in the same error. Reduction 
to 2/2 failed.

> 
> If you're able to recreate with reduced settings, then your controller's
> failure can be caused by a single command, and it's hopefully just a
> matter of finding that command.
> 
> If the problem is not reproducible with reduced settings, then perhaps
> it's related to concurrent queue usage or high depth, and you can play
> with either to see if you discover anything interesting.

Starting the kernel with nr_cpus=1 didn't change anything although race 
conditions are probably still possible due to async signalling or
interrupts.


The only thing that might still explain something: 'nvme show-regs'
suffers from the same problems with readq. If for any reason other
userspace tools work in a similar way to read the controller's
capabilities, it has to fail.

But I know of no reason why, e.g. mkfs.btrfs should do somehting like
that.

Best,
Stephan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* nvme: controller resets
  2015-11-10 20:45   ` Stephan Günther
@ 2015-11-10 21:16     ` Vedant Lath
  2015-11-10 21:34       ` Stephan Günther
  0 siblings, 1 reply; 15+ messages in thread
From: Vedant Lath @ 2015-11-10 21:16 UTC (permalink / raw)


> Starting the kernel with nr_cpus=1 didn't change anything although race
> conditions are probably still possible due to async signalling or
> interrupts.
>
>
> The only thing that might still explain something: 'nvme show-regs'
> suffers from the same problems with readq. If for any reason other
> userspace tools work in a similar way to read the controller's
> capabilities, it has to fail.
>
> But I know of no reason why, e.g. mkfs.btrfs should do somehting like
> that.

I was able to run mkfs.btrfs successfully two times (without strace).
Once with btrfs-progs v3.17 and second time with btrfs-progs v4.3.
However my third attempt produced a similar error as yours. So your
successful runs with mkfs.btrfs might not have been related to strace
but might be the nature of this intermittent error.

partprobe always resulted in a crash for me.

I tried nvme-cli to get the error log but that was empty.

> $ sudo ./nvme get-feature /dev/nvme0 -f 0x07 -H
> get-feature: 0x07 (Number of Queues), Current value: 00000000
>         Number of IO Completion Queues Allocated (NCQA): 1
>         Number of IO Submission Queues Allocated (NSQA): 1

Does this mean the SSD supports only one queue? But all features from
0x00 to 0x0B return 0 so I'm not sure how if that value is legitimate.

Could you please check if the firmware version of your SSD is the same
as mine? I have posted details given by nvme-cli at
http://pastebin.ubuntu.com/13049434/ .

^ permalink raw reply	[flat|nested] 15+ messages in thread

* nvme: controller resets
  2015-11-10 21:16     ` Vedant Lath
@ 2015-11-10 21:34       ` Stephan Günther
  2015-11-10 21:43         ` Vedant Lath
  0 siblings, 1 reply; 15+ messages in thread
From: Stephan Günther @ 2015-11-10 21:34 UTC (permalink / raw)


On 2015/November/11 02:46, Vedant Lath wrote:
> I tried nvme-cli to get the error log but that was empty.

dito.

> Could you please check if the firmware version of your SSD is the same
> as mine? I have posted details given by nvme-cli at
> http://pastebin.ubuntu.com/13049434/ .

Mine:

| NVME Identify Controller:
| vid     : 0x106b
| ssvid   : 0x106b
| sn      : C025xxxxxxxxxxxxxx
| mn      : APPLE SSD AP0512H
| fr      : 5.841.01

I have reports of other guys with firmware 5.831.01 and 5.844.01. Both 
have the same problems.

Are using a MacBook8,1 oder MacBook7,1?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* nvme: controller resets
  2015-11-10 21:34       ` Stephan Günther
@ 2015-11-10 21:43         ` Vedant Lath
  2015-11-10 22:02           ` Stephan Günther
  0 siblings, 1 reply; 15+ messages in thread
From: Vedant Lath @ 2015-11-10 21:43 UTC (permalink / raw)


On Wed, Nov 11, 2015@3:04 AM, Stephan G?nther <guenther@tum.de> wrote:
> Mine:
>
> | NVME Identify Controller:
> | vid     : 0x106b
> | ssvid   : 0x106b
> | sn      : C025xxxxxxxxxxxxxx
> | mn      : APPLE SSD AP0512H
> | fr      : 5.841.01
>
> I have reports of other guys with firmware 5.831.01 and 5.844.01. Both
> have the same problems.

That's a different firmware revision than mine. Mine is 7.831.01. But
the minor version and patch number is same. :/ Maybe the first digit
is indicative of the MacBook it is built for?

I have a feeling that this SSD's NVMe implementation is more like a
draft implementation of the NVMe standard. I am hopeful that future
firmware upgrades will make it fully compliant with NVMe.

> Are using a MacBook8,1 oder MacBook7,1?

I am using a MacBookAir7,1. It is the 11" MacBook Air (Early 2015)
model with 256 GB SSD.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* nvme: controller resets
  2015-11-10 21:43         ` Vedant Lath
@ 2015-11-10 22:02           ` Stephan Günther
  0 siblings, 0 replies; 15+ messages in thread
From: Stephan Günther @ 2015-11-10 22:02 UTC (permalink / raw)


On 2015/November/11 03:13, Vedant Lath wrote:
> On Wed, Nov 11, 2015@3:04 AM, Stephan G?nther <guenther@tum.de> wrote:
> > Mine:
> >
> > | NVME Identify Controller:
> > | vid     : 0x106b
> > | ssvid   : 0x106b
> > | sn      : C025xxxxxxxxxxxxxx
> > | mn      : APPLE SSD AP0512H
> > | fr      : 5.841.01
> >
> > I have reports of other guys with firmware 5.831.01 and 5.844.01. Both
> > have the same problems.
> 
> That's a different firmware revision than mine. Mine is 7.831.01. But
> the minor version and patch number is same. :/ Maybe the first digit
> is indicative of the MacBook it is built for?

Educated guess since you have a MacBook7,1.

> 
> I have a feeling that this SSD's NVMe implementation is more like a
> draft implementation of the NVMe standard. I am hopeful that future
> firmware upgrades will make it fully compliant with NVMe.

I don't have much hope regarding that. But let's see.

I wrote to opensource at apple.com asking for help in that matter. After
the pleasant surprise with Broadcom/WLAN yesterday I have some hope.

> 
> > Are using a MacBook8,1 oder MacBook7,1?
> 
> I am using a MacBookAir7,1. It is the 11" MacBook Air (Early 2015)
> model with 256 GB SSD.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* nvme: controller resets
  2015-11-10 15:51 ` Keith Busch
  2015-11-10 20:45   ` Stephan Günther
@ 2015-11-10 22:28   ` Vedant Lath
  2015-11-11 21:56     ` Vedant Lath
  1 sibling, 1 reply; 15+ messages in thread
From: Vedant Lath @ 2015-11-10 22:28 UTC (permalink / raw)


On Tue, Nov 10, 2015@9:21 PM, Keith Busch <keith.busch@intel.com> wrote:
> Not sure really. Normally I file a f/w bug for this kind of thing. :)
>
> But I'll throw out some potential ideas. Try trottling driver capabilities
> and see if anything improves: reduce queue count to 1 and depth to 2
> (requires code change).
>
> If you're able to recreate with reduced settings, then your controller's
> failure can be caused by a single command, and it's hopefully just a
> matter of finding that command.
>
> If the problem is not reproducible with reduced settings, then perhaps
> it's related to concurrent queue usage or high depth, and you can play
> with either to see if you discover anything interesting.
>
> Of course, I could be way off...

Is there any way to monitor all the commands going through the wire?
Wouldn't that help? That would at least tell us which NVMe command
results in a reset, and the flow of the commands leading up to the
reset can give us more context into the error.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* nvme: controller resets
  2015-11-10 22:28   ` Vedant Lath
@ 2015-11-11 21:56     ` Vedant Lath
  2015-11-11 22:09       ` Stephan Günther
  2015-11-11 22:14       ` Keith Busch
  0 siblings, 2 replies; 15+ messages in thread
From: Vedant Lath @ 2015-11-11 21:56 UTC (permalink / raw)


On Wed, Nov 11, 2015@3:58 AM, Vedant Lath <vedant@lath.in> wrote:
> On Tue, Nov 10, 2015@9:21 PM, Keith Busch <keith.busch@intel.com> wrote:
>> Not sure really. Normally I file a f/w bug for this kind of thing. :)
>>
>> But I'll throw out some potential ideas. Try trottling driver capabilities
>> and see if anything improves: reduce queue count to 1 and depth to 2
>> (requires code change).
>>
>> If you're able to recreate with reduced settings, then your controller's
>> failure can be caused by a single command, and it's hopefully just a
>> matter of finding that command.
>>
>> If the problem is not reproducible with reduced settings, then perhaps
>> it's related to concurrent queue usage or high depth, and you can play
>> with either to see if you discover anything interesting.
>>
>> Of course, I could be way off...
>
> Is there any way to monitor all the commands going through the wire?
> Wouldn't that help? That would at least tell us which NVMe command
> results in a reset, and the flow of the commands leading up to the
> reset can give us more context into the error.

Reducing I/O queue depth to 2 fixes the crash. Increasing I/O queue
depth to 3 again results in a crash.

I did some debugging with pr_debug() in __nvme_submit_cmd() and a few
other functions and found out that running partprobe -d is a very
simple test case. With I/O queue depth as 3, I got the following
trace:
[...]
[  307.045206] nvme: submit: QID 1 CMD opcode 2 flags 0 cid 0 NSID 1
[  307.045209] nvme: submit: rsvd2 0 metadata 0 prp1 2308943872 prp2 0
slba 48907264 length 0 control 0 dsmgmt 0 reftag 0 apptag 0 appmask 0
[  307.045311] nvme: nvme_finish_cmd called
[  308.474874] nvme: submit: QID 0 CMD opcode C flags 0 cid 0 NSID 0
[  308.474900] nvme: nvme_finish_cmd called
[  308.474908] nvme: async_req_completion called
(completion of add_id)
(run partprobe -d)
[  318.488668] nvme: submit: QID 1 CMD opcode 0 flags 0 cid 1 NSID 1
[  318.497900] nvme: nvme_finish_cmd called
[  318.498318] nvme: submit: QID 1 CMD opcode 0 flags 0 cid 1 NSID 1
[  318.498835] nvme: submit: QID 1 CMD opcode 2 flags 0 cid 0 NSID 1
[  318.498839] nvme: submit: rsvd2 0 metadata 0 prp1 6042226688 prp2 0
slba 76790 length 0 control 0 dsmgmt 0 reftag 0 apptag 0 appmask 0
[  319.472684] nvme 0000:04:00.0: Failed status: 3, reset controller
[  319.472946] nvme 0000:04:00.0: Cancelling I/O 0 QID 1
[  319.472952] nvme 0000:04:00.0: Cancelling I/O 1 QID 1
[  319.472983] nvme: MQES is 1023, compiled Q_DEPTH is 3, configured
q_depth is 3
[  335.069654] nvme 0000:04:00.0: Device not ready; aborting reset
[  335.069679] nvme 0000:04:00.0: Device failed to resume
[  335.071921] blk_update_request: I/O error, dev nvme0n1, sector 0
[  335.071931] blk_update_request: I/O error, dev nvme0n1, sector 614320
[  335.071947] Buffer I/O error on dev nvme0n1p1, logical block 76784,
async page read

With mkfs.btrfs /dev/nvme0n1p5, the last few messages before the crash were:
[  235.496658] nvme: submit: QID 1 CMD opcode 1 flags 0 cid 0 NSID 1
[  235.496668] nvme: nvme_finish_cmd called
[  235.496685] nvme: submit: QID 1 CMD opcode 1 flags 0 cid 1 NSID 1
[  235.496686] nvme: nvme_finish_cmd called
[  235.496696] nvme: submit: QID 1 CMD opcode 1 flags 0 cid 0 NSID 1
[  235.496710] nvme: nvme_finish_cmd called
[  235.496720] nvme: nvme_finish_cmd called
[  235.496731] nvme: submit: QID 1 CMD opcode 0 flags 0 cid 0 NSID 1
[  235.496753] nvme: submit: QID 1 CMD opcode 2 flags 0 cid 1 NSID 1
[  235.496755] nvme: submit: rsvd2 0 metadata 0 prp1 5897535488 prp2
2318221568 slba 41495303 length 0 control 0 dsmgmt 0 reftag 0 apptag 0
appmask 0
[  236.189627] nvme 0000:04:00.0: Failed status: 3, reset controller
[  236.189985] nvme 0000:04:00.0: Cancelling I/O 0 QID 1
[  236.189995] nvme 0000:04:00.0: Cancelling I/O 1 QID 1
[  236.190056] nvme: MQES is 1023, compiled Q_DEPTH is 3, configured
q_depth is 3
[  251.792099] nvme 0000:04:00.0: Device not ready; aborting reset
[  251.792140] nvme 0000:04:00.0: Device failed to resume
[  251.796491] blk_update_request: I/O error, dev nvme0n1, sector 331962424
[  251.796519] blk_update_request: I/O error, dev nvme0n1, sector 0
[  251.798340] Buffer I/O error on dev nvme0n1p5, logical block 7,
async page read

I am not sure when nvme_finish_cmd is called, does that mean that a
command has completed execution and it's request can be freed from the
completion queue?

In both cases, the controller crashes when FLUSH and WRITE commands
are in the queue at the same time.

What could be the best way around this firmware bug? There's no hope
that this will be fixed in a firmware update.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* nvme: controller resets
  2015-11-11 21:56     ` Vedant Lath
@ 2015-11-11 22:09       ` Stephan Günther
  2015-11-12 14:02         ` Vedant Lath
  2015-11-11 22:14       ` Keith Busch
  1 sibling, 1 reply; 15+ messages in thread
From: Stephan Günther @ 2015-11-11 22:09 UTC (permalink / raw)


On 2015/November/12 03:26, Vedant Lath wrote:
> On Wed, Nov 11, 2015@3:58 AM, Vedant Lath <vedant@lath.in> wrote:
> > On Tue, Nov 10, 2015@9:21 PM, Keith Busch <keith.busch@intel.com> wrote:
> >> Not sure really. Normally I file a f/w bug for this kind of thing. :)
> >>
> >> But I'll throw out some potential ideas. Try trottling driver capabilities
> >> and see if anything improves: reduce queue count to 1 and depth to 2
> >> (requires code change).
> >>
> >> If you're able to recreate with reduced settings, then your controller's
> >> failure can be caused by a single command, and it's hopefully just a
> >> matter of finding that command.
> >>
> >> If the problem is not reproducible with reduced settings, then perhaps
> >> it's related to concurrent queue usage or high depth, and you can play
> >> with either to see if you discover anything interesting.
> >>
> >> Of course, I could be way off...
> >
> > Is there any way to monitor all the commands going through the wire?
> > Wouldn't that help? That would at least tell us which NVMe command
> > results in a reset, and the flow of the commands leading up to the
> > reset can give us more context into the error.
> 
> Reducing I/O queue depth to 2 fixes the crash. Increasing I/O queue
> depth to 3 again results in a crash.

The device fails to initialize with those settings for me. However, 
think I found the problem:

@@ -2273,7 +2276,7 @@ static void nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid)
        if (dev->stripe_size)
                blk_queue_chunk_sectors(ns->queue, dev->stripe_size >> 9);
        if (dev->vwc & NVME_CTRL_VWC_PRESENT)
-               blk_queue_flush(ns->queue, REQ_FLUSH | REQ_FUA);
+               blk_queue_flush(ns->queue, REQ_FUA);
        blk_queue_virt_boundary(ns->queue, dev->page_size - 1);

        disk->major = nvme_major

With these changes I was able to create a btrfs, copy several GiB of 
data, umount, remount, scrub, and balance.

The probem is *not* the flush itself (issueing the ioctl does not 
provoke the error. It is either a combination of flush with other 
commands or some flags issued together with a flush.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* nvme: controller resets
  2015-11-11 21:56     ` Vedant Lath
  2015-11-11 22:09       ` Stephan Günther
@ 2015-11-11 22:14       ` Keith Busch
  2015-11-12  9:45         ` Vedant Lath
  1 sibling, 1 reply; 15+ messages in thread
From: Keith Busch @ 2015-11-11 22:14 UTC (permalink / raw)


On Thu, Nov 12, 2015@03:26:04AM +0530, Vedant Lath wrote:
> [  235.496753] nvme: submit: QID 1 CMD opcode 2 flags 0 cid 1 NSID 1
> [  235.496755] nvme: submit: rsvd2 0 metadata 0 prp1 5897535488 prp2
> 2318221568 slba 41495303 length 0 control 0 dsmgmt 0 reftag 0 apptag 0
> appmask 0

Let's examine the above command.

You've got PRP1 set to 0x15f854000, and length set to 0 (1 block). Based
on other info provided on this device, a block is 4k.

Seeing PRP1 is 4k aligned and you're transferring 4k of data should
mean you only need one PRP. Your command however shows PRP2 is used and
pointing to a list (must be a list rather than data since the offset is
0x100 aligned).

Either your new prints are incomplete, or there's a nasty bug somewhere.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* nvme: controller resets
  2015-11-11 22:14       ` Keith Busch
@ 2015-11-12  9:45         ` Vedant Lath
  2015-11-12 11:26           ` Vedant Lath
  2015-11-16 21:33           ` Stephan Günther
  0 siblings, 2 replies; 15+ messages in thread
From: Vedant Lath @ 2015-11-12  9:45 UTC (permalink / raw)


On Thu, Nov 12, 2015@3:44 AM, Keith Busch <keith.busch@intel.com> wrote:
> On Thu, Nov 12, 2015@03:26:04AM +0530, Vedant Lath wrote:
>> [  235.496753] nvme: submit: QID 1 CMD opcode 2 flags 0 cid 1 NSID 1
>> [  235.496755] nvme: submit: rsvd2 0 metadata 0 prp1 5897535488 prp2
>> 2318221568 slba 41495303 length 0 control 0 dsmgmt 0 reftag 0 apptag 0
>> appmask 0
>
> Let's examine the above command.
>
> You've got PRP1 set to 0x15f854000, and length set to 0 (1 block). Based
> on other info provided on this device, a block is 4k.
>
> Seeing PRP1 is 4k aligned and you're transferring 4k of data should
> mean you only need one PRP. Your command however shows PRP2 is used and
> pointing to a list (must be a list rather than data since the offset is
> 0x100 aligned).
>
> Either your new prints are incomplete, or there's a nasty bug somewhere.

The print statements seem fine:
@ -389,6 +402,22 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
 {
        u16 tail = nvmeq->sq_tail;

+       pr_debug("nvme: submit: QID %X CMD opcode %X flags %X cid %X
NSID %X\012", nvmeq->qid, cmd->common.opcode, cmd->common.flags,
cmd->common.command_id, cmd->common.nsid);
+       if (cmd->common.opcode == 2) {
+               pr_debug("nvme: submit: rsvd2 %llu metadata %llu prp1
%llu prp2 %llu slba %llu length %u control %u dsmgmt %d reftag %d
apptag %u appmask %u\012",
+                        cmd->rw.rsvd2,
+                        cmd->rw.metadata,
+                        cmd->rw.prp1,
+                        cmd->rw.prp2,
+                        cmd->rw.slba,
+                        cmd->rw.length,
+                        cmd->rw.control,
+                        cmd->rw.dsmgmt,
+                        cmd->rw.reftag,
+                        cmd->rw.apptag,
+                        cmd->rw.appmask);
+       }
+
        if (nvmeq->sq_cmds_io)
                memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
        else

So it must be a bug somewhere.

This does not look like a device-specific bug so it should be
reproducible on other systems as well. The step to reproduce is:
mkfs.btrfs /dev/nvme0n1px (where x is a partition number on the SSD).
All WRITE commands done due to mkfs.btrfs have non-zero values for
both prp1 and prp2 regardless of length. All of them succeed except
the last one which is just after a FLUSH.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* nvme: controller resets
  2015-11-12  9:45         ` Vedant Lath
@ 2015-11-12 11:26           ` Vedant Lath
  2015-11-16 21:33           ` Stephan Günther
  1 sibling, 0 replies; 15+ messages in thread
From: Vedant Lath @ 2015-11-12 11:26 UTC (permalink / raw)


On Thu, Nov 12, 2015@3:26 AM, Vedant Lath <vedant@lath.in> wrote:
> In both cases, the controller crashes when FLUSH and WRITE commands
> are in the queue at the same time.

On Thu, Nov 12, 2015@3:15 PM, Vedant Lath <vedant@lath.in> wrote:
> All WRITE commands done due to mkfs.btrfs have non-zero values for
> both prp1 and prp2 regardless of length. All of them succeed except
> the last one which is just after a FLUSH.

Sorry, I meant READ instead of WRITE here. It can't READ and FLUSH at
the same time.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* nvme: controller resets
  2015-11-11 22:09       ` Stephan Günther
@ 2015-11-12 14:02         ` Vedant Lath
  0 siblings, 0 replies; 15+ messages in thread
From: Vedant Lath @ 2015-11-12 14:02 UTC (permalink / raw)


On Thu, Nov 12, 2015@3:39 AM, Stephan G?nther <guenther@tum.de> wrote:
> On 2015/November/12 03:26, Vedant Lath wrote:
>> Reducing I/O queue depth to 2 fixes the crash. Increasing I/O queue
>> depth to 3 again results in a crash.
>
> The device fails to initialize with those settings for me. However,
> think I found the problem:
>
> @@ -2273,7 +2276,7 @@ static void nvme_alloc_ns(struct nvme_dev *dev, unsigned nsid)
>         if (dev->stripe_size)
>                 blk_queue_chunk_sectors(ns->queue, dev->stripe_size >> 9);
>         if (dev->vwc & NVME_CTRL_VWC_PRESENT)
> -               blk_queue_flush(ns->queue, REQ_FLUSH | REQ_FUA);
> +               blk_queue_flush(ns->queue, REQ_FUA);
>         blk_queue_virt_boundary(ns->queue, dev->page_size - 1);
>
>         disk->major = nvme_major
>
> With these changes I was able to create a btrfs, copy several GiB of
> data, umount, remount, scrub, and balance.
>
> The probem is *not* the flush itself (issueing the ioctl does not
> provoke the error. It is either a combination of flush with other
> commands or some flags issued together with a flush.

If we don't do FLUSH then we risk data corruption on power loss. Even
though get-feature says volatile write-cache is disabled, I observed
data corruption with this patch on doing a power-failure test (using
diskchecker.pl [1][2]). I tested with I/O queue depth of 2 and 1024;
in both cases, in the subsequent boot, I was not able to mount the
btrfs filesystem on the SSD. With FLUSH and an I/O queue depth of 2, I
got only 1 error with diskchecker.pl; I guess that was the last write
before power loss.

I also observed a latency of 8-25 ms when doing FLUSH which indicates
that FLUSH is not a no-op even when VWC is reported as disabled.

I wonder how the OS X driver handles it.

1: https://gist.github.com/bradfitz/3172656
2: http://brad.livejournal.com/2116715.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* nvme: controller resets
  2015-11-12  9:45         ` Vedant Lath
  2015-11-12 11:26           ` Vedant Lath
@ 2015-11-16 21:33           ` Stephan Günther
  1 sibling, 0 replies; 15+ messages in thread
From: Stephan Günther @ 2015-11-16 21:33 UTC (permalink / raw)


On 2015/November/12 03:15, Vedant Lath wrote:
> On Thu, Nov 12, 2015@3:44 AM, Keith Busch <keith.busch@intel.com> wrote:
> > On Thu, Nov 12, 2015@03:26:04AM +0530, Vedant Lath wrote:
> >> [  235.496753] nvme: submit: QID 1 CMD opcode 2 flags 0 cid 1 NSID 1
> >> [  235.496755] nvme: submit: rsvd2 0 metadata 0 prp1 5897535488 prp2
> >> 2318221568 slba 41495303 length 0 control 0 dsmgmt 0 reftag 0 apptag 0
> >> appmask 0
> >
> > Let's examine the above command.
> >
> > You've got PRP1 set to 0x15f854000, and length set to 0 (1 block). Based
> > on other info provided on this device, a block is 4k.
> >
> > Seeing PRP1 is 4k aligned and you're transferring 4k of data should
> > mean you only need one PRP. Your command however shows PRP2 is used and
> > pointing to a list (must be a list rather than data since the offset is
> > 0x100 aligned).
> >
> > Either your new prints are incomplete, or there's a nasty bug somewhere.
> 
> The print statements seem fine:
> @ -389,6 +402,22 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
>  {
>         u16 tail = nvmeq->sq_tail;
> 
> +       pr_debug("nvme: submit: QID %X CMD opcode %X flags %X cid %X
> NSID %X\012", nvmeq->qid, cmd->common.opcode, cmd->common.flags,
> cmd->common.command_id, cmd->common.nsid);
> +       if (cmd->common.opcode == 2) {
> +               pr_debug("nvme: submit: rsvd2 %llu metadata %llu prp1
> %llu prp2 %llu slba %llu length %u control %u dsmgmt %d reftag %d
> apptag %u appmask %u\012",
> +                        cmd->rw.rsvd2,
> +                        cmd->rw.metadata,
> +                        cmd->rw.prp1,
> +                        cmd->rw.prp2,
> +                        cmd->rw.slba,
> +                        cmd->rw.length,
> +                        cmd->rw.control,
> +                        cmd->rw.dsmgmt,
> +                        cmd->rw.reftag,
> +                        cmd->rw.apptag,
> +                        cmd->rw.appmask);
> +       }
> +
>         if (nvmeq->sq_cmds_io)
>                 memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
>         else
> 
> So it must be a bug somewhere.
> 
> This does not look like a device-specific bug so it should be
> reproducible on other systems as well. The step to reproduce is:
> mkfs.btrfs /dev/nvme0n1px (where x is a partition number on the SSD).
> All WRITE commands done due to mkfs.btrfs have non-zero values for
> both prp1 and prp2 regardless of length. All of them succeed except
> the last one which is just after a FLUSH.

For the time being I propose the patch below. I tested it over the
weekend and seems to work better than disabling FLUSHs as I had propose
previously.

Although that is probably not a final fix, it is the best we can do
right now to both support the Apple controller and to avoid data loss to
the best of what we know.


Background: that Apple NVMe seems to quite compliant to NVMe, except for
two things:

1) It does not handle 64bit transfers, which has recently been
addressed.

2) There is a serious problem regarding the combination of queue depth
and FLUSH commands. Either we disable FLUSHs at all (and loose data on
panics or power loss) or we limit the queue depth to 2, which is
obviously not desirable but seems to have way less impact on performance
than I suspected.

As time is running short I propose that fix for now. Should it become
persistent, it should probably become a quirk. If we come up with new
results (or an apple falls on our heads), it will be replaced soon.


Signed-off-by: Stephan G?nther <guenther at tum.de>
Signed-off-by: Maurice Leclaure <leclaire at in.tum.de
---
 drivers/nvme/host/pci.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8187df2..15bbedb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2701,6 +2701,14 @@ static int nvme_dev_map(struct nvme_dev *dev)
 	dev->q_depth = min_t(int, NVME_CAP_MQES(cap) + 1, NVME_Q_DEPTH);
 	dev->db_stride = 1 << NVME_CAP_STRIDE(cap);
 	dev->dbs = ((void __iomem *)dev->bar) + 4096;
+
+	/*
+	 * Hotfix for the Apple controller found in the MacBook8,1 and some
+	 * MacBook7,1 to avoid controller resets and data loss.
+	 */
+	if (pdev->vendor == PCI_VENDOR_ID_APPLE && pdev->device == 0x2001)
+		dev->q_depth = 2;
+
 	if (readl(&dev->bar->vs) >= NVME_VS(1, 2))
 		dev->cmb = nvme_map_cmb(dev);

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-11-16 21:33 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-10 14:30 nvme: controller resets Stephan Günther
2015-11-10 15:51 ` Keith Busch
2015-11-10 20:45   ` Stephan Günther
2015-11-10 21:16     ` Vedant Lath
2015-11-10 21:34       ` Stephan Günther
2015-11-10 21:43         ` Vedant Lath
2015-11-10 22:02           ` Stephan Günther
2015-11-10 22:28   ` Vedant Lath
2015-11-11 21:56     ` Vedant Lath
2015-11-11 22:09       ` Stephan Günther
2015-11-12 14:02         ` Vedant Lath
2015-11-11 22:14       ` Keith Busch
2015-11-12  9:45         ` Vedant Lath
2015-11-12 11:26           ` Vedant Lath
2015-11-16 21:33           ` Stephan Günther

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.