regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD
@ 2021-09-25 13:10 Orlando Chamberlain
  2021-09-25 17:16 ` Keith Busch
  2021-09-26  5:59 ` Thorsten Leemhuis
  0 siblings, 2 replies; 17+ messages in thread
From: Orlando Chamberlain @ 2021-09-25 13:10 UTC (permalink / raw)
  To: linux-nvme; +Cc: regressions, gargaditya08, kbusch, sagi, hare, dwagner, hch

[-- Attachment #1: Type: text/plain, Size: 2979 bytes --]

Commit e7006de6c238 causes the SSD controller on Apple T2 computers to crash
and prevents linux from booting.

This commit implemented a counter that is stored within the NVMe command_id,
however this counter makes the command_id higher than normal, causing a panic
on the T2 security chip that functions as the SSD controller, which then
causes the system to power off after a few seconds.

This was reported on bugzilla here:
https://bugzilla.kernel.org/show_bug.cgi?id=214509 but it was not originally
classified as NVMe (when the report was created it was unknown what was
causing it), so I don't know if it notified the NVMe mailing list when it
was later reclassified to NVMe. Sorry if you've already seen this issue.

The T2 security chip (which is the SSD) has this line in its crash log (the
rest of this log is in an attachment on the bugzilla report):

panic(cpu 1 caller 0xfffffff028d884ec): ANS2 Recoverable Panic - assert failed: [7447]:command id out of range error (cid = 4120), status_reg: 0x2000 - Null(2)

This is the entry in lspci -nn for the ssd:

04:00.0 Mass storage controller [0180]: Apple Inc. ANS2 NVMe Controller [106b:2005] (rev 01)

This commit was included in 5.14.6 and backported to 5.10.67, but does not
occur in 5.14.5 and 5.10.66. I am on a MacBookPro16,1, the crash has been
reproduced on a MacBookPro16,2 as well. I have been able to reproduce on Arch
Linux with vanilla kernel 5.10.67 (others have gotten it on 5.14.6) with no
DKMS modules, and I bisected it to that commit
(e7006de6c23803799be000a5dcce4d916a36541a).

I've tried to modify the genctr so that it is in the other side of the
command_id (which I thought might make the command_id's lower) with the patch
below, but it did not prevent the crash.

Regards,
Orlando Chamberlain

--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -489,17 +489,20 @@ struct nvme_ctrl_ops {

 /*
  * nvme command_id is constructed as such:
- * | xxxx | xxxxxxxxxxxx |
- *   gen    request tag
- */
+ * | xxxxxxxxxxxx | xxxx |
+ *   request tag    gen
+ *
+ * The gen is at the end as the SSD in Apple T2 Computers
+ * crashes when the command_id is too high.
+*/
 #define nvme_genctr_mask(gen)                  (gen & 0xf)
-#define nvme_cid_install_genctr(gen)           (nvme_genctr_mask(gen) << 12)
-#define nvme_genctr_from_cid(cid)              ((cid & 0xf000) >> 12)
-#define nvme_tag_from_cid(cid)                 (cid & 0xfff)
+#define nvme_cid_install_genctr(gen)           (nvme_genctr_mask(gen))
+#define nvme_genctr_from_cid(cid)              ((cid & 0x000f))
+#define nvme_tag_from_cid(cid)                 ((cid & 0xfff0) >> 4)

 static inline u16 nvme_cid(struct request *rq)
 {
-       return nvme_cid_install_genctr(nvme_req(rq)->genctr) | rq->tag;
+       return nvme_cid_install_genctr(nvme_req(rq)->genctr) | (rq->tag << 4);
 }

 static inline struct request *nvme_find_rq(struct blk_mq_tags *tags,

[-- Attachment #2: publickey - redecorating@protonmail.com - 0xEE1BCCD7.asc --]
[-- Type: application/pgp-keys, Size: 1873 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD
  2021-09-25 13:10 [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD Orlando Chamberlain
@ 2021-09-25 17:16 ` Keith Busch
  2021-09-25 23:40   ` Orlando Chamberlain
  2021-09-26  5:59 ` Thorsten Leemhuis
  1 sibling, 1 reply; 17+ messages in thread
From: Keith Busch @ 2021-09-25 17:16 UTC (permalink / raw)
  To: Orlando Chamberlain
  Cc: linux-nvme, regressions, gargaditya08, sagi, hare, dwagner, hch

On Sat, Sep 25, 2021 at 01:10:42PM +0000, Orlando Chamberlain wrote:
> Commit e7006de6c238 causes the SSD controller on Apple T2 computers to crash
> and prevents linux from booting.
> 
> This commit implemented a counter that is stored within the NVMe command_id,
> however this counter makes the command_id higher than normal, causing a panic
> on the T2 security chip that functions as the SSD controller, which then
> causes the system to power off after a few seconds.

Ah, yet another spec non-complainat quirk from these controllers.

> This was reported on bugzilla here:
> https://bugzilla.kernel.org/show_bug.cgi?id=214509 but it was not originally
> classified as NVMe (when the report was created it was unknown what was
> causing it), so I don't know if it notified the NVMe mailing list when it
> was later reclassified to NVMe. Sorry if you've already seen this issue.

The mailing list was not copied, so thank you for directly notifying
this list. 
 
> The T2 security chip (which is the SSD) has this line in its crash log (the
> rest of this log is in an attachment on the bugzilla report):
> 
> panic(cpu 1 caller 0xfffffff028d884ec): ANS2 Recoverable Panic - assert failed: [7447]:command id out of range error (cid = 4120), status_reg: 0x2000 - Null(2)
> 
> This is the entry in lspci -nn for the ssd:
> 
> 04:00.0 Mass storage controller [0180]: Apple Inc. ANS2 NVMe Controller [106b:2005] (rev 01)
> 
> This commit was included in 5.14.6 and backported to 5.10.67, but does not
> occur in 5.14.5 and 5.10.66. I am on a MacBookPro16,1, the crash has been
> reproduced on a MacBookPro16,2 as well. 

Is the PCI VID:DID the same from in your lspci output for all affected
macbooks?

> I have been able to reproduce on Arch
> Linux with vanilla kernel 5.10.67 (others have gotten it on 5.14.6) with no
> DKMS modules, and I bisected it to that commit
> (e7006de6c23803799be000a5dcce4d916a36541a).
> 
> I've tried to modify the genctr so that it is in the other side of the
> command_id (which I thought might make the command_id's lower) with the patch
> below, but it did not prevent the crash.

That might mean the h/w is using the command id as an index into
internal structures. That is not spec compliant, so it sounds like
we'll need to introduce another quirk for the macs.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD
  2021-09-25 17:16 ` Keith Busch
@ 2021-09-25 23:40   ` Orlando Chamberlain
  2021-09-26  2:08     ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Orlando Chamberlain @ 2021-09-25 23:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, regressions, gargaditya08, sagi, hare, dwagner, hch



On 26/9/21 03:16, Keith Busch wrote:
> On Sat, Sep 25, 2021 at 01:10:42PM +0000, Orlando Chamberlain wrote:
>> Commit e7006de6c238 causes the SSD controller on Apple T2 computers to crash
>> and prevents linux from booting.
>>
>> This commit implemented a counter that is stored within the NVMe command_id,
>> however this counter makes the command_id higher than normal, causing a panic
>> on the T2 security chip that functions as the SSD controller, which then
>> causes the system to power off after a few seconds.
> 
> Ah, yet another spec non-complainat quirk from these controllers.

Apple does like to think different.

>> This is the entry in lspci -nn for the ssd:
>>
>> 04:00.0 Mass storage controller [0180]: Apple Inc. ANS2 NVMe Controller [106b:2005] (rev 01)
>>
>> This commit was included in 5.14.6 and backported to 5.10.67, but does not
>> occur in 5.14.5 and 5.10.66. I am on a MacBookPro16,1, the crash has been
>> reproduced on a MacBookPro16,2 as well.
> 
> Is the PCI VID:DID the same from in your lspci output for all affected
> macbooks?

Yes, they all have 106b:2005

>> I've tried to modify the genctr so that it is in the other side of the
>> command_id (which I thought might make the command_id's lower) with the patch
>> below, but it did not prevent the crash.
> 
> That might mean the h/w is using the command id as an index into
> internal structures. That is not spec compliant, so it sounds like
> we'll need to introduce another quirk for the macs.
> 

I've managed to get it to boot by commenting out the counter increment, which might work
as a quirk:

--- a/drivers/nvme/host/core.c

+++ b/drivers/nvme/host/core.c

@@ -1027,7 +1027,7 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req)

                return BLK_STS_IOERR;

        }

 

-       nvme_req(req)->genctr++;

+       //nvme_req(req)->genctr++;

        cmd->common.command_id = nvme_cid(req);

        trace_nvme_setup_cmd(req, cmd);

        return ret;


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD
  2021-09-25 23:40   ` Orlando Chamberlain
@ 2021-09-26  2:08     ` Keith Busch
  2021-09-26  3:53       ` Orlando Chamberlain
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2021-09-26  2:08 UTC (permalink / raw)
  To: Orlando Chamberlain
  Cc: linux-nvme, regressions, gargaditya08, sagi, hare, dwagner, hch

On Sat, Sep 25, 2021 at 11:40:19PM +0000, Orlando Chamberlain wrote:
> I've managed to get it to boot by commenting out the counter increment, which might work
> as a quirk:

Great, that's essentially what I proposed in a different thread (sorry, I see
now you were not copied there):

  http://lists.infradead.org/pipermail/linux-nvme/2021-September/027665.html

The final patch may look a bit different, but it should be okay to test with
the one in the link.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD
  2021-09-26  2:08     ` Keith Busch
@ 2021-09-26  3:53       ` Orlando Chamberlain
  2021-09-26  4:35         ` Orlando Chamberlain
  0 siblings, 1 reply; 17+ messages in thread
From: Orlando Chamberlain @ 2021-09-26  3:53 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, regressions, gargaditya08, sagi, hare, dwagner, hch

On 26/9/21 12:08, Keith Busch wrote:
> On Sat, Sep 25, 2021 at 11:40:19PM +0000, Orlando Chamberlain wrote:
>> I've managed to get it to boot by commenting out the counter increment, which might work
>> as a quirk:
> 
> Great, that's essentially what I proposed in a different thread (sorry, I see
> now you were not copied there):
> 
>   http://lists.infradead.org/pipermail/linux-nvme/2021-September/027665.html
> 
> The final patch may look a bit different, but it should be okay to test with
> the one in the link.
> 
I've tried that patch, it doesn't crash, but /dev/nvme0, /dev/nvme0n1 etc don't show up so
it doesn't progress past the initramfs. There weren't any errors other than preexisting
acpi table ones.

(I sent this a few minutes ago but I forgot to reply all so here it is again)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD
  2021-09-26  3:53       ` Orlando Chamberlain
@ 2021-09-26  4:35         ` Orlando Chamberlain
       [not found]           ` <PNZPR01MB4415801C6084E8CFD068A84AB8A69@PNZPR01MB4415.INDPRD01.PROD.OUTLOOK.COM>
  0 siblings, 1 reply; 17+ messages in thread
From: Orlando Chamberlain @ 2021-09-26  4:35 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, regressions, gargaditya08, sagi, hare, dwagner, hch

On 26/9/21 13:53, Orlando Chamberlain wrote:
>> Great, that's essentially what I proposed in a different thread (sorry, I see
>> now you were not copied there):
>>
>>   http://lists.infradead.org/pipermail/linux-nvme/2021-September/027665.html
>>
>> The final patch may look a bit different, but it should be okay to test with
>> the one in the link.
>>
> I've tried that patch, it doesn't crash, but /dev/nvme0, /dev/nvme0n1 etc don't show up so
> it doesn't progress past the initramfs. There weren't any errors other than preexisting
> acpi table ones.
> 

The issue was that there's a , instead of a | at the end of the second line here. It works when that is replaced with a |.

-				NVME_QUIRK_SHARED_TAGS },
+				NVME_QUIRK_SHARED_TAGS ,
+				NVME_QUIRK_SKIP_CID_GEN },


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD
  2021-09-25 13:10 [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD Orlando Chamberlain
  2021-09-25 17:16 ` Keith Busch
@ 2021-09-26  5:59 ` Thorsten Leemhuis
  2021-10-04  9:17   ` Thorsten Leemhuis
  1 sibling, 1 reply; 17+ messages in thread
From: Thorsten Leemhuis @ 2021-09-26  5:59 UTC (permalink / raw)
  To: regressions

On 25.09.21 15:10, Orlando Chamberlain wrote:
> Commit e7006de6c238 causes the SSD controller on Apple T2 computers to crash
> and prevents linux from booting.
> 
> This commit implemented a counter that is stored within the NVMe command_id,
> however this counter makes the command_id higher than normal, causing a panic
> on the T2 security chip that functions as the SSD controller, which then
> causes the system to power off after a few seconds.
> 
> This was reported on bugzilla here:
> https://bugzilla.kernel.org/show_bug.cgi?id=214509 but it was not originally
> classified as NVMe (when the report was created it was unknown what was
> causing it), so I don't know if it notified the NVMe mailing list when it
> was later reclassified to NVMe. Sorry if you've already seen this issue.
> 
> The T2 security chip (which is the SSD) has this line in its crash log (the
> rest of this log is in an attachment on the bugzilla report):
> 
> panic(cpu 1 caller 0xfffffff028d884ec): ANS2 Recoverable Panic - assert failed: [7447]:command id out of range error (cid = 4120), status_reg: 0x2000 - Null(2)
> 
> This is the entry in lspci -nn for the ssd:
> 
> 04:00.0 Mass storage controller [0180]: Apple Inc. ANS2 NVMe Controller [106b:2005] (rev 01)
> 
> This commit was included in 5.14.6 and backported to 5.10.67, but does not
> occur in 5.14.5 and 5.10.66. I am on a MacBookPro16,1, the crash has been
> reproduced on a MacBookPro16,2 as well. I have been able to reproduce on Arch
> Linux with vanilla kernel 5.10.67 (others have gotten it on 5.14.6) with no
> DKMS modules, and I bisected it to that commit
> (e7006de6c23803799be000a5dcce4d916a36541a).

Feel free to ignore this message. I write it to make regzbot track above
issue. Regzbot is the regression tracking bot I'm working on. It's still
in the early stages and this is still one of the first few regression I
make it track to get started and things tested in the field. That also
why I'm sending the mail just to the regressions list (it will do its
fully magic nevertheless). For details see:
https://linux-regtracking.leemhuis.info/post/inital-regzbot-running/
https://linux-regtracking.leemhuis.info/post/regzbot-approach/

#regzbot ^introduced e7006de6c23803799be000a5dcce4d916a36541a

#regzbot monitor
https://lore.kernel.org/lkml/CAHk-=wgML11x9afCvmg9yhVm9wi5mvnjBvmX+i7OfMA0Vd4FWA@mail.gmail.com/

Ciao, Thorsten



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD
       [not found]           ` <PNZPR01MB4415801C6084E8CFD068A84AB8A69@PNZPR01MB4415.INDPRD01.PROD.OUTLOOK.COM>
@ 2021-09-26  8:44             ` Sagi Grimberg
  2021-09-27  4:22               ` Orlando Chamberlain
  0 siblings, 1 reply; 17+ messages in thread
From: Sagi Grimberg @ 2021-09-26  8:44 UTC (permalink / raw)
  To: Aditya Garg, Orlando Chamberlain, kbusch
  Cc: linux-nvme, regressions, hare, dwagner, hch


> I checked out the proposal sent by Orlando Chamberlain to replace NVME_QUIRK_SHARED_TAGS , by NVME_QUIRK_SHARED_TAGS | given in the patch on http://lists.infradead.org/pipermail/linux-nvme/2021-September/027665.html. The , still causes panics to the T2 as described before. In the case of |, the kernel boots correctly without panicking the T2, but in case we are having Linux on an External Drive, which is my case, then the internal SSD doesn't seem to be recognised at all. I've tested the patch on 5.14.7.

That sounds like a separate issue, because with this patch applied,
all tags should be within the queue entry range (with generation
set to 0 always).

Is it possible that the io_queue_depth is being set to something
that exceeds NVME_PCI_MAX_QUEUE_SIZE (4095) ? the default is 1024

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD
  2021-09-26  8:44             ` Sagi Grimberg
@ 2021-09-27  4:22               ` Orlando Chamberlain
  2021-09-27  4:51                 ` Aditya Garg
  0 siblings, 1 reply; 17+ messages in thread
From: Orlando Chamberlain @ 2021-09-27  4:22 UTC (permalink / raw)
  To: Sagi Grimberg, Aditya Garg, kbusch
  Cc: linux-nvme, regressions, hare, dwagner, hch

On 26/9/21 18:44, Sagi Grimberg wrote:
> 
>> I checked out the proposal sent by Orlando Chamberlain to replace NVME_QUIRK_SHARED_TAGS , by NVME_QUIRK_SHARED_TAGS | given in the patch on http://lists.infradead.org/pipermail/linux-nvme/2021-September/027665.html. The , still causes panics to the T2 as described before. In the case of |, the kernel boots correctly without panicking the T2, but in case we are having Linux on an External Drive, which is my case, then the internal SSD doesn't seem to be recognised at all. I've tested the patch on 5.14.7.
> 
> That sounds like a separate issue, because with this patch applied,
> all tags should be within the queue entry range (with generation
> set to 0 always).
> 
> Is it possible that the io_queue_depth is being set to something
> that exceeds NVME_PCI_MAX_QUEUE_SIZE (4095) ? the default is 1024
>
I've been able to reproduce it by using the same kernel Aditya is using:
https://github.com/AdityaGarg8/T2-Big-Sur-Ubuntu-Kernel/actions/runs/1275383460

From the initramfs:

# dmesg | grep nvme
nvme nvme0: pci function 0000:04:00.0
nvme nvme0: 1/0/0 default/read/poll queues
nvme nvme0: Identify NS List failed (status=0xb)
nvme nvme0: LightNVM init failure

It might be because this is 5.14.7, while I've been using 5.15-rc2. Additionally,
there are differences in kernel configs, I've put both configs in this gist
https://gist.github.com/Redecorating/c8cf574df969f9b4f626dfb9c6b2a758


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD
  2021-09-27  4:22               ` Orlando Chamberlain
@ 2021-09-27  4:51                 ` Aditya Garg
  2021-09-27  6:05                   ` Sven Peter
  0 siblings, 1 reply; 17+ messages in thread
From: Aditya Garg @ 2021-09-27  4:51 UTC (permalink / raw)
  To: Orlando Chamberlain, sagi, kbusch
  Cc: linux-nvme, regressions, hare, dwagner, hch, torvalds, axboe,
	james.smart, chaitanya.kulkarni, akpm, linux-kernel

I am getting the same error.

________________________________________
From: Orlando Chamberlain <redecorating@protonmail.com>
Sent: Monday, September 27, 2021 4:22 AM
To: Sagi Grimberg; Aditya Garg; kbusch@kernel.org
Cc: linux-nvme@lists.infradead.org; regressions@lists.linux.dev; hare@suse.de; dwagner@suse.de; hch@lst.de
Subject: Re: [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD

On 26/9/21 18:44, Sagi Grimberg wrote:
>
>> I checked out the proposal sent by Orlando Chamberlain to replace NVME_QUIRK_SHARED_TAGS , by NVME_QUIRK_SHARED_TAGS | given in the patch on http://lists.infradead.org/pipermail/linux-nvme/2021-September/027665.html. The , still causes panics to the T2 as described before. In the case of |, the kernel boots correctly without panicking the T2, but in case we are having Linux on an External Drive, which is my case, then the internal SSD doesn't seem to be recognised at all. I've tested the patch on 5.14.7.
>
> That sounds like a separate issue, because with this patch applied,
> all tags should be within the queue entry range (with generation
> set to 0 always).
>
> Is it possible that the io_queue_depth is being set to something
> that exceeds NVME_PCI_MAX_QUEUE_SIZE (4095) ? the default is 1024
>
I've been able to reproduce it by using the same kernel Aditya is using:
https://github.com/AdityaGarg8/T2-Big-Sur-Ubuntu-Kernel/actions/runs/1275383460

From the initramfs:

# dmesg | grep nvme
nvme nvme0: pci function 0000:04:00.0
nvme nvme0: 1/0/0 default/read/poll queues
nvme nvme0: Identify NS List failed (status=0xb)
nvme nvme0: LightNVM init failure

It might be because this is 5.14.7, while I've been using 5.15-rc2. Additionally,
there are differences in kernel configs, I've put both configs in this gist
https://gist.github.com/Redecorating/c8cf574df969f9b4f626dfb9c6b2a758


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD
  2021-09-27  4:51                 ` Aditya Garg
@ 2021-09-27  6:05                   ` Sven Peter
  2021-09-27 15:02                     ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Sven Peter @ 2021-09-27  6:05 UTC (permalink / raw)
  To: Aditya Garg, Orlando Chamberlain, sagi, Keith Busch
  Cc: linux-nvme, regressions, hare, dwagner, hch, Linus Torvalds,
	axboe, james.smart, chaitanya.kulkarni, akpm, linux-kernel

Hi,


On Mon, Sep 27, 2021, at 06:51, Aditya Garg wrote:
> I am getting the same error.
>
> ________________________________________
> From: Orlando Chamberlain <redecorating@protonmail.com>
> Sent: Monday, September 27, 2021 4:22 AM
> To: Sagi Grimberg; Aditya Garg; kbusch@kernel.org
> Cc: linux-nvme@lists.infradead.org; regressions@lists.linux.dev; 
> hare@suse.de; dwagner@suse.de; hch@lst.de
> Subject: Re: [REGRESSION] nvme: code command_id with a genctr for 
> use-after-free validation crashes apple T2 SSD
>
> On 26/9/21 18:44, Sagi Grimberg wrote:
>>
>>> I checked out the proposal sent by Orlando Chamberlain to replace NVME_QUIRK_SHARED_TAGS , by NVME_QUIRK_SHARED_TAGS | given in the patch on http://lists.infradead.org/pipermail/linux-nvme/2021-September/027665.html. The , still causes panics to the T2 as described before. In the case of |, the kernel boots correctly without panicking the T2, but in case we are having Linux on an External Drive, which is my case, then the internal SSD doesn't seem to be recognised at all. I've tested the patch on 5.14.7.
>>
>> That sounds like a separate issue, because with this patch applied,
>> all tags should be within the queue entry range (with generation
>> set to 0 always).
>>
>> Is it possible that the io_queue_depth is being set to something
>> that exceeds NVME_PCI_MAX_QUEUE_SIZE (4095) ? the default is 1024
>>
> I've been able to reproduce it by using the same kernel Aditya is using:
> https://github.com/AdityaGarg8/T2-Big-Sur-Ubuntu-Kernel/actions/runs/1275383460
>
> From the initramfs:
>
> # dmesg | grep nvme
> nvme nvme0: pci function 0000:04:00.0
> nvme nvme0: 1/0/0 default/read/poll queues
> nvme nvme0: Identify NS List failed (status=0xb)
> nvme nvme0: LightNVM init failure

Maybe I should've just submitted the quirks required for the M1 already...
The ANS2 firmware on there doesn't support the vanilla nvme_scan_ns_list
call. That should not break anything though because core.c falls back to
nvme_scan_ns_sequential in that case which does work. It just results
in that "Identify NS List failed (status=0xb)" error message.

Not sure where that LightNVM failure comes from though. Afaict lightnvm has been 
removed from 5.15 and shouldn't be used for the Apple controller anyway.

Sven

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD
  2021-09-27  6:05                   ` Sven Peter
@ 2021-09-27 15:02                     ` Keith Busch
  0 siblings, 0 replies; 17+ messages in thread
From: Keith Busch @ 2021-09-27 15:02 UTC (permalink / raw)
  To: Sven Peter
  Cc: Aditya Garg, Orlando Chamberlain, sagi, linux-nvme, regressions,
	hare, dwagner, hch, Linus Torvalds, axboe, james.smart,
	chaitanya.kulkarni, akpm, linux-kernel

On Mon, Sep 27, 2021 at 08:05:00AM +0200, Sven Peter wrote:
> Hi,
> 
> 
> On Mon, Sep 27, 2021, at 06:51, Aditya Garg wrote:
> > I am getting the same error.
> >
> > ________________________________________
> > From: Orlando Chamberlain <redecorating@protonmail.com>
> > Sent: Monday, September 27, 2021 4:22 AM
> > To: Sagi Grimberg; Aditya Garg; kbusch@kernel.org
> > Cc: linux-nvme@lists.infradead.org; regressions@lists.linux.dev; 
> > hare@suse.de; dwagner@suse.de; hch@lst.de
> > Subject: Re: [REGRESSION] nvme: code command_id with a genctr for 
> > use-after-free validation crashes apple T2 SSD
> >
> > On 26/9/21 18:44, Sagi Grimberg wrote:
> >>
> >>> I checked out the proposal sent by Orlando Chamberlain to replace NVME_QUIRK_SHARED_TAGS , by NVME_QUIRK_SHARED_TAGS | given in the patch on http://lists.infradead.org/pipermail/linux-nvme/2021-September/027665.html. The , still causes panics to the T2 as described before. In the case of |, the kernel boots correctly without panicking the T2, but in case we are having Linux on an External Drive, which is my case, then the internal SSD doesn't seem to be recognised at all. I've tested the patch on 5.14.7.
> >>
> >> That sounds like a separate issue, because with this patch applied,
> >> all tags should be within the queue entry range (with generation
> >> set to 0 always).
> >>
> >> Is it possible that the io_queue_depth is being set to something
> >> that exceeds NVME_PCI_MAX_QUEUE_SIZE (4095) ? the default is 1024
> >>
> > I've been able to reproduce it by using the same kernel Aditya is using:
> > https://github.com/AdityaGarg8/T2-Big-Sur-Ubuntu-Kernel/actions/runs/1275383460
> >
> > From the initramfs:
> >
> > # dmesg | grep nvme
> > nvme nvme0: pci function 0000:04:00.0
> > nvme nvme0: 1/0/0 default/read/poll queues
> > nvme nvme0: Identify NS List failed (status=0xb)
> > nvme nvme0: LightNVM init failure
> 
> Maybe I should've just submitted the quirks required for the M1 already...
> The ANS2 firmware on there doesn't support the vanilla nvme_scan_ns_list
> call. That should not break anything though because core.c falls back to
> nvme_scan_ns_sequential in that case which does work. It just results
> in that "Identify NS List failed (status=0xb)" error message.
> 
> Not sure where that LightNVM failure comes from though. Afaict lightnvm has been 
> removed from 5.15 and shouldn't be used for the Apple controller anyway.

Yes, lightnvm was removed in 5.15, so the quirk bit identifying these
devices was removed. The patch was made for upstream 5.15. If you want
to backport it to stable, the bit will need to be changed, otherwise it
make the driver think its an openchannel SSD instead and fail.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD
  2021-09-26  5:59 ` Thorsten Leemhuis
@ 2021-10-04  9:17   ` Thorsten Leemhuis
  2021-10-04  9:27     ` Greg KH
  0 siblings, 1 reply; 17+ messages in thread
From: Thorsten Leemhuis @ 2021-10-04  9:17 UTC (permalink / raw)
  To: regressions


On 26.09.21 07:59, Thorsten Leemhuis wrote:
> On 25.09.21 15:10, Orlando Chamberlain wrote:
>> Commit e7006de6c238 causes the SSD controller on Apple T2 computers to crash
>> and prevents linux from booting.
>>
>> This commit implemented a counter that is stored within the NVMe command_id,
>> however this counter makes the command_id higher than normal, causing a panic
>> on the T2 security chip that functions as the SSD controller, which then
>> causes the system to power off after a few seconds.
>>
>> This was reported on bugzilla here:
>> https://bugzilla.kernel.org/show_bug.cgi?id=214509 but it was not originally
>> classified as NVMe (when the report was created it was unknown what was
>> causing it), so I don't know if it notified the NVMe mailing list when it
>> was later reclassified to NVMe. Sorry if you've already seen this issue.
>>
>> The T2 security chip (which is the SSD) has this line in its crash log (the
>> rest of this log is in an attachment on the bugzilla report):
>>
>> panic(cpu 1 caller 0xfffffff028d884ec): ANS2 Recoverable Panic - assert failed: [7447]:command id out of range error (cid = 4120), status_reg: 0x2000 - Null(2)
>>
>> This is the entry in lspci -nn for the ssd:
>>
>> 04:00.0 Mass storage controller [0180]: Apple Inc. ANS2 NVMe Controller [106b:2005] (rev 01)
>>
>> This commit was included in 5.14.6 and backported to 5.10.67, but does not
>> occur in 5.14.5 and 5.10.66. I am on a MacBookPro16,1, the crash has been
>> reproduced on a MacBookPro16,2 as well. I have been able to reproduce on Arch
>> Linux with vanilla kernel 5.10.67 (others have gotten it on 5.14.6) with no
>> DKMS modules, and I bisected it to that commit
>> (e7006de6c23803799be000a5dcce4d916a36541a).
> 
> Feel free to ignore this message. I write it to make regzbot track above
> issue. Regzbot is the regression tracking bot I'm working on. It's still
> in the early stages and this is still one of the first few regression I
> make it track to get started and things tested in the field. That also
> why I'm sending the mail just to the regressions list (it will do its
> fully magic nevertheless). For details see:
> https://linux-regtracking.leemhuis.info/post/inital-regzbot-running/
> https://linux-regtracking.leemhuis.info/post/regzbot-approach/
> 
> #regzbot ^introduced e7006de6c23803799be000a5dcce4d916a36541a
> 
> #regzbot monitor
> https://lore.kernel.org/lkml/CAHk-=wgML11x9afCvmg9yhVm9wi5mvnjBvmX+i7OfMA0Vd4FWA@mail.gmail.com/

FWIW, this is just for the record: the fix for this landed in mainline,
but didn't refer to this thread or the one montitored, hence I need to
write this mail to make regzbot mark this regression as resolved:

#regzbot monitor
https://lore.kernel.org/all/20210927154306.387437-1-kbusch@kernel.org/

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD
  2021-10-04  9:17   ` Thorsten Leemhuis
@ 2021-10-04  9:27     ` Greg KH
  2021-10-04 10:11       ` Thorsten Leemhuis
  0 siblings, 1 reply; 17+ messages in thread
From: Greg KH @ 2021-10-04  9:27 UTC (permalink / raw)
  To: Thorsten Leemhuis; +Cc: regressions

On Mon, Oct 04, 2021 at 11:17:21AM +0200, Thorsten Leemhuis wrote:
> 
> On 26.09.21 07:59, Thorsten Leemhuis wrote:
> > On 25.09.21 15:10, Orlando Chamberlain wrote:
> >> Commit e7006de6c238 causes the SSD controller on Apple T2 computers to crash
> >> and prevents linux from booting.
> >>
> >> This commit implemented a counter that is stored within the NVMe command_id,
> >> however this counter makes the command_id higher than normal, causing a panic
> >> on the T2 security chip that functions as the SSD controller, which then
> >> causes the system to power off after a few seconds.
> >>
> >> This was reported on bugzilla here:
> >> https://bugzilla.kernel.org/show_bug.cgi?id=214509 but it was not originally
> >> classified as NVMe (when the report was created it was unknown what was
> >> causing it), so I don't know if it notified the NVMe mailing list when it
> >> was later reclassified to NVMe. Sorry if you've already seen this issue.
> >>
> >> The T2 security chip (which is the SSD) has this line in its crash log (the
> >> rest of this log is in an attachment on the bugzilla report):
> >>
> >> panic(cpu 1 caller 0xfffffff028d884ec): ANS2 Recoverable Panic - assert failed: [7447]:command id out of range error (cid = 4120), status_reg: 0x2000 - Null(2)
> >>
> >> This is the entry in lspci -nn for the ssd:
> >>
> >> 04:00.0 Mass storage controller [0180]: Apple Inc. ANS2 NVMe Controller [106b:2005] (rev 01)
> >>
> >> This commit was included in 5.14.6 and backported to 5.10.67, but does not
> >> occur in 5.14.5 and 5.10.66. I am on a MacBookPro16,1, the crash has been
> >> reproduced on a MacBookPro16,2 as well. I have been able to reproduce on Arch
> >> Linux with vanilla kernel 5.10.67 (others have gotten it on 5.14.6) with no
> >> DKMS modules, and I bisected it to that commit
> >> (e7006de6c23803799be000a5dcce4d916a36541a).
> > 
> > Feel free to ignore this message. I write it to make regzbot track above
> > issue. Regzbot is the regression tracking bot I'm working on. It's still
> > in the early stages and this is still one of the first few regression I
> > make it track to get started and things tested in the field. That also
> > why I'm sending the mail just to the regressions list (it will do its
> > fully magic nevertheless). For details see:
> > https://linux-regtracking.leemhuis.info/post/inital-regzbot-running/
> > https://linux-regtracking.leemhuis.info/post/regzbot-approach/
> > 
> > #regzbot ^introduced e7006de6c23803799be000a5dcce4d916a36541a
> > 
> > #regzbot monitor
> > https://lore.kernel.org/lkml/CAHk-=wgML11x9afCvmg9yhVm9wi5mvnjBvmX+i7OfMA0Vd4FWA@mail.gmail.com/
> 
> FWIW, this is just for the record: the fix for this landed in mainline,
> but didn't refer to this thread or the one montitored, hence I need to
> write this mail to make regzbot mark this regression as resolved:
> 
> #regzbot monitor
> https://lore.kernel.org/all/20210927154306.387437-1-kbusch@kernel.org/

Thanks for tracking this, I'll go queue it up for 5.10.y and 5.14.y now.

greg k-h

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD
  2021-10-04  9:27     ` Greg KH
@ 2021-10-04 10:11       ` Thorsten Leemhuis
  2021-10-04 11:36         ` Greg KH
  0 siblings, 1 reply; 17+ messages in thread
From: Thorsten Leemhuis @ 2021-10-04 10:11 UTC (permalink / raw)
  To: Greg KH; +Cc: regressions

On 04.10.21 11:27, Greg KH wrote:
> On Mon, Oct 04, 2021 at 11:17:21AM +0200, Thorsten Leemhuis wrote:
>> On 26.09.21 07:59, Thorsten Leemhuis wrote:
>>> On 25.09.21 15:10, Orlando Chamberlain wrote:
>>>> Commit e7006de6c238 causes the SSD controller on Apple T2 computers to crash
>>>> and prevents linux from booting.
>>>>
>>>> This commit implemented a counter that is stored within the NVMe command_id,
>>>> however this counter makes the command_id higher than normal, causing a panic
>>>> on the T2 security chip that functions as the SSD controller, which then
>>>> causes the system to power off after a few seconds.
>>>>
>>>> This was reported on bugzilla here:
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=214509 but it was not originally
>>>> classified as NVMe (when the report was created it was unknown what was
>>>> causing it), so I don't know if it notified the NVMe mailing list when it
>>>> was later reclassified to NVMe. Sorry if you've already seen this issue.
> [...]
>>> Feel free to ignore this message. I write it to make regzbot track above
> [...]
>> FWIW, this is just for the record: the fix for this landed in mainline,
>> but didn't refer to this thread or the one montitored, hence I need to
>> write this mail to make regzbot mark this regression as resolved:
>>
>> #regzbot monitor
>> https://lore.kernel.org/all/20210927154306.387437-1-kbusch@kernel.org/
> 
> Thanks for tracking this, I'll go queue it up for 5.10.y and 5.14.y now.

Nice to hear that regzbot is helpful even while still in the early stage
of field testing. FWIW, this regression made me consider adding two things:

* guess it would be a good idea if regzbot would simply export a
parseable list with commit IDs of still unresolved mainline regressions,
as your backports script then could simply consult it and put such
commits on hold

* it afaics should even be possible to make regzbot detect "hey, this
commit was backported, but the fix was not yet, so lets add it to the
list of regressions in stable and longterm kernels and notify the stable
team about it.

But for now there are other, more important things I need to work on.
Will add this to the issue tracker for regzbot to work on this another time.

Ciao, Thorsten




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD
  2021-10-04 10:11       ` Thorsten Leemhuis
@ 2021-10-04 11:36         ` Greg KH
  2021-10-05  5:50           ` Thorsten Leemhuis
  0 siblings, 1 reply; 17+ messages in thread
From: Greg KH @ 2021-10-04 11:36 UTC (permalink / raw)
  To: Thorsten Leemhuis; +Cc: regressions

On Mon, Oct 04, 2021 at 12:11:43PM +0200, Thorsten Leemhuis wrote:
> On 04.10.21 11:27, Greg KH wrote:
> > On Mon, Oct 04, 2021 at 11:17:21AM +0200, Thorsten Leemhuis wrote:
> >> On 26.09.21 07:59, Thorsten Leemhuis wrote:
> >>> On 25.09.21 15:10, Orlando Chamberlain wrote:
> >>>> Commit e7006de6c238 causes the SSD controller on Apple T2 computers to crash
> >>>> and prevents linux from booting.
> >>>>
> >>>> This commit implemented a counter that is stored within the NVMe command_id,
> >>>> however this counter makes the command_id higher than normal, causing a panic
> >>>> on the T2 security chip that functions as the SSD controller, which then
> >>>> causes the system to power off after a few seconds.
> >>>>
> >>>> This was reported on bugzilla here:
> >>>> https://bugzilla.kernel.org/show_bug.cgi?id=214509 but it was not originally
> >>>> classified as NVMe (when the report was created it was unknown what was
> >>>> causing it), so I don't know if it notified the NVMe mailing list when it
> >>>> was later reclassified to NVMe. Sorry if you've already seen this issue.
> > [...]
> >>> Feel free to ignore this message. I write it to make regzbot track above
> > [...]
> >> FWIW, this is just for the record: the fix for this landed in mainline,
> >> but didn't refer to this thread or the one montitored, hence I need to
> >> write this mail to make regzbot mark this regression as resolved:
> >>
> >> #regzbot monitor
> >> https://lore.kernel.org/all/20210927154306.387437-1-kbusch@kernel.org/
> > 
> > Thanks for tracking this, I'll go queue it up for 5.10.y and 5.14.y now.
> 
> Nice to hear that regzbot is helpful even while still in the early stage
> of field testing. FWIW, this regression made me consider adding two things:
> 
> * guess it would be a good idea if regzbot would simply export a
> parseable list with commit IDs of still unresolved mainline regressions,
> as your backports script then could simply consult it and put such
> commits on hold

That would be nice, but that's not a normal thing that happens often.

> * it afaics should even be possible to make regzbot detect "hey, this
> commit was backported, but the fix was not yet, so lets add it to the
> list of regressions in stable and longterm kernels and notify the stable
> team about it.

If a commit has a Fixes: tag in it, our scripts will pick this up.  But
this commit did not have a Fixes: tag in it, so it missed it :(

Ideally all regression fixes will have a fixes: tag in them, and then
you don't have to add any new scripting.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD
  2021-10-04 11:36         ` Greg KH
@ 2021-10-05  5:50           ` Thorsten Leemhuis
  0 siblings, 0 replies; 17+ messages in thread
From: Thorsten Leemhuis @ 2021-10-05  5:50 UTC (permalink / raw)
  To: Greg KH; +Cc: regressions

On 04.10.21 13:36, Greg KH wrote:
> On Mon, Oct 04, 2021 at 12:11:43PM +0200, Thorsten Leemhuis wrote:
>> On 04.10.21 11:27, Greg KH wrote:
>>> On Mon, Oct 04, 2021 at 11:17:21AM +0200, Thorsten Leemhuis wrote:
>>>> On 26.09.21 07:59, Thorsten Leemhuis wrote:
>>>>> On 25.09.21 15:10, Orlando Chamberlain wrote:
>>>>>> Commit e7006de6c238 causes the SSD controller on Apple T2 computers to crash
>>>>>> and prevents linux from booting.
>>>>>>
>>>>>> This commit implemented a counter that is stored within the NVMe command_id,
>>>>>> however this counter makes the command_id higher than normal, causing a panic
>>>>>> on the T2 security chip that functions as the SSD controller, which then
>>>>>> causes the system to power off after a few seconds.
>>>>>>
>>>>>> This was reported on bugzilla here:
>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=214509 but it was not originally
>>>>>> classified as NVMe (when the report was created it was unknown what was
>>>>>> causing it), so I don't know if it notified the NVMe mailing list when it
>>>>>> was later reclassified to NVMe. Sorry if you've already seen this issue.
>>> [...]
>>>>> Feel free to ignore this message. I write it to make regzbot track above
>>> [...]
>>>> FWIW, this is just for the record: the fix for this landed in mainline,
>>>> but didn't refer to this thread or the one montitored, hence I need to
>>>> write this mail to make regzbot mark this regression as resolved:
>>>>
>>>> #regzbot monitor
>>>> https://lore.kernel.org/all/20210927154306.387437-1-kbusch@kernel.org/
>>>
>>> Thanks for tracking this, I'll go queue it up for 5.10.y and 5.14.y now.
>>
>> Nice to hear that regzbot is helpful even while still in the early stage
>> of field testing. FWIW, this regression made me consider adding two things:
>>
>> * guess it would be a good idea if regzbot would simply export a
>> parseable list with commit IDs of still unresolved mainline regressions,
>> as your backports script then could simply consult it and put such
>> commits on hold
> 
> That would be nice, but that's not a normal thing that happens often.

Okay, thx for letting me know, I already suspected something like that.

>> * it afaics should even be possible to make regzbot detect "hey, this
>> commit was backported, but the fix was not yet, so lets add it to the
>> list of regressions in stable and longterm kernels and notify the stable
>> team about it.
> 
> If a commit has a Fixes: tag in it, our scripts will pick this up.  But
> this commit did not have a Fixes: tag in it, so it missed it :(
> 
> Ideally all regression fixes will have a fixes: tag in them, and then
> you don't have to add any new scripting.

Yeah, sure, but maybe some users run into the regression before it's
fixed in mainline or before the fix is backported to the stable trees.
Hence it IMHO might be good to have it listed to save users the trouble
of reporting the issue again. But whatever, that's finetuning that can
be done later, for now there are more important things that regzbot
needs to learn first IMHO. :-D

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-10-05  5:50 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-25 13:10 [REGRESSION] nvme: code command_id with a genctr for use-after-free validation crashes apple T2 SSD Orlando Chamberlain
2021-09-25 17:16 ` Keith Busch
2021-09-25 23:40   ` Orlando Chamberlain
2021-09-26  2:08     ` Keith Busch
2021-09-26  3:53       ` Orlando Chamberlain
2021-09-26  4:35         ` Orlando Chamberlain
     [not found]           ` <PNZPR01MB4415801C6084E8CFD068A84AB8A69@PNZPR01MB4415.INDPRD01.PROD.OUTLOOK.COM>
2021-09-26  8:44             ` Sagi Grimberg
2021-09-27  4:22               ` Orlando Chamberlain
2021-09-27  4:51                 ` Aditya Garg
2021-09-27  6:05                   ` Sven Peter
2021-09-27 15:02                     ` Keith Busch
2021-09-26  5:59 ` Thorsten Leemhuis
2021-10-04  9:17   ` Thorsten Leemhuis
2021-10-04  9:27     ` Greg KH
2021-10-04 10:11       ` Thorsten Leemhuis
2021-10-04 11:36         ` Greg KH
2021-10-05  5:50           ` Thorsten Leemhuis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).