* nvme-host: disk corruptions when issuing IDENTIFY commands via ioctl() @ 2022-03-08 16:45 Maurizio Lombardi 2022-03-08 19:52 ` Keith Busch 2022-03-09 6:26 ` Christoph Hellwig 0 siblings, 2 replies; 12+ messages in thread From: Maurizio Lombardi @ 2022-03-08 16:45 UTC (permalink / raw) To: linux-nvme; +Cc: axboe, Christoph Hellwig, Sagi Grimberg, Ming Lei Hello, I recently received a bug report complaining about disk corruptions when issuing a NVME_IOCTL_ADMIN_CMD / IDENTIFY ioctl() with cmd.data_len = 8192 bytes and the buffer address not aligned to the page size. This is the C program that we used to reproduce the issue (tested with 5.17.0-rc6): http://bsdbackstore.it/misc/nvme_ioctl_512.c simply run it by passing a path to an nvme device: ./nvme_ioctl_512 /dev/nvme0n1 It appears to be very unpredictable. Sometimes I hit disk corruptions after a few tries, sometimes it takes hours. Sometimes the ioctl() returns success and sometimes it fails. We suspect that the root cause is that the nvme-host driver doesn't enforce the 4096 byte limit for the IDENTIFY commands as the nvme-target does (see the nvmet_execute_identify() --> nvmet_check_transfer_len(req, NVME_IDENTIFY_DATA_SIZE) code). So if we pass a 8192-byte buffer not aligned to the page size, it will need 3 pages on archs where page size is 4k and the nvme spec says that the data buffer may not cross more than one page boundary. Does it make sense to you? What's your opinion on this? Thanks, Maurizio Lombardi ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: nvme-host: disk corruptions when issuing IDENTIFY commands via ioctl() 2022-03-08 16:45 nvme-host: disk corruptions when issuing IDENTIFY commands via ioctl() Maurizio Lombardi @ 2022-03-08 19:52 ` Keith Busch 2022-03-09 0:18 ` Ming Lei 2022-03-09 6:26 ` Christoph Hellwig 1 sibling, 1 reply; 12+ messages in thread From: Keith Busch @ 2022-03-08 19:52 UTC (permalink / raw) To: Maurizio Lombardi Cc: linux-nvme, axboe, Christoph Hellwig, Sagi Grimberg, Ming Lei On Tue, Mar 08, 2022 at 05:45:20PM +0100, Maurizio Lombardi wrote: > Hello, > > I recently received a bug report complaining about disk corruptions when > issuing a NVME_IOCTL_ADMIN_CMD / IDENTIFY ioctl() with cmd.data_len = > 8192 bytes and the buffer address not aligned to the page size. > > This is the C program that we used to reproduce the issue (tested with > 5.17.0-rc6): http://bsdbackstore.it/misc/nvme_ioctl_512.c > > simply run it by passing a path to an nvme device: > ./nvme_ioctl_512 /dev/nvme0n1 > > It appears to be very unpredictable. Sometimes I hit disk corruptions > after a few tries, sometimes it takes hours. Sometimes the ioctl() > returns success and sometimes it fails. > > We suspect that the root cause is that the nvme-host driver doesn't > enforce the 4096 byte limit for the IDENTIFY commands as the > nvme-target does (see the nvmet_execute_identify() --> > nvmet_check_transfer_len(req, NVME_IDENTIFY_DATA_SIZE) code). > So if we pass a 8192-byte buffer not aligned to the page size, it will > need 3 pages on archs where page size is 4k and the nvme spec says > that the data buffer may not cross more than one page boundary. > > Does it make sense to you? What's your opinion on this? You are telling the driver to prepare a 3-page PRP, so it makes a PRP list. The device knows it's a 4k payload, though, so it thinks your PRP list pointer is actually a pointer to the data destination. The device is corrupting that memory, which could lead to on-disk corruption if that memory is concurrently used for a data-out command. Observing that type of corruption is probably not deterministic. This was an unfortunate pitfall of nvme's PRP method: the transfer length is implicit, so both sides need to agree on that for everything to work. If either side is mistaken on the transfer length, then you get corruption. In short: don't do that. If your application misuses the ioctl to break it, you get to keep both pieces. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: nvme-host: disk corruptions when issuing IDENTIFY commands via ioctl() 2022-03-08 19:52 ` Keith Busch @ 2022-03-09 0:18 ` Ming Lei 2022-03-09 0:39 ` Keith Busch 0 siblings, 1 reply; 12+ messages in thread From: Ming Lei @ 2022-03-09 0:18 UTC (permalink / raw) To: Keith Busch Cc: Maurizio Lombardi, linux-nvme, axboe, Christoph Hellwig, Sagi Grimberg, Ming Lei On Tue, Mar 08, 2022 at 11:52:38AM -0800, Keith Busch wrote: > On Tue, Mar 08, 2022 at 05:45:20PM +0100, Maurizio Lombardi wrote: > > Hello, > > > > I recently received a bug report complaining about disk corruptions when > > issuing a NVME_IOCTL_ADMIN_CMD / IDENTIFY ioctl() with cmd.data_len = > > 8192 bytes and the buffer address not aligned to the page size. > > > > This is the C program that we used to reproduce the issue (tested with > > 5.17.0-rc6): http://bsdbackstore.it/misc/nvme_ioctl_512.c > > > > simply run it by passing a path to an nvme device: > > ./nvme_ioctl_512 /dev/nvme0n1 > > > > It appears to be very unpredictable. Sometimes I hit disk corruptions > > after a few tries, sometimes it takes hours. Sometimes the ioctl() > > returns success and sometimes it fails. > > > > We suspect that the root cause is that the nvme-host driver doesn't > > enforce the 4096 byte limit for the IDENTIFY commands as the > > nvme-target does (see the nvmet_execute_identify() --> > > nvmet_check_transfer_len(req, NVME_IDENTIFY_DATA_SIZE) code). > > So if we pass a 8192-byte buffer not aligned to the page size, it will > > need 3 pages on archs where page size is 4k and the nvme spec says > > that the data buffer may not cross more than one page boundary. > > > > Does it make sense to you? What's your opinion on this? > > You are telling the driver to prepare a 3-page PRP, so it makes a PRP > list. The device knows it's a 4k payload, though, so it thinks your PRP > list pointer is actually a pointer to the data destination. The device > is corrupting that memory, which could lead to on-disk corruption if > that memory is concurrently used for a data-out command. Observing that > type of corruption is probably not deterministic. > > This was an unfortunate pitfall of nvme's PRP method: the transfer > length is implicit, so both sides need to agree on that for everything > to work. If either side is mistaken on the transfer length, then you get > corruption. > > In short: don't do that. If your application misuses the ioctl to break > it, you get to keep both pieces. Given NVMe spec states that data length of IDENTIFY command should be 4096bytes, and PRP list can't be used. So looks nvme driver need to validate the command before submitting to hardware, otherwise any buggy application can break FS or memory easily. Thanks, Ming ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: nvme-host: disk corruptions when issuing IDENTIFY commands via ioctl() 2022-03-09 0:18 ` Ming Lei @ 2022-03-09 0:39 ` Keith Busch 2022-03-09 1:02 ` Ming Lei 0 siblings, 1 reply; 12+ messages in thread From: Keith Busch @ 2022-03-09 0:39 UTC (permalink / raw) To: Ming Lei Cc: Maurizio Lombardi, linux-nvme, axboe, Christoph Hellwig, Sagi Grimberg, Ming Lei On Wed, Mar 09, 2022 at 08:18:47AM +0800, Ming Lei wrote: > Given NVMe spec states that data length of IDENTIFY command should be > 4096bytes, and PRP list can't be used. > > So looks nvme driver need to validate the command before submitting to > hardware, otherwise any buggy application can break FS or memory easily. No way. The driver does not police the user passthrough interface for these kinds of things. It couldn't ever be complete or future proof if it did. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: nvme-host: disk corruptions when issuing IDENTIFY commands via ioctl() 2022-03-09 0:39 ` Keith Busch @ 2022-03-09 1:02 ` Ming Lei 2022-03-09 1:14 ` Keith Busch 0 siblings, 1 reply; 12+ messages in thread From: Ming Lei @ 2022-03-09 1:02 UTC (permalink / raw) To: Keith Busch Cc: Maurizio Lombardi, linux-nvme, axboe, Christoph Hellwig, Sagi Grimberg, Ming Lei, linux-kernel On Tue, Mar 08, 2022 at 04:39:04PM -0800, Keith Busch wrote: > On Wed, Mar 09, 2022 at 08:18:47AM +0800, Ming Lei wrote: > > Given NVMe spec states that data length of IDENTIFY command should be > > 4096bytes, and PRP list can't be used. > > > > So looks nvme driver need to validate the command before submitting to > > hardware, otherwise any buggy application can break FS or memory easily. > > No way. The driver does not police the user passthrough interface for > these kinds of things. So you trust application to provide correct data always? From user viewpoint, this defect provides one easy hole to break FS or memory, it is one serious issue, IMO. The FS/memory corruption can be reproduced easily even in VM. > It couldn't ever be complete or future proof if > it did. But the spec states clearly the data length of IDENTIFY command is 4096 and PRP list can't be used, so why do you think it isn't complete or future proof to validate data length of IDENTIFY in nvme driver? Thanks, Ming ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: nvme-host: disk corruptions when issuing IDENTIFY commands via ioctl() 2022-03-09 1:02 ` Ming Lei @ 2022-03-09 1:14 ` Keith Busch 2022-03-09 2:48 ` Ming Lei 0 siblings, 1 reply; 12+ messages in thread From: Keith Busch @ 2022-03-09 1:14 UTC (permalink / raw) To: Ming Lei Cc: Maurizio Lombardi, linux-nvme, axboe, Christoph Hellwig, Sagi Grimberg, Ming Lei, linux-kernel On Wed, Mar 09, 2022 at 09:02:42AM +0800, Ming Lei wrote: > On Tue, Mar 08, 2022 at 04:39:04PM -0800, Keith Busch wrote: > > On Wed, Mar 09, 2022 at 08:18:47AM +0800, Ming Lei wrote: > > > Given NVMe spec states that data length of IDENTIFY command should be > > > 4096bytes, and PRP list can't be used. > > > > > > So looks nvme driver need to validate the command before submitting to > > > hardware, otherwise any buggy application can break FS or memory easily. > > > > No way. The driver does not police the user passthrough interface for > > these kinds of things. > > So you trust application to provide correct data always? > > From user viewpoint, this defect provides one easy hole to break FS or > memory, it is one serious issue, IMO. The FS/memory corruption can > be reproduced easily even in VM. It doesn't seem so serious considering it's been this way for 10 years, and we already knew about this. It's even been reported before: http://lists.infradead.org/pipermail/linux-nvme/2013-August/000365.html > > It couldn't ever be complete or future proof if > > it did. > > But the spec states clearly the data length of IDENTIFY command is 4096 > and PRP list can't be used, so why do you think it isn't complete or > future proof to validate data length of IDENTIFY in nvme driver? The current spec says that opcode uses 4k today. What about some time in the future? And why are you focusing on Identify anyway? The same potential for abuse exists with any of the other numerous opcodes that don't have a fixed transfer size, most of which the driver couldn't possibly ever know what the transfer length is supposed to be. This is a priviledged operation; the applications get to own the fallout if they misuse it. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: nvme-host: disk corruptions when issuing IDENTIFY commands via ioctl() 2022-03-09 1:14 ` Keith Busch @ 2022-03-09 2:48 ` Ming Lei 2022-03-09 3:09 ` Keith Busch 0 siblings, 1 reply; 12+ messages in thread From: Ming Lei @ 2022-03-09 2:48 UTC (permalink / raw) To: Keith Busch Cc: Maurizio Lombardi, linux-nvme, axboe, Christoph Hellwig, Sagi Grimberg, Ming Lei, linux-kernel On Tue, Mar 08, 2022 at 05:14:29PM -0800, Keith Busch wrote: > On Wed, Mar 09, 2022 at 09:02:42AM +0800, Ming Lei wrote: > > On Tue, Mar 08, 2022 at 04:39:04PM -0800, Keith Busch wrote: > > > On Wed, Mar 09, 2022 at 08:18:47AM +0800, Ming Lei wrote: > > > > Given NVMe spec states that data length of IDENTIFY command should be > > > > 4096bytes, and PRP list can't be used. > > > > > > > > So looks nvme driver need to validate the command before submitting to > > > > hardware, otherwise any buggy application can break FS or memory easily. > > > > > > No way. The driver does not police the user passthrough interface for > > > these kinds of things. > > > > So you trust application to provide correct data always? > > > > From user viewpoint, this defect provides one easy hole to break FS or > > memory, it is one serious issue, IMO. The FS/memory corruption can > > be reproduced easily even in VM. > > It doesn't seem so serious considering it's been this way for 10 years, > and we already knew about this. It's even been reported before: > > http://lists.infradead.org/pipermail/linux-nvme/2013-August/000365.html BTW, this issue is actually one real report from one Red Hat Customer. > > > > It couldn't ever be complete or future proof if > > > it did. > > > > But the spec states clearly the data length of IDENTIFY command is 4096 > > and PRP list can't be used, so why do you think it isn't complete or > > future proof to validate data length of IDENTIFY in nvme driver? > > The current spec says that opcode uses 4k today. What about some time in > the future? spec change should only be applied on future hardware, which can not break current in-market hardware. nvme target has validated the Identify's transfer length already. > And why are you focusing on Identify anyway? Nvme spec states explicitly that the following 4 commands can't use PRP list: - Identify command - Namespace Attachment command - Namespace Management command - Set Features command So it should be enough to just validate these commands. Thanks, Ming ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: nvme-host: disk corruptions when issuing IDENTIFY commands via ioctl() 2022-03-09 2:48 ` Ming Lei @ 2022-03-09 3:09 ` Keith Busch 0 siblings, 0 replies; 12+ messages in thread From: Keith Busch @ 2022-03-09 3:09 UTC (permalink / raw) To: Ming Lei Cc: Maurizio Lombardi, linux-nvme, axboe, Christoph Hellwig, Sagi Grimberg, Ming Lei, linux-kernel On Wed, Mar 09, 2022 at 10:48:35AM +0800, Ming Lei wrote: > > > On Tue, Mar 08, 2022 at 04:39:04PM -0800, Keith Busch wrote: > > BTW, this issue is actually one real report from one Red Hat Customer. And the correct fix has always been to fix the application. > > > > > > But the spec states clearly the data length of IDENTIFY command is 4096 > > > and PRP list can't be used, so why do you think it isn't complete or > > > future proof to validate data length of IDENTIFY in nvme driver? > > > > The current spec says that opcode uses 4k today. What about some time in > > the future? > > spec change should only be applied on future hardware, which can not break > current in-market hardware. If a new Identify CNS were invented by committee that uses 8k, then the older driver enforcing only 4k mappings will create more corruption, and then it would be the driver's fault. > nvme target has validated the Identify's transfer length already. nvmet provides a fabrics targets, which uses SGL, not PRP. The SGL encodes the length, so it's possible to validate it. > > And why are you focusing on Identify anyway? > > Nvme spec states explicitly that the following 4 commands can't use PRP list: > > - Identify command > - Namespace Attachment command > - Namespace Management command > - Set Features command > > So it should be enough to just validate these commands. Why are these 4 opcodes so special that the driver should provide training wheels for broken apps, yet it must trust the same app with the hundreds of other possible opcodes through the same interface? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: nvme-host: disk corruptions when issuing IDENTIFY commands via ioctl() 2022-03-08 16:45 nvme-host: disk corruptions when issuing IDENTIFY commands via ioctl() Maurizio Lombardi 2022-03-08 19:52 ` Keith Busch @ 2022-03-09 6:26 ` Christoph Hellwig 2022-03-09 16:23 ` Keith Busch 1 sibling, 1 reply; 12+ messages in thread From: Christoph Hellwig @ 2022-03-09 6:26 UTC (permalink / raw) To: Maurizio Lombardi Cc: linux-nvme, axboe, Christoph Hellwig, Sagi Grimberg, Ming Lei On Tue, Mar 08, 2022 at 05:45:20PM +0100, Maurizio Lombardi wrote: > We suspect that the root cause is that the nvme-host driver doesn't > enforce the 4096 byte limit for the IDENTIFY commands as the > nvme-target does (see the nvmet_execute_identify() --> > nvmet_check_transfer_len(req, NVME_IDENTIFY_DATA_SIZE) code). > So if we pass a 8192-byte buffer not aligned to the page size, it will > need 3 pages on archs where page size is 4k and the nvme spec says > that the data buffer may not cross more than one page boundary. > > Does it make sense to you? What's your opinion on this? Combination of a broken application (does what the spec explicitly tells it not do) and broken hardware (does the most stupid thing when fed invalid input), not much the driver can do here. But we really should talk to the nvme working group to ECN the text for the single PRP requirement to spell out the consequence in more detail, and maybe also mandate how it is handled for the next spec version. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: nvme-host: disk corruptions when issuing IDENTIFY commands via ioctl() 2022-03-09 6:26 ` Christoph Hellwig @ 2022-03-09 16:23 ` Keith Busch 2022-03-10 16:04 ` Christoph Hellwig 0 siblings, 1 reply; 12+ messages in thread From: Keith Busch @ 2022-03-09 16:23 UTC (permalink / raw) To: Christoph Hellwig Cc: Maurizio Lombardi, linux-nvme, axboe, Sagi Grimberg, Ming Lei On Wed, Mar 09, 2022 at 07:26:30AM +0100, Christoph Hellwig wrote: > On Tue, Mar 08, 2022 at 05:45:20PM +0100, Maurizio Lombardi wrote: > > We suspect that the root cause is that the nvme-host driver doesn't > > enforce the 4096 byte limit for the IDENTIFY commands as the > > nvme-target does (see the nvmet_execute_identify() --> > > nvmet_check_transfer_len(req, NVME_IDENTIFY_DATA_SIZE) code). > > So if we pass a 8192-byte buffer not aligned to the page size, it will > > need 3 pages on archs where page size is 4k and the nvme spec says > > that the data buffer may not cross more than one page boundary. > > > > Does it make sense to you? What's your opinion on this? > > Combination of a broken application (does what the spec explicitly > tells it not do) and broken hardware (does the most stupid thing when > fed invalid input), not much the driver can do here. There's nothing the hardware can do either to know it was given invalid input here if PRP2 is page aligned. There's no way it can tell the difference between a PRP List vs PRP destination. > But we really should talk to the nvme working group to ECN the text > for the single PRP requirement to spell out the consequence in more > detail, and maybe also mandate how it is handled for the next spec > version. It's not a "single PRP requirement". The spec just says the "data structure is 4096 bytes". This can validly span 2 PRPs if the first one has a non-zero offset. The spec created the "NDT" command field to help detect mismatched host/device PRP expectations. Unfortunately it only applies to vendor-specific commands, and no one implemented it anyway. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: nvme-host: disk corruptions when issuing IDENTIFY commands via ioctl() 2022-03-09 16:23 ` Keith Busch @ 2022-03-10 16:04 ` Christoph Hellwig 2022-03-10 17:38 ` Keith Busch 0 siblings, 1 reply; 12+ messages in thread From: Christoph Hellwig @ 2022-03-10 16:04 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Maurizio Lombardi, linux-nvme, axboe, Sagi Grimberg, Ming Lei On Wed, Mar 09, 2022 at 08:23:03AM -0800, Keith Busch wrote: > > Combination of a broken application (does what the spec explicitly > > tells it not do) and broken hardware (does the most stupid thing when > > fed invalid input), not much the driver can do here. > > There's nothing the hardware can do either to know it was given invalid > input here if PRP2 is page aligned. There's no way it can tell the > difference between a PRP List vs PRP destination. Well, it can know that there must be at most two PRP2 for Identify when the MPS is set to 4k. Yes, this is annoying especially with hardware allerated frontends, but that's what you get for that stupid globally harmful microptimization that PRPs are. > > But we really should talk to the nvme working group to ECN the text > > for the single PRP requirement to spell out the consequence in more > > detail, and maybe also mandate how it is handled for the next spec > > version. > > It's not a "single PRP requirement". The spec just says the "data > structure is 4096 bytes". This can validly span 2 PRPs if the first one > has a non-zero offset. Yes. But even 2 PRPs are not a PRP list given that the data pointer has two PRP fields. (completly ingoring the issues with non-4k MPS). ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: nvme-host: disk corruptions when issuing IDENTIFY commands via ioctl() 2022-03-10 16:04 ` Christoph Hellwig @ 2022-03-10 17:38 ` Keith Busch 0 siblings, 0 replies; 12+ messages in thread From: Keith Busch @ 2022-03-10 17:38 UTC (permalink / raw) To: Christoph Hellwig Cc: Maurizio Lombardi, linux-nvme, axboe, Sagi Grimberg, Ming Lei On Thu, Mar 10, 2022 at 05:04:00PM +0100, Christoph Hellwig wrote: > On Wed, Mar 09, 2022 at 08:23:03AM -0800, Keith Busch wrote: > > > Combination of a broken application (does what the spec explicitly > > > tells it not do) and broken hardware (does the most stupid thing when > > > fed invalid input), not much the driver can do here. > > > > There's nothing the hardware can do either to know it was given invalid > > input here if PRP2 is page aligned. There's no way it can tell the > > difference between a PRP List vs PRP destination. > > Well, it can know that there must be at most two PRP2 for Identify when > the MPS is set to 4k. Yes, this is annoying especially with hardware > allerated frontends, but that's what you get for that stupid globally > harmful microptimization that PRPs are. Yep. If the host is purposufully tricked into making a PRP list when the real payload didn't need it, the device will think PRP2 is the destination buffer and corrupt that memory, and it's not the device's fault. I agree the optimization was not worth the trouble it inflicted. I don't believe it's the driver's responsibility either, though, and am completely against the driver providing any sanity checks for broken apps as Ming advocated. It is a maintenance nightmare and doomed for failure. Just one example: 1.0's 'Get Log Page' said no PRP lists allowed, so an older driver enforcing that would have crippled 1.1+ devices. The app must own the responsibility for using the interface correctly. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2022-03-10 17:38 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-03-08 16:45 nvme-host: disk corruptions when issuing IDENTIFY commands via ioctl() Maurizio Lombardi 2022-03-08 19:52 ` Keith Busch 2022-03-09 0:18 ` Ming Lei 2022-03-09 0:39 ` Keith Busch 2022-03-09 1:02 ` Ming Lei 2022-03-09 1:14 ` Keith Busch 2022-03-09 2:48 ` Ming Lei 2022-03-09 3:09 ` Keith Busch 2022-03-09 6:26 ` Christoph Hellwig 2022-03-09 16:23 ` Keith Busch 2022-03-10 16:04 ` Christoph Hellwig 2022-03-10 17:38 ` Keith Busch
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.