From: Lei Rao <lei.rao@intel.com>
To: kbusch@kernel.org, axboe@fb.com, kch@nvidia.com, hch@lst.de,
sagi@grimberg.me, alex.williamson@redhat.com, cohuck@redhat.com,
jgg@ziepe.ca, yishaih@nvidia.com,
shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com,
mjrosato@linux.ibm.com, linux-kernel@vger.kernel.org,
linux-nvme@lists.infradead.org, kvm@vger.kernel.org
Cc: eddie.dong@intel.com, yadong.li@intel.com, yi.l.liu@intel.com,
Konrad.wilk@oracle.com, stephen@eideticom.com,
hang.yuan@intel.com, Lei Rao <lei.rao@intel.com>
Subject: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
Date: Tue, 6 Dec 2022 13:58:16 +0800 [thread overview]
Message-ID: <20221206055816.292304-6-lei.rao@intel.com> (raw)
In-Reply-To: <20221206055816.292304-1-lei.rao@intel.com>
The documentation describes the details of the NVMe hardware
extension to support VFIO live migration.
Signed-off-by: Lei Rao <lei.rao@intel.com>
Signed-off-by: Yadong Li <yadong.li@intel.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Hang Yuan <hang.yuan@intel.com>
---
drivers/vfio/pci/nvme/nvme.txt | 278 +++++++++++++++++++++++++++++++++
1 file changed, 278 insertions(+)
create mode 100644 drivers/vfio/pci/nvme/nvme.txt
diff --git a/drivers/vfio/pci/nvme/nvme.txt b/drivers/vfio/pci/nvme/nvme.txt
new file mode 100644
index 000000000000..eadcf2082eed
--- /dev/null
+++ b/drivers/vfio/pci/nvme/nvme.txt
@@ -0,0 +1,278 @@
+===========================
+NVMe Live Migration Support
+===========================
+
+Introduction
+------------
+To support live migration, NVMe device designs its own implementation,
+including five new specific admin commands and a capability flag in
+the vendor-specific field in the identify controller data structure to
+support VF's live migration usage. Software can use these live migration
+admin commands to get device migration state data size, save and load the
+data, suspend and resume the given VF device. They are submitted by software
+to the NVMe PF device's admin queue and ignored if placed in the VF device's
+admin queue. This is due to the NVMe VF device being passed to the virtual
+machine in the virtualization scenario. So VF device's admin queue is not
+available for the hypervisor to submit VF device live migration commands.
+The capability flag in the identify controller data structure can be used by
+software to detect if the NVMe device supports live migration. The following
+chapters introduce the detailed format of the commands and the capability flag.
+
+Definition of opcode for live migration commands
+------------------------------------------------
+
++---------------------------+-----------+-----------+------------+
+| | | | |
+| Opcode by Field | | | |
+| | | | |
++--------+---------+--------+ | | |
+| | | | Combined | Namespace | |
+| 07 | 06:02 | 01:00 | Opcode | Identifier| Command |
+| | | | | used | |
++--------+---------+--------+ | | |
+|Generic | Function| Data | | | |
+|command | |Transfer| | | |
++--------+---------+--------+-----------+-----------+------------+
+| |
+| Vendor SpecificOpcode |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Query the |
+| 1b | 10001 | 00 | 0xC4 | | data size |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Suspend the|
+| 1b | 10010 | 00 | 0xC8 | | VF |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Resume the |
+| 1b | 10011 | 00 | 0xCC | | VF |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Save the |
+| 1b | 10100 | 10 | 0xD2 | |device data |
++--------+---------+--------+-----------+-----------+------------+
+| | | | | | Load the |
+| 1b | 10101 | 01 | 0xD5 | |device data |
++--------+---------+--------+-----------+-----------+------------+
+
+Definition of QUERY_DATA_SIZE command
+-------------------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xC4 to indicate a qeury command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for more details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 39:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller internal data size to query |
++---------+------------------------------------------------------------------------------------+
+| 63:42 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The QUERY_DATA_SIZE command is used to query the NVMe VF internal data size for live migration.
+When the NVMe firmware receives the command, it will return the size of NVMe VF internal
+data. The data size depends on how many IO queues are created.
+
+Definition of SUSPEND command
+-----------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xC8 to indicate a suspend command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe specification for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 39:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller to suspend |
++---------+------------------------------------------------------------------------------------+
+| 63:42 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The SUSPEND command is used to suspend the NVMe VF controller. When the NVMe firmware receives
+this command, it will suspend the NVMe VF controller.
+
+Definition of RESUME command
+----------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xCC to indicate a resume command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 39:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller to resume |
++---------+------------------------------------------------------------------------------------+
+| 63:42 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The RESUME command is used to resume the NVMe VF controller. When firmware receives this command,
+it will restart the NVMe VF controller.
+
+Definition of SAVE_DATA command
+--------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xD2 to indicate a save command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 23:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 31:24 | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer |
++---------+------------------------------------------------------------------------------------+
+| 39:32 | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)|
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller internal data to save |
++---------+------------------------------------------------------------------------------------+
+| 63:42 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The SAVE_DATA command is used to save the NVMe VF internal data for live migration. When firmware
+receives this command, it will save the admin queue states, save some registers, drain IO SQs
+and CQs, save every IO queue state, disable the VF controller, and transfer all data to the
+host memory through DMA.
+
+Definition of LOAD_DATA command
+--------------------------
+
++---------+------------------------------------------------------------------------------------+
+| | |
+| Bytes | Description |
+| | |
++---------+------------------------------------------------------------------------------------+
+| | |
+| | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | Bits |Description | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 07:00 |Opcode(OPC):set to 0xD5 to indicate a load command | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| 03:00 | | 13:10 |Reserved | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 15:14 |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1] | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | | 31:16 |Command Identifier(CID) | |
+| | +-----------+--------------------------------------------------------------------+ |
+| | |
+| | |
++---------+------------------------------------------------------------------------------------+
+| 23:04 | Reserved |
++---------+------------------------------------------------------------------------------------+
+| 31:24 | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer |
++---------+------------------------------------------------------------------------------------+
+| 39:32 | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)|
++---------+------------------------------------------------------------------------------------+
+| 41:40 | VF index: means which VF controller internal data to load |
++---------+------------------------------------------------------------------------------------+
+| 47:44 | Size: means the size of the device's internal data to be loaded |
++---------+------------------------------------------------------------------------------------+
+| 63:48 | Reserved |
++---------+------------------------------------------------------------------------------------+
+
+The LOAD_DATA command is used to restore the NVMe VF internal data. When firmware receives this
+command, it will read the device internal's data from the host memory through DMA, restore the
+admin queue states and some registers, and restore every IO queue state.
+
+Extensions of the vendor-specific field in the identify controller data structure
+---------------------------------------------------------------------------------
+
++---------+------+------+------+-------------------------------+
+| | | | | |
+| Bytes | I/O |Admin | Disc | Description |
+| | | | | |
++---------+------+------+------+-------------------------------+
+| | | | | |
+| 01:00 | M | M | R | PCI Vendor ID(VID) |
++---------+------+------+------+-------------------------------+
+| | | | | |
+| 03:02 | M | M | R | PCI Subsytem Vendor ID(SSVID) |
++---------+------+------+------+-------------------------------+
+| | | | | |
+| ... | ... | ... | ... | ... |
++---------+------+------+------+-------------------------------+
+| | | | | |
+| 3072 | O | O | O | Live Migration Support |
++---------+------+------+------+-------------------------------+
+| | | | | |
+|4095:3073| O | O | O | Vendor Specific |
++---------+------+------+------+-------------------------------+
+
+According to NVMe specification, the bytes from 3072 to 4095 are vendor-specific fields.
+NVMe device uses the 3072 bytes in the identify controller data structure to indicate
+whether live migration is supported. 0x0 means live migration is not supported. 0x01 means
+live migration is supported, and other values are reserved.
+
+[1] https://nvmexpress.org/wp-content/uploads/NVMe-NVM-Express-2.0a-2021.07.26-Ratified.pdf
--
2.34.1
next prev parent reply other threads:[~2022-12-06 5:59 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-06 5:58 [RFC PATCH 0/5] Add new VFIO PCI driver for NVMe devices Lei Rao
2022-12-06 5:58 ` [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver Lei Rao
2022-12-06 6:19 ` Christoph Hellwig
2022-12-06 13:44 ` Jason Gunthorpe
2022-12-06 13:51 ` Keith Busch
2022-12-06 14:27 ` Jason Gunthorpe
2022-12-06 13:58 ` Christoph Hellwig
2022-12-06 15:22 ` Jason Gunthorpe
2022-12-06 15:38 ` Christoph Hellwig
2022-12-06 15:51 ` Jason Gunthorpe
2022-12-06 16:55 ` Christoph Hellwig
2022-12-06 19:15 ` Jason Gunthorpe
2022-12-07 2:30 ` Max Gurtovoy
2022-12-07 7:58 ` Christoph Hellwig
2022-12-09 2:11 ` Tian, Kevin
2022-12-12 7:41 ` Christoph Hellwig
2022-12-07 7:54 ` Christoph Hellwig
2022-12-07 10:59 ` Max Gurtovoy
2022-12-07 13:46 ` Christoph Hellwig
2022-12-07 14:50 ` Max Gurtovoy
2022-12-07 16:35 ` Christoph Hellwig
2022-12-07 13:34 ` Jason Gunthorpe
2022-12-07 13:52 ` Christoph Hellwig
2022-12-07 15:07 ` Jason Gunthorpe
2022-12-07 16:38 ` Christoph Hellwig
2022-12-07 17:31 ` Jason Gunthorpe
2022-12-07 18:33 ` Christoph Hellwig
2022-12-07 20:08 ` Jason Gunthorpe
2022-12-09 2:50 ` Tian, Kevin
2022-12-09 18:56 ` Dong, Eddie
2022-12-11 11:39 ` Max Gurtovoy
2022-12-12 7:55 ` Christoph Hellwig
2022-12-12 14:49 ` Max Gurtovoy
2022-12-12 7:50 ` Christoph Hellwig
2022-12-13 14:01 ` Jason Gunthorpe
2022-12-13 16:08 ` Christoph Hellwig
2022-12-13 17:49 ` Jason Gunthorpe
2022-12-06 5:58 ` [RFC PATCH 2/5] nvme-vfio: add new vfio-pci driver for NVMe device Lei Rao
2022-12-06 5:58 ` [RFC PATCH 3/5] nvme-vfio: enable the function of VFIO live migration Lei Rao
2023-01-19 10:21 ` Max Gurtovoy
2023-02-09 9:09 ` Rao, Lei
2022-12-06 5:58 ` [RFC PATCH 4/5] nvme-vfio: check if the hardware supports " Lei Rao
2022-12-06 13:47 ` Keith Busch
2022-12-06 5:58 ` Lei Rao [this message]
2022-12-06 6:26 ` [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device Christoph Hellwig
2022-12-06 13:05 ` Jason Gunthorpe
2022-12-06 13:09 ` Christoph Hellwig
2022-12-06 13:52 ` Jason Gunthorpe
2022-12-06 14:00 ` Christoph Hellwig
2022-12-06 14:20 ` Jason Gunthorpe
2022-12-06 14:31 ` Christoph Hellwig
2022-12-06 14:48 ` Jason Gunthorpe
2022-12-06 15:01 ` Christoph Hellwig
2022-12-06 15:28 ` Jason Gunthorpe
2022-12-06 15:35 ` Christoph Hellwig
2022-12-06 18:00 ` Dong, Eddie
2022-12-12 7:57 ` Christoph Hellwig
2022-12-11 12:05 ` Max Gurtovoy
2022-12-11 13:21 ` Rao, Lei
2022-12-11 14:51 ` Max Gurtovoy
2022-12-12 1:20 ` Rao, Lei
2022-12-12 8:09 ` Christoph Hellwig
2022-12-09 2:05 ` Tian, Kevin
2022-12-09 16:53 ` Li, Yadong
2022-12-12 8:11 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20221206055816.292304-6-lei.rao@intel.com \
--to=lei.rao@intel.com \
--cc=Konrad.wilk@oracle.com \
--cc=alex.williamson@redhat.com \
--cc=axboe@fb.com \
--cc=cohuck@redhat.com \
--cc=eddie.dong@intel.com \
--cc=hang.yuan@intel.com \
--cc=hch@lst.de \
--cc=jgg@ziepe.ca \
--cc=kbusch@kernel.org \
--cc=kch@nvidia.com \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=mjrosato@linux.ibm.com \
--cc=sagi@grimberg.me \
--cc=shameerali.kolothum.thodi@huawei.com \
--cc=stephen@eideticom.com \
--cc=yadong.li@intel.com \
--cc=yi.l.liu@intel.com \
--cc=yishaih@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).