linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Lei Rao <lei.rao@intel.com>
To: kbusch@kernel.org, axboe@fb.com, kch@nvidia.com, hch@lst.de,
	sagi@grimberg.me, alex.williamson@redhat.com, cohuck@redhat.com,
	jgg@ziepe.ca, yishaih@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com,
	mjrosato@linux.ibm.com, linux-kernel@vger.kernel.org,
	linux-nvme@lists.infradead.org, kvm@vger.kernel.org
Cc: eddie.dong@intel.com, yadong.li@intel.com, yi.l.liu@intel.com,
	Konrad.wilk@oracle.com, stephen@eideticom.com,
	hang.yuan@intel.com, Lei Rao <lei.rao@intel.com>
Subject: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device
Date: Tue,  6 Dec 2022 13:58:16 +0800	[thread overview]
Message-ID: <20221206055816.292304-6-lei.rao@intel.com> (raw)
In-Reply-To: <20221206055816.292304-1-lei.rao@intel.com>

The documentation describes the details of the NVMe hardware
extension to support VFIO live migration.

Signed-off-by: Lei Rao <lei.rao@intel.com>
Signed-off-by: Yadong Li <yadong.li@intel.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
Reviewed-by: Hang Yuan <hang.yuan@intel.com>
---
 drivers/vfio/pci/nvme/nvme.txt | 278 +++++++++++++++++++++++++++++++++
 1 file changed, 278 insertions(+)
 create mode 100644 drivers/vfio/pci/nvme/nvme.txt

diff --git a/drivers/vfio/pci/nvme/nvme.txt b/drivers/vfio/pci/nvme/nvme.txt
new file mode 100644
index 000000000000..eadcf2082eed
--- /dev/null
+++ b/drivers/vfio/pci/nvme/nvme.txt
@@ -0,0 +1,278 @@
+===========================
+NVMe Live Migration Support
+===========================
+
+Introduction
+------------
+To support live migration, NVMe device designs its own implementation,
+including five new specific admin commands and a capability flag in
+the vendor-specific field in the identify controller data structure to
+support VF's live migration usage. Software can use these live migration
+admin commands to get device migration state data size, save and load the
+data, suspend and resume the given VF device. They are submitted by software
+to the NVMe PF device's admin queue and ignored if placed in the VF device's
+admin queue. This is due to the NVMe VF device being passed to the virtual
+machine in the virtualization scenario. So VF device's admin queue is not
+available for the hypervisor to submit VF device live migration commands.
+The capability flag in the identify controller data structure can be used by
+software to detect if the NVMe device supports live migration. The following
+chapters introduce the detailed format of the commands and the capability flag.
+
+Definition of opcode for live migration commands
+------------------------------------------------
+
++---------------------------+-----------+-----------+------------+
+|                           |           |           |            |
+|     Opcode by Field       |           |           |            |
+|                           |           |           |            |
++--------+---------+--------+           |           |            |
+|        |         |        | Combined  | Namespace |            |
+|    07  |  06:02  | 01:00  |  Opcode   | Identifier|  Command   |
+|        |         |        |           |    used   |            |
++--------+---------+--------+           |           |            |
+|Generic | Function|  Data  |           |           |            |
+|command |         |Transfer|           |           |            |
++--------+---------+--------+-----------+-----------+------------+
+|                                                                |
+|                     Vendor SpecificOpcode                      |
++--------+---------+--------+-----------+-----------+------------+
+|        |         |        |           |           | Query the  |
+|   1b   |  10001  |  00    |   0xC4    |           | data size  |
++--------+---------+--------+-----------+-----------+------------+
+|        |         |        |           |           | Suspend the|
+|   1b   |  10010  |  00    |   0xC8    |           |    VF      |
++--------+---------+--------+-----------+-----------+------------+
+|        |         |        |           |           | Resume the |
+|   1b   |  10011  |  00    |   0xCC    |           |    VF      |
++--------+---------+--------+-----------+-----------+------------+
+|        |         |        |           |           | Save the   |
+|   1b   |  10100  |  10    |   0xD2    |           |device data |
++--------+---------+--------+-----------+-----------+------------+
+|        |         |        |           |           | Load the   |
+|   1b   |  10101  |  01    |   0xD5    |           |device data |
++--------+---------+--------+-----------+-----------+------------+
+
+Definition of QUERY_DATA_SIZE command
+-------------------------------------
+
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|   Bytes |                                    Description                                     |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|         |                                                                                    |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  Bits     |Description                                                         | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  07:00    |Opcode(OPC):set to 0xC4 to indicate a qeury command                 | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  09:08    |Fused Operation(FUSE):Please see NVMe SPEC for more details[1]      | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|  03:00  | |  13:10    |Reserved                                                            | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  15:14    |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1]    | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  31:16    |Command Identifier(CID)                                             | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         |                                                                                    |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|  39:04  |  Reserved                                                                          |
++---------+------------------------------------------------------------------------------------+
+|  41:40  |  VF index: means which VF controller internal data size to query                   |
++---------+------------------------------------------------------------------------------------+
+|  63:42  |  Reserved                                                                          |
++---------+------------------------------------------------------------------------------------+
+
+The QUERY_DATA_SIZE command is used to query the NVMe VF internal data size for live migration.
+When the NVMe firmware receives the command, it will return the size of NVMe VF internal
+data. The data size depends on how many IO queues are created.
+
+Definition of SUSPEND command
+-----------------------------
+
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|   Bytes |                                    Description                                     |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|         |                                                                                    |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  Bits     |Description                                                         | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  07:00    |Opcode(OPC):set to 0xC8 to indicate a suspend command               | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  09:08    |Fused Operation(FUSE):Please see NVMe specification for details[1]  | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|  03:00  | |  13:10    |Reserved                                                            | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  15:14    |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1]     | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  31:16    |Command Identifier(CID)                                             | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         |                                                                                    |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|  39:04  |  Reserved                                                                          |
++---------+------------------------------------------------------------------------------------+
+|  41:40  |  VF index: means which VF controller to suspend                                    |
++---------+------------------------------------------------------------------------------------+
+|  63:42  |  Reserved                                                                          |
++---------+------------------------------------------------------------------------------------+
+
+The SUSPEND command is used to suspend the NVMe VF controller. When the NVMe firmware receives
+this command, it will suspend the NVMe VF controller.
+
+Definition of RESUME command
+----------------------------
+
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|   Bytes |                                    Description                                     |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|         |                                                                                    |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  Bits     |Description                                                         | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  07:00    |Opcode(OPC):set to 0xCC to indicate a resume command                | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  09:08    |Fused Operation(FUSE):Please see NVMe SPEC for details[1]           | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|  03:00  | |  13:10    |Reserved                                                            | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  15:14    |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1]     | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  31:16    |Command Identifier(CID)                                             | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         |                                                                                    |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|  39:04  |  Reserved                                                                          |
++---------+------------------------------------------------------------------------------------+
+|  41:40  |  VF index: means which VF controller to resume                                     |
++---------+------------------------------------------------------------------------------------+
+|  63:42  |  Reserved                                                                          |
++---------+------------------------------------------------------------------------------------+
+
+The RESUME command is used to resume the NVMe VF controller. When firmware receives this command,
+it will restart the NVMe VF controller.
+
+Definition of SAVE_DATA command
+--------------------------
+
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|   Bytes |                                    Description                                     |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|         |                                                                                    |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  Bits     |Description                                                         | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  07:00    |Opcode(OPC):set to 0xD2 to indicate a save command                  | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  09:08    |Fused Operation(FUSE):Please see NVMe SPEC for details[1]           | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|  03:00  | |  13:10    |Reserved                                                            | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  15:14    |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1]     | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  31:16    |Command Identifier(CID)                                             | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         |                                                                                    |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|  23:04  | Reserved                                                                           |
++---------+------------------------------------------------------------------------------------+
+|  31:24  | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer              |
++---------+------------------------------------------------------------------------------------+
+|  39:32  | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)|
++---------+------------------------------------------------------------------------------------+
+|  41:40  | VF index: means which VF controller internal data to save                          |
++---------+------------------------------------------------------------------------------------+
+|  63:42  | Reserved                                                                           |
++---------+------------------------------------------------------------------------------------+
+
+The SAVE_DATA command is used to save the NVMe VF internal data for live migration. When firmware
+receives this command, it will save the admin queue states, save some registers, drain IO SQs
+and CQs, save every IO queue state, disable the VF controller, and transfer all data to the
+host memory through DMA.
+
+Definition of LOAD_DATA command
+--------------------------
+
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|   Bytes |                                    Description                                     |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|         |                                                                                    |
+|         |                                                                                    |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  Bits     |Description                                                         | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  07:00    |Opcode(OPC):set to 0xD5 to indicate a load command                  | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  09:08    |Fused Operation(FUSE):Please see NVMe SPEC for details[1]           | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|  03:00  | |  13:10    |Reserved                                                            | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  15:14    |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1]    | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         | |  31:16    |Command Identifier(CID)                                             | |
+|         | +-----------+--------------------------------------------------------------------+ |
+|         |                                                                                    |
+|         |                                                                                    |
++---------+------------------------------------------------------------------------------------+
+|  23:04  | Reserved                                                                           |
++---------+------------------------------------------------------------------------------------+
+|  31:24  | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer              |
++---------+------------------------------------------------------------------------------------+
+|  39:32  | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)|
++---------+------------------------------------------------------------------------------------+
+|  41:40  | VF index: means which VF controller internal data to load                          |
++---------+------------------------------------------------------------------------------------+
+|  47:44  | Size: means the size of the device's internal data to be loaded                    |
++---------+------------------------------------------------------------------------------------+
+|  63:48  | Reserved                                                                           |
++---------+------------------------------------------------------------------------------------+
+
+The LOAD_DATA command is used to restore the NVMe VF internal data. When firmware receives this
+command, it will read the device internal's data from the host memory through DMA, restore the
+admin queue states and some registers, and restore every IO queue state.
+
+Extensions of the vendor-specific field in the identify controller data structure
+---------------------------------------------------------------------------------
+
++---------+------+------+------+-------------------------------+
+|         |      |      |      |                               |
+|  Bytes  | I/O  |Admin | Disc |        Description            |
+|         |      |      |      |                               |
++---------+------+------+------+-------------------------------+
+|         |      |      |      |                               |
+| 01:00   |  M   |  M   |  R   | PCI Vendor ID(VID)            |
++---------+------+------+------+-------------------------------+
+|         |      |      |      |                               |
+| 03:02   |  M   |  M   |  R   | PCI Subsytem Vendor ID(SSVID) |
++---------+------+------+------+-------------------------------+
+|         |      |      |      |                               |
+|  ...    | ...  | ...  | ...  |  ...                          |
++---------+------+------+------+-------------------------------+
+|         |      |      |      |                               |
+|  3072   |  O   |  O   |  O   | Live Migration Support        |
++---------+------+------+------+-------------------------------+
+|         |      |      |      |                               |
+|4095:3073|  O   |  O   |  O   | Vendor Specific               |
++---------+------+------+------+-------------------------------+
+
+According to NVMe specification, the bytes from 3072 to 4095 are vendor-specific fields.
+NVMe device uses the 3072 bytes in the identify controller data structure to indicate
+whether live migration is supported. 0x0 means live migration is not supported. 0x01 means
+live migration is supported, and other values are reserved.
+
+[1] https://nvmexpress.org/wp-content/uploads/NVMe-NVM-Express-2.0a-2021.07.26-Ratified.pdf
-- 
2.34.1


  parent reply	other threads:[~2022-12-06  5:59 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-06  5:58 [RFC PATCH 0/5] Add new VFIO PCI driver for NVMe devices Lei Rao
2022-12-06  5:58 ` [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver Lei Rao
2022-12-06  6:19   ` Christoph Hellwig
2022-12-06 13:44     ` Jason Gunthorpe
2022-12-06 13:51       ` Keith Busch
2022-12-06 14:27         ` Jason Gunthorpe
2022-12-06 13:58       ` Christoph Hellwig
2022-12-06 15:22         ` Jason Gunthorpe
2022-12-06 15:38           ` Christoph Hellwig
2022-12-06 15:51             ` Jason Gunthorpe
2022-12-06 16:55               ` Christoph Hellwig
2022-12-06 19:15                 ` Jason Gunthorpe
2022-12-07  2:30                   ` Max Gurtovoy
2022-12-07  7:58                     ` Christoph Hellwig
2022-12-09  2:11                       ` Tian, Kevin
2022-12-12  7:41                         ` Christoph Hellwig
2022-12-07  7:54                   ` Christoph Hellwig
2022-12-07 10:59                     ` Max Gurtovoy
2022-12-07 13:46                       ` Christoph Hellwig
2022-12-07 14:50                         ` Max Gurtovoy
2022-12-07 16:35                           ` Christoph Hellwig
2022-12-07 13:34                     ` Jason Gunthorpe
2022-12-07 13:52                       ` Christoph Hellwig
2022-12-07 15:07                         ` Jason Gunthorpe
2022-12-07 16:38                           ` Christoph Hellwig
2022-12-07 17:31                             ` Jason Gunthorpe
2022-12-07 18:33                               ` Christoph Hellwig
2022-12-07 20:08                                 ` Jason Gunthorpe
2022-12-09  2:50                                   ` Tian, Kevin
2022-12-09 18:56                                     ` Dong, Eddie
2022-12-11 11:39                                   ` Max Gurtovoy
2022-12-12  7:55                                     ` Christoph Hellwig
2022-12-12 14:49                                       ` Max Gurtovoy
2022-12-12  7:50                                   ` Christoph Hellwig
2022-12-13 14:01                                     ` Jason Gunthorpe
2022-12-13 16:08                                       ` Christoph Hellwig
2022-12-13 17:49                                         ` Jason Gunthorpe
2022-12-06  5:58 ` [RFC PATCH 2/5] nvme-vfio: add new vfio-pci driver for NVMe device Lei Rao
2022-12-06  5:58 ` [RFC PATCH 3/5] nvme-vfio: enable the function of VFIO live migration Lei Rao
2023-01-19 10:21   ` Max Gurtovoy
2023-02-09  9:09     ` Rao, Lei
2022-12-06  5:58 ` [RFC PATCH 4/5] nvme-vfio: check if the hardware supports " Lei Rao
2022-12-06 13:47   ` Keith Busch
2022-12-06  5:58 ` Lei Rao [this message]
2022-12-06  6:26   ` [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device Christoph Hellwig
2022-12-06 13:05     ` Jason Gunthorpe
2022-12-06 13:09       ` Christoph Hellwig
2022-12-06 13:52         ` Jason Gunthorpe
2022-12-06 14:00           ` Christoph Hellwig
2022-12-06 14:20             ` Jason Gunthorpe
2022-12-06 14:31               ` Christoph Hellwig
2022-12-06 14:48                 ` Jason Gunthorpe
2022-12-06 15:01                   ` Christoph Hellwig
2022-12-06 15:28                     ` Jason Gunthorpe
2022-12-06 15:35                       ` Christoph Hellwig
2022-12-06 18:00                         ` Dong, Eddie
2022-12-12  7:57                           ` Christoph Hellwig
2022-12-11 12:05                     ` Max Gurtovoy
2022-12-11 13:21                       ` Rao, Lei
2022-12-11 14:51                         ` Max Gurtovoy
2022-12-12  1:20                           ` Rao, Lei
2022-12-12  8:09                           ` Christoph Hellwig
2022-12-09  2:05         ` Tian, Kevin
2022-12-09 16:53           ` Li, Yadong
2022-12-12  8:11             ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221206055816.292304-6-lei.rao@intel.com \
    --to=lei.rao@intel.com \
    --cc=Konrad.wilk@oracle.com \
    --cc=alex.williamson@redhat.com \
    --cc=axboe@fb.com \
    --cc=cohuck@redhat.com \
    --cc=eddie.dong@intel.com \
    --cc=hang.yuan@intel.com \
    --cc=hch@lst.de \
    --cc=jgg@ziepe.ca \
    --cc=kbusch@kernel.org \
    --cc=kch@nvidia.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=mjrosato@linux.ibm.com \
    --cc=sagi@grimberg.me \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=stephen@eideticom.com \
    --cc=yadong.li@intel.com \
    --cc=yi.l.liu@intel.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).