All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6 Resend] *** Vhost-pci RFC ***
@ 2016-05-29  8:11 ` Wei Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Wei Wang @ 2016-05-29  8:11 UTC (permalink / raw)
  To: kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini; +Cc: Wei Wang

This RFC proposes a design of vhost-pci, which is a new virtio device type.
The vhost-pci device is used for inter-VM communication. Please read the RFC
patches for details.


Wei Wang (6):
  Vhost-pci RFC: Introduction
  Vhost-pci RFC: Modification Scope
  Vhost-pci RFC: Benefits to KVM
  Vhost-pci RFC: Detailed Description in the Virtio Specification Format
  Vhost-pci RFC: Future Security Enhancement
  Vhost-pci RFC: Experimental Results

 Benefits          |   8 ++
 Details           | 324 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 FutureWorks       |  21 ++++
 Introduction      |  31 ++++++
 ModificationScope |   3 +
 Results           |  18 +++
 6 files changed, 405 insertions(+)
 create mode 100644 Benefits
 create mode 100644 Details
 create mode 100644 FutureWorks
 create mode 100644 Introduction
 create mode 100644 ModificationScope
 create mode 100644 Results

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH 0/6 Resend] *** Vhost-pci RFC ***
@ 2016-05-29  8:11 ` Wei Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Wei Wang @ 2016-05-29  8:11 UTC (permalink / raw)
  To: kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini; +Cc: Wei Wang

This RFC proposes a design of vhost-pci, which is a new virtio device type.
The vhost-pci device is used for inter-VM communication. Please read the RFC
patches for details.


Wei Wang (6):
  Vhost-pci RFC: Introduction
  Vhost-pci RFC: Modification Scope
  Vhost-pci RFC: Benefits to KVM
  Vhost-pci RFC: Detailed Description in the Virtio Specification Format
  Vhost-pci RFC: Future Security Enhancement
  Vhost-pci RFC: Experimental Results

 Benefits          |   8 ++
 Details           | 324 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 FutureWorks       |  21 ++++
 Introduction      |  31 ++++++
 ModificationScope |   3 +
 Results           |  18 +++
 6 files changed, 405 insertions(+)
 create mode 100644 Benefits
 create mode 100644 Details
 create mode 100644 FutureWorks
 create mode 100644 Introduction
 create mode 100644 ModificationScope
 create mode 100644 Results

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [virtio-comment] [PATCH 1/6 Resend] Vhost-pci RFC: Introduction
  2016-05-29  8:11 ` [Qemu-devel] " Wei Wang
@ 2016-05-29  8:11   ` Wei Wang
  -1 siblings, 0 replies; 34+ messages in thread
From: Wei Wang @ 2016-05-29  8:11 UTC (permalink / raw)
  To: kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini; +Cc: Wei Wang

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 Introduction | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)
 create mode 100644 Introduction

diff --git a/Introduction b/Introduction
new file mode 100644
index 0000000..8774676
--- /dev/null
+++ b/Introduction
@@ -0,0 +1,31 @@
+Current vhost-user based backend designs for virtio-net devices present scaling
+challenges, as communication intensive applications (e.g. virtual network
+functions) running in VMs start to stress this centralized design and resources
+assigned to it.
+
+Vhost-pci was initially proposed by Michael S. Tsirkin with vIOMMU support to
+offer a protected and point-to-point based inter-VM communication. In many
+cases, such as network function virtualization (NFV) and software defined
+networking (SDN) usages, VMs in an isolated network trust each other and they
+may be chained together to complete a task. In these use cases, people care
+more about performance than security. In this RFC we present a comprehensive
+design of vhost-pci without vIOMMU support. A VM with such a vhost-pci device
+is able to copy data to another VM's memory directly. The advantages of using
+vhost-pci over vhost-user are: 1) one less packet copy per packet transfer; and
+2) better scalability.
+
+To follow the naming conventions in the virtio specification, we call the VM
+who sends packets to the destination VM the device VM, and the VM who provides
+the vring and receives packets the driver VM. The vhost-pci device/driver works
+independently in the device VM. It can be considered as a simple device mapping
+the entire memory of a driver VM. But a lot of times, it may be used to
+communicate to a virtio device in the driver VM. That is, it usually plays the
+role of a backend part of a virtio device of a driver VM.
+
+The vhost-pci design is not limited to networking usages. The design presented
+in this RFC is quite fundamental, and it is potentially useful for other
+inter-VM data moving usages. For the convenience of descriptions, we will
+simply use "virtio device" to refer to the device on a driver VM that is backed
+by a vhost-pci device. The figures of this RFC are shown in this link:
+https://etherpad.opnfv.org/p/vhost-pci_RFC
+
-- 
1.8.3.1


This publicly archived list offers a means to provide input to the

OASIS Virtual I/O Device (VIRTIO) TC.



In order to verify user consent to the Feedback License terms and

to minimize spam in the list archive, subscription is required

before posting.



Subscribe: virtio-comment-subscribe@lists.oasis-open.org

Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org

List help: virtio-comment-help@lists.oasis-open.org

List archive: https://lists.oasis-open.org/archives/virtio-comment/

Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf

List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists

Committee: https://www.oasis-open.org/committees/virtio/

Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH 1/6 Resend] Vhost-pci RFC: Introduction
@ 2016-05-29  8:11   ` Wei Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Wei Wang @ 2016-05-29  8:11 UTC (permalink / raw)
  To: kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini; +Cc: Wei Wang

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 Introduction | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)
 create mode 100644 Introduction

diff --git a/Introduction b/Introduction
new file mode 100644
index 0000000..8774676
--- /dev/null
+++ b/Introduction
@@ -0,0 +1,31 @@
+Current vhost-user based backend designs for virtio-net devices present scaling
+challenges, as communication intensive applications (e.g. virtual network
+functions) running in VMs start to stress this centralized design and resources
+assigned to it.
+
+Vhost-pci was initially proposed by Michael S. Tsirkin with vIOMMU support to
+offer a protected and point-to-point based inter-VM communication. In many
+cases, such as network function virtualization (NFV) and software defined
+networking (SDN) usages, VMs in an isolated network trust each other and they
+may be chained together to complete a task. In these use cases, people care
+more about performance than security. In this RFC we present a comprehensive
+design of vhost-pci without vIOMMU support. A VM with such a vhost-pci device
+is able to copy data to another VM's memory directly. The advantages of using
+vhost-pci over vhost-user are: 1) one less packet copy per packet transfer; and
+2) better scalability.
+
+To follow the naming conventions in the virtio specification, we call the VM
+who sends packets to the destination VM the device VM, and the VM who provides
+the vring and receives packets the driver VM. The vhost-pci device/driver works
+independently in the device VM. It can be considered as a simple device mapping
+the entire memory of a driver VM. But a lot of times, it may be used to
+communicate to a virtio device in the driver VM. That is, it usually plays the
+role of a backend part of a virtio device of a driver VM.
+
+The vhost-pci design is not limited to networking usages. The design presented
+in this RFC is quite fundamental, and it is potentially useful for other
+inter-VM data moving usages. For the convenience of descriptions, we will
+simply use "virtio device" to refer to the device on a driver VM that is backed
+by a vhost-pci device. The figures of this RFC are shown in this link:
+https://etherpad.opnfv.org/p/vhost-pci_RFC
+
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [virtio-comment] [PATCH 2/6 Resend] Vhost-pci RFC: Modification Scope
  2016-05-29  8:11 ` [Qemu-devel] " Wei Wang
@ 2016-05-29  8:11   ` Wei Wang
  -1 siblings, 0 replies; 34+ messages in thread
From: Wei Wang @ 2016-05-29  8:11 UTC (permalink / raw)
  To: kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini; +Cc: Wei Wang

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 ModificationScope | 3 +++
 1 file changed, 3 insertions(+)
 create mode 100644 ModificationScope

diff --git a/ModificationScope b/ModificationScope
new file mode 100644
index 0000000..ea949ee
--- /dev/null
+++ b/ModificationScope
@@ -0,0 +1,3 @@
+Changes to: QEMU, Guest kernel
+Affects specific archs: x86
+Affects specific guests: Linux
-- 
1.8.3.1


This publicly archived list offers a means to provide input to the

OASIS Virtual I/O Device (VIRTIO) TC.



In order to verify user consent to the Feedback License terms and

to minimize spam in the list archive, subscription is required

before posting.



Subscribe: virtio-comment-subscribe@lists.oasis-open.org

Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org

List help: virtio-comment-help@lists.oasis-open.org

List archive: https://lists.oasis-open.org/archives/virtio-comment/

Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf

List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists

Committee: https://www.oasis-open.org/committees/virtio/

Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH 2/6 Resend] Vhost-pci RFC: Modification Scope
@ 2016-05-29  8:11   ` Wei Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Wei Wang @ 2016-05-29  8:11 UTC (permalink / raw)
  To: kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini; +Cc: Wei Wang

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 ModificationScope | 3 +++
 1 file changed, 3 insertions(+)
 create mode 100644 ModificationScope

diff --git a/ModificationScope b/ModificationScope
new file mode 100644
index 0000000..ea949ee
--- /dev/null
+++ b/ModificationScope
@@ -0,0 +1,3 @@
+Changes to: QEMU, Guest kernel
+Affects specific archs: x86
+Affects specific guests: Linux
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [virtio-comment] [PATCH 3/6 Resend] Vhost-pci RFC: Benefits to KVM
  2016-05-29  8:11 ` [Qemu-devel] " Wei Wang
@ 2016-05-29  8:11   ` Wei Wang
  -1 siblings, 0 replies; 34+ messages in thread
From: Wei Wang @ 2016-05-29  8:11 UTC (permalink / raw)
  To: kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini; +Cc: Wei Wang

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 Benefits | 8 ++++++++
 1 file changed, 8 insertions(+)
 create mode 100644 Benefits

diff --git a/Benefits b/Benefits
new file mode 100644
index 0000000..7d10485
--- /dev/null
+++ b/Benefits
@@ -0,0 +1,8 @@
+The vhost-pci design provides KVM with a faster and more scalable inter-VM
+communication mechanism. In NFV usages and today's virtualized datacenters,
+where multiple VMs connect to a centralized vswitch to communicate, the more
+the VMs connect to (or chained via) the vswitch, the lower the inter-VM
+communication throughput will be. The vhost-pci design successfully resolves
+this scalability issue and makes KVM a better hypervisor for NFV and
+datacenters.
+
-- 
1.8.3.1


This publicly archived list offers a means to provide input to the

OASIS Virtual I/O Device (VIRTIO) TC.



In order to verify user consent to the Feedback License terms and

to minimize spam in the list archive, subscription is required

before posting.



Subscribe: virtio-comment-subscribe@lists.oasis-open.org

Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org

List help: virtio-comment-help@lists.oasis-open.org

List archive: https://lists.oasis-open.org/archives/virtio-comment/

Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf

List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists

Committee: https://www.oasis-open.org/committees/virtio/

Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH 3/6 Resend] Vhost-pci RFC: Benefits to KVM
@ 2016-05-29  8:11   ` Wei Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Wei Wang @ 2016-05-29  8:11 UTC (permalink / raw)
  To: kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini; +Cc: Wei Wang

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 Benefits | 8 ++++++++
 1 file changed, 8 insertions(+)
 create mode 100644 Benefits

diff --git a/Benefits b/Benefits
new file mode 100644
index 0000000..7d10485
--- /dev/null
+++ b/Benefits
@@ -0,0 +1,8 @@
+The vhost-pci design provides KVM with a faster and more scalable inter-VM
+communication mechanism. In NFV usages and today's virtualized datacenters,
+where multiple VMs connect to a centralized vswitch to communicate, the more
+the VMs connect to (or chained via) the vswitch, the lower the inter-VM
+communication throughput will be. The vhost-pci design successfully resolves
+this scalability issue and makes KVM a better hypervisor for NFV and
+datacenters.
+
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [virtio-comment] [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format
  2016-05-29  8:11 ` [Qemu-devel] " Wei Wang
@ 2016-05-29  8:11   ` Wei Wang
  -1 siblings, 0 replies; 34+ messages in thread
From: Wei Wang @ 2016-05-29  8:11 UTC (permalink / raw)
  To: kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini; +Cc: Wei Wang

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 Details | 324 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 324 insertions(+)
 create mode 100644 Details

diff --git a/Details b/Details
new file mode 100644
index 0000000..4ea2252
--- /dev/null
+++ b/Details
@@ -0,0 +1,324 @@
+1 Device ID
+TBD
+
+2 Virtqueues
+0 controlq
+
+3 Feature Bits
+3.1 Local Feature Bits
+Currently no local feature bits are defined, so the standard virtio feature
+bits negation will always be successful and complete.
+
+3.2 Remote Feature Bits
+The remote feature bits are obtained from the frontend virtio device and
+negotiated with the vhost-pci driver via the controlq. The negotiation steps
+are described in 4.5 Device Initialization.
+
+4 Device Configuration Layout
+struct vhost_pci_config {
+	#define VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK 0
+	#define VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK 1
+	#define VHOST_PCI_CONTROLQ_FEATURE_BITS_ACK 2
+	u32 ack_type;
+	u32 ack_device_type;
+	u64 ack_device_id;
+	union {
+		#define VHOST_PCI_CONTROLQ_ACK_ADD_DONE 0
+		#define VHOST_PCI_CONTROLQ_ACK_ADD_FAIL 1
+		#define VHOST_PCI_CONTROLQ_ACK_DEL_DONE 2
+		#define VHOST_PCI_CONTROLQ_ACK_DEL_FAIL 3
+		u64 ack_memory_info;		
+		u64 ack_device_info;
+		u64 ack_feature_bits;
+	};
+};
+
+The configuration fields are currently used for the vhost-pci driver to
+acknowledge to the vhost-pci device after it receives controlq messages. 
+
+4.5 Device Initialization
+When a device VM boots, it creates a vhost-pci server socket.
+
+When a virtio device on the driver VM is created with specifying the use of a
+vhost-pci device as a backend, a client socket is created and connected to the
+corresponding vhost-pci server for message exchanges.
+
+The messages passed to the vhost-pci server is proceeded by the following
+header:
+struct vhost_pci_socket_hdr {
+	#define VHOST_PCI_SOCKET_MEMORY_INFO 0
+	#define VHOST_PCI_SOCKET_MEMORY_INFO_ACK 1
+	#define VHOST_PCI_SOCKET_DEVICE_INFO 2
+	#define VHOST_PCI_SOCKET_DEVICE_INFO_ACK 3
+	#define VHOST_PCI_SOCKET_FEATURE_BITS 4
+	#define VHOST_PCI_SOCKET_FEATURE_BITS_ACK 5
+	u16 msg_type;
+	u16 msg_version;
+	u32 msg_len;
+	u64 qemu_pid;
+};
+
+The payload of the above message types can be constructed using the structures
+below:
+/* VHOST_PCI_SOCKET_MEMORY_INFO message */
+struct vhost_pci_socket_memory_info {
+	#define VHOST_PCI_ADD_MEMORY 0
+	#define VHOST_PCI_DEL_MEMORY 1
+	u16 ops;
+	u32 nregions;
+	struct vhost_pci_memory_region {
+		int fd;
+		u64 guest_phys_addr;
+		u64 memory_size;
+		u64 mmap_offset;
+	} regions[VHOST_PCI_MAX_NREGIONS];
+};
+
+/* VHOST_PCI_SOCKET_DEVICE_INFO message */
+struct vhost_pci_device_info {
+	#define VHOST_PCI_ADD_FRONTEND_DEVICE 0
+	#define VHOST_PCI_DEL_FRONTEND_DEVICE 1
+	u16    ops;
+	u32    nvirtq;
+	#define VHOST_PCI_FRONTEND_DEVICE_NET 1
+	#define VHOST_PCI_FRONTEND_DEVICE_BLK 2
+	#define VHOST_PCI_FRONTEND_DEVICE_CONSOLE 3
+	#define VHOST_PCI_FRONTEND_DEVICE_ENTROPY 4
+	#define VHOST_PCI_FRONTEND_DEVICE_BALLOON 5
+	#define VHOST_PCI_FRONTEND_DEVICE_SCSI 8
+	u32    device_type;
+	u64    device_id;
+	struct virtq exotic_virtq[VHOST_PCI_MAX_NVIRTQ];
+};
+The device_id field identifies the device. For example, it can be used to 
+store a MAC address if the device_type is VHOST_PCI_FRONTEND_DEVICE_NET.
+
+/* VHOST_PCI_SOCKET_FEATURE_BITS message*/
+struct vhost_pci_feature_bits {
+	u64 feature_bits;
+};
+
+/* VHOST_PCI_SOCKET_xx_ACK messages */
+struct vhost_pci_socket_ack {
+	#define VHOST_PCI_SOCKET_ACK_ADD_DONE 0
+	#define VHOST_PCI_SOCKET_ACK_ADD_FAIL 1
+	#define VHOST_PCI_SOCKET_ACK_DEL_DONE 2
+	#define VHOST_PCI_SOCKET_ACK_DEL_FAIL 3
+	u64 ack;
+};
+
+The driver update message passed via the controlq is preceded by the following
+header:
+struct vhost_pci_controlq_hdr {
+	#define VHOST_PCI_CONTROLQ_MEMORY_INFO 0
+	#define VHOST_PCI_CONTROLQ_DEVICE_INFO 1
+	#define VHOST_PCI_CONTROLQ_FEATURE_BITS 2
+	#define VHOST_PCI_CONTROLQ_UPDATE_DONE 3
+	u16 msg_type;
+	u16 msg_version;
+	u32 msg_len;
+};
+
+The payload of a VHOST_PCI_CONTROLQ_MEMORY_INFO message can be constructed
+using the following structure:
+/* VHOST_PCI_CONTROLQ_MEMORY_INFO message */
+struct vhost_pci_controlq_memory_info {
+	#define VHOST_PCI_ADD_MEMORY 0
+	#define VHOST_PCI_DEL_MEMORY 1
+	u16  ops;
+	u32 nregion;
+	struct exotic_memory_region {
+		u64   region_base_xgpa;
+		u64   size;
+		u64   offset_in_bar_area;
+	} region[VHOST_PCI_MAX_NREGIONS];
+};
+
+The payload of VHOST_PCI_CONTROLQ_DEVICE_INFO and
+VHOST_PCI_CONTROLQ_FEATURE_BITS messages can be constructed using the
+vhost_pci_device_info structure and the vhost_pci_feature_bits structure
+respectively.
+
+The payload of a VHOST_PCI_CONTROLQ_UPDATE_DONE message can be constructed
+using the structure below:
+struct vhost_pci_controlq_update_done {
+	u32    device_type;
+	u64    device_id;
+};
+
+Fig. 1 shows the initialization steps.
+
+When the vhost-pci server receives a VHOST_PCI_SOCKET_MEMORY_INFO(ADD) message,
+it checks if a vhost-pci device has been created for the requesting VM whose
+QEMU process id is qemu_pid. If yes, it will simply update the subsequent
+received messages to the vhost-pci driver via the controlq. Otherwise, the
+server creates a new vhost-pci device, and continues the following
+initialization steps.
+
+The vhost-pci server adds up all the memory region size, and uses a 64-bit
+device bar for the mapping of all the memory regions obtained from the socket
+message. To better support memory hot-plugging of the driver VM, the bar is
+configured with a double size of the driver VM's memory. The server maps the
+received memory info via the QEMU MemoryRegion mechanism, and then the new
+created vhost-pci device is hot-plugged to the VM.
+
+When the device status is updated with DRIVER_OK, a
+VHOST_PCI_CONTROLQ_MEMORY_INFO(ADD) message, which is stemed from the memory
+info socket message, is put on the controlq and a controlq interrupt is injected
+to the VM.
+
+When the vhost-pci server receives a
+VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK(ADD_DONE) acknowledgement from the driver,
+it sends a VHOST_PCI_SOCKET_MEMORY_INFO_ACK(ADD_DONE) message to the client
+that is identified by the ack_device_type and ack_device_id fields.
+
+When the vhost-pci server receives a
+VHOST_PCI_SOCKET_FEATURE_BITS(feature bits) message, a
+VHOST_PCI_CONTROLQ_FEATURE_BITS(feature bits) message is put on the controlq
+and a controlq interrupt is injected to the VM.
+
+If the vhost-pci server notices that the driver fully accepted the offered
+feature bits, it sends a VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message
+to the client. If the vhost-pci server notices that the vhost-pci driver only
+accepted a subset of the offered feature bits, it sends a
+VHOST_PCI_SOCKET_FEATURE_BITS(accepted feature bits) message back to the
+client. The client side virtio device re-negotiates the new feature bits with
+its driver, and sends back a VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE)
+message to the server.
+
+Either when the vhost-pci driver fully accepted the offered feature bits or a
+VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message is received from the
+client, the vhost-pci server puts a VHOST_PCI_CONTROLQ_UPDATE_DONE message on
+the controlq, and a controlq interrupt is injected to the VM.
+
+When the vhost-pci server receives a VHOST_PCI_SOCKET_DEVICE_INFO(ADD) message,
+a VHOST_PCI_CONTROLQ_DEVICE_INFO(ADD) message is put on the controlq and a
+controlq interrupt is injected to the VM.
+
+When the vhost-pci server receives a
+VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(ADD_DONE) acknowledgement from the driver,
+it sends a VHOST_PCI_SOCKET_DEVICE_INFO_ACK(ADD_DONE) message to the
+corresponding client.
+
+4.5.1 Device Requirements: Device Initialization
+To let a VM be capable of creating vhost-pci devices, a vhost-pci server MUST
+be created when it boots.
+
+The vhost-pci server socket path SHOULD be provided to a virtio client socket
+for the connection to the vhost-pci server.
+
+The virtio device MUST finish the feature bits negotiation with its driver
+before negotiating them with the vhost-pci device.
+
+If the client receives a VHOST_PCI_SOCKET_FEATURE_BITS(feature bits) message,
+it MUST reset the device to go into backwards capability mode, re-negotiate
+the received feature bits with its driver, and send back a
+VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message to the server.
+
+In any cases that an acknowledgement from the vhost-pci driver indicates a
+FAIL, the vhost-pci server SHOULD send a FAIL socket message to the client.
+
+In any cases that the msg_type is different between the sender and the
+receiver, the receiver SHOULD acknowledge a FAIL to the sender or convert the
+message to its version if the converted version is still functionally usable.
+
+4.5.2 Driver Requirements: Device Initialization
+The vhost-pci driver MUST NOT accept any feature bits that are not offered by
+the remote feature bits, and SHOULD acknowledge to the device of the accepted
+feature bits by writing them to the vhost_pci_config fields.
+
+When the vhost-pci driver receives a VHOST_PCI_CONTROLQ_UPDATE_DONE message
+from the controlq, the vhost-pci driver MUST initialize the corresponding
+driver interface of the device_type if it has not been initialized, and add
+the device_id to the frontend device list that records all the frontend virtio
+devices being supported by vhost-pci for inter-VM communications.
+
+The vhost-pci driver SHOULD acknowledge to the device that the device and
+memory info update (add or delete) is DONE or FAIL by writing the
+acknowledgement (DONE or FAIL) to the vhost_pci_config fields.
+
+The vhost-pci driver MUST ensure that writing to the vhost_pci_config fields
+to be atomic.
+
+4.6 Device Operation
+4.6.1 Device Requirements: Device Operation
+4.6.1.1 Frontend Device Info Update
+When the frontend virtio device changes any info (e.g. device_id, virtq
+address) that it has sent to the vhost-pci device, it SHOULD send a
+VHOST_PCI_SOCKET_DEVICE_INFO(ADD) message, which contains the new device info,
+to the vhost-pci server. The vhost-pci device SHOULD insert a
+VHOST_PCI_CONTROLQ_DEVICE_INFO(ADD) to the controlq and inject a contrlq
+interrupt to the VM.
+
+When the vhost-pci device receives a
+VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(ADD_DONE) acknowledgement from the driver,
+it SHOULD send a VHOST_PCI_SOCKET_DEVICE_INFO_ACK(ADD_DONE) message to the
+client that is identified by the ack_device_type and ack_device_id fields, to
+indicate that the vhost-pci driver has finished the handling of the device
+info update.
+
+4.6.1.2 Frontend Device Remove
+When the frontend virtio device is removed (e.g. hot-plug out), the client
+SHOULD send a VHOST_PCI_SOCKET_DEVICE_INFO(DEL) message to the vhost-pci
+server. The vhost-pci device SHOULD put a VHOST_PCI_CONTROLQ_DEVICE_INFO(DEL)
+message on the controlq and inject a contrlq interrupt to the VM.
+
+When the vhost-pci receives a VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(DEL_DONE), it
+SHOULD send a VHOST_PCI_SOCKET_DEVICE_INFO_ACK(DEL_DONE) message to the
+corresponding client to indicate that the vhost-pci driver has removed the
+vhost-pci based inter-VM communication support for the requesting virtio
+device.
+
+4.6.1.3 Driver VM Shutdown and Migration
+Before the driver VM is destroyed or migrated, all the clients that connect to
+the vhost-pci server SHOULD send a VHOST_PCI_SOCKET_DEVICE_INFO(DEL) message to
+the vhost-pci server. The destroying or migrating activity MUST wait until all
+the VHOST_PCI_SOCKET_DEL_CONNECTION_ACK(DEL_DONE) messages are received.
+
+When a vhost-pci device has no frontend devices, the vhost-pci device SHOULD be
+destroyed.
+
+4.6.1.4 Driver VM Memory Hot-plug
+When the vhost-pci server receives a VHOST_PCI_SOCKET_MEMORY_INFO(DEL) message,
+a VHOST_PCI_CONTROLQ_MEMORY_INFO(DEL) message SHOULD be put on the controlq and
+a controlq interrupt is injected to the VM. When the vhost-pci server receives
+a VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK(DEL_DONE) acknowledgement from the driver,
+it SHOULD unmap that memory region and send a
+VHOST_PCI_SOCKET_MEMORY_INFO_ACK(DEL_DONE) message to the client.
+
+When the vhost-pci server receives a VHOST_PCI_SOCKET_MEMORY_INFO(ADD) message,
+and the received memory info is new to what has already been mapped, it
+calculates the total received memory size.
+
+If the new memory size plus the mapped memory size is smaller than the address
+space size reserved by the bar, the server SHOULD map the new memory and expose
+it to the VM via the QEMU MemoryRegion mechanism. Then it SHOULD put the new
+memory info on the controlq, and injects a controlq interrupt to the VM.
+
+If the new memory size plus the mapped memory size is larger than the address
+space size reserved by the bar, the server clones out a new vhost-pci device,
+configures the bar size to be double of the current memory, hot-plugs out the
+old vhost-pci device, and hot-plugs in the new vhost-pci device to the VM. The
+initialization steps SHOULD follow 4.5 Device Initialization, except the
+interaction between the server and client is not needed.
+
+When the vhost-pci server receives a
+VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK(ADD_DONE) acknowledgement from the driver,
+it SHOULD send a VHOST_PCI_SOCKET_MEMORY_INFO_ACK(ADD_DONE) message to the
+client.
+
+4.6.2 Driver Requirements: Device Operation
+The vhost-pci driver SHOULD acknowledge to the vhost-pci device by writing
+VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(ADD_DONE) to the vhost_pci_config fields
+when it finishes handling the device info update.
+
+The vhost-pci driver SHOULD ensure that all the CPUs are noticed about the
+device info update before acknowledging to the vhost-pci device.
+
+The vhost-pci driver SHOULD acknowledge to the vhost-pci device by writing
+VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(DEL_DONE) to vhost_pci_config fields when
+it finishes removing the vhost-pci support for the requesting virtio device.
+
+The vhost-pci driver SHOULD ensure that all the CPUs are noticed about the
+removing of the vhost-pci support for the requesting virtio device before
+acknowledging to the vhost-pci device.
-- 
1.8.3.1


This publicly archived list offers a means to provide input to the

OASIS Virtual I/O Device (VIRTIO) TC.



In order to verify user consent to the Feedback License terms and

to minimize spam in the list archive, subscription is required

before posting.



Subscribe: virtio-comment-subscribe@lists.oasis-open.org

Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org

List help: virtio-comment-help@lists.oasis-open.org

List archive: https://lists.oasis-open.org/archives/virtio-comment/

Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf

List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists

Committee: https://www.oasis-open.org/committees/virtio/

Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format
@ 2016-05-29  8:11   ` Wei Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Wei Wang @ 2016-05-29  8:11 UTC (permalink / raw)
  To: kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini; +Cc: Wei Wang

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 Details | 324 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 324 insertions(+)
 create mode 100644 Details

diff --git a/Details b/Details
new file mode 100644
index 0000000..4ea2252
--- /dev/null
+++ b/Details
@@ -0,0 +1,324 @@
+1 Device ID
+TBD
+
+2 Virtqueues
+0 controlq
+
+3 Feature Bits
+3.1 Local Feature Bits
+Currently no local feature bits are defined, so the standard virtio feature
+bits negation will always be successful and complete.
+
+3.2 Remote Feature Bits
+The remote feature bits are obtained from the frontend virtio device and
+negotiated with the vhost-pci driver via the controlq. The negotiation steps
+are described in 4.5 Device Initialization.
+
+4 Device Configuration Layout
+struct vhost_pci_config {
+	#define VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK 0
+	#define VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK 1
+	#define VHOST_PCI_CONTROLQ_FEATURE_BITS_ACK 2
+	u32 ack_type;
+	u32 ack_device_type;
+	u64 ack_device_id;
+	union {
+		#define VHOST_PCI_CONTROLQ_ACK_ADD_DONE 0
+		#define VHOST_PCI_CONTROLQ_ACK_ADD_FAIL 1
+		#define VHOST_PCI_CONTROLQ_ACK_DEL_DONE 2
+		#define VHOST_PCI_CONTROLQ_ACK_DEL_FAIL 3
+		u64 ack_memory_info;		
+		u64 ack_device_info;
+		u64 ack_feature_bits;
+	};
+};
+
+The configuration fields are currently used for the vhost-pci driver to
+acknowledge to the vhost-pci device after it receives controlq messages. 
+
+4.5 Device Initialization
+When a device VM boots, it creates a vhost-pci server socket.
+
+When a virtio device on the driver VM is created with specifying the use of a
+vhost-pci device as a backend, a client socket is created and connected to the
+corresponding vhost-pci server for message exchanges.
+
+The messages passed to the vhost-pci server is proceeded by the following
+header:
+struct vhost_pci_socket_hdr {
+	#define VHOST_PCI_SOCKET_MEMORY_INFO 0
+	#define VHOST_PCI_SOCKET_MEMORY_INFO_ACK 1
+	#define VHOST_PCI_SOCKET_DEVICE_INFO 2
+	#define VHOST_PCI_SOCKET_DEVICE_INFO_ACK 3
+	#define VHOST_PCI_SOCKET_FEATURE_BITS 4
+	#define VHOST_PCI_SOCKET_FEATURE_BITS_ACK 5
+	u16 msg_type;
+	u16 msg_version;
+	u32 msg_len;
+	u64 qemu_pid;
+};
+
+The payload of the above message types can be constructed using the structures
+below:
+/* VHOST_PCI_SOCKET_MEMORY_INFO message */
+struct vhost_pci_socket_memory_info {
+	#define VHOST_PCI_ADD_MEMORY 0
+	#define VHOST_PCI_DEL_MEMORY 1
+	u16 ops;
+	u32 nregions;
+	struct vhost_pci_memory_region {
+		int fd;
+		u64 guest_phys_addr;
+		u64 memory_size;
+		u64 mmap_offset;
+	} regions[VHOST_PCI_MAX_NREGIONS];
+};
+
+/* VHOST_PCI_SOCKET_DEVICE_INFO message */
+struct vhost_pci_device_info {
+	#define VHOST_PCI_ADD_FRONTEND_DEVICE 0
+	#define VHOST_PCI_DEL_FRONTEND_DEVICE 1
+	u16    ops;
+	u32    nvirtq;
+	#define VHOST_PCI_FRONTEND_DEVICE_NET 1
+	#define VHOST_PCI_FRONTEND_DEVICE_BLK 2
+	#define VHOST_PCI_FRONTEND_DEVICE_CONSOLE 3
+	#define VHOST_PCI_FRONTEND_DEVICE_ENTROPY 4
+	#define VHOST_PCI_FRONTEND_DEVICE_BALLOON 5
+	#define VHOST_PCI_FRONTEND_DEVICE_SCSI 8
+	u32    device_type;
+	u64    device_id;
+	struct virtq exotic_virtq[VHOST_PCI_MAX_NVIRTQ];
+};
+The device_id field identifies the device. For example, it can be used to 
+store a MAC address if the device_type is VHOST_PCI_FRONTEND_DEVICE_NET.
+
+/* VHOST_PCI_SOCKET_FEATURE_BITS message*/
+struct vhost_pci_feature_bits {
+	u64 feature_bits;
+};
+
+/* VHOST_PCI_SOCKET_xx_ACK messages */
+struct vhost_pci_socket_ack {
+	#define VHOST_PCI_SOCKET_ACK_ADD_DONE 0
+	#define VHOST_PCI_SOCKET_ACK_ADD_FAIL 1
+	#define VHOST_PCI_SOCKET_ACK_DEL_DONE 2
+	#define VHOST_PCI_SOCKET_ACK_DEL_FAIL 3
+	u64 ack;
+};
+
+The driver update message passed via the controlq is preceded by the following
+header:
+struct vhost_pci_controlq_hdr {
+	#define VHOST_PCI_CONTROLQ_MEMORY_INFO 0
+	#define VHOST_PCI_CONTROLQ_DEVICE_INFO 1
+	#define VHOST_PCI_CONTROLQ_FEATURE_BITS 2
+	#define VHOST_PCI_CONTROLQ_UPDATE_DONE 3
+	u16 msg_type;
+	u16 msg_version;
+	u32 msg_len;
+};
+
+The payload of a VHOST_PCI_CONTROLQ_MEMORY_INFO message can be constructed
+using the following structure:
+/* VHOST_PCI_CONTROLQ_MEMORY_INFO message */
+struct vhost_pci_controlq_memory_info {
+	#define VHOST_PCI_ADD_MEMORY 0
+	#define VHOST_PCI_DEL_MEMORY 1
+	u16  ops;
+	u32 nregion;
+	struct exotic_memory_region {
+		u64   region_base_xgpa;
+		u64   size;
+		u64   offset_in_bar_area;
+	} region[VHOST_PCI_MAX_NREGIONS];
+};
+
+The payload of VHOST_PCI_CONTROLQ_DEVICE_INFO and
+VHOST_PCI_CONTROLQ_FEATURE_BITS messages can be constructed using the
+vhost_pci_device_info structure and the vhost_pci_feature_bits structure
+respectively.
+
+The payload of a VHOST_PCI_CONTROLQ_UPDATE_DONE message can be constructed
+using the structure below:
+struct vhost_pci_controlq_update_done {
+	u32    device_type;
+	u64    device_id;
+};
+
+Fig. 1 shows the initialization steps.
+
+When the vhost-pci server receives a VHOST_PCI_SOCKET_MEMORY_INFO(ADD) message,
+it checks if a vhost-pci device has been created for the requesting VM whose
+QEMU process id is qemu_pid. If yes, it will simply update the subsequent
+received messages to the vhost-pci driver via the controlq. Otherwise, the
+server creates a new vhost-pci device, and continues the following
+initialization steps.
+
+The vhost-pci server adds up all the memory region size, and uses a 64-bit
+device bar for the mapping of all the memory regions obtained from the socket
+message. To better support memory hot-plugging of the driver VM, the bar is
+configured with a double size of the driver VM's memory. The server maps the
+received memory info via the QEMU MemoryRegion mechanism, and then the new
+created vhost-pci device is hot-plugged to the VM.
+
+When the device status is updated with DRIVER_OK, a
+VHOST_PCI_CONTROLQ_MEMORY_INFO(ADD) message, which is stemed from the memory
+info socket message, is put on the controlq and a controlq interrupt is injected
+to the VM.
+
+When the vhost-pci server receives a
+VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK(ADD_DONE) acknowledgement from the driver,
+it sends a VHOST_PCI_SOCKET_MEMORY_INFO_ACK(ADD_DONE) message to the client
+that is identified by the ack_device_type and ack_device_id fields.
+
+When the vhost-pci server receives a
+VHOST_PCI_SOCKET_FEATURE_BITS(feature bits) message, a
+VHOST_PCI_CONTROLQ_FEATURE_BITS(feature bits) message is put on the controlq
+and a controlq interrupt is injected to the VM.
+
+If the vhost-pci server notices that the driver fully accepted the offered
+feature bits, it sends a VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message
+to the client. If the vhost-pci server notices that the vhost-pci driver only
+accepted a subset of the offered feature bits, it sends a
+VHOST_PCI_SOCKET_FEATURE_BITS(accepted feature bits) message back to the
+client. The client side virtio device re-negotiates the new feature bits with
+its driver, and sends back a VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE)
+message to the server.
+
+Either when the vhost-pci driver fully accepted the offered feature bits or a
+VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message is received from the
+client, the vhost-pci server puts a VHOST_PCI_CONTROLQ_UPDATE_DONE message on
+the controlq, and a controlq interrupt is injected to the VM.
+
+When the vhost-pci server receives a VHOST_PCI_SOCKET_DEVICE_INFO(ADD) message,
+a VHOST_PCI_CONTROLQ_DEVICE_INFO(ADD) message is put on the controlq and a
+controlq interrupt is injected to the VM.
+
+When the vhost-pci server receives a
+VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(ADD_DONE) acknowledgement from the driver,
+it sends a VHOST_PCI_SOCKET_DEVICE_INFO_ACK(ADD_DONE) message to the
+corresponding client.
+
+4.5.1 Device Requirements: Device Initialization
+To let a VM be capable of creating vhost-pci devices, a vhost-pci server MUST
+be created when it boots.
+
+The vhost-pci server socket path SHOULD be provided to a virtio client socket
+for the connection to the vhost-pci server.
+
+The virtio device MUST finish the feature bits negotiation with its driver
+before negotiating them with the vhost-pci device.
+
+If the client receives a VHOST_PCI_SOCKET_FEATURE_BITS(feature bits) message,
+it MUST reset the device to go into backwards capability mode, re-negotiate
+the received feature bits with its driver, and send back a
+VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message to the server.
+
+In any cases that an acknowledgement from the vhost-pci driver indicates a
+FAIL, the vhost-pci server SHOULD send a FAIL socket message to the client.
+
+In any cases that the msg_type is different between the sender and the
+receiver, the receiver SHOULD acknowledge a FAIL to the sender or convert the
+message to its version if the converted version is still functionally usable.
+
+4.5.2 Driver Requirements: Device Initialization
+The vhost-pci driver MUST NOT accept any feature bits that are not offered by
+the remote feature bits, and SHOULD acknowledge to the device of the accepted
+feature bits by writing them to the vhost_pci_config fields.
+
+When the vhost-pci driver receives a VHOST_PCI_CONTROLQ_UPDATE_DONE message
+from the controlq, the vhost-pci driver MUST initialize the corresponding
+driver interface of the device_type if it has not been initialized, and add
+the device_id to the frontend device list that records all the frontend virtio
+devices being supported by vhost-pci for inter-VM communications.
+
+The vhost-pci driver SHOULD acknowledge to the device that the device and
+memory info update (add or delete) is DONE or FAIL by writing the
+acknowledgement (DONE or FAIL) to the vhost_pci_config fields.
+
+The vhost-pci driver MUST ensure that writing to the vhost_pci_config fields
+to be atomic.
+
+4.6 Device Operation
+4.6.1 Device Requirements: Device Operation
+4.6.1.1 Frontend Device Info Update
+When the frontend virtio device changes any info (e.g. device_id, virtq
+address) that it has sent to the vhost-pci device, it SHOULD send a
+VHOST_PCI_SOCKET_DEVICE_INFO(ADD) message, which contains the new device info,
+to the vhost-pci server. The vhost-pci device SHOULD insert a
+VHOST_PCI_CONTROLQ_DEVICE_INFO(ADD) to the controlq and inject a contrlq
+interrupt to the VM.
+
+When the vhost-pci device receives a
+VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(ADD_DONE) acknowledgement from the driver,
+it SHOULD send a VHOST_PCI_SOCKET_DEVICE_INFO_ACK(ADD_DONE) message to the
+client that is identified by the ack_device_type and ack_device_id fields, to
+indicate that the vhost-pci driver has finished the handling of the device
+info update.
+
+4.6.1.2 Frontend Device Remove
+When the frontend virtio device is removed (e.g. hot-plug out), the client
+SHOULD send a VHOST_PCI_SOCKET_DEVICE_INFO(DEL) message to the vhost-pci
+server. The vhost-pci device SHOULD put a VHOST_PCI_CONTROLQ_DEVICE_INFO(DEL)
+message on the controlq and inject a contrlq interrupt to the VM.
+
+When the vhost-pci receives a VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(DEL_DONE), it
+SHOULD send a VHOST_PCI_SOCKET_DEVICE_INFO_ACK(DEL_DONE) message to the
+corresponding client to indicate that the vhost-pci driver has removed the
+vhost-pci based inter-VM communication support for the requesting virtio
+device.
+
+4.6.1.3 Driver VM Shutdown and Migration
+Before the driver VM is destroyed or migrated, all the clients that connect to
+the vhost-pci server SHOULD send a VHOST_PCI_SOCKET_DEVICE_INFO(DEL) message to
+the vhost-pci server. The destroying or migrating activity MUST wait until all
+the VHOST_PCI_SOCKET_DEL_CONNECTION_ACK(DEL_DONE) messages are received.
+
+When a vhost-pci device has no frontend devices, the vhost-pci device SHOULD be
+destroyed.
+
+4.6.1.4 Driver VM Memory Hot-plug
+When the vhost-pci server receives a VHOST_PCI_SOCKET_MEMORY_INFO(DEL) message,
+a VHOST_PCI_CONTROLQ_MEMORY_INFO(DEL) message SHOULD be put on the controlq and
+a controlq interrupt is injected to the VM. When the vhost-pci server receives
+a VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK(DEL_DONE) acknowledgement from the driver,
+it SHOULD unmap that memory region and send a
+VHOST_PCI_SOCKET_MEMORY_INFO_ACK(DEL_DONE) message to the client.
+
+When the vhost-pci server receives a VHOST_PCI_SOCKET_MEMORY_INFO(ADD) message,
+and the received memory info is new to what has already been mapped, it
+calculates the total received memory size.
+
+If the new memory size plus the mapped memory size is smaller than the address
+space size reserved by the bar, the server SHOULD map the new memory and expose
+it to the VM via the QEMU MemoryRegion mechanism. Then it SHOULD put the new
+memory info on the controlq, and injects a controlq interrupt to the VM.
+
+If the new memory size plus the mapped memory size is larger than the address
+space size reserved by the bar, the server clones out a new vhost-pci device,
+configures the bar size to be double of the current memory, hot-plugs out the
+old vhost-pci device, and hot-plugs in the new vhost-pci device to the VM. The
+initialization steps SHOULD follow 4.5 Device Initialization, except the
+interaction between the server and client is not needed.
+
+When the vhost-pci server receives a
+VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK(ADD_DONE) acknowledgement from the driver,
+it SHOULD send a VHOST_PCI_SOCKET_MEMORY_INFO_ACK(ADD_DONE) message to the
+client.
+
+4.6.2 Driver Requirements: Device Operation
+The vhost-pci driver SHOULD acknowledge to the vhost-pci device by writing
+VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(ADD_DONE) to the vhost_pci_config fields
+when it finishes handling the device info update.
+
+The vhost-pci driver SHOULD ensure that all the CPUs are noticed about the
+device info update before acknowledging to the vhost-pci device.
+
+The vhost-pci driver SHOULD acknowledge to the vhost-pci device by writing
+VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(DEL_DONE) to vhost_pci_config fields when
+it finishes removing the vhost-pci support for the requesting virtio device.
+
+The vhost-pci driver SHOULD ensure that all the CPUs are noticed about the
+removing of the vhost-pci support for the requesting virtio device before
+acknowledging to the vhost-pci device.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [virtio-comment] [PATCH 5/6 Resend] Vhost-pci RFC: Future Security Enhancement
  2016-05-29  8:11 ` [Qemu-devel] " Wei Wang
@ 2016-05-29  8:11   ` Wei Wang
  -1 siblings, 0 replies; 34+ messages in thread
From: Wei Wang @ 2016-05-29  8:11 UTC (permalink / raw)
  To: kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini; +Cc: Wei Wang

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 FutureWorks | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 create mode 100644 FutureWorks

diff --git a/FutureWorks b/FutureWorks
new file mode 100644
index 0000000..210edcd
--- /dev/null
+++ b/FutureWorks
@@ -0,0 +1,21 @@
+The vhost-pci design is currently suitable for a group of VMs who trust each
+other. To extend it to a more general use case, two security features can be
+added in the future.
+
+1 vIOMMU
+vIOMMU provides the driver VM with the ability to restrict the device VM to
+transiently access a specified portion of its memory. The vhost-pci design
+proposed in this RFC can be extended to access the driver VM's memory with
+vIOMMU. Precisely, the vIOMMU engine in the driver VM configures access
+permissions (R/W) for the vhost-pci device to access its memory. More details
+can be found at https://wiki.opnfv.org/display/kvm/Vm2vm+Mst and
+https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg03993.html
+
+2 eptp switching
+The idea of eptp swithing allows a vhost-pci device driver to access the mapped
+driver VM's memory in an alternative view, where only a piece of trusted code
+can access the driver VM's memory. More details can be found at
+http://events.linuxfoundation.org/sites/events/files/slides/
+Jun_Nakajima_NFV_KVM%202015_final.pdf
+
+
-- 
1.8.3.1


This publicly archived list offers a means to provide input to the

OASIS Virtual I/O Device (VIRTIO) TC.



In order to verify user consent to the Feedback License terms and

to minimize spam in the list archive, subscription is required

before posting.



Subscribe: virtio-comment-subscribe@lists.oasis-open.org

Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org

List help: virtio-comment-help@lists.oasis-open.org

List archive: https://lists.oasis-open.org/archives/virtio-comment/

Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf

List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists

Committee: https://www.oasis-open.org/committees/virtio/

Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH 5/6 Resend] Vhost-pci RFC: Future Security Enhancement
@ 2016-05-29  8:11   ` Wei Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Wei Wang @ 2016-05-29  8:11 UTC (permalink / raw)
  To: kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini; +Cc: Wei Wang

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 FutureWorks | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 create mode 100644 FutureWorks

diff --git a/FutureWorks b/FutureWorks
new file mode 100644
index 0000000..210edcd
--- /dev/null
+++ b/FutureWorks
@@ -0,0 +1,21 @@
+The vhost-pci design is currently suitable for a group of VMs who trust each
+other. To extend it to a more general use case, two security features can be
+added in the future.
+
+1 vIOMMU
+vIOMMU provides the driver VM with the ability to restrict the device VM to
+transiently access a specified portion of its memory. The vhost-pci design
+proposed in this RFC can be extended to access the driver VM's memory with
+vIOMMU. Precisely, the vIOMMU engine in the driver VM configures access
+permissions (R/W) for the vhost-pci device to access its memory. More details
+can be found at https://wiki.opnfv.org/display/kvm/Vm2vm+Mst and
+https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg03993.html
+
+2 eptp switching
+The idea of eptp swithing allows a vhost-pci device driver to access the mapped
+driver VM's memory in an alternative view, where only a piece of trusted code
+can access the driver VM's memory. More details can be found at
+http://events.linuxfoundation.org/sites/events/files/slides/
+Jun_Nakajima_NFV_KVM%202015_final.pdf
+
+
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 6/6 Resend] Vhost-pci RFC: Experimental Results
  2016-05-29  8:11 ` [Qemu-devel] " Wei Wang
@ 2016-05-29  8:11   ` Wei Wang
  -1 siblings, 0 replies; 34+ messages in thread
From: Wei Wang @ 2016-05-29  8:11 UTC (permalink / raw)
  To: kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini; +Cc: Wei Wang

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 Results | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)
 create mode 100644 Results

diff --git a/Results b/Results
new file mode 100644
index 0000000..7402826
--- /dev/null
+++ b/Results
@@ -0,0 +1,18 @@
+We have built a fundamental vhost-pci based inter-VM communication framework
+for network packet transmission. To test the throughput affected by scaling
+with more VMs to stream out packets, we chain 2 to 5 VMs, and follow the vsperf
+test methodology proposed by OPNFV, as shown in Fig. 2. The first VM is
+passthrough-ed with a physical NIC to inject packets from an external packet
+generator, and the last VM is passthrough-ed with a physical NIC to eject
+packets back to the external generator. A layer2 forwarding module in each VM
+is responsible for forwarding incoming packets from NIC1 (the injection NIC) to
+NIC2 (the ejection NIC). In the traditional way, NIC2 is a virtio-net device
+connected to the vhost-user backend in OVS. With our proposed solution, NIC2 is
+a vhost-pci device, which directly copies packets to the next VM. The packet
+generator implements the RFC2544 standard, which keeps running at a 0 packet
+loss rate.
+
+Fig. 3 shows the scalability test results. In the vhost-user case, a
+significant performance drop (40%~55%) occurs when 4 and 5 VMs are chained
+together. The vhost-pci based inter-VM communication scales well (no
+significant throughput drop) with more VMs are chained together.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH 6/6 Resend] Vhost-pci RFC: Experimental Results
@ 2016-05-29  8:11   ` Wei Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Wei Wang @ 2016-05-29  8:11 UTC (permalink / raw)
  To: kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini; +Cc: Wei Wang

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 Results | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)
 create mode 100644 Results

diff --git a/Results b/Results
new file mode 100644
index 0000000..7402826
--- /dev/null
+++ b/Results
@@ -0,0 +1,18 @@
+We have built a fundamental vhost-pci based inter-VM communication framework
+for network packet transmission. To test the throughput affected by scaling
+with more VMs to stream out packets, we chain 2 to 5 VMs, and follow the vsperf
+test methodology proposed by OPNFV, as shown in Fig. 2. The first VM is
+passthrough-ed with a physical NIC to inject packets from an external packet
+generator, and the last VM is passthrough-ed with a physical NIC to eject
+packets back to the external generator. A layer2 forwarding module in each VM
+is responsible for forwarding incoming packets from NIC1 (the injection NIC) to
+NIC2 (the ejection NIC). In the traditional way, NIC2 is a virtio-net device
+connected to the vhost-user backend in OVS. With our proposed solution, NIC2 is
+a vhost-pci device, which directly copies packets to the next VM. The packet
+generator implements the RFC2544 standard, which keeps running at a 0 packet
+loss rate.
+
+Fig. 3 shows the scalability test results. In the vhost-user case, a
+significant performance drop (40%~55%) occurs when 4 and 5 VMs are chained
+together. The vhost-pci based inter-VM communication scales well (no
+significant throughput drop) with more VMs are chained together.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [virtio-comment] [PATCH 5/6 Resend] Vhost-pci RFC: Future Security Enhancement
  2016-05-29  8:11   ` [Qemu-devel] " Wei Wang
@ 2016-05-30  6:23     ` Jan Kiszka
  -1 siblings, 0 replies; 34+ messages in thread
From: Jan Kiszka @ 2016-05-30  6:23 UTC (permalink / raw)
  To: Wei Wang, kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini

On 2016-05-29 10:11, Wei Wang wrote:
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> ---
>  FutureWorks | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
>  create mode 100644 FutureWorks
> 
> diff --git a/FutureWorks b/FutureWorks
> new file mode 100644
> index 0000000..210edcd
> --- /dev/null
> +++ b/FutureWorks
> @@ -0,0 +1,21 @@
> +The vhost-pci design is currently suitable for a group of VMs who trust each
> +other. To extend it to a more general use case, two security features can be
> +added in the future.

Sounds a bit like security is just "nice to have" in the foreseen use
cases of this mechanism. Is that really true?

> +
> +1 vIOMMU
> +vIOMMU provides the driver VM with the ability to restrict the device VM to
> +transiently access a specified portion of its memory. The vhost-pci design
> +proposed in this RFC can be extended to access the driver VM's memory with
> +vIOMMU. Precisely, the vIOMMU engine in the driver VM configures access
> +permissions (R/W) for the vhost-pci device to access its memory. More details
> +can be found at https://wiki.opnfv.org/display/kvm/Vm2vm+Mst and
> +https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg03993.html

Do you have performance estimates on this approach already?

One challenge should be how to let the VMs reuse existing buffer
mappings so that the vIOMMU isn't continuously reprogrammed - which is
likely not very efficient.

The other is how to hand over packets/buffers in a chain of multiple
VMs. Ideally, there is already a hand-over from sender to the first
receiver so that the sender can no longer mess with the packet after the
receiver started processing it. However, that will work against efficiency.

Essentially, it's the old IPC question of remap vs. copy here. The rest
is "just" interfaces to exploit this elegantly.

> +
> +2 eptp switching
> +The idea of eptp swithing allows a vhost-pci device driver to access the mapped
> +driver VM's memory in an alternative view, where only a piece of trusted code
> +can access the driver VM's memory. More details can be found at
> +http://events.linuxfoundation.org/sites/events/files/slides/
> +Jun_Nakajima_NFV_KVM%202015_final.pdf

As we learned a while back, this one is not really secure. Any updates
on if/how this is going to be fixed?

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [virtio-comment] [PATCH 5/6 Resend] Vhost-pci RFC: Future Security Enhancement
@ 2016-05-30  6:23     ` Jan Kiszka
  0 siblings, 0 replies; 34+ messages in thread
From: Jan Kiszka @ 2016-05-30  6:23 UTC (permalink / raw)
  To: Wei Wang, kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini

On 2016-05-29 10:11, Wei Wang wrote:
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> ---
>  FutureWorks | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
>  create mode 100644 FutureWorks
> 
> diff --git a/FutureWorks b/FutureWorks
> new file mode 100644
> index 0000000..210edcd
> --- /dev/null
> +++ b/FutureWorks
> @@ -0,0 +1,21 @@
> +The vhost-pci design is currently suitable for a group of VMs who trust each
> +other. To extend it to a more general use case, two security features can be
> +added in the future.

Sounds a bit like security is just "nice to have" in the foreseen use
cases of this mechanism. Is that really true?

> +
> +1 vIOMMU
> +vIOMMU provides the driver VM with the ability to restrict the device VM to
> +transiently access a specified portion of its memory. The vhost-pci design
> +proposed in this RFC can be extended to access the driver VM's memory with
> +vIOMMU. Precisely, the vIOMMU engine in the driver VM configures access
> +permissions (R/W) for the vhost-pci device to access its memory. More details
> +can be found at https://wiki.opnfv.org/display/kvm/Vm2vm+Mst and
> +https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg03993.html

Do you have performance estimates on this approach already?

One challenge should be how to let the VMs reuse existing buffer
mappings so that the vIOMMU isn't continuously reprogrammed - which is
likely not very efficient.

The other is how to hand over packets/buffers in a chain of multiple
VMs. Ideally, there is already a hand-over from sender to the first
receiver so that the sender can no longer mess with the packet after the
receiver started processing it. However, that will work against efficiency.

Essentially, it's the old IPC question of remap vs. copy here. The rest
is "just" interfaces to exploit this elegantly.

> +
> +2 eptp switching
> +The idea of eptp swithing allows a vhost-pci device driver to access the mapped
> +driver VM's memory in an alternative view, where only a piece of trusted code
> +can access the driver VM's memory. More details can be found at
> +http://events.linuxfoundation.org/sites/events/files/slides/
> +Jun_Nakajima_NFV_KVM%202015_final.pdf

As we learned a while back, this one is not really secure. Any updates
on if/how this is going to be fixed?

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [virtio-comment] [PATCH 5/6 Resend] Vhost-pci RFC: Future Security Enhancement
  2016-05-30  6:23     ` [Qemu-devel] " Jan Kiszka
@ 2016-05-31  8:00       ` Wang, Wei W
  -1 siblings, 0 replies; 34+ messages in thread
From: Wang, Wei W @ 2016-05-31  8:00 UTC (permalink / raw)
  To: Jan Kiszka, kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini

On Mon 5/30/2016 2:24 PM, Jan Kiszka Wrote:
> On 2016-05-29 10:11, Wei Wang wrote:
> > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > ---
> >  FutureWorks | 21 +++++++++++++++++++++
> >  1 file changed, 21 insertions(+)
> >  create mode 100644 FutureWorks
> >
> > diff --git a/FutureWorks b/FutureWorks new file mode 100644 index
> > 0000000..210edcd
> > --- /dev/null
> > +++ b/FutureWorks
> > @@ -0,0 +1,21 @@
> > +The vhost-pci design is currently suitable for a group of VMs who
> > +trust each other. To extend it to a more general use case, two
> > +security features can be added in the future.
> 
> Sounds a bit like security is just "nice to have" in the foreseen use cases of this
> mechanism. Is that really true?

Not really. It's usually a tradeoff between performance and security, so I think having security doesn't always mean "Nice" :-)

Instead of proposing a compromised solution, we can actually offer two independent solutions, performance oriented vhost-pci (let's call it fast vhost-pci) and security oriented vhost-pci (say, secure vhost-pci). It's up to the users to choose which one to use according to their use cases. So, the secured version of vhost-pci can be viewed as another option for users (not a replacement of this proposal).

Here is a use example:
There are two groups of VMs running on the same host machine. The frequent inter-VM communication between VMs in Group A can choose the fast vhost-pci mechanism. In a special case that a VM from Group A needs to communicate with a VM from Group B, they should set up a new NIC each and specify the use of the secure vhost-pci. 
Since the secure vhost-pci is on our future plan, the traditional vhost-user can be an option for that inter-Group communication currently.

> > +
> > +1 vIOMMU
> > +vIOMMU provides the driver VM with the ability to restrict the device
> > +VM to transiently access a specified portion of its memory. The
> > +vhost-pci design proposed in this RFC can be extended to access the
> > +driver VM's memory with vIOMMU. Precisely, the vIOMMU engine in the
> > +driver VM configures access permissions (R/W) for the vhost-pci
> > +device to access its memory. More details can be found at
> > +https://wiki.opnfv.org/display/kvm/Vm2vm+Mst and
> > +https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg03993.html
> 
> Do you have performance estimates on this approach already?
> 
> One challenge should be how to let the VMs reuse existing buffer mappings so
> that the vIOMMU isn't continuously reprogrammed - which is likely not very
> efficient.

I think one option here is to reserve a large block of GPA area (like a memory pool). The buffers are allocated from and freed to the pool.

Another one would be using batching. For example, set up a batch of 32 buffers (just give the starting guest physical address, and the 32 buffers are guest-physically continuous) each time.

> The other is how to hand over packets/buffers in a chain of multiple VMs. Ideally,
> there is already a hand-over from sender to the first receiver so that the sender
> can no longer mess with the packet after the receiver started processing it.
> However, that will work against efficiency.
> 
> Essentially, it's the old IPC question of remap vs. copy here. The rest is "just"
> interfaces to exploit this elegantly.

There are several ways to do a remapping. The remapping we are using here is to have the entire driver VM's memory mapped by the device VM, that is, the driver VM's memory is completely shared with the device VM. 

If I understand that old remapping based IPC problem correctly, this kind of remapping requires a high degree of coordination between the two parts (i.e. the device VM and the driver VM). I think "virtq" is right the coordination mechanism here - the device VM grabs and fills a buffer from the available ring and puts the filled buffer to the used ring,  so I don't think the sender would mess with the packet after the receiver started to process it.
Please point out if I didn't get your point correctly. Thanks.

> 
> > +
> > +2 eptp switching
> > +The idea of eptp swithing allows a vhost-pci device driver to access
> > +the mapped driver VM's memory in an alternative view, where only a
> > +piece of trusted code can access the driver VM's memory. More details
> > +can be found at
> > +http://events.linuxfoundation.org/sites/events/files/slides/
> > +Jun_Nakajima_NFV_KVM%202015_final.pdf
> 
> As we learned a while back, this one is not really secure. Any updates on if/how
> this is going to be fixed?
> 

Right, that is why we claimed it as a protection mechanism. However, one option we are trying is to fix the related holes from the hardware. We can give updates once it's ready.

Best,
Wei


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [virtio-comment] [PATCH 5/6 Resend] Vhost-pci RFC: Future Security Enhancement
@ 2016-05-31  8:00       ` Wang, Wei W
  0 siblings, 0 replies; 34+ messages in thread
From: Wang, Wei W @ 2016-05-31  8:00 UTC (permalink / raw)
  To: Jan Kiszka, kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini

On Mon 5/30/2016 2:24 PM, Jan Kiszka Wrote:
> On 2016-05-29 10:11, Wei Wang wrote:
> > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > ---
> >  FutureWorks | 21 +++++++++++++++++++++
> >  1 file changed, 21 insertions(+)
> >  create mode 100644 FutureWorks
> >
> > diff --git a/FutureWorks b/FutureWorks new file mode 100644 index
> > 0000000..210edcd
> > --- /dev/null
> > +++ b/FutureWorks
> > @@ -0,0 +1,21 @@
> > +The vhost-pci design is currently suitable for a group of VMs who
> > +trust each other. To extend it to a more general use case, two
> > +security features can be added in the future.
> 
> Sounds a bit like security is just "nice to have" in the foreseen use cases of this
> mechanism. Is that really true?

Not really. It's usually a tradeoff between performance and security, so I think having security doesn't always mean "Nice" :-)

Instead of proposing a compromised solution, we can actually offer two independent solutions, performance oriented vhost-pci (let's call it fast vhost-pci) and security oriented vhost-pci (say, secure vhost-pci). It's up to the users to choose which one to use according to their use cases. So, the secured version of vhost-pci can be viewed as another option for users (not a replacement of this proposal).

Here is a use example:
There are two groups of VMs running on the same host machine. The frequent inter-VM communication between VMs in Group A can choose the fast vhost-pci mechanism. In a special case that a VM from Group A needs to communicate with a VM from Group B, they should set up a new NIC each and specify the use of the secure vhost-pci. 
Since the secure vhost-pci is on our future plan, the traditional vhost-user can be an option for that inter-Group communication currently.

> > +
> > +1 vIOMMU
> > +vIOMMU provides the driver VM with the ability to restrict the device
> > +VM to transiently access a specified portion of its memory. The
> > +vhost-pci design proposed in this RFC can be extended to access the
> > +driver VM's memory with vIOMMU. Precisely, the vIOMMU engine in the
> > +driver VM configures access permissions (R/W) for the vhost-pci
> > +device to access its memory. More details can be found at
> > +https://wiki.opnfv.org/display/kvm/Vm2vm+Mst and
> > +https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg03993.html
> 
> Do you have performance estimates on this approach already?
> 
> One challenge should be how to let the VMs reuse existing buffer mappings so
> that the vIOMMU isn't continuously reprogrammed - which is likely not very
> efficient.

I think one option here is to reserve a large block of GPA area (like a memory pool). The buffers are allocated from and freed to the pool.

Another one would be using batching. For example, set up a batch of 32 buffers (just give the starting guest physical address, and the 32 buffers are guest-physically continuous) each time.

> The other is how to hand over packets/buffers in a chain of multiple VMs. Ideally,
> there is already a hand-over from sender to the first receiver so that the sender
> can no longer mess with the packet after the receiver started processing it.
> However, that will work against efficiency.
> 
> Essentially, it's the old IPC question of remap vs. copy here. The rest is "just"
> interfaces to exploit this elegantly.

There are several ways to do a remapping. The remapping we are using here is to have the entire driver VM's memory mapped by the device VM, that is, the driver VM's memory is completely shared with the device VM. 

If I understand that old remapping based IPC problem correctly, this kind of remapping requires a high degree of coordination between the two parts (i.e. the device VM and the driver VM). I think "virtq" is right the coordination mechanism here - the device VM grabs and fills a buffer from the available ring and puts the filled buffer to the used ring,  so I don't think the sender would mess with the packet after the receiver started to process it.
Please point out if I didn't get your point correctly. Thanks.

> 
> > +
> > +2 eptp switching
> > +The idea of eptp swithing allows a vhost-pci device driver to access
> > +the mapped driver VM's memory in an alternative view, where only a
> > +piece of trusted code can access the driver VM's memory. More details
> > +can be found at
> > +http://events.linuxfoundation.org/sites/events/files/slides/
> > +Jun_Nakajima_NFV_KVM%202015_final.pdf
> 
> As we learned a while back, this one is not really secure. Any updates on if/how
> this is going to be fixed?
> 

Right, that is why we claimed it as a protection mechanism. However, one option we are trying is to fix the related holes from the hardware. We can give updates once it's ready.

Best,
Wei

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format
  2016-05-29  8:11   ` [Qemu-devel] " Wei Wang
@ 2016-06-01  8:15     ` Xiao Guangrong
  -1 siblings, 0 replies; 34+ messages in thread
From: Xiao Guangrong @ 2016-06-01  8:15 UTC (permalink / raw)
  To: Wei Wang, kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini



On 05/29/2016 04:11 PM, Wei Wang wrote:
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> ---
>   Details | 324 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 324 insertions(+)
>   create mode 100644 Details
>
> diff --git a/Details b/Details
> new file mode 100644
> index 0000000..4ea2252
> --- /dev/null
> +++ b/Details
> @@ -0,0 +1,324 @@
> +1 Device ID
> +TBD
> +
> +2 Virtqueues
> +0 controlq
> +
> +3 Feature Bits
> +3.1 Local Feature Bits
> +Currently no local feature bits are defined, so the standard virtio feature
> +bits negation will always be successful and complete.
> +
> +3.2 Remote Feature Bits
> +The remote feature bits are obtained from the frontend virtio device and
> +negotiated with the vhost-pci driver via the controlq. The negotiation steps
> +are described in 4.5 Device Initialization.
> +
> +4 Device Configuration Layout
> +struct vhost_pci_config {
> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK 0
> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK 1
> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS_ACK 2
> +	u32 ack_type;
> +	u32 ack_device_type;
> +	u64 ack_device_id;
> +	union {
> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_DONE 0
> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_FAIL 1
> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_DONE 2
> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_FAIL 3
> +		u64 ack_memory_info;		
> +		u64 ack_device_info;
> +		u64 ack_feature_bits;
> +	};
> +};

Do you need to write all these 4 field to ack the operation? It seems
it is not efficient and it is not flexible if the driver need to
offer more data to the device in the further. Can we dedicate a
vq for this purpose?

BTW, current approach can not handle the case if there are multiple
same kind of requests in the control queue, e.g, if there are two
memory-add request in the control queue.

> +
> +The configuration fields are currently used for the vhost-pci driver to
> +acknowledge to the vhost-pci device after it receives controlq messages.
> +
> +4.5 Device Initialization
> +When a device VM boots, it creates a vhost-pci server socket.
> +
> +When a virtio device on the driver VM is created with specifying the use of a
> +vhost-pci device as a backend, a client socket is created and connected to the
> +corresponding vhost-pci server for message exchanges.
> +
> +The messages passed to the vhost-pci server is proceeded by the following
> +header:
> +struct vhost_pci_socket_hdr {
> +	#define VHOST_PCI_SOCKET_MEMORY_INFO 0
> +	#define VHOST_PCI_SOCKET_MEMORY_INFO_ACK 1
> +	#define VHOST_PCI_SOCKET_DEVICE_INFO 2
> +	#define VHOST_PCI_SOCKET_DEVICE_INFO_ACK 3
> +	#define VHOST_PCI_SOCKET_FEATURE_BITS 4
> +	#define VHOST_PCI_SOCKET_FEATURE_BITS_ACK 5
> +	u16 msg_type;
> +	u16 msg_version;
> +	u32 msg_len;
> +	u64 qemu_pid;
> +};
> +
> +The payload of the above message types can be constructed using the structures
> +below:
> +/* VHOST_PCI_SOCKET_MEMORY_INFO message */
> +struct vhost_pci_socket_memory_info {
> +	#define VHOST_PCI_ADD_MEMORY 0
> +	#define VHOST_PCI_DEL_MEMORY 1
> +	u16 ops;
> +	u32 nregions;
> +	struct vhost_pci_memory_region {
> +		int fd;
> +		u64 guest_phys_addr;
> +		u64 memory_size;
> +		u64 mmap_offset;
> +	} regions[VHOST_PCI_MAX_NREGIONS];
> +};
> +
> +/* VHOST_PCI_SOCKET_DEVICE_INFO message */
> +struct vhost_pci_device_info {
> +	#define VHOST_PCI_ADD_FRONTEND_DEVICE 0
> +	#define VHOST_PCI_DEL_FRONTEND_DEVICE 1
> +	u16    ops;
> +	u32    nvirtq;
> +	#define VHOST_PCI_FRONTEND_DEVICE_NET 1
> +	#define VHOST_PCI_FRONTEND_DEVICE_BLK 2
> +	#define VHOST_PCI_FRONTEND_DEVICE_CONSOLE 3
> +	#define VHOST_PCI_FRONTEND_DEVICE_ENTROPY 4
> +	#define VHOST_PCI_FRONTEND_DEVICE_BALLOON 5
> +	#define VHOST_PCI_FRONTEND_DEVICE_SCSI 8
> +	u32    device_type;
> +	u64    device_id;
> +	struct virtq exotic_virtq[VHOST_PCI_MAX_NVIRTQ];
> +};
> +The device_id field identifies the device. For example, it can be used to
> +store a MAC address if the device_type is VHOST_PCI_FRONTEND_DEVICE_NET.
> +
> +/* VHOST_PCI_SOCKET_FEATURE_BITS message*/
> +struct vhost_pci_feature_bits {
> +	u64 feature_bits;
> +};

We not only have 'socket feature bits' but also the feature bits for per virtio device
plugged in on the side of vhost-pci device.

E.g: if there are two virtio devices (e.g, a NIC and BLK) both of them need to directly communicate
with another VM. The feature bits of these two devices need to be negotiated with that VM 
respectively. And you can not put these feature bits in vhost_pci_device_info struct as its
vq is not created at that time.

> +
> +/* VHOST_PCI_SOCKET_xx_ACK messages */
> +struct vhost_pci_socket_ack {
> +	#define VHOST_PCI_SOCKET_ACK_ADD_DONE 0
> +	#define VHOST_PCI_SOCKET_ACK_ADD_FAIL 1
> +	#define VHOST_PCI_SOCKET_ACK_DEL_DONE 2
> +	#define VHOST_PCI_SOCKET_ACK_DEL_FAIL 3
> +	u64 ack;
> +};
> +
> +The driver update message passed via the controlq is preceded by the following
> +header:
> +struct vhost_pci_controlq_hdr {
> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO 0
> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO 1
> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS 2
> +	#define VHOST_PCI_CONTROLQ_UPDATE_DONE 3
> +	u16 msg_type;
> +	u16 msg_version;
> +	u32 msg_len;
> +};
> +
> +The payload of a VHOST_PCI_CONTROLQ_MEMORY_INFO message can be constructed
> +using the following structure:
> +/* VHOST_PCI_CONTROLQ_MEMORY_INFO message */
> +struct vhost_pci_controlq_memory_info {
> +	#define VHOST_PCI_ADD_MEMORY 0
> +	#define VHOST_PCI_DEL_MEMORY 1
> +	u16  ops;
> +	u32 nregion;
> +	struct exotic_memory_region {
> +		u64   region_base_xgpa;
> +		u64   size;
> +		u64   offset_in_bar_area;
> +	} region[VHOST_PCI_MAX_NREGIONS];
> +};
> +
> +The payload of VHOST_PCI_CONTROLQ_DEVICE_INFO and
> +VHOST_PCI_CONTROLQ_FEATURE_BITS messages can be constructed using the
> +vhost_pci_device_info structure and the vhost_pci_feature_bits structure
> +respectively.
> +
> +The payload of a VHOST_PCI_CONTROLQ_UPDATE_DONE message can be constructed
> +using the structure below:
> +struct vhost_pci_controlq_update_done {
> +	u32    device_type;
> +	u64    device_id;
> +};
> +
> +Fig. 1 shows the initialization steps.
> +
> +When the vhost-pci server receives a VHOST_PCI_SOCKET_MEMORY_INFO(ADD) message,
> +it checks if a vhost-pci device has been created for the requesting VM whose
> +QEMU process id is qemu_pid. If yes, it will simply update the subsequent
> +received messages to the vhost-pci driver via the controlq. Otherwise, the
> +server creates a new vhost-pci device, and continues the following
> +initialization steps.


qemu-pid is not stable as the existing VM will be killed silently and the new
vhost-pci driver reusing the same qemu-pid will ask to join before the
vhost-device gets to know the previous one has gone.

> +
> +The vhost-pci server adds up all the memory region size, and uses a 64-bit
> +device bar for the mapping of all the memory regions obtained from the socket
> +message. To better support memory hot-plugging of the driver VM, the bar is
> +configured with a double size of the driver VM's memory. The server maps the
> +received memory info via the QEMU MemoryRegion mechanism, and then the new
> +created vhost-pci device is hot-plugged to the VM.
> +
> +When the device status is updated with DRIVER_OK, a
> +VHOST_PCI_CONTROLQ_MEMORY_INFO(ADD) message, which is stemed from the memory
> +info socket message, is put on the controlq and a controlq interrupt is injected
> +to the VM.
> +
> +When the vhost-pci server receives a
> +VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK(ADD_DONE) acknowledgement from the driver,
> +it sends a VHOST_PCI_SOCKET_MEMORY_INFO_ACK(ADD_DONE) message to the client
> +that is identified by the ack_device_type and ack_device_id fields.
> +
> +When the vhost-pci server receives a
> +VHOST_PCI_SOCKET_FEATURE_BITS(feature bits) message, a
> +VHOST_PCI_CONTROLQ_FEATURE_BITS(feature bits) message is put on the controlq
> +and a controlq interrupt is injected to the VM.
> +
> +If the vhost-pci server notices that the driver fully accepted the offered
> +feature bits, it sends a VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message
> +to the client. If the vhost-pci server notices that the vhost-pci driver only
> +accepted a subset of the offered feature bits, it sends a
> +VHOST_PCI_SOCKET_FEATURE_BITS(accepted feature bits) message back to the
> +client. The client side virtio device re-negotiates the new feature bits with
> +its driver, and sends back a VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE)
> +message to the server.
> +
> +Either when the vhost-pci driver fully accepted the offered feature bits or a
> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message is received from the
> +client, the vhost-pci server puts a VHOST_PCI_CONTROLQ_UPDATE_DONE message on
> +the controlq, and a controlq interrupt is injected to the VM.

Why VHOST_PCI_CONTROLQ_UPDATE_DONE is needed?

> +
> +When the vhost-pci server receives a VHOST_PCI_SOCKET_DEVICE_INFO(ADD) message,
> +a VHOST_PCI_CONTROLQ_DEVICE_INFO(ADD) message is put on the controlq and a
> +controlq interrupt is injected to the VM.
> +
> +When the vhost-pci server receives a
> +VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(ADD_DONE) acknowledgement from the driver,
> +it sends a VHOST_PCI_SOCKET_DEVICE_INFO_ACK(ADD_DONE) message to the
> +corresponding client.
> +
> +4.5.1 Device Requirements: Device Initialization
> +To let a VM be capable of creating vhost-pci devices, a vhost-pci server MUST
> +be created when it boots.
> +
> +The vhost-pci server socket path SHOULD be provided to a virtio client socket
> +for the connection to the vhost-pci server.
> +
> +The virtio device MUST finish the feature bits negotiation with its driver
> +before negotiating them with the vhost-pci device.
> +
> +If the client receives a VHOST_PCI_SOCKET_FEATURE_BITS(feature bits) message,
> +it MUST reset the device to go into backwards capability mode, re-negotiate
> +the received feature bits with its driver, and send back a
> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message to the server.
> +
> +In any cases that an acknowledgement from the vhost-pci driver indicates a
> +FAIL, the vhost-pci server SHOULD send a FAIL socket message to the client.
> +
> +In any cases that the msg_type is different between the sender and the
> +receiver, the receiver SHOULD acknowledge a FAIL to the sender or convert the
> +message to its version if the converted version is still functionally usable.
> +
> +4.5.2 Driver Requirements: Device Initialization
> +The vhost-pci driver MUST NOT accept any feature bits that are not offered by
> +the remote feature bits, and SHOULD acknowledge to the device of the accepted
> +feature bits by writing them to the vhost_pci_config fields.
> +
> +When the vhost-pci driver receives a VHOST_PCI_CONTROLQ_UPDATE_DONE message
> +from the controlq, the vhost-pci driver MUST initialize the corresponding
> +driver interface of the device_type if it has not been initialized, and add
> +the device_id to the frontend device list that records all the frontend virtio
> +devices being supported by vhost-pci for inter-VM communications.

Okay, i saw how to use it here. But, once the driver gets
VHOST_PCI_CONTROLQ_DEVICE_INFO(ADD) then it knows how to communicate with the
virtio device on another VM. Why we postpone the initialize until it gets
VHOST_PCI_CONTROLQ_UPDATE_DONE?

> +
> +The vhost-pci driver SHOULD acknowledge to the device that the device and
> +memory info update (add or delete) is DONE or FAIL by writing the
> +acknowledgement (DONE or FAIL) to the vhost_pci_config fields.
> +
> +The vhost-pci driver MUST ensure that writing to the vhost_pci_config fields
> +to be atomic.
> +
> +4.6 Device Operation
> +4.6.1 Device Requirements: Device Operation
> +4.6.1.1 Frontend Device Info Update
> +When the frontend virtio device changes any info (e.g. device_id, virtq
> +address) that it has sent to the vhost-pci device, it SHOULD send a
> +VHOST_PCI_SOCKET_DEVICE_INFO(ADD) message, which contains the new device info,
> +to the vhost-pci server. The vhost-pci device SHOULD insert a
> +VHOST_PCI_CONTROLQ_DEVICE_INFO(ADD) to the controlq and inject a contrlq
> +interrupt to the VM.
> +
> +When the vhost-pci device receives a
> +VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(ADD_DONE) acknowledgement from the driver,
> +it SHOULD send a VHOST_PCI_SOCKET_DEVICE_INFO_ACK(ADD_DONE) message to the
> +client that is identified by the ack_device_type and ack_device_id fields, to
> +indicate that the vhost-pci driver has finished the handling of the device
> +info update.

If VHOST_PCI_CONTROLQ_UPDATE_DONE is really needed, you missed it here.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format
@ 2016-06-01  8:15     ` Xiao Guangrong
  0 siblings, 0 replies; 34+ messages in thread
From: Xiao Guangrong @ 2016-06-01  8:15 UTC (permalink / raw)
  To: Wei Wang, kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini



On 05/29/2016 04:11 PM, Wei Wang wrote:
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> ---
>   Details | 324 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 324 insertions(+)
>   create mode 100644 Details
>
> diff --git a/Details b/Details
> new file mode 100644
> index 0000000..4ea2252
> --- /dev/null
> +++ b/Details
> @@ -0,0 +1,324 @@
> +1 Device ID
> +TBD
> +
> +2 Virtqueues
> +0 controlq
> +
> +3 Feature Bits
> +3.1 Local Feature Bits
> +Currently no local feature bits are defined, so the standard virtio feature
> +bits negation will always be successful and complete.
> +
> +3.2 Remote Feature Bits
> +The remote feature bits are obtained from the frontend virtio device and
> +negotiated with the vhost-pci driver via the controlq. The negotiation steps
> +are described in 4.5 Device Initialization.
> +
> +4 Device Configuration Layout
> +struct vhost_pci_config {
> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK 0
> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK 1
> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS_ACK 2
> +	u32 ack_type;
> +	u32 ack_device_type;
> +	u64 ack_device_id;
> +	union {
> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_DONE 0
> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_FAIL 1
> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_DONE 2
> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_FAIL 3
> +		u64 ack_memory_info;		
> +		u64 ack_device_info;
> +		u64 ack_feature_bits;
> +	};
> +};

Do you need to write all these 4 field to ack the operation? It seems
it is not efficient and it is not flexible if the driver need to
offer more data to the device in the further. Can we dedicate a
vq for this purpose?

BTW, current approach can not handle the case if there are multiple
same kind of requests in the control queue, e.g, if there are two
memory-add request in the control queue.

> +
> +The configuration fields are currently used for the vhost-pci driver to
> +acknowledge to the vhost-pci device after it receives controlq messages.
> +
> +4.5 Device Initialization
> +When a device VM boots, it creates a vhost-pci server socket.
> +
> +When a virtio device on the driver VM is created with specifying the use of a
> +vhost-pci device as a backend, a client socket is created and connected to the
> +corresponding vhost-pci server for message exchanges.
> +
> +The messages passed to the vhost-pci server is proceeded by the following
> +header:
> +struct vhost_pci_socket_hdr {
> +	#define VHOST_PCI_SOCKET_MEMORY_INFO 0
> +	#define VHOST_PCI_SOCKET_MEMORY_INFO_ACK 1
> +	#define VHOST_PCI_SOCKET_DEVICE_INFO 2
> +	#define VHOST_PCI_SOCKET_DEVICE_INFO_ACK 3
> +	#define VHOST_PCI_SOCKET_FEATURE_BITS 4
> +	#define VHOST_PCI_SOCKET_FEATURE_BITS_ACK 5
> +	u16 msg_type;
> +	u16 msg_version;
> +	u32 msg_len;
> +	u64 qemu_pid;
> +};
> +
> +The payload of the above message types can be constructed using the structures
> +below:
> +/* VHOST_PCI_SOCKET_MEMORY_INFO message */
> +struct vhost_pci_socket_memory_info {
> +	#define VHOST_PCI_ADD_MEMORY 0
> +	#define VHOST_PCI_DEL_MEMORY 1
> +	u16 ops;
> +	u32 nregions;
> +	struct vhost_pci_memory_region {
> +		int fd;
> +		u64 guest_phys_addr;
> +		u64 memory_size;
> +		u64 mmap_offset;
> +	} regions[VHOST_PCI_MAX_NREGIONS];
> +};
> +
> +/* VHOST_PCI_SOCKET_DEVICE_INFO message */
> +struct vhost_pci_device_info {
> +	#define VHOST_PCI_ADD_FRONTEND_DEVICE 0
> +	#define VHOST_PCI_DEL_FRONTEND_DEVICE 1
> +	u16    ops;
> +	u32    nvirtq;
> +	#define VHOST_PCI_FRONTEND_DEVICE_NET 1
> +	#define VHOST_PCI_FRONTEND_DEVICE_BLK 2
> +	#define VHOST_PCI_FRONTEND_DEVICE_CONSOLE 3
> +	#define VHOST_PCI_FRONTEND_DEVICE_ENTROPY 4
> +	#define VHOST_PCI_FRONTEND_DEVICE_BALLOON 5
> +	#define VHOST_PCI_FRONTEND_DEVICE_SCSI 8
> +	u32    device_type;
> +	u64    device_id;
> +	struct virtq exotic_virtq[VHOST_PCI_MAX_NVIRTQ];
> +};
> +The device_id field identifies the device. For example, it can be used to
> +store a MAC address if the device_type is VHOST_PCI_FRONTEND_DEVICE_NET.
> +
> +/* VHOST_PCI_SOCKET_FEATURE_BITS message*/
> +struct vhost_pci_feature_bits {
> +	u64 feature_bits;
> +};

We not only have 'socket feature bits' but also the feature bits for per virtio device
plugged in on the side of vhost-pci device.

E.g: if there are two virtio devices (e.g, a NIC and BLK) both of them need to directly communicate
with another VM. The feature bits of these two devices need to be negotiated with that VM 
respectively. And you can not put these feature bits in vhost_pci_device_info struct as its
vq is not created at that time.

> +
> +/* VHOST_PCI_SOCKET_xx_ACK messages */
> +struct vhost_pci_socket_ack {
> +	#define VHOST_PCI_SOCKET_ACK_ADD_DONE 0
> +	#define VHOST_PCI_SOCKET_ACK_ADD_FAIL 1
> +	#define VHOST_PCI_SOCKET_ACK_DEL_DONE 2
> +	#define VHOST_PCI_SOCKET_ACK_DEL_FAIL 3
> +	u64 ack;
> +};
> +
> +The driver update message passed via the controlq is preceded by the following
> +header:
> +struct vhost_pci_controlq_hdr {
> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO 0
> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO 1
> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS 2
> +	#define VHOST_PCI_CONTROLQ_UPDATE_DONE 3
> +	u16 msg_type;
> +	u16 msg_version;
> +	u32 msg_len;
> +};
> +
> +The payload of a VHOST_PCI_CONTROLQ_MEMORY_INFO message can be constructed
> +using the following structure:
> +/* VHOST_PCI_CONTROLQ_MEMORY_INFO message */
> +struct vhost_pci_controlq_memory_info {
> +	#define VHOST_PCI_ADD_MEMORY 0
> +	#define VHOST_PCI_DEL_MEMORY 1
> +	u16  ops;
> +	u32 nregion;
> +	struct exotic_memory_region {
> +		u64   region_base_xgpa;
> +		u64   size;
> +		u64   offset_in_bar_area;
> +	} region[VHOST_PCI_MAX_NREGIONS];
> +};
> +
> +The payload of VHOST_PCI_CONTROLQ_DEVICE_INFO and
> +VHOST_PCI_CONTROLQ_FEATURE_BITS messages can be constructed using the
> +vhost_pci_device_info structure and the vhost_pci_feature_bits structure
> +respectively.
> +
> +The payload of a VHOST_PCI_CONTROLQ_UPDATE_DONE message can be constructed
> +using the structure below:
> +struct vhost_pci_controlq_update_done {
> +	u32    device_type;
> +	u64    device_id;
> +};
> +
> +Fig. 1 shows the initialization steps.
> +
> +When the vhost-pci server receives a VHOST_PCI_SOCKET_MEMORY_INFO(ADD) message,
> +it checks if a vhost-pci device has been created for the requesting VM whose
> +QEMU process id is qemu_pid. If yes, it will simply update the subsequent
> +received messages to the vhost-pci driver via the controlq. Otherwise, the
> +server creates a new vhost-pci device, and continues the following
> +initialization steps.


qemu-pid is not stable as the existing VM will be killed silently and the new
vhost-pci driver reusing the same qemu-pid will ask to join before the
vhost-device gets to know the previous one has gone.

> +
> +The vhost-pci server adds up all the memory region size, and uses a 64-bit
> +device bar for the mapping of all the memory regions obtained from the socket
> +message. To better support memory hot-plugging of the driver VM, the bar is
> +configured with a double size of the driver VM's memory. The server maps the
> +received memory info via the QEMU MemoryRegion mechanism, and then the new
> +created vhost-pci device is hot-plugged to the VM.
> +
> +When the device status is updated with DRIVER_OK, a
> +VHOST_PCI_CONTROLQ_MEMORY_INFO(ADD) message, which is stemed from the memory
> +info socket message, is put on the controlq and a controlq interrupt is injected
> +to the VM.
> +
> +When the vhost-pci server receives a
> +VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK(ADD_DONE) acknowledgement from the driver,
> +it sends a VHOST_PCI_SOCKET_MEMORY_INFO_ACK(ADD_DONE) message to the client
> +that is identified by the ack_device_type and ack_device_id fields.
> +
> +When the vhost-pci server receives a
> +VHOST_PCI_SOCKET_FEATURE_BITS(feature bits) message, a
> +VHOST_PCI_CONTROLQ_FEATURE_BITS(feature bits) message is put on the controlq
> +and a controlq interrupt is injected to the VM.
> +
> +If the vhost-pci server notices that the driver fully accepted the offered
> +feature bits, it sends a VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message
> +to the client. If the vhost-pci server notices that the vhost-pci driver only
> +accepted a subset of the offered feature bits, it sends a
> +VHOST_PCI_SOCKET_FEATURE_BITS(accepted feature bits) message back to the
> +client. The client side virtio device re-negotiates the new feature bits with
> +its driver, and sends back a VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE)
> +message to the server.
> +
> +Either when the vhost-pci driver fully accepted the offered feature bits or a
> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message is received from the
> +client, the vhost-pci server puts a VHOST_PCI_CONTROLQ_UPDATE_DONE message on
> +the controlq, and a controlq interrupt is injected to the VM.

Why VHOST_PCI_CONTROLQ_UPDATE_DONE is needed?

> +
> +When the vhost-pci server receives a VHOST_PCI_SOCKET_DEVICE_INFO(ADD) message,
> +a VHOST_PCI_CONTROLQ_DEVICE_INFO(ADD) message is put on the controlq and a
> +controlq interrupt is injected to the VM.
> +
> +When the vhost-pci server receives a
> +VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(ADD_DONE) acknowledgement from the driver,
> +it sends a VHOST_PCI_SOCKET_DEVICE_INFO_ACK(ADD_DONE) message to the
> +corresponding client.
> +
> +4.5.1 Device Requirements: Device Initialization
> +To let a VM be capable of creating vhost-pci devices, a vhost-pci server MUST
> +be created when it boots.
> +
> +The vhost-pci server socket path SHOULD be provided to a virtio client socket
> +for the connection to the vhost-pci server.
> +
> +The virtio device MUST finish the feature bits negotiation with its driver
> +before negotiating them with the vhost-pci device.
> +
> +If the client receives a VHOST_PCI_SOCKET_FEATURE_BITS(feature bits) message,
> +it MUST reset the device to go into backwards capability mode, re-negotiate
> +the received feature bits with its driver, and send back a
> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message to the server.
> +
> +In any cases that an acknowledgement from the vhost-pci driver indicates a
> +FAIL, the vhost-pci server SHOULD send a FAIL socket message to the client.
> +
> +In any cases that the msg_type is different between the sender and the
> +receiver, the receiver SHOULD acknowledge a FAIL to the sender or convert the
> +message to its version if the converted version is still functionally usable.
> +
> +4.5.2 Driver Requirements: Device Initialization
> +The vhost-pci driver MUST NOT accept any feature bits that are not offered by
> +the remote feature bits, and SHOULD acknowledge to the device of the accepted
> +feature bits by writing them to the vhost_pci_config fields.
> +
> +When the vhost-pci driver receives a VHOST_PCI_CONTROLQ_UPDATE_DONE message
> +from the controlq, the vhost-pci driver MUST initialize the corresponding
> +driver interface of the device_type if it has not been initialized, and add
> +the device_id to the frontend device list that records all the frontend virtio
> +devices being supported by vhost-pci for inter-VM communications.

Okay, i saw how to use it here. But, once the driver gets
VHOST_PCI_CONTROLQ_DEVICE_INFO(ADD) then it knows how to communicate with the
virtio device on another VM. Why we postpone the initialize until it gets
VHOST_PCI_CONTROLQ_UPDATE_DONE?

> +
> +The vhost-pci driver SHOULD acknowledge to the device that the device and
> +memory info update (add or delete) is DONE or FAIL by writing the
> +acknowledgement (DONE or FAIL) to the vhost_pci_config fields.
> +
> +The vhost-pci driver MUST ensure that writing to the vhost_pci_config fields
> +to be atomic.
> +
> +4.6 Device Operation
> +4.6.1 Device Requirements: Device Operation
> +4.6.1.1 Frontend Device Info Update
> +When the frontend virtio device changes any info (e.g. device_id, virtq
> +address) that it has sent to the vhost-pci device, it SHOULD send a
> +VHOST_PCI_SOCKET_DEVICE_INFO(ADD) message, which contains the new device info,
> +to the vhost-pci server. The vhost-pci device SHOULD insert a
> +VHOST_PCI_CONTROLQ_DEVICE_INFO(ADD) to the controlq and inject a contrlq
> +interrupt to the VM.
> +
> +When the vhost-pci device receives a
> +VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(ADD_DONE) acknowledgement from the driver,
> +it SHOULD send a VHOST_PCI_SOCKET_DEVICE_INFO_ACK(ADD_DONE) message to the
> +client that is identified by the ack_device_type and ack_device_id fields, to
> +indicate that the vhost-pci driver has finished the handling of the device
> +info update.

If VHOST_PCI_CONTROLQ_UPDATE_DONE is really needed, you missed it here.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [virtio-comment] RE: [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format
  2016-06-01  8:15     ` [Qemu-devel] " Xiao Guangrong
@ 2016-06-02  3:15       ` Wang, Wei W
  -1 siblings, 0 replies; 34+ messages in thread
From: Wang, Wei W @ 2016-06-02  3:15 UTC (permalink / raw)
  To: 'Xiao Guangrong',
	kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini

On Wed 6/1/2016 4:15 PM, Xiao Guangrong wrote:
> On 05/29/2016 04:11 PM, Wei Wang wrote:
> > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > ---
> >   Details | 324
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >   1 file changed, 324 insertions(+)
> >   create mode 100644 Details
> >
> > diff --git a/Details b/Details
> > new file mode 100644
> > index 0000000..4ea2252
> > --- /dev/null
> > +++ b/Details
> > @@ -0,0 +1,324 @@
> > +1 Device ID
> > +TBD
> > +
> > +2 Virtqueues
> > +0 controlq
> > +
> > +3 Feature Bits
> > +3.1 Local Feature Bits
> > +Currently no local feature bits are defined, so the standard virtio
> > +feature bits negation will always be successful and complete.
> > +
> > +3.2 Remote Feature Bits
> > +The remote feature bits are obtained from the frontend virtio device
> > +and negotiated with the vhost-pci driver via the controlq. The
> > +negotiation steps are described in 4.5 Device Initialization.
> > +
> > +4 Device Configuration Layout
> > +struct vhost_pci_config {
> > +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK 0
> > +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK 1
> > +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS_ACK 2
> > +	u32 ack_type;
> > +	u32 ack_device_type;
> > +	u64 ack_device_id;
> > +	union {
> > +		#define VHOST_PCI_CONTROLQ_ACK_ADD_DONE 0
> > +		#define VHOST_PCI_CONTROLQ_ACK_ADD_FAIL 1
> > +		#define VHOST_PCI_CONTROLQ_ACK_DEL_DONE 2
> > +		#define VHOST_PCI_CONTROLQ_ACK_DEL_FAIL 3
> > +		u64 ack_memory_info;
> > +		u64 ack_device_info;
> > +		u64 ack_feature_bits;
> > +	};
> > +};
> 
> Do you need to write all these 4 field to ack the operation? It seems it is not
> efficient and it is not flexible if the driver need to offer more data to the device
> in the further. Can we dedicate a vq for this purpose?

Yes, all the 4 fields are required to be written. The vhost-pci server usually connects to multiple clients, and the "ack_device_type" and "ack_device_id" fields are used to identify them.

Agree, another controlq for the guest->host direction looks better, and the above fileds can be converted to be the controlq message header.

> 
> BTW, current approach can not handle the case if there are multiple same kind
> of requests in the control queue, e.g, if there are two memory-add request in
> the control queue.

A vhost-pci device corresponds to a driver VM. The two memory-add requests on the controlq are all for the same driver VM.  Memory-add requests for different driver VMs  couldn’t be present on the same controlq. I haven’t seen the issue yet. Can you please explain more? Thanks.


> > +
> > +The configuration fields are currently used for the vhost-pci driver
> > +to acknowledge to the vhost-pci device after it receives controlq messages.
> > +
> > +4.5 Device Initialization
> > +When a device VM boots, it creates a vhost-pci server socket.
> > +
> > +When a virtio device on the driver VM is created with specifying the
> > +use of a vhost-pci device as a backend, a client socket is created
> > +and connected to the corresponding vhost-pci server for message exchanges.
> > +
> > +The messages passed to the vhost-pci server is proceeded by the
> > +following
> > +header:
> > +struct vhost_pci_socket_hdr {
> > +	#define VHOST_PCI_SOCKET_MEMORY_INFO 0
> > +	#define VHOST_PCI_SOCKET_MEMORY_INFO_ACK 1
> > +	#define VHOST_PCI_SOCKET_DEVICE_INFO 2
> > +	#define VHOST_PCI_SOCKET_DEVICE_INFO_ACK 3
> > +	#define VHOST_PCI_SOCKET_FEATURE_BITS 4
> > +	#define VHOST_PCI_SOCKET_FEATURE_BITS_ACK 5
> > +	u16 msg_type;
> > +	u16 msg_version;
> > +	u32 msg_len;
> > +	u64 qemu_pid;
> > +};
> > +
> > +The payload of the above message types can be constructed using the
> > +structures
> > +below:
> > +/* VHOST_PCI_SOCKET_MEMORY_INFO message */ struct
> > +vhost_pci_socket_memory_info {
> > +	#define VHOST_PCI_ADD_MEMORY 0
> > +	#define VHOST_PCI_DEL_MEMORY 1
> > +	u16 ops;
> > +	u32 nregions;
> > +	struct vhost_pci_memory_region {
> > +		int fd;
> > +		u64 guest_phys_addr;
> > +		u64 memory_size;
> > +		u64 mmap_offset;
> > +	} regions[VHOST_PCI_MAX_NREGIONS];
> > +};
> > +
> > +/* VHOST_PCI_SOCKET_DEVICE_INFO message */ struct
> > +vhost_pci_device_info {
> > +	#define VHOST_PCI_ADD_FRONTEND_DEVICE 0
> > +	#define VHOST_PCI_DEL_FRONTEND_DEVICE 1
> > +	u16    ops;
> > +	u32    nvirtq;
> > +	#define VHOST_PCI_FRONTEND_DEVICE_NET 1
> > +	#define VHOST_PCI_FRONTEND_DEVICE_BLK 2
> > +	#define VHOST_PCI_FRONTEND_DEVICE_CONSOLE 3
> > +	#define VHOST_PCI_FRONTEND_DEVICE_ENTROPY 4
> > +	#define VHOST_PCI_FRONTEND_DEVICE_BALLOON 5
> > +	#define VHOST_PCI_FRONTEND_DEVICE_SCSI 8
> > +	u32    device_type;
> > +	u64    device_id;
> > +	struct virtq exotic_virtq[VHOST_PCI_MAX_NVIRTQ];
> > +};
> > +The device_id field identifies the device. For example, it can be
> > +used to store a MAC address if the device_type is
> VHOST_PCI_FRONTEND_DEVICE_NET.
> > +
> > +/* VHOST_PCI_SOCKET_FEATURE_BITS message*/ struct
> > +vhost_pci_feature_bits {
> > +	u64 feature_bits;
> > +};
> 
> We not only have 'socket feature bits' but also the feature bits for per virtio
> device plugged in on the side of vhost-pci device.

Yes. It is mentioned in "3 Feature Bits". The socket feature bits here are actually the remote feature bits (got from a socket message).

> 
> E.g: if there are two virtio devices (e.g, a NIC and BLK) both of them need to
> directly communicate with another VM. The feature bits of these two devices
> need to be negotiated with that VM respectively. And you can not put these
> feature bits in vhost_pci_device_info struct as its vq is not created at that time.

Right. If you check the initialization steps below, there is a statement "When the device status is updated with DRIVER_OK".

> > +
> > +/* VHOST_PCI_SOCKET_xx_ACK messages */ struct vhost_pci_socket_ack {
> > +	#define VHOST_PCI_SOCKET_ACK_ADD_DONE 0
> > +	#define VHOST_PCI_SOCKET_ACK_ADD_FAIL 1
> > +	#define VHOST_PCI_SOCKET_ACK_DEL_DONE 2
> > +	#define VHOST_PCI_SOCKET_ACK_DEL_FAIL 3
> > +	u64 ack;
> > +};
> > +
> > +The driver update message passed via the controlq is preceded by the
> > +following
> > +header:
> > +struct vhost_pci_controlq_hdr {
> > +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO 0
> > +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO 1
> > +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS 2
> > +	#define VHOST_PCI_CONTROLQ_UPDATE_DONE 3
> > +	u16 msg_type;
> > +	u16 msg_version;
> > +	u32 msg_len;
> > +};
> > +
> > +The payload of a VHOST_PCI_CONTROLQ_MEMORY_INFO message can be
> > +constructed using the following structure:
> > +/* VHOST_PCI_CONTROLQ_MEMORY_INFO message */ struct
> > +vhost_pci_controlq_memory_info {
> > +	#define VHOST_PCI_ADD_MEMORY 0
> > +	#define VHOST_PCI_DEL_MEMORY 1
> > +	u16  ops;
> > +	u32 nregion;
> > +	struct exotic_memory_region {
> > +		u64   region_base_xgpa;
> > +		u64   size;
> > +		u64   offset_in_bar_area;
> > +	} region[VHOST_PCI_MAX_NREGIONS];
> > +};
> > +
> > +The payload of VHOST_PCI_CONTROLQ_DEVICE_INFO and
> > +VHOST_PCI_CONTROLQ_FEATURE_BITS messages can be constructed using
> the
> > +vhost_pci_device_info structure and the vhost_pci_feature_bits
> > +structure respectively.
> > +
> > +The payload of a VHOST_PCI_CONTROLQ_UPDATE_DONE message can be
> > +constructed using the structure below:
> > +struct vhost_pci_controlq_update_done {
> > +	u32    device_type;
> > +	u64    device_id;
> > +};
> > +
> > +Fig. 1 shows the initialization steps.
> > +
> > +When the vhost-pci server receives a
> > +VHOST_PCI_SOCKET_MEMORY_INFO(ADD) message, it checks if a vhost-pci
> > +device has been created for the requesting VM whose QEMU process id
> > +is qemu_pid. If yes, it will simply update the subsequent received
> > +messages to the vhost-pci driver via the controlq. Otherwise, the
> > +server creates a new vhost-pci device, and continues the following
> initialization steps.
> 
> 
> qemu-pid is not stable as the existing VM will be killed silently and the new
> vhost-pci driver reusing the same qemu-pid will ask to join before the vhost-
> device gets to know the previous one has gone.

Would it be a normal and legal operation to silently kill a QEMU? I guess only the system admin can do that, right?

If that's true, I think we can add a new field, "u64 tsc_of_birth" to the vhost_pci_socket_hdr structure. It records the tsc when the QEMU is created. 
If that's true, another problem would be the remove of the vhost-pci device for a silently killed driver VM.
The vhost-pci server may need to periodically send a checking message to check if the driver VM is silently killed. If that really happens, it should remove the related vhost-pci device.
 
> > +
> > +The vhost-pci server adds up all the memory region size, and uses a
> > +64-bit device bar for the mapping of all the memory regions obtained
> > +from the socket message. To better support memory hot-plugging of the
> > +driver VM, the bar is configured with a double size of the driver
> > +VM's memory. The server maps the received memory info via the QEMU
> > +MemoryRegion mechanism, and then the new created vhost-pci device is
> hot-plugged to the VM.
> > +
> > +When the device status is updated with DRIVER_OK, a
> > +VHOST_PCI_CONTROLQ_MEMORY_INFO(ADD) message, which is stemed
> from the
> > +memory info socket message, is put on the controlq and a controlq
> > +interrupt is injected to the VM.
> > +
> > +When the vhost-pci server receives a
> > +VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK(ADD_DONE)
> acknowledgement from the
> > +driver, it sends a VHOST_PCI_SOCKET_MEMORY_INFO_ACK(ADD_DONE)
> message
> > +to the client that is identified by the ack_device_type and ack_device_id fields.
> > +
> > +When the vhost-pci server receives a
> > +VHOST_PCI_SOCKET_FEATURE_BITS(feature bits) message, a
> > +VHOST_PCI_CONTROLQ_FEATURE_BITS(feature bits) message is put on the
> > +controlq and a controlq interrupt is injected to the VM.
> > +
> > +If the vhost-pci server notices that the driver fully accepted the
> > +offered feature bits, it sends a
> > +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message to the client.
> If
> > +the vhost-pci server notices that the vhost-pci driver only accepted
> > +a subset of the offered feature bits, it sends a
> > +VHOST_PCI_SOCKET_FEATURE_BITS(accepted feature bits) message back to
> > +the client. The client side virtio device re-negotiates the new
> > +feature bits with its driver, and sends back a
> > +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE)
> > +message to the server.
> > +
> > +Either when the vhost-pci driver fully accepted the offered feature
> > +bits or a
> > +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message is received
> from
> > +the client, the vhost-pci server puts a
> > +VHOST_PCI_CONTROLQ_UPDATE_DONE message on the controlq, and a
> controlq interrupt is injected to the VM.
> 
> Why VHOST_PCI_CONTROLQ_UPDATE_DONE is needed?

OK, this one looks redundant. We can set up the related support for that frontend device when the device info is received via the controlq.

Best,
Wei
 
> > +
> > +When the vhost-pci server receives a
> > +VHOST_PCI_SOCKET_DEVICE_INFO(ADD) message, a
> > +VHOST_PCI_CONTROLQ_DEVICE_INFO(ADD) message is put on the controlq
> and a controlq interrupt is injected to the VM.
> > +
> > +When the vhost-pci server receives a
> > +VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(ADD_DONE) acknowledgement
> from the
> > +driver, it sends a VHOST_PCI_SOCKET_DEVICE_INFO_ACK(ADD_DONE)
> message
> > +to the corresponding client.
> > +
> > +4.5.1 Device Requirements: Device Initialization To let a VM be
> > +capable of creating vhost-pci devices, a vhost-pci server MUST be
> > +created when it boots.
> > +
> > +The vhost-pci server socket path SHOULD be provided to a virtio
> > +client socket for the connection to the vhost-pci server.
> > +
> > +The virtio device MUST finish the feature bits negotiation with its
> > +driver before negotiating them with the vhost-pci device.
> > +
> > +If the client receives a VHOST_PCI_SOCKET_FEATURE_BITS(feature bits)
> > +message, it MUST reset the device to go into backwards capability
> > +mode, re-negotiate the received feature bits with its driver, and
> > +send back a
> > +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message to the server.
> > +
> > +In any cases that an acknowledgement from the vhost-pci driver
> > +indicates a FAIL, the vhost-pci server SHOULD send a FAIL socket message to
> the client.
> > +
> > +In any cases that the msg_type is different between the sender and
> > +the receiver, the receiver SHOULD acknowledge a FAIL to the sender or
> > +convert the message to its version if the converted version is still functionally
> usable.
> > +
> > +4.5.2 Driver Requirements: Device Initialization The vhost-pci driver
> > +MUST NOT accept any feature bits that are not offered by the remote
> > +feature bits, and SHOULD acknowledge to the device of the accepted
> > +feature bits by writing them to the vhost_pci_config fields.
> > +
> > +When the vhost-pci driver receives a
> VHOST_PCI_CONTROLQ_UPDATE_DONE
> > +message from the controlq, the vhost-pci driver MUST initialize the
> > +corresponding driver interface of the device_type if it has not been
> > +initialized, and add the device_id to the frontend device list that
> > +records all the frontend virtio devices being supported by vhost-pci for inter-
> VM communications.
> 
> Okay, i saw how to use it here. But, once the driver gets
> VHOST_PCI_CONTROLQ_DEVICE_INFO(ADD) then it knows how to
> communicate with the virtio device on another VM. Why we postpone the
> initialize until it gets VHOST_PCI_CONTROLQ_UPDATE_DONE?
> 
> > +
> > +The vhost-pci driver SHOULD acknowledge to the device that the device
> > +and memory info update (add or delete) is DONE or FAIL by writing the
> > +acknowledgement (DONE or FAIL) to the vhost_pci_config fields.
> > +
> > +The vhost-pci driver MUST ensure that writing to the vhost_pci_config
> > +fields to be atomic.
> > +
> > +4.6 Device Operation
> > +4.6.1 Device Requirements: Device Operation
> > +4.6.1.1 Frontend Device Info Update
> > +When the frontend virtio device changes any info (e.g. device_id,
> > +virtq
> > +address) that it has sent to the vhost-pci device, it SHOULD send a
> > +VHOST_PCI_SOCKET_DEVICE_INFO(ADD) message, which contains the new
> > +device info, to the vhost-pci server. The vhost-pci device SHOULD
> > +insert a
> > +VHOST_PCI_CONTROLQ_DEVICE_INFO(ADD) to the controlq and inject a
> > +contrlq interrupt to the VM.
> > +
> > +When the vhost-pci device receives a
> > +VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(ADD_DONE) acknowledgement
> from the
> > +driver, it SHOULD send a
> VHOST_PCI_SOCKET_DEVICE_INFO_ACK(ADD_DONE)
> > +message to the client that is identified by the ack_device_type and
> > +ack_device_id fields, to indicate that the vhost-pci driver has
> > +finished the handling of the device info update.
> 
> If VHOST_PCI_CONTROLQ_UPDATE_DONE is really needed, you missed it here.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format
@ 2016-06-02  3:15       ` Wang, Wei W
  0 siblings, 0 replies; 34+ messages in thread
From: Wang, Wei W @ 2016-06-02  3:15 UTC (permalink / raw)
  To: 'Xiao Guangrong',
	kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini

On Wed 6/1/2016 4:15 PM, Xiao Guangrong wrote:
> On 05/29/2016 04:11 PM, Wei Wang wrote:
> > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > ---
> >   Details | 324
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >   1 file changed, 324 insertions(+)
> >   create mode 100644 Details
> >
> > diff --git a/Details b/Details
> > new file mode 100644
> > index 0000000..4ea2252
> > --- /dev/null
> > +++ b/Details
> > @@ -0,0 +1,324 @@
> > +1 Device ID
> > +TBD
> > +
> > +2 Virtqueues
> > +0 controlq
> > +
> > +3 Feature Bits
> > +3.1 Local Feature Bits
> > +Currently no local feature bits are defined, so the standard virtio
> > +feature bits negation will always be successful and complete.
> > +
> > +3.2 Remote Feature Bits
> > +The remote feature bits are obtained from the frontend virtio device
> > +and negotiated with the vhost-pci driver via the controlq. The
> > +negotiation steps are described in 4.5 Device Initialization.
> > +
> > +4 Device Configuration Layout
> > +struct vhost_pci_config {
> > +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK 0
> > +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK 1
> > +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS_ACK 2
> > +	u32 ack_type;
> > +	u32 ack_device_type;
> > +	u64 ack_device_id;
> > +	union {
> > +		#define VHOST_PCI_CONTROLQ_ACK_ADD_DONE 0
> > +		#define VHOST_PCI_CONTROLQ_ACK_ADD_FAIL 1
> > +		#define VHOST_PCI_CONTROLQ_ACK_DEL_DONE 2
> > +		#define VHOST_PCI_CONTROLQ_ACK_DEL_FAIL 3
> > +		u64 ack_memory_info;
> > +		u64 ack_device_info;
> > +		u64 ack_feature_bits;
> > +	};
> > +};
> 
> Do you need to write all these 4 field to ack the operation? It seems it is not
> efficient and it is not flexible if the driver need to offer more data to the device
> in the further. Can we dedicate a vq for this purpose?

Yes, all the 4 fields are required to be written. The vhost-pci server usually connects to multiple clients, and the "ack_device_type" and "ack_device_id" fields are used to identify them.

Agree, another controlq for the guest->host direction looks better, and the above fileds can be converted to be the controlq message header.

> 
> BTW, current approach can not handle the case if there are multiple same kind
> of requests in the control queue, e.g, if there are two memory-add request in
> the control queue.

A vhost-pci device corresponds to a driver VM. The two memory-add requests on the controlq are all for the same driver VM.  Memory-add requests for different driver VMs  couldn’t be present on the same controlq. I haven’t seen the issue yet. Can you please explain more? Thanks.


> > +
> > +The configuration fields are currently used for the vhost-pci driver
> > +to acknowledge to the vhost-pci device after it receives controlq messages.
> > +
> > +4.5 Device Initialization
> > +When a device VM boots, it creates a vhost-pci server socket.
> > +
> > +When a virtio device on the driver VM is created with specifying the
> > +use of a vhost-pci device as a backend, a client socket is created
> > +and connected to the corresponding vhost-pci server for message exchanges.
> > +
> > +The messages passed to the vhost-pci server is proceeded by the
> > +following
> > +header:
> > +struct vhost_pci_socket_hdr {
> > +	#define VHOST_PCI_SOCKET_MEMORY_INFO 0
> > +	#define VHOST_PCI_SOCKET_MEMORY_INFO_ACK 1
> > +	#define VHOST_PCI_SOCKET_DEVICE_INFO 2
> > +	#define VHOST_PCI_SOCKET_DEVICE_INFO_ACK 3
> > +	#define VHOST_PCI_SOCKET_FEATURE_BITS 4
> > +	#define VHOST_PCI_SOCKET_FEATURE_BITS_ACK 5
> > +	u16 msg_type;
> > +	u16 msg_version;
> > +	u32 msg_len;
> > +	u64 qemu_pid;
> > +};
> > +
> > +The payload of the above message types can be constructed using the
> > +structures
> > +below:
> > +/* VHOST_PCI_SOCKET_MEMORY_INFO message */ struct
> > +vhost_pci_socket_memory_info {
> > +	#define VHOST_PCI_ADD_MEMORY 0
> > +	#define VHOST_PCI_DEL_MEMORY 1
> > +	u16 ops;
> > +	u32 nregions;
> > +	struct vhost_pci_memory_region {
> > +		int fd;
> > +		u64 guest_phys_addr;
> > +		u64 memory_size;
> > +		u64 mmap_offset;
> > +	} regions[VHOST_PCI_MAX_NREGIONS];
> > +};
> > +
> > +/* VHOST_PCI_SOCKET_DEVICE_INFO message */ struct
> > +vhost_pci_device_info {
> > +	#define VHOST_PCI_ADD_FRONTEND_DEVICE 0
> > +	#define VHOST_PCI_DEL_FRONTEND_DEVICE 1
> > +	u16    ops;
> > +	u32    nvirtq;
> > +	#define VHOST_PCI_FRONTEND_DEVICE_NET 1
> > +	#define VHOST_PCI_FRONTEND_DEVICE_BLK 2
> > +	#define VHOST_PCI_FRONTEND_DEVICE_CONSOLE 3
> > +	#define VHOST_PCI_FRONTEND_DEVICE_ENTROPY 4
> > +	#define VHOST_PCI_FRONTEND_DEVICE_BALLOON 5
> > +	#define VHOST_PCI_FRONTEND_DEVICE_SCSI 8
> > +	u32    device_type;
> > +	u64    device_id;
> > +	struct virtq exotic_virtq[VHOST_PCI_MAX_NVIRTQ];
> > +};
> > +The device_id field identifies the device. For example, it can be
> > +used to store a MAC address if the device_type is
> VHOST_PCI_FRONTEND_DEVICE_NET.
> > +
> > +/* VHOST_PCI_SOCKET_FEATURE_BITS message*/ struct
> > +vhost_pci_feature_bits {
> > +	u64 feature_bits;
> > +};
> 
> We not only have 'socket feature bits' but also the feature bits for per virtio
> device plugged in on the side of vhost-pci device.

Yes. It is mentioned in "3 Feature Bits". The socket feature bits here are actually the remote feature bits (got from a socket message).

> 
> E.g: if there are two virtio devices (e.g, a NIC and BLK) both of them need to
> directly communicate with another VM. The feature bits of these two devices
> need to be negotiated with that VM respectively. And you can not put these
> feature bits in vhost_pci_device_info struct as its vq is not created at that time.

Right. If you check the initialization steps below, there is a statement "When the device status is updated with DRIVER_OK".

> > +
> > +/* VHOST_PCI_SOCKET_xx_ACK messages */ struct vhost_pci_socket_ack {
> > +	#define VHOST_PCI_SOCKET_ACK_ADD_DONE 0
> > +	#define VHOST_PCI_SOCKET_ACK_ADD_FAIL 1
> > +	#define VHOST_PCI_SOCKET_ACK_DEL_DONE 2
> > +	#define VHOST_PCI_SOCKET_ACK_DEL_FAIL 3
> > +	u64 ack;
> > +};
> > +
> > +The driver update message passed via the controlq is preceded by the
> > +following
> > +header:
> > +struct vhost_pci_controlq_hdr {
> > +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO 0
> > +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO 1
> > +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS 2
> > +	#define VHOST_PCI_CONTROLQ_UPDATE_DONE 3
> > +	u16 msg_type;
> > +	u16 msg_version;
> > +	u32 msg_len;
> > +};
> > +
> > +The payload of a VHOST_PCI_CONTROLQ_MEMORY_INFO message can be
> > +constructed using the following structure:
> > +/* VHOST_PCI_CONTROLQ_MEMORY_INFO message */ struct
> > +vhost_pci_controlq_memory_info {
> > +	#define VHOST_PCI_ADD_MEMORY 0
> > +	#define VHOST_PCI_DEL_MEMORY 1
> > +	u16  ops;
> > +	u32 nregion;
> > +	struct exotic_memory_region {
> > +		u64   region_base_xgpa;
> > +		u64   size;
> > +		u64   offset_in_bar_area;
> > +	} region[VHOST_PCI_MAX_NREGIONS];
> > +};
> > +
> > +The payload of VHOST_PCI_CONTROLQ_DEVICE_INFO and
> > +VHOST_PCI_CONTROLQ_FEATURE_BITS messages can be constructed using
> the
> > +vhost_pci_device_info structure and the vhost_pci_feature_bits
> > +structure respectively.
> > +
> > +The payload of a VHOST_PCI_CONTROLQ_UPDATE_DONE message can be
> > +constructed using the structure below:
> > +struct vhost_pci_controlq_update_done {
> > +	u32    device_type;
> > +	u64    device_id;
> > +};
> > +
> > +Fig. 1 shows the initialization steps.
> > +
> > +When the vhost-pci server receives a
> > +VHOST_PCI_SOCKET_MEMORY_INFO(ADD) message, it checks if a vhost-pci
> > +device has been created for the requesting VM whose QEMU process id
> > +is qemu_pid. If yes, it will simply update the subsequent received
> > +messages to the vhost-pci driver via the controlq. Otherwise, the
> > +server creates a new vhost-pci device, and continues the following
> initialization steps.
> 
> 
> qemu-pid is not stable as the existing VM will be killed silently and the new
> vhost-pci driver reusing the same qemu-pid will ask to join before the vhost-
> device gets to know the previous one has gone.

Would it be a normal and legal operation to silently kill a QEMU? I guess only the system admin can do that, right?

If that's true, I think we can add a new field, "u64 tsc_of_birth" to the vhost_pci_socket_hdr structure. It records the tsc when the QEMU is created. 
If that's true, another problem would be the remove of the vhost-pci device for a silently killed driver VM.
The vhost-pci server may need to periodically send a checking message to check if the driver VM is silently killed. If that really happens, it should remove the related vhost-pci device.
 
> > +
> > +The vhost-pci server adds up all the memory region size, and uses a
> > +64-bit device bar for the mapping of all the memory regions obtained
> > +from the socket message. To better support memory hot-plugging of the
> > +driver VM, the bar is configured with a double size of the driver
> > +VM's memory. The server maps the received memory info via the QEMU
> > +MemoryRegion mechanism, and then the new created vhost-pci device is
> hot-plugged to the VM.
> > +
> > +When the device status is updated with DRIVER_OK, a
> > +VHOST_PCI_CONTROLQ_MEMORY_INFO(ADD) message, which is stemed
> from the
> > +memory info socket message, is put on the controlq and a controlq
> > +interrupt is injected to the VM.
> > +
> > +When the vhost-pci server receives a
> > +VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK(ADD_DONE)
> acknowledgement from the
> > +driver, it sends a VHOST_PCI_SOCKET_MEMORY_INFO_ACK(ADD_DONE)
> message
> > +to the client that is identified by the ack_device_type and ack_device_id fields.
> > +
> > +When the vhost-pci server receives a
> > +VHOST_PCI_SOCKET_FEATURE_BITS(feature bits) message, a
> > +VHOST_PCI_CONTROLQ_FEATURE_BITS(feature bits) message is put on the
> > +controlq and a controlq interrupt is injected to the VM.
> > +
> > +If the vhost-pci server notices that the driver fully accepted the
> > +offered feature bits, it sends a
> > +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message to the client.
> If
> > +the vhost-pci server notices that the vhost-pci driver only accepted
> > +a subset of the offered feature bits, it sends a
> > +VHOST_PCI_SOCKET_FEATURE_BITS(accepted feature bits) message back to
> > +the client. The client side virtio device re-negotiates the new
> > +feature bits with its driver, and sends back a
> > +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE)
> > +message to the server.
> > +
> > +Either when the vhost-pci driver fully accepted the offered feature
> > +bits or a
> > +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message is received
> from
> > +the client, the vhost-pci server puts a
> > +VHOST_PCI_CONTROLQ_UPDATE_DONE message on the controlq, and a
> controlq interrupt is injected to the VM.
> 
> Why VHOST_PCI_CONTROLQ_UPDATE_DONE is needed?

OK, this one looks redundant. We can set up the related support for that frontend device when the device info is received via the controlq.

Best,
Wei
 
> > +
> > +When the vhost-pci server receives a
> > +VHOST_PCI_SOCKET_DEVICE_INFO(ADD) message, a
> > +VHOST_PCI_CONTROLQ_DEVICE_INFO(ADD) message is put on the controlq
> and a controlq interrupt is injected to the VM.
> > +
> > +When the vhost-pci server receives a
> > +VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(ADD_DONE) acknowledgement
> from the
> > +driver, it sends a VHOST_PCI_SOCKET_DEVICE_INFO_ACK(ADD_DONE)
> message
> > +to the corresponding client.
> > +
> > +4.5.1 Device Requirements: Device Initialization To let a VM be
> > +capable of creating vhost-pci devices, a vhost-pci server MUST be
> > +created when it boots.
> > +
> > +The vhost-pci server socket path SHOULD be provided to a virtio
> > +client socket for the connection to the vhost-pci server.
> > +
> > +The virtio device MUST finish the feature bits negotiation with its
> > +driver before negotiating them with the vhost-pci device.
> > +
> > +If the client receives a VHOST_PCI_SOCKET_FEATURE_BITS(feature bits)
> > +message, it MUST reset the device to go into backwards capability
> > +mode, re-negotiate the received feature bits with its driver, and
> > +send back a
> > +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message to the server.
> > +
> > +In any cases that an acknowledgement from the vhost-pci driver
> > +indicates a FAIL, the vhost-pci server SHOULD send a FAIL socket message to
> the client.
> > +
> > +In any cases that the msg_type is different between the sender and
> > +the receiver, the receiver SHOULD acknowledge a FAIL to the sender or
> > +convert the message to its version if the converted version is still functionally
> usable.
> > +
> > +4.5.2 Driver Requirements: Device Initialization The vhost-pci driver
> > +MUST NOT accept any feature bits that are not offered by the remote
> > +feature bits, and SHOULD acknowledge to the device of the accepted
> > +feature bits by writing them to the vhost_pci_config fields.
> > +
> > +When the vhost-pci driver receives a
> VHOST_PCI_CONTROLQ_UPDATE_DONE
> > +message from the controlq, the vhost-pci driver MUST initialize the
> > +corresponding driver interface of the device_type if it has not been
> > +initialized, and add the device_id to the frontend device list that
> > +records all the frontend virtio devices being supported by vhost-pci for inter-
> VM communications.
> 
> Okay, i saw how to use it here. But, once the driver gets
> VHOST_PCI_CONTROLQ_DEVICE_INFO(ADD) then it knows how to
> communicate with the virtio device on another VM. Why we postpone the
> initialize until it gets VHOST_PCI_CONTROLQ_UPDATE_DONE?
> 
> > +
> > +The vhost-pci driver SHOULD acknowledge to the device that the device
> > +and memory info update (add or delete) is DONE or FAIL by writing the
> > +acknowledgement (DONE or FAIL) to the vhost_pci_config fields.
> > +
> > +The vhost-pci driver MUST ensure that writing to the vhost_pci_config
> > +fields to be atomic.
> > +
> > +4.6 Device Operation
> > +4.6.1 Device Requirements: Device Operation
> > +4.6.1.1 Frontend Device Info Update
> > +When the frontend virtio device changes any info (e.g. device_id,
> > +virtq
> > +address) that it has sent to the vhost-pci device, it SHOULD send a
> > +VHOST_PCI_SOCKET_DEVICE_INFO(ADD) message, which contains the new
> > +device info, to the vhost-pci server. The vhost-pci device SHOULD
> > +insert a
> > +VHOST_PCI_CONTROLQ_DEVICE_INFO(ADD) to the controlq and inject a
> > +contrlq interrupt to the VM.
> > +
> > +When the vhost-pci device receives a
> > +VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK(ADD_DONE) acknowledgement
> from the
> > +driver, it SHOULD send a
> VHOST_PCI_SOCKET_DEVICE_INFO_ACK(ADD_DONE)
> > +message to the client that is identified by the ack_device_type and
> > +ack_device_id fields, to indicate that the vhost-pci driver has
> > +finished the handling of the device info update.
> 
> If VHOST_PCI_CONTROLQ_UPDATE_DONE is really needed, you missed it here.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format
  2016-06-02  3:15       ` [Qemu-devel] " Wang, Wei W
@ 2016-06-02  3:52         ` Xiao Guangrong
  -1 siblings, 0 replies; 34+ messages in thread
From: Xiao Guangrong @ 2016-06-02  3:52 UTC (permalink / raw)
  To: Wang, Wei W, kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini



On 06/02/2016 11:15 AM, Wang, Wei W wrote:
> On Wed 6/1/2016 4:15 PM, Xiao Guangrong wrote:
>> On 05/29/2016 04:11 PM, Wei Wang wrote:
>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>> ---
>>>    Details | 324
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 324 insertions(+)
>>>    create mode 100644 Details
>>>
>>> diff --git a/Details b/Details
>>> new file mode 100644
>>> index 0000000..4ea2252
>>> --- /dev/null
>>> +++ b/Details
>>> @@ -0,0 +1,324 @@
>>> +1 Device ID
>>> +TBD
>>> +
>>> +2 Virtqueues
>>> +0 controlq
>>> +
>>> +3 Feature Bits
>>> +3.1 Local Feature Bits
>>> +Currently no local feature bits are defined, so the standard virtio
>>> +feature bits negation will always be successful and complete.
>>> +
>>> +3.2 Remote Feature Bits
>>> +The remote feature bits are obtained from the frontend virtio device
>>> +and negotiated with the vhost-pci driver via the controlq. The
>>> +negotiation steps are described in 4.5 Device Initialization.
>>> +
>>> +4 Device Configuration Layout
>>> +struct vhost_pci_config {
>>> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK 0
>>> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK 1
>>> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS_ACK 2
>>> +	u32 ack_type;
>>> +	u32 ack_device_type;
>>> +	u64 ack_device_id;
>>> +	union {
>>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_DONE 0
>>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_FAIL 1
>>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_DONE 2
>>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_FAIL 3
>>> +		u64 ack_memory_info;
>>> +		u64 ack_device_info;
>>> +		u64 ack_feature_bits;
>>> +	};
>>> +};
>>
>> Do you need to write all these 4 field to ack the operation? It seems it is not
>> efficient and it is not flexible if the driver need to offer more data to the device
>> in the further. Can we dedicate a vq for this purpose?
>
> Yes, all the 4 fields are required to be written. The vhost-pci server usually connects to multiple clients, and the "ack_device_type" and "ack_device_id" fields are used to identify them.
>
> Agree, another controlq for the guest->host direction looks better, and the above fileds can be converted to be the controlq message header.
>

Thanks.

>>
>> BTW, current approach can not handle the case if there are multiple same kind
>> of requests in the control queue, e.g, if there are two memory-add request in
>> the control queue.
>
> A vhost-pci device corresponds to a driver VM. The two memory-add requests on the controlq are all for the same driver VM.  Memory-add requests for different driver VMs  couldn’t be present on the same controlq. I haven’t seen the issue yet. Can you please explain more? Thanks.

The issue is caused by "The two memory-add requests on the controlq are all for the same driver VM",
the driver need to ACK these request respectively, however, these two requests have the same
ack_type, device_type, device_id, ack_memory_info, then QEMU is not able to figure out which request
has been acked.

>
>
>>> +
>>> +The configuration fields are currently used for the vhost-pci driver
>>> +to acknowledge to the vhost-pci device after it receives controlq messages.
>>> +
>>> +4.5 Device Initialization
>>> +When a device VM boots, it creates a vhost-pci server socket.
>>> +
>>> +When a virtio device on the driver VM is created with specifying the
>>> +use of a vhost-pci device as a backend, a client socket is created
>>> +and connected to the corresponding vhost-pci server for message exchanges.
>>> +
>>> +The messages passed to the vhost-pci server is proceeded by the
>>> +following
>>> +header:
>>> +struct vhost_pci_socket_hdr {
>>> +	#define VHOST_PCI_SOCKET_MEMORY_INFO 0
>>> +	#define VHOST_PCI_SOCKET_MEMORY_INFO_ACK 1
>>> +	#define VHOST_PCI_SOCKET_DEVICE_INFO 2
>>> +	#define VHOST_PCI_SOCKET_DEVICE_INFO_ACK 3
>>> +	#define VHOST_PCI_SOCKET_FEATURE_BITS 4
>>> +	#define VHOST_PCI_SOCKET_FEATURE_BITS_ACK 5
>>> +	u16 msg_type;
>>> +	u16 msg_version;
>>> +	u32 msg_len;
>>> +	u64 qemu_pid;
>>> +};
>>> +
>>> +The payload of the above message types can be constructed using the
>>> +structures
>>> +below:
>>> +/* VHOST_PCI_SOCKET_MEMORY_INFO message */ struct
>>> +vhost_pci_socket_memory_info {
>>> +	#define VHOST_PCI_ADD_MEMORY 0
>>> +	#define VHOST_PCI_DEL_MEMORY 1
>>> +	u16 ops;
>>> +	u32 nregions;
>>> +	struct vhost_pci_memory_region {
>>> +		int fd;
>>> +		u64 guest_phys_addr;
>>> +		u64 memory_size;
>>> +		u64 mmap_offset;
>>> +	} regions[VHOST_PCI_MAX_NREGIONS];
>>> +};
>>> +
>>> +/* VHOST_PCI_SOCKET_DEVICE_INFO message */ struct
>>> +vhost_pci_device_info {
>>> +	#define VHOST_PCI_ADD_FRONTEND_DEVICE 0
>>> +	#define VHOST_PCI_DEL_FRONTEND_DEVICE 1
>>> +	u16    ops;
>>> +	u32    nvirtq;
>>> +	#define VHOST_PCI_FRONTEND_DEVICE_NET 1
>>> +	#define VHOST_PCI_FRONTEND_DEVICE_BLK 2
>>> +	#define VHOST_PCI_FRONTEND_DEVICE_CONSOLE 3
>>> +	#define VHOST_PCI_FRONTEND_DEVICE_ENTROPY 4
>>> +	#define VHOST_PCI_FRONTEND_DEVICE_BALLOON 5
>>> +	#define VHOST_PCI_FRONTEND_DEVICE_SCSI 8
>>> +	u32    device_type;
>>> +	u64    device_id;
>>> +	struct virtq exotic_virtq[VHOST_PCI_MAX_NVIRTQ];
>>> +};
>>> +The device_id field identifies the device. For example, it can be
>>> +used to store a MAC address if the device_type is
>> VHOST_PCI_FRONTEND_DEVICE_NET.
>>> +
>>> +/* VHOST_PCI_SOCKET_FEATURE_BITS message*/ struct
>>> +vhost_pci_feature_bits {
>>> +	u64 feature_bits;
>>> +};
>>
>> We not only have 'socket feature bits' but also the feature bits for per virtio
>> device plugged in on the side of vhost-pci device.
>
> Yes. It is mentioned in "3 Feature Bits". The socket feature bits here are actually the remote feature bits (got from a socket message).

Hmm, there are two questions:
1) The device related info (e.g, device-type, device-id, etc) has not been included
    there, so the vhost-pci driver can not write proper values to the fields of
    vhost_pci_config.

2) different virtio device may have different feature bits, VHOST_PCI_SOCKET_FEATURE_BITS
    and VHOST_PCI_CONTROLQ_FEATURE_BITS should be able to negotiate the feature bits for
    different virtio device.

>
>>
>> E.g: if there are two virtio devices (e.g, a NIC and BLK) both of them need to
>> directly communicate with another VM. The feature bits of these two devices
>> need to be negotiated with that VM respectively. And you can not put these
>> feature bits in vhost_pci_device_info struct as its vq is not created at that time.
>
> Right. If you check the initialization steps below, there is a statement "When the device status is updated with DRIVER_OK".
>
>>> +
>>> +/* VHOST_PCI_SOCKET_xx_ACK messages */ struct vhost_pci_socket_ack {
>>> +	#define VHOST_PCI_SOCKET_ACK_ADD_DONE 0
>>> +	#define VHOST_PCI_SOCKET_ACK_ADD_FAIL 1
>>> +	#define VHOST_PCI_SOCKET_ACK_DEL_DONE 2
>>> +	#define VHOST_PCI_SOCKET_ACK_DEL_FAIL 3
>>> +	u64 ack;
>>> +};
>>> +
>>> +The driver update message passed via the controlq is preceded by the
>>> +following
>>> +header:
>>> +struct vhost_pci_controlq_hdr {
>>> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO 0
>>> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO 1
>>> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS 2
>>> +	#define VHOST_PCI_CONTROLQ_UPDATE_DONE 3
>>> +	u16 msg_type;
>>> +	u16 msg_version;
>>> +	u32 msg_len;
>>> +};
>>> +
>>> +The payload of a VHOST_PCI_CONTROLQ_MEMORY_INFO message can be
>>> +constructed using the following structure:
>>> +/* VHOST_PCI_CONTROLQ_MEMORY_INFO message */ struct
>>> +vhost_pci_controlq_memory_info {
>>> +	#define VHOST_PCI_ADD_MEMORY 0
>>> +	#define VHOST_PCI_DEL_MEMORY 1
>>> +	u16  ops;
>>> +	u32 nregion;
>>> +	struct exotic_memory_region {
>>> +		u64   region_base_xgpa;
>>> +		u64   size;
>>> +		u64   offset_in_bar_area;
>>> +	} region[VHOST_PCI_MAX_NREGIONS];
>>> +};
>>> +
>>> +The payload of VHOST_PCI_CONTROLQ_DEVICE_INFO and
>>> +VHOST_PCI_CONTROLQ_FEATURE_BITS messages can be constructed using
>> the
>>> +vhost_pci_device_info structure and the vhost_pci_feature_bits
>>> +structure respectively.
>>> +
>>> +The payload of a VHOST_PCI_CONTROLQ_UPDATE_DONE message can be
>>> +constructed using the structure below:
>>> +struct vhost_pci_controlq_update_done {
>>> +	u32    device_type;
>>> +	u64    device_id;
>>> +};
>>> +
>>> +Fig. 1 shows the initialization steps.
>>> +
>>> +When the vhost-pci server receives a
>>> +VHOST_PCI_SOCKET_MEMORY_INFO(ADD) message, it checks if a vhost-pci
>>> +device has been created for the requesting VM whose QEMU process id
>>> +is qemu_pid. If yes, it will simply update the subsequent received
>>> +messages to the vhost-pci driver via the controlq. Otherwise, the
>>> +server creates a new vhost-pci device, and continues the following
>> initialization steps.
>>
>>
>> qemu-pid is not stable as the existing VM will be killed silently and the new
>> vhost-pci driver reusing the same qemu-pid will ask to join before the vhost-
>> device gets to know the previous one has gone.
>
> Would it be a normal and legal operation to silently kill a QEMU? I guess only the system admin can do that, right?

Yup, it is a valid operation as a problematic VM will be killed and as you said the admin
can kill it silently, anyway, the design should have a way to handle this case properly.

>
> If that's true, I think we can add a new field, "u64 tsc_of_birth" to the vhost_pci_socket_hdr structure. It records the tsc when the QEMU is created.

You only need to identify a VM, can use UUID instead?

> If that's true, another problem would be the remove of the vhost-pci device for a silently killed driver VM.
> The vhost-pci server may need to periodically send a checking message to check if the driver VM is silently killed. If that really happens, it should remove the related vhost-pci device.

Yes.

>
>>> +
>>> +The vhost-pci server adds up all the memory region size, and uses a
>>> +64-bit device bar for the mapping of all the memory regions obtained
>>> +from the socket message. To better support memory hot-plugging of the
>>> +driver VM, the bar is configured with a double size of the driver
>>> +VM's memory. The server maps the received memory info via the QEMU
>>> +MemoryRegion mechanism, and then the new created vhost-pci device is
>> hot-plugged to the VM.
>>> +
>>> +When the device status is updated with DRIVER_OK, a
>>> +VHOST_PCI_CONTROLQ_MEMORY_INFO(ADD) message, which is stemed
>> from the
>>> +memory info socket message, is put on the controlq and a controlq
>>> +interrupt is injected to the VM.
>>> +
>>> +When the vhost-pci server receives a
>>> +VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK(ADD_DONE)
>> acknowledgement from the
>>> +driver, it sends a VHOST_PCI_SOCKET_MEMORY_INFO_ACK(ADD_DONE)
>> message
>>> +to the client that is identified by the ack_device_type and ack_device_id fields.
>>> +
>>> +When the vhost-pci server receives a
>>> +VHOST_PCI_SOCKET_FEATURE_BITS(feature bits) message, a
>>> +VHOST_PCI_CONTROLQ_FEATURE_BITS(feature bits) message is put on the
>>> +controlq and a controlq interrupt is injected to the VM.
>>> +
>>> +If the vhost-pci server notices that the driver fully accepted the
>>> +offered feature bits, it sends a
>>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message to the client.
>> If
>>> +the vhost-pci server notices that the vhost-pci driver only accepted
>>> +a subset of the offered feature bits, it sends a
>>> +VHOST_PCI_SOCKET_FEATURE_BITS(accepted feature bits) message back to
>>> +the client. The client side virtio device re-negotiates the new
>>> +feature bits with its driver, and sends back a
>>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE)
>>> +message to the server.
>>> +
>>> +Either when the vhost-pci driver fully accepted the offered feature
>>> +bits or a
>>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message is received
>> from
>>> +the client, the vhost-pci server puts a
>>> +VHOST_PCI_CONTROLQ_UPDATE_DONE message on the controlq, and a
>> controlq interrupt is injected to the VM.
>>
>> Why VHOST_PCI_CONTROLQ_UPDATE_DONE is needed?
>
> OK, this one looks redundant. We can set up the related support for that frontend device when the device info is received via the controlq.
>

Great.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format
@ 2016-06-02  3:52         ` Xiao Guangrong
  0 siblings, 0 replies; 34+ messages in thread
From: Xiao Guangrong @ 2016-06-02  3:52 UTC (permalink / raw)
  To: Wang, Wei W, kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini



On 06/02/2016 11:15 AM, Wang, Wei W wrote:
> On Wed 6/1/2016 4:15 PM, Xiao Guangrong wrote:
>> On 05/29/2016 04:11 PM, Wei Wang wrote:
>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>> ---
>>>    Details | 324
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 324 insertions(+)
>>>    create mode 100644 Details
>>>
>>> diff --git a/Details b/Details
>>> new file mode 100644
>>> index 0000000..4ea2252
>>> --- /dev/null
>>> +++ b/Details
>>> @@ -0,0 +1,324 @@
>>> +1 Device ID
>>> +TBD
>>> +
>>> +2 Virtqueues
>>> +0 controlq
>>> +
>>> +3 Feature Bits
>>> +3.1 Local Feature Bits
>>> +Currently no local feature bits are defined, so the standard virtio
>>> +feature bits negation will always be successful and complete.
>>> +
>>> +3.2 Remote Feature Bits
>>> +The remote feature bits are obtained from the frontend virtio device
>>> +and negotiated with the vhost-pci driver via the controlq. The
>>> +negotiation steps are described in 4.5 Device Initialization.
>>> +
>>> +4 Device Configuration Layout
>>> +struct vhost_pci_config {
>>> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK 0
>>> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK 1
>>> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS_ACK 2
>>> +	u32 ack_type;
>>> +	u32 ack_device_type;
>>> +	u64 ack_device_id;
>>> +	union {
>>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_DONE 0
>>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_FAIL 1
>>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_DONE 2
>>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_FAIL 3
>>> +		u64 ack_memory_info;
>>> +		u64 ack_device_info;
>>> +		u64 ack_feature_bits;
>>> +	};
>>> +};
>>
>> Do you need to write all these 4 field to ack the operation? It seems it is not
>> efficient and it is not flexible if the driver need to offer more data to the device
>> in the further. Can we dedicate a vq for this purpose?
>
> Yes, all the 4 fields are required to be written. The vhost-pci server usually connects to multiple clients, and the "ack_device_type" and "ack_device_id" fields are used to identify them.
>
> Agree, another controlq for the guest->host direction looks better, and the above fileds can be converted to be the controlq message header.
>

Thanks.

>>
>> BTW, current approach can not handle the case if there are multiple same kind
>> of requests in the control queue, e.g, if there are two memory-add request in
>> the control queue.
>
> A vhost-pci device corresponds to a driver VM. The two memory-add requests on the controlq are all for the same driver VM.  Memory-add requests for different driver VMs  couldn’t be present on the same controlq. I haven’t seen the issue yet. Can you please explain more? Thanks.

The issue is caused by "The two memory-add requests on the controlq are all for the same driver VM",
the driver need to ACK these request respectively, however, these two requests have the same
ack_type, device_type, device_id, ack_memory_info, then QEMU is not able to figure out which request
has been acked.

>
>
>>> +
>>> +The configuration fields are currently used for the vhost-pci driver
>>> +to acknowledge to the vhost-pci device after it receives controlq messages.
>>> +
>>> +4.5 Device Initialization
>>> +When a device VM boots, it creates a vhost-pci server socket.
>>> +
>>> +When a virtio device on the driver VM is created with specifying the
>>> +use of a vhost-pci device as a backend, a client socket is created
>>> +and connected to the corresponding vhost-pci server for message exchanges.
>>> +
>>> +The messages passed to the vhost-pci server is proceeded by the
>>> +following
>>> +header:
>>> +struct vhost_pci_socket_hdr {
>>> +	#define VHOST_PCI_SOCKET_MEMORY_INFO 0
>>> +	#define VHOST_PCI_SOCKET_MEMORY_INFO_ACK 1
>>> +	#define VHOST_PCI_SOCKET_DEVICE_INFO 2
>>> +	#define VHOST_PCI_SOCKET_DEVICE_INFO_ACK 3
>>> +	#define VHOST_PCI_SOCKET_FEATURE_BITS 4
>>> +	#define VHOST_PCI_SOCKET_FEATURE_BITS_ACK 5
>>> +	u16 msg_type;
>>> +	u16 msg_version;
>>> +	u32 msg_len;
>>> +	u64 qemu_pid;
>>> +};
>>> +
>>> +The payload of the above message types can be constructed using the
>>> +structures
>>> +below:
>>> +/* VHOST_PCI_SOCKET_MEMORY_INFO message */ struct
>>> +vhost_pci_socket_memory_info {
>>> +	#define VHOST_PCI_ADD_MEMORY 0
>>> +	#define VHOST_PCI_DEL_MEMORY 1
>>> +	u16 ops;
>>> +	u32 nregions;
>>> +	struct vhost_pci_memory_region {
>>> +		int fd;
>>> +		u64 guest_phys_addr;
>>> +		u64 memory_size;
>>> +		u64 mmap_offset;
>>> +	} regions[VHOST_PCI_MAX_NREGIONS];
>>> +};
>>> +
>>> +/* VHOST_PCI_SOCKET_DEVICE_INFO message */ struct
>>> +vhost_pci_device_info {
>>> +	#define VHOST_PCI_ADD_FRONTEND_DEVICE 0
>>> +	#define VHOST_PCI_DEL_FRONTEND_DEVICE 1
>>> +	u16    ops;
>>> +	u32    nvirtq;
>>> +	#define VHOST_PCI_FRONTEND_DEVICE_NET 1
>>> +	#define VHOST_PCI_FRONTEND_DEVICE_BLK 2
>>> +	#define VHOST_PCI_FRONTEND_DEVICE_CONSOLE 3
>>> +	#define VHOST_PCI_FRONTEND_DEVICE_ENTROPY 4
>>> +	#define VHOST_PCI_FRONTEND_DEVICE_BALLOON 5
>>> +	#define VHOST_PCI_FRONTEND_DEVICE_SCSI 8
>>> +	u32    device_type;
>>> +	u64    device_id;
>>> +	struct virtq exotic_virtq[VHOST_PCI_MAX_NVIRTQ];
>>> +};
>>> +The device_id field identifies the device. For example, it can be
>>> +used to store a MAC address if the device_type is
>> VHOST_PCI_FRONTEND_DEVICE_NET.
>>> +
>>> +/* VHOST_PCI_SOCKET_FEATURE_BITS message*/ struct
>>> +vhost_pci_feature_bits {
>>> +	u64 feature_bits;
>>> +};
>>
>> We not only have 'socket feature bits' but also the feature bits for per virtio
>> device plugged in on the side of vhost-pci device.
>
> Yes. It is mentioned in "3 Feature Bits". The socket feature bits here are actually the remote feature bits (got from a socket message).

Hmm, there are two questions:
1) The device related info (e.g, device-type, device-id, etc) has not been included
    there, so the vhost-pci driver can not write proper values to the fields of
    vhost_pci_config.

2) different virtio device may have different feature bits, VHOST_PCI_SOCKET_FEATURE_BITS
    and VHOST_PCI_CONTROLQ_FEATURE_BITS should be able to negotiate the feature bits for
    different virtio device.

>
>>
>> E.g: if there are two virtio devices (e.g, a NIC and BLK) both of them need to
>> directly communicate with another VM. The feature bits of these two devices
>> need to be negotiated with that VM respectively. And you can not put these
>> feature bits in vhost_pci_device_info struct as its vq is not created at that time.
>
> Right. If you check the initialization steps below, there is a statement "When the device status is updated with DRIVER_OK".
>
>>> +
>>> +/* VHOST_PCI_SOCKET_xx_ACK messages */ struct vhost_pci_socket_ack {
>>> +	#define VHOST_PCI_SOCKET_ACK_ADD_DONE 0
>>> +	#define VHOST_PCI_SOCKET_ACK_ADD_FAIL 1
>>> +	#define VHOST_PCI_SOCKET_ACK_DEL_DONE 2
>>> +	#define VHOST_PCI_SOCKET_ACK_DEL_FAIL 3
>>> +	u64 ack;
>>> +};
>>> +
>>> +The driver update message passed via the controlq is preceded by the
>>> +following
>>> +header:
>>> +struct vhost_pci_controlq_hdr {
>>> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO 0
>>> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO 1
>>> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS 2
>>> +	#define VHOST_PCI_CONTROLQ_UPDATE_DONE 3
>>> +	u16 msg_type;
>>> +	u16 msg_version;
>>> +	u32 msg_len;
>>> +};
>>> +
>>> +The payload of a VHOST_PCI_CONTROLQ_MEMORY_INFO message can be
>>> +constructed using the following structure:
>>> +/* VHOST_PCI_CONTROLQ_MEMORY_INFO message */ struct
>>> +vhost_pci_controlq_memory_info {
>>> +	#define VHOST_PCI_ADD_MEMORY 0
>>> +	#define VHOST_PCI_DEL_MEMORY 1
>>> +	u16  ops;
>>> +	u32 nregion;
>>> +	struct exotic_memory_region {
>>> +		u64   region_base_xgpa;
>>> +		u64   size;
>>> +		u64   offset_in_bar_area;
>>> +	} region[VHOST_PCI_MAX_NREGIONS];
>>> +};
>>> +
>>> +The payload of VHOST_PCI_CONTROLQ_DEVICE_INFO and
>>> +VHOST_PCI_CONTROLQ_FEATURE_BITS messages can be constructed using
>> the
>>> +vhost_pci_device_info structure and the vhost_pci_feature_bits
>>> +structure respectively.
>>> +
>>> +The payload of a VHOST_PCI_CONTROLQ_UPDATE_DONE message can be
>>> +constructed using the structure below:
>>> +struct vhost_pci_controlq_update_done {
>>> +	u32    device_type;
>>> +	u64    device_id;
>>> +};
>>> +
>>> +Fig. 1 shows the initialization steps.
>>> +
>>> +When the vhost-pci server receives a
>>> +VHOST_PCI_SOCKET_MEMORY_INFO(ADD) message, it checks if a vhost-pci
>>> +device has been created for the requesting VM whose QEMU process id
>>> +is qemu_pid. If yes, it will simply update the subsequent received
>>> +messages to the vhost-pci driver via the controlq. Otherwise, the
>>> +server creates a new vhost-pci device, and continues the following
>> initialization steps.
>>
>>
>> qemu-pid is not stable as the existing VM will be killed silently and the new
>> vhost-pci driver reusing the same qemu-pid will ask to join before the vhost-
>> device gets to know the previous one has gone.
>
> Would it be a normal and legal operation to silently kill a QEMU? I guess only the system admin can do that, right?

Yup, it is a valid operation as a problematic VM will be killed and as you said the admin
can kill it silently, anyway, the design should have a way to handle this case properly.

>
> If that's true, I think we can add a new field, "u64 tsc_of_birth" to the vhost_pci_socket_hdr structure. It records the tsc when the QEMU is created.

You only need to identify a VM, can use UUID instead?

> If that's true, another problem would be the remove of the vhost-pci device for a silently killed driver VM.
> The vhost-pci server may need to periodically send a checking message to check if the driver VM is silently killed. If that really happens, it should remove the related vhost-pci device.

Yes.

>
>>> +
>>> +The vhost-pci server adds up all the memory region size, and uses a
>>> +64-bit device bar for the mapping of all the memory regions obtained
>>> +from the socket message. To better support memory hot-plugging of the
>>> +driver VM, the bar is configured with a double size of the driver
>>> +VM's memory. The server maps the received memory info via the QEMU
>>> +MemoryRegion mechanism, and then the new created vhost-pci device is
>> hot-plugged to the VM.
>>> +
>>> +When the device status is updated with DRIVER_OK, a
>>> +VHOST_PCI_CONTROLQ_MEMORY_INFO(ADD) message, which is stemed
>> from the
>>> +memory info socket message, is put on the controlq and a controlq
>>> +interrupt is injected to the VM.
>>> +
>>> +When the vhost-pci server receives a
>>> +VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK(ADD_DONE)
>> acknowledgement from the
>>> +driver, it sends a VHOST_PCI_SOCKET_MEMORY_INFO_ACK(ADD_DONE)
>> message
>>> +to the client that is identified by the ack_device_type and ack_device_id fields.
>>> +
>>> +When the vhost-pci server receives a
>>> +VHOST_PCI_SOCKET_FEATURE_BITS(feature bits) message, a
>>> +VHOST_PCI_CONTROLQ_FEATURE_BITS(feature bits) message is put on the
>>> +controlq and a controlq interrupt is injected to the VM.
>>> +
>>> +If the vhost-pci server notices that the driver fully accepted the
>>> +offered feature bits, it sends a
>>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message to the client.
>> If
>>> +the vhost-pci server notices that the vhost-pci driver only accepted
>>> +a subset of the offered feature bits, it sends a
>>> +VHOST_PCI_SOCKET_FEATURE_BITS(accepted feature bits) message back to
>>> +the client. The client side virtio device re-negotiates the new
>>> +feature bits with its driver, and sends back a
>>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE)
>>> +message to the server.
>>> +
>>> +Either when the vhost-pci driver fully accepted the offered feature
>>> +bits or a
>>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message is received
>> from
>>> +the client, the vhost-pci server puts a
>>> +VHOST_PCI_CONTROLQ_UPDATE_DONE message on the controlq, and a
>> controlq interrupt is injected to the VM.
>>
>> Why VHOST_PCI_CONTROLQ_UPDATE_DONE is needed?
>
> OK, this one looks redundant. We can set up the related support for that frontend device when the device info is received via the controlq.
>

Great.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [virtio-comment] RE: [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format
  2016-06-02  3:52         ` [Qemu-devel] " Xiao Guangrong
@ 2016-06-02  8:43           ` Wang, Wei W
  -1 siblings, 0 replies; 34+ messages in thread
From: Wang, Wei W @ 2016-06-02  8:43 UTC (permalink / raw)
  To: Xiao Guangrong, kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini

On Thu 6/2/2016 11:52 AM,  Xiao Guangrong wrote:
> On 06/02/2016 11:15 AM, Wang, Wei W wrote:
> > On Wed 6/1/2016 4:15 PM, Xiao Guangrong wrote:
> >> On 05/29/2016 04:11 PM, Wei Wang wrote:
> >>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> >>> ---
> >>>    Details | 324
> >>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>    1 file changed, 324 insertions(+)
> >>>    create mode 100644 Details
> >>>
> >>> diff --git a/Details b/Details
> >>> new file mode 100644
> >>> index 0000000..4ea2252
> >>> --- /dev/null
> >>> +++ b/Details
> >>> @@ -0,0 +1,324 @@
> >>> +1 Device ID
> >>> +TBD
> >>> +
> >>> +2 Virtqueues
> >>> +0 controlq
> >>> +
> >>> +3 Feature Bits
> >>> +3.1 Local Feature Bits
> >>> +Currently no local feature bits are defined, so the standard virtio
> >>> +feature bits negation will always be successful and complete.
> >>> +
> >>> +3.2 Remote Feature Bits
> >>> +The remote feature bits are obtained from the frontend virtio
> >>> +device and negotiated with the vhost-pci driver via the controlq.
> >>> +The negotiation steps are described in 4.5 Device Initialization.
> >>> +
> >>> +4 Device Configuration Layout
> >>> +struct vhost_pci_config {
> >>> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK 0
> >>> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK 1
> >>> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS_ACK 2
> >>> +	u32 ack_type;
> >>> +	u32 ack_device_type;
> >>> +	u64 ack_device_id;
> >>> +	union {
> >>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_DONE 0
> >>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_FAIL 1
> >>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_DONE 2
> >>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_FAIL 3
> >>> +		u64 ack_memory_info;
> >>> +		u64 ack_device_info;
> >>> +		u64 ack_feature_bits;
> >>> +	};
> >>> +};
> >>
> >> Do you need to write all these 4 field to ack the operation? It seems
> >> it is not efficient and it is not flexible if the driver need to
> >> offer more data to the device in the further. Can we dedicate a vq
> >> for this purpose?
> >
> > Yes, all the 4 fields are required to be written. The vhost-pci server usually
> connects to multiple clients, and the "ack_device_type" and "ack_device_id"
> fields are used to identify them.
> >
> > Agree, another controlq for the guest->host direction looks better, and the
> above fileds can be converted to be the controlq message header.
> >
> 
> Thanks.
> 
> >>
> >> BTW, current approach can not handle the case if there are multiple
> >> same kind of requests in the control queue, e.g, if there are two
> >> memory-add request in the control queue.
> >
> > A vhost-pci device corresponds to a driver VM. The two memory-add requests
> on the controlq are all for the same driver VM.  Memory-add requests for
> different driver VMs  couldn’t be present on the same controlq. I haven’t seen
> the issue yet. Can you please explain more? Thanks.
> 
> The issue is caused by "The two memory-add requests on the controlq are all for
> the same driver VM", the driver need to ACK these request respectively, however,
> these two requests have the same ack_type, device_type, device_id,
> ack_memory_info, then QEMU is not able to figure out which request has been
> acked.

Normally pieces of memory info should be combined into one message (the structure includes multiple memory regions) and sent by the client. In a rare case like this: the driver VM hot-adds 1GB memory, followed by hot-adding another 1GB memory. The first piece of memory info is passed via the socket and controlq to the vhost-pci driver, then the second. Normally they won't get an opportunity to be put on the controlq at the same time. 
Even the implementation batches the controlq messages, there will be a sequence difference between the two messages on the controlq, right?

From the QEMU's (vhost-pci server) perspective, it just sends back an ACK to the client whenever it receives an ACK from the vhost-pci driver.
From the client's perspective, it will receive two ACK messages in this example.
Since the two have a sequence difference, the client should be able to distinguish the two (first sent, first acked), right?

Do you see a case where handling the first message is delayed and the second message is handled and ACK-ed first? 

> >
> >
> >>> +
> >>> +The configuration fields are currently used for the vhost-pci
> >>> +driver to acknowledge to the vhost-pci device after it receives controlq
> messages.
> >>> +
> >>> +4.5 Device Initialization
> >>> +When a device VM boots, it creates a vhost-pci server socket.
> >>> +
> >>> +When a virtio device on the driver VM is created with specifying
> >>> +the use of a vhost-pci device as a backend, a client socket is
> >>> +created and connected to the corresponding vhost-pci server for message
> exchanges.
> >>> +
> >>> +The messages passed to the vhost-pci server is proceeded by the
> >>> +following
> >>> +header:
> >>> +struct vhost_pci_socket_hdr {
> >>> +	#define VHOST_PCI_SOCKET_MEMORY_INFO 0
> >>> +	#define VHOST_PCI_SOCKET_MEMORY_INFO_ACK 1
> >>> +	#define VHOST_PCI_SOCKET_DEVICE_INFO 2
> >>> +	#define VHOST_PCI_SOCKET_DEVICE_INFO_ACK 3
> >>> +	#define VHOST_PCI_SOCKET_FEATURE_BITS 4
> >>> +	#define VHOST_PCI_SOCKET_FEATURE_BITS_ACK 5
> >>> +	u16 msg_type;
> >>> +	u16 msg_version;
> >>> +	u32 msg_len;
> >>> +	u64 qemu_pid;
> >>> +};
> >>> +
> >>> +The payload of the above message types can be constructed using the
> >>> +structures
> >>> +below:
> >>> +/* VHOST_PCI_SOCKET_MEMORY_INFO message */ struct
> >>> +vhost_pci_socket_memory_info {
> >>> +	#define VHOST_PCI_ADD_MEMORY 0
> >>> +	#define VHOST_PCI_DEL_MEMORY 1
> >>> +	u16 ops;
> >>> +	u32 nregions;
> >>> +	struct vhost_pci_memory_region {
> >>> +		int fd;
> >>> +		u64 guest_phys_addr;
> >>> +		u64 memory_size;
> >>> +		u64 mmap_offset;
> >>> +	} regions[VHOST_PCI_MAX_NREGIONS]; };
> >>> +
> >>> +/* VHOST_PCI_SOCKET_DEVICE_INFO message */ struct
> >>> +vhost_pci_device_info {
> >>> +	#define VHOST_PCI_ADD_FRONTEND_DEVICE 0
> >>> +	#define VHOST_PCI_DEL_FRONTEND_DEVICE 1
> >>> +	u16    ops;
> >>> +	u32    nvirtq;
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_NET 1
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_BLK 2
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_CONSOLE 3
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_ENTROPY 4
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_BALLOON 5
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_SCSI 8
> >>> +	u32    device_type;
> >>> +	u64    device_id;
> >>> +	struct virtq exotic_virtq[VHOST_PCI_MAX_NVIRTQ];
> >>> +};
> >>> +The device_id field identifies the device. For example, it can be
> >>> +used to store a MAC address if the device_type is
> >> VHOST_PCI_FRONTEND_DEVICE_NET.
> >>> +
> >>> +/* VHOST_PCI_SOCKET_FEATURE_BITS message*/ struct
> >>> +vhost_pci_feature_bits {
> >>> +	u64 feature_bits;
> >>> +};
> >>
> >> We not only have 'socket feature bits' but also the feature bits for
> >> per virtio device plugged in on the side of vhost-pci device.
> >
> > Yes. It is mentioned in "3 Feature Bits". The socket feature bits here are
> actually the remote feature bits (got from a socket message).
> 
> Hmm, there are two questions:
> 1) The device related info (e.g, device-type, device-id, etc) has not been included
>     there, so the vhost-pci driver can not write proper values to the fields of
>     vhost_pci_config.

Right. For the controlq feature bits messages, the data structure lacks the "device_type" and "device_id" fields. I will add them.

I think "device_type" and "device_id" don’t need to be included in the socket feature bits message structure:
each frontend virtio device has its own connection to the vhost-pci server, that is, the connection itself has actually already been associated with a "device_type+device_id" (the server should maintain a table for such a relationship after a device info socket message is received). 

> 2) different virtio device may have different feature bits,
> VHOST_PCI_SOCKET_FEATURE_BITS
>     and VHOST_PCI_CONTROLQ_FEATURE_BITS should be able to negotiate the
> feature bits for
>     different virtio device.

Then, I think this shouldn’t be a problem after "device_type" and "device_id" are added to the controlq feature bits message structure.

> >
> >>
> >> E.g: if there are two virtio devices (e.g, a NIC and BLK) both of
> >> them need to directly communicate with another VM. The feature bits
> >> of these two devices need to be negotiated with that VM respectively.
> >> And you can not put these feature bits in vhost_pci_device_info struct as its
> vq is not created at that time.
> >
> > Right. If you check the initialization steps below, there is a statement "When
> the device status is updated with DRIVER_OK".
> >
> >>> +
> >>> +/* VHOST_PCI_SOCKET_xx_ACK messages */ struct vhost_pci_socket_ack {
> >>> +	#define VHOST_PCI_SOCKET_ACK_ADD_DONE 0
> >>> +	#define VHOST_PCI_SOCKET_ACK_ADD_FAIL 1
> >>> +	#define VHOST_PCI_SOCKET_ACK_DEL_DONE 2
> >>> +	#define VHOST_PCI_SOCKET_ACK_DEL_FAIL 3
> >>> +	u64 ack;
> >>> +};
> >>> +
> >>> +The driver update message passed via the controlq is preceded by
> >>> +the following
> >>> +header:
> >>> +struct vhost_pci_controlq_hdr {
> >>> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO 0
> >>> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO 1
> >>> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS 2
> >>> +	#define VHOST_PCI_CONTROLQ_UPDATE_DONE 3
> >>> +	u16 msg_type;
> >>> +	u16 msg_version;
> >>> +	u32 msg_len;
> >>> +};
> >>> +
> >>> +The payload of a VHOST_PCI_CONTROLQ_MEMORY_INFO message can be
> >>> +constructed using the following structure:
> >>> +/* VHOST_PCI_CONTROLQ_MEMORY_INFO message */ struct
> >>> +vhost_pci_controlq_memory_info {
> >>> +	#define VHOST_PCI_ADD_MEMORY 0
> >>> +	#define VHOST_PCI_DEL_MEMORY 1
> >>> +	u16  ops;
> >>> +	u32 nregion;
> >>> +	struct exotic_memory_region {
> >>> +		u64   region_base_xgpa;
> >>> +		u64   size;
> >>> +		u64   offset_in_bar_area;
> >>> +	} region[VHOST_PCI_MAX_NREGIONS];
> >>> +};
> >>> +
> >>> +The payload of VHOST_PCI_CONTROLQ_DEVICE_INFO and
> >>> +VHOST_PCI_CONTROLQ_FEATURE_BITS messages can be constructed
> using
> >> the
> >>> +vhost_pci_device_info structure and the vhost_pci_feature_bits
> >>> +structure respectively.
> >>> +
> >>> +The payload of a VHOST_PCI_CONTROLQ_UPDATE_DONE message can be
> >>> +constructed using the structure below:
> >>> +struct vhost_pci_controlq_update_done {
> >>> +	u32    device_type;
> >>> +	u64    device_id;
> >>> +};
> >>> +
> >>> +Fig. 1 shows the initialization steps.
> >>> +
> >>> +When the vhost-pci server receives a
> >>> +VHOST_PCI_SOCKET_MEMORY_INFO(ADD) message, it checks if a vhost-
> pci
> >>> +device has been created for the requesting VM whose QEMU process id
> >>> +is qemu_pid. If yes, it will simply update the subsequent received
> >>> +messages to the vhost-pci driver via the controlq. Otherwise, the
> >>> +server creates a new vhost-pci device, and continues the following
> >> initialization steps.
> >>
> >>
> >> qemu-pid is not stable as the existing VM will be killed silently and
> >> the new vhost-pci driver reusing the same qemu-pid will ask to join
> >> before the vhost- device gets to know the previous one has gone.
> >
> > Would it be a normal and legal operation to silently kill a QEMU? I guess only
> the system admin can do that, right?
> 
> Yup, it is a valid operation as a problematic VM will be killed and as you said the
> admin can kill it silently, anyway, the design should have a way to handle this
> case properly.
> 
> >
> > If that's true, I think we can add a new field, "u64 tsc_of_birth" to the
> vhost_pci_socket_hdr structure. It records the tsc when the QEMU is created.
> 
> You only need to identify a VM, can use UUID instead?

Yes. The two methods are actually similar.

Best,
Wei

> 
> > If that's true, another problem would be the remove of the vhost-pci device
> for a silently killed driver VM.
> > The vhost-pci server may need to periodically send a checking message to
> check if the driver VM is silently killed. If that really happens, it should remove
> the related vhost-pci device.
> 
> Yes.
> 
> >
> >>> +
> >>> +The vhost-pci server adds up all the memory region size, and uses a
> >>> +64-bit device bar for the mapping of all the memory regions
> >>> +obtained from the socket message. To better support memory
> >>> +hot-plugging of the driver VM, the bar is configured with a double
> >>> +size of the driver VM's memory. The server maps the received memory
> >>> +info via the QEMU MemoryRegion mechanism, and then the new created
> >>> +vhost-pci device is
> >> hot-plugged to the VM.
> >>> +
> >>> +When the device status is updated with DRIVER_OK, a
> >>> +VHOST_PCI_CONTROLQ_MEMORY_INFO(ADD) message, which is stemed
> >> from the
> >>> +memory info socket message, is put on the controlq and a controlq
> >>> +interrupt is injected to the VM.
> >>> +
> >>> +When the vhost-pci server receives a
> >>> +VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK(ADD_DONE)
> >> acknowledgement from the
> >>> +driver, it sends a VHOST_PCI_SOCKET_MEMORY_INFO_ACK(ADD_DONE)
> >> message
> >>> +to the client that is identified by the ack_device_type and ack_device_id
> fields.
> >>> +
> >>> +When the vhost-pci server receives a
> >>> +VHOST_PCI_SOCKET_FEATURE_BITS(feature bits) message, a
> >>> +VHOST_PCI_CONTROLQ_FEATURE_BITS(feature bits) message is put on
> the
> >>> +controlq and a controlq interrupt is injected to the VM.
> >>> +
> >>> +If the vhost-pci server notices that the driver fully accepted the
> >>> +offered feature bits, it sends a
> >>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message to the
> client.
> >> If
> >>> +the vhost-pci server notices that the vhost-pci driver only
> >>> +accepted a subset of the offered feature bits, it sends a
> >>> +VHOST_PCI_SOCKET_FEATURE_BITS(accepted feature bits) message back
> >>> +to the client. The client side virtio device re-negotiates the new
> >>> +feature bits with its driver, and sends back a
> >>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE)
> >>> +message to the server.
> >>> +
> >>> +Either when the vhost-pci driver fully accepted the offered feature
> >>> +bits or a
> >>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message is received
> >> from
> >>> +the client, the vhost-pci server puts a
> >>> +VHOST_PCI_CONTROLQ_UPDATE_DONE message on the controlq, and a
> >> controlq interrupt is injected to the VM.
> >>
> >> Why VHOST_PCI_CONTROLQ_UPDATE_DONE is needed?
> >
> > OK, this one looks redundant. We can set up the related support for that
> frontend device when the device info is received via the controlq.
> >
> 
> Great.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format
@ 2016-06-02  8:43           ` Wang, Wei W
  0 siblings, 0 replies; 34+ messages in thread
From: Wang, Wei W @ 2016-06-02  8:43 UTC (permalink / raw)
  To: Xiao Guangrong, kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini

On Thu 6/2/2016 11:52 AM,  Xiao Guangrong wrote:
> On 06/02/2016 11:15 AM, Wang, Wei W wrote:
> > On Wed 6/1/2016 4:15 PM, Xiao Guangrong wrote:
> >> On 05/29/2016 04:11 PM, Wei Wang wrote:
> >>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> >>> ---
> >>>    Details | 324
> >>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>    1 file changed, 324 insertions(+)
> >>>    create mode 100644 Details
> >>>
> >>> diff --git a/Details b/Details
> >>> new file mode 100644
> >>> index 0000000..4ea2252
> >>> --- /dev/null
> >>> +++ b/Details
> >>> @@ -0,0 +1,324 @@
> >>> +1 Device ID
> >>> +TBD
> >>> +
> >>> +2 Virtqueues
> >>> +0 controlq
> >>> +
> >>> +3 Feature Bits
> >>> +3.1 Local Feature Bits
> >>> +Currently no local feature bits are defined, so the standard virtio
> >>> +feature bits negation will always be successful and complete.
> >>> +
> >>> +3.2 Remote Feature Bits
> >>> +The remote feature bits are obtained from the frontend virtio
> >>> +device and negotiated with the vhost-pci driver via the controlq.
> >>> +The negotiation steps are described in 4.5 Device Initialization.
> >>> +
> >>> +4 Device Configuration Layout
> >>> +struct vhost_pci_config {
> >>> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK 0
> >>> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK 1
> >>> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS_ACK 2
> >>> +	u32 ack_type;
> >>> +	u32 ack_device_type;
> >>> +	u64 ack_device_id;
> >>> +	union {
> >>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_DONE 0
> >>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_FAIL 1
> >>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_DONE 2
> >>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_FAIL 3
> >>> +		u64 ack_memory_info;
> >>> +		u64 ack_device_info;
> >>> +		u64 ack_feature_bits;
> >>> +	};
> >>> +};
> >>
> >> Do you need to write all these 4 field to ack the operation? It seems
> >> it is not efficient and it is not flexible if the driver need to
> >> offer more data to the device in the further. Can we dedicate a vq
> >> for this purpose?
> >
> > Yes, all the 4 fields are required to be written. The vhost-pci server usually
> connects to multiple clients, and the "ack_device_type" and "ack_device_id"
> fields are used to identify them.
> >
> > Agree, another controlq for the guest->host direction looks better, and the
> above fileds can be converted to be the controlq message header.
> >
> 
> Thanks.
> 
> >>
> >> BTW, current approach can not handle the case if there are multiple
> >> same kind of requests in the control queue, e.g, if there are two
> >> memory-add request in the control queue.
> >
> > A vhost-pci device corresponds to a driver VM. The two memory-add requests
> on the controlq are all for the same driver VM.  Memory-add requests for
> different driver VMs  couldn’t be present on the same controlq. I haven’t seen
> the issue yet. Can you please explain more? Thanks.
> 
> The issue is caused by "The two memory-add requests on the controlq are all for
> the same driver VM", the driver need to ACK these request respectively, however,
> these two requests have the same ack_type, device_type, device_id,
> ack_memory_info, then QEMU is not able to figure out which request has been
> acked.

Normally pieces of memory info should be combined into one message (the structure includes multiple memory regions) and sent by the client. In a rare case like this: the driver VM hot-adds 1GB memory, followed by hot-adding another 1GB memory. The first piece of memory info is passed via the socket and controlq to the vhost-pci driver, then the second. Normally they won't get an opportunity to be put on the controlq at the same time. 
Even the implementation batches the controlq messages, there will be a sequence difference between the two messages on the controlq, right?

From the QEMU's (vhost-pci server) perspective, it just sends back an ACK to the client whenever it receives an ACK from the vhost-pci driver.
From the client's perspective, it will receive two ACK messages in this example.
Since the two have a sequence difference, the client should be able to distinguish the two (first sent, first acked), right?

Do you see a case where handling the first message is delayed and the second message is handled and ACK-ed first? 

> >
> >
> >>> +
> >>> +The configuration fields are currently used for the vhost-pci
> >>> +driver to acknowledge to the vhost-pci device after it receives controlq
> messages.
> >>> +
> >>> +4.5 Device Initialization
> >>> +When a device VM boots, it creates a vhost-pci server socket.
> >>> +
> >>> +When a virtio device on the driver VM is created with specifying
> >>> +the use of a vhost-pci device as a backend, a client socket is
> >>> +created and connected to the corresponding vhost-pci server for message
> exchanges.
> >>> +
> >>> +The messages passed to the vhost-pci server is proceeded by the
> >>> +following
> >>> +header:
> >>> +struct vhost_pci_socket_hdr {
> >>> +	#define VHOST_PCI_SOCKET_MEMORY_INFO 0
> >>> +	#define VHOST_PCI_SOCKET_MEMORY_INFO_ACK 1
> >>> +	#define VHOST_PCI_SOCKET_DEVICE_INFO 2
> >>> +	#define VHOST_PCI_SOCKET_DEVICE_INFO_ACK 3
> >>> +	#define VHOST_PCI_SOCKET_FEATURE_BITS 4
> >>> +	#define VHOST_PCI_SOCKET_FEATURE_BITS_ACK 5
> >>> +	u16 msg_type;
> >>> +	u16 msg_version;
> >>> +	u32 msg_len;
> >>> +	u64 qemu_pid;
> >>> +};
> >>> +
> >>> +The payload of the above message types can be constructed using the
> >>> +structures
> >>> +below:
> >>> +/* VHOST_PCI_SOCKET_MEMORY_INFO message */ struct
> >>> +vhost_pci_socket_memory_info {
> >>> +	#define VHOST_PCI_ADD_MEMORY 0
> >>> +	#define VHOST_PCI_DEL_MEMORY 1
> >>> +	u16 ops;
> >>> +	u32 nregions;
> >>> +	struct vhost_pci_memory_region {
> >>> +		int fd;
> >>> +		u64 guest_phys_addr;
> >>> +		u64 memory_size;
> >>> +		u64 mmap_offset;
> >>> +	} regions[VHOST_PCI_MAX_NREGIONS]; };
> >>> +
> >>> +/* VHOST_PCI_SOCKET_DEVICE_INFO message */ struct
> >>> +vhost_pci_device_info {
> >>> +	#define VHOST_PCI_ADD_FRONTEND_DEVICE 0
> >>> +	#define VHOST_PCI_DEL_FRONTEND_DEVICE 1
> >>> +	u16    ops;
> >>> +	u32    nvirtq;
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_NET 1
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_BLK 2
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_CONSOLE 3
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_ENTROPY 4
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_BALLOON 5
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_SCSI 8
> >>> +	u32    device_type;
> >>> +	u64    device_id;
> >>> +	struct virtq exotic_virtq[VHOST_PCI_MAX_NVIRTQ];
> >>> +};
> >>> +The device_id field identifies the device. For example, it can be
> >>> +used to store a MAC address if the device_type is
> >> VHOST_PCI_FRONTEND_DEVICE_NET.
> >>> +
> >>> +/* VHOST_PCI_SOCKET_FEATURE_BITS message*/ struct
> >>> +vhost_pci_feature_bits {
> >>> +	u64 feature_bits;
> >>> +};
> >>
> >> We not only have 'socket feature bits' but also the feature bits for
> >> per virtio device plugged in on the side of vhost-pci device.
> >
> > Yes. It is mentioned in "3 Feature Bits". The socket feature bits here are
> actually the remote feature bits (got from a socket message).
> 
> Hmm, there are two questions:
> 1) The device related info (e.g, device-type, device-id, etc) has not been included
>     there, so the vhost-pci driver can not write proper values to the fields of
>     vhost_pci_config.

Right. For the controlq feature bits messages, the data structure lacks the "device_type" and "device_id" fields. I will add them.

I think "device_type" and "device_id" don’t need to be included in the socket feature bits message structure:
each frontend virtio device has its own connection to the vhost-pci server, that is, the connection itself has actually already been associated with a "device_type+device_id" (the server should maintain a table for such a relationship after a device info socket message is received). 

> 2) different virtio device may have different feature bits,
> VHOST_PCI_SOCKET_FEATURE_BITS
>     and VHOST_PCI_CONTROLQ_FEATURE_BITS should be able to negotiate the
> feature bits for
>     different virtio device.

Then, I think this shouldn’t be a problem after "device_type" and "device_id" are added to the controlq feature bits message structure.

> >
> >>
> >> E.g: if there are two virtio devices (e.g, a NIC and BLK) both of
> >> them need to directly communicate with another VM. The feature bits
> >> of these two devices need to be negotiated with that VM respectively.
> >> And you can not put these feature bits in vhost_pci_device_info struct as its
> vq is not created at that time.
> >
> > Right. If you check the initialization steps below, there is a statement "When
> the device status is updated with DRIVER_OK".
> >
> >>> +
> >>> +/* VHOST_PCI_SOCKET_xx_ACK messages */ struct vhost_pci_socket_ack {
> >>> +	#define VHOST_PCI_SOCKET_ACK_ADD_DONE 0
> >>> +	#define VHOST_PCI_SOCKET_ACK_ADD_FAIL 1
> >>> +	#define VHOST_PCI_SOCKET_ACK_DEL_DONE 2
> >>> +	#define VHOST_PCI_SOCKET_ACK_DEL_FAIL 3
> >>> +	u64 ack;
> >>> +};
> >>> +
> >>> +The driver update message passed via the controlq is preceded by
> >>> +the following
> >>> +header:
> >>> +struct vhost_pci_controlq_hdr {
> >>> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO 0
> >>> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO 1
> >>> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS 2
> >>> +	#define VHOST_PCI_CONTROLQ_UPDATE_DONE 3
> >>> +	u16 msg_type;
> >>> +	u16 msg_version;
> >>> +	u32 msg_len;
> >>> +};
> >>> +
> >>> +The payload of a VHOST_PCI_CONTROLQ_MEMORY_INFO message can be
> >>> +constructed using the following structure:
> >>> +/* VHOST_PCI_CONTROLQ_MEMORY_INFO message */ struct
> >>> +vhost_pci_controlq_memory_info {
> >>> +	#define VHOST_PCI_ADD_MEMORY 0
> >>> +	#define VHOST_PCI_DEL_MEMORY 1
> >>> +	u16  ops;
> >>> +	u32 nregion;
> >>> +	struct exotic_memory_region {
> >>> +		u64   region_base_xgpa;
> >>> +		u64   size;
> >>> +		u64   offset_in_bar_area;
> >>> +	} region[VHOST_PCI_MAX_NREGIONS];
> >>> +};
> >>> +
> >>> +The payload of VHOST_PCI_CONTROLQ_DEVICE_INFO and
> >>> +VHOST_PCI_CONTROLQ_FEATURE_BITS messages can be constructed
> using
> >> the
> >>> +vhost_pci_device_info structure and the vhost_pci_feature_bits
> >>> +structure respectively.
> >>> +
> >>> +The payload of a VHOST_PCI_CONTROLQ_UPDATE_DONE message can be
> >>> +constructed using the structure below:
> >>> +struct vhost_pci_controlq_update_done {
> >>> +	u32    device_type;
> >>> +	u64    device_id;
> >>> +};
> >>> +
> >>> +Fig. 1 shows the initialization steps.
> >>> +
> >>> +When the vhost-pci server receives a
> >>> +VHOST_PCI_SOCKET_MEMORY_INFO(ADD) message, it checks if a vhost-
> pci
> >>> +device has been created for the requesting VM whose QEMU process id
> >>> +is qemu_pid. If yes, it will simply update the subsequent received
> >>> +messages to the vhost-pci driver via the controlq. Otherwise, the
> >>> +server creates a new vhost-pci device, and continues the following
> >> initialization steps.
> >>
> >>
> >> qemu-pid is not stable as the existing VM will be killed silently and
> >> the new vhost-pci driver reusing the same qemu-pid will ask to join
> >> before the vhost- device gets to know the previous one has gone.
> >
> > Would it be a normal and legal operation to silently kill a QEMU? I guess only
> the system admin can do that, right?
> 
> Yup, it is a valid operation as a problematic VM will be killed and as you said the
> admin can kill it silently, anyway, the design should have a way to handle this
> case properly.
> 
> >
> > If that's true, I think we can add a new field, "u64 tsc_of_birth" to the
> vhost_pci_socket_hdr structure. It records the tsc when the QEMU is created.
> 
> You only need to identify a VM, can use UUID instead?

Yes. The two methods are actually similar.

Best,
Wei

> 
> > If that's true, another problem would be the remove of the vhost-pci device
> for a silently killed driver VM.
> > The vhost-pci server may need to periodically send a checking message to
> check if the driver VM is silently killed. If that really happens, it should remove
> the related vhost-pci device.
> 
> Yes.
> 
> >
> >>> +
> >>> +The vhost-pci server adds up all the memory region size, and uses a
> >>> +64-bit device bar for the mapping of all the memory regions
> >>> +obtained from the socket message. To better support memory
> >>> +hot-plugging of the driver VM, the bar is configured with a double
> >>> +size of the driver VM's memory. The server maps the received memory
> >>> +info via the QEMU MemoryRegion mechanism, and then the new created
> >>> +vhost-pci device is
> >> hot-plugged to the VM.
> >>> +
> >>> +When the device status is updated with DRIVER_OK, a
> >>> +VHOST_PCI_CONTROLQ_MEMORY_INFO(ADD) message, which is stemed
> >> from the
> >>> +memory info socket message, is put on the controlq and a controlq
> >>> +interrupt is injected to the VM.
> >>> +
> >>> +When the vhost-pci server receives a
> >>> +VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK(ADD_DONE)
> >> acknowledgement from the
> >>> +driver, it sends a VHOST_PCI_SOCKET_MEMORY_INFO_ACK(ADD_DONE)
> >> message
> >>> +to the client that is identified by the ack_device_type and ack_device_id
> fields.
> >>> +
> >>> +When the vhost-pci server receives a
> >>> +VHOST_PCI_SOCKET_FEATURE_BITS(feature bits) message, a
> >>> +VHOST_PCI_CONTROLQ_FEATURE_BITS(feature bits) message is put on
> the
> >>> +controlq and a controlq interrupt is injected to the VM.
> >>> +
> >>> +If the vhost-pci server notices that the driver fully accepted the
> >>> +offered feature bits, it sends a
> >>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message to the
> client.
> >> If
> >>> +the vhost-pci server notices that the vhost-pci driver only
> >>> +accepted a subset of the offered feature bits, it sends a
> >>> +VHOST_PCI_SOCKET_FEATURE_BITS(accepted feature bits) message back
> >>> +to the client. The client side virtio device re-negotiates the new
> >>> +feature bits with its driver, and sends back a
> >>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE)
> >>> +message to the server.
> >>> +
> >>> +Either when the vhost-pci driver fully accepted the offered feature
> >>> +bits or a
> >>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message is received
> >> from
> >>> +the client, the vhost-pci server puts a
> >>> +VHOST_PCI_CONTROLQ_UPDATE_DONE message on the controlq, and a
> >> controlq interrupt is injected to the VM.
> >>
> >> Why VHOST_PCI_CONTROLQ_UPDATE_DONE is needed?
> >
> > OK, this one looks redundant. We can set up the related support for that
> frontend device when the device info is received via the controlq.
> >
> 
> Great.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [virtio-comment] [PATCH 5/6 Resend] Vhost-pci RFC: Future Security Enhancement
  2016-05-31  8:00       ` [Qemu-devel] " Wang, Wei W
@ 2016-06-02  9:27         ` Jan Kiszka
  -1 siblings, 0 replies; 34+ messages in thread
From: Jan Kiszka @ 2016-06-02  9:27 UTC (permalink / raw)
  To: Wang, Wei W, kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini

On 2016-05-31 10:00, Wang, Wei W wrote:
> On Mon 5/30/2016 2:24 PM, Jan Kiszka Wrote:
>> On 2016-05-29 10:11, Wei Wang wrote:
>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>> ---
>>>  FutureWorks | 21 +++++++++++++++++++++
>>>  1 file changed, 21 insertions(+)
>>>  create mode 100644 FutureWorks
>>>
>>> diff --git a/FutureWorks b/FutureWorks new file mode 100644 index
>>> 0000000..210edcd
>>> --- /dev/null
>>> +++ b/FutureWorks
>>> @@ -0,0 +1,21 @@
>>> +The vhost-pci design is currently suitable for a group of VMs who
>>> +trust each other. To extend it to a more general use case, two
>>> +security features can be added in the future.
>>
>> Sounds a bit like security is just "nice to have" in the foreseen use cases of this
>> mechanism. Is that really true?
> 
> Not really. It's usually a tradeoff between performance and security, so I think having security doesn't always mean "Nice" :-)

I don't disagree. I'm rather wondering if the variant without isolation
has valid use-cases at all.

> 
> Instead of proposing a compromised solution, we can actually offer two independent solutions, performance oriented vhost-pci (let's call it fast vhost-pci) and security oriented vhost-pci (say, secure vhost-pci). It's up to the users to choose which one to use according to their use cases. So, the secured version of vhost-pci can be viewed as another option for users (not a replacement of this proposal).
> 
> Here is a use example:
> There are two groups of VMs running on the same host machine. The frequent inter-VM communication between VMs in Group A can choose the fast vhost-pci mechanism. In a special case that a VM from Group A needs to communicate with a VM from Group B, they should set up a new NIC each and specify the use of the secure vhost-pci. 
> Since the secure vhost-pci is on our future plan, the traditional vhost-user can be an option for that inter-Group communication currently.
> 
>>> +
>>> +1 vIOMMU
>>> +vIOMMU provides the driver VM with the ability to restrict the device
>>> +VM to transiently access a specified portion of its memory. The
>>> +vhost-pci design proposed in this RFC can be extended to access the
>>> +driver VM's memory with vIOMMU. Precisely, the vIOMMU engine in the
>>> +driver VM configures access permissions (R/W) for the vhost-pci
>>> +device to access its memory. More details can be found at
>>> +https://wiki.opnfv.org/display/kvm/Vm2vm+Mst and
>>> +https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg03993.html
>>
>> Do you have performance estimates on this approach already?
>>
>> One challenge should be how to let the VMs reuse existing buffer mappings so
>> that the vIOMMU isn't continuously reprogrammed - which is likely not very
>> efficient.
> 
> I think one option here is to reserve a large block of GPA area (like a memory pool). The buffers are allocated from and freed to the pool.

That's basically communication via shared memory regions (like ivshmem).
We'll do this for safety and security critical setups where we need to
keep the hypervisor complexity minimal (we cannot effort a vIOMMU
implementation, e.g.). Possibly, there is some room for sharing solution
details on the guest side here.

> 
> Another one would be using batching. For example, set up a batch of 32 buffers (just give the starting guest physical address, and the 32 buffers are guest-physically continuous) each time.
> 
>> The other is how to hand over packets/buffers in a chain of multiple VMs. Ideally,
>> there is already a hand-over from sender to the first receiver so that the sender
>> can no longer mess with the packet after the receiver started processing it.
>> However, that will work against efficiency.
>>
>> Essentially, it's the old IPC question of remap vs. copy here. The rest is "just"
>> interfaces to exploit this elegantly.
> 
> There are several ways to do a remapping. The remapping we are using here is to have the entire driver VM's memory mapped by the device VM, that is, the driver VM's memory is completely shared with the device VM. 

I know - that's what I would call "without isolation".

> 
> If I understand that old remapping based IPC problem correctly, this kind of remapping requires a high degree of coordination between the two parts (i.e. the device VM and the driver VM). I think "virtq" is right the coordination mechanism here - the device VM grabs and fills a buffer from the available ring and puts the filled buffer to the used ring,  so I don't think the sender would mess with the packet after the receiver started to process it.
> Please point out if I didn't get your point correctly. Thanks.

If you look at kdbus, e.g., they have some threshold value for handing
over messages from sender to receiver via changes to the page tables
(removal from sender side, addition to reader side - they don't rely on
both sides being "nice" with each other) vs. simply copying the data
between both address spaces. The costly paging changes - and it doesn't
matter of they affect a process or a VM - apparently only pay off if the
data blocks are large enough. A typical network packet is way below that
threshold (a few 100K on x86 IIRC).

> 
>>
>>> +
>>> +2 eptp switching
>>> +The idea of eptp swithing allows a vhost-pci device driver to access
>>> +the mapped driver VM's memory in an alternative view, where only a
>>> +piece of trusted code can access the driver VM's memory. More details
>>> +can be found at
>>> +http://events.linuxfoundation.org/sites/events/files/slides/
>>> +Jun_Nakajima_NFV_KVM%202015_final.pdf
>>
>> As we learned a while back, this one is not really secure. Any updates on if/how
>> this is going to be fixed?
>>
> 
> Right, that is why we claimed it as a protection mechanism. However, one option we are trying is to fix the related holes from the hardware. We can give updates once it's ready.

Thanks, looking forward.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [virtio-comment] [PATCH 5/6 Resend] Vhost-pci RFC: Future Security Enhancement
@ 2016-06-02  9:27         ` Jan Kiszka
  0 siblings, 0 replies; 34+ messages in thread
From: Jan Kiszka @ 2016-06-02  9:27 UTC (permalink / raw)
  To: Wang, Wei W, kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini

On 2016-05-31 10:00, Wang, Wei W wrote:
> On Mon 5/30/2016 2:24 PM, Jan Kiszka Wrote:
>> On 2016-05-29 10:11, Wei Wang wrote:
>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>> ---
>>>  FutureWorks | 21 +++++++++++++++++++++
>>>  1 file changed, 21 insertions(+)
>>>  create mode 100644 FutureWorks
>>>
>>> diff --git a/FutureWorks b/FutureWorks new file mode 100644 index
>>> 0000000..210edcd
>>> --- /dev/null
>>> +++ b/FutureWorks
>>> @@ -0,0 +1,21 @@
>>> +The vhost-pci design is currently suitable for a group of VMs who
>>> +trust each other. To extend it to a more general use case, two
>>> +security features can be added in the future.
>>
>> Sounds a bit like security is just "nice to have" in the foreseen use cases of this
>> mechanism. Is that really true?
> 
> Not really. It's usually a tradeoff between performance and security, so I think having security doesn't always mean "Nice" :-)

I don't disagree. I'm rather wondering if the variant without isolation
has valid use-cases at all.

> 
> Instead of proposing a compromised solution, we can actually offer two independent solutions, performance oriented vhost-pci (let's call it fast vhost-pci) and security oriented vhost-pci (say, secure vhost-pci). It's up to the users to choose which one to use according to their use cases. So, the secured version of vhost-pci can be viewed as another option for users (not a replacement of this proposal).
> 
> Here is a use example:
> There are two groups of VMs running on the same host machine. The frequent inter-VM communication between VMs in Group A can choose the fast vhost-pci mechanism. In a special case that a VM from Group A needs to communicate with a VM from Group B, they should set up a new NIC each and specify the use of the secure vhost-pci. 
> Since the secure vhost-pci is on our future plan, the traditional vhost-user can be an option for that inter-Group communication currently.
> 
>>> +
>>> +1 vIOMMU
>>> +vIOMMU provides the driver VM with the ability to restrict the device
>>> +VM to transiently access a specified portion of its memory. The
>>> +vhost-pci design proposed in this RFC can be extended to access the
>>> +driver VM's memory with vIOMMU. Precisely, the vIOMMU engine in the
>>> +driver VM configures access permissions (R/W) for the vhost-pci
>>> +device to access its memory. More details can be found at
>>> +https://wiki.opnfv.org/display/kvm/Vm2vm+Mst and
>>> +https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg03993.html
>>
>> Do you have performance estimates on this approach already?
>>
>> One challenge should be how to let the VMs reuse existing buffer mappings so
>> that the vIOMMU isn't continuously reprogrammed - which is likely not very
>> efficient.
> 
> I think one option here is to reserve a large block of GPA area (like a memory pool). The buffers are allocated from and freed to the pool.

That's basically communication via shared memory regions (like ivshmem).
We'll do this for safety and security critical setups where we need to
keep the hypervisor complexity minimal (we cannot effort a vIOMMU
implementation, e.g.). Possibly, there is some room for sharing solution
details on the guest side here.

> 
> Another one would be using batching. For example, set up a batch of 32 buffers (just give the starting guest physical address, and the 32 buffers are guest-physically continuous) each time.
> 
>> The other is how to hand over packets/buffers in a chain of multiple VMs. Ideally,
>> there is already a hand-over from sender to the first receiver so that the sender
>> can no longer mess with the packet after the receiver started processing it.
>> However, that will work against efficiency.
>>
>> Essentially, it's the old IPC question of remap vs. copy here. The rest is "just"
>> interfaces to exploit this elegantly.
> 
> There are several ways to do a remapping. The remapping we are using here is to have the entire driver VM's memory mapped by the device VM, that is, the driver VM's memory is completely shared with the device VM. 

I know - that's what I would call "without isolation".

> 
> If I understand that old remapping based IPC problem correctly, this kind of remapping requires a high degree of coordination between the two parts (i.e. the device VM and the driver VM). I think "virtq" is right the coordination mechanism here - the device VM grabs and fills a buffer from the available ring and puts the filled buffer to the used ring,  so I don't think the sender would mess with the packet after the receiver started to process it.
> Please point out if I didn't get your point correctly. Thanks.

If you look at kdbus, e.g., they have some threshold value for handing
over messages from sender to receiver via changes to the page tables
(removal from sender side, addition to reader side - they don't rely on
both sides being "nice" with each other) vs. simply copying the data
between both address spaces. The costly paging changes - and it doesn't
matter of they affect a process or a VM - apparently only pay off if the
data blocks are large enough. A typical network packet is way below that
threshold (a few 100K on x86 IIRC).

> 
>>
>>> +
>>> +2 eptp switching
>>> +The idea of eptp swithing allows a vhost-pci device driver to access
>>> +the mapped driver VM's memory in an alternative view, where only a
>>> +piece of trusted code can access the driver VM's memory. More details
>>> +can be found at
>>> +http://events.linuxfoundation.org/sites/events/files/slides/
>>> +Jun_Nakajima_NFV_KVM%202015_final.pdf
>>
>> As we learned a while back, this one is not really secure. Any updates on if/how
>> this is going to be fixed?
>>
> 
> Right, that is why we claimed it as a protection mechanism. However, one option we are trying is to fix the related holes from the hardware. We can give updates once it's ready.

Thanks, looking forward.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format
  2016-06-02  8:43           ` [Qemu-devel] " Wang, Wei W
@ 2016-06-02 11:13             ` Xiao Guangrong
  -1 siblings, 0 replies; 34+ messages in thread
From: Xiao Guangrong @ 2016-06-02 11:13 UTC (permalink / raw)
  To: Wang, Wei W, kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini



On 06/02/2016 04:43 PM, Wang, Wei W wrote:
> On Thu 6/2/2016 11:52 AM,  Xiao Guangrong wrote:
>> On 06/02/2016 11:15 AM, Wang, Wei W wrote:
>>> On Wed 6/1/2016 4:15 PM, Xiao Guangrong wrote:
>>>> On 05/29/2016 04:11 PM, Wei Wang wrote:
>>>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>>>> ---
>>>>>     Details | 324
>>>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>     1 file changed, 324 insertions(+)
>>>>>     create mode 100644 Details
>>>>>
>>>>> diff --git a/Details b/Details
>>>>> new file mode 100644
>>>>> index 0000000..4ea2252
>>>>> --- /dev/null
>>>>> +++ b/Details
>>>>> @@ -0,0 +1,324 @@
>>>>> +1 Device ID
>>>>> +TBD
>>>>> +
>>>>> +2 Virtqueues
>>>>> +0 controlq
>>>>> +
>>>>> +3 Feature Bits
>>>>> +3.1 Local Feature Bits
>>>>> +Currently no local feature bits are defined, so the standard virtio
>>>>> +feature bits negation will always be successful and complete.
>>>>> +
>>>>> +3.2 Remote Feature Bits
>>>>> +The remote feature bits are obtained from the frontend virtio
>>>>> +device and negotiated with the vhost-pci driver via the controlq.
>>>>> +The negotiation steps are described in 4.5 Device Initialization.
>>>>> +
>>>>> +4 Device Configuration Layout
>>>>> +struct vhost_pci_config {
>>>>> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK 0
>>>>> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK 1
>>>>> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS_ACK 2
>>>>> +	u32 ack_type;
>>>>> +	u32 ack_device_type;
>>>>> +	u64 ack_device_id;
>>>>> +	union {
>>>>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_DONE 0
>>>>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_FAIL 1
>>>>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_DONE 2
>>>>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_FAIL 3
>>>>> +		u64 ack_memory_info;
>>>>> +		u64 ack_device_info;
>>>>> +		u64 ack_feature_bits;
>>>>> +	};
>>>>> +};
>>>>
>>>> Do you need to write all these 4 field to ack the operation? It seems
>>>> it is not efficient and it is not flexible if the driver need to
>>>> offer more data to the device in the further. Can we dedicate a vq
>>>> for this purpose?
>>>
>>> Yes, all the 4 fields are required to be written. The vhost-pci server usually
>> connects to multiple clients, and the "ack_device_type" and "ack_device_id"
>> fields are used to identify them.
>>>
>>> Agree, another controlq for the guest->host direction looks better, and the
>> above fileds can be converted to be the controlq message header.
>>>
>>
>> Thanks.
>>
>>>>
>>>> BTW, current approach can not handle the case if there are multiple
>>>> same kind of requests in the control queue, e.g, if there are two
>>>> memory-add request in the control queue.
>>>
>>> A vhost-pci device corresponds to a driver VM. The two memory-add requests
>> on the controlq are all for the same driver VM.  Memory-add requests for
>> different driver VMs  couldn’t be present on the same controlq. I haven’t seen
>> the issue yet. Can you please explain more? Thanks.
>>
>> The issue is caused by "The two memory-add requests on the controlq are all for
>> the same driver VM", the driver need to ACK these request respectively, however,
>> these two requests have the same ack_type, device_type, device_id,
>> ack_memory_info, then QEMU is not able to figure out which request has been
>> acked.
>
> Normally pieces of memory info should be combined into one message (the structure includes multiple memory regions) and sent by the client. In a rare case like this: the driver VM hot-adds 1GB memory, followed by hot-adding another 1GB memory. The first piece of memory info is passed via the socket and controlq to the vhost-pci driver, then the second. Normally they won't get an opportunity to be put on the controlq at the same time.
> Even the implementation batches the controlq messages, there will be a sequence difference between the two messages on the controlq, right?

That assumes the driver should serially handle the control messages...

>
>  From the QEMU's (vhost-pci server) perspective, it just sends back an ACK to the client whenever it receives an ACK from the vhost-pci driver.
>  From the client's perspective, it will receive two ACK messages in this example.
> Since the two have a sequence difference, the client should be able to distinguish the two (first sent, first acked), right?

That assumes that the vhost-pci server and remote virtio device should use
serial mode too.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format
@ 2016-06-02 11:13             ` Xiao Guangrong
  0 siblings, 0 replies; 34+ messages in thread
From: Xiao Guangrong @ 2016-06-02 11:13 UTC (permalink / raw)
  To: Wang, Wei W, kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini



On 06/02/2016 04:43 PM, Wang, Wei W wrote:
> On Thu 6/2/2016 11:52 AM,  Xiao Guangrong wrote:
>> On 06/02/2016 11:15 AM, Wang, Wei W wrote:
>>> On Wed 6/1/2016 4:15 PM, Xiao Guangrong wrote:
>>>> On 05/29/2016 04:11 PM, Wei Wang wrote:
>>>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>>>>> ---
>>>>>     Details | 324
>>>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>     1 file changed, 324 insertions(+)
>>>>>     create mode 100644 Details
>>>>>
>>>>> diff --git a/Details b/Details
>>>>> new file mode 100644
>>>>> index 0000000..4ea2252
>>>>> --- /dev/null
>>>>> +++ b/Details
>>>>> @@ -0,0 +1,324 @@
>>>>> +1 Device ID
>>>>> +TBD
>>>>> +
>>>>> +2 Virtqueues
>>>>> +0 controlq
>>>>> +
>>>>> +3 Feature Bits
>>>>> +3.1 Local Feature Bits
>>>>> +Currently no local feature bits are defined, so the standard virtio
>>>>> +feature bits negation will always be successful and complete.
>>>>> +
>>>>> +3.2 Remote Feature Bits
>>>>> +The remote feature bits are obtained from the frontend virtio
>>>>> +device and negotiated with the vhost-pci driver via the controlq.
>>>>> +The negotiation steps are described in 4.5 Device Initialization.
>>>>> +
>>>>> +4 Device Configuration Layout
>>>>> +struct vhost_pci_config {
>>>>> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK 0
>>>>> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK 1
>>>>> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS_ACK 2
>>>>> +	u32 ack_type;
>>>>> +	u32 ack_device_type;
>>>>> +	u64 ack_device_id;
>>>>> +	union {
>>>>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_DONE 0
>>>>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_FAIL 1
>>>>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_DONE 2
>>>>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_FAIL 3
>>>>> +		u64 ack_memory_info;
>>>>> +		u64 ack_device_info;
>>>>> +		u64 ack_feature_bits;
>>>>> +	};
>>>>> +};
>>>>
>>>> Do you need to write all these 4 field to ack the operation? It seems
>>>> it is not efficient and it is not flexible if the driver need to
>>>> offer more data to the device in the further. Can we dedicate a vq
>>>> for this purpose?
>>>
>>> Yes, all the 4 fields are required to be written. The vhost-pci server usually
>> connects to multiple clients, and the "ack_device_type" and "ack_device_id"
>> fields are used to identify them.
>>>
>>> Agree, another controlq for the guest->host direction looks better, and the
>> above fileds can be converted to be the controlq message header.
>>>
>>
>> Thanks.
>>
>>>>
>>>> BTW, current approach can not handle the case if there are multiple
>>>> same kind of requests in the control queue, e.g, if there are two
>>>> memory-add request in the control queue.
>>>
>>> A vhost-pci device corresponds to a driver VM. The two memory-add requests
>> on the controlq are all for the same driver VM.  Memory-add requests for
>> different driver VMs  couldn’t be present on the same controlq. I haven’t seen
>> the issue yet. Can you please explain more? Thanks.
>>
>> The issue is caused by "The two memory-add requests on the controlq are all for
>> the same driver VM", the driver need to ACK these request respectively, however,
>> these two requests have the same ack_type, device_type, device_id,
>> ack_memory_info, then QEMU is not able to figure out which request has been
>> acked.
>
> Normally pieces of memory info should be combined into one message (the structure includes multiple memory regions) and sent by the client. In a rare case like this: the driver VM hot-adds 1GB memory, followed by hot-adding another 1GB memory. The first piece of memory info is passed via the socket and controlq to the vhost-pci driver, then the second. Normally they won't get an opportunity to be put on the controlq at the same time.
> Even the implementation batches the controlq messages, there will be a sequence difference between the two messages on the controlq, right?

That assumes the driver should serially handle the control messages...

>
>  From the QEMU's (vhost-pci server) perspective, it just sends back an ACK to the client whenever it receives an ACK from the vhost-pci driver.
>  From the client's perspective, it will receive two ACK messages in this example.
> Since the two have a sequence difference, the client should be able to distinguish the two (first sent, first acked), right?

That assumes that the vhost-pci server and remote virtio device should use
serial mode too.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [virtio-comment] [PATCH 5/6 Resend] Vhost-pci RFC: Future Security Enhancement
  2016-06-02  9:27         ` [Qemu-devel] " Jan Kiszka
@ 2016-06-03  5:54           ` Wang, Wei W
  -1 siblings, 0 replies; 34+ messages in thread
From: Wang, Wei W @ 2016-06-03  5:54 UTC (permalink / raw)
  To: 'Jan Kiszka',
	kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini

On Thu 6/2/2016 5:27 PM, Jan Kiszka wrote:
> On 2016-05-31 10:00, Wang, Wei W wrote:
> > On Mon 5/30/2016 2:24 PM, Jan Kiszka Wrote:
> >> On 2016-05-29 10:11, Wei Wang wrote:
> >>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> >>> ---
> >>>  FutureWorks | 21 +++++++++++++++++++++
> >>>  1 file changed, 21 insertions(+)
> >>>  create mode 100644 FutureWorks
> >>>
> >>> diff --git a/FutureWorks b/FutureWorks new file mode 100644 index
> >>> 0000000..210edcd
> >>> --- /dev/null
> >>> +++ b/FutureWorks
> >>> @@ -0,0 +1,21 @@
> >>> +The vhost-pci design is currently suitable for a group of VMs who
> >>> +trust each other. To extend it to a more general use case, two
> >>> +security features can be added in the future.
> >>
> >> Sounds a bit like security is just "nice to have" in the foreseen use
> >> cases of this mechanism. Is that really true?
> >
> > Not really. It's usually a tradeoff between performance and security,
> > so I think having security doesn't always mean "Nice" :-)
> 
> I don't disagree. I'm rather wondering if the variant without isolation has valid
> use-cases at all.


I think one of the use examples is Network Function Virtualization. A group of 
Network Function VMs are chained together. AFAIK, compared to isolation,
performance is more important to them.


> > Instead of proposing a compromised solution, we can actually offer two
> independent solutions, performance oriented vhost-pci (let's call it fast vhost-pci)
> and security oriented vhost-pci (say, secure vhost-pci). It's up to the users to
> choose which one to use according to their use cases. So, the secured version of
> vhost-pci can be viewed as another option for users (not a replacement of this
> proposal).
> >
> > Here is a use example:
> > There are two groups of VMs running on the same host machine. The frequent
> inter-VM communication between VMs in Group A can choose the fast vhost-pci
> mechanism. In a special case that a VM from Group A needs to communicate
> with a VM from Group B, they should set up a new NIC each and specify the use
> of the secure vhost-pci.
> > Since the secure vhost-pci is on our future plan, the traditional vhost-user can
> be an option for that inter-Group communication currently.
> >
> >>> +
> >>> +1 vIOMMU
> >>> +vIOMMU provides the driver VM with the ability to restrict the
> >>> +device VM to transiently access a specified portion of its memory.
> >>> +The vhost-pci design proposed in this RFC can be extended to access
> >>> +the driver VM's memory with vIOMMU. Precisely, the vIOMMU engine in
> >>> +the driver VM configures access permissions (R/W) for the vhost-pci
> >>> +device to access its memory. More details can be found at
> >>> +https://wiki.opnfv.org/display/kvm/Vm2vm+Mst and
> >>> +https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg03993.html
> >>
> >> Do you have performance estimates on this approach already?
> >>
> >> One challenge should be how to let the VMs reuse existing buffer
> >> mappings so that the vIOMMU isn't continuously reprogrammed - which
> >> is likely not very efficient.
> >
> > I think one option here is to reserve a large block of GPA area (like a memory
> pool). The buffers are allocated from and freed to the pool.
> 
> That's basically communication via shared memory regions (like ivshmem).
> We'll do this for safety and security critical setups where we need to keep the
> hypervisor complexity minimal (we cannot effort a vIOMMU implementation,
> e.g.). Possibly, there is some room for sharing solution details on the guest side
> here.

I think people from redhat have started digging into the guest side vIOMMU. 

> >
> > Another one would be using batching. For example, set up a batch of 32
> buffers (just give the starting guest physical address, and the 32 buffers are
> guest-physically continuous) each time.
> >
> >> The other is how to hand over packets/buffers in a chain of multiple
> >> VMs. Ideally, there is already a hand-over from sender to the first
> >> receiver so that the sender can no longer mess with the packet after the
> receiver started processing it.
> >> However, that will work against efficiency.
> >>
> >> Essentially, it's the old IPC question of remap vs. copy here. The rest is "just"
> >> interfaces to exploit this elegantly.
> >
> > There are several ways to do a remapping. The remapping we are using here is
> to have the entire driver VM's memory mapped by the device VM, that is, the
> driver VM's memory is completely shared with the device VM.
> 
> I know - that's what I would call "without isolation".
> 
> >
> > If I understand that old remapping based IPC problem correctly, this kind of
> remapping requires a high degree of coordination between the two parts (i.e.
> the device VM and the driver VM). I think "virtq" is right the coordination
> mechanism here - the device VM grabs and fills a buffer from the available ring
> and puts the filled buffer to the used ring,  so I don't think the sender would
> mess with the packet after the receiver started to process it.
> > Please point out if I didn't get your point correctly. Thanks.
> 
> If you look at kdbus, e.g., they have some threshold value for handing over
> messages from sender to receiver via changes to the page tables (removal from
> sender side, addition to reader side - they don't rely on both sides being "nice"
> with each other) vs. simply copying the data between both address spaces. The
> costly paging changes - and it doesn't matter of they affect a process or a VM -
> apparently only pay off if the data blocks are large enough. A typical network
> packet is way below that threshold (a few 100K on x86 IIRC).

I think it's different. This method is like page flipping, which frequently modifies
both the sender and receiver's pages tables. It is costly because frequently maintaining 
complete TLB consistency across all the CPUs are expensive. I think that's why they
use a threshold to avoid TLB thrashing.

The mapping setup in vhost-pci is a one-time thing (we can ignore the memory hotplug
cases in the discussion, as that happens relatively rare). We don't need that frequent 
page table modifications to transfer data, so I think we also don't have that limitation.


Best,
Wei
> 
> >
> >>
> >>> +
> >>> +2 eptp switching
> >>> +The idea of eptp swithing allows a vhost-pci device driver to
> >>> +access the mapped driver VM's memory in an alternative view, where
> >>> +only a piece of trusted code can access the driver VM's memory.
> >>> +More details can be found at
> >>> +http://events.linuxfoundation.org/sites/events/files/slides/
> >>> +Jun_Nakajima_NFV_KVM%202015_final.pdf
> >>
> >> As we learned a while back, this one is not really secure. Any
> >> updates on if/how this is going to be fixed?
> >>
> >
> > Right, that is why we claimed it as a protection mechanism. However, one
> option we are trying is to fix the related holes from the hardware. We can give
> updates once it's ready.
> 
> Thanks, looking forward.
> 
> Jan
> 
> --
> Siemens AG, Corporate Technology, CT RDA ITP SES-DE Corporate Competence
> Center Embedded Linux

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [virtio-comment] [PATCH 5/6 Resend] Vhost-pci RFC: Future Security Enhancement
@ 2016-06-03  5:54           ` Wang, Wei W
  0 siblings, 0 replies; 34+ messages in thread
From: Wang, Wei W @ 2016-06-03  5:54 UTC (permalink / raw)
  To: 'Jan Kiszka',
	kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini

On Thu 6/2/2016 5:27 PM, Jan Kiszka wrote:
> On 2016-05-31 10:00, Wang, Wei W wrote:
> > On Mon 5/30/2016 2:24 PM, Jan Kiszka Wrote:
> >> On 2016-05-29 10:11, Wei Wang wrote:
> >>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> >>> ---
> >>>  FutureWorks | 21 +++++++++++++++++++++
> >>>  1 file changed, 21 insertions(+)
> >>>  create mode 100644 FutureWorks
> >>>
> >>> diff --git a/FutureWorks b/FutureWorks new file mode 100644 index
> >>> 0000000..210edcd
> >>> --- /dev/null
> >>> +++ b/FutureWorks
> >>> @@ -0,0 +1,21 @@
> >>> +The vhost-pci design is currently suitable for a group of VMs who
> >>> +trust each other. To extend it to a more general use case, two
> >>> +security features can be added in the future.
> >>
> >> Sounds a bit like security is just "nice to have" in the foreseen use
> >> cases of this mechanism. Is that really true?
> >
> > Not really. It's usually a tradeoff between performance and security,
> > so I think having security doesn't always mean "Nice" :-)
> 
> I don't disagree. I'm rather wondering if the variant without isolation has valid
> use-cases at all.


I think one of the use examples is Network Function Virtualization. A group of 
Network Function VMs are chained together. AFAIK, compared to isolation,
performance is more important to them.


> > Instead of proposing a compromised solution, we can actually offer two
> independent solutions, performance oriented vhost-pci (let's call it fast vhost-pci)
> and security oriented vhost-pci (say, secure vhost-pci). It's up to the users to
> choose which one to use according to their use cases. So, the secured version of
> vhost-pci can be viewed as another option for users (not a replacement of this
> proposal).
> >
> > Here is a use example:
> > There are two groups of VMs running on the same host machine. The frequent
> inter-VM communication between VMs in Group A can choose the fast vhost-pci
> mechanism. In a special case that a VM from Group A needs to communicate
> with a VM from Group B, they should set up a new NIC each and specify the use
> of the secure vhost-pci.
> > Since the secure vhost-pci is on our future plan, the traditional vhost-user can
> be an option for that inter-Group communication currently.
> >
> >>> +
> >>> +1 vIOMMU
> >>> +vIOMMU provides the driver VM with the ability to restrict the
> >>> +device VM to transiently access a specified portion of its memory.
> >>> +The vhost-pci design proposed in this RFC can be extended to access
> >>> +the driver VM's memory with vIOMMU. Precisely, the vIOMMU engine in
> >>> +the driver VM configures access permissions (R/W) for the vhost-pci
> >>> +device to access its memory. More details can be found at
> >>> +https://wiki.opnfv.org/display/kvm/Vm2vm+Mst and
> >>> +https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg03993.html
> >>
> >> Do you have performance estimates on this approach already?
> >>
> >> One challenge should be how to let the VMs reuse existing buffer
> >> mappings so that the vIOMMU isn't continuously reprogrammed - which
> >> is likely not very efficient.
> >
> > I think one option here is to reserve a large block of GPA area (like a memory
> pool). The buffers are allocated from and freed to the pool.
> 
> That's basically communication via shared memory regions (like ivshmem).
> We'll do this for safety and security critical setups where we need to keep the
> hypervisor complexity minimal (we cannot effort a vIOMMU implementation,
> e.g.). Possibly, there is some room for sharing solution details on the guest side
> here.

I think people from redhat have started digging into the guest side vIOMMU. 

> >
> > Another one would be using batching. For example, set up a batch of 32
> buffers (just give the starting guest physical address, and the 32 buffers are
> guest-physically continuous) each time.
> >
> >> The other is how to hand over packets/buffers in a chain of multiple
> >> VMs. Ideally, there is already a hand-over from sender to the first
> >> receiver so that the sender can no longer mess with the packet after the
> receiver started processing it.
> >> However, that will work against efficiency.
> >>
> >> Essentially, it's the old IPC question of remap vs. copy here. The rest is "just"
> >> interfaces to exploit this elegantly.
> >
> > There are several ways to do a remapping. The remapping we are using here is
> to have the entire driver VM's memory mapped by the device VM, that is, the
> driver VM's memory is completely shared with the device VM.
> 
> I know - that's what I would call "without isolation".
> 
> >
> > If I understand that old remapping based IPC problem correctly, this kind of
> remapping requires a high degree of coordination between the two parts (i.e.
> the device VM and the driver VM). I think "virtq" is right the coordination
> mechanism here - the device VM grabs and fills a buffer from the available ring
> and puts the filled buffer to the used ring,  so I don't think the sender would
> mess with the packet after the receiver started to process it.
> > Please point out if I didn't get your point correctly. Thanks.
> 
> If you look at kdbus, e.g., they have some threshold value for handing over
> messages from sender to receiver via changes to the page tables (removal from
> sender side, addition to reader side - they don't rely on both sides being "nice"
> with each other) vs. simply copying the data between both address spaces. The
> costly paging changes - and it doesn't matter of they affect a process or a VM -
> apparently only pay off if the data blocks are large enough. A typical network
> packet is way below that threshold (a few 100K on x86 IIRC).

I think it's different. This method is like page flipping, which frequently modifies
both the sender and receiver's pages tables. It is costly because frequently maintaining 
complete TLB consistency across all the CPUs are expensive. I think that's why they
use a threshold to avoid TLB thrashing.

The mapping setup in vhost-pci is a one-time thing (we can ignore the memory hotplug
cases in the discussion, as that happens relatively rare). We don't need that frequent 
page table modifications to transfer data, so I think we also don't have that limitation.


Best,
Wei
> 
> >
> >>
> >>> +
> >>> +2 eptp switching
> >>> +The idea of eptp swithing allows a vhost-pci device driver to
> >>> +access the mapped driver VM's memory in an alternative view, where
> >>> +only a piece of trusted code can access the driver VM's memory.
> >>> +More details can be found at
> >>> +http://events.linuxfoundation.org/sites/events/files/slides/
> >>> +Jun_Nakajima_NFV_KVM%202015_final.pdf
> >>
> >> As we learned a while back, this one is not really secure. Any
> >> updates on if/how this is going to be fixed?
> >>
> >
> > Right, that is why we claimed it as a protection mechanism. However, one
> option we are trying is to fix the related holes from the hardware. We can give
> updates once it's ready.
> 
> Thanks, looking forward.
> 
> Jan
> 
> --
> Siemens AG, Corporate Technology, CT RDA ITP SES-DE Corporate Competence
> Center Embedded Linux

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [virtio-comment] RE: [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format
  2016-06-02 11:13             ` [Qemu-devel] " Xiao Guangrong
@ 2016-06-03  6:12               ` Wang, Wei W
  -1 siblings, 0 replies; 34+ messages in thread
From: Wang, Wei W @ 2016-06-03  6:12 UTC (permalink / raw)
  To: Xiao Guangrong, kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini

On Thu 6/2/2016 7:13 PM, Xiao Guangrong wrote:
> On 06/02/2016 04:43 PM, Wang, Wei W wrote:
> > On Thu 6/2/2016 11:52 AM,  Xiao Guangrong wrote:
> >> On 06/02/2016 11:15 AM, Wang, Wei W wrote:
> >>> On Wed 6/1/2016 4:15 PM, Xiao Guangrong wrote:
> >>>> On 05/29/2016 04:11 PM, Wei Wang wrote:
> >>>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> >>>>> ---
> >>>>>     Details | 324
> >>>>
> >>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>     1 file changed, 324 insertions(+)
> >>>>>     create mode 100644 Details
> >>>>>
> >>>>> diff --git a/Details b/Details
> >>>>> new file mode 100644
> >>>>> index 0000000..4ea2252
> >>>>> --- /dev/null
> >>>>> +++ b/Details
> >>>>> @@ -0,0 +1,324 @@
> >>>>> +1 Device ID
> >>>>> +TBD
> >>>>> +
> >>>>> +2 Virtqueues
> >>>>> +0 controlq
> >>>>> +
> >>>>> +3 Feature Bits
> >>>>> +3.1 Local Feature Bits
> >>>>> +Currently no local feature bits are defined, so the standard
> >>>>> +virtio feature bits negation will always be successful and complete.
> >>>>> +
> >>>>> +3.2 Remote Feature Bits
> >>>>> +The remote feature bits are obtained from the frontend virtio
> >>>>> +device and negotiated with the vhost-pci driver via the controlq.
> >>>>> +The negotiation steps are described in 4.5 Device Initialization.
> >>>>> +
> >>>>> +4 Device Configuration Layout
> >>>>> +struct vhost_pci_config {
> >>>>> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK 0
> >>>>> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK 1
> >>>>> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS_ACK 2
> >>>>> +	u32 ack_type;
> >>>>> +	u32 ack_device_type;
> >>>>> +	u64 ack_device_id;
> >>>>> +	union {
> >>>>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_DONE 0
> >>>>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_FAIL 1
> >>>>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_DONE 2
> >>>>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_FAIL 3
> >>>>> +		u64 ack_memory_info;
> >>>>> +		u64 ack_device_info;
> >>>>> +		u64 ack_feature_bits;
> >>>>> +	};
> >>>>> +};
> >>>>
> >>>> Do you need to write all these 4 field to ack the operation? It
> >>>> seems it is not efficient and it is not flexible if the driver need
> >>>> to offer more data to the device in the further. Can we dedicate a
> >>>> vq for this purpose?
> >>>
> >>> Yes, all the 4 fields are required to be written. The vhost-pci
> >>> server usually
> >> connects to multiple clients, and the "ack_device_type" and "ack_device_id"
> >> fields are used to identify them.
> >>>
> >>> Agree, another controlq for the guest->host direction looks better,
> >>> and the
> >> above fileds can be converted to be the controlq message header.
> >>>
> >>
> >> Thanks.
> >>
> >>>>
> >>>> BTW, current approach can not handle the case if there are multiple
> >>>> same kind of requests in the control queue, e.g, if there are two
> >>>> memory-add request in the control queue.
> >>>
> >>> A vhost-pci device corresponds to a driver VM. The two memory-add
> >>> requests
> >> on the controlq are all for the same driver VM.  Memory-add requests
> >> for different driver VMs  couldn’t be present on the same controlq. I
> >> haven’t seen the issue yet. Can you please explain more? Thanks.
> >>
> >> The issue is caused by "The two memory-add requests on the controlq
> >> are all for the same driver VM", the driver need to ACK these request
> >> respectively, however, these two requests have the same ack_type,
> >> device_type, device_id, ack_memory_info, then QEMU is not able to
> >> figure out which request has been acked.
> >
> > Normally pieces of memory info should be combined into one message (the
> structure includes multiple memory regions) and sent by the client. In a rare case
> like this: the driver VM hot-adds 1GB memory, followed by hot-adding another
> 1GB memory. The first piece of memory info is passed via the socket and
> controlq to the vhost-pci driver, then the second. Normally they won't get an
> opportunity to be put on the controlq at the same time.
> > Even the implementation batches the controlq messages, there will be a
> sequence difference between the two messages on the controlq, right?
> 
> That assumes the driver should serially handle the control messages...
> 
> >
> >  From the QEMU's (vhost-pci server) perspective, it just sends back an ACK to
> the client whenever it receives an ACK from the vhost-pci driver.
> >  From the client's perspective, it will receive two ACK messages in this example.
> > Since the two have a sequence difference, the client should be able to
> distinguish the two (first sent, first acked), right?
> 
> That assumes that the vhost-pci server and remote virtio device should use serial
> mode too.

Before adding more fields to support that, I have another two questions to discuss here:

In my understanding, socket messages are sent one by one. How would a client send
two messages in parallel to the server?

Regarding the controlq, I understand that there are optimizations to parallelize ring 
operations, but that are mostly done to increase the data plane performance. 
Is it necessary to do such parallelism for the control message operations, which would
complicate the design?

Best,
Wei




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format
@ 2016-06-03  6:12               ` Wang, Wei W
  0 siblings, 0 replies; 34+ messages in thread
From: Wang, Wei W @ 2016-06-03  6:12 UTC (permalink / raw)
  To: Xiao Guangrong, kvm, qemu-devel, virtio-comment, mst, stefanha, pbonzini

On Thu 6/2/2016 7:13 PM, Xiao Guangrong wrote:
> On 06/02/2016 04:43 PM, Wang, Wei W wrote:
> > On Thu 6/2/2016 11:52 AM,  Xiao Guangrong wrote:
> >> On 06/02/2016 11:15 AM, Wang, Wei W wrote:
> >>> On Wed 6/1/2016 4:15 PM, Xiao Guangrong wrote:
> >>>> On 05/29/2016 04:11 PM, Wei Wang wrote:
> >>>>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> >>>>> ---
> >>>>>     Details | 324
> >>>>
> >>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>     1 file changed, 324 insertions(+)
> >>>>>     create mode 100644 Details
> >>>>>
> >>>>> diff --git a/Details b/Details
> >>>>> new file mode 100644
> >>>>> index 0000000..4ea2252
> >>>>> --- /dev/null
> >>>>> +++ b/Details
> >>>>> @@ -0,0 +1,324 @@
> >>>>> +1 Device ID
> >>>>> +TBD
> >>>>> +
> >>>>> +2 Virtqueues
> >>>>> +0 controlq
> >>>>> +
> >>>>> +3 Feature Bits
> >>>>> +3.1 Local Feature Bits
> >>>>> +Currently no local feature bits are defined, so the standard
> >>>>> +virtio feature bits negation will always be successful and complete.
> >>>>> +
> >>>>> +3.2 Remote Feature Bits
> >>>>> +The remote feature bits are obtained from the frontend virtio
> >>>>> +device and negotiated with the vhost-pci driver via the controlq.
> >>>>> +The negotiation steps are described in 4.5 Device Initialization.
> >>>>> +
> >>>>> +4 Device Configuration Layout
> >>>>> +struct vhost_pci_config {
> >>>>> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK 0
> >>>>> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK 1
> >>>>> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS_ACK 2
> >>>>> +	u32 ack_type;
> >>>>> +	u32 ack_device_type;
> >>>>> +	u64 ack_device_id;
> >>>>> +	union {
> >>>>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_DONE 0
> >>>>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_FAIL 1
> >>>>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_DONE 2
> >>>>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_FAIL 3
> >>>>> +		u64 ack_memory_info;
> >>>>> +		u64 ack_device_info;
> >>>>> +		u64 ack_feature_bits;
> >>>>> +	};
> >>>>> +};
> >>>>
> >>>> Do you need to write all these 4 field to ack the operation? It
> >>>> seems it is not efficient and it is not flexible if the driver need
> >>>> to offer more data to the device in the further. Can we dedicate a
> >>>> vq for this purpose?
> >>>
> >>> Yes, all the 4 fields are required to be written. The vhost-pci
> >>> server usually
> >> connects to multiple clients, and the "ack_device_type" and "ack_device_id"
> >> fields are used to identify them.
> >>>
> >>> Agree, another controlq for the guest->host direction looks better,
> >>> and the
> >> above fileds can be converted to be the controlq message header.
> >>>
> >>
> >> Thanks.
> >>
> >>>>
> >>>> BTW, current approach can not handle the case if there are multiple
> >>>> same kind of requests in the control queue, e.g, if there are two
> >>>> memory-add request in the control queue.
> >>>
> >>> A vhost-pci device corresponds to a driver VM. The two memory-add
> >>> requests
> >> on the controlq are all for the same driver VM.  Memory-add requests
> >> for different driver VMs  couldn’t be present on the same controlq. I
> >> haven’t seen the issue yet. Can you please explain more? Thanks.
> >>
> >> The issue is caused by "The two memory-add requests on the controlq
> >> are all for the same driver VM", the driver need to ACK these request
> >> respectively, however, these two requests have the same ack_type,
> >> device_type, device_id, ack_memory_info, then QEMU is not able to
> >> figure out which request has been acked.
> >
> > Normally pieces of memory info should be combined into one message (the
> structure includes multiple memory regions) and sent by the client. In a rare case
> like this: the driver VM hot-adds 1GB memory, followed by hot-adding another
> 1GB memory. The first piece of memory info is passed via the socket and
> controlq to the vhost-pci driver, then the second. Normally they won't get an
> opportunity to be put on the controlq at the same time.
> > Even the implementation batches the controlq messages, there will be a
> sequence difference between the two messages on the controlq, right?
> 
> That assumes the driver should serially handle the control messages...
> 
> >
> >  From the QEMU's (vhost-pci server) perspective, it just sends back an ACK to
> the client whenever it receives an ACK from the vhost-pci driver.
> >  From the client's perspective, it will receive two ACK messages in this example.
> > Since the two have a sequence difference, the client should be able to
> distinguish the two (first sent, first acked), right?
> 
> That assumes that the vhost-pci server and remote virtio device should use serial
> mode too.

Before adding more fields to support that, I have another two questions to discuss here:

In my understanding, socket messages are sent one by one. How would a client send
two messages in parallel to the server?

Regarding the controlq, I understand that there are optimizations to parallelize ring 
operations, but that are mostly done to increase the data plane performance. 
Is it necessary to do such parallelism for the control message operations, which would
complicate the design?

Best,
Wei




^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2016-06-03  6:12 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-29  8:11 [PATCH 0/6 Resend] *** Vhost-pci RFC *** Wei Wang
2016-05-29  8:11 ` [Qemu-devel] " Wei Wang
2016-05-29  8:11 ` [virtio-comment] [PATCH 1/6 Resend] Vhost-pci RFC: Introduction Wei Wang
2016-05-29  8:11   ` [Qemu-devel] " Wei Wang
2016-05-29  8:11 ` [virtio-comment] [PATCH 2/6 Resend] Vhost-pci RFC: Modification Scope Wei Wang
2016-05-29  8:11   ` [Qemu-devel] " Wei Wang
2016-05-29  8:11 ` [virtio-comment] [PATCH 3/6 Resend] Vhost-pci RFC: Benefits to KVM Wei Wang
2016-05-29  8:11   ` [Qemu-devel] " Wei Wang
2016-05-29  8:11 ` [virtio-comment] [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format Wei Wang
2016-05-29  8:11   ` [Qemu-devel] " Wei Wang
2016-06-01  8:15   ` Xiao Guangrong
2016-06-01  8:15     ` [Qemu-devel] " Xiao Guangrong
2016-06-02  3:15     ` [virtio-comment] " Wang, Wei W
2016-06-02  3:15       ` [Qemu-devel] " Wang, Wei W
2016-06-02  3:52       ` Xiao Guangrong
2016-06-02  3:52         ` [Qemu-devel] " Xiao Guangrong
2016-06-02  8:43         ` [virtio-comment] " Wang, Wei W
2016-06-02  8:43           ` [Qemu-devel] " Wang, Wei W
2016-06-02 11:13           ` Xiao Guangrong
2016-06-02 11:13             ` [Qemu-devel] " Xiao Guangrong
2016-06-03  6:12             ` [virtio-comment] " Wang, Wei W
2016-06-03  6:12               ` [Qemu-devel] " Wang, Wei W
2016-05-29  8:11 ` [virtio-comment] [PATCH 5/6 Resend] Vhost-pci RFC: Future Security Enhancement Wei Wang
2016-05-29  8:11   ` [Qemu-devel] " Wei Wang
2016-05-30  6:23   ` [virtio-comment] " Jan Kiszka
2016-05-30  6:23     ` [Qemu-devel] " Jan Kiszka
2016-05-31  8:00     ` Wang, Wei W
2016-05-31  8:00       ` [Qemu-devel] " Wang, Wei W
2016-06-02  9:27       ` Jan Kiszka
2016-06-02  9:27         ` [Qemu-devel] " Jan Kiszka
2016-06-03  5:54         ` Wang, Wei W
2016-06-03  5:54           ` [Qemu-devel] " Wang, Wei W
2016-05-29  8:11 ` [PATCH 6/6 Resend] Vhost-pci RFC: Experimental Results Wei Wang
2016-05-29  8:11   ` [Qemu-devel] " Wei Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.