All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/8] basic vfio-ccw infrastructure
@ 2016-04-29 12:11 ` Dong Jia Shi
  0 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

vfio: ccw: basic vfio-ccw infrastructure
========================================

Introduction
------------

Here we describe the vfio support for Channel I/O devices (aka. CCW
devices) for Linux/s390. Motivation for vfio-ccw is to passthrough CCW
devices to a virtual machine, while vfio is the means.

Different than other hardware architectures, s390 has defined a unified
I/O access method, which is so called Channel I/O. It has its own
access patterns:
- Channel programs run asynchronously on a separate (co)processor.
- The channel subsystem will access any memory designated by the caller
  in the channel program directly, i.e. there is no iommu involved.
Thus when we introduce vfio support for these devices, we realize it
with a no-iommu vfio implementation.

This document does not intend to explain the s390 hardware architecture
in every detail. More information/reference could be found here:
- A good start to know Channel I/O in general:
  https://en.wikipedia.org/wiki/Channel_I/O
- s390 architecture:
  s390 Principles of Operation manual (IBM Form. No. SA22-7832)
- The existing Qemu code which implements a simple emulated channel
  subsystem could also be a good reference. It makes it easier to
  follow the flow.
  qemu/hw/s390x/css.c

Motivation of vfio-ccw
----------------------

Currently, a guest virtualized via qemu/kvm on s390 only sees
paravirtualized virtio devices via the "Virtio Over Channel I/O
(virtio-ccw)" transport. This makes virtio devices discoverable via
standard operating system algorithms for handling channel devices.

However this is not enough. On s390 for the majority of devices, which
use the standard Channel I/O based mechanism, we also need to provide
the functionality of passing through them to a Qemu virtual machine.
This includes devices that don't have a virtio counterpart (e.g. tape
drives) or that have specific characteristics which guests want to
exploit.

For passing a device to a guest, we want to use the same interface as
everybody else, namely vfio. Thus, we would like to introduce vfio
support for channel devices. And we would like to name this new vfio
device "vfio-ccw".

Access patterns of CCW devices
------------------------------

s390 architecture has implemented a so called channel subsystem, that
provides a unified view of the devices physically attached to the
systems. Though the s390 hardware platform knows about a huge variety of
different peripheral attachments like disk devices (aka. DASDs), tapes,
communication controllers, etc. They can all be accessed by a well
defined access method and they are presenting I/O completion a unified
way: I/O interruptions.

All I/O requires the use of channel command words (CCWs). A CCW is an
instruction to a specialized I/O channel processor. A channel program
is a sequence of CCWs which are executed by the I/O channel subsystem.
To issue a CCW program to the channel subsystem, it is required to
build an operation request block (ORB), which can be used to point out
the format of the CCW and other control information to the system. The
operating system signals the I/O channel subsystem to begin executing
the channel program with a SSCH (start sub-channel) instruction. The
central processor is then free to proceed with non-I/O instructions
until interrupted. The I/O completion result is received by the
interrupt handler in the form of interrupt response block (IRB).

Back to vfio-ccw, in short:
- ORBs and CCW programs are built in user space (with virtual
  addresses).
- ORBs and CCW programs are passed to the kernel.
- kernel translates virtual addresses to real addresses and starts the
  IO with issuing a privileged Channel I/O instruction (e.g SSCH).
- CCW programs run asynchronously on a separate processor.
- I/O completion will be signaled to the host with I/O interruptions.
  And it will be copied as IRB to user space.


vfio-ccw patches overview
-------------------------

It follows that we need vfio-ccw with a vfio no-iommu mode. For now,
our patches are based on the current no-iommu implementation. It's a
good start to launch the code review for vfio-ccw. Note that the
implementation is far from complete yet; but we'd like to get feedback
for the general architecture.

The current no-iommu implementation would consider vfio-ccw as
unsupported and will taint the kernel. This should be not true for
vfio-ccw. But whether the end result will be using the existing
no-iommu code or a new module would be an implementation detail.

* CCW translation APIs
- Description:
  These introduce a group of APIs (start with 'ccwchain_') to do CCW
  translation. The CCWs passed in by a user space program are organized
  in a buffer, with their user virtual memory addresses. These APIs will
  copy the CCWs into the kernel space, and assemble a runnable kernel
  CCW program by updating the user virtual addresses with their
  corresponding physical addresses.
- Patches:
  vfio: ccw: introduce page array interfaces
  vfio: ccw: introduce ccw chain interfaces

* vfio-ccw device driver
- Description:
  The following patches introduce vfio-ccw, which utilizes the CCW
  translation APIs. vfio-ccw is a driver for vfio-based ccw devices
  which can bind to any device that is passed to the guest and
  implements the following vfio ioctls:
    VFIO_DEVICE_GET_INFO
    VFIO_DEVICE_CCW_HOT_RESET
    VFIO_DEVICE_CCW_CMD_REQUEST
  With this CMD_REQUEST ioctl, user space program can pass a CCW
  program to the kernel, to do further CCW translation before issuing
  them to a real device. Currently we map I/O that is basically async
  to this synchronous interface, which means it will not return until
  the interrupt handler got the I/O execution result.
- Patches:
  vfio: ccw: basic implementation for vfio_ccw driver
  vfio: ccw: realize VFIO_DEVICE_GET_INFO ioctl
  vfio: ccw: realize VFIO_DEVICE_CCW_HOT_RESET ioctl
  vfio: ccw: realize VFIO_DEVICE_CCW_CMD_REQUEST ioctl

The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
good example to get understand how these patches work. Here is a little
bit more detail how an I/O request triggered by the Qemu guest will be
handled (without error handling).

Explanation:
Q1-Q4: Qemu side process.
K1-K6: Kernel side process.

Q1. Intercept a ssch instruction.
Q2. Translate the guest ccw program to a user space ccw program
    (u_ccwchain).
Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
    K1. Copy from u_ccwchain to kernel (k_ccwchain).
    K2. Translate the user space ccw program to a kernel space ccw
        program, which becomes runnable for a real device.
    K3. With the necessary information contained in the orb passed in
        by Qemu, issue the k_ccwchain to the device, and wait event q
        for the I/O result.
    K4. Interrupt handler gets the I/O result, and wakes up the wait q.
    K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
        update the user space irb.
    K6. Copy irb and scsw back to user space.
Q4. Update the irb for the guest.

Limitations
-----------

The current vfio-ccw implementation focuses on supporting basic commands
needed to implement block device functionality (read/write) of DASD/ECKD
device only. Some commands may need special handling in the future, for
example, anything related to path grouping.

DASD is a kind of storage device. While ECKD is a data recording format.
More information for DASD and ECKD could be found here:
https://en.wikipedia.org/wiki/Direct-access_storage_device
https://en.wikipedia.org/wiki/Count_key_data

Together with the corresponding work in Qemu, we can bring the passed
through DASD/ECKD device online in a guest now and use it as a block
device.

Reference
---------
1. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832)
2. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204)
3. https://en.wikipedia.org/wiki/Channel_I/O
4. https://www.kernel.org/doc/Documentation/s390/cds.txt

Dong Jia Shi (8):
  iommu: s390: enable iommu api for s390 ccw devices
  s390: move orb.h from drivers/s390/ to arch/s390/
  vfio: ccw: basic implementation for vfio_ccw driver
  vfio: ccw: realize VFIO_DEVICE_GET_INFO ioctl
  vfio: ccw: realize VFIO_DEVICE_CCW_HOT_RESET ioctl
  vfio: ccw: introduce page array interfaces
  vfio: ccw: introduce ccw chain interfaces
  vfio: ccw: realize VFIO_DEVICE_CCW_CMD_REQUEST ioctl

 arch/s390/include/asm/irq.h                       |   1 +
 {drivers/s390/cio => arch/s390/include/asm}/orb.h |   0
 arch/s390/kernel/irq.c                            |   1 +
 drivers/iommu/Kconfig                             |   6 +-
 drivers/s390/cio/eadm_sch.c                       |   2 +-
 drivers/s390/cio/eadm_sch.h                       |   2 +-
 drivers/s390/cio/io_sch.h                         |   2 +-
 drivers/s390/cio/ioasm.c                          |   2 +-
 drivers/s390/cio/ioasm.h                          |   2 +-
 drivers/s390/cio/trace.h                          |   2 +-
 drivers/vfio/Kconfig                              |   1 +
 drivers/vfio/Makefile                             |   1 +
 drivers/vfio/ccw/Kconfig                          |   7 +
 drivers/vfio/ccw/Makefile                         |   2 +
 drivers/vfio/ccw/ccwchain.c                       | 569 ++++++++++++++++++++++
 drivers/vfio/ccw/ccwchain.h                       |  49 ++
 drivers/vfio/ccw/vfio_ccw.c                       | 416 ++++++++++++++++
 include/uapi/linux/vfio.h                         |  32 ++
 18 files changed, 1088 insertions(+), 9 deletions(-)
 rename {drivers/s390/cio => arch/s390/include/asm}/orb.h (100%)
 create mode 100644 drivers/vfio/ccw/Kconfig
 create mode 100644 drivers/vfio/ccw/Makefile
 create mode 100644 drivers/vfio/ccw/ccwchain.c
 create mode 100644 drivers/vfio/ccw/ccwchain.h
 create mode 100644 drivers/vfio/ccw/vfio_ccw.c

-- 
2.6.6

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
@ 2016-04-29 12:11 ` Dong Jia Shi
  0 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

vfio: ccw: basic vfio-ccw infrastructure
========================================

Introduction
------------

Here we describe the vfio support for Channel I/O devices (aka. CCW
devices) for Linux/s390. Motivation for vfio-ccw is to passthrough CCW
devices to a virtual machine, while vfio is the means.

Different than other hardware architectures, s390 has defined a unified
I/O access method, which is so called Channel I/O. It has its own
access patterns:
- Channel programs run asynchronously on a separate (co)processor.
- The channel subsystem will access any memory designated by the caller
  in the channel program directly, i.e. there is no iommu involved.
Thus when we introduce vfio support for these devices, we realize it
with a no-iommu vfio implementation.

This document does not intend to explain the s390 hardware architecture
in every detail. More information/reference could be found here:
- A good start to know Channel I/O in general:
  https://en.wikipedia.org/wiki/Channel_I/O
- s390 architecture:
  s390 Principles of Operation manual (IBM Form. No. SA22-7832)
- The existing Qemu code which implements a simple emulated channel
  subsystem could also be a good reference. It makes it easier to
  follow the flow.
  qemu/hw/s390x/css.c

Motivation of vfio-ccw
----------------------

Currently, a guest virtualized via qemu/kvm on s390 only sees
paravirtualized virtio devices via the "Virtio Over Channel I/O
(virtio-ccw)" transport. This makes virtio devices discoverable via
standard operating system algorithms for handling channel devices.

However this is not enough. On s390 for the majority of devices, which
use the standard Channel I/O based mechanism, we also need to provide
the functionality of passing through them to a Qemu virtual machine.
This includes devices that don't have a virtio counterpart (e.g. tape
drives) or that have specific characteristics which guests want to
exploit.

For passing a device to a guest, we want to use the same interface as
everybody else, namely vfio. Thus, we would like to introduce vfio
support for channel devices. And we would like to name this new vfio
device "vfio-ccw".

Access patterns of CCW devices
------------------------------

s390 architecture has implemented a so called channel subsystem, that
provides a unified view of the devices physically attached to the
systems. Though the s390 hardware platform knows about a huge variety of
different peripheral attachments like disk devices (aka. DASDs), tapes,
communication controllers, etc. They can all be accessed by a well
defined access method and they are presenting I/O completion a unified
way: I/O interruptions.

All I/O requires the use of channel command words (CCWs). A CCW is an
instruction to a specialized I/O channel processor. A channel program
is a sequence of CCWs which are executed by the I/O channel subsystem.
To issue a CCW program to the channel subsystem, it is required to
build an operation request block (ORB), which can be used to point out
the format of the CCW and other control information to the system. The
operating system signals the I/O channel subsystem to begin executing
the channel program with a SSCH (start sub-channel) instruction. The
central processor is then free to proceed with non-I/O instructions
until interrupted. The I/O completion result is received by the
interrupt handler in the form of interrupt response block (IRB).

Back to vfio-ccw, in short:
- ORBs and CCW programs are built in user space (with virtual
  addresses).
- ORBs and CCW programs are passed to the kernel.
- kernel translates virtual addresses to real addresses and starts the
  IO with issuing a privileged Channel I/O instruction (e.g SSCH).
- CCW programs run asynchronously on a separate processor.
- I/O completion will be signaled to the host with I/O interruptions.
  And it will be copied as IRB to user space.


vfio-ccw patches overview
-------------------------

It follows that we need vfio-ccw with a vfio no-iommu mode. For now,
our patches are based on the current no-iommu implementation. It's a
good start to launch the code review for vfio-ccw. Note that the
implementation is far from complete yet; but we'd like to get feedback
for the general architecture.

The current no-iommu implementation would consider vfio-ccw as
unsupported and will taint the kernel. This should be not true for
vfio-ccw. But whether the end result will be using the existing
no-iommu code or a new module would be an implementation detail.

* CCW translation APIs
- Description:
  These introduce a group of APIs (start with 'ccwchain_') to do CCW
  translation. The CCWs passed in by a user space program are organized
  in a buffer, with their user virtual memory addresses. These APIs will
  copy the CCWs into the kernel space, and assemble a runnable kernel
  CCW program by updating the user virtual addresses with their
  corresponding physical addresses.
- Patches:
  vfio: ccw: introduce page array interfaces
  vfio: ccw: introduce ccw chain interfaces

* vfio-ccw device driver
- Description:
  The following patches introduce vfio-ccw, which utilizes the CCW
  translation APIs. vfio-ccw is a driver for vfio-based ccw devices
  which can bind to any device that is passed to the guest and
  implements the following vfio ioctls:
    VFIO_DEVICE_GET_INFO
    VFIO_DEVICE_CCW_HOT_RESET
    VFIO_DEVICE_CCW_CMD_REQUEST
  With this CMD_REQUEST ioctl, user space program can pass a CCW
  program to the kernel, to do further CCW translation before issuing
  them to a real device. Currently we map I/O that is basically async
  to this synchronous interface, which means it will not return until
  the interrupt handler got the I/O execution result.
- Patches:
  vfio: ccw: basic implementation for vfio_ccw driver
  vfio: ccw: realize VFIO_DEVICE_GET_INFO ioctl
  vfio: ccw: realize VFIO_DEVICE_CCW_HOT_RESET ioctl
  vfio: ccw: realize VFIO_DEVICE_CCW_CMD_REQUEST ioctl

The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
good example to get understand how these patches work. Here is a little
bit more detail how an I/O request triggered by the Qemu guest will be
handled (without error handling).

Explanation:
Q1-Q4: Qemu side process.
K1-K6: Kernel side process.

Q1. Intercept a ssch instruction.
Q2. Translate the guest ccw program to a user space ccw program
    (u_ccwchain).
Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
    K1. Copy from u_ccwchain to kernel (k_ccwchain).
    K2. Translate the user space ccw program to a kernel space ccw
        program, which becomes runnable for a real device.
    K3. With the necessary information contained in the orb passed in
        by Qemu, issue the k_ccwchain to the device, and wait event q
        for the I/O result.
    K4. Interrupt handler gets the I/O result, and wakes up the wait q.
    K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
        update the user space irb.
    K6. Copy irb and scsw back to user space.
Q4. Update the irb for the guest.

Limitations
-----------

The current vfio-ccw implementation focuses on supporting basic commands
needed to implement block device functionality (read/write) of DASD/ECKD
device only. Some commands may need special handling in the future, for
example, anything related to path grouping.

DASD is a kind of storage device. While ECKD is a data recording format.
More information for DASD and ECKD could be found here:
https://en.wikipedia.org/wiki/Direct-access_storage_device
https://en.wikipedia.org/wiki/Count_key_data

Together with the corresponding work in Qemu, we can bring the passed
through DASD/ECKD device online in a guest now and use it as a block
device.

Reference
---------
1. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832)
2. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204)
3. https://en.wikipedia.org/wiki/Channel_I/O
4. https://www.kernel.org/doc/Documentation/s390/cds.txt

Dong Jia Shi (8):
  iommu: s390: enable iommu api for s390 ccw devices
  s390: move orb.h from drivers/s390/ to arch/s390/
  vfio: ccw: basic implementation for vfio_ccw driver
  vfio: ccw: realize VFIO_DEVICE_GET_INFO ioctl
  vfio: ccw: realize VFIO_DEVICE_CCW_HOT_RESET ioctl
  vfio: ccw: introduce page array interfaces
  vfio: ccw: introduce ccw chain interfaces
  vfio: ccw: realize VFIO_DEVICE_CCW_CMD_REQUEST ioctl

 arch/s390/include/asm/irq.h                       |   1 +
 {drivers/s390/cio => arch/s390/include/asm}/orb.h |   0
 arch/s390/kernel/irq.c                            |   1 +
 drivers/iommu/Kconfig                             |   6 +-
 drivers/s390/cio/eadm_sch.c                       |   2 +-
 drivers/s390/cio/eadm_sch.h                       |   2 +-
 drivers/s390/cio/io_sch.h                         |   2 +-
 drivers/s390/cio/ioasm.c                          |   2 +-
 drivers/s390/cio/ioasm.h                          |   2 +-
 drivers/s390/cio/trace.h                          |   2 +-
 drivers/vfio/Kconfig                              |   1 +
 drivers/vfio/Makefile                             |   1 +
 drivers/vfio/ccw/Kconfig                          |   7 +
 drivers/vfio/ccw/Makefile                         |   2 +
 drivers/vfio/ccw/ccwchain.c                       | 569 ++++++++++++++++++++++
 drivers/vfio/ccw/ccwchain.h                       |  49 ++
 drivers/vfio/ccw/vfio_ccw.c                       | 416 ++++++++++++++++
 include/uapi/linux/vfio.h                         |  32 ++
 18 files changed, 1088 insertions(+), 9 deletions(-)
 rename {drivers/s390/cio => arch/s390/include/asm}/orb.h (100%)
 create mode 100644 drivers/vfio/ccw/Kconfig
 create mode 100644 drivers/vfio/ccw/Makefile
 create mode 100644 drivers/vfio/ccw/ccwchain.c
 create mode 100644 drivers/vfio/ccw/ccwchain.h
 create mode 100644 drivers/vfio/ccw/vfio_ccw.c

-- 
2.6.6

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH RFC 1/8] iommu: s390: enable iommu api for s390 ccw devices
  2016-04-29 12:11 ` [Qemu-devel] " Dong Jia Shi
@ 2016-04-29 12:11   ` Dong Jia Shi
  -1 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

This enables IOMMU API if CONFIG_CCW is configured.

Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
---
 drivers/iommu/Kconfig | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index dd1dc39..63bbc3d 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -331,11 +331,11 @@ config ARM_SMMU_V3
 	  the ARM SMMUv3 architecture.
 
 config S390_IOMMU
-	def_bool y if S390 && PCI
-	depends on S390 && PCI
+	def_bool y
+	depends on S390 && (PCI || CCW)
 	select IOMMU_API
 	help
-	  Support for the IOMMU API for s390 PCI devices.
+	  Support for the IOMMU API for s390 PCI and CCW devices.
 
 config MTK_IOMMU
 	bool "MTK IOMMU Support"
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH RFC 1/8] iommu: s390: enable iommu api for s390 ccw devices
@ 2016-04-29 12:11   ` Dong Jia Shi
  0 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

This enables IOMMU API if CONFIG_CCW is configured.

Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
---
 drivers/iommu/Kconfig | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index dd1dc39..63bbc3d 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -331,11 +331,11 @@ config ARM_SMMU_V3
 	  the ARM SMMUv3 architecture.
 
 config S390_IOMMU
-	def_bool y if S390 && PCI
-	depends on S390 && PCI
+	def_bool y
+	depends on S390 && (PCI || CCW)
 	select IOMMU_API
 	help
-	  Support for the IOMMU API for s390 PCI devices.
+	  Support for the IOMMU API for s390 PCI and CCW devices.
 
 config MTK_IOMMU
 	bool "MTK IOMMU Support"
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC 2/8] s390: move orb.h from drivers/s390/ to arch/s390/
  2016-04-29 12:11 ` [Qemu-devel] " Dong Jia Shi
@ 2016-04-29 12:11   ` Dong Jia Shi
  -1 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

Let's make orb-related definitions available outside
of the common I/O layer for future use (e.g. for
passing channel devices to a guest).

Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
---
 {drivers/s390/cio => arch/s390/include/asm}/orb.h | 0
 drivers/s390/cio/eadm_sch.c                       | 2 +-
 drivers/s390/cio/eadm_sch.h                       | 2 +-
 drivers/s390/cio/io_sch.h                         | 2 +-
 drivers/s390/cio/ioasm.c                          | 2 +-
 drivers/s390/cio/ioasm.h                          | 2 +-
 drivers/s390/cio/trace.h                          | 2 +-
 7 files changed, 6 insertions(+), 6 deletions(-)
 rename {drivers/s390/cio => arch/s390/include/asm}/orb.h (100%)

diff --git a/drivers/s390/cio/orb.h b/arch/s390/include/asm/orb.h
similarity index 100%
rename from drivers/s390/cio/orb.h
rename to arch/s390/include/asm/orb.h
diff --git a/drivers/s390/cio/eadm_sch.c b/drivers/s390/cio/eadm_sch.c
index b3f44bc..8082a03 100644
--- a/drivers/s390/cio/eadm_sch.c
+++ b/drivers/s390/cio/eadm_sch.c
@@ -21,12 +21,12 @@
 #include <asm/cio.h>
 #include <asm/scsw.h>
 #include <asm/eadm.h>
+#include <asm/orb.h>
 
 #include "eadm_sch.h"
 #include "ioasm.h"
 #include "cio.h"
 #include "css.h"
-#include "orb.h"
 
 MODULE_DESCRIPTION("driver for s390 eadm subchannels");
 MODULE_LICENSE("GPL");
diff --git a/drivers/s390/cio/eadm_sch.h b/drivers/s390/cio/eadm_sch.h
index 9664e46..2184920 100644
--- a/drivers/s390/cio/eadm_sch.h
+++ b/drivers/s390/cio/eadm_sch.h
@@ -5,7 +5,7 @@
 #include <linux/device.h>
 #include <linux/timer.h>
 #include <linux/list.h>
-#include "orb.h"
+#include <asm/orb.h>
 
 struct eadm_private {
 	union orb orb;
diff --git a/drivers/s390/cio/io_sch.h b/drivers/s390/cio/io_sch.h
index 8975060..b768523 100644
--- a/drivers/s390/cio/io_sch.h
+++ b/drivers/s390/cio/io_sch.h
@@ -5,8 +5,8 @@
 #include <asm/schid.h>
 #include <asm/ccwdev.h>
 #include <asm/irq.h>
+#include <asm/orb.h>
 #include "css.h"
-#include "orb.h"
 
 struct io_subchannel_private {
 	union orb orb;		/* operation request block */
diff --git a/drivers/s390/cio/ioasm.c b/drivers/s390/cio/ioasm.c
index 9898481..7fd413d 100644
--- a/drivers/s390/cio/ioasm.c
+++ b/drivers/s390/cio/ioasm.c
@@ -7,9 +7,9 @@
 #include <asm/chpid.h>
 #include <asm/schid.h>
 #include <asm/crw.h>
+#include <asm/orb.h>
 
 #include "ioasm.h"
-#include "orb.h"
 #include "cio.h"
 
 int stsch(struct subchannel_id schid, struct schib *addr)
diff --git a/drivers/s390/cio/ioasm.h b/drivers/s390/cio/ioasm.h
index b31ee6b..b2ca4a3 100644
--- a/drivers/s390/cio/ioasm.h
+++ b/drivers/s390/cio/ioasm.h
@@ -4,7 +4,7 @@
 #include <asm/chpid.h>
 #include <asm/schid.h>
 #include <asm/crw.h>
-#include "orb.h"
+#include <asm/orb.h>
 #include "cio.h"
 #include "trace.h"
 
diff --git a/drivers/s390/cio/trace.h b/drivers/s390/cio/trace.h
index 5b807a0..ba58f7c 100644
--- a/drivers/s390/cio/trace.h
+++ b/drivers/s390/cio/trace.h
@@ -7,10 +7,10 @@
 
 #include <linux/kernel.h>
 #include <asm/crw.h>
+#include <asm/orb.h>
 #include <uapi/asm/chpid.h>
 #include <uapi/asm/schid.h>
 #include "cio.h"
-#include "orb.h"
 
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM s390
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH RFC 2/8] s390: move orb.h from drivers/s390/ to arch/s390/
@ 2016-04-29 12:11   ` Dong Jia Shi
  0 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

Let's make orb-related definitions available outside
of the common I/O layer for future use (e.g. for
passing channel devices to a guest).

Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
---
 {drivers/s390/cio => arch/s390/include/asm}/orb.h | 0
 drivers/s390/cio/eadm_sch.c                       | 2 +-
 drivers/s390/cio/eadm_sch.h                       | 2 +-
 drivers/s390/cio/io_sch.h                         | 2 +-
 drivers/s390/cio/ioasm.c                          | 2 +-
 drivers/s390/cio/ioasm.h                          | 2 +-
 drivers/s390/cio/trace.h                          | 2 +-
 7 files changed, 6 insertions(+), 6 deletions(-)
 rename {drivers/s390/cio => arch/s390/include/asm}/orb.h (100%)

diff --git a/drivers/s390/cio/orb.h b/arch/s390/include/asm/orb.h
similarity index 100%
rename from drivers/s390/cio/orb.h
rename to arch/s390/include/asm/orb.h
diff --git a/drivers/s390/cio/eadm_sch.c b/drivers/s390/cio/eadm_sch.c
index b3f44bc..8082a03 100644
--- a/drivers/s390/cio/eadm_sch.c
+++ b/drivers/s390/cio/eadm_sch.c
@@ -21,12 +21,12 @@
 #include <asm/cio.h>
 #include <asm/scsw.h>
 #include <asm/eadm.h>
+#include <asm/orb.h>
 
 #include "eadm_sch.h"
 #include "ioasm.h"
 #include "cio.h"
 #include "css.h"
-#include "orb.h"
 
 MODULE_DESCRIPTION("driver for s390 eadm subchannels");
 MODULE_LICENSE("GPL");
diff --git a/drivers/s390/cio/eadm_sch.h b/drivers/s390/cio/eadm_sch.h
index 9664e46..2184920 100644
--- a/drivers/s390/cio/eadm_sch.h
+++ b/drivers/s390/cio/eadm_sch.h
@@ -5,7 +5,7 @@
 #include <linux/device.h>
 #include <linux/timer.h>
 #include <linux/list.h>
-#include "orb.h"
+#include <asm/orb.h>
 
 struct eadm_private {
 	union orb orb;
diff --git a/drivers/s390/cio/io_sch.h b/drivers/s390/cio/io_sch.h
index 8975060..b768523 100644
--- a/drivers/s390/cio/io_sch.h
+++ b/drivers/s390/cio/io_sch.h
@@ -5,8 +5,8 @@
 #include <asm/schid.h>
 #include <asm/ccwdev.h>
 #include <asm/irq.h>
+#include <asm/orb.h>
 #include "css.h"
-#include "orb.h"
 
 struct io_subchannel_private {
 	union orb orb;		/* operation request block */
diff --git a/drivers/s390/cio/ioasm.c b/drivers/s390/cio/ioasm.c
index 9898481..7fd413d 100644
--- a/drivers/s390/cio/ioasm.c
+++ b/drivers/s390/cio/ioasm.c
@@ -7,9 +7,9 @@
 #include <asm/chpid.h>
 #include <asm/schid.h>
 #include <asm/crw.h>
+#include <asm/orb.h>
 
 #include "ioasm.h"
-#include "orb.h"
 #include "cio.h"
 
 int stsch(struct subchannel_id schid, struct schib *addr)
diff --git a/drivers/s390/cio/ioasm.h b/drivers/s390/cio/ioasm.h
index b31ee6b..b2ca4a3 100644
--- a/drivers/s390/cio/ioasm.h
+++ b/drivers/s390/cio/ioasm.h
@@ -4,7 +4,7 @@
 #include <asm/chpid.h>
 #include <asm/schid.h>
 #include <asm/crw.h>
-#include "orb.h"
+#include <asm/orb.h>
 #include "cio.h"
 #include "trace.h"
 
diff --git a/drivers/s390/cio/trace.h b/drivers/s390/cio/trace.h
index 5b807a0..ba58f7c 100644
--- a/drivers/s390/cio/trace.h
+++ b/drivers/s390/cio/trace.h
@@ -7,10 +7,10 @@
 
 #include <linux/kernel.h>
 #include <asm/crw.h>
+#include <asm/orb.h>
 #include <uapi/asm/chpid.h>
 #include <uapi/asm/schid.h>
 #include "cio.h"
-#include "orb.h"
 
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM s390
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC 3/8] vfio: ccw: basic implementation for vfio_ccw driver
  2016-04-29 12:11 ` [Qemu-devel] " Dong Jia Shi
@ 2016-04-29 12:11   ` Dong Jia Shi
  -1 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

Add a basic vfio_ccw driver, which depends on the VFIO No-IOMMU
support.

Add a new config option:
  Device Drivers
  --> VFIO Non-Privileged userspace driver framework
    --> VFIO No-IOMMU support
      --> VFIO support for ccw devices

Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
---
 arch/s390/include/asm/irq.h |   1 +
 arch/s390/kernel/irq.c      |   1 +
 drivers/vfio/Kconfig        |   1 +
 drivers/vfio/Makefile       |   1 +
 drivers/vfio/ccw/Kconfig    |   7 ++
 drivers/vfio/ccw/Makefile   |   2 +
 drivers/vfio/ccw/vfio_ccw.c | 160 ++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 173 insertions(+)
 create mode 100644 drivers/vfio/ccw/Kconfig
 create mode 100644 drivers/vfio/ccw/Makefile
 create mode 100644 drivers/vfio/ccw/vfio_ccw.c

diff --git a/arch/s390/include/asm/irq.h b/arch/s390/include/asm/irq.h
index f97b055..5ec272a 100644
--- a/arch/s390/include/asm/irq.h
+++ b/arch/s390/include/asm/irq.h
@@ -66,6 +66,7 @@ enum interruption_class {
 	IRQIO_VAI,
 	NMI_NMI,
 	CPU_RST,
+	IRQIO_VFC,
 	NR_ARCH_IRQS
 };
 
diff --git a/arch/s390/kernel/irq.c b/arch/s390/kernel/irq.c
index c373a1d..706002a 100644
--- a/arch/s390/kernel/irq.c
+++ b/arch/s390/kernel/irq.c
@@ -88,6 +88,7 @@ static const struct irq_class irqclass_sub_desc[] = {
 	{.irq = IRQIO_VAI,  .name = "VAI", .desc = "[I/O] Virtual I/O Devices AI"},
 	{.irq = NMI_NMI,    .name = "NMI", .desc = "[NMI] Machine Check"},
 	{.irq = CPU_RST,    .name = "RST", .desc = "[CPU] CPU Restart"},
+	{.irq = IRQIO_VFC,  .name = "VFC", .desc = "[I/O] VFIO CCW Devices"},
 };
 
 void __init init_IRQ(void)
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index da6e2ce..f1d414c 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -46,6 +46,7 @@ menuconfig VFIO_NOIOMMU
 
 	  If you don't know what to do here, say N.
 
+source "drivers/vfio/ccw/Kconfig"
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7b8a31f..2b39593 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
 obj-$(CONFIG_VFIO_PLATFORM) += platform/
+obj-$(CONFIG_VFIO_CCW) += ccw/
diff --git a/drivers/vfio/ccw/Kconfig b/drivers/vfio/ccw/Kconfig
new file mode 100644
index 0000000..6281152
--- /dev/null
+++ b/drivers/vfio/ccw/Kconfig
@@ -0,0 +1,7 @@
+config VFIO_CCW
+	tristate "VFIO support for CCW devices"
+	depends on VFIO_NOIOMMU && CCW
+	help
+	  VFIO support for CCW bus driver. Note that this is just
+	  the base driver; you'll also need a userspace program
+	  to provide a device configuration and channel programs.
diff --git a/drivers/vfio/ccw/Makefile b/drivers/vfio/ccw/Makefile
new file mode 100644
index 0000000..ea14ca9
--- /dev/null
+++ b/drivers/vfio/ccw/Makefile
@@ -0,0 +1,2 @@
+vfio-ccw-y := vfio_ccw.o
+obj-$(CONFIG_VFIO_CCW) += vfio-ccw.o
diff --git a/drivers/vfio/ccw/vfio_ccw.c b/drivers/vfio/ccw/vfio_ccw.c
new file mode 100644
index 0000000..8b0acae
--- /dev/null
+++ b/drivers/vfio/ccw/vfio_ccw.c
@@ -0,0 +1,160 @@
+/*
+ * vfio based ccw device driver
+ *
+ * Copyright IBM Corp. 2016
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License (version 2 only)
+ * as published by the Free Software Foundation.
+ *
+ * Author(s): Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
+ *            Xiao Feng Ren <renxiaof@linux.vnet.ibm.com>
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/iommu.h>
+#include <linux/vfio.h>
+#include <asm/ccwdev.h>
+#include <asm/cio.h>
+
+/**
+ * struct vfio_ccw_device
+ * @cdev: ccw device
+ * @going_away: if an offline procedure was already ongoing
+ */
+struct vfio_ccw_device {
+	struct ccw_device	*cdev;
+	bool			going_away;
+};
+
+enum vfio_ccw_device_type {
+	vfio_dasd_eckd,
+};
+
+struct ccw_device_id vfio_ccw_ids[] = {
+	{ CCW_DEVICE_DEVTYPE(0x3990, 0, 0x3390, 0),
+	  .driver_info = vfio_dasd_eckd},
+	{ /* End of list. */ },
+};
+MODULE_DEVICE_TABLE(ccw, vfio_ccw_ids);
+
+/*
+ * vfio callbacks
+ */
+static int vfio_ccw_open(void *device_data)
+{
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	return 0;
+}
+
+static void vfio_ccw_release(void *device_data)
+{
+	module_put(THIS_MODULE);
+}
+
+static long vfio_ccw_ioctl(void *device_data, unsigned int cmd,
+			   unsigned long arg)
+{
+	return -ENOTTY;
+}
+
+static const struct vfio_device_ops vfio_ccw_ops = {
+	.name		= "vfio_ccw",
+	.open		= vfio_ccw_open,
+	.release	= vfio_ccw_release,
+	.ioctl		= vfio_ccw_ioctl,
+};
+
+static int vfio_ccw_probe(struct ccw_device *cdev)
+{
+	struct iommu_group *group = vfio_iommu_group_get(&cdev->dev);
+
+	if (!group)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int vfio_ccw_set_offline(struct ccw_device *cdev)
+{
+	struct vfio_device *device = vfio_device_get_from_dev(&cdev->dev);
+	struct vfio_ccw_device *vdev;
+
+	if (!device)
+		return 0;
+
+	vdev = vfio_device_data(device);
+	vfio_device_put(device);
+	if (!vdev || vdev->going_away)
+		return 0;
+
+	vdev->going_away = true;
+	vfio_del_group_dev(&cdev->dev);
+	kfree(vdev);
+
+	return 0;
+}
+
+void vfio_ccw_remove(struct ccw_device *cdev)
+{
+	if (cdev && cdev->online)
+		vfio_ccw_set_offline(cdev);
+
+	vfio_iommu_group_put(cdev->dev.iommu_group, &cdev->dev);
+}
+
+static int vfio_ccw_set_online(struct ccw_device *cdev)
+{
+	struct vfio_ccw_device *vdev;
+	int ret;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev)
+		return -ENOMEM;
+
+	vdev->cdev = cdev;
+
+	ret = vfio_add_group_dev(&cdev->dev, &vfio_ccw_ops, vdev);
+	if (ret)
+		kfree(vdev);
+
+	return ret;
+}
+
+static int vfio_ccw_notify(struct ccw_device *cdev, int event)
+{
+	/* LATER: We probably need to handle device/path state changes. */
+	return 0;
+}
+
+static struct ccw_driver vfio_ccw_driver = {
+	.driver = {
+		.name	= "vfio_ccw",
+		.owner	= THIS_MODULE,
+	},
+	.ids	     = vfio_ccw_ids,
+	.probe	     = vfio_ccw_probe,
+	.remove      = vfio_ccw_remove,
+	.set_offline = vfio_ccw_set_offline,
+	.set_online  = vfio_ccw_set_online,
+	.notify      = vfio_ccw_notify,
+	.int_class   = IRQIO_VFC,
+};
+
+static int __init vfio_ccw_init(void)
+{
+	return ccw_driver_register(&vfio_ccw_driver);
+}
+
+static void __exit vfio_ccw_cleanup(void)
+{
+	ccw_driver_unregister(&vfio_ccw_driver);
+}
+
+module_init(vfio_ccw_init);
+module_exit(vfio_ccw_cleanup);
+
+MODULE_LICENSE("GPL v2");
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH RFC 3/8] vfio: ccw: basic implementation for vfio_ccw driver
@ 2016-04-29 12:11   ` Dong Jia Shi
  0 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

Add a basic vfio_ccw driver, which depends on the VFIO No-IOMMU
support.

Add a new config option:
  Device Drivers
  --> VFIO Non-Privileged userspace driver framework
    --> VFIO No-IOMMU support
      --> VFIO support for ccw devices

Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
---
 arch/s390/include/asm/irq.h |   1 +
 arch/s390/kernel/irq.c      |   1 +
 drivers/vfio/Kconfig        |   1 +
 drivers/vfio/Makefile       |   1 +
 drivers/vfio/ccw/Kconfig    |   7 ++
 drivers/vfio/ccw/Makefile   |   2 +
 drivers/vfio/ccw/vfio_ccw.c | 160 ++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 173 insertions(+)
 create mode 100644 drivers/vfio/ccw/Kconfig
 create mode 100644 drivers/vfio/ccw/Makefile
 create mode 100644 drivers/vfio/ccw/vfio_ccw.c

diff --git a/arch/s390/include/asm/irq.h b/arch/s390/include/asm/irq.h
index f97b055..5ec272a 100644
--- a/arch/s390/include/asm/irq.h
+++ b/arch/s390/include/asm/irq.h
@@ -66,6 +66,7 @@ enum interruption_class {
 	IRQIO_VAI,
 	NMI_NMI,
 	CPU_RST,
+	IRQIO_VFC,
 	NR_ARCH_IRQS
 };
 
diff --git a/arch/s390/kernel/irq.c b/arch/s390/kernel/irq.c
index c373a1d..706002a 100644
--- a/arch/s390/kernel/irq.c
+++ b/arch/s390/kernel/irq.c
@@ -88,6 +88,7 @@ static const struct irq_class irqclass_sub_desc[] = {
 	{.irq = IRQIO_VAI,  .name = "VAI", .desc = "[I/O] Virtual I/O Devices AI"},
 	{.irq = NMI_NMI,    .name = "NMI", .desc = "[NMI] Machine Check"},
 	{.irq = CPU_RST,    .name = "RST", .desc = "[CPU] CPU Restart"},
+	{.irq = IRQIO_VFC,  .name = "VFC", .desc = "[I/O] VFIO CCW Devices"},
 };
 
 void __init init_IRQ(void)
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index da6e2ce..f1d414c 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -46,6 +46,7 @@ menuconfig VFIO_NOIOMMU
 
 	  If you don't know what to do here, say N.
 
+source "drivers/vfio/ccw/Kconfig"
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7b8a31f..2b39593 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
 obj-$(CONFIG_VFIO_PLATFORM) += platform/
+obj-$(CONFIG_VFIO_CCW) += ccw/
diff --git a/drivers/vfio/ccw/Kconfig b/drivers/vfio/ccw/Kconfig
new file mode 100644
index 0000000..6281152
--- /dev/null
+++ b/drivers/vfio/ccw/Kconfig
@@ -0,0 +1,7 @@
+config VFIO_CCW
+	tristate "VFIO support for CCW devices"
+	depends on VFIO_NOIOMMU && CCW
+	help
+	  VFIO support for CCW bus driver. Note that this is just
+	  the base driver; you'll also need a userspace program
+	  to provide a device configuration and channel programs.
diff --git a/drivers/vfio/ccw/Makefile b/drivers/vfio/ccw/Makefile
new file mode 100644
index 0000000..ea14ca9
--- /dev/null
+++ b/drivers/vfio/ccw/Makefile
@@ -0,0 +1,2 @@
+vfio-ccw-y := vfio_ccw.o
+obj-$(CONFIG_VFIO_CCW) += vfio-ccw.o
diff --git a/drivers/vfio/ccw/vfio_ccw.c b/drivers/vfio/ccw/vfio_ccw.c
new file mode 100644
index 0000000..8b0acae
--- /dev/null
+++ b/drivers/vfio/ccw/vfio_ccw.c
@@ -0,0 +1,160 @@
+/*
+ * vfio based ccw device driver
+ *
+ * Copyright IBM Corp. 2016
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License (version 2 only)
+ * as published by the Free Software Foundation.
+ *
+ * Author(s): Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
+ *            Xiao Feng Ren <renxiaof@linux.vnet.ibm.com>
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/iommu.h>
+#include <linux/vfio.h>
+#include <asm/ccwdev.h>
+#include <asm/cio.h>
+
+/**
+ * struct vfio_ccw_device
+ * @cdev: ccw device
+ * @going_away: if an offline procedure was already ongoing
+ */
+struct vfio_ccw_device {
+	struct ccw_device	*cdev;
+	bool			going_away;
+};
+
+enum vfio_ccw_device_type {
+	vfio_dasd_eckd,
+};
+
+struct ccw_device_id vfio_ccw_ids[] = {
+	{ CCW_DEVICE_DEVTYPE(0x3990, 0, 0x3390, 0),
+	  .driver_info = vfio_dasd_eckd},
+	{ /* End of list. */ },
+};
+MODULE_DEVICE_TABLE(ccw, vfio_ccw_ids);
+
+/*
+ * vfio callbacks
+ */
+static int vfio_ccw_open(void *device_data)
+{
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	return 0;
+}
+
+static void vfio_ccw_release(void *device_data)
+{
+	module_put(THIS_MODULE);
+}
+
+static long vfio_ccw_ioctl(void *device_data, unsigned int cmd,
+			   unsigned long arg)
+{
+	return -ENOTTY;
+}
+
+static const struct vfio_device_ops vfio_ccw_ops = {
+	.name		= "vfio_ccw",
+	.open		= vfio_ccw_open,
+	.release	= vfio_ccw_release,
+	.ioctl		= vfio_ccw_ioctl,
+};
+
+static int vfio_ccw_probe(struct ccw_device *cdev)
+{
+	struct iommu_group *group = vfio_iommu_group_get(&cdev->dev);
+
+	if (!group)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int vfio_ccw_set_offline(struct ccw_device *cdev)
+{
+	struct vfio_device *device = vfio_device_get_from_dev(&cdev->dev);
+	struct vfio_ccw_device *vdev;
+
+	if (!device)
+		return 0;
+
+	vdev = vfio_device_data(device);
+	vfio_device_put(device);
+	if (!vdev || vdev->going_away)
+		return 0;
+
+	vdev->going_away = true;
+	vfio_del_group_dev(&cdev->dev);
+	kfree(vdev);
+
+	return 0;
+}
+
+void vfio_ccw_remove(struct ccw_device *cdev)
+{
+	if (cdev && cdev->online)
+		vfio_ccw_set_offline(cdev);
+
+	vfio_iommu_group_put(cdev->dev.iommu_group, &cdev->dev);
+}
+
+static int vfio_ccw_set_online(struct ccw_device *cdev)
+{
+	struct vfio_ccw_device *vdev;
+	int ret;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev)
+		return -ENOMEM;
+
+	vdev->cdev = cdev;
+
+	ret = vfio_add_group_dev(&cdev->dev, &vfio_ccw_ops, vdev);
+	if (ret)
+		kfree(vdev);
+
+	return ret;
+}
+
+static int vfio_ccw_notify(struct ccw_device *cdev, int event)
+{
+	/* LATER: We probably need to handle device/path state changes. */
+	return 0;
+}
+
+static struct ccw_driver vfio_ccw_driver = {
+	.driver = {
+		.name	= "vfio_ccw",
+		.owner	= THIS_MODULE,
+	},
+	.ids	     = vfio_ccw_ids,
+	.probe	     = vfio_ccw_probe,
+	.remove      = vfio_ccw_remove,
+	.set_offline = vfio_ccw_set_offline,
+	.set_online  = vfio_ccw_set_online,
+	.notify      = vfio_ccw_notify,
+	.int_class   = IRQIO_VFC,
+};
+
+static int __init vfio_ccw_init(void)
+{
+	return ccw_driver_register(&vfio_ccw_driver);
+}
+
+static void __exit vfio_ccw_cleanup(void)
+{
+	ccw_driver_unregister(&vfio_ccw_driver);
+}
+
+module_init(vfio_ccw_init);
+module_exit(vfio_ccw_cleanup);
+
+MODULE_LICENSE("GPL v2");
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC 4/8] vfio: ccw: realize VFIO_DEVICE_GET_INFO ioctl
  2016-04-29 12:11 ` [Qemu-devel] " Dong Jia Shi
@ 2016-04-29 12:11   ` Dong Jia Shi
  -1 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

Introduce device information about vfio-ccw: VFIO_DEVICE_FLAGS_CCW.
Realize VFIO_DEVICE_GET_INFO ioctl for vfio-ccw.

Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
---
 drivers/vfio/ccw/vfio_ccw.c | 20 ++++++++++++++++++++
 include/uapi/linux/vfio.h   |  1 +
 2 files changed, 21 insertions(+)

diff --git a/drivers/vfio/ccw/vfio_ccw.c b/drivers/vfio/ccw/vfio_ccw.c
index 8b0acae..7331aed 100644
--- a/drivers/vfio/ccw/vfio_ccw.c
+++ b/drivers/vfio/ccw/vfio_ccw.c
@@ -58,6 +58,26 @@ static void vfio_ccw_release(void *device_data)
 static long vfio_ccw_ioctl(void *device_data, unsigned int cmd,
 			   unsigned long arg)
 {
+	unsigned long minsz;
+
+	if (cmd == VFIO_DEVICE_GET_INFO) {
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_CCW;
+		info.num_regions = 0;
+		info.num_irqs = 0;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
 	return -ENOTTY;
 }
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 255a211..aaedfcd 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -198,6 +198,7 @@ struct vfio_device_info {
 #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
 #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
 #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
+#define VFIO_DEVICE_FLAGS_CCW	(1 << 4)	/* vfio-ccw device */
 	__u32	num_regions;	/* Max region index + 1 */
 	__u32	num_irqs;	/* Max IRQ index + 1 */
 };
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH RFC 4/8] vfio: ccw: realize VFIO_DEVICE_GET_INFO ioctl
@ 2016-04-29 12:11   ` Dong Jia Shi
  0 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

Introduce device information about vfio-ccw: VFIO_DEVICE_FLAGS_CCW.
Realize VFIO_DEVICE_GET_INFO ioctl for vfio-ccw.

Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
---
 drivers/vfio/ccw/vfio_ccw.c | 20 ++++++++++++++++++++
 include/uapi/linux/vfio.h   |  1 +
 2 files changed, 21 insertions(+)

diff --git a/drivers/vfio/ccw/vfio_ccw.c b/drivers/vfio/ccw/vfio_ccw.c
index 8b0acae..7331aed 100644
--- a/drivers/vfio/ccw/vfio_ccw.c
+++ b/drivers/vfio/ccw/vfio_ccw.c
@@ -58,6 +58,26 @@ static void vfio_ccw_release(void *device_data)
 static long vfio_ccw_ioctl(void *device_data, unsigned int cmd,
 			   unsigned long arg)
 {
+	unsigned long minsz;
+
+	if (cmd == VFIO_DEVICE_GET_INFO) {
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_CCW;
+		info.num_regions = 0;
+		info.num_irqs = 0;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
 	return -ENOTTY;
 }
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 255a211..aaedfcd 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -198,6 +198,7 @@ struct vfio_device_info {
 #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
 #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
 #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
+#define VFIO_DEVICE_FLAGS_CCW	(1 << 4)	/* vfio-ccw device */
 	__u32	num_regions;	/* Max region index + 1 */
 	__u32	num_irqs;	/* Max IRQ index + 1 */
 };
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC 5/8] vfio: ccw: realize VFIO_DEVICE_CCW_HOT_RESET ioctl
  2016-04-29 12:11 ` [Qemu-devel] " Dong Jia Shi
@ 2016-04-29 12:11   ` Dong Jia Shi
  -1 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

Introduce VFIO_DEVICE_CCW_HOT_RESET ioctl for vfio-ccw to make it
possible to hot-reset the device.

We try to achieve a hot reset by first offlining the device and then
onlining it again: this should clear all state at the subchannel.

Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
---
 drivers/vfio/ccw/vfio_ccw.c | 50 ++++++++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h   |  8 ++++++++
 2 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/ccw/vfio_ccw.c b/drivers/vfio/ccw/vfio_ccw.c
index 7331aed..9700448 100644
--- a/drivers/vfio/ccw/vfio_ccw.c
+++ b/drivers/vfio/ccw/vfio_ccw.c
@@ -22,10 +22,12 @@
  * struct vfio_ccw_device
  * @cdev: ccw device
  * @going_away: if an offline procedure was already ongoing
+ * @hot_reset: if hot-reset is ongoing
  */
 struct vfio_ccw_device {
 	struct ccw_device	*cdev;
 	bool			going_away;
+	bool			hot_reset;
 };
 
 enum vfio_ccw_device_type {
@@ -58,6 +60,7 @@ static void vfio_ccw_release(void *device_data)
 static long vfio_ccw_ioctl(void *device_data, unsigned int cmd,
 			   unsigned long arg)
 {
+	struct vfio_ccw_device *vcdev = device_data;
 	unsigned long minsz;
 
 	if (cmd == VFIO_DEVICE_GET_INFO) {
@@ -76,6 +79,34 @@ static long vfio_ccw_ioctl(void *device_data, unsigned int cmd,
 		info.num_irqs = 0;
 
 		return copy_to_user((void __user *)arg, &info, minsz);
+
+	} else if (cmd == VFIO_DEVICE_CCW_HOT_RESET) {
+		unsigned long flags;
+		int ret;
+
+		spin_lock_irqsave(get_ccwdev_lock(vcdev->cdev), flags);
+		if (!vcdev->cdev->online) {
+			spin_unlock_irqrestore(get_ccwdev_lock(vcdev->cdev),
+					       flags);
+			return -EINVAL;
+		}
+
+		if (vcdev->hot_reset) {
+			spin_unlock_irqrestore(get_ccwdev_lock(vcdev->cdev),
+					       flags);
+			return -EBUSY;
+		}
+		vcdev->hot_reset = true;
+		spin_unlock_irqrestore(get_ccwdev_lock(vcdev->cdev), flags);
+
+		ret = ccw_device_set_offline(vcdev->cdev);
+		if (!ret)
+			ret = ccw_device_set_online(vcdev->cdev);
+
+		spin_lock_irqsave(get_ccwdev_lock(vcdev->cdev), flags);
+		vcdev->hot_reset = false;
+		spin_unlock_irqrestore(get_ccwdev_lock(vcdev->cdev), flags);
+		return ret;
 	}
 
 	return -ENOTTY;
@@ -108,7 +139,7 @@ static int vfio_ccw_set_offline(struct ccw_device *cdev)
 
 	vdev = vfio_device_data(device);
 	vfio_device_put(device);
-	if (!vdev || vdev->going_away)
+	if (!vdev || vdev->hot_reset || vdev->going_away)
 		return 0;
 
 	vdev->going_away = true;
@@ -128,9 +159,26 @@ void vfio_ccw_remove(struct ccw_device *cdev)
 
 static int vfio_ccw_set_online(struct ccw_device *cdev)
 {
+	struct vfio_device *device = vfio_device_get_from_dev(&cdev->dev);
 	struct vfio_ccw_device *vdev;
 	int ret;
 
+	if (!device)
+		goto create_device;
+
+	vdev = vfio_device_data(device);
+	vfio_device_put(device);
+	if (!vdev)
+		goto create_device;
+
+	/*
+	 * During hot reset, we just want to disable/enable the
+	 * subchannel and need not setup anything again.
+	 */
+	if (vdev->hot_reset)
+		return 0;
+
+create_device:
 	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
 	if (!vdev)
 		return -ENOMEM;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index aaedfcd..889a316 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -687,6 +687,14 @@ struct vfio_iommu_spapr_tce_remove {
 };
 #define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
 
+/**
+ * VFIO_DEVICE_CCW_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 21)
+ *
+ * Hot reset the channel I/O device. All state of the subchannel will be
+ * cleared.
+ */
+#define VFIO_DEVICE_CCW_HOT_RESET	_IO(VFIO_TYPE, VFIO_BASE + 21)
+
 /* ***************************************************************** */
 
 #endif /* _UAPIVFIO_H */
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH RFC 5/8] vfio: ccw: realize VFIO_DEVICE_CCW_HOT_RESET ioctl
@ 2016-04-29 12:11   ` Dong Jia Shi
  0 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

Introduce VFIO_DEVICE_CCW_HOT_RESET ioctl for vfio-ccw to make it
possible to hot-reset the device.

We try to achieve a hot reset by first offlining the device and then
onlining it again: this should clear all state at the subchannel.

Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
---
 drivers/vfio/ccw/vfio_ccw.c | 50 ++++++++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h   |  8 ++++++++
 2 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/ccw/vfio_ccw.c b/drivers/vfio/ccw/vfio_ccw.c
index 7331aed..9700448 100644
--- a/drivers/vfio/ccw/vfio_ccw.c
+++ b/drivers/vfio/ccw/vfio_ccw.c
@@ -22,10 +22,12 @@
  * struct vfio_ccw_device
  * @cdev: ccw device
  * @going_away: if an offline procedure was already ongoing
+ * @hot_reset: if hot-reset is ongoing
  */
 struct vfio_ccw_device {
 	struct ccw_device	*cdev;
 	bool			going_away;
+	bool			hot_reset;
 };
 
 enum vfio_ccw_device_type {
@@ -58,6 +60,7 @@ static void vfio_ccw_release(void *device_data)
 static long vfio_ccw_ioctl(void *device_data, unsigned int cmd,
 			   unsigned long arg)
 {
+	struct vfio_ccw_device *vcdev = device_data;
 	unsigned long minsz;
 
 	if (cmd == VFIO_DEVICE_GET_INFO) {
@@ -76,6 +79,34 @@ static long vfio_ccw_ioctl(void *device_data, unsigned int cmd,
 		info.num_irqs = 0;
 
 		return copy_to_user((void __user *)arg, &info, minsz);
+
+	} else if (cmd == VFIO_DEVICE_CCW_HOT_RESET) {
+		unsigned long flags;
+		int ret;
+
+		spin_lock_irqsave(get_ccwdev_lock(vcdev->cdev), flags);
+		if (!vcdev->cdev->online) {
+			spin_unlock_irqrestore(get_ccwdev_lock(vcdev->cdev),
+					       flags);
+			return -EINVAL;
+		}
+
+		if (vcdev->hot_reset) {
+			spin_unlock_irqrestore(get_ccwdev_lock(vcdev->cdev),
+					       flags);
+			return -EBUSY;
+		}
+		vcdev->hot_reset = true;
+		spin_unlock_irqrestore(get_ccwdev_lock(vcdev->cdev), flags);
+
+		ret = ccw_device_set_offline(vcdev->cdev);
+		if (!ret)
+			ret = ccw_device_set_online(vcdev->cdev);
+
+		spin_lock_irqsave(get_ccwdev_lock(vcdev->cdev), flags);
+		vcdev->hot_reset = false;
+		spin_unlock_irqrestore(get_ccwdev_lock(vcdev->cdev), flags);
+		return ret;
 	}
 
 	return -ENOTTY;
@@ -108,7 +139,7 @@ static int vfio_ccw_set_offline(struct ccw_device *cdev)
 
 	vdev = vfio_device_data(device);
 	vfio_device_put(device);
-	if (!vdev || vdev->going_away)
+	if (!vdev || vdev->hot_reset || vdev->going_away)
 		return 0;
 
 	vdev->going_away = true;
@@ -128,9 +159,26 @@ void vfio_ccw_remove(struct ccw_device *cdev)
 
 static int vfio_ccw_set_online(struct ccw_device *cdev)
 {
+	struct vfio_device *device = vfio_device_get_from_dev(&cdev->dev);
 	struct vfio_ccw_device *vdev;
 	int ret;
 
+	if (!device)
+		goto create_device;
+
+	vdev = vfio_device_data(device);
+	vfio_device_put(device);
+	if (!vdev)
+		goto create_device;
+
+	/*
+	 * During hot reset, we just want to disable/enable the
+	 * subchannel and need not setup anything again.
+	 */
+	if (vdev->hot_reset)
+		return 0;
+
+create_device:
 	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
 	if (!vdev)
 		return -ENOMEM;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index aaedfcd..889a316 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -687,6 +687,14 @@ struct vfio_iommu_spapr_tce_remove {
 };
 #define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
 
+/**
+ * VFIO_DEVICE_CCW_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 21)
+ *
+ * Hot reset the channel I/O device. All state of the subchannel will be
+ * cleared.
+ */
+#define VFIO_DEVICE_CCW_HOT_RESET	_IO(VFIO_TYPE, VFIO_BASE + 21)
+
 /* ***************************************************************** */
 
 #endif /* _UAPIVFIO_H */
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC 6/8] vfio: ccw: introduce page array interfaces
  2016-04-29 12:11 ` [Qemu-devel] " Dong Jia Shi
@ 2016-04-29 12:11   ` Dong Jia Shi
  -1 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

CCW translation requires to pin/unpin sets of mem pages frequently.
Currently we have a lack of support to do this in an efficient way.
So we introduce page_array data structure and helper functions to
handle pin/unpin operations here.

Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
---
 drivers/vfio/ccw/Makefile   |   2 +-
 drivers/vfio/ccw/ccwchain.c | 128 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 129 insertions(+), 1 deletion(-)
 create mode 100644 drivers/vfio/ccw/ccwchain.c

diff --git a/drivers/vfio/ccw/Makefile b/drivers/vfio/ccw/Makefile
index ea14ca9..ac62330 100644
--- a/drivers/vfio/ccw/Makefile
+++ b/drivers/vfio/ccw/Makefile
@@ -1,2 +1,2 @@
-vfio-ccw-y := vfio_ccw.o
+vfio-ccw-y := vfio_ccw.o ccwchain.o
 obj-$(CONFIG_VFIO_CCW) += vfio-ccw.o
diff --git a/drivers/vfio/ccw/ccwchain.c b/drivers/vfio/ccw/ccwchain.c
new file mode 100644
index 0000000..03b4e82
--- /dev/null
+++ b/drivers/vfio/ccw/ccwchain.c
@@ -0,0 +1,128 @@
+/*
+ * ccwchain interfaces
+ *
+ * Copyright IBM Corp. 2016
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License (version 2 only)
+ * as published by the Free Software Foundation.
+ *
+ * Author(s): Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
+ *            Xiao Feng Ren <renxiaof@linux.vnet.ibm.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+
+struct page_array {
+	u64			hva;
+	int			nr;
+	struct page		**items;
+};
+
+struct page_arrays {
+	struct page_array	*parray;
+	int			nr;
+};
+
+/*
+ * Helpers to operate page_array.
+ */
+/*
+ * page_array_pin() - pin user pages in memory
+ * @p: page_array on which to perform the operation
+ *
+ * Attempt to pin user pages in memory.
+ *
+ * Usage of page_array:
+ * @p->hva      starting user address. Assigned by caller.
+ * @p->nr       number of pages from @p->hva to pin. Assigned by caller.
+ *              number of pages pinned. Assigned by callee.
+ * @p->items    array that receives pointers to the pages pinned. Allocated by
+ *              caller.
+ *
+ * Returns:
+ *   Number of pages pinned on success. If @p->nr is 0 or negative, returns 0.
+ *   If no pages were pinned, returns -errno.
+ */
+static int page_array_pin(struct page_array *p)
+{
+	int i, nr;
+
+	nr = get_user_pages_fast(p->hva, p->nr, 1, p->items);
+	if (nr <= 0) {
+		p->nr = 0;
+		return nr;
+	} else if (nr != p->nr) {
+		for (i = 0; i < nr; i++)
+			put_page(p->items[i]);
+		p->nr = 0;
+		return -ENOMEM;
+	}
+
+	return nr;
+}
+
+/* Unpin the items before releasing the memory. */
+static void page_array_items_unpin_free(struct page_array *p)
+{
+	int i;
+
+	for (i = 0; i < p->nr; i++)
+		put_page(p->items[i]);
+
+	p->nr = 0;
+	kfree(p->items);
+}
+
+/* Alloc memory for items, then pin pages with them. */
+static int page_array_items_alloc_pin(u64 hva,
+				      unsigned int len,
+				      struct page_array *p)
+{
+	int ret;
+
+	if (!len || p->nr)
+		return -EINVAL;
+
+	p->hva = hva;
+
+	p->nr = ((hva & ~PAGE_MASK) + len + (PAGE_SIZE - 1)) >> PAGE_SHIFT;
+	if (!p->nr)
+		return -EINVAL;
+
+	p->items = kcalloc(p->nr, sizeof(*p->items), GFP_KERNEL);
+	if (!p->items)
+		return -ENOMEM;
+
+	ret = page_array_pin(p);
+	if (ret <= 0)
+		kfree(p->items);
+
+	return ret;
+}
+
+static int page_arrays_init(struct page_arrays *ps, int nr)
+{
+	ps->parray = kcalloc(nr, sizeof(*ps->parray), GFP_KERNEL);
+	if (!ps->parray) {
+		ps->nr = 0;
+		return -ENOMEM;
+	}
+
+	ps->nr = nr;
+	return 0;
+}
+
+static void page_arrays_unpin_free(struct page_arrays *ps)
+{
+	int i;
+
+	for (i = 0; i < ps->nr; i++)
+		page_array_items_unpin_free(ps->parray + i);
+
+	kfree(ps->parray);
+
+	ps->parray = NULL;
+	ps->nr = 0;
+}
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH RFC 6/8] vfio: ccw: introduce page array interfaces
@ 2016-04-29 12:11   ` Dong Jia Shi
  0 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

CCW translation requires to pin/unpin sets of mem pages frequently.
Currently we have a lack of support to do this in an efficient way.
So we introduce page_array data structure and helper functions to
handle pin/unpin operations here.

Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
---
 drivers/vfio/ccw/Makefile   |   2 +-
 drivers/vfio/ccw/ccwchain.c | 128 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 129 insertions(+), 1 deletion(-)
 create mode 100644 drivers/vfio/ccw/ccwchain.c

diff --git a/drivers/vfio/ccw/Makefile b/drivers/vfio/ccw/Makefile
index ea14ca9..ac62330 100644
--- a/drivers/vfio/ccw/Makefile
+++ b/drivers/vfio/ccw/Makefile
@@ -1,2 +1,2 @@
-vfio-ccw-y := vfio_ccw.o
+vfio-ccw-y := vfio_ccw.o ccwchain.o
 obj-$(CONFIG_VFIO_CCW) += vfio-ccw.o
diff --git a/drivers/vfio/ccw/ccwchain.c b/drivers/vfio/ccw/ccwchain.c
new file mode 100644
index 0000000..03b4e82
--- /dev/null
+++ b/drivers/vfio/ccw/ccwchain.c
@@ -0,0 +1,128 @@
+/*
+ * ccwchain interfaces
+ *
+ * Copyright IBM Corp. 2016
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License (version 2 only)
+ * as published by the Free Software Foundation.
+ *
+ * Author(s): Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
+ *            Xiao Feng Ren <renxiaof@linux.vnet.ibm.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+
+struct page_array {
+	u64			hva;
+	int			nr;
+	struct page		**items;
+};
+
+struct page_arrays {
+	struct page_array	*parray;
+	int			nr;
+};
+
+/*
+ * Helpers to operate page_array.
+ */
+/*
+ * page_array_pin() - pin user pages in memory
+ * @p: page_array on which to perform the operation
+ *
+ * Attempt to pin user pages in memory.
+ *
+ * Usage of page_array:
+ * @p->hva      starting user address. Assigned by caller.
+ * @p->nr       number of pages from @p->hva to pin. Assigned by caller.
+ *              number of pages pinned. Assigned by callee.
+ * @p->items    array that receives pointers to the pages pinned. Allocated by
+ *              caller.
+ *
+ * Returns:
+ *   Number of pages pinned on success. If @p->nr is 0 or negative, returns 0.
+ *   If no pages were pinned, returns -errno.
+ */
+static int page_array_pin(struct page_array *p)
+{
+	int i, nr;
+
+	nr = get_user_pages_fast(p->hva, p->nr, 1, p->items);
+	if (nr <= 0) {
+		p->nr = 0;
+		return nr;
+	} else if (nr != p->nr) {
+		for (i = 0; i < nr; i++)
+			put_page(p->items[i]);
+		p->nr = 0;
+		return -ENOMEM;
+	}
+
+	return nr;
+}
+
+/* Unpin the items before releasing the memory. */
+static void page_array_items_unpin_free(struct page_array *p)
+{
+	int i;
+
+	for (i = 0; i < p->nr; i++)
+		put_page(p->items[i]);
+
+	p->nr = 0;
+	kfree(p->items);
+}
+
+/* Alloc memory for items, then pin pages with them. */
+static int page_array_items_alloc_pin(u64 hva,
+				      unsigned int len,
+				      struct page_array *p)
+{
+	int ret;
+
+	if (!len || p->nr)
+		return -EINVAL;
+
+	p->hva = hva;
+
+	p->nr = ((hva & ~PAGE_MASK) + len + (PAGE_SIZE - 1)) >> PAGE_SHIFT;
+	if (!p->nr)
+		return -EINVAL;
+
+	p->items = kcalloc(p->nr, sizeof(*p->items), GFP_KERNEL);
+	if (!p->items)
+		return -ENOMEM;
+
+	ret = page_array_pin(p);
+	if (ret <= 0)
+		kfree(p->items);
+
+	return ret;
+}
+
+static int page_arrays_init(struct page_arrays *ps, int nr)
+{
+	ps->parray = kcalloc(nr, sizeof(*ps->parray), GFP_KERNEL);
+	if (!ps->parray) {
+		ps->nr = 0;
+		return -ENOMEM;
+	}
+
+	ps->nr = nr;
+	return 0;
+}
+
+static void page_arrays_unpin_free(struct page_arrays *ps)
+{
+	int i;
+
+	for (i = 0; i < ps->nr; i++)
+		page_array_items_unpin_free(ps->parray + i);
+
+	kfree(ps->parray);
+
+	ps->parray = NULL;
+	ps->nr = 0;
+}
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC 7/8] vfio: ccw: introduce ccw chain interfaces
  2016-04-29 12:11 ` [Qemu-devel] " Dong Jia Shi
@ 2016-04-29 12:11   ` Dong Jia Shi
  -1 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

Introduce ccwchain structure and helper functions that can be used to
handle special ccw programs issued from user-space.

The following limitations apply:
1. Supports only prefetch enabled mode.
2. Supports direct ccw chaining by translating them to idal ccws.
3. Supports idal(c64) ccw chaining.

These interfaces are designed to support translation only for special
ccw programs, which are generated and formatted by a user-space
program. Thus this will make it possible for VFIO to leverage the
interfaces to realize channel I/O device drivers in user-space.

User-space programs should prepare the ccws according to the rules
below:
1. Allocate a 4K memory buffer in user-space to store all of the ccw
   program information.
2. Lower 2k of the buffer are used to store a maximum of 256 ccws.
3. Upper 2k of the buffer are used to store a maximum of 256
   corresponding cda data sets, each having a length of 8 bytes.
4. All of the ccws should be placed one after another.
5. For direct and idal ccw:
   - Find a free cda data entry, and find its offset to the address
     of the cda buffer.
   - Store the offset as the CDA value in the ccw.
   - Store the user virtual address of the data (idaw) as the data of
     the cda entry.
6. For tic ccw:
   - Find the target ccw, and find its offset to the address of the
     ccw buffer.
   - Store the offset as the CDA value in the ccw.

Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
---
 drivers/vfio/ccw/ccwchain.c | 441 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/vfio/ccw/ccwchain.h |  49 +++++
 2 files changed, 490 insertions(+)
 create mode 100644 drivers/vfio/ccw/ccwchain.h

diff --git a/drivers/vfio/ccw/ccwchain.c b/drivers/vfio/ccw/ccwchain.c
index 03b4e82..964b6479 100644
--- a/drivers/vfio/ccw/ccwchain.c
+++ b/drivers/vfio/ccw/ccwchain.c
@@ -11,8 +11,19 @@
  *            Xiao Feng Ren <renxiaof@linux.vnet.ibm.com>
  */
 
+#include <asm/ccwdev.h>
+#include <asm/idals.h>
+#include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/slab.h>
+#include "ccwchain.h"
+
+/*
+ * Max length for ccw chain.
+ * XXX: Limit to 256, need to check more?
+ */
+#define CCWCHAIN_LEN_MAX	256
+#define CDA_ITEM_SIZE		3 /* sizeof(u64) == (1 << 3) */
 
 struct page_array {
 	u64			hva;
@@ -25,6 +36,20 @@ struct page_arrays {
 	int			nr;
 };
 
+struct ccwchain_buf {
+	struct ccw1		ccw[CCWCHAIN_LEN_MAX];
+	u64			cda[CCWCHAIN_LEN_MAX];
+};
+
+struct ccwchain {
+	struct ccwchain_buf	buf;
+
+	/* Valid ccw number in chain */
+	int			nr;
+	/* Pinned PAGEs for the original data. */
+	struct page_arrays	*pss;
+};
+
 /*
  * Helpers to operate page_array.
  */
@@ -126,3 +151,419 @@ static void page_arrays_unpin_free(struct page_arrays *ps)
 	ps->parray = NULL;
 	ps->nr = 0;
 }
+
+/*
+ * Helpers to operate ccwchain.
+ */
+/* Return the number of idal words needed for an address/length pair. */
+static inline unsigned int ccwchain_idal_nr_words(u64 addr, unsigned int length)
+{
+	/*
+	 * User virtual address and its corresponding kernel physical address
+	 * are aligned by pages. Thus their offsets to the page boundary will be
+	 * the same.
+	 * Althought idal_nr_words expects a virtual address as its first param,
+	 * it is the offset that matters. It's fine to use either hva or hpa as
+	 * the input, since they have the same offset inside a page.
+	 */
+	return idal_nr_words((void *)(addr), length);
+}
+
+/* Create the list idal words for a page_arrays. */
+static inline void ccwchain_idal_create_words(unsigned long *idaws,
+					      struct page_arrays *ps)
+{
+	int i, j, k;
+
+	/*
+	 * Idal words (execept the first one) rely on the memory being 4k
+	 * aligned. If a user virtual address is 4K aligned, then it's
+	 * corresponding kernel physical address will also be 4K aligned. Thus
+	 * there will be no problem here to simply use the hpa to create an
+	 * idaw.
+	 */
+	k = 0;
+	for (i = 0; i < ps->nr; i++)
+		for (j = 0; j < ps->parray[i].nr; j++) {
+			idaws[k] = page_to_phys(ps->parray[i].items[j]);
+			if (k == 0)
+				idaws[k] += ps->parray[i].hva & ~PAGE_MASK;
+			k++;
+		}
+}
+
+#define ccw_is_test(_ccw) (((_ccw)->cmd_code & 0x0F) == 0)
+
+#define ccw_is_noop(_ccw) ((_ccw)->cmd_code == CCW_CMD_NOOP)
+
+#define ccw_is_tic(_ccw) ((_ccw)->cmd_code == CCW_CMD_TIC)
+
+#define ccw_is_idal(_ccw) ((_ccw)->flags & CCW_FLAG_IDA)
+
+/* Free resource for a ccw that allocated memory for its cda. */
+static void ccw_chain_cda_free(struct ccwchain *chain, int idx)
+{
+	struct ccw1 *ccw = chain->buf.ccw + idx;
+
+	if (!ccw->count)
+		return;
+
+	kfree((void *)(u64)ccw->cda);
+}
+
+/* Unpin the pages then free the memory resources. */
+static void ccw_chain_unpin_free(struct ccwchain *chain)
+{
+	int i;
+
+	if (!chain)
+		return;
+
+	for (i = 0; i < chain->nr; i++) {
+		page_arrays_unpin_free(chain->pss + i);
+		ccw_chain_cda_free(chain, i);
+	}
+
+	kfree(chain->pss);
+	kfree(chain);
+}
+
+static int ccw_chain_fetch_tic(struct ccwchain *chain, int idx)
+{
+	struct ccw1 *ccw = chain->buf.ccw + idx;
+
+	if (ccw->cda >= sizeof(chain->buf.ccw))
+		return -EINVAL;
+
+	/*
+	 * tic_ccw.cda stores the offset to the address of the first ccw
+	 * of the chain. Here we update its value with the the real address.
+	 */
+	ccw->cda += virt_to_phys(chain->buf.ccw);
+
+	return 0;
+}
+
+static int ccw_chain_fetch_direct(struct ccwchain *chain, int idx)
+{
+	struct ccw1 *ccw;
+	struct page_arrays *ps;
+	unsigned long *idaws;
+	u64 cda_hva;
+	int i, cidaw;
+
+	ccw = chain->buf.ccw + idx;
+
+	/*
+	 * direct_ccw.cda stores the offset of its cda data in the cda buffer.
+	 */
+	i = ccw->cda >> CDA_ITEM_SIZE;
+	if (i < 0)
+		return -EINVAL;
+	cda_hva = chain->buf.cda[i];
+	if (IS_ERR_VALUE(cda_hva))
+		return -EFAULT;
+
+	/*
+	 * Pin data page(s) in memory.
+	 * The number of pages actually is the count of the idaws which will be
+	 * needed when translating a direct ccw to a idal ccw.
+	 */
+	ps = chain->pss + idx;
+	if (page_arrays_init(ps, 1))
+		return -ENOMEM;
+	cidaw = page_array_items_alloc_pin(cda_hva, ccw->count, ps->parray);
+	if (cidaw <= 0)
+		return cidaw;
+
+	/* Translate this direct ccw to a idal ccw. */
+	idaws = kcalloc(cidaw, sizeof(*idaws), GFP_DMA | GFP_KERNEL);
+	if (!idaws) {
+		page_arrays_unpin_free(ps);
+		return -ENOMEM;
+	}
+	ccw->cda = (__u32) virt_to_phys(idaws);
+	ccw->flags |= CCW_FLAG_IDA;
+
+	ccwchain_idal_create_words(idaws, ps);
+
+	return 0;
+}
+
+static int ccw_chain_fetch_idal(struct ccwchain *chain, int idx)
+{
+	struct ccw1 *ccw;
+	struct page_arrays *ps;
+	unsigned long *idaws;
+	unsigned int cidaw, idaw_len;
+	int i, ret;
+	u64 cda_hva, idaw_hva;
+
+	ccw = chain->buf.ccw + idx;
+
+	/* idal_ccw.cda stores the offset of its cda data in the cda buffer. */
+	i = ccw->cda >> CDA_ITEM_SIZE;
+	if (i < 0)
+		return -EINVAL;
+	cda_hva = chain->buf.cda[i];
+	if (IS_ERR_VALUE(cda_hva))
+		return -EFAULT;
+
+	/* Calculate size of idaws. */
+	ret = copy_from_user(&idaw_hva, (void __user *)cda_hva, sizeof(*idaws));
+	if (ret)
+		return ret;
+
+	cidaw = ccwchain_idal_nr_words(idaw_hva, ccw->count);
+	idaw_len = cidaw * sizeof(*idaws);
+
+	/* Pin data page(s) in memory. */
+	ps = chain->pss + idx;
+	ret = page_arrays_init(ps, cidaw);
+	if (ret)
+		return ret;
+
+	/* Translate idal ccw to use new allocated idaws. */
+	idaws = kzalloc(idaw_len, GFP_DMA | GFP_KERNEL);
+	if (!idaws) {
+		ret = -ENOMEM;
+		goto out_unpin;
+	}
+
+	ret = copy_from_user(idaws, (void __user *)cda_hva, idaw_len);
+	if (ret)
+		goto out_free_idaws;
+
+	ccw->cda = virt_to_phys(idaws);
+
+	for (i = 0; i < cidaw; i++) {
+		idaw_hva = *(idaws + i);
+		if (IS_ERR_VALUE(idaw_hva)) {
+			ret = -EFAULT;
+			goto out_free_idaws;
+		}
+
+		ret = page_array_items_alloc_pin(idaw_hva, 1, ps->parray + i);
+		if (ret <= 0)
+			goto out_free_idaws;
+	}
+
+	ccwchain_idal_create_words(idaws, ps);
+
+	return 0;
+
+out_free_idaws:
+	kfree(idaws);
+out_unpin:
+	page_arrays_unpin_free(ps);
+	return ret;
+}
+
+/*
+ * Fetch one ccw.
+ * To reduce memory copy, we'll pin the cda page in memory,
+ * and to get rid of the cda 2G limitiaion of ccw1, we'll translate
+ * direct ccws to idal ccws.
+ */
+static int ccw_chain_fetch_one(struct ccwchain *chain, int idx)
+{
+	struct ccw1 *ccw = chain->buf.ccw + idx;
+
+	if (ccw_is_test(ccw) || ccw_is_noop(ccw))
+		return 0;
+
+	if (ccw_is_tic(ccw))
+		return ccw_chain_fetch_tic(chain, idx);
+
+	if (ccw_is_idal(ccw))
+		return ccw_chain_fetch_idal(chain, idx);
+
+	return ccw_chain_fetch_direct(chain, idx);
+}
+
+static int ccw_chain_copy_from_user(struct ccwchain_cmd *cmd)
+{
+	struct ccwchain *chain;
+	int ret;
+
+	if (!cmd->nr || cmd->nr > CCWCHAIN_LEN_MAX) {
+		ret = -EINVAL;
+		goto out_error;
+	}
+
+	chain = kzalloc(sizeof(*chain), GFP_DMA | GFP_KERNEL);
+	if (!chain) {
+		ret = -ENOMEM;
+		goto out_error;
+	}
+
+	chain->nr = cmd->nr;
+
+	/* Copy current chain from user. */
+	ret = copy_from_user(&chain->buf,
+			     (void __user *)cmd->u_ccwchain,
+			     sizeof(chain->buf));
+	if (ret)
+		goto out_free_chain;
+
+	/* Alloc memory for page_arrays. */
+	chain->pss = kcalloc(chain->nr, sizeof(*chain->pss), GFP_KERNEL);
+	if (!chain->pss) {
+		ret = -ENOMEM;
+		goto out_free_chain;
+	}
+
+	cmd->k_ccwchain = chain;
+
+	return 0;
+
+out_free_chain:
+	kfree(chain);
+out_error:
+	cmd->k_ccwchain = NULL;
+	return ret;
+}
+
+/**
+ * ccwchain_alloc() - allocate resources for a ccw chain.
+ * @cmd: ccwchain command on which to perform the operation
+ *
+ * This function is a wrapper around ccw_chain_copy_from_user().
+ *
+ * This creates a ccwchain and allocates a memory buffer, that could at most
+ * contain @cmd->nr ccws, for the ccwchain. Then it copies user-space ccw
+ * program from @cmd->u_ccwchain to the buffer, and stores the address of the
+ * ccwchain to @cmd->k_ccwchain as the output.
+ *
+ * Returns:
+ *   %0 on success and a negative error value on failure.
+ */
+int ccwchain_alloc(struct ccwchain_cmd *cmd)
+{
+	return ccw_chain_copy_from_user(cmd);
+}
+
+/**
+ * ccwchain_free() - free resources for a ccw chain.
+ * @cmd: ccwchain command on which to perform the operation
+ *
+ * This function is a wrapper around ccw_chain_unpin_free().
+ *
+ * This unpins the memory pages and frees the memory space occupied by @cmd,
+ * which must have been returned by a previous call to ccwchain_alloc().
+ * Otherwise, undefined behavior occurs.
+ */
+void ccwchain_free(struct ccwchain_cmd *cmd)
+{
+	ccw_chain_unpin_free(cmd->k_ccwchain);
+}
+
+/**
+ * ccwchain_prefetch() - translate a user-space ccw program to a real-device
+ *                       runnable ccw program.
+ * @cmd: ccwchain command on which to perform the operation
+ *
+ * This function translates the user-space ccw program (@cmd->u_ccwchain) and
+ * stores the result to @cmd->k_ccwchain. @cmd must have been returned by a
+ * previous call to ccwchain_alloc(). Otherwise, undefined behavior occurs.
+ *
+ * The S/390 CCW Translation APIS (prefixed by 'ccwchain_') are introduced as
+ * helpers to do ccw chain translation inside the kernel. Basically they accept
+ * a special ccw program issued by a user-space process, and translate the ccw
+ * program to a real-device runnable ccw program.
+ *
+ * The ccws passed in should be well organized in a user-space buffer, using
+ * virtual memory addresses and offsets inside the buffer. These APIs will copy
+ * the ccws into a kernel-space buffer, and update the virtual addresses and the
+ * offsets with their corresponding physical addresses. Then channel I/O device
+ * drivers could issue the translated ccw program to real devices to perform an
+ * I/O operation.
+ *
+ * User-space ccw program format:
+ * These interfaces are designed to support translation only for special ccw
+ * programs, which are generated and formatted by a user-space program. Thus
+ * this will make it possible for things like VFIO to leverage the interfaces to
+ * realize channel I/O device drivers in user-space.
+ *
+ * User-space programs should prepare the ccws according to the rules below
+ * 1. Alloc a 4K bytes memory buffer in user-space to store all of the ccw
+ *    program information.
+ * 2. Lower 2K of the buffer are used to store a maximum of 256 ccws.
+ * 3. Upper 2K of the buffer are used to store a maximum of 256 corresponding
+ *    cda data sets, each having a length of 8 bytes.
+ * 4. All of the ccws should be placed one after another.
+ * 5. For direct and idal ccw
+ *    - Find a free cda data entry, and find its offset to the address of the
+ *      cda buffer.
+ *    - Store the offset as the CDA value in the ccw.
+ *    - Store the virtual address of the data(idaw) as the data of the cda
+ *      entry.
+ * 6. For tic ccw
+ *    - Find the target ccw, and find its offset to the address of the ccw
+ *      buffer.
+ *    - Store the offset as the CDA value in the ccw.
+ *
+ * Limitations:
+ * 1. Supports only prefetch enabled mode.
+ * 2. Supports direct ccw chaining by translating them to idal ccws.
+ * 3. Supports idal(c64) ccw chaining.
+ *
+ * Returns:
+ *   %0 on success and a negative error value on failure.
+ */
+int ccwchain_prefetch(struct ccwchain_cmd *cmd)
+{
+	int ret, i;
+	struct ccwchain *chain = cmd->k_ccwchain;
+
+	for (i = 0; i < chain->nr; i++) {
+		ret = ccw_chain_fetch_one(chain, i);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/**
+ * ccwchain_get_cpa() - get the ccw program address of a ccwchain
+ * @cmd: ccwchain command on which to perform the operation
+ *
+ * This function returns the address of the translated kernel ccw program.
+ * Channel I/O device drivers could issue this address to real devices to
+ * perform an I/O operation.
+ */
+struct ccw1 *ccwchain_get_cpa(struct ccwchain_cmd *cmd)
+{
+	return ((struct ccwchain *)cmd->k_ccwchain)->buf.ccw;
+}
+
+/**
+ * ccwchain_update_scsw() - update scsw for a ccw chain.
+ * @cmd: ccwchain command on which to perform the operation
+ * @scsw: I/O result of the ccw program and also the target to be updated
+ *
+ * @scsw contains the I/O results of the ccw program that pointed to by @cmd.
+ * However what @scsw->cpa stores is a kernel physical address, which is
+ * meaningless for a user-space program, which is waiting for the I/O results.
+ *
+ * This function updates @scsw->cpa to its coressponding user-space ccw address
+ * (an offset inside the user-space ccw buffer).
+ */
+void ccwchain_update_scsw(struct ccwchain_cmd *cmd, union scsw *scsw)
+{
+	u32 cpa = scsw->cmd.cpa;
+	struct ccwchain *chain = cmd->k_ccwchain;
+
+	/*
+	 * LATER:
+	 * For now, only update the cmd.cpa part. We may need to deal with
+	 * other portions of the schib as well, even if we don't return them
+	 * in the ioctl directly. Path status changes etc.
+	 */
+	cpa = cpa - (u32)(u64)(chain->buf.ccw);
+	if (cpa & (1 << 31))
+		cpa &= (1 << 31) - 1U;
+
+	scsw->cmd.cpa = cpa;
+}
diff --git a/drivers/vfio/ccw/ccwchain.h b/drivers/vfio/ccw/ccwchain.h
new file mode 100644
index 0000000..b72ac2a
--- /dev/null
+++ b/drivers/vfio/ccw/ccwchain.h
@@ -0,0 +1,49 @@
+/*
+ * ccwchain interfaces
+ *
+ * Copyright IBM Corp. 2016
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License (version 2 only)
+ * as published by the Free Software Foundation.
+ *
+ * Author(s): Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
+ *            Xiao Feng Ren <renxiaof@linux.vnet.ibm.com>
+ */
+
+#ifndef _CCW_CHAIN_H_
+#define _CCW_CHAIN_H_
+
+#include <asm/cio.h>
+#include <asm/scsw.h>
+
+/**
+ * struct ccwchain_cmd - manage information for ccw program
+ * @u_ccwchain: handle of a user-space ccw program
+ * @k_ccwchain: handle of a kernel-space ccw program
+ * @nr: number of ccws in the ccw program
+ *
+ * @u_ccwchain is an user-space virtual address of a buffer where a user-space
+ * ccw program is stored. Size of this buffer is 4K bytes, of which the low 2K
+ * is for the ccws and the upper 2K for cda data.
+ *
+ * @k_ccwchain is a kernel-space physical address of a ccwchain struct, that
+ * points to the translated result of @u_ccwchain. This is opaque to user-space
+ * programs.
+ *
+ * @nr is the number of ccws in both user-space ccw program and kernel-space ccw
+ * program.
+ */
+struct ccwchain_cmd {
+	void *u_ccwchain;
+	void *k_ccwchain;
+	int nr;
+};
+
+extern int ccwchain_alloc(struct ccwchain_cmd *cmd);
+extern void ccwchain_free(struct ccwchain_cmd *cmd);
+extern int ccwchain_prefetch(struct ccwchain_cmd *cmd);
+extern struct ccw1 *ccwchain_get_cpa(struct ccwchain_cmd *cmd);
+extern void ccwchain_update_scsw(struct ccwchain_cmd *cmd, union scsw *scsw);
+
+#endif
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH RFC 7/8] vfio: ccw: introduce ccw chain interfaces
@ 2016-04-29 12:11   ` Dong Jia Shi
  0 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

Introduce ccwchain structure and helper functions that can be used to
handle special ccw programs issued from user-space.

The following limitations apply:
1. Supports only prefetch enabled mode.
2. Supports direct ccw chaining by translating them to idal ccws.
3. Supports idal(c64) ccw chaining.

These interfaces are designed to support translation only for special
ccw programs, which are generated and formatted by a user-space
program. Thus this will make it possible for VFIO to leverage the
interfaces to realize channel I/O device drivers in user-space.

User-space programs should prepare the ccws according to the rules
below:
1. Allocate a 4K memory buffer in user-space to store all of the ccw
   program information.
2. Lower 2k of the buffer are used to store a maximum of 256 ccws.
3. Upper 2k of the buffer are used to store a maximum of 256
   corresponding cda data sets, each having a length of 8 bytes.
4. All of the ccws should be placed one after another.
5. For direct and idal ccw:
   - Find a free cda data entry, and find its offset to the address
     of the cda buffer.
   - Store the offset as the CDA value in the ccw.
   - Store the user virtual address of the data (idaw) as the data of
     the cda entry.
6. For tic ccw:
   - Find the target ccw, and find its offset to the address of the
     ccw buffer.
   - Store the offset as the CDA value in the ccw.

Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
---
 drivers/vfio/ccw/ccwchain.c | 441 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/vfio/ccw/ccwchain.h |  49 +++++
 2 files changed, 490 insertions(+)
 create mode 100644 drivers/vfio/ccw/ccwchain.h

diff --git a/drivers/vfio/ccw/ccwchain.c b/drivers/vfio/ccw/ccwchain.c
index 03b4e82..964b6479 100644
--- a/drivers/vfio/ccw/ccwchain.c
+++ b/drivers/vfio/ccw/ccwchain.c
@@ -11,8 +11,19 @@
  *            Xiao Feng Ren <renxiaof@linux.vnet.ibm.com>
  */
 
+#include <asm/ccwdev.h>
+#include <asm/idals.h>
+#include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/slab.h>
+#include "ccwchain.h"
+
+/*
+ * Max length for ccw chain.
+ * XXX: Limit to 256, need to check more?
+ */
+#define CCWCHAIN_LEN_MAX	256
+#define CDA_ITEM_SIZE		3 /* sizeof(u64) == (1 << 3) */
 
 struct page_array {
 	u64			hva;
@@ -25,6 +36,20 @@ struct page_arrays {
 	int			nr;
 };
 
+struct ccwchain_buf {
+	struct ccw1		ccw[CCWCHAIN_LEN_MAX];
+	u64			cda[CCWCHAIN_LEN_MAX];
+};
+
+struct ccwchain {
+	struct ccwchain_buf	buf;
+
+	/* Valid ccw number in chain */
+	int			nr;
+	/* Pinned PAGEs for the original data. */
+	struct page_arrays	*pss;
+};
+
 /*
  * Helpers to operate page_array.
  */
@@ -126,3 +151,419 @@ static void page_arrays_unpin_free(struct page_arrays *ps)
 	ps->parray = NULL;
 	ps->nr = 0;
 }
+
+/*
+ * Helpers to operate ccwchain.
+ */
+/* Return the number of idal words needed for an address/length pair. */
+static inline unsigned int ccwchain_idal_nr_words(u64 addr, unsigned int length)
+{
+	/*
+	 * User virtual address and its corresponding kernel physical address
+	 * are aligned by pages. Thus their offsets to the page boundary will be
+	 * the same.
+	 * Althought idal_nr_words expects a virtual address as its first param,
+	 * it is the offset that matters. It's fine to use either hva or hpa as
+	 * the input, since they have the same offset inside a page.
+	 */
+	return idal_nr_words((void *)(addr), length);
+}
+
+/* Create the list idal words for a page_arrays. */
+static inline void ccwchain_idal_create_words(unsigned long *idaws,
+					      struct page_arrays *ps)
+{
+	int i, j, k;
+
+	/*
+	 * Idal words (execept the first one) rely on the memory being 4k
+	 * aligned. If a user virtual address is 4K aligned, then it's
+	 * corresponding kernel physical address will also be 4K aligned. Thus
+	 * there will be no problem here to simply use the hpa to create an
+	 * idaw.
+	 */
+	k = 0;
+	for (i = 0; i < ps->nr; i++)
+		for (j = 0; j < ps->parray[i].nr; j++) {
+			idaws[k] = page_to_phys(ps->parray[i].items[j]);
+			if (k == 0)
+				idaws[k] += ps->parray[i].hva & ~PAGE_MASK;
+			k++;
+		}
+}
+
+#define ccw_is_test(_ccw) (((_ccw)->cmd_code & 0x0F) == 0)
+
+#define ccw_is_noop(_ccw) ((_ccw)->cmd_code == CCW_CMD_NOOP)
+
+#define ccw_is_tic(_ccw) ((_ccw)->cmd_code == CCW_CMD_TIC)
+
+#define ccw_is_idal(_ccw) ((_ccw)->flags & CCW_FLAG_IDA)
+
+/* Free resource for a ccw that allocated memory for its cda. */
+static void ccw_chain_cda_free(struct ccwchain *chain, int idx)
+{
+	struct ccw1 *ccw = chain->buf.ccw + idx;
+
+	if (!ccw->count)
+		return;
+
+	kfree((void *)(u64)ccw->cda);
+}
+
+/* Unpin the pages then free the memory resources. */
+static void ccw_chain_unpin_free(struct ccwchain *chain)
+{
+	int i;
+
+	if (!chain)
+		return;
+
+	for (i = 0; i < chain->nr; i++) {
+		page_arrays_unpin_free(chain->pss + i);
+		ccw_chain_cda_free(chain, i);
+	}
+
+	kfree(chain->pss);
+	kfree(chain);
+}
+
+static int ccw_chain_fetch_tic(struct ccwchain *chain, int idx)
+{
+	struct ccw1 *ccw = chain->buf.ccw + idx;
+
+	if (ccw->cda >= sizeof(chain->buf.ccw))
+		return -EINVAL;
+
+	/*
+	 * tic_ccw.cda stores the offset to the address of the first ccw
+	 * of the chain. Here we update its value with the the real address.
+	 */
+	ccw->cda += virt_to_phys(chain->buf.ccw);
+
+	return 0;
+}
+
+static int ccw_chain_fetch_direct(struct ccwchain *chain, int idx)
+{
+	struct ccw1 *ccw;
+	struct page_arrays *ps;
+	unsigned long *idaws;
+	u64 cda_hva;
+	int i, cidaw;
+
+	ccw = chain->buf.ccw + idx;
+
+	/*
+	 * direct_ccw.cda stores the offset of its cda data in the cda buffer.
+	 */
+	i = ccw->cda >> CDA_ITEM_SIZE;
+	if (i < 0)
+		return -EINVAL;
+	cda_hva = chain->buf.cda[i];
+	if (IS_ERR_VALUE(cda_hva))
+		return -EFAULT;
+
+	/*
+	 * Pin data page(s) in memory.
+	 * The number of pages actually is the count of the idaws which will be
+	 * needed when translating a direct ccw to a idal ccw.
+	 */
+	ps = chain->pss + idx;
+	if (page_arrays_init(ps, 1))
+		return -ENOMEM;
+	cidaw = page_array_items_alloc_pin(cda_hva, ccw->count, ps->parray);
+	if (cidaw <= 0)
+		return cidaw;
+
+	/* Translate this direct ccw to a idal ccw. */
+	idaws = kcalloc(cidaw, sizeof(*idaws), GFP_DMA | GFP_KERNEL);
+	if (!idaws) {
+		page_arrays_unpin_free(ps);
+		return -ENOMEM;
+	}
+	ccw->cda = (__u32) virt_to_phys(idaws);
+	ccw->flags |= CCW_FLAG_IDA;
+
+	ccwchain_idal_create_words(idaws, ps);
+
+	return 0;
+}
+
+static int ccw_chain_fetch_idal(struct ccwchain *chain, int idx)
+{
+	struct ccw1 *ccw;
+	struct page_arrays *ps;
+	unsigned long *idaws;
+	unsigned int cidaw, idaw_len;
+	int i, ret;
+	u64 cda_hva, idaw_hva;
+
+	ccw = chain->buf.ccw + idx;
+
+	/* idal_ccw.cda stores the offset of its cda data in the cda buffer. */
+	i = ccw->cda >> CDA_ITEM_SIZE;
+	if (i < 0)
+		return -EINVAL;
+	cda_hva = chain->buf.cda[i];
+	if (IS_ERR_VALUE(cda_hva))
+		return -EFAULT;
+
+	/* Calculate size of idaws. */
+	ret = copy_from_user(&idaw_hva, (void __user *)cda_hva, sizeof(*idaws));
+	if (ret)
+		return ret;
+
+	cidaw = ccwchain_idal_nr_words(idaw_hva, ccw->count);
+	idaw_len = cidaw * sizeof(*idaws);
+
+	/* Pin data page(s) in memory. */
+	ps = chain->pss + idx;
+	ret = page_arrays_init(ps, cidaw);
+	if (ret)
+		return ret;
+
+	/* Translate idal ccw to use new allocated idaws. */
+	idaws = kzalloc(idaw_len, GFP_DMA | GFP_KERNEL);
+	if (!idaws) {
+		ret = -ENOMEM;
+		goto out_unpin;
+	}
+
+	ret = copy_from_user(idaws, (void __user *)cda_hva, idaw_len);
+	if (ret)
+		goto out_free_idaws;
+
+	ccw->cda = virt_to_phys(idaws);
+
+	for (i = 0; i < cidaw; i++) {
+		idaw_hva = *(idaws + i);
+		if (IS_ERR_VALUE(idaw_hva)) {
+			ret = -EFAULT;
+			goto out_free_idaws;
+		}
+
+		ret = page_array_items_alloc_pin(idaw_hva, 1, ps->parray + i);
+		if (ret <= 0)
+			goto out_free_idaws;
+	}
+
+	ccwchain_idal_create_words(idaws, ps);
+
+	return 0;
+
+out_free_idaws:
+	kfree(idaws);
+out_unpin:
+	page_arrays_unpin_free(ps);
+	return ret;
+}
+
+/*
+ * Fetch one ccw.
+ * To reduce memory copy, we'll pin the cda page in memory,
+ * and to get rid of the cda 2G limitiaion of ccw1, we'll translate
+ * direct ccws to idal ccws.
+ */
+static int ccw_chain_fetch_one(struct ccwchain *chain, int idx)
+{
+	struct ccw1 *ccw = chain->buf.ccw + idx;
+
+	if (ccw_is_test(ccw) || ccw_is_noop(ccw))
+		return 0;
+
+	if (ccw_is_tic(ccw))
+		return ccw_chain_fetch_tic(chain, idx);
+
+	if (ccw_is_idal(ccw))
+		return ccw_chain_fetch_idal(chain, idx);
+
+	return ccw_chain_fetch_direct(chain, idx);
+}
+
+static int ccw_chain_copy_from_user(struct ccwchain_cmd *cmd)
+{
+	struct ccwchain *chain;
+	int ret;
+
+	if (!cmd->nr || cmd->nr > CCWCHAIN_LEN_MAX) {
+		ret = -EINVAL;
+		goto out_error;
+	}
+
+	chain = kzalloc(sizeof(*chain), GFP_DMA | GFP_KERNEL);
+	if (!chain) {
+		ret = -ENOMEM;
+		goto out_error;
+	}
+
+	chain->nr = cmd->nr;
+
+	/* Copy current chain from user. */
+	ret = copy_from_user(&chain->buf,
+			     (void __user *)cmd->u_ccwchain,
+			     sizeof(chain->buf));
+	if (ret)
+		goto out_free_chain;
+
+	/* Alloc memory for page_arrays. */
+	chain->pss = kcalloc(chain->nr, sizeof(*chain->pss), GFP_KERNEL);
+	if (!chain->pss) {
+		ret = -ENOMEM;
+		goto out_free_chain;
+	}
+
+	cmd->k_ccwchain = chain;
+
+	return 0;
+
+out_free_chain:
+	kfree(chain);
+out_error:
+	cmd->k_ccwchain = NULL;
+	return ret;
+}
+
+/**
+ * ccwchain_alloc() - allocate resources for a ccw chain.
+ * @cmd: ccwchain command on which to perform the operation
+ *
+ * This function is a wrapper around ccw_chain_copy_from_user().
+ *
+ * This creates a ccwchain and allocates a memory buffer, that could at most
+ * contain @cmd->nr ccws, for the ccwchain. Then it copies user-space ccw
+ * program from @cmd->u_ccwchain to the buffer, and stores the address of the
+ * ccwchain to @cmd->k_ccwchain as the output.
+ *
+ * Returns:
+ *   %0 on success and a negative error value on failure.
+ */
+int ccwchain_alloc(struct ccwchain_cmd *cmd)
+{
+	return ccw_chain_copy_from_user(cmd);
+}
+
+/**
+ * ccwchain_free() - free resources for a ccw chain.
+ * @cmd: ccwchain command on which to perform the operation
+ *
+ * This function is a wrapper around ccw_chain_unpin_free().
+ *
+ * This unpins the memory pages and frees the memory space occupied by @cmd,
+ * which must have been returned by a previous call to ccwchain_alloc().
+ * Otherwise, undefined behavior occurs.
+ */
+void ccwchain_free(struct ccwchain_cmd *cmd)
+{
+	ccw_chain_unpin_free(cmd->k_ccwchain);
+}
+
+/**
+ * ccwchain_prefetch() - translate a user-space ccw program to a real-device
+ *                       runnable ccw program.
+ * @cmd: ccwchain command on which to perform the operation
+ *
+ * This function translates the user-space ccw program (@cmd->u_ccwchain) and
+ * stores the result to @cmd->k_ccwchain. @cmd must have been returned by a
+ * previous call to ccwchain_alloc(). Otherwise, undefined behavior occurs.
+ *
+ * The S/390 CCW Translation APIS (prefixed by 'ccwchain_') are introduced as
+ * helpers to do ccw chain translation inside the kernel. Basically they accept
+ * a special ccw program issued by a user-space process, and translate the ccw
+ * program to a real-device runnable ccw program.
+ *
+ * The ccws passed in should be well organized in a user-space buffer, using
+ * virtual memory addresses and offsets inside the buffer. These APIs will copy
+ * the ccws into a kernel-space buffer, and update the virtual addresses and the
+ * offsets with their corresponding physical addresses. Then channel I/O device
+ * drivers could issue the translated ccw program to real devices to perform an
+ * I/O operation.
+ *
+ * User-space ccw program format:
+ * These interfaces are designed to support translation only for special ccw
+ * programs, which are generated and formatted by a user-space program. Thus
+ * this will make it possible for things like VFIO to leverage the interfaces to
+ * realize channel I/O device drivers in user-space.
+ *
+ * User-space programs should prepare the ccws according to the rules below
+ * 1. Alloc a 4K bytes memory buffer in user-space to store all of the ccw
+ *    program information.
+ * 2. Lower 2K of the buffer are used to store a maximum of 256 ccws.
+ * 3. Upper 2K of the buffer are used to store a maximum of 256 corresponding
+ *    cda data sets, each having a length of 8 bytes.
+ * 4. All of the ccws should be placed one after another.
+ * 5. For direct and idal ccw
+ *    - Find a free cda data entry, and find its offset to the address of the
+ *      cda buffer.
+ *    - Store the offset as the CDA value in the ccw.
+ *    - Store the virtual address of the data(idaw) as the data of the cda
+ *      entry.
+ * 6. For tic ccw
+ *    - Find the target ccw, and find its offset to the address of the ccw
+ *      buffer.
+ *    - Store the offset as the CDA value in the ccw.
+ *
+ * Limitations:
+ * 1. Supports only prefetch enabled mode.
+ * 2. Supports direct ccw chaining by translating them to idal ccws.
+ * 3. Supports idal(c64) ccw chaining.
+ *
+ * Returns:
+ *   %0 on success and a negative error value on failure.
+ */
+int ccwchain_prefetch(struct ccwchain_cmd *cmd)
+{
+	int ret, i;
+	struct ccwchain *chain = cmd->k_ccwchain;
+
+	for (i = 0; i < chain->nr; i++) {
+		ret = ccw_chain_fetch_one(chain, i);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/**
+ * ccwchain_get_cpa() - get the ccw program address of a ccwchain
+ * @cmd: ccwchain command on which to perform the operation
+ *
+ * This function returns the address of the translated kernel ccw program.
+ * Channel I/O device drivers could issue this address to real devices to
+ * perform an I/O operation.
+ */
+struct ccw1 *ccwchain_get_cpa(struct ccwchain_cmd *cmd)
+{
+	return ((struct ccwchain *)cmd->k_ccwchain)->buf.ccw;
+}
+
+/**
+ * ccwchain_update_scsw() - update scsw for a ccw chain.
+ * @cmd: ccwchain command on which to perform the operation
+ * @scsw: I/O result of the ccw program and also the target to be updated
+ *
+ * @scsw contains the I/O results of the ccw program that pointed to by @cmd.
+ * However what @scsw->cpa stores is a kernel physical address, which is
+ * meaningless for a user-space program, which is waiting for the I/O results.
+ *
+ * This function updates @scsw->cpa to its coressponding user-space ccw address
+ * (an offset inside the user-space ccw buffer).
+ */
+void ccwchain_update_scsw(struct ccwchain_cmd *cmd, union scsw *scsw)
+{
+	u32 cpa = scsw->cmd.cpa;
+	struct ccwchain *chain = cmd->k_ccwchain;
+
+	/*
+	 * LATER:
+	 * For now, only update the cmd.cpa part. We may need to deal with
+	 * other portions of the schib as well, even if we don't return them
+	 * in the ioctl directly. Path status changes etc.
+	 */
+	cpa = cpa - (u32)(u64)(chain->buf.ccw);
+	if (cpa & (1 << 31))
+		cpa &= (1 << 31) - 1U;
+
+	scsw->cmd.cpa = cpa;
+}
diff --git a/drivers/vfio/ccw/ccwchain.h b/drivers/vfio/ccw/ccwchain.h
new file mode 100644
index 0000000..b72ac2a
--- /dev/null
+++ b/drivers/vfio/ccw/ccwchain.h
@@ -0,0 +1,49 @@
+/*
+ * ccwchain interfaces
+ *
+ * Copyright IBM Corp. 2016
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License (version 2 only)
+ * as published by the Free Software Foundation.
+ *
+ * Author(s): Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
+ *            Xiao Feng Ren <renxiaof@linux.vnet.ibm.com>
+ */
+
+#ifndef _CCW_CHAIN_H_
+#define _CCW_CHAIN_H_
+
+#include <asm/cio.h>
+#include <asm/scsw.h>
+
+/**
+ * struct ccwchain_cmd - manage information for ccw program
+ * @u_ccwchain: handle of a user-space ccw program
+ * @k_ccwchain: handle of a kernel-space ccw program
+ * @nr: number of ccws in the ccw program
+ *
+ * @u_ccwchain is an user-space virtual address of a buffer where a user-space
+ * ccw program is stored. Size of this buffer is 4K bytes, of which the low 2K
+ * is for the ccws and the upper 2K for cda data.
+ *
+ * @k_ccwchain is a kernel-space physical address of a ccwchain struct, that
+ * points to the translated result of @u_ccwchain. This is opaque to user-space
+ * programs.
+ *
+ * @nr is the number of ccws in both user-space ccw program and kernel-space ccw
+ * program.
+ */
+struct ccwchain_cmd {
+	void *u_ccwchain;
+	void *k_ccwchain;
+	int nr;
+};
+
+extern int ccwchain_alloc(struct ccwchain_cmd *cmd);
+extern void ccwchain_free(struct ccwchain_cmd *cmd);
+extern int ccwchain_prefetch(struct ccwchain_cmd *cmd);
+extern struct ccw1 *ccwchain_get_cpa(struct ccwchain_cmd *cmd);
+extern void ccwchain_update_scsw(struct ccwchain_cmd *cmd, union scsw *scsw);
+
+#endif
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH RFC 8/8] vfio: ccw: realize VFIO_DEVICE_CCW_CMD_REQUEST ioctl
  2016-04-29 12:11 ` [Qemu-devel] " Dong Jia Shi
@ 2016-04-29 12:11   ` Dong Jia Shi
  -1 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

Introduce VFIO_DEVICE_CCW_CMD_REQUEST ioctl for vfio-ccw
to handle the translated ccw commands.

We implement the basic ccw command handling infrastructure
here:
1. Issue the translated ccw commands to the device.
2. Once we get the execution result, update the guest SCSW
   with it.

Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
---
 drivers/vfio/ccw/vfio_ccw.c | 190 +++++++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h   |  23 ++++++
 2 files changed, 212 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/ccw/vfio_ccw.c b/drivers/vfio/ccw/vfio_ccw.c
index 9700448..3979544 100644
--- a/drivers/vfio/ccw/vfio_ccw.c
+++ b/drivers/vfio/ccw/vfio_ccw.c
@@ -17,15 +17,30 @@
 #include <linux/vfio.h>
 #include <asm/ccwdev.h>
 #include <asm/cio.h>
+#include <asm/orb.h>
+#include "ccwchain.h"
 
 /**
  * struct vfio_ccw_device
  * @cdev: ccw device
+ * @curr_intparm: record current interrupt parameter,
+ *                used for wait interrupt.
+ * @wait_q: wait for interrupt
+ * @ccwchain_cmd: address map for current ccwchain.
+ * @irb: irb info received from interrupt
+ * @orb: orb for the currently processed ssch request
+ * @scsw: scsw info
  * @going_away: if an offline procedure was already ongoing
  * @hot_reset: if hot-reset is ongoing
  */
 struct vfio_ccw_device {
 	struct ccw_device	*cdev;
+	u32			curr_intparm;
+	wait_queue_head_t	wait_q;
+	struct ccwchain_cmd	ccwchain_cmd;
+	struct irb		irb;
+	union orb		orb;
+	union scsw		scsw;
 	bool			going_away;
 	bool			hot_reset;
 };
@@ -42,6 +57,118 @@ struct ccw_device_id vfio_ccw_ids[] = {
 MODULE_DEVICE_TABLE(ccw, vfio_ccw_ids);
 
 /*
+ * LATER:
+ * This is good for Linux guests; but we may need an interface to
+ * deal with further bits in the orb.
+ */
+static unsigned long flags_from_orb(union orb *orb)
+{
+	unsigned long flags = 0;
+
+	flags |= orb->cmd.pfch ? 0 : DOIO_DENY_PREFETCH;
+	flags |= orb->cmd.spnd ? DOIO_ALLOW_SUSPEND : 0;
+	flags |= orb->cmd.ssic ? (DOIO_SUPPRESS_INTER | DOIO_ALLOW_SUSPEND) : 0;
+
+	return flags;
+}
+
+/* Check if the current intparm has been set. */
+static int doing_io(struct vfio_ccw_device *vcdev, u32 intparm)
+{
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(get_ccwdev_lock(vcdev->cdev), flags);
+	ret = (vcdev->curr_intparm == intparm);
+	spin_unlock_irqrestore(get_ccwdev_lock(vcdev->cdev), flags);
+	return ret;
+}
+
+int vfio_ccw_io_helper(struct vfio_ccw_device *vcdev)
+{
+	struct ccwchain_cmd *ccwchain_cmd;
+	struct ccw1 *cpa;
+	u32 intparm;
+	unsigned long io_flags, lock_flags;
+	int ret;
+
+	ccwchain_cmd = &vcdev->ccwchain_cmd;
+	cpa = ccwchain_get_cpa(ccwchain_cmd);
+	intparm = (u32)(u64)ccwchain_cmd->k_ccwchain;
+	io_flags = flags_from_orb(&vcdev->orb);
+
+	spin_lock_irqsave(get_ccwdev_lock(vcdev->cdev), lock_flags);
+	ret = ccw_device_start(vcdev->cdev, cpa, intparm,
+			       vcdev->orb.cmd.lpm, io_flags);
+	if (!ret)
+		vcdev->curr_intparm = 0;
+	spin_unlock_irqrestore(get_ccwdev_lock(vcdev->cdev), lock_flags);
+
+	if (!ret)
+		wait_event(vcdev->wait_q,
+			   doing_io(vcdev, intparm));
+
+	ccwchain_update_scsw(ccwchain_cmd, &(vcdev->irb.scsw));
+
+	return ret;
+}
+
+/* Deal with the ccw command request from the userspace. */
+int vfio_ccw_cmd_request(struct vfio_ccw_device *vcdev,
+			 struct vfio_ccw_cmd *ccw_cmd)
+{
+	union orb *orb = &vcdev->orb;
+	union scsw *scsw = &vcdev->scsw;
+	struct irb *irb = &vcdev->irb;
+	int ret;
+
+	memcpy(orb, ccw_cmd->orb_area, sizeof(*orb));
+	memcpy(scsw, ccw_cmd->scsw_area, sizeof(*scsw));
+	vcdev->ccwchain_cmd.u_ccwchain = (void *)ccw_cmd->ccwchain_buf;
+	vcdev->ccwchain_cmd.k_ccwchain = NULL;
+	vcdev->ccwchain_cmd.nr = ccw_cmd->ccwchain_nr;
+
+	if (scsw->cmd.fctl & SCSW_FCTL_START_FUNC) {
+		/*
+		 * XXX:
+		 * Only support prefetch enable mode now.
+		 * Only support 64bit addressing idal.
+		 */
+		if (!orb->cmd.pfch || !orb->cmd.c64)
+			return -EOPNOTSUPP;
+
+		ret = ccwchain_alloc(&vcdev->ccwchain_cmd);
+		if (ret)
+			return ret;
+
+		ret = ccwchain_prefetch(&vcdev->ccwchain_cmd);
+		if (ret) {
+			ccwchain_free(&vcdev->ccwchain_cmd);
+			return ret;
+		}
+
+		/* Start channel program and wait for I/O interrupt. */
+		ret = vfio_ccw_io_helper(vcdev);
+		if (!ret) {
+			/* Get irb info and copy it to irb_area. */
+			memcpy(ccw_cmd->irb_area, irb, sizeof(*irb));
+		}
+
+		ccwchain_free(&vcdev->ccwchain_cmd);
+	} else if (scsw->cmd.fctl & SCSW_FCTL_HALT_FUNC) {
+		/* XXX: Handle halt. */
+		ret = -EOPNOTSUPP;
+	} else if (scsw->cmd.fctl & SCSW_FCTL_CLEAR_FUNC) {
+		/* XXX: Handle clear. */
+		ret = -EOPNOTSUPP;
+	} else {
+		ret = -EOPNOTSUPP;
+	}
+
+	return ret;
+}
+
+/*
  * vfio callbacks
  */
 static int vfio_ccw_open(void *device_data)
@@ -107,6 +234,24 @@ static long vfio_ccw_ioctl(void *device_data, unsigned int cmd,
 		vcdev->hot_reset = false;
 		spin_unlock_irqrestore(get_ccwdev_lock(vcdev->cdev), flags);
 		return ret;
+
+	} else if (cmd == VFIO_DEVICE_CCW_CMD_REQUEST) {
+		struct vfio_ccw_cmd ccw_cmd;
+		int ret;
+
+		minsz = offsetofend(struct vfio_ccw_cmd, ccwchain_buf);
+
+		if (copy_from_user(&ccw_cmd, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (ccw_cmd.argsz < minsz)
+			return -EINVAL;
+
+		ret = vfio_ccw_cmd_request(vcdev, &ccw_cmd);
+		if (ret)
+			return ret;
+
+		return copy_to_user((void __user *)arg, &ccw_cmd, minsz);
 	}
 
 	return -ENOTTY;
@@ -119,6 +264,25 @@ static const struct vfio_device_ops vfio_ccw_ops = {
 	.ioctl		= vfio_ccw_ioctl,
 };
 
+static void vfio_ccw_int_handler(struct ccw_device *cdev,
+				unsigned long intparm,
+				struct irb *irb)
+{
+	struct vfio_device *device = dev_get_drvdata(&cdev->dev);
+	struct vfio_ccw_device *vdev;
+
+	if (!device)
+		return;
+
+	vdev = vfio_device_data(device);
+	if (!vdev)
+		return;
+
+	vdev->curr_intparm = intparm;
+	memcpy(&vdev->irb, irb, sizeof(*irb));
+	wake_up(&vdev->wait_q);
+}
+
 static int vfio_ccw_probe(struct ccw_device *cdev)
 {
 	struct iommu_group *group = vfio_iommu_group_get(&cdev->dev);
@@ -126,6 +290,8 @@ static int vfio_ccw_probe(struct ccw_device *cdev)
 	if (!group)
 		return -EINVAL;
 
+	cdev->handler = vfio_ccw_int_handler;
+
 	return 0;
 }
 
@@ -142,6 +308,9 @@ static int vfio_ccw_set_offline(struct ccw_device *cdev)
 	if (!vdev || vdev->hot_reset || vdev->going_away)
 		return 0;
 
+	/* Put the vfio_device reference we got during the online process. */
+	vfio_device_put(device);
+
 	vdev->going_away = true;
 	vfio_del_group_dev(&cdev->dev);
 	kfree(vdev);
@@ -155,6 +324,8 @@ void vfio_ccw_remove(struct ccw_device *cdev)
 		vfio_ccw_set_offline(cdev);
 
 	vfio_iommu_group_put(cdev->dev.iommu_group, &cdev->dev);
+
+	cdev->handler = NULL;
 }
 
 static int vfio_ccw_set_online(struct ccw_device *cdev)
@@ -186,8 +357,25 @@ create_device:
 	vdev->cdev = cdev;
 
 	ret = vfio_add_group_dev(&cdev->dev, &vfio_ccw_ops, vdev);
-	if (ret)
+	if (ret) {
+		kfree(vdev);
+		return ret;
+	}
+
+	/*
+	 * Get a reference to the vfio_device for this device, and don't put
+	 * it until device offline. Thus we don't need to get/put a reference
+	 * every time we run into the int_handler. And we will get rid of a
+	 * wrong usage of mutex in int_handler.
+	 */
+	device = vfio_device_get_from_dev(&cdev->dev);
+	if (!device) {
+		vfio_del_group_dev(&cdev->dev);
 		kfree(vdev);
+		return -ENODEV;
+	}
+
+	init_waitqueue_head(&vdev->wait_q);
 
 	return ret;
 }
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 889a316..5e8a58e 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -695,6 +695,29 @@ struct vfio_iommu_spapr_tce_remove {
  */
 #define VFIO_DEVICE_CCW_HOT_RESET	_IO(VFIO_TYPE, VFIO_BASE + 21)
 
+/**
+ * VFIO_DEVICE_CCW_CMD_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
+ *                                     struct vfio_ccw_cmd)
+ *
+ * Issue a user-space ccw program for translation and performing channel I/O
+ * operations.
+ */
+struct vfio_ccw_cmd {
+	__u32 argsz;
+	__u8 cssid;
+	__u8 ssid;
+	__u16 devno;
+#define ORB_AREA_SIZE 12
+	__u8 orb_area[ORB_AREA_SIZE];
+#define SCSW_AREA_SIZE 12
+	__u8 scsw_area[SCSW_AREA_SIZE];
+#define IRB_AREA_SIZE 96
+	__u8 irb_area[IRB_AREA_SIZE];
+	__u32 ccwchain_nr;
+	__u64 ccwchain_buf;
+} __attribute__((packed));
+#define VFIO_DEVICE_CCW_CMD_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 22)
+
 /* ***************************************************************** */
 
 #endif /* _UAPIVFIO_H */
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH RFC 8/8] vfio: ccw: realize VFIO_DEVICE_CCW_CMD_REQUEST ioctl
@ 2016-04-29 12:11   ` Dong Jia Shi
  0 siblings, 0 replies; 36+ messages in thread
From: Dong Jia Shi @ 2016-04-29 12:11 UTC (permalink / raw)
  To: kvm, linux-s390, qemu-devel
  Cc: bjsdjshi, renxiaof, cornelia.huck, borntraeger, agraf, alex.williamson

Introduce VFIO_DEVICE_CCW_CMD_REQUEST ioctl for vfio-ccw
to handle the translated ccw commands.

We implement the basic ccw command handling infrastructure
here:
1. Issue the translated ccw commands to the device.
2. Once we get the execution result, update the guest SCSW
   with it.

Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
---
 drivers/vfio/ccw/vfio_ccw.c | 190 +++++++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/vfio.h   |  23 ++++++
 2 files changed, 212 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/ccw/vfio_ccw.c b/drivers/vfio/ccw/vfio_ccw.c
index 9700448..3979544 100644
--- a/drivers/vfio/ccw/vfio_ccw.c
+++ b/drivers/vfio/ccw/vfio_ccw.c
@@ -17,15 +17,30 @@
 #include <linux/vfio.h>
 #include <asm/ccwdev.h>
 #include <asm/cio.h>
+#include <asm/orb.h>
+#include "ccwchain.h"
 
 /**
  * struct vfio_ccw_device
  * @cdev: ccw device
+ * @curr_intparm: record current interrupt parameter,
+ *                used for wait interrupt.
+ * @wait_q: wait for interrupt
+ * @ccwchain_cmd: address map for current ccwchain.
+ * @irb: irb info received from interrupt
+ * @orb: orb for the currently processed ssch request
+ * @scsw: scsw info
  * @going_away: if an offline procedure was already ongoing
  * @hot_reset: if hot-reset is ongoing
  */
 struct vfio_ccw_device {
 	struct ccw_device	*cdev;
+	u32			curr_intparm;
+	wait_queue_head_t	wait_q;
+	struct ccwchain_cmd	ccwchain_cmd;
+	struct irb		irb;
+	union orb		orb;
+	union scsw		scsw;
 	bool			going_away;
 	bool			hot_reset;
 };
@@ -42,6 +57,118 @@ struct ccw_device_id vfio_ccw_ids[] = {
 MODULE_DEVICE_TABLE(ccw, vfio_ccw_ids);
 
 /*
+ * LATER:
+ * This is good for Linux guests; but we may need an interface to
+ * deal with further bits in the orb.
+ */
+static unsigned long flags_from_orb(union orb *orb)
+{
+	unsigned long flags = 0;
+
+	flags |= orb->cmd.pfch ? 0 : DOIO_DENY_PREFETCH;
+	flags |= orb->cmd.spnd ? DOIO_ALLOW_SUSPEND : 0;
+	flags |= orb->cmd.ssic ? (DOIO_SUPPRESS_INTER | DOIO_ALLOW_SUSPEND) : 0;
+
+	return flags;
+}
+
+/* Check if the current intparm has been set. */
+static int doing_io(struct vfio_ccw_device *vcdev, u32 intparm)
+{
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(get_ccwdev_lock(vcdev->cdev), flags);
+	ret = (vcdev->curr_intparm == intparm);
+	spin_unlock_irqrestore(get_ccwdev_lock(vcdev->cdev), flags);
+	return ret;
+}
+
+int vfio_ccw_io_helper(struct vfio_ccw_device *vcdev)
+{
+	struct ccwchain_cmd *ccwchain_cmd;
+	struct ccw1 *cpa;
+	u32 intparm;
+	unsigned long io_flags, lock_flags;
+	int ret;
+
+	ccwchain_cmd = &vcdev->ccwchain_cmd;
+	cpa = ccwchain_get_cpa(ccwchain_cmd);
+	intparm = (u32)(u64)ccwchain_cmd->k_ccwchain;
+	io_flags = flags_from_orb(&vcdev->orb);
+
+	spin_lock_irqsave(get_ccwdev_lock(vcdev->cdev), lock_flags);
+	ret = ccw_device_start(vcdev->cdev, cpa, intparm,
+			       vcdev->orb.cmd.lpm, io_flags);
+	if (!ret)
+		vcdev->curr_intparm = 0;
+	spin_unlock_irqrestore(get_ccwdev_lock(vcdev->cdev), lock_flags);
+
+	if (!ret)
+		wait_event(vcdev->wait_q,
+			   doing_io(vcdev, intparm));
+
+	ccwchain_update_scsw(ccwchain_cmd, &(vcdev->irb.scsw));
+
+	return ret;
+}
+
+/* Deal with the ccw command request from the userspace. */
+int vfio_ccw_cmd_request(struct vfio_ccw_device *vcdev,
+			 struct vfio_ccw_cmd *ccw_cmd)
+{
+	union orb *orb = &vcdev->orb;
+	union scsw *scsw = &vcdev->scsw;
+	struct irb *irb = &vcdev->irb;
+	int ret;
+
+	memcpy(orb, ccw_cmd->orb_area, sizeof(*orb));
+	memcpy(scsw, ccw_cmd->scsw_area, sizeof(*scsw));
+	vcdev->ccwchain_cmd.u_ccwchain = (void *)ccw_cmd->ccwchain_buf;
+	vcdev->ccwchain_cmd.k_ccwchain = NULL;
+	vcdev->ccwchain_cmd.nr = ccw_cmd->ccwchain_nr;
+
+	if (scsw->cmd.fctl & SCSW_FCTL_START_FUNC) {
+		/*
+		 * XXX:
+		 * Only support prefetch enable mode now.
+		 * Only support 64bit addressing idal.
+		 */
+		if (!orb->cmd.pfch || !orb->cmd.c64)
+			return -EOPNOTSUPP;
+
+		ret = ccwchain_alloc(&vcdev->ccwchain_cmd);
+		if (ret)
+			return ret;
+
+		ret = ccwchain_prefetch(&vcdev->ccwchain_cmd);
+		if (ret) {
+			ccwchain_free(&vcdev->ccwchain_cmd);
+			return ret;
+		}
+
+		/* Start channel program and wait for I/O interrupt. */
+		ret = vfio_ccw_io_helper(vcdev);
+		if (!ret) {
+			/* Get irb info and copy it to irb_area. */
+			memcpy(ccw_cmd->irb_area, irb, sizeof(*irb));
+		}
+
+		ccwchain_free(&vcdev->ccwchain_cmd);
+	} else if (scsw->cmd.fctl & SCSW_FCTL_HALT_FUNC) {
+		/* XXX: Handle halt. */
+		ret = -EOPNOTSUPP;
+	} else if (scsw->cmd.fctl & SCSW_FCTL_CLEAR_FUNC) {
+		/* XXX: Handle clear. */
+		ret = -EOPNOTSUPP;
+	} else {
+		ret = -EOPNOTSUPP;
+	}
+
+	return ret;
+}
+
+/*
  * vfio callbacks
  */
 static int vfio_ccw_open(void *device_data)
@@ -107,6 +234,24 @@ static long vfio_ccw_ioctl(void *device_data, unsigned int cmd,
 		vcdev->hot_reset = false;
 		spin_unlock_irqrestore(get_ccwdev_lock(vcdev->cdev), flags);
 		return ret;
+
+	} else if (cmd == VFIO_DEVICE_CCW_CMD_REQUEST) {
+		struct vfio_ccw_cmd ccw_cmd;
+		int ret;
+
+		minsz = offsetofend(struct vfio_ccw_cmd, ccwchain_buf);
+
+		if (copy_from_user(&ccw_cmd, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (ccw_cmd.argsz < minsz)
+			return -EINVAL;
+
+		ret = vfio_ccw_cmd_request(vcdev, &ccw_cmd);
+		if (ret)
+			return ret;
+
+		return copy_to_user((void __user *)arg, &ccw_cmd, minsz);
 	}
 
 	return -ENOTTY;
@@ -119,6 +264,25 @@ static const struct vfio_device_ops vfio_ccw_ops = {
 	.ioctl		= vfio_ccw_ioctl,
 };
 
+static void vfio_ccw_int_handler(struct ccw_device *cdev,
+				unsigned long intparm,
+				struct irb *irb)
+{
+	struct vfio_device *device = dev_get_drvdata(&cdev->dev);
+	struct vfio_ccw_device *vdev;
+
+	if (!device)
+		return;
+
+	vdev = vfio_device_data(device);
+	if (!vdev)
+		return;
+
+	vdev->curr_intparm = intparm;
+	memcpy(&vdev->irb, irb, sizeof(*irb));
+	wake_up(&vdev->wait_q);
+}
+
 static int vfio_ccw_probe(struct ccw_device *cdev)
 {
 	struct iommu_group *group = vfio_iommu_group_get(&cdev->dev);
@@ -126,6 +290,8 @@ static int vfio_ccw_probe(struct ccw_device *cdev)
 	if (!group)
 		return -EINVAL;
 
+	cdev->handler = vfio_ccw_int_handler;
+
 	return 0;
 }
 
@@ -142,6 +308,9 @@ static int vfio_ccw_set_offline(struct ccw_device *cdev)
 	if (!vdev || vdev->hot_reset || vdev->going_away)
 		return 0;
 
+	/* Put the vfio_device reference we got during the online process. */
+	vfio_device_put(device);
+
 	vdev->going_away = true;
 	vfio_del_group_dev(&cdev->dev);
 	kfree(vdev);
@@ -155,6 +324,8 @@ void vfio_ccw_remove(struct ccw_device *cdev)
 		vfio_ccw_set_offline(cdev);
 
 	vfio_iommu_group_put(cdev->dev.iommu_group, &cdev->dev);
+
+	cdev->handler = NULL;
 }
 
 static int vfio_ccw_set_online(struct ccw_device *cdev)
@@ -186,8 +357,25 @@ create_device:
 	vdev->cdev = cdev;
 
 	ret = vfio_add_group_dev(&cdev->dev, &vfio_ccw_ops, vdev);
-	if (ret)
+	if (ret) {
+		kfree(vdev);
+		return ret;
+	}
+
+	/*
+	 * Get a reference to the vfio_device for this device, and don't put
+	 * it until device offline. Thus we don't need to get/put a reference
+	 * every time we run into the int_handler. And we will get rid of a
+	 * wrong usage of mutex in int_handler.
+	 */
+	device = vfio_device_get_from_dev(&cdev->dev);
+	if (!device) {
+		vfio_del_group_dev(&cdev->dev);
 		kfree(vdev);
+		return -ENODEV;
+	}
+
+	init_waitqueue_head(&vdev->wait_q);
 
 	return ret;
 }
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 889a316..5e8a58e 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -695,6 +695,29 @@ struct vfio_iommu_spapr_tce_remove {
  */
 #define VFIO_DEVICE_CCW_HOT_RESET	_IO(VFIO_TYPE, VFIO_BASE + 21)
 
+/**
+ * VFIO_DEVICE_CCW_CMD_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
+ *                                     struct vfio_ccw_cmd)
+ *
+ * Issue a user-space ccw program for translation and performing channel I/O
+ * operations.
+ */
+struct vfio_ccw_cmd {
+	__u32 argsz;
+	__u8 cssid;
+	__u8 ssid;
+	__u16 devno;
+#define ORB_AREA_SIZE 12
+	__u8 orb_area[ORB_AREA_SIZE];
+#define SCSW_AREA_SIZE 12
+	__u8 scsw_area[SCSW_AREA_SIZE];
+#define IRB_AREA_SIZE 96
+	__u8 irb_area[IRB_AREA_SIZE];
+	__u32 ccwchain_nr;
+	__u64 ccwchain_buf;
+} __attribute__((packed));
+#define VFIO_DEVICE_CCW_CMD_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 22)
+
 /* ***************************************************************** */
 
 #endif /* _UAPIVFIO_H */
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC 0/8] basic vfio-ccw infrastructure
  2016-04-29 12:11 ` [Qemu-devel] " Dong Jia Shi
@ 2016-04-29 17:17   ` Alex Williamson
  -1 siblings, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2016-04-29 17:17 UTC (permalink / raw)
  To: Dong Jia Shi
  Cc: kvm, linux-s390, qemu-devel, renxiaof, cornelia.huck, borntraeger, agraf

On Fri, 29 Apr 2016 14:11:47 +0200
Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com> wrote:

> vfio: ccw: basic vfio-ccw infrastructure
> ========================================
> 
> Introduction
> ------------
> 
> Here we describe the vfio support for Channel I/O devices (aka. CCW
> devices) for Linux/s390. Motivation for vfio-ccw is to passthrough CCW
> devices to a virtual machine, while vfio is the means.
> 
> Different than other hardware architectures, s390 has defined a unified
> I/O access method, which is so called Channel I/O. It has its own
> access patterns:
> - Channel programs run asynchronously on a separate (co)processor.
> - The channel subsystem will access any memory designated by the caller
>   in the channel program directly, i.e. there is no iommu involved.
> Thus when we introduce vfio support for these devices, we realize it
> with a no-iommu vfio implementation.
> 
> This document does not intend to explain the s390 hardware architecture
> in every detail. More information/reference could be found here:
> - A good start to know Channel I/O in general:
>   https://en.wikipedia.org/wiki/Channel_I/O
> - s390 architecture:
>   s390 Principles of Operation manual (IBM Form. No. SA22-7832)
> - The existing Qemu code which implements a simple emulated channel
>   subsystem could also be a good reference. It makes it easier to
>   follow the flow.
>   qemu/hw/s390x/css.c
> 
> Motivation of vfio-ccw
> ----------------------
> 
> Currently, a guest virtualized via qemu/kvm on s390 only sees
> paravirtualized virtio devices via the "Virtio Over Channel I/O
> (virtio-ccw)" transport. This makes virtio devices discoverable via
> standard operating system algorithms for handling channel devices.
> 
> However this is not enough. On s390 for the majority of devices, which
> use the standard Channel I/O based mechanism, we also need to provide
> the functionality of passing through them to a Qemu virtual machine.
> This includes devices that don't have a virtio counterpart (e.g. tape
> drives) or that have specific characteristics which guests want to
> exploit.
> 
> For passing a device to a guest, we want to use the same interface as
> everybody else, namely vfio. Thus, we would like to introduce vfio
> support for channel devices. And we would like to name this new vfio
> device "vfio-ccw".
> 
> Access patterns of CCW devices
> ------------------------------
> 
> s390 architecture has implemented a so called channel subsystem, that
> provides a unified view of the devices physically attached to the
> systems. Though the s390 hardware platform knows about a huge variety of
> different peripheral attachments like disk devices (aka. DASDs), tapes,
> communication controllers, etc. They can all be accessed by a well
> defined access method and they are presenting I/O completion a unified
> way: I/O interruptions.
> 
> All I/O requires the use of channel command words (CCWs). A CCW is an
> instruction to a specialized I/O channel processor. A channel program
> is a sequence of CCWs which are executed by the I/O channel subsystem.
> To issue a CCW program to the channel subsystem, it is required to
> build an operation request block (ORB), which can be used to point out
> the format of the CCW and other control information to the system. The
> operating system signals the I/O channel subsystem to begin executing
> the channel program with a SSCH (start sub-channel) instruction. The
> central processor is then free to proceed with non-I/O instructions
> until interrupted. The I/O completion result is received by the
> interrupt handler in the form of interrupt response block (IRB).
> 
> Back to vfio-ccw, in short:
> - ORBs and CCW programs are built in user space (with virtual
>   addresses).
> - ORBs and CCW programs are passed to the kernel.
> - kernel translates virtual addresses to real addresses and starts the
>   IO with issuing a privileged Channel I/O instruction (e.g SSCH).
> - CCW programs run asynchronously on a separate processor.
> - I/O completion will be signaled to the host with I/O interruptions.
>   And it will be copied as IRB to user space.
> 
> 
> vfio-ccw patches overview
> -------------------------
> 
> It follows that we need vfio-ccw with a vfio no-iommu mode. For now,
> our patches are based on the current no-iommu implementation. It's a
> good start to launch the code review for vfio-ccw. Note that the
> implementation is far from complete yet; but we'd like to get feedback
> for the general architecture.
> 
> The current no-iommu implementation would consider vfio-ccw as
> unsupported and will taint the kernel. This should be not true for
> vfio-ccw. But whether the end result will be using the existing
> no-iommu code or a new module would be an implementation detail.
> 
> * CCW translation APIs
> - Description:
>   These introduce a group of APIs (start with 'ccwchain_') to do CCW
>   translation. The CCWs passed in by a user space program are organized
>   in a buffer, with their user virtual memory addresses. These APIs will
>   copy the CCWs into the kernel space, and assemble a runnable kernel
>   CCW program by updating the user virtual addresses with their
>   corresponding physical addresses.
> - Patches:
>   vfio: ccw: introduce page array interfaces
>   vfio: ccw: introduce ccw chain interfaces
> 
> * vfio-ccw device driver
> - Description:
>   The following patches introduce vfio-ccw, which utilizes the CCW
>   translation APIs. vfio-ccw is a driver for vfio-based ccw devices
>   which can bind to any device that is passed to the guest and
>   implements the following vfio ioctls:
>     VFIO_DEVICE_GET_INFO
>     VFIO_DEVICE_CCW_HOT_RESET
>     VFIO_DEVICE_CCW_CMD_REQUEST
>   With this CMD_REQUEST ioctl, user space program can pass a CCW
>   program to the kernel, to do further CCW translation before issuing
>   them to a real device. Currently we map I/O that is basically async
>   to this synchronous interface, which means it will not return until
>   the interrupt handler got the I/O execution result.
> - Patches:
>   vfio: ccw: basic implementation for vfio_ccw driver
>   vfio: ccw: realize VFIO_DEVICE_GET_INFO ioctl
>   vfio: ccw: realize VFIO_DEVICE_CCW_HOT_RESET ioctl
>   vfio: ccw: realize VFIO_DEVICE_CCW_CMD_REQUEST ioctl
> 
> The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> good example to get understand how these patches work. Here is a little
> bit more detail how an I/O request triggered by the Qemu guest will be
> handled (without error handling).
> 
> Explanation:
> Q1-Q4: Qemu side process.
> K1-K6: Kernel side process.
> 
> Q1. Intercept a ssch instruction.
> Q2. Translate the guest ccw program to a user space ccw program
>     (u_ccwchain).

Is this replacing guest physical address in the program with QEMU
virtual addresses?

> Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
>     K1. Copy from u_ccwchain to kernel (k_ccwchain).
>     K2. Translate the user space ccw program to a kernel space ccw
>         program, which becomes runnable for a real device.

And here we translate and likely pin QEMU virtual address to physical
addresses to further modify the program sent into the channel?

>     K3. With the necessary information contained in the orb passed in
>         by Qemu, issue the k_ccwchain to the device, and wait event q
>         for the I/O result.
>     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
>     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
>         update the user space irb.
>     K6. Copy irb and scsw back to user space.
> Q4. Update the irb for the guest.

If the answers to my questions above are both yes, then this is really
a mediated interface, not a direct assignment.  We don't need an iommu
because we're policing and translating the program for the device
before it gets sent to hardware.  I think there are better ways than
noiommu to handle such devices perhaps even with better performance
than this two-stage translation.  In fact, I think the solution we plan
to implement for vGPU support would work here.

Like your device, a vGPU is mediated, we don't have IOMMU level
translation or isolation since a vGPU is largely a software construct,
but we do have software policing and translating how the GPU is
programmed.  To do this we're creating a type1 compatible vfio iommu
backend that uses the existing map and unmap ioctls, but rather than
programming them into an IOMMU for a device, it simply stores the
translations for use by later requests.  This means that a device
programmed in a VM with guest physical addresses can have the
vfio kernel convert that address to process virtual address, pin the
page and program the hardware with the host physical address in one
step.

This architecture also makes the vfio api completely compatible with
existing usage without tainting QEMU with support for noiommu devices.
I would strongly suggest following a similar approach and dropping the
noiommu interface.  We really do not need to confuse users with noiommu
devices that are safe and assignable and devices where noiommu should
warn them to stay away.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
@ 2016-04-29 17:17   ` Alex Williamson
  0 siblings, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2016-04-29 17:17 UTC (permalink / raw)
  To: Dong Jia Shi
  Cc: kvm, linux-s390, qemu-devel, renxiaof, cornelia.huck, borntraeger, agraf

On Fri, 29 Apr 2016 14:11:47 +0200
Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com> wrote:

> vfio: ccw: basic vfio-ccw infrastructure
> ========================================
> 
> Introduction
> ------------
> 
> Here we describe the vfio support for Channel I/O devices (aka. CCW
> devices) for Linux/s390. Motivation for vfio-ccw is to passthrough CCW
> devices to a virtual machine, while vfio is the means.
> 
> Different than other hardware architectures, s390 has defined a unified
> I/O access method, which is so called Channel I/O. It has its own
> access patterns:
> - Channel programs run asynchronously on a separate (co)processor.
> - The channel subsystem will access any memory designated by the caller
>   in the channel program directly, i.e. there is no iommu involved.
> Thus when we introduce vfio support for these devices, we realize it
> with a no-iommu vfio implementation.
> 
> This document does not intend to explain the s390 hardware architecture
> in every detail. More information/reference could be found here:
> - A good start to know Channel I/O in general:
>   https://en.wikipedia.org/wiki/Channel_I/O
> - s390 architecture:
>   s390 Principles of Operation manual (IBM Form. No. SA22-7832)
> - The existing Qemu code which implements a simple emulated channel
>   subsystem could also be a good reference. It makes it easier to
>   follow the flow.
>   qemu/hw/s390x/css.c
> 
> Motivation of vfio-ccw
> ----------------------
> 
> Currently, a guest virtualized via qemu/kvm on s390 only sees
> paravirtualized virtio devices via the "Virtio Over Channel I/O
> (virtio-ccw)" transport. This makes virtio devices discoverable via
> standard operating system algorithms for handling channel devices.
> 
> However this is not enough. On s390 for the majority of devices, which
> use the standard Channel I/O based mechanism, we also need to provide
> the functionality of passing through them to a Qemu virtual machine.
> This includes devices that don't have a virtio counterpart (e.g. tape
> drives) or that have specific characteristics which guests want to
> exploit.
> 
> For passing a device to a guest, we want to use the same interface as
> everybody else, namely vfio. Thus, we would like to introduce vfio
> support for channel devices. And we would like to name this new vfio
> device "vfio-ccw".
> 
> Access patterns of CCW devices
> ------------------------------
> 
> s390 architecture has implemented a so called channel subsystem, that
> provides a unified view of the devices physically attached to the
> systems. Though the s390 hardware platform knows about a huge variety of
> different peripheral attachments like disk devices (aka. DASDs), tapes,
> communication controllers, etc. They can all be accessed by a well
> defined access method and they are presenting I/O completion a unified
> way: I/O interruptions.
> 
> All I/O requires the use of channel command words (CCWs). A CCW is an
> instruction to a specialized I/O channel processor. A channel program
> is a sequence of CCWs which are executed by the I/O channel subsystem.
> To issue a CCW program to the channel subsystem, it is required to
> build an operation request block (ORB), which can be used to point out
> the format of the CCW and other control information to the system. The
> operating system signals the I/O channel subsystem to begin executing
> the channel program with a SSCH (start sub-channel) instruction. The
> central processor is then free to proceed with non-I/O instructions
> until interrupted. The I/O completion result is received by the
> interrupt handler in the form of interrupt response block (IRB).
> 
> Back to vfio-ccw, in short:
> - ORBs and CCW programs are built in user space (with virtual
>   addresses).
> - ORBs and CCW programs are passed to the kernel.
> - kernel translates virtual addresses to real addresses and starts the
>   IO with issuing a privileged Channel I/O instruction (e.g SSCH).
> - CCW programs run asynchronously on a separate processor.
> - I/O completion will be signaled to the host with I/O interruptions.
>   And it will be copied as IRB to user space.
> 
> 
> vfio-ccw patches overview
> -------------------------
> 
> It follows that we need vfio-ccw with a vfio no-iommu mode. For now,
> our patches are based on the current no-iommu implementation. It's a
> good start to launch the code review for vfio-ccw. Note that the
> implementation is far from complete yet; but we'd like to get feedback
> for the general architecture.
> 
> The current no-iommu implementation would consider vfio-ccw as
> unsupported and will taint the kernel. This should be not true for
> vfio-ccw. But whether the end result will be using the existing
> no-iommu code or a new module would be an implementation detail.
> 
> * CCW translation APIs
> - Description:
>   These introduce a group of APIs (start with 'ccwchain_') to do CCW
>   translation. The CCWs passed in by a user space program are organized
>   in a buffer, with their user virtual memory addresses. These APIs will
>   copy the CCWs into the kernel space, and assemble a runnable kernel
>   CCW program by updating the user virtual addresses with their
>   corresponding physical addresses.
> - Patches:
>   vfio: ccw: introduce page array interfaces
>   vfio: ccw: introduce ccw chain interfaces
> 
> * vfio-ccw device driver
> - Description:
>   The following patches introduce vfio-ccw, which utilizes the CCW
>   translation APIs. vfio-ccw is a driver for vfio-based ccw devices
>   which can bind to any device that is passed to the guest and
>   implements the following vfio ioctls:
>     VFIO_DEVICE_GET_INFO
>     VFIO_DEVICE_CCW_HOT_RESET
>     VFIO_DEVICE_CCW_CMD_REQUEST
>   With this CMD_REQUEST ioctl, user space program can pass a CCW
>   program to the kernel, to do further CCW translation before issuing
>   them to a real device. Currently we map I/O that is basically async
>   to this synchronous interface, which means it will not return until
>   the interrupt handler got the I/O execution result.
> - Patches:
>   vfio: ccw: basic implementation for vfio_ccw driver
>   vfio: ccw: realize VFIO_DEVICE_GET_INFO ioctl
>   vfio: ccw: realize VFIO_DEVICE_CCW_HOT_RESET ioctl
>   vfio: ccw: realize VFIO_DEVICE_CCW_CMD_REQUEST ioctl
> 
> The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> good example to get understand how these patches work. Here is a little
> bit more detail how an I/O request triggered by the Qemu guest will be
> handled (without error handling).
> 
> Explanation:
> Q1-Q4: Qemu side process.
> K1-K6: Kernel side process.
> 
> Q1. Intercept a ssch instruction.
> Q2. Translate the guest ccw program to a user space ccw program
>     (u_ccwchain).

Is this replacing guest physical address in the program with QEMU
virtual addresses?

> Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
>     K1. Copy from u_ccwchain to kernel (k_ccwchain).
>     K2. Translate the user space ccw program to a kernel space ccw
>         program, which becomes runnable for a real device.

And here we translate and likely pin QEMU virtual address to physical
addresses to further modify the program sent into the channel?

>     K3. With the necessary information contained in the orb passed in
>         by Qemu, issue the k_ccwchain to the device, and wait event q
>         for the I/O result.
>     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
>     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
>         update the user space irb.
>     K6. Copy irb and scsw back to user space.
> Q4. Update the irb for the guest.

If the answers to my questions above are both yes, then this is really
a mediated interface, not a direct assignment.  We don't need an iommu
because we're policing and translating the program for the device
before it gets sent to hardware.  I think there are better ways than
noiommu to handle such devices perhaps even with better performance
than this two-stage translation.  In fact, I think the solution we plan
to implement for vGPU support would work here.

Like your device, a vGPU is mediated, we don't have IOMMU level
translation or isolation since a vGPU is largely a software construct,
but we do have software policing and translating how the GPU is
programmed.  To do this we're creating a type1 compatible vfio iommu
backend that uses the existing map and unmap ioctls, but rather than
programming them into an IOMMU for a device, it simply stores the
translations for use by later requests.  This means that a device
programmed in a VM with guest physical addresses can have the
vfio kernel convert that address to process virtual address, pin the
page and program the hardware with the host physical address in one
step.

This architecture also makes the vfio api completely compatible with
existing usage without tainting QEMU with support for noiommu devices.
I would strongly suggest following a similar approach and dropping the
noiommu interface.  We really do not need to confuse users with noiommu
devices that are safe and assignable and devices where noiommu should
warn them to stay away.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC 0/8] basic vfio-ccw infrastructure
  2016-04-29 17:17   ` [Qemu-devel] " Alex Williamson
@ 2016-05-04  9:26     ` Dong Jia
  -1 siblings, 0 replies; 36+ messages in thread
From: Dong Jia @ 2016-05-04  9:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-s390, qemu-devel, renxiaof, cornelia.huck, borntraeger, agraf

On Fri, 29 Apr 2016 11:17:35 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

Dear Alex:

Thanks for the comments.

[...]

> > 
> > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> > good example to get understand how these patches work. Here is a little
> > bit more detail how an I/O request triggered by the Qemu guest will be
> > handled (without error handling).
> > 
> > Explanation:
> > Q1-Q4: Qemu side process.
> > K1-K6: Kernel side process.
> > 
> > Q1. Intercept a ssch instruction.
> > Q2. Translate the guest ccw program to a user space ccw program
> >     (u_ccwchain).
> 
> Is this replacing guest physical address in the program with QEMU
> virtual addresses?
Yes.

> 
> > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> >     K2. Translate the user space ccw program to a kernel space ccw
> >         program, which becomes runnable for a real device.
> 
> And here we translate and likely pin QEMU virtual address to physical
> addresses to further modify the program sent into the channel?
Yes. Exactly.

> 
> >     K3. With the necessary information contained in the orb passed in
> >         by Qemu, issue the k_ccwchain to the device, and wait event q
> >         for the I/O result.
> >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> >         update the user space irb.
> >     K6. Copy irb and scsw back to user space.
> > Q4. Update the irb for the guest.
> 
> If the answers to my questions above are both yes,
Yes, they are.

> then this is really a mediated interface, not a direct assignment.
Right. This is true.

> We don't need an iommu
> because we're policing and translating the program for the device
> before it gets sent to hardware.  I think there are better ways than
> noiommu to handle such devices perhaps even with better performance
> than this two-stage translation.  In fact, I think the solution we plan
> to implement for vGPU support would work here.
> 
> Like your device, a vGPU is mediated, we don't have IOMMU level
> translation or isolation since a vGPU is largely a software construct,
> but we do have software policing and translating how the GPU is
> programmed.  To do this we're creating a type1 compatible vfio iommu
> backend that uses the existing map and unmap ioctls, but rather than
> programming them into an IOMMU for a device, it simply stores the
> translations for use by later requests.  This means that a device
> programmed in a VM with guest physical addresses can have the
> vfio kernel convert that address to process virtual address, pin the
> page and program the hardware with the host physical address in one
> step.
I've read through the mail threads those discuss how to add vGPU
support in VFIO. I'm afraid that proposal could not be simply addressed
to this case, especially if we want to make the vfio api completely
compatible with the existing usage.

AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
fixed range of address in the memory space for DMA operations. Any
address inside this range will not be used for other purpose. Thus we
can add memory listener on this range, and pin the pages for further
use (DMA operation). And we can keep the pages pinned during the life
cycle of the VM (not quite accurate, or I should say 'the target
device').

Well, a Subchannel Device does not have such a range of address. The
device driver simply calls kalloc() to get a piece of memory, and
assembles a ccw program with it, before issuing the ccw program to
perform an I/O operation. So the Qemu memory listener can't tell if an
address is for an I/O operation, or for whatever else. And this makes
the memory listener unnecessary for our case.

The only time point that we know we should pin pages for I/O, is the
time that an I/O instruction (e.g. ssch) was intercepted. At this
point, we know the address contented in the parameter of the ssch
instruction points to a piece of memory that contents a ccw program.
Then we do: pin the pages --> convert the ccw program --> perform the
I/O --> return the I/O result --> and unpin the pages.

> 
> This architecture also makes the vfio api completely compatible with
> existing usage without tainting QEMU with support for noiommu devices.
> I would strongly suggest following a similar approach and dropping the
> noiommu interface.  We really do not need to confuse users with noiommu
> devices that are safe and assignable and devices where noiommu should
> warn them to stay away.  Thanks,
Understand. But like explained above, even if we introduce a new vfio
iommu backend, what it does would probably look quite like what the
no-iommu backend does. Any idea about this?

> 
> Alex
> 

--------
Dong Jia

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
@ 2016-05-04  9:26     ` Dong Jia
  0 siblings, 0 replies; 36+ messages in thread
From: Dong Jia @ 2016-05-04  9:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-s390, qemu-devel, renxiaof, cornelia.huck, borntraeger, agraf

On Fri, 29 Apr 2016 11:17:35 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

Dear Alex:

Thanks for the comments.

[...]

> > 
> > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> > good example to get understand how these patches work. Here is a little
> > bit more detail how an I/O request triggered by the Qemu guest will be
> > handled (without error handling).
> > 
> > Explanation:
> > Q1-Q4: Qemu side process.
> > K1-K6: Kernel side process.
> > 
> > Q1. Intercept a ssch instruction.
> > Q2. Translate the guest ccw program to a user space ccw program
> >     (u_ccwchain).
> 
> Is this replacing guest physical address in the program with QEMU
> virtual addresses?
Yes.

> 
> > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> >     K2. Translate the user space ccw program to a kernel space ccw
> >         program, which becomes runnable for a real device.
> 
> And here we translate and likely pin QEMU virtual address to physical
> addresses to further modify the program sent into the channel?
Yes. Exactly.

> 
> >     K3. With the necessary information contained in the orb passed in
> >         by Qemu, issue the k_ccwchain to the device, and wait event q
> >         for the I/O result.
> >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> >         update the user space irb.
> >     K6. Copy irb and scsw back to user space.
> > Q4. Update the irb for the guest.
> 
> If the answers to my questions above are both yes,
Yes, they are.

> then this is really a mediated interface, not a direct assignment.
Right. This is true.

> We don't need an iommu
> because we're policing and translating the program for the device
> before it gets sent to hardware.  I think there are better ways than
> noiommu to handle such devices perhaps even with better performance
> than this two-stage translation.  In fact, I think the solution we plan
> to implement for vGPU support would work here.
> 
> Like your device, a vGPU is mediated, we don't have IOMMU level
> translation or isolation since a vGPU is largely a software construct,
> but we do have software policing and translating how the GPU is
> programmed.  To do this we're creating a type1 compatible vfio iommu
> backend that uses the existing map and unmap ioctls, but rather than
> programming them into an IOMMU for a device, it simply stores the
> translations for use by later requests.  This means that a device
> programmed in a VM with guest physical addresses can have the
> vfio kernel convert that address to process virtual address, pin the
> page and program the hardware with the host physical address in one
> step.
I've read through the mail threads those discuss how to add vGPU
support in VFIO. I'm afraid that proposal could not be simply addressed
to this case, especially if we want to make the vfio api completely
compatible with the existing usage.

AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
fixed range of address in the memory space for DMA operations. Any
address inside this range will not be used for other purpose. Thus we
can add memory listener on this range, and pin the pages for further
use (DMA operation). And we can keep the pages pinned during the life
cycle of the VM (not quite accurate, or I should say 'the target
device').

Well, a Subchannel Device does not have such a range of address. The
device driver simply calls kalloc() to get a piece of memory, and
assembles a ccw program with it, before issuing the ccw program to
perform an I/O operation. So the Qemu memory listener can't tell if an
address is for an I/O operation, or for whatever else. And this makes
the memory listener unnecessary for our case.

The only time point that we know we should pin pages for I/O, is the
time that an I/O instruction (e.g. ssch) was intercepted. At this
point, we know the address contented in the parameter of the ssch
instruction points to a piece of memory that contents a ccw program.
Then we do: pin the pages --> convert the ccw program --> perform the
I/O --> return the I/O result --> and unpin the pages.

> 
> This architecture also makes the vfio api completely compatible with
> existing usage without tainting QEMU with support for noiommu devices.
> I would strongly suggest following a similar approach and dropping the
> noiommu interface.  We really do not need to confuse users with noiommu
> devices that are safe and assignable and devices where noiommu should
> warn them to stay away.  Thanks,
Understand. But like explained above, even if we introduce a new vfio
iommu backend, what it does would probably look quite like what the
no-iommu backend does. Any idea about this?

> 
> Alex
> 

--------
Dong Jia

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC 0/8] basic vfio-ccw infrastructure
  2016-05-04  9:26     ` [Qemu-devel] " Dong Jia
@ 2016-05-04 19:26       ` Alex Williamson
  -1 siblings, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2016-05-04 19:26 UTC (permalink / raw)
  To: Dong Jia
  Cc: kvm, linux-s390, qemu-devel, renxiaof, cornelia.huck, borntraeger, agraf

On Wed, 4 May 2016 17:26:29 +0800
Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:

> On Fri, 29 Apr 2016 11:17:35 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> Dear Alex:
> 
> Thanks for the comments.
> 
> [...]
> 
> > > 
> > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> > > good example to get understand how these patches work. Here is a little
> > > bit more detail how an I/O request triggered by the Qemu guest will be
> > > handled (without error handling).
> > > 
> > > Explanation:
> > > Q1-Q4: Qemu side process.
> > > K1-K6: Kernel side process.
> > > 
> > > Q1. Intercept a ssch instruction.
> > > Q2. Translate the guest ccw program to a user space ccw program
> > >     (u_ccwchain).  
> > 
> > Is this replacing guest physical address in the program with QEMU
> > virtual addresses?  
> Yes.
> 
> >   
> > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > >     K2. Translate the user space ccw program to a kernel space ccw
> > >         program, which becomes runnable for a real device.  
> > 
> > And here we translate and likely pin QEMU virtual address to physical
> > addresses to further modify the program sent into the channel?  
> Yes. Exactly.
> 
> >   
> > >     K3. With the necessary information contained in the orb passed in
> > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > >         for the I/O result.
> > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > >         update the user space irb.
> > >     K6. Copy irb and scsw back to user space.
> > > Q4. Update the irb for the guest.  
> > 
> > If the answers to my questions above are both yes,  
> Yes, they are.
> 
> > then this is really a mediated interface, not a direct assignment.  
> Right. This is true.
> 
> > We don't need an iommu
> > because we're policing and translating the program for the device
> > before it gets sent to hardware.  I think there are better ways than
> > noiommu to handle such devices perhaps even with better performance
> > than this two-stage translation.  In fact, I think the solution we plan
> > to implement for vGPU support would work here.
> > 
> > Like your device, a vGPU is mediated, we don't have IOMMU level
> > translation or isolation since a vGPU is largely a software construct,
> > but we do have software policing and translating how the GPU is
> > programmed.  To do this we're creating a type1 compatible vfio iommu
> > backend that uses the existing map and unmap ioctls, but rather than
> > programming them into an IOMMU for a device, it simply stores the
> > translations for use by later requests.  This means that a device
> > programmed in a VM with guest physical addresses can have the
> > vfio kernel convert that address to process virtual address, pin the
> > page and program the hardware with the host physical address in one
> > step.  
> I've read through the mail threads those discuss how to add vGPU
> support in VFIO. I'm afraid that proposal could not be simply addressed
> to this case, especially if we want to make the vfio api completely
> compatible with the existing usage.
> 
> AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
> fixed range of address in the memory space for DMA operations. Any
> address inside this range will not be used for other purpose. Thus we
> can add memory listener on this range, and pin the pages for further
> use (DMA operation). And we can keep the pages pinned during the life
> cycle of the VM (not quite accurate, or I should say 'the target
> device').

That's not entirely accurate.  Ignoring a guest IOMMU, current device
assignment pins all of guest memory, not just a dedicated, exclusive
range of it, in order to map it through the hardware IOMMU.  That gives
the guest the ability to transparently perform DMA with the device
since the IOMMU maps the guest physical to host physical translations.

That's not what vGPU is about.  In the case of vGPU the proposal is to
use the same QEMU vfio MemoryListener API, but only for the purpose of
having an accurate database of guest physical to process virtual
translations for the VM.  In your above example, this means step Q2 is
eliminated because step K2 has the information to perform both a guest
physical to process virtual translation and to pin the page to get a
host physical address.  So you'd only need to modify the program once.

> Well, a Subchannel Device does not have such a range of address. The
> device driver simply calls kalloc() to get a piece of memory, and
> assembles a ccw program with it, before issuing the ccw program to
> perform an I/O operation. So the Qemu memory listener can't tell if an
> address is for an I/O operation, or for whatever else. And this makes
> the memory listener unnecessary for our case.

It's only unnecessary because QEMU is manipulating the program to
replace those addresses with process virtual addresses.  The purpose
of the MemoryListener in the vGPU approach is only to inform the
kernel so that it can perform that translation itself.
 
> The only time point that we know we should pin pages for I/O, is the
> time that an I/O instruction (e.g. ssch) was intercepted. At this
> point, we know the address contented in the parameter of the ssch
> instruction points to a piece of memory that contents a ccw program.
> Then we do: pin the pages --> convert the ccw program --> perform the
> I/O --> return the I/O result --> and unpin the pages.

And you could do exactly the same with the vGPU model, it's simply a
difference of how many times the program is converted and using the
MemoryListener to update guest physical to process virtual addresses in
the kernel.

> > This architecture also makes the vfio api completely compatible with
> > existing usage without tainting QEMU with support for noiommu devices.
> > I would strongly suggest following a similar approach and dropping the
> > noiommu interface.  We really do not need to confuse users with noiommu
> > devices that are safe and assignable and devices where noiommu should
> > warn them to stay away.  Thanks,  
> Understand. But like explained above, even if we introduce a new vfio
> iommu backend, what it does would probably look quite like what the
> no-iommu backend does. Any idea about this?

It's not, a mediated device simply shifts the isolation guarantees from
hardware protection in an IOMMU to software protection in a mediated
vfio bus driver.  The IOMMU interface simply becomes a database through
which we can perform in-kernel translations.  All you want is the vfio
device model and you have the ability to do that in a secure way, which
is the same as vGPU.  The no-iommu code is intended to provide the vfio
device model in a known-to-be-insecure means.  I don't think you want
to build on that and I don't think we want no-iommu anywhere near
QEMU.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
@ 2016-05-04 19:26       ` Alex Williamson
  0 siblings, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2016-05-04 19:26 UTC (permalink / raw)
  To: Dong Jia
  Cc: kvm, linux-s390, qemu-devel, renxiaof, cornelia.huck, borntraeger, agraf

On Wed, 4 May 2016 17:26:29 +0800
Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:

> On Fri, 29 Apr 2016 11:17:35 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> Dear Alex:
> 
> Thanks for the comments.
> 
> [...]
> 
> > > 
> > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> > > good example to get understand how these patches work. Here is a little
> > > bit more detail how an I/O request triggered by the Qemu guest will be
> > > handled (without error handling).
> > > 
> > > Explanation:
> > > Q1-Q4: Qemu side process.
> > > K1-K6: Kernel side process.
> > > 
> > > Q1. Intercept a ssch instruction.
> > > Q2. Translate the guest ccw program to a user space ccw program
> > >     (u_ccwchain).  
> > 
> > Is this replacing guest physical address in the program with QEMU
> > virtual addresses?  
> Yes.
> 
> >   
> > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > >     K2. Translate the user space ccw program to a kernel space ccw
> > >         program, which becomes runnable for a real device.  
> > 
> > And here we translate and likely pin QEMU virtual address to physical
> > addresses to further modify the program sent into the channel?  
> Yes. Exactly.
> 
> >   
> > >     K3. With the necessary information contained in the orb passed in
> > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > >         for the I/O result.
> > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > >         update the user space irb.
> > >     K6. Copy irb and scsw back to user space.
> > > Q4. Update the irb for the guest.  
> > 
> > If the answers to my questions above are both yes,  
> Yes, they are.
> 
> > then this is really a mediated interface, not a direct assignment.  
> Right. This is true.
> 
> > We don't need an iommu
> > because we're policing and translating the program for the device
> > before it gets sent to hardware.  I think there are better ways than
> > noiommu to handle such devices perhaps even with better performance
> > than this two-stage translation.  In fact, I think the solution we plan
> > to implement for vGPU support would work here.
> > 
> > Like your device, a vGPU is mediated, we don't have IOMMU level
> > translation or isolation since a vGPU is largely a software construct,
> > but we do have software policing and translating how the GPU is
> > programmed.  To do this we're creating a type1 compatible vfio iommu
> > backend that uses the existing map and unmap ioctls, but rather than
> > programming them into an IOMMU for a device, it simply stores the
> > translations for use by later requests.  This means that a device
> > programmed in a VM with guest physical addresses can have the
> > vfio kernel convert that address to process virtual address, pin the
> > page and program the hardware with the host physical address in one
> > step.  
> I've read through the mail threads those discuss how to add vGPU
> support in VFIO. I'm afraid that proposal could not be simply addressed
> to this case, especially if we want to make the vfio api completely
> compatible with the existing usage.
> 
> AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
> fixed range of address in the memory space for DMA operations. Any
> address inside this range will not be used for other purpose. Thus we
> can add memory listener on this range, and pin the pages for further
> use (DMA operation). And we can keep the pages pinned during the life
> cycle of the VM (not quite accurate, or I should say 'the target
> device').

That's not entirely accurate.  Ignoring a guest IOMMU, current device
assignment pins all of guest memory, not just a dedicated, exclusive
range of it, in order to map it through the hardware IOMMU.  That gives
the guest the ability to transparently perform DMA with the device
since the IOMMU maps the guest physical to host physical translations.

That's not what vGPU is about.  In the case of vGPU the proposal is to
use the same QEMU vfio MemoryListener API, but only for the purpose of
having an accurate database of guest physical to process virtual
translations for the VM.  In your above example, this means step Q2 is
eliminated because step K2 has the information to perform both a guest
physical to process virtual translation and to pin the page to get a
host physical address.  So you'd only need to modify the program once.

> Well, a Subchannel Device does not have such a range of address. The
> device driver simply calls kalloc() to get a piece of memory, and
> assembles a ccw program with it, before issuing the ccw program to
> perform an I/O operation. So the Qemu memory listener can't tell if an
> address is for an I/O operation, or for whatever else. And this makes
> the memory listener unnecessary for our case.

It's only unnecessary because QEMU is manipulating the program to
replace those addresses with process virtual addresses.  The purpose
of the MemoryListener in the vGPU approach is only to inform the
kernel so that it can perform that translation itself.
 
> The only time point that we know we should pin pages for I/O, is the
> time that an I/O instruction (e.g. ssch) was intercepted. At this
> point, we know the address contented in the parameter of the ssch
> instruction points to a piece of memory that contents a ccw program.
> Then we do: pin the pages --> convert the ccw program --> perform the
> I/O --> return the I/O result --> and unpin the pages.

And you could do exactly the same with the vGPU model, it's simply a
difference of how many times the program is converted and using the
MemoryListener to update guest physical to process virtual addresses in
the kernel.

> > This architecture also makes the vfio api completely compatible with
> > existing usage without tainting QEMU with support for noiommu devices.
> > I would strongly suggest following a similar approach and dropping the
> > noiommu interface.  We really do not need to confuse users with noiommu
> > devices that are safe and assignable and devices where noiommu should
> > warn them to stay away.  Thanks,  
> Understand. But like explained above, even if we introduce a new vfio
> iommu backend, what it does would probably look quite like what the
> no-iommu backend does. Any idea about this?

It's not, a mediated device simply shifts the isolation guarantees from
hardware protection in an IOMMU to software protection in a mediated
vfio bus driver.  The IOMMU interface simply becomes a database through
which we can perform in-kernel translations.  All you want is the vfio
device model and you have the ability to do that in a secure way, which
is the same as vGPU.  The no-iommu code is intended to provide the vfio
device model in a known-to-be-insecure means.  I don't think you want
to build on that and I don't think we want no-iommu anywhere near
QEMU.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC 0/8] basic vfio-ccw infrastructure
  2016-05-04 19:26       ` [Qemu-devel] " Alex Williamson
@ 2016-05-05 10:29         ` Dong Jia
  -1 siblings, 0 replies; 36+ messages in thread
From: Dong Jia @ 2016-05-05 10:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-s390, qemu-devel, renxiaof, cornelia.huck,
	borntraeger, agraf, Dong Jia

On Wed, 4 May 2016 13:26:53 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 4 May 2016 17:26:29 +0800
> Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:
> 
> > On Fri, 29 Apr 2016 11:17:35 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > Dear Alex:
> > 
> > Thanks for the comments.
> > 
> > [...]
> > 
> > > > 
> > > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> > > > good example to get understand how these patches work. Here is a little
> > > > bit more detail how an I/O request triggered by the Qemu guest will be
> > > > handled (without error handling).
> > > > 
> > > > Explanation:
> > > > Q1-Q4: Qemu side process.
> > > > K1-K6: Kernel side process.
> > > > 
> > > > Q1. Intercept a ssch instruction.
> > > > Q2. Translate the guest ccw program to a user space ccw program
> > > >     (u_ccwchain).  
> > > 
> > > Is this replacing guest physical address in the program with QEMU
> > > virtual addresses?  
> > Yes.
> > 
> > >   
> > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > >         program, which becomes runnable for a real device.  
> > > 
> > > And here we translate and likely pin QEMU virtual address to physical
> > > addresses to further modify the program sent into the channel?  
> > Yes. Exactly.
> > 
> > >   
> > > >     K3. With the necessary information contained in the orb passed in
> > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > >         for the I/O result.
> > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > >         update the user space irb.
> > > >     K6. Copy irb and scsw back to user space.
> > > > Q4. Update the irb for the guest.  
> > > 
> > > If the answers to my questions above are both yes,  
> > Yes, they are.
> > 
> > > then this is really a mediated interface, not a direct assignment.  
> > Right. This is true.
> > 
> > > We don't need an iommu
> > > because we're policing and translating the program for the device
> > > before it gets sent to hardware.  I think there are better ways than
> > > noiommu to handle such devices perhaps even with better performance
> > > than this two-stage translation.  In fact, I think the solution we plan
> > > to implement for vGPU support would work here.
> > > 
> > > Like your device, a vGPU is mediated, we don't have IOMMU level
> > > translation or isolation since a vGPU is largely a software construct,
> > > but we do have software policing and translating how the GPU is
> > > programmed.  To do this we're creating a type1 compatible vfio iommu
> > > backend that uses the existing map and unmap ioctls, but rather than
> > > programming them into an IOMMU for a device, it simply stores the
> > > translations for use by later requests.  This means that a device
> > > programmed in a VM with guest physical addresses can have the
> > > vfio kernel convert that address to process virtual address, pin the
> > > page and program the hardware with the host physical address in one
> > > step.  
> > I've read through the mail threads those discuss how to add vGPU
> > support in VFIO. I'm afraid that proposal could not be simply addressed
> > to this case, especially if we want to make the vfio api completely
> > compatible with the existing usage.
> > 
> > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
> > fixed range of address in the memory space for DMA operations. Any
> > address inside this range will not be used for other purpose. Thus we
> > can add memory listener on this range, and pin the pages for further
> > use (DMA operation). And we can keep the pages pinned during the life
> > cycle of the VM (not quite accurate, or I should say 'the target
> > device').
> 
> That's not entirely accurate.  Ignoring a guest IOMMU, current device
> assignment pins all of guest memory, not just a dedicated, exclusive
> range of it, in order to map it through the hardware IOMMU.  That gives
> the guest the ability to transparently perform DMA with the device
> since the IOMMU maps the guest physical to host physical translations.
Thanks for this explanation.

I noticed in the Qemu part, when we tried to introduce vfio-pci to the
s390 architecture, we set the IOMMU width by calling
memory_region_add_subregion before initializing the address_space of
the PCI device, which will be registered with the vfio_memory_listener
later. The 'width' of the subregion is what I called the 'range' in the
former reply.

The first reason we did that is, we know exactly the dma memory
range, and we got the width by 'dma_addr_end - dma_addr_start'. The
second reason we have to do that is, using the following statement will
cause the initialization of the guest tremendously long:
    group = vfio_get_group(groupid, &address_space_memory);
Because doing map on [0, UINT64_MAX] range does cost lots of time. For
me, it's unacceptably long (more than 5 minutes).

My questions are:
1. Why we have to 'pin all of guest memory' if we do know the
iommu memory range?
2. Didn't you have the long time starting problem either? Or I
must miss something. For the vfio-ccw case, there is no fixed range. So
according to your proposal, vfio-ccw has to pin all of guest memory.
And I guess I will encounter this problem again.

> 
> That's not what vGPU is about.  In the case of vGPU the proposal is to
> use the same QEMU vfio MemoryListener API, but only for the purpose of
> having an accurate database of guest physical to process virtual
> translations for the VM.  In your above example, this means step Q2 is
> eliminated because step K2 has the information to perform both a guest
> physical to process virtual translation and to pin the page to get a
> host physical address.  So you'd only need to modify the program once.
According to my understanding of your proposal, I should do:
------------------------------------------------------------
#1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
When starting the guest, pin all of guest memory, and form the database.

#2. In the driver of the ccw devices, when an I/O instruction was
intercepted, query the database and translate the ccw program for I/O
operation.

I also noticed in another thread:
---------------------------------
[Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

Kirti did:
1. don't pin the pages in the map ioctl for the vGPU case.
2. export vfio_pin_pages and vfio_unpin_pages.

Although their patches didn't show how these interfaces were used, I
guess them can either use these interfaces to pin/unpin all of the
guest memory, or pin/unpin memory on demand. So can I reuse their work
to finish my #1? If the answer is yes, then I could change my plan and
do:
#1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
When starting the guest, form the <vaddr, iova, size> database.

#2. In the driver of the ccw devices, when an I/O instruction was
intercepted, call vfio_pin_pages (Kirti's version) to get the host
physical address, then translate the ccw program for I/O operation.

So which one is the right way to go?

> 
> > Well, a Subchannel Device does not have such a range of address. The
> > device driver simply calls kalloc() to get a piece of memory, and
> > assembles a ccw program with it, before issuing the ccw program to
> > perform an I/O operation. So the Qemu memory listener can't tell if an
> > address is for an I/O operation, or for whatever else. And this makes
> > the memory listener unnecessary for our case.
> 
> It's only unnecessary because QEMU is manipulating the program to
> replace those addresses with process virtual addresses.  The purpose
> of the MemoryListener in the vGPU approach is only to inform the
> kernel so that it can perform that translation itself.
> 
> > The only time point that we know we should pin pages for I/O, is the
> > time that an I/O instruction (e.g. ssch) was intercepted. At this
> > point, we know the address contented in the parameter of the ssch
> > instruction points to a piece of memory that contents a ccw program.
> > Then we do: pin the pages --> convert the ccw program --> perform the
> > I/O --> return the I/O result --> and unpin the pages.
> 
> And you could do exactly the same with the vGPU model, it's simply a
> difference of how many times the program is converted and using the
> MemoryListener to update guest physical to process virtual addresses in
> the kernel.
Understand.

> 
> > > This architecture also makes the vfio api completely compatible with
> > > existing usage without tainting QEMU with support for noiommu devices.
> > > I would strongly suggest following a similar approach and dropping the
> > > noiommu interface.  We really do not need to confuse users with noiommu
> > > devices that are safe and assignable and devices where noiommu should
> > > warn them to stay away.  Thanks,  
> > Understand. But like explained above, even if we introduce a new vfio
> > iommu backend, what it does would probably look quite like what the
> > no-iommu backend does. Any idea about this?
> 
> It's not, a mediated device simply shifts the isolation guarantees from
> hardware protection in an IOMMU to software protection in a mediated
> vfio bus driver.  The IOMMU interface simply becomes a database through
> which we can perform in-kernel translations.  All you want is the vfio
> device model and you have the ability to do that in a secure way, which
> is the same as vGPU.  The no-iommu code is intended to provide the vfio
> device model in a known-to-be-insecure means.  I don't think you want
> to build on that and I don't think we want no-iommu anywhere near
> QEMU.  Thanks,
Got it. I will mimic the vGPU model, once the above questions are
clarified. :>

> 
> Alex
> 

--------
Dong Jia

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
@ 2016-05-05 10:29         ` Dong Jia
  0 siblings, 0 replies; 36+ messages in thread
From: Dong Jia @ 2016-05-05 10:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-s390, qemu-devel, renxiaof, cornelia.huck,
	borntraeger, agraf, Dong Jia

On Wed, 4 May 2016 13:26:53 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 4 May 2016 17:26:29 +0800
> Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:
> 
> > On Fri, 29 Apr 2016 11:17:35 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > Dear Alex:
> > 
> > Thanks for the comments.
> > 
> > [...]
> > 
> > > > 
> > > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> > > > good example to get understand how these patches work. Here is a little
> > > > bit more detail how an I/O request triggered by the Qemu guest will be
> > > > handled (without error handling).
> > > > 
> > > > Explanation:
> > > > Q1-Q4: Qemu side process.
> > > > K1-K6: Kernel side process.
> > > > 
> > > > Q1. Intercept a ssch instruction.
> > > > Q2. Translate the guest ccw program to a user space ccw program
> > > >     (u_ccwchain).  
> > > 
> > > Is this replacing guest physical address in the program with QEMU
> > > virtual addresses?  
> > Yes.
> > 
> > >   
> > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > >         program, which becomes runnable for a real device.  
> > > 
> > > And here we translate and likely pin QEMU virtual address to physical
> > > addresses to further modify the program sent into the channel?  
> > Yes. Exactly.
> > 
> > >   
> > > >     K3. With the necessary information contained in the orb passed in
> > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > >         for the I/O result.
> > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > >         update the user space irb.
> > > >     K6. Copy irb and scsw back to user space.
> > > > Q4. Update the irb for the guest.  
> > > 
> > > If the answers to my questions above are both yes,  
> > Yes, they are.
> > 
> > > then this is really a mediated interface, not a direct assignment.  
> > Right. This is true.
> > 
> > > We don't need an iommu
> > > because we're policing and translating the program for the device
> > > before it gets sent to hardware.  I think there are better ways than
> > > noiommu to handle such devices perhaps even with better performance
> > > than this two-stage translation.  In fact, I think the solution we plan
> > > to implement for vGPU support would work here.
> > > 
> > > Like your device, a vGPU is mediated, we don't have IOMMU level
> > > translation or isolation since a vGPU is largely a software construct,
> > > but we do have software policing and translating how the GPU is
> > > programmed.  To do this we're creating a type1 compatible vfio iommu
> > > backend that uses the existing map and unmap ioctls, but rather than
> > > programming them into an IOMMU for a device, it simply stores the
> > > translations for use by later requests.  This means that a device
> > > programmed in a VM with guest physical addresses can have the
> > > vfio kernel convert that address to process virtual address, pin the
> > > page and program the hardware with the host physical address in one
> > > step.  
> > I've read through the mail threads those discuss how to add vGPU
> > support in VFIO. I'm afraid that proposal could not be simply addressed
> > to this case, especially if we want to make the vfio api completely
> > compatible with the existing usage.
> > 
> > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
> > fixed range of address in the memory space for DMA operations. Any
> > address inside this range will not be used for other purpose. Thus we
> > can add memory listener on this range, and pin the pages for further
> > use (DMA operation). And we can keep the pages pinned during the life
> > cycle of the VM (not quite accurate, or I should say 'the target
> > device').
> 
> That's not entirely accurate.  Ignoring a guest IOMMU, current device
> assignment pins all of guest memory, not just a dedicated, exclusive
> range of it, in order to map it through the hardware IOMMU.  That gives
> the guest the ability to transparently perform DMA with the device
> since the IOMMU maps the guest physical to host physical translations.
Thanks for this explanation.

I noticed in the Qemu part, when we tried to introduce vfio-pci to the
s390 architecture, we set the IOMMU width by calling
memory_region_add_subregion before initializing the address_space of
the PCI device, which will be registered with the vfio_memory_listener
later. The 'width' of the subregion is what I called the 'range' in the
former reply.

The first reason we did that is, we know exactly the dma memory
range, and we got the width by 'dma_addr_end - dma_addr_start'. The
second reason we have to do that is, using the following statement will
cause the initialization of the guest tremendously long:
    group = vfio_get_group(groupid, &address_space_memory);
Because doing map on [0, UINT64_MAX] range does cost lots of time. For
me, it's unacceptably long (more than 5 minutes).

My questions are:
1. Why we have to 'pin all of guest memory' if we do know the
iommu memory range?
2. Didn't you have the long time starting problem either? Or I
must miss something. For the vfio-ccw case, there is no fixed range. So
according to your proposal, vfio-ccw has to pin all of guest memory.
And I guess I will encounter this problem again.

> 
> That's not what vGPU is about.  In the case of vGPU the proposal is to
> use the same QEMU vfio MemoryListener API, but only for the purpose of
> having an accurate database of guest physical to process virtual
> translations for the VM.  In your above example, this means step Q2 is
> eliminated because step K2 has the information to perform both a guest
> physical to process virtual translation and to pin the page to get a
> host physical address.  So you'd only need to modify the program once.
According to my understanding of your proposal, I should do:
------------------------------------------------------------
#1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
When starting the guest, pin all of guest memory, and form the database.

#2. In the driver of the ccw devices, when an I/O instruction was
intercepted, query the database and translate the ccw program for I/O
operation.

I also noticed in another thread:
---------------------------------
[Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

Kirti did:
1. don't pin the pages in the map ioctl for the vGPU case.
2. export vfio_pin_pages and vfio_unpin_pages.

Although their patches didn't show how these interfaces were used, I
guess them can either use these interfaces to pin/unpin all of the
guest memory, or pin/unpin memory on demand. So can I reuse their work
to finish my #1? If the answer is yes, then I could change my plan and
do:
#1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
When starting the guest, form the <vaddr, iova, size> database.

#2. In the driver of the ccw devices, when an I/O instruction was
intercepted, call vfio_pin_pages (Kirti's version) to get the host
physical address, then translate the ccw program for I/O operation.

So which one is the right way to go?

> 
> > Well, a Subchannel Device does not have such a range of address. The
> > device driver simply calls kalloc() to get a piece of memory, and
> > assembles a ccw program with it, before issuing the ccw program to
> > perform an I/O operation. So the Qemu memory listener can't tell if an
> > address is for an I/O operation, or for whatever else. And this makes
> > the memory listener unnecessary for our case.
> 
> It's only unnecessary because QEMU is manipulating the program to
> replace those addresses with process virtual addresses.  The purpose
> of the MemoryListener in the vGPU approach is only to inform the
> kernel so that it can perform that translation itself.
> 
> > The only time point that we know we should pin pages for I/O, is the
> > time that an I/O instruction (e.g. ssch) was intercepted. At this
> > point, we know the address contented in the parameter of the ssch
> > instruction points to a piece of memory that contents a ccw program.
> > Then we do: pin the pages --> convert the ccw program --> perform the
> > I/O --> return the I/O result --> and unpin the pages.
> 
> And you could do exactly the same with the vGPU model, it's simply a
> difference of how many times the program is converted and using the
> MemoryListener to update guest physical to process virtual addresses in
> the kernel.
Understand.

> 
> > > This architecture also makes the vfio api completely compatible with
> > > existing usage without tainting QEMU with support for noiommu devices.
> > > I would strongly suggest following a similar approach and dropping the
> > > noiommu interface.  We really do not need to confuse users with noiommu
> > > devices that are safe and assignable and devices where noiommu should
> > > warn them to stay away.  Thanks,  
> > Understand. But like explained above, even if we introduce a new vfio
> > iommu backend, what it does would probably look quite like what the
> > no-iommu backend does. Any idea about this?
> 
> It's not, a mediated device simply shifts the isolation guarantees from
> hardware protection in an IOMMU to software protection in a mediated
> vfio bus driver.  The IOMMU interface simply becomes a database through
> which we can perform in-kernel translations.  All you want is the vfio
> device model and you have the ability to do that in a secure way, which
> is the same as vGPU.  The no-iommu code is intended to provide the vfio
> device model in a known-to-be-insecure means.  I don't think you want
> to build on that and I don't think we want no-iommu anywhere near
> QEMU.  Thanks,
Got it. I will mimic the vGPU model, once the above questions are
clarified. :>

> 
> Alex
> 

--------
Dong Jia

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC 0/8] basic vfio-ccw infrastructure
  2016-05-05 10:29         ` [Qemu-devel] " Dong Jia
@ 2016-05-05 19:19           ` Alex Williamson
  -1 siblings, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2016-05-05 19:19 UTC (permalink / raw)
  To: Dong Jia
  Cc: kvm, linux-s390, qemu-devel, renxiaof, cornelia.huck,
	borntraeger, agraf, Tian, Kevin, Song, Jike, Neo Jia,
	Kirti Wankhede

[cc +Intel,NVIDIA]

On Thu, 5 May 2016 18:29:08 +0800
Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:

> On Wed, 4 May 2016 13:26:53 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Wed, 4 May 2016 17:26:29 +0800
> > Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:
> >   
> > > On Fri, 29 Apr 2016 11:17:35 -0600
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > 
> > > Dear Alex:
> > > 
> > > Thanks for the comments.
> > > 
> > > [...]
> > >   
> > > > > 
> > > > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> > > > > good example to get understand how these patches work. Here is a little
> > > > > bit more detail how an I/O request triggered by the Qemu guest will be
> > > > > handled (without error handling).
> > > > > 
> > > > > Explanation:
> > > > > Q1-Q4: Qemu side process.
> > > > > K1-K6: Kernel side process.
> > > > > 
> > > > > Q1. Intercept a ssch instruction.
> > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > >     (u_ccwchain).    
> > > > 
> > > > Is this replacing guest physical address in the program with QEMU
> > > > virtual addresses?    
> > > Yes.
> > >   
> > > >     
> > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > > >         program, which becomes runnable for a real device.    
> > > > 
> > > > And here we translate and likely pin QEMU virtual address to physical
> > > > addresses to further modify the program sent into the channel?    
> > > Yes. Exactly.
> > >   
> > > >     
> > > > >     K3. With the necessary information contained in the orb passed in
> > > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > >         for the I/O result.
> > > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > >         update the user space irb.
> > > > >     K6. Copy irb and scsw back to user space.
> > > > > Q4. Update the irb for the guest.    
> > > > 
> > > > If the answers to my questions above are both yes,    
> > > Yes, they are.
> > >   
> > > > then this is really a mediated interface, not a direct assignment.    
> > > Right. This is true.
> > >   
> > > > We don't need an iommu
> > > > because we're policing and translating the program for the device
> > > > before it gets sent to hardware.  I think there are better ways than
> > > > noiommu to handle such devices perhaps even with better performance
> > > > than this two-stage translation.  In fact, I think the solution we plan
> > > > to implement for vGPU support would work here.
> > > > 
> > > > Like your device, a vGPU is mediated, we don't have IOMMU level
> > > > translation or isolation since a vGPU is largely a software construct,
> > > > but we do have software policing and translating how the GPU is
> > > > programmed.  To do this we're creating a type1 compatible vfio iommu
> > > > backend that uses the existing map and unmap ioctls, but rather than
> > > > programming them into an IOMMU for a device, it simply stores the
> > > > translations for use by later requests.  This means that a device
> > > > programmed in a VM with guest physical addresses can have the
> > > > vfio kernel convert that address to process virtual address, pin the
> > > > page and program the hardware with the host physical address in one
> > > > step.    
> > > I've read through the mail threads those discuss how to add vGPU
> > > support in VFIO. I'm afraid that proposal could not be simply addressed
> > > to this case, especially if we want to make the vfio api completely
> > > compatible with the existing usage.
> > > 
> > > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
> > > fixed range of address in the memory space for DMA operations. Any
> > > address inside this range will not be used for other purpose. Thus we
> > > can add memory listener on this range, and pin the pages for further
> > > use (DMA operation). And we can keep the pages pinned during the life
> > > cycle of the VM (not quite accurate, or I should say 'the target
> > > device').  
> > 
> > That's not entirely accurate.  Ignoring a guest IOMMU, current device
> > assignment pins all of guest memory, not just a dedicated, exclusive
> > range of it, in order to map it through the hardware IOMMU.  That gives
> > the guest the ability to transparently perform DMA with the device
> > since the IOMMU maps the guest physical to host physical translations.  
> Thanks for this explanation.
> 
> I noticed in the Qemu part, when we tried to introduce vfio-pci to the
> s390 architecture, we set the IOMMU width by calling
> memory_region_add_subregion before initializing the address_space of
> the PCI device, which will be registered with the vfio_memory_listener
> later. The 'width' of the subregion is what I called the 'range' in the
> former reply.
> 
> The first reason we did that is, we know exactly the dma memory
> range, and we got the width by 'dma_addr_end - dma_addr_start'. The
> second reason we have to do that is, using the following statement will
> cause the initialization of the guest tremendously long:
>     group = vfio_get_group(groupid, &address_space_memory);
> Because doing map on [0, UINT64_MAX] range does cost lots of time. For
> me, it's unacceptably long (more than 5 minutes).
> 
> My questions are:
> 1. Why we have to 'pin all of guest memory' if we do know the
> iommu memory range?

We have a few different configuration here, so let's not confuse them.
On x86 with pci device assignment we typically don't have a guest IOMMU
so the guest assumes the device can DMA to any address in the guest
memory space.  To enable that we pin all of guest memory and map it
through the IOMMU.  Even with a guest IOMMU on x86, it's an optional
feature that the guest OS may or may not use, so we'll always at least
startup in this mode and the guest may or may not enable something else.

When we have a guest IOMMU available, the device switches to a
different address space, note that in current QEMU code,
vfio_get_group() is actually called as:

    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev));

Where pci_device_iommu_address_space() determines whether the device is
translated by an IOMMU and defaults back to &address_space_memory if
not.  So we already have code that is supposed to handle the difference
between whether we're mapping all of guest memory or whether we're only
registering translations populated in the IOMMU for the device.

It appears that S390 implements some sort of IOMMU in the guest, so
theoretically DMA_MAP and DMA_UNMAP operations are only going to map
the IOTLB translations relevant to that device.  At least that's how
it's supposed to work.  So we shouldn't be pinning all of guest memory
for the PCI case.

When we switch to the vgpu/mediated-device approach, everything should
work the same except the DMA_MAP and DMA_UNMAP ioctls don't do any
pinning or IOMMU mapping.  They only update the in-kernel vfio view of
IOVA to process virtual translations.  These translations are then
consumed only when a device operation requires DMA.  At that point we
do an IOVA-to-VA translation and page_to_pfn(get_user_pages()) to get a
host physical address, which is only pinned while the operation is
inflight.

> 2. Didn't you have the long time starting problem either? Or I
> must miss something. For the vfio-ccw case, there is no fixed range. So
> according to your proposal, vfio-ccw has to pin all of guest memory.
> And I guess I will encounter this problem again.

x86 with a guest IOMMU is very new and still not upstream, so I don't
know if there's a point at which we perform an operation over the
entire address space, that would be slow.  It seems like something we
could optimize though.  x86 without a guest IOMMU only performs DMA_MAP
operations for the actual populated guest memory.  This is of course
not free, but is negligible for small guests and scales as the memory
size of the guest increases.  According the the vgpu/mediated-device
proposal, there would be no pinning occurring at startup, the DMA_MAP
would only be populating a tree of IOVA-to-VA mappings using the
granularity of the DMA_MAP parameters itself.

> > 
> > That's not what vGPU is about.  In the case of vGPU the proposal is to
> > use the same QEMU vfio MemoryListener API, but only for the purpose of
> > having an accurate database of guest physical to process virtual
> > translations for the VM.  In your above example, this means step Q2 is
> > eliminated because step K2 has the information to perform both a guest
> > physical to process virtual translation and to pin the page to get a
> > host physical address.  So you'd only need to modify the program once.  
> According to my understanding of your proposal, I should do:
> ------------------------------------------------------------
> #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> When starting the guest, pin all of guest memory, and form the database.

I hope that we can have a common type1-compatible iommu backend for
vfio, there's nothing ccw specific there.  Pages would not be pinned,
only registered for later retrieval by the mediated-device backend and
only for the runtime of the ccw program in your case.

> #2. In the driver of the ccw devices, when an I/O instruction was
> intercepted, query the database and translate the ccw program for I/O
> operation.

The database query would be the point at which the page is pinned, so
there would be some sort of 'put' of the translation after the ccw
program executes to release the pin.

> I also noticed in another thread:
> ---------------------------------
> [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
> 
> Kirti did:
> 1. don't pin the pages in the map ioctl for the vGPU case.
> 2. export vfio_pin_pages and vfio_unpin_pages.
> 
> Although their patches didn't show how these interfaces were used, I
> guess them can either use these interfaces to pin/unpin all of the
> guest memory, or pin/unpin memory on demand. So can I reuse their work
> to finish my #1? If the answer is yes, then I could change my plan and

Yes, we would absolutely only want one vfio iommu backend doing this,
there's nothing device specific about it.  We're looking at supporting
both modes of operation, fully pinned and pin-on-demand.  NVIDIA vGPU
wants the on-demand approach while Intel vGPU wants to pin the entire
guest, at least for an initial solution.  This iommu backend would need
to support both as determined by the mediated device backend.

> do:
> #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> When starting the guest, form the <vaddr, iova, size> database.
> 
> #2. In the driver of the ccw devices, when an I/O instruction was
> intercepted, call vfio_pin_pages (Kirti's version) to get the host
> physical address, then translate the ccw program for I/O operation.
> 
> So which one is the right way to go?

As above, I think we have a need to support both approaches in this new
iommu backend, it will be up to you to determine which is appropriate
for your devices and guest drivers.  A fully pinned guest has a latency
advantage, but obviously there are numerous disadvantages for the
pinning itself.  Pinning on-demand has overhead to setup each DMA
operations by the device but has a much smaller pinning footprint.

> > > Well, a Subchannel Device does not have such a range of address. The
> > > device driver simply calls kalloc() to get a piece of memory, and
> > > assembles a ccw program with it, before issuing the ccw program to
> > > perform an I/O operation. So the Qemu memory listener can't tell if an
> > > address is for an I/O operation, or for whatever else. And this makes
> > > the memory listener unnecessary for our case.  
> > 
> > It's only unnecessary because QEMU is manipulating the program to
> > replace those addresses with process virtual addresses.  The purpose
> > of the MemoryListener in the vGPU approach is only to inform the
> > kernel so that it can perform that translation itself.
> >   
> > > The only time point that we know we should pin pages for I/O, is the
> > > time that an I/O instruction (e.g. ssch) was intercepted. At this
> > > point, we know the address contented in the parameter of the ssch
> > > instruction points to a piece of memory that contents a ccw program.
> > > Then we do: pin the pages --> convert the ccw program --> perform the
> > > I/O --> return the I/O result --> and unpin the pages.  
> > 
> > And you could do exactly the same with the vGPU model, it's simply a
> > difference of how many times the program is converted and using the
> > MemoryListener to update guest physical to process virtual addresses in
> > the kernel.  
> Understand.
> 
> >   
> > > > This architecture also makes the vfio api completely compatible with
> > > > existing usage without tainting QEMU with support for noiommu devices.
> > > > I would strongly suggest following a similar approach and dropping the
> > > > noiommu interface.  We really do not need to confuse users with noiommu
> > > > devices that are safe and assignable and devices where noiommu should
> > > > warn them to stay away.  Thanks,    
> > > Understand. But like explained above, even if we introduce a new vfio
> > > iommu backend, what it does would probably look quite like what the
> > > no-iommu backend does. Any idea about this?  
> > 
> > It's not, a mediated device simply shifts the isolation guarantees from
> > hardware protection in an IOMMU to software protection in a mediated
> > vfio bus driver.  The IOMMU interface simply becomes a database through
> > which we can perform in-kernel translations.  All you want is the vfio
> > device model and you have the ability to do that in a secure way, which
> > is the same as vGPU.  The no-iommu code is intended to provide the vfio
> > device model in a known-to-be-insecure means.  I don't think you want
> > to build on that and I don't think we want no-iommu anywhere near
> > QEMU.  Thanks,  
> Got it. I will mimic the vGPU model, once the above questions are
> clarified. :>

Thanks,
Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
@ 2016-05-05 19:19           ` Alex Williamson
  0 siblings, 0 replies; 36+ messages in thread
From: Alex Williamson @ 2016-05-05 19:19 UTC (permalink / raw)
  To: Dong Jia
  Cc: kvm, linux-s390, qemu-devel, renxiaof, cornelia.huck,
	borntraeger, agraf, Tian, Kevin, Song, Jike, Neo Jia,
	Kirti Wankhede

[cc +Intel,NVIDIA]

On Thu, 5 May 2016 18:29:08 +0800
Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:

> On Wed, 4 May 2016 13:26:53 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Wed, 4 May 2016 17:26:29 +0800
> > Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:
> >   
> > > On Fri, 29 Apr 2016 11:17:35 -0600
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > 
> > > Dear Alex:
> > > 
> > > Thanks for the comments.
> > > 
> > > [...]
> > >   
> > > > > 
> > > > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> > > > > good example to get understand how these patches work. Here is a little
> > > > > bit more detail how an I/O request triggered by the Qemu guest will be
> > > > > handled (without error handling).
> > > > > 
> > > > > Explanation:
> > > > > Q1-Q4: Qemu side process.
> > > > > K1-K6: Kernel side process.
> > > > > 
> > > > > Q1. Intercept a ssch instruction.
> > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > >     (u_ccwchain).    
> > > > 
> > > > Is this replacing guest physical address in the program with QEMU
> > > > virtual addresses?    
> > > Yes.
> > >   
> > > >     
> > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > > >         program, which becomes runnable for a real device.    
> > > > 
> > > > And here we translate and likely pin QEMU virtual address to physical
> > > > addresses to further modify the program sent into the channel?    
> > > Yes. Exactly.
> > >   
> > > >     
> > > > >     K3. With the necessary information contained in the orb passed in
> > > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > >         for the I/O result.
> > > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > >         update the user space irb.
> > > > >     K6. Copy irb and scsw back to user space.
> > > > > Q4. Update the irb for the guest.    
> > > > 
> > > > If the answers to my questions above are both yes,    
> > > Yes, they are.
> > >   
> > > > then this is really a mediated interface, not a direct assignment.    
> > > Right. This is true.
> > >   
> > > > We don't need an iommu
> > > > because we're policing and translating the program for the device
> > > > before it gets sent to hardware.  I think there are better ways than
> > > > noiommu to handle such devices perhaps even with better performance
> > > > than this two-stage translation.  In fact, I think the solution we plan
> > > > to implement for vGPU support would work here.
> > > > 
> > > > Like your device, a vGPU is mediated, we don't have IOMMU level
> > > > translation or isolation since a vGPU is largely a software construct,
> > > > but we do have software policing and translating how the GPU is
> > > > programmed.  To do this we're creating a type1 compatible vfio iommu
> > > > backend that uses the existing map and unmap ioctls, but rather than
> > > > programming them into an IOMMU for a device, it simply stores the
> > > > translations for use by later requests.  This means that a device
> > > > programmed in a VM with guest physical addresses can have the
> > > > vfio kernel convert that address to process virtual address, pin the
> > > > page and program the hardware with the host physical address in one
> > > > step.    
> > > I've read through the mail threads those discuss how to add vGPU
> > > support in VFIO. I'm afraid that proposal could not be simply addressed
> > > to this case, especially if we want to make the vfio api completely
> > > compatible with the existing usage.
> > > 
> > > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
> > > fixed range of address in the memory space for DMA operations. Any
> > > address inside this range will not be used for other purpose. Thus we
> > > can add memory listener on this range, and pin the pages for further
> > > use (DMA operation). And we can keep the pages pinned during the life
> > > cycle of the VM (not quite accurate, or I should say 'the target
> > > device').  
> > 
> > That's not entirely accurate.  Ignoring a guest IOMMU, current device
> > assignment pins all of guest memory, not just a dedicated, exclusive
> > range of it, in order to map it through the hardware IOMMU.  That gives
> > the guest the ability to transparently perform DMA with the device
> > since the IOMMU maps the guest physical to host physical translations.  
> Thanks for this explanation.
> 
> I noticed in the Qemu part, when we tried to introduce vfio-pci to the
> s390 architecture, we set the IOMMU width by calling
> memory_region_add_subregion before initializing the address_space of
> the PCI device, which will be registered with the vfio_memory_listener
> later. The 'width' of the subregion is what I called the 'range' in the
> former reply.
> 
> The first reason we did that is, we know exactly the dma memory
> range, and we got the width by 'dma_addr_end - dma_addr_start'. The
> second reason we have to do that is, using the following statement will
> cause the initialization of the guest tremendously long:
>     group = vfio_get_group(groupid, &address_space_memory);
> Because doing map on [0, UINT64_MAX] range does cost lots of time. For
> me, it's unacceptably long (more than 5 minutes).
> 
> My questions are:
> 1. Why we have to 'pin all of guest memory' if we do know the
> iommu memory range?

We have a few different configuration here, so let's not confuse them.
On x86 with pci device assignment we typically don't have a guest IOMMU
so the guest assumes the device can DMA to any address in the guest
memory space.  To enable that we pin all of guest memory and map it
through the IOMMU.  Even with a guest IOMMU on x86, it's an optional
feature that the guest OS may or may not use, so we'll always at least
startup in this mode and the guest may or may not enable something else.

When we have a guest IOMMU available, the device switches to a
different address space, note that in current QEMU code,
vfio_get_group() is actually called as:

    group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev));

Where pci_device_iommu_address_space() determines whether the device is
translated by an IOMMU and defaults back to &address_space_memory if
not.  So we already have code that is supposed to handle the difference
between whether we're mapping all of guest memory or whether we're only
registering translations populated in the IOMMU for the device.

It appears that S390 implements some sort of IOMMU in the guest, so
theoretically DMA_MAP and DMA_UNMAP operations are only going to map
the IOTLB translations relevant to that device.  At least that's how
it's supposed to work.  So we shouldn't be pinning all of guest memory
for the PCI case.

When we switch to the vgpu/mediated-device approach, everything should
work the same except the DMA_MAP and DMA_UNMAP ioctls don't do any
pinning or IOMMU mapping.  They only update the in-kernel vfio view of
IOVA to process virtual translations.  These translations are then
consumed only when a device operation requires DMA.  At that point we
do an IOVA-to-VA translation and page_to_pfn(get_user_pages()) to get a
host physical address, which is only pinned while the operation is
inflight.

> 2. Didn't you have the long time starting problem either? Or I
> must miss something. For the vfio-ccw case, there is no fixed range. So
> according to your proposal, vfio-ccw has to pin all of guest memory.
> And I guess I will encounter this problem again.

x86 with a guest IOMMU is very new and still not upstream, so I don't
know if there's a point at which we perform an operation over the
entire address space, that would be slow.  It seems like something we
could optimize though.  x86 without a guest IOMMU only performs DMA_MAP
operations for the actual populated guest memory.  This is of course
not free, but is negligible for small guests and scales as the memory
size of the guest increases.  According the the vgpu/mediated-device
proposal, there would be no pinning occurring at startup, the DMA_MAP
would only be populating a tree of IOVA-to-VA mappings using the
granularity of the DMA_MAP parameters itself.

> > 
> > That's not what vGPU is about.  In the case of vGPU the proposal is to
> > use the same QEMU vfio MemoryListener API, but only for the purpose of
> > having an accurate database of guest physical to process virtual
> > translations for the VM.  In your above example, this means step Q2 is
> > eliminated because step K2 has the information to perform both a guest
> > physical to process virtual translation and to pin the page to get a
> > host physical address.  So you'd only need to modify the program once.  
> According to my understanding of your proposal, I should do:
> ------------------------------------------------------------
> #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> When starting the guest, pin all of guest memory, and form the database.

I hope that we can have a common type1-compatible iommu backend for
vfio, there's nothing ccw specific there.  Pages would not be pinned,
only registered for later retrieval by the mediated-device backend and
only for the runtime of the ccw program in your case.

> #2. In the driver of the ccw devices, when an I/O instruction was
> intercepted, query the database and translate the ccw program for I/O
> operation.

The database query would be the point at which the page is pinned, so
there would be some sort of 'put' of the translation after the ccw
program executes to release the pin.

> I also noticed in another thread:
> ---------------------------------
> [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
> 
> Kirti did:
> 1. don't pin the pages in the map ioctl for the vGPU case.
> 2. export vfio_pin_pages and vfio_unpin_pages.
> 
> Although their patches didn't show how these interfaces were used, I
> guess them can either use these interfaces to pin/unpin all of the
> guest memory, or pin/unpin memory on demand. So can I reuse their work
> to finish my #1? If the answer is yes, then I could change my plan and

Yes, we would absolutely only want one vfio iommu backend doing this,
there's nothing device specific about it.  We're looking at supporting
both modes of operation, fully pinned and pin-on-demand.  NVIDIA vGPU
wants the on-demand approach while Intel vGPU wants to pin the entire
guest, at least for an initial solution.  This iommu backend would need
to support both as determined by the mediated device backend.

> do:
> #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> When starting the guest, form the <vaddr, iova, size> database.
> 
> #2. In the driver of the ccw devices, when an I/O instruction was
> intercepted, call vfio_pin_pages (Kirti's version) to get the host
> physical address, then translate the ccw program for I/O operation.
> 
> So which one is the right way to go?

As above, I think we have a need to support both approaches in this new
iommu backend, it will be up to you to determine which is appropriate
for your devices and guest drivers.  A fully pinned guest has a latency
advantage, but obviously there are numerous disadvantages for the
pinning itself.  Pinning on-demand has overhead to setup each DMA
operations by the device but has a much smaller pinning footprint.

> > > Well, a Subchannel Device does not have such a range of address. The
> > > device driver simply calls kalloc() to get a piece of memory, and
> > > assembles a ccw program with it, before issuing the ccw program to
> > > perform an I/O operation. So the Qemu memory listener can't tell if an
> > > address is for an I/O operation, or for whatever else. And this makes
> > > the memory listener unnecessary for our case.  
> > 
> > It's only unnecessary because QEMU is manipulating the program to
> > replace those addresses with process virtual addresses.  The purpose
> > of the MemoryListener in the vGPU approach is only to inform the
> > kernel so that it can perform that translation itself.
> >   
> > > The only time point that we know we should pin pages for I/O, is the
> > > time that an I/O instruction (e.g. ssch) was intercepted. At this
> > > point, we know the address contented in the parameter of the ssch
> > > instruction points to a piece of memory that contents a ccw program.
> > > Then we do: pin the pages --> convert the ccw program --> perform the
> > > I/O --> return the I/O result --> and unpin the pages.  
> > 
> > And you could do exactly the same with the vGPU model, it's simply a
> > difference of how many times the program is converted and using the
> > MemoryListener to update guest physical to process virtual addresses in
> > the kernel.  
> Understand.
> 
> >   
> > > > This architecture also makes the vfio api completely compatible with
> > > > existing usage without tainting QEMU with support for noiommu devices.
> > > > I would strongly suggest following a similar approach and dropping the
> > > > noiommu interface.  We really do not need to confuse users with noiommu
> > > > devices that are safe and assignable and devices where noiommu should
> > > > warn them to stay away.  Thanks,    
> > > Understand. But like explained above, even if we introduce a new vfio
> > > iommu backend, what it does would probably look quite like what the
> > > no-iommu backend does. Any idea about this?  
> > 
> > It's not, a mediated device simply shifts the isolation guarantees from
> > hardware protection in an IOMMU to software protection in a mediated
> > vfio bus driver.  The IOMMU interface simply becomes a database through
> > which we can perform in-kernel translations.  All you want is the vfio
> > device model and you have the ability to do that in a secure way, which
> > is the same as vGPU.  The no-iommu code is intended to provide the vfio
> > device model in a known-to-be-insecure means.  I don't think you want
> > to build on that and I don't think we want no-iommu anywhere near
> > QEMU.  Thanks,  
> Got it. I will mimic the vGPU model, once the above questions are
> clarified. :>

Thanks,
Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC 0/8] basic vfio-ccw infrastructure
  2016-05-05 19:19           ` [Qemu-devel] " Alex Williamson
  (?)
@ 2016-05-05 20:23             ` Neo Jia
  -1 siblings, 0 replies; 36+ messages in thread
From: Neo Jia @ 2016-05-05 20:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Dong Jia, kvm, linux-s390, qemu-devel, renxiaof, cornelia.huck,
	borntraeger, agraf, Tian, Kevin, Song, Jike, Kirti Wankhede

On Thu, May 05, 2016 at 01:19:45PM -0600, Alex Williamson wrote:
> [cc +Intel,NVIDIA]
> 
> On Thu, 5 May 2016 18:29:08 +0800
> Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:
> 
> > On Wed, 4 May 2016 13:26:53 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > > On Wed, 4 May 2016 17:26:29 +0800
> > > Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:
> > >   
> > > > On Fri, 29 Apr 2016 11:17:35 -0600
> > > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > > 
> > > > Dear Alex:
> > > > 
> > > > Thanks for the comments.
> > > > 
> > > > [...]
> > > >   
> > > > > > 
> > > > > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> > > > > > good example to get understand how these patches work. Here is a little
> > > > > > bit more detail how an I/O request triggered by the Qemu guest will be
> > > > > > handled (without error handling).
> > > > > > 
> > > > > > Explanation:
> > > > > > Q1-Q4: Qemu side process.
> > > > > > K1-K6: Kernel side process.
> > > > > > 
> > > > > > Q1. Intercept a ssch instruction.
> > > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > > >     (u_ccwchain).    
> > > > > 
> > > > > Is this replacing guest physical address in the program with QEMU
> > > > > virtual addresses?    
> > > > Yes.
> > > >   
> > > > >     
> > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > > > >         program, which becomes runnable for a real device.    
> > > > > 
> > > > > And here we translate and likely pin QEMU virtual address to physical
> > > > > addresses to further modify the program sent into the channel?    
> > > > Yes. Exactly.
> > > >   
> > > > >     
> > > > > >     K3. With the necessary information contained in the orb passed in
> > > > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > > >         for the I/O result.
> > > > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > > >         update the user space irb.
> > > > > >     K6. Copy irb and scsw back to user space.
> > > > > > Q4. Update the irb for the guest.    
> > > > > 
> > > > > If the answers to my questions above are both yes,    
> > > > Yes, they are.
> > > >   
> > > > > then this is really a mediated interface, not a direct assignment.    
> > > > Right. This is true.
> > > >   
> > > > > We don't need an iommu
> > > > > because we're policing and translating the program for the device
> > > > > before it gets sent to hardware.  I think there are better ways than
> > > > > noiommu to handle such devices perhaps even with better performance
> > > > > than this two-stage translation.  In fact, I think the solution we plan
> > > > > to implement for vGPU support would work here.
> > > > > 
> > > > > Like your device, a vGPU is mediated, we don't have IOMMU level
> > > > > translation or isolation since a vGPU is largely a software construct,
> > > > > but we do have software policing and translating how the GPU is
> > > > > programmed.  To do this we're creating a type1 compatible vfio iommu
> > > > > backend that uses the existing map and unmap ioctls, but rather than
> > > > > programming them into an IOMMU for a device, it simply stores the
> > > > > translations for use by later requests.  This means that a device
> > > > > programmed in a VM with guest physical addresses can have the
> > > > > vfio kernel convert that address to process virtual address, pin the
> > > > > page and program the hardware with the host physical address in one
> > > > > step.    
> > > > I've read through the mail threads those discuss how to add vGPU
> > > > support in VFIO. I'm afraid that proposal could not be simply addressed
> > > > to this case, especially if we want to make the vfio api completely
> > > > compatible with the existing usage.
> > > > 
> > > > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
> > > > fixed range of address in the memory space for DMA operations. Any
> > > > address inside this range will not be used for other purpose. Thus we
> > > > can add memory listener on this range, and pin the pages for further
> > > > use (DMA operation). And we can keep the pages pinned during the life
> > > > cycle of the VM (not quite accurate, or I should say 'the target
> > > > device').  
> > > 
> > > That's not entirely accurate.  Ignoring a guest IOMMU, current device
> > > assignment pins all of guest memory, not just a dedicated, exclusive
> > > range of it, in order to map it through the hardware IOMMU.  That gives
> > > the guest the ability to transparently perform DMA with the device
> > > since the IOMMU maps the guest physical to host physical translations.  
> > Thanks for this explanation.
> > 
> > I noticed in the Qemu part, when we tried to introduce vfio-pci to the
> > s390 architecture, we set the IOMMU width by calling
> > memory_region_add_subregion before initializing the address_space of
> > the PCI device, which will be registered with the vfio_memory_listener
> > later. The 'width' of the subregion is what I called the 'range' in the
> > former reply.
> > 
> > The first reason we did that is, we know exactly the dma memory
> > range, and we got the width by 'dma_addr_end - dma_addr_start'. The
> > second reason we have to do that is, using the following statement will
> > cause the initialization of the guest tremendously long:
> >     group = vfio_get_group(groupid, &address_space_memory);
> > Because doing map on [0, UINT64_MAX] range does cost lots of time. For
> > me, it's unacceptably long (more than 5 minutes).
> > 
> > My questions are:
> > 1. Why we have to 'pin all of guest memory' if we do know the
> > iommu memory range?
> 
> We have a few different configuration here, so let's not confuse them.
> On x86 with pci device assignment we typically don't have a guest IOMMU
> so the guest assumes the device can DMA to any address in the guest
> memory space.  To enable that we pin all of guest memory and map it
> through the IOMMU.  Even with a guest IOMMU on x86, it's an optional
> feature that the guest OS may or may not use, so we'll always at least
> startup in this mode and the guest may or may not enable something else.
> 
> When we have a guest IOMMU available, the device switches to a
> different address space, note that in current QEMU code,
> vfio_get_group() is actually called as:
> 
>     group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev));
> 
> Where pci_device_iommu_address_space() determines whether the device is
> translated by an IOMMU and defaults back to &address_space_memory if
> not.  So we already have code that is supposed to handle the difference
> between whether we're mapping all of guest memory or whether we're only
> registering translations populated in the IOMMU for the device.
> 
> It appears that S390 implements some sort of IOMMU in the guest, so
> theoretically DMA_MAP and DMA_UNMAP operations are only going to map
> the IOTLB translations relevant to that device.  At least that's how
> it's supposed to work.  So we shouldn't be pinning all of guest memory
> for the PCI case.
> 
> When we switch to the vgpu/mediated-device approach, everything should
> work the same except the DMA_MAP and DMA_UNMAP ioctls don't do any
> pinning or IOMMU mapping.  They only update the in-kernel vfio view of
> IOVA to process virtual translations.  These translations are then
> consumed only when a device operation requires DMA.  At that point we
> do an IOVA-to-VA translation and page_to_pfn(get_user_pages()) to get a
> host physical address, which is only pinned while the operation is
> inflight.
> 
> > 2. Didn't you have the long time starting problem either? Or I
> > must miss something. For the vfio-ccw case, there is no fixed range. So
> > according to your proposal, vfio-ccw has to pin all of guest memory.
> > And I guess I will encounter this problem again.
> 
> x86 with a guest IOMMU is very new and still not upstream, so I don't
> know if there's a point at which we perform an operation over the
> entire address space, that would be slow.  It seems like something we
> could optimize though.  x86 without a guest IOMMU only performs DMA_MAP
> operations for the actual populated guest memory.  This is of course
> not free, but is negligible for small guests and scales as the memory
> size of the guest increases.  According the the vgpu/mediated-device
> proposal, there would be no pinning occurring at startup, the DMA_MAP
> would only be populating a tree of IOVA-to-VA mappings using the
> granularity of the DMA_MAP parameters itself.
> 
> > > 
> > > That's not what vGPU is about.  In the case of vGPU the proposal is to
> > > use the same QEMU vfio MemoryListener API, but only for the purpose of
> > > having an accurate database of guest physical to process virtual
> > > translations for the VM.  In your above example, this means step Q2 is
> > > eliminated because step K2 has the information to perform both a guest
> > > physical to process virtual translation and to pin the page to get a
> > > host physical address.  So you'd only need to modify the program once.  
> > According to my understanding of your proposal, I should do:
> > ------------------------------------------------------------
> > #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> > When starting the guest, pin all of guest memory, and form the database.
> 
> I hope that we can have a common type1-compatible iommu backend for
> vfio, there's nothing ccw specific there.  Pages would not be pinned,
> only registered for later retrieval by the mediated-device backend and
> only for the runtime of the ccw program in your case.
> 
> > #2. In the driver of the ccw devices, when an I/O instruction was
> > intercepted, query the database and translate the ccw program for I/O
> > operation.
> 
> The database query would be the point at which the page is pinned, so
> there would be some sort of 'put' of the translation after the ccw
> program executes to release the pin.
> 
> > I also noticed in another thread:
> > ---------------------------------
> > [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
> > 
> > Kirti did:
> > 1. don't pin the pages in the map ioctl for the vGPU case.
> > 2. export vfio_pin_pages and vfio_unpin_pages.
> > 
> > Although their patches didn't show how these interfaces were used, I
> > guess them can either use these interfaces to pin/unpin all of the
> > guest memory, or pin/unpin memory on demand. So can I reuse their work
> > to finish my #1? If the answer is yes, then I could change my plan and
> 
> Yes, we would absolutely only want one vfio iommu backend doing this,
> there's nothing device specific about it.  We're looking at supporting
> both modes of operation, fully pinned and pin-on-demand.  NVIDIA vGPU
> wants the on-demand approach while Intel vGPU wants to pin the entire
> guest, at least for an initial solution.  This iommu backend would need
> to support both as determined by the mediated device backend.

Right, we will add a new callback to mediated device backend interface for this
purpose in v4 version patch.

Thanks,
Neo

> 
> > do:
> > #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> > When starting the guest, form the <vaddr, iova, size> database.
> > 
> > #2. In the driver of the ccw devices, when an I/O instruction was
> > intercepted, call vfio_pin_pages (Kirti's version) to get the host
> > physical address, then translate the ccw program for I/O operation.
> > 
> > So which one is the right way to go?
> 
> As above, I think we have a need to support both approaches in this new
> iommu backend, it will be up to you to determine which is appropriate
> for your devices and guest drivers.  A fully pinned guest has a latency
> advantage, but obviously there are numerous disadvantages for the
> pinning itself.  Pinning on-demand has overhead to setup each DMA
> operations by the device but has a much smaller pinning footprint.
> 
> > > > Well, a Subchannel Device does not have such a range of address. The
> > > > device driver simply calls kalloc() to get a piece of memory, and
> > > > assembles a ccw program with it, before issuing the ccw program to
> > > > perform an I/O operation. So the Qemu memory listener can't tell if an
> > > > address is for an I/O operation, or for whatever else. And this makes
> > > > the memory listener unnecessary for our case.  
> > > 
> > > It's only unnecessary because QEMU is manipulating the program to
> > > replace those addresses with process virtual addresses.  The purpose
> > > of the MemoryListener in the vGPU approach is only to inform the
> > > kernel so that it can perform that translation itself.
> > >   
> > > > The only time point that we know we should pin pages for I/O, is the
> > > > time that an I/O instruction (e.g. ssch) was intercepted. At this
> > > > point, we know the address contented in the parameter of the ssch
> > > > instruction points to a piece of memory that contents a ccw program.
> > > > Then we do: pin the pages --> convert the ccw program --> perform the
> > > > I/O --> return the I/O result --> and unpin the pages.  
> > > 
> > > And you could do exactly the same with the vGPU model, it's simply a
> > > difference of how many times the program is converted and using the
> > > MemoryListener to update guest physical to process virtual addresses in
> > > the kernel.  
> > Understand.
> > 
> > >   
> > > > > This architecture also makes the vfio api completely compatible with
> > > > > existing usage without tainting QEMU with support for noiommu devices.
> > > > > I would strongly suggest following a similar approach and dropping the
> > > > > noiommu interface.  We really do not need to confuse users with noiommu
> > > > > devices that are safe and assignable and devices where noiommu should
> > > > > warn them to stay away.  Thanks,    
> > > > Understand. But like explained above, even if we introduce a new vfio
> > > > iommu backend, what it does would probably look quite like what the
> > > > no-iommu backend does. Any idea about this?  
> > > 
> > > It's not, a mediated device simply shifts the isolation guarantees from
> > > hardware protection in an IOMMU to software protection in a mediated
> > > vfio bus driver.  The IOMMU interface simply becomes a database through
> > > which we can perform in-kernel translations.  All you want is the vfio
> > > device model and you have the ability to do that in a secure way, which
> > > is the same as vGPU.  The no-iommu code is intended to provide the vfio
> > > device model in a known-to-be-insecure means.  I don't think you want
> > > to build on that and I don't think we want no-iommu anywhere near
> > > QEMU.  Thanks,  
> > Got it. I will mimic the vGPU model, once the above questions are
> > clarified. :>
> 
> Thanks,
> Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC 0/8] basic vfio-ccw infrastructure
@ 2016-05-05 20:23             ` Neo Jia
  0 siblings, 0 replies; 36+ messages in thread
From: Neo Jia @ 2016-05-05 20:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Dong Jia, kvm, linux-s390, qemu-devel, renxiaof, cornelia.huck,
	borntraeger, agraf, Tian, Kevin, Song, Jike, Kirti Wankhede

On Thu, May 05, 2016 at 01:19:45PM -0600, Alex Williamson wrote:
> [cc +Intel,NVIDIA]
> 
> On Thu, 5 May 2016 18:29:08 +0800
> Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:
> 
> > On Wed, 4 May 2016 13:26:53 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > > On Wed, 4 May 2016 17:26:29 +0800
> > > Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:
> > >   
> > > > On Fri, 29 Apr 2016 11:17:35 -0600
> > > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > > 
> > > > Dear Alex:
> > > > 
> > > > Thanks for the comments.
> > > > 
> > > > [...]
> > > >   
> > > > > > 
> > > > > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> > > > > > good example to get understand how these patches work. Here is a little
> > > > > > bit more detail how an I/O request triggered by the Qemu guest will be
> > > > > > handled (without error handling).
> > > > > > 
> > > > > > Explanation:
> > > > > > Q1-Q4: Qemu side process.
> > > > > > K1-K6: Kernel side process.
> > > > > > 
> > > > > > Q1. Intercept a ssch instruction.
> > > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > > >     (u_ccwchain).    
> > > > > 
> > > > > Is this replacing guest physical address in the program with QEMU
> > > > > virtual addresses?    
> > > > Yes.
> > > >   
> > > > >     
> > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > > > >         program, which becomes runnable for a real device.    
> > > > > 
> > > > > And here we translate and likely pin QEMU virtual address to physical
> > > > > addresses to further modify the program sent into the channel?    
> > > > Yes. Exactly.
> > > >   
> > > > >     
> > > > > >     K3. With the necessary information contained in the orb passed in
> > > > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > > >         for the I/O result.
> > > > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > > >         update the user space irb.
> > > > > >     K6. Copy irb and scsw back to user space.
> > > > > > Q4. Update the irb for the guest.    
> > > > > 
> > > > > If the answers to my questions above are both yes,    
> > > > Yes, they are.
> > > >   
> > > > > then this is really a mediated interface, not a direct assignment.    
> > > > Right. This is true.
> > > >   
> > > > > We don't need an iommu
> > > > > because we're policing and translating the program for the device
> > > > > before it gets sent to hardware.  I think there are better ways than
> > > > > noiommu to handle such devices perhaps even with better performance
> > > > > than this two-stage translation.  In fact, I think the solution we plan
> > > > > to implement for vGPU support would work here.
> > > > > 
> > > > > Like your device, a vGPU is mediated, we don't have IOMMU level
> > > > > translation or isolation since a vGPU is largely a software construct,
> > > > > but we do have software policing and translating how the GPU is
> > > > > programmed.  To do this we're creating a type1 compatible vfio iommu
> > > > > backend that uses the existing map and unmap ioctls, but rather than
> > > > > programming them into an IOMMU for a device, it simply stores the
> > > > > translations for use by later requests.  This means that a device
> > > > > programmed in a VM with guest physical addresses can have the
> > > > > vfio kernel convert that address to process virtual address, pin the
> > > > > page and program the hardware with the host physical address in one
> > > > > step.    
> > > > I've read through the mail threads those discuss how to add vGPU
> > > > support in VFIO. I'm afraid that proposal could not be simply addressed
> > > > to this case, especially if we want to make the vfio api completely
> > > > compatible with the existing usage.
> > > > 
> > > > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
> > > > fixed range of address in the memory space for DMA operations. Any
> > > > address inside this range will not be used for other purpose. Thus we
> > > > can add memory listener on this range, and pin the pages for further
> > > > use (DMA operation). And we can keep the pages pinned during the life
> > > > cycle of the VM (not quite accurate, or I should say 'the target
> > > > device').  
> > > 
> > > That's not entirely accurate.  Ignoring a guest IOMMU, current device
> > > assignment pins all of guest memory, not just a dedicated, exclusive
> > > range of it, in order to map it through the hardware IOMMU.  That gives
> > > the guest the ability to transparently perform DMA with the device
> > > since the IOMMU maps the guest physical to host physical translations.  
> > Thanks for this explanation.
> > 
> > I noticed in the Qemu part, when we tried to introduce vfio-pci to the
> > s390 architecture, we set the IOMMU width by calling
> > memory_region_add_subregion before initializing the address_space of
> > the PCI device, which will be registered with the vfio_memory_listener
> > later. The 'width' of the subregion is what I called the 'range' in the
> > former reply.
> > 
> > The first reason we did that is, we know exactly the dma memory
> > range, and we got the width by 'dma_addr_end - dma_addr_start'. The
> > second reason we have to do that is, using the following statement will
> > cause the initialization of the guest tremendously long:
> >     group = vfio_get_group(groupid, &address_space_memory);
> > Because doing map on [0, UINT64_MAX] range does cost lots of time. For
> > me, it's unacceptably long (more than 5 minutes).
> > 
> > My questions are:
> > 1. Why we have to 'pin all of guest memory' if we do know the
> > iommu memory range?
> 
> We have a few different configuration here, so let's not confuse them.
> On x86 with pci device assignment we typically don't have a guest IOMMU
> so the guest assumes the device can DMA to any address in the guest
> memory space.  To enable that we pin all of guest memory and map it
> through the IOMMU.  Even with a guest IOMMU on x86, it's an optional
> feature that the guest OS may or may not use, so we'll always at least
> startup in this mode and the guest may or may not enable something else.
> 
> When we have a guest IOMMU available, the device switches to a
> different address space, note that in current QEMU code,
> vfio_get_group() is actually called as:
> 
>     group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev));
> 
> Where pci_device_iommu_address_space() determines whether the device is
> translated by an IOMMU and defaults back to &address_space_memory if
> not.  So we already have code that is supposed to handle the difference
> between whether we're mapping all of guest memory or whether we're only
> registering translations populated in the IOMMU for the device.
> 
> It appears that S390 implements some sort of IOMMU in the guest, so
> theoretically DMA_MAP and DMA_UNMAP operations are only going to map
> the IOTLB translations relevant to that device.  At least that's how
> it's supposed to work.  So we shouldn't be pinning all of guest memory
> for the PCI case.
> 
> When we switch to the vgpu/mediated-device approach, everything should
> work the same except the DMA_MAP and DMA_UNMAP ioctls don't do any
> pinning or IOMMU mapping.  They only update the in-kernel vfio view of
> IOVA to process virtual translations.  These translations are then
> consumed only when a device operation requires DMA.  At that point we
> do an IOVA-to-VA translation and page_to_pfn(get_user_pages()) to get a
> host physical address, which is only pinned while the operation is
> inflight.
> 
> > 2. Didn't you have the long time starting problem either? Or I
> > must miss something. For the vfio-ccw case, there is no fixed range. So
> > according to your proposal, vfio-ccw has to pin all of guest memory.
> > And I guess I will encounter this problem again.
> 
> x86 with a guest IOMMU is very new and still not upstream, so I don't
> know if there's a point at which we perform an operation over the
> entire address space, that would be slow.  It seems like something we
> could optimize though.  x86 without a guest IOMMU only performs DMA_MAP
> operations for the actual populated guest memory.  This is of course
> not free, but is negligible for small guests and scales as the memory
> size of the guest increases.  According the the vgpu/mediated-device
> proposal, there would be no pinning occurring at startup, the DMA_MAP
> would only be populating a tree of IOVA-to-VA mappings using the
> granularity of the DMA_MAP parameters itself.
> 
> > > 
> > > That's not what vGPU is about.  In the case of vGPU the proposal is to
> > > use the same QEMU vfio MemoryListener API, but only for the purpose of
> > > having an accurate database of guest physical to process virtual
> > > translations for the VM.  In your above example, this means step Q2 is
> > > eliminated because step K2 has the information to perform both a guest
> > > physical to process virtual translation and to pin the page to get a
> > > host physical address.  So you'd only need to modify the program once.  
> > According to my understanding of your proposal, I should do:
> > ------------------------------------------------------------
> > #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> > When starting the guest, pin all of guest memory, and form the database.
> 
> I hope that we can have a common type1-compatible iommu backend for
> vfio, there's nothing ccw specific there.  Pages would not be pinned,
> only registered for later retrieval by the mediated-device backend and
> only for the runtime of the ccw program in your case.
> 
> > #2. In the driver of the ccw devices, when an I/O instruction was
> > intercepted, query the database and translate the ccw program for I/O
> > operation.
> 
> The database query would be the point at which the page is pinned, so
> there would be some sort of 'put' of the translation after the ccw
> program executes to release the pin.
> 
> > I also noticed in another thread:
> > ---------------------------------
> > [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
> > 
> > Kirti did:
> > 1. don't pin the pages in the map ioctl for the vGPU case.
> > 2. export vfio_pin_pages and vfio_unpin_pages.
> > 
> > Although their patches didn't show how these interfaces were used, I
> > guess them can either use these interfaces to pin/unpin all of the
> > guest memory, or pin/unpin memory on demand. So can I reuse their work
> > to finish my #1? If the answer is yes, then I could change my plan and
> 
> Yes, we would absolutely only want one vfio iommu backend doing this,
> there's nothing device specific about it.  We're looking at supporting
> both modes of operation, fully pinned and pin-on-demand.  NVIDIA vGPU
> wants the on-demand approach while Intel vGPU wants to pin the entire
> guest, at least for an initial solution.  This iommu backend would need
> to support both as determined by the mediated device backend.

Right, we will add a new callback to mediated device backend interface for this
purpose in v4 version patch.

Thanks,
Neo

> 
> > do:
> > #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> > When starting the guest, form the <vaddr, iova, size> database.
> > 
> > #2. In the driver of the ccw devices, when an I/O instruction was
> > intercepted, call vfio_pin_pages (Kirti's version) to get the host
> > physical address, then translate the ccw program for I/O operation.
> > 
> > So which one is the right way to go?
> 
> As above, I think we have a need to support both approaches in this new
> iommu backend, it will be up to you to determine which is appropriate
> for your devices and guest drivers.  A fully pinned guest has a latency
> advantage, but obviously there are numerous disadvantages for the
> pinning itself.  Pinning on-demand has overhead to setup each DMA
> operations by the device but has a much smaller pinning footprint.
> 
> > > > Well, a Subchannel Device does not have such a range of address. The
> > > > device driver simply calls kalloc() to get a piece of memory, and
> > > > assembles a ccw program with it, before issuing the ccw program to
> > > > perform an I/O operation. So the Qemu memory listener can't tell if an
> > > > address is for an I/O operation, or for whatever else. And this makes
> > > > the memory listener unnecessary for our case.  
> > > 
> > > It's only unnecessary because QEMU is manipulating the program to
> > > replace those addresses with process virtual addresses.  The purpose
> > > of the MemoryListener in the vGPU approach is only to inform the
> > > kernel so that it can perform that translation itself.
> > >   
> > > > The only time point that we know we should pin pages for I/O, is the
> > > > time that an I/O instruction (e.g. ssch) was intercepted. At this
> > > > point, we know the address contented in the parameter of the ssch
> > > > instruction points to a piece of memory that contents a ccw program.
> > > > Then we do: pin the pages --> convert the ccw program --> perform the
> > > > I/O --> return the I/O result --> and unpin the pages.  
> > > 
> > > And you could do exactly the same with the vGPU model, it's simply a
> > > difference of how many times the program is converted and using the
> > > MemoryListener to update guest physical to process virtual addresses in
> > > the kernel.  
> > Understand.
> > 
> > >   
> > > > > This architecture also makes the vfio api completely compatible with
> > > > > existing usage without tainting QEMU with support for noiommu devices.
> > > > > I would strongly suggest following a similar approach and dropping the
> > > > > noiommu interface.  We really do not need to confuse users with noiommu
> > > > > devices that are safe and assignable and devices where noiommu should
> > > > > warn them to stay away.  Thanks,    
> > > > Understand. But like explained above, even if we introduce a new vfio
> > > > iommu backend, what it does would probably look quite like what the
> > > > no-iommu backend does. Any idea about this?  
> > > 
> > > It's not, a mediated device simply shifts the isolation guarantees from
> > > hardware protection in an IOMMU to software protection in a mediated
> > > vfio bus driver.  The IOMMU interface simply becomes a database through
> > > which we can perform in-kernel translations.  All you want is the vfio
> > > device model and you have the ability to do that in a secure way, which
> > > is the same as vGPU.  The no-iommu code is intended to provide the vfio
> > > device model in a known-to-be-insecure means.  I don't think you want
> > > to build on that and I don't think we want no-iommu anywhere near
> > > QEMU.  Thanks,  
> > Got it. I will mimic the vGPU model, once the above questions are
> > clarified. :>
> 
> Thanks,
> Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
@ 2016-05-05 20:23             ` Neo Jia
  0 siblings, 0 replies; 36+ messages in thread
From: Neo Jia @ 2016-05-05 20:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Dong Jia, kvm, linux-s390, qemu-devel, renxiaof, cornelia.huck,
	borntraeger, agraf, Tian, Kevin, Song, Jike, Kirti Wankhede

On Thu, May 05, 2016 at 01:19:45PM -0600, Alex Williamson wrote:
> [cc +Intel,NVIDIA]
> 
> On Thu, 5 May 2016 18:29:08 +0800
> Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:
> 
> > On Wed, 4 May 2016 13:26:53 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > > On Wed, 4 May 2016 17:26:29 +0800
> > > Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:
> > >   
> > > > On Fri, 29 Apr 2016 11:17:35 -0600
> > > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > > 
> > > > Dear Alex:
> > > > 
> > > > Thanks for the comments.
> > > > 
> > > > [...]
> > > >   
> > > > > > 
> > > > > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> > > > > > good example to get understand how these patches work. Here is a little
> > > > > > bit more detail how an I/O request triggered by the Qemu guest will be
> > > > > > handled (without error handling).
> > > > > > 
> > > > > > Explanation:
> > > > > > Q1-Q4: Qemu side process.
> > > > > > K1-K6: Kernel side process.
> > > > > > 
> > > > > > Q1. Intercept a ssch instruction.
> > > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > > >     (u_ccwchain).    
> > > > > 
> > > > > Is this replacing guest physical address in the program with QEMU
> > > > > virtual addresses?    
> > > > Yes.
> > > >   
> > > > >     
> > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > > > >         program, which becomes runnable for a real device.    
> > > > > 
> > > > > And here we translate and likely pin QEMU virtual address to physical
> > > > > addresses to further modify the program sent into the channel?    
> > > > Yes. Exactly.
> > > >   
> > > > >     
> > > > > >     K3. With the necessary information contained in the orb passed in
> > > > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > > >         for the I/O result.
> > > > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > > >         update the user space irb.
> > > > > >     K6. Copy irb and scsw back to user space.
> > > > > > Q4. Update the irb for the guest.    
> > > > > 
> > > > > If the answers to my questions above are both yes,    
> > > > Yes, they are.
> > > >   
> > > > > then this is really a mediated interface, not a direct assignment.    
> > > > Right. This is true.
> > > >   
> > > > > We don't need an iommu
> > > > > because we're policing and translating the program for the device
> > > > > before it gets sent to hardware.  I think there are better ways than
> > > > > noiommu to handle such devices perhaps even with better performance
> > > > > than this two-stage translation.  In fact, I think the solution we plan
> > > > > to implement for vGPU support would work here.
> > > > > 
> > > > > Like your device, a vGPU is mediated, we don't have IOMMU level
> > > > > translation or isolation since a vGPU is largely a software construct,
> > > > > but we do have software policing and translating how the GPU is
> > > > > programmed.  To do this we're creating a type1 compatible vfio iommu
> > > > > backend that uses the existing map and unmap ioctls, but rather than
> > > > > programming them into an IOMMU for a device, it simply stores the
> > > > > translations for use by later requests.  This means that a device
> > > > > programmed in a VM with guest physical addresses can have the
> > > > > vfio kernel convert that address to process virtual address, pin the
> > > > > page and program the hardware with the host physical address in one
> > > > > step.    
> > > > I've read through the mail threads those discuss how to add vGPU
> > > > support in VFIO. I'm afraid that proposal could not be simply addressed
> > > > to this case, especially if we want to make the vfio api completely
> > > > compatible with the existing usage.
> > > > 
> > > > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
> > > > fixed range of address in the memory space for DMA operations. Any
> > > > address inside this range will not be used for other purpose. Thus we
> > > > can add memory listener on this range, and pin the pages for further
> > > > use (DMA operation). And we can keep the pages pinned during the life
> > > > cycle of the VM (not quite accurate, or I should say 'the target
> > > > device').  
> > > 
> > > That's not entirely accurate.  Ignoring a guest IOMMU, current device
> > > assignment pins all of guest memory, not just a dedicated, exclusive
> > > range of it, in order to map it through the hardware IOMMU.  That gives
> > > the guest the ability to transparently perform DMA with the device
> > > since the IOMMU maps the guest physical to host physical translations.  
> > Thanks for this explanation.
> > 
> > I noticed in the Qemu part, when we tried to introduce vfio-pci to the
> > s390 architecture, we set the IOMMU width by calling
> > memory_region_add_subregion before initializing the address_space of
> > the PCI device, which will be registered with the vfio_memory_listener
> > later. The 'width' of the subregion is what I called the 'range' in the
> > former reply.
> > 
> > The first reason we did that is, we know exactly the dma memory
> > range, and we got the width by 'dma_addr_end - dma_addr_start'. The
> > second reason we have to do that is, using the following statement will
> > cause the initialization of the guest tremendously long:
> >     group = vfio_get_group(groupid, &address_space_memory);
> > Because doing map on [0, UINT64_MAX] range does cost lots of time. For
> > me, it's unacceptably long (more than 5 minutes).
> > 
> > My questions are:
> > 1. Why we have to 'pin all of guest memory' if we do know the
> > iommu memory range?
> 
> We have a few different configuration here, so let's not confuse them.
> On x86 with pci device assignment we typically don't have a guest IOMMU
> so the guest assumes the device can DMA to any address in the guest
> memory space.  To enable that we pin all of guest memory and map it
> through the IOMMU.  Even with a guest IOMMU on x86, it's an optional
> feature that the guest OS may or may not use, so we'll always at least
> startup in this mode and the guest may or may not enable something else.
> 
> When we have a guest IOMMU available, the device switches to a
> different address space, note that in current QEMU code,
> vfio_get_group() is actually called as:
> 
>     group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev));
> 
> Where pci_device_iommu_address_space() determines whether the device is
> translated by an IOMMU and defaults back to &address_space_memory if
> not.  So we already have code that is supposed to handle the difference
> between whether we're mapping all of guest memory or whether we're only
> registering translations populated in the IOMMU for the device.
> 
> It appears that S390 implements some sort of IOMMU in the guest, so
> theoretically DMA_MAP and DMA_UNMAP operations are only going to map
> the IOTLB translations relevant to that device.  At least that's how
> it's supposed to work.  So we shouldn't be pinning all of guest memory
> for the PCI case.
> 
> When we switch to the vgpu/mediated-device approach, everything should
> work the same except the DMA_MAP and DMA_UNMAP ioctls don't do any
> pinning or IOMMU mapping.  They only update the in-kernel vfio view of
> IOVA to process virtual translations.  These translations are then
> consumed only when a device operation requires DMA.  At that point we
> do an IOVA-to-VA translation and page_to_pfn(get_user_pages()) to get a
> host physical address, which is only pinned while the operation is
> inflight.
> 
> > 2. Didn't you have the long time starting problem either? Or I
> > must miss something. For the vfio-ccw case, there is no fixed range. So
> > according to your proposal, vfio-ccw has to pin all of guest memory.
> > And I guess I will encounter this problem again.
> 
> x86 with a guest IOMMU is very new and still not upstream, so I don't
> know if there's a point at which we perform an operation over the
> entire address space, that would be slow.  It seems like something we
> could optimize though.  x86 without a guest IOMMU only performs DMA_MAP
> operations for the actual populated guest memory.  This is of course
> not free, but is negligible for small guests and scales as the memory
> size of the guest increases.  According the the vgpu/mediated-device
> proposal, there would be no pinning occurring at startup, the DMA_MAP
> would only be populating a tree of IOVA-to-VA mappings using the
> granularity of the DMA_MAP parameters itself.
> 
> > > 
> > > That's not what vGPU is about.  In the case of vGPU the proposal is to
> > > use the same QEMU vfio MemoryListener API, but only for the purpose of
> > > having an accurate database of guest physical to process virtual
> > > translations for the VM.  In your above example, this means step Q2 is
> > > eliminated because step K2 has the information to perform both a guest
> > > physical to process virtual translation and to pin the page to get a
> > > host physical address.  So you'd only need to modify the program once.  
> > According to my understanding of your proposal, I should do:
> > ------------------------------------------------------------
> > #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> > When starting the guest, pin all of guest memory, and form the database.
> 
> I hope that we can have a common type1-compatible iommu backend for
> vfio, there's nothing ccw specific there.  Pages would not be pinned,
> only registered for later retrieval by the mediated-device backend and
> only for the runtime of the ccw program in your case.
> 
> > #2. In the driver of the ccw devices, when an I/O instruction was
> > intercepted, query the database and translate the ccw program for I/O
> > operation.
> 
> The database query would be the point at which the page is pinned, so
> there would be some sort of 'put' of the translation after the ccw
> program executes to release the pin.
> 
> > I also noticed in another thread:
> > ---------------------------------
> > [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
> > 
> > Kirti did:
> > 1. don't pin the pages in the map ioctl for the vGPU case.
> > 2. export vfio_pin_pages and vfio_unpin_pages.
> > 
> > Although their patches didn't show how these interfaces were used, I
> > guess them can either use these interfaces to pin/unpin all of the
> > guest memory, or pin/unpin memory on demand. So can I reuse their work
> > to finish my #1? If the answer is yes, then I could change my plan and
> 
> Yes, we would absolutely only want one vfio iommu backend doing this,
> there's nothing device specific about it.  We're looking at supporting
> both modes of operation, fully pinned and pin-on-demand.  NVIDIA vGPU
> wants the on-demand approach while Intel vGPU wants to pin the entire
> guest, at least for an initial solution.  This iommu backend would need
> to support both as determined by the mediated device backend.

Right, we will add a new callback to mediated device backend interface for this
purpose in v4 version patch.

Thanks,
Neo

> 
> > do:
> > #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> > When starting the guest, form the <vaddr, iova, size> database.
> > 
> > #2. In the driver of the ccw devices, when an I/O instruction was
> > intercepted, call vfio_pin_pages (Kirti's version) to get the host
> > physical address, then translate the ccw program for I/O operation.
> > 
> > So which one is the right way to go?
> 
> As above, I think we have a need to support both approaches in this new
> iommu backend, it will be up to you to determine which is appropriate
> for your devices and guest drivers.  A fully pinned guest has a latency
> advantage, but obviously there are numerous disadvantages for the
> pinning itself.  Pinning on-demand has overhead to setup each DMA
> operations by the device but has a much smaller pinning footprint.
> 
> > > > Well, a Subchannel Device does not have such a range of address. The
> > > > device driver simply calls kalloc() to get a piece of memory, and
> > > > assembles a ccw program with it, before issuing the ccw program to
> > > > perform an I/O operation. So the Qemu memory listener can't tell if an
> > > > address is for an I/O operation, or for whatever else. And this makes
> > > > the memory listener unnecessary for our case.  
> > > 
> > > It's only unnecessary because QEMU is manipulating the program to
> > > replace those addresses with process virtual addresses.  The purpose
> > > of the MemoryListener in the vGPU approach is only to inform the
> > > kernel so that it can perform that translation itself.
> > >   
> > > > The only time point that we know we should pin pages for I/O, is the
> > > > time that an I/O instruction (e.g. ssch) was intercepted. At this
> > > > point, we know the address contented in the parameter of the ssch
> > > > instruction points to a piece of memory that contents a ccw program.
> > > > Then we do: pin the pages --> convert the ccw program --> perform the
> > > > I/O --> return the I/O result --> and unpin the pages.  
> > > 
> > > And you could do exactly the same with the vGPU model, it's simply a
> > > difference of how many times the program is converted and using the
> > > MemoryListener to update guest physical to process virtual addresses in
> > > the kernel.  
> > Understand.
> > 
> > >   
> > > > > This architecture also makes the vfio api completely compatible with
> > > > > existing usage without tainting QEMU with support for noiommu devices.
> > > > > I would strongly suggest following a similar approach and dropping the
> > > > > noiommu interface.  We really do not need to confuse users with noiommu
> > > > > devices that are safe and assignable and devices where noiommu should
> > > > > warn them to stay away.  Thanks,    
> > > > Understand. But like explained above, even if we introduce a new vfio
> > > > iommu backend, what it does would probably look quite like what the
> > > > no-iommu backend does. Any idea about this?  
> > > 
> > > It's not, a mediated device simply shifts the isolation guarantees from
> > > hardware protection in an IOMMU to software protection in a mediated
> > > vfio bus driver.  The IOMMU interface simply becomes a database through
> > > which we can perform in-kernel translations.  All you want is the vfio
> > > device model and you have the ability to do that in a secure way, which
> > > is the same as vGPU.  The no-iommu code is intended to provide the vfio
> > > device model in a known-to-be-insecure means.  I don't think you want
> > > to build on that and I don't think we want no-iommu anywhere near
> > > QEMU.  Thanks,  
> > Got it. I will mimic the vGPU model, once the above questions are
> > clarified. :>
> 
> Thanks,
> Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC 0/8] basic vfio-ccw infrastructure
  2016-05-05 19:19           ` [Qemu-devel] " Alex Williamson
@ 2016-05-09  9:55             ` Dong Jia
  -1 siblings, 0 replies; 36+ messages in thread
From: Dong Jia @ 2016-05-09  9:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-s390, qemu-devel, renxiaof, cornelia.huck,
	borntraeger, agraf, Tian, Kevin, Song, Jike, Neo Jia,
	Kirti Wankhede, Dong Jia

On Thu, 5 May 2016 13:19:45 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> [cc +Intel,NVIDIA]
> 
> On Thu, 5 May 2016 18:29:08 +0800
> Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:
> 
> > On Wed, 4 May 2016 13:26:53 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > > On Wed, 4 May 2016 17:26:29 +0800
> > > Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:
> > >   
> > > > On Fri, 29 Apr 2016 11:17:35 -0600
> > > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > > 
> > > > Dear Alex:
> > > > 
> > > > Thanks for the comments.
> > > > 
> > > > [...]
> > > >   
> > > > > > 
> > > > > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> > > > > > good example to get understand how these patches work. Here is a little
> > > > > > bit more detail how an I/O request triggered by the Qemu guest will be
> > > > > > handled (without error handling).
> > > > > > 
> > > > > > Explanation:
> > > > > > Q1-Q4: Qemu side process.
> > > > > > K1-K6: Kernel side process.
> > > > > > 
> > > > > > Q1. Intercept a ssch instruction.
> > > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > > >     (u_ccwchain).    
> > > > > 
> > > > > Is this replacing guest physical address in the program with QEMU
> > > > > virtual addresses?    
> > > > Yes.
> > > >   
> > > > >     
> > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > > > >         program, which becomes runnable for a real device.    
> > > > > 
> > > > > And here we translate and likely pin QEMU virtual address to physical
> > > > > addresses to further modify the program sent into the channel?    
> > > > Yes. Exactly.
> > > >   
> > > > >     
> > > > > >     K3. With the necessary information contained in the orb passed in
> > > > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > > >         for the I/O result.
> > > > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > > >         update the user space irb.
> > > > > >     K6. Copy irb and scsw back to user space.
> > > > > > Q4. Update the irb for the guest.    
> > > > > 
> > > > > If the answers to my questions above are both yes,    
> > > > Yes, they are.
> > > >   
> > > > > then this is really a mediated interface, not a direct assignment.    
> > > > Right. This is true.
> > > >   
> > > > > We don't need an iommu
> > > > > because we're policing and translating the program for the device
> > > > > before it gets sent to hardware.  I think there are better ways than
> > > > > noiommu to handle such devices perhaps even with better performance
> > > > > than this two-stage translation.  In fact, I think the solution we plan
> > > > > to implement for vGPU support would work here.
> > > > > 
> > > > > Like your device, a vGPU is mediated, we don't have IOMMU level
> > > > > translation or isolation since a vGPU is largely a software construct,
> > > > > but we do have software policing and translating how the GPU is
> > > > > programmed.  To do this we're creating a type1 compatible vfio iommu
> > > > > backend that uses the existing map and unmap ioctls, but rather than
> > > > > programming them into an IOMMU for a device, it simply stores the
> > > > > translations for use by later requests.  This means that a device
> > > > > programmed in a VM with guest physical addresses can have the
> > > > > vfio kernel convert that address to process virtual address, pin the
> > > > > page and program the hardware with the host physical address in one
> > > > > step.    
> > > > I've read through the mail threads those discuss how to add vGPU
> > > > support in VFIO. I'm afraid that proposal could not be simply addressed
> > > > to this case, especially if we want to make the vfio api completely
> > > > compatible with the existing usage.
> > > > 
> > > > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
> > > > fixed range of address in the memory space for DMA operations. Any
> > > > address inside this range will not be used for other purpose. Thus we
> > > > can add memory listener on this range, and pin the pages for further
> > > > use (DMA operation). And we can keep the pages pinned during the life
> > > > cycle of the VM (not quite accurate, or I should say 'the target
> > > > device').  
> > > 
> > > That's not entirely accurate.  Ignoring a guest IOMMU, current device
> > > assignment pins all of guest memory, not just a dedicated, exclusive
> > > range of it, in order to map it through the hardware IOMMU.  That gives
> > > the guest the ability to transparently perform DMA with the device
> > > since the IOMMU maps the guest physical to host physical translations.  
> > Thanks for this explanation.
> > 
> > I noticed in the Qemu part, when we tried to introduce vfio-pci to the
> > s390 architecture, we set the IOMMU width by calling
> > memory_region_add_subregion before initializing the address_space of
> > the PCI device, which will be registered with the vfio_memory_listener
> > later. The 'width' of the subregion is what I called the 'range' in the
> > former reply.
> > 
> > The first reason we did that is, we know exactly the dma memory
> > range, and we got the width by 'dma_addr_end - dma_addr_start'. The
> > second reason we have to do that is, using the following statement will
> > cause the initialization of the guest tremendously long:
> >     group = vfio_get_group(groupid, &address_space_memory);
> > Because doing map on [0, UINT64_MAX] range does cost lots of time. For
> > me, it's unacceptably long (more than 5 minutes).
> > 
> > My questions are:
> > 1. Why we have to 'pin all of guest memory' if we do know the
> > iommu memory range?
> 
> We have a few different configuration here, so let's not confuse them.
> On x86 with pci device assignment we typically don't have a guest IOMMU
> so the guest assumes the device can DMA to any address in the guest
> memory space.  To enable that we pin all of guest memory and map it
> through the IOMMU.  Even with a guest IOMMU on x86, it's an optional
> feature that the guest OS may or may not use, so we'll always at least
> startup in this mode and the guest may or may not enable something else.
> 
> When we have a guest IOMMU available, the device switches to a
> different address space, note that in current QEMU code,
> vfio_get_group() is actually called as:
> 
>     group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev));
> 
> Where pci_device_iommu_address_space() determines whether the device is
> translated by an IOMMU and defaults back to &address_space_memory if
> not.  So we already have code that is supposed to handle the difference
> between whether we're mapping all of guest memory or whether we're only
> registering translations populated in the IOMMU for the device.
Big thanks! I'm clear about this now.

> 
> It appears that S390 implements some sort of IOMMU in the guest, so
> theoretically DMA_MAP and DMA_UNMAP operations are only going to map
> the IOTLB translations relevant to that device.  At least that's how
> it's supposed to work.  So we shouldn't be pinning all of guest memory
> for the PCI case.
Nod.

> 
> When we switch to the vgpu/mediated-device approach, everything should
> work the same except the DMA_MAP and DMA_UNMAP ioctls don't do any
> pinning or IOMMU mapping.  They only update the in-kernel vfio view of
> IOVA to process virtual translations.  These translations are then
> consumed only when a device operation requires DMA.  At that point we
> do an IOVA-to-VA translation and page_to_pfn(get_user_pages()) to get a
> host physical address, which is only pinned while the operation is
> inflight.
Got this.

> 
> > 2. Didn't you have the long time starting problem either? Or I
> > must miss something. For the vfio-ccw case, there is no fixed range. So
> > according to your proposal, vfio-ccw has to pin all of guest memory.
> > And I guess I will encounter this problem again.
> 
> x86 with a guest IOMMU is very new and still not upstream, so I don't
> know if there's a point at which we perform an operation over the
> entire address space, that would be slow.  It seems like something we
> could optimize though.  x86 without a guest IOMMU only performs DMA_MAP
> operations for the actual populated guest memory.  This is of course
> not free, but is negligible for small guests and scales as the memory
> size of the guest increases.
I can't recall clearly, but our problem must have something to do with
the iommu_replay action in vfio_listener_region_add. It lead us to do
iommu replay in the whole guest address space at that time.

Since we will definitely not pin all the guest memory. This won't be a
problem for us anymore. :>

>  According the the vgpu/mediated-device
> proposal, there would be no pinning occurring at startup, the DMA_MAP
> would only be populating a tree of IOVA-to-VA mappings using the
> granularity of the DMA_MAP parameters itself.
Understand and got this too.

> 
> > > 
> > > That's not what vGPU is about.  In the case of vGPU the proposal is to
> > > use the same QEMU vfio MemoryListener API, but only for the purpose of
> > > having an accurate database of guest physical to process virtual
> > > translations for the VM.  In your above example, this means step Q2 is
> > > eliminated because step K2 has the information to perform both a guest
> > > physical to process virtual translation and to pin the page to get a
> > > host physical address.  So you'd only need to modify the program once.  
> > According to my understanding of your proposal, I should do:
> > ------------------------------------------------------------
> > #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> > When starting the guest, pin all of guest memory, and form the database.
> 
> I hope that we can have a common type1-compatible iommu backend for
> vfio, there's nothing ccw specific there.  Pages would not be pinned,
> only registered for later retrieval by the mediated-device backend and
> only for the runtime of the ccw program in your case.
This sounds reasonable and feasible.

> 
> > #2. In the driver of the ccw devices, when an I/O instruction was
> > intercepted, query the database and translate the ccw program for I/O
> > operation.
> 
> The database query would be the point at which the page is pinned,
Just like Kirti's vfio_pin_pages.

> so
> there would be some sort of 'put' of the translation after the ccw
> program executes to release the pin.
Right. Quite obviously, if we call vfio_pin_pages, we should call
vfio_unpin_pages in pair.

> 
> > I also noticed in another thread:
> > ---------------------------------
> > [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
> > 
> > Kirti did:
> > 1. don't pin the pages in the map ioctl for the vGPU case.
> > 2. export vfio_pin_pages and vfio_unpin_pages.
> > 
> > Although their patches didn't show how these interfaces were used, I
> > guess them can either use these interfaces to pin/unpin all of the
> > guest memory, or pin/unpin memory on demand. So can I reuse their work
> > to finish my #1? If the answer is yes, then I could change my plan and
> 
> Yes, we would absolutely only want one vfio iommu backend doing this,
> there's nothing device specific about it.  We're looking at supporting
> both modes of operation, fully pinned and pin-on-demand.  NVIDIA vGPU
> wants the on-demand approach while Intel vGPU wants to pin the entire
> guest, at least for an initial solution.  This iommu backend would need
> to support both as determined by the mediated device backend.
I will stay tuned with their discussion.

> 
> > do:
> > #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> > When starting the guest, form the <vaddr, iova, size> database.
> > 
> > #2. In the driver of the ccw devices, when an I/O instruction was
> > intercepted, call vfio_pin_pages (Kirti's version) to get the host
> > physical address, then translate the ccw program for I/O operation.
> > 
> > So which one is the right way to go?
> 
> As above, I think we have a need to support both approaches in this new
> iommu backend, it will be up to you to determine which is appropriate
> for your devices and guest drivers.  A fully pinned guest has a latency
> advantage, but obviously there are numerous disadvantages for the
> pinning itself.  Pinning on-demand has overhead to setup each DMA
> operations by the device but has a much smaller pinning footprint.
Got it.

> 
> > > > Well, a Subchannel Device does not have such a range of address. The
> > > > device driver simply calls kalloc() to get a piece of memory, and
> > > > assembles a ccw program with it, before issuing the ccw program to
> > > > perform an I/O operation. So the Qemu memory listener can't tell if an
> > > > address is for an I/O operation, or for whatever else. And this makes
> > > > the memory listener unnecessary for our case.  
> > > 
> > > It's only unnecessary because QEMU is manipulating the program to
> > > replace those addresses with process virtual addresses.  The purpose
> > > of the MemoryListener in the vGPU approach is only to inform the
> > > kernel so that it can perform that translation itself.
> > >   
> > > > The only time point that we know we should pin pages for I/O, is the
> > > > time that an I/O instruction (e.g. ssch) was intercepted. At this
> > > > point, we know the address contented in the parameter of the ssch
> > > > instruction points to a piece of memory that contents a ccw program.
> > > > Then we do: pin the pages --> convert the ccw program --> perform the
> > > > I/O --> return the I/O result --> and unpin the pages.  
> > > 
> > > And you could do exactly the same with the vGPU model, it's simply a
> > > difference of how many times the program is converted and using the
> > > MemoryListener to update guest physical to process virtual addresses in
> > > the kernel.  
> > Understand.
> > 
> > >   
> > > > > This architecture also makes the vfio api completely compatible with
> > > > > existing usage without tainting QEMU with support for noiommu devices.
> > > > > I would strongly suggest following a similar approach and dropping the
> > > > > noiommu interface.  We really do not need to confuse users with noiommu
> > > > > devices that are safe and assignable and devices where noiommu should
> > > > > warn them to stay away.  Thanks,    
> > > > Understand. But like explained above, even if we introduce a new vfio
> > > > iommu backend, what it does would probably look quite like what the
> > > > no-iommu backend does. Any idea about this?  
> > > 
> > > It's not, a mediated device simply shifts the isolation guarantees from
> > > hardware protection in an IOMMU to software protection in a mediated
> > > vfio bus driver.  The IOMMU interface simply becomes a database through
> > > which we can perform in-kernel translations.  All you want is the vfio
> > > device model and you have the ability to do that in a secure way, which
> > > is the same as vGPU.  The no-iommu code is intended to provide the vfio
> > > device model in a known-to-be-insecure means.  I don't think you want
> > > to build on that and I don't think we want no-iommu anywhere near
> > > QEMU.  Thanks,  
> > Got it. I will mimic the vGPU model, once the above questions are
> > clarified. :>
> 
> Thanks,
> Alex
> 

Thanks again. Things get cleared for me a lot.

--------
Dong Jia

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
@ 2016-05-09  9:55             ` Dong Jia
  0 siblings, 0 replies; 36+ messages in thread
From: Dong Jia @ 2016-05-09  9:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-s390, qemu-devel, renxiaof, cornelia.huck,
	borntraeger, agraf, Tian, Kevin, Song, Jike, Neo Jia,
	Kirti Wankhede, Dong Jia

On Thu, 5 May 2016 13:19:45 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> [cc +Intel,NVIDIA]
> 
> On Thu, 5 May 2016 18:29:08 +0800
> Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:
> 
> > On Wed, 4 May 2016 13:26:53 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > > On Wed, 4 May 2016 17:26:29 +0800
> > > Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:
> > >   
> > > > On Fri, 29 Apr 2016 11:17:35 -0600
> > > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > > 
> > > > Dear Alex:
> > > > 
> > > > Thanks for the comments.
> > > > 
> > > > [...]
> > > >   
> > > > > > 
> > > > > > The user of vfio-ccw is not limited to Qemu, while Qemu is definitely a
> > > > > > good example to get understand how these patches work. Here is a little
> > > > > > bit more detail how an I/O request triggered by the Qemu guest will be
> > > > > > handled (without error handling).
> > > > > > 
> > > > > > Explanation:
> > > > > > Q1-Q4: Qemu side process.
> > > > > > K1-K6: Kernel side process.
> > > > > > 
> > > > > > Q1. Intercept a ssch instruction.
> > > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > > >     (u_ccwchain).    
> > > > > 
> > > > > Is this replacing guest physical address in the program with QEMU
> > > > > virtual addresses?    
> > > > Yes.
> > > >   
> > > > >     
> > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > > > >         program, which becomes runnable for a real device.    
> > > > > 
> > > > > And here we translate and likely pin QEMU virtual address to physical
> > > > > addresses to further modify the program sent into the channel?    
> > > > Yes. Exactly.
> > > >   
> > > > >     
> > > > > >     K3. With the necessary information contained in the orb passed in
> > > > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > > >         for the I/O result.
> > > > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > > >         update the user space irb.
> > > > > >     K6. Copy irb and scsw back to user space.
> > > > > > Q4. Update the irb for the guest.    
> > > > > 
> > > > > If the answers to my questions above are both yes,    
> > > > Yes, they are.
> > > >   
> > > > > then this is really a mediated interface, not a direct assignment.    
> > > > Right. This is true.
> > > >   
> > > > > We don't need an iommu
> > > > > because we're policing and translating the program for the device
> > > > > before it gets sent to hardware.  I think there are better ways than
> > > > > noiommu to handle such devices perhaps even with better performance
> > > > > than this two-stage translation.  In fact, I think the solution we plan
> > > > > to implement for vGPU support would work here.
> > > > > 
> > > > > Like your device, a vGPU is mediated, we don't have IOMMU level
> > > > > translation or isolation since a vGPU is largely a software construct,
> > > > > but we do have software policing and translating how the GPU is
> > > > > programmed.  To do this we're creating a type1 compatible vfio iommu
> > > > > backend that uses the existing map and unmap ioctls, but rather than
> > > > > programming them into an IOMMU for a device, it simply stores the
> > > > > translations for use by later requests.  This means that a device
> > > > > programmed in a VM with guest physical addresses can have the
> > > > > vfio kernel convert that address to process virtual address, pin the
> > > > > page and program the hardware with the host physical address in one
> > > > > step.    
> > > > I've read through the mail threads those discuss how to add vGPU
> > > > support in VFIO. I'm afraid that proposal could not be simply addressed
> > > > to this case, especially if we want to make the vfio api completely
> > > > compatible with the existing usage.
> > > > 
> > > > AFAIU, a PCI device (or a vGPU device) uses a dedicated, exclusive and
> > > > fixed range of address in the memory space for DMA operations. Any
> > > > address inside this range will not be used for other purpose. Thus we
> > > > can add memory listener on this range, and pin the pages for further
> > > > use (DMA operation). And we can keep the pages pinned during the life
> > > > cycle of the VM (not quite accurate, or I should say 'the target
> > > > device').  
> > > 
> > > That's not entirely accurate.  Ignoring a guest IOMMU, current device
> > > assignment pins all of guest memory, not just a dedicated, exclusive
> > > range of it, in order to map it through the hardware IOMMU.  That gives
> > > the guest the ability to transparently perform DMA with the device
> > > since the IOMMU maps the guest physical to host physical translations.  
> > Thanks for this explanation.
> > 
> > I noticed in the Qemu part, when we tried to introduce vfio-pci to the
> > s390 architecture, we set the IOMMU width by calling
> > memory_region_add_subregion before initializing the address_space of
> > the PCI device, which will be registered with the vfio_memory_listener
> > later. The 'width' of the subregion is what I called the 'range' in the
> > former reply.
> > 
> > The first reason we did that is, we know exactly the dma memory
> > range, and we got the width by 'dma_addr_end - dma_addr_start'. The
> > second reason we have to do that is, using the following statement will
> > cause the initialization of the guest tremendously long:
> >     group = vfio_get_group(groupid, &address_space_memory);
> > Because doing map on [0, UINT64_MAX] range does cost lots of time. For
> > me, it's unacceptably long (more than 5 minutes).
> > 
> > My questions are:
> > 1. Why we have to 'pin all of guest memory' if we do know the
> > iommu memory range?
> 
> We have a few different configuration here, so let's not confuse them.
> On x86 with pci device assignment we typically don't have a guest IOMMU
> so the guest assumes the device can DMA to any address in the guest
> memory space.  To enable that we pin all of guest memory and map it
> through the IOMMU.  Even with a guest IOMMU on x86, it's an optional
> feature that the guest OS may or may not use, so we'll always at least
> startup in this mode and the guest may or may not enable something else.
> 
> When we have a guest IOMMU available, the device switches to a
> different address space, note that in current QEMU code,
> vfio_get_group() is actually called as:
> 
>     group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev));
> 
> Where pci_device_iommu_address_space() determines whether the device is
> translated by an IOMMU and defaults back to &address_space_memory if
> not.  So we already have code that is supposed to handle the difference
> between whether we're mapping all of guest memory or whether we're only
> registering translations populated in the IOMMU for the device.
Big thanks! I'm clear about this now.

> 
> It appears that S390 implements some sort of IOMMU in the guest, so
> theoretically DMA_MAP and DMA_UNMAP operations are only going to map
> the IOTLB translations relevant to that device.  At least that's how
> it's supposed to work.  So we shouldn't be pinning all of guest memory
> for the PCI case.
Nod.

> 
> When we switch to the vgpu/mediated-device approach, everything should
> work the same except the DMA_MAP and DMA_UNMAP ioctls don't do any
> pinning or IOMMU mapping.  They only update the in-kernel vfio view of
> IOVA to process virtual translations.  These translations are then
> consumed only when a device operation requires DMA.  At that point we
> do an IOVA-to-VA translation and page_to_pfn(get_user_pages()) to get a
> host physical address, which is only pinned while the operation is
> inflight.
Got this.

> 
> > 2. Didn't you have the long time starting problem either? Or I
> > must miss something. For the vfio-ccw case, there is no fixed range. So
> > according to your proposal, vfio-ccw has to pin all of guest memory.
> > And I guess I will encounter this problem again.
> 
> x86 with a guest IOMMU is very new and still not upstream, so I don't
> know if there's a point at which we perform an operation over the
> entire address space, that would be slow.  It seems like something we
> could optimize though.  x86 without a guest IOMMU only performs DMA_MAP
> operations for the actual populated guest memory.  This is of course
> not free, but is negligible for small guests and scales as the memory
> size of the guest increases.
I can't recall clearly, but our problem must have something to do with
the iommu_replay action in vfio_listener_region_add. It lead us to do
iommu replay in the whole guest address space at that time.

Since we will definitely not pin all the guest memory. This won't be a
problem for us anymore. :>

>  According the the vgpu/mediated-device
> proposal, there would be no pinning occurring at startup, the DMA_MAP
> would only be populating a tree of IOVA-to-VA mappings using the
> granularity of the DMA_MAP parameters itself.
Understand and got this too.

> 
> > > 
> > > That's not what vGPU is about.  In the case of vGPU the proposal is to
> > > use the same QEMU vfio MemoryListener API, but only for the purpose of
> > > having an accurate database of guest physical to process virtual
> > > translations for the VM.  In your above example, this means step Q2 is
> > > eliminated because step K2 has the information to perform both a guest
> > > physical to process virtual translation and to pin the page to get a
> > > host physical address.  So you'd only need to modify the program once.  
> > According to my understanding of your proposal, I should do:
> > ------------------------------------------------------------
> > #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> > When starting the guest, pin all of guest memory, and form the database.
> 
> I hope that we can have a common type1-compatible iommu backend for
> vfio, there's nothing ccw specific there.  Pages would not be pinned,
> only registered for later retrieval by the mediated-device backend and
> only for the runtime of the ccw program in your case.
This sounds reasonable and feasible.

> 
> > #2. In the driver of the ccw devices, when an I/O instruction was
> > intercepted, query the database and translate the ccw program for I/O
> > operation.
> 
> The database query would be the point at which the page is pinned,
Just like Kirti's vfio_pin_pages.

> so
> there would be some sort of 'put' of the translation after the ccw
> program executes to release the pin.
Right. Quite obviously, if we call vfio_pin_pages, we should call
vfio_unpin_pages in pair.

> 
> > I also noticed in another thread:
> > ---------------------------------
> > [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
> > 
> > Kirti did:
> > 1. don't pin the pages in the map ioctl for the vGPU case.
> > 2. export vfio_pin_pages and vfio_unpin_pages.
> > 
> > Although their patches didn't show how these interfaces were used, I
> > guess them can either use these interfaces to pin/unpin all of the
> > guest memory, or pin/unpin memory on demand. So can I reuse their work
> > to finish my #1? If the answer is yes, then I could change my plan and
> 
> Yes, we would absolutely only want one vfio iommu backend doing this,
> there's nothing device specific about it.  We're looking at supporting
> both modes of operation, fully pinned and pin-on-demand.  NVIDIA vGPU
> wants the on-demand approach while Intel vGPU wants to pin the entire
> guest, at least for an initial solution.  This iommu backend would need
> to support both as determined by the mediated device backend.
I will stay tuned with their discussion.

> 
> > do:
> > #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> > When starting the guest, form the <vaddr, iova, size> database.
> > 
> > #2. In the driver of the ccw devices, when an I/O instruction was
> > intercepted, call vfio_pin_pages (Kirti's version) to get the host
> > physical address, then translate the ccw program for I/O operation.
> > 
> > So which one is the right way to go?
> 
> As above, I think we have a need to support both approaches in this new
> iommu backend, it will be up to you to determine which is appropriate
> for your devices and guest drivers.  A fully pinned guest has a latency
> advantage, but obviously there are numerous disadvantages for the
> pinning itself.  Pinning on-demand has overhead to setup each DMA
> operations by the device but has a much smaller pinning footprint.
Got it.

> 
> > > > Well, a Subchannel Device does not have such a range of address. The
> > > > device driver simply calls kalloc() to get a piece of memory, and
> > > > assembles a ccw program with it, before issuing the ccw program to
> > > > perform an I/O operation. So the Qemu memory listener can't tell if an
> > > > address is for an I/O operation, or for whatever else. And this makes
> > > > the memory listener unnecessary for our case.  
> > > 
> > > It's only unnecessary because QEMU is manipulating the program to
> > > replace those addresses with process virtual addresses.  The purpose
> > > of the MemoryListener in the vGPU approach is only to inform the
> > > kernel so that it can perform that translation itself.
> > >   
> > > > The only time point that we know we should pin pages for I/O, is the
> > > > time that an I/O instruction (e.g. ssch) was intercepted. At this
> > > > point, we know the address contented in the parameter of the ssch
> > > > instruction points to a piece of memory that contents a ccw program.
> > > > Then we do: pin the pages --> convert the ccw program --> perform the
> > > > I/O --> return the I/O result --> and unpin the pages.  
> > > 
> > > And you could do exactly the same with the vGPU model, it's simply a
> > > difference of how many times the program is converted and using the
> > > MemoryListener to update guest physical to process virtual addresses in
> > > the kernel.  
> > Understand.
> > 
> > >   
> > > > > This architecture also makes the vfio api completely compatible with
> > > > > existing usage without tainting QEMU with support for noiommu devices.
> > > > > I would strongly suggest following a similar approach and dropping the
> > > > > noiommu interface.  We really do not need to confuse users with noiommu
> > > > > devices that are safe and assignable and devices where noiommu should
> > > > > warn them to stay away.  Thanks,    
> > > > Understand. But like explained above, even if we introduce a new vfio
> > > > iommu backend, what it does would probably look quite like what the
> > > > no-iommu backend does. Any idea about this?  
> > > 
> > > It's not, a mediated device simply shifts the isolation guarantees from
> > > hardware protection in an IOMMU to software protection in a mediated
> > > vfio bus driver.  The IOMMU interface simply becomes a database through
> > > which we can perform in-kernel translations.  All you want is the vfio
> > > device model and you have the ability to do that in a secure way, which
> > > is the same as vGPU.  The no-iommu code is intended to provide the vfio
> > > device model in a known-to-be-insecure means.  I don't think you want
> > > to build on that and I don't think we want no-iommu anywhere near
> > > QEMU.  Thanks,  
> > Got it. I will mimic the vGPU model, once the above questions are
> > clarified. :>
> 
> Thanks,
> Alex
> 

Thanks again. Things get cleared for me a lot.

--------
Dong Jia

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC 0/8] basic vfio-ccw infrastructure
  2016-05-05 20:23             ` Neo Jia
  (?)
@ 2016-05-09  9:59               ` Dong Jia
  -1 siblings, 0 replies; 36+ messages in thread
From: Dong Jia @ 2016-05-09  9:59 UTC (permalink / raw)
  To: Neo Jia
  Cc: Alex Williamson, kvm, linux-s390, qemu-devel, renxiaof,
	cornelia.huck, borntraeger, agraf, Tian, Kevin, Song, Jike,
	Kirti Wankhede, Dong Jia

On Thu, 5 May 2016 13:23:11 -0700
Neo Jia <cjia@nvidia.com> wrote:

> > > I also noticed in another thread:
> > > ---------------------------------
> > > [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
> > > 
> > > Kirti did:
> > > 1. don't pin the pages in the map ioctl for the vGPU case.
> > > 2. export vfio_pin_pages and vfio_unpin_pages.
> > > 
> > > Although their patches didn't show how these interfaces were used, I
> > > guess them can either use these interfaces to pin/unpin all of the
> > > guest memory, or pin/unpin memory on demand. So can I reuse their work
> > > to finish my #1? If the answer is yes, then I could change my plan and  
> > 
> > Yes, we would absolutely only want one vfio iommu backend doing this,
> > there's nothing device specific about it.  We're looking at supporting
> > both modes of operation, fully pinned and pin-on-demand.  NVIDIA vGPU
> > wants the on-demand approach while Intel vGPU wants to pin the entire
> > guest, at least for an initial solution.  This iommu backend would need
> > to support both as determined by the mediated device backend.  
> 
> Right, we will add a new callback to mediated device backend interface for this
> purpose in v4 version patch.
Dear Neo:
Thanks for this information.

What I interest most is the new vfio iommu backend. Looking forward to
your new patches. :>

> 
> Thanks,
> Neo
> 
> >   
> > > do:
> > > #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> > > When starting the guest, form the <vaddr, iova, size> database.
> > > 
> > > #2. In the driver of the ccw devices, when an I/O instruction was
> > > intercepted, call vfio_pin_pages (Kirti's version) to get the host
> > > physical address, then translate the ccw program for I/O operation.
> > > 
> > > So which one is the right way to go?  
> > 
> > As above, I think we have a need to support both approaches in this new
> > iommu backend, it will be up to you to determine which is appropriate
> > for your devices and guest drivers.  A fully pinned guest has a latency
> > advantage, but obviously there are numerous disadvantages for the
> > pinning itself.  Pinning on-demand has overhead to setup each DMA
> > operations by the device but has a much smaller pinning footprint.


--------
Dong Jia

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH RFC 0/8] basic vfio-ccw infrastructure
@ 2016-05-09  9:59               ` Dong Jia
  0 siblings, 0 replies; 36+ messages in thread
From: Dong Jia @ 2016-05-09  9:59 UTC (permalink / raw)
  To: Neo Jia
  Cc: Alex Williamson, kvm, linux-s390, qemu-devel, renxiaof,
	cornelia.huck, borntraeger, agraf, Tian, Kevin, Song, Jike,
	Kirti Wankhede, Dong Jia

On Thu, 5 May 2016 13:23:11 -0700
Neo Jia <cjia@nvidia.com> wrote:

> > > I also noticed in another thread:
> > > ---------------------------------
> > > [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
> > > 
> > > Kirti did:
> > > 1. don't pin the pages in the map ioctl for the vGPU case.
> > > 2. export vfio_pin_pages and vfio_unpin_pages.
> > > 
> > > Although their patches didn't show how these interfaces were used, I
> > > guess them can either use these interfaces to pin/unpin all of the
> > > guest memory, or pin/unpin memory on demand. So can I reuse their work
> > > to finish my #1? If the answer is yes, then I could change my plan and  
> > 
> > Yes, we would absolutely only want one vfio iommu backend doing this,
> > there's nothing device specific about it.  We're looking at supporting
> > both modes of operation, fully pinned and pin-on-demand.  NVIDIA vGPU
> > wants the on-demand approach while Intel vGPU wants to pin the entire
> > guest, at least for an initial solution.  This iommu backend would need
> > to support both as determined by the mediated device backend.  
> 
> Right, we will add a new callback to mediated device backend interface for this
> purpose in v4 version patch.
Dear Neo:
Thanks for this information.

What I interest most is the new vfio iommu backend. Looking forward to
your new patches. :>

> 
> Thanks,
> Neo
> 
> >   
> > > do:
> > > #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> > > When starting the guest, form the <vaddr, iova, size> database.
> > > 
> > > #2. In the driver of the ccw devices, when an I/O instruction was
> > > intercepted, call vfio_pin_pages (Kirti's version) to get the host
> > > physical address, then translate the ccw program for I/O operation.
> > > 
> > > So which one is the right way to go?  
> > 
> > As above, I think we have a need to support both approaches in this new
> > iommu backend, it will be up to you to determine which is appropriate
> > for your devices and guest drivers.  A fully pinned guest has a latency
> > advantage, but obviously there are numerous disadvantages for the
> > pinning itself.  Pinning on-demand has overhead to setup each DMA
> > operations by the device but has a much smaller pinning footprint.


--------
Dong Jia


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/8] basic vfio-ccw infrastructure
@ 2016-05-09  9:59               ` Dong Jia
  0 siblings, 0 replies; 36+ messages in thread
From: Dong Jia @ 2016-05-09  9:59 UTC (permalink / raw)
  To: Neo Jia
  Cc: Alex Williamson, kvm, linux-s390, qemu-devel, renxiaof,
	cornelia.huck, borntraeger, agraf, Tian, Kevin, Song, Jike,
	Kirti Wankhede, Dong Jia

On Thu, 5 May 2016 13:23:11 -0700
Neo Jia <cjia@nvidia.com> wrote:

> > > I also noticed in another thread:
> > > ---------------------------------
> > > [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
> > > 
> > > Kirti did:
> > > 1. don't pin the pages in the map ioctl for the vGPU case.
> > > 2. export vfio_pin_pages and vfio_unpin_pages.
> > > 
> > > Although their patches didn't show how these interfaces were used, I
> > > guess them can either use these interfaces to pin/unpin all of the
> > > guest memory, or pin/unpin memory on demand. So can I reuse their work
> > > to finish my #1? If the answer is yes, then I could change my plan and  
> > 
> > Yes, we would absolutely only want one vfio iommu backend doing this,
> > there's nothing device specific about it.  We're looking at supporting
> > both modes of operation, fully pinned and pin-on-demand.  NVIDIA vGPU
> > wants the on-demand approach while Intel vGPU wants to pin the entire
> > guest, at least for an initial solution.  This iommu backend would need
> > to support both as determined by the mediated device backend.  
> 
> Right, we will add a new callback to mediated device backend interface for this
> purpose in v4 version patch.
Dear Neo:
Thanks for this information.

What I interest most is the new vfio iommu backend. Looking forward to
your new patches. :>

> 
> Thanks,
> Neo
> 
> >   
> > > do:
> > > #1. Introduce a vfio_iommu_type1_ccw as the vfio iommu backend for ccw.
> > > When starting the guest, form the <vaddr, iova, size> database.
> > > 
> > > #2. In the driver of the ccw devices, when an I/O instruction was
> > > intercepted, call vfio_pin_pages (Kirti's version) to get the host
> > > physical address, then translate the ccw program for I/O operation.
> > > 
> > > So which one is the right way to go?  
> > 
> > As above, I think we have a need to support both approaches in this new
> > iommu backend, it will be up to you to determine which is appropriate
> > for your devices and guest drivers.  A fully pinned guest has a latency
> > advantage, but obviously there are numerous disadvantages for the
> > pinning itself.  Pinning on-demand has overhead to setup each DMA
> > operations by the device but has a much smaller pinning footprint.


--------
Dong Jia

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2016-05-09  9:59 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-29 12:11 [PATCH RFC 0/8] basic vfio-ccw infrastructure Dong Jia Shi
2016-04-29 12:11 ` [Qemu-devel] " Dong Jia Shi
2016-04-29 12:11 ` [PATCH RFC 1/8] iommu: s390: enable iommu api for s390 ccw devices Dong Jia Shi
2016-04-29 12:11   ` [Qemu-devel] " Dong Jia Shi
2016-04-29 12:11 ` [PATCH RFC 2/8] s390: move orb.h from drivers/s390/ to arch/s390/ Dong Jia Shi
2016-04-29 12:11   ` [Qemu-devel] " Dong Jia Shi
2016-04-29 12:11 ` [PATCH RFC 3/8] vfio: ccw: basic implementation for vfio_ccw driver Dong Jia Shi
2016-04-29 12:11   ` [Qemu-devel] " Dong Jia Shi
2016-04-29 12:11 ` [PATCH RFC 4/8] vfio: ccw: realize VFIO_DEVICE_GET_INFO ioctl Dong Jia Shi
2016-04-29 12:11   ` [Qemu-devel] " Dong Jia Shi
2016-04-29 12:11 ` [PATCH RFC 5/8] vfio: ccw: realize VFIO_DEVICE_CCW_HOT_RESET ioctl Dong Jia Shi
2016-04-29 12:11   ` [Qemu-devel] " Dong Jia Shi
2016-04-29 12:11 ` [PATCH RFC 6/8] vfio: ccw: introduce page array interfaces Dong Jia Shi
2016-04-29 12:11   ` [Qemu-devel] " Dong Jia Shi
2016-04-29 12:11 ` [PATCH RFC 7/8] vfio: ccw: introduce ccw chain interfaces Dong Jia Shi
2016-04-29 12:11   ` [Qemu-devel] " Dong Jia Shi
2016-04-29 12:11 ` [PATCH RFC 8/8] vfio: ccw: realize VFIO_DEVICE_CCW_CMD_REQUEST ioctl Dong Jia Shi
2016-04-29 12:11   ` [Qemu-devel] " Dong Jia Shi
2016-04-29 17:17 ` [PATCH RFC 0/8] basic vfio-ccw infrastructure Alex Williamson
2016-04-29 17:17   ` [Qemu-devel] " Alex Williamson
2016-05-04  9:26   ` Dong Jia
2016-05-04  9:26     ` [Qemu-devel] " Dong Jia
2016-05-04 19:26     ` Alex Williamson
2016-05-04 19:26       ` [Qemu-devel] " Alex Williamson
2016-05-05 10:29       ` Dong Jia
2016-05-05 10:29         ` [Qemu-devel] " Dong Jia
2016-05-05 19:19         ` Alex Williamson
2016-05-05 19:19           ` [Qemu-devel] " Alex Williamson
2016-05-05 20:23           ` Neo Jia
2016-05-05 20:23             ` [Qemu-devel] " Neo Jia
2016-05-05 20:23             ` Neo Jia
2016-05-09  9:59             ` Dong Jia
2016-05-09  9:59               ` [Qemu-devel] " Dong Jia
2016-05-09  9:59               ` Dong Jia
2016-05-09  9:55           ` Dong Jia
2016-05-09  9:55             ` [Qemu-devel] " Dong Jia

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.