All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Marcel Apfelbaum <marcel@redhat.com>
Cc: qemu-devel@nongnu.org, ehabkost@redhat.com, imammedo@redhat.com,
	yuval.shaia@oracle.com, pbonzini@redhat.com
Subject: Re: [Qemu-devel] [PATCH V2 3/5] docs: add pvrdma device documentation
Date: Tue, 19 Dec 2017 19:47:55 +0200	[thread overview]
Message-ID: <20171219194739-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <20171217125457.3429-4-marcel@redhat.com>

On Sun, Dec 17, 2017 at 02:54:55PM +0200, Marcel Apfelbaum wrote:
> Signed-off-by: Marcel Apfelbaum <marcel@redhat.com>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> ---
>  docs/pvrdma.txt | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 145 insertions(+)
>  create mode 100644 docs/pvrdma.txt
> 
> diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
> new file mode 100644
> index 0000000000..74c5cf2495
> --- /dev/null
> +++ b/docs/pvrdma.txt
> @@ -0,0 +1,145 @@
> +Paravirtualized RDMA Device (PVRDMA)
> +====================================
> +
> +
> +1. Description
> +===============
> +PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
> +It works with its Linux Kernel driver AS IS, no need for any special guest
> +modifications.
> +
> +While it complies with the VMware device, it can also communicate with bare
> +metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
> +can work with Soft-RoCE (rxe).
> +
> +It does not require the whole guest RAM to be pinned allowing memory
> +over-commit and, even if not implemented yet, migration support will be
> +possible with some HW assistance.
> +
> +A project presentation accompany this document:
> +- http://events.linuxfoundation.org/sites/events/files/slides/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf
> +
> +
> +
> +2. Setup
> +========
> +
> +
> +2.1 Guest setup
> +===============
> +Fedora 27+ kernels work out of the box, older distributions
> +require updating the kernel to 4.14 to include the pvrdma driver.
> +
> +However the libpvrdma library needed by User Level Software is still
> +not available as part of the distributions, so the rdma-core library
> +needs to be compiled and optionally installed.
> +
> +Please follow the instructions at:
> +  https://github.com/linux-rdma/rdma-core.git
> +
> +
> +2.2 Host Setup
> +==============
> +The pvrdma backend is an ibdevice interface that can be exposed
> +either by a Soft-RoCE(rxe) device on machines with no RDMA device,
> +or an HCA SRIOV function(VF/PF).
> +Note that ibdevice interfaces can't be shared between pvrdma devices,
> +each one requiring a separate instance (rxe or SRIOV VF).
> +
> +
> +2.2.1 Soft-RoCE backend(rxe)
> +===========================
> +A stable version of rxe is required, Fedora 27+ or a Linux
> +Kernel 4.14+ is preferred.
> +
> +The rdma_rxe module is part of the Linux Kernel but not loaded by default.
> +Install the User Level library (librxe) following the instructions from:
> +https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home
> +
> +Associate an ETH interface with rxe by running:
> +   rxe_cfg add eth0
> +An rxe0 ibdevice interface will be created and can be used as pvrdma backend.
> +
> +
> +2.2.2 RDMA device Virtual Function backend
> +==========================================
> +Nothing special is required, the pvrdma device can work not only with
> +Ethernet Links, but also Infinibands Links.
> +All is needed is an ibdevice with an active port, for Mellanox cards
> +will be something like mlx5_6 which can be the backend.
> +
> +
> +2.2.3 QEMU setup
> +================
> +Configure QEMU with --enable-rdma flag, installing
> +the required RDMA libraries.
> +
> +
> +3. Usage
> +========
> +Currently the device is working only with memory backed RAM
> +and it must be mark as "shared":
> +   -m 1G \
> +   -object memory-backend-ram,id=mb1,size=1G,share \
> +   -numa node,memdev=mb1 \
> +
> +The pvrdma device is composed of two functions:
> + - Function 0 is a vmxnet Ethernet Device which is redundant in Guest
> +   but is required to pass the ibdevice GID using its MAC.
> +   Examples:
> +     For an rxe backend using eth0 interface it will use its mac:
> +       -device vmxnet3,addr=<slot>.0,multifunction=on,mac=<eth0 MAC>
> +     For an SRIOV VF, we take the Ethernet Interface exposed by it:
> +       -device vmxnet3,multifunction=on,mac=<RoCE eth MAC>
> + - Function 1 is the actual device:
> +       -device pvrdma,addr=<slot>.1,backend-dev=<ibdevice>,backend-gid-idx=<gid>,backend-port=<port>
> +   where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
> + Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's MAC.
> + The rules of conversion are part of the RoCE spec, but since manual conversion
> + is not required, spotting problems is not hard:
> +    Example: GID: fe80:0000:0000:0000:7efe:90ff:fecb:743a
> +             MAC: 7c:fe:90:cb:74:3a
> +    Note the difference between the first byte of the MAC and the GID.
> +
> +
> +4. Implementation details
> +=========================
> +The device acts like a proxy between the Guest Driver and the host
> +ibdevice interface.
> +On configuration path:
> + - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request
> +   a resource from the backend interface, maintaining a 1-1 mapping
> +   between the guest and host.
> +On data path:
> + - Every post_send/receive received from the guest will be converted into
> +   a post_send/receive for the backend. The buffers data will not be touched
> +   or copied resulting in near bare-metal performance for large enough buffers.
> + - Completions from the backend interface will result in completions for
> +   the pvrdma device.


Where's the host/guest interface documented?

> +
> +
> +5. Limitations
> +==============
> +- The device obviously is limited by the Guest Linux Driver features implementation
> +  of the VMware device API.
> +- Memory registration mechanism requires mremap for every page in the buffer in order
> +  to map it to a contiguous virtual address range. Since this is not the data path
> +  it should not matter much.
> +- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached,
> +  so it can't work with huge pages. The limitation will be addressed in the future,
> +  however QEMU allocates Gust RAM with MADV_HUGEPAGE so if there are enough huge
> +  pages available, QEMU will use them.
> +- As previously stated, migration is not supported yet, however with some hardware
> +  support can be done.
> +
> +
> +
> +6. Performance
> +==============
> +By design the pvrdma device exits on each post-send/receive, so for small buffers
> +the performance is affected; however for medium buffers it will became close to
> +bare metal and from 1MB buffers and  up it reaches bare metal performance.
> +(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device)
> +
> +All the above assumes no memory registration is done on data path.
> -- 
> 2.13.5

  reply	other threads:[~2017-12-19 17:48 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-17 12:54 [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation Marcel Apfelbaum
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 1/5] pci/shpc: Move function to generic header file Marcel Apfelbaum
2017-12-17 18:16   ` Philippe Mathieu-Daudé
2017-12-17 19:03     ` Yuval Shaia
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 2/5] mem: add share parameter to memory-backend-ram Marcel Apfelbaum
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 3/5] docs: add pvrdma device documentation Marcel Apfelbaum
2017-12-19 17:47   ` Michael S. Tsirkin [this message]
2017-12-20 14:45     ` Marcel Apfelbaum
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 4/5] pvrdma: initial implementation Marcel Apfelbaum
2017-12-19 16:12   ` Michael S. Tsirkin
2017-12-19 17:29     ` Marcel Apfelbaum
2017-12-19 17:48   ` Michael S. Tsirkin
2017-12-20 15:25     ` Yuval Shaia
2017-12-20 18:01       ` Michael S. Tsirkin
2017-12-19 19:13   ` Philippe Mathieu-Daudé
2017-12-20  4:08     ` Michael S. Tsirkin
2017-12-20 14:46       ` Marcel Apfelbaum
2017-12-17 12:54 ` [Qemu-devel] [PATCH V2 5/5] MAINTAINERS: add entry for hw/net/pvrdma Marcel Apfelbaum
2017-12-19 17:49   ` Michael S. Tsirkin
2017-12-19 18:05 ` [Qemu-devel] [PATCH V2 0/5] hw/pvrdma: PVRDMA device implementation Michael S. Tsirkin
2017-12-20 15:07   ` Marcel Apfelbaum
2017-12-21  0:05     ` Michael S. Tsirkin
2017-12-21  7:27       ` Yuval Shaia
2017-12-21 14:22         ` Michael S. Tsirkin
2017-12-21 15:59           ` Marcel Apfelbaum
2017-12-21 20:46             ` Michael S. Tsirkin
2017-12-21 22:30               ` Yuval Shaia
2017-12-22  4:58                 ` Marcel Apfelbaum
2017-12-20 17:56   ` Yuval Shaia
2017-12-20 18:05     ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171219194739-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=ehabkost@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=marcel@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=yuval.shaia@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.