All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: zhenwei pi <pizhenwei@bytedance.com>
Cc: virtio-comment@lists.oasis-open.org
Subject: Re: Re: Re: [virtio-comment] Re: [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission
Date: Mon, 5 Jun 2023 13:21:39 -0400	[thread overview]
Message-ID: <20230605172139.GF1624556@fedora> (raw)
In-Reply-To: <ced49494-1839-50d3-6742-0d5c1e7b33c7@bytedance.com>

[-- Attachment #1: Type: text/plain, Size: 4379 bytes --]

On Fri, Jun 02, 2023 at 08:55:02AM +0800, zhenwei pi wrote:
> 
> 
> On 6/2/23 05:23, Stefan Hajnoczi wrote:
> > On Thu, Jun 01, 2023 at 03:13:53PM -0400, Stefan Hajnoczi wrote:
> > > On Thu, Jun 01, 2023 at 09:09:49PM +0800, zhenwei pi wrote:
> > > > On 6/1/23 19:33, Stefan Hajnoczi wrote:
> > > > > On Thu, Jun 01, 2023 at 05:02:45PM +0800, zhenwei pi wrote:
> > > > > > On 6/1/23 00:20, Stefan Hajnoczi wrote:
> > > > > > > On Thu, May 04, 2023 at 04:19:04PM +0800, zhenwei pi wrote:
> > > One more idea to play with: VIRTIO has flexible message framing, so
> > > devices must process a virtqueue buffer the same regardless of whether
> > > it has 1 large element or many small elements. Therefore the virtqueue
> > > RDMA protocol does not need to preserve the virtqueue element count and
> > > sizes from the driver. For example, the target can offer a list of
> > > key/length pairs that the initiator RDMA WRITES the virtqueue buffer
> > > contents into. For a virtio-blk device that would be a struct
> > > virtio_blk_outhdr followed by a large page-aligned buffer for the I/O
> > > buffer data to be transferred. Then the device always a properly aligned
> > > and contiguous buffer. Unfortunately this approach breaks down when the
> > > virtqueue carries requests that are organized very differently, but it
> > > might be useful when there is a most common request type.
> > 
> > I'm not sure if I explained this well. What I'm trying to say is that I
> > think RDMA benefits when the receiver's memory constraints are visible
> > to the sender. The sender performs RDMA WRITEs to the locations where
> > the receiver can efficiently process the data.
> > 
> > This protocol proposal doesn't really take advantage of this approach
> > because it communicates the virtqueue buffer elements from the initiator
> > (the sender) to the target (the receiver). That's the wrong way around.
> > 
> > I have never used RDMA myself, so this might be wrong, but as long as
> > the RDMA API allows the sender to specify a scatter-gather list as
> > input, then I think the details of the virtqueue buffer elements that
> > don't have the WRITE flag should never be communicated over the network.
> > Instead the initiator should RDMA WRITE from the VIRTIO driver's
> > scatter-gather list to the target's preferred destination instead.
> > 
> > Stefan
> 
> Hi,
> 
> I guess I followed your point. "the target can offer a list of key/length
> pairs that the initiator RDMA WRITES the virtqueue buffer contents into"
> seems not good to me, I'd prefer to expose RDMA memory region of initiator
> side only(target side uses RDMA READ/WRITE to operate the memory of
> initiator, this means target side has no need to allocate/pin memory
> buffer).

Many targets will need to pin memory for the underlying disk I/O anyway.
If the initiator RDMA WRITEs data into the target's pinned memory, then
the target can forward the data to the disk without copies.

But assuming the target doesn't want to pin memory, the protocol can
still be simplified. The initiator sends a VQ_OP command containing:
1. VQ_OP header with a list of <addr, key, len> tuples for WRITE
   virtqueue buffer elements.
2. The contents of the !WRITE virtqueue buffer elements.

Note that this approach does not involve the target sending RDMA READs
because this seems inefficient to me when the ibv_*() APIs allow the
initiator to send the !WRITE virtqueue buffer elements along with the
requests using a scatter-gather list.

The target receives the VQ_OP command and sends RDMA WRITEs to fill in
used buffer elements. The last RDMA WRITEs may need to be WRITE WITH IMM
to efficiently complete the request.

> From the point of my view, this protocol needs to be effective and
> maintainable, mapping vring mechanism with RDMA WRITE from 2
> directions(initiator to target, and target to initiator) leads high
> complexity ...

My concern is that simply mapping vrings to RDMA is inefficient. It is
not necessary for the target to RDMA READ virtqueue buffer elements when
the initiator could include them in its send scatter-gather list
instead.

If we forget about vrings and focus instead on how to offer virtqueue
semantics at the minimal RDMA cost, then I think the protocol would look
more like what I'm describing.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2023-06-05 17:21 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-04  8:18 [virtio-comment] [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 01/11] transport-fabrics: introduce Virtio Over Fabrics overview zhenwei pi
2023-05-04  8:57   ` David Hildenbrand
2023-05-04  9:46     ` zhenwei pi
2023-05-04 10:05       ` Michael S. Tsirkin
2023-05-04 10:12         ` David Hildenbrand
2023-05-04 10:50         ` Re: " zhenwei pi
2023-05-31 14:00   ` [virtio-comment] " Stefan Hajnoczi
2023-06-02  1:17     ` [virtio-comment] " zhenwei pi
2023-06-05  2:39   ` [virtio-comment] " Parav Pandit
2023-06-05  2:39   ` Parav Pandit
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 02/11] transport-fabrics: introduce Virtio Qualified Name zhenwei pi
2023-05-31 14:06   ` Stefan Hajnoczi
2023-06-02  1:50     ` zhenwei pi
2023-06-05  2:40       ` Parav Pandit
2023-06-05  7:57         ` zhenwei pi
2023-06-05 17:05         ` Stefan Hajnoczi
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 03/11] transport-fabircs: introduce Segment Descriptor Definition zhenwei pi
2023-05-31 14:23   ` Stefan Hajnoczi
2023-06-02  3:08     ` zhenwei pi
2023-06-05  2:40   ` [virtio-comment] " Parav Pandit
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 04/11] transport-fabrics: introduce Stream Transmission zhenwei pi
2023-05-31 15:20   ` Stefan Hajnoczi
2023-06-02  2:26     ` zhenwei pi
2023-06-05 16:11       ` Stefan Hajnoczi
2023-06-06  3:13         ` zhenwei pi
2023-06-06 13:09           ` Stefan Hajnoczi
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 05/11] transport-fabrics: introduce Keyed Transmission zhenwei pi
2023-05-31 16:20   ` [virtio-comment] " Stefan Hajnoczi
2023-06-01  9:02     ` zhenwei pi
2023-06-01 11:33       ` Stefan Hajnoczi
2023-06-01 13:09         ` zhenwei pi
2023-06-01 19:13           ` Stefan Hajnoczi
2023-06-01 21:23             ` Stefan Hajnoczi
2023-06-02  0:55               ` zhenwei pi
2023-06-05 17:21                 ` Stefan Hajnoczi [this message]
2023-06-05  2:41   ` Parav Pandit
2023-06-05  8:41     ` zhenwei pi
2023-06-05 11:45       ` Parav Pandit
2023-06-05 12:50         ` zhenwei pi
2023-06-05 13:12           ` Parav Pandit
2023-06-06  7:13             ` zhenwei pi
2023-06-06 21:52               ` Parav Pandit
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 06/11] transport-fabrics: introduce command set zhenwei pi
2023-05-31 17:10   ` [virtio-comment] " Stefan Hajnoczi
2023-06-02  5:15     ` [virtio-comment] " zhenwei pi
2023-06-05 16:30       ` Stefan Hajnoczi
2023-06-06  1:31         ` [virtio-comment] " zhenwei pi
2023-06-06 13:34           ` Stefan Hajnoczi
2023-06-07  2:58             ` [virtio-comment] " zhenwei pi
2023-06-08 16:41               ` Stefan Hajnoczi
2023-06-08 17:01                 ` [virtio-comment] " Parav Pandit
2023-06-09  1:39                   ` [virtio-comment] " zhenwei pi
2023-06-09  2:06                     ` [virtio-comment] " Parav Pandit
2023-06-09  3:55                       ` zhenwei pi
2023-06-11 20:56                         ` Parav Pandit
2023-06-06  2:02         ` [virtio-comment] " zhenwei pi
2023-06-06 13:44           ` Stefan Hajnoczi
2023-06-07  2:03             ` [virtio-comment] " zhenwei pi
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 07/11] transport-fabrics: introduce opcodes zhenwei pi
2023-05-31 17:11   ` [virtio-comment] " Stefan Hajnoczi
     [not found]   ` <20230531205508.GA1509630@fedora>
2023-06-02  8:39     ` [virtio-comment] " zhenwei pi
2023-06-05 16:46       ` Stefan Hajnoczi
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 08/11] transport-fabrics: introduce status of completion zhenwei pi
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 09/11] transport-fabrics: add TCP&RDMA binding zhenwei pi
     [not found]   ` <20230531210255.GC1509630@fedora>
2023-06-02  9:07     ` [virtio-comment] Re: " zhenwei pi
2023-06-05 16:57       ` Stefan Hajnoczi
2023-06-06  1:41         ` [virtio-comment] " zhenwei pi
2023-06-06 13:51           ` Stefan Hajnoczi
2023-06-07  2:15             ` zhenwei pi
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 10/11] transport-fabrics: add device initialization zhenwei pi
     [not found]   ` <20230531210925.GD1509630@fedora>
2023-06-02  9:11     ` zhenwei pi
2023-05-04  8:19 ` [virtio-comment] [PATCH v2 11/11] transport-fabrics: support inline data for keyed transmission zhenwei pi
2023-05-29  0:56 ` [virtio-comment] PING: [PATCH v2 00/11] Introduce Virtio Over Fabrics zhenwei pi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230605172139.GF1624556@fedora \
    --to=stefanha@redhat.com \
    --cc=pizhenwei@bytedance.com \
    --cc=virtio-comment@lists.oasis-open.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.