netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dominique Martinet <asmadeus@codewreck.org>
To: Christian Schoenebeck <linux_oss@crudebyte.com>
Cc: v9fs-developer@lists.sourceforge.net, netdev@vger.kernel.org,
	Eric Van Hensbergen <ericvh@gmail.com>,
	Latchesar Ionkov <lucho@ionkov.net>, Greg Kurz <groug@kaod.org>,
	Vivek Goyal <vgoyal@redhat.com>,
	Nikolay Kichukov <nikolay@oldum.net>
Subject: Re: [PATCH v4 12/12] net/9p: allocate appropriate reduced message buffers
Date: Sun, 3 Apr 2022 21:37:55 +0900	[thread overview]
Message-ID: <YkmVI6pqTuMD8dVi@codewreck.org> (raw)
In-Reply-To: <1953222.pKi1t3aLRd@silver>

Christian Schoenebeck wrote on Sun, Apr 03, 2022 at 01:29:53PM +0200:
> So maybe I should just exclude the 9p RDMA transport from this 9p message size 
> reduction change in v5 until somebody had a chance to test this change with 
> RDMA.

Yes, I'm pretty certain it won't work so we'll want to exclude it unless
we can extend the RDMA protocol to address buffers.

From my understanding, RDMA comes with two type of primitives:
 - recv/send that 9p exlusively uses, which is just a pool of buffers
registered to the NIC and get filled on a first-come-first-serve basis
(I'm not sure if it's a first-fit, or if message will be truncated, or
if it'll error out if the message doesn't fit... But basically given
that's what we use for 9p we have no way of guaranteeing that a read
reply will be filled in the big buffer allocated for it and not
something else)

If we're lucky the algorithm used is smallest-fit first, but it doesn't
look like it:
---
The order of the Receive Request consumptions in a Receive Queue is by
the order that they were posted to it.
When you have a SRQ, you cannot predict which Receive Request will be
consumed by which QP, so all the Receive Requests in that SRQ should be
able to contain the incoming message (in terms of length).
--- https://www.rdmamojo.com/2013/02/02/ibv_post_recv/ (in a comment)


 - read/write, which can be addressed e.g. the remote end can specify a
cookie along with address/size and directly operate on remote memory
(hence the "remote direct memory access" name). There are also some cool
stuff that can be done like atomic compare and swap or arithmetic
operations on remote memory which doesn't really concern us.

Using read/writes like NFS over RDMA does would resolve the problem and
allow what they call "direct data placement", which is reading or
writing directly from the page cache or user buffer as a real zero copy
operation, but it requires the cookie/address to be sent and client to
act on it so it's a real transport-specific protocol change, but given
the low number of users I think that's something that could be
considered if someone wants to work on it.

Until then we'll be safer with that bit disabled...

> Which makes me wonder, what is that exact hardware, hypervisor, OS that 
> supports 9p & RDMA?

I've used it with mellanox infiniband cards in the past. These support
SRIOV virtual functions so are quite practical for VMs, could let it do
the work with a single machine and no cable.

I'm pretty sure it'd work with any recent server hardware that supports
RoCE (I -think- it's getting more common?), or with emulation ~10 years
ago I got it to run with softiwarp which has been merged in the kernel
(siw) since so that might be the easiest way to run it now.

Server-side, both diod and nfs-ganesha support 9p over RDMA, I haven't
used diod recently but ganesha ought to work.


> On the long-term I can imagine to add RDMA transport support on QEMU 9p side. 

What would you expect it to be used for?

> There is already RDMA code in QEMU, however it is only used for migration by 
> QEMU so far I think.

Yes, looking at it a bit there's live migration over RDMA (I tested it
at my previous job), some handling for gluster+rdma, and a
paravirtualized RDMA device (pvrdma).
the docs for it says it works with soft-roce so it would also probably
work for tests (I'm not sure what difference there is between rxe and
siw), but at this point you've just setup virtualized rdma on the host
anyway...

I'll try to get something setup for tests on my end as well, it's
definitely something I had on my todo...
-- 
Dominique

  reply	other threads:[~2022-04-03 12:38 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-30 13:23 [PATCH v4 00/12] remove msize limit in virtio transport Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 12/12] net/9p: allocate appropriate reduced message buffers Christian Schoenebeck
2022-04-02 14:05   ` Dominique Martinet
2022-04-03 11:29     ` Christian Schoenebeck
2022-04-03 12:37       ` Dominique Martinet [this message]
2022-04-03 14:00         ` Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 08/12] net/9p: limit 'msize' to KMALLOC_MAX_SIZE for all transports Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 04/12] 9p/trans_virtio: introduce struct virtqueue_sg Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 01/12] net/9p: show error message if user 'msize' cannot be satisfied Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 10/12] 9p: add P9_ERRMAX for 9p2000 and 9p2000.u Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 11/12] net/9p: add p9_msg_buf_size() Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 06/12] 9p/trans_virtio: support larger msize values Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 02/12] 9p/trans_virtio: separate allocation of scatter gather list Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 09/12] net/9p: split message size argument into 't_size' and 'r_size' pair Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 07/12] 9p/trans_virtio: resize sg lists to whatever is possible Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 05/12] net/9p: add trans_maxsize to struct p9_client Christian Schoenebeck
2021-12-30 13:23 ` [PATCH v4 03/12] 9p/trans_virtio: turn amount of sg lists into runtime info Christian Schoenebeck
2022-01-20 22:43 ` [PATCH v4 00/12] remove msize limit in virtio transport Nikolay Kichukov
2022-01-22 13:34   ` Christian Schoenebeck
2022-01-24 10:21     ` Nikolay Kichukov
2022-01-24 11:07       ` Dominique Martinet
2022-01-24 11:57         ` Christian Schoenebeck
2022-01-24 12:56           ` Dominique Martinet
2022-01-24 13:55             ` Christian Schoenebeck
2022-01-25  8:45           ` Nikolay Kichukov
2022-05-24  8:10         ` Nikolay Kichukov
2022-05-24 11:29           ` Christian Schoenebeck
2022-07-07 14:30 ` Christian Schoenebeck
     [not found]   ` <CAFkjPT=GAoViYd0E7CZQDq3ZjhmYT0DsBytfZXnE10JL0P8O-Q@mail.gmail.com>
2022-07-08  1:15     ` Dominique Martinet
     [not found]       ` <CAFkjPTngeFh=0mPVW-Yf1Sxkxp_HDNUeANndoYN3-eU9_rGLuQ@mail.gmail.com>
2022-07-08 11:18         ` Christian Schoenebeck
2022-07-08 11:40           ` Dominique Martinet
2022-07-08 13:00             ` Christian Schoenebeck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YkmVI6pqTuMD8dVi@codewreck.org \
    --to=asmadeus@codewreck.org \
    --cc=ericvh@gmail.com \
    --cc=groug@kaod.org \
    --cc=linux_oss@crudebyte.com \
    --cc=lucho@ionkov.net \
    --cc=netdev@vger.kernel.org \
    --cc=nikolay@oldum.net \
    --cc=v9fs-developer@lists.sourceforge.net \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).