All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ralph Schmieder <ralph.schmieder@gmail.com>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: "\"Daniel P. Berrangé\"" <berrange@redhat.com>, qemu-devel@nongnu.org
Subject: Re: socket.c added support for unix domain socket datagram transport
Date: Mon, 26 Apr 2021 13:14:48 +0200	[thread overview]
Message-ID: <2DC6F891-4F28-4044-A055-0CDAB45A3C24@gmail.com> (raw)
In-Reply-To: <20210423183901.12a71759@redhat.com>



> On Apr 23, 2021, at 18:39, Stefano Brivio <sbrivio@redhat.com> wrote:
> 
> On Fri, 23 Apr 2021 17:48:08 +0200
> Ralph Schmieder <ralph.schmieder@gmail.com> wrote:
> 
>> Hi, Stefano... Thanks for the detailed response... inline
>> Thanks,
>> -ralph
>> 
>> 
>>> On Apr 23, 2021, at 17:29, Stefano Brivio <sbrivio@redhat.com>
>>> wrote:
>>> 
>>> Hi Ralph,
>>> 
>>> On Fri, 23 Apr 2021 08:56:48 +0200
>>> Ralph Schmieder <ralph.schmieder@gmail.com> wrote:
>>> 
>>>> Hey...  new to this list.  I was looking for a way to use Unix
>>>> domain sockets as a network transport between local VMs.
>>>> 
>>>> I'm part of a team where we run dozens if not hundreds of VMs on a
>>>> single compute instance which are highly interconnected.
>>>> 
>>>> In the current implementation, I use UDP sockets (e.g. something
>>>> like 
>>>> 
>>>> -netdev
>>>> id=bla,type=socket,udp=localhost:1234,localaddr=localhost:5678) 
>>>> 
>>>> which works great.
>>>> 
>>>> The downside of this approach is that I need to keep track of all
>>>> the UDP ports in use and that there's a potential for clashes.
>>>> Clearly, having Unix domain sockets where I could store the
>>>> sockets in the VM's directory would be much easier to manage.
>>>> 
>>>> However, even though there is some AF_UNIX support in net/socket.c,
>>>> it's
>>>> 
>>>> - not configurable
>>>> - it doesn't work  
>>> 
>>> I hate to say this, but I've been working on something very similar
>>> just in the past days, with the notable difference that I'm using
>>> stream-oriented AF_UNIX sockets instead of datagram-oriented.
>>> 
>>> I have a similar use case, and after some experiments I picked a
>>> stream-oriented socket over datagram-oriented because:
>>> 
>>> - overhead appears to be the same
>>> 
>>> - message boundaries make little sense here -- you already have a
>>> 32-bit vnet header with the message size defining the message
>>> boundaries
>>> 
>>> - datagram-oriented AF_UNIX sockets are actually reliable and
>>> there's no packet reordering on Linux, but this is not "formally"
>>> guaranteed
>>> 
>>> - it's helpful for me to know when a qemu instance disconnects for
>>> any reason
>>> 
>> 
>> IMO, dgram is the right tool for this as it is symmetrical to using a
>> UDP transport... Since I need to pick up those packets from outside
>> of Qemu (inside of a custom networking fabric) I'd have to make
>> assumptions about the data and don't know the length of the packet in
>> advance.
> 
> Okay, so it doesn't seem to fit your case, but this specific point is
> where you actually have a small advantage using a stream-oriented
> socket. If you receive a packet and have a smaller receive buffer, you
> can read the length of the packet from the vnet header and then read
> the rest of the packet at a later time.
> 
> With a datagram-oriented socket, you would need to know the maximum
> packet size in advance, and use a receive buffer that's large enough to
> contain it, because if you don't, you'll discard data.

For me, the maximum packet size is a jumbo frame (e.g. 9x1024) anyway -- everything must fit into an atomic write of that size.

> 
> The same reasoning applies to a receive buffer that's larger than the
> maximum packet size you can get -- you can then read multiple packets at
> a time, filling your buffer, partially reading a packet at the end of
> it, and reading the rest later.
> 
> With a datagram-oriented socket you need to resort to recvmmsg() to
> receive multiple packets with one syscall (nothing against it, it's
> just slightly more tedious).
> 
>> Using the datagram approach fits nicely into this concept.
>> So, yes, in my instance the transport IS truly connectionless and VMs
>> just keep sending packets if the fabric isn't there or doesn't pick
>> up their packets.
> 
> I see, in that case I guess you really need a datagram-oriented
> socket... even though what happens with my patch (just like with the
> existing TCP support) is that your fabric would need to be there when
> qemu starts, but if it disappears later, qemu will simply close the
> socket. Indeed, it's not "hotplug", which is probably what you need.

That's the point.  This is peer-to-peer/point-to-point and not client/server.

> 
>> And maybe there's use for both, as there's currently already support
>> for connection oriented (TCP) and connectionless (UDP) inet
>> transports. 
> 
> Yes, I think so.
> 
>>>> As a side note, I tried to pass in an already open FD, but that
>>>> didn't work either.  
>>> 
>>> This actually worked for me as a quick work-around, either with:
>>> 	https://github.com/StevenVanAcker/udstools
>>> 
>>> or with a trivial C implementation of that, that does essentially:
>>> 
>>> 	fd = strtol(argv[1], NULL, 0);
>>> 	if (fd < 3 || fd > INT_MAX || errno)
>>> 		usage(argv[0]);
>>> 
>>> 	s = socket(AF_UNIX, SOCK_STREAM, 0);
>>> 	if (s < 0) {
>>> 		perror("socket");
>>> 		exit(EXIT_FAILURE);
>>> 	}
>>> 
>>> 	if (connect(s, (const struct sockaddr *)&addr, sizeof(addr)) < 0) {
>>> 		perror("connect");
>>> 		exit(EXIT_FAILURE);
>>> 	}
>>> 
>>> 	if (dup2(s, (int)fd) < 0) {
>>> 		perror("dup");
>>> 		exit(EXIT_FAILURE);
>>> 	}
>>> 
>>> 	close(s);
>>> 
>>> 	execvp(argv[2], argv + 2);
>>> 	perror("execvp");
>>> 
>>> where argv[1] is the socket number you pass in the qemu command line
>>> (-net socket,fd=X) and argv[2] is the path to qemu.
>> 
>> As I was looking for dgram support I didn't even try with a stream
>> socket ;)
> 
> Mind that it also works with a SOCK_DGRAM ;) ...that was my original
> attempt, actually.
> 
>>>> So, I added some code which does work for me... e.g.
>>>> 
>>>> - can specify the socket paths like -netdev
>>>> id=bla,type=socket,unix=/tmp/in:/tmp/out
>>>> - it does forward packets between two Qemu instances running
>>>> back-to-back
>>>> 
>>>> I'm wondering if this is of interest for the wider community and,
>>>> if so, how to proceed.
>>>> 
>>>> Thanks,
>>>> -ralph
>>>> 
>>>> Commit
>>>> https://github.com/rschmied/qemu/commit/73f02114e718ec898c7cd8e855d0d5d5d7abe362
>>>> 
>>> 
>>> I think your patch could be a bit simpler, as you could mostly reuse
>>> net_socket_udp_init() for your initialisation, and perhaps rename
>>> it to net_socket_dgram_init().  
>> 
>> Thanks... I agree that my code can likely be shortened... it was just
>> a PoC that I cobbled together yesterday and it still has a lot of
>> to-be-removed lines.
> 
> I'm not sure if it helps, but I guess you could "conceptually" recycle
> my patch and in some sense "extend" the UDP parts to a generic datagram
> interface, just like mine extends the TCP implementation to a generic
> stream interface.
> 
> About command line and documentation, I guess it's clear that
> "connect=" implies something stream-oriented, so I would prefer to
> leave it like that for a stream-oriented AF_UNIX socket -- it behaves
> just like TCP.
> 
> On the other hand, you can't recycle the current UDP "mcast=" stuff
> because it's not multicast (AF_UNIX multicast support for Linux was
> proposed some years ago, https://lwn.net/Articles/482523/, but not
> merged), and of course not "udp="... would "unix_dgram=" make sense
> to you?
> 
> On a side note, I wonder why you need two named sockets instead of
> one -- I mean, they're bidirectional...


Hmm... each peer needs to send unsolicited frames/packets to the other end... and thus needs to bind to their socket.  Pretty much for the same reason as the UDP transport requires you to specify a local and a remote 5-tuple.  Even though for AF_INET, the local port does not have to be specified, the OS would assign an ephemeral port to make it unique. Am I missing something?

Another thing: on Windows, there's a AF_UNIX/SOCK_STREAM implementation... So, technically it should be possible to use that code path on Windows, too.  Not a windows guy, though... So, can't say whether it would simply work or not:

https://devblogs.microsoft.com/commandline/af_unix-comes-to-windows/



> 
> -- 
> Stefano
> 



  reply	other threads:[~2021-04-26 11:19 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-23  6:56 socket.c added support for unix domain socket datagram transport Ralph Schmieder
2021-04-23  9:16 ` Daniel P. Berrangé
2021-04-23 13:38   ` Ralph Schmieder
2021-04-23 15:29 ` Stefano Brivio
2021-04-23 15:48   ` Ralph Schmieder
2021-04-23 16:39     ` Stefano Brivio
2021-04-26 11:14       ` Ralph Schmieder [this message]
2021-04-27 21:51         ` Stefano Brivio
2021-04-23 16:21   ` Daniel P. Berrangé
2021-04-23 16:54     ` Stefano Brivio
2021-04-26 12:05       ` Ralph Schmieder
2021-04-26 12:56       ` Daniel P. Berrangé
2021-04-27 21:52         ` Stefano Brivio
2021-04-28  9:02           ` Daniel P. Berrangé
2021-04-29 12:07             ` Markus Armbruster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2DC6F891-4F28-4044-A055-0CDAB45A3C24@gmail.com \
    --to=ralph.schmieder@gmail.com \
    --cc=berrange@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.