Re: socket.c added support for unix domain socket datagram transport

From: Ralph Schmieder <ralph.schmieder@gmail.com>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: "\"Daniel P. Berrangé\"" <berrange@redhat.com>, qemu-devel@nongnu.org
Subject: Re: socket.c added support for unix domain socket datagram transport
Date: Mon, 26 Apr 2021 13:14:48 +0200	[thread overview]
Message-ID: <2DC6F891-4F28-4044-A055-0CDAB45A3C24@gmail.com> (raw)
In-Reply-To: <20210423183901.12a71759@redhat.com>

> On Apr 23, 2021, at 18:39, Stefano Brivio <sbrivio@redhat.com> wrote:
> 
> On Fri, 23 Apr 2021 17:48:08 +0200
> Ralph Schmieder <ralph.schmieder@gmail.com> wrote:
> 
>> Hi, Stefano... Thanks for the detailed response... inline
>> Thanks,
>> -ralph
>> 
>> 
>>> On Apr 23, 2021, at 17:29, Stefano Brivio <sbrivio@redhat.com>
>>> wrote:
>>> 
>>> Hi Ralph,
>>> 
>>> On Fri, 23 Apr 2021 08:56:48 +0200
>>> Ralph Schmieder <ralph.schmieder@gmail.com> wrote:
>>> 
>>>> Hey...  new to this list.  I was looking for a way to use Unix
>>>> domain sockets as a network transport between local VMs.
>>>> 
>>>> I'm part of a team where we run dozens if not hundreds of VMs on a
>>>> single compute instance which are highly interconnected.
>>>> 
>>>> In the current implementation, I use UDP sockets (e.g. something
>>>> like 
>>>> 
>>>> -netdev
>>>> id=bla,type=socket,udp=localhost:1234,localaddr=localhost:5678) 
>>>> 
>>>> which works great.
>>>> 
>>>> The downside of this approach is that I need to keep track of all
>>>> the UDP ports in use and that there's a potential for clashes.
>>>> Clearly, having Unix domain sockets where I could store the
>>>> sockets in the VM's directory would be much easier to manage.
>>>> 
>>>> However, even though there is some AF_UNIX support in net/socket.c,
>>>> it's
>>>> 
>>>> - not configurable
>>>> - it doesn't work  
>>> 
>>> I hate to say this, but I've been working on something very similar
>>> just in the past days, with the notable difference that I'm using
>>> stream-oriented AF_UNIX sockets instead of datagram-oriented.
>>> 
>>> I have a similar use case, and after some experiments I picked a
>>> stream-oriented socket over datagram-oriented because:
>>> 
>>> - overhead appears to be the same
>>> 
>>> - message boundaries make little sense here -- you already have a
>>> 32-bit vnet header with the message size defining the message
>>> boundaries
>>> 
>>> - datagram-oriented AF_UNIX sockets are actually reliable and
>>> there's no packet reordering on Linux, but this is not "formally"
>>> guaranteed
>>> 
>>> - it's helpful for me to know when a qemu instance disconnects for
>>> any reason
>>> 
>> 
>> IMO, dgram is the right tool for this as it is symmetrical to using a
>> UDP transport... Since I need to pick up those packets from outside
>> of Qemu (inside of a custom networking fabric) I'd have to make
>> assumptions about the data and don't know the length of the packet in
>> advance.
> 
> Okay, so it doesn't seem to fit your case, but this specific point is
> where you actually have a small advantage using a stream-oriented
> socket. If you receive a packet and have a smaller receive buffer, you
> can read the length of the packet from the vnet header and then read
> the rest of the packet at a later time.
> 
> With a datagram-oriented socket, you would need to know the maximum
> packet size in advance, and use a receive buffer that's large enough to
> contain it, because if you don't, you'll discard data.

For me, the maximum packet size is a jumbo frame (e.g. 9x1024) anyway -- everything must fit into an atomic write of that size.

> 
> The same reasoning applies to a receive buffer that's larger than the
> maximum packet size you can get -- you can then read multiple packets at
> a time, filling your buffer, partially reading a packet at the end of
> it, and reading the rest later.
> 
> With a datagram-oriented socket you need to resort to recvmmsg() to
> receive multiple packets with one syscall (nothing against it, it's
> just slightly more tedious).
> 
>> Using the datagram approach fits nicely into this concept.
>> So, yes, in my instance the transport IS truly connectionless and VMs
>> just keep sending packets if the fabric isn't there or doesn't pick
>> up their packets.
> 
> I see, in that case I guess you really need a datagram-oriented
> socket... even though what happens with my patch (just like with the
> existing TCP support) is that your fabric would need to be there when
> qemu starts, but if it disappears later, qemu will simply close the
> socket. Indeed, it's not "hotplug", which is probably what you need.

That's the point.  This is peer-to-peer/point-to-point and not client/server.

> 
>> And maybe there's use for both, as there's currently already support
>> for connection oriented (TCP) and connectionless (UDP) inet
>> transports. 
> 
> Yes, I think so.
> 
>>>> As a side note, I tried to pass in an already open FD, but that
>>>> didn't work either.  
>>> 
>>> This actually worked for me as a quick work-around, either with:
>>> 	https://github.com/StevenVanAcker/udstools
>>> 
>>> or with a trivial C implementation of that, that does essentially:
>>> 
>>> 	fd = strtol(argv[1], NULL, 0);
>>> 	if (fd < 3 || fd > INT_MAX || errno)
>>> 		usage(argv[0]);
>>> 
>>> 	s = socket(AF_UNIX, SOCK_STREAM, 0);
>>> 	if (s < 0) {
>>> 		perror("socket");
>>> 		exit(EXIT_FAILURE);
>>> 	}
>>> 
>>> 	if (connect(s, (const struct sockaddr *)&addr, sizeof(addr)) < 0) {
>>> 		perror("connect");
>>> 		exit(EXIT_FAILURE);
>>> 	}
>>> 
>>> 	if (dup2(s, (int)fd) < 0) {
>>> 		perror("dup");
>>> 		exit(EXIT_FAILURE);
>>> 	}
>>> 
>>> 	close(s);
>>> 
>>> 	execvp(argv[2], argv + 2);
>>> 	perror("execvp");
>>> 
>>> where argv[1] is the socket number you pass in the qemu command line
>>> (-net socket,fd=X) and argv[2] is the path to qemu.
>> 
>> As I was looking for dgram support I didn't even try with a stream
>> socket ;)
> 
> Mind that it also works with a SOCK_DGRAM ;) ...that was my original
> attempt, actually.
> 
>>>> So, I added some code which does work for me... e.g.
>>>> 
>>>> - can specify the socket paths like -netdev
>>>> id=bla,type=socket,unix=/tmp/in:/tmp/out
>>>> - it does forward packets between two Qemu instances running
>>>> back-to-back
>>>> 
>>>> I'm wondering if this is of interest for the wider community and,
>>>> if so, how to proceed.
>>>> 
>>>> Thanks,
>>>> -ralph
>>>> 
>>>> Commit
>>>> https://github.com/rschmied/qemu/commit/73f02114e718ec898c7cd8e855d0d5d5d7abe362
>>>> 
>>> 
>>> I think your patch could be a bit simpler, as you could mostly reuse
>>> net_socket_udp_init() for your initialisation, and perhaps rename
>>> it to net_socket_dgram_init().  
>> 
>> Thanks... I agree that my code can likely be shortened... it was just
>> a PoC that I cobbled together yesterday and it still has a lot of
>> to-be-removed lines.
> 
> I'm not sure if it helps, but I guess you could "conceptually" recycle
> my patch and in some sense "extend" the UDP parts to a generic datagram
> interface, just like mine extends the TCP implementation to a generic
> stream interface.
> 
> About command line and documentation, I guess it's clear that
> "connect=" implies something stream-oriented, so I would prefer to
> leave it like that for a stream-oriented AF_UNIX socket -- it behaves
> just like TCP.
> 
> On the other hand, you can't recycle the current UDP "mcast=" stuff
> because it's not multicast (AF_UNIX multicast support for Linux was
> proposed some years ago, https://lwn.net/Articles/482523/, but not
> merged), and of course not "udp="... would "unix_dgram=" make sense
> to you?
> 
> On a side note, I wonder why you need two named sockets instead of
> one -- I mean, they're bidirectional...

Hmm... each peer needs to send unsolicited frames/packets to the other end... and thus needs to bind to their socket.  Pretty much for the same reason as the UDP transport requires you to specify a local and a remote 5-tuple.  Even though for AF_INET, the local port does not have to be specified, the OS would assign an ephemeral port to make it unique. Am I missing something?

Another thing: on Windows, there's a AF_UNIX/SOCK_STREAM implementation... So, technically it should be possible to use that code path on Windows, too.  Not a windows guy, though... So, can't say whether it would simply work or not:

https://devblogs.microsoft.com/commandline/af_unix-comes-to-windows/

> 
> -- 
> Stefano
>