linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Javier Martinez Canillas <javier@collabora.co.uk>
To: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>,
	Lennart Poettering <lennart@poettering.net>,
	Kay Sievers <kay.sievers@vrfy.org>,
	Alban Crequy <alban.crequy@collabora.co.uk>,
	Bart Cerneels <bart.cerneels@collabora.co.uk>,
	Rodrigo Moya <rodrigo.moya@collabora.co.uk>,
	Sjoerd Simons <sjoerd.simons@collabora.co.uk>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH 01/10] af_unix: Documentation on multicast unix sockets
Date: Mon, 20 Feb 2012 16:57:26 +0100	[thread overview]
Message-ID: <1329753455-1106-2-git-send-email-javier@collabora.co.uk> (raw)
In-Reply-To: <1329753455-1106-1-git-send-email-javier@collabora.co.uk>

From: Alban Crequy <alban.crequy@collabora.co.uk>

Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
Reviewed-by: Ian Molton <ian.molton@collabora.co.uk>
---
 .../networking/multicast-unix-sockets.txt          |  180 ++++++++++++++++++++
 1 files changed, 180 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/networking/multicast-unix-sockets.txt

diff --git a/Documentation/networking/multicast-unix-sockets.txt b/Documentation/networking/multicast-unix-sockets.txt
new file mode 100644
index 0000000..ec9a19c
--- /dev/null
+++ b/Documentation/networking/multicast-unix-sockets.txt
@@ -0,0 +1,180 @@
+Multicast Unix sockets
+======================
+
+Multicast is implemented on SOCK_DGRAM and SOCK_SEQPACKET Unix sockets.
+
+An userspace application can create a multicast group with:
+
+  struct unix_mreq mreq = {0,};
+  mreq.address.sun_family = AF_UNIX;
+  mreq.address.sun_path[0] = '\0';
+  strcpy(mreq.address.sun_path + 1, "socket-address");
+
+  sockfd = socket(AF_UNIX, SOCK_DGRAM, 0);
+  ret = setsockopt(sockfd, SOL_UNIX, UNIX_CREATE_GROUP, &mreq, sizeof(mreq));
+
+This allocates a struct unix_mcast_group, which is reference counted and exists
+as long as the socket who created it exists or the group has at least one
+member.
+
+SOCK_DGRAM sockets can join a multicast group with:
+
+  ret = setsockopt(sockfd, SOL_UNIX, UNIX_JOIN_GROUP, &mreq, sizeof(mreq));
+
+This allocates a struct unix_mcast, which holds the settings of the membership,
+mainly whether loopback is enabled. A socket can be a member of several
+multicast groups.
+
+Since SOCK_SEQPACKET sockets are connection-oriented the semantics are
+different. A client cannot join a group but it can only connect and the
+multicast accept socket is used to allow the peer to join the group with:
+
+  ret = setsockopt(groupfd, SOL_UNIX, UNIX_CREATE_GROUP, &val, vallen);
+  ret = listen(groupfd, 10);
+  connfd = accept(sockfd, NULL, 0);
+  ret = setsockopt(connfd, SOL_UNIX, UNIX_ACCEPT_GROUP, &mreq, sizeof(mreq));
+
+The socket is part of the multicast group until it is released, shutdown with
+RCV_SHUTDOWN or it leaves explicitely the group:
+
+  ret = setsockopt(sockfd, SOL_UNIX, UNIX_LEAVE_GROUP, &mreq, sizeof(mreq));
+
+Struct unix_mcast nodes are linked in two RCU lists:
+- (struct unix_sock)->mcast_subscriptions
+- (struct unix_mcast_group)->mcast_members
+
+              unix_mcast_group  unix_mcast_group
+                      |                 |
+                      v                 v
+unix_sock  ---->  unix_mcast  ----> unix_mcast
+                      |
+                      v
+unix_sock  ---->  unix_mcast
+                      |
+                      v
+unix_sock  ---->  unix_mcast
+
+
+SOCK_DGRAM semantics
+====================
+
+          G          The socket which created the group
+       /  |  \
+     P1  P2  P3      The member sockets
+
+Messages sent to the group are received by all members except the sender itself
+unless the sending socket has UNIX_MREQ_LOOPBACK set.
+
+Non-members can also send to the group socket G and the message will be
+broadcast to the group members, however socket G does not receive messages sent
+to the group, via it, itself.
+
+
+SOCK_SEQPACKET semantics
+========================
+
+When a connection is performed on a SOCK_SEQPACKET multicast socket, a new
+socket is created and its file descriptor is received by accept().
+
+          L          The listening socket
+       /  |  \
+     A1  A2  A3      The accepted sockets
+      |   |   |
+     C1  C2  C3      The connected sockets
+
+Messages sent on the C1 socket are received by:
+- C1 itself if UNIX_MREQ_LOOPBACK is set.
+- The peer socket A1 if UNIX_MREQ_SEND_TO_PEER is set.
+- The other members of the multicast group C2 and C3.
+
+Only members can send to the group in this case.
+
+
+Atomic delivery and ordering
+============================
+
+Each message sent is delivered atomically to either none of the recipients or
+all the recipients, even with interruptions and errors.
+
+Locking is used in order to keep the ordering consistent on all recipients. We
+want to avoid the following scenario. Two emitters A and B, and 2 recipients, C
+and D:
+
+           C    D
+A -------->|    |    Step 1: A's message is delivered to C
+B -------->|    |    Step 2: B's message is delivered to C
+B ---------|--->|    Step 3: B's message is delivered to D
+A ---------|--->|    Step 4: A's message is delivered to D
+
+Result: - C received (A, B)
+        - D received (B, A)
+
+Although A and B had a list of recipients (C, D) in the same order, C and D
+received the messages in a different order. To avoid this scenario, we need a
+locking mechanism while the messages are being delivered with skb_queue_tail().
+
+Solution 1:
+The easiest implementation would be to use a global spinlock on the group, but
+it creates an avoidable contention, especially when there are two independent
+streams set up with socket filters; e.g. if A sends messages received only by
+C, and B sends messages received only by D.
+
+Solution 2:
+Fine-grained locking could be implemented with a spinlock on each recipient.
+Before delivering the message to the recipients, the sender takes a spinlock on
+each recipient at the same time.
+
+Taking several spinlocks on the same struct can be dangerous and leads to
+deadlocks. This is prevented by sorting the list of sockets by memory address
+and taking the spinlocks in that order. The ordered list of recipients is
+computed on demand when a message is sent and the list is cached for
+performance. When the group membership changes, the generation of the
+membership is incremented and the ordered recipient list is invalidated.
+
+With this solution, the number of spinlocks taken simultaneously can be
+arbitrary big. Whilst it works, it breaks the lockdep mechanism.
+
+Solution 3:
+The current implementation is similar to solution 2 but with a limit on the
+number of spinlocks taken simultaneously (8), so lockdep works fine. A hash
+function and bit array with n=8 specifies which spinlocks to take.  Contention
+on independent streams can still happen but it is less likely.
+
+
+Flow control
+============
+
+When a socket's receiving queue is full, the default behavior is to block
+senders (or to return -EAGAIN on non-blocking sockets). The socket can also
+join a multicast group with the flag UNIX_MREQ_DROP_WHEN_FULL. In this case,
+messages sent to the group will not be delivered to that socket when its
+receiving queue is full.
+
+Messages are still delivered atomically to all members who don't have the flag
+UNIX_MREQ_DROP_WHEN_FULL. If send() returns -EAGAIN, nobody received the
+message. If send() blocks because of one member, the other members don't
+receive the message until all sockets (except those with
+UNIX_MREQ_DROP_WHEN_FULL set) can receive at the same time.
+
+poll/epoll/select on POLLOUT events have a consistent behavior; they block if
+at least one member of the multicast group without UNIX_MREQ_DROP_WHEN_FULL has
+a full receiving queue.
+
+
+Multicast socket reference counting
+===================================
+
+A poller for POLLOUT events can block for any member of the group. The poller
+can use the wait queue "peer_wait" of any member. So it is important that Unix
+sockets are not released before all pollers exit. This is achieved by:
+
+- Incrementing the reference counter of a socket when it joins a multicast
+  group.
+- Decrementing it when the group is destroyed, that is when all
+  sockets keeping a reference on the group released their reference on the
+  group.
+
+struct unix_mcast_group keeps track of both current members and previous
+members. When a socket leaves a group, it is removed from the members list and
+put in the dead members list. This is done in order to take advantage of RCU
+lists, which reduces lock contention.
-- 
1.7.7.6


  reply	other threads:[~2012-02-20 16:07 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-20 15:57 [PATCH 0/10] af_unix: add multicast and filtering features to AF_UNIX Javier Martinez Canillas
2012-02-20 15:57 ` Javier Martinez Canillas [this message]
2012-02-20 15:57 ` [PATCH 02/10] af_unix: Add constant for unix socket options level Javier Martinez Canillas
2012-02-20 15:57 ` [PATCH 03/10] af_unix: add setsockopt on unix sockets Javier Martinez Canillas
2012-02-20 16:20   ` David Miller
2012-02-20 19:13 ` [PATCH 0/10] af_unix: add multicast and filtering features to AF_UNIX Colin Walters
2012-02-21  8:07   ` Rodrigo Moya
2012-02-24 20:36 ` David Miller
2012-02-27 14:00   ` Javier Martinez Canillas
2012-02-27 19:05     ` David Miller
2012-02-28 10:47       ` Rodrigo Moya
2012-02-28 14:28         ` David Lamparter
2012-02-28 15:24           ` Javier Martinez Canillas
2012-02-28 16:33             ` Javier Martinez Canillas
2012-02-28 19:05         ` David Miller
2012-03-01 11:57           ` Javier Martinez Canillas
2012-03-01 12:26             ` Eric Dumazet
2012-03-01 12:33               ` David Laight
2012-03-01 12:50                 ` Rodrigo Moya
2012-03-01 12:59                   ` Eric Dumazet
2012-03-01 13:56                     ` Javier Martinez Canillas
2012-03-01 16:00                       ` Eric Dumazet
2012-03-01 16:02                       ` Luiz Augusto von Dentz
2012-03-01 17:06                         ` Javier Martinez Canillas
2012-03-01 17:59                         ` Eric Dumazet
2012-03-01 18:10                           ` Alan Cox
2012-03-01 19:02                           ` Javier Martinez Canillas
2012-03-01 19:29                             ` Javier Martinez Canillas
2012-03-01 18:53                         ` David Dillow
2012-03-01 20:55                       ` David Miller
2012-03-02  4:40                         ` Stephen Hemminger
2012-03-01 20:44               ` David Miller
2012-03-01 22:01                 ` Luiz Augusto von Dentz
2012-03-01 22:08                   ` David Miller
2012-03-02  8:39                     ` Luiz Augusto von Dentz
2012-03-02  8:55                       ` David Miller
2012-03-02  9:27                         ` Javier Martinez Canillas
2012-03-02  9:39                           ` David Miller
2012-03-02 13:13                           ` Eric Dumazet
2012-03-02 16:34                             ` Javier Martinez Canillas
2012-03-02 17:08                               ` Alan Cox
2012-03-05  8:38                                 ` Luiz Augusto von Dentz
2012-03-05 14:05                                   ` Martin Mares
2012-03-05 15:11                                     ` Javier Martinez Canillas
2012-03-05 15:49                                       ` Martin Mares
2012-03-05 18:55                           ` David Lamparter
2012-03-02 10:08                         ` Luiz Augusto von Dentz
2012-03-03 12:20                           ` Martin Mares
2012-03-02 22:19                         ` david
2012-03-01 12:57             ` Luiz Augusto von Dentz
2012-03-01 20:42             ` David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1329753455-1106-2-git-send-email-javier@collabora.co.uk \
    --to=javier@collabora.co.uk \
    --cc=alban.crequy@collabora.co.uk \
    --cc=bart.cerneels@collabora.co.uk \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=kay.sievers@vrfy.org \
    --cc=lennart@poettering.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=rodrigo.moya@collabora.co.uk \
    --cc=sjoerd.simons@collabora.co.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).