All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/11] netlink: memory mapped I/O
@ 2011-09-03 17:26 kaber
  2011-09-03 17:26 ` [PATCH 01/11] netlink: add symbolic value for congested state kaber
                   ` (11 more replies)
  0 siblings, 12 replies; 21+ messages in thread
From: kaber @ 2011-09-03 17:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev

The following RFC patches contain an implementation of memory mapped I/O
for netlink. The implementation is modelled after AF_PACKET memory mapped
I/O with a few differences:

- In order to perform memory mapped I/O to userspace, the kernel allocates
  skbs with the data area pointing to the data area of the mapped frames.
  All netlink subsystems assume a linear data area, so for the sake of
  simplicity, the mapped data area is not attached to the paged area but
  to skb->data. This requires introduction of a special skb alloction
  function that just allocates an skb head without the data area. Since this
  is a quite rare use case, I introduced a new function based on __alloc_skb
  instead of splitting it up into head and data alloction. The alternative
  would be to   introduce an __alloc_skb_head and __alloc_skb_data function,
  which would actually be useful for a specific error case in memory mapped
  netlink, but would require a couple of extra instructions for the common
  skb allocation case, so it doesn't really seem worth it.

  In order to get the destination memory area for skb->data before message
  construction, memory mapped netlink I/O needs to look up the destination
  socket during allocation instead of during transmission because the
  ring is owned by the receiveing socket/process. A special skb allocation
  function (netlink_alloc_skb) taking the destination pid as an argument is
  used for this, all subsystems that want to support memory mapped I/O need
  to use this function, automatic fallback to the receive queue happens
  for unconverted subsystems. Dumps automatically use memory mapped I/O if
  the receiving socket has enabled it.

  The visible effect of looking up the destination socket during allocation
  instead of transmission is that message ordering in userspace might
  change in case allocation and transmission aren't performed atomically.
  This usually doesn't matter since most subsystems have a BKL-like lock
  like the rtnl mutex, to my knowledge the currently only existing case
  where it might matter is nfnetlink_queue combined with the recently
  introduced batched verdicts, but a) that subsystem already includes
  sequence numbers which allow userspace to reorder messages in case it
  cares to, also the reodering window is quite small and b) with memory
  mapped transmission batching can be performed in a subsystem indepandant
  manner.

- AF_NETLINK contains flow control for database dumps, with regular I/O
  dump continuation are triggered based on the sockets receive queue space
  and by recvmsg() calls. Since with memory mapped I/O there are no
  recvmsg() calls under normal operation, this is done in netlink_poll(),
  under the assumption that userspace has processed all pending frames
  before invoking poll(), thus the ring is expected to have room for new
  messages. Dumps currently don't benefit as much as they could from
  memory mapped I/O because each single continuation requires a poll()
  call. A more agressive approach seems like a good idea to me, especially
  in case the socket is not subscribed to any multicast groups (IOW only
  receiving explicitly requested data).

Besides that, the memory mapped netlink implementation extends the states
defined by AF_PACKET between userspace and the kernel by a SKIP status, this
is intended for the case that userspace wants to queue frames (specifically
when using nfnetlink_queue, an IDS and stream reassembly, requested by
Eric Leblond) for a longer period of time. The kernel skips over all frames
marked with SKIP when looking or unused frames and only fails when not finding
a free frame or when having skipped the entire ring.

Also noteworthy is memory mapped sendmsg: the kernel performs validation
of messages before accepting and processing them, in order to prevent
userspace from changing the messages contents after validation, the
kernel checks that the ring is only mapped once and the file descriptor
is not shared (in order to avoid having userspace set up another mapping
after the first mentioned check). If either of both is not true, the
message copied to an allocated skb and processed as with regular I/O.
I'd especially appreciate review of this part since I'm not really versed
in memory, file and process management,

The remaining interesting details are included in the changelogs of the
individual patches and the documentation, so I won't repeat them here.

As an example, nfnetlink_queue is convererted to support memory mapped
I/O. Other subsystems that would probably benefit are nfnetlink_log,
audit and maybe ISCSI, not sure. Since I don't own sufficiently powerful
hardware for real testing, my testcases where based on iperf over loopback.
Depending on the MTU, the latest patchset shows a 900% improvement for
an MTU of 1500 and a roughly 300% improvement for an MTU of 15000.

Jesper is taking benchmarks on real hardware sometime soon, for now I'd
just like to get the basic concept reviewed. An example implementation
for userspace-queueing is available at:

git://git.kernel.org/pub/scm/linux/kernel/git/kaber/libmnl-mmap.git

once master.kernel.org is up again. My git tree of the kernel parts
is not up to date anymore, the latest patches show way better
performance in my limited test setup.

Comments and rewiew highly welcome!

Cheers,
Patrick

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 01/11] netlink: add symbolic value for congested state
  2011-09-03 17:26 [PATCH RFC 0/11] netlink: memory mapped I/O kaber
@ 2011-09-03 17:26 ` kaber
  2011-09-03 17:26 ` [PATCH 02/11] net: add function to allocate skbuff head without data area kaber
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: kaber @ 2011-09-03 17:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev

From: Patrick McHardy <kaber@trash.net>

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/netlink/af_netlink.c |   18 +++++++++++-------
 1 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 0a4db02..fc63ca5 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -88,6 +88,10 @@ struct listeners {
 	unsigned long		masks[0];
 };
 
+/* state bits */
+#define NETLINK_CONGESTED	0x0
+
+/* flags */
 #define NETLINK_KERNEL_SOCKET	0x1
 #define NETLINK_RECV_PKTINFO	0x2
 #define NETLINK_BROADCAST_SEND_ERROR	0x4
@@ -737,7 +741,7 @@ static void netlink_overrun(struct sock *sk)
 	struct netlink_sock *nlk = nlk_sk(sk);
 
 	if (!(nlk->flags & NETLINK_RECV_NO_ENOBUFS)) {
-		if (!test_and_set_bit(0, &nlk_sk(sk)->state)) {
+		if (!test_and_set_bit(NETLINK_CONGESTED, &nlk_sk(sk)->state)) {
 			sk->sk_err = ENOBUFS;
 			sk->sk_error_report(sk);
 		}
@@ -798,7 +802,7 @@ int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
 	nlk = nlk_sk(sk);
 
 	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-	    test_bit(0, &nlk->state)) {
+	    test_bit(NETLINK_CONGESTED, &nlk->state)) {
 		DECLARE_WAITQUEUE(wait, current);
 		if (!*timeo) {
 			if (!ssk || netlink_is_kernel(ssk))
@@ -812,7 +816,7 @@ int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
 		add_wait_queue(&nlk->wait, &wait);
 
 		if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-		     test_bit(0, &nlk->state)) &&
+		     test_bit(NETLINK_CONGESTED, &nlk->state)) &&
 		    !sock_flag(sk, SOCK_DEAD))
 			*timeo = schedule_timeout(*timeo);
 
@@ -876,8 +880,8 @@ static inline void netlink_rcv_wake(struct sock *sk)
 	struct netlink_sock *nlk = nlk_sk(sk);
 
 	if (skb_queue_empty(&sk->sk_receive_queue))
-		clear_bit(0, &nlk->state);
-	if (!test_bit(0, &nlk->state))
+		clear_bit(NETLINK_CONGESTED, &nlk->state);
+	if (!test_bit(NETLINK_CONGESTED, &nlk->state))
 		wake_up_interruptible(&nlk->wait);
 }
 
@@ -958,7 +962,7 @@ static inline int netlink_broadcast_deliver(struct sock *sk,
 	struct netlink_sock *nlk = nlk_sk(sk);
 
 	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
-	    !test_bit(0, &nlk->state)) {
+	    !test_bit(NETLINK_CONGESTED, &nlk->state)) {
 		skb_set_owner_r(skb, sk);
 		skb_queue_tail(&sk->sk_receive_queue, skb);
 		sk->sk_data_ready(sk, skb->len);
@@ -1236,7 +1240,7 @@ static int netlink_setsockopt(struct socket *sock, int level, int optname,
 	case NETLINK_NO_ENOBUFS:
 		if (val) {
 			nlk->flags |= NETLINK_RECV_NO_ENOBUFS;
-			clear_bit(0, &nlk->state);
+			clear_bit(NETLINK_CONGESTED, &nlk->state);
 			wake_up_interruptible(&nlk->wait);
 		} else
 			nlk->flags &= ~NETLINK_RECV_NO_ENOBUFS;
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 02/11] net: add function to allocate skbuff head without data area
  2011-09-03 17:26 [PATCH RFC 0/11] netlink: memory mapped I/O kaber
  2011-09-03 17:26 ` [PATCH 01/11] netlink: add symbolic value for congested state kaber
@ 2011-09-03 17:26 ` kaber
  2011-09-04  8:12   ` Eric Dumazet
  2011-09-03 17:26 ` [PATCH 03/11] netlink: add helper function for queueing skbs to the receive queue kaber
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: kaber @ 2011-09-03 17:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev

From: Patrick McHardy <kaber@trash.net>

Add a function to allocate a skbuff head without any data. This will be
used by memory mapped netlink to attach data from the mmaped area to the
skb.

Additionally change skb_release_all() to check whether the skb has a
data area to allow the skb destructor to clear the data pointer in
case only a head has been allocated.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 include/linux/skbuff.h |    6 ++++++
 net/core/skbuff.c      |   31 ++++++++++++++++++++++++++++++-
 2 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7b996ed..8cfc285 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -521,6 +521,12 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 	return __alloc_skb(size, priority, 1, NUMA_NO_NODE);
 }
 
+extern struct sk_buff *__alloc_skb_head(gfp_t priority, int node);
+static inline struct sk_buff *alloc_skb_head(gfp_t priority)
+{
+	return __alloc_skb_head(priority, -1);
+}
+
 extern bool skb_recycle_check(struct sk_buff *skb, int skb_size);
 
 extern struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2beda82..d632de9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -153,6 +153,34 @@ static void skb_under_panic(struct sk_buff *skb, int sz, void *here)
  *
  */
 
+struct sk_buff *__alloc_skb_head(gfp_t gfp_mask, int node)
+{
+	struct sk_buff *skb;
+
+	/* Get the HEAD */
+	skb = kmem_cache_alloc_node(skbuff_head_cache,
+				    gfp_mask & ~__GFP_DMA, node);
+	if (!skb)
+		goto out;
+	prefetchw(skb);
+
+	/*
+	 * Only clear those fields we need to clear, not those that we will
+	 * actually initialise below. Hence, don't put any more fields after
+	 * the tail pointer in struct sk_buff!
+	 */
+	memset(skb, 0, offsetof(struct sk_buff, tail));
+	skb->data = NULL;
+	skb->truesize = sizeof(struct sk_buff);
+	atomic_set(&skb->users, 1);
+
+#ifdef NET_SKBUFF_DATA_USES_OFFSET
+	skb->mac_header = ~0U;
+#endif
+out:
+	return skb;
+}
+
 /**
  *	__alloc_skb	-	allocate a network buffer
  *	@size: size to allocate
@@ -414,7 +442,8 @@ static void skb_release_head_state(struct sk_buff *skb)
 static void skb_release_all(struct sk_buff *skb)
 {
 	skb_release_head_state(skb);
-	skb_release_data(skb);
+	if (likely(skb->data))
+		skb_release_data(skb);
 }
 
 /**
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 03/11] netlink: add helper function for queueing skbs to the receive queue
  2011-09-03 17:26 [PATCH RFC 0/11] netlink: memory mapped I/O kaber
  2011-09-03 17:26 ` [PATCH 01/11] netlink: add symbolic value for congested state kaber
  2011-09-03 17:26 ` [PATCH 02/11] net: add function to allocate skbuff head without data area kaber
@ 2011-09-03 17:26 ` kaber
  2011-09-03 17:26 ` [PATCH 04/11] netlink: don't orphan skb in netlink_trim() kaber
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: kaber @ 2011-09-03 17:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev

From: Patrick McHardy <kaber@trash.net>

Consolidate skb receive queue code to allow overloading it for memory
mapped sockets.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/netlink/af_netlink.c |   26 ++++++++++++++------------
 1 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index fc63ca5..a9f876b 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -153,6 +153,14 @@ static struct hlist_head *nl_pid_hashfn(struct nl_pid_hash *hash, u32 pid)
 	return &hash->table[jhash_1word(pid, hash->rnd) & hash->mask];
 }
 
+static void netlink_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+{
+	unsigned int len = skb->len;
+
+	skb_queue_tail(&sk->sk_receive_queue, skb);
+	sk->sk_data_ready(sk, len);
+}
+
 static void netlink_sock_destruct(struct sock *sk)
 {
 	struct netlink_sock *nlk = nlk_sk(sk);
@@ -838,8 +846,7 @@ int netlink_sendskb(struct sock *sk, struct sk_buff *skb)
 {
 	int len = skb->len;
 
-	skb_queue_tail(&sk->sk_receive_queue, skb);
-	sk->sk_data_ready(sk, len);
+	netlink_queue_rcv_skb(sk, skb);
 	sock_put(sk);
 	return len;
 }
@@ -964,8 +971,7 @@ static inline int netlink_broadcast_deliver(struct sock *sk,
 	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
 	    !test_bit(NETLINK_CONGESTED, &nlk->state)) {
 		skb_set_owner_r(skb, sk);
-		skb_queue_tail(&sk->sk_receive_queue, skb);
-		sk->sk_data_ready(sk, skb->len);
+		netlink_queue_rcv_skb(sk, skb);
 		return atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf;
 	}
 	return -1;
@@ -1689,10 +1695,8 @@ static int netlink_dump(struct sock *sk)
 
 		if (sk_filter(sk, skb))
 			kfree_skb(skb);
-		else {
-			skb_queue_tail(&sk->sk_receive_queue, skb);
-			sk->sk_data_ready(sk, skb->len);
-		}
+		else
+			netlink_queue_rcv_skb(sk, skb);
 		return 0;
 	}
 
@@ -1706,10 +1710,8 @@ static int netlink_dump(struct sock *sk)
 
 	if (sk_filter(sk, skb))
 		kfree_skb(skb);
-	else {
-		skb_queue_tail(&sk->sk_receive_queue, skb);
-		sk->sk_data_ready(sk, skb->len);
-	}
+	else
+		netlink_queue_rcv_skb(sk, skb);
 
 	if (cb->done)
 		cb->done(cb);
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 04/11] netlink: don't orphan skb in netlink_trim()
  2011-09-03 17:26 [PATCH RFC 0/11] netlink: memory mapped I/O kaber
                   ` (2 preceding siblings ...)
  2011-09-03 17:26 ` [PATCH 03/11] netlink: add helper function for queueing skbs to the receive queue kaber
@ 2011-09-03 17:26 ` kaber
  2011-09-03 17:26 ` [PATCH 05/11] netlink: add netlink_skb_set_owner_r() kaber
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: kaber @ 2011-09-03 17:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev

From: Patrick McHardy <kaber@trash.net>

Netlink doesn't account skbs to the sending socket, so the there's no
need to orphan the skb before trimming it.

Removing the skb_orphan() call is required for mmap'ed netlink, which uses
a netlink specific skb destructor that must not be invoked before the
final freeing of the skb.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/netlink/af_netlink.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index a9f876b..7b9d7d0 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -862,7 +862,7 @@ static inline struct sk_buff *netlink_trim(struct sk_buff *skb,
 {
 	int delta;
 
-	skb_orphan(skb);
+	WARN_ON(skb->sk != NULL);
 
 	delta = skb->end - skb->tail;
 	if (delta * 2 < skb->truesize)
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 05/11] netlink: add netlink_skb_set_owner_r()
  2011-09-03 17:26 [PATCH RFC 0/11] netlink: memory mapped I/O kaber
                   ` (3 preceding siblings ...)
  2011-09-03 17:26 ` [PATCH 04/11] netlink: don't orphan skb in netlink_trim() kaber
@ 2011-09-03 17:26 ` kaber
  2011-09-03 17:26 ` [PATCH 06/11] netlink: memory mapped netlink: ring setup kaber
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: kaber @ 2011-09-03 17:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev

From: Patrick McHardy <kaber@trash.net>

For mmap'ed I/O a netlink specific skb destructor needs to be invoked after the
final kfree_skb() to clean up state in the memory mapped frames. This doesn
t work currently since the skb's ownership is transfered to the receiving
socket using skb_set_owner_r(), which orphans the skb, thereby invoking the
destructor prematurely.

Since netlink doesn't account skbs to the originating socket, there's no need
to orphan the skb. Add a netlink specific skb_set_owner_r() variant that does
not orphan the skb and use a netlink specific destructor to call sock_rfree().

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/netlink/af_netlink.c |   18 ++++++++++++++++--
 1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 7b9d7d0..1402acf 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -161,6 +161,20 @@ static void netlink_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	sk->sk_data_ready(sk, len);
 }
 
+static void netlink_skb_destructor(struct sk_buff *skb)
+{
+	sock_rfree(skb);
+}
+
+static void netlink_skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
+{
+	WARN_ON(skb->sk != NULL);
+	skb->sk = sk;
+	skb->destructor = netlink_skb_destructor;
+	atomic_add(skb->truesize, &sk->sk_rmem_alloc);
+	sk_mem_charge(sk, skb->truesize);
+}
+
 static void netlink_sock_destruct(struct sock *sk)
 {
 	struct netlink_sock *nlk = nlk_sk(sk);
@@ -838,7 +852,7 @@ int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
 		}
 		return 1;
 	}
-	skb_set_owner_r(skb, sk);
+	netlink_skb_set_owner_r(skb, sk);
 	return 0;
 }
 
@@ -900,7 +914,7 @@ static inline int netlink_unicast_kernel(struct sock *sk, struct sk_buff *skb)
 	ret = -ECONNREFUSED;
 	if (nlk->netlink_rcv != NULL) {
 		ret = skb->len;
-		skb_set_owner_r(skb, sk);
+		netlink_skb_set_owner_r(skb, sk);
 		nlk->netlink_rcv(skb);
 	}
 	kfree_skb(skb);
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 06/11] netlink: memory mapped netlink: ring setup
  2011-09-03 17:26 [PATCH RFC 0/11] netlink: memory mapped I/O kaber
                   ` (4 preceding siblings ...)
  2011-09-03 17:26 ` [PATCH 05/11] netlink: add netlink_skb_set_owner_r() kaber
@ 2011-09-03 17:26 ` kaber
  2011-09-03 17:26 ` [PATCH 07/11] netlink: add memory mapped netlink helper functions kaber
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: kaber @ 2011-09-03 17:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev

From: Patrick McHardy <kaber@trash.net>

Add support for memory mapped RX and TX ring setup and teardown based on
the af_packet.c code. The following patches will use this to add the real
memory mapped receive and transmit functionality.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 include/linux/netlink.h  |   32 +++++
 net/Kconfig              |    9 ++
 net/netlink/af_netlink.c |  287 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 326 insertions(+), 2 deletions(-)

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 2e17c5d..969b95e 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -1,6 +1,7 @@
 #ifndef __LINUX_NETLINK_H
 #define __LINUX_NETLINK_H
 
+#include <linux/netlink.h>
 #include <linux/socket.h> /* for sa_family_t */
 #include <linux/types.h>
 
@@ -102,11 +103,42 @@ struct nlmsgerr {
 #define NETLINK_PKTINFO		3
 #define NETLINK_BROADCAST_ERROR	4
 #define NETLINK_NO_ENOBUFS	5
+#define NETLINK_RX_RING		6
+#define NETLINK_TX_RING		7
 
 struct nl_pktinfo {
 	__u32	group;
 };
 
+struct nl_mmap_req {
+	unsigned int	nm_block_size;
+	unsigned int	nm_block_nr;
+	unsigned int	nm_frame_size;
+	unsigned int	nm_frame_nr;
+};
+
+struct nl_mmap_hdr {
+	unsigned int	nm_status;
+	unsigned int	nm_len;
+	__u32		nm_group;
+	/* credentials */
+	__u32		nm_pid;
+	__u32		nm_uid;
+	__u32		nm_gid;
+};
+
+enum nl_mmap_status {
+	NL_MMAP_STATUS_UNUSED,
+	NL_MMAP_STATUS_RESERVED,
+	NL_MMAP_STATUS_VALID,
+	NL_MMAP_STATUS_COPY,
+	NL_MMAP_STATUS_SKIP,
+};
+
+#define NL_MMAP_MSG_ALIGNMENT		NLMSG_ALIGNTO
+#define NL_MMAP_MSG_ALIGN(sz)		__ALIGN_KERNEL(sz, NL_MMAP_MSG_ALIGNMENT)
+#define NL_MMAP_HDRLEN			NL_MMAP_MSG_ALIGN(sizeof(struct nl_mmap_hdr))
+
 #define NET_MAJOR 36		/* Major 36 is reserved for networking 						*/
 
 enum {
diff --git a/net/Kconfig b/net/Kconfig
index a073148..93599e0 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -23,6 +23,15 @@ menuconfig NET
 
 if NET
 
+config NETLINK_MMAP
+	bool "Netlink: mmaped I/O"
+	help
+	  This option enables support for memory mapped netlink I/O. This
+	  reduces overhead by avoiding copying data between kernel- and
+	  userspace.
+
+	  If unsure, say N.
+
 config WANT_COMPAT_NETLINK_MESSAGES
 	bool
 	help
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 1402acf..6d4db46 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -55,6 +55,7 @@
 #include <linux/types.h>
 #include <linux/audit.h>
 #include <linux/mutex.h>
+#include <linux/vmalloc.h>
 
 #include <net/net_namespace.h>
 #include <net/sock.h>
@@ -64,6 +65,20 @@
 #define NLGRPSZ(x)	(ALIGN(x, sizeof(unsigned long) * 8) / 8)
 #define NLGRPLONGS(x)	(NLGRPSZ(x)/sizeof(unsigned long))
 
+struct netlink_ring {
+	void			**pg_vec;
+	unsigned int		head;
+	unsigned int		frames_per_block;
+	unsigned int		frame_size;
+	unsigned int		frame_max;
+
+	unsigned int		pg_vec_order;
+	unsigned int		pg_vec_pages;
+	unsigned int		pg_vec_len;
+
+	atomic_t		pending;
+};
+
 struct netlink_sock {
 	/* struct sock has to be the first member of netlink_sock */
 	struct sock		sk;
@@ -81,6 +96,12 @@ struct netlink_sock {
 	struct mutex		cb_def_mutex;
 	void			(*netlink_rcv)(struct sk_buff *skb);
 	struct module		*module;
+#ifdef CONFIG_NETLINK_MMAP
+	struct mutex		pg_vec_lock;
+	struct netlink_ring	rx_ring;
+	struct netlink_ring	tx_ring;
+	atomic_t		mapped;
+#endif /* CONFIG_NETLINK_MMAP */
 };
 
 struct listeners {
@@ -153,6 +174,234 @@ static struct hlist_head *nl_pid_hashfn(struct nl_pid_hash *hash, u32 pid)
 	return &hash->table[jhash_1word(pid, hash->rnd) & hash->mask];
 }
 
+#ifdef CONFIG_NETLINK_MMAP
+static __pure struct page *pgvec_to_page(const void *addr)
+{
+	if (is_vmalloc_addr(addr))
+		return vmalloc_to_page(addr);
+	else
+		return virt_to_page(addr);
+}
+
+static void free_pg_vec(void **pg_vec, unsigned int order, unsigned int len)
+{
+	unsigned int i;
+
+	for (i = 0; i < len; i++) {
+		if (pg_vec[i] != NULL) {
+			if (is_vmalloc_addr(pg_vec[i]))
+				vfree(pg_vec[i]);
+			else
+				free_pages((unsigned long)pg_vec[i], order);
+		}
+	}
+	kfree(pg_vec);
+}
+
+static void *alloc_one_pg_vec_page(unsigned long order)
+{
+	void *buffer;
+	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP | __GFP_ZERO |
+			  __GFP_NOWARN | __GFP_NORETRY;
+
+	buffer = (void *)__get_free_pages(gfp_flags, order);
+	if (buffer != NULL)
+		return buffer;
+
+	buffer = vzalloc((1 << order) * PAGE_SIZE);
+	if (buffer != NULL)
+		return buffer;
+
+	gfp_flags &= ~__GFP_NORETRY;
+	return (void *)__get_free_pages(gfp_flags, order);
+}
+
+static void **alloc_pg_vec(struct netlink_sock *nlk,
+			   struct nl_mmap_req *req, unsigned int order)
+{
+	unsigned int block_nr = req->nm_block_nr;
+	unsigned int i;
+	void **pg_vec, *ptr;
+
+	pg_vec = kcalloc(block_nr, sizeof(void *), GFP_KERNEL);
+	if (pg_vec == NULL)
+		return NULL;
+
+	for (i = 0; i < block_nr; i++) {
+		pg_vec[i] = ptr = alloc_one_pg_vec_page(order);
+		if (pg_vec[i] == NULL)
+			goto err1;
+	}
+
+	return pg_vec;
+err1:
+	free_pg_vec(pg_vec, order, block_nr);
+	return NULL;
+}
+
+static int netlink_set_ring(struct sock *sk, struct nl_mmap_req *req,
+			    bool closing, bool tx_ring)
+{
+	struct netlink_sock *nlk = nlk_sk(sk);
+	struct netlink_ring *ring;
+	struct sk_buff_head *queue;
+	void **pg_vec = NULL;
+	unsigned int order = 0;
+	int err;
+
+	ring  = tx_ring ? &nlk->tx_ring : &nlk->rx_ring;
+	queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue;
+
+	if (!closing) {
+		if (atomic_read(&nlk->mapped))
+			return -EBUSY;
+		if (atomic_read(&ring->pending))
+			return -EBUSY;
+	}
+
+	if (req->nm_block_nr) {
+		if (ring->pg_vec != NULL)
+			return -EBUSY;
+
+		if ((int)req->nm_block_size <= 0)
+			return -EINVAL;
+		if (!IS_ALIGNED(req->nm_block_size, PAGE_SIZE))
+			return -EINVAL;
+		if (req->nm_frame_size < NL_MMAP_HDRLEN)
+			return -EINVAL;
+		if (!IS_ALIGNED(req->nm_frame_size, NL_MMAP_MSG_ALIGNMENT))
+			return -EINVAL;
+
+		ring->frames_per_block = req->nm_block_size /
+					 req->nm_frame_size;
+		if (ring->frames_per_block == 0)
+			return -EINVAL;
+		if (ring->frames_per_block * req->nm_block_nr !=
+		    req->nm_frame_nr)
+			return -EINVAL;
+
+		order = get_order(req->nm_block_size);
+		pg_vec = alloc_pg_vec(nlk, req, order);
+		if (pg_vec == NULL)
+			return -ENOMEM;
+	} else {
+		if (req->nm_frame_nr)
+			return -EINVAL;
+	}
+
+	err = -EBUSY;
+	mutex_lock(&nlk->pg_vec_lock);
+	if (closing || atomic_read(&nlk->mapped) == 0) {
+		err = 0;
+		spin_lock_bh(&queue->lock);
+
+		ring->frame_max		= req->nm_frame_nr - 1;
+		ring->head		= 0;
+		ring->frame_size	= req->nm_frame_size;
+		ring->pg_vec_pages	= req->nm_block_size / PAGE_SIZE;
+
+		swap(ring->pg_vec_len, req->nm_block_nr);
+		swap(ring->pg_vec_order, order);
+		swap(ring->pg_vec, pg_vec);
+
+		__skb_queue_purge(queue);
+		spin_unlock_bh(&queue->lock);
+
+		WARN_ON(atomic_read(&nlk->mapped));
+	}
+	mutex_unlock(&nlk->pg_vec_lock);
+
+	if (pg_vec)
+		free_pg_vec(pg_vec, order, req->nm_block_nr);
+	return err;
+}
+
+static void netlink_mm_open(struct vm_area_struct *vma)
+{
+	struct file *file = vma->vm_file;
+	struct socket *sock = file->private_data;
+	struct sock *sk = sock->sk;
+
+	if (sk)
+		atomic_inc(&nlk_sk(sk)->mapped);
+}
+
+static void netlink_mm_close(struct vm_area_struct *vma)
+{
+	struct file *file = vma->vm_file;
+	struct socket *sock = file->private_data;
+	struct sock *sk = sock->sk;
+
+	if (sk)
+		atomic_dec(&nlk_sk(sk)->mapped);
+}
+
+static const struct vm_operations_struct netlink_mmap_ops = {
+	.open	= netlink_mm_open,
+	.close	= netlink_mm_close,
+};
+
+static int netlink_mmap(struct file *file, struct socket *sock,
+			struct vm_area_struct *vma)
+{
+	struct sock *sk = sock->sk;
+	struct netlink_sock *nlk = nlk_sk(sk);
+	struct netlink_ring *ring;
+	unsigned long start, size, expected;
+	unsigned int i;
+	int err = -EINVAL;
+
+	if (vma->vm_pgoff)
+		return -EINVAL;
+
+	mutex_lock(&nlk->pg_vec_lock);
+
+	expected = 0;
+	for (ring = &nlk->rx_ring; ring <= &nlk->tx_ring; ring++) {
+		if (ring->pg_vec == NULL)
+			continue;
+		expected += ring->pg_vec_len * ring->pg_vec_pages * PAGE_SIZE;
+	}
+
+	if (expected == 0)
+		goto out;
+
+	size = vma->vm_end - vma->vm_start;
+	if (size != expected)
+		goto out;
+
+	start = vma->vm_start;
+	for (ring = &nlk->rx_ring; ring <= &nlk->tx_ring; ring++) {
+		if (ring->pg_vec == NULL)
+			continue;
+
+		for (i = 0; i < ring->pg_vec_len; i++) {
+			struct page *page;
+			void *kaddr = ring->pg_vec[i];
+			unsigned int pg_num;
+
+			for (pg_num = 0; pg_num < ring->pg_vec_pages; pg_num++) {
+				page = pgvec_to_page(kaddr);
+				err = vm_insert_page(vma, start, page);
+				if (err < 0)
+					goto out;
+				start += PAGE_SIZE;
+				kaddr += PAGE_SIZE;
+			}
+		}
+	}
+
+	atomic_inc(&nlk->mapped);
+	vma->vm_ops = &netlink_mmap_ops;
+	err = 0;
+out:
+	mutex_unlock(&nlk->pg_vec_lock);
+	return 0;
+}
+#else /* CONFIG_NETLINK_MMAP */
+#define netlink_mmap			sock_no_mmap
+#endif /* CONFIG_NETLINK_MMAP */
+
 static void netlink_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	unsigned int len = skb->len;
@@ -186,6 +435,17 @@ static void netlink_sock_destruct(struct sock *sk)
 	}
 
 	skb_queue_purge(&sk->sk_receive_queue);
+#ifdef CONFIG_NETLINK_MMAP
+	if (1) {
+		struct nl_mmap_req req;
+
+		memset(&req, 0, sizeof(req));
+		if (nlk->rx_ring.pg_vec)
+			netlink_set_ring(sk, &req, true, false);
+		if (nlk->tx_ring.pg_vec)
+			netlink_set_ring(sk, &req, true, true);
+	}
+#endif /* CONFIG_NETLINK_MMAP */
 
 	if (!sock_flag(sk, SOCK_DEAD)) {
 		printk(KERN_ERR "Freeing alive netlink socket %p\n", sk);
@@ -448,6 +708,9 @@ static int __netlink_create(struct net *net, struct socket *sock,
 		mutex_init(nlk->cb_mutex);
 	}
 	init_waitqueue_head(&nlk->wait);
+#ifdef CONFIG_NETLINK_MMAP
+	mutex_init(&nlk->pg_vec_lock);
+#endif
 
 	sk->sk_destruct = netlink_sock_destruct;
 	sk->sk_protocol = protocol;
@@ -1222,7 +1485,8 @@ static int netlink_setsockopt(struct socket *sock, int level, int optname,
 	if (level != SOL_NETLINK)
 		return -ENOPROTOOPT;
 
-	if (optlen >= sizeof(int) &&
+	if (optname != NETLINK_RX_RING && optname != NETLINK_TX_RING &&
+	    optlen >= sizeof(int) &&
 	    get_user(val, (unsigned int __user *)optval))
 		return -EFAULT;
 
@@ -1266,6 +1530,25 @@ static int netlink_setsockopt(struct socket *sock, int level, int optname,
 			nlk->flags &= ~NETLINK_RECV_NO_ENOBUFS;
 		err = 0;
 		break;
+#ifdef CONFIG_NETLINK_MMAP
+	case NETLINK_RX_RING:
+	case NETLINK_TX_RING: {
+		struct nl_mmap_req req;
+
+		/* Rings might consume more memory than queue limits, require
+		 * CAP_NET_ADMIN.
+		 */
+		if (!capable(CAP_NET_ADMIN))
+			return -EPERM;
+		if (optlen < sizeof(req))
+			return -EINVAL;
+		if (copy_from_user(&req, optval, sizeof(req)))
+			return -EFAULT;
+		err = netlink_set_ring(sk, &req, false,
+				       optname == NETLINK_TX_RING);
+		break;
+	}
+#endif /* CONFIG_NETLINK_MMAP */
 	default:
 		err = -ENOPROTOOPT;
 	}
@@ -2081,7 +2364,7 @@ static const struct proto_ops netlink_ops = {
 	.getsockopt =	netlink_getsockopt,
 	.sendmsg =	netlink_sendmsg,
 	.recvmsg =	netlink_recvmsg,
-	.mmap =		sock_no_mmap,
+	.mmap =		netlink_mmap,
 	.sendpage =	sock_no_sendpage,
 };
 
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 07/11] netlink: add memory mapped netlink helper functions
  2011-09-03 17:26 [PATCH RFC 0/11] netlink: memory mapped I/O kaber
                   ` (5 preceding siblings ...)
  2011-09-03 17:26 ` [PATCH 06/11] netlink: memory mapped netlink: ring setup kaber
@ 2011-09-03 17:26 ` kaber
  2011-09-03 17:26 ` [PATCH 08/11] netlink: implement memory mapped sendmsg() kaber
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: kaber @ 2011-09-03 17:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev

From: Patrick McHardy <kaber@trash.net>

Add helper functions for looking up memory mapped frame headers, reading and
writing their status, setting up skbs with memory mapped data areas and
cleaning up state again and a poll function.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 include/linux/netlink.h  |    8 ++
 net/netlink/af_netlink.c |  184 +++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 190 insertions(+), 2 deletions(-)

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 969b95e..955adc1 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -190,10 +190,18 @@ static inline struct nlmsghdr *nlmsg_hdr(const struct sk_buff *skb)
 	return (struct nlmsghdr *)skb->data;
 }
 
+enum netlink_skb_flags {
+	NETLINK_SKB_MMAPED	= 0x1,		/* Packet data is mmapped */
+	NETLINK_SKB_TX		= 0x2,		/* Packet was sent by userspace */
+	NETLINK_SKB_DELIVERED	= 0x4,		/* Packet was delivered */
+};
+
 struct netlink_skb_parms {
 	struct ucred		creds;		/* Skb credentials	*/
 	__u32			pid;
 	__u32			dst_group;
+	__u32			flags;
+	struct sock		*sk;		/* socket owning mmaped ring */
 };
 
 #define NETLINK_CB(skb)		(*(struct netlink_skb_parms*)&((skb)->cb))
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 6d4db46..229bc03 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -56,6 +56,7 @@
 #include <linux/audit.h>
 #include <linux/mutex.h>
 #include <linux/vmalloc.h>
+#include <asm/cacheflush.h>
 
 #include <net/net_namespace.h>
 #include <net/sock.h>
@@ -158,6 +159,7 @@ static DECLARE_WAIT_QUEUE_HEAD(nl_table_wait);
 
 static int netlink_dump(struct sock *sk);
 static void netlink_destroy_callback(struct netlink_callback *cb);
+static void netlink_skb_destructor(struct sk_buff *skb);
 
 static DEFINE_RWLOCK(nl_table_lock);
 static atomic_t nl_table_users = ATOMIC_INIT(0);
@@ -175,6 +177,11 @@ static struct hlist_head *nl_pid_hashfn(struct nl_pid_hash *hash, u32 pid)
 }
 
 #ifdef CONFIG_NETLINK_MMAP
+static bool netlink_skb_is_mmaped(const struct sk_buff *skb)
+{
+	return NETLINK_CB(skb).flags & NETLINK_SKB_MMAPED;
+}
+
 static __pure struct page *pgvec_to_page(const void *addr)
 {
 	if (is_vmalloc_addr(addr))
@@ -398,8 +405,153 @@ out:
 	mutex_unlock(&nlk->pg_vec_lock);
 	return 0;
 }
+
+static void netlink_frame_flush_dcache(const struct nl_mmap_hdr *hdr)
+{
+#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE == 1
+	struct page *p_start, *p_end;
+
+	/* First page is flushed through netlink_{get,set}_status */
+	p_start = pgvec_to_page(hdr + PAGE_SIZE);
+	p_end   = pgvec_to_page((void *)hdr + NL_MMAP_MSG_HDRLEN + hdr->nm_len - 1);
+	while (p_start <= p_end) {
+		flush_dcache_page(p_start);
+		p_start++;
+	}
+#endif
+}
+
+static enum nl_mmap_status netlink_get_status(const struct nl_mmap_hdr *hdr)
+{
+	smp_rmb();
+	flush_dcache_page(pgvec_to_page(hdr));
+	return hdr->nm_status;
+}
+
+static void netlink_set_status(struct nl_mmap_hdr *hdr,
+			       enum nl_mmap_status status)
+{
+	hdr->nm_status = status;
+	flush_dcache_page(pgvec_to_page(hdr));
+	smp_wmb();
+}
+
+static struct nl_mmap_hdr *
+__netlink_lookup_frame(const struct netlink_ring *ring, unsigned int pos)
+{
+	unsigned int pg_vec_pos, frame_off;
+
+	pg_vec_pos = pos / ring->frames_per_block;
+	frame_off  = pos % ring->frames_per_block;
+
+	return ring->pg_vec[pg_vec_pos] + (frame_off * ring->frame_size);
+}
+
+static struct nl_mmap_hdr *
+netlink_lookup_frame(const struct netlink_ring *ring, unsigned int pos,
+		     enum nl_mmap_status status)
+{
+	struct nl_mmap_hdr *hdr;
+
+	hdr = __netlink_lookup_frame(ring, pos);
+	if (netlink_get_status(hdr) != status)
+		return NULL;
+
+	return hdr;
+}
+
+static struct nl_mmap_hdr *
+netlink_current_frame(const struct netlink_ring *ring,
+		      enum nl_mmap_status status)
+{
+	return netlink_lookup_frame(ring, ring->head, status);
+}
+
+static struct nl_mmap_hdr *
+netlink_previous_frame(const struct netlink_ring *ring,
+		       enum nl_mmap_status status)
+{
+	unsigned int prev;
+
+	prev = ring->head ? ring->head - 1 : ring->frame_max;
+	return netlink_lookup_frame(ring, prev, status);
+}
+
+static void netlink_increment_head(struct netlink_ring *ring)
+{
+	ring->head = ring->head != ring->frame_max ? ring->head + 1 : 0;
+}
+
+static void netlink_forward_ring(struct netlink_ring *ring)
+{
+	unsigned int head = ring->head, pos = head;
+	const struct nl_mmap_hdr *hdr;
+
+	do {
+		hdr = __netlink_lookup_frame(ring, pos);
+		if (hdr->nm_status == NL_MMAP_STATUS_UNUSED)
+			break;
+		if (hdr->nm_status != NL_MMAP_STATUS_SKIP)
+			break;
+		netlink_increment_head(ring);
+	} while (ring->head != head);
+}
+
+static unsigned int netlink_poll(struct file *file, struct socket *sock,
+				 poll_table *wait)
+{
+	struct sock *sk = sock->sk;
+	struct netlink_sock *nlk = nlk_sk(sk);
+	unsigned int mask;
+
+	mask = datagram_poll(file, sock, wait);
+
+	spin_lock_bh(&sk->sk_receive_queue.lock);
+	if (nlk->rx_ring.pg_vec) {
+		netlink_forward_ring(&nlk->rx_ring);
+		if (!netlink_previous_frame(&nlk->rx_ring, NL_MMAP_STATUS_UNUSED))
+			mask |= POLLIN | POLLRDNORM;
+	}
+	spin_unlock_bh(&sk->sk_receive_queue.lock);
+
+	spin_lock_bh(&sk->sk_write_queue.lock);
+	if (nlk->tx_ring.pg_vec) {
+		if (netlink_current_frame(&nlk->tx_ring, NL_MMAP_STATUS_UNUSED))
+			mask |= POLLOUT | POLLWRNORM;
+	}
+	spin_unlock_bh(&sk->sk_write_queue.lock);
+
+	return mask;
+}
+
+static struct nl_mmap_hdr *netlink_mmap_hdr(struct sk_buff *skb)
+{
+	return (struct nl_mmap_hdr *)(skb->head - NL_MMAP_HDRLEN);
+}
+
+void netlink_ring_setup_skb(struct sk_buff *skb, struct sock *sk,
+			    struct netlink_ring *ring, struct nl_mmap_hdr *hdr)
+{
+	unsigned int size;
+	void *data;
+
+	size = ring->frame_size - NL_MMAP_HDRLEN;
+	data = (void *)hdr + NL_MMAP_HDRLEN;
+
+	skb->head	= data;
+	skb->data	= data;
+	skb_reset_tail_pointer(skb);
+	skb->end	= skb->tail + size;
+	skb->len	= 0;
+
+	skb->destructor	= netlink_skb_destructor;
+	NETLINK_CB(skb).flags |= NETLINK_SKB_MMAPED;
+	NETLINK_CB(skb).sk = sk;
+}
 #else /* CONFIG_NETLINK_MMAP */
+#define netlink_skb_is_mmaped(skb)	false
 #define netlink_mmap			sock_no_mmap
+#define netlink_poll			datagram_poll
 #endif /* CONFIG_NETLINK_MMAP */
 
 static void netlink_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
@@ -412,7 +564,35 @@ static void netlink_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 
 static void netlink_skb_destructor(struct sk_buff *skb)
 {
-	sock_rfree(skb);
+#ifdef CONFIG_NETLINK_MMAP
+	struct nl_mmap_hdr *hdr;
+	struct netlink_ring *ring;
+	struct sock *sk;
+
+	/* If a packet from the kernel to userspace was freed because of an
+	 * error without being delivered to userspace, the kernel must reset
+	 * the status. In the direction userspace to kernel, the status is
+	 * always reset here after the packet was processed and freed.
+	 */
+	if (netlink_skb_is_mmaped(skb)) {
+		hdr = netlink_mmap_hdr(skb);
+		sk = NETLINK_CB(skb).sk;
+
+		if (!(NETLINK_CB(skb).flags & NETLINK_SKB_DELIVERED)) {
+			hdr->nm_len = 0;
+			netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
+		}
+		ring = &nlk_sk(sk)->rx_ring;
+
+		WARN_ON(atomic_read(&ring->pending) == 0);
+		atomic_dec(&ring->pending);
+		sock_put(sk);
+
+		skb->data = NULL;
+	}
+#endif
+	if (skb->sk != NULL)
+		sock_rfree(skb);
 }
 
 static void netlink_skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
@@ -2356,7 +2536,7 @@ static const struct proto_ops netlink_ops = {
 	.socketpair =	sock_no_socketpair,
 	.accept =	sock_no_accept,
 	.getname =	netlink_getname,
-	.poll =		datagram_poll,
+	.poll =		netlink_poll,
 	.ioctl =	sock_no_ioctl,
 	.listen =	sock_no_listen,
 	.shutdown =	sock_no_shutdown,
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 08/11] netlink: implement memory mapped sendmsg()
  2011-09-03 17:26 [PATCH RFC 0/11] netlink: memory mapped I/O kaber
                   ` (6 preceding siblings ...)
  2011-09-03 17:26 ` [PATCH 07/11] netlink: add memory mapped netlink helper functions kaber
@ 2011-09-03 17:26 ` kaber
  2011-09-04 16:18   ` Michał Mirosław
  2011-09-03 17:26 ` [PATCH 09/11] netlink: implement memory mapped recvmsg() kaber
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: kaber @ 2011-09-03 17:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev

From: Patrick McHardy <kaber@trash.net>

Add support for memory mapped sendmsg() to netlink. Userspace queued to
be processed frames into the TX ring and invokes sendmsg with
msg.iov.iov_base = NULL to trigger processing of all pending messages.

Since the kernel usually performs full message validation before beginning
processing, userspace must be prevented from modifying the message
contents while the kernel is processing them. In order to do so, the
frames contents are copied to an allocated skb in case the the ring is
mapped more than once or the file descriptor is shared (f.i. through
AF_UNIX file descriptor passing).

Otherwise an skb without a data area is allocated, the data pointer set
to point to the data area of the ring frame and the skb is processed.
Once the skb is freed, the destructor releases the frame back to userspace
by setting the status to NL_MMAP_STATUS_UNUSED.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/netlink/af_netlink.c |  129 +++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 123 insertions(+), 6 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 229bc03..9b6400f 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -182,6 +182,11 @@ static bool netlink_skb_is_mmaped(const struct sk_buff *skb)
 	return NETLINK_CB(skb).flags & NETLINK_SKB_MMAPED;
 }
 
+static bool netlink_tx_is_mmaped(struct sock *sk)
+{
+	return nlk_sk(sk)->tx_ring.pg_vec != NULL;
+}
+
 static __pure struct page *pgvec_to_page(const void *addr)
 {
 	if (is_vmalloc_addr(addr))
@@ -548,10 +553,108 @@ void netlink_ring_setup_skb(struct sk_buff *skb, struct sock *sk,
 	NETLINK_CB(skb).flags |= NETLINK_SKB_MMAPED;
 	NETLINK_CB(skb).sk = sk;
 }
+
+static int netlink_mmap_sendmsg(struct sock *sk, struct msghdr *msg,
+				u32 dst_pid, u32 dst_group,
+				struct sock_iocb *siocb)
+{
+	struct netlink_sock *nlk = nlk_sk(sk);
+	struct netlink_ring *ring;
+	struct nl_mmap_hdr *hdr;
+	struct sk_buff *skb;
+	unsigned int maxlen;
+	bool excl = true;
+	int err = 0, len = 0;
+
+	/* Netlink messages are validated by the receiver before processing.
+	 * In order to avoid userspace changing the contents of the message
+	 * after validation, the socket and the ring may only be used by a
+	 * single process, otherwise we fall back to copying.
+	 */
+	if (atomic_long_read(&sk->sk_socket->file->f_count) > 2 ||
+	    atomic_read(&nlk->mapped) > 1)
+		excl = false;
+
+	mutex_lock(&nlk->pg_vec_lock);
+
+	ring   = &nlk->tx_ring;
+	maxlen = ring->frame_size - NL_MMAP_HDRLEN;
+
+	do {
+		hdr = netlink_current_frame(ring, NL_MMAP_STATUS_VALID);
+		if (hdr == NULL) {
+			if (!(msg->msg_flags & MSG_DONTWAIT) &&
+			    atomic_read(&nlk->tx_ring.pending))
+				schedule();
+			continue;
+		}
+		if (hdr->nm_len > maxlen) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		netlink_frame_flush_dcache(hdr);
+
+		if (likely(dst_pid == 0 && dst_group == 0 && excl)) {
+			skb = alloc_skb_head(GFP_KERNEL);
+			if (skb == NULL) {
+				err = -ENOBUFS;
+				goto out;
+			}
+			sock_hold(sk);
+			netlink_ring_setup_skb(skb, sk, ring, hdr);
+			NETLINK_CB(skb).flags |= NETLINK_SKB_TX;
+			__skb_put(skb, hdr->nm_len);
+			netlink_set_status(hdr, NL_MMAP_STATUS_RESERVED);
+			atomic_inc(&ring->pending);
+		} else {
+			skb = alloc_skb(hdr->nm_len, GFP_KERNEL);
+			if (skb == NULL) {
+				err = -ENOBUFS;
+				goto out;
+			}
+			__skb_put(skb, hdr->nm_len);
+			memcpy(skb->data, (void *)hdr + NL_MMAP_HDRLEN, hdr->nm_len);
+			netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
+		}
+
+		netlink_increment_head(ring);
+
+		NETLINK_CB(skb).pid	  = nlk->pid;
+		NETLINK_CB(skb).dst_group = dst_group;
+		NETLINK_CB(skb).creds	  = siocb->scm->creds;
+
+		err = security_netlink_send(sk, skb);
+		if (err) {
+			kfree_skb(skb);
+			goto out;
+		}
+
+		if (unlikely(dst_group)) {
+			atomic_inc(&skb->users);
+			netlink_broadcast(sk, skb, dst_pid, dst_group, GFP_KERNEL);
+		}
+		err = netlink_unicast(sk, skb, dst_pid, msg->msg_flags & MSG_DONTWAIT);
+		if (err < 0)
+			goto out;
+		len += err;
+
+	} while (hdr != NULL ||
+		 (!(msg->msg_flags & MSG_DONTWAIT) &&
+		  atomic_read(&nlk->tx_ring.pending)));
+
+	if (len > 0)
+		err = len;
+out:
+	mutex_unlock(&nlk->pg_vec_lock);
+	return err;
+}
 #else /* CONFIG_NETLINK_MMAP */
 #define netlink_skb_is_mmaped(skb)	false
+#define netlink_tx_is_mmaped(sk)	false
 #define netlink_mmap			sock_no_mmap
 #define netlink_poll			datagram_poll
+#define netlink_mmap_sendmsg(sk, msg, dst_pid, dst_group, siocb)	0
 #endif /* CONFIG_NETLINK_MMAP */
 
 static void netlink_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
@@ -578,11 +681,16 @@ static void netlink_skb_destructor(struct sk_buff *skb)
 		hdr = netlink_mmap_hdr(skb);
 		sk = NETLINK_CB(skb).sk;
 
-		if (!(NETLINK_CB(skb).flags & NETLINK_SKB_DELIVERED)) {
-			hdr->nm_len = 0;
-			netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
+		if (NETLINK_CB(skb).flags & NETLINK_SKB_TX) {
+			netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
+			ring = &nlk_sk(sk)->tx_ring;
+		} else {
+			if (!(NETLINK_CB(skb).flags & NETLINK_SKB_DELIVERED)) {
+				hdr->nm_len = 0;
+				netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
+			}
+			ring = &nlk_sk(sk)->rx_ring;
 		}
-		ring = &nlk_sk(sk)->rx_ring;
 
 		WARN_ON(atomic_read(&ring->pending) == 0);
 		atomic_dec(&ring->pending);
@@ -1266,8 +1374,9 @@ int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
 
 	nlk = nlk_sk(sk);
 
-	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-	    test_bit(NETLINK_CONGESTED, &nlk->state)) {
+	if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
+	     test_bit(NETLINK_CONGESTED, &nlk->state)) &&
+	    !netlink_skb_is_mmaped(skb)) {
 		DECLARE_WAITQUEUE(wait, current);
 		if (!*timeo) {
 			if (!ssk || netlink_is_kernel(ssk))
@@ -1320,6 +1429,8 @@ static inline struct sk_buff *netlink_trim(struct sk_buff *skb,
 	int delta;
 
 	WARN_ON(skb->sk != NULL);
+	if (netlink_skb_is_mmaped(skb))
+		return skb;
 
 	delta = skb->end - skb->tail;
 	if (delta * 2 < skb->truesize)
@@ -1839,6 +1950,12 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock,
 			goto out;
 	}
 
+	if (netlink_tx_is_mmaped(sk) &&
+	    msg->msg_iov->iov_base == NULL) {
+		err = netlink_mmap_sendmsg(sk, msg, dst_pid, dst_group, siocb);
+		goto out;
+	}
+
 	err = -EMSGSIZE;
 	if (len > sk->sk_sndbuf - 32)
 		goto out;
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 09/11] netlink: implement memory mapped recvmsg()
  2011-09-03 17:26 [PATCH RFC 0/11] netlink: memory mapped I/O kaber
                   ` (7 preceding siblings ...)
  2011-09-03 17:26 ` [PATCH 08/11] netlink: implement memory mapped sendmsg() kaber
@ 2011-09-03 17:26 ` kaber
  2011-09-03 17:26 ` [PATCH 10/11] netlink: add documentation for memory mapped I/O kaber
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: kaber @ 2011-09-03 17:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev

From: Patrick McHardy <kaber@trash.net>

Add support for memory mapped recvmsg() to netlink. When the kernel wants
to send a message to userspace, it uses a special skb allocation function
that looks up the receiving socket and, in case the socket uses memory
mapped I/O, allocates an skb without a data area and sets the data pointer
to point to the data area of the ring frame. Message construction than
proceeds as normally, once completed the ownership of the ring frame is
transfered to userspace by setting the status to NL_MMAP_STATUS_VALID.

When memory mapped I/O can't be used because the required size exceeds
the frame size or the sending subsystem doesn't support memory mapped
I/O, the skb is queued to the socket receive queue and the frame status
is set to NL_MMAP_STATUS_COPY to instruct userspace to invoke recvmsg()
to receive the queued message.

Unlike with regular socket I/O, userspace normally doesn't invoke
recvmsg(), so flow control of dumps happens in netlink_poll() under
the assumption that poll() is usually only invoked when all pending
frames have been processed.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 include/linux/netlink.h  |    2 +
 net/netlink/af_netlink.c |  207 ++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 182 insertions(+), 27 deletions(-)

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 955adc1..2bef899 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -223,6 +223,8 @@ extern void __netlink_clear_multicast_users(struct sock *sk, unsigned int group)
 extern void netlink_clear_multicast_users(struct sock *sk, unsigned int group);
 extern void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err);
 extern int netlink_has_listeners(struct sock *sk, unsigned int group);
+extern struct sk_buff *netlink_alloc_skb(struct sock *ssk, unsigned int size,
+					 u32 dst_pid, gfp_t gfp_mask);
 extern int netlink_unicast(struct sock *ssk, struct sk_buff *skb, __u32 pid, int nonblock);
 extern int netlink_broadcast(struct sock *ssk, struct sk_buff *skb, __u32 pid,
 			     __u32 group, gfp_t allocation);
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 9b6400f..2f4745d 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -176,12 +176,40 @@ static struct hlist_head *nl_pid_hashfn(struct nl_pid_hash *hash, u32 pid)
 	return &hash->table[jhash_1word(pid, hash->rnd) & hash->mask];
 }
 
+static void netlink_overrun(struct sock *sk)
+{
+	struct netlink_sock *nlk = nlk_sk(sk);
+
+	if (!(nlk->flags & NETLINK_RECV_NO_ENOBUFS)) {
+		if (!test_and_set_bit(NETLINK_CONGESTED, &nlk_sk(sk)->state)) {
+			sk->sk_err = ENOBUFS;
+			sk->sk_error_report(sk);
+		}
+	}
+	atomic_inc(&sk->sk_drops);
+}
+
+static inline void netlink_rcv_wake(struct sock *sk)
+{
+	struct netlink_sock *nlk = nlk_sk(sk);
+
+	if (skb_queue_empty(&sk->sk_receive_queue))
+		clear_bit(NETLINK_CONGESTED, &nlk->state);
+	if (!test_bit(NETLINK_CONGESTED, &nlk->state))
+		wake_up_interruptible(&nlk->wait);
+}
+
 #ifdef CONFIG_NETLINK_MMAP
 static bool netlink_skb_is_mmaped(const struct sk_buff *skb)
 {
 	return NETLINK_CB(skb).flags & NETLINK_SKB_MMAPED;
 }
 
+static bool netlink_rx_is_mmaped(struct sock *sk)
+{
+	return nlk_sk(sk)->rx_ring.pg_vec != NULL;
+}
+
 static bool netlink_tx_is_mmaped(struct sock *sk)
 {
 	return nlk_sk(sk)->tx_ring.pg_vec != NULL;
@@ -508,6 +536,22 @@ static unsigned int netlink_poll(struct file *file, struct socket *sock,
 	struct sock *sk = sock->sk;
 	struct netlink_sock *nlk = nlk_sk(sk);
 	unsigned int mask;
+	int err;
+
+	if (nlk->rx_ring.pg_vec != NULL) {
+		/* Memory mapped sockets don't call recvmsg(), so flow control
+		 * is performed here under the assumption that the entire ring
+		 * has been processed before invoking poll().
+		 */
+		if (nlk->cb != NULL) {
+			err = netlink_dump(sk);
+			if (err < 0) {
+				sk->sk_err = err;
+				sk->sk_error_report(sk);
+			}
+		}
+		netlink_rcv_wake(sk);
+	}
 
 	mask = datagram_poll(file, sock, wait);
 
@@ -649,8 +693,53 @@ out:
 	mutex_unlock(&nlk->pg_vec_lock);
 	return err;
 }
+
+static void netlink_queue_mmaped_skb(struct sock *sk, struct sk_buff *skb)
+{
+	struct nl_mmap_hdr *hdr;
+
+	hdr = netlink_mmap_hdr(skb);
+	hdr->nm_len	= skb->len;
+	hdr->nm_group	= NETLINK_CB(skb).dst_group;
+	hdr->nm_pid	= NETLINK_CB(skb).creds.pid;
+	hdr->nm_uid	= NETLINK_CB(skb).creds.uid;
+	hdr->nm_gid	= NETLINK_CB(skb).creds.gid;
+	netlink_frame_flush_dcache(hdr);
+	netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
+
+	NETLINK_CB(skb).flags |= NETLINK_SKB_DELIVERED;
+	kfree_skb(skb);
+}
+
+static void netlink_ring_set_copied(struct sock *sk, struct sk_buff *skb)
+{
+	struct netlink_sock *nlk = nlk_sk(sk);
+	struct netlink_ring *ring = &nlk->rx_ring;
+	struct nl_mmap_hdr *hdr;
+
+	spin_lock_bh(&sk->sk_receive_queue.lock);
+	hdr = netlink_current_frame(ring, NL_MMAP_STATUS_UNUSED);
+	if (hdr == NULL) {
+		spin_unlock_bh(&sk->sk_receive_queue.lock);
+		kfree_skb(skb);
+		netlink_overrun(sk);
+		return;
+	}
+	netlink_increment_head(ring);
+	__skb_queue_tail(&sk->sk_receive_queue, skb);
+	spin_unlock_bh(&sk->sk_receive_queue.lock);
+
+	hdr->nm_len	= skb->len;
+	hdr->nm_group	= NETLINK_CB(skb).dst_group;
+	hdr->nm_pid	= NETLINK_CB(skb).creds.pid;
+	hdr->nm_uid	= NETLINK_CB(skb).creds.uid;
+	hdr->nm_gid	= NETLINK_CB(skb).creds.gid;
+	netlink_set_status(hdr, NL_MMAP_STATUS_COPY);
+}
+
 #else /* CONFIG_NETLINK_MMAP */
 #define netlink_skb_is_mmaped(skb)	false
+#define netlink_rx_is_mmaped(sk)	false
 #define netlink_tx_is_mmaped(sk)	false
 #define netlink_mmap			sock_no_mmap
 #define netlink_poll			datagram_poll
@@ -661,7 +750,14 @@ static void netlink_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	unsigned int len = skb->len;
 
-	skb_queue_tail(&sk->sk_receive_queue, skb);
+#ifdef CONFIG_NETLINK_MMAP
+	if (netlink_skb_is_mmaped(skb))
+		netlink_queue_mmaped_skb(sk, skb);
+	else if (netlink_rx_is_mmaped(sk))
+		netlink_ring_set_copied(sk, skb);
+	else
+#endif /* CONFIG_NETLINK_MMAP */
+		skb_queue_tail(&sk->sk_receive_queue, skb);
 	sk->sk_data_ready(sk, len);
 }
 
@@ -1309,19 +1405,6 @@ static int netlink_getname(struct socket *sock, struct sockaddr *addr,
 	return 0;
 }
 
-static void netlink_overrun(struct sock *sk)
-{
-	struct netlink_sock *nlk = nlk_sk(sk);
-
-	if (!(nlk->flags & NETLINK_RECV_NO_ENOBUFS)) {
-		if (!test_and_set_bit(NETLINK_CONGESTED, &nlk_sk(sk)->state)) {
-			sk->sk_err = ENOBUFS;
-			sk->sk_error_report(sk);
-		}
-	}
-	atomic_inc(&sk->sk_drops);
-}
-
 static struct sock *netlink_getsockbypid(struct sock *ssk, u32 pid)
 {
 	struct sock *sock;
@@ -1450,16 +1533,6 @@ static inline struct sk_buff *netlink_trim(struct sk_buff *skb,
 	return skb;
 }
 
-static inline void netlink_rcv_wake(struct sock *sk)
-{
-	struct netlink_sock *nlk = nlk_sk(sk);
-
-	if (skb_queue_empty(&sk->sk_receive_queue))
-		clear_bit(NETLINK_CONGESTED, &nlk->state);
-	if (!test_bit(NETLINK_CONGESTED, &nlk->state))
-		wake_up_interruptible(&nlk->wait);
-}
-
 static inline int netlink_unicast_kernel(struct sock *sk, struct sk_buff *skb)
 {
 	int ret;
@@ -1512,6 +1585,69 @@ retry:
 }
 EXPORT_SYMBOL(netlink_unicast);
 
+struct sk_buff *netlink_alloc_skb(struct sock *ssk, unsigned int size,
+				  u32 dst_pid, gfp_t gfp_mask)
+{
+#ifdef CONFIG_NETLINK_MMAP
+	struct sock *sk = NULL;
+	struct sk_buff *skb;
+	struct netlink_ring *ring;
+	struct nl_mmap_hdr *hdr;
+	unsigned int maxlen;
+
+	sk = netlink_getsockbypid(ssk, dst_pid);
+	if (IS_ERR(sk))
+		goto out;
+
+	ring = &nlk_sk(sk)->rx_ring;
+	/* fast-path without atomic ops for common case: non-mmaped receiver */
+	if (ring->pg_vec == NULL)
+		goto out_put;
+
+	skb = alloc_skb_head(gfp_mask);
+	if (skb == NULL)
+		goto err1;
+
+	spin_lock_bh(&sk->sk_receive_queue.lock);
+	/* check again under lock */
+	if (ring->pg_vec == NULL)
+		goto out_free;
+
+	maxlen = ring->frame_size - NL_MMAP_HDRLEN;
+	if (maxlen < size)
+		goto out_free;
+
+	netlink_forward_ring(ring);
+	hdr = netlink_current_frame(ring, NL_MMAP_STATUS_UNUSED);
+	if (hdr == NULL)
+		goto err2;
+	netlink_ring_setup_skb(skb, sk, ring, hdr);
+	netlink_set_status(hdr, NL_MMAP_STATUS_RESERVED);
+	atomic_inc(&ring->pending);
+	netlink_increment_head(ring);
+
+	spin_unlock_bh(&sk->sk_receive_queue.lock);
+	return skb;
+
+err2:
+	kfree_skb(skb);
+	spin_unlock_bh(&sk->sk_receive_queue.lock);
+	netlink_overrun(sk);
+err1:
+	sock_put(sk);
+	return NULL;
+
+out_free:
+	kfree_skb(skb);
+	spin_unlock_bh(&sk->sk_receive_queue.lock);
+out_put:
+	sock_put(sk);
+out:
+#endif
+	return alloc_skb(size, gfp_mask);
+}
+EXPORT_SYMBOL_GPL(netlink_alloc_skb);
+
 int netlink_has_listeners(struct sock *sk, unsigned int group)
 {
 	int res = 0;
@@ -2278,9 +2414,13 @@ static int netlink_dump(struct sock *sk)
 
 	alloc_size = max_t(int, cb->min_dump_alloc, NLMSG_GOODSIZE);
 
-	skb = sock_rmalloc(sk, alloc_size, 0, GFP_KERNEL);
+	if (!netlink_rx_is_mmaped(sk) &&
+	    atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
+		goto errout_skb;
+	skb = netlink_alloc_skb(sk, alloc_size, nlk->pid, GFP_KERNEL);
 	if (!skb)
 		goto errout_skb;
+	netlink_skb_set_owner_r(skb, sk);
 
 	len = cb->dump(skb, cb);
 
@@ -2337,11 +2477,23 @@ int netlink_dump_start(struct sock *ssk, struct sk_buff *skb,
 	if (cb == NULL)
 		return -ENOBUFS;
 
+	/* Memory mapped dump requests need to be copied to avoid looping
+	 * on the pending state in netlink_mmap_sendmsg() while the cb holds
+	 * a reference to the skb.
+	 */
+	if (netlink_skb_is_mmaped(skb)) {
+		skb = skb_copy(skb, GFP_KERNEL);
+		if (skb == NULL) {
+			kfree(cb);
+			return -ENOBUFS;
+		}
+	} else
+		atomic_inc(&skb->users);
+
 	cb->dump = dump;
 	cb->done = done;
 	cb->nlh = nlh;
 	cb->min_dump_alloc = min_dump_alloc;
-	atomic_inc(&skb->users);
 	cb->skb = skb;
 
 	sk = netlink_lookup(sock_net(ssk), ssk->sk_protocol, NETLINK_CB(skb).pid);
@@ -2386,7 +2538,8 @@ void netlink_ack(struct sk_buff *in_skb, struct nlmsghdr *nlh, int err)
 	if (err)
 		payload += nlmsg_len(nlh);
 
-	skb = nlmsg_new(payload, GFP_KERNEL);
+	skb = netlink_alloc_skb(in_skb->sk, nlmsg_total_size(payload),
+				NETLINK_CB(in_skb).pid, GFP_KERNEL);
 	if (!skb) {
 		struct sock *sk;
 
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 10/11] netlink: add documentation for memory mapped I/O
  2011-09-03 17:26 [PATCH RFC 0/11] netlink: memory mapped I/O kaber
                   ` (8 preceding siblings ...)
  2011-09-03 17:26 ` [PATCH 09/11] netlink: implement memory mapped recvmsg() kaber
@ 2011-09-03 17:26 ` kaber
  2011-09-03 17:26 ` [PATCH 11/11] nfnetlink: add support for memory mapped netlink kaber
  2011-09-17  5:48 ` [PATCH RFC 0/11] netlink: memory mapped I/O David Miller
  11 siblings, 0 replies; 21+ messages in thread
From: kaber @ 2011-09-03 17:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev

From: Patrick McHardy <kaber@trash.net>

Add documentation and some example code for memory mapped netlink I/O.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 Documentation/networking/netlink_mmap.txt |  328 +++++++++++++++++++++++++++++
 1 files changed, 328 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/networking/netlink_mmap.txt

diff --git a/Documentation/networking/netlink_mmap.txt b/Documentation/networking/netlink_mmap.txt
new file mode 100644
index 0000000..8034811
--- /dev/null
+++ b/Documentation/networking/netlink_mmap.txt
@@ -0,0 +1,328 @@
+This file documents how to use memory mapped I/O with netlink.
+
+Overview
+--------
+
+Memory mapped netlink I/O can be used to increase throughput and decrease
+overhead of unicast receive and transmit operations. Some netlink subsystems
+require high throughput, these are mainly the netfilter subsystems
+nfnetlink_queue and nfnetlink_log, but it can also help speed up large
+dump operations of f.i. the routing database.
+
+Memory mapped netlink I/O used two circular ring buffers for RX and TX which
+are mapped into the processes address space.
+
+The RX ring is used by the kernel to directly construct netlink messages into
+user-space memory without copying them as done with regular socket I/O,
+additionally as long as the ring contains messages no recvmsg() or poll()
+syscalls have to be issued by user-space to get more message.
+
+The TX ring is used to process messages directly from user-space memory, the
+kernel processes all messages contained in the ring using a single sendmsg()
+call.
+
+Usage overview
+--------------
+
+In order to use memory mapped netlink I/O, user-space needs three main changes:
+
+- ring setup
+- conversion of the RX path to get messages from the ring instead of recvmsg()
+- conversion of the TX path to construct messages into the ring
+
+Ring setup is done using setsockopt() to provide the ring parameters to the
+kernel, then a call to mmap() to map the ring into the processes address space:
+
+- setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &params, sizeof(params));
+- setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &params, sizeof(params));
+- ring = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0)
+
+Usage of either ring is optional, but even if only the RX ring is used the
+mapping still needs to be writable in order to update the frame status after
+processing.
+
+Conversion of the reception path involves calling poll() on the file
+descriptor, once the socket is readable the frames from the ring are
+processsed in order until no more messages are available, as indicated by
+a status word in the frame header.
+
+On kernel side, in order to make use of memory mapped I/O on receive, the
+originating netlink subsystem needs to support memory mapped I/O, otherwise
+it will use an allocated socket buffer as usual and the contents will be
+ copied to the ring on transmission, nullifying most of the performance gains.
+Dumps of kernel databases automatically support memory mapped I/O.
+
+Conversion of the transmit path involves changing message contruction to
+use memory from the TX ring instead of (usually) a buffer declared on the
+stack and setting up the frame header approriately. Optionally poll() can
+be used to wait for free frames in the TX ring.
+
+Structured and definitions for using memory mapped I/O are contained in
+<linux/netlink.h>.
+
+RX and TX rings
+----------------
+
+Each ring contains a number of continous memory blocks, containing frames of
+fixed size dependant on the parameters used for ring setup.
+
+Ring:	[ block 0 ]
+		[ frame 0 ]
+		[ frame 1 ]
+	[ block 1 ]
+		[ frame 2 ]
+		[ frame 3 ]
+	...
+	[ block n ]
+		[ frame 2 * n ]
+		[ frame 2 * n + 1 ]
+
+The blocks are only visible to the kernel, from the point of view of user-space
+the ring just contains the frames in a continous memory zone.
+
+The ring parameters used for setting up the ring are defined as follows:
+
+struct nl_mmap_req {
+	unsigned int	nm_block_size;
+	unsigned int	nm_block_nr;
+	unsigned int	nm_frame_size;
+	unsigned int	nm_frame_nr;
+};
+
+Frames are grouped into blocks, where each block is a continous region of memory
+and holds nm_block_size / nm_frame_size frames. The total number of frames in
+the ring is nm_frame_nr. The following invariants hold:
+
+- frames_per_block = nm_block_size / nm_frame_size
+
+- nm_frame_nr = frames_per_block * nm_block_nr
+
+Some parameters are constrained, specifically:
+
+- nm_block_size must be a multiple of the architectures memory page size.
+  The getpagesize() function can be used to get the page size.
+
+- nm_frame_size must be equal or larger to NL_MMAP_HDRLEN, IOW a frame must be
+  able to hold at least the frame header
+
+- nm_frame_size must be smaller or equal to nm_block_size
+
+- nm_frame_size must be a multiple of NL_MMAP_MSG_ALIGNMENT
+
+- nm_frame_nr must equal the actual number of frames as specified above.
+
+When the kernel can't allocate phsyically continous memory for a ring block,
+it will fall back to use physically discontinous memory. This might affect
+performance negatively, in order to avoid this the nm_frame_size parameter
+should be chosen to be as small as possible for the required frame size and
+the number of blocks should be increased instead.
+
+Ring frames
+------------
+
+Each frames contain a frame header, consisting of a synchronization word and some
+meta-data, and the message itself.
+
+Frame:	[ header message ]
+
+The frame header is defined as follows:
+
+struct nl_mmap_hdr {
+	unsigned int	nm_status;
+	unsigned int	nm_len;
+	__u32		nm_group;
+	/* credentials */
+	__u32		nm_pid;
+	__u32		nm_uid;
+	__u32		nm_gid;
+};
+
+- nm_status is used for synchronizing processing between the kernel and user-
+  space and specifies ownership of the frame as well as the operation to perform
+
+- nm_len contains the length of the message contained in the data area
+
+- nm_group specified the destination multicast group of message
+
+- nm_pid, nm_uid and nm_gid contain the netlink pid, UID and GID of the sending
+  process. These values correspond to the data available using SOCK_PASSCRED in
+  the SCM_CREDENTIALS cmsg.
+
+The possible values in the status word are:
+
+- NL_MMAP_STATUS_UNUSED:
+	RX ring:	frame belongs to the kernel and contains no message
+			for user-space. Approriate action is to invoke poll()
+			to wait for new messages.
+	TX ring:	frame belongs to user-space and can be used for
+			message construction.
+
+- NL_MMAP_STATUS_RESERVED:
+	RX ring only:	frame is currently used by the kernel for message
+			construction and contains no valid message yet.
+			Appropriate action is to invoke poll() to wait for
+			new messages.
+
+- NL_MMAP_STATUS_VALID:
+	RX ring:	frame contains a valid message. Approriate action is
+			to process the message and release the frame back to
+			the kernel by setting the status to
+			NL_MMAP_STATUS_UNUSED.
+	TX ring:	the frame contains a valid message from user-space to
+			be processed by the kernel. After completing processing
+			the kernel will release the frame back to user-space by
+			setting the status to NL_MMAP_STATUS_UNUSED.
+
+- NL_MMAP_STATUS_COPY:
+	RX ring only:	a message is ready to be processed but could not be
+			stored in the ring, either because it exceeded the
+			frame size or because the originating subsystem does
+			not support memory mapped I/O. Appropriate action is
+			to invoke recvmsg() to receive the message and release
+			the frame back to the kernel by setting the status to
+			NL_MMAP_STATUS_UNUSED.
+
+- NL_MMAP_STATUS_SKIP:
+	RX ring only:	user-space queued the message for later processing, but
+			processed some messages following it in the ring. The
+			kernel should skip this frame when looking for unused
+			frames.
+
+The data area of a frame begins at a offset of NL_MMAP_HDRLEN relative to the
+frame header.
+
+TX limitations
+--------------
+
+Kernel processing usually involves validation of the message received by
+user-space, then processing its contents. The kernel must assure that
+userspace is not able to modify the message contents after they have been
+validated. In order to do so, the message is copied from the ring frame
+to an allocated buffer if either of these conditions is false:
+
+- only a single mapping of the ring exists
+- the file descriptor is not shared between processes
+
+This means that for threaded programs, the kernel will fall back to copying.
+
+Example
+-------
+
+Ring setup:
+
+	unsigned int block_size = 16 * getpagesize();
+	struct nl_mmap_req req = {
+		.nm_block_size		= block_size,
+		.nm_block_nr		= 64,
+		.nm_frame_size		= 16384,
+		.nm_frame_nr		= 64 * block_size / 16384,
+	};
+	unsigned int ring_size;
+	void *rx_ring, *tx_ring;
+
+	/* Configure ring parameters */
+	if (setsockopt(fd, NETLINK_RX_RING, &req, sizeof(req)) < 0)
+		exit(1);
+	if (setsockopt(fd, NETLINK_TX_RING, &req, sizeof(req)) < 0)
+		exit(1)
+
+	/* Calculate size of each invididual ring */
+	ring_size = req.nm_block_nr * req.nm_block_size;
+
+	/* Map RX/TX rings. The TX ring is located after the RX ring */
+	rx_ring = mmap(NULL, 2 * ring_size, PROT_READ | PROT_WRITE,
+		       MAP_SHARED, fd, 0);
+	if ((long)rx_ring == -1L)
+		exit(1);
+	tx_ring = rx_ring + ring_size:
+
+Message reception:
+
+This example assumes some ring parameters of the ring setup are available.
+
+	unsigned int frame_offset = 0;
+	struct nl_mmap_hdr *hdr;
+	struct nlmsghdr *nlh;
+	unsigned char buf[16384];
+	ssize_t len;
+
+	while (1) {
+		struct pollfd pfds[1];
+
+		pfds[0].fd	= fd;
+		pfds[0].events	= POLLIN | POLLERR;
+		pfds[0].revents	= 0;
+
+		if (poll(pfds, 1, -1) < 0 && errno != -EINTR)
+			exit(1);
+
+		/* Check for errors. Error handling omitted */
+		if (pfds[0].revents & POLLERR)
+			<handle error>
+
+		/* If no new messages, poll again */
+		if (!(pfds[0].revents & POLLIN))
+			continue;
+
+		/* Process all frames */
+		while (1) {
+			/* Get next frame header */
+			hdr = rx_ring + frame_offset;
+
+			if (hdr->nm_status == NL_MMAP_STATUS_VALID)
+				/* Regular memory mapped frame */
+				nlh = (void *hdr) + NL_MMAP_HDRLEN;
+				len = hdr->nm_len;
+			} else if (hdr->nm_status == NL_MMAP_STATUS_COPY) {
+				/* Frame queued to socket receive queue */
+				len = recv(fd, buf, sizeof(buf), MSG_DONTWAIT);
+				if (len <= 0)
+					break;
+				nlh = buf;
+			} else
+				/* No more messages to process, continue polling */
+				break;
+
+			process_msg(nlh);
+
+			/* Release frame back to the kernel */
+			hdr->nm_status = NL_MMAP_STATUS_UNUSED;
+
+			/* Advance frame offset to next frame */
+			frame_offset = (frame_offset + frame_size) % ring_size;
+		}
+	}
+
+Message transmission:
+
+This example assumes some ring parameters of the ring setup are available.
+A single message is constructed and transmitted, to send multiple messages
+at once they would be constructed in consecutive frames before a final call
+to sendto().
+
+	unsigned int frame_offset = 0;
+	struct nl_mmap_hdr *hdr;
+	struct nlmsghdr *nlh;
+	struct sockaddr_nl addr = {
+		.nl_family	= AF_NETLINK,
+	};
+
+	hdr = tx_ring + frame_offset;
+	if (hdr->nm_status != NL_MMAP_STATUS_UNUSED)
+		/* No frame available. Use poll() to avoid. */
+		exit(1);
+
+	nlh = (void *)hdr + NL_MMAP_HDRLEN;
+
+	/* Build message */
+	build_message(nlh);
+
+	/* Fill frame header: length and status need to be set */
+	hdr->nm_len	= nlh->nlmsg_len;
+	hdr->nm_status	= NL_MMAP_STATUS_VALID;
+
+	if (sendto(fd, NULL, 0, 0, &addr, sizeof(addr)) < 0)
+		exit(1);
+
+	/* Advance frame offset to next frame */
+	frame_offset = (frame_offset + frame_size) % ring_size;
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 11/11] nfnetlink: add support for memory mapped netlink
  2011-09-03 17:26 [PATCH RFC 0/11] netlink: memory mapped I/O kaber
                   ` (9 preceding siblings ...)
  2011-09-03 17:26 ` [PATCH 10/11] netlink: add documentation for memory mapped I/O kaber
@ 2011-09-03 17:26 ` kaber
  2011-09-17  5:48 ` [PATCH RFC 0/11] netlink: memory mapped I/O David Miller
  11 siblings, 0 replies; 21+ messages in thread
From: kaber @ 2011-09-03 17:26 UTC (permalink / raw)
  To: davem; +Cc: netfilter-devel, netdev

From: Patrick McHardy <kaber@trash.net>

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 include/linux/netfilter/nfnetlink.h |    2 ++
 net/netfilter/nfnetlink.c           |    7 +++++++
 net/netfilter/nfnetlink_log.c       |    9 +++++----
 net/netfilter/nfnetlink_queue.c     |    2 +-
 4 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/linux/netfilter/nfnetlink.h b/include/linux/netfilter/nfnetlink.h
index 74d3386..07b48cf 100644
--- a/include/linux/netfilter/nfnetlink.h
+++ b/include/linux/netfilter/nfnetlink.h
@@ -78,6 +78,8 @@ extern int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n);
 extern int nfnetlink_subsys_unregister(const struct nfnetlink_subsystem *n);
 
 extern int nfnetlink_has_listeners(struct net *net, unsigned int group);
+extern struct sk_buff *nfnetlink_alloc_skb(struct net *net, unsigned int size,
+					   u32 dst_pid, gfp_t gfp_mask);
 extern int nfnetlink_send(struct sk_buff *skb, struct net *net, u32 pid, unsigned group,
 			  int echo, gfp_t flags);
 extern int nfnetlink_set_err(struct net *net, u32 pid, u32 group, int error);
diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 1905976..de0d9b3 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -103,6 +103,13 @@ int nfnetlink_has_listeners(struct net *net, unsigned int group)
 }
 EXPORT_SYMBOL_GPL(nfnetlink_has_listeners);
 
+struct sk_buff *nfnetlink_alloc_skb(struct net *net, unsigned int size,
+				    u32 dst_pid, gfp_t gfp_mask)
+{
+	return netlink_alloc_skb(net->nfnl, size, dst_pid, gfp_mask);
+}
+EXPORT_SYMBOL_GPL(nfnetlink_alloc_skb);
+
 int nfnetlink_send(struct sk_buff *skb, struct net *net, u32 pid,
 		   unsigned group, int echo, gfp_t flags)
 {
diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c
index 2d8158a..c37fb0c 100644
--- a/net/netfilter/nfnetlink_log.c
+++ b/net/netfilter/nfnetlink_log.c
@@ -296,7 +296,7 @@ nfulnl_set_flags(struct nfulnl_instance *inst, u_int16_t flags)
 }
 
 static struct sk_buff *
-nfulnl_alloc_skb(unsigned int inst_size, unsigned int pkt_size)
+nfulnl_alloc_skb(u32 peer_pid, unsigned int inst_size, unsigned int pkt_size)
 {
 	struct sk_buff *skb;
 	unsigned int n;
@@ -305,7 +305,7 @@ nfulnl_alloc_skb(unsigned int inst_size, unsigned int pkt_size)
 	 * message.  WARNING: has to be <= 128k due to slab restrictions */
 
 	n = max(inst_size, pkt_size);
-	skb = alloc_skb(n, GFP_ATOMIC);
+	skb = nfnetlink_alloc_skb(&init_net, n, peer_pid, GFP_ATOMIC);
 	if (!skb) {
 		pr_notice("nfnetlink_log: can't alloc whole buffer (%u bytes)\n",
 			inst_size);
@@ -314,7 +314,8 @@ nfulnl_alloc_skb(unsigned int inst_size, unsigned int pkt_size)
 			/* try to allocate only as much as we need for current
 			 * packet */
 
-			skb = alloc_skb(pkt_size, GFP_ATOMIC);
+			skb = nfnetlink_alloc_skb(&init_net, pkt_size, peer_pid,
+						  GFP_ATOMIC);
 			if (!skb)
 				pr_err("nfnetlink_log: can't even alloc %u "
 				       "bytes\n", pkt_size);
@@ -642,7 +643,7 @@ nfulnl_log_packet(u_int8_t pf,
 	}
 
 	if (!inst->skb) {
-		inst->skb = nfulnl_alloc_skb(inst->nlbufsiz, size);
+		inst->skb = nfulnl_alloc_skb(inst->peer_pid, inst->nlbufsiz, size);
 		if (!inst->skb)
 			goto alloc_failure;
 	}
diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index 00bd475..2bb29a3 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -266,7 +266,7 @@ nfqnl_build_packet_message(struct nfqnl_instance *queue,
 	}
 
 
-	skb = alloc_skb(size, GFP_ATOMIC);
+	skb = nfnetlink_alloc_skb(&init_net, size, queue->peer_pid, GFP_ATOMIC);
 	if (!skb)
 		goto nlmsg_failure;
 
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 02/11] net: add function to allocate skbuff head without data area
  2011-09-03 17:26 ` [PATCH 02/11] net: add function to allocate skbuff head without data area kaber
@ 2011-09-04  8:12   ` Eric Dumazet
  2011-09-07 15:20     ` Patrick McHardy
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2011-09-04  8:12 UTC (permalink / raw)
  To: kaber; +Cc: davem, netfilter-devel, netdev

Le samedi 03 septembre 2011 à 19:26 +0200, kaber@trash.net a écrit :

> +struct sk_buff *__alloc_skb_head(gfp_t gfp_mask, int node)
> +{
> +	struct sk_buff *skb;
> +
> +	/* Get the HEAD */
> +	skb = kmem_cache_alloc_node(skbuff_head_cache,
> +				    gfp_mask & ~__GFP_DMA, node);
> +	if (!skb)
> +		goto out;
> +	prefetchw(skb);

Please remove this prefetchw(), since we have no delay between it and
actual memset(skb).

> +
> +	/*
> +	 * Only clear those fields we need to clear, not those that we will
> +	 * actually initialise below. Hence, don't put any more fields after
> +	 * the tail pointer in struct sk_buff!
> +	 */
> +	memset(skb, 0, offsetof(struct sk_buff, tail));



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 08/11] netlink: implement memory mapped sendmsg()
  2011-09-03 17:26 ` [PATCH 08/11] netlink: implement memory mapped sendmsg() kaber
@ 2011-09-04 16:18   ` Michał Mirosław
  2011-09-07 15:22     ` Patrick McHardy
  0 siblings, 1 reply; 21+ messages in thread
From: Michał Mirosław @ 2011-09-04 16:18 UTC (permalink / raw)
  To: kaber; +Cc: davem, netfilter-devel, netdev

On Sat, Sep 03, 2011 at 07:26:08PM +0200, kaber@trash.net wrote:
> From: Patrick McHardy <kaber@trash.net>
> 
> Add support for memory mapped sendmsg() to netlink. Userspace queued to
> be processed frames into the TX ring and invokes sendmsg with
> msg.iov.iov_base = NULL to trigger processing of all pending messages.
> 
> Since the kernel usually performs full message validation before beginning
> processing, userspace must be prevented from modifying the message
> contents while the kernel is processing them. In order to do so, the
> frames contents are copied to an allocated skb in case the the ring is
> mapped more than once or the file descriptor is shared (f.i. through
> AF_UNIX file descriptor passing).
> 
> Otherwise an skb without a data area is allocated, the data pointer set
> to point to the data area of the ring frame and the skb is processed.
> Once the skb is freed, the destructor releases the frame back to userspace
> by setting the status to NL_MMAP_STATUS_UNUSED.

Is this protected from threads? Like: one thread waits on sendmsg() and
another (same process) changes the buffer.

Best Regards,
Michał Mirosław
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 02/11] net: add function to allocate skbuff head without data area
  2011-09-04  8:12   ` Eric Dumazet
@ 2011-09-07 15:20     ` Patrick McHardy
  0 siblings, 0 replies; 21+ messages in thread
From: Patrick McHardy @ 2011-09-07 15:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netfilter-devel, netdev

On 04.09.2011 10:12, Eric Dumazet wrote:
> Le samedi 03 septembre 2011 à 19:26 +0200, kaber@trash.net a écrit :
> 
>> +struct sk_buff *__alloc_skb_head(gfp_t gfp_mask, int node)
>> +{
>> +	struct sk_buff *skb;
>> +
>> +	/* Get the HEAD */
>> +	skb = kmem_cache_alloc_node(skbuff_head_cache,
>> +				    gfp_mask & ~__GFP_DMA, node);
>> +	if (!skb)
>> +		goto out;
>> +	prefetchw(skb);
> 
> Please remove this prefetchw(), since we have no delay between it and
> actual memset(skb).

Done, thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 08/11] netlink: implement memory mapped sendmsg()
  2011-09-04 16:18   ` Michał Mirosław
@ 2011-09-07 15:22     ` Patrick McHardy
  2011-09-07 20:03       ` Michał Mirosław
  0 siblings, 1 reply; 21+ messages in thread
From: Patrick McHardy @ 2011-09-07 15:22 UTC (permalink / raw)
  To: Michał Mirosław; +Cc: davem, netfilter-devel, netdev

On 04.09.2011 18:18, Michał Mirosław wrote:
> On Sat, Sep 03, 2011 at 07:26:08PM +0200, kaber@trash.net wrote:
>> From: Patrick McHardy <kaber@trash.net>
>>
>> Add support for memory mapped sendmsg() to netlink. Userspace queued to
>> be processed frames into the TX ring and invokes sendmsg with
>> msg.iov.iov_base = NULL to trigger processing of all pending messages.
>>
>> Since the kernel usually performs full message validation before beginning
>> processing, userspace must be prevented from modifying the message
>> contents while the kernel is processing them. In order to do so, the
>> frames contents are copied to an allocated skb in case the the ring is
>> mapped more than once or the file descriptor is shared (f.i. through
>> AF_UNIX file descriptor passing).
>>
>> Otherwise an skb without a data area is allocated, the data pointer set
>> to point to the data area of the ring frame and the skb is processed.
>> Once the skb is freed, the destructor releases the frame back to userspace
>> by setting the status to NL_MMAP_STATUS_UNUSED.
> 
> Is this protected from threads? Like: one thread waits on sendmsg() and
> another (same process) changes the buffer.

Yes, if the ring is mapped multiple times (or the file descriptor
is changed), the contents are copied to an allocated skb.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 08/11] netlink: implement memory mapped sendmsg()
  2011-09-07 15:22     ` Patrick McHardy
@ 2011-09-07 20:03       ` Michał Mirosław
  2011-09-08  9:33         ` Patrick McHardy
  0 siblings, 1 reply; 21+ messages in thread
From: Michał Mirosław @ 2011-09-07 20:03 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netfilter-devel, netdev

On Wed, Sep 07, 2011 at 05:22:00PM +0200, Patrick McHardy wrote:
> On 04.09.2011 18:18, Michał Mirosław wrote:
> > On Sat, Sep 03, 2011 at 07:26:08PM +0200, kaber@trash.net wrote:
> >> From: Patrick McHardy <kaber@trash.net>
> >>
> >> Add support for memory mapped sendmsg() to netlink. Userspace queued to
> >> be processed frames into the TX ring and invokes sendmsg with
> >> msg.iov.iov_base = NULL to trigger processing of all pending messages.
> >>
> >> Since the kernel usually performs full message validation before beginning
> >> processing, userspace must be prevented from modifying the message
> >> contents while the kernel is processing them. In order to do so, the
> >> frames contents are copied to an allocated skb in case the the ring is
> >> mapped more than once or the file descriptor is shared (f.i. through
> >> AF_UNIX file descriptor passing).
> >>
> >> Otherwise an skb without a data area is allocated, the data pointer set
> >> to point to the data area of the ring frame and the skb is processed.
> >> Once the skb is freed, the destructor releases the frame back to userspace
> >> by setting the status to NL_MMAP_STATUS_UNUSED.
> > 
> > Is this protected from threads? Like: one thread waits on sendmsg() and
> > another (same process) changes the buffer.
> Yes, if the ring is mapped multiple times (or the file descriptor
> is changed), the contents are copied to an allocated skb.

I mean:

[1] mmap()
[1] fill buffers
[1] pthread_create() [creates: 2]
[1] sendmsg() starts
[2] modify buffers
[1] sendmsg() returns

So: no multiple mmaps, and no touching of the fd. I haven't dug into
filesystem layer to see if threads affect file->f_count, but there
sure are no multiple mappings here.

Best Regards,
Michał Mirosław
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 08/11] netlink: implement memory mapped sendmsg()
  2011-09-07 20:03       ` Michał Mirosław
@ 2011-09-08  9:33         ` Patrick McHardy
  2011-09-08 18:08           ` Michał Mirosław
  0 siblings, 1 reply; 21+ messages in thread
From: Patrick McHardy @ 2011-09-08  9:33 UTC (permalink / raw)
  To: Michał Mirosław; +Cc: davem, netfilter-devel, netdev

Am 07.09.2011 22:03, schrieb Michał Mirosław:
> On Wed, Sep 07, 2011 at 05:22:00PM +0200, Patrick McHardy wrote:
>> On 04.09.2011 18:18, Michał Mirosław wrote:
>>> On Sat, Sep 03, 2011 at 07:26:08PM +0200, kaber@trash.net wrote:
>>>> From: Patrick McHardy <kaber@trash.net>
>>>>
>>>> Add support for memory mapped sendmsg() to netlink. Userspace queued to
>>>> be processed frames into the TX ring and invokes sendmsg with
>>>> msg.iov.iov_base = NULL to trigger processing of all pending messages.
>>>>
>>>> Since the kernel usually performs full message validation before beginning
>>>> processing, userspace must be prevented from modifying the message
>>>> contents while the kernel is processing them. In order to do so, the
>>>> frames contents are copied to an allocated skb in case the the ring is
>>>> mapped more than once or the file descriptor is shared (f.i. through
>>>> AF_UNIX file descriptor passing).
>>>>
>>>> Otherwise an skb without a data area is allocated, the data pointer set
>>>> to point to the data area of the ring frame and the skb is processed.
>>>> Once the skb is freed, the destructor releases the frame back to userspace
>>>> by setting the status to NL_MMAP_STATUS_UNUSED.
>>>
>>> Is this protected from threads? Like: one thread waits on sendmsg() and
>>> another (same process) changes the buffer.
>> Yes, if the ring is mapped multiple times (or the file descriptor
>> is changed), the contents are copied to an allocated skb.
> 
> I mean:
> 
> [1] mmap()
> [1] fill buffers
> [1] pthread_create() [creates: 2]
> [1] sendmsg() starts
> [2] modify buffers
> [1] sendmsg() returns
> 
> So: no multiple mmaps, and no touching of the fd. I haven't dug into
> filesystem layer to see if threads affect file->f_count, but there
> sure are no multiple mappings here.

If CLONE_VM is given to clone(), the mapping is visible in both
threads and thus we have multiple mappings (vma_ops->open() is
invoked through clone()). Without CLONE_VM, the second thread
can't access the ring unless it mmap()s it itself, in case we'd
also have multiple mappings.

The file descriptor check is only meant for the case that
the fd is passed to a second process through AF_UNIX, the
first process invokes sendmsg(), sendmsg() checks for multiple
mappings and the second process invokes mmap() after that.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 08/11] netlink: implement memory mapped sendmsg()
  2011-09-08  9:33         ` Patrick McHardy
@ 2011-09-08 18:08           ` Michał Mirosław
  0 siblings, 0 replies; 21+ messages in thread
From: Michał Mirosław @ 2011-09-08 18:08 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netfilter-devel, netdev

On Thu, Sep 08, 2011 at 11:33:09AM +0200, Patrick McHardy wrote:
> Am 07.09.2011 22:03, schrieb Michał Mirosław:
> > On Wed, Sep 07, 2011 at 05:22:00PM +0200, Patrick McHardy wrote:
> >> On 04.09.2011 18:18, Michał Mirosław wrote:
> >>> On Sat, Sep 03, 2011 at 07:26:08PM +0200, kaber@trash.net wrote:
> >>>> From: Patrick McHardy <kaber@trash.net>
> >>>>
> >>>> Add support for memory mapped sendmsg() to netlink. Userspace queued to
> >>>> be processed frames into the TX ring and invokes sendmsg with
> >>>> msg.iov.iov_base = NULL to trigger processing of all pending messages.
> >>>>
> >>>> Since the kernel usually performs full message validation before beginning
> >>>> processing, userspace must be prevented from modifying the message
> >>>> contents while the kernel is processing them. In order to do so, the
> >>>> frames contents are copied to an allocated skb in case the the ring is
> >>>> mapped more than once or the file descriptor is shared (f.i. through
> >>>> AF_UNIX file descriptor passing).
> >>>>
> >>>> Otherwise an skb without a data area is allocated, the data pointer set
> >>>> to point to the data area of the ring frame and the skb is processed.
> >>>> Once the skb is freed, the destructor releases the frame back to userspace
> >>>> by setting the status to NL_MMAP_STATUS_UNUSED.
> >>>
> >>> Is this protected from threads? Like: one thread waits on sendmsg() and
> >>> another (same process) changes the buffer.
> >> Yes, if the ring is mapped multiple times (or the file descriptor
> >> is changed), the contents are copied to an allocated skb.
> > 
> > I mean:
> > 
> > [1] mmap()
> > [1] fill buffers
> > [1] pthread_create() [creates: 2]
> > [1] sendmsg() starts
> > [2] modify buffers
> > [1] sendmsg() returns
> > 
> > So: no multiple mmaps, and no touching of the fd. I haven't dug into
> > filesystem layer to see if threads affect file->f_count, but there
> > sure are no multiple mappings here.
> If CLONE_VM is given to clone(), the mapping is visible in both
> threads and thus we have multiple mappings (vma_ops->open() is
> invoked through clone()). Without CLONE_VM, the second thread
> can't access the ring unless it mmap()s it itself, in case we'd
> also have multiple mappings.

I made a quick look into kernel/fork.c, and it looks to me that if CLONE_VM
is set, then vma->open() is actually avoided --- it's called via dup_mm()
and dup_mmap() only if CLONE_VM is not there and the VMA needs to be copied.

Best Regards,
Michał Mirosław
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC 0/11] netlink: memory mapped I/O
  2011-09-03 17:26 [PATCH RFC 0/11] netlink: memory mapped I/O kaber
                   ` (10 preceding siblings ...)
  2011-09-03 17:26 ` [PATCH 11/11] nfnetlink: add support for memory mapped netlink kaber
@ 2011-09-17  5:48 ` David Miller
  11 siblings, 0 replies; 21+ messages in thread
From: David Miller @ 2011-09-17  5:48 UTC (permalink / raw)
  To: kaber; +Cc: netfilter-devel, netdev

From: kaber@trash.net
Date: Sat,  3 Sep 2011 19:26:00 +0200

> Comments and rewiew highly welcome!

Looks great to me, I look forward to integrating the final version
of this stuff.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 01/11] netlink: add symbolic value for congested state
  2012-08-20  6:18 [PATCH 00/11] " Patrick McHardy
@ 2012-08-20  6:18 ` Patrick McHardy
  0 siblings, 0 replies; 21+ messages in thread
From: Patrick McHardy @ 2012-08-20  6:18 UTC (permalink / raw)
  To: Florian.Westphal; +Cc: netdev, netfilter-devel

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/netlink/af_netlink.c |   18 +++++++++++-------
 1 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 5463969..1bdfa52 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -89,6 +89,10 @@ struct listeners {
 	unsigned long		masks[0];
 };
 
+/* state bits */
+#define NETLINK_CONGESTED	0x0
+
+/* flags */
 #define NETLINK_KERNEL_SOCKET	0x1
 #define NETLINK_RECV_PKTINFO	0x2
 #define NETLINK_BROADCAST_SEND_ERROR	0x4
@@ -762,7 +766,7 @@ static void netlink_overrun(struct sock *sk)
 	struct netlink_sock *nlk = nlk_sk(sk);
 
 	if (!(nlk->flags & NETLINK_RECV_NO_ENOBUFS)) {
-		if (!test_and_set_bit(0, &nlk_sk(sk)->state)) {
+		if (!test_and_set_bit(NETLINK_CONGESTED, &nlk_sk(sk)->state)) {
 			sk->sk_err = ENOBUFS;
 			sk->sk_error_report(sk);
 		}
@@ -823,7 +827,7 @@ int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
 	nlk = nlk_sk(sk);
 
 	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-	    test_bit(0, &nlk->state)) {
+	    test_bit(NETLINK_CONGESTED, &nlk->state)) {
 		DECLARE_WAITQUEUE(wait, current);
 		if (!*timeo) {
 			if (!ssk || netlink_is_kernel(ssk))
@@ -837,7 +841,7 @@ int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
 		add_wait_queue(&nlk->wait, &wait);
 
 		if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-		     test_bit(0, &nlk->state)) &&
+		     test_bit(NETLINK_CONGESTED, &nlk->state)) &&
 		    !sock_flag(sk, SOCK_DEAD))
 			*timeo = schedule_timeout(*timeo);
 
@@ -907,8 +911,8 @@ static void netlink_rcv_wake(struct sock *sk)
 	struct netlink_sock *nlk = nlk_sk(sk);
 
 	if (skb_queue_empty(&sk->sk_receive_queue))
-		clear_bit(0, &nlk->state);
-	if (!test_bit(0, &nlk->state))
+		clear_bit(NETLINK_CONGESTED, &nlk->state);
+	if (!test_bit(NETLINK_CONGESTED, &nlk->state))
 		wake_up_interruptible(&nlk->wait);
 }
 
@@ -990,7 +994,7 @@ static int netlink_broadcast_deliver(struct sock *sk, struct sk_buff *skb)
 	struct netlink_sock *nlk = nlk_sk(sk);
 
 	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
-	    !test_bit(0, &nlk->state)) {
+	    !test_bit(NETLINK_CONGESTED, &nlk->state)) {
 		skb_set_owner_r(skb, sk);
 		__netlink_sendskb(sk, skb);
 		return atomic_read(&sk->sk_rmem_alloc) > (sk->sk_rcvbuf >> 1);
@@ -1270,7 +1274,7 @@ static int netlink_setsockopt(struct socket *sock, int level, int optname,
 	case NETLINK_NO_ENOBUFS:
 		if (val) {
 			nlk->flags |= NETLINK_RECV_NO_ENOBUFS;
-			clear_bit(0, &nlk->state);
+			clear_bit(NETLINK_CONGESTED, &nlk->state);
 			wake_up_interruptible(&nlk->wait);
 		} else {
 			nlk->flags &= ~NETLINK_RECV_NO_ENOBUFS;
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2012-08-20  6:18 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-03 17:26 [PATCH RFC 0/11] netlink: memory mapped I/O kaber
2011-09-03 17:26 ` [PATCH 01/11] netlink: add symbolic value for congested state kaber
2011-09-03 17:26 ` [PATCH 02/11] net: add function to allocate skbuff head without data area kaber
2011-09-04  8:12   ` Eric Dumazet
2011-09-07 15:20     ` Patrick McHardy
2011-09-03 17:26 ` [PATCH 03/11] netlink: add helper function for queueing skbs to the receive queue kaber
2011-09-03 17:26 ` [PATCH 04/11] netlink: don't orphan skb in netlink_trim() kaber
2011-09-03 17:26 ` [PATCH 05/11] netlink: add netlink_skb_set_owner_r() kaber
2011-09-03 17:26 ` [PATCH 06/11] netlink: memory mapped netlink: ring setup kaber
2011-09-03 17:26 ` [PATCH 07/11] netlink: add memory mapped netlink helper functions kaber
2011-09-03 17:26 ` [PATCH 08/11] netlink: implement memory mapped sendmsg() kaber
2011-09-04 16:18   ` Michał Mirosław
2011-09-07 15:22     ` Patrick McHardy
2011-09-07 20:03       ` Michał Mirosław
2011-09-08  9:33         ` Patrick McHardy
2011-09-08 18:08           ` Michał Mirosław
2011-09-03 17:26 ` [PATCH 09/11] netlink: implement memory mapped recvmsg() kaber
2011-09-03 17:26 ` [PATCH 10/11] netlink: add documentation for memory mapped I/O kaber
2011-09-03 17:26 ` [PATCH 11/11] nfnetlink: add support for memory mapped netlink kaber
2011-09-17  5:48 ` [PATCH RFC 0/11] netlink: memory mapped I/O David Miller
2012-08-20  6:18 [PATCH 00/11] " Patrick McHardy
2012-08-20  6:18 ` [PATCH 01/11] netlink: add symbolic value for congested state Patrick McHardy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.