netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v2 0/6] packet: refine rollover
@ 2015-05-09 21:24 Willem de Bruijn
  2015-05-09 21:24 ` [PATCH net-next v2 1/6] packet: rollover prepare: move code out of callsites Willem de Bruijn
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Willem de Bruijn @ 2015-05-09 21:24 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, david.laight, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

efine packet socket rollover:

1. mitigate a case of lock contention
2. avoid exporting resource exhaustion to other sockets,
   by migrating only to a victim socket that has ample room
3. avoid reordering of most flows on the socket,
   by migrating first the flow responsible for load imbalance
4. help processes detect load imbalance,
   by exporting rollover counters

Context: rollover implements flow migration in packet socket fanout
groups in case of extreme load imbalance. It is a specific
implementation of migration that minimizes reordering by selecting
the same victim socket when possible (and by selecting subsequent
victims in a round robin fashion, from which its name derives).

Changes:
  v1 -> v2:
    - huge flow detection: run lockless
    - huge flow detection: replace stored index with random
    - contention avoidance: test in packet_poll while lock held
    - contention avoidance: clear pressure sooner

          packet_poll and packet_recvmsg would clear only if the sock
          is empty to avoid taking the necessary lock. But,
          * packet_poll already holds this lock, so a lockless variant
            __packet_rcv_has_room is cheap.
          * packet_recvmsg is usually called only for non-ring sockets,
            which also runs lockless.

    - preparation: drop "single return" patch

          packet_rcv_has_room is now a locked wrapper around
          __packet_rcv_has_room, achieving the same (single footer).

Willem de Bruijn (6):
  packet: rollover prepare: move code out of callsites
  packet: rollover prepare: per-socket state
  packet: rollover only to socket with headroom
  packet: rollover lock contention avoidance
  packet: rollover huge flows before small flows
  packet: rollover statistics

 include/uapi/linux/if_packet.h |   7 ++
 net/packet/af_packet.c         | 172 ++++++++++++++++++++++++++++++++++-------
 net/packet/internal.h          |  12 ++-
 3 files changed, 163 insertions(+), 28 deletions(-)

-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH net-next v2 1/6] packet: rollover prepare: move code out of callsites
  2015-05-09 21:24 [PATCH net-next v2 0/6] packet: refine rollover Willem de Bruijn
@ 2015-05-09 21:24 ` Willem de Bruijn
  2015-05-09 21:24 ` [PATCH net-next v2 2/6] packet: rollover prepare: per-socket state Willem de Bruijn
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Willem de Bruijn @ 2015-05-09 21:24 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, david.laight, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

packet_rcv_fanout calls fanout_demux_rollover twice. Move all rollover
logic into the callee to simplify these callsites, especially with
upcoming changes.

The main differences between the two callsites is that the FLAG
variant tests whether the socket previously selected by another
mode (RR, RND, HASH, ..) has room before migrating flows, whereas the
rollover mode has no original socket to test.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/packet/af_packet.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 5102c3c..56f9c52 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1318,18 +1318,22 @@ static unsigned int fanout_demux_rnd(struct packet_fanout *f,
 
 static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 					  struct sk_buff *skb,
-					  unsigned int idx, unsigned int skip,
+					  unsigned int idx, bool try_self,
 					  unsigned int num)
 {
 	unsigned int i, j;
 
+	if (try_self && packet_rcv_has_room(pkt_sk(f->arr[idx]), skb))
+		return idx;
+
 	i = j = min_t(int, f->next[idx], num - 1);
 	do {
-		if (i != skip && packet_rcv_has_room(pkt_sk(f->arr[i]), skb)) {
+		if (i != idx && packet_rcv_has_room(pkt_sk(f->arr[i]), skb)) {
 			if (i != j)
 				f->next[idx] = i;
 			return i;
 		}
+
 		if (++i == num)
 			i = 0;
 	} while (i != j);
@@ -1386,17 +1390,14 @@ static int packet_rcv_fanout(struct sk_buff *skb, struct net_device *dev,
 		idx = fanout_demux_qm(f, skb, num);
 		break;
 	case PACKET_FANOUT_ROLLOVER:
-		idx = fanout_demux_rollover(f, skb, 0, (unsigned int) -1, num);
+		idx = fanout_demux_rollover(f, skb, 0, false, num);
 		break;
 	}
 
-	po = pkt_sk(f->arr[idx]);
-	if (fanout_has_flag(f, PACKET_FANOUT_FLAG_ROLLOVER) &&
-	    unlikely(!packet_rcv_has_room(po, skb))) {
-		idx = fanout_demux_rollover(f, skb, idx, idx, num);
-		po = pkt_sk(f->arr[idx]);
-	}
+	if (fanout_has_flag(f, PACKET_FANOUT_FLAG_ROLLOVER))
+		idx = fanout_demux_rollover(f, skb, idx, true, num);
 
+	po = pkt_sk(f->arr[idx]);
 	return po->prot_hook.func(skb, dev, &po->prot_hook, orig_dev);
 }
 
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next v2 2/6] packet: rollover prepare: per-socket state
  2015-05-09 21:24 [PATCH net-next v2 0/6] packet: refine rollover Willem de Bruijn
  2015-05-09 21:24 ` [PATCH net-next v2 1/6] packet: rollover prepare: move code out of callsites Willem de Bruijn
@ 2015-05-09 21:24 ` Willem de Bruijn
  2015-05-09 21:24 ` [PATCH net-next v2 3/6] packet: rollover only to socket with headroom Willem de Bruijn
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Willem de Bruijn @ 2015-05-09 21:24 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, david.laight, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Replace rollover state per fanout group with state per socket. Future
patches will add fields to the new structure.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/packet/af_packet.c | 21 ++++++++++++++++++---
 net/packet/internal.h  |  6 +++++-
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 56f9c52..f8ec909 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1321,16 +1321,18 @@ static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 					  unsigned int idx, bool try_self,
 					  unsigned int num)
 {
+	struct packet_sock *po;
 	unsigned int i, j;
 
-	if (try_self && packet_rcv_has_room(pkt_sk(f->arr[idx]), skb))
+	po = pkt_sk(f->arr[idx]);
+	if (try_self && packet_rcv_has_room(po, skb))
 		return idx;
 
-	i = j = min_t(int, f->next[idx], num - 1);
+	i = j = min_t(int, po->rollover->sock, num - 1);
 	do {
 		if (i != idx && packet_rcv_has_room(pkt_sk(f->arr[i]), skb)) {
 			if (i != j)
-				f->next[idx] = i;
+				po->rollover->sock = i;
 			return i;
 		}
 
@@ -1468,6 +1470,12 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
 	if (po->fanout)
 		return -EALREADY;
 
+	if (type_flags & PACKET_FANOUT_FLAG_ROLLOVER) {
+		po->rollover = kzalloc(sizeof(*po->rollover), GFP_KERNEL);
+		if (!po->rollover)
+			return -ENOMEM;
+	}
+
 	mutex_lock(&fanout_mutex);
 	match = NULL;
 	list_for_each_entry(f, &fanout_list, list) {
@@ -1516,6 +1524,10 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
 	}
 out:
 	mutex_unlock(&fanout_mutex);
+	if (err) {
+		kfree(po->rollover);
+		po->rollover = NULL;
+	}
 	return err;
 }
 
@@ -1537,6 +1549,8 @@ static void fanout_release(struct sock *sk)
 		kfree(f);
 	}
 	mutex_unlock(&fanout_mutex);
+
+	kfree(po->rollover);
 }
 
 static const struct proto_ops packet_ops;
@@ -2863,6 +2877,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
 
 	spin_lock_init(&po->bind_lock);
 	mutex_init(&po->pg_vec_lock);
+	po->rollover = NULL;
 	po->prot_hook.func = packet_rcv;
 
 	if (sock->type == SOCK_PACKET)
diff --git a/net/packet/internal.h b/net/packet/internal.h
index fe6e20c..a9d33a2 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -82,12 +82,15 @@ struct packet_fanout {
 	atomic_t		rr_cur;
 	struct list_head	list;
 	struct sock		*arr[PACKET_FANOUT_MAX];
-	int			next[PACKET_FANOUT_MAX];
 	spinlock_t		lock;
 	atomic_t		sk_ref;
 	struct packet_type	prot_hook ____cacheline_aligned_in_smp;
 };
 
+struct packet_rollover {
+	int			sock;
+} ____cacheline_aligned_in_smp;
+
 struct packet_sock {
 	/* struct sock has to be the first member of packet_sock */
 	struct sock		sk;
@@ -104,6 +107,7 @@ struct packet_sock {
 				has_vnet_hdr:1;
 	int			ifindex;	/* bound device		*/
 	__be16			num;
+	struct packet_rollover	*rollover;
 	struct packet_mclist	*mclist;
 	atomic_t		mapped;
 	enum tpacket_versions	tp_version;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next v2 3/6] packet: rollover only to socket with headroom
  2015-05-09 21:24 [PATCH net-next v2 0/6] packet: refine rollover Willem de Bruijn
  2015-05-09 21:24 ` [PATCH net-next v2 1/6] packet: rollover prepare: move code out of callsites Willem de Bruijn
  2015-05-09 21:24 ` [PATCH net-next v2 2/6] packet: rollover prepare: per-socket state Willem de Bruijn
@ 2015-05-09 21:24 ` Willem de Bruijn
  2015-05-09 21:24 ` [PATCH net-next v2 4/6] packet: rollover lock contention avoidance Willem de Bruijn
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Willem de Bruijn @ 2015-05-09 21:24 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, david.laight, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Only migrate flows to sockets that have sufficient headroom, where
sufficient is defined as having at least 25% empty space.

The kernel has three different buffer types: a regular socket, a ring
with frames (TPACKET_V[12]) or a ring with blocks (TPACKET_V3). The
latter two do not expose a read pointer to the kernel, so headroom is
not computed easily. All three needs a different implementation to
estimate free space.

Tested:
  Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.

  bench_rollover has as many sockets as there are NIC receive queues
  in the system. Each socket is owned by a process that is pinned to
  one of the receive cpus. RFS is disabled. RPS is enabled with an
  identity mapping (cpu x -> cpu x), to count drops with softnettop.

    lpbb5:/export/hda3/willemb# ./bench_rollover -r -l 1000 -s
    Press [Enter] to exit

    cpu         rx       rx.k     drop.k   rollover     r.huge   r.failed
      0         16         16          0          0          0          0
      1         21         21          0          0          0          0
      2    5227502    5227502          0          0          0          0
      3         18         18          0          0          0          0
      4    6083289    6083289          0    5227496          0          0
      5         22         22          0          0          0          0
      6         21         21          0          0          0          0
      7          9          9          0          0          0          0

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/packet/af_packet.c | 74 +++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 58 insertions(+), 16 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index f8ec909..fb421a8 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1234,27 +1234,68 @@ static void packet_free_pending(struct packet_sock *po)
 	free_percpu(po->tx_ring.pending_refcnt);
 }
 
-static bool packet_rcv_has_room(struct packet_sock *po, struct sk_buff *skb)
+#define ROOM_POW_OFF	2
+#define ROOM_NONE	0x0
+#define ROOM_LOW	0x1
+#define ROOM_NORMAL	0x2
+
+static bool __tpacket_has_room(struct packet_sock *po, int pow_off)
+{
+	int idx, len;
+
+	len = po->rx_ring.frame_max + 1;
+	idx = po->rx_ring.head;
+	if (pow_off)
+		idx += len >> pow_off;
+	if (idx >= len)
+		idx -= len;
+	return packet_lookup_frame(po, &po->rx_ring, idx, TP_STATUS_KERNEL);
+}
+
+static bool __tpacket_v3_has_room(struct packet_sock *po, int pow_off)
+{
+	int idx, len;
+
+	len = po->rx_ring.prb_bdqc.knum_blocks;
+	idx = po->rx_ring.prb_bdqc.kactive_blk_num;
+	if (pow_off)
+		idx += len >> pow_off;
+	if (idx >= len)
+		idx -= len;
+	return prb_lookup_block(po, &po->rx_ring, idx, TP_STATUS_KERNEL);
+}
+
+static int packet_rcv_has_room(struct packet_sock *po, struct sk_buff *skb)
 {
 	struct sock *sk = &po->sk;
-	bool has_room;
+	int ret = ROOM_NONE;
 
-	if (po->prot_hook.func != tpacket_rcv)
-		return (atomic_read(&sk->sk_rmem_alloc) + skb->truesize)
-			<= sk->sk_rcvbuf;
+	if (po->prot_hook.func != tpacket_rcv) {
+		int avail = sk->sk_rcvbuf - atomic_read(&sk->sk_rmem_alloc)
+					  - skb->truesize;
+		if (avail > (sk->sk_rcvbuf >> ROOM_POW_OFF))
+			return ROOM_NORMAL;
+		else if (avail > 0)
+			return ROOM_LOW;
+		else
+			return ROOM_NONE;
+	}
 
 	spin_lock(&sk->sk_receive_queue.lock);
-	if (po->tp_version == TPACKET_V3)
-		has_room = prb_lookup_block(po, &po->rx_ring,
-					    po->rx_ring.prb_bdqc.kactive_blk_num,
-					    TP_STATUS_KERNEL);
-	else
-		has_room = packet_lookup_frame(po, &po->rx_ring,
-					       po->rx_ring.head,
-					       TP_STATUS_KERNEL);
+	if (po->tp_version == TPACKET_V3) {
+		if (__tpacket_v3_has_room(po, ROOM_POW_OFF))
+			ret = ROOM_NORMAL;
+		else if (__tpacket_v3_has_room(po, 0))
+			ret = ROOM_LOW;
+	} else {
+		if (__tpacket_has_room(po, ROOM_POW_OFF))
+			ret = ROOM_NORMAL;
+		else if (__tpacket_has_room(po, 0))
+			ret = ROOM_LOW;
+	}
 	spin_unlock(&sk->sk_receive_queue.lock);
 
-	return has_room;
+	return ret;
 }
 
 static void packet_sock_destruct(struct sock *sk)
@@ -1325,12 +1366,13 @@ static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 	unsigned int i, j;
 
 	po = pkt_sk(f->arr[idx]);
-	if (try_self && packet_rcv_has_room(po, skb))
+	if (try_self && packet_rcv_has_room(po, skb) != ROOM_NONE)
 		return idx;
 
 	i = j = min_t(int, po->rollover->sock, num - 1);
 	do {
-		if (i != idx && packet_rcv_has_room(pkt_sk(f->arr[i]), skb)) {
+		if (i != idx &&
+		    packet_rcv_has_room(pkt_sk(f->arr[i]), skb) == ROOM_NORMAL) {
 			if (i != j)
 				po->rollover->sock = i;
 			return i;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next v2 4/6] packet: rollover lock contention avoidance
  2015-05-09 21:24 [PATCH net-next v2 0/6] packet: refine rollover Willem de Bruijn
                   ` (2 preceding siblings ...)
  2015-05-09 21:24 ` [PATCH net-next v2 3/6] packet: rollover only to socket with headroom Willem de Bruijn
@ 2015-05-09 21:24 ` Willem de Bruijn
  2015-05-09 21:25 ` [PATCH net-next v2 5/6] packet: rollover huge flows before small flows Willem de Bruijn
  2015-05-09 21:25 ` [PATCH net-next v2 6/6] packet: rollover statistics Willem de Bruijn
  5 siblings, 0 replies; 9+ messages in thread
From: Willem de Bruijn @ 2015-05-09 21:24 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, david.laight, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Rollover has to call packet_rcv_has_room on sockets in the fanout
group to find a socket to migrate to. This operation is expensive
especially if the packet sockets use rings, when a lock has to be
acquired.

Avoid pounding on the lock by all sockets by temporarily marking a
socket as "under memory pressure" when such pressure is detected.
While set, only the socket owner may call packet_rcv_has_room on the
socket. Once it detects normal conditions, it clears the flag. The
socket is not used as a victim by any other socket in the meantime.

Under reasonably balanced load, each socket writer frequently calls
packet_rcv_has_room and clears its own pressure field. As a backup
for when the socket is rarely written to, also clear the flag on
reading (packet_recvmsg, packet_poll) if this can be done cheaply
(i.e., without calling packet_rcv_has_room). This is only for
edge cases.

Tested:
  Ran bench_rollover: a process with 8 sockets in a single fanout
  group, each pinned to a single cpu that receives one nic recv
  interrupt. RPS and RFS are disabled. The benchmark uses packet
  rx_ring, which has to take a lock when determining whether a
  socket has room.

  Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread
  uniformly across the packet sockets (and inserted an iptables
  rule to drop in PREROUTING to avoid protocol stack processing).

  Without this patch, all sockets try to migrate traffic to
  neighbors, causing lock contention when searching for a non-
  empty neighbor. The lock is the top 9 entries.

    perf record -a -g sleep 5

    -  17.82%   bench_rollover  [kernel.kallsyms]    [k] _raw_spin_lock
       - _raw_spin_lock
          - 99.00% spin_lock
    	 + 81.77% packet_rcv_has_room.isra.41
    	 + 18.23% tpacket_rcv
          + 0.84% packet_rcv_has_room.isra.41
    +   5.20%      ksoftirqd/6  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.15%      ksoftirqd/1  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.14%      ksoftirqd/2  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.12%      ksoftirqd/7  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.12%      ksoftirqd/5  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.10%      ksoftirqd/4  [kernel.kallsyms]    [k] _raw_spin_lock
    +   4.66%      ksoftirqd/0  [kernel.kallsyms]    [k] _raw_spin_lock
    +   4.45%      ksoftirqd/3  [kernel.kallsyms]    [k] _raw_spin_lock
    +   1.55%   bench_rollover  [kernel.kallsyms]    [k] packet_rcv_has_room.isra.41

  On net-next with this patch, this lock contention is no longer a
  top entry. Most time is spent in the actual read function. Next up
  are other locks:

    +  15.52%  bench_rollover  bench_rollover     [.] reader
    +   4.68%         swapper  [kernel.kallsyms]  [k] memcpy_erms
    +   2.77%         swapper  [kernel.kallsyms]  [k] packet_lookup_frame.isra.51
    +   2.56%     ksoftirqd/1  [kernel.kallsyms]  [k] memcpy_erms
    +   2.16%         swapper  [kernel.kallsyms]  [k] tpacket_rcv
    +   1.93%         swapper  [kernel.kallsyms]  [k] mlx4_en_process_rx_cq

  Looking closer at the remaining _raw_spin_lock, the cost of probing
  in rollover is now comparable to the cost of taking the lock later
  in tpacket_rcv.

    -   1.51%         swapper  [kernel.kallsyms]  [k] _raw_spin_lock
       - _raw_spin_lock
          + 33.41% packet_rcv_has_room
          + 28.15% tpacket_rcv
          + 19.54% enqueue_to_backlog
          + 6.45% __free_pages_ok
          + 2.78% packet_rcv_fanout
          + 2.13% fanout_demux_rollover
          + 2.01% netif_receive_skb_internal

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/packet/af_packet.c | 38 +++++++++++++++++++++++++++++++-------
 net/packet/internal.h  |  1 +
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index fb421a8..e6b252a 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1265,14 +1265,14 @@ static bool __tpacket_v3_has_room(struct packet_sock *po, int pow_off)
 	return prb_lookup_block(po, &po->rx_ring, idx, TP_STATUS_KERNEL);
 }
 
-static int packet_rcv_has_room(struct packet_sock *po, struct sk_buff *skb)
+static int __packet_rcv_has_room(struct packet_sock *po, struct sk_buff *skb)
 {
 	struct sock *sk = &po->sk;
 	int ret = ROOM_NONE;
 
 	if (po->prot_hook.func != tpacket_rcv) {
 		int avail = sk->sk_rcvbuf - atomic_read(&sk->sk_rmem_alloc)
-					  - skb->truesize;
+					  - (skb ? skb->truesize : 0);
 		if (avail > (sk->sk_rcvbuf >> ROOM_POW_OFF))
 			return ROOM_NORMAL;
 		else if (avail > 0)
@@ -1281,7 +1281,6 @@ static int packet_rcv_has_room(struct packet_sock *po, struct sk_buff *skb)
 			return ROOM_NONE;
 	}
 
-	spin_lock(&sk->sk_receive_queue.lock);
 	if (po->tp_version == TPACKET_V3) {
 		if (__tpacket_v3_has_room(po, ROOM_POW_OFF))
 			ret = ROOM_NORMAL;
@@ -1293,7 +1292,26 @@ static int packet_rcv_has_room(struct packet_sock *po, struct sk_buff *skb)
 		else if (__tpacket_has_room(po, 0))
 			ret = ROOM_LOW;
 	}
-	spin_unlock(&sk->sk_receive_queue.lock);
+
+	return ret;
+}
+
+static int packet_rcv_has_room(struct packet_sock *po, struct sk_buff *skb)
+{
+	int ret;
+	bool has_room;
+
+	if (po->prot_hook.func == tpacket_rcv) {
+		spin_lock(&po->sk.sk_receive_queue.lock);
+		ret = __packet_rcv_has_room(po, skb);
+		spin_unlock(&po->sk.sk_receive_queue.lock);
+	} else {
+		ret = __packet_rcv_has_room(po, skb);
+	}
+
+	has_room = ret == ROOM_NORMAL;
+	if (po->pressure == has_room)
+		xchg(&po->pressure, !has_room);
 
 	return ret;
 }
@@ -1362,7 +1380,7 @@ static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 					  unsigned int idx, bool try_self,
 					  unsigned int num)
 {
-	struct packet_sock *po;
+	struct packet_sock *po, *po_next;
 	unsigned int i, j;
 
 	po = pkt_sk(f->arr[idx]);
@@ -1371,8 +1389,9 @@ static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 
 	i = j = min_t(int, po->rollover->sock, num - 1);
 	do {
-		if (i != idx &&
-		    packet_rcv_has_room(pkt_sk(f->arr[i]), skb) == ROOM_NORMAL) {
+		po_next = pkt_sk(f->arr[i]);
+		if (po_next != po && !po_next->pressure &&
+		    packet_rcv_has_room(po_next, skb) == ROOM_NORMAL) {
 			if (i != j)
 				po->rollover->sock = i;
 			return i;
@@ -2997,6 +3016,9 @@ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 	if (skb == NULL)
 		goto out;
 
+	if (pkt_sk(sk)->pressure)
+		packet_rcv_has_room(pkt_sk(sk), NULL);
+
 	if (pkt_sk(sk)->has_vnet_hdr) {
 		struct virtio_net_hdr vnet_hdr = { 0 };
 
@@ -3752,6 +3774,8 @@ static unsigned int packet_poll(struct file *file, struct socket *sock,
 			TP_STATUS_KERNEL))
 			mask |= POLLIN | POLLRDNORM;
 	}
+	if (po->pressure && __packet_rcv_has_room(po, NULL) == ROOM_NORMAL)
+		xchg(&po->pressure, 0);
 	spin_unlock_bh(&sk->sk_receive_queue.lock);
 	spin_lock_bh(&sk->sk_write_queue.lock);
 	if (po->tx_ring.pg_vec) {
diff --git a/net/packet/internal.h b/net/packet/internal.h
index a9d33a2..22d7d77 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -105,6 +105,7 @@ struct packet_sock {
 				auxdata:1,
 				origdev:1,
 				has_vnet_hdr:1;
+	int			pressure;
 	int			ifindex;	/* bound device		*/
 	__be16			num;
 	struct packet_rollover	*rollover;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next v2 5/6] packet: rollover huge flows before small flows
  2015-05-09 21:24 [PATCH net-next v2 0/6] packet: refine rollover Willem de Bruijn
                   ` (3 preceding siblings ...)
  2015-05-09 21:24 ` [PATCH net-next v2 4/6] packet: rollover lock contention avoidance Willem de Bruijn
@ 2015-05-09 21:25 ` Willem de Bruijn
  2015-05-09 21:25 ` [PATCH net-next v2 6/6] packet: rollover statistics Willem de Bruijn
  5 siblings, 0 replies; 9+ messages in thread
From: Willem de Bruijn @ 2015-05-09 21:25 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, david.laight, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Migrate flows from a socket to another socket in the fanout group not
only when the socket is full. Start migrating huge flows early, to
divert possible 4-tuple attacks without affecting normal traffic.

Introduce fanout_flow_is_huge(). This detects huge flows, which are
defined as taking up more than half the load. It does so cheaply, by
storing the rxhashes of the N most recent packets. If over half of
these are the same rxhash as the current packet, then drop it. This
only protects against 4-tuple attacks. N is chosen to fit all data in
a single cache line.

Tested:
  Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.

    lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s
    cpu         rx       rx.k     drop.k   rollover     r.huge   r.failed
      0         14         14          0          0          0          0
      1         20         20          0          0          0          0
      2         16         16          0          0          0          0
      3    6168824    6168824          0    4867721    4867721          0
      4    4867741    4867741          0          0          0          0
      5         12         12          0          0          0          0
      6         15         15          0          0          0          0
      7         17         17          0          0          0          0

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/packet/af_packet.c | 25 ++++++++++++++++++++++---
 net/packet/internal.h  |  2 ++
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index e6b252a..0755203 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1341,6 +1341,20 @@ static int fanout_rr_next(struct packet_fanout *f, unsigned int num)
 	return x;
 }
 
+static bool fanout_flow_is_huge(struct packet_sock *po, struct sk_buff *skb)
+{
+	u32 rxhash;
+	int i, count = 0;
+
+	rxhash = skb_get_hash(skb);
+	for (i = 0; i < ROLLOVER_HLEN; i++)
+		if (po->rollover->history[i] == rxhash)
+			count++;
+
+	po->rollover->history[prandom_u32() % ROLLOVER_HLEN] = rxhash;
+	return count > (ROLLOVER_HLEN >> 1);
+}
+
 static unsigned int fanout_demux_hash(struct packet_fanout *f,
 				      struct sk_buff *skb,
 				      unsigned int num)
@@ -1381,11 +1395,16 @@ static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 					  unsigned int num)
 {
 	struct packet_sock *po, *po_next;
-	unsigned int i, j;
+	unsigned int i, j, room;
 
 	po = pkt_sk(f->arr[idx]);
-	if (try_self && packet_rcv_has_room(po, skb) != ROOM_NONE)
-		return idx;
+
+	if (try_self) {
+		room = packet_rcv_has_room(po, skb);
+		if (room == ROOM_NORMAL ||
+		    (room == ROOM_LOW && !fanout_flow_is_huge(po, skb)))
+			return idx;
+	}
 
 	i = j = min_t(int, po->rollover->sock, num - 1);
 	do {
diff --git a/net/packet/internal.h b/net/packet/internal.h
index 22d7d77..a9d30a1 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -89,6 +89,8 @@ struct packet_fanout {
 
 struct packet_rollover {
 	int			sock;
+#define ROLLOVER_HLEN	(L1_CACHE_BYTES / sizeof(u32))
+	u32			history[ROLLOVER_HLEN] ____cacheline_aligned;
 } ____cacheline_aligned_in_smp;
 
 struct packet_sock {
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next v2 6/6] packet: rollover statistics
  2015-05-09 21:24 [PATCH net-next v2 0/6] packet: refine rollover Willem de Bruijn
                   ` (4 preceding siblings ...)
  2015-05-09 21:25 ` [PATCH net-next v2 5/6] packet: rollover huge flows before small flows Willem de Bruijn
@ 2015-05-09 21:25 ` Willem de Bruijn
  2015-05-11 17:09   ` David Miller
  5 siblings, 1 reply; 9+ messages in thread
From: Willem de Bruijn @ 2015-05-09 21:25 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, david.laight, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Rollover indicates exceptional conditions. Export a counter to inform
socket owners of this state.

If no socket with sufficient room is found, rollover fails. Also count
these events.

Finally, also count when flows are rolled over early thanks to huge
flow detection, to validate its correctness.

Tested:
  Read counters in bench_rollover on all other tests in the patchset

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/uapi/linux/if_packet.h |  7 +++++++
 net/packet/af_packet.c         | 19 ++++++++++++++++++-
 net/packet/internal.h          |  3 +++
 3 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index 053bd10..52b6501 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -54,6 +54,7 @@ struct sockaddr_ll {
 #define PACKET_FANOUT			18
 #define PACKET_TX_HAS_OFF		19
 #define PACKET_QDISC_BYPASS		20
+#define PACKET_ROLLOVER_STATS		21
 
 #define PACKET_FANOUT_HASH		0
 #define PACKET_FANOUT_LB		1
@@ -75,6 +76,12 @@ struct tpacket_stats_v3 {
 	unsigned int	tp_freeze_q_cnt;
 };
 
+struct tpacket_rollover_stats {
+	unsigned long	tp_all;
+	unsigned long	tp_huge;
+	unsigned long	tp_failed;
+};
+
 union tpacket_stats_u {
 	struct tpacket_stats stats1;
 	struct tpacket_stats_v3 stats3;
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 0755203..727409c 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1395,7 +1395,7 @@ static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 					  unsigned int num)
 {
 	struct packet_sock *po, *po_next;
-	unsigned int i, j, room;
+	unsigned int i, j, room = ROOM_NONE;
 
 	po = pkt_sk(f->arr[idx]);
 
@@ -1413,6 +1413,9 @@ static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 		    packet_rcv_has_room(po_next, skb) == ROOM_NORMAL) {
 			if (i != j)
 				po->rollover->sock = i;
+			atomic_long_inc(&po->rollover->num);
+			if (room == ROOM_LOW)
+				atomic_long_inc(&po->rollover->num_huge);
 			return i;
 		}
 
@@ -1420,6 +1423,7 @@ static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 			i = 0;
 	} while (i != j);
 
+	atomic_long_inc(&po->rollover->num_failed);
 	return idx;
 }
 
@@ -1554,6 +1558,9 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
 		po->rollover = kzalloc(sizeof(*po->rollover), GFP_KERNEL);
 		if (!po->rollover)
 			return -ENOMEM;
+		atomic_long_set(&po->rollover->num, 0);
+		atomic_long_set(&po->rollover->num_huge, 0);
+		atomic_long_set(&po->rollover->num_failed, 0);
 	}
 
 	mutex_lock(&fanout_mutex);
@@ -3581,6 +3588,7 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 	struct packet_sock *po = pkt_sk(sk);
 	void *data = &val;
 	union tpacket_stats_u st;
+	struct tpacket_rollover_stats rstats;
 
 	if (level != SOL_PACKET)
 		return -ENOPROTOOPT;
@@ -3656,6 +3664,15 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 			((u32)po->fanout->flags << 24)) :
 		       0);
 		break;
+	case PACKET_ROLLOVER_STATS:
+		if (!po->rollover)
+			return -EINVAL;
+		rstats.tp_all = atomic_long_read(&po->rollover->num);
+		rstats.tp_huge = atomic_long_read(&po->rollover->num_huge);
+		rstats.tp_failed = atomic_long_read(&po->rollover->num_failed);
+		data = &rstats;
+		lv = sizeof(rstats);
+		break;
 	case PACKET_TX_HAS_OFF:
 		val = po->tp_tx_has_off;
 		break;
diff --git a/net/packet/internal.h b/net/packet/internal.h
index a9d30a1..c035d26 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -89,6 +89,9 @@ struct packet_fanout {
 
 struct packet_rollover {
 	int			sock;
+	atomic_long_t		num;
+	atomic_long_t		num_huge;
+	atomic_long_t		num_failed;
 #define ROLLOVER_HLEN	(L1_CACHE_BYTES / sizeof(u32))
 	u32			history[ROLLOVER_HLEN] ____cacheline_aligned;
 } ____cacheline_aligned_in_smp;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v2 6/6] packet: rollover statistics
  2015-05-09 21:25 ` [PATCH net-next v2 6/6] packet: rollover statistics Willem de Bruijn
@ 2015-05-11 17:09   ` David Miller
  2015-05-12 15:04     ` Willem de Bruijn
  0 siblings, 1 reply; 9+ messages in thread
From: David Miller @ 2015-05-11 17:09 UTC (permalink / raw)
  To: willemb; +Cc: netdev, eric.dumazet, david.laight

From: Willem de Bruijn <willemb@google.com>
Date: Sat,  9 May 2015 17:25:01 -0400

> From: Willem de Bruijn <willemb@google.com>
> 
> Rollover indicates exceptional conditions. Export a counter to inform
> socket owners of this state.
> 
> If no socket with sufficient room is found, rollover fails. Also count
> these events.
> 
> Finally, also count when flows are rolled over early thanks to huge
> flow detection, to validate its correctness.
> 
> Tested:
>   Read counters in bench_rollover on all other tests in the patchset
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
>  include/uapi/linux/if_packet.h |  7 +++++++
>  net/packet/af_packet.c         | 19 ++++++++++++++++++-
>  net/packet/internal.h          |  3 +++
>  3 files changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
> index 053bd10..52b6501 100644
> --- a/include/uapi/linux/if_packet.h
> +++ b/include/uapi/linux/if_packet.h
> @@ -54,6 +54,7 @@ struct sockaddr_ll {
>  #define PACKET_FANOUT			18
>  #define PACKET_TX_HAS_OFF		19
>  #define PACKET_QDISC_BYPASS		20
> +#define PACKET_ROLLOVER_STATS		21
>  
>  #define PACKET_FANOUT_HASH		0
>  #define PACKET_FANOUT_LB		1
> @@ -75,6 +76,12 @@ struct tpacket_stats_v3 {
>  	unsigned int	tp_freeze_q_cnt;
>  };
>  
> +struct tpacket_rollover_stats {
> +	unsigned long	tp_all;
> +	unsigned long	tp_huge;
> +	unsigned long	tp_failed;
> +};

Please do not use variably sized types, especially "long" for user exported
interfaces.  Otherwise you will need to add a compat layer to translate
the data into userspace for 32-bit tasks running on a 64-bit kernel.

If a 32-bit value really is not sufficient, use __aligned_u64.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v2 6/6] packet: rollover statistics
  2015-05-11 17:09   ` David Miller
@ 2015-05-12 15:04     ` Willem de Bruijn
  0 siblings, 0 replies; 9+ messages in thread
From: Willem de Bruijn @ 2015-05-12 15:04 UTC (permalink / raw)
  To: David Miller; +Cc: Network Development, Eric Dumazet, David Laight

>>
>> +struct tpacket_rollover_stats {
>> +     unsigned long   tp_all;
>> +     unsigned long   tp_huge;
>> +     unsigned long   tp_failed;
>> +};
>
> Please do not use variably sized types, especially "long" for user exported
> interfaces.  Otherwise you will need to add a compat layer to translate
> the data into userspace for 32-bit tasks running on a 64-bit kernel.
>
> If a 32-bit value really is not sufficient, use __aligned_u64.

Thanks for catching that. I converted them to __aligned_u64. The
internal counters remain atomic_long_t.

I'll push v3 of the patchset with only this change.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-05-12 15:04 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-09 21:24 [PATCH net-next v2 0/6] packet: refine rollover Willem de Bruijn
2015-05-09 21:24 ` [PATCH net-next v2 1/6] packet: rollover prepare: move code out of callsites Willem de Bruijn
2015-05-09 21:24 ` [PATCH net-next v2 2/6] packet: rollover prepare: per-socket state Willem de Bruijn
2015-05-09 21:24 ` [PATCH net-next v2 3/6] packet: rollover only to socket with headroom Willem de Bruijn
2015-05-09 21:24 ` [PATCH net-next v2 4/6] packet: rollover lock contention avoidance Willem de Bruijn
2015-05-09 21:25 ` [PATCH net-next v2 5/6] packet: rollover huge flows before small flows Willem de Bruijn
2015-05-09 21:25 ` [PATCH net-next v2 6/6] packet: rollover statistics Willem de Bruijn
2015-05-11 17:09   ` David Miller
2015-05-12 15:04     ` Willem de Bruijn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).