All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/7] packet: refine rollover
@ 2015-05-06 18:27 Willem de Bruijn
  2015-05-06 18:27 ` [PATCH net-next 1/7] packet: rollover prepare: move code out of callsites Willem de Bruijn
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Willem de Bruijn @ 2015-05-06 18:27 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Refine packet socket rollover:

1. mitigate a case of lock contention
2. avoid resource exhaustion on other sockets,
   by migrating only to a victim socket that has ample room
3. avoid reordering other flows on the same socket,
   by migrating first the flow responsible for load imbalance
4. help processes detect load imbalance,
   by exporting rollover counters

Context: rollover implements flow migration in packet socket fanout
groups in case of extreme load imbalance. It is a specific
implementation of migration that minimizes reordering by selecting
the same victim socket when possible (and by selecting subsequent
victims in a round robin fashion, from which its name derives).

Willem de Bruijn (7):
  packet: rollover prepare: move code out of callsites
  packet: rollover prepare: per-socket state
  packet: rollover prepare: single return in packet_rcv_has_room
  packet: rollover lock contention avoidance
  packet: rollover only to socket with headroom
  packet: rollover huge flows before small flows
  packet: rollover statistics

 include/uapi/linux/if_packet.h |   7 ++
 net/packet/af_packet.c         | 164 ++++++++++++++++++++++++++++++++++-------
 net/packet/internal.h          |  14 +++-
 3 files changed, 157 insertions(+), 28 deletions(-)

-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net-next 1/7] packet: rollover prepare: move code out of callsites
  2015-05-06 18:27 [PATCH net-next 0/7] packet: refine rollover Willem de Bruijn
@ 2015-05-06 18:27 ` Willem de Bruijn
  2015-05-06 18:27 ` [PATCH net-next 2/7] packet: rollover prepare: per-socket state Willem de Bruijn
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Willem de Bruijn @ 2015-05-06 18:27 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

packet_rcv_fanout calls fanout_demux_rollover twice. Move all rollover
logic into the callee to simplify these callsites, especially with
upcoming changes.

The main differences between the two callsites is that the FLAG
variant tests whether the socket previously selected by another
mode (RR, RND, HASH, ..) has room before migrating flows, whereas the
rollover mode has no original socket to test.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/packet/af_packet.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 5102c3c..56f9c52 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1318,18 +1318,22 @@ static unsigned int fanout_demux_rnd(struct packet_fanout *f,
 
 static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 					  struct sk_buff *skb,
-					  unsigned int idx, unsigned int skip,
+					  unsigned int idx, bool try_self,
 					  unsigned int num)
 {
 	unsigned int i, j;
 
+	if (try_self && packet_rcv_has_room(pkt_sk(f->arr[idx]), skb))
+		return idx;
+
 	i = j = min_t(int, f->next[idx], num - 1);
 	do {
-		if (i != skip && packet_rcv_has_room(pkt_sk(f->arr[i]), skb)) {
+		if (i != idx && packet_rcv_has_room(pkt_sk(f->arr[i]), skb)) {
 			if (i != j)
 				f->next[idx] = i;
 			return i;
 		}
+
 		if (++i == num)
 			i = 0;
 	} while (i != j);
@@ -1386,17 +1390,14 @@ static int packet_rcv_fanout(struct sk_buff *skb, struct net_device *dev,
 		idx = fanout_demux_qm(f, skb, num);
 		break;
 	case PACKET_FANOUT_ROLLOVER:
-		idx = fanout_demux_rollover(f, skb, 0, (unsigned int) -1, num);
+		idx = fanout_demux_rollover(f, skb, 0, false, num);
 		break;
 	}
 
-	po = pkt_sk(f->arr[idx]);
-	if (fanout_has_flag(f, PACKET_FANOUT_FLAG_ROLLOVER) &&
-	    unlikely(!packet_rcv_has_room(po, skb))) {
-		idx = fanout_demux_rollover(f, skb, idx, idx, num);
-		po = pkt_sk(f->arr[idx]);
-	}
+	if (fanout_has_flag(f, PACKET_FANOUT_FLAG_ROLLOVER))
+		idx = fanout_demux_rollover(f, skb, idx, true, num);
 
+	po = pkt_sk(f->arr[idx]);
 	return po->prot_hook.func(skb, dev, &po->prot_hook, orig_dev);
 }
 
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next 2/7] packet: rollover prepare: per-socket state
  2015-05-06 18:27 [PATCH net-next 0/7] packet: refine rollover Willem de Bruijn
  2015-05-06 18:27 ` [PATCH net-next 1/7] packet: rollover prepare: move code out of callsites Willem de Bruijn
@ 2015-05-06 18:27 ` Willem de Bruijn
  2015-05-06 18:27 ` [PATCH net-next 3/7] packet: rollover prepare: single return in packet_rcv_has_room Willem de Bruijn
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Willem de Bruijn @ 2015-05-06 18:27 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Replace rollover state per fanout group with state per socket. Future
patches will add fields to the new structure.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/packet/af_packet.c | 21 ++++++++++++++++++---
 net/packet/internal.h  |  6 +++++-
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 56f9c52..f8ec909 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1321,16 +1321,18 @@ static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 					  unsigned int idx, bool try_self,
 					  unsigned int num)
 {
+	struct packet_sock *po;
 	unsigned int i, j;
 
-	if (try_self && packet_rcv_has_room(pkt_sk(f->arr[idx]), skb))
+	po = pkt_sk(f->arr[idx]);
+	if (try_self && packet_rcv_has_room(po, skb))
 		return idx;
 
-	i = j = min_t(int, f->next[idx], num - 1);
+	i = j = min_t(int, po->rollover->sock, num - 1);
 	do {
 		if (i != idx && packet_rcv_has_room(pkt_sk(f->arr[i]), skb)) {
 			if (i != j)
-				f->next[idx] = i;
+				po->rollover->sock = i;
 			return i;
 		}
 
@@ -1468,6 +1470,12 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
 	if (po->fanout)
 		return -EALREADY;
 
+	if (type_flags & PACKET_FANOUT_FLAG_ROLLOVER) {
+		po->rollover = kzalloc(sizeof(*po->rollover), GFP_KERNEL);
+		if (!po->rollover)
+			return -ENOMEM;
+	}
+
 	mutex_lock(&fanout_mutex);
 	match = NULL;
 	list_for_each_entry(f, &fanout_list, list) {
@@ -1516,6 +1524,10 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
 	}
 out:
 	mutex_unlock(&fanout_mutex);
+	if (err) {
+		kfree(po->rollover);
+		po->rollover = NULL;
+	}
 	return err;
 }
 
@@ -1537,6 +1549,8 @@ static void fanout_release(struct sock *sk)
 		kfree(f);
 	}
 	mutex_unlock(&fanout_mutex);
+
+	kfree(po->rollover);
 }
 
 static const struct proto_ops packet_ops;
@@ -2863,6 +2877,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
 
 	spin_lock_init(&po->bind_lock);
 	mutex_init(&po->pg_vec_lock);
+	po->rollover = NULL;
 	po->prot_hook.func = packet_rcv;
 
 	if (sock->type == SOCK_PACKET)
diff --git a/net/packet/internal.h b/net/packet/internal.h
index fe6e20c..a9d33a2 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -82,12 +82,15 @@ struct packet_fanout {
 	atomic_t		rr_cur;
 	struct list_head	list;
 	struct sock		*arr[PACKET_FANOUT_MAX];
-	int			next[PACKET_FANOUT_MAX];
 	spinlock_t		lock;
 	atomic_t		sk_ref;
 	struct packet_type	prot_hook ____cacheline_aligned_in_smp;
 };
 
+struct packet_rollover {
+	int			sock;
+} ____cacheline_aligned_in_smp;
+
 struct packet_sock {
 	/* struct sock has to be the first member of packet_sock */
 	struct sock		sk;
@@ -104,6 +107,7 @@ struct packet_sock {
 				has_vnet_hdr:1;
 	int			ifindex;	/* bound device		*/
 	__be16			num;
+	struct packet_rollover	*rollover;
 	struct packet_mclist	*mclist;
 	atomic_t		mapped;
 	enum tpacket_versions	tp_version;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next 3/7] packet: rollover prepare: single return in packet_rcv_has_room
  2015-05-06 18:27 [PATCH net-next 0/7] packet: refine rollover Willem de Bruijn
  2015-05-06 18:27 ` [PATCH net-next 1/7] packet: rollover prepare: move code out of callsites Willem de Bruijn
  2015-05-06 18:27 ` [PATCH net-next 2/7] packet: rollover prepare: per-socket state Willem de Bruijn
@ 2015-05-06 18:27 ` Willem de Bruijn
  2015-05-07 13:49   ` David Laight
  2015-05-06 18:27 ` [PATCH net-next 4/7] packet: rollover lock contention avoidance Willem de Bruijn
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Willem de Bruijn @ 2015-05-06 18:27 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Convert

  if (x) {
    A;
    return y;
  }
  B;
  return y;

Into

  if (x) {
    A;
  } else {
    B;
  }

  return y;

No other changes.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/packet/af_packet.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index f8ec909..8275056 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1239,20 +1239,21 @@ static bool packet_rcv_has_room(struct packet_sock *po, struct sk_buff *skb)
 	struct sock *sk = &po->sk;
 	bool has_room;
 
-	if (po->prot_hook.func != tpacket_rcv)
-		return (atomic_read(&sk->sk_rmem_alloc) + skb->truesize)
-			<= sk->sk_rcvbuf;
-
-	spin_lock(&sk->sk_receive_queue.lock);
-	if (po->tp_version == TPACKET_V3)
-		has_room = prb_lookup_block(po, &po->rx_ring,
-					    po->rx_ring.prb_bdqc.kactive_blk_num,
-					    TP_STATUS_KERNEL);
-	else
-		has_room = packet_lookup_frame(po, &po->rx_ring,
-					       po->rx_ring.head,
-					       TP_STATUS_KERNEL);
-	spin_unlock(&sk->sk_receive_queue.lock);
+	if (po->prot_hook.func != tpacket_rcv) {
+		has_room = (atomic_read(&sk->sk_rmem_alloc) + skb->truesize)
+			   <= sk->sk_rcvbuf;
+	} else {
+		spin_lock(&sk->sk_receive_queue.lock);
+		if (po->tp_version == TPACKET_V3)
+			has_room = prb_lookup_block(po, &po->rx_ring,
+						    po->rx_ring.prb_bdqc.kactive_blk_num,
+						    TP_STATUS_KERNEL);
+		else
+			has_room = packet_lookup_frame(po, &po->rx_ring,
+						       po->rx_ring.head,
+						       TP_STATUS_KERNEL);
+		spin_unlock(&sk->sk_receive_queue.lock);
+	}
 
 	return has_room;
 }
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next 4/7] packet: rollover lock contention avoidance
  2015-05-06 18:27 [PATCH net-next 0/7] packet: refine rollover Willem de Bruijn
                   ` (2 preceding siblings ...)
  2015-05-06 18:27 ` [PATCH net-next 3/7] packet: rollover prepare: single return in packet_rcv_has_room Willem de Bruijn
@ 2015-05-06 18:27 ` Willem de Bruijn
  2015-05-06 19:44   ` Eric Dumazet
  2015-05-06 18:27 ` [PATCH net-next 5/7] packet: rollover only to socket with headroom Willem de Bruijn
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Willem de Bruijn @ 2015-05-06 18:27 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Rollover has to call packet_rcv_has_room on sockets in the fanout
group to find a socket to migrate to. This operation is expensive
especially if the packet sockets use rings, when a lock has to be
acquired.

Avoid pounding on the lock by all sockets by temporarily marking a
socket as "under memory pressure" when such pressure is detected.
While set, only the socket owner may call packet_rcv_has_room on the
socket. Once it detects normal conditions, it clears the flag. The
socket is not used as a victim by any other socket in the meantime.

Under reasonably balanced load, each socket writer frequently calls
packet_rcv_has_room and clears its own pressure field. As a backup
for when the socket is rarely written to, also clear the flag on
reading (packet_recvmsg, packet_poll) if this can be done cheaply
(i.e., without calling packet_rcv_has_room). This is only for
edge cases.

Tested:
  Ran bench_rollover: a process with 8 sockets in a single fanout
  group, each pinned to a single cpu that receives one nic recv
  interrupt. RPS and RFS are disabled. The benchmark uses packet
  rx_ring, which has to take a lock when determining whether a
  socket has room.

  Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread
  uniformly across the packet sockets (and inserted an iptables
  rule to drop in PREROUTING to avoid protocol stack processing).

  Without this patch, all sockets try to migrate traffic to
  neighbors, causing lock contention when searching for a non-
  empty neighbor. The lock is the top 9 entries.

    perf record -a -g sleep 5

    -  17.82%   bench_rollover  [kernel.kallsyms]    [k] _raw_spin_lock
       - _raw_spin_lock
          - 99.00% spin_lock
    	 + 81.77% packet_rcv_has_room.isra.41
    	 + 18.23% tpacket_rcv
          + 0.84% packet_rcv_has_room.isra.41
    +   5.20%      ksoftirqd/6  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.15%      ksoftirqd/1  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.14%      ksoftirqd/2  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.12%      ksoftirqd/7  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.12%      ksoftirqd/5  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.10%      ksoftirqd/4  [kernel.kallsyms]    [k] _raw_spin_lock
    +   4.66%      ksoftirqd/0  [kernel.kallsyms]    [k] _raw_spin_lock
    +   4.45%      ksoftirqd/3  [kernel.kallsyms]    [k] _raw_spin_lock
    +   1.55%   bench_rollover  [kernel.kallsyms]    [k] packet_rcv_has_room.isra.41

  On net-next with this patch, this lock contention is no longer a
  top entry. Most time is spent in the actual read function. Next up
  are other locks:

    +  17.51%  bench_rollover  bench_rollover      [.] reader
    +   4.95%         swapper  [kernel.kallsyms]   [k] memcpy_erms
    +   2.37%         swapper  [kernel.kallsyms]   [k] _raw_spin_lock
    +   2.31%         swapper  [kernel.kallsyms]   [k] tpacket_rcv
    +   2.20%         swapper  [kernel.kallsyms]   [k] udp4_gro_receive
    +   1.65%         swapper  [kernel.kallsyms]   [k] mlx4_en_process_rx_cq
    +   1.39%         swapper  [kernel.kallsyms]   [k] kmem_cache_alloc

    Looking closer at the remaining _raw_spin_lock, the cost of probing
    in rollover is now comparable to the cost of taking the lock later
    in tpacket_rcv.

    -   2.37%         swapper  [kernel.kallsyms]   [k] _raw_spin_lock
       - _raw_spin_lock
          + 36.86% tpacket_rcv
          + 36.72% packet_rcv_has_room.isra.54
          + 13.17% enqueue_to_backlog
          + 3.98% __free_pages_ok
          + 1.49% packet_rcv_fanout
          + 1.37% fanout_demux_rollover
          + 1.26% netif_receive_skb_internal
          + 1.16% mlx4_en_poll_rx_cq
          + 0.85% get_next_timer_interrupt
          + 0.56% handle_irq
          + 0.53% process_backlog

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/packet/af_packet.c | 16 ++++++++++++++--
 net/packet/internal.h  |  1 +
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 8275056..fdb5261 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1255,6 +1255,9 @@ static bool packet_rcv_has_room(struct packet_sock *po, struct sk_buff *skb)
 		spin_unlock(&sk->sk_receive_queue.lock);
 	}
 
+	if (po->pressure == has_room)
+		xchg(&po->pressure, !has_room);
+
 	return has_room;
 }
 
@@ -1322,7 +1325,7 @@ static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 					  unsigned int idx, bool try_self,
 					  unsigned int num)
 {
-	struct packet_sock *po;
+	struct packet_sock *po, *po_next;
 	unsigned int i, j;
 
 	po = pkt_sk(f->arr[idx]);
@@ -1331,7 +1334,9 @@ static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 
 	i = j = min_t(int, po->rollover->sock, num - 1);
 	do {
-		if (i != idx && packet_rcv_has_room(pkt_sk(f->arr[i]), skb)) {
+		po_next = pkt_sk(f->arr[i]);
+		if (po_next != po && !po_next->pressure &&
+		    packet_rcv_has_room(po_next, skb)) {
 			if (i != j)
 				po->rollover->sock = i;
 			return i;
@@ -2956,6 +2961,9 @@ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 	if (skb == NULL)
 		goto out;
 
+	if (pkt_sk(sk)->pressure && !sk_rmem_alloc_get(sk))
+		xchg(&pkt_sk(sk)->pressure, 0);
+
 	if (pkt_sk(sk)->has_vnet_hdr) {
 		struct virtio_net_hdr vnet_hdr = { 0 };
 
@@ -3718,6 +3726,10 @@ static unsigned int packet_poll(struct file *file, struct socket *sock,
 			mask |= POLLOUT | POLLWRNORM;
 	}
 	spin_unlock_bh(&sk->sk_write_queue.lock);
+
+	if (po->pressure && !(mask & POLLIN))
+		xchg(&po->pressure, 0);
+
 	return mask;
 }
 
diff --git a/net/packet/internal.h b/net/packet/internal.h
index a9d33a2..22d7d77 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -105,6 +105,7 @@ struct packet_sock {
 				auxdata:1,
 				origdev:1,
 				has_vnet_hdr:1;
+	int			pressure;
 	int			ifindex;	/* bound device		*/
 	__be16			num;
 	struct packet_rollover	*rollover;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next 5/7] packet: rollover only to socket with headroom
  2015-05-06 18:27 [PATCH net-next 0/7] packet: refine rollover Willem de Bruijn
                   ` (3 preceding siblings ...)
  2015-05-06 18:27 ` [PATCH net-next 4/7] packet: rollover lock contention avoidance Willem de Bruijn
@ 2015-05-06 18:27 ` Willem de Bruijn
  2015-05-06 18:27 ` [PATCH net-next 6/7] packet: rollover huge flows before small flows Willem de Bruijn
  2015-05-06 18:27 ` [PATCH net-next 7/7] packet: rollover statistics Willem de Bruijn
  6 siblings, 0 replies; 16+ messages in thread
From: Willem de Bruijn @ 2015-05-06 18:27 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Only migrate flows to sockets that have sufficient headroom, where
sufficient is defined as having at least 25% empty space.

The kernel has three different buffer types: a regular socket, a ring
with frames (TPACKET_V[12]) or a ring with blocks (TPACKET_V3). The
latter two do not expose a read pointer to the kernel, so headroom is
not computed easily. All three needs a different implementation to
estimate free space.

Tested:
  Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.

  bench_rollover has as many sockets as there are NIC receive queues
  in the system. Each socket is owned by a process that is pinned to
  one of the receive cpus. RFS is disabled. RPS is enabled with an
  identity mapping (cpu x -> cpu x), to count drops with softnettop.

      lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -s -r
      Press [Enter] to exit

      cpu         rx       rx.k     drop.k   rollover     r.huge   r.failed
        0    1563828    1563828          0    8674715          0          0
        1    1212590    1212590          0          0          0          0
        2    1236807    1236808          0          0          0          0
        3    1246906    1246906          0          0          0          0
        4    1236603    1236603          0          0          0          0
        5    1238808    1238808          0          0          0          0
        6    1250217    1250217          0          0          0          0
        7    1252906    1252906          0          0          0          0

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/packet/af_packet.c | 68 +++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 54 insertions(+), 14 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index fdb5261..d0c4c95 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1234,31 +1234,71 @@ static void packet_free_pending(struct packet_sock *po)
 	free_percpu(po->tx_ring.pending_refcnt);
 }
 
-static bool packet_rcv_has_room(struct packet_sock *po, struct sk_buff *skb)
+#define ROOM_POW_OFF	2
+#define ROOM_NONE	0x0
+#define ROOM_LOW	0x1
+#define ROOM_NORMAL	0x2
+
+static bool __tpacket_has_room(struct packet_sock *po, int pow_off)
+{
+	int idx, len;
+
+	len = po->rx_ring.frame_max + 1;
+	idx = po->rx_ring.head;
+	if (pow_off)
+		idx += len >> pow_off;
+	if (idx >= len)
+		idx -= len;
+	return packet_lookup_frame(po, &po->rx_ring, idx, TP_STATUS_KERNEL);
+}
+
+static bool __tpacket_v3_has_room(struct packet_sock *po, int pow_off)
+{
+	int idx, len;
+
+	len = po->rx_ring.prb_bdqc.knum_blocks;
+	idx = po->rx_ring.prb_bdqc.kactive_blk_num;
+	if (pow_off)
+		idx += len >> pow_off;
+	if (idx >= len)
+		idx -= len;
+	return prb_lookup_block(po, &po->rx_ring, idx, TP_STATUS_KERNEL);
+}
+
+static int packet_rcv_has_room(struct packet_sock *po, struct sk_buff *skb)
 {
 	struct sock *sk = &po->sk;
+	int ret = ROOM_NONE;
 	bool has_room;
 
 	if (po->prot_hook.func != tpacket_rcv) {
-		has_room = (atomic_read(&sk->sk_rmem_alloc) + skb->truesize)
-			   <= sk->sk_rcvbuf;
+		int avail = sk->sk_rcvbuf - atomic_read(&sk->sk_rmem_alloc)
+					  - skb->truesize;
+		if (avail > (sk->sk_rcvbuf >> ROOM_POW_OFF))
+			ret = ROOM_NORMAL;
+		else if (avail > 0)
+			ret = ROOM_LOW;
 	} else {
 		spin_lock(&sk->sk_receive_queue.lock);
-		if (po->tp_version == TPACKET_V3)
-			has_room = prb_lookup_block(po, &po->rx_ring,
-						    po->rx_ring.prb_bdqc.kactive_blk_num,
-						    TP_STATUS_KERNEL);
-		else
-			has_room = packet_lookup_frame(po, &po->rx_ring,
-						       po->rx_ring.head,
-						       TP_STATUS_KERNEL);
+		if (po->tp_version == TPACKET_V3) {
+			if (__tpacket_v3_has_room(po, ROOM_POW_OFF))
+				ret = ROOM_NORMAL;
+			else if (__tpacket_v3_has_room(po, 0))
+				ret = ROOM_LOW;
+		} else {
+			if (__tpacket_has_room(po, ROOM_POW_OFF))
+				ret = ROOM_NORMAL;
+			else if (__tpacket_has_room(po, 0))
+				ret = ROOM_LOW;
+		}
 		spin_unlock(&sk->sk_receive_queue.lock);
 	}
 
+	has_room = ret == ROOM_NORMAL;
 	if (po->pressure == has_room)
 		xchg(&po->pressure, !has_room);
 
-	return has_room;
+	return ret;
 }
 
 static void packet_sock_destruct(struct sock *sk)
@@ -1329,14 +1369,14 @@ static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 	unsigned int i, j;
 
 	po = pkt_sk(f->arr[idx]);
-	if (try_self && packet_rcv_has_room(po, skb))
+	if (try_self && packet_rcv_has_room(po, skb) != ROOM_NONE)
 		return idx;
 
 	i = j = min_t(int, po->rollover->sock, num - 1);
 	do {
 		po_next = pkt_sk(f->arr[i]);
 		if (po_next != po && !po_next->pressure &&
-		    packet_rcv_has_room(po_next, skb)) {
+		    packet_rcv_has_room(po_next, skb) == ROOM_NORMAL) {
 			if (i != j)
 				po->rollover->sock = i;
 			return i;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next 6/7] packet: rollover huge flows before small flows
  2015-05-06 18:27 [PATCH net-next 0/7] packet: refine rollover Willem de Bruijn
                   ` (4 preceding siblings ...)
  2015-05-06 18:27 ` [PATCH net-next 5/7] packet: rollover only to socket with headroom Willem de Bruijn
@ 2015-05-06 18:27 ` Willem de Bruijn
  2015-05-06 19:34   ` Eric Dumazet
  2015-05-06 18:27 ` [PATCH net-next 7/7] packet: rollover statistics Willem de Bruijn
  6 siblings, 1 reply; 16+ messages in thread
From: Willem de Bruijn @ 2015-05-06 18:27 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Migrate flows from a socket to another socket in the fanout group not
only when the socket is full. Start migrating huge flows early, to
divert possible 4-tuple attacks without affecting normal traffic.

Introduce fanout_flow_is_huge(). This detects huge flows, which are
defined as taking up more than half the load. It does so cheaply, by
storing the rxhashes of the N most recent packets. If over half of
these are the same rxhash as the current packet, then drop it. This
only protects against 4-tuple attacks. N is chosen to fit all data in
a single cache line.

Tested:
  Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.

      lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s
      cpu        rx       rx.k     drop.k   rollover     r.huge   r.failed
       0    1202599    1202599          0          0          0          0
       1    1221096    1221096          0          0          0          0
       2    1202296    1202296          0          0          0          0
       3    1229998    1229998          0          0          0          0
       4    1229551    1229551          0          0          0          0
       5    1221097    1221097          0          0          0          0
       6    1223496    1223496          0          0          0          0
       7    1616768    1616768          0    8530027    8530027          0

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/packet/af_packet.c | 30 +++++++++++++++++++++++++++---
 net/packet/internal.h  |  4 ++++
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index d0c4c95..4e54b6b 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1326,6 +1326,24 @@ static int fanout_rr_next(struct packet_fanout *f, unsigned int num)
 	return x;
 }
 
+static bool fanout_flow_is_huge(struct packet_sock *po, struct sk_buff *skb)
+{
+	u32 rxhash;
+	int i, count = 0;
+
+	rxhash = skb_get_hash(skb);
+	spin_lock(&po->rollover->hist_lock);
+	for (i = 0; i < ROLLOVER_HLEN; i++)
+		if (po->rollover->history[i] == rxhash)
+			count++;
+
+	i = po->rollover->hist_idx++ & (ROLLOVER_HLEN - 1);
+	po->rollover->history[i] = rxhash;
+	spin_unlock(&po->rollover->hist_lock);
+
+	return count > (ROLLOVER_HLEN >> 1);
+}
+
 static unsigned int fanout_demux_hash(struct packet_fanout *f,
 				      struct sk_buff *skb,
 				      unsigned int num)
@@ -1366,11 +1384,16 @@ static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 					  unsigned int num)
 {
 	struct packet_sock *po, *po_next;
-	unsigned int i, j;
+	unsigned int i, j, room;
 
 	po = pkt_sk(f->arr[idx]);
-	if (try_self && packet_rcv_has_room(po, skb) != ROOM_NONE)
-		return idx;
+
+	if (try_self) {
+		room = packet_rcv_has_room(po, skb);
+		if (room == ROOM_NORMAL ||
+		    (room == ROOM_LOW && !fanout_flow_is_huge(po, skb)))
+			return idx;
+	}
 
 	i = j = min_t(int, po->rollover->sock, num - 1);
 	do {
@@ -1520,6 +1543,7 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
 		po->rollover = kzalloc(sizeof(*po->rollover), GFP_KERNEL);
 		if (!po->rollover)
 			return -ENOMEM;
+		spin_lock_init(&po->rollover->hist_lock);
 	}
 
 	mutex_lock(&fanout_mutex);
diff --git a/net/packet/internal.h b/net/packet/internal.h
index 22d7d77..6f479c4 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -89,6 +89,10 @@ struct packet_fanout {
 
 struct packet_rollover {
 	int			sock;
+	int			hist_idx;
+#define ROLLOVER_HLEN	(L1_CACHE_BYTES / sizeof(u32))
+	u32			history[ROLLOVER_HLEN] ____cacheline_aligned;
+	spinlock_t		hist_lock;
 } ____cacheline_aligned_in_smp;
 
 struct packet_sock {
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next 7/7] packet: rollover statistics
  2015-05-06 18:27 [PATCH net-next 0/7] packet: refine rollover Willem de Bruijn
                   ` (5 preceding siblings ...)
  2015-05-06 18:27 ` [PATCH net-next 6/7] packet: rollover huge flows before small flows Willem de Bruijn
@ 2015-05-06 18:27 ` Willem de Bruijn
  6 siblings, 0 replies; 16+ messages in thread
From: Willem de Bruijn @ 2015-05-06 18:27 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Rollover indicates exceptional conditions. Export a counter to inform
socket owners of this state.

If no socket with sufficient room is found, rollover fails. Also count
these events.

Finally, also count when flows are rolled over early thanks to huge
flow detection, to validate its correctness.

Tested:
  Read counters in bench_rollover on all other tests in the patchset

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/uapi/linux/if_packet.h |  7 +++++++
 net/packet/af_packet.c         | 17 +++++++++++++++++
 net/packet/internal.h          |  3 +++
 3 files changed, 27 insertions(+)

diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index 053bd10..52b6501 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -54,6 +54,7 @@ struct sockaddr_ll {
 #define PACKET_FANOUT			18
 #define PACKET_TX_HAS_OFF		19
 #define PACKET_QDISC_BYPASS		20
+#define PACKET_ROLLOVER_STATS		21
 
 #define PACKET_FANOUT_HASH		0
 #define PACKET_FANOUT_LB		1
@@ -75,6 +76,12 @@ struct tpacket_stats_v3 {
 	unsigned int	tp_freeze_q_cnt;
 };
 
+struct tpacket_rollover_stats {
+	unsigned long	tp_all;
+	unsigned long	tp_huge;
+	unsigned long	tp_failed;
+};
+
 union tpacket_stats_u {
 	struct tpacket_stats stats1;
 	struct tpacket_stats_v3 stats3;
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 4e54b6b..9a82196 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1402,6 +1402,9 @@ static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 		    packet_rcv_has_room(po_next, skb) == ROOM_NORMAL) {
 			if (i != j)
 				po->rollover->sock = i;
+			atomic_long_inc(&po->rollover->num);
+			if (room == ROOM_LOW)
+				atomic_long_inc(&po->rollover->num_huge);
 			return i;
 		}
 
@@ -1409,6 +1412,7 @@ static unsigned int fanout_demux_rollover(struct packet_fanout *f,
 			i = 0;
 	} while (i != j);
 
+	atomic_long_inc(&po->rollover->num_failed);
 	return idx;
 }
 
@@ -1544,6 +1548,9 @@ static int fanout_add(struct sock *sk, u16 id, u16 type_flags)
 		if (!po->rollover)
 			return -ENOMEM;
 		spin_lock_init(&po->rollover->hist_lock);
+		atomic_long_set(&po->rollover->num, 0);
+		atomic_long_set(&po->rollover->num_huge, 0);
+		atomic_long_set(&po->rollover->num_failed, 0);
 	}
 
 	mutex_lock(&fanout_mutex);
@@ -3571,6 +3578,7 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 	struct packet_sock *po = pkt_sk(sk);
 	void *data = &val;
 	union tpacket_stats_u st;
+	struct tpacket_rollover_stats rstats;
 
 	if (level != SOL_PACKET)
 		return -ENOPROTOOPT;
@@ -3646,6 +3654,15 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 			((u32)po->fanout->flags << 24)) :
 		       0);
 		break;
+	case PACKET_ROLLOVER_STATS:
+		if (!po->rollover)
+			return -EINVAL;
+		rstats.tp_all = atomic_long_read(&po->rollover->num);
+		rstats.tp_huge = atomic_long_read(&po->rollover->num_huge);
+		rstats.tp_failed = atomic_long_read(&po->rollover->num_failed);
+		data = &rstats;
+		lv = sizeof(rstats);
+		break;
 	case PACKET_TX_HAS_OFF:
 		val = po->tp_tx_has_off;
 		break;
diff --git a/net/packet/internal.h b/net/packet/internal.h
index 6f479c4..5d51b7b 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -89,6 +89,9 @@ struct packet_fanout {
 
 struct packet_rollover {
 	int			sock;
+	atomic_long_t		num;
+	atomic_long_t		num_huge;
+	atomic_long_t		num_failed;
 	int			hist_idx;
 #define ROLLOVER_HLEN	(L1_CACHE_BYTES / sizeof(u32))
 	u32			history[ROLLOVER_HLEN] ____cacheline_aligned;
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 6/7] packet: rollover huge flows before small flows
  2015-05-06 18:27 ` [PATCH net-next 6/7] packet: rollover huge flows before small flows Willem de Bruijn
@ 2015-05-06 19:34   ` Eric Dumazet
  2015-05-06 20:06     ` Willem de Bruijn
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2015-05-06 19:34 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, davem

On Wed, 2015-05-06 at 14:27 -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> Migrate flows from a socket to another socket in the fanout group not
> only when the socket is full. Start migrating huge flows early, to
> divert possible 4-tuple attacks without affecting normal traffic.
> 
> Introduce fanout_flow_is_huge(). This detects huge flows, which are
> defined as taking up more than half the load. It does so cheaply, by
> storing the rxhashes of the N most recent packets. If over half of
> these are the same rxhash as the current packet, then drop it. This
> only protects against 4-tuple attacks. N is chosen to fit all data in
> a single cache line.
> 
> Tested:
>   Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.
> 
>       lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s
>       cpu        rx       rx.k     drop.k   rollover     r.huge   r.failed
>        0    1202599    1202599          0          0          0          0
>        1    1221096    1221096          0          0          0          0
>        2    1202296    1202296          0          0          0          0
>        3    1229998    1229998          0          0          0          0
>        4    1229551    1229551          0          0          0          0
>        5    1221097    1221097          0          0          0          0
>        6    1223496    1223496          0          0          0          0
>        7    1616768    1616768          0    8530027    8530027          0
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
>  net/packet/af_packet.c | 30 +++++++++++++++++++++++++++---
>  net/packet/internal.h  |  4 ++++
>  2 files changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index d0c4c95..4e54b6b 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -1326,6 +1326,24 @@ static int fanout_rr_next(struct packet_fanout *f, unsigned int num)
>  	return x;
>  }
>  
> +static bool fanout_flow_is_huge(struct packet_sock *po, struct sk_buff *skb)
> +{
> +	u32 rxhash;
> +	int i, count = 0;
> +
> +	rxhash = skb_get_hash(skb);
> +	spin_lock(&po->rollover->hist_lock);
> +	for (i = 0; i < ROLLOVER_HLEN; i++)
> +		if (po->rollover->history[i] == rxhash)
> +			count++;
> +
> +	i = po->rollover->hist_idx++ & (ROLLOVER_HLEN - 1);
> +	po->rollover->history[i] = rxhash;
> +	spin_unlock(&po->rollover->hist_lock);
> +
> +	return count > (ROLLOVER_HLEN >> 1);
> +}
> +

I am not a huge fan of this strategy or memory placement, because of the
spinlock that protects something which should be a hint, more than an
ultra precise decision. At the time lock is released, the status might
already be imprecise.

You touch 3 cache lines here, one for rollover->hist_idx++, one for
history[i] = hash, and one for the spinlock.

(And following patch has the atomic_long_inc() for stats)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 4/7] packet: rollover lock contention avoidance
  2015-05-06 18:27 ` [PATCH net-next 4/7] packet: rollover lock contention avoidance Willem de Bruijn
@ 2015-05-06 19:44   ` Eric Dumazet
  2015-05-06 21:05     ` Willem de Bruijn
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2015-05-06 19:44 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, davem

On Wed, 2015-05-06 at 14:27 -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 

> @@ -3718,6 +3726,10 @@ static unsigned int packet_poll(struct file *file, struct socket *sock,
>  			mask |= POLLOUT | POLLWRNORM;
>  	}
>  	spin_unlock_bh(&sk->sk_write_queue.lock);
> +
> +	if (po->pressure && !(mask & POLLIN))
> +		xchg(&po->pressure, 0);
> +
>  	return mask;

This look racy to me : several threads could run here, and a thread
could remove pressure just set by another one.

Also waiting for queue being completely empty before releasing pressure
might depend on scheduling to utilize queue shortly.

(We usually use max_occupancy/2 thresholds)

Maybe this would be better, but please check.

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 5102c3cc4eec..fab59f8bb336 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3694,6 +3694,8 @@ static unsigned int packet_poll(struct file *file, struct socket *sock,
                        TP_STATUS_KERNEL))
                        mask |= POLLIN | POLLRDNORM;
        }
+       if (po->pressure && !(mask & POLLIN))
+               xchg(&po->pressure, 0);
        spin_unlock_bh(&sk->sk_receive_queue.lock);
        spin_lock_bh(&sk->sk_write_queue.lock);
        if (po->tx_ring.pg_vec) {

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 6/7] packet: rollover huge flows before small flows
  2015-05-06 19:34   ` Eric Dumazet
@ 2015-05-06 20:06     ` Willem de Bruijn
  2015-05-06 20:16       ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Willem de Bruijn @ 2015-05-06 20:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Network Development, David Miller

On Wed, May 6, 2015 at 3:34 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2015-05-06 at 14:27 -0400, Willem de Bruijn wrote:
>> From: Willem de Bruijn <willemb@google.com>
>>
>> Migrate flows from a socket to another socket in the fanout group not
>> only when the socket is full. Start migrating huge flows early, to
>> divert possible 4-tuple attacks without affecting normal traffic.
>>
>> Introduce fanout_flow_is_huge(). This detects huge flows, which are
>> defined as taking up more than half the load. It does so cheaply, by
>> storing the rxhashes of the N most recent packets. If over half of
>> these are the same rxhash as the current packet, then drop it. This
>> only protects against 4-tuple attacks. N is chosen to fit all data in
>> a single cache line.
>>
>> Tested:
>>   Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.
>>
>>       lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s
>>       cpu        rx       rx.k     drop.k   rollover     r.huge   r.failed
>>        0    1202599    1202599          0          0          0          0
>>        1    1221096    1221096          0          0          0          0
>>        2    1202296    1202296          0          0          0          0
>>        3    1229998    1229998          0          0          0          0
>>        4    1229551    1229551          0          0          0          0
>>        5    1221097    1221097          0          0          0          0
>>        6    1223496    1223496          0          0          0          0
>>        7    1616768    1616768          0    8530027    8530027          0
>>
>> Signed-off-by: Willem de Bruijn <willemb@google.com>
>> ---
>>  net/packet/af_packet.c | 30 +++++++++++++++++++++++++++---
>>  net/packet/internal.h  |  4 ++++
>>  2 files changed, 31 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index d0c4c95..4e54b6b 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -1326,6 +1326,24 @@ static int fanout_rr_next(struct packet_fanout *f, unsigned int num)
>>       return x;
>>  }
>>
>> +static bool fanout_flow_is_huge(struct packet_sock *po, struct sk_buff *skb)
>> +{
>> +     u32 rxhash;
>> +     int i, count = 0;
>> +
>> +     rxhash = skb_get_hash(skb);
>> +     spin_lock(&po->rollover->hist_lock);
>> +     for (i = 0; i < ROLLOVER_HLEN; i++)
>> +             if (po->rollover->history[i] == rxhash)
>> +                     count++;
>> +
>> +     i = po->rollover->hist_idx++ & (ROLLOVER_HLEN - 1);
>> +     po->rollover->history[i] = rxhash;
>> +     spin_unlock(&po->rollover->hist_lock);
>> +
>> +     return count > (ROLLOVER_HLEN >> 1);
>> +}
>> +
>
> I am not a huge fan of this strategy or memory placement, because of the
> spinlock that protects something which should be a hint, more than an
> ultra precise decision. At the time lock is released, the status might
> already be imprecise.
>
> You touch 3 cache lines here, one for rollover->hist_idx++, one for
> history[i] = hash, and one for the spinlock.

Do you suggest running lockless, similar to the rfs table?
I can reduce history length to make hist_idx fit in the same
cacheline.

>
> (And following patch has the atomic_long_inc() for stats)
>
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 6/7] packet: rollover huge flows before small flows
  2015-05-06 20:06     ` Willem de Bruijn
@ 2015-05-06 20:16       ` Eric Dumazet
  2015-05-06 20:19         ` Willem de Bruijn
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2015-05-06 20:16 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Network Development, David Miller

On Wed, 2015-05-06 at 16:06 -0400, Willem de Bruijn wrote:

> Do you suggest running lockless, similar to the rfs table?
> I can reduce history length to make hist_idx fit in the same
> cacheline.

I suggest not using hit_idx at all, and no lock either.

for (i = 0; i < ROLLOVER_HLEN; i++)
      if (po->rollover->history[i] == rxhash)
              count++;

po->rollover->history[prandom_u32() % ROLLOVER_HLEN] = rxhash;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 6/7] packet: rollover huge flows before small flows
  2015-05-06 20:16       ` Eric Dumazet
@ 2015-05-06 20:19         ` Willem de Bruijn
  0 siblings, 0 replies; 16+ messages in thread
From: Willem de Bruijn @ 2015-05-06 20:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Network Development, David Miller

On Wed, May 6, 2015 at 4:16 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2015-05-06 at 16:06 -0400, Willem de Bruijn wrote:
>
>> Do you suggest running lockless, similar to the rfs table?
>> I can reduce history length to make hist_idx fit in the same
>> cacheline.
>
> I suggest not using hit_idx at all, and no lock either.
>
> for (i = 0; i < ROLLOVER_HLEN; i++)
>       if (po->rollover->history[i] == rxhash)
>               count++;
>
> po->rollover->history[prandom_u32() % ROLLOVER_HLEN] = rxhash;

Thanks, I'll do that.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 4/7] packet: rollover lock contention avoidance
  2015-05-06 19:44   ` Eric Dumazet
@ 2015-05-06 21:05     ` Willem de Bruijn
  0 siblings, 0 replies; 16+ messages in thread
From: Willem de Bruijn @ 2015-05-06 21:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Network Development, David Miller

On Wed, May 6, 2015 at 3:44 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2015-05-06 at 14:27 -0400, Willem de Bruijn wrote:
>> From: Willem de Bruijn <willemb@google.com>
>>
>
>> @@ -3718,6 +3726,10 @@ static unsigned int packet_poll(struct file *file, struct socket *sock,
>>                       mask |= POLLOUT | POLLWRNORM;
>>       }
>>       spin_unlock_bh(&sk->sk_write_queue.lock);
>> +
>> +     if (po->pressure && !(mask & POLLIN))
>> +             xchg(&po->pressure, 0);
>> +
>>       return mask;
>
> This look racy to me : several threads could run here, and a thread
> could remove pressure just set by another one.
>
> Also waiting for queue being completely empty before releasing pressure
> might depend on scheduling to utilize queue shortly.
>
> (We usually use max_occupancy/2 thresholds)

Yes, half would be better. I could not figure out a way to test anything
besides empty without taking the receive lock. But as your change points
out, packet_poll already takes that lock. I can call a lockless variant of
packet_rcv_has_room right there, then (iff po->pressure).

> Maybe this would be better, but please check.
>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 5102c3cc4eec..fab59f8bb336 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -3694,6 +3694,8 @@ static unsigned int packet_poll(struct file *file, struct socket *sock,
>                         TP_STATUS_KERNEL))
>                         mask |= POLLIN | POLLRDNORM;
>         }
> +       if (po->pressure && !(mask & POLLIN))
> +               xchg(&po->pressure, 0);
>         spin_unlock_bh(&sk->sk_receive_queue.lock);
>         spin_lock_bh(&sk->sk_write_queue.lock);
>         if (po->tx_ring.pg_vec) {
>

That would fix it for rings. I'll move the check there.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH net-next 3/7] packet: rollover prepare: single return in packet_rcv_has_room
  2015-05-06 18:27 ` [PATCH net-next 3/7] packet: rollover prepare: single return in packet_rcv_has_room Willem de Bruijn
@ 2015-05-07 13:49   ` David Laight
  2015-05-07 16:05     ` Willem de Bruijn
  0 siblings, 1 reply; 16+ messages in thread
From: David Laight @ 2015-05-07 13:49 UTC (permalink / raw)
  To: 'Willem de Bruijn', netdev; +Cc: davem

From: Willem de Bruijn
> Sent: 06 May 2015 19:27
> Convert
> 
>   if (x) {
>     A;
>     return y;
>   }
>   B;
>   return y;
> 
> Into
> 
>   if (x) {
>     A;
>   } else {
>     B;
>   }
> 
>   return y;
> 
> No other changes.

...

IMHO there is little benefit in the 'single return from function' style.
In general it just makes scan-reading of code more difficult (once you
see a 'return' you know then is no more relevant code) and hits the RH margin.

	David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 3/7] packet: rollover prepare: single return in packet_rcv_has_room
  2015-05-07 13:49   ` David Laight
@ 2015-05-07 16:05     ` Willem de Bruijn
  0 siblings, 0 replies; 16+ messages in thread
From: Willem de Bruijn @ 2015-05-07 16:05 UTC (permalink / raw)
  To: David Laight; +Cc: netdev, davem

> IMHO there is little benefit in the 'single return from function' style.
> In general it just makes scan-reading of code more difficult (once you
> see a 'return' you know then is no more relevant code) and hits the RH margin.

Subsequent patches add code to the footer of the function. I should
have been more clear about that in the commit message. The change
is indeed not very useful on its own. But at the end of the patch set,
this is the footer:

      has_room = ret == ROOM_NORMAL;
      if (po->pressure == has_room)
              xchg(&po->pressure, !has_room);

      return ret;

which otherwise has to be duplicated.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-05-07 16:06 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-06 18:27 [PATCH net-next 0/7] packet: refine rollover Willem de Bruijn
2015-05-06 18:27 ` [PATCH net-next 1/7] packet: rollover prepare: move code out of callsites Willem de Bruijn
2015-05-06 18:27 ` [PATCH net-next 2/7] packet: rollover prepare: per-socket state Willem de Bruijn
2015-05-06 18:27 ` [PATCH net-next 3/7] packet: rollover prepare: single return in packet_rcv_has_room Willem de Bruijn
2015-05-07 13:49   ` David Laight
2015-05-07 16:05     ` Willem de Bruijn
2015-05-06 18:27 ` [PATCH net-next 4/7] packet: rollover lock contention avoidance Willem de Bruijn
2015-05-06 19:44   ` Eric Dumazet
2015-05-06 21:05     ` Willem de Bruijn
2015-05-06 18:27 ` [PATCH net-next 5/7] packet: rollover only to socket with headroom Willem de Bruijn
2015-05-06 18:27 ` [PATCH net-next 6/7] packet: rollover huge flows before small flows Willem de Bruijn
2015-05-06 19:34   ` Eric Dumazet
2015-05-06 20:06     ` Willem de Bruijn
2015-05-06 20:16       ` Eric Dumazet
2015-05-06 20:19         ` Willem de Bruijn
2015-05-06 18:27 ` [PATCH net-next 7/7] packet: rollover statistics Willem de Bruijn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.