All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v4 0/3] udp: refactor memory accounting
@ 2016-10-19 12:46 Paolo Abeni
       [not found] ` <cover.1476877189.git.pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2016-10-19 12:47 ` [PATCH net-next v4 2/3] udp: implement memory accounting helpers Paolo Abeni
  0 siblings, 2 replies; 12+ messages in thread
From: Paolo Abeni @ 2016-10-19 12:46 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, James Morris, Trond Myklebust, Alexander Duyck,
	Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa,
	Edward Cree, linux-nfs

This patch series refactor the udp memory accounting, replacing the
generic implementation with a custom one, in order to remove the needs for
locking the socket on the enqueue and dequeue operations. The socket backlog
usage is dropped, as well.

The first patch factor out pieces of some queue and memory management
socket helpers, so that they can later be used by the udp memory accounting
functions.
The second patch adds the memory account helpers, without using them.
The third patch replacse the old rx memory accounting path for udp over ipv4 and
udp over ipv6. In kernel UDP users are updated, as well.

The memory accounting schema is described in detail in the individual patch
commit message.

The performance gain depends on the specific scenario; with few flows (and
little contention in the original code) the differences are in the noise range,
while with several flows contending the same socket, the measured speed-up
is relevant (e.g. even over 100% in case of extreme contention)

v3 -> v4:
 - simplified the locking schema, always use a plain spinlock

v2 -> v3:
 - do not set the now unsed backlog_rcv callback

v1 -> v2:
 - changed slighly the memory accounting schema, we now perform lazy reclaim
 - fixed forward_alloc updating issue
 - fixed memory counter integer overflows

Paolo Abeni (3):
  net/socket: factor out helpers for memory and queue manipulation
  udp: implement memory accounting helpers
  udp: use it's own memory accounting schema

 include/linux/udp.h   |   1 +
 include/net/sock.h    |   4 ++
 include/net/udp.h     |   4 ++
 net/core/datagram.c   |  36 +++++++-----
 net/core/sock.c       |  66 +++++++++++++++-------
 net/ipv4/udp.c        | 153 +++++++++++++++++++++++++++++++++++++++-----------
 net/ipv6/udp.c        |  34 +++--------
 net/sunrpc/svcsock.c  |  20 +++++--
 net/sunrpc/xprtsock.c |   2 +-
 9 files changed, 221 insertions(+), 99 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net-next v4 1/3] net/socket: factor out helpers for memory and queue manipulation
  2016-10-19 12:46 [PATCH net-next v4 0/3] udp: refactor memory accounting Paolo Abeni
@ 2016-10-19 12:47     ` Paolo Abeni
  2016-10-19 12:47 ` [PATCH net-next v4 2/3] udp: implement memory accounting helpers Paolo Abeni
  1 sibling, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2016-10-19 12:47 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David S. Miller, James Morris, Trond Myklebust, Alexander Duyck,
	Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa,
	Edward Cree, linux-nfs-u79uwXL29TY76Z2rM5mHXA

Basic sock operations that udp code can use with its own
memory accounting schema. No functional change is introduced
in the existing APIs.

v2 -> v4:
  - avoid exporting __sock_enqueue_skb

v1 -> v2:
  - avoid export sock_rmem_free

Acked-by: Hannes Frederic Sowa <hannes-tFNcAqjVMyqKXQKiL6tip0B+6BGkLq7r@public.gmane.org>
Signed-off-by: Paolo Abeni <pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 include/net/sock.h  |  4 ++++
 net/core/datagram.c | 36 +++++++++++++++++------------
 net/core/sock.c     | 66 +++++++++++++++++++++++++++++++++++++----------------
 3 files changed, 72 insertions(+), 34 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index ebf75db..2764895 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1274,7 +1274,9 @@ static inline struct inode *SOCK_INODE(struct socket *socket)
 /*
  * Functions for memory accounting
  */
+int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind);
 int __sk_mem_schedule(struct sock *sk, int size, int kind);
+void __sk_mem_reduce_allocated(struct sock *sk, int amount);
 void __sk_mem_reclaim(struct sock *sk, int amount);
 
 #define SK_MEM_QUANTUM ((int)PAGE_SIZE)
@@ -1950,6 +1952,8 @@ void sk_reset_timer(struct sock *sk, struct timer_list *timer,
 
 void sk_stop_timer(struct sock *sk, struct timer_list *timer);
 
+int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
+			unsigned int flags);
 int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 
diff --git a/net/core/datagram.c b/net/core/datagram.c
index b7de71f..bfb973a 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -323,6 +323,27 @@ void __skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb, int len)
 }
 EXPORT_SYMBOL(__skb_free_datagram_locked);
 
+int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
+			unsigned int flags)
+{
+	int err = 0;
+
+	if (flags & MSG_PEEK) {
+		err = -ENOENT;
+		spin_lock_bh(&sk->sk_receive_queue.lock);
+		if (skb == skb_peek(&sk->sk_receive_queue)) {
+			__skb_unlink(skb, &sk->sk_receive_queue);
+			atomic_dec(&skb->users);
+			err = 0;
+		}
+		spin_unlock_bh(&sk->sk_receive_queue.lock);
+	}
+
+	atomic_inc(&sk->sk_drops);
+	return err;
+}
+EXPORT_SYMBOL(__sk_queue_drop_skb);
+
 /**
  *	skb_kill_datagram - Free a datagram skbuff forcibly
  *	@sk: socket
@@ -346,23 +367,10 @@ void __skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb, int len)
 
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags)
 {
-	int err = 0;
-
-	if (flags & MSG_PEEK) {
-		err = -ENOENT;
-		spin_lock_bh(&sk->sk_receive_queue.lock);
-		if (skb == skb_peek(&sk->sk_receive_queue)) {
-			__skb_unlink(skb, &sk->sk_receive_queue);
-			atomic_dec(&skb->users);
-			err = 0;
-		}
-		spin_unlock_bh(&sk->sk_receive_queue.lock);
-	}
+	int err = __sk_queue_drop_skb(sk, skb, flags);
 
 	kfree_skb(skb);
-	atomic_inc(&sk->sk_drops);
 	sk_mem_reclaim_partial(sk);
-
 	return err;
 }
 EXPORT_SYMBOL(skb_kill_datagram);
diff --git a/net/core/sock.c b/net/core/sock.c
index c73e28f..7c85e6d 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -404,8 +404,8 @@ static void sock_disable_timestamp(struct sock *sk, unsigned long flags)
 	}
 }
 
-
 int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+
 {
 	unsigned long flags;
 	struct sk_buff_head *list = &sk->sk_receive_queue;
@@ -2091,24 +2091,18 @@ int sk_wait_data(struct sock *sk, long *timeo, const struct sk_buff *skb)
 EXPORT_SYMBOL(sk_wait_data);
 
 /**
- *	__sk_mem_schedule - increase sk_forward_alloc and memory_allocated
+ *	__sk_mem_raise_allocated - increase memory_allocated
  *	@sk: socket
  *	@size: memory size to allocate
+ *	@amt: pages to allocate
  *	@kind: allocation type
  *
- *	If kind is SK_MEM_SEND, it means wmem allocation. Otherwise it means
- *	rmem allocation. This function assumes that protocols which have
- *	memory_pressure use sk_wmem_queued as write buffer accounting.
+ *	Similar to __sk_mem_schedule(), but does not update sk_forward_alloc
  */
-int __sk_mem_schedule(struct sock *sk, int size, int kind)
+int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 {
 	struct proto *prot = sk->sk_prot;
-	int amt = sk_mem_pages(size);
-	long allocated;
-
-	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
-
-	allocated = sk_memory_allocated_add(sk, amt);
+	long allocated = sk_memory_allocated_add(sk, amt);
 
 	if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
 	    !mem_cgroup_charge_skmem(sk->sk_memcg, amt))
@@ -2169,9 +2163,6 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 
 	trace_sock_exceed_buf_limit(sk, prot, allocated);
 
-	/* Alas. Undo changes. */
-	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
-
 	sk_memory_allocated_sub(sk, amt);
 
 	if (mem_cgroup_sockets_enabled && sk->sk_memcg)
@@ -2179,18 +2170,40 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 
 	return 0;
 }
+EXPORT_SYMBOL(__sk_mem_raise_allocated);
+
+/**
+ *	__sk_mem_schedule - increase sk_forward_alloc and memory_allocated
+ *	@sk: socket
+ *	@size: memory size to allocate
+ *	@kind: allocation type
+ *
+ *	If kind is SK_MEM_SEND, it means wmem allocation. Otherwise it means
+ *	rmem allocation. This function assumes that protocols which have
+ *	memory_pressure use sk_wmem_queued as write buffer accounting.
+ */
+int __sk_mem_schedule(struct sock *sk, int size, int kind)
+{
+	int ret, amt = sk_mem_pages(size);
+
+	sk->sk_forward_alloc += amt << SK_MEM_QUANTUM_SHIFT;
+	ret = __sk_mem_raise_allocated(sk, size, amt, kind);
+	if (!ret)
+		sk->sk_forward_alloc -= amt << SK_MEM_QUANTUM_SHIFT;
+	return ret;
+}
 EXPORT_SYMBOL(__sk_mem_schedule);
 
 /**
- *	__sk_mem_reclaim - reclaim memory_allocated
+ *	__sk_mem_reduce_allocated - reclaim memory_allocated
  *	@sk: socket
- *	@amount: number of bytes (rounded down to a SK_MEM_QUANTUM multiple)
+ *	@amount: number of quanta
+ *
+ *	Similar to __sk_mem_reclaim(), but does not update sk_forward_alloc
  */
-void __sk_mem_reclaim(struct sock *sk, int amount)
+void __sk_mem_reduce_allocated(struct sock *sk, int amount)
 {
-	amount >>= SK_MEM_QUANTUM_SHIFT;
 	sk_memory_allocated_sub(sk, amount);
-	sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT;
 
 	if (mem_cgroup_sockets_enabled && sk->sk_memcg)
 		mem_cgroup_uncharge_skmem(sk->sk_memcg, amount);
@@ -2199,6 +2212,19 @@ void __sk_mem_reclaim(struct sock *sk, int amount)
 	    (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
 		sk_leave_memory_pressure(sk);
 }
+EXPORT_SYMBOL(__sk_mem_reduce_allocated);
+
+/**
+ *	__sk_mem_reclaim - reclaim sk_forward_alloc and memory_allocated
+ *	@sk: socket
+ *	@amount: number of bytes (rounded down to a SK_MEM_QUANTUM multiple)
+ */
+void __sk_mem_reclaim(struct sock *sk, int amount)
+{
+	amount >>= SK_MEM_QUANTUM_SHIFT;
+	sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT;
+	__sk_mem_reduce_allocated(sk, amount);
+}
 EXPORT_SYMBOL(__sk_mem_reclaim);
 
 int sk_set_peek_off(struct sock *sk, int val)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next v4 1/3] net/socket: factor out helpers for memory and queue manipulation
@ 2016-10-19 12:47     ` Paolo Abeni
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2016-10-19 12:47 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, James Morris, Trond Myklebust, Alexander Duyck,
	Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa,
	Edward Cree, linux-nfs

Basic sock operations that udp code can use with its own
memory accounting schema. No functional change is introduced
in the existing APIs.

v2 -> v4:
  - avoid exporting __sock_enqueue_skb

v1 -> v2:
  - avoid export sock_rmem_free

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/sock.h  |  4 ++++
 net/core/datagram.c | 36 +++++++++++++++++------------
 net/core/sock.c     | 66 +++++++++++++++++++++++++++++++++++++----------------
 3 files changed, 72 insertions(+), 34 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index ebf75db..2764895 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1274,7 +1274,9 @@ static inline struct inode *SOCK_INODE(struct socket *socket)
 /*
  * Functions for memory accounting
  */
+int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind);
 int __sk_mem_schedule(struct sock *sk, int size, int kind);
+void __sk_mem_reduce_allocated(struct sock *sk, int amount);
 void __sk_mem_reclaim(struct sock *sk, int amount);
 
 #define SK_MEM_QUANTUM ((int)PAGE_SIZE)
@@ -1950,6 +1952,8 @@ void sk_reset_timer(struct sock *sk, struct timer_list *timer,
 
 void sk_stop_timer(struct sock *sk, struct timer_list *timer);
 
+int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
+			unsigned int flags);
 int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 
diff --git a/net/core/datagram.c b/net/core/datagram.c
index b7de71f..bfb973a 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -323,6 +323,27 @@ void __skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb, int len)
 }
 EXPORT_SYMBOL(__skb_free_datagram_locked);
 
+int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
+			unsigned int flags)
+{
+	int err = 0;
+
+	if (flags & MSG_PEEK) {
+		err = -ENOENT;
+		spin_lock_bh(&sk->sk_receive_queue.lock);
+		if (skb == skb_peek(&sk->sk_receive_queue)) {
+			__skb_unlink(skb, &sk->sk_receive_queue);
+			atomic_dec(&skb->users);
+			err = 0;
+		}
+		spin_unlock_bh(&sk->sk_receive_queue.lock);
+	}
+
+	atomic_inc(&sk->sk_drops);
+	return err;
+}
+EXPORT_SYMBOL(__sk_queue_drop_skb);
+
 /**
  *	skb_kill_datagram - Free a datagram skbuff forcibly
  *	@sk: socket
@@ -346,23 +367,10 @@ void __skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb, int len)
 
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags)
 {
-	int err = 0;
-
-	if (flags & MSG_PEEK) {
-		err = -ENOENT;
-		spin_lock_bh(&sk->sk_receive_queue.lock);
-		if (skb == skb_peek(&sk->sk_receive_queue)) {
-			__skb_unlink(skb, &sk->sk_receive_queue);
-			atomic_dec(&skb->users);
-			err = 0;
-		}
-		spin_unlock_bh(&sk->sk_receive_queue.lock);
-	}
+	int err = __sk_queue_drop_skb(sk, skb, flags);
 
 	kfree_skb(skb);
-	atomic_inc(&sk->sk_drops);
 	sk_mem_reclaim_partial(sk);
-
 	return err;
 }
 EXPORT_SYMBOL(skb_kill_datagram);
diff --git a/net/core/sock.c b/net/core/sock.c
index c73e28f..7c85e6d 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -404,8 +404,8 @@ static void sock_disable_timestamp(struct sock *sk, unsigned long flags)
 	}
 }
 
-
 int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+
 {
 	unsigned long flags;
 	struct sk_buff_head *list = &sk->sk_receive_queue;
@@ -2091,24 +2091,18 @@ int sk_wait_data(struct sock *sk, long *timeo, const struct sk_buff *skb)
 EXPORT_SYMBOL(sk_wait_data);
 
 /**
- *	__sk_mem_schedule - increase sk_forward_alloc and memory_allocated
+ *	__sk_mem_raise_allocated - increase memory_allocated
  *	@sk: socket
  *	@size: memory size to allocate
+ *	@amt: pages to allocate
  *	@kind: allocation type
  *
- *	If kind is SK_MEM_SEND, it means wmem allocation. Otherwise it means
- *	rmem allocation. This function assumes that protocols which have
- *	memory_pressure use sk_wmem_queued as write buffer accounting.
+ *	Similar to __sk_mem_schedule(), but does not update sk_forward_alloc
  */
-int __sk_mem_schedule(struct sock *sk, int size, int kind)
+int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 {
 	struct proto *prot = sk->sk_prot;
-	int amt = sk_mem_pages(size);
-	long allocated;
-
-	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
-
-	allocated = sk_memory_allocated_add(sk, amt);
+	long allocated = sk_memory_allocated_add(sk, amt);
 
 	if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
 	    !mem_cgroup_charge_skmem(sk->sk_memcg, amt))
@@ -2169,9 +2163,6 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 
 	trace_sock_exceed_buf_limit(sk, prot, allocated);
 
-	/* Alas. Undo changes. */
-	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
-
 	sk_memory_allocated_sub(sk, amt);
 
 	if (mem_cgroup_sockets_enabled && sk->sk_memcg)
@@ -2179,18 +2170,40 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 
 	return 0;
 }
+EXPORT_SYMBOL(__sk_mem_raise_allocated);
+
+/**
+ *	__sk_mem_schedule - increase sk_forward_alloc and memory_allocated
+ *	@sk: socket
+ *	@size: memory size to allocate
+ *	@kind: allocation type
+ *
+ *	If kind is SK_MEM_SEND, it means wmem allocation. Otherwise it means
+ *	rmem allocation. This function assumes that protocols which have
+ *	memory_pressure use sk_wmem_queued as write buffer accounting.
+ */
+int __sk_mem_schedule(struct sock *sk, int size, int kind)
+{
+	int ret, amt = sk_mem_pages(size);
+
+	sk->sk_forward_alloc += amt << SK_MEM_QUANTUM_SHIFT;
+	ret = __sk_mem_raise_allocated(sk, size, amt, kind);
+	if (!ret)
+		sk->sk_forward_alloc -= amt << SK_MEM_QUANTUM_SHIFT;
+	return ret;
+}
 EXPORT_SYMBOL(__sk_mem_schedule);
 
 /**
- *	__sk_mem_reclaim - reclaim memory_allocated
+ *	__sk_mem_reduce_allocated - reclaim memory_allocated
  *	@sk: socket
- *	@amount: number of bytes (rounded down to a SK_MEM_QUANTUM multiple)
+ *	@amount: number of quanta
+ *
+ *	Similar to __sk_mem_reclaim(), but does not update sk_forward_alloc
  */
-void __sk_mem_reclaim(struct sock *sk, int amount)
+void __sk_mem_reduce_allocated(struct sock *sk, int amount)
 {
-	amount >>= SK_MEM_QUANTUM_SHIFT;
 	sk_memory_allocated_sub(sk, amount);
-	sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT;
 
 	if (mem_cgroup_sockets_enabled && sk->sk_memcg)
 		mem_cgroup_uncharge_skmem(sk->sk_memcg, amount);
@@ -2199,6 +2212,19 @@ void __sk_mem_reclaim(struct sock *sk, int amount)
 	    (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)))
 		sk_leave_memory_pressure(sk);
 }
+EXPORT_SYMBOL(__sk_mem_reduce_allocated);
+
+/**
+ *	__sk_mem_reclaim - reclaim sk_forward_alloc and memory_allocated
+ *	@sk: socket
+ *	@amount: number of bytes (rounded down to a SK_MEM_QUANTUM multiple)
+ */
+void __sk_mem_reclaim(struct sock *sk, int amount)
+{
+	amount >>= SK_MEM_QUANTUM_SHIFT;
+	sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT;
+	__sk_mem_reduce_allocated(sk, amount);
+}
 EXPORT_SYMBOL(__sk_mem_reclaim);
 
 int sk_set_peek_off(struct sock *sk, int val)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next v4 2/3] udp: implement memory accounting helpers
  2016-10-19 12:46 [PATCH net-next v4 0/3] udp: refactor memory accounting Paolo Abeni
       [not found] ` <cover.1476877189.git.pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-10-19 12:47 ` Paolo Abeni
  2016-10-20  4:43   ` Eric Dumazet
  1 sibling, 1 reply; 12+ messages in thread
From: Paolo Abeni @ 2016-10-19 12:47 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, James Morris, Trond Myklebust, Alexander Duyck,
	Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa,
	Edward Cree, linux-nfs

Avoid using the generic helpers.
Adds a new spinlock_t to the udp_sock structure and use
it to protect the memory accounting operation, both on
enqueue and on dequeue.

On dequeue perform partial memory reclaiming, trying to
leave a quantum of forward allocated memory.

On enqueue use a custom helper, to allow some optimizations:
- use a plain spin_lock() variant instead of the slightly
  costly spin_lock_irqsave(),
- avoid dst_force check, since the calling code has already
  dropped the skb dst

The above needs custom memory reclaiming on shutdown, provided
by the udp_destruct_sock().

v3 -> v4:
  - reworked memory accunting, simplifying the schema
  - provide an helper for both memory scheduling and enqueuing

v1 -> v2:
  - use a udp specific destrctor to perform memory reclaiming
  - remove a couple of helpers, unneeded after the above cleanup
  - do not reclaim memory on dequeue if not under memory
    pressure
  - reworked the fwd accounting schema to avoid potential
    integer overflow

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/udp.h |   1 +
 include/net/udp.h   |   4 ++
 net/ipv4/udp.c      | 111 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 116 insertions(+)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index d1fd8cd..8ea7f8b 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -66,6 +66,7 @@ struct udp_sock {
 #define UDPLITE_RECV_CC  0x4		/* set via udplite setsocktopt        */
 	__u8		 pcflag;        /* marks socket as UDP-Lite if > 0    */
 	__u8		 unused[3];
+	spinlock_t	 mem_lock;	/* protects memory accounting         */
 	/*
 	 * For encapsulation sockets.
 	 */
diff --git a/include/net/udp.h b/include/net/udp.h
index ea53a87..18f1e6b 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -246,6 +246,9 @@ static inline __be16 udp_flow_src_port(struct net *net, struct sk_buff *skb,
 }
 
 /* net/ipv4/udp.c */
+void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
+int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb);
+
 void udp_v4_early_demux(struct sk_buff *skb);
 int udp_get_port(struct sock *sk, unsigned short snum,
 		 int (*saddr_cmp)(const struct sock *,
@@ -258,6 +261,7 @@ int udp_get_port(struct sock *sk, unsigned short snum,
 void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
 int udp_rcv(struct sk_buff *skb);
 int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
+int udp_init_sock(struct sock *sk);
 int udp_disconnect(struct sock *sk, int flags);
 unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait);
 struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 7d96dc2..7047bb3 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1172,6 +1172,117 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 	return ret;
 }
 
+static void udp_rmem_release(struct sock *sk, int size, int partial)
+{
+	struct udp_sock *up = udp_sk(sk);
+	int amt;
+
+	atomic_sub(size, &sk->sk_rmem_alloc);
+
+	spin_lock_bh(&up->mem_lock);
+	sk->sk_forward_alloc += size;
+	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
+	sk->sk_forward_alloc -= amt;
+	spin_unlock_bh(&up->mem_lock);
+
+	if (amt)
+		__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
+}
+
+static void udp_rmem_free(struct sk_buff *skb)
+{
+	udp_rmem_release(skb->sk, skb->truesize, 1);
+}
+
+int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
+{
+	struct sk_buff_head *list = &sk->sk_receive_queue;
+	int rmem, delta, amt, err = -ENOMEM;
+	struct udp_sock *up = udp_sk(sk);
+	int size = skb->truesize;
+
+	/* try to avoid the costly atomic add/sub pair when the receive
+	 * queue is full
+	 */
+	if (atomic_read(&sk->sk_rmem_alloc) + size > sk->sk_rcvbuf)
+		goto drop;
+
+	rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
+	if (rmem > sk->sk_rcvbuf)
+		goto uncharge_drop;
+
+	spin_lock(&up->mem_lock);
+	if (size < sk->sk_forward_alloc)
+		goto unlock;
+
+	amt = sk_mem_pages(size);
+	delta = amt << SK_MEM_QUANTUM_SHIFT;
+	if (!__sk_mem_raise_allocated(sk, delta, amt, SK_MEM_RECV)) {
+		err = -ENOBUFS;
+		spin_unlock(&up->mem_lock);
+		goto uncharge_drop;
+	}
+
+	sk->sk_forward_alloc += delta;
+
+unlock:
+	sk->sk_forward_alloc -= size;
+	spin_unlock(&up->mem_lock);
+
+	/* the skb owner in now the udp socket */
+	skb_orphan(skb);
+	skb->sk = sk;
+	skb->destructor = udp_rmem_free;
+
+	skb->dev = NULL;
+	sock_skb_set_dropcount(sk, skb);
+
+	spin_lock(&list->lock);
+	__skb_queue_tail(list, skb);
+	spin_unlock(&list->lock);
+
+	if (!sock_flag(sk, SOCK_DEAD))
+		sk->sk_data_ready(sk);
+
+	return 0;
+
+uncharge_drop:
+	atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
+
+drop:
+	atomic_inc(&sk->sk_drops);
+	return err;
+}
+EXPORT_SYMBOL_GPL(__udp_enqueue_schedule_skb);
+
+static void udp_destruct_sock(struct sock *sk)
+{
+	/* reclaim completely the forward allocated memory */
+	__skb_queue_purge(&sk->sk_receive_queue);
+	udp_rmem_release(sk, 0, 0);
+	inet_sock_destruct(sk);
+}
+
+int udp_init_sock(struct sock *sk)
+{
+	sk->sk_destruct = udp_destruct_sock;
+	spin_lock_init(&udp_sk(sk)->mem_lock);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(udp_init_sock);
+
+void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len)
+{
+	if (unlikely(READ_ONCE(sk->sk_peek_off) >= 0)) {
+		bool slow = lock_sock_fast(sk);
+
+		sk_peek_offset_bwd(sk, len);
+		unlock_sock_fast(sk, slow);
+	}
+	consume_skb(skb);
+}
+EXPORT_SYMBOL_GPL(skb_consume_udp);
+
 /**
  *	first_packet_length	- return length of first packet in receive queue
  *	@sk: socket
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next v4 3/3] udp: use it's own memory accounting schema
  2016-10-19 12:46 [PATCH net-next v4 0/3] udp: refactor memory accounting Paolo Abeni
@ 2016-10-19 12:47     ` Paolo Abeni
  2016-10-19 12:47 ` [PATCH net-next v4 2/3] udp: implement memory accounting helpers Paolo Abeni
  1 sibling, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2016-10-19 12:47 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: David S. Miller, James Morris, Trond Myklebust, Alexander Duyck,
	Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa,
	Edward Cree, linux-nfs-u79uwXL29TY76Z2rM5mHXA

Completely avoid default sock memory accounting and replace it
with udp-specific accounting.

Since the new memory accounting model encapsulates completely
the required locking, remove the socket lock on both enqueue and
dequeue, and avoid using the backlog on enqueue.

Be sure to clean-up rx queue memory on socket destruction, using
udp its own sk_destruct.

Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.

nr readers      Kpps (vanilla)  Kpps (patched)
1               170             420
3               1250            1900
6               3000            3500
9               4200            4400
12              5700            6100

v3 -> v4:
  - remove useless sk_rcvqueues_full() call

v2 -> v3:
  - do not set the now unsed backlog_rcv callback

v1 -> v2:
  - add memory pressure support
  - fixed dropwatch accounting for ipv6

Acked-by: Hannes Frederic Sowa <hannes-tFNcAqjVMyqKXQKiL6tip0B+6BGkLq7r@public.gmane.org>
Signed-off-by: Paolo Abeni <pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 net/ipv4/udp.c        | 42 ++++++++----------------------------------
 net/ipv6/udp.c        | 34 ++++++++--------------------------
 net/sunrpc/svcsock.c  | 20 ++++++++++++++++----
 net/sunrpc/xprtsock.c |  2 +-
 4 files changed, 33 insertions(+), 65 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 7047bb3..1925ca7 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1312,13 +1312,8 @@ static int first_packet_length(struct sock *sk)
 	res = skb ? skb->len : -1;
 	spin_unlock_bh(&rcvq->lock);
 
-	if (!skb_queue_empty(&list_kill)) {
-		bool slow = lock_sock_fast(sk);
-
+	if (!skb_queue_empty(&list_kill))
 		__skb_queue_purge(&list_kill);
-		sk_mem_reclaim_partial(sk);
-		unlock_sock_fast(sk, slow);
-	}
 	return res;
 }
 
@@ -1367,7 +1362,6 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 	int err;
 	int is_udplite = IS_UDPLITE(sk);
 	bool checksum_valid = false;
-	bool slow;
 
 	if (flags & MSG_ERRQUEUE)
 		return ip_recv_error(sk, msg, len, addr_len);
@@ -1408,13 +1402,12 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 	}
 
 	if (unlikely(err)) {
-		trace_kfree_skb(skb, udp_recvmsg);
 		if (!peeked) {
 			atomic_inc(&sk->sk_drops);
 			UDP_INC_STATS(sock_net(sk),
 				      UDP_MIB_INERRORS, is_udplite);
 		}
-		skb_free_datagram_locked(sk, skb);
+		kfree_skb(skb);
 		return err;
 	}
 
@@ -1439,16 +1432,15 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 	if (flags & MSG_TRUNC)
 		err = ulen;
 
-	__skb_free_datagram_locked(sk, skb, peeking ? -err : err);
+	skb_consume_udp(sk, skb, peeking ? -err : err);
 	return err;
 
 csum_copy_err:
-	slow = lock_sock_fast(sk);
-	if (!skb_kill_datagram(sk, skb, flags)) {
+	if (!__sk_queue_drop_skb(sk, skb, flags)) {
 		UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
 		UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
 	}
-	unlock_sock_fast(sk, slow);
+	kfree_skb(skb);
 
 	/* starting over for a new packet, but check if we need to yield */
 	cond_resched();
@@ -1567,7 +1559,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 		sk_incoming_cpu_update(sk);
 	}
 
-	rc = __sock_queue_rcv_skb(sk, skb);
+	rc = __udp_enqueue_schedule_skb(sk, skb);
 	if (rc < 0) {
 		int is_udplite = IS_UDPLITE(sk);
 
@@ -1582,7 +1574,6 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	}
 
 	return 0;
-
 }
 
 static struct static_key udp_encap_needed __read_mostly;
@@ -1604,7 +1595,6 @@ void udp_encap_enable(void)
 int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	struct udp_sock *up = udp_sk(sk);
-	int rc;
 	int is_udplite = IS_UDPLITE(sk);
 
 	/*
@@ -1691,25 +1681,9 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 		goto drop;
 
 	udp_csum_pull_header(skb);
-	if (sk_rcvqueues_full(sk, sk->sk_rcvbuf)) {
-		__UDP_INC_STATS(sock_net(sk), UDP_MIB_RCVBUFERRORS,
-				is_udplite);
-		goto drop;
-	}
-
-	rc = 0;
 
 	ipv4_pktinfo_prepare(sk, skb);
-	bh_lock_sock(sk);
-	if (!sock_owned_by_user(sk))
-		rc = __udp_queue_rcv_skb(sk, skb);
-	else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) {
-		bh_unlock_sock(sk);
-		goto drop;
-	}
-	bh_unlock_sock(sk);
-
-	return rc;
+	return __udp_queue_rcv_skb(sk, skb);
 
 csum_error:
 	__UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
@@ -2319,13 +2293,13 @@ struct proto udp_prot = {
 	.connect	   = ip4_datagram_connect,
 	.disconnect	   = udp_disconnect,
 	.ioctl		   = udp_ioctl,
+	.init		   = udp_init_sock,
 	.destroy	   = udp_destroy_sock,
 	.setsockopt	   = udp_setsockopt,
 	.getsockopt	   = udp_getsockopt,
 	.sendmsg	   = udp_sendmsg,
 	.recvmsg	   = udp_recvmsg,
 	.sendpage	   = udp_sendpage,
-	.backlog_rcv	   = __udp_queue_rcv_skb,
 	.release_cb	   = ip4_datagram_release_cb,
 	.hash		   = udp_lib_hash,
 	.unhash		   = udp_lib_unhash,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 9aa7c1c..71963b2 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -334,7 +334,6 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	int is_udplite = IS_UDPLITE(sk);
 	bool checksum_valid = false;
 	int is_udp4;
-	bool slow;
 
 	if (flags & MSG_ERRQUEUE)
 		return ipv6_recv_error(sk, msg, len, addr_len);
@@ -378,7 +377,6 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 			goto csum_copy_err;
 	}
 	if (unlikely(err)) {
-		trace_kfree_skb(skb, udpv6_recvmsg);
 		if (!peeked) {
 			atomic_inc(&sk->sk_drops);
 			if (is_udp4)
@@ -388,7 +386,7 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 				UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS,
 					       is_udplite);
 		}
-		skb_free_datagram_locked(sk, skb);
+		kfree_skb(skb);
 		return err;
 	}
 	if (!peeked) {
@@ -437,12 +435,11 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	if (flags & MSG_TRUNC)
 		err = ulen;
 
-	__skb_free_datagram_locked(sk, skb, peeking ? -err : err);
+	skb_consume_udp(sk, skb, peeking ? -err : err);
 	return err;
 
 csum_copy_err:
-	slow = lock_sock_fast(sk);
-	if (!skb_kill_datagram(sk, skb, flags)) {
+	if (!__sk_queue_drop_skb(sk, skb, flags)) {
 		if (is_udp4) {
 			UDP_INC_STATS(sock_net(sk),
 				      UDP_MIB_CSUMERRORS, is_udplite);
@@ -455,7 +452,7 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 				       UDP_MIB_INERRORS, is_udplite);
 		}
 	}
-	unlock_sock_fast(sk, slow);
+	kfree_skb(skb);
 
 	/* starting over for a new packet, but check if we need to yield */
 	cond_resched();
@@ -523,7 +520,7 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 		sk_incoming_cpu_update(sk);
 	}
 
-	rc = __sock_queue_rcv_skb(sk, skb);
+	rc = __udp_enqueue_schedule_skb(sk, skb);
 	if (rc < 0) {
 		int is_udplite = IS_UDPLITE(sk);
 
@@ -535,6 +532,7 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 		kfree_skb(skb);
 		return -1;
 	}
+
 	return 0;
 }
 
@@ -556,7 +554,6 @@ void udpv6_encap_enable(void)
 int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	struct udp_sock *up = udp_sk(sk);
-	int rc;
 	int is_udplite = IS_UDPLITE(sk);
 
 	if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb))
@@ -622,25 +619,10 @@ int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 		goto drop;
 
 	udp_csum_pull_header(skb);
-	if (sk_rcvqueues_full(sk, sk->sk_rcvbuf)) {
-		__UDP6_INC_STATS(sock_net(sk),
-				 UDP_MIB_RCVBUFERRORS, is_udplite);
-		goto drop;
-	}
 
 	skb_dst_drop(skb);
 
-	bh_lock_sock(sk);
-	rc = 0;
-	if (!sock_owned_by_user(sk))
-		rc = __udpv6_queue_rcv_skb(sk, skb);
-	else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) {
-		bh_unlock_sock(sk);
-		goto drop;
-	}
-	bh_unlock_sock(sk);
-
-	return rc;
+	return __udpv6_queue_rcv_skb(sk, skb);
 
 csum_error:
 	__UDP6_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
@@ -1433,12 +1415,12 @@ struct proto udpv6_prot = {
 	.connect	   = ip6_datagram_connect,
 	.disconnect	   = udp_disconnect,
 	.ioctl		   = udp_ioctl,
+	.init		   = udp_init_sock,
 	.destroy	   = udpv6_destroy_sock,
 	.setsockopt	   = udpv6_setsockopt,
 	.getsockopt	   = udpv6_getsockopt,
 	.sendmsg	   = udpv6_sendmsg,
 	.recvmsg	   = udpv6_recvmsg,
-	.backlog_rcv	   = __udpv6_queue_rcv_skb,
 	.release_cb	   = ip6_datagram_release_cb,
 	.hash		   = udp_lib_hash,
 	.unhash		   = udp_lib_unhash,
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 57625f6..e2a55dc 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -39,6 +39,7 @@
 #include <net/checksum.h>
 #include <net/ip.h>
 #include <net/ipv6.h>
+#include <net/udp.h>
 #include <net/tcp.h>
 #include <net/tcp_states.h>
 #include <asm/uaccess.h>
@@ -129,6 +130,18 @@ static void svc_release_skb(struct svc_rqst *rqstp)
 	}
 }
 
+static void svc_release_udp_skb(struct svc_rqst *rqstp)
+{
+	struct sk_buff *skb = rqstp->rq_xprt_ctxt;
+
+	if (skb) {
+		rqstp->rq_xprt_ctxt = NULL;
+
+		dprintk("svc: service %p, releasing skb %p\n", rqstp, skb);
+		consume_skb(skb);
+	}
+}
+
 union svc_pktinfo_u {
 	struct in_pktinfo pkti;
 	struct in6_pktinfo pkti6;
@@ -575,7 +588,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
 			goto out_free;
 		}
 		local_bh_enable();
-		skb_free_datagram_locked(svsk->sk_sk, skb);
+		consume_skb(skb);
 	} else {
 		/* we can use it in-place */
 		rqstp->rq_arg.head[0].iov_base = skb->data;
@@ -602,8 +615,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
 
 	return len;
 out_free:
-	trace_kfree_skb(skb, svc_udp_recvfrom);
-	skb_free_datagram_locked(svsk->sk_sk, skb);
+	kfree_skb(skb);
 	return 0;
 }
 
@@ -660,7 +672,7 @@ static struct svc_xprt *svc_udp_create(struct svc_serv *serv,
 	.xpo_create = svc_udp_create,
 	.xpo_recvfrom = svc_udp_recvfrom,
 	.xpo_sendto = svc_udp_sendto,
-	.xpo_release_rqst = svc_release_skb,
+	.xpo_release_rqst = svc_release_udp_skb,
 	.xpo_detach = svc_sock_detach,
 	.xpo_free = svc_sock_free,
 	.xpo_prep_reply_hdr = svc_udp_prep_reply_hdr,
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index bf16883..8e04a43 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1074,7 +1074,7 @@ static void xs_udp_data_receive(struct sock_xprt *transport)
 		skb = skb_recv_datagram(sk, 0, 1, &err);
 		if (skb != NULL) {
 			xs_udp_data_read_skb(&transport->xprt, sk, skb);
-			skb_free_datagram_locked(sk, skb);
+			consume_skb(skb);
 			continue;
 		}
 		if (!test_and_clear_bit(XPRT_SOCK_DATA_READY, &transport->sock_state))
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next v4 3/3] udp: use it's own memory accounting schema
@ 2016-10-19 12:47     ` Paolo Abeni
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2016-10-19 12:47 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, James Morris, Trond Myklebust, Alexander Duyck,
	Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa,
	Edward Cree, linux-nfs

Completely avoid default sock memory accounting and replace it
with udp-specific accounting.

Since the new memory accounting model encapsulates completely
the required locking, remove the socket lock on both enqueue and
dequeue, and avoid using the backlog on enqueue.

Be sure to clean-up rx queue memory on socket destruction, using
udp its own sk_destruct.

Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.

nr readers      Kpps (vanilla)  Kpps (patched)
1               170             420
3               1250            1900
6               3000            3500
9               4200            4400
12              5700            6100

v3 -> v4:
  - remove useless sk_rcvqueues_full() call

v2 -> v3:
  - do not set the now unsed backlog_rcv callback

v1 -> v2:
  - add memory pressure support
  - fixed dropwatch accounting for ipv6

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/udp.c        | 42 ++++++++----------------------------------
 net/ipv6/udp.c        | 34 ++++++++--------------------------
 net/sunrpc/svcsock.c  | 20 ++++++++++++++++----
 net/sunrpc/xprtsock.c |  2 +-
 4 files changed, 33 insertions(+), 65 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 7047bb3..1925ca7 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1312,13 +1312,8 @@ static int first_packet_length(struct sock *sk)
 	res = skb ? skb->len : -1;
 	spin_unlock_bh(&rcvq->lock);
 
-	if (!skb_queue_empty(&list_kill)) {
-		bool slow = lock_sock_fast(sk);
-
+	if (!skb_queue_empty(&list_kill))
 		__skb_queue_purge(&list_kill);
-		sk_mem_reclaim_partial(sk);
-		unlock_sock_fast(sk, slow);
-	}
 	return res;
 }
 
@@ -1367,7 +1362,6 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 	int err;
 	int is_udplite = IS_UDPLITE(sk);
 	bool checksum_valid = false;
-	bool slow;
 
 	if (flags & MSG_ERRQUEUE)
 		return ip_recv_error(sk, msg, len, addr_len);
@@ -1408,13 +1402,12 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 	}
 
 	if (unlikely(err)) {
-		trace_kfree_skb(skb, udp_recvmsg);
 		if (!peeked) {
 			atomic_inc(&sk->sk_drops);
 			UDP_INC_STATS(sock_net(sk),
 				      UDP_MIB_INERRORS, is_udplite);
 		}
-		skb_free_datagram_locked(sk, skb);
+		kfree_skb(skb);
 		return err;
 	}
 
@@ -1439,16 +1432,15 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 	if (flags & MSG_TRUNC)
 		err = ulen;
 
-	__skb_free_datagram_locked(sk, skb, peeking ? -err : err);
+	skb_consume_udp(sk, skb, peeking ? -err : err);
 	return err;
 
 csum_copy_err:
-	slow = lock_sock_fast(sk);
-	if (!skb_kill_datagram(sk, skb, flags)) {
+	if (!__sk_queue_drop_skb(sk, skb, flags)) {
 		UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
 		UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
 	}
-	unlock_sock_fast(sk, slow);
+	kfree_skb(skb);
 
 	/* starting over for a new packet, but check if we need to yield */
 	cond_resched();
@@ -1567,7 +1559,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 		sk_incoming_cpu_update(sk);
 	}
 
-	rc = __sock_queue_rcv_skb(sk, skb);
+	rc = __udp_enqueue_schedule_skb(sk, skb);
 	if (rc < 0) {
 		int is_udplite = IS_UDPLITE(sk);
 
@@ -1582,7 +1574,6 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	}
 
 	return 0;
-
 }
 
 static struct static_key udp_encap_needed __read_mostly;
@@ -1604,7 +1595,6 @@ void udp_encap_enable(void)
 int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	struct udp_sock *up = udp_sk(sk);
-	int rc;
 	int is_udplite = IS_UDPLITE(sk);
 
 	/*
@@ -1691,25 +1681,9 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 		goto drop;
 
 	udp_csum_pull_header(skb);
-	if (sk_rcvqueues_full(sk, sk->sk_rcvbuf)) {
-		__UDP_INC_STATS(sock_net(sk), UDP_MIB_RCVBUFERRORS,
-				is_udplite);
-		goto drop;
-	}
-
-	rc = 0;
 
 	ipv4_pktinfo_prepare(sk, skb);
-	bh_lock_sock(sk);
-	if (!sock_owned_by_user(sk))
-		rc = __udp_queue_rcv_skb(sk, skb);
-	else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) {
-		bh_unlock_sock(sk);
-		goto drop;
-	}
-	bh_unlock_sock(sk);
-
-	return rc;
+	return __udp_queue_rcv_skb(sk, skb);
 
 csum_error:
 	__UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
@@ -2319,13 +2293,13 @@ struct proto udp_prot = {
 	.connect	   = ip4_datagram_connect,
 	.disconnect	   = udp_disconnect,
 	.ioctl		   = udp_ioctl,
+	.init		   = udp_init_sock,
 	.destroy	   = udp_destroy_sock,
 	.setsockopt	   = udp_setsockopt,
 	.getsockopt	   = udp_getsockopt,
 	.sendmsg	   = udp_sendmsg,
 	.recvmsg	   = udp_recvmsg,
 	.sendpage	   = udp_sendpage,
-	.backlog_rcv	   = __udp_queue_rcv_skb,
 	.release_cb	   = ip4_datagram_release_cb,
 	.hash		   = udp_lib_hash,
 	.unhash		   = udp_lib_unhash,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 9aa7c1c..71963b2 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -334,7 +334,6 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	int is_udplite = IS_UDPLITE(sk);
 	bool checksum_valid = false;
 	int is_udp4;
-	bool slow;
 
 	if (flags & MSG_ERRQUEUE)
 		return ipv6_recv_error(sk, msg, len, addr_len);
@@ -378,7 +377,6 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 			goto csum_copy_err;
 	}
 	if (unlikely(err)) {
-		trace_kfree_skb(skb, udpv6_recvmsg);
 		if (!peeked) {
 			atomic_inc(&sk->sk_drops);
 			if (is_udp4)
@@ -388,7 +386,7 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 				UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS,
 					       is_udplite);
 		}
-		skb_free_datagram_locked(sk, skb);
+		kfree_skb(skb);
 		return err;
 	}
 	if (!peeked) {
@@ -437,12 +435,11 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	if (flags & MSG_TRUNC)
 		err = ulen;
 
-	__skb_free_datagram_locked(sk, skb, peeking ? -err : err);
+	skb_consume_udp(sk, skb, peeking ? -err : err);
 	return err;
 
 csum_copy_err:
-	slow = lock_sock_fast(sk);
-	if (!skb_kill_datagram(sk, skb, flags)) {
+	if (!__sk_queue_drop_skb(sk, skb, flags)) {
 		if (is_udp4) {
 			UDP_INC_STATS(sock_net(sk),
 				      UDP_MIB_CSUMERRORS, is_udplite);
@@ -455,7 +452,7 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 				       UDP_MIB_INERRORS, is_udplite);
 		}
 	}
-	unlock_sock_fast(sk, slow);
+	kfree_skb(skb);
 
 	/* starting over for a new packet, but check if we need to yield */
 	cond_resched();
@@ -523,7 +520,7 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 		sk_incoming_cpu_update(sk);
 	}
 
-	rc = __sock_queue_rcv_skb(sk, skb);
+	rc = __udp_enqueue_schedule_skb(sk, skb);
 	if (rc < 0) {
 		int is_udplite = IS_UDPLITE(sk);
 
@@ -535,6 +532,7 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 		kfree_skb(skb);
 		return -1;
 	}
+
 	return 0;
 }
 
@@ -556,7 +554,6 @@ void udpv6_encap_enable(void)
 int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	struct udp_sock *up = udp_sk(sk);
-	int rc;
 	int is_udplite = IS_UDPLITE(sk);
 
 	if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb))
@@ -622,25 +619,10 @@ int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 		goto drop;
 
 	udp_csum_pull_header(skb);
-	if (sk_rcvqueues_full(sk, sk->sk_rcvbuf)) {
-		__UDP6_INC_STATS(sock_net(sk),
-				 UDP_MIB_RCVBUFERRORS, is_udplite);
-		goto drop;
-	}
 
 	skb_dst_drop(skb);
 
-	bh_lock_sock(sk);
-	rc = 0;
-	if (!sock_owned_by_user(sk))
-		rc = __udpv6_queue_rcv_skb(sk, skb);
-	else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) {
-		bh_unlock_sock(sk);
-		goto drop;
-	}
-	bh_unlock_sock(sk);
-
-	return rc;
+	return __udpv6_queue_rcv_skb(sk, skb);
 
 csum_error:
 	__UDP6_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
@@ -1433,12 +1415,12 @@ struct proto udpv6_prot = {
 	.connect	   = ip6_datagram_connect,
 	.disconnect	   = udp_disconnect,
 	.ioctl		   = udp_ioctl,
+	.init		   = udp_init_sock,
 	.destroy	   = udpv6_destroy_sock,
 	.setsockopt	   = udpv6_setsockopt,
 	.getsockopt	   = udpv6_getsockopt,
 	.sendmsg	   = udpv6_sendmsg,
 	.recvmsg	   = udpv6_recvmsg,
-	.backlog_rcv	   = __udpv6_queue_rcv_skb,
 	.release_cb	   = ip6_datagram_release_cb,
 	.hash		   = udp_lib_hash,
 	.unhash		   = udp_lib_unhash,
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 57625f6..e2a55dc 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -39,6 +39,7 @@
 #include <net/checksum.h>
 #include <net/ip.h>
 #include <net/ipv6.h>
+#include <net/udp.h>
 #include <net/tcp.h>
 #include <net/tcp_states.h>
 #include <asm/uaccess.h>
@@ -129,6 +130,18 @@ static void svc_release_skb(struct svc_rqst *rqstp)
 	}
 }
 
+static void svc_release_udp_skb(struct svc_rqst *rqstp)
+{
+	struct sk_buff *skb = rqstp->rq_xprt_ctxt;
+
+	if (skb) {
+		rqstp->rq_xprt_ctxt = NULL;
+
+		dprintk("svc: service %p, releasing skb %p\n", rqstp, skb);
+		consume_skb(skb);
+	}
+}
+
 union svc_pktinfo_u {
 	struct in_pktinfo pkti;
 	struct in6_pktinfo pkti6;
@@ -575,7 +588,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
 			goto out_free;
 		}
 		local_bh_enable();
-		skb_free_datagram_locked(svsk->sk_sk, skb);
+		consume_skb(skb);
 	} else {
 		/* we can use it in-place */
 		rqstp->rq_arg.head[0].iov_base = skb->data;
@@ -602,8 +615,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
 
 	return len;
 out_free:
-	trace_kfree_skb(skb, svc_udp_recvfrom);
-	skb_free_datagram_locked(svsk->sk_sk, skb);
+	kfree_skb(skb);
 	return 0;
 }
 
@@ -660,7 +672,7 @@ static struct svc_xprt *svc_udp_create(struct svc_serv *serv,
 	.xpo_create = svc_udp_create,
 	.xpo_recvfrom = svc_udp_recvfrom,
 	.xpo_sendto = svc_udp_sendto,
-	.xpo_release_rqst = svc_release_skb,
+	.xpo_release_rqst = svc_release_udp_skb,
 	.xpo_detach = svc_sock_detach,
 	.xpo_free = svc_sock_free,
 	.xpo_prep_reply_hdr = svc_udp_prep_reply_hdr,
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index bf16883..8e04a43 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1074,7 +1074,7 @@ static void xs_udp_data_receive(struct sock_xprt *transport)
 		skb = skb_recv_datagram(sk, 0, 1, &err);
 		if (skb != NULL) {
 			xs_udp_data_read_skb(&transport->xprt, sk, skb);
-			skb_free_datagram_locked(sk, skb);
+			consume_skb(skb);
 			continue;
 		}
 		if (!test_and_clear_bit(XPRT_SOCK_DATA_READY, &transport->sock_state))
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v4 3/3] udp: use it's own memory accounting schema
  2016-10-19 12:47     ` Paolo Abeni
  (?)
@ 2016-10-20  1:54     ` Eric Dumazet
  -1 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2016-10-20  1:54 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, David S. Miller, James Morris, Trond Myklebust,
	Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert,
	Hannes Frederic Sowa, Edward Cree, linux-nfs

On Wed, 2016-10-19 at 14:47 +0200, Paolo Abeni wrote:
> Completely avoid default sock memory accounting and replace it
> with udp-specific accounting.
> 
> Since the new memory accounting model encapsulates completely
> the required locking, remove the socket lock on both enqueue and
> dequeue, and avoid using the backlog on enqueue.
> 
> Be sure to clean-up rx queue memory on socket destruction, using
> udp its own sk_destruct.
> 
> Tested using pktgen with random src port, 64 bytes packet,
> wire-speed on a 10G link as sender and udp_sink as the receiver,
> using an l4 tuple rxhash to stress the contention, and one or more
> udp_sink instances with reuseport.
> 
> nr readers      Kpps (vanilla)  Kpps (patched)
> 1               170             420
> 3               1250            1900
> 6               3000            3500
> 9               4200            4400
> 12              5700            6100
> 
> v3 -> v4:
>   - remove useless sk_rcvqueues_full() call
> 
> v2 -> v3:
>   - do not set the now unsed backlog_rcv callback
> 
> v1 -> v2:
>   - add memory pressure support
>   - fixed dropwatch accounting for ipv6
> 
> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
>  net/ipv4/udp.c        | 42 ++++++++----------------------------------
>  net/ipv6/udp.c        | 34 ++++++++--------------------------
>  net/sunrpc/svcsock.c  | 20 ++++++++++++++++----
>  net/sunrpc/xprtsock.c |  2 +-
>  4 files changed, 33 insertions(+), 65 deletions(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 7047bb3..1925ca7 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1312,13 +1312,8 @@ static int first_packet_length(struct sock *sk)
>  	res = skb ? skb->len : -1;
>  	spin_unlock_bh(&rcvq->lock);
>  
> -	if (!skb_queue_empty(&list_kill)) {
> -		bool slow = lock_sock_fast(sk);
> -
> +	if (!skb_queue_empty(&list_kill))
>  		__skb_queue_purge(&list_kill);
> -		sk_mem_reclaim_partial(sk);
> -		unlock_sock_fast(sk, slow);
> -	}

Simply call __skb_queue_purge(&list_kill);

(without testing if list_kill is empty or not)

>  	return res;
>  }
>  

We also could remove list_kill and directly call kfree_skb() in the
loop.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v4 2/3] udp: implement memory accounting helpers
  2016-10-19 12:47 ` [PATCH net-next v4 2/3] udp: implement memory accounting helpers Paolo Abeni
@ 2016-10-20  4:43   ` Eric Dumazet
       [not found]     ` <1476938622.5650.111.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2016-10-20  4:43 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, David S. Miller, James Morris, Trond Myklebust,
	Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert,
	Hannes Frederic Sowa, Edward Cree, linux-nfs

On Wed, 2016-10-19 at 14:47 +0200, Paolo Abeni wrote:
> Avoid using the generic helpers.
> Adds a new spinlock_t to the udp_sock structure and use
> it to protect the memory accounting operation, both on
> enqueue and on dequeue.
> 
> On dequeue perform partial memory reclaiming, trying to
> leave a quantum of forward allocated memory.
> 
> On enqueue use a custom helper, to allow some optimizations:
> - use a plain spin_lock() variant instead of the slightly
>   costly spin_lock_irqsave(),
> - avoid dst_force check, since the calling code has already
>   dropped the skb dst
> 
> The above needs custom memory reclaiming on shutdown, provided
> by the udp_destruct_sock().
> 
> v3 -> v4:
>   - reworked memory accunting, simplifying the schema
>   - provide an helper for both memory scheduling and enqueuing
> 
> v1 -> v2:
>   - use a udp specific destrctor to perform memory reclaiming
>   - remove a couple of helpers, unneeded after the above cleanup
>   - do not reclaim memory on dequeue if not under memory
>     pressure
>   - reworked the fwd accounting schema to avoid potential
>     integer overflow
> 
> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
>  include/linux/udp.h |   1 +
>  include/net/udp.h   |   4 ++
>  net/ipv4/udp.c      | 111 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 116 insertions(+)
> 
> diff --git a/include/linux/udp.h b/include/linux/udp.h
> index d1fd8cd..8ea7f8b 100644
> --- a/include/linux/udp.h
> +++ b/include/linux/udp.h
> @@ -66,6 +66,7 @@ struct udp_sock {
>  #define UDPLITE_RECV_CC  0x4		/* set via udplite setsocktopt        */
>  	__u8		 pcflag;        /* marks socket as UDP-Lite if > 0    */
>  	__u8		 unused[3];
> +	spinlock_t	 mem_lock;	/* protects memory accounting         */
>  	/*
>  	 * For encapsulation sockets.
>  	 */
> diff --git a/include/net/udp.h b/include/net/udp.h
> index ea53a87..18f1e6b 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -246,6 +246,9 @@ static inline __be16 udp_flow_src_port(struct net *net, struct sk_buff *skb,
>  }
>  
>  /* net/ipv4/udp.c */
> +void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
> +int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb);
> +
>  void udp_v4_early_demux(struct sk_buff *skb);
>  int udp_get_port(struct sock *sk, unsigned short snum,
>  		 int (*saddr_cmp)(const struct sock *,
> @@ -258,6 +261,7 @@ int udp_get_port(struct sock *sk, unsigned short snum,
>  void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
>  int udp_rcv(struct sk_buff *skb);
>  int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
> +int udp_init_sock(struct sock *sk);
>  int udp_disconnect(struct sock *sk, int flags);
>  unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait);
>  struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 7d96dc2..7047bb3 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1172,6 +1172,117 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
>  	return ret;
>  }
>  
> +static void udp_rmem_release(struct sock *sk, int size, int partial)
> +{
> +	struct udp_sock *up = udp_sk(sk);
> +	int amt;
> +
> +	atomic_sub(size, &sk->sk_rmem_alloc);
> +
> +	spin_lock_bh(&up->mem_lock);
> +	sk->sk_forward_alloc += size;
> +	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
> +	sk->sk_forward_alloc -= amt;
> +	spin_unlock_bh(&up->mem_lock);
> +
> +	if (amt)
> +		__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
> +}
> +
> +static void udp_rmem_free(struct sk_buff *skb)
> +{
> +	udp_rmem_release(skb->sk, skb->truesize, 1);
> +}
> +
> +int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> +{
> +	struct sk_buff_head *list = &sk->sk_receive_queue;
> +	int rmem, delta, amt, err = -ENOMEM;
> +	struct udp_sock *up = udp_sk(sk);
> +	int size = skb->truesize;
> +
> +	/* try to avoid the costly atomic add/sub pair when the receive
> +	 * queue is full
> +	 */
> +	if (atomic_read(&sk->sk_rmem_alloc) + size > sk->sk_rcvbuf)
> +		goto drop;
> +
> +	rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> +	if (rmem > sk->sk_rcvbuf)
> +		goto uncharge_drop;

This is dangerous : We used to accept one packet in receive queue, even
if sk_rcvbuf is too small. (Check __sock_queue_rcv_skb())

With your code we might drop all packets even if receive queue is empty,
this might add regressions for some apps.

> +
> +	spin_lock(&up->mem_lock);
> +	if (size < sk->sk_forward_alloc)
> +		goto unlock;

Strange label name, since you do "sk->sk_forward_alloc -= size;"before
unlock.

You could avoid the goto by :

	if (size >= sk->sk_forward_alloc) {
             amt = sk_mem_pages(size);
             ...
        }
        sk->sk_forward_alloc += delta;

> +
> +	amt = sk_mem_pages(size);
> +	delta = amt << SK_MEM_QUANTUM_SHIFT;
> +	if (!__sk_mem_raise_allocated(sk, delta, amt, SK_MEM_RECV)) {
> +		err = -ENOBUFS;
> +		spin_unlock(&up->mem_lock);
> +		goto uncharge_drop;
> +	}
> +
> +	sk->sk_forward_alloc += delta;
> +
> +unlock:
> +	sk->sk_forward_alloc -= size;
> +	spin_unlock(&up->mem_lock);
> +
> +	/* the skb owner in now the udp socket */
> +	skb_orphan(skb);
> +	skb->sk = sk;
> +	skb->destructor = udp_rmem_free;

> +
> +	skb->dev = NULL;
> +	sock_skb_set_dropcount(sk, skb);
> +
> +	spin_lock(&list->lock);

Since we acquire this lock anyway to store the packet,
why not protecting sk_forward_alloc as well with this spinlock,
and get rid of up->mem_lock.

Ideally a single spin_lock() should be enough.

Then almost all other atomic ops could be transformed to non atomic ops.


> +	__skb_queue_tail(list, skb);
> +	spin_unlock(&list->lock);
> +
> +	if (!sock_flag(sk, SOCK_DEAD))
> +		sk->sk_data_ready(sk);
> +
> +	return 0;
> +
> +uncharge_drop:
> +	atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
> +
> +drop:
> +	atomic_inc(&sk->sk_drops);
> +	return err;
> +}
> +EXPORT_SYMBOL_GPL(__udp_enqueue_schedule_skb);
> +
> +static void udp_destruct_sock(struct sock *sk)
> +{
> +	/* reclaim completely the forward allocated memory */
> +	__skb_queue_purge(&sk->sk_receive_queue);
> +	udp_rmem_release(sk, 0, 0);
> +	inet_sock_destruct(sk);
> +}
> +
> +int udp_init_sock(struct sock *sk)
> +{
> +	sk->sk_destruct = udp_destruct_sock;
> +	spin_lock_init(&udp_sk(sk)->mem_lock);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(udp_init_sock);
> +
> +void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len)
> +{
> +	if (unlikely(READ_ONCE(sk->sk_peek_off) >= 0)) {
> +		bool slow = lock_sock_fast(sk);
> +
> +		sk_peek_offset_bwd(sk, len);
> +		unlock_sock_fast(sk, slow);
> +	}
> +	consume_skb(skb);
> +}
> +EXPORT_SYMBOL_GPL(skb_consume_udp);
> +
>  /**
>   *	first_packet_length	- return length of first packet in receive queue
>   *	@sk: socket

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v4 2/3] udp: implement memory accounting helpers
  2016-10-20  4:43   ` Eric Dumazet
@ 2016-10-20  7:02         ` Paolo Abeni
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2016-10-20  7:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David S. Miller, James Morris,
	Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet,
	Tom Herbert, Hannes Frederic Sowa, Edward Cree,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

On Wed, 2016-10-19 at 21:43 -0700, Eric Dumazet wrote:
> On Wed, 2016-10-19 at 14:47 +0200, Paolo Abeni wrote:
> > Avoid using the generic helpers.
> > Adds a new spinlock_t to the udp_sock structure and use
> > it to protect the memory accounting operation, both on
> > enqueue and on dequeue.
> > 
> > On dequeue perform partial memory reclaiming, trying to
> > leave a quantum of forward allocated memory.
> > 
> > On enqueue use a custom helper, to allow some optimizations:
> > - use a plain spin_lock() variant instead of the slightly
> >   costly spin_lock_irqsave(),
> > - avoid dst_force check, since the calling code has already
> >   dropped the skb dst
> > 
> > The above needs custom memory reclaiming on shutdown, provided
> > by the udp_destruct_sock().
> > 
> > v3 -> v4:
> >   - reworked memory accunting, simplifying the schema
> >   - provide an helper for both memory scheduling and enqueuing
> > 
> > v1 -> v2:
> >   - use a udp specific destrctor to perform memory reclaiming
> >   - remove a couple of helpers, unneeded after the above cleanup
> >   - do not reclaim memory on dequeue if not under memory
> >     pressure
> >   - reworked the fwd accounting schema to avoid potential
> >     integer overflow
> > 
> > Acked-by: Hannes Frederic Sowa <hannes-tFNcAqjVMyqKXQKiL6tip0B+6BGkLq7r@public.gmane.org>
> > Signed-off-by: Paolo Abeni <pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  include/linux/udp.h |   1 +
> >  include/net/udp.h   |   4 ++
> >  net/ipv4/udp.c      | 111 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 116 insertions(+)
> > 
> > diff --git a/include/linux/udp.h b/include/linux/udp.h
> > index d1fd8cd..8ea7f8b 100644
> > --- a/include/linux/udp.h
> > +++ b/include/linux/udp.h
> > @@ -66,6 +66,7 @@ struct udp_sock {
> >  #define UDPLITE_RECV_CC  0x4		/* set via udplite setsocktopt        */
> >  	__u8		 pcflag;        /* marks socket as UDP-Lite if > 0    */
> >  	__u8		 unused[3];
> > +	spinlock_t	 mem_lock;	/* protects memory accounting         */
> >  	/*
> >  	 * For encapsulation sockets.
> >  	 */
> > diff --git a/include/net/udp.h b/include/net/udp.h
> > index ea53a87..18f1e6b 100644
> > --- a/include/net/udp.h
> > +++ b/include/net/udp.h
> > @@ -246,6 +246,9 @@ static inline __be16 udp_flow_src_port(struct net *net, struct sk_buff *skb,
> >  }
> >  
> >  /* net/ipv4/udp.c */
> > +void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
> > +int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb);
> > +
> >  void udp_v4_early_demux(struct sk_buff *skb);
> >  int udp_get_port(struct sock *sk, unsigned short snum,
> >  		 int (*saddr_cmp)(const struct sock *,
> > @@ -258,6 +261,7 @@ int udp_get_port(struct sock *sk, unsigned short snum,
> >  void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
> >  int udp_rcv(struct sk_buff *skb);
> >  int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
> > +int udp_init_sock(struct sock *sk);
> >  int udp_disconnect(struct sock *sk, int flags);
> >  unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait);
> >  struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 7d96dc2..7047bb3 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -1172,6 +1172,117 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
> >  	return ret;
> >  }
> >  
> > +static void udp_rmem_release(struct sock *sk, int size, int partial)
> > +{
> > +	struct udp_sock *up = udp_sk(sk);
> > +	int amt;
> > +
> > +	atomic_sub(size, &sk->sk_rmem_alloc);
> > +
> > +	spin_lock_bh(&up->mem_lock);
> > +	sk->sk_forward_alloc += size;
> > +	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
> > +	sk->sk_forward_alloc -= amt;
> > +	spin_unlock_bh(&up->mem_lock);
> > +
> > +	if (amt)
> > +		__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
> > +}
> > +
> > +static void udp_rmem_free(struct sk_buff *skb)
> > +{
> > +	udp_rmem_release(skb->sk, skb->truesize, 1);
> > +}
> > +
> > +int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> > +{
> > +	struct sk_buff_head *list = &sk->sk_receive_queue;
> > +	int rmem, delta, amt, err = -ENOMEM;
> > +	struct udp_sock *up = udp_sk(sk);
> > +	int size = skb->truesize;
> > +
> > +	/* try to avoid the costly atomic add/sub pair when the receive
> > +	 * queue is full
> > +	 */
> > +	if (atomic_read(&sk->sk_rmem_alloc) + size > sk->sk_rcvbuf)
> > +		goto drop;
> > +
> > +	rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > +	if (rmem > sk->sk_rcvbuf)
> > +		goto uncharge_drop;
> 
> This is dangerous : We used to accept one packet in receive queue, even
> if sk_rcvbuf is too small. (Check __sock_queue_rcv_skb())
> 
> With your code we might drop all packets even if receive queue is empty,
> this might add regressions for some apps.

Thank you for reviewing this! 

I missed that point. I'll take care of it (and of your others comments)
in v5.

Regards,

Paolo

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v4 2/3] udp: implement memory accounting helpers
@ 2016-10-20  7:02         ` Paolo Abeni
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2016-10-20  7:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, James Morris, Trond Myklebust,
	Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert,
	Hannes Frederic Sowa, Edward Cree, linux-nfs

On Wed, 2016-10-19 at 21:43 -0700, Eric Dumazet wrote:
> On Wed, 2016-10-19 at 14:47 +0200, Paolo Abeni wrote:
> > Avoid using the generic helpers.
> > Adds a new spinlock_t to the udp_sock structure and use
> > it to protect the memory accounting operation, both on
> > enqueue and on dequeue.
> > 
> > On dequeue perform partial memory reclaiming, trying to
> > leave a quantum of forward allocated memory.
> > 
> > On enqueue use a custom helper, to allow some optimizations:
> > - use a plain spin_lock() variant instead of the slightly
> >   costly spin_lock_irqsave(),
> > - avoid dst_force check, since the calling code has already
> >   dropped the skb dst
> > 
> > The above needs custom memory reclaiming on shutdown, provided
> > by the udp_destruct_sock().
> > 
> > v3 -> v4:
> >   - reworked memory accunting, simplifying the schema
> >   - provide an helper for both memory scheduling and enqueuing
> > 
> > v1 -> v2:
> >   - use a udp specific destrctor to perform memory reclaiming
> >   - remove a couple of helpers, unneeded after the above cleanup
> >   - do not reclaim memory on dequeue if not under memory
> >     pressure
> >   - reworked the fwd accounting schema to avoid potential
> >     integer overflow
> > 
> > Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > ---
> >  include/linux/udp.h |   1 +
> >  include/net/udp.h   |   4 ++
> >  net/ipv4/udp.c      | 111 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 116 insertions(+)
> > 
> > diff --git a/include/linux/udp.h b/include/linux/udp.h
> > index d1fd8cd..8ea7f8b 100644
> > --- a/include/linux/udp.h
> > +++ b/include/linux/udp.h
> > @@ -66,6 +66,7 @@ struct udp_sock {
> >  #define UDPLITE_RECV_CC  0x4		/* set via udplite setsocktopt        */
> >  	__u8		 pcflag;        /* marks socket as UDP-Lite if > 0    */
> >  	__u8		 unused[3];
> > +	spinlock_t	 mem_lock;	/* protects memory accounting         */
> >  	/*
> >  	 * For encapsulation sockets.
> >  	 */
> > diff --git a/include/net/udp.h b/include/net/udp.h
> > index ea53a87..18f1e6b 100644
> > --- a/include/net/udp.h
> > +++ b/include/net/udp.h
> > @@ -246,6 +246,9 @@ static inline __be16 udp_flow_src_port(struct net *net, struct sk_buff *skb,
> >  }
> >  
> >  /* net/ipv4/udp.c */
> > +void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
> > +int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb);
> > +
> >  void udp_v4_early_demux(struct sk_buff *skb);
> >  int udp_get_port(struct sock *sk, unsigned short snum,
> >  		 int (*saddr_cmp)(const struct sock *,
> > @@ -258,6 +261,7 @@ int udp_get_port(struct sock *sk, unsigned short snum,
> >  void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
> >  int udp_rcv(struct sk_buff *skb);
> >  int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
> > +int udp_init_sock(struct sock *sk);
> >  int udp_disconnect(struct sock *sk, int flags);
> >  unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait);
> >  struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 7d96dc2..7047bb3 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -1172,6 +1172,117 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
> >  	return ret;
> >  }
> >  
> > +static void udp_rmem_release(struct sock *sk, int size, int partial)
> > +{
> > +	struct udp_sock *up = udp_sk(sk);
> > +	int amt;
> > +
> > +	atomic_sub(size, &sk->sk_rmem_alloc);
> > +
> > +	spin_lock_bh(&up->mem_lock);
> > +	sk->sk_forward_alloc += size;
> > +	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
> > +	sk->sk_forward_alloc -= amt;
> > +	spin_unlock_bh(&up->mem_lock);
> > +
> > +	if (amt)
> > +		__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
> > +}
> > +
> > +static void udp_rmem_free(struct sk_buff *skb)
> > +{
> > +	udp_rmem_release(skb->sk, skb->truesize, 1);
> > +}
> > +
> > +int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> > +{
> > +	struct sk_buff_head *list = &sk->sk_receive_queue;
> > +	int rmem, delta, amt, err = -ENOMEM;
> > +	struct udp_sock *up = udp_sk(sk);
> > +	int size = skb->truesize;
> > +
> > +	/* try to avoid the costly atomic add/sub pair when the receive
> > +	 * queue is full
> > +	 */
> > +	if (atomic_read(&sk->sk_rmem_alloc) + size > sk->sk_rcvbuf)
> > +		goto drop;
> > +
> > +	rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > +	if (rmem > sk->sk_rcvbuf)
> > +		goto uncharge_drop;
> 
> This is dangerous : We used to accept one packet in receive queue, even
> if sk_rcvbuf is too small. (Check __sock_queue_rcv_skb())
> 
> With your code we might drop all packets even if receive queue is empty,
> this might add regressions for some apps.

Thank you for reviewing this! 

I missed that point. I'll take care of it (and of your others comments)
in v5.

Regards,

Paolo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v4 2/3] udp: implement memory accounting helpers
  2016-10-20  4:43   ` Eric Dumazet
@ 2016-10-20 10:18         ` Paolo Abeni
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2016-10-20 10:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David S. Miller, James Morris,
	Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet,
	Tom Herbert, Hannes Frederic Sowa, Edward Cree,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

On Wed, 2016-10-19 at 21:43 -0700, Eric Dumazet wrote:
> On Wed, 2016-10-19 at 14:47 +0200, Paolo Abeni wrote:
> > Avoid using the generic helpers.
> > Adds a new spinlock_t to the udp_sock structure and use
> > it to protect the memory accounting operation, both on
> > enqueue and on dequeue.
> > 
> > On dequeue perform partial memory reclaiming, trying to
> > leave a quantum of forward allocated memory.
> > 
> > On enqueue use a custom helper, to allow some optimizations:
> > - use a plain spin_lock() variant instead of the slightly
> >   costly spin_lock_irqsave(),
> > - avoid dst_force check, since the calling code has already
> >   dropped the skb dst
> > 
> > The above needs custom memory reclaiming on shutdown, provided
> > by the udp_destruct_sock().
> > 
> > v3 -> v4:
> >   - reworked memory accunting, simplifying the schema
> >   - provide an helper for both memory scheduling and enqueuing
> > 
> > v1 -> v2:
> >   - use a udp specific destrctor to perform memory reclaiming
> >   - remove a couple of helpers, unneeded after the above cleanup
> >   - do not reclaim memory on dequeue if not under memory
> >     pressure
> >   - reworked the fwd accounting schema to avoid potential
> >     integer overflow
> > 
> > Acked-by: Hannes Frederic Sowa <hannes-tFNcAqjVMyqKXQKiL6tip0B+6BGkLq7r@public.gmane.org>
> > Signed-off-by: Paolo Abeni <pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  include/linux/udp.h |   1 +
> >  include/net/udp.h   |   4 ++
> >  net/ipv4/udp.c      | 111 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 116 insertions(+)
> > 
> > diff --git a/include/linux/udp.h b/include/linux/udp.h
> > index d1fd8cd..8ea7f8b 100644
> > --- a/include/linux/udp.h
> > +++ b/include/linux/udp.h
> > @@ -66,6 +66,7 @@ struct udp_sock {
> >  #define UDPLITE_RECV_CC  0x4		/* set via udplite setsocktopt        */
> >  	__u8		 pcflag;        /* marks socket as UDP-Lite if > 0    */
> >  	__u8		 unused[3];
> > +	spinlock_t	 mem_lock;	/* protects memory accounting         */
> >  	/*
> >  	 * For encapsulation sockets.
> >  	 */
> > diff --git a/include/net/udp.h b/include/net/udp.h
> > index ea53a87..18f1e6b 100644
> > --- a/include/net/udp.h
> > +++ b/include/net/udp.h
> > @@ -246,6 +246,9 @@ static inline __be16 udp_flow_src_port(struct net *net, struct sk_buff *skb,
> >  }
> >  
> >  /* net/ipv4/udp.c */
> > +void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
> > +int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb);
> > +
> >  void udp_v4_early_demux(struct sk_buff *skb);
> >  int udp_get_port(struct sock *sk, unsigned short snum,
> >  		 int (*saddr_cmp)(const struct sock *,
> > @@ -258,6 +261,7 @@ int udp_get_port(struct sock *sk, unsigned short snum,
> >  void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
> >  int udp_rcv(struct sk_buff *skb);
> >  int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
> > +int udp_init_sock(struct sock *sk);
> >  int udp_disconnect(struct sock *sk, int flags);
> >  unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait);
> >  struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 7d96dc2..7047bb3 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -1172,6 +1172,117 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
> >  	return ret;
> >  }
> >  
> > +static void udp_rmem_release(struct sock *sk, int size, int partial)
> > +{
> > +	struct udp_sock *up = udp_sk(sk);
> > +	int amt;
> > +
> > +	atomic_sub(size, &sk->sk_rmem_alloc);
> > +
> > +	spin_lock_bh(&up->mem_lock);
> > +	sk->sk_forward_alloc += size;
> > +	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
> > +	sk->sk_forward_alloc -= amt;
> > +	spin_unlock_bh(&up->mem_lock);
> > +
> > +	if (amt)
> > +		__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
> > +}
> > +
> > +static void udp_rmem_free(struct sk_buff *skb)
> > +{
> > +	udp_rmem_release(skb->sk, skb->truesize, 1);
> > +}
> > +
> > +int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> > +{
> > +	struct sk_buff_head *list = &sk->sk_receive_queue;
> > +	int rmem, delta, amt, err = -ENOMEM;
> > +	struct udp_sock *up = udp_sk(sk);
> > +	int size = skb->truesize;
> > +
> > +	/* try to avoid the costly atomic add/sub pair when the receive
> > +	 * queue is full
> > +	 */
> > +	if (atomic_read(&sk->sk_rmem_alloc) + size > sk->sk_rcvbuf)
> > +		goto drop;
> > +
> > +	rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > +	if (rmem > sk->sk_rcvbuf)
> > +		goto uncharge_drop;
> 
> This is dangerous : We used to accept one packet in receive queue, even
> if sk_rcvbuf is too small. (Check __sock_queue_rcv_skb())
> 
> With your code we might drop all packets even if receive queue is empty,
> this might add regressions for some apps.
> 
> > +
> > +	spin_lock(&up->mem_lock);
> > +	if (size < sk->sk_forward_alloc)
> > +		goto unlock;
> 
> Strange label name, since you do "sk->sk_forward_alloc -= size;"before
> unlock.
> 
> You could avoid the goto by :
> 
> 	if (size >= sk->sk_forward_alloc) {
>              amt = sk_mem_pages(size);
>              ...
>         }
>         sk->sk_forward_alloc += delta;k_mem_schedule(
> 
> > +
> > +	amt = sk_mem_pages(size);
> > +	delta = amt << SK_MEM_QUANTUM_SHIFT;
> > +	if (!__sk_mem_raise_allocated(sk, delta, amt, SK_MEM_RECV)) {
> > +		err = -ENOBUFS;
> > +		spin_unlock(&up->mem_lock);
> > +		goto uncharge_drop;
> > +	}
> > +
> > +	sk->sk_forward_alloc += delta;
> > +
> > +unlock:
> > +	sk->sk_forward_alloc -= size;
> > +	spin_unlock(&up->mem_lock);
> > +
> > +	/* the skb owner in now the udp socket */
> > +	skb_orphan(skb);
> > +	skb->sk = sk;
> > +	skb->destructor = udp_rmem_free;
> 
> > +
> > +	skb->dev = NULL;
> > +	sock_skb_set_dropcount(sk, skb);
> > +
> > +	spin_lock(&list->lock);
> 
> Since we acquire this lock anyway to store the packet,
> why not protecting sk_forward_alloc as well with this spinlock,
> and get rid of up->mem_lock.
> 
> Ideally a single spin_lock() should be enough.

I thought that would hit us back, since udp_recvmsg() needs to acquire
it twice and we are increasing the bh lock scope, but a quick test
showed otherwise!!!

> Then almost all other atomic ops could be transformed to non atomic ops.

I'm not sure which atomic operations you refer to ?!? AFAICS we still
need them to manipulate udp_memory_allocated, sk_rmem_alloc, and
sk_drops. Can you please hint which ones ?

Thank you,

Paolo


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v4 2/3] udp: implement memory accounting helpers
@ 2016-10-20 10:18         ` Paolo Abeni
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Abeni @ 2016-10-20 10:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, James Morris, Trond Myklebust,
	Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert,
	Hannes Frederic Sowa, Edward Cree, linux-nfs

On Wed, 2016-10-19 at 21:43 -0700, Eric Dumazet wrote:
> On Wed, 2016-10-19 at 14:47 +0200, Paolo Abeni wrote:
> > Avoid using the generic helpers.
> > Adds a new spinlock_t to the udp_sock structure and use
> > it to protect the memory accounting operation, both on
> > enqueue and on dequeue.
> > 
> > On dequeue perform partial memory reclaiming, trying to
> > leave a quantum of forward allocated memory.
> > 
> > On enqueue use a custom helper, to allow some optimizations:
> > - use a plain spin_lock() variant instead of the slightly
> >   costly spin_lock_irqsave(),
> > - avoid dst_force check, since the calling code has already
> >   dropped the skb dst
> > 
> > The above needs custom memory reclaiming on shutdown, provided
> > by the udp_destruct_sock().
> > 
> > v3 -> v4:
> >   - reworked memory accunting, simplifying the schema
> >   - provide an helper for both memory scheduling and enqueuing
> > 
> > v1 -> v2:
> >   - use a udp specific destrctor to perform memory reclaiming
> >   - remove a couple of helpers, unneeded after the above cleanup
> >   - do not reclaim memory on dequeue if not under memory
> >     pressure
> >   - reworked the fwd accounting schema to avoid potential
> >     integer overflow
> > 
> > Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > ---
> >  include/linux/udp.h |   1 +
> >  include/net/udp.h   |   4 ++
> >  net/ipv4/udp.c      | 111 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 116 insertions(+)
> > 
> > diff --git a/include/linux/udp.h b/include/linux/udp.h
> > index d1fd8cd..8ea7f8b 100644
> > --- a/include/linux/udp.h
> > +++ b/include/linux/udp.h
> > @@ -66,6 +66,7 @@ struct udp_sock {
> >  #define UDPLITE_RECV_CC  0x4		/* set via udplite setsocktopt        */
> >  	__u8		 pcflag;        /* marks socket as UDP-Lite if > 0    */
> >  	__u8		 unused[3];
> > +	spinlock_t	 mem_lock;	/* protects memory accounting         */
> >  	/*
> >  	 * For encapsulation sockets.
> >  	 */
> > diff --git a/include/net/udp.h b/include/net/udp.h
> > index ea53a87..18f1e6b 100644
> > --- a/include/net/udp.h
> > +++ b/include/net/udp.h
> > @@ -246,6 +246,9 @@ static inline __be16 udp_flow_src_port(struct net *net, struct sk_buff *skb,
> >  }
> >  
> >  /* net/ipv4/udp.c */
> > +void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
> > +int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb);
> > +
> >  void udp_v4_early_demux(struct sk_buff *skb);
> >  int udp_get_port(struct sock *sk, unsigned short snum,
> >  		 int (*saddr_cmp)(const struct sock *,
> > @@ -258,6 +261,7 @@ int udp_get_port(struct sock *sk, unsigned short snum,
> >  void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst);
> >  int udp_rcv(struct sk_buff *skb);
> >  int udp_ioctl(struct sock *sk, int cmd, unsigned long arg);
> > +int udp_init_sock(struct sock *sk);
> >  int udp_disconnect(struct sock *sk, int flags);
> >  unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait);
> >  struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 7d96dc2..7047bb3 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -1172,6 +1172,117 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
> >  	return ret;
> >  }
> >  
> > +static void udp_rmem_release(struct sock *sk, int size, int partial)
> > +{
> > +	struct udp_sock *up = udp_sk(sk);
> > +	int amt;
> > +
> > +	atomic_sub(size, &sk->sk_rmem_alloc);
> > +
> > +	spin_lock_bh(&up->mem_lock);
> > +	sk->sk_forward_alloc += size;
> > +	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
> > +	sk->sk_forward_alloc -= amt;
> > +	spin_unlock_bh(&up->mem_lock);
> > +
> > +	if (amt)
> > +		__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
> > +}
> > +
> > +static void udp_rmem_free(struct sk_buff *skb)
> > +{
> > +	udp_rmem_release(skb->sk, skb->truesize, 1);
> > +}
> > +
> > +int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> > +{
> > +	struct sk_buff_head *list = &sk->sk_receive_queue;
> > +	int rmem, delta, amt, err = -ENOMEM;
> > +	struct udp_sock *up = udp_sk(sk);
> > +	int size = skb->truesize;
> > +
> > +	/* try to avoid the costly atomic add/sub pair when the receive
> > +	 * queue is full
> > +	 */
> > +	if (atomic_read(&sk->sk_rmem_alloc) + size > sk->sk_rcvbuf)
> > +		goto drop;
> > +
> > +	rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > +	if (rmem > sk->sk_rcvbuf)
> > +		goto uncharge_drop;
> 
> This is dangerous : We used to accept one packet in receive queue, even
> if sk_rcvbuf is too small. (Check __sock_queue_rcv_skb())
> 
> With your code we might drop all packets even if receive queue is empty,
> this might add regressions for some apps.
> 
> > +
> > +	spin_lock(&up->mem_lock);
> > +	if (size < sk->sk_forward_alloc)
> > +		goto unlock;
> 
> Strange label name, since you do "sk->sk_forward_alloc -= size;"before
> unlock.
> 
> You could avoid the goto by :
> 
> 	if (size >= sk->sk_forward_alloc) {
>              amt = sk_mem_pages(size);
>              ...
>         }
>         sk->sk_forward_alloc += delta;k_mem_schedule(
> 
> > +
> > +	amt = sk_mem_pages(size);
> > +	delta = amt << SK_MEM_QUANTUM_SHIFT;
> > +	if (!__sk_mem_raise_allocated(sk, delta, amt, SK_MEM_RECV)) {
> > +		err = -ENOBUFS;
> > +		spin_unlock(&up->mem_lock);
> > +		goto uncharge_drop;
> > +	}
> > +
> > +	sk->sk_forward_alloc += delta;
> > +
> > +unlock:
> > +	sk->sk_forward_alloc -= size;
> > +	spin_unlock(&up->mem_lock);
> > +
> > +	/* the skb owner in now the udp socket */
> > +	skb_orphan(skb);
> > +	skb->sk = sk;
> > +	skb->destructor = udp_rmem_free;
> 
> > +
> > +	skb->dev = NULL;
> > +	sock_skb_set_dropcount(sk, skb);
> > +
> > +	spin_lock(&list->lock);
> 
> Since we acquire this lock anyway to store the packet,
> why not protecting sk_forward_alloc as well with this spinlock,
> and get rid of up->mem_lock.
> 
> Ideally a single spin_lock() should be enough.

I thought that would hit us back, since udp_recvmsg() needs to acquire
it twice and we are increasing the bh lock scope, but a quick test
showed otherwise!!!

> Then almost all other atomic ops could be transformed to non atomic ops.

I'm not sure which atomic operations you refer to ?!? AFAICS we still
need them to manipulate udp_memory_allocated, sk_rmem_alloc, and
sk_drops. Can you please hint which ones ?

Thank you,

Paolo



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-10-20 10:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-19 12:46 [PATCH net-next v4 0/3] udp: refactor memory accounting Paolo Abeni
     [not found] ` <cover.1476877189.git.pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-10-19 12:47   ` [PATCH net-next v4 1/3] net/socket: factor out helpers for memory and queue manipulation Paolo Abeni
2016-10-19 12:47     ` Paolo Abeni
2016-10-19 12:47   ` [PATCH net-next v4 3/3] udp: use it's own memory accounting schema Paolo Abeni
2016-10-19 12:47     ` Paolo Abeni
2016-10-20  1:54     ` Eric Dumazet
2016-10-19 12:47 ` [PATCH net-next v4 2/3] udp: implement memory accounting helpers Paolo Abeni
2016-10-20  4:43   ` Eric Dumazet
     [not found]     ` <1476938622.5650.111.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
2016-10-20  7:02       ` Paolo Abeni
2016-10-20  7:02         ` Paolo Abeni
2016-10-20 10:18       ` Paolo Abeni
2016-10-20 10:18         ` Paolo Abeni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.