* [PATCH net-next v3 0/3] udp: refactor memory accounting @ 2016-09-28 10:52 ` Paolo Abeni 0 siblings, 0 replies; 20+ messages in thread From: Paolo Abeni @ 2016-09-28 10:52 UTC (permalink / raw) To: netdev-u79uwXL29TY76Z2rM5mHXA Cc: David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs-u79uwXL29TY76Z2rM5mHXA This patch series refactor the udp memory accounting, replacing the generic implementation with a custom one, in order to remove the needs for locking the socket on the enqueue and dequeue operations. The socket backlog usage is dropped, as well. The first patch factor out core pieces of some queue and memory management socket helpers, so that they can later be used by the udp memory accounting functions. The second patch adds the memory account helpers, without using them. The third patch replacse the old rx memory accounting path for udp over ipv4 and udp over ipv6. In kernel UDP users are updated, as well. The memory accounting schema is described in detail in the individual patch commit message. The performance gain depends on the specific scenario; with few flows (and little contention in the original code) the differences are in the noise range, while with several flows contending the same socket, the measured speed-up is relevant (e.g. even over 100% in case of extreme contention) v2 -> v3: - do not set the now unsed backlog_rcv callback v1 -> v2: - changed slighly the memory accounting schema, we now perform lazy reclaim - fixed forward_alloc updating issue - fixed memory counter integer overflows Paolo Abeni (3): net/socket: factor out helpers for memory and queue manipulation udp: implement memory accounting helpers udp: use it's own memory accounting schema include/linux/udp.h | 2 + include/net/sock.h | 5 ++ include/net/udp.h | 7 ++ net/core/datagram.c | 36 ++++++---- net/core/sock.c | 96 ++++++++++++++++++--------- net/ipv4/udp.c | 179 ++++++++++++++++++++++++++++++++++++++++++-------- net/ipv6/udp.c | 33 ++++------ net/sunrpc/svcsock.c | 20 ++++-- net/sunrpc/xprtsock.c | 2 +- 9 files changed, 279 insertions(+), 101 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH net-next v3 0/3] udp: refactor memory accounting @ 2016-09-28 10:52 ` Paolo Abeni 0 siblings, 0 replies; 20+ messages in thread From: Paolo Abeni @ 2016-09-28 10:52 UTC (permalink / raw) To: netdev Cc: David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs This patch series refactor the udp memory accounting, replacing the generic implementation with a custom one, in order to remove the needs for locking the socket on the enqueue and dequeue operations. The socket backlog usage is dropped, as well. The first patch factor out core pieces of some queue and memory management socket helpers, so that they can later be used by the udp memory accounting functions. The second patch adds the memory account helpers, without using them. The third patch replacse the old rx memory accounting path for udp over ipv4 and udp over ipv6. In kernel UDP users are updated, as well. The memory accounting schema is described in detail in the individual patch commit message. The performance gain depends on the specific scenario; with few flows (and little contention in the original code) the differences are in the noise range, while with several flows contending the same socket, the measured speed-up is relevant (e.g. even over 100% in case of extreme contention) v2 -> v3: - do not set the now unsed backlog_rcv callback v1 -> v2: - changed slighly the memory accounting schema, we now perform lazy reclaim - fixed forward_alloc updating issue - fixed memory counter integer overflows Paolo Abeni (3): net/socket: factor out helpers for memory and queue manipulation udp: implement memory accounting helpers udp: use it's own memory accounting schema include/linux/udp.h | 2 + include/net/sock.h | 5 ++ include/net/udp.h | 7 ++ net/core/datagram.c | 36 ++++++---- net/core/sock.c | 96 ++++++++++++++++++--------- net/ipv4/udp.c | 179 ++++++++++++++++++++++++++++++++++++++++++-------- net/ipv6/udp.c | 33 ++++------ net/sunrpc/svcsock.c | 20 ++++-- net/sunrpc/xprtsock.c | 2 +- 9 files changed, 279 insertions(+), 101 deletions(-) -- 1.8.3.1 ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH net-next v3 1/3] net/socket: factor out helpers for memory and queue manipulation 2016-09-28 10:52 ` Paolo Abeni (?) @ 2016-09-28 10:52 ` Paolo Abeni -1 siblings, 0 replies; 20+ messages in thread From: Paolo Abeni @ 2016-09-28 10:52 UTC (permalink / raw) To: netdev Cc: David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs Basic sock operations that udp code can use with its own memory accounting schema. No functional change is introduced in the existing APIs. v1 -> v2: - avoid export sock_rmem_free Signed-off-by: Paolo Abeni <pabeni@redhat.com> --- include/net/sock.h | 5 +++ net/core/datagram.c | 36 ++++++++++++-------- net/core/sock.c | 96 +++++++++++++++++++++++++++++++++++------------------ 3 files changed, 91 insertions(+), 46 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index ebf75db..7e42922 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1274,7 +1274,9 @@ static inline struct inode *SOCK_INODE(struct socket *socket) /* * Functions for memory accounting */ +int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind); int __sk_mem_schedule(struct sock *sk, int size, int kind); +void __sk_mem_reduce_allocated(struct sock *sk, int amount); void __sk_mem_reclaim(struct sock *sk, int amount); #define SK_MEM_QUANTUM ((int)PAGE_SIZE) @@ -1950,6 +1952,9 @@ void sk_reset_timer(struct sock *sk, struct timer_list *timer, void sk_stop_timer(struct sock *sk, struct timer_list *timer); +int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb, + unsigned int flags); +void __sock_enqueue_skb(struct sock *sk, struct sk_buff *skb); int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb); int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb); diff --git a/net/core/datagram.c b/net/core/datagram.c index b7de71f..bfb973a 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -323,6 +323,27 @@ void __skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb, int len) } EXPORT_SYMBOL(__skb_free_datagram_locked); +int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb, + unsigned int flags) +{ + int err = 0; + + if (flags & MSG_PEEK) { + err = -ENOENT; + spin_lock_bh(&sk->sk_receive_queue.lock); + if (skb == skb_peek(&sk->sk_receive_queue)) { + __skb_unlink(skb, &sk->sk_receive_queue); + atomic_dec(&skb->users); + err = 0; + } + spin_unlock_bh(&sk->sk_receive_queue.lock); + } + + atomic_inc(&sk->sk_drops); + return err; +} +EXPORT_SYMBOL(__sk_queue_drop_skb); + /** * skb_kill_datagram - Free a datagram skbuff forcibly * @sk: socket @@ -346,23 +367,10 @@ EXPORT_SYMBOL(__skb_free_datagram_locked); int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags) { - int err = 0; - - if (flags & MSG_PEEK) { - err = -ENOENT; - spin_lock_bh(&sk->sk_receive_queue.lock); - if (skb == skb_peek(&sk->sk_receive_queue)) { - __skb_unlink(skb, &sk->sk_receive_queue); - atomic_dec(&skb->users); - err = 0; - } - spin_unlock_bh(&sk->sk_receive_queue.lock); - } + int err = __sk_queue_drop_skb(sk, skb, flags); kfree_skb(skb); - atomic_inc(&sk->sk_drops); sk_mem_reclaim_partial(sk); - return err; } EXPORT_SYMBOL(skb_kill_datagram); diff --git a/net/core/sock.c b/net/core/sock.c index 038e660..102f4d1 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -405,24 +405,12 @@ static void sock_disable_timestamp(struct sock *sk, unsigned long flags) } -int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) +void __sock_enqueue_skb(struct sock *sk, struct sk_buff *skb) { unsigned long flags; struct sk_buff_head *list = &sk->sk_receive_queue; - if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf) { - atomic_inc(&sk->sk_drops); - trace_sock_rcvqueue_full(sk, skb); - return -ENOMEM; - } - - if (!sk_rmem_schedule(sk, skb, skb->truesize)) { - atomic_inc(&sk->sk_drops); - return -ENOBUFS; - } - skb->dev = NULL; - skb_set_owner_r(skb, sk); /* we escape from rcu protected region, make sure we dont leak * a norefcounted dst @@ -436,6 +424,24 @@ int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) if (!sock_flag(sk, SOCK_DEAD)) sk->sk_data_ready(sk); +} +EXPORT_SYMBOL(__sock_enqueue_skb); + +int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) +{ + if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf) { + atomic_inc(&sk->sk_drops); + trace_sock_rcvqueue_full(sk, skb); + return -ENOMEM; + } + + if (!sk_rmem_schedule(sk, skb, skb->truesize)) { + atomic_inc(&sk->sk_drops); + return -ENOBUFS; + } + + skb_set_owner_r(skb, sk); + __sock_enqueue_skb(sk, skb); return 0; } EXPORT_SYMBOL(__sock_queue_rcv_skb); @@ -2091,24 +2097,18 @@ int sk_wait_data(struct sock *sk, long *timeo, const struct sk_buff *skb) EXPORT_SYMBOL(sk_wait_data); /** - * __sk_mem_schedule - increase sk_forward_alloc and memory_allocated + * __sk_mem_raise_allocated - increase memory_allocated * @sk: socket * @size: memory size to allocate + * @amt: pages to allocate * @kind: allocation type * - * If kind is SK_MEM_SEND, it means wmem allocation. Otherwise it means - * rmem allocation. This function assumes that protocols which have - * memory_pressure use sk_wmem_queued as write buffer accounting. + * Similar to __sk_mem_schedule(), but does not update sk_forward_alloc */ -int __sk_mem_schedule(struct sock *sk, int size, int kind) +int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind) { struct proto *prot = sk->sk_prot; - int amt = sk_mem_pages(size); - long allocated; - - sk->sk_forward_alloc += amt * SK_MEM_QUANTUM; - - allocated = sk_memory_allocated_add(sk, amt); + long allocated = sk_memory_allocated_add(sk, amt); if (mem_cgroup_sockets_enabled && sk->sk_memcg && !mem_cgroup_charge_skmem(sk->sk_memcg, amt)) @@ -2169,9 +2169,6 @@ suppress_allocation: trace_sock_exceed_buf_limit(sk, prot, allocated); - /* Alas. Undo changes. */ - sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM; - sk_memory_allocated_sub(sk, amt); if (mem_cgroup_sockets_enabled && sk->sk_memcg) @@ -2179,18 +2176,40 @@ suppress_allocation: return 0; } +EXPORT_SYMBOL(__sk_mem_raise_allocated); + +/** + * __sk_mem_schedule - increase sk_forward_alloc and memory_allocated + * @sk: socket + * @size: memory size to allocate + * @kind: allocation type + * + * If kind is SK_MEM_SEND, it means wmem allocation. Otherwise it means + * rmem allocation. This function assumes that protocols which have + * memory_pressure use sk_wmem_queued as write buffer accounting. + */ +int __sk_mem_schedule(struct sock *sk, int size, int kind) +{ + int ret, amt = sk_mem_pages(size); + + sk->sk_forward_alloc += amt << SK_MEM_QUANTUM_SHIFT; + ret = __sk_mem_raise_allocated(sk, size, amt, kind); + if (!ret) + sk->sk_forward_alloc -= amt << SK_MEM_QUANTUM_SHIFT; + return ret; +} EXPORT_SYMBOL(__sk_mem_schedule); /** - * __sk_mem_reclaim - reclaim memory_allocated + * __sk_mem_reduce_allocated - reclaim memory_allocated * @sk: socket - * @amount: number of bytes (rounded down to a SK_MEM_QUANTUM multiple) + * @amount: number of quanta + * + * Similar to __sk_mem_reclaim(), but does not update sk_forward_alloc */ -void __sk_mem_reclaim(struct sock *sk, int amount) +void __sk_mem_reduce_allocated(struct sock *sk, int amount) { - amount >>= SK_MEM_QUANTUM_SHIFT; sk_memory_allocated_sub(sk, amount); - sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT; if (mem_cgroup_sockets_enabled && sk->sk_memcg) mem_cgroup_uncharge_skmem(sk->sk_memcg, amount); @@ -2199,6 +2218,19 @@ void __sk_mem_reclaim(struct sock *sk, int amount) (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0))) sk_leave_memory_pressure(sk); } +EXPORT_SYMBOL(__sk_mem_reduce_allocated); + +/** + * __sk_mem_reclaim - reclaim sk_forward_alloc and memory_allocated + * @sk: socket + * @amount: number of bytes (rounded down to a SK_MEM_QUANTUM multiple) + */ +void __sk_mem_reclaim(struct sock *sk, int amount) +{ + amount >>= SK_MEM_QUANTUM_SHIFT; + sk->sk_forward_alloc -= amount << SK_MEM_QUANTUM_SHIFT; + __sk_mem_reduce_allocated(sk, amount); +} EXPORT_SYMBOL(__sk_mem_reclaim); int sk_set_peek_off(struct sock *sk, int val) -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH net-next v3 2/3] udp: implement memory accounting helpers 2016-09-28 10:52 ` Paolo Abeni (?) (?) @ 2016-09-28 10:52 ` Paolo Abeni 2016-09-29 1:42 ` Eric Dumazet -1 siblings, 1 reply; 20+ messages in thread From: Paolo Abeni @ 2016-09-28 10:52 UTC (permalink / raw) To: netdev Cc: David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs Avoid usage of common memory accounting functions, since the logic is pretty much different. To account for forward allocation, a couple of new atomic_t members are added to udp_sock: 'mem_alloced' and 'can_reclaim'. The current forward allocation is estimated as 'mem_alloced' minus 'sk_rmem_alloc'. When the forward allocation can't cope with the packet to be enqueued, 'mem_alloced' is incremented by the packet size rounded-up to the next SK_MEM_QUANTUM. After a dequeue, if under memory pressure, we try to partially reclaim all forward allocated memory rounded down to an SK_MEM_QUANTUM and 'mem_alloc' is decreased by that amount. To protect against concurrent reclaim, we use 'can_reclaim' as an unblocking synchronization point and let only one process do the work. sk->sk_forward_alloc is set after each memory update to the currently estimated forward allocation, without any lock or protection. This value is updated/maintained only to expose some semi-reasonable value to the eventual reader, and is guaranteed to be 0 at socket destruction time. The above needs custom memory reclaiming on shutdown, provided by the udp_destruct_sock() helper, which completely reclaim the allocated forward memory. As a consequence of the above schema, enqueue to sk_error_queue will cause fwd value to be understimated. v1 -> v2: - use a udp specific destrctor to perform memory reclaiming - remove a couple of helpers, unneeded after the above cleanup - do not reclaim memory on dequeue if not under memory pressure - reworked the fwd accounting schema to avoid potential integer overflow Signed-off-by: Paolo Abeni <pabeni@redhat.com> --- include/linux/udp.h | 2 + include/net/udp.h | 7 +++ net/ipv4/udp.c | 137 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 146 insertions(+) diff --git a/include/linux/udp.h b/include/linux/udp.h index d1fd8cd..16aa22b 100644 --- a/include/linux/udp.h +++ b/include/linux/udp.h @@ -42,6 +42,8 @@ static inline u32 udp_hashfn(const struct net *net, u32 num, u32 mask) struct udp_sock { /* inet_sock has to be the first member */ struct inet_sock inet; + atomic_t mem_allocated; + atomic_t can_reclaim; #define udp_port_hash inet.sk.__sk_common.skc_u16hashes[0] #define udp_portaddr_hash inet.sk.__sk_common.skc_u16hashes[1] #define udp_portaddr_node inet.sk.__sk_common.skc_portaddr_node diff --git a/include/net/udp.h b/include/net/udp.h index ea53a87..b9563ef 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -98,6 +98,8 @@ static inline struct udp_hslot *udp_hashslot2(struct udp_table *table, extern struct proto udp_prot; extern atomic_long_t udp_memory_allocated; +extern int udp_memory_pressure; +extern struct percpu_counter udp_sockets_allocated; /* sysctl variables for udp */ extern long sysctl_udp_mem[3]; @@ -246,6 +248,10 @@ static inline __be16 udp_flow_src_port(struct net *net, struct sk_buff *skb, } /* net/ipv4/udp.c */ +void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len); +int udp_rmem_schedule(struct sock *sk, struct sk_buff *skb); +void udp_enter_memory_pressure(struct sock *sk); + void udp_v4_early_demux(struct sk_buff *skb); int udp_get_port(struct sock *sk, unsigned short snum, int (*saddr_cmp)(const struct sock *, @@ -258,6 +264,7 @@ void udp_flush_pending_frames(struct sock *sk); void udp4_hwcsum(struct sk_buff *skb, __be32 src, __be32 dst); int udp_rcv(struct sk_buff *skb); int udp_ioctl(struct sock *sk, int cmd, unsigned long arg); +int udp_init_sock(struct sock *sk); int udp_disconnect(struct sock *sk, int flags); unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait); struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb, diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 7d96dc2..2218901 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -131,6 +131,12 @@ EXPORT_SYMBOL(sysctl_udp_wmem_min); atomic_long_t udp_memory_allocated; EXPORT_SYMBOL(udp_memory_allocated); +int udp_memory_pressure __read_mostly; +EXPORT_SYMBOL(udp_memory_pressure); + +struct percpu_counter udp_sockets_allocated; +EXPORT_SYMBOL(udp_sockets_allocated); + #define MAX_UDP_PORTS 65536 #define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN) @@ -1172,6 +1178,137 @@ out: return ret; } +static int __udp_forward(struct udp_sock *up, int rmem) +{ + return atomic_read(&up->mem_allocated) - rmem; +} + +static bool udp_under_memory_pressure(const struct sock *sk) +{ + if (mem_cgroup_sockets_enabled && sk->sk_memcg && + mem_cgroup_under_socket_pressure(sk->sk_memcg)) + return true; + + return READ_ONCE(udp_memory_pressure); +} + +void udp_enter_memory_pressure(struct sock *sk) +{ + WRITE_ONCE(udp_memory_pressure, 1); +} +EXPORT_SYMBOL(udp_enter_memory_pressure); + +/* if partial != 0 do nothing if not under memory pressure and avoid + * reclaiming last quanta + */ +static void udp_rmem_release(struct sock *sk, int partial) +{ + struct udp_sock *up = udp_sk(sk); + int fwd, amt; + + if (partial && !udp_under_memory_pressure(sk)) + return; + + /* we can have concurrent release; if we catch any conflict + * we let only one of them do the work + */ + if (atomic_dec_if_positive(&up->can_reclaim) < 0) + return; + + fwd = __udp_forward(up, atomic_read(&sk->sk_rmem_alloc)); + if (fwd < SK_MEM_QUANTUM + partial) { + atomic_inc(&up->can_reclaim); + return; + } + + amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1); + atomic_sub(amt, &up->mem_allocated); + atomic_inc(&up->can_reclaim); + + __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT); + sk->sk_forward_alloc = fwd - amt; +} + +static void udp_rmem_free(struct sk_buff *skb) +{ + struct sock *sk = skb->sk; + + atomic_sub(skb->truesize, &sk->sk_rmem_alloc); + udp_rmem_release(sk, 1); +} + +int udp_rmem_schedule(struct sock *sk, struct sk_buff *skb) +{ + int fwd, amt, delta, rmem, err = -ENOMEM; + struct udp_sock *up = udp_sk(sk); + + rmem = atomic_add_return(skb->truesize, &sk->sk_rmem_alloc); + if (rmem > sk->sk_rcvbuf) + goto drop; + + fwd = __udp_forward(up, rmem); + if (fwd > 0) + goto no_alloc; + + amt = sk_mem_pages(skb->truesize); + delta = amt << SK_MEM_QUANTUM_SHIFT; + if (!__sk_mem_raise_allocated(sk, delta, amt, SK_MEM_RECV)) { + err = -ENOBUFS; + goto drop; + } + + /* if we have some skbs in the error queue, the forward allocation could + * be understimated, even below 0; avoid exporting such values + */ + fwd = atomic_add_return(delta, &up->mem_allocated) - rmem; + if (fwd < 0) + fwd = SK_MEM_QUANTUM; + +no_alloc: + sk->sk_forward_alloc = fwd; + skb_orphan(skb); + skb->sk = sk; + skb->destructor = udp_rmem_free; + return 0; + +drop: + atomic_sub(skb->truesize, &sk->sk_rmem_alloc); + atomic_inc(&sk->sk_drops); + return err; +} +EXPORT_SYMBOL_GPL(udp_rmem_schedule); + +static void udp_destruct_sock(struct sock *sk) +{ + /* reclaim completely the forward allocated memory */ + __skb_queue_purge(&sk->sk_receive_queue); + udp_rmem_release(sk, 0); + inet_sock_destruct(sk); +} + +int udp_init_sock(struct sock *sk) +{ + struct udp_sock *up = udp_sk(sk); + + atomic_set(&up->mem_allocated, 0); + atomic_set(&up->can_reclaim, 1); + sk->sk_destruct = udp_destruct_sock; + return 0; +} +EXPORT_SYMBOL_GPL(udp_init_sock); + +void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len) +{ + if (unlikely(READ_ONCE(sk->sk_peek_off) >= 0)) { + bool slow = lock_sock_fast(sk); + + sk_peek_offset_bwd(sk, len); + unlock_sock_fast(sk, slow); + } + consume_skb(skb); +} +EXPORT_SYMBOL_GPL(skb_consume_udp); + /** * first_packet_length - return length of first packet in receive queue * @sk: socket -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers 2016-09-28 10:52 ` [PATCH net-next v3 2/3] udp: implement memory accounting helpers Paolo Abeni @ 2016-09-29 1:42 ` Eric Dumazet [not found] ` <1475113378.28155.124.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org> 0 siblings, 1 reply; 20+ messages in thread From: Eric Dumazet @ 2016-09-29 1:42 UTC (permalink / raw) To: Paolo Abeni Cc: netdev, David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs On Wed, 2016-09-28 at 12:52 +0200, Paolo Abeni wrote: > +static void udp_rmem_release(struct sock *sk, int partial) > +{ > + struct udp_sock *up = udp_sk(sk); > + int fwd, amt; > + > + if (partial && !udp_under_memory_pressure(sk)) > + return; > + > + /* we can have concurrent release; if we catch any conflict > + * we let only one of them do the work > + */ > + if (atomic_dec_if_positive(&up->can_reclaim) < 0) > + return; > + > + fwd = __udp_forward(up, atomic_read(&sk->sk_rmem_alloc)); > + if (fwd < SK_MEM_QUANTUM + partial) { > + atomic_inc(&up->can_reclaim); > + return; > + } > + > + amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1); > + atomic_sub(amt, &up->mem_allocated); > + atomic_inc(&up->can_reclaim); > + > + __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT); > + sk->sk_forward_alloc = fwd - amt; > +} This is racy... all these atomics make me nervous... ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <1475113378.28155.124.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>]
* Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers 2016-09-29 1:42 ` Eric Dumazet @ 2016-09-29 7:34 ` Paolo Abeni 0 siblings, 0 replies; 20+ messages in thread From: Paolo Abeni @ 2016-09-29 7:34 UTC (permalink / raw) To: Eric Dumazet Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs-u79uwXL29TY76Z2rM5mHXA Hi Eric, On Wed, 2016-09-28 at 18:42 -0700, Eric Dumazet wrote: > On Wed, 2016-09-28 at 12:52 +0200, Paolo Abeni wrote: > > > +static void udp_rmem_release(struct sock *sk, int partial) > > +{ > > + struct udp_sock *up = udp_sk(sk); > > + int fwd, amt; > > + > > + if (partial && !udp_under_memory_pressure(sk)) > > + return; > > + > > + /* we can have concurrent release; if we catch any conflict > > + * we let only one of them do the work > > + */ > > + if (atomic_dec_if_positive(&up->can_reclaim) < 0) > > + return; > > + > > + fwd = __udp_forward(up, atomic_read(&sk->sk_rmem_alloc)); > > + if (fwd < SK_MEM_QUANTUM + partial) { > > + atomic_inc(&up->can_reclaim); > > + return; > > + } > > + > > + amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1); > > + atomic_sub(amt, &up->mem_allocated); > > + atomic_inc(&up->can_reclaim); > > + > > + __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT); > > + sk->sk_forward_alloc = fwd - amt; > > +} Thank you for reviewing this! > This is racy... Could you please elaborate? > all these atomics make me nervous... I'd like to drop some of them if possible. atomic_inc(&up->can_reclaim); could probably be replaced with atomic_set(&up->can_reclaim, 1) since we don't have concurrent processes doing that and can_reclaim.counter is known to be 0 at that point. Performance wise the impact is minimal, since in normal condition we do the reclaim only on socket shutdown. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers @ 2016-09-29 7:34 ` Paolo Abeni 0 siblings, 0 replies; 20+ messages in thread From: Paolo Abeni @ 2016-09-29 7:34 UTC (permalink / raw) To: Eric Dumazet Cc: netdev, David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs Hi Eric, On Wed, 2016-09-28 at 18:42 -0700, Eric Dumazet wrote: > On Wed, 2016-09-28 at 12:52 +0200, Paolo Abeni wrote: > > > +static void udp_rmem_release(struct sock *sk, int partial) > > +{ > > + struct udp_sock *up = udp_sk(sk); > > + int fwd, amt; > > + > > + if (partial && !udp_under_memory_pressure(sk)) > > + return; > > + > > + /* we can have concurrent release; if we catch any conflict > > + * we let only one of them do the work > > + */ > > + if (atomic_dec_if_positive(&up->can_reclaim) < 0) > > + return; > > + > > + fwd = __udp_forward(up, atomic_read(&sk->sk_rmem_alloc)); > > + if (fwd < SK_MEM_QUANTUM + partial) { > > + atomic_inc(&up->can_reclaim); > > + return; > > + } > > + > > + amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1); > > + atomic_sub(amt, &up->mem_allocated); > > + atomic_inc(&up->can_reclaim); > > + > > + __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT); > > + sk->sk_forward_alloc = fwd - amt; > > +} Thank you for reviewing this! > This is racy... Could you please elaborate? > all these atomics make me nervous... I'd like to drop some of them if possible. atomic_inc(&up->can_reclaim); could probably be replaced with atomic_set(&up->can_reclaim, 1) since we don't have concurrent processes doing that and can_reclaim.counter is known to be 0 at that point. Performance wise the impact is minimal, since in normal condition we do the reclaim only on socket shutdown. Paolo ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers 2016-09-29 1:42 ` Eric Dumazet @ 2016-09-29 9:31 ` Paolo Abeni 0 siblings, 0 replies; 20+ messages in thread From: Paolo Abeni @ 2016-09-29 9:31 UTC (permalink / raw) To: Eric Dumazet Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs-u79uwXL29TY76Z2rM5mHXA On Wed, 2016-09-28 at 18:42 -0700, Eric Dumazet wrote: > On Wed, 2016-09-28 at 12:52 +0200, Paolo Abeni wrote: > > > +static void udp_rmem_release(struct sock *sk, int partial) > > +{ > > + struct udp_sock *up = udp_sk(sk); > > + int fwd, amt; > > + > > + if (partial && !udp_under_memory_pressure(sk)) > > + return; > > + > > + /* we can have concurrent release; if we catch any conflict > > + * we let only one of them do the work > > + */ > > + if (atomic_dec_if_positive(&up->can_reclaim) < 0) > > + return; > > + > > + fwd = __udp_forward(up, atomic_read(&sk->sk_rmem_alloc)); > > + if (fwd < SK_MEM_QUANTUM + partial) { > > + atomic_inc(&up->can_reclaim); > > + return; > > + } > > + > > + amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1); > > + atomic_sub(amt, &up->mem_allocated); > > + atomic_inc(&up->can_reclaim); > > + > > + __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT); > > + sk->sk_forward_alloc = fwd - amt; > > +} > > > This is racy... all these atomics make me nervous... Ah, perhaps I got it: if we have a concurrent memory scheduling, we could end up with a value of mem_allocated below the real need. That mismatch will not drift: at worst we can end up with mem_allocated being single SK_MEM_QUANTUM below what is strictly needed. A possible alternative could be: static void udp_rmem_release(struct sock *sk, int partial) { struct udp_sock *up = udp_sk(sk); int fwd, amt, alloc_old, alloc; if (partial && !udp_under_memory_pressure(sk)) return; alloc = atomic_read(&up->mem_allocated); fwd = alloc - atomic_read(&sk->sk_rmem_alloc); if (fwd < SK_MEM_QUANTUM + partial) return; amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1); alloc_old = atomic_cmpxchg(&up->mem_allocated, alloc, alloc - amt); /* if a concurrent update is detected, just do nothing; if said update * is due to another memory release, that release take care of * reclaiming the memory for us, too. * Otherwise we will be able to release on later dequeue, since * we will eventually stop colliding with the writer when it will * consume all the fwd allocated memory */ if (alloc_old != alloc) return; __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT); sk->sk_forward_alloc = fwd - amt; } which is even more lazy in reclaiming but should never underestimate the needed forward allocation, and under pressure should eventually free the needed memory. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers @ 2016-09-29 9:31 ` Paolo Abeni 0 siblings, 0 replies; 20+ messages in thread From: Paolo Abeni @ 2016-09-29 9:31 UTC (permalink / raw) To: Eric Dumazet Cc: netdev, David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs On Wed, 2016-09-28 at 18:42 -0700, Eric Dumazet wrote: > On Wed, 2016-09-28 at 12:52 +0200, Paolo Abeni wrote: > > > +static void udp_rmem_release(struct sock *sk, int partial) > > +{ > > + struct udp_sock *up = udp_sk(sk); > > + int fwd, amt; > > + > > + if (partial && !udp_under_memory_pressure(sk)) > > + return; > > + > > + /* we can have concurrent release; if we catch any conflict > > + * we let only one of them do the work > > + */ > > + if (atomic_dec_if_positive(&up->can_reclaim) < 0) > > + return; > > + > > + fwd = __udp_forward(up, atomic_read(&sk->sk_rmem_alloc)); > > + if (fwd < SK_MEM_QUANTUM + partial) { > > + atomic_inc(&up->can_reclaim); > > + return; > > + } > > + > > + amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1); > > + atomic_sub(amt, &up->mem_allocated); > > + atomic_inc(&up->can_reclaim); > > + > > + __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT); > > + sk->sk_forward_alloc = fwd - amt; > > +} > > > This is racy... all these atomics make me nervous... Ah, perhaps I got it: if we have a concurrent memory scheduling, we could end up with a value of mem_allocated below the real need. That mismatch will not drift: at worst we can end up with mem_allocated being single SK_MEM_QUANTUM below what is strictly needed. A possible alternative could be: static void udp_rmem_release(struct sock *sk, int partial) { struct udp_sock *up = udp_sk(sk); int fwd, amt, alloc_old, alloc; if (partial && !udp_under_memory_pressure(sk)) return; alloc = atomic_read(&up->mem_allocated); fwd = alloc - atomic_read(&sk->sk_rmem_alloc); if (fwd < SK_MEM_QUANTUM + partial) return; amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1); alloc_old = atomic_cmpxchg(&up->mem_allocated, alloc, alloc - amt); /* if a concurrent update is detected, just do nothing; if said update * is due to another memory release, that release take care of * reclaiming the memory for us, too. * Otherwise we will be able to release on later dequeue, since * we will eventually stop colliding with the writer when it will * consume all the fwd allocated memory */ if (alloc_old != alloc) return; __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT); sk->sk_forward_alloc = fwd - amt; } which is even more lazy in reclaiming but should never underestimate the needed forward allocation, and under pressure should eventually free the needed memory. ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <1475141514.4676.28.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers 2016-09-29 9:31 ` Paolo Abeni @ 2016-09-29 13:24 ` Eric Dumazet -1 siblings, 0 replies; 20+ messages in thread From: Eric Dumazet @ 2016-09-29 13:24 UTC (permalink / raw) To: Paolo Abeni Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs-u79uwXL29TY76Z2rM5mHXA On Thu, 2016-09-29 at 11:31 +0200, Paolo Abeni wrote: > On Wed, 2016-09-28 at 18:42 -0700, Eric Dumazet wrote: > > On Wed, 2016-09-28 at 12:52 +0200, Paolo Abeni wrote: > > > > > +static void udp_rmem_release(struct sock *sk, int partial) > > > +{ > > > + struct udp_sock *up = udp_sk(sk); > > > + int fwd, amt; > > > + > > > + if (partial && !udp_under_memory_pressure(sk)) > > > + return; > > > + > > > + /* we can have concurrent release; if we catch any conflict > > > + * we let only one of them do the work > > > + */ > > > + if (atomic_dec_if_positive(&up->can_reclaim) < 0) > > > + return; > > > + > > > + fwd = __udp_forward(up, atomic_read(&sk->sk_rmem_alloc)); > > > + if (fwd < SK_MEM_QUANTUM + partial) { > > > + atomic_inc(&up->can_reclaim); > > > + return; > > > + } > > > + > > > + amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1); > > > + atomic_sub(amt, &up->mem_allocated); > > > + atomic_inc(&up->can_reclaim); > > > + > > > + __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT); > > > + sk->sk_forward_alloc = fwd - amt; > > > +} > > > > > > This is racy... all these atomics make me nervous... > > Ah, perhaps I got it: if we have a concurrent memory scheduling, we > could end up with a value of mem_allocated below the real need. > > That mismatch will not drift: at worst we can end up with mem_allocated > being single SK_MEM_QUANTUM below what is strictly needed. > > A possible alternative could be: > > static void udp_rmem_release(struct sock *sk, int partial) > { > struct udp_sock *up = udp_sk(sk); > int fwd, amt, alloc_old, alloc; > > if (partial && !udp_under_memory_pressure(sk)) > return; > > alloc = atomic_read(&up->mem_allocated); > fwd = alloc - atomic_read(&sk->sk_rmem_alloc); > if (fwd < SK_MEM_QUANTUM + partial) > return; > > amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1); > alloc_old = atomic_cmpxchg(&up->mem_allocated, alloc, alloc - amt); > /* if a concurrent update is detected, just do nothing; if said update > * is due to another memory release, that release take care of > * reclaiming the memory for us, too. > * Otherwise we will be able to release on later dequeue, since > * we will eventually stop colliding with the writer when it will > * consume all the fwd allocated memory > */ > if (alloc_old != alloc) > return; > > __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT); > sk->sk_forward_alloc = fwd - amt; Can still be done from multiple cpus. Add some ndelay() or udelay() before to simulate fact that current cpu could be interrupted by an NMI handler (perf for example)... or hard IRQ handler... Then make sure your tests involve 16 concurrent cpus dealing with one udp socket... > } > > which is even more lazy in reclaiming but should never underestimate the > needed forward allocation, and under pressure should eventually free the > needed memory. If this code is rarely used, why don't you simply use a real spinlock, so that we do not have to worry about all this ? A spinlock acquisition/release is a _single_ locked operation. Faster than the 3 atomic you got in last version. spinlock code (ticket or MCS) avoids starvation. Then, you can safely update multiple fields in the socket. And you get nice lockdep support as a bonus. cmpxchg() is fine when a single field need an exclusion. But there you have multiple fields to update at once : sk_memory_allocated_add() and sk_memory_allocated_sub() can work using atomic_long_add_return() and atomic_long_sub() because their caller owns the socket lock and can safely update sk->sk_forward_alloc without additional locking, but UDP wont have this luxury after your patches. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers @ 2016-09-29 13:24 ` Eric Dumazet 0 siblings, 0 replies; 20+ messages in thread From: Eric Dumazet @ 2016-09-29 13:24 UTC (permalink / raw) To: Paolo Abeni Cc: netdev, David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs On Thu, 2016-09-29 at 11:31 +0200, Paolo Abeni wrote: > On Wed, 2016-09-28 at 18:42 -0700, Eric Dumazet wrote: > > On Wed, 2016-09-28 at 12:52 +0200, Paolo Abeni wrote: > > > > > +static void udp_rmem_release(struct sock *sk, int partial) > > > +{ > > > + struct udp_sock *up = udp_sk(sk); > > > + int fwd, amt; > > > + > > > + if (partial && !udp_under_memory_pressure(sk)) > > > + return; > > > + > > > + /* we can have concurrent release; if we catch any conflict > > > + * we let only one of them do the work > > > + */ > > > + if (atomic_dec_if_positive(&up->can_reclaim) < 0) > > > + return; > > > + > > > + fwd = __udp_forward(up, atomic_read(&sk->sk_rmem_alloc)); > > > + if (fwd < SK_MEM_QUANTUM + partial) { > > > + atomic_inc(&up->can_reclaim); > > > + return; > > > + } > > > + > > > + amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1); > > > + atomic_sub(amt, &up->mem_allocated); > > > + atomic_inc(&up->can_reclaim); > > > + > > > + __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT); > > > + sk->sk_forward_alloc = fwd - amt; > > > +} > > > > > > This is racy... all these atomics make me nervous... > > Ah, perhaps I got it: if we have a concurrent memory scheduling, we > could end up with a value of mem_allocated below the real need. > > That mismatch will not drift: at worst we can end up with mem_allocated > being single SK_MEM_QUANTUM below what is strictly needed. > > A possible alternative could be: > > static void udp_rmem_release(struct sock *sk, int partial) > { > struct udp_sock *up = udp_sk(sk); > int fwd, amt, alloc_old, alloc; > > if (partial && !udp_under_memory_pressure(sk)) > return; > > alloc = atomic_read(&up->mem_allocated); > fwd = alloc - atomic_read(&sk->sk_rmem_alloc); > if (fwd < SK_MEM_QUANTUM + partial) > return; > > amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1); > alloc_old = atomic_cmpxchg(&up->mem_allocated, alloc, alloc - amt); > /* if a concurrent update is detected, just do nothing; if said update > * is due to another memory release, that release take care of > * reclaiming the memory for us, too. > * Otherwise we will be able to release on later dequeue, since > * we will eventually stop colliding with the writer when it will > * consume all the fwd allocated memory > */ > if (alloc_old != alloc) > return; > > __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT); > sk->sk_forward_alloc = fwd - amt; Can still be done from multiple cpus. Add some ndelay() or udelay() before to simulate fact that current cpu could be interrupted by an NMI handler (perf for example)... or hard IRQ handler... Then make sure your tests involve 16 concurrent cpus dealing with one udp socket... > } > > which is even more lazy in reclaiming but should never underestimate the > needed forward allocation, and under pressure should eventually free the > needed memory. If this code is rarely used, why don't you simply use a real spinlock, so that we do not have to worry about all this ? A spinlock acquisition/release is a _single_ locked operation. Faster than the 3 atomic you got in last version. spinlock code (ticket or MCS) avoids starvation. Then, you can safely update multiple fields in the socket. And you get nice lockdep support as a bonus. cmpxchg() is fine when a single field need an exclusion. But there you have multiple fields to update at once : sk_memory_allocated_add() and sk_memory_allocated_sub() can work using atomic_long_add_return() and atomic_long_sub() because their caller owns the socket lock and can safely update sk->sk_forward_alloc without additional locking, but UDP wont have this luxury after your patches. ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <1475155472.28155.164.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>]
* Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers 2016-09-29 13:24 ` Eric Dumazet @ 2016-09-29 14:01 ` Paolo Abeni -1 siblings, 0 replies; 20+ messages in thread From: Paolo Abeni @ 2016-09-29 14:01 UTC (permalink / raw) To: Eric Dumazet Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs-u79uwXL29TY76Z2rM5mHXA On Thu, 2016-09-29 at 06:24 -0700, Eric Dumazet wrote: > On Thu, 2016-09-29 at 11:31 +0200, Paolo Abeni wrote: > > On Wed, 2016-09-28 at 18:42 -0700, Eric Dumazet wrote: > > > On Wed, 2016-09-28 at 12:52 +0200, Paolo Abeni wrote: > > > > > > > +static void udp_rmem_release(struct sock *sk, int partial) > > > > +{ > > > > + struct udp_sock *up = udp_sk(sk); > > > > + int fwd, amt; > > > > + > > > > + if (partial && !udp_under_memory_pressure(sk)) > > > > + return; > > > > + > > > > + /* we can have concurrent release; if we catch any conflict > > > > + * we let only one of them do the work > > > > + */ > > > > + if (atomic_dec_if_positive(&up->can_reclaim) < 0) > > > > + return; > > > > + > > > > + fwd = __udp_forward(up, atomic_read(&sk->sk_rmem_alloc)); > > > > + if (fwd < SK_MEM_QUANTUM + partial) { > > > > + atomic_inc(&up->can_reclaim); > > > > + return; > > > > + } > > > > + > > > > + amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1); > > > > + atomic_sub(amt, &up->mem_allocated); > > > > + atomic_inc(&up->can_reclaim); > > > > + > > > > + __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT); > > > > + sk->sk_forward_alloc = fwd - amt; > > > > +} > > > > > > > > > This is racy... all these atomics make me nervous... > > > > Ah, perhaps I got it: if we have a concurrent memory scheduling, we > > could end up with a value of mem_allocated below the real need. > > > > That mismatch will not drift: at worst we can end up with mem_allocated > > being single SK_MEM_QUANTUM below what is strictly needed. > > > > A possible alternative could be: > > > > static void udp_rmem_release(struct sock *sk, int partial) > > { > > struct udp_sock *up = udp_sk(sk); > > int fwd, amt, alloc_old, alloc; > > > > if (partial && !udp_under_memory_pressure(sk)) > > return; > > > > alloc = atomic_read(&up->mem_allocated); > > fwd = alloc - atomic_read(&sk->sk_rmem_alloc); > > if (fwd < SK_MEM_QUANTUM + partial) > > return; > > > > amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1); > > alloc_old = atomic_cmpxchg(&up->mem_allocated, alloc, alloc - amt); > > /* if a concurrent update is detected, just do nothing; if said update > > * is due to another memory release, that release take care of > > * reclaiming the memory for us, too. > > * Otherwise we will be able to release on later dequeue, since > > * we will eventually stop colliding with the writer when it will > > * consume all the fwd allocated memory > > */ > > if (alloc_old != alloc) > > return; > > > > __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT); > > sk->sk_forward_alloc = fwd - amt; > > Can still be done from multiple cpus. > > Add some ndelay() or udelay() before to simulate fact that current cpu > could be interrupted by an NMI handler (perf for example)... or hard IRQ > handler... > > Then make sure your tests involve 16 concurrent cpus dealing with one > udp socket... Thank you again reviewing this. I'm working to this sort of tests right now. > > > } > > > > which is even more lazy in reclaiming but should never underestimate the > > needed forward allocation, and under pressure should eventually free the > > needed memory. > > > If this code is rarely used, why don't you simply use a real spinlock, > so that we do not have to worry about all this ? > > A spinlock acquisition/release is a _single_ locked operation. > Faster than the 3 atomic you got in last version. > spinlock code (ticket or MCS) avoids starvation. I'd like to avoid adding a lock, if possible, to avoid any possible source of contention. > Then, you can safely update multiple fields in the socket. > > And you get nice lockdep support as a bonus. > > cmpxchg() is fine when a single field need an exclusion. But there you > have multiple fields to update at once : > > sk_memory_allocated_add() and sk_memory_allocated_sub() can work using > atomic_long_add_return() and atomic_long_sub() because their caller owns > the socket lock and can safely update sk->sk_forward_alloc without > additional locking, but UDP wont have this luxury after your patches. When we reach __sk_mem_reduce_allocated() we are sure we can free the specified amount of memory, so we only need to ensure consistent sk_prot->memory_allocated updates. The current atomic operation suffices to this. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers @ 2016-09-29 14:01 ` Paolo Abeni 0 siblings, 0 replies; 20+ messages in thread From: Paolo Abeni @ 2016-09-29 14:01 UTC (permalink / raw) To: Eric Dumazet Cc: netdev, David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs On Thu, 2016-09-29 at 06:24 -0700, Eric Dumazet wrote: > On Thu, 2016-09-29 at 11:31 +0200, Paolo Abeni wrote: > > On Wed, 2016-09-28 at 18:42 -0700, Eric Dumazet wrote: > > > On Wed, 2016-09-28 at 12:52 +0200, Paolo Abeni wrote: > > > > > > > +static void udp_rmem_release(struct sock *sk, int partial) > > > > +{ > > > > + struct udp_sock *up = udp_sk(sk); > > > > + int fwd, amt; > > > > + > > > > + if (partial && !udp_under_memory_pressure(sk)) > > > > + return; > > > > + > > > > + /* we can have concurrent release; if we catch any conflict > > > > + * we let only one of them do the work > > > > + */ > > > > + if (atomic_dec_if_positive(&up->can_reclaim) < 0) > > > > + return; > > > > + > > > > + fwd = __udp_forward(up, atomic_read(&sk->sk_rmem_alloc)); > > > > + if (fwd < SK_MEM_QUANTUM + partial) { > > > > + atomic_inc(&up->can_reclaim); > > > > + return; > > > > + } > > > > + > > > > + amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1); > > > > + atomic_sub(amt, &up->mem_allocated); > > > > + atomic_inc(&up->can_reclaim); > > > > + > > > > + __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT); > > > > + sk->sk_forward_alloc = fwd - amt; > > > > +} > > > > > > > > > This is racy... all these atomics make me nervous... > > > > Ah, perhaps I got it: if we have a concurrent memory scheduling, we > > could end up with a value of mem_allocated below the real need. > > > > That mismatch will not drift: at worst we can end up with mem_allocated > > being single SK_MEM_QUANTUM below what is strictly needed. > > > > A possible alternative could be: > > > > static void udp_rmem_release(struct sock *sk, int partial) > > { > > struct udp_sock *up = udp_sk(sk); > > int fwd, amt, alloc_old, alloc; > > > > if (partial && !udp_under_memory_pressure(sk)) > > return; > > > > alloc = atomic_read(&up->mem_allocated); > > fwd = alloc - atomic_read(&sk->sk_rmem_alloc); > > if (fwd < SK_MEM_QUANTUM + partial) > > return; > > > > amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1); > > alloc_old = atomic_cmpxchg(&up->mem_allocated, alloc, alloc - amt); > > /* if a concurrent update is detected, just do nothing; if said update > > * is due to another memory release, that release take care of > > * reclaiming the memory for us, too. > > * Otherwise we will be able to release on later dequeue, since > > * we will eventually stop colliding with the writer when it will > > * consume all the fwd allocated memory > > */ > > if (alloc_old != alloc) > > return; > > > > __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT); > > sk->sk_forward_alloc = fwd - amt; > > Can still be done from multiple cpus. > > Add some ndelay() or udelay() before to simulate fact that current cpu > could be interrupted by an NMI handler (perf for example)... or hard IRQ > handler... > > Then make sure your tests involve 16 concurrent cpus dealing with one > udp socket... Thank you again reviewing this. I'm working to this sort of tests right now. > > > } > > > > which is even more lazy in reclaiming but should never underestimate the > > needed forward allocation, and under pressure should eventually free the > > needed memory. > > > If this code is rarely used, why don't you simply use a real spinlock, > so that we do not have to worry about all this ? > > A spinlock acquisition/release is a _single_ locked operation. > Faster than the 3 atomic you got in last version. > spinlock code (ticket or MCS) avoids starvation. I'd like to avoid adding a lock, if possible, to avoid any possible source of contention. > Then, you can safely update multiple fields in the socket. > > And you get nice lockdep support as a bonus. > > cmpxchg() is fine when a single field need an exclusion. But there you > have multiple fields to update at once : > > sk_memory_allocated_add() and sk_memory_allocated_sub() can work using > atomic_long_add_return() and atomic_long_sub() because their caller owns > the socket lock and can safely update sk->sk_forward_alloc without > additional locking, but UDP wont have this luxury after your patches. When we reach __sk_mem_reduce_allocated() we are sure we can free the specified amount of memory, so we only need to ensure consistent sk_prot->memory_allocated updates. The current atomic operation suffices to this. Paolo ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <1475157674.4676.52.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers 2016-09-29 14:01 ` Paolo Abeni @ 2016-09-29 14:13 ` Eric Dumazet -1 siblings, 0 replies; 20+ messages in thread From: Eric Dumazet @ 2016-09-29 14:13 UTC (permalink / raw) To: Paolo Abeni Cc: netdev-u79uwXL29TY76Z2rM5mHXA, David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs-u79uwXL29TY76Z2rM5mHXA On Thu, 2016-09-29 at 16:01 +0200, Paolo Abeni wrote: > When we reach __sk_mem_reduce_allocated() we are sure we can free the > specified amount of memory, so we only need to ensure consistent > sk_prot->memory_allocated updates. The current atomic operation suffices > to this. Then why are you updating sk->sk_forward_alloc using racy operations ? If this is not needed or racy, do not do it. So that we can remove this annoying thing that a dynamic checker could very well detect. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers @ 2016-09-29 14:13 ` Eric Dumazet 0 siblings, 0 replies; 20+ messages in thread From: Eric Dumazet @ 2016-09-29 14:13 UTC (permalink / raw) To: Paolo Abeni Cc: netdev, David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs On Thu, 2016-09-29 at 16:01 +0200, Paolo Abeni wrote: > When we reach __sk_mem_reduce_allocated() we are sure we can free the > specified amount of memory, so we only need to ensure consistent > sk_prot->memory_allocated updates. The current atomic operation suffices > to this. Then why are you updating sk->sk_forward_alloc using racy operations ? If this is not needed or racy, do not do it. So that we can remove this annoying thing that a dynamic checker could very well detect. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers 2016-09-29 14:13 ` Eric Dumazet (?) @ 2016-09-29 14:34 ` Paolo Abeni 2016-09-29 14:49 ` Eric Dumazet -1 siblings, 1 reply; 20+ messages in thread From: Paolo Abeni @ 2016-09-29 14:34 UTC (permalink / raw) To: Eric Dumazet Cc: netdev, David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs On Thu, 2016-09-29 at 07:13 -0700, Eric Dumazet wrote: > On Thu, 2016-09-29 at 16:01 +0200, Paolo Abeni wrote: > > > When we reach __sk_mem_reduce_allocated() we are sure we can free the > > specified amount of memory, so we only need to ensure consistent > > sk_prot->memory_allocated updates. The current atomic operation suffices > > to this. > > Then why are you updating sk->sk_forward_alloc using racy operations ? > > If this is not needed or racy, do not do it. Thank you for all the feedback. The actual forward allocated memory value is: atomic_read(&up->mem_allocated) - atomic_read(&sk->sk_rmem_alloc). sk_forward_alloc is updated only to hint to the user space the forward allocated memory value via the diag interface. If such information is not needed we can drop the update, and sk_forward_alloc will always be seen as 0 even when the socket has some forward allocation. Cheers, Paolo ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers 2016-09-29 14:34 ` Paolo Abeni @ 2016-09-29 14:49 ` Eric Dumazet 2016-09-29 14:59 ` Paolo Abeni 0 siblings, 1 reply; 20+ messages in thread From: Eric Dumazet @ 2016-09-29 14:49 UTC (permalink / raw) To: Paolo Abeni Cc: Eric Dumazet, netdev, David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs On Thu, Sep 29, 2016 at 7:34 AM, Paolo Abeni <pabeni@redhat.com> wrote: > On Thu, 2016-09-29 at 07:13 -0700, Eric Dumazet wrote: >> On Thu, 2016-09-29 at 16:01 +0200, Paolo Abeni wrote: >> >> > When we reach __sk_mem_reduce_allocated() we are sure we can free the >> > specified amount of memory, so we only need to ensure consistent >> > sk_prot->memory_allocated updates. The current atomic operation suffices >> > to this. >> >> Then why are you updating sk->sk_forward_alloc using racy operations ? >> >> If this is not needed or racy, do not do it. > > Thank you for all the feedback. > > The actual forward allocated memory value is: > > atomic_read(&up->mem_allocated) - atomic_read(&sk->sk_rmem_alloc). > > sk_forward_alloc is updated only to hint to the user space the forward > allocated memory value via the diag interface. > > If such information is not needed we can drop the update, and > sk_forward_alloc will always be seen as 0 even when the socket has some > forward allocation. The information is needed and we want an accurate one, really. Think about debugging on a live server, some stuck or mad sockets ;) Please consider adding a proper accessor, able to deal with the UDP peculiarities. int sk_forward_alloc_get(const struct sock *sk) { if (sk is not UDP) return sk->sk_forward_alloc; return the precise amount using the private fields that UDP maintains, } Then use this accessor in these places : net/ipv4/af_inet.c:155: WARN_ON(sk->sk_forward_alloc); net/core/sock_diag.c:66: mem[SK_MEMINFO_FWD_ALLOC] = sk->sk_forward_alloc; net/ipv4/inet_diag.c:191: .idiag_fmem = sk->sk_forward_alloc, net/sched/em_meta.c:462: dst->value = sk->sk_forward_alloc; Thanks. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers 2016-09-29 14:49 ` Eric Dumazet @ 2016-09-29 14:59 ` Paolo Abeni 0 siblings, 0 replies; 20+ messages in thread From: Paolo Abeni @ 2016-09-29 14:59 UTC (permalink / raw) To: Eric Dumazet Cc: Eric Dumazet, netdev, David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs On Thu, 2016-09-29 at 07:49 -0700, Eric Dumazet wrote: > On Thu, Sep 29, 2016 at 7:34 AM, Paolo Abeni <pabeni@redhat.com> wrote: > > On Thu, 2016-09-29 at 07:13 -0700, Eric Dumazet wrote: > >> On Thu, 2016-09-29 at 16:01 +0200, Paolo Abeni wrote: > >> > >> > When we reach __sk_mem_reduce_allocated() we are sure we can free the > >> > specified amount of memory, so we only need to ensure consistent > >> > sk_prot->memory_allocated updates. The current atomic operation suffices > >> > to this. > >> > >> Then why are you updating sk->sk_forward_alloc using racy operations ? > >> > >> If this is not needed or racy, do not do it. > > > > Thank you for all the feedback. > > > > The actual forward allocated memory value is: > > > > atomic_read(&up->mem_allocated) - atomic_read(&sk->sk_rmem_alloc). > > > > sk_forward_alloc is updated only to hint to the user space the forward > > allocated memory value via the diag interface. > > > > If such information is not needed we can drop the update, and > > sk_forward_alloc will always be seen as 0 even when the socket has some > > forward allocation. > > The information is needed and we want an accurate one, really. > > Think about debugging on a live server, some stuck or mad sockets ;) > > Please consider adding a proper accessor, able to deal with the UDP > peculiarities. Nice suggestion, thanks! I'll try that in v4. Perhaps is worth adding a struct proto helper for this ? I'm sorry, but I'll not be in Tokyo, so I'll probably produce some traffic on netdev in the meanwhile. Cheers, Paolo ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <cover.1475048434.git.pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* [PATCH net-next v3 3/3] udp: use it's own memory accounting schema 2016-09-28 10:52 ` Paolo Abeni @ 2016-09-28 10:52 ` Paolo Abeni -1 siblings, 0 replies; 20+ messages in thread From: Paolo Abeni @ 2016-09-28 10:52 UTC (permalink / raw) To: netdev-u79uwXL29TY76Z2rM5mHXA Cc: David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs-u79uwXL29TY76Z2rM5mHXA Completely avoid default sock memory accounting and replace it with udp-specific accounting. The udp code now tracks the memory pressure status, to optimize memory reclaiming. Since the new memory accounting model does not require socket locking, remove the lock on enqueue and free and avoid using the backlog on enqueue. Be sure to clean-up rx queue memory on socket destruction, using udp its own sk_destruct. Tested using pktgen with random src port, 64 bytes packet, wire-speed on a 10G link as sender and udp_sink as the receiver, using an l4 tuple rxhash to stress the contention, and one or more udp_sink instances with reuseport. nr readers Kpps (vanilla) Kpps (patched) 1 150 600 3 1300 2300 6 3100 3900 9 4300 4650 12 5800 6550 v2 -> v3: - do not set the now unsed backlog_rcv callback v1 -> v2: - add memory pressure support - fixed dropwatch accounting for ipv6 Signed-off-by: Paolo Abeni <pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- net/ipv4/udp.c | 42 +++++++++++++----------------------------- net/ipv6/udp.c | 33 ++++++++++++--------------------- net/sunrpc/svcsock.c | 20 ++++++++++++++++---- net/sunrpc/xprtsock.c | 2 +- 4 files changed, 42 insertions(+), 55 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 2218901..c05bf5d 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1338,13 +1338,8 @@ static int first_packet_length(struct sock *sk) res = skb ? skb->len : -1; spin_unlock_bh(&rcvq->lock); - if (!skb_queue_empty(&list_kill)) { - bool slow = lock_sock_fast(sk); - + if (!skb_queue_empty(&list_kill)) __skb_queue_purge(&list_kill); - sk_mem_reclaim_partial(sk); - unlock_sock_fast(sk, slow); - } return res; } @@ -1393,7 +1388,6 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock, int err; int is_udplite = IS_UDPLITE(sk); bool checksum_valid = false; - bool slow; if (flags & MSG_ERRQUEUE) return ip_recv_error(sk, msg, len, addr_len); @@ -1434,13 +1428,12 @@ try_again: } if (unlikely(err)) { - trace_kfree_skb(skb, udp_recvmsg); if (!peeked) { atomic_inc(&sk->sk_drops); UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite); } - skb_free_datagram_locked(sk, skb); + kfree_skb(skb); return err; } @@ -1465,16 +1458,15 @@ try_again: if (flags & MSG_TRUNC) err = ulen; - __skb_free_datagram_locked(sk, skb, peeking ? -err : err); + skb_consume_udp(sk, skb, peeking ? -err : err); return err; csum_copy_err: - slow = lock_sock_fast(sk); - if (!skb_kill_datagram(sk, skb, flags)) { + if (!__sk_queue_drop_skb(sk, skb, flags)) { UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite); UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite); } - unlock_sock_fast(sk, slow); + kfree_skb(skb); /* starting over for a new packet, but check if we need to yield */ cond_resched(); @@ -1593,7 +1585,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) sk_incoming_cpu_update(sk); } - rc = __sock_queue_rcv_skb(sk, skb); + rc = udp_rmem_schedule(sk, skb); if (rc < 0) { int is_udplite = IS_UDPLITE(sk); @@ -1607,8 +1599,8 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) return -1; } + __sock_enqueue_skb(sk, skb); return 0; - } static struct static_key udp_encap_needed __read_mostly; @@ -1630,7 +1622,6 @@ EXPORT_SYMBOL(udp_encap_enable); int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) { struct udp_sock *up = udp_sk(sk); - int rc; int is_udplite = IS_UDPLITE(sk); /* @@ -1723,19 +1714,8 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) goto drop; } - rc = 0; - ipv4_pktinfo_prepare(sk, skb); - bh_lock_sock(sk); - if (!sock_owned_by_user(sk)) - rc = __udp_queue_rcv_skb(sk, skb); - else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) { - bh_unlock_sock(sk); - goto drop; - } - bh_unlock_sock(sk); - - return rc; + return __udp_queue_rcv_skb(sk, skb); csum_error: __UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite); @@ -2345,19 +2325,22 @@ struct proto udp_prot = { .connect = ip4_datagram_connect, .disconnect = udp_disconnect, .ioctl = udp_ioctl, + .init = udp_init_sock, .destroy = udp_destroy_sock, .setsockopt = udp_setsockopt, .getsockopt = udp_getsockopt, .sendmsg = udp_sendmsg, .recvmsg = udp_recvmsg, .sendpage = udp_sendpage, - .backlog_rcv = __udp_queue_rcv_skb, .release_cb = ip4_datagram_release_cb, .hash = udp_lib_hash, .unhash = udp_lib_unhash, .rehash = udp_v4_rehash, .get_port = udp_v4_get_port, + .enter_memory_pressure = udp_enter_memory_pressure, + .sockets_allocated = &udp_sockets_allocated, .memory_allocated = &udp_memory_allocated, + .memory_pressure = &udp_memory_pressure, .sysctl_mem = sysctl_udp_mem, .sysctl_wmem = &sysctl_udp_wmem_min, .sysctl_rmem = &sysctl_udp_rmem_min, @@ -2641,6 +2624,7 @@ void __init udp_init(void) { unsigned long limit; + percpu_counter_init(&udp_sockets_allocated, 0, GFP_KERNEL); udp_table_init(&udp_table, "UDP"); limit = nr_free_buffer_pages() / 8; limit = max(limit, 128UL); diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 9aa7c1c..c88e576 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -334,7 +334,6 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int is_udplite = IS_UDPLITE(sk); bool checksum_valid = false; int is_udp4; - bool slow; if (flags & MSG_ERRQUEUE) return ipv6_recv_error(sk, msg, len, addr_len); @@ -378,7 +377,6 @@ try_again: goto csum_copy_err; } if (unlikely(err)) { - trace_kfree_skb(skb, udpv6_recvmsg); if (!peeked) { atomic_inc(&sk->sk_drops); if (is_udp4) @@ -388,7 +386,7 @@ try_again: UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite); } - skb_free_datagram_locked(sk, skb); + kfree_skb(skb); return err; } if (!peeked) { @@ -437,12 +435,11 @@ try_again: if (flags & MSG_TRUNC) err = ulen; - __skb_free_datagram_locked(sk, skb, peeking ? -err : err); + skb_consume_udp(sk, skb, peeking ? -err : err); return err; csum_copy_err: - slow = lock_sock_fast(sk); - if (!skb_kill_datagram(sk, skb, flags)) { + if (!__sk_queue_drop_skb(sk, skb, flags)) { if (is_udp4) { UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite); @@ -455,7 +452,7 @@ csum_copy_err: UDP_MIB_INERRORS, is_udplite); } } - unlock_sock_fast(sk, slow); + kfree_skb(skb); /* starting over for a new packet, but check if we need to yield */ cond_resched(); @@ -523,7 +520,7 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) sk_incoming_cpu_update(sk); } - rc = __sock_queue_rcv_skb(sk, skb); + rc = udp_rmem_schedule(sk, skb); if (rc < 0) { int is_udplite = IS_UDPLITE(sk); @@ -535,6 +532,8 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) kfree_skb(skb); return -1; } + + __sock_enqueue_skb(sk, skb); return 0; } @@ -556,7 +555,6 @@ EXPORT_SYMBOL(udpv6_encap_enable); int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) { struct udp_sock *up = udp_sk(sk); - int rc; int is_udplite = IS_UDPLITE(sk); if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb)) @@ -630,17 +628,7 @@ int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) skb_dst_drop(skb); - bh_lock_sock(sk); - rc = 0; - if (!sock_owned_by_user(sk)) - rc = __udpv6_queue_rcv_skb(sk, skb); - else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) { - bh_unlock_sock(sk); - goto drop; - } - bh_unlock_sock(sk); - - return rc; + return __udpv6_queue_rcv_skb(sk, skb); csum_error: __UDP6_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite); @@ -1433,18 +1421,21 @@ struct proto udpv6_prot = { .connect = ip6_datagram_connect, .disconnect = udp_disconnect, .ioctl = udp_ioctl, + .init = udp_init_sock, .destroy = udpv6_destroy_sock, .setsockopt = udpv6_setsockopt, .getsockopt = udpv6_getsockopt, .sendmsg = udpv6_sendmsg, .recvmsg = udpv6_recvmsg, - .backlog_rcv = __udpv6_queue_rcv_skb, .release_cb = ip6_datagram_release_cb, .hash = udp_lib_hash, .unhash = udp_lib_unhash, .rehash = udp_v6_rehash, .get_port = udp_v6_get_port, + .enter_memory_pressure = udp_enter_memory_pressure, + .sockets_allocated = &udp_sockets_allocated, .memory_allocated = &udp_memory_allocated, + .memory_pressure = &udp_memory_pressure, .sysctl_mem = sysctl_udp_mem, .sysctl_wmem = &sysctl_udp_wmem_min, .sysctl_rmem = &sysctl_udp_rmem_min, diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 57625f6..e2a55dc 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -39,6 +39,7 @@ #include <net/checksum.h> #include <net/ip.h> #include <net/ipv6.h> +#include <net/udp.h> #include <net/tcp.h> #include <net/tcp_states.h> #include <asm/uaccess.h> @@ -129,6 +130,18 @@ static void svc_release_skb(struct svc_rqst *rqstp) } } +static void svc_release_udp_skb(struct svc_rqst *rqstp) +{ + struct sk_buff *skb = rqstp->rq_xprt_ctxt; + + if (skb) { + rqstp->rq_xprt_ctxt = NULL; + + dprintk("svc: service %p, releasing skb %p\n", rqstp, skb); + consume_skb(skb); + } +} + union svc_pktinfo_u { struct in_pktinfo pkti; struct in6_pktinfo pkti6; @@ -575,7 +588,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp) goto out_free; } local_bh_enable(); - skb_free_datagram_locked(svsk->sk_sk, skb); + consume_skb(skb); } else { /* we can use it in-place */ rqstp->rq_arg.head[0].iov_base = skb->data; @@ -602,8 +615,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp) return len; out_free: - trace_kfree_skb(skb, svc_udp_recvfrom); - skb_free_datagram_locked(svsk->sk_sk, skb); + kfree_skb(skb); return 0; } @@ -660,7 +672,7 @@ static struct svc_xprt_ops svc_udp_ops = { .xpo_create = svc_udp_create, .xpo_recvfrom = svc_udp_recvfrom, .xpo_sendto = svc_udp_sendto, - .xpo_release_rqst = svc_release_skb, + .xpo_release_rqst = svc_release_udp_skb, .xpo_detach = svc_sock_detach, .xpo_free = svc_sock_free, .xpo_prep_reply_hdr = svc_udp_prep_reply_hdr, diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index bf16883..8e04a43 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -1074,7 +1074,7 @@ static void xs_udp_data_receive(struct sock_xprt *transport) skb = skb_recv_datagram(sk, 0, 1, &err); if (skb != NULL) { xs_udp_data_read_skb(&transport->xprt, sk, skb); - skb_free_datagram_locked(sk, skb); + consume_skb(skb); continue; } if (!test_and_clear_bit(XPRT_SOCK_DATA_READY, &transport->sock_state)) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH net-next v3 3/3] udp: use it's own memory accounting schema @ 2016-09-28 10:52 ` Paolo Abeni 0 siblings, 0 replies; 20+ messages in thread From: Paolo Abeni @ 2016-09-28 10:52 UTC (permalink / raw) To: netdev Cc: David S. Miller, James Morris, Trond Myklebust, Alexander Duyck, Daniel Borkmann, Eric Dumazet, Tom Herbert, Hannes Frederic Sowa, Edward Cree, linux-nfs Completely avoid default sock memory accounting and replace it with udp-specific accounting. The udp code now tracks the memory pressure status, to optimize memory reclaiming. Since the new memory accounting model does not require socket locking, remove the lock on enqueue and free and avoid using the backlog on enqueue. Be sure to clean-up rx queue memory on socket destruction, using udp its own sk_destruct. Tested using pktgen with random src port, 64 bytes packet, wire-speed on a 10G link as sender and udp_sink as the receiver, using an l4 tuple rxhash to stress the contention, and one or more udp_sink instances with reuseport. nr readers Kpps (vanilla) Kpps (patched) 1 150 600 3 1300 2300 6 3100 3900 9 4300 4650 12 5800 6550 v2 -> v3: - do not set the now unsed backlog_rcv callback v1 -> v2: - add memory pressure support - fixed dropwatch accounting for ipv6 Signed-off-by: Paolo Abeni <pabeni@redhat.com> --- net/ipv4/udp.c | 42 +++++++++++++----------------------------- net/ipv6/udp.c | 33 ++++++++++++--------------------- net/sunrpc/svcsock.c | 20 ++++++++++++++++---- net/sunrpc/xprtsock.c | 2 +- 4 files changed, 42 insertions(+), 55 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 2218901..c05bf5d 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1338,13 +1338,8 @@ static int first_packet_length(struct sock *sk) res = skb ? skb->len : -1; spin_unlock_bh(&rcvq->lock); - if (!skb_queue_empty(&list_kill)) { - bool slow = lock_sock_fast(sk); - + if (!skb_queue_empty(&list_kill)) __skb_queue_purge(&list_kill); - sk_mem_reclaim_partial(sk); - unlock_sock_fast(sk, slow); - } return res; } @@ -1393,7 +1388,6 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock, int err; int is_udplite = IS_UDPLITE(sk); bool checksum_valid = false; - bool slow; if (flags & MSG_ERRQUEUE) return ip_recv_error(sk, msg, len, addr_len); @@ -1434,13 +1428,12 @@ try_again: } if (unlikely(err)) { - trace_kfree_skb(skb, udp_recvmsg); if (!peeked) { atomic_inc(&sk->sk_drops); UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite); } - skb_free_datagram_locked(sk, skb); + kfree_skb(skb); return err; } @@ -1465,16 +1458,15 @@ try_again: if (flags & MSG_TRUNC) err = ulen; - __skb_free_datagram_locked(sk, skb, peeking ? -err : err); + skb_consume_udp(sk, skb, peeking ? -err : err); return err; csum_copy_err: - slow = lock_sock_fast(sk); - if (!skb_kill_datagram(sk, skb, flags)) { + if (!__sk_queue_drop_skb(sk, skb, flags)) { UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite); UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite); } - unlock_sock_fast(sk, slow); + kfree_skb(skb); /* starting over for a new packet, but check if we need to yield */ cond_resched(); @@ -1593,7 +1585,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) sk_incoming_cpu_update(sk); } - rc = __sock_queue_rcv_skb(sk, skb); + rc = udp_rmem_schedule(sk, skb); if (rc < 0) { int is_udplite = IS_UDPLITE(sk); @@ -1607,8 +1599,8 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) return -1; } + __sock_enqueue_skb(sk, skb); return 0; - } static struct static_key udp_encap_needed __read_mostly; @@ -1630,7 +1622,6 @@ EXPORT_SYMBOL(udp_encap_enable); int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) { struct udp_sock *up = udp_sk(sk); - int rc; int is_udplite = IS_UDPLITE(sk); /* @@ -1723,19 +1714,8 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) goto drop; } - rc = 0; - ipv4_pktinfo_prepare(sk, skb); - bh_lock_sock(sk); - if (!sock_owned_by_user(sk)) - rc = __udp_queue_rcv_skb(sk, skb); - else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) { - bh_unlock_sock(sk); - goto drop; - } - bh_unlock_sock(sk); - - return rc; + return __udp_queue_rcv_skb(sk, skb); csum_error: __UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite); @@ -2345,19 +2325,22 @@ struct proto udp_prot = { .connect = ip4_datagram_connect, .disconnect = udp_disconnect, .ioctl = udp_ioctl, + .init = udp_init_sock, .destroy = udp_destroy_sock, .setsockopt = udp_setsockopt, .getsockopt = udp_getsockopt, .sendmsg = udp_sendmsg, .recvmsg = udp_recvmsg, .sendpage = udp_sendpage, - .backlog_rcv = __udp_queue_rcv_skb, .release_cb = ip4_datagram_release_cb, .hash = udp_lib_hash, .unhash = udp_lib_unhash, .rehash = udp_v4_rehash, .get_port = udp_v4_get_port, + .enter_memory_pressure = udp_enter_memory_pressure, + .sockets_allocated = &udp_sockets_allocated, .memory_allocated = &udp_memory_allocated, + .memory_pressure = &udp_memory_pressure, .sysctl_mem = sysctl_udp_mem, .sysctl_wmem = &sysctl_udp_wmem_min, .sysctl_rmem = &sysctl_udp_rmem_min, @@ -2641,6 +2624,7 @@ void __init udp_init(void) { unsigned long limit; + percpu_counter_init(&udp_sockets_allocated, 0, GFP_KERNEL); udp_table_init(&udp_table, "UDP"); limit = nr_free_buffer_pages() / 8; limit = max(limit, 128UL); diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 9aa7c1c..c88e576 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -334,7 +334,6 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int is_udplite = IS_UDPLITE(sk); bool checksum_valid = false; int is_udp4; - bool slow; if (flags & MSG_ERRQUEUE) return ipv6_recv_error(sk, msg, len, addr_len); @@ -378,7 +377,6 @@ try_again: goto csum_copy_err; } if (unlikely(err)) { - trace_kfree_skb(skb, udpv6_recvmsg); if (!peeked) { atomic_inc(&sk->sk_drops); if (is_udp4) @@ -388,7 +386,7 @@ try_again: UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite); } - skb_free_datagram_locked(sk, skb); + kfree_skb(skb); return err; } if (!peeked) { @@ -437,12 +435,11 @@ try_again: if (flags & MSG_TRUNC) err = ulen; - __skb_free_datagram_locked(sk, skb, peeking ? -err : err); + skb_consume_udp(sk, skb, peeking ? -err : err); return err; csum_copy_err: - slow = lock_sock_fast(sk); - if (!skb_kill_datagram(sk, skb, flags)) { + if (!__sk_queue_drop_skb(sk, skb, flags)) { if (is_udp4) { UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite); @@ -455,7 +452,7 @@ csum_copy_err: UDP_MIB_INERRORS, is_udplite); } } - unlock_sock_fast(sk, slow); + kfree_skb(skb); /* starting over for a new packet, but check if we need to yield */ cond_resched(); @@ -523,7 +520,7 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) sk_incoming_cpu_update(sk); } - rc = __sock_queue_rcv_skb(sk, skb); + rc = udp_rmem_schedule(sk, skb); if (rc < 0) { int is_udplite = IS_UDPLITE(sk); @@ -535,6 +532,8 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) kfree_skb(skb); return -1; } + + __sock_enqueue_skb(sk, skb); return 0; } @@ -556,7 +555,6 @@ EXPORT_SYMBOL(udpv6_encap_enable); int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) { struct udp_sock *up = udp_sk(sk); - int rc; int is_udplite = IS_UDPLITE(sk); if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb)) @@ -630,17 +628,7 @@ int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) skb_dst_drop(skb); - bh_lock_sock(sk); - rc = 0; - if (!sock_owned_by_user(sk)) - rc = __udpv6_queue_rcv_skb(sk, skb); - else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) { - bh_unlock_sock(sk); - goto drop; - } - bh_unlock_sock(sk); - - return rc; + return __udpv6_queue_rcv_skb(sk, skb); csum_error: __UDP6_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite); @@ -1433,18 +1421,21 @@ struct proto udpv6_prot = { .connect = ip6_datagram_connect, .disconnect = udp_disconnect, .ioctl = udp_ioctl, + .init = udp_init_sock, .destroy = udpv6_destroy_sock, .setsockopt = udpv6_setsockopt, .getsockopt = udpv6_getsockopt, .sendmsg = udpv6_sendmsg, .recvmsg = udpv6_recvmsg, - .backlog_rcv = __udpv6_queue_rcv_skb, .release_cb = ip6_datagram_release_cb, .hash = udp_lib_hash, .unhash = udp_lib_unhash, .rehash = udp_v6_rehash, .get_port = udp_v6_get_port, + .enter_memory_pressure = udp_enter_memory_pressure, + .sockets_allocated = &udp_sockets_allocated, .memory_allocated = &udp_memory_allocated, + .memory_pressure = &udp_memory_pressure, .sysctl_mem = sysctl_udp_mem, .sysctl_wmem = &sysctl_udp_wmem_min, .sysctl_rmem = &sysctl_udp_rmem_min, diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 57625f6..e2a55dc 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -39,6 +39,7 @@ #include <net/checksum.h> #include <net/ip.h> #include <net/ipv6.h> +#include <net/udp.h> #include <net/tcp.h> #include <net/tcp_states.h> #include <asm/uaccess.h> @@ -129,6 +130,18 @@ static void svc_release_skb(struct svc_rqst *rqstp) } } +static void svc_release_udp_skb(struct svc_rqst *rqstp) +{ + struct sk_buff *skb = rqstp->rq_xprt_ctxt; + + if (skb) { + rqstp->rq_xprt_ctxt = NULL; + + dprintk("svc: service %p, releasing skb %p\n", rqstp, skb); + consume_skb(skb); + } +} + union svc_pktinfo_u { struct in_pktinfo pkti; struct in6_pktinfo pkti6; @@ -575,7 +588,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp) goto out_free; } local_bh_enable(); - skb_free_datagram_locked(svsk->sk_sk, skb); + consume_skb(skb); } else { /* we can use it in-place */ rqstp->rq_arg.head[0].iov_base = skb->data; @@ -602,8 +615,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp) return len; out_free: - trace_kfree_skb(skb, svc_udp_recvfrom); - skb_free_datagram_locked(svsk->sk_sk, skb); + kfree_skb(skb); return 0; } @@ -660,7 +672,7 @@ static struct svc_xprt_ops svc_udp_ops = { .xpo_create = svc_udp_create, .xpo_recvfrom = svc_udp_recvfrom, .xpo_sendto = svc_udp_sendto, - .xpo_release_rqst = svc_release_skb, + .xpo_release_rqst = svc_release_udp_skb, .xpo_detach = svc_sock_detach, .xpo_free = svc_sock_free, .xpo_prep_reply_hdr = svc_udp_prep_reply_hdr, diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index bf16883..8e04a43 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -1074,7 +1074,7 @@ static void xs_udp_data_receive(struct sock_xprt *transport) skb = skb_recv_datagram(sk, 0, 1, &err); if (skb != NULL) { xs_udp_data_read_skb(&transport->xprt, sk, skb); - skb_free_datagram_locked(sk, skb); + consume_skb(skb); continue; } if (!test_and_clear_bit(XPRT_SOCK_DATA_READY, &transport->sock_state)) -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 20+ messages in thread
end of thread, other threads:[~2016-09-29 14:59 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-09-28 10:52 [PATCH net-next v3 0/3] udp: refactor memory accounting Paolo Abeni 2016-09-28 10:52 ` Paolo Abeni 2016-09-28 10:52 ` [PATCH net-next v3 1/3] net/socket: factor out helpers for memory and queue manipulation Paolo Abeni 2016-09-28 10:52 ` [PATCH net-next v3 2/3] udp: implement memory accounting helpers Paolo Abeni 2016-09-29 1:42 ` Eric Dumazet [not found] ` <1475113378.28155.124.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org> 2016-09-29 7:34 ` Paolo Abeni 2016-09-29 7:34 ` Paolo Abeni 2016-09-29 9:31 ` Paolo Abeni 2016-09-29 9:31 ` Paolo Abeni [not found] ` <1475141514.4676.28.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2016-09-29 13:24 ` Eric Dumazet 2016-09-29 13:24 ` Eric Dumazet [not found] ` <1475155472.28155.164.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org> 2016-09-29 14:01 ` Paolo Abeni 2016-09-29 14:01 ` Paolo Abeni [not found] ` <1475157674.4676.52.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2016-09-29 14:13 ` Eric Dumazet 2016-09-29 14:13 ` Eric Dumazet 2016-09-29 14:34 ` Paolo Abeni 2016-09-29 14:49 ` Eric Dumazet 2016-09-29 14:59 ` Paolo Abeni [not found] ` <cover.1475048434.git.pabeni-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2016-09-28 10:52 ` [PATCH net-next v3 3/3] udp: use it's own memory accounting schema Paolo Abeni 2016-09-28 10:52 ` Paolo Abeni
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.