From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS,USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DA7E6C43381 for ; Fri, 22 Mar 2019 15:56:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A0CE621900 for ; Fri, 22 Mar 2019 15:56:57 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="C/bOj5L6" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727908AbfCVP44 (ORCPT ); Fri, 22 Mar 2019 11:56:56 -0400 Received: from mail-pf1-f201.google.com ([209.85.210.201]:40398 "EHLO mail-pf1-f201.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727169AbfCVP44 (ORCPT ); Fri, 22 Mar 2019 11:56:56 -0400 Received: by mail-pf1-f201.google.com with SMTP id z26so2714983pfa.7 for ; Fri, 22 Mar 2019 08:56:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=Tj2ugnWT7oWtGKXB79igYmt+43qsnqP07AySiSUW35o=; b=C/bOj5L6G0K03sW1arPZPS8UcFfbjQzTCblpg3DngzrNw8N5xV3Sse6uYebWR/GhYj ZjL6s5eg9sMCuM4eUfQOESntfS6GdlU93BP19l2typ5HeVZ8mxmhgVwFZY1K2VEI0gmE zhrjRtaiNqvOOqGVr4Ju/Kvc/iCthr/YuSNu6GXanDWbIuZnoQRrzWVBHXaHBBRWRNRH 7gRQxKRiPESbnc5Um5uYwdbE/acjm0Or9gp85HVRIBKHtXS732kHnVaZ8TsboTPcCNTq 2O76FwjJ8Rhd/2V8YpQtxPGd/q93YMfYt8erwhg0X3/XrLCGBv+n3KC5KK4q3R9cqlMc rMaw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=Tj2ugnWT7oWtGKXB79igYmt+43qsnqP07AySiSUW35o=; b=gPRw/A5oplxciWqKp/d9rVzMBFVOBsEFxbgiQEh4XDGNYBBILEG6BkBzdSfM+pDmH1 3SqB1EitFYLcmbv7nVXlaObcOjBC4drlvxQlGjGmC4nSlkH/zDNmZPSYOk8gR6e9eyhk ZQAvszH81DEhVE208eHO2JTQoPz+c2l1vONR2WMp1IQ6T+q4toPcWT9OxA0pt/FzNgxD wRATgSrApgCT8tjlu3aIg0hAh9x4LRMvFOC21ASPTOXKFDgUn86wHLIqOf34wT50kbDV Y7hrVClzbQDHaEBhALaOWpWRb1QOoXEggi9v+CDkKTp0xX/zpklMjEpoaT3UiqdlWn/D jt0w== X-Gm-Message-State: APjAAAUY+8UIPOiAr+gd2zZ0W2cFm7K+zXrPZTIAuHCay04uhTIzYlvA g67QbDez57E01rKaaIcfLRmUYY2MbBNqGg== X-Google-Smtp-Source: APXvYqx63A1QBStrkVOkaxOFndWqMhMhLZu8PSTtVOsyL4Z+TjxSuH3dwfEo2LNDSV+6eRRTcXBFTx4XVzRl6g== X-Received: by 2002:a63:1723:: with SMTP id x35mr9326515pgl.364.1553270215019; Fri, 22 Mar 2019 08:56:55 -0700 (PDT) Date: Fri, 22 Mar 2019 08:56:40 -0700 In-Reply-To: <20190322155640.248144-1-edumazet@google.com> Message-Id: <20190322155640.248144-4-edumazet@google.com> Mime-Version: 1.0 References: <20190322155640.248144-1-edumazet@google.com> X-Mailer: git-send-email 2.21.0.392.gf8f6787159e-goog Subject: [PATCH v3 net-next 3/3] tcp: add one skb cache for rx From: Eric Dumazet To: "David S . Miller" Cc: netdev , Eric Dumazet , Eric Dumazet , Soheil Hassas Yeganeh , Willem de Bruijn Content-Type: text/plain; charset="UTF-8" Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Often times, recvmsg() system calls and BH handling for a particular TCP socket are done on different cpus. This means the incoming skb had to be allocated on a cpu, but freed on another. This incurs a high spinlock contention in slab layer for small rpc, but also a high number of cache line ping pongs for larger packets. A full size GRO packet might use 45 page fragments, meaning that up to 45 put_page() can be involved. More over performing the __kfree_skb() in the recvmsg() context adds a latency for user applications, and increase probability of trapping them in backlog processing, since the BH handler might found the socket owned by the user. This patch, combined with the prior one increases the rpc performance by about 10 % on servers with large number of cores. (tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps instead of 8 Mpps) This also increases single bulk flow performance on 40Gbit+ links, since in this case there are often two cpus working in tandem : - CPU handling the NIC rx interrupts, feeding the receive queue, and (after this patch) freeing the skbs that were consumed. - CPU in recvmsg() system call, essentially 100 % busy copying out data to user space. Having at most one skb in a per-socket cache has very little risk of memory exhaustion, and since it is protected by socket lock, its management is essentially free. Note that if rps/rfs is used, we do not enable this feature, because there is high chance that the same cpu is handling both the recvmsg() system call and the TCP rx path, but that another cpu did the skb allocations in the device driver right before the RPS/RFS logic. To properly handle this case, it seems we would need to record on which cpu skb was allocated, and use a different channel to give skbs back to this cpu. Signed-off-by: Eric Dumazet Acked-by: Soheil Hassas Yeganeh Acked-by: Willem de Bruijn --- include/net/sock.h | 10 ++++++++++ net/ipv4/af_inet.c | 4 ++++ net/ipv4/tcp.c | 4 ++++ net/ipv4/tcp_ipv4.c | 11 +++++++++-- net/ipv6/tcp_ipv6.c | 12 +++++++++--- 5 files changed, 36 insertions(+), 5 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 314c47a8f5d19918393aa854a95e6e0f7ec6b604..577d91fb56267371c6bc5ae65f7454deba726bd6 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -368,6 +368,7 @@ struct sock { atomic_t sk_drops; int sk_rcvlowat; struct sk_buff_head sk_error_queue; + struct sk_buff *sk_rx_skb_cache; struct sk_buff_head sk_receive_queue; /* * The backlog queue is special, it is always used with @@ -2438,6 +2439,15 @@ static inline void skb_setup_tx_timestamp(struct sk_buff *skb, __u16 tsflags) static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb) { __skb_unlink(skb, &sk->sk_receive_queue); + if ( +#ifdef CONFIG_RPS + !static_branch_unlikely(&rps_needed) && +#endif + !sk->sk_rx_skb_cache) { + sk->sk_rx_skb_cache = skb; + skb_orphan(skb); + return; + } __kfree_skb(skb); } diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index eab3ebde981e78a6a0a4852c3b4374c02ede1187..7f3a984ad618580ae28501c3fe3dd3fa915a66a2 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -136,6 +136,10 @@ void inet_sock_destruct(struct sock *sk) struct inet_sock *inet = inet_sk(sk); __skb_queue_purge(&sk->sk_receive_queue); + if (sk->sk_rx_skb_cache) { + __kfree_skb(sk->sk_rx_skb_cache); + sk->sk_rx_skb_cache = NULL; + } __skb_queue_purge(&sk->sk_error_queue); sk_mem_reclaim(sk); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index f0b5a599914514fee2ee14c7083796dfcd3614cd..29b94edf05f9357d3a33744d677827ce624738ae 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2583,6 +2583,10 @@ int tcp_disconnect(struct sock *sk, int flags) tcp_clear_xmit_timers(sk); __skb_queue_purge(&sk->sk_receive_queue); + if (sk->sk_rx_skb_cache) { + __kfree_skb(sk->sk_rx_skb_cache); + sk->sk_rx_skb_cache = NULL; + } tp->copied_seq = tp->rcv_nxt; tp->urg_data = 0; tcp_write_queue_purge(sk); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 277d71239d755d858be70663320d8de2ab23dfcc..3979939804b70b805655d94c598a6cb397e35947 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1774,6 +1774,7 @@ static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph, int tcp_v4_rcv(struct sk_buff *skb) { struct net *net = dev_net(skb->dev); + struct sk_buff *skb_to_free; int sdif = inet_sdif(skb); const struct iphdr *iph; const struct tcphdr *th; @@ -1905,11 +1906,17 @@ int tcp_v4_rcv(struct sk_buff *skb) tcp_segs_in(tcp_sk(sk), skb); ret = 0; if (!sock_owned_by_user(sk)) { + skb_to_free = sk->sk_rx_skb_cache; + sk->sk_rx_skb_cache = NULL; ret = tcp_v4_do_rcv(sk, skb); - } else if (tcp_add_backlog(sk, skb)) { - goto discard_and_relse; + } else { + if (tcp_add_backlog(sk, skb)) + goto discard_and_relse; + skb_to_free = NULL; } bh_unlock_sock(sk); + if (skb_to_free) + __kfree_skb(skb_to_free); put_and_return: if (refcounted) diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 983ad7a751027cb8fbaee095b90225d71fbaa698..77d723bbe05085881d3d5d4ca0cb4dbcede8d11d 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1436,6 +1436,7 @@ static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr, static int tcp_v6_rcv(struct sk_buff *skb) { + struct sk_buff *skb_to_free; int sdif = inet6_sdif(skb); const struct tcphdr *th; const struct ipv6hdr *hdr; @@ -1562,12 +1563,17 @@ static int tcp_v6_rcv(struct sk_buff *skb) tcp_segs_in(tcp_sk(sk), skb); ret = 0; if (!sock_owned_by_user(sk)) { + skb_to_free = sk->sk_rx_skb_cache; + sk->sk_rx_skb_cache = NULL; ret = tcp_v6_do_rcv(sk, skb); - } else if (tcp_add_backlog(sk, skb)) { - goto discard_and_relse; + } else { + if (tcp_add_backlog(sk, skb)) + goto discard_and_relse; + skb_to_free = NULL; } bh_unlock_sock(sk); - + if (skb_to_free) + __kfree_skb(skb_to_free); put_and_return: if (refcounted) sock_put(sk); -- 2.21.0.392.gf8f6787159e-goog