From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=eF07=RZ=vger.kernel.org=netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-16.6 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS,USER_AGENT_GIT,USER_IN_DEF_DKIM_WL
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 75271C43381
	for <netdev@archiver.kernel.org>; Fri, 22 Mar 2019 15:56:54 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 44C9821900
	for <netdev@archiver.kernel.org>; Fri, 22 Mar 2019 15:56:54 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="YKiQXXsU"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727815AbfCVP4x (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Fri, 22 Mar 2019 11:56:53 -0400
Received: from mail-qk1-f202.google.com ([209.85.222.202]:39107 "EHLO
        mail-qk1-f202.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727169AbfCVP4x (ORCPT
        <rfc822;netdev@vger.kernel.org>); Fri, 22 Mar 2019 11:56:53 -0400
Received: by mail-qk1-f202.google.com with SMTP id w134so2289427qka.6
        for <netdev@vger.kernel.org>; Fri, 22 Mar 2019 08:56:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=rSmboetvjO6xpvrYmldEBMrZygjcVsXwRBm3qeGZMAc=;
        b=YKiQXXsUaypY3NnbzcM598YbvI/ZTRk1gD5yBieHn+0rt9qbTOjqgZ746bcVe38Uk9
         kO5LfQKv9kyTdpvTJsYx6I0DjZagRzBHr7GYxktbkvU4TJ5JuF6XVC1FPd7PTCZ4tvZQ
         BpUZwwA/qQbYF2Vk3vehhZJep5BT4Makz/9JzulaCP3dkTn0Ahbn2kzxufh63MA+8su6
         XemB+K0MYRgnkmSipED7dBTRsUc7mNf48FYyhNcbok56fmkymjm8eDBkr9t9SsG7TVL5
         FYaXa9NPMXk/Qb6MSQTu/DI3DtUscdBPDLlJlMZaJ0ltkdjCcbhCf9SZuAnul8DzRXqm
         qmFg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=rSmboetvjO6xpvrYmldEBMrZygjcVsXwRBm3qeGZMAc=;
        b=Wy6Nhne7BfjWgcM/Z1SRDvh9JqPT02wQvI0AvWbuwDuqMwPH48HJoNWIvYyZ/Jbge2
         QdVTds7fiRuy1On8WtHQ9pKHvHvCxI/R/piLFkJfmbuwYWAtNIsWL0HTrw++YNT8LrHq
         0zHExnlc2cGokxKP2dUwtmjdDLlBuKDs7l3NX+ZiCYaM72z503YORcovAH+U8YVsz4Ga
         32woEr8VwrS1dAkrohXObVi3QBsHcM9GquWtSo4XsqCkKTFJmXI9nLglPJ8nRnEVz0rP
         E9bZeGqw2fpv0NsE2fhx1iIsLRVsLL0x4HPbFGoArvFnEwhynJrk8MBBQgKh35avHv5/
         8ALw==
X-Gm-Message-State: APjAAAV3Fqj+G3h6XVQO22ycJsXchjHNjYTgUfGis6c03LsA5+ZE/ZV9
        xZI54Ys/HaolkKCkLIQGtfVDJkdNxGrfbg==
X-Google-Smtp-Source: APXvYqzL4yIdk6JBuPUB4YTXgeQ8g8sL7hiotyqMFv3k9BIx3RD3auZAR8H/5g7xGft7AIdm+C1LZ70pHT8X3w==
X-Received: by 2002:a37:790:: with SMTP id 138mr8124200qkh.271.1553270211704;
 Fri, 22 Mar 2019 08:56:51 -0700 (PDT)
Date:   Fri, 22 Mar 2019 08:56:39 -0700
In-Reply-To: <20190322155640.248144-1-edumazet@google.com>
Message-Id: <20190322155640.248144-3-edumazet@google.com>
Mime-Version: 1.0
References: <20190322155640.248144-1-edumazet@google.com>
X-Mailer: git-send-email 2.21.0.392.gf8f6787159e-goog
Subject: [PATCH v3 net-next 2/3] tcp: add one skb cache for tx
From:   Eric Dumazet <edumazet@google.com>
To:     "David S . Miller" <davem@davemloft.net>
Cc:     netdev <netdev@vger.kernel.org>,
        Eric Dumazet <edumazet@google.com>,
        Eric Dumazet <eric.dumazet@gmail.com>,
        Soheil Hassas Yeganeh <soheil@google.com>,
        Willem de Bruijn <willemb@google.com>
Content-Type: text/plain; charset="UTF-8"
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

On hosts with a lot of cores, RPC workloads suffer from heavy contention on slab spinlocks.

    20.69%  [kernel]       [k] queued_spin_lock_slowpath
     5.64%  [kernel]       [k] _raw_spin_lock
     3.83%  [kernel]       [k] syscall_return_via_sysret
     3.48%  [kernel]       [k] __entry_text_start
     1.76%  [kernel]       [k] __netif_receive_skb_core
     1.64%  [kernel]       [k] __fget

For each sendmsg(), we allocate one skb, and free it at the time ACK packet comes.

In many cases, ACK packets are handled by another cpus, and this unfortunately
incurs heavy costs for slab layer.

This patch uses an extra pointer in socket structure, so that we try to reuse
the same skb and avoid these expensive costs.

We cache at most one skb per socket so this should be safe as far as
memory pressure is concerned.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
---
 include/net/sock.h |  5 +++++
 net/ipv4/tcp.c     | 50 +++++++++++++++++++++-------------------------
 2 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index fecdf639225c2d4995ee2e2cd9be57f3d4f22777..314c47a8f5d19918393aa854a95e6e0f7ec6b604 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -414,6 +414,7 @@ struct sock {
 		struct sk_buff	*sk_send_head;
 		struct rb_root	tcp_rtx_queue;
 	};
+	struct sk_buff		*sk_tx_skb_cache;
 	struct sk_buff_head	sk_write_queue;
 	__s32			sk_peek_off;
 	int			sk_write_pending;
@@ -1463,6 +1464,10 @@ static inline void sk_mem_uncharge(struct sock *sk, int size)
 
 static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
 {
+	if (!sk->sk_tx_skb_cache) {
+		sk->sk_tx_skb_cache = skb;
+		return;
+	}
 	sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
 	sk->sk_wmem_queued -= skb->truesize;
 	sk_mem_uncharge(sk, skb->truesize);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 6baa6dc1b13b0b94b1da238668b93e167cf444fe..f0b5a599914514fee2ee14c7083796dfcd3614cd 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -865,6 +865,21 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
 {
 	struct sk_buff *skb;
 
+	skb = sk->sk_tx_skb_cache;
+	if (skb && !size) {
+		const struct sk_buff_fclones *fclones;
+
+		fclones = container_of(skb, struct sk_buff_fclones, skb1);
+		if (refcount_read(&fclones->fclone_ref) == 1) {
+			sk->sk_wmem_queued -= skb->truesize;
+			sk_mem_uncharge(sk, skb->truesize);
+			skb->truesize -= skb->data_len;
+			sk->sk_tx_skb_cache = NULL;
+			pskb_trim(skb, 0);
+			INIT_LIST_HEAD(&skb->tcp_tsorted_anchor);
+			return skb;
+		}
+	}
 	/* The TCP header must be at least 32-bit aligned.  */
 	size = ALIGN(size, 4);
 
@@ -1098,30 +1113,6 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
 }
 EXPORT_SYMBOL(tcp_sendpage);
 
-/* Do not bother using a page frag for very small frames.
- * But use this heuristic only for the first skb in write queue.
- *
- * Having no payload in skb->head allows better SACK shifting
- * in tcp_shift_skb_data(), reducing sack/rack overhead, because
- * write queue has less skbs.
- * Each skb can hold up to MAX_SKB_FRAGS * 32Kbytes, or ~0.5 MB.
- * This also speeds up tso_fragment(), since it wont fallback
- * to tcp_fragment().
- */
-static int linear_payload_sz(bool first_skb)
-{
-	if (first_skb)
-		return SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
-	return 0;
-}
-
-static int select_size(bool first_skb, bool zc)
-{
-	if (zc)
-		return 0;
-	return linear_payload_sz(first_skb);
-}
-
 void tcp_free_fastopen_req(struct tcp_sock *tp)
 {
 	if (tp->fastopen_req) {
@@ -1272,7 +1263,6 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 
 		if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) {
 			bool first_skb;
-			int linear;
 
 new_segment:
 			if (!sk_stream_memory_free(sk))
@@ -1283,8 +1273,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 				goto restart;
 			}
 			first_skb = tcp_rtx_and_write_queues_empty(sk);
-			linear = select_size(first_skb, zc);
-			skb = sk_stream_alloc_skb(sk, linear, sk->sk_allocation,
+			skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation,
 						  first_skb);
 			if (!skb)
 				goto wait_for_memory;
@@ -2552,6 +2541,13 @@ void tcp_write_queue_purge(struct sock *sk)
 		sk_wmem_free_skb(sk, skb);
 	}
 	tcp_rtx_queue_purge(sk);
+	skb = sk->sk_tx_skb_cache;
+	if (skb) {
+		sk->sk_wmem_queued -= skb->truesize;
+		sk_mem_uncharge(sk, skb->truesize);
+		__kfree_skb(skb);
+		sk->sk_tx_skb_cache = NULL;
+	}
 	INIT_LIST_HEAD(&tcp_sk(sk)->tsorted_sent_queue);
 	sk_mem_reclaim(sk);
 	tcp_clear_all_retrans_hints(tcp_sk(sk));
-- 
2.21.0.392.gf8f6787159e-goog