From: Kuniyuki Iwashima <kuniyu@amazon.com>
To: "David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Jeff Layton <jlayton@kernel.org>,
Chuck Lever <chuck.lever@oracle.com>,
Luis Chamberlain <mcgrof@kernel.org>,
Kees Cook <keescook@chromium.org>,
Iurii Zaikin <yzaikin@google.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>,
Kuniyuki Iwashima <kuni1840@gmail.com>, <netdev@vger.kernel.org>,
<linux-fsdevel@vger.kernel.org>
Subject: [PATCH v1 net-next 13/13] udp: Introduce optional per-netns hash table.
Date: Thu, 25 Aug 2022 17:04:45 -0700 [thread overview]
Message-ID: <20220826000445.46552-14-kuniyu@amazon.com> (raw)
In-Reply-To: <20220826000445.46552-1-kuniyu@amazon.com>
We introduce an optional per-netns hash table for UDP.
With a smaller hash table, we can look up sockets faster and isolate
noisy neighbours. Also, we can reduce lock contention.
We can control the hash table size by a new sysctl knob. However,
depending on workloads, it will require very sensitive tuning, so we
disable the feature by default (net.ipv4.udp_child_ehash_entries == 0).
Moreover, we can fall back to using the global hash table in case we
fail to allocate enough memory for a new hash table.
We can check the current hash table size by another read-only sysctl
knob, net.ipv4.udp_hash_entries. A negative value means the netns
shares the global hash table (per-netns hash table is disabled or
failed to allocate memory).
We could optimise the hash table lookup/iteration further by removing
netns comparison for the per-netns one in the future. Also, we could
optimise the sparse udp_hslot layout by putting it in udp_table.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
Documentation/networking/ip-sysctl.rst | 20 ++++++++
include/net/netns/ipv4.h | 2 +
net/ipv4/sysctl_net_ipv4.c | 56 +++++++++++++++++++++
net/ipv4/udp.c | 69 ++++++++++++++++++++++++++
4 files changed, 147 insertions(+)
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 97a0952b11e3..6dc4e2853e39 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1090,6 +1090,26 @@ udp_rmem_min - INTEGER
udp_wmem_min - INTEGER
UDP does not have tx memory accounting and this tunable has no effect.
+udp_hash_entries - INTEGER
+ Read-only number of hash buckets for UDP sockets in the current
+ networking namespace.
+
+ A negative value means the networking namespace does not own its
+ hash buckets and shares the initial networking namespace's one.
+
+udp_child_ehash_entries - INTEGER
+ Control the number of hash buckets for UDP sockets in the child
+ networking namespace, which must be set before clone() or unshare().
+
+ The written value except for 0 is rounded up to 2^n. 0 is a special
+ value, meaning the child networking namespace will share the initial
+ networking namespace's hash buckets.
+
+ Note that the child will use the global one in case the kernel
+ fails to allocate enough memory.
+
+ Default: 0
+
RAW variables
=============
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index c367da5d61e2..a1be7ebb7338 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -200,6 +200,8 @@ struct netns_ipv4 {
atomic_t dev_addr_genid;
+ unsigned int sysctl_udp_child_hash_entries;
+
#ifdef CONFIG_SYSCTL
unsigned long *sysctl_local_reserved_ports;
int sysctl_ip_prot_sock;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 03a3187c4705..b3cea3f36463 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -424,6 +424,47 @@ static int proc_tcp_child_ehash_entries(struct ctl_table *table, int write,
return 0;
}
+static int proc_udp_hash_entries(struct ctl_table *table, int write,
+ void *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct net *net = container_of(table->data, struct net,
+ ipv4.sysctl_udp_child_hash_entries);
+ int udp_hash_entries;
+ struct ctl_table tbl;
+
+ udp_hash_entries = net->ipv4.udp_table->mask + 1;
+
+ /* A negative number indicates that the child netns
+ * shares the global udp_table.
+ */
+ if (!net_eq(net, &init_net) && net->ipv4.udp_table == &udp_table)
+ udp_hash_entries *= -1;
+
+ tbl.data = &udp_hash_entries;
+ tbl.maxlen = sizeof(int);
+
+ return proc_dointvec(&tbl, write, buffer, lenp, ppos);
+}
+
+static int proc_udp_child_hash_entries(struct ctl_table *table, int write,
+ void *buffer, size_t *lenp, loff_t *ppos)
+{
+ unsigned int udp_child_hash_entries;
+ int ret;
+
+ ret = proc_douintvec(table, write, buffer, lenp, ppos);
+ if (!write || ret)
+ return ret;
+
+ udp_child_hash_entries = READ_ONCE(*(unsigned int *)table->data);
+ if (udp_child_hash_entries)
+ udp_child_hash_entries = roundup_pow_of_two(udp_child_hash_entries);
+
+ WRITE_ONCE(*(unsigned int *)table->data, udp_child_hash_entries);
+
+ return 0;
+}
+
#ifdef CONFIG_IP_ROUTE_MULTIPATH
static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write,
void *buffer, size_t *lenp,
@@ -1378,6 +1419,21 @@ static struct ctl_table ipv4_net_table[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_INT_MAX,
},
+ {
+ .procname = "udp_hash_entries",
+ .data = &init_net.ipv4.sysctl_udp_child_hash_entries,
+ .mode = 0444,
+ .proc_handler = proc_udp_hash_entries,
+ },
+ {
+ .procname = "udp_child_hash_entries",
+ .data = &init_net.ipv4.sysctl_udp_child_hash_entries,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_udp_child_hash_entries,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_INT_MAX,
+ },
{
.procname = "udp_rmem_min",
.data = &init_net.ipv4.sysctl_udp_rmem_min,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f4825e38762a..c41306225305 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3309,8 +3309,77 @@ static int __net_init udp_sysctl_init(struct net *net)
return 0;
}
+static struct udp_table __net_init *udp_pernet_table_alloc(unsigned int hash_entries)
+{
+ struct udp_table *udptable;
+ int i;
+
+ udptable = kmalloc(sizeof(*udptable), GFP_KERNEL);
+ if (!udptable)
+ goto out;
+
+ udptable->hash = kvmalloc_array(hash_entries * 2,
+ sizeof(struct udp_hslot), GFP_KERNEL);
+ if (!udptable->hash)
+ goto free_table;
+
+ udptable->hash2 = udptable->hash + hash_entries;
+ udptable->mask = hash_entries - 1;
+ udptable->log = ilog2(hash_entries);
+
+ for (i = 0; i < hash_entries; i++) {
+ INIT_HLIST_HEAD(&udptable->hash[i].head);
+ udptable->hash[i].count = 0;
+ spin_lock_init(&udptable->hash[i].lock);
+
+ INIT_HLIST_HEAD(&udptable->hash2[i].head);
+ udptable->hash2[i].count = 0;
+ spin_lock_init(&udptable->hash2[i].lock);
+ }
+
+ return udptable;
+
+free_table:
+ kfree(udptable);
+out:
+ return NULL;
+}
+
+static int __net_init udp_pernet_table_init(struct net *net, struct net *old_net)
+{
+ struct udp_table *udptable;
+ unsigned int hash_entries;
+
+ hash_entries = READ_ONCE(old_net->ipv4.sysctl_udp_child_hash_entries);
+ if (!hash_entries)
+ goto out;
+
+ udptable = udp_pernet_table_alloc(hash_entries);
+ if (udptable)
+ net->ipv4.udp_table = udptable;
+ else
+ pr_warn("Failed to allocate UDP hash table (entries: %u) "
+ "for a netns, fallback to use the global one\n",
+ hash_entries);
+out:
+ return 0;
+}
+
+static void __net_exit udp_pernet_table_free(struct net *net)
+{
+ struct udp_table *udptable = net->ipv4.udp_table;
+
+ if (udptable == &udp_table)
+ return;
+
+ kvfree(udptable->hash);
+ kfree(udptable);
+}
+
static struct pernet_operations __net_initdata udp_sysctl_ops = {
.init = udp_sysctl_init,
+ .init2 = udp_pernet_table_init,
+ .exit = udp_pernet_table_free,
};
#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_PROC_FS)
--
2.30.2
next prev parent reply other threads:[~2022-08-26 0:09 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-26 0:04 [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 01/13] fs/lock: Revive LOCK_MAND Kuniyuki Iwashima
2022-08-26 10:02 ` Jeff Layton
2022-08-26 16:48 ` Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 02/13] sysctl: Support LOCK_MAND for read/write Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 03/13] selftest: sysctl: Add test for flock(LOCK_MAND) Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 04/13] net: Introduce init2() for pernet_operations Kuniyuki Iwashima
2022-08-26 15:20 ` Eric Dumazet
2022-08-26 17:03 ` Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 05/13] tcp: Clean up some functions Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 06/13] tcp: Set NULL to sk->sk_prot->h.hashinfo Kuniyuki Iwashima
2022-08-26 15:40 ` Eric Dumazet
2022-08-26 17:26 ` Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 07/13] tcp: Access &tcp_hashinfo via net Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 08/13] tcp: Introduce optional per-netns ehash Kuniyuki Iwashima
2022-08-26 15:24 ` Eric Dumazet
2022-08-26 17:19 ` Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 09/13] udp: Clean up some functions Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 10/13] udp: Set NULL to sk->sk_prot->h.udp_table Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 11/13] udp: Set NULL to udp_seq_afinfo.udp_table Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 12/13] udp: Access &udp_table via net Kuniyuki Iwashima
2022-08-26 0:04 ` Kuniyuki Iwashima [this message]
2022-08-26 15:17 ` [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Eric Dumazet
2022-08-26 16:51 ` Kuniyuki Iwashima
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220826000445.46552-14-kuniyu@amazon.com \
--to=kuniyu@amazon.com \
--cc=chuck.lever@oracle.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=jlayton@kernel.org \
--cc=keescook@chromium.org \
--cc=kuba@kernel.org \
--cc=kuni1840@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=mcgrof@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=yzaikin@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).