All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table.
@ 2022-08-26  0:04 Kuniyuki Iwashima
  2022-08-26  0:04 ` [PATCH v1 net-next 01/13] fs/lock: Revive LOCK_MAND Kuniyuki Iwashima
                   ` (13 more replies)
  0 siblings, 14 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26  0:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jeff Layton, Chuck Lever, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-fsdevel

The more sockets we have in the hash table, the more time we spend
looking up the socket.  While running a number of small workloads on
the same host, they penalise each other and cause performance degradation.

Also, the root cause might be a single workload that consumes much more
resources than the others.  It often happens on a cloud service where
different workloads share the same computing resource.

On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
entries), after running iperf3 in different netns, creating 24Mi sockets
without data transfer in the root netns causes about 10% performance
regression for the iperf3's connection.

 thash_entries		sockets		length		Gbps
	524288		      1		     1		50.7
			   24Mi		    48		45.1

It is basically related to the length of the list of each hash bucket.
For testing purposes to see how performance drops along the length,
I set 131072 (1Mi / 8) to thash_entries, and here's the result.

 thash_entries		sockets		length		Gbps
        131072		      1		     1		50.7
			    1Mi		     8		49.9
			    2Mi		    16		48.9
			    4Mi		    32		47.3
			    8Mi		    64		44.6
			   16Mi		   128		40.6
			   24Mi		   192		36.3
			   32Mi		   256		32.5
			   40Mi		   320		27.0
			   48Mi		   384		25.0

To resolve the socket lookup degradation, we introduce an optional
per-netns hash table for TCP and UDP.  With a smaller hash table, we
can look up sockets faster and isolate noisy neighbours.  Also, we can
reduce lock contention.

We can control and check the hash size via sysctl knobs.  It requires
some tuning based on workloads, so the per-netns hash table is disabled
by default.

  # dmesg | cut -d ' ' -f 5- | grep "established hash"
  TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)

  # sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = 524288  # can be changed by thash_entries

  # sysctl net.ipv4.tcp_child_ehash_entries
  net.ipv4.tcp_child_ehash_entries = 0  # disabled by default

  # ip netns add test1
  # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = -524288  # share the global ehash

  # sysctl -w net.ipv4.tcp_child_ehash_entries=100
  net.ipv4.tcp_child_ehash_entries = 100

  # sysctl net.ipv4.tcp_child_ehash_entries
  net.ipv4.tcp_child_ehash_entries = 128  # rounded up to 2^n

  # ip netns add test2
  # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = 128  # own per-netns ehash

  [ UDP has the same interface as udp_hash_entries and
    udp_child_hash_entries. ]

When creating per-netns concurrently with different sizes, we can
guarantee the size by doing one of these ways.

  1) Share the global hash table and create per-netns one

  First, unshare() with tcp_child_ehash_entries==0.  It creates dedicated
  netns sysctl knobs where we can safely change tcp_child_ehash_entries
  and clone()/unshare() to create a per-netns hash table.

  2) Lock the sysctl knob

  We can use flock(LOCK_MAND) or BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny
  read/write on sysctl knobs.

For details, please see each patch.

  patch  1 -  3: mandatory lock support for sysctl (fs stuff)
  patch  4 -  7: prep patch for per-netns TCP ehash
  patch       8: add per-netns TCP ehash
  patch  9 - 12: prep patch for per-netns UDP hash table
  patch      13: add per-netns UDP hash table


Kuniyuki Iwashima (13):
  fs/lock: Revive LOCK_MAND.
  sysctl: Support LOCK_MAND for read/write.
  selftest: sysctl: Add test for flock(LOCK_MAND).
  net: Introduce init2() for pernet_operations.
  tcp: Clean up some functions.
  tcp: Set NULL to sk->sk_prot->h.hashinfo.
  tcp: Access &tcp_hashinfo via net.
  tcp: Introduce optional per-netns ehash.
  udp: Clean up some functions.
  udp: Set NULL to sk->sk_prot->h.udp_table.
  udp: Set NULL to udp_seq_afinfo.udp_table.
  udp: Access &udp_table via net.
  udp: Introduce optional per-netns hash table.

 Documentation/networking/ip-sysctl.rst        |  40 +++++
 .../chelsio/inline_crypto/chtls/chtls_cm.c    |   5 +-
 .../mellanox/mlx5/core/en_accel/ktls_rx.c     |   5 +-
 .../net/ethernet/netronome/nfp/crypto/tls.c   |   5 +-
 fs/locks.c                                    |  83 ++++++---
 fs/proc/proc_sysctl.c                         |  25 ++-
 include/linux/fs.h                            |   1 +
 include/net/inet_hashtables.h                 |  16 ++
 include/net/net_namespace.h                   |   3 +
 include/net/netns/ipv4.h                      |   4 +
 include/uapi/asm-generic/fcntl.h              |   5 -
 net/core/filter.c                             |   9 +-
 net/core/net_namespace.c                      |  18 +-
 net/dccp/proto.c                              |   2 +
 net/ipv4/af_inet.c                            |   2 +-
 net/ipv4/esp4.c                               |   3 +-
 net/ipv4/inet_connection_sock.c               |  25 ++-
 net/ipv4/inet_hashtables.c                    | 102 ++++++++---
 net/ipv4/inet_timewait_sock.c                 |   4 +-
 net/ipv4/netfilter/nf_socket_ipv4.c           |   2 +-
 net/ipv4/netfilter/nf_tproxy_ipv4.c           |  17 +-
 net/ipv4/sysctl_net_ipv4.c                    | 113 ++++++++++++
 net/ipv4/tcp.c                                |   1 +
 net/ipv4/tcp_diag.c                           |  18 +-
 net/ipv4/tcp_ipv4.c                           | 122 +++++++++----
 net/ipv4/tcp_minisocks.c                      |   2 +-
 net/ipv4/udp.c                                | 164 ++++++++++++++----
 net/ipv4/udp_diag.c                           |   6 +-
 net/ipv4/udp_offload.c                        |   5 +-
 net/ipv6/esp6.c                               |   3 +-
 net/ipv6/inet6_hashtables.c                   |   4 +-
 net/ipv6/netfilter/nf_socket_ipv6.c           |   2 +-
 net/ipv6/netfilter/nf_tproxy_ipv6.c           |   5 +-
 net/ipv6/tcp_ipv6.c                           |  30 +++-
 net/ipv6/udp.c                                |  31 ++--
 net/ipv6/udp_offload.c                        |   5 +-
 net/mptcp/mptcp_diag.c                        |   7 +-
 tools/testing/selftests/sysctl/.gitignore     |   2 +
 tools/testing/selftests/sysctl/Makefile       |   9 +-
 tools/testing/selftests/sysctl/sysctl_flock.c | 157 +++++++++++++++++
 40 files changed, 854 insertions(+), 208 deletions(-)
 create mode 100644 tools/testing/selftests/sysctl/.gitignore
 create mode 100644 tools/testing/selftests/sysctl/sysctl_flock.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 01/13] fs/lock: Revive LOCK_MAND.
  2022-08-26  0:04 [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Kuniyuki Iwashima
@ 2022-08-26  0:04 ` Kuniyuki Iwashima
  2022-08-26 10:02   ` Jeff Layton
  2022-08-26  0:04 ` [PATCH v1 net-next 02/13] sysctl: Support LOCK_MAND for read/write Kuniyuki Iwashima
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26  0:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jeff Layton, Chuck Lever, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-fsdevel

The commit 90f7d7a0d0d6 ("locks: remove LOCK_MAND flock lock support")
removed LOCK_MAND support from the kernel because nothing checked the
flag, nor was there no use case.  This patch revives LOCK_MAND to
introduce a mandatory lock for read/write on /proc/sys.  Currently, it's
the only use case, so we added two changes while reverting the commit.

First, we used to allow any f_mode for LOCK_MAND, but now we don't get
it back.  Instead, we enforce being FMODE_READ|FMODE_WRITE as LOCK_SH
and LOCK_EX.

Second, when f_ops->flock() was called with LOCK_MAND, each function
returned -EOPNOTSUPP.  The following patch does not use f_ops->flock(),
so we put the validation before calling f_ops->flock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 fs/locks.c                       | 57 ++++++++++++++++++++------------
 include/uapi/asm-generic/fcntl.h |  5 ---
 2 files changed, 35 insertions(+), 27 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index c266cfdc3291..03ff10a3165e 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -421,6 +421,10 @@ static inline int flock_translate_cmd(int cmd) {
 	case LOCK_UN:
 		return F_UNLCK;
 	}
+
+	if (cmd & LOCK_MAND)
+		return cmd & (LOCK_MAND | LOCK_RW);
+
 	return -EINVAL;
 }
 
@@ -879,6 +883,10 @@ static bool flock_locks_conflict(struct file_lock *caller_fl,
 	if (caller_fl->fl_file == sys_fl->fl_file)
 		return false;
 
+	if (caller_fl->fl_type & LOCK_MAND ||
+	    sys_fl->fl_type & LOCK_MAND)
+		return true;
+
 	return locks_conflict(caller_fl, sys_fl);
 }
 
@@ -2077,9 +2085,7 @@ EXPORT_SYMBOL(locks_lock_inode_wait);
  *	- %LOCK_SH -- a shared lock.
  *	- %LOCK_EX -- an exclusive lock.
  *	- %LOCK_UN -- remove an existing lock.
- *	- %LOCK_MAND -- a 'mandatory' flock. (DEPRECATED)
- *
- *	%LOCK_MAND support has been removed from the kernel.
+ *	- %LOCK_MAND -- a 'mandatory' flock. (only supported on /proc/sys/)
  */
 SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
 {
@@ -2087,19 +2093,6 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
 	struct file_lock fl;
 	struct fd f;
 
-	/*
-	 * LOCK_MAND locks were broken for a long time in that they never
-	 * conflicted with one another and didn't prevent any sort of open,
-	 * read or write activity.
-	 *
-	 * Just ignore these requests now, to preserve legacy behavior, but
-	 * throw a warning to let people know that they don't actually work.
-	 */
-	if (cmd & LOCK_MAND) {
-		pr_warn_once("Attempt to set a LOCK_MAND lock via flock(2). This support has been removed and the request ignored.\n");
-		return 0;
-	}
-
 	type = flock_translate_cmd(cmd & ~LOCK_NB);
 	if (type < 0)
 		return type;
@@ -2109,6 +2102,7 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
 	if (!f.file)
 		return error;
 
+	/* LOCK_MAND supports only read/write on proc_sysctl for now */
 	if (type != F_UNLCK && !(f.file->f_mode & (FMODE_READ | FMODE_WRITE)))
 		goto out_putf;
 
@@ -2122,12 +2116,18 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
 	if (can_sleep)
 		fl.fl_flags |= FL_SLEEP;
 
-	if (f.file->f_op->flock)
+	if (f.file->f_op->flock) {
+		if (cmd & LOCK_MAND) {
+			error = -EOPNOTSUPP;
+			goto out_putf;
+		}
+
 		error = f.file->f_op->flock(f.file,
 					    (can_sleep) ? F_SETLKW : F_SETLK,
 					    &fl);
-	else
+	} else {
 		error = locks_lock_file_wait(f.file, &fl);
+	}
 
  out_putf:
 	fdput(f);
@@ -2711,7 +2711,11 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
 		seq_printf(f, " %s ",
 			     (inode == NULL) ? "*NOINODE*" : "ADVISORY ");
 	} else if (IS_FLOCK(fl)) {
-		seq_puts(f, "FLOCK  ADVISORY  ");
+		if (fl->fl_type & LOCK_MAND) {
+			seq_puts(f, "FLOCK  MANDATORY ");
+		} else {
+			seq_puts(f, "FLOCK  ADVISORY  ");
+		}
 	} else if (IS_LEASE(fl)) {
 		if (fl->fl_flags & FL_DELEG)
 			seq_puts(f, "DELEG  ");
@@ -2727,10 +2731,19 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
 	} else {
 		seq_puts(f, "UNKNOWN UNKNOWN  ");
 	}
-	type = IS_LEASE(fl) ? target_leasetype(fl) : fl->fl_type;
 
-	seq_printf(f, "%s ", (type == F_WRLCK) ? "WRITE" :
-			     (type == F_RDLCK) ? "READ" : "UNLCK");
+	if (fl->fl_type & LOCK_MAND) {
+		seq_printf(f, "%s ",
+			   (fl->fl_type & LOCK_READ)
+			   ? (fl->fl_type & LOCK_WRITE) ? "RW   " : "READ "
+			   : (fl->fl_type & LOCK_WRITE) ? "WRITE" : "NONE ");
+	} else {
+		type = IS_LEASE(fl) ? target_leasetype(fl) : fl->fl_type;
+
+		seq_printf(f, "%s ", (type == F_WRLCK) ? "WRITE" :
+			   (type == F_RDLCK) ? "READ" : "UNLCK");
+	}
+
 	if (inode) {
 		/* userspace relies on this representation of dev_t */
 		seq_printf(f, "%d %02x:%02x:%lu ", fl_pid,
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 1ecdb911add8..94fb8c6fd543 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -180,11 +180,6 @@ struct f_owner_ex {
 #define LOCK_NB		4	/* or'd with one of the above to prevent
 				   blocking */
 #define LOCK_UN		8	/* remove lock */
-
-/*
- * LOCK_MAND support has been removed from the kernel. We leave the symbols
- * here to not break legacy builds, but these should not be used in new code.
- */
 #define LOCK_MAND	32	/* This is a mandatory flock ... */
 #define LOCK_READ	64	/* which allows concurrent read operations */
 #define LOCK_WRITE	128	/* which allows concurrent write operations */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 02/13] sysctl: Support LOCK_MAND for read/write.
  2022-08-26  0:04 [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Kuniyuki Iwashima
  2022-08-26  0:04 ` [PATCH v1 net-next 01/13] fs/lock: Revive LOCK_MAND Kuniyuki Iwashima
@ 2022-08-26  0:04 ` Kuniyuki Iwashima
  2022-08-26  0:04 ` [PATCH v1 net-next 03/13] selftest: sysctl: Add test for flock(LOCK_MAND) Kuniyuki Iwashima
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26  0:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jeff Layton, Chuck Lever, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-fsdevel

The preceding patch added LOCK_MAND support for flock(), and this patch
adds read/write protection on sysctl knobs.  The read/write operations
will return -EPERM if the file is mandatory-locked.

The following patches introduce sysctl knobs which are read in clone() or
unshare() to control a per-netns hash table size for TCP/UDP.  In such a
case, we can use write protection to guarantee the hash table's size for
the child netns.

The difference between BPF_PROG_TYPE_CGROUP_SYSCTL is that the BPF prog
requires processes to be in the same cgroup to allow/deny read/write to
sysctl knobs.

Note that the read protection might be useless, especially for some
sysctl knobs whose value we can know in another way.  For example, we
can know fs.nr_open by opening too many files and checking the error,
and net.ipv4.tcp_syn_retries by dropping SYN and dumping packets.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 fs/locks.c            | 26 ++++++++++++++++++++++++++
 fs/proc/proc_sysctl.c | 25 ++++++++++++++++++++++++-
 include/linux/fs.h    |  1 +
 3 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/fs/locks.c b/fs/locks.c
index 03ff10a3165e..c858c6c61920 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -890,6 +890,32 @@ static bool flock_locks_conflict(struct file_lock *caller_fl,
 	return locks_conflict(caller_fl, sys_fl);
 }
 
+int flock_mandatory_locked(struct file *filp)
+{
+	struct file_lock_context *ctx;
+	struct file_lock *fl;
+	int flags = 0;
+
+	ctx = smp_load_acquire(&file_inode(filp)->i_flctx);
+	if (!ctx)
+		goto out;
+
+	spin_lock(&ctx->flc_lock);
+	list_for_each_entry(fl, &ctx->flc_flock, fl_list) {
+		if (!(fl->fl_type & LOCK_MAND))
+			continue;
+
+		if (fl->fl_file != filp)
+			flags = fl->fl_type & (LOCK_MAND | LOCK_RW);
+
+		break;
+	}
+	spin_unlock(&ctx->flc_lock);
+out:
+	return flags;
+}
+EXPORT_SYMBOL(flock_mandatory_locked);
+
 void
 posix_test_lock(struct file *filp, struct file_lock *fl)
 {
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 021e83fe831f..ce2755670970 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -561,10 +561,30 @@ static struct dentry *proc_sys_lookup(struct inode *dir, struct dentry *dentry,
 	return err;
 }
 
+static bool proc_mandatory_locked(struct file *filp, int write)
+{
+	int flags = flock_mandatory_locked(filp);
+
+	if (flags & LOCK_MAND) {
+		if (write) {
+			if (flags & LOCK_WRITE)
+				return false;
+		} else {
+			if (flags & LOCK_READ)
+				return false;
+		}
+
+		return true;
+	}
+
+	return false;
+}
+
 static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
 		int write)
 {
-	struct inode *inode = file_inode(iocb->ki_filp);
+	struct file *filp = iocb->ki_filp;
+	struct inode *inode = file_inode(filp);
 	struct ctl_table_header *head = grab_header(inode);
 	struct ctl_table *table = PROC_I(inode)->sysctl_entry;
 	size_t count = iov_iter_count(iter);
@@ -582,6 +602,9 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
 	if (sysctl_perm(head, table, write ? MAY_WRITE : MAY_READ))
 		goto out;
 
+	if (proc_mandatory_locked(filp, write))
+		goto out;
+
 	/* if that can happen at all, it should be -EINVAL, not -EISDIR */
 	error = -EINVAL;
 	if (!table->proc_handler)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9eced4cc286e..5d1d4b10a868 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1164,6 +1164,7 @@ extern void locks_copy_conflock(struct file_lock *, struct file_lock *);
 extern void locks_remove_posix(struct file *, fl_owner_t);
 extern void locks_remove_file(struct file *);
 extern void locks_release_private(struct file_lock *);
+int flock_mandatory_locked(struct file *filp);
 extern void posix_test_lock(struct file *, struct file_lock *);
 extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *);
 extern int locks_delete_block(struct file_lock *);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 03/13] selftest: sysctl: Add test for flock(LOCK_MAND).
  2022-08-26  0:04 [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Kuniyuki Iwashima
  2022-08-26  0:04 ` [PATCH v1 net-next 01/13] fs/lock: Revive LOCK_MAND Kuniyuki Iwashima
  2022-08-26  0:04 ` [PATCH v1 net-next 02/13] sysctl: Support LOCK_MAND for read/write Kuniyuki Iwashima
@ 2022-08-26  0:04 ` Kuniyuki Iwashima
  2022-08-26  0:04 ` [PATCH v1 net-next 04/13] net: Introduce init2() for pernet_operations Kuniyuki Iwashima
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26  0:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jeff Layton, Chuck Lever, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-fsdevel

This patch adds test cases for LOCK_MAND:

  - LOCK_MAND with/without LOCK_READ/LOCK_WRITE
  - read/write
    - same fd
    - different fd
      - same PID
      - different PID

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 tools/testing/selftests/sysctl/.gitignore     |   2 +
 tools/testing/selftests/sysctl/Makefile       |   9 +-
 tools/testing/selftests/sysctl/sysctl_flock.c | 157 ++++++++++++++++++
 3 files changed, 163 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/sysctl/.gitignore
 create mode 100644 tools/testing/selftests/sysctl/sysctl_flock.c

diff --git a/tools/testing/selftests/sysctl/.gitignore b/tools/testing/selftests/sysctl/.gitignore
new file mode 100644
index 000000000000..a3382ba798a6
--- /dev/null
+++ b/tools/testing/selftests/sysctl/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+/sysctl_flock
diff --git a/tools/testing/selftests/sysctl/Makefile b/tools/testing/selftests/sysctl/Makefile
index 110301f9f5be..eb565b6c8340 100644
--- a/tools/testing/selftests/sysctl/Makefile
+++ b/tools/testing/selftests/sysctl/Makefile
@@ -2,12 +2,11 @@
 # Makefile for sysctl selftests.
 # Expects kernel.sysctl_writes_strict=1.
 
-# No binaries, but make sure arg-less "make" doesn't trigger "run_tests".
-all:
+CFLAGS =  -Wall -Wl,--no-as-needed -O2 -g
+CFLAGS += -D_GNU_SOURCE
 
 TEST_PROGS := sysctl.sh
 
-include ../lib.mk
+TEST_GEN_PROGS += sysctl_flock
 
-# Nothing to clean up.
-clean:
+include ../lib.mk
diff --git a/tools/testing/selftests/sysctl/sysctl_flock.c b/tools/testing/selftests/sysctl/sysctl_flock.c
new file mode 100644
index 000000000000..4c4be10ae0d3
--- /dev/null
+++ b/tools/testing/selftests/sysctl/sysctl_flock.c
@@ -0,0 +1,157 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Amazon.com Inc. or its affiliates. */
+
+#include <unistd.h>
+#include <sys/file.h>
+#include <sys/types.h>
+
+#include "../kselftest_harness.h"
+
+#define SYSCTL_PATH	"/proc/sys/net/ipv4/tcp_migrate_req"
+#define SYSCTL_BUFLEN	8
+
+FIXTURE(proc_sysctl_flock)
+{
+	int fd;
+	char buf[SYSCTL_BUFLEN];
+	int len;
+};
+
+FIXTURE_VARIANT(proc_sysctl_flock)
+{
+	int cmd;
+};
+
+FIXTURE_VARIANT_ADD(proc_sysctl_flock, lock_mand)
+{
+	.cmd = LOCK_MAND,
+};
+
+FIXTURE_VARIANT_ADD(proc_sysctl_flock, lock_mand_read)
+{
+	.cmd = LOCK_MAND | LOCK_READ,
+};
+
+FIXTURE_VARIANT_ADD(proc_sysctl_flock, lock_mand_write)
+{
+	.cmd = LOCK_MAND | LOCK_WRITE,
+};
+
+FIXTURE_VARIANT_ADD(proc_sysctl_flock, lock_mand_read_write)
+{
+	.cmd = LOCK_MAND | LOCK_READ | LOCK_WRITE,
+};
+
+int proc_sysctl_open(struct __test_metadata *_metadata)
+{
+	int fd;
+
+	fd = open(SYSCTL_PATH, O_RDWR);
+	ASSERT_NE(-1, fd);
+
+	return fd;
+}
+
+FIXTURE_SETUP(proc_sysctl_flock)
+{
+	self->fd = proc_sysctl_open(_metadata);
+	ASSERT_EQ(0, flock(self->fd, variant->cmd));
+
+	self->len = read(self->fd, self->buf, sizeof(self->buf));
+	ASSERT_NE(-1, self->len);
+
+	ASSERT_EQ(self->len, write(self->fd, self->buf, self->len));
+}
+
+FIXTURE_TEARDOWN(proc_sysctl_flock)
+{
+	flock(self->fd, LOCK_UN);
+	close(self->fd);
+}
+
+int is_readable(int cmd)
+{
+	return cmd & LOCK_READ;
+}
+
+int is_writable(int cmd)
+{
+	return cmd & LOCK_WRITE;
+}
+
+void proc_sysctl_newfd_read(struct __test_metadata *_metadata,
+			    FIXTURE_DATA(proc_sysctl_flock) *self,
+			    const FIXTURE_VARIANT(proc_sysctl_flock) *variant)
+{
+	char buf[SYSCTL_BUFLEN];
+	int err, fd;
+
+	fd = proc_sysctl_open(_metadata);
+
+	err = read(fd, buf, SYSCTL_BUFLEN);
+	if (is_readable(variant->cmd)) {
+		ASSERT_EQ(self->len, err);
+	} else {
+		ASSERT_EQ(-1, err);
+	}
+
+	close(fd);
+}
+
+void proc_sysctl_newfd_write(struct __test_metadata *_metadata,
+			     FIXTURE_DATA(proc_sysctl_flock) *self,
+			     const FIXTURE_VARIANT(proc_sysctl_flock) *variant)
+{
+	int err, fd;
+
+	fd = proc_sysctl_open(_metadata);
+
+	err = write(fd, self->buf, self->len);
+	if (is_writable(variant->cmd)) {
+		ASSERT_EQ(self->len, err);
+	} else {
+		ASSERT_EQ(-1, err);
+	}
+
+	close(fd);
+}
+
+void proc_sysctl_fork(struct __test_metadata *_metadata,
+		      FIXTURE_DATA(proc_sysctl_flock) *self,
+		      const FIXTURE_VARIANT(proc_sysctl_flock) *variant,
+		      int read)
+{
+	int pid, status;
+
+	pid = fork();
+	if (pid == 0) {
+		if (read)
+			return proc_sysctl_newfd_read(_metadata, self, variant);
+		else
+			return proc_sysctl_newfd_write(_metadata, self, variant);
+	}
+
+	waitpid(pid, &status, 0);
+}
+
+TEST_F(proc_sysctl_flock, test_newfd_read)
+{
+	proc_sysctl_newfd_read(_metadata, self, variant);
+}
+
+TEST_F(proc_sysctl_flock, test_newfd_write)
+{
+	proc_sysctl_newfd_write(_metadata, self, variant);
+}
+
+TEST_F(proc_sysctl_flock, test_fork_newfd_read)
+{
+	proc_sysctl_fork(_metadata, self, variant, 1);
+}
+
+TEST_F(proc_sysctl_flock, test_fork_newfd_write)
+{
+	proc_sysctl_fork(_metadata, self, variant, 0);
+}
+
+TEST_HARNESS_MAIN
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 04/13] net: Introduce init2() for pernet_operations.
  2022-08-26  0:04 [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2022-08-26  0:04 ` [PATCH v1 net-next 03/13] selftest: sysctl: Add test for flock(LOCK_MAND) Kuniyuki Iwashima
@ 2022-08-26  0:04 ` Kuniyuki Iwashima
  2022-08-26 15:20   ` Eric Dumazet
  2022-08-26  0:04 ` [PATCH v1 net-next 05/13] tcp: Clean up some functions Kuniyuki Iwashima
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26  0:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jeff Layton, Chuck Lever, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-fsdevel

This patch adds a new init function for pernet_operations, init2().

We call each init2() during clone() or unshare() only, where we can
access the parent netns for a child netns creation.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/net_namespace.h |  3 +++
 net/core/net_namespace.c    | 18 +++++++++++-------
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 8c3587d5c308..3ca426649756 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -410,6 +410,8 @@ struct pernet_operations {
 	 * from register_pernet_subsys(), unregister_pernet_subsys()
 	 * register_pernet_device() and unregister_pernet_device().
 	 *
+	 * init2() is called during clone() or unshare() only.
+	 *
 	 * Exit methods using blocking RCU primitives, such as
 	 * synchronize_rcu(), should be implemented via exit_batch.
 	 * Then, destruction of a group of net requires single
@@ -422,6 +424,7 @@ struct pernet_operations {
 	 * the calls.
 	 */
 	int (*init)(struct net *net);
+	int (*init2)(struct net *net, struct net *old_net);
 	void (*pre_exit)(struct net *net);
 	void (*exit)(struct net *net);
 	void (*exit_batch)(struct list_head *net_exit_list);
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 6b9f19122ec1..b120ff97d9f5 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -116,7 +116,8 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
 	return 0;
 }
 
-static int ops_init(const struct pernet_operations *ops, struct net *net)
+static int ops_init(const struct pernet_operations *ops,
+		    struct net *net, struct net *old_net)
 {
 	int err = -ENOMEM;
 	void *data = NULL;
@@ -133,6 +134,8 @@ static int ops_init(const struct pernet_operations *ops, struct net *net)
 	err = 0;
 	if (ops->init)
 		err = ops->init(net);
+	if (!err && ops->init2 && old_net)
+		err = ops->init2(net, old_net);
 	if (!err)
 		return 0;
 
@@ -301,7 +304,8 @@ EXPORT_SYMBOL_GPL(get_net_ns_by_id);
 /*
  * setup_net runs the initializers for the network namespace object.
  */
-static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
+static __net_init int setup_net(struct net *net, struct net *old_net,
+				struct user_namespace *user_ns)
 {
 	/* Must be called with pernet_ops_rwsem held */
 	const struct pernet_operations *ops, *saved_ops;
@@ -323,7 +327,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
 	mutex_init(&net->ipv4.ra_mutex);
 
 	list_for_each_entry(ops, &pernet_list, list) {
-		error = ops_init(ops, net);
+		error = ops_init(ops, net, old_net);
 		if (error < 0)
 			goto out_undo;
 	}
@@ -469,7 +473,7 @@ struct net *copy_net_ns(unsigned long flags,
 	if (rv < 0)
 		goto put_userns;
 
-	rv = setup_net(net, user_ns);
+	rv = setup_net(net, old_net, user_ns);
 
 	up_read(&pernet_ops_rwsem);
 
@@ -1107,7 +1111,7 @@ void __init net_ns_init(void)
 	init_net.key_domain = &init_net_key_domain;
 #endif
 	down_write(&pernet_ops_rwsem);
-	if (setup_net(&init_net, &init_user_ns))
+	if (setup_net(&init_net, NULL, &init_user_ns))
 		panic("Could not setup the initial network namespace");
 
 	init_net_initialized = true;
@@ -1148,7 +1152,7 @@ static int __register_pernet_operations(struct list_head *list,
 
 			memcg = mem_cgroup_or_root(get_mem_cgroup_from_obj(net));
 			old = set_active_memcg(memcg);
-			error = ops_init(ops, net);
+			error = ops_init(ops, net, NULL);
 			set_active_memcg(old);
 			mem_cgroup_put(memcg);
 			if (error)
@@ -1188,7 +1192,7 @@ static int __register_pernet_operations(struct list_head *list,
 		return 0;
 	}
 
-	return ops_init(ops, &init_net);
+	return ops_init(ops, &init_net, NULL);
 }
 
 static void __unregister_pernet_operations(struct pernet_operations *ops)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 05/13] tcp: Clean up some functions.
  2022-08-26  0:04 [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2022-08-26  0:04 ` [PATCH v1 net-next 04/13] net: Introduce init2() for pernet_operations Kuniyuki Iwashima
@ 2022-08-26  0:04 ` Kuniyuki Iwashima
  2022-08-26  0:04 ` [PATCH v1 net-next 06/13] tcp: Set NULL to sk->sk_prot->h.hashinfo Kuniyuki Iwashima
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26  0:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jeff Layton, Chuck Lever, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-fsdevel

This patch adds no functional change and cleans up some functions
that the following patches touch around so that we make them tidy
and easy to review/revert.  The changes are

  - Keep reverse christmas tree order
  - Remove unnecessary init of port in inet_csk_find_open_port()
  - Use req_to_sk() once in reqsk_queue_unlink()

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/inet_connection_sock.c | 21 ++++++++++-----------
 net/ipv4/inet_hashtables.c      | 29 +++++++++++++++--------------
 net/ipv4/tcp_ipv4.c             |  4 ++--
 3 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index f0038043b661..8e71d65cfad4 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -286,15 +286,13 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
 			struct inet_bind_hashbucket **head2_ret, int *port_ret)
 {
 	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
-	int port = 0;
+	int i, low, high, attempt_half, port, l3mdev;
 	struct inet_bind_hashbucket *head, *head2;
 	struct net *net = sock_net(sk);
-	bool relax = false;
-	int i, low, high, attempt_half;
 	struct inet_bind2_bucket *tb2;
 	struct inet_bind_bucket *tb;
 	u32 remaining, offset;
-	int l3mdev;
+	bool relax = false;
 
 	l3mdev = inet_sk_bound_l3mdev(sk);
 ports_exhausted:
@@ -471,15 +469,14 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 {
 	bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
 	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
-	int ret = 1, port = snum;
-	struct net *net = sock_net(sk);
 	bool found_port = false, check_bind_conflict = true;
 	bool bhash_created = false, bhash2_created = false;
 	struct inet_bind_hashbucket *head, *head2;
 	struct inet_bind2_bucket *tb2 = NULL;
 	struct inet_bind_bucket *tb = NULL;
 	bool head2_lock_acquired = false;
-	int l3mdev;
+	int ret = 1, port = snum, l3mdev;
+	struct net *net = sock_net(sk);
 
 	l3mdev = inet_sk_bound_l3mdev(sk);
 
@@ -909,14 +906,16 @@ static void reqsk_migrate_reset(struct request_sock *req)
 /* return true if req was found in the ehash table */
 static bool reqsk_queue_unlink(struct request_sock *req)
 {
-	struct inet_hashinfo *hashinfo = req_to_sk(req)->sk_prot->h.hashinfo;
+	struct sock *sk = req_to_sk(req);
 	bool found = false;
 
-	if (sk_hashed(req_to_sk(req))) {
-		spinlock_t *lock = inet_ehash_lockp(hashinfo, req->rsk_hash);
+	if (sk_hashed(sk)) {
+		struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
+		spinlock_t *lock;
 
+		lock = inet_ehash_lockp(hashinfo, req->rsk_hash);
 		spin_lock(lock);
-		found = __sk_nulls_del_node_init_rcu(req_to_sk(req));
+		found = __sk_nulls_del_node_init_rcu(sk);
 		spin_unlock(lock);
 	}
 	if (timer_pending(&req->rsk_timer) && del_timer_sync(&req->rsk_timer))
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 60d77e234a68..29dce78de179 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -169,13 +169,14 @@ void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
 static void __inet_put_port(struct sock *sk)
 {
 	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
-	const int bhash = inet_bhashfn(sock_net(sk), inet_sk(sk)->inet_num,
-			hashinfo->bhash_size);
-	struct inet_bind_hashbucket *head = &hashinfo->bhash[bhash];
-	struct inet_bind_hashbucket *head2 =
-		inet_bhashfn_portaddr(hashinfo, sk, sock_net(sk),
-				      inet_sk(sk)->inet_num);
+	struct inet_bind_hashbucket *head, *head2;
+	struct net *net = sock_net(sk);
 	struct inet_bind_bucket *tb;
+	int bhash;
+
+	bhash = inet_bhashfn(net, inet_sk(sk)->inet_num, hashinfo->bhash_size);
+	head = &hashinfo->bhash[bhash];
+	head2 = inet_bhashfn_portaddr(hashinfo, sk, net, inet_sk(sk)->inet_num);
 
 	spin_lock(&head->lock);
 	tb = inet_csk(sk)->icsk_bind_hash;
@@ -209,17 +210,17 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
 {
 	struct inet_hashinfo *table = sk->sk_prot->h.hashinfo;
 	unsigned short port = inet_sk(child)->inet_num;
-	const int bhash = inet_bhashfn(sock_net(sk), port,
-			table->bhash_size);
-	struct inet_bind_hashbucket *head = &table->bhash[bhash];
-	struct inet_bind_hashbucket *head2 =
-		inet_bhashfn_portaddr(table, child, sock_net(sk), port);
+	struct inet_bind_hashbucket *head, *head2;
 	bool created_inet_bind_bucket = false;
-	bool update_fastreuse = false;
 	struct net *net = sock_net(sk);
+	bool update_fastreuse = false;
 	struct inet_bind2_bucket *tb2;
 	struct inet_bind_bucket *tb;
-	int l3mdev;
+	int bhash, l3mdev;
+
+	bhash = inet_bhashfn(net, port, table->bhash_size);
+	head = &table->bhash[bhash];
+	head2 = inet_bhashfn_portaddr(table, child, net, port);
 
 	spin_lock(&head->lock);
 	spin_lock(&head2->lock);
@@ -629,8 +630,8 @@ static bool inet_ehash_lookup_by_sk(struct sock *sk,
 bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk)
 {
 	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
-	struct hlist_nulls_head *list;
 	struct inet_ehash_bucket *head;
+	struct hlist_nulls_head *list;
 	spinlock_t *lock;
 	bool ret = true;
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index cc2ad67f75be..61a9bf661814 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2406,9 +2406,9 @@ static void *established_get_first(struct seq_file *seq)
 
 static void *established_get_next(struct seq_file *seq, void *cur)
 {
-	struct sock *sk = cur;
-	struct hlist_nulls_node *node;
 	struct tcp_iter_state *st = seq->private;
+	struct hlist_nulls_node *node;
+	struct sock *sk = cur;
 
 	++st->num;
 	++st->offset;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 06/13] tcp: Set NULL to sk->sk_prot->h.hashinfo.
  2022-08-26  0:04 [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Kuniyuki Iwashima
                   ` (4 preceding siblings ...)
  2022-08-26  0:04 ` [PATCH v1 net-next 05/13] tcp: Clean up some functions Kuniyuki Iwashima
@ 2022-08-26  0:04 ` Kuniyuki Iwashima
  2022-08-26 15:40   ` Eric Dumazet
  2022-08-26  0:04 ` [PATCH v1 net-next 07/13] tcp: Access &tcp_hashinfo via net Kuniyuki Iwashima
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26  0:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jeff Layton, Chuck Lever, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-fsdevel

We will soon introduce an optional per-netns ehash.

This means we cannot use the global sk->sk_prot->h.hashinfo
to fetch a TCP hashinfo.

Instead, set NULL to sk->sk_prot->h.hashinfo for TCP and get
a proper hashinfo from net->ipv4.tcp_death_row->hashinfo.

Note that we need not use sk->sk_prot->h.hashinfo if DCCP is
disabled.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/inet_hashtables.h   | 10 ++++++++++
 net/ipv4/af_inet.c              |  2 +-
 net/ipv4/inet_connection_sock.c |  6 +++---
 net/ipv4/inet_hashtables.c      | 14 +++++++-------
 net/ipv4/tcp_ipv4.c             |  2 +-
 net/ipv6/tcp_ipv6.c             |  2 +-
 6 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 44a419b9e3d5..2c866112433e 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -170,6 +170,16 @@ struct inet_hashinfo {
 	struct inet_listen_hashbucket	*lhash2;
 };
 
+static inline struct inet_hashinfo *inet_get_hashinfo(const struct sock *sk)
+{
+#if IS_ENABLED(CONFIG_IP_DCCP)
+	return sk->sk_prot->h.hashinfo ? :
+		sock_net(sk)->ipv4.tcp_death_row->hashinfo;
+#else
+	return sock_net(sk)->ipv4.tcp_death_row->hashinfo;
+#endif
+}
+
 static inline struct inet_listen_hashbucket *
 inet_lhash2_bucket(struct inet_hashinfo *h, u32 hash)
 {
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index d3ab1ae32ef5..6d81d98a34ca 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1250,7 +1250,7 @@ static int inet_sk_reselect_saddr(struct sock *sk)
 	}
 
 	prev_addr_hashbucket =
-		inet_bhashfn_portaddr(sk->sk_prot->h.hashinfo, sk,
+		inet_bhashfn_portaddr(inet_get_hashinfo(sk), sk,
 				      sock_net(sk), inet->inet_num);
 
 	inet->inet_saddr = inet->inet_rcv_saddr = new_saddr;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 8e71d65cfad4..9e3613da3e57 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -285,7 +285,7 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret,
 			struct inet_bind2_bucket **tb2_ret,
 			struct inet_bind_hashbucket **head2_ret, int *port_ret)
 {
-	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
+	struct inet_hashinfo *hinfo = inet_get_hashinfo(sk);
 	int i, low, high, attempt_half, port, l3mdev;
 	struct inet_bind_hashbucket *head, *head2;
 	struct net *net = sock_net(sk);
@@ -468,7 +468,7 @@ void inet_csk_update_fastreuse(struct inet_bind_bucket *tb,
 int inet_csk_get_port(struct sock *sk, unsigned short snum)
 {
 	bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
-	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
+	struct inet_hashinfo *hinfo = inet_get_hashinfo(sk);
 	bool found_port = false, check_bind_conflict = true;
 	bool bhash_created = false, bhash2_created = false;
 	struct inet_bind_hashbucket *head, *head2;
@@ -910,7 +910,7 @@ static bool reqsk_queue_unlink(struct request_sock *req)
 	bool found = false;
 
 	if (sk_hashed(sk)) {
-		struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
+		struct inet_hashinfo *hashinfo = inet_get_hashinfo(sk);
 		spinlock_t *lock;
 
 		lock = inet_ehash_lockp(hashinfo, req->rsk_hash);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 29dce78de179..23bfff989be6 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -168,7 +168,7 @@ void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
  */
 static void __inet_put_port(struct sock *sk)
 {
-	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
+	struct inet_hashinfo *hashinfo = inet_get_hashinfo(sk);
 	struct inet_bind_hashbucket *head, *head2;
 	struct net *net = sock_net(sk);
 	struct inet_bind_bucket *tb;
@@ -208,7 +208,7 @@ EXPORT_SYMBOL(inet_put_port);
 
 int __inet_inherit_port(const struct sock *sk, struct sock *child)
 {
-	struct inet_hashinfo *table = sk->sk_prot->h.hashinfo;
+	struct inet_hashinfo *table = inet_get_hashinfo(sk);
 	unsigned short port = inet_sk(child)->inet_num;
 	struct inet_bind_hashbucket *head, *head2;
 	bool created_inet_bind_bucket = false;
@@ -629,7 +629,7 @@ static bool inet_ehash_lookup_by_sk(struct sock *sk,
  */
 bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk)
 {
-	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
+	struct inet_hashinfo *hashinfo = inet_get_hashinfo(sk);
 	struct inet_ehash_bucket *head;
 	struct hlist_nulls_head *list;
 	spinlock_t *lock;
@@ -701,7 +701,7 @@ static int inet_reuseport_add_sock(struct sock *sk,
 
 int __inet_hash(struct sock *sk, struct sock *osk)
 {
-	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
+	struct inet_hashinfo *hashinfo = inet_get_hashinfo(sk);
 	struct inet_listen_hashbucket *ilb2;
 	int err = 0;
 
@@ -747,7 +747,7 @@ EXPORT_SYMBOL_GPL(inet_hash);
 
 void inet_unhash(struct sock *sk)
 {
-	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
+	struct inet_hashinfo *hashinfo = inet_get_hashinfo(sk);
 
 	if (sk_unhashed(sk))
 		return;
@@ -834,7 +834,7 @@ inet_bind2_bucket_find(const struct inet_bind_hashbucket *head, const struct net
 struct inet_bind_hashbucket *
 inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, int port)
 {
-	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
+	struct inet_hashinfo *hinfo = inet_get_hashinfo(sk);
 	u32 hash;
 #if IS_ENABLED(CONFIG_IPV6)
 	struct in6_addr addr_any = {};
@@ -850,7 +850,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in
 
 int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk)
 {
-	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
+	struct inet_hashinfo *hinfo = inet_get_hashinfo(sk);
 	struct inet_bind2_bucket *tb2, *new_tb2;
 	int l3mdev = inet_sk_bound_l3mdev(sk);
 	struct inet_bind_hashbucket *head2;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 61a9bf661814..7c3b3ce85a5e 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3083,7 +3083,7 @@ struct proto tcp_prot = {
 	.slab_flags		= SLAB_TYPESAFE_BY_RCU,
 	.twsk_prot		= &tcp_timewait_sock_ops,
 	.rsk_prot		= &tcp_request_sock_ops,
-	.h.hashinfo		= &tcp_hashinfo,
+	.h.hashinfo		= NULL,
 	.no_autobind		= true,
 	.diag_destroy		= tcp_abort,
 };
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index ff5c4fc135fc..791f24da9212 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2193,7 +2193,7 @@ struct proto tcpv6_prot = {
 	.slab_flags		= SLAB_TYPESAFE_BY_RCU,
 	.twsk_prot		= &tcp6_timewait_sock_ops,
 	.rsk_prot		= &tcp6_request_sock_ops,
-	.h.hashinfo		= &tcp_hashinfo,
+	.h.hashinfo		= NULL,
 	.no_autobind		= true,
 	.diag_destroy		= tcp_abort,
 };
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 07/13] tcp: Access &tcp_hashinfo via net.
  2022-08-26  0:04 [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Kuniyuki Iwashima
                   ` (5 preceding siblings ...)
  2022-08-26  0:04 ` [PATCH v1 net-next 06/13] tcp: Set NULL to sk->sk_prot->h.hashinfo Kuniyuki Iwashima
@ 2022-08-26  0:04 ` Kuniyuki Iwashima
  2022-08-26  0:04 ` [PATCH v1 net-next 08/13] tcp: Introduce optional per-netns ehash Kuniyuki Iwashima
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26  0:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jeff Layton, Chuck Lever, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-fsdevel

We will soon introduce an optional per-netns ehash.

This means we cannot use tcp_hashinfo directly in most places.

Instead, access it via net->ipv4.tcp_death_row->hashinfo.

The access will be valid only while initialising tcp_hashinfo
itself and creating/destroying each netns.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 .../chelsio/inline_crypto/chtls/chtls_cm.c    |  5 +-
 .../mellanox/mlx5/core/en_accel/ktls_rx.c     |  5 +-
 .../net/ethernet/netronome/nfp/crypto/tls.c   |  5 +-
 net/core/filter.c                             |  5 +-
 net/ipv4/esp4.c                               |  3 +-
 net/ipv4/inet_hashtables.c                    |  2 +-
 net/ipv4/netfilter/nf_socket_ipv4.c           |  2 +-
 net/ipv4/netfilter/nf_tproxy_ipv4.c           | 17 +++--
 net/ipv4/tcp_diag.c                           | 18 +++++-
 net/ipv4/tcp_ipv4.c                           | 63 +++++++++++--------
 net/ipv4/tcp_minisocks.c                      |  2 +-
 net/ipv6/esp6.c                               |  3 +-
 net/ipv6/inet6_hashtables.c                   |  4 +-
 net/ipv6/netfilter/nf_socket_ipv6.c           |  2 +-
 net/ipv6/netfilter/nf_tproxy_ipv6.c           |  5 +-
 net/ipv6/tcp_ipv6.c                           | 16 ++---
 net/mptcp/mptcp_diag.c                        |  7 ++-
 17 files changed, 98 insertions(+), 66 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
index ddfe9208529a..f90bfba4b303 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
+++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
@@ -1069,8 +1069,7 @@ static void chtls_pass_accept_rpl(struct sk_buff *skb,
 	cxgb4_l2t_send(csk->egress_dev, skb, csk->l2t_entry);
 }
 
-static void inet_inherit_port(struct inet_hashinfo *hash_info,
-			      struct sock *lsk, struct sock *newsk)
+static void inet_inherit_port(struct sock *lsk, struct sock *newsk)
 {
 	local_bh_disable();
 	__inet_inherit_port(lsk, newsk);
@@ -1240,7 +1239,7 @@ static struct sock *chtls_recv_sock(struct sock *lsk,
 						     ipv4.sysctl_tcp_window_scaling),
 					   tp->window_clamp);
 	neigh_release(n);
-	inet_inherit_port(&tcp_hashinfo, lsk, newsk);
+	inet_inherit_port(lsk, newsk);
 	csk_set_flag(csk, CSK_CONN_INLINE);
 	bh_unlock_sock(newsk); /* tcp_create_openreq_child ->sk_clone_lock */
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_rx.c
index 13145ecaf839..9ae36772cc15 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_rx.c
@@ -461,6 +461,7 @@ static void resync_update_sn(struct mlx5e_rq *rq, struct sk_buff *skb)
 {
 	struct ethhdr *eth = (struct ethhdr *)(skb->data);
 	struct net_device *netdev = rq->netdev;
+	struct net *net = dev_net(netdev);
 	struct sock *sk = NULL;
 	unsigned int datalen;
 	struct iphdr *iph;
@@ -475,7 +476,7 @@ static void resync_update_sn(struct mlx5e_rq *rq, struct sk_buff *skb)
 		depth += sizeof(struct iphdr);
 		th = (void *)iph + sizeof(struct iphdr);
 
-		sk = inet_lookup_established(dev_net(netdev), &tcp_hashinfo,
+		sk = inet_lookup_established(net, net->ipv4.tcp_death_row->hashinfo,
 					     iph->saddr, th->source, iph->daddr,
 					     th->dest, netdev->ifindex);
 #if IS_ENABLED(CONFIG_IPV6)
@@ -485,7 +486,7 @@ static void resync_update_sn(struct mlx5e_rq *rq, struct sk_buff *skb)
 		depth += sizeof(struct ipv6hdr);
 		th = (void *)ipv6h + sizeof(struct ipv6hdr);
 
-		sk = __inet6_lookup_established(dev_net(netdev), &tcp_hashinfo,
+		sk = __inet6_lookup_established(net, net->ipv4.tcp_death_row->hashinfo,
 						&ipv6h->saddr, th->source,
 						&ipv6h->daddr, ntohs(th->dest),
 						netdev->ifindex, 0);
diff --git a/drivers/net/ethernet/netronome/nfp/crypto/tls.c b/drivers/net/ethernet/netronome/nfp/crypto/tls.c
index 78368e71ce83..7948c04a32dc 100644
--- a/drivers/net/ethernet/netronome/nfp/crypto/tls.c
+++ b/drivers/net/ethernet/netronome/nfp/crypto/tls.c
@@ -474,6 +474,7 @@ int nfp_net_tls_rx_resync_req(struct net_device *netdev,
 {
 	struct nfp_net *nn = netdev_priv(netdev);
 	struct nfp_net_tls_offload_ctx *ntls;
+	struct net *net = dev_net(netdev);
 	struct ipv6hdr *ipv6h;
 	struct tcphdr *th;
 	struct iphdr *iph;
@@ -494,13 +495,13 @@ int nfp_net_tls_rx_resync_req(struct net_device *netdev,
 
 	switch (ipv6h->version) {
 	case 4:
-		sk = inet_lookup_established(dev_net(netdev), &tcp_hashinfo,
+		sk = inet_lookup_established(net, net->ipv4.tcp_death_row->hashinfo,
 					     iph->saddr, th->source, iph->daddr,
 					     th->dest, netdev->ifindex);
 		break;
 #if IS_ENABLED(CONFIG_IPV6)
 	case 6:
-		sk = __inet6_lookup_established(dev_net(netdev), &tcp_hashinfo,
+		sk = __inet6_lookup_established(net, net->ipv4.tcp_death_row->hashinfo,
 						&ipv6h->saddr, th->source,
 						&ipv6h->daddr, ntohs(th->dest),
 						netdev->ifindex, 0);
diff --git a/net/core/filter.c b/net/core/filter.c
index e8508aaafd27..dd5844033b66 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6468,6 +6468,7 @@ static const struct bpf_func_proto bpf_lwt_seg6_adjust_srh_proto = {
 static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
 			      int dif, int sdif, u8 family, u8 proto)
 {
+	struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row->hashinfo;
 	bool refcounted = false;
 	struct sock *sk = NULL;
 
@@ -6476,7 +6477,7 @@ static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
 		__be32 dst4 = tuple->ipv4.daddr;
 
 		if (proto == IPPROTO_TCP)
-			sk = __inet_lookup(net, &tcp_hashinfo, NULL, 0,
+			sk = __inet_lookup(net, hinfo, NULL, 0,
 					   src4, tuple->ipv4.sport,
 					   dst4, tuple->ipv4.dport,
 					   dif, sdif, &refcounted);
@@ -6490,7 +6491,7 @@ static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
 		struct in6_addr *dst6 = (struct in6_addr *)&tuple->ipv6.daddr;
 
 		if (proto == IPPROTO_TCP)
-			sk = __inet6_lookup(net, &tcp_hashinfo, NULL, 0,
+			sk = __inet6_lookup(net, hinfo, NULL, 0,
 					    src6, tuple->ipv6.sport,
 					    dst6, ntohs(tuple->ipv6.dport),
 					    dif, sdif, &refcounted);
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 5c03eba787e5..951c965d9322 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -134,6 +134,7 @@ static void esp_free_tcp_sk(struct rcu_head *head)
 static struct sock *esp_find_tcp_sk(struct xfrm_state *x)
 {
 	struct xfrm_encap_tmpl *encap = x->encap;
+	struct net *net = xs_net(x);
 	struct esp_tcp_sk *esk;
 	__be16 sport, dport;
 	struct sock *nsk;
@@ -160,7 +161,7 @@ static struct sock *esp_find_tcp_sk(struct xfrm_state *x)
 	}
 	spin_unlock_bh(&x->lock);
 
-	sk = inet_lookup_established(xs_net(x), &tcp_hashinfo, x->id.daddr.a4,
+	sk = inet_lookup_established(net, net->ipv4.tcp_death_row->hashinfo, x->id.daddr.a4,
 				     dport, x->props.saddr.a4, sport, 0);
 	if (!sk)
 		return ERR_PTR(-ENOENT);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 23bfff989be6..5eb21a95179b 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -386,7 +386,7 @@ static inline struct sock *inet_lookup_run_bpf(struct net *net,
 	struct sock *sk, *reuse_sk;
 	bool no_reuseport;
 
-	if (hashinfo != &tcp_hashinfo)
+	if (hashinfo != net->ipv4.tcp_death_row->hashinfo)
 		return NULL; /* only TCP is supported */
 
 	no_reuseport = bpf_sk_lookup_run_v4(net, IPPROTO_TCP, saddr, sport,
diff --git a/net/ipv4/netfilter/nf_socket_ipv4.c b/net/ipv4/netfilter/nf_socket_ipv4.c
index 2d42e4c35a20..9ba496d061cc 100644
--- a/net/ipv4/netfilter/nf_socket_ipv4.c
+++ b/net/ipv4/netfilter/nf_socket_ipv4.c
@@ -71,7 +71,7 @@ nf_socket_get_sock_v4(struct net *net, struct sk_buff *skb, const int doff,
 {
 	switch (protocol) {
 	case IPPROTO_TCP:
-		return inet_lookup(net, &tcp_hashinfo, skb, doff,
+		return inet_lookup(net, net->ipv4.tcp_death_row->hashinfo, skb, doff,
 				   saddr, sport, daddr, dport,
 				   in->ifindex);
 	case IPPROTO_UDP:
diff --git a/net/ipv4/netfilter/nf_tproxy_ipv4.c b/net/ipv4/netfilter/nf_tproxy_ipv4.c
index b2bae0b0e42a..ce4e0d50a55c 100644
--- a/net/ipv4/netfilter/nf_tproxy_ipv4.c
+++ b/net/ipv4/netfilter/nf_tproxy_ipv4.c
@@ -79,6 +79,7 @@ nf_tproxy_get_sock_v4(struct net *net, struct sk_buff *skb,
 		      const struct net_device *in,
 		      const enum nf_tproxy_lookup_t lookup_type)
 {
+	struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row->hashinfo;
 	struct sock *sk;
 
 	switch (protocol) {
@@ -92,12 +93,10 @@ nf_tproxy_get_sock_v4(struct net *net, struct sk_buff *skb,
 
 		switch (lookup_type) {
 		case NF_TPROXY_LOOKUP_LISTENER:
-			sk = inet_lookup_listener(net, &tcp_hashinfo, skb,
-						    ip_hdrlen(skb) +
-						      __tcp_hdrlen(hp),
-						    saddr, sport,
-						    daddr, dport,
-						    in->ifindex, 0);
+			sk = inet_lookup_listener(net, hinfo, skb,
+						  ip_hdrlen(skb) + __tcp_hdrlen(hp),
+						  saddr, sport, daddr, dport,
+						  in->ifindex, 0);
 
 			if (sk && !refcount_inc_not_zero(&sk->sk_refcnt))
 				sk = NULL;
@@ -108,9 +107,9 @@ nf_tproxy_get_sock_v4(struct net *net, struct sk_buff *skb,
 			 */
 			break;
 		case NF_TPROXY_LOOKUP_ESTABLISHED:
-			sk = inet_lookup_established(net, &tcp_hashinfo,
-						    saddr, sport, daddr, dport,
-						    in->ifindex);
+			sk = inet_lookup_established(net, hinfo,
+						     saddr, sport, daddr, dport,
+						     in->ifindex);
 			break;
 		default:
 			BUG();
diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c
index 75a1c985f49a..958c923020c0 100644
--- a/net/ipv4/tcp_diag.c
+++ b/net/ipv4/tcp_diag.c
@@ -181,13 +181,21 @@ static size_t tcp_diag_get_aux_size(struct sock *sk, bool net_admin)
 static void tcp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
 			  const struct inet_diag_req_v2 *r)
 {
-	inet_diag_dump_icsk(&tcp_hashinfo, skb, cb, r);
+	struct inet_hashinfo *hinfo;
+
+	hinfo = sock_net(cb->skb->sk)->ipv4.tcp_death_row->hashinfo;
+
+	inet_diag_dump_icsk(hinfo, skb, cb, r);
 }
 
 static int tcp_diag_dump_one(struct netlink_callback *cb,
 			     const struct inet_diag_req_v2 *req)
 {
-	return inet_diag_dump_one_icsk(&tcp_hashinfo, cb, req);
+	struct inet_hashinfo *hinfo;
+
+	hinfo = sock_net(cb->skb->sk)->ipv4.tcp_death_row->hashinfo;
+
+	return inet_diag_dump_one_icsk(hinfo, cb, req);
 }
 
 #ifdef CONFIG_INET_DIAG_DESTROY
@@ -195,9 +203,13 @@ static int tcp_diag_destroy(struct sk_buff *in_skb,
 			    const struct inet_diag_req_v2 *req)
 {
 	struct net *net = sock_net(in_skb->sk);
-	struct sock *sk = inet_diag_find_one_icsk(net, &tcp_hashinfo, req);
+	struct inet_hashinfo *hinfo;
+	struct sock *sk;
 	int err;
 
+	hinfo = net->ipv4.tcp_death_row->hashinfo;
+	sk = inet_diag_find_one_icsk(net, hinfo, req);
+
 	if (IS_ERR(sk))
 		return PTR_ERR(sk);
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7c3b3ce85a5e..b07930643b11 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -249,7 +249,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 
 	if (!inet->inet_saddr) {
 		if (inet_csk(sk)->icsk_bind2_hash) {
-			prev_addr_hashbucket = inet_bhashfn_portaddr(&tcp_hashinfo,
+			prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
 								     sk, sock_net(sk),
 								     inet->inet_num);
 			prev_sk_rcv_saddr = sk->sk_rcv_saddr;
@@ -494,7 +494,7 @@ int tcp_v4_err(struct sk_buff *skb, u32 info)
 	int err;
 	struct net *net = dev_net(skb->dev);
 
-	sk = __inet_lookup_established(net, &tcp_hashinfo, iph->daddr,
+	sk = __inet_lookup_established(net, net->ipv4.tcp_death_row->hashinfo, iph->daddr,
 				       th->dest, iph->saddr, ntohs(th->source),
 				       inet_iif(skb), 0);
 	if (!sk) {
@@ -759,7 +759,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
 		 * Incoming packet is checked with md5 hash with finding key,
 		 * no RST generated if md5 hash doesn't match.
 		 */
-		sk1 = __inet_lookup_listener(net, &tcp_hashinfo, NULL, 0,
+		sk1 = __inet_lookup_listener(net, net->ipv4.tcp_death_row->hashinfo, NULL, 0,
 					     ip_hdr(skb)->saddr,
 					     th->source, ip_hdr(skb)->daddr,
 					     ntohs(th->source), dif, sdif);
@@ -1728,6 +1728,7 @@ EXPORT_SYMBOL(tcp_v4_do_rcv);
 
 int tcp_v4_early_demux(struct sk_buff *skb)
 {
+	struct net *net = dev_net(skb->dev);
 	const struct iphdr *iph;
 	const struct tcphdr *th;
 	struct sock *sk;
@@ -1744,7 +1745,7 @@ int tcp_v4_early_demux(struct sk_buff *skb)
 	if (th->doff < sizeof(struct tcphdr) / 4)
 		return 0;
 
-	sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
+	sk = __inet_lookup_established(net, net->ipv4.tcp_death_row->hashinfo,
 				       iph->saddr, th->source,
 				       iph->daddr, ntohs(th->dest),
 				       skb->skb_iif, inet_sdif(skb));
@@ -1970,7 +1971,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	th = (const struct tcphdr *)skb->data;
 	iph = ip_hdr(skb);
 lookup:
-	sk = __inet_lookup_skb(&tcp_hashinfo, skb, __tcp_hdrlen(th), th->source,
+	sk = __inet_lookup_skb(net->ipv4.tcp_death_row->hashinfo,
+			       skb, __tcp_hdrlen(th), th->source,
 			       th->dest, sdif, &refcounted);
 	if (!sk)
 		goto no_tcp_socket;
@@ -2152,9 +2154,9 @@ int tcp_v4_rcv(struct sk_buff *skb)
 	}
 	switch (tcp_timewait_state_process(inet_twsk(sk), skb, th)) {
 	case TCP_TW_SYN: {
-		struct sock *sk2 = inet_lookup_listener(dev_net(skb->dev),
-							&tcp_hashinfo, skb,
-							__tcp_hdrlen(th),
+		struct sock *sk2 = inet_lookup_listener(net,
+							net->ipv4.tcp_death_row->hashinfo,
+							skb, __tcp_hdrlen(th),
 							iph->saddr, th->source,
 							iph->daddr, th->dest,
 							inet_iif(skb),
@@ -2304,15 +2306,16 @@ static bool seq_sk_match(struct seq_file *seq, const struct sock *sk)
  */
 static void *listening_get_first(struct seq_file *seq)
 {
+	struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
 	struct tcp_iter_state *st = seq->private;
 
 	st->offset = 0;
-	for (; st->bucket <= tcp_hashinfo.lhash2_mask; st->bucket++) {
+	for (; st->bucket <= hinfo->lhash2_mask; st->bucket++) {
 		struct inet_listen_hashbucket *ilb2;
 		struct hlist_nulls_node *node;
 		struct sock *sk;
 
-		ilb2 = &tcp_hashinfo.lhash2[st->bucket];
+		ilb2 = &hinfo->lhash2[st->bucket];
 		if (hlist_nulls_empty(&ilb2->nulls_head))
 			continue;
 
@@ -2337,6 +2340,7 @@ static void *listening_get_next(struct seq_file *seq, void *cur)
 	struct tcp_iter_state *st = seq->private;
 	struct inet_listen_hashbucket *ilb2;
 	struct hlist_nulls_node *node;
+	struct inet_hashinfo *hinfo;
 	struct sock *sk = cur;
 
 	++st->num;
@@ -2348,7 +2352,8 @@ static void *listening_get_next(struct seq_file *seq, void *cur)
 			return sk;
 	}
 
-	ilb2 = &tcp_hashinfo.lhash2[st->bucket];
+	hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
+	ilb2 = &hinfo->lhash2[st->bucket];
 	spin_unlock(&ilb2->lock);
 	++st->bucket;
 	return listening_get_first(seq);
@@ -2370,9 +2375,10 @@ static void *listening_get_idx(struct seq_file *seq, loff_t *pos)
 	return rc;
 }
 
-static inline bool empty_bucket(const struct tcp_iter_state *st)
+static inline bool empty_bucket(struct inet_hashinfo *hinfo,
+				const struct tcp_iter_state *st)
 {
-	return hlist_nulls_empty(&tcp_hashinfo.ehash[st->bucket].chain);
+	return hlist_nulls_empty(&hinfo->ehash[st->bucket].chain);
 }
 
 /*
@@ -2381,20 +2387,21 @@ static inline bool empty_bucket(const struct tcp_iter_state *st)
  */
 static void *established_get_first(struct seq_file *seq)
 {
+	struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
 	struct tcp_iter_state *st = seq->private;
 
 	st->offset = 0;
-	for (; st->bucket <= tcp_hashinfo.ehash_mask; ++st->bucket) {
+	for (; st->bucket <= hinfo->ehash_mask; ++st->bucket) {
 		struct sock *sk;
 		struct hlist_nulls_node *node;
-		spinlock_t *lock = inet_ehash_lockp(&tcp_hashinfo, st->bucket);
+		spinlock_t *lock = inet_ehash_lockp(hinfo, st->bucket);
 
 		/* Lockless fast path for the common case of empty buckets */
-		if (empty_bucket(st))
+		if (empty_bucket(hinfo, st))
 			continue;
 
 		spin_lock_bh(lock);
-		sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[st->bucket].chain) {
+		sk_nulls_for_each(sk, node, &hinfo->ehash[st->bucket].chain) {
 			if (seq_sk_match(seq, sk))
 				return sk;
 		}
@@ -2406,6 +2413,7 @@ static void *established_get_first(struct seq_file *seq)
 
 static void *established_get_next(struct seq_file *seq, void *cur)
 {
+	struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
 	struct tcp_iter_state *st = seq->private;
 	struct hlist_nulls_node *node;
 	struct sock *sk = cur;
@@ -2420,7 +2428,7 @@ static void *established_get_next(struct seq_file *seq, void *cur)
 			return sk;
 	}
 
-	spin_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket));
+	spin_unlock_bh(inet_ehash_lockp(hinfo, st->bucket));
 	++st->bucket;
 	return established_get_first(seq);
 }
@@ -2458,6 +2466,7 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
 
 static void *tcp_seek_last_pos(struct seq_file *seq)
 {
+	struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
 	struct tcp_iter_state *st = seq->private;
 	int bucket = st->bucket;
 	int offset = st->offset;
@@ -2466,7 +2475,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq)
 
 	switch (st->state) {
 	case TCP_SEQ_STATE_LISTENING:
-		if (st->bucket > tcp_hashinfo.lhash2_mask)
+		if (st->bucket > hinfo->lhash2_mask)
 			break;
 		st->state = TCP_SEQ_STATE_LISTENING;
 		rc = listening_get_first(seq);
@@ -2478,7 +2487,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq)
 		st->state = TCP_SEQ_STATE_ESTABLISHED;
 		fallthrough;
 	case TCP_SEQ_STATE_ESTABLISHED:
-		if (st->bucket > tcp_hashinfo.ehash_mask)
+		if (st->bucket > hinfo->ehash_mask)
 			break;
 		rc = established_get_first(seq);
 		while (offset-- && rc && bucket == st->bucket)
@@ -2546,16 +2555,17 @@ EXPORT_SYMBOL(tcp_seq_next);
 
 void tcp_seq_stop(struct seq_file *seq, void *v)
 {
+	struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
 	struct tcp_iter_state *st = seq->private;
 
 	switch (st->state) {
 	case TCP_SEQ_STATE_LISTENING:
 		if (v != SEQ_START_TOKEN)
-			spin_unlock(&tcp_hashinfo.lhash2[st->bucket].lock);
+			spin_unlock(&hinfo->lhash2[st->bucket].lock);
 		break;
 	case TCP_SEQ_STATE_ESTABLISHED:
 		if (v)
-			spin_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket));
+			spin_unlock_bh(inet_ehash_lockp(hinfo, st->bucket));
 		break;
 	}
 }
@@ -2750,6 +2760,7 @@ static int bpf_iter_tcp_realloc_batch(struct bpf_tcp_iter_state *iter,
 static unsigned int bpf_iter_tcp_listening_batch(struct seq_file *seq,
 						 struct sock *start_sk)
 {
+	struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
 	struct bpf_tcp_iter_state *iter = seq->private;
 	struct tcp_iter_state *st = &iter->state;
 	struct hlist_nulls_node *node;
@@ -2769,7 +2780,7 @@ static unsigned int bpf_iter_tcp_listening_batch(struct seq_file *seq,
 			expected++;
 		}
 	}
-	spin_unlock(&tcp_hashinfo.lhash2[st->bucket].lock);
+	spin_unlock(&hinfo->lhash2[st->bucket].lock);
 
 	return expected;
 }
@@ -2777,6 +2788,7 @@ static unsigned int bpf_iter_tcp_listening_batch(struct seq_file *seq,
 static unsigned int bpf_iter_tcp_established_batch(struct seq_file *seq,
 						   struct sock *start_sk)
 {
+	struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
 	struct bpf_tcp_iter_state *iter = seq->private;
 	struct tcp_iter_state *st = &iter->state;
 	struct hlist_nulls_node *node;
@@ -2796,13 +2808,14 @@ static unsigned int bpf_iter_tcp_established_batch(struct seq_file *seq,
 			expected++;
 		}
 	}
-	spin_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket));
+	spin_unlock_bh(inet_ehash_lockp(hinfo, st->bucket));
 
 	return expected;
 }
 
 static struct sock *bpf_iter_tcp_batch(struct seq_file *seq)
 {
+	struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo;
 	struct bpf_tcp_iter_state *iter = seq->private;
 	struct tcp_iter_state *st = &iter->state;
 	unsigned int expected;
@@ -2818,7 +2831,7 @@ static struct sock *bpf_iter_tcp_batch(struct seq_file *seq)
 		st->offset = 0;
 		st->bucket++;
 		if (st->state == TCP_SEQ_STATE_LISTENING &&
-		    st->bucket > tcp_hashinfo.lhash2_mask) {
+		    st->bucket > hinfo->lhash2_mask) {
 			st->state = TCP_SEQ_STATE_ESTABLISHED;
 			st->bucket = 0;
 		}
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index cb95d88497ae..361aad67c6d6 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -319,7 +319,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
 		/* Linkage updates.
 		 * Note that access to tw after this point is illegal.
 		 */
-		inet_twsk_hashdance(tw, sk, &tcp_hashinfo);
+		inet_twsk_hashdance(tw, sk, tcp_death_row->hashinfo);
 		local_bh_enable();
 	} else {
 		/* Sorry, if we're out of memory, just CLOSE this
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 8220923a12f7..3cdeed3d2e1a 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -151,6 +151,7 @@ static void esp_free_tcp_sk(struct rcu_head *head)
 static struct sock *esp6_find_tcp_sk(struct xfrm_state *x)
 {
 	struct xfrm_encap_tmpl *encap = x->encap;
+	struct net *net = xs_net(x);
 	struct esp_tcp_sk *esk;
 	__be16 sport, dport;
 	struct sock *nsk;
@@ -177,7 +178,7 @@ static struct sock *esp6_find_tcp_sk(struct xfrm_state *x)
 	}
 	spin_unlock_bh(&x->lock);
 
-	sk = __inet6_lookup_established(xs_net(x), &tcp_hashinfo, &x->id.daddr.in6,
+	sk = __inet6_lookup_established(net, net->ipv4.tcp_death_row->hashinfo, &x->id.daddr.in6,
 					dport, &x->props.saddr.in6, ntohs(sport), 0, 0);
 	if (!sk)
 		return ERR_PTR(-ENOENT);
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 7d53d62783b1..52513a6a54fd 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -21,8 +21,6 @@
 #include <net/ip.h>
 #include <net/sock_reuseport.h>
 
-extern struct inet_hashinfo tcp_hashinfo;
-
 u32 inet6_ehashfn(const struct net *net,
 		  const struct in6_addr *laddr, const u16 lport,
 		  const struct in6_addr *faddr, const __be16 fport)
@@ -169,7 +167,7 @@ static inline struct sock *inet6_lookup_run_bpf(struct net *net,
 	struct sock *sk, *reuse_sk;
 	bool no_reuseport;
 
-	if (hashinfo != &tcp_hashinfo)
+	if (hashinfo != net->ipv4.tcp_death_row->hashinfo)
 		return NULL; /* only TCP is supported */
 
 	no_reuseport = bpf_sk_lookup_run_v6(net, IPPROTO_TCP, saddr, sport,
diff --git a/net/ipv6/netfilter/nf_socket_ipv6.c b/net/ipv6/netfilter/nf_socket_ipv6.c
index aa5bb8789ba0..6356407065c7 100644
--- a/net/ipv6/netfilter/nf_socket_ipv6.c
+++ b/net/ipv6/netfilter/nf_socket_ipv6.c
@@ -83,7 +83,7 @@ nf_socket_get_sock_v6(struct net *net, struct sk_buff *skb, int doff,
 {
 	switch (protocol) {
 	case IPPROTO_TCP:
-		return inet6_lookup(net, &tcp_hashinfo, skb, doff,
+		return inet6_lookup(net, net->ipv4.tcp_death_row->hashinfo, skb, doff,
 				    saddr, sport, daddr, dport,
 				    in->ifindex);
 	case IPPROTO_UDP:
diff --git a/net/ipv6/netfilter/nf_tproxy_ipv6.c b/net/ipv6/netfilter/nf_tproxy_ipv6.c
index 6bac68fb27a3..b0e5fb3e4d95 100644
--- a/net/ipv6/netfilter/nf_tproxy_ipv6.c
+++ b/net/ipv6/netfilter/nf_tproxy_ipv6.c
@@ -80,6 +80,7 @@ nf_tproxy_get_sock_v6(struct net *net, struct sk_buff *skb, int thoff,
 		      const struct net_device *in,
 		      const enum nf_tproxy_lookup_t lookup_type)
 {
+	struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row->hashinfo;
 	struct sock *sk;
 
 	switch (protocol) {
@@ -93,7 +94,7 @@ nf_tproxy_get_sock_v6(struct net *net, struct sk_buff *skb, int thoff,
 
 		switch (lookup_type) {
 		case NF_TPROXY_LOOKUP_LISTENER:
-			sk = inet6_lookup_listener(net, &tcp_hashinfo, skb,
+			sk = inet6_lookup_listener(net, hinfo, skb,
 						   thoff + __tcp_hdrlen(hp),
 						   saddr, sport,
 						   daddr, ntohs(dport),
@@ -108,7 +109,7 @@ nf_tproxy_get_sock_v6(struct net *net, struct sk_buff *skb, int thoff,
 			 */
 			break;
 		case NF_TPROXY_LOOKUP_ESTABLISHED:
-			sk = __inet6_lookup_established(net, &tcp_hashinfo,
+			sk = __inet6_lookup_established(net, hinfo,
 							saddr, sport, daddr, ntohs(dport),
 							in->ifindex, 0);
 			break;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 791f24da9212..27b2fd98a2c4 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -286,12 +286,14 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 		goto failure;
 	}
 
+	tcp_death_row = sock_net(sk)->ipv4.tcp_death_row;
+
 	if (!saddr) {
 		struct inet_bind_hashbucket *prev_addr_hashbucket = NULL;
 		struct in6_addr prev_v6_rcv_saddr;
 
 		if (icsk->icsk_bind2_hash) {
-			prev_addr_hashbucket = inet_bhashfn_portaddr(&tcp_hashinfo,
+			prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo,
 								     sk, sock_net(sk),
 								     inet->inet_num);
 			prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
@@ -325,7 +327,6 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	inet->inet_dport = usin->sin6_port;
 
 	tcp_set_state(sk, TCP_SYN_SENT);
-	tcp_death_row = sock_net(sk)->ipv4.tcp_death_row;
 	err = inet6_hash_connect(tcp_death_row, sk);
 	if (err)
 		goto late_failure;
@@ -403,7 +404,7 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	bool fatal;
 	int err;
 
-	sk = __inet6_lookup_established(net, &tcp_hashinfo,
+	sk = __inet6_lookup_established(net, net->ipv4.tcp_death_row->hashinfo,
 					&hdr->daddr, th->dest,
 					&hdr->saddr, ntohs(th->source),
 					skb->dev->ifindex, inet6_sdif(skb));
@@ -1037,7 +1038,7 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb)
 		 * no RST generated if md5 hash doesn't match.
 		 */
 		sk1 = inet6_lookup_listener(net,
-					   &tcp_hashinfo, NULL, 0,
+					   net->ipv4.tcp_death_row->hashinfo, NULL, 0,
 					   &ipv6h->saddr,
 					   th->source, &ipv6h->daddr,
 					   ntohs(th->source), dif, sdif);
@@ -1636,7 +1637,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 	hdr = ipv6_hdr(skb);
 
 lookup:
-	sk = __inet6_lookup_skb(&tcp_hashinfo, skb, __tcp_hdrlen(th),
+	sk = __inet6_lookup_skb(net->ipv4.tcp_death_row->hashinfo, skb, __tcp_hdrlen(th),
 				th->source, th->dest, inet6_iif(skb), sdif,
 				&refcounted);
 	if (!sk)
@@ -1811,7 +1812,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 	{
 		struct sock *sk2;
 
-		sk2 = inet6_lookup_listener(dev_net(skb->dev), &tcp_hashinfo,
+		sk2 = inet6_lookup_listener(net, net->ipv4.tcp_death_row->hashinfo,
 					    skb, __tcp_hdrlen(th),
 					    &ipv6_hdr(skb)->saddr, th->source,
 					    &ipv6_hdr(skb)->daddr,
@@ -1844,6 +1845,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 
 void tcp_v6_early_demux(struct sk_buff *skb)
 {
+	struct net *net = dev_net(skb->dev);
 	const struct ipv6hdr *hdr;
 	const struct tcphdr *th;
 	struct sock *sk;
@@ -1861,7 +1863,7 @@ void tcp_v6_early_demux(struct sk_buff *skb)
 		return;
 
 	/* Note : We use inet6_iif() here, not tcp_v6_iif() */
-	sk = __inet6_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
+	sk = __inet6_lookup_established(net, net->ipv4.tcp_death_row->hashinfo,
 					&hdr->saddr, th->source,
 					&hdr->daddr, ntohs(th->dest),
 					inet6_iif(skb), inet6_sdif(skb));
diff --git a/net/mptcp/mptcp_diag.c b/net/mptcp/mptcp_diag.c
index 7f9a71780437..12b0aac25a39 100644
--- a/net/mptcp/mptcp_diag.c
+++ b/net/mptcp/mptcp_diag.c
@@ -81,15 +81,18 @@ static void mptcp_diag_dump_listeners(struct sk_buff *skb, struct netlink_callba
 	struct mptcp_diag_ctx *diag_ctx = (void *)cb->ctx;
 	struct nlattr *bc = cb_data->inet_diag_nla_bc;
 	struct net *net = sock_net(skb->sk);
+	struct inet_hashinfo *hinfo;
 	int i;
 
-	for (i = diag_ctx->l_slot; i <= tcp_hashinfo.lhash2_mask; i++) {
+	hinfo = net->ipv4.tcp_death_row->hashinfo;
+
+	for (i = diag_ctx->l_slot; i <= hinfo->lhash2_mask; i++) {
 		struct inet_listen_hashbucket *ilb;
 		struct hlist_nulls_node *node;
 		struct sock *sk;
 		int num = 0;
 
-		ilb = &tcp_hashinfo.lhash2[i];
+		ilb = &hinfo->lhash2[i];
 
 		rcu_read_lock();
 		spin_lock(&ilb->lock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 08/13] tcp: Introduce optional per-netns ehash.
  2022-08-26  0:04 [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Kuniyuki Iwashima
                   ` (6 preceding siblings ...)
  2022-08-26  0:04 ` [PATCH v1 net-next 07/13] tcp: Access &tcp_hashinfo via net Kuniyuki Iwashima
@ 2022-08-26  0:04 ` Kuniyuki Iwashima
  2022-08-26 15:24   ` Eric Dumazet
  2022-08-26  0:04 ` [PATCH v1 net-next 09/13] udp: Clean up some functions Kuniyuki Iwashima
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26  0:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jeff Layton, Chuck Lever, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-fsdevel

The more sockets we have in the hash table, the more time we spend
looking up the socket.  While running a number of small workloads on
the same host, they penalise each other and cause performance degradation.

Also, the root cause might be a single workload that consumes much more
resources than the others.  It often happens on a cloud service where
different workloads share the same computing resource.

To resolve the issue, we introduce an optional per-netns hash table for
TCP, but it's just ehash, and we still share the global bhash and lhash2.

With a smaller ehash, we can look up non-listener sockets faster and
isolate such noisy neighbours.  Also, we can reduce lock contention.

We can control the ehash size by a new sysctl knob.  However, depending
on workloads, it will require very sensitive tuning, so we disable the
feature by default (net.ipv4.tcp_child_ehash_entries == 0).  Moreover,
we can fall back to using the global ehash in case we fail to allocate
enough memory for a new ehash.

We can check the current ehash size by another read-only sysctl knob,
net.ipv4.tcp_ehash_entries.  A negative value means the netns shares
the global ehash (per-netns ehash is disabled or failed to allocate
memory).

  # dmesg | cut -d ' ' -f 5- | grep "established hash"
  TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)

  # sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = 524288  # can be changed by thash_entries

  # sysctl net.ipv4.tcp_child_ehash_entries
  net.ipv4.tcp_child_ehash_entries = 0  # disabled by default

  # ip netns add test1
  # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = -524288  # share the global ehash

  # sysctl -w net.ipv4.tcp_child_ehash_entries=100
  net.ipv4.tcp_child_ehash_entries = 100

  # sysctl net.ipv4.tcp_child_ehash_entries
  net.ipv4.tcp_child_ehash_entries = 128  # rounded up to 2^n

  # ip netns add test2
  # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = 128  # own per-netns ehash

When more than two processes in the same netns create per-netns ehash
concurrently with different sizes, we need to guarantee the size in
one of the following ways:

  1) Share the global ehash and create per-netns ehash

  First, unshare() with tcp_child_ehash_entries==0.  It creates dedicated
  netns sysctl knobs where we can safely change tcp_child_ehash_entries
  and clone()/unshare() to create a per-netns ehash.

  2) Lock the sysctl knob

  We can use flock(LOCK_MAND) or BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny
  read/write on sysctl knobs.

Note the default values of two sysctl knobs depend on the ehash size and
should be tuned carefully:

  tcp_max_tw_buckets  : tcp_child_ehash_entries / 2
  tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)

Also, we could optimise ehash lookup/iteration further by removing netns
comparison for the per-netns ehash in the future.

As a bonus, we can dismantle netns faster.  Currently, while destroying
netns, we call inet_twsk_purge(), which walks through the global ehash.
It can be potentially big because it can have many sockets other than
TIME_WAIT in all netns.  Splitting ehash changes that situation, where
it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
in each netns.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 Documentation/networking/ip-sysctl.rst | 20 +++++++++
 include/net/inet_hashtables.h          |  6 +++
 include/net/netns/ipv4.h               |  1 +
 net/dccp/proto.c                       |  2 +
 net/ipv4/inet_hashtables.c             | 57 ++++++++++++++++++++++++++
 net/ipv4/inet_timewait_sock.c          |  4 +-
 net/ipv4/sysctl_net_ipv4.c             | 57 ++++++++++++++++++++++++++
 net/ipv4/tcp.c                         |  1 +
 net/ipv4/tcp_ipv4.c                    | 53 ++++++++++++++++++++----
 net/ipv6/tcp_ipv6.c                    | 12 +++++-
 10 files changed, 202 insertions(+), 11 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 56cd4ea059b2..97a0952b11e3 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1037,6 +1037,26 @@ tcp_challenge_ack_limit - INTEGER
 	in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks)
 	Default: 1000
 
+tcp_ehash_entries - INTEGER
+	Read-only number of hash buckets for TCP sockets in the current
+	networking namespace.
+
+	A negative value means the networking namespace does not own its
+	hash buckets and shares the initial networking namespace's one.
+
+tcp_child_ehash_entries - INTEGER
+	Control the number of hash buckets for TCP sockets in the child
+	networking namespace, which must be set before clone() or unshare().
+
+	The written value except for 0 is rounded up to 2^n.  0 is a special
+	value, meaning the child networking namespace will share the initial
+	networking namespace's hash buckets.
+
+	Note that the child will use the global one in case the kernel
+	fails to allocate enough memory.
+
+	Default: 0
+
 UDP variables
 =============
 
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 2c866112433e..039440936ab2 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -168,6 +168,8 @@ struct inet_hashinfo {
 	/* The 2nd listener table hashed by local port and address */
 	unsigned int			lhash2_mask;
 	struct inet_listen_hashbucket	*lhash2;
+
+	bool				pernet;
 };
 
 static inline struct inet_hashinfo *inet_get_hashinfo(const struct sock *sk)
@@ -214,6 +216,10 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo)
 	hashinfo->ehash_locks = NULL;
 }
 
+struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo,
+						 unsigned int ehash_entries);
+void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo);
+
 struct inet_bind_bucket *
 inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
 			struct inet_bind_hashbucket *head,
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index c7320ef356d9..6d9c01879027 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -170,6 +170,7 @@ struct netns_ipv4 {
 	int sysctl_tcp_pacing_ca_ratio;
 	int sysctl_tcp_wmem[3];
 	int sysctl_tcp_rmem[3];
+	unsigned int sysctl_tcp_child_ehash_entries;
 	unsigned long sysctl_tcp_comp_sack_delay_ns;
 	unsigned long sysctl_tcp_comp_sack_slack_ns;
 	int sysctl_max_syn_backlog;
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 7cd4a6cc99fc..c548ca3e9b0e 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1197,6 +1197,8 @@ static int __init dccp_init(void)
 		INIT_HLIST_HEAD(&dccp_hashinfo.bhash2[i].chain);
 	}
 
+	dccp_hashinfo.pernet = false;
+
 	rc = dccp_mib_init();
 	if (rc)
 		goto out_free_dccp_bhash2;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 5eb21a95179b..a57932b14bc6 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -1145,3 +1145,60 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
 	return 0;
 }
 EXPORT_SYMBOL_GPL(inet_ehash_locks_alloc);
+
+struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo,
+						 unsigned int ehash_entries)
+{
+	struct inet_hashinfo *new_hashinfo;
+	int i;
+
+	new_hashinfo = kmalloc(sizeof(*new_hashinfo), GFP_KERNEL);
+	if (!new_hashinfo)
+		goto err;
+
+	new_hashinfo->ehash = kvmalloc_array(ehash_entries,
+					     sizeof(struct inet_ehash_bucket),
+					     GFP_KERNEL);
+	if (!new_hashinfo->ehash)
+		goto free_hashinfo;
+
+	new_hashinfo->ehash_mask = ehash_entries - 1;
+
+	if (inet_ehash_locks_alloc(new_hashinfo))
+		goto free_ehash;
+
+	for (i = 0; i < ehash_entries; i++)
+		INIT_HLIST_NULLS_HEAD(&new_hashinfo->ehash[i].chain, i);
+
+	new_hashinfo->bind_bucket_cachep = hashinfo->bind_bucket_cachep;
+	new_hashinfo->bhash = hashinfo->bhash;
+	new_hashinfo->bind2_bucket_cachep = hashinfo->bind2_bucket_cachep;
+	new_hashinfo->bhash2 = hashinfo->bhash2;
+	new_hashinfo->bhash_size = hashinfo->bhash_size;
+
+	new_hashinfo->lhash2_mask = hashinfo->lhash2_mask;
+	new_hashinfo->lhash2 = hashinfo->lhash2;
+
+	new_hashinfo->pernet = true;
+
+	return new_hashinfo;
+
+free_ehash:
+	kvfree(new_hashinfo->ehash);
+free_hashinfo:
+	kfree(new_hashinfo);
+err:
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_alloc);
+
+void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo)
+{
+	if (!hashinfo->pernet)
+		return;
+
+	inet_ehash_locks_free(hashinfo);
+	kvfree(hashinfo->ehash);
+	kfree(hashinfo);
+}
+EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_free);
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 47ccc343c9fb..a5d40acde9d6 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -59,8 +59,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
 	inet_twsk_bind_unhash(tw, hashinfo);
 	spin_unlock(&bhead->lock);
 
-	if (refcount_dec_and_test(&tw->tw_dr->tw_refcount))
+	if (refcount_dec_and_test(&tw->tw_dr->tw_refcount)) {
+		inet_pernet_hashinfo_free(hashinfo);
 		kfree(tw->tw_dr);
+	}
 
 	inet_twsk_put(tw);
 }
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 5490c285668b..03a3187c4705 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -382,6 +382,48 @@ static int proc_tcp_available_ulp(struct ctl_table *ctl,
 	return ret;
 }
 
+static int proc_tcp_ehash_entries(struct ctl_table *table, int write,
+				  void *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct net *net = container_of(table->data, struct net,
+				       ipv4.sysctl_tcp_child_ehash_entries);
+	struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row->hashinfo;
+	int tcp_ehash_entries;
+	struct ctl_table tbl;
+
+	tcp_ehash_entries = hinfo->ehash_mask + 1;
+
+	/* A negative number indicates that the child netns
+	 * shares the global ehash.
+	 */
+	if (!net_eq(net, &init_net) && !hinfo->pernet)
+		tcp_ehash_entries *= -1;
+
+	tbl.data = &tcp_ehash_entries;
+	tbl.maxlen = sizeof(int);
+
+	return proc_dointvec(&tbl, write, buffer, lenp, ppos);
+}
+
+static int proc_tcp_child_ehash_entries(struct ctl_table *table, int write,
+					void *buffer, size_t *lenp, loff_t *ppos)
+{
+	unsigned int tcp_child_ehash_entries;
+	int ret;
+
+	ret = proc_douintvec(table, write, buffer, lenp, ppos);
+	if (!write || ret)
+		return ret;
+
+	tcp_child_ehash_entries = READ_ONCE(*(unsigned int *)table->data);
+	if (tcp_child_ehash_entries)
+		tcp_child_ehash_entries = roundup_pow_of_two(tcp_child_ehash_entries);
+
+	WRITE_ONCE(*(unsigned int *)table->data, tcp_child_ehash_entries);
+
+	return 0;
+}
+
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write,
 					  void *buffer, size_t *lenp,
@@ -1321,6 +1363,21 @@ static struct ctl_table ipv4_net_table[] = {
 		.extra1         = SYSCTL_ZERO,
 		.extra2         = SYSCTL_ONE,
 	},
+	{
+		.procname	= "tcp_ehash_entries",
+		.data		= &init_net.ipv4.sysctl_tcp_child_ehash_entries,
+		.mode		= 0444,
+		.proc_handler	= proc_tcp_ehash_entries,
+	},
+	{
+		.procname	= "tcp_child_ehash_entries",
+		.data		= &init_net.ipv4.sysctl_tcp_child_ehash_entries,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_tcp_child_ehash_entries,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_INT_MAX,
+	},
 	{
 		.procname	= "udp_rmem_min",
 		.data		= &init_net.ipv4.sysctl_udp_rmem_min,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index baf6adb723ad..f8ce673e32cb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4788,6 +4788,7 @@ void __init tcp_init(void)
 		INIT_HLIST_HEAD(&tcp_hashinfo.bhash2[i].chain);
 	}
 
+	tcp_hashinfo.pernet = false;
 
 	cnt = tcp_hashinfo.ehash_mask + 1;
 	sysctl_tcp_max_orphans = cnt / 2;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index b07930643b11..604119f46b52 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3109,14 +3109,23 @@ static void __net_exit tcp_sk_exit(struct net *net)
 	if (net->ipv4.tcp_congestion_control)
 		bpf_module_put(net->ipv4.tcp_congestion_control,
 			       net->ipv4.tcp_congestion_control->owner);
-	if (refcount_dec_and_test(&tcp_death_row->tw_refcount))
+	if (refcount_dec_and_test(&tcp_death_row->tw_refcount)) {
+		inet_pernet_hashinfo_free(tcp_death_row->hashinfo);
 		kfree(tcp_death_row);
+	}
 }
 
-static int __net_init tcp_sk_init(struct net *net)
+static void __net_init tcp_set_hashinfo(struct net *net, struct inet_hashinfo *hinfo)
 {
-	int cnt;
+	int ehash_entries = hinfo->ehash_mask + 1;
 
+	net->ipv4.tcp_death_row->hashinfo = hinfo;
+	net->ipv4.tcp_death_row->sysctl_max_tw_buckets = ehash_entries / 2;
+	net->ipv4.sysctl_max_syn_backlog = max(128, ehash_entries / 128);
+}
+
+static int __net_init tcp_sk_init(struct net *net)
+{
 	net->ipv4.sysctl_tcp_ecn = 2;
 	net->ipv4.sysctl_tcp_ecn_fallback = 1;
 
@@ -3145,12 +3154,10 @@ static int __net_init tcp_sk_init(struct net *net)
 	net->ipv4.tcp_death_row = kzalloc(sizeof(struct inet_timewait_death_row), GFP_KERNEL);
 	if (!net->ipv4.tcp_death_row)
 		return -ENOMEM;
+
 	refcount_set(&net->ipv4.tcp_death_row->tw_refcount, 1);
-	cnt = tcp_hashinfo.ehash_mask + 1;
-	net->ipv4.tcp_death_row->sysctl_max_tw_buckets = cnt / 2;
-	net->ipv4.tcp_death_row->hashinfo = &tcp_hashinfo;
+	tcp_set_hashinfo(net, &tcp_hashinfo);
 
-	net->ipv4.sysctl_max_syn_backlog = max(128, cnt / 128);
 	net->ipv4.sysctl_tcp_sack = 1;
 	net->ipv4.sysctl_tcp_window_scaling = 1;
 	net->ipv4.sysctl_tcp_timestamps = 1;
@@ -3206,18 +3213,46 @@ static int __net_init tcp_sk_init(struct net *net)
 	return 0;
 }
 
+static int __net_init tcp_sk_init_pernet_hashinfo(struct net *net, struct net *old_net)
+{
+	struct inet_hashinfo *child_hinfo;
+	int ehash_entries;
+
+	ehash_entries = READ_ONCE(old_net->ipv4.sysctl_tcp_child_ehash_entries);
+	if (!ehash_entries)
+		goto out;
+
+	child_hinfo = inet_pernet_hashinfo_alloc(&tcp_hashinfo, ehash_entries);
+	if (child_hinfo)
+		tcp_set_hashinfo(net, child_hinfo);
+	else
+		pr_warn("Failed to allocate TCP ehash (entries: %u) "
+			"for a netns, fallback to use the global one\n",
+			ehash_entries);
+out:
+	return 0;
+}
+
 static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
 {
+	bool purge_once = true;
 	struct net *net;
 
-	inet_twsk_purge(&tcp_hashinfo, AF_INET);
+	list_for_each_entry(net, net_exit_list, exit_list) {
+		if (net->ipv4.tcp_death_row->hashinfo->pernet) {
+			inet_twsk_purge(net->ipv4.tcp_death_row->hashinfo, AF_INET);
+		} else if (purge_once) {
+			inet_twsk_purge(&tcp_hashinfo, AF_INET);
+			purge_once = false;
+		}
 
-	list_for_each_entry(net, net_exit_list, exit_list)
 		tcp_fastopen_ctx_destroy(net);
+	}
 }
 
 static struct pernet_operations __net_initdata tcp_sk_ops = {
        .init	   = tcp_sk_init,
+       .init2	   = tcp_sk_init_pernet_hashinfo,
        .exit	   = tcp_sk_exit,
        .exit_batch = tcp_sk_exit_batch,
 };
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 27b2fd98a2c4..19f730428720 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2229,7 +2229,17 @@ static void __net_exit tcpv6_net_exit(struct net *net)
 
 static void __net_exit tcpv6_net_exit_batch(struct list_head *net_exit_list)
 {
-	inet_twsk_purge(&tcp_hashinfo, AF_INET6);
+	bool purge_once = true;
+	struct net *net;
+
+	list_for_each_entry(net, net_exit_list, exit_list) {
+		if (net->ipv4.tcp_death_row->hashinfo->pernet) {
+			inet_twsk_purge(net->ipv4.tcp_death_row->hashinfo, AF_INET6);
+		} else if (purge_once) {
+			inet_twsk_purge(&tcp_hashinfo, AF_INET6);
+			purge_once = false;
+		}
+	}
 }
 
 static struct pernet_operations tcpv6_net_ops = {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 09/13] udp: Clean up some functions.
  2022-08-26  0:04 [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Kuniyuki Iwashima
                   ` (7 preceding siblings ...)
  2022-08-26  0:04 ` [PATCH v1 net-next 08/13] tcp: Introduce optional per-netns ehash Kuniyuki Iwashima
@ 2022-08-26  0:04 ` Kuniyuki Iwashima
  2022-08-26  0:04 ` [PATCH v1 net-next 10/13] udp: Set NULL to sk->sk_prot->h.udp_table Kuniyuki Iwashima
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26  0:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jeff Layton, Chuck Lever, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-fsdevel

This patch adds no functional change and cleans up some functions
that the following patches touch around so that we make them tidy
and easy to review/revert.  The change is

  - Keep reverse christmas tree order

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/udp.c | 35 +++++++++++++++++++++--------------
 net/ipv6/udp.c | 12 ++++++++----
 2 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 34eda973bbf1..c073304e60bb 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -232,10 +232,10 @@ static int udp_reuseport_add_sock(struct sock *sk, struct udp_hslot *hslot)
 int udp_lib_get_port(struct sock *sk, unsigned short snum,
 		     unsigned int hash2_nulladdr)
 {
-	struct udp_hslot *hslot, *hslot2;
 	struct udp_table *udptable = sk->sk_prot->h.udp_table;
-	int    error = 1;
+	struct udp_hslot *hslot, *hslot2;
 	struct net *net = sock_net(sk);
+	int error = 1;
 
 	if (!snum) {
 		int low, high, remaining;
@@ -2524,10 +2524,13 @@ static struct sock *__udp4_lib_mcast_demux_lookup(struct net *net,
 						  __be16 rmt_port, __be32 rmt_addr,
 						  int dif, int sdif)
 {
-	struct sock *sk, *result;
 	unsigned short hnum = ntohs(loc_port);
-	unsigned int slot = udp_hashfn(net, hnum, udp_table.mask);
-	struct udp_hslot *hslot = &udp_table.hash[slot];
+	struct sock *sk, *result;
+	struct udp_hslot *hslot;
+	unsigned int slot;
+
+	slot = udp_hashfn(net, hnum, udp_table.mask);
+	hslot = &udp_table.hash[slot];
 
 	/* Do not bother scanning a too big list */
 	if (hslot->count > 10)
@@ -2555,14 +2558,18 @@ static struct sock *__udp4_lib_demux_lookup(struct net *net,
 					    __be16 rmt_port, __be32 rmt_addr,
 					    int dif, int sdif)
 {
-	unsigned short hnum = ntohs(loc_port);
-	unsigned int hash2 = ipv4_portaddr_hash(net, loc_addr, hnum);
-	unsigned int slot2 = hash2 & udp_table.mask;
-	struct udp_hslot *hslot2 = &udp_table.hash2[slot2];
 	INET_ADDR_COOKIE(acookie, rmt_addr, loc_addr);
-	const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum);
+	unsigned short hnum = ntohs(loc_port);
+	unsigned int hash2, slot2;
+	struct udp_hslot *hslot2;
+	__portpair ports;
 	struct sock *sk;
 
+	hash2 = ipv4_portaddr_hash(net, loc_addr, hnum);
+	slot2 = hash2 & udp_table.mask;
+	hslot2 = &udp_table.hash2[slot2];
+	ports = INET_COMBINED_PORTS(rmt_port, hnum);
+
 	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
 		if (inet_match(net, sk, acookie, ports, dif, sdif))
 			return sk;
@@ -2963,10 +2970,10 @@ EXPORT_SYMBOL(udp_prot);
 
 static struct sock *udp_get_first(struct seq_file *seq, int start)
 {
-	struct sock *sk;
-	struct udp_seq_afinfo *afinfo;
 	struct udp_iter_state *state = seq->private;
 	struct net *net = seq_file_net(seq);
+	struct udp_seq_afinfo *afinfo;
+	struct sock *sk;
 
 	if (state->bpf_seq_afinfo)
 		afinfo = state->bpf_seq_afinfo;
@@ -2997,9 +3004,9 @@ static struct sock *udp_get_first(struct seq_file *seq, int start)
 
 static struct sock *udp_get_next(struct seq_file *seq, struct sock *sk)
 {
-	struct udp_seq_afinfo *afinfo;
 	struct udp_iter_state *state = seq->private;
 	struct net *net = seq_file_net(seq);
+	struct udp_seq_afinfo *afinfo;
 
 	if (state->bpf_seq_afinfo)
 		afinfo = state->bpf_seq_afinfo;
@@ -3055,8 +3062,8 @@ EXPORT_SYMBOL(udp_seq_next);
 
 void udp_seq_stop(struct seq_file *seq, void *v)
 {
-	struct udp_seq_afinfo *afinfo;
 	struct udp_iter_state *state = seq->private;
+	struct udp_seq_afinfo *afinfo;
 
 	if (state->bpf_seq_afinfo)
 		afinfo = state->bpf_seq_afinfo;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 16c176e7c69a..f7e0248a00bc 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1036,12 +1036,16 @@ static struct sock *__udp6_lib_demux_lookup(struct net *net,
 			int dif, int sdif)
 {
 	unsigned short hnum = ntohs(loc_port);
-	unsigned int hash2 = ipv6_portaddr_hash(net, loc_addr, hnum);
-	unsigned int slot2 = hash2 & udp_table.mask;
-	struct udp_hslot *hslot2 = &udp_table.hash2[slot2];
-	const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum);
+	unsigned int hash2, slot2;
+	struct udp_hslot *hslot2;
+	__portpair ports;
 	struct sock *sk;
 
+	hash2 = ipv6_portaddr_hash(net, loc_addr, hnum);
+	slot2 = hash2 & udp_table.mask;
+	hslot2 = &udp_table.hash2[slot2];
+	ports = INET_COMBINED_PORTS(rmt_port, hnum);
+
 	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
 		if (sk->sk_state == TCP_ESTABLISHED &&
 		    inet6_match(net, sk, rmt_addr, loc_addr, ports, dif, sdif))
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 10/13] udp: Set NULL to sk->sk_prot->h.udp_table.
  2022-08-26  0:04 [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Kuniyuki Iwashima
                   ` (8 preceding siblings ...)
  2022-08-26  0:04 ` [PATCH v1 net-next 09/13] udp: Clean up some functions Kuniyuki Iwashima
@ 2022-08-26  0:04 ` Kuniyuki Iwashima
  2022-08-26  0:04 ` [PATCH v1 net-next 11/13] udp: Set NULL to udp_seq_afinfo.udp_table Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26  0:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jeff Layton, Chuck Lever, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-fsdevel

We will soon introduce an optional per-netns hash table
for UDP.

This means we cannot use the global sk->sk_prot->h.udp_table
to fetch a UDP hash table.

Instead, set NULL to sk->sk_prot->h.udp_table for UDP and get
a proper table from net->ipv4.udp_table.

Note that we still need sk->sk_prot->h.udp_table for UDP LITE.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/netns/ipv4.h |  1 +
 net/ipv4/udp.c           | 15 +++++++++++----
 net/ipv6/udp.c           |  2 +-
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 6d9c01879027..c367da5d61e2 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -42,6 +42,7 @@ struct tcp_fastopen_context;
 
 struct netns_ipv4 {
 	struct inet_timewait_death_row *tcp_death_row;
+	struct udp_table *udp_table;
 
 #ifdef CONFIG_SYSCTL
 	struct ctl_table_header	*forw_hdr;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c073304e60bb..1bfae1fbe682 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -131,6 +131,11 @@ EXPORT_PER_CPU_SYMBOL_GPL(udp_memory_per_cpu_fw_alloc);
 #define MAX_UDP_PORTS 65536
 #define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN)
 
+static struct udp_table *udp_get_table_prot(struct sock *sk)
+{
+	return sk->sk_prot->h.udp_table ? : sock_net(sk)->ipv4.udp_table;
+}
+
 static int udp_lib_lport_inuse(struct net *net, __u16 num,
 			       const struct udp_hslot *hslot,
 			       unsigned long *bitmap,
@@ -232,7 +237,7 @@ static int udp_reuseport_add_sock(struct sock *sk, struct udp_hslot *hslot)
 int udp_lib_get_port(struct sock *sk, unsigned short snum,
 		     unsigned int hash2_nulladdr)
 {
-	struct udp_table *udptable = sk->sk_prot->h.udp_table;
+	struct udp_table *udptable = udp_get_table_prot(sk);
 	struct udp_hslot *hslot, *hslot2;
 	struct net *net = sock_net(sk);
 	int error = 1;
@@ -2004,7 +2009,7 @@ EXPORT_SYMBOL(udp_disconnect);
 void udp_lib_unhash(struct sock *sk)
 {
 	if (sk_hashed(sk)) {
-		struct udp_table *udptable = sk->sk_prot->h.udp_table;
+		struct udp_table *udptable = udp_get_table_prot(sk);
 		struct udp_hslot *hslot, *hslot2;
 
 		hslot  = udp_hashslot(udptable, sock_net(sk),
@@ -2035,7 +2040,7 @@ EXPORT_SYMBOL(udp_lib_unhash);
 void udp_lib_rehash(struct sock *sk, u16 newhash)
 {
 	if (sk_hashed(sk)) {
-		struct udp_table *udptable = sk->sk_prot->h.udp_table;
+		struct udp_table *udptable = udp_get_table_prot(sk);
 		struct udp_hslot *hslot, *hslot2, *nhslot2;
 
 		hslot2 = udp_hashslot2(udptable, udp_sk(sk)->udp_portaddr_hash);
@@ -2960,7 +2965,7 @@ struct proto udp_prot = {
 	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_udp_wmem_min),
 	.sysctl_rmem_offset	= offsetof(struct net, ipv4.sysctl_udp_rmem_min),
 	.obj_size		= sizeof(struct udp_sock),
-	.h.udp_table		= &udp_table,
+	.h.udp_table		= NULL,
 	.diag_destroy		= udp_abort,
 };
 EXPORT_SYMBOL(udp_prot);
@@ -3273,6 +3278,8 @@ EXPORT_SYMBOL(udp_flow_hashrnd);
 
 static int __net_init udp_sysctl_init(struct net *net)
 {
+	net->ipv4.udp_table = &udp_table;
+
 	net->ipv4.sysctl_udp_rmem_min = PAGE_SIZE;
 	net->ipv4.sysctl_udp_wmem_min = PAGE_SIZE;
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index f7e0248a00bc..5dc069d5ad1e 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1747,7 +1747,7 @@ struct proto udpv6_prot = {
 	.sysctl_wmem_offset     = offsetof(struct net, ipv4.sysctl_udp_wmem_min),
 	.sysctl_rmem_offset     = offsetof(struct net, ipv4.sysctl_udp_rmem_min),
 	.obj_size		= sizeof(struct udp6_sock),
-	.h.udp_table		= &udp_table,
+	.h.udp_table		= NULL,
 	.diag_destroy		= udp_abort,
 };
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 11/13] udp: Set NULL to udp_seq_afinfo.udp_table.
  2022-08-26  0:04 [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Kuniyuki Iwashima
                   ` (9 preceding siblings ...)
  2022-08-26  0:04 ` [PATCH v1 net-next 10/13] udp: Set NULL to sk->sk_prot->h.udp_table Kuniyuki Iwashima
@ 2022-08-26  0:04 ` Kuniyuki Iwashima
  2022-08-26  0:04 ` [PATCH v1 net-next 12/13] udp: Access &udp_table via net Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26  0:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jeff Layton, Chuck Lever, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-fsdevel

We will soon introduce an optional per-netns hash table
for UDP.

This means we cannot use the global udp_seq_afinfo.udp_table
to fetch a UDP hash table.

Instead, set NULL to udp_seq_afinfo.udp_table for UDP and get
a proper table from net->ipv4.udp_table.

Note that we still need udp_seq_afinfo.udp_table for UDP LITE.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/udp.c | 32 ++++++++++++++++++++++++--------
 net/ipv6/udp.c |  2 +-
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1bfae1fbe682..17b4b175fc67 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -136,6 +136,12 @@ static struct udp_table *udp_get_table_prot(struct sock *sk)
 	return sk->sk_prot->h.udp_table ? : sock_net(sk)->ipv4.udp_table;
 }
 
+static struct udp_table *udp_get_table_afinfo(struct udp_seq_afinfo *afinfo,
+					      struct net *net)
+{
+	return afinfo->udp_table ? : net->ipv4.udp_table;
+}
+
 static int udp_lib_lport_inuse(struct net *net, __u16 num,
 			       const struct udp_hslot *hslot,
 			       unsigned long *bitmap,
@@ -2978,6 +2984,7 @@ static struct sock *udp_get_first(struct seq_file *seq, int start)
 	struct udp_iter_state *state = seq->private;
 	struct net *net = seq_file_net(seq);
 	struct udp_seq_afinfo *afinfo;
+	struct udp_table *udptable;
 	struct sock *sk;
 
 	if (state->bpf_seq_afinfo)
@@ -2985,9 +2992,11 @@ static struct sock *udp_get_first(struct seq_file *seq, int start)
 	else
 		afinfo = pde_data(file_inode(seq->file));
 
-	for (state->bucket = start; state->bucket <= afinfo->udp_table->mask;
+	udptable = udp_get_table_afinfo(afinfo, net);
+
+	for (state->bucket = start; state->bucket <= udptable->mask;
 	     ++state->bucket) {
-		struct udp_hslot *hslot = &afinfo->udp_table->hash[state->bucket];
+		struct udp_hslot *hslot = &udptable->hash[state->bucket];
 
 		if (hlist_empty(&hslot->head))
 			continue;
@@ -3012,6 +3021,7 @@ static struct sock *udp_get_next(struct seq_file *seq, struct sock *sk)
 	struct udp_iter_state *state = seq->private;
 	struct net *net = seq_file_net(seq);
 	struct udp_seq_afinfo *afinfo;
+	struct udp_table *udptable;
 
 	if (state->bpf_seq_afinfo)
 		afinfo = state->bpf_seq_afinfo;
@@ -3025,8 +3035,11 @@ static struct sock *udp_get_next(struct seq_file *seq, struct sock *sk)
 			 sk->sk_family != afinfo->family)));
 
 	if (!sk) {
-		if (state->bucket <= afinfo->udp_table->mask)
-			spin_unlock_bh(&afinfo->udp_table->hash[state->bucket].lock);
+		udptable = udp_get_table_afinfo(afinfo, net);
+
+		if (state->bucket <= udptable->mask)
+			spin_unlock_bh(&udptable->hash[state->bucket].lock);
+
 		return udp_get_first(seq, state->bucket + 1);
 	}
 	return sk;
@@ -3069,14 +3082,17 @@ void udp_seq_stop(struct seq_file *seq, void *v)
 {
 	struct udp_iter_state *state = seq->private;
 	struct udp_seq_afinfo *afinfo;
+	struct udp_table *udptable;
 
 	if (state->bpf_seq_afinfo)
 		afinfo = state->bpf_seq_afinfo;
 	else
 		afinfo = pde_data(file_inode(seq->file));
 
-	if (state->bucket <= afinfo->udp_table->mask)
-		spin_unlock_bh(&afinfo->udp_table->hash[state->bucket].lock);
+	udptable = udp_get_table_afinfo(afinfo, seq_file_net(seq));
+
+	if (state->bucket <= udptable->mask)
+		spin_unlock_bh(&udptable->hash[state->bucket].lock);
 }
 EXPORT_SYMBOL(udp_seq_stop);
 
@@ -3189,7 +3205,7 @@ EXPORT_SYMBOL(udp_seq_ops);
 
 static struct udp_seq_afinfo udp4_seq_afinfo = {
 	.family		= AF_INET,
-	.udp_table	= &udp_table,
+	.udp_table	= NULL,
 };
 
 static int __net_init udp4_proc_init_net(struct net *net)
@@ -3309,7 +3325,7 @@ static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
 		return -ENOMEM;
 
 	afinfo->family = AF_UNSPEC;
-	afinfo->udp_table = &udp_table;
+	afinfo->udp_table = NULL;
 	st->bpf_seq_afinfo = afinfo;
 	ret = bpf_iter_init_seq_net(priv_data, aux);
 	if (ret)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 5dc069d5ad1e..8abf16c5a8a1 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1697,7 +1697,7 @@ EXPORT_SYMBOL(udp6_seq_ops);
 
 static struct udp_seq_afinfo udp6_seq_afinfo = {
 	.family		= AF_INET6,
-	.udp_table	= &udp_table,
+	.udp_table	= NULL,
 };
 
 int __net_init udp6_proc_init(struct net *net)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 12/13] udp: Access &udp_table via net.
  2022-08-26  0:04 [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Kuniyuki Iwashima
                   ` (10 preceding siblings ...)
  2022-08-26  0:04 ` [PATCH v1 net-next 11/13] udp: Set NULL to udp_seq_afinfo.udp_table Kuniyuki Iwashima
@ 2022-08-26  0:04 ` Kuniyuki Iwashima
  2022-08-26  0:04 ` [PATCH v1 net-next 13/13] udp: Introduce optional per-netns hash table Kuniyuki Iwashima
  2022-08-26 15:17 ` [PATCH v1 net-next 00/13] tcp/udp: " Eric Dumazet
  13 siblings, 0 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26  0:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jeff Layton, Chuck Lever, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-fsdevel

We will soon introduce an optional per-netns hash table
for UDP.

This means we cannot use udp_table directly in most places.

Instead, access it via net->ipv4.udp_table.

The access will be valid only while initialising udp_table
itself and creating/destroying each netns.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/core/filter.c      |  4 ++--
 net/ipv4/udp.c         | 23 +++++++++++++----------
 net/ipv4/udp_diag.c    |  6 +++---
 net/ipv4/udp_offload.c |  5 +++--
 net/ipv6/udp.c         | 19 +++++++++++--------
 net/ipv6/udp_offload.c |  5 +++--
 6 files changed, 35 insertions(+), 27 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index dd5844033b66..38b247f5de24 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6484,7 +6484,7 @@ static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
 		else
 			sk = __udp4_lib_lookup(net, src4, tuple->ipv4.sport,
 					       dst4, tuple->ipv4.dport,
-					       dif, sdif, &udp_table, NULL);
+					       dif, sdif, net->ipv4.udp_table, NULL);
 #if IS_ENABLED(CONFIG_IPV6)
 	} else {
 		struct in6_addr *src6 = (struct in6_addr *)&tuple->ipv6.saddr;
@@ -6500,7 +6500,7 @@ static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
 							    src6, tuple->ipv6.sport,
 							    dst6, tuple->ipv6.dport,
 							    dif, sdif,
-							    &udp_table, NULL);
+							    net->ipv4.udp_table, NULL);
 #endif
 	}
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 17b4b175fc67..f4825e38762a 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -478,7 +478,7 @@ static struct sock *udp4_lookup_run_bpf(struct net *net,
 	struct sock *sk, *reuse_sk;
 	bool no_reuseport;
 
-	if (udptable != &udp_table)
+	if (udptable != net->ipv4.udp_table)
 		return NULL; /* only UDP is supported */
 
 	no_reuseport = bpf_sk_lookup_run_v4(net, IPPROTO_UDP, saddr, sport,
@@ -559,10 +559,11 @@ struct sock *udp4_lib_lookup_skb(const struct sk_buff *skb,
 				 __be16 sport, __be16 dport)
 {
 	const struct iphdr *iph = ip_hdr(skb);
+	struct net *net = dev_net(skb->dev);
 
-	return __udp4_lib_lookup(dev_net(skb->dev), iph->saddr, sport,
+	return __udp4_lib_lookup(net, iph->saddr, sport,
 				 iph->daddr, dport, inet_iif(skb),
-				 inet_sdif(skb), &udp_table, NULL);
+				 inet_sdif(skb), net->ipv4.udp_table, NULL);
 }
 
 /* Must be called under rcu_read_lock().
@@ -575,7 +576,7 @@ struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
 	struct sock *sk;
 
 	sk = __udp4_lib_lookup(net, saddr, sport, daddr, dport,
-			       dif, 0, &udp_table, NULL);
+			       dif, 0, net->ipv4.udp_table, NULL);
 	if (sk && !refcount_inc_not_zero(&sk->sk_refcnt))
 		sk = NULL;
 	return sk;
@@ -810,7 +811,7 @@ int __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
 
 int udp_err(struct sk_buff *skb, u32 info)
 {
-	return __udp4_lib_err(skb, info, &udp_table);
+	return __udp4_lib_err(skb, info, dev_net(skb->dev)->ipv4.udp_table);
 }
 
 /*
@@ -2535,13 +2536,14 @@ static struct sock *__udp4_lib_mcast_demux_lookup(struct net *net,
 						  __be16 rmt_port, __be32 rmt_addr,
 						  int dif, int sdif)
 {
+	struct udp_table *udptable = net->ipv4.udp_table;
 	unsigned short hnum = ntohs(loc_port);
 	struct sock *sk, *result;
 	struct udp_hslot *hslot;
 	unsigned int slot;
 
-	slot = udp_hashfn(net, hnum, udp_table.mask);
-	hslot = &udp_table.hash[slot];
+	slot = udp_hashfn(net, hnum, udptable->mask);
+	hslot = &udptable->hash[slot];
 
 	/* Do not bother scanning a too big list */
 	if (hslot->count > 10)
@@ -2569,6 +2571,7 @@ static struct sock *__udp4_lib_demux_lookup(struct net *net,
 					    __be16 rmt_port, __be32 rmt_addr,
 					    int dif, int sdif)
 {
+	struct udp_table *udptable = net->ipv4.udp_table;
 	INET_ADDR_COOKIE(acookie, rmt_addr, loc_addr);
 	unsigned short hnum = ntohs(loc_port);
 	unsigned int hash2, slot2;
@@ -2577,8 +2580,8 @@ static struct sock *__udp4_lib_demux_lookup(struct net *net,
 	struct sock *sk;
 
 	hash2 = ipv4_portaddr_hash(net, loc_addr, hnum);
-	slot2 = hash2 & udp_table.mask;
-	hslot2 = &udp_table.hash2[slot2];
+	slot2 = hash2 & udptable->mask;
+	hslot2 = &udptable->hash2[slot2];
 	ports = INET_COMBINED_PORTS(rmt_port, hnum);
 
 	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
@@ -2660,7 +2663,7 @@ int udp_v4_early_demux(struct sk_buff *skb)
 
 int udp_rcv(struct sk_buff *skb)
 {
-	return __udp4_lib_rcv(skb, &udp_table, IPPROTO_UDP);
+	return __udp4_lib_rcv(skb, dev_net(skb->dev)->ipv4.udp_table, IPPROTO_UDP);
 }
 
 void udp_destroy_sock(struct sock *sk)
diff --git a/net/ipv4/udp_diag.c b/net/ipv4/udp_diag.c
index 1ed8c4d78e5c..de3f2d31f510 100644
--- a/net/ipv4/udp_diag.c
+++ b/net/ipv4/udp_diag.c
@@ -147,13 +147,13 @@ static void udp_dump(struct udp_table *table, struct sk_buff *skb,
 static void udp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
 			  const struct inet_diag_req_v2 *r)
 {
-	udp_dump(&udp_table, skb, cb, r);
+	udp_dump(sock_net(cb->skb->sk)->ipv4.udp_table, skb, cb, r);
 }
 
 static int udp_diag_dump_one(struct netlink_callback *cb,
 			     const struct inet_diag_req_v2 *req)
 {
-	return udp_dump_one(&udp_table, cb, req);
+	return udp_dump_one(sock_net(cb->skb->sk)->ipv4.udp_table, cb, req);
 }
 
 static void udp_diag_get_info(struct sock *sk, struct inet_diag_msg *r,
@@ -225,7 +225,7 @@ static int __udp_diag_destroy(struct sk_buff *in_skb,
 static int udp_diag_destroy(struct sk_buff *in_skb,
 			    const struct inet_diag_req_v2 *req)
 {
-	return __udp_diag_destroy(in_skb, req, &udp_table);
+	return __udp_diag_destroy(in_skb, req, sock_net(in_skb->sk)->ipv4.udp_table);
 }
 
 static int udplite_diag_destroy(struct sk_buff *in_skb,
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 6d1a4bec2614..aedde65e2268 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -600,10 +600,11 @@ static struct sock *udp4_gro_lookup_skb(struct sk_buff *skb, __be16 sport,
 					__be16 dport)
 {
 	const struct iphdr *iph = skb_gro_network_header(skb);
+	struct net *net = dev_net(skb->dev);
 
-	return __udp4_lib_lookup(dev_net(skb->dev), iph->saddr, sport,
+	return __udp4_lib_lookup(net, iph->saddr, sport,
 				 iph->daddr, dport, inet_iif(skb),
-				 inet_sdif(skb), &udp_table, NULL);
+				 inet_sdif(skb), net->ipv4.udp_table, NULL);
 }
 
 INDIRECT_CALLABLE_SCOPE
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 8abf16c5a8a1..739a8c86f0b6 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -203,7 +203,7 @@ static inline struct sock *udp6_lookup_run_bpf(struct net *net,
 	struct sock *sk, *reuse_sk;
 	bool no_reuseport;
 
-	if (udptable != &udp_table)
+	if (udptable != net->ipv4.udp_table)
 		return NULL; /* only UDP is supported */
 
 	no_reuseport = bpf_sk_lookup_run_v6(net, IPPROTO_UDP, saddr, sport,
@@ -284,10 +284,11 @@ struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb,
 				 __be16 sport, __be16 dport)
 {
 	const struct ipv6hdr *iph = ipv6_hdr(skb);
+	struct net *net = dev_net(skb->dev);
 
-	return __udp6_lib_lookup(dev_net(skb->dev), &iph->saddr, sport,
+	return __udp6_lib_lookup(net, &iph->saddr, sport,
 				 &iph->daddr, dport, inet6_iif(skb),
-				 inet6_sdif(skb), &udp_table, NULL);
+				 inet6_sdif(skb), net->ipv4.udp_table, NULL);
 }
 
 /* Must be called under rcu_read_lock().
@@ -300,7 +301,7 @@ struct sock *udp6_lib_lookup(struct net *net, const struct in6_addr *saddr, __be
 	struct sock *sk;
 
 	sk =  __udp6_lib_lookup(net, saddr, sport, daddr, dport,
-				dif, 0, &udp_table, NULL);
+				dif, 0, net->ipv4.udp_table, NULL);
 	if (sk && !refcount_inc_not_zero(&sk->sk_refcnt))
 		sk = NULL;
 	return sk;
@@ -667,7 +668,8 @@ static __inline__ int udpv6_err(struct sk_buff *skb,
 				struct inet6_skb_parm *opt, u8 type,
 				u8 code, int offset, __be32 info)
 {
-	return __udp6_lib_err(skb, opt, type, code, offset, info, &udp_table);
+	return __udp6_lib_err(skb, opt, type, code, offset, info,
+			      dev_net(skb->dev)->ipv4.udp_table);
 }
 
 static int udpv6_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
@@ -1035,6 +1037,7 @@ static struct sock *__udp6_lib_demux_lookup(struct net *net,
 			__be16 rmt_port, const struct in6_addr *rmt_addr,
 			int dif, int sdif)
 {
+	struct udp_table *udptable = net->ipv4.udp_table;
 	unsigned short hnum = ntohs(loc_port);
 	unsigned int hash2, slot2;
 	struct udp_hslot *hslot2;
@@ -1042,8 +1045,8 @@ static struct sock *__udp6_lib_demux_lookup(struct net *net,
 	struct sock *sk;
 
 	hash2 = ipv6_portaddr_hash(net, loc_addr, hnum);
-	slot2 = hash2 & udp_table.mask;
-	hslot2 = &udp_table.hash2[slot2];
+	slot2 = hash2 & udptable->mask;
+	hslot2 = &udptable->hash2[slot2];
 	ports = INET_COMBINED_PORTS(rmt_port, hnum);
 
 	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
@@ -1099,7 +1102,7 @@ void udp_v6_early_demux(struct sk_buff *skb)
 
 INDIRECT_CALLABLE_SCOPE int udpv6_rcv(struct sk_buff *skb)
 {
-	return __udp6_lib_rcv(skb, &udp_table, IPPROTO_UDP);
+	return __udp6_lib_rcv(skb, dev_net(skb->dev)->ipv4.udp_table, IPPROTO_UDP);
 }
 
 /*
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index 7720d04ed396..e0e10f6bcdc1 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -116,10 +116,11 @@ static struct sock *udp6_gro_lookup_skb(struct sk_buff *skb, __be16 sport,
 					__be16 dport)
 {
 	const struct ipv6hdr *iph = skb_gro_network_header(skb);
+	struct net *net = dev_net(skb->dev);
 
-	return __udp6_lib_lookup(dev_net(skb->dev), &iph->saddr, sport,
+	return __udp6_lib_lookup(net, &iph->saddr, sport,
 				 &iph->daddr, dport, inet6_iif(skb),
-				 inet6_sdif(skb), &udp_table, NULL);
+				 inet6_sdif(skb), net->ipv4.udp_table, NULL);
 }
 
 INDIRECT_CALLABLE_SCOPE
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 13/13] udp: Introduce optional per-netns hash table.
  2022-08-26  0:04 [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Kuniyuki Iwashima
                   ` (11 preceding siblings ...)
  2022-08-26  0:04 ` [PATCH v1 net-next 12/13] udp: Access &udp_table via net Kuniyuki Iwashima
@ 2022-08-26  0:04 ` Kuniyuki Iwashima
  2022-08-26 15:17 ` [PATCH v1 net-next 00/13] tcp/udp: " Eric Dumazet
  13 siblings, 0 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26  0:04 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jeff Layton, Chuck Lever, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-fsdevel

We introduce an optional per-netns hash table for UDP.

With a smaller hash table, we can look up sockets faster and isolate
noisy neighbours.  Also, we can reduce lock contention.

We can control the hash table size by a new sysctl knob.  However,
depending on workloads, it will require very sensitive tuning, so we
disable the feature by default (net.ipv4.udp_child_ehash_entries == 0).
Moreover, we can fall back to using the global hash table in case we
fail to allocate enough memory for a new hash table.

We can check the current hash table size by another read-only sysctl
knob, net.ipv4.udp_hash_entries.  A negative value means the netns
shares the global hash table (per-netns hash table is disabled or
failed to allocate memory).

We could optimise the hash table lookup/iteration further by removing
netns comparison for the per-netns one in the future.  Also, we could
optimise the sparse udp_hslot layout by putting it in udp_table.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 Documentation/networking/ip-sysctl.rst | 20 ++++++++
 include/net/netns/ipv4.h               |  2 +
 net/ipv4/sysctl_net_ipv4.c             | 56 +++++++++++++++++++++
 net/ipv4/udp.c                         | 69 ++++++++++++++++++++++++++
 4 files changed, 147 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 97a0952b11e3..6dc4e2853e39 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1090,6 +1090,26 @@ udp_rmem_min - INTEGER
 udp_wmem_min - INTEGER
 	UDP does not have tx memory accounting and this tunable has no effect.
 
+udp_hash_entries - INTEGER
+	Read-only number of hash buckets for UDP sockets in the current
+	networking namespace.
+
+	A negative value means the networking namespace does not own its
+	hash buckets and shares the initial networking namespace's one.
+
+udp_child_ehash_entries - INTEGER
+	Control the number of hash buckets for UDP sockets in the child
+	networking namespace, which must be set before clone() or unshare().
+
+	The written value except for 0 is rounded up to 2^n.  0 is a special
+	value, meaning the child networking namespace will share the initial
+	networking namespace's hash buckets.
+
+	Note that the child will use the global one in case the kernel
+	fails to allocate enough memory.
+
+	Default: 0
+
 RAW variables
 =============
 
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index c367da5d61e2..a1be7ebb7338 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -200,6 +200,8 @@ struct netns_ipv4 {
 
 	atomic_t dev_addr_genid;
 
+	unsigned int sysctl_udp_child_hash_entries;
+
 #ifdef CONFIG_SYSCTL
 	unsigned long *sysctl_local_reserved_ports;
 	int sysctl_ip_prot_sock;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 03a3187c4705..b3cea3f36463 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -424,6 +424,47 @@ static int proc_tcp_child_ehash_entries(struct ctl_table *table, int write,
 	return 0;
 }
 
+static int proc_udp_hash_entries(struct ctl_table *table, int write,
+				 void *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct net *net = container_of(table->data, struct net,
+				       ipv4.sysctl_udp_child_hash_entries);
+	int udp_hash_entries;
+	struct ctl_table tbl;
+
+	udp_hash_entries = net->ipv4.udp_table->mask + 1;
+
+	/* A negative number indicates that the child netns
+	 * shares the global udp_table.
+	 */
+	if (!net_eq(net, &init_net) && net->ipv4.udp_table == &udp_table)
+		udp_hash_entries *= -1;
+
+	tbl.data = &udp_hash_entries;
+	tbl.maxlen = sizeof(int);
+
+	return proc_dointvec(&tbl, write, buffer, lenp, ppos);
+}
+
+static int proc_udp_child_hash_entries(struct ctl_table *table, int write,
+				       void *buffer, size_t *lenp, loff_t *ppos)
+{
+	unsigned int udp_child_hash_entries;
+	int ret;
+
+	ret = proc_douintvec(table, write, buffer, lenp, ppos);
+	if (!write || ret)
+		return ret;
+
+	udp_child_hash_entries = READ_ONCE(*(unsigned int *)table->data);
+	if (udp_child_hash_entries)
+		udp_child_hash_entries = roundup_pow_of_two(udp_child_hash_entries);
+
+	WRITE_ONCE(*(unsigned int *)table->data, udp_child_hash_entries);
+
+	return 0;
+}
+
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write,
 					  void *buffer, size_t *lenp,
@@ -1378,6 +1419,21 @@ static struct ctl_table ipv4_net_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_INT_MAX,
 	},
+	{
+		.procname	= "udp_hash_entries",
+		.data		= &init_net.ipv4.sysctl_udp_child_hash_entries,
+		.mode		= 0444,
+		.proc_handler	= proc_udp_hash_entries,
+	},
+	{
+		.procname	= "udp_child_hash_entries",
+		.data		= &init_net.ipv4.sysctl_udp_child_hash_entries,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_udp_child_hash_entries,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_INT_MAX,
+	},
 	{
 		.procname	= "udp_rmem_min",
 		.data		= &init_net.ipv4.sysctl_udp_rmem_min,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f4825e38762a..c41306225305 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3309,8 +3309,77 @@ static int __net_init udp_sysctl_init(struct net *net)
 	return 0;
 }
 
+static struct udp_table __net_init *udp_pernet_table_alloc(unsigned int hash_entries)
+{
+	struct udp_table *udptable;
+	int i;
+
+	udptable = kmalloc(sizeof(*udptable), GFP_KERNEL);
+	if (!udptable)
+		goto out;
+
+	udptable->hash = kvmalloc_array(hash_entries * 2,
+					sizeof(struct udp_hslot), GFP_KERNEL);
+	if (!udptable->hash)
+		goto free_table;
+
+	udptable->hash2 = udptable->hash + hash_entries;
+	udptable->mask = hash_entries - 1;
+	udptable->log = ilog2(hash_entries);
+
+	for (i = 0; i < hash_entries; i++) {
+		INIT_HLIST_HEAD(&udptable->hash[i].head);
+		udptable->hash[i].count = 0;
+		spin_lock_init(&udptable->hash[i].lock);
+
+		INIT_HLIST_HEAD(&udptable->hash2[i].head);
+		udptable->hash2[i].count = 0;
+		spin_lock_init(&udptable->hash2[i].lock);
+	}
+
+	return udptable;
+
+free_table:
+	kfree(udptable);
+out:
+	return NULL;
+}
+
+static int __net_init udp_pernet_table_init(struct net *net, struct net *old_net)
+{
+	struct udp_table *udptable;
+	unsigned int hash_entries;
+
+	hash_entries = READ_ONCE(old_net->ipv4.sysctl_udp_child_hash_entries);
+	if (!hash_entries)
+		goto out;
+
+	udptable = udp_pernet_table_alloc(hash_entries);
+	if (udptable)
+		net->ipv4.udp_table = udptable;
+	else
+		pr_warn("Failed to allocate UDP hash table (entries: %u) "
+			"for a netns, fallback to use the global one\n",
+			hash_entries);
+out:
+	return 0;
+}
+
+static void __net_exit udp_pernet_table_free(struct net *net)
+{
+	struct udp_table *udptable = net->ipv4.udp_table;
+
+	if (udptable == &udp_table)
+		return;
+
+	kvfree(udptable->hash);
+	kfree(udptable);
+}
+
 static struct pernet_operations __net_initdata udp_sysctl_ops = {
 	.init	= udp_sysctl_init,
+	.init2	= udp_pernet_table_init,
+	.exit	= udp_pernet_table_free,
 };
 
 #if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_PROC_FS)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 01/13] fs/lock: Revive LOCK_MAND.
  2022-08-26  0:04 ` [PATCH v1 net-next 01/13] fs/lock: Revive LOCK_MAND Kuniyuki Iwashima
@ 2022-08-26 10:02   ` Jeff Layton
  2022-08-26 16:48     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2022-08-26 10:02 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Chuck Lever, Luis Chamberlain, Kees Cook,
	Iurii Zaikin
  Cc: Kuniyuki Iwashima, netdev, linux-fsdevel

On Thu, 2022-08-25 at 17:04 -0700, Kuniyuki Iwashima wrote:
> The commit 90f7d7a0d0d6 ("locks: remove LOCK_MAND flock lock support")
> removed LOCK_MAND support from the kernel because nothing checked the
> flag, nor was there no use case.  This patch revives LOCK_MAND to
> introduce a mandatory lock for read/write on /proc/sys.  Currently, it's
> the only use case, so we added two changes while reverting the commit.
> 
> First, we used to allow any f_mode for LOCK_MAND, but now we don't get
> it back.  Instead, we enforce being FMODE_READ|FMODE_WRITE as LOCK_SH
> and LOCK_EX.
> 
> Second, when f_ops->flock() was called with LOCK_MAND, each function
> returned -EOPNOTSUPP.  The following patch does not use f_ops->flock(),
> so we put the validation before calling f_ops->flock().
> 
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  fs/locks.c                       | 57 ++++++++++++++++++++------------
>  include/uapi/asm-generic/fcntl.h |  5 ---
>  2 files changed, 35 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index c266cfdc3291..03ff10a3165e 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -421,6 +421,10 @@ static inline int flock_translate_cmd(int cmd) {
>  	case LOCK_UN:
>  		return F_UNLCK;
>  	}
> +
> +	if (cmd & LOCK_MAND)
> +		return cmd & (LOCK_MAND | LOCK_RW);
> +
>  	return -EINVAL;
>  }
>  
> @@ -879,6 +883,10 @@ static bool flock_locks_conflict(struct file_lock *caller_fl,
>  	if (caller_fl->fl_file == sys_fl->fl_file)
>  		return false;
>  
> +	if (caller_fl->fl_type & LOCK_MAND ||
> +	    sys_fl->fl_type & LOCK_MAND)
> +		return true;
> +
>  	return locks_conflict(caller_fl, sys_fl);
>  }
>  
> @@ -2077,9 +2085,7 @@ EXPORT_SYMBOL(locks_lock_inode_wait);
>   *	- %LOCK_SH -- a shared lock.
>   *	- %LOCK_EX -- an exclusive lock.
>   *	- %LOCK_UN -- remove an existing lock.
> - *	- %LOCK_MAND -- a 'mandatory' flock. (DEPRECATED)
> - *
> - *	%LOCK_MAND support has been removed from the kernel.
> + *	- %LOCK_MAND -- a 'mandatory' flock. (only supported on /proc/sys/)
>   */
>  SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
>  {
> @@ -2087,19 +2093,6 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
>  	struct file_lock fl;
>  	struct fd f;
>  
> -	/*
> -	 * LOCK_MAND locks were broken for a long time in that they never
> -	 * conflicted with one another and didn't prevent any sort of open,
> -	 * read or write activity.
> -	 *
> -	 * Just ignore these requests now, to preserve legacy behavior, but
> -	 * throw a warning to let people know that they don't actually work.
> -	 */
> -	if (cmd & LOCK_MAND) {
> -		pr_warn_once("Attempt to set a LOCK_MAND lock via flock(2). This support has been removed and the request ignored.\n");
> -		return 0;
> -	}
> -
>  	type = flock_translate_cmd(cmd & ~LOCK_NB);
>  	if (type < 0)
>  		return type;
> @@ -2109,6 +2102,7 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
>  	if (!f.file)
>  		return error;
>  
> +	/* LOCK_MAND supports only read/write on proc_sysctl for now */
>  	if (type != F_UNLCK && !(f.file->f_mode & (FMODE_READ | FMODE_WRITE)))
>  		goto out_putf;
>  
> @@ -2122,12 +2116,18 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
>  	if (can_sleep)
>  		fl.fl_flags |= FL_SLEEP;
>  
> -	if (f.file->f_op->flock)
> +	if (f.file->f_op->flock) {
> +		if (cmd & LOCK_MAND) {
> +			error = -EOPNOTSUPP;
> +			goto out_putf;
> +		}
> +
>  		error = f.file->f_op->flock(f.file,
>  					    (can_sleep) ? F_SETLKW : F_SETLK,
>  					    &fl);
> -	else
> +	} else {
>  		error = locks_lock_file_wait(f.file, &fl);
> +	}
>  
>   out_putf:
>  	fdput(f);
> @@ -2711,7 +2711,11 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
>  		seq_printf(f, " %s ",
>  			     (inode == NULL) ? "*NOINODE*" : "ADVISORY ");
>  	} else if (IS_FLOCK(fl)) {
> -		seq_puts(f, "FLOCK  ADVISORY  ");
> +		if (fl->fl_type & LOCK_MAND) {
> +			seq_puts(f, "FLOCK  MANDATORY ");
> +		} else {
> +			seq_puts(f, "FLOCK  ADVISORY  ");
> +		}
>  	} else if (IS_LEASE(fl)) {
>  		if (fl->fl_flags & FL_DELEG)
>  			seq_puts(f, "DELEG  ");
> @@ -2727,10 +2731,19 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
>  	} else {
>  		seq_puts(f, "UNKNOWN UNKNOWN  ");
>  	}
> -	type = IS_LEASE(fl) ? target_leasetype(fl) : fl->fl_type;
>  
> -	seq_printf(f, "%s ", (type == F_WRLCK) ? "WRITE" :
> -			     (type == F_RDLCK) ? "READ" : "UNLCK");
> +	if (fl->fl_type & LOCK_MAND) {
> +		seq_printf(f, "%s ",
> +			   (fl->fl_type & LOCK_READ)
> +			   ? (fl->fl_type & LOCK_WRITE) ? "RW   " : "READ "
> +			   : (fl->fl_type & LOCK_WRITE) ? "WRITE" : "NONE ");
> +	} else {
> +		type = IS_LEASE(fl) ? target_leasetype(fl) : fl->fl_type;
> +
> +		seq_printf(f, "%s ", (type == F_WRLCK) ? "WRITE" :
> +			   (type == F_RDLCK) ? "READ" : "UNLCK");
> +	}
> +
>  	if (inode) {
>  		/* userspace relies on this representation of dev_t */
>  		seq_printf(f, "%d %02x:%02x:%lu ", fl_pid,
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 1ecdb911add8..94fb8c6fd543 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -180,11 +180,6 @@ struct f_owner_ex {
>  #define LOCK_NB		4	/* or'd with one of the above to prevent
>  				   blocking */
>  #define LOCK_UN		8	/* remove lock */
> -
> -/*
> - * LOCK_MAND support has been removed from the kernel. We leave the symbols
> - * here to not break legacy builds, but these should not be used in new code.
> - */
>  #define LOCK_MAND	32	/* This is a mandatory flock ... */
>  #define LOCK_READ	64	/* which allows concurrent read operations */
>  #define LOCK_WRITE	128	/* which allows concurrent write operations */

NACK.

This may break legacy userland code that sets LOCK_MAND on flock calls
(e.g. old versions of samba).

If you want to add a new mechanism that does something similar with a
new flag, then that may be possible, but please don't overload old flags
that could still be used in the field with new meanings.

If you do decide to use flock for this functionality (and I'm not sure
this is a good idea), then I'd also like to see a clear description of
the semantics this provides.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table.
  2022-08-26  0:04 [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Kuniyuki Iwashima
                   ` (12 preceding siblings ...)
  2022-08-26  0:04 ` [PATCH v1 net-next 13/13] udp: Introduce optional per-netns hash table Kuniyuki Iwashima
@ 2022-08-26 15:17 ` Eric Dumazet
  2022-08-26 16:51   ` Kuniyuki Iwashima
  13 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2022-08-26 15:17 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Jeff Layton,
	Chuck Lever, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Kuniyuki Iwashima, netdev, linux-fsdevel

On Thu, Aug 25, 2022 at 5:05 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> The more sockets we have in the hash table, the more time we spend
> looking up the socket.  While running a number of small workloads on
> the same host, they penalise each other and cause performance degradation.
>
> Also, the root cause might be a single workload that consumes much more
> resources than the others.  It often happens on a cloud service where
> different workloads share the same computing resource.
>
> On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
> entries), after running iperf3 in different netns, creating 24Mi sockets
> without data transfer in the root netns causes about 10% performance
> regression for the iperf3's connection.
>
>  thash_entries          sockets         length          Gbps
>         524288                1              1          50.7
>                            24Mi             48          45.1
>
> It is basically related to the length of the list of each hash bucket.
> For testing purposes to see how performance drops along the length,
> I set 131072 (1Mi / 8) to thash_entries, and here's the result.
>
>  thash_entries          sockets         length          Gbps
>         131072                1              1          50.7
>                             1Mi              8          49.9
>                             2Mi             16          48.9
>                             4Mi             32          47.3
>                             8Mi             64          44.6
>                            16Mi            128          40.6
>                            24Mi            192          36.3
>                            32Mi            256          32.5
>                            40Mi            320          27.0
>                            48Mi            384          25.0
>
> To resolve the socket lookup degradation, we introduce an optional
> per-netns hash table for TCP and UDP.  With a smaller hash table, we
> can look up sockets faster and isolate noisy neighbours.  Also, we can
> reduce lock contention.
>
> We can control and check the hash size via sysctl knobs.  It requires
> some tuning based on workloads, so the per-netns hash table is disabled
> by default.
>
>   # dmesg | cut -d ' ' -f 5- | grep "established hash"
>   TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
>
>   # sysctl net.ipv4.tcp_ehash_entries
>   net.ipv4.tcp_ehash_entries = 524288  # can be changed by thash_entries
>
>   # sysctl net.ipv4.tcp_child_ehash_entries
>   net.ipv4.tcp_child_ehash_entries = 0  # disabled by default
>
>   # ip netns add test1
>   # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
>   net.ipv4.tcp_ehash_entries = -524288  # share the global ehash
>
>   # sysctl -w net.ipv4.tcp_child_ehash_entries=100
>   net.ipv4.tcp_child_ehash_entries = 100
>
>   # sysctl net.ipv4.tcp_child_ehash_entries
>   net.ipv4.tcp_child_ehash_entries = 128  # rounded up to 2^n
>
>   # ip netns add test2
>   # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
>   net.ipv4.tcp_ehash_entries = 128  # own per-netns ehash
>
>   [ UDP has the same interface as udp_hash_entries and
>     udp_child_hash_entries. ]
>
> When creating per-netns concurrently with different sizes, we can
> guarantee the size by doing one of these ways.
>
>   1) Share the global hash table and create per-netns one
>
>   First, unshare() with tcp_child_ehash_entries==0.  It creates dedicated
>   netns sysctl knobs where we can safely change tcp_child_ehash_entries
>   and clone()/unshare() to create a per-netns hash table.
>
>   2) Lock the sysctl knob
>

This is orthogonal.

Your series should have been split in three really.

I do not want to discuss the merit of re-instating LOCK_MAND :/



>   We can use flock(LOCK_MAND) or BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny
>   read/write on sysctl knobs.
>
> For details, please see each patch.
>
>   patch  1 -  3: mandatory lock support for sysctl (fs stuff)
>   patch  4 -  7: prep patch for per-netns TCP ehash
>   patch       8: add per-netns TCP ehash
>   patch  9 - 12: prep patch for per-netns UDP hash table
>   patch      13: add per-netns UDP hash table
>
>
> Kuniyuki Iwashima (13):
>   fs/lock: Revive LOCK_MAND.
>   sysctl: Support LOCK_MAND for read/write.
>   selftest: sysctl: Add test for flock(LOCK_MAND).
>   net: Introduce init2() for pernet_operations.
>   tcp: Clean up some functions.
>   tcp: Set NULL to sk->sk_prot->h.hashinfo.
>   tcp: Access &tcp_hashinfo via net.
>   tcp: Introduce optional per-netns ehash.
>   udp: Clean up some functions.
>   udp: Set NULL to sk->sk_prot->h.udp_table.
>   udp: Set NULL to udp_seq_afinfo.udp_table.
>   udp: Access &udp_table via net.
>   udp: Introduce optional per-netns hash table.
>
>  Documentation/networking/ip-sysctl.rst        |  40 +++++
>  .../chelsio/inline_crypto/chtls/chtls_cm.c    |   5 +-
>  .../mellanox/mlx5/core/en_accel/ktls_rx.c     |   5 +-
>  .../net/ethernet/netronome/nfp/crypto/tls.c   |   5 +-
>  fs/locks.c                                    |  83 ++++++---
>  fs/proc/proc_sysctl.c                         |  25 ++-
>  include/linux/fs.h                            |   1 +
>  include/net/inet_hashtables.h                 |  16 ++
>  include/net/net_namespace.h                   |   3 +
>  include/net/netns/ipv4.h                      |   4 +
>  include/uapi/asm-generic/fcntl.h              |   5 -
>  net/core/filter.c                             |   9 +-
>  net/core/net_namespace.c                      |  18 +-
>  net/dccp/proto.c                              |   2 +
>  net/ipv4/af_inet.c                            |   2 +-
>  net/ipv4/esp4.c                               |   3 +-
>  net/ipv4/inet_connection_sock.c               |  25 ++-
>  net/ipv4/inet_hashtables.c                    | 102 ++++++++---
>  net/ipv4/inet_timewait_sock.c                 |   4 +-
>  net/ipv4/netfilter/nf_socket_ipv4.c           |   2 +-
>  net/ipv4/netfilter/nf_tproxy_ipv4.c           |  17 +-
>  net/ipv4/sysctl_net_ipv4.c                    | 113 ++++++++++++
>  net/ipv4/tcp.c                                |   1 +
>  net/ipv4/tcp_diag.c                           |  18 +-
>  net/ipv4/tcp_ipv4.c                           | 122 +++++++++----
>  net/ipv4/tcp_minisocks.c                      |   2 +-
>  net/ipv4/udp.c                                | 164 ++++++++++++++----
>  net/ipv4/udp_diag.c                           |   6 +-
>  net/ipv4/udp_offload.c                        |   5 +-
>  net/ipv6/esp6.c                               |   3 +-
>  net/ipv6/inet6_hashtables.c                   |   4 +-
>  net/ipv6/netfilter/nf_socket_ipv6.c           |   2 +-
>  net/ipv6/netfilter/nf_tproxy_ipv6.c           |   5 +-
>  net/ipv6/tcp_ipv6.c                           |  30 +++-
>  net/ipv6/udp.c                                |  31 ++--
>  net/ipv6/udp_offload.c                        |   5 +-
>  net/mptcp/mptcp_diag.c                        |   7 +-
>  tools/testing/selftests/sysctl/.gitignore     |   2 +
>  tools/testing/selftests/sysctl/Makefile       |   9 +-
>  tools/testing/selftests/sysctl/sysctl_flock.c | 157 +++++++++++++++++
>  40 files changed, 854 insertions(+), 208 deletions(-)
>  create mode 100644 tools/testing/selftests/sysctl/.gitignore
>  create mode 100644 tools/testing/selftests/sysctl/sysctl_flock.c
>
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 04/13] net: Introduce init2() for pernet_operations.
  2022-08-26  0:04 ` [PATCH v1 net-next 04/13] net: Introduce init2() for pernet_operations Kuniyuki Iwashima
@ 2022-08-26 15:20   ` Eric Dumazet
  2022-08-26 17:03     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2022-08-26 15:20 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Jeff Layton,
	Chuck Lever, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Kuniyuki Iwashima, netdev, linux-fsdevel

On Thu, Aug 25, 2022 at 5:06 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> This patch adds a new init function for pernet_operations, init2().

Why ?

This seems not really needed...

TCP ops->init can trivially reach the parent net_ns if needed,
because the current process is the one doing the creation of a new net_ns.

>
> We call each init2() during clone() or unshare() only, where we can
> access the parent netns for a child netns creation.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  include/net/net_namespace.h |  3 +++
>  net/core/net_namespace.c    | 18 +++++++++++-------
>  2 files changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index 8c3587d5c308..3ca426649756 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -410,6 +410,8 @@ struct pernet_operations {
>          * from register_pernet_subsys(), unregister_pernet_subsys()
>          * register_pernet_device() and unregister_pernet_device().
>          *
> +        * init2() is called during clone() or unshare() only.
> +        *
>          * Exit methods using blocking RCU primitives, such as
>          * synchronize_rcu(), should be implemented via exit_batch.
>          * Then, destruction of a group of net requires single
> @@ -422,6 +424,7 @@ struct pernet_operations {
>          * the calls.
>          */
>         int (*init)(struct net *net);
> +       int (*init2)(struct net *net, struct net *old_net);
>         void (*pre_exit)(struct net *net);
>         void (*exit)(struct net *net);
>         void (*exit_batch)(struct list_head *net_exit_list);
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 6b9f19122ec1..b120ff97d9f5 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -116,7 +116,8 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
>         return 0;
>  }
>
> -static int ops_init(const struct pernet_operations *ops, struct net *net)
> +static int ops_init(const struct pernet_operations *ops,
> +                   struct net *net, struct net *old_net)
>  {
>         int err = -ENOMEM;
>         void *data = NULL;
> @@ -133,6 +134,8 @@ static int ops_init(const struct pernet_operations *ops, struct net *net)
>         err = 0;
>         if (ops->init)
>                 err = ops->init(net);
> +       if (!err && ops->init2 && old_net)
> +               err = ops->init2(net, old_net);

If an error comes here, while ops->init() was a success, we probably
leave things in a bad state (memory leak ?)

>         if (!err)
>                 return 0;
>
> @@ -301,7 +304,8 @@ EXPORT_SYMBOL_GPL(get_net_ns_by_id);
>  /*
>   * setup_net runs the initializers for the network namespace object.
>   */
> -static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
> +static __net_init int setup_net(struct net *net, struct net *old_net,
> +                               struct user_namespace *user_ns)
>  {
>         /* Must be called with pernet_ops_rwsem held */
>         const struct pernet_operations *ops, *saved_ops;
> @@ -323,7 +327,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
>         mutex_init(&net->ipv4.ra_mutex);
>
>         list_for_each_entry(ops, &pernet_list, list) {
> -               error = ops_init(ops, net);
> +               error = ops_init(ops, net, old_net);
>                 if (error < 0)
>                         goto out_undo;
>         }
> @@ -469,7 +473,7 @@ struct net *copy_net_ns(unsigned long flags,
>         if (rv < 0)
>                 goto put_userns;
>
> -       rv = setup_net(net, user_ns);
> +       rv = setup_net(net, old_net, user_ns);
>
>         up_read(&pernet_ops_rwsem);
>
> @@ -1107,7 +1111,7 @@ void __init net_ns_init(void)
>         init_net.key_domain = &init_net_key_domain;
>  #endif
>         down_write(&pernet_ops_rwsem);
> -       if (setup_net(&init_net, &init_user_ns))
> +       if (setup_net(&init_net, NULL, &init_user_ns))
>                 panic("Could not setup the initial network namespace");
>
>         init_net_initialized = true;
> @@ -1148,7 +1152,7 @@ static int __register_pernet_operations(struct list_head *list,
>
>                         memcg = mem_cgroup_or_root(get_mem_cgroup_from_obj(net));
>                         old = set_active_memcg(memcg);
> -                       error = ops_init(ops, net);
> +                       error = ops_init(ops, net, NULL);
>                         set_active_memcg(old);
>                         mem_cgroup_put(memcg);
>                         if (error)
> @@ -1188,7 +1192,7 @@ static int __register_pernet_operations(struct list_head *list,
>                 return 0;
>         }
>
> -       return ops_init(ops, &init_net);
> +       return ops_init(ops, &init_net, NULL);
>  }
>
>  static void __unregister_pernet_operations(struct pernet_operations *ops)
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 08/13] tcp: Introduce optional per-netns ehash.
  2022-08-26  0:04 ` [PATCH v1 net-next 08/13] tcp: Introduce optional per-netns ehash Kuniyuki Iwashima
@ 2022-08-26 15:24   ` Eric Dumazet
  2022-08-26 17:19     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2022-08-26 15:24 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Jeff Layton,
	Chuck Lever, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Kuniyuki Iwashima, netdev, linux-fsdevel

On Thu, Aug 25, 2022 at 5:07 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> The more sockets we have in the hash table, the more time we spend
> looking up the socket.  While running a number of small workloads on
> the same host, they penalise each other and cause performance degradation.
>
> Also, the root cause might be a single workload that consumes much more
> resources than the others.  It often happens on a cloud service where
> different workloads share the same computing resource.
>
> To resolve the issue, we introduce an optional per-netns hash table for
> TCP, but it's just ehash, and we still share the global bhash and lhash2.
>
> With a smaller ehash, we can look up non-listener sockets faster and
> isolate such noisy neighbours.  Also, we can reduce lock contention.
>
> We can control the ehash size by a new sysctl knob.  However, depending
> on workloads, it will require very sensitive tuning, so we disable the
> feature by default (net.ipv4.tcp_child_ehash_entries == 0).  Moreover,
> we can fall back to using the global ehash in case we fail to allocate
> enough memory for a new ehash.
>
> We can check the current ehash size by another read-only sysctl knob,
> net.ipv4.tcp_ehash_entries.  A negative value means the netns shares
> the global ehash (per-netns ehash is disabled or failed to allocate
> memory).
>
>   # dmesg | cut -d ' ' -f 5- | grep "established hash"
>   TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
>
>   # sysctl net.ipv4.tcp_ehash_entries
>   net.ipv4.tcp_ehash_entries = 524288  # can be changed by thash_entries
>
>   # sysctl net.ipv4.tcp_child_ehash_entries
>   net.ipv4.tcp_child_ehash_entries = 0  # disabled by default
>
>   # ip netns add test1
>   # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
>   net.ipv4.tcp_ehash_entries = -524288  # share the global ehash
>
>   # sysctl -w net.ipv4.tcp_child_ehash_entries=100
>   net.ipv4.tcp_child_ehash_entries = 100
>
>   # sysctl net.ipv4.tcp_child_ehash_entries
>   net.ipv4.tcp_child_ehash_entries = 128  # rounded up to 2^n
>
>   # ip netns add test2
>   # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
>   net.ipv4.tcp_ehash_entries = 128  # own per-netns ehash
>
> When more than two processes in the same netns create per-netns ehash
> concurrently with different sizes, we need to guarantee the size in
> one of the following ways:
>
>   1) Share the global ehash and create per-netns ehash
>
>   First, unshare() with tcp_child_ehash_entries==0.  It creates dedicated
>   netns sysctl knobs where we can safely change tcp_child_ehash_entries
>   and clone()/unshare() to create a per-netns ehash.
>
>   2) Lock the sysctl knob
>
>   We can use flock(LOCK_MAND) or BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny
>   read/write on sysctl knobs.
>
> Note the default values of two sysctl knobs depend on the ehash size and
> should be tuned carefully:
>
>   tcp_max_tw_buckets  : tcp_child_ehash_entries / 2
>   tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
>
> Also, we could optimise ehash lookup/iteration further by removing netns
> comparison for the per-netns ehash in the future.
>
> As a bonus, we can dismantle netns faster.  Currently, while destroying
> netns, we call inet_twsk_purge(), which walks through the global ehash.
> It can be potentially big because it can have many sockets other than
> TIME_WAIT in all netns.  Splitting ehash changes that situation, where
> it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
> in each netns.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  Documentation/networking/ip-sysctl.rst | 20 +++++++++
>  include/net/inet_hashtables.h          |  6 +++
>  include/net/netns/ipv4.h               |  1 +
>  net/dccp/proto.c                       |  2 +
>  net/ipv4/inet_hashtables.c             | 57 ++++++++++++++++++++++++++
>  net/ipv4/inet_timewait_sock.c          |  4 +-
>  net/ipv4/sysctl_net_ipv4.c             | 57 ++++++++++++++++++++++++++
>  net/ipv4/tcp.c                         |  1 +
>  net/ipv4/tcp_ipv4.c                    | 53 ++++++++++++++++++++----
>  net/ipv6/tcp_ipv6.c                    | 12 +++++-
>  10 files changed, 202 insertions(+), 11 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index 56cd4ea059b2..97a0952b11e3 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -1037,6 +1037,26 @@ tcp_challenge_ack_limit - INTEGER
>         in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks)
>         Default: 1000
>
> +tcp_ehash_entries - INTEGER
> +       Read-only number of hash buckets for TCP sockets in the current
> +       networking namespace.
> +
> +       A negative value means the networking namespace does not own its
> +       hash buckets and shares the initial networking namespace's one.
> +
> +tcp_child_ehash_entries - INTEGER
> +       Control the number of hash buckets for TCP sockets in the child
> +       networking namespace, which must be set before clone() or unshare().
> +
> +       The written value except for 0 is rounded up to 2^n.  0 is a special
> +       value, meaning the child networking namespace will share the initial
> +       networking namespace's hash buckets.
> +
> +       Note that the child will use the global one in case the kernel
> +       fails to allocate enough memory.
> +
> +       Default: 0
> +
>  UDP variables
>  =============
>
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 2c866112433e..039440936ab2 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -168,6 +168,8 @@ struct inet_hashinfo {
>         /* The 2nd listener table hashed by local port and address */
>         unsigned int                    lhash2_mask;
>         struct inet_listen_hashbucket   *lhash2;
> +
> +       bool                            pernet;
>  };
>
>  static inline struct inet_hashinfo *inet_get_hashinfo(const struct sock *sk)
> @@ -214,6 +216,10 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo)
>         hashinfo->ehash_locks = NULL;
>  }
>
> +struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo,
> +                                                unsigned int ehash_entries);
> +void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo);
> +
>  struct inet_bind_bucket *
>  inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
>                         struct inet_bind_hashbucket *head,
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index c7320ef356d9..6d9c01879027 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -170,6 +170,7 @@ struct netns_ipv4 {
>         int sysctl_tcp_pacing_ca_ratio;
>         int sysctl_tcp_wmem[3];
>         int sysctl_tcp_rmem[3];
> +       unsigned int sysctl_tcp_child_ehash_entries;
>         unsigned long sysctl_tcp_comp_sack_delay_ns;
>         unsigned long sysctl_tcp_comp_sack_slack_ns;
>         int sysctl_max_syn_backlog;
> diff --git a/net/dccp/proto.c b/net/dccp/proto.c
> index 7cd4a6cc99fc..c548ca3e9b0e 100644
> --- a/net/dccp/proto.c
> +++ b/net/dccp/proto.c
> @@ -1197,6 +1197,8 @@ static int __init dccp_init(void)
>                 INIT_HLIST_HEAD(&dccp_hashinfo.bhash2[i].chain);
>         }
>
> +       dccp_hashinfo.pernet = false;
> +
>         rc = dccp_mib_init();
>         if (rc)
>                 goto out_free_dccp_bhash2;
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index 5eb21a95179b..a57932b14bc6 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -1145,3 +1145,60 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
>         return 0;
>  }
>  EXPORT_SYMBOL_GPL(inet_ehash_locks_alloc);
> +
> +struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo,
> +                                                unsigned int ehash_entries)
> +{
> +       struct inet_hashinfo *new_hashinfo;
> +       int i;
> +
> +       new_hashinfo = kmalloc(sizeof(*new_hashinfo), GFP_KERNEL);
> +       if (!new_hashinfo)
> +               goto err;
> +
> +       new_hashinfo->ehash = kvmalloc_array(ehash_entries,
> +                                            sizeof(struct inet_ehash_bucket),
> +                                            GFP_KERNEL);

GFP_KERNEL_ACCOUNT ?

> +       if (!new_hashinfo->ehash)
> +               goto free_hashinfo;
> +
> +       new_hashinfo->ehash_mask = ehash_entries - 1;
> +
> +       if (inet_ehash_locks_alloc(new_hashinfo))
> +               goto free_ehash;
> +
> +       for (i = 0; i < ehash_entries; i++)
> +               INIT_HLIST_NULLS_HEAD(&new_hashinfo->ehash[i].chain, i);
> +
> +       new_hashinfo->bind_bucket_cachep = hashinfo->bind_bucket_cachep;
> +       new_hashinfo->bhash = hashinfo->bhash;
> +       new_hashinfo->bind2_bucket_cachep = hashinfo->bind2_bucket_cachep;
> +       new_hashinfo->bhash2 = hashinfo->bhash2;
> +       new_hashinfo->bhash_size = hashinfo->bhash_size;
> +
> +       new_hashinfo->lhash2_mask = hashinfo->lhash2_mask;
> +       new_hashinfo->lhash2 = hashinfo->lhash2;
> +
> +       new_hashinfo->pernet = true;
> +
> +       return new_hashinfo;
> +
> +free_ehash:
> +       kvfree(new_hashinfo->ehash);
> +free_hashinfo:
> +       kfree(new_hashinfo);
> +err:
> +       return NULL;
> +}
> +EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_alloc);
> +
> +void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo)
> +{
> +       if (!hashinfo->pernet)
> +               return;
> +
> +       inet_ehash_locks_free(hashinfo);
> +       kvfree(hashinfo->ehash);
> +       kfree(hashinfo);
> +}
> +EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_free);
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index 47ccc343c9fb..a5d40acde9d6 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -59,8 +59,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
>         inet_twsk_bind_unhash(tw, hashinfo);
>         spin_unlock(&bhead->lock);
>
> -       if (refcount_dec_and_test(&tw->tw_dr->tw_refcount))
> +       if (refcount_dec_and_test(&tw->tw_dr->tw_refcount)) {
> +               inet_pernet_hashinfo_free(hashinfo);
>                 kfree(tw->tw_dr);
> +       }
>
>         inet_twsk_put(tw);
>  }
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 5490c285668b..03a3187c4705 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -382,6 +382,48 @@ static int proc_tcp_available_ulp(struct ctl_table *ctl,
>         return ret;
>  }
>
> +static int proc_tcp_ehash_entries(struct ctl_table *table, int write,
> +                                 void *buffer, size_t *lenp, loff_t *ppos)
> +{
> +       struct net *net = container_of(table->data, struct net,
> +                                      ipv4.sysctl_tcp_child_ehash_entries);
> +       struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row->hashinfo;
> +       int tcp_ehash_entries;
> +       struct ctl_table tbl;
> +
> +       tcp_ehash_entries = hinfo->ehash_mask + 1;
> +
> +       /* A negative number indicates that the child netns
> +        * shares the global ehash.
> +        */
> +       if (!net_eq(net, &init_net) && !hinfo->pernet)
> +               tcp_ehash_entries *= -1;
> +
> +       tbl.data = &tcp_ehash_entries;
> +       tbl.maxlen = sizeof(int);
> +
> +       return proc_dointvec(&tbl, write, buffer, lenp, ppos);
> +}
> +
> +static int proc_tcp_child_ehash_entries(struct ctl_table *table, int write,
> +                                       void *buffer, size_t *lenp, loff_t *ppos)
> +{
> +       unsigned int tcp_child_ehash_entries;
> +       int ret;
> +
> +       ret = proc_douintvec(table, write, buffer, lenp, ppos);
> +       if (!write || ret)
> +               return ret;
> +
> +       tcp_child_ehash_entries = READ_ONCE(*(unsigned int *)table->data);
> +       if (tcp_child_ehash_entries)
> +               tcp_child_ehash_entries = roundup_pow_of_two(tcp_child_ehash_entries);
> +
> +       WRITE_ONCE(*(unsigned int *)table->data, tcp_child_ehash_entries);
> +
> +       return 0;
> +}
> +
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>  static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write,
>                                           void *buffer, size_t *lenp,
> @@ -1321,6 +1363,21 @@ static struct ctl_table ipv4_net_table[] = {
>                 .extra1         = SYSCTL_ZERO,
>                 .extra2         = SYSCTL_ONE,
>         },
> +       {
> +               .procname       = "tcp_ehash_entries",
> +               .data           = &init_net.ipv4.sysctl_tcp_child_ehash_entries,
> +               .mode           = 0444,
> +               .proc_handler   = proc_tcp_ehash_entries,
> +       },
> +       {
> +               .procname       = "tcp_child_ehash_entries",
> +               .data           = &init_net.ipv4.sysctl_tcp_child_ehash_entries,
> +               .maxlen         = sizeof(unsigned int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_tcp_child_ehash_entries,
> +               .extra1         = SYSCTL_ZERO,
> +               .extra2         = SYSCTL_INT_MAX,

Have you really tested what happens if you set the sysctl to max value
0x7fffffff

I would assume some kernel allocations will fail, or some loops will
trigger some kind of soft lockups.

> +       },
>         {
>                 .procname       = "udp_rmem_min",
>                 .data           = &init_net.ipv4.sysctl_udp_rmem_min,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index baf6adb723ad..f8ce673e32cb 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4788,6 +4788,7 @@ void __init tcp_init(void)
>                 INIT_HLIST_HEAD(&tcp_hashinfo.bhash2[i].chain);
>         }
>
> +       tcp_hashinfo.pernet = false;
>
>         cnt = tcp_hashinfo.ehash_mask + 1;
>         sysctl_tcp_max_orphans = cnt / 2;
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index b07930643b11..604119f46b52 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -3109,14 +3109,23 @@ static void __net_exit tcp_sk_exit(struct net *net)
>         if (net->ipv4.tcp_congestion_control)
>                 bpf_module_put(net->ipv4.tcp_congestion_control,
>                                net->ipv4.tcp_congestion_control->owner);
> -       if (refcount_dec_and_test(&tcp_death_row->tw_refcount))
> +       if (refcount_dec_and_test(&tcp_death_row->tw_refcount)) {
> +               inet_pernet_hashinfo_free(tcp_death_row->hashinfo);
>                 kfree(tcp_death_row);
> +       }
>  }
>
> -static int __net_init tcp_sk_init(struct net *net)
> +static void __net_init tcp_set_hashinfo(struct net *net, struct inet_hashinfo *hinfo)
>  {
> -       int cnt;
> +       int ehash_entries = hinfo->ehash_mask + 1;

0x7fffffff + 1 -> integer overflow

>
> +       net->ipv4.tcp_death_row->hashinfo = hinfo;
> +       net->ipv4.tcp_death_row->sysctl_max_tw_buckets = ehash_entries / 2;
> +       net->ipv4.sysctl_max_syn_backlog = max(128, ehash_entries / 128);
> +}
> +
> +static int __net_init tcp_sk_init(struct net *net)
> +{
>         net->ipv4.sysctl_tcp_ecn = 2;
>         net->ipv4.sysctl_tcp_ecn_fallback = 1;
>
> @@ -3145,12 +3154,10 @@ static int __net_init tcp_sk_init(struct net *net)
>         net->ipv4.tcp_death_row = kzalloc(sizeof(struct inet_timewait_death_row), GFP_KERNEL);
>         if (!net->ipv4.tcp_death_row)
>                 return -ENOMEM;
> +
>         refcount_set(&net->ipv4.tcp_death_row->tw_refcount, 1);
> -       cnt = tcp_hashinfo.ehash_mask + 1;
> -       net->ipv4.tcp_death_row->sysctl_max_tw_buckets = cnt / 2;
> -       net->ipv4.tcp_death_row->hashinfo = &tcp_hashinfo;
> +       tcp_set_hashinfo(net, &tcp_hashinfo);
>
> -       net->ipv4.sysctl_max_syn_backlog = max(128, cnt / 128);
>         net->ipv4.sysctl_tcp_sack = 1;
>         net->ipv4.sysctl_tcp_window_scaling = 1;
>         net->ipv4.sysctl_tcp_timestamps = 1;
> @@ -3206,18 +3213,46 @@ static int __net_init tcp_sk_init(struct net *net)
>         return 0;
>  }
>
> +static int __net_init tcp_sk_init_pernet_hashinfo(struct net *net, struct net *old_net)
> +{
> +       struct inet_hashinfo *child_hinfo;
> +       int ehash_entries;
> +
> +       ehash_entries = READ_ONCE(old_net->ipv4.sysctl_tcp_child_ehash_entries);
> +       if (!ehash_entries)
> +               goto out;
> +
> +       child_hinfo = inet_pernet_hashinfo_alloc(&tcp_hashinfo, ehash_entries);
> +       if (child_hinfo)
> +               tcp_set_hashinfo(net, child_hinfo);
> +       else
> +               pr_warn("Failed to allocate TCP ehash (entries: %u) "
> +                       "for a netns, fallback to use the global one\n",
> +                       ehash_entries);
> +out:
> +       return 0;
> +}
> +
>  static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
>  {
> +       bool purge_once = true;
>         struct net *net;
>
> -       inet_twsk_purge(&tcp_hashinfo, AF_INET);
> +       list_for_each_entry(net, net_exit_list, exit_list) {
> +               if (net->ipv4.tcp_death_row->hashinfo->pernet) {
> +                       inet_twsk_purge(net->ipv4.tcp_death_row->hashinfo, AF_INET);
> +               } else if (purge_once) {
> +                       inet_twsk_purge(&tcp_hashinfo, AF_INET);
> +                       purge_once = false;
> +               }
>
> -       list_for_each_entry(net, net_exit_list, exit_list)
>                 tcp_fastopen_ctx_destroy(net);
> +       }
>  }
>
>  static struct pernet_operations __net_initdata tcp_sk_ops = {
>         .init      = tcp_sk_init,
> +       .init2     = tcp_sk_init_pernet_hashinfo,
>         .exit      = tcp_sk_exit,
>         .exit_batch = tcp_sk_exit_batch,
>  };
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 27b2fd98a2c4..19f730428720 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -2229,7 +2229,17 @@ static void __net_exit tcpv6_net_exit(struct net *net)
>
>  static void __net_exit tcpv6_net_exit_batch(struct list_head *net_exit_list)
>  {
> -       inet_twsk_purge(&tcp_hashinfo, AF_INET6);
> +       bool purge_once = true;
> +       struct net *net;
> +

This looks like a duplicate of ipv4 function. Opportunity of factorization ?

> +       list_for_each_entry(net, net_exit_list, exit_list) {
> +               if (net->ipv4.tcp_death_row->hashinfo->pernet) {
> +                       inet_twsk_purge(net->ipv4.tcp_death_row->hashinfo, AF_INET6);
> +               } else if (purge_once) {
> +                       inet_twsk_purge(&tcp_hashinfo, AF_INET6);
> +                       purge_once = false;
> +               }
> +       }
>  }
>
>  static struct pernet_operations tcpv6_net_ops = {
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 06/13] tcp: Set NULL to sk->sk_prot->h.hashinfo.
  2022-08-26  0:04 ` [PATCH v1 net-next 06/13] tcp: Set NULL to sk->sk_prot->h.hashinfo Kuniyuki Iwashima
@ 2022-08-26 15:40   ` Eric Dumazet
  2022-08-26 17:26     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2022-08-26 15:40 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Jeff Layton,
	Chuck Lever, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Kuniyuki Iwashima, netdev, linux-fsdevel

On Thu, Aug 25, 2022 at 5:07 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> We will soon introduce an optional per-netns ehash.
>
> This means we cannot use the global sk->sk_prot->h.hashinfo
> to fetch a TCP hashinfo.
>
> Instead, set NULL to sk->sk_prot->h.hashinfo for TCP and get
> a proper hashinfo from net->ipv4.tcp_death_row->hashinfo.
>
> Note that we need not use sk->sk_prot->h.hashinfo if DCCP is
> disabled.
>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  include/net/inet_hashtables.h   | 10 ++++++++++
>  net/ipv4/af_inet.c              |  2 +-
>  net/ipv4/inet_connection_sock.c |  6 +++---
>  net/ipv4/inet_hashtables.c      | 14 +++++++-------
>  net/ipv4/tcp_ipv4.c             |  2 +-
>  net/ipv6/tcp_ipv6.c             |  2 +-
>  6 files changed, 23 insertions(+), 13 deletions(-)
>
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 44a419b9e3d5..2c866112433e 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -170,6 +170,16 @@ struct inet_hashinfo {
>         struct inet_listen_hashbucket   *lhash2;
>  };
>
> +static inline struct inet_hashinfo *inet_get_hashinfo(const struct sock *sk)
> +{
> +#if IS_ENABLED(CONFIG_IP_DCCP)
> +       return sk->sk_prot->h.hashinfo ? :
> +               sock_net(sk)->ipv4.tcp_death_row->hashinfo;
> +#else
> +       return sock_net(sk)->ipv4.tcp_death_row->hashinfo;
> +#endif
> +}
>

If the sk_prot->h.hashinfo must disappear, I would rather add a new
inet->hashinfo field

return inet_sk(sk)->hashinfo

Conceptually, the pointer no longer belongs to sk_prot, and not in struct net,
otherwise you should name this helper   tcp_or_dccp_get_hashinfo() to avoid
any temptation to use it for other inet protocol.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 01/13] fs/lock: Revive LOCK_MAND.
  2022-08-26 10:02   ` Jeff Layton
@ 2022-08-26 16:48     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26 16:48 UTC (permalink / raw)
  To: jlayton
  Cc: chuck.lever, davem, edumazet, keescook, kuba, kuni1840, kuniyu,
	linux-fsdevel, mcgrof, netdev, pabeni, yzaikin

From:   Jeff Layton <jlayton@kernel.org>
Date:   Fri, 26 Aug 2022 06:02:44 -0400
> On Thu, 2022-08-25 at 17:04 -0700, Kuniyuki Iwashima wrote:
> > The commit 90f7d7a0d0d6 ("locks: remove LOCK_MAND flock lock support")
> > removed LOCK_MAND support from the kernel because nothing checked the
> > flag, nor was there no use case.  This patch revives LOCK_MAND to
> > introduce a mandatory lock for read/write on /proc/sys.  Currently, it's
> > the only use case, so we added two changes while reverting the commit.
> > 
> > First, we used to allow any f_mode for LOCK_MAND, but now we don't get
> > it back.  Instead, we enforce being FMODE_READ|FMODE_WRITE as LOCK_SH
> > and LOCK_EX.
> > 
> > Second, when f_ops->flock() was called with LOCK_MAND, each function
> > returned -EOPNOTSUPP.  The following patch does not use f_ops->flock(),
> > so we put the validation before calling f_ops->flock().
> > 
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  fs/locks.c                       | 57 ++++++++++++++++++++------------
> >  include/uapi/asm-generic/fcntl.h |  5 ---
> >  2 files changed, 35 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/locks.c b/fs/locks.c
> > index c266cfdc3291..03ff10a3165e 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
> > @@ -421,6 +421,10 @@ static inline int flock_translate_cmd(int cmd) {
> >  	case LOCK_UN:
> >  		return F_UNLCK;
> >  	}
> > +
> > +	if (cmd & LOCK_MAND)
> > +		return cmd & (LOCK_MAND | LOCK_RW);
> > +
> >  	return -EINVAL;
> >  }
> >  
> > @@ -879,6 +883,10 @@ static bool flock_locks_conflict(struct file_lock *caller_fl,
> >  	if (caller_fl->fl_file == sys_fl->fl_file)
> >  		return false;
> >  
> > +	if (caller_fl->fl_type & LOCK_MAND ||
> > +	    sys_fl->fl_type & LOCK_MAND)
> > +		return true;
> > +
> >  	return locks_conflict(caller_fl, sys_fl);
> >  }
> >  
> > @@ -2077,9 +2085,7 @@ EXPORT_SYMBOL(locks_lock_inode_wait);
> >   *	- %LOCK_SH -- a shared lock.
> >   *	- %LOCK_EX -- an exclusive lock.
> >   *	- %LOCK_UN -- remove an existing lock.
> > - *	- %LOCK_MAND -- a 'mandatory' flock. (DEPRECATED)
> > - *
> > - *	%LOCK_MAND support has been removed from the kernel.
> > + *	- %LOCK_MAND -- a 'mandatory' flock. (only supported on /proc/sys/)
> >   */
> >  SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
> >  {
> > @@ -2087,19 +2093,6 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
> >  	struct file_lock fl;
> >  	struct fd f;
> >  
> > -	/*
> > -	 * LOCK_MAND locks were broken for a long time in that they never
> > -	 * conflicted with one another and didn't prevent any sort of open,
> > -	 * read or write activity.
> > -	 *
> > -	 * Just ignore these requests now, to preserve legacy behavior, but
> > -	 * throw a warning to let people know that they don't actually work.
> > -	 */
> > -	if (cmd & LOCK_MAND) {
> > -		pr_warn_once("Attempt to set a LOCK_MAND lock via flock(2). This support has been removed and the request ignored.\n");
> > -		return 0;
> > -	}
> > -
> >  	type = flock_translate_cmd(cmd & ~LOCK_NB);
> >  	if (type < 0)
> >  		return type;
> > @@ -2109,6 +2102,7 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
> >  	if (!f.file)
> >  		return error;
> >  
> > +	/* LOCK_MAND supports only read/write on proc_sysctl for now */
> >  	if (type != F_UNLCK && !(f.file->f_mode & (FMODE_READ | FMODE_WRITE)))
> >  		goto out_putf;
> >  
> > @@ -2122,12 +2116,18 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
> >  	if (can_sleep)
> >  		fl.fl_flags |= FL_SLEEP;
> >  
> > -	if (f.file->f_op->flock)
> > +	if (f.file->f_op->flock) {
> > +		if (cmd & LOCK_MAND) {
> > +			error = -EOPNOTSUPP;
> > +			goto out_putf;
> > +		}
> > +
> >  		error = f.file->f_op->flock(f.file,
> >  					    (can_sleep) ? F_SETLKW : F_SETLK,
> >  					    &fl);
> > -	else
> > +	} else {
> >  		error = locks_lock_file_wait(f.file, &fl);
> > +	}
> >  
> >   out_putf:
> >  	fdput(f);
> > @@ -2711,7 +2711,11 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
> >  		seq_printf(f, " %s ",
> >  			     (inode == NULL) ? "*NOINODE*" : "ADVISORY ");
> >  	} else if (IS_FLOCK(fl)) {
> > -		seq_puts(f, "FLOCK  ADVISORY  ");
> > +		if (fl->fl_type & LOCK_MAND) {
> > +			seq_puts(f, "FLOCK  MANDATORY ");
> > +		} else {
> > +			seq_puts(f, "FLOCK  ADVISORY  ");
> > +		}
> >  	} else if (IS_LEASE(fl)) {
> >  		if (fl->fl_flags & FL_DELEG)
> >  			seq_puts(f, "DELEG  ");
> > @@ -2727,10 +2731,19 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
> >  	} else {
> >  		seq_puts(f, "UNKNOWN UNKNOWN  ");
> >  	}
> > -	type = IS_LEASE(fl) ? target_leasetype(fl) : fl->fl_type;
> >  
> > -	seq_printf(f, "%s ", (type == F_WRLCK) ? "WRITE" :
> > -			     (type == F_RDLCK) ? "READ" : "UNLCK");
> > +	if (fl->fl_type & LOCK_MAND) {
> > +		seq_printf(f, "%s ",
> > +			   (fl->fl_type & LOCK_READ)
> > +			   ? (fl->fl_type & LOCK_WRITE) ? "RW   " : "READ "
> > +			   : (fl->fl_type & LOCK_WRITE) ? "WRITE" : "NONE ");
> > +	} else {
> > +		type = IS_LEASE(fl) ? target_leasetype(fl) : fl->fl_type;
> > +
> > +		seq_printf(f, "%s ", (type == F_WRLCK) ? "WRITE" :
> > +			   (type == F_RDLCK) ? "READ" : "UNLCK");
> > +	}
> > +
> >  	if (inode) {
> >  		/* userspace relies on this representation of dev_t */
> >  		seq_printf(f, "%d %02x:%02x:%lu ", fl_pid,
> > diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> > index 1ecdb911add8..94fb8c6fd543 100644
> > --- a/include/uapi/asm-generic/fcntl.h
> > +++ b/include/uapi/asm-generic/fcntl.h
> > @@ -180,11 +180,6 @@ struct f_owner_ex {
> >  #define LOCK_NB		4	/* or'd with one of the above to prevent
> >  				   blocking */
> >  #define LOCK_UN		8	/* remove lock */
> > -
> > -/*
> > - * LOCK_MAND support has been removed from the kernel. We leave the symbols
> > - * here to not break legacy builds, but these should not be used in new code.
> > - */
> >  #define LOCK_MAND	32	/* This is a mandatory flock ... */
> >  #define LOCK_READ	64	/* which allows concurrent read operations */
> >  #define LOCK_WRITE	128	/* which allows concurrent write operations */
> 
> NACK.
> 
> This may break legacy userland code that sets LOCK_MAND on flock calls
> (e.g. old versions of samba).
> 
> If you want to add a new mechanism that does something similar with a
> new flag, then that may be possible, but please don't overload old flags
> that could still be used in the field with new meanings.

Exactly, that makes sense.
Thanks for feedback!


> If you do decide to use flock for this functionality (and I'm not sure
> this is a good idea),

Actually, the patch 1-2 were experimental to show all available options
(flock()'s latency vs unshare()'s memory cost), and I like unshare().
If both of them were unacceptable, I would have added clone() BPF hook.

But it seems unshare() works at least, I'll drop this part in the next
spin.

Thank you.


> then I'd also like to see a clear description of
> the semantics this provides.
> -- 
> Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table.
  2022-08-26 15:17 ` [PATCH v1 net-next 00/13] tcp/udp: " Eric Dumazet
@ 2022-08-26 16:51   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26 16:51 UTC (permalink / raw)
  To: edumazet
  Cc: chuck.lever, davem, jlayton, keescook, kuba, kuni1840, kuniyu,
	linux-fsdevel, mcgrof, netdev, pabeni, yzaikin

From:   Eric Dumazet <edumazet@google.com>
Date:   Fri, 26 Aug 2022 08:17:25 -0700
> On Thu, Aug 25, 2022 at 5:05 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > The more sockets we have in the hash table, the more time we spend
> > looking up the socket.  While running a number of small workloads on
> > the same host, they penalise each other and cause performance degradation.
> >
> > Also, the root cause might be a single workload that consumes much more
> > resources than the others.  It often happens on a cloud service where
> > different workloads share the same computing resource.
> >
> > On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
> > entries), after running iperf3 in different netns, creating 24Mi sockets
> > without data transfer in the root netns causes about 10% performance
> > regression for the iperf3's connection.
> >
> >  thash_entries          sockets         length          Gbps
> >         524288                1              1          50.7
> >                            24Mi             48          45.1
> >
> > It is basically related to the length of the list of each hash bucket.
> > For testing purposes to see how performance drops along the length,
> > I set 131072 (1Mi / 8) to thash_entries, and here's the result.
> >
> >  thash_entries          sockets         length          Gbps
> >         131072                1              1          50.7
> >                             1Mi              8          49.9
> >                             2Mi             16          48.9
> >                             4Mi             32          47.3
> >                             8Mi             64          44.6
> >                            16Mi            128          40.6
> >                            24Mi            192          36.3
> >                            32Mi            256          32.5
> >                            40Mi            320          27.0
> >                            48Mi            384          25.0
> >
> > To resolve the socket lookup degradation, we introduce an optional
> > per-netns hash table for TCP and UDP.  With a smaller hash table, we
> > can look up sockets faster and isolate noisy neighbours.  Also, we can
> > reduce lock contention.
> >
> > We can control and check the hash size via sysctl knobs.  It requires
> > some tuning based on workloads, so the per-netns hash table is disabled
> > by default.
> >
> >   # dmesg | cut -d ' ' -f 5- | grep "established hash"
> >   TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
> >
> >   # sysctl net.ipv4.tcp_ehash_entries
> >   net.ipv4.tcp_ehash_entries = 524288  # can be changed by thash_entries
> >
> >   # sysctl net.ipv4.tcp_child_ehash_entries
> >   net.ipv4.tcp_child_ehash_entries = 0  # disabled by default
> >
> >   # ip netns add test1
> >   # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
> >   net.ipv4.tcp_ehash_entries = -524288  # share the global ehash
> >
> >   # sysctl -w net.ipv4.tcp_child_ehash_entries=100
> >   net.ipv4.tcp_child_ehash_entries = 100
> >
> >   # sysctl net.ipv4.tcp_child_ehash_entries
> >   net.ipv4.tcp_child_ehash_entries = 128  # rounded up to 2^n
> >
> >   # ip netns add test2
> >   # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
> >   net.ipv4.tcp_ehash_entries = 128  # own per-netns ehash
> >
> >   [ UDP has the same interface as udp_hash_entries and
> >     udp_child_hash_entries. ]
> >
> > When creating per-netns concurrently with different sizes, we can
> > guarantee the size by doing one of these ways.
> >
> >   1) Share the global hash table and create per-netns one
> >
> >   First, unshare() with tcp_child_ehash_entries==0.  It creates dedicated
> >   netns sysctl knobs where we can safely change tcp_child_ehash_entries
> >   and clone()/unshare() to create a per-netns hash table.
> >
> >   2) Lock the sysctl knob
> >
> 
> This is orthogonal.
> 
> Your series should have been split in three really.
> 
> I do not want to discuss the merit of re-instating LOCK_MAND :/

I see.
I'll drop the flock() part at once and respin TCP part only in v2.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 04/13] net: Introduce init2() for pernet_operations.
  2022-08-26 15:20   ` Eric Dumazet
@ 2022-08-26 17:03     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26 17:03 UTC (permalink / raw)
  To: edumazet
  Cc: chuck.lever, davem, jlayton, keescook, kuba, kuni1840, kuniyu,
	linux-fsdevel, mcgrof, netdev, pabeni, yzaikin

From:   Eric Dumazet <edumazet@google.com>
Date:   Fri, 26 Aug 2022 08:20:06 -0700
> On Thu, Aug 25, 2022 at 5:06 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > This patch adds a new init function for pernet_operations, init2().
> 
> Why ?
> 
> This seems not really needed...
> 
> TCP ops->init can trivially reach the parent net_ns if needed,
> because the current process is the one doing the creation of a new net_ns.

Yes, it's true because IPv4 TCP/UDP are both unloadable.

At first, I was thinking of a general interface, but I'm fine
to drop this patch and access current->nsproxy->net_ns like
sysctl_devconf_inherit_init_net does.


> 
> >
> > We call each init2() during clone() or unshare() only, where we can
> > access the parent netns for a child netns creation.
> >
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  include/net/net_namespace.h |  3 +++
> >  net/core/net_namespace.c    | 18 +++++++++++-------
> >  2 files changed, 14 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> > index 8c3587d5c308..3ca426649756 100644
> > --- a/include/net/net_namespace.h
> > +++ b/include/net/net_namespace.h
> > @@ -410,6 +410,8 @@ struct pernet_operations {
> >          * from register_pernet_subsys(), unregister_pernet_subsys()
> >          * register_pernet_device() and unregister_pernet_device().
> >          *
> > +        * init2() is called during clone() or unshare() only.
> > +        *
> >          * Exit methods using blocking RCU primitives, such as
> >          * synchronize_rcu(), should be implemented via exit_batch.
> >          * Then, destruction of a group of net requires single
> > @@ -422,6 +424,7 @@ struct pernet_operations {
> >          * the calls.
> >          */
> >         int (*init)(struct net *net);
> > +       int (*init2)(struct net *net, struct net *old_net);
> >         void (*pre_exit)(struct net *net);
> >         void (*exit)(struct net *net);
> >         void (*exit_batch)(struct list_head *net_exit_list);
> > diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> > index 6b9f19122ec1..b120ff97d9f5 100644
> > --- a/net/core/net_namespace.c
> > +++ b/net/core/net_namespace.c
> > @@ -116,7 +116,8 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data)
> >         return 0;
> >  }
> >
> > -static int ops_init(const struct pernet_operations *ops, struct net *net)
> > +static int ops_init(const struct pernet_operations *ops,
> > +                   struct net *net, struct net *old_net)
> >  {
> >         int err = -ENOMEM;
> >         void *data = NULL;
> > @@ -133,6 +134,8 @@ static int ops_init(const struct pernet_operations *ops, struct net *net)
> >         err = 0;
> >         if (ops->init)
> >                 err = ops->init(net);
> > +       if (!err && ops->init2 && old_net)
> > +               err = ops->init2(net, old_net);
> 
> If an error comes here, while ops->init() was a success, we probably
> leave things in a bad state (memory leak ?)

Somehow I thought .exit() should handle the case, yes, it's really bad
design... at least I should have added .exit2().

I'll drop this in v2.

Thank you!


> 
> >         if (!err)
> >                 return 0;
> >
> > @@ -301,7 +304,8 @@ EXPORT_SYMBOL_GPL(get_net_ns_by_id);
> >  /*
> >   * setup_net runs the initializers for the network namespace object.
> >   */
> > -static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
> > +static __net_init int setup_net(struct net *net, struct net *old_net,
> > +                               struct user_namespace *user_ns)
> >  {
> >         /* Must be called with pernet_ops_rwsem held */
> >         const struct pernet_operations *ops, *saved_ops;
> > @@ -323,7 +327,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
> >         mutex_init(&net->ipv4.ra_mutex);
> >
> >         list_for_each_entry(ops, &pernet_list, list) {
> > -               error = ops_init(ops, net);
> > +               error = ops_init(ops, net, old_net);
> >                 if (error < 0)
> >                         goto out_undo;
> >         }
> > @@ -469,7 +473,7 @@ struct net *copy_net_ns(unsigned long flags,
> >         if (rv < 0)
> >                 goto put_userns;
> >
> > -       rv = setup_net(net, user_ns);
> > +       rv = setup_net(net, old_net, user_ns);
> >
> >         up_read(&pernet_ops_rwsem);
> >
> > @@ -1107,7 +1111,7 @@ void __init net_ns_init(void)
> >         init_net.key_domain = &init_net_key_domain;
> >  #endif
> >         down_write(&pernet_ops_rwsem);
> > -       if (setup_net(&init_net, &init_user_ns))
> > +       if (setup_net(&init_net, NULL, &init_user_ns))
> >                 panic("Could not setup the initial network namespace");
> >
> >         init_net_initialized = true;
> > @@ -1148,7 +1152,7 @@ static int __register_pernet_operations(struct list_head *list,
> >
> >                         memcg = mem_cgroup_or_root(get_mem_cgroup_from_obj(net));
> >                         old = set_active_memcg(memcg);
> > -                       error = ops_init(ops, net);
> > +                       error = ops_init(ops, net, NULL);
> >                         set_active_memcg(old);
> >                         mem_cgroup_put(memcg);
> >                         if (error)
> > @@ -1188,7 +1192,7 @@ static int __register_pernet_operations(struct list_head *list,
> >                 return 0;
> >         }
> >
> > -       return ops_init(ops, &init_net);
> > +       return ops_init(ops, &init_net, NULL);
> >  }
> >
> >  static void __unregister_pernet_operations(struct pernet_operations *ops)
> > --
> > 2.30.2
> >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 08/13] tcp: Introduce optional per-netns ehash.
  2022-08-26 15:24   ` Eric Dumazet
@ 2022-08-26 17:19     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26 17:19 UTC (permalink / raw)
  To: edumazet
  Cc: chuck.lever, davem, jlayton, keescook, kuba, kuni1840, kuniyu,
	linux-fsdevel, mcgrof, netdev, pabeni, yzaikin

From:   Eric Dumazet <edumazet@google.com>
Date:   Fri, 26 Aug 2022 08:24:54 -0700
> On Thu, Aug 25, 2022 at 5:07 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > The more sockets we have in the hash table, the more time we spend
> > looking up the socket.  While running a number of small workloads on
> > the same host, they penalise each other and cause performance degradation.
> >
> > Also, the root cause might be a single workload that consumes much more
> > resources than the others.  It often happens on a cloud service where
> > different workloads share the same computing resource.
> >
> > To resolve the issue, we introduce an optional per-netns hash table for
> > TCP, but it's just ehash, and we still share the global bhash and lhash2.
> >
> > With a smaller ehash, we can look up non-listener sockets faster and
> > isolate such noisy neighbours.  Also, we can reduce lock contention.
> >
> > We can control the ehash size by a new sysctl knob.  However, depending
> > on workloads, it will require very sensitive tuning, so we disable the
> > feature by default (net.ipv4.tcp_child_ehash_entries == 0).  Moreover,
> > we can fall back to using the global ehash in case we fail to allocate
> > enough memory for a new ehash.
> >
> > We can check the current ehash size by another read-only sysctl knob,
> > net.ipv4.tcp_ehash_entries.  A negative value means the netns shares
> > the global ehash (per-netns ehash is disabled or failed to allocate
> > memory).
> >
> >   # dmesg | cut -d ' ' -f 5- | grep "established hash"
> >   TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
> >
> >   # sysctl net.ipv4.tcp_ehash_entries
> >   net.ipv4.tcp_ehash_entries = 524288  # can be changed by thash_entries
> >
> >   # sysctl net.ipv4.tcp_child_ehash_entries
> >   net.ipv4.tcp_child_ehash_entries = 0  # disabled by default
> >
> >   # ip netns add test1
> >   # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
> >   net.ipv4.tcp_ehash_entries = -524288  # share the global ehash
> >
> >   # sysctl -w net.ipv4.tcp_child_ehash_entries=100
> >   net.ipv4.tcp_child_ehash_entries = 100
> >
> >   # sysctl net.ipv4.tcp_child_ehash_entries
> >   net.ipv4.tcp_child_ehash_entries = 128  # rounded up to 2^n
> >
> >   # ip netns add test2
> >   # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
> >   net.ipv4.tcp_ehash_entries = 128  # own per-netns ehash
> >
> > When more than two processes in the same netns create per-netns ehash
> > concurrently with different sizes, we need to guarantee the size in
> > one of the following ways:
> >
> >   1) Share the global ehash and create per-netns ehash
> >
> >   First, unshare() with tcp_child_ehash_entries==0.  It creates dedicated
> >   netns sysctl knobs where we can safely change tcp_child_ehash_entries
> >   and clone()/unshare() to create a per-netns ehash.
> >
> >   2) Lock the sysctl knob
> >
> >   We can use flock(LOCK_MAND) or BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny
> >   read/write on sysctl knobs.
> >
> > Note the default values of two sysctl knobs depend on the ehash size and
> > should be tuned carefully:
> >
> >   tcp_max_tw_buckets  : tcp_child_ehash_entries / 2
> >   tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
> >
> > Also, we could optimise ehash lookup/iteration further by removing netns
> > comparison for the per-netns ehash in the future.
> >
> > As a bonus, we can dismantle netns faster.  Currently, while destroying
> > netns, we call inet_twsk_purge(), which walks through the global ehash.
> > It can be potentially big because it can have many sockets other than
> > TIME_WAIT in all netns.  Splitting ehash changes that situation, where
> > it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
> > in each netns.
> >
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  Documentation/networking/ip-sysctl.rst | 20 +++++++++
> >  include/net/inet_hashtables.h          |  6 +++
> >  include/net/netns/ipv4.h               |  1 +
> >  net/dccp/proto.c                       |  2 +
> >  net/ipv4/inet_hashtables.c             | 57 ++++++++++++++++++++++++++
> >  net/ipv4/inet_timewait_sock.c          |  4 +-
> >  net/ipv4/sysctl_net_ipv4.c             | 57 ++++++++++++++++++++++++++
> >  net/ipv4/tcp.c                         |  1 +
> >  net/ipv4/tcp_ipv4.c                    | 53 ++++++++++++++++++++----
> >  net/ipv6/tcp_ipv6.c                    | 12 +++++-
> >  10 files changed, 202 insertions(+), 11 deletions(-)
> >
> > diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> > index 56cd4ea059b2..97a0952b11e3 100644
> > --- a/Documentation/networking/ip-sysctl.rst
> > +++ b/Documentation/networking/ip-sysctl.rst
> > @@ -1037,6 +1037,26 @@ tcp_challenge_ack_limit - INTEGER
> >         in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks)
> >         Default: 1000
> >
> > +tcp_ehash_entries - INTEGER
> > +       Read-only number of hash buckets for TCP sockets in the current
> > +       networking namespace.
> > +
> > +       A negative value means the networking namespace does not own its
> > +       hash buckets and shares the initial networking namespace's one.
> > +
> > +tcp_child_ehash_entries - INTEGER
> > +       Control the number of hash buckets for TCP sockets in the child
> > +       networking namespace, which must be set before clone() or unshare().
> > +
> > +       The written value except for 0 is rounded up to 2^n.  0 is a special
> > +       value, meaning the child networking namespace will share the initial
> > +       networking namespace's hash buckets.
> > +
> > +       Note that the child will use the global one in case the kernel
> > +       fails to allocate enough memory.
> > +
> > +       Default: 0
> > +
> >  UDP variables
> >  =============
> >
> > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > index 2c866112433e..039440936ab2 100644
> > --- a/include/net/inet_hashtables.h
> > +++ b/include/net/inet_hashtables.h
> > @@ -168,6 +168,8 @@ struct inet_hashinfo {
> >         /* The 2nd listener table hashed by local port and address */
> >         unsigned int                    lhash2_mask;
> >         struct inet_listen_hashbucket   *lhash2;
> > +
> > +       bool                            pernet;
> >  };
> >
> >  static inline struct inet_hashinfo *inet_get_hashinfo(const struct sock *sk)
> > @@ -214,6 +216,10 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo)
> >         hashinfo->ehash_locks = NULL;
> >  }
> >
> > +struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo,
> > +                                                unsigned int ehash_entries);
> > +void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo);
> > +
> >  struct inet_bind_bucket *
> >  inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
> >                         struct inet_bind_hashbucket *head,
> > diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> > index c7320ef356d9..6d9c01879027 100644
> > --- a/include/net/netns/ipv4.h
> > +++ b/include/net/netns/ipv4.h
> > @@ -170,6 +170,7 @@ struct netns_ipv4 {
> >         int sysctl_tcp_pacing_ca_ratio;
> >         int sysctl_tcp_wmem[3];
> >         int sysctl_tcp_rmem[3];
> > +       unsigned int sysctl_tcp_child_ehash_entries;
> >         unsigned long sysctl_tcp_comp_sack_delay_ns;
> >         unsigned long sysctl_tcp_comp_sack_slack_ns;
> >         int sysctl_max_syn_backlog;
> > diff --git a/net/dccp/proto.c b/net/dccp/proto.c
> > index 7cd4a6cc99fc..c548ca3e9b0e 100644
> > --- a/net/dccp/proto.c
> > +++ b/net/dccp/proto.c
> > @@ -1197,6 +1197,8 @@ static int __init dccp_init(void)
> >                 INIT_HLIST_HEAD(&dccp_hashinfo.bhash2[i].chain);
> >         }
> >
> > +       dccp_hashinfo.pernet = false;
> > +
> >         rc = dccp_mib_init();
> >         if (rc)
> >                 goto out_free_dccp_bhash2;
> > diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> > index 5eb21a95179b..a57932b14bc6 100644
> > --- a/net/ipv4/inet_hashtables.c
> > +++ b/net/ipv4/inet_hashtables.c
> > @@ -1145,3 +1145,60 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
> >         return 0;
> >  }
> >  EXPORT_SYMBOL_GPL(inet_ehash_locks_alloc);
> > +
> > +struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo,
> > +                                                unsigned int ehash_entries)
> > +{
> > +       struct inet_hashinfo *new_hashinfo;
> > +       int i;
> > +
> > +       new_hashinfo = kmalloc(sizeof(*new_hashinfo), GFP_KERNEL);
> > +       if (!new_hashinfo)
> > +               goto err;
> > +
> > +       new_hashinfo->ehash = kvmalloc_array(ehash_entries,
> > +                                            sizeof(struct inet_ehash_bucket),
> > +                                            GFP_KERNEL);
> 
> GFP_KERNEL_ACCOUNT ?

Right, we should account the use.
Will use it in v2.


> 
> > +       if (!new_hashinfo->ehash)
> > +               goto free_hashinfo;
> > +
> > +       new_hashinfo->ehash_mask = ehash_entries - 1;
> > +
> > +       if (inet_ehash_locks_alloc(new_hashinfo))
> > +               goto free_ehash;
> > +
> > +       for (i = 0; i < ehash_entries; i++)
> > +               INIT_HLIST_NULLS_HEAD(&new_hashinfo->ehash[i].chain, i);
> > +
> > +       new_hashinfo->bind_bucket_cachep = hashinfo->bind_bucket_cachep;
> > +       new_hashinfo->bhash = hashinfo->bhash;
> > +       new_hashinfo->bind2_bucket_cachep = hashinfo->bind2_bucket_cachep;
> > +       new_hashinfo->bhash2 = hashinfo->bhash2;
> > +       new_hashinfo->bhash_size = hashinfo->bhash_size;
> > +
> > +       new_hashinfo->lhash2_mask = hashinfo->lhash2_mask;
> > +       new_hashinfo->lhash2 = hashinfo->lhash2;
> > +
> > +       new_hashinfo->pernet = true;
> > +
> > +       return new_hashinfo;
> > +
> > +free_ehash:
> > +       kvfree(new_hashinfo->ehash);
> > +free_hashinfo:
> > +       kfree(new_hashinfo);
> > +err:
> > +       return NULL;
> > +}
> > +EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_alloc);
> > +
> > +void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo)
> > +{
> > +       if (!hashinfo->pernet)
> > +               return;
> > +
> > +       inet_ehash_locks_free(hashinfo);
> > +       kvfree(hashinfo->ehash);
> > +       kfree(hashinfo);
> > +}
> > +EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_free);
> > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > index 47ccc343c9fb..a5d40acde9d6 100644
> > --- a/net/ipv4/inet_timewait_sock.c
> > +++ b/net/ipv4/inet_timewait_sock.c
> > @@ -59,8 +59,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
> >         inet_twsk_bind_unhash(tw, hashinfo);
> >         spin_unlock(&bhead->lock);
> >
> > -       if (refcount_dec_and_test(&tw->tw_dr->tw_refcount))
> > +       if (refcount_dec_and_test(&tw->tw_dr->tw_refcount)) {
> > +               inet_pernet_hashinfo_free(hashinfo);
> >                 kfree(tw->tw_dr);
> > +       }
> >
> >         inet_twsk_put(tw);
> >  }
> > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> > index 5490c285668b..03a3187c4705 100644
> > --- a/net/ipv4/sysctl_net_ipv4.c
> > +++ b/net/ipv4/sysctl_net_ipv4.c
> > @@ -382,6 +382,48 @@ static int proc_tcp_available_ulp(struct ctl_table *ctl,
> >         return ret;
> >  }
> >
> > +static int proc_tcp_ehash_entries(struct ctl_table *table, int write,
> > +                                 void *buffer, size_t *lenp, loff_t *ppos)
> > +{
> > +       struct net *net = container_of(table->data, struct net,
> > +                                      ipv4.sysctl_tcp_child_ehash_entries);
> > +       struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row->hashinfo;
> > +       int tcp_ehash_entries;
> > +       struct ctl_table tbl;
> > +
> > +       tcp_ehash_entries = hinfo->ehash_mask + 1;
> > +
> > +       /* A negative number indicates that the child netns
> > +        * shares the global ehash.
> > +        */
> > +       if (!net_eq(net, &init_net) && !hinfo->pernet)
> > +               tcp_ehash_entries *= -1;
> > +
> > +       tbl.data = &tcp_ehash_entries;
> > +       tbl.maxlen = sizeof(int);
> > +
> > +       return proc_dointvec(&tbl, write, buffer, lenp, ppos);
> > +}
> > +
> > +static int proc_tcp_child_ehash_entries(struct ctl_table *table, int write,
> > +                                       void *buffer, size_t *lenp, loff_t *ppos)
> > +{
> > +       unsigned int tcp_child_ehash_entries;
> > +       int ret;
> > +
> > +       ret = proc_douintvec(table, write, buffer, lenp, ppos);
> > +       if (!write || ret)
> > +               return ret;
> > +
> > +       tcp_child_ehash_entries = READ_ONCE(*(unsigned int *)table->data);
> > +       if (tcp_child_ehash_entries)
> > +               tcp_child_ehash_entries = roundup_pow_of_two(tcp_child_ehash_entries);
> > +
> > +       WRITE_ONCE(*(unsigned int *)table->data, tcp_child_ehash_entries);
> > +
> > +       return 0;
> > +}
> > +
> >  #ifdef CONFIG_IP_ROUTE_MULTIPATH
> >  static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write,
> >                                           void *buffer, size_t *lenp,
> > @@ -1321,6 +1363,21 @@ static struct ctl_table ipv4_net_table[] = {
> >                 .extra1         = SYSCTL_ZERO,
> >                 .extra2         = SYSCTL_ONE,
> >         },
> > +       {
> > +               .procname       = "tcp_ehash_entries",
> > +               .data           = &init_net.ipv4.sysctl_tcp_child_ehash_entries,
> > +               .mode           = 0444,
> > +               .proc_handler   = proc_tcp_ehash_entries,
> > +       },
> > +       {
> > +               .procname       = "tcp_child_ehash_entries",
> > +               .data           = &init_net.ipv4.sysctl_tcp_child_ehash_entries,
> > +               .maxlen         = sizeof(unsigned int),
> > +               .mode           = 0644,
> > +               .proc_handler   = proc_tcp_child_ehash_entries,
> > +               .extra1         = SYSCTL_ZERO,
> > +               .extra2         = SYSCTL_INT_MAX,
> 
> Have you really tested what happens if you set the sysctl to max value
> 0x7fffffff
> 
> I would assume some kernel allocations will fail, or some loops will
> trigger some kind of soft lockups.

Yes, I saw vmalloc() error splat and fallback to the global ehash.
I think 4Mi or 8Mi should be enough for most workloads.
What do you think?

---8<---
[   46.525863] ------------[ cut here ]------------
[   46.526095] WARNING: CPU: 0 PID: 240 at mm/util.c:624 kvmalloc_node+0xbb/0xc0
[   46.526534] Modules linked in:
[   46.526901] CPU: 0 PID: 240 Comm: ip Not tainted 6.0.0-rc1-per-netns-hash-tcpudp-15620-gd02cde62bac1 #121
[   46.527241] Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014
[   46.527568] RIP: 0010:kvmalloc_node+0xbb/0xc0
[   46.527870] Code: 55 48 89 ef 68 00 04 00 00 48 8d 4c 0a ff e8 7c 74 03 00 48 83 c4 18 5d 41 5c 41 5d c3 cc cc cc cc 41 81 e4 00 20 00 00 75 ed <0f> 0b eb e9 90 55 48 89 fd e8 57 1d 03 00 48 89 ef 84 c0 74 06 5d
[   46.528493] RSP: 0018:ffffc9000022fd80 EFLAGS: 00000246
[   46.528801] RAX: 0000000000000000 RBX: 0000000080000000 RCX: 0000000000000000
[   46.529107] RDX: 0000000000000015 RSI: ffffffff81220571 RDI: 0000000000052cc0
[   46.529390] RBP: 0000000400000000 R08: ffffffff830ee280 R09: 0000000000000060
[   46.529730] R10: ffff888005305a00 R11: 0000000000001788 R12: 0000000000000000
[   46.529989] R13: 00000000ffffffff R14: ffffffff8389db80 R15: ffff88800550e300
[   46.530351] FS:  00007f1ca8200740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
[   46.530731] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   46.530931] CR2: 00007f1ca82e9150 CR3: 00000000054ca000 CR4: 00000000000006f0
[   46.531386] Call Trace:
[   46.531966]  <TASK>
[   46.532277]  inet_pernet_hashinfo_alloc+0x40/0xe0
[   46.532530]  tcp_sk_init_pernet_hashinfo+0x26/0x80
[   46.532806]  ops_init+0x7a/0x150
[   46.532965]  setup_net+0x145/0x2b0
[   46.533116]  copy_net_ns+0xf8/0x1c0
[   46.533310]  create_new_namespaces+0x10e/0x2e0
[   46.533478]  unshare_nsproxy_namespaces+0x57/0x90
[   46.533731]  ksys_unshare+0x183/0x320
[   46.533863]  __x64_sys_unshare+0x9/0x10
[   46.534029]  do_syscall_64+0x3b/0x90
[   46.534192]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   46.534506] RIP: 0033:0x7f1ca82f86c7
[   46.534906] Code: 73 01 c3 48 8b 0d a9 07 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 10 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 79 07 0c 00 f7 d8 64 89 01 48
[   46.535581] RSP: 002b:00007ffd394c2298 EFLAGS: 00000206 ORIG_RAX: 0000000000000110
[   46.535860] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f1ca82f86c7
[   46.536118] RDX: 0000000000080000 RSI: 0000560877451812 RDI: 0000000040000000
[   46.536425] RBP: 0000000000000005 R08: 0000000000000000 R09: 00007ffd394c2140
[   46.536826] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
[   46.537094] R13: 00007ffd394c22b8 R14: 00007ffd394c4490 R15: 00007f1ca82006c8
[   46.537437]  </TASK>
[   46.537558] ---[ end trace 0000000000000000 ]---
[   46.538077] TCP: Failed to allocate TCP ehash (entries: 2147483648) for a netns, fallback to use the global one
---8<---


> 
> > +       },
> >         {
> >                 .procname       = "udp_rmem_min",
> >                 .data           = &init_net.ipv4.sysctl_udp_rmem_min,
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index baf6adb723ad..f8ce673e32cb 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -4788,6 +4788,7 @@ void __init tcp_init(void)
> >                 INIT_HLIST_HEAD(&tcp_hashinfo.bhash2[i].chain);
> >         }
> >
> > +       tcp_hashinfo.pernet = false;
> >
> >         cnt = tcp_hashinfo.ehash_mask + 1;
> >         sysctl_tcp_max_orphans = cnt / 2;
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index b07930643b11..604119f46b52 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -3109,14 +3109,23 @@ static void __net_exit tcp_sk_exit(struct net *net)
> >         if (net->ipv4.tcp_congestion_control)
> >                 bpf_module_put(net->ipv4.tcp_congestion_control,
> >                                net->ipv4.tcp_congestion_control->owner);
> > -       if (refcount_dec_and_test(&tcp_death_row->tw_refcount))
> > +       if (refcount_dec_and_test(&tcp_death_row->tw_refcount)) {
> > +               inet_pernet_hashinfo_free(tcp_death_row->hashinfo);
> >                 kfree(tcp_death_row);
> > +       }
> >  }
> >
> > -static int __net_init tcp_sk_init(struct net *net)
> > +static void __net_init tcp_set_hashinfo(struct net *net, struct inet_hashinfo *hinfo)
> >  {
> > -       int cnt;
> > +       int ehash_entries = hinfo->ehash_mask + 1;
> 
> 0x7fffffff + 1 -> integer overflow
> 

Nice catch!
I'll change it to unsigned int.


> >
> > +       net->ipv4.tcp_death_row->hashinfo = hinfo;
> > +       net->ipv4.tcp_death_row->sysctl_max_tw_buckets = ehash_entries / 2;
> > +       net->ipv4.sysctl_max_syn_backlog = max(128, ehash_entries / 128);
> > +}
> > +
> > +static int __net_init tcp_sk_init(struct net *net)
> > +{
> >         net->ipv4.sysctl_tcp_ecn = 2;
> >         net->ipv4.sysctl_tcp_ecn_fallback = 1;
> >
> > @@ -3145,12 +3154,10 @@ static int __net_init tcp_sk_init(struct net *net)
> >         net->ipv4.tcp_death_row = kzalloc(sizeof(struct inet_timewait_death_row), GFP_KERNEL);
> >         if (!net->ipv4.tcp_death_row)
> >                 return -ENOMEM;
> > +
> >         refcount_set(&net->ipv4.tcp_death_row->tw_refcount, 1);
> > -       cnt = tcp_hashinfo.ehash_mask + 1;
> > -       net->ipv4.tcp_death_row->sysctl_max_tw_buckets = cnt / 2;
> > -       net->ipv4.tcp_death_row->hashinfo = &tcp_hashinfo;
> > +       tcp_set_hashinfo(net, &tcp_hashinfo);
> >
> > -       net->ipv4.sysctl_max_syn_backlog = max(128, cnt / 128);
> >         net->ipv4.sysctl_tcp_sack = 1;
> >         net->ipv4.sysctl_tcp_window_scaling = 1;
> >         net->ipv4.sysctl_tcp_timestamps = 1;
> > @@ -3206,18 +3213,46 @@ static int __net_init tcp_sk_init(struct net *net)
> >         return 0;
> >  }
> >
> > +static int __net_init tcp_sk_init_pernet_hashinfo(struct net *net, struct net *old_net)
> > +{
> > +       struct inet_hashinfo *child_hinfo;
> > +       int ehash_entries;
> > +
> > +       ehash_entries = READ_ONCE(old_net->ipv4.sysctl_tcp_child_ehash_entries);
> > +       if (!ehash_entries)
> > +               goto out;
> > +
> > +       child_hinfo = inet_pernet_hashinfo_alloc(&tcp_hashinfo, ehash_entries);
> > +       if (child_hinfo)
> > +               tcp_set_hashinfo(net, child_hinfo);
> > +       else
> > +               pr_warn("Failed to allocate TCP ehash (entries: %u) "
> > +                       "for a netns, fallback to use the global one\n",
> > +                       ehash_entries);
> > +out:
> > +       return 0;
> > +}
> > +
> >  static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
> >  {
> > +       bool purge_once = true;
> >         struct net *net;
> >
> > -       inet_twsk_purge(&tcp_hashinfo, AF_INET);
> > +       list_for_each_entry(net, net_exit_list, exit_list) {
> > +               if (net->ipv4.tcp_death_row->hashinfo->pernet) {
> > +                       inet_twsk_purge(net->ipv4.tcp_death_row->hashinfo, AF_INET);
> > +               } else if (purge_once) {
> > +                       inet_twsk_purge(&tcp_hashinfo, AF_INET);
> > +                       purge_once = false;
> > +               }
> >
> > -       list_for_each_entry(net, net_exit_list, exit_list)
> >                 tcp_fastopen_ctx_destroy(net);
> > +       }
> >  }
> >
> >  static struct pernet_operations __net_initdata tcp_sk_ops = {
> >         .init      = tcp_sk_init,
> > +       .init2     = tcp_sk_init_pernet_hashinfo,
> >         .exit      = tcp_sk_exit,
> >         .exit_batch = tcp_sk_exit_batch,
> >  };
> > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > index 27b2fd98a2c4..19f730428720 100644
> > --- a/net/ipv6/tcp_ipv6.c
> > +++ b/net/ipv6/tcp_ipv6.c
> > @@ -2229,7 +2229,17 @@ static void __net_exit tcpv6_net_exit(struct net *net)
> >
> >  static void __net_exit tcpv6_net_exit_batch(struct list_head *net_exit_list)
> >  {
> > -       inet_twsk_purge(&tcp_hashinfo, AF_INET6);
> > +       bool purge_once = true;
> > +       struct net *net;
> > +
> 
> This looks like a duplicate of ipv4 function. Opportunity of factorization ?

Exactly.
I'll factorise it in v2.


> 
> > +       list_for_each_entry(net, net_exit_list, exit_list) {
> > +               if (net->ipv4.tcp_death_row->hashinfo->pernet) {
> > +                       inet_twsk_purge(net->ipv4.tcp_death_row->hashinfo, AF_INET6);
> > +               } else if (purge_once) {
> > +                       inet_twsk_purge(&tcp_hashinfo, AF_INET6);
> > +                       purge_once = false;
> > +               }
> > +       }
> >  }
> >
> >  static struct pernet_operations tcpv6_net_ops = {
> > --
> > 2.30.2
> >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 06/13] tcp: Set NULL to sk->sk_prot->h.hashinfo.
  2022-08-26 15:40   ` Eric Dumazet
@ 2022-08-26 17:26     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2022-08-26 17:26 UTC (permalink / raw)
  To: edumazet
  Cc: chuck.lever, davem, jlayton, keescook, kuba, kuni1840, kuniyu,
	linux-fsdevel, mcgrof, netdev, pabeni, yzaikin

From:   Eric Dumazet <edumazet@google.com>
Date:   Fri, 26 Aug 2022 08:40:49 -0700
> On Thu, Aug 25, 2022 at 5:07 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > We will soon introduce an optional per-netns ehash.
> >
> > This means we cannot use the global sk->sk_prot->h.hashinfo
> > to fetch a TCP hashinfo.
> >
> > Instead, set NULL to sk->sk_prot->h.hashinfo for TCP and get
> > a proper hashinfo from net->ipv4.tcp_death_row->hashinfo.
> >
> > Note that we need not use sk->sk_prot->h.hashinfo if DCCP is
> > disabled.
> >
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  include/net/inet_hashtables.h   | 10 ++++++++++
> >  net/ipv4/af_inet.c              |  2 +-
> >  net/ipv4/inet_connection_sock.c |  6 +++---
> >  net/ipv4/inet_hashtables.c      | 14 +++++++-------
> >  net/ipv4/tcp_ipv4.c             |  2 +-
> >  net/ipv6/tcp_ipv6.c             |  2 +-
> >  6 files changed, 23 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> > index 44a419b9e3d5..2c866112433e 100644
> > --- a/include/net/inet_hashtables.h
> > +++ b/include/net/inet_hashtables.h
> > @@ -170,6 +170,16 @@ struct inet_hashinfo {
> >         struct inet_listen_hashbucket   *lhash2;
> >  };
> >
> > +static inline struct inet_hashinfo *inet_get_hashinfo(const struct sock *sk)
> > +{
> > +#if IS_ENABLED(CONFIG_IP_DCCP)
> > +       return sk->sk_prot->h.hashinfo ? :
> > +               sock_net(sk)->ipv4.tcp_death_row->hashinfo;
> > +#else
> > +       return sock_net(sk)->ipv4.tcp_death_row->hashinfo;
> > +#endif
> > +}
> >
> 
> If the sk_prot->h.hashinfo must disappear, I would rather add a new
> inet->hashinfo field
> 
> return inet_sk(sk)->hashinfo
> 
> Conceptually, the pointer no longer belongs to sk_prot, and not in struct net,
> otherwise you should name this helper   tcp_or_dccp_get_hashinfo() to avoid
> any temptation to use it for other inet protocol.

That makes sense.

To keep the series simple, I'll use tcp_or_dccp_get_hashinfo() in v2 and
post follow-up patch to add inet_sk(sk)->hashinfo, which needs a little
bit more work/test I think.

Thank you!

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-08-26 17:26 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-26  0:04 [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Kuniyuki Iwashima
2022-08-26  0:04 ` [PATCH v1 net-next 01/13] fs/lock: Revive LOCK_MAND Kuniyuki Iwashima
2022-08-26 10:02   ` Jeff Layton
2022-08-26 16:48     ` Kuniyuki Iwashima
2022-08-26  0:04 ` [PATCH v1 net-next 02/13] sysctl: Support LOCK_MAND for read/write Kuniyuki Iwashima
2022-08-26  0:04 ` [PATCH v1 net-next 03/13] selftest: sysctl: Add test for flock(LOCK_MAND) Kuniyuki Iwashima
2022-08-26  0:04 ` [PATCH v1 net-next 04/13] net: Introduce init2() for pernet_operations Kuniyuki Iwashima
2022-08-26 15:20   ` Eric Dumazet
2022-08-26 17:03     ` Kuniyuki Iwashima
2022-08-26  0:04 ` [PATCH v1 net-next 05/13] tcp: Clean up some functions Kuniyuki Iwashima
2022-08-26  0:04 ` [PATCH v1 net-next 06/13] tcp: Set NULL to sk->sk_prot->h.hashinfo Kuniyuki Iwashima
2022-08-26 15:40   ` Eric Dumazet
2022-08-26 17:26     ` Kuniyuki Iwashima
2022-08-26  0:04 ` [PATCH v1 net-next 07/13] tcp: Access &tcp_hashinfo via net Kuniyuki Iwashima
2022-08-26  0:04 ` [PATCH v1 net-next 08/13] tcp: Introduce optional per-netns ehash Kuniyuki Iwashima
2022-08-26 15:24   ` Eric Dumazet
2022-08-26 17:19     ` Kuniyuki Iwashima
2022-08-26  0:04 ` [PATCH v1 net-next 09/13] udp: Clean up some functions Kuniyuki Iwashima
2022-08-26  0:04 ` [PATCH v1 net-next 10/13] udp: Set NULL to sk->sk_prot->h.udp_table Kuniyuki Iwashima
2022-08-26  0:04 ` [PATCH v1 net-next 11/13] udp: Set NULL to udp_seq_afinfo.udp_table Kuniyuki Iwashima
2022-08-26  0:04 ` [PATCH v1 net-next 12/13] udp: Access &udp_table via net Kuniyuki Iwashima
2022-08-26  0:04 ` [PATCH v1 net-next 13/13] udp: Introduce optional per-netns hash table Kuniyuki Iwashima
2022-08-26 15:17 ` [PATCH v1 net-next 00/13] tcp/udp: " Eric Dumazet
2022-08-26 16:51   ` Kuniyuki Iwashima

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.