linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table.
@ 2022-07-06  5:21 Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 01/16] sysctl: Clean up proc_handler definitions Kuniyuki Iwashima
                   ` (15 more replies)
  0 siblings, 16 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06  5:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel

A sysctl variable is accessed concurrently, and there is always a chance
of data-race.  So, all readers and writers need some basic protection to
avoid load/store-tearing.

This series changes some proc handlers to use READ_ONCE()/WRITE_ONCE()
internally and tries to fix a data-race on the sysctl side.  However, we
still need a fix for readers/writers in other subsystems.

Not to miss the fix, we convert such handlers to a wrapper function of one
with the "_lockless" suffix.  When we add a fix on other subsystems, we set
the lockless handler as .proc_handler to mark the sysctl knob safe.

After this series, if a proc handler does not have the lockless suffix, it
means we need fixes in other subsystems.  Finally, when there is no user of
proc handlers without the lockless suffix, we can remove them and get free
from sysctl data-races.

This series starts fixing from ipv4_table.


Kuniyuki Iwashima (16):
  sysctl: Clean up proc_handler definitions.
  sysctl: Add proc_dobool_lockless().
  sysctl: Add proc_dointvec_lockless().
  sysctl: Add proc_douintvec_lockless().
  sysctl: Add proc_dointvec_minmax_lockless().
  sysctl: Add proc_douintvec_minmax_lockless().
  sysctl: Add proc_doulongvec_minmax_lockless().
  sysctl: Add proc_dointvec_jiffies_lockless().
  tcp: Fix a data-race around sysctl_tcp_max_orphans.
  inetpeer: Fix data-races around sysctl.
  net: Fix a data-race around sysctl_mem.
  tcp: Mark sysctl_tcp_low_latency obsolete.
  cipso: Fix a data-race around cipso_v4_cache_bucketsize.
  cipso: Fix data-races around boolean sysctl.
  icmp: Fix data-races around sysctl.
  ipv4: Fix a data-race around sysctl_fib_sync_mem.

 Documentation/networking/ip-sysctl.rst |   2 +-
 include/linux/sysctl.h                 |  51 ++---
 include/net/sock.h                     |   2 +-
 include/trace/events/sock.h            |   6 +-
 kernel/sysctl.c                        | 258 ++++++++++++++-----------
 net/decnet/sysctl_net_decnet.c         |   2 +-
 net/ipv4/cipso_ipv4.c                  |  19 +-
 net/ipv4/fib_trie.c                    |   2 +-
 net/ipv4/icmp.c                        |   5 +-
 net/ipv4/inetpeer.c                    |  13 +-
 net/ipv4/sysctl_net_ipv4.c             |  29 +--
 net/ipv4/tcp.c                         |   3 +-
 net/sctp/sysctl.c                      |   2 +-
 13 files changed, 214 insertions(+), 180 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v1 net 01/16] sysctl: Clean up proc_handler definitions.
  2022-07-06  5:21 [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table Kuniyuki Iwashima
@ 2022-07-06  5:21 ` Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 02/16] sysctl: Add proc_dobool_lockless() Kuniyuki Iwashima
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06  5:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel

All proc_handler variants have almost the same function prototypes in
sysctl.h and empty functions in sysctl.c in case CONFIG_PROC_SYSCTL is
disabled.

This patch arranges them in the same order and defines them cleanly with
two macros so that we can add lockless helpers easily in the following
commits.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/linux/sysctl.h |  43 ++++++++---------
 kernel/sysctl.c        | 105 ++++++++++-------------------------------
 2 files changed, 45 insertions(+), 103 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 80263f7cdb77..9beab3a4de3d 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -62,29 +62,26 @@ extern const int sysctl_vals[];
 extern const unsigned long sysctl_long_vals[];
 
 typedef int proc_handler(struct ctl_table *ctl, int write, void *buffer,
-		size_t *lenp, loff_t *ppos);
-
-int proc_dostring(struct ctl_table *, int, void *, size_t *, loff_t *);
-int proc_dobool(struct ctl_table *table, int write, void *buffer,
-		size_t *lenp, loff_t *ppos);
-int proc_dointvec(struct ctl_table *, int, void *, size_t *, loff_t *);
-int proc_douintvec(struct ctl_table *, int, void *, size_t *, loff_t *);
-int proc_dointvec_minmax(struct ctl_table *, int, void *, size_t *, loff_t *);
-int proc_douintvec_minmax(struct ctl_table *table, int write, void *buffer,
-		size_t *lenp, loff_t *ppos);
-int proc_dou8vec_minmax(struct ctl_table *table, int write, void *buffer,
-			size_t *lenp, loff_t *ppos);
-int proc_dointvec_jiffies(struct ctl_table *, int, void *, size_t *, loff_t *);
-int proc_dointvec_userhz_jiffies(struct ctl_table *, int, void *, size_t *,
-		loff_t *);
-int proc_dointvec_ms_jiffies(struct ctl_table *, int, void *, size_t *,
-		loff_t *);
-int proc_doulongvec_minmax(struct ctl_table *, int, void *, size_t *, loff_t *);
-int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int, void *,
-		size_t *, loff_t *);
-int proc_do_large_bitmap(struct ctl_table *, int, void *, size_t *, loff_t *);
-int proc_do_static_key(struct ctl_table *table, int write, void *buffer,
-		size_t *lenp, loff_t *ppos);
+			 size_t *lenp, loff_t *ppos);
+
+#define PROC_HANDLER(function)						\
+	int function(struct ctl_table *table, int write, void *buffer,	\
+		     size_t *lenp, loff_t *ppos)
+
+PROC_HANDLER(proc_dostring);
+PROC_HANDLER(proc_dobool);
+PROC_HANDLER(proc_dointvec);
+PROC_HANDLER(proc_douintvec);
+PROC_HANDLER(proc_dointvec_minmax);
+PROC_HANDLER(proc_douintvec_minmax);
+PROC_HANDLER(proc_dou8vec_minmax);
+PROC_HANDLER(proc_doulongvec_minmax);
+PROC_HANDLER(proc_doulongvec_ms_jiffies_minmax);
+PROC_HANDLER(proc_dointvec_jiffies);
+PROC_HANDLER(proc_dointvec_userhz_jiffies);
+PROC_HANDLER(proc_dointvec_ms_jiffies);
+PROC_HANDLER(proc_do_large_bitmap);
+PROC_HANDLER(proc_do_static_key);
 
 /*
  * Register a set of sysctl names by calling register_sysctl_table
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e52b6e372c60..1082c8bc5ba5 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1016,7 +1016,6 @@ int proc_dou8vec_minmax(struct ctl_table *table, int write,
 		*data = val;
 	return 0;
 }
-EXPORT_SYMBOL_GPL(proc_dou8vec_minmax);
 
 #ifdef CONFIG_MAGIC_SYSRQ
 static int sysrq_sysctl_handler(struct ctl_table *table, int write,
@@ -1475,83 +1474,28 @@ int proc_do_large_bitmap(struct ctl_table *table, int write,
 
 #else /* CONFIG_PROC_SYSCTL */
 
-int proc_dostring(struct ctl_table *table, int write,
-		  void *buffer, size_t *lenp, loff_t *ppos)
-{
-	return -ENOSYS;
-}
-
-int proc_dobool(struct ctl_table *table, int write,
-		void *buffer, size_t *lenp, loff_t *ppos)
-{
-	return -ENOSYS;
-}
-
-int proc_dointvec(struct ctl_table *table, int write,
-		  void *buffer, size_t *lenp, loff_t *ppos)
-{
-	return -ENOSYS;
-}
-
-int proc_douintvec(struct ctl_table *table, int write,
-		  void *buffer, size_t *lenp, loff_t *ppos)
-{
-	return -ENOSYS;
-}
-
-int proc_dointvec_minmax(struct ctl_table *table, int write,
-		    void *buffer, size_t *lenp, loff_t *ppos)
-{
-	return -ENOSYS;
-}
-
-int proc_douintvec_minmax(struct ctl_table *table, int write,
-			  void *buffer, size_t *lenp, loff_t *ppos)
-{
-	return -ENOSYS;
-}
-
-int proc_dou8vec_minmax(struct ctl_table *table, int write,
-			void *buffer, size_t *lenp, loff_t *ppos)
-{
-	return -ENOSYS;
-}
 
-int proc_dointvec_jiffies(struct ctl_table *table, int write,
-		    void *buffer, size_t *lenp, loff_t *ppos)
-{
-	return -ENOSYS;
-}
-
-int proc_dointvec_userhz_jiffies(struct ctl_table *table, int write,
-		    void *buffer, size_t *lenp, loff_t *ppos)
-{
-	return -ENOSYS;
-}
-
-int proc_dointvec_ms_jiffies(struct ctl_table *table, int write,
-			     void *buffer, size_t *lenp, loff_t *ppos)
-{
-	return -ENOSYS;
-}
-
-int proc_doulongvec_minmax(struct ctl_table *table, int write,
-		    void *buffer, size_t *lenp, loff_t *ppos)
-{
-	return -ENOSYS;
-}
-
-int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int write,
-				      void *buffer, size_t *lenp, loff_t *ppos)
-{
-	return -ENOSYS;
-}
+#define PROC_HANDLER_ENOSYS(function)				\
+	int function(struct ctl_table *table, int write,	\
+		 void *buffer, size_t *lenp, loff_t *ppos)	\
+	{							\
+		return -ENOSYS;					\
+	}
 
-int proc_do_large_bitmap(struct ctl_table *table, int write,
-			 void *buffer, size_t *lenp, loff_t *ppos)
-{
-	return -ENOSYS;
-}
+PROC_HANDLER_ENOSYS(proc_dostring);
+PROC_HANDLER_ENOSYS(proc_dobool);
+PROC_HANDLER_ENOSYS(proc_dointvec);
+PROC_HANDLER_ENOSYS(proc_douintvec);
+PROC_HANDLER_ENOSYS(proc_dointvec_minmax);
+PROC_HANDLER_ENOSYS(proc_douintvec_minmax);
+PROC_HANDLER_ENOSYS(proc_dou8vec_minmax);
+PROC_HANDLER_ENOSYS(proc_doulongvec_minmax);
+PROC_HANDLER_ENOSYS(proc_doulongvec_ms_jiffies_minmax);
+PROC_HANDLER_ENOSYS(proc_dointvec_jiffies);
+PROC_HANDLER_ENOSYS(proc_dointvec_userhz_jiffies);
+PROC_HANDLER_ENOSYS(proc_dointvec_ms_jiffies);
+PROC_HANDLER_ENOSYS(proc_do_cad_pid);
+PROC_HANDLER_ENOSYS(proc_do_large_bitmap);
 
 #endif /* CONFIG_PROC_SYSCTL */
 
@@ -2448,15 +2392,16 @@ int __init sysctl_init_bases(void)
  * No sense putting this after each symbol definition, twice,
  * exception granted :-)
  */
+EXPORT_SYMBOL(proc_dostring);
 EXPORT_SYMBOL(proc_dobool);
 EXPORT_SYMBOL(proc_dointvec);
 EXPORT_SYMBOL(proc_douintvec);
-EXPORT_SYMBOL(proc_dointvec_jiffies);
 EXPORT_SYMBOL(proc_dointvec_minmax);
 EXPORT_SYMBOL_GPL(proc_douintvec_minmax);
-EXPORT_SYMBOL(proc_dointvec_userhz_jiffies);
-EXPORT_SYMBOL(proc_dointvec_ms_jiffies);
-EXPORT_SYMBOL(proc_dostring);
+EXPORT_SYMBOL_GPL(proc_dou8vec_minmax);
 EXPORT_SYMBOL(proc_doulongvec_minmax);
 EXPORT_SYMBOL(proc_doulongvec_ms_jiffies_minmax);
+EXPORT_SYMBOL(proc_dointvec_jiffies);
+EXPORT_SYMBOL(proc_dointvec_userhz_jiffies);
+EXPORT_SYMBOL(proc_dointvec_ms_jiffies);
 EXPORT_SYMBOL(proc_do_large_bitmap);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 net 02/16] sysctl: Add proc_dobool_lockless().
  2022-07-06  5:21 [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 01/16] sysctl: Clean up proc_handler definitions Kuniyuki Iwashima
@ 2022-07-06  5:21 ` Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 03/16] sysctl: Add proc_dointvec_lockless() Kuniyuki Iwashima
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06  5:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel, Jia He

A sysctl variable is accessed concurrently, and there is always a chance of
data-race.  So, all readers and writers need some basic protection to avoid
load/store-tearing.

This patch changes proc_dobool() to use READ_ONCE()/WRITE_ONCE() internally
to fix a data-race on the sysctl side.  For now, proc_dobool() itself is
tolerant to a data-race, but we still need to add annotations on the other
subsystem's side.

In case we miss such fixes, this patch converts proc_dobool() to a wrapper
of proc_dobool_lockless().  When we fix a data-race in the other subsystem,
we can explicitly set it as a handler.

Also, this patch removes proc_dobool()'s document and adds
proc_dobool_lockless()'s one so that no one will use proc_dobool() anymore.

Fixes: a2071573d634 ("sysctl: introduce new proc handler proc_dobool")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
CC: Jia He <hejianet@gmail.com>
---
 include/linux/sysctl.h |  2 ++
 kernel/sysctl.c        | 23 ++++++++++++++++-------
 2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 9beab3a4de3d..fcafc16abbad 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -83,6 +83,8 @@ PROC_HANDLER(proc_dointvec_ms_jiffies);
 PROC_HANDLER(proc_do_large_bitmap);
 PROC_HANDLER(proc_do_static_key);
 
+PROC_HANDLER(proc_dobool_lockless);
+
 /*
  * Register a set of sysctl names by calling register_sysctl_table
  * with an initialised array of struct ctl_table's.  An entry with 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1082c8bc5ba5..bc6fcc64eeaf 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -424,13 +424,12 @@ static void proc_put_char(void **buf, size_t *size, char c)
 }
 
 static int do_proc_dobool_conv(bool *negp, unsigned long *lvalp,
-				int *valp,
-				int write, void *data)
+			       int *valp, int write, void *data)
 {
 	if (write) {
-		*(bool *)valp = *lvalp;
+		WRITE_ONCE(*(bool *)valp, *lvalp);
 	} else {
-		int val = *(bool *)valp;
+		int val = READ_ONCE(*(bool *)valp);
 
 		*lvalp = (unsigned long)val;
 		*negp = false;
@@ -701,7 +700,7 @@ int do_proc_douintvec(struct ctl_table *table, int write,
 }
 
 /**
- * proc_dobool - read/write a bool
+ * proc_dobool_lockless - read/write a bool locklessly
  * @table: the sysctl table
  * @write: %TRUE if this is a write to the sysctl file
  * @buffer: the user buffer
@@ -713,13 +712,19 @@ int do_proc_douintvec(struct ctl_table *table, int write,
  *
  * Returns 0 on success.
  */
-int proc_dobool(struct ctl_table *table, int write, void *buffer,
-		size_t *lenp, loff_t *ppos)
+int proc_dobool_lockless(struct ctl_table *table, int write, void *buffer,
+			 size_t *lenp, loff_t *ppos)
 {
 	return do_proc_dointvec(table, write, buffer, lenp, ppos,
 				do_proc_dobool_conv, NULL);
 }
 
+int proc_dobool(struct ctl_table *table, int write, void *buffer,
+		size_t *lenp, loff_t *ppos)
+{
+	return proc_dobool_lockless(table, write, buffer, lenp, ppos);
+}
+
 /**
  * proc_dointvec - read a vector of integers
  * @table: the sysctl table
@@ -1497,6 +1502,8 @@ PROC_HANDLER_ENOSYS(proc_dointvec_ms_jiffies);
 PROC_HANDLER_ENOSYS(proc_do_cad_pid);
 PROC_HANDLER_ENOSYS(proc_do_large_bitmap);
 
+PROC_HANDLER_ENOSYS(proc_dobool_lockless);
+
 #endif /* CONFIG_PROC_SYSCTL */
 
 #if defined(CONFIG_SYSCTL)
@@ -2405,3 +2412,5 @@ EXPORT_SYMBOL(proc_dointvec_jiffies);
 EXPORT_SYMBOL(proc_dointvec_userhz_jiffies);
 EXPORT_SYMBOL(proc_dointvec_ms_jiffies);
 EXPORT_SYMBOL(proc_do_large_bitmap);
+
+EXPORT_SYMBOL(proc_dobool_lockless);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 net 03/16] sysctl: Add proc_dointvec_lockless().
  2022-07-06  5:21 [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 01/16] sysctl: Clean up proc_handler definitions Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 02/16] sysctl: Add proc_dobool_lockless() Kuniyuki Iwashima
@ 2022-07-06  5:21 ` Kuniyuki Iwashima
  2022-07-06  7:00   ` Eric Dumazet
  2022-07-06  5:21 ` [PATCH v1 net 04/16] sysctl: Add proc_douintvec_lockless() Kuniyuki Iwashima
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06  5:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel

A sysctl variable is accessed concurrently, and there is always a chance of
data-race.  So, all readers and writers need some basic protection to avoid
load/store-tearing.

This patch changes proc_dointvec() to use READ_ONCE()/WRITE_ONCE()
internally to fix a data-race on the sysctl side.  For now, proc_dointvec()
itself is tolerant to a data-race, but we still need to add annotations on
the other subsystem's side.

In case we miss such fixes, this patch converts proc_dointvec() to a
wrapper of proc_dointvec_lockless().  When we fix a data-race in the other
subsystem, we can explicitly set it as a handler.

Also, this patch removes proc_dointvec()'s document and adds
proc_dointvec_lockless()'s one so that no one will use proc_dointvec()
anymore.

While we are on it, we remove some trailing spaces.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/linux/sysctl.h |  1 +
 kernel/sysctl.c        | 27 +++++++++++++++++++--------
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index fcafc16abbad..cb87919b5508 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -84,6 +84,7 @@ PROC_HANDLER(proc_do_large_bitmap);
 PROC_HANDLER(proc_do_static_key);
 
 PROC_HANDLER(proc_dobool_lockless);
+PROC_HANDLER(proc_dointvec_lockless);
 
 /*
  * Register a set of sysctl names by calling register_sysctl_table
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index bc6fcc64eeaf..50d9b78aa0b3 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -445,14 +445,17 @@ static int do_proc_dointvec_conv(bool *negp, unsigned long *lvalp,
 		if (*negp) {
 			if (*lvalp > (unsigned long) INT_MAX + 1)
 				return -EINVAL;
-			*valp = -*lvalp;
+
+			WRITE_ONCE(*valp, -*lvalp);
 		} else {
 			if (*lvalp > (unsigned long) INT_MAX)
 				return -EINVAL;
-			*valp = *lvalp;
+
+			WRITE_ONCE(*valp, *lvalp);
 		}
 	} else {
-		int val = *valp;
+		int val = READ_ONCE(*valp);
+
 		if (val < 0) {
 			*negp = true;
 			*lvalp = -(unsigned long)val;
@@ -491,12 +494,12 @@ static int __do_proc_dointvec(void *tbl_data, struct ctl_table *table,
 	int *i, vleft, first = 1, err = 0;
 	size_t left;
 	char *p;
-	
+
 	if (!tbl_data || !table->maxlen || !*lenp || (*ppos && !write)) {
 		*lenp = 0;
 		return 0;
 	}
-	
+
 	i = (int *) tbl_data;
 	vleft = table->maxlen / sizeof(*i);
 	left = *lenp;
@@ -726,7 +729,7 @@ int proc_dobool(struct ctl_table *table, int write, void *buffer,
 }
 
 /**
- * proc_dointvec - read a vector of integers
+ * proc_dointvec_lockless - read/write a vector of integers locklessly
  * @table: the sysctl table
  * @write: %TRUE if this is a write to the sysctl file
  * @buffer: the user buffer
@@ -734,14 +737,20 @@ int proc_dobool(struct ctl_table *table, int write, void *buffer,
  * @ppos: file position
  *
  * Reads/writes up to table->maxlen/sizeof(unsigned int) integer
- * values from/to the user buffer, treated as an ASCII string. 
+ * values from/to the user buffer, treated as an ASCII string.
  *
  * Returns 0 on success.
  */
+int proc_dointvec_lockless(struct ctl_table *table, int write, void *buffer,
+			   size_t *lenp, loff_t *ppos)
+{
+	return do_proc_dointvec(table, write, buffer, lenp, ppos, NULL, NULL);
+}
+
 int proc_dointvec(struct ctl_table *table, int write, void *buffer,
 		  size_t *lenp, loff_t *ppos)
 {
-	return do_proc_dointvec(table, write, buffer, lenp, ppos, NULL, NULL);
+	return proc_dointvec_lockless(table, write, buffer, lenp, ppos);
 }
 
 #ifdef CONFIG_COMPACTION
@@ -1503,6 +1512,7 @@ PROC_HANDLER_ENOSYS(proc_do_cad_pid);
 PROC_HANDLER_ENOSYS(proc_do_large_bitmap);
 
 PROC_HANDLER_ENOSYS(proc_dobool_lockless);
+PROC_HANDLER_ENOSYS(proc_dointvec_lockless);
 
 #endif /* CONFIG_PROC_SYSCTL */
 
@@ -2414,3 +2424,4 @@ EXPORT_SYMBOL(proc_dointvec_ms_jiffies);
 EXPORT_SYMBOL(proc_do_large_bitmap);
 
 EXPORT_SYMBOL(proc_dobool_lockless);
+EXPORT_SYMBOL(proc_dointvec_lockless);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 net 04/16] sysctl: Add proc_douintvec_lockless().
  2022-07-06  5:21 [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2022-07-06  5:21 ` [PATCH v1 net 03/16] sysctl: Add proc_dointvec_lockless() Kuniyuki Iwashima
@ 2022-07-06  5:21 ` Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 05/16] sysctl: Add proc_dointvec_minmax_lockless() Kuniyuki Iwashima
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06  5:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel,
	Subash Abhinov Kasiviswanathan

A sysctl variable is accessed concurrently, and there is always a chance of
data-race.  So, all readers and writers need some basic protection to avoid
load/store-tearing.

This patch changes proc_douintvec() to use READ_ONCE()/WRITE_ONCE()
internally to fix a data-race on the sysctl side.  For now,
proc_douintvec() itself is tolerant to a data-race, but we still need to
add annotations on the other subsystem's side.

In case we miss such fixes, this patch converts proc_douintvec() to a
wrapper of proc_douintvec_lockless().  When we fix a data-race in the other
subsystem, we can explicitly set it as a handler.

Also, this patch removes proc_douintvec()'s document and adds
proc_douintvec_lockless()'s one so that no one will use proc_douintvec()
anymore.

Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
CC: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
---
 include/linux/sysctl.h |  1 +
 kernel/sysctl.c        | 20 +++++++++++++++-----
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index cb87919b5508..770ee1833c25 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -85,6 +85,7 @@ PROC_HANDLER(proc_do_static_key);
 
 PROC_HANDLER(proc_dobool_lockless);
 PROC_HANDLER(proc_dointvec_lockless);
+PROC_HANDLER(proc_douintvec_lockless);
 
 /*
  * Register a set of sysctl names by calling register_sysctl_table
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 50d9b78aa0b3..be8a7d912180 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -474,9 +474,11 @@ static int do_proc_douintvec_conv(unsigned long *lvalp,
 	if (write) {
 		if (*lvalp > UINT_MAX)
 			return -EINVAL;
-		*valp = *lvalp;
+
+		WRITE_ONCE(*valp, *lvalp);
 	} else {
-		unsigned int val = *valp;
+		unsigned int val = READ_ONCE(*valp);
+
 		*lvalp = (unsigned long)val;
 	}
 	return 0;
@@ -775,7 +777,7 @@ static int proc_dointvec_minmax_warn_RT_change(struct ctl_table *table,
 #endif
 
 /**
- * proc_douintvec - read a vector of unsigned integers
+ * proc_douintvec_lockless - read/write a vector of unsigned integers locklessly
  * @table: the sysctl table
  * @write: %TRUE if this is a write to the sysctl file
  * @buffer: the user buffer
@@ -787,13 +789,19 @@ static int proc_dointvec_minmax_warn_RT_change(struct ctl_table *table,
  *
  * Returns 0 on success.
  */
-int proc_douintvec(struct ctl_table *table, int write, void *buffer,
-		size_t *lenp, loff_t *ppos)
+int proc_douintvec_lockless(struct ctl_table *table, int write, void *buffer,
+			    size_t *lenp, loff_t *ppos)
 {
 	return do_proc_douintvec(table, write, buffer, lenp, ppos,
 				 do_proc_douintvec_conv, NULL);
 }
 
+int proc_douintvec(struct ctl_table *table, int write, void *buffer,
+		   size_t *lenp, loff_t *ppos)
+{
+	return proc_douintvec_lockless(table, write, buffer, lenp, ppos);
+}
+
 /*
  * Taint values can only be increased
  * This means we can safely use a temporary.
@@ -1513,6 +1521,7 @@ PROC_HANDLER_ENOSYS(proc_do_large_bitmap);
 
 PROC_HANDLER_ENOSYS(proc_dobool_lockless);
 PROC_HANDLER_ENOSYS(proc_dointvec_lockless);
+PROC_HANDLER_ENOSYS(proc_douintvec_lockless);
 
 #endif /* CONFIG_PROC_SYSCTL */
 
@@ -2425,3 +2434,4 @@ EXPORT_SYMBOL(proc_do_large_bitmap);
 
 EXPORT_SYMBOL(proc_dobool_lockless);
 EXPORT_SYMBOL(proc_dointvec_lockless);
+EXPORT_SYMBOL(proc_douintvec_lockless);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 net 05/16] sysctl: Add proc_dointvec_minmax_lockless().
  2022-07-06  5:21 [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2022-07-06  5:21 ` [PATCH v1 net 04/16] sysctl: Add proc_douintvec_lockless() Kuniyuki Iwashima
@ 2022-07-06  5:21 ` Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 06/16] sysctl: Add proc_douintvec_minmax_lockless() Kuniyuki Iwashima
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06  5:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel

A sysctl variable is accessed concurrently, and there is always a chance of
data-race.  So, all readers and writers need some basic protection to avoid
load/store-tearing.

This patch changes proc_dointvec_minmax() to use READ_ONCE()/WRITE_ONCE()
internally to fix a data-race on the sysctl side.  For now,
proc_dointvec_minmax() itself is tolerant to a data-race, but we still need
to add annotations on the other subsystem's side.

In case we miss such fixes, this patch converts proc_dointvec_minmax() to a
wrapper of proc_dointvec_minmax_lockless().  When we fix a data-race in the
other subsystem, we can explicitly set it as a handler.

Also, this patch removes proc_dointvec_minmax()'s document and adds
proc_dointvec_minmax_lockless()'s one so that no one will use
proc_dointvec_minmax() anymore.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/linux/sysctl.h |  1 +
 kernel/sysctl.c        | 19 +++++++++++++++----
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 770ee1833c25..7f91cc625d56 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -86,6 +86,7 @@ PROC_HANDLER(proc_do_static_key);
 PROC_HANDLER(proc_dobool_lockless);
 PROC_HANDLER(proc_dointvec_lockless);
 PROC_HANDLER(proc_douintvec_lockless);
+PROC_HANDLER(proc_dointvec_minmax_lockless);
 
 /*
  * Register a set of sysctl names by calling register_sysctl_table
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index be8a7d912180..aead731ae74b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -879,14 +879,16 @@ static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
 		if ((param->min && *param->min > tmp) ||
 		    (param->max && *param->max < tmp))
 			return -EINVAL;
-		*valp = tmp;
+
+		WRITE_ONCE(*valp, tmp);
 	}
 
 	return 0;
 }
 
 /**
- * proc_dointvec_minmax - read a vector of integers with min/max values
+ * proc_dointvec_minmax_lockless - read/write a vector of integers with
+ * min/max values locklessly
  * @table: the sysctl table
  * @write: %TRUE if this is a write to the sysctl file
  * @buffer: the user buffer
@@ -901,8 +903,8 @@ static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
  *
  * Returns 0 on success or -EINVAL on write when the range check fails.
  */
-int proc_dointvec_minmax(struct ctl_table *table, int write,
-		  void *buffer, size_t *lenp, loff_t *ppos)
+int proc_dointvec_minmax_lockless(struct ctl_table *table, int write,
+				  void *buffer, size_t *lenp, loff_t *ppos)
 {
 	struct do_proc_dointvec_minmax_conv_param param = {
 		.min = (int *) table->extra1,
@@ -912,6 +914,13 @@ int proc_dointvec_minmax(struct ctl_table *table, int write,
 				do_proc_dointvec_minmax_conv, &param);
 }
 
+int proc_dointvec_minmax(struct ctl_table *table, int write,
+			 void *buffer, size_t *lenp, loff_t *ppos)
+{
+	return proc_dointvec_minmax_lockless(table, write, buffer,
+					     lenp, ppos);
+}
+
 /**
  * struct do_proc_douintvec_minmax_conv_param - proc_douintvec_minmax() range checking structure
  * @min: pointer to minimum allowable value
@@ -1522,6 +1531,7 @@ PROC_HANDLER_ENOSYS(proc_do_large_bitmap);
 PROC_HANDLER_ENOSYS(proc_dobool_lockless);
 PROC_HANDLER_ENOSYS(proc_dointvec_lockless);
 PROC_HANDLER_ENOSYS(proc_douintvec_lockless);
+PROC_HANDLER_ENOSYS(proc_dointvec_minmax_lockless);
 
 #endif /* CONFIG_PROC_SYSCTL */
 
@@ -2435,3 +2445,4 @@ EXPORT_SYMBOL(proc_do_large_bitmap);
 EXPORT_SYMBOL(proc_dobool_lockless);
 EXPORT_SYMBOL(proc_dointvec_lockless);
 EXPORT_SYMBOL(proc_douintvec_lockless);
+EXPORT_SYMBOL(proc_dointvec_minmax_lockless);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 net 06/16] sysctl: Add proc_douintvec_minmax_lockless().
  2022-07-06  5:21 [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table Kuniyuki Iwashima
                   ` (4 preceding siblings ...)
  2022-07-06  5:21 ` [PATCH v1 net 05/16] sysctl: Add proc_dointvec_minmax_lockless() Kuniyuki Iwashima
@ 2022-07-06  5:21 ` Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 07/16] sysctl: Add proc_doulongvec_minmax_lockless() Kuniyuki Iwashima
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06  5:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel

A sysctl variable is accessed concurrently, and there is always a chance of
data-race.  So, all readers and writers need some basic protection to avoid
load/store-tearing.

This patch changes proc_douintvec_minmax() to use READ_ONCE()/WRITE_ONCE()
internally to fix a data-race on the sysctl side.  For now,
proc_douintvec_minmax() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.

In case we miss such fixes, this patch converts proc_douintvec_minmax() to
a wrapper of proc_douintvec_minmax_lockless().  When we fix a data-race in
the other subsystem, we can explicitly set it as a handler.

Also, this patch removes proc_douintvec_minmax()'s document and adds
proc_douintvec_minmax_lockless()'s one so that no one will use
proc_douintvec_minmax() anymore.

Fixes: 61d9b56a8920 ("sysctl: add unsigned int range support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
CC: Luis R. Rodriguez <mcgrof@kernel.org>
---
 include/linux/sysctl.h |  1 +
 kernel/sysctl.c        | 17 +++++++++++++----
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 7f91cc625d56..830d1a8f21d4 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -87,6 +87,7 @@ PROC_HANDLER(proc_dobool_lockless);
 PROC_HANDLER(proc_dointvec_lockless);
 PROC_HANDLER(proc_douintvec_lockless);
 PROC_HANDLER(proc_dointvec_minmax_lockless);
+PROC_HANDLER(proc_douintvec_minmax_lockless);
 
 /*
  * Register a set of sysctl names by calling register_sysctl_table
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index aead731ae74b..8ff57b8d1212 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -954,14 +954,15 @@ static int do_proc_douintvec_minmax_conv(unsigned long *lvalp,
 		    (param->max && *param->max < tmp))
 			return -ERANGE;
 
-		*valp = tmp;
+		WRITE_ONCE(*valp, tmp);
 	}
 
 	return 0;
 }
 
 /**
- * proc_douintvec_minmax - read a vector of unsigned ints with min/max values
+ * proc_douintvec_minmax_lockless - read/write a vector of unsigned ints
+ * with min/max values locklessly
  * @table: the sysctl table
  * @write: %TRUE if this is a write to the sysctl file
  * @buffer: the user buffer
@@ -979,8 +980,8 @@ static int do_proc_douintvec_minmax_conv(unsigned long *lvalp,
  *
  * Returns 0 on success or -ERANGE on write when the range check fails.
  */
-int proc_douintvec_minmax(struct ctl_table *table, int write,
-			  void *buffer, size_t *lenp, loff_t *ppos)
+int proc_douintvec_minmax_lockless(struct ctl_table *table, int write,
+				   void *buffer, size_t *lenp, loff_t *ppos)
 {
 	struct do_proc_douintvec_minmax_conv_param param = {
 		.min = (unsigned int *) table->extra1,
@@ -990,6 +991,12 @@ int proc_douintvec_minmax(struct ctl_table *table, int write,
 				 do_proc_douintvec_minmax_conv, &param);
 }
 
+int proc_douintvec_minmax(struct ctl_table *table, int write,
+			  void *buffer, size_t *lenp, loff_t *ppos)
+{
+	return proc_douintvec_minmax_lockless(table, write, buffer, lenp, ppos);
+}
+
 /**
  * proc_dou8vec_minmax - read a vector of unsigned chars with min/max values
  * @table: the sysctl table
@@ -1532,6 +1539,7 @@ PROC_HANDLER_ENOSYS(proc_dobool_lockless);
 PROC_HANDLER_ENOSYS(proc_dointvec_lockless);
 PROC_HANDLER_ENOSYS(proc_douintvec_lockless);
 PROC_HANDLER_ENOSYS(proc_dointvec_minmax_lockless);
+PROC_HANDLER_ENOSYS(proc_douintvec_minmax_lockless);
 
 #endif /* CONFIG_PROC_SYSCTL */
 
@@ -2446,3 +2454,4 @@ EXPORT_SYMBOL(proc_dobool_lockless);
 EXPORT_SYMBOL(proc_dointvec_lockless);
 EXPORT_SYMBOL(proc_douintvec_lockless);
 EXPORT_SYMBOL(proc_dointvec_minmax_lockless);
+EXPORT_SYMBOL_GPL(proc_douintvec_minmax_lockless);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 net 07/16] sysctl: Add proc_doulongvec_minmax_lockless().
  2022-07-06  5:21 [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table Kuniyuki Iwashima
                   ` (5 preceding siblings ...)
  2022-07-06  5:21 ` [PATCH v1 net 06/16] sysctl: Add proc_douintvec_minmax_lockless() Kuniyuki Iwashima
@ 2022-07-06  5:21 ` Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 08/16] sysctl: Add proc_dointvec_jiffies_lockless() Kuniyuki Iwashima
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06  5:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel

A sysctl variable is accessed concurrently, and there is always a chance of
data-race.  So, all readers and writers need some basic protection to avoid
load/store-tearing.

This patch changes proc_doulongvec_minmax() to use READ_ONCE()/WRITE_ONCE()
internally to fix a data-race on the sysctl side.  For now,
proc_doulongvec_minmax() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.

In case we miss such fixes, this patch converts proc_doulongvec_minmax() to
a wrapper of proc_doulongvec_minmax_lockless().  When we fix a data-race in
the other subsystem, we can explicitly set it as a handler.

Also, this patch removes proc_doulongvec_minmax()'s document and adds
proc_doulongvec_minmax_lockless()'s one so that no one will use
proc_doulongvec_minmax() anymore.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/linux/sysctl.h |  1 +
 kernel/sysctl.c        | 21 +++++++++++++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 830d1a8f21d4..c23b6beef748 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -88,6 +88,7 @@ PROC_HANDLER(proc_dointvec_lockless);
 PROC_HANDLER(proc_douintvec_lockless);
 PROC_HANDLER(proc_dointvec_minmax_lockless);
 PROC_HANDLER(proc_douintvec_minmax_lockless);
+PROC_HANDLER(proc_doulongvec_minmax_lockless);
 
 /*
  * Register a set of sysctl names by calling register_sysctl_table
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8ff57b8d1212..931ab58985f2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1127,9 +1127,11 @@ static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table,
 				err = -EINVAL;
 				break;
 			}
-			*i = val;
+
+			WRITE_ONCE(*i, val);
 		} else {
-			val = convdiv * (*i) / convmul;
+			val = convdiv * READ_ONCE(*i) / convmul;
+
 			if (!first)
 				proc_put_char(&buffer, &left, '\t');
 			proc_put_long(&buffer, &left, val, false);
@@ -1157,7 +1159,8 @@ static int do_proc_doulongvec_minmax(struct ctl_table *table, int write,
 }
 
 /**
- * proc_doulongvec_minmax - read a vector of long integers with min/max values
+ * proc_doulongvec_minmax_lockless - read/write a vector of long integers
+ * with min/max values locklessly
  * @table: the sysctl table
  * @write: %TRUE if this is a write to the sysctl file
  * @buffer: the user buffer
@@ -1172,10 +1175,18 @@ static int do_proc_doulongvec_minmax(struct ctl_table *table, int write,
  *
  * Returns 0 on success.
  */
+int proc_doulongvec_minmax_lockless(struct ctl_table *table, int write,
+				    void *buffer, size_t *lenp, loff_t *ppos)
+{
+	return do_proc_doulongvec_minmax(table, write, buffer, lenp, ppos,
+					 1l, 1l);
+}
+
 int proc_doulongvec_minmax(struct ctl_table *table, int write,
 			   void *buffer, size_t *lenp, loff_t *ppos)
 {
-    return do_proc_doulongvec_minmax(table, write, buffer, lenp, ppos, 1l, 1l);
+	return proc_doulongvec_minmax_lockless(table, write, buffer,
+					       lenp, ppos);
 }
 
 /**
@@ -1540,6 +1551,7 @@ PROC_HANDLER_ENOSYS(proc_dointvec_lockless);
 PROC_HANDLER_ENOSYS(proc_douintvec_lockless);
 PROC_HANDLER_ENOSYS(proc_dointvec_minmax_lockless);
 PROC_HANDLER_ENOSYS(proc_douintvec_minmax_lockless);
+PROC_HANDLER_ENOSYS(proc_doulongvec_minmax_lockless);
 
 #endif /* CONFIG_PROC_SYSCTL */
 
@@ -2455,3 +2467,4 @@ EXPORT_SYMBOL(proc_dointvec_lockless);
 EXPORT_SYMBOL(proc_douintvec_lockless);
 EXPORT_SYMBOL(proc_dointvec_minmax_lockless);
 EXPORT_SYMBOL_GPL(proc_douintvec_minmax_lockless);
+EXPORT_SYMBOL(proc_doulongvec_minmax_lockless);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 net 08/16] sysctl: Add proc_dointvec_jiffies_lockless().
  2022-07-06  5:21 [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table Kuniyuki Iwashima
                   ` (6 preceding siblings ...)
  2022-07-06  5:21 ` [PATCH v1 net 07/16] sysctl: Add proc_doulongvec_minmax_lockless() Kuniyuki Iwashima
@ 2022-07-06  5:21 ` Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 09/16] tcp: Fix a data-race around sysctl_tcp_max_orphans Kuniyuki Iwashima
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06  5:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel

A sysctl variable is accessed concurrently, and there is always a chance of
data-race.  So, all readers and writers need some basic protection to avoid
load/store-tearing.

This patch changes proc_dointvec_jiffies() to use READ_ONCE()/WRITE_ONCE()
internally to fix a data-race on the sysctl side.  For now,
proc_dointvec_jiffies() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.

In case we miss such fixes, this patch converts proc_dointvec_jiffies() to
a wrapper of proc_dointvec_jiffies_lockless().  When we fix a data-race in
the other subsystem, we can explicitly set it as a handler.

Also, this patch removes proc_dointvec_jiffies()'s document and adds
proc_dointvec_jiffies_lockless()'s one so that no one will use
proc_dointvec_jiffies() anymore.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/linux/sysctl.h |  1 +
 kernel/sysctl.c        | 26 ++++++++++++++++++++------
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index c23b6beef748..8747dbc721f5 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -89,6 +89,7 @@ PROC_HANDLER(proc_douintvec_lockless);
 PROC_HANDLER(proc_dointvec_minmax_lockless);
 PROC_HANDLER(proc_douintvec_minmax_lockless);
 PROC_HANDLER(proc_doulongvec_minmax_lockless);
+PROC_HANDLER(proc_dointvec_jiffies_lockless);
 
 /*
  * Register a set of sysctl names by calling register_sysctl_table
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 931ab58985f2..11a1ce837623 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1221,10 +1221,15 @@ static int do_proc_dointvec_jiffies_conv(bool *negp, unsigned long *lvalp,
 	if (write) {
 		if (*lvalp > INT_MAX / HZ)
 			return 1;
-		*valp = *negp ? -(*lvalp*HZ) : (*lvalp*HZ);
+
+		if (*negp)
+			WRITE_ONCE(*valp, -(*lvalp * HZ));
+		else
+			WRITE_ONCE(*valp, *lvalp * HZ);
 	} else {
-		int val = *valp;
+		int val = READ_ONCE(*valp);
 		unsigned long lval;
+
 		if (val < 0) {
 			*negp = true;
 			lval = -(unsigned long)val;
@@ -1286,7 +1291,8 @@ static int do_proc_dointvec_ms_jiffies_conv(bool *negp, unsigned long *lvalp,
 }
 
 /**
- * proc_dointvec_jiffies - read a vector of integers as seconds
+ * proc_dointvec_jiffies_lockless - read/write a vector of integers as
+ * seconds locklessly
  * @table: the sysctl table
  * @write: %TRUE if this is a write to the sysctl file
  * @buffer: the user buffer
@@ -1294,17 +1300,23 @@ static int do_proc_dointvec_ms_jiffies_conv(bool *negp, unsigned long *lvalp,
  * @ppos: file position
  *
  * Reads/writes up to table->maxlen/sizeof(unsigned int) integer
- * values from/to the user buffer, treated as an ASCII string. 
+ * values from/to the user buffer, treated as an ASCII string.
  * The values read are assumed to be in seconds, and are converted into
  * jiffies.
  *
  * Returns 0 on success.
  */
+int proc_dointvec_jiffies_lockless(struct ctl_table *table, int write,
+				   void *buffer, size_t *lenp, loff_t *ppos)
+{
+	return do_proc_dointvec(table, write, buffer, lenp, ppos,
+				do_proc_dointvec_jiffies_conv, NULL);
+}
+
 int proc_dointvec_jiffies(struct ctl_table *table, int write,
 			  void *buffer, size_t *lenp, loff_t *ppos)
 {
-    return do_proc_dointvec(table,write,buffer,lenp,ppos,
-		    	    do_proc_dointvec_jiffies_conv,NULL);
+	return proc_dointvec_jiffies_lockless(table, write, buffer, lenp, ppos);
 }
 
 /**
@@ -1552,6 +1564,7 @@ PROC_HANDLER_ENOSYS(proc_douintvec_lockless);
 PROC_HANDLER_ENOSYS(proc_dointvec_minmax_lockless);
 PROC_HANDLER_ENOSYS(proc_douintvec_minmax_lockless);
 PROC_HANDLER_ENOSYS(proc_doulongvec_minmax_lockless);
+PROC_HANDLER_ENOSYS(proc_dointvec_jiffies_lockless);
 
 #endif /* CONFIG_PROC_SYSCTL */
 
@@ -2468,3 +2481,4 @@ EXPORT_SYMBOL(proc_douintvec_lockless);
 EXPORT_SYMBOL(proc_dointvec_minmax_lockless);
 EXPORT_SYMBOL_GPL(proc_douintvec_minmax_lockless);
 EXPORT_SYMBOL(proc_doulongvec_minmax_lockless);
+EXPORT_SYMBOL(proc_dointvec_jiffies_lockless);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 net 09/16] tcp: Fix a data-race around sysctl_tcp_max_orphans.
  2022-07-06  5:21 [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table Kuniyuki Iwashima
                   ` (7 preceding siblings ...)
  2022-07-06  5:21 ` [PATCH v1 net 08/16] sysctl: Add proc_dointvec_jiffies_lockless() Kuniyuki Iwashima
@ 2022-07-06  5:21 ` Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 10/16] inetpeer: Fix data-races around sysctl Kuniyuki Iwashima
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06  5:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel

While reading sysctl_tcp_max_orphans, it can be changed concurrently.  So,
we need to add READ_ONCE().  Then we can set proc_dointvec_lockless() as
the handler to mark it safe.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/sysctl_net_ipv4.c | 2 +-
 net/ipv4/tcp.c             | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index cd448cdd3b38..aa5adf136556 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -476,7 +476,7 @@ static struct ctl_table ipv4_table[] = {
 		.data		= &sysctl_tcp_max_orphans,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec
+		.proc_handler	= proc_dointvec_lockless,
 	},
 	{
 		.procname	= "inet_peer_threshold",
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 028513d3e2a2..2222dfdde316 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2715,7 +2715,8 @@ static void tcp_orphan_update(struct timer_list *unused)
 
 static bool tcp_too_many_orphans(int shift)
 {
-	return READ_ONCE(tcp_orphan_cache) << shift > sysctl_tcp_max_orphans;
+	return READ_ONCE(tcp_orphan_cache) << shift >
+		READ_ONCE(sysctl_tcp_max_orphans);
 }
 
 bool tcp_check_oom(struct sock *sk, int shift)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 net 10/16] inetpeer: Fix data-races around sysctl.
  2022-07-06  5:21 [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table Kuniyuki Iwashima
                   ` (8 preceding siblings ...)
  2022-07-06  5:21 ` [PATCH v1 net 09/16] tcp: Fix a data-race around sysctl_tcp_max_orphans Kuniyuki Iwashima
@ 2022-07-06  5:21 ` Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 11/16] net: Fix a data-race around sysctl_mem Kuniyuki Iwashima
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06  5:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel

While reading sysctl variables, they can be changed concurrently.  So, we
need to add READ_ONCE().  Then we can set lockless variants as the handler
to mark them safe.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/inetpeer.c        | 13 ++++++++-----
 net/ipv4/sysctl_net_ipv4.c |  6 +++---
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index da21dfce24d7..69e1342a9068 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -141,16 +141,19 @@ static void inet_peer_gc(struct inet_peer_base *base,
 			 struct inet_peer *gc_stack[],
 			 unsigned int gc_cnt)
 {
+	int i, peer_threshold, peer_maxttl, peer_minttl;
 	struct inet_peer *p;
 	__u32 delta, ttl;
-	int i;
 
-	if (base->total >= inet_peer_threshold)
+	peer_threshold = READ_ONCE(inet_peer_threshold);
+	peer_maxttl = READ_ONCE(inet_peer_maxttl);
+	peer_minttl = READ_ONCE(inet_peer_minttl);
+
+	if (base->total >= peer_threshold)
 		ttl = 0; /* be aggressive */
 	else
-		ttl = inet_peer_maxttl
-				- (inet_peer_maxttl - inet_peer_minttl) / HZ *
-					base->total / inet_peer_threshold * HZ;
+		ttl = peer_maxttl - (peer_maxttl - peer_minttl) / HZ *
+			base->total / peer_threshold * HZ;
 	for (i = 0; i < gc_cnt; i++) {
 		p = gc_stack[i];
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index aa5adf136556..eea11218a663 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -483,21 +483,21 @@ static struct ctl_table ipv4_table[] = {
 		.data		= &inet_peer_threshold,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec
+		.proc_handler	= proc_dointvec_lockless,
 	},
 	{
 		.procname	= "inet_peer_minttl",
 		.data		= &inet_peer_minttl,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_jiffies,
+		.proc_handler	= proc_dointvec_jiffies_lockless,
 	},
 	{
 		.procname	= "inet_peer_maxttl",
 		.data		= &inet_peer_maxttl,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_jiffies,
+		.proc_handler	= proc_dointvec_jiffies_lockless,
 	},
 	{
 		.procname	= "tcp_mem",
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 net 11/16] net: Fix a data-race around sysctl_mem.
  2022-07-06  5:21 [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table Kuniyuki Iwashima
                   ` (9 preceding siblings ...)
  2022-07-06  5:21 ` [PATCH v1 net 10/16] inetpeer: Fix data-races around sysctl Kuniyuki Iwashima
@ 2022-07-06  5:21 ` Kuniyuki Iwashima
  2022-07-06 13:17   ` Steven Rostedt
  2022-07-06  5:21 ` [PATCH v1 net 12/16] tcp: Mark sysctl_tcp_low_latency obsolete Kuniyuki Iwashima
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06  5:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel,
	Satoru Moriya, Steven Rostedt

While reading .sysctl_mem, it can be changed concurrently.  So, we need to
add READ_ONCE().  Then we can set proc_doulongvec_minmax_lockless() as the
handler to mark it safe.

Fixes: 3847ce32aea9 ("core: add tracepoints for queueing skb to rcvbuf")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
CC: Satoru Moriya <satoru.moriya@hds.com>
CC: Steven Rostedt <rostedt@goodmis.org>
---
 include/net/sock.h             | 2 +-
 include/trace/events/sock.h    | 6 +++---
 net/decnet/sysctl_net_decnet.c | 2 +-
 net/ipv4/sysctl_net_ipv4.c     | 4 ++--
 net/sctp/sysctl.c              | 2 +-
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 72ca97ccb460..9fa54762e077 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1529,7 +1529,7 @@ void __sk_mem_reclaim(struct sock *sk, int amount);
 /* sysctl_mem values are in pages, we convert them in SK_MEM_QUANTUM units */
 static inline long sk_prot_mem_limits(const struct sock *sk, int index)
 {
-	long val = sk->sk_prot->sysctl_mem[index];
+	long val = READ_ONCE(sk->sk_prot->sysctl_mem[index]);
 
 #if PAGE_SIZE > SK_MEM_QUANTUM
 	val <<= PAGE_SHIFT - SK_MEM_QUANTUM_SHIFT;
diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index 12c315782766..3c36c2812782 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -122,9 +122,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
 
 	TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld sysctl_rmem=%d rmem_alloc=%d sysctl_wmem=%d wmem_alloc=%d wmem_queued=%d kind=%s",
 		__entry->name,
-		__entry->sysctl_mem[0],
-		__entry->sysctl_mem[1],
-		__entry->sysctl_mem[2],
+		READ_ONCE(__entry->sysctl_mem[0]),
+		READ_ONCE(__entry->sysctl_mem[1]),
+		READ_ONCE(__entry->sysctl_mem[2]),
 		__entry->allocated,
 		__entry->sysctl_rmem,
 		__entry->rmem_alloc,
diff --git a/net/decnet/sysctl_net_decnet.c b/net/decnet/sysctl_net_decnet.c
index 67b5ab2657b7..e7e658f1ba67 100644
--- a/net/decnet/sysctl_net_decnet.c
+++ b/net/decnet/sysctl_net_decnet.c
@@ -315,7 +315,7 @@ static struct ctl_table dn_table[] = {
 		.data = &sysctl_decnet_mem,
 		.maxlen = sizeof(sysctl_decnet_mem),
 		.mode = 0644,
-		.proc_handler = proc_doulongvec_minmax
+		.proc_handler = proc_doulongvec_minmax_lockless,
 	},
 	{
 		.procname = "decnet_rmem",
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index eea11218a663..b14931ca5c85 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -504,7 +504,7 @@ static struct ctl_table ipv4_table[] = {
 		.maxlen		= sizeof(sysctl_tcp_mem),
 		.data		= &sysctl_tcp_mem,
 		.mode		= 0644,
-		.proc_handler	= proc_doulongvec_minmax,
+		.proc_handler	= proc_doulongvec_minmax_lockless,
 	},
 	{
 		.procname	= "tcp_low_latency",
@@ -570,7 +570,7 @@ static struct ctl_table ipv4_table[] = {
 		.data		= &sysctl_udp_mem,
 		.maxlen		= sizeof(sysctl_udp_mem),
 		.mode		= 0644,
-		.proc_handler	= proc_doulongvec_minmax,
+		.proc_handler	= proc_doulongvec_minmax_lockless,
 	},
 	{
 		.procname	= "fib_sync_mem",
diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c
index b46a416787ec..fa79bf4059d1 100644
--- a/net/sctp/sysctl.c
+++ b/net/sctp/sysctl.c
@@ -64,7 +64,7 @@ static struct ctl_table sctp_table[] = {
 		.data		= &sysctl_sctp_mem,
 		.maxlen		= sizeof(sysctl_sctp_mem),
 		.mode		= 0644,
-		.proc_handler	= proc_doulongvec_minmax
+		.proc_handler	= proc_doulongvec_minmax_lockless,
 	},
 	{
 		.procname	= "sctp_rmem",
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 net 12/16] tcp: Mark sysctl_tcp_low_latency obsolete.
  2022-07-06  5:21 [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table Kuniyuki Iwashima
                   ` (10 preceding siblings ...)
  2022-07-06  5:21 ` [PATCH v1 net 11/16] net: Fix a data-race around sysctl_mem Kuniyuki Iwashima
@ 2022-07-06  5:21 ` Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 13/16] cipso: Fix a data-race around cipso_v4_cache_bucketsize Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06  5:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel

Since commit b2fb4f54ecd4 ("tcp: uninline tcp_prequeue()"),
sysctl_tcp_low_latency is no longer used.  However, to mark
it safe and finally remove proc_dointvec(), this patch changes
handler to a lockless variant.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/sysctl_net_ipv4.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index b14931ca5c85..0287d55f9230 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -507,11 +507,12 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_doulongvec_minmax_lockless,
 	},
 	{
+		/* obsolete */
 		.procname	= "tcp_low_latency",
 		.data		= &sysctl_tcp_low_latency,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec
+		.proc_handler	= proc_dointvec_lockless,
 	},
 #ifdef CONFIG_NETLABEL
 	{
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 net 13/16] cipso: Fix a data-race around cipso_v4_cache_bucketsize.
  2022-07-06  5:21 [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table Kuniyuki Iwashima
                   ` (11 preceding siblings ...)
  2022-07-06  5:21 ` [PATCH v1 net 12/16] tcp: Mark sysctl_tcp_low_latency obsolete Kuniyuki Iwashima
@ 2022-07-06  5:21 ` Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 14/16] cipso: Fix data-races around boolean sysctl Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06  5:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel, Paul Moore

While reading cipso_v4_cache_bucketsize, it can be changed concurrently.
So, we need to add READ_ONCE().  Then we can set proc_dointvec_lockless()
as the handler to mark it safe.

Fixes: 446fda4f2682 ("[NetLabel]: CIPSOv4 engine")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
CC: Paul Moore <paul.moore@hp.com>
---
 Documentation/networking/ip-sysctl.rst |  2 +-
 net/ipv4/cipso_ipv4.c                  | 12 ++++++------
 net/ipv4/sysctl_net_ipv4.c             |  2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 9f41961d11d5..0e58001f8580 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1085,7 +1085,7 @@ cipso_cache_enable - BOOLEAN
 cipso_cache_bucket_size - INTEGER
 	The CIPSO label cache consists of a fixed size hash table with each
 	hash bucket containing a number of cache entries.  This variable limits
-	the number of entries in each hash bucket; the larger the value the
+	the number of entries in each hash bucket; the larger the value is, the
 	more CIPSO label mappings that can be cached.  When the number of
 	entries in a given hash bucket reaches this limit adding new entries
 	causes the oldest entry in the bucket to be removed to make room.
diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index 62d5f99760aa..0600e9b06e1a 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -296,13 +296,13 @@ static int cipso_v4_cache_check(const unsigned char *key,
 int cipso_v4_cache_add(const unsigned char *cipso_ptr,
 		       const struct netlbl_lsm_secattr *secattr)
 {
-	int ret_val = -EPERM;
-	u32 bkt;
-	struct cipso_v4_map_cache_entry *entry = NULL;
+	int bkt_size = READ_ONCE(cipso_v4_cache_bucketsize);
 	struct cipso_v4_map_cache_entry *old_entry = NULL;
-	u32 cipso_ptr_len;
+	struct cipso_v4_map_cache_entry *entry = NULL;
+	u32 bkt, cipso_ptr_len;
+	int ret_val = -EPERM;
 
-	if (!cipso_v4_cache_enabled || cipso_v4_cache_bucketsize <= 0)
+	if (!cipso_v4_cache_enabled || bkt_size <= 0)
 		return 0;
 
 	cipso_ptr_len = cipso_ptr[1];
@@ -322,7 +322,7 @@ int cipso_v4_cache_add(const unsigned char *cipso_ptr,
 
 	bkt = entry->hash & (CIPSO_V4_CACHE_BUCKETS - 1);
 	spin_lock_bh(&cipso_v4_cache[bkt].lock);
-	if (cipso_v4_cache[bkt].size < cipso_v4_cache_bucketsize) {
+	if (cipso_v4_cache[bkt].size < bkt_size) {
 		list_add(&entry->list, &cipso_v4_cache[bkt].list);
 		cipso_v4_cache[bkt].size += 1;
 	} else {
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 0287d55f9230..5cd32b33bbac 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -527,7 +527,7 @@ static struct ctl_table ipv4_table[] = {
 		.data		= &cipso_v4_cache_bucketsize,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= proc_dointvec_lockless,
 	},
 	{
 		.procname	= "cipso_rbm_optfmt",
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 net 14/16] cipso: Fix data-races around boolean sysctl.
  2022-07-06  5:21 [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table Kuniyuki Iwashima
                   ` (12 preceding siblings ...)
  2022-07-06  5:21 ` [PATCH v1 net 13/16] cipso: Fix a data-race around cipso_v4_cache_bucketsize Kuniyuki Iwashima
@ 2022-07-06  5:21 ` Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 15/16] icmp: Fix data-races around sysctl Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 16/16] ipv4: Fix a data-race around sysctl_fib_sync_mem Kuniyuki Iwashima
  15 siblings, 0 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06  5:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel, Paul Moore

While reading sysctl variables, they can be changed concurrently.  So,
we need to add READ_ONCE().  Then we can set lockless variants as the
handler to mark it safe.

Also, these sysctl knob are boolean, so this patch changes their int
handler to boolean one.

  - cipso_v4_cache_enabled
  - cipso_v4_rbm_optfmt
  - cipso_v4_rbm_strictvalid

Fixes: 446fda4f2682 ("[NetLabel]: CIPSOv4 engine")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
CC: Paul Moore <paul.moore@hp.com>
---
 net/ipv4/cipso_ipv4.c      | 9 +++++----
 net/ipv4/sysctl_net_ipv4.c | 6 +++---
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index 0600e9b06e1a..2756170e470f 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -239,7 +239,7 @@ static int cipso_v4_cache_check(const unsigned char *key,
 	struct cipso_v4_map_cache_entry *prev_entry = NULL;
 	u32 hash;
 
-	if (!cipso_v4_cache_enabled)
+	if (!READ_ONCE(cipso_v4_cache_enabled))
 		return -ENOENT;
 
 	hash = cipso_v4_map_cache_hash(key, key_len);
@@ -302,7 +302,7 @@ int cipso_v4_cache_add(const unsigned char *cipso_ptr,
 	u32 bkt, cipso_ptr_len;
 	int ret_val = -EPERM;
 
-	if (!cipso_v4_cache_enabled || bkt_size <= 0)
+	if (!READ_ONCE(cipso_v4_cache_enabled) || bkt_size <= 0)
 		return 0;
 
 	cipso_ptr_len = cipso_ptr[1];
@@ -1199,7 +1199,8 @@ static int cipso_v4_gentag_rbm(const struct cipso_v4_doi *doi_def,
 		/* This will send packets using the "optimized" format when
 		 * possible as specified in  section 3.4.2.6 of the
 		 * CIPSO draft. */
-		if (cipso_v4_rbm_optfmt && ret_val > 0 && ret_val <= 10)
+		if (READ_ONCE(cipso_v4_rbm_optfmt) && ret_val > 0 &&
+		    ret_val <= 10)
 			tag_len = 14;
 		else
 			tag_len = 4 + ret_val;
@@ -1603,7 +1604,7 @@ int cipso_v4_validate(const struct sk_buff *skb, unsigned char **option)
 			 * all the CIPSO validations here but it doesn't
 			 * really specify _exactly_ what we need to validate
 			 * ... so, just make it a sysctl tunable. */
-			if (cipso_v4_rbm_strictvalid) {
+			if (READ_ONCE(cipso_v4_rbm_strictvalid)) {
 				if (cipso_v4_map_lvl_valid(doi_def,
 							   tag[3]) < 0) {
 					err_offset = opt_iter + 3;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 5cd32b33bbac..706795a3b369 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -520,7 +520,7 @@ static struct ctl_table ipv4_table[] = {
 		.data		= &cipso_v4_cache_enabled,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= proc_dobool_lockless,
 	},
 	{
 		.procname	= "cipso_cache_bucket_size",
@@ -534,14 +534,14 @@ static struct ctl_table ipv4_table[] = {
 		.data		= &cipso_v4_rbm_optfmt,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= proc_dobool_lockless,
 	},
 	{
 		.procname	= "cipso_rbm_strictvalid",
 		.data		= &cipso_v4_rbm_strictvalid,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
+		.proc_handler	= proc_dobool_lockless,
 	},
 #endif /* CONFIG_NETLABEL */
 	{
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 net 15/16] icmp: Fix data-races around sysctl.
  2022-07-06  5:21 [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table Kuniyuki Iwashima
                   ` (13 preceding siblings ...)
  2022-07-06  5:21 ` [PATCH v1 net 14/16] cipso: Fix data-races around boolean sysctl Kuniyuki Iwashima
@ 2022-07-06  5:21 ` Kuniyuki Iwashima
  2022-07-06  5:21 ` [PATCH v1 net 16/16] ipv4: Fix a data-race around sysctl_fib_sync_mem Kuniyuki Iwashima
  15 siblings, 0 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06  5:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel

While reading sysctl variables, it can be changed concurrently.  So, we
need to add READ_ONCE().  Then we can set proc_dointvec_minmax_lockless()
as the handler to mark it safe.

Fixes: 4cdf507d5452 ("icmp: add a global rate limitation")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/icmp.c            | 5 +++--
 net/ipv4/sysctl_net_ipv4.c | 4 ++--
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index efea0e796f06..0f9e61d29f73 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -253,11 +253,12 @@ bool icmp_global_allow(void)
 	spin_lock(&icmp_global.lock);
 	delta = min_t(u32, now - icmp_global.stamp, HZ);
 	if (delta >= HZ / 50) {
-		incr = sysctl_icmp_msgs_per_sec * delta / HZ ;
+		incr = READ_ONCE(sysctl_icmp_msgs_per_sec) * delta / HZ;
 		if (incr)
 			WRITE_ONCE(icmp_global.stamp, now);
 	}
-	credit = min_t(u32, icmp_global.credit + incr, sysctl_icmp_msgs_burst);
+	credit = min_t(u32, icmp_global.credit + incr,
+		       READ_ONCE(sysctl_icmp_msgs_burst));
 	if (credit) {
 		/* We want to use a credit of one in average, but need to randomize
 		 * it for security reasons.
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 706795a3b369..3b1d18be0857 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -555,7 +555,7 @@ static struct ctl_table ipv4_table[] = {
 		.data		= &sysctl_icmp_msgs_per_sec,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= proc_dointvec_minmax_lockless,
 		.extra1		= SYSCTL_ZERO,
 	},
 	{
@@ -563,7 +563,7 @@ static struct ctl_table ipv4_table[] = {
 		.data		= &sysctl_icmp_msgs_burst,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= proc_dointvec_minmax_lockless,
 		.extra1		= SYSCTL_ZERO,
 	},
 	{
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 net 16/16] ipv4: Fix a data-race around sysctl_fib_sync_mem.
  2022-07-06  5:21 [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table Kuniyuki Iwashima
                   ` (14 preceding siblings ...)
  2022-07-06  5:21 ` [PATCH v1 net 15/16] icmp: Fix data-races around sysctl Kuniyuki Iwashima
@ 2022-07-06  5:21 ` Kuniyuki Iwashima
  15 siblings, 0 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06  5:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, linux-kernel, David Ahern

While reading sysctl_fib_sync_mem, it can be changed concurrently.  So, we
need to add READ_ONCE().  Then we can set proc_douintvec_minmax_lockless()
as the handler to mark it safe.

Fixes: 9ab948a91b2c ("ipv4: Allow amount of dirty memory from fib resizing to be controllable")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
CC: David Ahern <dsahern@gmail.com>
---
 net/ipv4/fib_trie.c        | 2 +-
 net/ipv4/sysctl_net_ipv4.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 2734c3af7e24..46e8a5125853 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -498,7 +498,7 @@ static void tnode_free(struct key_vector *tn)
 		tn = container_of(head, struct tnode, rcu)->kv;
 	}
 
-	if (tnode_free_size >= sysctl_fib_sync_mem) {
+	if (tnode_free_size >= READ_ONCE(sysctl_fib_sync_mem)) {
 		tnode_free_size = 0;
 		synchronize_rcu();
 	}
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 3b1d18be0857..7ea681df37c4 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -578,7 +578,7 @@ static struct ctl_table ipv4_table[] = {
 		.data		= &sysctl_fib_sync_mem,
 		.maxlen		= sizeof(sysctl_fib_sync_mem),
 		.mode		= 0644,
-		.proc_handler	= proc_douintvec_minmax,
+		.proc_handler	= proc_douintvec_minmax_lockless,
 		.extra1		= &sysctl_fib_sync_mem_min,
 		.extra2		= &sysctl_fib_sync_mem_max,
 	},
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 net 03/16] sysctl: Add proc_dointvec_lockless().
  2022-07-06  5:21 ` [PATCH v1 net 03/16] sysctl: Add proc_dointvec_lockless() Kuniyuki Iwashima
@ 2022-07-06  7:00   ` Eric Dumazet
  2022-07-06 16:15     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2022-07-06  7:00 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Kuniyuki Iwashima, netdev, LKML

On Wed, Jul 6, 2022 at 7:22 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> A sysctl variable is accessed concurrently, and there is always a chance of
> data-race.  So, all readers and writers need some basic protection to avoid
> load/store-tearing.
>
> This patch changes proc_dointvec() to use READ_ONCE()/WRITE_ONCE()
> internally to fix a data-race on the sysctl side.  For now, proc_dointvec()
> itself is tolerant to a data-race, but we still need to add annotations on
> the other subsystem's side.
>
> In case we miss such fixes, this patch converts proc_dointvec() to a
> wrapper of proc_dointvec_lockless().  When we fix a data-race in the other
> subsystem, we can explicitly set it as a handler.
>
> Also, this patch removes proc_dointvec()'s document and adds
> proc_dointvec_lockless()'s one so that no one will use proc_dointvec()
> anymore.
>
> While we are on it, we remove some trailing spaces.


I do not see why you add more functions.

Really all sysctls can change locklessly by nature, as I pointed out.

So I would simply add WRITE_ONCE() whenever they are written, and
READ_ONCE() when they are read.

If stable teams care enough, they will have to backport these changes,
so I would rather not have to change
proc_dointvec() to proc_dointvec_lockless() in many files, with many
conflicts, that ultimately will either
add bugs, or ask extra work for maintainers.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 net 11/16] net: Fix a data-race around sysctl_mem.
  2022-07-06  5:21 ` [PATCH v1 net 11/16] net: Fix a data-race around sysctl_mem Kuniyuki Iwashima
@ 2022-07-06 13:17   ` Steven Rostedt
  2022-07-06 13:27     ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2022-07-06 13:17 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Kuniyuki Iwashima,
	netdev, linux-kernel, Satoru Moriya

On Tue, 5 Jul 2022 22:21:25 -0700
Kuniyuki Iwashima <kuniyu@amazon.com> wrote:

> --- a/include/trace/events/sock.h
> +++ b/include/trace/events/sock.h
> @@ -122,9 +122,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
>  
>  	TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld sysctl_rmem=%d rmem_alloc=%d sysctl_wmem=%d wmem_alloc=%d wmem_queued=%d kind=%s",
>  		__entry->name,
> -		__entry->sysctl_mem[0],
> -		__entry->sysctl_mem[1],
> -		__entry->sysctl_mem[2],
> +		READ_ONCE(__entry->sysctl_mem[0]),
> +		READ_ONCE(__entry->sysctl_mem[1]),
> +		READ_ONCE(__entry->sysctl_mem[2]),

This is not reading anything to do with sysctl. It's reading the content of
what was recorded in the ring buffer.

That is, the READ_ONCE() here is not necessary, and if anything will break
user space parsing, as this is exported to user space to tell it how to
read the binary format in the ring buffer.

-- Steve


>  		__entry->allocated,
>  		__entry->sysctl_rmem,
>  		__entry->rmem_alloc,

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 net 11/16] net: Fix a data-race around sysctl_mem.
  2022-07-06 13:17   ` Steven Rostedt
@ 2022-07-06 13:27     ` Steven Rostedt
  2022-07-06 16:27       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2022-07-06 13:27 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Kuniyuki Iwashima,
	netdev, linux-kernel, Satoru Moriya

On Wed, 6 Jul 2022 09:17:07 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 5 Jul 2022 22:21:25 -0700
> Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> 
> > --- a/include/trace/events/sock.h
> > +++ b/include/trace/events/sock.h
> > @@ -122,9 +122,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
> >  
> >  	TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld sysctl_rmem=%d rmem_alloc=%d sysctl_wmem=%d wmem_alloc=%d wmem_queued=%d kind=%s",
> >  		__entry->name,
> > -		__entry->sysctl_mem[0],
> > -		__entry->sysctl_mem[1],
> > -		__entry->sysctl_mem[2],
> > +		READ_ONCE(__entry->sysctl_mem[0]),
> > +		READ_ONCE(__entry->sysctl_mem[1]),
> > +		READ_ONCE(__entry->sysctl_mem[2]),  
> 
> This is not reading anything to do with sysctl. It's reading the content of
> what was recorded in the ring buffer.
> 
> That is, the READ_ONCE() here is not necessary, and if anything will break
> user space parsing, as this is exported to user space to tell it how to
> read the binary format in the ring buffer.

I take that back. Looking at the actual trace event, it is pointing to
sysctl memory, which is a major bug.

TRACE_EVENT(sock_exceed_buf_limit,

        TP_PROTO(struct sock *sk, struct proto *prot, long allocated, int kind),

        TP_ARGS(sk, prot, allocated, kind),

        TP_STRUCT__entry(
                __array(char, name, 32)
                __field(long *, sysctl_mem)

sysctl_mem is a pointer.

                __field(long, allocated)
                __field(int, sysctl_rmem)
                __field(int, rmem_alloc)
                __field(int, sysctl_wmem)
                __field(int, wmem_alloc)
                __field(int, wmem_queued)
                __field(int, kind)
        ),

        TP_fast_assign(
                strncpy(__entry->name, prot->name, 32);

                __entry->sysctl_mem = prot->sysctl_mem;


They save the pointer **IN THE RING BUFFER**!!!

                __entry->allocated = allocated;
                __entry->sysctl_rmem = sk_get_rmem0(sk, prot);
                __entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
                __entry->sysctl_wmem = sk_get_wmem0(sk, prot);
                __entry->wmem_alloc = refcount_read(&sk->sk_wmem_alloc);
                __entry->wmem_queued = READ_ONCE(sk->sk_wmem_queued);
                __entry->kind = kind;
        ),

        TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld sysctl_rmem=%d rmem_alloc=%d sysctl_wmem=%d wmem_alloc=%d wmem_queued=%d kind=%s",
                __entry->name,
                __entry->sysctl_mem[0],
                __entry->sysctl_mem[1],
                __entry->sysctl_mem[2],

They are now reading a stale pointer, which can be read at any time. That
is, you get the information of what is in sysctl_mem at the time the ring
buffer is read (which is useless from user space), and not at the time of
the event.

Thanks for pointing this out. This needs to be fixed.

-- Steve


                __entry->allocated,
                __entry->sysctl_rmem,
                __entry->rmem_alloc,
                __entry->sysctl_wmem,
                __entry->wmem_alloc,
                __entry->wmem_queued,
                show_skmem_kind_names(__entry->kind)
        )

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 net 03/16] sysctl: Add proc_dointvec_lockless().
  2022-07-06  7:00   ` Eric Dumazet
@ 2022-07-06 16:15     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06 16:15 UTC (permalink / raw)
  To: edumazet
  Cc: davem, keescook, kuba, kuni1840, kuniyu, linux-kernel, mcgrof,
	netdev, pabeni, yzaikin

From:   Eric Dumazet <edumazet@google.com>
Date:   Wed, 6 Jul 2022 09:00:11 +0200
> On Wed, Jul 6, 2022 at 7:22 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > A sysctl variable is accessed concurrently, and there is always a chance of
> > data-race.  So, all readers and writers need some basic protection to avoid
> > load/store-tearing.
> >
> > This patch changes proc_dointvec() to use READ_ONCE()/WRITE_ONCE()
> > internally to fix a data-race on the sysctl side.  For now, proc_dointvec()
> > itself is tolerant to a data-race, but we still need to add annotations on
> > the other subsystem's side.
> >
> > In case we miss such fixes, this patch converts proc_dointvec() to a
> > wrapper of proc_dointvec_lockless().  When we fix a data-race in the other
> > subsystem, we can explicitly set it as a handler.
> >
> > Also, this patch removes proc_dointvec()'s document and adds
> > proc_dointvec_lockless()'s one so that no one will use proc_dointvec()
> > anymore.
> >
> > While we are on it, we remove some trailing spaces.
> 
> 
> I do not see why you add more functions.

It was not to miss where we still need fixes and to be taken care of
by newly added sysctl knob.


> Really all sysctls can change locklessly by nature, as I pointed out.
> 
> So I would simply add WRITE_ONCE() whenever they are written, and
> READ_ONCE() when they are read.
> 
> If stable teams care enough, they will have to backport these changes,
> so I would rather not have to change
> proc_dointvec() to proc_dointvec_lockless() in many files, with many
> conflicts, that ultimately will either
> add bugs, or ask extra work for maintainers.

Indeed, I will drop such changes and just add annotations in *_conv().
Thank you!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 net 11/16] net: Fix a data-race around sysctl_mem.
  2022-07-06 13:27     ` Steven Rostedt
@ 2022-07-06 16:27       ` Kuniyuki Iwashima
  0 siblings, 0 replies; 22+ messages in thread
From: Kuniyuki Iwashima @ 2022-07-06 16:27 UTC (permalink / raw)
  To: rostedt
  Cc: davem, edumazet, keescook, kuba, kuni1840, kuniyu, linux-kernel,
	mcgrof, netdev, pabeni, satoru.moriya, yzaikin

From:   Steven Rostedt <rostedt@goodmis.org>
Date:   Wed, 6 Jul 2022 09:27:11 -0400
> On Wed, 6 Jul 2022 09:17:07 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Tue, 5 Jul 2022 22:21:25 -0700
> > Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > 
> > > --- a/include/trace/events/sock.h
> > > +++ b/include/trace/events/sock.h
> > > @@ -122,9 +122,9 @@ TRACE_EVENT(sock_exceed_buf_limit,
> > >  
> > >  	TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld sysctl_rmem=%d rmem_alloc=%d sysctl_wmem=%d wmem_alloc=%d wmem_queued=%d kind=%s",
> > >  		__entry->name,
> > > -		__entry->sysctl_mem[0],
> > > -		__entry->sysctl_mem[1],
> > > -		__entry->sysctl_mem[2],
> > > +		READ_ONCE(__entry->sysctl_mem[0]),
> > > +		READ_ONCE(__entry->sysctl_mem[1]),
> > > +		READ_ONCE(__entry->sysctl_mem[2]),  
> > 
> > This is not reading anything to do with sysctl. It's reading the content of
> > what was recorded in the ring buffer.
> > 
> > That is, the READ_ONCE() here is not necessary, and if anything will break
> > user space parsing, as this is exported to user space to tell it how to
> > read the binary format in the ring buffer.
> 
> I take that back. Looking at the actual trace event, it is pointing to
> sysctl memory, which is a major bug.
> 
> TRACE_EVENT(sock_exceed_buf_limit,
> 
>         TP_PROTO(struct sock *sk, struct proto *prot, long allocated, int kind),
> 
>         TP_ARGS(sk, prot, allocated, kind),
> 
>         TP_STRUCT__entry(
>                 __array(char, name, 32)
>                 __field(long *, sysctl_mem)
> 
> sysctl_mem is a pointer.
> 
>                 __field(long, allocated)
>                 __field(int, sysctl_rmem)
>                 __field(int, rmem_alloc)
>                 __field(int, sysctl_wmem)
>                 __field(int, wmem_alloc)
>                 __field(int, wmem_queued)
>                 __field(int, kind)
>         ),
> 
>         TP_fast_assign(
>                 strncpy(__entry->name, prot->name, 32);
> 
>                 __entry->sysctl_mem = prot->sysctl_mem;
> 
> 
> They save the pointer **IN THE RING BUFFER**!!!
> 
>                 __entry->allocated = allocated;
>                 __entry->sysctl_rmem = sk_get_rmem0(sk, prot);
>                 __entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
>                 __entry->sysctl_wmem = sk_get_wmem0(sk, prot);
>                 __entry->wmem_alloc = refcount_read(&sk->sk_wmem_alloc);
>                 __entry->wmem_queued = READ_ONCE(sk->sk_wmem_queued);
>                 __entry->kind = kind;
>         ),
> 
>         TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld sysctl_rmem=%d rmem_alloc=%d sysctl_wmem=%d wmem_alloc=%d wmem_queued=%d kind=%s",
>                 __entry->name,
>                 __entry->sysctl_mem[0],
>                 __entry->sysctl_mem[1],
>                 __entry->sysctl_mem[2],
> 
> They are now reading a stale pointer, which can be read at any time. That
> is, you get the information of what is in sysctl_mem at the time the ring
> buffer is read (which is useless from user space), and not at the time of
> the event.
> 
> Thanks for pointing this out. This needs to be fixed.

For the record, Steve fixed this properly here, so I'll drop the tracing
part in v2.
https://lore.kernel.org/netdev/20220706105040.54fc03b0@gandalf.local.home/

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-07-06 16:28 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-06  5:21 [PATCH v1 net 00/16] sysctl: Fix data-races around ipv4_table Kuniyuki Iwashima
2022-07-06  5:21 ` [PATCH v1 net 01/16] sysctl: Clean up proc_handler definitions Kuniyuki Iwashima
2022-07-06  5:21 ` [PATCH v1 net 02/16] sysctl: Add proc_dobool_lockless() Kuniyuki Iwashima
2022-07-06  5:21 ` [PATCH v1 net 03/16] sysctl: Add proc_dointvec_lockless() Kuniyuki Iwashima
2022-07-06  7:00   ` Eric Dumazet
2022-07-06 16:15     ` Kuniyuki Iwashima
2022-07-06  5:21 ` [PATCH v1 net 04/16] sysctl: Add proc_douintvec_lockless() Kuniyuki Iwashima
2022-07-06  5:21 ` [PATCH v1 net 05/16] sysctl: Add proc_dointvec_minmax_lockless() Kuniyuki Iwashima
2022-07-06  5:21 ` [PATCH v1 net 06/16] sysctl: Add proc_douintvec_minmax_lockless() Kuniyuki Iwashima
2022-07-06  5:21 ` [PATCH v1 net 07/16] sysctl: Add proc_doulongvec_minmax_lockless() Kuniyuki Iwashima
2022-07-06  5:21 ` [PATCH v1 net 08/16] sysctl: Add proc_dointvec_jiffies_lockless() Kuniyuki Iwashima
2022-07-06  5:21 ` [PATCH v1 net 09/16] tcp: Fix a data-race around sysctl_tcp_max_orphans Kuniyuki Iwashima
2022-07-06  5:21 ` [PATCH v1 net 10/16] inetpeer: Fix data-races around sysctl Kuniyuki Iwashima
2022-07-06  5:21 ` [PATCH v1 net 11/16] net: Fix a data-race around sysctl_mem Kuniyuki Iwashima
2022-07-06 13:17   ` Steven Rostedt
2022-07-06 13:27     ` Steven Rostedt
2022-07-06 16:27       ` Kuniyuki Iwashima
2022-07-06  5:21 ` [PATCH v1 net 12/16] tcp: Mark sysctl_tcp_low_latency obsolete Kuniyuki Iwashima
2022-07-06  5:21 ` [PATCH v1 net 13/16] cipso: Fix a data-race around cipso_v4_cache_bucketsize Kuniyuki Iwashima
2022-07-06  5:21 ` [PATCH v1 net 14/16] cipso: Fix data-races around boolean sysctl Kuniyuki Iwashima
2022-07-06  5:21 ` [PATCH v1 net 15/16] icmp: Fix data-races around sysctl Kuniyuki Iwashima
2022-07-06  5:21 ` [PATCH v1 net 16/16] ipv4: Fix a data-race around sysctl_fib_sync_mem Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).