All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 net 0/2] raw/ping: Fix locking in /proc/net/{raw,icmp}.
@ 2023-04-03 19:49 Kuniyuki Iwashima
  2023-04-03 19:49 ` [PATCH v1 net 1/2] raw: Fix NULL deref in raw_get_next() Kuniyuki Iwashima
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Kuniyuki Iwashima @ 2023-04-03 19:49 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

The first patch fixes a NULL deref for /proc/net/raw and second one fixes
the same issue for ping sockets.

The first patch also converts hlist_nulls to hlist, but this is because
the current code uses sk_nulls_for_each() for lockless readers, instead
of sk_nulls_for_each_rcu() which adds memory barrier, but raw sockets
does not use the nulls marker nor SLAB_TYPESAFE_BY_RCU in the first place.

OTOH, the ping sockets already uses sk_nulls_for_each_rcu(), and such
conversion can be posted later for net-next.


Kuniyuki Iwashima (2):
  raw: Fix NULL deref in raw_get_next().
  ping: Fix potentail NULL deref for /proc/net/icmp.

 include/net/raw.h   |  4 ++--
 net/ipv4/ping.c     |  8 ++++----
 net/ipv4/raw.c      | 36 +++++++++++++++++++-----------------
 net/ipv4/raw_diag.c | 10 ++++------
 net/ipv6/raw.c      | 10 ++++------
 5 files changed, 33 insertions(+), 35 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v1 net 1/2] raw: Fix NULL deref in raw_get_next().
  2023-04-03 19:49 [PATCH v1 net 0/2] raw/ping: Fix locking in /proc/net/{raw,icmp} Kuniyuki Iwashima
@ 2023-04-03 19:49 ` Kuniyuki Iwashima
  2023-04-04  1:46   ` Eric Dumazet
  2023-04-04  2:46   ` Jason Xing
  2023-04-03 19:49 ` [PATCH v1 net 2/2] ping: Fix potentail NULL deref for /proc/net/icmp Kuniyuki Iwashima
  2023-04-05  2:00 ` [PATCH v1 net 0/2] raw/ping: Fix locking in /proc/net/{raw,icmp} patchwork-bot+netdevbpf
  2 siblings, 2 replies; 11+ messages in thread
From: Kuniyuki Iwashima @ 2023-04-03 19:49 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, syzbot, Dae R . Jeong

Dae R. Jeong reported a NULL deref in raw_get_next() [0].

It seems that the repro was running these sequences in parallel so
that one thread was iterating on a socket that was being freed in
another netns.

  unshare(0x40060200)
  r0 = syz_open_procfs(0x0, &(0x7f0000002080)='net/raw\x00')
  socket$inet_icmp_raw(0x2, 0x3, 0x1)
  pread64(r0, &(0x7f0000000000)=""/10, 0xa, 0x10000000007f)

After commit 0daf07e52709 ("raw: convert raw sockets to RCU"), we
use RCU and hlist_nulls_for_each_entry() to iterate over SOCK_RAW
sockets.  However, we should use spinlock for slow paths to avoid
the NULL deref.

Also, SOCK_RAW does not use SLAB_TYPESAFE_BY_RCU, and the slab object
is not reused during iteration in the grace period.  In fact, the
lockless readers do not check the nulls marker with get_nulls_value().
So, SOCK_RAW should use hlist instead of hlist_nulls.

Instead of adding an unnecessary barrier by sk_nulls_for_each_rcu(),
let's convert hlist_nulls to hlist and use sk_for_each_rcu() for
fast paths and sk_for_each() and spinlock for /proc/net/raw.

[0]:
general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
CPU: 2 PID: 20952 Comm: syz-executor.0 Not tainted 6.2.0-g048ec869bafd-dirty #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
RIP: 0010:read_pnet include/net/net_namespace.h:383 [inline]
RIP: 0010:sock_net include/net/sock.h:649 [inline]
RIP: 0010:raw_get_next net/ipv4/raw.c:974 [inline]
RIP: 0010:raw_get_idx net/ipv4/raw.c:986 [inline]
RIP: 0010:raw_seq_start+0x431/0x800 net/ipv4/raw.c:995
Code: ef e8 33 3d 94 f7 49 8b 6d 00 4c 89 ef e8 b7 65 5f f7 49 89 ed 49 83 c5 98 0f 84 9a 00 00 00 48 83 c5 c8 48 89 e8 48 c1 e8 03 <42> 80 3c 30 00 74 08 48 89 ef e8 00 3d 94 f7 4c 8b 7d 00 48 89 ef
RSP: 0018:ffffc9001154f9b0 EFLAGS: 00010206
RAX: 0000000000000005 RBX: 1ffff1100302c8fd RCX: 0000000000000000
RDX: 0000000000000028 RSI: ffffc9001154f988 RDI: ffffc9000f77a338
RBP: 0000000000000029 R08: ffffffff8a50ffb4 R09: fffffbfff24b6bd9
R10: fffffbfff24b6bd9 R11: 0000000000000000 R12: ffff88801db73b78
R13: fffffffffffffff9 R14: dffffc0000000000 R15: 0000000000000030
FS:  00007f843ae8e700(0000) GS:ffff888063700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055bb9614b35f CR3: 000000003c672000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 seq_read_iter+0x4c6/0x10f0 fs/seq_file.c:225
 seq_read+0x224/0x320 fs/seq_file.c:162
 pde_read fs/proc/inode.c:316 [inline]
 proc_reg_read+0x23f/0x330 fs/proc/inode.c:328
 vfs_read+0x31e/0xd30 fs/read_write.c:468
 ksys_pread64 fs/read_write.c:665 [inline]
 __do_sys_pread64 fs/read_write.c:675 [inline]
 __se_sys_pread64 fs/read_write.c:672 [inline]
 __x64_sys_pread64+0x1e9/0x280 fs/read_write.c:672
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x4e/0xa0 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x478d29
Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f843ae8dbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
RAX: ffffffffffffffda RBX: 0000000000791408 RCX: 0000000000478d29
RDX: 000000000000000a RSI: 0000000020000000 RDI: 0000000000000003
RBP: 00000000f477909a R08: 0000000000000000 R09: 0000000000000000
R10: 000010000000007f R11: 0000000000000246 R12: 0000000000791740
R13: 0000000000791414 R14: 0000000000791408 R15: 00007ffc2eb48a50
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:read_pnet include/net/net_namespace.h:383 [inline]
RIP: 0010:sock_net include/net/sock.h:649 [inline]
RIP: 0010:raw_get_next net/ipv4/raw.c:974 [inline]
RIP: 0010:raw_get_idx net/ipv4/raw.c:986 [inline]
RIP: 0010:raw_seq_start+0x431/0x800 net/ipv4/raw.c:995
Code: ef e8 33 3d 94 f7 49 8b 6d 00 4c 89 ef e8 b7 65 5f f7 49 89 ed 49 83 c5 98 0f 84 9a 00 00 00 48 83 c5 c8 48 89 e8 48 c1 e8 03 <42> 80 3c 30 00 74 08 48 89 ef e8 00 3d 94 f7 4c 8b 7d 00 48 89 ef
RSP: 0018:ffffc9001154f9b0 EFLAGS: 00010206
RAX: 0000000000000005 RBX: 1ffff1100302c8fd RCX: 0000000000000000
RDX: 0000000000000028 RSI: ffffc9001154f988 RDI: ffffc9000f77a338
RBP: 0000000000000029 R08: ffffffff8a50ffb4 R09: fffffbfff24b6bd9
R10: fffffbfff24b6bd9 R11: 0000000000000000 R12: ffff88801db73b78
R13: fffffffffffffff9 R14: dffffc0000000000 R15: 0000000000000030
FS:  00007f843ae8e700(0000) GS:ffff888063700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f92ff166000 CR3: 000000003c672000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Fixes: 0daf07e52709 ("raw: convert raw sockets to RCU")
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Dae R. Jeong <threeearcat@gmail.com>
Link: https://lore.kernel.org/netdev/ZCA2mGV_cmq7lIfV@dragonet/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 include/net/raw.h   |  4 ++--
 net/ipv4/raw.c      | 36 +++++++++++++++++++-----------------
 net/ipv4/raw_diag.c | 10 ++++------
 net/ipv6/raw.c      | 10 ++++------
 4 files changed, 29 insertions(+), 31 deletions(-)

diff --git a/include/net/raw.h b/include/net/raw.h
index 2c004c20ed99..3af5289fdead 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -37,7 +37,7 @@ int raw_rcv(struct sock *, struct sk_buff *);
 struct raw_hashinfo {
 	spinlock_t lock;
 
-	struct hlist_nulls_head ht[RAW_HTABLE_SIZE] ____cacheline_aligned;
+	struct hlist_head ht[RAW_HTABLE_SIZE] ____cacheline_aligned;
 };
 
 static inline u32 raw_hashfunc(const struct net *net, u32 proto)
@@ -51,7 +51,7 @@ static inline void raw_hashinfo_init(struct raw_hashinfo *hashinfo)
 
 	spin_lock_init(&hashinfo->lock);
 	for (i = 0; i < RAW_HTABLE_SIZE; i++)
-		INIT_HLIST_NULLS_HEAD(&hashinfo->ht[i], i);
+		INIT_HLIST_HEAD(&hashinfo->ht[i]);
 }
 
 #ifdef CONFIG_PROC_FS
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 94df935ee0c5..8088a5011e7d 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -91,12 +91,12 @@ EXPORT_SYMBOL_GPL(raw_v4_hashinfo);
 int raw_hash_sk(struct sock *sk)
 {
 	struct raw_hashinfo *h = sk->sk_prot->h.raw_hash;
-	struct hlist_nulls_head *hlist;
+	struct hlist_head *hlist;
 
 	hlist = &h->ht[raw_hashfunc(sock_net(sk), inet_sk(sk)->inet_num)];
 
 	spin_lock(&h->lock);
-	__sk_nulls_add_node_rcu(sk, hlist);
+	sk_add_node_rcu(sk, hlist);
 	sock_set_flag(sk, SOCK_RCU_FREE);
 	spin_unlock(&h->lock);
 	sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
@@ -110,7 +110,7 @@ void raw_unhash_sk(struct sock *sk)
 	struct raw_hashinfo *h = sk->sk_prot->h.raw_hash;
 
 	spin_lock(&h->lock);
-	if (__sk_nulls_del_node_init_rcu(sk))
+	if (sk_del_node_init_rcu(sk))
 		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
 	spin_unlock(&h->lock);
 }
@@ -163,16 +163,15 @@ static int icmp_filter(const struct sock *sk, const struct sk_buff *skb)
 static int raw_v4_input(struct net *net, struct sk_buff *skb,
 			const struct iphdr *iph, int hash)
 {
-	struct hlist_nulls_head *hlist;
-	struct hlist_nulls_node *hnode;
 	int sdif = inet_sdif(skb);
+	struct hlist_head *hlist;
 	int dif = inet_iif(skb);
 	int delivered = 0;
 	struct sock *sk;
 
 	hlist = &raw_v4_hashinfo.ht[hash];
 	rcu_read_lock();
-	sk_nulls_for_each(sk, hnode, hlist) {
+	sk_for_each_rcu(sk, hlist) {
 		if (!raw_v4_match(net, sk, iph->protocol,
 				  iph->saddr, iph->daddr, dif, sdif))
 			continue;
@@ -264,10 +263,9 @@ static void raw_err(struct sock *sk, struct sk_buff *skb, u32 info)
 void raw_icmp_error(struct sk_buff *skb, int protocol, u32 info)
 {
 	struct net *net = dev_net(skb->dev);
-	struct hlist_nulls_head *hlist;
-	struct hlist_nulls_node *hnode;
 	int dif = skb->dev->ifindex;
 	int sdif = inet_sdif(skb);
+	struct hlist_head *hlist;
 	const struct iphdr *iph;
 	struct sock *sk;
 	int hash;
@@ -276,7 +274,7 @@ void raw_icmp_error(struct sk_buff *skb, int protocol, u32 info)
 	hlist = &raw_v4_hashinfo.ht[hash];
 
 	rcu_read_lock();
-	sk_nulls_for_each(sk, hnode, hlist) {
+	sk_for_each_rcu(sk, hlist) {
 		iph = (const struct iphdr *)skb->data;
 		if (!raw_v4_match(net, sk, iph->protocol,
 				  iph->daddr, iph->saddr, dif, sdif))
@@ -950,14 +948,13 @@ static struct sock *raw_get_first(struct seq_file *seq, int bucket)
 {
 	struct raw_hashinfo *h = pde_data(file_inode(seq->file));
 	struct raw_iter_state *state = raw_seq_private(seq);
-	struct hlist_nulls_head *hlist;
-	struct hlist_nulls_node *hnode;
+	struct hlist_head *hlist;
 	struct sock *sk;
 
 	for (state->bucket = bucket; state->bucket < RAW_HTABLE_SIZE;
 			++state->bucket) {
 		hlist = &h->ht[state->bucket];
-		sk_nulls_for_each(sk, hnode, hlist) {
+		sk_for_each(sk, hlist) {
 			if (sock_net(sk) == seq_file_net(seq))
 				return sk;
 		}
@@ -970,7 +967,7 @@ static struct sock *raw_get_next(struct seq_file *seq, struct sock *sk)
 	struct raw_iter_state *state = raw_seq_private(seq);
 
 	do {
-		sk = sk_nulls_next(sk);
+		sk = sk_next(sk);
 	} while (sk && sock_net(sk) != seq_file_net(seq));
 
 	if (!sk)
@@ -989,9 +986,12 @@ static struct sock *raw_get_idx(struct seq_file *seq, loff_t pos)
 }
 
 void *raw_seq_start(struct seq_file *seq, loff_t *pos)
-	__acquires(RCU)
+	__acquires(&h->lock)
 {
-	rcu_read_lock();
+	struct raw_hashinfo *h = pde_data(file_inode(seq->file));
+
+	spin_lock(&h->lock);
+
 	return *pos ? raw_get_idx(seq, *pos - 1) : SEQ_START_TOKEN;
 }
 EXPORT_SYMBOL_GPL(raw_seq_start);
@@ -1010,9 +1010,11 @@ void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 EXPORT_SYMBOL_GPL(raw_seq_next);
 
 void raw_seq_stop(struct seq_file *seq, void *v)
-	__releases(RCU)
+	__releases(&h->lock)
 {
-	rcu_read_unlock();
+	struct raw_hashinfo *h = pde_data(file_inode(seq->file));
+
+	spin_unlock(&h->lock);
 }
 EXPORT_SYMBOL_GPL(raw_seq_stop);
 
diff --git a/net/ipv4/raw_diag.c b/net/ipv4/raw_diag.c
index 999321834b94..da3591a66a16 100644
--- a/net/ipv4/raw_diag.c
+++ b/net/ipv4/raw_diag.c
@@ -57,8 +57,7 @@ static bool raw_lookup(struct net *net, struct sock *sk,
 static struct sock *raw_sock_get(struct net *net, const struct inet_diag_req_v2 *r)
 {
 	struct raw_hashinfo *hashinfo = raw_get_hashinfo(r);
-	struct hlist_nulls_head *hlist;
-	struct hlist_nulls_node *hnode;
+	struct hlist_head *hlist;
 	struct sock *sk;
 	int slot;
 
@@ -68,7 +67,7 @@ static struct sock *raw_sock_get(struct net *net, const struct inet_diag_req_v2
 	rcu_read_lock();
 	for (slot = 0; slot < RAW_HTABLE_SIZE; slot++) {
 		hlist = &hashinfo->ht[slot];
-		sk_nulls_for_each(sk, hnode, hlist) {
+		sk_for_each_rcu(sk, hlist) {
 			if (raw_lookup(net, sk, r)) {
 				/*
 				 * Grab it and keep until we fill
@@ -142,9 +141,8 @@ static void raw_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
 	struct raw_hashinfo *hashinfo = raw_get_hashinfo(r);
 	struct net *net = sock_net(skb->sk);
 	struct inet_diag_dump_data *cb_data;
-	struct hlist_nulls_head *hlist;
-	struct hlist_nulls_node *hnode;
 	int num, s_num, slot, s_slot;
+	struct hlist_head *hlist;
 	struct sock *sk = NULL;
 	struct nlattr *bc;
 
@@ -161,7 +159,7 @@ static void raw_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
 		num = 0;
 
 		hlist = &hashinfo->ht[slot];
-		sk_nulls_for_each(sk, hnode, hlist) {
+		sk_for_each_rcu(sk, hlist) {
 			struct inet_sock *inet = inet_sk(sk);
 
 			if (!net_eq(sock_net(sk), net))
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index bac9ba747bde..a327aa481df4 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -141,10 +141,9 @@ EXPORT_SYMBOL(rawv6_mh_filter_unregister);
 static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr)
 {
 	struct net *net = dev_net(skb->dev);
-	struct hlist_nulls_head *hlist;
-	struct hlist_nulls_node *hnode;
 	const struct in6_addr *saddr;
 	const struct in6_addr *daddr;
+	struct hlist_head *hlist;
 	struct sock *sk;
 	bool delivered = false;
 	__u8 hash;
@@ -155,7 +154,7 @@ static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr)
 	hash = raw_hashfunc(net, nexthdr);
 	hlist = &raw_v6_hashinfo.ht[hash];
 	rcu_read_lock();
-	sk_nulls_for_each(sk, hnode, hlist) {
+	sk_for_each_rcu(sk, hlist) {
 		int filtered;
 
 		if (!raw_v6_match(net, sk, nexthdr, daddr, saddr,
@@ -333,15 +332,14 @@ void raw6_icmp_error(struct sk_buff *skb, int nexthdr,
 		u8 type, u8 code, int inner_offset, __be32 info)
 {
 	struct net *net = dev_net(skb->dev);
-	struct hlist_nulls_head *hlist;
-	struct hlist_nulls_node *hnode;
+	struct hlist_head *hlist;
 	struct sock *sk;
 	int hash;
 
 	hash = raw_hashfunc(net, nexthdr);
 	hlist = &raw_v6_hashinfo.ht[hash];
 	rcu_read_lock();
-	sk_nulls_for_each(sk, hnode, hlist) {
+	sk_for_each_rcu(sk, hlist) {
 		/* Note: ipv6_hdr(skb) != skb->data */
 		const struct ipv6hdr *ip6h = (const struct ipv6hdr *)skb->data;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v1 net 2/2] ping: Fix potentail NULL deref for /proc/net/icmp.
  2023-04-03 19:49 [PATCH v1 net 0/2] raw/ping: Fix locking in /proc/net/{raw,icmp} Kuniyuki Iwashima
  2023-04-03 19:49 ` [PATCH v1 net 1/2] raw: Fix NULL deref in raw_get_next() Kuniyuki Iwashima
@ 2023-04-03 19:49 ` Kuniyuki Iwashima
  2023-04-04  1:47   ` Eric Dumazet
  2023-04-05  2:00 ` [PATCH v1 net 0/2] raw/ping: Fix locking in /proc/net/{raw,icmp} patchwork-bot+netdevbpf
  2 siblings, 1 reply; 11+ messages in thread
From: Kuniyuki Iwashima @ 2023-04-03 19:49 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

After commit dbca1596bbb0 ("ping: convert to RCU lookups, get rid
of rwlock"), we use RCU for ping sockets, but we should use spinlock
for /proc/net/icmp to avoid a potential NULL deref mentioned in
the previous patch.

Let's go back to using spinlock there.

Note we can convert ping sockets to use hlist instead of hlist_nulls
because we do not use SLAB_TYPESAFE_BY_RCU for ping sockets.

Fixes: dbca1596bbb0 ("ping: convert to RCU lookups, get rid of rwlock")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/ping.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 409ec2a1f95b..5178a3f3cb53 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -1089,13 +1089,13 @@ static struct sock *ping_get_idx(struct seq_file *seq, loff_t pos)
 }
 
 void *ping_seq_start(struct seq_file *seq, loff_t *pos, sa_family_t family)
-	__acquires(RCU)
+	__acquires(ping_table.lock)
 {
 	struct ping_iter_state *state = seq->private;
 	state->bucket = 0;
 	state->family = family;
 
-	rcu_read_lock();
+	spin_lock(&ping_table.lock);
 
 	return *pos ? ping_get_idx(seq, *pos-1) : SEQ_START_TOKEN;
 }
@@ -1121,9 +1121,9 @@ void *ping_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 EXPORT_SYMBOL_GPL(ping_seq_next);
 
 void ping_seq_stop(struct seq_file *seq, void *v)
-	__releases(RCU)
+	__releases(ping_table.lock)
 {
-	rcu_read_unlock();
+	spin_unlock(&ping_table.lock);
 }
 EXPORT_SYMBOL_GPL(ping_seq_stop);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 net 1/2] raw: Fix NULL deref in raw_get_next().
  2023-04-03 19:49 ` [PATCH v1 net 1/2] raw: Fix NULL deref in raw_get_next() Kuniyuki Iwashima
@ 2023-04-04  1:46   ` Eric Dumazet
  2023-04-04  2:46   ` Jason Xing
  1 sibling, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2023-04-04  1:46 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Kuniyuki Iwashima,
	netdev, syzbot, Dae R . Jeong

On Mon, Apr 3, 2023 at 9:51 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> Dae R. Jeong reported a NULL deref in raw_get_next() [0].
>
> It seems that the repro was running these sequences in parallel so
> that one thread was iterating on a socket that was being freed in
> another netns.
>
>   unshare(0x40060200)
>   r0 = syz_open_procfs(0x0, &(0x7f0000002080)='net/raw\x00')
>   socket$inet_icmp_raw(0x2, 0x3, 0x1)
>   pread64(r0, &(0x7f0000000000)=""/10, 0xa, 0x10000000007f)
>
> After commit 0daf07e52709 ("raw: convert raw sockets to RCU"), we
> use RCU and hlist_nulls_for_each_entry() to iterate over SOCK_RAW
> sockets.  However, we should use spinlock for slow paths to avoid
> the NULL deref.
>
> Also, SOCK_RAW does not use SLAB_TYPESAFE_BY_RCU, and the slab object
> is not reused during iteration in the grace period.  In fact, the
> lockless readers do not check the nulls marker with get_nulls_value().
> So, SOCK_RAW should use hlist instead of hlist_nulls.
>
> Instead of adding an unnecessary barrier by sk_nulls_for_each_rcu(),
> let's convert hlist_nulls to hlist and use sk_for_each_rcu() for
> fast paths and sk_for_each() and spinlock for /proc/net/raw.
>
> Fixes: 0daf07e52709 ("raw: convert raw sockets to RCU")
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Reported-by: Dae R. Jeong <threeearcat@gmail.com>
> Link: https://lore.kernel.org/netdev/ZCA2mGV_cmq7lIfV@dragonet/
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  include/net/raw.h   |  4 ++--
>  net/ipv4/raw.c      | 36 +++++++++++++++++++-----------------
>  net/ipv4/raw_diag.c | 10 ++++------
>  net/ipv6/raw.c      | 10 ++++------
>  4 files changed, 29 insertions(+), 31 deletions(-)
>

SGTM, thanks a lot.
Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 net 2/2] ping: Fix potentail NULL deref for /proc/net/icmp.
  2023-04-03 19:49 ` [PATCH v1 net 2/2] ping: Fix potentail NULL deref for /proc/net/icmp Kuniyuki Iwashima
@ 2023-04-04  1:47   ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2023-04-04  1:47 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Kuniyuki Iwashima, netdev

On Mon, Apr 3, 2023 at 9:51 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> After commit dbca1596bbb0 ("ping: convert to RCU lookups, get rid
> of rwlock"), we use RCU for ping sockets, but we should use spinlock
> for /proc/net/icmp to avoid a potential NULL deref mentioned in
> the previous patch.
>
> Let's go back to using spinlock there.
>
> Note we can convert ping sockets to use hlist instead of hlist_nulls
> because we do not use SLAB_TYPESAFE_BY_RCU for ping sockets.

Yes, this could be done later if we care enough.

>
> Fixes: dbca1596bbb0 ("ping: convert to RCU lookups, get rid of rwlock")
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 net 1/2] raw: Fix NULL deref in raw_get_next().
  2023-04-03 19:49 ` [PATCH v1 net 1/2] raw: Fix NULL deref in raw_get_next() Kuniyuki Iwashima
  2023-04-04  1:46   ` Eric Dumazet
@ 2023-04-04  2:46   ` Jason Xing
  2023-04-04  4:07     ` Eric Dumazet
  1 sibling, 1 reply; 11+ messages in thread
From: Jason Xing @ 2023-04-04  2:46 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Kuniyuki Iwashima, netdev, syzbot, Dae R . Jeong

On Tue, Apr 4, 2023 at 4:02 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> Dae R. Jeong reported a NULL deref in raw_get_next() [0].
>
> It seems that the repro was running these sequences in parallel so
> that one thread was iterating on a socket that was being freed in
> another netns.
>
>   unshare(0x40060200)
>   r0 = syz_open_procfs(0x0, &(0x7f0000002080)='net/raw\x00')
>   socket$inet_icmp_raw(0x2, 0x3, 0x1)
>   pread64(r0, &(0x7f0000000000)=""/10, 0xa, 0x10000000007f)
>
> After commit 0daf07e52709 ("raw: convert raw sockets to RCU"), we
> use RCU and hlist_nulls_for_each_entry() to iterate over SOCK_RAW
> sockets.  However, we should use spinlock for slow paths to avoid
> the NULL deref.
>
> Also, SOCK_RAW does not use SLAB_TYPESAFE_BY_RCU, and the slab object
> is not reused during iteration in the grace period.  In fact, the
> lockless readers do not check the nulls marker with get_nulls_value().
> So, SOCK_RAW should use hlist instead of hlist_nulls.
>
> Instead of adding an unnecessary barrier by sk_nulls_for_each_rcu(),
> let's convert hlist_nulls to hlist and use sk_for_each_rcu() for
> fast paths and sk_for_each() and spinlock for /proc/net/raw.
>
> [0]:
> general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN
> KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
> CPU: 2 PID: 20952 Comm: syz-executor.0 Not tainted 6.2.0-g048ec869bafd-dirty #7
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> RIP: 0010:read_pnet include/net/net_namespace.h:383 [inline]
> RIP: 0010:sock_net include/net/sock.h:649 [inline]
> RIP: 0010:raw_get_next net/ipv4/raw.c:974 [inline]
> RIP: 0010:raw_get_idx net/ipv4/raw.c:986 [inline]
> RIP: 0010:raw_seq_start+0x431/0x800 net/ipv4/raw.c:995
> Code: ef e8 33 3d 94 f7 49 8b 6d 00 4c 89 ef e8 b7 65 5f f7 49 89 ed 49 83 c5 98 0f 84 9a 00 00 00 48 83 c5 c8 48 89 e8 48 c1 e8 03 <42> 80 3c 30 00 74 08 48 89 ef e8 00 3d 94 f7 4c 8b 7d 00 48 89 ef
> RSP: 0018:ffffc9001154f9b0 EFLAGS: 00010206
> RAX: 0000000000000005 RBX: 1ffff1100302c8fd RCX: 0000000000000000
> RDX: 0000000000000028 RSI: ffffc9001154f988 RDI: ffffc9000f77a338
> RBP: 0000000000000029 R08: ffffffff8a50ffb4 R09: fffffbfff24b6bd9
> R10: fffffbfff24b6bd9 R11: 0000000000000000 R12: ffff88801db73b78
> R13: fffffffffffffff9 R14: dffffc0000000000 R15: 0000000000000030
> FS:  00007f843ae8e700(0000) GS:ffff888063700000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055bb9614b35f CR3: 000000003c672000 CR4: 00000000003506e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  <TASK>
>  seq_read_iter+0x4c6/0x10f0 fs/seq_file.c:225
>  seq_read+0x224/0x320 fs/seq_file.c:162
>  pde_read fs/proc/inode.c:316 [inline]
>  proc_reg_read+0x23f/0x330 fs/proc/inode.c:328
>  vfs_read+0x31e/0xd30 fs/read_write.c:468
>  ksys_pread64 fs/read_write.c:665 [inline]
>  __do_sys_pread64 fs/read_write.c:675 [inline]
>  __se_sys_pread64 fs/read_write.c:672 [inline]
>  __x64_sys_pread64+0x1e9/0x280 fs/read_write.c:672
>  do_syscall_x64 arch/x86/entry/common.c:51 [inline]
>  do_syscall_64+0x4e/0xa0 arch/x86/entry/common.c:82
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x478d29
> Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007f843ae8dbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
> RAX: ffffffffffffffda RBX: 0000000000791408 RCX: 0000000000478d29
> RDX: 000000000000000a RSI: 0000000020000000 RDI: 0000000000000003
> RBP: 00000000f477909a R08: 0000000000000000 R09: 0000000000000000
> R10: 000010000000007f R11: 0000000000000246 R12: 0000000000791740
> R13: 0000000000791414 R14: 0000000000791408 R15: 00007ffc2eb48a50
>  </TASK>
> Modules linked in:
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:read_pnet include/net/net_namespace.h:383 [inline]
> RIP: 0010:sock_net include/net/sock.h:649 [inline]
> RIP: 0010:raw_get_next net/ipv4/raw.c:974 [inline]
> RIP: 0010:raw_get_idx net/ipv4/raw.c:986 [inline]
> RIP: 0010:raw_seq_start+0x431/0x800 net/ipv4/raw.c:995
> Code: ef e8 33 3d 94 f7 49 8b 6d 00 4c 89 ef e8 b7 65 5f f7 49 89 ed 49 83 c5 98 0f 84 9a 00 00 00 48 83 c5 c8 48 89 e8 48 c1 e8 03 <42> 80 3c 30 00 74 08 48 89 ef e8 00 3d 94 f7 4c 8b 7d 00 48 89 ef
> RSP: 0018:ffffc9001154f9b0 EFLAGS: 00010206
> RAX: 0000000000000005 RBX: 1ffff1100302c8fd RCX: 0000000000000000
> RDX: 0000000000000028 RSI: ffffc9001154f988 RDI: ffffc9000f77a338
> RBP: 0000000000000029 R08: ffffffff8a50ffb4 R09: fffffbfff24b6bd9
> R10: fffffbfff24b6bd9 R11: 0000000000000000 R12: ffff88801db73b78
> R13: fffffffffffffff9 R14: dffffc0000000000 R15: 0000000000000030
> FS:  00007f843ae8e700(0000) GS:ffff888063700000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f92ff166000 CR3: 000000003c672000 CR4: 00000000003506e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>
> Fixes: 0daf07e52709 ("raw: convert raw sockets to RCU")
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Reported-by: Dae R. Jeong <threeearcat@gmail.com>
> Link: https://lore.kernel.org/netdev/ZCA2mGV_cmq7lIfV@dragonet/
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  include/net/raw.h   |  4 ++--
>  net/ipv4/raw.c      | 36 +++++++++++++++++++-----------------
>  net/ipv4/raw_diag.c | 10 ++++------
>  net/ipv6/raw.c      | 10 ++++------
>  4 files changed, 29 insertions(+), 31 deletions(-)
>
> diff --git a/include/net/raw.h b/include/net/raw.h
> index 2c004c20ed99..3af5289fdead 100644
> --- a/include/net/raw.h
> +++ b/include/net/raw.h
> @@ -37,7 +37,7 @@ int raw_rcv(struct sock *, struct sk_buff *);
>  struct raw_hashinfo {
>         spinlock_t lock;
>
> -       struct hlist_nulls_head ht[RAW_HTABLE_SIZE] ____cacheline_aligned;
> +       struct hlist_head ht[RAW_HTABLE_SIZE] ____cacheline_aligned;
>  };
>
>  static inline u32 raw_hashfunc(const struct net *net, u32 proto)
> @@ -51,7 +51,7 @@ static inline void raw_hashinfo_init(struct raw_hashinfo *hashinfo)
>
>         spin_lock_init(&hashinfo->lock);
>         for (i = 0; i < RAW_HTABLE_SIZE; i++)
> -               INIT_HLIST_NULLS_HEAD(&hashinfo->ht[i], i);
> +               INIT_HLIST_HEAD(&hashinfo->ht[i]);
>  }
>
>  #ifdef CONFIG_PROC_FS
> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
> index 94df935ee0c5..8088a5011e7d 100644
> --- a/net/ipv4/raw.c
> +++ b/net/ipv4/raw.c
> @@ -91,12 +91,12 @@ EXPORT_SYMBOL_GPL(raw_v4_hashinfo);
>  int raw_hash_sk(struct sock *sk)
>  {
>         struct raw_hashinfo *h = sk->sk_prot->h.raw_hash;
> -       struct hlist_nulls_head *hlist;
> +       struct hlist_head *hlist;
>
>         hlist = &h->ht[raw_hashfunc(sock_net(sk), inet_sk(sk)->inet_num)];
>
>         spin_lock(&h->lock);
> -       __sk_nulls_add_node_rcu(sk, hlist);
> +       sk_add_node_rcu(sk, hlist);
>         sock_set_flag(sk, SOCK_RCU_FREE);
>         spin_unlock(&h->lock);
>         sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
> @@ -110,7 +110,7 @@ void raw_unhash_sk(struct sock *sk)
>         struct raw_hashinfo *h = sk->sk_prot->h.raw_hash;
>
>         spin_lock(&h->lock);
> -       if (__sk_nulls_del_node_init_rcu(sk))
> +       if (sk_del_node_init_rcu(sk))
>                 sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
>         spin_unlock(&h->lock);
>  }
> @@ -163,16 +163,15 @@ static int icmp_filter(const struct sock *sk, const struct sk_buff *skb)
>  static int raw_v4_input(struct net *net, struct sk_buff *skb,
>                         const struct iphdr *iph, int hash)
>  {
> -       struct hlist_nulls_head *hlist;
> -       struct hlist_nulls_node *hnode;
>         int sdif = inet_sdif(skb);
> +       struct hlist_head *hlist;
>         int dif = inet_iif(skb);
>         int delivered = 0;
>         struct sock *sk;
>
>         hlist = &raw_v4_hashinfo.ht[hash];
>         rcu_read_lock();
> -       sk_nulls_for_each(sk, hnode, hlist) {
> +       sk_for_each_rcu(sk, hlist) {
>                 if (!raw_v4_match(net, sk, iph->protocol,
>                                   iph->saddr, iph->daddr, dif, sdif))
>                         continue;
> @@ -264,10 +263,9 @@ static void raw_err(struct sock *sk, struct sk_buff *skb, u32 info)
>  void raw_icmp_error(struct sk_buff *skb, int protocol, u32 info)
>  {
>         struct net *net = dev_net(skb->dev);
> -       struct hlist_nulls_head *hlist;
> -       struct hlist_nulls_node *hnode;
>         int dif = skb->dev->ifindex;
>         int sdif = inet_sdif(skb);
> +       struct hlist_head *hlist;
>         const struct iphdr *iph;
>         struct sock *sk;
>         int hash;
> @@ -276,7 +274,7 @@ void raw_icmp_error(struct sk_buff *skb, int protocol, u32 info)
>         hlist = &raw_v4_hashinfo.ht[hash];
>
>         rcu_read_lock();
> -       sk_nulls_for_each(sk, hnode, hlist) {
> +       sk_for_each_rcu(sk, hlist) {
>                 iph = (const struct iphdr *)skb->data;
>                 if (!raw_v4_match(net, sk, iph->protocol,
>                                   iph->daddr, iph->saddr, dif, sdif))
> @@ -950,14 +948,13 @@ static struct sock *raw_get_first(struct seq_file *seq, int bucket)
>  {
>         struct raw_hashinfo *h = pde_data(file_inode(seq->file));
>         struct raw_iter_state *state = raw_seq_private(seq);
> -       struct hlist_nulls_head *hlist;
> -       struct hlist_nulls_node *hnode;
> +       struct hlist_head *hlist;
>         struct sock *sk;
>
>         for (state->bucket = bucket; state->bucket < RAW_HTABLE_SIZE;
>                         ++state->bucket) {
>                 hlist = &h->ht[state->bucket];
> -               sk_nulls_for_each(sk, hnode, hlist) {
> +               sk_for_each(sk, hlist) {
>                         if (sock_net(sk) == seq_file_net(seq))
>                                 return sk;
>                 }
> @@ -970,7 +967,7 @@ static struct sock *raw_get_next(struct seq_file *seq, struct sock *sk)
>         struct raw_iter_state *state = raw_seq_private(seq);
>
>         do {
> -               sk = sk_nulls_next(sk);
> +               sk = sk_next(sk);
>         } while (sk && sock_net(sk) != seq_file_net(seq));
>
>         if (!sk)
> @@ -989,9 +986,12 @@ static struct sock *raw_get_idx(struct seq_file *seq, loff_t pos)
>  }
>
[...]
>  void *raw_seq_start(struct seq_file *seq, loff_t *pos)
> -       __acquires(RCU)
> +       __acquires(&h->lock)
>  {
> -       rcu_read_lock();
> +       struct raw_hashinfo *h = pde_data(file_inode(seq->file));
> +
> +       spin_lock(&h->lock);

I would like to ask two questions which make me confused:
1) Why would we use spin_lock to protect the socket in a raw hashtable
for reader's safety under the rcu protection? Normally, if we use the
RCU protection, we only make sure that we need to destroy the socket
by calling call_rcu() which would prevent the READER of the socket
from getting a NULL pointer.
2) Using spin lock when we're cating /proc/net/raw file may
block/postpone the action of hashing socket somewhere else?

Thanks,
Jason

> +
>         return *pos ? raw_get_idx(seq, *pos - 1) : SEQ_START_TOKEN;
>  }
>  EXPORT_SYMBOL_GPL(raw_seq_start);
> @@ -1010,9 +1010,11 @@ void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>  EXPORT_SYMBOL_GPL(raw_seq_next);
>
>  void raw_seq_stop(struct seq_file *seq, void *v)
> -       __releases(RCU)
> +       __releases(&h->lock)
>  {
> -       rcu_read_unlock();
> +       struct raw_hashinfo *h = pde_data(file_inode(seq->file));
> +
> +       spin_unlock(&h->lock);
>  }
>  EXPORT_SYMBOL_GPL(raw_seq_stop);
>
> diff --git a/net/ipv4/raw_diag.c b/net/ipv4/raw_diag.c
> index 999321834b94..da3591a66a16 100644
> --- a/net/ipv4/raw_diag.c
> +++ b/net/ipv4/raw_diag.c
> @@ -57,8 +57,7 @@ static bool raw_lookup(struct net *net, struct sock *sk,
>  static struct sock *raw_sock_get(struct net *net, const struct inet_diag_req_v2 *r)
>  {
>         struct raw_hashinfo *hashinfo = raw_get_hashinfo(r);
> -       struct hlist_nulls_head *hlist;
> -       struct hlist_nulls_node *hnode;
> +       struct hlist_head *hlist;
>         struct sock *sk;
>         int slot;
>
> @@ -68,7 +67,7 @@ static struct sock *raw_sock_get(struct net *net, const struct inet_diag_req_v2
>         rcu_read_lock();
>         for (slot = 0; slot < RAW_HTABLE_SIZE; slot++) {
>                 hlist = &hashinfo->ht[slot];
> -               sk_nulls_for_each(sk, hnode, hlist) {
> +               sk_for_each_rcu(sk, hlist) {
>                         if (raw_lookup(net, sk, r)) {
>                                 /*
>                                  * Grab it and keep until we fill
> @@ -142,9 +141,8 @@ static void raw_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
>         struct raw_hashinfo *hashinfo = raw_get_hashinfo(r);
>         struct net *net = sock_net(skb->sk);
>         struct inet_diag_dump_data *cb_data;
> -       struct hlist_nulls_head *hlist;
> -       struct hlist_nulls_node *hnode;
>         int num, s_num, slot, s_slot;
> +       struct hlist_head *hlist;
>         struct sock *sk = NULL;
>         struct nlattr *bc;
>
> @@ -161,7 +159,7 @@ static void raw_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
>                 num = 0;
>
>                 hlist = &hashinfo->ht[slot];
> -               sk_nulls_for_each(sk, hnode, hlist) {
> +               sk_for_each_rcu(sk, hlist) {
>                         struct inet_sock *inet = inet_sk(sk);
>
>                         if (!net_eq(sock_net(sk), net))
> diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
> index bac9ba747bde..a327aa481df4 100644
> --- a/net/ipv6/raw.c
> +++ b/net/ipv6/raw.c
> @@ -141,10 +141,9 @@ EXPORT_SYMBOL(rawv6_mh_filter_unregister);
>  static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr)
>  {
>         struct net *net = dev_net(skb->dev);
> -       struct hlist_nulls_head *hlist;
> -       struct hlist_nulls_node *hnode;
>         const struct in6_addr *saddr;
>         const struct in6_addr *daddr;
> +       struct hlist_head *hlist;
>         struct sock *sk;
>         bool delivered = false;
>         __u8 hash;
> @@ -155,7 +154,7 @@ static bool ipv6_raw_deliver(struct sk_buff *skb, int nexthdr)
>         hash = raw_hashfunc(net, nexthdr);
>         hlist = &raw_v6_hashinfo.ht[hash];
>         rcu_read_lock();
> -       sk_nulls_for_each(sk, hnode, hlist) {
> +       sk_for_each_rcu(sk, hlist) {
>                 int filtered;
>
>                 if (!raw_v6_match(net, sk, nexthdr, daddr, saddr,
> @@ -333,15 +332,14 @@ void raw6_icmp_error(struct sk_buff *skb, int nexthdr,
>                 u8 type, u8 code, int inner_offset, __be32 info)
>  {
>         struct net *net = dev_net(skb->dev);
> -       struct hlist_nulls_head *hlist;
> -       struct hlist_nulls_node *hnode;
> +       struct hlist_head *hlist;
>         struct sock *sk;
>         int hash;
>
>         hash = raw_hashfunc(net, nexthdr);
>         hlist = &raw_v6_hashinfo.ht[hash];
>         rcu_read_lock();
> -       sk_nulls_for_each(sk, hnode, hlist) {
> +       sk_for_each_rcu(sk, hlist) {
>                 /* Note: ipv6_hdr(skb) != skb->data */
>                 const struct ipv6hdr *ip6h = (const struct ipv6hdr *)skb->data;
>
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 net 1/2] raw: Fix NULL deref in raw_get_next().
  2023-04-04  2:46   ` Jason Xing
@ 2023-04-04  4:07     ` Eric Dumazet
  2023-04-04  6:56       ` Jason Xing
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2023-04-04  4:07 UTC (permalink / raw)
  To: Jason Xing
  Cc: Kuniyuki Iwashima, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Kuniyuki Iwashima, netdev, syzbot, Dae R . Jeong

On Tue, Apr 4, 2023 at 4:46 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> I would like to ask two questions which make me confused:
> 1) Why would we use spin_lock to protect the socket in a raw hashtable
> for reader's safety under the rcu protection? Normally, if we use the
> RCU protection, we only make sure that we need to destroy the socket
> by calling call_rcu() which would prevent the READER of the socket
> from getting a NULL pointer.

Yes, but then we can not sleep or yield the cpu.

> 2) Using spin lock when we're cating /proc/net/raw file may
> block/postpone the action of hashing socket somewhere else?

/proc/net/raw file access is rare, and limited in duration (at most
one page filled by system call)

Use of RCU was only intended in my original patch to solve deadlock issues
under packet floods, like DOS attacks.

Really using RCU in the data/fast path makes sense (and we did that already).
For control paths (or /proc/.... files), not so much.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 net 1/2] raw: Fix NULL deref in raw_get_next().
  2023-04-04  4:07     ` Eric Dumazet
@ 2023-04-04  6:56       ` Jason Xing
  2023-04-04  7:23         ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Xing @ 2023-04-04  6:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Kuniyuki Iwashima, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Kuniyuki Iwashima, netdev, syzbot, Dae R . Jeong

On Tue, Apr 4, 2023 at 12:07 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Apr 4, 2023 at 4:46 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > I would like to ask two questions which make me confused:
> > 1) Why would we use spin_lock to protect the socket in a raw hashtable
> > for reader's safety under the rcu protection? Normally, if we use the
> > RCU protection, we only make sure that we need to destroy the socket
> > by calling call_rcu() which would prevent the READER of the socket
> > from getting a NULL pointer.
>
> Yes, but then we can not sleep or yield the cpu.

Indeed. We also cannot sleep/yield under the protection of the spin
lock. And I checked the caller in fs/seq_file.c and noticed that we
have no chance to sleep/yield between ->start and ->stop.

So I wonder why we couldn't use RCU directly like the patch[1] you
proposed before and choose deliberately to switch to spin lock? Spin
lock for the whole hashinfo to protect the reader side is heavy, and
RCU outperforms spin lock in this case, I think.

[1]: commit 0daf07e52709 ("raw: convert raw sockets to RCU")

Thanks,
Jason
>
> > 2) Using spin lock when we're cating /proc/net/raw file may
> > block/postpone the action of hashing socket somewhere else?
>
> /proc/net/raw file access is rare, and limited in duration (at most
> one page filled by system call)
>
> Use of RCU was only intended in my original patch to solve deadlock issues
> under packet floods, like DOS attacks.
>
> Really using RCU in the data/fast path makes sense (and we did that already).
> For control paths (or /proc/.... files), not so much.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 net 1/2] raw: Fix NULL deref in raw_get_next().
  2023-04-04  6:56       ` Jason Xing
@ 2023-04-04  7:23         ` Eric Dumazet
  2023-04-04  7:34           ` Jason Xing
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2023-04-04  7:23 UTC (permalink / raw)
  To: Jason Xing
  Cc: Kuniyuki Iwashima, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Kuniyuki Iwashima, netdev, syzbot, Dae R . Jeong

On Tue, Apr 4, 2023 at 8:56 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Tue, Apr 4, 2023 at 12:07 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Apr 4, 2023 at 4:46 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > I would like to ask two questions which make me confused:
> > > 1) Why would we use spin_lock to protect the socket in a raw hashtable
> > > for reader's safety under the rcu protection? Normally, if we use the
> > > RCU protection, we only make sure that we need to destroy the socket
> > > by calling call_rcu() which would prevent the READER of the socket
> > > from getting a NULL pointer.
> >
> > Yes, but then we can not sleep or yield the cpu.
>
> Indeed. We also cannot sleep/yield under the protection of the spin
> lock. And I checked the caller in fs/seq_file.c and noticed that we
> have no chance to sleep/yield between ->start and ->stop.
>

You missed my point.
The spinlock can trivially be replaced by a mutex, now the fast path
has been RCU converted.
This would allow raw_get_idx()/raw_get_first() to use cond_resched(),
if some hosts try to use 10,000 raw sockets :/
Is it a real problem to solve right now ?  I do not think so.

> So I wonder why we couldn't use RCU directly like the patch[1] you
> proposed before and choose deliberately to switch to spin lock? Spin
> lock for the whole hashinfo to protect the reader side is heavy, and
> RCU outperforms spin lock in this case, I think.

spinlock is just fine enough, most hosts have less than 10 raw sockets,
because raw sockets make things _much_ slower.

RCU 'just because' does not make sense, it would suggest that RAW sockets
scale, while they do not.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 net 1/2] raw: Fix NULL deref in raw_get_next().
  2023-04-04  7:23         ` Eric Dumazet
@ 2023-04-04  7:34           ` Jason Xing
  0 siblings, 0 replies; 11+ messages in thread
From: Jason Xing @ 2023-04-04  7:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Kuniyuki Iwashima, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Kuniyuki Iwashima, netdev, syzbot, Dae R . Jeong

On Tue, Apr 4, 2023 at 3:23 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Apr 4, 2023 at 8:56 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Tue, Apr 4, 2023 at 12:07 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Tue, Apr 4, 2023 at 4:46 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > I would like to ask two questions which make me confused:
> > > > 1) Why would we use spin_lock to protect the socket in a raw hashtable
> > > > for reader's safety under the rcu protection? Normally, if we use the
> > > > RCU protection, we only make sure that we need to destroy the socket
> > > > by calling call_rcu() which would prevent the READER of the socket
> > > > from getting a NULL pointer.
> > >
> > > Yes, but then we can not sleep or yield the cpu.
> >
> > Indeed. We also cannot sleep/yield under the protection of the spin
> > lock. And I checked the caller in fs/seq_file.c and noticed that we
> > have no chance to sleep/yield between ->start and ->stop.
> >
>
> You missed my point.
> The spinlock can trivially be replaced by a mutex, now the fast path
> has been RCU converted.
> This would allow raw_get_idx()/raw_get_first() to use cond_resched(),
> if some hosts try to use 10,000 raw sockets :/

Thanks for the clarification. I agreed. The patch for now itself is good :)

> Is it a real problem to solve right now ?  I do not think so.
>
> > So I wonder why we couldn't use RCU directly like the patch[1] you
> > proposed before and choose deliberately to switch to spin lock? Spin
> > lock for the whole hashinfo to protect the reader side is heavy, and
> > RCU outperforms spin lock in this case, I think.
>
> spinlock is just fine enough, most hosts have less than 10 raw sockets,
> because raw sockets make things _much_ slower.

Sure.

Thanks,
Jason

>
> RCU 'just because' does not make sense, it would suggest that RAW sockets
> scale, while they do not.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 net 0/2] raw/ping: Fix locking in /proc/net/{raw,icmp}.
  2023-04-03 19:49 [PATCH v1 net 0/2] raw/ping: Fix locking in /proc/net/{raw,icmp} Kuniyuki Iwashima
  2023-04-03 19:49 ` [PATCH v1 net 1/2] raw: Fix NULL deref in raw_get_next() Kuniyuki Iwashima
  2023-04-03 19:49 ` [PATCH v1 net 2/2] ping: Fix potentail NULL deref for /proc/net/icmp Kuniyuki Iwashima
@ 2023-04-05  2:00 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 11+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-04-05  2:00 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: davem, edumazet, kuba, pabeni, kuni1840, netdev

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 3 Apr 2023 12:49:57 -0700 you wrote:
> The first patch fixes a NULL deref for /proc/net/raw and second one fixes
> the same issue for ping sockets.
> 
> The first patch also converts hlist_nulls to hlist, but this is because
> the current code uses sk_nulls_for_each() for lockless readers, instead
> of sk_nulls_for_each_rcu() which adds memory barrier, but raw sockets
> does not use the nulls marker nor SLAB_TYPESAFE_BY_RCU in the first place.
> 
> [...]

Here is the summary with links:
  - [v1,net,1/2] raw: Fix NULL deref in raw_get_next().
    https://git.kernel.org/netdev/net/c/0a78cf7264d2
  - [v1,net,2/2] ping: Fix potentail NULL deref for /proc/net/icmp.
    https://git.kernel.org/netdev/net/c/ab5fb73ffa01

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-04-05  2:00 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-03 19:49 [PATCH v1 net 0/2] raw/ping: Fix locking in /proc/net/{raw,icmp} Kuniyuki Iwashima
2023-04-03 19:49 ` [PATCH v1 net 1/2] raw: Fix NULL deref in raw_get_next() Kuniyuki Iwashima
2023-04-04  1:46   ` Eric Dumazet
2023-04-04  2:46   ` Jason Xing
2023-04-04  4:07     ` Eric Dumazet
2023-04-04  6:56       ` Jason Xing
2023-04-04  7:23         ` Eric Dumazet
2023-04-04  7:34           ` Jason Xing
2023-04-03 19:49 ` [PATCH v1 net 2/2] ping: Fix potentail NULL deref for /proc/net/icmp Kuniyuki Iwashima
2023-04-04  1:47   ` Eric Dumazet
2023-04-05  2:00 ` [PATCH v1 net 0/2] raw/ping: Fix locking in /proc/net/{raw,icmp} patchwork-bot+netdevbpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.