linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Eric Dumazet <edumazet@google.com>,
	syzbot <syzkaller@googlegroups.com>,
	Pablo Neira Ayuso <pablo@netfilter.org>,
	Sasha Levin <sashal@kernel.org>,
	kadlec@netfilter.org, fw@strlen.de, davem@davemloft.net,
	kuba@kernel.org, netfilter-devel@vger.kernel.org,
	coreteam@netfilter.org, netdev@vger.kernel.org
Subject: [PATCH AUTOSEL 5.14 20/40] netfilter: conntrack: serialize hash resizes and cleanups
Date: Tue,  5 Oct 2021 09:49:59 -0400	[thread overview]
Message-ID: <20211005135020.214291-20-sashal@kernel.org> (raw)
In-Reply-To: <20211005135020.214291-1-sashal@kernel.org>

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit e9edc188fc76499b0b9bd60364084037f6d03773 ]

Syzbot was able to trigger the following warning [1]

No repro found by syzbot yet but I was able to trigger similar issue
by having 2 scripts running in parallel, changing conntrack hash sizes,
and:

for j in `seq 1 1000` ; do unshare -n /bin/true >/dev/null ; done

It would take more than 5 minutes for net_namespace structures
to be cleaned up.

This is because nf_ct_iterate_cleanup() has to restart everytime
a resize happened.

By adding a mutex, we can serialize hash resizes and cleanups
and also make get_next_corpse() faster by skipping over empty
buckets.

Even without resizes in the picture, this patch considerably
speeds up network namespace dismantles.

[1]
INFO: task syz-executor.0:8312 can't die for more than 144 seconds.
task:syz-executor.0  state:R  running task     stack:25672 pid: 8312 ppid:  6573 flags:0x00004006
Call Trace:
 context_switch kernel/sched/core.c:4955 [inline]
 __schedule+0x940/0x26f0 kernel/sched/core.c:6236
 preempt_schedule_common+0x45/0xc0 kernel/sched/core.c:6408
 preempt_schedule_thunk+0x16/0x18 arch/x86/entry/thunk_64.S:35
 __local_bh_enable_ip+0x109/0x120 kernel/softirq.c:390
 local_bh_enable include/linux/bottom_half.h:32 [inline]
 get_next_corpse net/netfilter/nf_conntrack_core.c:2252 [inline]
 nf_ct_iterate_cleanup+0x15a/0x450 net/netfilter/nf_conntrack_core.c:2275
 nf_conntrack_cleanup_net_list+0x14c/0x4f0 net/netfilter/nf_conntrack_core.c:2469
 ops_exit_list+0x10d/0x160 net/core/net_namespace.c:171
 setup_net+0x639/0xa30 net/core/net_namespace.c:349
 copy_net_ns+0x319/0x760 net/core/net_namespace.c:470
 create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
 unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
 ksys_unshare+0x445/0x920 kernel/fork.c:3128
 __do_sys_unshare kernel/fork.c:3202 [inline]
 __se_sys_unshare kernel/fork.c:3200 [inline]
 __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3200
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f63da68e739
RSP: 002b:00007f63d7c05188 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 00007f63da792f80 RCX: 00007f63da68e739
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000000
RBP: 00007f63da6e8cc4 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f63da792f80
R13: 00007fff50b75d3f R14: 00007f63d7c05300 R15: 0000000000022000

Showing all locks held in the system:
1 lock held by khungtaskd/27:
 #0: ffffffff8b980020 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 kernel/locking/lockdep.c:6446
2 locks held by kworker/u4:2/153:
 #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
 #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: arch_atomic_long_set include/linux/atomic/atomic-long.h:41 [inline]
 #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: atomic_long_set include/linux/atomic/atomic-instrumented.h:1198 [inline]
 #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: set_work_data kernel/workqueue.c:634 [inline]
 #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: set_work_pool_and_clear_pending kernel/workqueue.c:661 [inline]
 #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x896/0x1690 kernel/workqueue.c:2268
 #1: ffffc9000140fdb0 ((kfence_timer).work){+.+.}-{0:0}, at: process_one_work+0x8ca/0x1690 kernel/workqueue.c:2272
1 lock held by systemd-udevd/2970:
1 lock held by in:imklog/6258:
 #0: ffff88807f970ff0 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0xe9/0x100 fs/file.c:990
3 locks held by kworker/1:6/8158:
1 lock held by syz-executor.0/8312:
2 locks held by kworker/u4:13/9320:
1 lock held by syz-executor.5/10178:
1 lock held by syz-executor.4/10217:

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 net/netfilter/nf_conntrack_core.c | 70 ++++++++++++++++---------------
 1 file changed, 37 insertions(+), 33 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index d31dbccbe7bd..4f074d7653b8 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -75,6 +75,9 @@ static __read_mostly struct kmem_cache *nf_conntrack_cachep;
 static DEFINE_SPINLOCK(nf_conntrack_locks_all_lock);
 static __read_mostly bool nf_conntrack_locks_all;
 
+/* serialize hash resizes and nf_ct_iterate_cleanup */
+static DEFINE_MUTEX(nf_conntrack_mutex);
+
 #define GC_SCAN_INTERVAL	(120u * HZ)
 #define GC_SCAN_MAX_DURATION	msecs_to_jiffies(10)
 
@@ -2192,28 +2195,31 @@ get_next_corpse(int (*iter)(struct nf_conn *i, void *data),
 	spinlock_t *lockp;
 
 	for (; *bucket < nf_conntrack_htable_size; (*bucket)++) {
+		struct hlist_nulls_head *hslot = &nf_conntrack_hash[*bucket];
+
+		if (hlist_nulls_empty(hslot))
+			continue;
+
 		lockp = &nf_conntrack_locks[*bucket % CONNTRACK_LOCKS];
 		local_bh_disable();
 		nf_conntrack_lock(lockp);
-		if (*bucket < nf_conntrack_htable_size) {
-			hlist_nulls_for_each_entry(h, n, &nf_conntrack_hash[*bucket], hnnode) {
-				if (NF_CT_DIRECTION(h) != IP_CT_DIR_REPLY)
-					continue;
-				/* All nf_conn objects are added to hash table twice, one
-				 * for original direction tuple, once for the reply tuple.
-				 *
-				 * Exception: In the IPS_NAT_CLASH case, only the reply
-				 * tuple is added (the original tuple already existed for
-				 * a different object).
-				 *
-				 * We only need to call the iterator once for each
-				 * conntrack, so we just use the 'reply' direction
-				 * tuple while iterating.
-				 */
-				ct = nf_ct_tuplehash_to_ctrack(h);
-				if (iter(ct, data))
-					goto found;
-			}
+		hlist_nulls_for_each_entry(h, n, hslot, hnnode) {
+			if (NF_CT_DIRECTION(h) != IP_CT_DIR_REPLY)
+				continue;
+			/* All nf_conn objects are added to hash table twice, one
+			 * for original direction tuple, once for the reply tuple.
+			 *
+			 * Exception: In the IPS_NAT_CLASH case, only the reply
+			 * tuple is added (the original tuple already existed for
+			 * a different object).
+			 *
+			 * We only need to call the iterator once for each
+			 * conntrack, so we just use the 'reply' direction
+			 * tuple while iterating.
+			 */
+			ct = nf_ct_tuplehash_to_ctrack(h);
+			if (iter(ct, data))
+				goto found;
 		}
 		spin_unlock(lockp);
 		local_bh_enable();
@@ -2231,26 +2237,20 @@ get_next_corpse(int (*iter)(struct nf_conn *i, void *data),
 static void nf_ct_iterate_cleanup(int (*iter)(struct nf_conn *i, void *data),
 				  void *data, u32 portid, int report)
 {
-	unsigned int bucket = 0, sequence;
+	unsigned int bucket = 0;
 	struct nf_conn *ct;
 
 	might_sleep();
 
-	for (;;) {
-		sequence = read_seqcount_begin(&nf_conntrack_generation);
-
-		while ((ct = get_next_corpse(iter, data, &bucket)) != NULL) {
-			/* Time to push up daises... */
+	mutex_lock(&nf_conntrack_mutex);
+	while ((ct = get_next_corpse(iter, data, &bucket)) != NULL) {
+		/* Time to push up daises... */
 
-			nf_ct_delete(ct, portid, report);
-			nf_ct_put(ct);
-			cond_resched();
-		}
-
-		if (!read_seqcount_retry(&nf_conntrack_generation, sequence))
-			break;
-		bucket = 0;
+		nf_ct_delete(ct, portid, report);
+		nf_ct_put(ct);
+		cond_resched();
 	}
+	mutex_unlock(&nf_conntrack_mutex);
 }
 
 struct iter_data {
@@ -2486,8 +2486,10 @@ int nf_conntrack_hash_resize(unsigned int hashsize)
 	if (!hash)
 		return -ENOMEM;
 
+	mutex_lock(&nf_conntrack_mutex);
 	old_size = nf_conntrack_htable_size;
 	if (old_size == hashsize) {
+		mutex_unlock(&nf_conntrack_mutex);
 		kvfree(hash);
 		return 0;
 	}
@@ -2523,6 +2525,8 @@ int nf_conntrack_hash_resize(unsigned int hashsize)
 	nf_conntrack_all_unlock();
 	local_bh_enable();
 
+	mutex_unlock(&nf_conntrack_mutex);
+
 	synchronize_net();
 	kvfree(old_hash);
 	return 0;
-- 
2.33.0


  parent reply	other threads:[~2021-10-05 13:53 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-05 13:49 [PATCH AUTOSEL 5.14 01/40] ext4: check and update i_disksize properly Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 02/40] ext4: correct the error path of ext4_write_inline_data_end() Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 03/40] ASoC: Intel: sof_sdw: tag SoundWire BEs as non-atomic Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 04/40] ext4: enforce buffer head state assertion in ext4_da_map_blocks Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 05/40] ALSA: oxfw: fix transmission method for Loud models based on OXFW971 Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 06/40] dt-bindings: interconnect: sdm660: Add missing a2noc qos clocks Sasha Levin
2021-10-05 19:11   ` Georgi Djakov
2021-10-06 15:12     ` Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 07/40] interconnect: qcom: " Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 08/40] ALSA: usb-audio: Unify mixer resume and reset_resume procedure Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 09/40] netfilter: ipset: Fix oversized kvmalloc() calls Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 10/40] HID: betop: fix slab-out-of-bounds Write in betop_probe Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 11/40] HID: apple: Fix logical maximum and usage maximum of Magic Keyboard JIS Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 12/40] netfilter: ip6_tables: zero-initialize fragment offset Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 13/40] HID: wacom: Add new Intuos BT (CTL-4100WL/CTL-6100WL) device IDs Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 14/40] HID: amd_sfh: Fix potential NULL pointer dereference Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 15/40] ASoC: SOF: loader: release_firmware() on load failure to avoid batching Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 16/40] KVM: arm64: nvhe: Fix missing FORCE for hyp-reloc.S build rule Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 17/40] netfilter: nf_tables: Fix oversized kvmalloc() calls Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 18/40] netfilter: nf_nat_masquerade: make async masq_inet6_event handling generic Sasha Levin
2021-10-05 13:49 ` [PATCH AUTOSEL 5.14 19/40] netfilter: nf_nat_masquerade: defer conntrack walk to work queue Sasha Levin
2021-10-05 13:49 ` Sasha Levin [this message]
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 21/40] mac80211: Drop frames from invalid MAC address in ad-hoc mode Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 22/40] pinctrl: qcom: sc7280: Add PM suspend callbacks Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 23/40] m68k: Handle arrivals of multiple signals correctly Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 24/40] hwmon: (ltc2947) Properly handle errors when looking for the external clock Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 25/40] net: prevent user from passing illegal stab size Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 26/40] mac80211: check return value of rhashtable_init Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 27/40] net: bgmac-platform: handle mac-address deferral Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 28/40] net: mdiobus: Fix memory leak in __mdiobus_register Sasha Levin
2021-10-05 14:02   ` Andrew Lunn
2021-10-06 15:12     ` Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 29/40] nvme: add command id quirk for apple controllers Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 30/40] vboxfs: fix broken legacy mount signature checking Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 31/40] net: sun: SUNVNET_COMMON should depend on INET Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 32/40] drm/amdgpu: fix gart.bo pin_count leak Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 33/40] scsi: ses: Fix unsigned comparison with less than zero Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 34/40] scsi: virtio_scsi: Fix spelling mistake "Unsupport" -> "Unsupported" Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 35/40] scsi: qla2xxx: Fix excessive messages during device logout Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 36/40] perf/x86: Reset destroy callback on event init failure Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 37/40] perf/core: fix userpage->time_enabled of inactive events Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 38/40] sched: Always inline is_percpu_thread() Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 39/40] io_uring: kill fasync Sasha Levin
2021-10-05 13:50 ` [PATCH AUTOSEL 5.14 40/40] hwmon: (pmbus/ibm-cffps) max_power_out swap changes Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211005135020.214291-20-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=coreteam@netfilter.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=fw@strlen.de \
    --cc=kadlec@netfilter.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=netfilter-devel@vger.kernel.org \
    --cc=pablo@netfilter.org \
    --cc=stable@vger.kernel.org \
    --cc=syzkaller@googlegroups.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).