linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Bug in vhost-net with RT kernels with SMP
@ 2021-12-15 15:44 Florent Carli
  2021-12-17 10:32 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 4+ messages in thread
From: Florent Carli @ 2021-12-15 15:44 UTC (permalink / raw)
  To: linux-kernel

Hello all,

We encounter a bug in host-net with RT kernels running on SMP.

Our setup is:
- a multi cpu virtualized hypervisor that runs an RT kernel with SMP
(tested with 4.19 and 5.10) (originally I encountered the bug with a
physical host but we virtualized it to facilitate debugging)
- a guest running a standard kernel with a tap interface on
virtio/vhost-net (one vcpu)
- this tap is bridged with the host so that they can ping each other
- the guest is pinging the host as quickly as possible ("i=0; while [
$i -lt 100000 ]; do ping -c 1 10.10.10.2; i=$((i+1)); done")
- it may crash after some time, but to crash "quicker", we have found
that other network trafics at the same time do help, so we have the
host ping the guest the same way. Having to manage several network
flows at the same time seems to be key to triggering this bug.

After a few seconds, we get the following back trace:

[  116.435285] WARNING: CPU: 2 PID: 365 at fs/eventfd.c:74
eventfd_signal+0x79/0x90
[  116.435303] Modules linked in:
[  116.435305] CPU: 2 PID: 365 Comm: vhost-361 Not tainted
5.10.83-rt58-mainline-rt #1
[  116.435307] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.13.0-1ubuntu1.1 04/01/2014
[  116.435308] BUG: using smp_processor_id() in preemptible [00000000]
code: vhost-361/365
[  116.435310] caller is print_stop_info+0x16/0x30
[  116.435317] CPU: 2 PID: 365 Comm: vhost-361 Not tainted
5.10.83-rt58-mainline-rt #1
[  116.435319] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.13.0-1ubuntu1.1 04/01/2014
[  116.435320] Call Trace:
[  116.435330]  dump_stack+0x57/0x6a
[  116.435362]  check_preemption_disabled+0xc8/0xd0
[  116.435377]  print_stop_info+0x16/0x30
[  116.435382]  show_regs+0x10/0x40
[  116.435399]  __warn+0x6d/0xa0
[  116.435410]  ? eventfd_signal+0x79/0x90
[  116.435413]  report_bug+0x95/0xb0
[  116.435434]  handle_bug+0x41/0x80
[  116.435437]  exc_invalid_op+0x14/0x60
[  116.435440]  asm_exc_invalid_op+0x12/0x20
[  116.435446] RIP: 0010:eventfd_signal+0x79/0x90
[  116.435449] Code: 01 00 00 00 be 03 00 00 00 4c 89 ef e8 30 ee e6
ff 65 ff 0d 69 bf 39 43 4c 89 ef e8 51 a7 77 00 4c 89 e0 5b 5d 41 5c
41 5d c3 <0f> 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d c3 0f 1f 84 00 00
00 00
[  116.435451] RSP: 0018:ffffb0e480793d50 EFLAGS: 00010202
[  116.435452] RAX: 0000000000000001 RBX: ffff9ff7c5e94b10 RCX: 0000000000000000
[  116.435454] RDX: 0000000000008ac4 RSI: 0000000000000001 RDI: ffff9ff7c5e144b8
[  116.435455] RBP: 0000000000000100 R08: 0000000000000000 R09: 0000000000000000
[  116.435456] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9ff7c53a3c40
[  116.435457] R13: ffffb0e480793e18 R14: ffff9ff7c5e90000 R15: ffff9ff7c5e94b10
[  116.435462]  vhost_tx_batch.constprop.0+0xa4/0x170
[  116.435474]  handle_tx_copy+0x156/0x570
[  116.435477]  ? __vhost_add_used_n+0x210/0x210
[  116.435479]  handle_tx+0xa0/0xe0
[  116.435481]  vhost_worker+0x8e/0xd0
[  116.435483]  kthread+0x17c/0x1a0
[  116.435491]  ? __kthread_parkme+0xa0/0xa0
[  116.435495]  ret_from_fork+0x22/0x30
[  116.435500] RIP: 0010:eventfd_signal+0x79/0x90
[  116.435503] Code: 01 00 00 00 be 03 00 00 00 4c 89 ef e8 30 ee e6
ff 65 ff 0d 69 bf 39 43 4c 89 ef e8 51 a7 77 00 4c 89 e0 5b 5d 41 5c
41 5d c3 <0f> 0b 45 31 e4 5b 5d 4c 89 e0 41 5c 41 5d c3 0f 1f 84 00 00
00 00
[  116.435504] RSP: 0018:ffffb0e480793d50 EFLAGS: 00010202
[  116.435505] RAX: 0000000000000001 RBX: ffff9ff7c5e94b10 RCX: 0000000000000000
[  116.435506] RDX: 0000000000008ac4 RSI: 0000000000000001 RDI: ffff9ff7c5e144b8
[  116.435507] RBP: 0000000000000100 R08: 0000000000000000 R09: 0000000000000000
[  116.435508] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9ff7c53a3c40
[  116.435509] R13: ffffb0e480793e18 R14: ffff9ff7c5e90000 R15: ffff9ff7c5e94b10
[  116.435510] FS:  0000000000000000(0000) GS:ffff9ff7fbd00000(0000)
knlGS:0000000000000000
[  116.435513] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  116.435515] CR2: 00007fff835d2080 CR3: 0000000107968002 CR4: 0000000000372ee0
[  116.435516] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  116.435517] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  116.435518] Call Trace:
[  116.435519]  vhost_tx_batch.constprop.0+0xa4/0x170
[  116.435521]  handle_tx_copy+0x156/0x570
[  116.435523]  ? __vhost_add_used_n+0x210/0x210
[  116.435525]  handle_tx+0xa0/0xe0
[  116.435527]  vhost_worker+0x8e/0xd0
[  116.435529]  kthread+0x17c/0x1a0
[  116.435531]  ? __kthread_parkme+0xa0/0xa0
[  116.435532]  ret_from_fork+0x22/0x30

The tap interface will still work but for about as long as it took to
crash, as if the situation needs to happen a second time for the
interface to stop working.
Anyway, the tap interface will fail shortly after: it will still be
there, but no trafic will flow between the guest and host (with no
kernel log or error message this time).

We can usually get the interface back up, by issuing "ip link set tap0
down", "ip link set tap0 up" from within the guest.
If we try the experiment once again, the tap will fail but we won't
get a new backtrace.
We get the backtrace only once, and we need to restart the host if we
want to see it again.
It's like the crash sets the kernel in an "unstable mode", in which
the vhost-net interfaces will easily fail (probably when the situation
will present itself for the "second" time).

A first analyzis was done on the #linux-rt irc channel:
"huh, looks like the underlying issue is hitting the WARN_ON_ONCE()
condition in eventfd_signal(), but in the bug handler (which
apparently is preemptible), we get into print_stop_info() and try to
use smp_processor_id() in that context, which is wrong because thigns
are preemptible.
I can't see the smp_processor_id() usage in mainlin (I don't have an
RT tree to hand), but someone with more RT knowledge might want to
look at that"

We were recommended to post to LKLM for further analysis.
Thanks.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bug in vhost-net with RT kernels with SMP
  2021-12-15 15:44 Bug in vhost-net with RT kernels with SMP Florent Carli
@ 2021-12-17 10:32 ` Sebastian Andrzej Siewior
  2021-12-17 10:32   ` [PATCH 1/2] stop_machine: Remove this_cpu_ptr() from print_stop_info() Sebastian Andrzej Siewior
  2021-12-17 10:32   ` [PATCH 2/2] eventfd: Make signal recursion protection a task bit Sebastian Andrzej Siewior
  0 siblings, 2 replies; 4+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-17 10:32 UTC (permalink / raw)
  To: fcarli
  Cc: linux-kernel, Luis Claudio R . Goncalves, stable-rt, Thomas Gleixner

On 2021-12-15 16:44:12 [+0100], Florent Carli wrote:
> Hello all,
Hi,
…

I tried to reproduce this but nothing. However…

> If we try the experiment once again, the tap will fail but we won't
> get a new backtrace.

The backtrace is meant to appear only once.

> A first analyzis was done on the #linux-rt irc channel:
> "huh, looks like the underlying issue is hitting the WARN_ON_ONCE()
> condition in eventfd_signal(), but in the bug handler (which
> apparently is preemptible), we get into print_stop_info() and try to
> use smp_processor_id() in that context, which is wrong because thigns
> are preemptible.
> I can't see the smp_processor_id() usage in mainlin (I don't have an
> RT tree to hand), but someone with more RT knowledge might want to
> look at that"

So that smp_processor_id() thingy is sad because the code changed before
it hit upstream and I didn't notice it.
Patch #1 contains the missing bits and avoids the backtrace from
WARN_ON_ONCE().
Patch #2 is a backport from upstream avoiding trigger WARN_ON_ONCE in
the first place.
Could you please try these two if they help?

> Thanks.

Sebastian



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/2] stop_machine: Remove this_cpu_ptr() from print_stop_info().
  2021-12-17 10:32 ` Sebastian Andrzej Siewior
@ 2021-12-17 10:32   ` Sebastian Andrzej Siewior
  2021-12-17 10:32   ` [PATCH 2/2] eventfd: Make signal recursion protection a task bit Sebastian Andrzej Siewior
  1 sibling, 0 replies; 4+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-17 10:32 UTC (permalink / raw)
  To: fcarli
  Cc: linux-kernel, Luis Claudio R . Goncalves, stable-rt,
	Thomas Gleixner, Sebastian Andrzej Siewior

This aligns the patch ("stop_machine: Add function and caller debug
info) with commit
  a8b62fd085050 ("stop_machine: Add function and caller debug info")

that was merged upstream and is slightly different.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/stop_machine.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index dbf585cf4b9f8..971d8acceaecb 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -51,7 +51,11 @@ static bool stop_machine_initialized = false;
 
 void print_stop_info(const char *log_lvl, struct task_struct *task)
 {
-	struct cpu_stopper *stopper = this_cpu_ptr(&cpu_stopper);
+	/*
+	 * If @task is a stopper task, it cannot migrate and task_cpu() is
+	 * stable.
+	 */
+	struct cpu_stopper *stopper = per_cpu_ptr(&cpu_stopper, task_cpu(task));
 
 	if (task != stopper->thread)
 		return;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/2] eventfd: Make signal recursion protection a task bit
  2021-12-17 10:32 ` Sebastian Andrzej Siewior
  2021-12-17 10:32   ` [PATCH 1/2] stop_machine: Remove this_cpu_ptr() from print_stop_info() Sebastian Andrzej Siewior
@ 2021-12-17 10:32   ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 4+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-12-17 10:32 UTC (permalink / raw)
  To: fcarli
  Cc: linux-kernel, Luis Claudio R . Goncalves, stable-rt,
	Thomas Gleixner, Daniel Bristot de Oliveira, Jason Wang, Al Viro,
	Sebastian Andrzej Siewior

From: Thomas Gleixner <tglx@linutronix.de>

Upstream commit b542e383d8c005f06a131e2b40d5889b812f19c6

The recursion protection for eventfd_signal() is based on a per CPU
variable and relies on the !RT semantics of spin_lock_irqsave() for
protecting this per CPU variable. On RT kernels spin_lock_irqsave() neither
disables preemption nor interrupts which allows the spin lock held section
to be preempted. If the preempting task invokes eventfd_signal() as well,
then the recursion warning triggers.

Paolo suggested to protect the per CPU variable with a local lock, but
that's heavyweight and actually not necessary. The goal of this protection
is to prevent the task stack from overflowing, which can be achieved with a
per task recursion protection as well.

Replace the per CPU variable with a per task bit similar to other recursion
protection bits like task_struct::in_page_owner. This works on both !RT and
RT kernels and removes as a side effect the extra per CPU storage.

No functional change for !RT kernels.

Reported-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/87wnp9idso.ffs@tglx
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 fs/aio.c                |  2 +-
 fs/eventfd.c            | 12 +++++-------
 include/linux/eventfd.h | 11 +++++------
 include/linux/sched.h   |  4 ++++
 4 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 76ce0cc3ee4ec..51b08ab01dffc 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1695,7 +1695,7 @@ static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
 		list_del(&iocb->ki_list);
 		iocb->ki_res.res = mangle_poll(mask);
 		req->done = true;
-		if (iocb->ki_eventfd && eventfd_signal_count()) {
+		if (iocb->ki_eventfd && eventfd_signal_allowed()) {
 			iocb = NULL;
 			INIT_WORK(&req->work, aio_poll_put_work);
 			schedule_work(&req->work);
diff --git a/fs/eventfd.c b/fs/eventfd.c
index df466ef81dddf..9035ca60bfcf3 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -25,8 +25,6 @@
 #include <linux/idr.h>
 #include <linux/uio.h>
 
-DEFINE_PER_CPU(int, eventfd_wake_count);
-
 static DEFINE_IDA(eventfd_ida);
 
 struct eventfd_ctx {
@@ -67,21 +65,21 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 	 * Deadlock or stack overflow issues can happen if we recurse here
 	 * through waitqueue wakeup handlers. If the caller users potentially
 	 * nested waitqueues with custom wakeup handlers, then it should
-	 * check eventfd_signal_count() before calling this function. If
-	 * it returns true, the eventfd_signal() call should be deferred to a
+	 * check eventfd_signal_allowed() before calling this function. If
+	 * it returns false, the eventfd_signal() call should be deferred to a
 	 * safe context.
 	 */
-	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
+	if (WARN_ON_ONCE(current->in_eventfd_signal))
 		return 0;
 
 	spin_lock_irqsave(&ctx->wqh.lock, flags);
-	this_cpu_inc(eventfd_wake_count);
+	current->in_eventfd_signal = 1;
 	if (ULLONG_MAX - ctx->count < n)
 		n = ULLONG_MAX - ctx->count;
 	ctx->count += n;
 	if (waitqueue_active(&ctx->wqh))
 		wake_up_locked_poll(&ctx->wqh, EPOLLIN);
-	this_cpu_dec(eventfd_wake_count);
+	current->in_eventfd_signal = 0;
 	spin_unlock_irqrestore(&ctx->wqh.lock, flags);
 
 	return n;
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index dc4fd8a6644dd..836b4c021a0a4 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -14,6 +14,7 @@
 #include <linux/err.h>
 #include <linux/percpu-defs.h>
 #include <linux/percpu.h>
+#include <linux/sched.h>
 
 /*
  * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
@@ -42,11 +43,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n);
 int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_entry_t *wait,
 				  __u64 *cnt);
 
-DECLARE_PER_CPU(int, eventfd_wake_count);
-
-static inline bool eventfd_signal_count(void)
+static inline bool eventfd_signal_allowed(void)
 {
-	return this_cpu_read(eventfd_wake_count);
+	return !current->in_eventfd_signal;
 }
 
 #else /* CONFIG_EVENTFD */
@@ -77,9 +76,9 @@ static inline int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx,
 	return -ENOSYS;
 }
 
-static inline bool eventfd_signal_count(void)
+static inline bool eventfd_signal_allowed(void)
 {
-	return false;
+	return true;
 }
 
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 409a24036952c..29e6ff1af1df9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -852,6 +852,10 @@ struct task_struct {
 	/* Stalled due to lack of memory */
 	unsigned			in_memstall:1;
 #endif
+#ifdef CONFIG_EVENTFD
+	/* Recursion prevention for eventfd_signal() */
+	unsigned			in_eventfd_signal:1;
+#endif
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-12-17 10:32 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-15 15:44 Bug in vhost-net with RT kernels with SMP Florent Carli
2021-12-17 10:32 ` Sebastian Andrzej Siewior
2021-12-17 10:32   ` [PATCH 1/2] stop_machine: Remove this_cpu_ptr() from print_stop_info() Sebastian Andrzej Siewior
2021-12-17 10:32   ` [PATCH 2/2] eventfd: Make signal recursion protection a task bit Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).