linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@poochiereds.net>
To: Chris Worley <chris.worley@primarydata.com>
Cc: Tejun Heo <tj@kernel.org>,
	linux-kernel@vger.kernel.org, bfields@fieldses.org,
	Michael Skralivetsky <michael.skralivetsky@primarydata.com>,
	Trond Myklebust <trond.myklebust@primarydata.com>,
	Shaohua Li <shli@fb.com>
Subject: Re: timer code oops when calling mod_delayed_work
Date: Tue, 3 Nov 2015 12:55:04 -0500	[thread overview]
Message-ID: <20151103125504.6649138f@tlielax.poochiereds.net> (raw)
In-Reply-To: <20151102203339.7ed8f2bb@synchrony.poochiereds.net>

On Mon, 2 Nov 2015 20:33:39 -0500
Jeff Layton <jlayton@poochiereds.net> wrote:

> On Mon, 2 Nov 2015 14:56:33 -0500
> Jeff Layton <jlayton@poochiereds.net> wrote:
> 
> > On Mon, 2 Nov 2015 09:48:41 -1000
> > Chris Worley <chris.worley@primarydata.com> wrote:
> > 
> > > On Sat, Oct 31, 2015 at 11:54 AM, Jeff Layton <jlayton@poochiereds.net> wrote:
> > > > On Sat, 31 Oct 2015 17:31:07 -0400
> > > > Tejun Heo <tj@kernel.org> wrote:
> > > ...
> > > >>
> > > >> > I have asked Chris and Michael to see if they can bisect it down, but
> > > >> > it may be a bit before they can get that done. Any insight you might
> > > >> > have in the meantime would helpful.
> > > >>
> > > >> Yeah, I'd love to find out how reproducible the issue is.  If the
> > > >> problem is rarely reproducible, it might make sense to try
> > > >> instrumentation before trying bisection as it *could* be a latent bug
> > > >> which has been there all along and bisecting to the commit introducing
> > > >> the code wouldn't help us too much.
> > > >>
> > > >
> > > > It seems fairly reproducible, at least on v4.3-rc7 kernels:
> > > >
> > > > This came about when I asked them to perf test some nfsd patches that I
> > > > have queued up. I patched a Fedora 4.3-rc7 kernel and wanted to see
> > > > what the perf delta was (with NFSv3, fwiw):
> > > >
> > > >     Patched kernels here: http://koji.fedoraproject.org/koji/taskinfo?taskID=11598089
> > > >
> > > >     Unpatched kernels here: http://koji.fedoraproject.org/koji/buildinfo?buildID=694377
> > > >
> > > > Michael was using the SPEC SFS VDI workload to test, and was able to
> > > > get the same panic on both kernels. So it does seem to be reproducible.
> > > > It might even be possible to tune the VM to make the shrinker fire more
> > > > often, which may help tickle this more.
> > > >
> > > > In any case, I've asked them to try something v4.2-ish and see if it's
> > > > reproducible there, and then try v4.1 if it is. I figure anything
> > > > earlier is probably not worth testing if it still fails on v4.1. If it
> > > > turns out not to be reproducible on those earlier kernels then we can
> > > > bisect from there to track it down.
> > > 
> > > The trick seems to be the NFS thread count: I initially though this
> > > was SFS/VDI specific, but when I ratcheted up the thread count to what
> > > Michael was using, 256 threads oopses on fio (throughput) benchmarks
> > > too.
> > > 
> > > In bisecting kernels, it appeared between 4.2.3-200 and 4.2.5-200 (all
> > > the 4.2.4 kernels were bad).
> > > 
> > > Jeff has a lead on this...
> > > 
> > > Chris
> > 
> > Thanks Chris. This commit went in between those kernels:
> > 
> > commit a7c571f2e3ff9243ba41c242486f53dbca37d864
> > Author: Shaohua Li <shli@fb.com>
> > Date:   Wed Sep 30 09:05:30 2015 -0700
> > 
> >     workqueue: make sure delayed work run in local cpu
> >     
> >     commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream.
> >     
> > 
> > ...and I suspect that it's the culprit. That commit causes this code to
> > always use add_timer_on, which seems to have different semantics from
> > add_timer.
> > 
> > I'm going to build a v4.2.5 kernel with that patch reverted to confirm
> > it, but it seems likely...
> > 
> 
> (dropping Lai's address, and adding Shaohua Li's)
> 
> Ok, I built a kernel with that patch reverted and that seems to fix the
> problem.
> 
> Looking at the patch, I guess the main difference is that we're no
> longer using add_timer for unbound workqueue tasks. That said, we
> should have possession of the PENDING bit before calling add_timer_on.
> 
> So...I'm a little stumped. Any thoughts as to where the race could be?

Ok, I've been able to reproduce this with a small module that I wrote
as a reproducer:

[  197.502139] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  197.503073] IP: [<ffffffff8112d67c>] detach_if_pending+0x6c/0x190
[  197.503752] PGD 0 
[  197.504061] Oops: 0002 [#1] SMP 
[  197.504449] Modules linked in: wqthrash(OE-) snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core xfs snd_hwdep snd_seq libcrc32c snd_seq_device ppdev snd_pcm snd_timer snd soundcore parport_pc e1000 i2c_piix4 parport joydev virtio_balloon pvpanic acpi_cpufreq nfsd nfs_acl lockd auth_rpcgss grace sunrpc virtio_blk virtio_console qxl drm_kms_helper ttm virtio_pci virtio_ring serio_raw virtio drm ata_generic pata_acpi
[  197.509358] CPU: 1 PID: 794 Comm: kworker/u8:9 Tainted: G           OE   4.3.0-0.rc7.git2.1.fc24.x86_64 #1
[  197.510767] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  197.511850] Workqueue: wqthrash wqthrash_workfunc [wqthrash]
[  197.513141] task: ffff88020f918000 ti: ffff8802143a8000 task.ti: ffff8802143a8000
[  197.513958] RIP: 0010:[<ffffffff8112d67c>]  [<ffffffff8112d67c>] detach_if_pending+0x6c/0x190
[  197.514927] RSP: 0018:ffff8802143abc68  EFLAGS: 00010086
[  197.515539] RAX: dead000000000200 RBX: ffffffffa022e050 RCX: 000000000000000f
[  197.516552] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
[  197.517374] RBP: ffff8802143abc88 R08: 0000000000000001 R09: 0000000000000001
[  197.518194] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[  197.518992] R13: ffff880216fce2c0 R14: 0000000000000001 R15: 00000000000003e8
[  197.519817] FS:  0000000000000000(0000) GS:ffff880217000000(0000) knlGS:0000000000000000
[  197.520738] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  197.521410] CR2: 0000000000000000 CR3: 00000000daa9b000 CR4: 00000000000006e0
[  197.522234] Stack:
[  197.522584]  ffffffffa022e050 0000000000000000 ffff880216fce2c0 ffff88021688f800
[  197.524102]  ffff8802143abcc0 ffffffff8112db72 0000000000000086 00000000ba809f75
[  197.525594]  ffffffffa022e000 ffff8802143abd00 0000000000000001 ffff8802143abcf0
[  197.527082] Call Trace:
[  197.527581]  [<ffffffff8112db72>] del_timer+0x62/0xa0
[  197.532565]  [<ffffffff810c913e>] try_to_grab_pending+0xce/0x190
[  197.533253]  [<ffffffff810c9242>] mod_delayed_work_on+0x42/0xb0
[  197.533965]  [<ffffffffa022c081>] wqthrash_workfunc+0x61/0x90 [wqthrash]
[  197.534766]  [<ffffffff810c8470>] process_one_work+0x230/0x6a0
[  197.535683]  [<ffffffff810c83d9>] ? process_one_work+0x199/0x6a0
[  197.536378]  [<ffffffff810c892e>] worker_thread+0x4e/0x450
[  197.537004]  [<ffffffff810c88e0>] ? process_one_work+0x6a0/0x6a0
[  197.537724]  [<ffffffff810cf981>] kthread+0x101/0x120
[  197.538662]  [<ffffffff81104fa9>] ? trace_hardirqs_on_caller+0x129/0x1b0
[  197.539926]  [<ffffffff810cf880>] ? kthread_create_on_node+0x250/0x250
[  197.541160]  [<ffffffff81864fef>] ret_from_fork+0x3f/0x70
[  197.542187]  [<ffffffff810cf880>] ? kthread_create_on_node+0x250/0x250
[  197.543416] Code: 3b 89 e4 00 e8 06 64 ff ff 85 c0 74 0d 80 3d 45 61 e4 00 00 0f 84 c4 00 00 00 65 ff 0d de e7 ed 7e 48 8b 03 48 8b 53 08 48 85 c0 <48> 89 02 74 04 48 89 50 08 45 84 f6 74 08 48 c7 43 08 00 00 00 
[  197.547527] RIP  [<ffffffff8112d67c>] detach_if_pending+0x6c/0x190
[  197.548481]  RSP <ffff8802143abc68>
[  197.549220] CR2: 0000000000000000
[  197.549883] ---[ end trace 10dec6677e9e3a68 ]---

Apply this patch, build a wqthrash.ko module, plug it in and watch it
explode. It basically just spawns a bunch of (immediate) workqueue jobs
that attempt to schedule and modify a single piece of delayed_work.

I still don't have a lead on the actual bug, but this does seem to
tickle it fairly easily.

-----------------------8<------------------------------

[PATCH] wqthrash: add module that will thrash the workqueue code

...let's see if we can tickle the races in the delayed_work
infrastructure.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
---
 samples/Makefile            |  3 +-
 samples/wqthrash/Makefile   |  1 +
 samples/wqthrash/wqthrash.c | 94 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 97 insertions(+), 1 deletion(-)
 create mode 100644 samples/wqthrash/Makefile
 create mode 100644 samples/wqthrash/wqthrash.c

diff --git a/samples/Makefile b/samples/Makefile
index f00257bcc5a7..7e92dc91879d 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,5 @@
 # Makefile for Linux samples code
 
 obj-$(CONFIG_SAMPLES)	+= kobject/ kprobes/ trace_events/ livepatch/ \
-			   hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/
+			   hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ \
+			   seccomp/ wqthrash/
diff --git a/samples/wqthrash/Makefile b/samples/wqthrash/Makefile
new file mode 100644
index 000000000000..da4c36a2c7d5
--- /dev/null
+++ b/samples/wqthrash/Makefile
@@ -0,0 +1 @@
+obj-m += wqthrash.o
diff --git a/samples/wqthrash/wqthrash.c b/samples/wqthrash/wqthrash.c
new file mode 100644
index 000000000000..b1c52c20fb8f
--- /dev/null
+++ b/samples/wqthrash/wqthrash.c
@@ -0,0 +1,94 @@
+/*
+ * wqthrash.c -- thrash the delayed workqueue rescheduling code
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor,
+ * Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/workqueue.h>
+#include <linux/delay.h>
+#include <linux/sched.h>
+
+#define	ITERATIONS	(102400)
+#define	DELAY		(HZ)
+#define	THRASHERS	(256)
+
+static void
+dummy_work(struct work_struct *unused)
+{
+	schedule_timeout_uninterruptible(1);
+}
+
+static DECLARE_DELAYED_WORK(wqthrash_delayed_work, dummy_work);
+
+static void
+wqthrash_workfunc(struct work_struct *unused)
+{
+	int	i;
+
+	for (i = 0; i < ITERATIONS; ++i) {
+		schedule_delayed_work(&wqthrash_delayed_work, DELAY);
+		cond_resched();
+		mod_delayed_work(system_wq, &wqthrash_delayed_work, DELAY);
+		cond_resched();
+	}
+}
+
+static struct workqueue_struct	*wq;
+static struct work_struct	*tw;
+
+static int
+wqthrash_init(void)
+{
+	int			i;
+
+	wq = alloc_workqueue("wqthrash", WQ_UNBOUND, 0);
+	if (!wq)
+		return -ENOMEM;
+
+	tw = kcalloc(THRASHERS, sizeof(*tw), GFP_KERNEL);
+	if (!tw) {
+		destroy_workqueue(wq);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < THRASHERS; ++i) {
+		INIT_WORK(&tw[i], wqthrash_workfunc);
+		queue_work(wq, &tw[i]);
+	}
+	return 0;
+}
+
+static void
+wqthrash_exit(void)
+{
+	int			i;
+
+	for (i = 0; i < THRASHERS; ++i)
+		flush_work(&tw[i]);
+
+	kfree(tw);
+	destroy_workqueue(wq);
+	cancel_delayed_work_sync(&wqthrash_delayed_work);
+}
+
+module_init(wqthrash_init);
+module_exit(wqthrash_exit);
+MODULE_LICENSE("GPL");
-- 
2.4.3


  reply	other threads:[~2015-11-03 17:55 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-29 14:31 timer code oops when calling mod_delayed_work Jeff Layton
2015-10-29 17:58 ` Jeff Layton
2015-10-31  2:00   ` Tejun Heo
2015-10-31 11:34     ` Jeff Layton
2015-10-31 21:31       ` Tejun Heo
2015-10-31 21:54         ` Jeff Layton
2015-11-02 19:48           ` Chris Worley
2015-11-02 19:56             ` Jeff Layton
2015-11-03  1:33               ` Jeff Layton
2015-11-03 17:55                 ` Jeff Layton [this message]
2015-11-03 22:54                   ` Tejun Heo
2015-11-04  0:06                     ` Tejun Heo
2015-11-04 11:48                       ` Jeff Layton
2015-11-04 17:15                         ` [PATCH] timer: add_timer_on() should perform proper migration Tejun Heo
2015-11-04 19:27                           ` [tip:timers/urgent] timers: Use proper base migration in add_timer_on() tip-bot for Tejun Heo
2015-11-04 19:35                           ` [PATCH] timer: add_timer_on() should perform proper migration Thomas Gleixner
2015-11-04 19:43                             ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151103125504.6649138f@tlielax.poochiereds.net \
    --to=jlayton@poochiereds.net \
    --cc=bfields@fieldses.org \
    --cc=chris.worley@primarydata.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=michael.skralivetsky@primarydata.com \
    --cc=shli@fb.com \
    --cc=tj@kernel.org \
    --cc=trond.myklebust@primarydata.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).