From: Dave Chinner <david@fromorbit.com>
To: Chris Leech <cleech@redhat.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Lee Duncan <lduncan@suse.com>,
open-iscsi@googlegroups.com,
Linux SCSI List <linux-scsi@vger.kernel.org>,
linux-block@vger.kernel.org, Christoph Hellwig <hch@lst.de>
Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
Date: Thu, 22 Dec 2016 16:13:22 +1100 [thread overview]
Message-ID: <20161222051322.GF4758@dastard> (raw)
In-Reply-To: <20161222001303.nvrtm22szn3hgxar@straylight.hirudinean.org>
On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote:
> On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > Hi,
> >
> > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> > >> Thanks Dave,
> > >>
> > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> > >> modules loaded (virtio block) so there's something else going on in the
> > >> current merge window. I'll keep an eye on it and make sure there's
> > >> nothing iSCSI needs fixing for.
> > >
> > > OK, so before this slips through the cracks.....
> > >
> > > Linus - your tree as of a few minutes ago still panics immediately
> > > when starting xfstests on iscsi devices. It appears to be a
> > > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > > seem to have bounced it and no-one is looking at it.
> >
> > Hmm. There's not much to go by.
> >
> > Can somebody in iscsi-land please try to just bisect it - I'm not
> > seeing a lot of clues to where this comes from otherwise.
>
> Yeah, my hopes of this being quickly resolved by someone else didn't
> work out and whatever is going on in that test VM is looking like a
> different kind of odd. I'm saving that off for later, and seeing if I
> can't be a bisect on the iSCSI issue.
There may be deeper issues. I just started running scalability tests
(e.g. 16-way fsmark create tests) and about a minute in I got a
directory corruption reported - something I hadn't seen in the dev
cycle at all. I unmounted the fs, mkfs'd it again, ran the
workload again and about a minute in this fired:
[628867.607417] ------------[ cut here ]------------
[628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
[628867.610702] Modules linked in:
[628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: G W 4.9.0-dgc #18
[628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[628867.616179] Workqueue: events rht_deferred_worker
[628867.632422] Call Trace:
[628867.634691] dump_stack+0x63/0x83
[628867.637937] __warn+0xcb/0xf0
[628867.641359] warn_slowpath_null+0x1d/0x20
[628867.643362] shadow_lru_isolate+0x171/0x220
[628867.644627] __list_lru_walk_one.isra.11+0x79/0x110
[628867.645780] ? __list_lru_init+0x70/0x70
[628867.646628] list_lru_walk_one+0x17/0x20
[628867.647488] scan_shadow_nodes+0x34/0x50
[628867.648358] shrink_slab.part.65.constprop.86+0x1dc/0x410
[628867.649506] shrink_node+0x57/0x90
[628867.650233] do_try_to_free_pages+0xdd/0x230
[628867.651157] try_to_free_pages+0xce/0x1a0
[628867.652342] __alloc_pages_slowpath+0x2df/0x960
[628867.653332] ? __might_sleep+0x4a/0x80
[628867.654148] __alloc_pages_nodemask+0x24b/0x290
[628867.655237] kmalloc_order+0x21/0x50
[628867.656016] kmalloc_order_trace+0x24/0xc0
[628867.656878] __kmalloc+0x17d/0x1d0
[628867.657644] bucket_table_alloc+0x195/0x1d0
[628867.658564] ? __might_sleep+0x4a/0x80
[628867.659449] rht_deferred_worker+0x287/0x3c0
[628867.660366] ? _raw_spin_unlock_irq+0xe/0x30
[628867.661294] process_one_work+0x1de/0x4d0
[628867.662208] worker_thread+0x4b/0x4f0
[628867.662990] kthread+0x10c/0x140
[628867.663687] ? process_one_work+0x4d0/0x4d0
[628867.664564] ? kthread_create_on_node+0x40/0x40
[628867.665523] ret_from_fork+0x25/0x30
[628867.666317] ---[ end trace 7c38634006a9955e ]---
Now, this workload does not touch the page cache at all - it's
entirely an XFS metadata workload, so it should not really be
affecting the working set code.
And worse, on that last error, the /host/ is now going into meltdown
(running 4.7.5) with 32 CPUs all burning down in ACPI code:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
35074 root -2 0 0 0 0 R 99.0 0.0 12:38.92 acpi_pad/12
35079 root -2 0 0 0 0 R 99.0 0.0 12:39.40 acpi_pad/16
35080 root -2 0 0 0 0 R 99.0 0.0 12:39.29 acpi_pad/17
35085 root -2 0 0 0 0 R 99.0 0.0 12:39.35 acpi_pad/22
35087 root -2 0 0 0 0 R 99.0 0.0 12:39.13 acpi_pad/24
35090 root -2 0 0 0 0 R 99.0 0.0 12:38.89 acpi_pad/27
35093 root -2 0 0 0 0 R 99.0 0.0 12:38.88 acpi_pad/30
35063 root -2 0 0 0 0 R 98.1 0.0 12:40.64 acpi_pad/1
35065 root -2 0 0 0 0 R 98.1 0.0 12:40.38 acpi_pad/3
35066 root -2 0 0 0 0 R 98.1 0.0 12:40.30 acpi_pad/4
35067 root -2 0 0 0 0 R 98.1 0.0 12:40.82 acpi_pad/5
35077 root -2 0 0 0 0 R 98.1 0.0 12:39.65 acpi_pad/14
35078 root -2 0 0 0 0 R 98.1 0.0 12:39.58 acpi_pad/15
35081 root -2 0 0 0 0 R 98.1 0.0 12:39.32 acpi_pad/18
35072 root -2 0 0 0 0 R 96.2 0.0 12:40.14 acpi_pad/10
35073 root -2 0 0 0 0 R 96.2 0.0 12:39.39 acpi_pad/11
35076 root -2 0 0 0 0 R 96.2 0.0 12:39.39 acpi_pad/13
35084 root -2 0 0 0 0 R 96.2 0.0 12:39.06 acpi_pad/21
35092 root -2 0 0 0 0 R 96.2 0.0 12:39.14 acpi_pad/29
35069 root -2 0 0 0 0 R 95.2 0.0 12:40.71 acpi_pad/7
35068 root -2 0 0 0 0 R 94.2 0.0 12:40.29 acpi_pad/6
35062 root -2 0 0 0 0 D 93.3 0.0 12:40.56 acpi_pad/0
35064 root -2 0 0 0 0 D 92.3 0.0 12:40.18 acpi_pad/2
35082 root -2 0 0 0 0 R 92.3 0.0 12:39.64 acpi_pad/19
35083 root -2 0 0 0 0 R 92.3 0.0 12:38.98 acpi_pad/20
35086 root -2 0 0 0 0 R 92.3 0.0 12:40.11 acpi_pad/23
35088 root -2 0 0 0 0 R 92.3 0.0 12:39.45 acpi_pad/25
35089 root -2 0 0 0 0 R 92.3 0.0 12:39.11 acpi_pad/26
35070 root -2 0 0 0 0 D 91.3 0.0 12:40.21 acpi_pad/8
35071 root -2 0 0 0 0 D 91.3 0.0 12:39.98 acpi_pad/9
35091 root -2 0 0 0 0 D 91.3 0.0 12:39.33 acpi_pad/28
perf top says:
65.98% [kernel] [k] power_saving_thread
3.27% [kernel] [k] native_queued_spin_lock_slowpath
1.61% [kernel] [k] native_write_msr
1.39% [kernel] [k] update_curr_rt
1.20% [kernel] [k] intel_pstate_update_util
1.01% [kernel] [k] __do_softirq
1.01% [kernel] [k] ktime_get
0.99% [kernel] [k] ktime_get_update_offsets_now
0.93% [kernel] [k] rcu_check_callbacks
0.90% [kernel] [k] _raw_spin_lock
0.88% [kernel] [k] perf_event_task_tick
0.82% [kernel] [k] native_irq_return_iret
0.81% [kernel] [k] run_timer_softirq
0.75% [kernel] [k] trigger_load_balance
No idea how to recover this, so I'm just going to reboot it. Back in
a bit.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2016-12-22 5:13 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-12-14 22:24 [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0 Dave Chinner
2016-12-14 22:29 ` Dave Chinner
2016-12-16 18:59 ` Chris Leech
2016-12-21 22:16 ` Dave Chinner
2016-12-21 23:19 ` Linus Torvalds
2016-12-22 0:13 ` Chris Leech
2016-12-22 5:13 ` Dave Chinner [this message]
2016-12-22 5:46 ` Linus Torvalds
2016-12-22 6:50 ` Dave Chinner
2016-12-22 18:50 ` Chris Leech
2016-12-22 23:53 ` Ming Lei
2016-12-23 0:03 ` Chris Leech
2016-12-23 10:00 ` Christoph Hellwig
2016-12-23 19:42 ` Linus Torvalds
2016-12-24 2:45 ` Jens Axboe
2016-12-24 9:49 ` Christoph Hellwig
2016-12-24 10:07 ` Christoph Hellwig
2016-12-24 13:17 ` Hannes Reinecke
2016-12-24 13:19 ` Christoph Hellwig
2017-01-04 14:07 ` Christoph Hellwig
2016-12-22 20:22 ` Hugh Dickins
2016-12-23 7:32 ` Johannes Weiner
2016-12-23 8:33 ` Johannes Weiner
2017-01-02 21:11 ` Johannes Weiner
2017-01-03 12:28 ` Jan Kara
2017-01-04 15:26 ` Laurence Oberman
2017-01-04 17:38 ` Laurence Oberman
2017-01-08 2:02 ` Johannes Weiner
2017-01-08 2:17 ` Linus Torvalds
2017-01-09 20:30 ` Jan Kara
2017-01-09 20:45 ` Johannes Weiner
2016-12-22 6:28 ` Dave Chinner
2016-12-22 17:24 ` Linus Torvalds
2016-12-22 20:20 ` Thomas Gleixner
2016-12-22 20:42 ` Dave Chinner
2016-12-22 21:06 ` Dave Chinner
2016-12-22 21:10 ` Linus Torvalds
2016-12-22 22:15 ` Dave Chinner
2016-12-22 22:33 ` Dave Chinner
2016-12-23 3:52 ` Dave Chinner
2016-12-23 0:16 ` Jens Axboe
2016-12-22 6:18 ` Christoph Hellwig
2016-12-22 6:30 ` Dave Chinner
2016-12-22 6:36 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161222051322.GF4758@dastard \
--to=david@fromorbit.com \
--cc=cleech@redhat.com \
--cc=hch@lst.de \
--cc=lduncan@suse.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=open-iscsi@googlegroups.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).