xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Boris Ostrovsky <boris.ostrovsky@oracle.com>
To: Steven Haigh <netwiz@crc.id.au>,
	xen-devel <xen-devel@lists.xenproject.org>,
	linux-kernel@vger.kernel.org
Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
Subject: Re: 4.4: INFO: rcu_sched self-detected stall on CPU
Date: Fri, 25 Mar 2016 08:23:27 -0400	[thread overview]
Message-ID: <56F52DBF.5080006__43930.2838909954$1458908685$gmane$org@oracle.com> (raw)
In-Reply-To: <56F4A816.3050505@crc.id.au>

On 03/24/2016 10:53 PM, Steven Haigh wrote:
> Hi all,
>
> Firstly, I've cross-posted this to xen-devel and the lkml - as this
> problem seems to only exist when using kernel 4.4 as a Xen DomU kernel.
> I have also CC'ed Greg KH for his awesome insight as maintainer.
>
> Please CC myself into replies - as I'm not a member of the kernel
> mailing list - I may miss replies from monitoring the archives.
>
> I've noticed recently that heavy disk IO is causing rcu_sched to detect
> stalls. The process mentioned usually goes to 100% CPU usage, and
> eventually processes start segfaulting and dying. The only fix to
> recover the system is to use 'xl destroy' to force-kill the VM and to
> start it again.
>
> The majority of these issues seem to mention ext4 in the trace. This may
> indicate an issue there - or may be a red herring.
>
> The gritty details:
> INFO: rcu_sched self-detected stall on CPU
> #0110-...: (20999 ticks this GP) idle=327/140000000000001/0
> softirq=1101493/1101493 fqs=6973
> #011 (t=21000 jiffies g=827095 c=827094 q=524)
> Task dump for CPU 0:
> rsync           R  running task        0  2446   2444 0x00000088
> ffffffff818d0c00 ffff88007fc03c58 ffffffff810a625f 0000000000000000
> ffffffff818d0c00 ffff88007fc03c70 ffffffff810a8699 0000000000000001
> ffff88007fc03ca0 ffffffff810d0e5a ffff88007fc170c0 ffffffff818d0c00
> Call Trace:
> <IRQ>  [<ffffffff810a625f>] sched_show_task+0xaf/0x110
> [<ffffffff810a8699>] dump_cpu_task+0x39/0x40
> [<ffffffff810d0e5a>] rcu_dump_cpu_stacks+0x8a/0xc0
> [<ffffffff810d4884>] rcu_check_callbacks+0x424/0x7a0
> [<ffffffff810a91e1>] ? account_system_time+0x81/0x110
> [<ffffffff810a9481>] ? account_process_tick+0x61/0x160
> [<ffffffff810e8050>] ? tick_sched_do_timer+0x30/0x30
> [<ffffffff810d9749>] update_process_times+0x39/0x60
> [<ffffffff810e7aa6>] tick_sched_handle.isra.15+0x36/0x50
> [<ffffffff810e808d>] tick_sched_timer+0x3d/0x70
> [<ffffffff810da342>] __hrtimer_run_queues+0xf2/0x250
> [<ffffffff810da698>] hrtimer_interrupt+0xa8/0x190
> [<ffffffff8100c61e>] xen_timer_interrupt+0x2e/0x140
> [<ffffffff810c8555>] handle_irq_event_percpu+0x55/0x1e0
> [<ffffffff810cbbca>] handle_percpu_irq+0x3a/0x50
> [<ffffffff810c7d22>] generic_handle_irq+0x22/0x30
> [<ffffffff813e50ff>] __evtchn_fifo_handle_events+0x15f/0x180
> [<ffffffff813e5130>] evtchn_fifo_handle_events+0x10/0x20
> [<ffffffff813e2233>] __xen_evtchn_do_upcall+0x43/0x80
> [<ffffffff813e3ea0>] xen_evtchn_do_upcall+0x30/0x50
> [<ffffffff8165deb2>] xen_hvm_callback_vector+0x82/0x90
> <EOI>  [<ffffffff810baf0d>] ? queued_write_lock_slowpath+0x3d/0x80
> [<ffffffff8165bcce>] _raw_write_lock+0x1e/0x30

This looks to me like ext4 failing to grab a lock. Everything above it 
(in Xen code) is regular tick interrupt handling which detects the stall.

Your config does not have CONFIG_PARAVIRT_SPINLOCKS so that eliminates 
any possible issues with pv locks.

Do you see anything "interesting" in dom0? (e.g. dmesg, xl dmesg, 
/var/log/xen/) Are you oversubscribing your guest (CPU-wise)?

-boris

> [<ffffffff81289b23>] ext4_es_remove_extent+0x43/0xc0
> [<ffffffff812692f9>] ext4_clear_inode+0x39/0x80
> [<ffffffff8124cafd>] ext4_evict_inode+0x8d/0x4e0
> [<ffffffff811df537>] evict+0xb7/0x180
> [<ffffffff811df636>] dispose_list+0x36/0x50
> [<ffffffff811e054b>] prune_icache_sb+0x4b/0x60
> [<ffffffff811c92c1>] super_cache_scan+0x141/0x190
> [<ffffffff811655fe>] shrink_slab.part.37+0x1ee/0x390
> [<ffffffff8116950c>] shrink_zone+0x26c/0x280
> [<ffffffff8116988c>] do_try_to_free_pages+0x15c/0x410
> [<ffffffff81169bfa>] try_to_free_pages+0xba/0x170
> [<ffffffff8115d1e5>] __alloc_pages_nodemask+0x525/0xa60
> [<ffffffff811a839c>] ? kmem_cache_free+0xcc/0x2c0
> [<ffffffff811a120d>] alloc_pages_current+0x8d/0x120
> [<ffffffff81154601>] __page_cache_alloc+0x91/0xc0
> [<ffffffff81154b66>] pagecache_get_page+0x56/0x1e0
> [<ffffffff81154ee6>] grab_cache_page_write_begin+0x26/0x40
> [<ffffffff8124c811>] ext4_da_write_begin+0xa1/0x300
> [<ffffffff8124d3a4>] ? ext4_da_write_end+0x124/0x2b0
> [<ffffffff811543a0>] generic_perform_write+0xc0/0x1a0
> [<ffffffff81156408>] __generic_file_write_iter+0x188/0x1e0
> [<ffffffff812421d0>] ext4_file_write_iter+0xf0/0x340
> [<ffffffff811c59fa>] __vfs_write+0xaa/0xe0
> [<ffffffff811c6022>] vfs_write+0xa2/0x1a0
> [<ffffffff811c6d26>] SyS_write+0x46/0xa0
> [<ffffffff8165c02e>] entry_SYSCALL_64_fastpath+0x12/0x71
>
> Some 11 hours later:
> sshd[785]: segfault at 1f0 ip 00007f03bb94ae5c sp 00007ffe9eb54470 error
> 4 in ld-2.17.so[7f03bb940000+21000]
> sh[787]: segfault at 1f0 ip 00007f6b4a0dfe5c sp 00007ffe3d4a71e0 error 4
> in ld-2.17.so[7f6b4a0d5000+21000]
> systemd-cgroups[788]: segfault at 1f0 ip 00007f4baa82ce5c sp
> 00007ffd28e4c4b0 error 4 in ld-2.17.so[7f4baa822000+21000]
> sshd[791]: segfault at 1f0 ip 00007ff8c8a8ce5c sp 00007ffede9e1c20 error
> 4 in ld-2.17.so[7ff8c8a82000+21000]
> sshd[792]: segfault at 1f0 ip 00007f183cf75e5c sp 00007ffc81ab7160 error
> 4 in ld-2.17.so[7f183cf6b000+21000]
> sshd[793]: segfault at 1f0 ip 00007f3c665ece5c sp 00007ffd9a13c850 error
> 4 in ld-2.17.so[7f3c665e2000+21000]
>
>  From isolated testing, this does not occur on kernel 4.5.x - however I
> have not verified this myself.
>
> The kernel config used can be found in the kernel-xen git repo if it
> assists in debugging:
> 	http://xen.crc.id.au/git/?p=kernel-xen;a=summary
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

  reply	other threads:[~2016-03-25 12:23 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-25  2:53 4.4: INFO: rcu_sched self-detected stall on CPU Steven Haigh
2016-03-25 12:23 ` Boris Ostrovsky [this message]
     [not found] ` <56F52DBF.5080006@oracle.com>
2016-03-25 14:05   ` Steven Haigh
     [not found]   ` <56F545B1.8080609@crc.id.au>
2016-03-25 14:44     ` Boris Ostrovsky
     [not found]     ` <56F54EE0.6030004@oracle.com>
2016-03-25 16:04       ` Steven Haigh
     [not found]       ` <56F56172.9020805@crc.id.au>
2016-03-25 16:20         ` Boris Ostrovsky
     [not found]         ` <56F5653B.1090700@oracle.com>
2016-03-25 21:07           ` Steven Haigh
     [not found]           ` <56F5A87A.8000903@crc.id.au>
2016-03-29  8:56             ` Steven Haigh
     [not found]             ` <56FA4336.2030301@crc.id.au>
2016-03-29 14:14               ` Boris Ostrovsky
     [not found]               ` <56FA8DDD.7070406@oracle.com>
2016-03-29 17:44                 ` Steven Haigh
     [not found]                 ` <56FABF17.7090608@crc.id.au>
2016-03-29 18:04                   ` Steven Haigh
     [not found]                   ` <56FAC3AC.9050802@crc.id.au>
2016-03-30 13:44                     ` Boris Ostrovsky
2016-05-02 20:54                     ` gregkh
2016-04-02  1:50                 ` Steven Haigh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='56F52DBF.5080006__43930.2838909954$1458908685$gmane$org@oracle.com' \
    --to=boris.ostrovsky@oracle.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netwiz@crc.id.au \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).