xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* 4.4: INFO: rcu_sched self-detected stall on CPU
@ 2016-03-25  2:53 Steven Haigh
  2016-03-25 12:23 ` Boris Ostrovsky
       [not found] ` <56F52DBF.5080006@oracle.com>
  0 siblings, 2 replies; 14+ messages in thread
From: Steven Haigh @ 2016-03-25  2:53 UTC (permalink / raw)
  To: xen-devel, linux-kernel; +Cc: gregkh


[-- Attachment #1.1.1: Type: text/plain, Size: 4941 bytes --]

Hi all,

Firstly, I've cross-posted this to xen-devel and the lkml - as this
problem seems to only exist when using kernel 4.4 as a Xen DomU kernel.
I have also CC'ed Greg KH for his awesome insight as maintainer.

Please CC myself into replies - as I'm not a member of the kernel
mailing list - I may miss replies from monitoring the archives.

I've noticed recently that heavy disk IO is causing rcu_sched to detect
stalls. The process mentioned usually goes to 100% CPU usage, and
eventually processes start segfaulting and dying. The only fix to
recover the system is to use 'xl destroy' to force-kill the VM and to
start it again.

The majority of these issues seem to mention ext4 in the trace. This may
indicate an issue there - or may be a red herring.

The gritty details:
INFO: rcu_sched self-detected stall on CPU
#0110-...: (20999 ticks this GP) idle=327/140000000000001/0
softirq=1101493/1101493 fqs=6973
#011 (t=21000 jiffies g=827095 c=827094 q=524)
Task dump for CPU 0:
rsync           R  running task        0  2446   2444 0x00000088
ffffffff818d0c00 ffff88007fc03c58 ffffffff810a625f 0000000000000000
ffffffff818d0c00 ffff88007fc03c70 ffffffff810a8699 0000000000000001
ffff88007fc03ca0 ffffffff810d0e5a ffff88007fc170c0 ffffffff818d0c00
Call Trace:
<IRQ>  [<ffffffff810a625f>] sched_show_task+0xaf/0x110
[<ffffffff810a8699>] dump_cpu_task+0x39/0x40
[<ffffffff810d0e5a>] rcu_dump_cpu_stacks+0x8a/0xc0
[<ffffffff810d4884>] rcu_check_callbacks+0x424/0x7a0
[<ffffffff810a91e1>] ? account_system_time+0x81/0x110
[<ffffffff810a9481>] ? account_process_tick+0x61/0x160
[<ffffffff810e8050>] ? tick_sched_do_timer+0x30/0x30
[<ffffffff810d9749>] update_process_times+0x39/0x60
[<ffffffff810e7aa6>] tick_sched_handle.isra.15+0x36/0x50
[<ffffffff810e808d>] tick_sched_timer+0x3d/0x70
[<ffffffff810da342>] __hrtimer_run_queues+0xf2/0x250
[<ffffffff810da698>] hrtimer_interrupt+0xa8/0x190
[<ffffffff8100c61e>] xen_timer_interrupt+0x2e/0x140
[<ffffffff810c8555>] handle_irq_event_percpu+0x55/0x1e0
[<ffffffff810cbbca>] handle_percpu_irq+0x3a/0x50
[<ffffffff810c7d22>] generic_handle_irq+0x22/0x30
[<ffffffff813e50ff>] __evtchn_fifo_handle_events+0x15f/0x180
[<ffffffff813e5130>] evtchn_fifo_handle_events+0x10/0x20
[<ffffffff813e2233>] __xen_evtchn_do_upcall+0x43/0x80
[<ffffffff813e3ea0>] xen_evtchn_do_upcall+0x30/0x50
[<ffffffff8165deb2>] xen_hvm_callback_vector+0x82/0x90
<EOI>  [<ffffffff810baf0d>] ? queued_write_lock_slowpath+0x3d/0x80
[<ffffffff8165bcce>] _raw_write_lock+0x1e/0x30
[<ffffffff81289b23>] ext4_es_remove_extent+0x43/0xc0
[<ffffffff812692f9>] ext4_clear_inode+0x39/0x80
[<ffffffff8124cafd>] ext4_evict_inode+0x8d/0x4e0
[<ffffffff811df537>] evict+0xb7/0x180
[<ffffffff811df636>] dispose_list+0x36/0x50
[<ffffffff811e054b>] prune_icache_sb+0x4b/0x60
[<ffffffff811c92c1>] super_cache_scan+0x141/0x190
[<ffffffff811655fe>] shrink_slab.part.37+0x1ee/0x390
[<ffffffff8116950c>] shrink_zone+0x26c/0x280
[<ffffffff8116988c>] do_try_to_free_pages+0x15c/0x410
[<ffffffff81169bfa>] try_to_free_pages+0xba/0x170
[<ffffffff8115d1e5>] __alloc_pages_nodemask+0x525/0xa60
[<ffffffff811a839c>] ? kmem_cache_free+0xcc/0x2c0
[<ffffffff811a120d>] alloc_pages_current+0x8d/0x120
[<ffffffff81154601>] __page_cache_alloc+0x91/0xc0
[<ffffffff81154b66>] pagecache_get_page+0x56/0x1e0
[<ffffffff81154ee6>] grab_cache_page_write_begin+0x26/0x40
[<ffffffff8124c811>] ext4_da_write_begin+0xa1/0x300
[<ffffffff8124d3a4>] ? ext4_da_write_end+0x124/0x2b0
[<ffffffff811543a0>] generic_perform_write+0xc0/0x1a0
[<ffffffff81156408>] __generic_file_write_iter+0x188/0x1e0
[<ffffffff812421d0>] ext4_file_write_iter+0xf0/0x340
[<ffffffff811c59fa>] __vfs_write+0xaa/0xe0
[<ffffffff811c6022>] vfs_write+0xa2/0x1a0
[<ffffffff811c6d26>] SyS_write+0x46/0xa0
[<ffffffff8165c02e>] entry_SYSCALL_64_fastpath+0x12/0x71

Some 11 hours later:
sshd[785]: segfault at 1f0 ip 00007f03bb94ae5c sp 00007ffe9eb54470 error
4 in ld-2.17.so[7f03bb940000+21000]
sh[787]: segfault at 1f0 ip 00007f6b4a0dfe5c sp 00007ffe3d4a71e0 error 4
in ld-2.17.so[7f6b4a0d5000+21000]
systemd-cgroups[788]: segfault at 1f0 ip 00007f4baa82ce5c sp
00007ffd28e4c4b0 error 4 in ld-2.17.so[7f4baa822000+21000]
sshd[791]: segfault at 1f0 ip 00007ff8c8a8ce5c sp 00007ffede9e1c20 error
4 in ld-2.17.so[7ff8c8a82000+21000]
sshd[792]: segfault at 1f0 ip 00007f183cf75e5c sp 00007ffc81ab7160 error
4 in ld-2.17.so[7f183cf6b000+21000]
sshd[793]: segfault at 1f0 ip 00007f3c665ece5c sp 00007ffd9a13c850 error
4 in ld-2.17.so[7f3c665e2000+21000]

From isolated testing, this does not occur on kernel 4.5.x - however I
have not verified this myself.

The kernel config used can be found in the kernel-xen git repo if it
assists in debugging:
	http://xen.crc.id.au/git/?p=kernel-xen;a=summary

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 4.4: INFO: rcu_sched self-detected stall on CPU
  2016-03-25  2:53 4.4: INFO: rcu_sched self-detected stall on CPU Steven Haigh
@ 2016-03-25 12:23 ` Boris Ostrovsky
       [not found] ` <56F52DBF.5080006@oracle.com>
  1 sibling, 0 replies; 14+ messages in thread
From: Boris Ostrovsky @ 2016-03-25 12:23 UTC (permalink / raw)
  To: Steven Haigh, xen-devel, linux-kernel; +Cc: gregkh

On 03/24/2016 10:53 PM, Steven Haigh wrote:
> Hi all,
>
> Firstly, I've cross-posted this to xen-devel and the lkml - as this
> problem seems to only exist when using kernel 4.4 as a Xen DomU kernel.
> I have also CC'ed Greg KH for his awesome insight as maintainer.
>
> Please CC myself into replies - as I'm not a member of the kernel
> mailing list - I may miss replies from monitoring the archives.
>
> I've noticed recently that heavy disk IO is causing rcu_sched to detect
> stalls. The process mentioned usually goes to 100% CPU usage, and
> eventually processes start segfaulting and dying. The only fix to
> recover the system is to use 'xl destroy' to force-kill the VM and to
> start it again.
>
> The majority of these issues seem to mention ext4 in the trace. This may
> indicate an issue there - or may be a red herring.
>
> The gritty details:
> INFO: rcu_sched self-detected stall on CPU
> #0110-...: (20999 ticks this GP) idle=327/140000000000001/0
> softirq=1101493/1101493 fqs=6973
> #011 (t=21000 jiffies g=827095 c=827094 q=524)
> Task dump for CPU 0:
> rsync           R  running task        0  2446   2444 0x00000088
> ffffffff818d0c00 ffff88007fc03c58 ffffffff810a625f 0000000000000000
> ffffffff818d0c00 ffff88007fc03c70 ffffffff810a8699 0000000000000001
> ffff88007fc03ca0 ffffffff810d0e5a ffff88007fc170c0 ffffffff818d0c00
> Call Trace:
> <IRQ>  [<ffffffff810a625f>] sched_show_task+0xaf/0x110
> [<ffffffff810a8699>] dump_cpu_task+0x39/0x40
> [<ffffffff810d0e5a>] rcu_dump_cpu_stacks+0x8a/0xc0
> [<ffffffff810d4884>] rcu_check_callbacks+0x424/0x7a0
> [<ffffffff810a91e1>] ? account_system_time+0x81/0x110
> [<ffffffff810a9481>] ? account_process_tick+0x61/0x160
> [<ffffffff810e8050>] ? tick_sched_do_timer+0x30/0x30
> [<ffffffff810d9749>] update_process_times+0x39/0x60
> [<ffffffff810e7aa6>] tick_sched_handle.isra.15+0x36/0x50
> [<ffffffff810e808d>] tick_sched_timer+0x3d/0x70
> [<ffffffff810da342>] __hrtimer_run_queues+0xf2/0x250
> [<ffffffff810da698>] hrtimer_interrupt+0xa8/0x190
> [<ffffffff8100c61e>] xen_timer_interrupt+0x2e/0x140
> [<ffffffff810c8555>] handle_irq_event_percpu+0x55/0x1e0
> [<ffffffff810cbbca>] handle_percpu_irq+0x3a/0x50
> [<ffffffff810c7d22>] generic_handle_irq+0x22/0x30
> [<ffffffff813e50ff>] __evtchn_fifo_handle_events+0x15f/0x180
> [<ffffffff813e5130>] evtchn_fifo_handle_events+0x10/0x20
> [<ffffffff813e2233>] __xen_evtchn_do_upcall+0x43/0x80
> [<ffffffff813e3ea0>] xen_evtchn_do_upcall+0x30/0x50
> [<ffffffff8165deb2>] xen_hvm_callback_vector+0x82/0x90
> <EOI>  [<ffffffff810baf0d>] ? queued_write_lock_slowpath+0x3d/0x80
> [<ffffffff8165bcce>] _raw_write_lock+0x1e/0x30

This looks to me like ext4 failing to grab a lock. Everything above it 
(in Xen code) is regular tick interrupt handling which detects the stall.

Your config does not have CONFIG_PARAVIRT_SPINLOCKS so that eliminates 
any possible issues with pv locks.

Do you see anything "interesting" in dom0? (e.g. dmesg, xl dmesg, 
/var/log/xen/) Are you oversubscribing your guest (CPU-wise)?

-boris

> [<ffffffff81289b23>] ext4_es_remove_extent+0x43/0xc0
> [<ffffffff812692f9>] ext4_clear_inode+0x39/0x80
> [<ffffffff8124cafd>] ext4_evict_inode+0x8d/0x4e0
> [<ffffffff811df537>] evict+0xb7/0x180
> [<ffffffff811df636>] dispose_list+0x36/0x50
> [<ffffffff811e054b>] prune_icache_sb+0x4b/0x60
> [<ffffffff811c92c1>] super_cache_scan+0x141/0x190
> [<ffffffff811655fe>] shrink_slab.part.37+0x1ee/0x390
> [<ffffffff8116950c>] shrink_zone+0x26c/0x280
> [<ffffffff8116988c>] do_try_to_free_pages+0x15c/0x410
> [<ffffffff81169bfa>] try_to_free_pages+0xba/0x170
> [<ffffffff8115d1e5>] __alloc_pages_nodemask+0x525/0xa60
> [<ffffffff811a839c>] ? kmem_cache_free+0xcc/0x2c0
> [<ffffffff811a120d>] alloc_pages_current+0x8d/0x120
> [<ffffffff81154601>] __page_cache_alloc+0x91/0xc0
> [<ffffffff81154b66>] pagecache_get_page+0x56/0x1e0
> [<ffffffff81154ee6>] grab_cache_page_write_begin+0x26/0x40
> [<ffffffff8124c811>] ext4_da_write_begin+0xa1/0x300
> [<ffffffff8124d3a4>] ? ext4_da_write_end+0x124/0x2b0
> [<ffffffff811543a0>] generic_perform_write+0xc0/0x1a0
> [<ffffffff81156408>] __generic_file_write_iter+0x188/0x1e0
> [<ffffffff812421d0>] ext4_file_write_iter+0xf0/0x340
> [<ffffffff811c59fa>] __vfs_write+0xaa/0xe0
> [<ffffffff811c6022>] vfs_write+0xa2/0x1a0
> [<ffffffff811c6d26>] SyS_write+0x46/0xa0
> [<ffffffff8165c02e>] entry_SYSCALL_64_fastpath+0x12/0x71
>
> Some 11 hours later:
> sshd[785]: segfault at 1f0 ip 00007f03bb94ae5c sp 00007ffe9eb54470 error
> 4 in ld-2.17.so[7f03bb940000+21000]
> sh[787]: segfault at 1f0 ip 00007f6b4a0dfe5c sp 00007ffe3d4a71e0 error 4
> in ld-2.17.so[7f6b4a0d5000+21000]
> systemd-cgroups[788]: segfault at 1f0 ip 00007f4baa82ce5c sp
> 00007ffd28e4c4b0 error 4 in ld-2.17.so[7f4baa822000+21000]
> sshd[791]: segfault at 1f0 ip 00007ff8c8a8ce5c sp 00007ffede9e1c20 error
> 4 in ld-2.17.so[7ff8c8a82000+21000]
> sshd[792]: segfault at 1f0 ip 00007f183cf75e5c sp 00007ffc81ab7160 error
> 4 in ld-2.17.so[7f183cf6b000+21000]
> sshd[793]: segfault at 1f0 ip 00007f3c665ece5c sp 00007ffd9a13c850 error
> 4 in ld-2.17.so[7f3c665e2000+21000]
>
>  From isolated testing, this does not occur on kernel 4.5.x - however I
> have not verified this myself.
>
> The kernel config used can be found in the kernel-xen git repo if it
> assists in debugging:
> 	http://xen.crc.id.au/git/?p=kernel-xen;a=summary
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 4.4: INFO: rcu_sched self-detected stall on CPU
       [not found] ` <56F52DBF.5080006@oracle.com>
@ 2016-03-25 14:05   ` Steven Haigh
       [not found]   ` <56F545B1.8080609@crc.id.au>
  1 sibling, 0 replies; 14+ messages in thread
From: Steven Haigh @ 2016-03-25 14:05 UTC (permalink / raw)
  To: Boris Ostrovsky, xen-devel, linux-kernel; +Cc: gregkh


[-- Attachment #1.1.1: Type: text/plain, Size: 7016 bytes --]

On 25/03/2016 11:23 PM, Boris Ostrovsky wrote:
> On 03/24/2016 10:53 PM, Steven Haigh wrote:
>> Hi all,
>>
>> Firstly, I've cross-posted this to xen-devel and the lkml - as this
>> problem seems to only exist when using kernel 4.4 as a Xen DomU kernel.
>> I have also CC'ed Greg KH for his awesome insight as maintainer.
>>
>> Please CC myself into replies - as I'm not a member of the kernel
>> mailing list - I may miss replies from monitoring the archives.
>>
>> I've noticed recently that heavy disk IO is causing rcu_sched to detect
>> stalls. The process mentioned usually goes to 100% CPU usage, and
>> eventually processes start segfaulting and dying. The only fix to
>> recover the system is to use 'xl destroy' to force-kill the VM and to
>> start it again.
>>
>> The majority of these issues seem to mention ext4 in the trace. This may
>> indicate an issue there - or may be a red herring.
>>
>> The gritty details:
>> INFO: rcu_sched self-detected stall on CPU
>> #0110-...: (20999 ticks this GP) idle=327/140000000000001/0
>> softirq=1101493/1101493 fqs=6973
>> #011 (t=21000 jiffies g=827095 c=827094 q=524)
>> Task dump for CPU 0:
>> rsync           R  running task        0  2446   2444 0x00000088
>> ffffffff818d0c00 ffff88007fc03c58 ffffffff810a625f 0000000000000000
>> ffffffff818d0c00 ffff88007fc03c70 ffffffff810a8699 0000000000000001
>> ffff88007fc03ca0 ffffffff810d0e5a ffff88007fc170c0 ffffffff818d0c00
>> Call Trace:
>> <IRQ>  [<ffffffff810a625f>] sched_show_task+0xaf/0x110
>> [<ffffffff810a8699>] dump_cpu_task+0x39/0x40
>> [<ffffffff810d0e5a>] rcu_dump_cpu_stacks+0x8a/0xc0
>> [<ffffffff810d4884>] rcu_check_callbacks+0x424/0x7a0
>> [<ffffffff810a91e1>] ? account_system_time+0x81/0x110
>> [<ffffffff810a9481>] ? account_process_tick+0x61/0x160
>> [<ffffffff810e8050>] ? tick_sched_do_timer+0x30/0x30
>> [<ffffffff810d9749>] update_process_times+0x39/0x60
>> [<ffffffff810e7aa6>] tick_sched_handle.isra.15+0x36/0x50
>> [<ffffffff810e808d>] tick_sched_timer+0x3d/0x70
>> [<ffffffff810da342>] __hrtimer_run_queues+0xf2/0x250
>> [<ffffffff810da698>] hrtimer_interrupt+0xa8/0x190
>> [<ffffffff8100c61e>] xen_timer_interrupt+0x2e/0x140
>> [<ffffffff810c8555>] handle_irq_event_percpu+0x55/0x1e0
>> [<ffffffff810cbbca>] handle_percpu_irq+0x3a/0x50
>> [<ffffffff810c7d22>] generic_handle_irq+0x22/0x30
>> [<ffffffff813e50ff>] __evtchn_fifo_handle_events+0x15f/0x180
>> [<ffffffff813e5130>] evtchn_fifo_handle_events+0x10/0x20
>> [<ffffffff813e2233>] __xen_evtchn_do_upcall+0x43/0x80
>> [<ffffffff813e3ea0>] xen_evtchn_do_upcall+0x30/0x50
>> [<ffffffff8165deb2>] xen_hvm_callback_vector+0x82/0x90
>> <EOI>  [<ffffffff810baf0d>] ? queued_write_lock_slowpath+0x3d/0x80
>> [<ffffffff8165bcce>] _raw_write_lock+0x1e/0x30
> 
> This looks to me like ext4 failing to grab a lock. Everything above it
> (in Xen code) is regular tick interrupt handling which detects the stall.
> 
> Your config does not have CONFIG_PARAVIRT_SPINLOCKS so that eliminates
> any possible issues with pv locks.
> 
> Do you see anything "interesting" in dom0? (e.g. dmesg, xl dmesg,
> /var/log/xen/) Are you oversubscribing your guest (CPU-wise)?

There is nothing special being logged anywhere that I can see. dmesg /
xl dmesg on the Dom0 show nothing unusual.

I do share CPUs - but I don't give any DomU more than 2 vcpus. The
physical host has 4 cores - 1 pinned to the Dom0.

I log to a remote syslog on this system - and I've uploaded the entire
log to a pastebin (don't want to do a 45Kb attachment here):
    http://paste.fedoraproject.org/345095/58914452

Not sure if it makes any difference at all, but my DomU config is:
# cat /etc/xen/backup.vm
name            = "backup.vm"
memory          = 2048
vcpus           = 2
cpus            = "1-3"
disk            = [ 'phy:/dev/vg_raid1_new/backup.vm,xvda,w' ]
vif             = [ "mac=00:11:36:35:35:09, bridge=br203,
vifname=vm.backup, script=vif-bridge" ]
bootloader      = 'pygrub'
pvh             = 1

on_poweroff     = 'destroy'
on_reboot       = 'restart'
on_crash        = 'restart'
cpu_weight      = 64

I never had this problem when running kernel 4.1.x - it only started
when I upgraded everything to 4.4 - not exactly a great help - but may
help narrow things down?

>> [<ffffffff81289b23>] ext4_es_remove_extent+0x43/0xc0
>> [<ffffffff812692f9>] ext4_clear_inode+0x39/0x80
>> [<ffffffff8124cafd>] ext4_evict_inode+0x8d/0x4e0
>> [<ffffffff811df537>] evict+0xb7/0x180
>> [<ffffffff811df636>] dispose_list+0x36/0x50
>> [<ffffffff811e054b>] prune_icache_sb+0x4b/0x60
>> [<ffffffff811c92c1>] super_cache_scan+0x141/0x190
>> [<ffffffff811655fe>] shrink_slab.part.37+0x1ee/0x390
>> [<ffffffff8116950c>] shrink_zone+0x26c/0x280
>> [<ffffffff8116988c>] do_try_to_free_pages+0x15c/0x410
>> [<ffffffff81169bfa>] try_to_free_pages+0xba/0x170
>> [<ffffffff8115d1e5>] __alloc_pages_nodemask+0x525/0xa60
>> [<ffffffff811a839c>] ? kmem_cache_free+0xcc/0x2c0
>> [<ffffffff811a120d>] alloc_pages_current+0x8d/0x120
>> [<ffffffff81154601>] __page_cache_alloc+0x91/0xc0
>> [<ffffffff81154b66>] pagecache_get_page+0x56/0x1e0
>> [<ffffffff81154ee6>] grab_cache_page_write_begin+0x26/0x40
>> [<ffffffff8124c811>] ext4_da_write_begin+0xa1/0x300
>> [<ffffffff8124d3a4>] ? ext4_da_write_end+0x124/0x2b0
>> [<ffffffff811543a0>] generic_perform_write+0xc0/0x1a0
>> [<ffffffff81156408>] __generic_file_write_iter+0x188/0x1e0
>> [<ffffffff812421d0>] ext4_file_write_iter+0xf0/0x340
>> [<ffffffff811c59fa>] __vfs_write+0xaa/0xe0
>> [<ffffffff811c6022>] vfs_write+0xa2/0x1a0
>> [<ffffffff811c6d26>] SyS_write+0x46/0xa0
>> [<ffffffff8165c02e>] entry_SYSCALL_64_fastpath+0x12/0x71
>>
>> Some 11 hours later:
>> sshd[785]: segfault at 1f0 ip 00007f03bb94ae5c sp 00007ffe9eb54470 error
>> 4 in ld-2.17.so[7f03bb940000+21000]
>> sh[787]: segfault at 1f0 ip 00007f6b4a0dfe5c sp 00007ffe3d4a71e0 error 4
>> in ld-2.17.so[7f6b4a0d5000+21000]
>> systemd-cgroups[788]: segfault at 1f0 ip 00007f4baa82ce5c sp
>> 00007ffd28e4c4b0 error 4 in ld-2.17.so[7f4baa822000+21000]
>> sshd[791]: segfault at 1f0 ip 00007ff8c8a8ce5c sp 00007ffede9e1c20 error
>> 4 in ld-2.17.so[7ff8c8a82000+21000]
>> sshd[792]: segfault at 1f0 ip 00007f183cf75e5c sp 00007ffc81ab7160 error
>> 4 in ld-2.17.so[7f183cf6b000+21000]
>> sshd[793]: segfault at 1f0 ip 00007f3c665ece5c sp 00007ffd9a13c850 error
>> 4 in ld-2.17.so[7f3c665e2000+21000]
>>
>>  From isolated testing, this does not occur on kernel 4.5.x - however I
>> have not verified this myself.
>>
>> The kernel config used can be found in the kernel-xen git repo if it
>> assists in debugging:
>>     http://xen.crc.id.au/git/?p=kernel-xen;a=summary
>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
> 

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 4.4: INFO: rcu_sched self-detected stall on CPU
       [not found]   ` <56F545B1.8080609@crc.id.au>
@ 2016-03-25 14:44     ` Boris Ostrovsky
       [not found]     ` <56F54EE0.6030004@oracle.com>
  1 sibling, 0 replies; 14+ messages in thread
From: Boris Ostrovsky @ 2016-03-25 14:44 UTC (permalink / raw)
  To: Steven Haigh, xen-devel, linux-kernel; +Cc: gregkh



On 03/25/2016 10:05 AM, Steven Haigh wrote:
> On 25/03/2016 11:23 PM, Boris Ostrovsky wrote:
>> On 03/24/2016 10:53 PM, Steven Haigh wrote:
>>> Hi all,
>>>
>>> Firstly, I've cross-posted this to xen-devel and the lkml - as this
>>> problem seems to only exist when using kernel 4.4 as a Xen DomU kernel.
>>> I have also CC'ed Greg KH for his awesome insight as maintainer.
>>>
>>> Please CC myself into replies - as I'm not a member of the kernel
>>> mailing list - I may miss replies from monitoring the archives.
>>>
>>> I've noticed recently that heavy disk IO is causing rcu_sched to detect
>>> stalls. The process mentioned usually goes to 100% CPU usage, and
>>> eventually processes start segfaulting and dying. The only fix to
>>> recover the system is to use 'xl destroy' to force-kill the VM and to
>>> start it again.
>>>
>>> The majority of these issues seem to mention ext4 in the trace. This may
>>> indicate an issue there - or may be a red herring.
>>>
>>> The gritty details:
>>> INFO: rcu_sched self-detected stall on CPU
>>> #0110-...: (20999 ticks this GP) idle=327/140000000000001/0
>>> softirq=1101493/1101493 fqs=6973
>>> #011 (t=21000 jiffies g=827095 c=827094 q=524)
>>> Task dump for CPU 0:
>>> rsync           R  running task        0  2446   2444 0x00000088
>>> ffffffff818d0c00 ffff88007fc03c58 ffffffff810a625f 0000000000000000
>>> ffffffff818d0c00 ffff88007fc03c70 ffffffff810a8699 0000000000000001
>>> ffff88007fc03ca0 ffffffff810d0e5a ffff88007fc170c0 ffffffff818d0c00
>>> Call Trace:
>>> <IRQ>  [<ffffffff810a625f>] sched_show_task+0xaf/0x110
>>> [<ffffffff810a8699>] dump_cpu_task+0x39/0x40
>>> [<ffffffff810d0e5a>] rcu_dump_cpu_stacks+0x8a/0xc0
>>> [<ffffffff810d4884>] rcu_check_callbacks+0x424/0x7a0
>>> [<ffffffff810a91e1>] ? account_system_time+0x81/0x110
>>> [<ffffffff810a9481>] ? account_process_tick+0x61/0x160
>>> [<ffffffff810e8050>] ? tick_sched_do_timer+0x30/0x30
>>> [<ffffffff810d9749>] update_process_times+0x39/0x60
>>> [<ffffffff810e7aa6>] tick_sched_handle.isra.15+0x36/0x50
>>> [<ffffffff810e808d>] tick_sched_timer+0x3d/0x70
>>> [<ffffffff810da342>] __hrtimer_run_queues+0xf2/0x250
>>> [<ffffffff810da698>] hrtimer_interrupt+0xa8/0x190
>>> [<ffffffff8100c61e>] xen_timer_interrupt+0x2e/0x140
>>> [<ffffffff810c8555>] handle_irq_event_percpu+0x55/0x1e0
>>> [<ffffffff810cbbca>] handle_percpu_irq+0x3a/0x50
>>> [<ffffffff810c7d22>] generic_handle_irq+0x22/0x30
>>> [<ffffffff813e50ff>] __evtchn_fifo_handle_events+0x15f/0x180
>>> [<ffffffff813e5130>] evtchn_fifo_handle_events+0x10/0x20
>>> [<ffffffff813e2233>] __xen_evtchn_do_upcall+0x43/0x80
>>> [<ffffffff813e3ea0>] xen_evtchn_do_upcall+0x30/0x50
>>> [<ffffffff8165deb2>] xen_hvm_callback_vector+0x82/0x90
>>> <EOI>  [<ffffffff810baf0d>] ? queued_write_lock_slowpath+0x3d/0x80
>>> [<ffffffff8165bcce>] _raw_write_lock+0x1e/0x30
>> This looks to me like ext4 failing to grab a lock. Everything above it
>> (in Xen code) is regular tick interrupt handling which detects the stall.
>>
>> Your config does not have CONFIG_PARAVIRT_SPINLOCKS so that eliminates
>> any possible issues with pv locks.
>>
>> Do you see anything "interesting" in dom0? (e.g. dmesg, xl dmesg,
>> /var/log/xen/) Are you oversubscribing your guest (CPU-wise)?
> There is nothing special being logged anywhere that I can see. dmesg /
> xl dmesg on the Dom0 show nothing unusual.
>
> I do share CPUs - but I don't give any DomU more than 2 vcpus. The
> physical host has 4 cores - 1 pinned to the Dom0.
>
> I log to a remote syslog on this system - and I've uploaded the entire
> log to a pastebin (don't want to do a 45Kb attachment here):
>      http://paste.fedoraproject.org/345095/58914452

That doesn't look like a full log. In any case, the RCU stall may be a 
secondary problem --- there is a bunch of splats before the stall.

-boris


>
> Not sure if it makes any difference at all, but my DomU config is:
> # cat /etc/xen/backup.vm
> name            = "backup.vm"
> memory          = 2048
> vcpus           = 2
> cpus            = "1-3"
> disk            = [ 'phy:/dev/vg_raid1_new/backup.vm,xvda,w' ]
> vif             = [ "mac=00:11:36:35:35:09, bridge=br203,
> vifname=vm.backup, script=vif-bridge" ]
> bootloader      = 'pygrub'
> pvh             = 1
>
> on_poweroff     = 'destroy'
> on_reboot       = 'restart'
> on_crash        = 'restart'
> cpu_weight      = 64
>
> I never had this problem when running kernel 4.1.x - it only started
> when I upgraded everything to 4.4 - not exactly a great help - but may
> help narrow things down?
>
>>> [<ffffffff81289b23>] ext4_es_remove_extent+0x43/0xc0
>>> [<ffffffff812692f9>] ext4_clear_inode+0x39/0x80
>>> [<ffffffff8124cafd>] ext4_evict_inode+0x8d/0x4e0
>>> [<ffffffff811df537>] evict+0xb7/0x180
>>> [<ffffffff811df636>] dispose_list+0x36/0x50
>>> [<ffffffff811e054b>] prune_icache_sb+0x4b/0x60
>>> [<ffffffff811c92c1>] super_cache_scan+0x141/0x190
>>> [<ffffffff811655fe>] shrink_slab.part.37+0x1ee/0x390
>>> [<ffffffff8116950c>] shrink_zone+0x26c/0x280
>>> [<ffffffff8116988c>] do_try_to_free_pages+0x15c/0x410
>>> [<ffffffff81169bfa>] try_to_free_pages+0xba/0x170
>>> [<ffffffff8115d1e5>] __alloc_pages_nodemask+0x525/0xa60
>>> [<ffffffff811a839c>] ? kmem_cache_free+0xcc/0x2c0
>>> [<ffffffff811a120d>] alloc_pages_current+0x8d/0x120
>>> [<ffffffff81154601>] __page_cache_alloc+0x91/0xc0
>>> [<ffffffff81154b66>] pagecache_get_page+0x56/0x1e0
>>> [<ffffffff81154ee6>] grab_cache_page_write_begin+0x26/0x40
>>> [<ffffffff8124c811>] ext4_da_write_begin+0xa1/0x300
>>> [<ffffffff8124d3a4>] ? ext4_da_write_end+0x124/0x2b0
>>> [<ffffffff811543a0>] generic_perform_write+0xc0/0x1a0
>>> [<ffffffff81156408>] __generic_file_write_iter+0x188/0x1e0
>>> [<ffffffff812421d0>] ext4_file_write_iter+0xf0/0x340
>>> [<ffffffff811c59fa>] __vfs_write+0xaa/0xe0
>>> [<ffffffff811c6022>] vfs_write+0xa2/0x1a0
>>> [<ffffffff811c6d26>] SyS_write+0x46/0xa0
>>> [<ffffffff8165c02e>] entry_SYSCALL_64_fastpath+0x12/0x71
>>>
>>> Some 11 hours later:
>>> sshd[785]: segfault at 1f0 ip 00007f03bb94ae5c sp 00007ffe9eb54470 error
>>> 4 in ld-2.17.so[7f03bb940000+21000]
>>> sh[787]: segfault at 1f0 ip 00007f6b4a0dfe5c sp 00007ffe3d4a71e0 error 4
>>> in ld-2.17.so[7f6b4a0d5000+21000]
>>> systemd-cgroups[788]: segfault at 1f0 ip 00007f4baa82ce5c sp
>>> 00007ffd28e4c4b0 error 4 in ld-2.17.so[7f4baa822000+21000]
>>> sshd[791]: segfault at 1f0 ip 00007ff8c8a8ce5c sp 00007ffede9e1c20 error
>>> 4 in ld-2.17.so[7ff8c8a82000+21000]
>>> sshd[792]: segfault at 1f0 ip 00007f183cf75e5c sp 00007ffc81ab7160 error
>>> 4 in ld-2.17.so[7f183cf6b000+21000]
>>> sshd[793]: segfault at 1f0 ip 00007f3c665ece5c sp 00007ffd9a13c850 error
>>> 4 in ld-2.17.so[7f3c665e2000+21000]
>>>
>>>   From isolated testing, this does not occur on kernel 4.5.x - however I
>>> have not verified this myself.
>>>
>>> The kernel config used can be found in the kernel-xen git repo if it
>>> assists in debugging:
>>>      http://xen.crc.id.au/git/?p=kernel-xen;a=summary
>>>
>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xen.org
>>> http://lists.xen.org/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 4.4: INFO: rcu_sched self-detected stall on CPU
       [not found]     ` <56F54EE0.6030004@oracle.com>
@ 2016-03-25 16:04       ` Steven Haigh
       [not found]       ` <56F56172.9020805@crc.id.au>
  1 sibling, 0 replies; 14+ messages in thread
From: Steven Haigh @ 2016-03-25 16:04 UTC (permalink / raw)
  To: Boris Ostrovsky, xen-devel, linux-kernel; +Cc: gregkh


[-- Attachment #1.1.1: Type: text/plain, Size: 8161 bytes --]

On 26/03/2016 1:44 AM, Boris Ostrovsky wrote:
> On 03/25/2016 10:05 AM, Steven Haigh wrote:
>> On 25/03/2016 11:23 PM, Boris Ostrovsky wrote:
>>> On 03/24/2016 10:53 PM, Steven Haigh wrote:
>>>> Hi all,
>>>>
>>>> Firstly, I've cross-posted this to xen-devel and the lkml - as this
>>>> problem seems to only exist when using kernel 4.4 as a Xen DomU kernel.
>>>> I have also CC'ed Greg KH for his awesome insight as maintainer.
>>>>
>>>> Please CC myself into replies - as I'm not a member of the kernel
>>>> mailing list - I may miss replies from monitoring the archives.
>>>>
>>>> I've noticed recently that heavy disk IO is causing rcu_sched to detect
>>>> stalls. The process mentioned usually goes to 100% CPU usage, and
>>>> eventually processes start segfaulting and dying. The only fix to
>>>> recover the system is to use 'xl destroy' to force-kill the VM and to
>>>> start it again.
>>>>
>>>> The majority of these issues seem to mention ext4 in the trace. This
>>>> may
>>>> indicate an issue there - or may be a red herring.
>>>>
>>>> The gritty details:
>>>> INFO: rcu_sched self-detected stall on CPU
>>>> #0110-...: (20999 ticks this GP) idle=327/140000000000001/0
>>>> softirq=1101493/1101493 fqs=6973
>>>> #011 (t=21000 jiffies g=827095 c=827094 q=524)
>>>> Task dump for CPU 0:
>>>> rsync           R  running task        0  2446   2444 0x00000088
>>>> ffffffff818d0c00 ffff88007fc03c58 ffffffff810a625f 0000000000000000
>>>> ffffffff818d0c00 ffff88007fc03c70 ffffffff810a8699 0000000000000001
>>>> ffff88007fc03ca0 ffffffff810d0e5a ffff88007fc170c0 ffffffff818d0c00
>>>> Call Trace:
>>>> <IRQ>  [<ffffffff810a625f>] sched_show_task+0xaf/0x110
>>>> [<ffffffff810a8699>] dump_cpu_task+0x39/0x40
>>>> [<ffffffff810d0e5a>] rcu_dump_cpu_stacks+0x8a/0xc0
>>>> [<ffffffff810d4884>] rcu_check_callbacks+0x424/0x7a0
>>>> [<ffffffff810a91e1>] ? account_system_time+0x81/0x110
>>>> [<ffffffff810a9481>] ? account_process_tick+0x61/0x160
>>>> [<ffffffff810e8050>] ? tick_sched_do_timer+0x30/0x30
>>>> [<ffffffff810d9749>] update_process_times+0x39/0x60
>>>> [<ffffffff810e7aa6>] tick_sched_handle.isra.15+0x36/0x50
>>>> [<ffffffff810e808d>] tick_sched_timer+0x3d/0x70
>>>> [<ffffffff810da342>] __hrtimer_run_queues+0xf2/0x250
>>>> [<ffffffff810da698>] hrtimer_interrupt+0xa8/0x190
>>>> [<ffffffff8100c61e>] xen_timer_interrupt+0x2e/0x140
>>>> [<ffffffff810c8555>] handle_irq_event_percpu+0x55/0x1e0
>>>> [<ffffffff810cbbca>] handle_percpu_irq+0x3a/0x50
>>>> [<ffffffff810c7d22>] generic_handle_irq+0x22/0x30
>>>> [<ffffffff813e50ff>] __evtchn_fifo_handle_events+0x15f/0x180
>>>> [<ffffffff813e5130>] evtchn_fifo_handle_events+0x10/0x20
>>>> [<ffffffff813e2233>] __xen_evtchn_do_upcall+0x43/0x80
>>>> [<ffffffff813e3ea0>] xen_evtchn_do_upcall+0x30/0x50
>>>> [<ffffffff8165deb2>] xen_hvm_callback_vector+0x82/0x90
>>>> <EOI>  [<ffffffff810baf0d>] ? queued_write_lock_slowpath+0x3d/0x80
>>>> [<ffffffff8165bcce>] _raw_write_lock+0x1e/0x30
>>> This looks to me like ext4 failing to grab a lock. Everything above it
>>> (in Xen code) is regular tick interrupt handling which detects the
>>> stall.
>>>
>>> Your config does not have CONFIG_PARAVIRT_SPINLOCKS so that eliminates
>>> any possible issues with pv locks.
>>>
>>> Do you see anything "interesting" in dom0? (e.g. dmesg, xl dmesg,
>>> /var/log/xen/) Are you oversubscribing your guest (CPU-wise)?
>> There is nothing special being logged anywhere that I can see. dmesg /
>> xl dmesg on the Dom0 show nothing unusual.
>>
>> I do share CPUs - but I don't give any DomU more than 2 vcpus. The
>> physical host has 4 cores - 1 pinned to the Dom0.
>>
>> I log to a remote syslog on this system - and I've uploaded the entire
>> log to a pastebin (don't want to do a 45Kb attachment here):
>>      http://paste.fedoraproject.org/345095/58914452
> 
> That doesn't look like a full log. In any case, the RCU stall may be a
> secondary problem --- there is a bunch of splats before the stall.

It may not actually be the full logs. Once the system gets really upset,
you can't run anything - as such, grabbing anything from dmesg is not
possible.

The logs provided above is all that gets spat out to the syslog server.

I'll try tinkering with a few things to see if I can get more output -
but right now, that's all I've been able to achieve. So far, my only
ideas are to remove the 'quiet' options from the kernel command line -
but I'm not sure how much that would help.

Suggestions gladly accepted on this front.

>>
>> Not sure if it makes any difference at all, but my DomU config is:
>> # cat /etc/xen/backup.vm
>> name            = "backup.vm"
>> memory          = 2048
>> vcpus           = 2
>> cpus            = "1-3"
>> disk            = [ 'phy:/dev/vg_raid1_new/backup.vm,xvda,w' ]
>> vif             = [ "mac=00:11:36:35:35:09, bridge=br203,
>> vifname=vm.backup, script=vif-bridge" ]
>> bootloader      = 'pygrub'
>> pvh             = 1
>>
>> on_poweroff     = 'destroy'
>> on_reboot       = 'restart'
>> on_crash        = 'restart'
>> cpu_weight      = 64
>>
>> I never had this problem when running kernel 4.1.x - it only started
>> when I upgraded everything to 4.4 - not exactly a great help - but may
>> help narrow things down?
>>
>>>> [<ffffffff81289b23>] ext4_es_remove_extent+0x43/0xc0
>>>> [<ffffffff812692f9>] ext4_clear_inode+0x39/0x80
>>>> [<ffffffff8124cafd>] ext4_evict_inode+0x8d/0x4e0
>>>> [<ffffffff811df537>] evict+0xb7/0x180
>>>> [<ffffffff811df636>] dispose_list+0x36/0x50
>>>> [<ffffffff811e054b>] prune_icache_sb+0x4b/0x60
>>>> [<ffffffff811c92c1>] super_cache_scan+0x141/0x190
>>>> [<ffffffff811655fe>] shrink_slab.part.37+0x1ee/0x390
>>>> [<ffffffff8116950c>] shrink_zone+0x26c/0x280
>>>> [<ffffffff8116988c>] do_try_to_free_pages+0x15c/0x410
>>>> [<ffffffff81169bfa>] try_to_free_pages+0xba/0x170
>>>> [<ffffffff8115d1e5>] __alloc_pages_nodemask+0x525/0xa60
>>>> [<ffffffff811a839c>] ? kmem_cache_free+0xcc/0x2c0
>>>> [<ffffffff811a120d>] alloc_pages_current+0x8d/0x120
>>>> [<ffffffff81154601>] __page_cache_alloc+0x91/0xc0
>>>> [<ffffffff81154b66>] pagecache_get_page+0x56/0x1e0
>>>> [<ffffffff81154ee6>] grab_cache_page_write_begin+0x26/0x40
>>>> [<ffffffff8124c811>] ext4_da_write_begin+0xa1/0x300
>>>> [<ffffffff8124d3a4>] ? ext4_da_write_end+0x124/0x2b0
>>>> [<ffffffff811543a0>] generic_perform_write+0xc0/0x1a0
>>>> [<ffffffff81156408>] __generic_file_write_iter+0x188/0x1e0
>>>> [<ffffffff812421d0>] ext4_file_write_iter+0xf0/0x340
>>>> [<ffffffff811c59fa>] __vfs_write+0xaa/0xe0
>>>> [<ffffffff811c6022>] vfs_write+0xa2/0x1a0
>>>> [<ffffffff811c6d26>] SyS_write+0x46/0xa0
>>>> [<ffffffff8165c02e>] entry_SYSCALL_64_fastpath+0x12/0x71
>>>>
>>>> Some 11 hours later:
>>>> sshd[785]: segfault at 1f0 ip 00007f03bb94ae5c sp 00007ffe9eb54470
>>>> error
>>>> 4 in ld-2.17.so[7f03bb940000+21000]
>>>> sh[787]: segfault at 1f0 ip 00007f6b4a0dfe5c sp 00007ffe3d4a71e0
>>>> error 4
>>>> in ld-2.17.so[7f6b4a0d5000+21000]
>>>> systemd-cgroups[788]: segfault at 1f0 ip 00007f4baa82ce5c sp
>>>> 00007ffd28e4c4b0 error 4 in ld-2.17.so[7f4baa822000+21000]
>>>> sshd[791]: segfault at 1f0 ip 00007ff8c8a8ce5c sp 00007ffede9e1c20
>>>> error
>>>> 4 in ld-2.17.so[7ff8c8a82000+21000]
>>>> sshd[792]: segfault at 1f0 ip 00007f183cf75e5c sp 00007ffc81ab7160
>>>> error
>>>> 4 in ld-2.17.so[7f183cf6b000+21000]
>>>> sshd[793]: segfault at 1f0 ip 00007f3c665ece5c sp 00007ffd9a13c850
>>>> error
>>>> 4 in ld-2.17.so[7f3c665e2000+21000]
>>>>
>>>>   From isolated testing, this does not occur on kernel 4.5.x -
>>>> however I
>>>> have not verified this myself.
>>>>
>>>> The kernel config used can be found in the kernel-xen git repo if it
>>>> assists in debugging:
>>>>      http://xen.crc.id.au/git/?p=kernel-xen;a=summary
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Xen-devel mailing list
>>>> Xen-devel@lists.xen.org
>>>> http://lists.xen.org/xen-devel
> 

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 4.4: INFO: rcu_sched self-detected stall on CPU
       [not found]       ` <56F56172.9020805@crc.id.au>
@ 2016-03-25 16:20         ` Boris Ostrovsky
       [not found]         ` <56F5653B.1090700@oracle.com>
  1 sibling, 0 replies; 14+ messages in thread
From: Boris Ostrovsky @ 2016-03-25 16:20 UTC (permalink / raw)
  To: Steven Haigh, xen-devel, linux-kernel; +Cc: gregkh

On 03/25/2016 12:04 PM, Steven Haigh wrote:
>
> It may not actually be the full logs. Once the system gets really upset,
> you can't run anything - as such, grabbing anything from dmesg is not
> possible.
>
> The logs provided above is all that gets spat out to the syslog server.
>
> I'll try tinkering with a few things to see if I can get more output -
> but right now, that's all I've been able to achieve. So far, my only
> ideas are to remove the 'quiet' options from the kernel command line -
> but I'm not sure how much that would help.
>
> Suggestions gladly accepted on this front.

You probably want to run connected to guest serial console (" 
serial='pty' " in guest config file and something like 'loglevel=7 
console=tty0 console=ttyS0,38400n8' on guest kernel commandline). And 
start the guest with 'xl create -c <cfg>' or connect later with 'xl 
console <domainID>'.

-boris



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 4.4: INFO: rcu_sched self-detected stall on CPU
       [not found]         ` <56F5653B.1090700@oracle.com>
@ 2016-03-25 21:07           ` Steven Haigh
       [not found]           ` <56F5A87A.8000903@crc.id.au>
  1 sibling, 0 replies; 14+ messages in thread
From: Steven Haigh @ 2016-03-25 21:07 UTC (permalink / raw)
  To: Boris Ostrovsky, xen-devel, linux-kernel; +Cc: gregkh


[-- Attachment #1.1.1: Type: text/plain, Size: 1555 bytes --]

On 26/03/2016 3:20 AM, Boris Ostrovsky wrote:
> On 03/25/2016 12:04 PM, Steven Haigh wrote:
>> It may not actually be the full logs. Once the system gets really upset,
>> you can't run anything - as such, grabbing anything from dmesg is not
>> possible.
>>
>> The logs provided above is all that gets spat out to the syslog server.
>>
>> I'll try tinkering with a few things to see if I can get more output -
>> but right now, that's all I've been able to achieve. So far, my only
>> ideas are to remove the 'quiet' options from the kernel command line -
>> but I'm not sure how much that would help.
>>
>> Suggestions gladly accepted on this front.
> 
> You probably want to run connected to guest serial console ("
> serial='pty' " in guest config file and something like 'loglevel=7
> console=tty0 console=ttyS0,38400n8' on guest kernel commandline). And
> start the guest with 'xl create -c <cfg>' or connect later with 'xl
> console <domainID>'.

Ok thanks, I've booted the DomU with:

$ cat /proc/cmdline
root=UUID=63ade949-ee67-4afb-8fe7-ecd96faa15e2 ro enforcemodulesig=1
selinux=0 fsck.repair=yes loglevel=7 console=tty0 console=ttyS0,38400n8

I've left a screen session attached to the console (via xl console) and
I'll see if that turns anything up. As this seems to be rather
unpredictable when it happens, it may take a day or two to get anything.
I just hope its more than the syslog output :)

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 4.4: INFO: rcu_sched self-detected stall on CPU
       [not found]           ` <56F5A87A.8000903@crc.id.au>
@ 2016-03-29  8:56             ` Steven Haigh
       [not found]             ` <56FA4336.2030301@crc.id.au>
  1 sibling, 0 replies; 14+ messages in thread
From: Steven Haigh @ 2016-03-29  8:56 UTC (permalink / raw)
  To: Boris Ostrovsky, xen-devel, linux-kernel; +Cc: gregkh


[-- Attachment #1.1.1: Type: text/plain, Size: 4390 bytes --]

On 26/03/2016 8:07 AM, Steven Haigh wrote:
> On 26/03/2016 3:20 AM, Boris Ostrovsky wrote:
>> On 03/25/2016 12:04 PM, Steven Haigh wrote:
>>> It may not actually be the full logs. Once the system gets really upset,
>>> you can't run anything - as such, grabbing anything from dmesg is not
>>> possible.
>>>
>>> The logs provided above is all that gets spat out to the syslog server.
>>>
>>> I'll try tinkering with a few things to see if I can get more output -
>>> but right now, that's all I've been able to achieve. So far, my only
>>> ideas are to remove the 'quiet' options from the kernel command line -
>>> but I'm not sure how much that would help.
>>>
>>> Suggestions gladly accepted on this front.
>>
>> You probably want to run connected to guest serial console ("
>> serial='pty' " in guest config file and something like 'loglevel=7
>> console=tty0 console=ttyS0,38400n8' on guest kernel commandline). And
>> start the guest with 'xl create -c <cfg>' or connect later with 'xl
>> console <domainID>'.
> 
> Ok thanks, I've booted the DomU with:
> 
> $ cat /proc/cmdline
> root=UUID=63ade949-ee67-4afb-8fe7-ecd96faa15e2 ro enforcemodulesig=1
> selinux=0 fsck.repair=yes loglevel=7 console=tty0 console=ttyS0,38400n8
> 
> I've left a screen session attached to the console (via xl console) and
> I'll see if that turns anything up. As this seems to be rather
> unpredictable when it happens, it may take a day or two to get anything.
> I just hope its more than the syslog output :)

Interestingly enough, this just happened again - but on a different
virtual machine. I'm starting to wonder if this may have something to do
with the uptime of the machine - as the system that this seems to happen
to is always different.

Destroying it and monitoring it again has so far come up blank.

I've thrown the latest lot of kernel messages here:
    http://paste.fedoraproject.org/346802/59241532

Interestingly, around the same time, /var/log/messages on the remote
syslog server shows:
Mar 29 17:00:01 zeus systemd: Created slice user-0.slice.
Mar 29 17:00:01 zeus systemd: Starting user-0.slice.
Mar 29 17:00:01 zeus systemd: Started Session 1567 of user root.
Mar 29 17:00:01 zeus systemd: Starting Session 1567 of user root.
Mar 29 17:00:01 zeus systemd: Removed slice user-0.slice.
Mar 29 17:00:01 zeus systemd: Stopping user-0.slice.
Mar 29 17:01:01 zeus systemd: Created slice user-0.slice.
Mar 29 17:01:01 zeus systemd: Starting user-0.slice.
Mar 29 17:01:01 zeus systemd: Started Session 1568 of user root.
Mar 29 17:01:01 zeus systemd: Starting Session 1568 of user root.
Mar 29 17:08:34 zeus ntpdate[18569]: adjust time server 203.56.246.94
offset -0.002247 sec
Mar 29 17:08:34 zeus systemd: Removed slice user-0.slice.
Mar 29 17:08:34 zeus systemd: Stopping user-0.slice.
Mar 29 17:10:01 zeus systemd: Created slice user-0.slice.
Mar 29 17:10:01 zeus systemd: Starting user-0.slice.
Mar 29 17:10:01 zeus systemd: Started Session 1569 of user root.
Mar 29 17:10:01 zeus systemd: Starting Session 1569 of user root.
Mar 29 17:10:01 zeus systemd: Removed slice user-0.slice.
Mar 29 17:10:01 zeus systemd: Stopping user-0.slice.
Mar 29 17:20:01 zeus systemd: Created slice user-0.slice.
Mar 29 17:20:01 zeus systemd: Starting user-0.slice.
Mar 29 17:20:01 zeus systemd: Started Session 1570 of user root.
Mar 29 17:20:01 zeus systemd: Starting Session 1570 of user root.
Mar 29 17:20:01 zeus systemd: Removed slice user-0.slice.
Mar 29 17:20:01 zeus systemd: Stopping user-0.slice.
Mar 29 17:30:55 zeus systemd: systemd-logind.service watchdog timeout
(limit 1min)!
Mar 29 17:32:25 zeus systemd: systemd-logind.service stop-sigabrt timed
out. Terminating.
Mar 29 17:33:56 zeus systemd: systemd-logind.service stop-sigterm timed
out. Killing.
Mar 29 17:35:26 zeus systemd: systemd-logind.service still around after
SIGKILL. Ignoring.
Mar 29 17:36:56 zeus systemd: systemd-logind.service stop-final-sigterm
timed out. Killing.
Mar 29 17:38:26 zeus systemd: systemd-logind.service still around after
final SIGKILL. Entering failed mode.
Mar 29 17:38:26 zeus systemd: Unit systemd-logind.service entered failed
state.
Mar 29 17:38:26 zeus systemd: systemd-logind.service failed.

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 4.4: INFO: rcu_sched self-detected stall on CPU
       [not found]             ` <56FA4336.2030301@crc.id.au>
@ 2016-03-29 14:14               ` Boris Ostrovsky
       [not found]               ` <56FA8DDD.7070406@oracle.com>
  1 sibling, 0 replies; 14+ messages in thread
From: Boris Ostrovsky @ 2016-03-29 14:14 UTC (permalink / raw)
  To: Steven Haigh, xen-devel, linux-kernel; +Cc: gregkh

On 03/29/2016 04:56 AM, Steven Haigh wrote:
>
> Interestingly enough, this just happened again - but on a different
> virtual machine. I'm starting to wonder if this may have something to do
> with the uptime of the machine - as the system that this seems to happen
> to is always different.
>
> Destroying it and monitoring it again has so far come up blank.
>
> I've thrown the latest lot of kernel messages here:
>      http://paste.fedoraproject.org/346802/59241532

Would be good to see full console log. The one that you posted starts 
with an error so I wonder what was before that.

Have you tried this on bare metal, BTW? And you said this is only 
observed on 4.4, not 4.5, right?

>
> Interestingly, around the same time, /var/log/messages on the remote
> syslog server shows:
> Mar 29 17:00:01 zeus systemd: Created slice user-0.slice.
> Mar 29 17:00:01 zeus systemd: Starting user-0.slice.
> Mar 29 17:00:01 zeus systemd: Started Session 1567 of user root.
> Mar 29 17:00:01 zeus systemd: Starting Session 1567 of user root.
> Mar 29 17:00:01 zeus systemd: Removed slice user-0.slice.
> Mar 29 17:00:01 zeus systemd: Stopping user-0.slice.
> Mar 29 17:01:01 zeus systemd: Created slice user-0.slice.
> Mar 29 17:01:01 zeus systemd: Starting user-0.slice.
> Mar 29 17:01:01 zeus systemd: Started Session 1568 of user root.
> Mar 29 17:01:01 zeus systemd: Starting Session 1568 of user root.
> Mar 29 17:08:34 zeus ntpdate[18569]: adjust time server 203.56.246.94
> offset -0.002247 sec
> Mar 29 17:08:34 zeus systemd: Removed slice user-0.slice.
> Mar 29 17:08:34 zeus systemd: Stopping user-0.slice.
> Mar 29 17:10:01 zeus systemd: Created slice user-0.slice.
> Mar 29 17:10:01 zeus systemd: Starting user-0.slice.
> Mar 29 17:10:01 zeus systemd: Started Session 1569 of user root.
> Mar 29 17:10:01 zeus systemd: Starting Session 1569 of user root.
> Mar 29 17:10:01 zeus systemd: Removed slice user-0.slice.
> Mar 29 17:10:01 zeus systemd: Stopping user-0.slice.
> Mar 29 17:20:01 zeus systemd: Created slice user-0.slice.
> Mar 29 17:20:01 zeus systemd: Starting user-0.slice.
> Mar 29 17:20:01 zeus systemd: Started Session 1570 of user root.
> Mar 29 17:20:01 zeus systemd: Starting Session 1570 of user root.
> Mar 29 17:20:01 zeus systemd: Removed slice user-0.slice.
> Mar 29 17:20:01 zeus systemd: Stopping user-0.slice.
> Mar 29 17:30:55 zeus systemd: systemd-logind.service watchdog timeout
> (limit 1min)!
> Mar 29 17:32:25 zeus systemd: systemd-logind.service stop-sigabrt timed
> out. Terminating.
> Mar 29 17:33:56 zeus systemd: systemd-logind.service stop-sigterm timed
> out. Killing.
> Mar 29 17:35:26 zeus systemd: systemd-logind.service still around after
> SIGKILL. Ignoring.
> Mar 29 17:36:56 zeus systemd: systemd-logind.service stop-final-sigterm
> timed out. Killing.
> Mar 29 17:38:26 zeus systemd: systemd-logind.service still around after
> final SIGKILL. Entering failed mode.
> Mar 29 17:38:26 zeus systemd: Unit systemd-logind.service entered failed
> state.
> Mar 29 17:38:26 zeus systemd: systemd-logind.service failed.


These may be result of your system not feeling well, which is not 
surprising.

-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 4.4: INFO: rcu_sched self-detected stall on CPU
       [not found]               ` <56FA8DDD.7070406@oracle.com>
@ 2016-03-29 17:44                 ` Steven Haigh
       [not found]                 ` <56FABF17.7090608@crc.id.au>
  2016-04-02  1:50                 ` Steven Haigh
  2 siblings, 0 replies; 14+ messages in thread
From: Steven Haigh @ 2016-03-29 17:44 UTC (permalink / raw)
  To: Boris Ostrovsky, xen-devel, linux-kernel; +Cc: gregkh


[-- Attachment #1.1.1: Type: text/plain, Size: 5229 bytes --]

On 30/03/2016 1:14 AM, Boris Ostrovsky wrote:
> On 03/29/2016 04:56 AM, Steven Haigh wrote:
>>
>> Interestingly enough, this just happened again - but on a different
>> virtual machine. I'm starting to wonder if this may have something to do
>> with the uptime of the machine - as the system that this seems to happen
>> to is always different.
>>
>> Destroying it and monitoring it again has so far come up blank.
>>
>> I've thrown the latest lot of kernel messages here:
>>      http://paste.fedoraproject.org/346802/59241532
> 
> Would be good to see full console log. The one that you posted starts
> with an error so I wonder what was before that.

Agreed. It started off with me observing this on one VM - but since
trying to get details on that VM - others have started showing issues as
well. It frustrating as it seems I've been playing whack-a-mole to get
more debug on what is going on.

So, I've changed the kernel command line to the following on ALL VMs on
this system:
enforcemodulesig=1 selinux=0 fsck.repair=yes loglevel=7 console=tty0
console=ttyS0,38400n8

In the Dom0 (which runs the same kernel package), I've started a screen
sessions with a screen for each of the DomUs running attached to the
console via 'xl console blah' - so hopefully the next one that goes down
(whichever one that is) will get caught in the console.

> Have you tried this on bare metal, BTW? And you said this is only
> observed on 4.4, not 4.5, right?

I use the same kernel package as the Dom0 kernel - and so far haven't
seen any issues running this as the Dom0. I haven't used it on baremetal
as a non-xen kernel as yet.

The kernel package I'm currently running is for CentOS / Scientific
Linux / RHEL at:
    http://au1.mirror.crc.id.au/repo/el7-testing/x86_64/

I'm using 4.4.6-3 at the moment - which has CONFIG_PREEMPT_VOLUNTARY set
- which *MAY* have increased the time between this happening - or may
have no effect at all. I'm not convinced either way as yet.

With respect to 4.5, I have had reports from another user of my packages
that they haven't seen the same crash using the same Xen packages but
with kernel 4.5. I have not verified this myself as yet as I haven't
gone down the path of making 4.5 packages for testing. As such, I
wouldn't treat this as a conclusive test case as yet.

I'm hoping that the steps I've taken above may give some more
information in which we can drill down into exactly what is going on -
or at least give more pointers into the root cause.

>>
>> Interestingly, around the same time, /var/log/messages on the remote
>> syslog server shows:
>> Mar 29 17:00:01 zeus systemd: Created slice user-0.slice.
>> Mar 29 17:00:01 zeus systemd: Starting user-0.slice.
>> Mar 29 17:00:01 zeus systemd: Started Session 1567 of user root.
>> Mar 29 17:00:01 zeus systemd: Starting Session 1567 of user root.
>> Mar 29 17:00:01 zeus systemd: Removed slice user-0.slice.
>> Mar 29 17:00:01 zeus systemd: Stopping user-0.slice.
>> Mar 29 17:01:01 zeus systemd: Created slice user-0.slice.
>> Mar 29 17:01:01 zeus systemd: Starting user-0.slice.
>> Mar 29 17:01:01 zeus systemd: Started Session 1568 of user root.
>> Mar 29 17:01:01 zeus systemd: Starting Session 1568 of user root.
>> Mar 29 17:08:34 zeus ntpdate[18569]: adjust time server 203.56.246.94
>> offset -0.002247 sec
>> Mar 29 17:08:34 zeus systemd: Removed slice user-0.slice.
>> Mar 29 17:08:34 zeus systemd: Stopping user-0.slice.
>> Mar 29 17:10:01 zeus systemd: Created slice user-0.slice.
>> Mar 29 17:10:01 zeus systemd: Starting user-0.slice.
>> Mar 29 17:10:01 zeus systemd: Started Session 1569 of user root.
>> Mar 29 17:10:01 zeus systemd: Starting Session 1569 of user root.
>> Mar 29 17:10:01 zeus systemd: Removed slice user-0.slice.
>> Mar 29 17:10:01 zeus systemd: Stopping user-0.slice.
>> Mar 29 17:20:01 zeus systemd: Created slice user-0.slice.
>> Mar 29 17:20:01 zeus systemd: Starting user-0.slice.
>> Mar 29 17:20:01 zeus systemd: Started Session 1570 of user root.
>> Mar 29 17:20:01 zeus systemd: Starting Session 1570 of user root.
>> Mar 29 17:20:01 zeus systemd: Removed slice user-0.slice.
>> Mar 29 17:20:01 zeus systemd: Stopping user-0.slice.
>> Mar 29 17:30:55 zeus systemd: systemd-logind.service watchdog timeout
>> (limit 1min)!
>> Mar 29 17:32:25 zeus systemd: systemd-logind.service stop-sigabrt timed
>> out. Terminating.
>> Mar 29 17:33:56 zeus systemd: systemd-logind.service stop-sigterm timed
>> out. Killing.
>> Mar 29 17:35:26 zeus systemd: systemd-logind.service still around after
>> SIGKILL. Ignoring.
>> Mar 29 17:36:56 zeus systemd: systemd-logind.service stop-final-sigterm
>> timed out. Killing.
>> Mar 29 17:38:26 zeus systemd: systemd-logind.service still around after
>> final SIGKILL. Entering failed mode.
>> Mar 29 17:38:26 zeus systemd: Unit systemd-logind.service entered failed
>> state.
>> Mar 29 17:38:26 zeus systemd: systemd-logind.service failed.
> 
> 
> These may be result of your system not feeling well, which is not
> surprising.
> 
> -boris

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 4.4: INFO: rcu_sched self-detected stall on CPU
       [not found]                 ` <56FABF17.7090608@crc.id.au>
@ 2016-03-29 18:04                   ` Steven Haigh
       [not found]                   ` <56FAC3AC.9050802@crc.id.au>
  1 sibling, 0 replies; 14+ messages in thread
From: Steven Haigh @ 2016-03-29 18:04 UTC (permalink / raw)
  To: Boris Ostrovsky, xen-devel, linux-kernel; +Cc: gregkh


[-- Attachment #1.1.1: Type: text/plain, Size: 1655 bytes --]

Greg, please see below - this is probably more for you...

On 03/29/2016 04:56 AM, Steven Haigh wrote:
>
> Interestingly enough, this just happened again - but on a different
> virtual machine. I'm starting to wonder if this may have something to do
> with the uptime of the machine - as the system that this seems to happen
> to is always different.
>
> Destroying it and monitoring it again has so far come up blank.
>
> I've thrown the latest lot of kernel messages here:
>      http://paste.fedoraproject.org/346802/59241532

So I just did a bit of digging via the almighty Google.

I started hunting for these lines, as they happen just before the stall:
BUG: Bad rss-counter state mm:ffff88007b7db480 idx:2 val:-1
BUG: Bad rss-counter state mm:ffff880079c638c0 idx:0 val:-1
BUG: Bad rss-counter state mm:ffff880079c638c0 idx:2 val:-1

I stumbled across this post on the lkml:
    http://marc.info/?l=linux-kernel&m=145141546409607

The patch attached seems to reference the following change in
unmap_mapping_range in mm/memory.c:
> -	struct zap_details details;
> +	struct zap_details details = { };

When I browse the GIT tree for 4.4.6:
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/mm/memory.c?id=refs/tags/v4.4.6

I see at line 2411:
struct zap_details details;

Is this something that has been missed being merged into the 4.4 tree?
I'll admit my kernel knowledge is not enough to understand what the code
actually does - but the similarities here seem uncanny.

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 4.4: INFO: rcu_sched self-detected stall on CPU
       [not found]                   ` <56FAC3AC.9050802@crc.id.au>
@ 2016-03-30 13:44                     ` Boris Ostrovsky
  2016-05-02 20:54                     ` gregkh
  1 sibling, 0 replies; 14+ messages in thread
From: Boris Ostrovsky @ 2016-03-30 13:44 UTC (permalink / raw)
  To: Steven Haigh, xen-devel, linux-kernel; +Cc: gregkh

On 03/29/2016 02:04 PM, Steven Haigh wrote:
> Greg, please see below - this is probably more for you...
>
> On 03/29/2016 04:56 AM, Steven Haigh wrote:
>> Interestingly enough, this just happened again - but on a different
>> virtual machine. I'm starting to wonder if this may have something to do
>> with the uptime of the machine - as the system that this seems to happen
>> to is always different.
>>
>> Destroying it and monitoring it again has so far come up blank.
>>
>> I've thrown the latest lot of kernel messages here:
>>       http://paste.fedoraproject.org/346802/59241532
> So I just did a bit of digging via the almighty Google.
>
> I started hunting for these lines, as they happen just before the stall:
> BUG: Bad rss-counter state mm:ffff88007b7db480 idx:2 val:-1
> BUG: Bad rss-counter state mm:ffff880079c638c0 idx:0 val:-1
> BUG: Bad rss-counter state mm:ffff880079c638c0 idx:2 val:-1
>
> I stumbled across this post on the lkml:
>      http://marc.info/?l=linux-kernel&m=145141546409607
>
> The patch attached seems to reference the following change in
> unmap_mapping_range in mm/memory.c:
>> -	struct zap_details details;
>> +	struct zap_details details = { };
> When I browse the GIT tree for 4.4.6:
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/mm/memory.c?id=refs/tags/v4.4.6
>
> I see at line 2411:
> struct zap_details details;
>
> Is this something that has been missed being merged into the 4.4 tree?
> I'll admit my kernel knowledge is not enough to understand what the code
> actually does - but the similarities here seem uncanny.


The patch that you are referring to is trying to fix a bug in a feature 
that's not in the mainline yet ("mm, oom: introduce oom reaper").

-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 4.4: INFO: rcu_sched self-detected stall on CPU
       [not found]               ` <56FA8DDD.7070406@oracle.com>
  2016-03-29 17:44                 ` Steven Haigh
       [not found]                 ` <56FABF17.7090608@crc.id.au>
@ 2016-04-02  1:50                 ` Steven Haigh
  2 siblings, 0 replies; 14+ messages in thread
From: Steven Haigh @ 2016-04-02  1:50 UTC (permalink / raw)
  To: Boris Ostrovsky, xen-devel, linux-kernel


[-- Attachment #1.1.1: Type: text/plain, Size: 10386 bytes --]

On 30/03/2016 1:14 AM, Boris Ostrovsky wrote:
> On 03/29/2016 04:56 AM, Steven Haigh wrote:
>>
>> Interestingly enough, this just happened again - but on a different
>> virtual machine. I'm starting to wonder if this may have something to do
>> with the uptime of the machine - as the system that this seems to happen
>> to is always different.
>>
>> Destroying it and monitoring it again has so far come up blank.
>>
>> I've thrown the latest lot of kernel messages here:
>>      http://paste.fedoraproject.org/346802/59241532
> 
> Would be good to see full console log. The one that you posted starts
> with an error so I wonder what was before that.

Ok, so I had a virtual machine do this again today. Both vcpus went to
100% usage and essentially hung. I attached to the screen console that
was connected via 'xl console' and copied the entire buffer to paste below:

yum-cron[30740]: segfault at 1781ab8 ip 00007f2a7fcd282f sp
00007ffe8655fe90 error 5 in libpython2.7.so.1.0[7f2a7fbf5000+178000]
swap_free: Bad swap file entry 2a2b7d5bb69515d8
BUG: Bad page map in process yum-cron  pte:56fab76d2a2bb06a pmd:0309e067
addr:0000000001780000 vm_flags:00100073 anon_vma:ffff88007b974c08
mapping:          (null) index:1780
file:          (null) fault:          (null) mmap:          (null)
readpage:          (null)
CPU: 0 PID: 30740 Comm: yum-cron Tainted: G    B
4.4.6-4.el7xen.x86_64 #1
 0000000000000000 ffff88004176bac0 ffffffff81323d17 0000000001780000
 00003ffffffff000 ffff88004176bb08 ffffffff8117e574 ffffffff81193d6e
 0000000000001780 ffff88000309ec00 0000000001780000 56fab76d2a2bb06a
Call Trace:
 [<ffffffff81323d17>] dump_stack+0x63/0x8c
 [<ffffffff8117e574>] print_bad_pte+0x1e4/0x290
 [<ffffffff81193d6e>] ? swap_info_get+0x7e/0xe0
 [<ffffffff8117fdef>] unmap_single_vma+0x4ff/0x840
 [<ffffffff81180837>] unmap_vmas+0x47/0x90
 [<ffffffff811891f8>] exit_mmap+0x98/0x150
 [<ffffffff8107a557>] mmput+0x47/0x100
 [<ffffffff8107f81e>] do_exit+0x24e/0xad0
 [<ffffffff8108011f>] do_group_exit+0x3f/0xa0
 [<ffffffff8108aa03>] get_signal+0x1c3/0x5e0
 [<ffffffff81017208>] do_signal+0x28/0x630
 [<ffffffff81152d7d>] ? printk+0x4d/0x4f
 [<ffffffff810c735f>] ? vprintk_default+0x1f/0x30
 [<ffffffff8107938c>] ? bad_area_access_error+0x43/0x4a
 [<ffffffff810603ac>] ? __do_page_fault+0x22c/0x3f0
 [<ffffffff8107808d>] exit_to_usermode_loop+0x4c/0x95
 [<ffffffff81003878>] prepare_exit_to_usermode+0x18/0x20
 [<ffffffff8165cae5>] retint_user+0x8/0x13
BUG: Bad page state in process yum-cron  pfn:0f3bf
page:ffffea00003cefc0 count:0 mapcount:7 mapping:ffff88000f3bf008
index:0xffff88000f3bf000
flags: 0x10000000000094(referenced|dirty|slab)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
flags: 0x80(slab)
Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache
x86_pkg_temp_thermal coretemp crct10dif_pclmul aesni_intel aes_x86_64
lrw gf128mul glue_helper ablk_helper cryptd pcspkr nfsd auth_rpcgss
nfs_acl lockd grace sunrpc ip_tables xen_netfront crc32c_intel
xen_gntalloc xen_evtchn ipv6 autofs4
CPU: 0 PID: 30740 Comm: yum-cron Tainted: G    B
4.4.6-4.el7xen.x86_64 #1
 0000000000000000 ffff88004176b958 ffffffff81323d17 ffffea00003cefc0
 ffffffff817ab348 ffff88004176b980 ffffffff811c1ab5 ffffea00003cefc0
 0000000000000001 0000000000000000 ffff88004176b9d0 ffffffff81159584
Call Trace:
 [<ffffffff81323d17>] dump_stack+0x63/0x8c
 [<ffffffff811c1ab5>] bad_page.part.69+0xdf/0xfc
 [<ffffffff81159584>] free_pages_prepare+0x294/0x2a0
 [<ffffffff8115b6c1>] free_hot_cold_page+0x31/0x160
 [<ffffffff8115b839>] free_hot_cold_page_list+0x49/0xb0
 [<ffffffff811621b5>] release_pages+0xc5/0x260
 [<ffffffff8119380d>] free_pages_and_swap_cache+0x7d/0x90
 [<ffffffff8117ddd6>] tlb_flush_mmu_free+0x36/0x60
 [<ffffffff8117ff54>] unmap_single_vma+0x664/0x840
 [<ffffffff81180837>] unmap_vmas+0x47/0x90
 [<ffffffff811891f8>] exit_mmap+0x98/0x150
 [<ffffffff8107a557>] mmput+0x47/0x100
 [<ffffffff8107f81e>] do_exit+0x24e/0xad0
 [<ffffffff8108011f>] do_group_exit+0x3f/0xa0
 [<ffffffff8108aa03>] get_signal+0x1c3/0x5e0
 [<ffffffff81017208>] do_signal+0x28/0x630
 [<ffffffff81152d7d>] ? printk+0x4d/0x4f
 [<ffffffff810c735f>] ? vprintk_default+0x1f/0x30
 [<ffffffff8107938c>] ? bad_area_access_error+0x43/0x4a
 [<ffffffff810603ac>] ? __do_page_fault+0x22c/0x3f0
 [<ffffffff8107808d>] exit_to_usermode_loop+0x4c/0x95
 [<ffffffff81003878>] prepare_exit_to_usermode+0x18/0x20
 [<ffffffff8165cae5>] retint_user+0x8/0x13
BUG: Bad rss-counter state mm:ffff88007b99e4c0 idx:0 val:-1
BUG: Bad rss-counter state mm:ffff88007b99e4c0 idx:1 val:2
BUG: Bad rss-counter state mm:ffff88007b99e4c0 idx:2 val:-1
yum-cron[4197]: segfault at 32947fcb ip 00007ff0fa1bf8bd sp
00007ffdb1c54990 error 4 in libpython2.7.so.1.0[7ff0fa13a000+178000]
BUG: unable to handle kernel paging request at ffff88010f3beffe
IP: [<ffffffff811a7fb9>] free_block+0x119/0x190
PGD 188b063 PUD 0
Oops: 0002 [#1] SMP
Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache
x86_pkg_temp_thermal coretemp crct10dif_pclmul aesni_intel aes_x86_64
lrw gf128mul glue_helper ablk_helper cryptd pcspkr nfsd auth_rpcgss
nfs_acl lockd grace sunrpc ip_tables xen_netfront crc32c_intel
xen_gntalloc xen_evtchn ipv6 autofs4
CPU: 1 PID: 8519 Comm: kworker/1:2 Tainted: G    B
4.4.6-4.el7xen.x86_64 #1
Workqueue: events cache_reap
task: ffff8800346bf1c0 ti: ffff880051700000 task.ti: ffff880051700000
RIP: 0010:[<ffffffff811a7fb9>]  [<ffffffff811a7fb9>] free_block+0x119/0x190
RSP: 0018:ffff880051703d40  EFLAGS: 00010082
RAX: ffffea00003cefc0 RBX: ffffea0000000000 RCX: ffff88000f3bf000
RDX: 00000000fffffffe RSI: ffff88007fd19c40 RDI: ffff88007d012100
RBP: ffff880051703d68 R08: ffff880051703d88 R09: 0000000000000006
R10: ffff88007d01a940 R11: 0000000080000000 R12: 000077ff80000000
R13: ffff88007fd19c58 R14: ffff88007d01a948 R15: ffff88007d01a968
FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88010f3beffe CR3: 000000007bc8d000 CR4: 0000000000040620
Stack:
 ffff88007fd19bf0 ffff880051703d88 ffff88007d01a940 000000000000000b
 ffff88007d012100 ffff880051703dc0 ffffffff811a871b 0000000b51703dd8
 ffff88007fd19c00 ffff880051703d88 ffff880051703d88 ffff88007d012100
Call Trace:
 [<ffffffff811a871b>] drain_array+0xab/0x130
 [<ffffffff811a955c>] cache_reap+0x6c/0x230
 [<ffffffff81093831>] process_one_work+0x151/0x400
 [<ffffffff8109437a>] worker_thread+0x11a/0x460
 [<ffffffff8165848f>] ? __schedule+0x2bf/0x880
 [<ffffffff81094260>] ? rescuer_thread+0x2f0/0x2f0
 [<ffffffff810993d9>] kthread+0xc9/0xe0
 [<ffffffff81099310>] ? kthread_park+0x60/0x60
 [<ffffffff8165c38f>] ret_from_fork+0x3f/0x70
 [<ffffffff81099310>] ? kthread_park+0x60/0x60
Code: 41 89 d1 4c 0f af c9 0f b6 4f 1c 49 c1 e9 20 44 29 ca d3 ea 0f b6
4f 1d 41 01 d1 8b 50 18 41 d3 e9 48 8b 48 10 83 ea 01 89 50 18 <44> 88
0c 11 49 8b 52 38 48 83 c2 01 49 89 52 38 8b 48 18 85 c9
RIP  [<ffffffff811a7fb9>] free_block+0x119/0x190
 RSP <ffff880051703d40>
CR2: ffff88010f3beffe
---[ end trace 99d22a980c39d4db ]---
BUG: unable to handle kernel paging request at ffffffffffffffd8
IP: [<ffffffff81099b40>] kthread_data+0x10/0x20
PGD 188d067 PUD 188f067 PMD 0
Oops: 0000 [#2] SMP
Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache
x86_pkg_temp_thermal coretemp crct10dif_pclmul aesni_intel aes_x86_64
lrw gf128mul glue_helper ablk_helper cryptd pcspkr nfsd auth_rpcgss
nfs_acl lockd grace sunrpc ip_tables xen_netfront crc32c_intel
xen_gntalloc xen_evtchn ipv6 autofs4
CPU: 1 PID: 8519 Comm: kworker/1:2 Tainted: G    B D
4.4.6-4.el7xen.x86_64 #1
task: ffff8800346bf1c0 ti: ffff880051700000 task.ti: ffff880051700000
RIP: 0010:[<ffffffff81099b40>]  [<ffffffff81099b40>] kthread_data+0x10/0x20
RSP: 0018:ffff880051703a60  EFLAGS: 00010002
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
RDX: ffffffff81bd4080 RSI: 0000000000000001 RDI: ffff8800346bf1c0
RBP: ffff880051703a60 R08: ffff88007c8e19d0 R09: 0000c88ed0b2b602
R10: 000000000000001f R11: 0000000000000000 R12: ffff8800346bf1c0
R13: 0000000000000000 R14: 00000000000163c0 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000028 CR3: 000000007bc8d000 CR4: 0000000000040620
Stack:
 ffff880051703a78 ffffffff81094741 ffff88007fd163c0 ffff880051703ac0
 ffffffff8165858f ffffffff810d2e27 ffff8800346bf1c0 ffff880051704000
 ffff8800346bf510 0000000000000046 ffff8800346bf1c0 ffff88007d3facc0
Call Trace:
 [<ffffffff81094741>] wq_worker_sleeping+0x11/0x90
 [<ffffffff8165858f>] __schedule+0x3bf/0x880
 [<ffffffff810d2e27>] ? call_rcu_sched+0x17/0x20
 [<ffffffff81658a85>] schedule+0x35/0x80
 [<ffffffff8107fc29>] do_exit+0x659/0xad0
 [<ffffffff8101a60a>] oops_end+0x9a/0xd0
 [<ffffffff8105f7ad>] no_context+0x10d/0x360
 [<ffffffff8105fb09>] __bad_area_nosemaphore+0x109/0x210
 [<ffffffff8105fc23>] bad_area_nosemaphore+0x13/0x20
 [<ffffffff81060200>] __do_page_fault+0x80/0x3f0
 [<ffffffff81060592>] do_page_fault+0x22/0x30
 [<ffffffff8165e0e8>] page_fault+0x28/0x30
 [<ffffffff811a7fb9>] ? free_block+0x119/0x190
 [<ffffffff811a871b>] drain_array+0xab/0x130
 [<ffffffff811a955c>] cache_reap+0x6c/0x230
 [<ffffffff81093831>] process_one_work+0x151/0x400
 [<ffffffff8109437a>] worker_thread+0x11a/0x460
 [<ffffffff8165848f>] ? __schedule+0x2bf/0x880
 [<ffffffff81094260>] ? rescuer_thread+0x2f0/0x2f0
 [<ffffffff810993d9>] kthread+0xc9/0xe0
 [<ffffffff81099310>] ? kthread_park+0x60/0x60
 [<ffffffff8165c38f>] ret_from_fork+0x3f/0x70
 [<ffffffff81099310>] ? kthread_park+0x60/0x60
Code: ff ff eb 92 be 3a 02 00 00 48 c7 c7 0b 5e 7a 81 e8 e6 34 fe ff e9
d5 fe ff ff 90 66 66 66 66 90 55 48 8b 87 00 04 00 00 48 89 e5 <48> 8b
40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90
RIP  [<ffffffff81099b40>] kthread_data+0x10/0x20
 RSP <ffff880051703a60>
CR2: ffffffffffffffd8
---[ end trace 99d22a980c39d4dc ]---
Fixing recursive fault but reboot is needed!

There is nothing further logged on the console after, and any previous
messages are concerning an NFS server.

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 4.4: INFO: rcu_sched self-detected stall on CPU
       [not found]                   ` <56FAC3AC.9050802@crc.id.au>
  2016-03-30 13:44                     ` Boris Ostrovsky
@ 2016-05-02 20:54                     ` gregkh
  1 sibling, 0 replies; 14+ messages in thread
From: gregkh @ 2016-05-02 20:54 UTC (permalink / raw)
  To: Steven Haigh; +Cc: xen-devel, Boris Ostrovsky, linux-kernel

On Wed, Mar 30, 2016 at 05:04:28AM +1100, Steven Haigh wrote:
> Greg, please see below - this is probably more for you...
> 
> On 03/29/2016 04:56 AM, Steven Haigh wrote:
> >
> > Interestingly enough, this just happened again - but on a different
> > virtual machine. I'm starting to wonder if this may have something to do
> > with the uptime of the machine - as the system that this seems to happen
> > to is always different.
> >
> > Destroying it and monitoring it again has so far come up blank.
> >
> > I've thrown the latest lot of kernel messages here:
> >      http://paste.fedoraproject.org/346802/59241532
> 
> So I just did a bit of digging via the almighty Google.
> 
> I started hunting for these lines, as they happen just before the stall:
> BUG: Bad rss-counter state mm:ffff88007b7db480 idx:2 val:-1
> BUG: Bad rss-counter state mm:ffff880079c638c0 idx:0 val:-1
> BUG: Bad rss-counter state mm:ffff880079c638c0 idx:2 val:-1
> 
> I stumbled across this post on the lkml:
>     http://marc.info/?l=linux-kernel&m=145141546409607
> 
> The patch attached seems to reference the following change in
> unmap_mapping_range in mm/memory.c:
> > -	struct zap_details details;
> > +	struct zap_details details = { };
> 
> When I browse the GIT tree for 4.4.6:
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/mm/memory.c?id=refs/tags/v4.4.6
> 
> I see at line 2411:
> struct zap_details details;
> 
> Is this something that has been missed being merged into the 4.4 tree?
> I'll admit my kernel knowledge is not enough to understand what the code
> actually does - but the similarities here seem uncanny.

I'm sorry, I have no idea what you are asking me about here.  Did I miss
a patch that should be backported?  Did I backport something
incorrectly?

confused,

greg k-h

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-05-02 20:54 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-25  2:53 4.4: INFO: rcu_sched self-detected stall on CPU Steven Haigh
2016-03-25 12:23 ` Boris Ostrovsky
     [not found] ` <56F52DBF.5080006@oracle.com>
2016-03-25 14:05   ` Steven Haigh
     [not found]   ` <56F545B1.8080609@crc.id.au>
2016-03-25 14:44     ` Boris Ostrovsky
     [not found]     ` <56F54EE0.6030004@oracle.com>
2016-03-25 16:04       ` Steven Haigh
     [not found]       ` <56F56172.9020805@crc.id.au>
2016-03-25 16:20         ` Boris Ostrovsky
     [not found]         ` <56F5653B.1090700@oracle.com>
2016-03-25 21:07           ` Steven Haigh
     [not found]           ` <56F5A87A.8000903@crc.id.au>
2016-03-29  8:56             ` Steven Haigh
     [not found]             ` <56FA4336.2030301@crc.id.au>
2016-03-29 14:14               ` Boris Ostrovsky
     [not found]               ` <56FA8DDD.7070406@oracle.com>
2016-03-29 17:44                 ` Steven Haigh
     [not found]                 ` <56FABF17.7090608@crc.id.au>
2016-03-29 18:04                   ` Steven Haigh
     [not found]                   ` <56FAC3AC.9050802@crc.id.au>
2016-03-30 13:44                     ` Boris Ostrovsky
2016-05-02 20:54                     ` gregkh
2016-04-02  1:50                 ` Steven Haigh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).