Re: 4.4: INFO: rcu_sched self-detected stall on CPU

From: Steven Haigh <netwiz@crc.id.au>
To: Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	xen-devel <xen-devel@lists.xenproject.org>,
	linux-kernel@vger.kernel.org
Subject: Re: 4.4: INFO: rcu_sched self-detected stall on CPU
Date: Sat, 2 Apr 2016 12:50:07 +1100	[thread overview]
Message-ID: <56FF254F.3050103__38527.4805514657$1459561922$gmane$org@crc.id.au> (raw)
In-Reply-To: <56FA8DDD.7070406@oracle.com>

[-- Attachment #1.1.1: Type: text/plain, Size: 10386 bytes --]

On 30/03/2016 1:14 AM, Boris Ostrovsky wrote:
> On 03/29/2016 04:56 AM, Steven Haigh wrote:
>>
>> Interestingly enough, this just happened again - but on a different
>> virtual machine. I'm starting to wonder if this may have something to do
>> with the uptime of the machine - as the system that this seems to happen
>> to is always different.
>>
>> Destroying it and monitoring it again has so far come up blank.
>>
>> I've thrown the latest lot of kernel messages here:
>>      http://paste.fedoraproject.org/346802/59241532
> 
> Would be good to see full console log. The one that you posted starts
> with an error so I wonder what was before that.

Ok, so I had a virtual machine do this again today. Both vcpus went to
100% usage and essentially hung. I attached to the screen console that
was connected via 'xl console' and copied the entire buffer to paste below:

yum-cron[30740]: segfault at 1781ab8 ip 00007f2a7fcd282f sp
00007ffe8655fe90 error 5 in libpython2.7.so.1.0[7f2a7fbf5000+178000]
swap_free: Bad swap file entry 2a2b7d5bb69515d8
BUG: Bad page map in process yum-cron  pte:56fab76d2a2bb06a pmd:0309e067
addr:0000000001780000 vm_flags:00100073 anon_vma:ffff88007b974c08
mapping:          (null) index:1780
file:          (null) fault:          (null) mmap:          (null)
readpage:          (null)
CPU: 0 PID: 30740 Comm: yum-cron Tainted: G    B
4.4.6-4.el7xen.x86_64 #1
 0000000000000000 ffff88004176bac0 ffffffff81323d17 0000000001780000
 00003ffffffff000 ffff88004176bb08 ffffffff8117e574 ffffffff81193d6e
 0000000000001780 ffff88000309ec00 0000000001780000 56fab76d2a2bb06a
Call Trace:
 [<ffffffff81323d17>] dump_stack+0x63/0x8c
 [<ffffffff8117e574>] print_bad_pte+0x1e4/0x290
 [<ffffffff81193d6e>] ? swap_info_get+0x7e/0xe0
 [<ffffffff8117fdef>] unmap_single_vma+0x4ff/0x840
 [<ffffffff81180837>] unmap_vmas+0x47/0x90
 [<ffffffff811891f8>] exit_mmap+0x98/0x150
 [<ffffffff8107a557>] mmput+0x47/0x100
 [<ffffffff8107f81e>] do_exit+0x24e/0xad0
 [<ffffffff8108011f>] do_group_exit+0x3f/0xa0
 [<ffffffff8108aa03>] get_signal+0x1c3/0x5e0
 [<ffffffff81017208>] do_signal+0x28/0x630
 [<ffffffff81152d7d>] ? printk+0x4d/0x4f
 [<ffffffff810c735f>] ? vprintk_default+0x1f/0x30
 [<ffffffff8107938c>] ? bad_area_access_error+0x43/0x4a
 [<ffffffff810603ac>] ? __do_page_fault+0x22c/0x3f0
 [<ffffffff8107808d>] exit_to_usermode_loop+0x4c/0x95
 [<ffffffff81003878>] prepare_exit_to_usermode+0x18/0x20
 [<ffffffff8165cae5>] retint_user+0x8/0x13
BUG: Bad page state in process yum-cron  pfn:0f3bf
page:ffffea00003cefc0 count:0 mapcount:7 mapping:ffff88000f3bf008
index:0xffff88000f3bf000
flags: 0x10000000000094(referenced|dirty|slab)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
flags: 0x80(slab)
Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache
x86_pkg_temp_thermal coretemp crct10dif_pclmul aesni_intel aes_x86_64
lrw gf128mul glue_helper ablk_helper cryptd pcspkr nfsd auth_rpcgss
nfs_acl lockd grace sunrpc ip_tables xen_netfront crc32c_intel
xen_gntalloc xen_evtchn ipv6 autofs4
CPU: 0 PID: 30740 Comm: yum-cron Tainted: G    B
4.4.6-4.el7xen.x86_64 #1
 0000000000000000 ffff88004176b958 ffffffff81323d17 ffffea00003cefc0
 ffffffff817ab348 ffff88004176b980 ffffffff811c1ab5 ffffea00003cefc0
 0000000000000001 0000000000000000 ffff88004176b9d0 ffffffff81159584
Call Trace:
 [<ffffffff81323d17>] dump_stack+0x63/0x8c
 [<ffffffff811c1ab5>] bad_page.part.69+0xdf/0xfc
 [<ffffffff81159584>] free_pages_prepare+0x294/0x2a0
 [<ffffffff8115b6c1>] free_hot_cold_page+0x31/0x160
 [<ffffffff8115b839>] free_hot_cold_page_list+0x49/0xb0
 [<ffffffff811621b5>] release_pages+0xc5/0x260
 [<ffffffff8119380d>] free_pages_and_swap_cache+0x7d/0x90
 [<ffffffff8117ddd6>] tlb_flush_mmu_free+0x36/0x60
 [<ffffffff8117ff54>] unmap_single_vma+0x664/0x840
 [<ffffffff81180837>] unmap_vmas+0x47/0x90
 [<ffffffff811891f8>] exit_mmap+0x98/0x150
 [<ffffffff8107a557>] mmput+0x47/0x100
 [<ffffffff8107f81e>] do_exit+0x24e/0xad0
 [<ffffffff8108011f>] do_group_exit+0x3f/0xa0
 [<ffffffff8108aa03>] get_signal+0x1c3/0x5e0
 [<ffffffff81017208>] do_signal+0x28/0x630
 [<ffffffff81152d7d>] ? printk+0x4d/0x4f
 [<ffffffff810c735f>] ? vprintk_default+0x1f/0x30
 [<ffffffff8107938c>] ? bad_area_access_error+0x43/0x4a
 [<ffffffff810603ac>] ? __do_page_fault+0x22c/0x3f0
 [<ffffffff8107808d>] exit_to_usermode_loop+0x4c/0x95
 [<ffffffff81003878>] prepare_exit_to_usermode+0x18/0x20
 [<ffffffff8165cae5>] retint_user+0x8/0x13
BUG: Bad rss-counter state mm:ffff88007b99e4c0 idx:0 val:-1
BUG: Bad rss-counter state mm:ffff88007b99e4c0 idx:1 val:2
BUG: Bad rss-counter state mm:ffff88007b99e4c0 idx:2 val:-1
yum-cron[4197]: segfault at 32947fcb ip 00007ff0fa1bf8bd sp
00007ffdb1c54990 error 4 in libpython2.7.so.1.0[7ff0fa13a000+178000]
BUG: unable to handle kernel paging request at ffff88010f3beffe
IP: [<ffffffff811a7fb9>] free_block+0x119/0x190
PGD 188b063 PUD 0
Oops: 0002 [#1] SMP
Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache
x86_pkg_temp_thermal coretemp crct10dif_pclmul aesni_intel aes_x86_64
lrw gf128mul glue_helper ablk_helper cryptd pcspkr nfsd auth_rpcgss
nfs_acl lockd grace sunrpc ip_tables xen_netfront crc32c_intel
xen_gntalloc xen_evtchn ipv6 autofs4
CPU: 1 PID: 8519 Comm: kworker/1:2 Tainted: G    B
4.4.6-4.el7xen.x86_64 #1
Workqueue: events cache_reap
task: ffff8800346bf1c0 ti: ffff880051700000 task.ti: ffff880051700000
RIP: 0010:[<ffffffff811a7fb9>]  [<ffffffff811a7fb9>] free_block+0x119/0x190
RSP: 0018:ffff880051703d40  EFLAGS: 00010082
RAX: ffffea00003cefc0 RBX: ffffea0000000000 RCX: ffff88000f3bf000
RDX: 00000000fffffffe RSI: ffff88007fd19c40 RDI: ffff88007d012100
RBP: ffff880051703d68 R08: ffff880051703d88 R09: 0000000000000006
R10: ffff88007d01a940 R11: 0000000080000000 R12: 000077ff80000000
R13: ffff88007fd19c58 R14: ffff88007d01a948 R15: ffff88007d01a968
FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88010f3beffe CR3: 000000007bc8d000 CR4: 0000000000040620
Stack:
 ffff88007fd19bf0 ffff880051703d88 ffff88007d01a940 000000000000000b
 ffff88007d012100 ffff880051703dc0 ffffffff811a871b 0000000b51703dd8
 ffff88007fd19c00 ffff880051703d88 ffff880051703d88 ffff88007d012100
Call Trace:
 [<ffffffff811a871b>] drain_array+0xab/0x130
 [<ffffffff811a955c>] cache_reap+0x6c/0x230
 [<ffffffff81093831>] process_one_work+0x151/0x400
 [<ffffffff8109437a>] worker_thread+0x11a/0x460
 [<ffffffff8165848f>] ? __schedule+0x2bf/0x880
 [<ffffffff81094260>] ? rescuer_thread+0x2f0/0x2f0
 [<ffffffff810993d9>] kthread+0xc9/0xe0
 [<ffffffff81099310>] ? kthread_park+0x60/0x60
 [<ffffffff8165c38f>] ret_from_fork+0x3f/0x70
 [<ffffffff81099310>] ? kthread_park+0x60/0x60
Code: 41 89 d1 4c 0f af c9 0f b6 4f 1c 49 c1 e9 20 44 29 ca d3 ea 0f b6
4f 1d 41 01 d1 8b 50 18 41 d3 e9 48 8b 48 10 83 ea 01 89 50 18 <44> 88
0c 11 49 8b 52 38 48 83 c2 01 49 89 52 38 8b 48 18 85 c9
RIP  [<ffffffff811a7fb9>] free_block+0x119/0x190
 RSP <ffff880051703d40>
CR2: ffff88010f3beffe
---[ end trace 99d22a980c39d4db ]---
BUG: unable to handle kernel paging request at ffffffffffffffd8
IP: [<ffffffff81099b40>] kthread_data+0x10/0x20
PGD 188d067 PUD 188f067 PMD 0
Oops: 0000 [#2] SMP
Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache
x86_pkg_temp_thermal coretemp crct10dif_pclmul aesni_intel aes_x86_64
lrw gf128mul glue_helper ablk_helper cryptd pcspkr nfsd auth_rpcgss
nfs_acl lockd grace sunrpc ip_tables xen_netfront crc32c_intel
xen_gntalloc xen_evtchn ipv6 autofs4
CPU: 1 PID: 8519 Comm: kworker/1:2 Tainted: G    B D
4.4.6-4.el7xen.x86_64 #1
task: ffff8800346bf1c0 ti: ffff880051700000 task.ti: ffff880051700000
RIP: 0010:[<ffffffff81099b40>]  [<ffffffff81099b40>] kthread_data+0x10/0x20
RSP: 0018:ffff880051703a60  EFLAGS: 00010002
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
RDX: ffffffff81bd4080 RSI: 0000000000000001 RDI: ffff8800346bf1c0
RBP: ffff880051703a60 R08: ffff88007c8e19d0 R09: 0000c88ed0b2b602
R10: 000000000000001f R11: 0000000000000000 R12: ffff8800346bf1c0
R13: 0000000000000000 R14: 00000000000163c0 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000028 CR3: 000000007bc8d000 CR4: 0000000000040620
Stack:
 ffff880051703a78 ffffffff81094741 ffff88007fd163c0 ffff880051703ac0
 ffffffff8165858f ffffffff810d2e27 ffff8800346bf1c0 ffff880051704000
 ffff8800346bf510 0000000000000046 ffff8800346bf1c0 ffff88007d3facc0
Call Trace:
 [<ffffffff81094741>] wq_worker_sleeping+0x11/0x90
 [<ffffffff8165858f>] __schedule+0x3bf/0x880
 [<ffffffff810d2e27>] ? call_rcu_sched+0x17/0x20
 [<ffffffff81658a85>] schedule+0x35/0x80
 [<ffffffff8107fc29>] do_exit+0x659/0xad0
 [<ffffffff8101a60a>] oops_end+0x9a/0xd0
 [<ffffffff8105f7ad>] no_context+0x10d/0x360
 [<ffffffff8105fb09>] __bad_area_nosemaphore+0x109/0x210
 [<ffffffff8105fc23>] bad_area_nosemaphore+0x13/0x20
 [<ffffffff81060200>] __do_page_fault+0x80/0x3f0
 [<ffffffff81060592>] do_page_fault+0x22/0x30
 [<ffffffff8165e0e8>] page_fault+0x28/0x30
 [<ffffffff811a7fb9>] ? free_block+0x119/0x190
 [<ffffffff811a871b>] drain_array+0xab/0x130
 [<ffffffff811a955c>] cache_reap+0x6c/0x230
 [<ffffffff81093831>] process_one_work+0x151/0x400
 [<ffffffff8109437a>] worker_thread+0x11a/0x460
 [<ffffffff8165848f>] ? __schedule+0x2bf/0x880
 [<ffffffff81094260>] ? rescuer_thread+0x2f0/0x2f0
 [<ffffffff810993d9>] kthread+0xc9/0xe0
 [<ffffffff81099310>] ? kthread_park+0x60/0x60
 [<ffffffff8165c38f>] ret_from_fork+0x3f/0x70
 [<ffffffff81099310>] ? kthread_park+0x60/0x60
Code: ff ff eb 92 be 3a 02 00 00 48 c7 c7 0b 5e 7a 81 e8 e6 34 fe ff e9
d5 fe ff ff 90 66 66 66 66 90 55 48 8b 87 00 04 00 00 48 89 e5 <48> 8b
40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90
RIP  [<ffffffff81099b40>] kthread_data+0x10/0x20
 RSP <ffff880051703a60>
CR2: ffffffffffffffd8
---[ end trace 99d22a980c39d4dc ]---
Fixing recursive fault but reboot is needed!

There is nothing further logged on the console after, and any previous
messages are concerning an NFS server.

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel