From mboxrd@z Thu Jan 1 00:00:00 1970 From: Steven Haigh Subject: Re: 4.4: INFO: rcu_sched self-detected stall on CPU Date: Sat, 26 Mar 2016 01:05:37 +1100 Message-ID: <56F545B1.8080609__4147.94172761521$1458915734$gmane$org@crc.id.au> References: <56F4A816.3050505@crc.id.au> <56F52DBF.5080006@oracle.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============4243238403309626012==" Return-path: Received: from mail6.bemta5.messagelabs.com ([195.245.231.135]) by lists.xenproject.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ajSNF-0005YY-Hy for xen-devel@lists.xenproject.org; Fri, 25 Mar 2016 14:06:02 +0000 In-Reply-To: <56F52DBF.5080006@oracle.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" To: Boris Ostrovsky , xen-devel , linux-kernel@vger.kernel.org Cc: "gregkh@linuxfoundation.org" List-Id: xen-devel@lists.xenproject.org This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --===============4243238403309626012== Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="CcxAmHofcQ5sReDQvvjuAx3NHWKhClbld" This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --CcxAmHofcQ5sReDQvvjuAx3NHWKhClbld Content-Type: multipart/mixed; boundary="CKqGFjJ1V9V9lEk5qAIvL5v6mXbnDjeIs" From: Steven Haigh To: Boris Ostrovsky , xen-devel , linux-kernel@vger.kernel.org Cc: "gregkh@linuxfoundation.org" Message-ID: <56F545B1.8080609@crc.id.au> Subject: Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU References: <56F4A816.3050505@crc.id.au> <56F52DBF.5080006@oracle.com> In-Reply-To: <56F52DBF.5080006@oracle.com> --CKqGFjJ1V9V9lEk5qAIvL5v6mXbnDjeIs Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 25/03/2016 11:23 PM, Boris Ostrovsky wrote: > On 03/24/2016 10:53 PM, Steven Haigh wrote: >> Hi all, >> >> Firstly, I've cross-posted this to xen-devel and the lkml - as this >> problem seems to only exist when using kernel 4.4 as a Xen DomU kernel= =2E >> I have also CC'ed Greg KH for his awesome insight as maintainer. >> >> Please CC myself into replies - as I'm not a member of the kernel >> mailing list - I may miss replies from monitoring the archives. >> >> I've noticed recently that heavy disk IO is causing rcu_sched to detec= t >> stalls. The process mentioned usually goes to 100% CPU usage, and >> eventually processes start segfaulting and dying. The only fix to >> recover the system is to use 'xl destroy' to force-kill the VM and to >> start it again. >> >> The majority of these issues seem to mention ext4 in the trace. This m= ay >> indicate an issue there - or may be a red herring. >> >> The gritty details: >> INFO: rcu_sched self-detected stall on CPU >> #0110-...: (20999 ticks this GP) idle=3D327/140000000000001/0 >> softirq=3D1101493/1101493 fqs=3D6973 >> #011 (t=3D21000 jiffies g=3D827095 c=3D827094 q=3D524) >> Task dump for CPU 0: >> rsync R running task 0 2446 2444 0x00000088 >> ffffffff818d0c00 ffff88007fc03c58 ffffffff810a625f 0000000000000000 >> ffffffff818d0c00 ffff88007fc03c70 ffffffff810a8699 0000000000000001 >> ffff88007fc03ca0 ffffffff810d0e5a ffff88007fc170c0 ffffffff818d0c00 >> Call Trace: >> [] sched_show_task+0xaf/0x110 >> [] dump_cpu_task+0x39/0x40 >> [] rcu_dump_cpu_stacks+0x8a/0xc0 >> [] rcu_check_callbacks+0x424/0x7a0 >> [] ? account_system_time+0x81/0x110 >> [] ? account_process_tick+0x61/0x160 >> [] ? tick_sched_do_timer+0x30/0x30 >> [] update_process_times+0x39/0x60 >> [] tick_sched_handle.isra.15+0x36/0x50 >> [] tick_sched_timer+0x3d/0x70 >> [] __hrtimer_run_queues+0xf2/0x250 >> [] hrtimer_interrupt+0xa8/0x190 >> [] xen_timer_interrupt+0x2e/0x140 >> [] handle_irq_event_percpu+0x55/0x1e0 >> [] handle_percpu_irq+0x3a/0x50 >> [] generic_handle_irq+0x22/0x30 >> [] __evtchn_fifo_handle_events+0x15f/0x180 >> [] evtchn_fifo_handle_events+0x10/0x20 >> [] __xen_evtchn_do_upcall+0x43/0x80 >> [] xen_evtchn_do_upcall+0x30/0x50 >> [] xen_hvm_callback_vector+0x82/0x90 >> [] ? queued_write_lock_slowpath+0x3d/0x80 >> [] _raw_write_lock+0x1e/0x30 >=20 > This looks to me like ext4 failing to grab a lock. Everything above it > (in Xen code) is regular tick interrupt handling which detects the stal= l. >=20 > Your config does not have CONFIG_PARAVIRT_SPINLOCKS so that eliminates > any possible issues with pv locks. >=20 > Do you see anything "interesting" in dom0? (e.g. dmesg, xl dmesg, > /var/log/xen/) Are you oversubscribing your guest (CPU-wise)? There is nothing special being logged anywhere that I can see. dmesg / xl dmesg on the Dom0 show nothing unusual. I do share CPUs - but I don't give any DomU more than 2 vcpus. The physical host has 4 cores - 1 pinned to the Dom0. I log to a remote syslog on this system - and I've uploaded the entire log to a pastebin (don't want to do a 45Kb attachment here): http://paste.fedoraproject.org/345095/58914452 Not sure if it makes any difference at all, but my DomU config is: # cat /etc/xen/backup.vm name =3D "backup.vm" memory =3D 2048 vcpus =3D 2 cpus =3D "1-3" disk =3D [ 'phy:/dev/vg_raid1_new/backup.vm,xvda,w' ] vif =3D [ "mac=3D00:11:36:35:35:09, bridge=3Dbr203, vifname=3Dvm.backup, script=3Dvif-bridge" ] bootloader =3D 'pygrub' pvh =3D 1 on_poweroff =3D 'destroy' on_reboot =3D 'restart' on_crash =3D 'restart' cpu_weight =3D 64 I never had this problem when running kernel 4.1.x - it only started when I upgraded everything to 4.4 - not exactly a great help - but may help narrow things down? >> [] ext4_es_remove_extent+0x43/0xc0 >> [] ext4_clear_inode+0x39/0x80 >> [] ext4_evict_inode+0x8d/0x4e0 >> [] evict+0xb7/0x180 >> [] dispose_list+0x36/0x50 >> [] prune_icache_sb+0x4b/0x60 >> [] super_cache_scan+0x141/0x190 >> [] shrink_slab.part.37+0x1ee/0x390 >> [] shrink_zone+0x26c/0x280 >> [] do_try_to_free_pages+0x15c/0x410 >> [] try_to_free_pages+0xba/0x170 >> [] __alloc_pages_nodemask+0x525/0xa60 >> [] ? kmem_cache_free+0xcc/0x2c0 >> [] alloc_pages_current+0x8d/0x120 >> [] __page_cache_alloc+0x91/0xc0 >> [] pagecache_get_page+0x56/0x1e0 >> [] grab_cache_page_write_begin+0x26/0x40 >> [] ext4_da_write_begin+0xa1/0x300 >> [] ? ext4_da_write_end+0x124/0x2b0 >> [] generic_perform_write+0xc0/0x1a0 >> [] __generic_file_write_iter+0x188/0x1e0 >> [] ext4_file_write_iter+0xf0/0x340 >> [] __vfs_write+0xaa/0xe0 >> [] vfs_write+0xa2/0x1a0 >> [] SyS_write+0x46/0xa0 >> [] entry_SYSCALL_64_fastpath+0x12/0x71 >> >> Some 11 hours later: >> sshd[785]: segfault at 1f0 ip 00007f03bb94ae5c sp 00007ffe9eb54470 err= or >> 4 in ld-2.17.so[7f03bb940000+21000] >> sh[787]: segfault at 1f0 ip 00007f6b4a0dfe5c sp 00007ffe3d4a71e0 error= 4 >> in ld-2.17.so[7f6b4a0d5000+21000] >> systemd-cgroups[788]: segfault at 1f0 ip 00007f4baa82ce5c sp >> 00007ffd28e4c4b0 error 4 in ld-2.17.so[7f4baa822000+21000] >> sshd[791]: segfault at 1f0 ip 00007ff8c8a8ce5c sp 00007ffede9e1c20 err= or >> 4 in ld-2.17.so[7ff8c8a82000+21000] >> sshd[792]: segfault at 1f0 ip 00007f183cf75e5c sp 00007ffc81ab7160 err= or >> 4 in ld-2.17.so[7f183cf6b000+21000] >> sshd[793]: segfault at 1f0 ip 00007f3c665ece5c sp 00007ffd9a13c850 err= or >> 4 in ld-2.17.so[7f3c665e2000+21000] >> >> From isolated testing, this does not occur on kernel 4.5.x - however = I >> have not verified this myself. >> >> The kernel config used can be found in the kernel-xen git repo if it >> assists in debugging: >> http://xen.crc.id.au/git/?p=3Dkernel-xen;a=3Dsummary >> >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel >=20 --=20 Steven Haigh Email: netwiz@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 --CKqGFjJ1V9V9lEk5qAIvL5v6mXbnDjeIs-- --CcxAmHofcQ5sReDQvvjuAx3NHWKhClbld Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJW9UW7AAoJEEGvNdV6fTHcOMUP/1WRdIxwa11UkJME1y+IQoFG 63ILVQJ0OFU2zPPum1lUGD4gncre7qxLm+qhxuWmg5VxaZvHc6Z0z7oxcQDnyDyw X6/A44blSasbl1meTx9QJ3rncYhsDTQgnMgIamCrKScRA/55SVWCo6d5w0N1f7A0 zbYR/i38AsKa+hV42DRdrx7jEJEUAUghq8q0fmIaetQVym9CMjUfL3N1IHFpLhJ/ rrAn7gfFj3TpAkcLkouk9BOev3dYytgfFEuean9KCyrtLI6FX6mGDjRXo5veuSlG xNko34JfTSJrIOnxOeTy4KU614qk2PiFsQgKk9P4Vr5TI6AwZyvtll6sj1OZcQOh hMf2YjrEbCrqF1D6kqoe3bTx4xM1tw6K6+vtxgir83bZzmUMk+ew/FgstCC6mCO7 H1U0sOhQHWueqZaPz3y+h/diwy7ln/r7bbRJubMusv/gFhlgK08A0C7XiWZrjvBH QboUo6CPkuRtpMqGXdVcPprxe1ZqLzm72m4B989H14/HZ0zCmyRMUisEFa9gXWr3 Tz3xa/O9VElPFcbTNnEOCMQojVARj5bKvWrDjXnntSyHq5rbknNYciDjB8Vw/dq3 iEFbsoFp/5m0DfMruc2okFT+gNCdV2EX07vS6o3SMBBLbrofRgwxo/XJq341IF+n S1MEEZvg5AhBWbQVARq8 =sfQo -----END PGP SIGNATURE----- --CcxAmHofcQ5sReDQvvjuAx3NHWKhClbld-- --===============4243238403309626012== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KWGVuLWRldmVs IG1haWxpbmcgbGlzdApYZW4tZGV2ZWxAbGlzdHMueGVuLm9yZwpodHRwOi8vbGlzdHMueGVuLm9y Zy94ZW4tZGV2ZWwK --===============4243238403309626012==--