On 30/03/2016 1:14 AM, Boris Ostrovsky wrote: > On 03/29/2016 04:56 AM, Steven Haigh wrote: >> >> Interestingly enough, this just happened again - but on a different >> virtual machine. I'm starting to wonder if this may have something to do >> with the uptime of the machine - as the system that this seems to happen >> to is always different. >> >> Destroying it and monitoring it again has so far come up blank. >> >> I've thrown the latest lot of kernel messages here: >> http://paste.fedoraproject.org/346802/59241532 > > Would be good to see full console log. The one that you posted starts > with an error so I wonder what was before that. Agreed. It started off with me observing this on one VM - but since trying to get details on that VM - others have started showing issues as well. It frustrating as it seems I've been playing whack-a-mole to get more debug on what is going on. So, I've changed the kernel command line to the following on ALL VMs on this system: enforcemodulesig=1 selinux=0 fsck.repair=yes loglevel=7 console=tty0 console=ttyS0,38400n8 In the Dom0 (which runs the same kernel package), I've started a screen sessions with a screen for each of the DomUs running attached to the console via 'xl console blah' - so hopefully the next one that goes down (whichever one that is) will get caught in the console. > Have you tried this on bare metal, BTW? And you said this is only > observed on 4.4, not 4.5, right? I use the same kernel package as the Dom0 kernel - and so far haven't seen any issues running this as the Dom0. I haven't used it on baremetal as a non-xen kernel as yet. The kernel package I'm currently running is for CentOS / Scientific Linux / RHEL at: http://au1.mirror.crc.id.au/repo/el7-testing/x86_64/ I'm using 4.4.6-3 at the moment - which has CONFIG_PREEMPT_VOLUNTARY set - which *MAY* have increased the time between this happening - or may have no effect at all. I'm not convinced either way as yet. With respect to 4.5, I have had reports from another user of my packages that they haven't seen the same crash using the same Xen packages but with kernel 4.5. I have not verified this myself as yet as I haven't gone down the path of making 4.5 packages for testing. As such, I wouldn't treat this as a conclusive test case as yet. I'm hoping that the steps I've taken above may give some more information in which we can drill down into exactly what is going on - or at least give more pointers into the root cause. >> >> Interestingly, around the same time, /var/log/messages on the remote >> syslog server shows: >> Mar 29 17:00:01 zeus systemd: Created slice user-0.slice. >> Mar 29 17:00:01 zeus systemd: Starting user-0.slice. >> Mar 29 17:00:01 zeus systemd: Started Session 1567 of user root. >> Mar 29 17:00:01 zeus systemd: Starting Session 1567 of user root. >> Mar 29 17:00:01 zeus systemd: Removed slice user-0.slice. >> Mar 29 17:00:01 zeus systemd: Stopping user-0.slice. >> Mar 29 17:01:01 zeus systemd: Created slice user-0.slice. >> Mar 29 17:01:01 zeus systemd: Starting user-0.slice. >> Mar 29 17:01:01 zeus systemd: Started Session 1568 of user root. >> Mar 29 17:01:01 zeus systemd: Starting Session 1568 of user root. >> Mar 29 17:08:34 zeus ntpdate[18569]: adjust time server 203.56.246.94 >> offset -0.002247 sec >> Mar 29 17:08:34 zeus systemd: Removed slice user-0.slice. >> Mar 29 17:08:34 zeus systemd: Stopping user-0.slice. >> Mar 29 17:10:01 zeus systemd: Created slice user-0.slice. >> Mar 29 17:10:01 zeus systemd: Starting user-0.slice. >> Mar 29 17:10:01 zeus systemd: Started Session 1569 of user root. >> Mar 29 17:10:01 zeus systemd: Starting Session 1569 of user root. >> Mar 29 17:10:01 zeus systemd: Removed slice user-0.slice. >> Mar 29 17:10:01 zeus systemd: Stopping user-0.slice. >> Mar 29 17:20:01 zeus systemd: Created slice user-0.slice. >> Mar 29 17:20:01 zeus systemd: Starting user-0.slice. >> Mar 29 17:20:01 zeus systemd: Started Session 1570 of user root. >> Mar 29 17:20:01 zeus systemd: Starting Session 1570 of user root. >> Mar 29 17:20:01 zeus systemd: Removed slice user-0.slice. >> Mar 29 17:20:01 zeus systemd: Stopping user-0.slice. >> Mar 29 17:30:55 zeus systemd: systemd-logind.service watchdog timeout >> (limit 1min)! >> Mar 29 17:32:25 zeus systemd: systemd-logind.service stop-sigabrt timed >> out. Terminating. >> Mar 29 17:33:56 zeus systemd: systemd-logind.service stop-sigterm timed >> out. Killing. >> Mar 29 17:35:26 zeus systemd: systemd-logind.service still around after >> SIGKILL. Ignoring. >> Mar 29 17:36:56 zeus systemd: systemd-logind.service stop-final-sigterm >> timed out. Killing. >> Mar 29 17:38:26 zeus systemd: systemd-logind.service still around after >> final SIGKILL. Entering failed mode. >> Mar 29 17:38:26 zeus systemd: Unit systemd-logind.service entered failed >> state. >> Mar 29 17:38:26 zeus systemd: systemd-logind.service failed. > > > These may be result of your system not feeling well, which is not > surprising. > > -boris -- Steven Haigh Email: netwiz@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897