On Fri, Dec 02, 2016 at 09:36:42AM +0100, Thomas Gleixner wrote: > On Fri, 2 Dec 2016, David Gibson wrote: > > On Thu, Dec 01, 2016 at 12:59:51PM +0100, Thomas Gleixner wrote: > > > So I assume that you are talking about a VM which was not scheduled by the > > > host due to overcommitment (who ever thought that this is a good idea) or > > > whatever other reason (yes, people were complaining about wreckage caused > > > by stopping kernels with debuggers) for a long enough time to trigger that > > > overflow situation. If that's the case then the unsigned conversion will > > > just make it more unlikely but it still will happen. > > > > It was essentially the stopped by debugger case. I forget exactly > > why, but the guest was being explicitly stopped from outside, it > > wasn't just scheduling lag. I think it was something in the vicinity > > of 10 minutes stopped. > > Ok. Debuggers stopping stuff is one issue, but if I understood Liav > correctly, then he is seing the issue on a heavy loaded machine. Right. I can't speak to other situations which might trigger this. > Liav, can you please describe the scenario in detail? Are you observing > this on bare metal or in a VM which gets scheduled out long enough or was > there debugging/hypervisor intervention involved? > > > It's long enough ago that I can't be sure, but I thought we'd tried > > various different stoppage periods, which should have also triggered > > the unsigned overflow you're describing, and didn't observe the crash > > once the change was applied. Note that there have been other changes > > to the timekeeping code since then, which might have made a > > difference. > > > > I agree that it's not reasonable for the guest to be entirely > > unaffected by such a large stoppage: I'd have no complaints if the > > guest time was messed up, and/or it spewed warnings. But complete > > guest death seems a rather more fragile response to the situation than > > we'd like. > > Guests death? Is it really dead/crashed or just stuck in that endless loop > trying to add that huge negative value piecewise? Well, I don't know. But the point was it was unusable from the console, and didn't come back any time soon. > That's at least what Liav was describing as he mentioned > __iter_div_u64_rem() explicitely. > > While I'm less worried about debuggers, I worry about the real thing. > > I agree that we should not starve after resume from a debug stop, but in > that case the least of my worries is time going backwards. > > Though if the signed mult overrun is observable in a live system, then we > need to worry about time going backwards even with the unsigned > conversion. Simply because once we fixed the starvation issue people with > insane enough setups will trigger the unsigned overrun and complain about > time going backwards. > > Thanks, > > tglx > > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson