On Mon, 2018-07-16 at 08:40 -0700, Paul E. McKenney wrote: > Most of the weekend was devoted to testing today's upcoming pull request, > but I did get a bit more testing done on this. > > I was able to make this happen more often by tweaking rcutorture a > bit, but I still do not yet have statistically significant results. > Nevertheless, I have thus far only seen failures with David's patch or > with both David's and my patch.  And I actually got a full-up rcutorture > failure (a too-short grace period) in addition to the aforementioned > close calls. > > Over this coming week I expect to devote significant testing time to > the commit just prior to David's in my stack.  If I don't see failures > on that commit, we will need to spent some quality time with the KVM > folks on whether or not kvm_x86_ops->run() and friends have the option of > failing to return, but instead causing control to pop up somewhere else. > Or someone could tell me how I am being blind to some obvious bug in > the two commits that allow RCU to treat KVM guest-OS execution as an > extended quiescent state.  ;-) One thing we can try, if my patch is implicated, is moving the calls to rcu_kvm_en{ter,xit} closer to the actual VM entry. Let's try putting them around the large asm block in arch/x86/kvm/vmx.c::vmx_vcpu_run() for example. If that fixes it, then we know we've missed something else interesting that's happening in the middle. Testing on Skylake shows a guest CPUID goes from ~3000 cycles to ~3500 with this patch, so in the next iteration it definitely needs to be ifdef CONFIG_NO_HZ_FULL anyway, because it's actually required there (AFAICT) and it's too expensive otherwise as Christian pointed out.