All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: 4.14.18 -> 4.14.24 - almost all guests hanged
       [not found]   ` <20180307145623.GH28488@pcnci.linuxbox.cz>
@ 2018-03-07 20:29     ` Nikola Ciprich
  2018-03-08 10:53       ` 王金浦
  2018-03-08 14:17       ` Greg KH
  0 siblings, 2 replies; 3+ messages in thread
From: Nikola Ciprich @ 2018-03-07 20:29 UTC (permalink / raw)
  To: 王金浦; +Cc: KVM list, nik, stable

[-- Attachment #1: Type: text/plain, Size: 6118 bytes --]

Hi,

> > > I'd like to report that when upgrading our cluster from 4.14.18 to
> > >  4.14.24-rc1 (with live guests migration), almost none of guests survived..
> > What's your hardware setup, intel with IBPB enabled microcode?
> Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
> 
> therefore I suppose no IBPB (at least meltdown checker reports so)
> 
> 
> > Does guests hang right after live migration?
> yes, just  tried it.
> 
> 
> > 
> > Are you able to reproduce the problem, does it work with latest upstream?
> yup, so I'm able to reproduce quickly. I'll revert the cluster to 4.14.18 now,
> but setup test system just afterwards, so and test the patch you've proposed.
> 
> > 
> > Not sure it helps, but following patch is missing in 4.14.24
> > 
> > commit 37b95951c58fdf08dc10afa9d02066ed9f176fb5 upstream.
> > 
> > kvm_valid_sregs() should use X86_CR0_PG and X86_CR4_PAE to check bit
> > status rather than X86_CR0_PG_BIT and X86_CR4_PAE_BIT. This patch is
> > to fix it.
> > 
> > Fixes: f29810335965a(KVM/x86: Check input paging mode when cs.l is set)
> > Reported-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Radim Krčmář <rkrcmar@redhat.com>
> > Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
> > Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> 
> I'll test and report.

so indeed, this one on top of 4.14.24-rc1 fixes the migration for me.
Greg, could you queue this one up please?

Jack, thanks for the hint!
BR
nik



> 
> n.
> 
> 
> > 
> > Regards,
> > Jack
> > >
> > > I noticed that most of them got stuck in "paused" state without
> > > possibility to resume (virsh just reported guest cannot be continued and
> > > needs to be rebooted).
> > >
> > > in dmesg, lots of following messages appeared:
> > >
> > > [  116.593508] device vnet0 entered promiscuous mode
> > > [  124.143532] *** Guest State ***
> > > [  124.143594] CR0: actual=0x0000000000000030, shadow=0x0000000060000010, gh_mask=fffffffffffffff7
> > > [  124.143668] CR4: actual=0x0000000000002050, shadow=0x0000000000000000, gh_mask=ffffffffffffe871
> > > [  124.143871] CR3 = 0x00000000feffc000
> > > [  124.143984] RSP = 0xffffffff82003e98  RIP = 0xffffffff816df002
> > > [  124.144102] RFLAGS=0x00000246         DR7 = 0x0000000000000400
> > > [  124.144221] Sysenter RSP=0000000000000000 CS:RIP=0000:0000000000000000
> > > [  124.144341] CS:   sel=0xf000, attr=0x0009b, limit=0x0000ffff, base=0x00000000ffff0000
> > > [  124.144516] DS:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
> > > [  124.144692] SS:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
> > > [  124.144907] ES:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
> > > [  124.145089] FS:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
> > > [  124.145272] GS:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
> > > [  124.145447] GDTR:                           limit=0x0000ffff, base=0x0000000000000000
> > > [  124.145626] LDTR: sel=0x0000, attr=0x00082, limit=0x0000ffff, base=0x0000000000000000
> > > [  124.145814] IDTR:                           limit=0x0000ffff, base=0x0000000000000000
> > > [  124.145995] TR:   sel=0x0000, attr=0x0008b, limit=0x0000ffff, base=0x0000000000000000
> > > [  124.146173] EFER =     0x0000000000000000  PAT = 0x0007040600070406
> > > [  124.146292] DebugCtl = 0x0000000000000000  DebugExceptions = 0x0000000000000000
> > > [  124.146466] Interruptibility = 00000000  ActivityState = 00000000
> > > [  124.146579] *** Host State ***
> > > [  124.146687] RIP = 0xffffffffa046a817  RSP = 0xffffc900200a7cb8
> > > [  124.146832] CS=0010 SS=0018 DS=0000 ES=0000 FS=0000 GS=0000 TR=0040
> > > [  124.146961] FSBase=00007fe82eff7700 GSBase=ffff881fffb40000 TRBase=fffffe00000df000
> > > [  124.147144] GDTBase=fffffe00000dd000 IDTBase=fffffe0000000000
> > > [  124.147262] CR0=0000000080050033 CR3=0000001f5b8fe004 CR4=00000000000626e0
> > > [  124.147381] Sysenter RSP=fffffe00000de200 CS:RIP=0010:ffffffff81801f60
> > > [  124.147499] EFER = 0x0000000000000d01  PAT = 0x0407050600070106
> > > [  124.147614] *** Control State ***
> > > [  124.147734] PinBased=0000007f CPUBased=96a1e9fa SecondaryExec=000004f2
> > > [  124.147849] EntryControls=0000d1ff ExitControls=002fefff
> > > [  124.147965] ExceptionBitmap=00060042 PFECmask=00000000 PFECmatch=00000000
> > > [  124.148085] VMEntry: intr_info=80000081 errcode=00000000 ilen=00000000
> > > [  124.148201] VMExit: intr_info=00000000 errcode=00000000 ilen=00000000
> > > [  124.148318]         reason=80000021 qualification=0000000000000000
> > > [  124.148432] IDTVectoring: info=00000000 errcode=00000000
> > > [  124.148545] TSC Offset = 0xffed7296fb06bc34
> > > [  124.148655] TPR Threshold = 0x00
> > > [  124.148770] EPT pointer = 0x0000001f1a0af01e
> > > [  124.148882] PLE Gap=00000080 Window=00001000
> > > [  124.148995] Virtual processor ID = 0x0001
> > >
> > > (never seen anything like that)
> > >
> > > I haven't yet went through all patches between those two versions, so don't
> > > have any suspicion yet.. If anyone recognizes this as known problem, please
> > > let me know..
> > >
> > > I'm going to try whether I'm able to reproduce the problem.
> > >
> > > BR
> > >
> > > nik
> > 
> 
> -- 
> -------------------------------------
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28.rijna 168, 709 00 Ostrava
> 
> tel.:   +420 591 166 214
> fax:    +420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
> 
> mobil servis: +420 737 238 656
> email servis: servis@linuxbox.cz
> -------------------------------------
> 

-- 
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:    +420 596 621 273
mobil:  +420 777 093 799

www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@linuxbox.cz
-------------------------------------

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: 4.14.18 -> 4.14.24 - almost all guests hanged
  2018-03-07 20:29     ` 4.14.18 -> 4.14.24 - almost all guests hanged Nikola Ciprich
@ 2018-03-08 10:53       ` 王金浦
  2018-03-08 14:17       ` Greg KH
  1 sibling, 0 replies; 3+ messages in thread
From: 王金浦 @ 2018-03-08 10:53 UTC (permalink / raw)
  To: Nikola Ciprich; +Cc: KVM list, nik, v3.14+, only the raid10 part

2018-03-07 21:29 GMT+01:00 Nikola Ciprich <nikola.ciprich@linuxbox.cz>:
> Hi,
>
>> > > I'd like to report that when upgrading our cluster from 4.14.18 to
>> > >  4.14.24-rc1 (with live guests migration), almost none of guests survived..
>> > What's your hardware setup, intel with IBPB enabled microcode?
>> Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
>>
>> therefore I suppose no IBPB (at least meltdown checker reports so)
>>
>>
>> > Does guests hang right after live migration?
>> yes, just  tried it.
>>
>>
>> >
>> > Are you able to reproduce the problem, does it work with latest upstream?
>> yup, so I'm able to reproduce quickly. I'll revert the cluster to 4.14.18 now,
>> but setup test system just afterwards, so and test the patch you've proposed.
>>
>> >
>> > Not sure it helps, but following patch is missing in 4.14.24
>> >
>> > commit 37b95951c58fdf08dc10afa9d02066ed9f176fb5 upstream.
>> >
>> > kvm_valid_sregs() should use X86_CR0_PG and X86_CR4_PAE to check bit
>> > status rather than X86_CR0_PG_BIT and X86_CR4_PAE_BIT. This patch is
>> > to fix it.
>> >
>> > Fixes: f29810335965a(KVM/x86: Check input paging mode when cs.l is set)
>> > Reported-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com>
>> > Cc: Paolo Bonzini <pbonzini@redhat.com>
>> > Cc: Radim Krčmář <rkrcmar@redhat.com>
>> > Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
>> > Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
>>
>> I'll test and report.
>
> so indeed, this one on top of 4.14.24-rc1 fixes the migration for me.
> Greg, could you queue this one up please?
>
> Jack, thanks for the hint!
> BR
> nik
Hi Nik,

Thanks for testing and let we know, the patch is already queued in
4.14.25-rc1 by Greg.

There're some other KVM fixes and performance enhancement.

Regards,
Jack

>
>
>
>>
>> n.
>>
>>
>> >
>> > Regards,
>> > Jack
>> > >
>> > > I noticed that most of them got stuck in "paused" state without
>> > > possibility to resume (virsh just reported guest cannot be continued and
>> > > needs to be rebooted).
>> > >
>> > > in dmesg, lots of following messages appeared:
>> > >
>> > > [  116.593508] device vnet0 entered promiscuous mode
>> > > [  124.143532] *** Guest State ***
>> > > [  124.143594] CR0: actual=0x0000000000000030, shadow=0x0000000060000010, gh_mask=fffffffffffffff7
>> > > [  124.143668] CR4: actual=0x0000000000002050, shadow=0x0000000000000000, gh_mask=ffffffffffffe871
>> > > [  124.143871] CR3 = 0x00000000feffc000
>> > > [  124.143984] RSP = 0xffffffff82003e98  RIP = 0xffffffff816df002
>> > > [  124.144102] RFLAGS=0x00000246         DR7 = 0x0000000000000400
>> > > [  124.144221] Sysenter RSP=0000000000000000 CS:RIP=0000:0000000000000000
>> > > [  124.144341] CS:   sel=0xf000, attr=0x0009b, limit=0x0000ffff, base=0x00000000ffff0000
>> > > [  124.144516] DS:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
>> > > [  124.144692] SS:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
>> > > [  124.144907] ES:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
>> > > [  124.145089] FS:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
>> > > [  124.145272] GS:   sel=0x0000, attr=0x00093, limit=0x0000ffff, base=0x0000000000000000
>> > > [  124.145447] GDTR:                           limit=0x0000ffff, base=0x0000000000000000
>> > > [  124.145626] LDTR: sel=0x0000, attr=0x00082, limit=0x0000ffff, base=0x0000000000000000
>> > > [  124.145814] IDTR:                           limit=0x0000ffff, base=0x0000000000000000
>> > > [  124.145995] TR:   sel=0x0000, attr=0x0008b, limit=0x0000ffff, base=0x0000000000000000
>> > > [  124.146173] EFER =     0x0000000000000000  PAT = 0x0007040600070406
>> > > [  124.146292] DebugCtl = 0x0000000000000000  DebugExceptions = 0x0000000000000000
>> > > [  124.146466] Interruptibility = 00000000  ActivityState = 00000000
>> > > [  124.146579] *** Host State ***
>> > > [  124.146687] RIP = 0xffffffffa046a817  RSP = 0xffffc900200a7cb8
>> > > [  124.146832] CS=0010 SS=0018 DS=0000 ES=0000 FS=0000 GS=0000 TR=0040
>> > > [  124.146961] FSBase=00007fe82eff7700 GSBase=ffff881fffb40000 TRBase=fffffe00000df000
>> > > [  124.147144] GDTBase=fffffe00000dd000 IDTBase=fffffe0000000000
>> > > [  124.147262] CR0=0000000080050033 CR3=0000001f5b8fe004 CR4=00000000000626e0
>> > > [  124.147381] Sysenter RSP=fffffe00000de200 CS:RIP=0010:ffffffff81801f60
>> > > [  124.147499] EFER = 0x0000000000000d01  PAT = 0x0407050600070106
>> > > [  124.147614] *** Control State ***
>> > > [  124.147734] PinBased=0000007f CPUBased=96a1e9fa SecondaryExec=000004f2
>> > > [  124.147849] EntryControls=0000d1ff ExitControls=002fefff
>> > > [  124.147965] ExceptionBitmap=00060042 PFECmask=00000000 PFECmatch=00000000
>> > > [  124.148085] VMEntry: intr_info=80000081 errcode=00000000 ilen=00000000
>> > > [  124.148201] VMExit: intr_info=00000000 errcode=00000000 ilen=00000000
>> > > [  124.148318]         reason=80000021 qualification=0000000000000000
>> > > [  124.148432] IDTVectoring: info=00000000 errcode=00000000
>> > > [  124.148545] TSC Offset = 0xffed7296fb06bc34
>> > > [  124.148655] TPR Threshold = 0x00
>> > > [  124.148770] EPT pointer = 0x0000001f1a0af01e
>> > > [  124.148882] PLE Gap=00000080 Window=00001000
>> > > [  124.148995] Virtual processor ID = 0x0001
>> > >
>> > > (never seen anything like that)
>> > >
>> > > I haven't yet went through all patches between those two versions, so don't
>> > > have any suspicion yet.. If anyone recognizes this as known problem, please
>> > > let me know..
>> > >
>> > > I'm going to try whether I'm able to reproduce the problem.
>> > >
>> > > BR
>> > >
>> > > nik
>> >
>>
>> --
>> -------------------------------------
>> Ing. Nikola CIPRICH
>> LinuxBox.cz, s.r.o.
>> 28.rijna 168, 709 00 Ostrava
>>
>> tel.:   +420 591 166 214
>> fax:    +420 596 621 273
>> mobil:  +420 777 093 799
>> www.linuxbox.cz
>>
>> mobil servis: +420 737 238 656
>> email servis: servis@linuxbox.cz
>> -------------------------------------
>>
>
> --
> -------------------------------------
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28. rijna 168, 709 00 Ostrava
>
> tel.:   +420 591 166 214
> fax:    +420 596 621 273
> mobil:  +420 777 093 799
>
> www.linuxbox.cz
>
> mobil servis: +420 737 238 656
> email servis: servis@linuxbox.cz
> -------------------------------------

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: 4.14.18 -> 4.14.24 - almost all guests hanged
  2018-03-07 20:29     ` 4.14.18 -> 4.14.24 - almost all guests hanged Nikola Ciprich
  2018-03-08 10:53       ` 王金浦
@ 2018-03-08 14:17       ` Greg KH
  1 sibling, 0 replies; 3+ messages in thread
From: Greg KH @ 2018-03-08 14:17 UTC (permalink / raw)
  To: Nikola Ciprich; +Cc: 王金浦, KVM list, nik, stable

On Wed, Mar 07, 2018 at 09:29:10PM +0100, Nikola Ciprich wrote:
> Hi,
> 
> > > > I'd like to report that when upgrading our cluster from 4.14.18 to
> > > >  4.14.24-rc1 (with live guests migration), almost none of guests survived..
> > > What's your hardware setup, intel with IBPB enabled microcode?
> > Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
> > 
> > therefore I suppose no IBPB (at least meltdown checker reports so)
> > 
> > 
> > > Does guests hang right after live migration?
> > yes, just  tried it.
> > 
> > 
> > > 
> > > Are you able to reproduce the problem, does it work with latest upstream?
> > yup, so I'm able to reproduce quickly. I'll revert the cluster to 4.14.18 now,
> > but setup test system just afterwards, so and test the patch you've proposed.
> > 
> > > 
> > > Not sure it helps, but following patch is missing in 4.14.24
> > > 
> > > commit 37b95951c58fdf08dc10afa9d02066ed9f176fb5 upstream.
> > > 
> > > kvm_valid_sregs() should use X86_CR0_PG and X86_CR4_PAE to check bit
> > > status rather than X86_CR0_PG_BIT and X86_CR4_PAE_BIT. This patch is
> > > to fix it.
> > > 
> > > Fixes: f29810335965a(KVM/x86: Check input paging mode when cs.l is set)
> > > Reported-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com>
> > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > Cc: Radim Krčmář <rkrcmar@redhat.com>
> > > Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
> > > Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> > 
> > I'll test and report.
> 
> so indeed, this one on top of 4.14.24-rc1 fixes the migration for me.
> Greg, could you queue this one up please?

As was already pointed out, this is already queued up to be in the next
release.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-03-08 14:17 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20180305083606.GA3004@pcnci.linuxbox.cz>
     [not found] ` <CAD9gYJKCiHjDV03ZtmZ1_R57N_PcVNWS-pX38pBq+6-7+ObcmQ@mail.gmail.com>
     [not found]   ` <20180307145623.GH28488@pcnci.linuxbox.cz>
2018-03-07 20:29     ` 4.14.18 -> 4.14.24 - almost all guests hanged Nikola Ciprich
2018-03-08 10:53       ` 王金浦
2018-03-08 14:17       ` Greg KH

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.