kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Nvidia GPU PCI passthrough and kernel commit #5f33887a36824f1e906863460535be5d841a4364
       [not found] <PH0PR02MB84228844F6176836E8C86B1BA40F9@PH0PR02MB8422.namprd02.prod.outlook.com>
@ 2022-11-24  1:39 ` Paolo Bonzini
       [not found]   ` <PH0PR02MB84229CEBB3C7A8DAC626107CA40F9@PH0PR02MB8422.namprd02.prod.outlook.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Paolo Bonzini @ 2022-11-24  1:39 UTC (permalink / raw)
  To: Ashish Gupta (SJC), kvm; +Cc: seanjc, John Levon

On 11/24/22 01:56, Ashish Gupta (SJC) wrote:
> Nutanix uses KVM based hypervisor, which is called AHV (Acropolis 
> Hypervisor).
> 
> latest AHV release is based on kernel v5.10.117. where we found that 
> Nvidia GPU cards (10/A30/A40 etc) stopped working.
> 
> Guest VM (based on centos7 or Ubuntu 16.10) were able to identify card 
> but after installing Nvidia Grid driver we were seeing following logs in 
> guest vm.
> 

Have you tested with a more recent version than 5.10.x, to see if the 
bug is still there?

Paolo


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Nvidia GPU PCI passthrough and kernel commit #5f33887a36824f1e906863460535be5d841a4364
       [not found]     ` <PH0PR02MB8422D2C6A7F56200FCD384D8A40F9@PH0PR02MB8422.namprd02.prod.outlook.com>
@ 2022-11-25 22:45       ` Paolo Bonzini
       [not found]         ` <PH0PR02MB84221C062510FCFAEE7EE9BAA4109@PH0PR02MB8422.namprd02.prod.outlook.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Paolo Bonzini @ 2022-11-25 22:45 UTC (permalink / raw)
  To: Ashish Gupta (SJC); +Cc: kvm, seanjc, John Levon

What about a much newer kernel, like 6.0 or so?

Paolo

On Thu, Nov 24, 2022 at 7:18 AM Ashish Gupta (SJC)
<ashish.gupta1@nutanix.com> wrote:
>
> Hi Paolo,
>
> With v5.10.155 also, it failed in similar way.
>
>
>
> [root@ahvgpu04-1 ~]# uname -r
>
> 5.10.155-2.el7.nutanix.20220304.242.x86_64
>
>
>
>
>
> Logs from guest vm.
>
> [  113.669214] NVRM: GPU at PCI:0000:00:06: GPU-fcdeaa4c-664a-4de8-2e32-23e14628ce8c
>
> [  113.669215] NVRM: GPU Board Serial Number: 1651522000466
>
> [  113.669216] NVRM: Xid (PCI:0000:00:06): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
>
> [  113.669384] NVRM: Xid (PCI:0000:00:06): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
>
> [  113.669400] NVRM: Xid (PCI:0000:00:06): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
>
> [  113.669498] NVRM: Xid (PCI:0000:00:06): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
>
> [  113.669609] NVRM: Xid (PCI:0000:00:06): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_CONTROL (0x20800a70 0x0).
>
> [  113.669615] NVRM: Xid (PCI:0000:00:06): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_CONTROL (0x20800a6c 0x4).
>
> [  113.670156] NVRM: Xid (PCI:0000:00:06): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_CONTROL (0x6 0x0).
>
> [  113.670247] NVRM: Xid (PCI:0000:00:06): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
>
> [  113.670338] NVRM: Xid (PCI:0000:00:06): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function GSP_RM_CONTROL (0x20800a38 0x18).
>
> [  113.672663] NVRM: Xid (PCI:0000:00:06): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function UPDATE_BAR_PDE (0x0 0x0).
>
> [  113.672702] NVRM: Xid (PCI:0000:00:06): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
>
> [  113.672709] NVRM: Xid (PCI:0000:00:06): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function FREE (0x0 0x0).
>
> [  113.672787] NVRM: Xid (PCI:0000:00:06): 119, pid='<unknown>', name=<unknown>, Timeout waiting for RPC from GSP! Expected function UNLOADING_GUEST_DRIVER (0x0 0x0).
>
> [  113.674376] NVRM: GPU 0000:00:06.0: RmInitAdapter failed! (0x11:0x45:2540)
>
> [  113.675130] NVRM: GPU 0000:00:06.0: rm_init_adapter failed, device minor number 0
>
> [  113.850458] NVRM: GPU 0000:00:06.0: RmInitAdapter failed! (0x22:0x56:731)
>
> [  113.851206] NVRM: GPU 0000:00:06.0: rm_init_adapter failed, device minor number 0
>
>
>
> Regards,
>
> --Ashish Gupta
>
>
>
> From: Ashish Gupta (SJC) <ashish.gupta1@nutanix.com>
> Date: Wednesday, November 23, 2022 at 5:49 PM
> To: Paolo Bonzini <pbonzini@redhat.com>, kvm@vger.kernel.org <kvm@vger.kernel.org>
> Cc: seanjc@google.com <seanjc@google.com>, John Levon <john.levon@nutanix.com>
> Subject: Re: Nvidia GPU PCI passthrough and kernel commit #5f33887a36824f1e906863460535be5d841a4364
>
> > Have you tested with a more recent version than 5.10.x, to see if the
> > bug is still there?
>
>
> Building image with v5.10.155.
>
> I am hoping to get result in 2-3H, I will update thread.
>
>
>
> Regards,
>
> --Ashish Gupta
>
> From: Paolo Bonzini <pbonzini@redhat.com>
> Date: Wednesday, November 23, 2022 at 5:39 PM
> To: Ashish Gupta (SJC) <ashish.gupta1@nutanix.com>, kvm@vger.kernel.org <kvm@vger.kernel.org>
> Cc: seanjc@google.com <seanjc@google.com>, John Levon <john.levon@nutanix.com>
> Subject: Re: Nvidia GPU PCI passthrough and kernel commit #5f33887a36824f1e906863460535be5d841a4364
>
> On 11/24/22 01:56, Ashish Gupta (SJC) wrote:
> > Nutanix uses KVM based hypervisor, which is called AHV (Acropolis
> > Hypervisor).
> >
> > latest AHV release is based on kernel v5.10.117. where we found that
> > Nvidia GPU cards (10/A30/A40 etc) stopped working.
> >
> > Guest VM (based on centos7 or Ubuntu 16.10) were able to identify card
> > but after installing Nvidia Grid driver we were seeing following logs in
> > guest vm.
> >
>
> Have you tested with a more recent version than 5.10.x, to see if the
> bug is still there?
>
> Paolo


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Nvidia GPU PCI passthrough and kernel commit #5f33887a36824f1e906863460535be5d841a4364
       [not found]         ` <PH0PR02MB84221C062510FCFAEE7EE9BAA4109@PH0PR02MB8422.namprd02.prod.outlook.com>
@ 2022-11-28 17:54           ` Paolo Bonzini
       [not found]             ` <PH0PR02MB8422C61596331E2B17E476C7A4149@PH0PR02MB8422.namprd02.prod.outlook.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Paolo Bonzini @ 2022-11-28 17:54 UTC (permalink / raw)
  To: Ashish Gupta (SJC); +Cc: kvm, seanjc, John Levon

On 11/27/22 19:29, Ashish Gupta (SJC) wrote:
> Hi Paolo,
> 
> I checked on Ubuntu 20.04 with kernel 6.0.
> 
> Nvidia GPU PCI passthrough working fine there.
> 
> Any guess, if this could be problem in 5.10.x and fixed by some 
> subsequent commit.

Yes, most likely (and also that's probably why it wasn't reported until 
now).  If you would like it to be fixed in 5.10, you can try the latest 
release of all the 5.10-5.19 stable branches.  Having the first fixed 
release might be enough to figure out a candidate fix.

Paolo


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Nvidia GPU PCI passthrough and kernel commit #5f33887a36824f1e906863460535be5d841a4364
       [not found]             ` <PH0PR02MB8422C61596331E2B17E476C7A4149@PH0PR02MB8422.namprd02.prod.outlook.com>
@ 2022-12-02  1:16               ` Paolo Bonzini
       [not found]                 ` <PH0PR02MB842221DE947D855CD1BF8AEFA4179@PH0PR02MB8422.namprd02.prod.outlook.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Paolo Bonzini @ 2022-12-02  1:16 UTC (permalink / raw)
  To: Ashish Gupta (SJC), Suresh Gumpula, Felipe Franciosi
  Cc: kvm, seanjc, John Levon, Bijan Mottahedeh, Eiichi Tsukata

On 12/2/22 01:29, Ashish Gupta (SJC) wrote:
> Hi Paolo,
> 
> While we were accessing code change done by commit : 
> 5f33887a36824f1e906863460535be5d841a4364
> 
> Bijan, noticed following:
> 
>  From the changed code in commit  # 
> 5f33887a36824f1e906863460535be5d841a4364 , we see that the following check
> 
> !kvm_vcpu_apicv_active(vcpu)*/)/*
> 
> has been removed, so in fact the new code is basically assuming that 
> apicv is always active.

Right, instead it checks irqchip_in_kernel(kvm) && enable_apicv.  This 
is documented in the commit message:

     However, these checks do not attempt to synchronize with changes to
     the IRTE.  In particular, there is no path that updates the IRTE
     when APICv is re-activated on vCPU 0; and there is no path to wakeup
     a CPU that has APICv disabled, if the wakeup occurs because of an
     IRTE that points to a posted interrupt.

The full series is at 
https://lore.kernel.org/lkml/20211123004311.2954158-2-pbonzini@redhat.com/T/ 
and has more details:

     Now that APICv can be disabled per-CPU (depending on whether it has
     some setup that is incompatible) we need to deal with guests having
     a mix of vCPUs with enabled/disabled posted interrupts.  For
     assigned devices, their posted interrupt configuration must be the
     same across the whole VM, so handle posted interrupts by hand on
     vCPUs with disabled posted interrupts.

All four patches were marked as stable, but it looks like the first 
three did not apply and therefore are not part of 5.10.

78311a514099932cd8434d5d2194aa94e56ab67c
     KVM: x86: ignore APICv if LAPIC is not enabled
7e1901f6c86c896acff6609e0176f93f756d8b2a
     KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled
37c4dbf337c5c2cdb24365ffae6ed70ac1e74d7a
     KVM: x86: check PIR even for vCPUs with disabled APICv

The three commits do not have any subsequent commit that Fixes them.

> The latest upstream code however seems to disable apicv conditionally 
> depending on if it is actually being used:

Right.

> We found that, once we disable hyperv benightment for Linux vm, 
> everything is working fine (on v5.10.84)
> 
> Further Eiichi noticed, that your change were introduced in 5.16 and 
> backported to 5.10.84.
> 
> On the other hand, Vitaly's patch (commit 
> #0f250a646382e017725001a552624be0c86527bf) was introduced in 5.15 and 
> NOT backported to 5.10.X.
> 
> Should we backport Vitaly's patch to stable 5.10.X? Do you think that 
> will solve issue what we are facing?

As you found out there are a lot of dependent changes to introduce 
__kvm_request_apicv_update so it's not really feasible.

Paolo


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Nvidia GPU PCI passthrough and kernel commit #5f33887a36824f1e906863460535be5d841a4364
       [not found]                     ` <PH0PR02MB8422F29708DF1CF961C580D1A4169@PH0PR02MB8422.namprd02.prod.outlook.com>
@ 2022-12-05 20:04                       ` Paolo Bonzini
  0 siblings, 0 replies; 5+ messages in thread
From: Paolo Bonzini @ 2022-12-05 20:04 UTC (permalink / raw)
  To: Ashish Gupta (SJC)
  Cc: Suresh Gumpula, Felipe Franciosi, kvm, Sean Christopherson,
	John Levon, Bijan Mottahedeh, Eiichi Tsukata

On 12/3/22 18:27, Ashish Gupta (SJC) wrote:
> As I have ready setup to test this out, I can patch these missing 3 
> missing patches and test.
> 
> I would like to understand if there are any other patches (from other 
> author) which you think would be needed as well.

There shouldn't be any others, on the other hand they probably do not 
apply right away otherwise Greg would have included them.  I'm not sure 
if the backport will be simpler if more patches are added, or if the 
required changes are trivial.

Paolo

> Please let me know.
> 
> I will start with your patches first.
> 
> Regards,
> 
> --Ashish Gupta
> 
> *From: *Paolo Bonzini <pbonzini@redhat.com>
> *Date: *Friday, December 2, 2022 at 1:39 PM
> *To: *Ashish Gupta (SJC) <ashish.gupta1@nutanix.com>
> *Cc: *Suresh Gumpula <suresh.gumpula@nutanix.com>, Felipe Franciosi 
> <felipe@nutanix.com>, kvm <kvm@vger.kernel.org>, Sean Christopherson 
> <seanjc@google.com>, John Levon <john.levon@nutanix.com>, Bijan 
> Mottahedeh <bijan.mottahedeh@nutanix.com>, Eiichi Tsukata 
> <eiichi.tsukata@nutanix.com>
> *Subject: *Re: Nvidia GPU PCI passthrough and kernel commit 
> #5f33887a36824f1e906863460535be5d841a4364
> 
> Yes, I think so. Are you going to test a backport of the three missing 
> patches or would you like me to prepare it?
> 
> Thanks for the report and the tests!
> 
> Paolo
> 
> Il ven 2 dic 2022, 20:59 Ashish Gupta (SJC) <ashish.gupta1@nutanix.com 
> <mailto:ashish.gupta1@nutanix.com>> ha scritto:
> 
>     Thanks Paolo,
> 
>     > All four patches were marked as stable, but it looks like the first
>     > three did not apply and therefore are not part of 5.10.
> 
>     Sounds like subset of changes are committed (backported) to 5.10.x
>     kernel and some are not.
> 
>     Wouldn’t that make 5.10.x kernel unstable for this kind of issue?
> 
>     Do you think, we should backport all those relevant changes in
>     stable branch like 5.10.x including patches from other authors also
>     around this area?
> 
>     Regards,
> 
>     --Ashish Gupta
> 
>     *From: *Paolo Bonzini <pbonzini@redhat.com <mailto:pbonzini@redhat.com>>
>     *Date: *Thursday, December 1, 2022 at 5:16 PM
>     *To: *Ashish Gupta (SJC) <ashish.gupta1@nutanix.com
>     <mailto:ashish.gupta1@nutanix.com>>, Suresh Gumpula
>     <suresh.gumpula@nutanix.com <mailto:suresh.gumpula@nutanix.com>>,
>     Felipe Franciosi <felipe@nutanix.com <mailto:felipe@nutanix.com>>
>     *Cc: *kvm@vger.kernel.org <mailto:kvm@vger.kernel.org>
>     <kvm@vger.kernel.org <mailto:kvm@vger.kernel.org>>,
>     seanjc@google.com <mailto:seanjc@google.com> <seanjc@google.com
>     <mailto:seanjc@google.com>>, John Levon <john.levon@nutanix.com
>     <mailto:john.levon@nutanix.com>>, Bijan Mottahedeh
>     <bijan.mottahedeh@nutanix.com
>     <mailto:bijan.mottahedeh@nutanix.com>>, Eiichi Tsukata
>     <eiichi.tsukata@nutanix.com <mailto:eiichi.tsukata@nutanix.com>>
>     *Subject: *Re: Nvidia GPU PCI passthrough and kernel commit
>     #5f33887a36824f1e906863460535be5d841a4364
> 
>     On 12/2/22 01:29, Ashish Gupta (SJC) wrote:
>     > Hi Paolo,
>     > 
>     > While we were accessing code change done by commit : 
>     > 5f33887a36824f1e906863460535be5d841a4364
>     > 
>     > Bijan, noticed following:
>     > 
>     >  From the changed code in commit  # 
>     > 5f33887a36824f1e906863460535be5d841a4364 , we see that the following check
>     > 
>     > !kvm_vcpu_apicv_active(vcpu)*/)/*
>     > 
>     > has been removed, so in fact the new code is basically assuming that 
>     > apicv is always active.
> 
>     Right, instead it checks irqchip_in_kernel(kvm) && enable_apicv.  This
>     is documented in the commit message:
> 
>           However, these checks do not attempt to synchronize with
>     changes to
>           the IRTE.  In particular, there is no path that updates the IRTE
>           when APICv is re-activated on vCPU 0; and there is no path to
>     wakeup
>           a CPU that has APICv disabled, if the wakeup occurs because of an
>           IRTE that points to a posted interrupt.
> 
>     The full series is at
>     https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_20211123004311.2954158-2D2-2Dpbonzini-40redhat.com_T_&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NSViKyfbZLLlRE5iJBGkhRVXJKqWdgMN8wGfv1tfc2E&m=iEB57vPMXHVPBeayAOwoHp32BcSlX-J5ig4nd4bnfDs1XqL3ykppJ1b1qVu9cuz_&s=nlSZ4vVygCrPKCaCRjJWrVFphM6Pym_iVYc-fBbjrc4&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_20211123004311.2954158-2D2-2Dpbonzini-40redhat.com_T_&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NSViKyfbZLLlRE5iJBGkhRVXJKqWdgMN8wGfv1tfc2E&m=iEB57vPMXHVPBeayAOwoHp32BcSlX-J5ig4nd4bnfDs1XqL3ykppJ1b1qVu9cuz_&s=nlSZ4vVygCrPKCaCRjJWrVFphM6Pym_iVYc-fBbjrc4&e=>
>     and has more details:
> 
>           Now that APICv can be disabled per-CPU (depending on whether
>     it has
>           some setup that is incompatible) we need to deal with guests
>     having
>           a mix of vCPUs with enabled/disabled posted interrupts.  For
>           assigned devices, their posted interrupt configuration must be the
>           same across the whole VM, so handle posted interrupts by hand on
>           vCPUs with disabled posted interrupts.
> 
>     All four patches were marked as stable, but it looks like the first
>     three did not apply and therefore are not part of 5.10.
> 
>     78311a514099932cd8434d5d2194aa94e56ab67c
>           KVM: x86: ignore APICv if LAPIC is not enabled
>     7e1901f6c86c896acff6609e0176f93f756d8b2a
>           KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled
>     37c4dbf337c5c2cdb24365ffae6ed70ac1e74d7a
>           KVM: x86: check PIR even for vCPUs with disabled APICv
> 
>     The three commits do not have any subsequent commit that Fixes them.
> 
>     > The latest upstream code however seems to disable apicv conditionally 
>     > depending on if it is actually being used:
> 
>     Right.
> 
>     > We found that, once we disable hyperv benightment for Linux vm, 
>     > everything is working fine (on v5.10.84)
>     > 
>     > Further Eiichi noticed, that your change were introduced in 5.16 and 
>     > backported to 5.10.84.
>     > 
>     > On the other hand, Vitaly's patch (commit 
>     > #0f250a646382e017725001a552624be0c86527bf) was introduced in 5.15 and 
>     > NOT backported to 5.10.X.
>     > 
>     > Should we backport Vitaly's patch to stable 5.10.X? Do you think that 
>     > will solve issue what we are facing?
> 
>     As you found out there are a lot of dependent changes to introduce
>     __kvm_request_apicv_update so it's not really feasible.
> 
>     Paolo
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-12-05 20:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <PH0PR02MB84228844F6176836E8C86B1BA40F9@PH0PR02MB8422.namprd02.prod.outlook.com>
2022-11-24  1:39 ` Nvidia GPU PCI passthrough and kernel commit #5f33887a36824f1e906863460535be5d841a4364 Paolo Bonzini
     [not found]   ` <PH0PR02MB84229CEBB3C7A8DAC626107CA40F9@PH0PR02MB8422.namprd02.prod.outlook.com>
     [not found]     ` <PH0PR02MB8422D2C6A7F56200FCD384D8A40F9@PH0PR02MB8422.namprd02.prod.outlook.com>
2022-11-25 22:45       ` Paolo Bonzini
     [not found]         ` <PH0PR02MB84221C062510FCFAEE7EE9BAA4109@PH0PR02MB8422.namprd02.prod.outlook.com>
2022-11-28 17:54           ` Paolo Bonzini
     [not found]             ` <PH0PR02MB8422C61596331E2B17E476C7A4149@PH0PR02MB8422.namprd02.prod.outlook.com>
2022-12-02  1:16               ` Paolo Bonzini
     [not found]                 ` <PH0PR02MB842221DE947D855CD1BF8AEFA4179@PH0PR02MB8422.namprd02.prod.outlook.com>
     [not found]                   ` <CABgObfaacMm0-igSCj5L5Ppc4arT2znpzT1+GqLO9kFgainBZA@mail.gmail.com>
     [not found]                     ` <PH0PR02MB8422F29708DF1CF961C580D1A4169@PH0PR02MB8422.namprd02.prod.outlook.com>
2022-12-05 20:04                       ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).