qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: Oliver O'Halloran <oohall@gmail.com>,
	Mahesh Salgaonkar <mahesh@linux.ibm.com>
Cc: Daniel Henrique Barboza <danielhb413@gmail.com>,
	Qemu-ppc <qemu-ppc@nongnu.org>,
	Qemu-devel <qemu-devel@nongnu.org>,
	David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [PATCH] spapr: Modify ibm, get-config-addr-info2 to set DEVNUM in PE config address.
Date: Thu, 29 Apr 2021 18:10:36 +1000	[thread overview]
Message-ID: <89a51a2d-ef9c-1fa4-1789-52bb20b03773@ozlabs.ru> (raw)
In-Reply-To: <CAOSf1CGGgxX4mGhjjsVGYA391XrABEOTh2xmiafW6V7cScyG4g@mail.gmail.com>



On 4/28/21 22:33, Oliver O'Halloran wrote:
> On Tue, Apr 27, 2021 at 9:56 PM Mahesh Salgaonkar <mahesh@linux.ibm.com> wrote:
>>
>> With upstream kernel, especially after commit 98ba956f6a389
>> ("powerpc/pseries/eeh: Rework device EEH PE determination") we see that KVM
>> guest isn't able to enable EEH option for PCI pass-through devices anymore.
> 
> How are you passing the devices through to the guest?
> 
>> [root@atest-guest ~]# dmesg | grep EEH
>> [    0.032337] EEH: pSeries platform initialized
>> [    0.298207] EEH: No capable adapters found: recovery disabled.
>> [root@atest-guest ~]#
>>
>> So far the linux kernel was assuming pe_config_addr equal to device's
>> config_addr and using it to enable EEH on the PE through ibm,set-eeh-option
>> RTAS call. Which wasn't the correct way as per PAPR. The linux kernel
>> commit 98ba956f6a389 fixed this flow. With that fixed, linux now gets the
>> PE config address first using the ibm,get-config-addr-info2 RTAS call, and
>> then uses the found address as argument to ibm,set-eeh-option RTAS call to
>> enable EEH on the PCI device.
> 
> That's not really correct. EEH being enabled or disabled is a per-PE
> setting rather than a per-device setting. The fact there's not a 1-1
> correspondence between devices and PEs is a large part of why the
> get-config-addr-info2 RTAS call exists in the first place.
> Unfortunately, the initial implementation of EEH support in linux
> conflated the two because in the past there was typically a single
> device within a PE. However, that assumption was never really correct
> and it has long outlived its usefulness.
> 
>> This works on PowerVM lpar but fails in qemu
>> KVM guests. This is because ibm,set-eeh-option RTAS call in qemu expects PE
>> config address bits 16:20 be populated with device number (DEVNUM).
>>
>> The rtas call ibm,get-config-addr-info2 in qemu always returns the PE
>> config address in form of "00BB0001" (i.e.  <00><BUS><DEVFN><REG>) where
>> "BB" represents the bus number of PE's primary bus and with device number
>> information always set to zero. However until commit 98ba956f6a389 this
>> return value wasn't used to enable EEH on the PCI device.
>>
>> Now in qemu guest with recent kernel the ibm,set-eeh-option RTAS call fails
>> with -3 return value indicating that there is no PCI device exist for the
>> specified pe config address. The rtas_ibm_set_eeh_option call uses
>> pci_find_device() to get the PC device that matches specific bus and devfn
>> extracted from PE config address passed as argument. Since the DEVFN part
>> of PE config always contains zero, pci_find_device() fails to find the
>> specific PCI device and hence fails to enable the EEH capability.
>>
>> hw/ppc/spapr_pci_vfio.c: spapr_phb_vfio_eeh_set_option()
>>     case RTAS_EEH_ENABLE: {
>>          PCIHostState *phb;
>>          PCIDevice *pdev;
>>
>>          /*
>>           * The EEH functionality is enabled on basis of PCI device,
>>           * instead of PE. We need check the validity of the PCI
>>           * device address.
>>           */
>>          phb = PCI_HOST_BRIDGE(sphb);
>>          pdev = pci_find_device(phb->bus,
>>                                 (addr >> 16) & 0xFF, (addr >> 8) & 0xFF);
>>          if (!pdev || !object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
>>              return RTAS_OUT_PARAM_ERROR;
>>          }
>>
>> This patch fixes this issue by populating DEVNUM field (bits 16:20) of PE
>> config address with device number.
> 
> I don't think this is a good idea and I'm fairly sure you're
> introducing some subtle breakage here. As an example, say that on the
> host side you have two devices on the same bus:
> 
> 0000:00:00.0 - card 1
> 0000:00:01.0 - card 2
> 
> On PowerNV (i.e. the hypervisor) these would be placed into the same
> hardware PE since they're on the same bus. Now, if I take both devices
> and pass them through to the guest on the same PHB and bus (use
> 0005:ff) we'll have:
> 
> 0005:ff:00.0 - card 1
> 0005:ff:01.0 - card 2
> 
> With this patch applied get-config-addr-info2 returns 00BBD001, so the
> PE we get for each device will be:
> 
> card 1 - 00ff0001
> card 2 - 00ff1001
> 
> Which implies the two are in different PEs. As a result, if the guest
> requests a reset of card 1's PE then the guest will see an unexpected
> reset of card 2 as well. From the hypervisor's point of view the two
> are in the same PE so this is a legitimate thing to do, but due to
> this patch the guest doesn't know that.


Agree. I guess we should only use vphbid:00:00.0 as a PE config address 
in QEMU as there is really just one per vphb which allows EEH.


> As far as I can remember this is why you're supposed to pass each EEH
> capable devices to the guest on a seperate spapr-phb (which matches
> what PHYP does). Alexy can probably tell you more.


The primary reason was that the EEH subdriver in VFIO did a poor job 
synchronizing states from different PEs so recovery was either tricky or 
broken:

https://git.qemu.org/?p=qemu.git;a=commitdiff;h=3153119;hp=f1a6cf3ef734aab142d5f7ce52e219474ababf6b


-- 
Alexey


  reply	other threads:[~2021-04-29  8:12 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-27 11:56 [PATCH] spapr: Modify ibm,get-config-addr-info2 to set DEVNUM in PE config address Mahesh Salgaonkar
2021-04-27 14:02 ` [PATCH] spapr: Modify ibm, get-config-addr-info2 " Daniel Henrique Barboza
2021-04-28 12:33 ` Oliver O'Halloran
2021-04-29  8:10   ` Alexey Kardashevskiy [this message]
2021-04-29  9:02   ` Mahesh J Salgaonkar
2021-04-30  1:53     ` Oliver O'Halloran
2021-05-03 15:47       ` Mahesh J Salgaonkar
2021-04-30 10:52     ` Daniel Henrique Barboza
2021-05-03  8:52       ` Mahesh J Salgaonkar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=89a51a2d-ef9c-1fa4-1789-52bb20b03773@ozlabs.ru \
    --to=aik@ozlabs.ru \
    --cc=danielhb413@gmail.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=mahesh@linux.ibm.com \
    --cc=oohall@gmail.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).