linux-hyperv.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor
@ 2022-09-13 12:47 Ajay Kaher
  2022-09-13 13:34 ` Vitaly Kuznetsov
  0 siblings, 1 reply; 16+ messages in thread
From: Ajay Kaher @ 2022-09-13 12:47 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: x86, hpa, linux-pci, linux-kernel, rostedt, Srivatsa Bhat,
	srivatsa, Alexey Makhalov, Vasavi Sirnapalli, er.ajay.kaher,
	willy, Nadav Amit, linux-hyperv, kvm, jailhouse-dev, xen-devel,
	acrn-dev, helgaas, bhelgaas, tglx, mingo, bp, dave.hansen,
	Alexander Graf


Note: Corrected the Subject.

> On 07/09/22, 8:50 PM, "Vitaly Kuznetsov" <vkuznets@redhat.com> wrote:
>
>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>> index ddb7986..1e5a8f7 100644
>> --- a/arch/x86/pci/common.c
>> +++ b/arch/x86/pci/common.c
>> @@ -20,6 +20,7 @@
>>  #include <asm/pci_x86.h>
>>  #include <asm/setup.h>
>>  #include <asm/irqdomain.h>
>> +#include <asm/hypervisor.h>
>>
>>  unsigned int pci_probe = PCI_PROBE_BIOS | PCI_PROBE_CONF1 | PCI_PROBE_CONF2 |
>>                               PCI_PROBE_MMCONF;
>> @@ -57,14 +58,58 @@ int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>>       return -EINVAL;
>>  }
>>
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +static int vm_raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>> +                                             int reg, int len, u32 *val)
>> +{
>> +     if (raw_pci_ext_ops)
>> +             return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>> +     if (domain == 0 && reg < 256 && raw_pci_ops)
>> +             return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>> +     return -EINVAL;
>> +}
>> +
>> +static int vm_raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>> +                                             int reg, int len, u32 val)
>> +{
>> +     if (raw_pci_ext_ops)
>> +             return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>> +     if (domain == 0 && reg < 256 && raw_pci_ops)
>> +             return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>> +     return -EINVAL;
>> +}
>
> These look exactly like raw_pci_read()/raw_pci_write() but with inverted
> priority. We could've added a parameter but to be more flexible, I'd
> suggest we add a 'priority' field to 'struct pci_raw_ops' and make
> raw_pci_read()/raw_pci_write() check it before deciding what to use
> first. To be on the safe side, you can leave raw_pci_ops's priority
> higher than raw_pci_ext_ops's by default and only tweak it in
> arch/x86/kernel/cpu/vmware.c

Thanks Vitaly for your response.

1. we have multiple objects of struct pci_raw_ops, 2. adding 'priority' field to struct pci_raw_ops
doesn't seems to be appropriate as need to take decision which object of struct pci_raw_ops has
to be used, not something with-in struct pci_raw_ops.

It's a generic solution for all hypervisor (sorry for earlier wrong Subject), not specific to VMware.

Further looking for feedback if it's impacting to any hypervisor.

-Ajay 




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor
  2022-09-13 12:47 [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor Ajay Kaher
@ 2022-09-13 13:34 ` Vitaly Kuznetsov
  2022-09-29  5:36   ` Ajay Kaher
  0 siblings, 1 reply; 16+ messages in thread
From: Vitaly Kuznetsov @ 2022-09-13 13:34 UTC (permalink / raw)
  To: Ajay Kaher
  Cc: x86, hpa, linux-pci, linux-kernel, rostedt, Srivatsa Bhat,
	srivatsa, Alexey Makhalov, Vasavi Sirnapalli, er.ajay.kaher,
	willy, Nadav Amit, linux-hyperv, kvm, jailhouse-dev, xen-devel,
	acrn-dev, helgaas, bhelgaas, tglx, mingo, bp, dave.hansen,
	Alexander Graf

Ajay Kaher <akaher@vmware.com> writes:

> Note: Corrected the Subject.
>
>> On 07/09/22, 8:50 PM, "Vitaly Kuznetsov" <vkuznets@redhat.com> wrote:
>>
>>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>>> index ddb7986..1e5a8f7 100644
>>> --- a/arch/x86/pci/common.c
>>> +++ b/arch/x86/pci/common.c
>>> @@ -20,6 +20,7 @@
>>>  #include <asm/pci_x86.h>
>>>  #include <asm/setup.h>
>>>  #include <asm/irqdomain.h>
>>> +#include <asm/hypervisor.h>
>>>
>>>  unsigned int pci_probe = PCI_PROBE_BIOS | PCI_PROBE_CONF1 | PCI_PROBE_CONF2 |
>>>                               PCI_PROBE_MMCONF;
>>> @@ -57,14 +58,58 @@ int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>>>       return -EINVAL;
>>>  }
>>>
>>> +#ifdef CONFIG_HYPERVISOR_GUEST
>>> +static int vm_raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>> +                                             int reg, int len, u32 *val)
>>> +{
>>> +     if (raw_pci_ext_ops)
>>> +             return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>>> +     if (domain == 0 && reg < 256 && raw_pci_ops)
>>> +             return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>>> +     return -EINVAL;
>>> +}
>>> +
>>> +static int vm_raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>>> +                                             int reg, int len, u32 val)
>>> +{
>>> +     if (raw_pci_ext_ops)
>>> +             return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>>> +     if (domain == 0 && reg < 256 && raw_pci_ops)
>>> +             return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>>> +     return -EINVAL;
>>> +}
>>
>> These look exactly like raw_pci_read()/raw_pci_write() but with inverted
>> priority. We could've added a parameter but to be more flexible, I'd
>> suggest we add a 'priority' field to 'struct pci_raw_ops' and make
>> raw_pci_read()/raw_pci_write() check it before deciding what to use
>> first. To be on the safe side, you can leave raw_pci_ops's priority
>> higher than raw_pci_ext_ops's by default and only tweak it in
>> arch/x86/kernel/cpu/vmware.c
>
> Thanks Vitaly for your response.
>
> 1. we have multiple objects of struct pci_raw_ops, 2. adding 'priority' field to struct pci_raw_ops
> doesn't seems to be appropriate as need to take decision which object of struct pci_raw_ops has
> to be used, not something with-in struct pci_raw_ops.

I'm not sure I follow, you have two instances of 'struct pci_raw_ops'
which are called 'raw_pci_ops' and 'raw_pci_ext_ops'. What if you do
something like (completely untested):

diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
index 70533fdcbf02..fb8270fa6c78 100644
--- a/arch/x86/include/asm/pci_x86.h
+++ b/arch/x86/include/asm/pci_x86.h
@@ -116,6 +116,7 @@ extern void (*pcibios_disable_irq)(struct pci_dev *dev);
 extern bool mp_should_keep_irq(struct device *dev);
 
 struct pci_raw_ops {
+       int rating;
        int (*read)(unsigned int domain, unsigned int bus, unsigned int devfn,
                                                int reg, int len, u32 *val);
        int (*write)(unsigned int domain, unsigned int bus, unsigned int devfn,
diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index ddb798603201..e9965fd11576 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
 int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
                                                int reg, int len, u32 *val)
 {
-       if (domain == 0 && reg < 256 && raw_pci_ops)
+       if (domain == 0 && reg < 256 && raw_pci_ops &&
+           (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
                return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
        if (raw_pci_ext_ops)
                return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
@@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
 int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
                                                int reg, int len, u32 val)
 {
-       if (domain == 0 && reg < 256 && raw_pci_ops)
+       if (domain == 0 && reg < 256 && raw_pci_ops &&
+           (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
                return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
        if (raw_pci_ext_ops)
                return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);

and then somewhere in Vmware hypervisor initialization code
(arch/x86/kernel/cpu/vmware.c) you do

 raw_pci_ext_ops->rating = 100;

why wouldn't it work? 

(diclaimer: completely untested, raw_pci_ops/raw_pci_ext_ops
initialization has to be checked so 'rating' is not garbage).

>
> It's a generic solution for all hypervisor (sorry for earlier wrong
> Subject), not specific to VMware. Further looking for feedback if it's
> impacting to any hypervisor.

That's the tricky part. We can check modern hypervisor versions, but
what about all other versions in existence? How can we know that there's
no QEMU/Hyper-V/... version out there where MMIO path is broken? I'd
suggest we limit the change to Vmware hypervisor, other hypervisors may
use the same mechanism (like the one above) later (but the person
suggesting the patch is always responsible for the research why it is
safe to do so).

-- 
Vitaly


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor
  2022-09-13 13:34 ` Vitaly Kuznetsov
@ 2022-09-29  5:36   ` Ajay Kaher
  2022-09-29  9:12     ` Alexander Graf
  2022-10-03 15:03     ` Vitaly Kuznetsov
  0 siblings, 2 replies; 16+ messages in thread
From: Ajay Kaher @ 2022-09-29  5:36 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: x86, hpa, linux-pci, linux-kernel, rostedt, Srivatsa Bhat,
	srivatsa, Alexey Makhalov, Vasavi Sirnapalli, er.ajay.kaher,
	willy, Nadav Amit, linux-hyperv, kvm, jailhouse-dev, xen-devel,
	acrn-dev, helgaas, bhelgaas, tglx, mingo, bp, dave.hansen,
	Alexander Graf


> On 13/09/22, 7:05 PM, "Vitaly Kuznetsov" <vkuznets@redhat.com> wrote:
>>
>> Thanks Vitaly for your response.
>>
>> 1. we have multiple objects of struct pci_raw_ops, 2. adding 'priority' field to struct pci_raw_ops
>> doesn't seems to be appropriate as need to take decision which object of struct pci_raw_ops has
>> to be used, not something with-in struct pci_raw_ops.
>
> I'm not sure I follow, you have two instances of 'struct pci_raw_ops'
> which are called 'raw_pci_ops' and 'raw_pci_ext_ops'. What if you do
> something like (completely untested):
>
> diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
> index 70533fdcbf02..fb8270fa6c78 100644
> --- a/arch/x86/include/asm/pci_x86.h
> +++ b/arch/x86/include/asm/pci_x86.h
> @@ -116,6 +116,7 @@ extern void (*pcibios_disable_irq)(struct pci_dev *dev);
> extern bool mp_should_keep_irq(struct device *dev);
>
> struct pci_raw_ops {
> +       int rating;
>          int (*read)(unsigned int domain, unsigned int bus, unsigned int devfn,
>                                                int reg, int len, u32 *val);
>          int (*write)(unsigned int domain, unsigned int bus, unsigned int devfn,
> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
> index ddb798603201..e9965fd11576 100644
> --- a/arch/x86/pci/common.c
> +++ b/arch/x86/pci/common.c
> @@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
>  int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>                                                 int reg, int len, u32 *val)
> {
> -       if (domain == 0 && reg < 256 && raw_pci_ops)
> +       if (domain == 0 && reg < 256 && raw_pci_ops &&
> +           (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
>                 return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>         if (raw_pci_ext_ops)
>                 return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
> @@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>  int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>                                                 int reg, int len, u32 val)
> {
> -       if (domain == 0 && reg < 256 && raw_pci_ops)
> +       if (domain == 0 && reg < 256 && raw_pci_ops &&
> +           (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
>                 return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>          if (raw_pci_ext_ops)
>                 return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>
> and then somewhere in Vmware hypervisor initialization code
> (arch/x86/kernel/cpu/vmware.c) you do
>
>  raw_pci_ext_ops->rating = 100;

Thanks Vitaly, for your review and helping us to improve the code.

I was working to make changes as you suggested, but before sending v3 would like to
discuss on following:

If we add rating with-in struct pci_raw_ops then we can't have pci_mmcfg as const,
and following change is must in arch/x86/pci/mmconfig_64.c:

-const struct pci_raw_ops pci_mmcfg = {
+struct pci_raw_ops pci_mmcfg = {
 	.read =		pci_mmcfg_read,
 	.write =	pci_mmcfg_write,
};

So to avoid this change, is it fine to have global bool prefer_raw_pci_ext_ops?

And raw_pci_read() will have following change:

-	if (domain == 0 && reg < 256 && raw_pci_ops)
+	if (domain == 0 && reg < 256 && raw_pci_ops &&
+	     (!prefer_raw_pci_ext_ops ||  !raw_pci_ext_ops)

>
> why wouldn't it work?
>
> (diclaimer: completely untested, raw_pci_ops/raw_pci_ext_ops
> initialization has to be checked so 'rating' is not garbage).
>
>>
>> It's a generic solution for all hypervisor (sorry for earlier wrong
>> Subject), not specific to VMware. Further looking for feedback if it's
>> impacting to any hypervisor.
>
> That's the tricky part. We can check modern hypervisor versions, but
> what about all other versions in existence? How can we know that there's
> no QEMU/Hyper-V/... version out there where MMIO path is broken? I'd
> suggest we limit the change to Vmware hypervisor, other hypervisors may
> use the same mechanism (like the one above) later (but the person
> suggesting the patch is always responsible for the research why it is
> safe to do so).

Ok, as of now we will make this change specific to VMware hypervisor.

- Ajay




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor
  2022-09-29  5:36   ` Ajay Kaher
@ 2022-09-29  9:12     ` Alexander Graf
  2022-10-03 15:03     ` Vitaly Kuznetsov
  1 sibling, 0 replies; 16+ messages in thread
From: Alexander Graf @ 2022-09-29  9:12 UTC (permalink / raw)
  To: Ajay Kaher, Vitaly Kuznetsov
  Cc: x86, hpa, linux-pci, linux-kernel, rostedt, Srivatsa Bhat,
	srivatsa, Alexey Makhalov, Vasavi Sirnapalli, er.ajay.kaher,
	willy, Nadav Amit, linux-hyperv, kvm, jailhouse-dev, xen-devel,
	acrn-dev, helgaas, bhelgaas, tglx, mingo, bp, dave.hansen,
	Michael S. Tsirkin


On 29.09.22 07:36, Ajay Kaher wrote:
>> On 13/09/22, 7:05 PM, "Vitaly Kuznetsov" <vkuznets@redhat.com> wrote:
>>> Thanks Vitaly for your response.
>>>
>>> 1. we have multiple objects of struct pci_raw_ops, 2. adding 'priority' field to struct pci_raw_ops
>>> doesn't seems to be appropriate as need to take decision which object of struct pci_raw_ops has
>>> to be used, not something with-in struct pci_raw_ops.
>> I'm not sure I follow, you have two instances of 'struct pci_raw_ops'
>> which are called 'raw_pci_ops' and 'raw_pci_ext_ops'. What if you do
>> something like (completely untested):
>>
>> diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
>> index 70533fdcbf02..fb8270fa6c78 100644
>> --- a/arch/x86/include/asm/pci_x86.h
>> +++ b/arch/x86/include/asm/pci_x86.h
>> @@ -116,6 +116,7 @@ extern void (*pcibios_disable_irq)(struct pci_dev *dev);
>> extern bool mp_should_keep_irq(struct device *dev);
>>
>> struct pci_raw_ops {
>> +       int rating;
>>           int (*read)(unsigned int domain, unsigned int bus, unsigned int devfn,
>>                                                 int reg, int len, u32 *val);
>>           int (*write)(unsigned int domain, unsigned int bus, unsigned int devfn,
>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>> index ddb798603201..e9965fd11576 100644
>> --- a/arch/x86/pci/common.c
>> +++ b/arch/x86/pci/common.c
>> @@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
>>   int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>                                                  int reg, int len, u32 *val)
>> {
>> -       if (domain == 0 && reg < 256 && raw_pci_ops)
>> +       if (domain == 0 && reg < 256 && raw_pci_ops &&
>> +           (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
>>                  return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>>          if (raw_pci_ext_ops)
>>                  return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>> @@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>   int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>>                                                  int reg, int len, u32 val)
>> {
>> -       if (domain == 0 && reg < 256 && raw_pci_ops)
>> +       if (domain == 0 && reg < 256 && raw_pci_ops &&
>> +           (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
>>                  return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>>           if (raw_pci_ext_ops)
>>                  return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>>
>> and then somewhere in Vmware hypervisor initialization code
>> (arch/x86/kernel/cpu/vmware.c) you do
>>
>>   raw_pci_ext_ops->rating = 100;
> Thanks Vitaly, for your review and helping us to improve the code.
>
> I was working to make changes as you suggested, but before sending v3 would like to
> discuss on following:
>
> If we add rating with-in struct pci_raw_ops then we can't have pci_mmcfg as const,
> and following change is must in arch/x86/pci/mmconfig_64.c:
>
> -const struct pci_raw_ops pci_mmcfg = {
> +struct pci_raw_ops pci_mmcfg = {
>          .read =         pci_mmcfg_read,
>          .write =        pci_mmcfg_write,
> };
>
> So to avoid this change, is it fine to have global bool prefer_raw_pci_ext_ops?
>
> And raw_pci_read() will have following change:
>
> -       if (domain == 0 && reg < 256 && raw_pci_ops)
> +       if (domain == 0 && reg < 256 && raw_pci_ops &&
> +            (!prefer_raw_pci_ext_ops ||  !raw_pci_ext_ops)
>
>> why wouldn't it work?
>>
>> (diclaimer: completely untested, raw_pci_ops/raw_pci_ext_ops
>> initialization has to be checked so 'rating' is not garbage).
>>
>>> It's a generic solution for all hypervisor (sorry for earlier wrong
>>> Subject), not specific to VMware. Further looking for feedback if it's
>>> impacting to any hypervisor.
>> That's the tricky part. We can check modern hypervisor versions, but
>> what about all other versions in existence? How can we know that there's
>> no QEMU/Hyper-V/... version out there where MMIO path is broken? I'd
>> suggest we limit the change to Vmware hypervisor, other hypervisors may
>> use the same mechanism (like the one above) later (but the person
>> suggesting the patch is always responsible for the research why it is
>> safe to do so).
> Ok, as of now we will make this change specific to VMware hypervisor.


Is there a way we can make it an ACPI property in MCFG to have the 
environment self-describe the fact that it's safe to do ECAM access for 
config space access over legacy PIO? That way we don't need to patch 
guests every time a hypervisor decides that it's safe to prefer ECAM.

Also, Michael (CC'ed) mentioned that according to spec, your PCIe host 
bridge with PCI_COMMAND.MEMORY=0 would stop responding to its ECAM 
window. Given that most ARM systems have no PIO fallback path, we want 
to make sure we never hit that condition.


Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor
  2022-09-29  5:36   ` Ajay Kaher
  2022-09-29  9:12     ` Alexander Graf
@ 2022-10-03 15:03     ` Vitaly Kuznetsov
  2022-10-03 17:34       ` Nadav Amit
  2022-10-03 21:01       ` H. Peter Anvin
  1 sibling, 2 replies; 16+ messages in thread
From: Vitaly Kuznetsov @ 2022-10-03 15:03 UTC (permalink / raw)
  To: Ajay Kaher
  Cc: x86, hpa, linux-pci, linux-kernel, rostedt, Srivatsa Bhat,
	srivatsa, Alexey Makhalov, Vasavi Sirnapalli, er.ajay.kaher,
	willy, Nadav Amit, linux-hyperv, kvm, jailhouse-dev, xen-devel,
	acrn-dev, helgaas, bhelgaas, tglx, mingo, bp, dave.hansen,
	Alexander Graf

Ajay Kaher <akaher@vmware.com> writes:

>> On 13/09/22, 7:05 PM, "Vitaly Kuznetsov" <vkuznets@redhat.com> wrote:
>>>
>>> Thanks Vitaly for your response.
>>>
>>> 1. we have multiple objects of struct pci_raw_ops, 2. adding 'priority' field to struct pci_raw_ops
>>> doesn't seems to be appropriate as need to take decision which object of struct pci_raw_ops has
>>> to be used, not something with-in struct pci_raw_ops.
>>
>> I'm not sure I follow, you have two instances of 'struct pci_raw_ops'
>> which are called 'raw_pci_ops' and 'raw_pci_ext_ops'. What if you do
>> something like (completely untested):
>>
>> diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
>> index 70533fdcbf02..fb8270fa6c78 100644
>> --- a/arch/x86/include/asm/pci_x86.h
>> +++ b/arch/x86/include/asm/pci_x86.h
>> @@ -116,6 +116,7 @@ extern void (*pcibios_disable_irq)(struct pci_dev *dev);
>> extern bool mp_should_keep_irq(struct device *dev);
>>
>> struct pci_raw_ops {
>> +       int rating;
>>          int (*read)(unsigned int domain, unsigned int bus, unsigned int devfn,
>>                                                int reg, int len, u32 *val);
>>          int (*write)(unsigned int domain, unsigned int bus, unsigned int devfn,
>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>> index ddb798603201..e9965fd11576 100644
>> --- a/arch/x86/pci/common.c
>> +++ b/arch/x86/pci/common.c
>> @@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
>>  int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>                                                 int reg, int len, u32 *val)
>> {
>> -       if (domain == 0 && reg < 256 && raw_pci_ops)
>> +       if (domain == 0 && reg < 256 && raw_pci_ops &&
>> +           (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
>>                 return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>>         if (raw_pci_ext_ops)
>>                 return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>> @@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>  int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>>                                                 int reg, int len, u32 val)
>> {
>> -       if (domain == 0 && reg < 256 && raw_pci_ops)
>> +       if (domain == 0 && reg < 256 && raw_pci_ops &&
>> +           (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
>>                 return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>>          if (raw_pci_ext_ops)
>>                 return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>>
>> and then somewhere in Vmware hypervisor initialization code
>> (arch/x86/kernel/cpu/vmware.c) you do
>>
>>  raw_pci_ext_ops->rating = 100;
>
> Thanks Vitaly, for your review and helping us to improve the code.
>
> I was working to make changes as you suggested, but before sending v3 would like to
> discuss on following:
>
> If we add rating with-in struct pci_raw_ops then we can't have pci_mmcfg as const,
> and following change is must in arch/x86/pci/mmconfig_64.c:
>
> -const struct pci_raw_ops pci_mmcfg = {
> +struct pci_raw_ops pci_mmcfg = {
>  	.read =		pci_mmcfg_read,
>  	.write =	pci_mmcfg_write,
> };
>
> So to avoid this change, is it fine to have global bool prefer_raw_pci_ext_ops?
>
> And raw_pci_read() will have following change:
>
> -	if (domain == 0 && reg < 256 && raw_pci_ops)
> +	if (domain == 0 && reg < 256 && raw_pci_ops &&
> +	     (!prefer_raw_pci_ext_ops ||  !raw_pci_ext_ops)
>

Not my but rather PCI maintainer's call but IMHO dropping 'const' is
better, introducing a new global var is our 'last resort' and should be
avoided whenever possible. Alternatively, you can add a
raw_pci_ext_ops_preferred() function checking somethin within 'struct
hypervisor_x86' but I'm unsure if it's better.

Also, please check Alex' question/suggestion.

...

-- 
Vitaly


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor
  2022-10-03 15:03     ` Vitaly Kuznetsov
@ 2022-10-03 17:34       ` Nadav Amit
  2022-10-03 21:06         ` H. Peter Anvin
                           ` (2 more replies)
  2022-10-03 21:01       ` H. Peter Anvin
  1 sibling, 3 replies; 16+ messages in thread
From: Nadav Amit @ 2022-10-03 17:34 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Alexander Graf
  Cc: Ajay Kaher, x86, hpa, linux-pci, linux-kernel, rostedt,
	Srivatsa Bhat, srivatsa, Alexey Makhalov, Vasavi Sirnapalli,
	er.ajay.kaher, willy, linux-hyperv, kvm, jailhouse-dev,
	xen-devel, acrn-dev, helgaas, bhelgaas, Thomas Gleixner, mingo,
	bp, dave.hansen

On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:

> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
> better, introducing a new global var is our 'last resort' and should be
> avoided whenever possible. Alternatively, you can add a
> raw_pci_ext_ops_preferred() function checking somethin within 'struct
> hypervisor_x86' but I'm unsure if it's better.
> 
> Also, please check Alex' question/suggestion.

Here is my take (and Ajay knows probably more than me):

Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
The two options are either to use a reserved field (which who knows, might
be used one day) or some OEM ID. I am also not familiar with
PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.

Anyhow, I understand (although not relate) to the objection for a new global
variable. How about explicitly calling this hardware bug a “bug” and using
the proper infrastructure? Calling it explicitly a bug may even push whoever
can to resolve it.

IOW, how about doing something along the lines of (not tested):


-- >8 --

Subject: [PATCH] x86/PCI: Prefer MMIO over PIO on VMware hypervisor

---
 arch/x86/include/asm/cpufeatures.h | 1 +
 arch/x86/kernel/cpu/common.c       | 2 ++
 arch/x86/kernel/cpu/vmware.c       | 2 ++
 arch/x86/pci/common.c              | 6 ++++--
 4 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index ef4775c6db01..216b6f357b6d 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -460,5 +460,6 @@
 #define X86_BUG_MMIO_UNKNOWN		X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */
 #define X86_BUG_RETBLEED		X86_BUG(27) /* CPU is affected by RETBleed */
 #define X86_BUG_EIBRS_PBRSB		X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
+#define X86_BUG_ECAM_MMIO		X86_BUG(29) /* ECAM MMIO is buggy and PIO is preferable */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3e508f239098..c94175fa304b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1299,6 +1299,8 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
 {
 	u64 ia32_cap = x86_read_arch_cap_msr();
 
+	setup_force_cpu_bug(X86_BUG_ECAM_MMIO);
+
 	/* Set ITLB_MULTIHIT bug if cpu is not in the whitelist and not mitigated */
 	if (!cpu_matches(cpu_vuln_whitelist, NO_ITLB_MULTIHIT) &&
 	    !(ia32_cap & ARCH_CAP_PSCHANGE_MC_NO))
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 02039ec3597d..8903776284a6 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -385,6 +385,8 @@ static void __init vmware_set_capabilities(void)
 		setup_force_cpu_cap(X86_FEATURE_VMCALL);
 	else if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMMCALL)
 		setup_force_cpu_cap(X86_FEATURE_VMW_VMMCALL);
+
+	setup_clear_cpu_cap(X86_BUG_ECAM_MMIO);
 }
 
 static void __init vmware_platform_setup(void)
diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index ddb798603201..bc81cf4c1014 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
 int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
 						int reg, int len, u32 *val)
 {
-	if (domain == 0 && reg < 256 && raw_pci_ops)
+	if (domain == 0 && reg < 256 && raw_pci_ops &&
+	    (boot_cpu_has_bug(X86_BUG_ECAM_MMIO) || !raw_pci_ext_ops))
 		return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
 	if (raw_pci_ext_ops)
 		return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
@@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
 int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
 						int reg, int len, u32 val)
 {
-	if (domain == 0 && reg < 256 && raw_pci_ops)
+	if (domain == 0 && reg < 256 && raw_pci_ops &&
+	    (boot_cpu_has_bug(X86_BUG_ECAM_MMIO) || !raw_pci_ext_ops))
 		return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
 	if (raw_pci_ext_ops)
 		return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
-- 
2.34.1





^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor
  2022-10-03 15:03     ` Vitaly Kuznetsov
  2022-10-03 17:34       ` Nadav Amit
@ 2022-10-03 21:01       ` H. Peter Anvin
  1 sibling, 0 replies; 16+ messages in thread
From: H. Peter Anvin @ 2022-10-03 21:01 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Ajay Kaher
  Cc: x86, linux-pci, linux-kernel, rostedt, Srivatsa Bhat, srivatsa,
	Alexey Makhalov, Vasavi Sirnapalli, er.ajay.kaher, willy,
	Nadav Amit, linux-hyperv, kvm, jailhouse-dev, xen-devel,
	acrn-dev, helgaas, bhelgaas, tglx, mingo, bp, dave.hansen,
	Alexander Graf

On October 3, 2022 8:03:41 AM PDT, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>Ajay Kaher <akaher@vmware.com> writes:
>
>>> On 13/09/22, 7:05 PM, "Vitaly Kuznetsov" <vkuznets@redhat.com> wrote:
>>>>
>>>> Thanks Vitaly for your response.
>>>>
>>>> 1. we have multiple objects of struct pci_raw_ops, 2. adding 'priority' field to struct pci_raw_ops
>>>> doesn't seems to be appropriate as need to take decision which object of struct pci_raw_ops has
>>>> to be used, not something with-in struct pci_raw_ops.
>>>
>>> I'm not sure I follow, you have two instances of 'struct pci_raw_ops'
>>> which are called 'raw_pci_ops' and 'raw_pci_ext_ops'. What if you do
>>> something like (completely untested):
>>>
>>> diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
>>> index 70533fdcbf02..fb8270fa6c78 100644
>>> --- a/arch/x86/include/asm/pci_x86.h
>>> +++ b/arch/x86/include/asm/pci_x86.h
>>> @@ -116,6 +116,7 @@ extern void (*pcibios_disable_irq)(struct pci_dev *dev);
>>> extern bool mp_should_keep_irq(struct device *dev);
>>>
>>> struct pci_raw_ops {
>>> +       int rating;
>>>          int (*read)(unsigned int domain, unsigned int bus, unsigned int devfn,
>>>                                                int reg, int len, u32 *val);
>>>          int (*write)(unsigned int domain, unsigned int bus, unsigned int devfn,
>>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>>> index ddb798603201..e9965fd11576 100644
>>> --- a/arch/x86/pci/common.c
>>> +++ b/arch/x86/pci/common.c
>>> @@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
>>>  int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>>                                                 int reg, int len, u32 *val)
>>> {
>>> -       if (domain == 0 && reg < 256 && raw_pci_ops)
>>> +       if (domain == 0 && reg < 256 && raw_pci_ops &&
>>> +           (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
>>>                 return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>>>         if (raw_pci_ext_ops)
>>>                 return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>>> @@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>>  int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>>>                                                 int reg, int len, u32 val)
>>> {
>>> -       if (domain == 0 && reg < 256 && raw_pci_ops)
>>> +       if (domain == 0 && reg < 256 && raw_pci_ops &&
>>> +           (!raw_pci_ext_ops || raw_pci_ext_ops->rating <= raw_pci_ops->rating))
>>>                 return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>>>          if (raw_pci_ext_ops)
>>>                 return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>>>
>>> and then somewhere in Vmware hypervisor initialization code
>>> (arch/x86/kernel/cpu/vmware.c) you do
>>>
>>>  raw_pci_ext_ops->rating = 100;
>>
>> Thanks Vitaly, for your review and helping us to improve the code.
>>
>> I was working to make changes as you suggested, but before sending v3 would like to
>> discuss on following:
>>
>> If we add rating with-in struct pci_raw_ops then we can't have pci_mmcfg as const,
>> and following change is must in arch/x86/pci/mmconfig_64.c:
>>
>> -const struct pci_raw_ops pci_mmcfg = {
>> +struct pci_raw_ops pci_mmcfg = {
>>  	.read =		pci_mmcfg_read,
>>  	.write =	pci_mmcfg_write,
>> };
>>
>> So to avoid this change, is it fine to have global bool prefer_raw_pci_ext_ops?
>>
>> And raw_pci_read() will have following change:
>>
>> -	if (domain == 0 && reg < 256 && raw_pci_ops)
>> +	if (domain == 0 && reg < 256 && raw_pci_ops &&
>> +	     (!prefer_raw_pci_ext_ops ||  !raw_pci_ext_ops)
>>
>
>Not my but rather PCI maintainer's call but IMHO dropping 'const' is
>better, introducing a new global var is our 'last resort' and should be
>avoided whenever possible. Alternatively, you can add a
>raw_pci_ext_ops_preferred() function checking somethin within 'struct
>hypervisor_x86' but I'm unsure if it's better.
>
>Also, please check Alex' question/suggestion.
>
>...
>

Could this be ro_after_init?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor
  2022-10-03 17:34       ` Nadav Amit
@ 2022-10-03 21:06         ` H. Peter Anvin
  2022-10-03 21:28           ` Nadav Amit
  2022-10-04  8:22         ` Alexander Graf
  2022-10-04  8:30         ` Vitaly Kuznetsov
  2 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2022-10-03 21:06 UTC (permalink / raw)
  To: Nadav Amit, Vitaly Kuznetsov, Alexander Graf
  Cc: Ajay Kaher, x86, linux-pci, linux-kernel, rostedt, Srivatsa Bhat,
	srivatsa, Alexey Makhalov, Vasavi Sirnapalli, er.ajay.kaher,
	willy, linux-hyperv, kvm, jailhouse-dev, xen-devel, acrn-dev,
	helgaas, bhelgaas, Thomas Gleixner, mingo, bp, dave.hansen

On October 3, 2022 10:34:15 AM PDT, Nadav Amit <namit@vmware.com> wrote:
>On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>
>> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
>> better, introducing a new global var is our 'last resort' and should be
>> avoided whenever possible. Alternatively, you can add a
>> raw_pci_ext_ops_preferred() function checking somethin within 'struct
>> hypervisor_x86' but I'm unsure if it's better.
>> 
>> Also, please check Alex' question/suggestion.
>
>Here is my take (and Ajay knows probably more than me):
>
>Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
>The two options are either to use a reserved field (which who knows, might
>be used one day) or some OEM ID. I am also not familiar with
>PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.
>
>Anyhow, I understand (although not relate) to the objection for a new global
>variable. How about explicitly calling this hardware bug a “bug” and using
>the proper infrastructure? Calling it explicitly a bug may even push whoever
>can to resolve it.
>
>IOW, how about doing something along the lines of (not tested):
>
>
>-- >8 --
>
>Subject: [PATCH] x86/PCI: Prefer MMIO over PIO on VMware hypervisor
>
>---
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/kernel/cpu/common.c       | 2 ++
> arch/x86/kernel/cpu/vmware.c       | 2 ++
> arch/x86/pci/common.c              | 6 ++++--
> 4 files changed, 9 insertions(+), 2 deletions(-)
>
>diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
>index ef4775c6db01..216b6f357b6d 100644
>--- a/arch/x86/include/asm/cpufeatures.h
>+++ b/arch/x86/include/asm/cpufeatures.h
>@@ -460,5 +460,6 @@
> #define X86_BUG_MMIO_UNKNOWN		X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */
> #define X86_BUG_RETBLEED		X86_BUG(27) /* CPU is affected by RETBleed */
> #define X86_BUG_EIBRS_PBRSB		X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
>+#define X86_BUG_ECAM_MMIO		X86_BUG(29) /* ECAM MMIO is buggy and PIO is preferable */
> 
> #endif /* _ASM_X86_CPUFEATURES_H */
>diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>index 3e508f239098..c94175fa304b 100644
>--- a/arch/x86/kernel/cpu/common.c
>+++ b/arch/x86/kernel/cpu/common.c
>@@ -1299,6 +1299,8 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
> {
> 	u64 ia32_cap = x86_read_arch_cap_msr();
> 
>+	setup_force_cpu_bug(X86_BUG_ECAM_MMIO);
>+
> 	/* Set ITLB_MULTIHIT bug if cpu is not in the whitelist and not mitigated */
> 	if (!cpu_matches(cpu_vuln_whitelist, NO_ITLB_MULTIHIT) &&
> 	    !(ia32_cap & ARCH_CAP_PSCHANGE_MC_NO))
>diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
>index 02039ec3597d..8903776284a6 100644
>--- a/arch/x86/kernel/cpu/vmware.c
>+++ b/arch/x86/kernel/cpu/vmware.c
>@@ -385,6 +385,8 @@ static void __init vmware_set_capabilities(void)
> 		setup_force_cpu_cap(X86_FEATURE_VMCALL);
> 	else if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMMCALL)
> 		setup_force_cpu_cap(X86_FEATURE_VMW_VMMCALL);
>+
>+	setup_clear_cpu_cap(X86_BUG_ECAM_MMIO);
> }
> 
> static void __init vmware_platform_setup(void)
>diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>index ddb798603201..bc81cf4c1014 100644
>--- a/arch/x86/pci/common.c
>+++ b/arch/x86/pci/common.c
>@@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
> int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
> 						int reg, int len, u32 *val)
> {
>-	if (domain == 0 && reg < 256 && raw_pci_ops)
>+	if (domain == 0 && reg < 256 && raw_pci_ops &&
>+	    (boot_cpu_has_bug(X86_BUG_ECAM_MMIO) || !raw_pci_ext_ops))
> 		return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
> 	if (raw_pci_ext_ops)
> 		return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>@@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
> int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
> 						int reg, int len, u32 val)
> {
>-	if (domain == 0 && reg < 256 && raw_pci_ops)
>+	if (domain == 0 && reg < 256 && raw_pci_ops &&
>+	    (boot_cpu_has_bug(X86_BUG_ECAM_MMIO) || !raw_pci_ext_ops))
> 		return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
> 	if (raw_pci_ext_ops)
> 		return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);

Also... any reason we can't just set raw_pci_ops == raw_ext_pci_ops for the case when the latter is preferred, and dispense with the conditionals in the use path? Similarly, raw_ext_pci_ops could be pointed to error routines instead of left at NULL.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor
  2022-10-03 21:06         ` H. Peter Anvin
@ 2022-10-03 21:28           ` Nadav Amit
  2022-10-03 23:51             ` H. Peter Anvin
  0 siblings, 1 reply; 16+ messages in thread
From: Nadav Amit @ 2022-10-03 21:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Vitaly Kuznetsov, Alexander Graf, Ajay Kaher, x86, linux-pci,
	linux-kernel, rostedt, Srivatsa Bhat, srivatsa, Alexey Makhalov,
	Vasavi Sirnapalli, er.ajay.kaher, willy, linux-hyperv, kvm,
	jailhouse-dev, xen-devel, acrn-dev, helgaas, bhelgaas,
	Thomas Gleixner, mingo, bp, dave.hansen

On Oct 3, 2022, at 2:06 PM, H. Peter Anvin <hpa@zytor.com> wrote:

> ⚠ External Email
> 
> On October 3, 2022 10:34:15 AM PDT, Nadav Amit <namit@vmware.com> wrote:
>> On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>> 
>>> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
>>> better, introducing a new global var is our 'last resort' and should be
>>> avoided whenever possible. Alternatively, you can add a
>>> raw_pci_ext_ops_preferred() function checking somethin within 'struct
>>> hypervisor_x86' but I'm unsure if it's better.
>>> 
>>> Also, please check Alex' question/suggestion.
>> 
>> Here is my take (and Ajay knows probably more than me):
>> 
>> Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
>> The two options are either to use a reserved field (which who knows, might
>> be used one day) or some OEM ID. I am also not familiar with
>> PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.
>> 
>> Anyhow, I understand (although not relate) to the objection for a new global
>> variable. How about explicitly calling this hardware bug a “bug” and using
>> the proper infrastructure? Calling it explicitly a bug may even push whoever
>> can to resolve it.
>> 
>> IOW, how about doing something along the lines of (not tested):
>> 
>> 
>> -- >8 --
>> 
>> Subject: [PATCH] x86/PCI: Prefer MMIO over PIO on VMware hypervisor
>> 
>> ---
>> arch/x86/include/asm/cpufeatures.h | 1 +
>> arch/x86/kernel/cpu/common.c       | 2 ++
>> arch/x86/kernel/cpu/vmware.c       | 2 ++
>> arch/x86/pci/common.c              | 6 ++++--
>> 4 files changed, 9 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
>> index ef4775c6db01..216b6f357b6d 100644
>> --- a/arch/x86/include/asm/cpufeatures.h
>> +++ b/arch/x86/include/asm/cpufeatures.h
>> @@ -460,5 +460,6 @@
>> #define X86_BUG_MMIO_UNKNOWN          X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */
>> #define X86_BUG_RETBLEED              X86_BUG(27) /* CPU is affected by RETBleed */
>> #define X86_BUG_EIBRS_PBRSB           X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
>> +#define X86_BUG_ECAM_MMIO             X86_BUG(29) /* ECAM MMIO is buggy and PIO is preferable */
>> 
>> #endif /* _ASM_X86_CPUFEATURES_H */
>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>> index 3e508f239098..c94175fa304b 100644
>> --- a/arch/x86/kernel/cpu/common.c
>> +++ b/arch/x86/kernel/cpu/common.c
>> @@ -1299,6 +1299,8 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
>> {
>>      u64 ia32_cap = x86_read_arch_cap_msr();
>> 
>> +      setup_force_cpu_bug(X86_BUG_ECAM_MMIO);
>> +
>>      /* Set ITLB_MULTIHIT bug if cpu is not in the whitelist and not mitigated */
>>      if (!cpu_matches(cpu_vuln_whitelist, NO_ITLB_MULTIHIT) &&
>>          !(ia32_cap & ARCH_CAP_PSCHANGE_MC_NO))
>> diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
>> index 02039ec3597d..8903776284a6 100644
>> --- a/arch/x86/kernel/cpu/vmware.c
>> +++ b/arch/x86/kernel/cpu/vmware.c
>> @@ -385,6 +385,8 @@ static void __init vmware_set_capabilities(void)
>>              setup_force_cpu_cap(X86_FEATURE_VMCALL);
>>      else if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMMCALL)
>>              setup_force_cpu_cap(X86_FEATURE_VMW_VMMCALL);
>> +
>> +      setup_clear_cpu_cap(X86_BUG_ECAM_MMIO);
>> }
>> 
>> static void __init vmware_platform_setup(void)
>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>> index ddb798603201..bc81cf4c1014 100644
>> --- a/arch/x86/pci/common.c
>> +++ b/arch/x86/pci/common.c
>> @@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
>> int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>                                              int reg, int len, u32 *val)
>> {
>> -      if (domain == 0 && reg < 256 && raw_pci_ops)
>> +      if (domain == 0 && reg < 256 && raw_pci_ops &&
>> +          (boot_cpu_has_bug(X86_BUG_ECAM_MMIO) || !raw_pci_ext_ops))
>>              return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>>      if (raw_pci_ext_ops)
>>              return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>> @@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>> int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>>                                              int reg, int len, u32 val)
>> {
>> -      if (domain == 0 && reg < 256 && raw_pci_ops)
>> +      if (domain == 0 && reg < 256 && raw_pci_ops &&
>> +          (boot_cpu_has_bug(X86_BUG_ECAM_MMIO) || !raw_pci_ext_ops))
>>              return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>>      if (raw_pci_ext_ops)
>>              return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
> 
> Also... any reason we can't just set raw_pci_ops == raw_ext_pci_ops for the case when the latter is preferred, and dispense with the conditionals in the use path? Similarly, raw_ext_pci_ops could be pointed to error routines instead of left at NULL.

I understood from Ajay that the initialization of raw_ext_pci_ops can be
done after the hypervisor initialization takes place, so doing what exactly
what you proposed by is not possible. It can probably be resolved, but I do
not think the end result would be simpler or cleaner. I think that the “bug”
solution really conveys the behavior.

IIUC performance would not be noticeable affected by 2 more conditional
branches.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor
  2022-10-03 21:28           ` Nadav Amit
@ 2022-10-03 23:51             ` H. Peter Anvin
  2022-10-04  0:19               ` Nadav Amit
  0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2022-10-03 23:51 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Vitaly Kuznetsov, Alexander Graf, Ajay Kaher, x86, linux-pci,
	linux-kernel, rostedt, Srivatsa Bhat, srivatsa, Alexey Makhalov,
	Vasavi Sirnapalli, er.ajay.kaher, willy, linux-hyperv, kvm,
	jailhouse-dev, xen-devel, acrn-dev, helgaas, bhelgaas,
	Thomas Gleixner, mingo, bp, dave.hansen

On October 3, 2022 2:28:40 PM PDT, Nadav Amit <namit@vmware.com> wrote:
>On Oct 3, 2022, at 2:06 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
>> ⚠ External Email
>> 
>> On October 3, 2022 10:34:15 AM PDT, Nadav Amit <namit@vmware.com> wrote:
>>> On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>>> 
>>>> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
>>>> better, introducing a new global var is our 'last resort' and should be
>>>> avoided whenever possible. Alternatively, you can add a
>>>> raw_pci_ext_ops_preferred() function checking somethin within 'struct
>>>> hypervisor_x86' but I'm unsure if it's better.
>>>> 
>>>> Also, please check Alex' question/suggestion.
>>> 
>>> Here is my take (and Ajay knows probably more than me):
>>> 
>>> Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
>>> The two options are either to use a reserved field (which who knows, might
>>> be used one day) or some OEM ID. I am also not familiar with
>>> PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.
>>> 
>>> Anyhow, I understand (although not relate) to the objection for a new global
>>> variable. How about explicitly calling this hardware bug a “bug” and using
>>> the proper infrastructure? Calling it explicitly a bug may even push whoever
>>> can to resolve it.
>>> 
>>> IOW, how about doing something along the lines of (not tested):
>>> 
>>> 
>>> -- >8 --
>>> 
>>> Subject: [PATCH] x86/PCI: Prefer MMIO over PIO on VMware hypervisor
>>> 
>>> ---
>>> arch/x86/include/asm/cpufeatures.h | 1 +
>>> arch/x86/kernel/cpu/common.c       | 2 ++
>>> arch/x86/kernel/cpu/vmware.c       | 2 ++
>>> arch/x86/pci/common.c              | 6 ++++--
>>> 4 files changed, 9 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
>>> index ef4775c6db01..216b6f357b6d 100644
>>> --- a/arch/x86/include/asm/cpufeatures.h
>>> +++ b/arch/x86/include/asm/cpufeatures.h
>>> @@ -460,5 +460,6 @@
>>> #define X86_BUG_MMIO_UNKNOWN          X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */
>>> #define X86_BUG_RETBLEED              X86_BUG(27) /* CPU is affected by RETBleed */
>>> #define X86_BUG_EIBRS_PBRSB           X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
>>> +#define X86_BUG_ECAM_MMIO             X86_BUG(29) /* ECAM MMIO is buggy and PIO is preferable */
>>> 
>>> #endif /* _ASM_X86_CPUFEATURES_H */
>>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>>> index 3e508f239098..c94175fa304b 100644
>>> --- a/arch/x86/kernel/cpu/common.c
>>> +++ b/arch/x86/kernel/cpu/common.c
>>> @@ -1299,6 +1299,8 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
>>> {
>>>      u64 ia32_cap = x86_read_arch_cap_msr();
>>> 
>>> +      setup_force_cpu_bug(X86_BUG_ECAM_MMIO);
>>> +
>>>      /* Set ITLB_MULTIHIT bug if cpu is not in the whitelist and not mitigated */
>>>      if (!cpu_matches(cpu_vuln_whitelist, NO_ITLB_MULTIHIT) &&
>>>          !(ia32_cap & ARCH_CAP_PSCHANGE_MC_NO))
>>> diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
>>> index 02039ec3597d..8903776284a6 100644
>>> --- a/arch/x86/kernel/cpu/vmware.c
>>> +++ b/arch/x86/kernel/cpu/vmware.c
>>> @@ -385,6 +385,8 @@ static void __init vmware_set_capabilities(void)
>>>              setup_force_cpu_cap(X86_FEATURE_VMCALL);
>>>      else if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMMCALL)
>>>              setup_force_cpu_cap(X86_FEATURE_VMW_VMMCALL);
>>> +
>>> +      setup_clear_cpu_cap(X86_BUG_ECAM_MMIO);
>>> }
>>> 
>>> static void __init vmware_platform_setup(void)
>>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>>> index ddb798603201..bc81cf4c1014 100644
>>> --- a/arch/x86/pci/common.c
>>> +++ b/arch/x86/pci/common.c
>>> @@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
>>> int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>>                                              int reg, int len, u32 *val)
>>> {
>>> -      if (domain == 0 && reg < 256 && raw_pci_ops)
>>> +      if (domain == 0 && reg < 256 && raw_pci_ops &&
>>> +          (boot_cpu_has_bug(X86_BUG_ECAM_MMIO) || !raw_pci_ext_ops))
>>>              return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>>>      if (raw_pci_ext_ops)
>>>              return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>>> @@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>> int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>>>                                              int reg, int len, u32 val)
>>> {
>>> -      if (domain == 0 && reg < 256 && raw_pci_ops)
>>> +      if (domain == 0 && reg < 256 && raw_pci_ops &&
>>> +          (boot_cpu_has_bug(X86_BUG_ECAM_MMIO) || !raw_pci_ext_ops))
>>>              return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>>>      if (raw_pci_ext_ops)
>>>              return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>> 
>> Also... any reason we can't just set raw_pci_ops == raw_ext_pci_ops for the case when the latter is preferred, and dispense with the conditionals in the use path? Similarly, raw_ext_pci_ops could be pointed to error routines instead of left at NULL.
>
>I understood from Ajay that the initialization of raw_ext_pci_ops can be
>done after the hypervisor initialization takes place, so doing what exactly
>what you proposed by is not possible. It can probably be resolved, but I do
>not think the end result would be simpler or cleaner. I think that the “bug”
>solution really conveys the behavior.
>
>IIUC performance would not be noticeable affected by 2 more conditional
>branches.
>
>

Isn't that exactly what you would want?!?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor
  2022-10-03 23:51             ` H. Peter Anvin
@ 2022-10-04  0:19               ` Nadav Amit
  0 siblings, 0 replies; 16+ messages in thread
From: Nadav Amit @ 2022-10-04  0:19 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Vitaly Kuznetsov, Alexander Graf, Ajay Kaher, x86, linux-pci,
	linux-kernel, rostedt, Srivatsa Bhat, srivatsa, Alexey Makhalov,
	Vasavi Sirnapalli, er.ajay.kaher, willy, linux-hyperv, kvm,
	jailhouse-dev, xen-devel, acrn-dev, helgaas, bhelgaas,
	Thomas Gleixner, mingo, bp, dave.hansen

On Oct 3, 2022, at 4:51 PM, H. Peter Anvin <hpa@zytor.com> wrote:

> ⚠ External Email
> 
> On October 3, 2022 2:28:40 PM PDT, Nadav Amit <namit@vmware.com> wrote:
>> On Oct 3, 2022, at 2:06 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> 
>>> ⚠ External Email
>>> 
>>> On October 3, 2022 10:34:15 AM PDT, Nadav Amit <namit@vmware.com> wrote:
>>>> On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>>>> 
>>>>> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
>>>>> better, introducing a new global var is our 'last resort' and should be
>>>>> avoided whenever possible. Alternatively, you can add a
>>>>> raw_pci_ext_ops_preferred() function checking somethin within 'struct
>>>>> hypervisor_x86' but I'm unsure if it's better.
>>>>> 
>>>>> Also, please check Alex' question/suggestion.
>>>> 
>>>> Here is my take (and Ajay knows probably more than me):
>>>> 
>>>> Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
>>>> The two options are either to use a reserved field (which who knows, might
>>>> be used one day) or some OEM ID. I am also not familiar with
>>>> PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.
>>>> 
>>>> Anyhow, I understand (although not relate) to the objection for a new global
>>>> variable. How about explicitly calling this hardware bug a “bug” and using
>>>> the proper infrastructure? Calling it explicitly a bug may even push whoever
>>>> can to resolve it.
>>>> 
>>>> IOW, how about doing something along the lines of (not tested):
>>>> 
>>>> 
>>>> -- >8 --
>>>> 
>>>> Subject: [PATCH] x86/PCI: Prefer MMIO over PIO on VMware hypervisor
>>>> 
>>>> ---
>>>> arch/x86/include/asm/cpufeatures.h | 1 +
>>>> arch/x86/kernel/cpu/common.c       | 2 ++
>>>> arch/x86/kernel/cpu/vmware.c       | 2 ++
>>>> arch/x86/pci/common.c              | 6 ++++--
>>>> 4 files changed, 9 insertions(+), 2 deletions(-)
>>>> 
>>>> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
>>>> index ef4775c6db01..216b6f357b6d 100644
>>>> --- a/arch/x86/include/asm/cpufeatures.h
>>>> +++ b/arch/x86/include/asm/cpufeatures.h
>>>> @@ -460,5 +460,6 @@
>>>> #define X86_BUG_MMIO_UNKNOWN          X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */
>>>> #define X86_BUG_RETBLEED              X86_BUG(27) /* CPU is affected by RETBleed */
>>>> #define X86_BUG_EIBRS_PBRSB           X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
>>>> +#define X86_BUG_ECAM_MMIO             X86_BUG(29) /* ECAM MMIO is buggy and PIO is preferable */
>>>> 
>>>> #endif /* _ASM_X86_CPUFEATURES_H */
>>>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>>>> index 3e508f239098..c94175fa304b 100644
>>>> --- a/arch/x86/kernel/cpu/common.c
>>>> +++ b/arch/x86/kernel/cpu/common.c
>>>> @@ -1299,6 +1299,8 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
>>>> {
>>>>     u64 ia32_cap = x86_read_arch_cap_msr();
>>>> 
>>>> +      setup_force_cpu_bug(X86_BUG_ECAM_MMIO);
>>>> +
>>>>     /* Set ITLB_MULTIHIT bug if cpu is not in the whitelist and not mitigated */
>>>>     if (!cpu_matches(cpu_vuln_whitelist, NO_ITLB_MULTIHIT) &&
>>>>         !(ia32_cap & ARCH_CAP_PSCHANGE_MC_NO))
>>>> diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
>>>> index 02039ec3597d..8903776284a6 100644
>>>> --- a/arch/x86/kernel/cpu/vmware.c
>>>> +++ b/arch/x86/kernel/cpu/vmware.c
>>>> @@ -385,6 +385,8 @@ static void __init vmware_set_capabilities(void)
>>>>             setup_force_cpu_cap(X86_FEATURE_VMCALL);
>>>>     else if (vmware_hypercall_mode == CPUID_VMWARE_FEATURES_ECX_VMMCALL)
>>>>             setup_force_cpu_cap(X86_FEATURE_VMW_VMMCALL);
>>>> +
>>>> +      setup_clear_cpu_cap(X86_BUG_ECAM_MMIO);
>>>> }
>>>> 
>>>> static void __init vmware_platform_setup(void)
>>>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>>>> index ddb798603201..bc81cf4c1014 100644
>>>> --- a/arch/x86/pci/common.c
>>>> +++ b/arch/x86/pci/common.c
>>>> @@ -40,7 +40,8 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
>>>> int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>>>                                             int reg, int len, u32 *val)
>>>> {
>>>> -      if (domain == 0 && reg < 256 && raw_pci_ops)
>>>> +      if (domain == 0 && reg < 256 && raw_pci_ops &&
>>>> +          (boot_cpu_has_bug(X86_BUG_ECAM_MMIO) || !raw_pci_ext_ops))
>>>>             return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>>>>     if (raw_pci_ext_ops)
>>>>             return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>>>> @@ -50,7 +51,8 @@ int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>>> int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>>>>                                             int reg, int len, u32 val)
>>>> {
>>>> -      if (domain == 0 && reg < 256 && raw_pci_ops)
>>>> +      if (domain == 0 && reg < 256 && raw_pci_ops &&
>>>> +          (boot_cpu_has_bug(X86_BUG_ECAM_MMIO) || !raw_pci_ext_ops))
>>>>             return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>>>>     if (raw_pci_ext_ops)
>>>>             return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>>> 
>>> Also... any reason we can't just set raw_pci_ops == raw_ext_pci_ops for the case when the latter is preferred, and dispense with the conditionals in the use path? Similarly, raw_ext_pci_ops could be pointed to error routines instead of left at NULL.
>> 
>> I understood from Ajay that the initialization of raw_ext_pci_ops can be
>> done after the hypervisor initialization takes place, so doing what exactly
>> what you proposed by is not possible. It can probably be resolved, but I do
>> not think the end result would be simpler or cleaner. I think that the “bug”
>> solution really conveys the behavior.
>> 
>> IIUC performance would not be noticeable affected by 2 more conditional
>> branches.
> 
> Isn't that exactly what you would want?!?

Two branches (which are probably mostly predicted correctly) are
inexpensive (nanoseconds?)

Causing a VM-exit, and the whole mess of handling it in the hypervisor
(potentially the userspace part of the hypervisor) is expensive
(microseconds). IIUC, Ajay wants to let reads to pass through to memory,
avoiding these VM-exits.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor
  2022-10-03 17:34       ` Nadav Amit
  2022-10-03 21:06         ` H. Peter Anvin
@ 2022-10-04  8:22         ` Alexander Graf
  2022-10-04 18:48           ` Nadav Amit
  2022-10-04  8:30         ` Vitaly Kuznetsov
  2 siblings, 1 reply; 16+ messages in thread
From: Alexander Graf @ 2022-10-04  8:22 UTC (permalink / raw)
  To: Nadav Amit, Vitaly Kuznetsov
  Cc: Ajay Kaher, x86, hpa, linux-pci, linux-kernel, rostedt,
	Srivatsa Bhat, srivatsa, Alexey Makhalov, Vasavi Sirnapalli,
	er.ajay.kaher, willy, linux-hyperv, kvm, jailhouse-dev,
	xen-devel, acrn-dev, helgaas, bhelgaas, Thomas Gleixner, mingo,
	bp, dave.hansen, Michael S. Tsirkin

Hey Nadav,

On 03.10.22 19:34, Nadav Amit wrote:
> On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>
>> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
>> better, introducing a new global var is our 'last resort' and should be
>> avoided whenever possible. Alternatively, you can add a
>> raw_pci_ext_ops_preferred() function checking somethin within 'struct
>> hypervisor_x86' but I'm unsure if it's better.
>>
>> Also, please check Alex' question/suggestion.
> Here is my take (and Ajay knows probably more than me):
>
> Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
> The two options are either to use a reserved field (which who knows, might
> be used one day) or some OEM ID. I am also not familiar with
> PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.
>
> Anyhow, I understand (although not relate) to the objection for a new global
> variable. How about explicitly calling this hardware bug a “bug” and using
> the proper infrastructure? Calling it explicitly a bug may even push whoever
> can to resolve it.


I am a lot more concerned with how we propagate it externally than 
within Linux. If we hard code that all Linux kernels 6.2+ that are 
running in VMware prefer ECAM over PIO, we lock ourselves into that 
stance for better or worse, which means:

* All past and future versions of any VMware hypervisor product have to 
always allow ECAM access for any PCIe config space write
* No other hypervisor benefits from any of this without upstream code change
* No real hardware platform benefits from this without upstream code change

By moving it into MCFG, we can create a path for the outside environment 
to tell the OS whether it's safe to use ECAM always. This obviously 
doesn't work with MCFG as it stands today, we'd have to propose an MCFG 
spec change to the PCI SIG's "PCI Firmware Specification" to add the 
respective field. Future VMware versions could then always expose the 
flag - and if you find it broken, remove it again.

Putting all of the logic on which system potentially prefers ECAM over 
PIO config space access into Linux is just a big hack that we should 
avoid as much as possible.


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor
  2022-10-03 17:34       ` Nadav Amit
  2022-10-03 21:06         ` H. Peter Anvin
  2022-10-04  8:22         ` Alexander Graf
@ 2022-10-04  8:30         ` Vitaly Kuznetsov
  2 siblings, 0 replies; 16+ messages in thread
From: Vitaly Kuznetsov @ 2022-10-04  8:30 UTC (permalink / raw)
  To: Nadav Amit, Alexander Graf, Ajay Kaher
  Cc: x86, hpa, linux-pci, linux-kernel, rostedt, Srivatsa Bhat,
	srivatsa, Alexey Makhalov, Vasavi Sirnapalli, er.ajay.kaher,
	willy, linux-hyperv, kvm, jailhouse-dev, xen-devel, acrn-dev,
	helgaas, bhelgaas, Thomas Gleixner, mingo, bp, dave.hansen

Nadav Amit <namit@vmware.com> writes:

> On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>
>> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
>> better, introducing a new global var is our 'last resort' and should be
>> avoided whenever possible. Alternatively, you can add a
>> raw_pci_ext_ops_preferred() function checking somethin within 'struct
>> hypervisor_x86' but I'm unsure if it's better.
>> 
>> Also, please check Alex' question/suggestion.
>
> Here is my take (and Ajay knows probably more than me):
>
> Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
> The two options are either to use a reserved field (which who knows, might
> be used one day) or some OEM ID. I am also not familiar with
> PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.
>
> Anyhow, I understand (although not relate) to the objection for a new global
> variable. How about explicitly calling this hardware bug a “bug” and using
> the proper infrastructure? Calling it explicitly a bug may even push whoever
> can to resolve it.
>
> IOW, how about doing something along the lines of (not tested):
>

Works for me. Going forward, the intention shoud be to also clear the
bug on other x86 hypervisors, e.g. we test modern Hyper-V versions and
if MMIO works well we clear it, we test modern QEMU/KVM setups and if
MMIO works introduce a feature bit somewhere and also clear the bug in
the guest when the bit is set.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor
  2022-10-04  8:22         ` Alexander Graf
@ 2022-10-04 18:48           ` Nadav Amit
  2022-10-10 14:58             ` Nadav Amit
  2022-10-10 17:05             ` Michael S. Tsirkin
  0 siblings, 2 replies; 16+ messages in thread
From: Nadav Amit @ 2022-10-04 18:48 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Vitaly Kuznetsov, Ajay Kaher, x86, hpa, linux-pci, linux-kernel,
	rostedt, Srivatsa Bhat, srivatsa, Alexey Makhalov,
	Vasavi Sirnapalli, er.ajay.kaher, willy, linux-hyperv, kvm,
	jailhouse-dev, xen-devel, helgaas, bhelgaas, Thomas Gleixner,
	mingo, bp, dave.hansen, Michael S. Tsirkin

On Oct 4, 2022, at 1:22 AM, Alexander Graf <graf@amazon.com> wrote:

> ⚠ External Email
> 
> Hey Nadav,
> 
> On 03.10.22 19:34, Nadav Amit wrote:
>> On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>> 
>>> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
>>> better, introducing a new global var is our 'last resort' and should be
>>> avoided whenever possible. Alternatively, you can add a
>>> raw_pci_ext_ops_preferred() function checking somethin within 'struct
>>> hypervisor_x86' but I'm unsure if it's better.
>>> 
>>> Also, please check Alex' question/suggestion.
>> Here is my take (and Ajay knows probably more than me):
>> 
>> Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
>> The two options are either to use a reserved field (which who knows, might
>> be used one day) or some OEM ID. I am also not familiar with
>> PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.
>> 
>> Anyhow, I understand (although not relate) to the objection for a new global
>> variable. How about explicitly calling this hardware bug a “bug” and using
>> the proper infrastructure? Calling it explicitly a bug may even push whoever
>> can to resolve it.
> 
> 
> I am a lot more concerned with how we propagate it externally than
> within Linux. If we hard code that all Linux kernels 6.2+ that are
> running in VMware prefer ECAM over PIO, we lock ourselves into that
> stance for better or worse, which means:
> 
> * All past and future versions of any VMware hypervisor product have to
> always allow ECAM access for any PCIe config space write
> * No other hypervisor benefits from any of this without upstream code change
> * No real hardware platform benefits from this without upstream code change
> 
> By moving it into MCFG, we can create a path for the outside environment
> to tell the OS whether it's safe to use ECAM always. This obviously
> doesn't work with MCFG as it stands today, we'd have to propose an MCFG
> spec change to the PCI SIG's "PCI Firmware Specification" to add the
> respective field. Future VMware versions could then always expose the
> flag - and if you find it broken, remove it again.
> 
> Putting all of the logic on which system potentially prefers ECAM over
> PIO config space access into Linux is just a big hack that we should
> avoid as much as possible.

Thanks Alex. You raise important points. Let me try to break down your
concerns slightly differently:

1. Enabling MMIO access should be selective, and potentially controlled by
the hypervisor. The very least a "chicken-bit” is needed.

2. PCI SIG would change its specifications to address unclear hardware bug.

I think (1) makes sense and we can discuss different ways of addressing it.
But (2) would not happen in a reasonable timeline and seems to me as an
unnecessary complication.

But before we discuss how to address the issue, perhaps we need to first
understand it better. I am not sure that I understand this MMIO bug, and so
far nobody was able to provide exact details.

So I went to have a look. It might not be super helpful, but for the record,
here is what I collected.

First, we have commit d6ece5491ae71d ("i386/x86-64 Correct for broken MCFG
tables on K8 systems”). It tried to "try to discover all devices on bus 0
that are unreachable using MM and fallback for them.” Interestingly, it
seems similar to FreeBSD code (commit 2d10570afe2b3e) that also mentions K8
and has similar detection logic in FreeBSD’s pcie_cfgregopen().

Then commit a0ca9909609470 ("PCI x86: always use conf1 to access config
space below 256 bytes”). The correspondence [1] mentions some bugs: ATI
chipset, VIA chipset, Intel 3 Series Express chipset family and some reports
on Nvidia. It turned out some devices had problem probing - to figure out if
MMIO is broken - the way the previous patch did.

All of these bugs are circa 2008, of course. And note that FreeBSD did not
take a similar path. The correspondence around Linux patch is endless. I
admit that I did not understand whether eventually the issues were found to
be per-bus or per-device.


Back to the matter at hand. The benefit of using the MCFG approach that you
propose is that it can enable native systems to use MMIO as well. However,
since the list of bugs is unclear and the problems might be device-specific,
it is not clear what information BIOSes have that Linux doesn’t. In other
words, the benefit of getting it into the specifications is questionable,
and the complexity+time is high.

Can we agree that the feature would be enabled explicitly by the hypervisor
and Linux would enable it based on the hypervisor input (through some
channel?)

Thanks,
Nadav

[1] https://lore.kernel.org/all/20080112144030.GA19279@jurassic.park.msu.ru/T/#u

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor
  2022-10-04 18:48           ` Nadav Amit
@ 2022-10-10 14:58             ` Nadav Amit
  2022-10-10 17:05             ` Michael S. Tsirkin
  1 sibling, 0 replies; 16+ messages in thread
From: Nadav Amit @ 2022-10-10 14:58 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Alexander Graf, Vitaly Kuznetsov, Ajay Kaher, x86, hpa,
	linux-pci, linux-kernel, rostedt, Srivatsa Bhat, srivatsa,
	Alexey Makhalov, Vasavi Sirnapalli, er.ajay.kaher, willy,
	linux-hyperv, kvm, jailhouse-dev, xen-devel, helgaas, bhelgaas,
	Thomas Gleixner, mingo, bp, dave.hansen, Michael S. Tsirkin

On Oct 4, 2022, at 11:48 AM, Nadav Amit <namit@vmware.com> wrote:

> On Oct 4, 2022, at 1:22 AM, Alexander Graf <graf@amazon.com> wrote:
> 
>> ⚠ External Email
>> 
>> Hey Nadav,
>> 
>> On 03.10.22 19:34, Nadav Amit wrote:
>>> On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>>> 
>>>> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
>>>> better, introducing a new global var is our 'last resort' and should be
>>>> avoided whenever possible. Alternatively, you can add a
>>>> raw_pci_ext_ops_preferred() function checking somethin within 'struct
>>>> hypervisor_x86' but I'm unsure if it's better.
>>>> 
>>>> Also, please check Alex' question/suggestion.
>>> Here is my take (and Ajay knows probably more than me):
>>> 
>>> Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
>>> The two options are either to use a reserved field (which who knows, might
>>> be used one day) or some OEM ID. I am also not familiar with
>>> PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.
>>> 
>>> Anyhow, I understand (although not relate) to the objection for a new global
>>> variable. How about explicitly calling this hardware bug a “bug” and using
>>> the proper infrastructure? Calling it explicitly a bug may even push whoever
>>> can to resolve it.
>> 
>> 
>> I am a lot more concerned with how we propagate it externally than
>> within Linux. If we hard code that all Linux kernels 6.2+ that are
>> running in VMware prefer ECAM over PIO, we lock ourselves into that
>> stance for better or worse, which means:
>> 
>> * All past and future versions of any VMware hypervisor product have to
>> always allow ECAM access for any PCIe config space write
>> * No other hypervisor benefits from any of this without upstream code change
>> * No real hardware platform benefits from this without upstream code change
>> 
>> By moving it into MCFG, we can create a path for the outside environment
>> to tell the OS whether it's safe to use ECAM always. This obviously
>> doesn't work with MCFG as it stands today, we'd have to propose an MCFG
>> spec change to the PCI SIG's "PCI Firmware Specification" to add the
>> respective field. Future VMware versions could then always expose the
>> flag - and if you find it broken, remove it again.
>> 
>> Putting all of the logic on which system potentially prefers ECAM over
>> PIO config space access into Linux is just a big hack that we should
>> avoid as much as possible.
> 
> Thanks Alex. You raise important points. Let me try to break down your
> concerns slightly differently:
> 
> 1. Enabling MMIO access should be selective, and potentially controlled by
> the hypervisor. The very least a "chicken-bit” is needed.
> 
> 2. PCI SIG would change its specifications to address unclear hardware bug.
> 
> I think (1) makes sense and we can discuss different ways of addressing it.
> But (2) would not happen in a reasonable timeline and seems to me as an
> unnecessary complication.
> 
> But before we discuss how to address the issue, perhaps we need to first
> understand it better. I am not sure that I understand this MMIO bug, and so
> far nobody was able to provide exact details.
> 
> So I went to have a look. It might not be super helpful, but for the record,
> here is what I collected.
> 
> First, we have commit d6ece5491ae71d ("i386/x86-64 Correct for broken MCFG
> tables on K8 systems”). It tried to "try to discover all devices on bus 0
> that are unreachable using MM and fallback for them.” Interestingly, it
> seems similar to FreeBSD code (commit 2d10570afe2b3e) that also mentions K8
> and has similar detection logic in FreeBSD’s pcie_cfgregopen().
> 
> Then commit a0ca9909609470 ("PCI x86: always use conf1 to access config
> space below 256 bytes”). The correspondence [1] mentions some bugs: ATI
> chipset, VIA chipset, Intel 3 Series Express chipset family and some reports
> on Nvidia. It turned out some devices had problem probing - to figure out if
> MMIO is broken - the way the previous patch did.
> 
> All of these bugs are circa 2008, of course. And note that FreeBSD did not
> take a similar path. The correspondence around Linux patch is endless. I
> admit that I did not understand whether eventually the issues were found to
> be per-bus or per-device.
> 
> 
> Back to the matter at hand. The benefit of using the MCFG approach that you
> propose is that it can enable native systems to use MMIO as well. However,
> since the list of bugs is unclear and the problems might be device-specific,
> it is not clear what information BIOSes have that Linux doesn’t. In other
> words, the benefit of getting it into the specifications is questionable,
> and the complexity+time is high.
> 
> Can we agree that the feature would be enabled explicitly by the hypervisor
> and Linux would enable it based on the hypervisor input (through some
> channel?)

Alex, is it ok with you? We will enable the feature (disable the bug)
explicitly from the hypervisor, but would not rely on MCFG changes, which
would even in the best case would take some time.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor
  2022-10-04 18:48           ` Nadav Amit
  2022-10-10 14:58             ` Nadav Amit
@ 2022-10-10 17:05             ` Michael S. Tsirkin
  1 sibling, 0 replies; 16+ messages in thread
From: Michael S. Tsirkin @ 2022-10-10 17:05 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Alexander Graf, Vitaly Kuznetsov, Ajay Kaher, x86, hpa,
	linux-pci, linux-kernel, rostedt, Srivatsa Bhat, srivatsa,
	Alexey Makhalov, Vasavi Sirnapalli, er.ajay.kaher, willy,
	linux-hyperv, kvm, jailhouse-dev, xen-devel, helgaas, bhelgaas,
	Thomas Gleixner, mingo, bp, dave.hansen

On Tue, Oct 04, 2022 at 06:48:11PM +0000, Nadav Amit wrote:
> On Oct 4, 2022, at 1:22 AM, Alexander Graf <graf@amazon.com> wrote:
> 
> > ⚠ External Email
> > 
> > Hey Nadav,
> > 
> > On 03.10.22 19:34, Nadav Amit wrote:
> >> On Oct 3, 2022, at 8:03 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> >> 
> >>> Not my but rather PCI maintainer's call but IMHO dropping 'const' is
> >>> better, introducing a new global var is our 'last resort' and should be
> >>> avoided whenever possible. Alternatively, you can add a
> >>> raw_pci_ext_ops_preferred() function checking somethin within 'struct
> >>> hypervisor_x86' but I'm unsure if it's better.
> >>> 
> >>> Also, please check Alex' question/suggestion.
> >> Here is my take (and Ajay knows probably more than me):
> >> 
> >> Looking briefly on MCFG, I do not see a clean way of using the ACPI table.
> >> The two options are either to use a reserved field (which who knows, might
> >> be used one day) or some OEM ID. I am also not familiar with
> >> PCI_COMMAND.MEMORY=0, so Ajay can hopefully give some answer about that.
> >> 
> >> Anyhow, I understand (although not relate) to the objection for a new global
> >> variable. How about explicitly calling this hardware bug a “bug” and using
> >> the proper infrastructure? Calling it explicitly a bug may even push whoever
> >> can to resolve it.
> > 
> > 
> > I am a lot more concerned with how we propagate it externally than
> > within Linux. If we hard code that all Linux kernels 6.2+ that are
> > running in VMware prefer ECAM over PIO, we lock ourselves into that
> > stance for better or worse, which means:
> > 
> > * All past and future versions of any VMware hypervisor product have to
> > always allow ECAM access for any PCIe config space write
> > * No other hypervisor benefits from any of this without upstream code change
> > * No real hardware platform benefits from this without upstream code change
> > 
> > By moving it into MCFG, we can create a path for the outside environment
> > to tell the OS whether it's safe to use ECAM always. This obviously
> > doesn't work with MCFG as it stands today, we'd have to propose an MCFG
> > spec change to the PCI SIG's "PCI Firmware Specification" to add the
> > respective field. Future VMware versions could then always expose the
> > flag - and if you find it broken, remove it again.
> > 
> > Putting all of the logic on which system potentially prefers ECAM over
> > PIO config space access into Linux is just a big hack that we should
> > avoid as much as possible.
> 
> Thanks Alex. You raise important points. Let me try to break down your
> concerns slightly differently:
> 
> 1. Enabling MMIO access should be selective, and potentially controlled by
> the hypervisor. The very least a "chicken-bit” is needed.
> 
> 2. PCI SIG would change its specifications to address unclear hardware bug.
> 
> I think (1) makes sense and we can discuss different ways of addressing it.
> But (2) would not happen in a reasonable timeline and seems to me as an
> unnecessary complication.
> 
> But before we discuss how to address the issue, perhaps we need to first
> understand it better. I am not sure that I understand this MMIO bug, and so
> far nobody was able to provide exact details.
> 
> So I went to have a look. It might not be super helpful, but for the record,
> here is what I collected.
> 
> First, we have commit d6ece5491ae71d ("i386/x86-64 Correct for broken MCFG
> tables on K8 systems”). It tried to "try to discover all devices on bus 0
> that are unreachable using MM and fallback for them.” Interestingly, it
> seems similar to FreeBSD code (commit 2d10570afe2b3e) that also mentions K8
> and has similar detection logic in FreeBSD’s pcie_cfgregopen().
> 
> Then commit a0ca9909609470 ("PCI x86: always use conf1 to access config
> space below 256 bytes”). The correspondence [1] mentions some bugs: ATI
> chipset, VIA chipset, Intel 3 Series Express chipset family and some reports
> on Nvidia. It turned out some devices had problem probing - to figure out if
> MMIO is broken - the way the previous patch did.

There's also a statement by Linus that MCFG might not cover all buses
in that thread.  I didn't think the implications through yet ...

> All of these bugs are circa 2008, of course. And note that FreeBSD did not
> take a similar path. The correspondence around Linux patch is endless. I
> admit that I did not understand whether eventually the issues were found to
> be per-bus or per-device.
> 
> 
> Back to the matter at hand. The benefit of using the MCFG approach that you
> propose is that it can enable native systems to use MMIO as well. However,
> since the list of bugs is unclear and the problems might be device-specific,
> it is not clear what information BIOSes have that Linux doesn’t. In other
> words, the benefit of getting it into the specifications is questionable,
> and the complexity+time is high.
> 
> Can we agree that the feature would be enabled explicitly by the hypervisor
> and Linux would enable it based on the hypervisor input (through some
> channel?)
> 
> Thanks,
> Nadav
> 
> [1] https://lore.kernel.org/all/20080112144030.GA19279@jurassic.park.msu.ru/T/#u


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-10-10 17:05 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-13 12:47 [PATCH v2] x86/PCI: Prefer MMIO over PIO on all hypervisor Ajay Kaher
2022-09-13 13:34 ` Vitaly Kuznetsov
2022-09-29  5:36   ` Ajay Kaher
2022-09-29  9:12     ` Alexander Graf
2022-10-03 15:03     ` Vitaly Kuznetsov
2022-10-03 17:34       ` Nadav Amit
2022-10-03 21:06         ` H. Peter Anvin
2022-10-03 21:28           ` Nadav Amit
2022-10-03 23:51             ` H. Peter Anvin
2022-10-04  0:19               ` Nadav Amit
2022-10-04  8:22         ` Alexander Graf
2022-10-04 18:48           ` Nadav Amit
2022-10-10 14:58             ` Nadav Amit
2022-10-10 17:05             ` Michael S. Tsirkin
2022-10-04  8:30         ` Vitaly Kuznetsov
2022-10-03 21:01       ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).