All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it
@ 2012-06-12 12:02 Wei Wang
  2012-06-12 15:13 ` Jan Beulich
  0 siblings, 1 reply; 33+ messages in thread
From: Wei Wang @ 2012-06-12 12:02 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper
  Cc: Jeremy Fitzhardinge, xen-devel, Konrad Rzeszutek Wilk

[-- Attachment #1: Type: text/plain, Size: 397 bytes --]

Hi Jan & Andrew

I had attached a revised patch, please check it.

I found that the following Linux commit triggers this issue. It has been 
included into 3.4 pv_ops.

" commit a776c491ca5e38c26d9f66923ff574d041e747f4
   Author: Eric W. Biederman <ebiederm@xmission.com>
   Date:   Mon Oct 17 11:46:06 2011 -0700

   PCI: msi: Disable msi interrupts when we initialize a pci device "

Thanks,
Wei

[-- Attachment #2: iommu-msi.patch --]
[-- Type: text/x-patch, Size: 1991 bytes --]

# HG changeset patch
# Parent f6bfaf9daa508c31b2bca0e461202db2759426fc
# User Wei Wang <wei.wang2@amd.com>
Re-enable iommu msi capability block if it is disabled by dom0
Linux commit a776c491ca5e38c26d9f66923ff574d041e747f4 disables msi interrupts. If is
running as a dom0, iommu interrupt will be disabled and hypervisor cannot process any
event and ppr logs afterwards.

Signed-off-by: Wei Wang <wei.wang2@amd.com>

diff -r f6bfaf9daa50 xen/drivers/passthrough/amd/pci_amd_iommu.c
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c	Wed Jun 06 16:37:05 2012 +0100
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c	Tue Jun 12 13:58:40 2012 +0200
@@ -81,6 +81,30 @@ static void disable_translation(u32 *dte
     dte[0] = entry;
 }
 
+static void iommu_msi_check_enable(struct amd_iommu *iommu)
+{
+    unsigned long flags;
+    uint16_t control;
+    uint8_t bus = PCI_BUS(iommu->bdf);
+    uint8_t dev = PCI_SLOT(iommu->bdf);
+    uint8_t func = PCI_FUNC(iommu->bdf);
+
+    ASSERT( iommu->msi_cap );
+
+    spin_lock_irqsave(&iommu->lock, flags);
+
+    control = pci_conf_read16(iommu->seg, bus, dev, func,
+                              iommu->msi_cap + PCI_MSI_FLAGS);
+    if ( !(control & PCI_MSI_FLAGS_ENABLE) )
+    {    
+        control |= PCI_MSI_FLAGS_ENABLE;
+        pci_conf_write16(iommu->seg, bus, dev, func,
+                         iommu->msi_cap + PCI_MSI_FLAGS, control);
+    }
+
+    spin_unlock_irqrestore(&iommu->lock, flags);
+}
+
 static void amd_iommu_setup_domain_device(
     struct domain *domain, struct amd_iommu *iommu, int bdf)
 {
@@ -101,6 +125,12 @@ static void amd_iommu_setup_domain_devic
     if ( ats_enabled )
         dte_i = 1;
 
+    /* 
+     * In some cases, dom0 disables iommu msi capability, 
+     * check and re-enable it here.
+     */
+    iommu_msi_check_enable(iommu); 
+
     /* get device-table entry */
     req_id = get_dma_requestor_id(iommu->seg, bdf);
     dte = iommu->dev_table.buffer + (req_id * IOMMU_DEV_TABLE_ENTRY_SIZE);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it
  2012-06-12 12:02 [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it Wei Wang
@ 2012-06-12 15:13 ` Jan Beulich
  2012-06-12 16:08   ` Andrew Cooper
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Beulich @ 2012-06-12 15:13 UTC (permalink / raw)
  To: Wei Wang
  Cc: Andrew Cooper, Jeremy Fitzhardinge, xen-devel, Konrad Rzeszutek Wilk

>>> On 12.06.12 at 14:02, Wei Wang <wei.wang2@amd.com> wrote:
> I had attached a revised patch, please check it.

While the patch technically looks better now, you didn't eliminate
my objections to the approach you take, nor did you comment on
the proposed alternative.

A fundamental problem is that your IOMMUs show up as a "normal"
PCI devices, breaking the separation between what is being
managed by the hypervisor vs by the Dom0 kernel. (This even
allows something as odd as passing through an IOMMU to a
DomU, which would clearly upset the hypervisor.)

> I found that the following Linux commit triggers this issue. It has been 
> included into 3.4 pv_ops.
> 
> " commit a776c491ca5e38c26d9f66923ff574d041e747f4
>    Author: Eric W. Biederman <ebiederm@xmission.com>
>    Date:   Mon Oct 17 11:46:06 2011 -0700
> 
>    PCI: msi: Disable msi interrupts when we initialize a pci device "

Thanks for locating this. As it stands, it is incomplete though
anyway: If the kexec-ed kernel is one built without CONFIG_PCI_MSI,
it won't have a means to suppress the "screaming" interrupts.

Jan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it
  2012-06-12 15:13 ` Jan Beulich
@ 2012-06-12 16:08   ` Andrew Cooper
  2012-06-12 16:43     ` Jan Beulich
  0 siblings, 1 reply; 33+ messages in thread
From: Andrew Cooper @ 2012-06-12 16:08 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Wang, Jeremy Fitzhardinge, xen-devel, Konrad Rzeszutek Wilk



On 12/06/12 16:13, Jan Beulich wrote:
>>>> On 12.06.12 at 14:02, Wei Wang <wei.wang2@amd.com> wrote:
>> I had attached a revised patch, please check it.
> While the patch technically looks better now, you didn't eliminate
> my objections to the approach you take, nor did you comment on
> the proposed alternative.
>
> A fundamental problem is that your IOMMUs show up as a "normal"
> PCI devices, breaking the separation between what is being
> managed by the hypervisor vs by the Dom0 kernel. (This even
> allows something as odd as passing through an IOMMU to a
> DomU, which would clearly upset the hypervisor.)
>
>> I found that the following Linux commit triggers this issue. It has been 
>> included into 3.4 pv_ops.
>>
>> " commit a776c491ca5e38c26d9f66923ff574d041e747f4
>>    Author: Eric W. Biederman <ebiederm@xmission.com>
>>    Date:   Mon Oct 17 11:46:06 2011 -0700
>>
>>    PCI: msi: Disable msi interrupts when we initialize a pci device "
> Thanks for locating this. As it stands, it is incomplete though
> anyway: If the kexec-ed kernel is one built without CONFIG_PCI_MSI,
> it won't have a means to suppress the "screaming" interrupts.
>
> Jan

I feel that the correct solution would be for Xen to hide the PCI
devices from dom0.  Xen already hides the DMAR ACPI table (by turning it
to an XMAR table), and this would be the logical way to proceed now that
IOMMU internals are appearing as PCI devices.

( A similar issue comes into play now that newer generations of CPUs are
exposing CPU internals as PCI devices )

~Andrew

>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it
  2012-06-12 16:08   ` Andrew Cooper
@ 2012-06-12 16:43     ` Jan Beulich
  2012-06-14 12:13       ` Wei Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Beulich @ 2012-06-12 16:43 UTC (permalink / raw)
  To: Wei Wang, Andrew Cooper
  Cc: Jeremy Fitzhardinge, xen-devel, Konrad Rzeszutek Wilk

>>> On 12.06.12 at 18:08, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> On 12/06/12 16:13, Jan Beulich wrote:
>>>>> On 12.06.12 at 14:02, Wei Wang <wei.wang2@amd.com> wrote:
>>> I had attached a revised patch, please check it.
>> While the patch technically looks better now, you didn't eliminate
>> my objections to the approach you take, nor did you comment on
>> the proposed alternative.
>>
>> A fundamental problem is that your IOMMUs show up as a "normal"
>> PCI devices, breaking the separation between what is being
>> managed by the hypervisor vs by the Dom0 kernel. (This even
>> allows something as odd as passing through an IOMMU to a
>> DomU, which would clearly upset the hypervisor.)
>>
>>> I found that the following Linux commit triggers this issue. It has been 
>>> included into 3.4 pv_ops.
>>>
>>> " commit a776c491ca5e38c26d9f66923ff574d041e747f4
>>>    Author: Eric W. Biederman <ebiederm@xmission.com>
>>>    Date:   Mon Oct 17 11:46:06 2011 -0700
>>>
>>>    PCI: msi: Disable msi interrupts when we initialize a pci device "
>> Thanks for locating this. As it stands, it is incomplete though
>> anyway: If the kexec-ed kernel is one built without CONFIG_PCI_MSI,
>> it won't have a means to suppress the "screaming" interrupts.
> 
> I feel that the correct solution would be for Xen to hide the PCI
> devices from dom0.  Xen already hides the DMAR ACPI table (by turning it
> to an XMAR table), and this would be the logical way to proceed now that
> IOMMU internals are appearing as PCI devices.

That is precisely what I suggested in my response to the first
version of this patch, and I'd also volunteer to look into putting
together a first draft implementation if we sort of agree that
this is the way to go.

> ( A similar issue comes into play now that newer generations of CPUs are
> exposing CPU internals as PCI devices )

Indeed, good point. Might be a little tricky though to determine
which one(s) they are, and still avoid conflicting with things like
the EDAC drivers in Dom0.

Jan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it
  2012-06-12 16:43     ` Jan Beulich
@ 2012-06-14 12:13       ` Wei Wang
  2012-06-14 14:18         ` Jan Beulich
  2012-06-20 15:45         ` [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it Jan Beulich
  0 siblings, 2 replies; 33+ messages in thread
From: Wei Wang @ 2012-06-14 12:13 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Jeremy Fitzhardinge, xen-devel, Konrad Rzeszutek Wilk

Am 12.06.2012 18:43, schrieb Jan Beulich:
>>>> On 12.06.12 at 18:08, Andrew Cooper<andrew.cooper3@citrix.com>  wrote:
>> On 12/06/12 16:13, Jan Beulich wrote:
>>>>>> On 12.06.12 at 14:02, Wei Wang<wei.wang2@amd.com>  wrote:
>>>> I had attached a revised patch, please check it.
>>> While the patch technically looks better now, you didn't eliminate
>>> my objections to the approach you take, nor did you comment on
>>> the proposed alternative.
>>>
>>> A fundamental problem is that your IOMMUs show up as a "normal"
>>> PCI devices, breaking the separation between what is being
>>> managed by the hypervisor vs by the Dom0 kernel. (This even
>>> allows something as odd as passing through an IOMMU to a
>>> DomU, which would clearly upset the hypervisor.)
>>>
>>>> I found that the following Linux commit triggers this issue. It has been
>>>> included into 3.4 pv_ops.
>>>>
>>>> " commit a776c491ca5e38c26d9f66923ff574d041e747f4
>>>>     Author: Eric W. Biederman<ebiederm@xmission.com>
>>>>     Date:   Mon Oct 17 11:46:06 2011 -0700
>>>>
>>>>     PCI: msi: Disable msi interrupts when we initialize a pci device "
>>> Thanks for locating this. As it stands, it is incomplete though
>>> anyway: If the kexec-ed kernel is one built without CONFIG_PCI_MSI,
>>> it won't have a means to suppress the "screaming" interrupts.
>>
>> I feel that the correct solution would be for Xen to hide the PCI
>> devices from dom0.  Xen already hides the DMAR ACPI table (by turning it
>> to an XMAR table), and this would be the logical way to proceed now that
>> IOMMU internals are appearing as PCI devices.

That sounds absolutely good to me. thanks for the suggestion.

> That is precisely what I suggested in my response to the first
> version of this patch, and I'd also volunteer to look into putting
> together a first draft implementation if we sort of agree that
> this is the way to go.

Cool! thanks for doing that. Looking forward to it in Xen 4.2 since 
iommu msi is really broken with recent Linux dom0...

Thanks
Wei


>> ( A similar issue comes into play now that newer generations of CPUs are
>> exposing CPU internals as PCI devices )
>
> Indeed, good point. Might be a little tricky though to determine
> which one(s) they are, and still avoid conflicting with things like
> the EDAC drivers in Dom0.
>
> Jan
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it
  2012-06-14 12:13       ` Wei Wang
@ 2012-06-14 14:18         ` Jan Beulich
  2012-06-14 15:15           ` Wei Wang
  2012-06-20 15:45         ` [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it Jan Beulich
  1 sibling, 1 reply; 33+ messages in thread
From: Jan Beulich @ 2012-06-14 14:18 UTC (permalink / raw)
  To: Wei Wang
  Cc: Andrew Cooper, Jeremy Fitzhardinge, xen-devel, Konrad Rzeszutek Wilk

>>> On 14.06.12 at 14:13, Wei Wang <wei.wang2@amd.com> wrote:
> Am 12.06.2012 18:43, schrieb Jan Beulich:
>> That is precisely what I suggested in my response to the first
>> version of this patch, and I'd also volunteer to look into putting
>> together a first draft implementation if we sort of agree that
>> this is the way to go.
> 
> Cool! thanks for doing that. Looking forward to it in Xen 4.2 since 
> iommu msi is really broken with recent Linux dom0...

That's unlikely to be for 4.2 - it's going to have a reasonable risk
of regressions, and functionality like that I don't think is nice to
rush into a feature frozen tree, the more that it provides a
workaround for the combination of questionable design and an
improper kernel side fix.

Have you at all considered getting this fixed on the kernel side?
As I don't have direct access to any AMD IOMMU capable
system - can one, other than by enumerating the respective
PCI IDs or reading ACPI tables, reasonably easily identify the
devices in question (e.g. via vendor/class/sub-class or some
such)? That might permit skipping those in the offending kernel
code...

Jan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it
  2012-06-14 14:18         ` Jan Beulich
@ 2012-06-14 15:15           ` Wei Wang
  2012-06-14 15:27             ` Jan Beulich
  2012-06-21  9:59               ` Jan Beulich
  0 siblings, 2 replies; 33+ messages in thread
From: Wei Wang @ 2012-06-14 15:15 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Jeremy Fitzhardinge, xen-devel, Hurwitz, Sherry,
	Konrad Rzeszutek Wilk

Am 14.06.2012 16:18, schrieb Jan Beulich:
>>>> On 14.06.12 at 14:13, Wei Wang<wei.wang2@amd.com>  wrote:
>> Am 12.06.2012 18:43, schrieb Jan Beulich:
>>> That is precisely what I suggested in my response to the first
>>> version of this patch, and I'd also volunteer to look into putting
>>> together a first draft implementation if we sort of agree that
>>> this is the way to go.
>>
>> Cool! thanks for doing that. Looking forward to it in Xen 4.2 since
>> iommu msi is really broken with recent Linux dom0...
>
> That's unlikely to be for 4.2 - it's going to have a reasonable risk
> of regressions, and functionality like that I don't think is nice to
> rush into a feature frozen tree, the more that it provides a
> workaround for the combination of questionable design and an
> improper kernel side fix.

Yes, it looks like improper to me also, but it would also be nice to 
have something Xen to handle this situation. Maybe we should not trust 
guest os, even dom0 some times...

> Have you at all considered getting this fixed on the kernel side?
> As I don't have direct access to any AMD IOMMU capable
> system - can one, other than by enumerating the respective
> PCI IDs or reading ACPI tables, reasonably easily identify the
> devices in question (e.g. via vendor/class/sub-class or some
> such)? That might permit skipping those in the offending kernel
> code...

AMD IOMMUs (both v1 and v2) uses class id 08 (System Base Peripheral) 
and sub class id 06 (IOMMU). Combined with PCI_VENDEOR_ID_AMD, this 
should be enough to identify amd iommu device. I could send you a kernel 
patch for review using this approach. I would believe that fixing this 
issue in 4.2, no matter how, is really important for amd iommu.

Thanks,
Wei

> Jan
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it
  2012-06-14 15:15           ` Wei Wang
@ 2012-06-14 15:27             ` Jan Beulich
  2012-06-21  9:59               ` Jan Beulich
  1 sibling, 0 replies; 33+ messages in thread
From: Jan Beulich @ 2012-06-14 15:27 UTC (permalink / raw)
  To: Wei Wang
  Cc: Andrew Cooper, Jeremy Fitzhardinge, xen-devel,
	Konrad Rzeszutek Wilk, Sherry Hurwitz

>>> On 14.06.12 at 17:15, Wei Wang <wei.wang2@amd.com> wrote:
> Am 14.06.2012 16:18, schrieb Jan Beulich:
>> Have you at all considered getting this fixed on the kernel side?
>> As I don't have direct access to any AMD IOMMU capable
>> system - can one, other than by enumerating the respective
>> PCI IDs or reading ACPI tables, reasonably easily identify the
>> devices in question (e.g. via vendor/class/sub-class or some
>> such)? That might permit skipping those in the offending kernel
>> code...
> 
> AMD IOMMUs (both v1 and v2) uses class id 08 (System Base Peripheral) 
> and sub class id 06 (IOMMU). Combined with PCI_VENDEOR_ID_AMD, this 
> should be enough to identify amd iommu device. I could send you a kernel 
> patch for review using this approach.

Yes, please.

Jan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it
  2012-06-14 12:13       ` Wei Wang
  2012-06-14 14:18         ` Jan Beulich
@ 2012-06-20 15:45         ` Jan Beulich
  2012-06-21 15:29           ` Wei Wang
  1 sibling, 1 reply; 33+ messages in thread
From: Jan Beulich @ 2012-06-20 15:45 UTC (permalink / raw)
  To: Wei Wang
  Cc: Andrew Cooper, Jeremy Fitzhardinge, xen-devel, Konrad Rzeszutek Wilk

[-- Attachment #1: Type: text/plain, Size: 14055 bytes --]

>>> On 14.06.12 at 14:13, Wei Wang <wei.wang2@amd.com> wrote:
> Am 12.06.2012 18:43, schrieb Jan Beulich:
>>>>> On 12.06.12 at 18:08, Andrew Cooper<andrew.cooper3@citrix.com>  wrote:
>>> On 12/06/12 16:13, Jan Beulich wrote:
>>>>>>> On 12.06.12 at 14:02, Wei Wang<wei.wang2@amd.com>  wrote:
>>>>> I had attached a revised patch, please check it.
>>>> While the patch technically looks better now, you didn't eliminate
>>>> my objections to the approach you take, nor did you comment on
>>>> the proposed alternative.
>>>>
>>>> A fundamental problem is that your IOMMUs show up as a "normal"
>>>> PCI devices, breaking the separation between what is being
>>>> managed by the hypervisor vs by the Dom0 kernel. (This even
>>>> allows something as odd as passing through an IOMMU to a
>>>> DomU, which would clearly upset the hypervisor.)
>>>>
>>>>> I found that the following Linux commit triggers this issue. It has been
>>>>> included into 3.4 pv_ops.
>>>>>
>>>>> " commit a776c491ca5e38c26d9f66923ff574d041e747f4
>>>>>     Author: Eric W. Biederman<ebiederm@xmission.com>
>>>>>     Date:   Mon Oct 17 11:46:06 2011 -0700
>>>>>
>>>>>     PCI: msi: Disable msi interrupts when we initialize a pci device "
>>>> Thanks for locating this. As it stands, it is incomplete though
>>>> anyway: If the kexec-ed kernel is one built without CONFIG_PCI_MSI,
>>>> it won't have a means to suppress the "screaming" interrupts.
>>>
>>> I feel that the correct solution would be for Xen to hide the PCI
>>> devices from dom0.  Xen already hides the DMAR ACPI table (by turning it
>>> to an XMAR table), and this would be the logical way to proceed now that
>>> IOMMU internals are appearing as PCI devices.
> 
> That sounds absolutely good to me. thanks for the suggestion.
> 
>> That is precisely what I suggested in my response to the first
>> version of this patch, and I'd also volunteer to look into putting
>> together a first draft implementation if we sort of agree that
>> this is the way to go.
> 
> Cool! thanks for doing that. Looking forward to it in Xen 4.2 since 
> iommu msi is really broken with recent Linux dom0...

Rather than submitting it for inclusion immediately, I'd rather see
you first use it successfully. Below/attached what appears to
work fine for me in contrived situations (i.e. hiding bridges with
nothing connected, as I still don't have any AMD IOMMU system
at hand).

Jan

Note that due to ptwr_do_page_fault() being run first, there'll be a
MEM_LOG() issued for each such attempt. If that's undesirable, the
order of the calls would need to be swapped.

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1875,6 +1875,7 @@ static int mod_l1_entry(l1_pgentry_t *pl
             break;
         case 1:
             l1e_remove_flags(nl1e, _PAGE_RW);
+            rc = 0;
             break;
         }
         if ( page )
@@ -5208,6 +5209,97 @@ int ptwr_do_page_fault(struct vcpu *v, u
     return 0;
 }
 
+#ifdef __x86_64__
+/*************************
+ * fault handling for read-only MMIO pages
+ */
+
+struct mmio_ro_emulate_ctxt {
+    struct x86_emulate_ctxt ctxt;
+    unsigned long cr2;
+};
+
+static int mmio_ro_emulated_read(
+    enum x86_segment seg,
+    unsigned long offset,
+    void *p_data,
+    unsigned int bytes,
+    struct x86_emulate_ctxt *ctxt)
+{
+    return X86EMUL_UNHANDLEABLE;
+}
+
+static int mmio_ro_emulated_write(
+    enum x86_segment seg,
+    unsigned long offset,
+    void *p_data,
+    unsigned int bytes,
+    struct x86_emulate_ctxt *ctxt)
+{
+    struct mmio_ro_emulate_ctxt *mmio_ro_ctxt =
+        container_of(ctxt, struct mmio_ro_emulate_ctxt, ctxt);
+
+    /* Only allow naturally-aligned stores at the original %cr2 address. */
+    if ( ((bytes | offset) & (bytes - 1)) || offset != mmio_ro_ctxt->cr2 )
+    {
+        MEM_LOG("mmio_ro_emulate: bad access (cr2=%lx, addr=%lx, bytes=%u)",
+                mmio_ro_ctxt->cr2, offset, bytes);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    return X86EMUL_OKAY;
+}
+
+static const struct x86_emulate_ops mmio_ro_emulate_ops = {
+    .read       = mmio_ro_emulated_read,
+    .insn_fetch = ptwr_emulated_read,
+    .write      = mmio_ro_emulated_write,
+};
+
+/* Check if guest is trying to modify a r/o MMIO page. */
+int mmio_ro_do_page_fault(struct vcpu *v, unsigned long addr,
+                          struct cpu_user_regs *regs)
+{
+    l1_pgentry_t      pte;
+    unsigned long     mfn;
+    unsigned int      addr_size = is_pv_32on64_domain(v->domain) ?
+                                  32 : BITS_PER_LONG;
+    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = {
+        .ctxt.regs = regs,
+        .ctxt.addr_size = addr_size,
+        .ctxt.sp_size = addr_size,
+        .cr2 = addr
+    };
+    int rc;
+
+    /* Attempt to read the PTE that maps the VA being accessed. */
+    guest_get_eff_l1e(v, addr, &pte);
+
+    /* We are looking only for read-only mappings of MMIO pages. */
+    if ( ((l1e_get_flags(pte) & (_PAGE_PRESENT|_PAGE_RW)) != _PAGE_PRESENT) )
+        return 0;
+
+    mfn = l1e_get_pfn(pte);
+    if ( mfn_valid(mfn) )
+    {
+        struct page_info *page = mfn_to_page(mfn);
+        struct domain *owner = page_get_owner_and_reference(page);
+
+        if ( owner )
+            put_page(page);
+        if ( owner != dom_io )
+            return 0;
+    }
+
+    if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn) )
+        return 0;
+
+    rc = x86_emulate(&mmio_ro_ctxt.ctxt, &mmio_ro_emulate_ops);
+
+    return rc != X86EMUL_UNHANDLEABLE ? EXCRET_fault_fixed : 0;
+}
+#endif /* __x86_64__ */
+
 void free_xen_pagetable(void *v)
 {
     if ( system_state == SYS_STATE_early_boot )
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1349,20 +1349,23 @@ static int fixup_page_fault(unsigned lon
         return 0;
     }
 
-    if ( VM_ASSIST(d, VMASST_TYPE_writable_pagetables) &&
-         guest_kernel_mode(v, regs) )
-    {
-        unsigned int mbs = PFEC_write_access;
-        unsigned int mbz = PFEC_reserved_bit | PFEC_insn_fetch;
-
-        /* Do not check if access-protection fault since the page may 
-           legitimately be not present in shadow page tables */
-        if ( !paging_mode_enabled(d) )
-            mbs |= PFEC_page_present;
-
-        if ( ((regs->error_code & (mbs | mbz)) == mbs) &&
+    if ( guest_kernel_mode(v, regs) &&
+         !(regs->error_code & (PFEC_reserved_bit | PFEC_insn_fetch)) &&
+         (regs->error_code & PFEC_write_access) )
+    {
+        if ( VM_ASSIST(d, VMASST_TYPE_writable_pagetables) &&
+             /* Do not check if access-protection fault since the page may
+                legitimately be not present in shadow page tables */
+             (paging_mode_enabled(d) ||
+              (regs->error_code & PFEC_page_present)) &&
              ptwr_do_page_fault(v, addr, regs) )
             return EXCRET_fault_fixed;
+
+#ifdef __x86_64__
+        if ( IS_PRIV(d) && (regs->error_code & PFEC_page_present) &&
+             mmio_ro_do_page_fault(v, addr, regs) )
+            return EXCRET_fault_fixed;
+#endif
     }
 
     /* For non-external shadowed guests, we fix up both their own 
@@ -1690,6 +1693,13 @@ static int pci_cfg_ok(struct domain *d, 
         return 0;
 
     machine_bdf = (d->arch.pci_cf8 >> 8) & 0xFFFF;
+    if ( write )
+    {
+        const unsigned long *ro_map = pci_get_ro_map(0);
+
+        if ( ro_map && test_bit(machine_bdf, ro_map) )
+            return 0;
+    }
     start = d->arch.pci_cf8 & 0xFF;
     end = start + size - 1;
     if (xsm_pci_config_permission(d, machine_bdf, start, end, write))
--- a/xen/arch/x86/x86_32/pci.c
+++ b/xen/arch/x86/x86_32/pci.c
@@ -6,6 +6,7 @@
 
 #include <xen/spinlock.h>
 #include <xen/pci.h>
+#include <xen/init.h>
 #include <asm/io.h>
 
 #define PCI_CONF_ADDRESS(bus, dev, func, reg) \
@@ -70,3 +71,7 @@ void pci_conf_write32(
     BUG_ON((bus > 255) || (dev > 31) || (func > 7) || (reg > 255));
     pci_conf_write(PCI_CONF_ADDRESS(bus, dev, func, reg), 0, 4, data);
 }
+
+void __init arch_pci_ro_device(int seg, int bdf)
+{
+}
--- a/xen/arch/x86/x86_64/mmconfig_64.c
+++ b/xen/arch/x86/x86_64/mmconfig_64.c
@@ -145,9 +145,30 @@ static void __iomem *mcfg_ioremap(const 
     return (void __iomem *) virt;
 }
 
+void arch_pci_ro_device(int seg, int bdf)
+{
+    unsigned int idx, bus = PCI_BUS(bdf);
+
+    for (idx = 0; idx < pci_mmcfg_config_num; ++idx) {
+        const struct acpi_mcfg_allocation *cfg = pci_mmcfg_virt[idx].cfg;
+        unsigned long mfn = (cfg->address >> PAGE_SHIFT) + bdf;
+
+        if (!pci_mmcfg_virt[idx].virt || cfg->pci_segment != seg ||
+            cfg->start_bus_number > bus || cfg->end_bus_number < bus)
+            continue;
+
+        if (rangeset_add_singleton(mmio_ro_ranges, mfn))
+            printk(XENLOG_ERR
+                   "%04x:%02x:%02x.%u: could not mark MCFG (mfn %#lx) read-only\n",
+                   cfg->pci_segment, bus, PCI_SLOT(bdf), PCI_FUNC(bdf),
+                   mfn);
+    }
+}
+
 int pci_mmcfg_arch_enable(unsigned int idx)
 {
     const typeof(pci_mmcfg_config[0]) *cfg = pci_mmcfg_virt[idx].cfg;
+    const unsigned long *ro_map = pci_get_ro_map(cfg->pci_segment);
 
     if (pci_mmcfg_virt[idx].virt)
         return 0;
@@ -159,6 +180,16 @@ int pci_mmcfg_arch_enable(unsigned int i
     }
     printk(KERN_INFO "PCI: Using MCFG for segment %04x bus %02x-%02x\n",
            cfg->pci_segment, cfg->start_bus_number, cfg->end_bus_number);
+    if (ro_map) {
+        unsigned int bdf = PCI_BDF(cfg->start_bus_number, 0, 0);
+        unsigned int end = PCI_BDF(cfg->end_bus_number, -1, -1);
+
+        while ((bdf = find_next_bit(ro_map, end + 1, bdf)) <= end) {
+            arch_pci_ro_device(cfg->pci_segment, bdf);
+            if (bdf++ == end)
+                break;
+        }
+    }
     return 0;
 }
 
--- a/xen/drivers/passthrough/amd/iommu_detect.c
+++ b/xen/drivers/passthrough/amd/iommu_detect.c
@@ -153,6 +153,12 @@ int __init amd_iommu_detect_one_acpi(
     if ( rt )
         return -ENODEV;
 
+    rt = pci_ro_device(iommu->seg, bus, PCI_DEVFN(dev, func));
+    if ( rt )
+        printk(XENLOG_ERR
+               "Could not mark config space of %04x:%02x:%02x.%u read-only (%d)\n",
+               iommu->seg, bus, dev, func, rt);
+
     list_add_tail(&iommu->list, &amd_iommu_head);
 
     return 0;
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -593,11 +593,3 @@ void hvm_dpci_eoi(struct domain *d, unsi
 unlock:
     spin_unlock(&d->event_lock);
 }
-
-static int __init setup_mmio_ro_ranges(void)
-{
-    mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges",
-                                  RANGESETF_prettyprint_hex);
-    return 0;
-}
-__initcall(setup_mmio_ro_ranges);
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -36,6 +36,7 @@
 struct pci_seg {
     struct list_head alldevs_list;
     u16 nr;
+    unsigned long *ro_map;
     /* bus2bridge_lock protects bus2bridge array */
     spinlock_t bus2bridge_lock;
 #define MAX_BUSES 256
@@ -106,6 +107,8 @@ void __init pt_pci_init(void)
     radix_tree_init(&pci_segments);
     if ( !alloc_pseg(0) )
         panic("Could not initialize PCI segment 0\n");
+    mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges",
+                                  RANGESETF_prettyprint_hex);
 }
 
 int __init pci_add_segment(u16 seg)
@@ -113,6 +116,13 @@ int __init pci_add_segment(u16 seg)
     return alloc_pseg(seg) ? 0 : -ENOMEM;
 }
 
+const unsigned long *pci_get_ro_map(u16 seg)
+{
+    struct pci_seg *pseg = get_pseg(seg);
+
+    return pseg ? pseg->ro_map : NULL;
+}
+
 static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
 {
     struct pci_dev *pdev;
@@ -198,6 +208,33 @@ static void free_pdev(struct pci_seg *ps
     xfree(pdev);
 }
 
+int __init pci_ro_device(int seg, int bus, int devfn)
+{
+    struct pci_seg *pseg = alloc_pseg(seg);
+    struct pci_dev *pdev;
+
+    if ( !pseg )
+        return -ENOMEM;
+    pdev = alloc_pdev(pseg, bus, devfn);
+    if ( !pdev )
+        return -ENOMEM;
+
+    if ( !pseg->ro_map )
+    {
+        size_t sz = BITS_TO_LONGS(PCI_BDF(-1, -1, -1) + 1) * sizeof(long);
+
+        pseg->ro_map = alloc_xenheap_pages(get_order_from_bytes(sz), 0);
+        if ( !pseg->ro_map )
+            return -ENOMEM;
+        memset(pseg->ro_map, 0, sz);
+    }
+
+    __set_bit(PCI_BDF2(bus, devfn), pseg->ro_map);
+    arch_pci_ro_device(seg, PCI_BDF2(bus, devfn));
+
+    return 0;
+}
+
 struct pci_dev *pci_get_pdev(int seg, int bus, int devfn)
 {
     struct pci_seg *pseg = get_pseg(seg);
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -555,6 +555,8 @@ void memguard_unguard_stack(void *p);
 
 int  ptwr_do_page_fault(struct vcpu *, unsigned long,
                         struct cpu_user_regs *);
+int  mmio_ro_do_page_fault(struct vcpu *, unsigned long,
+                           struct cpu_user_regs *);
 
 int audit_adjust_pgtables(struct domain *d, int dir, int noisy);
 
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -98,8 +98,11 @@ struct pci_dev *pci_lock_domain_pdev(
 void setup_dom0_pci_devices(struct domain *, void (*)(struct pci_dev *));
 void pci_release_devices(struct domain *d);
 int pci_add_segment(u16 seg);
+const unsigned long *pci_get_ro_map(u16 seg);
 int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *);
 int pci_remove_device(u16 seg, u8 bus, u8 devfn);
+int pci_ro_device(int seg, int bus, int devfn);
+void arch_pci_ro_device(int seg, int bdf);
 struct pci_dev *pci_get_pdev(int seg, int bus, int devfn);
 struct pci_dev *pci_get_pdev_by_domain(
     struct domain *, int seg, int bus, int devfn);


[-- Attachment #2: pci-disallow-write.patch --]
[-- Type: text/plain, Size: 11607 bytes --]

Note that due to ptwr_do_page_fault() being run first, there'll be a
MEM_LOG() issued for each such attempt. If that's undesirable, the
order of the calls would need to be swapped.

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1875,6 +1875,7 @@ static int mod_l1_entry(l1_pgentry_t *pl
             break;
         case 1:
             l1e_remove_flags(nl1e, _PAGE_RW);
+            rc = 0;
             break;
         }
         if ( page )
@@ -5208,6 +5209,97 @@ int ptwr_do_page_fault(struct vcpu *v, u
     return 0;
 }
 
+#ifdef __x86_64__
+/*************************
+ * fault handling for read-only MMIO pages
+ */
+
+struct mmio_ro_emulate_ctxt {
+    struct x86_emulate_ctxt ctxt;
+    unsigned long cr2;
+};
+
+static int mmio_ro_emulated_read(
+    enum x86_segment seg,
+    unsigned long offset,
+    void *p_data,
+    unsigned int bytes,
+    struct x86_emulate_ctxt *ctxt)
+{
+    return X86EMUL_UNHANDLEABLE;
+}
+
+static int mmio_ro_emulated_write(
+    enum x86_segment seg,
+    unsigned long offset,
+    void *p_data,
+    unsigned int bytes,
+    struct x86_emulate_ctxt *ctxt)
+{
+    struct mmio_ro_emulate_ctxt *mmio_ro_ctxt =
+        container_of(ctxt, struct mmio_ro_emulate_ctxt, ctxt);
+
+    /* Only allow naturally-aligned stores at the original %cr2 address. */
+    if ( ((bytes | offset) & (bytes - 1)) || offset != mmio_ro_ctxt->cr2 )
+    {
+        MEM_LOG("mmio_ro_emulate: bad access (cr2=%lx, addr=%lx, bytes=%u)",
+                mmio_ro_ctxt->cr2, offset, bytes);
+        return X86EMUL_UNHANDLEABLE;
+    }
+
+    return X86EMUL_OKAY;
+}
+
+static const struct x86_emulate_ops mmio_ro_emulate_ops = {
+    .read       = mmio_ro_emulated_read,
+    .insn_fetch = ptwr_emulated_read,
+    .write      = mmio_ro_emulated_write,
+};
+
+/* Check if guest is trying to modify a r/o MMIO page. */
+int mmio_ro_do_page_fault(struct vcpu *v, unsigned long addr,
+                          struct cpu_user_regs *regs)
+{
+    l1_pgentry_t      pte;
+    unsigned long     mfn;
+    unsigned int      addr_size = is_pv_32on64_domain(v->domain) ?
+                                  32 : BITS_PER_LONG;
+    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = {
+        .ctxt.regs = regs,
+        .ctxt.addr_size = addr_size,
+        .ctxt.sp_size = addr_size,
+        .cr2 = addr
+    };
+    int rc;
+
+    /* Attempt to read the PTE that maps the VA being accessed. */
+    guest_get_eff_l1e(v, addr, &pte);
+
+    /* We are looking only for read-only mappings of MMIO pages. */
+    if ( ((l1e_get_flags(pte) & (_PAGE_PRESENT|_PAGE_RW)) != _PAGE_PRESENT) )
+        return 0;
+
+    mfn = l1e_get_pfn(pte);
+    if ( mfn_valid(mfn) )
+    {
+        struct page_info *page = mfn_to_page(mfn);
+        struct domain *owner = page_get_owner_and_reference(page);
+
+        if ( owner )
+            put_page(page);
+        if ( owner != dom_io )
+            return 0;
+    }
+
+    if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn) )
+        return 0;
+
+    rc = x86_emulate(&mmio_ro_ctxt.ctxt, &mmio_ro_emulate_ops);
+
+    return rc != X86EMUL_UNHANDLEABLE ? EXCRET_fault_fixed : 0;
+}
+#endif /* __x86_64__ */
+
 void free_xen_pagetable(void *v)
 {
     if ( system_state == SYS_STATE_early_boot )
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1349,20 +1349,23 @@ static int fixup_page_fault(unsigned lon
         return 0;
     }
 
-    if ( VM_ASSIST(d, VMASST_TYPE_writable_pagetables) &&
-         guest_kernel_mode(v, regs) )
-    {
-        unsigned int mbs = PFEC_write_access;
-        unsigned int mbz = PFEC_reserved_bit | PFEC_insn_fetch;
-
-        /* Do not check if access-protection fault since the page may 
-           legitimately be not present in shadow page tables */
-        if ( !paging_mode_enabled(d) )
-            mbs |= PFEC_page_present;
-
-        if ( ((regs->error_code & (mbs | mbz)) == mbs) &&
+    if ( guest_kernel_mode(v, regs) &&
+         !(regs->error_code & (PFEC_reserved_bit | PFEC_insn_fetch)) &&
+         (regs->error_code & PFEC_write_access) )
+    {
+        if ( VM_ASSIST(d, VMASST_TYPE_writable_pagetables) &&
+             /* Do not check if access-protection fault since the page may
+                legitimately be not present in shadow page tables */
+             (paging_mode_enabled(d) ||
+              (regs->error_code & PFEC_page_present)) &&
              ptwr_do_page_fault(v, addr, regs) )
             return EXCRET_fault_fixed;
+
+#ifdef __x86_64__
+        if ( IS_PRIV(d) && (regs->error_code & PFEC_page_present) &&
+             mmio_ro_do_page_fault(v, addr, regs) )
+            return EXCRET_fault_fixed;
+#endif
     }
 
     /* For non-external shadowed guests, we fix up both their own 
@@ -1690,6 +1693,13 @@ static int pci_cfg_ok(struct domain *d, 
         return 0;
 
     machine_bdf = (d->arch.pci_cf8 >> 8) & 0xFFFF;
+    if ( write )
+    {
+        const unsigned long *ro_map = pci_get_ro_map(0);
+
+        if ( ro_map && test_bit(machine_bdf, ro_map) )
+            return 0;
+    }
     start = d->arch.pci_cf8 & 0xFF;
     end = start + size - 1;
     if (xsm_pci_config_permission(d, machine_bdf, start, end, write))
--- a/xen/arch/x86/x86_32/pci.c
+++ b/xen/arch/x86/x86_32/pci.c
@@ -6,6 +6,7 @@
 
 #include <xen/spinlock.h>
 #include <xen/pci.h>
+#include <xen/init.h>
 #include <asm/io.h>
 
 #define PCI_CONF_ADDRESS(bus, dev, func, reg) \
@@ -70,3 +71,7 @@ void pci_conf_write32(
     BUG_ON((bus > 255) || (dev > 31) || (func > 7) || (reg > 255));
     pci_conf_write(PCI_CONF_ADDRESS(bus, dev, func, reg), 0, 4, data);
 }
+
+void __init arch_pci_ro_device(int seg, int bdf)
+{
+}
--- a/xen/arch/x86/x86_64/mmconfig_64.c
+++ b/xen/arch/x86/x86_64/mmconfig_64.c
@@ -145,9 +145,30 @@ static void __iomem *mcfg_ioremap(const 
     return (void __iomem *) virt;
 }
 
+void arch_pci_ro_device(int seg, int bdf)
+{
+    unsigned int idx, bus = PCI_BUS(bdf);
+
+    for (idx = 0; idx < pci_mmcfg_config_num; ++idx) {
+        const struct acpi_mcfg_allocation *cfg = pci_mmcfg_virt[idx].cfg;
+        unsigned long mfn = (cfg->address >> PAGE_SHIFT) + bdf;
+
+        if (!pci_mmcfg_virt[idx].virt || cfg->pci_segment != seg ||
+            cfg->start_bus_number > bus || cfg->end_bus_number < bus)
+            continue;
+
+        if (rangeset_add_singleton(mmio_ro_ranges, mfn))
+            printk(XENLOG_ERR
+                   "%04x:%02x:%02x.%u: could not mark MCFG (mfn %#lx) read-only\n",
+                   cfg->pci_segment, bus, PCI_SLOT(bdf), PCI_FUNC(bdf),
+                   mfn);
+    }
+}
+
 int pci_mmcfg_arch_enable(unsigned int idx)
 {
     const typeof(pci_mmcfg_config[0]) *cfg = pci_mmcfg_virt[idx].cfg;
+    const unsigned long *ro_map = pci_get_ro_map(cfg->pci_segment);
 
     if (pci_mmcfg_virt[idx].virt)
         return 0;
@@ -159,6 +180,16 @@ int pci_mmcfg_arch_enable(unsigned int i
     }
     printk(KERN_INFO "PCI: Using MCFG for segment %04x bus %02x-%02x\n",
            cfg->pci_segment, cfg->start_bus_number, cfg->end_bus_number);
+    if (ro_map) {
+        unsigned int bdf = PCI_BDF(cfg->start_bus_number, 0, 0);
+        unsigned int end = PCI_BDF(cfg->end_bus_number, -1, -1);
+
+        while ((bdf = find_next_bit(ro_map, end + 1, bdf)) <= end) {
+            arch_pci_ro_device(cfg->pci_segment, bdf);
+            if (bdf++ == end)
+                break;
+        }
+    }
     return 0;
 }
 
--- a/xen/drivers/passthrough/amd/iommu_detect.c
+++ b/xen/drivers/passthrough/amd/iommu_detect.c
@@ -153,6 +153,12 @@ int __init amd_iommu_detect_one_acpi(
     if ( rt )
         return -ENODEV;
 
+    rt = pci_ro_device(iommu->seg, bus, PCI_DEVFN(dev, func));
+    if ( rt )
+        printk(XENLOG_ERR
+               "Could not mark config space of %04x:%02x:%02x.%u read-only (%d)\n",
+               iommu->seg, bus, dev, func, rt);
+
     list_add_tail(&iommu->list, &amd_iommu_head);
 
     return 0;
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -593,11 +593,3 @@ void hvm_dpci_eoi(struct domain *d, unsi
 unlock:
     spin_unlock(&d->event_lock);
 }
-
-static int __init setup_mmio_ro_ranges(void)
-{
-    mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges",
-                                  RANGESETF_prettyprint_hex);
-    return 0;
-}
-__initcall(setup_mmio_ro_ranges);
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -36,6 +36,7 @@
 struct pci_seg {
     struct list_head alldevs_list;
     u16 nr;
+    unsigned long *ro_map;
     /* bus2bridge_lock protects bus2bridge array */
     spinlock_t bus2bridge_lock;
 #define MAX_BUSES 256
@@ -106,6 +107,8 @@ void __init pt_pci_init(void)
     radix_tree_init(&pci_segments);
     if ( !alloc_pseg(0) )
         panic("Could not initialize PCI segment 0\n");
+    mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges",
+                                  RANGESETF_prettyprint_hex);
 }
 
 int __init pci_add_segment(u16 seg)
@@ -113,6 +116,13 @@ int __init pci_add_segment(u16 seg)
     return alloc_pseg(seg) ? 0 : -ENOMEM;
 }
 
+const unsigned long *pci_get_ro_map(u16 seg)
+{
+    struct pci_seg *pseg = get_pseg(seg);
+
+    return pseg ? pseg->ro_map : NULL;
+}
+
 static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
 {
     struct pci_dev *pdev;
@@ -198,6 +208,33 @@ static void free_pdev(struct pci_seg *ps
     xfree(pdev);
 }
 
+int __init pci_ro_device(int seg, int bus, int devfn)
+{
+    struct pci_seg *pseg = alloc_pseg(seg);
+    struct pci_dev *pdev;
+
+    if ( !pseg )
+        return -ENOMEM;
+    pdev = alloc_pdev(pseg, bus, devfn);
+    if ( !pdev )
+        return -ENOMEM;
+
+    if ( !pseg->ro_map )
+    {
+        size_t sz = BITS_TO_LONGS(PCI_BDF(-1, -1, -1) + 1) * sizeof(long);
+
+        pseg->ro_map = alloc_xenheap_pages(get_order_from_bytes(sz), 0);
+        if ( !pseg->ro_map )
+            return -ENOMEM;
+        memset(pseg->ro_map, 0, sz);
+    }
+
+    __set_bit(PCI_BDF2(bus, devfn), pseg->ro_map);
+    arch_pci_ro_device(seg, PCI_BDF2(bus, devfn));
+
+    return 0;
+}
+
 struct pci_dev *pci_get_pdev(int seg, int bus, int devfn)
 {
     struct pci_seg *pseg = get_pseg(seg);
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -555,6 +555,8 @@ void memguard_unguard_stack(void *p);
 
 int  ptwr_do_page_fault(struct vcpu *, unsigned long,
                         struct cpu_user_regs *);
+int  mmio_ro_do_page_fault(struct vcpu *, unsigned long,
+                           struct cpu_user_regs *);
 
 int audit_adjust_pgtables(struct domain *d, int dir, int noisy);
 
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -98,8 +98,11 @@ struct pci_dev *pci_lock_domain_pdev(
 void setup_dom0_pci_devices(struct domain *, void (*)(struct pci_dev *));
 void pci_release_devices(struct domain *d);
 int pci_add_segment(u16 seg);
+const unsigned long *pci_get_ro_map(u16 seg);
 int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *);
 int pci_remove_device(u16 seg, u8 bus, u8 devfn);
+int pci_ro_device(int seg, int bus, int devfn);
+void arch_pci_ro_device(int seg, int bdf);
 struct pci_dev *pci_get_pdev(int seg, int bus, int devfn);
 struct pci_dev *pci_get_pdev_by_domain(
     struct domain *, int seg, int bus, int devfn);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0 (was: Re: [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it)
  2012-06-14 15:15           ` Wei Wang
@ 2012-06-21  9:59               ` Jan Beulich
  2012-06-21  9:59               ` Jan Beulich
  1 sibling, 0 replies; 33+ messages in thread
From: Jan Beulich @ 2012-06-21  9:59 UTC (permalink / raw)
  To: Wei Wang, Jesse Barnes, ebiederm
  Cc: Sherry Hurwitz, Andrew Cooper, Jeremy Fitzhardinge, stable,
	xen-devel, Konrad Rzeszutek Wilk, linux-pci

>>> On 14.06.12 at 17:15, Wei Wang <wei.wang2@amd.com> wrote:
> Am 14.06.2012 16:18, schrieb Jan Beulich:
>> Have you at all considered getting this fixed on the kernel side?
>> As I don't have direct access to any AMD IOMMU capable
>> system - can one, other than by enumerating the respective
>> PCI IDs or reading ACPI tables, reasonably easily identify the
>> devices in question (e.g. via vendor/class/sub-class or some
>> such)? That might permit skipping those in the offending kernel
>> code...
> 
> AMD IOMMUs (both v1 and v2) uses class id 08 (System Base Peripheral) 
> and sub class id 06 (IOMMU). Combined with PCI_VENDEOR_ID_AMD, this 
> should be enough to identify amd iommu device. I could send you a kernel 
> patch for review using this approach. I would believe that fixing this 
> issue in 4.2, no matter how, is really important for amd iommu.

As you didn't come forward with anything, here's my first
take on this:

Commit d5dea7d95c48d7bc951cee4910a7fd9c0cd26fb0 ("PCI: msi: Disable msi
interrupts when we initialize a pci device") caused MSI to get disabled
on Xen Dom0 despite it having got turned on purposefully by the
hypervisor. As an immediate band aid, suppress the disabling in this
specific case.

I don't think, however, that either the original change or this fix are
really appropriate. The original fix, besides leaving aside the
presence of a hypervisor like Xen, doesn't deal with all possible
cases (in particular it has no effect if the secondary kernel was built
with CONFIG_PCI_MSI unset). And taking into account a hypervisor like
Xen, the logic like this should probably be skipped altogether (i.e. it
should be entirely left to the hypervisor, being the entity in control
of IRQs).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: stable@kernel.org 

---
 drivers/pci/msi.c       |    7 +++++++
 include/linux/pci_ids.h |    1 +
 2 files changed, 8 insertions(+)

--- 3.5-rc3/drivers/pci/msi.c
+++ 3.5-rc3-xen-pci-msi-no-iommu-disable/drivers/pci/msi.c
@@ -20,6 +20,7 @@
 #include <linux/errno.h>
 #include <linux/io.h>
 #include <linux/slab.h>
+#include <xen/xen.h>
 
 #include "pci.h"
 #include "msi.h"
@@ -1022,7 +1023,13 @@ void pci_msi_init_pci_dev(struct pci_dev
 	/* Disable the msi hardware to avoid screaming interrupts
 	 * during boot.  This is the power on reset default so
 	 * usually this should be a noop.
+	 * But on a Xen host don't do this for IOMMUs which the hypervisor
+	 * is in control of (and hence has already enabled on purpose).
 	 */
+	if (xen_initial_domain()
+	    && (dev->class >> 8) == PCI_CLASS_SYSTEM_IOMMU
+	    && dev->vendor == PCI_VENDOR_ID_AMD)
+		return;
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
 	if (pos)
 		msi_set_enable(dev, pos, 0);
--- 3.5-rc3/include/linux/pci_ids.h
+++ 3.5-rc3-xen-pci-msi-no-iommu-disable/include/linux/pci_ids.h
@@ -75,6 +75,7 @@
 #define PCI_CLASS_SYSTEM_RTC		0x0803
 #define PCI_CLASS_SYSTEM_PCI_HOTPLUG	0x0804
 #define PCI_CLASS_SYSTEM_SDHCI		0x0805
+#define PCI_CLASS_SYSTEM_IOMMU		0x0806
 #define PCI_CLASS_SYSTEM_OTHER		0x0880
 
 #define PCI_BASE_CLASS_INPUT		0x09




^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0 (was: Re: [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it)
@ 2012-06-21  9:59               ` Jan Beulich
  0 siblings, 0 replies; 33+ messages in thread
From: Jan Beulich @ 2012-06-21  9:59 UTC (permalink / raw)
  To: Wei Wang, Jesse Barnes, ebiederm
  Cc: Sherry Hurwitz, Andrew Cooper, Jeremy Fitzhardinge, stable,
	xen-devel, Konrad Rzeszutek Wilk, linux-pci

>>> On 14.06.12 at 17:15, Wei Wang <wei.wang2@amd.com> wrote:
> Am 14.06.2012 16:18, schrieb Jan Beulich:
>> Have you at all considered getting this fixed on the kernel side?
>> As I don't have direct access to any AMD IOMMU capable
>> system - can one, other than by enumerating the respective
>> PCI IDs or reading ACPI tables, reasonably easily identify the
>> devices in question (e.g. via vendor/class/sub-class or some
>> such)? That might permit skipping those in the offending kernel
>> code...
> 
> AMD IOMMUs (both v1 and v2) uses class id 08 (System Base Peripheral) 
> and sub class id 06 (IOMMU). Combined with PCI_VENDEOR_ID_AMD, this 
> should be enough to identify amd iommu device. I could send you a kernel 
> patch for review using this approach. I would believe that fixing this 
> issue in 4.2, no matter how, is really important for amd iommu.

As you didn't come forward with anything, here's my first
take on this:

Commit d5dea7d95c48d7bc951cee4910a7fd9c0cd26fb0 ("PCI: msi: Disable msi
interrupts when we initialize a pci device") caused MSI to get disabled
on Xen Dom0 despite it having got turned on purposefully by the
hypervisor. As an immediate band aid, suppress the disabling in this
specific case.

I don't think, however, that either the original change or this fix are
really appropriate. The original fix, besides leaving aside the
presence of a hypervisor like Xen, doesn't deal with all possible
cases (in particular it has no effect if the secondary kernel was built
with CONFIG_PCI_MSI unset). And taking into account a hypervisor like
Xen, the logic like this should probably be skipped altogether (i.e. it
should be entirely left to the hypervisor, being the entity in control
of IRQs).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: stable@kernel.org 

---
 drivers/pci/msi.c       |    7 +++++++
 include/linux/pci_ids.h |    1 +
 2 files changed, 8 insertions(+)

--- 3.5-rc3/drivers/pci/msi.c
+++ 3.5-rc3-xen-pci-msi-no-iommu-disable/drivers/pci/msi.c
@@ -20,6 +20,7 @@
 #include <linux/errno.h>
 #include <linux/io.h>
 #include <linux/slab.h>
+#include <xen/xen.h>
 
 #include "pci.h"
 #include "msi.h"
@@ -1022,7 +1023,13 @@ void pci_msi_init_pci_dev(struct pci_dev
 	/* Disable the msi hardware to avoid screaming interrupts
 	 * during boot.  This is the power on reset default so
 	 * usually this should be a noop.
+	 * But on a Xen host don't do this for IOMMUs which the hypervisor
+	 * is in control of (and hence has already enabled on purpose).
 	 */
+	if (xen_initial_domain()
+	    && (dev->class >> 8) == PCI_CLASS_SYSTEM_IOMMU
+	    && dev->vendor == PCI_VENDOR_ID_AMD)
+		return;
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
 	if (pos)
 		msi_set_enable(dev, pos, 0);
--- 3.5-rc3/include/linux/pci_ids.h
+++ 3.5-rc3-xen-pci-msi-no-iommu-disable/include/linux/pci_ids.h
@@ -75,6 +75,7 @@
 #define PCI_CLASS_SYSTEM_RTC		0x0803
 #define PCI_CLASS_SYSTEM_PCI_HOTPLUG	0x0804
 #define PCI_CLASS_SYSTEM_SDHCI		0x0805
+#define PCI_CLASS_SYSTEM_IOMMU		0x0806
 #define PCI_CLASS_SYSTEM_OTHER		0x0880
 
 #define PCI_BASE_CLASS_INPUT		0x09

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
  2012-06-21  9:59               ` Jan Beulich
@ 2012-06-21 11:08                 ` Eric W. Biederman
  -1 siblings, 0 replies; 33+ messages in thread
From: Eric W. Biederman @ 2012-06-21 11:08 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Wang, Jesse Barnes, Sherry Hurwitz, Andrew Cooper,
	Jeremy Fitzhardinge, stable, xen-devel, Konrad Rzeszutek Wilk,
	linux-pci

"Jan Beulich" <JBeulich@suse.com> writes:

>>>> On 14.06.12 at 17:15, Wei Wang <wei.wang2@amd.com> wrote:
>> Am 14.06.2012 16:18, schrieb Jan Beulich:
>>> Have you at all considered getting this fixed on the kernel side?
>>> As I don't have direct access to any AMD IOMMU capable
>>> system - can one, other than by enumerating the respective
>>> PCI IDs or reading ACPI tables, reasonably easily identify the
>>> devices in question (e.g. via vendor/class/sub-class or some
>>> such)? That might permit skipping those in the offending kernel
>>> code...
>> 
>> AMD IOMMUs (both v1 and v2) uses class id 08 (System Base Peripheral) 
>> and sub class id 06 (IOMMU). Combined with PCI_VENDEOR_ID_AMD, this 
>> should be enough to identify amd iommu device. I could send you a kernel 
>> patch for review using this approach. I would believe that fixing this 
>> issue in 4.2, no matter how, is really important for amd iommu.
>
> As you didn't come forward with anything, here's my first
> take on this:
>
> Commit d5dea7d95c48d7bc951cee4910a7fd9c0cd26fb0 ("PCI: msi: Disable msi
> interrupts when we initialize a pci device") caused MSI to get disabled
> on Xen Dom0 despite it having got turned on purposefully by the
> hypervisor. As an immediate band aid, suppress the disabling in this
> specific case.
>
> I don't think, however, that either the original change or this fix are
> really appropriate. The original fix, besides leaving aside the
> presence of a hypervisor like Xen, doesn't deal with all possible
> cases (in particular it has no effect if the secondary kernel was built
> with CONFIG_PCI_MSI unset). And taking into account a hypervisor like
> Xen, the logic like this should probably be skipped altogether (i.e. it
> should be entirely left to the hypervisor, being the entity in control
> of IRQs).

The original fix is fine. Xen dom0 remains insane.  Although perhaps
some better than Xen dom0 once was.

Why does the dom0 kernel even get any access to pci devices that
Xen doesn't want it to touch?

As far as I can tell my patch has revealed a design bug in Xen.  If you
don't want to be messed up by the kernel don't let the kernel touch
things.  At the very least we need an abstraction much cleaner
than the patch below.

Eric


> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Cc: stable@kernel.org 
>
> ---
>  drivers/pci/msi.c       |    7 +++++++
>  include/linux/pci_ids.h |    1 +
>  2 files changed, 8 insertions(+)
>
> --- 3.5-rc3/drivers/pci/msi.c
> +++ 3.5-rc3-xen-pci-msi-no-iommu-disable/drivers/pci/msi.c
> @@ -20,6 +20,7 @@
>  #include <linux/errno.h>
>  #include <linux/io.h>
>  #include <linux/slab.h>
> +#include <xen/xen.h>
>  
>  #include "pci.h"
>  #include "msi.h"
> @@ -1022,7 +1023,13 @@ void pci_msi_init_pci_dev(struct pci_dev
>  	/* Disable the msi hardware to avoid screaming interrupts
>  	 * during boot.  This is the power on reset default so
>  	 * usually this should be a noop.
> +	 * But on a Xen host don't do this for IOMMUs which the hypervisor
> +	 * is in control of (and hence has already enabled on purpose).
>  	 */
> +	if (xen_initial_domain()
> +	    && (dev->class >> 8) == PCI_CLASS_SYSTEM_IOMMU
> +	    && dev->vendor == PCI_VENDOR_ID_AMD)
> +		return;
>  	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
>  	if (pos)
>  		msi_set_enable(dev, pos, 0);
> --- 3.5-rc3/include/linux/pci_ids.h
> +++ 3.5-rc3-xen-pci-msi-no-iommu-disable/include/linux/pci_ids.h
> @@ -75,6 +75,7 @@
>  #define PCI_CLASS_SYSTEM_RTC		0x0803
>  #define PCI_CLASS_SYSTEM_PCI_HOTPLUG	0x0804
>  #define PCI_CLASS_SYSTEM_SDHCI		0x0805
> +#define PCI_CLASS_SYSTEM_IOMMU		0x0806
>  #define PCI_CLASS_SYSTEM_OTHER		0x0880
>  
>  #define PCI_BASE_CLASS_INPUT		0x09

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
@ 2012-06-21 11:08                 ` Eric W. Biederman
  0 siblings, 0 replies; 33+ messages in thread
From: Eric W. Biederman @ 2012-06-21 11:08 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Wang, Jesse Barnes, Sherry Hurwitz, Andrew Cooper,
	Jeremy Fitzhardinge, stable, xen-devel, Konrad Rzeszutek Wilk,
	linux-pci

"Jan Beulich" <JBeulich@suse.com> writes:

>>>> On 14.06.12 at 17:15, Wei Wang <wei.wang2@amd.com> wrote:
>> Am 14.06.2012 16:18, schrieb Jan Beulich:
>>> Have you at all considered getting this fixed on the kernel side?
>>> As I don't have direct access to any AMD IOMMU capable
>>> system - can one, other than by enumerating the respective
>>> PCI IDs or reading ACPI tables, reasonably easily identify the
>>> devices in question (e.g. via vendor/class/sub-class or some
>>> such)? That might permit skipping those in the offending kernel
>>> code...
>> 
>> AMD IOMMUs (both v1 and v2) uses class id 08 (System Base Peripheral) 
>> and sub class id 06 (IOMMU). Combined with PCI_VENDEOR_ID_AMD, this 
>> should be enough to identify amd iommu device. I could send you a kernel 
>> patch for review using this approach. I would believe that fixing this 
>> issue in 4.2, no matter how, is really important for amd iommu.
>
> As you didn't come forward with anything, here's my first
> take on this:
>
> Commit d5dea7d95c48d7bc951cee4910a7fd9c0cd26fb0 ("PCI: msi: Disable msi
> interrupts when we initialize a pci device") caused MSI to get disabled
> on Xen Dom0 despite it having got turned on purposefully by the
> hypervisor. As an immediate band aid, suppress the disabling in this
> specific case.
>
> I don't think, however, that either the original change or this fix are
> really appropriate. The original fix, besides leaving aside the
> presence of a hypervisor like Xen, doesn't deal with all possible
> cases (in particular it has no effect if the secondary kernel was built
> with CONFIG_PCI_MSI unset). And taking into account a hypervisor like
> Xen, the logic like this should probably be skipped altogether (i.e. it
> should be entirely left to the hypervisor, being the entity in control
> of IRQs).

The original fix is fine. Xen dom0 remains insane.  Although perhaps
some better than Xen dom0 once was.

Why does the dom0 kernel even get any access to pci devices that
Xen doesn't want it to touch?

As far as I can tell my patch has revealed a design bug in Xen.  If you
don't want to be messed up by the kernel don't let the kernel touch
things.  At the very least we need an abstraction much cleaner
than the patch below.

Eric


> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Cc: stable@kernel.org 
>
> ---
>  drivers/pci/msi.c       |    7 +++++++
>  include/linux/pci_ids.h |    1 +
>  2 files changed, 8 insertions(+)
>
> --- 3.5-rc3/drivers/pci/msi.c
> +++ 3.5-rc3-xen-pci-msi-no-iommu-disable/drivers/pci/msi.c
> @@ -20,6 +20,7 @@
>  #include <linux/errno.h>
>  #include <linux/io.h>
>  #include <linux/slab.h>
> +#include <xen/xen.h>
>  
>  #include "pci.h"
>  #include "msi.h"
> @@ -1022,7 +1023,13 @@ void pci_msi_init_pci_dev(struct pci_dev
>  	/* Disable the msi hardware to avoid screaming interrupts
>  	 * during boot.  This is the power on reset default so
>  	 * usually this should be a noop.
> +	 * But on a Xen host don't do this for IOMMUs which the hypervisor
> +	 * is in control of (and hence has already enabled on purpose).
>  	 */
> +	if (xen_initial_domain()
> +	    && (dev->class >> 8) == PCI_CLASS_SYSTEM_IOMMU
> +	    && dev->vendor == PCI_VENDOR_ID_AMD)
> +		return;
>  	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
>  	if (pos)
>  		msi_set_enable(dev, pos, 0);
> --- 3.5-rc3/include/linux/pci_ids.h
> +++ 3.5-rc3-xen-pci-msi-no-iommu-disable/include/linux/pci_ids.h
> @@ -75,6 +75,7 @@
>  #define PCI_CLASS_SYSTEM_RTC		0x0803
>  #define PCI_CLASS_SYSTEM_PCI_HOTPLUG	0x0804
>  #define PCI_CLASS_SYSTEM_SDHCI		0x0805
> +#define PCI_CLASS_SYSTEM_IOMMU		0x0806
>  #define PCI_CLASS_SYSTEM_OTHER		0x0880
>  
>  #define PCI_BASE_CLASS_INPUT		0x09

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
  2012-06-21  9:59               ` Jan Beulich
@ 2012-06-21 11:21                 ` Wei Wang
  -1 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2012-06-21 11:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Jesse Barnes, ebiederm, Sherry Hurwitz, Andrew Cooper,
	Jeremy Fitzhardinge, stable, xen-devel, Konrad Rzeszutek Wilk,
	linux-pci

On 06/21/2012 11:59 AM, Jan Beulich wrote:
>>>> On 14.06.12 at 17:15, Wei Wang<wei.wang2@amd.com>  wrote:
>> Am 14.06.2012 16:18, schrieb Jan Beulich:
>>> Have you at all considered getting this fixed on the kernel side?
>>> As I don't have direct access to any AMD IOMMU capable
>>> system - can one, other than by enumerating the respective
>>> PCI IDs or reading ACPI tables, reasonably easily identify the
>>> devices in question (e.g. via vendor/class/sub-class or some
>>> such)? That might permit skipping those in the offending kernel
>>> code...
>>
>> AMD IOMMUs (both v1 and v2) uses class id 08 (System Base Peripheral)
>> and sub class id 06 (IOMMU). Combined with PCI_VENDEOR_ID_AMD, this
>> should be enough to identify amd iommu device. I could send you a kernel
>> patch for review using this approach. I would believe that fixing this
>> issue in 4.2, no matter how, is really important for amd iommu.
>
> As you didn't come forward with anything, here's my first
> take on this:

Hi Jan
Thanks a lot for the patch. Actually I plan to send my version today, 
which is based on 3.4 pv_ops but looks very similar to yours. So, Acked!

I also evaluated the possibility of hiding iommu device from dom0. I 
think the change is no quite a lot, at least, for io based pcicfg 
access. A proof-of-concept patch is attached.

Thanks,
Wei

diff -r baa85434d0ec xen/arch/x86/traps.c
--- a/xen/arch/x86/traps.c      Thu Jun 21 11:30:59 2012 +0200
+++ b/xen/arch/x86/traps.c      Thu Jun 21 13:19:02 2012 +0200
@@ -73,6 +73,7 @@
  #include <asm/hpet.h>
  #include <public/arch-x86/cpuid.h>
  #include <xsm/xsm.h>
+#include <asm/hvm/svm/amd-iommu-proto.h>

  /*
   * opt_nmi: one of 'ignore', 'dom0', or 'fatal'.
@@ -1686,10 +1687,19 @@ static int pci_cfg_ok(struct domain *d,
  {
      uint32_t machine_bdf;
      uint16_t start, end;
+    struct amd_iommu *iommu;
+
      if (!IS_PRIV(d))
          return 0;

      machine_bdf = (d->arch.pci_cf8 >> 8) & 0xFFFF;
+
+    for_each_amd_iommu ( iommu )
+    {
+        if ( machine_bdf == iommu->bdf )
+            return 0;
+    }
+
      start = d->arch.pci_cf8 & 0xFF;
      end = start + size - 1;
      if (xsm_pci_config_permission(d, machine_bdf, start, end, write))

>
> Commit d5dea7d95c48d7bc951cee4910a7fd9c0cd26fb0 ("PCI: msi: Disable msi
> interrupts when we initialize a pci device") caused MSI to get disabled
> on Xen Dom0 despite it having got turned on purposefully by the
> hypervisor. As an immediate band aid, suppress the disabling in this
> specific case.
>
> I don't think, however, that either the original change or this fix are
> really appropriate. The original fix, besides leaving aside the
> presence of a hypervisor like Xen, doesn't deal with all possible
> cases (in particular it has no effect if the secondary kernel was built
> with CONFIG_PCI_MSI unset). And taking into account a hypervisor like
> Xen, the logic like this should probably be skipped altogether (i.e. it
> should be entirely left to the hypervisor, being the entity in control
> of IRQs).
>
> Signed-off-by: Jan Beulich<jbeulich@suse.com>
> Cc: stable@kernel.org
>
> ---
>   drivers/pci/msi.c       |    7 +++++++
>   include/linux/pci_ids.h |    1 +
>   2 files changed, 8 insertions(+)
>
> --- 3.5-rc3/drivers/pci/msi.c
> +++ 3.5-rc3-xen-pci-msi-no-iommu-disable/drivers/pci/msi.c
> @@ -20,6 +20,7 @@
>   #include<linux/errno.h>
>   #include<linux/io.h>
>   #include<linux/slab.h>
> +#include<xen/xen.h>
>
>   #include "pci.h"
>   #include "msi.h"
> @@ -1022,7 +1023,13 @@ void pci_msi_init_pci_dev(struct pci_dev
>   	/* Disable the msi hardware to avoid screaming interrupts
>   	 * during boot.  This is the power on reset default so
>   	 * usually this should be a noop.
> +	 * But on a Xen host don't do this for IOMMUs which the hypervisor
> +	 * is in control of (and hence has already enabled on purpose).
>   	 */
> +	if (xen_initial_domain()
> +	&&  (dev->class>>  8) == PCI_CLASS_SYSTEM_IOMMU
> +	&&  dev->vendor == PCI_VENDOR_ID_AMD)
> +		return;
>   	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
>   	if (pos)
>   		msi_set_enable(dev, pos, 0);
> --- 3.5-rc3/include/linux/pci_ids.h
> +++ 3.5-rc3-xen-pci-msi-no-iommu-disable/include/linux/pci_ids.h
> @@ -75,6 +75,7 @@
>   #define PCI_CLASS_SYSTEM_RTC		0x0803
>   #define PCI_CLASS_SYSTEM_PCI_HOTPLUG	0x0804
>   #define PCI_CLASS_SYSTEM_SDHCI		0x0805
> +#define PCI_CLASS_SYSTEM_IOMMU		0x0806
>   #define PCI_CLASS_SYSTEM_OTHER		0x0880
>
>   #define PCI_BASE_CLASS_INPUT		0x09
>
>
>
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
@ 2012-06-21 11:21                 ` Wei Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2012-06-21 11:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Jesse Barnes, ebiederm, Sherry Hurwitz, Andrew Cooper,
	Jeremy Fitzhardinge, stable, xen-devel, Konrad Rzeszutek Wilk,
	linux-pci

On 06/21/2012 11:59 AM, Jan Beulich wrote:
>>>> On 14.06.12 at 17:15, Wei Wang<wei.wang2@amd.com>  wrote:
>> Am 14.06.2012 16:18, schrieb Jan Beulich:
>>> Have you at all considered getting this fixed on the kernel side?
>>> As I don't have direct access to any AMD IOMMU capable
>>> system - can one, other than by enumerating the respective
>>> PCI IDs or reading ACPI tables, reasonably easily identify the
>>> devices in question (e.g. via vendor/class/sub-class or some
>>> such)? That might permit skipping those in the offending kernel
>>> code...
>>
>> AMD IOMMUs (both v1 and v2) uses class id 08 (System Base Peripheral)
>> and sub class id 06 (IOMMU). Combined with PCI_VENDEOR_ID_AMD, this
>> should be enough to identify amd iommu device. I could send you a kernel
>> patch for review using this approach. I would believe that fixing this
>> issue in 4.2, no matter how, is really important for amd iommu.
>
> As you didn't come forward with anything, here's my first
> take on this:

Hi Jan
Thanks a lot for the patch. Actually I plan to send my version today, 
which is based on 3.4 pv_ops but looks very similar to yours. So, Acked!

I also evaluated the possibility of hiding iommu device from dom0. I 
think the change is no quite a lot, at least, for io based pcicfg 
access. A proof-of-concept patch is attached.

Thanks,
Wei

diff -r baa85434d0ec xen/arch/x86/traps.c
--- a/xen/arch/x86/traps.c      Thu Jun 21 11:30:59 2012 +0200
+++ b/xen/arch/x86/traps.c      Thu Jun 21 13:19:02 2012 +0200
@@ -73,6 +73,7 @@
  #include <asm/hpet.h>
  #include <public/arch-x86/cpuid.h>
  #include <xsm/xsm.h>
+#include <asm/hvm/svm/amd-iommu-proto.h>

  /*
   * opt_nmi: one of 'ignore', 'dom0', or 'fatal'.
@@ -1686,10 +1687,19 @@ static int pci_cfg_ok(struct domain *d,
  {
      uint32_t machine_bdf;
      uint16_t start, end;
+    struct amd_iommu *iommu;
+
      if (!IS_PRIV(d))
          return 0;

      machine_bdf = (d->arch.pci_cf8 >> 8) & 0xFFFF;
+
+    for_each_amd_iommu ( iommu )
+    {
+        if ( machine_bdf == iommu->bdf )
+            return 0;
+    }
+
      start = d->arch.pci_cf8 & 0xFF;
      end = start + size - 1;
      if (xsm_pci_config_permission(d, machine_bdf, start, end, write))

>
> Commit d5dea7d95c48d7bc951cee4910a7fd9c0cd26fb0 ("PCI: msi: Disable msi
> interrupts when we initialize a pci device") caused MSI to get disabled
> on Xen Dom0 despite it having got turned on purposefully by the
> hypervisor. As an immediate band aid, suppress the disabling in this
> specific case.
>
> I don't think, however, that either the original change or this fix are
> really appropriate. The original fix, besides leaving aside the
> presence of a hypervisor like Xen, doesn't deal with all possible
> cases (in particular it has no effect if the secondary kernel was built
> with CONFIG_PCI_MSI unset). And taking into account a hypervisor like
> Xen, the logic like this should probably be skipped altogether (i.e. it
> should be entirely left to the hypervisor, being the entity in control
> of IRQs).
>
> Signed-off-by: Jan Beulich<jbeulich@suse.com>
> Cc: stable@kernel.org
>
> ---
>   drivers/pci/msi.c       |    7 +++++++
>   include/linux/pci_ids.h |    1 +
>   2 files changed, 8 insertions(+)
>
> --- 3.5-rc3/drivers/pci/msi.c
> +++ 3.5-rc3-xen-pci-msi-no-iommu-disable/drivers/pci/msi.c
> @@ -20,6 +20,7 @@
>   #include<linux/errno.h>
>   #include<linux/io.h>
>   #include<linux/slab.h>
> +#include<xen/xen.h>
>
>   #include "pci.h"
>   #include "msi.h"
> @@ -1022,7 +1023,13 @@ void pci_msi_init_pci_dev(struct pci_dev
>   	/* Disable the msi hardware to avoid screaming interrupts
>   	 * during boot.  This is the power on reset default so
>   	 * usually this should be a noop.
> +	 * But on a Xen host don't do this for IOMMUs which the hypervisor
> +	 * is in control of (and hence has already enabled on purpose).
>   	 */
> +	if (xen_initial_domain()
> +	&&  (dev->class>>  8) == PCI_CLASS_SYSTEM_IOMMU
> +	&&  dev->vendor == PCI_VENDOR_ID_AMD)
> +		return;
>   	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
>   	if (pos)
>   		msi_set_enable(dev, pos, 0);
> --- 3.5-rc3/include/linux/pci_ids.h
> +++ 3.5-rc3-xen-pci-msi-no-iommu-disable/include/linux/pci_ids.h
> @@ -75,6 +75,7 @@
>   #define PCI_CLASS_SYSTEM_RTC		0x0803
>   #define PCI_CLASS_SYSTEM_PCI_HOTPLUG	0x0804
>   #define PCI_CLASS_SYSTEM_SDHCI		0x0805
> +#define PCI_CLASS_SYSTEM_IOMMU		0x0806
>   #define PCI_CLASS_SYSTEM_OTHER		0x0880
>
>   #define PCI_BASE_CLASS_INPUT		0x09
>
>
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
  2012-06-21 11:21                 ` Wei Wang
@ 2012-06-21 12:06                   ` Jan Beulich
  -1 siblings, 0 replies; 33+ messages in thread
From: Jan Beulich @ 2012-06-21 12:06 UTC (permalink / raw)
  To: Wei Wang
  Cc: SherryHurwitz, Andrew Cooper, Jeremy Fitzhardinge, stable,
	xen-devel, KonradRzeszutek Wilk, linux-pci, Jesse Barnes,
	ebiederm

>>> On 21.06.12 at 13:21, Wei Wang <wei.wang2@amd.com> wrote:
> On 06/21/2012 11:59 AM, Jan Beulich wrote:
>>>>> On 14.06.12 at 17:15, Wei Wang<wei.wang2@amd.com>  wrote:
>>> Am 14.06.2012 16:18, schrieb Jan Beulich:
>>>> Have you at all considered getting this fixed on the kernel side?
>>>> As I don't have direct access to any AMD IOMMU capable
>>>> system - can one, other than by enumerating the respective
>>>> PCI IDs or reading ACPI tables, reasonably easily identify the
>>>> devices in question (e.g. via vendor/class/sub-class or some
>>>> such)? That might permit skipping those in the offending kernel
>>>> code...
>>>
>>> AMD IOMMUs (both v1 and v2) uses class id 08 (System Base Peripheral)
>>> and sub class id 06 (IOMMU). Combined with PCI_VENDEOR_ID_AMD, this
>>> should be enough to identify amd iommu device. I could send you a kernel
>>> patch for review using this approach. I would believe that fixing this
>>> issue in 4.2, no matter how, is really important for amd iommu.
>>
>> As you didn't come forward with anything, here's my first
>> take on this:
> 
> Hi Jan
> Thanks a lot for the patch. Actually I plan to send my version today, 
> which is based on 3.4 pv_ops but looks very similar to yours. So, Acked!
> 
> I also evaluated the possibility of hiding iommu device from dom0. I 
> think the change is no quite a lot, at least, for io based pcicfg 
> access. A proof-of-concept patch is attached.

This completely hides the device from Dom0, but only when
config space is accessed via method 1. Did you not see my
earlier patch doing this for MCFG as well (albeit only disallowing
writes, so allowing the device to still be seen by Dom0)?

Whether completely hiding the device is actually okay I can't
easily tell: Would IOMMUs always be either at func 0 of a single-
unction device, or at a non-zero func of a multi-function one? If
not, other devices may get hidden implicitly.

Also I noticed just now that guest_io_read() wouldn't really
behave correctly when pci_cfg_ok() returned false - it might
pass back 0xff even for a multi-byte read. I'll send a fix shortly.

Jan

> --- a/xen/arch/x86/traps.c      Thu Jun 21 11:30:59 2012 +0200
> +++ b/xen/arch/x86/traps.c      Thu Jun 21 13:19:02 2012 +0200
> @@ -73,6 +73,7 @@
>   #include <asm/hpet.h>
>   #include <public/arch-x86/cpuid.h>
>   #include <xsm/xsm.h>
> +#include <asm/hvm/svm/amd-iommu-proto.h>
> 
>   /*
>    * opt_nmi: one of 'ignore', 'dom0', or 'fatal'.
> @@ -1686,10 +1687,19 @@ static int pci_cfg_ok(struct domain *d,
>   {
>       uint32_t machine_bdf;
>       uint16_t start, end;
> +    struct amd_iommu *iommu;
> +
>       if (!IS_PRIV(d))
>           return 0;
> 
>       machine_bdf = (d->arch.pci_cf8 >> 8) & 0xFFFF;
> +
> +    for_each_amd_iommu ( iommu )
> +    {
> +        if ( machine_bdf == iommu->bdf )
> +            return 0;
> +    }
> +
>       start = d->arch.pci_cf8 & 0xFF;
>       end = start + size - 1;
>       if (xsm_pci_config_permission(d, machine_bdf, start, end, write))



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
@ 2012-06-21 12:06                   ` Jan Beulich
  0 siblings, 0 replies; 33+ messages in thread
From: Jan Beulich @ 2012-06-21 12:06 UTC (permalink / raw)
  To: Wei Wang
  Cc: SherryHurwitz, Andrew Cooper, Jeremy Fitzhardinge, stable,
	xen-devel, KonradRzeszutek Wilk, linux-pci, Jesse Barnes,
	ebiederm

>>> On 21.06.12 at 13:21, Wei Wang <wei.wang2@amd.com> wrote:
> On 06/21/2012 11:59 AM, Jan Beulich wrote:
>>>>> On 14.06.12 at 17:15, Wei Wang<wei.wang2@amd.com>  wrote:
>>> Am 14.06.2012 16:18, schrieb Jan Beulich:
>>>> Have you at all considered getting this fixed on the kernel side?
>>>> As I don't have direct access to any AMD IOMMU capable
>>>> system - can one, other than by enumerating the respective
>>>> PCI IDs or reading ACPI tables, reasonably easily identify the
>>>> devices in question (e.g. via vendor/class/sub-class or some
>>>> such)? That might permit skipping those in the offending kernel
>>>> code...
>>>
>>> AMD IOMMUs (both v1 and v2) uses class id 08 (System Base Peripheral)
>>> and sub class id 06 (IOMMU). Combined with PCI_VENDEOR_ID_AMD, this
>>> should be enough to identify amd iommu device. I could send you a kernel
>>> patch for review using this approach. I would believe that fixing this
>>> issue in 4.2, no matter how, is really important for amd iommu.
>>
>> As you didn't come forward with anything, here's my first
>> take on this:
> 
> Hi Jan
> Thanks a lot for the patch. Actually I plan to send my version today, 
> which is based on 3.4 pv_ops but looks very similar to yours. So, Acked!
> 
> I also evaluated the possibility of hiding iommu device from dom0. I 
> think the change is no quite a lot, at least, for io based pcicfg 
> access. A proof-of-concept patch is attached.

This completely hides the device from Dom0, but only when
config space is accessed via method 1. Did you not see my
earlier patch doing this for MCFG as well (albeit only disallowing
writes, so allowing the device to still be seen by Dom0)?

Whether completely hiding the device is actually okay I can't
easily tell: Would IOMMUs always be either at func 0 of a single-
unction device, or at a non-zero func of a multi-function one? If
not, other devices may get hidden implicitly.

Also I noticed just now that guest_io_read() wouldn't really
behave correctly when pci_cfg_ok() returned false - it might
pass back 0xff even for a multi-byte read. I'll send a fix shortly.

Jan

> --- a/xen/arch/x86/traps.c      Thu Jun 21 11:30:59 2012 +0200
> +++ b/xen/arch/x86/traps.c      Thu Jun 21 13:19:02 2012 +0200
> @@ -73,6 +73,7 @@
>   #include <asm/hpet.h>
>   #include <public/arch-x86/cpuid.h>
>   #include <xsm/xsm.h>
> +#include <asm/hvm/svm/amd-iommu-proto.h>
> 
>   /*
>    * opt_nmi: one of 'ignore', 'dom0', or 'fatal'.
> @@ -1686,10 +1687,19 @@ static int pci_cfg_ok(struct domain *d,
>   {
>       uint32_t machine_bdf;
>       uint16_t start, end;
> +    struct amd_iommu *iommu;
> +
>       if (!IS_PRIV(d))
>           return 0;
> 
>       machine_bdf = (d->arch.pci_cf8 >> 8) & 0xFFFF;
> +
> +    for_each_amd_iommu ( iommu )
> +    {
> +        if ( machine_bdf == iommu->bdf )
> +            return 0;
> +    }
> +
>       start = d->arch.pci_cf8 & 0xFF;
>       end = start + size - 1;
>       if (xsm_pci_config_permission(d, machine_bdf, start, end, write))

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
  2012-06-21 12:06                   ` Jan Beulich
@ 2012-06-21 12:28                     ` Wei Wang
  -1 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2012-06-21 12:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: SherryHurwitz, Andrew Cooper, Jeremy Fitzhardinge, stable,
	xen-devel, KonradRzeszutek Wilk, linux-pci, Jesse Barnes,
	ebiederm

On 06/21/2012 02:06 PM, Jan Beulich wrote:
>>>> On 21.06.12 at 13:21, Wei Wang<wei.wang2@amd.com>  wrote:
>> On 06/21/2012 11:59 AM, Jan Beulich wrote:
>>>>>> On 14.06.12 at 17:15, Wei Wang<wei.wang2@amd.com>   wrote:
>>>> Am 14.06.2012 16:18, schrieb Jan Beulich:
>>>>> Have you at all considered getting this fixed on the kernel side?
>>>>> As I don't have direct access to any AMD IOMMU capable
>>>>> system - can one, other than by enumerating the respective
>>>>> PCI IDs or reading ACPI tables, reasonably easily identify the
>>>>> devices in question (e.g. via vendor/class/sub-class or some
>>>>> such)? That might permit skipping those in the offending kernel
>>>>> code...
>>>>
>>>> AMD IOMMUs (both v1 and v2) uses class id 08 (System Base Peripheral)
>>>> and sub class id 06 (IOMMU). Combined with PCI_VENDEOR_ID_AMD, this
>>>> should be enough to identify amd iommu device. I could send you a kernel
>>>> patch for review using this approach. I would believe that fixing this
>>>> issue in 4.2, no matter how, is really important for amd iommu.
>>>
>>> As you didn't come forward with anything, here's my first
>>> take on this:
>>
>> Hi Jan
>> Thanks a lot for the patch. Actually I plan to send my version today,
>> which is based on 3.4 pv_ops but looks very similar to yours. So, Acked!
>>
>> I also evaluated the possibility of hiding iommu device from dom0. I
>> think the change is no quite a lot, at least, for io based pcicfg
>> access. A proof-of-concept patch is attached.
>
> This completely hides the device from Dom0, but only when
> config space is accessed via method 1. Did you not see my
> earlier patch doing this for MCFG as well
Could you please provide a particular c/s number?... (I saw too many c/s 
might be related to this topic). so that I could work out a patch to 
support both i/o and mmcfg.

(albeit only disallowing
> writes, so allowing the device to still be seen by Dom0)?
Sounds better to me...this still allows user to check iommu status from 
lspci.

> Whether completely hiding the device is actually okay I can't
> easily tell: Would IOMMUs always be either at func 0 of a single-
> unction device, or at a non-zero func of a multi-function one? If
> not, other devices may get hidden implicitly.

AMD IOMMU is an independent pci-e endpoint, and this function will not 
be used for other purposes other than containing an iommu. So I don't 
see that iommu will share bdf value with other devices.

Thanks,
Wei

> Also I noticed just now that guest_io_read() wouldn't really
> behave correctly when pci_cfg_ok() returned false - it might
> pass back 0xff even for a multi-byte read. I'll send a fix shortly.
>
> Jan
>
>> --- a/xen/arch/x86/traps.c      Thu Jun 21 11:30:59 2012 +0200
>> +++ b/xen/arch/x86/traps.c      Thu Jun 21 13:19:02 2012 +0200
>> @@ -73,6 +73,7 @@
>>    #include<asm/hpet.h>
>>    #include<public/arch-x86/cpuid.h>
>>    #include<xsm/xsm.h>
>> +#include<asm/hvm/svm/amd-iommu-proto.h>
>>
>>    /*
>>     * opt_nmi: one of 'ignore', 'dom0', or 'fatal'.
>> @@ -1686,10 +1687,19 @@ static int pci_cfg_ok(struct domain *d,
>>    {
>>        uint32_t machine_bdf;
>>        uint16_t start, end;
>> +    struct amd_iommu *iommu;
>> +
>>        if (!IS_PRIV(d))
>>            return 0;
>>
>>        machine_bdf = (d->arch.pci_cf8>>  8)&  0xFFFF;
>> +
>> +    for_each_amd_iommu ( iommu )
>> +    {
>> +        if ( machine_bdf == iommu->bdf )
>> +            return 0;
>> +    }
>> +
>>        start = d->arch.pci_cf8&  0xFF;
>>        end = start + size - 1;
>>        if (xsm_pci_config_permission(d, machine_bdf, start, end, write))
>
>
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
@ 2012-06-21 12:28                     ` Wei Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2012-06-21 12:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: SherryHurwitz, Andrew Cooper, Jeremy Fitzhardinge, stable,
	xen-devel, KonradRzeszutek Wilk, linux-pci, Jesse Barnes,
	ebiederm

On 06/21/2012 02:06 PM, Jan Beulich wrote:
>>>> On 21.06.12 at 13:21, Wei Wang<wei.wang2@amd.com>  wrote:
>> On 06/21/2012 11:59 AM, Jan Beulich wrote:
>>>>>> On 14.06.12 at 17:15, Wei Wang<wei.wang2@amd.com>   wrote:
>>>> Am 14.06.2012 16:18, schrieb Jan Beulich:
>>>>> Have you at all considered getting this fixed on the kernel side?
>>>>> As I don't have direct access to any AMD IOMMU capable
>>>>> system - can one, other than by enumerating the respective
>>>>> PCI IDs or reading ACPI tables, reasonably easily identify the
>>>>> devices in question (e.g. via vendor/class/sub-class or some
>>>>> such)? That might permit skipping those in the offending kernel
>>>>> code...
>>>>
>>>> AMD IOMMUs (both v1 and v2) uses class id 08 (System Base Peripheral)
>>>> and sub class id 06 (IOMMU). Combined with PCI_VENDEOR_ID_AMD, this
>>>> should be enough to identify amd iommu device. I could send you a kernel
>>>> patch for review using this approach. I would believe that fixing this
>>>> issue in 4.2, no matter how, is really important for amd iommu.
>>>
>>> As you didn't come forward with anything, here's my first
>>> take on this:
>>
>> Hi Jan
>> Thanks a lot for the patch. Actually I plan to send my version today,
>> which is based on 3.4 pv_ops but looks very similar to yours. So, Acked!
>>
>> I also evaluated the possibility of hiding iommu device from dom0. I
>> think the change is no quite a lot, at least, for io based pcicfg
>> access. A proof-of-concept patch is attached.
>
> This completely hides the device from Dom0, but only when
> config space is accessed via method 1. Did you not see my
> earlier patch doing this for MCFG as well
Could you please provide a particular c/s number?... (I saw too many c/s 
might be related to this topic). so that I could work out a patch to 
support both i/o and mmcfg.

(albeit only disallowing
> writes, so allowing the device to still be seen by Dom0)?
Sounds better to me...this still allows user to check iommu status from 
lspci.

> Whether completely hiding the device is actually okay I can't
> easily tell: Would IOMMUs always be either at func 0 of a single-
> unction device, or at a non-zero func of a multi-function one? If
> not, other devices may get hidden implicitly.

AMD IOMMU is an independent pci-e endpoint, and this function will not 
be used for other purposes other than containing an iommu. So I don't 
see that iommu will share bdf value with other devices.

Thanks,
Wei

> Also I noticed just now that guest_io_read() wouldn't really
> behave correctly when pci_cfg_ok() returned false - it might
> pass back 0xff even for a multi-byte read. I'll send a fix shortly.
>
> Jan
>
>> --- a/xen/arch/x86/traps.c      Thu Jun 21 11:30:59 2012 +0200
>> +++ b/xen/arch/x86/traps.c      Thu Jun 21 13:19:02 2012 +0200
>> @@ -73,6 +73,7 @@
>>    #include<asm/hpet.h>
>>    #include<public/arch-x86/cpuid.h>
>>    #include<xsm/xsm.h>
>> +#include<asm/hvm/svm/amd-iommu-proto.h>
>>
>>    /*
>>     * opt_nmi: one of 'ignore', 'dom0', or 'fatal'.
>> @@ -1686,10 +1687,19 @@ static int pci_cfg_ok(struct domain *d,
>>    {
>>        uint32_t machine_bdf;
>>        uint16_t start, end;
>> +    struct amd_iommu *iommu;
>> +
>>        if (!IS_PRIV(d))
>>            return 0;
>>
>>        machine_bdf = (d->arch.pci_cf8>>  8)&  0xFFFF;
>> +
>> +    for_each_amd_iommu ( iommu )
>> +    {
>> +        if ( machine_bdf == iommu->bdf )
>> +            return 0;
>> +    }
>> +
>>        start = d->arch.pci_cf8&  0xFF;
>>        end = start + size - 1;
>>        if (xsm_pci_config_permission(d, machine_bdf, start, end, write))
>
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
  2012-06-21 11:08                 ` Eric W. Biederman
@ 2012-06-21 12:28                   ` Jan Beulich
  -1 siblings, 0 replies; 33+ messages in thread
From: Jan Beulich @ 2012-06-21 12:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Sherry Hurwitz, Wei Wang, Andrew Cooper, Jeremy Fitzhardinge,
	stable, xen-devel, Konrad Rzeszutek Wilk, linux-pci,
	Jesse Barnes

>>> On 21.06.12 at 13:08, Eric W. Biederman <ebiederm@xmission.com> wrote:
> "Jan Beulich" <JBeulich@suse.com> writes:
> 
>>>>> On 14.06.12 at 17:15, Wei Wang <wei.wang2@amd.com> wrote:
>>> Am 14.06.2012 16:18, schrieb Jan Beulich:
>>>> Have you at all considered getting this fixed on the kernel side?
>>>> As I don't have direct access to any AMD IOMMU capable
>>>> system - can one, other than by enumerating the respective
>>>> PCI IDs or reading ACPI tables, reasonably easily identify the
>>>> devices in question (e.g. via vendor/class/sub-class or some
>>>> such)? That might permit skipping those in the offending kernel
>>>> code...
>>> 
>>> AMD IOMMUs (both v1 and v2) uses class id 08 (System Base Peripheral) 
>>> and sub class id 06 (IOMMU). Combined with PCI_VENDEOR_ID_AMD, this 
>>> should be enough to identify amd iommu device. I could send you a kernel 
>>> patch for review using this approach. I would believe that fixing this 
>>> issue in 4.2, no matter how, is really important for amd iommu.
>>
>> As you didn't come forward with anything, here's my first
>> take on this:
>>
>> Commit d5dea7d95c48d7bc951cee4910a7fd9c0cd26fb0 ("PCI: msi: Disable msi
>> interrupts when we initialize a pci device") caused MSI to get disabled
>> on Xen Dom0 despite it having got turned on purposefully by the
>> hypervisor. As an immediate band aid, suppress the disabling in this
>> specific case.
>>
>> I don't think, however, that either the original change or this fix are
>> really appropriate. The original fix, besides leaving aside the
>> presence of a hypervisor like Xen, doesn't deal with all possible
>> cases (in particular it has no effect if the secondary kernel was built
>> with CONFIG_PCI_MSI unset). And taking into account a hypervisor like
>> Xen, the logic like this should probably be skipped altogether (i.e. it
>> should be entirely left to the hypervisor, being the entity in control
>> of IRQs).
> 
> The original fix is fine. Xen dom0 remains insane.  Although perhaps
> some better than Xen dom0 once was.
> 
> Why does the dom0 kernel even get any access to pci devices that
> Xen doesn't want it to touch?
> 
> As far as I can tell my patch has revealed a design bug in Xen.  If you
> don't want to be messed up by the kernel don't let the kernel touch
> things.

I rather think this is a design bug of the AMD IOMMU - it should
never have got surfaced as a PCI device.

Furthermore, fully hiding _any_ PCI device from Dom0 has its
own set of problems - depending on the nature of the device,
full emulation of all devices may become necessary to get this
implemented (because this can implicitly hide other devices, or
the Dom0 kernel re-assigning BARs could get in conflict with
what the hidden devices use).

Xen's model has always been to allow Dom0 full control over all
peripherals and bridges, so if all of the sudden PCI devices of
other kinds show up, we can't simply re-model everything.

>  At the very least we need an abstraction much cleaner
> than the patch below.

Probably - any suggestion? As said in the mail I sent, this is
meant to overcome the immediate problem, and a more
permanent fix would be desirable (ideally suppressing the
playing with the MSI related data altogether, following the
rest of the MSI support in Xen/Dom0).

Jan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
@ 2012-06-21 12:28                   ` Jan Beulich
  0 siblings, 0 replies; 33+ messages in thread
From: Jan Beulich @ 2012-06-21 12:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Sherry Hurwitz, Wei Wang, Andrew Cooper, Jeremy Fitzhardinge,
	stable, xen-devel, Konrad Rzeszutek Wilk, linux-pci,
	Jesse Barnes

>>> On 21.06.12 at 13:08, Eric W. Biederman <ebiederm@xmission.com> wrote:
> "Jan Beulich" <JBeulich@suse.com> writes:
> 
>>>>> On 14.06.12 at 17:15, Wei Wang <wei.wang2@amd.com> wrote:
>>> Am 14.06.2012 16:18, schrieb Jan Beulich:
>>>> Have you at all considered getting this fixed on the kernel side?
>>>> As I don't have direct access to any AMD IOMMU capable
>>>> system - can one, other than by enumerating the respective
>>>> PCI IDs or reading ACPI tables, reasonably easily identify the
>>>> devices in question (e.g. via vendor/class/sub-class or some
>>>> such)? That might permit skipping those in the offending kernel
>>>> code...
>>> 
>>> AMD IOMMUs (both v1 and v2) uses class id 08 (System Base Peripheral) 
>>> and sub class id 06 (IOMMU). Combined with PCI_VENDEOR_ID_AMD, this 
>>> should be enough to identify amd iommu device. I could send you a kernel 
>>> patch for review using this approach. I would believe that fixing this 
>>> issue in 4.2, no matter how, is really important for amd iommu.
>>
>> As you didn't come forward with anything, here's my first
>> take on this:
>>
>> Commit d5dea7d95c48d7bc951cee4910a7fd9c0cd26fb0 ("PCI: msi: Disable msi
>> interrupts when we initialize a pci device") caused MSI to get disabled
>> on Xen Dom0 despite it having got turned on purposefully by the
>> hypervisor. As an immediate band aid, suppress the disabling in this
>> specific case.
>>
>> I don't think, however, that either the original change or this fix are
>> really appropriate. The original fix, besides leaving aside the
>> presence of a hypervisor like Xen, doesn't deal with all possible
>> cases (in particular it has no effect if the secondary kernel was built
>> with CONFIG_PCI_MSI unset). And taking into account a hypervisor like
>> Xen, the logic like this should probably be skipped altogether (i.e. it
>> should be entirely left to the hypervisor, being the entity in control
>> of IRQs).
> 
> The original fix is fine. Xen dom0 remains insane.  Although perhaps
> some better than Xen dom0 once was.
> 
> Why does the dom0 kernel even get any access to pci devices that
> Xen doesn't want it to touch?
> 
> As far as I can tell my patch has revealed a design bug in Xen.  If you
> don't want to be messed up by the kernel don't let the kernel touch
> things.

I rather think this is a design bug of the AMD IOMMU - it should
never have got surfaced as a PCI device.

Furthermore, fully hiding _any_ PCI device from Dom0 has its
own set of problems - depending on the nature of the device,
full emulation of all devices may become necessary to get this
implemented (because this can implicitly hide other devices, or
the Dom0 kernel re-assigning BARs could get in conflict with
what the hidden devices use).

Xen's model has always been to allow Dom0 full control over all
peripherals and bridges, so if all of the sudden PCI devices of
other kinds show up, we can't simply re-model everything.

>  At the very least we need an abstraction much cleaner
> than the patch below.

Probably - any suggestion? As said in the mail I sent, this is
meant to overcome the immediate problem, and a more
permanent fix would be desirable (ideally suppressing the
playing with the MSI related data altogether, following the
rest of the MSI support in Xen/Dom0).

Jan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
  2012-06-21 12:28                     ` Wei Wang
@ 2012-06-21 12:45                       ` Jan Beulich
  -1 siblings, 0 replies; 33+ messages in thread
From: Jan Beulich @ 2012-06-21 12:45 UTC (permalink / raw)
  To: Wei Wang
  Cc: SherryHurwitz, Andrew Cooper, Jeremy Fitzhardinge, stable,
	xen-devel, KonradRzeszutek Wilk, linux-pci, Jesse Barnes,
	ebiederm

>>> On 21.06.12 at 14:28, Wei Wang <wei.wang2@amd.com> wrote:
> On 06/21/2012 02:06 PM, Jan Beulich wrote:
>>>>> On 21.06.12 at 13:21, Wei Wang<wei.wang2@amd.com>  wrote:
>>> I also evaluated the possibility of hiding iommu device from dom0. I
>>> think the change is no quite a lot, at least, for io based pcicfg
>>> access. A proof-of-concept patch is attached.
>>
>> This completely hides the device from Dom0, but only when
>> config space is accessed via method 1. Did you not see my
>> earlier patch doing this for MCFG as well
> Could you please provide a particular c/s number?... (I saw too many c/s 
> might be related to this topic). so that I could work out a patch to 
> support both i/o and mmcfg.

I sent this to you yesterday, so you'd be able to test whether
it actually fulfills its purpose before we discuss whether this is
acceptable for 4.2. See
http://lists.xen.org/archives/html/xen-devel/2012-06/msg01129.html

> (albeit only disallowing
>> writes, so allowing the device to still be seen by Dom0)?
> Sounds better to me...this still allows user to check iommu status from 
> lspci.
> 
>> Whether completely hiding the device is actually okay I can't
>> easily tell: Would IOMMUs always be either at func 0 of a single-
>> unction device, or at a non-zero func of a multi-function one? If
>> not, other devices may get hidden implicitly.
> 
> AMD IOMMU is an independent pci-e endpoint, and this function will not 
> be used for other purposes other than containing an iommu. So I don't 
> see that iommu will share bdf value with other devices.

The question is not regarding bdf, but regarding whether under
the same seg:bus:dev there might be multiple functions, one of
which is the IOMMU, and if so, whether the IOMMU would be
guaranteed to have a non-zero function number.

Jan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
@ 2012-06-21 12:45                       ` Jan Beulich
  0 siblings, 0 replies; 33+ messages in thread
From: Jan Beulich @ 2012-06-21 12:45 UTC (permalink / raw)
  To: Wei Wang
  Cc: SherryHurwitz, Andrew Cooper, Jeremy Fitzhardinge, stable,
	xen-devel, KonradRzeszutek Wilk, linux-pci, Jesse Barnes,
	ebiederm

>>> On 21.06.12 at 14:28, Wei Wang <wei.wang2@amd.com> wrote:
> On 06/21/2012 02:06 PM, Jan Beulich wrote:
>>>>> On 21.06.12 at 13:21, Wei Wang<wei.wang2@amd.com>  wrote:
>>> I also evaluated the possibility of hiding iommu device from dom0. I
>>> think the change is no quite a lot, at least, for io based pcicfg
>>> access. A proof-of-concept patch is attached.
>>
>> This completely hides the device from Dom0, but only when
>> config space is accessed via method 1. Did you not see my
>> earlier patch doing this for MCFG as well
> Could you please provide a particular c/s number?... (I saw too many c/s 
> might be related to this topic). so that I could work out a patch to 
> support both i/o and mmcfg.

I sent this to you yesterday, so you'd be able to test whether
it actually fulfills its purpose before we discuss whether this is
acceptable for 4.2. See
http://lists.xen.org/archives/html/xen-devel/2012-06/msg01129.html

> (albeit only disallowing
>> writes, so allowing the device to still be seen by Dom0)?
> Sounds better to me...this still allows user to check iommu status from 
> lspci.
> 
>> Whether completely hiding the device is actually okay I can't
>> easily tell: Would IOMMUs always be either at func 0 of a single-
>> unction device, or at a non-zero func of a multi-function one? If
>> not, other devices may get hidden implicitly.
> 
> AMD IOMMU is an independent pci-e endpoint, and this function will not 
> be used for other purposes other than containing an iommu. So I don't 
> see that iommu will share bdf value with other devices.

The question is not regarding bdf, but regarding whether under
the same seg:bus:dev there might be multiple functions, one of
which is the IOMMU, and if so, whether the IOMMU would be
guaranteed to have a non-zero function number.

Jan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
  2012-06-21 12:45                       ` Jan Beulich
@ 2012-06-21 13:10                         ` Wei Wang
  -1 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2012-06-21 13:10 UTC (permalink / raw)
  To: Jan Beulich
  Cc: SherryHurwitz, Andrew Cooper, Jeremy Fitzhardinge, stable,
	xen-devel, KonradRzeszutek Wilk, linux-pci, Jesse Barnes,
	ebiederm

On 06/21/2012 02:45 PM, Jan Beulich wrote:
>>>> On 21.06.12 at 14:28, Wei Wang<wei.wang2@amd.com>  wrote:
>> On 06/21/2012 02:06 PM, Jan Beulich wrote:
>>>>>> On 21.06.12 at 13:21, Wei Wang<wei.wang2@amd.com>   wrote:
>>>> I also evaluated the possibility of hiding iommu device from dom0. I
>>>> think the change is no quite a lot, at least, for io based pcicfg
>>>> access. A proof-of-concept patch is attached.
>>>
>>> This completely hides the device from Dom0, but only when
>>> config space is accessed via method 1. Did you not see my
>>> earlier patch doing this for MCFG as well
>> Could you please provide a particular c/s number?... (I saw too many c/s
>> might be related to this topic). so that I could work out a patch to
>> support both i/o and mmcfg.
>
> I sent this to you yesterday, so you'd be able to test whether
> it actually fulfills its purpose before we discuss whether this is
> acceptable for 4.2. See
> http://lists.xen.org/archives/html/xen-devel/2012-06/msg01129.html

Oh, yes I found it, my email filter did not work well so I did not see 
it at the right folder. I will test right now.

>
>> (albeit only disallowing
>>> writes, so allowing the device to still be seen by Dom0)?
>> Sounds better to me...this still allows user to check iommu status from
>> lspci.
>>
>>> Whether completely hiding the device is actually okay I can't
>>> easily tell: Would IOMMUs always be either at func 0 of a single-
>>> unction device, or at a non-zero func of a multi-function one? If
>>> not, other devices may get hidden implicitly.
>>
>> AMD IOMMU is an independent pci-e endpoint, and this function will not
>> be used for other purposes other than containing an iommu. So I don't
>> see that iommu will share bdf value with other devices.
>
> The question is not regarding bdf, but regarding whether under
> the same seg:bus:dev there might be multiple functions, one of
> which is the IOMMU, and if so, whether the IOMMU would be
> guaranteed to have a non-zero function number.

In a real system (single or multiple iommu), amd iommu shares the same 
device number with north bridge but has function number 2.. (e.g 
bus:00.2) Howerver according to spec, it does not guaranteed to have 
non-zero function number. So what is the problem you see if iommu uses 
fun0 on a multi-func device?

Thanks,
Wei

> Jan
>
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
@ 2012-06-21 13:10                         ` Wei Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2012-06-21 13:10 UTC (permalink / raw)
  To: Jan Beulich
  Cc: SherryHurwitz, Andrew Cooper, Jeremy Fitzhardinge, stable,
	xen-devel, KonradRzeszutek Wilk, linux-pci, Jesse Barnes,
	ebiederm

On 06/21/2012 02:45 PM, Jan Beulich wrote:
>>>> On 21.06.12 at 14:28, Wei Wang<wei.wang2@amd.com>  wrote:
>> On 06/21/2012 02:06 PM, Jan Beulich wrote:
>>>>>> On 21.06.12 at 13:21, Wei Wang<wei.wang2@amd.com>   wrote:
>>>> I also evaluated the possibility of hiding iommu device from dom0. I
>>>> think the change is no quite a lot, at least, for io based pcicfg
>>>> access. A proof-of-concept patch is attached.
>>>
>>> This completely hides the device from Dom0, but only when
>>> config space is accessed via method 1. Did you not see my
>>> earlier patch doing this for MCFG as well
>> Could you please provide a particular c/s number?... (I saw too many c/s
>> might be related to this topic). so that I could work out a patch to
>> support both i/o and mmcfg.
>
> I sent this to you yesterday, so you'd be able to test whether
> it actually fulfills its purpose before we discuss whether this is
> acceptable for 4.2. See
> http://lists.xen.org/archives/html/xen-devel/2012-06/msg01129.html

Oh, yes I found it, my email filter did not work well so I did not see 
it at the right folder. I will test right now.

>
>> (albeit only disallowing
>>> writes, so allowing the device to still be seen by Dom0)?
>> Sounds better to me...this still allows user to check iommu status from
>> lspci.
>>
>>> Whether completely hiding the device is actually okay I can't
>>> easily tell: Would IOMMUs always be either at func 0 of a single-
>>> unction device, or at a non-zero func of a multi-function one? If
>>> not, other devices may get hidden implicitly.
>>
>> AMD IOMMU is an independent pci-e endpoint, and this function will not
>> be used for other purposes other than containing an iommu. So I don't
>> see that iommu will share bdf value with other devices.
>
> The question is not regarding bdf, but regarding whether under
> the same seg:bus:dev there might be multiple functions, one of
> which is the IOMMU, and if so, whether the IOMMU would be
> guaranteed to have a non-zero function number.

In a real system (single or multiple iommu), amd iommu shares the same 
device number with north bridge but has function number 2.. (e.g 
bus:00.2) Howerver according to spec, it does not guaranteed to have 
non-zero function number. So what is the problem you see if iommu uses 
fun0 on a multi-func device?

Thanks,
Wei

> Jan
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
  2012-06-21 13:10                         ` Wei Wang
@ 2012-06-21 13:24                           ` Jan Beulich
  -1 siblings, 0 replies; 33+ messages in thread
From: Jan Beulich @ 2012-06-21 13:24 UTC (permalink / raw)
  To: Wei Wang
  Cc: SherryHurwitz, Andrew Cooper, Jeremy Fitzhardinge, stable,
	xen-devel, KonradRzeszutek Wilk, linux-pci, Jesse Barnes,
	ebiederm

>>> On 21.06.12 at 15:10, Wei Wang <wei.wang2@amd.com> wrote:
> On 06/21/2012 02:45 PM, Jan Beulich wrote:
>>>>> On 21.06.12 at 14:28, Wei Wang<wei.wang2@amd.com>  wrote:
>>> AMD IOMMU is an independent pci-e endpoint, and this function will not
>>> be used for other purposes other than containing an iommu. So I don't
>>> see that iommu will share bdf value with other devices.
>>
>> The question is not regarding bdf, but regarding whether under
>> the same seg:bus:dev there might be multiple functions, one of
>> which is the IOMMU, and if so, whether the IOMMU would be
>> guaranteed to have a non-zero function number.
> 
> In a real system (single or multiple iommu), amd iommu shares the same 
> device number with north bridge but has function number 2.. (e.g 
> bus:00.2) Howerver according to spec, it does not guaranteed to have 
> non-zero function number. So what is the problem you see if iommu uses 
> fun0 on a multi-func device?

If it's on func 0 and gets hidden completely (as done by your
partial patch), other functions won't be found when scanning
for them (because secondary functions get looked at only
when func 0 actually exists, as otherwise evaluating the header
type register is invalid).

Jan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
@ 2012-06-21 13:24                           ` Jan Beulich
  0 siblings, 0 replies; 33+ messages in thread
From: Jan Beulich @ 2012-06-21 13:24 UTC (permalink / raw)
  To: Wei Wang
  Cc: SherryHurwitz, Andrew Cooper, Jeremy Fitzhardinge, stable,
	xen-devel, KonradRzeszutek Wilk, linux-pci, Jesse Barnes,
	ebiederm

>>> On 21.06.12 at 15:10, Wei Wang <wei.wang2@amd.com> wrote:
> On 06/21/2012 02:45 PM, Jan Beulich wrote:
>>>>> On 21.06.12 at 14:28, Wei Wang<wei.wang2@amd.com>  wrote:
>>> AMD IOMMU is an independent pci-e endpoint, and this function will not
>>> be used for other purposes other than containing an iommu. So I don't
>>> see that iommu will share bdf value with other devices.
>>
>> The question is not regarding bdf, but regarding whether under
>> the same seg:bus:dev there might be multiple functions, one of
>> which is the IOMMU, and if so, whether the IOMMU would be
>> guaranteed to have a non-zero function number.
> 
> In a real system (single or multiple iommu), amd iommu shares the same 
> device number with north bridge but has function number 2.. (e.g 
> bus:00.2) Howerver according to spec, it does not guaranteed to have 
> non-zero function number. So what is the problem you see if iommu uses 
> fun0 on a multi-func device?

If it's on func 0 and gets hidden completely (as done by your
partial patch), other functions won't be found when scanning
for them (because secondary functions get looked at only
when func 0 actually exists, as otherwise evaluating the header
type register is invalid).

Jan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
  2012-06-21 13:24                           ` Jan Beulich
@ 2012-06-21 13:27                             ` Wei Wang
  -1 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2012-06-21 13:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: SherryHurwitz, Andrew Cooper, Jeremy Fitzhardinge, stable,
	xen-devel, KonradRzeszutek Wilk, linux-pci, Jesse Barnes,
	ebiederm

On 06/21/2012 03:24 PM, Jan Beulich wrote:
>>>> On 21.06.12 at 15:10, Wei Wang<wei.wang2@amd.com>  wrote:
>> On 06/21/2012 02:45 PM, Jan Beulich wrote:
>>>>>> On 21.06.12 at 14:28, Wei Wang<wei.wang2@amd.com>   wrote:
>>>> AMD IOMMU is an independent pci-e endpoint, and this function will not
>>>> be used for other purposes other than containing an iommu. So I don't
>>>> see that iommu will share bdf value with other devices.
>>>
>>> The question is not regarding bdf, but regarding whether under
>>> the same seg:bus:dev there might be multiple functions, one of
>>> which is the IOMMU, and if so, whether the IOMMU would be
>>> guaranteed to have a non-zero function number.
>>
>> In a real system (single or multiple iommu), amd iommu shares the same
>> device number with north bridge but has function number 2.. (e.g
>> bus:00.2) Howerver according to spec, it does not guaranteed to have
>> non-zero function number. So what is the problem you see if iommu uses
>> fun0 on a multi-func device?
>
> If it's on func 0 and gets hidden completely (as done by your
> partial patch), other functions won't be found when scanning
> for them (because secondary functions get looked at only
> when func 0 actually exists, as otherwise evaluating the header
> type register is invalid).

OK, understood. Then I think we do need to allow pci cfg read for iommu 
device.

Thanks
Wei

> Jan
>
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0
@ 2012-06-21 13:27                             ` Wei Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2012-06-21 13:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: SherryHurwitz, Andrew Cooper, Jeremy Fitzhardinge, stable,
	xen-devel, KonradRzeszutek Wilk, linux-pci, Jesse Barnes,
	ebiederm

On 06/21/2012 03:24 PM, Jan Beulich wrote:
>>>> On 21.06.12 at 15:10, Wei Wang<wei.wang2@amd.com>  wrote:
>> On 06/21/2012 02:45 PM, Jan Beulich wrote:
>>>>>> On 21.06.12 at 14:28, Wei Wang<wei.wang2@amd.com>   wrote:
>>>> AMD IOMMU is an independent pci-e endpoint, and this function will not
>>>> be used for other purposes other than containing an iommu. So I don't
>>>> see that iommu will share bdf value with other devices.
>>>
>>> The question is not regarding bdf, but regarding whether under
>>> the same seg:bus:dev there might be multiple functions, one of
>>> which is the IOMMU, and if so, whether the IOMMU would be
>>> guaranteed to have a non-zero function number.
>>
>> In a real system (single or multiple iommu), amd iommu shares the same
>> device number with north bridge but has function number 2.. (e.g
>> bus:00.2) Howerver according to spec, it does not guaranteed to have
>> non-zero function number. So what is the problem you see if iommu uses
>> fun0 on a multi-func device?
>
> If it's on func 0 and gets hidden completely (as done by your
> partial patch), other functions won't be found when scanning
> for them (because secondary functions get looked at only
> when func 0 actually exists, as otherwise evaluating the header
> type register is invalid).

OK, understood. Then I think we do need to allow pci cfg read for iommu 
device.

Thanks
Wei

> Jan
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it
  2012-06-20 15:45         ` [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it Jan Beulich
@ 2012-06-21 15:29           ` Wei Wang
  2012-06-21 15:49             ` Jan Beulich
  0 siblings, 1 reply; 33+ messages in thread
From: Wei Wang @ 2012-06-21 15:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Jeremy Fitzhardinge, xen-devel, Konrad Rzeszutek Wilk

On 06/20/2012 05:45 PM, Jan Beulich wrote:
  > Rather than submitting it for inclusion immediately, I'd rather see
> you first use it successfully. Below/attached what appears to
> work fine for me in contrived situations (i.e. hiding bridges with
> nothing connected, as I still don't have any AMD IOMMU system
> at hand).

Hi Jan,

The patch works for me. IOMMU msi Capability is shown as enabled with 
this patch.
Thanks,
Wei

00:00.2 Generic system peripheral [0806]: Advanced Micro Devices [AMD] 
Device 1419
         Subsystem: Advanced Micro Devices [AMD] Device 1419
         Flags: bus master, fast devsel, latency 0, IRQ 11
         Capabilities: [40] Secure device <?>
         Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+
         Capabilities: [64] HyperTransport: MSI Mapping Enable+ Fixed+


> Jan
>
> Note that due to ptwr_do_page_fault() being run first, there'll be a
> MEM_LOG() issued for each such attempt. If that's undesirable, the
> order of the calls would need to be swapped.
>
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -1875,6 +1875,7 @@ static int mod_l1_entry(l1_pgentry_t *pl
>               break;
>           case 1:
>               l1e_remove_flags(nl1e, _PAGE_RW);
> +            rc = 0;
>               break;
>           }
>           if ( page )
> @@ -5208,6 +5209,97 @@ int ptwr_do_page_fault(struct vcpu *v, u
>       return 0;
>   }
>
> +#ifdef __x86_64__
> +/*************************
> + * fault handling for read-only MMIO pages
> + */
> +
> +struct mmio_ro_emulate_ctxt {
> +    struct x86_emulate_ctxt ctxt;
> +    unsigned long cr2;
> +};
> +
> +static int mmio_ro_emulated_read(
> +    enum x86_segment seg,
> +    unsigned long offset,
> +    void *p_data,
> +    unsigned int bytes,
> +    struct x86_emulate_ctxt *ctxt)
> +{
> +    return X86EMUL_UNHANDLEABLE;
> +}
> +
> +static int mmio_ro_emulated_write(
> +    enum x86_segment seg,
> +    unsigned long offset,
> +    void *p_data,
> +    unsigned int bytes,
> +    struct x86_emulate_ctxt *ctxt)
> +{
> +    struct mmio_ro_emulate_ctxt *mmio_ro_ctxt =
> +        container_of(ctxt, struct mmio_ro_emulate_ctxt, ctxt);
> +
> +    /* Only allow naturally-aligned stores at the original %cr2 address. */
> +    if ( ((bytes | offset)&  (bytes - 1)) || offset != mmio_ro_ctxt->cr2 )
> +    {
> +        MEM_LOG("mmio_ro_emulate: bad access (cr2=%lx, addr=%lx, bytes=%u)",
> +                mmio_ro_ctxt->cr2, offset, bytes);
> +        return X86EMUL_UNHANDLEABLE;
> +    }
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static const struct x86_emulate_ops mmio_ro_emulate_ops = {
> +    .read       = mmio_ro_emulated_read,
> +    .insn_fetch = ptwr_emulated_read,
> +    .write      = mmio_ro_emulated_write,
> +};
> +
> +/* Check if guest is trying to modify a r/o MMIO page. */
> +int mmio_ro_do_page_fault(struct vcpu *v, unsigned long addr,
> +                          struct cpu_user_regs *regs)
> +{
> +    l1_pgentry_t      pte;
> +    unsigned long     mfn;
> +    unsigned int      addr_size = is_pv_32on64_domain(v->domain) ?
> +                                  32 : BITS_PER_LONG;
> +    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = {
> +        .ctxt.regs = regs,
> +        .ctxt.addr_size = addr_size,
> +        .ctxt.sp_size = addr_size,
> +        .cr2 = addr
> +    };
> +    int rc;
> +
> +    /* Attempt to read the PTE that maps the VA being accessed. */
> +    guest_get_eff_l1e(v, addr,&pte);
> +
> +    /* We are looking only for read-only mappings of MMIO pages. */
> +    if ( ((l1e_get_flags(pte)&  (_PAGE_PRESENT|_PAGE_RW)) != _PAGE_PRESENT) )
> +        return 0;
> +
> +    mfn = l1e_get_pfn(pte);
> +    if ( mfn_valid(mfn) )
> +    {
> +        struct page_info *page = mfn_to_page(mfn);
> +        struct domain *owner = page_get_owner_and_reference(page);
> +
> +        if ( owner )
> +            put_page(page);
> +        if ( owner != dom_io )
> +            return 0;
> +    }
> +
> +    if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn) )
> +        return 0;
> +
> +    rc = x86_emulate(&mmio_ro_ctxt.ctxt,&mmio_ro_emulate_ops);
> +
> +    return rc != X86EMUL_UNHANDLEABLE ? EXCRET_fault_fixed : 0;
> +}
> +#endif /* __x86_64__ */
> +
>   void free_xen_pagetable(void *v)
>   {
>       if ( system_state == SYS_STATE_early_boot )
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -1349,20 +1349,23 @@ static int fixup_page_fault(unsigned lon
>           return 0;
>       }
>
> -    if ( VM_ASSIST(d, VMASST_TYPE_writable_pagetables)&&
> -         guest_kernel_mode(v, regs) )
> -    {
> -        unsigned int mbs = PFEC_write_access;
> -        unsigned int mbz = PFEC_reserved_bit | PFEC_insn_fetch;
> -
> -        /* Do not check if access-protection fault since the page may
> -           legitimately be not present in shadow page tables */
> -        if ( !paging_mode_enabled(d) )
> -            mbs |= PFEC_page_present;
> -
> -        if ( ((regs->error_code&  (mbs | mbz)) == mbs)&&
> +    if ( guest_kernel_mode(v, regs)&&
> +         !(regs->error_code&  (PFEC_reserved_bit | PFEC_insn_fetch))&&
> +         (regs->error_code&  PFEC_write_access) )
> +    {
> +        if ( VM_ASSIST(d, VMASST_TYPE_writable_pagetables)&&
> +             /* Do not check if access-protection fault since the page may
> +                legitimately be not present in shadow page tables */
> +             (paging_mode_enabled(d) ||
> +              (regs->error_code&  PFEC_page_present))&&
>                ptwr_do_page_fault(v, addr, regs) )
>               return EXCRET_fault_fixed;
> +
> +#ifdef __x86_64__
> +        if ( IS_PRIV(d)&&  (regs->error_code&  PFEC_page_present)&&
> +             mmio_ro_do_page_fault(v, addr, regs) )
> +            return EXCRET_fault_fixed;
> +#endif
>       }
>
>       /* For non-external shadowed guests, we fix up both their own
> @@ -1690,6 +1693,13 @@ static int pci_cfg_ok(struct domain *d,
>           return 0;
>
>       machine_bdf = (d->arch.pci_cf8>>  8)&  0xFFFF;
> +    if ( write )
> +    {
> +        const unsigned long *ro_map = pci_get_ro_map(0);
> +
> +        if ( ro_map&&  test_bit(machine_bdf, ro_map) )
> +            return 0;
> +    }
>       start = d->arch.pci_cf8&  0xFF;
>       end = start + size - 1;
>       if (xsm_pci_config_permission(d, machine_bdf, start, end, write))
> --- a/xen/arch/x86/x86_32/pci.c
> +++ b/xen/arch/x86/x86_32/pci.c
> @@ -6,6 +6,7 @@
>
>   #include<xen/spinlock.h>
>   #include<xen/pci.h>
> +#include<xen/init.h>
>   #include<asm/io.h>
>
>   #define PCI_CONF_ADDRESS(bus, dev, func, reg) \
> @@ -70,3 +71,7 @@ void pci_conf_write32(
>       BUG_ON((bus>  255) || (dev>  31) || (func>  7) || (reg>  255));
>       pci_conf_write(PCI_CONF_ADDRESS(bus, dev, func, reg), 0, 4, data);
>   }
> +
> +void __init arch_pci_ro_device(int seg, int bdf)
> +{
> +}
> --- a/xen/arch/x86/x86_64/mmconfig_64.c
> +++ b/xen/arch/x86/x86_64/mmconfig_64.c
> @@ -145,9 +145,30 @@ static void __iomem *mcfg_ioremap(const
>       return (void __iomem *) virt;
>   }
>
> +void arch_pci_ro_device(int seg, int bdf)
> +{
> +    unsigned int idx, bus = PCI_BUS(bdf);
> +
> +    for (idx = 0; idx<  pci_mmcfg_config_num; ++idx) {
> +        const struct acpi_mcfg_allocation *cfg = pci_mmcfg_virt[idx].cfg;
> +        unsigned long mfn = (cfg->address>>  PAGE_SHIFT) + bdf;
> +
> +        if (!pci_mmcfg_virt[idx].virt || cfg->pci_segment != seg ||
> +            cfg->start_bus_number>  bus || cfg->end_bus_number<  bus)
> +            continue;
> +
> +        if (rangeset_add_singleton(mmio_ro_ranges, mfn))
> +            printk(XENLOG_ERR
> +                   "%04x:%02x:%02x.%u: could not mark MCFG (mfn %#lx) read-only\n",
> +                   cfg->pci_segment, bus, PCI_SLOT(bdf), PCI_FUNC(bdf),
> +                   mfn);
> +    }
> +}
> +
>   int pci_mmcfg_arch_enable(unsigned int idx)
>   {
>       const typeof(pci_mmcfg_config[0]) *cfg = pci_mmcfg_virt[idx].cfg;
> +    const unsigned long *ro_map = pci_get_ro_map(cfg->pci_segment);
>
>       if (pci_mmcfg_virt[idx].virt)
>           return 0;
> @@ -159,6 +180,16 @@ int pci_mmcfg_arch_enable(unsigned int i
>       }
>       printk(KERN_INFO "PCI: Using MCFG for segment %04x bus %02x-%02x\n",
>              cfg->pci_segment, cfg->start_bus_number, cfg->end_bus_number);
> +    if (ro_map) {
> +        unsigned int bdf = PCI_BDF(cfg->start_bus_number, 0, 0);
> +        unsigned int end = PCI_BDF(cfg->end_bus_number, -1, -1);
> +
> +        while ((bdf = find_next_bit(ro_map, end + 1, bdf))<= end) {
> +            arch_pci_ro_device(cfg->pci_segment, bdf);
> +            if (bdf++ == end)
> +                break;
> +        }
> +    }
>       return 0;
>   }
>
> --- a/xen/drivers/passthrough/amd/iommu_detect.c
> +++ b/xen/drivers/passthrough/amd/iommu_detect.c
> @@ -153,6 +153,12 @@ int __init amd_iommu_detect_one_acpi(
>       if ( rt )
>           return -ENODEV;
>
> +    rt = pci_ro_device(iommu->seg, bus, PCI_DEVFN(dev, func));
> +    if ( rt )
> +        printk(XENLOG_ERR
> +               "Could not mark config space of %04x:%02x:%02x.%u read-only (%d)\n",
> +               iommu->seg, bus, dev, func, rt);
> +
>       list_add_tail(&iommu->list,&amd_iommu_head);
>
>       return 0;
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -593,11 +593,3 @@ void hvm_dpci_eoi(struct domain *d, unsi
>   unlock:
>       spin_unlock(&d->event_lock);
>   }
> -
> -static int __init setup_mmio_ro_ranges(void)
> -{
> -    mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges",
> -                                  RANGESETF_prettyprint_hex);
> -    return 0;
> -}
> -__initcall(setup_mmio_ro_ranges);
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -36,6 +36,7 @@
>   struct pci_seg {
>       struct list_head alldevs_list;
>       u16 nr;
> +    unsigned long *ro_map;
>       /* bus2bridge_lock protects bus2bridge array */
>       spinlock_t bus2bridge_lock;
>   #define MAX_BUSES 256
> @@ -106,6 +107,8 @@ void __init pt_pci_init(void)
>       radix_tree_init(&pci_segments);
>       if ( !alloc_pseg(0) )
>           panic("Could not initialize PCI segment 0\n");
> +    mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges",
> +                                  RANGESETF_prettyprint_hex);
>   }
>
>   int __init pci_add_segment(u16 seg)
> @@ -113,6 +116,13 @@ int __init pci_add_segment(u16 seg)
>       return alloc_pseg(seg) ? 0 : -ENOMEM;
>   }
>
> +const unsigned long *pci_get_ro_map(u16 seg)
> +{
> +    struct pci_seg *pseg = get_pseg(seg);
> +
> +    return pseg ? pseg->ro_map : NULL;
> +}
> +
>   static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
>   {
>       struct pci_dev *pdev;
> @@ -198,6 +208,33 @@ static void free_pdev(struct pci_seg *ps
>       xfree(pdev);
>   }
>
> +int __init pci_ro_device(int seg, int bus, int devfn)
> +{
> +    struct pci_seg *pseg = alloc_pseg(seg);
> +    struct pci_dev *pdev;
> +
> +    if ( !pseg )
> +        return -ENOMEM;
> +    pdev = alloc_pdev(pseg, bus, devfn);
> +    if ( !pdev )
> +        return -ENOMEM;
> +
> +    if ( !pseg->ro_map )
> +    {
> +        size_t sz = BITS_TO_LONGS(PCI_BDF(-1, -1, -1) + 1) * sizeof(long);
> +
> +        pseg->ro_map = alloc_xenheap_pages(get_order_from_bytes(sz), 0);
> +        if ( !pseg->ro_map )
> +            return -ENOMEM;
> +        memset(pseg->ro_map, 0, sz);
> +    }
> +
> +    __set_bit(PCI_BDF2(bus, devfn), pseg->ro_map);
> +    arch_pci_ro_device(seg, PCI_BDF2(bus, devfn));
> +
> +    return 0;
> +}
> +
>   struct pci_dev *pci_get_pdev(int seg, int bus, int devfn)
>   {
>       struct pci_seg *pseg = get_pseg(seg);
> --- a/xen/include/asm-x86/mm.h
> +++ b/xen/include/asm-x86/mm.h
> @@ -555,6 +555,8 @@ void memguard_unguard_stack(void *p);
>
>   int  ptwr_do_page_fault(struct vcpu *, unsigned long,
>                           struct cpu_user_regs *);
> +int  mmio_ro_do_page_fault(struct vcpu *, unsigned long,
> +                           struct cpu_user_regs *);
>
>   int audit_adjust_pgtables(struct domain *d, int dir, int noisy);
>
> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -98,8 +98,11 @@ struct pci_dev *pci_lock_domain_pdev(
>   void setup_dom0_pci_devices(struct domain *, void (*)(struct pci_dev *));
>   void pci_release_devices(struct domain *d);
>   int pci_add_segment(u16 seg);
> +const unsigned long *pci_get_ro_map(u16 seg);
>   int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *);
>   int pci_remove_device(u16 seg, u8 bus, u8 devfn);
> +int pci_ro_device(int seg, int bus, int devfn);
> +void arch_pci_ro_device(int seg, int bdf);
>   struct pci_dev *pci_get_pdev(int seg, int bus, int devfn);
>   struct pci_dev *pci_get_pdev_by_domain(
>       struct domain *, int seg, int bus, int devfn);
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it
  2012-06-21 15:29           ` Wei Wang
@ 2012-06-21 15:49             ` Jan Beulich
  2012-06-21 16:31               ` Keir Fraser
  2012-06-22  9:03               ` Wei Wang
  0 siblings, 2 replies; 33+ messages in thread
From: Jan Beulich @ 2012-06-21 15:49 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Wei Wang, Andrew Cooper, Jeremy Fitzhardinge, xen-devel,
	Konrad Rzeszutek Wilk

>>> On 21.06.12 at 17:29, Wei Wang <wei.wang2@amd.com> wrote:
> On 06/20/2012 05:45 PM, Jan Beulich wrote:
>   > Rather than submitting it for inclusion immediately, I'd rather see
>> you first use it successfully. Below/attached what appears to
>> work fine for me in contrived situations (i.e. hiding bridges with
>> nothing connected, as I still don't have any AMD IOMMU system
>> at hand).
> 
> The patch works for me. IOMMU msi Capability is shown as enabled with 
> this patch.

Keir,

the question now arises whether we really want this, and
particularly this late before 4.2. The Linux folks don't seem to
be willing to take the strait forward workaround for the
problem introduced at their end, so we will likely need
something (the more that the questionable fix already made
it into various stable trees) before 4.2 goes out (and even
the older trees would be affected, just that putting a change
like this there is even more questionable).

There are obviously more potential problems in this area: If
any of the MMIO addresses used by AMD's IOMMU is
configurable through one of the BARs, and if Dom0 decides to
re-assign MMIO space, neither allowing the writes nor simply
dropping them as done here will work. Whether that's a real
problem I don't know - Wei? And there might be other issues
arising from dropping all writes - we might just currently not be
aware of them.

Jan

> 00:00.2 Generic system peripheral [0806]: Advanced Micro Devices [AMD] 
> Device 1419
>          Subsystem: Advanced Micro Devices [AMD] Device 1419
>          Flags: bus master, fast devsel, latency 0, IRQ 11
>          Capabilities: [40] Secure device <?>
>          Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+
>          Capabilities: [64] HyperTransport: MSI Mapping Enable+ Fixed+
> 
> 
>> Jan
>>
>> Note that due to ptwr_do_page_fault() being run first, there'll be a
>> MEM_LOG() issued for each such attempt. If that's undesirable, the
>> order of the calls would need to be swapped.
>>
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -1875,6 +1875,7 @@ static int mod_l1_entry(l1_pgentry_t *pl
>>               break;
>>           case 1:
>>               l1e_remove_flags(nl1e, _PAGE_RW);
>> +            rc = 0;
>>               break;
>>           }
>>           if ( page )
>> @@ -5208,6 +5209,97 @@ int ptwr_do_page_fault(struct vcpu *v, u
>>       return 0;
>>   }
>>
>> +#ifdef __x86_64__
>> +/*************************
>> + * fault handling for read-only MMIO pages
>> + */
>> +
>> +struct mmio_ro_emulate_ctxt {
>> +    struct x86_emulate_ctxt ctxt;
>> +    unsigned long cr2;
>> +};
>> +
>> +static int mmio_ro_emulated_read(
>> +    enum x86_segment seg,
>> +    unsigned long offset,
>> +    void *p_data,
>> +    unsigned int bytes,
>> +    struct x86_emulate_ctxt *ctxt)
>> +{
>> +    return X86EMUL_UNHANDLEABLE;
>> +}
>> +
>> +static int mmio_ro_emulated_write(
>> +    enum x86_segment seg,
>> +    unsigned long offset,
>> +    void *p_data,
>> +    unsigned int bytes,
>> +    struct x86_emulate_ctxt *ctxt)
>> +{
>> +    struct mmio_ro_emulate_ctxt *mmio_ro_ctxt =
>> +        container_of(ctxt, struct mmio_ro_emulate_ctxt, ctxt);
>> +
>> +    /* Only allow naturally-aligned stores at the original %cr2 address. */
>> +    if ( ((bytes | offset)&  (bytes - 1)) || offset != mmio_ro_ctxt->cr2 )
>> +    {
>> +        MEM_LOG("mmio_ro_emulate: bad access (cr2=%lx, addr=%lx, 
> bytes=%u)",
>> +                mmio_ro_ctxt->cr2, offset, bytes);
>> +        return X86EMUL_UNHANDLEABLE;
>> +    }
>> +
>> +    return X86EMUL_OKAY;
>> +}
>> +
>> +static const struct x86_emulate_ops mmio_ro_emulate_ops = {
>> +    .read       = mmio_ro_emulated_read,
>> +    .insn_fetch = ptwr_emulated_read,
>> +    .write      = mmio_ro_emulated_write,
>> +};
>> +
>> +/* Check if guest is trying to modify a r/o MMIO page. */
>> +int mmio_ro_do_page_fault(struct vcpu *v, unsigned long addr,
>> +                          struct cpu_user_regs *regs)
>> +{
>> +    l1_pgentry_t      pte;
>> +    unsigned long     mfn;
>> +    unsigned int      addr_size = is_pv_32on64_domain(v->domain) ?
>> +                                  32 : BITS_PER_LONG;
>> +    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = {
>> +        .ctxt.regs = regs,
>> +        .ctxt.addr_size = addr_size,
>> +        .ctxt.sp_size = addr_size,
>> +        .cr2 = addr
>> +    };
>> +    int rc;
>> +
>> +    /* Attempt to read the PTE that maps the VA being accessed. */
>> +    guest_get_eff_l1e(v, addr,&pte);
>> +
>> +    /* We are looking only for read-only mappings of MMIO pages. */
>> +    if ( ((l1e_get_flags(pte)&  (_PAGE_PRESENT|_PAGE_RW)) != _PAGE_PRESENT) 
> )
>> +        return 0;
>> +
>> +    mfn = l1e_get_pfn(pte);
>> +    if ( mfn_valid(mfn) )
>> +    {
>> +        struct page_info *page = mfn_to_page(mfn);
>> +        struct domain *owner = page_get_owner_and_reference(page);
>> +
>> +        if ( owner )
>> +            put_page(page);
>> +        if ( owner != dom_io )
>> +            return 0;
>> +    }
>> +
>> +    if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn) )
>> +        return 0;
>> +
>> +    rc = x86_emulate(&mmio_ro_ctxt.ctxt,&mmio_ro_emulate_ops);
>> +
>> +    return rc != X86EMUL_UNHANDLEABLE ? EXCRET_fault_fixed : 0;
>> +}
>> +#endif /* __x86_64__ */
>> +
>>   void free_xen_pagetable(void *v)
>>   {
>>       if ( system_state == SYS_STATE_early_boot )
>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -1349,20 +1349,23 @@ static int fixup_page_fault(unsigned lon
>>           return 0;
>>       }
>>
>> -    if ( VM_ASSIST(d, VMASST_TYPE_writable_pagetables)&&
>> -         guest_kernel_mode(v, regs) )
>> -    {
>> -        unsigned int mbs = PFEC_write_access;
>> -        unsigned int mbz = PFEC_reserved_bit | PFEC_insn_fetch;
>> -
>> -        /* Do not check if access-protection fault since the page may
>> -           legitimately be not present in shadow page tables */
>> -        if ( !paging_mode_enabled(d) )
>> -            mbs |= PFEC_page_present;
>> -
>> -        if ( ((regs->error_code&  (mbs | mbz)) == mbs)&&
>> +    if ( guest_kernel_mode(v, regs)&&
>> +         !(regs->error_code&  (PFEC_reserved_bit | PFEC_insn_fetch))&&
>> +         (regs->error_code&  PFEC_write_access) )
>> +    {
>> +        if ( VM_ASSIST(d, VMASST_TYPE_writable_pagetables)&&
>> +             /* Do not check if access-protection fault since the page may
>> +                legitimately be not present in shadow page tables */
>> +             (paging_mode_enabled(d) ||
>> +              (regs->error_code&  PFEC_page_present))&&
>>                ptwr_do_page_fault(v, addr, regs) )
>>               return EXCRET_fault_fixed;
>> +
>> +#ifdef __x86_64__
>> +        if ( IS_PRIV(d)&&  (regs->error_code&  PFEC_page_present)&&
>> +             mmio_ro_do_page_fault(v, addr, regs) )
>> +            return EXCRET_fault_fixed;
>> +#endif
>>       }
>>
>>       /* For non-external shadowed guests, we fix up both their own
>> @@ -1690,6 +1693,13 @@ static int pci_cfg_ok(struct domain *d,
>>           return 0;
>>
>>       machine_bdf = (d->arch.pci_cf8>>  8)&  0xFFFF;
>> +    if ( write )
>> +    {
>> +        const unsigned long *ro_map = pci_get_ro_map(0);
>> +
>> +        if ( ro_map&&  test_bit(machine_bdf, ro_map) )
>> +            return 0;
>> +    }
>>       start = d->arch.pci_cf8&  0xFF;
>>       end = start + size - 1;
>>       if (xsm_pci_config_permission(d, machine_bdf, start, end, write))
>> --- a/xen/arch/x86/x86_32/pci.c
>> +++ b/xen/arch/x86/x86_32/pci.c
>> @@ -6,6 +6,7 @@
>>
>>   #include<xen/spinlock.h>
>>   #include<xen/pci.h>
>> +#include<xen/init.h>
>>   #include<asm/io.h>
>>
>>   #define PCI_CONF_ADDRESS(bus, dev, func, reg) \
>> @@ -70,3 +71,7 @@ void pci_conf_write32(
>>       BUG_ON((bus>  255) || (dev>  31) || (func>  7) || (reg>  255));
>>       pci_conf_write(PCI_CONF_ADDRESS(bus, dev, func, reg), 0, 4, data);
>>   }
>> +
>> +void __init arch_pci_ro_device(int seg, int bdf)
>> +{
>> +}
>> --- a/xen/arch/x86/x86_64/mmconfig_64.c
>> +++ b/xen/arch/x86/x86_64/mmconfig_64.c
>> @@ -145,9 +145,30 @@ static void __iomem *mcfg_ioremap(const
>>       return (void __iomem *) virt;
>>   }
>>
>> +void arch_pci_ro_device(int seg, int bdf)
>> +{
>> +    unsigned int idx, bus = PCI_BUS(bdf);
>> +
>> +    for (idx = 0; idx<  pci_mmcfg_config_num; ++idx) {
>> +        const struct acpi_mcfg_allocation *cfg = pci_mmcfg_virt[idx].cfg;
>> +        unsigned long mfn = (cfg->address>>  PAGE_SHIFT) + bdf;
>> +
>> +        if (!pci_mmcfg_virt[idx].virt || cfg->pci_segment != seg ||
>> +            cfg->start_bus_number>  bus || cfg->end_bus_number<  bus)
>> +            continue;
>> +
>> +        if (rangeset_add_singleton(mmio_ro_ranges, mfn))
>> +            printk(XENLOG_ERR
>> +                   "%04x:%02x:%02x.%u: could not mark MCFG (mfn %#lx) 
> read-only\n",
>> +                   cfg->pci_segment, bus, PCI_SLOT(bdf), PCI_FUNC(bdf),
>> +                   mfn);
>> +    }
>> +}
>> +
>>   int pci_mmcfg_arch_enable(unsigned int idx)
>>   {
>>       const typeof(pci_mmcfg_config[0]) *cfg = pci_mmcfg_virt[idx].cfg;
>> +    const unsigned long *ro_map = pci_get_ro_map(cfg->pci_segment);
>>
>>       if (pci_mmcfg_virt[idx].virt)
>>           return 0;
>> @@ -159,6 +180,16 @@ int pci_mmcfg_arch_enable(unsigned int i
>>       }
>>       printk(KERN_INFO "PCI: Using MCFG for segment %04x bus %02x-%02x\n",
>>              cfg->pci_segment, cfg->start_bus_number, cfg->end_bus_number);
>> +    if (ro_map) {
>> +        unsigned int bdf = PCI_BDF(cfg->start_bus_number, 0, 0);
>> +        unsigned int end = PCI_BDF(cfg->end_bus_number, -1, -1);
>> +
>> +        while ((bdf = find_next_bit(ro_map, end + 1, bdf))<= end) {
>> +            arch_pci_ro_device(cfg->pci_segment, bdf);
>> +            if (bdf++ == end)
>> +                break;
>> +        }
>> +    }
>>       return 0;
>>   }
>>
>> --- a/xen/drivers/passthrough/amd/iommu_detect.c
>> +++ b/xen/drivers/passthrough/amd/iommu_detect.c
>> @@ -153,6 +153,12 @@ int __init amd_iommu_detect_one_acpi(
>>       if ( rt )
>>           return -ENODEV;
>>
>> +    rt = pci_ro_device(iommu->seg, bus, PCI_DEVFN(dev, func));
>> +    if ( rt )
>> +        printk(XENLOG_ERR
>> +               "Could not mark config space of %04x:%02x:%02x.%u read-only 
> (%d)\n",
>> +               iommu->seg, bus, dev, func, rt);
>> +
>>       list_add_tail(&iommu->list,&amd_iommu_head);
>>
>>       return 0;
>> --- a/xen/drivers/passthrough/io.c
>> +++ b/xen/drivers/passthrough/io.c
>> @@ -593,11 +593,3 @@ void hvm_dpci_eoi(struct domain *d, unsi
>>   unlock:
>>       spin_unlock(&d->event_lock);
>>   }
>> -
>> -static int __init setup_mmio_ro_ranges(void)
>> -{
>> -    mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges",
>> -                                  RANGESETF_prettyprint_hex);
>> -    return 0;
>> -}
>> -__initcall(setup_mmio_ro_ranges);
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -36,6 +36,7 @@
>>   struct pci_seg {
>>       struct list_head alldevs_list;
>>       u16 nr;
>> +    unsigned long *ro_map;
>>       /* bus2bridge_lock protects bus2bridge array */
>>       spinlock_t bus2bridge_lock;
>>   #define MAX_BUSES 256
>> @@ -106,6 +107,8 @@ void __init pt_pci_init(void)
>>       radix_tree_init(&pci_segments);
>>       if ( !alloc_pseg(0) )
>>           panic("Could not initialize PCI segment 0\n");
>> +    mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges",
>> +                                  RANGESETF_prettyprint_hex);
>>   }
>>
>>   int __init pci_add_segment(u16 seg)
>> @@ -113,6 +116,13 @@ int __init pci_add_segment(u16 seg)
>>       return alloc_pseg(seg) ? 0 : -ENOMEM;
>>   }
>>
>> +const unsigned long *pci_get_ro_map(u16 seg)
>> +{
>> +    struct pci_seg *pseg = get_pseg(seg);
>> +
>> +    return pseg ? pseg->ro_map : NULL;
>> +}
>> +
>>   static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
>>   {
>>       struct pci_dev *pdev;
>> @@ -198,6 +208,33 @@ static void free_pdev(struct pci_seg *ps
>>       xfree(pdev);
>>   }
>>
>> +int __init pci_ro_device(int seg, int bus, int devfn)
>> +{
>> +    struct pci_seg *pseg = alloc_pseg(seg);
>> +    struct pci_dev *pdev;
>> +
>> +    if ( !pseg )
>> +        return -ENOMEM;
>> +    pdev = alloc_pdev(pseg, bus, devfn);
>> +    if ( !pdev )
>> +        return -ENOMEM;
>> +
>> +    if ( !pseg->ro_map )
>> +    {
>> +        size_t sz = BITS_TO_LONGS(PCI_BDF(-1, -1, -1) + 1) * sizeof(long);
>> +
>> +        pseg->ro_map = alloc_xenheap_pages(get_order_from_bytes(sz), 0);
>> +        if ( !pseg->ro_map )
>> +            return -ENOMEM;
>> +        memset(pseg->ro_map, 0, sz);
>> +    }
>> +
>> +    __set_bit(PCI_BDF2(bus, devfn), pseg->ro_map);
>> +    arch_pci_ro_device(seg, PCI_BDF2(bus, devfn));
>> +
>> +    return 0;
>> +}
>> +
>>   struct pci_dev *pci_get_pdev(int seg, int bus, int devfn)
>>   {
>>       struct pci_seg *pseg = get_pseg(seg);
>> --- a/xen/include/asm-x86/mm.h
>> +++ b/xen/include/asm-x86/mm.h
>> @@ -555,6 +555,8 @@ void memguard_unguard_stack(void *p);
>>
>>   int  ptwr_do_page_fault(struct vcpu *, unsigned long,
>>                           struct cpu_user_regs *);
>> +int  mmio_ro_do_page_fault(struct vcpu *, unsigned long,
>> +                           struct cpu_user_regs *);
>>
>>   int audit_adjust_pgtables(struct domain *d, int dir, int noisy);
>>
>> --- a/xen/include/xen/pci.h
>> +++ b/xen/include/xen/pci.h
>> @@ -98,8 +98,11 @@ struct pci_dev *pci_lock_domain_pdev(
>>   void setup_dom0_pci_devices(struct domain *, void (*)(struct pci_dev *));
>>   void pci_release_devices(struct domain *d);
>>   int pci_add_segment(u16 seg);
>> +const unsigned long *pci_get_ro_map(u16 seg);
>>   int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info 
> *);
>>   int pci_remove_device(u16 seg, u8 bus, u8 devfn);
>> +int pci_ro_device(int seg, int bus, int devfn);
>> +void arch_pci_ro_device(int seg, int bdf);
>>   struct pci_dev *pci_get_pdev(int seg, int bus, int devfn);
>>   struct pci_dev *pci_get_pdev_by_domain(
>>       struct domain *, int seg, int bus, int devfn);
>>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it
  2012-06-21 15:49             ` Jan Beulich
@ 2012-06-21 16:31               ` Keir Fraser
  2012-06-22  9:03               ` Wei Wang
  1 sibling, 0 replies; 33+ messages in thread
From: Keir Fraser @ 2012-06-21 16:31 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Wang, Andrew Cooper, Jeremy Fitzhardinge, xen-devel,
	Konrad Rzeszutek Wilk

On 21/06/2012 16:49, "Jan Beulich" <JBeulich@suse.com> wrote:

> the question now arises whether we really want this, and
> particularly this late before 4.2. The Linux folks don't seem to
> be willing to take the strait forward workaround for the
> problem introduced at their end, so we will likely need
> something (the more that the questionable fix already made
> it into various stable trees) before 4.2 goes out (and even
> the older trees would be affected, just that putting a change
> like this there is even more questionable).

I'd be okay seeing the proposed patch go in ahead of 4.2.0-rc1.

 -- Keir

> There are obviously more potential problems in this area: If
> any of the MMIO addresses used by AMD's IOMMU is
> configurable through one of the BARs, and if Dom0 decides to
> re-assign MMIO space, neither allowing the writes nor simply
> dropping them as done here will work. Whether that's a real
> problem I don't know - Wei? And there might be other issues
> arising from dropping all writes - we might just currently not be
> aware of them.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it
  2012-06-21 15:49             ` Jan Beulich
  2012-06-21 16:31               ` Keir Fraser
@ 2012-06-22  9:03               ` Wei Wang
  1 sibling, 0 replies; 33+ messages in thread
From: Wei Wang @ 2012-06-22  9:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Jeremy Fitzhardinge, xen-devel, Keir Fraser,
	Konrad Rzeszutek Wilk

On 06/21/2012 05:49 PM, Jan Beulich wrote:
>>>> On 21.06.12 at 17:29, Wei Wang<wei.wang2@amd.com>  wrote:
>> On 06/20/2012 05:45 PM, Jan Beulich wrote:
>>    >  Rather than submitting it for inclusion immediately, I'd rather see
>>> you first use it successfully. Below/attached what appears to
>>> work fine for me in contrived situations (i.e. hiding bridges with
>>> nothing connected, as I still don't have any AMD IOMMU system
>>> at hand).
>>
>> The patch works for me. IOMMU msi Capability is shown as enabled with
>> this patch.
>
> Keir,
>
> the question now arises whether we really want this, and
> particularly this late before 4.2. The Linux folks don't seem to
> be willing to take the strait forward workaround for the
> problem introduced at their end, so we will likely need
> something (the more that the questionable fix already made
> it into various stable trees) before 4.2 goes out (and even
> the older trees would be affected, just that putting a change
> like this there is even more questionable).
>
> There are obviously more potential problems in this area: If
> any of the MMIO addresses used by AMD's IOMMU is
> configurable through one of the BARs, and if Dom0 decides to
> re-assign MMIO space, neither allowing the writes nor simply
> dropping them as done here will work. Whether that's a real
> problem I don't know - Wei? And there might be other issues
> arising from dropping all writes - we might just currently not be
> aware of them.

AMD IOMMU does not have any PCI BARs, All address configuration should 
go to ACPI tables. So this should be fine at least for AMD IOMMU.

Thanks
Wei

> Jan
>
>> 00:00.2 Generic system peripheral [0806]: Advanced Micro Devices [AMD]
>> Device 1419
>>           Subsystem: Advanced Micro Devices [AMD] Device 1419
>>           Flags: bus master, fast devsel, latency 0, IRQ 11
>>           Capabilities: [40] Secure device<?>
>>           Capabilities: [54] MSI: Enable+ Count=1/1 Maskable- 64bit+
>>           Capabilities: [64] HyperTransport: MSI Mapping Enable+ Fixed+
>>
>>
>>> Jan
>>>
>>> Note that due to ptwr_do_page_fault() being run first, there'll be a
>>> MEM_LOG() issued for each such attempt. If that's undesirable, the
>>> order of the calls would need to be swapped.
>>>
>>> --- a/xen/arch/x86/mm.c
>>> +++ b/xen/arch/x86/mm.c
>>> @@ -1875,6 +1875,7 @@ static int mod_l1_entry(l1_pgentry_t *pl
>>>                break;
>>>            case 1:
>>>                l1e_remove_flags(nl1e, _PAGE_RW);
>>> +            rc = 0;
>>>                break;
>>>            }
>>>            if ( page )
>>> @@ -5208,6 +5209,97 @@ int ptwr_do_page_fault(struct vcpu *v, u
>>>        return 0;
>>>    }
>>>
>>> +#ifdef __x86_64__
>>> +/*************************
>>> + * fault handling for read-only MMIO pages
>>> + */
>>> +
>>> +struct mmio_ro_emulate_ctxt {
>>> +    struct x86_emulate_ctxt ctxt;
>>> +    unsigned long cr2;
>>> +};
>>> +
>>> +static int mmio_ro_emulated_read(
>>> +    enum x86_segment seg,
>>> +    unsigned long offset,
>>> +    void *p_data,
>>> +    unsigned int bytes,
>>> +    struct x86_emulate_ctxt *ctxt)
>>> +{
>>> +    return X86EMUL_UNHANDLEABLE;
>>> +}
>>> +
>>> +static int mmio_ro_emulated_write(
>>> +    enum x86_segment seg,
>>> +    unsigned long offset,
>>> +    void *p_data,
>>> +    unsigned int bytes,
>>> +    struct x86_emulate_ctxt *ctxt)
>>> +{
>>> +    struct mmio_ro_emulate_ctxt *mmio_ro_ctxt =
>>> +        container_of(ctxt, struct mmio_ro_emulate_ctxt, ctxt);
>>> +
>>> +    /* Only allow naturally-aligned stores at the original %cr2 address. */
>>> +    if ( ((bytes | offset)&   (bytes - 1)) || offset != mmio_ro_ctxt->cr2 )
>>> +    {
>>> +        MEM_LOG("mmio_ro_emulate: bad access (cr2=%lx, addr=%lx,
>> bytes=%u)",
>>> +                mmio_ro_ctxt->cr2, offset, bytes);
>>> +        return X86EMUL_UNHANDLEABLE;
>>> +    }
>>> +
>>> +    return X86EMUL_OKAY;
>>> +}
>>> +
>>> +static const struct x86_emulate_ops mmio_ro_emulate_ops = {
>>> +    .read       = mmio_ro_emulated_read,
>>> +    .insn_fetch = ptwr_emulated_read,
>>> +    .write      = mmio_ro_emulated_write,
>>> +};
>>> +
>>> +/* Check if guest is trying to modify a r/o MMIO page. */
>>> +int mmio_ro_do_page_fault(struct vcpu *v, unsigned long addr,
>>> +                          struct cpu_user_regs *regs)
>>> +{
>>> +    l1_pgentry_t      pte;
>>> +    unsigned long     mfn;
>>> +    unsigned int      addr_size = is_pv_32on64_domain(v->domain) ?
>>> +                                  32 : BITS_PER_LONG;
>>> +    struct mmio_ro_emulate_ctxt mmio_ro_ctxt = {
>>> +        .ctxt.regs = regs,
>>> +        .ctxt.addr_size = addr_size,
>>> +        .ctxt.sp_size = addr_size,
>>> +        .cr2 = addr
>>> +    };
>>> +    int rc;
>>> +
>>> +    /* Attempt to read the PTE that maps the VA being accessed. */
>>> +    guest_get_eff_l1e(v, addr,&pte);
>>> +
>>> +    /* We are looking only for read-only mappings of MMIO pages. */
>>> +    if ( ((l1e_get_flags(pte)&   (_PAGE_PRESENT|_PAGE_RW)) != _PAGE_PRESENT)
>> )
>>> +        return 0;
>>> +
>>> +    mfn = l1e_get_pfn(pte);
>>> +    if ( mfn_valid(mfn) )
>>> +    {
>>> +        struct page_info *page = mfn_to_page(mfn);
>>> +        struct domain *owner = page_get_owner_and_reference(page);
>>> +
>>> +        if ( owner )
>>> +            put_page(page);
>>> +        if ( owner != dom_io )
>>> +            return 0;
>>> +    }
>>> +
>>> +    if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn) )
>>> +        return 0;
>>> +
>>> +    rc = x86_emulate(&mmio_ro_ctxt.ctxt,&mmio_ro_emulate_ops);
>>> +
>>> +    return rc != X86EMUL_UNHANDLEABLE ? EXCRET_fault_fixed : 0;
>>> +}
>>> +#endif /* __x86_64__ */
>>> +
>>>    void free_xen_pagetable(void *v)
>>>    {
>>>        if ( system_state == SYS_STATE_early_boot )
>>> --- a/xen/arch/x86/traps.c
>>> +++ b/xen/arch/x86/traps.c
>>> @@ -1349,20 +1349,23 @@ static int fixup_page_fault(unsigned lon
>>>            return 0;
>>>        }
>>>
>>> -    if ( VM_ASSIST(d, VMASST_TYPE_writable_pagetables)&&
>>> -         guest_kernel_mode(v, regs) )
>>> -    {
>>> -        unsigned int mbs = PFEC_write_access;
>>> -        unsigned int mbz = PFEC_reserved_bit | PFEC_insn_fetch;
>>> -
>>> -        /* Do not check if access-protection fault since the page may
>>> -           legitimately be not present in shadow page tables */
>>> -        if ( !paging_mode_enabled(d) )
>>> -            mbs |= PFEC_page_present;
>>> -
>>> -        if ( ((regs->error_code&   (mbs | mbz)) == mbs)&&
>>> +    if ( guest_kernel_mode(v, regs)&&
>>> +         !(regs->error_code&   (PFEC_reserved_bit | PFEC_insn_fetch))&&
>>> +         (regs->error_code&   PFEC_write_access) )
>>> +    {
>>> +        if ( VM_ASSIST(d, VMASST_TYPE_writable_pagetables)&&
>>> +             /* Do not check if access-protection fault since the page may
>>> +                legitimately be not present in shadow page tables */
>>> +             (paging_mode_enabled(d) ||
>>> +              (regs->error_code&   PFEC_page_present))&&
>>>                 ptwr_do_page_fault(v, addr, regs) )
>>>                return EXCRET_fault_fixed;
>>> +
>>> +#ifdef __x86_64__
>>> +        if ( IS_PRIV(d)&&   (regs->error_code&   PFEC_page_present)&&
>>> +             mmio_ro_do_page_fault(v, addr, regs) )
>>> +            return EXCRET_fault_fixed;
>>> +#endif
>>>        }
>>>
>>>        /* For non-external shadowed guests, we fix up both their own
>>> @@ -1690,6 +1693,13 @@ static int pci_cfg_ok(struct domain *d,
>>>            return 0;
>>>
>>>        machine_bdf = (d->arch.pci_cf8>>   8)&   0xFFFF;
>>> +    if ( write )
>>> +    {
>>> +        const unsigned long *ro_map = pci_get_ro_map(0);
>>> +
>>> +        if ( ro_map&&   test_bit(machine_bdf, ro_map) )
>>> +            return 0;
>>> +    }
>>>        start = d->arch.pci_cf8&   0xFF;
>>>        end = start + size - 1;
>>>        if (xsm_pci_config_permission(d, machine_bdf, start, end, write))
>>> --- a/xen/arch/x86/x86_32/pci.c
>>> +++ b/xen/arch/x86/x86_32/pci.c
>>> @@ -6,6 +6,7 @@
>>>
>>>    #include<xen/spinlock.h>
>>>    #include<xen/pci.h>
>>> +#include<xen/init.h>
>>>    #include<asm/io.h>
>>>
>>>    #define PCI_CONF_ADDRESS(bus, dev, func, reg) \
>>> @@ -70,3 +71,7 @@ void pci_conf_write32(
>>>        BUG_ON((bus>   255) || (dev>   31) || (func>   7) || (reg>   255));
>>>        pci_conf_write(PCI_CONF_ADDRESS(bus, dev, func, reg), 0, 4, data);
>>>    }
>>> +
>>> +void __init arch_pci_ro_device(int seg, int bdf)
>>> +{
>>> +}
>>> --- a/xen/arch/x86/x86_64/mmconfig_64.c
>>> +++ b/xen/arch/x86/x86_64/mmconfig_64.c
>>> @@ -145,9 +145,30 @@ static void __iomem *mcfg_ioremap(const
>>>        return (void __iomem *) virt;
>>>    }
>>>
>>> +void arch_pci_ro_device(int seg, int bdf)
>>> +{
>>> +    unsigned int idx, bus = PCI_BUS(bdf);
>>> +
>>> +    for (idx = 0; idx<   pci_mmcfg_config_num; ++idx) {
>>> +        const struct acpi_mcfg_allocation *cfg = pci_mmcfg_virt[idx].cfg;
>>> +        unsigned long mfn = (cfg->address>>   PAGE_SHIFT) + bdf;
>>> +
>>> +        if (!pci_mmcfg_virt[idx].virt || cfg->pci_segment != seg ||
>>> +            cfg->start_bus_number>   bus || cfg->end_bus_number<   bus)
>>> +            continue;
>>> +
>>> +        if (rangeset_add_singleton(mmio_ro_ranges, mfn))
>>> +            printk(XENLOG_ERR
>>> +                   "%04x:%02x:%02x.%u: could not mark MCFG (mfn %#lx)
>> read-only\n",
>>> +                   cfg->pci_segment, bus, PCI_SLOT(bdf), PCI_FUNC(bdf),
>>> +                   mfn);
>>> +    }
>>> +}
>>> +
>>>    int pci_mmcfg_arch_enable(unsigned int idx)
>>>    {
>>>        const typeof(pci_mmcfg_config[0]) *cfg = pci_mmcfg_virt[idx].cfg;
>>> +    const unsigned long *ro_map = pci_get_ro_map(cfg->pci_segment);
>>>
>>>        if (pci_mmcfg_virt[idx].virt)
>>>            return 0;
>>> @@ -159,6 +180,16 @@ int pci_mmcfg_arch_enable(unsigned int i
>>>        }
>>>        printk(KERN_INFO "PCI: Using MCFG for segment %04x bus %02x-%02x\n",
>>>               cfg->pci_segment, cfg->start_bus_number, cfg->end_bus_number);
>>> +    if (ro_map) {
>>> +        unsigned int bdf = PCI_BDF(cfg->start_bus_number, 0, 0);
>>> +        unsigned int end = PCI_BDF(cfg->end_bus_number, -1, -1);
>>> +
>>> +        while ((bdf = find_next_bit(ro_map, end + 1, bdf))<= end) {
>>> +            arch_pci_ro_device(cfg->pci_segment, bdf);
>>> +            if (bdf++ == end)
>>> +                break;
>>> +        }
>>> +    }
>>>        return 0;
>>>    }
>>>
>>> --- a/xen/drivers/passthrough/amd/iommu_detect.c
>>> +++ b/xen/drivers/passthrough/amd/iommu_detect.c
>>> @@ -153,6 +153,12 @@ int __init amd_iommu_detect_one_acpi(
>>>        if ( rt )
>>>            return -ENODEV;
>>>
>>> +    rt = pci_ro_device(iommu->seg, bus, PCI_DEVFN(dev, func));
>>> +    if ( rt )
>>> +        printk(XENLOG_ERR
>>> +               "Could not mark config space of %04x:%02x:%02x.%u read-only
>> (%d)\n",
>>> +               iommu->seg, bus, dev, func, rt);
>>> +
>>>        list_add_tail(&iommu->list,&amd_iommu_head);
>>>
>>>        return 0;
>>> --- a/xen/drivers/passthrough/io.c
>>> +++ b/xen/drivers/passthrough/io.c
>>> @@ -593,11 +593,3 @@ void hvm_dpci_eoi(struct domain *d, unsi
>>>    unlock:
>>>        spin_unlock(&d->event_lock);
>>>    }
>>> -
>>> -static int __init setup_mmio_ro_ranges(void)
>>> -{
>>> -    mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges",
>>> -                                  RANGESETF_prettyprint_hex);
>>> -    return 0;
>>> -}
>>> -__initcall(setup_mmio_ro_ranges);
>>> --- a/xen/drivers/passthrough/pci.c
>>> +++ b/xen/drivers/passthrough/pci.c
>>> @@ -36,6 +36,7 @@
>>>    struct pci_seg {
>>>        struct list_head alldevs_list;
>>>        u16 nr;
>>> +    unsigned long *ro_map;
>>>        /* bus2bridge_lock protects bus2bridge array */
>>>        spinlock_t bus2bridge_lock;
>>>    #define MAX_BUSES 256
>>> @@ -106,6 +107,8 @@ void __init pt_pci_init(void)
>>>        radix_tree_init(&pci_segments);
>>>        if ( !alloc_pseg(0) )
>>>            panic("Could not initialize PCI segment 0\n");
>>> +    mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges",
>>> +                                  RANGESETF_prettyprint_hex);
>>>    }
>>>
>>>    int __init pci_add_segment(u16 seg)
>>> @@ -113,6 +116,13 @@ int __init pci_add_segment(u16 seg)
>>>        return alloc_pseg(seg) ? 0 : -ENOMEM;
>>>    }
>>>
>>> +const unsigned long *pci_get_ro_map(u16 seg)
>>> +{
>>> +    struct pci_seg *pseg = get_pseg(seg);
>>> +
>>> +    return pseg ? pseg->ro_map : NULL;
>>> +}
>>> +
>>>    static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
>>>    {
>>>        struct pci_dev *pdev;
>>> @@ -198,6 +208,33 @@ static void free_pdev(struct pci_seg *ps
>>>        xfree(pdev);
>>>    }
>>>
>>> +int __init pci_ro_device(int seg, int bus, int devfn)
>>> +{
>>> +    struct pci_seg *pseg = alloc_pseg(seg);
>>> +    struct pci_dev *pdev;
>>> +
>>> +    if ( !pseg )
>>> +        return -ENOMEM;
>>> +    pdev = alloc_pdev(pseg, bus, devfn);
>>> +    if ( !pdev )
>>> +        return -ENOMEM;
>>> +
>>> +    if ( !pseg->ro_map )
>>> +    {
>>> +        size_t sz = BITS_TO_LONGS(PCI_BDF(-1, -1, -1) + 1) * sizeof(long);
>>> +
>>> +        pseg->ro_map = alloc_xenheap_pages(get_order_from_bytes(sz), 0);
>>> +        if ( !pseg->ro_map )
>>> +            return -ENOMEM;
>>> +        memset(pseg->ro_map, 0, sz);
>>> +    }
>>> +
>>> +    __set_bit(PCI_BDF2(bus, devfn), pseg->ro_map);
>>> +    arch_pci_ro_device(seg, PCI_BDF2(bus, devfn));
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>    struct pci_dev *pci_get_pdev(int seg, int bus, int devfn)
>>>    {
>>>        struct pci_seg *pseg = get_pseg(seg);
>>> --- a/xen/include/asm-x86/mm.h
>>> +++ b/xen/include/asm-x86/mm.h
>>> @@ -555,6 +555,8 @@ void memguard_unguard_stack(void *p);
>>>
>>>    int  ptwr_do_page_fault(struct vcpu *, unsigned long,
>>>                            struct cpu_user_regs *);
>>> +int  mmio_ro_do_page_fault(struct vcpu *, unsigned long,
>>> +                           struct cpu_user_regs *);
>>>
>>>    int audit_adjust_pgtables(struct domain *d, int dir, int noisy);
>>>
>>> --- a/xen/include/xen/pci.h
>>> +++ b/xen/include/xen/pci.h
>>> @@ -98,8 +98,11 @@ struct pci_dev *pci_lock_domain_pdev(
>>>    void setup_dom0_pci_devices(struct domain *, void (*)(struct pci_dev *));
>>>    void pci_release_devices(struct domain *d);
>>>    int pci_add_segment(u16 seg);
>>> +const unsigned long *pci_get_ro_map(u16 seg);
>>>    int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info
>> *);
>>>    int pci_remove_device(u16 seg, u8 bus, u8 devfn);
>>> +int pci_ro_device(int seg, int bus, int devfn);
>>> +void arch_pci_ro_device(int seg, int bdf);
>>>    struct pci_dev *pci_get_pdev(int seg, int bus, int devfn);
>>>    struct pci_dev *pci_get_pdev_by_domain(
>>>        struct domain *, int seg, int bus, int devfn);
>>>
>
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2012-06-22  9:03 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-12 12:02 [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it Wei Wang
2012-06-12 15:13 ` Jan Beulich
2012-06-12 16:08   ` Andrew Cooper
2012-06-12 16:43     ` Jan Beulich
2012-06-14 12:13       ` Wei Wang
2012-06-14 14:18         ` Jan Beulich
2012-06-14 15:15           ` Wei Wang
2012-06-14 15:27             ` Jan Beulich
2012-06-21  9:59             ` [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0 (was: Re: [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it) Jan Beulich
2012-06-21  9:59               ` Jan Beulich
2012-06-21 11:08               ` [PATCH] PCI/MSI: don't disable AMD IOMMU MSI on Xen dom0 Eric W. Biederman
2012-06-21 11:08                 ` Eric W. Biederman
2012-06-21 12:28                 ` Jan Beulich
2012-06-21 12:28                   ` Jan Beulich
2012-06-21 11:21               ` Wei Wang
2012-06-21 11:21                 ` Wei Wang
2012-06-21 12:06                 ` Jan Beulich
2012-06-21 12:06                   ` Jan Beulich
2012-06-21 12:28                   ` Wei Wang
2012-06-21 12:28                     ` Wei Wang
2012-06-21 12:45                     ` Jan Beulich
2012-06-21 12:45                       ` Jan Beulich
2012-06-21 13:10                       ` Wei Wang
2012-06-21 13:10                         ` Wei Wang
2012-06-21 13:24                         ` Jan Beulich
2012-06-21 13:24                           ` Jan Beulich
2012-06-21 13:27                           ` Wei Wang
2012-06-21 13:27                             ` Wei Wang
2012-06-20 15:45         ` [PATCH V2] amd iommu: re-enable iommu msi if dom0 disabled it Jan Beulich
2012-06-21 15:29           ` Wei Wang
2012-06-21 15:49             ` Jan Beulich
2012-06-21 16:31               ` Keir Fraser
2012-06-22  9:03               ` Wei Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.