All of lore.kernel.org
 help / color / mirror / Atom feed
* RMRR Fix Design for Xen
@ 2014-12-19  1:21 Tiejun Chen
  2014-12-19  9:49 ` Fabio Fantoni
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Tiejun Chen @ 2014-12-19  1:21 UTC (permalink / raw)
  To: JBeulich, konrad.wilk, tim, kevin.tian, yang.z.zhang,
	stefano.stabellini, ian.campbell, ian.jackson, wei.liu2
  Cc: xen-devel

                    RMRR Fix Design for Xen

This design is a goal to fix RMRR for Xen. It includes four sectors as
follows:

    * Background
    * What is RMRR
    * Current RMRR Issues
    * Design Overview

We hope this can help us to understand current problem then figure out a
clean and better solution everyone can agree now to go forward.

Background
==========

We first identified this RMRR defect when trying to pass-through IGD device,
which can be simply fixed by adding an identity mapping in case of shared
EPT table. However along with the community discussion, it boiled down to
a more general RMRR problem, i.e. the identity mapping is brute-added
in hypervisor, w/o considering whether conflicting with an existing guest
PFN ranges. As a general solution we need invent a new mechanism so
reserved ranges allocated by hypervisor can be exported to the user space
toolstack and hvmloader, so conflict can be detected when constructing
guest PFN layout, with best-effort avoidance policies to further help.

What is RMRR
============

RMRR is a acronym for Reserved Memory Region Reporting.

BIOS may report each such reserved memory region through the RMRR structures,
along with the devices that requires access to the specified reserved memory
region. Reserved memory ranges that are either not DMA targets, or memory
ranges that may be target of BIOS initiated DMA only during pre-boot phase
(such as from a boot disk drive) must not be included in the reserved memory
region reporting. The base address of each RMRR region must be 4KB aligned and
the size must be an integer multiple of 4KB. BIOS must report the RMRR reported
memory addresses as reserved in the system memory map returned through methods
suchas INT15, EFI GetMemoryMap etc. The reserved memory region reporting
structures are optional. If there are no RMRR structures, the system software
concludes that the platform does not have any reserved memory ranges that are 
DMA targets.

The RMRR regions are expected to be used for legacy usages (such as USB, UMA
Graphics, etc.) requiring reserved memory. Platform designers shouldavoid or
limit use of reserved memory regions since these require system software to
create holes in the DMA virtual address range available to system software and
its drivers.

The following is grabbed from my BDW:

(XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
(XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ab80a000 end_address ab81dfff
(XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
(XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ad000000 end_address af7fffff

Here USB occupies 0xab80a000:0xab81dfff, IGD owns 0xad000000:0xaf7fffff.

Note there are zero or more Reserved Memory Region Reporting (RMRR) in one given
platform. And multiple devices may share one RMRR range. Additionally RMRR can
go anyplace.

Current RMRR Issues
===================

#1 RMRR may conflict RAM, mmio or other ranges in Guest physical level.

#2 Xen doesn't create RMRR mapping in case of shared ept, then the assigned
   device can't work well.

#3 Xen doesn't consider that case multiple devices may share one RMRR entry.
   This also is a damage between different VMs when we assign such devices to
   different VMs.

#4 Something like USB, is still restricted to current RMRR implementation. We
   should work out this case.

Design Overview
===============

First of all we need to make sure all resources don't overlap RMRR. And then
in case of shared ept, we can set these identity entries. And Certainly we will
group all devices associated to one same RMRR entry, then make sure all group
devices should be assigned to same VM.

1. Setup RMRR identity mapping

current status:
    * identity mapping only setup in non-shared ept case

proposal:

In non-shared ept case, IOMMU stuff always set those entries and RMRR is already
marked reserved in host so its fine enough. But in shared ept case, we need to
check any conflit, so we should follow up

  - gfn space unoccupied
        -> insert mapping: success.
            gfn:_mfn(gfn), PAGE_ORDER_4K, p2m_mmio_direct, p2m_access_rw
  - gfn space already occupied by 1:1 RMRR mapping
        -> do nothing; success.
  - gfn space already occupied by other mapping
        -> fail.

expectation:
    * only devices w/ non-conflicting RMRR can be assigned
    * fortunately this achieves the very initial intention to support IGD
      pass-through on BDW
    * provide last level protection in hypervisor to ensure a secure assignment

2. Check to reserve RMRR in guest physical level

current status:
    * Xen doesn't reserve RMRR in guest physical level

proposal:
    * Get RMRR info as necessary
    * Then check all potential points to reserve or skip RMRR

expectation:
    * Make sure all guest resources don't overlap RMRR range.

2.1 Expose RMRR to user space

current status:
    * Xen always record RMRR info into one list, acpi_rmrr_units, while parsing
      acpi. So we can retrieve these info by lookup that list.

proposal:
    * RMRR would be exposed by a new hypercall, which Jan already finished in
      current version but just expose all RMRR info unconditionally.
    * Furthermore we can expose RMRR on demand to diminish shrinking guest
      RAM/MMIO space.
    * So we will introduce a new parameter, 'rdm_forcecheck' and to collaborate
      with SBDFs to control which RMRR should be exposed:

        - We can set this parameter in .cfg file like,

            rdm_forcecheck = 1 => Of course this should be 0 by default.

        '1' means we should force check to reserve all ranges unconditionally.
        and if failed VM wouldn't be created successfully. This also can give
        user a chance to work well with later hotplug, even if not a device
        assignment while creating VM.

        If 0, we just check those assigned pci devices. As you know we already
        have such an existing hypercall to assign PCI devices, looks we can work
        directly under this hypercall to get that necessary SBDF to sort which
        RMRR should be handled. But obviously, we need to get these info before
        we populate guest memory to make sure these RMRR ranges should be
        excluded from guest memory. But unfortunately the memory populating
        takes place before a device assignment, so we can't live on that
        directly.

        But as we discussed it just benefit that assigned case to reorder that
        order, but not good to hotplug case. So we have to introduce a new
        DOMCTL to pass that global parameter with SBDF at the same time.

        For example, if we own these two RMRR entries,

        [00:14.0]    RMRR region: base_addr ab80a000 end_address ab81dfff
        [00:02.0]    RMRR region: base_addr ad000000 end_address af7fffff

        If 'rdm_forcecheck = 1', any caller to that hypercall may get these two
        entries. But if 'rdm_forcecheck = 0', and in .cfg file,

        #1 'pci=[00:14.0, 00:02.0]' -> The caller still get these two entries.
        #2 'pci=[00:02.0]' -> The caller just get one entry, 0xad000000:0xaf7fffff
        #2 'pci=[00:14.0]' -> The caller just get one entry, 0xab80a000:0xab81dfff
        #4 'pci=[others]' or no any pci configuration -> The caller get nothing.

        And ultimately, if we expose any RMRR entry at gfn,

        in non-shared ept case, 

        p2m table:  valid non-identity mapping, gfn:mfn != 1:1
        VT-D table: no such a mapping until set identity mapping,
                    gfn:_mfn(gfn) == 1:1 when we have a associated device
                    assignment.
        
        in shared ept case,

        p2m table\VT-D table:  always INVALID until set identity mapping,
                               gfn:_mfn(gfn) == 1:1 when we have a associated
                               device assignment.

        But if we don't expose any RMRR entry,

        in non-shared ept case, 

        p2m table:  valid non-identity mapping, gfn:mfn != 1:1
        VT-D table: no such a mapping until set identity mapping,
                    gfn:_mfn(gfn) == 1:1 when we have a associated device
                    assignment.

        in shared ept case,

        p2m table\VT-D table:  If guest RAM already cover gfn, we have sunc a
                               valid non-identity mapping, gfn:mfn != 1:1, but
                               we can't set any identity mapping again then
                               that associated device can't be assigned
                               successfully. If not, we'll set identity mapping,
                               gfn:_mfn(gfn) == 1:1, to work well.

 
expectation:
    * Provice a way to get necessary RMRR info.

Note here that new DOMCTL is already finished but need to fix some code style
issues.

2.2 Detect and avoid RMRR conflictions with guest RAM/MMIO

current status:
    * Currently Xen doesn't detect anything to avoid any conflict.

proposal:
    * We need to cover all points to check any conflict. Below lists places
      where reserved region conflictions should be detected:

        1>. libxc:setup_guest():modules_init()

        There are some modules, like acpi_module and smbios_module. They may
        occupy some ranges so we need to check if they're conflicting with
        all ranges from that new hypercall above.

        2>. libxc:setup_guest():xc_domain_populate_physmap()

        There are two ways to exclude RMRR ranges here:
            
            #1 Before we populate guest RAM without any RMRR range from that
               new hypercall to skip RMRR ranges when populating guest RAM.
            #2 In Xen we can fliter RMRR range while calling
               XENMEM_populate_physmap, its no change to libxc, but skip
               setting p2m entry for RMRR ranges in Xen hypervisor

        But to compare #1, #2 is not better since Xen still allocate those
        range, and we have to sync somewhere else in Xen. And #1 is cleaner
        because #2 actually shrinks the available memory size to the guest.

        3>. hvmloader:pci_setup()

        Here we should populate guest MMIO excluding all ranges from that new
        hypercall to detect RMRR conflictions for allocating PCI MMIO BAR.

        4>. hvmloader:mem_hole_alloc()

        Here we should populate mem holw excluding all ranges from that new
        hypercall to detect RMRR conflictions for dynamic allocation in
        hvmloader, e.g. as used for IGD opregion

        5>. hvmloader:build_e820_table():

        Finally we need let VM know that RMRR regions are reserved through e820
        table

    Its a little bit tricky to handle this inside hvmloader since as you know,
    struct hvm_info_table is only a approach between libxc and hvmloader. But
    however, making up all above places will bring some duplicated logic.
    Especially between libxc and hvmloader, which skip same regions in guest
    physical layer(one in populating guest RAM, the other in constructing e820)
    But current design has some limitation to pass information between libxc and
    hvmloader,

struct hvm_info_table {
    ...
    /*
     *  0x0 to page_to_phys(low_mem_pgend)-1:
     *    RAM below 4GB (except for VGA hole 0xA0000-0xBFFFF)
     */
    uint32_t    low_mem_pgend;
    ...
    /*
     *  0x100000000 to page_to_phys(high_mem_pgend)-1:
     *    RAM above 4GB
     */
    uint32_t    high_mem_pgend;
    ...
}

    nothing valuable is passed to hvmloader so we have to figure out a way to
    handle those points in hvmloader. Currently,

            #1 Reconstruct hvm_info_table{}

            We can store all necessary RMRR info into hvm_info_table as well,
            then we can pick them conveniently but oftentimes these RMRR
            entries are scattered and the number is also undertimined, so its
            a little bit ugly to do.

            #2 Xenstore

            We may store all necessary RMRR info as Xenstore then pick them in
            hvmloader.

            #3 A hypercall

            We may have to call our new hypercall again to pick them, but
            obviously this may introduce some duplicated codes.

            #4 XENMEM_{set_,}memory_map pair of hypercalls
            
            As Jan commented it "could be used(and is readily available to be
            extended that way, since for HVM domains XENMEM_set_memory_map
            returns -EPERM at present). The only potentially problematic aspect
            I can see with using it might be its limiting of the entry count to
            E820MAX." So a preparation patch is required before RMRR, and
            hvmloader still needs to query RMRR information.

expectation:
    * Cover all points to make sure guest doesn't occupt any RMRR range.

3. Group multiple devices sharing same RMRR entry

current status:
    * Xen doesn't distinguish if multiple devices share same RMRR entry. This
      may lead a leak to corruption, e.g. Device A and device B share same entry
      and then device A is assigned to domain A, device B is assigned to domain
      B. So domain A can read something from that range, even rewrite
      maliciously that range to corrupt device B inside domain B.

proposal:
    * We will group all devices by means of one same RMRR entry. Theoretically,
      we should make sure all devices in one group are allowed to assign to one
      VM. But in Xen side the hardware domain owns all devices at first, we
      should unbound all group devies before assign one of group device. But its
      hard to do such thing in current VT-d stuff. And actually its rare to have
      the group device in real world so we just prevent assigning any group
      device simply.
    * We will introduce two field, gid, in struct, acpi_rmrr_unit:
        gid: indicate if this device belongs a group.
      Then when we add or attach device in iommu side, we will check this field
      if we're assigning a group device then determine if we assign that.


expectation:
    * Make all device associated to one RMRR to be assigned same VM.

4. Handle in case of force accessing to RMRR regions

proposal:
    * Although we already reserve these ranges in guest e820 table, its
      possible to access these ranges.
      In non-shared ept case it will issue such EPT violation since we have no
      p2m table actually. But its same to access other reserved range so just
      live on Xen's default behavior. In shared-ept case guest can read or
      write anything from this range, but such a leak or corruption just
      happens inside same domain so this behavior is also same as a native
      case, it should be fine.

expectation:
    * Don't broken normal RMRR usage.

5. Re-enable USB

current status:
    * Currently Xen doesn't allow USB passthrough.

proposal:
    * Cleanup some legacy codes related to RMRR to accommodate our RMRR
      support.

expectation:
    * Make USB wor well after we handle RMRR properly.

6. Hotplug case

current status:
    * only work well in non-shared ept case or in case of shared ept & all
      associated gfns are free.

proposal:
    * If user ensure there'll be a hotplug device in advace, 'rdm_forcecheck'
      can be set to reserve all RMRR range as we describe above. Then any
      device can be hotplugged successfully.
    * If not, there are still two scenarios here:
      in non-shared ept case it still work well as original;
      in shared ept case, its just going a case of "1. Setup RMRR identity
      mapping"
        - gfn space unoccupied -> insert mapping: success.
        - gfn space already occupied by 1:1 RMRR mapping -> do nothing; success.
        - gfn space already occupied by other mapping -> fail.

expectation:
    * provide a way to let hotplug work in case of shared ept totally

Open Issue
==========

When other stuffs like ballon mechanism, to populate memory accessing RMRR
range, or others like qemu, force map RMRR range, we may need to avoid such a
possibility but we're not 100% sure this so just open this to double check.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RMRR Fix Design for Xen
  2014-12-19  1:21 RMRR Fix Design for Xen Tiejun Chen
@ 2014-12-19  9:49 ` Fabio Fantoni
  2014-12-19 15:13 ` Jan Beulich
  2015-01-05 17:01 ` George Dunlap
  2 siblings, 0 replies; 7+ messages in thread
From: Fabio Fantoni @ 2014-12-19  9:49 UTC (permalink / raw)
  To: Tiejun Chen, JBeulich, konrad.wilk, tim, kevin.tian,
	yang.z.zhang, stefano.stabellini, ian.campbell, ian.jackson,
	wei.liu2
  Cc: xen-devel

Il 19/12/2014 02:21, Tiejun Chen ha scritto:
>                      RMRR Fix Design for Xen
>
> This design is a goal to fix RMRR for Xen. It includes four sectors as
> follows:
>
>      * Background
>      * What is RMRR
>      * Current RMRR Issues
>      * Design Overview
>
> We hope this can help us to understand current problem then figure out a
> clean and better solution everyone can agree now to go forward.
>
> Background
> ==========
>
> We first identified this RMRR defect when trying to pass-through IGD device,
> which can be simply fixed by adding an identity mapping in case of shared
> EPT table. However along with the community discussion, it boiled down to
> a more general RMRR problem, i.e. the identity mapping is brute-added
> in hypervisor, w/o considering whether conflicting with an existing guest
> PFN ranges. As a general solution we need invent a new mechanism so
> reserved ranges allocated by hypervisor can be exported to the user space
> toolstack and hvmloader, so conflict can be detected when constructing
> guest PFN layout, with best-effort avoidance policies to further help.
>
> What is RMRR
> ============
>
> RMRR is a acronym for Reserved Memory Region Reporting.
>
> BIOS may report each such reserved memory region through the RMRR structures,
> along with the devices that requires access to the specified reserved memory
> region. Reserved memory ranges that are either not DMA targets, or memory
> ranges that may be target of BIOS initiated DMA only during pre-boot phase
> (such as from a boot disk drive) must not be included in the reserved memory
> region reporting. The base address of each RMRR region must be 4KB aligned and
> the size must be an integer multiple of 4KB. BIOS must report the RMRR reported
> memory addresses as reserved in the system memory map returned through methods
> suchas INT15, EFI GetMemoryMap etc. The reserved memory region reporting
> structures are optional. If there are no RMRR structures, the system software
> concludes that the platform does not have any reserved memory ranges that are
> DMA targets.
>
> The RMRR regions are expected to be used for legacy usages (such as USB, UMA
> Graphics, etc.) requiring reserved memory. Platform designers shouldavoid or
> limit use of reserved memory regions since these require system software to
> create holes in the DMA virtual address range available to system software and
> its drivers.
>
> The following is grabbed from my BDW:
>
> (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
> (XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ab80a000 end_address ab81dfff
> (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
> (XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ad000000 end_address af7fffff
>
> Here USB occupies 0xab80a000:0xab81dfff, IGD owns 0xad000000:0xaf7fffff.
>
> Note there are zero or more Reserved Memory Region Reporting (RMRR) in one given
> platform. And multiple devices may share one RMRR range. Additionally RMRR can
> go anyplace.
>
> Current RMRR Issues
> ===================
>
> #1 RMRR may conflict RAM, mmio or other ranges in Guest physical level.

Sorry if my question is not inherent, I don't have good knowledge about 
it, xen domUs require that all memory regions is correctly defined in 
hvmloader or do them automatically and correct in any emulated devices 
assigned to domUs?
I'm unable to have qxl vga working in linux domUs and unable to found 
the problem for now, I saw a memory warning in system logs of domU with 
qxl which makes me think that perhaps the differences in memory regions 
of qxl are not considered properly in hvmloader.
The warning I found in one Fedora domU with qxl is:
> ioremap error for 0xfc001000-0xfc002000, requested 0x10, got 0x0 
Here a post about with logs compared also with stdvga tests and kvm 
(with qxl full working) test:
http://lists.xen.org/archives/html/xen-devel/2013-10/msg00016.html
Here the qxl support for libxl patch updated to latest xen if someone 
want fast try it:
https://github.com/Fantu/Xen/commit/fadecf8d6ee0e8c7e421fafba67aa11879e8b8fe

Can someone tell me is can be a hvmloader memory problem or this is not 
related?

Thanks for any reply and sorry for my bad english.

>
> #2 Xen doesn't create RMRR mapping in case of shared ept, then the assigned
>     device can't work well.
>
> #3 Xen doesn't consider that case multiple devices may share one RMRR entry.
>     This also is a damage between different VMs when we assign such devices to
>     different VMs.
>
> #4 Something like USB, is still restricted to current RMRR implementation. We
>     should work out this case.
>
> Design Overview
> ===============
>
> First of all we need to make sure all resources don't overlap RMRR. And then
> in case of shared ept, we can set these identity entries. And Certainly we will
> group all devices associated to one same RMRR entry, then make sure all group
> devices should be assigned to same VM.
>
> 1. Setup RMRR identity mapping
>
> current status:
>      * identity mapping only setup in non-shared ept case
>
> proposal:
>
> In non-shared ept case, IOMMU stuff always set those entries and RMRR is already
> marked reserved in host so its fine enough. But in shared ept case, we need to
> check any conflit, so we should follow up
>
>    - gfn space unoccupied
>          -> insert mapping: success.
>              gfn:_mfn(gfn), PAGE_ORDER_4K, p2m_mmio_direct, p2m_access_rw
>    - gfn space already occupied by 1:1 RMRR mapping
>          -> do nothing; success.
>    - gfn space already occupied by other mapping
>          -> fail.
>
> expectation:
>      * only devices w/ non-conflicting RMRR can be assigned
>      * fortunately this achieves the very initial intention to support IGD
>        pass-through on BDW
>      * provide last level protection in hypervisor to ensure a secure assignment
>
> 2. Check to reserve RMRR in guest physical level
>
> current status:
>      * Xen doesn't reserve RMRR in guest physical level
>
> proposal:
>      * Get RMRR info as necessary
>      * Then check all potential points to reserve or skip RMRR
>
> expectation:
>      * Make sure all guest resources don't overlap RMRR range.
>
> 2.1 Expose RMRR to user space
>
> current status:
>      * Xen always record RMRR info into one list, acpi_rmrr_units, while parsing
>        acpi. So we can retrieve these info by lookup that list.
>
> proposal:
>      * RMRR would be exposed by a new hypercall, which Jan already finished in
>        current version but just expose all RMRR info unconditionally.
>      * Furthermore we can expose RMRR on demand to diminish shrinking guest
>        RAM/MMIO space.
>      * So we will introduce a new parameter, 'rdm_forcecheck' and to collaborate
>        with SBDFs to control which RMRR should be exposed:
>
>          - We can set this parameter in .cfg file like,
>
>              rdm_forcecheck = 1 => Of course this should be 0 by default.
>
>          '1' means we should force check to reserve all ranges unconditionally.
>          and if failed VM wouldn't be created successfully. This also can give
>          user a chance to work well with later hotplug, even if not a device
>          assignment while creating VM.
>
>          If 0, we just check those assigned pci devices. As you know we already
>          have such an existing hypercall to assign PCI devices, looks we can work
>          directly under this hypercall to get that necessary SBDF to sort which
>          RMRR should be handled. But obviously, we need to get these info before
>          we populate guest memory to make sure these RMRR ranges should be
>          excluded from guest memory. But unfortunately the memory populating
>          takes place before a device assignment, so we can't live on that
>          directly.
>
>          But as we discussed it just benefit that assigned case to reorder that
>          order, but not good to hotplug case. So we have to introduce a new
>          DOMCTL to pass that global parameter with SBDF at the same time.
>
>          For example, if we own these two RMRR entries,
>
>          [00:14.0]    RMRR region: base_addr ab80a000 end_address ab81dfff
>          [00:02.0]    RMRR region: base_addr ad000000 end_address af7fffff
>
>          If 'rdm_forcecheck = 1', any caller to that hypercall may get these two
>          entries. But if 'rdm_forcecheck = 0', and in .cfg file,
>
>          #1 'pci=[00:14.0, 00:02.0]' -> The caller still get these two entries.
>          #2 'pci=[00:02.0]' -> The caller just get one entry, 0xad000000:0xaf7fffff
>          #2 'pci=[00:14.0]' -> The caller just get one entry, 0xab80a000:0xab81dfff
>          #4 'pci=[others]' or no any pci configuration -> The caller get nothing.
>
>          And ultimately, if we expose any RMRR entry at gfn,
>
>          in non-shared ept case,
>
>          p2m table:  valid non-identity mapping, gfn:mfn != 1:1
>          VT-D table: no such a mapping until set identity mapping,
>                      gfn:_mfn(gfn) == 1:1 when we have a associated device
>                      assignment.
>          
>          in shared ept case,
>
>          p2m table\VT-D table:  always INVALID until set identity mapping,
>                                 gfn:_mfn(gfn) == 1:1 when we have a associated
>                                 device assignment.
>
>          But if we don't expose any RMRR entry,
>
>          in non-shared ept case,
>
>          p2m table:  valid non-identity mapping, gfn:mfn != 1:1
>          VT-D table: no such a mapping until set identity mapping,
>                      gfn:_mfn(gfn) == 1:1 when we have a associated device
>                      assignment.
>
>          in shared ept case,
>
>          p2m table\VT-D table:  If guest RAM already cover gfn, we have sunc a
>                                 valid non-identity mapping, gfn:mfn != 1:1, but
>                                 we can't set any identity mapping again then
>                                 that associated device can't be assigned
>                                 successfully. If not, we'll set identity mapping,
>                                 gfn:_mfn(gfn) == 1:1, to work well.
>
>   
> expectation:
>      * Provice a way to get necessary RMRR info.
>
> Note here that new DOMCTL is already finished but need to fix some code style
> issues.
>
> 2.2 Detect and avoid RMRR conflictions with guest RAM/MMIO
>
> current status:
>      * Currently Xen doesn't detect anything to avoid any conflict.
>
> proposal:
>      * We need to cover all points to check any conflict. Below lists places
>        where reserved region conflictions should be detected:
>
>          1>. libxc:setup_guest():modules_init()
>
>          There are some modules, like acpi_module and smbios_module. They may
>          occupy some ranges so we need to check if they're conflicting with
>          all ranges from that new hypercall above.
>
>          2>. libxc:setup_guest():xc_domain_populate_physmap()
>
>          There are two ways to exclude RMRR ranges here:
>              
>              #1 Before we populate guest RAM without any RMRR range from that
>                 new hypercall to skip RMRR ranges when populating guest RAM.
>              #2 In Xen we can fliter RMRR range while calling
>                 XENMEM_populate_physmap, its no change to libxc, but skip
>                 setting p2m entry for RMRR ranges in Xen hypervisor
>
>          But to compare #1, #2 is not better since Xen still allocate those
>          range, and we have to sync somewhere else in Xen. And #1 is cleaner
>          because #2 actually shrinks the available memory size to the guest.
>
>          3>. hvmloader:pci_setup()
>
>          Here we should populate guest MMIO excluding all ranges from that new
>          hypercall to detect RMRR conflictions for allocating PCI MMIO BAR.
>
>          4>. hvmloader:mem_hole_alloc()
>
>          Here we should populate mem holw excluding all ranges from that new
>          hypercall to detect RMRR conflictions for dynamic allocation in
>          hvmloader, e.g. as used for IGD opregion
>
>          5>. hvmloader:build_e820_table():
>
>          Finally we need let VM know that RMRR regions are reserved through e820
>          table
>
>      Its a little bit tricky to handle this inside hvmloader since as you know,
>      struct hvm_info_table is only a approach between libxc and hvmloader. But
>      however, making up all above places will bring some duplicated logic.
>      Especially between libxc and hvmloader, which skip same regions in guest
>      physical layer(one in populating guest RAM, the other in constructing e820)
>      But current design has some limitation to pass information between libxc and
>      hvmloader,
>
> struct hvm_info_table {
>      ...
>      /*
>       *  0x0 to page_to_phys(low_mem_pgend)-1:
>       *    RAM below 4GB (except for VGA hole 0xA0000-0xBFFFF)
>       */
>      uint32_t    low_mem_pgend;
>      ...
>      /*
>       *  0x100000000 to page_to_phys(high_mem_pgend)-1:
>       *    RAM above 4GB
>       */
>      uint32_t    high_mem_pgend;
>      ...
> }
>
>      nothing valuable is passed to hvmloader so we have to figure out a way to
>      handle those points in hvmloader. Currently,
>
>              #1 Reconstruct hvm_info_table{}
>
>              We can store all necessary RMRR info into hvm_info_table as well,
>              then we can pick them conveniently but oftentimes these RMRR
>              entries are scattered and the number is also undertimined, so its
>              a little bit ugly to do.
>
>              #2 Xenstore
>
>              We may store all necessary RMRR info as Xenstore then pick them in
>              hvmloader.
>
>              #3 A hypercall
>
>              We may have to call our new hypercall again to pick them, but
>              obviously this may introduce some duplicated codes.
>
>              #4 XENMEM_{set_,}memory_map pair of hypercalls
>              
>              As Jan commented it "could be used(and is readily available to be
>              extended that way, since for HVM domains XENMEM_set_memory_map
>              returns -EPERM at present). The only potentially problematic aspect
>              I can see with using it might be its limiting of the entry count to
>              E820MAX." So a preparation patch is required before RMRR, and
>              hvmloader still needs to query RMRR information.
>
> expectation:
>      * Cover all points to make sure guest doesn't occupt any RMRR range.
>
> 3. Group multiple devices sharing same RMRR entry
>
> current status:
>      * Xen doesn't distinguish if multiple devices share same RMRR entry. This
>        may lead a leak to corruption, e.g. Device A and device B share same entry
>        and then device A is assigned to domain A, device B is assigned to domain
>        B. So domain A can read something from that range, even rewrite
>        maliciously that range to corrupt device B inside domain B.
>
> proposal:
>      * We will group all devices by means of one same RMRR entry. Theoretically,
>        we should make sure all devices in one group are allowed to assign to one
>        VM. But in Xen side the hardware domain owns all devices at first, we
>        should unbound all group devies before assign one of group device. But its
>        hard to do such thing in current VT-d stuff. And actually its rare to have
>        the group device in real world so we just prevent assigning any group
>        device simply.
>      * We will introduce two field, gid, in struct, acpi_rmrr_unit:
>          gid: indicate if this device belongs a group.
>        Then when we add or attach device in iommu side, we will check this field
>        if we're assigning a group device then determine if we assign that.
>
>
> expectation:
>      * Make all device associated to one RMRR to be assigned same VM.
>
> 4. Handle in case of force accessing to RMRR regions
>
> proposal:
>      * Although we already reserve these ranges in guest e820 table, its
>        possible to access these ranges.
>        In non-shared ept case it will issue such EPT violation since we have no
>        p2m table actually. But its same to access other reserved range so just
>        live on Xen's default behavior. In shared-ept case guest can read or
>        write anything from this range, but such a leak or corruption just
>        happens inside same domain so this behavior is also same as a native
>        case, it should be fine.
>
> expectation:
>      * Don't broken normal RMRR usage.
>
> 5. Re-enable USB
>
> current status:
>      * Currently Xen doesn't allow USB passthrough.
>
> proposal:
>      * Cleanup some legacy codes related to RMRR to accommodate our RMRR
>        support.
>
> expectation:
>      * Make USB wor well after we handle RMRR properly.
>
> 6. Hotplug case
>
> current status:
>      * only work well in non-shared ept case or in case of shared ept & all
>        associated gfns are free.
>
> proposal:
>      * If user ensure there'll be a hotplug device in advace, 'rdm_forcecheck'
>        can be set to reserve all RMRR range as we describe above. Then any
>        device can be hotplugged successfully.
>      * If not, there are still two scenarios here:
>        in non-shared ept case it still work well as original;
>        in shared ept case, its just going a case of "1. Setup RMRR identity
>        mapping"
>          - gfn space unoccupied -> insert mapping: success.
>          - gfn space already occupied by 1:1 RMRR mapping -> do nothing; success.
>          - gfn space already occupied by other mapping -> fail.
>
> expectation:
>      * provide a way to let hotplug work in case of shared ept totally
>
> Open Issue
> ==========
>
> When other stuffs like ballon mechanism, to populate memory accessing RMRR
> range, or others like qemu, force map RMRR range, we may need to avoid such a
> possibility but we're not 100% sure this so just open this to double check.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RMRR Fix Design for Xen
  2014-12-19  1:21 RMRR Fix Design for Xen Tiejun Chen
  2014-12-19  9:49 ` Fabio Fantoni
@ 2014-12-19 15:13 ` Jan Beulich
  2014-12-22  2:11   ` Chen, Tiejun
  2015-01-05 17:01 ` George Dunlap
  2 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2014-12-19 15:13 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: kevin.tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, yang.z.zhang

>>> On 19.12.14 at 02:21, <tiejun.chen@intel.com> wrote:
> #4 Something like USB, is still restricted to current RMRR implementation. We
>    should work out this case.

This can mean all or nothing. My understanding is that right now code
assumes that USB devices won't use their RMRR-specified memory
regions post-boot (kind of contrary to your earlier statement that in
such a case the regions shouldn't be listed in RMRRs in the first place).

> Design Overview
> ===============
> 
> First of all we need to make sure all resources don't overlap RMRR. And then
> in case of shared ept, we can set these identity entries. And Certainly we 
> will
> group all devices associated to one same RMRR entry, then make sure all 
> group
> devices should be assigned to same VM.
> 
> 1. Setup RMRR identity mapping
> 
> current status:
>     * identity mapping only setup in non-shared ept case
> 
> proposal:
> 
> In non-shared ept case, IOMMU stuff always set those entries and RMRR is 
> already marked reserved in host so its fine enough.

Is it? Where? Or am I misunderstanding the whole statement, likely
due to me silently replacing "host" by "guest" (since reservation in
host address spaces is of no interest here afaict)?

> But in shared ept case, we need to
> check any conflit, so we should follow up
> 
>   - gfn space unoccupied
>         -> insert mapping: success.
>             gfn:_mfn(gfn), PAGE_ORDER_4K, p2m_mmio_direct, p2m_access_rw
>   - gfn space already occupied by 1:1 RMRR mapping
>         -> do nothing; success.
>   - gfn space already occupied by other mapping
>         -> fail.
> 
> expectation:
>     * only devices w/ non-conflicting RMRR can be assigned
>     * fortunately this achieves the very initial intention to support IGD
>       pass-through on BDW

Are you trying to say here that doing the above is all you need for
your specific machine? If so, that's clearly not something to go into
a design document.

Also there's clearly an alternative proposal: Drop support for sharing
page tables. Your colleagues will surely have told you that we've
been considering this for quite some time, and had actually hoped
for them to do the necessary VT-d side work to allow for this without
causing performance regressions.

> 2.1 Expose RMRR to user space
> 
> current status:
>     * Xen always record RMRR info into one list, acpi_rmrr_units, while 
> parsing
>       acpi. So we can retrieve these info by lookup that list.
> 
> proposal:
>     * RMRR would be exposed by a new hypercall, which Jan already finished 
> in
>       current version but just expose all RMRR info unconditionally.
>     * Furthermore we can expose RMRR on demand to diminish shrinking guest
>       RAM/MMIO space.
>     * So we will introduce a new parameter, 'rdm_forcecheck' and to 
> collaborate
>       with SBDFs to control which RMRR should be exposed:
> 
>         - We can set this parameter in .cfg file like,
> 
>             rdm_forcecheck = 1 => Of course this should be 0 by default.
> 
>         '1' means we should force check to reserve all ranges 
> unconditionally.
>         and if failed VM wouldn't be created successfully. This also can 
> give
>         user a chance to work well with later hotplug, even if not a device
>         assignment while creating VM.
> 
>         If 0, we just check those assigned pci devices. As you know we already

assigned? Wasn't the plan to have a separate potentially-to-be-
assigned list? And I can only re-iterate that the name
"rdm_forcecheck" doesn't really express what you mean. Since
your intention is to check all devices (rather than do a check that
otherwise wouldn't be done), "rdm_all" or "rdm_check_all" would
seem closer to the intended behavior.

>         have such an existing hypercall to assign PCI devices, looks we can 
> work
>         directly under this hypercall to get that necessary SBDF to sort 
> which
>         RMRR should be handled. But obviously, we need to get these info 
> before
>         we populate guest memory to make sure these RMRR ranges should be
>         excluded from guest memory. But unfortunately the memory populating
>         takes place before a device assignment, so we can't live on that
>         directly.
> 
>         But as we discussed it just benefit that assigned case to reorder 
> that
>         order, but not good to hotplug case. So we have to introduce a new
>         DOMCTL to pass that global parameter with SBDF at the same time.
> 
>         For example, if we own these two RMRR entries,

"own" is confusing here, I assume you mean if there are such entries.

>         [00:14.0]    RMRR region: base_addr ab80a000 end_address ab81dfff
>         [00:02.0]    RMRR region: base_addr ad000000 end_address af7fffff
> 
>         If 'rdm_forcecheck = 1', any caller to that hypercall may get these two

s/may/will/ I suppose.

>         entries. But if 'rdm_forcecheck = 0', and in .cfg file,
> 
>         #1 'pci=[00:14.0, 00:02.0]' -> The caller still get these two entries.
>         #2 'pci=[00:02.0]' -> The caller just get one entry, 
> 0xad000000:0xaf7fffff
>         #2 'pci=[00:14.0]' -> The caller just get one entry, 
> 0xab80a000:0xab81dfff
>         #4 'pci=[others]' or no any pci configuration -> The caller get 
> nothing.

Oh, interesting, you dropped the idea of a separate list of possibly-
to-be-assigned devices. Why that?

>         And ultimately, if we expose any RMRR entry at gfn,
> 
>         in non-shared ept case, 
> 
>         p2m table:  valid non-identity mapping, gfn:mfn != 1:1

This would mean ...

>         VT-D table: no such a mapping until set identity mapping,
>                     gfn:_mfn(gfn) == 1:1 when we have a associated device
>                     assignment.

... the two mappings to go out of sync when such a 1:1 mapping gets
established. I think we should avoid such a situation if at all possible.

>         in shared ept case,
> 
>         p2m table\VT-D table:  always INVALID until set identity mapping,
>                                gfn:_mfn(gfn) == 1:1 when we have a 
> associated
>                                device assignment.
> 
>         But if we don't expose any RMRR entry,
> 
>         in non-shared ept case, 
> 
>         p2m table:  valid non-identity mapping, gfn:mfn != 1:1
>         VT-D table: no such a mapping until set identity mapping,
>                     gfn:_mfn(gfn) == 1:1 when we have a associated device
>                     assignment.
> 
>         in shared ept case,
> 
>         p2m table\VT-D table:  If guest RAM already cover gfn, we have sunc a
>                                valid non-identity mapping, gfn:mfn != 1:1, 
> but
>                                we can't set any identity mapping again then
>                                that associated device can't be assigned
>                                successfully. If not, we'll set identity 
> mapping,
>                                gfn:_mfn(gfn) == 1:1, to work well.

So in the end we'd still have various cases of different behavior,
when really it would be much better (if not a requirement) if guest
observable behavior wouldn't depend on implementation details
like (non-)shared page tables.

> 2.2 Detect and avoid RMRR conflictions with guest RAM/MMIO
> 
> current status:
>     * Currently Xen doesn't detect anything to avoid any conflict.
> 
> proposal:
>     * We need to cover all points to check any conflict. Below lists places
>       where reserved region conflictions should be detected:
> 
>         1>. libxc:setup_guest():modules_init()
> 
>         There are some modules, like acpi_module and smbios_module. They may
>         occupy some ranges so we need to check if they're conflicting with
>         all ranges from that new hypercall above.

I'm missing of what conflict resolution you expect to do here.
Following earlier discussion in the context of patch review made
pretty clear that this isn't really straightforward, having the
potential to prevent a guest from booting (e.g. when the virtual
BIOS conflicts with an RMRR).

>         2>. libxc:setup_guest():xc_domain_populate_physmap()
> 
>         There are two ways to exclude RMRR ranges here:
>             
>             #1 Before we populate guest RAM without any RMRR range from that
>                new hypercall to skip RMRR ranges when populating guest RAM.
>             #2 In Xen we can fliter RMRR range while calling
>                XENMEM_populate_physmap, its no change to libxc, but skip
>                setting p2m entry for RMRR ranges in Xen hypervisor
> 
>         But to compare #1, #2 is not better since Xen still allocate those
>         range, and we have to sync somewhere else in Xen. And #1 is cleaner
>         because #2 actually shrinks the available memory size to the guest.
> 
>         3>. hvmloader:pci_setup()
> 
>         Here we should populate guest MMIO excluding all ranges from that 
> new
>         hypercall to detect RMRR conflictions for allocating PCI MMIO BAR.
> 
>         4>. hvmloader:mem_hole_alloc()
> 
>         Here we should populate mem holw excluding all ranges from that new
>         hypercall to detect RMRR conflictions for dynamic allocation in
>         hvmloader, e.g. as used for IGD opregion
> 
>         5>. hvmloader:build_e820_table():
> 
>         Finally we need let VM know that RMRR regions are reserved through 
> e820
>         table
> 
>     Its a little bit tricky to handle this inside hvmloader since as you 
> know,
>     struct hvm_info_table is only a approach between libxc and hvmloader. 
> But
>     however, making up all above places will bring some duplicated logic.
>     Especially between libxc and hvmloader, which skip same regions in guest
>     physical layer(one in populating guest RAM, the other in constructing 
> e820)
>     But current design has some limitation to pass information between libxc 
> and
>     hvmloader,
> 
> struct hvm_info_table {
>     ...
>     /*
>      *  0x0 to page_to_phys(low_mem_pgend)-1:
>      *    RAM below 4GB (except for VGA hole 0xA0000-0xBFFFF)
>      */
>     uint32_t    low_mem_pgend;
>     ...
>     /*
>      *  0x100000000 to page_to_phys(high_mem_pgend)-1:
>      *    RAM above 4GB
>      */
>     uint32_t    high_mem_pgend;
>     ...
> }
> 
>     nothing valuable is passed to hvmloader so we have to figure out a way 
> to
>     handle those points in hvmloader. Currently,
> 
>             #1 Reconstruct hvm_info_table{}
> 
>             We can store all necessary RMRR info into hvm_info_table as 
> well,
>             then we can pick them conveniently but oftentimes these RMRR
>             entries are scattered and the number is also undertimined, so 
> its
>             a little bit ugly to do.
> 
>             #2 Xenstore
> 
>             We may store all necessary RMRR info as Xenstore then pick them 
> in
>             hvmloader.
> 
>             #3 A hypercall
> 
>             We may have to call our new hypercall again to pick them, but
>             obviously this may introduce some duplicated codes.
> 
>             #4 XENMEM_{set_,}memory_map pair of hypercalls
>             
>             As Jan commented it "could be used(and is readily available to 
> be
>             extended that way, since for HVM domains XENMEM_set_memory_map
>             returns -EPERM at present). The only potentially problematic 
> aspect
>             I can see with using it might be its limiting of the entry count 
> to
>             E820MAX." So a preparation patch is required before RMRR, and
>             hvmloader still needs to query RMRR information.

Somewhere above I'm missing the mentioning of the option to reorder
operations of hvmloader, to e.g. base PCI BAR assignment on the
previously set up E820 table, rather than assuming a (mostly) fixed
size window where to put these.

> 3. Group multiple devices sharing same RMRR entry
> 
> current status:
>     * Xen doesn't distinguish if multiple devices share same RMRR entry. 
> This
>       may lead a leak to corruption, e.g. Device A and device B share same 
> entry
>       and then device A is assigned to domain A, device B is assigned to 
> domain
>       B. So domain A can read something from that range, even rewrite
>       maliciously that range to corrupt device B inside domain B.
> 
> proposal:
>     * We will group all devices by means of one same RMRR entry. 
> Theoretically,
>       we should make sure all devices in one group are allowed to assign to 
> one
>       VM. But in Xen side the hardware domain owns all devices at first, we
>       should unbound all group devies before assign one of group device. But 
> its
>       hard to do such thing in current VT-d stuff. And actually its rare to 
> have
>       the group device in real world so we just prevent assigning any group
>       device simply.
>     * We will introduce two field, gid, in struct, acpi_rmrr_unit:
>         gid: indicate if this device belongs a group.
>       Then when we add or attach device in iommu side, we will check this 
> field
>       if we're assigning a group device then determine if we assign that.
> 
> 
> expectation:
>     * Make all device associated to one RMRR to be assigned same VM.

A few lines up you say you're intending to refuse such assignments
rather than making them happen in a consistent way - what's the
plan in the end?

> 4. Handle in case of force accessing to RMRR regions
> 
> proposal:
>     * Although we already reserve these ranges in guest e820 table, its
>       possible to access these ranges.
>       In non-shared ept case it will issue such EPT violation since we have 
> no
>       p2m table actually. But its same to access other reserved range so 
> just
>       live on Xen's default behavior. In shared-ept case guest can read or
>       write anything from this range, but such a leak or corruption just
>       happens inside same domain so this behavior is also same as a native
>       case, it should be fine.
> 
> expectation:
>     * Don't broken normal RMRR usage.

I'm not getting the purpose of this whole section.

> 5. Re-enable USB
> 
> current status:
>     * Currently Xen doesn't allow USB passthrough.

???

> 6. Hotplug case
> 
> current status:
>     * only work well in non-shared ept case or in case of shared ept & all
>       associated gfns are free.
> 
> proposal:
>     * If user ensure there'll be a hotplug device in advace, 
> 'rdm_forcecheck'
>       can be set to reserve all RMRR range as we describe above. Then any
>       device can be hotplugged successfully.
>     * If not, there are still two scenarios here:
>       in non-shared ept case it still work well as original;

So you continue to assume that the two tables going out of sync is
acceptable.

>       in shared ept case, its just going a case of "1. Setup RMRR identity
>       mapping"
>         - gfn space unoccupied -> insert mapping: success.
>         - gfn space already occupied by 1:1 RMRR mapping -> do nothing; 
> success.
>         - gfn space already occupied by other mapping -> fail.
> 
> expectation:
>     * provide a way to let hotplug work in case of shared ept totally
> 
> Open Issue
> ==========
> 
> When other stuffs like ballon mechanism, to populate memory accessing RMRR
> range, or others like qemu, force map RMRR range, we may need to avoid such 
> a
> possibility but we're not 100% sure this so just open this to double check.

I think a HVM guest can be expected to play by what E820 table
says is reserved. That includes not using such ranges for ballooning.
If it still does, it'll get what it deserves.

Jan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RMRR Fix Design for Xen
  2014-12-19 15:13 ` Jan Beulich
@ 2014-12-22  2:11   ` Chen, Tiejun
  2014-12-22  4:48     ` Tian, Kevin
  0 siblings, 1 reply; 7+ messages in thread
From: Chen, Tiejun @ 2014-12-22  2:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: kevin.tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, yang.z.zhang

Jan,

Thanks for your time but I'm not going to address your comments here. 
Because I heard this design is totally not satisfied your expectation. 
But this really was reviewed with several revisions by Kevin and Yang 
before sending in public...

Anyway, I guess the only thing what I can do is that, Kevin and Yang, or 
other appropriate guys should finish this design as you expect. So now 
I'd better not say anything to avoid bringing any inconvenience.

Tiejun

On 2014/12/19 23:13, Jan Beulich wrote:
>>>> On 19.12.14 at 02:21, <tiejun.chen@intel.com> wrote:
>> #4 Something like USB, is still restricted to current RMRR implementation. We
>>     should work out this case.
>
> This can mean all or nothing. My understanding is that right now code
> assumes that USB devices won't use their RMRR-specified memory
> regions post-boot (kind of contrary to your earlier statement that in
> such a case the regions shouldn't be listed in RMRRs in the first place).
>
>> Design Overview
>> ===============
>>
>> First of all we need to make sure all resources don't overlap RMRR. And then
>> in case of shared ept, we can set these identity entries. And Certainly we
>> will
>> group all devices associated to one same RMRR entry, then make sure all
>> group
>> devices should be assigned to same VM.
>>
>> 1. Setup RMRR identity mapping
>>
>> current status:
>>      * identity mapping only setup in non-shared ept case
>>
>> proposal:
>>
>> In non-shared ept case, IOMMU stuff always set those entries and RMRR is
>> already marked reserved in host so its fine enough.
>
> Is it? Where? Or am I misunderstanding the whole statement, likely
> due to me silently replacing "host" by "guest" (since reservation in
> host address spaces is of no interest here afaict)?
>
>> But in shared ept case, we need to
>> check any conflit, so we should follow up
>>
>>    - gfn space unoccupied
>>          -> insert mapping: success.
>>              gfn:_mfn(gfn), PAGE_ORDER_4K, p2m_mmio_direct, p2m_access_rw
>>    - gfn space already occupied by 1:1 RMRR mapping
>>          -> do nothing; success.
>>    - gfn space already occupied by other mapping
>>          -> fail.
>>
>> expectation:
>>      * only devices w/ non-conflicting RMRR can be assigned
>>      * fortunately this achieves the very initial intention to support IGD
>>        pass-through on BDW
>
> Are you trying to say here that doing the above is all you need for
> your specific machine? If so, that's clearly not something to go into
> a design document.
>
> Also there's clearly an alternative proposal: Drop support for sharing
> page tables. Your colleagues will surely have told you that we've
> been considering this for quite some time, and had actually hoped
> for them to do the necessary VT-d side work to allow for this without
> causing performance regressions.
>
>> 2.1 Expose RMRR to user space
>>
>> current status:
>>      * Xen always record RMRR info into one list, acpi_rmrr_units, while
>> parsing
>>        acpi. So we can retrieve these info by lookup that list.
>>
>> proposal:
>>      * RMRR would be exposed by a new hypercall, which Jan already finished
>> in
>>        current version but just expose all RMRR info unconditionally.
>>      * Furthermore we can expose RMRR on demand to diminish shrinking guest
>>        RAM/MMIO space.
>>      * So we will introduce a new parameter, 'rdm_forcecheck' and to
>> collaborate
>>        with SBDFs to control which RMRR should be exposed:
>>
>>          - We can set this parameter in .cfg file like,
>>
>>              rdm_forcecheck = 1 => Of course this should be 0 by default.
>>
>>          '1' means we should force check to reserve all ranges
>> unconditionally.
>>          and if failed VM wouldn't be created successfully. This also can
>> give
>>          user a chance to work well with later hotplug, even if not a device
>>          assignment while creating VM.
>>
>>          If 0, we just check those assigned pci devices. As you know we already
>
> assigned? Wasn't the plan to have a separate potentially-to-be-
> assigned list? And I can only re-iterate that the name
> "rdm_forcecheck" doesn't really express what you mean. Since
> your intention is to check all devices (rather than do a check that
> otherwise wouldn't be done), "rdm_all" or "rdm_check_all" would
> seem closer to the intended behavior.
>
>>          have such an existing hypercall to assign PCI devices, looks we can
>> work
>>          directly under this hypercall to get that necessary SBDF to sort
>> which
>>          RMRR should be handled. But obviously, we need to get these info
>> before
>>          we populate guest memory to make sure these RMRR ranges should be
>>          excluded from guest memory. But unfortunately the memory populating
>>          takes place before a device assignment, so we can't live on that
>>          directly.
>>
>>          But as we discussed it just benefit that assigned case to reorder
>> that
>>          order, but not good to hotplug case. So we have to introduce a new
>>          DOMCTL to pass that global parameter with SBDF at the same time.
>>
>>          For example, if we own these two RMRR entries,
>
> "own" is confusing here, I assume you mean if there are such entries.
>
>>          [00:14.0]    RMRR region: base_addr ab80a000 end_address ab81dfff
>>          [00:02.0]    RMRR region: base_addr ad000000 end_address af7fffff
>>
>>          If 'rdm_forcecheck = 1', any caller to that hypercall may get these two
>
> s/may/will/ I suppose.
>
>>          entries. But if 'rdm_forcecheck = 0', and in .cfg file,
>>
>>          #1 'pci=[00:14.0, 00:02.0]' -> The caller still get these two entries.
>>          #2 'pci=[00:02.0]' -> The caller just get one entry,
>> 0xad000000:0xaf7fffff
>>          #2 'pci=[00:14.0]' -> The caller just get one entry,
>> 0xab80a000:0xab81dfff
>>          #4 'pci=[others]' or no any pci configuration -> The caller get
>> nothing.
>
> Oh, interesting, you dropped the idea of a separate list of possibly-
> to-be-assigned devices. Why that?
>
>>          And ultimately, if we expose any RMRR entry at gfn,
>>
>>          in non-shared ept case,
>>
>>          p2m table:  valid non-identity mapping, gfn:mfn != 1:1
>
> This would mean ...
>
>>          VT-D table: no such a mapping until set identity mapping,
>>                      gfn:_mfn(gfn) == 1:1 when we have a associated device
>>                      assignment.
>
> ... the two mappings to go out of sync when such a 1:1 mapping gets
> established. I think we should avoid such a situation if at all possible.
>
>>          in shared ept case,
>>
>>          p2m table\VT-D table:  always INVALID until set identity mapping,
>>                                 gfn:_mfn(gfn) == 1:1 when we have a
>> associated
>>                                 device assignment.
>>
>>          But if we don't expose any RMRR entry,
>>
>>          in non-shared ept case,
>>
>>          p2m table:  valid non-identity mapping, gfn:mfn != 1:1
>>          VT-D table: no such a mapping until set identity mapping,
>>                      gfn:_mfn(gfn) == 1:1 when we have a associated device
>>                      assignment.
>>
>>          in shared ept case,
>>
>>          p2m table\VT-D table:  If guest RAM already cover gfn, we have sunc a
>>                                 valid non-identity mapping, gfn:mfn != 1:1,
>> but
>>                                 we can't set any identity mapping again then
>>                                 that associated device can't be assigned
>>                                 successfully. If not, we'll set identity
>> mapping,
>>                                 gfn:_mfn(gfn) == 1:1, to work well.
>
> So in the end we'd still have various cases of different behavior,
> when really it would be much better (if not a requirement) if guest
> observable behavior wouldn't depend on implementation details
> like (non-)shared page tables.
>
>> 2.2 Detect and avoid RMRR conflictions with guest RAM/MMIO
>>
>> current status:
>>      * Currently Xen doesn't detect anything to avoid any conflict.
>>
>> proposal:
>>      * We need to cover all points to check any conflict. Below lists places
>>        where reserved region conflictions should be detected:
>>
>>          1>. libxc:setup_guest():modules_init()
>>
>>          There are some modules, like acpi_module and smbios_module. They may
>>          occupy some ranges so we need to check if they're conflicting with
>>          all ranges from that new hypercall above.
>
> I'm missing of what conflict resolution you expect to do here.
> Following earlier discussion in the context of patch review made
> pretty clear that this isn't really straightforward, having the
> potential to prevent a guest from booting (e.g. when the virtual
> BIOS conflicts with an RMRR).
>
>>          2>. libxc:setup_guest():xc_domain_populate_physmap()
>>
>>          There are two ways to exclude RMRR ranges here:
>>
>>              #1 Before we populate guest RAM without any RMRR range from that
>>                 new hypercall to skip RMRR ranges when populating guest RAM.
>>              #2 In Xen we can fliter RMRR range while calling
>>                 XENMEM_populate_physmap, its no change to libxc, but skip
>>                 setting p2m entry for RMRR ranges in Xen hypervisor
>>
>>          But to compare #1, #2 is not better since Xen still allocate those
>>          range, and we have to sync somewhere else in Xen. And #1 is cleaner
>>          because #2 actually shrinks the available memory size to the guest.
>>
>>          3>. hvmloader:pci_setup()
>>
>>          Here we should populate guest MMIO excluding all ranges from that
>> new
>>          hypercall to detect RMRR conflictions for allocating PCI MMIO BAR.
>>
>>          4>. hvmloader:mem_hole_alloc()
>>
>>          Here we should populate mem holw excluding all ranges from that new
>>          hypercall to detect RMRR conflictions for dynamic allocation in
>>          hvmloader, e.g. as used for IGD opregion
>>
>>          5>. hvmloader:build_e820_table():
>>
>>          Finally we need let VM know that RMRR regions are reserved through
>> e820
>>          table
>>
>>      Its a little bit tricky to handle this inside hvmloader since as you
>> know,
>>      struct hvm_info_table is only a approach between libxc and hvmloader.
>> But
>>      however, making up all above places will bring some duplicated logic.
>>      Especially between libxc and hvmloader, which skip same regions in guest
>>      physical layer(one in populating guest RAM, the other in constructing
>> e820)
>>      But current design has some limitation to pass information between libxc
>> and
>>      hvmloader,
>>
>> struct hvm_info_table {
>>      ...
>>      /*
>>       *  0x0 to page_to_phys(low_mem_pgend)-1:
>>       *    RAM below 4GB (except for VGA hole 0xA0000-0xBFFFF)
>>       */
>>      uint32_t    low_mem_pgend;
>>      ...
>>      /*
>>       *  0x100000000 to page_to_phys(high_mem_pgend)-1:
>>       *    RAM above 4GB
>>       */
>>      uint32_t    high_mem_pgend;
>>      ...
>> }
>>
>>      nothing valuable is passed to hvmloader so we have to figure out a way
>> to
>>      handle those points in hvmloader. Currently,
>>
>>              #1 Reconstruct hvm_info_table{}
>>
>>              We can store all necessary RMRR info into hvm_info_table as
>> well,
>>              then we can pick them conveniently but oftentimes these RMRR
>>              entries are scattered and the number is also undertimined, so
>> its
>>              a little bit ugly to do.
>>
>>              #2 Xenstore
>>
>>              We may store all necessary RMRR info as Xenstore then pick them
>> in
>>              hvmloader.
>>
>>              #3 A hypercall
>>
>>              We may have to call our new hypercall again to pick them, but
>>              obviously this may introduce some duplicated codes.
>>
>>              #4 XENMEM_{set_,}memory_map pair of hypercalls
>>
>>              As Jan commented it "could be used(and is readily available to
>> be
>>              extended that way, since for HVM domains XENMEM_set_memory_map
>>              returns -EPERM at present). The only potentially problematic
>> aspect
>>              I can see with using it might be its limiting of the entry count
>> to
>>              E820MAX." So a preparation patch is required before RMRR, and
>>              hvmloader still needs to query RMRR information.
>
> Somewhere above I'm missing the mentioning of the option to reorder
> operations of hvmloader, to e.g. base PCI BAR assignment on the
> previously set up E820 table, rather than assuming a (mostly) fixed
> size window where to put these.
>
>> 3. Group multiple devices sharing same RMRR entry
>>
>> current status:
>>      * Xen doesn't distinguish if multiple devices share same RMRR entry.
>> This
>>        may lead a leak to corruption, e.g. Device A and device B share same
>> entry
>>        and then device A is assigned to domain A, device B is assigned to
>> domain
>>        B. So domain A can read something from that range, even rewrite
>>        maliciously that range to corrupt device B inside domain B.
>>
>> proposal:
>>      * We will group all devices by means of one same RMRR entry.
>> Theoretically,
>>        we should make sure all devices in one group are allowed to assign to
>> one
>>        VM. But in Xen side the hardware domain owns all devices at first, we
>>        should unbound all group devies before assign one of group device. But
>> its
>>        hard to do such thing in current VT-d stuff. And actually its rare to
>> have
>>        the group device in real world so we just prevent assigning any group
>>        device simply.
>>      * We will introduce two field, gid, in struct, acpi_rmrr_unit:
>>          gid: indicate if this device belongs a group.
>>        Then when we add or attach device in iommu side, we will check this
>> field
>>        if we're assigning a group device then determine if we assign that.
>>
>>
>> expectation:
>>      * Make all device associated to one RMRR to be assigned same VM.
>
> A few lines up you say you're intending to refuse such assignments
> rather than making them happen in a consistent way - what's the
> plan in the end?
>
>> 4. Handle in case of force accessing to RMRR regions
>>
>> proposal:
>>      * Although we already reserve these ranges in guest e820 table, its
>>        possible to access these ranges.
>>        In non-shared ept case it will issue such EPT violation since we have
>> no
>>        p2m table actually. But its same to access other reserved range so
>> just
>>        live on Xen's default behavior. In shared-ept case guest can read or
>>        write anything from this range, but such a leak or corruption just
>>        happens inside same domain so this behavior is also same as a native
>>        case, it should be fine.
>>
>> expectation:
>>      * Don't broken normal RMRR usage.
>
> I'm not getting the purpose of this whole section.
>
>> 5. Re-enable USB
>>
>> current status:
>>      * Currently Xen doesn't allow USB passthrough.
>
> ???
>
>> 6. Hotplug case
>>
>> current status:
>>      * only work well in non-shared ept case or in case of shared ept & all
>>        associated gfns are free.
>>
>> proposal:
>>      * If user ensure there'll be a hotplug device in advace,
>> 'rdm_forcecheck'
>>        can be set to reserve all RMRR range as we describe above. Then any
>>        device can be hotplugged successfully.
>>      * If not, there are still two scenarios here:
>>        in non-shared ept case it still work well as original;
>
> So you continue to assume that the two tables going out of sync is
> acceptable.
>
>>        in shared ept case, its just going a case of "1. Setup RMRR identity
>>        mapping"
>>          - gfn space unoccupied -> insert mapping: success.
>>          - gfn space already occupied by 1:1 RMRR mapping -> do nothing;
>> success.
>>          - gfn space already occupied by other mapping -> fail.
>>
>> expectation:
>>      * provide a way to let hotplug work in case of shared ept totally
>>
>> Open Issue
>> ==========
>>
>> When other stuffs like ballon mechanism, to populate memory accessing RMRR
>> range, or others like qemu, force map RMRR range, we may need to avoid such
>> a
>> possibility but we're not 100% sure this so just open this to double check.
>
> I think a HVM guest can be expected to play by what E820 table
> says is reserved. That includes not using such ranges for ballooning.
> If it still does, it'll get what it deserves.
>
> Jan
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RMRR Fix Design for Xen
  2014-12-22  2:11   ` Chen, Tiejun
@ 2014-12-22  4:48     ` Tian, Kevin
  0 siblings, 0 replies; 7+ messages in thread
From: Tian, Kevin @ 2014-12-22  4:48 UTC (permalink / raw)
  To: Chen, Tiejun, Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z

I'll work out a new design proposal based on below content and previous
discussions. Thanks Tiejun for your hard-working and Jan for your careful
reviews so far.

For below comment:
> > Also there's clearly an alternative proposal: Drop support for sharing
> > page tables. Your colleagues will surely have told you that we've
> > been considering this for quite some time, and had actually hoped
> > for them to do the necessary VT-d side work to allow for this without
> > causing performance regressions.

let's separate it from RMRR discussion, because RMRR issues are about
general p2m and thus orthogonal to the implementation difference between
shared or not-shared fact (though it did lead to different behaviors w/ current
bogus logic).

Thanks
Kevin

> From: Chen, Tiejun
> Sent: Monday, December 22, 2014 10:12 AM
> 
> Jan,
> 
> Thanks for your time but I'm not going to address your comments here.
> Because I heard this design is totally not satisfied your expectation.
> But this really was reviewed with several revisions by Kevin and Yang
> before sending in public...
> 
> Anyway, I guess the only thing what I can do is that, Kevin and Yang, or
> other appropriate guys should finish this design as you expect. So now
> I'd better not say anything to avoid bringing any inconvenience.
> 
> Tiejun
> 
> On 2014/12/19 23:13, Jan Beulich wrote:
> >>>> On 19.12.14 at 02:21, <tiejun.chen@intel.com> wrote:
> >> #4 Something like USB, is still restricted to current RMRR implementation.
> We
> >>     should work out this case.
> >
> > This can mean all or nothing. My understanding is that right now code
> > assumes that USB devices won't use their RMRR-specified memory
> > regions post-boot (kind of contrary to your earlier statement that in
> > such a case the regions shouldn't be listed in RMRRs in the first place).
> >
> >> Design Overview
> >> ===============
> >>
> >> First of all we need to make sure all resources don't overlap RMRR. And
> then
> >> in case of shared ept, we can set these identity entries. And Certainly we
> >> will
> >> group all devices associated to one same RMRR entry, then make sure all
> >> group
> >> devices should be assigned to same VM.
> >>
> >> 1. Setup RMRR identity mapping
> >>
> >> current status:
> >>      * identity mapping only setup in non-shared ept case
> >>
> >> proposal:
> >>
> >> In non-shared ept case, IOMMU stuff always set those entries and RMRR is
> >> already marked reserved in host so its fine enough.
> >
> > Is it? Where? Or am I misunderstanding the whole statement, likely
> > due to me silently replacing "host" by "guest" (since reservation in
> > host address spaces is of no interest here afaict)?
> >
> >> But in shared ept case, we need to
> >> check any conflit, so we should follow up
> >>
> >>    - gfn space unoccupied
> >>          -> insert mapping: success.
> >>              gfn:_mfn(gfn), PAGE_ORDER_4K, p2m_mmio_direct,
> p2m_access_rw
> >>    - gfn space already occupied by 1:1 RMRR mapping
> >>          -> do nothing; success.
> >>    - gfn space already occupied by other mapping
> >>          -> fail.
> >>
> >> expectation:
> >>      * only devices w/ non-conflicting RMRR can be assigned
> >>      * fortunately this achieves the very initial intention to support IGD
> >>        pass-through on BDW
> >
> > Are you trying to say here that doing the above is all you need for
> > your specific machine? If so, that's clearly not something to go into
> > a design document.
> >
> > Also there's clearly an alternative proposal: Drop support for sharing
> > page tables. Your colleagues will surely have told you that we've
> > been considering this for quite some time, and had actually hoped
> > for them to do the necessary VT-d side work to allow for this without
> > causing performance regressions.
> >
> >> 2.1 Expose RMRR to user space
> >>
> >> current status:
> >>      * Xen always record RMRR info into one list, acpi_rmrr_units, while
> >> parsing
> >>        acpi. So we can retrieve these info by lookup that list.
> >>
> >> proposal:
> >>      * RMRR would be exposed by a new hypercall, which Jan already
> finished
> >> in
> >>        current version but just expose all RMRR info unconditionally.
> >>      * Furthermore we can expose RMRR on demand to diminish
> shrinking guest
> >>        RAM/MMIO space.
> >>      * So we will introduce a new parameter, 'rdm_forcecheck' and to
> >> collaborate
> >>        with SBDFs to control which RMRR should be exposed:
> >>
> >>          - We can set this parameter in .cfg file like,
> >>
> >>              rdm_forcecheck = 1 => Of course this should be 0 by
> default.
> >>
> >>          '1' means we should force check to reserve all ranges
> >> unconditionally.
> >>          and if failed VM wouldn't be created successfully. This also can
> >> give
> >>          user a chance to work well with later hotplug, even if not a
> device
> >>          assignment while creating VM.
> >>
> >>          If 0, we just check those assigned pci devices. As you know we
> already
> >
> > assigned? Wasn't the plan to have a separate potentially-to-be-
> > assigned list? And I can only re-iterate that the name
> > "rdm_forcecheck" doesn't really express what you mean. Since
> > your intention is to check all devices (rather than do a check that
> > otherwise wouldn't be done), "rdm_all" or "rdm_check_all" would
> > seem closer to the intended behavior.
> >
> >>          have such an existing hypercall to assign PCI devices, looks we
> can
> >> work
> >>          directly under this hypercall to get that necessary SBDF to sort
> >> which
> >>          RMRR should be handled. But obviously, we need to get these
> info
> >> before
> >>          we populate guest memory to make sure these RMRR ranges
> should be
> >>          excluded from guest memory. But unfortunately the memory
> populating
> >>          takes place before a device assignment, so we can't live on that
> >>          directly.
> >>
> >>          But as we discussed it just benefit that assigned case to reorder
> >> that
> >>          order, but not good to hotplug case. So we have to introduce a
> new
> >>          DOMCTL to pass that global parameter with SBDF at the same
> time.
> >>
> >>          For example, if we own these two RMRR entries,
> >
> > "own" is confusing here, I assume you mean if there are such entries.
> >
> >>          [00:14.0]    RMRR region: base_addr ab80a000 end_address
> ab81dfff
> >>          [00:02.0]    RMRR region: base_addr ad000000 end_address
> af7fffff
> >>
> >>          If 'rdm_forcecheck = 1', any caller to that hypercall may get
> these two
> >
> > s/may/will/ I suppose.
> >
> >>          entries. But if 'rdm_forcecheck = 0', and in .cfg file,
> >>
> >>          #1 'pci=[00:14.0, 00:02.0]' -> The caller still get these two
> entries.
> >>          #2 'pci=[00:02.0]' -> The caller just get one entry,
> >> 0xad000000:0xaf7fffff
> >>          #2 'pci=[00:14.0]' -> The caller just get one entry,
> >> 0xab80a000:0xab81dfff
> >>          #4 'pci=[others]' or no any pci configuration -> The caller get
> >> nothing.
> >
> > Oh, interesting, you dropped the idea of a separate list of possibly-
> > to-be-assigned devices. Why that?
> >
> >>          And ultimately, if we expose any RMRR entry at gfn,
> >>
> >>          in non-shared ept case,
> >>
> >>          p2m table:  valid non-identity mapping, gfn:mfn != 1:1
> >
> > This would mean ...
> >
> >>          VT-D table: no such a mapping until set identity mapping,
> >>                      gfn:_mfn(gfn) == 1:1 when we have a associated
> device
> >>                      assignment.
> >
> > ... the two mappings to go out of sync when such a 1:1 mapping gets
> > established. I think we should avoid such a situation if at all possible.
> >
> >>          in shared ept case,
> >>
> >>          p2m table\VT-D table:  always INVALID until set identity
> mapping,
> >>                                 gfn:_mfn(gfn) == 1:1 when we have
> a
> >> associated
> >>                                 device assignment.
> >>
> >>          But if we don't expose any RMRR entry,
> >>
> >>          in non-shared ept case,
> >>
> >>          p2m table:  valid non-identity mapping, gfn:mfn != 1:1
> >>          VT-D table: no such a mapping until set identity mapping,
> >>                      gfn:_mfn(gfn) == 1:1 when we have a associated
> device
> >>                      assignment.
> >>
> >>          in shared ept case,
> >>
> >>          p2m table\VT-D table:  If guest RAM already cover gfn, we
> have sunc a
> >>                                 valid non-identity mapping,
> gfn:mfn != 1:1,
> >> but
> >>                                 we can't set any identity mapping
> again then
> >>                                 that associated device can't be
> assigned
> >>                                 successfully. If not, we'll set identity
> >> mapping,
> >>                                 gfn:_mfn(gfn) == 1:1, to work well.
> >
> > So in the end we'd still have various cases of different behavior,
> > when really it would be much better (if not a requirement) if guest
> > observable behavior wouldn't depend on implementation details
> > like (non-)shared page tables.
> >
> >> 2.2 Detect and avoid RMRR conflictions with guest RAM/MMIO
> >>
> >> current status:
> >>      * Currently Xen doesn't detect anything to avoid any conflict.
> >>
> >> proposal:
> >>      * We need to cover all points to check any conflict. Below lists places
> >>        where reserved region conflictions should be detected:
> >>
> >>          1>. libxc:setup_guest():modules_init()
> >>
> >>          There are some modules, like acpi_module and smbios_module.
> They may
> >>          occupy some ranges so we need to check if they're conflicting
> with
> >>          all ranges from that new hypercall above.
> >
> > I'm missing of what conflict resolution you expect to do here.
> > Following earlier discussion in the context of patch review made
> > pretty clear that this isn't really straightforward, having the
> > potential to prevent a guest from booting (e.g. when the virtual
> > BIOS conflicts with an RMRR).
> >
> >>          2>. libxc:setup_guest():xc_domain_populate_physmap()
> >>
> >>          There are two ways to exclude RMRR ranges here:
> >>
> >>              #1 Before we populate guest RAM without any RMRR
> range from that
> >>                 new hypercall to skip RMRR ranges when populating
> guest RAM.
> >>              #2 In Xen we can fliter RMRR range while calling
> >>                 XENMEM_populate_physmap, its no change to libxc,
> but skip
> >>                 setting p2m entry for RMRR ranges in Xen hypervisor
> >>
> >>          But to compare #1, #2 is not better since Xen still allocate those
> >>          range, and we have to sync somewhere else in Xen. And #1 is
> cleaner
> >>          because #2 actually shrinks the available memory size to the
> guest.
> >>
> >>          3>. hvmloader:pci_setup()
> >>
> >>          Here we should populate guest MMIO excluding all ranges
> from that
> >> new
> >>          hypercall to detect RMRR conflictions for allocating PCI MMIO
> BAR.
> >>
> >>          4>. hvmloader:mem_hole_alloc()
> >>
> >>          Here we should populate mem holw excluding all ranges from
> that new
> >>          hypercall to detect RMRR conflictions for dynamic allocation in
> >>          hvmloader, e.g. as used for IGD opregion
> >>
> >>          5>. hvmloader:build_e820_table():
> >>
> >>          Finally we need let VM know that RMRR regions are reserved
> through
> >> e820
> >>          table
> >>
> >>      Its a little bit tricky to handle this inside hvmloader since as you
> >> know,
> >>      struct hvm_info_table is only a approach between libxc and
> hvmloader.
> >> But
> >>      however, making up all above places will bring some duplicated
> logic.
> >>      Especially between libxc and hvmloader, which skip same regions in
> guest
> >>      physical layer(one in populating guest RAM, the other in
> constructing
> >> e820)
> >>      But current design has some limitation to pass information between
> libxc
> >> and
> >>      hvmloader,
> >>
> >> struct hvm_info_table {
> >>      ...
> >>      /*
> >>       *  0x0 to page_to_phys(low_mem_pgend)-1:
> >>       *    RAM below 4GB (except for VGA hole 0xA0000-0xBFFFF)
> >>       */
> >>      uint32_t    low_mem_pgend;
> >>      ...
> >>      /*
> >>       *  0x100000000 to page_to_phys(high_mem_pgend)-1:
> >>       *    RAM above 4GB
> >>       */
> >>      uint32_t    high_mem_pgend;
> >>      ...
> >> }
> >>
> >>      nothing valuable is passed to hvmloader so we have to figure out a
> way
> >> to
> >>      handle those points in hvmloader. Currently,
> >>
> >>              #1 Reconstruct hvm_info_table{}
> >>
> >>              We can store all necessary RMRR info into hvm_info_table
> as
> >> well,
> >>              then we can pick them conveniently but oftentimes these
> RMRR
> >>              entries are scattered and the number is also undertimined,
> so
> >> its
> >>              a little bit ugly to do.
> >>
> >>              #2 Xenstore
> >>
> >>              We may store all necessary RMRR info as Xenstore then
> pick them
> >> in
> >>              hvmloader.
> >>
> >>              #3 A hypercall
> >>
> >>              We may have to call our new hypercall again to pick them,
> but
> >>              obviously this may introduce some duplicated codes.
> >>
> >>              #4 XENMEM_{set_,}memory_map pair of hypercalls
> >>
> >>              As Jan commented it "could be used(and is readily
> available to
> >> be
> >>              extended that way, since for HVM domains
> XENMEM_set_memory_map
> >>              returns -EPERM at present). The only potentially
> problematic
> >> aspect
> >>              I can see with using it might be its limiting of the entry
> count
> >> to
> >>              E820MAX." So a preparation patch is required before
> RMRR, and
> >>              hvmloader still needs to query RMRR information.
> >
> > Somewhere above I'm missing the mentioning of the option to reorder
> > operations of hvmloader, to e.g. base PCI BAR assignment on the
> > previously set up E820 table, rather than assuming a (mostly) fixed
> > size window where to put these.
> >
> >> 3. Group multiple devices sharing same RMRR entry
> >>
> >> current status:
> >>      * Xen doesn't distinguish if multiple devices share same RMRR entry.
> >> This
> >>        may lead a leak to corruption, e.g. Device A and device B share
> same
> >> entry
> >>        and then device A is assigned to domain A, device B is assigned to
> >> domain
> >>        B. So domain A can read something from that range, even rewrite
> >>        maliciously that range to corrupt device B inside domain B.
> >>
> >> proposal:
> >>      * We will group all devices by means of one same RMRR entry.
> >> Theoretically,
> >>        we should make sure all devices in one group are allowed to
> assign to
> >> one
> >>        VM. But in Xen side the hardware domain owns all devices at first,
> we
> >>        should unbound all group devies before assign one of group device.
> But
> >> its
> >>        hard to do such thing in current VT-d stuff. And actually its rare to
> >> have
> >>        the group device in real world so we just prevent assigning any
> group
> >>        device simply.
> >>      * We will introduce two field, gid, in struct, acpi_rmrr_unit:
> >>          gid: indicate if this device belongs a group.
> >>        Then when we add or attach device in iommu side, we will check
> this
> >> field
> >>        if we're assigning a group device then determine if we assign that.
> >>
> >>
> >> expectation:
> >>      * Make all device associated to one RMRR to be assigned same VM.
> >
> > A few lines up you say you're intending to refuse such assignments
> > rather than making them happen in a consistent way - what's the
> > plan in the end?
> >
> >> 4. Handle in case of force accessing to RMRR regions
> >>
> >> proposal:
> >>      * Although we already reserve these ranges in guest e820 table, its
> >>        possible to access these ranges.
> >>        In non-shared ept case it will issue such EPT violation since we
> have
> >> no
> >>        p2m table actually. But its same to access other reserved range so
> >> just
> >>        live on Xen's default behavior. In shared-ept case guest can read
> or
> >>        write anything from this range, but such a leak or corruption just
> >>        happens inside same domain so this behavior is also same as a
> native
> >>        case, it should be fine.
> >>
> >> expectation:
> >>      * Don't broken normal RMRR usage.
> >
> > I'm not getting the purpose of this whole section.
> >
> >> 5. Re-enable USB
> >>
> >> current status:
> >>      * Currently Xen doesn't allow USB passthrough.
> >
> > ???
> >
> >> 6. Hotplug case
> >>
> >> current status:
> >>      * only work well in non-shared ept case or in case of shared ept & all
> >>        associated gfns are free.
> >>
> >> proposal:
> >>      * If user ensure there'll be a hotplug device in advace,
> >> 'rdm_forcecheck'
> >>        can be set to reserve all RMRR range as we describe above. Then
> any
> >>        device can be hotplugged successfully.
> >>      * If not, there are still two scenarios here:
> >>        in non-shared ept case it still work well as original;
> >
> > So you continue to assume that the two tables going out of sync is
> > acceptable.
> >
> >>        in shared ept case, its just going a case of "1. Setup RMRR identity
> >>        mapping"
> >>          - gfn space unoccupied -> insert mapping: success.
> >>          - gfn space already occupied by 1:1 RMRR mapping -> do
> nothing;
> >> success.
> >>          - gfn space already occupied by other mapping -> fail.
> >>
> >> expectation:
> >>      * provide a way to let hotplug work in case of shared ept totally
> >>
> >> Open Issue
> >> ==========
> >>
> >> When other stuffs like ballon mechanism, to populate memory accessing
> RMRR
> >> range, or others like qemu, force map RMRR range, we may need to avoid
> such
> >> a
> >> possibility but we're not 100% sure this so just open this to double check.
> >
> > I think a HVM guest can be expected to play by what E820 table
> > says is reserved. That includes not using such ranges for ballooning.
> > If it still does, it'll get what it deserves.
> >
> > Jan
> >

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RMRR Fix Design for Xen
  2014-12-19  1:21 RMRR Fix Design for Xen Tiejun Chen
  2014-12-19  9:49 ` Fabio Fantoni
  2014-12-19 15:13 ` Jan Beulich
@ 2015-01-05 17:01 ` George Dunlap
  2015-01-06  1:19   ` Chen, Tiejun
  2 siblings, 1 reply; 7+ messages in thread
From: George Dunlap @ 2015-01-05 17:01 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: Tian, Kevin, Wei Liu, Ian Campbell, Stefano Stabellini,
	Ian Jackson, Tim Deegan, xen-devel, Jan Beulich, Zhang, Yang Z

On Fri, Dec 19, 2014 at 1:21 AM, Tiejun Chen <tiejun.chen@intel.com> wrote:
>                     RMRR Fix Design for Xen
>
> This design is a goal to fix RMRR for Xen. It includes four sectors as
> follows:
>
>     * Background
>     * What is RMRR
>     * Current RMRR Issues
>     * Design Overview
>
> We hope this can help us to understand current problem then figure out a
> clean and better solution everyone can agree now to go forward.
>
> Background
> ==========
>
> We first identified this RMRR defect when trying to pass-through IGD device,
> which can be simply fixed by adding an identity mapping in case of shared
> EPT table. However along with the community discussion, it boiled down to
> a more general RMRR problem, i.e. the identity mapping is brute-added
> in hypervisor, w/o considering whether conflicting with an existing guest
> PFN ranges. As a general solution we need invent a new mechanism so
> reserved ranges allocated by hypervisor can be exported to the user space
> toolstack and hvmloader, so conflict can be detected when constructing
> guest PFN layout, with best-effort avoidance policies to further help.
>
> What is RMRR
> ============
>
> RMRR is a acronym for Reserved Memory Region Reporting.
>
> BIOS may report each such reserved memory region through the RMRR structures,
> along with the devices that requires access to the specified reserved memory
> region. Reserved memory ranges that are either not DMA targets, or memory
> ranges that may be target of BIOS initiated DMA only during pre-boot phase
> (such as from a boot disk drive) must not be included in the reserved memory
> region reporting. The base address of each RMRR region must be 4KB aligned and
> the size must be an integer multiple of 4KB. BIOS must report the RMRR reported
> memory addresses as reserved in the system memory map returned through methods
> suchas INT15, EFI GetMemoryMap etc. The reserved memory region reporting
> structures are optional. If there are no RMRR structures, the system software
> concludes that the platform does not have any reserved memory ranges that are
> DMA targets.
>
> The RMRR regions are expected to be used for legacy usages (such as USB, UMA
> Graphics, etc.) requiring reserved memory. Platform designers shouldavoid or
> limit use of reserved memory regions since these require system software to
> create holes in the DMA virtual address range available to system software and
> its drivers.
>
> The following is grabbed from my BDW:
>
> (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
> (XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ab80a000 end_address ab81dfff
> (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
> (XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ad000000 end_address af7fffff
>
> Here USB occupies 0xab80a000:0xab81dfff, IGD owns 0xad000000:0xaf7fffff.
>
> Note there are zero or more Reserved Memory Region Reporting (RMRR) in one given
> platform. And multiple devices may share one RMRR range. Additionally RMRR can
> go anyplace.

Tiejun,

Thanks for this document -- such a document is really helpful in
figuring out the best way to architect the solution to a problem.

I hope you don't mind me asking a few additional questions here.
You've said that:
* RMRR is a range used by devices (typically legacy devices such as
USB, but apparently also newer devices like IGD)
* RMRR ranges are reported by BIOSes
* RMRR ranges should be avoided by the guest.

I'm still missing a few things, however.

* In the case of passing through a virtual device, how does the
"range" apply wrt gpfn space and mfn space?  I assume in example
above, the range [ab80a000,ab81dfff] corresponds to an mfn range.
When passing through this device to the guest, do pfns
[ab80a000,ab81dfff] need to be mapped to the same mfn range (i.e., 1-1
mapping), or can they be mapped from somewhere else in pfn space?

* You've described the range, but later on you talk about Xen
"creating" RMRR mappings.  What does this mean?  Are there registers
that need to be written?  Do the ept / IOMMU tables need some kind of
special flags?

Thanks,
 -George

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RMRR Fix Design for Xen
  2015-01-05 17:01 ` George Dunlap
@ 2015-01-06  1:19   ` Chen, Tiejun
  0 siblings, 0 replies; 7+ messages in thread
From: Chen, Tiejun @ 2015-01-06  1:19 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tian, Kevin, Wei Liu, Ian Campbell, Stefano Stabellini,
	Ian Jackson, Tim Deegan, xen-devel, Jan Beulich, Zhang, Yang Z

On 2015/1/6 1:01, George Dunlap wrote:
> On Fri, Dec 19, 2014 at 1:21 AM, Tiejun Chen <tiejun.chen@intel.com> wrote:
>>                      RMRR Fix Design for Xen
>>
>> This design is a goal to fix RMRR for Xen. It includes four sectors as
>> follows:
>>
>>      * Background
>>      * What is RMRR
>>      * Current RMRR Issues
>>      * Design Overview
>>
>> We hope this can help us to understand current problem then figure out a
>> clean and better solution everyone can agree now to go forward.
>>
>> Background
>> ==========
>>
>> We first identified this RMRR defect when trying to pass-through IGD device,
>> which can be simply fixed by adding an identity mapping in case of shared
>> EPT table. However along with the community discussion, it boiled down to
>> a more general RMRR problem, i.e. the identity mapping is brute-added
>> in hypervisor, w/o considering whether conflicting with an existing guest
>> PFN ranges. As a general solution we need invent a new mechanism so
>> reserved ranges allocated by hypervisor can be exported to the user space
>> toolstack and hvmloader, so conflict can be detected when constructing
>> guest PFN layout, with best-effort avoidance policies to further help.
>>
>> What is RMRR
>> ============
>>
>> RMRR is a acronym for Reserved Memory Region Reporting.
>>
>> BIOS may report each such reserved memory region through the RMRR structures,
>> along with the devices that requires access to the specified reserved memory
>> region. Reserved memory ranges that are either not DMA targets, or memory
>> ranges that may be target of BIOS initiated DMA only during pre-boot phase
>> (such as from a boot disk drive) must not be included in the reserved memory
>> region reporting. The base address of each RMRR region must be 4KB aligned and
>> the size must be an integer multiple of 4KB. BIOS must report the RMRR reported
>> memory addresses as reserved in the system memory map returned through methods
>> suchas INT15, EFI GetMemoryMap etc. The reserved memory region reporting
>> structures are optional. If there are no RMRR structures, the system software
>> concludes that the platform does not have any reserved memory ranges that are
>> DMA targets.
>>
>> The RMRR regions are expected to be used for legacy usages (such as USB, UMA
>> Graphics, etc.) requiring reserved memory. Platform designers shouldavoid or
>> limit use of reserved memory regions since these require system software to
>> create holes in the DMA virtual address range available to system software and
>> its drivers.
>>
>> The following is grabbed from my BDW:
>>
>> (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
>> (XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ab80a000 end_address ab81dfff
>> (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
>> (XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ad000000 end_address af7fffff
>>
>> Here USB occupies 0xab80a000:0xab81dfff, IGD owns 0xad000000:0xaf7fffff.
>>
>> Note there are zero or more Reserved Memory Region Reporting (RMRR) in one given
>> platform. And multiple devices may share one RMRR range. Additionally RMRR can
>> go anyplace.
>
> Tiejun,
>
> Thanks for this document -- such a document is really helpful in
> figuring out the best way to architect the solution to a problem.
>
> I hope you don't mind me asking a few additional questions here.

Everything is fine to me :)

> You've said that:
> * RMRR is a range used by devices (typically legacy devices such as
> USB, but apparently also newer devices like IGD)
> * RMRR ranges are reported by BIOSes
> * RMRR ranges should be avoided by the guest.
>
> I'm still missing a few things, however.
>
> * In the case of passing through a virtual device, how does the
> "range" apply wrt gpfn space and mfn space?  I assume in example
> above, the range [ab80a000,ab81dfff] corresponds to an mfn range.
> When passing through this device to the guest, do pfns
> [ab80a000,ab81dfff] need to be mapped to the same mfn range (i.e., 1-1
> mapping), or can they be mapped from somewhere else in pfn space?

Always 1:1 mapping here.

>
> * You've described the range, but later on you talk about Xen
> "creating" RMRR mappings.  What does this mean?  Are there registers
> that need to be written?  Do the ept / IOMMU tables need some kind of
> special flags?

We don't use or introduce any special flags.

As you know Xen supports two scenarios. In case of non-shared ept, we 
just create VT-D table to cover those 1:1 mappings but if case of shared 
ept we always create and share such 1:1 mappings.

BTW, this v1 document is not good as a design review so Kevin sent out 
v2, "(v2) Design proposal for RMRR fix", again. I hope that can provide 
more as you expect.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-01-06  1:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-19  1:21 RMRR Fix Design for Xen Tiejun Chen
2014-12-19  9:49 ` Fabio Fantoni
2014-12-19 15:13 ` Jan Beulich
2014-12-22  2:11   ` Chen, Tiejun
2014-12-22  4:48     ` Tian, Kevin
2015-01-05 17:01 ` George Dunlap
2015-01-06  1:19   ` Chen, Tiejun

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.