All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 0/3] Add handling of extended regions (safe ranges) on Arm (Was "xen/memory: Introduce a hypercall to provide unallocated space")
@ 2021-09-10 18:18 Oleksandr Tyshchenko
  2021-09-10 18:18 ` [PATCH V2 1/3] xen: Introduce "gpaddr_bits" field to XEN_SYSCTL_physinfo Oleksandr Tyshchenko
                   ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Oleksandr Tyshchenko @ 2021-09-10 18:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Tyshchenko, Ian Jackson, Wei Liu, Anthony PERARD,
	Andrew Cooper, George Dunlap, Jan Beulich, Julien Grall,
	Stefano Stabellini, Juergen Gross, Volodymyr Babchuk,
	Roger Pau Monné,
	Henry Wang, Bertrand Marquis, Wei Chen

From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>

You can find an initial discussion at [1] and [2].

The extended region (safe range) is a region of guest physical address space
which is unused and could be safely used to create grant/foreign mappings instead
of wasting real RAM pages from the domain memory for establishing these mappings.

The extended regions are chosen at the domain creation time and advertised
to it via "reg" property under hypervisor node in the guest device-tree
(the indexes for extended regions are 1...N).
No device tree bindings update is needed, guest infers the presense of extended
regions from the number of regions in "reg" property.
New compatible/property will be needed (but only after this patch [3] or alternative
goes in) to indicate that "region 0 is safe to use". Until this patch is merged it is
not safe to use extended regions for the grant table space.

The extended regions are calculated differently for direct mapped
Dom0 (with and without IOMMU) and non-direct mapped DomUs.

Please note the following limitations:
- The extended region feature is only supported for 64-bit domain.
- The ACPI case is not covered.

Xen patch series is also available at [4]. The corresponding Linux patch series is at [5] for now
(last 4 patches).

Tested on Renesas Salvator-X board + H3 ES3.0 SoC (Arm64) with updated virtio-disk backend [6]
running in Dom0 (256MB RAM) and DomD (2GB RAM). In both cases the backend pre-maps DomU memory
which is 3GB. All foreign memory gets mapped into extended regions (so the amount of RAM in
the backend domain is not reduced). No issues were observed.

[1] https://lore.kernel.org/xen-devel/1627489110-25633-1-git-send-email-olekstysh@gmail.com/
[2] https://lore.kernel.org/xen-devel/1631034578-12598-1-git-send-email-olekstysh@gmail.com/
[3] https://lore.kernel.org/xen-devel/1631228688-30347-1-git-send-email-olekstysh@gmail.com/
[4] https://github.com/otyshchenko1/xen/commits/map_opt_ml3
[5] https://github.com/otyshchenko1/linux/commits/map_opt_ml2
[6] https://github.com/otyshchenko1/virtio-disk/commits/map_opt_next

Oleksandr Tyshchenko (3):
  xen: Introduce "gpaddr_bits" field to XEN_SYSCTL_physinfo
  xen/arm: Add handling of extended regions for Dom0
  libxl/arm: Add handling of extended regions for DomU

 tools/include/libxl.h            |   7 ++
 tools/libs/light/libxl.c         |   2 +
 tools/libs/light/libxl_arm.c     |  89 ++++++++++++++-
 tools/libs/light/libxl_types.idl |   2 +
 xen/arch/arm/domain_build.c      | 226 ++++++++++++++++++++++++++++++++++++++-
 xen/arch/arm/sysctl.c            |   2 +
 xen/arch/x86/sysctl.c            |   2 +
 xen/include/public/sysctl.h      |   3 +-
 8 files changed, 328 insertions(+), 5 deletions(-)

-- 
2.7.4



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH V2 1/3] xen: Introduce "gpaddr_bits" field to XEN_SYSCTL_physinfo
  2021-09-10 18:18 [PATCH V2 0/3] Add handling of extended regions (safe ranges) on Arm (Was "xen/memory: Introduce a hypercall to provide unallocated space") Oleksandr Tyshchenko
@ 2021-09-10 18:18 ` Oleksandr Tyshchenko
  2021-09-16 14:49   ` Jan Beulich
  2021-09-10 18:18 ` [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0 Oleksandr Tyshchenko
  2021-09-10 18:18 ` [PATCH V2 3/3] libxl/arm: Add handling of extended regions for DomU Oleksandr Tyshchenko
  2 siblings, 1 reply; 43+ messages in thread
From: Oleksandr Tyshchenko @ 2021-09-10 18:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Tyshchenko, Ian Jackson, Wei Liu, Anthony PERARD,
	Andrew Cooper, George Dunlap, Jan Beulich, Julien Grall,
	Stefano Stabellini, Juergen Gross, Volodymyr Babchuk,
	Roger Pau Monné,
	Henry Wang, Bertrand Marquis, Wei Chen

From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>

We need to pass info about maximum supported guest address
space size to the toolstack on Arm in order to properly
calculate the base and size of the extended region (safe range)
for the guest. The extended region is unused address space which
could be safely used by domain for foreign/grant mappings on Arm.
The extended region itself will be handled by the subsequents
patch.

Use p2m_ipa_bits variable on Arm, the x86 equivalent is
hap_paddr_bits.

As we change the size of structure bump the interface version.

Suggested-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
---
Please note, that recent review comments [1] haven't been addressed yet.
It is not forgotten, some clarification is needed. It will be addressed for the next version.

[1] https://lore.kernel.org/xen-devel/973f5344-aa10-3ad6-ff02-ad5f358ad279@citrix.com/

Changes since RFC:
   - update patch subject/description
   - replace arch-specific sub-struct with common gpaddr_bits
     field and update code to reflect that
---
 tools/include/libxl.h            | 7 +++++++
 tools/libs/light/libxl.c         | 2 ++
 tools/libs/light/libxl_types.idl | 2 ++
 xen/arch/arm/sysctl.c            | 2 ++
 xen/arch/x86/sysctl.c            | 2 ++
 xen/include/public/sysctl.h      | 3 ++-
 6 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/tools/include/libxl.h b/tools/include/libxl.h
index b9ba16d..da44944 100644
--- a/tools/include/libxl.h
+++ b/tools/include/libxl.h
@@ -855,6 +855,13 @@ typedef struct libxl__ctx libxl_ctx;
  */
 #define LIBXL_HAVE_PHYSINFO_MAX_POSSIBLE_MFN 1
 
+ /*
+  * LIBXL_HAVE_PHYSINFO_GPADDR_BITS
+  *
+  * If this is defined, libxl_physinfo has a "gpaddr_bits" field.
+  */
+ #define LIBXL_HAVE_PHYSINFO_GPADDR_BITS 1
+
 /*
  * LIBXL_HAVE_DOMINFO_OUTSTANDING_MEMKB 1
  *
diff --git a/tools/libs/light/libxl.c b/tools/libs/light/libxl.c
index 204eb0b..c86624f 100644
--- a/tools/libs/light/libxl.c
+++ b/tools/libs/light/libxl.c
@@ -405,6 +405,8 @@ int libxl_get_physinfo(libxl_ctx *ctx, libxl_physinfo *physinfo)
     physinfo->cap_vmtrace =
         !!(xcphysinfo.capabilities & XEN_SYSCTL_PHYSCAP_vmtrace);
 
+    physinfo->gpaddr_bits = xcphysinfo.gpaddr_bits;
+
     GC_FREE;
     return 0;
 }
diff --git a/tools/libs/light/libxl_types.idl b/tools/libs/light/libxl_types.idl
index 3f9fff6..f7437e4 100644
--- a/tools/libs/light/libxl_types.idl
+++ b/tools/libs/light/libxl_types.idl
@@ -1061,6 +1061,8 @@ libxl_physinfo = Struct("physinfo", [
     ("cap_shadow", bool),
     ("cap_iommu_hap_pt_share", bool),
     ("cap_vmtrace", bool),
+
+    ("gpaddr_bits", uint32),
     ], dir=DIR_OUT)
 
 libxl_connectorinfo = Struct("connectorinfo", [
diff --git a/xen/arch/arm/sysctl.c b/xen/arch/arm/sysctl.c
index f87944e..91dca4f 100644
--- a/xen/arch/arm/sysctl.c
+++ b/xen/arch/arm/sysctl.c
@@ -15,6 +15,8 @@
 void arch_do_physinfo(struct xen_sysctl_physinfo *pi)
 {
     pi->capabilities |= XEN_SYSCTL_PHYSCAP_hvm | XEN_SYSCTL_PHYSCAP_hap;
+
+    pi->gpaddr_bits = p2m_ipa_bits;
 }
 
 long arch_do_sysctl(struct xen_sysctl *sysctl,
diff --git a/xen/arch/x86/sysctl.c b/xen/arch/x86/sysctl.c
index aff52a1..7b14865 100644
--- a/xen/arch/x86/sysctl.c
+++ b/xen/arch/x86/sysctl.c
@@ -135,6 +135,8 @@ void arch_do_physinfo(struct xen_sysctl_physinfo *pi)
         pi->capabilities |= XEN_SYSCTL_PHYSCAP_hap;
     if ( IS_ENABLED(CONFIG_SHADOW_PAGING) )
         pi->capabilities |= XEN_SYSCTL_PHYSCAP_shadow;
+
+    pi->gpaddr_bits = hap_paddr_bits;
 }
 
 long arch_do_sysctl(
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 039ccf8..f53b42e 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -35,7 +35,7 @@
 #include "domctl.h"
 #include "physdev.h"
 
-#define XEN_SYSCTL_INTERFACE_VERSION 0x00000013
+#define XEN_SYSCTL_INTERFACE_VERSION 0x00000014
 
 /*
  * Read console content from Xen buffer ring.
@@ -120,6 +120,7 @@ struct xen_sysctl_physinfo {
     uint64_aligned_t outstanding_pages;
     uint64_aligned_t max_mfn; /* Largest possible MFN on this host */
     uint32_t hw_cap[8];
+    uint32_t gpaddr_bits;
 };
 
 /*
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-10 18:18 [PATCH V2 0/3] Add handling of extended regions (safe ranges) on Arm (Was "xen/memory: Introduce a hypercall to provide unallocated space") Oleksandr Tyshchenko
  2021-09-10 18:18 ` [PATCH V2 1/3] xen: Introduce "gpaddr_bits" field to XEN_SYSCTL_physinfo Oleksandr Tyshchenko
@ 2021-09-10 18:18 ` Oleksandr Tyshchenko
  2021-09-14  0:55   ` Stefano Stabellini
  2021-09-17 15:48   ` Julien Grall
  2021-09-10 18:18 ` [PATCH V2 3/3] libxl/arm: Add handling of extended regions for DomU Oleksandr Tyshchenko
  2 siblings, 2 replies; 43+ messages in thread
From: Oleksandr Tyshchenko @ 2021-09-10 18:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Tyshchenko, Stefano Stabellini, Julien Grall,
	Volodymyr Babchuk, Henry Wang, Bertrand Marquis, Wei Chen

From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>

The extended region (safe range) is a region of guest physical
address space which is unused and could be safely used to create
grant/foreign mappings instead of wasting real RAM pages from
the domain memory for establishing these mappings.

The extended regions are chosen at the domain creation time and
advertised to it via "reg" property under hypervisor node in
the guest device-tree. As region 0 is reserved for grant table
space (always present), the indexes for extended regions are 1...N.
If extended regions could not be allocated for some reason,
Xen doesn't fail and behaves as usual, so only inserts region 0.

Please note the following limitations:
- The extended region feature is only supported for 64-bit domain.
- The ACPI case is not covered.

***

As Dom0 is direct mapped domain on Arm (e.g. MFN == GFN)
the algorithm to choose extended regions for it is different
in comparison with the algorithm for non-direct mapped DomU.
What is more, that extended regions should be chosen differently
whether IOMMU is enabled or not.

Provide RAM not assigned to Dom0 if IOMMU is disabled or memory
holes found in host device-tree if otherwise. Make sure that
extended regions are 2MB-aligned and located within maximum possible
addressable physical memory range. The maximum number of extended
regions is 128.

Suggested-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
---
Changes since RFC:
   - update patch description
   - drop uneeded "extended-region" DT property
---

 xen/arch/arm/domain_build.c | 226 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 224 insertions(+), 2 deletions(-)

diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
index 206038d..070ec27 100644
--- a/xen/arch/arm/domain_build.c
+++ b/xen/arch/arm/domain_build.c
@@ -724,6 +724,196 @@ static int __init make_memory_node(const struct domain *d,
     return res;
 }
 
+static int __init add_ext_regions(unsigned long s, unsigned long e, void *data)
+{
+    struct meminfo *ext_regions = data;
+    paddr_t start, size;
+
+    if ( ext_regions->nr_banks >= ARRAY_SIZE(ext_regions->bank) )
+        return 0;
+
+    /* Both start and size of the extended region should be 2MB aligned */
+    start = (s + SZ_2M - 1) & ~(SZ_2M - 1);
+    if ( start > e )
+        return 0;
+
+    size = (e - start + 1) & ~(SZ_2M - 1);
+    if ( !size )
+        return 0;
+
+    ext_regions->bank[ext_regions->nr_banks].start = start;
+    ext_regions->bank[ext_regions->nr_banks].size = size;
+    ext_regions->nr_banks ++;
+
+    return 0;
+}
+
+/*
+ * The extended regions will be prevalidated by the memory hotplug path
+ * in Linux which requires for any added address range to be within maximum
+ * possible addressable physical memory range for which the linear mapping
+ * could be created.
+ * For 48-bit VA space size the maximum addressable range are:
+ * 0x40000000 - 0x80003fffffff
+ */
+#define EXT_REGION_START   0x40000000ULL
+#define EXT_REGION_END     0x80003fffffffULL
+
+static int __init find_unallocated_memory(const struct kernel_info *kinfo,
+                                          struct meminfo *ext_regions)
+{
+    const struct meminfo *assign_mem = &kinfo->mem;
+    struct rangeset *unalloc_mem;
+    paddr_t start, end;
+    unsigned int i;
+    int res;
+
+    dt_dprintk("Find unallocated memory for extended regions\n");
+
+    unalloc_mem = rangeset_new(NULL, NULL, 0);
+    if ( !unalloc_mem )
+        return -ENOMEM;
+
+    /* Start with all available RAM */
+    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
+    {
+        start = bootinfo.mem.bank[i].start;
+        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size - 1;
+        res = rangeset_add_range(unalloc_mem, start, end);
+        if ( res )
+        {
+            printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
+                   start, end);
+            goto out;
+        }
+    }
+
+    /* Remove RAM assigned to Dom0 */
+    for ( i = 0; i < assign_mem->nr_banks; i++ )
+    {
+        start = assign_mem->bank[i].start;
+        end = assign_mem->bank[i].start + assign_mem->bank[i].size - 1;
+        res = rangeset_remove_range(unalloc_mem, start, end);
+        if ( res )
+        {
+            printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
+                   start, end);
+            goto out;
+        }
+    }
+
+    /* Remove reserved-memory regions */
+    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
+    {
+        start = bootinfo.reserved_mem.bank[i].start;
+        end = bootinfo.reserved_mem.bank[i].start +
+            bootinfo.reserved_mem.bank[i].size - 1;
+        res = rangeset_remove_range(unalloc_mem, start, end);
+        if ( res )
+        {
+            printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
+                   start, end);
+            goto out;
+        }
+    }
+
+    /* Remove grant table region */
+    start = kinfo->gnttab_start;
+    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
+    res = rangeset_remove_range(unalloc_mem, start, end);
+    if ( res )
+    {
+        printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
+               start, end);
+        goto out;
+    }
+
+    start = EXT_REGION_START;
+    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
+    res = rangeset_report_ranges(unalloc_mem, start, end,
+                                 add_ext_regions, ext_regions);
+    if ( res )
+        ext_regions->nr_banks = 0;
+    else if ( !ext_regions->nr_banks )
+        res = -ENOENT;
+
+out:
+    rangeset_destroy(unalloc_mem);
+
+    return res;
+}
+
+static int __init find_memory_holes(const struct kernel_info *kinfo,
+                                    struct meminfo *ext_regions)
+{
+    struct dt_device_node *np;
+    struct rangeset *mem_holes;
+    paddr_t start, end;
+    unsigned int i;
+    int res;
+
+    dt_dprintk("Find memory holes for extended regions\n");
+
+    mem_holes = rangeset_new(NULL, NULL, 0);
+    if ( !mem_holes )
+        return -ENOMEM;
+
+    /* Start with maximum possible addressable physical memory range */
+    start = EXT_REGION_START;
+    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
+    res = rangeset_add_range(mem_holes, start, end);
+    if ( res )
+    {
+        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
+               start, end);
+        goto out;
+    }
+
+    /* Remove all regions described by "reg" property (MMIO, RAM, etc) */
+    dt_for_each_device_node( dt_host, np )
+    {
+        unsigned int naddr;
+        u64 addr, size;
+
+        naddr = dt_number_of_address(np);
+
+        for ( i = 0; i < naddr; i++ )
+        {
+            res = dt_device_get_address(np, i, &addr, &size);
+            if ( res )
+            {
+                printk(XENLOG_ERR "Unable to retrieve address %u for %s\n",
+                       i, dt_node_full_name(np));
+                goto out;
+            }
+
+            start = addr & PAGE_MASK;
+            end = PAGE_ALIGN(addr + size) - 1;
+            res = rangeset_remove_range(mem_holes, start, end);
+            if ( res )
+            {
+                printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
+                       start, end);
+                goto out;
+            }
+        }
+    }
+
+    start = EXT_REGION_START;
+    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
+    res = rangeset_report_ranges(mem_holes, start, end,
+                                 add_ext_regions,  ext_regions);
+    if ( res )
+        ext_regions->nr_banks = 0;
+    else if ( !ext_regions->nr_banks )
+        res = -ENOENT;
+
+out:
+    rangeset_destroy(mem_holes);
+
+    return res;
+}
+
 static int __init make_hypervisor_node(struct domain *d,
                                        const struct kernel_info *kinfo,
                                        int addrcells, int sizecells)
@@ -731,11 +921,13 @@ static int __init make_hypervisor_node(struct domain *d,
     const char compat[] =
         "xen,xen-"__stringify(XEN_VERSION)"."__stringify(XEN_SUBVERSION)"\0"
         "xen,xen";
-    __be32 reg[4];
+    __be32 reg[(NR_MEM_BANKS + 1) * 4];
     gic_interrupt_t intr;
     __be32 *cells;
     int res;
     void *fdt = kinfo->fdt;
+    struct meminfo *ext_regions;
+    unsigned int i;
 
     dt_dprintk("Create hypervisor node\n");
 
@@ -757,12 +949,42 @@ static int __init make_hypervisor_node(struct domain *d,
     if ( res )
         return res;
 
+    ext_regions = xzalloc(struct meminfo);
+    if ( !ext_regions )
+        return -ENOMEM;
+
+    if ( is_32bit_domain(d) )
+        printk(XENLOG_WARNING "The extended region is only supported for 64-bit guest\n");
+    else
+    {
+        if ( !is_iommu_enabled(d) )
+            res = find_unallocated_memory(kinfo, ext_regions);
+        else
+            res = find_memory_holes(kinfo, ext_regions);
+
+        if ( res )
+            printk(XENLOG_WARNING "Failed to allocate extended regions\n");
+    }
+
     /* reg 0 is grant table space */
     cells = &reg[0];
     dt_child_set_range(&cells, addrcells, sizecells,
                        kinfo->gnttab_start, kinfo->gnttab_size);
+    /* reg 1...N are extended regions */
+    for ( i = 0; i < ext_regions->nr_banks; i++ )
+    {
+        u64 start = ext_regions->bank[i].start;
+        u64 size = ext_regions->bank[i].size;
+
+        dt_dprintk("Extended region %d: %#"PRIx64"->%#"PRIx64"\n",
+                   i, start, start + size);
+
+        dt_child_set_range(&cells, addrcells, sizecells, start, size);
+    }
+    xfree(ext_regions);
+
     res = fdt_property(fdt, "reg", reg,
-                       dt_cells_to_size(addrcells + sizecells));
+                       dt_cells_to_size(addrcells + sizecells) * (i + 1));
     if ( res )
         return res;
 
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V2 3/3] libxl/arm: Add handling of extended regions for DomU
  2021-09-10 18:18 [PATCH V2 0/3] Add handling of extended regions (safe ranges) on Arm (Was "xen/memory: Introduce a hypercall to provide unallocated space") Oleksandr Tyshchenko
  2021-09-10 18:18 ` [PATCH V2 1/3] xen: Introduce "gpaddr_bits" field to XEN_SYSCTL_physinfo Oleksandr Tyshchenko
  2021-09-10 18:18 ` [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0 Oleksandr Tyshchenko
@ 2021-09-10 18:18 ` Oleksandr Tyshchenko
  2021-09-16 22:35   ` Stefano Stabellini
  2 siblings, 1 reply; 43+ messages in thread
From: Oleksandr Tyshchenko @ 2021-09-10 18:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Oleksandr Tyshchenko, Ian Jackson, Wei Liu, Anthony PERARD,
	Juergen Gross, Henry Wang, Bertrand Marquis, Wei Chen,
	Julien Grall, Stefano Stabellini

From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>

The extended region (safe range) is a region of guest physical
address space which is unused and could be safely used to create
grant/foreign mappings instead of wasting real RAM pages from
the domain memory for establishing these mappings.

The extended regions are chosen at the domain creation time and
advertised to it via "reg" property under hypervisor node in
the guest device-tree. As region 0 is reserved for grant table
space (always present), the indexes for extended regions are 1...N.
If extended regions could not be allocated for some reason,
Xen doesn't fail and behaves as usual, so only inserts region 0.

Please note the following limitations:
- The extended region feature is only supported for 64-bit domain.
- The ACPI case is not covered.

***

The algorithm to choose extended regions for non-direct mapped
DomU is simpler in comparison with the algorithm for direct mapped
Dom0. As we have a lot of unused space above 4GB, provide single
1GB-aligned region from the second RAM bank taking into the account
the maximum supported guest address space size and the amount of
memory assigned to the guest. The maximum size of the region is 128GB.

Suggested-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
---
Changes since RFC:
   - update patch description
   - drop uneeded "extended-region" DT property
   - clear reg array in finalise_ext_region() and add a TODO
---
 tools/libs/light/libxl_arm.c | 89 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 87 insertions(+), 2 deletions(-)

diff --git a/tools/libs/light/libxl_arm.c b/tools/libs/light/libxl_arm.c
index e3140a6..8c1d9d7 100644
--- a/tools/libs/light/libxl_arm.c
+++ b/tools/libs/light/libxl_arm.c
@@ -615,9 +615,12 @@ static int make_hypervisor_node(libxl__gc *gc, void *fdt,
                               "xen,xen");
     if (res) return res;
 
-    /* reg 0 is grant table space */
+    /*
+     * reg 0 is a placeholder for grant table space, reg 1 is a placeholder
+     * for the extended region.
+     */
     res = fdt_property_regs(gc, fdt, GUEST_ROOT_ADDRESS_CELLS, GUEST_ROOT_SIZE_CELLS,
-                            1,GUEST_GNTTAB_BASE, GUEST_GNTTAB_SIZE);
+                            2, 0, 0, 0, 0);
     if (res) return res;
 
     /*
@@ -1069,6 +1072,86 @@ static void finalise_one_node(libxl__gc *gc, void *fdt, const char *uname,
     }
 }
 
+#define ALIGN_UP_TO_GB(x)   (((x) + GB(1) - 1) & (~(GB(1) - 1)))
+
+#define EXT_REGION_SIZE   GB(128)
+
+static void finalise_ext_region(libxl__gc *gc, struct xc_dom_image *dom)
+{
+    void *fdt = dom->devicetree_blob;
+    uint32_t regs[(GUEST_ROOT_ADDRESS_CELLS + GUEST_ROOT_SIZE_CELLS) * 2];
+    be32 *cells = &regs[0];
+    uint64_t region_size = 0, region_base, bank1end_align, bank1end_max;
+    uint32_t gpaddr_bits;
+    libxl_physinfo info;
+    int offset, rc;
+
+    offset = fdt_path_offset(fdt, "/hypervisor");
+    assert(offset > 0);
+
+    if (strcmp(dom->guest_type, "xen-3.0-aarch64")) {
+        LOG(WARN, "The extended region is only supported for 64-bit guest");
+        goto out;
+    }
+
+    rc = libxl_get_physinfo(CTX, &info);
+    assert(!rc);
+
+    gpaddr_bits = info.gpaddr_bits;
+    assert(gpaddr_bits >= 32 && gpaddr_bits <= 48);
+
+    /*
+     * Try to allocate single 1GB-aligned extended region from the second RAM
+     * bank (above 4GB) taking into the account the maximum supported guest
+     * address space size and the amount of memory assigned to the guest.
+     * The maximum size of the region is 128GB.
+     */
+    bank1end_max = min(1ULL << gpaddr_bits, GUEST_RAM1_BASE + GUEST_RAM1_SIZE);
+    bank1end_align = GUEST_RAM1_BASE +
+        ALIGN_UP_TO_GB((uint64_t)dom->rambank_size[1] << XC_PAGE_SHIFT);
+
+    if (bank1end_max <= bank1end_align) {
+        LOG(WARN, "The extended region cannot be allocated, not enough space");
+        goto out;
+    }
+
+    if (bank1end_max - bank1end_align > EXT_REGION_SIZE) {
+        region_base = bank1end_max - EXT_REGION_SIZE;
+        region_size = EXT_REGION_SIZE;
+    } else {
+        region_base = bank1end_align;
+        region_size = bank1end_max - bank1end_align;
+    }
+
+out:
+    /*
+     * The first region for grant table space must be always present.
+     * If we managed to allocate the extended region then insert it as
+     * a second region.
+     * TODO If we failed to allocate the region, we end up inserting
+     * zero-sized region. This is because we don't know in advance when
+     * creating hypervisor node whether it would be possible to allocate
+     * a region, but we have to create a placeholder anyway. The Linux driver
+     * is able to deal with by checking the region size. We cannot choose
+     * a region when creating hypervisor node because the guest memory layout
+     * is not know at that moment (and dom->rambank_size[1] is empty).
+     * We need to find a way not to expose invalid regions.
+     */
+    memset(regs, 0, sizeof(regs));
+    set_range(&cells, GUEST_ROOT_ADDRESS_CELLS, GUEST_ROOT_SIZE_CELLS,
+              GUEST_GNTTAB_BASE, GUEST_GNTTAB_SIZE);
+    if (region_size > 0) {
+        LOG(DEBUG, "Extended region: %#"PRIx64"->%#"PRIx64"\n",
+            region_base, region_base + region_size);
+
+        set_range(&cells, GUEST_ROOT_ADDRESS_CELLS, GUEST_ROOT_SIZE_CELLS,
+                  region_base, region_size);
+    }
+
+    rc = fdt_setprop_inplace(fdt, offset, "reg", regs, sizeof(regs));
+    assert(!rc);
+}
+
 int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
                                                uint32_t domid,
                                                libxl_domain_config *d_config,
@@ -1109,6 +1192,8 @@ int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
 
     }
 
+    finalise_ext_region(gc, dom);
+
     for (i = 0; i < GUEST_RAM_BANKS; i++) {
         const uint64_t size = (uint64_t)dom->rambank_size[i] << XC_PAGE_SHIFT;
 
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-10 18:18 ` [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0 Oleksandr Tyshchenko
@ 2021-09-14  0:55   ` Stefano Stabellini
  2021-09-15 19:10     ` Oleksandr
  2021-09-17 15:48   ` Julien Grall
  1 sibling, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2021-09-14  0:55 UTC (permalink / raw)
  To: Oleksandr Tyshchenko
  Cc: xen-devel, Oleksandr Tyshchenko, Stefano Stabellini,
	Julien Grall, Volodymyr Babchuk, Henry Wang, Bertrand Marquis,
	Wei Chen

On Fri, 10 Sep 2021, Oleksandr Tyshchenko wrote:
> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> 
> The extended region (safe range) is a region of guest physical
> address space which is unused and could be safely used to create
> grant/foreign mappings instead of wasting real RAM pages from
> the domain memory for establishing these mappings.
> 
> The extended regions are chosen at the domain creation time and
> advertised to it via "reg" property under hypervisor node in
> the guest device-tree. As region 0 is reserved for grant table
> space (always present), the indexes for extended regions are 1...N.
> If extended regions could not be allocated for some reason,
> Xen doesn't fail and behaves as usual, so only inserts region 0.
> 
> Please note the following limitations:
> - The extended region feature is only supported for 64-bit domain.
> - The ACPI case is not covered.
> 
> ***
> 
> As Dom0 is direct mapped domain on Arm (e.g. MFN == GFN)
> the algorithm to choose extended regions for it is different
> in comparison with the algorithm for non-direct mapped DomU.
> What is more, that extended regions should be chosen differently
> whether IOMMU is enabled or not.
> 
> Provide RAM not assigned to Dom0 if IOMMU is disabled or memory
> holes found in host device-tree if otherwise. Make sure that
> extended regions are 2MB-aligned and located within maximum possible
> addressable physical memory range. The maximum number of extended
> regions is 128.
> 
> Suggested-by: Julien Grall <jgrall@amazon.com>
> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> ---
> Changes since RFC:
>    - update patch description
>    - drop uneeded "extended-region" DT property
> ---
> 
>  xen/arch/arm/domain_build.c | 226 +++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 224 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
> index 206038d..070ec27 100644
> --- a/xen/arch/arm/domain_build.c
> +++ b/xen/arch/arm/domain_build.c
> @@ -724,6 +724,196 @@ static int __init make_memory_node(const struct domain *d,
>      return res;
>  }
>  
> +static int __init add_ext_regions(unsigned long s, unsigned long e, void *data)
> +{
> +    struct meminfo *ext_regions = data;
> +    paddr_t start, size;
> +
> +    if ( ext_regions->nr_banks >= ARRAY_SIZE(ext_regions->bank) )
> +        return 0;
> +
> +    /* Both start and size of the extended region should be 2MB aligned */
> +    start = (s + SZ_2M - 1) & ~(SZ_2M - 1);
> +    if ( start > e )
> +        return 0;
> +
> +    size = (e - start + 1) & ~(SZ_2M - 1);
> +    if ( !size )
> +        return 0;

Can't you align size as well?

  size = (size - (SZ_2M - 1)) & ~(SZ_2M - 1);


> +    ext_regions->bank[ext_regions->nr_banks].start = start;
> +    ext_regions->bank[ext_regions->nr_banks].size = size;
> +    ext_regions->nr_banks ++;
                            ^ no space

> +    return 0;
> +}
> +
> +/*
> + * The extended regions will be prevalidated by the memory hotplug path
> + * in Linux which requires for any added address range to be within maximum
> + * possible addressable physical memory range for which the linear mapping
> + * could be created.
> + * For 48-bit VA space size the maximum addressable range are:
> + * 0x40000000 - 0x80003fffffff

Please don't make Linux-specific comments in Xen code for interfaces
that are supposed to be OS-agnostic.


> + */
> +#define EXT_REGION_START   0x40000000ULL
> +#define EXT_REGION_END     0x80003fffffffULL
> +
> +static int __init find_unallocated_memory(const struct kernel_info *kinfo,
> +                                          struct meminfo *ext_regions)
> +{
> +    const struct meminfo *assign_mem = &kinfo->mem;
> +    struct rangeset *unalloc_mem;
> +    paddr_t start, end;
> +    unsigned int i;
> +    int res;
> +
> +    dt_dprintk("Find unallocated memory for extended regions\n");
> +
> +    unalloc_mem = rangeset_new(NULL, NULL, 0);
> +    if ( !unalloc_mem )
> +        return -ENOMEM;
> +
> +    /* Start with all available RAM */
> +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
> +    {
> +        start = bootinfo.mem.bank[i].start;
> +        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size - 1;

Is the -1 needed? Isn't it going to screw up the size calculation later?


> +        res = rangeset_add_range(unalloc_mem, start, end);
> +        if ( res )
> +        {
> +            printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
> +                   start, end);
> +            goto out;
> +        }
> +    }
> +
> +    /* Remove RAM assigned to Dom0 */
> +    for ( i = 0; i < assign_mem->nr_banks; i++ )
> +    {
> +        start = assign_mem->bank[i].start;
> +        end = assign_mem->bank[i].start + assign_mem->bank[i].size - 1;
> +        res = rangeset_remove_range(unalloc_mem, start, end);
> +        if ( res )
> +        {
> +            printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
> +                   start, end);
> +            goto out;
> +        }
> +    }
> +
> +    /* Remove reserved-memory regions */
> +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
> +    {
> +        start = bootinfo.reserved_mem.bank[i].start;
> +        end = bootinfo.reserved_mem.bank[i].start +
> +            bootinfo.reserved_mem.bank[i].size - 1;
> +        res = rangeset_remove_range(unalloc_mem, start, end);
> +        if ( res )
> +        {
> +            printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
> +                   start, end);
> +            goto out;
> +        }
> +    }
> +
> +    /* Remove grant table region */
> +    start = kinfo->gnttab_start;
> +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
> +    res = rangeset_remove_range(unalloc_mem, start, end);
> +    if ( res )
> +    {
> +        printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
> +               start, end);
> +        goto out;
> +    }
> +
> +    start = EXT_REGION_START;
> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> +    res = rangeset_report_ranges(unalloc_mem, start, end,
> +                                 add_ext_regions, ext_regions);
> +    if ( res )
> +        ext_regions->nr_banks = 0;
> +    else if ( !ext_regions->nr_banks )
> +        res = -ENOENT;
> +
> +out:
> +    rangeset_destroy(unalloc_mem);
> +
> +    return res;
> +}
> +
> +static int __init find_memory_holes(const struct kernel_info *kinfo,
> +                                    struct meminfo *ext_regions)
> +{
> +    struct dt_device_node *np;
> +    struct rangeset *mem_holes;
> +    paddr_t start, end;
> +    unsigned int i;
> +    int res;
> +
> +    dt_dprintk("Find memory holes for extended regions\n");
> +
> +    mem_holes = rangeset_new(NULL, NULL, 0);
> +    if ( !mem_holes )
> +        return -ENOMEM;
> +
> +    /* Start with maximum possible addressable physical memory range */
> +    start = EXT_REGION_START;
> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> +    res = rangeset_add_range(mem_holes, start, end);
> +    if ( res )
> +    {
> +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
> +               start, end);
> +        goto out;
> +    }
> +
> +    /* Remove all regions described by "reg" property (MMIO, RAM, etc) */
> +    dt_for_each_device_node( dt_host, np )

Don't you need something like device_tree_for_each_node ?
dt_for_each_device_node won't go down any deeper in the tree?

Alternatively, maybe we could simply record the highest possible address
of any memory/device/anything as we scan the device tree with
handle_node. Then we can use that as the starting point here. So that we
don't need to scan the device tree twice, and also we don't need my
suggestion below to remove 1GB-aligned 1GB-multiple regions.


> +    {
> +        unsigned int naddr;
> +        u64 addr, size;
> +
> +        naddr = dt_number_of_address(np);
> +
> +        for ( i = 0; i < naddr; i++ )
> +        {
> +            res = dt_device_get_address(np, i, &addr, &size);
> +            if ( res )
> +            {
> +                printk(XENLOG_ERR "Unable to retrieve address %u for %s\n",
> +                       i, dt_node_full_name(np));

It might be possible for a device not to have a range if it doesn't have
any MMIO regions, right? For instance, certain ARM timer nodes. I would
not print any errors and continue.


> +                goto out;
> +            }
> +
> +            start = addr & PAGE_MASK;
> +            end = PAGE_ALIGN(addr + size) - 1;
> +            res = rangeset_remove_range(mem_holes, start, end);
> +            if ( res )
> +            {
> +                printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
> +                       start, end);
> +                goto out;
> +            }
> +        }
> +    }

As is, it will result in a myriad of small ranges which is unuseful and
slow to parse. I suggest to simplify it by removing a larger region than
strictly necessary. For instance, you could remove a 1GB-aligned and
1GB-multiple region for each range. That way, you are going to get fewer
large free ranges instance of many small ones which we don't need.


> +    start = EXT_REGION_START;
> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> +    res = rangeset_report_ranges(mem_holes, start, end,
> +                                 add_ext_regions,  ext_regions);
> +    if ( res )
> +        ext_regions->nr_banks = 0;
> +    else if ( !ext_regions->nr_banks )
> +        res = -ENOENT;
> +
> +out:
> +    rangeset_destroy(mem_holes);
> +
> +    return res;
> +}
> +
>  static int __init make_hypervisor_node(struct domain *d,
>                                         const struct kernel_info *kinfo,
>                                         int addrcells, int sizecells)
> @@ -731,11 +921,13 @@ static int __init make_hypervisor_node(struct domain *d,
>      const char compat[] =
>          "xen,xen-"__stringify(XEN_VERSION)"."__stringify(XEN_SUBVERSION)"\0"
>          "xen,xen";
> -    __be32 reg[4];
> +    __be32 reg[(NR_MEM_BANKS + 1) * 4];

If you are xzalloc'ing struct meminfo then you might as well xzalloc reg
too. Or keep both on the stack with a lower NR_MEM_BANKS.


>      gic_interrupt_t intr;
>      __be32 *cells;
>      int res;
>      void *fdt = kinfo->fdt;
> +    struct meminfo *ext_regions;
> +    unsigned int i;
>  
>      dt_dprintk("Create hypervisor node\n");
>  
> @@ -757,12 +949,42 @@ static int __init make_hypervisor_node(struct domain *d,
>      if ( res )
>          return res;
>  
> +    ext_regions = xzalloc(struct meminfo);
> +    if ( !ext_regions )
> +        return -ENOMEM;
> +
> +    if ( is_32bit_domain(d) )
> +        printk(XENLOG_WARNING "The extended region is only supported for 64-bit guest\n");

This is going to add an unconditional warning to all 32bit boots. I
would skip it entirely or only keep it as XENLOG_DEBUG.


> +    else
> +    {
> +        if ( !is_iommu_enabled(d) )
> +            res = find_unallocated_memory(kinfo, ext_regions);
> +        else
> +            res = find_memory_holes(kinfo, ext_regions);
> +
> +        if ( res )
> +            printk(XENLOG_WARNING "Failed to allocate extended regions\n");
> +    }
> +
>      /* reg 0 is grant table space */
>      cells = &reg[0];
>      dt_child_set_range(&cells, addrcells, sizecells,
>                         kinfo->gnttab_start, kinfo->gnttab_size);
> +    /* reg 1...N are extended regions */
> +    for ( i = 0; i < ext_regions->nr_banks; i++ )
> +    {
> +        u64 start = ext_regions->bank[i].start;
> +        u64 size = ext_regions->bank[i].size;
> +
> +        dt_dprintk("Extended region %d: %#"PRIx64"->%#"PRIx64"\n",
> +                   i, start, start + size);
> +
> +        dt_child_set_range(&cells, addrcells, sizecells, start, size);
> +    }
> +    xfree(ext_regions);
> +
>      res = fdt_property(fdt, "reg", reg,
> -                       dt_cells_to_size(addrcells + sizecells));
> +                       dt_cells_to_size(addrcells + sizecells) * (i + 1));
>      if ( res )
>          return res;
>  
> -- 
> 2.7.4
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-14  0:55   ` Stefano Stabellini
@ 2021-09-15 19:10     ` Oleksandr
  2021-09-15 21:21       ` Stefano Stabellini
                         ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Oleksandr @ 2021-09-15 19:10 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Oleksandr Tyshchenko, Julien Grall, Volodymyr Babchuk,
	Henry Wang, Bertrand Marquis, Wei Chen


On 14.09.21 03:55, Stefano Stabellini wrote:

Hi Stefano

> On Fri, 10 Sep 2021, Oleksandr Tyshchenko wrote:
>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>
>> The extended region (safe range) is a region of guest physical
>> address space which is unused and could be safely used to create
>> grant/foreign mappings instead of wasting real RAM pages from
>> the domain memory for establishing these mappings.
>>
>> The extended regions are chosen at the domain creation time and
>> advertised to it via "reg" property under hypervisor node in
>> the guest device-tree. As region 0 is reserved for grant table
>> space (always present), the indexes for extended regions are 1...N.
>> If extended regions could not be allocated for some reason,
>> Xen doesn't fail and behaves as usual, so only inserts region 0.
>>
>> Please note the following limitations:
>> - The extended region feature is only supported for 64-bit domain.
>> - The ACPI case is not covered.
>>
>> ***
>>
>> As Dom0 is direct mapped domain on Arm (e.g. MFN == GFN)
>> the algorithm to choose extended regions for it is different
>> in comparison with the algorithm for non-direct mapped DomU.
>> What is more, that extended regions should be chosen differently
>> whether IOMMU is enabled or not.
>>
>> Provide RAM not assigned to Dom0 if IOMMU is disabled or memory
>> holes found in host device-tree if otherwise. Make sure that
>> extended regions are 2MB-aligned and located within maximum possible
>> addressable physical memory range. The maximum number of extended
>> regions is 128.
>>
>> Suggested-by: Julien Grall <jgrall@amazon.com>
>> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>> ---
>> Changes since RFC:
>>     - update patch description
>>     - drop uneeded "extended-region" DT property
>> ---
>>
>>   xen/arch/arm/domain_build.c | 226 +++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 224 insertions(+), 2 deletions(-)
>>
>> diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
>> index 206038d..070ec27 100644
>> --- a/xen/arch/arm/domain_build.c
>> +++ b/xen/arch/arm/domain_build.c
>> @@ -724,6 +724,196 @@ static int __init make_memory_node(const struct domain *d,
>>       return res;
>>   }
>>   
>> +static int __init add_ext_regions(unsigned long s, unsigned long e, void *data)
>> +{
>> +    struct meminfo *ext_regions = data;
>> +    paddr_t start, size;
>> +
>> +    if ( ext_regions->nr_banks >= ARRAY_SIZE(ext_regions->bank) )
>> +        return 0;
>> +
>> +    /* Both start and size of the extended region should be 2MB aligned */
>> +    start = (s + SZ_2M - 1) & ~(SZ_2M - 1);
>> +    if ( start > e )
>> +        return 0;
>> +
>> +    size = (e - start + 1) & ~(SZ_2M - 1);
>> +    if ( !size )
>> +        return 0;
> Can't you align size as well?
>
>    size = (size - (SZ_2M - 1)) & ~(SZ_2M - 1);

I am sorry, I don't entirely get what you really meant here. We get both 
start and size 2MB-aligned by the calculations above
(when calculating an alignment, we need to make sure that "start_passed 
<= start_aligned && size_aligned <= size_passed").
If I add the proposing string after, I will reduce the already aligned 
size by 2MB.
If I replace the size calculation with the following, I will get the 
reduced size even if the passed region is initially 2MB-aligned, so 
doesn't need to be adjusted.
size = e - s + 1;
size = (size - (SZ_2M - 1)) & ~(SZ_2M - 1);

>
>> +    ext_regions->bank[ext_regions->nr_banks].start = start;
>> +    ext_regions->bank[ext_regions->nr_banks].size = size;
>> +    ext_regions->nr_banks ++;
>                              ^ no space

ok


>
>> +    return 0;
>> +}
>> +
>> +/*
>> + * The extended regions will be prevalidated by the memory hotplug path
>> + * in Linux which requires for any added address range to be within maximum
>> + * possible addressable physical memory range for which the linear mapping
>> + * could be created.
>> + * For 48-bit VA space size the maximum addressable range are:
>> + * 0x40000000 - 0x80003fffffff
> Please don't make Linux-specific comments in Xen code for interfaces
> that are supposed to be OS-agnostic.

You are right. I just wanted to describe where these magic numbers come 
from.
Someone might question why, for example, "0 ... max_gpaddr" can't be 
used. I will move
that Linux-specific comments to the commit message to keep some 
justification of these numbers.


>> + */
>> +#define EXT_REGION_START   0x40000000ULL
>> +#define EXT_REGION_END     0x80003fffffffULL
>> +
>> +static int __init find_unallocated_memory(const struct kernel_info *kinfo,
>> +                                          struct meminfo *ext_regions)
>> +{
>> +    const struct meminfo *assign_mem = &kinfo->mem;
>> +    struct rangeset *unalloc_mem;
>> +    paddr_t start, end;
>> +    unsigned int i;
>> +    int res;
>> +
>> +    dt_dprintk("Find unallocated memory for extended regions\n");
>> +
>> +    unalloc_mem = rangeset_new(NULL, NULL, 0);
>> +    if ( !unalloc_mem )
>> +        return -ENOMEM;
>> +
>> +    /* Start with all available RAM */
>> +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
>> +    {
>> +        start = bootinfo.mem.bank[i].start;
>> +        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size - 1;
> Is the -1 needed? Isn't it going to screw up the size calculation later?
I thought, it was needed. The calculation seems correct.
For example, in my setup when IOMMU is disabled for Dom0 ("unallocated 
to Dom0 RAM"):

--- All available RAM and reserved memory found in DT:

(XEN) Initrd 0000000084000040-0000000085effc48
(XEN) RAM: 0000000048000000 - 00000000bfffffff <--- RAM bank 0
(XEN) RAM: 0000000500000000 - 000000057fffffff <--- RAM bank 1
(XEN) RAM: 0000000600000000 - 000000067fffffff <--- RAM bank 2
(XEN) RAM: 0000000700000000 - 000000077fffffff <--- RAM bank 3
(XEN)
(XEN) MODULE[0]: 0000000078080000 - 00000000781d74c8 Xen
(XEN) MODULE[1]: 0000000057fe7000 - 0000000057ffd080 Device Tree
(XEN) MODULE[2]: 0000000084000040 - 0000000085effc48 Ramdisk
(XEN) MODULE[3]: 000000008a000000 - 000000008c000000 Kernel
(XEN) MODULE[4]: 000000008c000000 - 000000008c010000 XSM
(XEN)  RESVD[0]: 0000000084000040 - 0000000085effc48
(XEN)  RESVD[1]: 0000000054000000 - 0000000056ffffff  <--- Reserved memory

--- Dom0 RAM:

(XEN) Allocating 1:1 mappings totalling 256MB for dom0:
(XEN) BANK[0] 0x00000060000000-0x00000070000000 (256MB)

--- Dom0 grant table range:

(XEN) Grant table range: 0x00000078080000-0x000000780c0000

--- Calculated extended regions printed in make_hypervisor_node():

printk("Extended region %d: %#"PRIx64"->%#"PRIx64"\n", i, start, start + 
size);

(XEN) Extended region 0: 0x48000000->0x54000000
(XEN) Extended region 1: 0x57000000->0x60000000
(XEN) Extended region 2: 0x70000000->0x78000000
(XEN) Extended region 3: 0x78200000->0xc0000000
(XEN) Extended region 4: 0x500000000->0x580000000
(XEN) Extended region 5: 0x600000000->0x680000000
(XEN) Extended region 6: 0x700000000->0x780000000

--- Resulted hypervisor node in Dom0 DT:

hypervisor {
         interrupts = <0x01 0x00 0xf08>;
         interrupt-parent = <0x19>;
         compatible = "xen,xen-4.16\0xen,xen";
         reg = <0x00 0x78080000 0x00 0x40000 0x00 0x48000000 0x00 
0xc000000 0x00 0x57000000 0x00 0x9000000 0x00 0x70000000 0x00 0x8000000 
0x00 0x78200000 0x00 0x47e00000 0x05 0x00 0x00 0x80000000 0x06 0x00 0x00 
0x80000000 0x07 0x00 0x00 0x80000000>;
};

Where index 0 corresponds to the grant table region and indexes 1...N 
correspond to the extended regions.


>> +        res = rangeset_add_range(unalloc_mem, start, end);
>> +        if ( res )
>> +        {
>> +            printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
>> +                   start, end);
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    /* Remove RAM assigned to Dom0 */
>> +    for ( i = 0; i < assign_mem->nr_banks; i++ )
>> +    {
>> +        start = assign_mem->bank[i].start;
>> +        end = assign_mem->bank[i].start + assign_mem->bank[i].size - 1;
>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>> +        if ( res )
>> +        {
>> +            printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
>> +                   start, end);
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    /* Remove reserved-memory regions */
>> +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
>> +    {
>> +        start = bootinfo.reserved_mem.bank[i].start;
>> +        end = bootinfo.reserved_mem.bank[i].start +
>> +            bootinfo.reserved_mem.bank[i].size - 1;
>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>> +        if ( res )
>> +        {
>> +            printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
>> +                   start, end);
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    /* Remove grant table region */
>> +    start = kinfo->gnttab_start;
>> +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
>> +    res = rangeset_remove_range(unalloc_mem, start, end);
>> +    if ( res )
>> +    {
>> +        printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
>> +               start, end);
>> +        goto out;
>> +    }
>> +
>> +    start = EXT_REGION_START;
>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>> +    res = rangeset_report_ranges(unalloc_mem, start, end,
>> +                                 add_ext_regions, ext_regions);
>> +    if ( res )
>> +        ext_regions->nr_banks = 0;
>> +    else if ( !ext_regions->nr_banks )
>> +        res = -ENOENT;
>> +
>> +out:
>> +    rangeset_destroy(unalloc_mem);
>> +
>> +    return res;
>> +}
>> +
>> +static int __init find_memory_holes(const struct kernel_info *kinfo,
>> +                                    struct meminfo *ext_regions)
>> +{
>> +    struct dt_device_node *np;
>> +    struct rangeset *mem_holes;
>> +    paddr_t start, end;
>> +    unsigned int i;
>> +    int res;
>> +
>> +    dt_dprintk("Find memory holes for extended regions\n");
>> +
>> +    mem_holes = rangeset_new(NULL, NULL, 0);
>> +    if ( !mem_holes )
>> +        return -ENOMEM;
>> +
>> +    /* Start with maximum possible addressable physical memory range */
>> +    start = EXT_REGION_START;
>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>> +    res = rangeset_add_range(mem_holes, start, end);
>> +    if ( res )
>> +    {
>> +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
>> +               start, end);
>> +        goto out;
>> +    }
>> +
>> +    /* Remove all regions described by "reg" property (MMIO, RAM, etc) */
>> +    dt_for_each_device_node( dt_host, np )
> Don't you need something like device_tree_for_each_node ?
> dt_for_each_device_node won't go down any deeper in the tree?

Thank you for pointing this out, I will investigate and update the patch.


>
> Alternatively, maybe we could simply record the highest possible address
> of any memory/device/anything as we scan the device tree with
> handle_node. Then we can use that as the starting point here.
I also don't like the idea to scan the DT much, but I failed to find an 
effective solution how to avoid that.
Yes, we can record the highest possible address, but I am afraid, I 
didn't entirely get a suggestion. Is the suggestion to provide a single 
region
starting from highest possible address + 1 and up to the EXT_REGION_END 
suitably aligned? Could you please clarify?


> So that we
> don't need to scan the device tree twice, and also we don't need my
> suggestion below to remove 1GB-aligned 1GB-multiple regions.
I provided some thoughts regarding this below.


>
>
>> +    {
>> +        unsigned int naddr;
>> +        u64 addr, size;
>> +
>> +        naddr = dt_number_of_address(np);
>> +
>> +        for ( i = 0; i < naddr; i++ )
>> +        {
>> +            res = dt_device_get_address(np, i, &addr, &size);
>> +            if ( res )
>> +            {
>> +                printk(XENLOG_ERR "Unable to retrieve address %u for %s\n",
>> +                       i, dt_node_full_name(np));
> It might be possible for a device not to have a range if it doesn't have
> any MMIO regions, right? For instance, certain ARM timer nodes. I would
> not print any errors and continue.
I though, if device didn't have a range, then this loop wouldn't be 
executed at all as dt_number_of_address would return 0.
I will double check.


>
>
>> +                goto out;
>> +            }
>> +
>> +            start = addr & PAGE_MASK;
>> +            end = PAGE_ALIGN(addr + size) - 1;
>> +            res = rangeset_remove_range(mem_holes, start, end);
>> +            if ( res )
>> +            {
>> +                printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
>> +                       start, end);
>> +                goto out;
>> +            }
>> +        }
>> +    }
> As is, it will result in a myriad of small ranges which is unuseful and
> slow to parse. I suggest to simplify it by removing a larger region than
> strictly necessary. For instance, you could remove a 1GB-aligned and
> 1GB-multiple region for each range. That way, you are going to get fewer
> large free ranges instance of many small ones which we don't need.

I agree with you that a lot of small ranges increase the bookkeeping in 
Dom0 and there is also a theoretical (?) possibility
that small ranges occupy all space we provide for extended regions 
(NR_MEM_BANKS)...
But, let's consider my setup as an example again, but when the IOMMU is 
enabled for Dom0 ("holes found in DT").

--- The RAM configuration is the same:

(XEN) RAM: 0000000048000000 - 00000000bfffffff <--- RAM bank 0
(XEN) RAM: 0000000500000000 - 000000057fffffff <--- RAM bank 1
(XEN) RAM: 0000000600000000 - 000000067fffffff <--- RAM bank 2
(XEN) RAM: 0000000700000000 - 000000077fffffff <--- RAM bank 3

--- There are a lot of various platform devices with reg property 
described in DT, I will probably not post all IO ranges here, just say 
that mostly all of them to be mapped at 0xE0000000-0xFFFFFFFF.

--- As we only pick up ranges with size >= 2MB, the calculated extended 
regions are (based on 40-bit IPA):

(XEN) Extended region 0: 0x40000000->0x47e00000
(XEN) Extended region 1: 0xc0000000->0xe6000000
(XEN) Extended region 2: 0xe7000000->0xe7200000
(XEN) Extended region 3: 0xe7400000->0xe7600000
(XEN) Extended region 4: 0xe7800000->0xec000000
(XEN) Extended region 5: 0xec200000->0xec400000
(XEN) Extended region 6: 0xec800000->0xee000000
(XEN) Extended region 7: 0xee600000->0xee800000
(XEN) Extended region 8: 0xeea00000->0xf1000000
(XEN) Extended region 9: 0xf1200000->0xfd000000
(XEN) Extended region 10: 0xfd200000->0xfd800000
(XEN) Extended region 11: 0xfda00000->0xfe000000
(XEN) Extended region 12: 0xfe200000->0xfe600000
(XEN) Extended region 13: 0xfec00000->0xff800000
(XEN) Extended region 14: 0x100000000->0x500000000
(XEN) Extended region 15: 0x580000000->0x600000000
(XEN) Extended region 16: 0x680000000->0x700000000
(XEN) Extended region 17: 0x780000000->0x10000000000

So, if I *correctly* understood your idea about removing 1GB-aligned 
1GB-multiple region for each range we would get the following:

(XEN) Extended region 0: 0x100000000->0x500000000
(XEN) Extended region 1: 0x580000000->0x600000000
(XEN) Extended region 2: 0x680000000->0x700000000
(XEN) Extended region 3: 0x780000000->0x10000000000

As you can see there are no extended regions below 4GB at all. I assume, 
it would be good to provide them for 1:1 mapped Dom0 (for 32-bit DMA 
devices?)
What else worries me is that IPA size could be 36 or even 32. So, I am 
afraid, we might even fail to find extended regions above 4GB.


I think, if 2MB is considered small enough to bother with, probably we 
should go with something in between (16MB, 32MB, 64MB).
For example, we can take into the account ranges with size >= 16MB:

(XEN) Extended region 0: 0x40000000->0x47e00000
(XEN) Extended region 1: 0xc0000000->0xe6000000
(XEN) Extended region 2: 0xe7800000->0xec000000
(XEN) Extended region 3: 0xec800000->0xee000000
(XEN) Extended region 4: 0xeea00000->0xf1000000
(XEN) Extended region 5: 0xf1200000->0xfd000000
(XEN) Extended region 6: 0x100000000->0x500000000
(XEN) Extended region 7: 0x580000000->0x600000000
(XEN) Extended region 8: 0x680000000->0x700000000
(XEN) Extended region 9: 0x780000000->0x10000000000

Any thoughts?


>
>> +    start = EXT_REGION_START;
>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>> +    res = rangeset_report_ranges(mem_holes, start, end,
>> +                                 add_ext_regions,  ext_regions);
>> +    if ( res )
>> +        ext_regions->nr_banks = 0;
>> +    else if ( !ext_regions->nr_banks )
>> +        res = -ENOENT;
>> +
>> +out:
>> +    rangeset_destroy(mem_holes);
>> +
>> +    return res;
>> +}
>> +
>>   static int __init make_hypervisor_node(struct domain *d,
>>                                          const struct kernel_info *kinfo,
>>                                          int addrcells, int sizecells)
>> @@ -731,11 +921,13 @@ static int __init make_hypervisor_node(struct domain *d,
>>       const char compat[] =
>>           "xen,xen-"__stringify(XEN_VERSION)"."__stringify(XEN_SUBVERSION)"\0"
>>           "xen,xen";
>> -    __be32 reg[4];
>> +    __be32 reg[(NR_MEM_BANKS + 1) * 4];
> If you are xzalloc'ing struct meminfo then you might as well xzalloc reg
> too. Or keep both on the stack with a lower NR_MEM_BANKS.
sounds reasonable, I will probably xzalloc reg as well.

>
>
>>       gic_interrupt_t intr;
>>       __be32 *cells;
>>       int res;
>>       void *fdt = kinfo->fdt;
>> +    struct meminfo *ext_regions;
>> +    unsigned int i;
>>   
>>       dt_dprintk("Create hypervisor node\n");
>>   
>> @@ -757,12 +949,42 @@ static int __init make_hypervisor_node(struct domain *d,
>>       if ( res )
>>           return res;
>>   
>> +    ext_regions = xzalloc(struct meminfo);
>> +    if ( !ext_regions )
>> +        return -ENOMEM;
>> +
>> +    if ( is_32bit_domain(d) )
>> +        printk(XENLOG_WARNING "The extended region is only supported for 64-bit guest\n");
> This is going to add an unconditional warning to all 32bit boots. I
> would skip it entirely or only keep it as XENLOG_DEBUG.

agree, I will probably convert to XENLOG_DEBUG.


>
>
>> +    else
>> +    {
>> +        if ( !is_iommu_enabled(d) )
>> +            res = find_unallocated_memory(kinfo, ext_regions);
>> +        else
>> +            res = find_memory_holes(kinfo, ext_regions);
>> +
>> +        if ( res )
>> +            printk(XENLOG_WARNING "Failed to allocate extended regions\n");
>> +    }
>> +
>>       /* reg 0 is grant table space */
>>       cells = &reg[0];
>>       dt_child_set_range(&cells, addrcells, sizecells,
>>                          kinfo->gnttab_start, kinfo->gnttab_size);
>> +    /* reg 1...N are extended regions */
>> +    for ( i = 0; i < ext_regions->nr_banks; i++ )
>> +    {
>> +        u64 start = ext_regions->bank[i].start;
>> +        u64 size = ext_regions->bank[i].size;
>> +
>> +        dt_dprintk("Extended region %d: %#"PRIx64"->%#"PRIx64"\n",
>> +                   i, start, start + size);
>> +
>> +        dt_child_set_range(&cells, addrcells, sizecells, start, size);
>> +    }
>> +    xfree(ext_regions);
>> +
>>       res = fdt_property(fdt, "reg", reg,
>> -                       dt_cells_to_size(addrcells + sizecells));
>> +                       dt_cells_to_size(addrcells + sizecells) * (i + 1));
>>       if ( res )
>>           return res;
>>   
>> -- 
>> 2.7.4
>>

Thank you.


-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-15 19:10     ` Oleksandr
@ 2021-09-15 21:21       ` Stefano Stabellini
  2021-09-16 20:57         ` Oleksandr
  2021-09-17 14:08       ` Oleksandr
  2021-09-17 15:52       ` Julien Grall
  2 siblings, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2021-09-15 21:21 UTC (permalink / raw)
  To: Oleksandr
  Cc: Stefano Stabellini, xen-devel, Oleksandr Tyshchenko,
	Julien Grall, Volodymyr Babchuk, Henry Wang, Bertrand Marquis,
	Wei Chen

On Wed, 15 Sep 2021, Oleksandr wrote:
> > On Fri, 10 Sep 2021, Oleksandr Tyshchenko wrote:
> > > From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> > > 
> > > The extended region (safe range) is a region of guest physical
> > > address space which is unused and could be safely used to create
> > > grant/foreign mappings instead of wasting real RAM pages from
> > > the domain memory for establishing these mappings.
> > > 
> > > The extended regions are chosen at the domain creation time and
> > > advertised to it via "reg" property under hypervisor node in
> > > the guest device-tree. As region 0 is reserved for grant table
> > > space (always present), the indexes for extended regions are 1...N.
> > > If extended regions could not be allocated for some reason,
> > > Xen doesn't fail and behaves as usual, so only inserts region 0.
> > > 
> > > Please note the following limitations:
> > > - The extended region feature is only supported for 64-bit domain.
> > > - The ACPI case is not covered.
> > > 
> > > ***
> > > 
> > > As Dom0 is direct mapped domain on Arm (e.g. MFN == GFN)
> > > the algorithm to choose extended regions for it is different
> > > in comparison with the algorithm for non-direct mapped DomU.
> > > What is more, that extended regions should be chosen differently
> > > whether IOMMU is enabled or not.
> > > 
> > > Provide RAM not assigned to Dom0 if IOMMU is disabled or memory
> > > holes found in host device-tree if otherwise. Make sure that
> > > extended regions are 2MB-aligned and located within maximum possible
> > > addressable physical memory range. The maximum number of extended
> > > regions is 128.
> > > 
> > > Suggested-by: Julien Grall <jgrall@amazon.com>
> > > Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> > > ---
> > > Changes since RFC:
> > >     - update patch description
> > >     - drop uneeded "extended-region" DT property
> > > ---
> > > 
> > >   xen/arch/arm/domain_build.c | 226
> > > +++++++++++++++++++++++++++++++++++++++++++-
> > >   1 file changed, 224 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
> > > index 206038d..070ec27 100644
> > > --- a/xen/arch/arm/domain_build.c
> > > +++ b/xen/arch/arm/domain_build.c
> > > @@ -724,6 +724,196 @@ static int __init make_memory_node(const struct
> > > domain *d,
> > >       return res;
> > >   }
> > >   +static int __init add_ext_regions(unsigned long s, unsigned long e,
> > > void *data)
> > > +{
> > > +    struct meminfo *ext_regions = data;
> > > +    paddr_t start, size;
> > > +
> > > +    if ( ext_regions->nr_banks >= ARRAY_SIZE(ext_regions->bank) )
> > > +        return 0;
> > > +
> > > +    /* Both start and size of the extended region should be 2MB aligned
> > > */
> > > +    start = (s + SZ_2M - 1) & ~(SZ_2M - 1);
> > > +    if ( start > e )
> > > +        return 0;
> > > +
> > > +    size = (e - start + 1) & ~(SZ_2M - 1);
> > > +    if ( !size )
> > > +        return 0;
> > Can't you align size as well?
> > 
> >    size = (size - (SZ_2M - 1)) & ~(SZ_2M - 1);
> 
> I am sorry, I don't entirely get what you really meant here. We get both start
> and size 2MB-aligned by the calculations above
> (when calculating an alignment, we need to make sure that "start_passed <=
> start_aligned && size_aligned <= size_passed").
> If I add the proposing string after, I will reduce the already aligned size by
> 2MB.
> If I replace the size calculation with the following, I will get the reduced
> size even if the passed region is initially 2MB-aligned, so doesn't need to be
> adjusted.
> size = e - s + 1;
> size = (size - (SZ_2M - 1)) & ~(SZ_2M - 1);

Sorry I misread your original code, I think it was working as intended
except for the "+1". I think it should be:

  size = (e - start) & ~(SZ_2M - 1);



> > > + */
> > > +#define EXT_REGION_START   0x40000000ULL
> > > +#define EXT_REGION_END     0x80003fffffffULL
> > > +
> > > +static int __init find_unallocated_memory(const struct kernel_info
> > > *kinfo,
> > > +                                          struct meminfo *ext_regions)
> > > +{
> > > +    const struct meminfo *assign_mem = &kinfo->mem;
> > > +    struct rangeset *unalloc_mem;
> > > +    paddr_t start, end;
> > > +    unsigned int i;
> > > +    int res;
> > > +
> > > +    dt_dprintk("Find unallocated memory for extended regions\n");
> > > +
> > > +    unalloc_mem = rangeset_new(NULL, NULL, 0);
> > > +    if ( !unalloc_mem )
> > > +        return -ENOMEM;
> > > +
> > > +    /* Start with all available RAM */
> > > +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
> > > +    {
> > > +        start = bootinfo.mem.bank[i].start;
> > > +        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size - 1;
> > Is the -1 needed? Isn't it going to screw up the size calculation later?
> I thought, it was needed. The calculation seems correct.

I think that normally for an example MMIO region:

start = 0x48000000
size  = 0x40000000
end   = 0x88000000

So end = start + size and points to the first address out of the range.
In other words, 0x88000000 doesn't actually belong to the MMIO region in
the example.

But here you are passing addresses to rangeset_add_range and other
rangeset functions and I think rangeset takes *inclusive* addresses as
input. So you need to pass start and end-1 because end-1 is the last
address of the MMIO region.

In fact you can see for instance in map_range_to_domain:

        res = iomem_permit_access(d, paddr_to_pfn(addr),
                paddr_to_pfn(PAGE_ALIGN(addr + len - 1)));

Where iomem_permit_access is based on rangeset. So for clarity, I would
do:

start = assign_mem->bank[i].start;
end = assign_mem->bank[i].start + assign_mem->bank[i].size;
res = rangeset_remove_range(unalloc_mem, start, end - 1);

So that we don't get confused on the meaning of "end" which everywhere
else means the first address not in range.


> > > +        res = rangeset_add_range(unalloc_mem, start, end);
> > > +        if ( res )
> > > +        {
> > > +            printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
> > > +                   start, end);
> > > +            goto out;
> > > +        }
> > > +    }
> > > +
> > > +    /* Remove RAM assigned to Dom0 */
> > > +    for ( i = 0; i < assign_mem->nr_banks; i++ )
> > > +    {
> > > +        start = assign_mem->bank[i].start;
> > > +        end = assign_mem->bank[i].start + assign_mem->bank[i].size - 1;
> > > +        res = rangeset_remove_range(unalloc_mem, start, end);
> > > +        if ( res )
> > > +        {
> > > +            printk(XENLOG_ERR "Failed to remove:
> > > %#"PRIx64"->%#"PRIx64"\n",
> > > +                   start, end);
> > > +            goto out;
> > > +        }
> > > +    }
> > > +
> > > +    /* Remove reserved-memory regions */
> > > +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
> > > +    {
> > > +        start = bootinfo.reserved_mem.bank[i].start;
> > > +        end = bootinfo.reserved_mem.bank[i].start +
> > > +            bootinfo.reserved_mem.bank[i].size - 1;
> > > +        res = rangeset_remove_range(unalloc_mem, start, end);
> > > +        if ( res )
> > > +        {
> > > +            printk(XENLOG_ERR "Failed to remove:
> > > %#"PRIx64"->%#"PRIx64"\n",
> > > +                   start, end);
> > > +            goto out;
> > > +        }
> > > +    }
> > > +
> > > +    /* Remove grant table region */
> > > +    start = kinfo->gnttab_start;
> > > +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
> > > +    res = rangeset_remove_range(unalloc_mem, start, end);
> > > +    if ( res )
> > > +    {
> > > +        printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
> > > +               start, end);
> > > +        goto out;
> > > +    }
> > > +
> > > +    start = EXT_REGION_START;
> > > +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> > > +    res = rangeset_report_ranges(unalloc_mem, start, end,
> > > +                                 add_ext_regions, ext_regions);
> > > +    if ( res )
> > > +        ext_regions->nr_banks = 0;
> > > +    else if ( !ext_regions->nr_banks )
> > > +        res = -ENOENT;
> > > +
> > > +out:
> > > +    rangeset_destroy(unalloc_mem);
> > > +
> > > +    return res;
> > > +}
> > > +
> > > +static int __init find_memory_holes(const struct kernel_info *kinfo,
> > > +                                    struct meminfo *ext_regions)
> > > +{
> > > +    struct dt_device_node *np;
> > > +    struct rangeset *mem_holes;
> > > +    paddr_t start, end;
> > > +    unsigned int i;
> > > +    int res;
> > > +
> > > +    dt_dprintk("Find memory holes for extended regions\n");
> > > +
> > > +    mem_holes = rangeset_new(NULL, NULL, 0);
> > > +    if ( !mem_holes )
> > > +        return -ENOMEM;
> > > +
> > > +    /* Start with maximum possible addressable physical memory range */
> > > +    start = EXT_REGION_START;
> > > +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> > > +    res = rangeset_add_range(mem_holes, start, end);
> > > +    if ( res )
> > > +    {
> > > +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
> > > +               start, end);
> > > +        goto out;
> > > +    }
> > > +
> > > +    /* Remove all regions described by "reg" property (MMIO, RAM, etc) */
> > > +    dt_for_each_device_node( dt_host, np )
> > Don't you need something like device_tree_for_each_node ?
> > dt_for_each_device_node won't go down any deeper in the tree?
> 
> Thank you for pointing this out, I will investigate and update the patch.
> 
> 
> > 
> > Alternatively, maybe we could simply record the highest possible address
> > of any memory/device/anything as we scan the device tree with
> > handle_node. Then we can use that as the starting point here.
> I also don't like the idea to scan the DT much, but I failed to find an
> effective solution how to avoid that.
> Yes, we can record the highest possible address, but I am afraid, I didn't
> entirely get a suggestion. Is the suggestion to provide a single region
> starting from highest possible address + 1 and up to the EXT_REGION_END
> suitably aligned? Could you please clarify?

Yes, that is what I was suggesting as a possible alternative: start from
the highest possible address in DT + 1 and up to the EXT_REGION_END
suitably aligned. But that wouldn't solve the <4GB issue.

 
> > > +                goto out;
> > > +            }
> > > +
> > > +            start = addr & PAGE_MASK;
> > > +            end = PAGE_ALIGN(addr + size) - 1;
> > > +            res = rangeset_remove_range(mem_holes, start, end);
> > > +            if ( res )
> > > +            {
> > > +                printk(XENLOG_ERR "Failed to remove:
> > > %#"PRIx64"->%#"PRIx64"\n",
> > > +                       start, end);
> > > +                goto out;
> > > +            }
> > > +        }
> > > +    }
> > As is, it will result in a myriad of small ranges which is unuseful and
> > slow to parse. I suggest to simplify it by removing a larger region than
> > strictly necessary. For instance, you could remove a 1GB-aligned and
> > 1GB-multiple region for each range. That way, you are going to get fewer
> > large free ranges instance of many small ones which we don't need.
> 
> I agree with you that a lot of small ranges increase the bookkeeping in Dom0
> and there is also a theoretical (?) possibility
> that small ranges occupy all space we provide for extended regions
> (NR_MEM_BANKS)...
> But, let's consider my setup as an example again, but when the IOMMU is
> enabled for Dom0 ("holes found in DT").
> 
> --- The RAM configuration is the same:
> 
> (XEN) RAM: 0000000048000000 - 00000000bfffffff <--- RAM bank 0
> (XEN) RAM: 0000000500000000 - 000000057fffffff <--- RAM bank 1
> (XEN) RAM: 0000000600000000 - 000000067fffffff <--- RAM bank 2
> (XEN) RAM: 0000000700000000 - 000000077fffffff <--- RAM bank 3
> 
> --- There are a lot of various platform devices with reg property described in
> DT, I will probably not post all IO ranges here, just say that mostly all of
> them to be mapped at 0xE0000000-0xFFFFFFFF.
> 
> --- As we only pick up ranges with size >= 2MB, the calculated extended
> regions are (based on 40-bit IPA):
> 
> (XEN) Extended region 0: 0x40000000->0x47e00000
> (XEN) Extended region 1: 0xc0000000->0xe6000000
> (XEN) Extended region 2: 0xe7000000->0xe7200000
> (XEN) Extended region 3: 0xe7400000->0xe7600000
> (XEN) Extended region 4: 0xe7800000->0xec000000
> (XEN) Extended region 5: 0xec200000->0xec400000
> (XEN) Extended region 6: 0xec800000->0xee000000
> (XEN) Extended region 7: 0xee600000->0xee800000
> (XEN) Extended region 8: 0xeea00000->0xf1000000
> (XEN) Extended region 9: 0xf1200000->0xfd000000
> (XEN) Extended region 10: 0xfd200000->0xfd800000
> (XEN) Extended region 11: 0xfda00000->0xfe000000
> (XEN) Extended region 12: 0xfe200000->0xfe600000
> (XEN) Extended region 13: 0xfec00000->0xff800000
> (XEN) Extended region 14: 0x100000000->0x500000000
> (XEN) Extended region 15: 0x580000000->0x600000000
> (XEN) Extended region 16: 0x680000000->0x700000000
> (XEN) Extended region 17: 0x780000000->0x10000000000
> 
> So, if I *correctly* understood your idea about removing 1GB-aligned
> 1GB-multiple region for each range we would get the following:
> 
> (XEN) Extended region 0: 0x100000000->0x500000000
> (XEN) Extended region 1: 0x580000000->0x600000000
> (XEN) Extended region 2: 0x680000000->0x700000000
> (XEN) Extended region 3: 0x780000000->0x10000000000
> 
> As you can see there are no extended regions below 4GB at all. I assume, it
> would be good to provide them for 1:1 mapped Dom0 (for 32-bit DMA devices?)
> What else worries me is that IPA size could be 36 or even 32. So, I am afraid,
> we might even fail to find extended regions above 4GB.
> 
> 
> I think, if 2MB is considered small enough to bother with, probably we should
> go with something in between (16MB, 32MB, 64MB).
> For example, we can take into the account ranges with size >= 16MB:
> 
> (XEN) Extended region 0: 0x40000000->0x47e00000
> (XEN) Extended region 1: 0xc0000000->0xe6000000
> (XEN) Extended region 2: 0xe7800000->0xec000000
> (XEN) Extended region 3: 0xec800000->0xee000000
> (XEN) Extended region 4: 0xeea00000->0xf1000000
> (XEN) Extended region 5: 0xf1200000->0xfd000000
> (XEN) Extended region 6: 0x100000000->0x500000000
> (XEN) Extended region 7: 0x580000000->0x600000000
> (XEN) Extended region 8: 0x680000000->0x700000000
> (XEN) Extended region 9: 0x780000000->0x10000000000
> 
> Any thoughts?

Yeah maybe an intermediate value would be best. I'd go with 64MB.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 1/3] xen: Introduce "gpaddr_bits" field to XEN_SYSCTL_physinfo
  2021-09-10 18:18 ` [PATCH V2 1/3] xen: Introduce "gpaddr_bits" field to XEN_SYSCTL_physinfo Oleksandr Tyshchenko
@ 2021-09-16 14:49   ` Jan Beulich
  2021-09-16 15:43     ` Oleksandr
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Beulich @ 2021-09-16 14:49 UTC (permalink / raw)
  To: Oleksandr Tyshchenko
  Cc: Oleksandr Tyshchenko, Ian Jackson, Wei Liu, Anthony PERARD,
	Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Juergen Gross, Volodymyr Babchuk, Roger Pau Monné,
	Henry Wang, Bertrand Marquis, Wei Chen, xen-devel

On 10.09.2021 20:18, Oleksandr Tyshchenko wrote:
> --- a/tools/include/libxl.h
> +++ b/tools/include/libxl.h
> @@ -855,6 +855,13 @@ typedef struct libxl__ctx libxl_ctx;
>   */
>  #define LIBXL_HAVE_PHYSINFO_MAX_POSSIBLE_MFN 1
>  
> + /*
> +  * LIBXL_HAVE_PHYSINFO_GPADDR_BITS
> +  *
> +  * If this is defined, libxl_physinfo has a "gpaddr_bits" field.
> +  */
> + #define LIBXL_HAVE_PHYSINFO_GPADDR_BITS 1

Nit: I don't think you mean to have leading blanks here?

> @@ -120,6 +120,7 @@ struct xen_sysctl_physinfo {
>      uint64_aligned_t outstanding_pages;
>      uint64_aligned_t max_mfn; /* Largest possible MFN on this host */
>      uint32_t hw_cap[8];
> +    uint32_t gpaddr_bits;
>  };

Please make trailing padding explicit. I wonder whether this needs
to be a 32-bit field: I expect we would need a full new ABI by the
time we might reach 256 address bits. Otoh e.g. threads_per_core is
pretty certainly oversized as well ...

Jan



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 1/3] xen: Introduce "gpaddr_bits" field to XEN_SYSCTL_physinfo
  2021-09-16 14:49   ` Jan Beulich
@ 2021-09-16 15:43     ` Oleksandr
  2021-09-16 15:47       ` Jan Beulich
  0 siblings, 1 reply; 43+ messages in thread
From: Oleksandr @ 2021-09-16 15:43 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Tyshchenko, Ian Jackson, Wei Liu, Anthony PERARD,
	Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Juergen Gross, Volodymyr Babchuk, Roger Pau Monné,
	Henry Wang, Bertrand Marquis, Wei Chen, xen-devel


On 16.09.21 17:49, Jan Beulich wrote:

Hi Jan

> On 10.09.2021 20:18, Oleksandr Tyshchenko wrote:
>> --- a/tools/include/libxl.h
>> +++ b/tools/include/libxl.h
>> @@ -855,6 +855,13 @@ typedef struct libxl__ctx libxl_ctx;
>>    */
>>   #define LIBXL_HAVE_PHYSINFO_MAX_POSSIBLE_MFN 1
>>   
>> + /*
>> +  * LIBXL_HAVE_PHYSINFO_GPADDR_BITS
>> +  *
>> +  * If this is defined, libxl_physinfo has a "gpaddr_bits" field.
>> +  */
>> + #define LIBXL_HAVE_PHYSINFO_GPADDR_BITS 1
> Nit: I don't think you mean to have leading blanks here?

Yes, will remove.


>
>> @@ -120,6 +120,7 @@ struct xen_sysctl_physinfo {
>>       uint64_aligned_t outstanding_pages;
>>       uint64_aligned_t max_mfn; /* Largest possible MFN on this host */
>>       uint32_t hw_cap[8];
>> +    uint32_t gpaddr_bits;
>>   };
> Please make trailing padding explicit. I wonder whether this needs
> to be a 32-bit field: I expect we would need a full new ABI by the
> time we might reach 256 address bits. Otoh e.g. threads_per_core is
> pretty certainly oversized as well ...

I take it, this is a suggestion to make the field uint8_t and add 
uint8_t pad[7] after?


Please note, these are still unaddressed review comments for the last 
version [1], with a suggestion to use domctl (new?).
Also it is not entirely clear to me regarding whether this field will 
even remain gpaddr_bits or became maxphysaddr for example.

[1] 
https://lore.kernel.org/xen-devel/973f5344-aa10-3ad6-ff02-ad5f358ad279@citrix.com/


>
> Jan
>
-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 1/3] xen: Introduce "gpaddr_bits" field to XEN_SYSCTL_physinfo
  2021-09-16 15:43     ` Oleksandr
@ 2021-09-16 15:47       ` Jan Beulich
  2021-09-16 16:05         ` Oleksandr
  0 siblings, 1 reply; 43+ messages in thread
From: Jan Beulich @ 2021-09-16 15:47 UTC (permalink / raw)
  To: Oleksandr
  Cc: Oleksandr Tyshchenko, Ian Jackson, Wei Liu, Anthony PERARD,
	Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Juergen Gross, Volodymyr Babchuk, Roger Pau Monné,
	Henry Wang, Bertrand Marquis, Wei Chen, xen-devel

On 16.09.2021 17:43, Oleksandr wrote:
> On 16.09.21 17:49, Jan Beulich wrote:
>> On 10.09.2021 20:18, Oleksandr Tyshchenko wrote:
>>> @@ -120,6 +120,7 @@ struct xen_sysctl_physinfo {
>>>       uint64_aligned_t outstanding_pages;
>>>       uint64_aligned_t max_mfn; /* Largest possible MFN on this host */
>>>       uint32_t hw_cap[8];
>>> +    uint32_t gpaddr_bits;
>>>   };
>> Please make trailing padding explicit. I wonder whether this needs
>> to be a 32-bit field: I expect we would need a full new ABI by the
>> time we might reach 256 address bits. Otoh e.g. threads_per_core is
>> pretty certainly oversized as well ...
> 
> I take it, this is a suggestion to make the field uint8_t and add 
> uint8_t pad[7] after?

I view this as a viable option at least.

Jan



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 1/3] xen: Introduce "gpaddr_bits" field to XEN_SYSCTL_physinfo
  2021-09-16 15:47       ` Jan Beulich
@ 2021-09-16 16:05         ` Oleksandr
  0 siblings, 0 replies; 43+ messages in thread
From: Oleksandr @ 2021-09-16 16:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Oleksandr Tyshchenko, Ian Jackson, Wei Liu, Anthony PERARD,
	Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Juergen Gross, Volodymyr Babchuk, Roger Pau Monné,
	Henry Wang, Bertrand Marquis, Wei Chen, xen-devel


On 16.09.21 18:47, Jan Beulich wrote:

Hi Jan

> On 16.09.2021 17:43, Oleksandr wrote:
>> On 16.09.21 17:49, Jan Beulich wrote:
>>> On 10.09.2021 20:18, Oleksandr Tyshchenko wrote:
>>>> @@ -120,6 +120,7 @@ struct xen_sysctl_physinfo {
>>>>        uint64_aligned_t outstanding_pages;
>>>>        uint64_aligned_t max_mfn; /* Largest possible MFN on this host */
>>>>        uint32_t hw_cap[8];
>>>> +    uint32_t gpaddr_bits;
>>>>    };
>>> Please make trailing padding explicit. I wonder whether this needs
>>> to be a 32-bit field: I expect we would need a full new ABI by the
>>> time we might reach 256 address bits. Otoh e.g. threads_per_core is
>>> pretty certainly oversized as well ...
>> I take it, this is a suggestion to make the field uint8_t and add
>> uint8_t pad[7] after?
> I view this as a viable option at least.

I got it, sounds reasonable.


>
> Jan
>
-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-15 21:21       ` Stefano Stabellini
@ 2021-09-16 20:57         ` Oleksandr
  2021-09-16 21:30           ` Stefano Stabellini
  0 siblings, 1 reply; 43+ messages in thread
From: Oleksandr @ 2021-09-16 20:57 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Oleksandr Tyshchenko, Julien Grall, Volodymyr Babchuk,
	Henry Wang, Bertrand Marquis, Wei Chen


On 16.09.21 00:21, Stefano Stabellini wrote:

Hi Stefano

> On Wed, 15 Sep 2021, Oleksandr wrote:
>>> On Fri, 10 Sep 2021, Oleksandr Tyshchenko wrote:
>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>>
>>>> The extended region (safe range) is a region of guest physical
>>>> address space which is unused and could be safely used to create
>>>> grant/foreign mappings instead of wasting real RAM pages from
>>>> the domain memory for establishing these mappings.
>>>>
>>>> The extended regions are chosen at the domain creation time and
>>>> advertised to it via "reg" property under hypervisor node in
>>>> the guest device-tree. As region 0 is reserved for grant table
>>>> space (always present), the indexes for extended regions are 1...N.
>>>> If extended regions could not be allocated for some reason,
>>>> Xen doesn't fail and behaves as usual, so only inserts region 0.
>>>>
>>>> Please note the following limitations:
>>>> - The extended region feature is only supported for 64-bit domain.
>>>> - The ACPI case is not covered.
>>>>
>>>> ***
>>>>
>>>> As Dom0 is direct mapped domain on Arm (e.g. MFN == GFN)
>>>> the algorithm to choose extended regions for it is different
>>>> in comparison with the algorithm for non-direct mapped DomU.
>>>> What is more, that extended regions should be chosen differently
>>>> whether IOMMU is enabled or not.
>>>>
>>>> Provide RAM not assigned to Dom0 if IOMMU is disabled or memory
>>>> holes found in host device-tree if otherwise. Make sure that
>>>> extended regions are 2MB-aligned and located within maximum possible
>>>> addressable physical memory range. The maximum number of extended
>>>> regions is 128.
>>>>
>>>> Suggested-by: Julien Grall <jgrall@amazon.com>
>>>> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>> ---
>>>> Changes since RFC:
>>>>      - update patch description
>>>>      - drop uneeded "extended-region" DT property
>>>> ---
>>>>
>>>>    xen/arch/arm/domain_build.c | 226
>>>> +++++++++++++++++++++++++++++++++++++++++++-
>>>>    1 file changed, 224 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
>>>> index 206038d..070ec27 100644
>>>> --- a/xen/arch/arm/domain_build.c
>>>> +++ b/xen/arch/arm/domain_build.c
>>>> @@ -724,6 +724,196 @@ static int __init make_memory_node(const struct
>>>> domain *d,
>>>>        return res;
>>>>    }
>>>>    +static int __init add_ext_regions(unsigned long s, unsigned long e,
>>>> void *data)
>>>> +{
>>>> +    struct meminfo *ext_regions = data;
>>>> +    paddr_t start, size;
>>>> +
>>>> +    if ( ext_regions->nr_banks >= ARRAY_SIZE(ext_regions->bank) )
>>>> +        return 0;
>>>> +
>>>> +    /* Both start and size of the extended region should be 2MB aligned
>>>> */
>>>> +    start = (s + SZ_2M - 1) & ~(SZ_2M - 1);
>>>> +    if ( start > e )
>>>> +        return 0;
>>>> +
>>>> +    size = (e - start + 1) & ~(SZ_2M - 1);
>>>> +    if ( !size )
>>>> +        return 0;
>>> Can't you align size as well?
>>>
>>>     size = (size - (SZ_2M - 1)) & ~(SZ_2M - 1);
>> I am sorry, I don't entirely get what you really meant here. We get both start
>> and size 2MB-aligned by the calculations above
>> (when calculating an alignment, we need to make sure that "start_passed <=
>> start_aligned && size_aligned <= size_passed").
>> If I add the proposing string after, I will reduce the already aligned size by
>> 2MB.
>> If I replace the size calculation with the following, I will get the reduced
>> size even if the passed region is initially 2MB-aligned, so doesn't need to be
>> adjusted.
>> size = e - s + 1;
>> size = (size - (SZ_2M - 1)) & ~(SZ_2M - 1);
> Sorry I misread your original code, I think it was working as intended
> except for the "+1". I think it should be:
>
>    size = (e - start) & ~(SZ_2M - 1);
But why without "+1"? Isn't "e" here the *last address* of passed range?
Without "+1" I get non entirely correct calculations, last valid 2MB is 
missed.

[snip]
(XEN) Extended region 14: 0x580000000->0x5ffe00000
(XEN) Extended region 15: 0x680000000->0x6ffe00000
(XEN) Extended region 16: 0x780000000->0xffffe00000

But should get:

[snip]
(XEN) Extended region 15: 0x580000000->0x600000000
(XEN) Extended region 16: 0x680000000->0x700000000
(XEN) Extended region 17: 0x780000000->0x10000000000

Let's consider how a hole between (for example) RAM bank 1 and bank 2 is 
calculated:
(XEN) RAM: 0000000500000000 - 000000057fffffff <--- RAM bank 1 with size 
0x80000000
(XEN) RAM: 0000000600000000 - 000000067fffffff <--- RAM bank 2 with size 
0x80000000
So the hole size should also be 0x80000000.
If we pass these RAM banks to rangeset_remove_range() one by one:
1: s = 0x500000000 e = 0x57FFFFFFF
2. s = 0x600000000 e = 0x67FFFFFFF
we get s = 0x580000000 e = 0x5FFFFFFFF in add_ext_regions(), where "e" 
is the last address of the hole (not the first address out of the hole), 
so I think, that for proper size calculation we need to add 1 to "e - 
s". Or I really missed something?


>
>>>> + */
>>>> +#define EXT_REGION_START   0x40000000ULL
>>>> +#define EXT_REGION_END     0x80003fffffffULL
>>>> +
>>>> +static int __init find_unallocated_memory(const struct kernel_info
>>>> *kinfo,
>>>> +                                          struct meminfo *ext_regions)
>>>> +{
>>>> +    const struct meminfo *assign_mem = &kinfo->mem;
>>>> +    struct rangeset *unalloc_mem;
>>>> +    paddr_t start, end;
>>>> +    unsigned int i;
>>>> +    int res;
>>>> +
>>>> +    dt_dprintk("Find unallocated memory for extended regions\n");
>>>> +
>>>> +    unalloc_mem = rangeset_new(NULL, NULL, 0);
>>>> +    if ( !unalloc_mem )
>>>> +        return -ENOMEM;
>>>> +
>>>> +    /* Start with all available RAM */
>>>> +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
>>>> +    {
>>>> +        start = bootinfo.mem.bank[i].start;
>>>> +        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size - 1;
>>> Is the -1 needed? Isn't it going to screw up the size calculation later?
>> I thought, it was needed. The calculation seems correct.
> I think that normally for an example MMIO region:
>
> start = 0x48000000
> size  = 0x40000000
> end   = 0x88000000
>
> So end = start + size and points to the first address out of the range.
> In other words, 0x88000000 doesn't actually belong to the MMIO region in
> the example.
>
> But here you are passing addresses to rangeset_add_range and other
> rangeset functions and I think rangeset takes *inclusive* addresses as
> input. So you need to pass start and end-1 because end-1 is the last
> address of the MMIO region.
>
> In fact you can see for instance in map_range_to_domain:
>
>          res = iomem_permit_access(d, paddr_to_pfn(addr),
>                  paddr_to_pfn(PAGE_ALIGN(addr + len - 1)));
>
> Where iomem_permit_access is based on rangeset. So for clarity, I would
> do:
>
> start = assign_mem->bank[i].start;
> end = assign_mem->bank[i].start + assign_mem->bank[i].size;
> res = rangeset_remove_range(unalloc_mem, start, end - 1);
>
> So that we don't get confused on the meaning of "end" which everywhere
> else means the first address not in range.

I got your point, I will update the code if it much cleaner.


>>>> +        res = rangeset_add_range(unalloc_mem, start, end);
>>>> +        if ( res )
>>>> +        {
>>>> +            printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
>>>> +                   start, end);
>>>> +            goto out;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    /* Remove RAM assigned to Dom0 */
>>>> +    for ( i = 0; i < assign_mem->nr_banks; i++ )
>>>> +    {
>>>> +        start = assign_mem->bank[i].start;
>>>> +        end = assign_mem->bank[i].start + assign_mem->bank[i].size - 1;
>>>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>>>> +        if ( res )
>>>> +        {
>>>> +            printk(XENLOG_ERR "Failed to remove:
>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>> +                   start, end);
>>>> +            goto out;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    /* Remove reserved-memory regions */
>>>> +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
>>>> +    {
>>>> +        start = bootinfo.reserved_mem.bank[i].start;
>>>> +        end = bootinfo.reserved_mem.bank[i].start +
>>>> +            bootinfo.reserved_mem.bank[i].size - 1;
>>>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>>>> +        if ( res )
>>>> +        {
>>>> +            printk(XENLOG_ERR "Failed to remove:
>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>> +                   start, end);
>>>> +            goto out;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    /* Remove grant table region */
>>>> +    start = kinfo->gnttab_start;
>>>> +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
>>>> +    res = rangeset_remove_range(unalloc_mem, start, end);
>>>> +    if ( res )
>>>> +    {
>>>> +        printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
>>>> +               start, end);
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    start = EXT_REGION_START;
>>>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>>> +    res = rangeset_report_ranges(unalloc_mem, start, end,
>>>> +                                 add_ext_regions, ext_regions);
>>>> +    if ( res )
>>>> +        ext_regions->nr_banks = 0;
>>>> +    else if ( !ext_regions->nr_banks )
>>>> +        res = -ENOENT;
>>>> +
>>>> +out:
>>>> +    rangeset_destroy(unalloc_mem);
>>>> +
>>>> +    return res;
>>>> +}
>>>> +
>>>> +static int __init find_memory_holes(const struct kernel_info *kinfo,
>>>> +                                    struct meminfo *ext_regions)
>>>> +{
>>>> +    struct dt_device_node *np;
>>>> +    struct rangeset *mem_holes;
>>>> +    paddr_t start, end;
>>>> +    unsigned int i;
>>>> +    int res;
>>>> +
>>>> +    dt_dprintk("Find memory holes for extended regions\n");
>>>> +
>>>> +    mem_holes = rangeset_new(NULL, NULL, 0);
>>>> +    if ( !mem_holes )
>>>> +        return -ENOMEM;
>>>> +
>>>> +    /* Start with maximum possible addressable physical memory range */
>>>> +    start = EXT_REGION_START;
>>>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>>> +    res = rangeset_add_range(mem_holes, start, end);
>>>> +    if ( res )
>>>> +    {
>>>> +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
>>>> +               start, end);
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    /* Remove all regions described by "reg" property (MMIO, RAM, etc) */
>>>> +    dt_for_each_device_node( dt_host, np )
>>> Don't you need something like device_tree_for_each_node ?
>>> dt_for_each_device_node won't go down any deeper in the tree?
>> Thank you for pointing this out, I will investigate and update the patch.
>>
>>
>>> Alternatively, maybe we could simply record the highest possible address
>>> of any memory/device/anything as we scan the device tree with
>>> handle_node. Then we can use that as the starting point here.
>> I also don't like the idea to scan the DT much, but I failed to find an
>> effective solution how to avoid that.
>> Yes, we can record the highest possible address, but I am afraid, I didn't
>> entirely get a suggestion. Is the suggestion to provide a single region
>> starting from highest possible address + 1 and up to the EXT_REGION_END
>> suitably aligned? Could you please clarify?
> Yes, that is what I was suggesting as a possible alternative: start from
> the highest possible address in DT + 1 and up to the EXT_REGION_END
> suitably aligned. But that wouldn't solve the <4GB issue.
>
>>>> +                goto out;
>>>> +            }
>>>> +
>>>> +            start = addr & PAGE_MASK;
>>>> +            end = PAGE_ALIGN(addr + size) - 1;
>>>> +            res = rangeset_remove_range(mem_holes, start, end);
>>>> +            if ( res )
>>>> +            {
>>>> +                printk(XENLOG_ERR "Failed to remove:
>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>> +                       start, end);
>>>> +                goto out;
>>>> +            }
>>>> +        }
>>>> +    }
>>> As is, it will result in a myriad of small ranges which is unuseful and
>>> slow to parse. I suggest to simplify it by removing a larger region than
>>> strictly necessary. For instance, you could remove a 1GB-aligned and
>>> 1GB-multiple region for each range. That way, you are going to get fewer
>>> large free ranges instance of many small ones which we don't need.
>> I agree with you that a lot of small ranges increase the bookkeeping in Dom0
>> and there is also a theoretical (?) possibility
>> that small ranges occupy all space we provide for extended regions
>> (NR_MEM_BANKS)...
>> But, let's consider my setup as an example again, but when the IOMMU is
>> enabled for Dom0 ("holes found in DT").
>>
>> --- The RAM configuration is the same:
>>
>> (XEN) RAM: 0000000048000000 - 00000000bfffffff <--- RAM bank 0
>> (XEN) RAM: 0000000500000000 - 000000057fffffff <--- RAM bank 1
>> (XEN) RAM: 0000000600000000 - 000000067fffffff <--- RAM bank 2
>> (XEN) RAM: 0000000700000000 - 000000077fffffff <--- RAM bank 3
>>
>> --- There are a lot of various platform devices with reg property described in
>> DT, I will probably not post all IO ranges here, just say that mostly all of
>> them to be mapped at 0xE0000000-0xFFFFFFFF.
>>
>> --- As we only pick up ranges with size >= 2MB, the calculated extended
>> regions are (based on 40-bit IPA):
>>
>> (XEN) Extended region 0: 0x40000000->0x47e00000
>> (XEN) Extended region 1: 0xc0000000->0xe6000000
>> (XEN) Extended region 2: 0xe7000000->0xe7200000
>> (XEN) Extended region 3: 0xe7400000->0xe7600000
>> (XEN) Extended region 4: 0xe7800000->0xec000000
>> (XEN) Extended region 5: 0xec200000->0xec400000
>> (XEN) Extended region 6: 0xec800000->0xee000000
>> (XEN) Extended region 7: 0xee600000->0xee800000
>> (XEN) Extended region 8: 0xeea00000->0xf1000000
>> (XEN) Extended region 9: 0xf1200000->0xfd000000
>> (XEN) Extended region 10: 0xfd200000->0xfd800000
>> (XEN) Extended region 11: 0xfda00000->0xfe000000
>> (XEN) Extended region 12: 0xfe200000->0xfe600000
>> (XEN) Extended region 13: 0xfec00000->0xff800000
>> (XEN) Extended region 14: 0x100000000->0x500000000
>> (XEN) Extended region 15: 0x580000000->0x600000000
>> (XEN) Extended region 16: 0x680000000->0x700000000
>> (XEN) Extended region 17: 0x780000000->0x10000000000
>>
>> So, if I *correctly* understood your idea about removing 1GB-aligned
>> 1GB-multiple region for each range we would get the following:
>>
>> (XEN) Extended region 0: 0x100000000->0x500000000
>> (XEN) Extended region 1: 0x580000000->0x600000000
>> (XEN) Extended region 2: 0x680000000->0x700000000
>> (XEN) Extended region 3: 0x780000000->0x10000000000
>>
>> As you can see there are no extended regions below 4GB at all. I assume, it
>> would be good to provide them for 1:1 mapped Dom0 (for 32-bit DMA devices?)
>> What else worries me is that IPA size could be 36 or even 32. So, I am afraid,
>> we might even fail to find extended regions above 4GB.
>>
>>
>> I think, if 2MB is considered small enough to bother with, probably we should
>> go with something in between (16MB, 32MB, 64MB).
>> For example, we can take into the account ranges with size >= 16MB:
>>
>> (XEN) Extended region 0: 0x40000000->0x47e00000
>> (XEN) Extended region 1: 0xc0000000->0xe6000000
>> (XEN) Extended region 2: 0xe7800000->0xec000000
>> (XEN) Extended region 3: 0xec800000->0xee000000
>> (XEN) Extended region 4: 0xeea00000->0xf1000000
>> (XEN) Extended region 5: 0xf1200000->0xfd000000
>> (XEN) Extended region 6: 0x100000000->0x500000000
>> (XEN) Extended region 7: 0x580000000->0x600000000
>> (XEN) Extended region 8: 0x680000000->0x700000000
>> (XEN) Extended region 9: 0x780000000->0x10000000000
>>
>> Any thoughts?
> Yeah maybe an intermediate value would be best. I'd go with 64MB.

I completely agree.

So what I got on my setup with that value.

1. IOMMU is enabled for Dom0:

(XEN) Extended region 0: 0x40000000->0x47e00000
(XEN) Extended region 1: 0xc0000000->0xe6000000
(XEN) Extended region 2: 0xe7800000->0xec000000
(XEN) Extended region 3: 0xf1200000->0xfd000000
(XEN) Extended region 4: 0x100000000->0x500000000
(XEN) Extended region 5: 0x580000000->0x600000000
(XEN) Extended region 6: 0x680000000->0x700000000
(XEN) Extended region 7: 0x780000000->0x10000000000

2. IOMMU is disabled for Dom0:

(XEN) Extended region 0: 0x48000000->0x54000000
(XEN) Extended region 1: 0x57000000->0x60000000
(XEN) Extended region 2: 0x70000000->0x78000000
(XEN) Extended region 3: 0x78200000->0xc0000000
(XEN) Extended region 4: 0x500000000->0x580000000
(XEN) Extended region 5: 0x600000000->0x680000000
(XEN) Extended region 6: 0x700000000->0x780000000

Which is not bad.

Thank you.

-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-16 20:57         ` Oleksandr
@ 2021-09-16 21:30           ` Stefano Stabellini
  2021-09-17  7:28             ` Oleksandr
  0 siblings, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2021-09-16 21:30 UTC (permalink / raw)
  To: Oleksandr
  Cc: Stefano Stabellini, xen-devel, Oleksandr Tyshchenko,
	Julien Grall, Volodymyr Babchuk, Henry Wang, Bertrand Marquis,
	Wei Chen

On Thu, 16 Sep 2021, Oleksandr wrote:
> > On Wed, 15 Sep 2021, Oleksandr wrote:
> > > > On Fri, 10 Sep 2021, Oleksandr Tyshchenko wrote:
> > > > > From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> > > > > 
> > > > > The extended region (safe range) is a region of guest physical
> > > > > address space which is unused and could be safely used to create
> > > > > grant/foreign mappings instead of wasting real RAM pages from
> > > > > the domain memory for establishing these mappings.
> > > > > 
> > > > > The extended regions are chosen at the domain creation time and
> > > > > advertised to it via "reg" property under hypervisor node in
> > > > > the guest device-tree. As region 0 is reserved for grant table
> > > > > space (always present), the indexes for extended regions are 1...N.
> > > > > If extended regions could not be allocated for some reason,
> > > > > Xen doesn't fail and behaves as usual, so only inserts region 0.
> > > > > 
> > > > > Please note the following limitations:
> > > > > - The extended region feature is only supported for 64-bit domain.
> > > > > - The ACPI case is not covered.
> > > > > 
> > > > > ***
> > > > > 
> > > > > As Dom0 is direct mapped domain on Arm (e.g. MFN == GFN)
> > > > > the algorithm to choose extended regions for it is different
> > > > > in comparison with the algorithm for non-direct mapped DomU.
> > > > > What is more, that extended regions should be chosen differently
> > > > > whether IOMMU is enabled or not.
> > > > > 
> > > > > Provide RAM not assigned to Dom0 if IOMMU is disabled or memory
> > > > > holes found in host device-tree if otherwise. Make sure that
> > > > > extended regions are 2MB-aligned and located within maximum possible
> > > > > addressable physical memory range. The maximum number of extended
> > > > > regions is 128.
> > > > > 
> > > > > Suggested-by: Julien Grall <jgrall@amazon.com>
> > > > > Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> > > > > ---
> > > > > Changes since RFC:
> > > > >      - update patch description
> > > > >      - drop uneeded "extended-region" DT property
> > > > > ---
> > > > > 
> > > > >    xen/arch/arm/domain_build.c | 226
> > > > > +++++++++++++++++++++++++++++++++++++++++++-
> > > > >    1 file changed, 224 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
> > > > > index 206038d..070ec27 100644
> > > > > --- a/xen/arch/arm/domain_build.c
> > > > > +++ b/xen/arch/arm/domain_build.c
> > > > > @@ -724,6 +724,196 @@ static int __init make_memory_node(const struct
> > > > > domain *d,
> > > > >        return res;
> > > > >    }
> > > > >    +static int __init add_ext_regions(unsigned long s, unsigned long
> > > > > e,
> > > > > void *data)
> > > > > +{
> > > > > +    struct meminfo *ext_regions = data;
> > > > > +    paddr_t start, size;
> > > > > +
> > > > > +    if ( ext_regions->nr_banks >= ARRAY_SIZE(ext_regions->bank) )
> > > > > +        return 0;
> > > > > +
> > > > > +    /* Both start and size of the extended region should be 2MB
> > > > > aligned
> > > > > */
> > > > > +    start = (s + SZ_2M - 1) & ~(SZ_2M - 1);
> > > > > +    if ( start > e )
> > > > > +        return 0;
> > > > > +
> > > > > +    size = (e - start + 1) & ~(SZ_2M - 1);
> > > > > +    if ( !size )
> > > > > +        return 0;
> > > > Can't you align size as well?
> > > > 
> > > >     size = (size - (SZ_2M - 1)) & ~(SZ_2M - 1);
> > > I am sorry, I don't entirely get what you really meant here. We get both
> > > start
> > > and size 2MB-aligned by the calculations above
> > > (when calculating an alignment, we need to make sure that "start_passed <=
> > > start_aligned && size_aligned <= size_passed").
> > > If I add the proposing string after, I will reduce the already aligned
> > > size by
> > > 2MB.
> > > If I replace the size calculation with the following, I will get the
> > > reduced
> > > size even if the passed region is initially 2MB-aligned, so doesn't need
> > > to be
> > > adjusted.
> > > size = e - s + 1;
> > > size = (size - (SZ_2M - 1)) & ~(SZ_2M - 1);
> > Sorry I misread your original code, I think it was working as intended
> > except for the "+1". I think it should be:
> > 
> >    size = (e - start) & ~(SZ_2M - 1);
> But why without "+1"? Isn't "e" here the *last address* of passed range?
> Without "+1" I get non entirely correct calculations, last valid 2MB is
> missed.

You are right: the "+1" should not be needed if this was "end",
following the normal definition of end. However, add_ext_regions is
called by rangeset_report_ranges, so end here is not actually "end", it
is "end-1".

For clarity, I would ask you to rewrite it like this:

/* 
 * e is actually "end-1" because it is called by rangeset functions
 * which are inclusive of the last address.
 */
e += 1;
size = (e - start) & ~(SZ_2M - 1);


> [snip]
> (XEN) Extended region 14: 0x580000000->0x5ffe00000
> (XEN) Extended region 15: 0x680000000->0x6ffe00000
> (XEN) Extended region 16: 0x780000000->0xffffe00000
> 
> But should get:
> 
> [snip]
> (XEN) Extended region 15: 0x580000000->0x600000000
> (XEN) Extended region 16: 0x680000000->0x700000000
> (XEN) Extended region 17: 0x780000000->0x10000000000
> 
> Let's consider how a hole between (for example) RAM bank 1 and bank 2 is
> calculated:
> (XEN) RAM: 0000000500000000 - 000000057fffffff <--- RAM bank 1 with size
> 0x80000000
> (XEN) RAM: 0000000600000000 - 000000067fffffff <--- RAM bank 2 with size
> 0x80000000
> So the hole size should also be 0x80000000.
> If we pass these RAM banks to rangeset_remove_range() one by one:
> 1: s = 0x500000000 e = 0x57FFFFFFF
> 2. s = 0x600000000 e = 0x67FFFFFFF
> we get s = 0x580000000 e = 0x5FFFFFFFF in add_ext_regions(), where "e" is the
> last address of the hole (not the first address out of the hole), so I think,
> that for proper size calculation we need to add 1 to "e - s". Or I really
> missed something?
> 
> 
> > 
> > > > > + */
> > > > > +#define EXT_REGION_START   0x40000000ULL
> > > > > +#define EXT_REGION_END     0x80003fffffffULL
> > > > > +
> > > > > +static int __init find_unallocated_memory(const struct kernel_info
> > > > > *kinfo,
> > > > > +                                          struct meminfo
> > > > > *ext_regions)
> > > > > +{
> > > > > +    const struct meminfo *assign_mem = &kinfo->mem;
> > > > > +    struct rangeset *unalloc_mem;
> > > > > +    paddr_t start, end;
> > > > > +    unsigned int i;
> > > > > +    int res;
> > > > > +
> > > > > +    dt_dprintk("Find unallocated memory for extended regions\n");
> > > > > +
> > > > > +    unalloc_mem = rangeset_new(NULL, NULL, 0);
> > > > > +    if ( !unalloc_mem )
> > > > > +        return -ENOMEM;
> > > > > +
> > > > > +    /* Start with all available RAM */
> > > > > +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
> > > > > +    {
> > > > > +        start = bootinfo.mem.bank[i].start;
> > > > > +        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size
> > > > > - 1;
> > > > Is the -1 needed? Isn't it going to screw up the size calculation later?
> > > I thought, it was needed. The calculation seems correct.
> > I think that normally for an example MMIO region:
> > 
> > start = 0x48000000
> > size  = 0x40000000
> > end   = 0x88000000
> > 
> > So end = start + size and points to the first address out of the range.
> > In other words, 0x88000000 doesn't actually belong to the MMIO region in
> > the example.
> > 
> > But here you are passing addresses to rangeset_add_range and other
> > rangeset functions and I think rangeset takes *inclusive* addresses as
> > input. So you need to pass start and end-1 because end-1 is the last
> > address of the MMIO region.
> > 
> > In fact you can see for instance in map_range_to_domain:
> > 
> >          res = iomem_permit_access(d, paddr_to_pfn(addr),
> >                  paddr_to_pfn(PAGE_ALIGN(addr + len - 1)));
> > 
> > Where iomem_permit_access is based on rangeset. So for clarity, I would
> > do:
> > 
> > start = assign_mem->bank[i].start;
> > end = assign_mem->bank[i].start + assign_mem->bank[i].size;
> > res = rangeset_remove_range(unalloc_mem, start, end - 1);
> > 
> > So that we don't get confused on the meaning of "end" which everywhere
> > else means the first address not in range.
> 
> I got your point, I will update the code if it much cleaner.
> 
> 
> > > > > +        res = rangeset_add_range(unalloc_mem, start, end);
> > > > > +        if ( res )
> > > > > +        {
> > > > > +            printk(XENLOG_ERR "Failed to add:
> > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > +                   start, end);
> > > > > +            goto out;
> > > > > +        }
> > > > > +    }
> > > > > +
> > > > > +    /* Remove RAM assigned to Dom0 */
> > > > > +    for ( i = 0; i < assign_mem->nr_banks; i++ )
> > > > > +    {
> > > > > +        start = assign_mem->bank[i].start;
> > > > > +        end = assign_mem->bank[i].start + assign_mem->bank[i].size -
> > > > > 1;
> > > > > +        res = rangeset_remove_range(unalloc_mem, start, end);
> > > > > +        if ( res )
> > > > > +        {
> > > > > +            printk(XENLOG_ERR "Failed to remove:
> > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > +                   start, end);
> > > > > +            goto out;
> > > > > +        }
> > > > > +    }
> > > > > +
> > > > > +    /* Remove reserved-memory regions */
> > > > > +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
> > > > > +    {
> > > > > +        start = bootinfo.reserved_mem.bank[i].start;
> > > > > +        end = bootinfo.reserved_mem.bank[i].start +
> > > > > +            bootinfo.reserved_mem.bank[i].size - 1;
> > > > > +        res = rangeset_remove_range(unalloc_mem, start, end);
> > > > > +        if ( res )
> > > > > +        {
> > > > > +            printk(XENLOG_ERR "Failed to remove:
> > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > +                   start, end);
> > > > > +            goto out;
> > > > > +        }
> > > > > +    }
> > > > > +
> > > > > +    /* Remove grant table region */
> > > > > +    start = kinfo->gnttab_start;
> > > > > +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
> > > > > +    res = rangeset_remove_range(unalloc_mem, start, end);
> > > > > +    if ( res )
> > > > > +    {
> > > > > +        printk(XENLOG_ERR "Failed to remove:
> > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > +               start, end);
> > > > > +        goto out;
> > > > > +    }
> > > > > +
> > > > > +    start = EXT_REGION_START;
> > > > > +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> > > > > +    res = rangeset_report_ranges(unalloc_mem, start, end,
> > > > > +                                 add_ext_regions, ext_regions);
> > > > > +    if ( res )
> > > > > +        ext_regions->nr_banks = 0;
> > > > > +    else if ( !ext_regions->nr_banks )
> > > > > +        res = -ENOENT;
> > > > > +
> > > > > +out:
> > > > > +    rangeset_destroy(unalloc_mem);
> > > > > +
> > > > > +    return res;
> > > > > +}
> > > > > +
> > > > > +static int __init find_memory_holes(const struct kernel_info *kinfo,
> > > > > +                                    struct meminfo *ext_regions)
> > > > > +{
> > > > > +    struct dt_device_node *np;
> > > > > +    struct rangeset *mem_holes;
> > > > > +    paddr_t start, end;
> > > > > +    unsigned int i;
> > > > > +    int res;
> > > > > +
> > > > > +    dt_dprintk("Find memory holes for extended regions\n");
> > > > > +
> > > > > +    mem_holes = rangeset_new(NULL, NULL, 0);
> > > > > +    if ( !mem_holes )
> > > > > +        return -ENOMEM;
> > > > > +
> > > > > +    /* Start with maximum possible addressable physical memory range
> > > > > */
> > > > > +    start = EXT_REGION_START;
> > > > > +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> > > > > +    res = rangeset_add_range(mem_holes, start, end);
> > > > > +    if ( res )
> > > > > +    {
> > > > > +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
> > > > > +               start, end);
> > > > > +        goto out;
> > > > > +    }
> > > > > +
> > > > > +    /* Remove all regions described by "reg" property (MMIO, RAM,
> > > > > etc) */
> > > > > +    dt_for_each_device_node( dt_host, np )
> > > > Don't you need something like device_tree_for_each_node ?
> > > > dt_for_each_device_node won't go down any deeper in the tree?
> > > Thank you for pointing this out, I will investigate and update the patch.
> > > 
> > > 
> > > > Alternatively, maybe we could simply record the highest possible address
> > > > of any memory/device/anything as we scan the device tree with
> > > > handle_node. Then we can use that as the starting point here.
> > > I also don't like the idea to scan the DT much, but I failed to find an
> > > effective solution how to avoid that.
> > > Yes, we can record the highest possible address, but I am afraid, I didn't
> > > entirely get a suggestion. Is the suggestion to provide a single region
> > > starting from highest possible address + 1 and up to the EXT_REGION_END
> > > suitably aligned? Could you please clarify?
> > Yes, that is what I was suggesting as a possible alternative: start from
> > the highest possible address in DT + 1 and up to the EXT_REGION_END
> > suitably aligned. But that wouldn't solve the <4GB issue.
> > 
> > > > > +                goto out;
> > > > > +            }
> > > > > +
> > > > > +            start = addr & PAGE_MASK;
> > > > > +            end = PAGE_ALIGN(addr + size) - 1;
> > > > > +            res = rangeset_remove_range(mem_holes, start, end);
> > > > > +            if ( res )
> > > > > +            {
> > > > > +                printk(XENLOG_ERR "Failed to remove:
> > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > +                       start, end);
> > > > > +                goto out;
> > > > > +            }
> > > > > +        }
> > > > > +    }
> > > > As is, it will result in a myriad of small ranges which is unuseful and
> > > > slow to parse. I suggest to simplify it by removing a larger region than
> > > > strictly necessary. For instance, you could remove a 1GB-aligned and
> > > > 1GB-multiple region for each range. That way, you are going to get fewer
> > > > large free ranges instance of many small ones which we don't need.
> > > I agree with you that a lot of small ranges increase the bookkeeping in
> > > Dom0
> > > and there is also a theoretical (?) possibility
> > > that small ranges occupy all space we provide for extended regions
> > > (NR_MEM_BANKS)...
> > > But, let's consider my setup as an example again, but when the IOMMU is
> > > enabled for Dom0 ("holes found in DT").
> > > 
> > > --- The RAM configuration is the same:
> > > 
> > > (XEN) RAM: 0000000048000000 - 00000000bfffffff <--- RAM bank 0
> > > (XEN) RAM: 0000000500000000 - 000000057fffffff <--- RAM bank 1
> > > (XEN) RAM: 0000000600000000 - 000000067fffffff <--- RAM bank 2
> > > (XEN) RAM: 0000000700000000 - 000000077fffffff <--- RAM bank 3
> > > 
> > > --- There are a lot of various platform devices with reg property
> > > described in
> > > DT, I will probably not post all IO ranges here, just say that mostly all
> > > of
> > > them to be mapped at 0xE0000000-0xFFFFFFFF.
> > > 
> > > --- As we only pick up ranges with size >= 2MB, the calculated extended
> > > regions are (based on 40-bit IPA):
> > > 
> > > (XEN) Extended region 0: 0x40000000->0x47e00000
> > > (XEN) Extended region 1: 0xc0000000->0xe6000000
> > > (XEN) Extended region 2: 0xe7000000->0xe7200000
> > > (XEN) Extended region 3: 0xe7400000->0xe7600000
> > > (XEN) Extended region 4: 0xe7800000->0xec000000
> > > (XEN) Extended region 5: 0xec200000->0xec400000
> > > (XEN) Extended region 6: 0xec800000->0xee000000
> > > (XEN) Extended region 7: 0xee600000->0xee800000
> > > (XEN) Extended region 8: 0xeea00000->0xf1000000
> > > (XEN) Extended region 9: 0xf1200000->0xfd000000
> > > (XEN) Extended region 10: 0xfd200000->0xfd800000
> > > (XEN) Extended region 11: 0xfda00000->0xfe000000
> > > (XEN) Extended region 12: 0xfe200000->0xfe600000
> > > (XEN) Extended region 13: 0xfec00000->0xff800000
> > > (XEN) Extended region 14: 0x100000000->0x500000000
> > > (XEN) Extended region 15: 0x580000000->0x600000000
> > > (XEN) Extended region 16: 0x680000000->0x700000000
> > > (XEN) Extended region 17: 0x780000000->0x10000000000
> > > 
> > > So, if I *correctly* understood your idea about removing 1GB-aligned
> > > 1GB-multiple region for each range we would get the following:
> > > 
> > > (XEN) Extended region 0: 0x100000000->0x500000000
> > > (XEN) Extended region 1: 0x580000000->0x600000000
> > > (XEN) Extended region 2: 0x680000000->0x700000000
> > > (XEN) Extended region 3: 0x780000000->0x10000000000
> > > 
> > > As you can see there are no extended regions below 4GB at all. I assume,
> > > it
> > > would be good to provide them for 1:1 mapped Dom0 (for 32-bit DMA
> > > devices?)
> > > What else worries me is that IPA size could be 36 or even 32. So, I am
> > > afraid,
> > > we might even fail to find extended regions above 4GB.
> > > 
> > > 
> > > I think, if 2MB is considered small enough to bother with, probably we
> > > should
> > > go with something in between (16MB, 32MB, 64MB).
> > > For example, we can take into the account ranges with size >= 16MB:
> > > 
> > > (XEN) Extended region 0: 0x40000000->0x47e00000
> > > (XEN) Extended region 1: 0xc0000000->0xe6000000
> > > (XEN) Extended region 2: 0xe7800000->0xec000000
> > > (XEN) Extended region 3: 0xec800000->0xee000000
> > > (XEN) Extended region 4: 0xeea00000->0xf1000000
> > > (XEN) Extended region 5: 0xf1200000->0xfd000000
> > > (XEN) Extended region 6: 0x100000000->0x500000000
> > > (XEN) Extended region 7: 0x580000000->0x600000000
> > > (XEN) Extended region 8: 0x680000000->0x700000000
> > > (XEN) Extended region 9: 0x780000000->0x10000000000
> > > 
> > > Any thoughts?
> > Yeah maybe an intermediate value would be best. I'd go with 64MB.
> 
> I completely agree.
> 
> So what I got on my setup with that value.
> 
> 1. IOMMU is enabled for Dom0:
> 
> (XEN) Extended region 0: 0x40000000->0x47e00000
> (XEN) Extended region 1: 0xc0000000->0xe6000000
> (XEN) Extended region 2: 0xe7800000->0xec000000
> (XEN) Extended region 3: 0xf1200000->0xfd000000
> (XEN) Extended region 4: 0x100000000->0x500000000
> (XEN) Extended region 5: 0x580000000->0x600000000
> (XEN) Extended region 6: 0x680000000->0x700000000
> (XEN) Extended region 7: 0x780000000->0x10000000000
> 
> 2. IOMMU is disabled for Dom0:
> 
> (XEN) Extended region 0: 0x48000000->0x54000000
> (XEN) Extended region 1: 0x57000000->0x60000000
> (XEN) Extended region 2: 0x70000000->0x78000000
> (XEN) Extended region 3: 0x78200000->0xc0000000
> (XEN) Extended region 4: 0x500000000->0x580000000
> (XEN) Extended region 5: 0x600000000->0x680000000
> (XEN) Extended region 6: 0x700000000->0x780000000
> 
> Which is not bad.

Yeah I think that's good.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 3/3] libxl/arm: Add handling of extended regions for DomU
  2021-09-10 18:18 ` [PATCH V2 3/3] libxl/arm: Add handling of extended regions for DomU Oleksandr Tyshchenko
@ 2021-09-16 22:35   ` Stefano Stabellini
  2021-09-20 20:07     ` Oleksandr
  0 siblings, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2021-09-16 22:35 UTC (permalink / raw)
  To: Oleksandr Tyshchenko
  Cc: xen-devel, Oleksandr Tyshchenko, Ian Jackson, Wei Liu,
	Anthony PERARD, Juergen Gross, Henry Wang, Bertrand Marquis,
	Wei Chen, Julien Grall, Stefano Stabellini

On Fri, 10 Sep 2021, Oleksandr Tyshchenko wrote:
> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> 
> The extended region (safe range) is a region of guest physical
> address space which is unused and could be safely used to create
> grant/foreign mappings instead of wasting real RAM pages from
> the domain memory for establishing these mappings.
> 
> The extended regions are chosen at the domain creation time and
> advertised to it via "reg" property under hypervisor node in
> the guest device-tree. As region 0 is reserved for grant table
> space (always present), the indexes for extended regions are 1...N.
> If extended regions could not be allocated for some reason,
> Xen doesn't fail and behaves as usual, so only inserts region 0.
> 
> Please note the following limitations:
> - The extended region feature is only supported for 64-bit domain.
> - The ACPI case is not covered.
> 
> ***
> 
> The algorithm to choose extended regions for non-direct mapped
> DomU is simpler in comparison with the algorithm for direct mapped
> Dom0. As we have a lot of unused space above 4GB, provide single
> 1GB-aligned region from the second RAM bank taking into the account
> the maximum supported guest address space size and the amount of
> memory assigned to the guest. The maximum size of the region is 128GB.
> 
> Suggested-by: Julien Grall <jgrall@amazon.com>
> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> ---
> Changes since RFC:
>    - update patch description
>    - drop uneeded "extended-region" DT property
>    - clear reg array in finalise_ext_region() and add a TODO
> ---
>  tools/libs/light/libxl_arm.c | 89 +++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 87 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/libs/light/libxl_arm.c b/tools/libs/light/libxl_arm.c
> index e3140a6..8c1d9d7 100644
> --- a/tools/libs/light/libxl_arm.c
> +++ b/tools/libs/light/libxl_arm.c
> @@ -615,9 +615,12 @@ static int make_hypervisor_node(libxl__gc *gc, void *fdt,
>                                "xen,xen");
>      if (res) return res;
>  
> -    /* reg 0 is grant table space */
> +    /*
> +     * reg 0 is a placeholder for grant table space, reg 1 is a placeholder
> +     * for the extended region.
> +     */
>      res = fdt_property_regs(gc, fdt, GUEST_ROOT_ADDRESS_CELLS, GUEST_ROOT_SIZE_CELLS,
> -                            1,GUEST_GNTTAB_BASE, GUEST_GNTTAB_SIZE);
> +                            2, 0, 0, 0, 0);
>      if (res) return res;
>  
>      /*
> @@ -1069,6 +1072,86 @@ static void finalise_one_node(libxl__gc *gc, void *fdt, const char *uname,
>      }
>  }
>  
> +#define ALIGN_UP_TO_GB(x)   (((x) + GB(1) - 1) & (~(GB(1) - 1)))

Why do we need to align it to 1GB when for Dom0 is aligned to 2MB? I
think it makes sense to have the same alignment requirement.


> +
> +#define EXT_REGION_SIZE   GB(128)

The region needs to be described in xen/include/public/arch-arm.h like
GUEST_RAM1_BASE/SIZE.


> +static void finalise_ext_region(libxl__gc *gc, struct xc_dom_image *dom)
> +{
> +    void *fdt = dom->devicetree_blob;
> +    uint32_t regs[(GUEST_ROOT_ADDRESS_CELLS + GUEST_ROOT_SIZE_CELLS) * 2];
> +    be32 *cells = &regs[0];
> +    uint64_t region_size = 0, region_base, bank1end_align, bank1end_max;
> +    uint32_t gpaddr_bits;
> +    libxl_physinfo info;
> +    int offset, rc;
> +
> +    offset = fdt_path_offset(fdt, "/hypervisor");
> +    assert(offset > 0);
> +
> +    if (strcmp(dom->guest_type, "xen-3.0-aarch64")) {
> +        LOG(WARN, "The extended region is only supported for 64-bit guest");
> +        goto out;
> +    }
> +
> +    rc = libxl_get_physinfo(CTX, &info);
> +    assert(!rc);
> +
> +    gpaddr_bits = info.gpaddr_bits;
> +    assert(gpaddr_bits >= 32 && gpaddr_bits <= 48);
> +
> +    /*
> +     * Try to allocate single 1GB-aligned extended region from the second RAM
> +     * bank (above 4GB) taking into the account the maximum supported guest
> +     * address space size and the amount of memory assigned to the guest.
> +     * The maximum size of the region is 128GB.
> +     */
> +    bank1end_max = min(1ULL << gpaddr_bits, GUEST_RAM1_BASE + GUEST_RAM1_SIZE);
> +    bank1end_align = GUEST_RAM1_BASE +
> +        ALIGN_UP_TO_GB((uint64_t)dom->rambank_size[1] << XC_PAGE_SHIFT);
> +
> +    if (bank1end_max <= bank1end_align) {
> +        LOG(WARN, "The extended region cannot be allocated, not enough space");
> +        goto out;
> +    }
> +
> +    if (bank1end_max - bank1end_align > EXT_REGION_SIZE) {
> +        region_base = bank1end_max - EXT_REGION_SIZE;
> +        region_size = EXT_REGION_SIZE;
> +    } else {
> +        region_base = bank1end_align;
> +        region_size = bank1end_max - bank1end_align;
> +    }
> +
> +out:
> +    /*
> +     * The first region for grant table space must be always present.
> +     * If we managed to allocate the extended region then insert it as
> +     * a second region.
> +     * TODO If we failed to allocate the region, we end up inserting
> +     * zero-sized region. This is because we don't know in advance when
> +     * creating hypervisor node whether it would be possible to allocate
> +     * a region, but we have to create a placeholder anyway. The Linux driver
> +     * is able to deal with by checking the region size. We cannot choose
> +     * a region when creating hypervisor node because the guest memory layout
> +     * is not know at that moment (and dom->rambank_size[1] is empty).
> +     * We need to find a way not to expose invalid regions.
> +     */

This is not great -- it would be barely spec compliant. 

When make_hypervisor_node is called we know the max memory of the guest
as build_info.max_memkb should be populate, right?

If so, we could at least detect whether we can have an extended region
(if not caculate the exact start address) from make_hypervisor_node.

total_guest_memory = build_info.max_memkb * 1024;
rambank1_approx = total_guest_memory - GUEST_RAM0_SIZE;
extended_region_size = GUEST_RAM1_SIZE - rambank1_approx;

if (extended_region_size >= MIN_EXT_REGION_SIZE)
    allocate_ext_region


> +    memset(regs, 0, sizeof(regs));
> +    set_range(&cells, GUEST_ROOT_ADDRESS_CELLS, GUEST_ROOT_SIZE_CELLS,
> +              GUEST_GNTTAB_BASE, GUEST_GNTTAB_SIZE);
> +    if (region_size > 0) {

I think we want to check against a minimum amount rather than 0. Maybe 64MB?


> +        LOG(DEBUG, "Extended region: %#"PRIx64"->%#"PRIx64"\n",
> +            region_base, region_base + region_size);
> +
> +        set_range(&cells, GUEST_ROOT_ADDRESS_CELLS, GUEST_ROOT_SIZE_CELLS,
> +                  region_base, region_size);
> +    }
> +
> +    rc = fdt_setprop_inplace(fdt, offset, "reg", regs, sizeof(regs));
> +    assert(!rc);
> +}
> +
>  int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
>                                                 uint32_t domid,
>                                                 libxl_domain_config *d_config,
> @@ -1109,6 +1192,8 @@ int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
>  
>      }
>  
> +    finalise_ext_region(gc, dom);
> +
>      for (i = 0; i < GUEST_RAM_BANKS; i++) {
>          const uint64_t size = (uint64_t)dom->rambank_size[i] << XC_PAGE_SHIFT;
>  
> -- 
> 2.7.4
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-16 21:30           ` Stefano Stabellini
@ 2021-09-17  7:28             ` Oleksandr
  0 siblings, 0 replies; 43+ messages in thread
From: Oleksandr @ 2021-09-17  7:28 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Oleksandr Tyshchenko, Julien Grall, Volodymyr Babchuk,
	Henry Wang, Bertrand Marquis, Wei Chen


On 17.09.21 00:30, Stefano Stabellini wrote:

Hi Stefano

> On Thu, 16 Sep 2021, Oleksandr wrote:
>>> On Wed, 15 Sep 2021, Oleksandr wrote:
>>>>> On Fri, 10 Sep 2021, Oleksandr Tyshchenko wrote:
>>>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>>>>
>>>>>> The extended region (safe range) is a region of guest physical
>>>>>> address space which is unused and could be safely used to create
>>>>>> grant/foreign mappings instead of wasting real RAM pages from
>>>>>> the domain memory for establishing these mappings.
>>>>>>
>>>>>> The extended regions are chosen at the domain creation time and
>>>>>> advertised to it via "reg" property under hypervisor node in
>>>>>> the guest device-tree. As region 0 is reserved for grant table
>>>>>> space (always present), the indexes for extended regions are 1...N.
>>>>>> If extended regions could not be allocated for some reason,
>>>>>> Xen doesn't fail and behaves as usual, so only inserts region 0.
>>>>>>
>>>>>> Please note the following limitations:
>>>>>> - The extended region feature is only supported for 64-bit domain.
>>>>>> - The ACPI case is not covered.
>>>>>>
>>>>>> ***
>>>>>>
>>>>>> As Dom0 is direct mapped domain on Arm (e.g. MFN == GFN)
>>>>>> the algorithm to choose extended regions for it is different
>>>>>> in comparison with the algorithm for non-direct mapped DomU.
>>>>>> What is more, that extended regions should be chosen differently
>>>>>> whether IOMMU is enabled or not.
>>>>>>
>>>>>> Provide RAM not assigned to Dom0 if IOMMU is disabled or memory
>>>>>> holes found in host device-tree if otherwise. Make sure that
>>>>>> extended regions are 2MB-aligned and located within maximum possible
>>>>>> addressable physical memory range. The maximum number of extended
>>>>>> regions is 128.
>>>>>>
>>>>>> Suggested-by: Julien Grall <jgrall@amazon.com>
>>>>>> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>>>> ---
>>>>>> Changes since RFC:
>>>>>>       - update patch description
>>>>>>       - drop uneeded "extended-region" DT property
>>>>>> ---
>>>>>>
>>>>>>     xen/arch/arm/domain_build.c | 226
>>>>>> +++++++++++++++++++++++++++++++++++++++++++-
>>>>>>     1 file changed, 224 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
>>>>>> index 206038d..070ec27 100644
>>>>>> --- a/xen/arch/arm/domain_build.c
>>>>>> +++ b/xen/arch/arm/domain_build.c
>>>>>> @@ -724,6 +724,196 @@ static int __init make_memory_node(const struct
>>>>>> domain *d,
>>>>>>         return res;
>>>>>>     }
>>>>>>     +static int __init add_ext_regions(unsigned long s, unsigned long
>>>>>> e,
>>>>>> void *data)
>>>>>> +{
>>>>>> +    struct meminfo *ext_regions = data;
>>>>>> +    paddr_t start, size;
>>>>>> +
>>>>>> +    if ( ext_regions->nr_banks >= ARRAY_SIZE(ext_regions->bank) )
>>>>>> +        return 0;
>>>>>> +
>>>>>> +    /* Both start and size of the extended region should be 2MB
>>>>>> aligned
>>>>>> */
>>>>>> +    start = (s + SZ_2M - 1) & ~(SZ_2M - 1);
>>>>>> +    if ( start > e )
>>>>>> +        return 0;
>>>>>> +
>>>>>> +    size = (e - start + 1) & ~(SZ_2M - 1);
>>>>>> +    if ( !size )
>>>>>> +        return 0;
>>>>> Can't you align size as well?
>>>>>
>>>>>      size = (size - (SZ_2M - 1)) & ~(SZ_2M - 1);
>>>> I am sorry, I don't entirely get what you really meant here. We get both
>>>> start
>>>> and size 2MB-aligned by the calculations above
>>>> (when calculating an alignment, we need to make sure that "start_passed <=
>>>> start_aligned && size_aligned <= size_passed").
>>>> If I add the proposing string after, I will reduce the already aligned
>>>> size by
>>>> 2MB.
>>>> If I replace the size calculation with the following, I will get the
>>>> reduced
>>>> size even if the passed region is initially 2MB-aligned, so doesn't need
>>>> to be
>>>> adjusted.
>>>> size = e - s + 1;
>>>> size = (size - (SZ_2M - 1)) & ~(SZ_2M - 1);
>>> Sorry I misread your original code, I think it was working as intended
>>> except for the "+1". I think it should be:
>>>
>>>     size = (e - start) & ~(SZ_2M - 1);
>> But why without "+1"? Isn't "e" here the *last address* of passed range?
>> Without "+1" I get non entirely correct calculations, last valid 2MB is
>> missed.
> You are right: the "+1" should not be needed if this was "end",
> following the normal definition of end. However, add_ext_regions is
> called by rangeset_report_ranges, so end here is not actually "end", it
> is "end-1".

Yes.


>
> For clarity, I would ask you to rewrite it like this:
>
> /*
>   * e is actually "end-1" because it is called by rangeset functions
>   * which are inclusive of the last address.
>   */
> e += 1;
> size = (e - start) & ~(SZ_2M - 1);

Ack, will do.


>
>
>> [snip]
>> (XEN) Extended region 14: 0x580000000->0x5ffe00000
>> (XEN) Extended region 15: 0x680000000->0x6ffe00000
>> (XEN) Extended region 16: 0x780000000->0xffffe00000
>>
>> But should get:
>>
>> [snip]
>> (XEN) Extended region 15: 0x580000000->0x600000000
>> (XEN) Extended region 16: 0x680000000->0x700000000
>> (XEN) Extended region 17: 0x780000000->0x10000000000
>>
>> Let's consider how a hole between (for example) RAM bank 1 and bank 2 is
>> calculated:
>> (XEN) RAM: 0000000500000000 - 000000057fffffff <--- RAM bank 1 with size
>> 0x80000000
>> (XEN) RAM: 0000000600000000 - 000000067fffffff <--- RAM bank 2 with size
>> 0x80000000
>> So the hole size should also be 0x80000000.
>> If we pass these RAM banks to rangeset_remove_range() one by one:
>> 1: s = 0x500000000 e = 0x57FFFFFFF
>> 2. s = 0x600000000 e = 0x67FFFFFFF
>> we get s = 0x580000000 e = 0x5FFFFFFFF in add_ext_regions(), where "e" is the
>> last address of the hole (not the first address out of the hole), so I think,
>> that for proper size calculation we need to add 1 to "e - s". Or I really
>> missed something?
>>
>>
>>>>>> + */
>>>>>> +#define EXT_REGION_START   0x40000000ULL
>>>>>> +#define EXT_REGION_END     0x80003fffffffULL
>>>>>> +
>>>>>> +static int __init find_unallocated_memory(const struct kernel_info
>>>>>> *kinfo,
>>>>>> +                                          struct meminfo
>>>>>> *ext_regions)
>>>>>> +{
>>>>>> +    const struct meminfo *assign_mem = &kinfo->mem;
>>>>>> +    struct rangeset *unalloc_mem;
>>>>>> +    paddr_t start, end;
>>>>>> +    unsigned int i;
>>>>>> +    int res;
>>>>>> +
>>>>>> +    dt_dprintk("Find unallocated memory for extended regions\n");
>>>>>> +
>>>>>> +    unalloc_mem = rangeset_new(NULL, NULL, 0);
>>>>>> +    if ( !unalloc_mem )
>>>>>> +        return -ENOMEM;
>>>>>> +
>>>>>> +    /* Start with all available RAM */
>>>>>> +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
>>>>>> +    {
>>>>>> +        start = bootinfo.mem.bank[i].start;
>>>>>> +        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size
>>>>>> - 1;
>>>>> Is the -1 needed? Isn't it going to screw up the size calculation later?
>>>> I thought, it was needed. The calculation seems correct.
>>> I think that normally for an example MMIO region:
>>>
>>> start = 0x48000000
>>> size  = 0x40000000
>>> end   = 0x88000000
>>>
>>> So end = start + size and points to the first address out of the range.
>>> In other words, 0x88000000 doesn't actually belong to the MMIO region in
>>> the example.
>>>
>>> But here you are passing addresses to rangeset_add_range and other
>>> rangeset functions and I think rangeset takes *inclusive* addresses as
>>> input. So you need to pass start and end-1 because end-1 is the last
>>> address of the MMIO region.
>>>
>>> In fact you can see for instance in map_range_to_domain:
>>>
>>>           res = iomem_permit_access(d, paddr_to_pfn(addr),
>>>                   paddr_to_pfn(PAGE_ALIGN(addr + len - 1)));
>>>
>>> Where iomem_permit_access is based on rangeset. So for clarity, I would
>>> do:
>>>
>>> start = assign_mem->bank[i].start;
>>> end = assign_mem->bank[i].start + assign_mem->bank[i].size;
>>> res = rangeset_remove_range(unalloc_mem, start, end - 1);
>>>
>>> So that we don't get confused on the meaning of "end" which everywhere
>>> else means the first address not in range.
>> I got your point, I will update the code if it much cleaner.
>>
>>
>>>>>> +        res = rangeset_add_range(unalloc_mem, start, end);
>>>>>> +        if ( res )
>>>>>> +        {
>>>>>> +            printk(XENLOG_ERR "Failed to add:
>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>> +                   start, end);
>>>>>> +            goto out;
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Remove RAM assigned to Dom0 */
>>>>>> +    for ( i = 0; i < assign_mem->nr_banks; i++ )
>>>>>> +    {
>>>>>> +        start = assign_mem->bank[i].start;
>>>>>> +        end = assign_mem->bank[i].start + assign_mem->bank[i].size -
>>>>>> 1;
>>>>>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>>>>>> +        if ( res )
>>>>>> +        {
>>>>>> +            printk(XENLOG_ERR "Failed to remove:
>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>> +                   start, end);
>>>>>> +            goto out;
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Remove reserved-memory regions */
>>>>>> +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
>>>>>> +    {
>>>>>> +        start = bootinfo.reserved_mem.bank[i].start;
>>>>>> +        end = bootinfo.reserved_mem.bank[i].start +
>>>>>> +            bootinfo.reserved_mem.bank[i].size - 1;
>>>>>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>>>>>> +        if ( res )
>>>>>> +        {
>>>>>> +            printk(XENLOG_ERR "Failed to remove:
>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>> +                   start, end);
>>>>>> +            goto out;
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Remove grant table region */
>>>>>> +    start = kinfo->gnttab_start;
>>>>>> +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
>>>>>> +    res = rangeset_remove_range(unalloc_mem, start, end);
>>>>>> +    if ( res )
>>>>>> +    {
>>>>>> +        printk(XENLOG_ERR "Failed to remove:
>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>> +               start, end);
>>>>>> +        goto out;
>>>>>> +    }
>>>>>> +
>>>>>> +    start = EXT_REGION_START;
>>>>>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>>>>> +    res = rangeset_report_ranges(unalloc_mem, start, end,
>>>>>> +                                 add_ext_regions, ext_regions);
>>>>>> +    if ( res )
>>>>>> +        ext_regions->nr_banks = 0;
>>>>>> +    else if ( !ext_regions->nr_banks )
>>>>>> +        res = -ENOENT;
>>>>>> +
>>>>>> +out:
>>>>>> +    rangeset_destroy(unalloc_mem);
>>>>>> +
>>>>>> +    return res;
>>>>>> +}
>>>>>> +
>>>>>> +static int __init find_memory_holes(const struct kernel_info *kinfo,
>>>>>> +                                    struct meminfo *ext_regions)
>>>>>> +{
>>>>>> +    struct dt_device_node *np;
>>>>>> +    struct rangeset *mem_holes;
>>>>>> +    paddr_t start, end;
>>>>>> +    unsigned int i;
>>>>>> +    int res;
>>>>>> +
>>>>>> +    dt_dprintk("Find memory holes for extended regions\n");
>>>>>> +
>>>>>> +    mem_holes = rangeset_new(NULL, NULL, 0);
>>>>>> +    if ( !mem_holes )
>>>>>> +        return -ENOMEM;
>>>>>> +
>>>>>> +    /* Start with maximum possible addressable physical memory range
>>>>>> */
>>>>>> +    start = EXT_REGION_START;
>>>>>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>>>>> +    res = rangeset_add_range(mem_holes, start, end);
>>>>>> +    if ( res )
>>>>>> +    {
>>>>>> +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
>>>>>> +               start, end);
>>>>>> +        goto out;
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Remove all regions described by "reg" property (MMIO, RAM,
>>>>>> etc) */
>>>>>> +    dt_for_each_device_node( dt_host, np )
>>>>> Don't you need something like device_tree_for_each_node ?
>>>>> dt_for_each_device_node won't go down any deeper in the tree?
>>>> Thank you for pointing this out, I will investigate and update the patch.
>>>>
>>>>
>>>>> Alternatively, maybe we could simply record the highest possible address
>>>>> of any memory/device/anything as we scan the device tree with
>>>>> handle_node. Then we can use that as the starting point here.
>>>> I also don't like the idea to scan the DT much, but I failed to find an
>>>> effective solution how to avoid that.
>>>> Yes, we can record the highest possible address, but I am afraid, I didn't
>>>> entirely get a suggestion. Is the suggestion to provide a single region
>>>> starting from highest possible address + 1 and up to the EXT_REGION_END
>>>> suitably aligned? Could you please clarify?
>>> Yes, that is what I was suggesting as a possible alternative: start from
>>> the highest possible address in DT + 1 and up to the EXT_REGION_END
>>> suitably aligned. But that wouldn't solve the <4GB issue.
>>>
>>>>>> +                goto out;
>>>>>> +            }
>>>>>> +
>>>>>> +            start = addr & PAGE_MASK;
>>>>>> +            end = PAGE_ALIGN(addr + size) - 1;
>>>>>> +            res = rangeset_remove_range(mem_holes, start, end);
>>>>>> +            if ( res )
>>>>>> +            {
>>>>>> +                printk(XENLOG_ERR "Failed to remove:
>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>> +                       start, end);
>>>>>> +                goto out;
>>>>>> +            }
>>>>>> +        }
>>>>>> +    }
>>>>> As is, it will result in a myriad of small ranges which is unuseful and
>>>>> slow to parse. I suggest to simplify it by removing a larger region than
>>>>> strictly necessary. For instance, you could remove a 1GB-aligned and
>>>>> 1GB-multiple region for each range. That way, you are going to get fewer
>>>>> large free ranges instance of many small ones which we don't need.
>>>> I agree with you that a lot of small ranges increase the bookkeeping in
>>>> Dom0
>>>> and there is also a theoretical (?) possibility
>>>> that small ranges occupy all space we provide for extended regions
>>>> (NR_MEM_BANKS)...
>>>> But, let's consider my setup as an example again, but when the IOMMU is
>>>> enabled for Dom0 ("holes found in DT").
>>>>
>>>> --- The RAM configuration is the same:
>>>>
>>>> (XEN) RAM: 0000000048000000 - 00000000bfffffff <--- RAM bank 0
>>>> (XEN) RAM: 0000000500000000 - 000000057fffffff <--- RAM bank 1
>>>> (XEN) RAM: 0000000600000000 - 000000067fffffff <--- RAM bank 2
>>>> (XEN) RAM: 0000000700000000 - 000000077fffffff <--- RAM bank 3
>>>>
>>>> --- There are a lot of various platform devices with reg property
>>>> described in
>>>> DT, I will probably not post all IO ranges here, just say that mostly all
>>>> of
>>>> them to be mapped at 0xE0000000-0xFFFFFFFF.
>>>>
>>>> --- As we only pick up ranges with size >= 2MB, the calculated extended
>>>> regions are (based on 40-bit IPA):
>>>>
>>>> (XEN) Extended region 0: 0x40000000->0x47e00000
>>>> (XEN) Extended region 1: 0xc0000000->0xe6000000
>>>> (XEN) Extended region 2: 0xe7000000->0xe7200000
>>>> (XEN) Extended region 3: 0xe7400000->0xe7600000
>>>> (XEN) Extended region 4: 0xe7800000->0xec000000
>>>> (XEN) Extended region 5: 0xec200000->0xec400000
>>>> (XEN) Extended region 6: 0xec800000->0xee000000
>>>> (XEN) Extended region 7: 0xee600000->0xee800000
>>>> (XEN) Extended region 8: 0xeea00000->0xf1000000
>>>> (XEN) Extended region 9: 0xf1200000->0xfd000000
>>>> (XEN) Extended region 10: 0xfd200000->0xfd800000
>>>> (XEN) Extended region 11: 0xfda00000->0xfe000000
>>>> (XEN) Extended region 12: 0xfe200000->0xfe600000
>>>> (XEN) Extended region 13: 0xfec00000->0xff800000
>>>> (XEN) Extended region 14: 0x100000000->0x500000000
>>>> (XEN) Extended region 15: 0x580000000->0x600000000
>>>> (XEN) Extended region 16: 0x680000000->0x700000000
>>>> (XEN) Extended region 17: 0x780000000->0x10000000000
>>>>
>>>> So, if I *correctly* understood your idea about removing 1GB-aligned
>>>> 1GB-multiple region for each range we would get the following:
>>>>
>>>> (XEN) Extended region 0: 0x100000000->0x500000000
>>>> (XEN) Extended region 1: 0x580000000->0x600000000
>>>> (XEN) Extended region 2: 0x680000000->0x700000000
>>>> (XEN) Extended region 3: 0x780000000->0x10000000000
>>>>
>>>> As you can see there are no extended regions below 4GB at all. I assume,
>>>> it
>>>> would be good to provide them for 1:1 mapped Dom0 (for 32-bit DMA
>>>> devices?)
>>>> What else worries me is that IPA size could be 36 or even 32. So, I am
>>>> afraid,
>>>> we might even fail to find extended regions above 4GB.
>>>>
>>>>
>>>> I think, if 2MB is considered small enough to bother with, probably we
>>>> should
>>>> go with something in between (16MB, 32MB, 64MB).
>>>> For example, we can take into the account ranges with size >= 16MB:
>>>>
>>>> (XEN) Extended region 0: 0x40000000->0x47e00000
>>>> (XEN) Extended region 1: 0xc0000000->0xe6000000
>>>> (XEN) Extended region 2: 0xe7800000->0xec000000
>>>> (XEN) Extended region 3: 0xec800000->0xee000000
>>>> (XEN) Extended region 4: 0xeea00000->0xf1000000
>>>> (XEN) Extended region 5: 0xf1200000->0xfd000000
>>>> (XEN) Extended region 6: 0x100000000->0x500000000
>>>> (XEN) Extended region 7: 0x580000000->0x600000000
>>>> (XEN) Extended region 8: 0x680000000->0x700000000
>>>> (XEN) Extended region 9: 0x780000000->0x10000000000
>>>>
>>>> Any thoughts?
>>> Yeah maybe an intermediate value would be best. I'd go with 64MB.
>> I completely agree.
>>
>> So what I got on my setup with that value.
>>
>> 1. IOMMU is enabled for Dom0:
>>
>> (XEN) Extended region 0: 0x40000000->0x47e00000
>> (XEN) Extended region 1: 0xc0000000->0xe6000000
>> (XEN) Extended region 2: 0xe7800000->0xec000000
>> (XEN) Extended region 3: 0xf1200000->0xfd000000
>> (XEN) Extended region 4: 0x100000000->0x500000000
>> (XEN) Extended region 5: 0x580000000->0x600000000
>> (XEN) Extended region 6: 0x680000000->0x700000000
>> (XEN) Extended region 7: 0x780000000->0x10000000000
>>
>> 2. IOMMU is disabled for Dom0:
>>
>> (XEN) Extended region 0: 0x48000000->0x54000000
>> (XEN) Extended region 1: 0x57000000->0x60000000
>> (XEN) Extended region 2: 0x70000000->0x78000000
>> (XEN) Extended region 3: 0x78200000->0xc0000000
>> (XEN) Extended region 4: 0x500000000->0x580000000
>> (XEN) Extended region 5: 0x600000000->0x680000000
>> (XEN) Extended region 6: 0x700000000->0x780000000
>>
>> Which is not bad.
> Yeah I think that's good.

-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-15 19:10     ` Oleksandr
  2021-09-15 21:21       ` Stefano Stabellini
@ 2021-09-17 14:08       ` Oleksandr
  2021-09-17 15:52       ` Julien Grall
  2 siblings, 0 replies; 43+ messages in thread
From: Oleksandr @ 2021-09-17 14:08 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Oleksandr Tyshchenko, Julien Grall, Volodymyr Babchuk,
	Henry Wang, Bertrand Marquis, Wei Chen


On 15.09.21 22:10, Oleksandr wrote:

Hi Stefano.

[snip]
>
>>> +static int __init find_memory_holes(const struct kernel_info *kinfo,
>>> +                                    struct meminfo *ext_regions)
>>> +{
>>> +    struct dt_device_node *np;
>>> +    struct rangeset *mem_holes;
>>> +    paddr_t start, end;
>>> +    unsigned int i;
>>> +    int res;
>>> +
>>> +    dt_dprintk("Find memory holes for extended regions\n");
>>> +
>>> +    mem_holes = rangeset_new(NULL, NULL, 0);
>>> +    if ( !mem_holes )
>>> +        return -ENOMEM;
>>> +
>>> +    /* Start with maximum possible addressable physical memory 
>>> range */
>>> +    start = EXT_REGION_START;
>>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>> +    res = rangeset_add_range(mem_holes, start, end);
>>> +    if ( res )
>>> +    {
>>> +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
>>> +               start, end);
>>> +        goto out;
>>> +    }
>>> +
>>> +    /* Remove all regions described by "reg" property (MMIO, RAM, 
>>> etc) */
>>> +    dt_for_each_device_node( dt_host, np )
>> Don't you need something like device_tree_for_each_node ?
>> dt_for_each_device_node won't go down any deeper in the tree?
>
> Thank you for pointing this out, I will investigate and update the patch.

I have checked, dt_for_each_device_node( dt_host, np ) iterates all 
nodes, so nothing will be skipped.

As an example for this node:

         hdmi@fead0000 {
             compatible = "renesas,r8a7795-hdmi", "renesas,rcar-gen3-hdmi";
             reg = <0x0 0xfead0000 0x0 0x10000>;
             interrupts = <0x0 0x185 0x4>;
             clocks = <0xc 0x1 0x2d9 0xc 0x0 0x28>;
             clock-names = "iahb", "isfr";
             power-domains = <0x9 0x20>;
             resets = <0xc 0x2d9>;
             status = "okay";
             iommus = <0x50 0xc>;
             xen,passthrough;

             ports {
                 #address-cells = <0x1>;
                 #size-cells = <0x0>;

                 port@0 {
                     reg = <0x0>;

                     endpoint {
                         remote-endpoint = <0xb1>;
                         phandle = <0xc1>;
                     };
                 };

                 port@1 {
                     reg = <0x1>;

                     endpoint {
                         remote-endpoint = <0xb2>;
                         phandle = <0xd1>;
                     };
                 };

                 port@2 {
                     reg = <0x2>;

                     endpoint {
                         remote-endpoint = <0x6f>;
                         phandle = <0x6e>;
                     };
                 };
             };
         };


(XEN) process /soc/hdmi@fead0000
(XEN) ---number_of_address = 1
(XEN) -------0: 0xfead0000->0xfeae0000
(XEN) process /soc/hdmi@fead0000/ports
(XEN) ---number_of_address = 0
(XEN) process /soc/hdmi@fead0000/ports/port@0
(XEN) ---number_of_address = 0
(XEN) process /soc/hdmi@fead0000/ports/port@0/endpoint
(XEN) ---number_of_address = 0
(XEN) process /soc/hdmi@fead0000/ports/port@1
(XEN) ---number_of_address = 0
(XEN) process /soc/hdmi@fead0000/ports/port@1/endpoint
(XEN) ---number_of_address = 0
(XEN) process /soc/hdmi@fead0000/ports/port@2
(XEN) ---number_of_address = 0
(XEN) process /soc/hdmi@fead0000/ports/port@2/endpoint
(XEN) ---number_of_address = 0




[snip]

-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-10 18:18 ` [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0 Oleksandr Tyshchenko
  2021-09-14  0:55   ` Stefano Stabellini
@ 2021-09-17 15:48   ` Julien Grall
  2021-09-17 19:51     ` Oleksandr
  1 sibling, 1 reply; 43+ messages in thread
From: Julien Grall @ 2021-09-17 15:48 UTC (permalink / raw)
  To: Oleksandr Tyshchenko, xen-devel
  Cc: Oleksandr Tyshchenko, Stefano Stabellini, Volodymyr Babchuk,
	Henry Wang, Bertrand Marquis, Wei Chen

Hi Oleksandr,

On 10/09/2021 23:18, Oleksandr Tyshchenko wrote:
> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> 
> The extended region (safe range) is a region of guest physical
> address space which is unused and could be safely used to create
> grant/foreign mappings instead of wasting real RAM pages from
> the domain memory for establishing these mappings.
> 
> The extended regions are chosen at the domain creation time and
> advertised to it via "reg" property under hypervisor node in
> the guest device-tree. As region 0 is reserved for grant table
> space (always present), the indexes for extended regions are 1...N.
> If extended regions could not be allocated for some reason,
> Xen doesn't fail and behaves as usual, so only inserts region 0.
> 
> Please note the following limitations:
> - The extended region feature is only supported for 64-bit domain.
> - The ACPI case is not covered.

I understand the ACPI is not covered because we would need to create a 
new binding. But I am not sure to understand why 32-bit domain is not 
supported. Can you explain it?

> 
> ***
> 
> As Dom0 is direct mapped domain on Arm (e.g. MFN == GFN)
> the algorithm to choose extended regions for it is different
> in comparison with the algorithm for non-direct mapped DomU.
> What is more, that extended regions should be chosen differently
> whether IOMMU is enabled or not.
> 
> Provide RAM not assigned to Dom0 if IOMMU is disabled or memory
> holes found in host device-tree if otherwise. 

For the case when the IOMMU is disabled, this will only work if dom0 
cannot allocate memory outside of the original range. This is currently 
the case... but I think this should be spelled out in at least the 
commit message.

> Make sure that
> extended regions are 2MB-aligned and located within maximum possible
> addressable physical memory range. The maximum number of extended
> regions is 128.

Please explain how this limit was chosen.

> 
> Suggested-by: Julien Grall <jgrall@amazon.com>
> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> ---
> Changes since RFC:
>     - update patch description
>     - drop uneeded "extended-region" DT property
> ---
> 
>   xen/arch/arm/domain_build.c | 226 +++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 224 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
> index 206038d..070ec27 100644
> --- a/xen/arch/arm/domain_build.c
> +++ b/xen/arch/arm/domain_build.c
> @@ -724,6 +724,196 @@ static int __init make_memory_node(const struct domain *d,
>       return res;
>   }
>   
> +static int __init add_ext_regions(unsigned long s, unsigned long e, void *data)
> +{
> +    struct meminfo *ext_regions = data;
> +    paddr_t start, size;
> +
> +    if ( ext_regions->nr_banks >= ARRAY_SIZE(ext_regions->bank) )
> +        return 0;
> +
> +    /* Both start and size of the extended region should be 2MB aligned */
> +    start = (s + SZ_2M - 1) & ~(SZ_2M - 1);
> +    if ( start > e )
> +        return 0;
> +
> +    size = (e - start + 1) & ~(SZ_2M - 1);
> +    if ( !size )
> +        return 0;
> +
> +    ext_regions->bank[ext_regions->nr_banks].start = start;
> +    ext_regions->bank[ext_regions->nr_banks].size = size;
> +    ext_regions->nr_banks ++;
> +
> +    return 0;
> +}
> +
> +/*
> + * The extended regions will be prevalidated by the memory hotplug path
> + * in Linux which requires for any added address range to be within maximum
> + * possible addressable physical memory range for which the linear mapping
> + * could be created.
> + * For 48-bit VA space size the maximum addressable range are:

When I read "maximum", I understand an upper limit. But below, you are 
providing a range. So should you drop "maximum"?

Also, this is tailored to Linux using 48-bit VA. How about other limits?

> + * 0x40000000 - 0x80003fffffff
> + */
> +#define EXT_REGION_START   0x40000000ULL

I am probably missing something here.... There are platform out there 
with memory starting at 0 (IIRC ZynqMP is one example). So wouldn't this 
potentially rule out the extended region on such platform?

> +#define EXT_REGION_END     0x80003fffffffULL
> +
> +static int __init find_unallocated_memory(const struct kernel_info *kinfo,
> +                                          struct meminfo *ext_regions)
> +{
> +    const struct meminfo *assign_mem = &kinfo->mem;
> +    struct rangeset *unalloc_mem;
> +    paddr_t start, end;
> +    unsigned int i;
> +    int res;

We technically already know which range of memory is unused. This is 
pretty much any region in the freelist of the page allocator. So how 
about walking the freelist instead?

The advantage is we don't need to worry about modifying the function 
when adding new memory type.

One disavantage is this will not cover *all* the unused memory as this 
is doing. But I think this is an acceptable downside.

> +
> +    dt_dprintk("Find unallocated memory for extended regions\n");
> +
> +    unalloc_mem = rangeset_new(NULL, NULL, 0);
> +    if ( !unalloc_mem )
> +        return -ENOMEM;
> +
> +    /* Start with all available RAM */
> +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
> +    {
> +        start = bootinfo.mem.bank[i].start;
> +        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size - 1;
> +        res = rangeset_add_range(unalloc_mem, start, end);
> +        if ( res )
> +        {
> +            printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
> +                   start, end);
> +            goto out;
> +        }
> +    }
> +
> +    /* Remove RAM assigned to Dom0 */
> +    for ( i = 0; i < assign_mem->nr_banks; i++ )
> +    {
> +        start = assign_mem->bank[i].start;
> +        end = assign_mem->bank[i].start + assign_mem->bank[i].size - 1;
> +        res = rangeset_remove_range(unalloc_mem, start, end);
> +        if ( res )
> +        {
> +            printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
> +                   start, end);
> +            goto out;
> +        }
> +    }
> +
> +    /* Remove reserved-memory regions */
> +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
> +    {
> +        start = bootinfo.reserved_mem.bank[i].start;
> +        end = bootinfo.reserved_mem.bank[i].start +
> +            bootinfo.reserved_mem.bank[i].size - 1;
> +        res = rangeset_remove_range(unalloc_mem, start, end);
> +        if ( res )
> +        {
> +            printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
> +                   start, end);
> +            goto out;
> +        }
> +    }
> +
> +    /* Remove grant table region */
> +    start = kinfo->gnttab_start;
> +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
> +    res = rangeset_remove_range(unalloc_mem, start, end);
> +    if ( res )
> +    {
> +        printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
> +               start, end);
> +        goto out;
> +    }
> +
> +    start = EXT_REGION_START;
> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> +    res = rangeset_report_ranges(unalloc_mem, start, end,
> +                                 add_ext_regions, ext_regions);
> +    if ( res )
> +        ext_regions->nr_banks = 0;
> +    else if ( !ext_regions->nr_banks )
> +        res = -ENOENT;
> +
> +out:
> +    rangeset_destroy(unalloc_mem);
> +
> +    return res;
> +}
> +
> +static int __init find_memory_holes(const struct kernel_info *kinfo,
> +                                    struct meminfo *ext_regions)
> +{
> +    struct dt_device_node *np;
> +    struct rangeset *mem_holes;
> +    paddr_t start, end;
> +    unsigned int i;
> +    int res;
> +
> +    dt_dprintk("Find memory holes for extended regions\n");
> +
> +    mem_holes = rangeset_new(NULL, NULL, 0);
> +    if ( !mem_holes )
> +        return -ENOMEM;
> +
> +    /* Start with maximum possible addressable physical memory range */
> +    start = EXT_REGION_START;
> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> +    res = rangeset_add_range(mem_holes, start, end);
> +    if ( res )
> +    {
> +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
> +               start, end);
> +        goto out;
> +    }
> +
> +    /* Remove all regions described by "reg" property (MMIO, RAM, etc) */

Well... The loop below is not going to handle all the regions described 
in the property "reg". Instead, it will cover a subset of "reg" where 
the memory is addressable.

You will also need to cover "ranges" that will describe the BARs for the 
PCI devices.

> +    dt_for_each_device_node( dt_host, np )
> +    {
> +        unsigned int naddr;
> +        u64 addr, size;
> +
> +        naddr = dt_number_of_address(np);
> +
> +        for ( i = 0; i < naddr; i++ )
> +        {
> +            res = dt_device_get_address(np, i, &addr, &size);
> +            if ( res )
> +            {
> +                printk(XENLOG_ERR "Unable to retrieve address %u for %s\n",
> +                       i, dt_node_full_name(np));
> +                goto out;
> +            }
> +
> +            start = addr & PAGE_MASK;
> +            end = PAGE_ALIGN(addr + size) - 1;
> +            res = rangeset_remove_range(mem_holes, start, end);
> +            if ( res )
> +            {
> +                printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
> +                       start, end);
> +                goto out;
> +            }
> +        }
> +    }
> +
> +    start = EXT_REGION_START;
> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> +    res = rangeset_report_ranges(mem_holes, start, end,
> +                                 add_ext_regions,  ext_regions);
> +    if ( res )
> +        ext_regions->nr_banks = 0;
> +    else if ( !ext_regions->nr_banks )
> +        res = -ENOENT;
> +
> +out:
> +    rangeset_destroy(mem_holes);
> +
> +    return res;
> +}
> +
>   static int __init make_hypervisor_node(struct domain *d,
>                                          const struct kernel_info *kinfo,
>                                          int addrcells, int sizecells)
> @@ -731,11 +921,13 @@ static int __init make_hypervisor_node(struct domain *d,
>       const char compat[] =
>           "xen,xen-"__stringify(XEN_VERSION)"."__stringify(XEN_SUBVERSION)"\0"
>           "xen,xen";
> -    __be32 reg[4];
> +    __be32 reg[(NR_MEM_BANKS + 1) * 4];

This is a fairly large allocation on the stack. Could we move to a 
dynamic allocation?

>       gic_interrupt_t intr;
>       __be32 *cells;
>       int res;
>       void *fdt = kinfo->fdt;
> +    struct meminfo *ext_regions;
> +    unsigned int i;
>   
>       dt_dprintk("Create hypervisor node\n");
>   
> @@ -757,12 +949,42 @@ static int __init make_hypervisor_node(struct domain *d,
>       if ( res )
>           return res;
>   
> +    ext_regions = xzalloc(struct meminfo);
> +    if ( !ext_regions )
> +        return -ENOMEM;
> +
> +    if ( is_32bit_domain(d) )
> +        printk(XENLOG_WARNING "The extended region is only supported for 64-bit guest\n");
> +    else
> +    {
> +        if ( !is_iommu_enabled(d) )
> +            res = find_unallocated_memory(kinfo, ext_regions);
> +        else
> +            res = find_memory_holes(kinfo, ext_regions);
> +
> +        if ( res )
> +            printk(XENLOG_WARNING "Failed to allocate extended regions\n");
> +    }
> +
>       /* reg 0 is grant table space */
>       cells = &reg[0];
>       dt_child_set_range(&cells, addrcells, sizecells,
>                          kinfo->gnttab_start, kinfo->gnttab_size);
> +    /* reg 1...N are extended regions */
> +    for ( i = 0; i < ext_regions->nr_banks; i++ )
> +    {
> +        u64 start = ext_regions->bank[i].start;
> +        u64 size = ext_regions->bank[i].size;
> +
> +        dt_dprintk("Extended region %d: %#"PRIx64"->%#"PRIx64"\n",
> +                   i, start, start + size);
> +
> +        dt_child_set_range(&cells, addrcells, sizecells, start, size);
> +    }
> +    xfree(ext_regions);
> +
>       res = fdt_property(fdt, "reg", reg,
> -                       dt_cells_to_size(addrcells + sizecells));
> +                       dt_cells_to_size(addrcells + sizecells) * (i + 1));
>       if ( res )
>           return res;
>   
> 

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-15 19:10     ` Oleksandr
  2021-09-15 21:21       ` Stefano Stabellini
  2021-09-17 14:08       ` Oleksandr
@ 2021-09-17 15:52       ` Julien Grall
  2021-09-17 20:13         ` Oleksandr
  2 siblings, 1 reply; 43+ messages in thread
From: Julien Grall @ 2021-09-17 15:52 UTC (permalink / raw)
  To: Oleksandr, Stefano Stabellini
  Cc: xen-devel, Oleksandr Tyshchenko, Volodymyr Babchuk, Henry Wang,
	Bertrand Marquis, Wei Chen



On 16/09/2021 00:10, Oleksandr wrote:
>>> + * The extended regions will be prevalidated by the memory hotplug path
>>> + * in Linux which requires for any added address range to be within 
>>> maximum
>>> + * possible addressable physical memory range for which the linear 
>>> mapping
>>> + * could be created.
>>> + * For 48-bit VA space size the maximum addressable range are:
>>> + * 0x40000000 - 0x80003fffffff
>> Please don't make Linux-specific comments in Xen code for interfaces
>> that are supposed to be OS-agnostic.
> 
> You are right. I just wanted to describe where these magic numbers come 
> from.
> Someone might question why, for example, "0 ... max_gpaddr" can't be 
> used. I will move
> that Linux-specific comments to the commit message to keep some 
> justification of these numbers.

Please keep some rationale in the code. This is a lot easier to 
understand the code without having to play the git blame game.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-17 15:48   ` Julien Grall
@ 2021-09-17 19:51     ` Oleksandr
  2021-09-17 21:56       ` Stefano Stabellini
                         ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Oleksandr @ 2021-09-17 19:51 UTC (permalink / raw)
  To: Julien Grall
  Cc: xen-devel, Oleksandr Tyshchenko, Stefano Stabellini,
	Volodymyr Babchuk, Henry Wang, Bertrand Marquis, Wei Chen


On 17.09.21 18:48, Julien Grall wrote:
> Hi Oleksandr,

Hi Julien


>
> On 10/09/2021 23:18, Oleksandr Tyshchenko wrote:
>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>
>> The extended region (safe range) is a region of guest physical
>> address space which is unused and could be safely used to create
>> grant/foreign mappings instead of wasting real RAM pages from
>> the domain memory for establishing these mappings.
>>
>> The extended regions are chosen at the domain creation time and
>> advertised to it via "reg" property under hypervisor node in
>> the guest device-tree. As region 0 is reserved for grant table
>> space (always present), the indexes for extended regions are 1...N.
>> If extended regions could not be allocated for some reason,
>> Xen doesn't fail and behaves as usual, so only inserts region 0.
>>
>> Please note the following limitations:
>> - The extended region feature is only supported for 64-bit domain.
>> - The ACPI case is not covered.
>
> I understand the ACPI is not covered because we would need to create a 
> new binding. But I am not sure to understand why 32-bit domain is not 
> supported. Can you explain it?

The 32-bit domain is not supported for simplifying things from the 
beginning. It is a little bit difficult to get everything working at 
start. As I understand from discussion at [1] we can afford that 
simplification. However, I should have mentioned that 32-bit domain is 
not supported "for now".

>
>>
>> ***
>>
>> As Dom0 is direct mapped domain on Arm (e.g. MFN == GFN)
>> the algorithm to choose extended regions for it is different
>> in comparison with the algorithm for non-direct mapped DomU.
>> What is more, that extended regions should be chosen differently
>> whether IOMMU is enabled or not.
>>
>> Provide RAM not assigned to Dom0 if IOMMU is disabled or memory
>> holes found in host device-tree if otherwise. 
>
> For the case when the IOMMU is disabled, this will only work if dom0 
> cannot allocate memory outside of the original range. This is 
> currently the case... but I think this should be spelled out in at 
> least the commit message.

Agree, will update commit description.


>
>
>> Make sure that
>> extended regions are 2MB-aligned and located within maximum possible
>> addressable physical memory range. The maximum number of extended
>> regions is 128.
>
> Please explain how this limit was chosen.
Well, I decided to not introduce new data struct and etc to represent 
extended regions but reuse existing struct meminfo
used for memory/reserved-memory and, as I though, perfectly fitted. So, 
that limit come from NR_MEM_BANKS which is 128.


>
>>
>> Suggested-by: Julien Grall <jgrall@amazon.com>
>> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>> ---
>> Changes since RFC:
>>     - update patch description
>>     - drop uneeded "extended-region" DT property
>> ---
>>
>>   xen/arch/arm/domain_build.c | 226 
>> +++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 224 insertions(+), 2 deletions(-)
>>
>> diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
>> index 206038d..070ec27 100644
>> --- a/xen/arch/arm/domain_build.c
>> +++ b/xen/arch/arm/domain_build.c
>> @@ -724,6 +724,196 @@ static int __init make_memory_node(const struct 
>> domain *d,
>>       return res;
>>   }
>>   +static int __init add_ext_regions(unsigned long s, unsigned long 
>> e, void *data)
>> +{
>> +    struct meminfo *ext_regions = data;
>> +    paddr_t start, size;
>> +
>> +    if ( ext_regions->nr_banks >= ARRAY_SIZE(ext_regions->bank) )
>> +        return 0;
>> +
>> +    /* Both start and size of the extended region should be 2MB 
>> aligned */
>> +    start = (s + SZ_2M - 1) & ~(SZ_2M - 1);
>> +    if ( start > e )
>> +        return 0;
>> +
>> +    size = (e - start + 1) & ~(SZ_2M - 1);
>> +    if ( !size )
>> +        return 0;
>> +
>> +    ext_regions->bank[ext_regions->nr_banks].start = start;
>> +    ext_regions->bank[ext_regions->nr_banks].size = size;
>> +    ext_regions->nr_banks ++;
>> +
>> +    return 0;
>> +}
>> +
>> +/*
>> + * The extended regions will be prevalidated by the memory hotplug path
>> + * in Linux which requires for any added address range to be within 
>> maximum
>> + * possible addressable physical memory range for which the linear 
>> mapping
>> + * could be created.
>> + * For 48-bit VA space size the maximum addressable range are:
>
> When I read "maximum", I understand an upper limit. But below, you are 
> providing a range. So should you drop "maximum"?

yes, it is a little bit confusing.


>
>
> Also, this is tailored to Linux using 48-bit VA. How about other limits?
These limits are calculated at [2]. Sorry, I didn't investigate yet what 
values would be for other CONFIG_ARM64_VA_BITS_XXX. Also looks like some 
configs depend on 16K/64K pages...
I will try to investigate and provide limits later on.


>
>
>> + * 0x40000000 - 0x80003fffffff
>> + */
>> +#define EXT_REGION_START   0x40000000ULL
>
> I am probably missing something here.... There are platform out there 
> with memory starting at 0 (IIRC ZynqMP is one example). So wouldn't 
> this potentially rule out the extended region on such platform?

 From my understanding the extended region cannot be in 0...0x40000000 
range. If these platforms have memory above first GB, I believe the 
extended region(s) can be allocated for them.


>
>
>> +#define EXT_REGION_END 0x80003fffffffULL
>> +
>> +static int __init find_unallocated_memory(const struct kernel_info 
>> *kinfo,
>> +                                          struct meminfo *ext_regions)
>> +{
>> +    const struct meminfo *assign_mem = &kinfo->mem;
>> +    struct rangeset *unalloc_mem;
>> +    paddr_t start, end;
>> +    unsigned int i;
>> +    int res;
>
> We technically already know which range of memory is unused. This is 
> pretty much any region in the freelist of the page allocator. So how 
> about walking the freelist instead?

ok, I will investigate the page allocator code (right now I have no 
understanding of how to do that). BTW, I have just grepped "freelist" 
through the code and all page context related appearances are in x86 
code only.

>
> The advantage is we don't need to worry about modifying the function 
> when adding new memory type.
>
> One disavantage is this will not cover *all* the unused memory as this 
> is doing. But I think this is an acceptable downside.
>
>> +
>> +    dt_dprintk("Find unallocated memory for extended regions\n");
>> +
>> +    unalloc_mem = rangeset_new(NULL, NULL, 0);
>> +    if ( !unalloc_mem )
>> +        return -ENOMEM;
>> +
>> +    /* Start with all available RAM */
>> +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
>> +    {
>> +        start = bootinfo.mem.bank[i].start;
>> +        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size 
>> - 1;
>> +        res = rangeset_add_range(unalloc_mem, start, end);
>> +        if ( res )
>> +        {
>> +            printk(XENLOG_ERR "Failed to add: 
>> %#"PRIx64"->%#"PRIx64"\n",
>> +                   start, end);
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    /* Remove RAM assigned to Dom0 */
>> +    for ( i = 0; i < assign_mem->nr_banks; i++ )
>> +    {
>> +        start = assign_mem->bank[i].start;
>> +        end = assign_mem->bank[i].start + assign_mem->bank[i].size - 1;
>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>> +        if ( res )
>> +        {
>> +            printk(XENLOG_ERR "Failed to remove: 
>> %#"PRIx64"->%#"PRIx64"\n",
>> +                   start, end);
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    /* Remove reserved-memory regions */
>> +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
>> +    {
>> +        start = bootinfo.reserved_mem.bank[i].start;
>> +        end = bootinfo.reserved_mem.bank[i].start +
>> +            bootinfo.reserved_mem.bank[i].size - 1;
>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>> +        if ( res )
>> +        {
>> +            printk(XENLOG_ERR "Failed to remove: 
>> %#"PRIx64"->%#"PRIx64"\n",
>> +                   start, end);
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    /* Remove grant table region */
>> +    start = kinfo->gnttab_start;
>> +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
>> +    res = rangeset_remove_range(unalloc_mem, start, end);
>> +    if ( res )
>> +    {
>> +        printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
>> +               start, end);
>> +        goto out;
>> +    }
>> +
>> +    start = EXT_REGION_START;
>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>> +    res = rangeset_report_ranges(unalloc_mem, start, end,
>> +                                 add_ext_regions, ext_regions);
>> +    if ( res )
>> +        ext_regions->nr_banks = 0;
>> +    else if ( !ext_regions->nr_banks )
>> +        res = -ENOENT;
>> +
>> +out:
>> +    rangeset_destroy(unalloc_mem);
>> +
>> +    return res;
>> +}
>> +
>> +static int __init find_memory_holes(const struct kernel_info *kinfo,
>> +                                    struct meminfo *ext_regions)
>> +{
>> +    struct dt_device_node *np;
>> +    struct rangeset *mem_holes;
>> +    paddr_t start, end;
>> +    unsigned int i;
>> +    int res;
>> +
>> +    dt_dprintk("Find memory holes for extended regions\n");
>> +
>> +    mem_holes = rangeset_new(NULL, NULL, 0);
>> +    if ( !mem_holes )
>> +        return -ENOMEM;
>> +
>> +    /* Start with maximum possible addressable physical memory range */
>> +    start = EXT_REGION_START;
>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>> +    res = rangeset_add_range(mem_holes, start, end);
>> +    if ( res )
>> +    {
>> +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
>> +               start, end);
>> +        goto out;
>> +    }
>> +
>> +    /* Remove all regions described by "reg" property (MMIO, RAM, 
>> etc) */
>
> Well... The loop below is not going to handle all the regions 
> described in the property "reg". Instead, it will cover a subset of 
> "reg" where the memory is addressable.

As I understand, we are only interested in subset of "reg" where the 
memory is addressable.


>
>
> You will also need to cover "ranges" that will describe the BARs for 
> the PCI devices.
Good point. Could you please clarify how to recognize whether it is a 
PCI device as long as PCI support is not merged? Or just to find any 
device nodes with non-empty "ranges" property
and retrieve addresses?


>
>
>> +    dt_for_each_device_node( dt_host, np )
>> +    {
>> +        unsigned int naddr;
>> +        u64 addr, size;
>> +
>> +        naddr = dt_number_of_address(np);
>> +
>> +        for ( i = 0; i < naddr; i++ )
>> +        {
>> +            res = dt_device_get_address(np, i, &addr, &size);
>> +            if ( res )
>> +            {
>> +                printk(XENLOG_ERR "Unable to retrieve address %u for 
>> %s\n",
>> +                       i, dt_node_full_name(np));
>> +                goto out;
>> +            }
>> +
>> +            start = addr & PAGE_MASK;
>> +            end = PAGE_ALIGN(addr + size) - 1;
>> +            res = rangeset_remove_range(mem_holes, start, end);
>> +            if ( res )
>> +            {
>> +                printk(XENLOG_ERR "Failed to remove: 
>> %#"PRIx64"->%#"PRIx64"\n",
>> +                       start, end);
>> +                goto out;
>> +            }
>> +        }
>> +    }
>> +
>> +    start = EXT_REGION_START;
>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>> +    res = rangeset_report_ranges(mem_holes, start, end,
>> +                                 add_ext_regions, ext_regions);
>> +    if ( res )
>> +        ext_regions->nr_banks = 0;
>> +    else if ( !ext_regions->nr_banks )
>> +        res = -ENOENT;
>> +
>> +out:
>> +    rangeset_destroy(mem_holes);
>> +
>> +    return res;
>> +}
>> +
>>   static int __init make_hypervisor_node(struct domain *d,
>>                                          const struct kernel_info 
>> *kinfo,
>>                                          int addrcells, int sizecells)
>> @@ -731,11 +921,13 @@ static int __init make_hypervisor_node(struct 
>> domain *d,
>>       const char compat[] =
>> "xen,xen-"__stringify(XEN_VERSION)"."__stringify(XEN_SUBVERSION)"\0"
>>           "xen,xen";
>> -    __be32 reg[4];
>> +    __be32 reg[(NR_MEM_BANKS + 1) * 4];
>
> This is a fairly large allocation on the stack. Could we move to a 
> dynamic allocation?

Of course, will do.


>
>
>>       gic_interrupt_t intr;
>>       __be32 *cells;
>>       int res;
>>       void *fdt = kinfo->fdt;
>> +    struct meminfo *ext_regions;
>> +    unsigned int i;
>>         dt_dprintk("Create hypervisor node\n");
>>   @@ -757,12 +949,42 @@ static int __init make_hypervisor_node(struct 
>> domain *d,
>>       if ( res )
>>           return res;
>>   +    ext_regions = xzalloc(struct meminfo);
>> +    if ( !ext_regions )
>> +        return -ENOMEM;
>> +
>> +    if ( is_32bit_domain(d) )
>> +        printk(XENLOG_WARNING "The extended region is only supported 
>> for 64-bit guest\n");
>> +    else
>> +    {
>> +        if ( !is_iommu_enabled(d) )
>> +            res = find_unallocated_memory(kinfo, ext_regions);
>> +        else
>> +            res = find_memory_holes(kinfo, ext_regions);
>> +
>> +        if ( res )
>> +            printk(XENLOG_WARNING "Failed to allocate extended 
>> regions\n");
>> +    }
>> +
>>       /* reg 0 is grant table space */
>>       cells = &reg[0];
>>       dt_child_set_range(&cells, addrcells, sizecells,
>>                          kinfo->gnttab_start, kinfo->gnttab_size);
>> +    /* reg 1...N are extended regions */
>> +    for ( i = 0; i < ext_regions->nr_banks; i++ )
>> +    {
>> +        u64 start = ext_regions->bank[i].start;
>> +        u64 size = ext_regions->bank[i].size;
>> +
>> +        dt_dprintk("Extended region %d: %#"PRIx64"->%#"PRIx64"\n",
>> +                   i, start, start + size);
>> +
>> +        dt_child_set_range(&cells, addrcells, sizecells, start, size);
>> +    }
>> +    xfree(ext_regions);
>> +
>>       res = fdt_property(fdt, "reg", reg,
>> -                       dt_cells_to_size(addrcells + sizecells));
>> +                       dt_cells_to_size(addrcells + sizecells) * (i 
>> + 1));
>>       if ( res )
>>           return res;
>>
>
> Cheers,

[1] 
https://lore.kernel.org/xen-devel/cb1c8fd4-a4c5-c18e-c8db-f8e317d95526@xen.org/

[2] 
https://elixir.bootlin.com/linux/v5.15-rc1/source/arch/arm64/mm/mmu.c#L1448


Thank you.


-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-17 15:52       ` Julien Grall
@ 2021-09-17 20:13         ` Oleksandr
  0 siblings, 0 replies; 43+ messages in thread
From: Oleksandr @ 2021-09-17 20:13 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, xen-devel, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Henry Wang, Bertrand Marquis, Wei Chen


On 17.09.21 18:52, Julien Grall wrote:

Hi Julien

>
>
> On 16/09/2021 00:10, Oleksandr wrote:
>>>> + * The extended regions will be prevalidated by the memory hotplug 
>>>> path
>>>> + * in Linux which requires for any added address range to be 
>>>> within maximum
>>>> + * possible addressable physical memory range for which the linear 
>>>> mapping
>>>> + * could be created.
>>>> + * For 48-bit VA space size the maximum addressable range are:
>>>> + * 0x40000000 - 0x80003fffffff
>>> Please don't make Linux-specific comments in Xen code for interfaces
>>> that are supposed to be OS-agnostic.
>>
>> You are right. I just wanted to describe where these magic numbers 
>> come from.
>> Someone might question why, for example, "0 ... max_gpaddr" can't be 
>> used. I will move
>> that Linux-specific comments to the commit message to keep some 
>> justification of these numbers.
>
> Please keep some rationale in the code. This is a lot easier to 
> understand the code without having to play the git blame game.

ok, to be honest I failed to find how to express OS-depended constraints 
in a OS-agnostic way.


>
>
> Cheers,
>
-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-17 19:51     ` Oleksandr
@ 2021-09-17 21:56       ` Stefano Stabellini
  2021-09-17 22:37         ` Stefano Stabellini
  2021-09-21 19:43         ` Oleksandr
  2021-09-18 16:59       ` Oleksandr
  2021-09-19 14:00       ` Julien Grall
  2 siblings, 2 replies; 43+ messages in thread
From: Stefano Stabellini @ 2021-09-17 21:56 UTC (permalink / raw)
  To: Oleksandr
  Cc: Julien Grall, xen-devel, Oleksandr Tyshchenko,
	Stefano Stabellini, Volodymyr Babchuk, Henry Wang,
	Bertrand Marquis, Wei Chen

[-- Attachment #1: Type: text/plain, Size: 5626 bytes --]

On Fri, 17 Sep 2021, Oleksandr wrote:
> > > +
> > > +    dt_dprintk("Find unallocated memory for extended regions\n");
> > > +
> > > +    unalloc_mem = rangeset_new(NULL, NULL, 0);
> > > +    if ( !unalloc_mem )
> > > +        return -ENOMEM;
> > > +
> > > +    /* Start with all available RAM */
> > > +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
> > > +    {
> > > +        start = bootinfo.mem.bank[i].start;
> > > +        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size - 1;
> > > +        res = rangeset_add_range(unalloc_mem, start, end);
> > > +        if ( res )
> > > +        {
> > > +            printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
> > > +                   start, end);
> > > +            goto out;
> > > +        }
> > > +    }
> > > +
> > > +    /* Remove RAM assigned to Dom0 */
> > > +    for ( i = 0; i < assign_mem->nr_banks; i++ )
> > > +    {
> > > +        start = assign_mem->bank[i].start;
> > > +        end = assign_mem->bank[i].start + assign_mem->bank[i].size - 1;
> > > +        res = rangeset_remove_range(unalloc_mem, start, end);
> > > +        if ( res )
> > > +        {
> > > +            printk(XENLOG_ERR "Failed to remove:
> > > %#"PRIx64"->%#"PRIx64"\n",
> > > +                   start, end);
> > > +            goto out;
> > > +        }
> > > +    }
> > > +
> > > +    /* Remove reserved-memory regions */
> > > +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
> > > +    {
> > > +        start = bootinfo.reserved_mem.bank[i].start;
> > > +        end = bootinfo.reserved_mem.bank[i].start +
> > > +            bootinfo.reserved_mem.bank[i].size - 1;
> > > +        res = rangeset_remove_range(unalloc_mem, start, end);
> > > +        if ( res )
> > > +        {
> > > +            printk(XENLOG_ERR "Failed to remove:
> > > %#"PRIx64"->%#"PRIx64"\n",
> > > +                   start, end);
> > > +            goto out;
> > > +        }
> > > +    }
> > > +
> > > +    /* Remove grant table region */
> > > +    start = kinfo->gnttab_start;
> > > +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
> > > +    res = rangeset_remove_range(unalloc_mem, start, end);
> > > +    if ( res )
> > > +    {
> > > +        printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
> > > +               start, end);
> > > +        goto out;
> > > +    }
> > > +
> > > +    start = EXT_REGION_START;
> > > +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> > > +    res = rangeset_report_ranges(unalloc_mem, start, end,
> > > +                                 add_ext_regions, ext_regions);
> > > +    if ( res )
> > > +        ext_regions->nr_banks = 0;
> > > +    else if ( !ext_regions->nr_banks )
> > > +        res = -ENOENT;
> > > +
> > > +out:
> > > +    rangeset_destroy(unalloc_mem);
> > > +
> > > +    return res;
> > > +}
> > > +
> > > +static int __init find_memory_holes(const struct kernel_info *kinfo,
> > > +                                    struct meminfo *ext_regions)
> > > +{
> > > +    struct dt_device_node *np;
> > > +    struct rangeset *mem_holes;
> > > +    paddr_t start, end;
> > > +    unsigned int i;
> > > +    int res;
> > > +
> > > +    dt_dprintk("Find memory holes for extended regions\n");
> > > +
> > > +    mem_holes = rangeset_new(NULL, NULL, 0);
> > > +    if ( !mem_holes )
> > > +        return -ENOMEM;
> > > +
> > > +    /* Start with maximum possible addressable physical memory range */
> > > +    start = EXT_REGION_START;
> > > +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> > > +    res = rangeset_add_range(mem_holes, start, end);
> > > +    if ( res )
> > > +    {
> > > +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
> > > +               start, end);
> > > +        goto out;
> > > +    }
> > > +
> > > +    /* Remove all regions described by "reg" property (MMIO, RAM, etc) */
> > 
> > Well... The loop below is not going to handle all the regions described in
> > the property "reg". Instead, it will cover a subset of "reg" where the
> > memory is addressable.
> 
> As I understand, we are only interested in subset of "reg" where the memory is
> addressable.
> 
> 
> > 
> > 
> > You will also need to cover "ranges" that will describe the BARs for the PCI
> > devices.
> Good point.

Yes, very good point!


> Could you please clarify how to recognize whether it is a PCI
> device as long as PCI support is not merged? Or just to find any device nodes
> with non-empty "ranges" property
> and retrieve addresses?

Normally any bus can have a ranges property with the aperture and
possible address translations, including /amba (compatible =
"simple-bus"). However, in these cases dt_device_get_address already
takes care of it, see xen/common/device_tree.c:dt_device_get_address.

The PCI bus is special for 2 reasons:
- the ranges property has a different format
- the bus is hot-pluggable

So I think the only one that we need to treat specially is PCI.

As far as I am aware PCI is the only bus (or maybe just the only bus
that we support?) where ranges means the aperture.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-17 21:56       ` Stefano Stabellini
@ 2021-09-17 22:37         ` Stefano Stabellini
  2021-09-19 14:34           ` Julien Grall
  2021-09-21 19:43         ` Oleksandr
  1 sibling, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2021-09-17 22:37 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Oleksandr, Julien Grall, xen-devel, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Henry Wang, Bertrand Marquis, Wei Chen

[-- Attachment #1: Type: text/plain, Size: 7319 bytes --]

On Fri, 17 Sep 2021, Stefano Stabellini wrote:
> On Fri, 17 Sep 2021, Oleksandr wrote:
> > > > +
> > > > +    dt_dprintk("Find unallocated memory for extended regions\n");
> > > > +
> > > > +    unalloc_mem = rangeset_new(NULL, NULL, 0);
> > > > +    if ( !unalloc_mem )
> > > > +        return -ENOMEM;
> > > > +
> > > > +    /* Start with all available RAM */
> > > > +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
> > > > +    {
> > > > +        start = bootinfo.mem.bank[i].start;
> > > > +        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size - 1;
> > > > +        res = rangeset_add_range(unalloc_mem, start, end);
> > > > +        if ( res )
> > > > +        {
> > > > +            printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
> > > > +                   start, end);
> > > > +            goto out;
> > > > +        }
> > > > +    }
> > > > +
> > > > +    /* Remove RAM assigned to Dom0 */
> > > > +    for ( i = 0; i < assign_mem->nr_banks; i++ )
> > > > +    {
> > > > +        start = assign_mem->bank[i].start;
> > > > +        end = assign_mem->bank[i].start + assign_mem->bank[i].size - 1;
> > > > +        res = rangeset_remove_range(unalloc_mem, start, end);
> > > > +        if ( res )
> > > > +        {
> > > > +            printk(XENLOG_ERR "Failed to remove:
> > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > +                   start, end);
> > > > +            goto out;
> > > > +        }
> > > > +    }
> > > > +
> > > > +    /* Remove reserved-memory regions */
> > > > +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
> > > > +    {
> > > > +        start = bootinfo.reserved_mem.bank[i].start;
> > > > +        end = bootinfo.reserved_mem.bank[i].start +
> > > > +            bootinfo.reserved_mem.bank[i].size - 1;
> > > > +        res = rangeset_remove_range(unalloc_mem, start, end);
> > > > +        if ( res )
> > > > +        {
> > > > +            printk(XENLOG_ERR "Failed to remove:
> > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > +                   start, end);
> > > > +            goto out;
> > > > +        }
> > > > +    }
> > > > +
> > > > +    /* Remove grant table region */
> > > > +    start = kinfo->gnttab_start;
> > > > +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
> > > > +    res = rangeset_remove_range(unalloc_mem, start, end);
> > > > +    if ( res )
> > > > +    {
> > > > +        printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
> > > > +               start, end);
> > > > +        goto out;
> > > > +    }
> > > > +
> > > > +    start = EXT_REGION_START;
> > > > +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> > > > +    res = rangeset_report_ranges(unalloc_mem, start, end,
> > > > +                                 add_ext_regions, ext_regions);
> > > > +    if ( res )
> > > > +        ext_regions->nr_banks = 0;
> > > > +    else if ( !ext_regions->nr_banks )
> > > > +        res = -ENOENT;
> > > > +
> > > > +out:
> > > > +    rangeset_destroy(unalloc_mem);
> > > > +
> > > > +    return res;
> > > > +}
> > > > +
> > > > +static int __init find_memory_holes(const struct kernel_info *kinfo,
> > > > +                                    struct meminfo *ext_regions)
> > > > +{
> > > > +    struct dt_device_node *np;
> > > > +    struct rangeset *mem_holes;
> > > > +    paddr_t start, end;
> > > > +    unsigned int i;
> > > > +    int res;
> > > > +
> > > > +    dt_dprintk("Find memory holes for extended regions\n");
> > > > +
> > > > +    mem_holes = rangeset_new(NULL, NULL, 0);
> > > > +    if ( !mem_holes )
> > > > +        return -ENOMEM;
> > > > +
> > > > +    /* Start with maximum possible addressable physical memory range */
> > > > +    start = EXT_REGION_START;
> > > > +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> > > > +    res = rangeset_add_range(mem_holes, start, end);
> > > > +    if ( res )
> > > > +    {
> > > > +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
> > > > +               start, end);
> > > > +        goto out;
> > > > +    }
> > > > +
> > > > +    /* Remove all regions described by "reg" property (MMIO, RAM, etc) */
> > > 
> > > Well... The loop below is not going to handle all the regions described in
> > > the property "reg". Instead, it will cover a subset of "reg" where the
> > > memory is addressable.
> > 
> > As I understand, we are only interested in subset of "reg" where the memory is
> > addressable.
> > 
> > 
> > > 
> > > 
> > > You will also need to cover "ranges" that will describe the BARs for the PCI
> > > devices.
> > Good point.
> 
> Yes, very good point!
> 
> 
> > Could you please clarify how to recognize whether it is a PCI
> > device as long as PCI support is not merged? Or just to find any device nodes
> > with non-empty "ranges" property
> > and retrieve addresses?
> 
> Normally any bus can have a ranges property with the aperture and
> possible address translations, including /amba (compatible =
> "simple-bus"). However, in these cases dt_device_get_address already
> takes care of it, see xen/common/device_tree.c:dt_device_get_address.
> 
> The PCI bus is special for 2 reasons:
> - the ranges property has a different format
> - the bus is hot-pluggable
> 
> So I think the only one that we need to treat specially is PCI.
> 
> As far as I am aware PCI is the only bus (or maybe just the only bus
> that we support?) where ranges means the aperture.

Now that I think about this, there is another "hotpluggable" scenario we
need to think about:

[1] https://marc.info/?l=xen-devel&m=163056546214978

Xilinx devices have FPGA regions with apertures currently not described
in device tree, where things can programmed in PL at runtime making new
devices appear with new MMIO regions out of thin air.

Now let me start by saying that yes, the entire programmable region
aperture could probably be described in device tree, however, in
reality it is not currently done in any of the device trees we use
(including the upstream device trees in linux.git).

So, we have a problem :-(


I can work toward getting the right info on device tree, but in reality
that is going to take time and for now the device tree doesn't have the
FPGA aperture in it. So if we accept this series as is, it is going to
stop features like [1] from working.

If we cannot come up with any better plans, I think it would be better
to drop find_memory_holes, only rely on find_unallocated_memory even
when the IOMMU is on. One idea is that we could add on top of the
regions found by find_unallocated_memory any MMIO regions marked as
xen,passthrough: they are safe because they are not going to dom0 anyway.

The only alternative I can think of is to have a per-board
enable/disable toggle for the extend region but it would be very ugly.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-17 19:51     ` Oleksandr
  2021-09-17 21:56       ` Stefano Stabellini
@ 2021-09-18 16:59       ` Oleksandr
  2021-09-23 10:41         ` Oleksandr
  2021-09-19 14:00       ` Julien Grall
  2 siblings, 1 reply; 43+ messages in thread
From: Oleksandr @ 2021-09-18 16:59 UTC (permalink / raw)
  To: Julien Grall
  Cc: xen-devel, Oleksandr Tyshchenko, Stefano Stabellini,
	Volodymyr Babchuk, Henry Wang, Bertrand Marquis, Wei Chen


Hi Julien.


[snip]


>>
>>
>>> +#define EXT_REGION_END 0x80003fffffffULL
>>> +
>>> +static int __init find_unallocated_memory(const struct kernel_info 
>>> *kinfo,
>>> +                                          struct meminfo *ext_regions)
>>> +{
>>> +    const struct meminfo *assign_mem = &kinfo->mem;
>>> +    struct rangeset *unalloc_mem;
>>> +    paddr_t start, end;
>>> +    unsigned int i;
>>> +    int res;
>>
>> We technically already know which range of memory is unused. This is 
>> pretty much any region in the freelist of the page allocator. So how 
>> about walking the freelist instead?
>
> ok, I will investigate the page allocator code (right now I have no 
> understanding of how to do that). BTW, I have just grepped "freelist" 
> through the code and all page context related appearances are in x86 
> code only.
>
>>
>> The advantage is we don't need to worry about modifying the function 
>> when adding new memory type.
>>
>> One disavantage is this will not cover *all* the unused memory as 
>> this is doing. But I think this is an acceptable downside.

I did some investigations and create test patch. Although, I am not 100% 
sure this is exactly what you meant, but I will provide results anyway.

1. Below the extended regions (unallocated memory, regions >=64MB ) 
calculated by my initial method (bootinfo.mem - kinfo->mem - 
bootinfo.reserved_mem - kinfo->gnttab):

(XEN) Extended region 0: 0x48000000->0x54000000
(XEN) Extended region 1: 0x57000000->0x60000000
(XEN) Extended region 2: 0x70000000->0x78000000
(XEN) Extended region 3: 0x78200000->0xc0000000
(XEN) Extended region 4: 0x500000000->0x580000000
(XEN) Extended region 5: 0x600000000->0x680000000
(XEN) Extended region 6: 0x700000000->0x780000000

2. Below the extended regions (unallocated memory, regions >=64MB) 
calculated by new method (free memory in page allocator):

(XEN) Extended region 0: 0x48000000->0x54000000
(XEN) Extended region 1: 0x58000000->0x60000000
(XEN) Extended region 2: 0x70000000->0x78000000
(XEN) Extended region 3: 0x78200000->0x84000000
(XEN) Extended region 4: 0x86000000->0x8a000000
(XEN) Extended region 5: 0x8c200000->0xc0000000
(XEN) Extended region 6: 0x500000000->0x580000000
(XEN) Extended region 7: 0x600000000->0x680000000
(XEN) Extended region 8: 0x700000000->0x765e00000

Some thoughts regarding that.

1. A few ranges below 4GB are absent in resulting extended regions. I 
assume, this is because of the modules:

(XEN) Checking for initrd in /chosen
(XEN) Initrd 0000000084000040-0000000085effc48
(XEN) RAM: 0000000048000000 - 00000000bfffffff
(XEN) RAM: 0000000500000000 - 000000057fffffff
(XEN) RAM: 0000000600000000 - 000000067fffffff
(XEN) RAM: 0000000700000000 - 000000077fffffff
(XEN)
(XEN) MODULE[0]: 0000000078080000 - 00000000781d74c8 Xen
(XEN) MODULE[1]: 0000000057fe7000 - 0000000057ffd080 Device Tree
(XEN) MODULE[2]: 0000000084000040 - 0000000085effc48 Ramdisk
(XEN) MODULE[3]: 000000008a000000 - 000000008c000000 Kernel
(XEN) MODULE[4]: 000000008c000000 - 000000008c010000 XSM
(XEN)  RESVD[0]: 0000000084000040 - 0000000085effc48
(XEN)  RESVD[1]: 0000000054000000 - 0000000056ffffff

2. Also, it worth mentioning that relatively large chunk (~417MB) of 
memory above 4GB is absent (to be precise, at the end of last RAM bank), 
which I assume, used for Xen internals.
We could really use it for extended regions.
Below free regions in the heap (for last RAM bank) just in case:

(XEN) heap[node=0][zone=23][order=5] 0x00000765ec0000-0x00000765ee0000
(XEN) heap[node=0][zone=23][order=6] 0x00000765e80000-0x00000765ec0000
(XEN) heap[node=0][zone=23][order=7] 0x00000765e00000-0x00000765e80000
(XEN) heap[node=0][zone=23][order=9] 0x00000765c00000-0x00000765e00000
(XEN) heap[node=0][zone=23][order=10] 0x00000765800000-0x00000765c00000
(XEN) heap[node=0][zone=23][order=11] 0x00000765000000-0x00000765800000
(XEN) heap[node=0][zone=23][order=12] 0x00000764000000-0x00000765000000
(XEN) heap[node=0][zone=23][order=14] 0x00000760000000-0x00000764000000
(XEN) heap[node=0][zone=23][order=17] 0x00000740000000-0x00000760000000
(XEN) heap[node=0][zone=23][order=18] 0x00000540000000-0x00000580000000
(XEN) heap[node=0][zone=23][order=18] 0x00000500000000-0x00000540000000
(XEN) heap[node=0][zone=23][order=18] 0x00000640000000-0x00000680000000
(XEN) heap[node=0][zone=23][order=18] 0x00000600000000-0x00000640000000
(XEN) heap[node=0][zone=23][order=18] 0x00000700000000-0x00000740000000

Yes, you already pointed out this disadvantage, so if it is an 
acceptable downside, I am absolutely OK.


3. Common code updates. There is a question how to properly make a 
connection between common allocator internals and Arm's code for 
creating DT. I didn’t come up with anything better
than creating for_each_avail_page() for invoking a callback with page 
and its order.

**********

Below the proposed changes on top of the initial patch, would this be 
acceptable in general?

diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
index 523eb19..1e58fc5 100644
--- a/xen/arch/arm/domain_build.c
+++ b/xen/arch/arm/domain_build.c
@@ -753,16 +753,33 @@ static int __init add_ext_regions(unsigned long s, 
unsigned long e, void *data)
      return 0;
  }

+static int __init add_unalloc_mem(struct page_info *page, unsigned int 
order,
+                                  void *data)
+{
+    struct rangeset *unalloc_mem = data;
+    paddr_t start, end;
+    int res;
+
+    start = page_to_maddr(page);
+    end = start + pfn_to_paddr(1UL << order);
+    res = rangeset_add_range(unalloc_mem, start, end - 1);
+    if ( res )
+    {
+        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
+               start, end);
+        return res;
+    }
+
+    return 0;
+}
+
  #define EXT_REGION_START   0x40000000ULL
  #define EXT_REGION_END     0x80003fffffffULL

-static int __init find_unallocated_memory(const struct kernel_info *kinfo,
-                                          struct meminfo *ext_regions)
+static int __init find_unallocated_memory(struct meminfo *ext_regions)
  {
-    const struct meminfo *assign_mem = &kinfo->mem;
      struct rangeset *unalloc_mem;
      paddr_t start, end;
-    unsigned int i;
      int res;

      dt_dprintk("Find unallocated memory for extended regions\n");
@@ -771,59 +788,9 @@ static int __init find_unallocated_memory(const 
struct kernel_info *kinfo,
      if ( !unalloc_mem )
          return -ENOMEM;

-    /* Start with all available RAM */
-    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
-    {
-        start = bootinfo.mem.bank[i].start;
-        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size;
-        res = rangeset_add_range(unalloc_mem, start, end - 1);
-        if ( res )
-        {
-            printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
-                   start, end);
-            goto out;
-        }
-    }
-
-    /* Remove RAM assigned to Dom0 */
-    for ( i = 0; i < assign_mem->nr_banks; i++ )
-    {
-        start = assign_mem->bank[i].start;
-        end = assign_mem->bank[i].start + assign_mem->bank[i].size;
-        res = rangeset_remove_range(unalloc_mem, start, end - 1);
-        if ( res )
-        {
-            printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
-                   start, end);
-            goto out;
-        }
-    }
-
-    /* Remove reserved-memory regions */
-    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
-    {
-        start = bootinfo.reserved_mem.bank[i].start;
-        end = bootinfo.reserved_mem.bank[i].start +
-            bootinfo.reserved_mem.bank[i].size;
-        res = rangeset_remove_range(unalloc_mem, start, end - 1);
-        if ( res )
-        {
-            printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
-                   start, end);
-            goto out;
-        }
-    }
-
-    /* Remove grant table region */
-    start = kinfo->gnttab_start;
-    end = kinfo->gnttab_start + kinfo->gnttab_size;
-    res = rangeset_remove_range(unalloc_mem, start, end - 1);
+    res = for_each_avail_page(add_unalloc_mem, unalloc_mem);
      if ( res )
-    {
-        printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
-               start, end);
          goto out;
-    }

      start = EXT_REGION_START;
      end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
@@ -840,8 +807,7 @@ out:
      return res;
  }

-static int __init find_memory_holes(const struct kernel_info *kinfo,
-                                    struct meminfo *ext_regions)
+static int __init find_memory_holes(struct meminfo *ext_regions)
  {
      struct dt_device_node *np;
      struct rangeset *mem_holes;
@@ -961,9 +927,9 @@ static int __init make_hypervisor_node(struct domain *d,
      else
      {
          if ( !is_iommu_enabled(d) )
-            res = find_unallocated_memory(kinfo, ext_regions);
+            res = find_unallocated_memory(ext_regions);
          else
-            res = find_memory_holes(kinfo, ext_regions);
+            res = find_memory_holes(ext_regions);

          if ( res )
              printk(XENLOG_WARNING "Failed to allocate extended 
regions\n");
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 8fad139..7cd1020 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -1572,6 +1572,40 @@ static int reserve_heap_page(struct page_info *pg)

  }

+/* TODO heap_lock? */
+int for_each_avail_page(int (*cb)(struct page_info *, unsigned int, 
void *),
+                        void *data)
+{
+    unsigned int node, zone, order;
+    int ret;
+
+    for ( node = 0; node < MAX_NUMNODES; node++ )
+    {
+        if ( !avail[node] )
+            continue;
+
+        for ( zone = 0; zone < NR_ZONES; zone++ )
+        {
+            for ( order = 0; order <= MAX_ORDER; order++ )
+            {
+                struct page_info *head, *tmp;
+
+                if ( page_list_empty(&heap(node, zone, order)) )
+                    continue;
+
+                page_list_for_each_safe ( head, tmp, &heap(node, zone, 
order) )
+                {
+                    ret = cb(head, order, data);
+                    if ( ret )
+                        return ret;
+                }
+            }
+        }
+    }
+
+    return 0;
+}
+
  int offline_page(mfn_t mfn, int broken, uint32_t *status)
  {
      unsigned long old_info = 0;
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 667f9da..64dd3e2 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -123,6 +123,9 @@ unsigned int online_page(mfn_t mfn, uint32_t *status);
  int offline_page(mfn_t mfn, int broken, uint32_t *status);
  int query_page_offline(mfn_t mfn, uint32_t *status);

+int for_each_avail_page(int (*cb)(struct page_info *, unsigned int, 
void *),
+                        void *data);
+
  void heap_init_late(void);

  int assign_pages(


[snip]

-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-17 19:51     ` Oleksandr
  2021-09-17 21:56       ` Stefano Stabellini
  2021-09-18 16:59       ` Oleksandr
@ 2021-09-19 14:00       ` Julien Grall
  2021-09-19 17:59         ` Oleksandr
  2 siblings, 1 reply; 43+ messages in thread
From: Julien Grall @ 2021-09-19 14:00 UTC (permalink / raw)
  To: Oleksandr
  Cc: xen-devel, Oleksandr Tyshchenko, Stefano Stabellini,
	Volodymyr Babchuk, Henry Wang, Bertrand Marquis, Wei Chen

Hi,

On 18/09/2021 00:51, Oleksandr wrote:
> On 17.09.21 18:48, Julien Grall wrote:
>> On 10/09/2021 23:18, Oleksandr Tyshchenko wrote:
>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>
>>> The extended region (safe range) is a region of guest physical
>>> address space which is unused and could be safely used to create
>>> grant/foreign mappings instead of wasting real RAM pages from
>>> the domain memory for establishing these mappings.
>>>
>>> The extended regions are chosen at the domain creation time and
>>> advertised to it via "reg" property under hypervisor node in
>>> the guest device-tree. As region 0 is reserved for grant table
>>> space (always present), the indexes for extended regions are 1...N.
>>> If extended regions could not be allocated for some reason,
>>> Xen doesn't fail and behaves as usual, so only inserts region 0.
>>>
>>> Please note the following limitations:
>>> - The extended region feature is only supported for 64-bit domain.
>>> - The ACPI case is not covered.
>>
>> I understand the ACPI is not covered because we would need to create a 
>> new binding. But I am not sure to understand why 32-bit domain is not 
>> supported. Can you explain it?
> 
> The 32-bit domain is not supported for simplifying things from the 
> beginning. It is a little bit difficult to get everything working at 
> start. As I understand from discussion at [1] we can afford that 
> simplification. However, I should have mentioned that 32-bit domain is 
> not supported "for now".

Right, I forgot that. This is where it is useful to write down the 
decision in the commit message.

> 
>>
>>>
>>> ***
>>>
>>> As Dom0 is direct mapped domain on Arm (e.g. MFN == GFN)
>>> the algorithm to choose extended regions for it is different
>>> in comparison with the algorithm for non-direct mapped DomU.
>>> What is more, that extended regions should be chosen differently
>>> whether IOMMU is enabled or not.
>>>
>>> Provide RAM not assigned to Dom0 if IOMMU is disabled or memory
>>> holes found in host device-tree if otherwise. 
>>
>> For the case when the IOMMU is disabled, this will only work if dom0 
>> cannot allocate memory outside of the original range. This is 
>> currently the case... but I think this should be spelled out in at 
>> least the commit message.
> 
> Agree, will update commit description.
> 
> 
>>
>>
>>> Make sure that
>>> extended regions are 2MB-aligned and located within maximum possible
>>> addressable physical memory range. The maximum number of extended
>>> regions is 128.
>>
>> Please explain how this limit was chosen.
> Well, I decided to not introduce new data struct and etc to represent 
> extended regions but reuse existing struct meminfo
> used for memory/reserved-memory and, as I though, perfectly fitted. So, 
> that limit come from NR_MEM_BANKS which is 128.

Ok. So this is an artificial limit. Please make it clear in the commit 
message.

> 
>>
>>>
>>> Suggested-by: Julien Grall <jgrall@amazon.com>
>>> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>> ---
>>> Changes since RFC:
>>>     - update patch description
>>>     - drop uneeded "extended-region" DT property
>>> ---
>>>
>>>   xen/arch/arm/domain_build.c | 226 
>>> +++++++++++++++++++++++++++++++++++++++++++-
>>>   1 file changed, 224 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
>>> index 206038d..070ec27 100644
>>> --- a/xen/arch/arm/domain_build.c
>>> +++ b/xen/arch/arm/domain_build.c
>>> @@ -724,6 +724,196 @@ static int __init make_memory_node(const struct 
>>> domain *d,
>>>       return res;
>>>   }
>>>   +static int __init add_ext_regions(unsigned long s, unsigned long 
>>> e, void *data)
>>> +{
>>> +    struct meminfo *ext_regions = data;
>>> +    paddr_t start, size;
>>> +
>>> +    if ( ext_regions->nr_banks >= ARRAY_SIZE(ext_regions->bank) )
>>> +        return 0;
>>> +
>>> +    /* Both start and size of the extended region should be 2MB 
>>> aligned */
>>> +    start = (s + SZ_2M - 1) & ~(SZ_2M - 1);
>>> +    if ( start > e )
>>> +        return 0;
>>> +
>>> +    size = (e - start + 1) & ~(SZ_2M - 1);
>>> +    if ( !size )
>>> +        return 0;
>>> +
>>> +    ext_regions->bank[ext_regions->nr_banks].start = start;
>>> +    ext_regions->bank[ext_regions->nr_banks].size = size;
>>> +    ext_regions->nr_banks ++;
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +/*
>>> + * The extended regions will be prevalidated by the memory hotplug path
>>> + * in Linux which requires for any added address range to be within 
>>> maximum
>>> + * possible addressable physical memory range for which the linear 
>>> mapping
>>> + * could be created.
>>> + * For 48-bit VA space size the maximum addressable range are:
>>
>> When I read "maximum", I understand an upper limit. But below, you are 
>> providing a range. So should you drop "maximum"?
> 
> yes, it is a little bit confusing.
> 
> 
>>
>>
>> Also, this is tailored to Linux using 48-bit VA. How about other limits?
> These limits are calculated at [2]. Sorry, I didn't investigate yet what 
> values would be for other CONFIG_ARM64_VA_BITS_XXX. Also looks like some 
> configs depend on 16K/64K pages...
> I will try to investigate and provide limits later on.

I have thought a bit more about it. At the moment, you are relying on 
Xen to find a range that is addressable by the OS. This can be quite 
complex as different OS may have different requirement. So how about 
letting the OS to filter the ranges based on its limitations?

> 
> 
>>
>>
>>> + * 0x40000000 - 0x80003fffffff
>>> + */
>>> +#define EXT_REGION_START   0x40000000ULL
>>
>> I am probably missing something here.... There are platform out there 
>> with memory starting at 0 (IIRC ZynqMP is one example). So wouldn't 
>> this potentially rule out the extended region on such platform?
> 
>  From my understanding the extended region cannot be in 0...0x40000000 
> range. If these platforms have memory above first GB, I believe the 
> extended region(s) can be allocated for them.

Do you mean "cannot"? Technically this is a limitation of the current 
version of Linux. Tomorrow, someone may be able to remove that 
limitations. So, as mentionned above, maybe Xen should not do the filtering.

>>> +static int __init find_memory_holes(const struct kernel_info *kinfo,
>>> +                                    struct meminfo *ext_regions)
>>> +{
>>> +    struct dt_device_node *np;
>>> +    struct rangeset *mem_holes;
>>> +    paddr_t start, end;
>>> +    unsigned int i;
>>> +    int res;
>>> +
>>> +    dt_dprintk("Find memory holes for extended regions\n");
>>> +
>>> +    mem_holes = rangeset_new(NULL, NULL, 0);
>>> +    if ( !mem_holes )
>>> +        return -ENOMEM;
>>> +
>>> +    /* Start with maximum possible addressable physical memory range */
>>> +    start = EXT_REGION_START;
>>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>> +    res = rangeset_add_range(mem_holes, start, end);
>>> +    if ( res )
>>> +    {
>>> +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
>>> +               start, end);
>>> +        goto out;
>>> +    }
>>> +
>>> +    /* Remove all regions described by "reg" property (MMIO, RAM, 
>>> etc) */
>>
>> Well... The loop below is not going to handle all the regions 
>> described in the property "reg". Instead, it will cover a subset of 
>> "reg" where the memory is addressable.
> 
> As I understand, we are only interested in subset of "reg" where the 
> memory is addressable.

Right... That's not what your comment is saying.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-17 22:37         ` Stefano Stabellini
@ 2021-09-19 14:34           ` Julien Grall
  2021-09-19 20:18             ` Oleksandr
  2021-09-20 23:55             ` Stefano Stabellini
  0 siblings, 2 replies; 43+ messages in thread
From: Julien Grall @ 2021-09-19 14:34 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Oleksandr, xen-devel, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Henry Wang, Bertrand Marquis, Wei Chen

Hi Stefano,

On 18/09/2021 03:37, Stefano Stabellini wrote:
> On Fri, 17 Sep 2021, Stefano Stabellini wrote:
>> On Fri, 17 Sep 2021, Oleksandr wrote:
>>>>> +
>>>>> +    dt_dprintk("Find unallocated memory for extended regions\n");
>>>>> +
>>>>> +    unalloc_mem = rangeset_new(NULL, NULL, 0);
>>>>> +    if ( !unalloc_mem )
>>>>> +        return -ENOMEM;
>>>>> +
>>>>> +    /* Start with all available RAM */
>>>>> +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
>>>>> +    {
>>>>> +        start = bootinfo.mem.bank[i].start;
>>>>> +        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size - 1;
>>>>> +        res = rangeset_add_range(unalloc_mem, start, end);
>>>>> +        if ( res )
>>>>> +        {
>>>>> +            printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
>>>>> +                   start, end);
>>>>> +            goto out;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    /* Remove RAM assigned to Dom0 */
>>>>> +    for ( i = 0; i < assign_mem->nr_banks; i++ )
>>>>> +    {
>>>>> +        start = assign_mem->bank[i].start;
>>>>> +        end = assign_mem->bank[i].start + assign_mem->bank[i].size - 1;
>>>>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>>>>> +        if ( res )
>>>>> +        {
>>>>> +            printk(XENLOG_ERR "Failed to remove:
>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>> +                   start, end);
>>>>> +            goto out;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    /* Remove reserved-memory regions */
>>>>> +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
>>>>> +    {
>>>>> +        start = bootinfo.reserved_mem.bank[i].start;
>>>>> +        end = bootinfo.reserved_mem.bank[i].start +
>>>>> +            bootinfo.reserved_mem.bank[i].size - 1;
>>>>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>>>>> +        if ( res )
>>>>> +        {
>>>>> +            printk(XENLOG_ERR "Failed to remove:
>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>> +                   start, end);
>>>>> +            goto out;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    /* Remove grant table region */
>>>>> +    start = kinfo->gnttab_start;
>>>>> +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
>>>>> +    res = rangeset_remove_range(unalloc_mem, start, end);
>>>>> +    if ( res )
>>>>> +    {
>>>>> +        printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
>>>>> +               start, end);
>>>>> +        goto out;
>>>>> +    }
>>>>> +
>>>>> +    start = EXT_REGION_START;
>>>>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>>>> +    res = rangeset_report_ranges(unalloc_mem, start, end,
>>>>> +                                 add_ext_regions, ext_regions);
>>>>> +    if ( res )
>>>>> +        ext_regions->nr_banks = 0;
>>>>> +    else if ( !ext_regions->nr_banks )
>>>>> +        res = -ENOENT;
>>>>> +
>>>>> +out:
>>>>> +    rangeset_destroy(unalloc_mem);
>>>>> +
>>>>> +    return res;
>>>>> +}
>>>>> +
>>>>> +static int __init find_memory_holes(const struct kernel_info *kinfo,
>>>>> +                                    struct meminfo *ext_regions)
>>>>> +{
>>>>> +    struct dt_device_node *np;
>>>>> +    struct rangeset *mem_holes;
>>>>> +    paddr_t start, end;
>>>>> +    unsigned int i;
>>>>> +    int res;
>>>>> +
>>>>> +    dt_dprintk("Find memory holes for extended regions\n");
>>>>> +
>>>>> +    mem_holes = rangeset_new(NULL, NULL, 0);
>>>>> +    if ( !mem_holes )
>>>>> +        return -ENOMEM;
>>>>> +
>>>>> +    /* Start with maximum possible addressable physical memory range */
>>>>> +    start = EXT_REGION_START;
>>>>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>>>> +    res = rangeset_add_range(mem_holes, start, end);
>>>>> +    if ( res )
>>>>> +    {
>>>>> +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
>>>>> +               start, end);
>>>>> +        goto out;
>>>>> +    }
>>>>> +
>>>>> +    /* Remove all regions described by "reg" property (MMIO, RAM, etc) */
>>>>
>>>> Well... The loop below is not going to handle all the regions described in
>>>> the property "reg". Instead, it will cover a subset of "reg" where the
>>>> memory is addressable.
>>>
>>> As I understand, we are only interested in subset of "reg" where the memory is
>>> addressable.
>>>
>>>
>>>>
>>>>
>>>> You will also need to cover "ranges" that will describe the BARs for the PCI
>>>> devices.
>>> Good point.
>>
>> Yes, very good point!
>>
>>
>>> Could you please clarify how to recognize whether it is a PCI
>>> device as long as PCI support is not merged? Or just to find any device nodes
>>> with non-empty "ranges" property
>>> and retrieve addresses?
>>
>> Normally any bus can have a ranges property with the aperture and
>> possible address translations, including /amba (compatible =
>> "simple-bus"). However, in these cases dt_device_get_address already
>> takes care of it, see xen/common/device_tree.c:dt_device_get_address.
>>
>> The PCI bus is special for 2 reasons:
>> - the ranges property has a different format
>> - the bus is hot-pluggable
>>
>> So I think the only one that we need to treat specially is PCI.
>>
>> As far as I am aware PCI is the only bus (or maybe just the only bus
>> that we support?) where ranges means the aperture.
> 
> Now that I think about this, there is another "hotpluggable" scenario we
> need to think about:
> 
> [1] https://marc.info/?l=xen-devel&m=163056546214978
> 
> Xilinx devices have FPGA regions with apertures currently not described
> in device tree, where things can programmed in PL at runtime making new
> devices appear with new MMIO regions out of thin air.
> 
> Now let me start by saying that yes, the entire programmable region
> aperture could probably be described in device tree, however, in
> reality it is not currently done in any of the device trees we use
> (including the upstream device trees in linux.git).

This is rather annoying, but not unheard. There are a couple of 
platforms where the MMIOs are not fully described in the DT.

In fact, we have a callback 'specific_mappings' which create additional 
mappings (e.g. on the omap5) for dom0.

> 
> So, we have a problem :-(
> 
> 
> I can work toward getting the right info on device tree, but in reality
> that is going to take time and for now the device tree doesn't have the
> FPGA aperture in it. So if we accept this series as is, it is going to
> stop features like [1] from working. >
> If we cannot come up with any better plans, I think it would be better
> to drop find_memory_holes, only rely on find_unallocated_memory even
> when the IOMMU is on. One idea is that we could add on top of the
> regions found by find_unallocated_memory any MMIO regions marked as
> xen,passthrough: they are safe because they are not going to dom0 anyway.

(Oleksandr, it looks like some rationale about the different approach is 
missing in the commit message. Can you add it?)

When the IOMMU is on, Xen will do an extra mapping with GFN == MFN for 
every grant mapping in dom0. This is because Linux will always program 
the device with the MFN as it doesn't know whether the device has been 
protected by the hypervisor.

Therefore we can't use find_unallocated_memory() with the IOMMU on as it 
stands.

> 
> The only alternative I can think of is to have a per-board
> enable/disable toggle for the extend region but it would be very ugly.
At least, for your board, you seem to know the list of regions that are 
reserved for future use. So how about adding a per-board list of regions 
that should not be allocated?

This will also include anything mentioned in 'specific_mappings'.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-19 14:00       ` Julien Grall
@ 2021-09-19 17:59         ` Oleksandr
  0 siblings, 0 replies; 43+ messages in thread
From: Oleksandr @ 2021-09-19 17:59 UTC (permalink / raw)
  To: Julien Grall
  Cc: xen-devel, Oleksandr Tyshchenko, Stefano Stabellini,
	Volodymyr Babchuk, Henry Wang, Bertrand Marquis, Wei Chen


On 19.09.21 17:00, Julien Grall wrote:
> Hi,

Hi Julien


>
> On 18/09/2021 00:51, Oleksandr wrote:
>> On 17.09.21 18:48, Julien Grall wrote:
>>> On 10/09/2021 23:18, Oleksandr Tyshchenko wrote:
>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>>
>>>> The extended region (safe range) is a region of guest physical
>>>> address space which is unused and could be safely used to create
>>>> grant/foreign mappings instead of wasting real RAM pages from
>>>> the domain memory for establishing these mappings.
>>>>
>>>> The extended regions are chosen at the domain creation time and
>>>> advertised to it via "reg" property under hypervisor node in
>>>> the guest device-tree. As region 0 is reserved for grant table
>>>> space (always present), the indexes for extended regions are 1...N.
>>>> If extended regions could not be allocated for some reason,
>>>> Xen doesn't fail and behaves as usual, so only inserts region 0.
>>>>
>>>> Please note the following limitations:
>>>> - The extended region feature is only supported for 64-bit domain.
>>>> - The ACPI case is not covered.
>>>
>>> I understand the ACPI is not covered because we would need to create 
>>> a new binding. But I am not sure to understand why 32-bit domain is 
>>> not supported. Can you explain it?
>>
>> The 32-bit domain is not supported for simplifying things from the 
>> beginning. It is a little bit difficult to get everything working at 
>> start. As I understand from discussion at [1] we can afford that 
>> simplification. However, I should have mentioned that 32-bit domain 
>> is not supported "for now".
>
> Right, I forgot that. This is where it is useful to write down the 
> decision in the commit message.

ok, will do.


>
>
>>
>>>
>>>>
>>>> ***
>>>>
>>>> As Dom0 is direct mapped domain on Arm (e.g. MFN == GFN)
>>>> the algorithm to choose extended regions for it is different
>>>> in comparison with the algorithm for non-direct mapped DomU.
>>>> What is more, that extended regions should be chosen differently
>>>> whether IOMMU is enabled or not.
>>>>
>>>> Provide RAM not assigned to Dom0 if IOMMU is disabled or memory
>>>> holes found in host device-tree if otherwise. 
>>>
>>> For the case when the IOMMU is disabled, this will only work if dom0 
>>> cannot allocate memory outside of the original range. This is 
>>> currently the case... but I think this should be spelled out in at 
>>> least the commit message.
>>
>> Agree, will update commit description.
>>
>>
>>>
>>>
>>>> Make sure that
>>>> extended regions are 2MB-aligned and located within maximum possible
>>>> addressable physical memory range. The maximum number of extended
>>>> regions is 128.
>>>
>>> Please explain how this limit was chosen.
>> Well, I decided to not introduce new data struct and etc to represent 
>> extended regions but reuse existing struct meminfo
>> used for memory/reserved-memory and, as I though, perfectly fitted. 
>> So, that limit come from NR_MEM_BANKS which is 128.
>
> Ok. So this is an artificial limit. Please make it clear in the commit 
> message.

ok, will do


>
>
>>
>>>
>>>>
>>>> Suggested-by: Julien Grall <jgrall@amazon.com>
>>>> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>> ---
>>>> Changes since RFC:
>>>>     - update patch description
>>>>     - drop uneeded "extended-region" DT property
>>>> ---
>>>>
>>>>   xen/arch/arm/domain_build.c | 226 
>>>> +++++++++++++++++++++++++++++++++++++++++++-
>>>>   1 file changed, 224 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
>>>> index 206038d..070ec27 100644
>>>> --- a/xen/arch/arm/domain_build.c
>>>> +++ b/xen/arch/arm/domain_build.c
>>>> @@ -724,6 +724,196 @@ static int __init make_memory_node(const 
>>>> struct domain *d,
>>>>       return res;
>>>>   }
>>>>   +static int __init add_ext_regions(unsigned long s, unsigned long 
>>>> e, void *data)
>>>> +{
>>>> +    struct meminfo *ext_regions = data;
>>>> +    paddr_t start, size;
>>>> +
>>>> +    if ( ext_regions->nr_banks >= ARRAY_SIZE(ext_regions->bank) )
>>>> +        return 0;
>>>> +
>>>> +    /* Both start and size of the extended region should be 2MB 
>>>> aligned */
>>>> +    start = (s + SZ_2M - 1) & ~(SZ_2M - 1);
>>>> +    if ( start > e )
>>>> +        return 0;
>>>> +
>>>> +    size = (e - start + 1) & ~(SZ_2M - 1);
>>>> +    if ( !size )
>>>> +        return 0;
>>>> +
>>>> +    ext_regions->bank[ext_regions->nr_banks].start = start;
>>>> +    ext_regions->bank[ext_regions->nr_banks].size = size;
>>>> +    ext_regions->nr_banks ++;
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +/*
>>>> + * The extended regions will be prevalidated by the memory hotplug 
>>>> path
>>>> + * in Linux which requires for any added address range to be 
>>>> within maximum
>>>> + * possible addressable physical memory range for which the linear 
>>>> mapping
>>>> + * could be created.
>>>> + * For 48-bit VA space size the maximum addressable range are:
>>>
>>> When I read "maximum", I understand an upper limit. But below, you 
>>> are providing a range. So should you drop "maximum"?
>>
>> yes, it is a little bit confusing.
>>
>>
>>>
>>>
>>> Also, this is tailored to Linux using 48-bit VA. How about other 
>>> limits?
>> These limits are calculated at [2]. Sorry, I didn't investigate yet 
>> what values would be for other CONFIG_ARM64_VA_BITS_XXX. Also looks 
>> like some configs depend on 16K/64K pages...
>> I will try to investigate and provide limits later on.

I have rebuilt Linux with CONFIG_ARM64_VA_BITS_39=y and printed the limits.

These are: 0x40000000 - 0x403fffffff


>>
>
> I have thought a bit more about it. At the moment, you are relying on 
> Xen to find a range that is addressable by the OS. This can be quite 
> complex as different OS may have different requirement. So how about 
> letting the OS to filter the ranges based on its limitations?

I think, it is a nice idea, thank you. So, I will drop OS specific 
limits (EXT_REGION_*) from the patch.

>
>
>>
>>
>>>
>>>
>>>> + * 0x40000000 - 0x80003fffffff
>>>> + */
>>>> +#define EXT_REGION_START   0x40000000ULL
>>>
>>> I am probably missing something here.... There are platform out 
>>> there with memory starting at 0 (IIRC ZynqMP is one example). So 
>>> wouldn't this potentially rule out the extended region on such 
>>> platform?
>>
>>  From my understanding the extended region cannot be in 
>> 0...0x40000000 range. If these platforms have memory above first GB, 
>> I believe the extended region(s) can be allocated for them.
>
> Do you mean "cannot"? 
No. I think, there was some misunderstanding from my size. Initially, I 
got this as "extended region feature cannot be used on such platform in 
general", so I tried to say that if these platforms
also had some RAM *above* 0x40000000 then extended region could be 
allocated for them in principle, we only won't be able to take advantage 
of 0...0x40000000...


> Technically this is a limitation of the current version of Linux. 
> Tomorrow, someone may be able to remove that limitations. So, as 
> mentionned above, maybe Xen should not do the filtering.

I got it, sounds reasonable.


>
>
>>>> +static int __init find_memory_holes(const struct kernel_info *kinfo,
>>>> +                                    struct meminfo *ext_regions)
>>>> +{
>>>> +    struct dt_device_node *np;
>>>> +    struct rangeset *mem_holes;
>>>> +    paddr_t start, end;
>>>> +    unsigned int i;
>>>> +    int res;
>>>> +
>>>> +    dt_dprintk("Find memory holes for extended regions\n");
>>>> +
>>>> +    mem_holes = rangeset_new(NULL, NULL, 0);
>>>> +    if ( !mem_holes )
>>>> +        return -ENOMEM;
>>>> +
>>>> +    /* Start with maximum possible addressable physical memory 
>>>> range */
>>>> +    start = EXT_REGION_START;
>>>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>>> +    res = rangeset_add_range(mem_holes, start, end);
>>>> +    if ( res )
>>>> +    {
>>>> +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
>>>> +               start, end);
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    /* Remove all regions described by "reg" property (MMIO, RAM, 
>>>> etc) */
>>>
>>> Well... The loop below is not going to handle all the regions 
>>> described in the property "reg". Instead, it will cover a subset of 
>>> "reg" where the memory is addressable.
>>
>> As I understand, we are only interested in subset of "reg" where the 
>> memory is addressable.
>
> Right... That's not what your comment is saying.

ok, will update.


>
>
> Cheers,
>
-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-19 14:34           ` Julien Grall
@ 2021-09-19 20:18             ` Oleksandr
  2021-09-20 23:21               ` Stefano Stabellini
  2021-09-20 23:55             ` Stefano Stabellini
  1 sibling, 1 reply; 43+ messages in thread
From: Oleksandr @ 2021-09-19 20:18 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, xen-devel, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Henry Wang, Bertrand Marquis, Wei Chen


On 19.09.21 17:34, Julien Grall wrote:



> Hi Stefano,

Hi Julien


>
> On 18/09/2021 03:37, Stefano Stabellini wrote:
>> On Fri, 17 Sep 2021, Stefano Stabellini wrote:
>>> On Fri, 17 Sep 2021, Oleksandr wrote:
>>>>>> +
>>>>>> +    dt_dprintk("Find unallocated memory for extended regions\n");
>>>>>> +
>>>>>> +    unalloc_mem = rangeset_new(NULL, NULL, 0);
>>>>>> +    if ( !unalloc_mem )
>>>>>> +        return -ENOMEM;
>>>>>> +
>>>>>> +    /* Start with all available RAM */
>>>>>> +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
>>>>>> +    {
>>>>>> +        start = bootinfo.mem.bank[i].start;
>>>>>> +        end = bootinfo.mem.bank[i].start + 
>>>>>> bootinfo.mem.bank[i].size - 1;
>>>>>> +        res = rangeset_add_range(unalloc_mem, start, end);
>>>>>> +        if ( res )
>>>>>> +        {
>>>>>> +            printk(XENLOG_ERR "Failed to add: 
>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>> +                   start, end);
>>>>>> +            goto out;
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Remove RAM assigned to Dom0 */
>>>>>> +    for ( i = 0; i < assign_mem->nr_banks; i++ )
>>>>>> +    {
>>>>>> +        start = assign_mem->bank[i].start;
>>>>>> +        end = assign_mem->bank[i].start + 
>>>>>> assign_mem->bank[i].size - 1;
>>>>>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>>>>>> +        if ( res )
>>>>>> +        {
>>>>>> +            printk(XENLOG_ERR "Failed to remove:
>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>> +                   start, end);
>>>>>> +            goto out;
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Remove reserved-memory regions */
>>>>>> +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
>>>>>> +    {
>>>>>> +        start = bootinfo.reserved_mem.bank[i].start;
>>>>>> +        end = bootinfo.reserved_mem.bank[i].start +
>>>>>> +            bootinfo.reserved_mem.bank[i].size - 1;
>>>>>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>>>>>> +        if ( res )
>>>>>> +        {
>>>>>> +            printk(XENLOG_ERR "Failed to remove:
>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>> +                   start, end);
>>>>>> +            goto out;
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Remove grant table region */
>>>>>> +    start = kinfo->gnttab_start;
>>>>>> +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
>>>>>> +    res = rangeset_remove_range(unalloc_mem, start, end);
>>>>>> +    if ( res )
>>>>>> +    {
>>>>>> +        printk(XENLOG_ERR "Failed to remove: 
>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>> +               start, end);
>>>>>> +        goto out;
>>>>>> +    }
>>>>>> +
>>>>>> +    start = EXT_REGION_START;
>>>>>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>>>>> +    res = rangeset_report_ranges(unalloc_mem, start, end,
>>>>>> +                                 add_ext_regions, ext_regions);
>>>>>> +    if ( res )
>>>>>> +        ext_regions->nr_banks = 0;
>>>>>> +    else if ( !ext_regions->nr_banks )
>>>>>> +        res = -ENOENT;
>>>>>> +
>>>>>> +out:
>>>>>> +    rangeset_destroy(unalloc_mem);
>>>>>> +
>>>>>> +    return res;
>>>>>> +}
>>>>>> +
>>>>>> +static int __init find_memory_holes(const struct kernel_info 
>>>>>> *kinfo,
>>>>>> +                                    struct meminfo *ext_regions)
>>>>>> +{
>>>>>> +    struct dt_device_node *np;
>>>>>> +    struct rangeset *mem_holes;
>>>>>> +    paddr_t start, end;
>>>>>> +    unsigned int i;
>>>>>> +    int res;
>>>>>> +
>>>>>> +    dt_dprintk("Find memory holes for extended regions\n");
>>>>>> +
>>>>>> +    mem_holes = rangeset_new(NULL, NULL, 0);
>>>>>> +    if ( !mem_holes )
>>>>>> +        return -ENOMEM;
>>>>>> +
>>>>>> +    /* Start with maximum possible addressable physical memory 
>>>>>> range */
>>>>>> +    start = EXT_REGION_START;
>>>>>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>>>>> +    res = rangeset_add_range(mem_holes, start, end);
>>>>>> +    if ( res )
>>>>>> +    {
>>>>>> +        printk(XENLOG_ERR "Failed to add: 
>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>> +               start, end);
>>>>>> +        goto out;
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Remove all regions described by "reg" property (MMIO, 
>>>>>> RAM, etc) */
>>>>>
>>>>> Well... The loop below is not going to handle all the regions 
>>>>> described in
>>>>> the property "reg". Instead, it will cover a subset of "reg" where 
>>>>> the
>>>>> memory is addressable.
>>>>
>>>> As I understand, we are only interested in subset of "reg" where 
>>>> the memory is
>>>> addressable.
>>>>
>>>>
>>>>>
>>>>>
>>>>> You will also need to cover "ranges" that will describe the BARs 
>>>>> for the PCI
>>>>> devices.
>>>> Good point.
>>>
>>> Yes, very good point!
>>>
>>>
>>>> Could you please clarify how to recognize whether it is a PCI
>>>> device as long as PCI support is not merged? Or just to find any 
>>>> device nodes
>>>> with non-empty "ranges" property
>>>> and retrieve addresses?
>>>
>>> Normally any bus can have a ranges property with the aperture and
>>> possible address translations, including /amba (compatible =
>>> "simple-bus"). However, in these cases dt_device_get_address already
>>> takes care of it, see xen/common/device_tree.c:dt_device_get_address.
>>>
>>> The PCI bus is special for 2 reasons:
>>> - the ranges property has a different format
>>> - the bus is hot-pluggable
>>>
>>> So I think the only one that we need to treat specially is PCI.
>>>
>>> As far as I am aware PCI is the only bus (or maybe just the only bus
>>> that we support?) where ranges means the aperture.
>>
>> Now that I think about this, there is another "hotpluggable" scenario we
>> need to think about:
>>
>> [1] https://marc.info/?l=xen-devel&m=163056546214978
>>
>> Xilinx devices have FPGA regions with apertures currently not described
>> in device tree, where things can programmed in PL at runtime making new
>> devices appear with new MMIO regions out of thin air.
>>
>> Now let me start by saying that yes, the entire programmable region
>> aperture could probably be described in device tree, however, in
>> reality it is not currently done in any of the device trees we use
>> (including the upstream device trees in linux.git).
>
> This is rather annoying, but not unheard. There are a couple of 
> platforms where the MMIOs are not fully described in the DT.
>
> In fact, we have a callback 'specific_mappings' which create 
> additional mappings (e.g. on the omap5) for dom0.
>
>>
>> So, we have a problem :-(
>>
>>
>> I can work toward getting the right info on device tree, but in reality
>> that is going to take time and for now the device tree doesn't have the
>> FPGA aperture in it. So if we accept this series as is, it is going to
>> stop features like [1] from working. >
>> If we cannot come up with any better plans, I think it would be better
>> to drop find_memory_holes, only rely on find_unallocated_memory even
>> when the IOMMU is on. One idea is that we could add on top of the
>> regions found by find_unallocated_memory any MMIO regions marked as
>> xen,passthrough: they are safe because they are not going to dom0 
>> anyway.
>
> (Oleksandr, it looks like some rationale about the different approach 
> is missing in the commit message. Can you add it?)

Yes sure, but let me please clarify what is different approach in this 
context. Is it to *also* take into the account MMIO regions of the 
devices for passthrough for case when IOMMU is off (in addition to 
unallocated memory)? If yes, I wonder whether we will gain much with 
that according to that device's MMIO regions are usually not big enough 
and we stick to allocate extended regions with bigger size (> 64MB).


>
>
> When the IOMMU is on, Xen will do an extra mapping with GFN == MFN for 
> every grant mapping in dom0. This is because Linux will always program 
> the device with the MFN as it doesn't know whether the device has been 
> protected by the hypervisor.
>
> Therefore we can't use find_unallocated_memory() with the IOMMU on as 
> it stands.
>
>>
>> The only alternative I can think of is to have a per-board
>> enable/disable toggle for the extend region but it would be very ugly.
> At least, for your board, you seem to know the list of regions that 
> are reserved for future use. So how about adding a per-board list of 
> regions that should not be allocated?
>
> This will also include anything mentioned in 'specific_mappings'.
>
> Cheers,
>
-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 3/3] libxl/arm: Add handling of extended regions for DomU
  2021-09-16 22:35   ` Stefano Stabellini
@ 2021-09-20 20:07     ` Oleksandr
  2021-09-21 17:35       ` Oleksandr
  0 siblings, 1 reply; 43+ messages in thread
From: Oleksandr @ 2021-09-20 20:07 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Oleksandr Tyshchenko, Ian Jackson, Wei Liu,
	Anthony PERARD, Juergen Gross, Henry Wang, Bertrand Marquis,
	Wei Chen, Julien Grall


On 17.09.21 01:35, Stefano Stabellini wrote:

Hi Stefano

> On Fri, 10 Sep 2021, Oleksandr Tyshchenko wrote:
>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>
>> The extended region (safe range) is a region of guest physical
>> address space which is unused and could be safely used to create
>> grant/foreign mappings instead of wasting real RAM pages from
>> the domain memory for establishing these mappings.
>>
>> The extended regions are chosen at the domain creation time and
>> advertised to it via "reg" property under hypervisor node in
>> the guest device-tree. As region 0 is reserved for grant table
>> space (always present), the indexes for extended regions are 1...N.
>> If extended regions could not be allocated for some reason,
>> Xen doesn't fail and behaves as usual, so only inserts region 0.
>>
>> Please note the following limitations:
>> - The extended region feature is only supported for 64-bit domain.
>> - The ACPI case is not covered.
>>
>> ***
>>
>> The algorithm to choose extended regions for non-direct mapped
>> DomU is simpler in comparison with the algorithm for direct mapped
>> Dom0. As we have a lot of unused space above 4GB, provide single
>> 1GB-aligned region from the second RAM bank taking into the account
>> the maximum supported guest address space size and the amount of
>> memory assigned to the guest. The maximum size of the region is 128GB.
>>
>> Suggested-by: Julien Grall <jgrall@amazon.com>
>> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>> ---
>> Changes since RFC:
>>     - update patch description
>>     - drop uneeded "extended-region" DT property
>>     - clear reg array in finalise_ext_region() and add a TODO
>> ---
>>   tools/libs/light/libxl_arm.c | 89 +++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 87 insertions(+), 2 deletions(-)
>>
>> diff --git a/tools/libs/light/libxl_arm.c b/tools/libs/light/libxl_arm.c
>> index e3140a6..8c1d9d7 100644
>> --- a/tools/libs/light/libxl_arm.c
>> +++ b/tools/libs/light/libxl_arm.c
>> @@ -615,9 +615,12 @@ static int make_hypervisor_node(libxl__gc *gc, void *fdt,
>>                                 "xen,xen");
>>       if (res) return res;
>>   
>> -    /* reg 0 is grant table space */
>> +    /*
>> +     * reg 0 is a placeholder for grant table space, reg 1 is a placeholder
>> +     * for the extended region.
>> +     */
>>       res = fdt_property_regs(gc, fdt, GUEST_ROOT_ADDRESS_CELLS, GUEST_ROOT_SIZE_CELLS,
>> -                            1,GUEST_GNTTAB_BASE, GUEST_GNTTAB_SIZE);
>> +                            2, 0, 0, 0, 0);
>>       if (res) return res;
>>   
>>       /*
>> @@ -1069,6 +1072,86 @@ static void finalise_one_node(libxl__gc *gc, void *fdt, const char *uname,
>>       }
>>   }
>>   
>> +#define ALIGN_UP_TO_GB(x)   (((x) + GB(1) - 1) & (~(GB(1) - 1)))
> Why do we need to align it to 1GB when for Dom0 is aligned to 2MB? I
> think it makes sense to have the same alignment requirement.
Here, unlike with Dom0, we can provide indeed a big region (the maximum 
size is 128GB), so I decided to use maximum block size for the alignment.
ok, I will use 2MB alignment to be consistent will Dom0.


>
>
>> +
>> +#define EXT_REGION_SIZE   GB(128)
> The region needs to be described in xen/include/public/arch-arm.h like
> GUEST_RAM1_BASE/SIZE.

ok, will do


>
>
>> +static void finalise_ext_region(libxl__gc *gc, struct xc_dom_image *dom)
>> +{
>> +    void *fdt = dom->devicetree_blob;
>> +    uint32_t regs[(GUEST_ROOT_ADDRESS_CELLS + GUEST_ROOT_SIZE_CELLS) * 2];
>> +    be32 *cells = &regs[0];
>> +    uint64_t region_size = 0, region_base, bank1end_align, bank1end_max;
>> +    uint32_t gpaddr_bits;
>> +    libxl_physinfo info;
>> +    int offset, rc;
>> +
>> +    offset = fdt_path_offset(fdt, "/hypervisor");
>> +    assert(offset > 0);
>> +
>> +    if (strcmp(dom->guest_type, "xen-3.0-aarch64")) {
>> +        LOG(WARN, "The extended region is only supported for 64-bit guest");
>> +        goto out;
>> +    }
>> +
>> +    rc = libxl_get_physinfo(CTX, &info);
>> +    assert(!rc);
>> +
>> +    gpaddr_bits = info.gpaddr_bits;
>> +    assert(gpaddr_bits >= 32 && gpaddr_bits <= 48);
>> +
>> +    /*
>> +     * Try to allocate single 1GB-aligned extended region from the second RAM
>> +     * bank (above 4GB) taking into the account the maximum supported guest
>> +     * address space size and the amount of memory assigned to the guest.
>> +     * The maximum size of the region is 128GB.
>> +     */
>> +    bank1end_max = min(1ULL << gpaddr_bits, GUEST_RAM1_BASE + GUEST_RAM1_SIZE);
>> +    bank1end_align = GUEST_RAM1_BASE +
>> +        ALIGN_UP_TO_GB((uint64_t)dom->rambank_size[1] << XC_PAGE_SHIFT);
>> +
>> +    if (bank1end_max <= bank1end_align) {
>> +        LOG(WARN, "The extended region cannot be allocated, not enough space");
>> +        goto out;
>> +    }
>> +
>> +    if (bank1end_max - bank1end_align > EXT_REGION_SIZE) {
>> +        region_base = bank1end_max - EXT_REGION_SIZE;
>> +        region_size = EXT_REGION_SIZE;
>> +    } else {
>> +        region_base = bank1end_align;
>> +        region_size = bank1end_max - bank1end_align;
>> +    }
>> +
>> +out:
>> +    /*
>> +     * The first region for grant table space must be always present.
>> +     * If we managed to allocate the extended region then insert it as
>> +     * a second region.
>> +     * TODO If we failed to allocate the region, we end up inserting
>> +     * zero-sized region. This is because we don't know in advance when
>> +     * creating hypervisor node whether it would be possible to allocate
>> +     * a region, but we have to create a placeholder anyway. The Linux driver
>> +     * is able to deal with by checking the region size. We cannot choose
>> +     * a region when creating hypervisor node because the guest memory layout
>> +     * is not know at that moment (and dom->rambank_size[1] is empty).
>> +     * We need to find a way not to expose invalid regions.
>> +     */
> This is not great -- it would be barely spec compliant.

Absolutely agree.


>   
>
> When make_hypervisor_node is called we know the max memory of the guest
> as build_info.max_memkb should be populate, right?

Right. Just a small change to pass build_info to make_hypervisor_node() 
is needed.


>
> If so, we could at least detect whether we can have an extended region
> (if not caculate the exact start address) from make_hypervisor_node.
>
> total_guest_memory = build_info.max_memkb * 1024;
> rambank1_approx = total_guest_memory - GUEST_RAM0_SIZE;
> extended_region_size = GUEST_RAM1_SIZE - rambank1_approx;
>
> if (extended_region_size >= MIN_EXT_REGION_SIZE)
>      allocate_ext_region
Good point! I will recheck that. I would prefer avoid spreading extended 
region handling (introduce finalise_ext_region())
and do everything from the make_hypervisor_node().


>
>
>> +    memset(regs, 0, sizeof(regs));
>> +    set_range(&cells, GUEST_ROOT_ADDRESS_CELLS, GUEST_ROOT_SIZE_CELLS,
>> +              GUEST_GNTTAB_BASE, GUEST_GNTTAB_SIZE);
>> +    if (region_size > 0) {
> I think we want to check against a minimum amount rather than 0. Maybe 64MB?

Sounds reasonable, will update.


>
>
>> +        LOG(DEBUG, "Extended region: %#"PRIx64"->%#"PRIx64"\n",
>> +            region_base, region_base + region_size);
>> +
>> +        set_range(&cells, GUEST_ROOT_ADDRESS_CELLS, GUEST_ROOT_SIZE_CELLS,
>> +                  region_base, region_size);
>> +    }
>> +
>> +    rc = fdt_setprop_inplace(fdt, offset, "reg", regs, sizeof(regs));
>> +    assert(!rc);
>> +}
>> +
>>   int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
>>                                                  uint32_t domid,
>>                                                  libxl_domain_config *d_config,
>> @@ -1109,6 +1192,8 @@ int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
>>   
>>       }
>>   
>> +    finalise_ext_region(gc, dom);
>> +
>>       for (i = 0; i < GUEST_RAM_BANKS; i++) {
>>           const uint64_t size = (uint64_t)dom->rambank_size[i] << XC_PAGE_SHIFT;
>>   
>> -- 
>> 2.7.4


Thank you.


-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-19 20:18             ` Oleksandr
@ 2021-09-20 23:21               ` Stefano Stabellini
  2021-09-21 18:14                 ` Oleksandr
  0 siblings, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2021-09-20 23:21 UTC (permalink / raw)
  To: Oleksandr
  Cc: Julien Grall, Stefano Stabellini, xen-devel,
	Oleksandr Tyshchenko, Volodymyr Babchuk, Henry Wang,
	Bertrand Marquis, Wei Chen

[-- Attachment #1: Type: text/plain, Size: 9650 bytes --]

On Sun, 19 Sep 2021, Oleksandr wrote:
> > On 18/09/2021 03:37, Stefano Stabellini wrote:
> > > On Fri, 17 Sep 2021, Stefano Stabellini wrote:
> > > > On Fri, 17 Sep 2021, Oleksandr wrote:
> > > > > > > +
> > > > > > > +    dt_dprintk("Find unallocated memory for extended regions\n");
> > > > > > > +
> > > > > > > +    unalloc_mem = rangeset_new(NULL, NULL, 0);
> > > > > > > +    if ( !unalloc_mem )
> > > > > > > +        return -ENOMEM;
> > > > > > > +
> > > > > > > +    /* Start with all available RAM */
> > > > > > > +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
> > > > > > > +    {
> > > > > > > +        start = bootinfo.mem.bank[i].start;
> > > > > > > +        end = bootinfo.mem.bank[i].start +
> > > > > > > bootinfo.mem.bank[i].size - 1;
> > > > > > > +        res = rangeset_add_range(unalloc_mem, start, end);
> > > > > > > +        if ( res )
> > > > > > > +        {
> > > > > > > +            printk(XENLOG_ERR "Failed to add:
> > > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > > +                   start, end);
> > > > > > > +            goto out;
> > > > > > > +        }
> > > > > > > +    }
> > > > > > > +
> > > > > > > +    /* Remove RAM assigned to Dom0 */
> > > > > > > +    for ( i = 0; i < assign_mem->nr_banks; i++ )
> > > > > > > +    {
> > > > > > > +        start = assign_mem->bank[i].start;
> > > > > > > +        end = assign_mem->bank[i].start +
> > > > > > > assign_mem->bank[i].size - 1;
> > > > > > > +        res = rangeset_remove_range(unalloc_mem, start, end);
> > > > > > > +        if ( res )
> > > > > > > +        {
> > > > > > > +            printk(XENLOG_ERR "Failed to remove:
> > > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > > +                   start, end);
> > > > > > > +            goto out;
> > > > > > > +        }
> > > > > > > +    }
> > > > > > > +
> > > > > > > +    /* Remove reserved-memory regions */
> > > > > > > +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
> > > > > > > +    {
> > > > > > > +        start = bootinfo.reserved_mem.bank[i].start;
> > > > > > > +        end = bootinfo.reserved_mem.bank[i].start +
> > > > > > > +            bootinfo.reserved_mem.bank[i].size - 1;
> > > > > > > +        res = rangeset_remove_range(unalloc_mem, start, end);
> > > > > > > +        if ( res )
> > > > > > > +        {
> > > > > > > +            printk(XENLOG_ERR "Failed to remove:
> > > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > > +                   start, end);
> > > > > > > +            goto out;
> > > > > > > +        }
> > > > > > > +    }
> > > > > > > +
> > > > > > > +    /* Remove grant table region */
> > > > > > > +    start = kinfo->gnttab_start;
> > > > > > > +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
> > > > > > > +    res = rangeset_remove_range(unalloc_mem, start, end);
> > > > > > > +    if ( res )
> > > > > > > +    {
> > > > > > > +        printk(XENLOG_ERR "Failed to remove:
> > > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > > +               start, end);
> > > > > > > +        goto out;
> > > > > > > +    }
> > > > > > > +
> > > > > > > +    start = EXT_REGION_START;
> > > > > > > +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> > > > > > > +    res = rangeset_report_ranges(unalloc_mem, start, end,
> > > > > > > +                                 add_ext_regions, ext_regions);
> > > > > > > +    if ( res )
> > > > > > > +        ext_regions->nr_banks = 0;
> > > > > > > +    else if ( !ext_regions->nr_banks )
> > > > > > > +        res = -ENOENT;
> > > > > > > +
> > > > > > > +out:
> > > > > > > +    rangeset_destroy(unalloc_mem);
> > > > > > > +
> > > > > > > +    return res;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int __init find_memory_holes(const struct kernel_info
> > > > > > > *kinfo,
> > > > > > > +                                    struct meminfo *ext_regions)
> > > > > > > +{
> > > > > > > +    struct dt_device_node *np;
> > > > > > > +    struct rangeset *mem_holes;
> > > > > > > +    paddr_t start, end;
> > > > > > > +    unsigned int i;
> > > > > > > +    int res;
> > > > > > > +
> > > > > > > +    dt_dprintk("Find memory holes for extended regions\n");
> > > > > > > +
> > > > > > > +    mem_holes = rangeset_new(NULL, NULL, 0);
> > > > > > > +    if ( !mem_holes )
> > > > > > > +        return -ENOMEM;
> > > > > > > +
> > > > > > > +    /* Start with maximum possible addressable physical memory
> > > > > > > range */
> > > > > > > +    start = EXT_REGION_START;
> > > > > > > +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> > > > > > > +    res = rangeset_add_range(mem_holes, start, end);
> > > > > > > +    if ( res )
> > > > > > > +    {
> > > > > > > +        printk(XENLOG_ERR "Failed to add:
> > > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > > +               start, end);
> > > > > > > +        goto out;
> > > > > > > +    }
> > > > > > > +
> > > > > > > +    /* Remove all regions described by "reg" property (MMIO, RAM,
> > > > > > > etc) */
> > > > > > 
> > > > > > Well... The loop below is not going to handle all the regions
> > > > > > described in
> > > > > > the property "reg". Instead, it will cover a subset of "reg" where
> > > > > > the
> > > > > > memory is addressable.
> > > > > 
> > > > > As I understand, we are only interested in subset of "reg" where the
> > > > > memory is
> > > > > addressable.
> > > > > 
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > You will also need to cover "ranges" that will describe the BARs for
> > > > > > the PCI
> > > > > > devices.
> > > > > Good point.
> > > > 
> > > > Yes, very good point!
> > > > 
> > > > 
> > > > > Could you please clarify how to recognize whether it is a PCI
> > > > > device as long as PCI support is not merged? Or just to find any
> > > > > device nodes
> > > > > with non-empty "ranges" property
> > > > > and retrieve addresses?
> > > > 
> > > > Normally any bus can have a ranges property with the aperture and
> > > > possible address translations, including /amba (compatible =
> > > > "simple-bus"). However, in these cases dt_device_get_address already
> > > > takes care of it, see xen/common/device_tree.c:dt_device_get_address.
> > > > 
> > > > The PCI bus is special for 2 reasons:
> > > > - the ranges property has a different format
> > > > - the bus is hot-pluggable
> > > > 
> > > > So I think the only one that we need to treat specially is PCI.
> > > > 
> > > > As far as I am aware PCI is the only bus (or maybe just the only bus
> > > > that we support?) where ranges means the aperture.
> > > 
> > > Now that I think about this, there is another "hotpluggable" scenario we
> > > need to think about:
> > > 
> > > [1] https://marc.info/?l=xen-devel&m=163056546214978
> > > 
> > > Xilinx devices have FPGA regions with apertures currently not described
> > > in device tree, where things can programmed in PL at runtime making new
> > > devices appear with new MMIO regions out of thin air.
> > > 
> > > Now let me start by saying that yes, the entire programmable region
> > > aperture could probably be described in device tree, however, in
> > > reality it is not currently done in any of the device trees we use
> > > (including the upstream device trees in linux.git).
> > 
> > This is rather annoying, but not unheard. There are a couple of platforms
> > where the MMIOs are not fully described in the DT.
> > 
> > In fact, we have a callback 'specific_mappings' which create additional
> > mappings (e.g. on the omap5) for dom0.
> > 
> > > 
> > > So, we have a problem :-(
> > > 
> > > 
> > > I can work toward getting the right info on device tree, but in reality
> > > that is going to take time and for now the device tree doesn't have the
> > > FPGA aperture in it. So if we accept this series as is, it is going to
> > > stop features like [1] from working. >
> > > If we cannot come up with any better plans, I think it would be better
> > > to drop find_memory_holes, only rely on find_unallocated_memory even
> > > when the IOMMU is on. One idea is that we could add on top of the
> > > regions found by find_unallocated_memory any MMIO regions marked as
> > > xen,passthrough: they are safe because they are not going to dom0 anyway.
> > 
> > (Oleksandr, it looks like some rationale about the different approach is
> > missing in the commit message. Can you add it?)
> 
> Yes sure, but let me please clarify what is different approach in this
> context. Is it to *also* take into the account MMIO regions of the devices for
> passthrough for case when IOMMU is off (in addition to unallocated memory)? If
> yes, I wonder whether we will gain much with that according to that device's
> MMIO regions are usually not big enough and we stick to allocate extended
> regions with bigger size (> 64MB).

That's fair enough. There are a couple of counter examples where the
MMIO regions for the device to assign are quite large, for instance a
GPU, Xilinx AIEngine, or the PCIe Root Complex with the entire aperture,
but maybe they are not that common. I am not sure if it is worth
scanning the tree for xen,passthrough regions every time at boot for
this.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-19 14:34           ` Julien Grall
  2021-09-19 20:18             ` Oleksandr
@ 2021-09-20 23:55             ` Stefano Stabellini
  1 sibling, 0 replies; 43+ messages in thread
From: Stefano Stabellini @ 2021-09-20 23:55 UTC (permalink / raw)
  To: Julien Grall
  Cc: Stefano Stabellini, Oleksandr, xen-devel, Oleksandr Tyshchenko,
	Volodymyr Babchuk, Henry Wang, Bertrand Marquis, Wei Chen

[-- Attachment #1: Type: text/plain, Size: 9814 bytes --]

On Sun, 19 Sep 2021, Julien Grall wrote:
> On 18/09/2021 03:37, Stefano Stabellini wrote:
> > On Fri, 17 Sep 2021, Stefano Stabellini wrote:
> > > On Fri, 17 Sep 2021, Oleksandr wrote:
> > > > > > +
> > > > > > +    dt_dprintk("Find unallocated memory for extended regions\n");
> > > > > > +
> > > > > > +    unalloc_mem = rangeset_new(NULL, NULL, 0);
> > > > > > +    if ( !unalloc_mem )
> > > > > > +        return -ENOMEM;
> > > > > > +
> > > > > > +    /* Start with all available RAM */
> > > > > > +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
> > > > > > +    {
> > > > > > +        start = bootinfo.mem.bank[i].start;
> > > > > > +        end = bootinfo.mem.bank[i].start +
> > > > > > bootinfo.mem.bank[i].size - 1;
> > > > > > +        res = rangeset_add_range(unalloc_mem, start, end);
> > > > > > +        if ( res )
> > > > > > +        {
> > > > > > +            printk(XENLOG_ERR "Failed to add:
> > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > +                   start, end);
> > > > > > +            goto out;
> > > > > > +        }
> > > > > > +    }
> > > > > > +
> > > > > > +    /* Remove RAM assigned to Dom0 */
> > > > > > +    for ( i = 0; i < assign_mem->nr_banks; i++ )
> > > > > > +    {
> > > > > > +        start = assign_mem->bank[i].start;
> > > > > > +        end = assign_mem->bank[i].start + assign_mem->bank[i].size
> > > > > > - 1;
> > > > > > +        res = rangeset_remove_range(unalloc_mem, start, end);
> > > > > > +        if ( res )
> > > > > > +        {
> > > > > > +            printk(XENLOG_ERR "Failed to remove:
> > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > +                   start, end);
> > > > > > +            goto out;
> > > > > > +        }
> > > > > > +    }
> > > > > > +
> > > > > > +    /* Remove reserved-memory regions */
> > > > > > +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
> > > > > > +    {
> > > > > > +        start = bootinfo.reserved_mem.bank[i].start;
> > > > > > +        end = bootinfo.reserved_mem.bank[i].start +
> > > > > > +            bootinfo.reserved_mem.bank[i].size - 1;
> > > > > > +        res = rangeset_remove_range(unalloc_mem, start, end);
> > > > > > +        if ( res )
> > > > > > +        {
> > > > > > +            printk(XENLOG_ERR "Failed to remove:
> > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > +                   start, end);
> > > > > > +            goto out;
> > > > > > +        }
> > > > > > +    }
> > > > > > +
> > > > > > +    /* Remove grant table region */
> > > > > > +    start = kinfo->gnttab_start;
> > > > > > +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
> > > > > > +    res = rangeset_remove_range(unalloc_mem, start, end);
> > > > > > +    if ( res )
> > > > > > +    {
> > > > > > +        printk(XENLOG_ERR "Failed to remove:
> > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > +               start, end);
> > > > > > +        goto out;
> > > > > > +    }
> > > > > > +
> > > > > > +    start = EXT_REGION_START;
> > > > > > +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> > > > > > +    res = rangeset_report_ranges(unalloc_mem, start, end,
> > > > > > +                                 add_ext_regions, ext_regions);
> > > > > > +    if ( res )
> > > > > > +        ext_regions->nr_banks = 0;
> > > > > > +    else if ( !ext_regions->nr_banks )
> > > > > > +        res = -ENOENT;
> > > > > > +
> > > > > > +out:
> > > > > > +    rangeset_destroy(unalloc_mem);
> > > > > > +
> > > > > > +    return res;
> > > > > > +}
> > > > > > +
> > > > > > +static int __init find_memory_holes(const struct kernel_info
> > > > > > *kinfo,
> > > > > > +                                    struct meminfo *ext_regions)
> > > > > > +{
> > > > > > +    struct dt_device_node *np;
> > > > > > +    struct rangeset *mem_holes;
> > > > > > +    paddr_t start, end;
> > > > > > +    unsigned int i;
> > > > > > +    int res;
> > > > > > +
> > > > > > +    dt_dprintk("Find memory holes for extended regions\n");
> > > > > > +
> > > > > > +    mem_holes = rangeset_new(NULL, NULL, 0);
> > > > > > +    if ( !mem_holes )
> > > > > > +        return -ENOMEM;
> > > > > > +
> > > > > > +    /* Start with maximum possible addressable physical memory
> > > > > > range */
> > > > > > +    start = EXT_REGION_START;
> > > > > > +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> > > > > > +    res = rangeset_add_range(mem_holes, start, end);
> > > > > > +    if ( res )
> > > > > > +    {
> > > > > > +        printk(XENLOG_ERR "Failed to add:
> > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > +               start, end);
> > > > > > +        goto out;
> > > > > > +    }
> > > > > > +
> > > > > > +    /* Remove all regions described by "reg" property (MMIO, RAM,
> > > > > > etc) */
> > > > > 
> > > > > Well... The loop below is not going to handle all the regions
> > > > > described in
> > > > > the property "reg". Instead, it will cover a subset of "reg" where the
> > > > > memory is addressable.
> > > > 
> > > > As I understand, we are only interested in subset of "reg" where the
> > > > memory is
> > > > addressable.
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > > You will also need to cover "ranges" that will describe the BARs for
> > > > > the PCI
> > > > > devices.
> > > > Good point.
> > > 
> > > Yes, very good point!
> > > 
> > > 
> > > > Could you please clarify how to recognize whether it is a PCI
> > > > device as long as PCI support is not merged? Or just to find any device
> > > > nodes
> > > > with non-empty "ranges" property
> > > > and retrieve addresses?
> > > 
> > > Normally any bus can have a ranges property with the aperture and
> > > possible address translations, including /amba (compatible =
> > > "simple-bus"). However, in these cases dt_device_get_address already
> > > takes care of it, see xen/common/device_tree.c:dt_device_get_address.
> > > 
> > > The PCI bus is special for 2 reasons:
> > > - the ranges property has a different format
> > > - the bus is hot-pluggable
> > > 
> > > So I think the only one that we need to treat specially is PCI.
> > > 
> > > As far as I am aware PCI is the only bus (or maybe just the only bus
> > > that we support?) where ranges means the aperture.
> > 
> > Now that I think about this, there is another "hotpluggable" scenario we
> > need to think about:
> > 
> > [1] https://marc.info/?l=xen-devel&m=163056546214978
> > 
> > Xilinx devices have FPGA regions with apertures currently not described
> > in device tree, where things can programmed in PL at runtime making new
> > devices appear with new MMIO regions out of thin air.
> > 
> > Now let me start by saying that yes, the entire programmable region
> > aperture could probably be described in device tree, however, in
> > reality it is not currently done in any of the device trees we use
> > (including the upstream device trees in linux.git).
> 
> This is rather annoying, but not unheard. There are a couple of platforms
> where the MMIOs are not fully described in the DT.
> 
> In fact, we have a callback 'specific_mappings' which create additional
> mappings (e.g. on the omap5) for dom0.

Just for clarity this is a bit different because it is not an
MMIO-region yet. It is only a *potential* MMIO region. Basically it is
nothing until the Programmable Logic gets programmed. But the
Programmable Logic only uses addresses within a given range, thankfully,
and we know the range beforehand.

 
> > So, we have a problem :-(
> > 
> > 
> > I can work toward getting the right info on device tree, but in reality
> > that is going to take time and for now the device tree doesn't have the
> > FPGA aperture in it. So if we accept this series as is, it is going to
> > stop features like [1] from working. >
> > If we cannot come up with any better plans, I think it would be better
> > to drop find_memory_holes, only rely on find_unallocated_memory even
> > when the IOMMU is on. One idea is that we could add on top of the
> > regions found by find_unallocated_memory any MMIO regions marked as
> > xen,passthrough: they are safe because they are not going to dom0 anyway.
> 
> (Oleksandr, it looks like some rationale about the different approach is
> missing in the commit message. Can you add it?)
> 
> When the IOMMU is on, Xen will do an extra mapping with GFN == MFN for every
> grant mapping in dom0. This is because Linux will always program the device
> with the MFN as it doesn't know whether the device has been protected by the
> hypervisor.
> 
> Therefore we can't use find_unallocated_memory() with the IOMMU on as it
> stands.
> 
> > The only alternative I can think of is to have a per-board
> > enable/disable toggle for the extend region but it would be very ugly.
> At least, for your board, you seem to know the list of regions that are
> reserved for future use.
>
> So how about adding a per-board list of regions that
> should not be allocated?
> 
> This will also include anything mentioned in 'specific_mappings'.

I am OK with that. I should be able to find the potential address ranges
for Xilinx boards. However, the ranges might be different for different
boards (for different family of boards, not for every little different
revision). Hopefully, Xilinx is the worst case as the hardware is
actually programmable.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 3/3] libxl/arm: Add handling of extended regions for DomU
  2021-09-20 20:07     ` Oleksandr
@ 2021-09-21 17:35       ` Oleksandr
  0 siblings, 0 replies; 43+ messages in thread
From: Oleksandr @ 2021-09-21 17:35 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Oleksandr Tyshchenko, Ian Jackson, Wei Liu,
	Anthony PERARD, Juergen Gross, Henry Wang, Bertrand Marquis,
	Wei Chen, Julien Grall


Hi Stefano

[snip]



>
>>
>>
>>> +static void finalise_ext_region(libxl__gc *gc, struct xc_dom_image 
>>> *dom)
>>> +{
>>> +    void *fdt = dom->devicetree_blob;
>>> +    uint32_t regs[(GUEST_ROOT_ADDRESS_CELLS + 
>>> GUEST_ROOT_SIZE_CELLS) * 2];
>>> +    be32 *cells = &regs[0];
>>> +    uint64_t region_size = 0, region_base, bank1end_align, 
>>> bank1end_max;
>>> +    uint32_t gpaddr_bits;
>>> +    libxl_physinfo info;
>>> +    int offset, rc;
>>> +
>>> +    offset = fdt_path_offset(fdt, "/hypervisor");
>>> +    assert(offset > 0);
>>> +
>>> +    if (strcmp(dom->guest_type, "xen-3.0-aarch64")) {
>>> +        LOG(WARN, "The extended region is only supported for 64-bit 
>>> guest");
>>> +        goto out;
>>> +    }
>>> +
>>> +    rc = libxl_get_physinfo(CTX, &info);
>>> +    assert(!rc);
>>> +
>>> +    gpaddr_bits = info.gpaddr_bits;
>>> +    assert(gpaddr_bits >= 32 && gpaddr_bits <= 48);
>>> +
>>> +    /*
>>> +     * Try to allocate single 1GB-aligned extended region from the 
>>> second RAM
>>> +     * bank (above 4GB) taking into the account the maximum 
>>> supported guest
>>> +     * address space size and the amount of memory assigned to the 
>>> guest.
>>> +     * The maximum size of the region is 128GB.
>>> +     */
>>> +    bank1end_max = min(1ULL << gpaddr_bits, GUEST_RAM1_BASE + 
>>> GUEST_RAM1_SIZE);
>>> +    bank1end_align = GUEST_RAM1_BASE +
>>> +        ALIGN_UP_TO_GB((uint64_t)dom->rambank_size[1] << 
>>> XC_PAGE_SHIFT);
>>> +
>>> +    if (bank1end_max <= bank1end_align) {
>>> +        LOG(WARN, "The extended region cannot be allocated, not 
>>> enough space");
>>> +        goto out;
>>> +    }
>>> +
>>> +    if (bank1end_max - bank1end_align > EXT_REGION_SIZE) {
>>> +        region_base = bank1end_max - EXT_REGION_SIZE;
>>> +        region_size = EXT_REGION_SIZE;
>>> +    } else {
>>> +        region_base = bank1end_align;
>>> +        region_size = bank1end_max - bank1end_align;
>>> +    }
>>> +
>>> +out:
>>> +    /*
>>> +     * The first region for grant table space must be always present.
>>> +     * If we managed to allocate the extended region then insert it as
>>> +     * a second region.
>>> +     * TODO If we failed to allocate the region, we end up inserting
>>> +     * zero-sized region. This is because we don't know in advance 
>>> when
>>> +     * creating hypervisor node whether it would be possible to 
>>> allocate
>>> +     * a region, but we have to create a placeholder anyway. The 
>>> Linux driver
>>> +     * is able to deal with by checking the region size. We cannot 
>>> choose
>>> +     * a region when creating hypervisor node because the guest 
>>> memory layout
>>> +     * is not know at that moment (and dom->rambank_size[1] is empty).
>>> +     * We need to find a way not to expose invalid regions.
>>> +     */
>> This is not great -- it would be barely spec compliant.
>
> Absolutely agree.
>
>
>>
>> When make_hypervisor_node is called we know the max memory of the guest
>> as build_info.max_memkb should be populate, right?
>
> Right. Just a small change to pass build_info to 
> make_hypervisor_node() is needed.
>
>
>>
>> If so, we could at least detect whether we can have an extended region
>> (if not caculate the exact start address) from make_hypervisor_node.
>>
>> total_guest_memory = build_info.max_memkb * 1024;
>> rambank1_approx = total_guest_memory - GUEST_RAM0_SIZE;
>> extended_region_size = GUEST_RAM1_SIZE - rambank1_approx;
>>
>> if (extended_region_size >= MIN_EXT_REGION_SIZE)
>>      allocate_ext_region
> Good point! I will recheck that. I would prefer avoid spreading 
> extended region handling (introduce finalise_ext_region())
> and do everything from the make_hypervisor_node().
I experimented with that, so we can indeed calculate the address and 
size of extended region from make_hypervisor_node(),
there is no need for finalise_ext_region(). So, there won't be any 
zero-sized region anymore (due to unpopulated placeholder) if we fail to 
allocate extended region.
Thanks for the idea!

[snip]


-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-20 23:21               ` Stefano Stabellini
@ 2021-09-21 18:14                 ` Oleksandr
  2021-09-21 22:00                   ` Stefano Stabellini
  0 siblings, 1 reply; 43+ messages in thread
From: Oleksandr @ 2021-09-21 18:14 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Julien Grall, xen-devel, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Henry Wang, Bertrand Marquis, Wei Chen


On 21.09.21 02:21, Stefano Stabellini wrote:

Hi Stefano

> On Sun, 19 Sep 2021, Oleksandr wrote:
>>> On 18/09/2021 03:37, Stefano Stabellini wrote:
>>>> On Fri, 17 Sep 2021, Stefano Stabellini wrote:
>>>>> On Fri, 17 Sep 2021, Oleksandr wrote:
>>>>>>>> +
>>>>>>>> +    dt_dprintk("Find unallocated memory for extended regions\n");
>>>>>>>> +
>>>>>>>> +    unalloc_mem = rangeset_new(NULL, NULL, 0);
>>>>>>>> +    if ( !unalloc_mem )
>>>>>>>> +        return -ENOMEM;
>>>>>>>> +
>>>>>>>> +    /* Start with all available RAM */
>>>>>>>> +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
>>>>>>>> +    {
>>>>>>>> +        start = bootinfo.mem.bank[i].start;
>>>>>>>> +        end = bootinfo.mem.bank[i].start +
>>>>>>>> bootinfo.mem.bank[i].size - 1;
>>>>>>>> +        res = rangeset_add_range(unalloc_mem, start, end);
>>>>>>>> +        if ( res )
>>>>>>>> +        {
>>>>>>>> +            printk(XENLOG_ERR "Failed to add:
>>>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>>>> +                   start, end);
>>>>>>>> +            goto out;
>>>>>>>> +        }
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    /* Remove RAM assigned to Dom0 */
>>>>>>>> +    for ( i = 0; i < assign_mem->nr_banks; i++ )
>>>>>>>> +    {
>>>>>>>> +        start = assign_mem->bank[i].start;
>>>>>>>> +        end = assign_mem->bank[i].start +
>>>>>>>> assign_mem->bank[i].size - 1;
>>>>>>>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>>>>>>>> +        if ( res )
>>>>>>>> +        {
>>>>>>>> +            printk(XENLOG_ERR "Failed to remove:
>>>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>>>> +                   start, end);
>>>>>>>> +            goto out;
>>>>>>>> +        }
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    /* Remove reserved-memory regions */
>>>>>>>> +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
>>>>>>>> +    {
>>>>>>>> +        start = bootinfo.reserved_mem.bank[i].start;
>>>>>>>> +        end = bootinfo.reserved_mem.bank[i].start +
>>>>>>>> +            bootinfo.reserved_mem.bank[i].size - 1;
>>>>>>>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>>>>>>>> +        if ( res )
>>>>>>>> +        {
>>>>>>>> +            printk(XENLOG_ERR "Failed to remove:
>>>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>>>> +                   start, end);
>>>>>>>> +            goto out;
>>>>>>>> +        }
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    /* Remove grant table region */
>>>>>>>> +    start = kinfo->gnttab_start;
>>>>>>>> +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
>>>>>>>> +    res = rangeset_remove_range(unalloc_mem, start, end);
>>>>>>>> +    if ( res )
>>>>>>>> +    {
>>>>>>>> +        printk(XENLOG_ERR "Failed to remove:
>>>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>>>> +               start, end);
>>>>>>>> +        goto out;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    start = EXT_REGION_START;
>>>>>>>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>>>>>>> +    res = rangeset_report_ranges(unalloc_mem, start, end,
>>>>>>>> +                                 add_ext_regions, ext_regions);
>>>>>>>> +    if ( res )
>>>>>>>> +        ext_regions->nr_banks = 0;
>>>>>>>> +    else if ( !ext_regions->nr_banks )
>>>>>>>> +        res = -ENOENT;
>>>>>>>> +
>>>>>>>> +out:
>>>>>>>> +    rangeset_destroy(unalloc_mem);
>>>>>>>> +
>>>>>>>> +    return res;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int __init find_memory_holes(const struct kernel_info
>>>>>>>> *kinfo,
>>>>>>>> +                                    struct meminfo *ext_regions)
>>>>>>>> +{
>>>>>>>> +    struct dt_device_node *np;
>>>>>>>> +    struct rangeset *mem_holes;
>>>>>>>> +    paddr_t start, end;
>>>>>>>> +    unsigned int i;
>>>>>>>> +    int res;
>>>>>>>> +
>>>>>>>> +    dt_dprintk("Find memory holes for extended regions\n");
>>>>>>>> +
>>>>>>>> +    mem_holes = rangeset_new(NULL, NULL, 0);
>>>>>>>> +    if ( !mem_holes )
>>>>>>>> +        return -ENOMEM;
>>>>>>>> +
>>>>>>>> +    /* Start with maximum possible addressable physical memory
>>>>>>>> range */
>>>>>>>> +    start = EXT_REGION_START;
>>>>>>>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>>>>>>> +    res = rangeset_add_range(mem_holes, start, end);
>>>>>>>> +    if ( res )
>>>>>>>> +    {
>>>>>>>> +        printk(XENLOG_ERR "Failed to add:
>>>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>>>> +               start, end);
>>>>>>>> +        goto out;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    /* Remove all regions described by "reg" property (MMIO, RAM,
>>>>>>>> etc) */
>>>>>>> Well... The loop below is not going to handle all the regions
>>>>>>> described in
>>>>>>> the property "reg". Instead, it will cover a subset of "reg" where
>>>>>>> the
>>>>>>> memory is addressable.
>>>>>> As I understand, we are only interested in subset of "reg" where the
>>>>>> memory is
>>>>>> addressable.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> You will also need to cover "ranges" that will describe the BARs for
>>>>>>> the PCI
>>>>>>> devices.
>>>>>> Good point.
>>>>> Yes, very good point!
>>>>>
>>>>>
>>>>>> Could you please clarify how to recognize whether it is a PCI
>>>>>> device as long as PCI support is not merged? Or just to find any
>>>>>> device nodes
>>>>>> with non-empty "ranges" property
>>>>>> and retrieve addresses?
>>>>> Normally any bus can have a ranges property with the aperture and
>>>>> possible address translations, including /amba (compatible =
>>>>> "simple-bus"). However, in these cases dt_device_get_address already
>>>>> takes care of it, see xen/common/device_tree.c:dt_device_get_address.
>>>>>
>>>>> The PCI bus is special for 2 reasons:
>>>>> - the ranges property has a different format
>>>>> - the bus is hot-pluggable
>>>>>
>>>>> So I think the only one that we need to treat specially is PCI.
>>>>>
>>>>> As far as I am aware PCI is the only bus (or maybe just the only bus
>>>>> that we support?) where ranges means the aperture.
>>>> Now that I think about this, there is another "hotpluggable" scenario we
>>>> need to think about:
>>>>
>>>> [1] https://marc.info/?l=xen-devel&m=163056546214978
>>>>
>>>> Xilinx devices have FPGA regions with apertures currently not described
>>>> in device tree, where things can programmed in PL at runtime making new
>>>> devices appear with new MMIO regions out of thin air.
>>>>
>>>> Now let me start by saying that yes, the entire programmable region
>>>> aperture could probably be described in device tree, however, in
>>>> reality it is not currently done in any of the device trees we use
>>>> (including the upstream device trees in linux.git).
>>> This is rather annoying, but not unheard. There are a couple of platforms
>>> where the MMIOs are not fully described in the DT.
>>>
>>> In fact, we have a callback 'specific_mappings' which create additional
>>> mappings (e.g. on the omap5) for dom0.
>>>
>>>> So, we have a problem :-(
>>>>
>>>>
>>>> I can work toward getting the right info on device tree, but in reality
>>>> that is going to take time and for now the device tree doesn't have the
>>>> FPGA aperture in it. So if we accept this series as is, it is going to
>>>> stop features like [1] from working. >
>>>> If we cannot come up with any better plans, I think it would be better
>>>> to drop find_memory_holes, only rely on find_unallocated_memory even
>>>> when the IOMMU is on. One idea is that we could add on top of the
>>>> regions found by find_unallocated_memory any MMIO regions marked as
>>>> xen,passthrough: they are safe because they are not going to dom0 anyway.
>>> (Oleksandr, it looks like some rationale about the different approach is
>>> missing in the commit message. Can you add it?)
>> Yes sure, but let me please clarify what is different approach in this
>> context. Is it to *also* take into the account MMIO regions of the devices for
>> passthrough for case when IOMMU is off (in addition to unallocated memory)? If
>> yes, I wonder whether we will gain much with that according to that device's
>> MMIO regions are usually not big enough and we stick to allocate extended
>> regions with bigger size (> 64MB).
> That's fair enough. There are a couple of counter examples where the
> MMIO regions for the device to assign are quite large, for instance a
> GPU, Xilinx AIEngine, or the PCIe Root Complex with the entire aperture,
> but maybe they are not that common. I am not sure if it is worth
> scanning the tree for xen,passthrough regions every time at boot for
> this.

ok, I will add a few sentences to commit message about this different 
approach for now. At least this could be implemented later on if there 
is a need.


-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-17 21:56       ` Stefano Stabellini
  2021-09-17 22:37         ` Stefano Stabellini
@ 2021-09-21 19:43         ` Oleksandr
  2021-09-22 18:18           ` Oleksandr
  1 sibling, 1 reply; 43+ messages in thread
From: Oleksandr @ 2021-09-21 19:43 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Julien Grall, xen-devel, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Henry Wang, Bertrand Marquis, Wei Chen


On 18.09.21 00:56, Stefano Stabellini wrote:

Hi Stefano

> On Fri, 17 Sep 2021, Oleksandr wrote:
>>>> +
>>>> +    dt_dprintk("Find unallocated memory for extended regions\n");
>>>> +
>>>> +    unalloc_mem = rangeset_new(NULL, NULL, 0);
>>>> +    if ( !unalloc_mem )
>>>> +        return -ENOMEM;
>>>> +
>>>> +    /* Start with all available RAM */
>>>> +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
>>>> +    {
>>>> +        start = bootinfo.mem.bank[i].start;
>>>> +        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size - 1;
>>>> +        res = rangeset_add_range(unalloc_mem, start, end);
>>>> +        if ( res )
>>>> +        {
>>>> +            printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
>>>> +                   start, end);
>>>> +            goto out;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    /* Remove RAM assigned to Dom0 */
>>>> +    for ( i = 0; i < assign_mem->nr_banks; i++ )
>>>> +    {
>>>> +        start = assign_mem->bank[i].start;
>>>> +        end = assign_mem->bank[i].start + assign_mem->bank[i].size - 1;
>>>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>>>> +        if ( res )
>>>> +        {
>>>> +            printk(XENLOG_ERR "Failed to remove:
>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>> +                   start, end);
>>>> +            goto out;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    /* Remove reserved-memory regions */
>>>> +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
>>>> +    {
>>>> +        start = bootinfo.reserved_mem.bank[i].start;
>>>> +        end = bootinfo.reserved_mem.bank[i].start +
>>>> +            bootinfo.reserved_mem.bank[i].size - 1;
>>>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>>>> +        if ( res )
>>>> +        {
>>>> +            printk(XENLOG_ERR "Failed to remove:
>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>> +                   start, end);
>>>> +            goto out;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    /* Remove grant table region */
>>>> +    start = kinfo->gnttab_start;
>>>> +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
>>>> +    res = rangeset_remove_range(unalloc_mem, start, end);
>>>> +    if ( res )
>>>> +    {
>>>> +        printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
>>>> +               start, end);
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    start = EXT_REGION_START;
>>>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>>> +    res = rangeset_report_ranges(unalloc_mem, start, end,
>>>> +                                 add_ext_regions, ext_regions);
>>>> +    if ( res )
>>>> +        ext_regions->nr_banks = 0;
>>>> +    else if ( !ext_regions->nr_banks )
>>>> +        res = -ENOENT;
>>>> +
>>>> +out:
>>>> +    rangeset_destroy(unalloc_mem);
>>>> +
>>>> +    return res;
>>>> +}
>>>> +
>>>> +static int __init find_memory_holes(const struct kernel_info *kinfo,
>>>> +                                    struct meminfo *ext_regions)
>>>> +{
>>>> +    struct dt_device_node *np;
>>>> +    struct rangeset *mem_holes;
>>>> +    paddr_t start, end;
>>>> +    unsigned int i;
>>>> +    int res;
>>>> +
>>>> +    dt_dprintk("Find memory holes for extended regions\n");
>>>> +
>>>> +    mem_holes = rangeset_new(NULL, NULL, 0);
>>>> +    if ( !mem_holes )
>>>> +        return -ENOMEM;
>>>> +
>>>> +    /* Start with maximum possible addressable physical memory range */
>>>> +    start = EXT_REGION_START;
>>>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>>> +    res = rangeset_add_range(mem_holes, start, end);
>>>> +    if ( res )
>>>> +    {
>>>> +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
>>>> +               start, end);
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    /* Remove all regions described by "reg" property (MMIO, RAM, etc) */
>>> Well... The loop below is not going to handle all the regions described in
>>> the property "reg". Instead, it will cover a subset of "reg" where the
>>> memory is addressable.
>> As I understand, we are only interested in subset of "reg" where the memory is
>> addressable.
>>
>>
>>>
>>> You will also need to cover "ranges" that will describe the BARs for the PCI
>>> devices.
>> Good point.
> Yes, very good point!
>
>
>> Could you please clarify how to recognize whether it is a PCI
>> device as long as PCI support is not merged? Or just to find any device nodes
>> with non-empty "ranges" property
>> and retrieve addresses?
> Normally any bus can have a ranges property with the aperture and
> possible address translations, including /amba (compatible =
> "simple-bus"). However, in these cases dt_device_get_address already
> takes care of it, see xen/common/device_tree.c:dt_device_get_address.
>
> The PCI bus is special for 2 reasons:
> - the ranges property has a different format
> - the bus is hot-pluggable
>
> So I think the only one that we need to treat specially is PCI.
>
> As far as I am aware PCI is the only bus (or maybe just the only bus
> that we support?) where ranges means the aperture.
Thank you for the clarification. I need to find device node with 
non-empty ranges property
(and make sure that device_type property is "pci"), after that I need to 
read the context of ranges property and translate it.


-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-21 18:14                 ` Oleksandr
@ 2021-09-21 22:00                   ` Stefano Stabellini
  2021-09-22 18:25                     ` Oleksandr
  0 siblings, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2021-09-21 22:00 UTC (permalink / raw)
  To: Oleksandr
  Cc: Stefano Stabellini, Julien Grall, xen-devel,
	Oleksandr Tyshchenko, Volodymyr Babchuk, Henry Wang,
	Bertrand Marquis, Wei Chen, fnuv

[-- Attachment #1: Type: text/plain, Size: 11158 bytes --]

On Tue, 21 Sep 2021, Oleksandr wrote:
> On 21.09.21 02:21, Stefano Stabellini wrote:
> > On Sun, 19 Sep 2021, Oleksandr wrote:
> > > > On 18/09/2021 03:37, Stefano Stabellini wrote:
> > > > > On Fri, 17 Sep 2021, Stefano Stabellini wrote:
> > > > > > On Fri, 17 Sep 2021, Oleksandr wrote:
> > > > > > > > > +
> > > > > > > > > +    dt_dprintk("Find unallocated memory for extended
> > > > > > > > > regions\n");
> > > > > > > > > +
> > > > > > > > > +    unalloc_mem = rangeset_new(NULL, NULL, 0);
> > > > > > > > > +    if ( !unalloc_mem )
> > > > > > > > > +        return -ENOMEM;
> > > > > > > > > +
> > > > > > > > > +    /* Start with all available RAM */
> > > > > > > > > +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
> > > > > > > > > +    {
> > > > > > > > > +        start = bootinfo.mem.bank[i].start;
> > > > > > > > > +        end = bootinfo.mem.bank[i].start +
> > > > > > > > > bootinfo.mem.bank[i].size - 1;
> > > > > > > > > +        res = rangeset_add_range(unalloc_mem, start, end);
> > > > > > > > > +        if ( res )
> > > > > > > > > +        {
> > > > > > > > > +            printk(XENLOG_ERR "Failed to add:
> > > > > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > > > > +                   start, end);
> > > > > > > > > +            goto out;
> > > > > > > > > +        }
> > > > > > > > > +    }
> > > > > > > > > +
> > > > > > > > > +    /* Remove RAM assigned to Dom0 */
> > > > > > > > > +    for ( i = 0; i < assign_mem->nr_banks; i++ )
> > > > > > > > > +    {
> > > > > > > > > +        start = assign_mem->bank[i].start;
> > > > > > > > > +        end = assign_mem->bank[i].start +
> > > > > > > > > assign_mem->bank[i].size - 1;
> > > > > > > > > +        res = rangeset_remove_range(unalloc_mem, start, end);
> > > > > > > > > +        if ( res )
> > > > > > > > > +        {
> > > > > > > > > +            printk(XENLOG_ERR "Failed to remove:
> > > > > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > > > > +                   start, end);
> > > > > > > > > +            goto out;
> > > > > > > > > +        }
> > > > > > > > > +    }
> > > > > > > > > +
> > > > > > > > > +    /* Remove reserved-memory regions */
> > > > > > > > > +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
> > > > > > > > > +    {
> > > > > > > > > +        start = bootinfo.reserved_mem.bank[i].start;
> > > > > > > > > +        end = bootinfo.reserved_mem.bank[i].start +
> > > > > > > > > +            bootinfo.reserved_mem.bank[i].size - 1;
> > > > > > > > > +        res = rangeset_remove_range(unalloc_mem, start, end);
> > > > > > > > > +        if ( res )
> > > > > > > > > +        {
> > > > > > > > > +            printk(XENLOG_ERR "Failed to remove:
> > > > > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > > > > +                   start, end);
> > > > > > > > > +            goto out;
> > > > > > > > > +        }
> > > > > > > > > +    }
> > > > > > > > > +
> > > > > > > > > +    /* Remove grant table region */
> > > > > > > > > +    start = kinfo->gnttab_start;
> > > > > > > > > +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
> > > > > > > > > +    res = rangeset_remove_range(unalloc_mem, start, end);
> > > > > > > > > +    if ( res )
> > > > > > > > > +    {
> > > > > > > > > +        printk(XENLOG_ERR "Failed to remove:
> > > > > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > > > > +               start, end);
> > > > > > > > > +        goto out;
> > > > > > > > > +    }
> > > > > > > > > +
> > > > > > > > > +    start = EXT_REGION_START;
> > > > > > > > > +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> > > > > > > > > +    res = rangeset_report_ranges(unalloc_mem, start, end,
> > > > > > > > > +                                 add_ext_regions,
> > > > > > > > > ext_regions);
> > > > > > > > > +    if ( res )
> > > > > > > > > +        ext_regions->nr_banks = 0;
> > > > > > > > > +    else if ( !ext_regions->nr_banks )
> > > > > > > > > +        res = -ENOENT;
> > > > > > > > > +
> > > > > > > > > +out:
> > > > > > > > > +    rangeset_destroy(unalloc_mem);
> > > > > > > > > +
> > > > > > > > > +    return res;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static int __init find_memory_holes(const struct kernel_info
> > > > > > > > > *kinfo,
> > > > > > > > > +                                    struct meminfo
> > > > > > > > > *ext_regions)
> > > > > > > > > +{
> > > > > > > > > +    struct dt_device_node *np;
> > > > > > > > > +    struct rangeset *mem_holes;
> > > > > > > > > +    paddr_t start, end;
> > > > > > > > > +    unsigned int i;
> > > > > > > > > +    int res;
> > > > > > > > > +
> > > > > > > > > +    dt_dprintk("Find memory holes for extended regions\n");
> > > > > > > > > +
> > > > > > > > > +    mem_holes = rangeset_new(NULL, NULL, 0);
> > > > > > > > > +    if ( !mem_holes )
> > > > > > > > > +        return -ENOMEM;
> > > > > > > > > +
> > > > > > > > > +    /* Start with maximum possible addressable physical
> > > > > > > > > memory
> > > > > > > > > range */
> > > > > > > > > +    start = EXT_REGION_START;
> > > > > > > > > +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> > > > > > > > > +    res = rangeset_add_range(mem_holes, start, end);
> > > > > > > > > +    if ( res )
> > > > > > > > > +    {
> > > > > > > > > +        printk(XENLOG_ERR "Failed to add:
> > > > > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > > > > +               start, end);
> > > > > > > > > +        goto out;
> > > > > > > > > +    }
> > > > > > > > > +
> > > > > > > > > +    /* Remove all regions described by "reg" property (MMIO,
> > > > > > > > > RAM,
> > > > > > > > > etc) */
> > > > > > > > Well... The loop below is not going to handle all the regions
> > > > > > > > described in
> > > > > > > > the property "reg". Instead, it will cover a subset of "reg"
> > > > > > > > where
> > > > > > > > the
> > > > > > > > memory is addressable.
> > > > > > > As I understand, we are only interested in subset of "reg" where
> > > > > > > the
> > > > > > > memory is
> > > > > > > addressable.
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > You will also need to cover "ranges" that will describe the BARs
> > > > > > > > for
> > > > > > > > the PCI
> > > > > > > > devices.
> > > > > > > Good point.
> > > > > > Yes, very good point!
> > > > > > 
> > > > > > 
> > > > > > > Could you please clarify how to recognize whether it is a PCI
> > > > > > > device as long as PCI support is not merged? Or just to find any
> > > > > > > device nodes
> > > > > > > with non-empty "ranges" property
> > > > > > > and retrieve addresses?
> > > > > > Normally any bus can have a ranges property with the aperture and
> > > > > > possible address translations, including /amba (compatible =
> > > > > > "simple-bus"). However, in these cases dt_device_get_address already
> > > > > > takes care of it, see
> > > > > > xen/common/device_tree.c:dt_device_get_address.
> > > > > > 
> > > > > > The PCI bus is special for 2 reasons:
> > > > > > - the ranges property has a different format
> > > > > > - the bus is hot-pluggable
> > > > > > 
> > > > > > So I think the only one that we need to treat specially is PCI.
> > > > > > 
> > > > > > As far as I am aware PCI is the only bus (or maybe just the only bus
> > > > > > that we support?) where ranges means the aperture.
> > > > > Now that I think about this, there is another "hotpluggable" scenario
> > > > > we
> > > > > need to think about:
> > > > > 
> > > > > [1] https://marc.info/?l=xen-devel&m=163056546214978
> > > > > 
> > > > > Xilinx devices have FPGA regions with apertures currently not
> > > > > described
> > > > > in device tree, where things can programmed in PL at runtime making
> > > > > new
> > > > > devices appear with new MMIO regions out of thin air.
> > > > > 
> > > > > Now let me start by saying that yes, the entire programmable region
> > > > > aperture could probably be described in device tree, however, in
> > > > > reality it is not currently done in any of the device trees we use
> > > > > (including the upstream device trees in linux.git).
> > > > This is rather annoying, but not unheard. There are a couple of
> > > > platforms
> > > > where the MMIOs are not fully described in the DT.
> > > > 
> > > > In fact, we have a callback 'specific_mappings' which create additional
> > > > mappings (e.g. on the omap5) for dom0.
> > > > 
> > > > > So, we have a problem :-(
> > > > > 
> > > > > 
> > > > > I can work toward getting the right info on device tree, but in
> > > > > reality
> > > > > that is going to take time and for now the device tree doesn't have
> > > > > the
> > > > > FPGA aperture in it. So if we accept this series as is, it is going to
> > > > > stop features like [1] from working. >
> > > > > If we cannot come up with any better plans, I think it would be better
> > > > > to drop find_memory_holes, only rely on find_unallocated_memory even
> > > > > when the IOMMU is on. One idea is that we could add on top of the
> > > > > regions found by find_unallocated_memory any MMIO regions marked as
> > > > > xen,passthrough: they are safe because they are not going to dom0
> > > > > anyway.
> > > > (Oleksandr, it looks like some rationale about the different approach is
> > > > missing in the commit message. Can you add it?)
> > > Yes sure, but let me please clarify what is different approach in this
> > > context. Is it to *also* take into the account MMIO regions of the devices
> > > for
> > > passthrough for case when IOMMU is off (in addition to unallocated
> > > memory)? If
> > > yes, I wonder whether we will gain much with that according to that
> > > device's
> > > MMIO regions are usually not big enough and we stick to allocate extended
> > > regions with bigger size (> 64MB).
> > That's fair enough. There are a couple of counter examples where the
> > MMIO regions for the device to assign are quite large, for instance a
> > GPU, Xilinx AIEngine, or the PCIe Root Complex with the entire aperture,
> > but maybe they are not that common. I am not sure if it is worth
> > scanning the tree for xen,passthrough regions every time at boot for
> > this.
> 
> ok, I will add a few sentences to commit message about this different approach
> for now. At least this could be implemented later on if there is a need.

One thing that worries me about this is that if we take an old Xen with
this series and run it on a new board, it might cause problems. At the
very least [1] wouldn't work.

Can we have a Xen command line argument to disable extended regions as
an emergecy toggle?


[1] https://marc.info/?l=xen-devel&m=163056546214978

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-21 19:43         ` Oleksandr
@ 2021-09-22 18:18           ` Oleksandr
  2021-09-22 21:05             ` Stefano Stabellini
  0 siblings, 1 reply; 43+ messages in thread
From: Oleksandr @ 2021-09-22 18:18 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Julien Grall, xen-devel, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Henry Wang, Bertrand Marquis, Wei Chen



Hi Stefano


[snip]


>>>
>>>
>>>>
>>>> You will also need to cover "ranges" that will describe the BARs 
>>>> for the PCI
>>>> devices.
>>> Good point.
>> Yes, very good point!
>>
>>
>>> Could you please clarify how to recognize whether it is a PCI
>>> device as long as PCI support is not merged? Or just to find any 
>>> device nodes
>>> with non-empty "ranges" property
>>> and retrieve addresses?
>> Normally any bus can have a ranges property with the aperture and
>> possible address translations, including /amba (compatible =
>> "simple-bus"). However, in these cases dt_device_get_address already
>> takes care of it, see xen/common/device_tree.c:dt_device_get_address.
>>
>> The PCI bus is special for 2 reasons:
>> - the ranges property has a different format
>> - the bus is hot-pluggable
>>
>> So I think the only one that we need to treat specially is PCI.
>>
>> As far as I am aware PCI is the only bus (or maybe just the only bus
>> that we support?) where ranges means the aperture.
> Thank you for the clarification. I need to find device node with 
> non-empty ranges property
> (and make sure that device_type property is "pci"), after that I need 
> to read the context of ranges property and translate it.
>
>

OK, I experimented with that and managed to parse ranges property for 
PCI host bridge node.

I tested on my setup where the host device tree contains two PCI host 
bridge nodes with the following:

pcie@fe000000 {
...
             ranges = <0x1000000 0x0 0x0 0x0 0xfe100000 0x0 0x100000 
0x2000000 0x0 0xfe200000 0x0 0xfe200000 0x0 0x200000 0x2000000 0x0 
0x30000000 0x0 0x30000000 0x0 0x8000000 0x42000000 0x0 0x38000000 0x0 
0x38000000 0x0 0x8000000>;
...
};

pcie@ee800000 {
...
             ranges = <0x1000000 0x0 0x0 0x0 0xee900000 0x0 0x100000 
0x2000000 0x0 0xeea00000 0x0 0xeea00000 0x0 0x200000 0x2000000 0x0 
0xc0000000 0x0 0xc0000000 0x0 0x8000000 0x42000000 0x0 0xc8000000 0x0 
0xc8000000 0x0 0x8000000>;
...
};

So Xen retrieves the *CPU addresses* from the ranges:

(XEN) dev /soc/pcie@fe000000 range_size 7 nr_ranges 4
(XEN) 0: addr=fe100000, size=100000
(XEN) 1: addr=fe200000, size=200000
(XEN) 2: addr=30000000, size=8000000
(XEN) 3: addr=38000000, size=8000000
(XEN) dev /soc/pcie@ee800000 range_size 7 nr_ranges 4
(XEN) 0: addr=ee900000, size=100000
(XEN) 1: addr=eea00000, size=200000
(XEN) 2: addr=c0000000, size=8000000
(XEN) 3: addr=c8000000, size=8000000

The code below covers ranges property in the context of finding memory 
holes (to be squashed with current patch):

diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
index d37156a..7d20c10 100644
--- a/xen/arch/arm/domain_build.c
+++ b/xen/arch/arm/domain_build.c
@@ -834,6 +834,8 @@ static int __init find_memory_holes(struct meminfo 
*ext_regions)
      {
          unsigned int naddr;
          u64 addr, size;
+        const __be32 *ranges;
+        u32 len;

          naddr = dt_number_of_address(np);

@@ -857,6 +859,41 @@ static int __init find_memory_holes(struct meminfo 
*ext_regions)
                  goto out;
              }
          }
+
+        /*
+         * Also looking for non-empty ranges property which would 
likely mean
+         * that we deal with PCI host bridge device and the property here
+         * describes the BARs for the PCI devices.
+         */
+        ranges = dt_get_property(np, "ranges", &len);
+        if ( ranges && len )
+        {
+            unsigned int range_size, nr_ranges;
+            int na, ns, pna;
+
+            pna = dt_n_addr_cells(np);
+            na = dt_child_n_addr_cells(np);
+            ns = dt_child_n_size_cells(np);
+            range_size = pna + na + ns;
+            nr_ranges = len / sizeof(__be32) / range_size;
+
+            for ( i = 0; i < nr_ranges; i++, ranges += range_size )
+            {
+                /* Skip the child address and get the parent (CPU) 
address */
+                addr = dt_read_number(ranges + na, pna);
+                size = dt_read_number(ranges + na + pna, ns);
+
+                start = addr & PAGE_MASK;
+                end = PAGE_ALIGN(addr + size);
+                res = rangeset_remove_range(mem_holes, start, end - 1);
+                if ( res )
+                {
+                    printk(XENLOG_ERR "Failed to remove: 
%#"PRIx64"->%#"PRIx64"\n",
+                           start, end);
+                    goto out;
+                }
+            }
+        }
      }

      start = 0;



-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-21 22:00                   ` Stefano Stabellini
@ 2021-09-22 18:25                     ` Oleksandr
  2021-09-22 20:50                       ` Stefano Stabellini
  0 siblings, 1 reply; 43+ messages in thread
From: Oleksandr @ 2021-09-22 18:25 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Julien Grall, xen-devel, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Henry Wang, Bertrand Marquis, Wei Chen, fnuv


On 22.09.21 01:00, Stefano Stabellini wrote:

Hi Stefano

> On Tue, 21 Sep 2021, Oleksandr wrote:
>> On 21.09.21 02:21, Stefano Stabellini wrote:
>>> On Sun, 19 Sep 2021, Oleksandr wrote:
>>>>> On 18/09/2021 03:37, Stefano Stabellini wrote:
>>>>>> On Fri, 17 Sep 2021, Stefano Stabellini wrote:
>>>>>>> On Fri, 17 Sep 2021, Oleksandr wrote:
>>>>>>>>>> +
>>>>>>>>>> +    dt_dprintk("Find unallocated memory for extended
>>>>>>>>>> regions\n");
>>>>>>>>>> +
>>>>>>>>>> +    unalloc_mem = rangeset_new(NULL, NULL, 0);
>>>>>>>>>> +    if ( !unalloc_mem )
>>>>>>>>>> +        return -ENOMEM;
>>>>>>>>>> +
>>>>>>>>>> +    /* Start with all available RAM */
>>>>>>>>>> +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
>>>>>>>>>> +    {
>>>>>>>>>> +        start = bootinfo.mem.bank[i].start;
>>>>>>>>>> +        end = bootinfo.mem.bank[i].start +
>>>>>>>>>> bootinfo.mem.bank[i].size - 1;
>>>>>>>>>> +        res = rangeset_add_range(unalloc_mem, start, end);
>>>>>>>>>> +        if ( res )
>>>>>>>>>> +        {
>>>>>>>>>> +            printk(XENLOG_ERR "Failed to add:
>>>>>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>>>>>> +                   start, end);
>>>>>>>>>> +            goto out;
>>>>>>>>>> +        }
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    /* Remove RAM assigned to Dom0 */
>>>>>>>>>> +    for ( i = 0; i < assign_mem->nr_banks; i++ )
>>>>>>>>>> +    {
>>>>>>>>>> +        start = assign_mem->bank[i].start;
>>>>>>>>>> +        end = assign_mem->bank[i].start +
>>>>>>>>>> assign_mem->bank[i].size - 1;
>>>>>>>>>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>>>>>>>>>> +        if ( res )
>>>>>>>>>> +        {
>>>>>>>>>> +            printk(XENLOG_ERR "Failed to remove:
>>>>>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>>>>>> +                   start, end);
>>>>>>>>>> +            goto out;
>>>>>>>>>> +        }
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    /* Remove reserved-memory regions */
>>>>>>>>>> +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
>>>>>>>>>> +    {
>>>>>>>>>> +        start = bootinfo.reserved_mem.bank[i].start;
>>>>>>>>>> +        end = bootinfo.reserved_mem.bank[i].start +
>>>>>>>>>> +            bootinfo.reserved_mem.bank[i].size - 1;
>>>>>>>>>> +        res = rangeset_remove_range(unalloc_mem, start, end);
>>>>>>>>>> +        if ( res )
>>>>>>>>>> +        {
>>>>>>>>>> +            printk(XENLOG_ERR "Failed to remove:
>>>>>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>>>>>> +                   start, end);
>>>>>>>>>> +            goto out;
>>>>>>>>>> +        }
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    /* Remove grant table region */
>>>>>>>>>> +    start = kinfo->gnttab_start;
>>>>>>>>>> +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
>>>>>>>>>> +    res = rangeset_remove_range(unalloc_mem, start, end);
>>>>>>>>>> +    if ( res )
>>>>>>>>>> +    {
>>>>>>>>>> +        printk(XENLOG_ERR "Failed to remove:
>>>>>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>>>>>> +               start, end);
>>>>>>>>>> +        goto out;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    start = EXT_REGION_START;
>>>>>>>>>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>>>>>>>>> +    res = rangeset_report_ranges(unalloc_mem, start, end,
>>>>>>>>>> +                                 add_ext_regions,
>>>>>>>>>> ext_regions);
>>>>>>>>>> +    if ( res )
>>>>>>>>>> +        ext_regions->nr_banks = 0;
>>>>>>>>>> +    else if ( !ext_regions->nr_banks )
>>>>>>>>>> +        res = -ENOENT;
>>>>>>>>>> +
>>>>>>>>>> +out:
>>>>>>>>>> +    rangeset_destroy(unalloc_mem);
>>>>>>>>>> +
>>>>>>>>>> +    return res;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static int __init find_memory_holes(const struct kernel_info
>>>>>>>>>> *kinfo,
>>>>>>>>>> +                                    struct meminfo
>>>>>>>>>> *ext_regions)
>>>>>>>>>> +{
>>>>>>>>>> +    struct dt_device_node *np;
>>>>>>>>>> +    struct rangeset *mem_holes;
>>>>>>>>>> +    paddr_t start, end;
>>>>>>>>>> +    unsigned int i;
>>>>>>>>>> +    int res;
>>>>>>>>>> +
>>>>>>>>>> +    dt_dprintk("Find memory holes for extended regions\n");
>>>>>>>>>> +
>>>>>>>>>> +    mem_holes = rangeset_new(NULL, NULL, 0);
>>>>>>>>>> +    if ( !mem_holes )
>>>>>>>>>> +        return -ENOMEM;
>>>>>>>>>> +
>>>>>>>>>> +    /* Start with maximum possible addressable physical
>>>>>>>>>> memory
>>>>>>>>>> range */
>>>>>>>>>> +    start = EXT_REGION_START;
>>>>>>>>>> +    end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>>>>>>>>> +    res = rangeset_add_range(mem_holes, start, end);
>>>>>>>>>> +    if ( res )
>>>>>>>>>> +    {
>>>>>>>>>> +        printk(XENLOG_ERR "Failed to add:
>>>>>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>>>>>> +               start, end);
>>>>>>>>>> +        goto out;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    /* Remove all regions described by "reg" property (MMIO,
>>>>>>>>>> RAM,
>>>>>>>>>> etc) */
>>>>>>>>> Well... The loop below is not going to handle all the regions
>>>>>>>>> described in
>>>>>>>>> the property "reg". Instead, it will cover a subset of "reg"
>>>>>>>>> where
>>>>>>>>> the
>>>>>>>>> memory is addressable.
>>>>>>>> As I understand, we are only interested in subset of "reg" where
>>>>>>>> the
>>>>>>>> memory is
>>>>>>>> addressable.
>>>>>>>>
>>>>>>>>
>>>>>>>>> You will also need to cover "ranges" that will describe the BARs
>>>>>>>>> for
>>>>>>>>> the PCI
>>>>>>>>> devices.
>>>>>>>> Good point.
>>>>>>> Yes, very good point!
>>>>>>>
>>>>>>>
>>>>>>>> Could you please clarify how to recognize whether it is a PCI
>>>>>>>> device as long as PCI support is not merged? Or just to find any
>>>>>>>> device nodes
>>>>>>>> with non-empty "ranges" property
>>>>>>>> and retrieve addresses?
>>>>>>> Normally any bus can have a ranges property with the aperture and
>>>>>>> possible address translations, including /amba (compatible =
>>>>>>> "simple-bus"). However, in these cases dt_device_get_address already
>>>>>>> takes care of it, see
>>>>>>> xen/common/device_tree.c:dt_device_get_address.
>>>>>>>
>>>>>>> The PCI bus is special for 2 reasons:
>>>>>>> - the ranges property has a different format
>>>>>>> - the bus is hot-pluggable
>>>>>>>
>>>>>>> So I think the only one that we need to treat specially is PCI.
>>>>>>>
>>>>>>> As far as I am aware PCI is the only bus (or maybe just the only bus
>>>>>>> that we support?) where ranges means the aperture.
>>>>>> Now that I think about this, there is another "hotpluggable" scenario
>>>>>> we
>>>>>> need to think about:
>>>>>>
>>>>>> [1] https://marc.info/?l=xen-devel&m=163056546214978
>>>>>>
>>>>>> Xilinx devices have FPGA regions with apertures currently not
>>>>>> described
>>>>>> in device tree, where things can programmed in PL at runtime making
>>>>>> new
>>>>>> devices appear with new MMIO regions out of thin air.
>>>>>>
>>>>>> Now let me start by saying that yes, the entire programmable region
>>>>>> aperture could probably be described in device tree, however, in
>>>>>> reality it is not currently done in any of the device trees we use
>>>>>> (including the upstream device trees in linux.git).
>>>>> This is rather annoying, but not unheard. There are a couple of
>>>>> platforms
>>>>> where the MMIOs are not fully described in the DT.
>>>>>
>>>>> In fact, we have a callback 'specific_mappings' which create additional
>>>>> mappings (e.g. on the omap5) for dom0.
>>>>>
>>>>>> So, we have a problem :-(
>>>>>>
>>>>>>
>>>>>> I can work toward getting the right info on device tree, but in
>>>>>> reality
>>>>>> that is going to take time and for now the device tree doesn't have
>>>>>> the
>>>>>> FPGA aperture in it. So if we accept this series as is, it is going to
>>>>>> stop features like [1] from working. >
>>>>>> If we cannot come up with any better plans, I think it would be better
>>>>>> to drop find_memory_holes, only rely on find_unallocated_memory even
>>>>>> when the IOMMU is on. One idea is that we could add on top of the
>>>>>> regions found by find_unallocated_memory any MMIO regions marked as
>>>>>> xen,passthrough: they are safe because they are not going to dom0
>>>>>> anyway.
>>>>> (Oleksandr, it looks like some rationale about the different approach is
>>>>> missing in the commit message. Can you add it?)
>>>> Yes sure, but let me please clarify what is different approach in this
>>>> context. Is it to *also* take into the account MMIO regions of the devices
>>>> for
>>>> passthrough for case when IOMMU is off (in addition to unallocated
>>>> memory)? If
>>>> yes, I wonder whether we will gain much with that according to that
>>>> device's
>>>> MMIO regions are usually not big enough and we stick to allocate extended
>>>> regions with bigger size (> 64MB).
>>> That's fair enough. There are a couple of counter examples where the
>>> MMIO regions for the device to assign are quite large, for instance a
>>> GPU, Xilinx AIEngine, or the PCIe Root Complex with the entire aperture,
>>> but maybe they are not that common. I am not sure if it is worth
>>> scanning the tree for xen,passthrough regions every time at boot for
>>> this.
>> ok, I will add a few sentences to commit message about this different approach
>> for now. At least this could be implemented later on if there is a need.
> One thing that worries me about this is that if we take an old Xen with
> this series and run it on a new board, it might cause problems. At the
> very least [1] wouldn't work.

I got it.


>
> Can we have a Xen command line argument to disable extended regions as
> an emergecy toggle?

I think, yes. If no preference for the argument name I will name it 
"no-ext-region".

>
>
> [1] https://marc.info/?l=xen-devel&m=163056546214978

-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-22 18:25                     ` Oleksandr
@ 2021-09-22 20:50                       ` Stefano Stabellini
  2021-09-23 10:10                         ` Oleksandr
  0 siblings, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2021-09-22 20:50 UTC (permalink / raw)
  To: Oleksandr
  Cc: Stefano Stabellini, Julien Grall, xen-devel,
	Oleksandr Tyshchenko, Volodymyr Babchuk, Henry Wang,
	Bertrand Marquis, Wei Chen, fnuv

[-- Attachment #1: Type: text/plain, Size: 12828 bytes --]

On Wed, 22 Sep 2021, Oleksandr wrote:
> On 22.09.21 01:00, Stefano Stabellini wrote:
> > On Tue, 21 Sep 2021, Oleksandr wrote:
> > > On 21.09.21 02:21, Stefano Stabellini wrote:
> > > > On Sun, 19 Sep 2021, Oleksandr wrote:
> > > > > > On 18/09/2021 03:37, Stefano Stabellini wrote:
> > > > > > > On Fri, 17 Sep 2021, Stefano Stabellini wrote:
> > > > > > > > On Fri, 17 Sep 2021, Oleksandr wrote:
> > > > > > > > > > > +
> > > > > > > > > > > +    dt_dprintk("Find unallocated memory for extended
> > > > > > > > > > > regions\n");
> > > > > > > > > > > +
> > > > > > > > > > > +    unalloc_mem = rangeset_new(NULL, NULL, 0);
> > > > > > > > > > > +    if ( !unalloc_mem )
> > > > > > > > > > > +        return -ENOMEM;
> > > > > > > > > > > +
> > > > > > > > > > > +    /* Start with all available RAM */
> > > > > > > > > > > +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
> > > > > > > > > > > +    {
> > > > > > > > > > > +        start = bootinfo.mem.bank[i].start;
> > > > > > > > > > > +        end = bootinfo.mem.bank[i].start +
> > > > > > > > > > > bootinfo.mem.bank[i].size - 1;
> > > > > > > > > > > +        res = rangeset_add_range(unalloc_mem, start,
> > > > > > > > > > > end);
> > > > > > > > > > > +        if ( res )
> > > > > > > > > > > +        {
> > > > > > > > > > > +            printk(XENLOG_ERR "Failed to add:
> > > > > > > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > > > > > > +                   start, end);
> > > > > > > > > > > +            goto out;
> > > > > > > > > > > +        }
> > > > > > > > > > > +    }
> > > > > > > > > > > +
> > > > > > > > > > > +    /* Remove RAM assigned to Dom0 */
> > > > > > > > > > > +    for ( i = 0; i < assign_mem->nr_banks; i++ )
> > > > > > > > > > > +    {
> > > > > > > > > > > +        start = assign_mem->bank[i].start;
> > > > > > > > > > > +        end = assign_mem->bank[i].start +
> > > > > > > > > > > assign_mem->bank[i].size - 1;
> > > > > > > > > > > +        res = rangeset_remove_range(unalloc_mem, start,
> > > > > > > > > > > end);
> > > > > > > > > > > +        if ( res )
> > > > > > > > > > > +        {
> > > > > > > > > > > +            printk(XENLOG_ERR "Failed to remove:
> > > > > > > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > > > > > > +                   start, end);
> > > > > > > > > > > +            goto out;
> > > > > > > > > > > +        }
> > > > > > > > > > > +    }
> > > > > > > > > > > +
> > > > > > > > > > > +    /* Remove reserved-memory regions */
> > > > > > > > > > > +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++
> > > > > > > > > > > )
> > > > > > > > > > > +    {
> > > > > > > > > > > +        start = bootinfo.reserved_mem.bank[i].start;
> > > > > > > > > > > +        end = bootinfo.reserved_mem.bank[i].start +
> > > > > > > > > > > +            bootinfo.reserved_mem.bank[i].size - 1;
> > > > > > > > > > > +        res = rangeset_remove_range(unalloc_mem, start,
> > > > > > > > > > > end);
> > > > > > > > > > > +        if ( res )
> > > > > > > > > > > +        {
> > > > > > > > > > > +            printk(XENLOG_ERR "Failed to remove:
> > > > > > > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > > > > > > +                   start, end);
> > > > > > > > > > > +            goto out;
> > > > > > > > > > > +        }
> > > > > > > > > > > +    }
> > > > > > > > > > > +
> > > > > > > > > > > +    /* Remove grant table region */
> > > > > > > > > > > +    start = kinfo->gnttab_start;
> > > > > > > > > > > +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
> > > > > > > > > > > +    res = rangeset_remove_range(unalloc_mem, start, end);
> > > > > > > > > > > +    if ( res )
> > > > > > > > > > > +    {
> > > > > > > > > > > +        printk(XENLOG_ERR "Failed to remove:
> > > > > > > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > > > > > > +               start, end);
> > > > > > > > > > > +        goto out;
> > > > > > > > > > > +    }
> > > > > > > > > > > +
> > > > > > > > > > > +    start = EXT_REGION_START;
> > > > > > > > > > > +    end = min((1ULL << p2m_ipa_bits) - 1,
> > > > > > > > > > > EXT_REGION_END);
> > > > > > > > > > > +    res = rangeset_report_ranges(unalloc_mem, start, end,
> > > > > > > > > > > +                                 add_ext_regions,
> > > > > > > > > > > ext_regions);
> > > > > > > > > > > +    if ( res )
> > > > > > > > > > > +        ext_regions->nr_banks = 0;
> > > > > > > > > > > +    else if ( !ext_regions->nr_banks )
> > > > > > > > > > > +        res = -ENOENT;
> > > > > > > > > > > +
> > > > > > > > > > > +out:
> > > > > > > > > > > +    rangeset_destroy(unalloc_mem);
> > > > > > > > > > > +
> > > > > > > > > > > +    return res;
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +static int __init find_memory_holes(const struct
> > > > > > > > > > > kernel_info
> > > > > > > > > > > *kinfo,
> > > > > > > > > > > +                                    struct meminfo
> > > > > > > > > > > *ext_regions)
> > > > > > > > > > > +{
> > > > > > > > > > > +    struct dt_device_node *np;
> > > > > > > > > > > +    struct rangeset *mem_holes;
> > > > > > > > > > > +    paddr_t start, end;
> > > > > > > > > > > +    unsigned int i;
> > > > > > > > > > > +    int res;
> > > > > > > > > > > +
> > > > > > > > > > > +    dt_dprintk("Find memory holes for extended
> > > > > > > > > > > regions\n");
> > > > > > > > > > > +
> > > > > > > > > > > +    mem_holes = rangeset_new(NULL, NULL, 0);
> > > > > > > > > > > +    if ( !mem_holes )
> > > > > > > > > > > +        return -ENOMEM;
> > > > > > > > > > > +
> > > > > > > > > > > +    /* Start with maximum possible addressable physical
> > > > > > > > > > > memory
> > > > > > > > > > > range */
> > > > > > > > > > > +    start = EXT_REGION_START;
> > > > > > > > > > > +    end = min((1ULL << p2m_ipa_bits) - 1,
> > > > > > > > > > > EXT_REGION_END);
> > > > > > > > > > > +    res = rangeset_add_range(mem_holes, start, end);
> > > > > > > > > > > +    if ( res )
> > > > > > > > > > > +    {
> > > > > > > > > > > +        printk(XENLOG_ERR "Failed to add:
> > > > > > > > > > > %#"PRIx64"->%#"PRIx64"\n",
> > > > > > > > > > > +               start, end);
> > > > > > > > > > > +        goto out;
> > > > > > > > > > > +    }
> > > > > > > > > > > +
> > > > > > > > > > > +    /* Remove all regions described by "reg" property
> > > > > > > > > > > (MMIO,
> > > > > > > > > > > RAM,
> > > > > > > > > > > etc) */
> > > > > > > > > > Well... The loop below is not going to handle all the
> > > > > > > > > > regions
> > > > > > > > > > described in
> > > > > > > > > > the property "reg". Instead, it will cover a subset of "reg"
> > > > > > > > > > where
> > > > > > > > > > the
> > > > > > > > > > memory is addressable.
> > > > > > > > > As I understand, we are only interested in subset of "reg"
> > > > > > > > > where
> > > > > > > > > the
> > > > > > > > > memory is
> > > > > > > > > addressable.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > You will also need to cover "ranges" that will describe the
> > > > > > > > > > BARs
> > > > > > > > > > for
> > > > > > > > > > the PCI
> > > > > > > > > > devices.
> > > > > > > > > Good point.
> > > > > > > > Yes, very good point!
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > Could you please clarify how to recognize whether it is a PCI
> > > > > > > > > device as long as PCI support is not merged? Or just to find
> > > > > > > > > any
> > > > > > > > > device nodes
> > > > > > > > > with non-empty "ranges" property
> > > > > > > > > and retrieve addresses?
> > > > > > > > Normally any bus can have a ranges property with the aperture
> > > > > > > > and
> > > > > > > > possible address translations, including /amba (compatible =
> > > > > > > > "simple-bus"). However, in these cases dt_device_get_address
> > > > > > > > already
> > > > > > > > takes care of it, see
> > > > > > > > xen/common/device_tree.c:dt_device_get_address.
> > > > > > > > 
> > > > > > > > The PCI bus is special for 2 reasons:
> > > > > > > > - the ranges property has a different format
> > > > > > > > - the bus is hot-pluggable
> > > > > > > > 
> > > > > > > > So I think the only one that we need to treat specially is PCI.
> > > > > > > > 
> > > > > > > > As far as I am aware PCI is the only bus (or maybe just the only
> > > > > > > > bus
> > > > > > > > that we support?) where ranges means the aperture.
> > > > > > > Now that I think about this, there is another "hotpluggable"
> > > > > > > scenario
> > > > > > > we
> > > > > > > need to think about:
> > > > > > > 
> > > > > > > [1] https://marc.info/?l=xen-devel&m=163056546214978
> > > > > > > 
> > > > > > > Xilinx devices have FPGA regions with apertures currently not
> > > > > > > described
> > > > > > > in device tree, where things can programmed in PL at runtime
> > > > > > > making
> > > > > > > new
> > > > > > > devices appear with new MMIO regions out of thin air.
> > > > > > > 
> > > > > > > Now let me start by saying that yes, the entire programmable
> > > > > > > region
> > > > > > > aperture could probably be described in device tree, however, in
> > > > > > > reality it is not currently done in any of the device trees we use
> > > > > > > (including the upstream device trees in linux.git).
> > > > > > This is rather annoying, but not unheard. There are a couple of
> > > > > > platforms
> > > > > > where the MMIOs are not fully described in the DT.
> > > > > > 
> > > > > > In fact, we have a callback 'specific_mappings' which create
> > > > > > additional
> > > > > > mappings (e.g. on the omap5) for dom0.
> > > > > > 
> > > > > > > So, we have a problem :-(
> > > > > > > 
> > > > > > > 
> > > > > > > I can work toward getting the right info on device tree, but in
> > > > > > > reality
> > > > > > > that is going to take time and for now the device tree doesn't
> > > > > > > have
> > > > > > > the
> > > > > > > FPGA aperture in it. So if we accept this series as is, it is
> > > > > > > going to
> > > > > > > stop features like [1] from working. >
> > > > > > > If we cannot come up with any better plans, I think it would be
> > > > > > > better
> > > > > > > to drop find_memory_holes, only rely on find_unallocated_memory
> > > > > > > even
> > > > > > > when the IOMMU is on. One idea is that we could add on top of the
> > > > > > > regions found by find_unallocated_memory any MMIO regions marked
> > > > > > > as
> > > > > > > xen,passthrough: they are safe because they are not going to dom0
> > > > > > > anyway.
> > > > > > (Oleksandr, it looks like some rationale about the different
> > > > > > approach is
> > > > > > missing in the commit message. Can you add it?)
> > > > > Yes sure, but let me please clarify what is different approach in this
> > > > > context. Is it to *also* take into the account MMIO regions of the
> > > > > devices
> > > > > for
> > > > > passthrough for case when IOMMU is off (in addition to unallocated
> > > > > memory)? If
> > > > > yes, I wonder whether we will gain much with that according to that
> > > > > device's
> > > > > MMIO regions are usually not big enough and we stick to allocate
> > > > > extended
> > > > > regions with bigger size (> 64MB).
> > > > That's fair enough. There are a couple of counter examples where the
> > > > MMIO regions for the device to assign are quite large, for instance a
> > > > GPU, Xilinx AIEngine, or the PCIe Root Complex with the entire aperture,
> > > > but maybe they are not that common. I am not sure if it is worth
> > > > scanning the tree for xen,passthrough regions every time at boot for
> > > > this.
> > > ok, I will add a few sentences to commit message about this different
> > > approach
> > > for now. At least this could be implemented later on if there is a need.
> > One thing that worries me about this is that if we take an old Xen with
> > this series and run it on a new board, it might cause problems. At the
> > very least [1] wouldn't work.
> 
> I got it.
> 
> 
> > 
> > Can we have a Xen command line argument to disable extended regions as
> > an emergecy toggle?
> 
> I think, yes. If no preference for the argument name I will name it
> "no-ext-region".

It is better to introduce it as ext-regions=yes/no with yes as default.
So that in the future we could extending it to ext-regions=start,size if
we wanted to.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-22 18:18           ` Oleksandr
@ 2021-09-22 21:05             ` Stefano Stabellini
  2021-09-23 10:11               ` Oleksandr
  0 siblings, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2021-09-22 21:05 UTC (permalink / raw)
  To: Oleksandr
  Cc: Stefano Stabellini, Julien Grall, xen-devel,
	Oleksandr Tyshchenko, Volodymyr Babchuk, Henry Wang,
	Bertrand Marquis, Wei Chen

[-- Attachment #1: Type: text/plain, Size: 5753 bytes --]

On Wed, 22 Sep 2021, Oleksandr wrote:
> > > > > You will also need to cover "ranges" that will describe the BARs for
> > > > > the PCI
> > > > > devices.
> > > > Good point.
> > > Yes, very good point!
> > > 
> > > 
> > > > Could you please clarify how to recognize whether it is a PCI
> > > > device as long as PCI support is not merged? Or just to find any device
> > > > nodes
> > > > with non-empty "ranges" property
> > > > and retrieve addresses?
> > > Normally any bus can have a ranges property with the aperture and
> > > possible address translations, including /amba (compatible =
> > > "simple-bus"). However, in these cases dt_device_get_address already
> > > takes care of it, see xen/common/device_tree.c:dt_device_get_address.
> > > 
> > > The PCI bus is special for 2 reasons:
> > > - the ranges property has a different format
> > > - the bus is hot-pluggable
> > > 
> > > So I think the only one that we need to treat specially is PCI.
> > > 
> > > As far as I am aware PCI is the only bus (or maybe just the only bus
> > > that we support?) where ranges means the aperture.
> > Thank you for the clarification. I need to find device node with non-empty
> > ranges property
> > (and make sure that device_type property is "pci"), after that I need to
> > read the context of ranges property and translate it.
> > 
> > 
> 
> OK, I experimented with that and managed to parse ranges property for PCI host
> bridge node.
> 
> I tested on my setup where the host device tree contains two PCI host bridge
> nodes with the following:
> 
> pcie@fe000000 {
> ...
>             ranges = <0x1000000 0x0 0x0 0x0 0xfe100000 0x0 0x100000 0x2000000
> 0x0 0xfe200000 0x0 0xfe200000 0x0 0x200000 0x2000000 0x0 0x30000000 0x0
> 0x30000000 0x0 0x8000000 0x42000000 0x0 0x38000000 0x0 0x38000000 0x0
> 0x8000000>;
> ...
> };
> 
> pcie@ee800000 {
> ...
>             ranges = <0x1000000 0x0 0x0 0x0 0xee900000 0x0 0x100000 0x2000000
> 0x0 0xeea00000 0x0 0xeea00000 0x0 0x200000 0x2000000 0x0 0xc0000000 0x0
> 0xc0000000 0x0 0x8000000 0x42000000 0x0 0xc8000000 0x0 0xc8000000 0x0
> 0x8000000>;
> ...
> };
> 
> So Xen retrieves the *CPU addresses* from the ranges:
> 
> (XEN) dev /soc/pcie@fe000000 range_size 7 nr_ranges 4
> (XEN) 0: addr=fe100000, size=100000
> (XEN) 1: addr=fe200000, size=200000
> (XEN) 2: addr=30000000, size=8000000
> (XEN) 3: addr=38000000, size=8000000
> (XEN) dev /soc/pcie@ee800000 range_size 7 nr_ranges 4
> (XEN) 0: addr=ee900000, size=100000
> (XEN) 1: addr=eea00000, size=200000
> (XEN) 2: addr=c0000000, size=8000000
> (XEN) 3: addr=c8000000, size=8000000
> 
> The code below covers ranges property in the context of finding memory holes
> (to be squashed with current patch):
> 
> diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
> index d37156a..7d20c10 100644
> --- a/xen/arch/arm/domain_build.c
> +++ b/xen/arch/arm/domain_build.c
> @@ -834,6 +834,8 @@ static int __init find_memory_holes(struct meminfo
> *ext_regions)
>      {
>          unsigned int naddr;
>          u64 addr, size;
> +        const __be32 *ranges;
> +        u32 len;
> 
>          naddr = dt_number_of_address(np);
> 
> @@ -857,6 +859,41 @@ static int __init find_memory_holes(struct meminfo
> *ext_regions)
>                  goto out;
>              }
>          }
> +
> +        /*
> +         * Also looking for non-empty ranges property which would likely mean
> +         * that we deal with PCI host bridge device and the property here
> +         * describes the BARs for the PCI devices.
> +         */

One thing to be careful is that ranges with a valid parameter is not
only present in PCI busses. It can be present in amba and other
simple-busses too. In that case the format for ranges in simpler as it
doesn't have a "memory type" like PCI.

When you get addresses from reg, bus ranges properties are automatically
handled for you.

All of this to say that a check on "ranges" is not enough because it
might capture other non-PCI busses that have a different, simpler,
ranges format. You want to check for "ranges" under a device_type =
"pci"; node.


> +        ranges = dt_get_property(np, "ranges", &len);
> +        if ( ranges && len )
> +        {
> +            unsigned int range_size, nr_ranges;
> +            int na, ns, pna;
> +
> +            pna = dt_n_addr_cells(np);
> +            na = dt_child_n_addr_cells(np);
> +            ns = dt_child_n_size_cells(np);
> +            range_size = pna + na + ns;
> +            nr_ranges = len / sizeof(__be32) / range_size;
> +
> +            for ( i = 0; i < nr_ranges; i++, ranges += range_size )
> +            {
> +                /* Skip the child address and get the parent (CPU) address */
> +                addr = dt_read_number(ranges + na, pna);
> +                size = dt_read_number(ranges + na + pna, ns);
> +
> +                start = addr & PAGE_MASK;
> +                end = PAGE_ALIGN(addr + size);
> +                res = rangeset_remove_range(mem_holes, start, end - 1);
> +                if ( res )
> +                {
> +                    printk(XENLOG_ERR "Failed to remove:
> %#"PRIx64"->%#"PRIx64"\n",
> +                           start, end);
> +                    goto out;
> +                }
> +            }
> +        }
>      }

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-22 20:50                       ` Stefano Stabellini
@ 2021-09-23 10:10                         ` Oleksandr
  0 siblings, 0 replies; 43+ messages in thread
From: Oleksandr @ 2021-09-23 10:10 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Julien Grall, xen-devel, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Henry Wang, Bertrand Marquis, Wei Chen, fnuv


On 22.09.21 23:50, Stefano Stabellini wrote:

Hi Stefano.

> On Wed, 22 Sep 2021, Oleksandr wrote:
>> On 22.09.21 01:00, Stefano Stabellini wrote:
>>> On Tue, 21 Sep 2021, Oleksandr wrote:
>>>> On 21.09.21 02:21, Stefano Stabellini wrote:
>>>>> On Sun, 19 Sep 2021, Oleksandr wrote:
>>>>>>> On 18/09/2021 03:37, Stefano Stabellini wrote:
>>>>>>>> On Fri, 17 Sep 2021, Stefano Stabellini wrote:
>>>>>>>>> On Fri, 17 Sep 2021, Oleksandr wrote:
>>>>>>>>>>>> +
>>>>>>>>>>>> +    dt_dprintk("Find unallocated memory for extended
>>>>>>>>>>>> regions\n");
>>>>>>>>>>>> +
>>>>>>>>>>>> +    unalloc_mem = rangeset_new(NULL, NULL, 0);
>>>>>>>>>>>> +    if ( !unalloc_mem )
>>>>>>>>>>>> +        return -ENOMEM;
>>>>>>>>>>>> +
>>>>>>>>>>>> +    /* Start with all available RAM */
>>>>>>>>>>>> +    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
>>>>>>>>>>>> +    {
>>>>>>>>>>>> +        start = bootinfo.mem.bank[i].start;
>>>>>>>>>>>> +        end = bootinfo.mem.bank[i].start +
>>>>>>>>>>>> bootinfo.mem.bank[i].size - 1;
>>>>>>>>>>>> +        res = rangeset_add_range(unalloc_mem, start,
>>>>>>>>>>>> end);
>>>>>>>>>>>> +        if ( res )
>>>>>>>>>>>> +        {
>>>>>>>>>>>> +            printk(XENLOG_ERR "Failed to add:
>>>>>>>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>>>>>>>> +                   start, end);
>>>>>>>>>>>> +            goto out;
>>>>>>>>>>>> +        }
>>>>>>>>>>>> +    }
>>>>>>>>>>>> +
>>>>>>>>>>>> +    /* Remove RAM assigned to Dom0 */
>>>>>>>>>>>> +    for ( i = 0; i < assign_mem->nr_banks; i++ )
>>>>>>>>>>>> +    {
>>>>>>>>>>>> +        start = assign_mem->bank[i].start;
>>>>>>>>>>>> +        end = assign_mem->bank[i].start +
>>>>>>>>>>>> assign_mem->bank[i].size - 1;
>>>>>>>>>>>> +        res = rangeset_remove_range(unalloc_mem, start,
>>>>>>>>>>>> end);
>>>>>>>>>>>> +        if ( res )
>>>>>>>>>>>> +        {
>>>>>>>>>>>> +            printk(XENLOG_ERR "Failed to remove:
>>>>>>>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>>>>>>>> +                   start, end);
>>>>>>>>>>>> +            goto out;
>>>>>>>>>>>> +        }
>>>>>>>>>>>> +    }
>>>>>>>>>>>> +
>>>>>>>>>>>> +    /* Remove reserved-memory regions */
>>>>>>>>>>>> +    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++
>>>>>>>>>>>> )
>>>>>>>>>>>> +    {
>>>>>>>>>>>> +        start = bootinfo.reserved_mem.bank[i].start;
>>>>>>>>>>>> +        end = bootinfo.reserved_mem.bank[i].start +
>>>>>>>>>>>> +            bootinfo.reserved_mem.bank[i].size - 1;
>>>>>>>>>>>> +        res = rangeset_remove_range(unalloc_mem, start,
>>>>>>>>>>>> end);
>>>>>>>>>>>> +        if ( res )
>>>>>>>>>>>> +        {
>>>>>>>>>>>> +            printk(XENLOG_ERR "Failed to remove:
>>>>>>>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>>>>>>>> +                   start, end);
>>>>>>>>>>>> +            goto out;
>>>>>>>>>>>> +        }
>>>>>>>>>>>> +    }
>>>>>>>>>>>> +
>>>>>>>>>>>> +    /* Remove grant table region */
>>>>>>>>>>>> +    start = kinfo->gnttab_start;
>>>>>>>>>>>> +    end = kinfo->gnttab_start + kinfo->gnttab_size - 1;
>>>>>>>>>>>> +    res = rangeset_remove_range(unalloc_mem, start, end);
>>>>>>>>>>>> +    if ( res )
>>>>>>>>>>>> +    {
>>>>>>>>>>>> +        printk(XENLOG_ERR "Failed to remove:
>>>>>>>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>>>>>>>> +               start, end);
>>>>>>>>>>>> +        goto out;
>>>>>>>>>>>> +    }
>>>>>>>>>>>> +
>>>>>>>>>>>> +    start = EXT_REGION_START;
>>>>>>>>>>>> +    end = min((1ULL << p2m_ipa_bits) - 1,
>>>>>>>>>>>> EXT_REGION_END);
>>>>>>>>>>>> +    res = rangeset_report_ranges(unalloc_mem, start, end,
>>>>>>>>>>>> +                                 add_ext_regions,
>>>>>>>>>>>> ext_regions);
>>>>>>>>>>>> +    if ( res )
>>>>>>>>>>>> +        ext_regions->nr_banks = 0;
>>>>>>>>>>>> +    else if ( !ext_regions->nr_banks )
>>>>>>>>>>>> +        res = -ENOENT;
>>>>>>>>>>>> +
>>>>>>>>>>>> +out:
>>>>>>>>>>>> +    rangeset_destroy(unalloc_mem);
>>>>>>>>>>>> +
>>>>>>>>>>>> +    return res;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>> +static int __init find_memory_holes(const struct
>>>>>>>>>>>> kernel_info
>>>>>>>>>>>> *kinfo,
>>>>>>>>>>>> +                                    struct meminfo
>>>>>>>>>>>> *ext_regions)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    struct dt_device_node *np;
>>>>>>>>>>>> +    struct rangeset *mem_holes;
>>>>>>>>>>>> +    paddr_t start, end;
>>>>>>>>>>>> +    unsigned int i;
>>>>>>>>>>>> +    int res;
>>>>>>>>>>>> +
>>>>>>>>>>>> +    dt_dprintk("Find memory holes for extended
>>>>>>>>>>>> regions\n");
>>>>>>>>>>>> +
>>>>>>>>>>>> +    mem_holes = rangeset_new(NULL, NULL, 0);
>>>>>>>>>>>> +    if ( !mem_holes )
>>>>>>>>>>>> +        return -ENOMEM;
>>>>>>>>>>>> +
>>>>>>>>>>>> +    /* Start with maximum possible addressable physical
>>>>>>>>>>>> memory
>>>>>>>>>>>> range */
>>>>>>>>>>>> +    start = EXT_REGION_START;
>>>>>>>>>>>> +    end = min((1ULL << p2m_ipa_bits) - 1,
>>>>>>>>>>>> EXT_REGION_END);
>>>>>>>>>>>> +    res = rangeset_add_range(mem_holes, start, end);
>>>>>>>>>>>> +    if ( res )
>>>>>>>>>>>> +    {
>>>>>>>>>>>> +        printk(XENLOG_ERR "Failed to add:
>>>>>>>>>>>> %#"PRIx64"->%#"PRIx64"\n",
>>>>>>>>>>>> +               start, end);
>>>>>>>>>>>> +        goto out;
>>>>>>>>>>>> +    }
>>>>>>>>>>>> +
>>>>>>>>>>>> +    /* Remove all regions described by "reg" property
>>>>>>>>>>>> (MMIO,
>>>>>>>>>>>> RAM,
>>>>>>>>>>>> etc) */
>>>>>>>>>>> Well... The loop below is not going to handle all the
>>>>>>>>>>> regions
>>>>>>>>>>> described in
>>>>>>>>>>> the property "reg". Instead, it will cover a subset of "reg"
>>>>>>>>>>> where
>>>>>>>>>>> the
>>>>>>>>>>> memory is addressable.
>>>>>>>>>> As I understand, we are only interested in subset of "reg"
>>>>>>>>>> where
>>>>>>>>>> the
>>>>>>>>>> memory is
>>>>>>>>>> addressable.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> You will also need to cover "ranges" that will describe the
>>>>>>>>>>> BARs
>>>>>>>>>>> for
>>>>>>>>>>> the PCI
>>>>>>>>>>> devices.
>>>>>>>>>> Good point.
>>>>>>>>> Yes, very good point!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Could you please clarify how to recognize whether it is a PCI
>>>>>>>>>> device as long as PCI support is not merged? Or just to find
>>>>>>>>>> any
>>>>>>>>>> device nodes
>>>>>>>>>> with non-empty "ranges" property
>>>>>>>>>> and retrieve addresses?
>>>>>>>>> Normally any bus can have a ranges property with the aperture
>>>>>>>>> and
>>>>>>>>> possible address translations, including /amba (compatible =
>>>>>>>>> "simple-bus"). However, in these cases dt_device_get_address
>>>>>>>>> already
>>>>>>>>> takes care of it, see
>>>>>>>>> xen/common/device_tree.c:dt_device_get_address.
>>>>>>>>>
>>>>>>>>> The PCI bus is special for 2 reasons:
>>>>>>>>> - the ranges property has a different format
>>>>>>>>> - the bus is hot-pluggable
>>>>>>>>>
>>>>>>>>> So I think the only one that we need to treat specially is PCI.
>>>>>>>>>
>>>>>>>>> As far as I am aware PCI is the only bus (or maybe just the only
>>>>>>>>> bus
>>>>>>>>> that we support?) where ranges means the aperture.
>>>>>>>> Now that I think about this, there is another "hotpluggable"
>>>>>>>> scenario
>>>>>>>> we
>>>>>>>> need to think about:
>>>>>>>>
>>>>>>>> [1] https://marc.info/?l=xen-devel&m=163056546214978
>>>>>>>>
>>>>>>>> Xilinx devices have FPGA regions with apertures currently not
>>>>>>>> described
>>>>>>>> in device tree, where things can programmed in PL at runtime
>>>>>>>> making
>>>>>>>> new
>>>>>>>> devices appear with new MMIO regions out of thin air.
>>>>>>>>
>>>>>>>> Now let me start by saying that yes, the entire programmable
>>>>>>>> region
>>>>>>>> aperture could probably be described in device tree, however, in
>>>>>>>> reality it is not currently done in any of the device trees we use
>>>>>>>> (including the upstream device trees in linux.git).
>>>>>>> This is rather annoying, but not unheard. There are a couple of
>>>>>>> platforms
>>>>>>> where the MMIOs are not fully described in the DT.
>>>>>>>
>>>>>>> In fact, we have a callback 'specific_mappings' which create
>>>>>>> additional
>>>>>>> mappings (e.g. on the omap5) for dom0.
>>>>>>>
>>>>>>>> So, we have a problem :-(
>>>>>>>>
>>>>>>>>
>>>>>>>> I can work toward getting the right info on device tree, but in
>>>>>>>> reality
>>>>>>>> that is going to take time and for now the device tree doesn't
>>>>>>>> have
>>>>>>>> the
>>>>>>>> FPGA aperture in it. So if we accept this series as is, it is
>>>>>>>> going to
>>>>>>>> stop features like [1] from working. >
>>>>>>>> If we cannot come up with any better plans, I think it would be
>>>>>>>> better
>>>>>>>> to drop find_memory_holes, only rely on find_unallocated_memory
>>>>>>>> even
>>>>>>>> when the IOMMU is on. One idea is that we could add on top of the
>>>>>>>> regions found by find_unallocated_memory any MMIO regions marked
>>>>>>>> as
>>>>>>>> xen,passthrough: they are safe because they are not going to dom0
>>>>>>>> anyway.
>>>>>>> (Oleksandr, it looks like some rationale about the different
>>>>>>> approach is
>>>>>>> missing in the commit message. Can you add it?)
>>>>>> Yes sure, but let me please clarify what is different approach in this
>>>>>> context. Is it to *also* take into the account MMIO regions of the
>>>>>> devices
>>>>>> for
>>>>>> passthrough for case when IOMMU is off (in addition to unallocated
>>>>>> memory)? If
>>>>>> yes, I wonder whether we will gain much with that according to that
>>>>>> device's
>>>>>> MMIO regions are usually not big enough and we stick to allocate
>>>>>> extended
>>>>>> regions with bigger size (> 64MB).
>>>>> That's fair enough. There are a couple of counter examples where the
>>>>> MMIO regions for the device to assign are quite large, for instance a
>>>>> GPU, Xilinx AIEngine, or the PCIe Root Complex with the entire aperture,
>>>>> but maybe they are not that common. I am not sure if it is worth
>>>>> scanning the tree for xen,passthrough regions every time at boot for
>>>>> this.
>>>> ok, I will add a few sentences to commit message about this different
>>>> approach
>>>> for now. At least this could be implemented later on if there is a need.
>>> One thing that worries me about this is that if we take an old Xen with
>>> this series and run it on a new board, it might cause problems. At the
>>> very least [1] wouldn't work.
>> I got it.
>>
>>
>>> Can we have a Xen command line argument to disable extended regions as
>>> an emergecy toggle?
>> I think, yes. If no preference for the argument name I will name it
>> "no-ext-region".
> It is better to introduce it as ext-regions=yes/no with yes as default.
> So that in the future we could extending it to ext-regions=start,size if
> we wanted to.

ok, will do


-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-22 21:05             ` Stefano Stabellini
@ 2021-09-23 10:11               ` Oleksandr
  0 siblings, 0 replies; 43+ messages in thread
From: Oleksandr @ 2021-09-23 10:11 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Julien Grall, xen-devel, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Henry Wang, Bertrand Marquis, Wei Chen


On 23.09.21 00:05, Stefano Stabellini wrote:

Hi Stefano

> On Wed, 22 Sep 2021, Oleksandr wrote:
>>>>>> You will also need to cover "ranges" that will describe the BARs for
>>>>>> the PCI
>>>>>> devices.
>>>>> Good point.
>>>> Yes, very good point!
>>>>
>>>>
>>>>> Could you please clarify how to recognize whether it is a PCI
>>>>> device as long as PCI support is not merged? Or just to find any device
>>>>> nodes
>>>>> with non-empty "ranges" property
>>>>> and retrieve addresses?
>>>> Normally any bus can have a ranges property with the aperture and
>>>> possible address translations, including /amba (compatible =
>>>> "simple-bus"). However, in these cases dt_device_get_address already
>>>> takes care of it, see xen/common/device_tree.c:dt_device_get_address.
>>>>
>>>> The PCI bus is special for 2 reasons:
>>>> - the ranges property has a different format
>>>> - the bus is hot-pluggable
>>>>
>>>> So I think the only one that we need to treat specially is PCI.
>>>>
>>>> As far as I am aware PCI is the only bus (or maybe just the only bus
>>>> that we support?) where ranges means the aperture.
>>> Thank you for the clarification. I need to find device node with non-empty
>>> ranges property
>>> (and make sure that device_type property is "pci"), after that I need to
>>> read the context of ranges property and translate it.
>>>
>>>
>> OK, I experimented with that and managed to parse ranges property for PCI host
>> bridge node.
>>
>> I tested on my setup where the host device tree contains two PCI host bridge
>> nodes with the following:
>>
>> pcie@fe000000 {
>> ...
>>              ranges = <0x1000000 0x0 0x0 0x0 0xfe100000 0x0 0x100000 0x2000000
>> 0x0 0xfe200000 0x0 0xfe200000 0x0 0x200000 0x2000000 0x0 0x30000000 0x0
>> 0x30000000 0x0 0x8000000 0x42000000 0x0 0x38000000 0x0 0x38000000 0x0
>> 0x8000000>;
>> ...
>> };
>>
>> pcie@ee800000 {
>> ...
>>              ranges = <0x1000000 0x0 0x0 0x0 0xee900000 0x0 0x100000 0x2000000
>> 0x0 0xeea00000 0x0 0xeea00000 0x0 0x200000 0x2000000 0x0 0xc0000000 0x0
>> 0xc0000000 0x0 0x8000000 0x42000000 0x0 0xc8000000 0x0 0xc8000000 0x0
>> 0x8000000>;
>> ...
>> };
>>
>> So Xen retrieves the *CPU addresses* from the ranges:
>>
>> (XEN) dev /soc/pcie@fe000000 range_size 7 nr_ranges 4
>> (XEN) 0: addr=fe100000, size=100000
>> (XEN) 1: addr=fe200000, size=200000
>> (XEN) 2: addr=30000000, size=8000000
>> (XEN) 3: addr=38000000, size=8000000
>> (XEN) dev /soc/pcie@ee800000 range_size 7 nr_ranges 4
>> (XEN) 0: addr=ee900000, size=100000
>> (XEN) 1: addr=eea00000, size=200000
>> (XEN) 2: addr=c0000000, size=8000000
>> (XEN) 3: addr=c8000000, size=8000000
>>
>> The code below covers ranges property in the context of finding memory holes
>> (to be squashed with current patch):
>>
>> diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
>> index d37156a..7d20c10 100644
>> --- a/xen/arch/arm/domain_build.c
>> +++ b/xen/arch/arm/domain_build.c
>> @@ -834,6 +834,8 @@ static int __init find_memory_holes(struct meminfo
>> *ext_regions)
>>       {
>>           unsigned int naddr;
>>           u64 addr, size;
>> +        const __be32 *ranges;
>> +        u32 len;
>>
>>           naddr = dt_number_of_address(np);
>>
>> @@ -857,6 +859,41 @@ static int __init find_memory_holes(struct meminfo
>> *ext_regions)
>>                   goto out;
>>               }
>>           }
>> +
>> +        /*
>> +         * Also looking for non-empty ranges property which would likely mean
>> +         * that we deal with PCI host bridge device and the property here
>> +         * describes the BARs for the PCI devices.
>> +         */
> One thing to be careful is that ranges with a valid parameter is not
> only present in PCI busses. It can be present in amba and other
> simple-busses too. In that case the format for ranges in simpler as it
> doesn't have a "memory type" like PCI.
>
> When you get addresses from reg, bus ranges properties are automatically
> handled for you.
>
> All of this to say that a check on "ranges" is not enough because it
> might capture other non-PCI busses that have a different, simpler,
> ranges format. You want to check for "ranges" under a device_type =
> "pci"; node.

ok, will do.


>
>
>> +        ranges = dt_get_property(np, "ranges", &len);
>> +        if ( ranges && len )
>> +        {
>> +            unsigned int range_size, nr_ranges;
>> +            int na, ns, pna;
>> +
>> +            pna = dt_n_addr_cells(np);
>> +            na = dt_child_n_addr_cells(np);
>> +            ns = dt_child_n_size_cells(np);
>> +            range_size = pna + na + ns;
>> +            nr_ranges = len / sizeof(__be32) / range_size;
>> +
>> +            for ( i = 0; i < nr_ranges; i++, ranges += range_size )
>> +            {
>> +                /* Skip the child address and get the parent (CPU) address */
>> +                addr = dt_read_number(ranges + na, pna);
>> +                size = dt_read_number(ranges + na + pna, ns);
>> +
>> +                start = addr & PAGE_MASK;
>> +                end = PAGE_ALIGN(addr + size);
>> +                res = rangeset_remove_range(mem_holes, start, end - 1);
>> +                if ( res )
>> +                {
>> +                    printk(XENLOG_ERR "Failed to remove:
>> %#"PRIx64"->%#"PRIx64"\n",
>> +                           start, end);
>> +                    goto out;
>> +                }
>> +            }
>> +        }
>>       }

-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-18 16:59       ` Oleksandr
@ 2021-09-23 10:41         ` Oleksandr
  2021-09-23 16:38           ` Stefano Stabellini
  0 siblings, 1 reply; 43+ messages in thread
From: Oleksandr @ 2021-09-23 10:41 UTC (permalink / raw)
  To: Julien Grall, Stefano Stabellini
  Cc: xen-devel, Oleksandr Tyshchenko, Volodymyr Babchuk, Henry Wang,
	Bertrand Marquis, Wei Chen


Hi Stefano, Julien


On 18.09.21 19:59, Oleksandr wrote:
>
> Hi Julien.
>
>
> [snip]
>
>
>>>
>>>
>>>> +#define EXT_REGION_END 0x80003fffffffULL
>>>> +
>>>> +static int __init find_unallocated_memory(const struct kernel_info 
>>>> *kinfo,
>>>> +                                          struct meminfo 
>>>> *ext_regions)
>>>> +{
>>>> +    const struct meminfo *assign_mem = &kinfo->mem;
>>>> +    struct rangeset *unalloc_mem;
>>>> +    paddr_t start, end;
>>>> +    unsigned int i;
>>>> +    int res;
>>>
>>> We technically already know which range of memory is unused. This is 
>>> pretty much any region in the freelist of the page allocator. So how 
>>> about walking the freelist instead?
>>
>> ok, I will investigate the page allocator code (right now I have no 
>> understanding of how to do that). BTW, I have just grepped "freelist" 
>> through the code and all page context related appearances are in x86 
>> code only.
>>
>>>
>>> The advantage is we don't need to worry about modifying the function 
>>> when adding new memory type.
>>>
>>> One disavantage is this will not cover *all* the unused memory as 
>>> this is doing. But I think this is an acceptable downside.
>
> I did some investigations and create test patch. Although, I am not 
> 100% sure this is exactly what you meant, but I will provide results 
> anyway.
>
> 1. Below the extended regions (unallocated memory, regions >=64MB ) 
> calculated by my initial method (bootinfo.mem - kinfo->mem - 
> bootinfo.reserved_mem - kinfo->gnttab):
>
> (XEN) Extended region 0: 0x48000000->0x54000000
> (XEN) Extended region 1: 0x57000000->0x60000000
> (XEN) Extended region 2: 0x70000000->0x78000000
> (XEN) Extended region 3: 0x78200000->0xc0000000
> (XEN) Extended region 4: 0x500000000->0x580000000
> (XEN) Extended region 5: 0x600000000->0x680000000
> (XEN) Extended region 6: 0x700000000->0x780000000
>
> 2. Below the extended regions (unallocated memory, regions >=64MB) 
> calculated by new method (free memory in page allocator):
>
> (XEN) Extended region 0: 0x48000000->0x54000000
> (XEN) Extended region 1: 0x58000000->0x60000000
> (XEN) Extended region 2: 0x70000000->0x78000000
> (XEN) Extended region 3: 0x78200000->0x84000000
> (XEN) Extended region 4: 0x86000000->0x8a000000
> (XEN) Extended region 5: 0x8c200000->0xc0000000
> (XEN) Extended region 6: 0x500000000->0x580000000
> (XEN) Extended region 7: 0x600000000->0x680000000
> (XEN) Extended region 8: 0x700000000->0x765e00000
>
> Some thoughts regarding that.
>
> 1. A few ranges below 4GB are absent in resulting extended regions. I 
> assume, this is because of the modules:
>
> (XEN) Checking for initrd in /chosen
> (XEN) Initrd 0000000084000040-0000000085effc48
> (XEN) RAM: 0000000048000000 - 00000000bfffffff
> (XEN) RAM: 0000000500000000 - 000000057fffffff
> (XEN) RAM: 0000000600000000 - 000000067fffffff
> (XEN) RAM: 0000000700000000 - 000000077fffffff
> (XEN)
> (XEN) MODULE[0]: 0000000078080000 - 00000000781d74c8 Xen
> (XEN) MODULE[1]: 0000000057fe7000 - 0000000057ffd080 Device Tree
> (XEN) MODULE[2]: 0000000084000040 - 0000000085effc48 Ramdisk
> (XEN) MODULE[3]: 000000008a000000 - 000000008c000000 Kernel
> (XEN) MODULE[4]: 000000008c000000 - 000000008c010000 XSM
> (XEN)  RESVD[0]: 0000000084000040 - 0000000085effc48
> (XEN)  RESVD[1]: 0000000054000000 - 0000000056ffffff
>
> 2. Also, it worth mentioning that relatively large chunk (~417MB) of 
> memory above 4GB is absent (to be precise, at the end of last RAM 
> bank), which I assume, used for Xen internals.
> We could really use it for extended regions.
> Below free regions in the heap (for last RAM bank) just in case:
>
> (XEN) heap[node=0][zone=23][order=5] 0x00000765ec0000-0x00000765ee0000
> (XEN) heap[node=0][zone=23][order=6] 0x00000765e80000-0x00000765ec0000
> (XEN) heap[node=0][zone=23][order=7] 0x00000765e00000-0x00000765e80000
> (XEN) heap[node=0][zone=23][order=9] 0x00000765c00000-0x00000765e00000
> (XEN) heap[node=0][zone=23][order=10] 0x00000765800000-0x00000765c00000
> (XEN) heap[node=0][zone=23][order=11] 0x00000765000000-0x00000765800000
> (XEN) heap[node=0][zone=23][order=12] 0x00000764000000-0x00000765000000
> (XEN) heap[node=0][zone=23][order=14] 0x00000760000000-0x00000764000000
> (XEN) heap[node=0][zone=23][order=17] 0x00000740000000-0x00000760000000
> (XEN) heap[node=0][zone=23][order=18] 0x00000540000000-0x00000580000000
> (XEN) heap[node=0][zone=23][order=18] 0x00000500000000-0x00000540000000
> (XEN) heap[node=0][zone=23][order=18] 0x00000640000000-0x00000680000000
> (XEN) heap[node=0][zone=23][order=18] 0x00000600000000-0x00000640000000
> (XEN) heap[node=0][zone=23][order=18] 0x00000700000000-0x00000740000000
>
> Yes, you already pointed out this disadvantage, so if it is an 
> acceptable downside, I am absolutely OK.
>
>
> 3. Common code updates. There is a question how to properly make a 
> connection between common allocator internals and Arm's code for 
> creating DT. I didn’t come up with anything better
> than creating for_each_avail_page() for invoking a callback with page 
> and its order.
>
> **********
>
> Below the proposed changes on top of the initial patch, would this be 
> acceptable in general?
>
> diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
> index 523eb19..1e58fc5 100644
> --- a/xen/arch/arm/domain_build.c
> +++ b/xen/arch/arm/domain_build.c
> @@ -753,16 +753,33 @@ static int __init add_ext_regions(unsigned long 
> s, unsigned long e, void *data)
>      return 0;
>  }
>
> +static int __init add_unalloc_mem(struct page_info *page, unsigned 
> int order,
> +                                  void *data)
> +{
> +    struct rangeset *unalloc_mem = data;
> +    paddr_t start, end;
> +    int res;
> +
> +    start = page_to_maddr(page);
> +    end = start + pfn_to_paddr(1UL << order);
> +    res = rangeset_add_range(unalloc_mem, start, end - 1);
> +    if ( res )
> +    {
> +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
> +               start, end);
> +        return res;
> +    }
> +
> +    return 0;
> +}
> +
>  #define EXT_REGION_START   0x40000000ULL
>  #define EXT_REGION_END     0x80003fffffffULL
>
> -static int __init find_unallocated_memory(const struct kernel_info 
> *kinfo,
> -                                          struct meminfo *ext_regions)
> +static int __init find_unallocated_memory(struct meminfo *ext_regions)
>  {
> -    const struct meminfo *assign_mem = &kinfo->mem;
>      struct rangeset *unalloc_mem;
>      paddr_t start, end;
> -    unsigned int i;
>      int res;
>
>      dt_dprintk("Find unallocated memory for extended regions\n");
> @@ -771,59 +788,9 @@ static int __init find_unallocated_memory(const 
> struct kernel_info *kinfo,
>      if ( !unalloc_mem )
>          return -ENOMEM;
>
> -    /* Start with all available RAM */
> -    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
> -    {
> -        start = bootinfo.mem.bank[i].start;
> -        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size;
> -        res = rangeset_add_range(unalloc_mem, start, end - 1);
> -        if ( res )
> -        {
> -            printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
> -                   start, end);
> -            goto out;
> -        }
> -    }
> -
> -    /* Remove RAM assigned to Dom0 */
> -    for ( i = 0; i < assign_mem->nr_banks; i++ )
> -    {
> -        start = assign_mem->bank[i].start;
> -        end = assign_mem->bank[i].start + assign_mem->bank[i].size;
> -        res = rangeset_remove_range(unalloc_mem, start, end - 1);
> -        if ( res )
> -        {
> -            printk(XENLOG_ERR "Failed to remove: 
> %#"PRIx64"->%#"PRIx64"\n",
> -                   start, end);
> -            goto out;
> -        }
> -    }
> -
> -    /* Remove reserved-memory regions */
> -    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
> -    {
> -        start = bootinfo.reserved_mem.bank[i].start;
> -        end = bootinfo.reserved_mem.bank[i].start +
> -            bootinfo.reserved_mem.bank[i].size;
> -        res = rangeset_remove_range(unalloc_mem, start, end - 1);
> -        if ( res )
> -        {
> -            printk(XENLOG_ERR "Failed to remove: 
> %#"PRIx64"->%#"PRIx64"\n",
> -                   start, end);
> -            goto out;
> -        }
> -    }
> -
> -    /* Remove grant table region */
> -    start = kinfo->gnttab_start;
> -    end = kinfo->gnttab_start + kinfo->gnttab_size;
> -    res = rangeset_remove_range(unalloc_mem, start, end - 1);
> +    res = for_each_avail_page(add_unalloc_mem, unalloc_mem);
>      if ( res )
> -    {
> -        printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
> -               start, end);
>          goto out;
> -    }
>
>      start = EXT_REGION_START;
>      end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> @@ -840,8 +807,7 @@ out:
>      return res;
>  }
>
> -static int __init find_memory_holes(const struct kernel_info *kinfo,
> -                                    struct meminfo *ext_regions)
> +static int __init find_memory_holes(struct meminfo *ext_regions)
>  {
>      struct dt_device_node *np;
>      struct rangeset *mem_holes;
> @@ -961,9 +927,9 @@ static int __init make_hypervisor_node(struct 
> domain *d,
>      else
>      {
>          if ( !is_iommu_enabled(d) )
> -            res = find_unallocated_memory(kinfo, ext_regions);
> +            res = find_unallocated_memory(ext_regions);
>          else
> -            res = find_memory_holes(kinfo, ext_regions);
> +            res = find_memory_holes(ext_regions);
>
>          if ( res )
>              printk(XENLOG_WARNING "Failed to allocate extended 
> regions\n");
> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> index 8fad139..7cd1020 100644
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -1572,6 +1572,40 @@ static int reserve_heap_page(struct page_info *pg)
>
>  }
>
> +/* TODO heap_lock? */
> +int for_each_avail_page(int (*cb)(struct page_info *, unsigned int, 
> void *),
> +                        void *data)
> +{
> +    unsigned int node, zone, order;
> +    int ret;
> +
> +    for ( node = 0; node < MAX_NUMNODES; node++ )
> +    {
> +        if ( !avail[node] )
> +            continue;
> +
> +        for ( zone = 0; zone < NR_ZONES; zone++ )
> +        {
> +            for ( order = 0; order <= MAX_ORDER; order++ )
> +            {
> +                struct page_info *head, *tmp;
> +
> +                if ( page_list_empty(&heap(node, zone, order)) )
> +                    continue;
> +
> +                page_list_for_each_safe ( head, tmp, &heap(node, 
> zone, order) )
> +                {
> +                    ret = cb(head, order, data);
> +                    if ( ret )
> +                        return ret;
> +                }
> +            }
> +        }
> +    }
> +
> +    return 0;
> +}
> +
>  int offline_page(mfn_t mfn, int broken, uint32_t *status)
>  {
>      unsigned long old_info = 0;
> diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
> index 667f9da..64dd3e2 100644
> --- a/xen/include/xen/mm.h
> +++ b/xen/include/xen/mm.h
> @@ -123,6 +123,9 @@ unsigned int online_page(mfn_t mfn, uint32_t 
> *status);
>  int offline_page(mfn_t mfn, int broken, uint32_t *status);
>  int query_page_offline(mfn_t mfn, uint32_t *status);
>
> +int for_each_avail_page(int (*cb)(struct page_info *, unsigned int, 
> void *),
> +                        void *data);
> +
>  void heap_init_late(void);
>
>  int assign_pages(


I am sorry, but may I please clarify regarding that? Whether we will go 
this new direction (free memory in page allocator) or leave things as 
they are (bootinfo.mem - kinfo->mem - bootinfo.reserved_mem - 
kinfo->gnttab). This is only one still unclear moment to me in current 
patch before preparing V3.


-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-23 10:41         ` Oleksandr
@ 2021-09-23 16:38           ` Stefano Stabellini
  2021-09-23 17:44             ` Oleksandr
  0 siblings, 1 reply; 43+ messages in thread
From: Stefano Stabellini @ 2021-09-23 16:38 UTC (permalink / raw)
  To: Oleksandr
  Cc: Julien Grall, Stefano Stabellini, xen-devel,
	Oleksandr Tyshchenko, Volodymyr Babchuk, Henry Wang,
	Bertrand Marquis, Wei Chen

[-- Attachment #1: Type: text/plain, Size: 13906 bytes --]

On Thu, 23 Sep 2021, Oleksandr wrote:
> On 18.09.21 19:59, Oleksandr wrote:
> > > > > +#define EXT_REGION_END 0x80003fffffffULL
> > > > > +
> > > > > +static int __init find_unallocated_memory(const struct kernel_info
> > > > > *kinfo,
> > > > > +                                          struct meminfo
> > > > > *ext_regions)
> > > > > +{
> > > > > +    const struct meminfo *assign_mem = &kinfo->mem;
> > > > > +    struct rangeset *unalloc_mem;
> > > > > +    paddr_t start, end;
> > > > > +    unsigned int i;
> > > > > +    int res;
> > > > 
> > > > We technically already know which range of memory is unused. This is
> > > > pretty much any region in the freelist of the page allocator. So how
> > > > about walking the freelist instead?
> > > 
> > > ok, I will investigate the page allocator code (right now I have no
> > > understanding of how to do that). BTW, I have just grepped "freelist"
> > > through the code and all page context related appearances are in x86 code
> > > only.
> > > 
> > > > 
> > > > The advantage is we don't need to worry about modifying the function
> > > > when adding new memory type.
> > > > 
> > > > One disavantage is this will not cover *all* the unused memory as this
> > > > is doing. But I think this is an acceptable downside.
> > 
> > I did some investigations and create test patch. Although, I am not 100%
> > sure this is exactly what you meant, but I will provide results anyway.
> > 
> > 1. Below the extended regions (unallocated memory, regions >=64MB )
> > calculated by my initial method (bootinfo.mem - kinfo->mem -
> > bootinfo.reserved_mem - kinfo->gnttab):
> > 
> > (XEN) Extended region 0: 0x48000000->0x54000000
> > (XEN) Extended region 1: 0x57000000->0x60000000
> > (XEN) Extended region 2: 0x70000000->0x78000000
> > (XEN) Extended region 3: 0x78200000->0xc0000000
> > (XEN) Extended region 4: 0x500000000->0x580000000
> > (XEN) Extended region 5: 0x600000000->0x680000000
> > (XEN) Extended region 6: 0x700000000->0x780000000
> > 
> > 2. Below the extended regions (unallocated memory, regions >=64MB)
> > calculated by new method (free memory in page allocator):
> > 
> > (XEN) Extended region 0: 0x48000000->0x54000000
> > (XEN) Extended region 1: 0x58000000->0x60000000
> > (XEN) Extended region 2: 0x70000000->0x78000000
> > (XEN) Extended region 3: 0x78200000->0x84000000
> > (XEN) Extended region 4: 0x86000000->0x8a000000
> > (XEN) Extended region 5: 0x8c200000->0xc0000000
> > (XEN) Extended region 6: 0x500000000->0x580000000
> > (XEN) Extended region 7: 0x600000000->0x680000000
> > (XEN) Extended region 8: 0x700000000->0x765e00000
> > 
> > Some thoughts regarding that.
> > 
> > 1. A few ranges below 4GB are absent in resulting extended regions. I
> > assume, this is because of the modules:
> > 
> > (XEN) Checking for initrd in /chosen
> > (XEN) Initrd 0000000084000040-0000000085effc48
> > (XEN) RAM: 0000000048000000 - 00000000bfffffff
> > (XEN) RAM: 0000000500000000 - 000000057fffffff
> > (XEN) RAM: 0000000600000000 - 000000067fffffff
> > (XEN) RAM: 0000000700000000 - 000000077fffffff
> > (XEN)
> > (XEN) MODULE[0]: 0000000078080000 - 00000000781d74c8 Xen
> > (XEN) MODULE[1]: 0000000057fe7000 - 0000000057ffd080 Device Tree
> > (XEN) MODULE[2]: 0000000084000040 - 0000000085effc48 Ramdisk
> > (XEN) MODULE[3]: 000000008a000000 - 000000008c000000 Kernel
> > (XEN) MODULE[4]: 000000008c000000 - 000000008c010000 XSM
> > (XEN)  RESVD[0]: 0000000084000040 - 0000000085effc48
> > (XEN)  RESVD[1]: 0000000054000000 - 0000000056ffffff
> > 
> > 2. Also, it worth mentioning that relatively large chunk (~417MB) of memory
> > above 4GB is absent (to be precise, at the end of last RAM bank), which I
> > assume, used for Xen internals.
> > We could really use it for extended regions.
> > Below free regions in the heap (for last RAM bank) just in case:
> > 
> > (XEN) heap[node=0][zone=23][order=5] 0x00000765ec0000-0x00000765ee0000
> > (XEN) heap[node=0][zone=23][order=6] 0x00000765e80000-0x00000765ec0000
> > (XEN) heap[node=0][zone=23][order=7] 0x00000765e00000-0x00000765e80000
> > (XEN) heap[node=0][zone=23][order=9] 0x00000765c00000-0x00000765e00000
> > (XEN) heap[node=0][zone=23][order=10] 0x00000765800000-0x00000765c00000
> > (XEN) heap[node=0][zone=23][order=11] 0x00000765000000-0x00000765800000
> > (XEN) heap[node=0][zone=23][order=12] 0x00000764000000-0x00000765000000
> > (XEN) heap[node=0][zone=23][order=14] 0x00000760000000-0x00000764000000
> > (XEN) heap[node=0][zone=23][order=17] 0x00000740000000-0x00000760000000
> > (XEN) heap[node=0][zone=23][order=18] 0x00000540000000-0x00000580000000
> > (XEN) heap[node=0][zone=23][order=18] 0x00000500000000-0x00000540000000
> > (XEN) heap[node=0][zone=23][order=18] 0x00000640000000-0x00000680000000
> > (XEN) heap[node=0][zone=23][order=18] 0x00000600000000-0x00000640000000
> > (XEN) heap[node=0][zone=23][order=18] 0x00000700000000-0x00000740000000
> > 
> > Yes, you already pointed out this disadvantage, so if it is an acceptable
> > downside, I am absolutely OK.
> > 
> > 
> > 3. Common code updates. There is a question how to properly make a
> > connection between common allocator internals and Arm's code for creating
> > DT. I didn’t come up with anything better
> > than creating for_each_avail_page() for invoking a callback with page and
> > its order.
> > 
> > **********
> > 
> > Below the proposed changes on top of the initial patch, would this be
> > acceptable in general?
> > 
> > diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
> > index 523eb19..1e58fc5 100644
> > --- a/xen/arch/arm/domain_build.c
> > +++ b/xen/arch/arm/domain_build.c
> > @@ -753,16 +753,33 @@ static int __init add_ext_regions(unsigned long s,
> > unsigned long e, void *data)
> >      return 0;
> >  }
> > 
> > +static int __init add_unalloc_mem(struct page_info *page, unsigned int
> > order,
> > +                                  void *data)
> > +{
> > +    struct rangeset *unalloc_mem = data;
> > +    paddr_t start, end;
> > +    int res;
> > +
> > +    start = page_to_maddr(page);
> > +    end = start + pfn_to_paddr(1UL << order);
> > +    res = rangeset_add_range(unalloc_mem, start, end - 1);
> > +    if ( res )
> > +    {
> > +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
> > +               start, end);
> > +        return res;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >  #define EXT_REGION_START   0x40000000ULL
> >  #define EXT_REGION_END     0x80003fffffffULL
> > 
> > -static int __init find_unallocated_memory(const struct kernel_info *kinfo,
> > -                                          struct meminfo *ext_regions)
> > +static int __init find_unallocated_memory(struct meminfo *ext_regions)
> >  {
> > -    const struct meminfo *assign_mem = &kinfo->mem;
> >      struct rangeset *unalloc_mem;
> >      paddr_t start, end;
> > -    unsigned int i;
> >      int res;
> > 
> >      dt_dprintk("Find unallocated memory for extended regions\n");
> > @@ -771,59 +788,9 @@ static int __init find_unallocated_memory(const struct
> > kernel_info *kinfo,
> >      if ( !unalloc_mem )
> >          return -ENOMEM;
> > 
> > -    /* Start with all available RAM */
> > -    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
> > -    {
> > -        start = bootinfo.mem.bank[i].start;
> > -        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size;
> > -        res = rangeset_add_range(unalloc_mem, start, end - 1);
> > -        if ( res )
> > -        {
> > -            printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
> > -                   start, end);
> > -            goto out;
> > -        }
> > -    }
> > -
> > -    /* Remove RAM assigned to Dom0 */
> > -    for ( i = 0; i < assign_mem->nr_banks; i++ )
> > -    {
> > -        start = assign_mem->bank[i].start;
> > -        end = assign_mem->bank[i].start + assign_mem->bank[i].size;
> > -        res = rangeset_remove_range(unalloc_mem, start, end - 1);
> > -        if ( res )
> > -        {
> > -            printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
> > -                   start, end);
> > -            goto out;
> > -        }
> > -    }
> > -
> > -    /* Remove reserved-memory regions */
> > -    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
> > -    {
> > -        start = bootinfo.reserved_mem.bank[i].start;
> > -        end = bootinfo.reserved_mem.bank[i].start +
> > -            bootinfo.reserved_mem.bank[i].size;
> > -        res = rangeset_remove_range(unalloc_mem, start, end - 1);
> > -        if ( res )
> > -        {
> > -            printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
> > -                   start, end);
> > -            goto out;
> > -        }
> > -    }
> > -
> > -    /* Remove grant table region */
> > -    start = kinfo->gnttab_start;
> > -    end = kinfo->gnttab_start + kinfo->gnttab_size;
> > -    res = rangeset_remove_range(unalloc_mem, start, end - 1);
> > +    res = for_each_avail_page(add_unalloc_mem, unalloc_mem);
> >      if ( res )
> > -    {
> > -        printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
> > -               start, end);
> >          goto out;
> > -    }
> > 
> >      start = EXT_REGION_START;
> >      end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
> > @@ -840,8 +807,7 @@ out:
> >      return res;
> >  }
> > 
> > -static int __init find_memory_holes(const struct kernel_info *kinfo,
> > -                                    struct meminfo *ext_regions)
> > +static int __init find_memory_holes(struct meminfo *ext_regions)
> >  {
> >      struct dt_device_node *np;
> >      struct rangeset *mem_holes;
> > @@ -961,9 +927,9 @@ static int __init make_hypervisor_node(struct domain *d,
> >      else
> >      {
> >          if ( !is_iommu_enabled(d) )
> > -            res = find_unallocated_memory(kinfo, ext_regions);
> > +            res = find_unallocated_memory(ext_regions);
> >          else
> > -            res = find_memory_holes(kinfo, ext_regions);
> > +            res = find_memory_holes(ext_regions);
> > 
> >          if ( res )
> >              printk(XENLOG_WARNING "Failed to allocate extended regions\n");
> > diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> > index 8fad139..7cd1020 100644
> > --- a/xen/common/page_alloc.c
> > +++ b/xen/common/page_alloc.c
> > @@ -1572,6 +1572,40 @@ static int reserve_heap_page(struct page_info *pg)
> > 
> >  }
> > 
> > +/* TODO heap_lock? */
> > +int for_each_avail_page(int (*cb)(struct page_info *, unsigned int, void
> > *),
> > +                        void *data)
> > +{
> > +    unsigned int node, zone, order;
> > +    int ret;
> > +
> > +    for ( node = 0; node < MAX_NUMNODES; node++ )
> > +    {
> > +        if ( !avail[node] )
> > +            continue;
> > +
> > +        for ( zone = 0; zone < NR_ZONES; zone++ )
> > +        {
> > +            for ( order = 0; order <= MAX_ORDER; order++ )
> > +            {
> > +                struct page_info *head, *tmp;
> > +
> > +                if ( page_list_empty(&heap(node, zone, order)) )
> > +                    continue;
> > +
> > +                page_list_for_each_safe ( head, tmp, &heap(node, zone,
> > order) )
> > +                {
> > +                    ret = cb(head, order, data);
> > +                    if ( ret )
> > +                        return ret;
> > +                }
> > +            }
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >  int offline_page(mfn_t mfn, int broken, uint32_t *status)
> >  {
> >      unsigned long old_info = 0;
> > diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
> > index 667f9da..64dd3e2 100644
> > --- a/xen/include/xen/mm.h
> > +++ b/xen/include/xen/mm.h
> > @@ -123,6 +123,9 @@ unsigned int online_page(mfn_t mfn, uint32_t *status);
> >  int offline_page(mfn_t mfn, int broken, uint32_t *status);
> >  int query_page_offline(mfn_t mfn, uint32_t *status);
> > 
> > +int for_each_avail_page(int (*cb)(struct page_info *, unsigned int, void
> > *),
> > +                        void *data);
> > +
> >  void heap_init_late(void);
> > 
> >  int assign_pages(
> 
> 
> I am sorry, but may I please clarify regarding that? Whether we will go this
> new direction (free memory in page allocator) or leave things as they are
> (bootinfo.mem - kinfo->mem - bootinfo.reserved_mem - kinfo->gnttab). This is
> only one still unclear moment to me in current patch before preparing V3.

I think both approaches are fine. Your original approach leads to better
results in terms of extended regions but the difference is not drastic.
The original approach requires more code (bad) but probably less CPU
cycles (good).

Personally I am fine either way but as Julien was the one to provide
feedback on this it would be best to get his opinion.

But in the meantime I think it is OK to send a v3 so that we can review
the rest.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0
  2021-09-23 16:38           ` Stefano Stabellini
@ 2021-09-23 17:44             ` Oleksandr
  0 siblings, 0 replies; 43+ messages in thread
From: Oleksandr @ 2021-09-23 17:44 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Julien Grall, xen-devel, Oleksandr Tyshchenko, Volodymyr Babchuk,
	Henry Wang, Bertrand Marquis, Wei Chen


On 23.09.21 19:38, Stefano Stabellini wrote:

Hi Stefano

> On Thu, 23 Sep 2021, Oleksandr wrote:
>> On 18.09.21 19:59, Oleksandr wrote:
>>>>>> +#define EXT_REGION_END 0x80003fffffffULL
>>>>>> +
>>>>>> +static int __init find_unallocated_memory(const struct kernel_info
>>>>>> *kinfo,
>>>>>> +                                          struct meminfo
>>>>>> *ext_regions)
>>>>>> +{
>>>>>> +    const struct meminfo *assign_mem = &kinfo->mem;
>>>>>> +    struct rangeset *unalloc_mem;
>>>>>> +    paddr_t start, end;
>>>>>> +    unsigned int i;
>>>>>> +    int res;
>>>>> We technically already know which range of memory is unused. This is
>>>>> pretty much any region in the freelist of the page allocator. So how
>>>>> about walking the freelist instead?
>>>> ok, I will investigate the page allocator code (right now I have no
>>>> understanding of how to do that). BTW, I have just grepped "freelist"
>>>> through the code and all page context related appearances are in x86 code
>>>> only.
>>>>
>>>>> The advantage is we don't need to worry about modifying the function
>>>>> when adding new memory type.
>>>>>
>>>>> One disavantage is this will not cover *all* the unused memory as this
>>>>> is doing. But I think this is an acceptable downside.
>>> I did some investigations and create test patch. Although, I am not 100%
>>> sure this is exactly what you meant, but I will provide results anyway.
>>>
>>> 1. Below the extended regions (unallocated memory, regions >=64MB )
>>> calculated by my initial method (bootinfo.mem - kinfo->mem -
>>> bootinfo.reserved_mem - kinfo->gnttab):
>>>
>>> (XEN) Extended region 0: 0x48000000->0x54000000
>>> (XEN) Extended region 1: 0x57000000->0x60000000
>>> (XEN) Extended region 2: 0x70000000->0x78000000
>>> (XEN) Extended region 3: 0x78200000->0xc0000000
>>> (XEN) Extended region 4: 0x500000000->0x580000000
>>> (XEN) Extended region 5: 0x600000000->0x680000000
>>> (XEN) Extended region 6: 0x700000000->0x780000000
>>>
>>> 2. Below the extended regions (unallocated memory, regions >=64MB)
>>> calculated by new method (free memory in page allocator):
>>>
>>> (XEN) Extended region 0: 0x48000000->0x54000000
>>> (XEN) Extended region 1: 0x58000000->0x60000000
>>> (XEN) Extended region 2: 0x70000000->0x78000000
>>> (XEN) Extended region 3: 0x78200000->0x84000000
>>> (XEN) Extended region 4: 0x86000000->0x8a000000
>>> (XEN) Extended region 5: 0x8c200000->0xc0000000
>>> (XEN) Extended region 6: 0x500000000->0x580000000
>>> (XEN) Extended region 7: 0x600000000->0x680000000
>>> (XEN) Extended region 8: 0x700000000->0x765e00000
>>>
>>> Some thoughts regarding that.
>>>
>>> 1. A few ranges below 4GB are absent in resulting extended regions. I
>>> assume, this is because of the modules:
>>>
>>> (XEN) Checking for initrd in /chosen
>>> (XEN) Initrd 0000000084000040-0000000085effc48
>>> (XEN) RAM: 0000000048000000 - 00000000bfffffff
>>> (XEN) RAM: 0000000500000000 - 000000057fffffff
>>> (XEN) RAM: 0000000600000000 - 000000067fffffff
>>> (XEN) RAM: 0000000700000000 - 000000077fffffff
>>> (XEN)
>>> (XEN) MODULE[0]: 0000000078080000 - 00000000781d74c8 Xen
>>> (XEN) MODULE[1]: 0000000057fe7000 - 0000000057ffd080 Device Tree
>>> (XEN) MODULE[2]: 0000000084000040 - 0000000085effc48 Ramdisk
>>> (XEN) MODULE[3]: 000000008a000000 - 000000008c000000 Kernel
>>> (XEN) MODULE[4]: 000000008c000000 - 000000008c010000 XSM
>>> (XEN)  RESVD[0]: 0000000084000040 - 0000000085effc48
>>> (XEN)  RESVD[1]: 0000000054000000 - 0000000056ffffff
>>>
>>> 2. Also, it worth mentioning that relatively large chunk (~417MB) of memory
>>> above 4GB is absent (to be precise, at the end of last RAM bank), which I
>>> assume, used for Xen internals.
>>> We could really use it for extended regions.
>>> Below free regions in the heap (for last RAM bank) just in case:
>>>
>>> (XEN) heap[node=0][zone=23][order=5] 0x00000765ec0000-0x00000765ee0000
>>> (XEN) heap[node=0][zone=23][order=6] 0x00000765e80000-0x00000765ec0000
>>> (XEN) heap[node=0][zone=23][order=7] 0x00000765e00000-0x00000765e80000
>>> (XEN) heap[node=0][zone=23][order=9] 0x00000765c00000-0x00000765e00000
>>> (XEN) heap[node=0][zone=23][order=10] 0x00000765800000-0x00000765c00000
>>> (XEN) heap[node=0][zone=23][order=11] 0x00000765000000-0x00000765800000
>>> (XEN) heap[node=0][zone=23][order=12] 0x00000764000000-0x00000765000000
>>> (XEN) heap[node=0][zone=23][order=14] 0x00000760000000-0x00000764000000
>>> (XEN) heap[node=0][zone=23][order=17] 0x00000740000000-0x00000760000000
>>> (XEN) heap[node=0][zone=23][order=18] 0x00000540000000-0x00000580000000
>>> (XEN) heap[node=0][zone=23][order=18] 0x00000500000000-0x00000540000000
>>> (XEN) heap[node=0][zone=23][order=18] 0x00000640000000-0x00000680000000
>>> (XEN) heap[node=0][zone=23][order=18] 0x00000600000000-0x00000640000000
>>> (XEN) heap[node=0][zone=23][order=18] 0x00000700000000-0x00000740000000
>>>
>>> Yes, you already pointed out this disadvantage, so if it is an acceptable
>>> downside, I am absolutely OK.
>>>
>>>
>>> 3. Common code updates. There is a question how to properly make a
>>> connection between common allocator internals and Arm's code for creating
>>> DT. I didn’t come up with anything better
>>> than creating for_each_avail_page() for invoking a callback with page and
>>> its order.
>>>
>>> **********
>>>
>>> Below the proposed changes on top of the initial patch, would this be
>>> acceptable in general?
>>>
>>> diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
>>> index 523eb19..1e58fc5 100644
>>> --- a/xen/arch/arm/domain_build.c
>>> +++ b/xen/arch/arm/domain_build.c
>>> @@ -753,16 +753,33 @@ static int __init add_ext_regions(unsigned long s,
>>> unsigned long e, void *data)
>>>       return 0;
>>>   }
>>>
>>> +static int __init add_unalloc_mem(struct page_info *page, unsigned int
>>> order,
>>> +                                  void *data)
>>> +{
>>> +    struct rangeset *unalloc_mem = data;
>>> +    paddr_t start, end;
>>> +    int res;
>>> +
>>> +    start = page_to_maddr(page);
>>> +    end = start + pfn_to_paddr(1UL << order);
>>> +    res = rangeset_add_range(unalloc_mem, start, end - 1);
>>> +    if ( res )
>>> +    {
>>> +        printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
>>> +               start, end);
>>> +        return res;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>   #define EXT_REGION_START   0x40000000ULL
>>>   #define EXT_REGION_END     0x80003fffffffULL
>>>
>>> -static int __init find_unallocated_memory(const struct kernel_info *kinfo,
>>> -                                          struct meminfo *ext_regions)
>>> +static int __init find_unallocated_memory(struct meminfo *ext_regions)
>>>   {
>>> -    const struct meminfo *assign_mem = &kinfo->mem;
>>>       struct rangeset *unalloc_mem;
>>>       paddr_t start, end;
>>> -    unsigned int i;
>>>       int res;
>>>
>>>       dt_dprintk("Find unallocated memory for extended regions\n");
>>> @@ -771,59 +788,9 @@ static int __init find_unallocated_memory(const struct
>>> kernel_info *kinfo,
>>>       if ( !unalloc_mem )
>>>           return -ENOMEM;
>>>
>>> -    /* Start with all available RAM */
>>> -    for ( i = 0; i < bootinfo.mem.nr_banks; i++ )
>>> -    {
>>> -        start = bootinfo.mem.bank[i].start;
>>> -        end = bootinfo.mem.bank[i].start + bootinfo.mem.bank[i].size;
>>> -        res = rangeset_add_range(unalloc_mem, start, end - 1);
>>> -        if ( res )
>>> -        {
>>> -            printk(XENLOG_ERR "Failed to add: %#"PRIx64"->%#"PRIx64"\n",
>>> -                   start, end);
>>> -            goto out;
>>> -        }
>>> -    }
>>> -
>>> -    /* Remove RAM assigned to Dom0 */
>>> -    for ( i = 0; i < assign_mem->nr_banks; i++ )
>>> -    {
>>> -        start = assign_mem->bank[i].start;
>>> -        end = assign_mem->bank[i].start + assign_mem->bank[i].size;
>>> -        res = rangeset_remove_range(unalloc_mem, start, end - 1);
>>> -        if ( res )
>>> -        {
>>> -            printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
>>> -                   start, end);
>>> -            goto out;
>>> -        }
>>> -    }
>>> -
>>> -    /* Remove reserved-memory regions */
>>> -    for ( i = 0; i < bootinfo.reserved_mem.nr_banks; i++ )
>>> -    {
>>> -        start = bootinfo.reserved_mem.bank[i].start;
>>> -        end = bootinfo.reserved_mem.bank[i].start +
>>> -            bootinfo.reserved_mem.bank[i].size;
>>> -        res = rangeset_remove_range(unalloc_mem, start, end - 1);
>>> -        if ( res )
>>> -        {
>>> -            printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
>>> -                   start, end);
>>> -            goto out;
>>> -        }
>>> -    }
>>> -
>>> -    /* Remove grant table region */
>>> -    start = kinfo->gnttab_start;
>>> -    end = kinfo->gnttab_start + kinfo->gnttab_size;
>>> -    res = rangeset_remove_range(unalloc_mem, start, end - 1);
>>> +    res = for_each_avail_page(add_unalloc_mem, unalloc_mem);
>>>       if ( res )
>>> -    {
>>> -        printk(XENLOG_ERR "Failed to remove: %#"PRIx64"->%#"PRIx64"\n",
>>> -               start, end);
>>>           goto out;
>>> -    }
>>>
>>>       start = EXT_REGION_START;
>>>       end = min((1ULL << p2m_ipa_bits) - 1, EXT_REGION_END);
>>> @@ -840,8 +807,7 @@ out:
>>>       return res;
>>>   }
>>>
>>> -static int __init find_memory_holes(const struct kernel_info *kinfo,
>>> -                                    struct meminfo *ext_regions)
>>> +static int __init find_memory_holes(struct meminfo *ext_regions)
>>>   {
>>>       struct dt_device_node *np;
>>>       struct rangeset *mem_holes;
>>> @@ -961,9 +927,9 @@ static int __init make_hypervisor_node(struct domain *d,
>>>       else
>>>       {
>>>           if ( !is_iommu_enabled(d) )
>>> -            res = find_unallocated_memory(kinfo, ext_regions);
>>> +            res = find_unallocated_memory(ext_regions);
>>>           else
>>> -            res = find_memory_holes(kinfo, ext_regions);
>>> +            res = find_memory_holes(ext_regions);
>>>
>>>           if ( res )
>>>               printk(XENLOG_WARNING "Failed to allocate extended regions\n");
>>> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
>>> index 8fad139..7cd1020 100644
>>> --- a/xen/common/page_alloc.c
>>> +++ b/xen/common/page_alloc.c
>>> @@ -1572,6 +1572,40 @@ static int reserve_heap_page(struct page_info *pg)
>>>
>>>   }
>>>
>>> +/* TODO heap_lock? */
>>> +int for_each_avail_page(int (*cb)(struct page_info *, unsigned int, void
>>> *),
>>> +                        void *data)
>>> +{
>>> +    unsigned int node, zone, order;
>>> +    int ret;
>>> +
>>> +    for ( node = 0; node < MAX_NUMNODES; node++ )
>>> +    {
>>> +        if ( !avail[node] )
>>> +            continue;
>>> +
>>> +        for ( zone = 0; zone < NR_ZONES; zone++ )
>>> +        {
>>> +            for ( order = 0; order <= MAX_ORDER; order++ )
>>> +            {
>>> +                struct page_info *head, *tmp;
>>> +
>>> +                if ( page_list_empty(&heap(node, zone, order)) )
>>> +                    continue;
>>> +
>>> +                page_list_for_each_safe ( head, tmp, &heap(node, zone,
>>> order) )
>>> +                {
>>> +                    ret = cb(head, order, data);
>>> +                    if ( ret )
>>> +                        return ret;
>>> +                }
>>> +            }
>>> +        }
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>   int offline_page(mfn_t mfn, int broken, uint32_t *status)
>>>   {
>>>       unsigned long old_info = 0;
>>> diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
>>> index 667f9da..64dd3e2 100644
>>> --- a/xen/include/xen/mm.h
>>> +++ b/xen/include/xen/mm.h
>>> @@ -123,6 +123,9 @@ unsigned int online_page(mfn_t mfn, uint32_t *status);
>>>   int offline_page(mfn_t mfn, int broken, uint32_t *status);
>>>   int query_page_offline(mfn_t mfn, uint32_t *status);
>>>
>>> +int for_each_avail_page(int (*cb)(struct page_info *, unsigned int, void
>>> *),
>>> +                        void *data);
>>> +
>>>   void heap_init_late(void);
>>>
>>>   int assign_pages(
>>
>> I am sorry, but may I please clarify regarding that? Whether we will go this
>> new direction (free memory in page allocator) or leave things as they are
>> (bootinfo.mem - kinfo->mem - bootinfo.reserved_mem - kinfo->gnttab). This is
>> only one still unclear moment to me in current patch before preparing V3.
> I think both approaches are fine. Your original approach leads to better
> results in terms of extended regions but the difference is not drastic.
> The original approach requires more code (bad) but probably less CPU
> cycles (good).
>
> Personally I am fine either way but as Julien was the one to provide
> feedback on this it would be best to get his opinion.
>
> But in the meantime I think it is OK to send a v3 so that we can review
> the rest.


OK, thank you for the clarification.

I am also fine either way, I just wanted to know which one to pick. 
Anyway, I think I will be able to make updates later on.


-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2021-09-23 17:44 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-10 18:18 [PATCH V2 0/3] Add handling of extended regions (safe ranges) on Arm (Was "xen/memory: Introduce a hypercall to provide unallocated space") Oleksandr Tyshchenko
2021-09-10 18:18 ` [PATCH V2 1/3] xen: Introduce "gpaddr_bits" field to XEN_SYSCTL_physinfo Oleksandr Tyshchenko
2021-09-16 14:49   ` Jan Beulich
2021-09-16 15:43     ` Oleksandr
2021-09-16 15:47       ` Jan Beulich
2021-09-16 16:05         ` Oleksandr
2021-09-10 18:18 ` [PATCH V2 2/3] xen/arm: Add handling of extended regions for Dom0 Oleksandr Tyshchenko
2021-09-14  0:55   ` Stefano Stabellini
2021-09-15 19:10     ` Oleksandr
2021-09-15 21:21       ` Stefano Stabellini
2021-09-16 20:57         ` Oleksandr
2021-09-16 21:30           ` Stefano Stabellini
2021-09-17  7:28             ` Oleksandr
2021-09-17 14:08       ` Oleksandr
2021-09-17 15:52       ` Julien Grall
2021-09-17 20:13         ` Oleksandr
2021-09-17 15:48   ` Julien Grall
2021-09-17 19:51     ` Oleksandr
2021-09-17 21:56       ` Stefano Stabellini
2021-09-17 22:37         ` Stefano Stabellini
2021-09-19 14:34           ` Julien Grall
2021-09-19 20:18             ` Oleksandr
2021-09-20 23:21               ` Stefano Stabellini
2021-09-21 18:14                 ` Oleksandr
2021-09-21 22:00                   ` Stefano Stabellini
2021-09-22 18:25                     ` Oleksandr
2021-09-22 20:50                       ` Stefano Stabellini
2021-09-23 10:10                         ` Oleksandr
2021-09-20 23:55             ` Stefano Stabellini
2021-09-21 19:43         ` Oleksandr
2021-09-22 18:18           ` Oleksandr
2021-09-22 21:05             ` Stefano Stabellini
2021-09-23 10:11               ` Oleksandr
2021-09-18 16:59       ` Oleksandr
2021-09-23 10:41         ` Oleksandr
2021-09-23 16:38           ` Stefano Stabellini
2021-09-23 17:44             ` Oleksandr
2021-09-19 14:00       ` Julien Grall
2021-09-19 17:59         ` Oleksandr
2021-09-10 18:18 ` [PATCH V2 3/3] libxl/arm: Add handling of extended regions for DomU Oleksandr Tyshchenko
2021-09-16 22:35   ` Stefano Stabellini
2021-09-20 20:07     ` Oleksandr
2021-09-21 17:35       ` Oleksandr

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.