Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary

From: Igor Mammedov <imammedo@redhat.com>
To: Joao Martins <joao.m.martins@oracle.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Richard Henderson <richard.henderson@linaro.org>,
	qemu-devel@nongnu.org, Daniel Jordan <daniel.m.jordan@oracle.com>,
	David Edmondson <david.edmondson@oracle.com>,
	Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
Date: Mon, 28 Jun 2021 16:32:32 +0200	[thread overview]
Message-ID: <20210628163232.5669563c@redhat.com> (raw)
In-Reply-To: <a8668c8d-22c5-6168-3f29-3f5055d99c32@oracle.com>

On Wed, 23 Jun 2021 14:04:19 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/23/21 12:39 PM, Igor Mammedov wrote:
> > On Wed, 23 Jun 2021 10:37:38 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> On 6/23/21 8:11 AM, Igor Mammedov wrote:  
> >>> On Tue, 22 Jun 2021 16:49:00 +0100
> >>> Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>     
> >>>> It is assumed that the whole GPA space is available to be
> >>>> DMA addressable, within a given address space limit. Since
> >>>> v5.4 based that is not true, and VFIO will validate whether
> >>>> the selected IOVA is indeed valid i.e. not reserved by IOMMU
> >>>> on behalf of some specific devices or platform-defined.
> >>>>
> >>>> AMD systems with an IOMMU are examples of such platforms and
> >>>> particularly may export only these ranges as allowed:
> >>>>
> >>>> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> >>>> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> >>>> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
> >>>>
> >>>> We already know of accounting for the 4G hole, albeit if the
> >>>> guest is big enough we will fail to allocate a >1010G given
> >>>> the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
> >>>>
> >>>> When creating the region above 4G, take into account what
> >>>> IOVAs are allowed by defining the known allowed ranges
> >>>> and search for the next free IOVA ranges. When finding a
> >>>> invalid IOVA we mark them as reserved and proceed to the
> >>>> next allowed IOVA region.
> >>>>
> >>>> After accounting for the 1Tb hole on AMD hosts, mtree should
> >>>> look like:
> >>>>
> >>>> 0000000100000000-000000fcffffffff (prio 0, i/o):
> >>>> 	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
> >>>> 0000010000000000-000001037fffffff (prio 0, i/o):
> >>>> 	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff    
> >>>
> >>> why not push whole ram-above-4g above 1Tb mark
> >>> when RAM is sufficiently large (regardless of used host),
> >>> instead of creating yet another hole and all complexity it brings along?
> >>>     
> >>
> >> There's the problem with CMOS which describes memory above 4G, part of the
> >> reason I cap it to the 1TB minus the reserved range i.e. for AMD, CMOS would
> >> only describe up to 1T.
> >>
> >> But should we not care about that then it's an option, I suppose.  
> > we probably do not care about CMOS with so large RAM,
> > as long as QEMU generates correct E820 (cmos mattered only with old Seabios
> > which used it for generating memory map)
> >   
> OK, good to know.
> 
> Any extension on CMOS would probably also be out of spec.
ram size in CMOS is approximate at best as doesn't account for all holes,
anyways it's irrelevant if we aren't changing RAM size.

> >> We would waste 1Tb of address space because of 12G, and btw the logic here
> >> is not so different than the 4G hole, in fact could probably share this
> >> with it.  
> > the main reason I'm looking for alternative, is complexity
> > of making hole brings in. At this point, we can't do anything
> > with 4G hole as it's already there, but we can try to avoid that
> > for high RAM and keep rules there simple as it's now.
> >   
> Right. But for what is worth, that complexity is spread in two parts:
> 
> 1) dealing with a sparse RAM model (with more than one hole)
> 
> 2) offsetting everything else that assumes a linear RAM map.
> 
> I don't think that even if we shift start of RAM to after 1TB boundary that
> we would get away top solving item 2 -- which personally is where I find this
> a tad bit more hairy. So it would probably make this patch complexity smaller, but
> not vary much in how spread the changes get.

you are already shifting hotplug area behind 1Tb in [2/6],
so to be consistent just do the same for 4Gb+ alias a well.
That will automatically solve issues in patch 4/6, and all
that at cost of several lines in one place vs this 200+ LOC series.
Both approaches are a kludge but shifting everything over 1Tb mark
is the simplest one.

If there were/is actual demand for sparse/non linear RAM layouts,
I'd switch to pc-dimm based RAM (i.e. generalize it to handle
RAM below 4G and let users specify their own memory map,
see below for more).

> > Also partitioning/splitting main RAM is one of the things that
> > gets in the way converting it to PC-DIMMMs model.
> >   
> Can you expand on that? (a link to a series is enough)
There is no series so far, only a rough idea.

current initial RAM with rather arbitrary splitting rules,
doesn't map very well with pc-dimm device model.
Unfortunately I don't see a way to convert initial RAM to
pc-dimm model without breaking CLI/ABI/migration.

As a way forward, we can restrict legacy layout to old
machine versions, and switch to pc-dimm based initial RAM
for new machine versions.

Then users will be able to specify GPA where each pc-dimm shall
be mapped. For reserving GPA ranges we can change current GPA
allocator. Alternatively a bit more flexible approach could be
a dummy memory device that would consume a reserved range so
that no one could assign pc-dimm there, this way user can define
arbitrary (subject to alignment restrictions we put on it) sparse
memory layouts without modifying QEMU for yet another hole.

> > Loosing 1Tb of address space might be acceptable on a host
> > that can handle such amounts of RAM
> >   
>