All of lore.kernel.org
 help / color / mirror / Atom feed
From: Igor Mammedov <imammedo@redhat.com>
To: Joao Martins <joao.m.martins@oracle.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Richard Henderson <richard.henderson@linaro.org>,
	qemu-devel@nongnu.org, Daniel Jordan <daniel.m.jordan@oracle.com>,
	David Edmondson <david.edmondson@oracle.com>,
	Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
Date: Mon, 28 Jun 2021 16:32:32 +0200	[thread overview]
Message-ID: <20210628163232.5669563c@redhat.com> (raw)
In-Reply-To: <a8668c8d-22c5-6168-3f29-3f5055d99c32@oracle.com>

On Wed, 23 Jun 2021 14:04:19 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/23/21 12:39 PM, Igor Mammedov wrote:
> > On Wed, 23 Jun 2021 10:37:38 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> On 6/23/21 8:11 AM, Igor Mammedov wrote:  
> >>> On Tue, 22 Jun 2021 16:49:00 +0100
> >>> Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>     
> >>>> It is assumed that the whole GPA space is available to be
> >>>> DMA addressable, within a given address space limit. Since
> >>>> v5.4 based that is not true, and VFIO will validate whether
> >>>> the selected IOVA is indeed valid i.e. not reserved by IOMMU
> >>>> on behalf of some specific devices or platform-defined.
> >>>>
> >>>> AMD systems with an IOMMU are examples of such platforms and
> >>>> particularly may export only these ranges as allowed:
> >>>>
> >>>> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> >>>> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> >>>> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
> >>>>
> >>>> We already know of accounting for the 4G hole, albeit if the
> >>>> guest is big enough we will fail to allocate a >1010G given
> >>>> the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
> >>>>
> >>>> When creating the region above 4G, take into account what
> >>>> IOVAs are allowed by defining the known allowed ranges
> >>>> and search for the next free IOVA ranges. When finding a
> >>>> invalid IOVA we mark them as reserved and proceed to the
> >>>> next allowed IOVA region.
> >>>>
> >>>> After accounting for the 1Tb hole on AMD hosts, mtree should
> >>>> look like:
> >>>>
> >>>> 0000000100000000-000000fcffffffff (prio 0, i/o):
> >>>> 	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
> >>>> 0000010000000000-000001037fffffff (prio 0, i/o):
> >>>> 	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff    
> >>>
> >>> why not push whole ram-above-4g above 1Tb mark
> >>> when RAM is sufficiently large (regardless of used host),
> >>> instead of creating yet another hole and all complexity it brings along?
> >>>     
> >>
> >> There's the problem with CMOS which describes memory above 4G, part of the
> >> reason I cap it to the 1TB minus the reserved range i.e. for AMD, CMOS would
> >> only describe up to 1T.
> >>
> >> But should we not care about that then it's an option, I suppose.  
> > we probably do not care about CMOS with so large RAM,
> > as long as QEMU generates correct E820 (cmos mattered only with old Seabios
> > which used it for generating memory map)
> >   
> OK, good to know.
> 
> Any extension on CMOS would probably also be out of spec.
ram size in CMOS is approximate at best as doesn't account for all holes,
anyways it's irrelevant if we aren't changing RAM size.

> >> We would waste 1Tb of address space because of 12G, and btw the logic here
> >> is not so different than the 4G hole, in fact could probably share this
> >> with it.  
> > the main reason I'm looking for alternative, is complexity
> > of making hole brings in. At this point, we can't do anything
> > with 4G hole as it's already there, but we can try to avoid that
> > for high RAM and keep rules there simple as it's now.
> >   
> Right. But for what is worth, that complexity is spread in two parts:
> 
> 1) dealing with a sparse RAM model (with more than one hole)
> 
> 2) offsetting everything else that assumes a linear RAM map.
> 
> I don't think that even if we shift start of RAM to after 1TB boundary that
> we would get away top solving item 2 -- which personally is where I find this
> a tad bit more hairy. So it would probably make this patch complexity smaller, but
> not vary much in how spread the changes get.

you are already shifting hotplug area behind 1Tb in [2/6],
so to be consistent just do the same for 4Gb+ alias a well.
That will automatically solve issues in patch 4/6, and all
that at cost of several lines in one place vs this 200+ LOC series.
Both approaches are a kludge but shifting everything over 1Tb mark
is the simplest one.

If there were/is actual demand for sparse/non linear RAM layouts,
I'd switch to pc-dimm based RAM (i.e. generalize it to handle
RAM below 4G and let users specify their own memory map,
see below for more).

> > Also partitioning/splitting main RAM is one of the things that
> > gets in the way converting it to PC-DIMMMs model.
> >   
> Can you expand on that? (a link to a series is enough)
There is no series so far, only a rough idea.

current initial RAM with rather arbitrary splitting rules,
doesn't map very well with pc-dimm device model.
Unfortunately I don't see a way to convert initial RAM to
pc-dimm model without breaking CLI/ABI/migration.

As a way forward, we can restrict legacy layout to old
machine versions, and switch to pc-dimm based initial RAM
for new machine versions.

Then users will be able to specify GPA where each pc-dimm shall
be mapped. For reserving GPA ranges we can change current GPA
allocator. Alternatively a bit more flexible approach could be
a dummy memory device that would consume a reserved range so
that no one could assign pc-dimm there, this way user can define
arbitrary (subject to alignment restrictions we put on it) sparse
memory layouts without modifying QEMU for yet another hole.


> > Loosing 1Tb of address space might be acceptable on a host
> > that can handle such amounts of RAM
> >   
> 



  reply	other threads:[~2021-06-28 14:33 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-22 15:48 [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Joao Martins
2021-06-22 15:49 ` [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary Joao Martins
2021-06-23  7:11   ` Igor Mammedov
2021-06-23  9:37     ` Joao Martins
2021-06-23 11:39       ` Igor Mammedov
2021-06-23 13:04         ` Joao Martins
2021-06-28 14:32           ` Igor Mammedov [this message]
2021-08-06 10:41             ` Joao Martins
2021-06-23  9:03   ` Igor Mammedov
2021-06-23  9:51     ` Joao Martins
2021-06-23 12:09       ` Igor Mammedov
2021-06-23 13:07         ` Joao Martins
2021-06-28 13:25           ` Igor Mammedov
2021-06-28 13:43             ` Joao Martins
2021-06-28 15:21               ` Igor Mammedov
2021-06-24  9:32     ` Dr. David Alan Gilbert
2021-06-28 14:42       ` Igor Mammedov
2021-06-22 15:49 ` [PATCH RFC 2/6] i386/pc: Round up the hotpluggable memory within valid IOVA ranges Joao Martins
2021-06-22 15:49 ` [PATCH RFC 3/6] pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary Joao Martins
2021-06-22 15:49 ` [PATCH RFC 4/6] i386/pc: Keep PCI 64-bit hole within usable IOVA space Joao Martins
2021-06-23 12:30   ` Igor Mammedov
2021-06-23 13:22     ` Joao Martins
2021-06-28 15:37       ` Igor Mammedov
2021-06-23 16:33     ` Laszlo Ersek
2021-06-25 17:19       ` Joao Martins
2021-06-22 15:49 ` [PATCH RFC 5/6] i386/acpi: Fix SRAT ranges in accordance to usable IOVA Joao Martins
2021-06-22 15:49 ` [PATCH RFC 6/6] i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs Joao Martins
2021-06-23  9:18   ` Igor Mammedov
2021-06-23  9:59     ` Joao Martins
2021-06-22 21:16 ` [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Alex Williamson
2021-06-23  7:40   ` David Edmondson
2021-06-23 19:13     ` Alex Williamson
2021-06-23  9:30   ` Joao Martins
2021-06-23 11:58     ` Igor Mammedov
2021-06-23 13:15       ` Joao Martins
2021-06-23 19:27     ` Alex Williamson
2021-06-24  9:22       ` Dr. David Alan Gilbert
2021-06-25 16:54       ` Joao Martins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210628163232.5669563c@redhat.com \
    --to=imammedo@redhat.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=david.edmondson@oracle.com \
    --cc=ehabkost@redhat.com \
    --cc=joao.m.martins@oracle.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=suravee.suthikulpanit@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.