All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joao Martins <joao.m.martins@oracle.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Richard Henderson <richard.henderson@linaro.org>,
	qemu-devel@nongnu.org, Daniel Jordan <daniel.m.jordan@oracle.com>,
	David Edmondson <david.edmondson@oracle.com>,
	Auger Eric <eric.auger@redhat.com>,
	Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Igor Mammedov <imammedo@redhat.com>
Subject: Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
Date: Fri, 25 Jun 2021 17:54:41 +0100	[thread overview]
Message-ID: <e236e766-d3ae-214f-4982-5ec38e3740f7@oracle.com> (raw)
In-Reply-To: <20210623132736.1c7b326a.alex.williamson@redhat.com>



On 6/23/21 8:27 PM, Alex Williamson wrote:
> On Wed, 23 Jun 2021 10:30:29 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 6/22/21 10:16 PM, Alex Williamson wrote:
>>> On Tue, 22 Jun 2021 16:48:59 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>   
>>>> Hey,
>>>>
>>>> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
>>>> when running on AMD systems with an IOMMU.
>>>>
>>>> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
>>>> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
>>>> affected by this extra validation. But AMD systems with IOMMU have a hole in
>>>> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
>>>> here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
>>>> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
>>>>
>>>> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
>>>>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
>>>> of the failure:
>>>>
>>>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
>>>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
>>>> 	failed to setup container for group 258: memory listener initialization failed:
>>>> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
>>>>
>>>> Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
>>>> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
>>>> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
>>>> as documented on the links down below.
>>>>
>>>> This series tries to address that by dealing with this AMD-specific 1Tb hole,
>>>> similarly to how we deal with the 4G hole today in x86 in general. It is splitted
>>>> as following:
>>>>
>>>> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
>>>>            which gets used too in other parts of pc/acpi besides MR creation. The
>>>> 	   allowed IOVA *only* changes if it's an AMD host, so no change for
>>>> 	   Intel. We walk the allowed ranges for memory above 4G, and
>>>> 	   add a E820_RESERVED type everytime we find a hole (which is at the
>>>> 	   1TB boundary).
>>>> 	   
>>>> 	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
>>>> 	   understand that it doesn't cover the non-x86 host case running TCG.
>>>>
>>>> 	   Additionally, an alternative to hardcoded ranges as we do today,
>>>> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
>>>> 	   requiring to have a PCI device added in the vfio container. That would
>>>> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
>>>> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.  
>>>
>>>
>>> $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
>>> 0x00000000fee00000 0x00000000feefffff msi
>>> 0x000000fd00000000 0x000000ffffffffff reserved
>>>   
>> Yeap, I am aware.
>>
>> The VFIO advertising extension was just because we already advertise the above info,
>> although behind a non-empty vfio container e.g. we seem to use that for example in
>> collect_usable_iova_ranges().
> 
> VFIO can't guess what groups you'll use to mark reserved ranges in an
> empty container.  Each group might have unique ranges.  A container
> enforcing ranges unrelated to the groups/devices in use doesn't make
> sense.
>  
Hmm OK, I see.

The suggestion/point was because AMD IOMMU seems to mark these reserved for
both MSI range and the HyperTransport regardless of device/group
specifics. See amd_iommu_get_resv_regions().

So I thought that something else for advertising platform-wide reserved ranges would be
appropriate rather than replicating the same said info on the various groups
reserved_regions sysfs entry.

>>> Ideally we might take that into account on all hosts, but of course
>>> then we run into massive compatibility issues when we consider
>>> migration.  We run into similar problems when people try to assign
>>> devices to non-x86 TCG hosts, where the arch doesn't have a natural
>>> memory hole overlapping the msi range.
>>>
>>> The issue here is similar to trying to find a set of supported CPU
>>> flags across hosts, QEMU only has visibility to the host where it runs,
>>> an upper level tool needs to be able to pass through information about
>>> compatibility to all possible migration targets.  
>>
>> I agree with your generic sentiment (and idea) but are we sure this is really something as
>> dynamic/needing-denominator like CPU Features? The memory map looks to be deeply embedded
>> in the devices (ARM) or machine model (x86) that we pass in and doesn't change very often.
>> pc/q35 is one very good example, because this hasn't changed since it's inception [a
>> decade?] (and this limitation is there only for any multi-socket AMD machine with IOMMU
>> with more than 1Tb). Additionally, there might be architectural impositions like on x86
>> e.g. CMOS seems to tie in with memory above certain boundaries. Unless by a migration
>> targets, you mean to also cover you migrate between Intel and AMD hosts (which may need to
>> keep the reserved range nonetheless in the common denominator)
> 
> I like the flexibility that being able to specify reserved ranges would
> provide, 

/me nods

> but I agree that the machine memory map is usually deeply
> embedded into the arch code and would probably be difficult to
> generalize.  Cross vendor migration should be a consideration and only
> an inter-system management policy could specify the importance of that.
> 
> Perhaps as David mentioned, this is really a machine type issue, where
> the address width downsides you've noted might be sufficient reason
> to introduce a new machine type that includes this memory hole.  That
> would likely be the more traditional solution to this issue.

Maybe there could be a generic facility that stores/manages the reserved ranges, and then
the different machines can provide a default set depending on the running target
heuristics (AMD x86, generic x86, ARM, etc). To some extent it means tracking reserved,
rather tracking usable IOVA as I do here (and move that some where else that's not x86
specific).


      parent reply	other threads:[~2021-06-25 16:56 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-22 15:48 [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Joao Martins
2021-06-22 15:49 ` [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary Joao Martins
2021-06-23  7:11   ` Igor Mammedov
2021-06-23  9:37     ` Joao Martins
2021-06-23 11:39       ` Igor Mammedov
2021-06-23 13:04         ` Joao Martins
2021-06-28 14:32           ` Igor Mammedov
2021-08-06 10:41             ` Joao Martins
2021-06-23  9:03   ` Igor Mammedov
2021-06-23  9:51     ` Joao Martins
2021-06-23 12:09       ` Igor Mammedov
2021-06-23 13:07         ` Joao Martins
2021-06-28 13:25           ` Igor Mammedov
2021-06-28 13:43             ` Joao Martins
2021-06-28 15:21               ` Igor Mammedov
2021-06-24  9:32     ` Dr. David Alan Gilbert
2021-06-28 14:42       ` Igor Mammedov
2021-06-22 15:49 ` [PATCH RFC 2/6] i386/pc: Round up the hotpluggable memory within valid IOVA ranges Joao Martins
2021-06-22 15:49 ` [PATCH RFC 3/6] pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary Joao Martins
2021-06-22 15:49 ` [PATCH RFC 4/6] i386/pc: Keep PCI 64-bit hole within usable IOVA space Joao Martins
2021-06-23 12:30   ` Igor Mammedov
2021-06-23 13:22     ` Joao Martins
2021-06-28 15:37       ` Igor Mammedov
2021-06-23 16:33     ` Laszlo Ersek
2021-06-25 17:19       ` Joao Martins
2021-06-22 15:49 ` [PATCH RFC 5/6] i386/acpi: Fix SRAT ranges in accordance to usable IOVA Joao Martins
2021-06-22 15:49 ` [PATCH RFC 6/6] i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs Joao Martins
2021-06-23  9:18   ` Igor Mammedov
2021-06-23  9:59     ` Joao Martins
2021-06-22 21:16 ` [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Alex Williamson
2021-06-23  7:40   ` David Edmondson
2021-06-23 19:13     ` Alex Williamson
2021-06-23  9:30   ` Joao Martins
2021-06-23 11:58     ` Igor Mammedov
2021-06-23 13:15       ` Joao Martins
2021-06-23 19:27     ` Alex Williamson
2021-06-24  9:22       ` Dr. David Alan Gilbert
2021-06-25 16:54       ` Joao Martins [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e236e766-d3ae-214f-4982-5ec38e3740f7@oracle.com \
    --to=joao.m.martins@oracle.com \
    --cc=alex.williamson@redhat.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=david.edmondson@oracle.com \
    --cc=ehabkost@redhat.com \
    --cc=eric.auger@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=suravee.suthikulpanit@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.