linux-acpi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Jacob Pan (Jun)" <jacob.jun.pan@intel.com>
To: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: <linux-acpi@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>,
	<iommu@lists.linux-foundation.org>,
	<virtualization@lists.linux-foundation.org>,
	<linux-pci@vger.kernel.org>, <virtio-dev@lists.oasis-open.org>,
	<rjw@rjwysocki.net>, <lenb@kernel.org>,
	<lorenzo.pieralisi@arm.com>, <guohanjun@huawei.com>,
	<sudeep.holla@arm.com>, <gregkh@linuxfoundation.org>,
	<joro@8bytes.org>, <bhelgaas@google.com>, <mst@redhat.com>,
	<jasowang@redhat.com>, <eric.auger@redhat.com>,
	<sebastien.boeuf@intel.com>, <kevin.tian@intel.com>,
	jacob.jun.pan@intel.com
Subject: Re: [RFC 00/13] virtio-iommu on non-devicetree platforms
Date: Fri, 20 Dec 2019 10:54:30 -0800	[thread overview]
Message-ID: <20191220105430.0000437b@intel.com> (raw)
In-Reply-To: <20191218112044.GA2371701@myrica>

On Wed, 18 Dec 2019 12:20:44 +0100
Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:

> On Tue, Dec 03, 2019 at 07:01:36PM -0800, Jacob Pan (Jun) wrote:
> > Hi Jean,
> > 
> > Sorry for the delay, I was out last week. Comments inline below.
> > 
> > On Mon, 25 Nov 2019 19:02:47 +0100
> > Jean-Philippe Brucker <jean-philippe@linaro.org> wrote:
> >   
> > > On Fri, Nov 22, 2019 at 04:01:02PM -0800, Jacob Pan (Jun) wrote:  
> > > > > (1) ACPI has one table per vendor (DMAR for Intel, IVRS for
> > > > > AMD and IORT for Arm). From my point of view IORT is easier to
> > > > > extend, since we just need to introduce a new node type. There
> > > > > are no dependencies to Arm in the Linux IORT driver, so it
> > > > > works well with CONFIG_X86.   
> > > > From my limited understanding, IORT and VIOT is to solve device
> > > > topology enumeration only? I am not sure how it can be expanded
> > > > to cover information beyond device topology. e.g. DMAR has NUMA
> > > > information and root port ATS, I guess they are not used today
> > > > in the guest but might be additions in the future.    
> > > 
> > > The PCI root-complex node of IORT has an ATS attribute, which we
> > > can already use. However its scope is the root complex, not
> > > individual root ports like with DMAR.
> > > 
> > > I'm not very familiar with NUMA, but it looks like we just need to
> > > specify a proximity domain in relation to the SRAT table, for each
> > > viommu? The SMMUv3 node in IORT has a 4-bytes "proximity domain"
> > > field for this. We can add the same to the VIOT virtio-iommu nodes
> > > later, since the structures are extensible.
> > >   
> > I think there the proximity domain is more for each assigned device
> > than vIOMMU. vIOMMU in the guest can have assigned devices belong to
> > different pIOMMU and proximity domains. If the guest owns the first
> > level page tables (gIOVA or SVA), we want to make sure page tables
> > are allocated from the close proximity domain.
> > 
> > My understanding is virtio IOMMU supports both virtio devices and
> > assigned devices. we could care less about the former in terms of
> > NUMA.
> > 
> > In ACPI, we have _PXM method to retrieve device proximity domain. I
> > don't know if there is something equivalent or a generic way to get
> > _PXM information. I think VMM also need to make sure when an
> > assigned device is used with vIOMMU, there are some memory is
> > allocated from the device's proximity domain.
> >   
> > > But it might be better to keep the bare minimum information in
> > > the FW descriptor, and put the rest in the virtio-iommu. So yes
> > > topology enumeration is something the device cannot do itself
> > > (not fully that is, see (2)) but for the rest, virtio-iommu's
> > > PROBE request can provide details about each endpoint in relation
> > > to their physical IOMMU.
> > > 
> > > We could for example add a bit in a PROBE property saying that the
> > > whole path between the IOMMU and the endpoint supports ATS. For
> > > NUMA it might also be more interesting to have a finer
> > > granularity, since one viommu could be managing endpoints that
> > > are behind different physical IOMMUs. If in the future we want to
> > > allocate page tables close to the physical IOMMU for example, we
> > > might need to describe multiple NUMA nodes per viommu, using the
> > > PROBE request. 
> > Should we reinvent something for NUMA or use ACPI's SRAT, _PXM?   
> 
> Regardless whether we put it in the VIOT or in the virtio-iommu PROBE
> request, we necessarily need to reuse the node IDs defined by SRAT (or
> numa-node-id on devicetree, also a 32-bit value). A virtio-pci based
> virtio-iommu already has the _PXM of its closest bridge and wouldn't
> need anything more in the VIOT, while a virtio-mmio based
> virtio-iommu would need a proximity domain field in the VIOT. That
> could be added later since the table is extensible, but as you
> pointed out, that information might not be very useful.
> 
> > I am not sure how it is handled today in QEMU in terms of guest-host
> > NUMA proximity domain mapping.  
> 
> It looks like the user can specify this guest-host mapping on the
> command-line:
> 
>   -object memory-backend-ram,id=mem0,size=4G,host-nodes=3,policy=bind
>   -object memory-backend-ram,id=mem1,size=4G,host-nodes=4,policy=bind
>   -numa node,memdev=mem0,nodeid=numa0
>   -numa node,memdev=mem1,nodeid=numa1
>   -numa cpu,node-id=numa0,socket-id=0
>   -numa cpu,node-id=numa1,socket-id=1
> 
> numa0 and numa1 would get proximity domains 0 and 1, corresponding to
> host domains 3 and 4. It is also possible to specify the NUMA node of
> a PCI bus (via the PCI expander bridge), and therefore to assign a
> VFIO PCI device in the same proximity domain as its physical location.
> 
>   -device pxb,id=bridge1,bus=pci.0,numa_node=1 (simplified)
>   -device vfio-pci,host=03:01.0,bus=bridge1
> 
Thanks a lot for the thorough explanation. I will give that a try on
x86, I assume the ACPI tables also built to match these cmdline options.
> Linux can use this information to allocate DMA close to the endpoint
> (see for example __iommu_dma_alloc_pages()). For page tables
> allocation, io-pgtables currently takes the node ID of the IOMMU
> device, not the endpoint. For the scenario you describe (virtio-iommu
> endpoints managed by different physical IOMMU), we would need to take
> for example the node ID of the first endpoint in the iommu_domain for
> which we're allocating page tables.
> 
If iommu_domain is shared by multiple device from different NUMA node,
I guess taking the first one is as good as anyone. It would not be
an optimal configuration.
> Is it safe to assume that the pIOMMU is in the same proximity domain
> as the physical endpoint?
I think it is a safe assumption.
>  If that's the case, then the guest already
> has all the information it needs. Otherwise it's easy to add a
> proximity domain PROBE property for each endpoint. Configuring the
> host to pass that information might be more difficult.
> 
I agree, guest should always allocate DMA and IOVA page tables basedon
the endpoint. VT-d currently allocates page table pages based on IOMMU
NUMA node, that might have to change.
> 
> Off topic, I've been wondering how to make iommu-sva aware of NUMA
> topology as well, so that when handling a page request we allocate
> memory on the faulting device's NUMA node, but I think it might
> require invasive changes to the mm subsystem, to pass a NUMA node to
> handle_mm_fault().
> 
> Thanks,
> Jean


      reply	other threads:[~2019-12-20 18:54 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-22 10:49 [RFC 00/13] virtio-iommu on non-devicetree platforms Jean-Philippe Brucker
2019-11-22 10:49 ` [RFC 01/13] ACPI/IORT: Move IORT to the ACPI folder Jean-Philippe Brucker
2019-11-22 10:49 ` [RFC 02/13] ACPI: Add VIOT definitions Jean-Philippe Brucker
2019-11-22 10:49 ` [RFC 03/13] ACPI/IORT: Allow registration of external tables Jean-Philippe Brucker
2019-11-22 10:49 ` [RFC 04/13] ACPI/IORT: Add node categories Jean-Philippe Brucker
2019-11-22 10:49 ` [RFC 05/13] ACPI/IORT: Support VIOT virtio-mmio node Jean-Philippe Brucker
2019-11-22 10:49 ` [RFC 06/13] ACPI/IORT: Support VIOT virtio-pci node Jean-Philippe Brucker
2019-11-22 10:49 ` [RFC 07/13] ACPI/IORT: Defer probe until virtio-iommu-pci has registered a fwnode Jean-Philippe Brucker
2019-11-22 10:49 ` [RFC 08/13] ACPI/IORT: Add callback to update a device's fwnode Jean-Philippe Brucker
2019-11-22 10:49 ` [RFC 09/13] iommu/virtio: Create fwnode if necessary Jean-Philippe Brucker
2019-11-22 10:49 ` [RFC 10/13] iommu/virtio: Update IORT fwnode Jean-Philippe Brucker
2019-11-22 10:49 ` [RFC 11/13] ACPI: Add VIOT table Jean-Philippe Brucker
2019-11-22 10:49 ` [RFC virtio 12/13] virtio-iommu: Add built-in topology description Jean-Philippe Brucker
2019-11-22 10:50 ` [RFC 13/13] iommu/virtio: Add topology description to Jean-Philippe Brucker
2019-11-22 12:53   ` Michael S. Tsirkin
2019-11-25 17:48     ` Jean-Philippe Brucker
2019-11-22 13:00 ` [RFC 00/13] virtio-iommu on non-devicetree platforms Michael S. Tsirkin
2019-11-25 17:53   ` Jean-Philippe Brucker
2019-11-23  0:01 ` Jacob Pan (Jun)
2019-11-25 18:02   ` Jean-Philippe Brucker
2019-12-04  3:01     ` Jacob Pan (Jun)
2019-12-18 11:20       ` Jean-Philippe Brucker
2019-12-20 18:54         ` Jacob Pan (Jun) [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191220105430.0000437b@intel.com \
    --to=jacob.jun.pan@intel.com \
    --cc=bhelgaas@google.com \
    --cc=eric.auger@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=guohanjun@huawei.com \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jasowang@redhat.com \
    --cc=jean-philippe@linaro.org \
    --cc=joro@8bytes.org \
    --cc=kevin.tian@intel.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lorenzo.pieralisi@arm.com \
    --cc=mst@redhat.com \
    --cc=rjw@rjwysocki.net \
    --cc=sebastien.boeuf@intel.com \
    --cc=sudeep.holla@arm.com \
    --cc=virtio-dev@lists.oasis-open.org \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).