All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "Jose Ricardo Ziviani" <joserz@linux.ibm.com>,
	"Daniel Henrique Barboza" <danielhb413@gmail.com>,
	kvm-ppc@vger.kernel.org,
	"Piotr Jaroszynski" <pjaroszynski@nvidia.com>,
	"Leonardo Augusto Guimarães Garcia" <lagarcia@br.ibm.com>,
	linuxppc-dev@lists.ozlabs.org,
	"David Gibson" <david@gibson.dropbear.id.au>
Subject: Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation
Date: Wed, 20 Mar 2019 12:57:44 +1100	[thread overview]
Message-ID: <b0d973ee-a612-e6ca-d1ed-5db72f333128@ozlabs.ru> (raw)
In-Reply-To: <20190319103619.6534c7df@x1.home>



On 20/03/2019 03:36, Alex Williamson wrote:
> On Fri, 15 Mar 2019 19:18:35 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
>> (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
>> peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
>> platform puts all interconnected GPUs to the same IOMMU group.
>>
>> However the user may want to pass individual GPUs to the userspace so
>> in order to do so we need to put them into separate IOMMU groups and
>> cut off the interconnects.
>>
>> Thankfully V100 GPUs implement an interface to do by programming link
>> disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
>> this interface, it cannot be re-enabled until the secondary bus reset is
>> issued to the GPU.
>>
>> This defines a reset_done() handler for V100 NVlink2 device which
>> determines what links need to be disabled. This relies on presence
>> of the new "ibm,nvlink-peers" device tree property of a GPU telling which
>> PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
>>
>> This does not change the existing behaviour and instead adds
>> a new "isolate_nvlink" kernel parameter to allow such isolation.
>>
>> The alternative approaches would be:
>>
>> 1. do this in the system firmware (skiboot) but for that we would need
>> to tell skiboot via an additional OPAL call whether or not we want this
>> isolation - skiboot is unaware of IOMMU groups.
>>
>> 2. do this in the secondary bus reset handler in the POWERNV platform -
>> the problem with that is at that point the device is not enabled, i.e.
>> config space is not restored so we need to enable the device (i.e. MMIO
>> bit in CMD register + program valid address to BAR0) in order to disable
>> links and then perhaps undo all this initialization to bring the device
>> back to the state where pci_try_reset_function() expects it to be.
> 
> The trouble seems to be that this approach only maintains the isolation
> exposed by the IOMMU group when vfio-pci is the active driver for the
> device.  IOMMU groups can be used by any driver and the IOMMU core is
> incorporating groups in various ways.  So, if there's a device specific
> way to configure the isolation reported in the group, which requires
> some sort of active management against things like secondary bus
> resets, then I think we need to manage it above the attached endpoint
> driver.

Fair point. So for now I'll go for 2) then.

> Ideally I'd see this as a set of PCI quirks so that we might
> leverage it beyond POWER platforms.  I'm not sure how we get past the
> reliance on device tree properties that we won't have on other
> platforms though, if only NVIDIA could at least open a spec addressing
> the discovery and configuration of NVLink registers on their
> devices :-\  Thanks,

This would be nice, yes...


-- 
Alexey

WARNING: multiple messages have this Message-ID (diff)
From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "Jose Ricardo Ziviani" <joserz@linux.ibm.com>,
	"Daniel Henrique Barboza" <danielhb413@gmail.com>,
	kvm-ppc@vger.kernel.org,
	"Piotr Jaroszynski" <pjaroszynski@nvidia.com>,
	"Leonardo Augusto Guimarães Garcia" <lagarcia@br.ibm.com>,
	linuxppc-dev@lists.ozlabs.org,
	"David Gibson" <david@gibson.dropbear.id.au>
Subject: Re: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation
Date: Wed, 20 Mar 2019 01:57:44 +0000	[thread overview]
Message-ID: <b0d973ee-a612-e6ca-d1ed-5db72f333128@ozlabs.ru> (raw)
In-Reply-To: <20190319103619.6534c7df@x1.home>



On 20/03/2019 03:36, Alex Williamson wrote:
> On Fri, 15 Mar 2019 19:18:35 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
>> (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
>> peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
>> platform puts all interconnected GPUs to the same IOMMU group.
>>
>> However the user may want to pass individual GPUs to the userspace so
>> in order to do so we need to put them into separate IOMMU groups and
>> cut off the interconnects.
>>
>> Thankfully V100 GPUs implement an interface to do by programming link
>> disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
>> this interface, it cannot be re-enabled until the secondary bus reset is
>> issued to the GPU.
>>
>> This defines a reset_done() handler for V100 NVlink2 device which
>> determines what links need to be disabled. This relies on presence
>> of the new "ibm,nvlink-peers" device tree property of a GPU telling which
>> PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
>>
>> This does not change the existing behaviour and instead adds
>> a new "isolate_nvlink" kernel parameter to allow such isolation.
>>
>> The alternative approaches would be:
>>
>> 1. do this in the system firmware (skiboot) but for that we would need
>> to tell skiboot via an additional OPAL call whether or not we want this
>> isolation - skiboot is unaware of IOMMU groups.
>>
>> 2. do this in the secondary bus reset handler in the POWERNV platform -
>> the problem with that is at that point the device is not enabled, i.e.
>> config space is not restored so we need to enable the device (i.e. MMIO
>> bit in CMD register + program valid address to BAR0) in order to disable
>> links and then perhaps undo all this initialization to bring the device
>> back to the state where pci_try_reset_function() expects it to be.
> 
> The trouble seems to be that this approach only maintains the isolation
> exposed by the IOMMU group when vfio-pci is the active driver for the
> device.  IOMMU groups can be used by any driver and the IOMMU core is
> incorporating groups in various ways.  So, if there's a device specific
> way to configure the isolation reported in the group, which requires
> some sort of active management against things like secondary bus
> resets, then I think we need to manage it above the attached endpoint
> driver.

Fair point. So for now I'll go for 2) then.

> Ideally I'd see this as a set of PCI quirks so that we might
> leverage it beyond POWER platforms.  I'm not sure how we get past the
> reliance on device tree properties that we won't have on other
> platforms though, if only NVIDIA could at least open a spec addressing
> the discovery and configuration of NVLink registers on their
> devices :-\  Thanks,

This would be nice, yes...


-- 
Alexey

  reply	other threads:[~2019-03-20  1:59 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-15  8:18 [PATCH kernel RFC 0/2] vfio, powerpc/powernv: Isolate GV100GL Alexey Kardashevskiy
2019-03-15  8:18 ` Alexey Kardashevskiy
2019-03-15  8:18 ` [PATCH kernel RFC 1/2] vfio_pci: Allow device specific error handlers Alexey Kardashevskiy
2019-03-15  8:18   ` Alexey Kardashevskiy
2019-03-15  8:18 ` [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation Alexey Kardashevskiy
2019-03-15  8:18   ` Alexey Kardashevskiy
2019-03-19 16:36   ` Alex Williamson
2019-03-19 16:36     ` Alex Williamson
2019-03-20  1:57     ` Alexey Kardashevskiy [this message]
2019-03-20  1:57       ` Alexey Kardashevskiy
2019-03-20  4:38     ` David Gibson
2019-03-20  4:38       ` David Gibson
2019-03-20 19:09       ` Alex Williamson
2019-03-20 19:09         ` Alex Williamson
2019-03-20 23:56         ` David Gibson
2019-03-20 23:56           ` David Gibson
2019-03-21 18:19           ` Alex Williamson
2019-03-21 18:19             ` Alex Williamson
2019-03-22  3:08             ` David Gibson
2019-03-22  3:08               ` David Gibson
2019-03-22 23:10               ` Alex Williamson
2019-03-22 23:10                 ` Alex Williamson
2019-04-05  0:34                 ` David Gibson
2019-04-05  0:34                   ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b0d973ee-a612-e6ca-d1ed-5db72f333128@ozlabs.ru \
    --to=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=danielhb413@gmail.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=joserz@linux.ibm.com \
    --cc=kvm-ppc@vger.kernel.org \
    --cc=lagarcia@br.ibm.com \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=pjaroszynski@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.