Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB

From: Russell Currey <ruscur@russell.cc>
To: Oliver O'Halloran <oohall@gmail.com>,
	Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: KVM list <kvm@vger.kernel.org>,
	Fabiano Rosas <farosas@linux.ibm.com>,
	Alistair Popple <alistair@popple.id.au>,
	kvm-ppc@vger.kernel.org,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
Date: Fri, 17 Apr 2020 11:26:27 +1000	[thread overview]
Message-ID: <b0b361092d2d7e38f753edee6dcd9222b4e388ce.camel@russell.cc> (raw)
In-Reply-To: <CAOSf1CHgUsJ7jGokg6QD6cEDr4-o5hnyyyjRZ=YijsRY3T1sYA@mail.gmail.com>

On Thu, 2020-04-16 at 12:53 +1000, Oliver O'Halloran wrote:
> On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran <oohall@gmail.com>
> wrote:
> > On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <
> > aik@ozlabs.ru> wrote:
> > > Anyone? Is it totally useless or wrong approach? Thanks,
> > 
> > I wouldn't say it's either, but I still hate it.
> > 
> > The 4GB mode being per-PHB makes it difficult to use unless we
> > force
> > that mode on 100% of the time which I'd prefer not to do. Ideally
> > devices that actually support 64bit addressing (which is most of
> > them)
> > should be able to use no-translate mode when possible since a) It's
> > faster, and b) It frees up room in the TCE cache devices that
> > actually
> > need them. I know you've done some testing with 100G NICs and found
> > the overhead was fine, but IMO that's a bad test since it's pretty
> > much the best-case scenario since all the devices on the PHB are in
> > the same PE. The PHB's TCE cache only hits when the TCE matches the
> > DMA bus address and the PE number for the device so in a multi-PE
> > environment there's a lot of potential for TCE cache trashing. If
> > there was one or two PEs under that PHB it's probably not going to
> > matter, but if you have an NVMe rack with 20 drives it starts to
> > look
> > a bit ugly.
> > 
> > That all said, it might be worth doing this anyway since we
> > probably
> > want the software infrastructure in place to take advantage of it.
> > Maybe expand the command line parameters to allow it to be enabled
> > on
> > a per-PHB basis rather than globally.
> 
> Since we're on the topic
> 
> I've been thinking the real issue we have is that we're trying to
> pick
> an "optimal" IOMMU config at a point where we don't have enough
> information to work out what's actually optimal. The IOMMU config is
> done on a per-PE basis, but since PEs may contain devices with
> different DMA masks (looking at you wierd AMD audio function) we're
> always going to have to pick something conservative as the default
> config for TVE#0 (64k, no bypass mapping) since the driver will tell
> us what the device actually supports long after the IOMMU
> configuation
> is done. What we really want is to be able to have separate IOMMU
> contexts for each device, or at the very least a separate context for
> the crippled devices.
> 
> We could allow a per-device IOMMU context by extending the Master /
> Slave PE thing to cover DMA in addition to MMIO. Right now we only
> use
> slave PEs when a device's MMIO BARs extend over multiple m64
> segments.
> When that happens an MMIO error causes the PHB to freezes the PE
> corresponding to one of those segments, but not any of the others. To
> present a single "PE" to the EEH core we check the freeze status of
> each of the slave PEs when the EEH core does a PE status check and if
> any of them are frozen, we freeze the rest of them too. When a driver
> sets a limited DMA mask we could move that device to a seperate slave
> PE so that it has it's own IOMMU context taylored to its DMA
> addressing limits.
> 
> Thoughts?

For what it's worth this sounds like a good idea to me, it just sounds
tricky to implement.  You're adding another layer of complexity on top
of EEH (well, making things look simple to the EEH core and doing your
own freezing on top of it) in addition to the DMA handling.

If it works then great, just has a high potential to become a new bug
haven.

> 
> Oliver