linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Russell Currey <ruscur@russell.cc>
To: Oliver O'Halloran <oohall@gmail.com>,
	Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: KVM list <kvm@vger.kernel.org>,
	Fabiano Rosas <farosas@linux.ibm.com>,
	Alistair Popple <alistair@popple.id.au>,
	kvm-ppc@vger.kernel.org,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB
Date: Fri, 17 Apr 2020 11:26:27 +1000	[thread overview]
Message-ID: <b0b361092d2d7e38f753edee6dcd9222b4e388ce.camel@russell.cc> (raw)
In-Reply-To: <CAOSf1CHgUsJ7jGokg6QD6cEDr4-o5hnyyyjRZ=YijsRY3T1sYA@mail.gmail.com>

On Thu, 2020-04-16 at 12:53 +1000, Oliver O'Halloran wrote:
> On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran <oohall@gmail.com>
> wrote:
> > On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <
> > aik@ozlabs.ru> wrote:
> > > Anyone? Is it totally useless or wrong approach? Thanks,
> > 
> > I wouldn't say it's either, but I still hate it.
> > 
> > The 4GB mode being per-PHB makes it difficult to use unless we
> > force
> > that mode on 100% of the time which I'd prefer not to do. Ideally
> > devices that actually support 64bit addressing (which is most of
> > them)
> > should be able to use no-translate mode when possible since a) It's
> > faster, and b) It frees up room in the TCE cache devices that
> > actually
> > need them. I know you've done some testing with 100G NICs and found
> > the overhead was fine, but IMO that's a bad test since it's pretty
> > much the best-case scenario since all the devices on the PHB are in
> > the same PE. The PHB's TCE cache only hits when the TCE matches the
> > DMA bus address and the PE number for the device so in a multi-PE
> > environment there's a lot of potential for TCE cache trashing. If
> > there was one or two PEs under that PHB it's probably not going to
> > matter, but if you have an NVMe rack with 20 drives it starts to
> > look
> > a bit ugly.
> > 
> > That all said, it might be worth doing this anyway since we
> > probably
> > want the software infrastructure in place to take advantage of it.
> > Maybe expand the command line parameters to allow it to be enabled
> > on
> > a per-PHB basis rather than globally.
> 
> Since we're on the topic
> 
> I've been thinking the real issue we have is that we're trying to
> pick
> an "optimal" IOMMU config at a point where we don't have enough
> information to work out what's actually optimal. The IOMMU config is
> done on a per-PE basis, but since PEs may contain devices with
> different DMA masks (looking at you wierd AMD audio function) we're
> always going to have to pick something conservative as the default
> config for TVE#0 (64k, no bypass mapping) since the driver will tell
> us what the device actually supports long after the IOMMU
> configuation
> is done. What we really want is to be able to have separate IOMMU
> contexts for each device, or at the very least a separate context for
> the crippled devices.
> 
> We could allow a per-device IOMMU context by extending the Master /
> Slave PE thing to cover DMA in addition to MMIO. Right now we only
> use
> slave PEs when a device's MMIO BARs extend over multiple m64
> segments.
> When that happens an MMIO error causes the PHB to freezes the PE
> corresponding to one of those segments, but not any of the others. To
> present a single "PE" to the EEH core we check the freeze status of
> each of the slave PEs when the EEH core does a PE status check and if
> any of them are frozen, we freeze the rest of them too. When a driver
> sets a limited DMA mask we could move that device to a seperate slave
> PE so that it has it's own IOMMU context taylored to its DMA
> addressing limits.
> 
> Thoughts?

For what it's worth this sounds like a good idea to me, it just sounds
tricky to implement.  You're adding another layer of complexity on top
of EEH (well, making things look simple to the EEH core and doing your
own freezing on top of it) in addition to the DMA handling.

If it works then great, just has a high potential to become a new bug
haven.

> 
> Oliver


  reply	other threads:[~2020-04-17  1:28 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-23  7:53 [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB Alexey Kardashevskiy
2020-03-23  7:53 ` [PATCH kernel v2 1/7] powerpc/powernv/ioda: Move TCE bypass base to PE Alexey Kardashevskiy
2020-03-23  7:53 ` [PATCH kernel v2 2/7] powerpc/powernv/ioda: Rework for huge DMA window at 4GB Alexey Kardashevskiy
2020-03-23  7:53 ` [PATCH kernel v2 3/7] powerpc/powernv/ioda: Allow smaller TCE table levels Alexey Kardashevskiy
2020-03-23  7:53 ` [PATCH kernel v2 4/7] powerpc/powernv/phb4: Use IOMMU instead of bypassing Alexey Kardashevskiy
2020-03-23  7:53 ` [PATCH kernel v2 5/7] powerpc/iommu: Add a window number to iommu_table_group_ops::get_table_size Alexey Kardashevskiy
2020-03-23  7:53 ` [PATCH kernel v2 6/7] powerpc/powernv/phb4: Add 4GB IOMMU bypass mode Alexey Kardashevskiy
2020-03-23  7:53 ` [PATCH kernel v2 7/7] vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB Alexey Kardashevskiy
2020-04-08  9:43 ` [PATCH kernel v2 0/7] powerpc/powenv/ioda: Allow huge DMA window " Alexey Kardashevskiy
2020-04-16  1:27   ` Alexey Kardashevskiy
2020-04-16  2:34     ` Oliver O'Halloran
2020-04-16  2:53       ` Oliver O'Halloran
2020-04-17  1:26         ` Russell Currey [this message]
2020-04-17  5:47           ` Alexey Kardashevskiy
2020-04-20 14:04             ` Oliver O'Halloran
2020-04-21  5:11               ` Alexey Kardashevskiy
2020-04-21  6:35                 ` Oliver O'Halloran
2020-04-22  6:49                   ` Alexey Kardashevskiy
2020-04-22  9:11                     ` Oliver O'Halloran

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b0b361092d2d7e38f753edee6dcd9222b4e388ce.camel@russell.cc \
    --to=ruscur@russell.cc \
    --cc=aik@ozlabs.ru \
    --cc=alistair@popple.id.au \
    --cc=david@gibson.dropbear.id.au \
    --cc=farosas@linux.ibm.com \
    --cc=kvm-ppc@vger.kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=oohall@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).