dev.dpdk.org archive mirror
 help / color / mirror / Atom feed
From: David Marchand <david.marchand@redhat.com>
To: "Walker, Benjamin" <benjamin.walker@intel.com>
Cc: "jerinj@marvell.com" <jerinj@marvell.com>,
	"Burakov, Anatoly" <anatoly.burakov@intel.com>,
	 "dev@dpdk.org" <dev@dpdk.org>,
	Maxime Coquelin <maxime.coquelin@redhat.com>,
	hemant.agrawal@nxp.com,  shreyansh.jain@nxp.com,
	Rosen Xu <rosen.xu@intel.com>,
	 Gaetan Rivet <gaetan.rivet@6wind.com>,
	Stephen Hemminger <stephen@networkplumber.org>,
	 "Yigit, Ferruh" <ferruh.yigit@intel.com>,
	Thomas Monjalon <thomas@monjalon.net>,
	Yongseok Koh <yskoh@mellanox.com>
Subject: Re: [dpdk-dev] eal/pci: Improve automatic selection of IOVA mode
Date: Fri, 14 Jun 2019 10:42:22 +0200	[thread overview]
Message-ID: <CAJFAV8y-UKj5Eq+rofXaxFOf_s2cme3a-4GRJKT55Haje-ZuvQ@mail.gmail.com> (raw)
In-Reply-To: <da23632dca896ac78b998c85840b974a2d4d206d.camel@intel.com>

On Mon, Jun 3, 2019 at 6:44 PM Walker, Benjamin <benjamin.walker@intel.com>
wrote:

> On Mon, 2019-06-03 at 12:48 +0200, David Marchand wrote:
> > Hello,
> >
> > On Thu, May 30, 2019 at 7:48 PM Ben Walker <benjamin.walker@intel.com>
> wrote:
> > > In SPDK, not all drivers are registered with DPDK at start up time.
> > > Previously, that meant DPDK always chose to set itself up in IOVA_PA
> > > mode. Instead, when the correct iova choice is unclear based on the
> > > devices and drivers known to DPDK at start up time, use other
> heuristics
> > > (such as whether /proc/self/pagemap is accessible) to make a better
> > > choice.
> > >
> > > This enables SPDK to run as an unprivileged user again without
> requiring
> > > users to explicitly set the iova mode on the command line.
> > >
> >
> > Interesting, I got a bz on something similar the day you sent this
> patchset ;-
> > )
> >
> >
> > - When a dpdk process is started, either it has access to physical
> addresses
> > or not, and this won't change for the rest of its life.
> > Your fix on defaulting to VA based on a rte_eal_using_phys_addrs() check
> makes
> > sense to me.
> > It is the most encountered situation when running ovs as non root on
> recent
> > kernels.
> >
> >
> > - However, I fail to see the need for all of this detection code wrt
> drivers
> > and devices.
> >
> > On one side of the equation, when dpdk starts, it checks physical address
> > availability.
> > On the other side of the equation, we have the drivers that will be
> invoked
> > when probing devices (either at dpdk init, or when hotplugging a device).
> >
> > At this point, the probing call should check the driver requirement wrt
> to the
> > kernel driver the device is attached to.
> > If this requirement is not fulfilled, then the probing fails.
> >
> >
> > - This leaves the --iova-va forcing option.
> > Why do we need it?
> > If we don't have access to physical addresses, no choice but run in VA
> mode.
> > If we have access to physical addresses, the only case would be that you
> want
> > to downgrade from PA to VA.
> > But well, your process can still access it, not sure what the benefit is.
>
> All of the complexity here, at least as far as I understand it, stems from
> supporting hot insert of devices. This is very important to SPDK because
> storage
> devices get hot inserted all the time, so we very much appreciate that
> DPDK has
> put in so much effort in this area and continues to accept our patches to
> improve it. I know hot insert is not nearly as important for network
> devices.
>
> When DPDK starts up, it needs to select whether to use virtual addresses or
> physical addresses in its memory maps. It can do that by answering the
> following
> questions:
>
> 1. Does the system only have buses that support an IOMMU?
> 2. Is the IOMMU sufficiently fast for the use case?
> 3. Will all of the devices that will be used with DPDK throughout the
> application's lifetime work with an IOMMU?
>
> If these three things are true, then the best choice is to use virtual
> addresses
> in the memory translations. However, if any of the above are not true it
> needs
> to fall back to physical addresses.
>
> #1 is checked by simply asking all of the buses, which are known up front.
> #2 is
> just assumed to be true. But #3 is not possible to check fully because of
> hot
> insert.
>
> The code currently approximates the #3 check by looking at the devices
> present
> at initialization time. If a device exists that's bound to vfio-pci, and no
> other devices exist that are bound to a uio driver, and DPDK has a
> registered
> driver that's actually going to load against the vfio-pci devices, then it
> will
> elect to use virtual addresses. This is purely a heuristic - it's not a
> definitive answer because the user could later hot insert a device that
> gets
> bound to uio.
>
> The user, of course, knows the answer to which addressing scheme to use
> typically. For example, these checks assume #2 is true, but there may be
> hardware implementations where it is not and the user wants to force
> physical
> addresses. Or the user may know that they are going to hot insert a device
> at
> run time that doesn't work with the IOMMU. That's why it's important to
> maintain
> the ability for the user to override the default heuristic's decision via
> the
> command line.
>
> My patch series is simply improving the heuristic in a few ways. First,
> previously each bus when queried would return either virtual or physical
> addresses as its choice. However, often the bus just does not have enough
> information to formulate any preference at all (and PCI was defaulting to
> physical addresses in this case). Instead, I made it so that the bus can
> return
> that it doesn't care, which pushes the decision up to a higher level. That
> higher level then makes the decision by checking whether it can access
> /proc/self/pagemap. Second, I narrowed the uio check such that physical
> addresses will only be selected if a device bound to uio exists and there
> is a
> driver registered to use it. Previously if any device was bound to uio it
> would
> select physical addresses, even if DPDK never ended up loading against that
> device.
>
> I think these two things make the heuristic choose the right thing more
> often,
> but it still won't always get it right so the command line option needs to
> remain.
>
>
After some exchanges offlist, on irc and taking some time looking at the
code, here are my conclusions.
Copying bus drivers maintainers/connaisseurs.

We have cases where we prefer using VA even if PA are available (for fslmc
where translating from iova as PA to VA is more costly).

I worked on Ben patches and summarised it as two main issues with the
current code:
- physical addresses availability is not taken into account early enough in
EAL init, and we end up with memory subsystem complaining later which is
not that user friendly.
  A collateral is that the init could have fallen back to using VA in most
cases if there were no strong requirement on PA.
- pci bus driver looks at all devices on the system, with no consideration
on the pci white/blacklist and no consideration on the fact that dpdk has a
driver that supports the device

I prepared a new series that I will send shortly.
I am currently considering the backport potential for it.
Thoughts?

Else, reviews are welcome.

Thanks.

-- 
David Marchand

  reply	other threads:[~2019-06-14  8:42 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-30 17:48 [dpdk-dev] eal/pci: Improve automatic selection of IOVA mode Ben Walker
2019-05-30 17:48 ` [dpdk-dev] [PATCH 01/12] eal: Make rte_eal_using_phys_addrs work sooner Ben Walker
2019-05-30 21:29   ` [dpdk-dev] [PATCH v2 " Ben Walker
2019-05-30 21:29     ` [dpdk-dev] [PATCH v2 02/12] eal/pci: Inline several functions into rte_pci_get_iommu_class Ben Walker
2019-05-30 21:29     ` [dpdk-dev] [PATCH v2 03/12] eal/pci: Rework loops in rte_pci_get_iommu_class Ben Walker
2019-05-30 21:29     ` [dpdk-dev] [PATCH v2 04/12] eal/pci: Collapse two " Ben Walker
2019-05-30 21:29     ` [dpdk-dev] [PATCH v2 05/12] eal/pci: Add function pci_ignore_device Ben Walker
2019-05-30 21:29     ` [dpdk-dev] [PATCH v2 06/12] eal/pci: Correctly test whitelist/blacklist in rte_pci_get_iommu_class Ben Walker
2019-05-30 21:29     ` [dpdk-dev] [PATCH v2 07/12] eal/pci: Reverse if check " Ben Walker
2019-05-30 21:29     ` [dpdk-dev] [PATCH v2 08/12] eal/pci: Collapse loops " Ben Walker
2019-05-30 21:29     ` [dpdk-dev] [PATCH v2 09/12] eal/pci: Simplify rte_pci_get_iommu class by using a switch Ben Walker
2019-05-30 21:29     ` [dpdk-dev] [PATCH v2 10/12] eal/pci: Finding a device bound to UIO does not force PA Ben Walker
2019-05-30 21:29     ` [dpdk-dev] [PATCH v2 11/12] eal/pci: rte_pci_get_iommu_class handles no drivers Ben Walker
2019-05-30 21:29     ` [dpdk-dev] [PATCH v2 12/12] eal: If bus can't decide PA or VA, try to access PA Ben Walker
2019-05-30 17:48 ` [dpdk-dev] [PATCH 02/12] eal/pci: Inline several functions into rte_pci_get_iommu_class Ben Walker
2019-05-30 17:57   ` Stephen Hemminger
2019-05-30 18:09     ` Walker, Benjamin
2019-05-30 17:48 ` [dpdk-dev] [PATCH 03/12] eal/pci: Rework loops in rte_pci_get_iommu_class Ben Walker
2019-05-30 17:48 ` [dpdk-dev] [PATCH 04/12] eal/pci: Collapse two " Ben Walker
2019-05-30 17:48 ` [dpdk-dev] [PATCH 05/12] eal/pci: Add function pci_ignore_device Ben Walker
2019-05-30 17:48 ` [dpdk-dev] [PATCH 06/12] eal/pci: Correctly test whitelist/blacklist in rte_pci_get_iommu_class Ben Walker
2019-05-30 17:48 ` [dpdk-dev] [PATCH 07/12] eal/pci: Reverse if check " Ben Walker
2019-05-30 17:48 ` [dpdk-dev] [PATCH 08/12] eal/pci: Collapse loops " Ben Walker
2019-05-30 17:48 ` [dpdk-dev] [PATCH 09/12] eal/pci: Simplify rte_pci_get_iommu class by using a switch Ben Walker
2019-05-30 17:48 ` [dpdk-dev] [PATCH 10/12] eal/pci: Finding a device bound to UIO does not force PA Ben Walker
2019-05-30 17:48 ` [dpdk-dev] [PATCH 11/12] eal/pci: rte_pci_get_iommu_class handles no drivers Ben Walker
2019-05-30 17:48 ` [dpdk-dev] [PATCH 12/12] eal: If bus can't decide PA or VA, try to access PA Ben Walker
2019-06-03 10:48 ` [dpdk-dev] eal/pci: Improve automatic selection of IOVA mode David Marchand
2019-06-03 16:44   ` Walker, Benjamin
2019-06-14  8:42     ` David Marchand [this message]
2019-06-14  9:39 ` [dpdk-dev] [PATCH v2 0/3] " David Marchand
2019-06-14  9:39   ` [dpdk-dev] [PATCH v2 1/3] kni: refuse to initialise when IOVA is not PA David Marchand
2019-06-14  9:39   ` [dpdk-dev] [PATCH v2 2/3] eal: compute IOVA mode based on PA availability David Marchand
2019-07-03 10:17     ` Burakov, Anatoly
2019-07-04  7:13       ` David Marchand
2019-06-14  9:39   ` [dpdk-dev] [PATCH v2 3/3] bus/pci: only consider usable devices to select IOVA mode David Marchand
2019-07-03 10:45     ` Burakov, Anatoly
2019-07-04  9:18       ` David Marchand
2019-07-04 10:43         ` Burakov, Anatoly
2019-07-04 10:47           ` David Marchand
2019-07-04 17:14     ` Stephen Hemminger
2019-07-05  7:58       ` David Marchand
2019-07-05 16:27         ` Stephen Hemminger
2019-07-05  8:26       ` Thomas Monjalon
2019-06-27 17:05   ` [dpdk-dev] [PATCH v2 0/3] Improve automatic selection of " Thomas Monjalon
2019-07-02 14:18     ` Thomas Monjalon
2019-07-05 14:57   ` Thomas Monjalon
2019-06-04 11:28 [dpdk-dev] eal/pci: " Jerin Jacob Kollanukkaran

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJFAV8y-UKj5Eq+rofXaxFOf_s2cme3a-4GRJKT55Haje-ZuvQ@mail.gmail.com \
    --to=david.marchand@redhat.com \
    --cc=anatoly.burakov@intel.com \
    --cc=benjamin.walker@intel.com \
    --cc=dev@dpdk.org \
    --cc=ferruh.yigit@intel.com \
    --cc=gaetan.rivet@6wind.com \
    --cc=hemant.agrawal@nxp.com \
    --cc=jerinj@marvell.com \
    --cc=maxime.coquelin@redhat.com \
    --cc=rosen.xu@intel.com \
    --cc=shreyansh.jain@nxp.com \
    --cc=stephen@networkplumber.org \
    --cc=thomas@monjalon.net \
    --cc=yskoh@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).