linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Hari Vyas <hari.vyas@broadcom.com>,
	bhelgaas@google.com, linux-pci@vger.kernel.org,
	Ray Jui <ray.jui@broadcom.com>,
	linux-kernel@vger.kernel.org,
	Srinath Mannam <srinath.mannam@broadcom.com>,
	Guenter Roeck <linux@roeck-us.net>, Jens Axboe <axboe@kernel.dk>,
	Lukas Wunner <lukas@wunner.de>,
	Konstantin Khlebnikov <khlebnikov@yandex-team.ru>,
	Marta Rybczynska <mrybczyn@kalray.eu>,
	Pierre-Yves Kerbrat <pkerbrat@kalray.eu>
Subject: Re: [RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races
Date: Fri, 17 Aug 2018 13:42:57 +1000	[thread overview]
Message-ID: <0ec65b10f5e414e387bd368aa816900fedcaddcc.camel@kernel.crashing.org> (raw)
In-Reply-To: <20180817030741.GC10316@bhelgaas-glaptop.roam.corp.google.com>

On Thu, 2018-08-16 at 22:07 -0500, Bjorn Helgaas wrote:
> [+cc Srinath, Guenter, Jens, Lukas, Konstantin, Marta, Pierre-Yves]
> 
> On Thu, Aug 16, 2018 at 07:50:13AM +1000, Benjamin Herrenschmidt wrote:
> > [Note: This isn't meant to be merged, it need splitting at the very
> > least, see below]
> > 
> > This is something I cooked up quickly today to test if that would fix
> > my problems with large number of switch and NVME devices on POWER.
> > 
> > So far it does...
> > 
> > The issue (as discussed in the Re: PCIe enable device races thread) is
> > that pci_enable_device and related functions along with pci_set_master
> > and pci_enable_bridge are fundamentally racy.
> > 
> > There is no lockign protecting the state of the device in pci_dev and
> > if multiple devices under the same bridge try to enable it simultaneously
> > one some of them will potentially start accessing it before it has actually
> > been enabled.
> > 
> > Now there are a LOT more fields in pci_dev that aren't covered by any
> > form of locking.
> 
> Most of the PCI core relies on the assumption that only a single
> thread touches a device at a time.  This is generally true of the core
> during enumeration because enumeration is single-threaded.  It's
> generally true in normal driver operation because the core shouldn't
> touch a device after a driver claims it.

Mostly :-) There are a few exceptions though.

> But there are several exceptions, and I think we need to understand
> those scenarios before applying locks willy-nilly.

We need to stop creating ad-hoc locks. We have a good handle already on
the main enable/disable and bus master scenario, and the race with
is_added.

Ignore the patch itself, it has at least 2 bugs with PM, I'll send a
series improving things a bit later.

> One big exception is that enabling device A may also touch an upstream
> bridge B.  That causes things like your issue and Srinath's issue
> where drivers simultaneously enabling two devices below the same
> bridge corrupt the bridge's state [1].  Marta reported essentially the
> same issue [2].
> 
> Hari's issue [3] seems related to a race between a driver work queue
> and the core enumerating the device.  I should have pushed harder to
> understand this; I feel like we papered over the immediate problem
> without clearing up the larger issue of why the core enumeration path
> (pci_bus_add_device()) is running at the same time as a driver.

It's not. What is happening is that is_added is set by
pci_bus_add_device() after it has bound the driver. An easy fix would
have been to move it up instead:

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 737d1c52f002..ff4d536d43fc 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -323,16 +323,16 @@ void pci_bus_add_device(struct pci_dev *dev)
        pci_proc_attach_device(dev);
        pci_bridge_d3_update(dev);
 
+       dev->is_added = 1;
        dev->match_driver = true;
        retval = device_attach(&dev->dev);
        if (retval < 0 && retval != -EPROBE_DEFER) {
+               dev->is_added = 0;
                pci_warn(dev, "device attach failed (%d)\n", retval);
                pci_proc_detach_device(dev);
                pci_remove_sysfs_dev_files(dev);
                return;
        }
-
-       dev->is_added = 1;
 }
 EXPORT_SYMBOL_GPL(pci_bus_add_device);

(Untested).

Note: another advantage of the above is that the current code has an
odd asymetry: is_added is currently set after we attach but also
cleared after we detatch.

If we want to keep the flag being set after attaching, then we do
indeed need to protect it against concurrent access to other fields.

The easiest way to do that would have been to remove the :1 as this:

-	unsigned int	is_added:1;
+	unsigned int	is_added;

Again, none of these approach involves the invasive patch your merged
which uses that atomic operation which provides the false sense of
security that your are somewhat "protected" while in fact you only
protect the field itself, but provide no protection about overall
concurrency of the callers which might clash in different ways.

Finally, we could also move is_added under the protection of the new
mutex I propose adding, but that would really only work as long as
we move all the :1 fields protected by that mutex together inside
the struct pci_dev structure as to avoid collisions with other fields
being modified.

All of the above are preferable to what you merged.

> DPC/AER error handling adds more cases where the core potentially
> accesses devices asynchronously to the driver.
>
> User-mode sysfs controls like ASPM are also asynchronous to drivers.
> 
> Even setpci is a potential issue, though I don't know how far we want
> to go to protect it.  I think we should make setpci taint the kernel
> anyway.

I wouldn't bother too much about it.

> It might be nice if we had some sort of writeup of the locking
> strategy as a whole.
> 
> [1] https://lkml.kernel.org/r/1501858648-22228-1-git-send-email-srinath.mannam@broadcom.com
> [2] https://lkml.kernel.org/r/744877924.5841545.1521630049567.JavaMail.zimbra@kalray.eu
> [3] https://lkml.kernel.org/r/1530608741-30664-2-git-send-email-hari.vyas@broadcom.com

Rather than not having one ? :)

This is what I'm proposing here. Let me send patch series demonstrating
the start of this, which also fix both above issues and completely
remove that rather annoying atomic priv_flags.

I would also like to get rid of the atomic enable_cnt but that will
need a bit more churn through archs and drivers.

Cheers,
Ben.

  reply	other threads:[~2018-08-17  3:42 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-03  9:05 [PATCH v3] PCI: Data corruption happening due to race condition Hari Vyas
2018-07-03  9:05 ` Hari Vyas
2018-07-03  9:13   ` Lukas Wunner
2018-07-18 23:29   ` Bjorn Helgaas
2018-07-19  4:18     ` Benjamin Herrenschmidt
2018-07-19 14:04       ` Hari Vyas
2018-07-19 18:55         ` Lukas Wunner
2018-07-20  4:27           ` Benjamin Herrenschmidt
2018-07-27 22:25       ` Bjorn Helgaas
2018-07-28  0:45         ` Benjamin Herrenschmidt
2018-07-31 11:21         ` Michael Ellerman
2018-07-19 17:41   ` Bjorn Helgaas
2018-07-20  9:16     ` Hari Vyas
2018-07-20 12:20       ` Bjorn Helgaas
2018-07-31 16:37 ` Bjorn Helgaas
2018-08-15  3:35   ` PCIe enable device races (Was: [PATCH v3] PCI: Data corruption happening due to race condition) Benjamin Herrenschmidt
2018-08-15  4:16     ` Benjamin Herrenschmidt
2018-08-15  4:44       ` Benjamin Herrenschmidt
2018-08-15  5:21         ` [RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races Benjamin Herrenschmidt
2018-08-15 19:09         ` PCIe enable device races (Was: [PATCH v3] PCI: Data corruption happening due to race condition) Bjorn Helgaas
2018-08-15 21:50         ` [RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races Benjamin Herrenschmidt
2018-08-15 22:40           ` Guenter Roeck
2018-08-15 23:38             ` Benjamin Herrenschmidt
2018-08-20  1:31               ` Guenter Roeck
2018-08-17  3:07           ` Bjorn Helgaas
2018-08-17  3:42             ` Benjamin Herrenschmidt [this message]
2018-08-15 18:50     ` PCIe enable device races (Was: [PATCH v3] PCI: Data corruption happening due to race condition) Bjorn Helgaas
2018-08-15 21:52       ` Benjamin Herrenschmidt
2018-08-15 23:23         ` Benjamin Herrenschmidt
2018-08-16  7:58         ` Konstantin Khlebnikov
2018-08-16  8:02           ` Benjamin Herrenschmidt
2018-08-16  9:22             ` Hari Vyas
2018-08-16 10:10               ` Benjamin Herrenschmidt
2018-08-16 10:11                 ` Benjamin Herrenschmidt
2018-08-16 10:26                 ` Lukas Wunner
2018-08-16 10:47                   ` Hari Vyas
2018-08-16 23:20                     ` Benjamin Herrenschmidt
2018-08-16 23:17                   ` Benjamin Herrenschmidt
2018-08-17  0:43                     ` Benjamin Herrenschmidt
2018-08-16 19:43             ` Jens Axboe
2018-08-16 21:37               ` Benjamin Herrenschmidt
2018-08-16 21:56                 ` Jens Axboe
2018-08-16 23:09                   ` Benjamin Herrenschmidt
2018-08-17  0:14                     ` Jens Axboe
2018-08-16 12:28         ` Lukas Wunner
2018-08-16 23:25           ` Benjamin Herrenschmidt
2018-08-17  1:12             ` Benjamin Herrenschmidt
2018-08-17 16:39               ` Lukas Wunner
2018-08-18  3:37                 ` Benjamin Herrenschmidt
2018-08-18  9:22                   ` Lukas Wunner
2018-08-18 13:11                     ` Benjamin Herrenschmidt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0ec65b10f5e414e387bd368aa816900fedcaddcc.camel@kernel.crashing.org \
    --to=benh@kernel.crashing.org \
    --cc=axboe@kernel.dk \
    --cc=bhelgaas@google.com \
    --cc=hari.vyas@broadcom.com \
    --cc=helgaas@kernel.org \
    --cc=khlebnikov@yandex-team.ru \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux@roeck-us.net \
    --cc=lukas@wunner.de \
    --cc=mrybczyn@kalray.eu \
    --cc=pkerbrat@kalray.eu \
    --cc=ray.jui@broadcom.com \
    --cc=srinath.mannam@broadcom.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).