All of lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Xiang Zheng <zhengxiang9@huawei.com>,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	mingo@redhat.com, peterz@infradead.org,
	alex.williamson@redhat.com,
	Wang Haibin <wanghaibin.wang@huawei.com>,
	Guoheyi <guoheyi@huawei.com>,
	yebiaoxiang <yebiaoxiang@huawei.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>
Subject: Re: Kernel panic while doing vfio-pci hot-plug/unplug test
Date: Thu, 24 Oct 2019 04:22:32 -0700	[thread overview]
Message-ID: <20191024112232.GD2963@bombadil.infradead.org> (raw)
In-Reply-To: <20191023204638.GA8868@google.com>

On Wed, Oct 23, 2019 at 03:46:38PM -0500, Bjorn Helgaas wrote:
> [+cc Thomas, Rafael, beginning of thread at
> https://lore.kernel.org/r/79827f2f-9b43-4411-1376-b9063b67aee3@huawei.com]
> 
> On Wed, Oct 23, 2019 at 09:38:51AM -0700, Matthew Wilcox wrote:
> > On Wed, Oct 23, 2019 at 10:15:40AM -0500, Bjorn Helgaas wrote:
> > > I don't like being one of a handful of callers of __add_wait_queue(),
> > > so I like that solution from that point of view.
> > > 
> > > The 7ea7e98fd8d0 ("PCI: Block on access to temporarily unavailable pci
> > > device") commit log suggests that using __add_wait_queue() is a
> > > significant optimization, but I don't know how important that is in
> > > practical terms.  Config accesses are never a performance path anyway,
> > > so I'd be inclined to use add_wait_queue() unless somebody complains.
> > 
> > Wow, this has got pretty messy in the umpteen years since I last looked
> > at it.
> > 
> > Some problems I see:
> > 
> > 1. Commit df65c1bcd9b7b639177a5a15da1b8dc3bee4f5fa (tglx) says:
> > 
> >     x86/PCI: Select CONFIG_PCI_LOCKLESS_CONFIG
> >     
> >     All x86 PCI configuration space accessors have either their own
> >     serialization or can operate completely lockless (ECAM).
> >     
> >     Disable the global lock in the generic PCI configuration space accessors.
> > 
> > The concept behind this patch is broken.  We still need to lock out
> > config space accesses when devices are undergoing D-state transitions.
> > I would suggest that for the contention case that tglx is concerned about,
> > we should have a pci_bus_read_config_unlocked_##size set of functions
> > which can be used for devices we know never go into D states.
> 
> Host bridges that can't do config accesses atomically, e.g., they have
> something like the 0xcf8/0xcfc addr/data ports, need serialization.
> CONFIG_PCI_LOCKLESS_CONFIG removes the use of pci_lock for that, and I
> think that part makes sense regardless of whether devices can enter D
> states.

I disagree.  If a device is in D state, we need to block the access.
Maybe there needs to be a different mechanism for doing it that's not
a machine-wide lock, but it needs to happen.

> We *should* prevent config accesses during D-state transitions (per
> PCIe r5.0, sec 5.9), but I don't think pci_lock ever did that.

It used to set block_ucfg_access.  Maybe that's been lost; I see
there are still calls to pci_dev_lock() in pci_reset_function(),
for example.

> pci_raw_set_power_state() contains delays, but that only prevents
> accesses from the caller, not from other threads or from userspace.
> I suppose we should also prevent accesses by other threads during
> transitions done by ACPI, e.g., _PS0, _PS1, _PS2, _PS3.  AFAICT we
> don't do any of that.
> 
> It looks like pci_lock currently:
> 
>   - Serializes all kernel config accesses system-wide in
>     pci_bus_read_config_##size() (unless CONFIG_PCI_LOCKLESS_CONFIG=y).
> 
>   - Serializes all userspace config accesses system-wide in
>     pci_user_read_config_##size() (this seems unnecessary when
>     CONFIG_PCI_LOCKLESS_CONFIG=y).
> 
>   - Serializes userspace config accesses with resets of the device via
>     the dev->block_cfg_access bit and waitqueue mechanism.
> 
>   - Serializes kernel and userspace config accesses with bus->ops
>     changes in pci_bus_set_ops() (except that we don't serialize
>     kernel config accesses if CONFIG_PCI_LOCKLESS_CONFIG=y, which is
>     probably a problem).  But pci_bus_set_ops() is hardly used and I'm
>     not sure it's worth keeping it.
> 
> > 2. Commit a2e27787f893621c5a6b865acf6b7766f8671328 (jan kiszka)
> >    exports pci_lock.  I think this is a mistake; at best there should be
> >    accessors for the pci_lock.  But I don't understand why it needs to
> >    exclude PCI config space changes throughout pci_check_and_set_intx_mask().
> >    Why can it not do:
> > 
> > -	bus->ops->read(bus, dev->devfn, PCI_COMMAND, 4, &cmd_status_dword);
> > +	pci_read_config_dword(dev, PCI_COMMAND, &cmd_status_dword);
> > 
> > 3. I don't understand why 511dd98ce8cf6dc4f8f2cb32a8af31ce9f4ba4a1
> >    changed pci_lock to be a raw spinlock.  The patch description
> >    essentially says "We need it for RT" which isn't terribly helpful.
> > 
> > 4. Finally, getting back to the original problem report here, I wouldn't
> >    write this code this way today.  There's no reason not to use the
> >    regular add_wait_queue etc.  BUT!  Why are we using this custom locking
> >    mechanism?  It pretty much screams to me of an rwsem (reads/writes
> >    of config space take it for read; changes to config space accesses
> >    (disabling and changing of accessor methods) take it for write.
> 
> So maybe the immediate thing is to just convert to add_wait_queue()?

Isn't that going to run foul of the lock inversion you fixed in
cdcb33f9824429a926b971bf041a6cec238f91ff ?

> There's a lot we could clean up here, but I think it would take a fair
> bit of untangling before we actually solve this panic.

Yes, the mess has spread over many years ;-)

  parent reply	other threads:[~2019-10-24 11:22 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-16 13:36 Kernel panic while doing vfio-pci hot-plug/unplug test Xiang Zheng
2019-10-18 13:58 ` Bjorn Helgaas
2019-10-23  9:40   ` Xiang Zheng
2019-10-23 15:15     ` Bjorn Helgaas
2019-10-23 16:38       ` Matthew Wilcox
2019-10-23 20:46         ` Bjorn Helgaas
2019-10-24  3:26           ` Xiang Zheng
2019-10-24 11:22           ` Matthew Wilcox [this message]
2019-10-24 22:32             ` Bjorn Helgaas
2019-10-24  8:07         ` Thomas Gleixner
2020-02-22 16:55 ` Bjorn Helgaas
2020-02-24  1:29   ` Xiang Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191024112232.GD2963@bombadil.infradead.org \
    --to=willy@infradead.org \
    --cc=alex.williamson@redhat.com \
    --cc=guoheyi@huawei.com \
    --cc=helgaas@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rjw@rjwysocki.net \
    --cc=tglx@linutronix.de \
    --cc=wanghaibin.wang@huawei.com \
    --cc=yebiaoxiang@huawei.com \
    --cc=zhengxiang9@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.