linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Christoph Hellwig <hch@lst.de>
Cc: Christoph Hellwig <hch@lst.de>,
	John Garry <john.garry@huawei.com>,
	Ming Lei <ming.lei@redhat.com>, Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Bjorn Helgaas <bhelgaas@google.com>,
	linux-pci@vger.kernel.org, Sagi Grimberg <sagi@grimberg.me>,
	Daniel Wagner <dwagner@suse.de>, Wen Xiong <wenxiong@us.ibm.com>,
	Hannes Reinecke <hare@suse.de>, Keith Busch <kbusch@kernel.org>
Subject: Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
Date: Thu, 22 Jul 2021 00:38:07 +0200	[thread overview]
Message-ID: <878s1zpa28.ffs@nanos.tec.linutronix.de> (raw)
In-Reply-To: <20210721203259.GA18960@lst.de>

On Wed, Jul 21 2021 at 22:32, Christoph Hellwig wrote:
> On Wed, Jul 21, 2021 at 10:14:25PM +0200, Thomas Gleixner wrote:
>>   https://lore.kernel.org/r/87o8bxcuxv.ffs@nanos.tec.linutronix.de
>> 
>> TLDR: virtio allocates ONE irq on msix_enable() and then when the
>> guest

OOps, sorry that should have been VFIO not virtio.

>> actually unmasks another entry (e.g. request_irq()), it tears down the
>> allocated one and set's up two. On the third one this repeats ....
>> 
>> There are only two options:
>> 
>>   1) allocate everything upfront, which is undesired
>>   2) append entries, which might need locking, but I'm still trying to
>>      avoid that
>> 
>> There is another problem vs. vector exhaustion which can't be fixed that
>> way, but that's a different story.
>
> FTI, NVMe is similar.  We need one IRQ to setup the admin queue,
> which is used to query/set how many I/O queues are supported.  Just
> two steps though and not unbound.

That's fine because that's controlled by the driver consistently and it
(hopefully) makes sure that the admin queue is quiesced before
everything is torn down after the initial query.

But that's not the case for VFIO. It tears down all in use interrupts
and the guest driver is completely oblivious of that.

Assume the following situation:

 1) VM boots with 8 present CPUs and 16 possible CPUs

 2) The passed through card (PF or VF) supports multiqueue and the
    driver uses managed interrupts which e.g. allocates one queue and
    one interrupt per possible CPU.

    Initial setup requests all the interrupts, but only the first 8
    queue interrupts are unmasked and therefore reallocated by the host
    which works by some definition of works because the device is quiet
    at that point.

 3) Host admin plugs the other 8 CPUs into the guest

    Onlining these CPUs in the guest will unmask the dormant managed
    queue interrupts and cause the host to allocate the remaining 8 per
    queue interrupts one by one thereby tearing down _all_ previously
    allocated ones and then allocating one more than before.

    Assume that while this goes on the guest has I/O running on the
    already online CPUs and their associated queues. Depending on the
    device this either will lose interrupts or reroute them to the
    legacy INTx which is not handled. This might in the best case result
    in a few "timedout" requests, but I managed it at least once to make
    the device go into lala land state, i.e. it did not recover.

The above can be fixed by adding an 'append' mode to the MSI code.

But that does not fix the overcommit issue where the host runs out of
vector space. The result is simply that the guest does not know and just
continues to work on device/queues which will never ever recieve an
interrupt (again).

I got educated that all of this is considered unlikely and my argument
that the concept of unlikely simply does not exist at cloud scale got
ignored. Sure, I know it's VIRT and therefore not subject to common
sense.

Thanks,

        tglx



    
    

  reply	other threads:[~2021-07-21 22:38 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-15 12:08 [PATCH V4 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei
2021-07-15 12:08 ` [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed Ming Lei
2021-07-15 12:40   ` Greg Kroah-Hartman
2021-07-16  2:17     ` Ming Lei
2021-07-16 20:01   ` Bjorn Helgaas
2021-07-17  9:30     ` Ming Lei
2021-07-21  0:30       ` Bjorn Helgaas
2021-07-19  7:51   ` John Garry
2021-07-19  9:44     ` Christoph Hellwig
2021-07-19 10:39       ` John Garry
2021-07-20  2:38         ` Ming Lei
2021-07-21  7:20       ` Thomas Gleixner
2021-07-21  7:24         ` Christoph Hellwig
2021-07-21  9:44           ` John Garry
2021-07-21 20:22             ` Thomas Gleixner
2021-07-22  7:48               ` John Garry
2021-07-21 20:14           ` Thomas Gleixner
2021-07-21 20:32             ` Christoph Hellwig
2021-07-21 22:38               ` Thomas Gleixner [this message]
2021-07-22  7:46                 ` Christoph Hellwig
2021-07-15 12:08 ` [PATCH V4 2/3] blk-mq: mark if one queue map uses managed irq Ming Lei
2021-07-15 12:08 ` [PATCH V4 3/3] blk-mq: don't deactivate hctx if managed irq isn't used Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878s1zpa28.ffs@nanos.tec.linutronix.de \
    --to=tglx@linutronix.de \
    --cc=axboe@kernel.dk \
    --cc=bhelgaas@google.com \
    --cc=dwagner@suse.de \
    --cc=gregkh@linuxfoundation.org \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=john.garry@huawei.com \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=sagi@grimberg.me \
    --cc=wenxiong@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).