Re: [PATCH] PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce VM boot time

* Re: [PATCH] PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce VM boot time
@ 2022-04-21 18:26 Dexuan Cui
  2022-04-26 16:44 ` Lorenzo Pieralisi
  0 siblings, 1 reply; 9+ messages in thread
From: Dexuan Cui @ 2022-04-21 18:26 UTC (permalink / raw)
  To: Jake Oshins, Bjorn Helgaas, lorenzo.pieralisi, bhelgaas, Alex Williamson
  Cc: wei.liu, KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	linux-hyperv, linux-pci, linux-kernel, Michael Kelley (LINUX),
	robh, kw, kvm

> From: Jake Oshins <jakeo@microsoft.com>
> Sent: Wednesday, April 20, 2022 9:01 AM
> To: Bjorn Helgaas <helgaas@kernel.org>; Dexuan Cui <decui@microsoft.com>
> Cc: wei.liu@kernel.org; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>;
> lorenzo.pieralisi@arm.com; bhelgaas@google.com;
> linux-hyperv@vger.kernel.org; linux-pci@vger.kernel.org;
> linux-kernel@vger.kernel.org; Michael Kelley (LINUX)
> <mikelley@microsoft.com>; robh@kernel.org; kw@linux.com; Alex Williamson
> <alex.williamson@redhat.com>; .kvm@vger.kernel.org
> Subject: RE: [EXTERNAL] Re: [PATCH] PCI: hv: Do not set
> PCI_COMMAND_MEMORY to reduce VM boot time

I removed the "[EXTERNAL]" tag as it looks like that prevents the email from being
archived properly. See this link for the archived email thread:
https://lwn.net/ml/linux-kernel/20220419220007.26550-1-decui%40microsoft.com/

> > -----Original Message-----
> > From: Bjorn Helgaas <helgaas@kernel.org>
> > Sent: Wednesday, April 20, 2022 8:36 AM
> > To: Dexuan Cui <decui@microsoft.com>
> > Cc: wei.liu@kernel.org; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> > <haiyangz@microsoft.com>; Stephen Hemminger
> <sthemmin@microsoft.com>;
> > lorenzo.pieralisi@arm.com; bhelgaas@google.com; linux-
> > hyperv@vger.kernel.org; linux-pci@vger.kernel.org; linux-
> > kernel@vger.kernel.org; Michael Kelley (LINUX) <mikelley@microsoft.com>;
> > robh@kernel.org; kw@linux.com; Jake Oshins <jakeo@microsoft.com>; Alex
> > Williamson <alex.williamson@redhat.com>; .kvm@vger.kernel.org

I removed the period from ".kvm@" so the KVM list is correctly Cc'd.

> > Subject: [EXTERNAL] Re: [PATCH] PCI: hv: Do not set
> PCI_COMMAND_MEMORY
> > to reduce VM boot time
> >
> > [+cc Alex, kvm in case this idea is applicable for more than just Hyper-V]

Alex, your comments are welcome!

> > On Tue, Apr 19, 2022 at 03:00:07PM -0700, Dexuan Cui wrote:
> > > A VM on Azure can have 14 GPUs, and each GPU may have a huge MMIO
> > BAR,
> > > e.g. 128 GB. Currently the boot time of such a VM can be 4+ minutes,
> > > and most of the time is used by the host to unmap/map the vBAR from/to
> > > pBAR when the VM clears and sets the PCI_COMMAND_MEMORY bit: each
> > > unmap/map operation for a 128GB BAR needs about 1.8 seconds, and the
> > > pci-hyperv driver and the Linux PCI subsystem flip the
> > > PCI_COMMAND_MEMORY bit eight times (see pci_setup_device() ->
> > > pci_read_bases() and pci_std_update_resource()), increasing the boot
> > > time by 1.8 * 8 = 14.4 seconds per GPU, i.e. 14.4 * 14 = 201.6 seconds in
> total.
> > >
> > > Fix the slowness by not turning on the PCI_COMMAND_MEMORY in
> > > pci-hyperv.c, so the bit stays in the off state before the PCI device
> > > driver calls
> > > pci_enable_device(): when the bit is off, pci_read_bases() and
> > > pci_std_update_resource() don't cause Hyper-V to unmap/map the vBARs.
> > > With this change, the boot time of such a VM is reduced by
> > > 1.8 * (8-1) * 14 = 176.4 seconds.
> > >
> > > Tested-by: Boqun Feng (Microsoft) <boqun.feng@gmail.com>
> > > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > > Cc: Jake Oshins <jakeo@microsoft.com>
> > > ---
> > >  drivers/pci/controller/pci-hyperv.c | 17 +++++++++++------
> > >  1 file changed, 11 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > b/drivers/pci/controller/pci-hyperv.c
> > > index d270a204324e..f9fbbd8d94db 100644
> > > --- a/drivers/pci/controller/pci-hyperv.c
> > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > @@ -2082,12 +2082,17 @@ static void prepopulate_bars(struct
> > hv_pcibus_device *hbus)
> > >  				}
> > >  			}
> > >  			if (high_size <= 1 && low_size <= 1) {
> > > -				/* Set the memory enable bit. */
> > > -				_hv_pcifront_read_config(hpdev,
> > PCI_COMMAND, 2,
> > > -							 &command);
> > > -				command |= PCI_COMMAND_MEMORY;
> > > -				_hv_pcifront_write_config(hpdev,
> > PCI_COMMAND, 2,
> > > -							  command);
> > > +				/*
> > > +				 * No need to set the
> > PCI_COMMAND_MEMORY bit as
> > > +				 * the core PCI driver doesn't require the bit
> > > +				 * to be pre-set. Actually here we intentionally
> > > +				 * keep the bit off so that the PCI BAR probing
> > > +				 * in the core PCI driver doesn't cause Hyper-V
> > > +				 * to unnecessarily unmap/map the virtual
> > BARs
> > > +				 * from/to the physical BARs multiple times.
> > > +				 * This reduces the VM boot time significantly
> > > +				 * if the BAR sizes are huge.
> > > +				 */
> > >  				break;
> > >  			}
> > >  		}
> > > --
> > > 2.17.1
> > >
> 
> My question about this, directed at the people who know a lot more about the
> PCI subsystem in Linux than I do (Bjorn, Alex, etc.), is whether this change can
> create a problem.

Bjorn, Lorenzo, Alex, it would be nice to have your insights!

> In a physical computer, you might sometimes want to move
> a device from one part of address space to another, typically when another
> device is being hot-added to the system.  Imagine a scenario where you have,
> in this case a VM, and there are perhaps 15 NVMe SSDs passed through to the
> VM.  One of them dies and so it is removed.  The replacement SSD has a
> larger BAR than the one that was removed (because it's a different SKU) and
> now it doesn't fit neatly into the hole that was vacated by the previous SSD.
> 
> In this case, will the kernel ever attempt to move some of the existing SSDs to
> make room for the new one?  And if so, does this come with the requirement
> that they actually get mapped out of the address space while they are being
> moved?  (This was the scenario that prompted me to write the code above as
> it was, originally.)
> 
> Thanks,
> Jake Oshins

Sorry I don't quite follow. pci-hyperv allocates MMIO for the bridge window
in hv_pci_allocate_bridge_windows() and registers the MMIO ranges to the core
PCI driver via pci_add_resource(), and later the core PCI driver probes the
bus/device(s), validates the BAR sizes and the pre-initialized BAR values, and
uses the BAR configuration. IMO the whole process doesn't require the bit
PCI_COMMAND_MEMORY to be pre-set, and there should be no issue to
delay setting the bit to a PCI device device's .probe() -> pci_enable_device().

When a device is removed, hv_eject_device_work() -> 
pci_stop_and_remove_bus_device() triggers the PCI device driver's .remove(),
which calls pci_disable_device(), which instructs the hypervisor to unmap
the vBAR from the pBAR, so there should not be any issue when a new device
is added later. 

With the patch, if no PCI device driver is loaded, the PCI_COMMAND_MEMORY
bit stays in the off state, and the hypervisor doesn't map the vBAR to the pBAR.
I don't see any issue with this.

hv_pci_allocate_bridge_windows() -> vmbus_allocate_mmio() makes sure
that there is enough MMIO space for the devices on the bus, so the core
PCI driver doesn't have to worry about any MMIO conflict issue.

Please let me know if I understood and answered the questions.

Thanks,
-- Dexuan

^ permalink raw reply	[flat|nested] 9+ messages in thread