linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: Abhishek Sahu <abhsahu@nvidia.com>
Cc: Cornelia Huck <cohuck@redhat.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>,
	Kevin Tian <kevin.tian@intel.com>,
	"Rafael J . Wysocki" <rafael@kernel.org>,
	Max Gurtovoy <mgurtovoy@nvidia.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	<linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>,
	<linux-pm@vger.kernel.org>, <linux-pci@vger.kernel.org>
Subject: Re: [PATCH v4 4/6] vfio/pci: Add the support for PCI D3cold state
Date: Wed, 6 Jul 2022 09:40:41 -0600	[thread overview]
Message-ID: <20220706094041.29875c80.alex.williamson@redhat.com> (raw)
In-Reply-To: <20220701110814.7310-5-abhsahu@nvidia.com>

On Fri, 1 Jul 2022 16:38:12 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> Currently, if the runtime power management is enabled for vfio-pci
> based devices in the guest OS, then the guest OS will do the register
> write for PCI_PM_CTRL register. This write request will be handled in
> vfio_pm_config_write() where it will do the actual register write of
> PCI_PM_CTRL register. With this, the maximum D3hot state can be
> achieved for low power. If we can use the runtime PM framework, then
> we can achieve the D3cold state (on the supported systems) which will
> help in saving maximum power.
> 
> 1. D3cold state can't be achieved by writing PCI standard
>    PM config registers. This patch implements the newly added
>    'VFIO_DEVICE_FEATURE_POWER_MANAGEMENT' device feature which
>     can be used for putting the device into the D3cold state.
> 
> 2. The hypervisors can implement virtual ACPI methods. For example,
>    in guest linux OS if PCI device ACPI node has _PR3 and _PR0 power
>    resources with _ON/_OFF method, then guest linux OS invokes
>    the _OFF method during D3cold transition and then _ON during D0
>    transition. The hypervisor can tap these virtual ACPI calls and then
>    call the  'VFIO_DEVICE_FEATURE_POWER_MANAGEMENT' with respective flags.
> 
> 3. The vfio-pci driver uses runtime PM framework to achieve the
>    D3cold state. For the D3cold transition, decrement the usage count and
>    for the D0 transition, increment the usage count.
> 
> 4. If the D3cold state is not supported, then the device will
>    still be in the D3hot state. But with the runtime PM, the root port
>    can now also go into the suspended state.
> 
> 5. The 'pm_runtime_engaged' flag tracks the entry and exit to
>    runtime PM. This flag is protected with 'memory_lock' semaphore.
> 
> 6. During exit time, the flag clearing and usage count increment
>    are protected with 'memory_lock'. The actual wake-up is happening
>    outside 'memory_lock' since 'memory_lock' will be needed inside
>    runtime_resume callback also in subsequent patches.
> 
> 7. In D3cold, all kinds of device-related access (BAR read/write,
>    config read/write, etc.) need to be disabled. For BAR-related access,
>    we can use existing D3hot memory disable support. During the low power
>    entry, invalidate the mmap mappings and add the check for
>    'pm_runtime_engaged' flag.

Not disabled, just wrapped in pm-get/put.  If the device is indefinitely
in low-power without a wake-up eventfd, mmap faults are fatal to the
user.
 
> 8. For config space, ideally, we need to return an error whenever
>    there is any config access from the user side once the user moved the
>    device into low power state. But adding a check for
>    'pm_runtime_engaged' flag alone won't be sufficient due to the
>    following possible scenario from the user side where config space
>    access happens parallelly with the low power entry IOCTL.
> 
>    a. Config space access happens and vfio_pci_config_rw() will be
>       called.
>    b. The IOCTL to move into low power state is called.
>    c. The IOCTL will move the device into d3cold.
>    d. Exit from vfio_pci_config_rw() happened.
> 
>    Now, if we just check 'pm_runtime_engaged', then in the above
>    sequence the config space access will happen when the device already
>    is in the low power state. To prevent this situation, we increment the
>    usage count before any config space access and decrement the same
>    after completing the access. Also, to prevent any similar cases for
>    other types of access, the usage count will be incremented for all
>    kinds of access.

Unnecessary, just wrap in pm-get/put.  Thanks,

Alex

> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_config.c |   2 +-
>  drivers/vfio/pci/vfio_pci_core.c   | 169 +++++++++++++++++++++++++++--
>  include/linux/vfio_pci_core.h      |   1 +
>  3 files changed, 164 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index 9343f597182d..21a4743d011f 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -408,7 +408,7 @@ bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev)
>  	 * PF SR-IOV capability, there's therefore no need to trigger
>  	 * faults based on the virtual value.
>  	 */
> -	return pdev->current_state < PCI_D3hot &&
> +	return !vdev->pm_runtime_engaged && pdev->current_state < PCI_D3hot &&
>  	       (pdev->no_command_memory || (cmd & PCI_COMMAND_MEMORY));
>  }
>  
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 5948d930449b..8c17ca41d156 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -264,6 +264,18 @@ static int vfio_pci_core_runtime_suspend(struct device *dev)
>  {
>  	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>  
> +	down_write(&vdev->memory_lock);
> +	/*
> +	 * The user can move the device into D3hot state before invoking
> +	 * power management IOCTL. Move the device into D0 state here and then
> +	 * the pci-driver core runtime PM suspend function will move the device
> +	 * into the low power state. Also, for the devices which have
> +	 * NoSoftRst-, it will help in restoring the original state
> +	 * (saved locally in 'vdev->pm_save').
> +	 */
> +	vfio_pci_set_power_state(vdev, PCI_D0);
> +	up_write(&vdev->memory_lock);
> +
>  	/*
>  	 * If INTx is enabled, then mask INTx before going into the runtime
>  	 * suspended state and unmask the same in the runtime resume.
> @@ -386,6 +398,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  	struct pci_dev *pdev = vdev->pdev;
>  	struct vfio_pci_dummy_resource *dummy_res, *tmp;
>  	struct vfio_pci_ioeventfd *ioeventfd, *ioeventfd_tmp;
> +	bool do_resume = false;
>  	int i, bar;
>  
>  	/* For needs_reset */
> @@ -393,6 +406,25 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  
>  	/*
>  	 * This function can be invoked while the power state is non-D0.
> +	 * This non-D0 power state can be with or without runtime PM.
> +	 * Increment the usage count corresponding to pm_runtime_put()
> +	 * called during setting of 'pm_runtime_engaged'. The device will
> +	 * wake up if it has already gone into the suspended state.
> +	 * Otherwise, the next vfio_pci_set_power_state() will change the
> +	 * device power state to D0.
> +	 */
> +	down_write(&vdev->memory_lock);
> +	if (vdev->pm_runtime_engaged) {
> +		vdev->pm_runtime_engaged = false;
> +		pm_runtime_get_noresume(&pdev->dev);
> +		do_resume = true;
> +	}
> +	up_write(&vdev->memory_lock);
> +
> +	if (do_resume)
> +		pm_runtime_resume(&pdev->dev);
> +
> +	/*
>  	 * This function calls __pci_reset_function_locked() which internally
>  	 * can use pci_pm_reset() for the function reset. pci_pm_reset() will
>  	 * fail if the power state is non-D0. Also, for the devices which
> @@ -1190,6 +1222,99 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
>  
> +static int vfio_pci_pm_validate_flags(u32 flags)
> +{
> +	if (!flags)
> +		return -EINVAL;
> +	/* Only valid flags should be set */
> +	if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
> +		return -EINVAL;
> +	/* Both enter and exit should not be set */
> +	if ((flags & (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) ==
> +	    (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
> +				    void __user *arg, size_t argsz)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	struct pci_dev *pdev = vdev->pdev;
> +	struct vfio_device_feature_power_management vfio_pm = { 0 };
> +	int ret = 0;
> +
> +	ret = vfio_check_feature(flags, argsz,
> +				 VFIO_DEVICE_FEATURE_SET |
> +				 VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(vfio_pm));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (flags & VFIO_DEVICE_FEATURE_GET) {
> +		down_read(&vdev->memory_lock);
> +		if (vdev->pm_runtime_engaged)
> +			vfio_pm.flags = VFIO_PM_LOW_POWER_ENTER;
> +		else
> +			vfio_pm.flags = VFIO_PM_LOW_POWER_EXIT;
> +		up_read(&vdev->memory_lock);
> +
> +		if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm)))
> +			return -EFAULT;
> +
> +		return 0;
> +	}
> +
> +	if (copy_from_user(&vfio_pm, arg, sizeof(vfio_pm)))
> +		return -EFAULT;
> +
> +	ret = vfio_pci_pm_validate_flags(vfio_pm.flags);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * The vdev power related flags are protected with 'memory_lock'
> +	 * semaphore.
> +	 */
> +	if (vfio_pm.flags & VFIO_PM_LOW_POWER_ENTER) {
> +		vfio_pci_zap_and_down_write_memory_lock(vdev);
> +		if (vdev->pm_runtime_engaged) {
> +			up_write(&vdev->memory_lock);
> +			return -EINVAL;
> +		}
> +
> +		vdev->pm_runtime_engaged = true;
> +		up_write(&vdev->memory_lock);
> +		pm_runtime_put(&pdev->dev);
> +	} else if (vfio_pm.flags & VFIO_PM_LOW_POWER_EXIT) {
> +		down_write(&vdev->memory_lock);
> +		if (!vdev->pm_runtime_engaged) {
> +			up_write(&vdev->memory_lock);
> +			return -EINVAL;
> +		}
> +
> +		vdev->pm_runtime_engaged = false;
> +		pm_runtime_get_noresume(&pdev->dev);
> +		up_write(&vdev->memory_lock);
> +		ret = pm_runtime_resume(&pdev->dev);
> +		if (ret < 0) {
> +			down_write(&vdev->memory_lock);
> +			if (!vdev->pm_runtime_engaged) {
> +				vdev->pm_runtime_engaged = true;
> +				pm_runtime_put_noidle(&pdev->dev);
> +			}
> +			up_write(&vdev->memory_lock);
> +			return ret;
> +		}
> +	} else {
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
>  static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
>  				       void __user *arg, size_t argsz)
>  {
> @@ -1224,6 +1349,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
>  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>  		return vfio_pci_core_feature_token(device, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_POWER_MANAGEMENT:
> +		return vfio_pci_core_feature_pm(device, flags, arg, argsz);
>  	default:
>  		return -ENOTTY;
>  	}
> @@ -1234,31 +1361,47 @@ static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>  			   size_t count, loff_t *ppos, bool iswrite)
>  {
>  	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	int ret;
>  
>  	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
>  		return -EINVAL;
>  
> +	ret = pm_runtime_resume_and_get(&vdev->pdev->dev);
> +	if (ret < 0) {
> +		pci_info_ratelimited(vdev->pdev, "runtime resume failed %d\n",
> +				     ret);
> +		return -EIO;
> +	}
> +
>  	switch (index) {
>  	case VFIO_PCI_CONFIG_REGION_INDEX:
> -		return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
> +		ret = vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
> +		break;
>  
>  	case VFIO_PCI_ROM_REGION_INDEX:
>  		if (iswrite)
> -			return -EINVAL;
> -		return vfio_pci_bar_rw(vdev, buf, count, ppos, false);
> +			ret = -EINVAL;
> +		else
> +			ret = vfio_pci_bar_rw(vdev, buf, count, ppos, false);
> +		break;
>  
>  	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> -		return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
> +		ret = vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
> +		break;
>  
>  	case VFIO_PCI_VGA_REGION_INDEX:
> -		return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
> +		ret = vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
> +		break;
> +
>  	default:
>  		index -= VFIO_PCI_NUM_REGIONS;
> -		return vdev->region[index].ops->rw(vdev, buf,
> +		ret = vdev->region[index].ops->rw(vdev, buf,
>  						   count, ppos, iswrite);
> +		break;
>  	}
>  
> -	return -EINVAL;
> +	pm_runtime_put(&vdev->pdev->dev);
> +	return ret;
>  }
>  
>  ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
> @@ -2157,6 +2300,15 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		goto err_unlock;
>  	}
>  
> +	/*
> +	 * Some of the devices in the dev_set can be in the runtime suspended
> +	 * state. Increment the usage count for all the devices in the dev_set
> +	 * before reset and decrement the same after reset.
> +	 */
> +	ret = vfio_pci_dev_set_pm_runtime_get(dev_set);
> +	if (ret)
> +		goto err_unlock;
> +
>  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
>  		/*
>  		 * Test whether all the affected devices are contained by the
> @@ -2212,6 +2364,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		else
>  			mutex_unlock(&cur->vma_lock);
>  	}
> +
> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
> +		pm_runtime_put(&cur->pdev->dev);
>  err_unlock:
>  	mutex_unlock(&dev_set->lock);
>  	return ret;
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index cdfd328ba6b1..bf4823b008f9 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -125,6 +125,7 @@ struct vfio_pci_core_device {
>  	bool			nointx;
>  	bool			needs_pm_restore;
>  	bool			pm_intx_masked;
> +	bool			pm_runtime_engaged;
>  	struct pci_saved_state	*pci_saved_state;
>  	struct pci_saved_state	*pm_save;
>  	int			ioeventfds_nr;


  reply	other threads:[~2022-07-06 15:49 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-01 11:08 [PATCH v4 0/6] vfio/pci: power management changes Abhishek Sahu
2022-07-01 11:08 ` [PATCH v4 1/6] vfio/pci: Mask INTx during runtime suspend Abhishek Sahu
2022-07-06 15:39   ` Alex Williamson
2022-07-08  9:21     ` Abhishek Sahu
2022-07-08 15:45       ` Alex Williamson
2022-07-11  9:18         ` Abhishek Sahu
2022-07-11 12:57           ` Alex Williamson
2022-07-01 11:08 ` [PATCH v4 2/6] vfio: Add a new device feature for the power management Abhishek Sahu
2022-07-06 15:39   ` Alex Williamson
2022-07-08  9:39     ` Abhishek Sahu
2022-07-08 16:36       ` Alex Williamson
2022-07-11  9:43         ` Abhishek Sahu
2022-07-11 13:04           ` Alex Williamson
2022-07-11 17:30             ` Abhishek Sahu
2022-07-01 11:08 ` [PATCH v4 3/6] vfio: Increment the runtime PM usage count during IOCTL call Abhishek Sahu
2022-07-06 15:40   ` Alex Williamson
2022-07-08  9:43     ` Abhishek Sahu
2022-07-08 16:49       ` Alex Williamson
2022-07-11  9:50         ` Abhishek Sahu
2022-07-01 11:08 ` [PATCH v4 4/6] vfio/pci: Add the support for PCI D3cold state Abhishek Sahu
2022-07-06 15:40   ` Alex Williamson [this message]
2022-07-01 11:08 ` [PATCH v4 5/6] vfio/pci: Prevent low power re-entry without guest driver Abhishek Sahu
2022-07-06 15:40   ` Alex Williamson
2022-07-01 11:08 ` [PATCH v4 6/6] vfio/pci: Add support for virtual PME Abhishek Sahu
2022-07-06 15:40   ` Alex Williamson
2022-07-08  9:45     ` Abhishek Sahu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220706094041.29875c80.alex.williamson@redhat.com \
    --to=alex.williamson@redhat.com \
    --cc=abhsahu@nvidia.com \
    --cc=bhelgaas@google.com \
    --cc=cohuck@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mgurtovoy@nvidia.com \
    --cc=rafael@kernel.org \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).