linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH net-next 6/6] mlxsw: pci: Add support for new reset flow
       [not found] ` <c61d07469ecf5d3053442e24d4d050405f466b76.1679502371.git.petrm@nvidia.com>
@ 2023-03-23  9:13   ` Petr Machata
  2023-03-23 16:51   ` Bjorn Helgaas
  1 sibling, 0 replies; 10+ messages in thread
From: Petr Machata @ 2023-03-23  9:13 UTC (permalink / raw)
  To: Petr Machata
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, Ido Schimmel, Amit Cohen, mlxsw, linux-pci,
	Bjorn Helgaas


Petr Machata <petrm@nvidia.com> writes:

> From: Amit Cohen <amcohen@nvidia.com>
>
> The driver resets the device during probe and during a devlink reload.
> The current reset method reloads the current firmware version or a pending
> one, if one was previously flashed using devlink. However, the reset does
> not take down the PCI link, preventing the PCI firmware from being
> upgraded, unless the system is rebooted.
>
> To solve this problem, a new reset command (6) was implemented in the
> firmware. Unlike the current command (1), after issuing the new command
> the device will not start the reset immediately, but only after the PCI
> link was disabled. The driver is expected to wait for 500ms before
> re-enabling the link to give the firmware enough time to start the reset.
>
> Implement the new reset method and use it only after verifying it is
> supported by the current firmware version by querying the Management
> Capabilities Mask (MCAM) register. Consider the PCI firmware to be
> operational either after waiting for a predefined time of 2000ms or after
> reading an active link status when "Data Link Layer Link Active Reporting"
> is supported. For good measures, make sure the device ID can be read from
> the configuration space of the device.
>
> Once the PCI firmware is operational, go back to the regular reset flow
> and wait for the entire device to become ready. That is, repeatedly read
> the "system_status" register from the BAR until a value of "FW_READY"
> (0x5E) appears.
>
> Signed-off-by: Amit Cohen <amcohen@nvidia.com>
> Reviewed-by: Ido Schimmel <idosch@nvidia.com>
> Signed-off-by: Petr Machata <petrm@nvidia.com>

What I forgot to do was to CC PCI maintainer / subsystem. None of us in
the mlxsw team is well versed in PCI odds and ends, and we would
appreciate a sanity check of this code.

> ---
>  drivers/net/ethernet/mellanox/mlxsw/pci.c    | 151 ++++++++++++++++++-
>  drivers/net/ethernet/mellanox/mlxsw/pci_hw.h |   5 +
>  2 files changed, 155 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
> index 73ae2fdd94c4..9b11c5280424 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
> @@ -1459,6 +1459,137 @@ static int mlxsw_pci_sys_ready_wait(struct mlxsw_pci *mlxsw_pci,
>  	return -EBUSY;
>  }
>  
> +static int mlxsw_pci_link_active_wait(struct pci_dev *pdev)
> +{
> +	unsigned long end;
> +	u16 lnksta;
> +	int err;
> +
> +	end = jiffies + msecs_to_jiffies(MLXSW_PCI_TOGGLE_TIMEOUT_MSECS);
> +	do {
> +		msleep(MLXSW_PCI_TOGGLE_WAIT_MSECS);
> +		err = pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnksta);
> +		if (err)
> +			return pcibios_err_to_errno(err);
> +
> +		if (lnksta & PCI_EXP_LNKSTA_DLLLA)
> +			return 0;
> +	} while (time_before(jiffies, end));
> +
> +	pci_err(pdev, "PCI link not ready (0x%04x) after %d ms\n", lnksta,
> +		MLXSW_PCI_TOGGLE_TIMEOUT_MSECS);
> +
> +	return -ETIMEDOUT;
> +}
> +
> +static int mlxsw_pci_link_active_check(struct pci_dev *pdev)
> +{
> +	u32 lnkcap;
> +	int err;
> +
> +	err = pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &lnkcap);
> +	if (err)
> +		goto out;
> +
> +	if (lnkcap & PCI_EXP_LNKCAP_DLLLARC)
> +		return mlxsw_pci_link_active_wait(pdev);
> +
> +	/* In case the device does not support "Data Link Layer Link Active
> +	 * Reporting", simply wait for a predefined time for the device to
> +	 * become active.
> +	 */
> +	pci_dbg(pdev, "No PCI link reporting capability (0x%08x)\n", lnkcap);
> +
> +out:
> +	/* Sleep before handling the rest of the flow and accessing to PCI. */
> +	msleep(MLXSW_PCI_TOGGLE_TIMEOUT_MSECS);
> +	return pcibios_err_to_errno(err);
> +}
> +
> +static int mlxsw_pci_link_toggle(struct pci_dev *pdev)
> +{
> +	int err;
> +
> +	/* Disable the link. */
> +	err = pcie_capability_set_word(pdev, PCI_EXP_LNKCTL, PCI_EXP_LNKCTL_LD);
> +	if (err)
> +		return pcibios_err_to_errno(err);
> +
> +	/* Sleep to give firmware enough time to start the reset. */
> +	msleep(MLXSW_PCI_TOGGLE_WAIT_BEFORE_EN_MSECS);
> +
> +	/* Enable the link. */
> +	err = pcie_capability_clear_word(pdev, PCI_EXP_LNKCTL,
> +					 PCI_EXP_LNKCTL_LD);
> +	if (err)
> +		return pcibios_err_to_errno(err);
> +
> +	/* Wait for link active. */
> +	return mlxsw_pci_link_active_check(pdev);
> +}
> +
> +static int mlxsw_pci_device_id_read(struct pci_dev *pdev, u16 exp_dev_id)
> +{
> +	unsigned long end;
> +	u16 dev_id;
> +	int err;
> +
> +	end = jiffies + msecs_to_jiffies(MLXSW_PCI_TOGGLE_TIMEOUT_MSECS);
> +	do {
> +		msleep(MLXSW_PCI_TOGGLE_WAIT_MSECS);
> +
> +		/* Expect to get the correct PCI device ID as first indication
> +		 * that the ASIC is available.
> +		 */
> +		err = pci_read_config_word(pdev, PCI_DEVICE_ID, &dev_id);
> +		if (err)
> +			return pcibios_err_to_errno(err);
> +
> +		if (dev_id == exp_dev_id)
> +			return 0;
> +	} while (time_before(jiffies, end));
> +
> +	pci_err(pdev, "PCI device ID is not as expected after %d ms\n",
> +		MLXSW_PCI_TOGGLE_TIMEOUT_MSECS);
> +
> +	return -ETIMEDOUT;
> +}
> +
> +static int mlxsw_pci_reset_at_pci_disable(struct mlxsw_pci *mlxsw_pci)
> +{
> +	struct pci_bus *bridge_bus = mlxsw_pci->pdev->bus;
> +	struct pci_dev *bridge_pdev = bridge_bus->self;
> +	struct pci_dev *pdev = mlxsw_pci->pdev;
> +	char mrsr_pl[MLXSW_REG_MRSR_LEN];
> +	u16 dev_id = pdev->device;
> +	int err;
> +
> +	mlxsw_reg_mrsr_pack(mrsr_pl,
> +			    MLXSW_REG_MRSR_COMMAND_RESET_AT_PCI_DISABLE);
> +	err = mlxsw_reg_write(mlxsw_pci->core, MLXSW_REG(mrsr), mrsr_pl);
> +	if (err)
> +		return err;
> +
> +	/* Save the PCI configuration space so that we will be able to restore
> +	 * it after the firmware was reset.
> +	 */
> +	pci_save_state(pdev);
> +	pci_cfg_access_lock(pdev);
> +
> +	err = mlxsw_pci_link_toggle(bridge_pdev);
> +	if (err) {
> +		pci_err(bridge_pdev, "Failed to toggle PCI link\n");
> +		goto restore;
> +	}
> +
> +	err = mlxsw_pci_device_id_read(pdev, dev_id);
> +
> +restore:
> +	pci_cfg_access_unlock(pdev);
> +	pci_restore_state(pdev);
> +	return err;
> +}
> +
>  static int mlxsw_pci_reset_sw(struct mlxsw_pci *mlxsw_pci)
>  {
>  	char mrsr_pl[MLXSW_REG_MRSR_LEN];
> @@ -1471,6 +1602,8 @@ static int
>  mlxsw_pci_reset(struct mlxsw_pci *mlxsw_pci, const struct pci_device_id *id)
>  {
>  	struct pci_dev *pdev = mlxsw_pci->pdev;
> +	char mcam_pl[MLXSW_REG_MCAM_LEN];
> +	bool pci_reset_supported;
>  	u32 sys_status;
>  	int err;
>  
> @@ -1481,7 +1614,23 @@ mlxsw_pci_reset(struct mlxsw_pci *mlxsw_pci, const struct pci_device_id *id)
>  		return err;
>  	}
>  
> -	err = mlxsw_pci_reset_sw(mlxsw_pci);
> +	mlxsw_reg_mcam_pack(mcam_pl,
> +			    MLXSW_REG_MCAM_FEATURE_GROUP_ENHANCED_FEATURES);
> +	err = mlxsw_reg_query(mlxsw_pci->core, MLXSW_REG(mcam), mcam_pl);
> +	if (err)
> +		return err;
> +
> +	mlxsw_reg_mcam_unpack(mcam_pl, MLXSW_REG_MCAM_PCI_RESET,
> +			      &pci_reset_supported);
> +
> +	if (pci_reset_supported) {
> +		pci_dbg(pdev, "Starting PCI reset flow\n");
> +		err = mlxsw_pci_reset_at_pci_disable(mlxsw_pci);
> +	} else {
> +		pci_dbg(pdev, "Starting software reset flow\n");
> +		err = mlxsw_pci_reset_sw(mlxsw_pci);
> +	}
> +
>  	if (err)
>  		return err;
>  
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci_hw.h b/drivers/net/ethernet/mellanox/mlxsw/pci_hw.h
> index 48dbfea0a2a1..ded0828d7f1f 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/pci_hw.h
> +++ b/drivers/net/ethernet/mellanox/mlxsw/pci_hw.h
> @@ -27,6 +27,11 @@
>  
>  #define MLXSW_PCI_SW_RESET_TIMEOUT_MSECS	900000
>  #define MLXSW_PCI_SW_RESET_WAIT_MSECS		200
> +
> +#define MLXSW_PCI_TOGGLE_WAIT_BEFORE_EN_MSECS	500
> +#define MLXSW_PCI_TOGGLE_WAIT_MSECS		20
> +#define MLXSW_PCI_TOGGLE_TIMEOUT_MSECS		2000
> +
>  #define MLXSW_PCI_FW_READY			0xA1844
>  #define MLXSW_PCI_FW_READY_MASK			0xFFFF
>  #define MLXSW_PCI_FW_READY_MAGIC		0x5E


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 6/6] mlxsw: pci: Add support for new reset flow
       [not found] ` <c61d07469ecf5d3053442e24d4d050405f466b76.1679502371.git.petrm@nvidia.com>
  2023-03-23  9:13   ` [PATCH net-next 6/6] mlxsw: pci: Add support for new reset flow Petr Machata
@ 2023-03-23 16:51   ` Bjorn Helgaas
  2023-03-26 13:53     ` Ido Schimmel
  1 sibling, 1 reply; 10+ messages in thread
From: Bjorn Helgaas @ 2023-03-23 16:51 UTC (permalink / raw)
  To: Petr Machata
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, Ido Schimmel, Amit Cohen, mlxsw, linux-pci

Hi Petr, thanks for pointing me here.

On Wed, Mar 22, 2023 at 05:49:35PM +0100, Petr Machata wrote:
> From: Amit Cohen <amcohen@nvidia.com>
> 
> The driver resets the device during probe and during a devlink reload.
> The current reset method reloads the current firmware version or a pending
> one, if one was previously flashed using devlink. However, the reset does
> not take down the PCI link, preventing the PCI firmware from being
> upgraded, unless the system is rebooted.

Just to make sure I understand this correctly, the above sounds like
"firmware" includes two parts that have different rules for loading:

  - Current reset method is completely mlxsw-specific and resets the
    mlxsw core but not the PCIe interface; this loads only firmware
    part A

  - A PCIe reset resets both the mlxsw core and the PCIe interface;
    this loads both firmware part A and part B

> To solve this problem, a new reset command (6) was implemented in the
> firmware. Unlike the current command (1), after issuing the new command
> the device will not start the reset immediately, but only after the PCI
> link was disabled. The driver is expected to wait for 500ms before
> re-enabling the link to give the firmware enough time to start the reset.

I guess the idea here is that the mlxsw driver:

  - Tells the firmware we're going to reset
    (MLXSW_REG_MRSR_COMMAND_RESET_AT_PCI_DISABLE)

  - Saves PCI config state

  - Disables the link (mlxsw_pci_link_toggle()), which causes a PCIe
    hot reset

  - The firmware notices the link disable and starts its own internal
    reset

  - The mlxsw driver waits 500ms
    (MLXSW_PCI_TOGGLE_WAIT_BEFORE_EN_MSECS)

  - Enables link and waits for it to be active
    (mlxsw_pci_link_active_check()

  - Waits for device to be ready again (mlxsw_pci_device_id_read())

So the first question is why you don't simply use
pci_reset_function(), since it is supposed to cause a reset and do all
the necessary waiting for the device to be ready.  This is quite
complicated to do correctly; in fact, we still discover issues there
regularly.  There are many special cases in PCIe r6.0, sec 6.6.1, and
it would be much better if we can avoid trying to handle them all in
individual drivers.

Of course, pci_reset_function() does *not* include details like
MLXSW_PCI_TOGGLE_WAIT_BEFORE_EN_MSECS.

I assume that flashing the firmware to the device followed by a power
cycle (without ever doing MLXSW_REG_MRSR_COMMAND_RESET_AT_PCI_DISABLE)
would load the new firmware everywhere.  Can we not do the same with a
PCIe reset?

> Implement the new reset method and use it only after verifying it is
> supported by the current firmware version by querying the Management
> Capabilities Mask (MCAM) register. Consider the PCI firmware to be
> operational either after waiting for a predefined time of 2000ms or after
> reading an active link status when "Data Link Layer Link Active Reporting"
> is supported. For good measures, make sure the device ID can be read from
> the configuration space of the device.
> 
> Once the PCI firmware is operational, go back to the regular reset flow
> and wait for the entire device to become ready. That is, repeatedly read
> the "system_status" register from the BAR until a value of "FW_READY"
> (0x5E) appears.
> 
> Signed-off-by: Amit Cohen <amcohen@nvidia.com>
> Reviewed-by: Ido Schimmel <idosch@nvidia.com>
> Signed-off-by: Petr Machata <petrm@nvidia.com>
> ---
>  drivers/net/ethernet/mellanox/mlxsw/pci.c    | 151 ++++++++++++++++++-
>  drivers/net/ethernet/mellanox/mlxsw/pci_hw.h |   5 +
>  2 files changed, 155 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
> index 73ae2fdd94c4..9b11c5280424 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
> @@ -1459,6 +1459,137 @@ static int mlxsw_pci_sys_ready_wait(struct mlxsw_pci *mlxsw_pci,
>  	return -EBUSY;
>  }
>  
> +static int mlxsw_pci_link_active_wait(struct pci_dev *pdev)
> +{
> +	unsigned long end;
> +	u16 lnksta;
> +	int err;
> +
> +	end = jiffies + msecs_to_jiffies(MLXSW_PCI_TOGGLE_TIMEOUT_MSECS);
> +	do {
> +		msleep(MLXSW_PCI_TOGGLE_WAIT_MSECS);
> +		err = pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnksta);
> +		if (err)
> +			return pcibios_err_to_errno(err);
> +
> +		if (lnksta & PCI_EXP_LNKSTA_DLLLA)
> +			return 0;
> +	} while (time_before(jiffies, end));
> +
> +	pci_err(pdev, "PCI link not ready (0x%04x) after %d ms\n", lnksta,
> +		MLXSW_PCI_TOGGLE_TIMEOUT_MSECS);
> +
> +	return -ETIMEDOUT;
> +}
> +
> +static int mlxsw_pci_link_active_check(struct pci_dev *pdev)
> +{
> +	u32 lnkcap;
> +	int err;
> +
> +	err = pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &lnkcap);
> +	if (err)
> +		goto out;
> +
> +	if (lnkcap & PCI_EXP_LNKCAP_DLLLARC)
> +		return mlxsw_pci_link_active_wait(pdev);
> +
> +	/* In case the device does not support "Data Link Layer Link Active
> +	 * Reporting", simply wait for a predefined time for the device to
> +	 * become active.
> +	 */
> +	pci_dbg(pdev, "No PCI link reporting capability (0x%08x)\n", lnkcap);
> +
> +out:
> +	/* Sleep before handling the rest of the flow and accessing to PCI. */
> +	msleep(MLXSW_PCI_TOGGLE_TIMEOUT_MSECS);
> +	return pcibios_err_to_errno(err);
> +}
> +
> +static int mlxsw_pci_link_toggle(struct pci_dev *pdev)
> +{
> +	int err;
> +
> +	/* Disable the link. */
> +	err = pcie_capability_set_word(pdev, PCI_EXP_LNKCTL, PCI_EXP_LNKCTL_LD);
> +	if (err)
> +		return pcibios_err_to_errno(err);
> +
> +	/* Sleep to give firmware enough time to start the reset. */
> +	msleep(MLXSW_PCI_TOGGLE_WAIT_BEFORE_EN_MSECS);
> +
> +	/* Enable the link. */
> +	err = pcie_capability_clear_word(pdev, PCI_EXP_LNKCTL,
> +					 PCI_EXP_LNKCTL_LD);
> +	if (err)
> +		return pcibios_err_to_errno(err);
> +
> +	/* Wait for link active. */
> +	return mlxsw_pci_link_active_check(pdev);
> +}
> +
> +static int mlxsw_pci_device_id_read(struct pci_dev *pdev, u16 exp_dev_id)
> +{
> +	unsigned long end;
> +	u16 dev_id;
> +	int err;
> +
> +	end = jiffies + msecs_to_jiffies(MLXSW_PCI_TOGGLE_TIMEOUT_MSECS);
> +	do {
> +		msleep(MLXSW_PCI_TOGGLE_WAIT_MSECS);
> +
> +		/* Expect to get the correct PCI device ID as first indication
> +		 * that the ASIC is available.
> +		 */
> +		err = pci_read_config_word(pdev, PCI_DEVICE_ID, &dev_id);
> +		if (err)
> +			return pcibios_err_to_errno(err);
> +
> +		if (dev_id == exp_dev_id)
> +			return 0;
> +	} while (time_before(jiffies, end));
> +
> +	pci_err(pdev, "PCI device ID is not as expected after %d ms\n",
> +		MLXSW_PCI_TOGGLE_TIMEOUT_MSECS);
> +
> +	return -ETIMEDOUT;
> +}
> +
> +static int mlxsw_pci_reset_at_pci_disable(struct mlxsw_pci *mlxsw_pci)
> +{
> +	struct pci_bus *bridge_bus = mlxsw_pci->pdev->bus;
> +	struct pci_dev *bridge_pdev = bridge_bus->self;
> +	struct pci_dev *pdev = mlxsw_pci->pdev;
> +	char mrsr_pl[MLXSW_REG_MRSR_LEN];
> +	u16 dev_id = pdev->device;
> +	int err;
> +
> +	mlxsw_reg_mrsr_pack(mrsr_pl,
> +			    MLXSW_REG_MRSR_COMMAND_RESET_AT_PCI_DISABLE);
> +	err = mlxsw_reg_write(mlxsw_pci->core, MLXSW_REG(mrsr), mrsr_pl);
> +	if (err)
> +		return err;
> +
> +	/* Save the PCI configuration space so that we will be able to restore
> +	 * it after the firmware was reset.
> +	 */
> +	pci_save_state(pdev);
> +	pci_cfg_access_lock(pdev);
> +
> +	err = mlxsw_pci_link_toggle(bridge_pdev);
> +	if (err) {
> +		pci_err(bridge_pdev, "Failed to toggle PCI link\n");
> +		goto restore;
> +	}
> +
> +	err = mlxsw_pci_device_id_read(pdev, dev_id);
> +
> +restore:
> +	pci_cfg_access_unlock(pdev);
> +	pci_restore_state(pdev);
> +	return err;
> +}
> +
>  static int mlxsw_pci_reset_sw(struct mlxsw_pci *mlxsw_pci)
>  {
>  	char mrsr_pl[MLXSW_REG_MRSR_LEN];
> @@ -1471,6 +1602,8 @@ static int
>  mlxsw_pci_reset(struct mlxsw_pci *mlxsw_pci, const struct pci_device_id *id)
>  {
>  	struct pci_dev *pdev = mlxsw_pci->pdev;
> +	char mcam_pl[MLXSW_REG_MCAM_LEN];
> +	bool pci_reset_supported;
>  	u32 sys_status;
>  	int err;
>  
> @@ -1481,7 +1614,23 @@ mlxsw_pci_reset(struct mlxsw_pci *mlxsw_pci, const struct pci_device_id *id)
>  		return err;
>  	}
>  
> -	err = mlxsw_pci_reset_sw(mlxsw_pci);
> +	mlxsw_reg_mcam_pack(mcam_pl,
> +			    MLXSW_REG_MCAM_FEATURE_GROUP_ENHANCED_FEATURES);
> +	err = mlxsw_reg_query(mlxsw_pci->core, MLXSW_REG(mcam), mcam_pl);
> +	if (err)
> +		return err;
> +
> +	mlxsw_reg_mcam_unpack(mcam_pl, MLXSW_REG_MCAM_PCI_RESET,
> +			      &pci_reset_supported);
> +
> +	if (pci_reset_supported) {
> +		pci_dbg(pdev, "Starting PCI reset flow\n");
> +		err = mlxsw_pci_reset_at_pci_disable(mlxsw_pci);
> +	} else {
> +		pci_dbg(pdev, "Starting software reset flow\n");
> +		err = mlxsw_pci_reset_sw(mlxsw_pci);
> +	}
> +
>  	if (err)
>  		return err;
>  
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci_hw.h b/drivers/net/ethernet/mellanox/mlxsw/pci_hw.h
> index 48dbfea0a2a1..ded0828d7f1f 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/pci_hw.h
> +++ b/drivers/net/ethernet/mellanox/mlxsw/pci_hw.h
> @@ -27,6 +27,11 @@
>  
>  #define MLXSW_PCI_SW_RESET_TIMEOUT_MSECS	900000
>  #define MLXSW_PCI_SW_RESET_WAIT_MSECS		200
> +
> +#define MLXSW_PCI_TOGGLE_WAIT_BEFORE_EN_MSECS	500
> +#define MLXSW_PCI_TOGGLE_WAIT_MSECS		20
> +#define MLXSW_PCI_TOGGLE_TIMEOUT_MSECS		2000
> +
>  #define MLXSW_PCI_FW_READY			0xA1844
>  #define MLXSW_PCI_FW_READY_MASK			0xFFFF
>  #define MLXSW_PCI_FW_READY_MAGIC		0x5E
> -- 
> 2.39.0
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 6/6] mlxsw: pci: Add support for new reset flow
  2023-03-23 16:51   ` Bjorn Helgaas
@ 2023-03-26 13:53     ` Ido Schimmel
  2023-03-29 16:01       ` Bjorn Helgaas
  0 siblings, 1 reply; 10+ messages in thread
From: Ido Schimmel @ 2023-03-26 13:53 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Petr Machata, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, Amit Cohen, mlxsw, linux-pci

Hi Bjorn,

On Thu, Mar 23, 2023 at 11:51:15AM -0500, Bjorn Helgaas wrote:
> Hi Petr, thanks for pointing me here.
> 
> On Wed, Mar 22, 2023 at 05:49:35PM +0100, Petr Machata wrote:
> > From: Amit Cohen <amcohen@nvidia.com>
> > 
> > The driver resets the device during probe and during a devlink reload.
> > The current reset method reloads the current firmware version or a pending
> > one, if one was previously flashed using devlink. However, the reset does
> > not take down the PCI link, preventing the PCI firmware from being
> > upgraded, unless the system is rebooted.
> 
> Just to make sure I understand this correctly, the above sounds like
> "firmware" includes two parts that have different rules for loading:
> 
>   - Current reset method is completely mlxsw-specific and resets the
>     mlxsw core but not the PCIe interface; this loads only firmware
>     part A
> 
>   - A PCIe reset resets both the mlxsw core and the PCIe interface;
>     this loads both firmware part A and part B

Yes. A few years ago I had to flash a new firmware in order to test a
fix in the PCIe firmware and the bug still reproduced after a devlink
reload. Only after a reboot the new PCIe firmware was loaded and the bug
was fixed. Bugs in PCIe firmware are not common, but we would like to
avoid the scenario where users must reboot the machine in order to load
the new firmware.

> 
> > To solve this problem, a new reset command (6) was implemented in the
> > firmware. Unlike the current command (1), after issuing the new command
> > the device will not start the reset immediately, but only after the PCI
> > link was disabled. The driver is expected to wait for 500ms before
> > re-enabling the link to give the firmware enough time to start the reset.
> 
> I guess the idea here is that the mlxsw driver:
> 
>   - Tells the firmware we're going to reset
>     (MLXSW_REG_MRSR_COMMAND_RESET_AT_PCI_DISABLE)
> 
>   - Saves PCI config state
> 
>   - Disables the link (mlxsw_pci_link_toggle()), which causes a PCIe
>     hot reset
> 
>   - The firmware notices the link disable and starts its own internal
>     reset
> 
>   - The mlxsw driver waits 500ms
>     (MLXSW_PCI_TOGGLE_WAIT_BEFORE_EN_MSECS)
> 
>   - Enables link and waits for it to be active
>     (mlxsw_pci_link_active_check()
> 
>   - Waits for device to be ready again (mlxsw_pci_device_id_read())

Correct.

> 
> So the first question is why you don't simply use
> pci_reset_function(), since it is supposed to cause a reset and do all
> the necessary waiting for the device to be ready.  This is quite
> complicated to do correctly; in fact, we still discover issues there
> regularly.  There are many special cases in PCIe r6.0, sec 6.6.1, and
> it would be much better if we can avoid trying to handle them all in
> individual drivers.

I see that this function takes the device lock and I think (didn't try)
it will deadlock if we were to replace the current code with it since we
also perform a reset during probe where I believe the device lock is
already taken.

__pci_reset_function_locked() is another option, but it assumes the
device lock was already taken, which is correct during probe, but not
when reset is performed as part of devlink reload.

Let's put the locking issues aside and assume we can use
__pci_reset_function_locked(). I'm trying to figure out what it can
allow us to remove from the driver in favor of common PCI code. It
essentially invokes one of the supported reset methods. Looking at my
device, I see the following:

 # cat /sys/class/pci_bus/0000\:01/device/0000\:01\:00.0/reset_method 
 pm bus

So I assume it will invoke pci_pm_reset(). I'm not sure it can work for
us as our reset procedure requires us to disable the link on the
downstream port as a way of notifying the device that it should start
the reset procedure.

We might be able to use the "device_specific" method and add quirks in
"pci_dev_reset_methods". However, I'm not sure what would be the
benefit, as it basically means moving the code in
mlxsw_pci_link_toggle() to drivers/pci/quirks.c. Also, when the "probe"
argument is "true" we can't actually determine if this reset method is
supported or not, as we can't query that from the configuration space of
the device in the current implementation. It's queried using a command
interface that is specific to mlxsw and resides in the driver itself.
Not usable from drivers/pci/quirks.c.

> 
> Of course, pci_reset_function() does *not* include details like
> MLXSW_PCI_TOGGLE_WAIT_BEFORE_EN_MSECS.
> 
> I assume that flashing the firmware to the device followed by a power
> cycle (without ever doing MLXSW_REG_MRSR_COMMAND_RESET_AT_PCI_DISABLE)
> would load the new firmware everywhere.  Can we not do the same with a
> PCIe reset?

Yes, that's what we would like to achieve.

Thanks for the feedback!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 6/6] mlxsw: pci: Add support for new reset flow
  2023-03-26 13:53     ` Ido Schimmel
@ 2023-03-29 16:01       ` Bjorn Helgaas
  2023-03-29 17:10         ` Alex Williamson
  2023-04-13 10:26         ` Lukas Wunner
  0 siblings, 2 replies; 10+ messages in thread
From: Bjorn Helgaas @ 2023-03-29 16:01 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Petr Machata, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, Amit Cohen, mlxsw, linux-pci,
	Alex Williamson, Lukas Wunner

[+cc Alex, Lukas for link-disable reset thoughts, beginning of thread:
https://lore.kernel.org/all/cover.1679502371.git.petrm@nvidia.com/]

On Sun, Mar 26, 2023 at 04:53:58PM +0300, Ido Schimmel wrote:
> On Thu, Mar 23, 2023 at 11:51:15AM -0500, Bjorn Helgaas wrote:
> > On Wed, Mar 22, 2023 at 05:49:35PM +0100, Petr Machata wrote:
> > > From: Amit Cohen <amcohen@nvidia.com>
> > > 
> > > The driver resets the device during probe and during a devlink reload.
> > > The current reset method reloads the current firmware version or a pending
> > > one, if one was previously flashed using devlink. However, the reset does
> > > not take down the PCI link, preventing the PCI firmware from being
> > > upgraded, unless the system is rebooted.
> > 
> > Just to make sure I understand this correctly, the above sounds like
> > "firmware" includes two parts that have different rules for loading:
> > 
> >   - Current reset method is completely mlxsw-specific and resets the
> >     mlxsw core but not the PCIe interface; this loads only firmware
> >     part A
> > 
> >   - A PCIe reset resets both the mlxsw core and the PCIe interface;
> >     this loads both firmware part A and part B
> 
> Yes. A few years ago I had to flash a new firmware in order to test a
> fix in the PCIe firmware and the bug still reproduced after a devlink
> reload. Only after a reboot the new PCIe firmware was loaded and the bug
> was fixed. Bugs in PCIe firmware are not common, but we would like to
> avoid the scenario where users must reboot the machine in order to load
> the new firmware.
> 
> > > To solve this problem, a new reset command (6) was implemented in the
> > > firmware. Unlike the current command (1), after issuing the new command
> > > the device will not start the reset immediately, but only after the PCI
> > > link was disabled. The driver is expected to wait for 500ms before
> > > re-enabling the link to give the firmware enough time to start the reset.
> > 
> > I guess the idea here is that the mlxsw driver:
> > 
> >   - Tells the firmware we're going to reset
> >     (MLXSW_REG_MRSR_COMMAND_RESET_AT_PCI_DISABLE)
> > 
> >   - Saves PCI config state
> > 
> >   - Disables the link (mlxsw_pci_link_toggle()), which causes a PCIe
> >     hot reset
> > 
> >   - The firmware notices the link disable and starts its own internal
> >     reset
> > 
> >   - The mlxsw driver waits 500ms
> >     (MLXSW_PCI_TOGGLE_WAIT_BEFORE_EN_MSECS)
> > 
> >   - Enables link and waits for it to be active
> >     (mlxsw_pci_link_active_check()
> > 
> >   - Waits for device to be ready again (mlxsw_pci_device_id_read())
> 
> Correct.
> 
> > So the first question is why you don't simply use
> > pci_reset_function(), since it is supposed to cause a reset and do all
> > the necessary waiting for the device to be ready.  This is quite
> > complicated to do correctly; in fact, we still discover issues there
> > regularly.  There are many special cases in PCIe r6.0, sec 6.6.1, and
> > it would be much better if we can avoid trying to handle them all in
> > individual drivers.
> 
> I see that this function takes the device lock and I think (didn't try)
> it will deadlock if we were to replace the current code with it since we
> also perform a reset during probe where I believe the device lock is
> already taken.
> 
> __pci_reset_function_locked() is another option, but it assumes the
> device lock was already taken, which is correct during probe, but not
> when reset is performed as part of devlink reload.
> 
> Let's put the locking issues aside and assume we can use
> __pci_reset_function_locked(). I'm trying to figure out what it can
> allow us to remove from the driver in favor of common PCI code. It
> essentially invokes one of the supported reset methods. Looking at my
> device, I see the following:
> 
>  # cat /sys/class/pci_bus/0000\:01/device/0000\:01\:00.0/reset_method 
>  pm bus
> 
> So I assume it will invoke pci_pm_reset(). I'm not sure it can work for
> us as our reset procedure requires us to disable the link on the
> downstream port as a way of notifying the device that it should start
> the reset procedure.

Hmmm, pci_pm_reset() puts the device in D3hot, then back to D0.  Spec
says that results in "undefined internal Function state," which
doesn't even sound like a guaranteed reset, but it's what we have, and
in any case, it does not disable the link.

> We might be able to use the "device_specific" method and add quirks in
> "pci_dev_reset_methods". However, I'm not sure what would be the
> benefit, as it basically means moving the code in
> mlxsw_pci_link_toggle() to drivers/pci/quirks.c. Also, when the "probe"
> argument is "true" we can't actually determine if this reset method is
> supported or not, as we can't query that from the configuration space of
> the device in the current implementation. It's queried using a command
> interface that is specific to mlxsw and resides in the driver itself.
> Not usable from drivers/pci/quirks.c.

Spec (PCIe r6.0, sec 6.6.1) says "Disabling a Link causes Downstream
components to undergo a hot reset."  That seems like it *could* be a
general-purpose method of doing a reset, and I don't know why the PCI
core doesn't support it.  Added Alex and Lukas in case they know.

But it sounds like there's some wrinkle with your device?  I suppose a
link disable actually causes a reset, but that reset may not trigger
the firmware reload you need?  If we had a generic "disable link"
reset method, maybe a device quirk could disable it if necessary?

> > Of course, pci_reset_function() does *not* include details like
> > MLXSW_PCI_TOGGLE_WAIT_BEFORE_EN_MSECS.
> > 
> > I assume that flashing the firmware to the device followed by a power
> > cycle (without ever doing MLXSW_REG_MRSR_COMMAND_RESET_AT_PCI_DISABLE)
> > would load the new firmware everywhere.  Can we not do the same with a
> > PCIe reset?
> 
> Yes, that's what we would like to achieve.
> 
> Thanks for the feedback!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 6/6] mlxsw: pci: Add support for new reset flow
  2023-03-29 16:01       ` Bjorn Helgaas
@ 2023-03-29 17:10         ` Alex Williamson
  2023-03-30  8:26           ` Ido Schimmel
  2023-04-13 10:26         ` Lukas Wunner
  1 sibling, 1 reply; 10+ messages in thread
From: Alex Williamson @ 2023-03-29 17:10 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Ido Schimmel, Petr Machata, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, Amit Cohen, mlxsw,
	linux-pci, Lukas Wunner

On Wed, 29 Mar 2023 11:01:44 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> [+cc Alex, Lukas for link-disable reset thoughts, beginning of thread:
> https://lore.kernel.org/all/cover.1679502371.git.petrm@nvidia.com/]
> 
> On Sun, Mar 26, 2023 at 04:53:58PM +0300, Ido Schimmel wrote:
> > On Thu, Mar 23, 2023 at 11:51:15AM -0500, Bjorn Helgaas wrote:  
> > > On Wed, Mar 22, 2023 at 05:49:35PM +0100, Petr Machata wrote:  
> > > > From: Amit Cohen <amcohen@nvidia.com>
> > > > 
> > > > The driver resets the device during probe and during a devlink reload.
> > > > The current reset method reloads the current firmware version or a pending
> > > > one, if one was previously flashed using devlink. However, the reset does
> > > > not take down the PCI link, preventing the PCI firmware from being
> > > > upgraded, unless the system is rebooted.  
> > > 
> > > Just to make sure I understand this correctly, the above sounds like
> > > "firmware" includes two parts that have different rules for loading:
> > > 
> > >   - Current reset method is completely mlxsw-specific and resets the
> > >     mlxsw core but not the PCIe interface; this loads only firmware
> > >     part A
> > > 
> > >   - A PCIe reset resets both the mlxsw core and the PCIe interface;
> > >     this loads both firmware part A and part B  
> > 
> > Yes. A few years ago I had to flash a new firmware in order to test a
> > fix in the PCIe firmware and the bug still reproduced after a devlink
> > reload. Only after a reboot the new PCIe firmware was loaded and the bug
> > was fixed. Bugs in PCIe firmware are not common, but we would like to
> > avoid the scenario where users must reboot the machine in order to load
> > the new firmware.
> >   
> > > > To solve this problem, a new reset command (6) was implemented in the
> > > > firmware. Unlike the current command (1), after issuing the new command
> > > > the device will not start the reset immediately, but only after the PCI
> > > > link was disabled. The driver is expected to wait for 500ms before
> > > > re-enabling the link to give the firmware enough time to start the reset.  
> > > 
> > > I guess the idea here is that the mlxsw driver:
> > > 
> > >   - Tells the firmware we're going to reset
> > >     (MLXSW_REG_MRSR_COMMAND_RESET_AT_PCI_DISABLE)
> > > 
> > >   - Saves PCI config state
> > > 
> > >   - Disables the link (mlxsw_pci_link_toggle()), which causes a PCIe
> > >     hot reset
> > > 
> > >   - The firmware notices the link disable and starts its own internal
> > >     reset
> > > 
> > >   - The mlxsw driver waits 500ms
> > >     (MLXSW_PCI_TOGGLE_WAIT_BEFORE_EN_MSECS)
> > > 
> > >   - Enables link and waits for it to be active
> > >     (mlxsw_pci_link_active_check()
> > > 
> > >   - Waits for device to be ready again (mlxsw_pci_device_id_read())  
> > 
> > Correct.
> >   
> > > So the first question is why you don't simply use
> > > pci_reset_function(), since it is supposed to cause a reset and do all
> > > the necessary waiting for the device to be ready.  This is quite
> > > complicated to do correctly; in fact, we still discover issues there
> > > regularly.  There are many special cases in PCIe r6.0, sec 6.6.1, and
> > > it would be much better if we can avoid trying to handle them all in
> > > individual drivers.  
> > 
> > I see that this function takes the device lock and I think (didn't try)
> > it will deadlock if we were to replace the current code with it since we
> > also perform a reset during probe where I believe the device lock is
> > already taken.
> > 
> > __pci_reset_function_locked() is another option, but it assumes the
> > device lock was already taken, which is correct during probe, but not
> > when reset is performed as part of devlink reload.
> > 
> > Let's put the locking issues aside and assume we can use
> > __pci_reset_function_locked(). I'm trying to figure out what it can
> > allow us to remove from the driver in favor of common PCI code. It
> > essentially invokes one of the supported reset methods. Looking at my
> > device, I see the following:
> > 
> >  # cat /sys/class/pci_bus/0000\:01/device/0000\:01\:00.0/reset_method 
> >  pm bus
> > 
> > So I assume it will invoke pci_pm_reset(). I'm not sure it can work for
> > us as our reset procedure requires us to disable the link on the
> > downstream port as a way of notifying the device that it should start
> > the reset procedure.  
> 
> Hmmm, pci_pm_reset() puts the device in D3hot, then back to D0.  Spec
> says that results in "undefined internal Function state," which
> doesn't even sound like a guaranteed reset, but it's what we have, and
> in any case, it does not disable the link.
> 
> > We might be able to use the "device_specific" method and add quirks in
> > "pci_dev_reset_methods". However, I'm not sure what would be the
> > benefit, as it basically means moving the code in
> > mlxsw_pci_link_toggle() to drivers/pci/quirks.c. Also, when the "probe"
> > argument is "true" we can't actually determine if this reset method is
> > supported or not, as we can't query that from the configuration space of
> > the device in the current implementation. It's queried using a command
> > interface that is specific to mlxsw and resides in the driver itself.
> > Not usable from drivers/pci/quirks.c.  
> 
> Spec (PCIe r6.0, sec 6.6.1) says "Disabling a Link causes Downstream
> components to undergo a hot reset."  That seems like it *could* be a
> general-purpose method of doing a reset, and I don't know why the PCI
> core doesn't support it.  Added Alex and Lukas in case they know.

I think we don't have it because it's unclear how it's actually
different from a secondary bus reset from the bridge control register,
which is what "bus" would do when selected for the example above.  Per
the spec, both must cause a hot reset.  It seems this device needs a
significantly longer delay though.

Note that hot resets can be generated by a userspace driver with
ownership of the device and will make use of the pci-core reset
mechanisms.  Therefore if there is not a device specific reset, we'll
use the standard delays and the device ought not to get itself wedged
if the link becomes active at an unexpected point relative to a
firmware update.  This might be a point in favor of a device specific
reset solution in pci-core.  Thanks,

Alex
 
> But it sounds like there's some wrinkle with your device?  I suppose a
> link disable actually causes a reset, but that reset may not trigger
> the firmware reload you need?  If we had a generic "disable link"
> reset method, maybe a device quirk could disable it if necessary?
> 
> > > Of course, pci_reset_function() does *not* include details like
> > > MLXSW_PCI_TOGGLE_WAIT_BEFORE_EN_MSECS.
> > > 
> > > I assume that flashing the firmware to the device followed by a power
> > > cycle (without ever doing MLXSW_REG_MRSR_COMMAND_RESET_AT_PCI_DISABLE)
> > > would load the new firmware everywhere.  Can we not do the same with a
> > > PCIe reset?  
> > 
> > Yes, that's what we would like to achieve.
> > 
> > Thanks for the feedback!  
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 6/6] mlxsw: pci: Add support for new reset flow
  2023-03-29 17:10         ` Alex Williamson
@ 2023-03-30  8:26           ` Ido Schimmel
  2023-03-30 18:49             ` Alex Williamson
  0 siblings, 1 reply; 10+ messages in thread
From: Ido Schimmel @ 2023-03-30  8:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, Petr Machata, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, Amit Cohen, mlxsw,
	linux-pci, Lukas Wunner

On Wed, Mar 29, 2023 at 11:10:01AM -0600, Alex Williamson wrote:
> I think we don't have it because it's unclear how it's actually
> different from a secondary bus reset from the bridge control register,
> which is what "bus" would do when selected for the example above.  Per
> the spec, both must cause a hot reset.  It seems this device needs a
> significantly longer delay though.

Assuming you are referring to the 2ms sleep in
pci_reset_secondary_bus(), then yes. In our case, after disabling the
link on the downstream port we need to wait for 500ms before enabling
it.

> Note that hot resets can be generated by a userspace driver with
> ownership of the device and will make use of the pci-core reset
> mechanisms.  Therefore if there is not a device specific reset, we'll
> use the standard delays and the device ought not to get itself wedged
> if the link becomes active at an unexpected point relative to a
> firmware update.  This might be a point in favor of a device specific
> reset solution in pci-core.  Thanks,

I assume you referring to something like this:

# echo 1 > /sys/class/pci_bus/0000:03/device/0000:03:00.0/reset

Doesn't seem to have any effect (network ports remain up, at least).
Anyway, this device is completely managed by the kernel, not a user
space driver. I'm not aware of anyone using this method to reset the
device.

If I understand Bjorn and you correctly, we have two options:

1. Keep the current implementation inside the driver.

2. Call __pci_reset_function_locked() from the driver and move the link
toggling to drivers/pci/quirks.c as a "device_specific" method.

Personally, I don't see any benefit in 2, but we can try to implement
it, see if it even works and then decide.

Please let me know what are your preferences.

Thanks

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 6/6] mlxsw: pci: Add support for new reset flow
  2023-03-30  8:26           ` Ido Schimmel
@ 2023-03-30 18:49             ` Alex Williamson
  2023-04-13  8:10               ` Ido Schimmel
  0 siblings, 1 reply; 10+ messages in thread
From: Alex Williamson @ 2023-03-30 18:49 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Bjorn Helgaas, Petr Machata, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, Amit Cohen, mlxsw,
	linux-pci, Lukas Wunner

On Thu, 30 Mar 2023 11:26:30 +0300
Ido Schimmel <idosch@nvidia.com> wrote:

> On Wed, Mar 29, 2023 at 11:10:01AM -0600, Alex Williamson wrote:
> > I think we don't have it because it's unclear how it's actually
> > different from a secondary bus reset from the bridge control register,
> > which is what "bus" would do when selected for the example above.  Per
> > the spec, both must cause a hot reset.  It seems this device needs a
> > significantly longer delay though.  
> 
> Assuming you are referring to the 2ms sleep in
> pci_reset_secondary_bus(), then yes. In our case, after disabling the
> link on the downstream port we need to wait for 500ms before enabling
> it.
> 
> > Note that hot resets can be generated by a userspace driver with
> > ownership of the device and will make use of the pci-core reset
> > mechanisms.  Therefore if there is not a device specific reset, we'll
> > use the standard delays and the device ought not to get itself wedged
> > if the link becomes active at an unexpected point relative to a
> > firmware update.  This might be a point in favor of a device specific
> > reset solution in pci-core.  Thanks,  
> 
> I assume you referring to something like this:
> 
> # echo 1 > /sys/class/pci_bus/0000:03/device/0000:03:00.0/reset
> 
> Doesn't seem to have any effect (network ports remain up, at least).
> Anyway, this device is completely managed by the kernel, not a user
> space driver. I'm not aware of anyone using this method to reset the
> device.

The pci-sysfs reset attribute is only meant to reset the linked device,
so if this is a single function device then it might be accessing bus
reset, but it also might be using FLR or PM reset.  There's a
reset_method attribute to determine and select.

In any case, if the device is unaffected, that suggests we're dealing
with a device that doesn't comply with PCIe reset standards, which
might suggests it needs a device specific reset or to flag broken reset
methods regardless.

Note that QEMU is a vfio-pci userspace driver, so assigning the device
to a VM, where kernel drivers in the guest are managing the device is a
use case of userspace drivers which should have a functional reset
mechanism to avoid data leakage between userspace sessions.
 
> If I understand Bjorn and you correctly, we have two options:
> 
> 1. Keep the current implementation inside the driver.
> 
> 2. Call __pci_reset_function_locked() from the driver and move the link
> toggling to drivers/pci/quirks.c as a "device_specific" method.
> 
> Personally, I don't see any benefit in 2, but we can try to implement
> it, see if it even works and then decide.

The second option enables use cases like above, where the PCI-core can
perform an effective reset of the device rather than embedding that
into a specific driver.  Even if not intended as a primary use case,
it's a more complete solution and avoids potentially unhappy users that
assume such use cases are available.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 6/6] mlxsw: pci: Add support for new reset flow
  2023-03-30 18:49             ` Alex Williamson
@ 2023-04-13  8:10               ` Ido Schimmel
  2023-04-13 15:26                 ` Jakub Kicinski
  0 siblings, 1 reply; 10+ messages in thread
From: Ido Schimmel @ 2023-04-13  8:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Bjorn Helgaas, Petr Machata, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, Amit Cohen, mlxsw,
	linux-pci, Lukas Wunner

On Thu, Mar 30, 2023 at 12:49:58PM -0600, Alex Williamson wrote:
> On Thu, 30 Mar 2023 11:26:30 +0300
> Ido Schimmel <idosch@nvidia.com> wrote:
> 
> > On Wed, Mar 29, 2023 at 11:10:01AM -0600, Alex Williamson wrote:
> > > I think we don't have it because it's unclear how it's actually
> > > different from a secondary bus reset from the bridge control register,
> > > which is what "bus" would do when selected for the example above.  Per
> > > the spec, both must cause a hot reset.  It seems this device needs a
> > > significantly longer delay though.  
> > 
> > Assuming you are referring to the 2ms sleep in
> > pci_reset_secondary_bus(), then yes. In our case, after disabling the
> > link on the downstream port we need to wait for 500ms before enabling
> > it.
> > 
> > > Note that hot resets can be generated by a userspace driver with
> > > ownership of the device and will make use of the pci-core reset
> > > mechanisms.  Therefore if there is not a device specific reset, we'll
> > > use the standard delays and the device ought not to get itself wedged
> > > if the link becomes active at an unexpected point relative to a
> > > firmware update.  This might be a point in favor of a device specific
> > > reset solution in pci-core.  Thanks,  
> > 
> > I assume you referring to something like this:
> > 
> > # echo 1 > /sys/class/pci_bus/0000:03/device/0000:03:00.0/reset
> > 
> > Doesn't seem to have any effect (network ports remain up, at least).
> > Anyway, this device is completely managed by the kernel, not a user
> > space driver. I'm not aware of anyone using this method to reset the
> > device.
> 
> The pci-sysfs reset attribute is only meant to reset the linked device,
> so if this is a single function device then it might be accessing bus
> reset, but it also might be using FLR or PM reset.  There's a
> reset_method attribute to determine and select.
> 
> In any case, if the device is unaffected, that suggests we're dealing
> with a device that doesn't comply with PCIe reset standards, which
> might suggests it needs a device specific reset or to flag broken reset
> methods regardless.
> 
> Note that QEMU is a vfio-pci userspace driver, so assigning the device
> to a VM, where kernel drivers in the guest are managing the device is a
> use case of userspace drivers which should have a functional reset
> mechanism to avoid data leakage between userspace sessions.
>  
> > If I understand Bjorn and you correctly, we have two options:
> > 
> > 1. Keep the current implementation inside the driver.
> > 
> > 2. Call __pci_reset_function_locked() from the driver and move the link
> > toggling to drivers/pci/quirks.c as a "device_specific" method.
> > 
> > Personally, I don't see any benefit in 2, but we can try to implement
> > it, see if it even works and then decide.
> 
> The second option enables use cases like above, where the PCI-core can
> perform an effective reset of the device rather than embedding that
> into a specific driver.  Even if not intended as a primary use case,
> it's a more complete solution and avoids potentially unhappy users that
> assume such use cases are available.  Thanks,

I believe we are discussing three different issues:

1. Reset by PCI core when driver is bound to the device.
2. Reset by PCI core when driver is not bound to the device.
3. Reset by the driver itself during probe / devlink reload.

These are my notes on each:

1. In this case, the reset_prepare() and reset_done() handlers of the
driver are invoked by the PCI core before and after the PCI reset.
Without them, if the used PCI reset method indeed reset the entire
device, then the driver and the device would be out of sync. I have
implemented these handlers for the driver:
https://github.com/idosch/linux/commit/28bc0dc06c01559c19331578bbba9f2b0460ab5d.patch

2. In this case, the handlers are not available and we only need to
ensure that the used PCI reset method reset the device. The method can
be a generic method such as "pm" or "bus" (which are available in my
case) or a "device_specific" method that is implemented in
drivers/pci/quirks.c by accessing the configuration space of the device.

I need to check if one of the generic methods works for this device and
if not, check with the PCI team what we can do about it.

3. In this case, the driver already issues a reset via its command
interface during probe / devlink reload to ensure it is working with a
device in a clean state. The patch we are discussing merely adds another
reset method. Instead of starting the reset when the command is issued,
the reset will start after toggling the link on the downstream port.

To summarize, I would like to proceed as follows:

1. Submit the following patch that implements the PCI reset handlers for
the device:
https://github.com/idosch/linux/commit/28bc0dc06c01559c19331578bbba9f2b0460ab5d.patch

2. Check if one of the generic reset methods works for this device and
if not, check with the PCI team what we can do about it.

3. Re-submit the current patchset for inclusion.

Note that 2 does not block 3 since the solution for 2 would be either in
the firmware or in drivers/pci/quirks.c.

Please let me know if you are OK with the above.

Thanks

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 6/6] mlxsw: pci: Add support for new reset flow
  2023-03-29 16:01       ` Bjorn Helgaas
  2023-03-29 17:10         ` Alex Williamson
@ 2023-04-13 10:26         ` Lukas Wunner
  1 sibling, 0 replies; 10+ messages in thread
From: Lukas Wunner @ 2023-04-13 10:26 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Ido Schimmel, Petr Machata, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, Amit Cohen, mlxsw,
	linux-pci, Alex Williamson

On Wed, Mar 29, 2023 at 11:01:44AM -0500, Bjorn Helgaas wrote:
> On Sun, Mar 26, 2023 at 04:53:58PM +0300, Ido Schimmel wrote:
> > On Thu, Mar 23, 2023 at 11:51:15AM -0500, Bjorn Helgaas wrote:
> > > So the first question is why you don't simply use
> > > pci_reset_function(), since it is supposed to cause a reset and do all
> > > the necessary waiting for the device to be ready.

I agree, this driver should use one of the reset helpers provided
by the PCI core.

Not least because if the Downstream Port is hotplug-capable,
fiddling with the Link Disable bit behind the hotplug driver's back
will cause unbinding and rebinding of the device.

> > __pci_reset_function_locked() is another option, but it assumes the
> > device lock was already taken, which is correct during probe, but not
> > when reset is performed as part of devlink reload.

Just call pci_dev_lock() in the devlink reload code path.

> Hmmm, pci_pm_reset() puts the device in D3hot, then back to D0.  Spec
> says that results in "undefined internal Function state," which
> doesn't even sound like a guaranteed reset, but it's what we have, and
> in any case, it does not disable the link.

Per PCIe r6.0.1 sec 5.8, the device undergoes a Soft Reset when moving
from D3hot to D0 (unless the No_Soft_Reset bit is set in the Power
Management Control/Status Register, sec 7.5.2.2).

The driver can set the PCI_DEV_FLAGS_NO_PM_RESET flag if this reset method
doesn't have any effect (via quirk_no_pm_reset()).

We could also discuss moving pci_pm_reset after pci_reset_bus_function
in pci_reset_fn_methods[] if this reset method has little value on the
majority of devices.

Alternatively, if Secondary Bus Reset has the intended effect, the driver
could invoke that directly via pci_reset_bus().

> Spec (PCIe r6.0, sec 6.6.1) says "Disabling a Link causes Downstream
> components to undergo a hot reset."  That seems like it *could* be a
> general-purpose method of doing a reset, and I don't know why the PCI
> core doesn't support it.  Added Alex and Lukas in case they know.

Sounds reasonable to me.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next 6/6] mlxsw: pci: Add support for new reset flow
  2023-04-13  8:10               ` Ido Schimmel
@ 2023-04-13 15:26                 ` Jakub Kicinski
  0 siblings, 0 replies; 10+ messages in thread
From: Jakub Kicinski @ 2023-04-13 15:26 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Alex Williamson, Bjorn Helgaas, Petr Machata, David S. Miller,
	Eric Dumazet, Paolo Abeni, netdev, Amit Cohen, mlxsw, linux-pci,
	Lukas Wunner

On Thu, 13 Apr 2023 11:10:54 +0300 Ido Schimmel wrote:
> I believe we are discussing three different issues:
> 
> 1. Reset by PCI core when driver is bound to the device.
> 2. Reset by PCI core when driver is not bound to the device.
> 3. Reset by the driver itself during probe / devlink reload.
> 
> These are my notes on each:
> 
> 1. In this case, the reset_prepare() and reset_done() handlers of the
> driver are invoked by the PCI core before and after the PCI reset.
> Without them, if the used PCI reset method indeed reset the entire
> device, then the driver and the device would be out of sync. I have
> implemented these handlers for the driver:
> https://github.com/idosch/linux/commit/28bc0dc06c01559c19331578bbba9f2b0460ab5d.patch
> 
> 2. In this case, the handlers are not available and we only need to
> ensure that the used PCI reset method reset the device. The method can
> be a generic method such as "pm" or "bus" (which are available in my
> case) or a "device_specific" method that is implemented in
> drivers/pci/quirks.c by accessing the configuration space of the device.
> 
> I need to check if one of the generic methods works for this device and
> if not, check with the PCI team what we can do about it.
> 
> 3. In this case, the driver already issues a reset via its command
> interface during probe / devlink reload to ensure it is working with a
> device in a clean state. The patch we are discussing merely adds another
> reset method. Instead of starting the reset when the command is issued,
> the reset will start after toggling the link on the downstream port.
> 
> To summarize, I would like to proceed as follows:
> 
> 1. Submit the following patch that implements the PCI reset handlers for
> the device:
> https://github.com/idosch/linux/commit/28bc0dc06c01559c19331578bbba9f2b0460ab5d.patch

I'm probably confused but does this mean that PCI reset would impact
the datapath? From the commit message it sounded like the new reset
is harder than the previous one.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-04-13 15:26 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cover.1679502371.git.petrm@nvidia.com>
     [not found] ` <c61d07469ecf5d3053442e24d4d050405f466b76.1679502371.git.petrm@nvidia.com>
2023-03-23  9:13   ` [PATCH net-next 6/6] mlxsw: pci: Add support for new reset flow Petr Machata
2023-03-23 16:51   ` Bjorn Helgaas
2023-03-26 13:53     ` Ido Schimmel
2023-03-29 16:01       ` Bjorn Helgaas
2023-03-29 17:10         ` Alex Williamson
2023-03-30  8:26           ` Ido Schimmel
2023-03-30 18:49             ` Alex Williamson
2023-04-13  8:10               ` Ido Schimmel
2023-04-13 15:26                 ` Jakub Kicinski
2023-04-13 10:26         ` Lukas Wunner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).