linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH net-next 00/12] mlxsw: Add support for new reset flow
@ 2023-10-17  7:42 Ido Schimmel
  2023-10-17  7:42 ` [RFC PATCH net-next 01/12] netdevsim: Block until all devices are released Ido Schimmel
                   ` (11 more replies)
  0 siblings, 12 replies; 24+ messages in thread
From: Ido Schimmel @ 2023-10-17  7:42 UTC (permalink / raw)
  To: netdev, linux-pci
  Cc: davem, kuba, pabeni, edumazet, bhelgaas, alex.williamson, lukas,
	petrm, jiri, mlxsw, Ido Schimmel

This patchset changes mlxsw to issue a PCI reset during probe and
devlink reload so that the PCI firmware could be upgraded without a
reboot.

Unlike the previous version of this patchset [1], in this version the
driver no longer tries to issue a PCI reset by triggering a PCI link
toggle on its own, but instead calls the PCI core to issue the reset.

The PCI APIs require the device lock to be held which is why patches
#2-#3 adjust devlink to hold a reference on the associated device and
acquire the device lock during reload. Patch #1 prepares netdevsim for
these devlink changes. See the commit message for more details.

Patches #4-#5 add reset method quirks for NVIDIA Spectrum devices.

Patch #6 adds a debug level print in PCI core so that device ready delay
will be printed even if it is shorter than one second.

Patches #7-#9 are straightforward preparations in mlxsw.

Patch #10 finally implements the new reset flow in mlxsw.

Patch #11 adds PCI reset handlers in mlxsw to avoid user space from
resetting the device from underneath an unaware driver. Instead, the
driver is gracefully de-initialized before the PCI reset and then
initialized again after it.

Patch #12 adds a PCI reset selftest to make sure this code path does not
regress.

[1] https://lore.kernel.org/netdev/cover.1679502371.git.petrm@nvidia.com/

Amit Cohen (3):
  mlxsw: Extend MRSR pack() function to support new commands
  mlxsw: pci: Rename mlxsw_pci_sw_reset()
  mlxsw: pci: Move software reset code to a separate function

Ido Schimmel (9):
  netdevsim: Block until all devices are released
  devlink: Hold a reference on parent device
  devlink: Acquire device lock during reload
  PCI: Add no PM reset quirk for NVIDIA Spectrum devices
  PCI: Add device-specific reset for NVIDIA Spectrum devices
  PCI: Add debug print for device ready delay
  mlxsw: pci: Add support for new reset flow
  mlxsw: pci: Implement PCI reset handlers
  selftests: mlxsw: Add PCI reset test

 drivers/net/ethernet/mellanox/mlxsw/pci.c     | 90 +++++++++++++++++--
 drivers/net/ethernet/mellanox/mlxsw/reg.h     | 16 +++-
 drivers/net/netdevsim/bus.c                   | 12 +++
 drivers/pci/pci.c                             |  3 +
 drivers/pci/quirks.c                          | 42 +++++++++
 net/devlink/core.c                            |  7 +-
 net/devlink/dev.c                             |  8 ++
 net/devlink/devl_internal.h                   | 19 +++-
 net/devlink/health.c                          |  3 +-
 net/devlink/netlink.c                         | 21 +++--
 net/devlink/region.c                          |  3 +-
 .../selftests/drivers/net/mlxsw/pci_reset.sh  | 58 ++++++++++++
 12 files changed, 261 insertions(+), 21 deletions(-)
 create mode 100755 tools/testing/selftests/drivers/net/mlxsw/pci_reset.sh

-- 
2.40.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 01/12] netdevsim: Block until all devices are released
  2023-10-17  7:42 [RFC PATCH net-next 00/12] mlxsw: Add support for new reset flow Ido Schimmel
@ 2023-10-17  7:42 ` Ido Schimmel
  2023-10-19  0:53   ` Jakub Kicinski
  2023-10-17  7:42 ` [RFC PATCH net-next 02/12] devlink: Hold a reference on parent device Ido Schimmel
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 24+ messages in thread
From: Ido Schimmel @ 2023-10-17  7:42 UTC (permalink / raw)
  To: netdev, linux-pci
  Cc: davem, kuba, pabeni, edumazet, bhelgaas, alex.williamson, lukas,
	petrm, jiri, mlxsw, Ido Schimmel

Like other buses, devices on the netdevsim bus have a release callback
that is invoked when the reference count of the device drops to zero.
However, unlike other buses such as PCI, the release callback is not
necessarily built into the kernel, as netdevsim can be built as a
module.

This above is problematic as nothing prevents the module from being
unloaded before the release callback has been invoked, which can happen
asynchronously. One such example is going to be added in subsequent
patches where devlink will call put_device() from an RCU callback.

The issue is not theoretical and the reproducer in [1] can reliably
crash the kernel. The conclusion of this discussion was that the issue
should be solved in netdevsim, which is what this patch is trying to do.

Add a reference count that is increased when a device is added to the
bus and decreased when a device is released. Signal a completion when
the reference count drops to zero and wait for the completion when
unloading the module so that the module will not be unloaded before all
the devices were released. The reference count is initialized to one so
that completion is only signaled when unloading the module.

With this patch, the reproducer in [1] no longer crashes the kernel.

[1] https://lore.kernel.org/netdev/20230619125015.1541143-2-idosch@nvidia.com/

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/net/netdevsim/bus.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/net/netdevsim/bus.c b/drivers/net/netdevsim/bus.c
index 0787ad252dd9..bcbc1e19edde 100644
--- a/drivers/net/netdevsim/bus.c
+++ b/drivers/net/netdevsim/bus.c
@@ -3,11 +3,13 @@
  * Copyright (C) 2019 Mellanox Technologies. All rights reserved
  */
 
+#include <linux/completion.h>
 #include <linux/device.h>
 #include <linux/idr.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
 #include <linux/mutex.h>
+#include <linux/refcount.h>
 #include <linux/slab.h>
 #include <linux/sysfs.h>
 
@@ -17,6 +19,8 @@ static DEFINE_IDA(nsim_bus_dev_ids);
 static LIST_HEAD(nsim_bus_dev_list);
 static DEFINE_MUTEX(nsim_bus_dev_list_lock);
 static bool nsim_bus_enable;
+static refcount_t nsim_bus_devs; /* Including the bus itself. */
+static DECLARE_COMPLETION(nsim_bus_devs_released);
 
 static struct nsim_bus_dev *to_nsim_bus_dev(struct device *dev)
 {
@@ -121,6 +125,8 @@ static void nsim_bus_dev_release(struct device *dev)
 
 	nsim_bus_dev = container_of(dev, struct nsim_bus_dev, dev);
 	kfree(nsim_bus_dev);
+	if (refcount_dec_and_test(&nsim_bus_devs))
+		complete(&nsim_bus_devs_released);
 }
 
 static struct device_type nsim_bus_dev_type = {
@@ -170,6 +176,7 @@ new_device_store(const struct bus_type *bus, const char *buf, size_t count)
 		goto err;
 	}
 
+	refcount_inc(&nsim_bus_devs);
 	/* Allow using nsim_bus_dev */
 	smp_store_release(&nsim_bus_dev->init, true);
 
@@ -326,6 +333,7 @@ int nsim_bus_init(void)
 	err = driver_register(&nsim_driver);
 	if (err)
 		goto err_bus_unregister;
+	refcount_set(&nsim_bus_devs, 1);
 	/* Allow using resources */
 	smp_store_release(&nsim_bus_enable, true);
 	return 0;
@@ -341,6 +349,8 @@ void nsim_bus_exit(void)
 
 	/* Disallow using resources */
 	smp_store_release(&nsim_bus_enable, false);
+	if (refcount_dec_and_test(&nsim_bus_devs))
+		complete(&nsim_bus_devs_released);
 
 	mutex_lock(&nsim_bus_dev_list_lock);
 	list_for_each_entry_safe(nsim_bus_dev, tmp, &nsim_bus_dev_list, list) {
@@ -349,6 +359,8 @@ void nsim_bus_exit(void)
 	}
 	mutex_unlock(&nsim_bus_dev_list_lock);
 
+	wait_for_completion(&nsim_bus_devs_released);
+
 	driver_unregister(&nsim_driver);
 	bus_unregister(&nsim_bus);
 }
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 02/12] devlink: Hold a reference on parent device
  2023-10-17  7:42 [RFC PATCH net-next 00/12] mlxsw: Add support for new reset flow Ido Schimmel
  2023-10-17  7:42 ` [RFC PATCH net-next 01/12] netdevsim: Block until all devices are released Ido Schimmel
@ 2023-10-17  7:42 ` Ido Schimmel
  2023-10-17  7:56   ` Jiri Pirko
  2023-10-17  7:42 ` [RFC PATCH net-next 03/12] devlink: Acquire device lock during reload Ido Schimmel
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 24+ messages in thread
From: Ido Schimmel @ 2023-10-17  7:42 UTC (permalink / raw)
  To: netdev, linux-pci
  Cc: davem, kuba, pabeni, edumazet, bhelgaas, alex.williamson, lukas,
	petrm, jiri, mlxsw, Ido Schimmel

Each devlink instance is associated with a parent device and a pointer
to this device is stored in the devlink structure, but devlink does not
hold a reference on this device.

This is going to be a problem in the next patch where - among other
things - devlink will acquire the device lock during netns dismantle,
before the reload operation. Since netns dismantle is performed
asynchronously and since a reference is not held on the parent device,
it will be possible to hit a use-after-free.

Prepare for the upcoming change by holding a reference on the parent
device.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
---
 net/devlink/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/devlink/core.c b/net/devlink/core.c
index bcbbb952569f..5b8b692b8c76 100644
--- a/net/devlink/core.c
+++ b/net/devlink/core.c
@@ -4,6 +4,7 @@
  * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com>
  */
 
+#include <linux/device.h>
 #include <net/genetlink.h>
 #define CREATE_TRACE_POINTS
 #include <trace/events/devlink.h>
@@ -310,6 +311,7 @@ static void devlink_release(struct work_struct *work)
 
 	mutex_destroy(&devlink->lock);
 	lockdep_unregister_key(&devlink->lock_key);
+	put_device(devlink->dev);
 	kfree(devlink);
 }
 
@@ -425,6 +427,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
 	if (ret < 0)
 		goto err_xa_alloc;
 
+	get_device(dev);
 	devlink->dev = dev;
 	devlink->ops = ops;
 	xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 03/12] devlink: Acquire device lock during reload
  2023-10-17  7:42 [RFC PATCH net-next 00/12] mlxsw: Add support for new reset flow Ido Schimmel
  2023-10-17  7:42 ` [RFC PATCH net-next 01/12] netdevsim: Block until all devices are released Ido Schimmel
  2023-10-17  7:42 ` [RFC PATCH net-next 02/12] devlink: Hold a reference on parent device Ido Schimmel
@ 2023-10-17  7:42 ` Ido Schimmel
  2023-10-17  8:04   ` Jiri Pirko
  2023-10-17  7:42 ` [RFC PATCH net-next 04/12] PCI: Add no PM reset quirk for NVIDIA Spectrum devices Ido Schimmel
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 24+ messages in thread
From: Ido Schimmel @ 2023-10-17  7:42 UTC (permalink / raw)
  To: netdev, linux-pci
  Cc: davem, kuba, pabeni, edumazet, bhelgaas, alex.williamson, lukas,
	petrm, jiri, mlxsw, Ido Schimmel

Device drivers register with devlink from their probe routines (under
the device lock) by acquiring the devlink instance lock and calling
devl_register().

Drivers that support a devlink reload usually implement the
reload_{down, up}() operations in a similar fashion to their remove and
probe routines, respectively.

However, while the remove and probe routines are invoked with the device
lock held, the reload operations are only invoked with the devlink
instance lock held. It is therefore impossible for drivers to acquire
the device lock from their reload operations, as this would result in
lock inversion.

The motivating use case for invoking the reload operations with the
device lock held is in mlxsw which needs to trigger a PCI reset as part
of the reload. The driver cannot call pci_reset_function() as this
function acquires the device lock. Instead, it needs to call
__pci_reset_function_locked which expects the device lock to be held.

To that end, adjust devlink to always acquire the device lock before the
devlink instance lock when performing a reload. Do that both when reload
is triggered explicitly by user space and when it is triggered as part
of netns dismantle.

Tested the following flows with netdevsim and mlxsw while lockdep is
enabled:

netdevsim:

 # echo "10 1" > /sys/bus/netdevsim/new_device
 # devlink dev reload netdevsim/netdevsim10
 # ip netns add bla
 # devlink dev reload netdevsim/netdevsim10 netns bla
 # ip netns del bla
 # echo 10 > /sys/bus/netdevsim/del_device

mlxsw:

 # devlink dev reload pci/0000:01:00.0
 # ip netns add bla
 # devlink dev reload pci/0000:01:00.0 netns bla
 # ip netns del bla
 # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/remove
 # echo 1 > /sys/bus/pci/rescan

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 net/devlink/core.c          |  4 ++--
 net/devlink/dev.c           |  8 ++++++++
 net/devlink/devl_internal.h | 19 ++++++++++++++++++-
 net/devlink/health.c        |  3 ++-
 net/devlink/netlink.c       | 21 ++++++++++++++-------
 net/devlink/region.c        |  3 ++-
 6 files changed, 46 insertions(+), 12 deletions(-)

diff --git a/net/devlink/core.c b/net/devlink/core.c
index 5b8b692b8c76..0f866f2cbaf6 100644
--- a/net/devlink/core.c
+++ b/net/devlink/core.c
@@ -502,14 +502,14 @@ static void __net_exit devlink_pernet_pre_exit(struct net *net)
 	 * all devlink instances from this namespace into init_net.
 	 */
 	devlinks_xa_for_each_registered_get(net, index, devlink) {
-		devl_lock(devlink);
+		devl_dev_lock(devlink, true);
 		err = 0;
 		if (devl_is_registered(devlink))
 			err = devlink_reload(devlink, &init_net,
 					     DEVLINK_RELOAD_ACTION_DRIVER_REINIT,
 					     DEVLINK_RELOAD_LIMIT_UNSPEC,
 					     &actions_performed, NULL);
-		devl_unlock(devlink);
+		devl_dev_unlock(devlink, true);
 		devlink_put(devlink);
 		if (err && err != -EOPNOTSUPP)
 			pr_warn("Failed to reload devlink instance into init_net\n");
diff --git a/net/devlink/dev.c b/net/devlink/dev.c
index dc8039ca2b38..70cebe716187 100644
--- a/net/devlink/dev.c
+++ b/net/devlink/dev.c
@@ -4,6 +4,7 @@
  * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com>
  */
 
+#include <linux/device.h>
 #include <net/genetlink.h>
 #include <net/sock.h>
 #include "devl_internal.h"
@@ -433,6 +434,13 @@ int devlink_reload(struct devlink *devlink, struct net *dest_net,
 	struct net *curr_net;
 	int err;
 
+	/* Make sure the reload operations are invoked with the device lock
+	 * held to allow drivers to trigger functionality that expects it
+	 * (e.g., PCI reset) and to close possible races between these
+	 * operations and probe/remove.
+	 */
+	device_lock_assert(devlink->dev);
+
 	memcpy(remote_reload_stats, devlink->stats.remote_reload_stats,
 	       sizeof(remote_reload_stats));
 
diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
index 741d1bf1bec8..a9c5e52c40a7 100644
--- a/net/devlink/devl_internal.h
+++ b/net/devlink/devl_internal.h
@@ -3,6 +3,7 @@
  * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com>
  */
 
+#include <linux/device.h>
 #include <linux/etherdevice.h>
 #include <linux/mutex.h>
 #include <linux/netdevice.h>
@@ -96,6 +97,20 @@ static inline bool devl_is_registered(struct devlink *devlink)
 	return xa_get_mark(&devlinks, devlink->index, DEVLINK_REGISTERED);
 }
 
+static inline void devl_dev_lock(struct devlink *devlink, bool dev_lock)
+{
+	if (dev_lock)
+		device_lock(devlink->dev);
+	devl_lock(devlink);
+}
+
+static inline void devl_dev_unlock(struct devlink *devlink, bool dev_lock)
+{
+	devl_unlock(devlink);
+	if (dev_lock)
+		device_unlock(devlink->dev);
+}
+
 typedef void devlink_rel_notify_cb_t(struct devlink *devlink, u32 obj_index);
 typedef void devlink_rel_cleanup_cb_t(struct devlink *devlink, u32 obj_index,
 				      u32 rel_index);
@@ -113,6 +128,7 @@ int devlink_rel_devlink_handle_put(struct sk_buff *msg, struct devlink *devlink,
 /* Netlink */
 #define DEVLINK_NL_FLAG_NEED_PORT		BIT(0)
 #define DEVLINK_NL_FLAG_NEED_DEVLINK_OR_PORT	BIT(1)
+#define DEVLINK_NL_FLAG_NEED_DEV_LOCK		BIT(2)
 
 enum devlink_multicast_groups {
 	DEVLINK_MCGRP_CONFIG,
@@ -140,7 +156,8 @@ typedef int devlink_nl_dump_one_func_t(struct sk_buff *msg,
 				       int flags);
 
 struct devlink *
-devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs);
+devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
+			    bool dev_lock);
 
 int devlink_nl_dumpit(struct sk_buff *msg, struct netlink_callback *cb,
 		      devlink_nl_dump_one_func_t *dump_one);
diff --git a/net/devlink/health.c b/net/devlink/health.c
index 51e6e81e31bb..3c4c049c3636 100644
--- a/net/devlink/health.c
+++ b/net/devlink/health.c
@@ -1266,7 +1266,8 @@ devlink_health_reporter_get_from_cb_lock(struct netlink_callback *cb)
 	struct nlattr **attrs = info->attrs;
 	struct devlink *devlink;
 
-	devlink = devlink_get_from_attrs_lock(sock_net(cb->skb->sk), attrs);
+	devlink = devlink_get_from_attrs_lock(sock_net(cb->skb->sk), attrs,
+					      false);
 	if (IS_ERR(devlink))
 		return NULL;
 
diff --git a/net/devlink/netlink.c b/net/devlink/netlink.c
index 499304d9de49..14d598000d72 100644
--- a/net/devlink/netlink.c
+++ b/net/devlink/netlink.c
@@ -124,7 +124,8 @@ int devlink_nl_msg_reply_and_new(struct sk_buff **msg, struct genl_info *info)
 }
 
 struct devlink *
-devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs)
+devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
+			    bool dev_lock)
 {
 	struct devlink *devlink;
 	unsigned long index;
@@ -138,12 +139,12 @@ devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs)
 	devname = nla_data(attrs[DEVLINK_ATTR_DEV_NAME]);
 
 	devlinks_xa_for_each_registered_get(net, index, devlink) {
-		devl_lock(devlink);
+		devl_dev_lock(devlink, dev_lock);
 		if (devl_is_registered(devlink) &&
 		    strcmp(devlink->dev->bus->name, busname) == 0 &&
 		    strcmp(dev_name(devlink->dev), devname) == 0)
 			return devlink;
-		devl_unlock(devlink);
+		devl_dev_unlock(devlink, dev_lock);
 		devlink_put(devlink);
 	}
 
@@ -155,9 +156,12 @@ static int __devlink_nl_pre_doit(struct sk_buff *skb, struct genl_info *info,
 {
 	struct devlink_port *devlink_port;
 	struct devlink *devlink;
+	bool dev_lock;
 	int err;
 
-	devlink = devlink_get_from_attrs_lock(genl_info_net(info), info->attrs);
+	dev_lock = !!(flags & DEVLINK_NL_FLAG_NEED_DEV_LOCK);
+	devlink = devlink_get_from_attrs_lock(genl_info_net(info), info->attrs,
+					      dev_lock);
 	if (IS_ERR(devlink))
 		return PTR_ERR(devlink);
 
@@ -177,7 +181,7 @@ static int __devlink_nl_pre_doit(struct sk_buff *skb, struct genl_info *info,
 	return 0;
 
 unlock:
-	devl_unlock(devlink);
+	devl_dev_unlock(devlink, dev_lock);
 	devlink_put(devlink);
 	return err;
 }
@@ -205,9 +209,11 @@ void devlink_nl_post_doit(const struct genl_split_ops *ops,
 			  struct sk_buff *skb, struct genl_info *info)
 {
 	struct devlink *devlink;
+	bool dev_lock;
 
+	dev_lock = !!(ops->internal_flags & DEVLINK_NL_FLAG_NEED_DEV_LOCK);
 	devlink = info->user_ptr[0];
-	devl_unlock(devlink);
+	devl_dev_unlock(devlink, dev_lock);
 	devlink_put(devlink);
 }
 
@@ -219,7 +225,7 @@ static int devlink_nl_inst_single_dumpit(struct sk_buff *msg,
 	struct devlink *devlink;
 	int err;
 
-	devlink = devlink_get_from_attrs_lock(sock_net(msg->sk), attrs);
+	devlink = devlink_get_from_attrs_lock(sock_net(msg->sk), attrs, false);
 	if (IS_ERR(devlink))
 		return PTR_ERR(devlink);
 	err = dump_one(msg, devlink, cb, flags | NLM_F_DUMP_FILTERED);
@@ -420,6 +426,7 @@ static const struct genl_small_ops devlink_nl_small_ops[40] = {
 		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
 		.doit = devlink_nl_cmd_reload,
 		.flags = GENL_ADMIN_PERM,
+		.internal_flags = DEVLINK_NL_FLAG_NEED_DEV_LOCK,
 	},
 	{
 		.cmd = DEVLINK_CMD_PARAM_SET,
diff --git a/net/devlink/region.c b/net/devlink/region.c
index d197cdb662db..30c6c49ec10b 100644
--- a/net/devlink/region.c
+++ b/net/devlink/region.c
@@ -883,7 +883,8 @@ int devlink_nl_cmd_region_read_dumpit(struct sk_buff *skb,
 
 	start_offset = state->start_offset;
 
-	devlink = devlink_get_from_attrs_lock(sock_net(cb->skb->sk), attrs);
+	devlink = devlink_get_from_attrs_lock(sock_net(cb->skb->sk), attrs,
+					      false);
 	if (IS_ERR(devlink))
 		return PTR_ERR(devlink);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 04/12] PCI: Add no PM reset quirk for NVIDIA Spectrum devices
  2023-10-17  7:42 [RFC PATCH net-next 00/12] mlxsw: Add support for new reset flow Ido Schimmel
                   ` (2 preceding siblings ...)
  2023-10-17  7:42 ` [RFC PATCH net-next 03/12] devlink: Acquire device lock during reload Ido Schimmel
@ 2023-10-17  7:42 ` Ido Schimmel
  2023-10-18 19:40   ` Bjorn Helgaas
  2023-10-17  7:42 ` [RFC PATCH net-next 05/12] PCI: Add device-specific reset " Ido Schimmel
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 24+ messages in thread
From: Ido Schimmel @ 2023-10-17  7:42 UTC (permalink / raw)
  To: netdev, linux-pci
  Cc: davem, kuba, pabeni, edumazet, bhelgaas, alex.williamson, lukas,
	petrm, jiri, mlxsw, Ido Schimmel

Spectrum-{1,2,3,4} devices report that a D3hot->D0 transition causes a
reset (i.e., they advertise NoSoftRst-). However, this transition seems
to have no effect on the device: It continues to be operational and
network ports remain up. Advertising this support makes it seem as if a
PM reset is viable for these devices. Mark it as unavailable to skip it
when testing reset methods.

Before:

 # cat /sys/bus/pci/devices/0000\:03\:00.0/reset_method
 pm bus

After:

 # cat /sys/bus/pci/devices/0000\:03\:00.0/reset_method
 bus

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/pci/quirks.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index eeec1d6f9023..23f6bd2184e2 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3784,6 +3784,19 @@ static void quirk_no_pm_reset(struct pci_dev *dev)
 DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_ATI, PCI_ANY_ID,
 			       PCI_CLASS_DISPLAY_VGA, 8, quirk_no_pm_reset);
 
+/*
+ * Spectrum-{1,2,3,4} devices report that a D3hot->D0 transition causes a reset
+ * (i.e., they advertise NoSoftRst-). However, this transition seems to have no
+ * effect on the device: It continues to be operational and network ports
+ * remain up. Advertising this support makes it seem as if a PM reset is viable
+ * for these devices. Mark it as unavailable to skip it when testing reset
+ * methods.
+ */
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MELLANOX, 0xcb84, quirk_no_pm_reset);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MELLANOX, 0xcf6c, quirk_no_pm_reset);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MELLANOX, 0xcf70, quirk_no_pm_reset);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MELLANOX, 0xcf80, quirk_no_pm_reset);
+
 /*
  * Thunderbolt controllers with broken MSI hotplug signaling:
  * Entire 1st generation (Light Ridge, Eagle Ridge, Light Peak) and part
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 05/12] PCI: Add device-specific reset for NVIDIA Spectrum devices
  2023-10-17  7:42 [RFC PATCH net-next 00/12] mlxsw: Add support for new reset flow Ido Schimmel
                   ` (3 preceding siblings ...)
  2023-10-17  7:42 ` [RFC PATCH net-next 04/12] PCI: Add no PM reset quirk for NVIDIA Spectrum devices Ido Schimmel
@ 2023-10-17  7:42 ` Ido Schimmel
  2023-10-17 10:00   ` Lukas Wunner
  2023-10-18 20:08   ` Bjorn Helgaas
  2023-10-17  7:42 ` [RFC PATCH net-next 06/12] PCI: Add debug print for device ready delay Ido Schimmel
                   ` (6 subsequent siblings)
  11 siblings, 2 replies; 24+ messages in thread
From: Ido Schimmel @ 2023-10-17  7:42 UTC (permalink / raw)
  To: netdev, linux-pci
  Cc: davem, kuba, pabeni, edumazet, bhelgaas, alex.williamson, lukas,
	petrm, jiri, mlxsw, Ido Schimmel

The PCIe specification defines two methods to trigger a hot reset across
a link: Bus reset and link disablement (r6.0.1, sec 7.1, sec 6.6.1). In
the first method, the Secondary Bus Reset (SBR) bit in the Bridge
Control Register of the Downstream Port is asserted for at least 1ms
(r6.0.1, sec 7.5.1.3.13). In the second method, the Link Disable bit in
the Link Control Register of the Downstream Port is asserted and then
cleared to disable and enable the link (r6.0.1, sec 7.5.3.7).

While the two methods are identical from the perspective of the
Downstream device, they are different as far as the host is concerned.
In the first method, the Link Training and Status State Machine (LTSSM)
of the Downstream Port is expected to be in the Hot Reset state as long
as the SBR bit is asserted. In the second method, the LTSSM of the
Downstream Port is expected to be in the Disabled state as long as the
Link Disable bit is asserted.

This above difference is of importance because the specification
requires the LTTSM to exit from the Hot Reset state to the Detect state
within a 2ms timeout (r6.0.1, sec 4.2.7.11). NVIDIA Spectrum devices
cannot guarantee it and a host enforcing such a behavior might fail to
communicate with the device after issuing a Secondary Bus Reset. With
the link disablement method, the host can leave the link disabled for
enough time to allow the device to undergo a hot reset and reach the
Detect state. After enabling the link, the host will exit from the
Disabled state to Detect state (r6.0.1, sec 4.2.7.9) and observe that
the device is already in the Detect state.

The PCI core only implements the first method, which might not work with
NVIDIA Spectrum devices on certain hosts, as explained above. Therefore,
implement the link disablement method as a device-specific method for
NVIDIA Spectrum devices. Specifically, disable the link, wait for 500ms,
enable the link and then wait for the device to become accessible.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/pci/quirks.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 23f6bd2184e2..a6e308bb934c 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4182,6 +4182,31 @@ static int reset_hinic_vf_dev(struct pci_dev *pdev, bool probe)
 	return 0;
 }
 
+#define PCI_DEVICE_ID_MELLANOX_SPECTRUM		0xcb84
+#define PCI_DEVICE_ID_MELLANOX_SPECTRUM2	0xcf6c
+#define PCI_DEVICE_ID_MELLANOX_SPECTRUM3	0xcf70
+#define PCI_DEVICE_ID_MELLANOX_SPECTRUM4	0xcf80
+
+static int reset_mlx(struct pci_dev *pdev, bool probe)
+{
+	struct pci_dev *bridge = pdev->bus->self;
+
+	if (probe)
+		return 0;
+
+	/*
+	 * Disable the link on the Downstream port in order to trigger a hot
+	 * reset in the Downstream device. Wait for 500ms before enabling the
+	 * link so that the firmware on the device will have enough time to
+	 * transition the Upstream port to the Detect state.
+	 */
+	pcie_capability_set_word(bridge, PCI_EXP_LNKCTL, PCI_EXP_LNKCTL_LD);
+	msleep(500);
+	pcie_capability_clear_word(bridge, PCI_EXP_LNKCTL, PCI_EXP_LNKCTL_LD);
+
+	return pci_bridge_wait_for_secondary_bus(bridge, "link toggle");
+}
+
 static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
 	{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF,
 		 reset_intel_82599_sfp_virtfn },
@@ -4197,6 +4222,10 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
 		reset_chelsio_generic_dev },
 	{ PCI_VENDOR_ID_HUAWEI, PCI_DEVICE_ID_HINIC_VF,
 		reset_hinic_vf_dev },
+	{ PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_SPECTRUM, reset_mlx },
+	{ PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_SPECTRUM2, reset_mlx },
+	{ PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_SPECTRUM3, reset_mlx },
+	{ PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_SPECTRUM4, reset_mlx },
 	{ 0 }
 };
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 06/12] PCI: Add debug print for device ready delay
  2023-10-17  7:42 [RFC PATCH net-next 00/12] mlxsw: Add support for new reset flow Ido Schimmel
                   ` (4 preceding siblings ...)
  2023-10-17  7:42 ` [RFC PATCH net-next 05/12] PCI: Add device-specific reset " Ido Schimmel
@ 2023-10-17  7:42 ` Ido Schimmel
  2023-10-18 19:41   ` Bjorn Helgaas
  2023-10-17  7:42 ` [RFC PATCH net-next 07/12] mlxsw: Extend MRSR pack() function to support new commands Ido Schimmel
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 24+ messages in thread
From: Ido Schimmel @ 2023-10-17  7:42 UTC (permalink / raw)
  To: netdev, linux-pci
  Cc: davem, kuba, pabeni, edumazet, bhelgaas, alex.williamson, lukas,
	petrm, jiri, mlxsw, Ido Schimmel

Currently, the time it took a PCI device to become ready after reset is
only printed if it was longer than 1000ms ('PCI_RESET_WAIT'). However,
for debugging purposes it is useful to know this time even if it was
shorter. For example, with the device I am working on, hardware
engineers asked to verify that it becomes ready on the first try (no
delay).

To that end, add a debug level print that can be enabled using dynamic
debug. Example:

 # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset
 # dmesg -c | grep ready
 # echo "file drivers/pci/pci.c +p" > /sys/kernel/debug/dynamic_debug/control
 # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset
 # dmesg -c | grep ready
 [  396.060335] mlxsw_spectrum4 0000:01:00.0: ready 0ms after link toggle
 # echo "file drivers/pci/pci.c -p" > /sys/kernel/debug/dynamic_debug/control
 # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset
 # dmesg -c | grep ready

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/pci/pci.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 59c01d68c6d5..0a708e65c5c4 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1216,6 +1216,9 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
 	if (delay > PCI_RESET_WAIT)
 		pci_info(dev, "ready %dms after %s\n", delay - 1,
 			 reset_type);
+	else
+		pci_dbg(dev, "ready %dms after %s\n", delay - 1,
+			reset_type);
 
 	return 0;
 }
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 07/12] mlxsw: Extend MRSR pack() function to support new commands
  2023-10-17  7:42 [RFC PATCH net-next 00/12] mlxsw: Add support for new reset flow Ido Schimmel
                   ` (5 preceding siblings ...)
  2023-10-17  7:42 ` [RFC PATCH net-next 06/12] PCI: Add debug print for device ready delay Ido Schimmel
@ 2023-10-17  7:42 ` Ido Schimmel
  2023-10-17  7:42 ` [RFC PATCH net-next 08/12] mlxsw: pci: Rename mlxsw_pci_sw_reset() Ido Schimmel
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Ido Schimmel @ 2023-10-17  7:42 UTC (permalink / raw)
  To: netdev, linux-pci
  Cc: davem, kuba, pabeni, edumazet, bhelgaas, alex.williamson, lukas,
	petrm, jiri, mlxsw, Amit Cohen, Ido Schimmel

From: Amit Cohen <amcohen@nvidia.com>

Currently mlxsw_reg_mrsr_pack() always sets 'command=1'. As preparation for
support of new reset flow, pass the command as an argument to the
function and add an enum for this field.

For now, always pass 'command=1' to the pack() function.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlxsw/pci.c |  2 +-
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 14 ++++++++++++--
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index 7fae963b2608..afa7df273202 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -1479,7 +1479,7 @@ static int mlxsw_pci_sw_reset(struct mlxsw_pci *mlxsw_pci,
 		return err;
 	}
 
-	mlxsw_reg_mrsr_pack(mrsr_pl);
+	mlxsw_reg_mrsr_pack(mrsr_pl, MLXSW_REG_MRSR_COMMAND_SOFTWARE_RESET);
 	err = mlxsw_reg_write(mlxsw_pci->core, MLXSW_REG(mrsr), mrsr_pl);
 	if (err)
 		return err;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 9970921ceef3..44f528326394 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -10122,6 +10122,15 @@ mlxsw_reg_mgir_unpack(char *payload, u32 *hw_rev, char *fw_info_psid,
 
 MLXSW_REG_DEFINE(mrsr, MLXSW_REG_MRSR_ID, MLXSW_REG_MRSR_LEN);
 
+enum mlxsw_reg_mrsr_command {
+	/* Switch soft reset, does not reset PCI firmware. */
+	MLXSW_REG_MRSR_COMMAND_SOFTWARE_RESET = 1,
+	/* Reset will be done when PCI link will be disabled.
+	 * This command will reset PCI firmware also.
+	 */
+	MLXSW_REG_MRSR_COMMAND_RESET_AT_PCI_DISABLE = 6,
+};
+
 /* reg_mrsr_command
  * Reset/shutdown command
  * 0 - do nothing
@@ -10130,10 +10139,11 @@ MLXSW_REG_DEFINE(mrsr, MLXSW_REG_MRSR_ID, MLXSW_REG_MRSR_LEN);
  */
 MLXSW_ITEM32(reg, mrsr, command, 0x00, 0, 4);
 
-static inline void mlxsw_reg_mrsr_pack(char *payload)
+static inline void mlxsw_reg_mrsr_pack(char *payload,
+				       enum mlxsw_reg_mrsr_command command)
 {
 	MLXSW_REG_ZERO(mrsr, payload);
-	mlxsw_reg_mrsr_command_set(payload, 1);
+	mlxsw_reg_mrsr_command_set(payload, command);
 }
 
 /* MLCR - Management LED Control Register
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 08/12] mlxsw: pci: Rename mlxsw_pci_sw_reset()
  2023-10-17  7:42 [RFC PATCH net-next 00/12] mlxsw: Add support for new reset flow Ido Schimmel
                   ` (6 preceding siblings ...)
  2023-10-17  7:42 ` [RFC PATCH net-next 07/12] mlxsw: Extend MRSR pack() function to support new commands Ido Schimmel
@ 2023-10-17  7:42 ` Ido Schimmel
  2023-10-17  7:42 ` [RFC PATCH net-next 09/12] mlxsw: pci: Move software reset code to a separate function Ido Schimmel
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Ido Schimmel @ 2023-10-17  7:42 UTC (permalink / raw)
  To: netdev, linux-pci
  Cc: davem, kuba, pabeni, edumazet, bhelgaas, alex.williamson, lukas,
	petrm, jiri, mlxsw, Amit Cohen, Ido Schimmel

From: Amit Cohen <amcohen@nvidia.com>

In the next patches, mlxsw_pci_sw_reset() will be extended to support
more reset types and will not necessarily issue a software reset. Rename
the function to reflect that.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlxsw/pci.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index afa7df273202..af47d450332f 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -1464,8 +1464,8 @@ static int mlxsw_pci_sys_ready_wait(struct mlxsw_pci *mlxsw_pci,
 	return -EBUSY;
 }
 
-static int mlxsw_pci_sw_reset(struct mlxsw_pci *mlxsw_pci,
-			      const struct pci_device_id *id)
+static int
+mlxsw_pci_reset(struct mlxsw_pci *mlxsw_pci, const struct pci_device_id *id)
 {
 	struct pci_dev *pdev = mlxsw_pci->pdev;
 	char mrsr_pl[MLXSW_REG_MRSR_LEN];
@@ -1525,9 +1525,9 @@ static int mlxsw_pci_init(void *bus_priv, struct mlxsw_core *mlxsw_core,
 	if (!mbox)
 		return -ENOMEM;
 
-	err = mlxsw_pci_sw_reset(mlxsw_pci, mlxsw_pci->id);
+	err = mlxsw_pci_reset(mlxsw_pci, mlxsw_pci->id);
 	if (err)
-		goto err_sw_reset;
+		goto err_reset;
 
 	err = mlxsw_pci_alloc_irq_vectors(mlxsw_pci);
 	if (err < 0) {
@@ -1659,7 +1659,7 @@ static int mlxsw_pci_init(void *bus_priv, struct mlxsw_core *mlxsw_core,
 err_query_fw:
 	mlxsw_pci_free_irq_vectors(mlxsw_pci);
 err_alloc_irq:
-err_sw_reset:
+err_reset:
 mbox_put:
 	mlxsw_cmd_mbox_free(mbox);
 	return err;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 09/12] mlxsw: pci: Move software reset code to a separate function
  2023-10-17  7:42 [RFC PATCH net-next 00/12] mlxsw: Add support for new reset flow Ido Schimmel
                   ` (7 preceding siblings ...)
  2023-10-17  7:42 ` [RFC PATCH net-next 08/12] mlxsw: pci: Rename mlxsw_pci_sw_reset() Ido Schimmel
@ 2023-10-17  7:42 ` Ido Schimmel
  2023-10-17  7:42 ` [RFC PATCH net-next 10/12] mlxsw: pci: Add support for new reset flow Ido Schimmel
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Ido Schimmel @ 2023-10-17  7:42 UTC (permalink / raw)
  To: netdev, linux-pci
  Cc: davem, kuba, pabeni, edumazet, bhelgaas, alex.williamson, lukas,
	petrm, jiri, mlxsw, Amit Cohen, Ido Schimmel

From: Amit Cohen <amcohen@nvidia.com>

In general, the existing flow of software reset in the driver is:
1. Wait for system ready status.
2. Send MRSR command, to start the reset.
3. Wait for system ready status.

This flow will be extended once a new reset command is supported. As a
preparation, move step #2 to a separate function.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlxsw/pci.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index af47d450332f..1980343ff873 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -1464,11 +1464,18 @@ static int mlxsw_pci_sys_ready_wait(struct mlxsw_pci *mlxsw_pci,
 	return -EBUSY;
 }
 
+static int mlxsw_pci_reset_sw(struct mlxsw_pci *mlxsw_pci)
+{
+	char mrsr_pl[MLXSW_REG_MRSR_LEN];
+
+	mlxsw_reg_mrsr_pack(mrsr_pl, MLXSW_REG_MRSR_COMMAND_SOFTWARE_RESET);
+	return mlxsw_reg_write(mlxsw_pci->core, MLXSW_REG(mrsr), mrsr_pl);
+}
+
 static int
 mlxsw_pci_reset(struct mlxsw_pci *mlxsw_pci, const struct pci_device_id *id)
 {
 	struct pci_dev *pdev = mlxsw_pci->pdev;
-	char mrsr_pl[MLXSW_REG_MRSR_LEN];
 	u32 sys_status;
 	int err;
 
@@ -1479,8 +1486,7 @@ mlxsw_pci_reset(struct mlxsw_pci *mlxsw_pci, const struct pci_device_id *id)
 		return err;
 	}
 
-	mlxsw_reg_mrsr_pack(mrsr_pl, MLXSW_REG_MRSR_COMMAND_SOFTWARE_RESET);
-	err = mlxsw_reg_write(mlxsw_pci->core, MLXSW_REG(mrsr), mrsr_pl);
+	err = mlxsw_pci_reset_sw(mlxsw_pci);
 	if (err)
 		return err;
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 10/12] mlxsw: pci: Add support for new reset flow
  2023-10-17  7:42 [RFC PATCH net-next 00/12] mlxsw: Add support for new reset flow Ido Schimmel
                   ` (8 preceding siblings ...)
  2023-10-17  7:42 ` [RFC PATCH net-next 09/12] mlxsw: pci: Move software reset code to a separate function Ido Schimmel
@ 2023-10-17  7:42 ` Ido Schimmel
  2023-10-17  7:42 ` [RFC PATCH net-next 11/12] mlxsw: pci: Implement PCI reset handlers Ido Schimmel
  2023-10-17  7:42 ` [RFC PATCH net-next 12/12] selftests: mlxsw: Add PCI reset test Ido Schimmel
  11 siblings, 0 replies; 24+ messages in thread
From: Ido Schimmel @ 2023-10-17  7:42 UTC (permalink / raw)
  To: netdev, linux-pci
  Cc: davem, kuba, pabeni, edumazet, bhelgaas, alex.williamson, lukas,
	petrm, jiri, mlxsw, Ido Schimmel

The driver resets the device during probe and during a devlink reload.
The current reset method reloads the current firmware version or a
pending one, if one was previously flashed using devlink. However, the
current reset method does not result in a PCI hot reset, preventing the
PCI firmware from being upgraded, unless the system is rebooted.

To solve this problem, a new reset command (6) was implemented in the
firmware. Unlike the current command (1), after issuing the new command
the device will not start the reset immediately, but only after a PCI
hot reset.

Implement the new reset method by first verifying that it is supported
by the current firmware version by querying the Management Capabilities
Mask (MCAM) register. If supported, issue the new reset command (6) via
MRSR register followed by a PCI reset by calling
__pci_reset_function_locked().

Once the PCI firmware is operational, go back to the regular reset flow
and wait for the entire device to become ready. That is, repeatedly read
the "system_status" register from the BAR until a value of "FW_READY"
(0x5E) appears.

Tested:

 # for i in $(seq 1 10); do devlink dev reload pci/0000:01:00.0; done

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlxsw/pci.c | 44 ++++++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlxsw/reg.h |  2 ++
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index 1980343ff873..b5bb47b0215f 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -1464,6 +1464,33 @@ static int mlxsw_pci_sys_ready_wait(struct mlxsw_pci *mlxsw_pci,
 	return -EBUSY;
 }
 
+static int mlxsw_pci_reset_at_pci_disable(struct mlxsw_pci *mlxsw_pci)
+{
+	struct pci_dev *pdev = mlxsw_pci->pdev;
+	char mrsr_pl[MLXSW_REG_MRSR_LEN];
+	int err;
+
+	mlxsw_reg_mrsr_pack(mrsr_pl,
+			    MLXSW_REG_MRSR_COMMAND_RESET_AT_PCI_DISABLE);
+	err = mlxsw_reg_write(mlxsw_pci->core, MLXSW_REG(mrsr), mrsr_pl);
+	if (err)
+		return err;
+
+	device_lock_assert(&pdev->dev);
+
+	pci_cfg_access_lock(pdev);
+	pci_save_state(pdev);
+
+	err = __pci_reset_function_locked(pdev);
+	if (err)
+		pci_err(pdev, "PCI function reset failed with %d\n", err);
+
+	pci_restore_state(pdev);
+	pci_cfg_access_unlock(pdev);
+
+	return err;
+}
+
 static int mlxsw_pci_reset_sw(struct mlxsw_pci *mlxsw_pci)
 {
 	char mrsr_pl[MLXSW_REG_MRSR_LEN];
@@ -1476,6 +1503,8 @@ static int
 mlxsw_pci_reset(struct mlxsw_pci *mlxsw_pci, const struct pci_device_id *id)
 {
 	struct pci_dev *pdev = mlxsw_pci->pdev;
+	char mcam_pl[MLXSW_REG_MCAM_LEN];
+	bool pci_reset_supported;
 	u32 sys_status;
 	int err;
 
@@ -1486,10 +1515,23 @@ mlxsw_pci_reset(struct mlxsw_pci *mlxsw_pci, const struct pci_device_id *id)
 		return err;
 	}
 
-	err = mlxsw_pci_reset_sw(mlxsw_pci);
+	mlxsw_reg_mcam_pack(mcam_pl,
+			    MLXSW_REG_MCAM_FEATURE_GROUP_ENHANCED_FEATURES);
+	err = mlxsw_reg_query(mlxsw_pci->core, MLXSW_REG(mcam), mcam_pl);
 	if (err)
 		return err;
 
+	mlxsw_reg_mcam_unpack(mcam_pl, MLXSW_REG_MCAM_PCI_RESET,
+			      &pci_reset_supported);
+
+	if (pci_reset_supported) {
+		pci_dbg(pdev, "Starting PCI reset flow\n");
+		err = mlxsw_pci_reset_at_pci_disable(mlxsw_pci);
+	} else {
+		pci_dbg(pdev, "Starting software reset flow\n");
+		err = mlxsw_pci_reset_sw(mlxsw_pci);
+	}
+
 	err = mlxsw_pci_sys_ready_wait(mlxsw_pci, id, &sys_status);
 	if (err) {
 		dev_err(&pdev->dev, "Failed to reach system ready status after reset. Status is 0x%x\n",
diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 44f528326394..c314afd4a8ff 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -10594,6 +10594,8 @@ MLXSW_ITEM32(reg, mcam, feature_group, 0x00, 16, 8);
 enum mlxsw_reg_mcam_mng_feature_cap_mask_bits {
 	/* If set, MCIA supports 128 bytes payloads. Otherwise, 48 bytes. */
 	MLXSW_REG_MCAM_MCIA_128B = 34,
+	/* If set, MRSR.command=6 is supported. */
+	MLXSW_REG_MCAM_PCI_RESET = 48,
 };
 
 #define MLXSW_REG_BYTES_PER_DWORD 0x4
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 11/12] mlxsw: pci: Implement PCI reset handlers
  2023-10-17  7:42 [RFC PATCH net-next 00/12] mlxsw: Add support for new reset flow Ido Schimmel
                   ` (9 preceding siblings ...)
  2023-10-17  7:42 ` [RFC PATCH net-next 10/12] mlxsw: pci: Add support for new reset flow Ido Schimmel
@ 2023-10-17  7:42 ` Ido Schimmel
  2023-10-17  7:42 ` [RFC PATCH net-next 12/12] selftests: mlxsw: Add PCI reset test Ido Schimmel
  11 siblings, 0 replies; 24+ messages in thread
From: Ido Schimmel @ 2023-10-17  7:42 UTC (permalink / raw)
  To: netdev, linux-pci
  Cc: davem, kuba, pabeni, edumazet, bhelgaas, alex.williamson, lukas,
	petrm, jiri, mlxsw, Ido Schimmel

Implement reset_prepare() and reset_done() handlers that are invoked by
the PCI core before and after issuing a PCI reset, respectively.

Specifically, implement reset_prepare() by calling
mlxsw_core_bus_device_unregister() and reset_done() by calling
mlxsw_core_bus_device_register(). This is the same implementation as the
reload_{down,up}() devlink operations with the following differences:

1. The devlink instance is unregistered and then registered again after
   the reset.

2. A reset via the device's command interface (using MRSR register) is
   not issued during reset_done() as PCI core already issued a PCI
   reset.

Tested:

 # for i in $(seq 1 10); do echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset; done

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlxsw/pci.c | 28 +++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index b5bb47b0215f..8de953902918 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -128,6 +128,7 @@ struct mlxsw_pci {
 	const struct pci_device_id *id;
 	enum mlxsw_pci_cqe_v max_cqe_ver; /* Maximal supported CQE version */
 	u8 num_sdq_cqs; /* Number of CQs used for SDQs */
+	bool skip_reset;
 };
 
 static void mlxsw_pci_queue_tasklet_schedule(struct mlxsw_pci_queue *q)
@@ -1515,6 +1516,10 @@ mlxsw_pci_reset(struct mlxsw_pci *mlxsw_pci, const struct pci_device_id *id)
 		return err;
 	}
 
+	/* PCI core already issued a PCI reset, do not issue another reset. */
+	if (mlxsw_pci->skip_reset)
+		return 0;
+
 	mlxsw_reg_mcam_pack(mcam_pl,
 			    MLXSW_REG_MCAM_FEATURE_GROUP_ENHANCED_FEATURES);
 	err = mlxsw_reg_query(mlxsw_pci->core, MLXSW_REG(mcam), mcam_pl);
@@ -2085,11 +2090,34 @@ static void mlxsw_pci_remove(struct pci_dev *pdev)
 	kfree(mlxsw_pci);
 }
 
+static void mlxsw_pci_reset_prepare(struct pci_dev *pdev)
+{
+	struct mlxsw_pci *mlxsw_pci = pci_get_drvdata(pdev);
+
+	mlxsw_core_bus_device_unregister(mlxsw_pci->core, false);
+}
+
+static void mlxsw_pci_reset_done(struct pci_dev *pdev)
+{
+	struct mlxsw_pci *mlxsw_pci = pci_get_drvdata(pdev);
+
+	mlxsw_pci->skip_reset = true;
+	mlxsw_core_bus_device_register(&mlxsw_pci->bus_info, &mlxsw_pci_bus,
+				       mlxsw_pci, false, NULL, NULL);
+	mlxsw_pci->skip_reset = false;
+}
+
+static const struct pci_error_handlers mlxsw_pci_err_handler = {
+	.reset_prepare = mlxsw_pci_reset_prepare,
+	.reset_done = mlxsw_pci_reset_done,
+};
+
 int mlxsw_pci_driver_register(struct pci_driver *pci_driver)
 {
 	pci_driver->probe = mlxsw_pci_probe;
 	pci_driver->remove = mlxsw_pci_remove;
 	pci_driver->shutdown = mlxsw_pci_remove;
+	pci_driver->err_handler = &mlxsw_pci_err_handler;
 	return pci_register_driver(pci_driver);
 }
 EXPORT_SYMBOL(mlxsw_pci_driver_register);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 12/12] selftests: mlxsw: Add PCI reset test
  2023-10-17  7:42 [RFC PATCH net-next 00/12] mlxsw: Add support for new reset flow Ido Schimmel
                   ` (10 preceding siblings ...)
  2023-10-17  7:42 ` [RFC PATCH net-next 11/12] mlxsw: pci: Implement PCI reset handlers Ido Schimmel
@ 2023-10-17  7:42 ` Ido Schimmel
  11 siblings, 0 replies; 24+ messages in thread
From: Ido Schimmel @ 2023-10-17  7:42 UTC (permalink / raw)
  To: netdev, linux-pci
  Cc: davem, kuba, pabeni, edumazet, bhelgaas, alex.williamson, lukas,
	petrm, jiri, mlxsw, Ido Schimmel

Test that PCI reset works correctly by verifying that only the expected
reset methods are supported and that after issuing the reset the ifindex
of the port changes.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 .../selftests/drivers/net/mlxsw/pci_reset.sh  | 58 +++++++++++++++++++
 1 file changed, 58 insertions(+)
 create mode 100755 tools/testing/selftests/drivers/net/mlxsw/pci_reset.sh

diff --git a/tools/testing/selftests/drivers/net/mlxsw/pci_reset.sh b/tools/testing/selftests/drivers/net/mlxsw/pci_reset.sh
new file mode 100755
index 000000000000..2ea22806d530
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/mlxsw/pci_reset.sh
@@ -0,0 +1,58 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test that PCI reset works correctly by verifying that only the expected reset
+# methods are supported and that after issuing the reset the ifindex of the
+# port changes.
+
+lib_dir=$(dirname $0)/../../../net/forwarding
+
+ALL_TESTS="
+	pci_reset_test
+"
+NUM_NETIFS=1
+source $lib_dir/lib.sh
+source $lib_dir/devlink_lib.sh
+
+pci_reset_test()
+{
+	RET=0
+
+	local bus=$(echo $DEVLINK_DEV | cut -d '/' -f 1)
+	local bdf=$(echo $DEVLINK_DEV | cut -d '/' -f 2)
+
+	if [ $bus != "pci" ]; then
+		check_err 1 "devlink device is not a PCI device"
+		log_test "pci reset"
+		return
+	fi
+
+	if [ ! -f /sys/bus/pci/devices/$bdf/reset_method ]; then
+		check_err 1 "reset is not supported"
+		log_test "pci reset"
+		return
+	fi
+
+	[[ $(cat /sys/bus/pci/devices/$bdf/reset_method) == "device_specific bus" ]]
+	check_err $? "only \"device_specific\" and \"bus\" reset methods should be supported"
+
+	local ifindex_pre=$(ip -j link show dev $swp1 | jq '.[]["ifindex"]')
+
+	echo 1 > /sys/bus/pci/devices/$bdf/reset
+	check_err $? "reset failed"
+
+	# Wait for udev to rename newly created netdev.
+	udevadm settle
+
+	local ifindex_post=$(ip -j link show dev $swp1 | jq '.[]["ifindex"]')
+
+	[[ $ifindex_pre != $ifindex_post ]]
+	check_err $? "reset not performed"
+
+	log_test "pci reset"
+}
+
+swp1=${NETIFS[p1]}
+tests_run
+
+exit $EXIT_STATUS
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 02/12] devlink: Hold a reference on parent device
  2023-10-17  7:42 ` [RFC PATCH net-next 02/12] devlink: Hold a reference on parent device Ido Schimmel
@ 2023-10-17  7:56   ` Jiri Pirko
  2023-10-17  8:11     ` Ido Schimmel
  0 siblings, 1 reply; 24+ messages in thread
From: Jiri Pirko @ 2023-10-17  7:56 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, linux-pci, davem, kuba, pabeni, edumazet, bhelgaas,
	alex.williamson, lukas, petrm, jiri, mlxsw

Tue, Oct 17, 2023 at 09:42:47AM CEST, idosch@nvidia.com wrote:
>Each devlink instance is associated with a parent device and a pointer
>to this device is stored in the devlink structure, but devlink does not
>hold a reference on this device.
>
>This is going to be a problem in the next patch where - among other
>things - devlink will acquire the device lock during netns dismantle,
>before the reload operation. Since netns dismantle is performed
>asynchronously and since a reference is not held on the parent device,
>it will be possible to hit a use-after-free.
>
>Prepare for the upcoming change by holding a reference on the parent
>device.
>

Just a note, I'm currently pushing the same patch as a part
of my patchset:
https://lore.kernel.org/all/20231013121029.353351-4-jiri@resnulli.us/


>Signed-off-by: Ido Schimmel <idosch@nvidia.com>
>Reviewed-by: Jiri Pirko <jiri@nvidia.com>
>---
> net/devlink/core.c | 3 +++
> 1 file changed, 3 insertions(+)
>
>diff --git a/net/devlink/core.c b/net/devlink/core.c
>index bcbbb952569f..5b8b692b8c76 100644
>--- a/net/devlink/core.c
>+++ b/net/devlink/core.c
>@@ -4,6 +4,7 @@
>  * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com>
>  */
> 
>+#include <linux/device.h>
> #include <net/genetlink.h>
> #define CREATE_TRACE_POINTS
> #include <trace/events/devlink.h>
>@@ -310,6 +311,7 @@ static void devlink_release(struct work_struct *work)
> 
> 	mutex_destroy(&devlink->lock);
> 	lockdep_unregister_key(&devlink->lock_key);
>+	put_device(devlink->dev);
> 	kfree(devlink);
> }
> 
>@@ -425,6 +427,7 @@ struct devlink *devlink_alloc_ns(const struct devlink_ops *ops,
> 	if (ret < 0)
> 		goto err_xa_alloc;
> 
>+	get_device(dev);
> 	devlink->dev = dev;

Nit:
	devlink->dev = get_device(dev);


> 	devlink->ops = ops;
> 	xa_init_flags(&devlink->ports, XA_FLAGS_ALLOC);
>-- 
>2.40.1
>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 03/12] devlink: Acquire device lock during reload
  2023-10-17  7:42 ` [RFC PATCH net-next 03/12] devlink: Acquire device lock during reload Ido Schimmel
@ 2023-10-17  8:04   ` Jiri Pirko
  0 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2023-10-17  8:04 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, linux-pci, davem, kuba, pabeni, edumazet, bhelgaas,
	alex.williamson, lukas, petrm, jiri, mlxsw

Tue, Oct 17, 2023 at 09:42:48AM CEST, idosch@nvidia.com wrote:
>Device drivers register with devlink from their probe routines (under
>the device lock) by acquiring the devlink instance lock and calling
>devl_register().
>
>Drivers that support a devlink reload usually implement the
>reload_{down, up}() operations in a similar fashion to their remove and
>probe routines, respectively.
>
>However, while the remove and probe routines are invoked with the device
>lock held, the reload operations are only invoked with the devlink
>instance lock held. It is therefore impossible for drivers to acquire
>the device lock from their reload operations, as this would result in
>lock inversion.
>
>The motivating use case for invoking the reload operations with the
>device lock held is in mlxsw which needs to trigger a PCI reset as part
>of the reload. The driver cannot call pci_reset_function() as this
>function acquires the device lock. Instead, it needs to call
>__pci_reset_function_locked which expects the device lock to be held.
>
>To that end, adjust devlink to always acquire the device lock before the
>devlink instance lock when performing a reload. Do that both when reload
>is triggered explicitly by user space and when it is triggered as part
>of netns dismantle.
>
>Tested the following flows with netdevsim and mlxsw while lockdep is
>enabled:
>
>netdevsim:
>
> # echo "10 1" > /sys/bus/netdevsim/new_device
> # devlink dev reload netdevsim/netdevsim10
> # ip netns add bla
> # devlink dev reload netdevsim/netdevsim10 netns bla
> # ip netns del bla
> # echo 10 > /sys/bus/netdevsim/del_device
>
>mlxsw:
>
> # devlink dev reload pci/0000:01:00.0
> # ip netns add bla
> # devlink dev reload pci/0000:01:00.0 netns bla
> # ip netns del bla
> # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/remove
> # echo 1 > /sys/bus/pci/rescan
>
>Signed-off-by: Ido Schimmel <idosch@nvidia.com>
>---
> net/devlink/core.c          |  4 ++--
> net/devlink/dev.c           |  8 ++++++++
> net/devlink/devl_internal.h | 19 ++++++++++++++++++-
> net/devlink/health.c        |  3 ++-
> net/devlink/netlink.c       | 21 ++++++++++++++-------
> net/devlink/region.c        |  3 ++-
> 6 files changed, 46 insertions(+), 12 deletions(-)
>
>diff --git a/net/devlink/core.c b/net/devlink/core.c
>index 5b8b692b8c76..0f866f2cbaf6 100644
>--- a/net/devlink/core.c
>+++ b/net/devlink/core.c
>@@ -502,14 +502,14 @@ static void __net_exit devlink_pernet_pre_exit(struct net *net)
> 	 * all devlink instances from this namespace into init_net.
> 	 */
> 	devlinks_xa_for_each_registered_get(net, index, devlink) {
>-		devl_lock(devlink);
>+		devl_dev_lock(devlink, true);
> 		err = 0;
> 		if (devl_is_registered(devlink))
> 			err = devlink_reload(devlink, &init_net,
> 					     DEVLINK_RELOAD_ACTION_DRIVER_REINIT,
> 					     DEVLINK_RELOAD_LIMIT_UNSPEC,
> 					     &actions_performed, NULL);
>-		devl_unlock(devlink);
>+		devl_dev_unlock(devlink, true);
> 		devlink_put(devlink);
> 		if (err && err != -EOPNOTSUPP)
> 			pr_warn("Failed to reload devlink instance into init_net\n");
>diff --git a/net/devlink/dev.c b/net/devlink/dev.c
>index dc8039ca2b38..70cebe716187 100644
>--- a/net/devlink/dev.c
>+++ b/net/devlink/dev.c
>@@ -4,6 +4,7 @@
>  * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com>
>  */
> 
>+#include <linux/device.h>
> #include <net/genetlink.h>
> #include <net/sock.h>
> #include "devl_internal.h"
>@@ -433,6 +434,13 @@ int devlink_reload(struct devlink *devlink, struct net *dest_net,
> 	struct net *curr_net;
> 	int err;
> 
>+	/* Make sure the reload operations are invoked with the device lock
>+	 * held to allow drivers to trigger functionality that expects it
>+	 * (e.g., PCI reset) and to close possible races between these
>+	 * operations and probe/remove.
>+	 */
>+	device_lock_assert(devlink->dev);
>+
> 	memcpy(remote_reload_stats, devlink->stats.remote_reload_stats,
> 	       sizeof(remote_reload_stats));
> 
>diff --git a/net/devlink/devl_internal.h b/net/devlink/devl_internal.h
>index 741d1bf1bec8..a9c5e52c40a7 100644
>--- a/net/devlink/devl_internal.h
>+++ b/net/devlink/devl_internal.h
>@@ -3,6 +3,7 @@
>  * Copyright (c) 2016 Jiri Pirko <jiri@mellanox.com>
>  */
> 
>+#include <linux/device.h>
> #include <linux/etherdevice.h>
> #include <linux/mutex.h>
> #include <linux/netdevice.h>
>@@ -96,6 +97,20 @@ static inline bool devl_is_registered(struct devlink *devlink)
> 	return xa_get_mark(&devlinks, devlink->index, DEVLINK_REGISTERED);
> }
> 
>+static inline void devl_dev_lock(struct devlink *devlink, bool dev_lock)
>+{
>+	if (dev_lock)
>+		device_lock(devlink->dev);
>+	devl_lock(devlink);
>+}
>+
>+static inline void devl_dev_unlock(struct devlink *devlink, bool dev_lock)
>+{
>+	devl_unlock(devlink);
>+	if (dev_lock)
>+		device_unlock(devlink->dev);
>+}
>+
> typedef void devlink_rel_notify_cb_t(struct devlink *devlink, u32 obj_index);
> typedef void devlink_rel_cleanup_cb_t(struct devlink *devlink, u32 obj_index,
> 				      u32 rel_index);
>@@ -113,6 +128,7 @@ int devlink_rel_devlink_handle_put(struct sk_buff *msg, struct devlink *devlink,
> /* Netlink */
> #define DEVLINK_NL_FLAG_NEED_PORT		BIT(0)
> #define DEVLINK_NL_FLAG_NEED_DEVLINK_OR_PORT	BIT(1)
>+#define DEVLINK_NL_FLAG_NEED_DEV_LOCK		BIT(2)
> 
> enum devlink_multicast_groups {
> 	DEVLINK_MCGRP_CONFIG,
>@@ -140,7 +156,8 @@ typedef int devlink_nl_dump_one_func_t(struct sk_buff *msg,
> 				       int flags);
> 
> struct devlink *
>-devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs);
>+devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
>+			    bool dev_lock);
> 
> int devlink_nl_dumpit(struct sk_buff *msg, struct netlink_callback *cb,
> 		      devlink_nl_dump_one_func_t *dump_one);
>diff --git a/net/devlink/health.c b/net/devlink/health.c
>index 51e6e81e31bb..3c4c049c3636 100644
>--- a/net/devlink/health.c
>+++ b/net/devlink/health.c
>@@ -1266,7 +1266,8 @@ devlink_health_reporter_get_from_cb_lock(struct netlink_callback *cb)
> 	struct nlattr **attrs = info->attrs;
> 	struct devlink *devlink;
> 
>-	devlink = devlink_get_from_attrs_lock(sock_net(cb->skb->sk), attrs);
>+	devlink = devlink_get_from_attrs_lock(sock_net(cb->skb->sk), attrs,
>+					      false);
> 	if (IS_ERR(devlink))
> 		return NULL;
> 
>diff --git a/net/devlink/netlink.c b/net/devlink/netlink.c
>index 499304d9de49..14d598000d72 100644
>--- a/net/devlink/netlink.c
>+++ b/net/devlink/netlink.c
>@@ -124,7 +124,8 @@ int devlink_nl_msg_reply_and_new(struct sk_buff **msg, struct genl_info *info)
> }
> 
> struct devlink *
>-devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs)
>+devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs,
>+			    bool dev_lock)
> {
> 	struct devlink *devlink;
> 	unsigned long index;
>@@ -138,12 +139,12 @@ devlink_get_from_attrs_lock(struct net *net, struct nlattr **attrs)
> 	devname = nla_data(attrs[DEVLINK_ATTR_DEV_NAME]);
> 
> 	devlinks_xa_for_each_registered_get(net, index, devlink) {
>-		devl_lock(devlink);
>+		devl_dev_lock(devlink, dev_lock);
> 		if (devl_is_registered(devlink) &&
> 		    strcmp(devlink->dev->bus->name, busname) == 0 &&
> 		    strcmp(dev_name(devlink->dev), devname) == 0)
> 			return devlink;
>-		devl_unlock(devlink);
>+		devl_dev_unlock(devlink, dev_lock);
> 		devlink_put(devlink);
> 	}
> 
>@@ -155,9 +156,12 @@ static int __devlink_nl_pre_doit(struct sk_buff *skb, struct genl_info *info,
> {
> 	struct devlink_port *devlink_port;
> 	struct devlink *devlink;
>+	bool dev_lock;
> 	int err;
> 
>-	devlink = devlink_get_from_attrs_lock(genl_info_net(info), info->attrs);
>+	dev_lock = !!(flags & DEVLINK_NL_FLAG_NEED_DEV_LOCK);

I know you are aware, but just for the record: This conflicts
with my patchset "devlink: finish conversion to generated split_ops"
where I'm removing use of internal_flags. Ops that need this (should
be only reload) would need separate devlink_nl_pre/post_doit() helpers.

Otherwise the patch looks fine to me.


>+	devlink = devlink_get_from_attrs_lock(genl_info_net(info), info->attrs,
>+					      dev_lock);
> 	if (IS_ERR(devlink))
> 		return PTR_ERR(devlink);
> 
>@@ -177,7 +181,7 @@ static int __devlink_nl_pre_doit(struct sk_buff *skb, struct genl_info *info,
> 	return 0;
> 
> unlock:
>-	devl_unlock(devlink);
>+	devl_dev_unlock(devlink, dev_lock);
> 	devlink_put(devlink);
> 	return err;
> }
>@@ -205,9 +209,11 @@ void devlink_nl_post_doit(const struct genl_split_ops *ops,
> 			  struct sk_buff *skb, struct genl_info *info)
> {
> 	struct devlink *devlink;
>+	bool dev_lock;
> 
>+	dev_lock = !!(ops->internal_flags & DEVLINK_NL_FLAG_NEED_DEV_LOCK);
> 	devlink = info->user_ptr[0];
>-	devl_unlock(devlink);
>+	devl_dev_unlock(devlink, dev_lock);
> 	devlink_put(devlink);
> }
> 
>@@ -219,7 +225,7 @@ static int devlink_nl_inst_single_dumpit(struct sk_buff *msg,
> 	struct devlink *devlink;
> 	int err;
> 
>-	devlink = devlink_get_from_attrs_lock(sock_net(msg->sk), attrs);
>+	devlink = devlink_get_from_attrs_lock(sock_net(msg->sk), attrs, false);
> 	if (IS_ERR(devlink))
> 		return PTR_ERR(devlink);
> 	err = dump_one(msg, devlink, cb, flags | NLM_F_DUMP_FILTERED);
>@@ -420,6 +426,7 @@ static const struct genl_small_ops devlink_nl_small_ops[40] = {
> 		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
> 		.doit = devlink_nl_cmd_reload,
> 		.flags = GENL_ADMIN_PERM,
>+		.internal_flags = DEVLINK_NL_FLAG_NEED_DEV_LOCK,
> 	},
> 	{
> 		.cmd = DEVLINK_CMD_PARAM_SET,
>diff --git a/net/devlink/region.c b/net/devlink/region.c
>index d197cdb662db..30c6c49ec10b 100644
>--- a/net/devlink/region.c
>+++ b/net/devlink/region.c
>@@ -883,7 +883,8 @@ int devlink_nl_cmd_region_read_dumpit(struct sk_buff *skb,
> 
> 	start_offset = state->start_offset;
> 
>-	devlink = devlink_get_from_attrs_lock(sock_net(cb->skb->sk), attrs);
>+	devlink = devlink_get_from_attrs_lock(sock_net(cb->skb->sk), attrs,
>+					      false);
> 	if (IS_ERR(devlink))
> 		return PTR_ERR(devlink);
> 
>-- 
>2.40.1
>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 02/12] devlink: Hold a reference on parent device
  2023-10-17  7:56   ` Jiri Pirko
@ 2023-10-17  8:11     ` Ido Schimmel
  2023-10-17  9:01       ` Jiri Pirko
  0 siblings, 1 reply; 24+ messages in thread
From: Ido Schimmel @ 2023-10-17  8:11 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, linux-pci, davem, kuba, pabeni, edumazet, bhelgaas,
	alex.williamson, lukas, petrm, jiri, mlxsw

On Tue, Oct 17, 2023 at 09:56:18AM +0200, Jiri Pirko wrote:
> Tue, Oct 17, 2023 at 09:42:47AM CEST, idosch@nvidia.com wrote:
> >Each devlink instance is associated with a parent device and a pointer
> >to this device is stored in the devlink structure, but devlink does not
> >hold a reference on this device.
> >
> >This is going to be a problem in the next patch where - among other
> >things - devlink will acquire the device lock during netns dismantle,
> >before the reload operation. Since netns dismantle is performed
> >asynchronously and since a reference is not held on the parent device,
> >it will be possible to hit a use-after-free.
> >
> >Prepare for the upcoming change by holding a reference on the parent
> >device.
> >
> 
> Just a note, I'm currently pushing the same patch as a part
> of my patchset:
> https://lore.kernel.org/all/20231013121029.353351-4-jiri@resnulli.us/

Then you probably need patch #1 as well:

https://lore.kernel.org/netdev/20231017074257.3389177-2-idosch@nvidia.com/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 02/12] devlink: Hold a reference on parent device
  2023-10-17  8:11     ` Ido Schimmel
@ 2023-10-17  9:01       ` Jiri Pirko
  0 siblings, 0 replies; 24+ messages in thread
From: Jiri Pirko @ 2023-10-17  9:01 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, linux-pci, davem, kuba, pabeni, edumazet, bhelgaas,
	alex.williamson, lukas, petrm, jiri, mlxsw

Tue, Oct 17, 2023 at 10:11:24AM CEST, idosch@nvidia.com wrote:
>On Tue, Oct 17, 2023 at 09:56:18AM +0200, Jiri Pirko wrote:
>> Tue, Oct 17, 2023 at 09:42:47AM CEST, idosch@nvidia.com wrote:
>> >Each devlink instance is associated with a parent device and a pointer
>> >to this device is stored in the devlink structure, but devlink does not
>> >hold a reference on this device.
>> >
>> >This is going to be a problem in the next patch where - among other
>> >things - devlink will acquire the device lock during netns dismantle,
>> >before the reload operation. Since netns dismantle is performed
>> >asynchronously and since a reference is not held on the parent device,
>> >it will be possible to hit a use-after-free.
>> >
>> >Prepare for the upcoming change by holding a reference on the parent
>> >device.
>> >
>> 
>> Just a note, I'm currently pushing the same patch as a part
>> of my patchset:
>> https://lore.kernel.org/all/20231013121029.353351-4-jiri@resnulli.us/
>
>Then you probably need patch #1 as well:
>
>https://lore.kernel.org/netdev/20231017074257.3389177-2-idosch@nvidia.com/

Correct.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 05/12] PCI: Add device-specific reset for NVIDIA Spectrum devices
  2023-10-17  7:42 ` [RFC PATCH net-next 05/12] PCI: Add device-specific reset " Ido Schimmel
@ 2023-10-17 10:00   ` Lukas Wunner
  2023-10-18 20:08   ` Bjorn Helgaas
  1 sibling, 0 replies; 24+ messages in thread
From: Lukas Wunner @ 2023-10-17 10:00 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, linux-pci, davem, kuba, pabeni, edumazet, bhelgaas,
	alex.williamson, petrm, jiri, mlxsw

On Tue, Oct 17, 2023 at 10:42:50AM +0300, Ido Schimmel wrote:
> The PCIe specification defines two methods to trigger a hot reset across
> a link: Bus reset and link disablement (r6.0.1, sec 7.1, sec 6.6.1). In
> the first method, the Secondary Bus Reset (SBR) bit in the Bridge
> Control Register of the Downstream Port is asserted for at least 1ms
> (r6.0.1, sec 7.5.1.3.13). In the second method, the Link Disable bit in
> the Link Control Register of the Downstream Port is asserted and then
> cleared to disable and enable the link (r6.0.1, sec 7.5.3.7).
> 
> While the two methods are identical from the perspective of the
> Downstream device, they are different as far as the host is concerned.
> In the first method, the Link Training and Status State Machine (LTSSM)
> of the Downstream Port is expected to be in the Hot Reset state as long
> as the SBR bit is asserted. In the second method, the LTSSM of the
> Downstream Port is expected to be in the Disabled state as long as the
> Link Disable bit is asserted.
> 
> This above difference is of importance because the specification
> requires the LTTSM to exit from the Hot Reset state to the Detect state
> within a 2ms timeout (r6.0.1, sec 4.2.7.11). NVIDIA Spectrum devices
> cannot guarantee it and a host enforcing such a behavior might fail to
> communicate with the device after issuing a Secondary Bus Reset.

How does that failure manifest itself exactly?  Is the problem that
the Vendor ID register in config space is read too early and the
device doesn't like that?

It is possible to increase the d3cold_delay in struct pci_dev to
lengthen the delay until the Vendor ID is read.  Have you considered
that instead of using the Link Disable method?

The following commit queued for v6.7 introduces a quirk for a 1 second
d3cold_delay, perhaps you can take advantage of it?

https://git.kernel.org/pci/pci/c/c9260693aa0c

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 04/12] PCI: Add no PM reset quirk for NVIDIA Spectrum devices
  2023-10-17  7:42 ` [RFC PATCH net-next 04/12] PCI: Add no PM reset quirk for NVIDIA Spectrum devices Ido Schimmel
@ 2023-10-18 19:40   ` Bjorn Helgaas
  2023-10-22  8:23     ` Ido Schimmel
  0 siblings, 1 reply; 24+ messages in thread
From: Bjorn Helgaas @ 2023-10-18 19:40 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, linux-pci, davem, kuba, pabeni, edumazet, bhelgaas,
	alex.williamson, lukas, petrm, jiri, mlxsw

On Tue, Oct 17, 2023 at 10:42:49AM +0300, Ido Schimmel wrote:
> Spectrum-{1,2,3,4} devices report that a D3hot->D0 transition causes a
> reset (i.e., they advertise NoSoftRst-). However, this transition seems
> to have no effect on the device: It continues to be operational and
> network ports remain up. Advertising this support makes it seem as if a
> PM reset is viable for these devices. Mark it as unavailable to skip it
> when testing reset methods.
> 
> Before:
> 
>  # cat /sys/bus/pci/devices/0000\:03\:00.0/reset_method
>  pm bus
> 
> After:
> 
>  # cat /sys/bus/pci/devices/0000\:03\:00.0/reset_method
>  bus
> 
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Hopefully since these are NVIDIA parts and you work at NVIDIA, this is
stronger than "this transition *seems* to have no effect" :)

The spec actually says NoSoftRst- means internal state is "undefined"
after a D3hot->D0 transition, so preserving it would not be a defect
per spec.  The kernel assumption that NoSoftRst- means the device will
do a reset is perhaps a little too aggressive.

> ---
>  drivers/pci/quirks.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index eeec1d6f9023..23f6bd2184e2 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3784,6 +3784,19 @@ static void quirk_no_pm_reset(struct pci_dev *dev)
>  DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_ATI, PCI_ANY_ID,
>  			       PCI_CLASS_DISPLAY_VGA, 8, quirk_no_pm_reset);
>  
> +/*
> + * Spectrum-{1,2,3,4} devices report that a D3hot->D0 transition causes a reset
> + * (i.e., they advertise NoSoftRst-). However, this transition seems to have no
> + * effect on the device: It continues to be operational and network ports
> + * remain up. Advertising this support makes it seem as if a PM reset is viable
> + * for these devices. Mark it as unavailable to skip it when testing reset
> + * methods.
> + */
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MELLANOX, 0xcb84, quirk_no_pm_reset);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MELLANOX, 0xcf6c, quirk_no_pm_reset);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MELLANOX, 0xcf70, quirk_no_pm_reset);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MELLANOX, 0xcf80, quirk_no_pm_reset);
> +
>  /*
>   * Thunderbolt controllers with broken MSI hotplug signaling:
>   * Entire 1st generation (Light Ridge, Eagle Ridge, Light Peak) and part
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 06/12] PCI: Add debug print for device ready delay
  2023-10-17  7:42 ` [RFC PATCH net-next 06/12] PCI: Add debug print for device ready delay Ido Schimmel
@ 2023-10-18 19:41   ` Bjorn Helgaas
  0 siblings, 0 replies; 24+ messages in thread
From: Bjorn Helgaas @ 2023-10-18 19:41 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, linux-pci, davem, kuba, pabeni, edumazet, bhelgaas,
	alex.williamson, lukas, petrm, jiri, mlxsw

On Tue, Oct 17, 2023 at 10:42:51AM +0300, Ido Schimmel wrote:
> Currently, the time it took a PCI device to become ready after reset is
> only printed if it was longer than 1000ms ('PCI_RESET_WAIT'). However,
> for debugging purposes it is useful to know this time even if it was
> shorter. For example, with the device I am working on, hardware
> engineers asked to verify that it becomes ready on the first try (no
> delay).
> 
> To that end, add a debug level print that can be enabled using dynamic
> debug. Example:
> 
>  # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset
>  # dmesg -c | grep ready
>  # echo "file drivers/pci/pci.c +p" > /sys/kernel/debug/dynamic_debug/control
>  # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset
>  # dmesg -c | grep ready
>  [  396.060335] mlxsw_spectrum4 0000:01:00.0: ready 0ms after link toggle
>  # echo "file drivers/pci/pci.c -p" > /sys/kernel/debug/dynamic_debug/control
>  # echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset
>  # dmesg -c | grep ready
> 
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

> ---
>  drivers/pci/pci.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 59c01d68c6d5..0a708e65c5c4 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1216,6 +1216,9 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>  	if (delay > PCI_RESET_WAIT)
>  		pci_info(dev, "ready %dms after %s\n", delay - 1,
>  			 reset_type);
> +	else
> +		pci_dbg(dev, "ready %dms after %s\n", delay - 1,
> +			reset_type);
>  
>  	return 0;
>  }
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 05/12] PCI: Add device-specific reset for NVIDIA Spectrum devices
  2023-10-17  7:42 ` [RFC PATCH net-next 05/12] PCI: Add device-specific reset " Ido Schimmel
  2023-10-17 10:00   ` Lukas Wunner
@ 2023-10-18 20:08   ` Bjorn Helgaas
  2023-10-25 11:05     ` Ido Schimmel
  1 sibling, 1 reply; 24+ messages in thread
From: Bjorn Helgaas @ 2023-10-18 20:08 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, linux-pci, davem, kuba, pabeni, edumazet, bhelgaas,
	alex.williamson, lukas, petrm, jiri, mlxsw

On Tue, Oct 17, 2023 at 10:42:50AM +0300, Ido Schimmel wrote:
> The PCIe specification defines two methods to trigger a hot reset across
> a link: Bus reset and link disablement (r6.0.1, sec 7.1, sec 6.6.1). In
> the first method, the Secondary Bus Reset (SBR) bit in the Bridge
> Control Register of the Downstream Port is asserted for at least 1ms
> (r6.0.1, sec 7.5.1.3.13). In the second method, the Link Disable bit in
> the Link Control Register of the Downstream Port is asserted and then
> cleared to disable and enable the link (r6.0.1, sec 7.5.3.7).
> 
> While the two methods are identical from the perspective of the
> Downstream device, they are different as far as the host is concerned.
> In the first method, the Link Training and Status State Machine (LTSSM)
> of the Downstream Port is expected to be in the Hot Reset state as long
> as the SBR bit is asserted. In the second method, the LTSSM of the
> Downstream Port is expected to be in the Disabled state as long as the
> Link Disable bit is asserted.
> 
> This above difference is of importance because the specification
> requires the LTTSM to exit from the Hot Reset state to the Detect state
> within a 2ms timeout (r6.0.1, sec 4.2.7.11).

I don't read 4.2.7.11 quite that way.  Here's the text (from r6.0):

  • Lanes that were directed by a higher Layer to initiate Hot
    Reset:

    ◦ All Lanes in the configured Link transmit TS1 Ordered Sets
      with the Hot Reset bit asserted and the configured Link and
      Lane numbers.

    ◦ If two consecutive TS1 Ordered Sets are received on any
      Lane with the Hot Reset bit asserted and configured Link
      and Lane numbers, then:

      ▪ LinkUp = 0b (False)

      ▪ If no higher Layer is directing the Physical Layer to
        remain in Hot Reset, the next state is Detect

      ▪ Otherwise, all Lanes in the configured Link continue to
	transmit TS1 Ordered Sets with the Hot Reset bit asserted
	and the configured Link and Lane numbers.

    ◦ Otherwise, after a 2 ms timeout next state is Detect.

I assume that SBR being set constitutes a "higher Layer directing the
Physical Layer to remain in Hot Reset," so I would read this as saying
the LTSSM stays in Hot Reset as long as SBR is set.  Then, *after* a
2 ms timeout (not *within* 2 ms), the next state is Detect.

> NVIDIA Spectrum devices cannot guarantee it and a host enforcing
> such a behavior might fail to communicate with the device after
> issuing a Secondary Bus Reset.

I don't quite follow this.  What behavior is the host enforcing here?
I guess you're doing an SBR, and the Spectrum device doesn't respond
as expected afterwards?

It looks like pci_reset_secondary_bus() asserts SBR for at least
2 ms.  Then pci_bridge_wait_for_secondary_bus() should wait before
accessing the device, but maybe we don't wait long enough?

I guess this ends up back at d3cold_delay as suggested by Lukas.

> With the link disablement method, the host can leave the link
> disabled for enough time to allow the device to undergo a hot reset
> and reach the Detect state. After enabling the link, the host will
> exit from the Disabled state to Detect state (r6.0.1, sec 4.2.7.9)
> and observe that the device is already in the Detect state.
> 
> The PCI core only implements the first method, which might not work with
> NVIDIA Spectrum devices on certain hosts, as explained above. Therefore,
> implement the link disablement method as a device-specific method for
> NVIDIA Spectrum devices. Specifically, disable the link, wait for 500ms,
> enable the link and then wait for the device to become accessible.
> 
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> ---
>  drivers/pci/quirks.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 23f6bd2184e2..a6e308bb934c 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -4182,6 +4182,31 @@ static int reset_hinic_vf_dev(struct pci_dev *pdev, bool probe)
>  	return 0;
>  }
>  
> +#define PCI_DEVICE_ID_MELLANOX_SPECTRUM		0xcb84
> +#define PCI_DEVICE_ID_MELLANOX_SPECTRUM2	0xcf6c
> +#define PCI_DEVICE_ID_MELLANOX_SPECTRUM3	0xcf70
> +#define PCI_DEVICE_ID_MELLANOX_SPECTRUM4	0xcf80
> +
> +static int reset_mlx(struct pci_dev *pdev, bool probe)
> +{
> +	struct pci_dev *bridge = pdev->bus->self;
> +
> +	if (probe)
> +		return 0;
> +
> +	/*
> +	 * Disable the link on the Downstream port in order to trigger a hot
> +	 * reset in the Downstream device. Wait for 500ms before enabling the
> +	 * link so that the firmware on the device will have enough time to
> +	 * transition the Upstream port to the Detect state.
> +	 */
> +	pcie_capability_set_word(bridge, PCI_EXP_LNKCTL, PCI_EXP_LNKCTL_LD);
> +	msleep(500);
> +	pcie_capability_clear_word(bridge, PCI_EXP_LNKCTL, PCI_EXP_LNKCTL_LD);
> +
> +	return pci_bridge_wait_for_secondary_bus(bridge, "link toggle");
> +}
> +
>  static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
>  	{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF,
>  		 reset_intel_82599_sfp_virtfn },
> @@ -4197,6 +4222,10 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
>  		reset_chelsio_generic_dev },
>  	{ PCI_VENDOR_ID_HUAWEI, PCI_DEVICE_ID_HINIC_VF,
>  		reset_hinic_vf_dev },
> +	{ PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_SPECTRUM, reset_mlx },
> +	{ PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_SPECTRUM2, reset_mlx },
> +	{ PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_SPECTRUM3, reset_mlx },
> +	{ PCI_VENDOR_ID_MELLANOX, PCI_DEVICE_ID_MELLANOX_SPECTRUM4, reset_mlx },
>  	{ 0 }
>  };
>  
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 01/12] netdevsim: Block until all devices are released
  2023-10-17  7:42 ` [RFC PATCH net-next 01/12] netdevsim: Block until all devices are released Ido Schimmel
@ 2023-10-19  0:53   ` Jakub Kicinski
  0 siblings, 0 replies; 24+ messages in thread
From: Jakub Kicinski @ 2023-10-19  0:53 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, linux-pci, davem, pabeni, edumazet, bhelgaas,
	alex.williamson, lukas, petrm, jiri, mlxsw

On Tue, 17 Oct 2023 10:42:46 +0300 Ido Schimmel wrote:
> Like other buses, devices on the netdevsim bus have a release callback
> that is invoked when the reference count of the device drops to zero.
> However, unlike other buses such as PCI, the release callback is not
> necessarily built into the kernel, as netdevsim can be built as a
> module.
> 
> This above is problematic as nothing prevents the module from being
> unloaded before the release callback has been invoked, which can happen
> asynchronously. One such example is going to be added in subsequent
> patches where devlink will call put_device() from an RCU callback.
> 
> The issue is not theoretical and the reproducer in [1] can reliably
> crash the kernel. The conclusion of this discussion was that the issue
> should be solved in netdevsim, which is what this patch is trying to do.
> 
> Add a reference count that is increased when a device is added to the
> bus and decreased when a device is released. Signal a completion when
> the reference count drops to zero and wait for the completion when
> unloading the module so that the module will not be unloaded before all
> the devices were released. The reference count is initialized to one so
> that completion is only signaled when unloading the module.
> 
> With this patch, the reproducer in [1] no longer crashes the kernel.
> 
> [1] https://lore.kernel.org/netdev/20230619125015.1541143-2-idosch@nvidia.com/
> 
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>

Reviewed-by: Jakub Kicinski <kuba@kernel.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 04/12] PCI: Add no PM reset quirk for NVIDIA Spectrum devices
  2023-10-18 19:40   ` Bjorn Helgaas
@ 2023-10-22  8:23     ` Ido Schimmel
  0 siblings, 0 replies; 24+ messages in thread
From: Ido Schimmel @ 2023-10-22  8:23 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: netdev, linux-pci, davem, kuba, pabeni, edumazet, bhelgaas,
	alex.williamson, lukas, petrm, jiri, mlxsw

On Wed, Oct 18, 2023 at 02:40:41PM -0500, Bjorn Helgaas wrote:
> On Tue, Oct 17, 2023 at 10:42:49AM +0300, Ido Schimmel wrote:
> > Spectrum-{1,2,3,4} devices report that a D3hot->D0 transition causes a
> > reset (i.e., they advertise NoSoftRst-). However, this transition seems
> > to have no effect on the device: It continues to be operational and
> > network ports remain up. Advertising this support makes it seem as if a
> > PM reset is viable for these devices. Mark it as unavailable to skip it
> > when testing reset methods.
> > 
> > Before:
> > 
> >  # cat /sys/bus/pci/devices/0000\:03\:00.0/reset_method
> >  pm bus
> > 
> > After:
> > 
> >  # cat /sys/bus/pci/devices/0000\:03\:00.0/reset_method
> >  bus
> > 
> > Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> 
> Hopefully since these are NVIDIA parts and you work at NVIDIA, this is
> stronger than "this transition *seems* to have no effect" :)

Yes. Reworded to "this transition does not have any effect on the
device" and kept your tag.

FYI, new devices will not advertise support for PM reset so I don't
expect this list to grow.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 05/12] PCI: Add device-specific reset for NVIDIA Spectrum devices
  2023-10-18 20:08   ` Bjorn Helgaas
@ 2023-10-25 11:05     ` Ido Schimmel
  0 siblings, 0 replies; 24+ messages in thread
From: Ido Schimmel @ 2023-10-25 11:05 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: netdev, linux-pci, davem, kuba, pabeni, edumazet, bhelgaas,
	alex.williamson, lukas, petrm, jiri, mlxsw

On Wed, Oct 18, 2023 at 03:08:26PM -0500, Bjorn Helgaas wrote:
> On Tue, Oct 17, 2023 at 10:42:50AM +0300, Ido Schimmel wrote:
> > The PCIe specification defines two methods to trigger a hot reset across
> > a link: Bus reset and link disablement (r6.0.1, sec 7.1, sec 6.6.1). In
> > the first method, the Secondary Bus Reset (SBR) bit in the Bridge
> > Control Register of the Downstream Port is asserted for at least 1ms
> > (r6.0.1, sec 7.5.1.3.13). In the second method, the Link Disable bit in
> > the Link Control Register of the Downstream Port is asserted and then
> > cleared to disable and enable the link (r6.0.1, sec 7.5.3.7).
> > 
> > While the two methods are identical from the perspective of the
> > Downstream device, they are different as far as the host is concerned.
> > In the first method, the Link Training and Status State Machine (LTSSM)
> > of the Downstream Port is expected to be in the Hot Reset state as long
> > as the SBR bit is asserted. In the second method, the LTSSM of the
> > Downstream Port is expected to be in the Disabled state as long as the
> > Link Disable bit is asserted.
> > 
> > This above difference is of importance because the specification
> > requires the LTTSM to exit from the Hot Reset state to the Detect state
> > within a 2ms timeout (r6.0.1, sec 4.2.7.11).
> 
> I don't read 4.2.7.11 quite that way.  Here's the text (from r6.0):
> 
>   • Lanes that were directed by a higher Layer to initiate Hot
>     Reset:
> 
>     ◦ All Lanes in the configured Link transmit TS1 Ordered Sets
>       with the Hot Reset bit asserted and the configured Link and
>       Lane numbers.
> 
>     ◦ If two consecutive TS1 Ordered Sets are received on any
>       Lane with the Hot Reset bit asserted and configured Link
>       and Lane numbers, then:
> 
>       ▪ LinkUp = 0b (False)
> 
>       ▪ If no higher Layer is directing the Physical Layer to
>         remain in Hot Reset, the next state is Detect
> 
>       ▪ Otherwise, all Lanes in the configured Link continue to
> 	transmit TS1 Ordered Sets with the Hot Reset bit asserted
> 	and the configured Link and Lane numbers.
> 
>     ◦ Otherwise, after a 2 ms timeout next state is Detect.
> 
> I assume that SBR being set constitutes a "higher Layer directing the
> Physical Layer to remain in Hot Reset," so I would read this as saying
> the LTSSM stays in Hot Reset as long as SBR is set.  Then, *after* a
> 2 ms timeout (not *within* 2 ms), the next state is Detect.
> 
> > NVIDIA Spectrum devices cannot guarantee it and a host enforcing
> > such a behavior might fail to communicate with the device after
> > issuing a Secondary Bus Reset.
> 
> I don't quite follow this.  What behavior is the host enforcing here?
> I guess you're doing an SBR, and the Spectrum device doesn't respond
> as expected afterwards?
> 
> It looks like pci_reset_secondary_bus() asserts SBR for at least
> 2 ms.  Then pci_bridge_wait_for_secondary_bus() should wait before
> accessing the device, but maybe we don't wait long enough?
> 
> I guess this ends up back at d3cold_delay as suggested by Lukas.

I had a meeting with the PCI team before submitting this patch where I
stated that bus reset works fine (tested over 500 iterations) on the
hosts I have access to. They said that bus reset and link toggling are
identical from the perspective of the downstream device, but that in the
past they saw hosts that fail bus reset because of the time it takes the
downstream device to reach the Detect state. This was with a different
line of products that share the same PCI IP as Spectrum.

Given that I'm unable to reproduce this problem with Spectrum and that
your preference seems to be to reuse bus reset (or bus reset plus the
d3cold_delay quirk), I'll drop this patch for now. We can revisit this
patch in the future, if the problem manifests itself.

Regarding the other two PCI patches, I plan to submit this series after
net-next opens for v6.8. Are you OK with them being merged via net-next?

Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-10-25 11:05 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-17  7:42 [RFC PATCH net-next 00/12] mlxsw: Add support for new reset flow Ido Schimmel
2023-10-17  7:42 ` [RFC PATCH net-next 01/12] netdevsim: Block until all devices are released Ido Schimmel
2023-10-19  0:53   ` Jakub Kicinski
2023-10-17  7:42 ` [RFC PATCH net-next 02/12] devlink: Hold a reference on parent device Ido Schimmel
2023-10-17  7:56   ` Jiri Pirko
2023-10-17  8:11     ` Ido Schimmel
2023-10-17  9:01       ` Jiri Pirko
2023-10-17  7:42 ` [RFC PATCH net-next 03/12] devlink: Acquire device lock during reload Ido Schimmel
2023-10-17  8:04   ` Jiri Pirko
2023-10-17  7:42 ` [RFC PATCH net-next 04/12] PCI: Add no PM reset quirk for NVIDIA Spectrum devices Ido Schimmel
2023-10-18 19:40   ` Bjorn Helgaas
2023-10-22  8:23     ` Ido Schimmel
2023-10-17  7:42 ` [RFC PATCH net-next 05/12] PCI: Add device-specific reset " Ido Schimmel
2023-10-17 10:00   ` Lukas Wunner
2023-10-18 20:08   ` Bjorn Helgaas
2023-10-25 11:05     ` Ido Schimmel
2023-10-17  7:42 ` [RFC PATCH net-next 06/12] PCI: Add debug print for device ready delay Ido Schimmel
2023-10-18 19:41   ` Bjorn Helgaas
2023-10-17  7:42 ` [RFC PATCH net-next 07/12] mlxsw: Extend MRSR pack() function to support new commands Ido Schimmel
2023-10-17  7:42 ` [RFC PATCH net-next 08/12] mlxsw: pci: Rename mlxsw_pci_sw_reset() Ido Schimmel
2023-10-17  7:42 ` [RFC PATCH net-next 09/12] mlxsw: pci: Move software reset code to a separate function Ido Schimmel
2023-10-17  7:42 ` [RFC PATCH net-next 10/12] mlxsw: pci: Add support for new reset flow Ido Schimmel
2023-10-17  7:42 ` [RFC PATCH net-next 11/12] mlxsw: pci: Implement PCI reset handlers Ido Schimmel
2023-10-17  7:42 ` [RFC PATCH net-next 12/12] selftests: mlxsw: Add PCI reset test Ido Schimmel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).