linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] crypto: hisilicon - supports device isolation feature
@ 2022-06-14 12:29 Kai Ye
  2022-06-14 12:29 ` [PATCH v2 1/3] uacce: " Kai Ye
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Kai Ye @ 2022-06-14 12:29 UTC (permalink / raw)
  To: gregkh, herbert
  Cc: linux-crypto, linux-accelerators, linux-kernel, linuxarm,
	zhangfei.gao, wangzhou1, yekai13

Add the hardware error isolation feature for ACC. Defines a driver debugfs
node that used to configures the hardware error frequency. When the error
frequency is exceeded, the device will be isolated. The isolation strategy 
can be defined in each driver module. e.g. Defining the isolation strategy
for ACC, if the AER error frequency exceeds the value of setting for a 
certain period of time, The device will not be available in user space. The
VF device use the PF device isolation strategy. as well as the isolation 
strategy should not be set during device use.

changes v1->v2:
	1、deleted dev_to_uacce api.
	2、add vfs node doc. 
	3、move uacce->ref to driver.

Kai Ye (3):
  uacce: supports device isolation feature
  Documentation: add a isolation strategy vfs node for uacce
  crypto: hisilicon/qm - defining the device isolation strategy

 Documentation/ABI/testing/sysfs-driver-uacce |  17 ++
 drivers/crypto/hisilicon/qm.c                | 157 +++++++++++++++++--
 drivers/misc/uacce/uacce.c                   |  37 +++++
 include/linux/hisi_acc_qm.h                  |   9 ++
 include/linux/uacce.h                        |  16 +-
 5 files changed, 219 insertions(+), 17 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/3] uacce: supports device isolation feature
  2022-06-14 12:29 [PATCH v2 0/3] crypto: hisilicon - supports device isolation feature Kai Ye
@ 2022-06-14 12:29 ` Kai Ye
  2022-06-14 12:42   ` Greg KH
  2022-06-15  8:52   ` Jonathan Cameron
  2022-06-14 12:29 ` [PATCH v2 2/3] Documentation: add a isolation strategy vfs node for uacce Kai Ye
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 25+ messages in thread
From: Kai Ye @ 2022-06-14 12:29 UTC (permalink / raw)
  To: gregkh, herbert
  Cc: linux-crypto, linux-accelerators, linux-kernel, linuxarm,
	zhangfei.gao, wangzhou1, yekai13

UACCE add the hardware error isolation API. Users can configure
the error frequency threshold by this vfs node. This API interface
certainly supports the configuration of user protocol strategy. Then
parse it inside the device driver. UACCE only reports the device
isolate state. When the error frequency is exceeded, the device
will be isolated. The isolation strategy should be defined in each
driver module.

Signed-off-by: Kai Ye <yekai13@huawei.com>
Reviewed-by: Zhou Wang <wangzhou1@hisilicon.com>
---
 drivers/misc/uacce/uacce.c | 37 +++++++++++++++++++++++++++++++++++++
 include/linux/uacce.h      | 16 +++++++++++++---
 2 files changed, 50 insertions(+), 3 deletions(-)

diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
index b6219c6bfb48..525623215132 100644
--- a/drivers/misc/uacce/uacce.c
+++ b/drivers/misc/uacce/uacce.c
@@ -346,12 +346,47 @@ static ssize_t region_dus_size_show(struct device *dev,
 		       uacce->qf_pg_num[UACCE_QFRT_DUS] << PAGE_SHIFT);
 }
 
+static ssize_t isolate_show(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	struct uacce_device *uacce = to_uacce_device(dev);
+
+	return sysfs_emit(buf, "%d\n", uacce->ops->get_isolate_state(uacce));
+}
+
+static ssize_t isolate_strategy_show(struct device *dev,
+				     struct device_attribute *attr, char *buf)
+{
+	struct uacce_device *uacce = to_uacce_device(dev);
+
+	return sysfs_emit(buf, "%s\n", uacce->isolate_strategy);
+}
+
+static ssize_t isolate_strategy_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t count)
+{
+	struct uacce_device *uacce = to_uacce_device(dev);
+	int ret;
+
+	if (!buf || sizeof(buf) > UACCE_MAX_ISOLATE_STRATEGY_LEN)
+		return -EINVAL;
+
+	memcpy(uacce->isolate_strategy, buf, strlen(buf));
+
+	ret = uacce->ops->isolate_strategy_write(uacce, buf);
+
+	return ret ? ret : count;
+}
+
 static DEVICE_ATTR_RO(api);
 static DEVICE_ATTR_RO(flags);
 static DEVICE_ATTR_RO(available_instances);
 static DEVICE_ATTR_RO(algorithms);
 static DEVICE_ATTR_RO(region_mmio_size);
 static DEVICE_ATTR_RO(region_dus_size);
+static DEVICE_ATTR_RO(isolate);
+static DEVICE_ATTR_RW(isolate_strategy);
 
 static struct attribute *uacce_dev_attrs[] = {
 	&dev_attr_api.attr,
@@ -360,6 +395,8 @@ static struct attribute *uacce_dev_attrs[] = {
 	&dev_attr_algorithms.attr,
 	&dev_attr_region_mmio_size.attr,
 	&dev_attr_region_dus_size.attr,
+	&dev_attr_isolate.attr,
+	&dev_attr_isolate_strategy.attr,
 	NULL,
 };
 
diff --git a/include/linux/uacce.h b/include/linux/uacce.h
index 48e319f40275..0f7668bfa645 100644
--- a/include/linux/uacce.h
+++ b/include/linux/uacce.h
@@ -8,6 +8,7 @@
 #define UACCE_NAME		"uacce"
 #define UACCE_MAX_REGION	2
 #define UACCE_MAX_NAME_SIZE	64
+#define UACCE_MAX_ISOLATE_STRATEGY_LEN	256
 
 struct uacce_queue;
 struct uacce_device;
@@ -30,6 +31,8 @@ struct uacce_qfile_region {
  * @is_q_updated: check whether the task is finished
  * @mmap: mmap addresses of queue to user space
  * @ioctl: ioctl for user space users of the queue
+ * @get_isolate_state: get the device state after set the isolate strategy
+ * @isolate_strategy_store: stored the isolate strategy to the device
  */
 struct uacce_ops {
 	int (*get_available_instances)(struct uacce_device *uacce);
@@ -43,6 +46,8 @@ struct uacce_ops {
 		    struct uacce_qfile_region *qfr);
 	long (*ioctl)(struct uacce_queue *q, unsigned int cmd,
 		      unsigned long arg);
+	enum uacce_dev_state (*get_isolate_state)(struct uacce_device *uacce);
+	int (*isolate_strategy_write)(struct uacce_device *uacce, const char *buf);
 };
 
 /**
@@ -57,6 +62,12 @@ struct uacce_interface {
 	const struct uacce_ops *ops;
 };
 
+enum uacce_dev_state {
+	UACCE_DEV_ERR = -1,
+	UACCE_DEV_NORMAL,
+	UACCE_DEV_ISOLATE,
+};
+
 enum uacce_q_state {
 	UACCE_Q_ZOMBIE = 0,
 	UACCE_Q_INIT,
@@ -117,6 +128,7 @@ struct uacce_device {
 	struct list_head queues;
 	struct mutex queues_lock;
 	struct inode *inode;
+	char isolate_strategy[UACCE_MAX_ISOLATE_STRATEGY_LEN];
 };
 
 #if IS_ENABLED(CONFIG_UACCE)
@@ -125,7 +137,7 @@ struct uacce_device *uacce_alloc(struct device *parent,
 				 struct uacce_interface *interface);
 int uacce_register(struct uacce_device *uacce);
 void uacce_remove(struct uacce_device *uacce);
-
+struct uacce_device *dev_to_uacce(struct device *dev);
 #else /* CONFIG_UACCE */
 
 static inline
@@ -140,8 +152,6 @@ static inline int uacce_register(struct uacce_device *uacce)
 	return -EINVAL;
 }
 
-static inline void uacce_remove(struct uacce_device *uacce) {}
-
 #endif /* CONFIG_UACCE */
 
 #endif /* _LINUX_UACCE_H */
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 2/3] Documentation: add a isolation strategy vfs node for uacce
  2022-06-14 12:29 [PATCH v2 0/3] crypto: hisilicon - supports device isolation feature Kai Ye
  2022-06-14 12:29 ` [PATCH v2 1/3] uacce: " Kai Ye
@ 2022-06-14 12:29 ` Kai Ye
  2022-06-14 12:41   ` Greg KH
  2022-06-14 12:29 ` [PATCH v2 3/3] crypto: hisilicon/qm - defining the device isolation strategy Kai Ye
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Kai Ye @ 2022-06-14 12:29 UTC (permalink / raw)
  To: gregkh, herbert
  Cc: linux-crypto, linux-accelerators, linux-kernel, linuxarm,
	zhangfei.gao, wangzhou1, yekai13

Update documentation describing DebugFS that could help to
configure hard error frequency for users in th user space.

Signed-off-by: Kai Ye <yekai13@huawei.com>
---
 Documentation/ABI/testing/sysfs-driver-uacce | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce
index 08f2591138af..0c4226364182 100644
--- a/Documentation/ABI/testing/sysfs-driver-uacce
+++ b/Documentation/ABI/testing/sysfs-driver-uacce
@@ -19,6 +19,23 @@ Contact:        linux-accelerators@lists.ozlabs.org
 Description:    Available instances left of the device
                 Return -ENODEV if uacce_ops get_available_instances is not provided
 
+What:           /sys/class/uacce/<dev_name>/isolate_strategy
+Date:           Jun 2022
+KernelVersion:  5.19
+Contact:        linux-accelerators@lists.ozlabs.org
+Description:    A vfs node that used to configures the hardware
+                error frequency. This frequency is abstract. Like once an hour
+                or once a day. The specific isolation strategy can be defined in
+                each driver module.
+
+What:           /sys/class/uacce/<dev_name>/isolate
+Date:           Jun 2022
+KernelVersion:  5.19
+Contact:        linux-accelerators@lists.ozlabs.org
+Description:    A vfs node that show the device isolated state. The value 0
+                means that the device is working. The value 1 means that the
+                device has been isolated.
+
 What:           /sys/class/uacce/<dev_name>/algorithms
 Date:           Feb 2020
 KernelVersion:  5.7
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 3/3] crypto: hisilicon/qm - defining the device isolation strategy
  2022-06-14 12:29 [PATCH v2 0/3] crypto: hisilicon - supports device isolation feature Kai Ye
  2022-06-14 12:29 ` [PATCH v2 1/3] uacce: " Kai Ye
  2022-06-14 12:29 ` [PATCH v2 2/3] Documentation: add a isolation strategy vfs node for uacce Kai Ye
@ 2022-06-14 12:29 ` Kai Ye
  2022-06-14 12:43   ` Greg KH
                     ` (2 more replies)
  2022-06-14 12:29 ` [PATCH 1/3] uacce: supports device isolation feature Kai Ye
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 25+ messages in thread
From: Kai Ye @ 2022-06-14 12:29 UTC (permalink / raw)
  To: gregkh, herbert
  Cc: linux-crypto, linux-accelerators, linux-kernel, linuxarm,
	zhangfei.gao, wangzhou1, yekai13

Define the device isolation strategy by the device driver. if the
AER error frequency exceeds the value of setting for a certain
period of time, The device will not be available in user space. The VF
device use the PF device isolation strategy. All the hardware errors
are processed by PF driver.

Signed-off-by: Kai Ye <yekai13@huawei.com>
---
 drivers/crypto/hisilicon/qm.c | 157 +++++++++++++++++++++++++++++++---
 include/linux/hisi_acc_qm.h   |   9 ++
 2 files changed, 152 insertions(+), 14 deletions(-)

diff --git a/drivers/crypto/hisilicon/qm.c b/drivers/crypto/hisilicon/qm.c
index ad83c194d664..47c41fa52693 100644
--- a/drivers/crypto/hisilicon/qm.c
+++ b/drivers/crypto/hisilicon/qm.c
@@ -12,7 +12,6 @@
 #include <linux/pm_runtime.h>
 #include <linux/seq_file.h>
 #include <linux/slab.h>
-#include <linux/uacce.h>
 #include <linux/uaccess.h>
 #include <uapi/misc/uacce/hisi_qm.h>
 #include <linux/hisi_acc_qm.h>
@@ -417,6 +416,16 @@ struct hisi_qm_resource {
 	struct list_head list;
 };
 
+/**
+ * struct qm_hw_err - structure of describes the device err
+ * @list: hardware error list
+ * @tick_stamp: timestamp when the error occurred
+ */
+struct qm_hw_err {
+	struct list_head list;
+	unsigned long long tick_stamp;
+};
+
 struct hisi_qm_hw_ops {
 	int (*get_vft)(struct hisi_qm *qm, u32 *base, u32 *number);
 	void (*qm_db)(struct hisi_qm *qm, u16 qn,
@@ -3278,6 +3287,7 @@ static int hisi_qm_uacce_get_queue(struct uacce_device *uacce,
 	qp->event_cb = qm_qp_event_notifier;
 	qp->pasid = arg;
 	qp->is_in_kernel = false;
+	atomic_inc(&qm->uacce_ref);
 
 	return 0;
 }
@@ -3285,7 +3295,9 @@ static int hisi_qm_uacce_get_queue(struct uacce_device *uacce,
 static void hisi_qm_uacce_put_queue(struct uacce_queue *q)
 {
 	struct hisi_qp *qp = q->priv;
+	struct hisi_qm *qm = qp->qm;
 
+	atomic_dec(&qm->uacce_ref);
 	hisi_qm_cache_wb(qp->qm);
 	hisi_qm_release_qp(qp);
 }
@@ -3410,6 +3422,102 @@ static long hisi_qm_uacce_ioctl(struct uacce_queue *q, unsigned int cmd,
 	return 0;
 }
 
+/**
+ * qm_hw_err_isolate() - Try to isolate the uacce device with its VFs
+ * @qm: The qm which we want to configure.
+ *
+ * according to user's configuration of isolation strategy. Warning: this
+ * API should be called while there is no user on the device, or the users
+ * on this device are suspended by slot resetting preparation of PCI AER.
+ */
+static int qm_hw_err_isolate(struct hisi_qm *qm)
+{
+	struct qm_hw_err *err, *tmp, *hw_err;
+	struct qm_err_isolate *isolate;
+	u32 count = 0;
+
+	isolate = &qm->isolate_data;
+
+#define SECONDS_PER_HOUR	3600
+
+	/* All the hw errs are processed by PF driver */
+	if (qm->uacce->is_vf || atomic_read(&isolate->is_isolate) ||
+		!isolate->hw_err_isolate_hz)
+		return 0;
+
+	hw_err = kzalloc(sizeof(*hw_err), GFP_ATOMIC);
+	if (!hw_err)
+		return -ENOMEM;
+	hw_err->tick_stamp = jiffies;
+	list_for_each_entry_safe(err, tmp, &qm->uacce_hw_errs, list) {
+		if ((hw_err->tick_stamp - err->tick_stamp) / HZ >
+		    SECONDS_PER_HOUR) {
+			list_del(&err->list);
+			kfree(err);
+		} else {
+			count++;
+		}
+	}
+	list_add(&hw_err->list, &qm->uacce_hw_errs);
+
+	if (count >= isolate->hw_err_isolate_hz)
+		atomic_set(&isolate->is_isolate, 1);
+
+	return 0;
+}
+
+static void qm_hw_err_destroy(struct hisi_qm *qm)
+{
+	struct qm_hw_err *err, *tmp;
+
+	list_for_each_entry_safe(err, tmp, &qm->uacce_hw_errs, list) {
+		list_del(&err->list);
+		kfree(err);
+	}
+}
+
+static enum uacce_dev_state hisi_qm_get_isolate_state(struct uacce_device *uacce)
+{
+	struct hisi_qm *qm = uacce->priv;
+	struct hisi_qm *pf_qm;
+
+	if (uacce->is_vf) {
+		pf_qm = pci_get_drvdata(pci_physfn(qm->pdev));
+		qm->isolate_data.is_isolate = pf_qm->isolate_data.is_isolate;
+	}
+
+	return atomic_read(&qm->isolate_data.is_isolate) ?
+			UACCE_DEV_ISOLATE : UACCE_DEV_NORMAL;
+}
+
+static int hisi_qm_isolate_strategy_write(struct uacce_device *uacce,
+						const char *buf)
+{
+	struct hisi_qm *qm = uacce->priv;
+	unsigned long val = 0;
+
+#define MAX_ISOLATE_STRATEGY	65535
+
+	if (atomic_read(&qm->uacce_ref))
+		return -EBUSY;
+
+	/* must be set by PF */
+	if (atomic_read(&qm->isolate_data.is_isolate) || uacce->is_vf)
+		return -EINVAL;
+
+	if (kstrtoul(buf, 0, &val) < 0)
+		return -EINVAL;
+
+	if (val > MAX_ISOLATE_STRATEGY)
+		return -EINVAL;
+
+	qm->isolate_data.hw_err_isolate_hz = val;
+	dev_info(&qm->pdev->dev,
+		"the value of isolate_strategy is set to %lu.\n", val);
+
+	return 0;
+}
+
 static const struct uacce_ops uacce_qm_ops = {
 	.get_available_instances = hisi_qm_get_available_instances,
 	.get_queue = hisi_qm_uacce_get_queue,
@@ -3418,9 +3526,22 @@ static const struct uacce_ops uacce_qm_ops = {
 	.stop_queue = hisi_qm_uacce_stop_queue,
 	.mmap = hisi_qm_uacce_mmap,
 	.ioctl = hisi_qm_uacce_ioctl,
+	.get_isolate_state = hisi_qm_get_isolate_state,
 	.is_q_updated = hisi_qm_is_q_updated,
+	.isolate_strategy_write = hisi_qm_isolate_strategy_write,
 };
 
+static void qm_remove_uacce(struct hisi_qm *qm)
+{
+	struct uacce_device *uacce = qm->uacce;
+
+	if (qm->use_sva) {
+		qm_hw_err_destroy(qm);
+		uacce_remove(uacce);
+		qm->uacce = NULL;
+	}
+}
+
 static int qm_alloc_uacce(struct hisi_qm *qm)
 {
 	struct pci_dev *pdev = qm->pdev;
@@ -3433,6 +3554,7 @@ static int qm_alloc_uacce(struct hisi_qm *qm)
 	};
 	int ret;
 
+	INIT_LIST_HEAD(&qm->uacce_hw_errs);
 	ret = strscpy(interface.name, dev_driver_string(&pdev->dev),
 		      sizeof(interface.name));
 	if (ret < 0)
@@ -3446,8 +3568,7 @@ static int qm_alloc_uacce(struct hisi_qm *qm)
 		qm->use_sva = true;
 	} else {
 		/* only consider sva case */
-		uacce_remove(uacce);
-		qm->uacce = NULL;
+		qm_remove_uacce(qm);
 		return -EINVAL;
 	}
 
@@ -5109,6 +5230,12 @@ static int qm_controller_reset_prepare(struct hisi_qm *qm)
 		return ret;
 	}
 
+	if (qm->use_sva) {
+		ret = qm_hw_err_isolate(qm);
+		if (ret)
+			pci_err(pdev, "failed to isolate hw err!\n");
+	}
+
 	ret = qm_wait_vf_prepare_finish(qm);
 	if (ret)
 		pci_err(pdev, "failed to stop by vfs in soft reset!\n");
@@ -5436,19 +5563,24 @@ static int qm_controller_reset(struct hisi_qm *qm)
 	ret = qm_soft_reset(qm);
 	if (ret) {
 		pci_err(pdev, "Controller reset failed (%d)\n", ret);
-		qm_reset_bit_clear(qm);
-		return ret;
+		goto err_reset;
 	}
 
 	ret = qm_controller_reset_done(qm);
-	if (ret) {
-		qm_reset_bit_clear(qm);
-		return ret;
-	}
+	if (ret)
+		goto err_reset;
 
 	pci_info(pdev, "Controller reset complete\n");
-
 	return 0;
+
+err_reset:
+	pci_err(pdev, "Controller reset failed (%d)\n", ret);
+	qm_reset_bit_clear(qm);
+
+	/* if resetting fails, isolate the device */
+	if (qm->use_sva && !qm->uacce->is_vf)
+		atomic_set(&qm->isolate_data.is_isolate, 1);
+	return ret;
 }
 
 /**
@@ -6246,10 +6378,7 @@ int hisi_qm_init(struct hisi_qm *qm)
 err_free_qm_memory:
 	hisi_qm_memory_uninit(qm);
 err_alloc_uacce:
-	if (qm->use_sva) {
-		uacce_remove(qm->uacce);
-		qm->uacce = NULL;
-	}
+	qm_remove_uacce(qm);
 err_irq_register:
 	qm_irq_unregister(qm);
 err_pci_init:
diff --git a/include/linux/hisi_acc_qm.h b/include/linux/hisi_acc_qm.h
index 116e8bd68c99..c17fd6de8551 100644
--- a/include/linux/hisi_acc_qm.h
+++ b/include/linux/hisi_acc_qm.h
@@ -8,6 +8,7 @@
 #include <linux/iopoll.h>
 #include <linux/module.h>
 #include <linux/pci.h>
+#include <linux/uacce.h>
 
 #define QM_QNUM_V1			4096
 #define QM_QNUM_V2			1024
@@ -271,6 +272,11 @@ struct hisi_qm_poll_data {
 	u16 *qp_finish_id;
 };
 
+struct qm_err_isolate {
+	u32 hw_err_isolate_hz;	/* user cfg freq which triggers isolation */
+	atomic_t is_isolate;
+};
+
 struct hisi_qm {
 	enum qm_hw_ver ver;
 	enum qm_fun_type fun_type;
@@ -335,6 +341,9 @@ struct hisi_qm {
 	struct qm_shaper_factor *factor;
 	u32 mb_qos;
 	u32 type_rate;
+	struct list_head uacce_hw_errs;
+	atomic_t uacce_ref; /* reference of the uacce */
+	struct qm_err_isolate isolate_data;
 };
 
 struct hisi_qp_status {
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 1/3] uacce: supports device isolation feature
  2022-06-14 12:29 [PATCH v2 0/3] crypto: hisilicon - supports device isolation feature Kai Ye
                   ` (2 preceding siblings ...)
  2022-06-14 12:29 ` [PATCH v2 3/3] crypto: hisilicon/qm - defining the device isolation strategy Kai Ye
@ 2022-06-14 12:29 ` Kai Ye
  2022-06-14 14:14   ` Zhangfei Gao
  2022-06-14 12:29 ` [PATCH 2/3] Documentation: add a isolation strategy vfs node for uacce Kai Ye
  2022-06-14 12:29 ` [PATCH 3/3] crypto: hisilicon/qm - defining the device isolation strategy Kai Ye
  5 siblings, 1 reply; 25+ messages in thread
From: Kai Ye @ 2022-06-14 12:29 UTC (permalink / raw)
  To: gregkh, herbert
  Cc: linux-crypto, linux-accelerators, linux-kernel, linuxarm,
	zhangfei.gao, wangzhou1, yekai13

UACCE add the hardware error isolation API. Users can configure
the error frequency threshold by this vfs node. This API interface
certainly supports the configuration of user protocol strategy. Then
parse it inside the device driver. UACCE only reports the device
isolate state. When the error frequency is exceeded, the device
will be isolated. The isolation strategy should be defined in each
driver module.

Signed-off-by: Kai Ye <yekai13@huawei.com>
Reviewed-by: Zhou Wang <wangzhou1@hisilicon.com>
---
 drivers/misc/uacce/uacce.c | 51 ++++++++++++++++++++++++++++++++++++++
 include/linux/uacce.h      | 15 ++++++++++-
 2 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
index b6219c6bfb48..4d9d9aeb145a 100644
--- a/drivers/misc/uacce/uacce.c
+++ b/drivers/misc/uacce/uacce.c
@@ -12,6 +12,20 @@ static dev_t uacce_devt;
 static DEFINE_MUTEX(uacce_mutex);
 static DEFINE_XARRAY_ALLOC(uacce_xa);
 
+static int cdev_get(struct device *dev, void *data)
+{
+	struct uacce_device *uacce;
+	struct device **t_dev = data;
+
+	uacce = container_of(dev, struct uacce_device, dev);
+	if (uacce->parent == *t_dev) {
+		*t_dev = dev;
+		return 1;
+	}
+
+	return 0;
+}
+
 static int uacce_start_queue(struct uacce_queue *q)
 {
 	int ret = 0;
@@ -346,12 +360,47 @@ static ssize_t region_dus_size_show(struct device *dev,
 		       uacce->qf_pg_num[UACCE_QFRT_DUS] << PAGE_SHIFT);
 }
 
+static ssize_t isolate_show(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	struct uacce_device *uacce = to_uacce_device(dev);
+
+	return sysfs_emit(buf, "%d\n", uacce->ops->get_isolate_state(uacce));
+}
+
+static ssize_t isolate_strategy_show(struct device *dev,
+				     struct device_attribute *attr, char *buf)
+{
+	struct uacce_device *uacce = to_uacce_device(dev);
+
+	return sysfs_emit(buf, "%s\n", uacce->isolate_strategy);
+}
+
+static ssize_t isolate_strategy_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t count)
+{
+	struct uacce_device *uacce = to_uacce_device(dev);
+	int ret;
+
+	if (!buf || sizeof(buf) > UACCE_MAX_ISOLATE_STRATEGY_LEN)
+		return -EINVAL;
+
+	memcpy(uacce->isolate_strategy, buf, strlen(buf));
+
+	ret = uacce->ops->isolate_strategy_write(uacce, buf);
+
+	return ret > 0 ? count : ret;
+}
+
 static DEVICE_ATTR_RO(api);
 static DEVICE_ATTR_RO(flags);
 static DEVICE_ATTR_RO(available_instances);
 static DEVICE_ATTR_RO(algorithms);
 static DEVICE_ATTR_RO(region_mmio_size);
 static DEVICE_ATTR_RO(region_dus_size);
+static DEVICE_ATTR_RO(isolate);
+static DEVICE_ATTR_RW(isolate_strategy);
 
 static struct attribute *uacce_dev_attrs[] = {
 	&dev_attr_api.attr,
@@ -360,6 +409,8 @@ static struct attribute *uacce_dev_attrs[] = {
 	&dev_attr_algorithms.attr,
 	&dev_attr_region_mmio_size.attr,
 	&dev_attr_region_dus_size.attr,
+	&dev_attr_isolate.attr,
+	&dev_attr_isolate_strategy.attr,
 	NULL,
 };
 
diff --git a/include/linux/uacce.h b/include/linux/uacce.h
index 48e319f40275..e00a43a07e4b 100644
--- a/include/linux/uacce.h
+++ b/include/linux/uacce.h
@@ -8,6 +8,7 @@
 #define UACCE_NAME		"uacce"
 #define UACCE_MAX_REGION	2
 #define UACCE_MAX_NAME_SIZE	64
+#define UACCE_MAX_ISOLATE_STRATEGY_LEN	256
 
 struct uacce_queue;
 struct uacce_device;
@@ -30,6 +31,8 @@ struct uacce_qfile_region {
  * @is_q_updated: check whether the task is finished
  * @mmap: mmap addresses of queue to user space
  * @ioctl: ioctl for user space users of the queue
+ * @get_isolate_state: get the device state after set the isolate strategy
+ * @isolate_strategy_store: stored the isolate strategy to the device
  */
 struct uacce_ops {
 	int (*get_available_instances)(struct uacce_device *uacce);
@@ -43,6 +46,8 @@ struct uacce_ops {
 		    struct uacce_qfile_region *qfr);
 	long (*ioctl)(struct uacce_queue *q, unsigned int cmd,
 		      unsigned long arg);
+	enum uacce_dev_state (*get_isolate_state)(struct uacce_device *uacce);
+	int (*isolate_strategy_write)(struct uacce_device *uacce, const char *buf);
 };
 
 /**
@@ -57,6 +62,12 @@ struct uacce_interface {
 	const struct uacce_ops *ops;
 };
 
+enum uacce_dev_state {
+	UACCE_DEV_ERR = -1,
+	UACCE_DEV_NORMAL,
+	UACCE_DEV_ISOLATE,
+};
+
 enum uacce_q_state {
 	UACCE_Q_ZOMBIE = 0,
 	UACCE_Q_INIT,
@@ -99,6 +110,7 @@ struct uacce_queue {
  * @dev: dev of the uacce
  * @priv: private pointer of the uacce
  * @queues: list of queues
+ * @ref: reference of the uacce
  * @queues_lock: lock for queues list
  * @inode: core vfs
  */
@@ -117,6 +129,7 @@ struct uacce_device {
 	struct list_head queues;
 	struct mutex queues_lock;
 	struct inode *inode;
+	char isolate_strategy[UACCE_MAX_ISOLATE_STRATEGY_LEN];
 };
 
 #if IS_ENABLED(CONFIG_UACCE)
@@ -125,7 +138,7 @@ struct uacce_device *uacce_alloc(struct device *parent,
 				 struct uacce_interface *interface);
 int uacce_register(struct uacce_device *uacce);
 void uacce_remove(struct uacce_device *uacce);
-
+struct uacce_device *dev_to_uacce(struct device *dev);
 #else /* CONFIG_UACCE */
 
 static inline
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/3] Documentation: add a isolation strategy vfs node for uacce
  2022-06-14 12:29 [PATCH v2 0/3] crypto: hisilicon - supports device isolation feature Kai Ye
                   ` (3 preceding siblings ...)
  2022-06-14 12:29 ` [PATCH 1/3] uacce: supports device isolation feature Kai Ye
@ 2022-06-14 12:29 ` Kai Ye
  2022-06-14 12:29 ` [PATCH 3/3] crypto: hisilicon/qm - defining the device isolation strategy Kai Ye
  5 siblings, 0 replies; 25+ messages in thread
From: Kai Ye @ 2022-06-14 12:29 UTC (permalink / raw)
  To: gregkh, herbert
  Cc: linux-crypto, linux-accelerators, linux-kernel, linuxarm,
	zhangfei.gao, wangzhou1, yekai13

Update documentation describing DebugFS that could help to
configure hard error frequency for users in th user space.

Signed-off-by: Kai Ye <yekai13@huawei.com>
---
 Documentation/ABI/testing/sysfs-driver-uacce | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce
index 08f2591138af..0c4226364182 100644
--- a/Documentation/ABI/testing/sysfs-driver-uacce
+++ b/Documentation/ABI/testing/sysfs-driver-uacce
@@ -19,6 +19,23 @@ Contact:        linux-accelerators@lists.ozlabs.org
 Description:    Available instances left of the device
                 Return -ENODEV if uacce_ops get_available_instances is not provided
 
+What:           /sys/class/uacce/<dev_name>/isolate_strategy
+Date:           Jun 2022
+KernelVersion:  5.19
+Contact:        linux-accelerators@lists.ozlabs.org
+Description:    A vfs node that used to configures the hardware
+                error frequency. This frequency is abstract. Like once an hour
+                or once a day. The specific isolation strategy can be defined in
+                each driver module.
+
+What:           /sys/class/uacce/<dev_name>/isolate
+Date:           Jun 2022
+KernelVersion:  5.19
+Contact:        linux-accelerators@lists.ozlabs.org
+Description:    A vfs node that show the device isolated state. The value 0
+                means that the device is working. The value 1 means that the
+                device has been isolated.
+
 What:           /sys/class/uacce/<dev_name>/algorithms
 Date:           Feb 2020
 KernelVersion:  5.7
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/3] crypto: hisilicon/qm - defining the device isolation strategy
  2022-06-14 12:29 [PATCH v2 0/3] crypto: hisilicon - supports device isolation feature Kai Ye
                   ` (4 preceding siblings ...)
  2022-06-14 12:29 ` [PATCH 2/3] Documentation: add a isolation strategy vfs node for uacce Kai Ye
@ 2022-06-14 12:29 ` Kai Ye
  5 siblings, 0 replies; 25+ messages in thread
From: Kai Ye @ 2022-06-14 12:29 UTC (permalink / raw)
  To: gregkh, herbert
  Cc: linux-crypto, linux-accelerators, linux-kernel, linuxarm,
	zhangfei.gao, wangzhou1, yekai13

Define the device isolation strategy by the device driver. if the
AER error frequency exceeds the value of setting for a certain
period of time, The device will not be available in user space. The VF
device use the PF device isolation strategy. All the hardware errors
are processed by PF driver.

Signed-off-by: Kai Ye <yekai13@huawei.com>
---
 drivers/crypto/hisilicon/qm.c | 155 +++++++++++++++++++++++++++++++---
 include/linux/hisi_acc_qm.h   |   9 ++
 2 files changed, 152 insertions(+), 12 deletions(-)

diff --git a/drivers/crypto/hisilicon/qm.c b/drivers/crypto/hisilicon/qm.c
index b4ca2eb034d7..391d83929ad2 100644
--- a/drivers/crypto/hisilicon/qm.c
+++ b/drivers/crypto/hisilicon/qm.c
@@ -12,7 +12,6 @@
 #include <linux/pm_runtime.h>
 #include <linux/seq_file.h>
 #include <linux/slab.h>
-#include <linux/uacce.h>
 #include <linux/uaccess.h>
 #include <uapi/misc/uacce/hisi_qm.h>
 #include <linux/hisi_acc_qm.h>
@@ -417,6 +416,16 @@ struct hisi_qm_resource {
 	struct list_head list;
 };
 
+/**
+ * struct qm_hw_err - structure of describes the device err
+ * @list: hardware error list
+ * @tick_stamp: timestamp when the error occurred
+ */
+struct qm_hw_err {
+	struct list_head list;
+	unsigned long long tick_stamp;
+};
+
 struct hisi_qm_hw_ops {
 	int (*get_vft)(struct hisi_qm *qm, u32 *base, u32 *number);
 	void (*qm_db)(struct hisi_qm *qm, u16 qn,
@@ -3265,6 +3274,7 @@ static int hisi_qm_uacce_get_queue(struct uacce_device *uacce,
 	qp->event_cb = qm_qp_event_notifier;
 	qp->pasid = arg;
 	qp->is_in_kernel = false;
+	atomic_inc(&qm->uacce_ref);
 
 	return 0;
 }
@@ -3272,7 +3282,9 @@ static int hisi_qm_uacce_get_queue(struct uacce_device *uacce,
 static void hisi_qm_uacce_put_queue(struct uacce_queue *q)
 {
 	struct hisi_qp *qp = q->priv;
+	struct hisi_qm *qm = qp->qm;
 
+	atomic_dec(&qm->uacce_ref);
 	hisi_qm_cache_wb(qp->qm);
 	hisi_qm_release_qp(qp);
 }
@@ -3397,6 +3409,102 @@ static long hisi_qm_uacce_ioctl(struct uacce_queue *q, unsigned int cmd,
 	return 0;
 }
 
+/**
+ * qm_hw_err_isolate() - Try to isolate the uacce device with its VFs
+ * @qm: The qm which we want to configure.
+ *
+ * according to user's configuration of isolation strategy. Warning: this
+ * API should be called while there is no user on the device, or the users
+ * on this device are suspended by slot resetting preparation of PCI AER.
+ */
+static int qm_hw_err_isolate(struct hisi_qm *qm)
+{
+	struct qm_hw_err *err, *tmp, *hw_err;
+	struct qm_err_isolate *isolate;
+	u32 count = 0;
+
+	isolate = &qm->isolate_data;
+
+#define SECONDS_PER_HOUR	3600
+
+	/* All the hw errs are processed by PF driver */
+	if (qm->uacce->is_vf || atomic_read(&isolate->is_isolate) ||
+		!isolate->hw_err_isolate_hz)
+		return 0;
+
+	hw_err = kzalloc(sizeof(*hw_err), GFP_ATOMIC);
+	if (!hw_err)
+		return -ENOMEM;
+	hw_err->tick_stamp = jiffies;
+	list_for_each_entry_safe(err, tmp, &qm->uacce_hw_errs, list) {
+		if ((hw_err->tick_stamp - err->tick_stamp) / HZ >
+		    SECONDS_PER_HOUR) {
+			list_del(&err->list);
+			kfree(err);
+		} else {
+			count++;
+		}
+	}
+	list_add(&hw_err->list, &qm->uacce_hw_errs);
+
+	if (count >= isolate->hw_err_isolate_hz)
+		atomic_set(&isolate->is_isolate, 1);
+
+	return 0;
+}
+
+static void qm_hw_err_destroy(struct hisi_qm *qm)
+{
+	struct qm_hw_err *err, *tmp;
+
+	list_for_each_entry_safe(err, tmp, &qm->uacce_hw_errs, list) {
+		list_del(&err->list);
+		kfree(err);
+	}
+}
+
+static enum uacce_dev_state hisi_qm_get_isolate_state(struct uacce_device *uacce)
+{
+	struct hisi_qm *qm = uacce->priv;
+	struct hisi_qm *pf_qm;
+
+	if (uacce->is_vf) {
+		pf_qm = pci_get_drvdata(pci_physfn(qm->pdev));
+		qm->isolate_data.is_isolate = pf_qm->isolate_data.is_isolate;
+	}
+
+	return atomic_read(&qm->isolate_data.is_isolate) ?
+			UACCE_DEV_ISOLATE : UACCE_DEV_NORMAL;
+}
+
+static int hisi_qm_isolate_strategy_write(struct uacce_device *uacce,
+						const char *buf)
+{
+	struct hisi_qm *qm = uacce->priv;
+	unsigned long val = 0;
+
+#define MAX_ISOLATE_STRATEGY	65535
+
+	if (atomic_read(&qm->uacce_ref))
+		return -EBUSY;
+
+	/* must be set by PF */
+	if (atomic_read(&qm->isolate_data.is_isolate) || uacce->is_vf)
+		return -EINVAL;
+
+	if (kstrtoul(buf, 0, &val) < 0)
+		return -EINVAL;
+
+	if (val > MAX_ISOLATE_STRATEGY)
+		return -EINVAL;
+
+	qm->isolate_data.hw_err_isolate_hz = val;
+	dev_info(&qm->pdev->dev,
+		"the value of isolate_strategy is set to %lu.\n", val);
+
+	return 0;
+}
+
 static const struct uacce_ops uacce_qm_ops = {
 	.get_available_instances = hisi_qm_get_available_instances,
 	.get_queue = hisi_qm_uacce_get_queue,
@@ -3405,9 +3513,22 @@ static const struct uacce_ops uacce_qm_ops = {
 	.stop_queue = hisi_qm_uacce_stop_queue,
 	.mmap = hisi_qm_uacce_mmap,
 	.ioctl = hisi_qm_uacce_ioctl,
+	.get_isolate_state = hisi_qm_get_isolate_state,
 	.is_q_updated = hisi_qm_is_q_updated,
+	.isolate_strategy_write = hisi_qm_isolate_strategy_write,
 };
 
+static void qm_remove_uacce(struct hisi_qm *qm)
+{
+	struct uacce_device *uacce = qm->uacce;
+
+	if (qm->use_sva) {
+		qm_hw_err_destroy(qm);
+		uacce_remove(uacce);
+		qm->uacce = NULL;
+	}
+}
+
 static int qm_alloc_uacce(struct hisi_qm *qm)
 {
 	struct pci_dev *pdev = qm->pdev;
@@ -3420,6 +3541,7 @@ static int qm_alloc_uacce(struct hisi_qm *qm)
 	};
 	int ret;
 
+	INIT_LIST_HEAD(&qm->uacce_hw_errs);
 	ret = strscpy(interface.name, dev_driver_string(&pdev->dev),
 		      sizeof(interface.name));
 	if (ret < 0)
@@ -3433,8 +3555,7 @@ static int qm_alloc_uacce(struct hisi_qm *qm)
 		qm->use_sva = true;
 	} else {
 		/* only consider sva case */
-		uacce_remove(uacce);
-		qm->uacce = NULL;
+		qm_remove_uacce(qm);
 		return -EINVAL;
 	}
 
@@ -5074,6 +5195,12 @@ static int qm_controller_reset_prepare(struct hisi_qm *qm)
 		return ret;
 	}
 
+	if (qm->use_sva) {
+		ret = qm_hw_err_isolate(qm);
+		if (ret)
+			pci_err(pdev, "failed to isolate hw err!\n");
+	}
+
 	ret = qm_wait_vf_prepare_finish(qm);
 	if (ret)
 		pci_err(pdev, "failed to stop by vfs in soft reset!\n");
@@ -5401,19 +5528,24 @@ static int qm_controller_reset(struct hisi_qm *qm)
 	ret = qm_soft_reset(qm);
 	if (ret) {
 		pci_err(pdev, "Controller reset failed (%d)\n", ret);
-		qm_reset_bit_clear(qm);
-		return ret;
+		goto err_reset;
 	}
 
 	ret = qm_controller_reset_done(qm);
-	if (ret) {
-		qm_reset_bit_clear(qm);
-		return ret;
-	}
+	if (ret)
+		goto err_reset;
 
 	pci_info(pdev, "Controller reset complete\n");
-
 	return 0;
+
+err_reset:
+	pci_err(pdev, "Controller reset failed (%d)\n", ret);
+	qm_reset_bit_clear(qm);
+
+	/* if resetting fails, isolate the device */
+	if (qm->use_sva && !qm->uacce->is_vf)
+		atomic_set(&qm->isolate_data.is_isolate, 1);
+	return ret;
 }
 
 /**
@@ -6186,8 +6318,7 @@ int hisi_qm_init(struct hisi_qm *qm)
 
 err_alloc_uacce:
 	if (qm->use_sva) {
-		uacce_remove(qm->uacce);
-		qm->uacce = NULL;
+		qm_remove_uacce(qm);
 	}
 err_irq_register:
 	qm_irq_unregister(qm);
diff --git a/include/linux/hisi_acc_qm.h b/include/linux/hisi_acc_qm.h
index 6cabafffd0dd..c090aaaf9974 100644
--- a/include/linux/hisi_acc_qm.h
+++ b/include/linux/hisi_acc_qm.h
@@ -8,6 +8,7 @@
 #include <linux/iopoll.h>
 #include <linux/module.h>
 #include <linux/pci.h>
+#include <linux/uacce.h>
 
 #define QM_QNUM_V1			4096
 #define QM_QNUM_V2			1024
@@ -265,6 +266,11 @@ struct hisi_qm_list {
 	void (*unregister_from_crypto)(struct hisi_qm *qm);
 };
 
+struct qm_err_isolate {
+	u32 hw_err_isolate_hz;	/* user cfg freq which triggers isolation */
+	atomic_t is_isolate;
+};
+
 struct hisi_qm {
 	enum qm_hw_ver ver;
 	enum qm_fun_type fun_type;
@@ -329,6 +335,9 @@ struct hisi_qm {
 	struct qm_shaper_factor *factor;
 	u32 mb_qos;
 	u32 type_rate;
+	struct list_head uacce_hw_errs;
+	atomic_t uacce_ref; /* reference of the uacce */
+	struct qm_err_isolate isolate_data;
 };
 
 struct hisi_qp_status {
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/3] Documentation: add a isolation strategy vfs node for uacce
  2022-06-14 12:29 ` [PATCH v2 2/3] Documentation: add a isolation strategy vfs node for uacce Kai Ye
@ 2022-06-14 12:41   ` Greg KH
  2022-06-15  8:48     ` Jonathan Cameron
  0 siblings, 1 reply; 25+ messages in thread
From: Greg KH @ 2022-06-14 12:41 UTC (permalink / raw)
  To: Kai Ye
  Cc: herbert, linux-crypto, linux-accelerators, linux-kernel,
	linuxarm, zhangfei.gao, wangzhou1

On Tue, Jun 14, 2022 at 08:29:39PM +0800, Kai Ye wrote:
> Update documentation describing DebugFS that could help to
> configure hard error frequency for users in th user space.
> 
> Signed-off-by: Kai Ye <yekai13@huawei.com>
> ---
>  Documentation/ABI/testing/sysfs-driver-uacce | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce
> index 08f2591138af..0c4226364182 100644
> --- a/Documentation/ABI/testing/sysfs-driver-uacce
> +++ b/Documentation/ABI/testing/sysfs-driver-uacce
> @@ -19,6 +19,23 @@ Contact:        linux-accelerators@lists.ozlabs.org
>  Description:    Available instances left of the device
>                  Return -ENODEV if uacce_ops get_available_instances is not provided
>  
> +What:           /sys/class/uacce/<dev_name>/isolate_strategy
> +Date:           Jun 2022
> +KernelVersion:  5.19
> +Contact:        linux-accelerators@lists.ozlabs.org
> +Description:    A vfs node that used to configures the hardware

What is a "vfs node"?

> +                error frequency. This frequency is abstract. Like once an hour
> +                or once a day. The specific isolation strategy can be defined in
> +                each driver module.

No, you need to be specific here and describe the units and the format.
Otherwise it is no description at all :(

> +
> +What:           /sys/class/uacce/<dev_name>/isolate
> +Date:           Jun 2022
> +KernelVersion:  5.19

5.19 will not have this change.

> +Contact:        linux-accelerators@lists.ozlabs.org
> +Description:    A vfs node that show the device isolated state. The value 0
> +                means that the device is working. The value 1 means that the
> +                device has been isolated.

What does "working" or "isolated" mean?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/3] uacce: supports device isolation feature
  2022-06-14 12:29 ` [PATCH v2 1/3] uacce: " Kai Ye
@ 2022-06-14 12:42   ` Greg KH
  2022-06-15  8:52   ` Jonathan Cameron
  1 sibling, 0 replies; 25+ messages in thread
From: Greg KH @ 2022-06-14 12:42 UTC (permalink / raw)
  To: Kai Ye
  Cc: herbert, linux-crypto, linux-accelerators, linux-kernel,
	linuxarm, zhangfei.gao, wangzhou1

On Tue, Jun 14, 2022 at 08:29:38PM +0800, Kai Ye wrote:
> UACCE add the hardware error isolation API. Users can configure
> the error frequency threshold by this vfs node. This API interface
> certainly supports the configuration of user protocol strategy. Then
> parse it inside the device driver. UACCE only reports the device
> isolate state. When the error frequency is exceeded, the device
> will be isolated. The isolation strategy should be defined in each
> driver module.
> 
> Signed-off-by: Kai Ye <yekai13@huawei.com>
> Reviewed-by: Zhou Wang <wangzhou1@hisilicon.com>
> ---
>  drivers/misc/uacce/uacce.c | 37 +++++++++++++++++++++++++++++++++++++
>  include/linux/uacce.h      | 16 +++++++++++++---
>  2 files changed, 50 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
> index b6219c6bfb48..525623215132 100644
> --- a/drivers/misc/uacce/uacce.c
> +++ b/drivers/misc/uacce/uacce.c
> @@ -346,12 +346,47 @@ static ssize_t region_dus_size_show(struct device *dev,
>  		       uacce->qf_pg_num[UACCE_QFRT_DUS] << PAGE_SHIFT);
>  }
>  
> +static ssize_t isolate_show(struct device *dev,
> +			    struct device_attribute *attr, char *buf)
> +{
> +	struct uacce_device *uacce = to_uacce_device(dev);
> +
> +	return sysfs_emit(buf, "%d\n", uacce->ops->get_isolate_state(uacce));
> +}
> +
> +static ssize_t isolate_strategy_show(struct device *dev,
> +				     struct device_attribute *attr, char *buf)
> +{
> +	struct uacce_device *uacce = to_uacce_device(dev);
> +
> +	return sysfs_emit(buf, "%s\n", uacce->isolate_strategy);
> +}
> +
> +static ssize_t isolate_strategy_store(struct device *dev,
> +				      struct device_attribute *attr,
> +				      const char *buf, size_t count)
> +{
> +	struct uacce_device *uacce = to_uacce_device(dev);
> +	int ret;
> +
> +	if (!buf || sizeof(buf) > UACCE_MAX_ISOLATE_STRATEGY_LEN)
> +		return -EINVAL;
> +
> +	memcpy(uacce->isolate_strategy, buf, strlen(buf));
> +
> +	ret = uacce->ops->isolate_strategy_write(uacce, buf);
> +
> +	return ret ? ret : count;
> +}
> +
>  static DEVICE_ATTR_RO(api);
>  static DEVICE_ATTR_RO(flags);
>  static DEVICE_ATTR_RO(available_instances);
>  static DEVICE_ATTR_RO(algorithms);
>  static DEVICE_ATTR_RO(region_mmio_size);
>  static DEVICE_ATTR_RO(region_dus_size);
> +static DEVICE_ATTR_RO(isolate);
> +static DEVICE_ATTR_RW(isolate_strategy);
>  
>  static struct attribute *uacce_dev_attrs[] = {
>  	&dev_attr_api.attr,
> @@ -360,6 +395,8 @@ static struct attribute *uacce_dev_attrs[] = {
>  	&dev_attr_algorithms.attr,
>  	&dev_attr_region_mmio_size.attr,
>  	&dev_attr_region_dus_size.attr,
> +	&dev_attr_isolate.attr,
> +	&dev_attr_isolate_strategy.attr,
>  	NULL,
>  };
>  
> diff --git a/include/linux/uacce.h b/include/linux/uacce.h
> index 48e319f40275..0f7668bfa645 100644
> --- a/include/linux/uacce.h
> +++ b/include/linux/uacce.h
> @@ -8,6 +8,7 @@
>  #define UACCE_NAME		"uacce"
>  #define UACCE_MAX_REGION	2
>  #define UACCE_MAX_NAME_SIZE	64
> +#define UACCE_MAX_ISOLATE_STRATEGY_LEN	256

So it's a random string of characters?  What format?

>  
>  struct uacce_queue;
>  struct uacce_device;
> @@ -30,6 +31,8 @@ struct uacce_qfile_region {
>   * @is_q_updated: check whether the task is finished
>   * @mmap: mmap addresses of queue to user space
>   * @ioctl: ioctl for user space users of the queue
> + * @get_isolate_state: get the device state after set the isolate strategy
> + * @isolate_strategy_store: stored the isolate strategy to the device
>   */
>  struct uacce_ops {
>  	int (*get_available_instances)(struct uacce_device *uacce);
> @@ -43,6 +46,8 @@ struct uacce_ops {
>  		    struct uacce_qfile_region *qfr);
>  	long (*ioctl)(struct uacce_queue *q, unsigned int cmd,
>  		      unsigned long arg);
> +	enum uacce_dev_state (*get_isolate_state)(struct uacce_device *uacce);
> +	int (*isolate_strategy_write)(struct uacce_device *uacce, const char *buf);

Length of the buffer?

>  };
>  
>  /**
> @@ -57,6 +62,12 @@ struct uacce_interface {
>  	const struct uacce_ops *ops;
>  };
>  
> +enum uacce_dev_state {
> +	UACCE_DEV_ERR = -1,
> +	UACCE_DEV_NORMAL,
> +	UACCE_DEV_ISOLATE,
> +};
> +
>  enum uacce_q_state {
>  	UACCE_Q_ZOMBIE = 0,
>  	UACCE_Q_INIT,
> @@ -117,6 +128,7 @@ struct uacce_device {
>  	struct list_head queues;
>  	struct mutex queues_lock;
>  	struct inode *inode;
> +	char isolate_strategy[UACCE_MAX_ISOLATE_STRATEGY_LEN];
>  };
>  
>  #if IS_ENABLED(CONFIG_UACCE)
> @@ -125,7 +137,7 @@ struct uacce_device *uacce_alloc(struct device *parent,
>  				 struct uacce_interface *interface);
>  int uacce_register(struct uacce_device *uacce);
>  void uacce_remove(struct uacce_device *uacce);
> -
> +struct uacce_device *dev_to_uacce(struct device *dev);

Why is this moved to the .h file yet the function is not exported?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/3] crypto: hisilicon/qm - defining the device isolation strategy
  2022-06-14 12:29 ` [PATCH v2 3/3] crypto: hisilicon/qm - defining the device isolation strategy Kai Ye
@ 2022-06-14 12:43   ` Greg KH
  2022-06-14 13:24     ` yekai(A)
  2022-06-14 14:12   ` Zhangfei Gao
  2022-06-15 13:02   ` Jonathan Cameron
  2 siblings, 1 reply; 25+ messages in thread
From: Greg KH @ 2022-06-14 12:43 UTC (permalink / raw)
  To: Kai Ye
  Cc: herbert, linux-crypto, linux-accelerators, linux-kernel,
	linuxarm, zhangfei.gao, wangzhou1

On Tue, Jun 14, 2022 at 08:29:40PM +0800, Kai Ye wrote:
> Define the device isolation strategy by the device driver. if the
> AER error frequency exceeds the value of setting for a certain
> period of time, The device will not be available in user space. The VF
> device use the PF device isolation strategy. All the hardware errors
> are processed by PF driver.
> 
> Signed-off-by: Kai Ye <yekai13@huawei.com>
> ---
>  drivers/crypto/hisilicon/qm.c | 157 +++++++++++++++++++++++++++++++---
>  include/linux/hisi_acc_qm.h   |   9 ++
>  2 files changed, 152 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/crypto/hisilicon/qm.c b/drivers/crypto/hisilicon/qm.c
> index ad83c194d664..47c41fa52693 100644
> --- a/drivers/crypto/hisilicon/qm.c
> +++ b/drivers/crypto/hisilicon/qm.c
> @@ -12,7 +12,6 @@
>  #include <linux/pm_runtime.h>
>  #include <linux/seq_file.h>
>  #include <linux/slab.h>
> -#include <linux/uacce.h>
>  #include <linux/uaccess.h>
>  #include <uapi/misc/uacce/hisi_qm.h>
>  #include <linux/hisi_acc_qm.h>
> @@ -417,6 +416,16 @@ struct hisi_qm_resource {
>  	struct list_head list;
>  };
>  
> +/**
> + * struct qm_hw_err - structure of describes the device err
> + * @list: hardware error list
> + * @tick_stamp: timestamp when the error occurred
> + */
> +struct qm_hw_err {
> +	struct list_head list;
> +	unsigned long long tick_stamp;
> +};
> +
>  struct hisi_qm_hw_ops {
>  	int (*get_vft)(struct hisi_qm *qm, u32 *base, u32 *number);
>  	void (*qm_db)(struct hisi_qm *qm, u16 qn,
> @@ -3278,6 +3287,7 @@ static int hisi_qm_uacce_get_queue(struct uacce_device *uacce,
>  	qp->event_cb = qm_qp_event_notifier;
>  	qp->pasid = arg;
>  	qp->is_in_kernel = false;
> +	atomic_inc(&qm->uacce_ref);
>  
>  	return 0;
>  }
> @@ -3285,7 +3295,9 @@ static int hisi_qm_uacce_get_queue(struct uacce_device *uacce,
>  static void hisi_qm_uacce_put_queue(struct uacce_queue *q)
>  {
>  	struct hisi_qp *qp = q->priv;
> +	struct hisi_qm *qm = qp->qm;
>  
> +	atomic_dec(&qm->uacce_ref);
>  	hisi_qm_cache_wb(qp->qm);
>  	hisi_qm_release_qp(qp);
>  }
> @@ -3410,6 +3422,102 @@ static long hisi_qm_uacce_ioctl(struct uacce_queue *q, unsigned int cmd,
>  	return 0;
>  }
>  
> +/**
> + * qm_hw_err_isolate() - Try to isolate the uacce device with its VFs
> + * @qm: The qm which we want to configure.
> + *
> + * according to user's configuration of isolation strategy. Warning: this
> + * API should be called while there is no user on the device, or the users
> + * on this device are suspended by slot resetting preparation of PCI AER.
> + */
> +static int qm_hw_err_isolate(struct hisi_qm *qm)
> +{
> +	struct qm_hw_err *err, *tmp, *hw_err;
> +	struct qm_err_isolate *isolate;
> +	u32 count = 0;
> +
> +	isolate = &qm->isolate_data;
> +
> +#define SECONDS_PER_HOUR	3600
> +
> +	/* All the hw errs are processed by PF driver */
> +	if (qm->uacce->is_vf || atomic_read(&isolate->is_isolate) ||
> +		!isolate->hw_err_isolate_hz)
> +		return 0;
> +
> +	hw_err = kzalloc(sizeof(*hw_err), GFP_ATOMIC);
> +	if (!hw_err)
> +		return -ENOMEM;
> +	hw_err->tick_stamp = jiffies;
> +	list_for_each_entry_safe(err, tmp, &qm->uacce_hw_errs, list) {
> +		if ((hw_err->tick_stamp - err->tick_stamp) / HZ >
> +		    SECONDS_PER_HOUR) {
> +			list_del(&err->list);
> +			kfree(err);
> +		} else {
> +			count++;
> +		}
> +	}
> +	list_add(&hw_err->list, &qm->uacce_hw_errs);
> +
> +	if (count >= isolate->hw_err_isolate_hz)
> +		atomic_set(&isolate->is_isolate, 1);
> +
> +	return 0;
> +}
> +
> +static void qm_hw_err_destroy(struct hisi_qm *qm)
> +{
> +	struct qm_hw_err *err, *tmp;
> +
> +	list_for_each_entry_safe(err, tmp, &qm->uacce_hw_errs, list) {
> +		list_del(&err->list);
> +		kfree(err);
> +	}
> +}
> +
> +static enum uacce_dev_state hisi_qm_get_isolate_state(struct uacce_device *uacce)
> +{
> +	struct hisi_qm *qm = uacce->priv;
> +	struct hisi_qm *pf_qm;
> +
> +	if (uacce->is_vf) {
> +		pf_qm = pci_get_drvdata(pci_physfn(qm->pdev));
> +		qm->isolate_data.is_isolate = pf_qm->isolate_data.is_isolate;
> +	}
> +
> +	return atomic_read(&qm->isolate_data.is_isolate) ?
> +			UACCE_DEV_ISOLATE : UACCE_DEV_NORMAL;
> +}
> +
> +static int hisi_qm_isolate_strategy_write(struct uacce_device *uacce,
> +						const char *buf)
> +{
> +	struct hisi_qm *qm = uacce->priv;
> +	unsigned long val = 0;
> +
> +#define MAX_ISOLATE_STRATEGY	65535
> +
> +	if (atomic_read(&qm->uacce_ref))
> +		return -EBUSY;
> +
> +	/* must be set by PF */
> +	if (atomic_read(&qm->isolate_data.is_isolate) || uacce->is_vf)
> +		return -EINVAL;
> +
> +	if (kstrtoul(buf, 0, &val) < 0)
> +		return -EINVAL;
> +
> +	if (val > MAX_ISOLATE_STRATEGY)
> +		return -EINVAL;
> +
> +	qm->isolate_data.hw_err_isolate_hz = val;
> +	dev_info(&qm->pdev->dev,
> +		"the value of isolate_strategy is set to %lu.\n", val);
> +
> +	return 0;
> +}
> +
>  static const struct uacce_ops uacce_qm_ops = {
>  	.get_available_instances = hisi_qm_get_available_instances,
>  	.get_queue = hisi_qm_uacce_get_queue,
> @@ -3418,9 +3526,22 @@ static const struct uacce_ops uacce_qm_ops = {
>  	.stop_queue = hisi_qm_uacce_stop_queue,
>  	.mmap = hisi_qm_uacce_mmap,
>  	.ioctl = hisi_qm_uacce_ioctl,
> +	.get_isolate_state = hisi_qm_get_isolate_state,
>  	.is_q_updated = hisi_qm_is_q_updated,
> +	.isolate_strategy_write = hisi_qm_isolate_strategy_write,
>  };
>  
> +static void qm_remove_uacce(struct hisi_qm *qm)
> +{
> +	struct uacce_device *uacce = qm->uacce;
> +
> +	if (qm->use_sva) {
> +		qm_hw_err_destroy(qm);
> +		uacce_remove(uacce);
> +		qm->uacce = NULL;
> +	}
> +}
> +
>  static int qm_alloc_uacce(struct hisi_qm *qm)
>  {
>  	struct pci_dev *pdev = qm->pdev;
> @@ -3433,6 +3554,7 @@ static int qm_alloc_uacce(struct hisi_qm *qm)
>  	};
>  	int ret;
>  
> +	INIT_LIST_HEAD(&qm->uacce_hw_errs);
>  	ret = strscpy(interface.name, dev_driver_string(&pdev->dev),
>  		      sizeof(interface.name));
>  	if (ret < 0)
> @@ -3446,8 +3568,7 @@ static int qm_alloc_uacce(struct hisi_qm *qm)
>  		qm->use_sva = true;
>  	} else {
>  		/* only consider sva case */
> -		uacce_remove(uacce);
> -		qm->uacce = NULL;
> +		qm_remove_uacce(qm);
>  		return -EINVAL;
>  	}
>  
> @@ -5109,6 +5230,12 @@ static int qm_controller_reset_prepare(struct hisi_qm *qm)
>  		return ret;
>  	}
>  
> +	if (qm->use_sva) {
> +		ret = qm_hw_err_isolate(qm);
> +		if (ret)
> +			pci_err(pdev, "failed to isolate hw err!\n");
> +	}
> +
>  	ret = qm_wait_vf_prepare_finish(qm);
>  	if (ret)
>  		pci_err(pdev, "failed to stop by vfs in soft reset!\n");
> @@ -5436,19 +5563,24 @@ static int qm_controller_reset(struct hisi_qm *qm)
>  	ret = qm_soft_reset(qm);
>  	if (ret) {
>  		pci_err(pdev, "Controller reset failed (%d)\n", ret);
> -		qm_reset_bit_clear(qm);
> -		return ret;
> +		goto err_reset;
>  	}
>  
>  	ret = qm_controller_reset_done(qm);
> -	if (ret) {
> -		qm_reset_bit_clear(qm);
> -		return ret;
> -	}
> +	if (ret)
> +		goto err_reset;
>  
>  	pci_info(pdev, "Controller reset complete\n");
> -
>  	return 0;
> +
> +err_reset:
> +	pci_err(pdev, "Controller reset failed (%d)\n", ret);
> +	qm_reset_bit_clear(qm);
> +
> +	/* if resetting fails, isolate the device */
> +	if (qm->use_sva && !qm->uacce->is_vf)
> +		atomic_set(&qm->isolate_data.is_isolate, 1);
> +	return ret;
>  }
>  
>  /**
> @@ -6246,10 +6378,7 @@ int hisi_qm_init(struct hisi_qm *qm)
>  err_free_qm_memory:
>  	hisi_qm_memory_uninit(qm);
>  err_alloc_uacce:
> -	if (qm->use_sva) {
> -		uacce_remove(qm->uacce);
> -		qm->uacce = NULL;
> -	}
> +	qm_remove_uacce(qm);
>  err_irq_register:
>  	qm_irq_unregister(qm);
>  err_pci_init:
> diff --git a/include/linux/hisi_acc_qm.h b/include/linux/hisi_acc_qm.h
> index 116e8bd68c99..c17fd6de8551 100644
> --- a/include/linux/hisi_acc_qm.h
> +++ b/include/linux/hisi_acc_qm.h
> @@ -8,6 +8,7 @@
>  #include <linux/iopoll.h>
>  #include <linux/module.h>
>  #include <linux/pci.h>
> +#include <linux/uacce.h>
>  
>  #define QM_QNUM_V1			4096
>  #define QM_QNUM_V2			1024
> @@ -271,6 +272,11 @@ struct hisi_qm_poll_data {
>  	u16 *qp_finish_id;
>  };
>  
> +struct qm_err_isolate {
> +	u32 hw_err_isolate_hz;	/* user cfg freq which triggers isolation */
> +	atomic_t is_isolate;
> +};
> +
>  struct hisi_qm {
>  	enum qm_hw_ver ver;
>  	enum qm_fun_type fun_type;
> @@ -335,6 +341,9 @@ struct hisi_qm {
>  	struct qm_shaper_factor *factor;
>  	u32 mb_qos;
>  	u32 type_rate;
> +	struct list_head uacce_hw_errs;
> +	atomic_t uacce_ref; /* reference of the uacce */

That is not how reference counts work, sorry.  Please use 'struct kref'
for a real reference count, never roll your own.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/3] crypto: hisilicon/qm - defining the device isolation strategy
  2022-06-14 12:43   ` Greg KH
@ 2022-06-14 13:24     ` yekai(A)
  2022-06-14 13:29       ` Greg KH
  0 siblings, 1 reply; 25+ messages in thread
From: yekai(A) @ 2022-06-14 13:24 UTC (permalink / raw)
  To: Greg KH
  Cc: herbert, linux-crypto, linux-accelerators, linux-kernel,
	linuxarm, zhangfei.gao, wangzhou1



On 2022/6/14 20:43, Greg KH wrote:
> On Tue, Jun 14, 2022 at 08:29:40PM +0800, Kai Ye wrote:
>> Define the device isolation strategy by the device driver. if the
>> AER error frequency exceeds the value of setting for a certain
>> period of time, The device will not be available in user space. The VF
>> device use the PF device isolation strategy. All the hardware errors
>> are processed by PF driver.
>>
>> Signed-off-by: Kai Ye <yekai13@huawei.com>
>> ---
>>  drivers/crypto/hisilicon/qm.c | 157 +++++++++++++++++++++++++++++++---
>>  include/linux/hisi_acc_qm.h   |   9 ++
>>  2 files changed, 152 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/crypto/hisilicon/qm.c b/drivers/crypto/hisilicon/qm.c
>> index ad83c194d664..47c41fa52693 100644
>> --- a/drivers/crypto/hisilicon/qm.c
>> +++ b/drivers/crypto/hisilicon/qm.c
>> @@ -12,7 +12,6 @@
>>  #include <linux/pm_runtime.h>
>>  #include <linux/seq_file.h>
>>  #include <linux/slab.h>
>> -#include <linux/uacce.h>
>>  #include <linux/uaccess.h>
>>  #include <uapi/misc/uacce/hisi_qm.h>
>>  #include <linux/hisi_acc_qm.h>
>> @@ -417,6 +416,16 @@ struct hisi_qm_resource {
>>  	struct list_head list;
>>  };
>>
>> +/**
>> + * struct qm_hw_err - structure of describes the device err
>> + * @list: hardware error list
>> + * @tick_stamp: timestamp when the error occurred
>> + */
>> +struct qm_hw_err {
>> +	struct list_head list;
>> +	unsigned long long tick_stamp;
>> +};
>> +
>>  struct hisi_qm_hw_ops {
>>  	int (*get_vft)(struct hisi_qm *qm, u32 *base, u32 *number);
>>  	void (*qm_db)(struct hisi_qm *qm, u16 qn,
>> @@ -3278,6 +3287,7 @@ static int hisi_qm_uacce_get_queue(struct uacce_device *uacce,
>>  	qp->event_cb = qm_qp_event_notifier;
>>  	qp->pasid = arg;
>>  	qp->is_in_kernel = false;
>> +	atomic_inc(&qm->uacce_ref);
>>
>>  	return 0;
>>  }
>> @@ -3285,7 +3295,9 @@ static int hisi_qm_uacce_get_queue(struct uacce_device *uacce,
>>  static void hisi_qm_uacce_put_queue(struct uacce_queue *q)
>>  {
>>  	struct hisi_qp *qp = q->priv;
>> +	struct hisi_qm *qm = qp->qm;
>>
>> +	atomic_dec(&qm->uacce_ref);
>>  	hisi_qm_cache_wb(qp->qm);
>>  	hisi_qm_release_qp(qp);
>>  }
>> @@ -3410,6 +3422,102 @@ static long hisi_qm_uacce_ioctl(struct uacce_queue *q, unsigned int cmd,
>>  	return 0;
>>  }
>>
>> +/**
>> + * qm_hw_err_isolate() - Try to isolate the uacce device with its VFs
>> + * @qm: The qm which we want to configure.
>> + *
>> + * according to user's configuration of isolation strategy. Warning: this
>> + * API should be called while there is no user on the device, or the users
>> + * on this device are suspended by slot resetting preparation of PCI AER.
>> + */
>> +static int qm_hw_err_isolate(struct hisi_qm *qm)
>> +{
>> +	struct qm_hw_err *err, *tmp, *hw_err;
>> +	struct qm_err_isolate *isolate;
>> +	u32 count = 0;
>> +
>> +	isolate = &qm->isolate_data;
>> +
>> +#define SECONDS_PER_HOUR	3600
>> +
>> +	/* All the hw errs are processed by PF driver */
>> +	if (qm->uacce->is_vf || atomic_read(&isolate->is_isolate) ||
>> +		!isolate->hw_err_isolate_hz)
>> +		return 0;
>> +
>> +	hw_err = kzalloc(sizeof(*hw_err), GFP_ATOMIC);
>> +	if (!hw_err)
>> +		return -ENOMEM;
>> +	hw_err->tick_stamp = jiffies;
>> +	list_for_each_entry_safe(err, tmp, &qm->uacce_hw_errs, list) {
>> +		if ((hw_err->tick_stamp - err->tick_stamp) / HZ >
>> +		    SECONDS_PER_HOUR) {
>> +			list_del(&err->list);
>> +			kfree(err);
>> +		} else {
>> +			count++;
>> +		}
>> +	}
>> +	list_add(&hw_err->list, &qm->uacce_hw_errs);
>> +
>> +	if (count >= isolate->hw_err_isolate_hz)
>> +		atomic_set(&isolate->is_isolate, 1);
>> +
>> +	return 0;
>> +}
>> +
>> +static void qm_hw_err_destroy(struct hisi_qm *qm)
>> +{
>> +	struct qm_hw_err *err, *tmp;
>> +
>> +	list_for_each_entry_safe(err, tmp, &qm->uacce_hw_errs, list) {
>> +		list_del(&err->list);
>> +		kfree(err);
>> +	}
>> +}
>> +
>> +static enum uacce_dev_state hisi_qm_get_isolate_state(struct uacce_device *uacce)
>> +{
>> +	struct hisi_qm *qm = uacce->priv;
>> +	struct hisi_qm *pf_qm;
>> +
>> +	if (uacce->is_vf) {
>> +		pf_qm = pci_get_drvdata(pci_physfn(qm->pdev));
>> +		qm->isolate_data.is_isolate = pf_qm->isolate_data.is_isolate;
>> +	}
>> +
>> +	return atomic_read(&qm->isolate_data.is_isolate) ?
>> +			UACCE_DEV_ISOLATE : UACCE_DEV_NORMAL;
>> +}
>> +
>> +static int hisi_qm_isolate_strategy_write(struct uacce_device *uacce,
>> +						const char *buf)
>> +{
>> +	struct hisi_qm *qm = uacce->priv;
>> +	unsigned long val = 0;
>> +
>> +#define MAX_ISOLATE_STRATEGY	65535
>> +
>> +	if (atomic_read(&qm->uacce_ref))
>> +		return -EBUSY;
>> +
>> +	/* must be set by PF */
>> +	if (atomic_read(&qm->isolate_data.is_isolate) || uacce->is_vf)
>> +		return -EINVAL;
>> +
>> +	if (kstrtoul(buf, 0, &val) < 0)
>> +		return -EINVAL;
>> +
>> +	if (val > MAX_ISOLATE_STRATEGY)
>> +		return -EINVAL;
>> +
>> +	qm->isolate_data.hw_err_isolate_hz = val;
>> +	dev_info(&qm->pdev->dev,
>> +		"the value of isolate_strategy is set to %lu.\n", val);
>> +
>> +	return 0;
>> +}
>> +
>>  static const struct uacce_ops uacce_qm_ops = {
>>  	.get_available_instances = hisi_qm_get_available_instances,
>>  	.get_queue = hisi_qm_uacce_get_queue,
>> @@ -3418,9 +3526,22 @@ static const struct uacce_ops uacce_qm_ops = {
>>  	.stop_queue = hisi_qm_uacce_stop_queue,
>>  	.mmap = hisi_qm_uacce_mmap,
>>  	.ioctl = hisi_qm_uacce_ioctl,
>> +	.get_isolate_state = hisi_qm_get_isolate_state,
>>  	.is_q_updated = hisi_qm_is_q_updated,
>> +	.isolate_strategy_write = hisi_qm_isolate_strategy_write,
>>  };
>>
>> +static void qm_remove_uacce(struct hisi_qm *qm)
>> +{
>> +	struct uacce_device *uacce = qm->uacce;
>> +
>> +	if (qm->use_sva) {
>> +		qm_hw_err_destroy(qm);
>> +		uacce_remove(uacce);
>> +		qm->uacce = NULL;
>> +	}
>> +}
>> +
>>  static int qm_alloc_uacce(struct hisi_qm *qm)
>>  {
>>  	struct pci_dev *pdev = qm->pdev;
>> @@ -3433,6 +3554,7 @@ static int qm_alloc_uacce(struct hisi_qm *qm)
>>  	};
>>  	int ret;
>>
>> +	INIT_LIST_HEAD(&qm->uacce_hw_errs);
>>  	ret = strscpy(interface.name, dev_driver_string(&pdev->dev),
>>  		      sizeof(interface.name));
>>  	if (ret < 0)
>> @@ -3446,8 +3568,7 @@ static int qm_alloc_uacce(struct hisi_qm *qm)
>>  		qm->use_sva = true;
>>  	} else {
>>  		/* only consider sva case */
>> -		uacce_remove(uacce);
>> -		qm->uacce = NULL;
>> +		qm_remove_uacce(qm);
>>  		return -EINVAL;
>>  	}
>>
>> @@ -5109,6 +5230,12 @@ static int qm_controller_reset_prepare(struct hisi_qm *qm)
>>  		return ret;
>>  	}
>>
>> +	if (qm->use_sva) {
>> +		ret = qm_hw_err_isolate(qm);
>> +		if (ret)
>> +			pci_err(pdev, "failed to isolate hw err!\n");
>> +	}
>> +
>>  	ret = qm_wait_vf_prepare_finish(qm);
>>  	if (ret)
>>  		pci_err(pdev, "failed to stop by vfs in soft reset!\n");
>> @@ -5436,19 +5563,24 @@ static int qm_controller_reset(struct hisi_qm *qm)
>>  	ret = qm_soft_reset(qm);
>>  	if (ret) {
>>  		pci_err(pdev, "Controller reset failed (%d)\n", ret);
>> -		qm_reset_bit_clear(qm);
>> -		return ret;
>> +		goto err_reset;
>>  	}
>>
>>  	ret = qm_controller_reset_done(qm);
>> -	if (ret) {
>> -		qm_reset_bit_clear(qm);
>> -		return ret;
>> -	}
>> +	if (ret)
>> +		goto err_reset;
>>
>>  	pci_info(pdev, "Controller reset complete\n");
>> -
>>  	return 0;
>> +
>> +err_reset:
>> +	pci_err(pdev, "Controller reset failed (%d)\n", ret);
>> +	qm_reset_bit_clear(qm);
>> +
>> +	/* if resetting fails, isolate the device */
>> +	if (qm->use_sva && !qm->uacce->is_vf)
>> +		atomic_set(&qm->isolate_data.is_isolate, 1);
>> +	return ret;
>>  }
>>
>>  /**
>> @@ -6246,10 +6378,7 @@ int hisi_qm_init(struct hisi_qm *qm)
>>  err_free_qm_memory:
>>  	hisi_qm_memory_uninit(qm);
>>  err_alloc_uacce:
>> -	if (qm->use_sva) {
>> -		uacce_remove(qm->uacce);
>> -		qm->uacce = NULL;
>> -	}
>> +	qm_remove_uacce(qm);
>>  err_irq_register:
>>  	qm_irq_unregister(qm);
>>  err_pci_init:
>> diff --git a/include/linux/hisi_acc_qm.h b/include/linux/hisi_acc_qm.h
>> index 116e8bd68c99..c17fd6de8551 100644
>> --- a/include/linux/hisi_acc_qm.h
>> +++ b/include/linux/hisi_acc_qm.h
>> @@ -8,6 +8,7 @@
>>  #include <linux/iopoll.h>
>>  #include <linux/module.h>
>>  #include <linux/pci.h>
>> +#include <linux/uacce.h>
>>
>>  #define QM_QNUM_V1			4096
>>  #define QM_QNUM_V2			1024
>> @@ -271,6 +272,11 @@ struct hisi_qm_poll_data {
>>  	u16 *qp_finish_id;
>>  };
>>
>> +struct qm_err_isolate {
>> +	u32 hw_err_isolate_hz;	/* user cfg freq which triggers isolation */
>> +	atomic_t is_isolate;
>> +};
>> +
>>  struct hisi_qm {
>>  	enum qm_hw_ver ver;
>>  	enum qm_fun_type fun_type;
>> @@ -335,6 +341,9 @@ struct hisi_qm {
>>  	struct qm_shaper_factor *factor;
>>  	u32 mb_qos;
>>  	u32 type_rate;
>> +	struct list_head uacce_hw_errs;
>> +	atomic_t uacce_ref; /* reference of the uacce */
>
> That is not how reference counts work, sorry.  Please use 'struct kref'
> for a real reference count, never roll your own.
>
> thanks,
>
> greg k-h
> .
>

this atomic_t reference is lightweight than 'struct kref', this 
reference means whether the task is running. So would it be better to 
use atomic_t reference?

thanks
Kai

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/3] crypto: hisilicon/qm - defining the device isolation strategy
  2022-06-14 13:24     ` yekai(A)
@ 2022-06-14 13:29       ` Greg KH
  2022-06-15  9:10         ` yekai(A)
  0 siblings, 1 reply; 25+ messages in thread
From: Greg KH @ 2022-06-14 13:29 UTC (permalink / raw)
  To: yekai(A)
  Cc: herbert, linux-crypto, linux-accelerators, linux-kernel,
	linuxarm, zhangfei.gao, wangzhou1

On Tue, Jun 14, 2022 at 09:24:08PM +0800, yekai(A) wrote:
> > >  struct hisi_qm {
> > >  	enum qm_hw_ver ver;
> > >  	enum qm_fun_type fun_type;
> > > @@ -335,6 +341,9 @@ struct hisi_qm {
> > >  	struct qm_shaper_factor *factor;
> > >  	u32 mb_qos;
> > >  	u32 type_rate;
> > > +	struct list_head uacce_hw_errs;
> > > +	atomic_t uacce_ref; /* reference of the uacce */
> > 
> > That is not how reference counts work, sorry.  Please use 'struct kref'
> > for a real reference count, never roll your own.
> > 
> > thanks,
> > 
> > greg k-h
> > .
> > 
> 
> this atomic_t reference is lightweight than 'struct kref',

It's the same size, why would it be "lighter"?  Why do you need it to be
lighter, what performance issue is there with a kref?

> this reference
> means whether the task is running. So would it be better to use atomic_t
> reference?

I do not know, as "running or not running" is a state, not a count or a
reference.  why does this have to be atomic at all?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/3] crypto: hisilicon/qm - defining the device isolation strategy
  2022-06-14 12:29 ` [PATCH v2 3/3] crypto: hisilicon/qm - defining the device isolation strategy Kai Ye
  2022-06-14 12:43   ` Greg KH
@ 2022-06-14 14:12   ` Zhangfei Gao
  2022-06-15 13:02   ` Jonathan Cameron
  2 siblings, 0 replies; 25+ messages in thread
From: Zhangfei Gao @ 2022-06-14 14:12 UTC (permalink / raw)
  To: Kai Ye, gregkh, herbert
  Cc: linuxarm, linux-kernel, wangzhou1, linux-crypto, linux-accelerators



On 2022/6/14 下午8:29, Kai Ye via Linux-accelerators wrote:
> Define the device isolation strategy by the device driver. if the
> AER error frequency exceeds the value of setting for a certain
> period of time, The device will not be available in user space. The VF
> device use the PF device isolation strategy. All the hardware errors
> are processed by PF driver.
>
> Signed-off-by: Kai Ye <yekai13@huawei.com>
> ---
>   drivers/crypto/hisilicon/qm.c | 157 +++++++++++++++++++++++++++++++---
>   include/linux/hisi_acc_qm.h   |   9 ++
>   2 files changed, 152 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/crypto/hisilicon/qm.c b/drivers/crypto/hisilicon/qm.c
> index ad83c194d664..47c41fa52693 100644
> --- a/drivers/crypto/hisilicon/qm.c
> +++ b/drivers/crypto/hisilicon/qm.c
> @@ -12,7 +12,6 @@
>   #include <linux/pm_runtime.h>
>   #include <linux/seq_file.h>
>   #include <linux/slab.h>
> -#include <linux/uacce.h>
>   #include <linux/uaccess.h>
>   #include <uapi/misc/uacce/hisi_qm.h>
>   #include <linux/hisi_acc_qm.h>
> @@ -417,6 +416,16 @@ struct hisi_qm_resource {
>   	struct list_head list;
>   };
>   
> +/**
> + * struct qm_hw_err - structure of describes the device err
> + * @list: hardware error list
> + * @tick_stamp: timestamp when the error occurred
> + */
> +struct qm_hw_err {
> +	struct list_head list;
> +	unsigned long long tick_stamp;
> +};
> +
>   struct hisi_qm_hw_ops {
>   	int (*get_vft)(struct hisi_qm *qm, u32 *base, u32 *number);
>   	void (*qm_db)(struct hisi_qm *qm, u16 qn,
> @@ -3278,6 +3287,7 @@ static int hisi_qm_uacce_get_queue(struct uacce_device *uacce,
>   	qp->event_cb = qm_qp_event_notifier;
>   	qp->pasid = arg;
>   	qp->is_in_kernel = false;
> +	atomic_inc(&qm->uacce_ref);
>   
>   	return 0;
>   }
> @@ -3285,7 +3295,9 @@ static int hisi_qm_uacce_get_queue(struct uacce_device *uacce,
>   static void hisi_qm_uacce_put_queue(struct uacce_queue *q)
>   {
>   	struct hisi_qp *qp = q->priv;
> +	struct hisi_qm *qm = qp->qm;
>   
> +	atomic_dec(&qm->uacce_ref);

Can we use qm state or qp state instead?

enum qm_state {
         QM_INIT = 0,
         QM_START,
         QM_CLOSE,
         QM_STOP,
};

enum qp_state {
         QP_INIT = 1,
         QP_START,
         QP_STOP,
         QP_CLOSE,
};

Thanks


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] uacce: supports device isolation feature
  2022-06-14 12:29 ` [PATCH 1/3] uacce: supports device isolation feature Kai Ye
@ 2022-06-14 14:14   ` Zhangfei Gao
  2022-06-15  1:07     ` yekai(A)
  0 siblings, 1 reply; 25+ messages in thread
From: Zhangfei Gao @ 2022-06-14 14:14 UTC (permalink / raw)
  To: Kai Ye, gregkh, herbert
  Cc: linux-crypto, linux-accelerators, linux-kernel, linuxarm, wangzhou1



On 2022/6/14 下午8:29, Kai Ye wrote:
> UACCE add the hardware error isolation API. Users can configure
> the error frequency threshold by this vfs node. This API interface
> certainly supports the configuration of user protocol strategy. Then
> parse it inside the device driver. UACCE only reports the device
> isolate state. When the error frequency is exceeded, the device
> will be isolated. The isolation strategy should be defined in each
> driver module.
>
> Signed-off-by: Kai Ye <yekai13@huawei.com>
> Reviewed-by: Zhou Wang <wangzhou1@hisilicon.com>
> ---
>   drivers/misc/uacce/uacce.c | 51 ++++++++++++++++++++++++++++++++++++++
>   include/linux/uacce.h      | 15 ++++++++++-
>   2 files changed, 65 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
> index b6219c6bfb48..4d9d9aeb145a 100644
> --- a/drivers/misc/uacce/uacce.c
> +++ b/drivers/misc/uacce/uacce.c
> @@ -12,6 +12,20 @@ static dev_t uacce_devt;
>   static DEFINE_MUTEX(uacce_mutex);
>   static DEFINE_XARRAY_ALLOC(uacce_xa);
>   
> +static int cdev_get(struct device *dev, void *data)
> +{
> +	struct uacce_device *uacce;
> +	struct device **t_dev = data;
> +
> +	uacce = container_of(dev, struct uacce_device, dev);
> +	if (uacce->parent == *t_dev) {
> +		*t_dev = dev;
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
>   static int uacce_start_queue(struct uacce_queue *q)
>   {
>   	int ret = 0;
> @@ -346,12 +360,47 @@ static ssize_t region_dus_size_show(struct device *dev,
>   		       uacce->qf_pg_num[UACCE_QFRT_DUS] << PAGE_SHIFT);
>   }
>   
> +static ssize_t isolate_show(struct device *dev,
> +			    struct device_attribute *attr, char *buf)
> +{
> +	struct uacce_device *uacce = to_uacce_device(dev);
> +
> +	return sysfs_emit(buf, "%d\n", uacce->ops->get_isolate_state(uacce));
Are these two isolate ops  required or optional?
Do we need consider NULL pointer?

Thanks

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] uacce: supports device isolation feature
  2022-06-14 14:14   ` Zhangfei Gao
@ 2022-06-15  1:07     ` yekai(A)
  0 siblings, 0 replies; 25+ messages in thread
From: yekai(A) @ 2022-06-15  1:07 UTC (permalink / raw)
  To: Zhangfei Gao, gregkh, herbert
  Cc: linux-crypto, linux-accelerators, linux-kernel, linuxarm, wangzhou1



On 2022/6/14 22:14, Zhangfei Gao wrote:
>
>
> On 2022/6/14 下午8:29, Kai Ye wrote:
>> UACCE add the hardware error isolation API. Users can configure
>> the error frequency threshold by this vfs node. This API interface
>> certainly supports the configuration of user protocol strategy. Then
>> parse it inside the device driver. UACCE only reports the device
>> isolate state. When the error frequency is exceeded, the device
>> will be isolated. The isolation strategy should be defined in each
>> driver module.
>>
>> Signed-off-by: Kai Ye <yekai13@huawei.com>
>> Reviewed-by: Zhou Wang <wangzhou1@hisilicon.com>
>> ---
>>   drivers/misc/uacce/uacce.c | 51 ++++++++++++++++++++++++++++++++++++++
>>   include/linux/uacce.h      | 15 ++++++++++-
>>   2 files changed, 65 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
>> index b6219c6bfb48..4d9d9aeb145a 100644
>> --- a/drivers/misc/uacce/uacce.c
>> +++ b/drivers/misc/uacce/uacce.c
>> @@ -12,6 +12,20 @@ static dev_t uacce_devt;
>>   static DEFINE_MUTEX(uacce_mutex);
>>   static DEFINE_XARRAY_ALLOC(uacce_xa);
>>   +static int cdev_get(struct device *dev, void *data)
>> +{
>> +    struct uacce_device *uacce;
>> +    struct device **t_dev = data;
>> +
>> +    uacce = container_of(dev, struct uacce_device, dev);
>> +    if (uacce->parent == *t_dev) {
>> +        *t_dev = dev;
>> +        return 1;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>   static int uacce_start_queue(struct uacce_queue *q)
>>   {
>>       int ret = 0;
>> @@ -346,12 +360,47 @@ static ssize_t region_dus_size_show(struct
>> device *dev,
>>                  uacce->qf_pg_num[UACCE_QFRT_DUS] << PAGE_SHIFT);
>>   }
>>   +static ssize_t isolate_show(struct device *dev,
>> +                struct device_attribute *attr, char *buf)
>> +{
>> +    struct uacce_device *uacce = to_uacce_device(dev);
>> +
>> +    return sysfs_emit(buf, "%d\n",
>> uacce->ops->get_isolate_state(uacce));
> Are these two isolate ops  required or optional?
> Do we need consider NULL pointer?
>
> Thanks
> .
>

Yes, need to consider NULL pointer.

Thanks
kai

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/3] Documentation: add a isolation strategy vfs node for uacce
  2022-06-14 12:41   ` Greg KH
@ 2022-06-15  8:48     ` Jonathan Cameron
  2022-06-15  9:18       ` yekai(A)
  0 siblings, 1 reply; 25+ messages in thread
From: Jonathan Cameron @ 2022-06-15  8:48 UTC (permalink / raw)
  To: Greg KH
  Cc: Kai Ye, herbert, linux-kernel, linuxarm, wangzhou1, linux-crypto,
	zhangfei.gao, linux-accelerators

On Tue, 14 Jun 2022 14:41:52 +0200
Greg KH <gregkh@linuxfoundation.org> wrote:

> On Tue, Jun 14, 2022 at 08:29:39PM +0800, Kai Ye wrote:
> > Update documentation describing DebugFS that could help to
> > configure hard error frequency for users in th user space.
> > 
> > Signed-off-by: Kai Ye <yekai13@huawei.com>
> > ---
> >  Documentation/ABI/testing/sysfs-driver-uacce | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce
> > index 08f2591138af..0c4226364182 100644
> > --- a/Documentation/ABI/testing/sysfs-driver-uacce
> > +++ b/Documentation/ABI/testing/sysfs-driver-uacce
> > @@ -19,6 +19,23 @@ Contact:        linux-accelerators@lists.ozlabs.org
> >  Description:    Available instances left of the device
> >                  Return -ENODEV if uacce_ops get_available_instances is not provided
> >  
> > +What:           /sys/class/uacce/<dev_name>/isolate_strategy
> > +Date:           Jun 2022
> > +KernelVersion:  5.19
> > +Contact:        linux-accelerators@lists.ozlabs.org
> > +Description:    A vfs node that used to configures the hardware  
> 
> What is a "vfs node"?
> 
> > +                error frequency. This frequency is abstract. Like once an hour
> > +                or once a day. The specific isolation strategy can be defined in
> > +                each driver module.  
> 
> No, you need to be specific here and describe the units and the format.
> Otherwise it is no description at all :(

Also, rename it.   A frequency isn't a strategy.  Strategy would be something
like:

* First fault
* Faults in moving time window.
* Faults in fixed time window.

some of which would then need separate controls for the threshold and the
time window - those should be in separate sysfs attributes.

> 
> > +
> > +What:           /sys/class/uacce/<dev_name>/isolate
> > +Date:           Jun 2022
> > +KernelVersion:  5.19  
> 
> 5.19 will not have this change.
> 
> > +Contact:        linux-accelerators@lists.ozlabs.org
> > +Description:    A vfs node that show the device isolated state. The value 0
> > +                means that the device is working. The value 1 means that the
> > +                device has been isolated.  
> 
> What does "working" or "isolated" mean?
> 
> thanks,
> 
> greg k-h


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/3] uacce: supports device isolation feature
  2022-06-14 12:29 ` [PATCH v2 1/3] uacce: " Kai Ye
  2022-06-14 12:42   ` Greg KH
@ 2022-06-15  8:52   ` Jonathan Cameron
  2022-06-15  9:06     ` yekai(A)
  1 sibling, 1 reply; 25+ messages in thread
From: Jonathan Cameron @ 2022-06-15  8:52 UTC (permalink / raw)
  To: Kai Ye via Linux-accelerators
  Cc: Kai Ye, gregkh, herbert, linuxarm, linux-kernel, wangzhou1,
	linux-crypto, zhangfei.gao

On Tue, 14 Jun 2022 20:29:38 +0800
Kai Ye via Linux-accelerators <linux-accelerators@lists.ozlabs.org> wrote:

> UACCE add the hardware error isolation API. Users can configure
> the error frequency threshold by this vfs node. This API interface
> certainly supports the configuration of user protocol strategy. Then
> parse it inside the device driver. UACCE only reports the device
> isolate state. When the error frequency is exceeded, the device
> will be isolated. The isolation strategy should be defined in each
> driver module.
> 
> Signed-off-by: Kai Ye <yekai13@huawei.com>
> Reviewed-by: Zhou Wang <wangzhou1@hisilicon.com>
> ---
>  drivers/misc/uacce/uacce.c | 37 +++++++++++++++++++++++++++++++++++++
>  include/linux/uacce.h      | 16 +++++++++++++---
>  2 files changed, 50 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
> index b6219c6bfb48..525623215132 100644
> --- a/drivers/misc/uacce/uacce.c
> +++ b/drivers/misc/uacce/uacce.c
> @@ -346,12 +346,47 @@ static ssize_t region_dus_size_show(struct device *dev,
>  		       uacce->qf_pg_num[UACCE_QFRT_DUS] << PAGE_SHIFT);
>  }
>  
> +static ssize_t isolate_show(struct device *dev,
> +			    struct device_attribute *attr, char *buf)
> +{
> +	struct uacce_device *uacce = to_uacce_device(dev);
> +
> +	return sysfs_emit(buf, "%d\n", uacce->ops->get_isolate_state(uacce));
> +}
> +
> +static ssize_t isolate_strategy_show(struct device *dev,
> +				     struct device_attribute *attr, char *buf)
> +{
> +	struct uacce_device *uacce = to_uacce_device(dev);
> +
> +	return sysfs_emit(buf, "%s\n", uacce->isolate_strategy);
> +}
> +
> +static ssize_t isolate_strategy_store(struct device *dev,
> +				      struct device_attribute *attr,
> +				      const char *buf, size_t count)
> +{
> +	struct uacce_device *uacce = to_uacce_device(dev);
> +	int ret;
> +
> +	if (!buf || sizeof(buf) > UACCE_MAX_ISOLATE_STRATEGY_LEN)
> +		return -EINVAL;
> +
> +	memcpy(uacce->isolate_strategy, buf, strlen(buf));
What if it's not a valid strategy for the driver?  We shouldn't
store this until we know it's valid.

> +
> +	ret = uacce->ops->isolate_strategy_write(uacce, buf);
Having copied the buf into uacce, why pass it as well?

My preference would be to pass buf and length and not do
the memcpy in here.  Leave that choice to the driver.
If this were a single value, it would be better stored
as an integer than as a string.   Obviously that means
you need an isolate_strategy_read() as well (that also
solves the comment above about not storing what was written
until we know it was valid.

Thanks,

Jonathan



> +
> +	return ret ? ret : count;
> +}
> +
>  static DEVICE_ATTR_RO(api);
>  static DEVICE_ATTR_RO(flags);
>  static DEVICE_ATTR_RO(available_instances);
>  static DEVICE_ATTR_RO(algorithms);
>  static DEVICE_ATTR_RO(region_mmio_size);
>  static DEVICE_ATTR_RO(region_dus_size);
> +static DEVICE_ATTR_RO(isolate);
> +static DEVICE_ATTR_RW(isolate_strategy);
>  
>  static struct attribute *uacce_dev_attrs[] = {
>  	&dev_attr_api.attr,
> @@ -360,6 +395,8 @@ static struct attribute *uacce_dev_attrs[] = {
>  	&dev_attr_algorithms.attr,
>  	&dev_attr_region_mmio_size.attr,
>  	&dev_attr_region_dus_size.attr,
> +	&dev_attr_isolate.attr,
> +	&dev_attr_isolate_strategy.attr,
>  	NULL,
>  };
>  
> diff --git a/include/linux/uacce.h b/include/linux/uacce.h
> index 48e319f40275..0f7668bfa645 100644
> --- a/include/linux/uacce.h
> +++ b/include/linux/uacce.h
> @@ -8,6 +8,7 @@
>  #define UACCE_NAME		"uacce"
>  #define UACCE_MAX_REGION	2
>  #define UACCE_MAX_NAME_SIZE	64
> +#define UACCE_MAX_ISOLATE_STRATEGY_LEN	256
>  
>  struct uacce_queue;
>  struct uacce_device;
> @@ -30,6 +31,8 @@ struct uacce_qfile_region {
>   * @is_q_updated: check whether the task is finished
>   * @mmap: mmap addresses of queue to user space
>   * @ioctl: ioctl for user space users of the queue
> + * @get_isolate_state: get the device state after set the isolate strategy
> + * @isolate_strategy_store: stored the isolate strategy to the device
>   */
>  struct uacce_ops {
>  	int (*get_available_instances)(struct uacce_device *uacce);
> @@ -43,6 +46,8 @@ struct uacce_ops {
>  		    struct uacce_qfile_region *qfr);
>  	long (*ioctl)(struct uacce_queue *q, unsigned int cmd,
>  		      unsigned long arg);
> +	enum uacce_dev_state (*get_isolate_state)(struct uacce_device *uacce);
> +	int (*isolate_strategy_write)(struct uacce_device *uacce, const char *buf);
>  };
>  
>  /**
> @@ -57,6 +62,12 @@ struct uacce_interface {
>  	const struct uacce_ops *ops;
>  };
>  
> +enum uacce_dev_state {
> +	UACCE_DEV_ERR = -1,
> +	UACCE_DEV_NORMAL,
> +	UACCE_DEV_ISOLATE,
> +};
> +
>  enum uacce_q_state {
>  	UACCE_Q_ZOMBIE = 0,
>  	UACCE_Q_INIT,
> @@ -117,6 +128,7 @@ struct uacce_device {
>  	struct list_head queues;
>  	struct mutex queues_lock;
>  	struct inode *inode;
> +	char isolate_strategy[UACCE_MAX_ISOLATE_STRATEGY_LEN];
>  };
>  
>  #if IS_ENABLED(CONFIG_UACCE)
> @@ -125,7 +137,7 @@ struct uacce_device *uacce_alloc(struct device *parent,
>  				 struct uacce_interface *interface);
>  int uacce_register(struct uacce_device *uacce);
>  void uacce_remove(struct uacce_device *uacce);
> -
> +struct uacce_device *dev_to_uacce(struct device *dev);
>  #else /* CONFIG_UACCE */
>  
>  static inline
> @@ -140,8 +152,6 @@ static inline int uacce_register(struct uacce_device *uacce)
>  	return -EINVAL;
>  }
>  
> -static inline void uacce_remove(struct uacce_device *uacce) {}
> -
>  #endif /* CONFIG_UACCE */
>  
>  #endif /* _LINUX_UACCE_H */


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/3] uacce: supports device isolation feature
  2022-06-15  8:52   ` Jonathan Cameron
@ 2022-06-15  9:06     ` yekai(A)
  0 siblings, 0 replies; 25+ messages in thread
From: yekai(A) @ 2022-06-15  9:06 UTC (permalink / raw)
  To: Jonathan Cameron, Kai Ye via Linux-accelerators
  Cc: gregkh, herbert, linuxarm, linux-kernel, wangzhou1, linux-crypto,
	zhangfei.gao



On 2022/6/15 16:52, Jonathan Cameron wrote:
> On Tue, 14 Jun 2022 20:29:38 +0800
> Kai Ye via Linux-accelerators <linux-accelerators@lists.ozlabs.org> wrote:
>
>> UACCE add the hardware error isolation API. Users can configure
>> the error frequency threshold by this vfs node. This API interface
>> certainly supports the configuration of user protocol strategy. Then
>> parse it inside the device driver. UACCE only reports the device
>> isolate state. When the error frequency is exceeded, the device
>> will be isolated. The isolation strategy should be defined in each
>> driver module.
>>
>> Signed-off-by: Kai Ye <yekai13@huawei.com>
>> Reviewed-by: Zhou Wang <wangzhou1@hisilicon.com>
>> ---
>>  drivers/misc/uacce/uacce.c | 37 +++++++++++++++++++++++++++++++++++++
>>  include/linux/uacce.h      | 16 +++++++++++++---
>>  2 files changed, 50 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
>> index b6219c6bfb48..525623215132 100644
>> --- a/drivers/misc/uacce/uacce.c
>> +++ b/drivers/misc/uacce/uacce.c
>> @@ -346,12 +346,47 @@ static ssize_t region_dus_size_show(struct device *dev,
>>  		       uacce->qf_pg_num[UACCE_QFRT_DUS] << PAGE_SHIFT);
>>  }
>>
>> +static ssize_t isolate_show(struct device *dev,
>> +			    struct device_attribute *attr, char *buf)
>> +{
>> +	struct uacce_device *uacce = to_uacce_device(dev);
>> +
>> +	return sysfs_emit(buf, "%d\n", uacce->ops->get_isolate_state(uacce));
>> +}
>> +
>> +static ssize_t isolate_strategy_show(struct device *dev,
>> +				     struct device_attribute *attr, char *buf)
>> +{
>> +	struct uacce_device *uacce = to_uacce_device(dev);
>> +
>> +	return sysfs_emit(buf, "%s\n", uacce->isolate_strategy);
>> +}
>> +
>> +static ssize_t isolate_strategy_store(struct device *dev,
>> +				      struct device_attribute *attr,
>> +				      const char *buf, size_t count)
>> +{
>> +	struct uacce_device *uacce = to_uacce_device(dev);
>> +	int ret;
>> +
>> +	if (!buf || sizeof(buf) > UACCE_MAX_ISOLATE_STRATEGY_LEN)
>> +		return -EINVAL;
>> +
>> +	memcpy(uacce->isolate_strategy, buf, strlen(buf));
> What if it's not a valid strategy for the driver?  We shouldn't
> store this until we know it's valid.
>
>> +
>> +	ret = uacce->ops->isolate_strategy_write(uacce, buf);
> Having copied the buf into uacce, why pass it as well?
>
> My preference would be to pass buf and length and not do
> the memcpy in here.  Leave that choice to the driver.
> If this were a single value, it would be better stored
> as an integer than as a string.   Obviously that means
> you need an isolate_strategy_read() as well (that also
> solves the comment above about not storing what was written
> until we know it was valid.
>
> Thanks,
>
> Jonathan

it good job, I think so, need an isolate_strategy_read() instead of a copy.

thanks

Kai
>
>
>
>> +
>> +	return ret ? ret : count;
>> +}
>> +
>>  static DEVICE_ATTR_RO(api);
>>  static DEVICE_ATTR_RO(flags);
>>  static DEVICE_ATTR_RO(available_instances);
>>  static DEVICE_ATTR_RO(algorithms);
>>  static DEVICE_ATTR_RO(region_mmio_size);
>>  static DEVICE_ATTR_RO(region_dus_size);
>> +static DEVICE_ATTR_RO(isolate);
>> +static DEVICE_ATTR_RW(isolate_strategy);
>>
>>  static struct attribute *uacce_dev_attrs[] = {
>>  	&dev_attr_api.attr,
>> @@ -360,6 +395,8 @@ static struct attribute *uacce_dev_attrs[] = {
>>  	&dev_attr_algorithms.attr,
>>  	&dev_attr_region_mmio_size.attr,
>>  	&dev_attr_region_dus_size.attr,
>> +	&dev_attr_isolate.attr,
>> +	&dev_attr_isolate_strategy.attr,
>>  	NULL,
>>  };
>>
>> diff --git a/include/linux/uacce.h b/include/linux/uacce.h
>> index 48e319f40275..0f7668bfa645 100644
>> --- a/include/linux/uacce.h
>> +++ b/include/linux/uacce.h
>> @@ -8,6 +8,7 @@
>>  #define UACCE_NAME		"uacce"
>>  #define UACCE_MAX_REGION	2
>>  #define UACCE_MAX_NAME_SIZE	64
>> +#define UACCE_MAX_ISOLATE_STRATEGY_LEN	256
>>
>>  struct uacce_queue;
>>  struct uacce_device;
>> @@ -30,6 +31,8 @@ struct uacce_qfile_region {
>>   * @is_q_updated: check whether the task is finished
>>   * @mmap: mmap addresses of queue to user space
>>   * @ioctl: ioctl for user space users of the queue
>> + * @get_isolate_state: get the device state after set the isolate strategy
>> + * @isolate_strategy_store: stored the isolate strategy to the device
>>   */
>>  struct uacce_ops {
>>  	int (*get_available_instances)(struct uacce_device *uacce);
>> @@ -43,6 +46,8 @@ struct uacce_ops {
>>  		    struct uacce_qfile_region *qfr);
>>  	long (*ioctl)(struct uacce_queue *q, unsigned int cmd,
>>  		      unsigned long arg);
>> +	enum uacce_dev_state (*get_isolate_state)(struct uacce_device *uacce);
>> +	int (*isolate_strategy_write)(struct uacce_device *uacce, const char *buf);
>>  };
>>
>>  /**
>> @@ -57,6 +62,12 @@ struct uacce_interface {
>>  	const struct uacce_ops *ops;
>>  };
>>
>> +enum uacce_dev_state {
>> +	UACCE_DEV_ERR = -1,
>> +	UACCE_DEV_NORMAL,
>> +	UACCE_DEV_ISOLATE,
>> +};
>> +
>>  enum uacce_q_state {
>>  	UACCE_Q_ZOMBIE = 0,
>>  	UACCE_Q_INIT,
>> @@ -117,6 +128,7 @@ struct uacce_device {
>>  	struct list_head queues;
>>  	struct mutex queues_lock;
>>  	struct inode *inode;
>> +	char isolate_strategy[UACCE_MAX_ISOLATE_STRATEGY_LEN];
>>  };
>>
>>  #if IS_ENABLED(CONFIG_UACCE)
>> @@ -125,7 +137,7 @@ struct uacce_device *uacce_alloc(struct device *parent,
>>  				 struct uacce_interface *interface);
>>  int uacce_register(struct uacce_device *uacce);
>>  void uacce_remove(struct uacce_device *uacce);
>> -
>> +struct uacce_device *dev_to_uacce(struct device *dev);
>>  #else /* CONFIG_UACCE */
>>
>>  static inline
>> @@ -140,8 +152,6 @@ static inline int uacce_register(struct uacce_device *uacce)
>>  	return -EINVAL;
>>  }
>>
>> -static inline void uacce_remove(struct uacce_device *uacce) {}
>> -
>>  #endif /* CONFIG_UACCE */
>>
>>  #endif /* _LINUX_UACCE_H */
>
> .
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/3] crypto: hisilicon/qm - defining the device isolation strategy
  2022-06-14 13:29       ` Greg KH
@ 2022-06-15  9:10         ` yekai(A)
  0 siblings, 0 replies; 25+ messages in thread
From: yekai(A) @ 2022-06-15  9:10 UTC (permalink / raw)
  To: Greg KH
  Cc: herbert, linux-crypto, linux-accelerators, linux-kernel,
	linuxarm, zhangfei.gao, wangzhou1



On 2022/6/14 21:29, Greg KH wrote:
> On Tue, Jun 14, 2022 at 09:24:08PM +0800, yekai(A) wrote:
>>>>  struct hisi_qm {
>>>>  	enum qm_hw_ver ver;
>>>>  	enum qm_fun_type fun_type;
>>>> @@ -335,6 +341,9 @@ struct hisi_qm {
>>>>  	struct qm_shaper_factor *factor;
>>>>  	u32 mb_qos;
>>>>  	u32 type_rate;
>>>> +	struct list_head uacce_hw_errs;
>>>> +	atomic_t uacce_ref; /* reference of the uacce */
>>>
>>> That is not how reference counts work, sorry.  Please use 'struct kref'
>>> for a real reference count, never roll your own.
>>>
>>> thanks,
>>>
>>> greg k-h
>>> .
>>>
>>
>> this atomic_t reference is lightweight than 'struct kref',
>
> It's the same size, why would it be "lighter"?  Why do you need it to be
> lighter, what performance issue is there with a kref?
>
>> this reference
>> means whether the task is running. So would it be better to use atomic_t
>> reference?
>
> I do not know, as "running or not running" is a state, not a count or a
> reference.  why does this have to be atomic at all?
>
> thanks,
>
> greg k-h
> .
>

I will use 'qm_state' instead of reference count by zhangfei Gao's opinion.

Thanks
Kai

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/3] Documentation: add a isolation strategy vfs node for uacce
  2022-06-15  8:48     ` Jonathan Cameron
@ 2022-06-15  9:18       ` yekai(A)
  0 siblings, 0 replies; 25+ messages in thread
From: yekai(A) @ 2022-06-15  9:18 UTC (permalink / raw)
  To: Jonathan Cameron, Greg KH
  Cc: herbert, linux-kernel, linuxarm, wangzhou1, linux-crypto,
	zhangfei.gao, linux-accelerators



On 2022/6/15 16:48, Jonathan Cameron wrote:
> On Tue, 14 Jun 2022 14:41:52 +0200
> Greg KH <gregkh@linuxfoundation.org> wrote:
>
>> On Tue, Jun 14, 2022 at 08:29:39PM +0800, Kai Ye wrote:
>>> Update documentation describing DebugFS that could help to
>>> configure hard error frequency for users in th user space.
>>>
>>> Signed-off-by: Kai Ye <yekai13@huawei.com>
>>> ---
>>>  Documentation/ABI/testing/sysfs-driver-uacce | 17 +++++++++++++++++
>>>  1 file changed, 17 insertions(+)
>>>
>>> diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce
>>> index 08f2591138af..0c4226364182 100644
>>> --- a/Documentation/ABI/testing/sysfs-driver-uacce
>>> +++ b/Documentation/ABI/testing/sysfs-driver-uacce
>>> @@ -19,6 +19,23 @@ Contact:        linux-accelerators@lists.ozlabs.org
>>>  Description:    Available instances left of the device
>>>                  Return -ENODEV if uacce_ops get_available_instances is not provided
>>>
>>> +What:           /sys/class/uacce/<dev_name>/isolate_strategy
>>> +Date:           Jun 2022
>>> +KernelVersion:  5.19
>>> +Contact:        linux-accelerators@lists.ozlabs.org
>>> +Description:    A vfs node that used to configures the hardware
>>
>> What is a "vfs node"?
>>
>>> +                error frequency. This frequency is abstract. Like once an hour
>>> +                or once a day. The specific isolation strategy can be defined in
>>> +                each driver module.
>>
>> No, you need to be specific here and describe the units and the format.
>> Otherwise it is no description at all :(
>
> Also, rename it.   A frequency isn't a strategy.  Strategy would be something
> like:
>
> * First fault
> * Faults in moving time window.
> * Faults in fixed time window.
>
> some of which would then need separate controls for the threshold and the
> time window - those should be in separate sysfs attributes.
>

I will describe the units and the format in here.

Thanks

Kai
>>
>>> +
>>> +What:           /sys/class/uacce/<dev_name>/isolate
>>> +Date:           Jun 2022
>>> +KernelVersion:  5.19
>>
>> 5.19 will not have this change.
>>
>>> +Contact:        linux-accelerators@lists.ozlabs.org
>>> +Description:    A vfs node that show the device isolated state. The value 0
>>> +                means that the device is working. The value 1 means that the
>>> +                device has been isolated.
>>
>> What does "working" or "isolated" mean?
>>
>> thanks,
>>
>> greg k-h
>
> .
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/3] crypto: hisilicon/qm - defining the device isolation strategy
  2022-06-14 12:29 ` [PATCH v2 3/3] crypto: hisilicon/qm - defining the device isolation strategy Kai Ye
  2022-06-14 12:43   ` Greg KH
  2022-06-14 14:12   ` Zhangfei Gao
@ 2022-06-15 13:02   ` Jonathan Cameron
  2022-06-16  1:33     ` yekai(A)
  2 siblings, 1 reply; 25+ messages in thread
From: Jonathan Cameron @ 2022-06-15 13:02 UTC (permalink / raw)
  To: Kai Ye via Linux-accelerators
  Cc: Kai Ye, gregkh, herbert, linuxarm, linux-kernel, wangzhou1,
	linux-crypto, zhangfei.gao

On Tue, 14 Jun 2022 20:29:40 +0800
Kai Ye via Linux-accelerators <linux-accelerators@lists.ozlabs.org> wrote:

> Define the device isolation strategy by the device driver. if the
> AER error frequency exceeds the value of setting for a certain
> period of time, The device will not be available in user space. The VF
> device use the PF device isolation strategy. All the hardware errors
> are processed by PF driver.
> 
> Signed-off-by: Kai Ye <yekai13@huawei.com>

I'll try and avoid duplicating Greg's feedback but might well overlap a bit!

> ---
>  drivers/crypto/hisilicon/qm.c | 157 +++++++++++++++++++++++++++++++---
>  include/linux/hisi_acc_qm.h   |   9 ++
>  2 files changed, 152 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/crypto/hisilicon/qm.c b/drivers/crypto/hisilicon/qm.c
> index ad83c194d664..47c41fa52693 100644
> --- a/drivers/crypto/hisilicon/qm.c
> +++ b/drivers/crypto/hisilicon/qm.c
> @@ -12,7 +12,6 @@
>  #include <linux/pm_runtime.h>
>  #include <linux/seq_file.h>
>  #include <linux/slab.h>
> -#include <linux/uacce.h>

I assume you do this because you are now relying on hisi_acc_qm.h including
uacce.h?   Generally it is better to include most headers that we use
directly so it this still uses stuff from uacce.h then keep the include.

>  #include <linux/uaccess.h>
>  #include <uapi/misc/uacce/hisi_qm.h>
>  #include <linux/hisi_acc_qm.h>
> @@ -417,6 +416,16 @@ struct hisi_qm_resource {
>  	struct list_head list;
>  };
>  
> +/**
> + * struct qm_hw_err - structure of describes the device err
> + * @list: hardware error list
> + * @tick_stamp: timestamp when the error occurred

tick?   Perhaps just call it timestamp if that is what it is...


> + */
> +struct qm_hw_err {
> +	struct list_head list;
> +	unsigned long long tick_stamp;
> +};
> +

>  
> +/**
> + * qm_hw_err_isolate() - Try to isolate the uacce device with its VFs
> + * @qm: The qm which we want to configure.
> + *
> + * according to user's configuration of isolation strategy. Warning: this

Rewrite to make it full sentence.

> + * API should be called while there is no user on the device, or the users
> + * on this device are suspended by slot resetting preparation of PCI AER.
> + */
> +static int qm_hw_err_isolate(struct hisi_qm *qm)
> +{
> +	struct qm_hw_err *err, *tmp, *hw_err;
> +	struct qm_err_isolate *isolate;
> +	u32 count = 0;
> +
> +	isolate = &qm->isolate_data;
> +
> +#define SECONDS_PER_HOUR	3600
> +
> +	/* All the hw errs are processed by PF driver */
> +	if (qm->uacce->is_vf || atomic_read(&isolate->is_isolate) ||
> +		!isolate->hw_err_isolate_hz)
> +		return 0;
> +
> +	hw_err = kzalloc(sizeof(*hw_err), GFP_ATOMIC);
> +	if (!hw_err)
> +		return -ENOMEM;
blank line here to separate error handling from next bit of code.
 
> +	hw_err->tick_stamp = jiffies;
> +	list_for_each_entry_safe(err, tmp, &qm->uacce_hw_errs, list) {

These are ordered (I think). Could take advantage of that by
maintaining count of elements in parallel to the list then walking
list in right direction + stop when you reach last one to need
deleting.


> +		if ((hw_err->tick_stamp - err->tick_stamp) / HZ >
> +		    SECONDS_PER_HOUR) {
> +			list_del(&err->list);
> +			kfree(err);
> +		} else {
> +			count++;
> +		}
> +	}
> +	list_add(&hw_err->list, &qm->uacce_hw_errs);
> +
> +	if (count >= isolate->hw_err_isolate_hz)
> +		atomic_set(&isolate->is_isolate, 1);
> +
> +	return 0;
> +}
> +

...

> +static int hisi_qm_isolate_strategy_write(struct uacce_device *uacce,
> +						const char *buf)
> +{
> +	struct hisi_qm *qm = uacce->priv;
> +	unsigned long val = 0;
> +
> +#define MAX_ISOLATE_STRATEGY	65535
> +
> +	if (atomic_read(&qm->uacce_ref))
> +		return -EBUSY;
> +
> +	/* must be set by PF */
> +	if (atomic_read(&qm->isolate_data.is_isolate) || uacce->is_vf)

Why is the file visible on the vf?  Hide it or don't register it for vfs.

> +		return -EINVAL;
> +
> +	if (kstrtoul(buf, 0, &val) < 0)
> +		return -EINVAL;
> +
> +	if (val > MAX_ISOLATE_STRATEGY)
> +		return -EINVAL;
> +
> +	qm->isolate_data.hw_err_isolate_hz = val;
> +	dev_info(&qm->pdev->dev,
> +		"the value of isolate_strategy is set to %lu.\n", val);

This is just noise in the log.  If someone wants to check they should read
the sysfs file back and it reflect the new state.

> +
> +	return 0;
> +}
> +

...

>  static int qm_alloc_uacce(struct hisi_qm *qm)
>  {
>  	struct pci_dev *pdev = qm->pdev;
> @@ -3433,6 +3554,7 @@ static int qm_alloc_uacce(struct hisi_qm *qm)
>  	};
>  	int ret;
>  
> +	INIT_LIST_HEAD(&qm->uacce_hw_errs);
>  	ret = strscpy(interface.name, dev_driver_string(&pdev->dev),
>  		      sizeof(interface.name));
>  	if (ret < 0)
> @@ -3446,8 +3568,7 @@ static int qm_alloc_uacce(struct hisi_qm *qm)
>  		qm->use_sva = true;
>  	} else {
>  		/* only consider sva case */
> -		uacce_remove(uacce);
> -		qm->uacce = NULL;
> +		qm_remove_uacce(qm);
>  		return -EINVAL;
>  	}
>  
> @@ -5109,6 +5230,12 @@ static int qm_controller_reset_prepare(struct hisi_qm *qm)
>  		return ret;
>  	}
>  
> +	if (qm->use_sva) {
> +		ret = qm_hw_err_isolate(qm);
> +		if (ret)
> +			pci_err(pdev, "failed to isolate hw err!\n");
> +	}
> +
>  	ret = qm_wait_vf_prepare_finish(qm);
>  	if (ret)
>  		pci_err(pdev, "failed to stop by vfs in soft reset!\n");
> @@ -5436,19 +5563,24 @@ static int qm_controller_reset(struct hisi_qm *qm)
>  	ret = qm_soft_reset(qm);
>  	if (ret) {
>  		pci_err(pdev, "Controller reset failed (%d)\n", ret);

This is printed below as well - probably best to drop this one and then you
can remove the brackets as well.

> -		qm_reset_bit_clear(qm);
> -		return ret;
> +		goto err_reset;
>  	}
>  
>  	ret = qm_controller_reset_done(qm);
> -	if (ret) {
> -		qm_reset_bit_clear(qm);
> -		return ret;
> -	}
> +	if (ret)
> +		goto err_reset;
>  
>  	pci_info(pdev, "Controller reset complete\n");
> -

Avoid noise via white space changes like this.  The white space was
good and generally don't change white space in a patch doing anything else.

>  	return 0;
> +
> +err_reset:
> +	pci_err(pdev, "Controller reset failed (%d)\n", ret);
> +	qm_reset_bit_clear(qm);
> +
> +	/* if resetting fails, isolate the device */
> +	if (qm->use_sva && !qm->uacce->is_vf)
> +		atomic_set(&qm->isolate_data.is_isolate, 1);
> +	return ret;
>  }
>  
>  /**
> @@ -6246,10 +6378,7 @@ int hisi_qm_init(struct hisi_qm *qm)
>  err_free_qm_memory:
>  	hisi_qm_memory_uninit(qm);
>  err_alloc_uacce:
> -	if (qm->use_sva) {
> -		uacce_remove(qm->uacce);
> -		qm->uacce = NULL;
> -	}
> +	qm_remove_uacce(qm);
>  err_irq_register:
>  	qm_irq_unregister(qm);
>  err_pci_init:


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/3] crypto: hisilicon/qm - defining the device isolation strategy
  2022-06-15 13:02   ` Jonathan Cameron
@ 2022-06-16  1:33     ` yekai(A)
  2022-06-16 13:45       ` Jonathan Cameron
  0 siblings, 1 reply; 25+ messages in thread
From: yekai(A) @ 2022-06-16  1:33 UTC (permalink / raw)
  To: Jonathan Cameron, Kai Ye via Linux-accelerators
  Cc: gregkh, herbert, linuxarm, linux-kernel, wangzhou1, linux-crypto,
	zhangfei.gao



On 2022/6/15 21:02, Jonathan Cameron wrote:
> On Tue, 14 Jun 2022 20:29:40 +0800
> Kai Ye via Linux-accelerators <linux-accelerators@lists.ozlabs.org> wrote:
>
>> Define the device isolation strategy by the device driver. if the
>> AER error frequency exceeds the value of setting for a certain
>> period of time, The device will not be available in user space. The VF
>> device use the PF device isolation strategy. All the hardware errors
>> are processed by PF driver.
>>
>> Signed-off-by: Kai Ye <yekai13@huawei.com>
>
> I'll try and avoid duplicating Greg's feedback but might well overlap a bit!
>
>> ---
>>  drivers/crypto/hisilicon/qm.c | 157 +++++++++++++++++++++++++++++++---
>>  include/linux/hisi_acc_qm.h   |   9 ++
>>  2 files changed, 152 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/crypto/hisilicon/qm.c b/drivers/crypto/hisilicon/qm.c
>> index ad83c194d664..47c41fa52693 100644
>> --- a/drivers/crypto/hisilicon/qm.c
>> +++ b/drivers/crypto/hisilicon/qm.c
>> @@ -12,7 +12,6 @@
>>  #include <linux/pm_runtime.h>
>>  #include <linux/seq_file.h>
>>  #include <linux/slab.h>
>> -#include <linux/uacce.h>
>
> I assume you do this because you are now relying on hisi_acc_qm.h including
> uacce.h?   Generally it is better to include most headers that we use
> directly so it this still uses stuff from uacce.h then keep the include.

Yes
>
>>  #include <linux/uaccess.h>
>>  #include <uapi/misc/uacce/hisi_qm.h>
>>  #include <linux/hisi_acc_qm.h>
>> @@ -417,6 +416,16 @@ struct hisi_qm_resource {
>>  	struct list_head list;
>>  };
>>
>> +/**
>> + * struct qm_hw_err - structure of describes the device err
>> + * @list: hardware error list
>> + * @tick_stamp: timestamp when the error occurred
>
> tick?   Perhaps just call it timestamp if that is what it is...

call 'timestamp' instead of 'tick_stamp'
>
>
>> + */
>> +struct qm_hw_err {
>> +	struct list_head list;
>> +	unsigned long long tick_stamp;
>> +};
>> +
>
>>
>> +/**
>> + * qm_hw_err_isolate() - Try to isolate the uacce device with its VFs
>> + * @qm: The qm which we want to configure.
>> + *
>> + * according to user's configuration of isolation strategy. Warning: this
>
> Rewrite to make it full sentence.
>
>> + * API should be called while there is no user on the device, or the users
>> + * on this device are suspended by slot resetting preparation of PCI AER.
>> + */
>> +static int qm_hw_err_isolate(struct hisi_qm *qm)
>> +{
>> +	struct qm_hw_err *err, *tmp, *hw_err;
>> +	struct qm_err_isolate *isolate;
>> +	u32 count = 0;
>> +
>> +	isolate = &qm->isolate_data;
>> +
>> +#define SECONDS_PER_HOUR	3600
>> +
>> +	/* All the hw errs are processed by PF driver */
>> +	if (qm->uacce->is_vf || atomic_read(&isolate->is_isolate) ||
>> +		!isolate->hw_err_isolate_hz)
>> +		return 0;
>> +
>> +	hw_err = kzalloc(sizeof(*hw_err), GFP_ATOMIC);
>> +	if (!hw_err)
>> +		return -ENOMEM;
> blank line here to separate error handling from next bit of code.

Yes
>
>> +	hw_err->tick_stamp = jiffies;
>> +	list_for_each_entry_safe(err, tmp, &qm->uacce_hw_errs, list) {
>
> These are ordered (I think). Could take advantage of that by
> maintaining count of elements in parallel to the list then walking
> list in right direction + stop when you reach last one to need
> deleting.
>

thanks, The current list + jiffies solution seems more simple.
>
>> +		if ((hw_err->tick_stamp - err->tick_stamp) / HZ >
>> +		    SECONDS_PER_HOUR) {
>> +			list_del(&err->list);
>> +			kfree(err);
>> +		} else {
>> +			count++;
>> +		}
>> +	}
>> +	list_add(&hw_err->list, &qm->uacce_hw_errs);
>> +
>> +	if (count >= isolate->hw_err_isolate_hz)
>> +		atomic_set(&isolate->is_isolate, 1);
>> +
>> +	return 0;
>> +}
>> +
>
> ...
>
>> +static int hisi_qm_isolate_strategy_write(struct uacce_device *uacce,
>> +						const char *buf)
>> +{
>> +	struct hisi_qm *qm = uacce->priv;
>> +	unsigned long val = 0;
>> +
>> +#define MAX_ISOLATE_STRATEGY	65535
>> +
>> +	if (atomic_read(&qm->uacce_ref))
>> +		return -EBUSY;
>> +
>> +	/* must be set by PF */
>> +	if (atomic_read(&qm->isolate_data.is_isolate) || uacce->is_vf)
>
> Why is the file visible on the vf?  Hide it or don't register it for vfs.
Because VF devices can be registered with UACCE. So this file node can 
be visited on the vf. We're not sure if someone else's device is the 
same as qm. So i configure it this way by driver. the 'isolate_strategy' 
must be set by pf.

>
>> +		return -EINVAL;
>> +
>> +	if (kstrtoul(buf, 0, &val) < 0)
>> +		return -EINVAL;
>> +
>> +	if (val > MAX_ISOLATE_STRATEGY)
>> +		return -EINVAL;
>> +
>> +	qm->isolate_data.hw_err_isolate_hz = val;
>> +	dev_info(&qm->pdev->dev,
>> +		"the value of isolate_strategy is set to %lu.\n", val);
>
> This is just noise in the log.  If someone wants to check they should read
> the sysfs file back and it reflect the new state.

Yes delete it.
>
>> +
>> +	return 0;
>> +}
>> +
>
> ...
>
>>  static int qm_alloc_uacce(struct hisi_qm *qm)
>>  {
>>  	struct pci_dev *pdev = qm->pdev;
>> @@ -3433,6 +3554,7 @@ static int qm_alloc_uacce(struct hisi_qm *qm)
>>  	};
>>  	int ret;
>>
>> +	INIT_LIST_HEAD(&qm->uacce_hw_errs);
>>  	ret = strscpy(interface.name, dev_driver_string(&pdev->dev),
>>  		      sizeof(interface.name));
>>  	if (ret < 0)
>> @@ -3446,8 +3568,7 @@ static int qm_alloc_uacce(struct hisi_qm *qm)
>>  		qm->use_sva = true;
>>  	} else {
>>  		/* only consider sva case */
>> -		uacce_remove(uacce);
>> -		qm->uacce = NULL;
>> +		qm_remove_uacce(qm);
>>  		return -EINVAL;
>>  	}
>>
>> @@ -5109,6 +5230,12 @@ static int qm_controller_reset_prepare(struct hisi_qm *qm)
>>  		return ret;
>>  	}
>>
>> +	if (qm->use_sva) {
>> +		ret = qm_hw_err_isolate(qm);
>> +		if (ret)
>> +			pci_err(pdev, "failed to isolate hw err!\n");
>> +	}
>> +
>>  	ret = qm_wait_vf_prepare_finish(qm);
>>  	if (ret)
>>  		pci_err(pdev, "failed to stop by vfs in soft reset!\n");
>> @@ -5436,19 +5563,24 @@ static int qm_controller_reset(struct hisi_qm *qm)
>>  	ret = qm_soft_reset(qm);
>>  	if (ret) {
>>  		pci_err(pdev, "Controller reset failed (%d)\n", ret);
>
> This is printed below as well - probably best to drop this one and then you
> can remove the brackets as well.
>
>> -		qm_reset_bit_clear(qm);
>> -		return ret;
>> +		goto err_reset;
>>  	}
>>
>>  	ret = qm_controller_reset_done(qm);
>> -	if (ret) {
>> -		qm_reset_bit_clear(qm);
>> -		return ret;
>> -	}
>> +	if (ret)
>> +		goto err_reset;
>>
>>  	pci_info(pdev, "Controller reset complete\n");
>> -
>
> Avoid noise via white space changes like this.  The white space was
> good and generally don't change white space in a patch doing anything else.
>
>>  	return 0;
>> +
>> +err_reset:
>> +	pci_err(pdev, "Controller reset failed (%d)\n", ret);
>> +	qm_reset_bit_clear(qm);
>> +
>> +	/* if resetting fails, isolate the device */
>> +	if (qm->use_sva && !qm->uacce->is_vf)
>> +		atomic_set(&qm->isolate_data.is_isolate, 1);
>> +	return ret;
>>  }
>>
>>  /**
>> @@ -6246,10 +6378,7 @@ int hisi_qm_init(struct hisi_qm *qm)
>>  err_free_qm_memory:
>>  	hisi_qm_memory_uninit(qm);
>>  err_alloc_uacce:
>> -	if (qm->use_sva) {
>> -		uacce_remove(qm->uacce);
>> -		qm->uacce = NULL;
>> -	}
>> +	qm_remove_uacce(qm);
>>  err_irq_register:
>>  	qm_irq_unregister(qm);
>>  err_pci_init:
>
> .
>

thanks
Kai

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/3] crypto: hisilicon/qm - defining the device isolation strategy
  2022-06-16  1:33     ` yekai(A)
@ 2022-06-16 13:45       ` Jonathan Cameron
  2022-06-17  2:07         ` yekai(A)
  0 siblings, 1 reply; 25+ messages in thread
From: Jonathan Cameron @ 2022-06-16 13:45 UTC (permalink / raw)
  To: yekai(A)
  Cc: Kai Ye via Linux-accelerators, gregkh, herbert, linuxarm,
	linux-kernel, wangzhou1, linux-crypto, zhangfei.gao

...

> >  
> >> +	hw_err->tick_stamp = jiffies;
> >> +	list_for_each_entry_safe(err, tmp, &qm->uacce_hw_errs, list) {  
> >
> > These are ordered (I think). Could take advantage of that by
> > maintaining count of elements in parallel to the list then walking
> > list in right direction + stop when you reach last one to need
> > deleting.
> >  
> 
> thanks, The current list + jiffies solution seems more simple.

If list always remains relatively short then that's probably fine.

> >  
> >> +		if ((hw_err->tick_stamp - err->tick_stamp) / HZ >
> >> +		    SECONDS_PER_HOUR) {
> >> +			list_del(&err->list);
> >> +			kfree(err);
> >> +		} else {
> >> +			count++;
> >> +		}
> >> +	}
> >> +	list_add(&hw_err->list, &qm->uacce_hw_errs);
> >> +
> >> +	if (count >= isolate->hw_err_isolate_hz)
> >> +		atomic_set(&isolate->is_isolate, 1);
> >> +
> >> +	return 0;
> >> +}
> >> +  
> >
> > ...
> >  
> >> +static int hisi_qm_isolate_strategy_write(struct uacce_device *uacce,
> >> +						const char *buf)
> >> +{
> >> +	struct hisi_qm *qm = uacce->priv;
> >> +	unsigned long val = 0;
> >> +
> >> +#define MAX_ISOLATE_STRATEGY	65535
> >> +
> >> +	if (atomic_read(&qm->uacce_ref))
> >> +		return -EBUSY;
> >> +
> >> +	/* must be set by PF */
> >> +	if (atomic_read(&qm->isolate_data.is_isolate) || uacce->is_vf)  
> >
> > Why is the file visible on the vf?  Hide it or don't register it for vfs.  
> Because VF devices can be registered with UACCE. So this file node can 
> be visited on the vf. We're not sure if someone else's device is the 
> same as qm. So i configure it this way by driver. the 'isolate_strategy' 
> must be set by pf.
> 

If possible have the uacce registration from the driver provide information
on whether this applies to the VF.  Much better to have no file presented
by the VF than one that always returns an error code.

Jonathan


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/3] crypto: hisilicon/qm - defining the device isolation strategy
  2022-06-16 13:45       ` Jonathan Cameron
@ 2022-06-17  2:07         ` yekai(A)
  0 siblings, 0 replies; 25+ messages in thread
From: yekai(A) @ 2022-06-17  2:07 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Kai Ye via Linux-accelerators, gregkh, herbert, linuxarm,
	linux-kernel, wangzhou1, linux-crypto, zhangfei.gao



On 2022/6/16 21:45, Jonathan Cameron wrote:
> ...
>
>>>
>>>> +	hw_err->tick_stamp = jiffies;
>>>> +	list_for_each_entry_safe(err, tmp, &qm->uacce_hw_errs, list) {
>>>
>>> These are ordered (I think). Could take advantage of that by
>>> maintaining count of elements in parallel to the list then walking
>>> list in right direction + stop when you reach last one to need
>>> deleting.
>>>
>>
>> thanks, The current list + jiffies solution seems more simple.
>
> If list always remains relatively short then that's probably fine.
>
>>>
>>>> +		if ((hw_err->tick_stamp - err->tick_stamp) / HZ >
>>>> +		    SECONDS_PER_HOUR) {
>>>> +			list_del(&err->list);
>>>> +			kfree(err);
>>>> +		} else {
>>>> +			count++;
>>>> +		}
>>>> +	}
>>>> +	list_add(&hw_err->list, &qm->uacce_hw_errs);
>>>> +
>>>> +	if (count >= isolate->hw_err_isolate_hz)
>>>> +		atomic_set(&isolate->is_isolate, 1);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>
>>> ...
>>>
>>>> +static int hisi_qm_isolate_strategy_write(struct uacce_device *uacce,
>>>> +						const char *buf)
>>>> +{
>>>> +	struct hisi_qm *qm = uacce->priv;
>>>> +	unsigned long val = 0;
>>>> +
>>>> +#define MAX_ISOLATE_STRATEGY	65535
>>>> +
>>>> +	if (atomic_read(&qm->uacce_ref))
>>>> +		return -EBUSY;
>>>> +
>>>> +	/* must be set by PF */
>>>> +	if (atomic_read(&qm->isolate_data.is_isolate) || uacce->is_vf)
>>>
>>> Why is the file visible on the vf?  Hide it or don't register it for vfs.
>> Because VF devices can be registered with UACCE. So this file node can
>> be visited on the vf. We're not sure if someone else's device is the
>> same as qm. So i configure it this way by driver. the 'isolate_strategy'
>> must be set by pf.
>>
>
> If possible have the uacce registration from the driver provide information
> on whether this applies to the VF.  Much better to have no file presented
> by the VF than one that always returns an error code.
>
> Jonathan
>
> .
>

Yes, I will provide some information here for VF.

thanks

Kai

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 2/3] Documentation: add a isolation strategy vfs node for uacce
  2022-06-14 12:23 [PATCH v2 0/3] crypto: hisilicon - supports device isolation feature Kai Ye
@ 2022-06-14 12:23 ` Kai Ye
  0 siblings, 0 replies; 25+ messages in thread
From: Kai Ye @ 2022-06-14 12:23 UTC (permalink / raw)
  To: gregkh, herbert
  Cc: linux-crypto, linux-accelerators, linux-kernel, linuxarm,
	zhangfei.gao, wangzhou1, yekai13

Update documentation describing DebugFS that could help to
configure hard error frequency for users in th user space.

Signed-off-by: Kai Ye <yekai13@huawei.com>
---
 Documentation/ABI/testing/sysfs-driver-uacce | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce
index 08f2591138af..0c4226364182 100644
--- a/Documentation/ABI/testing/sysfs-driver-uacce
+++ b/Documentation/ABI/testing/sysfs-driver-uacce
@@ -19,6 +19,23 @@ Contact:        linux-accelerators@lists.ozlabs.org
 Description:    Available instances left of the device
                 Return -ENODEV if uacce_ops get_available_instances is not provided
 
+What:           /sys/class/uacce/<dev_name>/isolate_strategy
+Date:           Jun 2022
+KernelVersion:  5.19
+Contact:        linux-accelerators@lists.ozlabs.org
+Description:    A vfs node that used to configures the hardware
+                error frequency. This frequency is abstract. Like once an hour
+                or once a day. The specific isolation strategy can be defined in
+                each driver module.
+
+What:           /sys/class/uacce/<dev_name>/isolate
+Date:           Jun 2022
+KernelVersion:  5.19
+Contact:        linux-accelerators@lists.ozlabs.org
+Description:    A vfs node that show the device isolated state. The value 0
+                means that the device is working. The value 1 means that the
+                device has been isolated.
+
 What:           /sys/class/uacce/<dev_name>/algorithms
 Date:           Feb 2020
 KernelVersion:  5.7
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-06-17  2:08 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-14 12:29 [PATCH v2 0/3] crypto: hisilicon - supports device isolation feature Kai Ye
2022-06-14 12:29 ` [PATCH v2 1/3] uacce: " Kai Ye
2022-06-14 12:42   ` Greg KH
2022-06-15  8:52   ` Jonathan Cameron
2022-06-15  9:06     ` yekai(A)
2022-06-14 12:29 ` [PATCH v2 2/3] Documentation: add a isolation strategy vfs node for uacce Kai Ye
2022-06-14 12:41   ` Greg KH
2022-06-15  8:48     ` Jonathan Cameron
2022-06-15  9:18       ` yekai(A)
2022-06-14 12:29 ` [PATCH v2 3/3] crypto: hisilicon/qm - defining the device isolation strategy Kai Ye
2022-06-14 12:43   ` Greg KH
2022-06-14 13:24     ` yekai(A)
2022-06-14 13:29       ` Greg KH
2022-06-15  9:10         ` yekai(A)
2022-06-14 14:12   ` Zhangfei Gao
2022-06-15 13:02   ` Jonathan Cameron
2022-06-16  1:33     ` yekai(A)
2022-06-16 13:45       ` Jonathan Cameron
2022-06-17  2:07         ` yekai(A)
2022-06-14 12:29 ` [PATCH 1/3] uacce: supports device isolation feature Kai Ye
2022-06-14 14:14   ` Zhangfei Gao
2022-06-15  1:07     ` yekai(A)
2022-06-14 12:29 ` [PATCH 2/3] Documentation: add a isolation strategy vfs node for uacce Kai Ye
2022-06-14 12:29 ` [PATCH 3/3] crypto: hisilicon/qm - defining the device isolation strategy Kai Ye
  -- strict thread matches above, loose matches on Subject: below --
2022-06-14 12:23 [PATCH v2 0/3] crypto: hisilicon - supports device isolation feature Kai Ye
2022-06-14 12:23 ` [PATCH v2 2/3] Documentation: add a isolation strategy vfs node for uacce Kai Ye

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).