All of lore.kernel.org
 help / color / mirror / Atom feed
* [lustre-devel] [PATCH 00/12] lustre: new patches to address previous reviews
@ 2018-11-26  2:48 James Simmons
  2018-11-26  2:48 ` [lustre-devel] [PATCH 01/12] lustre: llite: move CONFIG_SECURITY handling to llite_internal.h James Simmons
                   ` (11 more replies)
  0 siblings, 12 replies; 23+ messages in thread
From: James Simmons @ 2018-11-26  2:48 UTC (permalink / raw)
  To: lustre-devel

New patches from myself to address issues with the last batch I sent.
Found a bug in the llite debugfs stats registeration code while
testing. Updated the TODO since many of the task have been done.
The patch for LU-5461 has been ported which makes sanity test 244 to
pass. The patch ported from LU-10218 now makes it possible to run
the sanity 160 changelog test without locking up the client node.
Several patches independent of PFL have been ported from lustre
2.10 LTS release.

Alexander Boyko (1):
  lustre: obdclass: obd_device improvement

Dmitry Eremin (1):
  lustre: clio: Introduce parallel tasks framework

James Simmons (6):
  lustre: llite: move CONFIG_SECURITY handling to llite_internal.h
  lustre: lnd: create enum kib_dev_caps
  lustre: lnd: test fpo_fmr_pool pointer instead of special bool
  lustre: llite: remove llite_loop left overs
  lustre: llite: avoid duplicate stats debugfs registration
  lustre: update TODO lustre list

John L. Hammond (2):
  lustre: mdc: propagate changelog errors to readers
  lustre: mdc: use large xattr buffers for old servers

Lai Siyao (1):
  lustre: mdc: don't add to page cache upon failure

Patrick Farrell (1):
  lustre: ldlm: No -EINVAL for canceled != unused

 drivers/staging/lustre/TODO                        |  85 ----
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    |  17 +-
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |  13 +-
 drivers/staging/lustre/lustre/include/cl_ptask.h   | 145 ++++++
 drivers/staging/lustre/lustre/include/obd_class.h  |   9 +-
 drivers/staging/lustre/lustre/ldlm/ldlm_resource.c |  16 +-
 drivers/staging/lustre/lustre/llite/file.c         | 107 +----
 .../staging/lustre/lustre/llite/llite_internal.h   |  81 +---
 drivers/staging/lustre/lustre/llite/llite_lib.c    |  11 +-
 drivers/staging/lustre/lustre/llite/lproc_llite.c  |   2 +-
 drivers/staging/lustre/lustre/llite/rw26.c         |  51 +--
 drivers/staging/lustre/lustre/mdc/mdc_changelog.c  |  45 +-
 drivers/staging/lustre/lustre/mdc/mdc_locks.c      |  14 +
 drivers/staging/lustre/lustre/mdc/mdc_request.c    |   5 +-
 drivers/staging/lustre/lustre/obdclass/Makefile    |   3 +-
 drivers/staging/lustre/lustre/obdclass/cl_ptask.c  | 501 +++++++++++++++++++++
 drivers/staging/lustre/lustre/obdclass/genops.c    | 284 ++++++++----
 .../staging/lustre/lustre/obdclass/obd_config.c    | 143 ++----
 drivers/staging/lustre/lustre/obdclass/obd_mount.c |   6 +-
 19 files changed, 1007 insertions(+), 531 deletions(-)
 create mode 100644 drivers/staging/lustre/lustre/include/cl_ptask.h
 create mode 100644 drivers/staging/lustre/lustre/obdclass/cl_ptask.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 01/12] lustre: llite: move CONFIG_SECURITY handling to llite_internal.h
  2018-11-26  2:48 [lustre-devel] [PATCH 00/12] lustre: new patches to address previous reviews James Simmons
@ 2018-11-26  2:48 ` James Simmons
  2018-11-26  2:48 ` [lustre-devel] [PATCH 02/12] lustre: lnd: create enum kib_dev_caps James Simmons
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: James Simmons @ 2018-11-26  2:48 UTC (permalink / raw)
  To: lustre-devel

For the linux kernel its recommended to keep CONFIG_* wrapped code
in a header file instead of the source files to avoid making the
code more difficulty to read. Move CONFIG_SECURITY wrapped code
to llite_internal.h in this case.

Signed-off-by: James Simmons <uja.ornl@yahoo.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Reviewed-on: https://review.whamcloud.com/33410
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/llite/llite_internal.h | 17 +++++++++++++++++
 drivers/staging/lustre/lustre/llite/llite_lib.c      | 11 ++---------
 2 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
index a55b568..8c703e6 100644
--- a/drivers/staging/lustre/lustre/llite/llite_internal.h
+++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
@@ -269,6 +269,23 @@ static inline void ll_layout_version_set(struct ll_inode_info *lli, __u32 gen)
 int ll_xattr_cache_get(struct inode *inode, const char *name,
 		       char *buffer, size_t size, __u64 valid);
 
+static inline bool obd_connect_has_secctx(struct obd_connect_data *data)
+{
+#ifdef CONFIG_SECURITY
+	return data->ocd_connect_flags & OBD_CONNECT_FLAGS2 &&
+	       data->ocd_connect_flags2 & OBD_CONNECT2_FILE_SECCTX;
+#else
+       return false;
+#endif
+}
+
+static inline void obd_connect_set_secctx(struct obd_connect_data *data)
+{
+#ifdef CONFIG_SECURITY
+	data->ocd_connect_flags2 |= OBD_CONNECT2_FILE_SECCTX;
+#endif
+}
+
 int ll_dentry_init_security(struct dentry *dentry, int mode, struct qstr *name,
 			    const char **secctx_name, void **secctx,
 			    u32 *secctx_size);
diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
index fac6584..2dfeab4 100644
--- a/drivers/staging/lustre/lustre/llite/llite_lib.c
+++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
@@ -148,12 +148,6 @@ static void ll_free_sbi(struct super_block *sb)
 	kfree(sbi);
 }
 
-static inline int obd_connect_has_secctx(struct obd_connect_data *data)
-{
-	return data->ocd_connect_flags & OBD_CONNECT_FLAGS2 &&
-	       data->ocd_connect_flags2 & OBD_CONNECT2_FILE_SECCTX;
-}
-
 static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 {
 	struct inode *root = NULL;
@@ -244,9 +238,8 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 	if (sbi->ll_flags & LL_SBI_ALWAYS_PING)
 		data->ocd_connect_flags &= ~OBD_CONNECT_PINGLESS;
 
-#ifdef CONFIG_SECURITY
-	data->ocd_connect_flags2 |= OBD_CONNECT2_FILE_SECCTX;
-#endif
+	obd_connect_set_secctx(data);
+
 	data->ocd_brw_size = MD_MAX_BRW_SIZE;
 
 	err = obd_connect(NULL, &sbi->ll_md_exp, sbi->ll_md_obd,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 02/12] lustre: lnd: create enum kib_dev_caps
  2018-11-26  2:48 [lustre-devel] [PATCH 00/12] lustre: new patches to address previous reviews James Simmons
  2018-11-26  2:48 ` [lustre-devel] [PATCH 01/12] lustre: llite: move CONFIG_SECURITY handling to llite_internal.h James Simmons
@ 2018-11-26  2:48 ` James Simmons
  2018-11-26  2:48 ` [lustre-devel] [PATCH 03/12] lustre: lnd: test fpo_fmr_pool pointer instead of special bool James Simmons
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: James Simmons @ 2018-11-26  2:48 UTC (permalink / raw)
  To: lustre-devel

Cleanup IBLND_DEV_CAPS_* by creating enum kib_dev_caps and using
the BIT() macros.

Signed-off-by: James Simmons <uja.ornl@yahoo.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-6142
Reviewed-on: https://review.whamcloud.com/33409
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Doug Oucharek <dougso@me.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c |  5 +++--
 drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h | 12 +++++++-----
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index ecdf4de..281004a 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -656,7 +656,7 @@ static unsigned int kiblnd_send_wrs(struct kib_conn *conn)
 	 * One WR for the LNet message
 	 * And ibc_max_frags for the transfer WRs
 	 */
-	u32 dev_caps = conn->ibc_hdev->ibh_dev->ibd_dev_caps;
+	enum kib_dev_caps dev_caps = conn->ibc_hdev->ibh_dev->ibd_dev_caps;
 	unsigned int ret = 1 + conn->ibc_max_frags;
 
 	/* FastReg needs two extra WRs for map and invalidate */
@@ -1441,7 +1441,8 @@ static int kiblnd_alloc_fmr_pool(struct kib_fmr_poolset *fps, struct kib_fmr_poo
 }
 
 static int kiblnd_alloc_freg_pool(struct kib_fmr_poolset *fps,
-				  struct kib_fmr_pool *fpo, u32 dev_caps)
+				  struct kib_fmr_pool *fpo,
+				  enum kib_dev_caps dev_caps)
 {
 	struct kib_fast_reg_descriptor *frd;
 	int i, rc;
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index 7f81fe2..0994fae 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -73,10 +73,6 @@
 #define IBLND_N_SCHED			2
 #define IBLND_N_SCHED_HIGH		4
 
-#define IBLND_DEV_CAPS_FASTREG_ENABLED		0x1
-#define IBLND_DEV_CAPS_FASTREG_GAPS_SUPPORT	0x2
-#define IBLND_DEV_CAPS_FMR_ENABLED		0x4
-
 struct kib_tunables {
 	int *kib_dev_failover;           /* HCA failover */
 	unsigned int *kib_service;       /* IB service number */
@@ -152,6 +148,12 @@ struct kib_tunables {
 #define KIB_IFNAME_SIZE	      256
 #endif
 
+enum kib_dev_caps {
+	IBLND_DEV_CAPS_FASTREG_ENABLED		= BIT(0),
+	IBLND_DEV_CAPS_FASTREG_GAPS_SUPPORT	= BIT(1),
+	IBLND_DEV_CAPS_FMR_ENABLED		= BIT(2),
+};
+
 struct kib_dev {
 	struct list_head   ibd_list;            /* chain on kib_devs */
 	struct list_head   ibd_fail_list;       /* chain on kib_failed_devs */
@@ -167,7 +169,7 @@ struct kib_dev {
 	unsigned int ibd_can_failover; /* IPoIB interface is a bonding master */
 	struct list_head   ibd_nets;
 	struct kib_hca_dev *ibd_hdev;
-	u32			ibd_dev_caps;
+	enum kib_dev_caps	ibd_dev_caps;
 };
 
 struct kib_hca_dev {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 03/12] lustre: lnd: test fpo_fmr_pool pointer instead of special bool
  2018-11-26  2:48 [lustre-devel] [PATCH 00/12] lustre: new patches to address previous reviews James Simmons
  2018-11-26  2:48 ` [lustre-devel] [PATCH 01/12] lustre: llite: move CONFIG_SECURITY handling to llite_internal.h James Simmons
  2018-11-26  2:48 ` [lustre-devel] [PATCH 02/12] lustre: lnd: create enum kib_dev_caps James Simmons
@ 2018-11-26  2:48 ` James Simmons
  2018-11-26  2:48 ` [lustre-devel] [PATCH 04/12] lustre: llite: remove llite_loop left overs James Simmons
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: James Simmons @ 2018-11-26  2:48 UTC (permalink / raw)
  To: lustre-devel

For the ko2iblnd driver it sets a fpo_is_fmr bool to tell use if
a pool was allocated. The name fpo_is_fmr is very misleading to
its function and its a weak test to tell us if a pool was allocated
in the FMR case. It is much easier to test the actually FMR pool
pointer then manually setting a bool flag to tell us if the FMR
pool is valide.

Signed-off-by: James Simmons <uja.ornl@yahoo.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-11152
Reviewed-on: https://review.whamcloud.com/33408
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c | 12 ++++--------
 drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h |  1 -
 2 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index 281004a..5394c1a 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -1356,9 +1356,8 @@ static void kiblnd_destroy_fmr_pool(struct kib_fmr_pool *fpo)
 {
 	LASSERT(!fpo->fpo_map_count);
 
-	if (fpo->fpo_is_fmr) {
-		if (fpo->fmr.fpo_fmr_pool)
-			ib_destroy_fmr_pool(fpo->fmr.fpo_fmr_pool);
+	if (!IS_ERR_OR_NULL(fpo->fmr.fpo_fmr_pool)) {
+		ib_destroy_fmr_pool(fpo->fmr.fpo_fmr_pool);
 	} else {
 		struct kib_fast_reg_descriptor *frd;
 		int i = 0;
@@ -1435,7 +1434,6 @@ static int kiblnd_alloc_fmr_pool(struct kib_fmr_poolset *fps, struct kib_fmr_poo
 		else
 			CERROR("FMRs are not supported\n");
 	}
-	fpo->fpo_is_fmr = true;
 
 	return rc;
 }
@@ -1447,8 +1445,6 @@ static int kiblnd_alloc_freg_pool(struct kib_fmr_poolset *fps,
 	struct kib_fast_reg_descriptor *frd;
 	int i, rc;
 
-	fpo->fpo_is_fmr = false;
-
 	INIT_LIST_HEAD(&fpo->fast_reg.fpo_pool_list);
 	fpo->fast_reg.fpo_pool_size = 0;
 	for (i = 0; i < fps->fps_pool_size; i++) {
@@ -1646,7 +1642,7 @@ void kiblnd_fmr_pool_unmap(struct kib_fmr *fmr, int status)
 		return;
 
 	fps = fpo->fpo_owner;
-	if (fpo->fpo_is_fmr) {
+	if (!IS_ERR_OR_NULL(fpo->fmr.fpo_fmr_pool)) {
 		if (fmr->fmr_pfmr) {
 			rc = ib_fmr_pool_unmap(fmr->fmr_pfmr);
 			LASSERT(!rc);
@@ -1708,7 +1704,7 @@ int kiblnd_fmr_pool_map(struct kib_fmr_poolset *fps, struct kib_tx *tx,
 		fpo->fpo_deadline = ktime_get_seconds() + IBLND_POOL_DEADLINE;
 		fpo->fpo_map_count++;
 
-		if (fpo->fpo_is_fmr) {
+		if (!IS_ERR_OR_NULL(fpo->fmr.fpo_fmr_pool)) {
 			struct ib_pool_fmr *pfmr;
 
 			spin_unlock(&fps->fps_lock);
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index 0994fae..2ddd83b 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -288,7 +288,6 @@ struct kib_fmr_pool {
 	time64_t		fpo_deadline;	/* deadline of this pool */
 	int                   fpo_failed;          /* fmr pool is failed */
 	int                   fpo_map_count;       /* # of mapped FMR */
-	bool			fpo_is_fmr;	/* True if FMR pools allocated */
 };
 
 struct kib_fmr {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 04/12] lustre: llite: remove llite_loop left overs
  2018-11-26  2:48 [lustre-devel] [PATCH 00/12] lustre: new patches to address previous reviews James Simmons
                   ` (2 preceding siblings ...)
  2018-11-26  2:48 ` [lustre-devel] [PATCH 03/12] lustre: lnd: test fpo_fmr_pool pointer instead of special bool James Simmons
@ 2018-11-26  2:48 ` James Simmons
  2018-11-26  2:48 ` [lustre-devel] [PATCH 05/12] lustre: llite: avoid duplicate stats debugfs registration James Simmons
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: James Simmons @ 2018-11-26  2:48 UTC (permalink / raw)
  To: lustre-devel

With the removal of llite_loop several pieces of code are still
present in the llite layer that were only used by the lloop device.
We can remove these no longer used pieces.

Signed-off-by: James Simmons <uja.ornl@yahoo.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8958
Reviewed-on: https://review.whamcloud.com/26795
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/llite/file.c         | 107 +--------------------
 .../staging/lustre/lustre/llite/llite_internal.h   |  64 ------------
 drivers/staging/lustre/lustre/llite/rw26.c         |  51 +++-------
 3 files changed, 15 insertions(+), 207 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/file.c b/drivers/staging/lustre/lustre/llite/file.c
index 68f88cf..15910ff 100644
--- a/drivers/staging/lustre/lustre/llite/file.c
+++ b/drivers/staging/lustre/lustre/llite/file.c
@@ -56,10 +56,6 @@
 static int ll_lease_close(struct obd_client_handle *och, struct inode *inode,
 			  bool *lease_broken);
 
-static enum llioc_iter
-ll_iocontrol_call(struct inode *inode, struct file *file,
-		  unsigned int cmd, unsigned long arg, int *rcp);
-
 static struct ll_file_data *ll_file_data_get(void)
 {
 	struct ll_file_data *fd;
@@ -2620,17 +2616,10 @@ int ll_ioctl_fssetxattr(struct inode *inode, unsigned int cmd,
 		return ll_ioctl_fssetxattr(inode, cmd, arg);
 	case BLKSSZGET:
 		return put_user(PAGE_SIZE, (int __user *)arg);
-	default: {
-		int err;
-
-		if (ll_iocontrol_call(inode, file, cmd, arg, &err) ==
-		     LLIOC_STOP)
-			return err;
-
+	default:
 		return obd_iocontrol(cmd, ll_i2dtexp(inode), 0, NULL,
 				     (void __user *)arg);
 	}
-	}
 }
 
 static loff_t ll_file_seek(struct file *file, loff_t offset, int origin)
@@ -3543,100 +3532,6 @@ int ll_inode_permission(struct inode *inode, int mask)
 	.get_acl	= ll_get_acl,
 };
 
-/* dynamic ioctl number support routines */
-static struct llioc_ctl_data {
-	struct rw_semaphore	ioc_sem;
-	struct list_head	      ioc_head;
-} llioc = {
-	__RWSEM_INITIALIZER(llioc.ioc_sem),
-	LIST_HEAD_INIT(llioc.ioc_head)
-};
-
-struct llioc_data {
-	struct list_head	      iocd_list;
-	unsigned int	    iocd_size;
-	llioc_callback_t	iocd_cb;
-	unsigned int	    iocd_count;
-	unsigned int	    iocd_cmd[0];
-};
-
-void *ll_iocontrol_register(llioc_callback_t cb, int count, unsigned int *cmd)
-{
-	unsigned int size;
-	struct llioc_data *in_data = NULL;
-
-	if (!cb || !cmd || count > LLIOC_MAX_CMD || count < 0)
-		return NULL;
-
-	size = sizeof(*in_data) + count * sizeof(unsigned int);
-	in_data = kzalloc(size, GFP_KERNEL);
-	if (!in_data)
-		return NULL;
-
-	in_data->iocd_size = size;
-	in_data->iocd_cb = cb;
-	in_data->iocd_count = count;
-	memcpy(in_data->iocd_cmd, cmd, sizeof(unsigned int) * count);
-
-	down_write(&llioc.ioc_sem);
-	list_add_tail(&in_data->iocd_list, &llioc.ioc_head);
-	up_write(&llioc.ioc_sem);
-
-	return in_data;
-}
-EXPORT_SYMBOL(ll_iocontrol_register);
-
-void ll_iocontrol_unregister(void *magic)
-{
-	struct llioc_data *tmp;
-
-	if (!magic)
-		return;
-
-	down_write(&llioc.ioc_sem);
-	list_for_each_entry(tmp, &llioc.ioc_head, iocd_list) {
-		if (tmp == magic) {
-			list_del(&tmp->iocd_list);
-			up_write(&llioc.ioc_sem);
-
-			kfree(tmp);
-			return;
-		}
-	}
-	up_write(&llioc.ioc_sem);
-
-	CWARN("didn't find iocontrol register block with magic: %p\n", magic);
-}
-EXPORT_SYMBOL(ll_iocontrol_unregister);
-
-static enum llioc_iter
-ll_iocontrol_call(struct inode *inode, struct file *file,
-		  unsigned int cmd, unsigned long arg, int *rcp)
-{
-	enum llioc_iter ret = LLIOC_CONT;
-	struct llioc_data *data;
-	int rc = -EINVAL, i;
-
-	down_read(&llioc.ioc_sem);
-	list_for_each_entry(data, &llioc.ioc_head, iocd_list) {
-		for (i = 0; i < data->iocd_count; i++) {
-			if (cmd != data->iocd_cmd[i])
-				continue;
-
-			ret = data->iocd_cb(inode, file, cmd, arg, data, &rc);
-			break;
-		}
-
-		if (ret == LLIOC_STOP)
-			break;
-	}
-	up_read(&llioc.ioc_sem);
-
-	if (rcp)
-		*rcp = rc;
-	return ret;
-}
-
 int ll_layout_conf(struct inode *inode, const struct cl_object_conf *conf)
 {
 	struct ll_inode_info *lli = ll_i2info(inode);
diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
index 8c703e6..48424a4 100644
--- a/drivers/staging/lustre/lustre/llite/llite_internal.h
+++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
@@ -1234,73 +1234,9 @@ static inline int ll_glimpse_size(struct inode *inode)
 	return true;
 }
 
-/* llite ioctl register support routine */
-enum llioc_iter {
-	LLIOC_CONT = 0,
-	LLIOC_STOP
-};
-
-#define LLIOC_MAX_CMD	   256
-
-/*
- * Rules to write a callback function:
- *
- * Parameters:
- *  @magic: Dynamic ioctl call routine will feed this value with the pointer
- *      returned to ll_iocontrol_register.  Callback functions should use this
- *      data to check the potential collasion of ioctl cmd. If collasion is
- *      found, callback function should return LLIOC_CONT.
- *  @rcp: The result of ioctl command.
- *
- *  Return values:
- *      If @magic matches the pointer returned by ll_iocontrol_data, the
- *      callback should return LLIOC_STOP; return LLIOC_STOP otherwise.
- */
-typedef enum llioc_iter (*llioc_callback_t)(struct inode *inode,
-		struct file *file, unsigned int cmd, unsigned long arg,
-		void *magic, int *rcp);
-
-/* export functions */
-/* Register ioctl block dynamatically for a regular file.
- *
- * @cmd: the array of ioctl command set
- * @count: number of commands in the @cmd
- * @cb: callback function, it will be called if an ioctl command is found to
- *      belong to the command list @cmd.
- *
- * Return value:
- *      A magic pointer will be returned if success;
- *      otherwise, NULL will be returned.
- */
-void *ll_iocontrol_register(llioc_callback_t cb, int count, unsigned int *cmd);
-void ll_iocontrol_unregister(void *magic);
-
 int cl_sync_file_range(struct inode *inode, loff_t start, loff_t end,
 		       enum cl_fsync_mode mode, int ignore_layout);
 
-/** direct write pages */
-struct ll_dio_pages {
-	/** page array to be written. we don't support
-	 * partial pages except the last one.
-	 */
-	struct page **ldp_pages;
-	/* offset of each page */
-	loff_t       *ldp_offsets;
-	/** if ldp_offsets is NULL, it means a sequential
-	 * pages to be written, then this is the file offset
-	 * of the first page.
-	 */
-	loff_t	ldp_start_offset;
-	/** how many bytes are to be written. */
-	size_t	ldp_size;
-	/** # of pages in the array. */
-	int	   ldp_nr;
-};
-
-ssize_t ll_direct_rw_pages(const struct lu_env *env, struct cl_io *io,
-			   int rw, struct inode *inode,
-			   struct ll_dio_pages *pv);
-
 static inline int ll_file_nolock(const struct file *file)
 {
 	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
diff --git a/drivers/staging/lustre/lustre/llite/rw26.c b/drivers/staging/lustre/lustre/llite/rw26.c
index 722e5ea..9843c9e 100644
--- a/drivers/staging/lustre/lustre/llite/rw26.c
+++ b/drivers/staging/lustre/lustre/llite/rw26.c
@@ -172,32 +172,27 @@ static void ll_free_user_pages(struct page **pages, int npages, int do_dirty)
 	kvfree(pages);
 }
 
-ssize_t ll_direct_rw_pages(const struct lu_env *env, struct cl_io *io,
-			   int rw, struct inode *inode,
-			   struct ll_dio_pages *pv)
+static ssize_t ll_direct_IO_seg(const struct lu_env *env, struct cl_io *io,
+				int rw, struct inode *inode, size_t size,
+				loff_t file_offset, struct page **pages,
+				int page_count)
 {
 	struct cl_page    *clp;
 	struct cl_2queue  *queue;
 	struct cl_object  *obj = io->ci_obj;
 	int i;
 	ssize_t rc = 0;
-	loff_t file_offset  = pv->ldp_start_offset;
-	size_t size = pv->ldp_size;
-	int page_count      = pv->ldp_nr;
-	struct page **pages = pv->ldp_pages;
 	size_t page_size = cl_page_size(obj);
+	size_t orig_size = size;
 	bool do_io;
-	int  io_pages       = 0;
+	int io_pages = 0;
 
 	queue = &io->ci_queue;
 	cl_2queue_init(queue);
 	for (i = 0; i < page_count; i++) {
-		if (pv->ldp_offsets)
-			file_offset = pv->ldp_offsets[i];
-
 		LASSERT(!(file_offset & (page_size - 1)));
 		clp = cl_page_find(env, obj, cl_index(obj, file_offset),
-				   pv->ldp_pages[i], CPT_TRANSIENT);
+				   pages[i], CPT_TRANSIENT);
 		if (IS_ERR(clp)) {
 			rc = PTR_ERR(clp);
 			break;
@@ -274,31 +269,13 @@ ssize_t ll_direct_rw_pages(const struct lu_env *env, struct cl_io *io,
 				       queue, 0);
 	}
 	if (rc == 0)
-		rc = pv->ldp_size;
+		rc = orig_size;
 
 	cl_2queue_discard(env, io, queue);
 	cl_2queue_disown(env, io, queue);
 	cl_2queue_fini(env, queue);
 	return rc;
 }
-EXPORT_SYMBOL(ll_direct_rw_pages);
-
-static ssize_t ll_direct_IO_26_seg(const struct lu_env *env, struct cl_io *io,
-				   int rw, struct inode *inode,
-				   struct address_space *mapping,
-				   size_t size, loff_t file_offset,
-				   struct page **pages, int page_count)
-{
-	struct ll_dio_pages pvec = {
-		.ldp_pages	= pages,
-		.ldp_nr		= page_count,
-		.ldp_size	= size,
-		.ldp_offsets	= NULL,
-		.ldp_start_offset = file_offset
-	};
-
-	return ll_direct_rw_pages(env, io, rw, inode, &pvec);
-}
 
 /* This is the maximum size of a single O_DIRECT request, based on the
  * kmalloc limit.  We need to fit all of the brw_page structs, each one
@@ -308,7 +285,8 @@ static ssize_t ll_direct_IO_26_seg(const struct lu_env *env, struct cl_io *io,
  */
 #define MAX_DIO_SIZE ((KMALLOC_MAX_SIZE / sizeof(struct brw_page) *	  \
 		       PAGE_SIZE) & ~(DT_MAX_BRW_SIZE - 1))
-static ssize_t ll_direct_IO_26(struct kiocb *iocb, struct iov_iter *iter)
+
+static ssize_t ll_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct ll_cl_context *lcc;
 	const struct lu_env *env;
@@ -362,10 +340,9 @@ static ssize_t ll_direct_IO_26(struct kiocb *iocb, struct iov_iter *iter)
 		if (likely(result > 0)) {
 			int n = DIV_ROUND_UP(result + offs, PAGE_SIZE);
 
-			result = ll_direct_IO_26_seg(env, io, iov_iter_rw(iter),
-						     inode, file->f_mapping,
-						     result, file_offset, pages,
-						     n);
+			result = ll_direct_IO_seg(env, io, iov_iter_rw(iter),
+						  inode, result, file_offset,
+						  pages, n);
 			ll_free_user_pages(pages, n, iov_iter_rw(iter) == READ);
 		}
 		if (unlikely(result <= 0)) {
@@ -627,7 +604,7 @@ static int ll_migratepage(struct address_space *mapping,
 
 const struct address_space_operations ll_aops = {
 	.readpage	= ll_readpage,
-	.direct_IO      = ll_direct_IO_26,
+	.direct_IO      = ll_direct_IO,
 	.writepage      = ll_writepage,
 	.writepages     = ll_writepages,
 	.set_page_dirty = __set_page_dirty_nobuffers,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 05/12] lustre: llite: avoid duplicate stats debugfs registration
  2018-11-26  2:48 [lustre-devel] [PATCH 00/12] lustre: new patches to address previous reviews James Simmons
                   ` (3 preceding siblings ...)
  2018-11-26  2:48 ` [lustre-devel] [PATCH 04/12] lustre: llite: remove llite_loop left overs James Simmons
@ 2018-11-26  2:48 ` James Simmons
  2018-11-26  2:48 ` [lustre-devel] [PATCH 06/12] lustre: mdc: don't add to page cache upon failure James Simmons
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: James Simmons @ 2018-11-26  2:48 UTC (permalink / raw)
  To: lustre-devel

The unwinding of debugfs handling in llite introduced a bug. Two
different debugfs files are currently being registered with the
same name "stats". Change the registration of the ll_ra_stats
debugfs file to its proper name "read_ahead_stats".

Fixes: cd514eac8029 ("staging: lustre: remove ldebugfs_register_stats() wrapper function")
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/llite/lproc_llite.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/lustre/llite/lproc_llite.c b/drivers/staging/lustre/lustre/llite/lproc_llite.c
index 10dc7a8..8139f84 100644
--- a/drivers/staging/lustre/lustre/llite/lproc_llite.c
+++ b/drivers/staging/lustre/lustre/llite/lproc_llite.c
@@ -1381,7 +1381,7 @@ int ll_debugfs_register_super(struct super_block *sb, const char *name)
 		lprocfs_counter_init(sbi->ll_ra_stats, id, 0,
 				     ra_stat_string[id], "pages");
 
-	debugfs_create_file("stats", 0644, sbi->ll_debugfs_entry,
+	debugfs_create_file("read_ahead_stats", 0644, sbi->ll_debugfs_entry,
 			    sbi->ll_ra_stats, &lprocfs_stats_seq_fops);
 out_ll_kset:
 	/* Yes we also register sysfs mount kset here as well */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 06/12] lustre: mdc: don't add to page cache upon failure
  2018-11-26  2:48 [lustre-devel] [PATCH 00/12] lustre: new patches to address previous reviews James Simmons
                   ` (4 preceding siblings ...)
  2018-11-26  2:48 ` [lustre-devel] [PATCH 05/12] lustre: llite: avoid duplicate stats debugfs registration James Simmons
@ 2018-11-26  2:48 ` James Simmons
  2018-11-27  3:01   ` NeilBrown
  2018-11-26  2:48 ` [lustre-devel] [PATCH 07/12] lustre: ldlm: No -EINVAL for canceled != unused James Simmons
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 23+ messages in thread
From: James Simmons @ 2018-11-26  2:48 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Reading directory pages may fail on MDS, in this case client should
not cache a non-up-to-date directory page, because it will cause
a later read on the same page fail.

Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-5461
Reviewed-on: http://review.whamcloud.com/11450
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/mdc/mdc_request.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c
index 1072b66..09b30ef 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
@@ -1212,7 +1212,10 @@ static int mdc_read_page_remote(void *data, struct page *page0)
 	}
 
 	rc = mdc_getpage(rp->rp_exp, fid, rp->rp_off, page_pool, npages, &req);
-	if (!rc) {
+	if (rc < 0) {
+		/* page0 is special, which was added into page cache early */
+		delete_from_page_cache(page0);
+	} else {
 		int lu_pgs = req->rq_bulk->bd_nob_transferred;
 
 		rd_pgs = (lu_pgs + PAGE_SIZE - 1) >> PAGE_SHIFT;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 07/12] lustre: ldlm: No -EINVAL for canceled != unused
  2018-11-26  2:48 [lustre-devel] [PATCH 00/12] lustre: new patches to address previous reviews James Simmons
                   ` (5 preceding siblings ...)
  2018-11-26  2:48 ` [lustre-devel] [PATCH 06/12] lustre: mdc: don't add to page cache upon failure James Simmons
@ 2018-11-26  2:48 ` James Simmons
  2018-11-26  2:48 ` [lustre-devel] [PATCH 08/12] lustre: mdc: propagate changelog errors to readers James Simmons
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: James Simmons @ 2018-11-26  2:48 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <paf@cray.com>

If any locks are removed from or added to the lru, the
check of "number unused vs number cancelled" may be wrong.
This is fine - do not return an error or print debug in
this case.

Signed-off-by: Patrick Farrell <paf@cray.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-7802
Reviewed-on: https://review.whamcloud.com/28560
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/ldlm/ldlm_resource.c | 16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c b/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
index 5028db7..11c0b88 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
@@ -193,19 +193,9 @@ static ssize_t lru_size_store(struct kobject *kobj, struct attribute *attr,
 		       "dropping all unused locks from namespace %s\n",
 		       ldlm_ns_name(ns));
 		if (ns_connect_lru_resize(ns)) {
-			int canceled, unused  = ns->ns_nr_unused;
-
-			/* Try to cancel all @ns_nr_unused locks. */
-			canceled = ldlm_cancel_lru(ns, unused, 0,
-						   LDLM_LRU_FLAG_PASSED |
-						   LDLM_LRU_FLAG_CLEANUP);
-			if (canceled < unused) {
-				CDEBUG(D_DLMTRACE,
-				       "not all requested locks are canceled, requested: %d, canceled: %d\n",
-				       unused,
-				       canceled);
-				return -EINVAL;
-			}
+			ldlm_cancel_lru(ns, ns->ns_nr_unused, 0,
+					LDLM_LRU_FLAG_PASSED |
+					LDLM_LRU_FLAG_CLEANUP);
 		} else {
 			tmp = ns->ns_max_unused;
 			ns->ns_max_unused = 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 08/12] lustre: mdc: propagate changelog errors to readers
  2018-11-26  2:48 [lustre-devel] [PATCH 00/12] lustre: new patches to address previous reviews James Simmons
                   ` (6 preceding siblings ...)
  2018-11-26  2:48 ` [lustre-devel] [PATCH 07/12] lustre: ldlm: No -EINVAL for canceled != unused James Simmons
@ 2018-11-26  2:48 ` James Simmons
  2018-11-27  3:13   ` NeilBrown
  2018-11-26  2:48 ` [lustre-devel] [PATCH 09/12] lustre: obdclass: obd_device improvement James Simmons
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 23+ messages in thread
From: James Simmons @ 2018-11-26  2:48 UTC (permalink / raw)
  To: lustre-devel

From: "John L. Hammond" <jhammond@whamcloud.com>

Store errors encountered by the changelog llog reader thread in the
crs_err field of struct changelog_reader_state so that they can be
propageted to changelog readers. In chlg_read() improve the error and
EOF reporting. Return -ERESTARTSYS when the blocked reader is
interrupted. Replace uses of wait_event_idle() with
ait_event_interruptible().

Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-10218
Reviewed-on: https://review.whamcloud.com/30040
Reviewed-by: Quentin Bouget <quentin.bouget@cea.fr>
Reviewed-by: Henri Doreau <henri.doreau@cea.fr>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/mdc/mdc_changelog.c | 45 ++++++++++++++---------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/drivers/staging/lustre/lustre/mdc/mdc_changelog.c b/drivers/staging/lustre/lustre/mdc/mdc_changelog.c
index becdee8..811a36a 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_changelog.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_changelog.c
@@ -71,7 +71,7 @@ struct chlg_reader_state {
 	/* Producer thread (if any) */
 	struct task_struct	*crs_prod_task;
 	/* An error occurred that prevents from reading further */
-	bool			 crs_err;
+	int			 crs_err;
 	/* EOF, no more records available */
 	bool			 crs_eof;
 	/* Desired start position */
@@ -148,9 +148,9 @@ static int chlg_read_cat_process_cb(const struct lu_env *env,
 	       PFID(&rec->cr.cr_tfid), PFID(&rec->cr.cr_pfid),
 	       rec->cr.cr_namelen, changelog_rec_name(&rec->cr));
 
-	wait_event_idle(crs->crs_waitq_prod,
-			(crs->crs_rec_count < CDEV_CHLG_MAX_PREFETCH ||
-			 kthread_should_stop()));
+	wait_event_interruptible(crs->crs_waitq_prod,
+				 crs->crs_rec_count < CDEV_CHLG_MAX_PREFETCH ||
+				 kthread_should_stop());
 
 	if (kthread_should_stop())
 		return LLOG_PROC_BREAK;
@@ -231,7 +231,7 @@ static int chlg_load(void *args)
 
 err_out:
 	if (rc < 0)
-		crs->crs_err = true;
+		crs->crs_err = rc;
 
 	wake_up_all(&crs->crs_waitq_cons);
 
@@ -241,7 +241,7 @@ static int chlg_load(void *args)
 	if (ctx)
 		llog_ctxt_put(ctx);
 
-	wait_event_idle(crs->crs_waitq_prod, kthread_should_stop());
+	wait_event_interruptible(crs->crs_waitq_prod, kthread_should_stop());
 
 	return rc;
 }
@@ -264,13 +264,20 @@ static ssize_t chlg_read(struct file *file, char __user *buff, size_t count,
 	struct chlg_reader_state *crs = file->private_data;
 	struct chlg_rec_entry *rec;
 	struct chlg_rec_entry *tmp;
-	ssize_t  written_total = 0;
+	ssize_t written_total = 0;
 	LIST_HEAD(consumed);
+	size_t rc;
+
+	if (file->f_flags & O_NONBLOCK && crs->crs_rec_count == 0) {
+		if (crs->crs_err < 0)
+			return crs->crs_err;
+		else if (crs->crs_eof)
+			return 0;
+		else
+			return -EAGAIN;
+	}
 
-	if (file->f_flags & O_NONBLOCK && crs->crs_rec_count == 0)
-		return -EAGAIN;
-
-	wait_event_idle(crs->crs_waitq_cons,
+	rc = wait_event_interruptible(crs->crs_waitq_cons,
 			crs->crs_rec_count > 0 || crs->crs_eof || crs->crs_err);
 
 	mutex_lock(&crs->crs_lock);
@@ -279,8 +286,7 @@ static ssize_t chlg_read(struct file *file, char __user *buff, size_t count,
 			break;
 
 		if (copy_to_user(buff, rec->enq_record, rec->enq_length)) {
-			if (written_total == 0)
-				written_total = -EFAULT;
+			rc = -EFAULT;
 			break;
 		}
 
@@ -294,15 +300,19 @@ static ssize_t chlg_read(struct file *file, char __user *buff, size_t count,
 	}
 	mutex_unlock(&crs->crs_lock);
 
-	if (written_total > 0)
+	if (written_total > 0) {
+		rc = written_total;
 		wake_up_all(&crs->crs_waitq_prod);
+	} else if (rc == 0) {
+		rc = crs->crs_err;
+	}
 
 	list_for_each_entry_safe(rec, tmp, &consumed, enq_linkage)
 		enq_record_delete(rec);
 
 	*ppos = crs->crs_start_offset;
 
-	return written_total;
+	return rc;
 }
 
 /**
@@ -509,15 +519,16 @@ static int chlg_release(struct inode *inode, struct file *file)
 	struct chlg_reader_state *crs = file->private_data;
 	struct chlg_rec_entry *rec;
 	struct chlg_rec_entry *tmp;
+	int rc = 0;
 
 	if (crs->crs_prod_task)
-		kthread_stop(crs->crs_prod_task);
+		rc = kthread_stop(crs->crs_prod_task);
 
 	list_for_each_entry_safe(rec, tmp, &crs->crs_rec_queue, enq_linkage)
 		enq_record_delete(rec);
 
 	kfree(crs);
-	return 0;
+	return rc;
 }
 
 /**
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 09/12] lustre: obdclass: obd_device improvement
  2018-11-26  2:48 [lustre-devel] [PATCH 00/12] lustre: new patches to address previous reviews James Simmons
                   ` (7 preceding siblings ...)
  2018-11-26  2:48 ` [lustre-devel] [PATCH 08/12] lustre: mdc: propagate changelog errors to readers James Simmons
@ 2018-11-26  2:48 ` James Simmons
  2018-11-27  4:01   ` NeilBrown
  2018-11-26  2:48 ` [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework James Simmons
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 23+ messages in thread
From: James Simmons @ 2018-11-26  2:48 UTC (permalink / raw)
  To: lustre-devel

From: Alexander Boyko <c17825@cray.com>

The patch removes self exports from obd's reference counting which
allows to avoid freeing of self exports by zombie thread.
A pair of functions class_register_device()/class_unregister_device()
is to make sure that an obd can not be referenced again once its
refcount reached 0.

Signed-off-by: Vladimir Saveliev Vladimir Saveliev <c17830@cray.com>
Signed-off-by: Alexey Lyashkov <c17817@cray.com>
Signed-off-by: Alexander Boyko <c17825@cray.com>
Signed-off-by: Yang Sheng <ys@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-4134
Seagate-bug-id: MRP-2139 MRP-3267
Reviewed-on: https://review.whamcloud.com/8045
Reviewed-on: https://review.whamcloud.com/29967
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/include/obd_class.h  |   9 +-
 drivers/staging/lustre/lustre/obdclass/genops.c    | 284 +++++++++++++++------
 .../staging/lustre/lustre/obdclass/obd_config.c    | 143 ++++-------
 drivers/staging/lustre/lustre/obdclass/obd_mount.c |   6 +-
 4 files changed, 261 insertions(+), 181 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/obd_class.h b/drivers/staging/lustre/lustre/include/obd_class.h
index 567189c..cc00915 100644
--- a/drivers/staging/lustre/lustre/include/obd_class.h
+++ b/drivers/staging/lustre/lustre/include/obd_class.h
@@ -65,8 +65,11 @@ int class_register_type(struct obd_ops *dt_ops, struct md_ops *md_ops,
 			const char *name, struct lu_device_type *ldt);
 int class_unregister_type(const char *name);
 
-struct obd_device *class_newdev(const char *type_name, const char *name);
-void class_release_dev(struct obd_device *obd);
+struct obd_device *class_newdev(const char *type_name, const char *name,
+				const char *uuid);
+int class_register_device(struct obd_device *obd);
+void class_unregister_device(struct obd_device *obd);
+void class_free_dev(struct obd_device *obd);
 
 int class_name2dev(const char *name);
 struct obd_device *class_name2obd(const char *name);
@@ -230,6 +233,8 @@ void __class_export_del_lock_ref(struct obd_export *exp,
 void class_export_put(struct obd_export *exp);
 struct obd_export *class_new_export(struct obd_device *obddev,
 				    struct obd_uuid *cluuid);
+struct obd_export *class_new_export_self(struct obd_device *obd,
+					 struct obd_uuid *uuid);
 void class_unlink_export(struct obd_export *exp);
 
 struct obd_import *class_import_get(struct obd_import *imp);
diff --git a/drivers/staging/lustre/lustre/obdclass/genops.c b/drivers/staging/lustre/lustre/obdclass/genops.c
index 59891a8..cdd44f7 100644
--- a/drivers/staging/lustre/lustre/obdclass/genops.c
+++ b/drivers/staging/lustre/lustre/obdclass/genops.c
@@ -38,6 +38,7 @@
 
 #define DEBUG_SUBSYSTEM S_CLASS
 #include <obd_class.h>
+#include <lustre_log.h>
 #include <lprocfs_status.h>
 #include <lustre_kernelcomm.h>
 
@@ -273,21 +274,20 @@ int class_unregister_type(const char *name)
 /**
  * Create a new obd device.
  *
- * Find an empty slot in ::obd_devs[], create a new obd device in it.
+ * Allocate the new obd_device and initialize it.
  *
  * \param[in] type_name obd device type string.
  * \param[in] name      obd device name.
+ * @uuid		obd device UUID.
  *
- * \retval NULL if create fails, otherwise return the obd device
- *	 pointer created.
+ * RETURN newdev	 pointer to created obd_device
+ * RETURN ERR_PTR(errno) on error
  */
-struct obd_device *class_newdev(const char *type_name, const char *name)
+struct obd_device *class_newdev(const char *type_name, const char *name,
+				const char *uuid)
 {
-	struct obd_device *result = NULL;
 	struct obd_device *newdev;
 	struct obd_type *type = NULL;
-	int i;
-	int new_obd_minor = 0;
 
 	if (strlen(name) >= MAX_OBD_NAME) {
 		CERROR("name/uuid must be < %u bytes long\n", MAX_OBD_NAME);
@@ -302,87 +302,167 @@ struct obd_device *class_newdev(const char *type_name, const char *name)
 
 	newdev = obd_device_alloc();
 	if (!newdev) {
-		result = ERR_PTR(-ENOMEM);
-		goto out_type;
+		class_put_type(type);
+		return ERR_PTR(-ENOMEM);
 	}
 
 	LASSERT(newdev->obd_magic == OBD_DEVICE_MAGIC);
+	strncpy(newdev->obd_name, name, sizeof(newdev->obd_name) - 1);
+	newdev->obd_type = type;
+	newdev->obd_minor = -1;
+
+	rwlock_init(&newdev->obd_pool_lock);
+	newdev->obd_pool_limit = 0;
+	newdev->obd_pool_slv = 0;
+
+	INIT_LIST_HEAD(&newdev->obd_exports);
+	INIT_LIST_HEAD(&newdev->obd_unlinked_exports);
+	INIT_LIST_HEAD(&newdev->obd_delayed_exports);
+	spin_lock_init(&newdev->obd_nid_lock);
+	spin_lock_init(&newdev->obd_dev_lock);
+	mutex_init(&newdev->obd_dev_mutex);
+	spin_lock_init(&newdev->obd_osfs_lock);
+	/* newdev->obd_osfs_age must be set to a value in the distant
+	 * past to guarantee a fresh statfs is fetched on mount.
+	 */
+	newdev->obd_osfs_age = get_jiffies_64() - 1000 * HZ;
 
-	write_lock(&obd_dev_lock);
-	for (i = 0; i < class_devno_max(); i++) {
-		struct obd_device *obd = class_num2obd(i);
+	/* XXX belongs in setup not attach  */
+	init_rwsem(&newdev->obd_observer_link_sem);
+	/* recovery data */
+	init_waitqueue_head(&newdev->obd_evict_inprogress_waitq);
 
-		if (obd && (strcmp(name, obd->obd_name) == 0)) {
-			CERROR("Device %s already exists at %d, won't add\n",
-			       name, i);
-			if (result) {
-				LASSERTF(result->obd_magic == OBD_DEVICE_MAGIC,
-					 "%p obd_magic %08x != %08x\n", result,
-					 result->obd_magic, OBD_DEVICE_MAGIC);
-				LASSERTF(result->obd_minor == new_obd_minor,
-					 "%p obd_minor %d != %d\n", result,
-					 result->obd_minor, new_obd_minor);
-
-				obd_devs[result->obd_minor] = NULL;
-				result->obd_name[0] = '\0';
-			 }
-			result = ERR_PTR(-EEXIST);
-			break;
-		}
-		if (!result && !obd) {
-			result = newdev;
-			result->obd_minor = i;
-			new_obd_minor = i;
-			result->obd_type = type;
-			strncpy(result->obd_name, name,
-				sizeof(result->obd_name) - 1);
-			obd_devs[i] = result;
-		}
-	}
-	write_unlock(&obd_dev_lock);
+	llog_group_init(&newdev->obd_olg);
+	/* Detach drops this */
+        atomic_set(&newdev->obd_refcount, 1);
+        lu_ref_init(&newdev->obd_reference);
+        lu_ref_add(&newdev->obd_reference, "newdev", newdev);
 
-	if (!result && i >= class_devno_max()) {
-		CERROR("all %u OBD devices used, increase MAX_OBD_DEVICES\n",
-		       class_devno_max());
-		result = ERR_PTR(-EOVERFLOW);
-		goto out;
-	}
+	newdev->obd_conn_inprogress = 0;
 
-	if (IS_ERR(result))
-		goto out;
+	strncpy(newdev->obd_uuid.uuid, uuid, strlen(uuid));
 
-	CDEBUG(D_IOCTL, "Adding new device %s (%p)\n",
-	       result->obd_name, result);
+	CDEBUG(D_IOCTL, "Allocate new device %s (%p)\n",
+	       newdev->obd_name, newdev);
 
-	return result;
-out:
-	obd_device_free(newdev);
-out_type:
-	class_put_type(type);
-	return result;
+	return newdev;
 }
 
-void class_release_dev(struct obd_device *obd)
+/**
+ * Free obd device.
+ *
+ * @obd obd_device to be freed
+ *
+ * RETURN none
+ */
+void class_free_dev(struct obd_device *obd)
 {
 	struct obd_type *obd_type = obd->obd_type;
 
 	LASSERTF(obd->obd_magic == OBD_DEVICE_MAGIC, "%p obd_magic %08x != %08x\n",
 		 obd, obd->obd_magic, OBD_DEVICE_MAGIC);
-	LASSERTF(obd == obd_devs[obd->obd_minor], "obd %p != obd_devs[%d] %p\n",
+	LASSERTF(obd->obd_minor == -1 || obd == obd_devs[obd->obd_minor],
+		 "obd %p != obd_devs[%d] %p\n",
 		 obd, obd->obd_minor, obd_devs[obd->obd_minor]);
+	LASSERTF(atomic_read(&obd->obd_refcount) == 0,
+		 "obd_refcount should be 0, not %d\n",
+		 atomic_read(&obd->obd_refcount));
 	LASSERT(obd_type);
 
-	CDEBUG(D_INFO, "Release obd device %s at %d obd_type name =%s\n",
-	       obd->obd_name, obd->obd_minor, obd->obd_type->typ_name);
+	CDEBUG(D_INFO, "Release obd device %s obd_type name =%s\n",
+	       obd->obd_name, obd->obd_type->typ_name);
+
+	CDEBUG(D_CONFIG, "finishing cleanup of obd %s (%s)\n",
+	       obd->obd_name, obd->obd_uuid.uuid);
+	if (obd->obd_stopping) {
+		int err;
+
+		/* If we're not stopping, we were never set up */
+		err = obd_cleanup(obd);
+		if (err)
+			CERROR("Cleanup %s returned %d\n",
+			       obd->obd_name, err);
+	}
 
-	write_lock(&obd_dev_lock);
-	obd_devs[obd->obd_minor] = NULL;
-	write_unlock(&obd_dev_lock);
 	obd_device_free(obd);
 
 	class_put_type(obd_type);
 }
 
+/**
+ * Unregister obd device.
+ *
+ * Free slot in obd_dev[] used by \a obd.
+ *
+ * @new_obd	obd_device to be unregistered
+ *
+ * RETURN	none
+ */
+void class_unregister_device(struct obd_device *obd)
+{
+	write_lock(&obd_dev_lock);
+	if (obd->obd_minor >= 0) {
+		LASSERT(obd_devs[obd->obd_minor] == obd);
+		obd_devs[obd->obd_minor] = NULL;
+		obd->obd_minor = -1;
+	}
+	write_unlock(&obd_dev_lock);
+}
+
+/**
+ * Register obd device.
+ *
+ * Find free slot in obd_devs[], fills it with \a new_obd.
+ *
+ * @new_obd	obd_device to be registered
+ *
+ * RETURN	0		success
+ *		-EEXIST		device with this name is registered
+ *		-EOVERFLOW	obd_devs[] is full
+ */
+int class_register_device(struct obd_device *new_obd)
+{
+	int new_obd_minor = 0;
+	bool minor_assign = false;
+	int ret = 0;
+	int i;
+
+	write_lock(&obd_dev_lock);
+	for (i = 0; i < class_devno_max(); i++) {
+		struct obd_device *obd = class_num2obd(i);
+
+		if (obd && (strcmp(new_obd->obd_name, obd->obd_name) == 0)) {
+			CERROR("%s: already exists, won't add\n",
+			       obd->obd_name);
+			/* in case we found a free slot before duplicate */
+			minor_assign = false;
+			ret = -EEXIST;
+			break;
+		}
+		if (!minor_assign && !obd) {
+			new_obd_minor = i;
+			minor_assign = true;
+		}
+	}
+
+	if (minor_assign) {
+		new_obd->obd_minor = new_obd_minor;
+		LASSERTF(!obd_devs[new_obd_minor], "obd_devs[%d] %p\n",
+			 new_obd_minor, obd_devs[new_obd_minor]);
+		obd_devs[new_obd_minor] = new_obd;
+	} else {
+		if (ret == 0) {
+			ret = -EOVERFLOW;
+			CERROR("%s: all %u/%u devices used, increase MAX_OBD_DEVICES: rc = %d\n",
+			       new_obd->obd_name, i, class_devno_max(), ret);
+		}
+	}
+	write_unlock(&obd_dev_lock);
+
+	return ret;
+}
+
+
 int class_name2dev(const char *name)
 {
 	int i;
@@ -677,7 +757,11 @@ static void class_export_destroy(struct obd_export *exp)
 	LASSERT(list_empty(&exp->exp_req_replay_queue));
 	LASSERT(list_empty(&exp->exp_hp_rpcs));
 	obd_destroy_export(exp);
-	class_decref(obd, "export", exp);
+	/* self export doesn't hold a reference to an obd, although it
+	 * exists until freeing of the obd
+	 */
+	if (exp != obd->obd_self_export)
+		class_decref(obd, "export", exp);
 
 	OBD_FREE_RCU(exp, sizeof(*exp), &exp->exp_handle);
 }
@@ -708,11 +792,27 @@ void class_export_put(struct obd_export *exp)
 	       atomic_read(&exp->exp_refcount) - 1);
 
 	if (atomic_dec_and_test(&exp->exp_refcount)) {
-		LASSERT(!list_empty(&exp->exp_obd_chain));
+		struct obd_device *obd = exp->exp_obd;
+
 		CDEBUG(D_IOCTL, "final put %p/%s\n",
 		       exp, exp->exp_client_uuid.uuid);
 
-		obd_zombie_export_add(exp);
+		if (exp == obd->obd_self_export) {
+			/* self export should be destroyed without
+			 * zombie thread as it doesn't hold a
+			 * reference to obd and doesn't hold any
+			 * resources
+			 */
+			class_export_destroy(exp);
+			/* self export is destroyed, no class
+			 * references exist and it is safe to free
+			 * obd
+			 */
+			class_free_dev(obd);
+		} else {
+			LASSERT(!list_empty(&exp->exp_obd_chain));
+			obd_zombie_export_add(exp);
+		}
 	}
 }
 EXPORT_SYMBOL(class_export_put);
@@ -728,8 +828,9 @@ static void obd_zombie_exp_cull(struct work_struct *ws)
  * pointer to it. The refcount is 2: one for the hash reference, and
  * one for the pointer returned by this function.
  */
-struct obd_export *class_new_export(struct obd_device *obd,
-				    struct obd_uuid *cluuid)
+static struct obd_export *__class_new_export(struct obd_device *obd,
+					     struct obd_uuid *cluuid,
+					     bool is_self)
 {
 	struct obd_export *export;
 	int rc = 0;
@@ -739,6 +840,7 @@ struct obd_export *class_new_export(struct obd_device *obd,
 		return ERR_PTR(-ENOMEM);
 
 	export->exp_conn_cnt = 0;
+	/* 2 = class_handle_hash + last */
 	atomic_set(&export->exp_refcount, 2);
 	atomic_set(&export->exp_rpc_count, 0);
 	atomic_set(&export->exp_cb_count, 0);
@@ -767,41 +869,65 @@ struct obd_export *class_new_export(struct obd_device *obd,
 	export->exp_client_uuid = *cluuid;
 	obd_init_export(export);
 
-	spin_lock(&obd->obd_dev_lock);
-	/* shouldn't happen, but might race */
-	if (obd->obd_stopping) {
-		rc = -ENODEV;
-		goto exit_unlock;
-	}
-
 	if (!obd_uuid_equals(cluuid, &obd->obd_uuid)) {
+		spin_lock(&obd->obd_dev_lock);
+		/* shouldn't happen, but might race */
+		if (obd->obd_stopping) {
+			rc = -ENODEV;
+			goto exit_unlock;
+		}
+		spin_unlock(&obd->obd_dev_lock);
+
 		rc = obd_uuid_add(obd, export);
 		if (rc) {
 			LCONSOLE_WARN("%s: denying duplicate export for %s, %d\n",
 				      obd->obd_name, cluuid->uuid, rc);
-			goto exit_unlock;
+			goto exit_err;
 		}
 	}
 
-	class_incref(obd, "export", export);
-	list_add(&export->exp_obd_chain, &export->exp_obd->obd_exports);
-	export->exp_obd->obd_num_exports++;
+	spin_lock(&obd->obd_dev_lock);
+	if (!is_self) {
+		class_incref(obd, "export", export);
+		list_add(&export->exp_obd_chain, &export->exp_obd->obd_exports);
+		obd->obd_num_exports++;
+	} else {
+		INIT_LIST_HEAD(&export->exp_obd_chain);
+	}
 	spin_unlock(&obd->obd_dev_lock);
 	return export;
 
 exit_unlock:
 	spin_unlock(&obd->obd_dev_lock);
+exit_err:
 	class_handle_unhash(&export->exp_handle);
 	obd_destroy_export(export);
 	kfree(export);
 	return ERR_PTR(rc);
 }
+
+struct obd_export *class_new_export(struct obd_device *obd,
+				    struct obd_uuid *uuid)
+{
+	return __class_new_export(obd, uuid, false);
+}
 EXPORT_SYMBOL(class_new_export);
 
+struct obd_export *class_new_export_self(struct obd_device *obd,
+					 struct obd_uuid *uuid)
+{
+	return __class_new_export(obd, uuid, true);
+}
+
 void class_unlink_export(struct obd_export *exp)
 {
 	class_handle_unhash(&exp->exp_handle);
 
+	if (exp->exp_obd->obd_self_export == exp) {
+		class_export_put(exp);
+		return;
+	}
+
 	spin_lock(&exp->exp_obd->obd_dev_lock);
 	/* delete an uuid-export hashitem from hashtables */
 	if (exp != exp->exp_obd->obd_self_export)
diff --git a/drivers/staging/lustre/lustre/obdclass/obd_config.c b/drivers/staging/lustre/lustre/obdclass/obd_config.c
index 9e46eb2..8be8751 100644
--- a/drivers/staging/lustre/lustre/obdclass/obd_config.c
+++ b/drivers/staging/lustre/lustre/obdclass/obd_config.c
@@ -280,6 +280,7 @@ static int class_attach(struct lustre_cfg *lcfg)
 {
 	struct obd_device *obd = NULL;
 	char *typename, *name, *uuid;
+	struct obd_export *exp;
 	int rc, len;
 
 	if (!LUSTRE_CFG_BUFLEN(lcfg, 1)) {
@@ -298,78 +299,50 @@ static int class_attach(struct lustre_cfg *lcfg)
 		CERROR("No UUID passed!\n");
 		return -EINVAL;
 	}
-	uuid = lustre_cfg_string(lcfg, 2);
 
-	CDEBUG(D_IOCTL, "attach type %s name: %s uuid: %s\n",
-	       typename, name, uuid);
+	uuid = lustre_cfg_string(lcfg, 2);
+	len = strlen(uuid);
+	if (len >= sizeof(obd->obd_uuid)) {
+		CERROR("uuid must be < %d bytes long\n",
+		       (int)sizeof(obd->obd_uuid));
+		return -EINVAL;
+	}
 
-	obd = class_newdev(typename, name);
+	obd = class_newdev(typename, name, uuid);
 	if (IS_ERR(obd)) {
 		/* Already exists or out of obds */
 		rc = PTR_ERR(obd);
-		obd = NULL;
 		CERROR("Cannot create device %s of type %s : %d\n",
 		       name, typename, rc);
-		goto out;
+		return rc;
 	}
-	LASSERTF(obd, "Cannot get obd device %s of type %s\n",
-		 name, typename);
 	LASSERTF(obd->obd_magic == OBD_DEVICE_MAGIC,
 		 "obd %p obd_magic %08X != %08X\n",
 		 obd, obd->obd_magic, OBD_DEVICE_MAGIC);
 	LASSERTF(strncmp(obd->obd_name, name, strlen(name)) == 0,
 		 "%p obd_name %s != %s\n", obd, obd->obd_name, name);
 
-	rwlock_init(&obd->obd_pool_lock);
-	obd->obd_pool_limit = 0;
-	obd->obd_pool_slv = 0;
-
-	INIT_LIST_HEAD(&obd->obd_exports);
-	INIT_LIST_HEAD(&obd->obd_unlinked_exports);
-	INIT_LIST_HEAD(&obd->obd_delayed_exports);
-	spin_lock_init(&obd->obd_nid_lock);
-	spin_lock_init(&obd->obd_dev_lock);
-	mutex_init(&obd->obd_dev_mutex);
-	spin_lock_init(&obd->obd_osfs_lock);
-	/* obd->obd_osfs_age must be set to a value in the distant
-	 * past to guarantee a fresh statfs is fetched on mount.
-	 */
-	obd->obd_osfs_age = get_jiffies_64() - 1000 * HZ;
-
-	/* XXX belongs in setup not attach  */
-	init_rwsem(&obd->obd_observer_link_sem);
-	/* recovery data */
-	init_waitqueue_head(&obd->obd_evict_inprogress_waitq);
-
-	llog_group_init(&obd->obd_olg);
+	exp = class_new_export_self(obd, &obd->obd_uuid);
+	if (IS_ERR(exp)) {
+		rc = PTR_ERR(exp);
+		class_free_dev(obd);
+		return rc;
+	}
 
-	obd->obd_conn_inprogress = 0;
+	obd->obd_self_export = exp;
+	class_export_put(exp);
 
-	len = strlen(uuid);
-	if (len >= sizeof(obd->obd_uuid)) {
-		CERROR("uuid must be < %d bytes long\n",
-		       (int)sizeof(obd->obd_uuid));
-		rc = -EINVAL;
-		goto out;
+	rc = class_register_device(obd);
+	if (rc) {
+		class_decref(obd, "newdev", obd);
+		return rc;
 	}
-	memcpy(obd->obd_uuid.uuid, uuid, len);
-
-	/* Detach drops this */
-	spin_lock(&obd->obd_dev_lock);
-	atomic_set(&obd->obd_refcount, 1);
-	spin_unlock(&obd->obd_dev_lock);
-	lu_ref_init(&obd->obd_reference);
-	lu_ref_add(&obd->obd_reference, "attach", obd);
 
 	obd->obd_attached = 1;
 	CDEBUG(D_IOCTL, "OBD: dev %d attached type %s with refcount %d\n",
 	       obd->obd_minor, typename, atomic_read(&obd->obd_refcount));
-	return 0;
- out:
-	if (obd)
-		class_release_dev(obd);
 
-	return rc;
+	return 0;
 }
 
 /** Create hashes, self-export, and call type-specific setup.
@@ -378,7 +351,6 @@ static int class_attach(struct lustre_cfg *lcfg)
 static int class_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 {
 	int err = 0;
-	struct obd_export *exp;
 
 	LASSERT(obd);
 	LASSERTF(obd == class_num2obd(obd->obd_minor),
@@ -420,18 +392,9 @@ static int class_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 	if (err)
 		goto err_hash;
 
-	exp = class_new_export(obd, &obd->obd_uuid);
-	if (IS_ERR(exp)) {
-		err = PTR_ERR(exp);
-		goto err_new;
-	}
-
-	obd->obd_self_export = exp;
-	class_export_put(exp);
-
 	err = obd_setup(obd, lcfg);
 	if (err)
-		goto err_exp;
+		goto err_setup;
 
 	obd->obd_set_up = 1;
 
@@ -444,12 +407,7 @@ static int class_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 	       obd->obd_name, obd->obd_uuid.uuid);
 
 	return 0;
-err_exp:
-	if (obd->obd_self_export) {
-		class_unlink_export(obd->obd_self_export);
-		obd->obd_self_export = NULL;
-	}
-err_new:
+err_setup:
 	rhashtable_destroy(&obd->obd_uuid_hash);
 err_hash:
 	obd->obd_starting = 0;
@@ -476,10 +434,13 @@ static int class_detach(struct obd_device *obd, struct lustre_cfg *lcfg)
 	obd->obd_attached = 0;
 	spin_unlock(&obd->obd_dev_lock);
 
+	/* cleanup in progress. we don't like to find this device after now */
+	class_unregister_device(obd);
+
 	CDEBUG(D_IOCTL, "detach on obd %s (uuid %s)\n",
 	       obd->obd_name, obd->obd_uuid.uuid);
 
-	class_decref(obd, "attach", obd);
+	class_decref(obd, "newdev", obd);
 	return 0;
 }
 
@@ -507,6 +468,10 @@ static int class_cleanup(struct obd_device *obd, struct lustre_cfg *lcfg)
 	}
 	/* Leave this on forever */
 	obd->obd_stopping = 1;
+	/* function can't return error after that point, so clear setup flag
+	 * as early as possible to avoid finding via obd_devs / hash
+	 */
+	obd->obd_set_up = 0;
 	spin_unlock(&obd->obd_dev_lock);
 
 	while (obd->obd_conn_inprogress > 0)
@@ -567,43 +532,27 @@ struct obd_device *class_incref(struct obd_device *obd,
 
 void class_decref(struct obd_device *obd, const char *scope, const void *source)
 {
-	int err;
-	int refs;
+	int last;
 
-	spin_lock(&obd->obd_dev_lock);
-	atomic_dec(&obd->obd_refcount);
-	refs = atomic_read(&obd->obd_refcount);
-	spin_unlock(&obd->obd_dev_lock);
+	CDEBUG(D_INFO, "Decref %s (%p) now %d - %s\n", obd->obd_name, obd,
+	       atomic_read(&obd->obd_refcount), scope);
+
+	LASSERT(obd->obd_num_exports >= 0);
+	last = atomic_dec_and_test(&obd->obd_refcount);
 	lu_ref_del(&obd->obd_reference, scope, source);
 
-	CDEBUG(D_INFO, "Decref %s (%p) now %d\n", obd->obd_name, obd, refs);
+	if (last) {
+		struct obd_export *exp;
 
-	if ((refs == 1) && obd->obd_stopping) {
+		LASSERT(!obd->obd_attached);
 		/* All exports have been destroyed; there should
 		 * be no more in-progress ops by this point.
 		 */
-
-		spin_lock(&obd->obd_self_export->exp_lock);
-		obd->obd_self_export->exp_flags |= exp_flags_from_obd(obd);
-		spin_unlock(&obd->obd_self_export->exp_lock);
-
-		/* note that we'll recurse into class_decref again */
-		class_unlink_export(obd->obd_self_export);
-		return;
-	}
-
-	if (refs == 0) {
-		CDEBUG(D_CONFIG, "finishing cleanup of obd %s (%s)\n",
-		       obd->obd_name, obd->obd_uuid.uuid);
-		LASSERT(!obd->obd_attached);
-		if (obd->obd_stopping) {
-			/* If we're not stopping, we were never set up */
-			err = obd_cleanup(obd);
-			if (err)
-				CERROR("Cleanup %s returned %d\n",
-				       obd->obd_name, err);
+		exp = obd->obd_self_export;
+		if (exp) {
+			exp->exp_flags |= exp_flags_from_obd(obd);
+			class_unlink_export(exp);
 		}
-		class_release_dev(obd);
 	}
 }
 EXPORT_SYMBOL(class_decref);
diff --git a/drivers/staging/lustre/lustre/obdclass/obd_mount.c b/drivers/staging/lustre/lustre/obdclass/obd_mount.c
index 5ed1758..db5e1b5 100644
--- a/drivers/staging/lustre/lustre/obdclass/obd_mount.c
+++ b/drivers/staging/lustre/lustre/obdclass/obd_mount.c
@@ -214,7 +214,7 @@ int lustre_start_mgc(struct super_block *sb)
 	struct lustre_sb_info *lsi = s2lsi(sb);
 	struct obd_device *obd;
 	struct obd_export *exp;
-	struct obd_uuid *uuid;
+	struct obd_uuid *uuid = NULL;
 	class_uuid_t uuidc;
 	lnet_nid_t nid;
 	char nidstr[LNET_NIDSTR_SIZE];
@@ -342,7 +342,6 @@ int lustre_start_mgc(struct super_block *sb)
 	rc = lustre_start_simple(mgcname, LUSTRE_MGC_NAME,
 				 (char *)uuid->uuid, LUSTRE_MGS_OBDNAME,
 				 niduuid, NULL, NULL);
-	kfree(uuid);
 	if (rc)
 		goto out_free;
 
@@ -404,7 +403,7 @@ int lustre_start_mgc(struct super_block *sb)
 	    lsi->lsi_lmd->lmd_flags & LMD_FLG_NOIR)
 		data->ocd_connect_flags &= ~OBD_CONNECT_IMP_RECOV;
 	data->ocd_version = LUSTRE_VERSION_CODE;
-	rc = obd_connect(NULL, &exp, obd, &obd->obd_uuid, data, NULL);
+	rc = obd_connect(NULL, &exp, obd, uuid, data, NULL);
 	if (rc) {
 		CERROR("connect failed %d\n", rc);
 		goto out;
@@ -420,6 +419,7 @@ int lustre_start_mgc(struct super_block *sb)
 out_free:
 	mutex_unlock(&mgc_start_lock);
 
+	kfree(uuid);
 	kfree(data);
 	kfree(mgcname);
 	kfree(niduuid);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework
  2018-11-26  2:48 [lustre-devel] [PATCH 00/12] lustre: new patches to address previous reviews James Simmons
                   ` (8 preceding siblings ...)
  2018-11-26  2:48 ` [lustre-devel] [PATCH 09/12] lustre: obdclass: obd_device improvement James Simmons
@ 2018-11-26  2:48 ` James Simmons
  2018-11-27  4:20   ` NeilBrown
  2018-11-26  2:48 ` [lustre-devel] [PATCH 11/12] lustre: mdc: use large xattr buffers for old servers James Simmons
  2018-11-26  2:48 ` [lustre-devel] [PATCH 12/12] lustre: update TODO lustre list James Simmons
  11 siblings, 1 reply; 23+ messages in thread
From: James Simmons @ 2018-11-26  2:48 UTC (permalink / raw)
  To: lustre-devel

From: Dmitry Eremin <dmitry.eremin@intel.com>

In this patch new API for parallel tasks execution is introduced.
This API based on Linux kernel padata API which is used to perform
encryption and decryption on large numbers of packets without
reordering those packets.

It was adopted for general use in Lustre for parallelization of
various functionality. The first place of its usage is parallel I/O
implementation.

The first step in using it is to set up a cl_ptask structure to
control of how this task are to be run:

    #include <cl_ptask.h>

    int cl_ptask_init(struct cl_ptask *ptask, cl_ptask_cb_t cbfunc,
                      void *cbdata, unsigned int flags, int cpu);

The cbfunc function with cbdata argument will be called in the process
of getting the task done. The cpu specifies which CPU will be used for
the final callback when the task is done.

The submission of task is done with:

    int cl_ptask_submit(struct cl_ptask *ptask,
                        struct cl_ptask_engine *engine);

The task is submitted to the engine for execution.

In order to wait for result of task execution you should call:

   int cl_ptask_wait_for(struct cl_ptask *ptask);

The tasks with flag PTF_ORDERED are executed in parallel but complete
into submission order. So, waiting for last ordered task you can be sure
that all previous tasks were done before this task complete.

This patch differs from the OpenSFS tree by adding this functional
to the clio layer instead of libcfs.

Signed-off-by: Dmitry Eremin <dmitry.eremin@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8964
Reviewed-on: https://review.whamcloud.com/24474
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/include/cl_ptask.h  | 145 +++++++
 drivers/staging/lustre/lustre/obdclass/Makefile   |   3 +-
 drivers/staging/lustre/lustre/obdclass/cl_ptask.c | 501 ++++++++++++++++++++++
 3 files changed, 648 insertions(+), 1 deletion(-)
 create mode 100644 drivers/staging/lustre/lustre/include/cl_ptask.h
 create mode 100644 drivers/staging/lustre/lustre/obdclass/cl_ptask.c

diff --git a/drivers/staging/lustre/lustre/include/cl_ptask.h b/drivers/staging/lustre/lustre/include/cl_ptask.h
new file mode 100644
index 0000000..02abd69
--- /dev/null
+++ b/drivers/staging/lustre/lustre/include/cl_ptask.h
@@ -0,0 +1,145 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * GPL HEADER START
+ *
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 only,
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License version 2 for more details (a copy is included
+ * in the LICENSE file that accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License
+ * version 2 along with this program; If not, see
+ * http://www.gnu.org/licenses/gpl-2.0.html
+ *
+ * GPL HEADER END
+ */
+/*
+ * Copyright (c) 2017, Intel Corporation.
+ * Use is subject to license terms.
+ */
+/*
+ * This file is part of Lustre, http://www.lustre.org/
+ *
+ * parallel task interface
+ */
+#ifndef __CL_LUSTRE_PTASK_H__
+#define __CL_LUSTRE_PTASK_H__
+
+#include <linux/types.h>
+#include <linux/bitops.h>
+#include <linux/kernel.h>
+#include <linux/cpumask.h>
+#include <linux/uaccess.h>
+#include <linux/notifier.h>
+#include <linux/workqueue.h>
+#include <linux/completion.h>
+#ifdef CONFIG_PADATA
+#include <linux/padata.h>
+#else
+struct padata_priv {};
+struct padata_instance {};
+#endif
+
+#define PTF_COMPLETE	BIT(0)
+#define PTF_AUTOFREE	BIT(1)
+#define PTF_ORDERED	BIT(2)
+#define PTF_USER_MM	BIT(3)
+#define PTF_ATOMIC	BIT(4)
+#define PTF_RETRY	BIT(5)
+
+struct cl_ptask_engine {
+	struct padata_instance	*pte_pinst;
+	struct workqueue_struct	*pte_wq;
+	struct notifier_block	 pte_notifier;
+	int			 pte_weight;
+};
+
+struct cl_ptask;
+typedef int (*cl_ptask_cb_t)(struct cl_ptask *);
+
+struct cl_ptask {
+	struct padata_priv	 pt_padata;
+	struct completion	 pt_completion;
+	mm_segment_t		 pt_fs;
+	struct mm_struct	*pt_mm;
+	unsigned int		 pt_flags;
+	int			 pt_cbcpu;
+	cl_ptask_cb_t		 pt_cbfunc;
+	void			*pt_cbdata;
+	int			 pt_result;
+};
+
+static inline
+struct padata_priv *cl_ptask2padata(struct cl_ptask *ptask)
+{
+	return &ptask->pt_padata;
+}
+
+static inline
+struct cl_ptask *cl_padata2ptask(struct padata_priv *padata)
+{
+	return container_of(padata, struct cl_ptask, pt_padata);
+}
+
+static inline
+bool cl_ptask_need_complete(struct cl_ptask *ptask)
+{
+	return ptask->pt_flags & PTF_COMPLETE;
+}
+
+static inline
+bool cl_ptask_is_autofree(struct cl_ptask *ptask)
+{
+	return ptask->pt_flags & PTF_AUTOFREE;
+}
+
+static inline
+bool cl_ptask_is_ordered(struct cl_ptask *ptask)
+{
+	return ptask->pt_flags & PTF_ORDERED;
+}
+
+static inline
+bool cl_ptask_use_user_mm(struct cl_ptask *ptask)
+{
+	return ptask->pt_flags & PTF_USER_MM;
+}
+
+static inline
+bool cl_ptask_is_atomic(struct cl_ptask *ptask)
+{
+	return ptask->pt_flags & PTF_ATOMIC;
+}
+
+static inline
+bool cl_ptask_is_retry(struct cl_ptask *ptask)
+{
+	return ptask->pt_flags & PTF_RETRY;
+}
+
+static inline
+int cl_ptask_result(struct cl_ptask *ptask)
+{
+	return ptask->pt_result;
+}
+
+struct cl_ptask_engine *cl_ptengine_init(const char *name,
+					 const struct cpumask *cpumask);
+void cl_ptengine_fini(struct cl_ptask_engine *engine);
+int cl_ptengine_set_cpumask(struct cl_ptask_engine *engine,
+			    const struct cpumask *cpumask);
+int cl_ptengine_weight(struct cl_ptask_engine *engine);
+
+int cl_ptask_submit(struct cl_ptask *ptask,  struct cl_ptask_engine *engine);
+int cl_ptask_wait_for(struct cl_ptask *ptask);
+int cl_ptask_init(struct cl_ptask *ptask, cl_ptask_cb_t cbfunc, void *cbdata,
+		  unsigned int flags, int cpu);
+
+#endif /* __CL_LUSTRE_PTASK_H__ */
diff --git a/drivers/staging/lustre/lustre/obdclass/Makefile b/drivers/staging/lustre/lustre/obdclass/Makefile
index b1fac48..a705aa0 100644
--- a/drivers/staging/lustre/lustre/obdclass/Makefile
+++ b/drivers/staging/lustre/lustre/obdclass/Makefile
@@ -8,4 +8,5 @@ obdclass-y := llog.o llog_cat.o llog_obd.o llog_swab.o class_obd.o debug.o \
 	      genops.o obd_sysfs.o lprocfs_status.o lprocfs_counters.o \
 	      lustre_handles.o lustre_peer.o statfs_pack.o linkea.o \
 	      obdo.o obd_config.o obd_mount.o lu_object.o lu_ref.o \
-	      cl_object.o cl_page.o cl_lock.o cl_io.o kernelcomm.o
+	      cl_object.o cl_page.o cl_lock.o cl_io.o cl_ptask.o \
+	      kernelcomm.o
diff --git a/drivers/staging/lustre/lustre/obdclass/cl_ptask.c b/drivers/staging/lustre/lustre/obdclass/cl_ptask.c
new file mode 100644
index 0000000..b0df3c4
--- /dev/null
+++ b/drivers/staging/lustre/lustre/obdclass/cl_ptask.c
@@ -0,0 +1,501 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * GPL HEADER START
+ *
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 only,
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License version 2 for more details (a copy is included
+ * in the LICENSE file that accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License
+ * version 2 along with this program; If not, see
+ * http://www.gnu.org/licenses/gpl-2.0.html
+ *
+ * GPL HEADER END
+ */
+/*
+ * Copyright (c) 2017, Intel Corporation.
+ * Use is subject to license terms.
+ */
+/*
+ * This file is part of Lustre, http://www.lustre.org/
+ *
+ * parallel task interface
+ */
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/cpumask.h>
+#include <linux/cpu.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/mm.h>
+#include <linux/moduleparam.h>
+#include <linux/mmu_context.h>
+
+#define DEBUG_SUBSYSTEM S_UNDEFINED
+
+#include <linux/libcfs/libcfs.h>
+#include <cl_ptask.h>
+
+/**
+ * This API based on Linux kernel padada API which is used to perform
+ * encryption and decryption on large numbers of packets without
+ * reordering those packets.
+ *
+ * It was adopted for general use in Lustre for parallelization of
+ * various functionality.
+ *
+ * The first step in using it is to set up a cl_ptask structure to
+ * control of how this task are to be run:
+ *
+ * #include <cl_ptask.h>
+ *
+ * int cl_ptask_init(struct cl_ptask *ptask, cl_ptask_cb_t cbfunc,
+ *                    void *cbdata, unsigned int flags, int cpu);
+ *
+ * The cbfunc function with cbdata argument will be called in the process
+ * of getting the task done. The cpu specifies which CPU will be used for
+ * the final callback when the task is done.
+ *
+ * The submission of task is done with:
+ *
+ * int cl_ptask_submit(struct cl_ptask *ptask, struct cl_ptask_engine *engine);
+ *
+ * The task is submitted to the engine for execution.
+ *
+ * In order to wait for result of task execution you should call:
+ *
+ * int cl_ptask_wait_for(struct cl_ptask *ptask);
+ *
+ * The tasks with flag PTF_ORDERED are executed in parallel but complete
+ * into submission order. So, waiting for last ordered task you can be sure
+ * that all previous tasks were done before this task complete.
+ */
+#ifdef CONFIG_PADATA
+static void cl_ptask_complete(struct padata_priv *padata)
+{
+	struct cl_ptask *ptask = cl_padata2ptask(padata);
+
+	if (cl_ptask_need_complete(ptask)) {
+		if (cl_ptask_is_ordered(ptask))
+			complete(&ptask->pt_completion);
+	} else if (cl_ptask_is_autofree(ptask)) {
+		kfree(ptask);
+	}
+}
+
+static void cl_ptask_execute(struct padata_priv *padata)
+{
+	struct cl_ptask *ptask = cl_padata2ptask(padata);
+	mm_segment_t old_fs = get_fs();
+	bool bh_enabled = false;
+
+	if (!cl_ptask_is_atomic(ptask)) {
+		local_bh_enable();
+		bh_enabled = true;
+	}
+
+	if (cl_ptask_use_user_mm(ptask) && ptask->pt_mm) {
+		use_mm(ptask->pt_mm);
+		set_fs(ptask->pt_fs);
+	}
+
+	if (ptask->pt_cbfunc)
+		ptask->pt_result = ptask->pt_cbfunc(ptask);
+	else
+		ptask->pt_result = -ENXIO;
+
+	if (cl_ptask_use_user_mm(ptask) && ptask->pt_mm) {
+		set_fs(old_fs);
+		unuse_mm(ptask->pt_mm);
+		mmput(ptask->pt_mm);
+		ptask->pt_mm = NULL;
+	}
+
+	if (cl_ptask_need_complete(ptask) && !cl_ptask_is_ordered(ptask))
+		complete(&ptask->pt_completion);
+
+	if (bh_enabled)
+		local_bh_disable();
+
+	padata_do_serial(padata);
+}
+
+static int cl_do_parallel(struct cl_ptask_engine *engine,
+			  struct padata_priv *padata)
+{
+	struct cl_ptask *ptask = cl_padata2ptask(padata);
+	int rc;
+
+	if (cl_ptask_need_complete(ptask))
+		reinit_completion(&ptask->pt_completion);
+
+	if (cl_ptask_use_user_mm(ptask)) {
+		ptask->pt_mm = get_task_mm(current);
+		ptask->pt_fs = get_fs();
+	}
+	ptask->pt_result = -EINPROGRESS;
+
+retry:
+	rc = padata_do_parallel(engine->pte_pinst, padata, ptask->pt_cbcpu);
+	if (rc == -EBUSY && cl_ptask_is_retry(ptask)) {
+		/* too many tasks already in queue */
+		schedule_timeout_uninterruptible(1);
+		goto retry;
+	}
+
+	if (rc) {
+		if (cl_ptask_use_user_mm(ptask) && ptask->pt_mm) {
+			mmput(ptask->pt_mm);
+			ptask->pt_mm = NULL;
+		}
+		ptask->pt_result = rc;
+	}
+
+	return rc;
+}
+
+/**
+ * This function submit initialized task for async execution
+ * in engine with specified id.
+ */
+int cl_ptask_submit(struct cl_ptask *ptask, struct cl_ptask_engine *engine)
+{
+	struct padata_priv *padata = cl_ptask2padata(ptask);
+
+	if (IS_ERR_OR_NULL(engine))
+		return -EINVAL;
+
+	memset(padata, 0, sizeof(*padata));
+
+	padata->parallel = cl_ptask_execute;
+	padata->serial   = cl_ptask_complete;
+
+	return cl_do_parallel(engine, padata);
+}
+
+#else  /* !CONFIG_PADATA */
+
+/**
+ * If CONFIG_PADATA is not defined this function just execute
+ * the initialized task in current thread. (emulate async execution)
+ */
+int cl_ptask_submit(struct cl_ptask *ptask, struct cl_ptask_engine *engine)
+{
+	if (IS_ERR_OR_NULL(engine))
+		return -EINVAL;
+
+	if (ptask->pt_cbfunc)
+		ptask->pt_result = ptask->pt_cbfunc(ptask);
+	else
+		ptask->pt_result = -ENXIO;
+
+	if (cl_ptask_need_complete(ptask))
+		complete(&ptask->pt_completion);
+	else if (cl_ptask_is_autofree(ptask))
+		kfree(ptask);
+
+	return 0;
+}
+#endif /* CONFIG_PADATA */
+EXPORT_SYMBOL(cl_ptask_submit);
+
+/**
+ * This function waits when task complete async execution.
+ * The tasks with flag PTF_ORDERED are executed in parallel but completes
+ * into submission order. So, waiting for last ordered task you can be sure
+ * that all previous tasks were done before this task complete.
+ */
+int cl_ptask_wait_for(struct cl_ptask *ptask)
+{
+	if (!cl_ptask_need_complete(ptask))
+		return -EINVAL;
+
+	wait_for_completion(&ptask->pt_completion);
+
+	return 0;
+}
+EXPORT_SYMBOL(cl_ptask_wait_for);
+
+/**
+ * This function initialize internal members of task and prepare it for
+ * async execution.
+ */
+int cl_ptask_init(struct cl_ptask *ptask, cl_ptask_cb_t cbfunc, void *cbdata,
+		  unsigned int flags, int cpu)
+{
+	memset(ptask, 0, sizeof(*ptask));
+
+	ptask->pt_flags  = flags;
+	ptask->pt_cbcpu  = cpu;
+	ptask->pt_mm     = NULL; /* will be set in cl_do_parallel() */
+	ptask->pt_fs     = get_fs();
+	ptask->pt_cbfunc = cbfunc;
+	ptask->pt_cbdata = cbdata;
+	ptask->pt_result = -EAGAIN;
+
+	if (cl_ptask_need_complete(ptask)) {
+		if (cl_ptask_is_autofree(ptask))
+			return -EINVAL;
+
+		init_completion(&ptask->pt_completion);
+	}
+
+	if (cl_ptask_is_atomic(ptask) && cl_ptask_use_user_mm(ptask))
+		return -EINVAL;
+
+	return 0;
+}
+EXPORT_SYMBOL(cl_ptask_init);
+
+/**
+ * This function set the mask of allowed CPUs for parallel execution
+ * for engine with specified id.
+ */
+int cl_ptengine_set_cpumask(struct cl_ptask_engine *engine,
+			    const struct cpumask *cpumask)
+{
+	int rc = 0;
+
+#ifdef CONFIG_PADATA
+	cpumask_var_t serial_mask;
+	cpumask_var_t parallel_mask;
+
+	if (IS_ERR_OR_NULL(engine))
+		return -EINVAL;
+
+	if (!alloc_cpumask_var(&serial_mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	if (!alloc_cpumask_var(&parallel_mask, GFP_KERNEL)) {
+		free_cpumask_var(serial_mask);
+		return -ENOMEM;
+	}
+
+	cpumask_copy(parallel_mask, cpumask);
+	cpumask_copy(serial_mask, cpu_online_mask);
+
+	rc = padata_set_cpumask(engine->pte_pinst, PADATA_CPU_PARALLEL,
+				parallel_mask);
+	free_cpumask_var(parallel_mask);
+	if (rc)
+		goto out_failed_mask;
+
+	rc = padata_set_cpumask(engine->pte_pinst, PADATA_CPU_SERIAL,
+				serial_mask);
+out_failed_mask:
+	free_cpumask_var(serial_mask);
+#endif /* CONFIG_PADATA */
+
+	return rc;
+}
+EXPORT_SYMBOL(cl_ptengine_set_cpumask);
+
+/**
+ * This function returns the count of allowed CPUs for parallel execution
+ * for engine with specified id.
+ */
+int cl_ptengine_weight(struct cl_ptask_engine *engine)
+{
+	if (IS_ERR_OR_NULL(engine))
+		return -EINVAL;
+
+	return engine->pte_weight;
+}
+EXPORT_SYMBOL(cl_ptengine_weight);
+
+#ifdef CONFIG_PADATA
+static int cl_ptask_cpumask_change_notify(struct notifier_block *self,
+					  unsigned long val, void *data)
+{
+	struct padata_cpumask *padata_cpumask = data;
+	struct cl_ptask_engine *engine;
+
+	engine = container_of(self, struct cl_ptask_engine, pte_notifier);
+
+	if (val & PADATA_CPU_PARALLEL)
+		engine->pte_weight = cpumask_weight(padata_cpumask->pcpu);
+
+	return 0;
+}
+
+static int cl_ptengine_padata_init(struct cl_ptask_engine *engine,
+				   const char *name,
+				   const struct cpumask *cpumask)
+{
+	unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_CPU_INTENSIVE;
+	char *pa_mask_buff, *cb_mask_buff;
+	cpumask_var_t all_mask;
+	cpumask_var_t par_mask;
+	int rc;
+
+	get_online_cpus();
+
+	engine->pte_wq = alloc_workqueue(name, wq_flags, 1);
+	if (!engine->pte_wq) {
+		rc = -ENOMEM;
+		goto err;
+	}
+
+	if (!alloc_cpumask_var(&all_mask, GFP_KERNEL)) {
+		rc = -ENOMEM;
+		goto err_destroy_workqueue;
+	}
+
+	if (!alloc_cpumask_var(&par_mask, GFP_KERNEL)) {
+		rc = -ENOMEM;
+		goto err_free_all_mask;
+	}
+
+	cpumask_copy(par_mask, cpumask);
+	if (cpumask_empty(par_mask) ||
+	    cpumask_equal(par_mask, cpu_online_mask)) {
+		cpumask_copy(all_mask, cpu_online_mask);
+		cpumask_clear(par_mask);
+		while (!cpumask_empty(all_mask)) {
+			int cpu = cpumask_first(all_mask);
+
+			cpumask_set_cpu(cpu, par_mask);
+			cpumask_andnot(all_mask, all_mask,
+				       topology_sibling_cpumask(cpu));
+		}
+	}
+
+	cpumask_copy(all_mask, cpu_online_mask);
+
+	pa_mask_buff = (char *)__get_free_page(GFP_KERNEL);
+	if (!pa_mask_buff) {
+		rc = -ENOMEM;
+		goto err_free_par_mask;
+	}
+
+	cb_mask_buff = (char *)__get_free_page(GFP_KERNEL);
+	if (!cb_mask_buff) {
+		free_page((unsigned long)pa_mask_buff);
+		rc = -ENOMEM;
+		goto err_free_par_mask;
+	}
+
+	cpumap_print_to_pagebuf(true, pa_mask_buff, par_mask);
+	pa_mask_buff[PAGE_SIZE - 1] = '\0';
+	cpumap_print_to_pagebuf(true, cb_mask_buff, all_mask);
+	cb_mask_buff[PAGE_SIZE - 1] = '\0';
+
+	CDEBUG(D_INFO, "%s weight=%u plist='%s' cblist='%s'\n",
+	       name, cpumask_weight(par_mask),
+	       pa_mask_buff, cb_mask_buff);
+
+	free_page((unsigned long)cb_mask_buff);
+	free_page((unsigned long)pa_mask_buff);
+
+	engine->pte_weight = cpumask_weight(par_mask);
+	engine->pte_pinst  = padata_alloc_possible(engine->pte_wq);
+	if (!engine->pte_pinst) {
+		rc = -ENOMEM;
+		goto err_free_par_mask;
+	}
+
+	engine->pte_notifier.notifier_call = cl_ptask_cpumask_change_notify;
+	rc = padata_register_cpumask_notifier(engine->pte_pinst,
+					      &engine->pte_notifier);
+	if (rc)
+		goto err_free_padata;
+
+	rc = cl_ptengine_set_cpumask(engine, par_mask);
+	if (rc)
+		goto err_unregister;
+
+	rc = padata_start(engine->pte_pinst);
+	if (rc)
+		goto err_unregister;
+
+	free_cpumask_var(par_mask);
+	free_cpumask_var(all_mask);
+
+	put_online_cpus();
+	return 0;
+
+err_unregister:
+	padata_unregister_cpumask_notifier(engine->pte_pinst,
+					   &engine->pte_notifier);
+err_free_padata:
+	padata_free(engine->pte_pinst);
+err_free_par_mask:
+	free_cpumask_var(par_mask);
+err_free_all_mask:
+	free_cpumask_var(all_mask);
+err_destroy_workqueue:
+	destroy_workqueue(engine->pte_wq);
+err:
+	put_online_cpus();
+	return rc;
+}
+
+static void cl_ptengine_padata_fini(struct cl_ptask_engine *engine)
+{
+	padata_stop(engine->pte_pinst);
+	padata_unregister_cpumask_notifier(engine->pte_pinst,
+					   &engine->pte_notifier);
+	padata_free(engine->pte_pinst);
+	destroy_workqueue(engine->pte_wq);
+}
+
+#else  /* !CONFIG_PADATA */
+
+static int cl_ptengine_padata_init(struct cl_ptask_engine *engine,
+				   const char *name,
+				   const struct cpumask *cpumask)
+{
+	engine->pte_weight = 1;
+
+	return 0;
+}
+
+static void cl_ptengine_padata_fini(struct cl_ptask_engine *engine)
+{
+}
+#endif /* CONFIG_PADATA */
+
+struct cl_ptask_engine *cl_ptengine_init(const char *name,
+					 const struct cpumask *cpumask)
+{
+	struct cl_ptask_engine *engine;
+	int rc;
+
+	engine = kzalloc(sizeof(*engine), GFP_KERNEL);
+	if (!engine) {
+		rc = -ENOMEM;
+		goto err;
+	}
+
+	rc = cl_ptengine_padata_init(engine, name, cpumask);
+	if (rc)
+		goto err_free_engine;
+
+	return engine;
+
+err_free_engine:
+	kfree(engine);
+err:
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL(cl_ptengine_init);
+
+void cl_ptengine_fini(struct cl_ptask_engine *engine)
+{
+	if (IS_ERR_OR_NULL(engine))
+		return;
+
+	cl_ptengine_padata_fini(engine);
+	kfree(engine);
+}
+EXPORT_SYMBOL(cl_ptengine_fini);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 11/12] lustre: mdc: use large xattr buffers for old servers
  2018-11-26  2:48 [lustre-devel] [PATCH 00/12] lustre: new patches to address previous reviews James Simmons
                   ` (9 preceding siblings ...)
  2018-11-26  2:48 ` [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework James Simmons
@ 2018-11-26  2:48 ` James Simmons
  2018-11-26  2:48 ` [lustre-devel] [PATCH 12/12] lustre: update TODO lustre list James Simmons
  11 siblings, 0 replies; 23+ messages in thread
From: James Simmons @ 2018-11-26  2:48 UTC (permalink / raw)
  To: lustre-devel

From: "John L. Hammond" <jhammond@whamcloud.com>

Pre 2.10.1 MDTs will crash when they receive a listxattr (MDS_GETXATTR
with OBD_MD_FLXATTRLS) RPC for an orphan or dead object. So for
clients connected to these older MDTs, try to avoid sending listxattr
RPCs by making the bulk getxattr (MDS_GETXATTR with OBD_MD_FLXATTRALL)
more likely to succeed and thereby reducing the chances of falling
ack to listxattr.

Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-10912
Reviewed-on: https://review.whamcloud.com/31990
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/mdc/mdc_locks.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/staging/lustre/lustre/mdc/mdc_locks.c b/drivers/staging/lustre/lustre/mdc/mdc_locks.c
index 2cc2378..7d4ba9c 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_locks.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_locks.c
@@ -352,6 +352,20 @@ static void mdc_realloc_openmsg(struct ptlrpc_request *req,
 	lit = req_capsule_client_get(&req->rq_pill, &RMF_LDLM_INTENT);
 	lit->opc = IT_GETXATTR;
 
+	/* If the supplied buffer is too small then the server will
+	 * return -ERANGE and llite will fallback to using non cached
+	 * xattr operations. On servers before 2.10.1 a (non-cached)
+	 * listxattr RPC for an orphan or dead file causes an oops. So
+	 * let's try to avoid sending too small a buffer to too old a
+	 * server. This is effectively undoing the memory conservation
+	 * of LU-9417 when it would be *more* likely to crash the
+	 * server. See LU-9856.
+	 */
+	BUILD_BUG_ON(OBD_OCD_VERSION(3, 0, 53, 0) <= LUSTRE_VERSION_CODE);
+	if (exp->exp_connect_data.ocd_version < OBD_OCD_VERSION(2, 10, 1, 0))
+		ea_vals_buf_size = max_t(u32, ea_vals_buf_size,
+					 exp->exp_connect_data.ocd_max_easize);
+
 	/* pack the intended request */
 	mdc_pack_body(req, &op_data->op_fid1, op_data->op_valid,
 		      ea_vals_buf_size, -1, 0);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 12/12] lustre: update TODO lustre list
  2018-11-26  2:48 [lustre-devel] [PATCH 00/12] lustre: new patches to address previous reviews James Simmons
                   ` (10 preceding siblings ...)
  2018-11-26  2:48 ` [lustre-devel] [PATCH 11/12] lustre: mdc: use large xattr buffers for old servers James Simmons
@ 2018-11-26  2:48 ` James Simmons
  11 siblings, 0 replies; 23+ messages in thread
From: James Simmons @ 2018-11-26  2:48 UTC (permalink / raw)
  To: lustre-devel

Many of the items in TODO have been completed so remove them from
the list.

Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/TODO | 85 ---------------------------------------------
 1 file changed, 85 deletions(-)

diff --git a/drivers/staging/lustre/TODO b/drivers/staging/lustre/TODO
index 942280b..1df7697 100644
--- a/drivers/staging/lustre/TODO
+++ b/drivers/staging/lustre/TODO
@@ -35,91 +35,6 @@ So move to time64_t and ktime_t as much as possible.
 ******************************************************************************
 * Proper IB support for ko2iblnd
 ******************************************************************************
-https://jira.hpdd.intel.com/browse/LU-9179
-
-Poor performance for the ko2iblnd driver. This is related to many of the
-patches below that are missing from the linux client.
-------------------------------------------------------------------------------
-
-https://jira.hpdd.intel.com/browse/LU-10394 / LU-10526 / LU-10089
-
-Default to using MEM_REG
-------------------------------------------------------------------------------
-
-https://jira.hpdd.intel.com/browse/LU-10459
-
-throttle tx based on queue depth
-------------------------------------------------------------------------------
-
-https://jira.hpdd.intel.com/browse/LU-9943
-
-correct WR fast reg accounting
-------------------------------------------------------------------------------
-
-https://jira.hpdd.intel.com/browse/LU-10291
-
-remove concurrent_sends tunable
-------------------------------------------------------------------------------
-
-https://jira.hpdd.intel.com/browse/LU-10213
-
-calculate qp max_send_wrs properly
-------------------------------------------------------------------------------
-
-https://jira.hpdd.intel.com/browse/LU-9810
-
-use less CQ entries for each connection
-------------------------------------------------------------------------------
-
-https://jira.hpdd.intel.com/browse/LU-10129 / LU-9180
-
-rework map_on_demand behavior
-------------------------------------------------------------------------------
-
-https://jira.hpdd.intel.com/browse/LU-10129
-
-query device capabilities
-------------------------------------------------------------------------------
-
-https://jira.hpdd.intel.com/browse/LU-9983
-
-allow for discontiguous fragments
-------------------------------------------------------------------------------
-
-https://jira.hpdd.intel.com/browse/LU-9500
-
-Don't Page Align remote_addr with FastReg
-------------------------------------------------------------------------------
-
-https://jira.hpdd.intel.com/browse/LU-9448
-
-handle empty CPTs
-------------------------------------------------------------------------------
-
-https://jira.hpdd.intel.com/browse/LU-9507
-
-Don't Assert On Reconnect with MultiQP
-------------------------------------------------------------------------------
-
-https://jira.hpdd.intel.com/browse/LU-9425
-
-Turn on 2 sges by default
-------------------------------------------------------------------------------
-
-https://jira.hpdd.intel.com/browse/LU-5718
-
-multiple sges for work request
-------------------------------------------------------------------------------
-
-https://jira.hpdd.intel.com/browse/LU-9094
-
-kill timedout txs from ibp_tx_queue
-------------------------------------------------------------------------------
-
-https://jira.hpdd.intel.com/browse/LU-9094
-
-reconnect peer for REJ_INVALID_SERVICE_ID
-------------------------------------------------------------------------------
 
 https://jira.hpdd.intel.com/browse/LU-8752
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 06/12] lustre: mdc: don't add to page cache upon failure
  2018-11-26  2:48 ` [lustre-devel] [PATCH 06/12] lustre: mdc: don't add to page cache upon failure James Simmons
@ 2018-11-27  3:01   ` NeilBrown
  2018-11-29 12:06     ` Siyao Lai
  0 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2018-11-27  3:01 UTC (permalink / raw)
  To: lustre-devel

On Sun, Nov 25 2018, James Simmons wrote:

> From: Lai Siyao <lai.siyao@whamcloud.com>
>
> Reading directory pages may fail on MDS, in this case client should
> not cache a non-up-to-date directory page, because it will cause
> a later read on the same page fail.
>
> Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-5461
> Reviewed-on: http://review.whamcloud.com/11450
> Reviewed-by: Fan Yong <fan.yong@intel.com>
> Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/lustre/mdc/mdc_request.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c
> index 1072b66..09b30ef 100644
> --- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
> +++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
> @@ -1212,7 +1212,10 @@ static int mdc_read_page_remote(void *data, struct page *page0)
>  	}
>  
>  	rc = mdc_getpage(rp->rp_exp, fid, rp->rp_off, page_pool, npages, &req);
> -	if (!rc) {
> +	if (rc < 0) {
> +		/* page0 is special, which was added into page cache early */
> +		delete_from_page_cache(page0);

This looks wrong to me.  We shouldn't need to delete the page from the
page-cache.
It won't be marked up-to-date, so why will that cause an error on a
later read???

Well, because mdc_page_locate() finds a page and if it isn't up-to-date,
it returns -EIO.  Why does it do that?  If it found a PageError() page,
then it might be reasonable to return -EIO.

Why not just return the page that was found, and let the caller check if
it is Uptodate?

Well, because mdc_read_page() handles a successful page return from
mdc_page_locate() as a hash collision.

I guess I need to understand how lustre maps a hash to a directory
block, and then how it handles collisions...

The reason this jumped out at me is that it looks like it might be
racy.  Adding a page and then removing it might leave a window where
some other thread can find the page.  That is not a problem is a
non-up-to-date page just means we should wait for it.  But if it can
cause an error, then maybe the race is a real problem.

But maybe there is some higher level locking...

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181127/5e6433ff/attachment.sig>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 08/12] lustre: mdc: propagate changelog errors to readers
  2018-11-26  2:48 ` [lustre-devel] [PATCH 08/12] lustre: mdc: propagate changelog errors to readers James Simmons
@ 2018-11-27  3:13   ` NeilBrown
  0 siblings, 0 replies; 23+ messages in thread
From: NeilBrown @ 2018-11-27  3:13 UTC (permalink / raw)
  To: lustre-devel

On Sun, Nov 25 2018, James Simmons wrote:

> From: "John L. Hammond" <jhammond@whamcloud.com>
>
> Store errors encountered by the changelog llog reader thread in the
> crs_err field of struct changelog_reader_state so that they can be
> propageted to changelog readers. In chlg_read() improve the error and
> EOF reporting. Return -ERESTARTSYS when the blocked reader is
> interrupted. Replace uses of wait_event_idle() with
> ait_event_interruptible().
>
> Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-10218
> Reviewed-on: https://review.whamcloud.com/30040
> Reviewed-by: Quentin Bouget <quentin.bouget@cea.fr>
> Reviewed-by: Henri Doreau <henri.doreau@cea.fr>
> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/lustre/mdc/mdc_changelog.c | 45 ++++++++++++++---------
>  1 file changed, 28 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/staging/lustre/lustre/mdc/mdc_changelog.c b/drivers/staging/lustre/lustre/mdc/mdc_changelog.c
> index becdee8..811a36a 100644
> --- a/drivers/staging/lustre/lustre/mdc/mdc_changelog.c
> +++ b/drivers/staging/lustre/lustre/mdc/mdc_changelog.c
> @@ -71,7 +71,7 @@ struct chlg_reader_state {
>  	/* Producer thread (if any) */
>  	struct task_struct	*crs_prod_task;
>  	/* An error occurred that prevents from reading further */
> -	bool			 crs_err;
> +	int			 crs_err;
>  	/* EOF, no more records available */
>  	bool			 crs_eof;
>  	/* Desired start position */
> @@ -148,9 +148,9 @@ static int chlg_read_cat_process_cb(const struct lu_env *env,
>  	       PFID(&rec->cr.cr_tfid), PFID(&rec->cr.cr_pfid),
>  	       rec->cr.cr_namelen, changelog_rec_name(&rec->cr));
>  
> -	wait_event_idle(crs->crs_waitq_prod,
> -			(crs->crs_rec_count < CDEV_CHLG_MAX_PREFETCH ||
> -			 kthread_should_stop()));
> +	wait_event_interruptible(crs->crs_waitq_prod,
> +				 crs->crs_rec_count < CDEV_CHLG_MAX_PREFETCH ||
> +				 kthread_should_stop());

This is wrong.  Not harmful, but wrong.
This is in a kernel thread, so it doesn't expect to receive signals.
So allowing an abort on a signal is pointless and misleading.

So I've removed this chunk from the patch, and also a later chunk which
uses wait_event_interruptible() in a kthread.

Other places where wait_event_interruptible() are used make sense.

Thanks,
NeilBrown


>  
>  	if (kthread_should_stop())
>  		return LLOG_PROC_BREAK;
> @@ -231,7 +231,7 @@ static int chlg_load(void *args)
>  
>  err_out:
>  	if (rc < 0)
> -		crs->crs_err = true;
> +		crs->crs_err = rc;
>  
>  	wake_up_all(&crs->crs_waitq_cons);
>  
> @@ -241,7 +241,7 @@ static int chlg_load(void *args)
>  	if (ctx)
>  		llog_ctxt_put(ctx);
>  
> -	wait_event_idle(crs->crs_waitq_prod, kthread_should_stop());
> +	wait_event_interruptible(crs->crs_waitq_prod, kthread_should_stop());
>  
>  	return rc;
>  }
> @@ -264,13 +264,20 @@ static ssize_t chlg_read(struct file *file, char __user *buff, size_t count,
>  	struct chlg_reader_state *crs = file->private_data;
>  	struct chlg_rec_entry *rec;
>  	struct chlg_rec_entry *tmp;
> -	ssize_t  written_total = 0;
> +	ssize_t written_total = 0;
>  	LIST_HEAD(consumed);
> +	size_t rc;
> +
> +	if (file->f_flags & O_NONBLOCK && crs->crs_rec_count == 0) {
> +		if (crs->crs_err < 0)
> +			return crs->crs_err;
> +		else if (crs->crs_eof)
> +			return 0;
> +		else
> +			return -EAGAIN;
> +	}
>  
> -	if (file->f_flags & O_NONBLOCK && crs->crs_rec_count == 0)
> -		return -EAGAIN;
> -
> -	wait_event_idle(crs->crs_waitq_cons,
> +	rc = wait_event_interruptible(crs->crs_waitq_cons,
>  			crs->crs_rec_count > 0 || crs->crs_eof || crs->crs_err);
>  
>  	mutex_lock(&crs->crs_lock);
> @@ -279,8 +286,7 @@ static ssize_t chlg_read(struct file *file, char __user *buff, size_t count,
>  			break;
>  
>  		if (copy_to_user(buff, rec->enq_record, rec->enq_length)) {
> -			if (written_total == 0)
> -				written_total = -EFAULT;
> +			rc = -EFAULT;
>  			break;
>  		}
>  
> @@ -294,15 +300,19 @@ static ssize_t chlg_read(struct file *file, char __user *buff, size_t count,
>  	}
>  	mutex_unlock(&crs->crs_lock);
>  
> -	if (written_total > 0)
> +	if (written_total > 0) {
> +		rc = written_total;
>  		wake_up_all(&crs->crs_waitq_prod);
> +	} else if (rc == 0) {
> +		rc = crs->crs_err;
> +	}
>  
>  	list_for_each_entry_safe(rec, tmp, &consumed, enq_linkage)
>  		enq_record_delete(rec);
>  
>  	*ppos = crs->crs_start_offset;
>  
> -	return written_total;
> +	return rc;
>  }
>  
>  /**
> @@ -509,15 +519,16 @@ static int chlg_release(struct inode *inode, struct file *file)
>  	struct chlg_reader_state *crs = file->private_data;
>  	struct chlg_rec_entry *rec;
>  	struct chlg_rec_entry *tmp;
> +	int rc = 0;
>  
>  	if (crs->crs_prod_task)
> -		kthread_stop(crs->crs_prod_task);
> +		rc = kthread_stop(crs->crs_prod_task);
>  
>  	list_for_each_entry_safe(rec, tmp, &crs->crs_rec_queue, enq_linkage)
>  		enq_record_delete(rec);
>  
>  	kfree(crs);
> -	return 0;
> +	return rc;
>  }
>  
>  /**
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181127/82e37de0/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 09/12] lustre: obdclass: obd_device improvement
  2018-11-26  2:48 ` [lustre-devel] [PATCH 09/12] lustre: obdclass: obd_device improvement James Simmons
@ 2018-11-27  4:01   ` NeilBrown
  0 siblings, 0 replies; 23+ messages in thread
From: NeilBrown @ 2018-11-27  4:01 UTC (permalink / raw)
  To: lustre-devel

On Sun, Nov 25 2018, James Simmons wrote:

> From: Alexander Boyko <c17825@cray.com>
>
> The patch removes self exports from obd's reference counting which
> allows to avoid freeing of self exports by zombie thread.
> A pair of functions class_register_device()/class_unregister_device()
> is to make sure that an obd can not be referenced again once its
> refcount reached 0.
>
> Signed-off-by: Vladimir Saveliev Vladimir Saveliev <c17830@cray.com>
> Signed-off-by: Alexey Lyashkov <c17817@cray.com>
> Signed-off-by: Alexander Boyko <c17825@cray.com>
> Signed-off-by: Yang Sheng <ys@whamcloud.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-4134
> Seagate-bug-id: MRP-2139 MRP-3267
> Reviewed-on: https://review.whamcloud.com/8045
> Reviewed-on: https://review.whamcloud.com/29967
> Reviewed-by: James Simmons <uja.ornl@yahoo.com>
> Reviewed-by: Alexey Lyashkov <c17817@cray.com>
> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/lustre/include/obd_class.h  |   9 +-
>  drivers/staging/lustre/lustre/obdclass/genops.c    | 284 +++++++++++++++------
>  .../staging/lustre/lustre/obdclass/obd_config.c    | 143 ++++-------
>  drivers/staging/lustre/lustre/obdclass/obd_mount.c |   6 +-
>  4 files changed, 261 insertions(+), 181 deletions(-)
>
> diff --git a/drivers/staging/lustre/lustre/include/obd_class.h b/drivers/staging/lustre/lustre/include/obd_class.h
> index 567189c..cc00915 100644
> --- a/drivers/staging/lustre/lustre/include/obd_class.h
> +++ b/drivers/staging/lustre/lustre/include/obd_class.h
> @@ -65,8 +65,11 @@ int class_register_type(struct obd_ops *dt_ops, struct md_ops *md_ops,
>  			const char *name, struct lu_device_type *ldt);
>  int class_unregister_type(const char *name);
>  
> -struct obd_device *class_newdev(const char *type_name, const char *name);
> -void class_release_dev(struct obd_device *obd);
> +struct obd_device *class_newdev(const char *type_name, const char *name,
> +				const char *uuid);
> +int class_register_device(struct obd_device *obd);
> +void class_unregister_device(struct obd_device *obd);
> +void class_free_dev(struct obd_device *obd);
>  
>  int class_name2dev(const char *name);
>  struct obd_device *class_name2obd(const char *name);
> @@ -230,6 +233,8 @@ void __class_export_del_lock_ref(struct obd_export *exp,
>  void class_export_put(struct obd_export *exp);
>  struct obd_export *class_new_export(struct obd_device *obddev,
>  				    struct obd_uuid *cluuid);
> +struct obd_export *class_new_export_self(struct obd_device *obd,
> +					 struct obd_uuid *uuid);
>  void class_unlink_export(struct obd_export *exp);
>  
>  struct obd_import *class_import_get(struct obd_import *imp);
> diff --git a/drivers/staging/lustre/lustre/obdclass/genops.c b/drivers/staging/lustre/lustre/obdclass/genops.c
> index 59891a8..cdd44f7 100644
> --- a/drivers/staging/lustre/lustre/obdclass/genops.c
> +++ b/drivers/staging/lustre/lustre/obdclass/genops.c
> @@ -38,6 +38,7 @@
>  
>  #define DEBUG_SUBSYSTEM S_CLASS
>  #include <obd_class.h>
> +#include <lustre_log.h>
>  #include <lprocfs_status.h>
>  #include <lustre_kernelcomm.h>
>  
> @@ -273,21 +274,20 @@ int class_unregister_type(const char *name)
>  /**
>   * Create a new obd device.
>   *
> - * Find an empty slot in ::obd_devs[], create a new obd device in it.
> + * Allocate the new obd_device and initialize it.
>   *
>   * \param[in] type_name obd device type string.
>   * \param[in] name      obd device name.
> + * @uuid		obd device UUID.
>   *
> - * \retval NULL if create fails, otherwise return the obd device
> - *	 pointer created.
> + * RETURN newdev	 pointer to created obd_device
> + * RETURN ERR_PTR(errno) on error
>   */
> -struct obd_device *class_newdev(const char *type_name, const char *name)
> +struct obd_device *class_newdev(const char *type_name, const char *name,
> +				const char *uuid)
>  {
> -	struct obd_device *result = NULL;
>  	struct obd_device *newdev;
>  	struct obd_type *type = NULL;
> -	int i;
> -	int new_obd_minor = 0;
>  
>  	if (strlen(name) >= MAX_OBD_NAME) {
>  		CERROR("name/uuid must be < %u bytes long\n", MAX_OBD_NAME);
> @@ -302,87 +302,167 @@ struct obd_device *class_newdev(const char *type_name, const char *name)
>  
>  	newdev = obd_device_alloc();
>  	if (!newdev) {
> -		result = ERR_PTR(-ENOMEM);
> -		goto out_type;
> +		class_put_type(type);
> +		return ERR_PTR(-ENOMEM);
>  	}
>  
>  	LASSERT(newdev->obd_magic == OBD_DEVICE_MAGIC);
> +	strncpy(newdev->obd_name, name, sizeof(newdev->obd_name) - 1);
> +	newdev->obd_type = type;
> +	newdev->obd_minor = -1;
> +
> +	rwlock_init(&newdev->obd_pool_lock);
> +	newdev->obd_pool_limit = 0;
> +	newdev->obd_pool_slv = 0;
> +
> +	INIT_LIST_HEAD(&newdev->obd_exports);
> +	INIT_LIST_HEAD(&newdev->obd_unlinked_exports);
> +	INIT_LIST_HEAD(&newdev->obd_delayed_exports);
> +	spin_lock_init(&newdev->obd_nid_lock);
> +	spin_lock_init(&newdev->obd_dev_lock);
> +	mutex_init(&newdev->obd_dev_mutex);
> +	spin_lock_init(&newdev->obd_osfs_lock);
> +	/* newdev->obd_osfs_age must be set to a value in the distant
> +	 * past to guarantee a fresh statfs is fetched on mount.
> +	 */
> +	newdev->obd_osfs_age = get_jiffies_64() - 1000 * HZ;
>  
> -	write_lock(&obd_dev_lock);
> -	for (i = 0; i < class_devno_max(); i++) {
> -		struct obd_device *obd = class_num2obd(i);
> +	/* XXX belongs in setup not attach  */
> +	init_rwsem(&newdev->obd_observer_link_sem);
> +	/* recovery data */
> +	init_waitqueue_head(&newdev->obd_evict_inprogress_waitq);
>  
> -		if (obd && (strcmp(name, obd->obd_name) == 0)) {
> -			CERROR("Device %s already exists at %d, won't add\n",
> -			       name, i);
> -			if (result) {
> -				LASSERTF(result->obd_magic == OBD_DEVICE_MAGIC,
> -					 "%p obd_magic %08x != %08x\n", result,
> -					 result->obd_magic, OBD_DEVICE_MAGIC);
> -				LASSERTF(result->obd_minor == new_obd_minor,
> -					 "%p obd_minor %d != %d\n", result,
> -					 result->obd_minor, new_obd_minor);
> -
> -				obd_devs[result->obd_minor] = NULL;
> -				result->obd_name[0] = '\0';
> -			 }
> -			result = ERR_PTR(-EEXIST);
> -			break;
> -		}
> -		if (!result && !obd) {
> -			result = newdev;
> -			result->obd_minor = i;
> -			new_obd_minor = i;
> -			result->obd_type = type;
> -			strncpy(result->obd_name, name,
> -				sizeof(result->obd_name) - 1);
> -			obd_devs[i] = result;
> -		}
> -	}
> -	write_unlock(&obd_dev_lock);
> +	llog_group_init(&newdev->obd_olg);
> +	/* Detach drops this */
> +        atomic_set(&newdev->obd_refcount, 1);
> +        lu_ref_init(&newdev->obd_reference);
> +        lu_ref_add(&newdev->obd_reference, "newdev", newdev);
>  
> -	if (!result && i >= class_devno_max()) {
> -		CERROR("all %u OBD devices used, increase MAX_OBD_DEVICES\n",
> -		       class_devno_max());
> -		result = ERR_PTR(-EOVERFLOW);
> -		goto out;
> -	}
> +	newdev->obd_conn_inprogress = 0;
>  
> -	if (IS_ERR(result))
> -		goto out;
> +	strncpy(newdev->obd_uuid.uuid, uuid, strlen(uuid));
>  
> -	CDEBUG(D_IOCTL, "Adding new device %s (%p)\n",
> -	       result->obd_name, result);
> +	CDEBUG(D_IOCTL, "Allocate new device %s (%p)\n",
> +	       newdev->obd_name, newdev);
>  
> -	return result;
> -out:
> -	obd_device_free(newdev);
> -out_type:
> -	class_put_type(type);
> -	return result;
> +	return newdev;
>  }
>  
> -void class_release_dev(struct obd_device *obd)
> +/**
> + * Free obd device.
> + *
> + * @obd obd_device to be freed
> + *
> + * RETURN none
> + */
> +void class_free_dev(struct obd_device *obd)
>  {
>  	struct obd_type *obd_type = obd->obd_type;
>  
>  	LASSERTF(obd->obd_magic == OBD_DEVICE_MAGIC, "%p obd_magic %08x != %08x\n",
>  		 obd, obd->obd_magic, OBD_DEVICE_MAGIC);
> -	LASSERTF(obd == obd_devs[obd->obd_minor], "obd %p != obd_devs[%d] %p\n",
> +	LASSERTF(obd->obd_minor == -1 || obd == obd_devs[obd->obd_minor],
> +		 "obd %p != obd_devs[%d] %p\n",
>  		 obd, obd->obd_minor, obd_devs[obd->obd_minor]);
> +	LASSERTF(atomic_read(&obd->obd_refcount) == 0,
> +		 "obd_refcount should be 0, not %d\n",
> +		 atomic_read(&obd->obd_refcount));
>  	LASSERT(obd_type);
>  
> -	CDEBUG(D_INFO, "Release obd device %s at %d obd_type name =%s\n",
> -	       obd->obd_name, obd->obd_minor, obd->obd_type->typ_name);
> +	CDEBUG(D_INFO, "Release obd device %s obd_type name =%s\n",
> +	       obd->obd_name, obd->obd_type->typ_name);
> +
> +	CDEBUG(D_CONFIG, "finishing cleanup of obd %s (%s)\n",
> +	       obd->obd_name, obd->obd_uuid.uuid);
> +	if (obd->obd_stopping) {
> +		int err;
> +
> +		/* If we're not stopping, we were never set up */
> +		err = obd_cleanup(obd);
> +		if (err)
> +			CERROR("Cleanup %s returned %d\n",
> +			       obd->obd_name, err);
> +	}
>  
> -	write_lock(&obd_dev_lock);
> -	obd_devs[obd->obd_minor] = NULL;
> -	write_unlock(&obd_dev_lock);
>  	obd_device_free(obd);
>  
>  	class_put_type(obd_type);
>  }
>  
> +/**
> + * Unregister obd device.
> + *
> + * Free slot in obd_dev[] used by \a obd.
> + *
> + * @new_obd	obd_device to be unregistered
> + *
> + * RETURN	none
> + */
> +void class_unregister_device(struct obd_device *obd)
> +{
> +	write_lock(&obd_dev_lock);
> +	if (obd->obd_minor >= 0) {
> +		LASSERT(obd_devs[obd->obd_minor] == obd);
> +		obd_devs[obd->obd_minor] = NULL;
> +		obd->obd_minor = -1;
> +	}
> +	write_unlock(&obd_dev_lock);
> +}
> +
> +/**
> + * Register obd device.
> + *
> + * Find free slot in obd_devs[], fills it with \a new_obd.
> + *
> + * @new_obd	obd_device to be registered
> + *
> + * RETURN	0		success
> + *		-EEXIST		device with this name is registered
> + *		-EOVERFLOW	obd_devs[] is full
> + */
> +int class_register_device(struct obd_device *new_obd)
> +{
> +	int new_obd_minor = 0;
> +	bool minor_assign = false;
> +	int ret = 0;
> +	int i;
> +
> +	write_lock(&obd_dev_lock);
> +	for (i = 0; i < class_devno_max(); i++) {
> +		struct obd_device *obd = class_num2obd(i);
> +
> +		if (obd && (strcmp(new_obd->obd_name, obd->obd_name) == 0)) {
> +			CERROR("%s: already exists, won't add\n",
> +			       obd->obd_name);
> +			/* in case we found a free slot before duplicate */
> +			minor_assign = false;
> +			ret = -EEXIST;
> +			break;
> +		}
> +		if (!minor_assign && !obd) {
> +			new_obd_minor = i;
> +			minor_assign = true;
> +		}
> +	}
> +
> +	if (minor_assign) {
> +		new_obd->obd_minor = new_obd_minor;
> +		LASSERTF(!obd_devs[new_obd_minor], "obd_devs[%d] %p\n",
> +			 new_obd_minor, obd_devs[new_obd_minor]);
> +		obd_devs[new_obd_minor] = new_obd;
> +	} else {
> +		if (ret == 0) {
> +			ret = -EOVERFLOW;
> +			CERROR("%s: all %u/%u devices used, increase MAX_OBD_DEVICES: rc = %d\n",
> +			       new_obd->obd_name, i, class_devno_max(), ret);
> +		}
> +	}
> +	write_unlock(&obd_dev_lock);
> +
> +	return ret;
> +}
> +
> +
>  int class_name2dev(const char *name)
>  {
>  	int i;
> @@ -677,7 +757,11 @@ static void class_export_destroy(struct obd_export *exp)
>  	LASSERT(list_empty(&exp->exp_req_replay_queue));
>  	LASSERT(list_empty(&exp->exp_hp_rpcs));
>  	obd_destroy_export(exp);
> -	class_decref(obd, "export", exp);
> +	/* self export doesn't hold a reference to an obd, although it
> +	 * exists until freeing of the obd
> +	 */
> +	if (exp != obd->obd_self_export)
> +		class_decref(obd, "export", exp);
>  
>  	OBD_FREE_RCU(exp, sizeof(*exp), &exp->exp_handle);
>  }
> @@ -708,11 +792,27 @@ void class_export_put(struct obd_export *exp)
>  	       atomic_read(&exp->exp_refcount) - 1);
>  
>  	if (atomic_dec_and_test(&exp->exp_refcount)) {
> -		LASSERT(!list_empty(&exp->exp_obd_chain));
> +		struct obd_device *obd = exp->exp_obd;
> +
>  		CDEBUG(D_IOCTL, "final put %p/%s\n",
>  		       exp, exp->exp_client_uuid.uuid);
>  
> -		obd_zombie_export_add(exp);
> +		if (exp == obd->obd_self_export) {
> +			/* self export should be destroyed without
> +			 * zombie thread as it doesn't hold a
> +			 * reference to obd and doesn't hold any
> +			 * resources
> +			 */
> +			class_export_destroy(exp);
> +			/* self export is destroyed, no class
> +			 * references exist and it is safe to free
> +			 * obd
> +			 */
> +			class_free_dev(obd);
> +		} else {
> +			LASSERT(!list_empty(&exp->exp_obd_chain));
> +			obd_zombie_export_add(exp);
> +		}
>  	}
>  }
>  EXPORT_SYMBOL(class_export_put);
> @@ -728,8 +828,9 @@ static void obd_zombie_exp_cull(struct work_struct *ws)
>   * pointer to it. The refcount is 2: one for the hash reference, and
>   * one for the pointer returned by this function.
>   */
> -struct obd_export *class_new_export(struct obd_device *obd,
> -				    struct obd_uuid *cluuid)
> +static struct obd_export *__class_new_export(struct obd_device *obd,
> +					     struct obd_uuid *cluuid,
> +					     bool is_self)
>  {
>  	struct obd_export *export;
>  	int rc = 0;
> @@ -739,6 +840,7 @@ struct obd_export *class_new_export(struct obd_device *obd,
>  		return ERR_PTR(-ENOMEM);
>  
>  	export->exp_conn_cnt = 0;
> +	/* 2 = class_handle_hash + last */
>  	atomic_set(&export->exp_refcount, 2);
>  	atomic_set(&export->exp_rpc_count, 0);
>  	atomic_set(&export->exp_cb_count, 0);
> @@ -767,41 +869,65 @@ struct obd_export *class_new_export(struct obd_device *obd,
>  	export->exp_client_uuid = *cluuid;
>  	obd_init_export(export);
>  
> -	spin_lock(&obd->obd_dev_lock);
> -	/* shouldn't happen, but might race */
> -	if (obd->obd_stopping) {
> -		rc = -ENODEV;
> -		goto exit_unlock;
> -	}
> -
>  	if (!obd_uuid_equals(cluuid, &obd->obd_uuid)) {
> +		spin_lock(&obd->obd_dev_lock);
> +		/* shouldn't happen, but might race */
> +		if (obd->obd_stopping) {
> +			rc = -ENODEV;
> +			goto exit_unlock;
> +		}
> +		spin_unlock(&obd->obd_dev_lock);

This is wrong.  The lock previously protected obd_uuid_add(), now it
doesn't.
This probably happened because the locking was simplified when I
converted to rhashtables so the OpenSFS patch did quite different things
with locking.

I've applied the following incremental patch to fix up the locking.

Thanks,
NeilBrown

diff --git a/drivers/staging/lustre/lustre/obdclass/genops.c b/drivers/staging/lustre/lustre/obdclass/genops.c
index cdd44f72403f..76bc73fd79c8 100644
--- a/drivers/staging/lustre/lustre/obdclass/genops.c
+++ b/drivers/staging/lustre/lustre/obdclass/genops.c
@@ -869,24 +869,22 @@ static struct obd_export *__class_new_export(struct obd_device *obd,
 	export->exp_client_uuid = *cluuid;
 	obd_init_export(export);
 
+	spin_lock(&obd->obd_dev_lock);
 	if (!obd_uuid_equals(cluuid, &obd->obd_uuid)) {
-		spin_lock(&obd->obd_dev_lock);
 		/* shouldn't happen, but might race */
 		if (obd->obd_stopping) {
 			rc = -ENODEV;
 			goto exit_unlock;
 		}
-		spin_unlock(&obd->obd_dev_lock);
 
 		rc = obd_uuid_add(obd, export);
 		if (rc) {
 			LCONSOLE_WARN("%s: denying duplicate export for %s, %d\n",
 				      obd->obd_name, cluuid->uuid, rc);
-			goto exit_err;
+			goto exit_unlock;
 		}
 	}
 
-	spin_lock(&obd->obd_dev_lock);
 	if (!is_self) {
 		class_incref(obd, "export", export);
 		list_add(&export->exp_obd_chain, &export->exp_obd->obd_exports);
@@ -899,7 +897,6 @@ static struct obd_export *__class_new_export(struct obd_device *obd,
 
 exit_unlock:
 	spin_unlock(&obd->obd_dev_lock);
-exit_err:
 	class_handle_unhash(&export->exp_handle);
 	obd_destroy_export(export);
 	kfree(export);
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181127/d8edc8f4/attachment.sig>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework
  2018-11-26  2:48 ` [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework James Simmons
@ 2018-11-27  4:20   ` NeilBrown
  2018-11-27  5:08     ` Andreas Dilger
  0 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2018-11-27  4:20 UTC (permalink / raw)
  To: lustre-devel

On Sun, Nov 25 2018, James Simmons wrote:

> From: Dmitry Eremin <dmitry.eremin@intel.com>
>
> In this patch new API for parallel tasks execution is introduced.
> This API based on Linux kernel padata API which is used to perform
> encryption and decryption on large numbers of packets without
> reordering those packets.
>
> It was adopted for general use in Lustre for parallelization of
> various functionality. The first place of its usage is parallel I/O
> implementation.
>
> The first step in using it is to set up a cl_ptask structure to
> control of how this task are to be run:
>
>     #include <cl_ptask.h>
>
>     int cl_ptask_init(struct cl_ptask *ptask, cl_ptask_cb_t cbfunc,
>                       void *cbdata, unsigned int flags, int cpu);
>
> The cbfunc function with cbdata argument will be called in the process
> of getting the task done. The cpu specifies which CPU will be used for
> the final callback when the task is done.
>
> The submission of task is done with:
>
>     int cl_ptask_submit(struct cl_ptask *ptask,
>                         struct cl_ptask_engine *engine);
>
> The task is submitted to the engine for execution.
>
> In order to wait for result of task execution you should call:
>
>    int cl_ptask_wait_for(struct cl_ptask *ptask);
>
> The tasks with flag PTF_ORDERED are executed in parallel but complete
> into submission order. So, waiting for last ordered task you can be sure
> that all previous tasks were done before this task complete.
>
> This patch differs from the OpenSFS tree by adding this functional
> to the clio layer instead of libcfs.

While you are right that it shouldn't be in libcfs, it actually
shouldn't exist at all.
cfs_ptask_init() is used precisely once in OpenSFS.  There is no point
creating a generic API wrapper like this that is only used once.

cl_oi needs to use padata API calls directly.

Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181127/94e4c201/attachment.sig>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework
  2018-11-27  4:20   ` NeilBrown
@ 2018-11-27  5:08     ` Andreas Dilger
  2018-11-27 13:51       ` Patrick Farrell
  0 siblings, 1 reply; 23+ messages in thread
From: Andreas Dilger @ 2018-11-27  5:08 UTC (permalink / raw)
  To: lustre-devel

On Nov 26, 2018, at 21:20, NeilBrown <neilb@suse.com> wrote:
> 
> On Sun, Nov 25 2018, James Simmons wrote:
> 
>> From: Dmitry Eremin <dmitry.eremin@intel.com>
>> 
>> In this patch new API for parallel tasks execution is introduced.
>> This API based on Linux kernel padata API which is used to perform
>> encryption and decryption on large numbers of packets without
>> reordering those packets.
>> 
>> It was adopted for general use in Lustre for parallelization of
>> various functionality. The first place of its usage is parallel I/O
>> implementation.
>> 
>> The first step in using it is to set up a cl_ptask structure to
>> control of how this task are to be run:
>> 
>>    #include <cl_ptask.h>
>> 
>>    int cl_ptask_init(struct cl_ptask *ptask, cl_ptask_cb_t cbfunc,
>>                      void *cbdata, unsigned int flags, int cpu);
>> 
>> The cbfunc function with cbdata argument will be called in the process
>> of getting the task done. The cpu specifies which CPU will be used for
>> the final callback when the task is done.
>> 
>> The submission of task is done with:
>> 
>>    int cl_ptask_submit(struct cl_ptask *ptask,
>>                        struct cl_ptask_engine *engine);
>> 
>> The task is submitted to the engine for execution.
>> 
>> In order to wait for result of task execution you should call:
>> 
>>   int cl_ptask_wait_for(struct cl_ptask *ptask);
>> 
>> The tasks with flag PTF_ORDERED are executed in parallel but complete
>> into submission order. So, waiting for last ordered task you can be sure
>> that all previous tasks were done before this task complete.
>> 
>> This patch differs from the OpenSFS tree by adding this functional
>> to the clio layer instead of libcfs.
> 
> While you are right that it shouldn't be in libcfs, it actually
> shouldn't exist at all.
> cfs_ptask_init() is used precisely once in OpenSFS.  There is no point
> creating a generic API wrapper like this that is only used once.
> 
> cl_oi needs to use padata API calls directly.

This infrastructure was also going to be used for parallel readahead, but the patch that implemented that was never landed because the expected performance gains didn't materialize.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework
  2018-11-27  5:08     ` Andreas Dilger
@ 2018-11-27 13:51       ` Patrick Farrell
  2018-11-27 14:01         ` Patrick Farrell
  0 siblings, 1 reply; 23+ messages in thread
From: Patrick Farrell @ 2018-11-27 13:51 UTC (permalink / raw)
  To: lustre-devel

Two notes coming, first about padata.

A major reason is actually the infrastructure itself - it?s inappropriate to our kinds of tasks.  I did a quick talk on it a while back, intending then to fix it, but never got the chance (and since had better ideas to improve write performance):

https://www.eofs.eu/_media/events/devsummit17/patrick_farrell_laddevsummit_pio.pdf

padata basically bakes in a set of assumptions that amount to ?functionally infinite amount of small work units and a dedicated machine?, which fit well with its role in packet encryption but don?t sit well for other kinds of paralelliziation.  (For example, all work is strictly and explicitly bound to a CPU.  No scheduler.  One more as a bonus - it distributes work across all allowed CPUs, but that means if you have a small number of work items (which splitting I/O tends to be because you have to make relatively big chunks) that effectively every work unit starts a worker thread for itself.)

The recent discussion of a new parallel inaction framework on LWN looked intriguing for future work.  it?s expected to fix a number of the limitations.
https://lwn.net/Articles/771169/

________________________________
From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of Andreas Dilger <adilger@whamcloud.com>
Sent: Monday, November 26, 2018 11:08:45 PM
To: NeilBrown
Cc: Lustre Development List
Subject: Re: [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework

On Nov 26, 2018, at 21:20, NeilBrown <neilb@suse.com> wrote:
>
> On Sun, Nov 25 2018, James Simmons wrote:
>
>> From: Dmitry Eremin <dmitry.eremin@intel.com>
>>
>> In this patch new API for parallel tasks execution is introduced.
>> This API based on Linux kernel padata API which is used to perform
>> encryption and decryption on large numbers of packets without
>> reordering those packets.
>>
>> It was adopted for general use in Lustre for parallelization of
>> various functionality. The first place of its usage is parallel I/O
>> implementation.
>>
>> The first step in using it is to set up a cl_ptask structure to
>> control of how this task are to be run:
>>
>>    #include <cl_ptask.h>
>>
>>    int cl_ptask_init(struct cl_ptask *ptask, cl_ptask_cb_t cbfunc,
>>                      void *cbdata, unsigned int flags, int cpu);
>>
>> The cbfunc function with cbdata argument will be called in the process
>> of getting the task done. The cpu specifies which CPU will be used for
>> the final callback when the task is done.
>>
>> The submission of task is done with:
>>
>>    int cl_ptask_submit(struct cl_ptask *ptask,
>>                        struct cl_ptask_engine *engine);
>>
>> The task is submitted to the engine for execution.
>>
>> In order to wait for result of task execution you should call:
>>
>>   int cl_ptask_wait_for(struct cl_ptask *ptask);
>>
>> The tasks with flag PTF_ORDERED are executed in parallel but complete
>> into submission order. So, waiting for last ordered task you can be sure
>> that all previous tasks were done before this task complete.
>>
>> This patch differs from the OpenSFS tree by adding this functional
>> to the clio layer instead of libcfs.
>
> While you are right that it shouldn't be in libcfs, it actually
> shouldn't exist at all.
> cfs_ptask_init() is used precisely once in OpenSFS.  There is no point
> creating a generic API wrapper like this that is only used once.
>
> cl_oi needs to use padata API calls directly.

This infrastructure was also going to be used for parallel readahead, but the patch that implemented that was never landed because the expected performance gains didn't materialize.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud







_______________________________________________
lustre-devel mailing list
lustre-devel at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181127/06cc8d7e/attachment-0001.html>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework
  2018-11-27 13:51       ` Patrick Farrell
@ 2018-11-27 14:01         ` Patrick Farrell
  2018-11-27 22:27           ` NeilBrown
  0 siblings, 1 reply; 23+ messages in thread
From: Patrick Farrell @ 2018-11-27 14:01 UTC (permalink / raw)
  To: lustre-devel


Second, about pio.

I believe that long term it?s headed out of Lustre.  It only improves performance in a limited way in certain circumstances, and harms it in various others.  So it?s off by default, and, I suspect, remains completely unused.  A while back I noticed its test framework test didn?t activate it correctly, and once fixed, it sometimes deadlocks (race with truncate). There?s a patch to fix that, but a problem was found in it and it has since languished.

I would still suggest you take it, Neil, as othewise you?ll complicate a bunch of potentially nasty porting working in the CLIO stack, as you apply the years of patches written with it there.  Instead, I?d suggest we pull it in the open sfs branch (Sorry!  It was a promising idea but it hasn?t panned out, and the current parallel readahead work isn?t going to use it.) and then eventually you could pick that up.

Curious how folks feel about this.  I?d be willing to take a stab at writing a removal patch for 2.13.  It pains me a bit to suggest giving up on it, but Jinshan and I want to do write container type work to improve writes, and there?s the older/new again DDN parallel readahead work for reads.

________________________________
From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of Patrick Farrell <paf@cray.com>
Sent: Tuesday, November 27, 2018 7:51:02 AM
To: Andreas Dilger; NeilBrown
Cc: Lustre Development List
Subject: Re: [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework

Two notes coming, first about padata.

A major reason is actually the infrastructure itself - it?s inappropriate to our kinds of tasks.  I did a quick talk on it a while back, intending then to fix it, but never got the chance (and since had better ideas to improve write performance):

https://www.eofs.eu/_media/events/devsummit17/patrick_farrell_laddevsummit_pio.pdf

padata basically bakes in a set of assumptions that amount to ?functionally infinite amount of small work units and a dedicated machine?, which fit well with its role in packet encryption but don?t sit well for other kinds of paralelliziation.  (For example, all work is strictly and explicitly bound to a CPU.  No scheduler.  One more as a bonus - it distributes work across all allowed CPUs, but that means if you have a small number of work items (which splitting I/O tends to be because you have to make relatively big chunks) that effectively every work unit starts a worker thread for itself.)

The recent discussion of a new parallel inaction framework on LWN looked intriguing for future work.  it?s expected to fix a number of the limitations.
https://lwn.net/Articles/771169/

________________________________
From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of Andreas Dilger <adilger@whamcloud.com>
Sent: Monday, November 26, 2018 11:08:45 PM
To: NeilBrown
Cc: Lustre Development List
Subject: Re: [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework

On Nov 26, 2018, at 21:20, NeilBrown <neilb@suse.com> wrote:
>
> On Sun, Nov 25 2018, James Simmons wrote:
>
>> From: Dmitry Eremin <dmitry.eremin@intel.com>
>>
>> In this patch new API for parallel tasks execution is introduced.
>> This API based on Linux kernel padata API which is used to perform
>> encryption and decryption on large numbers of packets without
>> reordering those packets.
>>
>> It was adopted for general use in Lustre for parallelization of
>> various functionality. The first place of its usage is parallel I/O
>> implementation.
>>
>> The first step in using it is to set up a cl_ptask structure to
>> control of how this task are to be run:
>>
>>    #include <cl_ptask.h>
>>
>>    int cl_ptask_init(struct cl_ptask *ptask, cl_ptask_cb_t cbfunc,
>>                      void *cbdata, unsigned int flags, int cpu);
>>
>> The cbfunc function with cbdata argument will be called in the process
>> of getting the task done. The cpu specifies which CPU will be used for
>> the final callback when the task is done.
>>
>> The submission of task is done with:
>>
>>    int cl_ptask_submit(struct cl_ptask *ptask,
>>                        struct cl_ptask_engine *engine);
>>
>> The task is submitted to the engine for execution.
>>
>> In order to wait for result of task execution you should call:
>>
>>   int cl_ptask_wait_for(struct cl_ptask *ptask);
>>
>> The tasks with flag PTF_ORDERED are executed in parallel but complete
>> into submission order. So, waiting for last ordered task you can be sure
>> that all previous tasks were done before this task complete.
>>
>> This patch differs from the OpenSFS tree by adding this functional
>> to the clio layer instead of libcfs.
>
> While you are right that it shouldn't be in libcfs, it actually
> shouldn't exist at all.
> cfs_ptask_init() is used precisely once in OpenSFS.  There is no point
> creating a generic API wrapper like this that is only used once.
>
> cl_oi needs to use padata API calls directly.

This infrastructure was also going to be used for parallel readahead, but the patch that implemented that was never landed because the expected performance gains didn't materialize.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud







_______________________________________________
lustre-devel mailing list
lustre-devel at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181127/c7df3662/attachment.html>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework
  2018-11-27 14:01         ` Patrick Farrell
@ 2018-11-27 22:27           ` NeilBrown
  2018-11-27 22:50             ` Patrick Farrell
  0 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2018-11-27 22:27 UTC (permalink / raw)
  To: lustre-devel

On Tue, Nov 27 2018, Patrick Farrell wrote:

> Second, about pio.
>
> I believe that long term it?s headed out of Lustre.  It only improves performance in a limited way in certain circumstances, and harms it in various others.  So it?s off by default, and, I suspect, remains completely unused.  A while back I noticed its test framework test didn?t activate it correctly, and once fixed, it sometimes deadlocks (race with truncate). There?s a patch to fix that, but a problem was found in it and it has since languished.
>
> I would still suggest you take it, Neil, as othewise you?ll complicate a bunch of potentially nasty porting working in the CLIO stack, as you apply the years of patches written with it there.  Instead, I?d suggest we pull it in the open sfs branch (Sorry!  It was a promising idea but it hasn?t panned out, and the current parallel readahead work isn?t going to use it.) and then eventually you could pick that up.

Thanks so much for this background and context - really helpful.

I looked though your slides and got the impression that a simple
work-queue would probably be the best approach - no need to create your
own pool of kthreads as I think you said you had trialed.

As for the suggestion that I take it anyway, and then remove it later
after it gets removed from OpenSFS, I remain unconvinced.
You mention "years of patches written with it there"  but the first
usage of the cfs_ptask_init only landed in March 2017 (less than 2 years
ago).  libcfs_ptask is only use in lustre/obdclass/ lustre/llite/
lustre/lov/ and the total patches in these directories since it was
introduced in 319.  I suspect most of them aren't related to ptask.

So I see no evidence that there will be much "nasty porting work".  I
suspect there will be some, but porting code is what I spend a lot of my
time doing, and doing it helps force me to understand the code.

So what this isn't a "no way, never", it is "I'm not convinced".

Thanks,
NeilBrown


>
> Curious how folks feel about this.  I?d be willing to take a stab at writing a removal patch for 2.13.  It pains me a bit to suggest giving up on it, but Jinshan and I want to do write container type work to improve writes, and there?s the older/new again DDN parallel readahead work for reads.
>
> ________________________________
> From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of Patrick Farrell <paf@cray.com>
> Sent: Tuesday, November 27, 2018 7:51:02 AM
> To: Andreas Dilger; NeilBrown
> Cc: Lustre Development List
> Subject: Re: [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework
>
> Two notes coming, first about padata.
>
> A major reason is actually the infrastructure itself - it?s inappropriate to our kinds of tasks.  I did a quick talk on it a while back, intending then to fix it, but never got the chance (and since had better ideas to improve write performance):
>
> https://www.eofs.eu/_media/events/devsummit17/patrick_farrell_laddevsummit_pio.pdf
>
> padata basically bakes in a set of assumptions that amount to ?functionally infinite amount of small work units and a dedicated machine?, which fit well with its role in packet encryption but don?t sit well for other kinds of paralelliziation.  (For example, all work is strictly and explicitly bound to a CPU.  No scheduler.  One more as a bonus - it distributes work across all allowed CPUs, but that means if you have a small number of work items (which splitting I/O tends to be because you have to make relatively big chunks) that effectively every work unit starts a worker thread for itself.)
>
> The recent discussion of a new parallel inaction framework on LWN looked intriguing for future work.  it?s expected to fix a number of the limitations.
> https://lwn.net/Articles/771169/
>
> ________________________________
> From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of Andreas Dilger <adilger@whamcloud.com>
> Sent: Monday, November 26, 2018 11:08:45 PM
> To: NeilBrown
> Cc: Lustre Development List
> Subject: Re: [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework
>
> On Nov 26, 2018, at 21:20, NeilBrown <neilb@suse.com> wrote:
>>
>> On Sun, Nov 25 2018, James Simmons wrote:
>>
>>> From: Dmitry Eremin <dmitry.eremin@intel.com>
>>>
>>> In this patch new API for parallel tasks execution is introduced.
>>> This API based on Linux kernel padata API which is used to perform
>>> encryption and decryption on large numbers of packets without
>>> reordering those packets.
>>>
>>> It was adopted for general use in Lustre for parallelization of
>>> various functionality. The first place of its usage is parallel I/O
>>> implementation.
>>>
>>> The first step in using it is to set up a cl_ptask structure to
>>> control of how this task are to be run:
>>>
>>>    #include <cl_ptask.h>
>>>
>>>    int cl_ptask_init(struct cl_ptask *ptask, cl_ptask_cb_t cbfunc,
>>>                      void *cbdata, unsigned int flags, int cpu);
>>>
>>> The cbfunc function with cbdata argument will be called in the process
>>> of getting the task done. The cpu specifies which CPU will be used for
>>> the final callback when the task is done.
>>>
>>> The submission of task is done with:
>>>
>>>    int cl_ptask_submit(struct cl_ptask *ptask,
>>>                        struct cl_ptask_engine *engine);
>>>
>>> The task is submitted to the engine for execution.
>>>
>>> In order to wait for result of task execution you should call:
>>>
>>>   int cl_ptask_wait_for(struct cl_ptask *ptask);
>>>
>>> The tasks with flag PTF_ORDERED are executed in parallel but complete
>>> into submission order. So, waiting for last ordered task you can be sure
>>> that all previous tasks were done before this task complete.
>>>
>>> This patch differs from the OpenSFS tree by adding this functional
>>> to the clio layer instead of libcfs.
>>
>> While you are right that it shouldn't be in libcfs, it actually
>> shouldn't exist at all.
>> cfs_ptask_init() is used precisely once in OpenSFS.  There is no point
>> creating a generic API wrapper like this that is only used once.
>>
>> cl_oi needs to use padata API calls directly.
>
> This infrastructure was also going to be used for parallel readahead, but the patch that implemented that was never landed because the expected performance gains didn't materialize.
>
> Cheers, Andreas
> ---
> Andreas Dilger
> Principal Lustre Architect
> Whamcloud
>
>
>
>
>
>
>
> _______________________________________________
> lustre-devel mailing list
> lustre-devel at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181128/3a9f4f4b/attachment.sig>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework
  2018-11-27 22:27           ` NeilBrown
@ 2018-11-27 22:50             ` Patrick Farrell
  0 siblings, 0 replies; 23+ messages in thread
From: Patrick Farrell @ 2018-11-27 22:50 UTC (permalink / raw)
  To: lustre-devel

Starting from the top:
Yes, a simple work queue would probably work OK.  I had good luck with a simple kthread_run, actually (in a later pass at it).

But if you're thinking of improving it, there are a number of issues with it today, which are non-trivial to resolve.  Not sure which I mentioned in my presentation, but here's a quick attempt:
1. It only works on > 1 stripe files, which isn't ideal
2. It has no limit on the number of threads it will use to do I/O to one file.  In reality, 2-4 threads or so is the maximum which gets a benefit.  More than that actually hurts.
3. I believe it hurts read performance (or maybe it's off for reads even when on for writes?  Can't remember)
4. It has a deadlock with truncate which is not easy to fix.  My attempt to fix it creates a *different* lock inversion and since pio isn't (AFAIK) being used, I gave up.  No one has complained about the deadlock and it happens fairly easily with 'dd', so...

RE: Porting difficulties.  Sorry - Not the *ptask* part.  The changes in the CLIO stack to allow the actual parallel I/O use, which are in another patch.  I tend to run all the LU-8964 patches together in my mind.

This is the change I was suggesting you might not want to skip:

"commit db59ecb5d1d0284fb918def6348a11e0966d7767
Author: Dmitry Eremin <dmitry.eremin@intel.com>
Date:   Thu Mar 30 22:38:56 2017 +0300

    LU-8964 clio: Parallelize generic I/O

    Add parallel version of cl_io_loop() function which use information
    about stripes from LOV layer and process them in parallel.
    This feature is disabled by default. To enable it you should run
    "lctl set_param llite.*.pio=1" command."

" lustre/include/cl_object.h     |  49 ++++++---
 lustre/include/lustre_compat.h | 119 +++++++++++++++++++++
 lustre/include/obd_support.h   |   1 +
 lustre/llite/file.c            | 201 ++++++++++++++++++++++++++++-------
 lustre/llite/llite_internal.h  | 123 +--------------------
 lustre/llite/lproc_llite.c     |  39 ++++++-
 lustre/llite/rw26.c            |   4 +-
 lustre/llite/vvp_internal.h    |   9 +-
 lustre/llite/vvp_io.c          | 235 +++++++++++++++++++++--------------------
 lustre/lov/lov_io.c            |  91 ++++++++++------
 lustre/obdclass/cl_io.c        | 233 +++++++++++++++++++++++++++++++---------
 lustre/obdclass/cl_object.c    |  13 +++
 lustre/osc/osc_io.c            |   4 +-
 lustre/osc/osc_lock.c          |   6 +-
 lustre/tests/sanity.sh         |  11 ++"

It is, of course, up to you, and you are *really* good at porting code.  But as I assume you see, this one is significantly scarier.  The ptask patch itself is no big deal.

- Patrick


?On 11/27/18, 4:27 PM, "NeilBrown" <neilb@suse.com> wrote:

    On Tue, Nov 27 2018, Patrick Farrell wrote:
    
    > Second, about pio.
    >
    > I believe that long term it?s headed out of Lustre.  It only improves performance in a limited way in certain circumstances, and harms it in various others.  So it?s off by default, and, I suspect, remains completely unused.  A while back I noticed its test framework test didn?t activate it correctly, and once fixed, it sometimes deadlocks (race with truncate). There?s a patch to fix that, but a problem was found in it and it has since languished.
    >
    > I would still suggest you take it, Neil, as othewise you?ll complicate a bunch of potentially nasty porting working in the CLIO stack, as you apply the years of patches written with it there.  Instead, I?d suggest we pull it in the open sfs branch (Sorry!  It was a promising idea but it hasn?t panned out, and the current parallel readahead work isn?t going to use it.) and then eventually you could pick that up.
    
    Thanks so much for this background and context - really helpful.
    
    I looked though your slides and got the impression that a simple
    work-queue would probably be the best approach - no need to create your
    own pool of kthreads as I think you said you had trialed.
    
    As for the suggestion that I take it anyway, and then remove it later
    after it gets removed from OpenSFS, I remain unconvinced.
    You mention "years of patches written with it there"  but the first
    usage of the cfs_ptask_init only landed in March 2017 (less than 2 years
    ago).  libcfs_ptask is only use in lustre/obdclass/ lustre/llite/
    lustre/lov/ and the total patches in these directories since it was
    introduced in 319.  I suspect most of them aren't related to ptask.
    
    So I see no evidence that there will be much "nasty porting work".  I
    suspect there will be some, but porting code is what I spend a lot of my
    time doing, and doing it helps force me to understand the code.
    
    So what this isn't a "no way, never", it is "I'm not convinced".
    
    Thanks,
    NeilBrown
    
    
    >
    > Curious how folks feel about this.  I?d be willing to take a stab at writing a removal patch for 2.13.  It pains me a bit to suggest giving up on it, but Jinshan and I want to do write container type work to improve writes, and there?s the older/new again DDN parallel readahead work for reads.
    >
    > ________________________________
    > From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of Patrick Farrell <paf@cray.com>
    > Sent: Tuesday, November 27, 2018 7:51:02 AM
    > To: Andreas Dilger; NeilBrown
    > Cc: Lustre Development List
    > Subject: Re: [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework
    >
    > Two notes coming, first about padata.
    >
    > A major reason is actually the infrastructure itself - it?s inappropriate to our kinds of tasks.  I did a quick talk on it a while back, intending then to fix it, but never got the chance (and since had better ideas to improve write performance):
    >
    > https://www.eofs.eu/_media/events/devsummit17/patrick_farrell_laddevsummit_pio.pdf
    >
    > padata basically bakes in a set of assumptions that amount to ?functionally infinite amount of small work units and a dedicated machine?, which fit well with its role in packet encryption but don?t sit well for other kinds of paralelliziation.  (For example, all work is strictly and explicitly bound to a CPU.  No scheduler.  One more as a bonus - it distributes work across all allowed CPUs, but that means if you have a small number of work items (which splitting I/O tends to be because you have to make relatively big chunks) that effectively every work unit starts a worker thread for itself.)
    >
    > The recent discussion of a new parallel inaction framework on LWN looked intriguing for future work.  it?s expected to fix a number of the limitations.
    > https://lwn.net/Articles/771169/
    >
    > ________________________________
    > From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of Andreas Dilger <adilger@whamcloud.com>
    > Sent: Monday, November 26, 2018 11:08:45 PM
    > To: NeilBrown
    > Cc: Lustre Development List
    > Subject: Re: [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework
    >
    > On Nov 26, 2018, at 21:20, NeilBrown <neilb@suse.com> wrote:
    >>
    >> On Sun, Nov 25 2018, James Simmons wrote:
    >>
    >>> From: Dmitry Eremin <dmitry.eremin@intel.com>
    >>>
    >>> In this patch new API for parallel tasks execution is introduced.
    >>> This API based on Linux kernel padata API which is used to perform
    >>> encryption and decryption on large numbers of packets without
    >>> reordering those packets.
    >>>
    >>> It was adopted for general use in Lustre for parallelization of
    >>> various functionality. The first place of its usage is parallel I/O
    >>> implementation.
    >>>
    >>> The first step in using it is to set up a cl_ptask structure to
    >>> control of how this task are to be run:
    >>>
    >>>    #include <cl_ptask.h>
    >>>
    >>>    int cl_ptask_init(struct cl_ptask *ptask, cl_ptask_cb_t cbfunc,
    >>>                      void *cbdata, unsigned int flags, int cpu);
    >>>
    >>> The cbfunc function with cbdata argument will be called in the process
    >>> of getting the task done. The cpu specifies which CPU will be used for
    >>> the final callback when the task is done.
    >>>
    >>> The submission of task is done with:
    >>>
    >>>    int cl_ptask_submit(struct cl_ptask *ptask,
    >>>                        struct cl_ptask_engine *engine);
    >>>
    >>> The task is submitted to the engine for execution.
    >>>
    >>> In order to wait for result of task execution you should call:
    >>>
    >>>   int cl_ptask_wait_for(struct cl_ptask *ptask);
    >>>
    >>> The tasks with flag PTF_ORDERED are executed in parallel but complete
    >>> into submission order. So, waiting for last ordered task you can be sure
    >>> that all previous tasks were done before this task complete.
    >>>
    >>> This patch differs from the OpenSFS tree by adding this functional
    >>> to the clio layer instead of libcfs.
    >>
    >> While you are right that it shouldn't be in libcfs, it actually
    >> shouldn't exist at all.
    >> cfs_ptask_init() is used precisely once in OpenSFS.  There is no point
    >> creating a generic API wrapper like this that is only used once.
    >>
    >> cl_oi needs to use padata API calls directly.
    >
    > This infrastructure was also going to be used for parallel readahead, but the patch that implemented that was never landed because the expected performance gains didn't materialize.
    >
    > Cheers, Andreas
    > ---
    > Andreas Dilger
    > Principal Lustre Architect
    > Whamcloud
    >
    >
    >
    >
    >
    >
    >
    > _______________________________________________
    > lustre-devel mailing list
    > lustre-devel at lists.lustre.org
    > http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
    

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [lustre-devel] [PATCH 06/12] lustre: mdc: don't add to page cache upon failure
  2018-11-27  3:01   ` NeilBrown
@ 2018-11-29 12:06     ` Siyao Lai
  0 siblings, 0 replies; 23+ messages in thread
From: Siyao Lai @ 2018-11-29 12:06 UTC (permalink / raw)
  To: lustre-devel

You're right, I'll fix it later.

Thanks,
Lai

?On 2018/11/27, 11:01 AM, "NeilBrown" <neilb@suse.com> wrote:

    On Sun, Nov 25 2018, James Simmons wrote:
    
    > From: Lai Siyao <lai.siyao@whamcloud.com>
    >
    > Reading directory pages may fail on MDS, in this case client should
    > not cache a non-up-to-date directory page, because it will cause
    > a later read on the same page fail.
    >
    > Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
    > WC-bug-id: https://jira.whamcloud.com/browse/LU-5461
    > Reviewed-on: http://review.whamcloud.com/11450
    > Reviewed-by: Fan Yong <fan.yong@intel.com>
    > Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
    > Reviewed-by: Oleg Drokin <green@whamcloud.com>
    > Signed-off-by: James Simmons <jsimmons@infradead.org>
    > ---
    >  drivers/staging/lustre/lustre/mdc/mdc_request.c | 5 ++++-
    >  1 file changed, 4 insertions(+), 1 deletion(-)
    >
    > diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c
    > index 1072b66..09b30ef 100644
    > --- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
    > +++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
    > @@ -1212,7 +1212,10 @@ static int mdc_read_page_remote(void *data, struct page *page0)
    >  	}
    >  
    >  	rc = mdc_getpage(rp->rp_exp, fid, rp->rp_off, page_pool, npages, &req);
    > -	if (!rc) {
    > +	if (rc < 0) {
    > +		/* page0 is special, which was added into page cache early */
    > +		delete_from_page_cache(page0);
    
    This looks wrong to me.  We shouldn't need to delete the page from the
    page-cache.
    It won't be marked up-to-date, so why will that cause an error on a
    later read???
    
    Well, because mdc_page_locate() finds a page and if it isn't up-to-date,
    it returns -EIO.  Why does it do that?  If it found a PageError() page,
    then it might be reasonable to return -EIO.
    
    Why not just return the page that was found, and let the caller check if
    it is Uptodate?
    
    Well, because mdc_read_page() handles a successful page return from
    mdc_page_locate() as a hash collision.
    
    I guess I need to understand how lustre maps a hash to a directory
    block, and then how it handles collisions...
    
    The reason this jumped out at me is that it looks like it might be
    racy.  Adding a page and then removing it might leave a window where
    some other thread can find the page.  That is not a problem is a
    non-up-to-date page just means we should wait for it.  But if it can
    cause an error, then maybe the race is a real problem.
    
    But maybe there is some higher level locking...
    
    NeilBrown
    

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-11-29 12:06 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-26  2:48 [lustre-devel] [PATCH 00/12] lustre: new patches to address previous reviews James Simmons
2018-11-26  2:48 ` [lustre-devel] [PATCH 01/12] lustre: llite: move CONFIG_SECURITY handling to llite_internal.h James Simmons
2018-11-26  2:48 ` [lustre-devel] [PATCH 02/12] lustre: lnd: create enum kib_dev_caps James Simmons
2018-11-26  2:48 ` [lustre-devel] [PATCH 03/12] lustre: lnd: test fpo_fmr_pool pointer instead of special bool James Simmons
2018-11-26  2:48 ` [lustre-devel] [PATCH 04/12] lustre: llite: remove llite_loop left overs James Simmons
2018-11-26  2:48 ` [lustre-devel] [PATCH 05/12] lustre: llite: avoid duplicate stats debugfs registration James Simmons
2018-11-26  2:48 ` [lustre-devel] [PATCH 06/12] lustre: mdc: don't add to page cache upon failure James Simmons
2018-11-27  3:01   ` NeilBrown
2018-11-29 12:06     ` Siyao Lai
2018-11-26  2:48 ` [lustre-devel] [PATCH 07/12] lustre: ldlm: No -EINVAL for canceled != unused James Simmons
2018-11-26  2:48 ` [lustre-devel] [PATCH 08/12] lustre: mdc: propagate changelog errors to readers James Simmons
2018-11-27  3:13   ` NeilBrown
2018-11-26  2:48 ` [lustre-devel] [PATCH 09/12] lustre: obdclass: obd_device improvement James Simmons
2018-11-27  4:01   ` NeilBrown
2018-11-26  2:48 ` [lustre-devel] [PATCH 10/12] lustre: clio: Introduce parallel tasks framework James Simmons
2018-11-27  4:20   ` NeilBrown
2018-11-27  5:08     ` Andreas Dilger
2018-11-27 13:51       ` Patrick Farrell
2018-11-27 14:01         ` Patrick Farrell
2018-11-27 22:27           ` NeilBrown
2018-11-27 22:50             ` Patrick Farrell
2018-11-26  2:48 ` [lustre-devel] [PATCH 11/12] lustre: mdc: use large xattr buffers for old servers James Simmons
2018-11-26  2:48 ` [lustre-devel] [PATCH 12/12] lustre: update TODO lustre list James Simmons

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.