[lustre-devel] [PATCH 0/8] lustre: some old patches from whamcloud tree

All of lore.kernel.org
 help / color / mirror / Atom feed

* [lustre-devel] [PATCH 0/8] lustre: some old patches from whamcloud tree
@ 2019-07-25  2:43 James Simmons
  2019-07-25  2:44 ` [lustre-devel] [PATCH 1/8] lustre: seq: make seq_proc_write_common() safer James Simmons
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: James Simmons @ 2019-07-25  2:43 UTC (permalink / raw)
  To: lustre-devel

Some old patches missing from the linux lustre client. I suspect
these weren't port since they most are likely server side only
but I want to make sure. Also we are missing patch:

http://review.whamcloud.com/20939 from LU-8319 but I don't know
if that is needed anymore with the earlier changes done.
Feedback welcomed.

Andreas Dilger (1):
  lustre: seq: make seq_proc_write_common() safer

Gregoire Pichon (1):
  lustre: tests: testcases for multiple modify RPCs feature

Jeremy Filizetti (1):
  lustre: ldlm: Don't check opcode with NULL rq_reqmsg

Li Wei (1):
  lustre: ptlrpc: Fix an rq_no_reply assertion failure

Wang Di (1):
  lustre: fld: retry fld rpc until the import is closed

wang di (3):
  lustre: fld: resend seq lookup RPC if it is on LWP
  lustre: fld: retry fld rpc even for ESHUTDOWN
  lustre: fld: fld client lookup should retry

 fs/lustre/fid/lproc_fid.c       | 12 +++++++-----
 fs/lustre/fld/fld_request.c     | 42 ++++++++++++++++++++++++++++++++++++++---
 fs/lustre/include/obd_support.h |  2 ++
 fs/lustre/ldlm/ldlm_lib.c       |  7 +++++++
 fs/lustre/ptlrpc/service.c      | 10 ++++++++++
 5 files changed, 65 insertions(+), 8 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 1/8] lustre: seq: make seq_proc_write_common() safer
  2019-07-25  2:43 [lustre-devel] [PATCH 0/8] lustre: some old patches from whamcloud tree James Simmons
@ 2019-07-25  2:44 ` James Simmons
  2019-07-25 23:55   ` NeilBrown
  2019-07-25  2:44 ` [lustre-devel] [PATCH 2/8] lustre: ptlrpc: Fix an rq_no_reply assertion failure James Simmons
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: James Simmons @ 2019-07-25  2:44 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Don't allow seq_proc_write_common() to specify arbitrary ranges,
since this can permanently corrupt the sequence controller and/or
sequnece server.  That would allow duplicate FID allocation, or
possibly prevent any new files to be created or servers to be added
to the filesystem.

Instead, limit the sequence range that can be written via /proc to
a subset of the sequence range currently allocated to that node.
Add the "clear" keyword to allow dropping the entire local sequence
and force a new one to be fetched from the sequence server.

WC-bug-id: https://jira.whamcloud.com/browse/LU-3642
Lustre-commit: 05f69f5ee20eeffcc26f643333cedcfb53ba6669
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: http://review.whamcloud.com/7123
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: James Simmons <uja.ornl@gmail.com>
Reviewed-by: Oleg Drokin <green@whamclould.com>
---
 fs/lustre/fid/lproc_fid.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/lustre/fid/lproc_fid.c b/fs/lustre/fid/lproc_fid.c
index 94869d4..e2e47df 100644
--- a/fs/lustre/fid/lproc_fid.c
+++ b/fs/lustre/fid/lproc_fid.c
@@ -52,14 +52,18 @@
 /* Format: [0x64BIT_INT - 0x64BIT_INT] + 32 bytes just in case */
 #define MAX_FID_RANGE_STRLEN (32 + 2 * 2 * sizeof(u64))
 /*
- * Note: this function is only used for testing, it is no safe for production
- * use.
+ * Reduce the SEQ range allocated to a node to a strict subset of the range
+ * currently-allocated SEQ range.  If the specified range is "clear", then
+ * drop all allocated sequences and request a new one from the master.
+ *
+ * Note: this function should only be used for testing, it is not necessarily
+ * safe for production use.
  */
 static int
 ldebugfs_fid_write_common(const char __user *buffer, size_t count,
 			  struct lu_seq_range *range)
 {
-	struct lu_seq_range tmp;
+	struct lu_seq_range tmp = { 0, };
 	int rc;
 	char kernbuf[MAX_FID_RANGE_STRLEN];
 
@@ -82,8 +86,6 @@
 	rc = sscanf(kernbuf, "[%llx - %llx]\n",
 		    (unsigned long long *)&tmp.lsr_start,
 		    (unsigned long long *)&tmp.lsr_end);
-	if (rc != 2)
-		return -EINVAL;
 	if (!lu_seq_range_is_sane(&tmp) || lu_seq_range_is_zero(&tmp) ||
 	    tmp.lsr_start < range->lsr_start || tmp.lsr_end > range->lsr_end)
 		return -EINVAL;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 2/8] lustre: ptlrpc: Fix an rq_no_reply assertion failure
  2019-07-25  2:43 [lustre-devel] [PATCH 0/8] lustre: some old patches from whamcloud tree James Simmons
  2019-07-25  2:44 ` [lustre-devel] [PATCH 1/8] lustre: seq: make seq_proc_write_common() safer James Simmons
@ 2019-07-25  2:44 ` James Simmons
  2019-08-14 16:58   ` Andreas Dilger
  2019-07-25  2:44 ` [lustre-devel] [PATCH 3/8] lustre: fld: resend seq lookup RPC if it is on LWP James Simmons
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: James Simmons @ 2019-07-25  2:44 UTC (permalink / raw)
  To: lustre-devel

From: Li Wei <wei.g.li@intel.com>

An OSS had an assertion failure:

  LustreError: 5366:0:(ldlm_lib.c:2689:target_bulk_io()) @@@ timeout
  on bulk GET after 0+0s  req at ffff88083a61b400
  x1476486691018500/t0(4300509964)
  o4->8dda3382-83f8-6445-5eea-828fd59e4a06 at 192.168.1.116@o2ib1:0/0
  lens 504/448 e 391470 to 0 dl 1408494729 ref 2 fl Complete:/4/0 rc
  0/0
  LustreError: 5432:0:(niobuf.c:550:ptlrpc_send_reply()) ASSERTION(
  req->rq_no_reply == 0 ) failed:
  Lustre: soaked-OST0000: Bulk IO write error with
  8dda3382-83f8-6445-5eea-828fd59e4a06 (at 192.168.1.116 at o2ib1),
  client will retry: rc -110
  LustreError: 5432:0:(niobuf.c:550:ptlrpc_send_reply()) LBUG
  Pid: 5432, comm: ll_ost_io03_003

  Call Trace:
  [<ffffffffa0641895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
  [<ffffffffa0641e97>] lbug_with_loc+0x47/0xb0 [libcfs]
  [<ffffffffa09cda4c>] ptlrpc_send_reply+0x4ec/0x7f0 [ptlrpc]
  [<ffffffffa09d4aae>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
  [<ffffffffa09e4d75>] ptlrpc_at_check_timed+0xcd5/0x1370 [ptlrpc]
  [<ffffffffa09dc1e9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
  [<ffffffffa09e66f8>] ptlrpc_main+0x12e8/0x1990 [ptlrpc]
  [<ffffffff81069290>] ? pick_next_task_fair+0xd0/0x130
  [<ffffffff81529246>] ? schedule+0x176/0x3b0
  [<ffffffffa09e5410>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
  [<ffffffff8109abf6>] kthread+0x96/0xa0
  [<ffffffff8100c20a>] child_rip+0xa/0x20
  [<ffffffff8109ab60>] ? kthread+0x0/0xa0
  [<ffffffff8100c200>] ? child_rip+0x0/0x20

The thread in tgt_brw_write() had decided not to reply by setting
rq_no_reply, right before another thread tried to send an early reply
for the request.

WC-bug-id: https://jira.whamcloud.com/browse/LU-5537
Lustre-commit: a8d448e4cd5978c546911f98067232bcdd30b651
Signed-off-by: Li Wei <wei.g.li@intel.com>
Reviewed-on: http://review.whamcloud.com/11740
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Johann Lombardi <johann.lombardi@intel.com>
---
 fs/lustre/ptlrpc/service.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index a40e964..c9ab9c3 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -1098,6 +1098,16 @@ static int ptlrpc_at_send_early_reply(struct ptlrpc_request *req)
 	reqcopy->rq_reqmsg = reqmsg;
 	memcpy(reqmsg, req->rq_reqmsg, req->rq_reqlen);
 
+	/*
+	 * tgt_brw_read() and tgt_brw_write() may have decided not to reply.
+	 * Without this check, we would fail the rq_no_reply assertion in
+	 * ptlrpc_send_reply().
+	 */
+	if (reqcopy->rq_no_reply) {
+		rc = -ETIMEDOUT;
+		goto out;
+	}
+
 	LASSERT(atomic_read(&req->rq_refcount));
 	/** if it is last refcount then early reply isn't needed */
 	if (atomic_read(&req->rq_refcount) == 1) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 3/8] lustre: fld: resend seq lookup RPC if it is on LWP
  2019-07-25  2:43 [lustre-devel] [PATCH 0/8] lustre: some old patches from whamcloud tree James Simmons
  2019-07-25  2:44 ` [lustre-devel] [PATCH 1/8] lustre: seq: make seq_proc_write_common() safer James Simmons
  2019-07-25  2:44 ` [lustre-devel] [PATCH 2/8] lustre: ptlrpc: Fix an rq_no_reply assertion failure James Simmons
@ 2019-07-25  2:44 ` James Simmons
  2019-08-14 16:58   ` Andreas Dilger
  2019-07-25  2:44 ` [lustre-devel] [PATCH 4/8] lustre: fld: retry fld rpc even for ESHUTDOWN James Simmons
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: James Simmons @ 2019-07-25  2:44 UTC (permalink / raw)
  To: lustre-devel

From: wang di <di.wang@intel.com>

Because Light Weight connection might be evicted after
restart, then cause inflight RPC fails, to avoid this,
we need resend seq lookup RPC.

remove "-f" from "stop mdt" in sanity 17m, so umount can
keep the the connection, and otherwise the OSP might be
evicted.

WC-bug-id: https://jira.whamcloud.com/browse/LU-4571
Lustre-commit: cf7f66d87e52293535cde6e8cc7386e6c1bdfa46
Signed-off-by: wang di <di.wang@intel.com>
Reviewed-on: http://review.whamcloud.com/9106
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: Niu Yawei <yawei.niu@intel.com>
---
 fs/lustre/fld/fld_request.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/fld/fld_request.c b/fs/lustre/fld/fld_request.c
index 248fffa..ec45ea6 100644
--- a/fs/lustre/fld/fld_request.c
+++ b/fs/lustre/fld/fld_request.c
@@ -314,6 +314,7 @@ int fld_client_rpc(struct obd_export *exp,
 
 	LASSERT(exp);
 
+again:
 	imp = class_exp2cliimp(exp);
 	switch (fld_op) {
 	case FLD_QUERY:
@@ -329,8 +330,15 @@ int fld_client_rpc(struct obd_export *exp,
 		op = req_capsule_client_get(&req->rq_pill, &RMF_FLD_OPC);
 		*op = FLD_LOOKUP;
 
-		if (imp->imp_connect_flags_orig & OBD_CONNECT_MDS_MDS)
+		/* For MDS_MDS seq lookup, it will always use LWP connection,
+		 * but LWP will be evicted after restart, so cause the error.
+		 * so we will set no_delay for seq lookup request, once the
+		 * request fails because of the eviction. always retry here
+		 */
+		if (imp->imp_connect_flags_orig & OBD_CONNECT_MDS_MDS) {
 			req->rq_allow_replay = 1;
+			req->rq_no_delay = 1;
+		}
 		break;
 	case FLD_READ:
 		req = ptlrpc_request_alloc_pack(imp, &RQF_FLD_READ,
@@ -358,8 +366,19 @@ int fld_client_rpc(struct obd_export *exp,
 	obd_get_request_slot(&exp->exp_obd->u.cli);
 	rc = ptlrpc_queue_wait(req);
 	obd_put_request_slot(&exp->exp_obd->u.cli);
-	if (rc)
+	if (rc != 0) {
+		if (rc == -EWOULDBLOCK) {
+			/* For no_delay req(see above), EWOULDBLOCK means the
+			 * connection is being evicted, but this seq lookup
+			 * should not return error, since it would cause
+			 * unecessary failure of the application, instead
+			 * it should retry here
+			 */
+			ptlrpc_req_finished(req);
+			goto again;
+		}
 		goto out_req;
+	}
 
 	if (fld_op == FLD_QUERY) {
 		prange = req_capsule_server_get(&req->rq_pill, &RMF_FLD_MDFLD);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 4/8] lustre: fld: retry fld rpc even for ESHUTDOWN
  2019-07-25  2:43 [lustre-devel] [PATCH 0/8] lustre: some old patches from whamcloud tree James Simmons
                   ` (2 preceding siblings ...)
  2019-07-25  2:44 ` [lustre-devel] [PATCH 3/8] lustre: fld: resend seq lookup RPC if it is on LWP James Simmons
@ 2019-07-25  2:44 ` James Simmons
  2019-08-14 16:58   ` Andreas Dilger
  2019-08-14 16:58   ` Andreas Dilger
  2019-07-25  2:44 ` [lustre-devel] [PATCH 5/8] lustre: fld: retry fld rpc until the import is closed James Simmons
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 19+ messages in thread
From: James Simmons @ 2019-07-25  2:44 UTC (permalink / raw)
  To: lustre-devel

From: wang di <di.wang@intel.com>

when LWP is being evicted, because it is not replayable,
the request might return ESHUTDOWN or EWOULDBLOCK, instead
of failed, which might cause application failure, fld
client will retry RPC, until the connection is being setup
again or the LWP is being closed.

WC-bug-id: https://jira.whamcloud.com/browse/LU-4420
Lustre-commit: d335e310d4bf490509998ddbb1824e38cff20998
Signed-off-by: wang di <di.wang@intel.com>
Reviewed-on: http://review.whamcloud.com/10285
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Johann Lombardi <johann.lombardi@intel.com>
---
 fs/lustre/fld/fld_request.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/lustre/fld/fld_request.c b/fs/lustre/fld/fld_request.c
index ec45ea6..ba0ef82 100644
--- a/fs/lustre/fld/fld_request.c
+++ b/fs/lustre/fld/fld_request.c
@@ -367,12 +367,12 @@ int fld_client_rpc(struct obd_export *exp,
 	rc = ptlrpc_queue_wait(req);
 	obd_put_request_slot(&exp->exp_obd->u.cli);
 	if (rc != 0) {
-		if (rc == -EWOULDBLOCK) {
-			/* For no_delay req(see above), EWOULDBLOCK means the
-			 * connection is being evicted, but this seq lookup
-			 * should not return error, since it would cause
-			 * unecessary failure of the application, instead
-			 * it should retry here
+		if (rc == -EWOULDBLOCK || rc == -ESHUTDOWN) {
+			/* For no_delay req(see above), EWOULDBLOCK and
+			 * ESHUTDOWN means the connection is being evicted,
+			 * but this seq lookup should not return error,
+			 * since it would cause unecessary failure of the
+			 * application, instead it should retry here
 			 */
 			ptlrpc_req_finished(req);
 			goto again;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 5/8] lustre: fld: retry fld rpc until the import is closed
  2019-07-25  2:43 [lustre-devel] [PATCH 0/8] lustre: some old patches from whamcloud tree James Simmons
                   ` (3 preceding siblings ...)
  2019-07-25  2:44 ` [lustre-devel] [PATCH 4/8] lustre: fld: retry fld rpc even for ESHUTDOWN James Simmons
@ 2019-07-25  2:44 ` James Simmons
  2019-08-14 16:58   ` Andreas Dilger
  2019-07-25  2:44 ` [lustre-devel] [PATCH 6/8] lustre: fld: fld client lookup should retry James Simmons
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: James Simmons @ 2019-07-25  2:44 UTC (permalink / raw)
  To: lustre-devel

From: Wang Di <di.wang@intel.com>

Since LWP is not replayable, it should retry RPC until
the import is closed, otherwise it would cause unnecessary
failure of the application.

WC-bug-id: https://jira.whamcloud.com/browse/LU-5312
Lustre-commit: 07d481957c16832c782bb7d0c1fc436bcb148ea6
Signed-off-by: Wang Di <di.wang@intel.com>
Reviewed-on: http://review.whamcloud.com/11039
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
---
 fs/lustre/fld/fld_request.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/lustre/fld/fld_request.c b/fs/lustre/fld/fld_request.c
index ba0ef82..60e7105 100644
--- a/fs/lustre/fld/fld_request.c
+++ b/fs/lustre/fld/fld_request.c
@@ -367,12 +367,10 @@ int fld_client_rpc(struct obd_export *exp,
 	rc = ptlrpc_queue_wait(req);
 	obd_put_request_slot(&exp->exp_obd->u.cli);
 	if (rc != 0) {
-		if (rc == -EWOULDBLOCK || rc == -ESHUTDOWN) {
-			/* For no_delay req(see above), EWOULDBLOCK and
-			 * ESHUTDOWN means the connection is being evicted,
-			 * but this seq lookup should not return error,
-			 * since it would cause unecessary failure of the
-			 * application, instead it should retry here
+		if (imp->imp_state != LUSTRE_IMP_CLOSED) {
+			/* Since LWP is not replayable, so it will keep
+			 * trying unless umount happens, otherwise it would
+			 * cause unecessary failure of the application.
 			 */
 			ptlrpc_req_finished(req);
 			goto again;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 6/8] lustre: fld: fld client lookup should retry
  2019-07-25  2:43 [lustre-devel] [PATCH 0/8] lustre: some old patches from whamcloud tree James Simmons
                   ` (4 preceding siblings ...)
  2019-07-25  2:44 ` [lustre-devel] [PATCH 5/8] lustre: fld: retry fld rpc until the import is closed James Simmons
@ 2019-07-25  2:44 ` James Simmons
  2019-08-14 16:58   ` Andreas Dilger
  2019-07-25  2:44 ` [lustre-devel] [PATCH 7/8] lustre: tests: testcases for multiple modify RPCs feature James Simmons
  2019-07-25  2:44 ` [lustre-devel] [PATCH 8/8] lustre: ldlm: Don't check opcode with NULL rq_reqmsg James Simmons
  7 siblings, 1 reply; 19+ messages in thread
From: James Simmons @ 2019-07-25  2:44 UTC (permalink / raw)
  To: lustre-devel

From: wang di <di.wang@intel.com>

If FLD client lookup fails because of the remote target
is shutdown (or deactive), it should retry another target,
otherwise it will cause the application failure.

And FLD client should stop retry if the import has
been deactive.

WC-bug-id: https://jira.whamcloud.com/browse/LU-6419
Lustre-commit: 3ededde903c92f8485cae0dc9f958f194ff0b140
Signed-off-by: wang di <di.wang@intel.com>
Reviewed-on: http://review.whamcloud.com/14313
Reviewed-by: Lai Siyao <lai.siyao@intel.com>
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
---
 fs/lustre/fld/fld_request.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/fld/fld_request.c b/fs/lustre/fld/fld_request.c
index 60e7105..dfd4ae9 100644
--- a/fs/lustre/fld/fld_request.c
+++ b/fs/lustre/fld/fld_request.c
@@ -367,7 +367,7 @@ int fld_client_rpc(struct obd_export *exp,
 	rc = ptlrpc_queue_wait(req);
 	obd_put_request_slot(&exp->exp_obd->u.cli);
 	if (rc != 0) {
-		if (imp->imp_state != LUSTRE_IMP_CLOSED) {
+		if (imp->imp_state != LUSTRE_IMP_CLOSED && !imp->imp_deactive) {
 			/* Since LWP is not replayable, so it will keep
 			 * trying unless umount happens, otherwise it would
 			 * cause unecessary failure of the application.
@@ -404,6 +404,7 @@ int fld_client_lookup(struct lu_client_fld *fld, u64 seq, u32 *mds,
 {
 	struct lu_seq_range res = { 0 };
 	struct lu_fld_target *target;
+	struct lu_fld_target *origin;
 	int rc;
 
 	rc = fld_cache_lookup(fld->lcf_cache, seq, &res);
@@ -415,7 +416,8 @@ int fld_client_lookup(struct lu_client_fld *fld, u64 seq, u32 *mds,
 	/* Can not find it in the cache */
 	target = fld_client_get_target(fld, seq);
 	LASSERT(target);
-
+	origin = target;
+again:
 	CDEBUG(D_INFO,
 	       "%s: Lookup fld entry (seq: %#llx) on target %s (idx %llu)\n",
 	       fld->lcf_name, seq, fld_target_name(target), target->ft_idx);
@@ -424,6 +426,23 @@ int fld_client_lookup(struct lu_client_fld *fld, u64 seq, u32 *mds,
 	fld_range_set_type(&res, flags);
 	rc = fld_client_rpc(target->ft_exp, &res, FLD_QUERY, NULL);
 
+	if (rc == -ESHUTDOWN) {
+		/* If fld lookup failed because the target has been shutdown,
+		 * then try next target in the list, until trying all targets
+		 * or fld lookup succeeds
+		 */
+		spin_lock(&fld->lcf_lock);
+		if (target->ft_chain.next == fld->lcf_targets.prev)
+			target = list_entry(fld->lcf_targets.next,
+					    struct lu_fld_target, ft_chain);
+		else
+			target = list_entry(target->ft_chain.next,
+						 struct lu_fld_target,
+						 ft_chain);
+		spin_unlock(&fld->lcf_lock);
+		if (target != origin)
+			goto again;
+	}
 	if (rc == 0) {
 		*mds = res.lsr_index;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 7/8] lustre: tests: testcases for multiple modify RPCs feature
  2019-07-25  2:43 [lustre-devel] [PATCH 0/8] lustre: some old patches from whamcloud tree James Simmons
                   ` (5 preceding siblings ...)
  2019-07-25  2:44 ` [lustre-devel] [PATCH 6/8] lustre: fld: fld client lookup should retry James Simmons
@ 2019-07-25  2:44 ` James Simmons
  2019-08-14 16:58   ` Andreas Dilger
  2019-07-25  2:44 ` [lustre-devel] [PATCH 8/8] lustre: ldlm: Don't check opcode with NULL rq_reqmsg James Simmons
  7 siblings, 1 reply; 19+ messages in thread
From: James Simmons @ 2019-07-25  2:44 UTC (permalink / raw)
  To: lustre-devel

From: Gregoire Pichon <gregoire.pichon@bull.net>

This patch creates new testcases in the Auster test suite to
test the support of multiple modify RPCs in flight feature.

Two new OBD_FAIL codes are added to allow several failure occurences
of the reint requests or replies. This is needed because the current
fail checks impose the OBD_FAIL_ONCE flag.

Added testcases are :
- sanity         245
- conf-sanity    90a 90b 90c 90d
- replay-single  102a 102b 102c 102d

This patch also removes replay-single tests 53a and 53b
from the ALWAYS_EXCEPT list.

WC-bug-id: https://jira.whamcloud.com/browse/LU-5319
Lustre-commit: c2d27a0f12688c0d029880919f8b002e557b540c
Signed-off-by: Gregoire Pichon <gregoire.pichon@bull.net>
Reviewed-on: http://review.whamcloud.com/14861
Reviewed-by: Jian Yu <yujian@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
---
 fs/lustre/include/obd_support.h | 2 ++
 fs/lustre/ldlm/ldlm_lib.c       | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 3e15cac..7999ac6 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -189,6 +189,8 @@
 #define OBD_FAIL_MDS_SWAP_LAYOUTS_NET			0x14f
 #define OBD_FAIL_MDS_HSM_ACTION_NET			0x150
 #define OBD_FAIL_MDS_CHANGELOG_INIT			0x151
+#define OBD_FAIL_MDS_REINT_MULTI_NET			0x159
+#define OBD_FAIL_MDS_REINT_MULTI_NET_REP		0x15a
 
 /* layout lock */
 #define OBD_FAIL_MDS_NO_LL_GETATTR			0x170
diff --git a/fs/lustre/ldlm/ldlm_lib.c b/fs/lustre/ldlm/ldlm_lib.c
index 887507d..a3b8df4 100644
--- a/fs/lustre/ldlm/ldlm_lib.c
+++ b/fs/lustre/ldlm/ldlm_lib.c
@@ -689,6 +689,9 @@ int target_pack_pool_reply(struct ptlrpc_request *req)
 		DEBUG_REQ(D_ERROR, req, "dropping reply");
 		return -ECOMM;
 	}
+	if (unlikely(lustre_msg_get_opc(req->rq_reqmsg) == MDS_REINT &&
+		     OBD_FAIL_CHECK(OBD_FAIL_MDS_REINT_MULTI_NET_REP)))
+		return -ECOMM;
 
 	if (unlikely(rc)) {
 		DEBUG_REQ(D_NET, req, "processing error (%d)", rc);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 8/8] lustre: ldlm: Don't check opcode with NULL rq_reqmsg
  2019-07-25  2:43 [lustre-devel] [PATCH 0/8] lustre: some old patches from whamcloud tree James Simmons
                   ` (6 preceding siblings ...)
  2019-07-25  2:44 ` [lustre-devel] [PATCH 7/8] lustre: tests: testcases for multiple modify RPCs feature James Simmons
@ 2019-07-25  2:44 ` James Simmons
  2019-08-14 16:58   ` Andreas Dilger
  7 siblings, 1 reply; 19+ messages in thread
From: James Simmons @ 2019-07-25  2:44 UTC (permalink / raw)
  To: lustre-devel

From: Jeremy Filizetti <jeremy.filizetti@gmail.com>

When GSS is enabled it's possible to have a NULL rq_reqmsg
if a bad signature or no context is returned during the unwrap
of the request.  Don't check the opcode in this case.

WC-bug-id: https://jira.whamcloud.com/browse/LU-7508
Lustre-commit: 3f4572caef5f25f4a9b5347b2ccf933fdad9db9c
Signed-off-by: Jeremy Filizetti <jeremy.filizetti@gmail.com>
Reviewed-on: http://review.whamcloud.com/17414
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
---
 fs/lustre/ldlm/ldlm_lib.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/ldlm/ldlm_lib.c b/fs/lustre/ldlm/ldlm_lib.c
index a3b8df4..04982b8 100644
--- a/fs/lustre/ldlm/ldlm_lib.c
+++ b/fs/lustre/ldlm/ldlm_lib.c
@@ -689,8 +689,12 @@ int target_pack_pool_reply(struct ptlrpc_request *req)
 		DEBUG_REQ(D_ERROR, req, "dropping reply");
 		return -ECOMM;
 	}
-	if (unlikely(lustre_msg_get_opc(req->rq_reqmsg) == MDS_REINT &&
-		     OBD_FAIL_CHECK(OBD_FAIL_MDS_REINT_MULTI_NET_REP)))
+	/* We can have a null rq_reqmsg in the event of bad signature or
+	 * no context when unwrapping
+	 */
+	if (req->rq_reqmsg &&
+	    unlikely(lustre_msg_get_opc(req->rq_reqmsg) == MDS_REINT &&
+	    OBD_FAIL_CHECK(OBD_FAIL_MDS_REINT_MULTI_NET_REP)))
 		return -ECOMM;
 
 	if (unlikely(rc)) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 1/8] lustre: seq: make seq_proc_write_common() safer
  2019-07-25  2:44 ` [lustre-devel] [PATCH 1/8] lustre: seq: make seq_proc_write_common() safer James Simmons
@ 2019-07-25 23:55   ` NeilBrown
  2019-07-26  3:31     ` James Simmons
  0 siblings, 1 reply; 19+ messages in thread
From: NeilBrown @ 2019-07-25 23:55 UTC (permalink / raw)
  To: lustre-devel

On Wed, Jul 24 2019, James Simmons wrote:

> From: Andreas Dilger <adilger@whamcloud.com>
>
> Don't allow seq_proc_write_common() to specify arbitrary ranges,
> since this can permanently corrupt the sequence controller and/or
> sequnece server.  That would allow duplicate FID allocation, or
> possibly prevent any new files to be created or servers to be added
> to the filesystem.
>
> Instead, limit the sequence range that can be written via /proc to
> a subset of the sequence range currently allocated to that node.
> Add the "clear" keyword to allow dropping the entire local sequence
> and force a new one to be fetched from the sequence server.
>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-3642
> Lustre-commit: 05f69f5ee20eeffcc26f643333cedcfb53ba6669
> Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
> Reviewed-on: http://review.whamcloud.com/7123
> Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
> Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
> Reviewed-by: James Simmons <uja.ornl@gmail.com>
> Reviewed-by: Oleg Drokin <green@whamclould.com>
> ---
>  fs/lustre/fid/lproc_fid.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)

I already have this is my backport branch.
 a8ab6db57383

NeilBrown


>
> diff --git a/fs/lustre/fid/lproc_fid.c b/fs/lustre/fid/lproc_fid.c
> index 94869d4..e2e47df 100644
> --- a/fs/lustre/fid/lproc_fid.c
> +++ b/fs/lustre/fid/lproc_fid.c
> @@ -52,14 +52,18 @@
>  /* Format: [0x64BIT_INT - 0x64BIT_INT] + 32 bytes just in case */
>  #define MAX_FID_RANGE_STRLEN (32 + 2 * 2 * sizeof(u64))
>  /*
> - * Note: this function is only used for testing, it is no safe for production
> - * use.
> + * Reduce the SEQ range allocated to a node to a strict subset of the range
> + * currently-allocated SEQ range.  If the specified range is "clear", then
> + * drop all allocated sequences and request a new one from the master.
> + *
> + * Note: this function should only be used for testing, it is not necessarily
> + * safe for production use.
>   */
>  static int
>  ldebugfs_fid_write_common(const char __user *buffer, size_t count,
>  			  struct lu_seq_range *range)
>  {
> -	struct lu_seq_range tmp;
> +	struct lu_seq_range tmp = { 0, };
>  	int rc;
>  	char kernbuf[MAX_FID_RANGE_STRLEN];
>  
> @@ -82,8 +86,6 @@
>  	rc = sscanf(kernbuf, "[%llx - %llx]\n",
>  		    (unsigned long long *)&tmp.lsr_start,
>  		    (unsigned long long *)&tmp.lsr_end);
> -	if (rc != 2)
> -		return -EINVAL;
>  	if (!lu_seq_range_is_sane(&tmp) || lu_seq_range_is_zero(&tmp) ||
>  	    tmp.lsr_start < range->lsr_start || tmp.lsr_end > range->lsr_end)
>  		return -EINVAL;
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20190726/bc2590d3/attachment.sig>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 1/8] lustre: seq: make seq_proc_write_common() safer
  2019-07-25 23:55   ` NeilBrown
@ 2019-07-26  3:31     ` James Simmons
  0 siblings, 0 replies; 19+ messages in thread
From: James Simmons @ 2019-07-26  3:31 UTC (permalink / raw)
  To: lustre-devel


> > From: Andreas Dilger <adilger@whamcloud.com>
> >
> > Don't allow seq_proc_write_common() to specify arbitrary ranges,
> > since this can permanently corrupt the sequence controller and/or
> > sequnece server.  That would allow duplicate FID allocation, or
> > possibly prevent any new files to be created or servers to be added
> > to the filesystem.
> >
> > Instead, limit the sequence range that can be written via /proc to
> > a subset of the sequence range currently allocated to that node.
> > Add the "clear" keyword to allow dropping the entire local sequence
> > and force a new one to be fetched from the sequence server.
> >
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-3642
> > Lustre-commit: 05f69f5ee20eeffcc26f643333cedcfb53ba6669
> > Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
> > Reviewed-on: http://review.whamcloud.com/7123
> > Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
> > Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
> > Reviewed-by: James Simmons <uja.ornl@gmail.com>
> > Reviewed-by: Oleg Drokin <green@whamclould.com>
> > ---
> >  fs/lustre/fid/lproc_fid.c | 12 +++++++-----
> >  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> I already have this is my backport branch.
>  a8ab6db57383

I pushed the first batch from your backport branch. I remember 
remember pushing some patches before for fid / fld that ended up
being server only patches. Its not really clear what is server or
client in that code. I was hoping for some feedback from Andreas
and / or Oleg about this patch collect being server only.

> > diff --git a/fs/lustre/fid/lproc_fid.c b/fs/lustre/fid/lproc_fid.c
> > index 94869d4..e2e47df 100644
> > --- a/fs/lustre/fid/lproc_fid.c
> > +++ b/fs/lustre/fid/lproc_fid.c
> > @@ -52,14 +52,18 @@
> >  /* Format: [0x64BIT_INT - 0x64BIT_INT] + 32 bytes just in case */
> >  #define MAX_FID_RANGE_STRLEN (32 + 2 * 2 * sizeof(u64))
> >  /*
> > - * Note: this function is only used for testing, it is no safe for production
> > - * use.
> > + * Reduce the SEQ range allocated to a node to a strict subset of the range
> > + * currently-allocated SEQ range.  If the specified range is "clear", then
> > + * drop all allocated sequences and request a new one from the master.
> > + *
> > + * Note: this function should only be used for testing, it is not necessarily
> > + * safe for production use.
> >   */
> >  static int
> >  ldebugfs_fid_write_common(const char __user *buffer, size_t count,
> >  			  struct lu_seq_range *range)
> >  {
> > -	struct lu_seq_range tmp;
> > +	struct lu_seq_range tmp = { 0, };
> >  	int rc;
> >  	char kernbuf[MAX_FID_RANGE_STRLEN];
> >  
> > @@ -82,8 +86,6 @@
> >  	rc = sscanf(kernbuf, "[%llx - %llx]\n",
> >  		    (unsigned long long *)&tmp.lsr_start,
> >  		    (unsigned long long *)&tmp.lsr_end);
> > -	if (rc != 2)
> > -		return -EINVAL;
> >  	if (!lu_seq_range_is_sane(&tmp) || lu_seq_range_is_zero(&tmp) ||
> >  	    tmp.lsr_start < range->lsr_start || tmp.lsr_end > range->lsr_end)
> >  		return -EINVAL;
> > -- 
> > 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 5/8] lustre: fld: retry fld rpc until the import is closed
  2019-07-25  2:44 ` [lustre-devel] [PATCH 5/8] lustre: fld: retry fld rpc until the import is closed James Simmons
@ 2019-08-14 16:58   ` Andreas Dilger
  0 siblings, 0 replies; 19+ messages in thread
From: Andreas Dilger @ 2019-08-14 16:58 UTC (permalink / raw)
  To: lustre-devel

You may as well merge this with the previous patch. 

Cheers, Andreas

> On Jul 24, 2019, at 19:44, James Simmons <jsimmons@infradead.org> wrote:
> 
> From: Wang Di <di.wang@intel.com>
> 
> Since LWP is not replayable, it should retry RPC until
> the import is closed, otherwise it would cause unnecessary
> failure of the application.
> 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-5312
> Lustre-commit: 07d481957c16832c782bb7d0c1fc436bcb148ea6
> Signed-off-by: Wang Di <di.wang@intel.com>
> Reviewed-on: http://review.whamcloud.com/11039
> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
> Reviewed-by: Fan Yong <fan.yong@intel.com>
> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
> ---
> fs/lustre/fld/fld_request.c | 10 ++++------
> 1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/lustre/fld/fld_request.c b/fs/lustre/fld/fld_request.c
> index ba0ef82..60e7105 100644
> --- a/fs/lustre/fld/fld_request.c
> +++ b/fs/lustre/fld/fld_request.c
> @@ -367,12 +367,10 @@ int fld_client_rpc(struct obd_export *exp,
>    rc = ptlrpc_queue_wait(req);
>    obd_put_request_slot(&exp->exp_obd->u.cli);
>    if (rc != 0) {
> -        if (rc == -EWOULDBLOCK || rc == -ESHUTDOWN) {
> -            /* For no_delay req(see above), EWOULDBLOCK and
> -             * ESHUTDOWN means the connection is being evicted,
> -             * but this seq lookup should not return error,
> -             * since it would cause unecessary failure of the
> -             * application, instead it should retry here
> +        if (imp->imp_state != LUSTRE_IMP_CLOSED) {
> +            /* Since LWP is not replayable, so it will keep
> +             * trying unless umount happens, otherwise it would
> +             * cause unecessary failure of the application.
>             */
>            ptlrpc_req_finished(req);
>            goto again;
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 8/8] lustre: ldlm: Don't check opcode with NULL rq_reqmsg
  2019-07-25  2:44 ` [lustre-devel] [PATCH 8/8] lustre: ldlm: Don't check opcode with NULL rq_reqmsg James Simmons
@ 2019-08-14 16:58   ` Andreas Dilger
  0 siblings, 0 replies; 19+ messages in thread
From: Andreas Dilger @ 2019-08-14 16:58 UTC (permalink / raw)
  To: lustre-devel

The ldlm code on the client acts as a "server" for some RPCs from the
OSS and MDS, namely lock cancellation and granting callbacks
(ASTs).

So even if some of this looks like server code it is also used on the client. 

Cheers, Andreas

> On Jul 24, 2019, at 19:44, James Simmons <jsimmons@infradead.org> wrote:
> 
> From: Jeremy Filizetti <jeremy.filizetti@gmail.com>
> 
> When GSS is enabled it's possible to have a NULL rq_reqmsg
> if a bad signature or no context is returned during the unwrap
> of the request.  Don't check the opcode in this case.
> 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-7508
> Lustre-commit: 3f4572caef5f25f4a9b5347b2ccf933fdad9db9c
> Signed-off-by: Jeremy Filizetti <jeremy.filizetti@gmail.com>
> Reviewed-on: http://review.whamcloud.com/17414
> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
> Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
> ---
> fs/lustre/ldlm/ldlm_lib.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/lustre/ldlm/ldlm_lib.c b/fs/lustre/ldlm/ldlm_lib.c
> index a3b8df4..04982b8 100644
> --- a/fs/lustre/ldlm/ldlm_lib.c
> +++ b/fs/lustre/ldlm/ldlm_lib.c
> @@ -689,8 +689,12 @@ int target_pack_pool_reply(struct ptlrpc_request *req)
>        DEBUG_REQ(D_ERROR, req, "dropping reply");
>        return -ECOMM;
>    }
> -    if (unlikely(lustre_msg_get_opc(req->rq_reqmsg) == MDS_REINT &&
> -             OBD_FAIL_CHECK(OBD_FAIL_MDS_REINT_MULTI_NET_REP)))
> +    /* We can have a null rq_reqmsg in the event of bad signature or
> +     * no context when unwrapping
> +     */
> +    if (req->rq_reqmsg &&
> +        unlikely(lustre_msg_get_opc(req->rq_reqmsg) == MDS_REINT &&
> +        OBD_FAIL_CHECK(OBD_FAIL_MDS_REINT_MULTI_NET_REP)))
>        return -ECOMM;
> 
>    if (unlikely(rc)) {
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 4/8] lustre: fld: retry fld rpc even for ESHUTDOWN
  2019-07-25  2:44 ` [lustre-devel] [PATCH 4/8] lustre: fld: retry fld rpc even for ESHUTDOWN James Simmons
@ 2019-08-14 16:58   ` Andreas Dilger
  2019-08-14 16:58   ` Andreas Dilger
  1 sibling, 0 replies; 19+ messages in thread
From: Andreas Dilger @ 2019-08-14 16:58 UTC (permalink / raw)
  To: lustre-devel

This is mostly useful only on the servers, but clients also allocate "sequences" of 1, so it doesn't hurt to apply it to clients as well. 

Cheers, Andreas

> On Jul 24, 2019, at 19:44, James Simmons <jsimmons@infradead.org> wrote:
> 
> From: wang di <di.wang@intel.com>
> 
> when LWP is being evicted, because it is not replayable,
> the request might return ESHUTDOWN or EWOULDBLOCK, instead
> of failed, which might cause application failure, fld
> client will retry RPC, until the connection is being setup
> again or the LWP is being closed.
> 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-4420
> Lustre-commit: d335e310d4bf490509998ddbb1824e38cff20998
> Signed-off-by: wang di <di.wang@intel.com>
> Reviewed-on: http://review.whamcloud.com/10285
> Reviewed-by: Fan Yong <fan.yong@intel.com>
> Reviewed-by: Johann Lombardi <johann.lombardi@intel.com>
> ---
> fs/lustre/fld/fld_request.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/lustre/fld/fld_request.c b/fs/lustre/fld/fld_request.c
> index ec45ea6..ba0ef82 100644
> --- a/fs/lustre/fld/fld_request.c
> +++ b/fs/lustre/fld/fld_request.c
> @@ -367,12 +367,12 @@ int fld_client_rpc(struct obd_export *exp,
>    rc = ptlrpc_queue_wait(req);
>    obd_put_request_slot(&exp->exp_obd->u.cli);
>    if (rc != 0) {
> -        if (rc == -EWOULDBLOCK) {
> -            /* For no_delay req(see above), EWOULDBLOCK means the
> -             * connection is being evicted, but this seq lookup
> -             * should not return error, since it would cause
> -             * unecessary failure of the application, instead
> -             * it should retry here
> +        if (rc == -EWOULDBLOCK || rc == -ESHUTDOWN) {
> +            /* For no_delay req(see above), EWOULDBLOCK and
> +             * ESHUTDOWN means the connection is being evicted,
> +             * but this seq lookup should not return error,
> +             * since it would cause unecessary failure of the
> +             * application, instead it should retry here
>             */
>            ptlrpc_req_finished(req);
>            goto again;
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 4/8] lustre: fld: retry fld rpc even for ESHUTDOWN
  2019-07-25  2:44 ` [lustre-devel] [PATCH 4/8] lustre: fld: retry fld rpc even for ESHUTDOWN James Simmons
  2019-08-14 16:58   ` Andreas Dilger
@ 2019-08-14 16:58   ` Andreas Dilger
  1 sibling, 0 replies; 19+ messages in thread
From: Andreas Dilger @ 2019-08-14 16:58 UTC (permalink / raw)
  To: lustre-devel

While FLD is used on both client and server, the LWP connections are only used on clients. However, I don't see any harm to keep this code consistent with the master branch. 

Otherwise, you would have to remember to apply this patch at some point in the future when the server code is landed, which is more likely to leave the bug in place 

Cheers, Andreas

> On Jul 24, 2019, at 19:44, James Simmons <jsimmons@infradead.org> wrote:
> 
> From: wang di <di.wang@intel.com>
> 
> when LWP is being evicted, because it is not replayable,
> the request might return ESHUTDOWN or EWOULDBLOCK, instead
> of failed, which might cause application failure, fld
> client will retry RPC, until the connection is being setup
> again or the LWP is being closed.
> 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-4420
> Lustre-commit: d335e310d4bf490509998ddbb1824e38cff20998
> Signed-off-by: wang di <di.wang@intel.com>
> Reviewed-on: http://review.whamcloud.com/10285
> Reviewed-by: Fan Yong <fan.yong@intel.com>
> Reviewed-by: Johann Lombardi <johann.lombardi@intel.com>
> ---
> fs/lustre/fld/fld_request.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/lustre/fld/fld_request.c b/fs/lustre/fld/fld_request.c
> index ec45ea6..ba0ef82 100644
> --- a/fs/lustre/fld/fld_request.c
> +++ b/fs/lustre/fld/fld_request.c
> @@ -367,12 +367,12 @@ int fld_client_rpc(struct obd_export *exp,
>    rc = ptlrpc_queue_wait(req);
>    obd_put_request_slot(&exp->exp_obd->u.cli);
>    if (rc != 0) {
> -        if (rc == -EWOULDBLOCK) {
> -            /* For no_delay req(see above), EWOULDBLOCK means the
> -             * connection is being evicted, but this seq lookup
> -             * should not return error, since it would cause
> -             * unecessary failure of the application, instead
> -             * it should retry here
> +        if (rc == -EWOULDBLOCK || rc == -ESHUTDOWN) {
> +            /* For no_delay req(see above), EWOULDBLOCK and
> +             * ESHUTDOWN means the connection is being evicted,
> +             * but this seq lookup should not return error,
> +             * since it would cause unecessary failure of the
> +             * application, instead it should retry here
>             */
>            ptlrpc_req_finished(req);
>            goto again;
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 7/8] lustre: tests: testcases for multiple modify RPCs feature
  2019-07-25  2:44 ` [lustre-devel] [PATCH 7/8] lustre: tests: testcases for multiple modify RPCs feature James Simmons
@ 2019-08-14 16:58   ` Andreas Dilger
  0 siblings, 0 replies; 19+ messages in thread
From: Andreas Dilger @ 2019-08-14 16:58 UTC (permalink / raw)
  To: lustre-devel

This is needed on the client for the test cases to work. 

Cheers, Andreas

> On Jul 24, 2019, at 19:44, James Simmons <jsimmons@infradead.org> wrote:
> 
> From: Gregoire Pichon <gregoire.pichon@bull.net>
> 
> This patch creates new testcases in the Auster test suite to
> test the support of multiple modify RPCs in flight feature.
> 
> Two new OBD_FAIL codes are added to allow several failure occurences
> of the reint requests or replies. This is needed because the current
> fail checks impose the OBD_FAIL_ONCE flag.
> 
> Added testcases are :
> - sanity         245
> - conf-sanity    90a 90b 90c 90d
> - replay-single  102a 102b 102c 102d
> 
> This patch also removes replay-single tests 53a and 53b
> from the ALWAYS_EXCEPT list.
> 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-5319
> Lustre-commit: c2d27a0f12688c0d029880919f8b002e557b540c
> Signed-off-by: Gregoire Pichon <gregoire.pichon@bull.net>
> Reviewed-on: http://review.whamcloud.com/14861
> Reviewed-by: Jian Yu <yujian@whamcloud.com>
> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> ---
> fs/lustre/include/obd_support.h | 2 ++
> fs/lustre/ldlm/ldlm_lib.c       | 3 +++
> 2 files changed, 5 insertions(+)
> 
> diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
> index 3e15cac..7999ac6 100644
> --- a/fs/lustre/include/obd_support.h
> +++ b/fs/lustre/include/obd_support.h
> @@ -189,6 +189,8 @@
> #define OBD_FAIL_MDS_SWAP_LAYOUTS_NET            0x14f
> #define OBD_FAIL_MDS_HSM_ACTION_NET            0x150
> #define OBD_FAIL_MDS_CHANGELOG_INIT            0x151
> +#define OBD_FAIL_MDS_REINT_MULTI_NET            0x159
> +#define OBD_FAIL_MDS_REINT_MULTI_NET_REP        0x15a
> 
> /* layout lock */
> #define OBD_FAIL_MDS_NO_LL_GETATTR            0x170
> diff --git a/fs/lustre/ldlm/ldlm_lib.c b/fs/lustre/ldlm/ldlm_lib.c
> index 887507d..a3b8df4 100644
> --- a/fs/lustre/ldlm/ldlm_lib.c
> +++ b/fs/lustre/ldlm/ldlm_lib.c
> @@ -689,6 +689,9 @@ int target_pack_pool_reply(struct ptlrpc_request *req)
>        DEBUG_REQ(D_ERROR, req, "dropping reply");
>        return -ECOMM;
>    }
> +    if (unlikely(lustre_msg_get_opc(req->rq_reqmsg) == MDS_REINT &&
> +             OBD_FAIL_CHECK(OBD_FAIL_MDS_REINT_MULTI_NET_REP)))
> +        return -ECOMM;
> 
>    if (unlikely(rc)) {
>        DEBUG_REQ(D_NET, req, "processing error (%d)", rc);
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 2/8] lustre: ptlrpc: Fix an rq_no_reply assertion failure
  2019-07-25  2:44 ` [lustre-devel] [PATCH 2/8] lustre: ptlrpc: Fix an rq_no_reply assertion failure James Simmons
@ 2019-08-14 16:58   ` Andreas Dilger
  0 siblings, 0 replies; 19+ messages in thread
From: Andreas Dilger @ 2019-08-14 16:58 UTC (permalink / raw)
  To: lustre-devel

This is definitely server code.

Cheers, Andreas

> On Jul 24, 2019, at 19:44, James Simmons <jsimmons@infradead.org> wrote:
> 
> From: Li Wei <wei.g.li@intel.com>
> 
> An OSS had an assertion failure:
> 
>  LustreError: 5366:0:(ldlm_lib.c:2689:target_bulk_io()) @@@ timeout
>  on bulk GET after 0+0s  req at ffff88083a61b400
>  x1476486691018500/t0(4300509964)
>  o4->8dda3382-83f8-6445-5eea-828fd59e4a06 at 192.168.1.116@o2ib1:0/0
>  lens 504/448 e 391470 to 0 dl 1408494729 ref 2 fl Complete:/4/0 rc
>  0/0
>  LustreError: 5432:0:(niobuf.c:550:ptlrpc_send_reply()) ASSERTION(
>  req->rq_no_reply == 0 ) failed:
>  Lustre: soaked-OST0000: Bulk IO write error with
>  8dda3382-83f8-6445-5eea-828fd59e4a06 (at 192.168.1.116 at o2ib1),
>  client will retry: rc -110
>  LustreError: 5432:0:(niobuf.c:550:ptlrpc_send_reply()) LBUG
>  Pid: 5432, comm: ll_ost_io03_003
> 
>  Call Trace:
>  [<ffffffffa0641895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
>  [<ffffffffa0641e97>] lbug_with_loc+0x47/0xb0 [libcfs]
>  [<ffffffffa09cda4c>] ptlrpc_send_reply+0x4ec/0x7f0 [ptlrpc]
>  [<ffffffffa09d4aae>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
>  [<ffffffffa09e4d75>] ptlrpc_at_check_timed+0xcd5/0x1370 [ptlrpc]
>  [<ffffffffa09dc1e9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
>  [<ffffffffa09e66f8>] ptlrpc_main+0x12e8/0x1990 [ptlrpc]
>  [<ffffffff81069290>] ? pick_next_task_fair+0xd0/0x130
>  [<ffffffff81529246>] ? schedule+0x176/0x3b0
>  [<ffffffffa09e5410>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
>  [<ffffffff8109abf6>] kthread+0x96/0xa0
>  [<ffffffff8100c20a>] child_rip+0xa/0x20
>  [<ffffffff8109ab60>] ? kthread+0x0/0xa0
>  [<ffffffff8100c200>] ? child_rip+0x0/0x20
> 
> The thread in tgt_brw_write() had decided not to reply by setting
> rq_no_reply, right before another thread tried to send an early reply
> for the request.
> 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-5537
> Lustre-commit: a8d448e4cd5978c546911f98067232bcdd30b651
> Signed-off-by: Li Wei <wei.g.li@intel.com>
> Reviewed-on: http://review.whamcloud.com/11740
> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> Reviewed-by: Johann Lombardi <johann.lombardi@intel.com>
> ---
> fs/lustre/ptlrpc/service.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
> 
> diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
> index a40e964..c9ab9c3 100644
> --- a/fs/lustre/ptlrpc/service.c
> +++ b/fs/lustre/ptlrpc/service.c
> @@ -1098,6 +1098,16 @@ static int ptlrpc_at_send_early_reply(struct ptlrpc_request *req)
>    reqcopy->rq_reqmsg = reqmsg;
>    memcpy(reqmsg, req->rq_reqmsg, req->rq_reqlen);
> 
> +    /*
> +     * tgt_brw_read() and tgt_brw_write() may have decided not to reply.
> +     * Without this check, we would fail the rq_no_reply assertion in
> +     * ptlrpc_send_reply().
> +     */
> +    if (reqcopy->rq_no_reply) {
> +        rc = -ETIMEDOUT;
> +        goto out;
> +    }
> +
>    LASSERT(atomic_read(&req->rq_refcount));
>    /** if it is last refcount then early reply isn't needed */
>    if (atomic_read(&req->rq_refcount) == 1) {
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 3/8] lustre: fld: resend seq lookup RPC if it is on LWP
  2019-07-25  2:44 ` [lustre-devel] [PATCH 3/8] lustre: fld: resend seq lookup RPC if it is on LWP James Simmons
@ 2019-08-14 16:58   ` Andreas Dilger
  0 siblings, 0 replies; 19+ messages in thread
From: Andreas Dilger @ 2019-08-14 16:58 UTC (permalink / raw)
  To: lustre-devel

This is functionality used only by the server (LWP connection and also
MDS-MDS connection flag).  But as I wrote previously, it will be tough to
track this patch to only apply it when the server code is landed. Instead,
it would likely just be a hard-to-find bug that needs to be tracked down
again and fixed again. 

My preference would be to land this and other similar patches in shared
code that is not easily separated into client- and server-only sections. 

Cheers, Andreas

> On Jul 24, 2019, at 19:44, James Simmons <jsimmons@infradead.org> wrote:
> 
> From: wang di <di.wang@intel.com>
> 
> Because Light Weight connection might be evicted after
> restart, then cause inflight RPC fails, to avoid this,
> we need resend seq lookup RPC.
> 
> remove "-f" from "stop mdt" in sanity 17m, so umount can
> keep the the connection, and otherwise the OSP might be
> evicted.
> 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-4571
> Lustre-commit: cf7f66d87e52293535cde6e8cc7386e6c1bdfa46
> Signed-off-by: wang di <di.wang@intel.com>
> Reviewed-on: http://review.whamcloud.com/9106
> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
> Reviewed-by: Niu Yawei <yawei.niu@intel.com>
> ---
> fs/lustre/fld/fld_request.c | 23 +++++++++++++++++++++--
> 1 file changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/lustre/fld/fld_request.c b/fs/lustre/fld/fld_request.c
> index 248fffa..ec45ea6 100644
> --- a/fs/lustre/fld/fld_request.c
> +++ b/fs/lustre/fld/fld_request.c
> @@ -314,6 +314,7 @@ int fld_client_rpc(struct obd_export *exp,
> 
>    LASSERT(exp);
> 
> +again:
>    imp = class_exp2cliimp(exp);
>    switch (fld_op) {
>    case FLD_QUERY:
> @@ -329,8 +330,15 @@ int fld_client_rpc(struct obd_export *exp,
>        op = req_capsule_client_get(&req->rq_pill, &RMF_FLD_OPC);
>        *op = FLD_LOOKUP;
> 
> -        if (imp->imp_connect_flags_orig & OBD_CONNECT_MDS_MDS)
> +        /* For MDS_MDS seq lookup, it will always use LWP connection,
> +         * but LWP will be evicted after restart, so cause the error.
> +         * so we will set no_delay for seq lookup request, once the
> +         * request fails because of the eviction. always retry here
> +         */
> +        if (imp->imp_connect_flags_orig & OBD_CONNECT_MDS_MDS) {
>            req->rq_allow_replay = 1;
> +            req->rq_no_delay = 1;
> +        }
>        break;
>    case FLD_READ:
>        req = ptlrpc_request_alloc_pack(imp, &RQF_FLD_READ,
> @@ -358,8 +366,19 @@ int fld_client_rpc(struct obd_export *exp,
>    obd_get_request_slot(&exp->exp_obd->u.cli);
>    rc = ptlrpc_queue_wait(req);
>    obd_put_request_slot(&exp->exp_obd->u.cli);
> -    if (rc)
> +    if (rc != 0) {
> +        if (rc == -EWOULDBLOCK) {
> +            /* For no_delay req(see above), EWOULDBLOCK means the
> +             * connection is being evicted, but this seq lookup
> +             * should not return error, since it would cause
> +             * unecessary failure of the application, instead
> +             * it should retry here
> +             */
> +            ptlrpc_req_finished(req);
> +            goto again;
> +        }
>        goto out_req;
> +    }
> 
>    if (fld_op == FLD_QUERY) {
>        prange = req_capsule_server_get(&req->rq_pill, &RMF_FLD_MDFLD);
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [lustre-devel] [PATCH 6/8] lustre: fld: fld client lookup should retry
  2019-07-25  2:44 ` [lustre-devel] [PATCH 6/8] lustre: fld: fld client lookup should retry James Simmons
@ 2019-08-14 16:58   ` Andreas Dilger
  0 siblings, 0 replies; 19+ messages in thread
From: Andreas Dilger @ 2019-08-14 16:58 UTC (permalink / raw)
  To: lustre-devel

This is definitely client code. 

Cheers, Andreas

> On Jul 24, 2019, at 19:44, James Simmons <jsimmons@infradead.org> wrote:
> 
> From: wang di <di.wang@intel.com>
> 
> If FLD client lookup fails because of the remote target
> is shutdown (or deactive), it should retry another target,
> otherwise it will cause the application failure.
> 
> And FLD client should stop retry if the import has
> been deactive.
> 
> WC-bug-id: https://jira.whamcloud.com/browse/LU-6419
> Lustre-commit: 3ededde903c92f8485cae0dc9f958f194ff0b140
> Signed-off-by: wang di <di.wang@intel.com>
> Reviewed-on: http://review.whamcloud.com/14313
> Reviewed-by: Lai Siyao <lai.siyao@intel.com>
> Reviewed-by: Fan Yong <fan.yong@intel.com>
> Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
> ---
> fs/lustre/fld/fld_request.c | 23 +++++++++++++++++++++--
> 1 file changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/lustre/fld/fld_request.c b/fs/lustre/fld/fld_request.c
> index 60e7105..dfd4ae9 100644
> --- a/fs/lustre/fld/fld_request.c
> +++ b/fs/lustre/fld/fld_request.c
> @@ -367,7 +367,7 @@ int fld_client_rpc(struct obd_export *exp,
>    rc = ptlrpc_queue_wait(req);
>    obd_put_request_slot(&exp->exp_obd->u.cli);
>    if (rc != 0) {
> -        if (imp->imp_state != LUSTRE_IMP_CLOSED) {
> +        if (imp->imp_state != LUSTRE_IMP_CLOSED && !imp->imp_deactive) {
>            /* Since LWP is not replayable, so it will keep
>             * trying unless umount happens, otherwise it would
>             * cause unecessary failure of the application.
> @@ -404,6 +404,7 @@ int fld_client_lookup(struct lu_client_fld *fld, u64 seq, u32 *mds,
> {
>    struct lu_seq_range res = { 0 };
>    struct lu_fld_target *target;
> +    struct lu_fld_target *origin;
>    int rc;
> 
>    rc = fld_cache_lookup(fld->lcf_cache, seq, &res);
> @@ -415,7 +416,8 @@ int fld_client_lookup(struct lu_client_fld *fld, u64 seq, u32 *mds,
>    /* Can not find it in the cache */
>    target = fld_client_get_target(fld, seq);
>    LASSERT(target);
> -
> +    origin = target;
> +again:
>    CDEBUG(D_INFO,
>           "%s: Lookup fld entry (seq: %#llx) on target %s (idx %llu)\n",
>           fld->lcf_name, seq, fld_target_name(target), target->ft_idx);
> @@ -424,6 +426,23 @@ int fld_client_lookup(struct lu_client_fld *fld, u64 seq, u32 *mds,
>    fld_range_set_type(&res, flags);
>    rc = fld_client_rpc(target->ft_exp, &res, FLD_QUERY, NULL);
> 
> +    if (rc == -ESHUTDOWN) {
> +        /* If fld lookup failed because the target has been shutdown,
> +         * then try next target in the list, until trying all targets
> +         * or fld lookup succeeds
> +         */
> +        spin_lock(&fld->lcf_lock);
> +        if (target->ft_chain.next == fld->lcf_targets.prev)
> +            target = list_entry(fld->lcf_targets.next,
> +                        struct lu_fld_target, ft_chain);
> +        else
> +            target = list_entry(target->ft_chain.next,
> +                         struct lu_fld_target,
> +                         ft_chain);
> +        spin_unlock(&fld->lcf_lock);
> +        if (target != origin)
> +            goto again;
> +    }
>    if (rc == 0) {
>        *mds = res.lsr_index;
> 
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-08-14 16:58 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-25  2:43 [lustre-devel] [PATCH 0/8] lustre: some old patches from whamcloud tree James Simmons
2019-07-25  2:44 ` [lustre-devel] [PATCH 1/8] lustre: seq: make seq_proc_write_common() safer James Simmons
2019-07-25 23:55   ` NeilBrown
2019-07-26  3:31     ` James Simmons
2019-07-25  2:44 ` [lustre-devel] [PATCH 2/8] lustre: ptlrpc: Fix an rq_no_reply assertion failure James Simmons
2019-08-14 16:58   ` Andreas Dilger
2019-07-25  2:44 ` [lustre-devel] [PATCH 3/8] lustre: fld: resend seq lookup RPC if it is on LWP James Simmons
2019-08-14 16:58   ` Andreas Dilger
2019-07-25  2:44 ` [lustre-devel] [PATCH 4/8] lustre: fld: retry fld rpc even for ESHUTDOWN James Simmons
2019-08-14 16:58   ` Andreas Dilger
2019-08-14 16:58   ` Andreas Dilger
2019-07-25  2:44 ` [lustre-devel] [PATCH 5/8] lustre: fld: retry fld rpc until the import is closed James Simmons
2019-08-14 16:58   ` Andreas Dilger
2019-07-25  2:44 ` [lustre-devel] [PATCH 6/8] lustre: fld: fld client lookup should retry James Simmons
2019-08-14 16:58   ` Andreas Dilger
2019-07-25  2:44 ` [lustre-devel] [PATCH 7/8] lustre: tests: testcases for multiple modify RPCs feature James Simmons
2019-08-14 16:58   ` Andreas Dilger
2019-07-25  2:44 ` [lustre-devel] [PATCH 8/8] lustre: ldlm: Don't check opcode with NULL rq_reqmsg James Simmons
2019-08-14 16:58   ` Andreas Dilger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.