lustre-devel-lustre.org archive mirror
 help / color / mirror / Atom feed
* [lustre-devel] [PATCH 00/15] lustre: updates to OpenSFS tree as of July 7 2021
@ 2021-07-07 19:11 James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 01/15] lustre: osc: Notify server if cache discard takes a long time James Simmons
                   ` (14 more replies)
  0 siblings, 15 replies; 16+ messages in thread
From: James Simmons @ 2021-07-07 19:11 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

Sync the native client to current tip of OpenSFS tree.

Alex Zhuravlev (1):
  lustre: mgc: configurable wait-to-reprocess time

Alexander Boyko (1):
  lustre: client: don't panic for mgs evictions

Chris Horn (3):
  lnet: Add health ping stats
  lnet: Ensure ref taken when queueing for discovery
  lnet: Correct distance calculation of local NIDs

James Simmons (1):
  lnet: add netlink infrastructure

Oleg Drokin (3):
  lustre: osc: Notify server if cache discard takes a long time
  lustre: mdt: New connect flag for non-open-by-fid lock request
  lustre: obdclass: Wake up entire queue of requests on close completion

Patrick Farrell (5):
  lustre: osc: Move shrink update to per-write
  lustre: llite: parallelize direct i/o issuance
  lustre: osc: Don't get time for each page
  lustre: clio: Implement real list splice
  lustre: osc: Simplify clipping for transient pages

Serguei Smirnov (1):
  lnet: socklnd: detect link state to set fatal error on ni

 fs/lustre/include/cl_object.h          | 13 +++++-
 fs/lustre/include/lustre_osc.h         |  6 +--
 fs/lustre/include/obd_class.h          | 15 ++++++
 fs/lustre/llite/file.c                 | 51 +++++++++++++++++++--
 fs/lustre/llite/llite_internal.h       |  9 ++++
 fs/lustre/llite/llite_lib.c            |  4 +-
 fs/lustre/llite/lproc_llite.c          | 37 +++++++++++++++
 fs/lustre/llite/namei.c                |  4 +-
 fs/lustre/llite/rw26.c                 | 41 ++++++-----------
 fs/lustre/llite/vvp_io.c               |  1 +
 fs/lustre/mgc/mgc_internal.h           |  8 ++++
 fs/lustre/mgc/mgc_request.c            | 44 +++++++++++++-----
 fs/lustre/obdclass/cl_io.c             | 39 +++++++++++++---
 fs/lustre/obdclass/genops.c            |  6 ++-
 fs/lustre/obdclass/lprocfs_status.c    |  6 +++
 fs/lustre/osc/osc_cache.c              | 42 ++++++++++++++---
 fs/lustre/osc/osc_internal.h           |  1 +
 fs/lustre/osc/osc_io.c                 | 18 ++++++--
 fs/lustre/osc/osc_page.c               | 10 ++--
 fs/lustre/osc/osc_request.c            | 54 +++++++++++++++++-----
 fs/lustre/ptlrpc/import.c              |  5 +-
 fs/lustre/ptlrpc/wiretest.c            |  2 +
 include/linux/lnet/lib-types.h         | 15 ++++++
 include/uapi/linux/lnet/lnet-dlc.h     |  4 ++
 include/uapi/linux/lnet/lnet-nl.h      | 67 +++++++++++++++++++++++++++
 include/uapi/linux/lustre/lustre_idl.h |  1 +
 net/lnet/klnds/socklnd/socklnd.c       | 78 ++++++++++++++++++++++++++++++++
 net/lnet/klnds/socklnd/socklnd.h       |  1 +
 net/lnet/lnet/api-ni.c                 | 83 ++++++++++++++++++++++++++++++++++
 net/lnet/lnet/lib-move.c               | 40 ++++++++++------
 net/lnet/lnet/peer.c                   | 10 ++--
 31 files changed, 611 insertions(+), 104 deletions(-)
 create mode 100644 include/uapi/linux/lnet/lnet-nl.h

-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [lustre-devel] [PATCH 01/15] lustre: osc: Notify server if cache discard takes a long time
  2021-07-07 19:11 [lustre-devel] [PATCH 00/15] lustre: updates to OpenSFS tree as of July 7 2021 James Simmons
@ 2021-07-07 19:11 ` James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 02/15] lustre: osc: Move shrink update to per-write James Simmons
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: James Simmons @ 2021-07-07 19:11 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Oleg Drokin <green@whamcloud.com>

Discarding a large number of pages from a mapping under a
single lock can take a really long time (750GB is over 170s).
Since there is no stream of RPCs sent to the server as with
read or write to prolong the DLM lock timeout, the server
may evict the client as it does not see progress is being made.

As such send periodic "empty" RPCs to the server to show the
client is still alive and working on the pages under the lock.

For compatibility reasons the RPC is formed as a one-byte
OST_READ request with a special flag set to avoid doing
actual IO, but older servers actually do the one-byte read

WC-bug-id: https://jira.whamcloud.com/browse/LU-14711
Lustre-commit: 564070343ac4ccf4 ("LU-14711 osc: Notify server if cache discard takes a long time")
Signed-off-by: Oleg Drokin <green@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/43857
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Patrick Farrell <farr0186@gmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h |  3 +++
 fs/lustre/osc/osc_cache.c     | 11 +++++++++
 fs/lustre/osc/osc_internal.h  |  1 +
 fs/lustre/osc/osc_request.c   | 54 +++++++++++++++++++++++++++++++++----------
 4 files changed, 57 insertions(+), 12 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index c615091..1495949 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -1919,6 +1919,9 @@ struct cl_io {
 			loff_t			ls_result;
 			int			ls_whence;
 		} ci_lseek;
+		struct cl_misc_io {
+			time64_t		lm_next_rpc_time;
+		} ci_misc;
 	} u;
 	struct cl_2queue	ci_queue;
 	size_t			ci_nob;
diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index 8dd12b1..321e9d9 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -3186,6 +3186,15 @@ bool osc_page_gang_lookup(const struct lu_env *env, struct cl_io *io,
 
 		if (!res)
 			break;
+
+		if (io->ci_type == CIT_MISC &&
+		    io->u.ci_misc.lm_next_rpc_time &&
+		    ktime_get_seconds() > io->u.ci_misc.lm_next_rpc_time) {
+			osc_send_empty_rpc(osc, idx << PAGE_SHIFT);
+			io->u.ci_misc.lm_next_rpc_time = ktime_get_seconds() +
+							 5 * obd_timeout / 16;
+		}
+
 		if (need_resched())
 			cond_resched();
 
@@ -3320,6 +3329,8 @@ int osc_lock_discard_pages(const struct lu_env *env, struct osc_object *osc,
 
 	io->ci_obj = cl_object_top(osc2cl(osc));
 	io->ci_ignore_layout = 1;
+	io->u.ci_misc.lm_next_rpc_time = ktime_get_seconds() +
+					 5 * obd_timeout / 16;
 	result = cl_io_init(env, io, CIT_MISC, io->ci_obj);
 	if (result != 0)
 		goto out;
diff --git a/fs/lustre/osc/osc_internal.h b/fs/lustre/osc/osc_internal.h
index 3b65f2d..d174691 100644
--- a/fs/lustre/osc/osc_internal.h
+++ b/fs/lustre/osc/osc_internal.h
@@ -87,6 +87,7 @@ int osc_ladvise_base(struct obd_export *exp, struct obdo *oa,
 int osc_process_config_base(struct obd_device *obd, struct lustre_cfg *cfg);
 int osc_build_rpc(const struct lu_env *env, struct client_obd *cli,
 		  struct list_head *ext_list, int cmd);
+void osc_send_empty_rpc(struct osc_object *osc, pgoff_t start);
 unsigned long osc_lru_reserve(struct client_obd *cli, unsigned long npages);
 void osc_lru_unreserve(struct client_obd *cli, unsigned long npages);
 
diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index 0d590ed..2b2ee83 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -1399,21 +1399,23 @@ static int osc_brw_prep_request(int cmd, struct client_obd *cli,
 	struct brw_page *pg_prev;
 	void *short_io_buf;
 	const char *obd_name = cli->cl_import->imp_obd->obd_name;
-	struct inode *inode;
+	struct inode *inode = NULL;
 	bool directio = false;
 
-	inode = page2inode(pga[0]->pg);
-	if (!inode) {
-		/* Try to get reference to inode from cl_page if we are
-		 * dealing with direct IO, as handled pages are not
-		 * actual page cache pages.
-		 */
-		struct osc_async_page *oap = brw_page2oap(pga[0]);
-		struct cl_page *clpage = oap2cl_page(oap);
+	if (pga[0]->pg) {
+		inode = page2inode(pga[0]->pg);
+		if (!inode) {
+			/* Try to get reference to inode from cl_page if we are
+			 * dealing with direct IO, as handled pages are not
+			 * actual page cache pages.
+			 */
+			struct osc_async_page *oap = brw_page2oap(pga[0]);
+			struct cl_page *clpage = oap2cl_page(oap);
 
-		inode = clpage->cp_inode;
-		if (inode)
-			directio = true;
+			inode = clpage->cp_inode;
+			if (inode)
+				directio = true;
+		}
 	}
 	if (OBD_FAIL_CHECK(OBD_FAIL_OSC_BRW_PREP_REQ))
 		return -ENOMEM; /* Recoverable */
@@ -2666,6 +2668,34 @@ int osc_build_rpc(const struct lu_env *env, struct client_obd *cli,
 	return rc;
 }
 
+/* This is to refresh our lock in face of no RPCs. */
+void osc_send_empty_rpc(struct osc_object *osc, pgoff_t start)
+{
+	struct ptlrpc_request *req;
+	struct obdo oa;
+	struct brw_page bpg = { .off = start, .count = 1};
+	struct brw_page *pga = &bpg;
+	int rc;
+
+	memset(&oa, 0, sizeof(oa));
+	oa.o_oi = osc->oo_oinfo->loi_oi;
+	oa.o_valid = OBD_MD_FLID | OBD_MD_FLGROUP | OBD_MD_FLFLAGS;
+	/* For updated servers - don't do a read */
+	oa.o_flags = OBD_FL_NORPC;
+
+	rc = osc_brw_prep_request(OBD_BRW_READ, osc_cli(osc), &oa, 1, &pga,
+				  &req, 0);
+
+	/* If we succeeded we ship it off, if not there's no point in doing
+	 * anything. Also no resends.
+	 * No interpret callback, no commit callback.
+	 */
+	if (!rc) {
+		req->rq_no_resend = 1;
+		ptlrpcd_add_req(req);
+	}
+}
+
 static int osc_set_lock_data(struct ldlm_lock *lock, void *data)
 {
 	int set = 0;
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [lustre-devel] [PATCH 02/15] lustre: osc: Move shrink update to per-write
  2021-07-07 19:11 [lustre-devel] [PATCH 00/15] lustre: updates to OpenSFS tree as of July 7 2021 James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 01/15] lustre: osc: Notify server if cache discard takes a long time James Simmons
@ 2021-07-07 19:11 ` James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 03/15] lustre: client: don't panic for mgs evictions James Simmons
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: James Simmons @ 2021-07-07 19:11 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Patrick Farrell, Lustre Development List

From: Patrick Farrell <farr0186@gmail.com>

Updating the grant shrink interval is currently done for
each page submitted, rather than once per write.  Since
the grant shrink interval is in seconds, this is
unnecessary.

This came up because this function showed up in the perf
traces for https://review.whamcloud.com/#/c/38151/, and
it is called with the cl_loi_list_lock held.

Note that this change makes this access to the grant shrink
interval a 'dirty' access, without locking, but the grant
shrink interval is:
A) Already accessed like this in various places, and
B) can safely be out of date or suffer a lost update
without affecting correctness or performance.

IOR performance testing with this test:
mpirun -np 36 $IOR -o $LUSTRE -w -t 1M -b 2G -i 1 -F

No patches:
5942 MiB/s
With 38151:
14950 MiB/s
With 38151+this:
15320 MiB/s

WC-bug-id: https://jira.whamcloud.com/browse/LU-13419
Lustre-commit: c24c25dc1b84912 ("LU-13419 osc: Move shrink update to per-write")
Signed-off-by: Patrick Farrell <farr0186@gmail.com>
Reviewed-on: https://review.whamcloud.com/38214
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Wang Shilong <wshilong@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_cache.c | 1 -
 fs/lustre/osc/osc_io.c    | 5 +++++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index 321e9d9..0f0daa1 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -1426,7 +1426,6 @@ static void osc_consume_write_grant(struct client_obd *cli,
 	pga->flag |= OBD_BRW_FROM_GRANT;
 	CDEBUG(D_CACHE, "using %lu grant credits for brw %p page %p\n",
 	       PAGE_SIZE, pga, pga->pg);
-	osc_update_next_shrink(cli);
 }
 
 /* the companion to osc_consume_write_grant, called when a brw has completed.
diff --git a/fs/lustre/osc/osc_io.c b/fs/lustre/osc/osc_io.c
index de214ba..67fe85b 100644
--- a/fs/lustre/osc/osc_io.c
+++ b/fs/lustre/osc/osc_io.c
@@ -354,6 +354,11 @@ int osc_io_commit_async(const struct lu_env *env,
 			pagevec_reinit(pvec);
 		}
 	}
+	/* The shrink interval is in seconds, so we can update it once per
+	 * write, rather than once per page.
+	 */
+	osc_update_next_shrink(osc_cli(osc));
+
 
 	/* Clean up any partially full pagevecs */
 	if (pagevec_count(pvec) != 0)
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [lustre-devel] [PATCH 03/15] lustre: client: don't panic for mgs evictions
  2021-07-07 19:11 [lustre-devel] [PATCH 00/15] lustre: updates to OpenSFS tree as of July 7 2021 James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 01/15] lustre: osc: Notify server if cache discard takes a long time James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 02/15] lustre: osc: Move shrink update to per-write James Simmons
@ 2021-07-07 19:11 ` James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 04/15] lnet: Add health ping stats James Simmons
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: James Simmons @ 2021-07-07 19:11 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Ben Evans, Alexander Boyko, Lustre Development List

From: Alexander Boyko <alexander.boyko@hpe.com>

Avoid client panics for MGS evictions.
Create a function to check if the eviction is coming
from an MGS, and if so to ignore it.

Rework dump_on_eviction and lbug_on_eviction so
all logic is handled in one place.

HPE-bug-id: LUS-197
WC-bug-id: https://jira.whamcloud.com/browse/LU-13811
Lustre-commit: 5d8f6742e65d588d ("LU-13811 client: don't panic for mgs evictions")
Signed-off-by: Alexander Boyko <alexander.boyko@hpe.com>
Signed-off-by: Ben Evans <jevans@cray.com>
Reviewed-on: https://review.whamcloud.com/43655
Reviewed-by: Andriy Skulysh <askulysh@gmail.com>
Reviewed-by: Alexander Zarochentsev <alexander.zarochentsev@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_class.h | 15 +++++++++++++++
 fs/lustre/ptlrpc/import.c     |  5 ++---
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index 2fe4ea2..f2a3d2b 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -1701,6 +1701,21 @@ int class_add_nids_to_uuid(struct obd_uuid *uuid, lnet_nid_t *nids,
 int class_procfs_init(void);
 int class_procfs_clean(void);
 
+extern unsigned int obd_lbug_on_eviction;
+extern unsigned int obd_dump_on_eviction;
+
+static inline bool do_dump_on_eviction(struct obd_device *exp_obd)
+{
+	if (obd_lbug_on_eviction &&
+	    strncmp(exp_obd->obd_type->typ_name, LUSTRE_MGC_NAME,
+		    strlen(LUSTRE_MGC_NAME))) {
+		CERROR("LBUG upon eviction\n");
+		LBUG();
+	}
+
+	return obd_dump_on_eviction;
+}
+
 /* statfs_pack.c */
 struct kstatfs;
 void statfs_pack(struct obd_statfs *osfs, struct kstatfs *sfs);
diff --git a/fs/lustre/ptlrpc/import.c b/fs/lustre/ptlrpc/import.c
index 1f31edb..f28fb68 100644
--- a/fs/lustre/ptlrpc/import.c
+++ b/fs/lustre/ptlrpc/import.c
@@ -1473,13 +1473,12 @@ static int ptlrpc_invalidate_import_thread(void *data)
 	       imp->imp_obd->obd_name, obd2cli_tgt(imp->imp_obd),
 	       imp->imp_connection->c_remote_uuid.uuid);
 
-	ptlrpc_invalidate_import(imp);
-
-	if (obd_dump_on_eviction) {
+	if (do_dump_on_eviction(imp->imp_obd)) {
 		CERROR("dump the log upon eviction\n");
 		libcfs_debug_dumplog();
 	}
 
+	ptlrpc_invalidate_import(imp);
 	import_set_state(imp, LUSTRE_IMP_RECOVER);
 	ptlrpc_import_recovery_state_machine(imp);
 
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [lustre-devel] [PATCH 04/15] lnet: Add health ping stats
  2021-07-07 19:11 [lustre-devel] [PATCH 00/15] lustre: updates to OpenSFS tree as of July 7 2021 James Simmons
                   ` (2 preceding siblings ...)
  2021-07-07 19:11 ` [lustre-devel] [PATCH 03/15] lustre: client: don't panic for mgs evictions James Simmons
@ 2021-07-07 19:11 ` James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 05/15] lnet: Ensure ref taken when queueing for discovery James Simmons
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: James Simmons @ 2021-07-07 19:11 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Chris Horn, Lustre Development List

From: Chris Horn <chris.horn@hpe.com>

Add the NI and peer NI ping count and next ping timestamp to
detailed output of lnetctl peer and net output.

HPE-bug-id: LUS-9109
WC-bug-id: https://jira.whamcloud.com/browse/LU-13569
Lustre-commit: 4c7e4aa576296603 ("LU-13569 lnet: Add health ping stats")
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Reviewed-on: https://review.whamcloud.com/40314
Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/uapi/linux/lnet/lnet-dlc.h | 4 ++++
 net/lnet/lnet/api-ni.c             | 2 ++
 net/lnet/lnet/peer.c               | 7 +++++--
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/lnet/lnet-dlc.h b/include/uapi/linux/lnet/lnet-dlc.h
index b375d0a..c1c063f 100644
--- a/include/uapi/linux/lnet/lnet-dlc.h
+++ b/include/uapi/linux/lnet/lnet-dlc.h
@@ -191,6 +191,8 @@ struct lnet_ioctl_local_ni_hstats {
 	__u32 hlni_local_timeout;
 	__u32 hlni_local_error;
 	__s32 hlni_health_value;
+	__u32 hlni_ping_count;
+	__u64 hlni_next_ping;
 };
 
 struct lnet_ioctl_peer_ni_hstats {
@@ -199,6 +201,8 @@ struct lnet_ioctl_peer_ni_hstats {
 	__u32 hlpni_remote_error;
 	__u32 hlpni_network_timeout;
 	__s32 hlpni_health_value;
+	__u32 hlpni_ping_count;
+	__u64 hlpni_next_ping;
 };
 
 struct lnet_ioctl_element_msg_stats {
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index d6a8c1b..e52bb41 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -3634,6 +3634,8 @@ u32 lnet_get_dlc_seq_locked(void)
 		atomic_read(&ni->ni_hstats.hlt_local_error);
 	stats->hlni_health_value =
 		atomic_read(&ni->ni_healthv);
+	stats->hlni_ping_count = ni->ni_ping_count;
+	stats->hlni_next_ping = ni->ni_next_ping;
 
 unlock:
 	lnet_net_unlock(cpt);
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 2fc784d..76b2d2f 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -3986,6 +3986,8 @@ int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk)
 			atomic_read(&lpni->lpni_hstats.hlt_remote_error);
 		lpni_hstats->hlpni_health_value =
 			atomic_read(&lpni->lpni_healthv);
+		lpni_hstats->hlpni_ping_count = lpni->lpni_ping_count;
+		lpni_hstats->hlpni_next_ping = lpni->lpni_next_ping;
 		if (copy_to_user(bulk, lpni_hstats, sizeof(*lpni_hstats)))
 			goto out_free_hstats;
 		bulk += sizeof(*lpni_hstats);
@@ -4081,7 +4083,7 @@ int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk)
 			lnet_net_unlock(LNET_LOCK_EX);
 			return;
 		}
-		atomic_set(&lpni->lpni_healthv, value);
+		lnet_set_lpni_healthv_locked(lpni, value);
 		lnet_peer_ni_add_to_recoveryq_locked(lpni,
 						     &the_lnet.ln_mt_peerNIRecovq, now);
 		lnet_peer_ni_decref_locked(lpni);
@@ -4102,7 +4104,8 @@ int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk)
 					    lpn_peer_nets) {
 				list_for_each_entry(lpni, &lpn->lpn_peer_nis,
 						    lpni_peer_nis) {
-					atomic_set(&lpni->lpni_healthv, value);
+					lnet_set_lpni_healthv_locked(lpni,
+								     value);
 					lnet_peer_ni_add_to_recoveryq_locked(lpni,
 									     &the_lnet.ln_mt_peerNIRecovq,
 									     now);
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [lustre-devel] [PATCH 05/15] lnet: Ensure ref taken when queueing for discovery
  2021-07-07 19:11 [lustre-devel] [PATCH 00/15] lustre: updates to OpenSFS tree as of July 7 2021 James Simmons
                   ` (3 preceding siblings ...)
  2021-07-07 19:11 ` [lustre-devel] [PATCH 04/15] lnet: Add health ping stats James Simmons
@ 2021-07-07 19:11 ` James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 06/15] lnet: Correct distance calculation of local NIDs James Simmons
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: James Simmons @ 2021-07-07 19:11 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Chris Horn, Lustre Development List

From: Chris Horn <chris.horn@hpe.com>

Call lnet_peer_queue_for_discovery() in
lnet_discovery_event_handler() to ensure that we take a ref on
the peer when forcing it onto the discovery queue. This also ensures
that the peer state has LNET_PEER_DISCOVERING.

Add a test to sanity-lnet.sh that can trigger the refcount loss bug
in discovery.

HPE-bug-id: LUS-7651
WC-bug-id: https://jira.whamcloud.com/browse/LU-14627
Lustre-commit: 2ce6957b69370b0c ("LU-14627 lnet: Ensure ref taken when queueing for discovery")
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Reviewed-on: https://review.whamcloud.com/43418
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Stephane Thiell <sthiell@stanford.edu>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/peer.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 76b2d2f..29c3372 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -2783,7 +2783,8 @@ static void lnet_discovery_event_handler(struct lnet_event *event)
 	/* Put peer back at end of request queue, if discovery not already
 	 * done
 	 */
-	if (rc == LNET_REDISCOVER_PEER && !lnet_peer_is_uptodate(lp)) {
+	if (rc == LNET_REDISCOVER_PEER && !lnet_peer_is_uptodate(lp) &&
+	    lnet_peer_queue_for_discovery(lp)) {
 		list_move_tail(&lp->lp_dc_list, &the_lnet.ln_dc_request);
 		wake_up(&the_lnet.ln_dc_waitq);
 	}
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [lustre-devel] [PATCH 06/15] lnet: Correct distance calculation of local NIDs
  2021-07-07 19:11 [lustre-devel] [PATCH 00/15] lustre: updates to OpenSFS tree as of July 7 2021 James Simmons
                   ` (4 preceding siblings ...)
  2021-07-07 19:11 ` [lustre-devel] [PATCH 05/15] lnet: Ensure ref taken when queueing for discovery James Simmons
@ 2021-07-07 19:11 ` James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 07/15] lnet: socklnd: detect link state to set fatal error on ni James Simmons
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: James Simmons @ 2021-07-07 19:11 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Chris Horn, Lustre Development List

From: Chris Horn <chris.horn@hpe.com>

Multi-rail peers can have multiple local NIDs on the same net, but
LNetDist() may only identify a NID as local if it is the first one
returned by lnet_get_next_ni_locked().

We need to check all local NIs to find a match for the target NID
in LNetDist().

Add test to check LNetDist() calculation of local NIDs for a peer with
multiple NIDs on the same net.

HPE-bug-id: LUS-9964
WC-bug-id: https://jira.whamcloud.com/browse/LU-14649
Lustre-commit: 4d0162037415988b ("LU-14649 lnet: Correct distance calculation of local NIDs")
Signed-off-by: Chris Horn <hornc@cray.com>
Reviewed-on: https://review.whamcloud.com/43498
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-move.c | 40 +++++++++++++++++++++++++++-------------
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index 3ae0209..33d7e78 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -4981,6 +4981,7 @@ struct lnet_msg *
 	int cpt;
 	u32 order = 2;
 	struct list_head *rn_list;
+	bool matched_dstnet = false;
 
 	/*
 	 * if !local_nid_dist_zero, I don't return a distance of 0 ever
@@ -5007,27 +5008,40 @@ struct lnet_msg *
 			return local_nid_dist_zero ? 0 : 1;
 		}
 
-		if (LNET_NIDNET(ni->ni_nid) == dstnet) {
-			/*
-			 * Check if ni was originally created in
-			 * current net namespace.
-			 * If not, assign order above 0xffff0000,
-			 * to make this ni not a priority.
+		if (!matched_dstnet && LNET_NIDNET(ni->ni_nid) == dstnet) {
+			matched_dstnet = true;
+			/* We matched the destination net, but we may have
+			 * additional local NIs to inspect.
+			 *
+			 * We record the nid and order as appropriate, but
+			 * they may be overwritten if we match local NI above.
 			 */
-			if (current->nsproxy &&
-			    !net_eq(ni->ni_net_ns, current->nsproxy->net_ns))
-				order += 0xffff0000;
 			if (srcnidp)
 				*srcnidp = ni->ni_nid;
-			if (orderp)
-				*orderp = order;
-			lnet_net_unlock(cpt);
-			return 1;
+
+			if (orderp) {
+				/* Check if ni was originally created in
+				 * current net namespace.
+				 * If not, assign order above 0xffff0000,
+				 * to make this ni not a priority.
+				 */
+				if (current->nsproxy &&
+				    !net_eq(ni->ni_net_ns,
+					    current->nsproxy->net_ns))
+					*orderp = order + 0xffff0000;
+				else
+					*orderp = order;
+			}
 		}
 
 		order++;
 	}
 
+	if (matched_dstnet) {
+		lnet_net_unlock(cpt);
+		return 1;
+	}
+
 	rn_list = lnet_net2rnethash(dstnet);
 	list_for_each_entry(rnet, rn_list, lrn_list) {
 		if (rnet->lrn_net == dstnet) {
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [lustre-devel] [PATCH 07/15] lnet: socklnd: detect link state to set fatal error on ni
  2021-07-07 19:11 [lustre-devel] [PATCH 00/15] lustre: updates to OpenSFS tree as of July 7 2021 James Simmons
                   ` (5 preceding siblings ...)
  2021-07-07 19:11 ` [lustre-devel] [PATCH 06/15] lnet: Correct distance calculation of local NIDs James Simmons
@ 2021-07-07 19:11 ` James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 08/15] lustre: mdt: New connect flag for non-open-by-fid lock request James Simmons
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: James Simmons @ 2021-07-07 19:11 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Serguei Smirnov, Lustre Development List

From: Serguei Smirnov <ssmirnov@whamcloud.com>

To help avoid selecting lnet ni which corresponds to a downed
ethernet link for sending, add a mechanism for detecting link
events in socklnd. On link up/down events, find corresponding
ni and toggle ni_fatal_error_on flag, similar to o2iblnd way.

WC-bug-id: https://jira.whamcloud.com/browse/LU-14742
Lustre-commit: fc2df80e96dc5db9f ("LU-14742 socklnd: detect link state to set fatal error on ni")
Signed-off-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/43952
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/socklnd/socklnd.c | 78 ++++++++++++++++++++++++++++++++++++++++
 net/lnet/klnds/socklnd/socklnd.h |  1 +
 2 files changed, 79 insertions(+)

diff --git a/net/lnet/klnds/socklnd/socklnd.c b/net/lnet/klnds/socklnd/socklnd.c
index eb8c736..e15f1c0 100644
--- a/net/lnet/klnds/socklnd/socklnd.c
+++ b/net/lnet/klnds/socklnd/socklnd.c
@@ -1843,6 +1843,78 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 	}
 }
 
+static int ksocknal_get_link_status(struct net_device *dev)
+{
+	int ret = -1;
+
+	LASSERT(dev);
+
+	if (!netif_running(dev))
+		ret = 0;
+	/* Some devices may not be providing link settings */
+	else if (dev->ethtool_ops->get_link)
+		ret = dev->ethtool_ops->get_link(dev);
+
+	return ret;
+}
+
+static int
+ksocknal_handle_link_state_change(struct net_device *dev,
+				  unsigned char operstate)
+{
+	struct lnet_ni *ni;
+	struct ksock_net *net;
+	struct ksock_net *cnxt;
+	int ifindex;
+	unsigned char link_down = !(operstate == IF_OPER_UP);
+
+	ifindex = dev->ifindex;
+
+	if (!ksocknal_data.ksnd_nnets)
+		goto out;
+
+	list_for_each_entry_safe(net, cnxt, &ksocknal_data.ksnd_nets,
+				 ksnn_list) {
+		if (net->ksnn_interface.ksni_index != ifindex)
+			continue;
+		ni = net->ksnn_ni;
+		if (link_down)
+			atomic_set(&ni->ni_fatal_error_on, link_down);
+		else
+			atomic_set(&ni->ni_fatal_error_on,
+				   (ksocknal_get_link_status(dev) == 0));
+	}
+out:
+	return 0;
+}
+
+
+/************************************
+ * Net device notifier event handler
+ ************************************/
+static int ksocknal_device_event(struct notifier_block *unused,
+				 unsigned long event, void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+	unsigned char operstate;
+
+	operstate = dev->operstate;
+
+	switch (event) {
+	case NETDEV_UP:
+	case NETDEV_DOWN:
+	case NETDEV_CHANGE:
+		ksocknal_handle_link_state_change(dev, operstate);
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block ksocknal_notifier_block = {
+	.notifier_call = ksocknal_device_event,
+};
+
 static void
 ksocknal_base_shutdown(void)
 {
@@ -1852,6 +1924,9 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 
 	LASSERT(!ksocknal_data.ksnd_nnets);
 
+	if (ksocknal_data.ksnd_init == SOCKNAL_INIT_ALL)
+		unregister_netdevice_notifier(&ksocknal_notifier_block);
+
 	switch (ksocknal_data.ksnd_init) {
 	default:
 		LASSERT(0);
@@ -2015,6 +2090,8 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 		goto failed;
 	}
 
+	register_netdevice_notifier(&ksocknal_notifier_block);
+
 	/* flag everything initialised */
 	ksocknal_data.ksnd_init = SOCKNAL_INIT_ALL;
 
@@ -2297,6 +2374,7 @@ static int ksocknal_push(struct lnet_ni *ni, struct lnet_process_id id)
 	ni->ni_nid = LNET_MKNID(LNET_NIDNET(ni->ni_nid),
 				ntohl(((struct sockaddr_in *)&ksi->ksni_addr)->sin_addr.s_addr));
 	list_add(&net->ksnn_list, &ksocknal_data.ksnd_nets);
+	net->ksnn_ni = ni;
 	ksocknal_data.ksnd_nnets++;
 
 	return 0;
diff --git a/net/lnet/klnds/socklnd/socklnd.h b/net/lnet/klnds/socklnd/socklnd.h
index dac8559..357769a 100644
--- a/net/lnet/klnds/socklnd/socklnd.h
+++ b/net/lnet/klnds/socklnd/socklnd.h
@@ -175,6 +175,7 @@ struct ksock_net {
 	struct list_head	ksnn_list;		/* chain on global list */
 	atomic_t		ksnn_npeers;		/* # peers */
 	struct ksock_interface	ksnn_interface;		/* IP interface */
+	struct lnet_ni		*ksnn_ni;
 };
 
 /* When the ksock_net is shut down, this bias is added to
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [lustre-devel] [PATCH 08/15] lustre: mdt: New connect flag for non-open-by-fid lock request
  2021-07-07 19:11 [lustre-devel] [PATCH 00/15] lustre: updates to OpenSFS tree as of July 7 2021 James Simmons
                   ` (6 preceding siblings ...)
  2021-07-07 19:11 ` [lustre-devel] [PATCH 07/15] lnet: socklnd: detect link state to set fatal error on ni James Simmons
@ 2021-07-07 19:11 ` James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 09/15] lustre: obdclass: Wake up entire queue of requests on close completion James Simmons
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: James Simmons @ 2021-07-07 19:11 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Oleg Drokin <green@whamcloud.com>

While we removed the 2.1 check for open by fid when open
lock is requested, when you talk to old servers that don't
have that patch - they get an open error, so introduce a compat
flag.

Fixes: c9e0538f2b ("lustre: llite: Introduce inode open heat counter")
WC-bug-id: https://jira.whamcloud.com/browse/LU-10948
Lustre-commit: 72c9a6e5fb6e11fca ("LU-10948 mdt: New connect flag for non-open-by-fid lock request")
Signed-off-by: Oleg Drokin <green@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/43907
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Nunez <jnunez@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_lib.c            | 3 ++-
 fs/lustre/llite/namei.c                | 4 +++-
 fs/lustre/obdclass/lprocfs_status.c    | 6 ++++++
 fs/lustre/ptlrpc/wiretest.c            | 2 ++
 include/uapi/linux/lustre/lustre_idl.h | 1 +
 5 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 646bff8..b131edd 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -316,7 +316,8 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 				   OBD_CONNECT2_CRUSH | OBD_CONNECT2_LSEEK |
 				   OBD_CONNECT2_GETATTR_PFID |
 				   OBD_CONNECT2_DOM_LVB |
-				   OBD_CONNECT2_REP_MBITS;
+				   OBD_CONNECT2_REP_MBITS |
+				   OBD_CONNECT2_ATOMIC_OPEN_LOCK;
 
 	if (sbi->ll_flags & LL_SBI_LRU_RESIZE)
 		data->ocd_connect_flags |= OBD_CONNECT_LRU_RESIZE;
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index f42e872..f32aa14 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -1145,7 +1145,9 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
 	 * we only need to request open lock if it was requested
 	 * for every open
 	 */
-	if (ll_i2sbi(dir)->ll_oc_thrsh_count == 1)
+	if (ll_i2sbi(dir)->ll_oc_thrsh_count == 1 &&
+	    exp_connect_flags2(ll_i2mdexp(dir)) &
+	    OBD_CONNECT2_ATOMIC_OPEN_LOCK)
 		it->it_flags |= MDS_OPEN_LOCK;
 
 	/* Dentry added to dcache tree in ll_lookup_it */
diff --git a/fs/lustre/obdclass/lprocfs_status.c b/fs/lustre/obdclass/lprocfs_status.c
index 0cad91d..db809f3 100644
--- a/fs/lustre/obdclass/lprocfs_status.c
+++ b/fs/lustre/obdclass/lprocfs_status.c
@@ -131,6 +131,12 @@
 	"lseek",		/* 0x40000 */
 	"dom_lvb",		/* 0x80000 */
 	"reply_mbits",		/* 0x100000 */
+	"mode_convert",		/* 0x200000 */
+	"batch_rpc",		/* 0x400000 */
+	"pcc_ro",		/* 0x800000 */
+	"mne_nid_type",		/* 0x1000000 */
+	"lock_contend",		/* 0x2000000 */
+	"atomic_open_lock",	/* 0x4000000 */
 	NULL
 };
 
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index db97748..9e0eaa7 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1252,6 +1252,8 @@ void lustre_assert_wire_constants(void)
 		 OBD_CONNECT2_DOM_LVB);
 	LASSERTF(OBD_CONNECT2_REP_MBITS == 0x100000ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_REP_MBITS);
+	LASSERTF(OBD_CONNECT2_ATOMIC_OPEN_LOCK == 0x4000000ULL, "found 0x%.16llxULL\n",
+		 OBD_CONNECT2_ATOMIC_OPEN_LOCK);
 	LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
 		 (unsigned int)OBD_CKSUM_CRC32);
 	LASSERTF(OBD_CKSUM_ADLER == 0x00000002UL, "found 0x%.8xUL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 813e4fc..68bb807 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -840,6 +840,7 @@ struct ptlrpc_body_v2 {
 #define OBD_CONNECT2_LSEEK	      0x40000ULL /* SEEK_HOLE/DATA RPC */
 #define OBD_CONNECT2_DOM_LVB	      0x80000ULL /* pack DOM glimpse data in LVB */
 #define OBD_CONNECT2_REP_MBITS	     0x100000ULL /* match reply by mbits, not xid */
+#define OBD_CONNECT2_ATOMIC_OPEN_LOCK 0x4000000ULL/* request lock on 1st open */
 /* XXX README XXX:
  * Please DO NOT add flag values here before first ensuring that this same
  * flag value is not in use on some other branch.  Please clear any such
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [lustre-devel] [PATCH 09/15] lustre: obdclass: Wake up entire queue of requests on close completion
  2021-07-07 19:11 [lustre-devel] [PATCH 00/15] lustre: updates to OpenSFS tree as of July 7 2021 James Simmons
                   ` (7 preceding siblings ...)
  2021-07-07 19:11 ` [lustre-devel] [PATCH 08/15] lustre: mdt: New connect flag for non-open-by-fid lock request James Simmons
@ 2021-07-07 19:11 ` James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 10/15] lnet: add netlink infrastructure James Simmons
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: James Simmons @ 2021-07-07 19:11 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Oleg Drokin <green@whamcloud.com>

Since close requests could be stuck behind normal requests and get
more slots we need to wake up entire accumulated queue waiting
for the next modrpc slot or have additional waitqueue just for
close requests.

This patch goes with the former approach.

Fixes: 7cb15d0448 ("staging: lustre: mdc: manage number of modify RPCs in flight")
WC-bug-id: https://jira.whamcloud.com/browse/LU-10948
Lustre-commit: a4e1567d67559b797 ("LU-14741 obdclass: Wake up entire queue of requests on close completion")
Signed-off-by: Oleg Drokin <green@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/43941
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/genops.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/obdclass/genops.c b/fs/lustre/obdclass/genops.c
index bbb63b2..4e89e0a 100644
--- a/fs/lustre/obdclass/genops.c
+++ b/fs/lustre/obdclass/genops.c
@@ -1587,6 +1587,10 @@ void obd_put_mod_rpc_slot(struct client_obd *cli, u32 opc, u16 tag)
 	LASSERT(tag - 1 < OBD_MAX_RIF_MAX);
 	LASSERT(test_and_clear_bit(tag - 1, cli->cl_mod_tag_bitmap) != 0);
 	spin_unlock(&cli->cl_mod_rpcs_lock);
-	wake_up(&cli->cl_mod_rpcs_waitq);
+	/* LU-14741 - to prevent close RPCs stuck behind normal ones */
+	if (close_req)
+		wake_up_all(&cli->cl_mod_rpcs_waitq);
+	else
+		wake_up(&cli->cl_mod_rpcs_waitq);
 }
 EXPORT_SYMBOL(obd_put_mod_rpc_slot);
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [lustre-devel] [PATCH 10/15] lnet: add netlink infrastructure
  2021-07-07 19:11 [lustre-devel] [PATCH 00/15] lustre: updates to OpenSFS tree as of July 7 2021 James Simmons
                   ` (8 preceding siblings ...)
  2021-07-07 19:11 ` [lustre-devel] [PATCH 09/15] lustre: obdclass: Wake up entire queue of requests on close completion James Simmons
@ 2021-07-07 19:11 ` James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 11/15] lustre: llite: parallelize direct i/o issuance James Simmons
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: James Simmons @ 2021-07-07 19:11 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

Netlink was designed as a successor to ioctl as defined under
RFC 3549. There are several advantages to using netlink over
ioctls or virtual file system interfaces like proc. Collecting
proc doesn't scale well which was seen with power drain on Android
phones. A netlink implementation was developed to remove this
performance hit. Details can be read at:

https://lwn.net/Articles/406975

Besides the scaling gains the other benefit is the flexiblity
with API changes. Adding or removing information to be transmitted
doesn't require creating a new interface like ioctl do. Instead
you add new code to handle the stream of attributes read from the
socket. Lastly you can multiplex data to N listeners with groups
using one request.

This patch adds netlink handling in a generic way that can be
used by the libyaml library. This greatly lowers the barrier by
only requiring the implementor to understand the libyaml API.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9680
Lustre-commit: 3c39dac19aaf7f3f ("LU-9680 utils: add netlink infrastructure")
Signed-off-by: James Simmons <jsimmons@infradead.org>
Reviewed-on: https://review.whamcloud.com/34230
Reviewed-by: Petros Koutoupis <petros.koutoupis@hpe.com>
Reviewed-by: Ben Evans <beevans@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
---
 include/linux/lnet/lib-types.h    | 15 ++++++++
 include/uapi/linux/lnet/lnet-nl.h | 67 ++++++++++++++++++++++++++++++++
 net/lnet/lnet/api-ni.c            | 81 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 163 insertions(+)
 create mode 100644 include/uapi/linux/lnet/lnet-nl.h

diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index cb0a950..64d7472 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -43,7 +43,9 @@
 #include <linux/types.h>
 #include <linux/completion.h>
 #include <linux/kref.h>
+#include <net/genetlink.h>
 
+#include <uapi/linux/lnet/lnet-nl.h>
 #include <uapi/linux/lnet/lnet-types.h>
 #include <uapi/linux/lnet/lnetctl.h>
 #include <uapi/linux/lnet/lnet-dlc.h>
@@ -1280,4 +1282,17 @@ struct lnet {
 	struct list_head		ln_udsp_list;
 };
 
+static const struct nla_policy scalar_attr_policy[LN_SCALAR_CNT + 1] = {
+	[LN_SCALAR_ATTR_LIST]		= { .type = NLA_NESTED },
+	[LN_SCALAR_ATTR_LIST_SIZE]	= { .type = NLA_U16 },
+	[LN_SCALAR_ATTR_INDEX]		= { .type = NLA_U16 },
+	[LN_SCALAR_ATTR_NLA_TYPE]	= { .type = NLA_U16 },
+	[LN_SCALAR_ATTR_VALUE]		= { .type = NLA_STRING },
+	[LN_SCALAR_ATTR_KEY_FORMAT]	= { .type = NLA_U16 },
+};
+
+int lnet_genl_send_scalar_list(struct sk_buff *msg, u32 portid, u32 seq,
+			       const struct genl_family *family, int flags,
+			       u8 cmd, const struct ln_key_list *data[]);
+
 #endif
diff --git a/include/uapi/linux/lnet/lnet-nl.h b/include/uapi/linux/lnet/lnet-nl.h
new file mode 100644
index 0000000..f5bb67c
--- /dev/null
+++ b/include/uapi/linux/lnet/lnet-nl.h
@@ -0,0 +1,67 @@
+/* SPDX-License-Identifier: LGPL-2.0+ WITH Linux-syscall-note */
+/*
+ * LGPL HEADER START
+ *
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library.
+ *
+ * LGPL HEADER END
+ *
+ */
+/* Copyright (c) 2021,  UT-Battelle, LLC
+ *
+ * Author: James Simmons <jsimmons@infradead.org>
+ */
+
+#ifndef __UAPI_LNET_NL_H__
+#define __UAPI_LNET_NL_H__
+
+#include <linux/types.h>
+
+enum lnet_nl_key_format {
+	/* Is it FLOW or BLOCK */
+	LNKF_FLOW		= 1,
+	/* Is it SEQUENCE or MAPPING */
+	LNKF_MAPPING		= 2,
+	LNKF_SEQUENCE		= 4,
+};
+
+enum lnet_nl_scalar_attrs {
+	LN_SCALAR_ATTR_UNSPEC = 0,
+	LN_SCALAR_ATTR_LIST,
+
+	LN_SCALAR_ATTR_LIST_SIZE,
+	LN_SCALAR_ATTR_INDEX,
+	LN_SCALAR_ATTR_NLA_TYPE,
+	LN_SCALAR_ATTR_VALUE,
+	LN_SCALAR_ATTR_KEY_FORMAT,
+
+	__LN_SCALAR_ATTR_LAST,
+};
+
+#define LN_SCALAR_CNT (__LN_SCALAR_ATTR_LAST - 1)
+
+struct ln_key_props {
+	char			*lkp_values;
+	__u16			lkp_key_format;
+	__u16			lkp_data_type;
+};
+
+struct ln_key_list {
+	__u16			lkl_maxattr;
+	struct ln_key_props	lkl_list[];
+};
+
+#endif /* __UAPI_LNET_NL_H__ */
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index e52bb41..687df3b 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -2572,6 +2572,87 @@ static void lnet_push_target_fini(void)
 	return rc;
 }
 
+static int lnet_genl_parse_list(struct sk_buff *msg,
+				const struct ln_key_list *data[], u16 idx)
+{
+	const struct ln_key_list *list = data[idx];
+	const struct ln_key_props *props;
+	struct nlattr *node;
+	u16 count;
+
+	if (!list)
+		return 0;
+
+	if (!list->lkl_maxattr)
+		return -ERANGE;
+
+	props = list->lkl_list;
+	if (!props)
+		return -EINVAL;
+
+	node = nla_nest_start(msg, LN_SCALAR_ATTR_LIST);
+	if (!node)
+		return -ENOBUFS;
+
+	for (count = 1; count <= list->lkl_maxattr; count++) {
+		struct nlattr *key = nla_nest_start(msg, count);
+
+		if (count == 1)
+			nla_put_u16(msg, LN_SCALAR_ATTR_LIST_SIZE,
+				    list->lkl_maxattr);
+
+		nla_put_u16(msg, LN_SCALAR_ATTR_INDEX, count);
+		if (props[count].lkp_values)
+			nla_put_string(msg, LN_SCALAR_ATTR_VALUE,
+				       props[count].lkp_values);
+		if (props[count].lkp_key_format)
+			nla_put_u16(msg, LN_SCALAR_ATTR_KEY_FORMAT,
+				    props[count].lkp_key_format);
+		nla_put_u16(msg, LN_SCALAR_ATTR_NLA_TYPE,
+			    props[count].lkp_data_type);
+		if (props[count].lkp_data_type == NLA_NESTED) {
+			int rc;
+
+			rc = lnet_genl_parse_list(msg, data, ++idx);
+			if (rc < 0)
+				return rc;
+		}
+
+		nla_nest_end(msg, key);
+	}
+
+	nla_nest_end(msg, node);
+	return 0;
+}
+
+int lnet_genl_send_scalar_list(struct sk_buff *msg, u32 portid, u32 seq,
+			       const struct genl_family *family, int flags,
+			       u8 cmd, const struct ln_key_list *data[])
+{
+	int rc = 0;
+	void *hdr;
+
+	if (!data[0])
+		return -EINVAL;
+
+	hdr = genlmsg_put(msg, portid, seq, family, flags, cmd);
+	if (!hdr) {
+		rc = -EMSGSIZE;
+		goto canceled;
+	}
+
+	rc = lnet_genl_parse_list(msg, data, 0);
+	if (rc < 0)
+		goto canceled;
+
+	genlmsg_end(msg, hdr);
+canceled:
+	if (rc < 0)
+		genlmsg_cancel(msg, hdr);
+	return rc;
+}
+EXPORT_SYMBOL(lnet_genl_send_scalar_list);
+
 /**
  * Initialize LNet library.
  *
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [lustre-devel] [PATCH 11/15] lustre: llite: parallelize direct i/o issuance
  2021-07-07 19:11 [lustre-devel] [PATCH 00/15] lustre: updates to OpenSFS tree as of July 7 2021 James Simmons
                   ` (9 preceding siblings ...)
  2021-07-07 19:11 ` [lustre-devel] [PATCH 10/15] lnet: add netlink infrastructure James Simmons
@ 2021-07-07 19:11 ` James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 12/15] lustre: osc: Don't get time for each page James Simmons
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: James Simmons @ 2021-07-07 19:11 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Patrick Farrell, Lustre Development List

From: Patrick Farrell <farr0186@gmail.com>

Currently, the direct i/o code issues an i/o to a given
stripe, and then waits for that i/o to complete.  (This is
for i/os from a single process.)  This forces DIO to send
only one RPC at a time, serially.

In the case of multi-stripe files and larger i/os from
userspace, this means that i/o is serialized - so single
thread/single process direct i/o doesn't see any benefit
from the combination of extra stripes & larger i/os.

Using part of the AIO support, it is possible to move this
waiting up a level, so it happens after all the i/o is
issued.  (See LU-4198 for AIO support.)

This means we can issue many RPCs and then wait,
dramatically improving performance vs waiting for each RPC
serially.

This is referred to as 'parallel dio'.

Notes:
AIO is not supported on pipes, so we fall back to the old
sync behavior if the source or destination is a pipe.

Error handling is similar to buffered writes: We do not
wait for individual chunks, so we can get an error on an RPC
in the middle of an i/o.  The solution is to return an
error in this case, because we cannot know how many bytes
were written contiguously.  This is similar to buffered i/o
combined with fsync().

The performance improvement from this is dramatic, and
greater at larger sizes.

lfs setstripe -c 8 -S 4M .
mpirun -np 1  $IOR -w -r -t 64M -b 64G -o ./iorfile --posix.odirect
Without the patch:
write     764.85 MiB/s
read      682.87 MiB/s

With patch:
write     4030 MiB/s
read	  4468 MiB/s

WC-bug-id: https://jira.whamcloud.com/browse/LU-13798
Lustre-commit: cba07b68f9386b61 ("LU-13798 llite: parallelize direct i/o issuance")
Signed-off-by: Patrick Farrell <farr0186@gmail.com>
Reviewed-on: https://review.whamcloud.com/39436
Reviewed-by: Wang Shilong <wshilong@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h    | 10 +++++++-
 fs/lustre/include/lustre_osc.h   |  2 +-
 fs/lustre/llite/file.c           | 51 ++++++++++++++++++++++++++++++++++++----
 fs/lustre/llite/llite_internal.h |  9 +++++++
 fs/lustre/llite/llite_lib.c      |  1 +
 fs/lustre/llite/lproc_llite.c    | 37 +++++++++++++++++++++++++++++
 fs/lustre/llite/rw26.c           | 38 +++++++++---------------------
 fs/lustre/llite/vvp_io.c         |  1 +
 fs/lustre/obdclass/cl_io.c       | 29 +++++++++++++++++++++++
 fs/lustre/osc/osc_cache.c        | 12 +++++++++-
 10 files changed, 155 insertions(+), 35 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index 1495949..61a14f4 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -1996,7 +1996,13 @@ struct cl_io {
 	/**
 	 * Sequential read hints.
 	 */
-				ci_seq_read:1;
+				ci_seq_read:1,
+	/**
+	 * Do parallel (async) submission of DIO RPCs.  Note DIO is still sync
+	 * to userspace, only the RPCs are submitted async, then waited for at
+	 * the llite layer before returning.
+	 */
+				ci_parallel_dio:1;
 	/**
 	 * Bypass quota check
 	 */
@@ -2585,6 +2591,8 @@ int cl_sync_io_wait(const struct lu_env *env, struct cl_sync_io *anchor,
 		    long timeout);
 void cl_sync_io_note(const struct lu_env *env, struct cl_sync_io *anchor,
 		     int ioret);
+int cl_sync_io_wait_recycle(const struct lu_env *env, struct cl_sync_io *anchor,
+			    long timeout, int ioret);
 struct cl_dio_aio *cl_aio_alloc(struct kiocb *iocb);
 void cl_aio_free(struct cl_dio_aio *aio);
 
diff --git a/fs/lustre/include/lustre_osc.h b/fs/lustre/include/lustre_osc.h
index 0947677..884ea59 100644
--- a/fs/lustre/include/lustre_osc.h
+++ b/fs/lustre/include/lustre_osc.h
@@ -602,7 +602,7 @@ int osc_teardown_async_page(const struct lu_env *env, struct osc_object *obj,
 			    struct osc_page *ops);
 int osc_flush_async_page(const struct lu_env *env, struct cl_io *io,
 			 struct osc_page *ops);
-int osc_queue_sync_pages(const struct lu_env *env, const struct cl_io *io,
+int osc_queue_sync_pages(const struct lu_env *env, struct cl_io *io,
 			 struct osc_object *obj, struct list_head *list,
 			 int brw_flags);
 int osc_cache_truncate_start(const struct lu_env *env, struct osc_object *obj,
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 2dcf25f..54e343f 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -1619,12 +1619,15 @@ static void ll_heat_add(struct inode *inode, enum cl_io_type iot,
 	struct ll_sb_info *sbi = ll_i2sbi(inode);
 	struct vvp_io *vio = vvp_env_io(env);
 	struct range_lock range;
+	bool range_locked = false;
 	struct cl_io *io;
 	ssize_t result = 0;
 	int rc = 0;
+	int rc2 = 0;
 	unsigned int retried = 0;
 	unsigned int dio_lock = 0;
 	bool is_aio = false;
+	bool is_parallel_dio = false;
 	struct cl_dio_aio *ci_aio = NULL;
 	size_t per_bytes;
 	bool partial_io = false;
@@ -1642,6 +1645,17 @@ static void ll_heat_add(struct inode *inode, enum cl_io_type iot,
 	if (file->f_flags & O_DIRECT) {
 		if (!is_sync_kiocb(args->u.normal.via_iocb))
 			is_aio = true;
+
+		/* the kernel does not support AIO on pipes, and parallel DIO
+		 * uses part of the AIO path, so we must not do parallel dio
+		 * to pipes
+		 */
+		is_parallel_dio = !iov_iter_is_pipe(args->u.normal.via_iter) &&
+			       !is_aio;
+
+		if (!ll_sbi_has_parallel_dio(sbi))
+			is_parallel_dio = false;
+
 		ci_aio = cl_aio_alloc(args->u.normal.via_iocb);
 		if (!ci_aio) {
 			rc = -ENOMEM;
@@ -1665,10 +1679,9 @@ static void ll_heat_add(struct inode *inode, enum cl_io_type iot,
 	io->ci_aio = ci_aio;
 	io->ci_dio_lock = dio_lock;
 	io->ci_ndelay_tried = retried;
+	io->ci_parallel_dio = is_parallel_dio;
 
 	if (cl_io_rw_init(env, io, iot, *ppos, per_bytes) == 0) {
-		bool range_locked = false;
-
 		if (file->f_flags & O_APPEND)
 			range_lock_init(&range, 0, LUSTRE_EOF);
 		else
@@ -1697,17 +1710,41 @@ static void ll_heat_add(struct inode *inode, enum cl_io_type iot,
 		ll_cl_add(file, env, io, LCC_RW);
 		rc = cl_io_loop(env, io);
 		ll_cl_remove(file, env);
-		if (range_locked) {
+		if (range_locked && !is_parallel_dio) {
 			CDEBUG(D_VFSTRACE, "Range unlock [%llu, %llu]\n",
 			       range.rl_start,
 			       range.rl_last);
 			range_unlock(&lli->lli_write_tree, &range);
+			range_locked = false;
 		}
 	} else {
 		/* cl_io_rw_init() handled IO */
 		rc = io->ci_result;
 	}
 
+	/* N/B: parallel DIO may be disabled during i/o submission;
+	 * if that occurs, async RPCs are resolved before we get here, and this
+	 * wait call completes immediately.
+	 */
+	if (is_parallel_dio) {
+		struct cl_sync_io *anchor = &io->ci_aio->cda_sync;
+
+		/* for dio, EIOCBQUEUED is an implementation detail,
+		 * and we don't return it to userspace
+		 */
+		if (rc == -EIOCBQUEUED)
+			rc = 0;
+
+		rc2 = cl_sync_io_wait_recycle(env, anchor, 0, 0);
+		if (rc2 < 0)
+			rc = rc2;
+
+		if (range_locked) {
+			range_unlock(&lli->lli_write_tree, &range);
+			range_locked = false;
+		}
+	}
+
 	/*
 	 * In order to move forward AIO, ci_nob was increased,
 	 * but that doesn't mean io have been finished, it just
@@ -1717,8 +1754,12 @@ static void ll_heat_add(struct inode *inode, enum cl_io_type iot,
 	 */
 	if (io->ci_nob > 0) {
 		if (!is_aio) {
-			result += io->ci_nob;
-			*ppos = io->u.ci_wr.wr.crw_pos; /* for splice */
+			if (rc2 == 0) {
+				result += io->ci_nob;
+				*ppos = io->u.ci_wr.wr.crw_pos; /* for splice */
+			} else if (rc2) {
+				result = 0;
+			}
 		}
 		count -= io->ci_nob;
 
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 3674af9..a073d6d 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -631,6 +631,9 @@ enum stats_track_type {
 #define LL_SBI_FOREIGN_SYMLINK	0x20000000 /* foreign fake-symlink support */
 /* foreign fake-symlink upcall registered */
 #define LL_SBI_FOREIGN_SYMLINK_UPCALL	0x40000000
+#define LL_SBI_PARALLEL_DIO     0x80000000 /* parallel (async) submission of
+					    * RPCs for DIO
+					    */
 
 #define LL_SBI_FLAGS {	\
 	"nolck",	\
@@ -664,6 +667,7 @@ enum stats_track_type {
 	"noencrypt",	\
 	"foreign_symlink",	\
 	"foreign_symlink_upcall",	\
+	"parallel_dio",	\
 }
 
 /*
@@ -1001,6 +1005,11 @@ static inline bool ll_sbi_has_foreign_symlink(struct ll_sb_info *sbi)
 	return !!(sbi->ll_flags & LL_SBI_FOREIGN_SYMLINK);
 }
 
+static inline bool ll_sbi_has_parallel_dio(struct ll_sb_info *sbi)
+{
+	return !!(sbi->ll_flags & LL_SBI_PARALLEL_DIO);
+}
+
 void ll_ras_enter(struct file *f, loff_t pos, size_t count);
 
 /* llite/lcommon_misc.c */
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index b131edd..153d34e 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -179,6 +179,7 @@ static struct ll_sb_info *ll_init_sbi(void)
 	sbi->ll_flags |= LL_SBI_AGL_ENABLED;
 	sbi->ll_flags |= LL_SBI_FAST_READ;
 	sbi->ll_flags |= LL_SBI_TINY_WRITE;
+	sbi->ll_flags |= LL_SBI_PARALLEL_DIO;
 	ll_sbi_set_encrypt(sbi, true);
 
 	/* root squash */
diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index cd8394c..3b4f60c 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -1100,6 +1100,42 @@ static ssize_t tiny_write_store(struct kobject *kobj,
 }
 LUSTRE_RW_ATTR(tiny_write);
 
+static ssize_t parallel_dio_show(struct kobject *kobj,
+				 struct attribute *attr,
+				 char *buf)
+{
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+
+	return snprintf(buf, PAGE_SIZE, "%u\n",
+		       !!(sbi->ll_flags & LL_SBI_PARALLEL_DIO));
+}
+
+static ssize_t parallel_dio_store(struct kobject *kobj,
+				  struct attribute *attr,
+				  const char *buffer,
+				  size_t count)
+{
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+	bool val;
+	int rc;
+
+	rc = kstrtobool(buffer, &val);
+	if (rc)
+		return rc;
+
+	spin_lock(&sbi->ll_lock);
+	if (val)
+		sbi->ll_flags |= LL_SBI_PARALLEL_DIO;
+	else
+		sbi->ll_flags &= ~LL_SBI_PARALLEL_DIO;
+	spin_unlock(&sbi->ll_lock);
+
+	return count;
+}
+LUSTRE_RW_ATTR(parallel_dio);
+
 static ssize_t max_read_ahead_async_active_show(struct kobject *kobj,
 					       struct attribute *attr,
 					       char *buf)
@@ -1685,6 +1721,7 @@ struct ldebugfs_vars lprocfs_llite_obd_vars[] = {
 	&lustre_attr_xattr_cache.attr,
 	&lustre_attr_fast_read.attr,
 	&lustre_attr_tiny_write.attr,
+	&lustre_attr_parallel_dio.attr,
 	&lustre_attr_file_heat.attr,
 	&lustre_attr_heat_decay_percentage.attr,
 	&lustre_attr_heat_period_second.attr,
diff --git a/fs/lustre/llite/rw26.c b/fs/lustre/llite/rw26.c
index 2de956d..6a1b5bb 100644
--- a/fs/lustre/llite/rw26.c
+++ b/fs/lustre/llite/rw26.c
@@ -404,39 +404,23 @@ static ssize_t ll_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 out:
 	aio->cda_bytes += tot_bytes;
 
-	if (is_sync_kiocb(iocb)) {
-		struct cl_sync_io *anchor = &aio->cda_sync;
-		ssize_t rc2;
+	if (rw == WRITE)
+		vio->u.readwrite.vui_written += tot_bytes;
+	else
+		vio->u.readwrite.vui_read += tot_bytes;
 
-		/**
-		 * @anchor was inited as 1 to prevent end_io to be
-		 * called before we add all pages for IO, so drop
-		 * one extra reference to make sure we could wait
-		 * count to be zero.
-		 */
-		cl_sync_io_note(env, anchor, result);
+	/* If async dio submission is not allowed, we must wait here. */
+	if (is_sync_kiocb(iocb) && !io->ci_parallel_dio) {
+		ssize_t rc2;
 
-		rc2 = cl_sync_io_wait(env, anchor, 0);
+		rc2 = cl_sync_io_wait_recycle(env, &aio->cda_sync, 0, 0);
 		if (result == 0 && rc2)
 			result = rc2;
 
-		/**
-		 * One extra reference again, as if @anchor is
-		 * reused we assume it as 1 before using.
-		 */
-		atomic_add(1, &anchor->csi_sync_nr);
-		if (result == 0) {
-			/* no commit async for direct IO */
-			vio->u.readwrite.vui_written += tot_bytes;
-			result = tot_bytes;
-		}
-	} else {
-		if (rw == WRITE)
-			vio->u.readwrite.vui_written += tot_bytes;
-		else
-			vio->u.readwrite.vui_read += tot_bytes;
 		if (result == 0)
-			result = -EIOCBQUEUED;
+			result = tot_bytes;
+	} else if (result == 0) {
+		result = -EIOCBQUEUED;
 	}
 
 	return result;
diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index 12314fd..0e54f46 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -526,6 +526,7 @@ static void vvp_io_advance(const struct lu_env *env,
 	 * of relying on VFS, we move iov iter by ourselves.
 	 */
 	iov_iter_advance(vio->vui_iter, nob);
+	CDEBUG(D_VFSTRACE, "advancing %ld bytes\n", nob);
 	vio->vui_tot_count -= nob;
 	iov_iter_reexpand(vio->vui_iter, vio->vui_tot_count);
 }
diff --git a/fs/lustre/obdclass/cl_io.c b/fs/lustre/obdclass/cl_io.c
index 6c22137..beda7fc 100644
--- a/fs/lustre/obdclass/cl_io.c
+++ b/fs/lustre/obdclass/cl_io.c
@@ -1202,3 +1202,32 @@ void cl_sync_io_note(const struct lu_env *env, struct cl_sync_io *anchor,
 	}
 }
 EXPORT_SYMBOL(cl_sync_io_note);
+
+
+int cl_sync_io_wait_recycle(const struct lu_env *env, struct cl_sync_io *anchor,
+			    long timeout, int ioret)
+{
+	int rc = 0;
+
+	/*
+	 * @anchor was inited as 1 to prevent end_io to be
+	 * called before we add all pages for IO, so drop
+	 * one extra reference to make sure we could wait
+	 * count to be zero.
+	 */
+	cl_sync_io_note(env, anchor, ioret);
+	/* Wait for completion of normal dio.
+	 * This replaces the EIOCBQEUED return from the DIO/AIO
+	 * path, and this is where AIO and DIO implementations
+	 * split.
+	 */
+	rc = cl_sync_io_wait(env, anchor, timeout);
+	/**
+	 * One extra reference again, as if @anchor is
+	 * reused we assume it as 1 before using.
+	 */
+	atomic_add(1, &anchor->csi_sync_nr);
+
+	return rc;
+}
+EXPORT_SYMBOL(cl_sync_io_wait_recycle);
diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index 0f0daa1..e37c034 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -2640,7 +2640,7 @@ int osc_flush_async_page(const struct lu_env *env, struct cl_io *io,
 	return rc;
 }
 
-int osc_queue_sync_pages(const struct lu_env *env, const struct cl_io *io,
+int osc_queue_sync_pages(const struct lu_env *env, struct cl_io *io,
 			 struct osc_object *obj, struct list_head *list,
 			 int brw_flags)
 {
@@ -2701,6 +2701,7 @@ int osc_queue_sync_pages(const struct lu_env *env, const struct cl_io *io,
 		grants += (1 << cli->cl_chunkbits) *
 			  ((page_count + ppc - 1) / ppc);
 
+		CDEBUG(D_CACHE, "requesting %d bytes grant\n", grants);
 		spin_lock(&cli->cl_loi_list_lock);
 		if (osc_reserve_grant(cli, grants) == 0) {
 			list_for_each_entry(oap, list, oap_pending_item) {
@@ -2710,6 +2711,15 @@ int osc_queue_sync_pages(const struct lu_env *env, const struct cl_io *io,
 			}
 			osc_unreserve_grant_nolock(cli, grants, 0);
 			ext->oe_grants = grants;
+		} else {
+			/* We cannot report ENOSPC correctly if we do parallel
+			 * DIO (async RPC submission), so turn off parallel dio
+			 * if there is not sufficient grant available.  This
+			 * makes individual RPCs synchronous.
+			 */
+			io->ci_parallel_dio = false;
+			CDEBUG(D_CACHE,
+			"not enough grant available, switching to sync for this i/o\n");
 		}
 		spin_unlock(&cli->cl_loi_list_lock);
 	}
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [lustre-devel] [PATCH 12/15] lustre: osc: Don't get time for each page
  2021-07-07 19:11 [lustre-devel] [PATCH 00/15] lustre: updates to OpenSFS tree as of July 7 2021 James Simmons
                   ` (10 preceding siblings ...)
  2021-07-07 19:11 ` [lustre-devel] [PATCH 11/15] lustre: llite: parallelize direct i/o issuance James Simmons
@ 2021-07-07 19:11 ` James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 13/15] lustre: clio: Implement real list splice James Simmons
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: James Simmons @ 2021-07-07 19:11 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Patrick Farrell, Lustre Development List

From: Patrick Farrell <farr0186@gmail.com>

Getting the time when each batch of pages starts is
sufficiently accurate, and ktime_get() is several % of the
CPU time when doing AIO + DIO.

This relies on previous patches in this series.

Measuring this in milliseconds/gigabyte lets us measure the
improvement in absolute terms, rather than just relative
terms.

This patch reduces i/o time in ms/GiB by:
Write: 17 ms/GiB
Read: 6 ms/GiB

Totals:
Write: 237 ms/GiB
Read: 223 ms/GiB

IOR:
mpirun -np 1  $IOR -w -r -t 64M -b 64G -o ./iorfile --posix.odirect
Without the patch:
write     4030 MiB/s
read      4468  MiB/s

With patch:
write     4326 MiB/s
read      4587 MiB/s

WC-bug-id: https://jira.whamcloud.com/browse/LU-13799
Lustre-commit: 485976ab451dd6708 ("LU-13799 osc: Don't get time for each page")
Signed-off-by: Patrick Farrell <farr0186@gmail.com>
Reviewed-on: https://review.whamcloud.com/39437
Reviewed-by: Wang Shilong <wshilong@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_osc.h | 2 +-
 fs/lustre/osc/osc_io.c         | 3 ++-
 fs/lustre/osc/osc_page.c       | 4 ++--
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/lustre/include/lustre_osc.h b/fs/lustre/include/lustre_osc.h
index 884ea59..208bb59 100644
--- a/fs/lustre/include/lustre_osc.h
+++ b/fs/lustre/include/lustre_osc.h
@@ -584,7 +584,7 @@ void osc_index2policy(union ldlm_policy_data *policy,
 		      pgoff_t start, pgoff_t end);
 void osc_lru_add_batch(struct client_obd *cli, struct list_head *list);
 void osc_page_submit(const struct lu_env *env, struct osc_page *opg,
-		     enum cl_req_type crt, int brw_flags);
+		     enum cl_req_type crt, int brw_flags, ktime_t submit_time);
 int lru_queue_work(const struct lu_env *env, void *data);
 long osc_lru_shrink(const struct lu_env *env, struct client_obd *cli,
 		    long target, bool force);
diff --git a/fs/lustre/osc/osc_io.c b/fs/lustre/osc/osc_io.c
index 67fe85b..bd92b5d 100644
--- a/fs/lustre/osc/osc_io.c
+++ b/fs/lustre/osc/osc_io.c
@@ -132,6 +132,7 @@ int osc_io_submit(const struct lu_env *env, const struct cl_io_slice *ios,
 	unsigned int max_pages;
 	unsigned int ppc_bits; /* pages per chunk bits */
 	unsigned int ppc;
+	ktime_t submit_time = ktime_get();
 	bool sync_queue = false;
 
 	LASSERT(qin->pl_nr > 0);
@@ -195,7 +196,7 @@ int osc_io_submit(const struct lu_env *env, const struct cl_io_slice *ios,
 		oap->oap_async_flags |= ASYNC_COUNT_STABLE;
 		spin_unlock(&oap->oap_lock);
 
-		osc_page_submit(env, opg, crt, brw_flags);
+		osc_page_submit(env, opg, crt, brw_flags, submit_time);
 		list_add_tail(&oap->oap_pending_item, &list);
 
 		if (page->cp_sync_io)
diff --git a/fs/lustre/osc/osc_page.c b/fs/lustre/osc/osc_page.c
index 94db9d2..0f088fe 100644
--- a/fs/lustre/osc/osc_page.c
+++ b/fs/lustre/osc/osc_page.c
@@ -295,7 +295,7 @@ int osc_page_init(const struct lu_env *env, struct cl_object *obj,
  * transfer (i.e., transferred synchronously).
  */
 void osc_page_submit(const struct lu_env *env, struct osc_page *opg,
-		     enum cl_req_type crt, int brw_flags)
+		     enum cl_req_type crt, int brw_flags, ktime_t submit_time)
 {
 	struct osc_io *oio = osc_env_io(env);
 	struct osc_async_page *oap = &opg->ops_oap;
@@ -316,7 +316,7 @@ void osc_page_submit(const struct lu_env *env, struct osc_page *opg,
 		oap->oap_cmd |= OBD_BRW_NOQUOTA;
 	}
 
-	opg->ops_submit_time = ktime_get();
+	opg->ops_submit_time = submit_time;
 	osc_page_transfer_get(opg, "transfer\0imm");
 	osc_page_transfer_add(env, opg, crt);
 }
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [lustre-devel] [PATCH 13/15] lustre: clio: Implement real list splice
  2021-07-07 19:11 [lustre-devel] [PATCH 00/15] lustre: updates to OpenSFS tree as of July 7 2021 James Simmons
                   ` (11 preceding siblings ...)
  2021-07-07 19:11 ` [lustre-devel] [PATCH 12/15] lustre: osc: Don't get time for each page James Simmons
@ 2021-07-07 19:11 ` James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 14/15] lustre: osc: Simplify clipping for transient pages James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 15/15] lustre: mgc: configurable wait-to-reprocess time James Simmons
  14 siblings, 0 replies; 16+ messages in thread
From: James Simmons @ 2021-07-07 19:11 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Patrick Farrell, Lustre Development List

From: Patrick Farrell <farr0186@gmail.com>

Lustre's list_splice is actually just a slightly
depressing list_for_each; let's use a real list_splice.

This saves significant time in AIO/DIO page submission,
getting a several % performance boost.

This patch reduces i/o time in ms/GiB by:
Write: 16 ms/GiB
Read: 14 ms/GiB

Totals:
Write: 220 ms/GiB
Read: 209 ms/GiB

mpirun -np 1  $IOR -w -r -t 64M -b 64G -o ./iorfile --posix.odirect

With previous patches in series:
write     4326 MiB/s
read      4587 MiB/s

With this patch:
write     4647 MiB/s
read      4888 MiB/s

WC-bug-id: https://jira.whamcloud.com/browse/LU-13799
Lustre-commit: dfe2d225b86d4215 ("LU-13799 clio: Implement real list splice")
Signed-off-by: Patrick Farrell <farr0186@gmail.com>
Reviewed-on: https://review.whamcloud.com/39439
Reviewed-by: Wang Shilong <wshilong@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/cl_io.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/lustre/obdclass/cl_io.c b/fs/lustre/obdclass/cl_io.c
index beda7fc..63ce39c 100644
--- a/fs/lustre/obdclass/cl_io.c
+++ b/fs/lustre/obdclass/cl_io.c
@@ -891,13 +891,11 @@ void cl_page_list_move_head(struct cl_page_list *dst, struct cl_page_list *src,
 /**
  * splice the cl_page_list, just as list head does
  */
-void cl_page_list_splice(struct cl_page_list *list, struct cl_page_list *head)
+void cl_page_list_splice(struct cl_page_list *src, struct cl_page_list *dst)
 {
-	struct cl_page *page;
-	struct cl_page *tmp;
-
-	cl_page_list_for_each_safe(page, tmp, list)
-		cl_page_list_move(head, list, page);
+	dst->pl_nr += src->pl_nr;
+	src->pl_nr = 0;
+	list_splice_tail_init(&src->pl_pages, &dst->pl_pages);
 }
 EXPORT_SYMBOL(cl_page_list_splice);
 
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [lustre-devel] [PATCH 14/15] lustre: osc: Simplify clipping for transient pages
  2021-07-07 19:11 [lustre-devel] [PATCH 00/15] lustre: updates to OpenSFS tree as of July 7 2021 James Simmons
                   ` (12 preceding siblings ...)
  2021-07-07 19:11 ` [lustre-devel] [PATCH 13/15] lustre: clio: Implement real list splice James Simmons
@ 2021-07-07 19:11 ` James Simmons
  2021-07-07 19:11 ` [lustre-devel] [PATCH 15/15] lustre: mgc: configurable wait-to-reprocess time James Simmons
  14 siblings, 0 replies; 16+ messages in thread
From: James Simmons @ 2021-07-07 19:11 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Patrick Farrell, Lustre Development List

From: Patrick Farrell <farr0186@gmail.com>

The combination of page clip and page flag setting for
transient pages takes up several % of the time when
submitting them for async DIO.

But neither is required - Transient pages do not change
after creation except in limited cases, and in any case,
they are only accessible from the submitting thread -
there is no possibility of parallel access.

So we can set the page flags, etc, at init time.

This patch improves i/o time in ms/GiB by:
Write: 17 ms/GiB
Read: 22 ms/GiB

Totals:
Write: 204 ms/GiB
Read: 198 ms/GiB

mpirun -np 1  $IOR -w -r -t 64M -b 64G -o ./iorfile --posix.odirect

With previous patches in series:
write     4647 MiB/s
read      4888 MiB/s

Plus this patch:
write     5030 MiB/s
read      5174 MiB/s

WC-bug-id: https://jira.whamcloud.com/browse/LU-13799
Lustre-commit: b64b9646f17b771c ("LU-13799 osc: Simplify clipping for transient pages")
Signed-off-by: Patrick Farrell <farr0186@gmail.com>
Reviewed-on: https://review.whamcloud.com/39440
Reviewed-by: Wang Shilong <wshilong@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_osc.h |  2 +-
 fs/lustre/llite/rw26.c         |  3 ++-
 fs/lustre/osc/osc_cache.c      | 18 +++++++++++++-----
 fs/lustre/osc/osc_io.c         | 10 ++++++----
 fs/lustre/osc/osc_page.c       |  6 ++++--
 5 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/fs/lustre/include/lustre_osc.h b/fs/lustre/include/lustre_osc.h
index 208bb59..13e9363 100644
--- a/fs/lustre/include/lustre_osc.h
+++ b/fs/lustre/include/lustre_osc.h
@@ -593,7 +593,7 @@ long osc_lru_shrink(const struct lu_env *env, struct client_obd *cli,
 int osc_set_async_flags(struct osc_object *obj, struct osc_page *opg,
 			u32 async_flags);
 int osc_prep_async_page(struct osc_object *osc, struct osc_page *ops,
-			struct page *page, loff_t offset);
+			struct cl_page *page, loff_t offset);
 int osc_queue_async_io(const struct lu_env *env, struct cl_io *io,
 		       struct osc_page *ops, cl_commit_cbt cb);
 int osc_page_cache_add(const struct lu_env *env, struct osc_page *opg,
diff --git a/fs/lustre/llite/rw26.c b/fs/lustre/llite/rw26.c
index 6a1b5bb..ba9c070 100644
--- a/fs/lustre/llite/rw26.c
+++ b/fs/lustre/llite/rw26.c
@@ -269,7 +269,8 @@ struct ll_dio_pages {
 		 * Set page clip to tell transfer formation engine
 		 * that page has to be sent even if it is beyond KMS.
 		 */
-		cl_page_clip(env, page, 0, min(size, page_size));
+		if (size < page_size)
+			cl_page_clip(env, page, 0, size);
 		++io_pages;
 
 		/* drop the reference count for cl_page_find */
diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index e37c034..84c6b68 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -2311,10 +2311,11 @@ int __osc_io_unplug(const struct lu_env *env, struct client_obd *cli,
 EXPORT_SYMBOL(__osc_io_unplug);
 
 int osc_prep_async_page(struct osc_object *osc, struct osc_page *ops,
-			struct page *page, loff_t offset)
+			struct cl_page *page, loff_t offset)
 {
 	struct obd_export *exp = osc_export(osc);
 	struct osc_async_page *oap = &ops->ops_oap;
+	struct page *vmpage = page->cp_vmpage;
 
 	if (!page)
 		return -EIO;
@@ -2323,17 +2324,24 @@ int osc_prep_async_page(struct osc_object *osc, struct osc_page *ops,
 	oap->oap_cli = &exp->exp_obd->u.cli;
 	oap->oap_obj = osc;
 
-	oap->oap_page = page;
+	oap->oap_page = vmpage;
 	oap->oap_obj_off = offset;
 	LASSERT(!(offset & ~PAGE_MASK));
 
+	/* Count of transient (direct i/o) pages is always stable by the time
+	 * they're submitted.  Setting this here lets us avoid calling
+	 * cl_page_clip later to set this.
+	 */
+	if (page->cp_type == CPT_TRANSIENT)
+		oap->oap_async_flags |= ASYNC_COUNT_STABLE|ASYNC_URGENT|
+					ASYNC_READY;
+
 	INIT_LIST_HEAD(&oap->oap_pending_item);
 	INIT_LIST_HEAD(&oap->oap_rpc_item);
 
 	spin_lock_init(&oap->oap_lock);
-	CDEBUG(D_INFO, "oap %p page %p obj off %llu\n",
-	       oap, page, oap->oap_obj_off);
-
+	CDEBUG(D_INFO, "oap %p vmpage %p obj off %llu\n",
+	       oap, vmpage, oap->oap_obj_off);
 	return 0;
 }
 EXPORT_SYMBOL(osc_prep_async_page);
diff --git a/fs/lustre/osc/osc_io.c b/fs/lustre/osc/osc_io.c
index bd92b5d..f69f201 100644
--- a/fs/lustre/osc/osc_io.c
+++ b/fs/lustre/osc/osc_io.c
@@ -191,10 +191,12 @@ int osc_io_submit(const struct lu_env *env, const struct cl_io_slice *ios,
 			continue;
 		}
 
-		spin_lock(&oap->oap_lock);
-		oap->oap_async_flags = ASYNC_URGENT | ASYNC_READY;
-		oap->oap_async_flags |= ASYNC_COUNT_STABLE;
-		spin_unlock(&oap->oap_lock);
+		if (page->cp_type != CPT_TRANSIENT) {
+			spin_lock(&oap->oap_lock);
+			oap->oap_async_flags = ASYNC_URGENT | ASYNC_READY;
+			oap->oap_async_flags |= ASYNC_COUNT_STABLE;
+			spin_unlock(&oap->oap_lock);
+		}
 
 		osc_page_submit(env, opg, crt, brw_flags, submit_time);
 		list_add_tail(&oap->oap_pending_item, &list);
diff --git a/fs/lustre/osc/osc_page.c b/fs/lustre/osc/osc_page.c
index 0f088fe..8aa21ee 100644
--- a/fs/lustre/osc/osc_page.c
+++ b/fs/lustre/osc/osc_page.c
@@ -212,6 +212,9 @@ static void osc_page_clip(const struct lu_env *env,
 	opg->ops_from = from;
 	/* argument @to is exclusive, but @ops_to is inclusive */
 	opg->ops_to = to - 1;
+	/* This isn't really necessary for transient pages, but we also don't
+	 * call clip on transient pages often, so it's OK.
+	 */
 	spin_lock(&oap->oap_lock);
 	oap->oap_async_flags |= ASYNC_COUNT_STABLE;
 	spin_unlock(&oap->oap_lock);
@@ -257,8 +260,7 @@ int osc_page_init(const struct lu_env *env, struct cl_object *obj,
 	opg->ops_to = PAGE_SIZE - 1;
 	INIT_LIST_HEAD(&opg->ops_lru);
 
-	result = osc_prep_async_page(osc, opg, cl_page->cp_vmpage,
-				     cl_offset(obj, index));
+	result = osc_prep_async_page(osc, opg, cl_page, cl_offset(obj, index));
 	if (result != 0)
 		return result;
 
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [lustre-devel] [PATCH 15/15] lustre: mgc: configurable wait-to-reprocess time
  2021-07-07 19:11 [lustre-devel] [PATCH 00/15] lustre: updates to OpenSFS tree as of July 7 2021 James Simmons
                   ` (13 preceding siblings ...)
  2021-07-07 19:11 ` [lustre-devel] [PATCH 14/15] lustre: osc: Simplify clipping for transient pages James Simmons
@ 2021-07-07 19:11 ` James Simmons
  14 siblings, 0 replies; 16+ messages in thread
From: James Simmons @ 2021-07-07 19:11 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Alex Zhuravlev <bzzz@whamcloud.com>

so we can set it shorter, for testing purposes at least. to change
minimal wait time MGC module option 'mgc_requeue_timeout_min'
should be used (in seconds). additionally a random value upto
mgc_requeue_timeout_min is added to avoid a flood of config re-read
requests from clients. if mgc_requeue_timeout_min is set to 0,
then random part will be upto 1 second.

ost-pools: before: 5840s, after:a 3474s
sanity-flr: before: 1575s, after: 1381s
sanity-quota: before: 10679s, after: 9703s

WC-bug-id: https://jira.whamcloud.com/browse/LU-14516
Lustre-commit: 04b2da6180d3c8eda ("LU-14516 mgc: configurable wait-to-reprocess time")
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/42020
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Aurelien Degremont <degremoa@amazon.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: James Simmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/mgc/mgc_internal.h |  8 ++++++++
 fs/lustre/mgc/mgc_request.c  | 44 +++++++++++++++++++++++++++++++++-----------
 2 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/fs/lustre/mgc/mgc_internal.h b/fs/lustre/mgc/mgc_internal.h
index a2a09d4..91f5fa1 100644
--- a/fs/lustre/mgc/mgc_internal.h
+++ b/fs/lustre/mgc/mgc_internal.h
@@ -43,6 +43,14 @@
 
 int mgc_process_log(struct obd_device *mgc, struct config_llog_data *cld);
 
+/* this timeout represents how many seconds MGC should wait before
+ * requeue config and recover lock to the MGS. We need to randomize this
+ * in order to not flood the MGS.
+ */
+#define MGC_TIMEOUT_MIN_SECONDS		5
+
+extern unsigned int mgc_requeue_timeout_min;
+
 static inline bool cld_is_sptlrpc(struct config_llog_data *cld)
 {
 	return cld->cld_type == MGS_CFG_T_SPTLRPC;
diff --git a/fs/lustre/mgc/mgc_request.c b/fs/lustre/mgc/mgc_request.c
index 1dfc74b..50044aa2 100644
--- a/fs/lustre/mgc/mgc_request.c
+++ b/fs/lustre/mgc/mgc_request.c
@@ -530,13 +530,6 @@ static void do_requeue(struct config_llog_data *cld)
 	up_read(&cld->cld_mgcexp->exp_obd->u.cli.cl_sem);
 }
 
-/* this timeout represents how many seconds MGC should wait before
- * requeue config and recover lock to the MGS. We need to randomize this
- * in order to not flood the MGS.
- */
-#define MGC_TIMEOUT_MIN_SECONDS   5
-#define MGC_TIMEOUT_RAND_CENTISEC 500
-
 static int mgc_requeue_thread(void *data)
 {
 	bool first = true;
@@ -548,7 +541,6 @@ static int mgc_requeue_thread(void *data)
 	rq_state |= RQ_RUNNING;
 	while (!(rq_state & RQ_STOP)) {
 		struct config_llog_data *cld, *cld_prev;
-		int rand = prandom_u32_max(MGC_TIMEOUT_RAND_CENTISEC);
 		int to;
 
 		/* Any new or requeued lostlocks will change the state */
@@ -565,11 +557,11 @@ static int mgc_requeue_thread(void *data)
 		 * random so everyone doesn't try to reconnect at once.
 		 */
 		/* rand is centi-seconds, "to" is in centi-HZ */
-		to = MGC_TIMEOUT_MIN_SECONDS * HZ * 100;
-		to += rand * HZ;
+		to = mgc_requeue_timeout_min == 0 ? 1 : mgc_requeue_timeout_min;
+		to = mgc_requeue_timeout_min * HZ + prandom_u32_max(to * HZ);
 		wait_event_idle_timeout(rq_waitq,
 					rq_state & (RQ_STOP | RQ_PRECLEANUP),
-					to/100);
+					to);
 
 		/*
 		 * iterate & processing through the list. for each cld, process
@@ -1835,6 +1827,36 @@ static int mgc_process_config(struct obd_device *obd, u32 len, void *buf)
 	.process_config	= mgc_process_config,
 };
 
+static int mgc_param_requeue_timeout_min_set(const char *val,
+					     const struct kernel_param *kp)
+{
+	int rc;
+	unsigned int num;
+
+	rc = kstrtouint(val, 0, &num);
+	if (rc < 0)
+		return rc;
+	if (num > 120)
+		return -EINVAL;
+
+	mgc_requeue_timeout_min = num;
+
+	return 0;
+}
+
+static struct kernel_param_ops param_ops_requeue_timeout_min = {
+	.set = mgc_param_requeue_timeout_min_set,
+	.get = param_get_uint,
+};
+
+#define param_check_requeue_timeout_min(name, p) \
+		__param_check(name, p, unsigned int)
+
+unsigned int mgc_requeue_timeout_min = MGC_TIMEOUT_MIN_SECONDS;
+module_param_call(mgc_requeue_timeout_min, mgc_param_requeue_timeout_min_set,
+		  param_get_uint, &param_ops_requeue_timeout_min, 0644);
+MODULE_PARM_DESC(mgc_requeue_timeout_min, "Minimal requeue time to refresh logs");
+
 static int __init mgc_init(void)
 {
 	int rc;
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-07-07 19:12 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-07 19:11 [lustre-devel] [PATCH 00/15] lustre: updates to OpenSFS tree as of July 7 2021 James Simmons
2021-07-07 19:11 ` [lustre-devel] [PATCH 01/15] lustre: osc: Notify server if cache discard takes a long time James Simmons
2021-07-07 19:11 ` [lustre-devel] [PATCH 02/15] lustre: osc: Move shrink update to per-write James Simmons
2021-07-07 19:11 ` [lustre-devel] [PATCH 03/15] lustre: client: don't panic for mgs evictions James Simmons
2021-07-07 19:11 ` [lustre-devel] [PATCH 04/15] lnet: Add health ping stats James Simmons
2021-07-07 19:11 ` [lustre-devel] [PATCH 05/15] lnet: Ensure ref taken when queueing for discovery James Simmons
2021-07-07 19:11 ` [lustre-devel] [PATCH 06/15] lnet: Correct distance calculation of local NIDs James Simmons
2021-07-07 19:11 ` [lustre-devel] [PATCH 07/15] lnet: socklnd: detect link state to set fatal error on ni James Simmons
2021-07-07 19:11 ` [lustre-devel] [PATCH 08/15] lustre: mdt: New connect flag for non-open-by-fid lock request James Simmons
2021-07-07 19:11 ` [lustre-devel] [PATCH 09/15] lustre: obdclass: Wake up entire queue of requests on close completion James Simmons
2021-07-07 19:11 ` [lustre-devel] [PATCH 10/15] lnet: add netlink infrastructure James Simmons
2021-07-07 19:11 ` [lustre-devel] [PATCH 11/15] lustre: llite: parallelize direct i/o issuance James Simmons
2021-07-07 19:11 ` [lustre-devel] [PATCH 12/15] lustre: osc: Don't get time for each page James Simmons
2021-07-07 19:11 ` [lustre-devel] [PATCH 13/15] lustre: clio: Implement real list splice James Simmons
2021-07-07 19:11 ` [lustre-devel] [PATCH 14/15] lustre: osc: Simplify clipping for transient pages James Simmons
2021-07-07 19:11 ` [lustre-devel] [PATCH 15/15] lustre: mgc: configurable wait-to-reprocess time James Simmons

This is a public inbox, see mirroring instructions
on how to clone and mirror all data and code used for this inbox