lustre-devel-lustre.org archive mirror
 help / color / mirror / Atom feed
* [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023
@ 2023-04-09 12:12 James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 01/40] lustre: protocol: basic batching processing framework James Simmons
                   ` (39 more replies)
  0 siblings, 40 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

Merge in the patches landed to the OpenSFS tree.

Alex Deiter (1):
  lustre: enc: file names encryption when using secure boot

Amir Shehata (2):
  lnet: Lock primary NID logic
  lnet: don't delete peer created by Lustre

Andreas Dilger (1):
  lustre: tgt: skip free inodes in OST weights

Andrew Perepechko (2):
  lustre: lov: fiemap improperly handles fm_extent_count=0
  lustre: llite: SIGBUS is possible on a race with page reclaim

Andriy Skulysh (1):
  lustre: osc: page fault in osc_release_bounce_pages()

Aurelien Degremont (2):
  lustre: llite: fix relatime support
  lustre: ptlrpc: clarify AT error message

Chris Horn (3):
  lnet: Peers added via kernel API should be permanent
  lnet: memory leak in copy_ioc_udsp_descr
  lnet: lnet_parse_route uses wrong loop var

Cyril Bordage (2):
  lnet: o2iblnd: Fix key mismatch issue
  lnet: remove crash with UDSP

Etienne AUJAMES (1):
  lustre: llog: fix processing of a wrapped catalog

Hongchao Zhang (1):
  lustre: quota: fix insane grant quota

James Simmons (1):
  lustre: llite: replace lld_nfs_dentry flag with opencache handling

Lai Siyao (2):
  lustre: llite: match lock in corresponding namespace
  lustre: uapi: add DMV_IMP_INHERIT connect flag

Li Dongyang (1):
  lustre: fid: clean up OBIF_MAX_OID and IDIF_MAX_OID

Mikhail Pershin (1):
  lustre: client: -o network needs add_conn processing

Mr NeilBrown (1):
  lustre: ldlm: remove client_import_find_conn()

Oleg Drokin (1):
  lustre: update version to 2.15.54

Patrick Farrell (1):
  lustre: clio: Remove cl_page_size()

Qian Yingjin (4):
  lustre: protocol: basic batching processing framework
  lustre: readahead: add stats for read-ahead page count
  lustre: llite: check truncated page in ->readpage()
  lustre: llite: check read page past requested

Sebastien Buisson (4):
  lustre: enc: align Base64 encoding with RFC 4648 base64url
  lustre: sec: fid2path for encrypted files
  lustre: sec: Lustre/HSM on enc file with enc key
  lustre: fileset: check fileset for operations by fid

Sergey Cheremencev (2):
  lustre: quota: enforce project quota for root
  lustre: tgt: add qos debug

Serguei Smirnov (1):
  lnet: add 'force' option to lnetctl peer del

Timothy Day (2):
  lnet: libcfs: remove unused hash code
  lustre: ptlrpc: fix clang build errors

Vitaly Fertman (2):
  lustre: ldlm: BL_AST lock cancel still can be batched
  lustre: llite: dir layout inheritance fixes

Yang Sheng (1):
  lustre: ldlm: send the cancel RPC asap

 fs/lustre/include/cl_object.h            |   1 -
 fs/lustre/include/lu_object.h            |  14 +-
 fs/lustre/include/lustre_crypto.h        |   3 +
 fs/lustre/include/lustre_disk.h          |   3 +-
 fs/lustre/include/lustre_export.h        |   5 +
 fs/lustre/include/lustre_fid.h           |   6 +-
 fs/lustre/include/lustre_import.h        |   1 +
 fs/lustre/include/lustre_log.h           |  16 +
 fs/lustre/include/lustre_net.h           |  45 ++-
 fs/lustre/include/lustre_nrs.h           |  11 +-
 fs/lustre/include/lustre_nrs_delay.h     |  14 +-
 fs/lustre/include/lustre_req_layout.h    |  28 +-
 fs/lustre/include/lustre_swab.h          |   3 +
 fs/lustre/include/obd.h                  |  86 ++++-
 fs/lustre/include/obd_class.h            |  48 +++
 fs/lustre/include/obd_support.h          |   7 +-
 fs/lustre/ldlm/ldlm_lib.c                |  44 +--
 fs/lustre/ldlm/ldlm_lockd.c              |   8 +-
 fs/lustre/ldlm/ldlm_request.c            | 122 +++++--
 fs/lustre/llite/crypto.c                 |  35 +-
 fs/lustre/llite/dir.c                    | 122 +++++--
 fs/lustre/llite/file.c                   | 262 ++++++++++++--
 fs/lustre/llite/llite_internal.h         |  44 ++-
 fs/lustre/llite/llite_lib.c              |  10 +-
 fs/lustre/llite/llite_mmap.c             |  19 +
 fs/lustre/llite/llite_nfs.c              |  15 +-
 fs/lustre/llite/lproc_llite.c            |  65 +++-
 fs/lustre/llite/namei.c                  |  25 +-
 fs/lustre/llite/rw.c                     | 105 +++++-
 fs/lustre/llite/rw26.c                   |  16 +-
 fs/lustre/llite/super25.c                |   2 +
 fs/lustre/llite/vvp_io.c                 |  12 +-
 fs/lustre/llite/vvp_page.c               |  37 ++
 fs/lustre/lmv/lmv_internal.h             |  12 +
 fs/lustre/lmv/lmv_obd.c                  | 267 ++++++++++++--
 fs/lustre/lov/lov_object.c               |   7 +-
 fs/lustre/lov/lov_page.c                 |   2 +-
 fs/lustre/mdc/Makefile                   |   2 +-
 fs/lustre/mdc/mdc_batch.c                |  62 ++++
 fs/lustre/mdc/mdc_internal.h             |   3 +
 fs/lustre/mdc/mdc_request.c              |  14 +-
 fs/lustre/obdclass/cl_page.c             |  24 --
 fs/lustre/obdclass/llog.c                |  21 +-
 fs/lustre/obdclass/llog_cat.c            | 128 ++++---
 fs/lustre/obdclass/lu_tgt_descs.c        |  52 +--
 fs/lustre/obdclass/obd_config.c          |  17 -
 fs/lustre/osc/osc_cache.c                |   2 +-
 fs/lustre/osc/osc_quota.c                |   1 +
 fs/lustre/osc/osc_request.c              |   3 +
 fs/lustre/ptlrpc/Makefile                |   2 +-
 fs/lustre/ptlrpc/batch.c                 | 588 +++++++++++++++++++++++++++++++
 fs/lustre/ptlrpc/client.c                |  25 ++
 fs/lustre/ptlrpc/layout.c                | 126 ++++++-
 fs/lustre/ptlrpc/lproc_ptlrpc.c          |   1 +
 fs/lustre/ptlrpc/nrs_delay.c             |   2 +-
 fs/lustre/ptlrpc/pack_generic.c          |  27 +-
 fs/lustre/ptlrpc/sec_config.c            |   2 +-
 fs/lustre/ptlrpc/service.c               |  11 +-
 fs/lustre/ptlrpc/wiretest.c              |  14 +-
 include/linux/libcfs/libcfs_hash.h       |  18 -
 include/linux/lnet/lib-lnet.h            |   7 +-
 include/uapi/linux/lnet/lnet-dlc.h       |   4 +-
 include/uapi/linux/lustre/lustre_idl.h   |  96 ++++-
 include/uapi/linux/lustre/lustre_ostid.h |   4 +-
 include/uapi/linux/lustre/lustre_user.h  |   1 +
 include/uapi/linux/lustre/lustre_ver.h   |   4 +-
 net/lnet/klnds/o2iblnd/o2iblnd.c         |   5 +-
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c      |  17 +-
 net/lnet/lnet/api-ni.c                   |   6 +-
 net/lnet/lnet/config.c                   |   2 +-
 net/lnet/lnet/peer.c                     | 191 +++++++---
 net/lnet/lnet/udsp.c                     |  36 +-
 72 files changed, 2526 insertions(+), 514 deletions(-)
 create mode 100644 fs/lustre/mdc/mdc_batch.c
 create mode 100644 fs/lustre/ptlrpc/batch.c

-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 01/40] lustre: protocol: basic batching processing framework
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 02/40] lustre: lov: fiemap improperly handles fm_extent_count=0 James Simmons
                   ` (38 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Qian Yingjin <qian@ddn.com>

Batching processing can obtain boost performace. The larger the
batch size, the higher the latency for the entire batch. Although
the latency for the entire batch of operations is higher than the
latency of any single operation, the throughput of the batch of
operations is much high.

This patch implements the basic batching processing framework for
Lustre. It could be used for the future batching statahead and
WBC.

A batched RPC does not require that the opcodes of sub requests in
a batch are same. Each sub request has its own opcode. It allows
batching not only read-only requests but also multiple
modification updates with different opcodes, and even a mixed
workload which contains both read-only requests and modification
updates.

For the recovery, only the batched RPC contains a client XID,
there is no separate client XID for each sub-request. Although the
server will generate a transno for each update sub request, but
the transno only stores into the batched RPC (in @ptlrpc_body)
when the sub update request is finished. Thus the batched RPC only
stores the transno of the last sub update request. Only the
batched RPC contains the @ptlrpc_body message field. Each sub
request in a batched RPC does not contain @ptlrpc_body field.

A new field named @lrd_batch_idx is added in the client reply data
@lsd_reply_data. It indicates the sub request index in a batched
RPC. When the server finished a sub update request, it will update
@lrd_batch_idx accordingly.
When found that a batched RPC was a resend RPC, and if the index
of the sub request in the batched RPC is smaller or equal than
@lrd_batch_idx in the reply data, it means that the sub request has
already executed and committed, the server will reconstruct the
reply for this sub request; if the index is larger than
@lrd_batch_idx, the server will re-execute the sub request in the
batched RPC.

To simplify the reply/resend of the batched RPCs, the batch
processing stops at the first failure in the current design.

WC-bug-id: https://jira.whamcloud.com/browse/LU-14393
Lustre-commit: 840274b5c5e95e44a ("LU-14393 protocol: basic batching processing framework")
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/41378
Lustre-commit: 178988d67aa2f83aa ("LU-14393 recovery: reply reconstruction for batched RPCs")
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/48228
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h         |  43 +++
 fs/lustre/include/lustre_req_layout.h  |  28 +-
 fs/lustre/include/lustre_swab.h        |   3 +
 fs/lustre/include/obd.h                |  78 +++++
 fs/lustre/include/obd_class.h          |  48 +++
 fs/lustre/lmv/lmv_internal.h           |  12 +
 fs/lustre/lmv/lmv_obd.c                | 173 ++++++++++
 fs/lustre/mdc/Makefile                 |   2 +-
 fs/lustre/mdc/mdc_batch.c              |  62 ++++
 fs/lustre/mdc/mdc_internal.h           |   3 +
 fs/lustre/mdc/mdc_request.c            |   4 +
 fs/lustre/ptlrpc/Makefile              |   2 +-
 fs/lustre/ptlrpc/batch.c               | 588 +++++++++++++++++++++++++++++++++
 fs/lustre/ptlrpc/client.c              |  25 ++
 fs/lustre/ptlrpc/layout.c              | 126 ++++++-
 fs/lustre/ptlrpc/lproc_ptlrpc.c        |   1 +
 fs/lustre/ptlrpc/pack_generic.c        |  27 +-
 fs/lustre/ptlrpc/wiretest.c            |  12 +-
 include/uapi/linux/lustre/lustre_idl.h |  82 ++++-
 19 files changed, 1280 insertions(+), 39 deletions(-)
 create mode 100644 fs/lustre/mdc/mdc_batch.c
 create mode 100644 fs/lustre/ptlrpc/batch.c

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index 1605fcc..1ffe9f7 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -1161,6 +1161,13 @@ struct ptlrpc_bulk_frag_ops {
 			      struct page *page, int pageoffset, int len);
 
 	/**
+	 * Add a @fragment to the bulk descriptor @desc.
+	 * Data to transfer in the fragment is pointed to by @frag
+	 * The size of the fragment is @len
+	 */
+	int (*add_iov_frag)(struct ptlrpc_bulk_desc *desc, void *frag, int len);
+
+	/**
 	 * Uninitialize and free bulk descriptor @desc.
 	 * Works on bulk descriptors both from server and client side.
 	 */
@@ -1170,6 +1177,42 @@ struct ptlrpc_bulk_frag_ops {
 extern const struct ptlrpc_bulk_frag_ops ptlrpc_bulk_kiov_pin_ops;
 extern const struct ptlrpc_bulk_frag_ops ptlrpc_bulk_kiov_nopin_ops;
 
+static inline bool req_capsule_ptlreq(struct req_capsule *pill)
+{
+	struct ptlrpc_request *req = pill->rc_req;
+
+	return req && pill == &req->rq_pill;
+}
+
+static inline bool req_capsule_subreq(struct req_capsule *pill)
+{
+	struct ptlrpc_request *req = pill->rc_req;
+
+	return !req || pill != &req->rq_pill;
+}
+
+/**
+ * Returns true if request needs to be swabbed into local cpu byteorder
+ */
+static inline bool req_capsule_req_need_swab(struct req_capsule *pill)
+{
+	struct ptlrpc_request *req = pill->rc_req;
+
+	return req && req_capsule_req_swabbed(&req->rq_pill,
+					      MSG_PTLRPC_HEADER_OFF);
+}
+
+/**
+ * Returns true if request reply needs to be swabbed into local cpu byteorder
+ */
+static inline bool req_capsule_rep_need_swab(struct req_capsule *pill)
+{
+	struct ptlrpc_request *req = pill->rc_req;
+
+	return req && req_capsule_rep_swabbed(&req->rq_pill,
+					      MSG_PTLRPC_HEADER_OFF);
+}
+
 /**
  * Definition of bulk descriptor.
  * Bulks are special "Two phase" RPCs where initial request message
diff --git a/fs/lustre/include/lustre_req_layout.h b/fs/lustre/include/lustre_req_layout.h
index 9f22134b..a7ed89b 100644
--- a/fs/lustre/include/lustre_req_layout.h
+++ b/fs/lustre/include/lustre_req_layout.h
@@ -82,7 +82,9 @@ void req_capsule_init(struct req_capsule *pill, struct ptlrpc_request *req,
 void req_capsule_set(struct req_capsule *pill, const struct req_format *fmt);
 size_t req_capsule_filled_sizes(struct req_capsule *pill,
 				enum req_location loc);
-int  req_capsule_server_pack(struct req_capsule *pill);
+int req_capsule_server_pack(struct req_capsule *pill);
+int req_capsule_client_pack(struct req_capsule *pill);
+void req_capsule_set_replen(struct req_capsule *pill);
 
 void *req_capsule_client_get(struct req_capsule *pill,
 			     const struct req_msg_field *field);
@@ -150,22 +152,6 @@ static inline bool req_capsule_rep_swabbed(struct req_capsule *pill,
 }
 
 /**
- * Returns true if request needs to be swabbed into local cpu byteorder
- */
-static inline bool req_capsule_req_need_swab(struct req_capsule *pill)
-{
-	return req_capsule_req_swabbed(pill, MSG_PTLRPC_HEADER_OFF);
-}
-
-/**
- * Returns true if request reply needs to be swabbed into local cpu byteorder
- */
-static inline bool req_capsule_rep_need_swab(struct req_capsule *pill)
-{
-	return req_capsule_rep_swabbed(pill, MSG_PTLRPC_HEADER_OFF);
-}
-
-/**
  * Mark request buffer at offset \a index that it was already swabbed
  */
 static inline void req_capsule_set_req_swabbed(struct req_capsule *pill,
@@ -295,6 +281,14 @@ static inline void req_capsule_set_rep_swabbed(struct req_capsule *pill,
 
 extern struct req_format RQF_CONNECT;
 
+/* Batch UpdaTe req_format */
+extern struct req_format RQF_MDS_BATCH;
+
+/* Batch UpdaTe format */
+extern struct req_msg_field RMF_BUT_REPLY;
+extern struct req_msg_field RMF_BUT_HEADER;
+extern struct req_msg_field RMF_BUT_BUF;
+
 extern struct req_msg_field RMF_GENERIC_DATA;
 extern struct req_msg_field RMF_PTLRPC_BODY;
 extern struct req_msg_field RMF_MDT_BODY;
diff --git a/fs/lustre/include/lustre_swab.h b/fs/lustre/include/lustre_swab.h
index 000e622..eda3532 100644
--- a/fs/lustre/include/lustre_swab.h
+++ b/fs/lustre/include/lustre_swab.h
@@ -96,6 +96,9 @@ void lustre_swab_lov_user_md_objects(struct lov_user_ost_data *lod,
 void lustre_swab_hsm_user_state(struct hsm_user_state *hus);
 void lustre_swab_hsm_user_item(struct hsm_user_item *hui);
 void lustre_swab_hsm_request(struct hsm_request *hr);
+void lustre_swab_but_update_header(struct but_update_header *buh);
+void lustre_swab_but_update_buffer(struct but_update_buffer *bub);
+void lustre_swab_batch_update_reply(struct batch_update_reply *bur);
 void lustre_swab_swap_layouts(struct mdc_swap_layouts *msl);
 void lustre_swab_close_data(struct close_data *data);
 void lustre_swab_close_data_resync_done(struct close_data_resync_done *resync);
diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index e9752a3..a980bf0 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -835,7 +835,14 @@ struct md_readdir_info {
 struct md_op_item;
 typedef int (*md_op_item_cb_t)(struct md_op_item *item, int rc);
 
+enum md_item_opcode {
+	MD_OP_NONE	= 0,
+	MD_OP_GETATTR	= 1,
+	MD_OP_MAX,
+};
+
 struct md_op_item {
+	enum md_item_opcode		 mop_opc;
 	struct md_op_data		 mop_data;
 	struct lookup_intent		 mop_it;
 	struct lustre_handle		 mop_lockh;
@@ -847,6 +854,69 @@ struct md_op_item {
 	struct work_struct		 mop_work;
 };
 
+enum lu_batch_flags {
+	BATCH_FL_NONE	= 0x0,
+	/* All requests in a batch are read-only. */
+	BATCH_FL_RDONLY	= 0x1,
+	/* Will create PTLRPC request set for the batch. */
+	BATCH_FL_RQSET	= 0x2,
+	/* Whether need sync commit. */
+	BATCH_FL_SYNC	= 0x4,
+};
+
+struct lu_batch {
+	struct ptlrpc_request_set	*lbt_rqset;
+	__s32				 lbt_result;
+	__u32				 lbt_flags;
+	/* Max batched SUB requests count in a batch. */
+	__u32				 lbt_max_count;
+};
+
+struct batch_update_head {
+	struct obd_export	*buh_exp;
+	struct lu_batch		*buh_batch;
+	int			 buh_flags;
+	__u32			 buh_count;
+	__u32			 buh_update_count;
+	__u32			 buh_buf_count;
+	__u32			 buh_reqsize;
+	__u32			 buh_repsize;
+	__u32			 buh_batchid;
+	struct list_head	 buh_buf_list;
+	struct list_head	 buh_cb_list;
+};
+
+struct object_update_callback;
+typedef int (*object_update_interpret_t)(struct ptlrpc_request *req,
+					 struct lustre_msg *repmsg,
+					 struct object_update_callback *ouc,
+					 int rc);
+
+struct object_update_callback {
+	struct list_head		 ouc_item;
+	object_update_interpret_t	 ouc_interpret;
+	struct batch_update_head	*ouc_head;
+	void				*ouc_data;
+};
+
+typedef int (*md_update_pack_t)(struct batch_update_head *head,
+				struct lustre_msg *reqmsg,
+				size_t *max_pack_size,
+				struct md_op_item *item);
+
+struct cli_batch {
+	struct lu_batch			  cbh_super;
+	struct batch_update_head	 *cbh_head;
+};
+
+struct lu_batch *cli_batch_create(struct obd_export *exp,
+				  enum lu_batch_flags flags, __u32 max_count);
+int cli_batch_stop(struct obd_export *exp, struct lu_batch *bh);
+int cli_batch_flush(struct obd_export *exp, struct lu_batch *bh, bool wait);
+int cli_batch_add(struct obd_export *exp, struct lu_batch *bh,
+		  struct md_op_item *item, md_update_pack_t packer,
+		  object_update_interpret_t interpreter);
+
 struct obd_ops {
 	struct module *owner;
 	int (*iocontrol)(unsigned int cmd, struct obd_export *exp, int len,
@@ -1086,6 +1156,14 @@ struct md_ops {
 			const union lmv_mds_md *lmv, size_t lmv_size);
 	int (*rmfid)(struct obd_export *exp, struct fid_array *fa, int *rcs,
 		     struct ptlrpc_request_set *set);
+	struct lu_batch *(*batch_create)(struct obd_export *exp,
+					 enum lu_batch_flags flags,
+					 u32 max_count);
+	int (*batch_stop)(struct obd_export *exp, struct lu_batch *bh);
+	int (*batch_flush)(struct obd_export *exp, struct lu_batch *bh,
+			   bool wait);
+	int (*batch_add)(struct obd_export *exp, struct lu_batch *bh,
+			 struct md_op_item *item);
 };
 
 static inline struct md_open_data *obd_mod_alloc(void)
diff --git a/fs/lustre/include/obd_class.h b/fs/lustre/include/obd_class.h
index 81ef59e..e4ad600 100644
--- a/fs/lustre/include/obd_class.h
+++ b/fs/lustre/include/obd_class.h
@@ -1673,6 +1673,54 @@ static inline int md_rmfid(struct obd_export *exp, struct fid_array *fa,
 	return MDP(exp->exp_obd, rmfid)(exp, fa, rcs, set);
 }
 
+static inline struct lu_batch *
+md_batch_create(struct obd_export *exp, enum lu_batch_flags flags,
+		__u32 max_count)
+{
+	int rc;
+
+	rc = exp_check_ops(exp);
+	if (rc)
+		return ERR_PTR(rc);
+
+	return MDP(exp->exp_obd, batch_create)(exp, flags, max_count);
+}
+
+static inline int md_batch_stop(struct obd_export *exp, struct lu_batch *bh)
+{
+	int rc;
+
+	rc = exp_check_ops(exp);
+	if (rc)
+		return rc;
+
+	return MDP(exp->exp_obd, batch_stop)(exp, bh);
+}
+
+static inline int md_batch_flush(struct obd_export *exp, struct lu_batch *bh,
+				 bool wait)
+{
+	int rc;
+
+	rc = exp_check_ops(exp);
+	if (rc)
+		return rc;
+
+	return MDP(exp->exp_obd, batch_flush)(exp, bh, wait);
+}
+
+static inline int md_batch_add(struct obd_export *exp, struct lu_batch *bh,
+			       struct md_op_item *item)
+{
+	int rc;
+
+	rc = exp_check_ops(exp);
+	if (rc)
+		return rc;
+
+	return MDP(exp->exp_obd, batch_add)(exp, bh, item);
+}
+
 /* OBD Metadata Support */
 
 int obd_init_caches(void);
diff --git a/fs/lustre/lmv/lmv_internal.h b/fs/lustre/lmv/lmv_internal.h
index 9e89f88..64ec4ae 100644
--- a/fs/lustre/lmv/lmv_internal.h
+++ b/fs/lustre/lmv/lmv_internal.h
@@ -42,6 +42,18 @@
 #define LL_IT2STR(it)					\
 	((it) ? ldlm_it2str((it)->it_op) : "0")
 
+struct lmvsub_batch {
+	struct lu_batch		*sbh_sub;
+	struct lmv_tgt_desc	*sbh_tgt;
+	struct list_head	 sbh_sub_item;
+};
+
+struct lmv_batch {
+	struct lu_batch			 lbh_super;
+	struct ptlrpc_request_set	*lbh_rqset;
+	struct list_head		 lbh_sub_batch_list;
+};
+
 int lmv_intent_lock(struct obd_export *exp, struct md_op_data *op_data,
 		    struct lookup_intent *it, struct ptlrpc_request **reqp,
 		    ldlm_blocking_callback cb_blocking,
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 3a02cc1..64d16d8 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -3790,6 +3790,175 @@ static int lmv_merge_attr(struct obd_export *exp,
 	return 0;
 }
 
+static struct lu_batch *lmv_batch_create(struct obd_export *exp,
+					 enum lu_batch_flags flags,
+					 __u32 max_count)
+{
+	struct lu_batch *bh;
+	struct lmv_batch *lbh;
+
+	lbh = kzalloc(sizeof(*lbh), GFP_NOFS);
+	if (!lbh)
+		return ERR_PTR(-ENOMEM);
+
+	bh = &lbh->lbh_super;
+	bh->lbt_flags = flags;
+	bh->lbt_max_count = max_count;
+
+	if (flags & BATCH_FL_RQSET) {
+		bh->lbt_rqset = ptlrpc_prep_set();
+		if (!bh->lbt_rqset) {
+			kfree(lbh);
+			return ERR_PTR(-ENOMEM);
+		}
+	}
+
+	INIT_LIST_HEAD(&lbh->lbh_sub_batch_list);
+	return bh;
+}
+
+static int lmv_batch_stop(struct obd_export *exp, struct lu_batch *bh)
+{
+	struct lmv_batch *lbh;
+	struct lmvsub_batch *sub;
+	struct lmvsub_batch *tmp;
+	int rc = 0;
+
+	lbh = container_of(bh, struct lmv_batch, lbh_super);
+	list_for_each_entry_safe(sub, tmp, &lbh->lbh_sub_batch_list,
+				 sbh_sub_item) {
+		list_del(&sub->sbh_sub_item);
+		rc = md_batch_stop(sub->sbh_tgt->ltd_exp, sub->sbh_sub);
+		if (rc < 0) {
+			CERROR("%s: stop batch processing failed: rc = %d\n",
+			       exp->exp_obd->obd_name, rc);
+			if (bh->lbt_result == 0)
+				bh->lbt_result = rc;
+		}
+		kfree(sub);
+	}
+
+	if (bh->lbt_flags & BATCH_FL_RQSET) {
+		rc = ptlrpc_set_wait(NULL, bh->lbt_rqset);
+		ptlrpc_set_destroy(bh->lbt_rqset);
+	}
+
+	kfree(lbh);
+	return rc;
+}
+
+static int lmv_batch_flush(struct obd_export *exp, struct lu_batch *bh,
+			   bool wait)
+{
+	struct lmv_batch *lbh;
+	struct lmvsub_batch *sub;
+	int rc = 0;
+	int rc1;
+
+	lbh = container_of(bh, struct lmv_batch, lbh_super);
+	list_for_each_entry(sub, &lbh->lbh_sub_batch_list, sbh_sub_item) {
+		rc1 = md_batch_flush(sub->sbh_tgt->ltd_exp, sub->sbh_sub, wait);
+		if (rc1 < 0) {
+			CERROR("%s: stop batch processing failed: rc = %d\n",
+			       exp->exp_obd->obd_name, rc);
+			if (bh->lbt_result == 0)
+				bh->lbt_result = rc;
+
+			if (rc == 0)
+				rc = rc1;
+		}
+	}
+
+	if (wait && bh->lbt_flags & BATCH_FL_RQSET) {
+		rc1 = ptlrpc_set_wait(NULL, bh->lbt_rqset);
+		if (rc == 0)
+			rc = rc1;
+	}
+
+	return rc;
+}
+
+static inline struct lmv_tgt_desc *
+lmv_batch_locate_tgt(struct lmv_obd *lmv, struct md_op_item *item)
+{
+	struct lmv_tgt_desc *tgt;
+
+	switch (item->mop_opc) {
+	default:
+		tgt = ERR_PTR(-EOPNOTSUPP);
+	}
+
+	return tgt;
+}
+
+struct lu_batch *lmv_batch_lookup_sub(struct lmv_batch *lbh,
+				      struct lmv_tgt_desc *tgt)
+{
+	struct lmvsub_batch *sub;
+
+	list_for_each_entry(sub, &lbh->lbh_sub_batch_list, sbh_sub_item) {
+		if (sub->sbh_tgt == tgt)
+			return sub->sbh_sub;
+	}
+
+	return NULL;
+}
+
+struct lu_batch *lmv_batch_get_sub(struct lmv_batch *lbh,
+				   struct lmv_tgt_desc *tgt)
+{
+	struct lmvsub_batch *sbh;
+	struct lu_batch *child_bh;
+	struct lu_batch *bh;
+
+	child_bh = lmv_batch_lookup_sub(lbh, tgt);
+	if (child_bh)
+		return child_bh;
+
+	sbh = kzalloc(sizeof(*sbh), GFP_NOFS);
+	if (!sbh)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&sbh->sbh_sub_item);
+	sbh->sbh_tgt = tgt;
+
+	bh = &lbh->lbh_super;
+	child_bh = md_batch_create(tgt->ltd_exp, bh->lbt_flags,
+				   bh->lbt_max_count);
+	if (IS_ERR(child_bh)) {
+		kfree(sbh);
+		return child_bh;
+	}
+
+	child_bh->lbt_rqset = bh->lbt_rqset;
+	sbh->sbh_sub = child_bh;
+	list_add(&sbh->sbh_sub_item, &lbh->lbh_sub_batch_list);
+	return child_bh;
+}
+
+static int lmv_batch_add(struct obd_export *exp, struct lu_batch *bh,
+			 struct md_op_item *item)
+{
+	struct obd_device *obd = exp->exp_obd;
+	struct lmv_obd *lmv = &obd->u.lmv;
+	struct lmv_tgt_desc *tgt;
+	struct lmv_batch *lbh;
+	struct lu_batch *child_bh;
+	int rc;
+
+	tgt = lmv_batch_locate_tgt(lmv, item);
+	if (IS_ERR(tgt))
+		return PTR_ERR(tgt);
+
+	lbh = container_of(bh, struct lmv_batch, lbh_super);
+	child_bh = lmv_batch_get_sub(lbh, tgt);
+	if (IS_ERR(child_bh))
+		return PTR_ERR(child_bh);
+
+	rc = md_batch_add(tgt->ltd_exp, child_bh, item);
+	return rc;
+}
+
 static const struct obd_ops lmv_obd_ops = {
 	.owner			= THIS_MODULE,
 	.setup			= lmv_setup,
@@ -3840,6 +4009,10 @@ static int lmv_merge_attr(struct obd_export *exp,
 	.get_fid_from_lsm	= lmv_get_fid_from_lsm,
 	.unpackmd		= lmv_unpackmd,
 	.rmfid			= lmv_rmfid,
+	.batch_create		= lmv_batch_create,
+	.batch_add		= lmv_batch_add,
+	.batch_stop		= lmv_batch_stop,
+	.batch_flush		= lmv_batch_flush,
 };
 
 static int __init lmv_init(void)
diff --git a/fs/lustre/mdc/Makefile b/fs/lustre/mdc/Makefile
index 1ac97ee..191c400 100644
--- a/fs/lustre/mdc/Makefile
+++ b/fs/lustre/mdc/Makefile
@@ -2,5 +2,5 @@ ccflags-y += -I$(srctree)/$(src)/../include
 
 obj-$(CONFIG_LUSTRE_FS) += mdc.o
 mdc-y := mdc_changelog.o mdc_request.o mdc_reint.o mdc_lib.o mdc_locks.o lproc_mdc.o
-mdc-y += mdc_dev.o
+mdc-y += mdc_dev.o mdc_batch.o
 mdc-$(CONFIG_LUSTRE_FS_POSIX_ACL) += mdc_acl.o
diff --git a/fs/lustre/mdc/mdc_batch.c b/fs/lustre/mdc/mdc_batch.c
new file mode 100644
index 0000000..496d61e3
--- /dev/null
+++ b/fs/lustre/mdc/mdc_batch.c
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * GPL HEADER START
+ *
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 only,
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License version 2 for more details (a copy is included
+ * in the LICENSE file that accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License
+ * version 2 along with this program; If not, see
+ * http://www.gnu.org/licenses/gpl-2.0.html
+ *
+ * GPL HEADER END
+ */
+/*
+ * Copyright (c) 2020, 2022, DDN Storage Corporation.
+ */
+/*
+ * This file is part of Lustre, http://www.lustre.org/
+ */
+/*
+ * lustre/mdc/mdc_batch.c
+ *
+ * Batch Metadata Updating on the client (MDC)
+ *
+ * Author: Qian Yingjin <qian@ddn.com>
+ */
+
+#define DEBUG_SUBSYSTEM S_MDC
+
+#include <linux/module.h>
+#include <lustre_acl.h>
+
+#include "mdc_internal.h"
+
+static md_update_pack_t mdc_update_packers[MD_OP_MAX];
+
+static object_update_interpret_t mdc_update_interpreters[MD_OP_MAX];
+
+int mdc_batch_add(struct obd_export *exp, struct lu_batch *bh,
+		  struct md_op_item *item)
+{
+	enum md_item_opcode opc = item->mop_opc;
+
+	if (opc >= MD_OP_MAX || !mdc_update_packers[opc] ||
+	    !mdc_update_interpreters[opc]) {
+		CERROR("%s: unexpected opcode %d\n",
+		       exp->exp_obd->obd_name, opc);
+		return -EFAULT;
+	}
+
+	return cli_batch_add(exp, bh, item, mdc_update_packers[opc],
+			     mdc_update_interpreters[opc]);
+}
diff --git a/fs/lustre/mdc/mdc_internal.h b/fs/lustre/mdc/mdc_internal.h
index 2416607..ae12a37 100644
--- a/fs/lustre/mdc/mdc_internal.h
+++ b/fs/lustre/mdc/mdc_internal.h
@@ -132,6 +132,9 @@ int mdc_revalidate_lock(struct obd_export *exp, struct lookup_intent *it,
 
 int mdc_intent_getattr_async(struct obd_export *exp, struct md_op_item *item);
 
+int mdc_batch_add(struct obd_export *exp, struct lu_batch *bh,
+		  struct md_op_item *item);
+
 enum ldlm_mode mdc_lock_match(struct obd_export *exp, u64 flags,
 			      const struct lu_fid *fid, enum ldlm_type type,
 			      union ldlm_policy_data *policy,
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index c073da2..643b6ee 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -3023,6 +3023,10 @@ static int mdc_cleanup(struct obd_device *obd)
 	.intent_getattr_async	= mdc_intent_getattr_async,
 	.revalidate_lock	= mdc_revalidate_lock,
 	.rmfid			= mdc_rmfid,
+	.batch_create		= cli_batch_create,
+	.batch_stop		= cli_batch_stop,
+	.batch_flush		= cli_batch_flush,
+	.batch_add		= mdc_batch_add,
 };
 
 dev_t mdc_changelog_dev;
diff --git a/fs/lustre/ptlrpc/Makefile b/fs/lustre/ptlrpc/Makefile
index 3badb05..29287b4 100644
--- a/fs/lustre/ptlrpc/Makefile
+++ b/fs/lustre/ptlrpc/Makefile
@@ -13,7 +13,7 @@ ldlm_objs += $(LDLM)ldlm_pool.o
 ptlrpc_objs := client.o recover.o connection.o niobuf.o pack_generic.o
 ptlrpc_objs += events.o ptlrpc_module.o service.o pinger.o
 ptlrpc_objs += llog_net.o llog_client.o import.o ptlrpcd.o
-ptlrpc_objs += pers.o lproc_ptlrpc.o wiretest.o layout.o
+ptlrpc_objs += pers.o batch.o lproc_ptlrpc.o wiretest.o layout.o
 ptlrpc_objs += sec.o sec_bulk.o sec_gc.o sec_config.o
 ptlrpc_objs += sec_null.o sec_plain.o
 ptlrpc_objs += heap.o nrs.o nrs_fifo.o nrs_delay.o
diff --git a/fs/lustre/ptlrpc/batch.c b/fs/lustre/ptlrpc/batch.c
new file mode 100644
index 0000000..76eb4cf
--- /dev/null
+++ b/fs/lustre/ptlrpc/batch.c
@@ -0,0 +1,588 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * GPL HEADER START
+ *
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 only,
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License version 2 for more details (a copy is included
+ * in the LICENSE file that accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License
+ * version 2 along with this program; If not, see
+ * http://www.gnu.org/licenses/gpl-2.0.html
+ *
+ * GPL HEADER END
+ */
+/*
+ * Copyright (c) 2020, 2022, DDN/Whamcloud Storage Corporation.
+ */
+/*
+ * This file is part of Lustre, http://www.lustre.org/
+ */
+/*
+ * lustre/ptlrpc/batch.c
+ *
+ * Batch Metadata Updating on the client
+ *
+ * Author: Qian Yingjin <qian@ddn.com>
+ */
+
+#define DEBUG_SUBSYSTEM S_MDC
+
+#include <linux/module.h>
+#include <uapi/linux/lustre/lustre_idl.h>
+#include <obd.h>
+
+#define OUT_UPDATE_REPLY_SIZE          4096
+
+static inline struct lustre_msg *
+batch_update_repmsg_next(struct batch_update_reply *bur,
+			 struct lustre_msg *repmsg)
+{
+	if (repmsg)
+		return (struct lustre_msg *)((char *)repmsg +
+					     lustre_packed_msg_size(repmsg));
+	else
+		return &bur->burp_repmsg[0];
+}
+
+struct batch_update_buffer {
+	struct batch_update_request	*bub_req;
+	size_t				 bub_size;
+	size_t				 bub_end;
+	struct list_head		 bub_item;
+};
+
+struct batch_update_args {
+	struct batch_update_head	*ba_head;
+};
+
+/**
+ * Prepare inline update request
+ *
+ * Prepare BUT update ptlrpc inline request, and the request usuanlly includes
+ * one update buffer, which does not need bulk transfer.
+ */
+static int batch_prep_inline_update_req(struct batch_update_head *head,
+					struct ptlrpc_request *req,
+					int repsize)
+{
+	struct batch_update_buffer *buf;
+	struct but_update_header *buh;
+	int rc;
+
+	buf = list_entry(head->buh_buf_list.next,
+			  struct batch_update_buffer, bub_item);
+	req_capsule_set_size(&req->rq_pill, &RMF_BUT_HEADER, RCL_CLIENT,
+			     buf->bub_end + sizeof(*buh));
+
+	rc = ptlrpc_request_pack(req, LUSTRE_MDS_VERSION, MDS_BATCH);
+	if (rc != 0)
+		return rc;
+
+	buh = req_capsule_client_get(&req->rq_pill, &RMF_BUT_HEADER);
+	buh->buh_magic = BUT_HEADER_MAGIC;
+	buh->buh_count = 1;
+	buh->buh_inline_length = buf->bub_end;
+	buh->buh_reply_size = repsize;
+	buh->buh_update_count = head->buh_update_count;
+
+	memcpy(buh->buh_inline_data, buf->bub_req, buf->bub_end);
+
+	req_capsule_set_size(&req->rq_pill, &RMF_BUT_REPLY,
+			     RCL_SERVER, repsize);
+
+	ptlrpc_request_set_replen(req);
+	req->rq_request_portal = OUT_PORTAL;
+	req->rq_reply_portal = OSC_REPLY_PORTAL;
+
+	return rc;
+}
+
+static int batch_prep_update_req(struct batch_update_head *head,
+				 struct ptlrpc_request **reqp)
+{
+	struct ptlrpc_request *req;
+	struct ptlrpc_bulk_desc *desc;
+	struct batch_update_buffer *buf;
+	struct but_update_header *buh;
+	struct but_update_buffer *bub;
+	int page_count = 0;
+	int total = 0;
+	int repsize;
+	int rc;
+
+	repsize = head->buh_repsize +
+		  cfs_size_round(offsetof(struct batch_update_reply,
+					  burp_repmsg[0]));
+	if (repsize < OUT_UPDATE_REPLY_SIZE)
+		repsize = OUT_UPDATE_REPLY_SIZE;
+
+	LASSERT(head->buh_buf_count > 0);
+
+	req = ptlrpc_request_alloc(class_exp2cliimp(head->buh_exp),
+				   &RQF_MDS_BATCH);
+	if (!req)
+		return -ENOMEM;
+
+	if (head->buh_buf_count == 1) {
+		buf = list_entry(head->buh_buf_list.next,
+				 struct batch_update_buffer, bub_item);
+
+		/* Check whether it can be packed inline */
+		if (buf->bub_end + sizeof(struct but_update_header) <
+		    OUT_UPDATE_MAX_INLINE_SIZE) {
+			rc = batch_prep_inline_update_req(head, req, repsize);
+			if (rc == 0)
+				*reqp = req;
+			goto out_req;
+		}
+	}
+
+	req_capsule_set_size(&req->rq_pill, &RMF_BUT_HEADER, RCL_CLIENT,
+			     sizeof(struct but_update_header));
+	req_capsule_set_size(&req->rq_pill, &RMF_BUT_BUF, RCL_CLIENT,
+			     head->buh_buf_count * sizeof(*bub));
+
+	rc = ptlrpc_request_pack(req, LUSTRE_MDS_VERSION, MDS_BATCH);
+	if (rc != 0)
+		goto out_req;
+
+	buh = req_capsule_client_get(&req->rq_pill, &RMF_BUT_HEADER);
+	buh->buh_magic = BUT_HEADER_MAGIC;
+	buh->buh_count = head->buh_buf_count;
+	buh->buh_inline_length = 0;
+	buh->buh_reply_size = repsize;
+	buh->buh_update_count = head->buh_update_count;
+	bub = req_capsule_client_get(&req->rq_pill, &RMF_BUT_BUF);
+	list_for_each_entry(buf, &head->buh_buf_list, bub_item) {
+		bub->bub_size = buf->bub_size;
+		bub++;
+		/* First *and* last might be partial pages, hence +1 */
+		page_count += DIV_ROUND_UP(buf->bub_size, PAGE_SIZE) + 1;
+	}
+
+	req->rq_bulk_write = 1;
+	desc = ptlrpc_prep_bulk_imp(req, page_count,
+				    MD_MAX_BRW_SIZE >> LNET_MTU_BITS,
+				    PTLRPC_BULK_GET_SOURCE,
+				    MDS_BULK_PORTAL,
+				    &ptlrpc_bulk_kiov_nopin_ops);
+	if (!desc) {
+		rc = -ENOMEM;
+		goto out_req;
+	}
+
+	list_for_each_entry(buf, &head->buh_buf_list, bub_item) {
+		desc->bd_frag_ops->add_iov_frag(desc, buf->bub_req,
+						buf->bub_size);
+		total += buf->bub_size;
+	}
+	CDEBUG(D_OTHER, "Total %d in %u\n", total, head->buh_update_count);
+
+	req_capsule_set_size(&req->rq_pill, &RMF_BUT_REPLY,
+			     RCL_SERVER, repsize);
+
+	ptlrpc_request_set_replen(req);
+	req->rq_request_portal = OUT_PORTAL;
+	req->rq_reply_portal = OSC_REPLY_PORTAL;
+	*reqp = req;
+
+out_req:
+	if (rc < 0)
+		ptlrpc_req_finished(req);
+
+	return rc;
+}
+
+static struct batch_update_buffer *
+current_batch_update_buffer(struct batch_update_head *head)
+{
+	if (list_empty(&head->buh_buf_list))
+		return NULL;
+
+	return list_entry(head->buh_buf_list.prev, struct batch_update_buffer,
+			  bub_item);
+}
+
+static int batch_update_buffer_create(struct batch_update_head *head,
+				      size_t size)
+{
+	struct batch_update_buffer *buf;
+	struct batch_update_request *bur;
+
+	buf = kzalloc(sizeof(*buf), GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	LASSERT(size > 0);
+	size = round_up(size, PAGE_SIZE);
+	bur = kvzalloc(size, GFP_KERNEL);
+	if (!bur) {
+		kfree(buf);
+		return -ENOMEM;
+	}
+
+	bur->burq_magic = BUT_REQUEST_MAGIC;
+	bur->burq_count = 0;
+	buf->bub_req = bur;
+	buf->bub_size = size;
+	buf->bub_end = sizeof(*bur);
+	INIT_LIST_HEAD(&buf->bub_item);
+	list_add_tail(&buf->bub_item, &head->buh_buf_list);
+	head->buh_buf_count++;
+
+	return 0;
+}
+
+/**
+ * Destroy an @object_update_callback.
+ */
+static void object_update_callback_fini(struct object_update_callback *ouc)
+{
+	LASSERT(list_empty(&ouc->ouc_item));
+
+	kfree(ouc);
+}
+
+/**
+ * Insert an @object_update_callback into the @batch_update_head.
+ *
+ * Usually each update in @batch_update_head will have one correspondent
+ * callback, and these callbacks will be called in ->rq_interpret_reply.
+ */
+static int
+batch_insert_update_callback(struct batch_update_head *head, void *data,
+			     object_update_interpret_t interpret)
+{
+	struct object_update_callback *ouc;
+
+	ouc = kzalloc(sizeof(*ouc), GFP_KERNEL);
+	if (!ouc)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&ouc->ouc_item);
+	ouc->ouc_interpret = interpret;
+	ouc->ouc_head = head;
+	ouc->ouc_data = data;
+	list_add_tail(&ouc->ouc_item, &head->buh_cb_list);
+
+	return 0;
+}
+
+/**
+ * Allocate and initialize batch update request.
+ *
+ * @batch_update_head is being used to track updates being executed on
+ * this OBD device. The update buffer will be 4K initially, and increased
+ * if needed.
+ */
+static struct batch_update_head *
+batch_update_request_create(struct obd_export *exp, struct lu_batch *bh)
+{
+	struct batch_update_head *head;
+	int rc;
+
+	head = kzalloc(sizeof(*head), GFP_KERNEL);
+	if (!head)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&head->buh_cb_list);
+	INIT_LIST_HEAD(&head->buh_buf_list);
+	head->buh_exp = exp;
+	head->buh_batch = bh;
+
+	rc = batch_update_buffer_create(head, PAGE_SIZE);
+	if (rc != 0) {
+		kfree(head);
+		return ERR_PTR(rc);
+	}
+
+	return head;
+}
+
+static void batch_update_request_destroy(struct batch_update_head *head)
+{
+	struct batch_update_buffer *bub, *tmp;
+
+	if (!head)
+		return;
+
+	list_for_each_entry_safe(bub, tmp, &head->buh_buf_list, bub_item) {
+		list_del(&bub->bub_item);
+		kvfree(bub->bub_req);
+		kfree(bub);
+	}
+
+	kfree(head);
+}
+
+static int batch_update_request_fini(struct batch_update_head *head,
+				     struct ptlrpc_request *req,
+				     struct batch_update_reply *reply, int rc)
+{
+	struct object_update_callback *ouc, *next;
+	struct lustre_msg *repmsg = NULL;
+	int count = 0;
+	int index = 0;
+
+	if (reply)
+		count = reply->burp_count;
+
+	list_for_each_entry_safe(ouc, next, &head->buh_cb_list, ouc_item) {
+		int rc1 = 0;
+
+		list_del_init(&ouc->ouc_item);
+
+		/*
+		 * The peer may only have handled some requests (indicated by
+		 * @count) in the packaged OUT PRC, we can only get results
+		 * for the handled part.
+		 */
+		if (index < count) {
+			repmsg = batch_update_repmsg_next(reply, repmsg);
+			if (!repmsg)
+				rc1 = -EPROTO;
+			else
+				rc1 = repmsg->lm_result;
+		} else {
+			/*
+			 * The peer did not handle these request, let us return
+			 * -ECANCELED to the update interpreter for now.
+			 */
+			repmsg = NULL;
+			rc1 = -ECANCELED;
+		}
+
+		if (ouc->ouc_interpret)
+			ouc->ouc_interpret(req, repmsg, ouc, rc1);
+
+		object_update_callback_fini(ouc);
+		if (rc == 0 && rc1 < 0)
+			rc = rc1;
+	}
+
+	batch_update_request_destroy(head);
+
+	return rc;
+}
+
+static int batch_update_interpret(const struct lu_env *env,
+				  struct ptlrpc_request *req,
+				  void *args, int rc)
+{
+	struct batch_update_args *aa = (struct batch_update_args *)args;
+	struct batch_update_reply *reply = NULL;
+
+	if (!aa->ba_head)
+		return 0;
+
+	ptlrpc_put_mod_rpc_slot(req);
+	/* Unpack the results from the reply message. */
+	if (req->rq_repmsg && req->rq_replied) {
+		reply = req_capsule_server_sized_get(&req->rq_pill,
+						     &RMF_BUT_REPLY,
+						     sizeof(*reply));
+		if ((!reply ||
+		     reply->burp_magic != BUT_REPLY_MAGIC) && rc == 0)
+			rc = -EPROTO;
+	}
+
+	rc = batch_update_request_fini(aa->ba_head, req, reply, rc);
+
+	return rc;
+}
+
+static int batch_send_update_req(const struct lu_env *env,
+				 struct batch_update_head *head)
+{
+	struct lu_batch *bh;
+	struct ptlrpc_request *req = NULL;
+	struct batch_update_args *aa;
+	int rc;
+
+	if (!head)
+		return 0;
+
+	bh = head->buh_batch;
+	rc = batch_prep_update_req(head, &req);
+	if (rc) {
+		rc = batch_update_request_fini(head, NULL, NULL, rc);
+		return rc;
+	}
+
+	aa = ptlrpc_req_async_args(aa, req);
+	aa->ba_head = head;
+	req->rq_interpret_reply = batch_update_interpret;
+
+	/*
+	 * Only acquire modification RPC slot for the batched RPC
+	 * which contains metadata updates.
+	 */
+	if (!(bh->lbt_flags & BATCH_FL_RDONLY))
+		ptlrpc_get_mod_rpc_slot(req);
+
+	if (bh->lbt_flags & BATCH_FL_SYNC) {
+		rc = ptlrpc_queue_wait(req);
+	} else {
+		if ((bh->lbt_flags & (BATCH_FL_RDONLY | BATCH_FL_RQSET)) ==
+		    BATCH_FL_RDONLY) {
+			ptlrpcd_add_req(req);
+		} else if (bh->lbt_flags & BATCH_FL_RQSET) {
+			ptlrpc_set_add_req(bh->lbt_rqset, req);
+			ptlrpc_check_set(env, bh->lbt_rqset);
+		} else {
+			ptlrpcd_add_req(req);
+		}
+		req = NULL;
+	}
+
+	if (req)
+		ptlrpc_req_finished(req);
+
+	return rc;
+}
+
+static int batch_update_request_add(struct batch_update_head **headp,
+				    struct md_op_item *item,
+				    md_update_pack_t packer,
+				    object_update_interpret_t interpreter)
+{
+	struct batch_update_head *head = *headp;
+	struct lu_batch *bh = head->buh_batch;
+	struct batch_update_buffer *buf;
+	struct lustre_msg *reqmsg;
+	size_t max_len;
+	int rc;
+
+	for (; ;) {
+		buf = current_batch_update_buffer(head);
+		LASSERT(buf);
+		max_len = buf->bub_size - buf->bub_end;
+		reqmsg = (struct lustre_msg *)((char *)buf->bub_req +
+						buf->bub_end);
+		rc = packer(head, reqmsg, &max_len, item);
+		if (rc == -E2BIG) {
+			int rc2;
+
+			/* Create new batch object update buffer */
+			rc2 = batch_update_buffer_create(head,
+				max_len + offsetof(struct batch_update_request,
+						   burq_reqmsg[0]) + 1);
+			if (rc2 != 0) {
+				rc = rc2;
+				break;
+			}
+		} else {
+			if (rc == 0) {
+				buf->bub_end += max_len;
+				buf->bub_req->burq_count++;
+				head->buh_update_count++;
+				head->buh_repsize += reqmsg->lm_repsize;
+			}
+			break;
+		}
+	}
+
+	if (rc)
+		goto out;
+
+	rc = batch_insert_update_callback(head, item, interpreter);
+	if (rc)
+		goto out;
+
+	/* Unplug the batch queue if accumulated enough update requests. */
+	if (bh->lbt_max_count && head->buh_update_count >= bh->lbt_max_count) {
+		rc = batch_send_update_req(NULL, head);
+		*headp = NULL;
+	}
+out:
+	if (rc) {
+		batch_update_request_destroy(head);
+		*headp = NULL;
+	}
+
+	return rc;
+}
+
+struct lu_batch *cli_batch_create(struct obd_export *exp,
+				  enum lu_batch_flags flags, u32 max_count)
+{
+	struct cli_batch *cbh;
+	struct lu_batch *bh;
+
+	cbh = kzalloc(sizeof(*cbh), GFP_KERNEL);
+	if (!cbh)
+		return ERR_PTR(-ENOMEM);
+
+	bh = &cbh->cbh_super;
+	bh->lbt_result = 0;
+	bh->lbt_flags = flags;
+	bh->lbt_max_count = max_count;
+
+	cbh->cbh_head = batch_update_request_create(exp, bh);
+	if (IS_ERR(cbh->cbh_head)) {
+		bh = (struct lu_batch *)cbh->cbh_head;
+		kfree(cbh);
+	}
+
+	return bh;
+}
+EXPORT_SYMBOL(cli_batch_create);
+
+int cli_batch_stop(struct obd_export *exp, struct lu_batch *bh)
+{
+	struct cli_batch *cbh;
+	int rc;
+
+	cbh = container_of(bh, struct cli_batch, cbh_super);
+	rc = batch_send_update_req(NULL, cbh->cbh_head);
+
+	kfree(cbh);
+	return rc;
+}
+EXPORT_SYMBOL(cli_batch_stop);
+
+int cli_batch_flush(struct obd_export *exp, struct lu_batch *bh, bool wait)
+{
+	struct cli_batch *cbh;
+	int rc;
+
+	cbh = container_of(bh, struct cli_batch, cbh_super);
+	if (!cbh->cbh_head)
+		return 0;
+
+	rc = batch_send_update_req(NULL, cbh->cbh_head);
+	cbh->cbh_head = NULL;
+
+	return rc;
+}
+EXPORT_SYMBOL(cli_batch_flush);
+
+int cli_batch_add(struct obd_export *exp, struct lu_batch *bh,
+		  struct md_op_item *item, md_update_pack_t packer,
+		  object_update_interpret_t interpreter)
+{
+	struct cli_batch *cbh;
+	int rc;
+
+	cbh = container_of(bh, struct cli_batch, cbh_super);
+	if (!cbh->cbh_head) {
+		cbh->cbh_head = batch_update_request_create(exp, bh);
+		if (IS_ERR(cbh->cbh_head))
+			return PTR_ERR(cbh->cbh_head);
+	}
+
+	rc = batch_update_request_add(&cbh->cbh_head, item,
+				      packer, interpreter);
+
+	return rc;
+}
+EXPORT_SYMBOL(cli_batch_add);
diff --git a/fs/lustre/ptlrpc/client.c b/fs/lustre/ptlrpc/client.c
index 6c1d98d..c9a8c8f 100644
--- a/fs/lustre/ptlrpc/client.c
+++ b/fs/lustre/ptlrpc/client.c
@@ -70,6 +70,30 @@ static void ptlrpc_release_bulk_page_pin(struct ptlrpc_bulk_desc *desc)
 		put_page(desc->bd_vec[i].bv_page);
 }
 
+static int ptlrpc_prep_bulk_frag_pages(struct ptlrpc_bulk_desc *desc,
+				       void *frag, int len)
+{
+	unsigned int offset = (unsigned long)frag & ~PAGE_MASK;
+
+	while (len > 0) {
+		int page_len = min_t(unsigned int, PAGE_SIZE - offset,
+				     len);
+		struct page *pg;
+
+		if (is_vmalloc_addr(frag))
+			pg = vmalloc_to_page(frag);
+		else
+			pg = virt_to_page(frag);
+
+		ptlrpc_prep_bulk_page_nopin(desc, pg, offset, page_len);
+		offset = 0;
+		len -= page_len;
+		frag += page_len;
+	}
+
+	return desc->bd_nob;
+}
+
 const struct ptlrpc_bulk_frag_ops ptlrpc_bulk_kiov_pin_ops = {
 	.add_kiov_frag		= ptlrpc_prep_bulk_page_pin,
 	.release_frags		= ptlrpc_release_bulk_page_pin,
@@ -79,6 +103,7 @@ static void ptlrpc_release_bulk_page_pin(struct ptlrpc_bulk_desc *desc)
 const struct ptlrpc_bulk_frag_ops ptlrpc_bulk_kiov_nopin_ops = {
 	.add_kiov_frag		= ptlrpc_prep_bulk_page_nopin,
 	.release_frags		= NULL,
+	.add_iov_frag		= ptlrpc_prep_bulk_frag_pages,
 };
 EXPORT_SYMBOL(ptlrpc_bulk_kiov_nopin_ops);
 
diff --git a/fs/lustre/ptlrpc/layout.c b/fs/lustre/ptlrpc/layout.c
index 82ec899..0fe74ff 100644
--- a/fs/lustre/ptlrpc/layout.c
+++ b/fs/lustre/ptlrpc/layout.c
@@ -561,6 +561,17 @@
 	&RMF_CAPA2
 };
 
+static const struct req_msg_field *mds_batch_client[] = {
+	&RMF_PTLRPC_BODY,
+	&RMF_BUT_HEADER,
+	&RMF_BUT_BUF,
+};
+
+static const struct req_msg_field *mds_batch_server[] = {
+	&RMF_PTLRPC_BODY,
+	&RMF_BUT_REPLY,
+};
+
 static const struct req_msg_field *llog_origin_handle_create_client[] = {
 	&RMF_PTLRPC_BODY,
 	&RMF_LLOGD_BODY,
@@ -800,6 +811,7 @@
 	&RQF_LLOG_ORIGIN_HANDLE_PREV_BLOCK,
 	&RQF_LLOG_ORIGIN_HANDLE_READ_HEADER,
 	&RQF_CONNECT,
+	&RQF_MDS_BATCH,
 };
 
 struct req_msg_field {
@@ -1222,6 +1234,20 @@ struct req_msg_field RMF_OST_LADVISE =
 		    lustre_swab_ladvise, NULL);
 EXPORT_SYMBOL(RMF_OST_LADVISE);
 
+struct req_msg_field RMF_BUT_REPLY =
+			DEFINE_MSGF("batch_update_reply", 0, -1,
+				    lustre_swab_batch_update_reply, NULL);
+EXPORT_SYMBOL(RMF_BUT_REPLY);
+
+struct req_msg_field RMF_BUT_HEADER = DEFINE_MSGF("but_update_header", 0,
+				-1, lustre_swab_but_update_header, NULL);
+EXPORT_SYMBOL(RMF_BUT_HEADER);
+
+struct req_msg_field RMF_BUT_BUF = DEFINE_MSGF("but_update_buf",
+			RMF_F_STRUCT_ARRAY, sizeof(struct but_update_buffer),
+			lustre_swab_but_update_buffer, NULL);
+EXPORT_SYMBOL(RMF_BUT_BUF);
+
 /*
  * Request formats.
  */
@@ -1422,6 +1448,11 @@ struct req_format RQF_MDS_GET_INFO =
 			mds_getinfo_server);
 EXPORT_SYMBOL(RQF_MDS_GET_INFO);
 
+struct req_format RQF_MDS_BATCH =
+	DEFINE_REQ_FMT0("MDS_BATCH", mds_batch_client,
+			mds_batch_server);
+EXPORT_SYMBOL(RQF_MDS_BATCH);
+
 struct req_format RQF_LDLM_ENQUEUE =
 	DEFINE_REQ_FMT0("LDLM_ENQUEUE",
 			ldlm_enqueue_client, ldlm_enqueue_lvb_server);
@@ -1849,17 +1880,61 @@ int req_capsule_server_pack(struct req_capsule *pill)
 	LASSERT(fmt);
 
 	count = req_capsule_filled_sizes(pill, RCL_SERVER);
-	rc = lustre_pack_reply(pill->rc_req, count,
-			       pill->rc_area[RCL_SERVER], NULL);
-	if (rc != 0) {
-		DEBUG_REQ(D_ERROR, pill->rc_req,
-			  "Cannot pack %d fields in format '%s'",
-			  count, fmt->rf_name);
+	if (req_capsule_ptlreq(pill)) {
+		rc = lustre_pack_reply(pill->rc_req, count,
+				       pill->rc_area[RCL_SERVER], NULL);
+		if (rc != 0) {
+			DEBUG_REQ(D_ERROR, pill->rc_req,
+				  "Cannot pack %d fields in format '%s'",
+				   count, fmt->rf_name);
+		}
+	} else { /* SUB request */
+		u32 msg_len;
+
+		msg_len = lustre_msg_size_v2(count, pill->rc_area[RCL_SERVER]);
+		if (msg_len > pill->rc_reqmsg->lm_repsize) {
+			/* TODO: Check whether there is enough buffer size */
+			CDEBUG(D_INFO,
+			       "Overflow pack %d fields in format '%s' for the SUB request with message len %u:%u\n",
+			       count, fmt->rf_name, msg_len,
+			       pill->rc_reqmsg->lm_repsize);
+		}
+
+		rc = 0;
+		lustre_init_msg_v2(pill->rc_repmsg, count,
+				   pill->rc_area[RCL_SERVER], NULL);
 	}
+
 	return rc;
 }
 EXPORT_SYMBOL(req_capsule_server_pack);
 
+int req_capsule_client_pack(struct req_capsule *pill)
+{
+	const struct req_format *fmt;
+	int count;
+	int rc = 0;
+
+	LASSERT(pill->rc_loc == RCL_CLIENT);
+	fmt = pill->rc_fmt;
+	LASSERT(fmt);
+
+	count = req_capsule_filled_sizes(pill, RCL_CLIENT);
+	if (req_capsule_ptlreq(pill)) {
+		struct ptlrpc_request *req = pill->rc_req;
+
+		rc = lustre_pack_request(req, req->rq_import->imp_msg_magic,
+					 count, pill->rc_area[RCL_CLIENT],
+					 NULL);
+	} else {
+		/* Sub request in a batch PTLRPC request */
+		lustre_init_msg_v2(pill->rc_reqmsg, count,
+				   pill->rc_area[RCL_CLIENT], NULL);
+	}
+	return rc;
+}
+EXPORT_SYMBOL(req_capsule_client_pack);
+
 /**
  * Returns the PTLRPC request or reply (@loc) buffer offset of a @pill
  * corresponding to the given RMF (@field).
@@ -2050,6 +2125,7 @@ static void *__req_capsule_get(struct req_capsule *pill,
 	value = getter(msg, offset, len);
 
 	if (!value) {
+		LASSERT(pill->rc_req);
 		DEBUG_REQ(D_ERROR, pill->rc_req,
 			  "Wrong buffer for field '%s' (%u of %u) in format '%s', %u vs. %u (%s)",
 			  field->rmf_name, offset, lustre_msg_bufcount(msg),
@@ -2218,10 +2294,18 @@ u32 req_capsule_get_size(const struct req_capsule *pill,
  */
 u32 req_capsule_msg_size(struct req_capsule *pill, enum req_location loc)
 {
-	return lustre_msg_size(pill->rc_req->rq_import->imp_msg_magic,
-			       pill->rc_fmt->rf_fields[loc].nr,
-			       pill->rc_area[loc]);
+	if (req_capsule_ptlreq(pill)) {
+		return lustre_msg_size(pill->rc_req->rq_import->imp_msg_magic,
+				       pill->rc_fmt->rf_fields[loc].nr,
+				       pill->rc_area[loc]);
+	} else { /* SUB request in a batch request */
+		int count;
+
+		count = req_capsule_filled_sizes(pill, loc);
+		return lustre_msg_size_v2(count, pill->rc_area[loc]);
+	}
 }
+EXPORT_SYMBOL(req_capsule_msg_size);
 
 /**
  * While req_capsule_msg_size() computes the size of a PTLRPC request or reply
@@ -2373,16 +2457,32 @@ void req_capsule_shrink(struct req_capsule *pill,
 	LASSERTF(newlen <= len, "%s:%s, oldlen=%u, newlen=%u\n",
 		 fmt->rf_name, field->rmf_name, len, newlen);
 
+	len = lustre_shrink_msg(msg, offset, newlen, 1);
 	if (loc == RCL_CLIENT) {
-		pill->rc_req->rq_reqlen = lustre_shrink_msg(msg, offset, newlen,
-							    1);
+		if (req_capsule_ptlreq(pill))
+			pill->rc_req->rq_reqlen = len;
 	} else {
-		pill->rc_req->rq_replen = lustre_shrink_msg(msg, offset, newlen,
-							    1);
 		/* update also field size in reply lenghts arrays for possible
 		 * reply re-pack due to req_capsule_server_grow() call.
 		 */
 		req_capsule_set_size(pill, field, loc, newlen);
+		if (req_capsule_ptlreq(pill))
+			pill->rc_req->rq_replen = len;
 	}
 }
 EXPORT_SYMBOL(req_capsule_shrink);
+
+void req_capsule_set_replen(struct req_capsule *pill)
+{
+	if (req_capsule_ptlreq(pill)) {
+		ptlrpc_request_set_replen(pill->rc_req);
+	} else { /* SUB request in a batch request */
+		int count;
+
+		count = req_capsule_filled_sizes(pill, RCL_SERVER);
+		pill->rc_reqmsg->lm_repsize =
+			lustre_msg_size_v2(count,
+					   pill->rc_area[RCL_SERVER]);
+	}
+}
+EXPORT_SYMBOL(req_capsule_set_replen);
diff --git a/fs/lustre/ptlrpc/lproc_ptlrpc.c b/fs/lustre/ptlrpc/lproc_ptlrpc.c
index f3f8a71..af83902 100644
--- a/fs/lustre/ptlrpc/lproc_ptlrpc.c
+++ b/fs/lustre/ptlrpc/lproc_ptlrpc.c
@@ -98,6 +98,7 @@
 	{ MDS_HSM_CT_UNREGISTER,		"mds_hsm_ct_unregister" },
 	{ MDS_SWAP_LAYOUTS,			"mds_swap_layouts" },
 	{ MDS_RMFID,				"mds_rmfid" },
+	{ MDS_BATCH,				"mds_batch" },
 	{ LDLM_ENQUEUE,				"ldlm_enqueue" },
 	{ LDLM_CONVERT,				"ldlm_convert" },
 	{ LDLM_CANCEL,				"ldlm_cancel" },
diff --git a/fs/lustre/ptlrpc/pack_generic.c b/fs/lustre/ptlrpc/pack_generic.c
index 3499611..8d58f9b 100644
--- a/fs/lustre/ptlrpc/pack_generic.c
+++ b/fs/lustre/ptlrpc/pack_generic.c
@@ -491,7 +491,7 @@ static int lustre_unpack_msg_v2(struct lustre_msg_v2 *m, int len)
 		__swab32s(&m->lm_repsize);
 		__swab32s(&m->lm_cksum);
 		__swab32s(&m->lm_flags);
-		BUILD_BUG_ON(offsetof(typeof(*m), lm_padding_2) == 0);
+		__swab32s(&m->lm_opc);
 		BUILD_BUG_ON(offsetof(typeof(*m), lm_padding_3) == 0);
 	}
 
@@ -2591,6 +2591,31 @@ void lustre_swab_hsm_request(struct hsm_request *hr)
 	__swab32s(&hr->hr_data_len);
 }
 
+/* TODO: swab each sub reply message. */
+void lustre_swab_batch_update_reply(struct batch_update_reply *bur)
+{
+	__swab32s(&bur->burp_magic);
+	__swab16s(&bur->burp_count);
+	__swab16s(&bur->burp_padding);
+}
+
+void lustre_swab_but_update_header(struct but_update_header *buh)
+{
+	__swab32s(&buh->buh_magic);
+	__swab32s(&buh->buh_count);
+	__swab32s(&buh->buh_inline_length);
+	__swab32s(&buh->buh_reply_size);
+	__swab32s(&buh->buh_update_count);
+}
+EXPORT_SYMBOL(lustre_swab_but_update_header);
+
+void lustre_swab_but_update_buffer(struct but_update_buffer *bub)
+{
+	__swab32s(&bub->bub_size);
+	__swab32s(&bub->bub_padding);
+}
+EXPORT_SYMBOL(lustre_swab_but_update_buffer);
+
 void lustre_swab_swap_layouts(struct mdc_swap_layouts *msl)
 {
 	__swab64s(&msl->msl_flags);
diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 372dc10..2c02430 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -181,7 +181,9 @@ void lustre_assert_wire_constants(void)
 		 (long long)MDS_SWAP_LAYOUTS);
 	LASSERTF(MDS_RMFID == 62, "found %lld\n",
 		 (long long)MDS_RMFID);
-	LASSERTF(MDS_LAST_OPC == 63, "found %lld\n",
+	LASSERTF(MDS_BATCH == 63, "found %lld\n",
+		 (long long)MDS_BATCH);
+	LASSERTF(MDS_LAST_OPC == 64, "found %lld\n",
 		 (long long)MDS_LAST_OPC);
 	LASSERTF(REINT_SETATTR == 1, "found %lld\n",
 		 (long long)REINT_SETATTR);
@@ -661,10 +663,10 @@ void lustre_assert_wire_constants(void)
 		 (long long)(int)offsetof(struct lustre_msg_v2, lm_flags));
 	LASSERTF((int)sizeof(((struct lustre_msg_v2 *)0)->lm_flags) == 4, "found %lld\n",
 		 (long long)(int)sizeof(((struct lustre_msg_v2 *)0)->lm_flags));
-	LASSERTF((int)offsetof(struct lustre_msg_v2, lm_padding_2) == 24, "found %lld\n",
-		 (long long)(int)offsetof(struct lustre_msg_v2, lm_padding_2));
-	LASSERTF((int)sizeof(((struct lustre_msg_v2 *)0)->lm_padding_2) == 4, "found %lld\n",
-		 (long long)(int)sizeof(((struct lustre_msg_v2 *)0)->lm_padding_2));
+	LASSERTF((int)offsetof(struct lustre_msg_v2, lm_opc) == 24, "found %lld\n",
+		 (long long)(int)offsetof(struct lustre_msg_v2, lm_opc));
+	LASSERTF((int)sizeof(((struct lustre_msg_v2 *)0)->lm_opc) == 4, "found %lld\n",
+		 (long long)(int)sizeof(((struct lustre_msg_v2 *)0)->lm_opc));
 	LASSERTF((int)offsetof(struct lustre_msg_v2, lm_padding_3) == 28, "found %lld\n",
 		 (long long)(int)offsetof(struct lustre_msg_v2, lm_padding_3));
 	LASSERTF((int)sizeof(((struct lustre_msg_v2 *)0)->lm_padding_3) == 4, "found %lld\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 8cf9323..99735fc 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -544,7 +544,7 @@ struct lustre_msg_v2 {
 	__u32 lm_repsize;	/* size of preallocated reply buffer */
 	__u32 lm_cksum;		/* CRC32 of ptlrpc_body early reply messages */
 	__u32 lm_flags;		/* enum lustre_msghdr MSGHDR_* flags */
-	__u32 lm_padding_2;	/* unused */
+	__u32 lm_opc;		/* SUB request opcode in a batch request */
 	__u32 lm_padding_3;	/* unused */
 	__u32 lm_buflens[0];	/* length of additional buffers in bytes,
 				 * padded to a multiple of 8 bytes.
@@ -555,6 +555,9 @@ struct lustre_msg_v2 {
 	 */
 };
 
+/* The returned result of the SUB request in a batch request */
+#define lm_result	lm_opc
+
 /* ptlrpc_body packet pb_types */
 #define PTL_RPC_MSG_REQUEST	4711	/* normal RPC request message */
 #define PTL_RPC_MSG_ERR		4712	/* error reply if request unprocessed */
@@ -1428,6 +1431,7 @@ enum mds_cmd {
 	MDS_HSM_CT_UNREGISTER	= 60,
 	MDS_SWAP_LAYOUTS	= 61,
 	MDS_RMFID		= 62,
+	MDS_BATCH		= 63,
 	MDS_LAST_OPC
 };
 
@@ -2860,6 +2864,82 @@ struct hsm_progress_kernel {
 	__u64			hpk_padding2;
 } __attribute__((packed));
 
+#define OUT_UPDATE_MAX_INLINE_SIZE	4096
+
+#define BUT_REQUEST_MAGIC	0xBADE0001
+/* Hold batched updates sending to the remote target in a single RPC */
+struct batch_update_request {
+	/* Magic number: BUT_REQUEST_MAGIC. */
+	__u32			burq_magic;
+	/* Number of sub requests packed in this batched RPC: burq_reqmsg[]. */
+	__u16			burq_count;
+	/* Unused padding field. */
+	__u16			burq_padding;
+	/*
+	 * Sub request message array. As message feild buffers for each sub
+	 * request are packed after padded lustre_msg.lm_buflens[] arrary, thus
+	 * it can locate the next request message via the function
+	 * @batch_update_reqmsg_next() in lustre/include/obj_update.h
+	 */
+	struct lustre_msg	burq_reqmsg[0];
+};
+
+#define BUT_HEADER_MAGIC	0xBADF0001
+/* Header for Batched UpdaTes request */
+struct but_update_header {
+	/* Magic number: BUT_HEADER_MAGIC */
+	__u32	buh_magic;
+	/*
+	 * When the total request buffer length is less than MAX_INLINE_SIZE,
+	 * @buh_count is set with 1 and the batched RPC request can be packed
+	 * inline.
+	 * Otherwise, @buh_count indicates the IO vector count transferring in
+	 * bulk I/O.
+	 */
+	__u32	buh_count;
+	/* inline buffer length when the batched RPC can be packed inline. */
+	__u32	buh_inline_length;
+	/* The reply buffer size the client prepared. */
+	__u32	buh_reply_size;
+	/* Sub request count in this batched RPC. */
+	__u32	buh_update_count;
+	/* Unused padding field. */
+	__u32	buh_padding;
+	/* Inline buffer used when the RPC request can be packed inline. */
+	__u32	buh_inline_data[0];
+};
+
+struct but_update_buffer {
+	__u32	bub_size;
+	__u32	bub_padding;
+};
+
+#define BUT_REPLY_MAGIC	0x00AD0001
+/* Batched reply received from a remote targer in a batched RPC. */
+struct batch_update_reply {
+	/* Magic number: BUT_REPLY_MAGIC. */
+	__u32			burp_magic;
+	/* Successful returned sub requests. */
+	__u16			burp_count;
+	/* Unused padding field. */
+	__u16			burp_padding;
+	/*
+	 * Sub reply message array.
+	 * It can locate the next reply message buffer via the function
+	 * @batch_update_repmsg_next() in lustre/include/obj_update.h
+	 */
+	struct lustre_msg	burp_repmsg[0];
+};
+
+/**
+ * Batch update opcode.
+ */
+enum batch_update_cmd {
+	BUT_GETATTR	= 1,
+	BUT_LAST_OPC,
+	BUT_FIRST_OPC	= BUT_GETATTR,
+};
+
 /** layout swap request structure
  * fid1 and fid2 are in mdt_body
  */
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 02/40] lustre: lov: fiemap improperly handles fm_extent_count=0
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 01/40] lustre: protocol: basic batching processing framework James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 03/40] lustre: llite: SIGBUS is possible on a race with page reclaim James Simmons
                   ` (37 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Andrew Perepechko, Lustre Development List

From: Andrew Perepechko <andrew.perepechko@hpe.com>

FIEMAP calls with fm_extent_count=0 are supposed only to
return the number of extents.

lov_object_fiemap() attempts to initialize stripe_last
based on fiemap->fm_extents[0] which is not initialized
in userspace and not even allocated in kernelspace.

Eventually, the call exits with -EINVAL and "FIEMAP does
not init start entry" kernel log message.

Fixes: f39704f6e1 ("lustre: lov: FIEMAP support for PFL and FLR file")
HPE-bug-id: LUS-11443
WC-bug-id: https://jira.whamcloud.com/browse/LU-16480
Lustre-commit: 829af7b029d8e4e39 ("LU-16480 lov: fiemap improperly handles fm_extent_count=0")
Signed-off-by: Andrew Perepechko <andrew.perepechko@hpe.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49645
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/lov/lov_object.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/lov/lov_object.c b/fs/lustre/lov/lov_object.c
index 34cb6a0..5d65aab 100644
--- a/fs/lustre/lov/lov_object.c
+++ b/fs/lustre/lov/lov_object.c
@@ -1896,7 +1896,7 @@ static int lov_object_fiemap(const struct lu_env *env, struct cl_object *obj,
 	struct fiemap_state fs = { NULL };
 	struct lu_extent range;
 	int cur_ext;
-	int stripe_last;
+	int stripe_last = 0;
 	int start_stripe = 0;
 	bool resume = false;
 
@@ -1992,9 +1992,10 @@ static int lov_object_fiemap(const struct lu_env *env, struct cl_object *obj,
 	 * the high 16bits of fe_device remember which stripe the last
 	 * call has been arrived, we'd continue from there in this call.
 	 */
-	if (fiemap->fm_extent_count && fiemap->fm_extents[0].fe_logical)
+	if (fiemap->fm_extent_count && fiemap->fm_extents[0].fe_logical) {
 		resume = true;
-	stripe_last = get_fe_stripenr(&fiemap->fm_extents[0]);
+		stripe_last = get_fe_stripenr(&fiemap->fm_extents[0]);
+	}
 	/**
 	 * stripe_last records stripe number we've been processed in the last
 	 * call
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 03/40] lustre: llite: SIGBUS is possible on a race with page reclaim
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 01/40] lustre: protocol: basic batching processing framework James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 02/40] lustre: lov: fiemap improperly handles fm_extent_count=0 James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 04/40] lustre: osc: page fault in osc_release_bounce_pages() James Simmons
                   ` (36 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Patrick Farrell, Andrew Perepechko, Lustre Development List

From: Andrew Perepechko <andrew.perepechko@hpe.com>

We can restart fault handling if page truncation happens
in parallel with the fault handler.

WC-bug-id: https://jira.whamcloud.com/browse/LU-16160
Lustre-commit: b4da788a819f82d35 ("LU-16160 llite: SIGBUS is possible on a race with page reclaim")
Signed-off-by: Andrew Perepechko <andrew.perepechko@hpe.com>
Signed-off-by: Patrick Farrell <farr0186@gmail.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49647
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_internal.h |  4 ++++
 fs/lustre/llite/llite_lib.c      |  1 +
 fs/lustre/llite/llite_mmap.c     | 19 +++++++++++++++++++
 fs/lustre/llite/vvp_page.c       | 37 +++++++++++++++++++++++++++++++++++++
 fs/lustre/obdclass/cl_page.c     | 18 ------------------
 5 files changed, 61 insertions(+), 18 deletions(-)

diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index c42330e..0dac71d 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -47,6 +47,7 @@
 #include <linux/compat.h>
 #include <linux/aio.h>
 #include <linux/parser.h>
+#include <linux/seqlock.h>
 #include <lustre_crypto.h>
 #include <range_lock.h>
 #include <linux/namei.h>
@@ -287,6 +288,7 @@ struct ll_inode_info {
 	struct mutex			lli_xattrs_enq_lock;
 	struct list_head		lli_xattrs; /* ll_xattr_entry->xe_list */
 	struct list_head		lli_lccs; /* list of ll_cl_context */
+	seqlock_t			lli_page_inv_lock;
 };
 
 static inline void ll_trunc_sem_init(struct ll_trunc_sem *sem)
@@ -1834,4 +1836,6 @@ int ll_file_open_encrypt(struct inode *inode, struct file *filp)
 bool ll_foreign_is_openable(struct dentry *dentry, unsigned int flags);
 bool ll_foreign_is_removable(struct dentry *dentry, bool unset);
 
+int ll_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
+
 #endif /* LLITE_INTERNAL_H */
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 30056a6..f84b6f5 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -1213,6 +1213,7 @@ void ll_lli_init(struct ll_inode_info *lli)
 	memset(lli->lli_jobid, 0, sizeof(lli->lli_jobid));
 	/* ll_cl_context initialize */
 	INIT_LIST_HEAD(&lli->lli_lccs);
+	seqlock_init(&lli->lli_page_inv_lock);
 }
 
 int ll_fill_super(struct super_block *sb)
diff --git a/fs/lustre/llite/llite_mmap.c b/fs/lustre/llite/llite_mmap.c
index 4acc7ee..db069de 100644
--- a/fs/lustre/llite/llite_mmap.c
+++ b/fs/lustre/llite/llite_mmap.c
@@ -257,6 +257,25 @@ static inline vm_fault_t to_fault_error(int result)
 	return result;
 }
 
+int ll_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	int ret;
+	unsigned int seq;
+
+	/* this seqlock lets us notice if a page has been deleted on this inode
+	 * during the fault process, allowing us to catch an erroneous SIGBUS
+	 * See LU-16160
+	 */
+	do {
+		seq = read_seqbegin(&ll_i2info(inode)->lli_page_inv_lock);
+		ret = filemap_fault(vmf);
+	} while (read_seqretry(&ll_i2info(inode)->lli_page_inv_lock, seq) &&
+		 (ret & VM_FAULT_SIGBUS));
+
+	return ret;
+}
+
 /**
  * Lustre implementation of a vm_operations_struct::fault() method, called by
  * VM to server page fault (both in kernel and user space).
diff --git a/fs/lustre/llite/vvp_page.c b/fs/lustre/llite/vvp_page.c
index f359596..30524fd 100644
--- a/fs/lustre/llite/vvp_page.c
+++ b/fs/lustre/llite/vvp_page.c
@@ -63,6 +63,42 @@ static void vvp_page_discard(const struct lu_env *env,
 		ll_ra_stats_inc(vmpage->mapping->host, RA_STAT_DISCARDED);
 }
 
+static void vvp_page_delete(const struct lu_env *env,
+			    const struct cl_page_slice *slice)
+{
+	struct cl_page *cp = slice->cpl_page;
+
+	if (cp->cp_type == CPT_CACHEABLE) {
+		struct page *vmpage = cp->cp_vmpage;
+		struct inode *inode = vmpage->mapping->host;
+
+		LASSERT(PageLocked(vmpage));
+		LASSERT((struct cl_page *)vmpage->private == cp);
+
+		/* Drop the reference count held in vvp_page_init */
+		refcount_dec(&cp->cp_ref);
+
+		ClearPagePrivate(vmpage);
+		vmpage->private = 0;
+
+		/* clearpageuptodate prevents the page being read by the
+		 * kernel after it has been deleted from Lustre, which avoids
+		 * potential stale data reads.  The seqlock allows us to see
+		 * that a page was potentially deleted and catch the resulting
+		 * SIGBUS - see ll_filemap_fault() (LU-16160)
+		 */
+		write_seqlock(&ll_i2info(inode)->lli_page_inv_lock);
+		ClearPageUptodate(vmpage);
+		write_sequnlock(&ll_i2info(inode)->lli_page_inv_lock);
+
+		/*
+		 * The reference from vmpage to cl_page is removed,
+		 * but the reference back is still here. It is removed
+		 * later in cl_page_free().
+		 */
+	}
+}
+
 /**
  * Handles page transfer errors at VM level.
  *
@@ -146,6 +182,7 @@ static void vvp_page_completion_write(const struct lu_env *env,
 }
 
 static const struct cl_page_operations vvp_page_ops = {
+	.cpo_delete		= vvp_page_delete,
 	.cpo_discard		= vvp_page_discard,
 	.io = {
 		[CRT_READ] = {
diff --git a/fs/lustre/obdclass/cl_page.c b/fs/lustre/obdclass/cl_page.c
index 7011235..62d8ee5 100644
--- a/fs/lustre/obdclass/cl_page.c
+++ b/fs/lustre/obdclass/cl_page.c
@@ -704,7 +704,6 @@ void cl_page_discard(const struct lu_env *env,
 static void __cl_page_delete(const struct lu_env *env, struct cl_page *cp)
 {
 	const struct cl_page_slice *slice;
-	struct page *vmpage;
 	int i;
 
 	PASSERT(env, cp, cp->cp_state != CPS_FREEING);
@@ -719,23 +718,6 @@ static void __cl_page_delete(const struct lu_env *env, struct cl_page *cp)
 		if (slice->cpl_ops->cpo_delete)
 			(*slice->cpl_ops->cpo_delete)(env, slice);
 	}
-
-	if (cp->cp_type == CPT_CACHEABLE) {
-		vmpage = cp->cp_vmpage;
-		LASSERT(PageLocked(vmpage));
-		LASSERT((struct cl_page *)vmpage->private == cp);
-
-		/* Drop the reference count held in vvp_page_init */
-		refcount_dec(&cp->cp_ref);
-		ClearPagePrivate(vmpage);
-		vmpage->private = 0;
-
-		/*
-		 * The reference from vmpage to cl_page is removed,
-		 * but the reference back is still here. It is removed
-		 * later in cl_page_free().
-		 */
-	}
 }
 
 /**
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 04/40] lustre: osc: page fault in osc_release_bounce_pages()
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (2 preceding siblings ...)
  2023-04-09 12:12 ` [lustre-devel] [PATCH 03/40] lustre: llite: SIGBUS is possible on a race with page reclaim James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 05/40] lustre: readahead: add stats for read-ahead page count James Simmons
                   ` (35 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Andriy Skulysh, Lustre Development List

From: Andriy Skulysh <andriy.skulysh@hpe.com>

pga[i] can be uninitialized. It happens after following
code path in osc_build_rpc():

	oa = kmem_cache_zalloc(osc_obdo_kmem, GFP_NOFS);
	if (!oa) {
		rc = -ENOMEM;
		goto out;
	}

Fixes: ef93d889b4c6 ("lustre: sec: encryption for write path")
HPE-bug-id: LUS-10991
WC-bug-id: https://jira.whamcloud.com/browse/LU-16333
Signed-off-by: Andriy Skulysh <andriy.skulysh@hpe.com>
Reviewed-by: Alexander Zarochentsev <alexander.zarochentsev@hpe.com>
Reviewed-by: Alexander Boyko <c17825@cray.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49210
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/osc/osc_request.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/lustre/osc/osc_request.c b/fs/lustre/osc/osc_request.c
index bd294c5..6ea1db6 100644
--- a/fs/lustre/osc/osc_request.c
+++ b/fs/lustre/osc/osc_request.c
@@ -1463,6 +1463,9 @@ static inline void osc_release_bounce_pages(struct brw_page **pga,
 	struct page **pa = NULL;
 	int i, j = 0;
 
+	if (!pga[0])
+		return;
+
 	if (PageChecked(pga[0]->pg)) {
 		pa = kvmalloc_array(page_count, sizeof(*pa),
 				    GFP_KERNEL | __GFP_ZERO);
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 05/40] lustre: readahead: add stats for read-ahead page count
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (3 preceding siblings ...)
  2023-04-09 12:12 ` [lustre-devel] [PATCH 04/40] lustre: osc: page fault in osc_release_bounce_pages() James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 06/40] lustre: quota: enforce project quota for root James Simmons
                   ` (34 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Qian Yingjin <qian@ddn.com>

This patch adds the stats for read-ahead page count:

lctl get_param llite.*.read_ahead_stats
llite.lustre-ffff938b7849d000.read_ahead_stats=
snapshot_time           4011.320890492 secs.nsecs
start_time              0.000000000 secs.nsecs
elapsed_time            4011.320890492 secs.nsecs
hits                    4 samples [pages]
misses                  1 samples [pages]
zero_size_window        4 samples [pages]
failed_to_reach_end     1 samples [pages]
failed_to_fast_read     1 samples [pages]
readahead_pages         1 samples [pages] 255 255 255

WC-bug-id: https://jira.whamcloud.com/browse/LU-16338
Lustre-commit: cdcf97e17e73dfdd6 ("LU-16338 readahead: add stats for read-ahead page count")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49224
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <farr0186@gmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_internal.h |  1 +
 fs/lustre/llite/lproc_llite.c    | 15 ++++++++++++---
 fs/lustre/llite/rw.c             | 12 ++++++++++++
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 0dac71d..1d85d0b 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -578,6 +578,7 @@ enum ra_stat {
 	RA_STAT_ASYNC,
 	RA_STAT_FAILED_FAST_READ,
 	RA_STAT_MMAP_RANGE_READ,
+	RA_STAT_READAHEAD_PAGES,
 	_NR_RA_STAT,
 };
 
diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index 3d64a93..70dbc87 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -1858,6 +1858,7 @@ void ll_stats_ops_tally(struct ll_sb_info *sbi, int op, long count)
 	[RA_STAT_ASYNC]			= "async readahead",
 	[RA_STAT_FAILED_FAST_READ]	= "failed to fast read",
 	[RA_STAT_MMAP_RANGE_READ]	= "mmap range read",
+	[RA_STAT_READAHEAD_PAGES]	= "readahead_pages",
 };
 
 int ll_debugfs_register_super(struct super_block *sb, const char *name)
@@ -1911,9 +1912,17 @@ int ll_debugfs_register_super(struct super_block *sb, const char *name)
 		goto out_stats;
 	}
 
-	for (id = 0; id < ARRAY_SIZE(ra_stat_string); id++)
-		lprocfs_counter_init(sbi->ll_ra_stats, id, LPROCFS_TYPE_PAGES,
-				     ra_stat_string[id]);
+	for (id = 0; id < ARRAY_SIZE(ra_stat_string); id++) {
+		if (id == RA_STAT_READAHEAD_PAGES)
+			lprocfs_counter_init(sbi->ll_ra_stats, id,
+					     LPROCFS_TYPE_PAGES |
+					     LPROCFS_CNTR_AVGMINMAX,
+					     ra_stat_string[id]);
+		else
+			lprocfs_counter_init(sbi->ll_ra_stats, id,
+					     LPROCFS_TYPE_PAGES,
+					     ra_stat_string[id]);
+	}
 
 	debugfs_create_file("read_ahead_stats", 0644, sbi->ll_debugfs_entry,
 			    sbi->ll_ra_stats, &lprocfs_stats_seq_fops);
diff --git a/fs/lustre/llite/rw.c b/fs/lustre/llite/rw.c
index 2290b31..0b14ea6 100644
--- a/fs/lustre/llite/rw.c
+++ b/fs/lustre/llite/rw.c
@@ -150,6 +150,14 @@ void ll_ra_stats_inc(struct inode *inode, enum ra_stat which)
 	ll_ra_stats_inc_sbi(sbi, which);
 }
 
+void ll_ra_stats_add(struct inode *inode, enum ra_stat which, long count)
+{
+	struct ll_sb_info *sbi = ll_i2sbi(inode);
+
+	LASSERTF(which < _NR_RA_STAT, "which: %u\n", which);
+	lprocfs_counter_add(sbi->ll_ra_stats, which, count);
+}
+
 #define RAS_CDEBUG(ras) \
 	CDEBUG(D_READA,							     \
 	       "lre %llu cr %lu cb %llu wsi %lu wp %lu nra %lu rpc %lu r %lu csr %lu so %llu sb %llu sl %llu lr %lu\n", \
@@ -528,6 +536,10 @@ static bool ras_inside_ra_window(pgoff_t idx, struct ra_io_arg *ria)
 	}
 	cl_read_ahead_release(env, &ra);
 
+	if (count)
+		ll_ra_stats_add(vvp_object_inode(io->ci_obj),
+				RA_STAT_READAHEAD_PAGES, count);
+
 	return count;
 }
 
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 06/40] lustre: quota: enforce project quota for root
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (4 preceding siblings ...)
  2023-04-09 12:12 ` [lustre-devel] [PATCH 05/40] lustre: readahead: add stats for read-ahead page count James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 07/40] lustre: ldlm: send the cancel RPC asap James Simmons
                   ` (33 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Sergey Cheremencev, Lustre Development List

From: Sergey Cheremencev <scherementsev@ddn.com>

Patch adds an option to enforce project quotas for root.
It is disabled by default, to enable set
osd-ldiskfs.*.quota_slave.root_prj_enable to 1
at each target where you need this option.

WC-bug-id: https://jira.whamcloud.com/browse/LU-16415
Lustre-commit: f147655c33ea61450 ("LU-16415 quota: enforce project quota for root")
Signed-off-by: Sergey Cheremencev <scherementsev@ddn.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49460
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd.h                | 8 ++++----
 fs/lustre/osc/osc_cache.c              | 2 +-
 fs/lustre/osc/osc_quota.c              | 1 +
 include/uapi/linux/lustre/lustre_idl.h | 2 ++
 4 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/lustre/include/obd.h b/fs/lustre/include/obd.h
index a980bf0..54bef2e 100644
--- a/fs/lustre/include/obd.h
+++ b/fs/lustre/include/obd.h
@@ -208,8 +208,10 @@ struct client_obd {
 	unsigned int		 cl_checksum:1,	/* 0 = disabled, 1 = enabled */
 				 cl_checksum_dump:1, /* same */
 				 cl_ocd_grant_param:1,
-				 cl_lsom_update:1; /* send LSOM updates */
-	/* supported checksum types that are worked out at connect time */
+				 cl_lsom_update:1, /* send LSOM updates */
+				 cl_root_squash:1, /* if root squash enabled*/
+				 /* check prj quota for root */
+				 cl_root_prjquota:1;
 	enum lustre_sec_part     cl_sp_me;
 	enum lustre_sec_part     cl_sp_to;
 	struct sptlrpc_flavor    cl_flvr_mgc;   /* fixed flavor of mgc->mgs */
@@ -233,8 +235,6 @@ struct client_obd {
 	struct list_head	cl_grant_chain;
 	time64_t		cl_grant_shrink_interval; /* seconds */
 
-	int			cl_root_squash; /* if root squash enabled*/
-
 	/* A chunk is an optimal size used by osc_extent to determine
 	 * the extent size. A chunk is max(PAGE_SIZE, OST block size)
 	 */
diff --git a/fs/lustre/osc/osc_cache.c b/fs/lustre/osc/osc_cache.c
index b339aef..dddf98f 100644
--- a/fs/lustre/osc/osc_cache.c
+++ b/fs/lustre/osc/osc_cache.c
@@ -2366,7 +2366,7 @@ int osc_queue_async_io(const struct lu_env *env, struct cl_io *io,
 	 * we should bypass quota
 	 */
 	if ((!oio->oi_cap_sys_resource ||
-	     cli->cl_root_squash) &&
+	     cli->cl_root_squash || cli->cl_root_prjquota) &&
 	    !io->ci_noquota) {
 		struct cl_object *obj;
 		struct cl_attr *attr;
diff --git a/fs/lustre/osc/osc_quota.c b/fs/lustre/osc/osc_quota.c
index 708ad3c..c48a89f3 100644
--- a/fs/lustre/osc/osc_quota.c
+++ b/fs/lustre/osc/osc_quota.c
@@ -120,6 +120,7 @@ int osc_quota_setdq(struct client_obd *cli, u64 xid, const unsigned int qid[],
 
 	mutex_lock(&cli->cl_quota_mutex);
 	cli->cl_root_squash = !!(flags & OBD_FL_ROOT_SQUASH);
+	cli->cl_root_prjquota = !!(flags & OBD_FL_ROOT_PRJQUOTA);
 	/* still mark the quots is running out for the old request, because it
 	 * could be processed after the new request at OST, the side effect is
 	 * the following request will be processed synchronously, but it will
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index 99735fc..b4185a7 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -961,6 +961,7 @@ enum obdo_flags {
 	OBD_FL_FLUSH		= 0x00200000, /* flush pages on the OST */
 	OBD_FL_SHORT_IO		= 0x00400000, /* short io request */
 	OBD_FL_ROOT_SQUASH	= 0x00800000, /* root squash */
+	OBD_FL_ROOT_PRJQUOTA	= 0x01000000, /* check prj quota for root */
 	/* OBD_FL_LOCAL_MASK = 0xF0000000, was local-only flags until 2.10 */
 
 	/*
@@ -1250,6 +1251,7 @@ struct hsm_state_set {
 				      * it to sync quickly
 				      */
 #define OBD_BRW_OVER_PRJQUOTA 0x8000 /* Running out of project quota */
+#define OBD_BRW_ROOT_PRJQUOTA 0x10000 /* check project quota for root */
 #define OBD_BRW_RDMA_ONLY    0x20000 /* RPC contains RDMA-only pages*/
 #define OBD_BRW_SYS_RESOURCE 0x40000 /* page has CAP_SYS_RESOURCE */
 
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 07/40] lustre: ldlm: send the cancel RPC asap
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (5 preceding siblings ...)
  2023-04-09 12:12 ` [lustre-devel] [PATCH 06/40] lustre: quota: enforce project quota for root James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 08/40] lustre: enc: align Base64 encoding with RFC 4648 base64url James Simmons
                   ` (32 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Yang Sheng, Lustre Development List

From: Yang Sheng <ys@whamcloud.com>

This patch try to send cancel RPC ASAP when bl_ast
received from server. The exist problem is that
lock could be added in regular queue before bl_ast
arrived since other reason. It will prevent lock
canceling in timely manner. The other problem is
that we collect many locks in one RPC to save
the network traffic. But this process could take
a long time when dirty pages flushing.

- The lock canceling will be processed even lock has
  been added to bl queue while bl_ast arrived. Unless
  the cancel RPC has been sent.
- Send the cancel RPC immediately for bl_ast lock. Don't
  try to add more locks in such case.

WC-bug-id: https://jira.whamcloud.com/browse/LU-16285
Lustre-commit: b65374d96b2027213 ("LU-16285 ldlm: send the cancel RPC asap")
Signed-off-by: Yang Sheng <ys@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49527
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Qian Yingjin <qian@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_dlm.h |   1 +
 fs/lustre/ldlm/ldlm_lockd.c    |   9 ++--
 fs/lustre/ldlm/ldlm_request.c  | 100 ++++++++++++++++++++++++++++-------------
 3 files changed, 75 insertions(+), 35 deletions(-)

diff --git a/fs/lustre/include/lustre_dlm.h b/fs/lustre/include/lustre_dlm.h
index d08c48f..3a4f152 100644
--- a/fs/lustre/include/lustre_dlm.h
+++ b/fs/lustre/include/lustre_dlm.h
@@ -593,6 +593,7 @@ enum ldlm_cancel_flags {
 	LCF_BL_AST     = 0x4, /* Cancel locks marked as LDLM_FL_BL_AST
 			       * in the same RPC
 			       */
+	LCF_ONE_LOCK	= 0x8,	/* Cancel locks pack only one lock. */
 };
 
 struct ldlm_flock {
diff --git a/fs/lustre/ldlm/ldlm_lockd.c b/fs/lustre/ldlm/ldlm_lockd.c
index 0ff4e3a..3a085db 100644
--- a/fs/lustre/ldlm/ldlm_lockd.c
+++ b/fs/lustre/ldlm/ldlm_lockd.c
@@ -700,8 +700,7 @@ static int ldlm_callback_handler(struct ptlrpc_request *req)
 		 * we can tell the server we have no lock. Otherwise, we
 		 * should send cancel after dropping the cache.
 		 */
-		if ((ldlm_is_canceling(lock) && ldlm_is_bl_done(lock)) ||
-		    ldlm_is_failed(lock)) {
+		if (ldlm_is_ast_sent(lock) || ldlm_is_failed(lock)) {
 			LDLM_DEBUG(lock,
 				   "callback on lock %#llx - lock disappeared",
 				   dlm_req->lock_handle[0].cookie);
@@ -736,7 +735,7 @@ static int ldlm_callback_handler(struct ptlrpc_request *req)
 
 	switch (lustre_msg_get_opc(req->rq_reqmsg)) {
 	case LDLM_BL_CALLBACK:
-		CDEBUG(D_INODE, "blocking ast\n");
+		LDLM_DEBUG(lock, "blocking ast\n");
 		req_capsule_extend(&req->rq_pill, &RQF_LDLM_BL_CALLBACK);
 		if (!ldlm_is_cancel_on_block(lock)) {
 			rc = ldlm_callback_reply(req, 0);
@@ -748,14 +747,14 @@ static int ldlm_callback_handler(struct ptlrpc_request *req)
 			ldlm_handle_bl_callback(ns, &dlm_req->lock_desc, lock);
 		break;
 	case LDLM_CP_CALLBACK:
-		CDEBUG(D_INODE, "completion ast\n");
+		LDLM_DEBUG(lock, "completion ast\n");
 		req_capsule_extend(&req->rq_pill, &RQF_LDLM_CP_CALLBACK);
 		rc = ldlm_handle_cp_callback(req, ns, dlm_req, lock);
 		if (!OBD_FAIL_CHECK(OBD_FAIL_LDLM_CANCEL_BL_CB_RACE))
 			ldlm_callback_reply(req, rc);
 		break;
 	case LDLM_GL_CALLBACK:
-		CDEBUG(D_INODE, "glimpse ast\n");
+		LDLM_DEBUG(lock, "glimpse ast\n");
 		req_capsule_extend(&req->rq_pill, &RQF_LDLM_GL_CALLBACK);
 		ldlm_handle_gl_callback(req, ns, dlm_req, lock);
 		break;
diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index 8b244d7..ef3ad28 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -994,14 +994,34 @@ static u64 ldlm_cli_cancel_local(struct ldlm_lock *lock)
 	return rc;
 }
 
+static inline int __ldlm_pack_lock(struct ldlm_lock *lock,
+				   struct ldlm_request *dlm)
+{
+	LASSERT(lock->l_conn_export);
+	lock_res_and_lock(lock);
+	if (ldlm_is_ast_sent(lock)) {
+		unlock_res_and_lock(lock);
+		return 0;
+	}
+	ldlm_set_ast_sent(lock);
+	unlock_res_and_lock(lock);
+
+	/* Pack the lock handle to the given request buffer. */
+	LDLM_DEBUG(lock, "packing");
+	dlm->lock_handle[dlm->lock_count++] = lock->l_remote_handle;
+
+	return 1;
+}
+#define ldlm_cancel_pack(req, head, count) \
+		_ldlm_cancel_pack(req, NULL, head, count)
+
 /**
  * Pack @count locks in @head into ldlm_request buffer of request @req.
  */
-static void ldlm_cancel_pack(struct ptlrpc_request *req,
+static int _ldlm_cancel_pack(struct ptlrpc_request *req, struct ldlm_lock *lock,
 			     struct list_head *head, int count)
 {
 	struct ldlm_request *dlm;
-	struct ldlm_lock *lock;
 	int max, packed = 0;
 
 	dlm = req_capsule_client_get(&req->rq_pill, &RMF_DLM_REQ);
@@ -1019,24 +1039,23 @@ static void ldlm_cancel_pack(struct ptlrpc_request *req,
 	 * so that the server cancel would call filter_lvbo_update() less
 	 * frequently.
 	 */
-	list_for_each_entry(lock, head, l_bl_ast) {
-		if (!count--)
-			break;
-		LASSERT(lock->l_conn_export);
-		/* Pack the lock handle to the given request buffer. */
-		LDLM_DEBUG(lock, "packing");
-		dlm->lock_handle[dlm->lock_count++] = lock->l_remote_handle;
-		packed++;
+	if (lock) { /* only pack one lock */
+		packed = __ldlm_pack_lock(lock, dlm);
+	} else {
+		list_for_each_entry(lock, head, l_bl_ast) {
+			if (!count--)
+				break;
+			packed += __ldlm_pack_lock(lock, dlm);
+		}
 	}
-	CDEBUG(D_DLMTRACE, "%d locks packed\n", packed);
+	return packed;
 }
 
 /**
  * Prepare and send a batched cancel RPC. It will include @count lock
  * handles of locks given in @cancels list.
  */
-static int ldlm_cli_cancel_req(struct obd_export *exp,
-			       struct list_head *cancels,
+static int ldlm_cli_cancel_req(struct obd_export *exp, void *ptr,
 			       int count, enum ldlm_cancel_flags flags)
 {
 	struct ptlrpc_request *req = NULL;
@@ -1085,7 +1104,15 @@ static int ldlm_cli_cancel_req(struct obd_export *exp,
 		req->rq_reply_portal = LDLM_CANCEL_REPLY_PORTAL;
 		ptlrpc_at_set_req_timeout(req);
 
-		ldlm_cancel_pack(req, cancels, count);
+		if (flags & LCF_ONE_LOCK)
+			rc = _ldlm_cancel_pack(req, ptr, NULL, count);
+		else
+			rc = _ldlm_cancel_pack(req, NULL, ptr, count);
+		if (rc == 0) {
+			ptlrpc_req_finished(req);
+			sent = count;
+			goto out;
+		}
 
 		ptlrpc_request_set_replen(req);
 		if (flags & LCF_ASYNC) {
@@ -1235,10 +1262,10 @@ int ldlm_cli_convert(struct ldlm_lock *lock,
  * Lock must not have any readers or writers by this time.
  */
 int ldlm_cli_cancel(const struct lustre_handle *lockh,
-		    enum ldlm_cancel_flags cancel_flags)
+		    enum ldlm_cancel_flags flags)
 {
 	struct obd_export *exp;
-	int avail, count = 1;
+	int avail, count = 1, bl_ast = 0;
 	u64 rc = 0;
 	struct ldlm_namespace *ns;
 	struct ldlm_lock *lock;
@@ -1253,11 +1280,17 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
 	lock_res_and_lock(lock);
 	LASSERT(!ldlm_is_converting(lock));
 
-	/* Lock is being canceled and the caller doesn't want to wait */
-	if (ldlm_is_canceling(lock)) {
+	if (ldlm_is_bl_ast(lock)) {
+		if (ldlm_is_ast_sent(lock)) {
+			unlock_res_and_lock(lock);
+			LDLM_LOCK_RELEASE(lock);
+			return 0;
+		}
+		bl_ast = 1;
+	} else if (ldlm_is_canceling(lock)) {
+		/* Lock is being canceled and the caller doesn't want to wait */
 		unlock_res_and_lock(lock);
-
-		if (!(cancel_flags & LCF_ASYNC))
+		if (flags & LCF_ASYNC)
 			wait_event_idle(lock->l_waitq, is_bl_done(lock));
 
 		LDLM_LOCK_RELEASE(lock);
@@ -1267,24 +1300,30 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
 	ldlm_set_canceling(lock);
 	unlock_res_and_lock(lock);
 
-	if (cancel_flags & LCF_LOCAL)
+	if (flags & LCF_LOCAL)
 		OBD_FAIL_TIMEOUT(OBD_FAIL_LDLM_LOCAL_CANCEL_PAUSE,
 				 cfs_fail_val);
 
 	rc = ldlm_cli_cancel_local(lock);
-	if (rc == LDLM_FL_LOCAL_ONLY || cancel_flags & LCF_LOCAL) {
+	if (rc == LDLM_FL_LOCAL_ONLY || flags & LCF_LOCAL) {
 		LDLM_LOCK_RELEASE(lock);
 		return 0;
 	}
-	/*
-	 * Even if the lock is marked as LDLM_FL_BL_AST, this is a LDLM_CANCEL
-	 * RPC which goes to canceld portal, so we can cancel other LRU locks
-	 * here and send them all as one LDLM_CANCEL RPC.
-	 */
-	LASSERT(list_empty(&lock->l_bl_ast));
-	list_add(&lock->l_bl_ast, &cancels);
 
 	exp = lock->l_conn_export;
+	if (bl_ast) { /* Send RPC immedaitly for LDLM_FL_BL_AST */
+		ldlm_cli_cancel_req(exp, lock, count, flags | LCF_ONE_LOCK);
+		LDLM_LOCK_RELEASE(lock);
+		return 0;
+	}
+
+	LASSERT(list_empty(&lock->l_bl_ast));
+	list_add(&lock->l_bl_ast, &cancels);
+	/*
+	 * This is a LDLM_CANCEL RPC which goes to canceld portal,
+	 * so we can cancel other LRU locks here and send them all
+	 * as one LDLM_CANCEL RPC.
+	 */
 	if (exp_connect_cancelset(exp)) {
 		avail = ldlm_format_handles_avail(class_exp2cliimp(exp),
 						  &RQF_LDLM_CANCEL,
@@ -1295,7 +1334,8 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
 		count += ldlm_cancel_lru_local(ns, &cancels, 0, avail - 1,
 					       LCF_BL_AST, 0);
 	}
-	ldlm_cli_cancel_list(&cancels, count, NULL, cancel_flags);
+	ldlm_cli_cancel_list(&cancels, count, NULL, flags);
+
 	return 0;
 }
 EXPORT_SYMBOL(ldlm_cli_cancel);
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 08/40] lustre: enc: align Base64 encoding with RFC 4648 base64url
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (6 preceding siblings ...)
  2023-04-09 12:12 ` [lustre-devel] [PATCH 07/40] lustre: ldlm: send the cancel RPC asap James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 09/40] lustre: quota: fix insane grant quota James Simmons
                   ` (31 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Sebastien Buisson <sbuisson@ddn.com>

Lustre encryption uses a Base64 encoding to encode no-key filenames
(the filenames that are presented to userspace when a directory is
 listed without its encryption key).
Make this Base64 encoding compliant with RFC 4648 base64url. And use
'+' leading character to distringuish digested names.

This is adapted from kernel
commit ba47b515f594 ("fscrypt: align Base64 encoding with RFC 4648 base64url")

To maintain compatibility with older clients, a new llite parameter
named 'filename_enc_use_old_base64' is introduced, set to 0 by
default. When 0, Lustre uses new-fashion base64 encoding. When set to
1, Lustre uses old-style base64 encoding.

To set this parameter globally for all clients, do on the MGS:
mgs# lctl set_param -P llite.*.filename_enc_use_old_base64={0,1}

WC-bug-id: https://jira.whamcloud.com/browse/LU-16374
Lustre-commit: 583ee6911b6cac7f2 ("LU-16374 enc: align Base64 encoding with RFC 4648 base64url")
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49581
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: jsimmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_crypto.h |  3 +++
 fs/lustre/include/lustre_disk.h   |  3 ++-
 fs/lustre/llite/crypto.c          | 24 ++++++++++++-------
 fs/lustre/llite/llite_lib.c       |  3 +++
 fs/lustre/llite/lproc_llite.c     | 49 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 72 insertions(+), 10 deletions(-)

diff --git a/fs/lustre/include/lustre_crypto.h b/fs/lustre/include/lustre_crypto.h
index 2252798..ced1a191 100644
--- a/fs/lustre/include/lustre_crypto.h
+++ b/fs/lustre/include/lustre_crypto.h
@@ -32,6 +32,9 @@
 
 #include <linux/fscrypt.h>
 
+#define LLCRYPT_DIGESTED_CHAR		'+'
+#define LLCRYPT_DIGESTED_CHAR_OLD	'_'
+
 /* Macro to extract digest from Lustre specific structures */
 #define LLCRYPT_EXTRACT_DIGEST(name, len)			\
 	((name) + round_down((len) - FS_CRYPTO_BLOCK_SIZE - 1,	\
diff --git a/fs/lustre/include/lustre_disk.h b/fs/lustre/include/lustre_disk.h
index 15f94ad8..a8e935e 100644
--- a/fs/lustre/include/lustre_disk.h
+++ b/fs/lustre/include/lustre_disk.h
@@ -136,7 +136,8 @@ struct lustre_sb_info {
 	struct fscrypt_dummy_context lsi_dummy_enc_ctx;
 };
 
-#define LSI_UMOUNT_FAILOVER	0x00200000
+#define LSI_UMOUNT_FAILOVER		0x00200000
+#define LSI_FILENAME_ENC_B64_OLD_CLI    0x01000000 /* use old style base64 */
 
 #define     s2lsi(sb)	((struct lustre_sb_info *)((sb)->s_fs_info))
 #define     s2lsi_nocast(sb) ((sb)->s_fs_info)
diff --git a/fs/lustre/llite/crypto.c b/fs/lustre/llite/crypto.c
index d6750fb..5fb7f4d 100644
--- a/fs/lustre/llite/crypto.c
+++ b/fs/lustre/llite/crypto.c
@@ -227,15 +227,16 @@ int ll_setup_filename(struct inode *dir, const struct qstr *iname,
 	struct qstr dname;
 	int rc;
 
-	if (fid) {
-		fid->f_seq = 0;
-		fid->f_oid = 0;
-		fid->f_ver = 0;
-	}
-
 	if (fid && IS_ENCRYPTED(dir) && !fscrypt_has_encryption_key(dir) &&
-	    iname->name[0] == '_')
-		digested = 1;
+	    !fscrypt_has_encryption_key(dir)) {
+		struct lustre_sb_info *lsi = s2lsi(dir->i_sb);
+
+		if ((!(lsi->lsi_flags & LSI_FILENAME_ENC_B64_OLD_CLI) &&
+		     iname->name[0] == LLCRYPT_DIGESTED_CHAR) ||
+		   ((lsi->lsi_flags & LSI_FILENAME_ENC_B64_OLD_CLI) &&
+		     iname->name[0] == LLCRYPT_DIGESTED_CHAR_OLD))
+			digested = 1;
+	}
 
 	dname.name = iname->name + digested;
 	dname.len = iname->len - digested;
@@ -375,6 +376,8 @@ int ll_fname_disk_to_usr(struct inode *inode,
 		}
 		if (lltr.len > FS_CRYPTO_BLOCK_SIZE * 2 &&
 		    !fscrypt_has_encryption_key(inode)) {
+			struct lustre_sb_info *lsi = s2lsi(inode->i_sb);
+
 			digested = 1;
 			/* Without the key for long names, set the dentry name
 			 * to the representing struct ll_digest_filename. It
@@ -391,7 +394,10 @@ int ll_fname_disk_to_usr(struct inode *inode,
 			lltr.name = (char *)&digest;
 			lltr.len = sizeof(digest);
 
-			oname->name[0] = '_';
+			if (!(lsi->lsi_flags & LSI_FILENAME_ENC_B64_OLD_CLI))
+				oname->name[0] = LLCRYPT_DIGESTED_CHAR;
+			else
+				oname->name[0] = LLCRYPT_DIGESTED_CHAR_OLD;
 			oname->name = oname->name + 1;
 			oname->len--;
 		}
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index f84b6f5..e48bb6c 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -508,10 +508,13 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 	}
 
 	if (ll_sbi_has_name_encrypt(sbi) && !obd_connect_has_name_enc(data)) {
+		struct  lustre_sb_info *lsi = s2lsi(sb);
+
 		if (ll_sb_has_test_dummy_encryption(sb))
 			LCONSOLE_WARN("%s: server %s does not support name encryption, not using it.\n",
 				      sbi->ll_fsname,
 				      sbi->ll_md_exp->exp_obd->obd_name);
+		lsi->lsi_flags &= ~LSI_FILENAME_ENC_B64_OLD_CLI;
 		ll_sbi_set_name_encrypt(sbi, false);
 	}
 
diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index 70dbc87..48d93c6 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -1653,6 +1653,53 @@ static ssize_t ll_nosquash_nids_seq_write(struct file *file,
 
 LDEBUGFS_SEQ_FOPS(ll_nosquash_nids);
 
+static int ll_old_b64_enc_seq_show(struct seq_file *m, void *v)
+{
+	struct super_block *sb = m->private;
+	struct lustre_sb_info *lsi = s2lsi(sb);
+
+	seq_printf(m, "%u\n",
+		   lsi->lsi_flags & LSI_FILENAME_ENC_B64_OLD_CLI ? 1 : 0);
+	return 0;
+}
+
+static ssize_t ll_old_b64_enc_seq_write(struct file *file,
+					const char __user *buffer,
+					size_t count, loff_t *off)
+{
+	struct seq_file *m = file->private_data;
+	struct super_block *sb = m->private;
+	struct lustre_sb_info *lsi = s2lsi(sb);
+	struct ll_sb_info *sbi = ll_s2sbi(sb);
+	bool val;
+	int rc;
+
+	rc = kstrtobool_from_user(buffer, count, &val);
+	if (rc)
+		return rc;
+
+	if (val) {
+		if (!ll_sbi_has_name_encrypt(sbi)) {
+			/* server does not support name encryption,
+			 * so force it to NULL on client
+			 */
+			CDEBUG(D_SEC,
+			       "%s: server does not support name encryption\n",
+			       sbi->ll_fsname);
+			lsi->lsi_flags &= ~LSI_FILENAME_ENC_B64_OLD_CLI;
+			return -EOPNOTSUPP;
+		}
+
+		lsi->lsi_flags |= LSI_FILENAME_ENC_B64_OLD_CLI;
+	} else {
+		lsi->lsi_flags &= ~LSI_FILENAME_ENC_B64_OLD_CLI;
+	}
+
+	return count;
+}
+
+LDEBUGFS_SEQ_FOPS(ll_old_b64_enc);
+
 static int ll_pcc_seq_show(struct seq_file *m, void *v)
 {
 	struct super_block *sb = m->private;
@@ -1709,6 +1756,8 @@ struct ldebugfs_vars lprocfs_llite_obd_vars[] = {
 	  .fops =	&ll_nosquash_nids_fops			},
 	{ .name =	"pcc",
 	  .fops =	&ll_pcc_fops,				},
+	{ .name =	"filename_enc_use_old_base64",
+	  .fops =	&ll_old_b64_enc_fops,			},
 	{ NULL }
 };
 
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 09/40] lustre: quota: fix insane grant quota
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (7 preceding siblings ...)
  2023-04-09 12:12 ` [lustre-devel] [PATCH 08/40] lustre: enc: align Base64 encoding with RFC 4648 base64url James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 10/40] lustre: llite: check truncated page in ->readpage() James Simmons
                   ` (30 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Hongchao Zhang, Lustre Development List

From: Hongchao Zhang <hongchao@whamcloud.com>

Fix the insane grant value in quota master/slave index,
the logs often contain the content similar to the following,

LustreError: 39815:0:(qmt_handler.c:527:qmt_dqacq0())
$$$ Release too much! uuid:work-MDT0000-lwp-MDT0002_UUID
release:18446744070274413724 granted:18446744070291193856,
total:4118877744 qmt:work-QMT0000 pool:0-dt id:40212 enforced:1
hard:128849018880 soft:12884901888 granted:4118877744 time:0
qunit: 16777216 edquot:0 may_rel:0 revoke:0 default:no

It could be caused by chgrp, which reserves quota before changing
GID for some file at MDT, then release the reserved quota after
the file GID has been changed on the corresponding OST, (this issue
is tracked at LU-5152 and LU-11303)

In some case, some quota could be released even the quota was not
reserved correctly, which cause the grant quota to be some negative
value, which is regarded as some insane big value because the type
of grant is "u64", then the normal grant release will fail and
the grant field of some quota ID in the quota file (both at QMT and
QSD) contain insane value, but can't be reset correctly.

This patch resets the affected quota by clear the quota limits and
grant, and the grant will be reported by each QSD when the quota ID
is enforced again, then rebuild the grant at QMT.

WC-bug-id: https://jira.whamcloud.com/browse/LU-15880
Lustre-commit: a2fd4d3aee9739dcb ("LU-15880 quota: fix insane grant quota")
Signed-off-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/48981
Reviewed-by: Sergey Cheremencev <scherementsev@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/dir.c                   | 1 +
 include/uapi/linux/lustre/lustre_user.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index 7dca0fc..56ef1bb 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -1158,6 +1158,7 @@ int quotactl_ioctl(struct super_block *sb, struct if_quotactl *qctl)
 	case LUSTRE_Q_SETINFOPOOL:
 	case LUSTRE_Q_SETDEFAULT_POOL:
 	case LUSTRE_Q_DELETEQID:
+	case LUSTRE_Q_RESETQID:
 		if (!capable(CAP_SYS_ADMIN))
 			return -EPERM;
 
diff --git a/include/uapi/linux/lustre/lustre_user.h b/include/uapi/linux/lustre/lustre_user.h
index 9bbb1c9..68fddcf 100644
--- a/include/uapi/linux/lustre/lustre_user.h
+++ b/include/uapi/linux/lustre/lustre_user.h
@@ -1041,6 +1041,7 @@ static inline void obd_uuid2fsname(char *buf, char *uuid, int buflen)
 #define LUSTRE_Q_GETDEFAULT_POOL	0x800013 /* get default pool quota*/
 #define LUSTRE_Q_SETDEFAULT_POOL	0x800014 /* set default pool quota */
 #define LUSTRE_Q_DELETEQID	0x800015  /* delete quota ID */
+#define LUSTRE_Q_RESETQID	0x800016  /* reset quota ID */
 /* In the current Lustre implementation, the grace time is either the time
  * or the timestamp to be used after some quota ID exceeds the soft limt,
  * 48 bits should be enough, its high 16 bits can be used as quota flags.
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 10/40] lustre: llite: check truncated page in ->readpage()
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (8 preceding siblings ...)
  2023-04-09 12:12 ` [lustre-devel] [PATCH 09/40] lustre: quota: fix insane grant quota James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 11/40] lnet: o2iblnd: Fix key mismatch issue James Simmons
                   ` (29 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Qian Yingjin <qian@ddn.com>

The page end offset calculation in filemap_get_read_batch() was
off by one. This bug was introduced in commit v5.11-10234-gcbd59c48ae
("mm/filemap: use head pages in generic_file_buffered_read")

When a read is submitted with end offset 1048575, it calculates
the end page index for read of 256 where it should be 255. This
results in the readpage() call for the page with index 256 is over
stripe boundary and may not be covered by a DLM extent lock.

This happens in a corner race case: filemap_get_read_batch()
batches the page with index 256 for read, but later this page is
removed from page cache due to the lock protected it being revoked,
but has a reference count due to the batch.  This results in this
page in the read path is not covered by any DLM lock.

The solution is simple. We can check whether the page was
truncated and removed from page cache in ->readpage() by the
address_sapce pointer of the page. If it was truncated, return
AOP_TRUNCATED_PAGE to the upper caller.  This will cause the
kernel to retry to batch pages and the truncated page will not
be added as it was already removed from page cache of the file.

WC-bug-id: https://jira.whamcloud.com/browse/LU-16412
Lustre-commit: 209afbe28b5f164bd ("LU-16412 llite: check truncated page in ->readpage()")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49433
Reviewed-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-by: Zhenyu Xu <bobijam@hotmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/obd_support.h |  6 ++++--
 fs/lustre/llite/rw.c            | 35 +++++++++++++++++++++++++++++++++++
 fs/lustre/llite/rw26.c          |  7 +++++++
 3 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index a2930c8..4ef5c61 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -458,8 +458,8 @@
 /* was	OBD_FAIL_LLOG_CATINFO_NET			0x1309 until 2.3 */
 #define OBD_FAIL_MDS_SYNC_CAPA_SL			0x1310
 #define OBD_FAIL_SEQ_ALLOC				0x1311
-#define OBD_FAIL_PLAIN_RECORDS			    0x1319
-#define OBD_FAIL_CATALOG_FULL_CHECK		    0x131a
+#define OBD_FAIL_PLAIN_RECORDS				0x1319
+#define OBD_FAIL_CATALOG_FULL_CHECK			0x131a
 
 #define OBD_FAIL_LLITE					0x1400
 #define OBD_FAIL_LLITE_FAULT_TRUNC_RACE			0x1401
@@ -488,6 +488,8 @@
 #define OBD_FAIL_LLITE_PAGE_ALLOC			0x1418
 #define OBD_FAIL_LLITE_OPEN_DELAY			0x1419
 #define OBD_FAIL_LLITE_XATTR_PAUSE			0x1420
+#define OBD_FAIL_LLITE_PAGE_INVALIDATE_PAUSE		0x1421
+#define OBD_FAIL_LLITE_READPAGE_PAUSE			0x1422
 
 #define OBD_FAIL_FID_INDIR				0x1501
 #define OBD_FAIL_FID_INLMA				0x1502
diff --git a/fs/lustre/llite/rw.c b/fs/lustre/llite/rw.c
index 0b14ea6..dea2af1 100644
--- a/fs/lustre/llite/rw.c
+++ b/fs/lustre/llite/rw.c
@@ -1865,6 +1865,41 @@ int ll_readpage(struct file *file, struct page *vmpage)
 	struct ll_sb_info *sbi = ll_i2sbi(inode);
 	int result;
 
+	if (OBD_FAIL_PRECHECK(OBD_FAIL_LLITE_READPAGE_PAUSE)) {
+		unlock_page(vmpage);
+		OBD_FAIL_TIMEOUT(OBD_FAIL_LLITE_READPAGE_PAUSE, cfs_fail_val);
+		lock_page(vmpage);
+	}
+
+	/*
+	 * The @vmpage got truncated.
+	 * This is a kernel bug introduced since kernel 5.12:
+	 * comment: cbd59c48ae2bcadc4a7599c29cf32fd3f9b78251
+	 * ("mm/filemap: use head pages in generic_file_buffered_read")
+	 *
+	 * The page end offset calculation in filemap_get_read_batch() was off
+	 * by one.  When a read is submitted with end offset 1048575, then it
+	 * calculates the end page for read of 256 where it should be 255. This
+	 * results in the readpage() for the page with index 256 is over stripe
+	 * boundary and may not covered by a DLM extent lock.
+	 *
+	 * This happens in a corner race case: filemap_get_read_batch() adds
+	 * the page with index 256 for read which is not in the current read
+	 * I/O context, and this page is being invalidated and will be removed
+	 * from page cache due to the lock protected it being revoken. This
+	 * results in this page in the read path not covered by any DLM lock.
+	 *
+	 * The solution is simple. Check whether the page was truncated in
+	 * ->readpage(). If so, just return AOP_TRUNCATED_PAGE to the upper
+	 * caller. Then the kernel will retry to batch pages, and it will not
+	 * add the truncated page into batches as it was removed from page
+	 * cache of the file.
+	 */
+	if (vmpage->mapping != inode->i_mapping) {
+		unlock_page(vmpage);
+		return AOP_TRUNCATED_PAGE;
+	}
+
 	lcc = ll_cl_find(inode);
 	if (lcc) {
 		env = lcc->lcc_env;
diff --git a/fs/lustre/llite/rw26.c b/fs/lustre/llite/rw26.c
index cadded4..6700717 100644
--- a/fs/lustre/llite/rw26.c
+++ b/fs/lustre/llite/rw26.c
@@ -96,6 +96,13 @@ static void ll_invalidatepage(struct page *vmpage, unsigned int offset,
 		}
 		cl_env_percpu_put(env);
 	}
+
+	if (OBD_FAIL_PRECHECK(OBD_FAIL_LLITE_PAGE_INVALIDATE_PAUSE)) {
+		unlock_page(vmpage);
+		OBD_FAIL_TIMEOUT(OBD_FAIL_LLITE_PAGE_INVALIDATE_PAUSE,
+				 cfs_fail_val);
+		lock_page(vmpage);
+	}
 }
 
 static int ll_releasepage(struct page *vmpage, gfp_t gfp_mask)
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 11/40] lnet: o2iblnd: Fix key mismatch issue
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (9 preceding siblings ...)
  2023-04-09 12:12 ` [lustre-devel] [PATCH 10/40] lustre: llite: check truncated page in ->readpage() James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 12/40] lustre: sec: fid2path for encrypted files James Simmons
                   ` (28 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Cyril Bordage, Lustre Development List

From: Cyril Bordage <cbordage@whamcloud.com>

If a pool memory region (mr) is mapped then unmapped without being
used, its key becomes out of sync with the RDMA subsystem.

At pool mr map time, the present code will create a local
invalidate work request (wr) using the mr's present key and then
change the mr's key.  When the mr is first used after being mapped,
the local invalidate wr will invalidate the original mr key, and
then a fast register wr is used with the modified key.  The fast
register will update the RDMA subsystem's key for the mr.

The error occurs when the mr is never used.  The next time the mr
is mapped, a local invalidate wr will again be created, but this
time it will use the mr's modified key.  The RDMA subsystem never
saw the original local invalidate, so now the RDMA subsystem's
key for the mr and o2iblnd's key for the mr are out of sync.

Fix the issue by tracking if the invalidate has been used.
Repurpose the boolean frd->frd_valid.  Presently, frd_valid is
always false.  Remove the code that used frd_valid to conditionally
split the invalidate from the fast register.  Instead, use frd_valid
to indicate when a new invalidate needs to be generated.  After a
post, evaluate if the invalidate was successfully used in the post.

These changes are only meaningful to the FRWR code path.  The failure
has only been observed when using Omni-Path Architecture.

WC-bug-id: https://jira.whamcloud.com/browse/LU-16349
Lustre-commit: 0c93919f1375ce16d ("LU-16349 o2iblnd: Fix key mismatch issue")
Signed-off-by: Cyril Bordage <cbordage@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49714
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/klnds/o2iblnd/o2iblnd.c    |  5 +++--
 net/lnet/klnds/o2iblnd/o2iblnd_cb.c | 17 +++++++++++------
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/net/lnet/klnds/o2iblnd/o2iblnd.c b/net/lnet/klnds/o2iblnd/o2iblnd.c
index c1dfbe5..a7a3c79 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd.c
@@ -1584,7 +1584,8 @@ static int kiblnd_alloc_freg_pool(struct kib_fmr_poolset *fps,
 			goto out_middle;
 		}
 
-		frd->frd_valid = true;
+		/* indicate that the local invalidate needs to be generated */
+		frd->frd_valid = false;
 
 		list_add_tail(&frd->frd_list, &fpo->fast_reg.fpo_pool_list);
 		fpo->fast_reg.fpo_pool_size++;
@@ -1738,7 +1739,6 @@ void kiblnd_fmr_pool_unmap(struct kib_fmr *fmr, int status)
 
 	fps = fpo->fpo_owner;
 	if (frd) {
-		frd->frd_valid = false;
 		frd->frd_posted = false;
 		fmr->fmr_frd = NULL;
 		spin_lock(&fps->fps_lock);
@@ -1800,6 +1800,7 @@ int kiblnd_fmr_pool_map(struct kib_fmr_poolset *fps, struct kib_tx *tx,
 				u32 key = is_rx ? mr->rkey : mr->lkey;
 				struct ib_send_wr *inv_wr;
 
+				frd->frd_valid = true;
 				inv_wr = &frd->frd_inv_wr;
 				memset(inv_wr, 0, sizeof(*inv_wr));
 				inv_wr->opcode = IB_WR_LOCAL_INV;
diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 6fc1730..5596fd6b 100644
--- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -847,12 +847,8 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 		struct ib_send_wr *wrq = &tx->tx_wrq[0].wr;
 
 		if (frd && !frd->frd_posted) {
-			if (!frd->frd_valid) {
-				wrq = &frd->frd_inv_wr;
-				wrq->next = &frd->frd_fastreg_wr.wr;
-			} else {
-				wrq = &frd->frd_fastreg_wr.wr;
-			}
+			wrq = &frd->frd_inv_wr;
+			wrq->next = &frd->frd_fastreg_wr.wr;
 			frd->frd_fastreg_wr.wr.next = &tx->tx_wrq[0].wr;
 		}
 
@@ -866,6 +862,15 @@ static int kiblnd_map_tx(struct lnet_ni *ni, struct kib_tx *tx,
 			rc = -EINVAL;
 		else
 			rc = ib_post_send(conn->ibc_cmid->qp, wrq, &bad);
+
+		if (frd && !frd->frd_posted) {
+			/* The local invalidate becomes invalid (has been
+			 * successfully used) if the post succeeds or the
+			 * failing wr was not the invalidate.
+			 */
+			frd->frd_valid =
+				!(rc == 0 || (bad != &frd->frd_inv_wr));
+		}
 	}
 
 	conn->ibc_last_send = ktime_get();
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 12/40] lustre: sec: fid2path for encrypted files
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (10 preceding siblings ...)
  2023-04-09 12:12 ` [lustre-devel] [PATCH 11/40] lnet: o2iblnd: Fix key mismatch issue James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 13/40] lustre: sec: Lustre/HSM on enc file with enc key James Simmons
                   ` (27 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Sebastien Buisson <sbuisson@ddn.com>

Add support of fid2path for encrypted files. Server side returns raw
encrypted path name to client, which needs to process the returned
string. This is done from top to bottom, by iteratively decrypting
parent name and then doing a lookup on it, so that child can in turn
be decrypted.

For encrypted files that do not have their names encrypted, lookups
can be skipped. Indeed, name decryption is a no-op in this case, which
means it is not necessary to fetch the encryption key associated with
the parent inode.

Without the encryption key, lookups are skipped for the same reason.
But names have to be encoded and/or digested. So server needs to
insert FIDs of individual path components in the returned string.
These FIDs are interpreted by the client to build encoded/digested
names.

WC-bug-id: https://jira.whamcloud.com/browse/LU-16205
Lustre-commit: fa9da556ad22b1485 ("LU-16205 sec: fid2path for encrypted files")
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/48930
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: jsimmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_export.h |   5 ++
 fs/lustre/llite/file.c            | 160 +++++++++++++++++++++++++++++++++++++-
 fs/lustre/llite/llite_internal.h  |  17 ++++
 fs/lustre/llite/llite_lib.c       |   1 +
 fs/lustre/lmv/lmv_obd.c           |  38 +++++++--
 fs/lustre/mdc/mdc_request.c       |  10 +--
 6 files changed, 214 insertions(+), 17 deletions(-)

diff --git a/fs/lustre/include/lustre_export.h b/fs/lustre/include/lustre_export.h
index 6a59e6c..59f1dea 100644
--- a/fs/lustre/include/lustre_export.h
+++ b/fs/lustre/include/lustre_export.h
@@ -284,6 +284,11 @@ static inline int exp_connect_encrypt(struct obd_export *exp)
 	return !!(exp_connect_flags2(exp) & OBD_CONNECT2_ENCRYPT);
 }
 
+static inline int exp_connect_encrypt_fid2path(struct obd_export *exp)
+{
+	return !!(exp_connect_flags2(exp) & OBD_CONNECT2_ENCRYPT_FID2PATH);
+}
+
 static inline int exp_connect_lseek(struct obd_export *exp)
 {
 	return !!(exp_connect_flags2(exp) & OBD_CONNECT2_LSEEK);
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index aa9c5da..668d544 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -2744,12 +2744,146 @@ static int ll_do_fiemap(struct inode *inode, struct fiemap *fiemap,
 	return rc;
 }
 
+static int fid2path_for_enc_file(struct inode *parent, char *gfpath,
+				 u32 gfpathlen)
+{
+	struct dentry *de = NULL, *de_parent = d_find_any_alias(parent);
+	struct fscrypt_str lltr = FSTR_INIT(NULL, 0);
+	struct fscrypt_str de_name;
+	char *p, *ptr = gfpath;
+	size_t len = 0, len_orig = 0;
+	int enckey = -1, nameenc = -1;
+	int rc = 0;
+
+	gfpath++;
+	while ((p = strsep(&gfpath, "/")) != NULL) {
+		struct lu_fid fid;
+
+		de = NULL;
+		if (!*p) {
+			dput(de_parent);
+			break;
+		}
+		len_orig = strlen(p);
+
+		rc = sscanf(p, "["SFID"]", RFID(&fid));
+		if (rc == 3)
+			p = strchr(p, ']') + 1;
+		else
+			fid_zero(&fid);
+		rc = 0;
+		len = strlen(p);
+
+		if (!IS_ENCRYPTED(parent)) {
+			if (gfpathlen < len + 1) {
+				dput(de_parent);
+				rc = -EOVERFLOW;
+				break;
+			}
+			memmove(ptr, p, len);
+			p = ptr;
+			ptr += len;
+			*(ptr++) = '/';
+			gfpathlen -= len + 1;
+			goto lookup;
+		}
+
+		/* From here, we know parent is encrypted */
+		if (enckey != 0) {
+			rc = fscrypt_get_encryption_info(parent);
+			if (rc && rc != -ENOKEY) {
+				dput(de_parent);
+				break;
+			}
+		}
+
+		if (enckey == -1) {
+			if (fscrypt_has_encryption_key(parent))
+				enckey = 1;
+			else
+				enckey = 0;
+			if (enckey == 1)
+				nameenc = true;
+		}
+
+		/* Even if names are not encrypted, we still need to call
+		 * ll_fname_disk_to_usr in order to decode names as they are
+		 * coming from the wire.
+		 */
+		rc = fscrypt_fname_alloc_buffer(parent, NAME_MAX + 1, &lltr);
+		if (rc < 0) {
+			dput(de_parent);
+			break;
+		}
+
+		de_name.name = p;
+		de_name.len = len;
+		rc = ll_fname_disk_to_usr(parent, 0, 0, &de_name,
+					  &lltr, &fid);
+		if (rc) {
+			fscrypt_fname_free_buffer(&lltr);
+			dput(de_parent);
+			break;
+		}
+		lltr.name[lltr.len] = '\0';
+
+		if (lltr.len <= len_orig && gfpathlen >= lltr.len + 1) {
+			memcpy(ptr, lltr.name, lltr.len);
+			p = ptr;
+			len = lltr.len;
+			ptr += lltr.len;
+			*(ptr++) = '/';
+			gfpathlen -= lltr.len + 1;
+		} else {
+			rc = -EOVERFLOW;
+		}
+		fscrypt_fname_free_buffer(&lltr);
+
+		if (rc == -EOVERFLOW) {
+			dput(de_parent);
+			break;
+		}
+
+lookup:
+		if (!gfpath) {
+			/* We reached the end of the string, which means
+			 * we are dealing with the last component in the path.
+			 * So save a useless lookup and exit.
+			 */
+			dput(de_parent);
+			break;
+		}
+
+		if (enckey == 0 || nameenc == 0)
+			continue;
+
+		inode_lock(parent);
+		de = lookup_one_len(p, de_parent, len);
+		inode_unlock(parent);
+		if (IS_ERR_OR_NULL(de) || !de->d_inode) {
+			dput(de_parent);
+			rc = -ENODATA;
+			break;
+		}
+
+		parent = de->d_inode;
+		dput(de_parent);
+		de_parent = de;
+	}
+
+	if (len)
+		*(ptr - 1) = '\0';
+	if (!IS_ERR_OR_NULL(de))
+		dput(de);
+	return rc;
+}
+
 int ll_fid2path(struct inode *inode, void __user *arg)
 {
 	struct obd_export *exp = ll_i2mdexp(inode);
 	const struct getinfo_fid2path __user *gfin = arg;
 	struct getinfo_fid2path *gfout;
-	u32 pathlen;
+	u32 pathlen, pathlen_orig;
 	size_t outsize;
 	int rc;
 
@@ -2763,7 +2897,9 @@ int ll_fid2path(struct inode *inode, void __user *arg)
 
 	if (pathlen > PATH_MAX)
 		return -EINVAL;
+	pathlen_orig = pathlen;
 
+gf_alloc:
 	outsize = sizeof(*gfout) + pathlen;
 
 	gfout = kzalloc(outsize, GFP_KERNEL);
@@ -2781,17 +2917,37 @@ int ll_fid2path(struct inode *inode, void __user *arg)
 	 * old server without fileset mount support will ignore this.
 	 */
 	*gfout->gf_root_fid = *ll_inode2fid(inode);
+	gfout->gf_pathlen = pathlen;
 
 	/* Call mdc_iocontrol */
 	rc = obd_iocontrol(OBD_IOC_FID2PATH, exp, outsize, gfout, NULL);
 	if (rc != 0)
 		goto gf_free;
 
-	if (copy_to_user(arg, gfout, outsize))
+	if (gfout->gf_pathlen && gfout->gf_path[0] == '/') {
+		/* by convention, server side (mdt_path_current()) puts
+		 * a leading '/' to tell client that we are dealing with
+		 * an encrypted file
+		 */
+		rc = fid2path_for_enc_file(inode, gfout->gf_path,
+					   gfout->gf_pathlen);
+		if (rc)
+			goto gf_free;
+		if (strlen(gfout->gf_path) > gfin->gf_pathlen) {
+			rc = -EOVERFLOW;
+			goto gf_free;
+		}
+	}
+
+	if (copy_to_user(arg, gfout, sizeof(*gfout) + pathlen_orig))
 		rc = -EFAULT;
 
 gf_free:
 	kfree(gfout);
+	if (rc == -ENAMETOOLONG) {
+		pathlen += PATH_MAX;
+		goto gf_alloc;
+	}
 	return rc;
 }
 
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 1d85d0b..2223dbb 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -523,6 +523,23 @@ static inline void obd_connect_set_name_enc(struct obd_connect_data *data)
 #endif
 }
 
+static inline bool obd_connect_has_enc_fid2path(struct obd_connect_data *data)
+{
+#ifdef HAVE_LUSTRE_CRYPTO
+	return data->ocd_connect_flags & OBD_CONNECT_FLAGS2 &&
+		data->ocd_connect_flags2 & OBD_CONNECT2_ENCRYPT_FID2PATH;
+#else
+	return false;
+#endif
+}
+
+static inline void obd_connect_set_enc_fid2path(struct obd_connect_data *data)
+{
+#ifdef HAVE_LUSTRE_CRYPTO
+	data->ocd_connect_flags2 |= OBD_CONNECT2_ENCRYPT_FID2PATH;
+#endif
+}
+
 /*
  * Locking to guarantee consistency of non-atomic updates to long long i_size,
  * consistency between file size and KMS.
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index e48bb6c..3774ca8 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -358,6 +358,7 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 
 	obd_connect_set_secctx(data);
 	if (ll_sbi_has_encrypt(sbi)) {
+		obd_connect_set_enc_fid2path(data);
 		obd_connect_set_name_enc(data);
 		obd_connect_set_enc(data);
 	}
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 64d16d8..99604e8 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -551,6 +551,8 @@ static int lmv_fid2path(struct obd_export *exp, int len, void *karg,
 	struct getinfo_fid2path *remote_gf = NULL;
 	struct lu_fid root_fid;
 	int remote_gf_size = 0;
+	int currentisenc = 0;
+	int globalisenc = 0;
 	int rc;
 
 	tgt = lmv_fid2tgt(lmv, &gf->gf_fid);
@@ -565,11 +567,23 @@ static int lmv_fid2path(struct obd_export *exp, int len, void *karg,
 	if (rc != 0 && rc != -EREMOTE)
 		goto out_fid2path;
 
+	if (gf->gf_path[0] == '/') {
+		/* by convention, server side (mdt_path_current()) puts
+		 * a leading '/' to tell client that we are dealing with
+		 * an encrypted file
+		 */
+		currentisenc = 1;
+		globalisenc = 1;
+	} else {
+		currentisenc = 0;
+	}
+
 	/* If remote_gf != NULL, it means just building the
 	 * path on the remote MDT, copy this path segment to gf
 	 */
 	if (remote_gf) {
 		struct getinfo_fid2path *ori_gf;
+		int oldisenc = 0;
 		char *ptr;
 		int len;
 
@@ -581,14 +595,22 @@ static int lmv_fid2path(struct obd_export *exp, int len, void *karg,
 		}
 
 		ptr = ori_gf->gf_path;
+		oldisenc = ptr[0] == '/';
 
 		len = strlen(gf->gf_path);
-		/* move the current path to the right to release space
-		 * for closer-to-root part
-		 */
-		memmove(ptr + len + 1, ptr, strlen(ori_gf->gf_path));
-		memcpy(ptr, gf->gf_path, len);
-		ptr[len] = '/';
+		if (len) {
+			/* move the current path to the right to release space
+			 * for closer-to-root part
+			 */
+			memmove(ptr + len - currentisenc + 1 + globalisenc,
+				ptr + oldisenc,
+				strlen(ori_gf->gf_path) - oldisenc + 1);
+			if (globalisenc)
+				*(ptr++) = '/';
+			memcpy(ptr, gf->gf_path + currentisenc,
+			       len - currentisenc);
+			ptr[len - currentisenc] = '/';
+		}
 	}
 
 	CDEBUG(D_INFO, "%s: get path %s " DFID " rec: %llu ln: %u\n",
@@ -601,13 +623,13 @@ static int lmv_fid2path(struct obd_export *exp, int len, void *karg,
 
 	/* sigh, has to go to another MDT to do path building further */
 	if (!remote_gf) {
-		remote_gf_size = sizeof(*remote_gf) + PATH_MAX;
+		remote_gf_size = sizeof(*remote_gf) + len - sizeof(*gf);
 		remote_gf = kzalloc(remote_gf_size, GFP_NOFS);
 		if (!remote_gf) {
 			rc = -ENOMEM;
 			goto out_fid2path;
 		}
-		remote_gf->gf_pathlen = PATH_MAX;
+		remote_gf->gf_pathlen = len - sizeof(*gf);
 	}
 
 	if (!fid_is_sane(&gf->gf_fid)) {
diff --git a/fs/lustre/mdc/mdc_request.c b/fs/lustre/mdc/mdc_request.c
index 643b6ee..58ea982 100644
--- a/fs/lustre/mdc/mdc_request.c
+++ b/fs/lustre/mdc/mdc_request.c
@@ -1707,8 +1707,6 @@ static int mdc_ioc_fid2path(struct obd_export *exp, struct getinfo_fid2path *gf)
 	void *key;
 	int rc;
 
-	if (gf->gf_pathlen > PATH_MAX)
-		return -ENAMETOOLONG;
 	if (gf->gf_pathlen < 2)
 		return -EOVERFLOW;
 
@@ -1746,12 +1744,10 @@ static int mdc_ioc_fid2path(struct obd_export *exp, struct getinfo_fid2path *gf)
 		goto out;
 	}
 
-	CDEBUG(D_IOCTL, "path got " DFID " from %llu #%d: %s\n",
+	CDEBUG(D_IOCTL, "path got " DFID " from %llu #%d: %.*s\n",
 	       PFID(&gf->gf_fid), gf->gf_recno, gf->gf_linkno,
-	       gf->gf_pathlen < 512 ? gf->gf_path :
-	       /* only log the last 512 characters of the path */
-	       gf->gf_path + gf->gf_pathlen - 512);
-
+	       /* only log the first 512 characters of the path */
+	       512, gf->gf_path);
 out:
 	kfree(key);
 	return rc;
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 13/40] lustre: sec: Lustre/HSM on enc file with enc key
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (11 preceding siblings ...)
  2023-04-09 12:12 ` [lustre-devel] [PATCH 12/40] lustre: sec: fid2path for encrypted files James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 14/40] lustre: llite: check read page past requested James Simmons
                   ` (26 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Sebastien Buisson <sbuisson@ddn.com>

Support for Lustre/HSM on encrypted files when the encryption key is
available requires similar attention as with file migration.
The volatile file used for HSM restore must have the same encryption
context as the Lustre file being restored, so that file content
remains accessible after the layout swap at the end of the restore
procedure.

Please note that using Lustre/HSM with the encryption key creates
clear text copies of encrypted files on the HSM backend storage.

WC-bug-id: https://jira.whamcloud.com/browse/LU-16310
Lustre-commit: df7a8d92d2378e236 ("LU-16310 sec: Lustre/HSM on enc file with enc key")
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49153
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: jsimmons <jsimmons@infradead.org>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Etienne AUJAMES <eaujames@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/crypto.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/llite/crypto.c b/fs/lustre/llite/crypto.c
index 5fb7f4d..61b85c8 100644
--- a/fs/lustre/llite/crypto.c
+++ b/fs/lustre/llite/crypto.c
@@ -246,7 +246,16 @@ int ll_setup_filename(struct inode *dir, const struct qstr *iname,
 		fid->f_oid = 0;
 		fid->f_ver = 0;
 	}
-	rc = fscrypt_setup_filename(dir, &dname, lookup, fname);
+	if (unlikely(filename_is_volatile(iname->name,
+					  iname->len, NULL))) {
+		/* keep volatile name as-is, matters for server side */
+		memset(fname, 0, sizeof(struct fscrypt_name));
+		fname->disk_name.name = (unsigned char *)iname->name;
+		fname->disk_name.len = iname->len;
+		rc = 0;
+	} else {
+		rc = fscrypt_setup_filename(dir, &dname, lookup, fname);
+	}
 	if (rc == -ENOENT && lookup) {
 		if (((is_root_inode(dir) &&
 		     iname->len == strlen(dot_fscrypt_name) &&
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 14/40] lustre: llite: check read page past requested
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (12 preceding siblings ...)
  2023-04-09 12:12 ` [lustre-devel] [PATCH 13/40] lustre: sec: Lustre/HSM on enc file with enc key James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 15/40] lustre: llite: fix relatime support James Simmons
                   ` (25 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Qian Yingjin <qian@ddn.com>

Due to a kernel bug introduced in 5.12 in commit:
cbd59c48ae2bcadc4a7599c29cf32fd3f9b78251
("mm/filemap: use head pages in generic_file_buffered_read")
if the page immediately after the current read is in cache,
the kernel will try to read it.

This attempts to read a page past the end of requested
read from userspace, and so has not been safely locked by
Lustre.

For a page after the end of the current read, check whether
it is under the protection of a DLM lock. If so, we take a
reference on the DLM lock until the page read has finished
and then release the reference.  If the page is not covered
by a DLM lock, then we are racing with the page being
removed from Lustre.  In that case, we return
AOP_TRUNCATED_PAGE, which makes the kernel release its
reference on the page and retry the page read.  This allows
the page to be removed from cache, so the kernel will not
find it and incorrectly attempt to read it again.

NB: Earlier versions of this description refer to stripe
boundaries, but the locking issue can occur whether or
not the page is on a stripe boundary, because dlmlocks
can cover part of a stripe.  (This is rare, but is
allowed.)

WC-bug-id: https://jira.whamcloud.com/browse/LU-16412
Lustre-commit: 2f8f38effac3a9519 ("LU-16412 llite: check read page past requested")
Signed-off-by: Qian Yingjin <qian@ddn.com>
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49723
Reviewed-by: Zhenyu Xu <bobijam@hotmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_internal.h |  2 ++
 fs/lustre/llite/rw.c             | 58 +++++++++++++++++++++++++++++++++++++---
 fs/lustre/llite/vvp_io.c         | 10 +++++--
 3 files changed, 65 insertions(+), 5 deletions(-)

diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 2223dbb..970b144 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -1371,6 +1371,8 @@ struct ll_cl_context {
 	struct cl_io		*lcc_io;
 	struct cl_page		*lcc_page;
 	enum lcc_type		 lcc_type;
+	struct kiocb		*lcc_iocb;
+	struct iov_iter		*lcc_iter;
 };
 
 struct ll_thread_info {
diff --git a/fs/lustre/llite/rw.c b/fs/lustre/llite/rw.c
index dea2af1..d285ae1 100644
--- a/fs/lustre/llite/rw.c
+++ b/fs/lustre/llite/rw.c
@@ -1858,11 +1858,14 @@ int ll_readpage(struct file *file, struct page *vmpage)
 {
 	struct inode *inode = file_inode(file);
 	struct cl_object *clob = ll_i2info(inode)->lli_clob;
-	struct ll_cl_context *lcc;
+	struct ll_sb_info *sbi = ll_i2sbi(inode);
 	const struct lu_env *env = NULL;
+	struct cl_read_ahead ra = { 0 };
+	struct ll_cl_context *lcc;
 	struct cl_io *io = NULL;
+	struct iov_iter *iter;
 	struct cl_page *page;
-	struct ll_sb_info *sbi = ll_i2sbi(inode);
+	struct kiocb *iocb;
 	int result;
 
 	if (OBD_FAIL_PRECHECK(OBD_FAIL_LLITE_READPAGE_PAUSE)) {
@@ -1911,6 +1914,8 @@ int ll_readpage(struct file *file, struct page *vmpage)
 		struct ll_readahead_state *ras = &fd->fd_ras;
 		struct lu_env *local_env = NULL;
 
+		CDEBUG(D_VFSTRACE, "fast read pgno: %ld\n", vmpage->index);
+
 		result = -ENODATA;
 
 		/*
@@ -1968,6 +1973,47 @@ int ll_readpage(struct file *file, struct page *vmpage)
 		return result;
 	}
 
+	if (lcc && lcc->lcc_type != LCC_MMAP) {
+		iocb = lcc->lcc_iocb;
+		iter = lcc->lcc_iter;
+
+		CDEBUG(D_VFSTRACE, "pgno:%ld, cnt:%ld, pos:%lld\n",
+		       vmpage->index, iter->count, iocb->ki_pos);
+
+		/*
+		 * This handles a kernel bug introduced in kernel 5.12:
+		 * comment: cbd59c48ae2bcadc4a7599c29cf32fd3f9b78251
+		 * ("mm/filemap: use head pages in generic_file_buffered_read")
+		 *
+		 * See above in this function for a full description of the
+		 * bug.  Briefly, the kernel will try to read 1 more page than
+		 * was actually requested *if that page is already in cache*.
+		 *
+		 * Because this page is beyond the boundary of the requested
+		 * read, Lustre does not lock it as part of the read.  This
+		 * means we must check if there is a valid dlmlock on this
+		 * page and reference it before we attempt to read in the
+		 * page.  If there is not a valid dlmlock, then we are racing
+		 * with dlmlock cancellation and the page is being removed
+		 * from the cache.
+		 *
+		 * That means we should return AOP_TRUNCATED_PAGE, which will
+		 * cause the kernel to retry the read, which should allow the
+		 * page to be removed from cache as the lock is cancelled.
+		 *
+		 * This should never occur except in kernels with the bug
+		 * mentioned above.
+		 */
+		if (cl_offset(clob, vmpage->index) >= iter->count + iocb->ki_pos) {
+			result = cl_io_read_ahead(env, io, vmpage->index, &ra);
+			if (result < 0 || vmpage->index > ra.cra_end_idx) {
+				cl_read_ahead_release(env, &ra);
+				unlock_page(vmpage);
+				return AOP_TRUNCATED_PAGE;
+			}
+		}
+	}
+
 	/**
 	 * Direct read can fall back to buffered read, but DIO is done
 	 * with lockless i/o, and buffered requires LDLM locking, so in
@@ -1979,7 +2025,8 @@ int ll_readpage(struct file *file, struct page *vmpage)
 		unlock_page(vmpage);
 		io->ci_dio_lock = 1;
 		io->ci_need_restart = 1;
-		return -ENOLCK;
+		result = -ENOLCK;
+		goto out;
 	}
 
 	page = cl_page_find(env, clob, vmpage->index, vmpage, CPT_CACHEABLE);
@@ -1999,5 +2046,10 @@ int ll_readpage(struct file *file, struct page *vmpage)
 		unlock_page(vmpage);
 		result = PTR_ERR(page);
 	}
+
+out:
+	if (ra.cra_release)
+		cl_read_ahead_release(env, &ra);
+
 	return result;
 }
diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index eacb35b..2da74a2 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -806,6 +806,7 @@ static int vvp_io_read_start(const struct lu_env *env,
 	loff_t pos = io->u.ci_rd.rd.crw_pos;
 	size_t cnt = io->u.ci_rd.rd.crw_count;
 	size_t tot = vio->vui_tot_count;
+	struct ll_cl_context *lcc;
 	int exceed = 0;
 	int result;
 	struct iov_iter iter;
@@ -868,9 +869,14 @@ static int vvp_io_read_start(const struct lu_env *env,
 	file_accessed(file);
 	LASSERT(vio->vui_iocb->ki_pos == pos);
 	iter = *vio->vui_iter;
-	result = generic_file_read_iter(vio->vui_iocb, &iter);
-	goto out;
 
+	lcc = ll_cl_find(inode);
+	lcc->lcc_iter = &iter;
+	lcc->lcc_iocb = vio->vui_iocb;
+	CDEBUG(D_VFSTRACE, "cnt:%ld,iocb pos:%lld\n", lcc->lcc_iter->count,
+	       lcc->lcc_iocb->ki_pos);
+
+	result = generic_file_read_iter(vio->vui_iocb, &iter);
 out:
 	if (result >= 0) {
 		if (result < cnt)
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 15/40] lustre: llite: fix relatime support
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (13 preceding siblings ...)
  2023-04-09 12:12 ` [lustre-devel] [PATCH 14/40] lustre: llite: check read page past requested James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 16/40] lustre: ptlrpc: clarify AT error message James Simmons
                   ` (24 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Aurelien Degremont <degremoa@amazon.com>

relatime behavior is properly managed by VFS, however
Lustre also stores acmtime on OST objects and atime
updates for OST objects should honor relatime behavior.

This patch updates 'ci_noatime' feature which was introduced to
properly honor noatime option for OST objects, to also support
'relatime'.
file_is_noatime() code already comes from upstream touch_atime().
Add missing parts from touch_atime() to also support relatime.

It also forces atime to disk on MDD if ondisk atime is older than
ondisk mtime/ctime to match relatime (even if relatime is not enabled)

WC-bug-id: https://jira.whamcloud.com/browse/LU-15728
Lustre-commit: c10c6eeb37dd55316 ("LU-15728 llite: fix relatime support")
Signed-off-by: Aurelien Degremont <degremoa@amazon.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/47017
Reviewed-by: Shaun Tancheff <shaun.tancheff@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/file.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 668d544..18f3302 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -1541,12 +1541,46 @@ void ll_io_set_mirror(struct cl_io *io, const struct file *file)
 	       file->f_path.dentry->d_name.name, io->ci_designated_mirror);
 }
 
+/*
+ * This is relatime_need_update() from Linux 5.17, which is not exported.
+ */
+static int relatime_need_update(struct vfsmount *mnt, struct inode *inode,
+				struct timespec64 now)
+{
+	if (!(mnt->mnt_flags & MNT_RELATIME))
+		return 1;
+	/*
+	 * Is mtime younger than atime? If yes, update atime:
+	 */
+	if (timespec64_compare(&inode->i_mtime, &inode->i_atime) >= 0)
+		return 1;
+	/*
+	 * Is ctime younger than atime? If yes, update atime:
+	 */
+	if (timespec64_compare(&inode->i_ctime, &inode->i_atime) >= 0)
+		return 1;
+
+	/*
+	 * Is the previous atime value older than a day? If yes,
+	 * update atime:
+	 */
+	if ((long)(now.tv_sec - inode->i_atime.tv_sec) >= 24*60*60)
+		return 1;
+	/*
+	 * Good, we can skip the atime update:
+	 */
+	return 0;
+}
+
+/*
+ * Very similar to kernel function: !__atime_needs_update()
+ */
 static bool file_is_noatime(const struct file *file)
 {
-	const struct vfsmount *mnt = file->f_path.mnt;
-	const struct inode *inode = file_inode(file);
+	struct vfsmount *mnt = file->f_path.mnt;
+	struct inode *inode = file_inode(file);
+	struct timespec64 now;
 
-	/* Adapted from file_accessed() and touch_atime().*/
 	if (file->f_flags & O_NOATIME)
 		return true;
 
@@ -1565,6 +1599,11 @@ static bool file_is_noatime(const struct file *file)
 	if ((inode->i_sb->s_flags & SB_NODIRATIME) && S_ISDIR(inode->i_mode))
 		return true;
 
+	now = current_time(inode);
+
+	if (!relatime_need_update(mnt, inode, now))
+		return true;
+
 	return false;
 }
 
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 16/40] lustre: ptlrpc: clarify AT error message
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (14 preceding siblings ...)
  2023-04-09 12:12 ` [lustre-devel] [PATCH 15/40] lustre: llite: fix relatime support James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 17/40] lustre: update version to 2.15.54 James Simmons
                   ` (23 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Aurelien Degremont <degremoa@amazon.com>

Clarify the error message related to passed deadline
for AT early replies. It was indicating that the system
was CPU bound which is most of the time wrong, as the issue
is rather communication failure delaying RPC traffic.
This could be confusing to people which will look for
CPU resource consumption where the network traffic is
more at cause.

Also try to use less cryptic keywords which makes only
sense to the feature developer, and not to admins.

WC-bug-id: https://jira.whamcloud.com/browse/LU-930
Lustre-commit: 9ce04000fba07706c ("LU-930 ptlrpc: clarify AT error message")
Signed-off-by: Aurelien Degremont <degremoa@amazon.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49548
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/service.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/lustre/ptlrpc/service.c b/fs/lustre/ptlrpc/service.c
index aaf7529..bf76272 100644
--- a/fs/lustre/ptlrpc/service.c
+++ b/fs/lustre/ptlrpc/service.c
@@ -1303,12 +1303,11 @@ static void ptlrpc_at_check_timed(struct ptlrpc_service_part *svcpt)
 		 * We're already past request deadlines before we even get a
 		 * chance to send early replies
 		 */
-		LCONSOLE_WARN("%s: This server is not able to keep up with request traffic (cpu-bound).\n",
-			      svcpt->scp_service->srv_name);
-		CWARN("earlyQ=%d reqQ=%d recA=%d, svcEst=%d, delay=%lldms\n",
-		      counter, svcpt->scp_nreqs_incoming,
-		      svcpt->scp_nreqs_active,
-		      at_get(&svcpt->scp_at_estimate), delay_ms);
+		LCONSOLE_WARN("'%s' is processing requests too slowly, client may timeout. Late by %ds, missed %d early replies (reqs waiting=%d active=%d, at_estimate=%d, delay=%lldms)\n",
+			      svcpt->scp_service->srv_name, -first, counter,
+			      svcpt->scp_nreqs_incoming,
+			      svcpt->scp_nreqs_active,
+			      at_get(&svcpt->scp_at_estimate), delay_ms);
 	}
 
 	/*
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 17/40] lustre: update version to 2.15.54
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (15 preceding siblings ...)
  2023-04-09 12:12 ` [lustre-devel] [PATCH 16/40] lustre: ptlrpc: clarify AT error message James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 18/40] lustre: tgt: skip free inodes in OST weights James Simmons
                   ` (22 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Oleg Drokin <green@whamcloud.com>

New tag 2.15.54

Signed-off-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/uapi/linux/lustre/lustre_ver.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/lustre/lustre_ver.h b/include/uapi/linux/lustre/lustre_ver.h
index 96267428..bc7a49c 100644
--- a/include/uapi/linux/lustre/lustre_ver.h
+++ b/include/uapi/linux/lustre/lustre_ver.h
@@ -3,9 +3,9 @@
 
 #define LUSTRE_MAJOR 2
 #define LUSTRE_MINOR 15
-#define LUSTRE_PATCH 53
+#define LUSTRE_PATCH 54
 #define LUSTRE_FIX 0
-#define LUSTRE_VERSION_STRING "2.15.53"
+#define LUSTRE_VERSION_STRING "2.15.54"
 
 #define OBD_OCD_VERSION(major, minor, patch, fix)			\
 	(((major) << 24) + ((minor) << 16) + ((patch) << 8) + (fix))
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 18/40] lustre: tgt: skip free inodes in OST weights
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (16 preceding siblings ...)
  2023-04-09 12:12 ` [lustre-devel] [PATCH 17/40] lustre: update version to 2.15.54 James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:12 ` [lustre-devel] [PATCH 19/40] lustre: fileset: check fileset for operations by fid James Simmons
                   ` (21 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Andreas Dilger <adilger@whamcloud.com>

In lu_tgt_qos_weight_calc() calculate the target weight consistently
with how the per-OST and per-OSS penalty calculation is done in
ltd_qos_penalties_calc().  Otherwise, the QOS weighting calculations
combine two different units, which incorrectly weighs allocations on
OST with more free inodes over those with more free space.

Fixes: 1fa303725063 ("lustre: lmv: share object alloc QoS code with LMV")
WC-bug-id: https://jira.whamcloud.com/browse/LU-16501
Lustre-commit: 511bf2f4ccd1482d6 ("LU-16501 tgt: skip free inodes in OST weights")
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49890
Reviewed-by: Artem Blagodarenko <ablagodarenko@ddn.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Sergey Cheremencev <scherementsev@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lu_object.h     | 14 ++++++++++++-
 fs/lustre/lmv/lmv_obd.c           |  4 ++--
 fs/lustre/obdclass/lu_tgt_descs.c | 41 ++++++++++++++++-----------------------
 3 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/fs/lustre/include/lu_object.h b/fs/lustre/include/lu_object.h
index 4e101fa..0562f806 100644
--- a/fs/lustre/include/lu_object.h
+++ b/fs/lustre/include/lu_object.h
@@ -1539,6 +1539,18 @@ struct lu_tgt_desc {
 			ltd_connecting:1;  /* target is connecting */
 };
 
+static inline u64 tgt_statfs_bavail(struct lu_tgt_desc *tgt)
+{
+	struct obd_statfs *statfs = &tgt->ltd_statfs;
+
+	return statfs->os_bavail * statfs->os_bsize;
+}
+
+static inline u64 tgt_statfs_iavail(struct lu_tgt_desc *tgt)
+{
+	return tgt->ltd_statfs.os_ffree;
+}
+
 /* number of pointers at 2nd level */
 #define TGT_PTRS_PER_BLOCK	(PAGE_SIZE / sizeof(void *))
 /* number of pointers at 1st level - only need as many as max OST/MDT count */
@@ -1593,7 +1605,7 @@ struct lu_tgt_descs {
 u64 lu_prandom_u64_max(u64 ep_ro);
 int lu_qos_add_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd);
 int lu_qos_del_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd);
-void lu_tgt_qos_weight_calc(struct lu_tgt_desc *tgt);
+void lu_tgt_qos_weight_calc(struct lu_tgt_desc *tgt, bool is_mdt);
 
 int lu_tgt_descs_init(struct lu_tgt_descs *ltd, bool is_mdt);
 void lu_tgt_descs_fini(struct lu_tgt_descs *ltd);
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 99604e8..1b6e4aa 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -1512,7 +1512,7 @@ static struct lu_tgt_desc *lmv_locate_tgt_qos(struct lmv_obd *lmv,
 		}
 
 		tgt->ltd_qos.ltq_usable = 1;
-		lu_tgt_qos_weight_calc(tgt);
+		lu_tgt_qos_weight_calc(tgt, true);
 		if (tgt->ltd_index == op_data->op_mds)
 			cur = tgt;
 		total_avail += tgt->ltd_qos.ltq_avail;
@@ -1613,7 +1613,7 @@ static struct lu_tgt_desc *lmv_locate_tgt_lf(struct lmv_obd *lmv)
 		}
 
 		tgt->ltd_qos.ltq_usable = 1;
-		lu_tgt_qos_weight_calc(tgt);
+		lu_tgt_qos_weight_calc(tgt, true);
 		avail += tgt->ltd_qos.ltq_avail;
 		if (!min || min->ltd_qos.ltq_avail > tgt->ltd_qos.ltq_avail)
 			min = tgt;
diff --git a/fs/lustre/obdclass/lu_tgt_descs.c b/fs/lustre/obdclass/lu_tgt_descs.c
index 7394789..35e7c7c 100644
--- a/fs/lustre/obdclass/lu_tgt_descs.c
+++ b/fs/lustre/obdclass/lu_tgt_descs.c
@@ -198,33 +198,26 @@ int lu_qos_del_tgt(struct lu_qos *qos, struct lu_tgt_desc *ltd)
 }
 EXPORT_SYMBOL(lu_qos_del_tgt);
 
-static inline u64 tgt_statfs_bavail(struct lu_tgt_desc *tgt)
-{
-	struct obd_statfs *statfs = &tgt->ltd_statfs;
-
-	return statfs->os_bavail * statfs->os_bsize;
-}
-
-static inline u64 tgt_statfs_iavail(struct lu_tgt_desc *tgt)
-{
-	return tgt->ltd_statfs.os_ffree;
-}
-
 /**
  * Calculate weight for a given tgt.
  *
- * The final tgt weight is bavail >> 16 * iavail >> 8 minus the tgt and server
- * penalties.  See ltd_qos_penalties_calc() for how penalties are calculated.
+ * The final tgt weight uses only free space for OSTs, but combines
+ * both free space and inodes for MDTs, minus tgt and server penalties.
+ * See ltd_qos_penalties_calc() for how penalties are calculated.
  *
  * @tgt		target descriptor
+ * @is_mdt	target table is for MDT selection (use inodes)
  */
-void lu_tgt_qos_weight_calc(struct lu_tgt_desc *tgt)
+void lu_tgt_qos_weight_calc(struct lu_tgt_desc *tgt, bool is_mdt)
 {
 	struct lu_tgt_qos *ltq = &tgt->ltd_qos;
 	u64 penalty;
 
-	ltq->ltq_avail = (tgt_statfs_bavail(tgt) >> 16) *
-			 (tgt_statfs_iavail(tgt) >> 8);
+	if (is_mdt)
+		ltq->ltq_avail = (tgt_statfs_bavail(tgt) >> 16) *
+				 (tgt_statfs_iavail(tgt) >> 8);
+	else
+		ltq->ltq_avail = tgt_statfs_bavail(tgt) >> 8;
 	penalty = ltq->ltq_penalty + ltq->ltq_svr->lsq_penalty;
 	if (ltq->ltq_avail < penalty)
 		ltq->ltq_weight = 0;
@@ -512,11 +505,10 @@ int ltd_qos_penalties_calc(struct lu_tgt_descs *ltd)
 
 		/*
 		 * per-tgt penalty is
-		 * prio * bavail * iavail / (num_tgt - 1) / 2
+		 * prio * bavail * iavail / (num_tgt - 1) / prio_max / 2
 		 */
-		tgt->ltd_qos.ltq_penalty_per_obj = prio_wide * ba * ia >> 8;
+		tgt->ltd_qos.ltq_penalty_per_obj = prio_wide * ba * ia >> 9;
 		do_div(tgt->ltd_qos.ltq_penalty_per_obj, num_active);
-		tgt->ltd_qos.ltq_penalty_per_obj >>= 1;
 
 		age = (now - tgt->ltd_qos.ltq_used) >> 3;
 		if (test_bit(LQ_RESET, &qos->lq_flags) ||
@@ -563,14 +555,11 @@ int ltd_qos_penalties_calc(struct lu_tgt_descs *ltd)
 			svr->lsq_penalty >>= age / desc->ld_qos_maxage;
 	}
 
-	clear_bit(LQ_DIRTY, &qos->lq_flags);
-	clear_bit(LQ_RESET, &qos->lq_flags);
 
 	/*
 	 * If each tgt has almost same free space, do rr allocation for better
 	 * creation performance
 	 */
-	clear_bit(LQ_SAME_SPACE, &qos->lq_flags);
 	if (((ba_max * (QOS_THRESHOLD_MAX - qos->lq_threshold_rr)) /
 	    QOS_THRESHOLD_MAX) < ba_min &&
 	    ((ia_max * (QOS_THRESHOLD_MAX - qos->lq_threshold_rr)) /
@@ -578,7 +567,11 @@ int ltd_qos_penalties_calc(struct lu_tgt_descs *ltd)
 		set_bit(LQ_SAME_SPACE, &qos->lq_flags);
 		/* Reset weights for the next time we enter qos mode */
 		set_bit(LQ_RESET, &qos->lq_flags);
+	} else {
+		clear_bit(LQ_SAME_SPACE, &qos->lq_flags);
+		clear_bit(LQ_RESET, &qos->lq_flags);
 	}
+	clear_bit(LQ_DIRTY, &qos->lq_flags);
 	rc = 0;
 
 out:
@@ -653,7 +646,7 @@ int ltd_qos_update(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt,
 		else
 			ltq->ltq_penalty -= ltq->ltq_penalty_per_obj;
 
-		lu_tgt_qos_weight_calc(tgt);
+		lu_tgt_qos_weight_calc(tgt, ltd->ltd_is_mdt);
 
 		/* Recalc the total weight of usable osts */
 		if (ltq->ltq_usable)
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 19/40] lustre: fileset: check fileset for operations by fid
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (17 preceding siblings ...)
  2023-04-09 12:12 ` [lustre-devel] [PATCH 18/40] lustre: tgt: skip free inodes in OST weights James Simmons
@ 2023-04-09 12:12 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 20/40] lustre: clio: Remove cl_page_size() James Simmons
                   ` (20 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:12 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Sebastien Buisson <sbuisson@ddn.com>

Some operations by FID, such as lfs rmfid, must be aware of
subdirectory mount (fileset) so that they do not operate on files
that are outside of the namespace currently mounted by the client.

For lfs rmfid, we first proceed to a fid2path resolution. As fid2path
is already fileset aware, it fails if a file or a link to a file is
outside of the subdirectory mount. So we carry on with rmfid only
for FIDs for which the file and all links do appear under the
current fileset.

This new behavior is enabled as soon as we detect a subdirectory mount
is done (either directly or imposed by a nodemap fileset). This means
the new behavior does not impact normal, whole-namespace client mount.

WC-bug-id: https://jira.whamcloud.com/browse/LU-16494
Lustre-commit: 9a72c073d33b04542 ("LU-16494 fileset: check fileset for operations by fid")
Signed-off-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49696
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: jsimmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/dir.c            | 84 ++++++++++++++++++++++++++++++++++++++++
 fs/lustre/llite/file.c           | 55 ++++++++++++++------------
 fs/lustre/llite/llite_internal.h |  2 +
 3 files changed, 116 insertions(+), 25 deletions(-)

diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index 56ef1bb..1298bd6 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -1295,6 +1295,7 @@ int ll_rmfid(struct file *file, void __user *arg)
 {
 	const struct fid_array __user *ufa = arg;
 	struct inode *inode = file_inode(file);
+	struct ll_sb_info *sbi = ll_i2sbi(inode);
 	struct fid_array *lfa = NULL;
 	size_t size;
 	unsigned int nr;
@@ -1325,8 +1326,91 @@ int ll_rmfid(struct file *file, void __user *arg)
 		goto free_rcs;
 	}
 
+	/* In case of subdirectory mount, we need to make sure all the files
+	 * for which we want to remove FID are visible in the namespace.
+	 */
+	if (!fid_is_root(&sbi->ll_root_fid)) {
+		struct fid_array *lfa_new = NULL;
+		int path_len = PATH_MAX, linkno;
+		struct getinfo_fid2path *gf;
+		int idx, last_idx = nr - 1;
+
+		lfa_new = kzalloc(size, GFP_NOFS);
+		if (!lfa_new) {
+			rc = -ENOMEM;
+			goto free_rcs;
+		}
+		lfa_new->fa_nr = 0;
+
+		gf = kmalloc(sizeof(*gf) + path_len + 1, GFP_NOFS);
+		if (!gf) {
+			rc = -ENOMEM;
+			goto free_rcs;
+		}
+
+		for (idx = 0; idx < nr; idx++) {
+			linkno = 0;
+			while (1) {
+				memset(gf, 0, sizeof(*gf) + path_len + 1);
+				gf->gf_fid = lfa->fa_fids[idx];
+				gf->gf_pathlen = path_len;
+				gf->gf_linkno = linkno;
+				rc = __ll_fid2path(inode, gf,
+						   sizeof(*gf) + gf->gf_pathlen,
+						   gf->gf_pathlen);
+				if (rc == -ENAMETOOLONG) {
+					struct getinfo_fid2path *tmpgf;
+
+					path_len += PATH_MAX;
+					tmpgf = krealloc(gf,
+						     sizeof(*gf) + path_len + 1,
+						     GFP_NOFS);
+					if (!tmpgf) {
+						kfree(gf);
+						kfree(lfa_new);
+						rc = -ENOMEM;
+						goto free_rcs;
+					}
+					gf = tmpgf;
+					continue;
+				}
+				if (rc)
+					break;
+				if (gf->gf_linkno == linkno)
+					break;
+				linkno = gf->gf_linkno;
+			}
+
+			if (!rc) {
+				/* All the links for this fid are visible in the
+				 * mounted subdir. So add it to the list of fids
+				 * to remove.
+				 */
+				lfa_new->fa_fids[lfa_new->fa_nr++] =
+					lfa->fa_fids[idx];
+			} else {
+				/* At least one link for this fid is not visible
+				 * in the mounted subdir. So add it at the end
+				 * of the list that will be hidden to lower
+				 * layers, and set -ENOENT as ret code.
+				 */
+				lfa_new->fa_fids[last_idx] = lfa->fa_fids[idx];
+				rcs[last_idx--] = rc;
+			}
+		}
+		kfree(gf);
+		kfree(lfa);
+		lfa = lfa_new;
+	}
+
+	if (lfa->fa_nr == 0) {
+		rc = rcs[nr - 1];
+		goto free_rcs;
+	}
+
 	/* Call mdc_iocontrol */
 	rc = md_rmfid(ll_i2mdexp(file_inode(file)), lfa, rcs, NULL);
+	lfa->fa_nr = nr;
 	if (!rc) {
 		for (i = 0; i < nr; i++)
 			if (rcs[i])
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index 18f3302..a9d247c 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -2917,9 +2917,37 @@ static int fid2path_for_enc_file(struct inode *parent, char *gfpath,
 	return rc;
 }
 
-int ll_fid2path(struct inode *inode, void __user *arg)
+int __ll_fid2path(struct inode *inode, struct getinfo_fid2path *gfout,
+		  size_t outsize, __u32 pathlen_orig)
 {
 	struct obd_export *exp = ll_i2mdexp(inode);
+	int rc;
+
+	/* Append root FID after gfout to let MDT know the root FID so that
+	 * it can lookup the correct path, this is mainly for fileset.
+	 * old server without fileset mount support will ignore this.
+	 */
+	*gfout->gf_root_fid = *ll_inode2fid(inode);
+
+	/* Call mdc_iocontrol */
+	rc = obd_iocontrol(OBD_IOC_FID2PATH, exp, outsize, gfout, NULL);
+
+	if (!rc && gfout->gf_pathlen && gfout->gf_path[0] == '/') {
+		/* by convention, server side (mdt_path_current()) puts
+		 * a leading '/' to tell client that we are dealing with
+		 * an encrypted file
+		 */
+		rc = fid2path_for_enc_file(inode, gfout->gf_path,
+					   gfout->gf_pathlen);
+		if (!rc && strlen(gfout->gf_path) > pathlen_orig)
+			rc = -EOVERFLOW;
+	}
+
+	return rc;
+}
+
+int ll_fid2path(struct inode *inode, void __user *arg)
+{
 	const struct getinfo_fid2path __user *gfin = arg;
 	struct getinfo_fid2path *gfout;
 	u32 pathlen, pathlen_orig;
@@ -2950,34 +2978,11 @@ int ll_fid2path(struct inode *inode, void __user *arg)
 		goto gf_free;
 	}
 
-	/*
-	 * append root FID after gfout to let MDT know the root FID so that it
-	 * can lookup the correct path, this is mainly for fileset.
-	 * old server without fileset mount support will ignore this.
-	 */
-	*gfout->gf_root_fid = *ll_inode2fid(inode);
 	gfout->gf_pathlen = pathlen;
-
-	/* Call mdc_iocontrol */
-	rc = obd_iocontrol(OBD_IOC_FID2PATH, exp, outsize, gfout, NULL);
+	rc = __ll_fid2path(inode, gfout, outsize, pathlen_orig);
 	if (rc != 0)
 		goto gf_free;
 
-	if (gfout->gf_pathlen && gfout->gf_path[0] == '/') {
-		/* by convention, server side (mdt_path_current()) puts
-		 * a leading '/' to tell client that we are dealing with
-		 * an encrypted file
-		 */
-		rc = fid2path_for_enc_file(inode, gfout->gf_path,
-					   gfout->gf_pathlen);
-		if (rc)
-			goto gf_free;
-		if (strlen(gfout->gf_path) > gfin->gf_pathlen) {
-			rc = -EOVERFLOW;
-			goto gf_free;
-		}
-	}
-
 	if (copy_to_user(arg, gfout, sizeof(*gfout) + pathlen_orig))
 		rc = -EFAULT;
 
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 970b144..6bbc781 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -1245,6 +1245,8 @@ int ll_dir_getstripe(struct inode *inode, void **plmm, int *plmm_size,
 int ll_fsync(struct file *file, loff_t start, loff_t end, int data);
 int ll_merge_attr(const struct lu_env *env, struct inode *inode);
 int ll_fid2path(struct inode *inode, void __user *arg);
+int __ll_fid2path(struct inode *inode, struct getinfo_fid2path *gfout,
+		  size_t outsize, u32 pathlen_orig);
 int ll_data_version(struct inode *inode, u64 *data_version, int flags);
 int ll_hsm_release(struct inode *inode);
 int ll_hsm_state_set(struct inode *inode, struct hsm_state_set *hss);
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 20/40] lustre: clio: Remove cl_page_size()
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (18 preceding siblings ...)
  2023-04-09 12:12 ` [lustre-devel] [PATCH 19/40] lustre: fileset: check fileset for operations by fid James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 21/40] lustre: fid: clean up OBIF_MAX_OID and IDIF_MAX_OID James Simmons
                   ` (19 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Patrick Farrell <pfarrell@whamcloud.com>

cl_page_size() is just a function which does:
1 << PAGE_SHIFT

and the kernel provides a macro for that - PAGE_SIZE.
Maybe it didn't when this function was added, but it sure
does now.

So, remove cl_page_size().

WC-bug-id: https://jira.whamcloud.com/browse/LU-16515
Lustre-commit: 19c38f6c94ae161b1 ("LU-16515 clio: Remove cl_page_size()")
Signed-off-by: Patrick Farrell <pfarrell@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49918
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/cl_object.h | 1 -
 fs/lustre/llite/rw26.c        | 9 ++++-----
 fs/lustre/llite/vvp_io.c      | 2 +-
 fs/lustre/lov/lov_page.c      | 2 +-
 fs/lustre/obdclass/cl_page.c  | 6 ------
 5 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/fs/lustre/include/cl_object.h b/fs/lustre/include/cl_object.h
index 41ce0b0..8a98413 100644
--- a/fs/lustre/include/cl_object.h
+++ b/fs/lustre/include/cl_object.h
@@ -2268,7 +2268,6 @@ void cl_page_touch(const struct lu_env *env, const struct cl_page *pg,
 		   size_t to);
 loff_t cl_offset(const struct cl_object *obj, pgoff_t idx);
 pgoff_t cl_index(const struct cl_object *obj, loff_t offset);
-size_t cl_page_size(const struct cl_object *obj);
 int cl_pages_prune(const struct lu_env *env, struct cl_object *obj);
 
 void cl_lock_print(const struct lu_env *env, void *cookie,
diff --git a/fs/lustre/llite/rw26.c b/fs/lustre/llite/rw26.c
index 6700717..6b338b2 100644
--- a/fs/lustre/llite/rw26.c
+++ b/fs/lustre/llite/rw26.c
@@ -218,7 +218,6 @@ static unsigned long ll_iov_iter_alignment(struct iov_iter *i)
 	struct cl_sync_io *anchor = &sdio->csd_sync;
 	loff_t offset = pv->ldp_file_offset;
 	int io_pages = 0;
-	size_t page_size = cl_page_size(obj);
 	int i;
 	ssize_t rc = 0;
 
@@ -257,12 +256,12 @@ static unsigned long ll_iov_iter_alignment(struct iov_iter *i)
 		 * Set page clip to tell transfer formation engine
 		 * that page has to be sent even if it is beyond KMS.
 		 */
-		if (size < page_size)
+		if (size < PAGE_SIZE)
 			cl_page_clip(env, page, 0, size);
 		++io_pages;
 
-		offset += page_size;
-		size -= page_size;
+		offset += PAGE_SIZE;
+		size -= PAGE_SIZE;
 	}
 	if (rc == 0 && io_pages > 0) {
 		int iot = rw == READ ? CRT_READ : CRT_WRITE;
@@ -478,7 +477,7 @@ static int ll_prepare_partial_page(const struct lu_env *env, struct cl_io *io,
 	if (attr->cat_kms <= offset) {
 		char *kaddr = kmap_atomic(pg->cp_vmpage);
 
-		memset(kaddr, 0, cl_page_size(obj));
+		memset(kaddr, 0, PAGE_SIZE);
 		kunmap_atomic(kaddr);
 		result = 0;
 		goto out;
diff --git a/fs/lustre/llite/vvp_io.c b/fs/lustre/llite/vvp_io.c
index 2da74a2..561ce66 100644
--- a/fs/lustre/llite/vvp_io.c
+++ b/fs/lustre/llite/vvp_io.c
@@ -1525,7 +1525,7 @@ static int vvp_io_fault_start(const struct lu_env *env,
 		 */
 		fio->ft_nob = size - cl_offset(obj, fio->ft_index);
 	else
-		fio->ft_nob = cl_page_size(obj);
+		fio->ft_nob = PAGE_SIZE;
 
 	lu_ref_add(&page->cp_reference, "fault", io);
 	fio->ft_page = page;
diff --git a/fs/lustre/lov/lov_page.c b/fs/lustre/lov/lov_page.c
index 6e28e62..e9283aa 100644
--- a/fs/lustre/lov/lov_page.c
+++ b/fs/lustre/lov/lov_page.c
@@ -144,7 +144,7 @@ int lov_page_init_empty(const struct lu_env *env, struct cl_object *obj,
 	page->cp_lov_index = CP_LOV_INDEX_EMPTY;
 
 	addr = kmap(page->cp_vmpage);
-	memset(addr, 0, cl_page_size(obj));
+	memset(addr, 0, PAGE_SIZE);
 	kunmap(page->cp_vmpage);
 	SetPageUptodate(page->cp_vmpage);
 	return 0;
diff --git a/fs/lustre/obdclass/cl_page.c b/fs/lustre/obdclass/cl_page.c
index 62d8ee5..8320293 100644
--- a/fs/lustre/obdclass/cl_page.c
+++ b/fs/lustre/obdclass/cl_page.c
@@ -1035,12 +1035,6 @@ pgoff_t cl_index(const struct cl_object *obj, loff_t offset)
 }
 EXPORT_SYMBOL(cl_index);
 
-size_t cl_page_size(const struct cl_object *obj)
-{
-	return 1UL << PAGE_SHIFT;
-}
-EXPORT_SYMBOL(cl_page_size);
-
 /**
  * Adds page slice to the compound page.
  *
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 21/40] lustre: fid: clean up OBIF_MAX_OID and IDIF_MAX_OID
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (19 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 20/40] lustre: clio: Remove cl_page_size() James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 22/40] lustre: llog: fix processing of a wrapped catalog James Simmons
                   ` (18 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Li Dongyang, Lustre Development List

From: Li Dongyang <dongyangli@ddn.com>

Define the OBIF|IDIF_MAX_OID macros to 1ULL << OBIF|IDIF_MAX_BITS - 1
Clean up the callers and remove OBIF|IDIF_OID_MASK which are not used.

WC-bug-id: https://jira.whamcloud.com/browse/LU-11912
Lustre-commit: bb2f0dac868cf1321 ("LU-11912 fid: clean up OBIF_MAX_OID and IDIF_MAX_OID")
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/45659
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Sergey Cheremencev <scherementsev@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_fid.h           | 6 +++---
 include/uapi/linux/lustre/lustre_idl.h   | 6 ++----
 include/uapi/linux/lustre/lustre_ostid.h | 4 ++--
 3 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/fs/lustre/include/lustre_fid.h b/fs/lustre/include/lustre_fid.h
index b8a3f2e..88a6061 100644
--- a/fs/lustre/include/lustre_fid.h
+++ b/fs/lustre/include/lustre_fid.h
@@ -481,18 +481,18 @@ static inline int ostid_res_name_eq(const struct ost_id *oi,
 static inline int ostid_set_id(struct ost_id *oi, u64 oid)
 {
 	if (fid_seq_is_mdt0(oi->oi.oi_seq)) {
-		if (oid >= IDIF_MAX_OID)
+		if (oid > IDIF_MAX_OID)
 			return -E2BIG;
 		oi->oi.oi_id = oid;
 	} else if (fid_is_idif(&oi->oi_fid)) {
-		if (oid >= IDIF_MAX_OID)
+		if (oid > IDIF_MAX_OID)
 			return -E2BIG;
 		oi->oi_fid.f_seq = fid_idif_seq(oid,
 						fid_idif_ost_idx(&oi->oi_fid));
 		oi->oi_fid.f_oid = oid;
 		oi->oi_fid.f_ver = oid >> 48;
 	} else {
-		if (oid >= OBIF_MAX_OID)
+		if (oid > OBIF_MAX_OID)
 			return -E2BIG;
 		oi->oi_fid.f_oid = oid;
 	}
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index b4185a7..a752639 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -295,11 +295,9 @@ enum fid_seq {
 };
 
 #define OBIF_OID_MAX_BITS	32
-#define OBIF_MAX_OID		((1ULL << OBIF_OID_MAX_BITS))
-#define OBIF_OID_MASK		((1ULL << OBIF_OID_MAX_BITS) - 1)
+#define OBIF_MAX_OID		((1ULL << OBIF_OID_MAX_BITS) - 1)
 #define IDIF_OID_MAX_BITS	48
-#define IDIF_MAX_OID		((1ULL << IDIF_OID_MAX_BITS))
-#define IDIF_OID_MASK		((1ULL << IDIF_OID_MAX_BITS) - 1)
+#define IDIF_MAX_OID		((1ULL << IDIF_OID_MAX_BITS) - 1)
 
 /** OID for FID_SEQ_SPECIAL */
 enum special_oid {
diff --git a/include/uapi/linux/lustre/lustre_ostid.h b/include/uapi/linux/lustre/lustre_ostid.h
index 90fa213..baf7c8f 100644
--- a/include/uapi/linux/lustre/lustre_ostid.h
+++ b/include/uapi/linux/lustre/lustre_ostid.h
@@ -91,7 +91,7 @@ static inline __u64 ostid_seq(const struct ost_id *ostid)
 static inline __u64 ostid_id(const struct ost_id *ostid)
 {
 	if (fid_seq_is_mdt0(ostid->oi.oi_seq))
-		return ostid->oi.oi_id & IDIF_OID_MASK;
+		return ostid->oi.oi_id & IDIF_MAX_OID;
 
 	if (fid_seq_is_default(ostid->oi.oi_seq))
 		return ostid->oi.oi_id;
@@ -212,7 +212,7 @@ static inline int ostid_to_fid(struct lu_fid *fid, const struct ost_id *ostid,
 		 * been in production for years.  This can handle create rates
 		 * of 1M objects/s/OST for 9 years, or combinations thereof.
 		 */
-		if (oid >= IDIF_MAX_OID)
+		if (oid > IDIF_MAX_OID)
 			return -EBADF;
 
 		fid->f_seq = fid_idif_seq(oid, ost_idx);
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 22/40] lustre: llog: fix processing of a wrapped catalog
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (20 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 21/40] lustre: fid: clean up OBIF_MAX_OID and IDIF_MAX_OID James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 23/40] lustre: llite: replace lld_nfs_dentry flag with opencache handling James Simmons
                   ` (17 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Etienne AUJAMES, Lustre Development List

From: Etienne AUJAMES <etienne.aujames@cea.fr>

Several issues were found with "lfs changelog --follow" for a wrapped
catalog (llog_cat_process() with startidx):

1/ incorrect lpcd_first_idx value for a wrapped catalog (startcat>0)

The first llog index to process is "lpcd_first_idx + 1". The startidx
represents the last record index processed for a llog plain. The
catalog index of this llog is startcat.
lpcd_first_idx of a catalog should be set to "startcat - 1"
e.g:
llog_cat_process(... startcat=10, startidx=101) means that the
processing will start with the llog plain at the index 10 of the
catalog. And the first record to process will be at index 102.

2/ startidx is not reset for an incorrect startcat index

startidx is relevant only for a startcat. So if the corresponding llog
plain is removed or if startcat is out of range, we need to reset
startidx.

This patch remove LLOG_CAT_FIRST, that was really confusing
(LU-14158). And update osp_sync_thread() with the
llog_cat_process() corrected behavior.

It modifies also llog_cat_retain_cb() to zap empty plain llog directly
in it (like for llog_cat_size_cb()), the current implementation is not
compatible with this patch.

Fixes: 58239d59 ("lustre: llog: fix processing of a wrapped catalog")
WC-bug-id: https://jira.whamcloud.com/browse/LU-15280
Lustre-commit: 76cf7427145a397a3 ("LU-15280 llog: fix processing of a wrapped catalog")
Signed-off-by: Etienne AUJAMES <eaujames@ddn.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/45708
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com>
Reviewed-by: Mikhail Pershin <mpershin@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_log.h         |  16 +++++
 fs/lustre/include/obd_support.h        |   1 +
 fs/lustre/obdclass/llog.c              |  21 +++---
 fs/lustre/obdclass/llog_cat.c          | 128 ++++++++++++++++++++++-----------
 include/uapi/linux/lustre/lustre_idl.h |   5 --
 5 files changed, 115 insertions(+), 56 deletions(-)

diff --git a/fs/lustre/include/lustre_log.h b/fs/lustre/include/lustre_log.h
index dbf3fd6..1586595 100644
--- a/fs/lustre/include/lustre_log.h
+++ b/fs/lustre/include/lustre_log.h
@@ -375,6 +375,15 @@ static inline int llog_next_block(const struct lu_env *env,
 	return rc;
 }
 
+static inline int llog_max_idx(struct llog_log_hdr *lh)
+{
+	if (OBD_FAIL_PRECHECK(OBD_FAIL_CAT_RECORDS) &&
+	    unlikely(lh->llh_flags & LLOG_F_IS_CAT))
+		return cfs_fail_val;
+	else
+		return LLOG_HDR_BITMAP_SIZE(lh) - 1;
+}
+
 /* Determine if a llog plain of a catalog could be skiped based on record
  * custom indexes.
  * This assumes that indexes follow each other. The number of records to skip
@@ -391,6 +400,13 @@ static inline int llog_is_plain_skipable(struct llog_log_hdr *lh,
 	return (LLOG_HDR_BITMAP_SIZE(lh) - rec->lrh_index) < (start - curr);
 }
 
+static inline bool llog_cat_is_wrapped(struct llog_handle *cat)
+{
+	struct llog_log_hdr *llh = cat->lgh_hdr;
+
+	return llh->llh_cat_idx >= cat->lgh_last_idx && llh->llh_count > 1;
+}
+
 /* llog.c */
 int lustre_process_log(struct super_block *sb, char *logname,
 		       struct config_llog_instance *cfg);
diff --git a/fs/lustre/include/obd_support.h b/fs/lustre/include/obd_support.h
index 4ef5c61..55196ce 100644
--- a/fs/lustre/include/obd_support.h
+++ b/fs/lustre/include/obd_support.h
@@ -458,6 +458,7 @@
 /* was	OBD_FAIL_LLOG_CATINFO_NET			0x1309 until 2.3 */
 #define OBD_FAIL_MDS_SYNC_CAPA_SL			0x1310
 #define OBD_FAIL_SEQ_ALLOC				0x1311
+#define OBD_FAIL_CAT_RECORDS				0x1312
 #define OBD_FAIL_PLAIN_RECORDS				0x1319
 #define OBD_FAIL_CATALOG_FULL_CHECK			0x131a
 
diff --git a/fs/lustre/obdclass/llog.c b/fs/lustre/obdclass/llog.c
index 90bb8bd..09fda39 100644
--- a/fs/lustre/obdclass/llog.c
+++ b/fs/lustre/obdclass/llog.c
@@ -247,7 +247,7 @@ int llog_verify_record(const struct llog_handle *llh, struct llog_rec_hdr *rec)
 	else if (rec->lrh_len == 0 || rec->lrh_len > chunk_size)
 		LLOG_ERROR_REC(llh, rec, "bad record len, chunk size is %d",
 			       chunk_size);
-	else if (rec->lrh_index >= LLOG_HDR_BITMAP_SIZE(llh->lgh_hdr))
+	else if (rec->lrh_index > llog_max_idx(llh->lgh_hdr))
 		LLOG_ERROR_REC(llh, rec, "index is too high");
 	else
 		return 0;
@@ -292,16 +292,21 @@ static int llog_process_thread(void *arg)
 		return 0;
 	}
 
+	last_index = llog_max_idx(llh);
 	if (cd) {
-		last_called_index = cd->lpcd_first_idx;
+		if (cd->lpcd_first_idx >= llog_max_idx(llh)) {
+			/* End of the indexes -> Nothing to do */
+			rc = 0;
+			goto out;
+		}
 		index = cd->lpcd_first_idx + 1;
+		last_called_index = cd->lpcd_first_idx;
+		if (cd->lpcd_last_idx > 0 &&
+		    cd->lpcd_last_idx <= llog_max_idx(llh))
+			last_index = cd->lpcd_last_idx;
+		else if (cd->lpcd_read_mode & LLOG_READ_MODE_RAW)
+			last_index = loghandle->lgh_last_idx;
 	}
-	if (cd && cd->lpcd_last_idx)
-		last_index = cd->lpcd_last_idx;
-	else if (cd && (cd->lpcd_read_mode & LLOG_READ_MODE_RAW))
-		last_index = loghandle->lgh_last_idx;
-	else
-		last_index = LLOG_HDR_BITMAP_SIZE(llh) - 1;
 
 	while (rc == 0) {
 		unsigned int buf_offset = 0;
diff --git a/fs/lustre/obdclass/llog_cat.c b/fs/lustre/obdclass/llog_cat.c
index 95bfa65..9d624a7 100644
--- a/fs/lustre/obdclass/llog_cat.c
+++ b/fs/lustre/obdclass/llog_cat.c
@@ -174,19 +174,25 @@ static int llog_cat_process_cb(const struct lu_env *env,
 	struct llog_handle *llh = NULL;
 	int rc;
 
+	/* Skip processing of the logs until startcat */
+	if (rec->lrh_index < d->lpd_startcat)
+		return 0;
+
 	rc = llog_cat_process_common(env, cat_llh, rec, &llh);
 	if (rc)
 		goto out;
 
-	if (rec->lrh_index < d->lpd_startcat)
-		/* Skip processing of the logs until startcat */
-		rc = 0;
-	else if (d->lpd_startidx > 0) {
-		struct llog_process_cat_data cd;
-
-		cd.lpcd_read_mode = LLOG_READ_MODE_NORMAL;
-		cd.lpcd_first_idx = d->lpd_startidx;
-		cd.lpcd_last_idx = 0;
+	if (d->lpd_startidx > 0) {
+		struct llog_process_cat_data cd = {
+			.lpcd_first_idx = 0,
+			.lpcd_last_idx = 0,
+			.lpcd_read_mode = LLOG_READ_MODE_NORMAL,
+		};
+
+		/* startidx is always associated with a catalog index */
+		if (d->lpd_startcat == rec->lrh_index)
+			cd.lpcd_first_idx = d->lpd_startidx;
+
 		rc = llog_process_or_fork(env, llh, d->lpd_cb, d->lpd_data,
 					  &cd, false);
 		/* Continue processing the next log from idx 0 */
@@ -208,57 +214,93 @@ static int llog_cat_process_or_fork(const struct lu_env *env,
 				    void *data, int startcat,
 				    int startidx, bool fork)
 {
-	struct llog_process_data d;
 	struct llog_log_hdr *llh = cat_llh->lgh_hdr;
+	struct llog_process_data d;
+	struct llog_process_cat_data cd;
 	int rc;
 
 	LASSERT(llh->llh_flags & LLOG_F_IS_CAT);
 	d.lpd_data = data;
 	d.lpd_cb = cb;
-	d.lpd_startcat = (startcat == LLOG_CAT_FIRST ? 0 : startcat);
-	d.lpd_startidx = startidx;
 
-	if (llh->llh_cat_idx > cat_llh->lgh_last_idx) {
-		struct llog_process_cat_data cd = {
-			.lpcd_read_mode = LLOG_READ_MODE_NORMAL
-		};
+	/* default: start from the oldest record */
+	d.lpd_startidx = 0;
+	d.lpd_startcat = llh->llh_cat_idx + 1;
+	cd.lpcd_first_idx = llh->llh_cat_idx;
+	cd.lpcd_last_idx = 0;
+	cd.lpcd_read_mode = LLOG_READ_MODE_NORMAL;
+
+	if (startcat > 0 && startcat <= llog_max_idx(llh)) {
+		/* start from a custom catalog/llog plain indexes*/
+		d.lpd_startidx = startidx;
+		d.lpd_startcat = startcat;
+		cd.lpcd_first_idx = startcat - 1;
+	} else if (startcat != 0) {
+		CWARN("%s: startcat %d out of range for catlog "DFID"\n",
+		      loghandle2name(cat_llh), startcat,
+		      PLOGID(&cat_llh->lgh_id));
+		return -EINVAL;
+	}
+
+	startcat = d.lpd_startcat;
+
+	/* if startcat <= lgh_last_idx, we only need to process the first part
+	 * of the catalog (from startcat).
+	 */
+	if (llog_cat_is_wrapped(cat_llh) && startcat > cat_llh->lgh_last_idx) {
+		int cat_idx_origin = llh->llh_cat_idx;
 
 		CWARN("%s: catlog " DFID " crosses index zero\n",
 		      loghandle2name(cat_llh),
 		      PLOGID(&cat_llh->lgh_id));
-		/*startcat = 0 is default value for general processing */
-		if ((startcat != LLOG_CAT_FIRST &&
-		    startcat >= llh->llh_cat_idx) || !startcat) {
-			/* processing the catalog part at the end */
-			cd.lpcd_first_idx = (startcat ? startcat :
-					     llh->llh_cat_idx);
-			cd.lpcd_last_idx = 0;
-			rc = llog_process_or_fork(env, cat_llh, cat_cb,
-						  &d, &cd, fork);
-			/* Reset the startcat because it has already reached
-			 * catalog bottom.
-			 */
-			startcat = 0;
-			d.lpd_startcat = 0;
-			if (rc != 0)
-				return rc;
-		}
-		/* processing the catalog part at the beginning */
-		cd.lpcd_first_idx = (startcat == LLOG_CAT_FIRST) ? 0 : startcat;
-		/* Note, the processing will stop at the lgh_last_idx value,
-		 * and it could be increased during processing. So records
-		 * between current lgh_last_idx and lgh_last_idx in future
-		 * would left unprocessed.
-		 */
-		cd.lpcd_last_idx = cat_llh->lgh_last_idx;
+
+		/* processing the catalog part at the end */
 		rc = llog_process_or_fork(env, cat_llh, cat_cb, &d, &cd, fork);
-	} else {
-		rc = llog_process_or_fork(env, cat_llh, cat_cb, &d, NULL, fork);
+		if (rc)
+			return rc;
+
+		/* Reset the startcat because it has already reached catalog
+		 * bottom.
+		 * lgh_last_idx value could be increased during processing. So
+		 * we process the remaining of catalog entries to be sure.
+		 */
+		d.lpd_startcat = 1;
+		d.lpd_startidx = 0;
+		cd.lpcd_first_idx = 0;
+		cd.lpcd_last_idx = max(cat_idx_origin, cat_llh->lgh_last_idx);
+	} else if (llog_cat_is_wrapped(cat_llh)) {
+		/* only process 1st part -> stop before reaching 2sd part */
+		cd.lpcd_last_idx = llh->llh_cat_idx;
 	}
 
+	/* processing the catalog part at the beginning */
+	rc = llog_process_or_fork(env, cat_llh, cat_cb, &d, &cd, fork);
+
 	return rc;
 }
 
+/**
+ * Process catalog records with a callback
+ *
+ * @note
+ * If "starcat = 0", this is the default processing. "startidx" argument is
+ * ignored and processing begin from the oldest record.
+ * If "startcat > 0", this is a custom starting point. Processing begin with
+ * the llog plain defined in the catalog record at index "startcat". The first
+ * llog plain record to process is at index "startidx + 1".
+ *
+ * @env		Lustre environnement
+ * @cat_llh	Catalog llog handler
+ * @cb		Callback executed for each records (in llog plain files)
+ * @data	Callback data argument
+ * @startcat	Catalog index of the llog plain to start with.
+ * @startidx	Index of the llog plain to start processing. The first
+ *		record to process is at startidx + 1.
+ *
+ * RETURN	0 processing successfully completed
+ *		LLOG_PROC_BREAK processing was stopped by the callback.
+ *		-errno on error.
+ */
 int llog_cat_process(const struct lu_env *env, struct llog_handle *cat_llh,
 		     llog_cb_t cb, void *data, int startcat, int startidx)
 {
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index a752639..d60d1d8 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -2619,11 +2619,6 @@ enum llog_flag {
 			  LLOG_F_EXT_X_OMODE | LLOG_F_EXT_X_XATTR,
 };
 
-/* means first record of catalog */
-enum {
-	LLOG_CAT_FIRST = -1,
-};
-
 /* On-disk header structure of each log object, stored in little endian order */
 #define LLOG_MIN_CHUNK_SIZE	8192
 #define LLOG_HEADER_SIZE	(96)	/* sizeof (llog_log_hdr) +
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 23/40] lustre: llite: replace lld_nfs_dentry flag with opencache handling
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (21 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 22/40] lustre: llog: fix processing of a wrapped catalog James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 24/40] lustre: llite: match lock in corresponding namespace James Simmons
                   ` (16 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

The lld_nfs_dentry flag was created for the case of caching the
open lock (opencache) when fetching fhandles for NFSv3. This same
path is used by the fhandle APIs. This lighter open changes key
behaviors since the open lock is always cached which we don't
want. Lustre introduced a way to modify caching the open lock
based on the number of opens done on a file within a certain
span of time. We can replace lld_nfs_dentry flag with the
new open lock caching. This way for fhandle handling we match
the open lock caching behavior of a normal file open.

WC-bug-id: https://jira.whamcloud.com/browse/LU-16463
Lustre-commit: d7a85652f4fcb8319 ("LU-16463 llite: replace lld_nfs_dentry flag with opencache handling")
Signed-off-by: James Simmons <jsimmons@infradead.org>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49237
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Etienne AUJAMES <eaujames@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
---
 fs/lustre/llite/file.c           | 26 ++++++++------------------
 fs/lustre/llite/llite_internal.h | 13 +++++++------
 fs/lustre/llite/llite_nfs.c      | 15 +--------------
 fs/lustre/llite/namei.c          |  8 +++++++-
 fs/lustre/llite/super25.c        |  2 ++
 5 files changed, 25 insertions(+), 39 deletions(-)

diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index a9d247c..fb8ede2 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -916,7 +916,7 @@ int ll_file_open(struct inode *inode, struct file *file)
 		if (!it->it_disposition) {
 			struct dentry *dentry = file_dentry(file);
 			struct ll_sb_info *sbi = ll_i2sbi(inode);
-			struct ll_dentry_data *ldd;
+			int open_threshold = sbi->ll_oc_thrsh_count;
 
 			/* We cannot just request lock handle now, new ELC code
 			 * means that one of other OPEN locks for this file
@@ -927,22 +927,20 @@ int ll_file_open(struct inode *inode, struct file *file)
 			mutex_unlock(&lli->lli_och_mutex);
 			/*
 			 * Normally called under two situations:
-			 * 1. NFS export.
+			 * 1. fhandle / NFS export.
 			 * 2. A race/condition on MDS resulting in no open
 			 *    handle to be returned from LOOKUP|OPEN request,
 			 *    for example if the target entry was a symlink.
 			 *
-			 * In NFS path we know there's pathologic behavior
-			 * so we always enable open lock caching when coming
-			 * from there. It's detected by setting a flag in
-			 * ll_iget_for_nfs.
-			 *
 			 * After reaching number of opens of this inode
 			 * we always ask for an open lock on it to handle
 			 * bad userspace actors that open and close files
 			 * in a loop for absolutely no good reason
 			 */
-			ldd = ll_d2d(dentry);
+			/* fhandle / NFS path. */
+			if (lli->lli_open_thrsh_count != UINT_MAX)
+				open_threshold = lli->lli_open_thrsh_count;
+
 			if (filename_is_volatile(dentry->d_name.name,
 						 dentry->d_name.len,
 						 NULL)) {
@@ -951,17 +949,9 @@ int ll_file_open(struct inode *inode, struct file *file)
 				 * We do not want openlock for volatile
 				 * files under any circumstances
 				 */
-			} else if (ldd && ldd->lld_nfs_dentry) {
-				/* NFS path. This also happens to catch
-				 * open by fh files I guess
-				 */
-				it->it_flags |= MDS_OPEN_LOCK;
-				/* clear the flag for future lookups */
-				ldd->lld_nfs_dentry = 0;
-			} else if (sbi->ll_oc_thrsh_count > 0) {
+			} else if (open_threshold > 0) {
 				/* Take MDS_OPEN_LOCK with many opens */
-				if (lli->lli_open_fd_count >=
-				    sbi->ll_oc_thrsh_count)
+				if (lli->lli_open_fd_count >= open_threshold)
 					it->it_flags |= MDS_OPEN_LOCK;
 
 				/* If this is open after we just closed */
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index 6bbc781..cdfc75e 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -72,7 +72,6 @@
 struct ll_dentry_data {
 	unsigned int			lld_sa_generation;
 	unsigned int			lld_invalid:1;
-	unsigned int			lld_nfs_dentry:1;
 	struct rcu_head			lld_rcu_head;
 };
 
@@ -145,11 +144,6 @@ struct ll_inode_info {
 	u64				lli_open_fd_write_count;
 	u64				lli_open_fd_exec_count;
 
-	/* Number of times this inode was opened */
-	u64				lli_open_fd_count;
-	/* When last close was performed on this inode */
-	ktime_t				lli_close_fd_time;
-
 	/* Protects access to och pointers and their usage counters */
 	struct mutex			lli_och_mutex;
 
@@ -162,6 +156,13 @@ struct ll_inode_info {
 	s64				lli_btime;
 	spinlock_t			lli_agl_lock;
 
+	/* inode specific open lock caching threshold */
+	u32				lli_open_thrsh_count;
+	/* Number of times this inode was opened */
+	u64				lli_open_fd_count;
+	/* When last close was performed on this inode */
+	ktime_t				lli_close_fd_time;
+
 	/* Try to make the d::member and f::member are aligned. Before using
 	 * these members, make clear whether it is directory or not.
 	 */
diff --git a/fs/lustre/llite/llite_nfs.c b/fs/lustre/llite/llite_nfs.c
index 3c4c9ef..232b2b3 100644
--- a/fs/lustre/llite/llite_nfs.c
+++ b/fs/lustre/llite/llite_nfs.c
@@ -114,7 +114,6 @@ struct inode *search_inode_for_lustre(struct super_block *sb,
 		struct lu_fid *fid, struct lu_fid *parent)
 {
 	struct inode *inode;
-	struct dentry *result;
 
 	if (!fid_is_sane(fid))
 		return ERR_PTR(-ESTALE);
@@ -131,19 +130,7 @@ struct inode *search_inode_for_lustre(struct super_block *sb,
 		return ERR_PTR(-ESTALE);
 	}
 
-	result = d_obtain_alias(inode);
-	if (IS_ERR(result))
-		return result;
-
-	/*
-	 * Need to signal to the ll_intent_file_open that
-	 * we came from NFS and so opencache needs to be
-	 * enabled for this one
-	 */
-	spin_lock(&result->d_lock);
-	ll_d2d(result)->lld_nfs_dentry = 1;
-	spin_unlock(&result->d_lock);
-	return result;
+	return d_obtain_alias(inode);
 }
 
 /**
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index 9314a17..ada539e 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -1144,6 +1144,7 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
 	struct ll_sb_info *sbi = NULL;
 	struct pcc_create_attach pca = { NULL, NULL };
 	bool encrypt = false;
+	int open_threshold;
 	int rc = 0;
 
 	CDEBUG(D_VFSTRACE,
@@ -1224,7 +1225,12 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
 	 * we only need to request open lock if it was requested
 	 * for every open
 	 */
-	if (ll_i2sbi(dir)->ll_oc_thrsh_count == 1 &&
+	if (ll_i2info(dir)->lli_open_thrsh_count != UINT_MAX)
+		open_threshold = ll_i2info(dir)->lli_open_thrsh_count;
+	else
+		open_threshold = ll_i2sbi(dir)->ll_oc_thrsh_count;
+
+	if (open_threshold == 1 &&
 	    exp_connect_flags2(ll_i2mdexp(dir)) &
 	    OBD_CONNECT2_ATOMIC_OPEN_LOCK)
 		it->it_flags |= MDS_OPEN_LOCK;
diff --git a/fs/lustre/llite/super25.c b/fs/lustre/llite/super25.c
index 5349a25..50272a7 100644
--- a/fs/lustre/llite/super25.c
+++ b/fs/lustre/llite/super25.c
@@ -55,6 +55,8 @@ static struct inode *ll_alloc_inode(struct super_block *sb)
 		return NULL;
 
 	inode_init_once(&lli->lli_vfs_inode);
+	lli->lli_open_thrsh_count = UINT_MAX;
+
 	return &lli->lli_vfs_inode;
 }
 
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 24/40] lustre: llite: match lock in corresponding namespace
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (22 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 23/40] lustre: llite: replace lld_nfs_dentry flag with opencache handling James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 25/40] lnet: libcfs: remove unused hash code James Simmons
                   ` (15 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lai Siyao, Lustre Development List

From: Lai Siyao <lai.siyao@whamcloud.com>

For remote object, LOOKUP lock is on parent MDT, so lmv_lock_match()
iterates all MDT namespaces to match locks. This is needed in places
where only LOOKUP ibit is matched, and the lock namespace is unknown.

WC-bug-id: https://jira.whamcloud.com/browse/LU-15971
Lustre-commit: 64264dc424ca13d90 ("LU-15971 llite: match lock in corresponding namespace")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/47843
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Qian Yingjin <qian@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/dir.c            |  3 ++-
 fs/lustre/llite/file.c           |  6 ++---
 fs/lustre/llite/llite_internal.h |  4 ++--
 fs/lustre/llite/namei.c          |  7 +++---
 fs/lustre/lmv/lmv_obd.c          | 52 ++++++++++++++++++++++++----------------
 5 files changed, 42 insertions(+), 30 deletions(-)

diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index 1298bd6..0422701 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -347,7 +347,8 @@ static int ll_readdir(struct file *filp, struct dir_context *ctx)
 			struct inode *parent;
 
 			parent = file_dentry(filp)->d_parent->d_inode;
-			if (ll_have_md_lock(parent, &ibits, LCK_MINMODE))
+			if (ll_have_md_lock(ll_i2mdexp(parent), parent, &ibits,
+					    LCK_MINMODE))
 				pfid = *ll_inode2fid(parent);
 		}
 
diff --git a/fs/lustre/llite/file.c b/fs/lustre/llite/file.c
index fb8ede2..746c18f 100644
--- a/fs/lustre/llite/file.c
+++ b/fs/lustre/llite/file.c
@@ -5198,7 +5198,7 @@ int ll_migrate(struct inode *parent, struct file *file, struct lmv_user_md *lum,
  *
  * Returns:		boolean, true iff all bits are found
  */
-int ll_have_md_lock(struct inode *inode, u64 *bits,
+int ll_have_md_lock(struct obd_export *exp, struct inode *inode, u64 *bits,
 		    enum ldlm_mode l_req_mode)
 {
 	struct lustre_handle lockh;
@@ -5222,8 +5222,8 @@ int ll_have_md_lock(struct inode *inode, u64 *bits,
 		if (policy.l_inodebits.bits == 0)
 			continue;
 
-		if (md_lock_match(ll_i2mdexp(inode), flags, fid, LDLM_IBITS,
-				  &policy, mode, &lockh)) {
+		if (md_lock_match(exp, flags, fid, LDLM_IBITS, &policy, mode,
+				  &lockh)) {
 			struct ldlm_lock *lock;
 
 			lock = ldlm_handle2lock(&lockh);
diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index cdfc75e..b101a71 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -1175,8 +1175,8 @@ static inline void mapping_clear_exiting(struct address_space *mapping)
 /* llite/file.c */
 extern const struct inode_operations ll_file_inode_operations;
 const struct file_operations *ll_select_file_operations(struct ll_sb_info *sbi);
-int ll_have_md_lock(struct inode *inode, u64 *bits,
-		    enum ldlm_mode l_req_mode);
+int ll_have_md_lock(struct obd_export *exp, struct inode *inode,
+		    u64 *bits, enum ldlm_mode l_req_mode);
 enum ldlm_mode ll_take_md_lock(struct inode *inode, u64 bits,
 			       struct lustre_handle *lockh, u64 flags,
 			       enum ldlm_mode mode);
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index ada539e..0c4c8e6 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -256,7 +256,8 @@ static void ll_lock_cancel_bits(struct ldlm_lock *lock, u64 to_cancel)
 	 * LCK_CR, LCK_CW, LCK_PR - bug 22891
 	 */
 	if (bits & MDS_INODELOCK_OPEN)
-		ll_have_md_lock(inode, &bits, lock->l_req_mode);
+		ll_have_md_lock(lock->l_conn_export, inode, &bits,
+				lock->l_req_mode);
 
 	if (bits & MDS_INODELOCK_OPEN) {
 		fmode_t fmode;
@@ -284,7 +285,7 @@ static void ll_lock_cancel_bits(struct ldlm_lock *lock, u64 to_cancel)
 	if (bits & (MDS_INODELOCK_LOOKUP | MDS_INODELOCK_UPDATE |
 		    MDS_INODELOCK_LAYOUT | MDS_INODELOCK_PERM |
 		    MDS_INODELOCK_DOM))
-		ll_have_md_lock(inode, &bits, LCK_MINMODE);
+		ll_have_md_lock(lock->l_conn_export, inode, &bits, LCK_MINMODE);
 
 	if (bits & MDS_INODELOCK_DOM) {
 		rc = ll_dom_lock_cancel(inode, lock);
@@ -435,7 +436,7 @@ int ll_md_need_convert(struct ldlm_lock *lock)
 	unlock_res_and_lock(lock);
 
 	inode = ll_inode_from_resource_lock(lock);
-	ll_have_md_lock(inode, &bits, mode);
+	ll_have_md_lock(lock->l_conn_export, inode, &bits, mode);
 	iput(inode);
 	return !!(bits);
 }
diff --git a/fs/lustre/lmv/lmv_obd.c b/fs/lustre/lmv/lmv_obd.c
index 1b6e4aa..1b95d93 100644
--- a/fs/lustre/lmv/lmv_obd.c
+++ b/fs/lustre/lmv/lmv_obd.c
@@ -3558,39 +3558,49 @@ static enum ldlm_mode lmv_lock_match(struct obd_export *exp, u64 flags,
 {
 	struct obd_device *obd = exp->exp_obd;
 	struct lmv_obd *lmv = &obd->u.lmv;
-	enum ldlm_mode rc;
 	struct lu_tgt_desc *tgt;
-	int i;
+	u64 bits = policy->l_inodebits.bits;
+	enum ldlm_mode rc = LCK_MINMODE;
 	int index;
+	int i;
 
 	CDEBUG(D_INODE, "Lock match for " DFID "\n", PFID(fid));
 
-	/*
-	 * With DNE every object can have two locks in different namespaces:
+	/* only one bit is set */
+	LASSERT(bits && !(bits & (bits - 1)));
+	/* With DNE every object can have two locks in different namespaces:
 	 * lookup lock in space of MDT storing direntry and update/open lock in
 	 * space of MDT storing inode.  Try the MDT that the FID maps to first,
 	 * since this can be easily found, and only try others if that fails.
 	 */
-	for (i = 0, index = lmv_fid2tgt_index(lmv, fid);
-	     i < lmv->lmv_mdt_descs.ltd_tgts_size;
-	     i++, index = (index + 1) % lmv->lmv_mdt_descs.ltd_tgts_size) {
-		if (index < 0) {
-			CDEBUG(D_HA, "%s: " DFID " is inaccessible: rc = %d\n",
-			       obd->obd_name, PFID(fid), index);
-			index = 0;
+	if (bits == MDS_INODELOCK_LOOKUP) {
+		for (i = 0, index = lmv_fid2tgt_index(lmv, fid);
+		     i < lmv->lmv_mdt_descs.ltd_tgts_size; i++,
+		     index = (index + 1) % lmv->lmv_mdt_descs.ltd_tgts_size) {
+			if (index < 0) {
+				CDEBUG(D_HA,
+				       "%s: " DFID " is inaccessible: rc = %d\n",
+				       obd->obd_name, PFID(fid), index);
+				index = 0;
+			}
+			tgt = lmv_tgt(lmv, index);
+			if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
+				continue;
+			rc = md_lock_match(tgt->ltd_exp, flags, fid, type,
+					   policy, mode, lockh);
+			if (rc)
+				break;
 		}
-
-		tgt = lmv_tgt(lmv, index);
-		if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
-			continue;
-
-		rc = md_lock_match(tgt->ltd_exp, flags, fid, type, policy, mode,
-				   lockh);
-		if (rc)
-			return rc;
+	} else {
+		tgt = lmv_fid2tgt(lmv, fid);
+		if (!IS_ERR(tgt) && tgt->ltd_exp && tgt->ltd_active)
+			rc = md_lock_match(tgt->ltd_exp, flags, fid, type,
+					   policy, mode, lockh);
 	}
 
-	return 0;
+	CDEBUG(D_INODE, "Lock match for "DFID": %d\n", PFID(fid), rc);
+
+	return rc;
 }
 
 static int lmv_get_lustre_md(struct obd_export *exp,
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 25/40] lnet: libcfs: remove unused hash code
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (23 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 24/40] lustre: llite: match lock in corresponding namespace James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 26/40] lustre: client: -o network needs add_conn processing James Simmons
                   ` (14 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Timothy Day <timday@amazon.com>

Two functions which hash then apply a mask
are removed.

WC-bug-id: https://jira.whamcloud.com/browse/LU-16518
Lustre-commit: 239e826876e5e2040 ("LU-16518 misc: use fixed hash code")
Signed-off-by: Timothy Day <timday@amazon.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49916
Reviewed-by: jsimmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/libcfs/libcfs_hash.h | 18 ------------------
 1 file changed, 18 deletions(-)

diff --git a/include/linux/libcfs/libcfs_hash.h b/include/linux/libcfs/libcfs_hash.h
index d3b4875..d60e002 100644
--- a/include/linux/libcfs/libcfs_hash.h
+++ b/include/linux/libcfs/libcfs_hash.h
@@ -829,24 +829,6 @@ static inline int __cfs_hash_theta(struct cfs_hash *hs)
 	return (hash & mask);
 }
 
-/*
- * Generic u32 hash algorithm.
- */
-static inline unsigned
-cfs_hash_u32_hash(const u32 key, unsigned int mask)
-{
-	return ((key * CFS_GOLDEN_RATIO_PRIME_32) & mask);
-}
-
-/*
- * Generic u64 hash algorithm.
- */
-static inline unsigned
-cfs_hash_u64_hash(const u64 key, unsigned int mask)
-{
-	return ((unsigned int)(key * CFS_GOLDEN_RATIO_PRIME_64) & mask);
-}
-
 /** iterate over all buckets in @bds (array of struct cfs_hash_bd) */
 #define cfs_hash_for_each_bd(bds, n, i)	\
 	for (i = 0; i < n && (bds)[i].bd_bucket != NULL; i++)
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 26/40] lustre: client: -o network needs add_conn processing
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (24 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 25/40] lnet: libcfs: remove unused hash code James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 27/40] lnet: Lock primary NID logic James Simmons
                   ` (13 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Mikhail Pershin, Lustre Development List

From: Mikhail Pershin <mpershin@whamcloud.com>

Mount option -o network restricts client import to use
only selected network. It processes connection UUID/NIDs
during 'setup' config command handling but skips any
'add_conn' command if its UUID has no mention about that
network. Meahwhile connection UUID is just a name and may
have many NIDs configured including those on restricted
network which are skipped as well. Therefore client import
configuration misses failover NIDs on restricted network.

Patch makes import to save restricted network information
after 'setup' command processing, so it is applied to any
client_import_add_conn() call. The 'add_conn' command is
always processed now and its NIDs will be filtered in the
same way as for 'setup'.

WC-bug-id: https://jira.whamcloud.com/browse/LU-16557
Lustre-commit: c508c9426838f1625 ("LU-16557 client: -o network needs add_conn processing")
Signed-off-by: Mikhail Pershin <mpershin@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49986
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Cyril Bordage <cbordage@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_import.h |  1 +
 fs/lustre/ldlm/ldlm_lib.c         | 20 +++++++++-----------
 fs/lustre/obdclass/obd_config.c   | 17 -----------------
 3 files changed, 10 insertions(+), 28 deletions(-)

diff --git a/fs/lustre/include/lustre_import.h b/fs/lustre/include/lustre_import.h
index 3ae05b5..ac46aae 100644
--- a/fs/lustre/include/lustre_import.h
+++ b/fs/lustre/include/lustre_import.h
@@ -340,6 +340,7 @@ struct obd_import {
 
 	struct imp_at			imp_at;	/* adaptive timeout data */
 	time64_t			imp_last_reply_time; /* for health check */
+	u32				imp_conn_restricted_net;
 };
 
 /* import.c : adaptive timeout handling.
diff --git a/fs/lustre/ldlm/ldlm_lib.c b/fs/lustre/ldlm/ldlm_lib.c
index ddedaad..0b8389e 100644
--- a/fs/lustre/ldlm/ldlm_lib.c
+++ b/fs/lustre/ldlm/ldlm_lib.c
@@ -56,7 +56,7 @@ static int import_set_conn(struct obd_import *imp, struct obd_uuid *uuid,
 {
 	struct ptlrpc_connection *ptlrpc_conn;
 	struct obd_import_conn *imp_conn = NULL, *item;
-	u32 refnet = LNET_NET_ANY;
+	u32 refnet = imp->imp_conn_restricted_net;
 	int rc = 0;
 
 	if (!create && !priority) {
@@ -64,10 +64,11 @@ static int import_set_conn(struct obd_import *imp, struct obd_uuid *uuid,
 		return -EINVAL;
 	}
 
-	if (imp->imp_connection &&
-	    imp->imp_connection->c_remote_uuid.uuid[0] == 0)
-		/* refnet is used to restrict network connections */
-		refnet = LNET_NID_NET(&imp->imp_connection->c_self);
+	/* refnet is used to restrict network connections */
+	if (refnet != LNET_NET_ANY)
+		CDEBUG(D_HA, "imp %s: restrict %s to %s net\n",
+		       imp->imp_obd->obd_name, uuid->uuid,
+		       libcfs_net2str(refnet));
 
 	ptlrpc_conn = ptlrpc_uuid_to_connection(uuid, refnet);
 	if (!ptlrpc_conn) {
@@ -296,10 +297,6 @@ int client_obd_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 	int rq_portal, rp_portal, connect_op;
 	const char *name = obd->obd_type->typ_name;
 	enum ldlm_ns_type ns_type = LDLM_NS_TYPE_UNKNOWN;
-	struct ptlrpc_connection fake_conn = {
-		.c_self = {},
-		.c_remote_uuid.uuid[0] = 0
-	};
 	int rc;
 
 	/*
@@ -494,8 +491,9 @@ int client_obd_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 			       rc);
 			goto err_import;
 		}
-		lnet_nid4_to_nid(LNET_MKNID(refnet, 0), &fake_conn.c_self);
-		imp->imp_connection = &fake_conn;
+		imp->imp_conn_restricted_net = refnet;
+	} else {
+		imp->imp_conn_restricted_net = LNET_NET_ANY;
 	}
 
 	rc = client_import_add_conn(imp, &server_uuid, 1);
diff --git a/fs/lustre/obdclass/obd_config.c b/fs/lustre/obdclass/obd_config.c
index 953f544..f2173df 100644
--- a/fs/lustre/obdclass/obd_config.c
+++ b/fs/lustre/obdclass/obd_config.c
@@ -1331,23 +1331,6 @@ int class_config_llog_handler(const struct lu_env *env,
 			}
 		}
 
-		/* Skip add_conn command if uuid is not on restricted net */
-		if (clli && clli->cfg_sb && s2lsi(clli->cfg_sb)) {
-			struct lustre_sb_info *lsi = s2lsi(clli->cfg_sb);
-			char *uuid_str = lustre_cfg_string(lcfg, 1);
-
-			if (lcfg->lcfg_command == LCFG_ADD_CONN &&
-			    lsi->lsi_lmd->lmd_nidnet &&
-			    LNET_NIDNET(libcfs_str2nid(uuid_str)) !=
-			    libcfs_str2net(lsi->lsi_lmd->lmd_nidnet)) {
-				CDEBUG(D_CONFIG, "skipping add_conn for %s\n",
-				       uuid_str);
-				rc = 0;
-				/* No processing! */
-				break;
-			}
-		}
-
 		lcfg_len = lustre_cfg_len(bufs.lcfg_bufcount, bufs.lcfg_buflen);
 		lcfg_new = kzalloc(lcfg_len, GFP_NOFS);
 		if (!lcfg_new) {
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 27/40] lnet: Lock primary NID logic
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (25 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 26/40] lustre: client: -o network needs add_conn processing James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 28/40] lnet: Peers added via kernel API should be permanent James Simmons
                   ` (12 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Amir Shehata, Lustre Development List

From: Amir Shehata <ashehata@whamcloud.com>

If a peer is created by Lustre make sure to lock that peer's
primary NID. This peer can be discovered in the background.
There is no need to block until discovery is complete, as Lustre
can continue on with the primary NID it provided.

Discovery will populate the peer with other interfaces the peer has
but will not change the peer's primary NID. It can also delete
peer's NIDs which Lustre told it about (not the Primary NID).

If a peer has been manually discovered via
   lnetctl discover <nid>
command, then make sure to delete the manually discovered
 peer and recreate it with the Lustre NID information
provided for us.

WC-bug-id: https://jira.whamcloud.com/browse/LU-14668
Lustre-commit: aacb16191a72bc6db ("LU-14668 lnet: Lock primary NID logic")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50106
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Cyril Bordage <cbordage@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/peer.c | 106 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 86 insertions(+), 20 deletions(-)

diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index da1f8d4..0539cb4 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -534,6 +534,15 @@ static void lnet_peer_cancel_discovery(struct lnet_peer *lp)
 		}
 	}
 
+	/* If we're asked to lock down the primary NID we shouldn't be
+	 * deleting it
+	 */
+	if (lp->lp_state & LNET_PEER_LOCK_PRIMARY &&
+	    nid_same(&primary_nid, nid)) {
+		rc = -EPERM;
+		goto out;
+	}
+
 	lpni = lnet_peer_ni_find_locked(nid);
 	if (!lpni) {
 		rc = -ENOENT;
@@ -1358,6 +1367,19 @@ struct lnet_peer_ni *
 		if (LNET_NID_IS_ANY(&pnid)) {
 			lnet_nid4_to_nid(nids[i], &pnid);
 			rc = lnet_add_peer_ni(&pnid, &LNET_ANY_NID, mr, true);
+			if (rc == -EALREADY) {
+				struct lnet_peer *lp;
+
+				CDEBUG(D_NET, "A peer exists for NID %s\n",
+				       libcfs_nidstr(&pnid));
+				rc = 0;
+				/* Adds a refcount */
+				lp = lnet_find_peer(&pnid);
+				LASSERT(lp);
+				pnid = lp->lp_primary_nid;
+				/* Drop refcount from lookup */
+				lnet_peer_decref_locked(lp);
+			}
 		} else if (lnet_peer_discovery_disabled) {
 			lnet_nid4_to_nid(nids[i], &nid);
 			rc = lnet_add_peer_ni(&nid, &LNET_ANY_NID, mr, true);
@@ -1405,13 +1427,20 @@ void LNetPrimaryNID(struct lnet_nid *nid)
 	 * down then this discovery can introduce long delays into the mount
 	 * process, so skip it if it isn't necessary.
 	 */
-	while (!lnet_peer_discovery_disabled && !lnet_peer_is_uptodate(lp)) {
-		spin_lock(&lp->lp_lock);
+	spin_lock(&lp->lp_lock);
+	if (!lnet_peer_discovery_disabled &&
+	    (!(lp->lp_state & LNET_PEER_LOCK_PRIMARY) ||
+	     !lnet_peer_is_uptodate_locked(lp))) {
 		/* force a full discovery cycle */
-		lp->lp_state |= LNET_PEER_FORCE_PING | LNET_PEER_FORCE_PUSH;
+		lp->lp_state |= LNET_PEER_FORCE_PING | LNET_PEER_FORCE_PUSH |
+				LNET_PEER_LOCK_PRIMARY;
 		spin_unlock(&lp->lp_lock);
 
-		rc = lnet_discover_peer_locked(lpni, cpt, true);
+		/* start discovery in the background. Messages to that
+		 * peer will not go through until the discovery is
+		 * complete
+		 */
+		rc = lnet_discover_peer_locked(lpni, cpt, false);
 		if (rc)
 			goto out_decref;
 		/* The lpni (or lp) for this NID may have changed and our ref is
@@ -1425,14 +1454,8 @@ void LNetPrimaryNID(struct lnet_nid *nid)
 			goto out_unlock;
 		}
 		lp = lpni->lpni_peer_net->lpn_peer;
-
-		/* If we find that the peer has discovery disabled then we will
-		 * not modify whatever primary NID is currently set for this
-		 * peer. Thus, we can break out of this loop even if the peer
-		 * is not fully up to date.
-		 */
-		if (lnet_is_discovery_disabled(lp))
-			break;
+	} else {
+		spin_unlock(&lp->lp_lock);
 	}
 	*nid = lp->lp_primary_nid;
 out_decref:
@@ -1538,6 +1561,8 @@ struct lnet_peer_net *
 			lnet_peer_clr_non_mr_pref_nids(lp);
 		}
 	}
+	if (flags & LNET_PEER_LOCK_PRIMARY)
+		lp->lp_state |= LNET_PEER_LOCK_PRIMARY;
 	spin_unlock(&lp->lp_lock);
 
 	lp->lp_nnis++;
@@ -1599,13 +1624,28 @@ struct lnet_peer_net *
 			else if ((lp->lp_state ^ flags) & LNET_PEER_MULTI_RAIL)
 				rc = -EPERM;
 			goto out;
-		} else if (!(flags & LNET_PEER_CONFIGURED)) {
+		} else if (lp->lp_state & LNET_PEER_LOCK_PRIMARY) {
 			if (nid_same(&lp->lp_primary_nid, nid)) {
 				rc = -EEXIST;
 				goto out;
 			}
+			/* we're trying to recreate an existing peer which
+			 * has already been created and its primary
+			 * locked. This is likely due to two servers
+			 * existing on the same node. So we'll just refer
+			 * to that node with the primary NID which was
+			 * first added by Lustre
+			 */
+			rc = -EALREADY;
+			goto out;
 		}
-		/* Delete and recreate as a configured peer. */
+		/* Delete and recreate the peer.
+		 * We can get here:
+		 * 1. If the peer is being recreated as a configured NID
+		 * 2. if there already exists a peer which
+		 *    was discovered manually, but is recreated via Lustre
+		 *    with PRIMARY_lock
+		 */
 		rc = lnet_peer_del(lp);
 		if (rc)
 			goto out;
@@ -1695,19 +1735,36 @@ struct lnet_peer_net *
 		}
 		/* If this is the primary NID, destroy the peer. */
 		if (lnet_peer_ni_is_primary(lpni)) {
-			struct lnet_peer *rtr_lp =
+			struct lnet_peer *lp2 =
 				lpni->lpni_peer_net->lpn_peer;
-			int rtr_refcount = rtr_lp->lp_rtr_refcount;
-
+			int rtr_refcount = lp2->lp_rtr_refcount;
+
+			/* If the new peer that this NID belongs to is
+			 * a primary NID for another peer which we're
+			 * suppose to preserve the Primary for then we
+			 * don't want to mess with it. But the
+			 * configuration is wrong at this point, so we
+			 * should flag both of these peers as in a bad
+			 * state
+			 */
+			if (lp2->lp_state & LNET_PEER_LOCK_PRIMARY) {
+				spin_lock(&lp->lp_lock);
+				lp->lp_state |= LNET_PEER_BAD_CONFIG;
+				spin_unlock(&lp->lp_lock);
+				spin_lock(&lp2->lp_lock);
+				lp2->lp_state |= LNET_PEER_BAD_CONFIG;
+				spin_unlock(&lp2->lp_lock);
+				goto out_free_lpni;
+			}
 			/* if we're trying to delete a router it means
 			 * we're moving this peer NI to a new peer so must
 			 * transfer router properties to the new peer
 			 */
 			if (rtr_refcount > 0) {
 				flags |= LNET_PEER_RTR_NI_FORCE_DEL;
-				lnet_rtr_transfer_to_peer(rtr_lp, lp);
+				lnet_rtr_transfer_to_peer(lp2, lp);
 			}
-			lnet_peer_del(lpni->lpni_peer_net->lpn_peer);
+			lnet_peer_del(lp2);
 			lnet_peer_ni_decref_locked(lpni);
 			lpni = lnet_peer_ni_alloc(nid);
 			if (!lpni) {
@@ -1765,7 +1822,8 @@ struct lnet_peer_net *
 	if (nid_same(&lp->lp_primary_nid, nid))
 		goto out;
 
-	lp->lp_primary_nid = *nid;
+	if (!(lp->lp_state & LNET_PEER_LOCK_PRIMARY))
+		lp->lp_primary_nid = *nid;
 
 	rc = lnet_peer_add_nid(lp, nid, flags);
 	if (rc) {
@@ -1773,6 +1831,14 @@ struct lnet_peer_net *
 		goto out;
 	}
 out:
+	/* if this is a configured peer or the primary for that peer has
+	 * been locked, then we don't want to flag this scenario as
+	 * a failure
+	 */
+	if (lp->lp_state & LNET_PEER_CONFIGURED ||
+	    lp->lp_state & LNET_PEER_LOCK_PRIMARY)
+		return 0;
+
 	CDEBUG(D_NET, "peer %s NID %s: %d\n",
 	       libcfs_nidstr(&old), libcfs_nidstr(nid), rc);
 
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 28/40] lnet: Peers added via kernel API should be permanent
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (26 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 27/40] lnet: Lock primary NID logic James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 29/40] lnet: don't delete peer created by Lustre James Simmons
                   ` (11 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Chris Horn, Amir Shehata, Lustre Development List

From: Chris Horn <chris.horn@hpe.com>

The LNetAddPeer() API allows Lustre to predefine the Peer for LNet.
Originally these peers would be temporary and potentially re-created
via discovery. Instead, let's make these peers permanent. This allows
Lustre to dictate the primary NID of the peer. LNet makes sure this
primary NID is not changed afterwards.

WC-bug-id: https://jira.whamcloud.com/browse/LU-14668
Lustre-commit: 41733dadd8ad0e87e ("LU-14668 lnet: Peers added via kernel API should be permanent")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/43788
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Cyril Bordage <cbordage@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h |  3 +--
 net/lnet/lnet/api-ni.c        |  2 +-
 net/lnet/lnet/peer.c          | 34 +++++++++++++++++-----------------
 3 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index d03dcf8..a8aa924 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -953,8 +953,7 @@ bool lnet_peer_is_pref_rtr_locked(struct lnet_peer_ni *lpni,
 int lnet_peer_add_pref_rtr(struct lnet_peer_ni *lpni, struct lnet_nid *nid);
 int lnet_peer_ni_set_non_mr_pref_nid(struct lnet_peer_ni *lpni,
 				     struct lnet_nid *nid);
-int lnet_add_peer_ni(struct lnet_nid *key_nid, struct lnet_nid *nid, bool mr,
-		     bool temp);
+int lnet_user_add_peer_ni(struct lnet_nid *key_nid, struct lnet_nid *nid, bool mr);
 int lnet_del_peer_ni(struct lnet_nid *key_nid, struct lnet_nid *nid);
 int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk);
 int lnet_get_peer_ni_info(u32 peer_index, u64 *nid,
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index a4fb95f..20093a9 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -4239,7 +4239,7 @@ u32 lnet_get_dlc_seq_locked(void)
 		mutex_lock(&the_lnet.ln_api_mutex);
 		lnet_nid4_to_nid(cfg->prcfg_prim_nid, &prim_nid);
 		lnet_nid4_to_nid(cfg->prcfg_cfg_nid, &nid);
-		rc = lnet_add_peer_ni(&prim_nid, &nid, cfg->prcfg_mr, false);
+		rc = lnet_user_add_peer_ni(&prim_nid, &nid, cfg->prcfg_mr);
 		mutex_unlock(&the_lnet.ln_api_mutex);
 		return rc;
 	}
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 0539cb4..fa2ca54 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -42,6 +42,8 @@
 #define LNET_REDISCOVER_PEER	(1)
 
 static int lnet_peer_queue_for_discovery(struct lnet_peer *lp);
+static int lnet_add_peer_ni(struct lnet_nid *prim_nid, struct lnet_nid *nid, bool mr,
+			    unsigned int flags);
 
 static void
 lnet_peer_remove_from_remote_list(struct lnet_peer_ni *lpni)
@@ -1366,7 +1368,8 @@ struct lnet_peer_ni *
 		lnet_nid4_to_nid(nids[i], &nid);
 		if (LNET_NID_IS_ANY(&pnid)) {
 			lnet_nid4_to_nid(nids[i], &pnid);
-			rc = lnet_add_peer_ni(&pnid, &LNET_ANY_NID, mr, true);
+			rc = lnet_add_peer_ni(&pnid, &LNET_ANY_NID, mr,
+					      LNET_PEER_LOCK_PRIMARY);
 			if (rc == -EALREADY) {
 				struct lnet_peer *lp;
 
@@ -1382,10 +1385,12 @@ struct lnet_peer_ni *
 			}
 		} else if (lnet_peer_discovery_disabled) {
 			lnet_nid4_to_nid(nids[i], &nid);
-			rc = lnet_add_peer_ni(&nid, &LNET_ANY_NID, mr, true);
+			rc = lnet_add_peer_ni(&nid, &LNET_ANY_NID, mr,
+					      LNET_PEER_LOCK_PRIMARY);
 		} else {
 			lnet_nid4_to_nid(nids[i], &nid);
-			rc = lnet_add_peer_ni(&pnid, &nid, mr, true);
+			rc = lnet_add_peer_ni(&pnid, &nid, mr,
+					      LNET_PEER_LOCK_PRIMARY);
 		}
 
 		if (rc && rc != -EEXIST)
@@ -1918,22 +1923,18 @@ struct lnet_peer_net *
  * The caller must hold ln_api_mutex. This prevents the peer from
  * being created/modified/deleted by a different thread.
  */
-int
+static int
 lnet_add_peer_ni(struct lnet_nid *prim_nid, struct lnet_nid *nid, bool mr,
-		 bool temp)
+		 unsigned int flags)
 __must_hold(&the_lnet.ln_api_mutex)
 {
 	struct lnet_peer *lp = NULL;
 	struct lnet_peer_ni *lpni;
-	unsigned int flags = 0;
 
 	/* The prim_nid must always be specified */
 	if (LNET_NID_IS_ANY(prim_nid))
 		return -EINVAL;
 
-	if (!temp)
-		flags = LNET_PEER_CONFIGURED;
-
 	if (mr)
 		flags |= LNET_PEER_MULTI_RAIL;
 
@@ -1951,13 +1952,6 @@ struct lnet_peer_net *
 	lnet_peer_ni_decref_locked(lpni);
 	lp = lpni->lpni_peer_net->lpn_peer;
 
-	/* Peer must have been configured. */
-	if (!temp && !(lp->lp_state & LNET_PEER_CONFIGURED)) {
-		CDEBUG(D_NET, "peer %s was not configured\n",
-		       libcfs_nidstr(prim_nid));
-		return -ENOENT;
-	}
-
 	/* Primary NID must match */
 	if (!nid_same(&lp->lp_primary_nid, prim_nid)) {
 		CDEBUG(D_NET, "prim_nid %s is not primary for peer %s\n",
@@ -1973,7 +1967,8 @@ struct lnet_peer_net *
 		return -EPERM;
 	}
 
-	if (temp && lnet_peer_is_uptodate(lp)) {
+	if ((flags & LNET_PEER_LOCK_PRIMARY) &&
+	    (lnet_peer_is_uptodate(lp) && (lp->lp_state & LNET_PEER_LOCK_PRIMARY))) {
 		CDEBUG(D_NET,
 		       "Don't add temporary peer NI for uptodate peer %s\n",
 		       libcfs_nidstr(&lp->lp_primary_nid));
@@ -1983,6 +1978,11 @@ struct lnet_peer_net *
 	return lnet_peer_add_nid(lp, nid, flags);
 }
 
+int lnet_user_add_peer_ni(struct lnet_nid *prim_nid, struct lnet_nid *nid, bool mr)
+{
+	return lnet_add_peer_ni(prim_nid, nid, mr, LNET_PEER_CONFIGURED);
+}
+
 /*
  * Implementation of IOC_LIBCFS_DEL_PEER_NI.
  *
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 29/40] lnet: don't delete peer created by Lustre
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (27 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 28/40] lnet: Peers added via kernel API should be permanent James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 30/40] lnet: memory leak in copy_ioc_udsp_descr James Simmons
                   ` (10 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Amir Shehata, Lustre Development List

From: Amir Shehata <ashehata@whamcloud.com>

Peers created by Lustre have their primary NIDs locked.
If that peer is deleted, it'll confuse lustre. So when manually
deleting a peer using:
   lnetctl peer del --prim_nid ...
We must continue to preserve the primary NID. Therefore we delete
all the constituent NIDs, but keep the primary NID. We then
flag the peer for rediscovery.

WC-bug-id: https://jira.whamcloud.com/browse/LU-14668
Lustre-commit: 7cc5b4329fc2eecbf ("LU-14668 lnet: don't delete peer created by Lustre")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/43565
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Cyril Bordage <cbordage@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/peer.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 43 insertions(+), 2 deletions(-)

diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index fa2ca54..0a5e73a 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -1983,6 +1983,40 @@ int lnet_user_add_peer_ni(struct lnet_nid *prim_nid, struct lnet_nid *nid, bool
 	return lnet_add_peer_ni(prim_nid, nid, mr, LNET_PEER_CONFIGURED);
 }
 
+static int
+lnet_reset_peer(struct lnet_peer *lp)
+{
+	struct lnet_peer_net *lpn, *lpntmp;
+	struct lnet_peer_ni *lpni, *lpnitmp;
+	unsigned int flags;
+	int rc;
+
+	lnet_peer_cancel_discovery(lp);
+
+	flags = LNET_PEER_CONFIGURED;
+	if (lp->lp_state & LNET_PEER_MULTI_RAIL)
+		flags |= LNET_PEER_MULTI_RAIL;
+
+	list_for_each_entry_safe(lpn, lpntmp, &lp->lp_peer_nets, lpn_peer_nets) {
+		list_for_each_entry_safe(lpni, lpnitmp, &lpn->lpn_peer_nis,
+					 lpni_peer_nis) {
+			if (nid_same(&lpni->lpni_nid, &lp->lp_primary_nid))
+				continue;
+
+			rc = lnet_peer_del_nid(lp, &lpni->lpni_nid, flags);
+			if (rc) {
+				CERROR("Failed to delete %s from peer %s\n",
+				       libcfs_nidstr(&lpni->lpni_nid),
+				       libcfs_nidstr(&lp->lp_primary_nid));
+			}
+		}
+	}
+
+	/* mark it for discovery the next time we use it */
+	lp->lp_state &= ~LNET_PEER_NIDS_UPTODATE;
+	return 0;
+}
+
 /*
  * Implementation of IOC_LIBCFS_DEL_PEER_NI.
  *
@@ -2026,8 +2060,15 @@ int lnet_user_add_peer_ni(struct lnet_nid *prim_nid, struct lnet_nid *nid, bool
 	}
 	lnet_net_unlock(LNET_LOCK_EX);
 
-	if (LNET_NID_IS_ANY(nid) || nid_same(nid, &lp->lp_primary_nid))
-		return lnet_peer_del(lp);
+	if (LNET_NID_IS_ANY(nid) || nid_same(nid, &lp->lp_primary_nid)) {
+		if (lp->lp_state & LNET_PEER_LOCK_PRIMARY) {
+			CERROR("peer %s created by Lustre. Must preserve primary NID, but will remove other NIDs\n",
+			       libcfs_nidstr(&lp->lp_primary_nid));
+			return lnet_reset_peer(lp);
+		} else {
+			return lnet_peer_del(lp);
+		}
+	}
 
 	flags = LNET_PEER_CONFIGURED;
 	if (lp->lp_state & LNET_PEER_MULTI_RAIL)
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 30/40] lnet: memory leak in copy_ioc_udsp_descr
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (28 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 29/40] lnet: don't delete peer created by Lustre James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 31/40] lnet: remove crash with UDSP James Simmons
                   ` (9 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Chris Horn, Lustre Development List

From: Chris Horn <chris.horn@hpe.com>

copy_ioc_udsp_descr() doesn't correctly handle the case where a
net number was not specified. In this case, there isn't any net
number range that needs to be copied into the udsp descriptor.

WC-bug-id: https://jira.whamcloud.com/browse/LU-16575
Lustre-commit: f8e129198b002589d ("LU-16575 lnet: memory leak in copy_ioc_udsp_descr")
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50081
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Cyril Bordage <cbordage@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/udsp.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/lnet/lnet/udsp.c b/net/lnet/lnet/udsp.c
index 2594df1..deaca51 100644
--- a/net/lnet/lnet/udsp.c
+++ b/net/lnet/lnet/udsp.c
@@ -1485,8 +1485,19 @@ struct lnet_udsp *
 	CDEBUG(D_NET, "%u\n", nid_descr->ud_net_id.udn_net_type);
 
 	/* allocate the total memory required to copy this NID descriptor */
-	alloc_size = (sizeof(struct cfs_expr_list) * (expr_count + 1)) +
-		     (sizeof(struct cfs_range_expr) * (range_count));
+	if (ioc_nid->iud_net.ud_net_num_expr.le_count) {
+		if (ioc_nid->iud_net.ud_net_num_expr.le_count != 1) {
+			CERROR("Unexpected number of net numeric ranges \"%u\". Cannot add UDSP rule.\n",
+			       ioc_nid->iud_net.ud_net_num_expr.le_count);
+			return -EINVAL;
+		}
+		alloc_size = (sizeof(struct cfs_expr_list) * (expr_count + 1)) +
+			     (sizeof(struct cfs_range_expr) * (range_count));
+	} else {
+		alloc_size = (sizeof(struct cfs_expr_list) * (expr_count)) +
+			     (sizeof(struct cfs_range_expr) * (range_count));
+	}
+
 	buf = kzalloc(alloc_size, GFP_KERNEL);
 	if (!buf)
 		return -ENOMEM;
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 31/40] lnet: remove crash with UDSP
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (29 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 30/40] lnet: memory leak in copy_ioc_udsp_descr James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 32/40] lustre: ptlrpc: fix clang build errors James Simmons
                   ` (8 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Cyril Bordage, Lustre Development List

From: Cyril Bordage <cbordage@whamcloud.com>

The following sequence of commands caused a crash:
  # lnetctl udsp add --dst tcp --prio 1
  # lnetctl discover 192.168.122.60@tcp
Pointer to lnet_peer_net in udsp_info is checked before used.

WC-bug-id: https://jira.whamcloud.com/browse/LU-15944
Lustre-commit: c56b9455f05f760ae ("LU-15944 lnet: remove crash with UDSP")
Signed-off-by: Cyril Bordage <cbordage@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/48801
Reviewed-by: Chris Horn <chris.horn@hpe.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Frank Sehr <fsehr@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/udsp.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/net/lnet/lnet/udsp.c b/net/lnet/lnet/udsp.c
index deaca51..eb9a614 100644
--- a/net/lnet/lnet/udsp.c
+++ b/net/lnet/lnet/udsp.c
@@ -74,13 +74,17 @@
  *     from the policy list.
  *
  *   Generally, the syntax is as follows
- *     lnetctl policy <add | del | show>
- *      --src:      ip2nets syntax specifying the local NID to match
- *      --dst:      ip2nets syntax specifying the remote NID to match
- *      --rte:      ip2nets syntax specifying the router NID to match
- *      --priority: Priority to apply to rule matches
- *      --idx:      Index of where to insert or delete the rule
- *                  By default add appends to the end of the rule list
+ *     lnetctl udsp add: add a udsp
+ *      --src: ip2nets syntax specifying the local NID to match
+ *      --dst: ip2nets syntax specifying the remote NID to match
+ *      --rte: ip2nets syntax specifying the router NID to match
+ *      --priority: priority value (0 - highest priority)
+ *      --idx: index of where to insert the rule.
+ *             By default, appends to the end of the rule list.
+ *     lnetctl udsp del: delete a udsp
+ *      --idx: index of the Policy.
+ *     lnetctl udsp show: show udsps
+ *       --idx: index of the policy to show.
  *
  * Author: Amir Shehata
  */
@@ -536,7 +540,8 @@ enum udsp_apply {
 
 	/* check if looking for a net match */
 	if (!rc &&
-	    (lnet_get_list_len(&lp_match->ud_addr_range) ||
+	    (!udi->udi_lpn ||
+	     lnet_get_list_len(&lp_match->ud_addr_range) ||
 	     !cfs_match_net(udi->udi_lpn->lpn_net_id,
 			    lp_match->ud_net_id.udn_net_type,
 			    &lp_match->ud_net_id.udn_net_num_range))) {
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 32/40] lustre: ptlrpc: fix clang build errors
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (30 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 31/40] lnet: remove crash with UDSP James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 33/40] lustre: ldlm: remove client_import_find_conn() James Simmons
                   ` (7 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Timothy Day <timday@amazon.com>

Fixed bugs which cause errors on Clang.

The majority of changes involve adding
defines for the 'ptlrpc_nrs_ctl' enum.
This avoids having to explicitly cast
enums from one type to another.

A 'strlcpy' in 'sptlrpc_process_config'
was copying the wrong number of bytes.

WC-bug-id: https://jira.whamcloud.com/browse/LU-16518
Lustre-commit: 50f28f81b5aa8f8ad ("LU-16518 ptlrpc: fix clang build errors")
Signed-off-by: Timothy Day <timday@amazon.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49859
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: jsimmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_nrs.h       | 11 ++++++++++-
 fs/lustre/include/lustre_nrs_delay.h | 14 ++++++--------
 fs/lustre/ptlrpc/nrs_delay.c         |  2 +-
 fs/lustre/ptlrpc/sec_config.c        |  2 +-
 4 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/fs/lustre/include/lustre_nrs.h b/fs/lustre/include/lustre_nrs.h
index 7e0a840..0e0dd73 100644
--- a/fs/lustre/include/lustre_nrs.h
+++ b/fs/lustre/include/lustre_nrs.h
@@ -64,7 +64,16 @@ enum ptlrpc_nrs_ctl {
 	 * Policies can start using opcodes from this value and onwards for
 	 * their own purposes; the assigned value itself is arbitrary.
 	 */
-	PTLRPC_NRS_CTL_1ST_POL_SPEC = 0x20,
+	PTLRPC_NRS_CTL_POL_SPEC_01 = 0x20,
+	PTLRPC_NRS_CTL_POL_SPEC_02,
+	PTLRPC_NRS_CTL_POL_SPEC_03,
+	PTLRPC_NRS_CTL_POL_SPEC_04,
+	PTLRPC_NRS_CTL_POL_SPEC_05,
+	PTLRPC_NRS_CTL_POL_SPEC_06,
+	PTLRPC_NRS_CTL_POL_SPEC_07,
+	PTLRPC_NRS_CTL_POL_SPEC_08,
+	PTLRPC_NRS_CTL_POL_SPEC_09,
+	PTLRPC_NRS_CTL_POL_SPEC_10
 };
 
 /**
diff --git a/fs/lustre/include/lustre_nrs_delay.h b/fs/lustre/include/lustre_nrs_delay.h
index 52c3885..75bf56d 100644
--- a/fs/lustre/include/lustre_nrs_delay.h
+++ b/fs/lustre/include/lustre_nrs_delay.h
@@ -73,14 +73,12 @@ struct nrs_delay_req {
 	time64_t	req_start_time;
 };
 
-enum nrs_ctl_delay {
-	NRS_CTL_DELAY_RD_MIN = PTLRPC_NRS_CTL_1ST_POL_SPEC,
-	NRS_CTL_DELAY_WR_MIN,
-	NRS_CTL_DELAY_RD_MAX,
-	NRS_CTL_DELAY_WR_MAX,
-	NRS_CTL_DELAY_RD_PCT,
-	NRS_CTL_DELAY_WR_PCT,
-};
+#define NRS_CTL_DELAY_RD_MIN PTLRPC_NRS_CTL_POL_SPEC_01
+#define NRS_CTL_DELAY_WR_MIN PTLRPC_NRS_CTL_POL_SPEC_02
+#define NRS_CTL_DELAY_RD_MAX PTLRPC_NRS_CTL_POL_SPEC_03
+#define NRS_CTL_DELAY_WR_MAX PTLRPC_NRS_CTL_POL_SPEC_04
+#define NRS_CTL_DELAY_RD_PCT PTLRPC_NRS_CTL_POL_SPEC_05
+#define NRS_CTL_DELAY_WR_PCT PTLRPC_NRS_CTL_POL_SPEC_06
 
 /** @} delay */
 
diff --git a/fs/lustre/ptlrpc/nrs_delay.c b/fs/lustre/ptlrpc/nrs_delay.c
index 127f00c..b249749 100644
--- a/fs/lustre/ptlrpc/nrs_delay.c
+++ b/fs/lustre/ptlrpc/nrs_delay.c
@@ -322,7 +322,7 @@ static int nrs_delay_ctl(struct ptlrpc_nrs_policy *policy,
 
 	assert_spin_locked(&policy->pol_nrs->nrs_lock);
 
-	switch ((enum nrs_ctl_delay)opc) {
+	switch (opc) {
 	default:
 		return -EINVAL;
 
diff --git a/fs/lustre/ptlrpc/sec_config.c b/fs/lustre/ptlrpc/sec_config.c
index e0ddebd..1b56ef4 100644
--- a/fs/lustre/ptlrpc/sec_config.c
+++ b/fs/lustre/ptlrpc/sec_config.c
@@ -649,7 +649,7 @@ int sptlrpc_process_config(struct lustre_cfg *lcfg)
 	 *	is a actual filesystem.
 	 */
 	if (server_name2fsname(target, fsname, NULL))
-		strlcpy(fsname, target, sizeof(target));
+		strlcpy(fsname, target, sizeof(fsname));
 
 	rc = sptlrpc_parse_rule(param, &rule);
 	if (rc)
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 33/40] lustre: ldlm: remove client_import_find_conn()
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (31 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 32/40] lustre: ptlrpc: fix clang build errors James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 34/40] lnet: add 'force' option to lnetctl peer del James Simmons
                   ` (6 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lustre Development List

From: Mr NeilBrown <neilb@suse.de>

This function hasn't been used since
Commit 3dd3fe462023 ("lustre: mgc: Use IR for client->MDS/OST connections").
So remove it.

WC-bug-id: https://jira.whamcloud.com/browse/LU-10360
Lustre-commit: 14544bdca5cc42a3e ("LU-10360 ldlm: remove client_import_find_conn()")
Signed-off-by: Mr NeilBrown <neilb@suse.de>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50000
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: jsimmons <jsimmons@infradead.org>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_net.h |  2 --
 fs/lustre/ldlm/ldlm_lib.c      | 24 ------------------------
 2 files changed, 26 deletions(-)

diff --git a/fs/lustre/include/lustre_net.h b/fs/lustre/include/lustre_net.h
index 1ffe9f7..a305ba3 100644
--- a/fs/lustre/include/lustre_net.h
+++ b/fs/lustre/include/lustre_net.h
@@ -2358,8 +2358,6 @@ int client_import_dyn_add_conn(struct obd_import *imp, struct obd_uuid *uuid,
 int client_import_add_nids_to_conn(struct obd_import *imp, lnet_nid_t *nids,
 				   int nid_count, struct obd_uuid *uuid);
 int client_import_del_conn(struct obd_import *imp, struct obd_uuid *uuid);
-int client_import_find_conn(struct obd_import *imp, lnet_nid_t peer,
-			    struct obd_uuid *uuid);
 int import_set_conn_priority(struct obd_import *imp, struct obd_uuid *uuid);
 void client_destroy_import(struct obd_import *imp);
 /** @} */
diff --git a/fs/lustre/ldlm/ldlm_lib.c b/fs/lustre/ldlm/ldlm_lib.c
index 0b8389e..b1ce0d4 100644
--- a/fs/lustre/ldlm/ldlm_lib.c
+++ b/fs/lustre/ldlm/ldlm_lib.c
@@ -243,30 +243,6 @@ int client_import_del_conn(struct obd_import *imp, struct obd_uuid *uuid)
 }
 EXPORT_SYMBOL(client_import_del_conn);
 
-/**
- * Find conn UUID by peer NID. @peer is a server NID. This function is used
- * to find a conn uuid of @imp which can reach @peer.
- */
-int client_import_find_conn(struct obd_import *imp, lnet_nid_t peer,
-			    struct obd_uuid *uuid)
-{
-	struct obd_import_conn *conn;
-	int rc = -ENOENT;
-
-	spin_lock(&imp->imp_lock);
-	list_for_each_entry(conn, &imp->imp_conn_list, oic_item) {
-		/* Check if conn UUID does have this peer NID. */
-		if (class_check_uuid(&conn->oic_uuid, peer)) {
-			*uuid = conn->oic_uuid;
-			rc = 0;
-			break;
-		}
-	}
-	spin_unlock(&imp->imp_lock);
-	return rc;
-}
-EXPORT_SYMBOL(client_import_find_conn);
-
 void client_destroy_import(struct obd_import *imp)
 {
 	/*
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 34/40] lnet: add 'force' option to lnetctl peer del
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (32 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 33/40] lustre: ldlm: remove client_import_find_conn() James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 35/40] lustre: ldlm: BL_AST lock cancel still can be batched James Simmons
                   ` (5 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Serguei Smirnov, Lustre Development List

From: Serguei Smirnov <ssmirnov@whamcloud.com>

Add --force option to 'lnetctl peer del' command.
If the peer has primary NID locked, this option allows
for the peer to be deleted manually:
  lnetctl peer del --prim_nid <nid> --force

Add --prim_lock option to 'lnetctl peer add' command.
If specified, the primary NID of the peer is locked
such that it is going to be the NID used to identify
the peer in communications with Lustre layer.

WC-bug-id: https://jira.whamcloud.com/browse/LU-14668
Lustre-commit: f1b2d8d60c593a670 ("LU-14668 lnet: add 'force' option to lnetctl peer del")
Signed-off-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50149
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h      |  6 ++++--
 include/uapi/linux/lnet/lnet-dlc.h |  4 +++-
 net/lnet/lnet/api-ni.c             |  6 ++++--
 net/lnet/lnet/peer.c               | 12 ++++++++----
 4 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index a8aa924..e26e150 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -953,8 +953,10 @@ bool lnet_peer_is_pref_rtr_locked(struct lnet_peer_ni *lpni,
 int lnet_peer_add_pref_rtr(struct lnet_peer_ni *lpni, struct lnet_nid *nid);
 int lnet_peer_ni_set_non_mr_pref_nid(struct lnet_peer_ni *lpni,
 				     struct lnet_nid *nid);
-int lnet_user_add_peer_ni(struct lnet_nid *key_nid, struct lnet_nid *nid, bool mr);
-int lnet_del_peer_ni(struct lnet_nid *key_nid, struct lnet_nid *nid);
+int lnet_user_add_peer_ni(struct lnet_nid *key_nid, struct lnet_nid *nid,
+			  bool mr, bool lock_prim);
+int lnet_del_peer_ni(struct lnet_nid *key_nid, struct lnet_nid *nid,
+		     int force);
 int lnet_get_peer_info(struct lnet_ioctl_peer_cfg *cfg, void __user *bulk);
 int lnet_get_peer_ni_info(u32 peer_index, u64 *nid,
 			  char alivness[LNET_MAX_STR_LEN],
diff --git a/include/uapi/linux/lnet/lnet-dlc.h b/include/uapi/linux/lnet/lnet-dlc.h
index 63578a0..fc1d40c 100644
--- a/include/uapi/linux/lnet/lnet-dlc.h
+++ b/include/uapi/linux/lnet/lnet-dlc.h
@@ -298,7 +298,9 @@ struct lnet_ioctl_peer_cfg {
 	struct libcfs_ioctl_hdr prcfg_hdr;
 	lnet_nid_t prcfg_prim_nid;
 	lnet_nid_t prcfg_cfg_nid;
-	__u32 prcfg_count;
+	__u32 prcfg_count;	/* ADD_PEER_NI: used for 'lock_prim' option
+				 * DEL_PEER_NI: used for 'force' option
+				 */
 	__u32 prcfg_mr;
 	__u32 prcfg_state;
 	__u32 prcfg_size;
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index 20093a9..9095d4e 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -4239,7 +4239,8 @@ u32 lnet_get_dlc_seq_locked(void)
 		mutex_lock(&the_lnet.ln_api_mutex);
 		lnet_nid4_to_nid(cfg->prcfg_prim_nid, &prim_nid);
 		lnet_nid4_to_nid(cfg->prcfg_cfg_nid, &nid);
-		rc = lnet_user_add_peer_ni(&prim_nid, &nid, cfg->prcfg_mr);
+		rc = lnet_user_add_peer_ni(&prim_nid, &nid, cfg->prcfg_mr,
+					   cfg->prcfg_count == 1);
 		mutex_unlock(&the_lnet.ln_api_mutex);
 		return rc;
 	}
@@ -4255,7 +4256,8 @@ u32 lnet_get_dlc_seq_locked(void)
 		lnet_nid4_to_nid(cfg->prcfg_prim_nid, &prim_nid);
 		lnet_nid4_to_nid(cfg->prcfg_cfg_nid, &nid);
 		rc = lnet_del_peer_ni(&prim_nid,
-				      &nid);
+				      &nid,
+				      cfg->prcfg_count);
 		mutex_unlock(&the_lnet.ln_api_mutex);
 		return rc;
 	}
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 0a5e73a..619973b 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -1978,9 +1978,12 @@ struct lnet_peer_net *
 	return lnet_peer_add_nid(lp, nid, flags);
 }
 
-int lnet_user_add_peer_ni(struct lnet_nid *prim_nid, struct lnet_nid *nid, bool mr)
+int lnet_user_add_peer_ni(struct lnet_nid *prim_nid, struct lnet_nid *nid,
+			  bool mr, bool lock_prim)
 {
-	return lnet_add_peer_ni(prim_nid, nid, mr, LNET_PEER_CONFIGURED);
+	int fl = LNET_PEER_CONFIGURED | (LNET_PEER_LOCK_PRIMARY * lock_prim);
+
+	return lnet_add_peer_ni(prim_nid, nid, mr, fl);
 }
 
 static int
@@ -2029,7 +2032,8 @@ int lnet_user_add_peer_ni(struct lnet_nid *prim_nid, struct lnet_nid *nid, bool
  * being modified/deleted by a different thread.
  */
 int
-lnet_del_peer_ni(struct lnet_nid *prim_nid, struct lnet_nid *nid)
+lnet_del_peer_ni(struct lnet_nid *prim_nid, struct lnet_nid *nid,
+		 int force)
 {
 	struct lnet_peer *lp;
 	struct lnet_peer_ni *lpni;
@@ -2061,7 +2065,7 @@ int lnet_user_add_peer_ni(struct lnet_nid *prim_nid, struct lnet_nid *nid, bool
 	lnet_net_unlock(LNET_LOCK_EX);
 
 	if (LNET_NID_IS_ANY(nid) || nid_same(nid, &lp->lp_primary_nid)) {
-		if (lp->lp_state & LNET_PEER_LOCK_PRIMARY) {
+		if (!force && lp->lp_state & LNET_PEER_LOCK_PRIMARY) {
 			CERROR("peer %s created by Lustre. Must preserve primary NID, but will remove other NIDs\n",
 			       libcfs_nidstr(&lp->lp_primary_nid));
 			return lnet_reset_peer(lp);
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 35/40] lustre: ldlm: BL_AST lock cancel still can be batched
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (33 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 34/40] lnet: add 'force' option to lnetctl peer del James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 36/40] lnet: lnet_parse_route uses wrong loop var James Simmons
                   ` (4 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Vitaly Fertman, Lustre Development List

From: Vitaly Fertman <vitaly.fertman@hpe.com>

The previous patch makes BLAST locks to be cancelled separately.
However the main problem is flushing the data under the other batched
locks, thus still possible to batch it with those with no data.
Could be optimized for not yet CANCELLING locks only, otherwise it is
already in the l_bl_ast list.

Fixes: 1ada5c64 ("lustre: ldlm: send the cancel RPC asap")
WC-bug-id: https://jira.whamcloud.com/browse/LU-16285
Lustre-commit: 9d79f92076b6a9ca7 ("LU-16285 ldlm: BL_AST lock cancel still can be batched")
Signed-off-by: Vitaly Fertman <vitaly.fertman@hpe.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50158
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/include/lustre_dlm.h |  1 -
 fs/lustre/ldlm/ldlm_lockd.c    |  3 ++-
 fs/lustre/ldlm/ldlm_request.c  | 42 +++++++++++++++++++++++++-----------------
 3 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/fs/lustre/include/lustre_dlm.h b/fs/lustre/include/lustre_dlm.h
index 3a4f152..d08c48f 100644
--- a/fs/lustre/include/lustre_dlm.h
+++ b/fs/lustre/include/lustre_dlm.h
@@ -593,7 +593,6 @@ enum ldlm_cancel_flags {
 	LCF_BL_AST     = 0x4, /* Cancel locks marked as LDLM_FL_BL_AST
 			       * in the same RPC
 			       */
-	LCF_ONE_LOCK	= 0x8,	/* Cancel locks pack only one lock. */
 };
 
 struct ldlm_flock {
diff --git a/fs/lustre/ldlm/ldlm_lockd.c b/fs/lustre/ldlm/ldlm_lockd.c
index 3a085db..abd853b 100644
--- a/fs/lustre/ldlm/ldlm_lockd.c
+++ b/fs/lustre/ldlm/ldlm_lockd.c
@@ -700,7 +700,8 @@ static int ldlm_callback_handler(struct ptlrpc_request *req)
 		 * we can tell the server we have no lock. Otherwise, we
 		 * should send cancel after dropping the cache.
 		 */
-		if (ldlm_is_ast_sent(lock) || ldlm_is_failed(lock)) {
+		if ((ldlm_is_canceling(lock) && ldlm_is_bl_done(lock)) ||
+		     ldlm_is_failed(lock)) {
 			LDLM_DEBUG(lock,
 				   "callback on lock %#llx - lock disappeared",
 				   dlm_req->lock_handle[0].cookie);
diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index ef3ad28..11071d9 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -1055,8 +1055,9 @@ static int _ldlm_cancel_pack(struct ptlrpc_request *req, struct ldlm_lock *lock,
  * Prepare and send a batched cancel RPC. It will include @count lock
  * handles of locks given in @cancels list.
  */
-static int ldlm_cli_cancel_req(struct obd_export *exp, void *ptr,
-			       int count, enum ldlm_cancel_flags flags)
+static int ldlm_cli_cancel_req(struct obd_export *exp, struct ldlm_lock *lock,
+			       struct list_head *head, int count,
+			       enum ldlm_cancel_flags flags)
 {
 	struct ptlrpc_request *req = NULL;
 	struct obd_import *imp;
@@ -1065,6 +1066,7 @@ static int ldlm_cli_cancel_req(struct obd_export *exp, void *ptr,
 
 	LASSERT(exp);
 	LASSERT(count > 0);
+	LASSERT(!head || !lock);
 
 	CFS_FAIL_TIMEOUT(OBD_FAIL_LDLM_PAUSE_CANCEL, cfs_fail_val);
 
@@ -1104,10 +1106,7 @@ static int ldlm_cli_cancel_req(struct obd_export *exp, void *ptr,
 		req->rq_reply_portal = LDLM_CANCEL_REPLY_PORTAL;
 		ptlrpc_at_set_req_timeout(req);
 
-		if (flags & LCF_ONE_LOCK)
-			rc = _ldlm_cancel_pack(req, ptr, NULL, count);
-		else
-			rc = _ldlm_cancel_pack(req, NULL, ptr, count);
+		rc = _ldlm_cancel_pack(req, lock, head, count);
 		if (rc == 0) {
 			ptlrpc_req_finished(req);
 			sent = count;
@@ -1265,7 +1264,8 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
 		    enum ldlm_cancel_flags flags)
 {
 	struct obd_export *exp;
-	int avail, count = 1, bl_ast = 0;
+	int avail, count = 1, separate = 0;
+	enum ldlm_lru_flags lru_flags = 0;
 	u64 rc = 0;
 	struct ldlm_namespace *ns;
 	struct ldlm_lock *lock;
@@ -1286,7 +1286,8 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
 			LDLM_LOCK_RELEASE(lock);
 			return 0;
 		}
-		bl_ast = 1;
+		if (ldlm_is_canceling(lock))
+			separate = 1;
 	} else if (ldlm_is_canceling(lock)) {
 		/* Lock is being canceled and the caller doesn't want to wait */
 		unlock_res_and_lock(lock);
@@ -1308,11 +1309,18 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
 	if (rc == LDLM_FL_LOCAL_ONLY || flags & LCF_LOCAL) {
 		LDLM_LOCK_RELEASE(lock);
 		return 0;
+	} else if (rc == LDLM_FL_BL_AST) {
+		/* BL_AST lock must not wait. */
+		lru_flags |= LDLM_LRU_FLAG_NO_WAIT;
 	}
 
 	exp = lock->l_conn_export;
-	if (bl_ast) { /* Send RPC immedaitly for LDLM_FL_BL_AST */
-		ldlm_cli_cancel_req(exp, lock, count, flags | LCF_ONE_LOCK);
+	/* If a lock has been taken from lru for a batched cancel and a later
+	 * BL_AST came, send a CANCEL RPC individually for it right away, not
+	 * waiting for the batch to be handled.
+	 */
+	if (separate) {
+		ldlm_cli_cancel_req(exp, lock, NULL, 1, flags);
 		LDLM_LOCK_RELEASE(lock);
 		return 0;
 	}
@@ -1332,7 +1340,7 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
 
 		ns = ldlm_lock_to_ns(lock);
 		count += ldlm_cancel_lru_local(ns, &cancels, 0, avail - 1,
-					       LCF_BL_AST, 0);
+					       LCF_BL_AST, lru_flags);
 	}
 	ldlm_cli_cancel_list(&cancels, count, NULL, flags);
 
@@ -1345,7 +1353,7 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
  * Return the number of cancelled locks.
  */
 int ldlm_cli_cancel_list_local(struct list_head *cancels, int count,
-			       enum ldlm_cancel_flags flags)
+			       enum ldlm_cancel_flags cancel_flags)
 {
 	LIST_HEAD(head);
 	struct ldlm_lock *lock, *next;
@@ -1357,7 +1365,7 @@ int ldlm_cli_cancel_list_local(struct list_head *cancels, int count,
 		if (left-- == 0)
 			break;
 
-		if (flags & LCF_LOCAL) {
+		if (cancel_flags & LCF_LOCAL) {
 			rc = LDLM_FL_LOCAL_ONLY;
 			ldlm_lock_cancel(lock);
 		} else {
@@ -1369,7 +1377,7 @@ int ldlm_cli_cancel_list_local(struct list_head *cancels, int count,
 		 * with the LDLM_FL_BL_AST flag in a separate RPC from
 		 * the one being generated now.
 		 */
-		if (!(flags & LCF_BL_AST) && (rc == LDLM_FL_BL_AST)) {
+		if (!(cancel_flags & LCF_BL_AST) && (rc == LDLM_FL_BL_AST)) {
 			LDLM_DEBUG(lock, "Cancel lock separately");
 			list_move(&lock->l_bl_ast, &head);
 			bl_ast++;
@@ -1384,7 +1392,7 @@ int ldlm_cli_cancel_list_local(struct list_head *cancels, int count,
 	}
 	if (bl_ast > 0) {
 		count -= bl_ast;
-		ldlm_cli_cancel_list(&head, bl_ast, NULL, 0);
+		ldlm_cli_cancel_list(&head, bl_ast, NULL, cancel_flags);
 	}
 
 	return count;
@@ -1887,11 +1895,11 @@ int ldlm_cli_cancel_list(struct list_head *cancels, int count,
 				ldlm_cancel_pack(req, cancels, count);
 			else
 				res = ldlm_cli_cancel_req(lock->l_conn_export,
-							  cancels, count,
+							  NULL, cancels, count,
 							  flags);
 		} else {
 			res = ldlm_cli_cancel_req(lock->l_conn_export,
-						  cancels, 1, flags);
+						  NULL, cancels, 1, flags);
 		}
 
 		if (res < 0) {
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 36/40] lnet: lnet_parse_route uses wrong loop var
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (34 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 35/40] lustre: ldlm: BL_AST lock cancel still can be batched James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 37/40] lustre: tgt: add qos debug James Simmons
                   ` (3 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Chris Horn, Lustre Development List

From: Chris Horn <chris.horn@hpe.com>

When looping over the gateways list, we're referencing the wrong
loop variable to get the gateway nid (ltb instead of ltb2).

Fixes: 1a77031c36 ("lustre: lnet/config: convert list_for_each to list_for_each_entry")
WC-bug-id: https://jira.whamcloud.com/browse/LU-16606
Lustre-commit: 0a414b1077a2f9dbc ("LU-16606 lnet: lnet_parse_route uses wrong loop var")
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50173
Reviewed-by: jsimmons <jsimmons@infradead.org>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/config.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/lnet/lnet/config.c b/net/lnet/lnet/config.c
index a54e1db..c239f9c 100644
--- a/net/lnet/lnet/config.c
+++ b/net/lnet/lnet/config.c
@@ -1168,7 +1168,7 @@ struct lnet_ni *
 		LASSERT(net != LNET_NET_ANY);
 
 		list_for_each_entry(ltb2, &gateways, ltb_list) {
-			LASSERT(libcfs_strnid(&nid, ltb->ltb_text) == 0);
+			LASSERT(libcfs_strnid(&nid, ltb2->ltb_text) == 0);
 
 			if (lnet_islocalnid(&nid)) {
 				*im_a_router = 1;
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 37/40] lustre: tgt: add qos debug
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (35 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 36/40] lnet: lnet_parse_route uses wrong loop var James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 38/40] lustre: enc: file names encryption when using secure boot James Simmons
                   ` (2 subsequent siblings)
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Sergey Cheremencev, Lustre Development List

From: Sergey Cheremencev <scherementsev@ddn.com>

Add several debug lines for QOS allocator.
Patch also changes S_CLASS subsystem to S_LOV in
lu_tgt_desc_tgt.c thus it can be enabled to capture
only QOS debugging.

WC-bug-id: https://jira.whamcloud.com/browse/LU-16501
Lustre-commit: 5fe45f0ff98064561 ("LU-16501 tgt: add qos debug")
Signed-off-by: Sergey Cheremencev <scherementsev@ddn.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49977
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/obdclass/lu_tgt_descs.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/lustre/obdclass/lu_tgt_descs.c b/fs/lustre/obdclass/lu_tgt_descs.c
index 35e7c7c..d573c12 100644
--- a/fs/lustre/obdclass/lu_tgt_descs.c
+++ b/fs/lustre/obdclass/lu_tgt_descs.c
@@ -31,7 +31,7 @@
  *
  */
 
-#define DEBUG_SUBSYSTEM S_CLASS
+#define DEBUG_SUBSYSTEM S_LOV
 
 #include <linux/module.h>
 #include <linux/list.h>
@@ -219,6 +219,8 @@ void lu_tgt_qos_weight_calc(struct lu_tgt_desc *tgt, bool is_mdt)
 	else
 		ltq->ltq_avail = tgt_statfs_bavail(tgt) >> 8;
 	penalty = ltq->ltq_penalty + ltq->ltq_svr->lsq_penalty;
+	CDEBUG(D_OTHER, "ltq_penalty: %llu lsq_penalty: %llu tgt_bavail: %llu\n",
+		  ltq->ltq_penalty, ltq->ltq_svr->lsq_penalty, ltq->ltq_avail);
 	if (ltq->ltq_avail < penalty)
 		ltq->ltq_weight = 0;
 	else
@@ -623,8 +625,15 @@ int ltd_qos_update(struct lu_tgt_descs *ltd, struct lu_tgt_desc *tgt,
 	/* Set max penalties for this tgt and server */
 	ltq->ltq_penalty += ltq->ltq_penalty_per_obj *
 			    ltd->ltd_lov_desc.ld_active_tgt_count;
+	CDEBUG(D_OTHER, "ltq_penalty: %llu per_obj: %llu tgt_count: %d\n",
+	       ltq->ltq_penalty, ltq->ltq_penalty_per_obj,
+	       ltd->ltd_lov_desc.ld_active_tgt_count);
 	svr->lsq_penalty += svr->lsq_penalty_per_obj *
 			    qos->lq_active_svr_count;
+	CDEBUG(D_OTHER, "lsq_penalty: %llu per_obj: %llu srv_count: %d\n",
+	       svr->lsq_penalty, svr->lsq_penalty_per_obj,
+	       qos->lq_active_svr_count);
+
 
 	/* Decrease all MDS penalties */
 	list_for_each_entry(svr, &qos->lq_svr_list, lsq_svr_list) {
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 38/40] lustre: enc: file names encryption when using secure boot
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (36 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 37/40] lustre: tgt: add qos debug James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 39/40] lustre: uapi: add DMV_IMP_INHERIT connect flag James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 40/40] lustre: llite: dir layout inheritance fixes James Simmons
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Alex Deiter, Lustre Development List

From: Alex Deiter <alex.deiter@gmail.com>

Secure boot activates lockdown mode in the Linux kernel.
And debugfs is restricted when the kernel is locked down.
This patch moves file names encryption from debugfs to sysfs.

WC-bug-id: https://jira.whamcloud.com/browse/LU-16621
Lustre-commit: 716675fff642655c4 ("LU-16621 enc: file names encryption when using secure boot")
Signed-off-by: Alex Deiter <alex.deiter@gmail.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50219
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: jsimmons <jsimmons@infradead.org>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/llite_internal.h |  1 +
 fs/lustre/llite/llite_lib.c      |  5 +++--
 fs/lustre/llite/lproc_llite.c    | 35 ++++++++++++++++++-----------------
 3 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/fs/lustre/llite/llite_internal.h b/fs/lustre/llite/llite_internal.h
index b101a71..72de8f7 100644
--- a/fs/lustre/llite/llite_internal.h
+++ b/fs/lustre/llite/llite_internal.h
@@ -737,6 +737,7 @@ struct ll_sb_info {
 	spinlock_t		ll_lock;
 	spinlock_t		ll_pp_extent_lock; /* pp_extent entry*/
 	spinlock_t		ll_process_lock; /* ll_rw_process_info */
+	struct lustre_sb_info	*lsi;
 	struct obd_uuid		ll_sb_uuid;
 	struct obd_export	*ll_md_exp;
 	struct obd_export	*ll_dt_exp;
diff --git a/fs/lustre/llite/llite_lib.c b/fs/lustre/llite/llite_lib.c
index 3774ca8..5a9bc61 100644
--- a/fs/lustre/llite/llite_lib.c
+++ b/fs/lustre/llite/llite_lib.c
@@ -79,7 +79,7 @@ static inline unsigned int ll_get_ra_async_max_active(void)
 	return cfs_cpt_weight(cfs_cpt_tab, CFS_CPT_ANY) >> 1;
 }
 
-static struct ll_sb_info *ll_init_sbi(void)
+static struct ll_sb_info *ll_init_sbi(struct lustre_sb_info *lsi)
 {
 	struct ll_sb_info *sbi = NULL;
 	unsigned long pages;
@@ -99,6 +99,7 @@ static struct ll_sb_info *ll_init_sbi(void)
 	mutex_init(&sbi->ll_lco.lco_lock);
 	spin_lock_init(&sbi->ll_pp_extent_lock);
 	spin_lock_init(&sbi->ll_process_lock);
+	sbi->lsi = lsi;
 	sbi->ll_rw_stats_on = 0;
 	sbi->ll_statfs_max_age = OBD_STATFS_CACHE_SECONDS;
 
@@ -1245,7 +1246,7 @@ int ll_fill_super(struct super_block *sb)
 	}
 
 	/* client additional sb info */
-	sbi = ll_init_sbi();
+	sbi = ll_init_sbi(lsi);
 	lsi->lsi_llsbi = sbi;
 	if (IS_ERR(sbi)) {
 		err = PTR_ERR(sbi);
diff --git a/fs/lustre/llite/lproc_llite.c b/fs/lustre/llite/lproc_llite.c
index 48d93c6..8b6c86f 100644
--- a/fs/lustre/llite/lproc_llite.c
+++ b/fs/lustre/llite/lproc_llite.c
@@ -1653,28 +1653,30 @@ static ssize_t ll_nosquash_nids_seq_write(struct file *file,
 
 LDEBUGFS_SEQ_FOPS(ll_nosquash_nids);
 
-static int ll_old_b64_enc_seq_show(struct seq_file *m, void *v)
+static ssize_t filename_enc_use_old_base64_show(struct kobject *kobj,
+						struct attribute *attr,
+						char *buffer)
 {
-	struct super_block *sb = m->private;
-	struct lustre_sb_info *lsi = s2lsi(sb);
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+	struct lustre_sb_info *lsi = sbi->lsi;
 
-	seq_printf(m, "%u\n",
-		   lsi->lsi_flags & LSI_FILENAME_ENC_B64_OLD_CLI ? 1 : 0);
-	return 0;
+	return scnprintf(buffer, PAGE_SIZE, "%u\n",
+			 lsi->lsi_flags & LSI_FILENAME_ENC_B64_OLD_CLI ? 1 : 0);
 }
 
-static ssize_t ll_old_b64_enc_seq_write(struct file *file,
-					const char __user *buffer,
-					size_t count, loff_t *off)
+static ssize_t filename_enc_use_old_base64_store(struct kobject *kobj,
+						 struct attribute *attr,
+						 const char *buffer,
+						 size_t count)
 {
-	struct seq_file *m = file->private_data;
-	struct super_block *sb = m->private;
-	struct lustre_sb_info *lsi = s2lsi(sb);
-	struct ll_sb_info *sbi = ll_s2sbi(sb);
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+	struct lustre_sb_info *lsi = sbi->lsi;
 	bool val;
 	int rc;
 
-	rc = kstrtobool_from_user(buffer, count, &val);
+	rc = kstrtobool(buffer, &val);
 	if (rc)
 		return rc;
 
@@ -1698,7 +1700,7 @@ static ssize_t ll_old_b64_enc_seq_write(struct file *file,
 	return count;
 }
 
-LDEBUGFS_SEQ_FOPS(ll_old_b64_enc);
+LUSTRE_RW_ATTR(filename_enc_use_old_base64);
 
 static int ll_pcc_seq_show(struct seq_file *m, void *v)
 {
@@ -1756,8 +1758,6 @@ struct ldebugfs_vars lprocfs_llite_obd_vars[] = {
 	  .fops =	&ll_nosquash_nids_fops			},
 	{ .name =	"pcc",
 	  .fops =	&ll_pcc_fops,				},
-	{ .name =	"filename_enc_use_old_base64",
-	  .fops =	&ll_old_b64_enc_fops,			},
 	{ NULL }
 };
 
@@ -1805,6 +1805,7 @@ struct ldebugfs_vars lprocfs_llite_obd_vars[] = {
 	&lustre_attr_opencache_threshold_ms.attr,
 	&lustre_attr_opencache_max_ms.attr,
 	&lustre_attr_inode_cache.attr,
+	&lustre_attr_filename_enc_use_old_base64.attr,
 	NULL,
 };
 
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 39/40] lustre: uapi: add DMV_IMP_INHERIT connect flag
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (37 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 38/40] lustre: enc: file names encryption when using secure boot James Simmons
@ 2023-04-09 12:13 ` James Simmons
  2023-04-09 12:13 ` [lustre-devel] [PATCH 40/40] lustre: llite: dir layout inheritance fixes James Simmons
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown; +Cc: Lai Siyao, Lustre Development List

From: Lai Siyao <lai.siyao@whamcloud.com>

Add OBD_CONNECT2_DMV_IMP_INHERIT for implicit default LMV inherit.

WC-bug-id: https://jira.whamcloud.com/browse/LU-15971
Lustre-commit: 203745e7b07101bb6 ("LU-15971 uapi: add DMV_IMP_INHERIT connect flag")
Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/47788
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Hongchao Zhang <hongchao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ptlrpc/wiretest.c            | 2 ++
 include/uapi/linux/lustre/lustre_idl.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/lustre/ptlrpc/wiretest.c b/fs/lustre/ptlrpc/wiretest.c
index 2c02430..472d155c 100644
--- a/fs/lustre/ptlrpc/wiretest.c
+++ b/fs/lustre/ptlrpc/wiretest.c
@@ -1245,6 +1245,8 @@ void lustre_assert_wire_constants(void)
 		 OBD_CONNECT2_ATOMIC_OPEN_LOCK);
 	LASSERTF(OBD_CONNECT2_ENCRYPT_NAME == 0x8000000ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_ENCRYPT_NAME);
+	LASSERTF(OBD_CONNECT2_DMV_IMP_INHERIT == 0x20000000ULL, "found 0x%.16llxULL\n",
+		 OBD_CONNECT2_DMV_IMP_INHERIT);
 	LASSERTF(OBD_CONNECT2_ENCRYPT_FID2PATH == 0x40000000ULL, "found 0x%.16llxULL\n",
 		 OBD_CONNECT2_ENCRYPT_FID2PATH);
 	LASSERTF(OBD_CKSUM_CRC32 == 0x00000001UL, "found 0x%.8xUL\n",
diff --git a/include/uapi/linux/lustre/lustre_idl.h b/include/uapi/linux/lustre/lustre_idl.h
index d60d1d8..c979e24 100644
--- a/include/uapi/linux/lustre/lustre_idl.h
+++ b/include/uapi/linux/lustre/lustre_idl.h
@@ -784,6 +784,7 @@ struct ptlrpc_body_v2 {
 #define OBD_CONNECT2_LOCK_CONTENTION	  0x2000000ULL /* contention detect */
 #define OBD_CONNECT2_ATOMIC_OPEN_LOCK	  0x4000000ULL /* lock on first open */
 #define OBD_CONNECT2_ENCRYPT_NAME	  0x8000000ULL /* name encrypt */
+#define OBD_CONNECT2_DMV_IMP_INHERIT	 0x20000000ULL /* client handle DMV inheritance */
 #define OBD_CONNECT2_ENCRYPT_FID2PATH	 0x40000000ULL /* fid2path enc file */
 /* XXX README XXX README XXX README XXX README XXX README XXX README XXX
  * Please DO NOT add OBD_CONNECT flags before first ensuring that this value
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [lustre-devel] [PATCH 40/40] lustre: llite: dir layout inheritance fixes
  2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
                   ` (38 preceding siblings ...)
  2023-04-09 12:13 ` [lustre-devel] [PATCH 39/40] lustre: uapi: add DMV_IMP_INHERIT connect flag James Simmons
@ 2023-04-09 12:13 ` James Simmons
  39 siblings, 0 replies; 41+ messages in thread
From: James Simmons @ 2023-04-09 12:13 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin, NeilBrown
  Cc: Vitaly Fertman, Lustre Development List

From: Vitaly Fertman <c17818@cray.com>

fixes for some minor problems:
- it may happen that the depth is not set on a dir, do not consider
  depth == 0 as a real depth while checking if the root default is
  applicable;
- setdirstripe util implicitely sets max_inherit to 3 for non-striped
  dir when -i option is given but -c is not; at the same time 3 is the
  default for striped dirs only;
- getdirstripe shows inherited default layouts with max_inherit==0,
  whereas it has no sense anymore; the same for an explicitily set
  default layout on a dir/root with max_inherit==0;
- getdirstripe hides max_inherit_rr when stripe_offset != -1 as it has
  no sense and reset to 0, however it leads to user confusion;

HPE-bug-id: LUS-11090
WC-bug-id: https://jira.whamcloud.com/browse/LU-16527
Lustre-commit: 6a5a4b49fabcb4c97 ("LU-16527 llite: dir layout inheritance fixes")
Signed-off-by: Vitaly Fertman <c17818@cray.com>
Reviewed-on: https://es-gerrit.dev.cray.com/161035
Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com>
Reviewed-by: Alexey Lyashkov <alexey.lyashkov@hpe.com>
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/49882
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/llite/dir.c   | 34 +++++++++++++---------------------
 fs/lustre/llite/namei.c | 10 ++++++++--
 2 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/fs/lustre/llite/dir.c b/fs/lustre/llite/dir.c
index 0422701..871dd93 100644
--- a/fs/lustre/llite/dir.c
+++ b/fs/lustre/llite/dir.c
@@ -1695,40 +1695,34 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 
 		/* Get default LMV EA */
 		if (lum.lum_magic == LMV_USER_MAGIC) {
+			struct lmv_user_md *lum;
+			struct ll_inode_info *lli;
+
 			if (lmmsize > sizeof(*ulmv)) {
 				rc = -EINVAL;
 				goto finish_req;
 			}
 
-			if (root_request) {
-				struct lmv_user_md *lum;
-				struct ll_inode_info *lli;
+			lum = (struct lmv_user_md *)lmm;
+			if (lum->lum_max_inherit == LMV_INHERIT_NONE) {
+				rc = -ENODATA;
+				goto finish_req;
+			}
 
-				lum = (struct lmv_user_md *)lmm;
+			if (root_request) {
 				lli = ll_i2info(inode);
 				if (lum->lum_max_inherit !=
 				    LMV_INHERIT_UNLIMITED) {
-					if (lum->lum_max_inherit ==
-						LMV_INHERIT_NONE ||
-					    lum->lum_max_inherit <
+					if (lum->lum_max_inherit <
 						LMV_INHERIT_END ||
 					    lum->lum_max_inherit >
 						LMV_INHERIT_MAX ||
-					    lum->lum_max_inherit <
+					    lum->lum_max_inherit <=
 						lli->lli_dir_depth) {
 						rc = -ENODATA;
 						goto finish_req;
 					}
 
-					if (lum->lum_max_inherit ==
-					    lli->lli_dir_depth) {
-						lum->lum_max_inherit =
-							LMV_INHERIT_NONE;
-						lum->lum_max_inherit_rr =
-							LMV_INHERIT_RR_NONE;
-						goto out_copy;
-					}
-
 					lum->lum_max_inherit -=
 						lli->lli_dir_depth;
 				}
@@ -1748,10 +1742,8 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 						goto out_copy;
 					}
 
-					if (lum->lum_max_inherit_rr >
-					    lli->lli_dir_depth)
-						lum->lum_max_inherit_rr -=
-							lli->lli_dir_depth;
+					lum->lum_max_inherit_rr -=
+						lli->lli_dir_depth;
 				}
 			}
 out_copy:
diff --git a/fs/lustre/llite/namei.c b/fs/lustre/llite/namei.c
index 0c4c8e6..a19e5f7 100644
--- a/fs/lustre/llite/namei.c
+++ b/fs/lustre/llite/namei.c
@@ -1464,8 +1464,10 @@ static void ll_qos_mkdir_prep(struct md_op_data *op_data, struct inode *dir)
 	struct ll_inode_info *rlli = ll_i2info(root);
 	struct ll_inode_info *lli = ll_i2info(dir);
 	struct lmv_stripe_md *lsm;
+	unsigned short depth;
 
 	op_data->op_dir_depth = lli->lli_inherit_depth ?: lli->lli_dir_depth;
+	depth = lli->lli_dir_depth;
 
 	/* parent directory is striped */
 	if (unlikely(lli->lli_lsm_md))
@@ -1492,13 +1494,17 @@ static void ll_qos_mkdir_prep(struct md_op_data *op_data, struct inode *dir)
 	if (lsm->lsm_md_master_mdt_index != LMV_OFFSET_DEFAULT)
 		goto unlock;
 
+	/**
+	 * Check if the fs default is to be applied.
+	 * depth == 0 means 'not inited' for not root dir.
+	 */
 	if (lsm->lsm_md_max_inherit != LMV_INHERIT_NONE &&
 	    (lsm->lsm_md_max_inherit == LMV_INHERIT_UNLIMITED ||
-	     lsm->lsm_md_max_inherit >= lli->lli_dir_depth)) {
+	     (depth && lsm->lsm_md_max_inherit > depth))) {
 		op_data->op_flags |= MF_QOS_MKDIR;
 		if (lsm->lsm_md_max_inherit_rr != LMV_INHERIT_RR_NONE &&
 		    (lsm->lsm_md_max_inherit_rr == LMV_INHERIT_RR_UNLIMITED ||
-		     lsm->lsm_md_max_inherit_rr >= lli->lli_dir_depth))
+		     (depth && lsm->lsm_md_max_inherit_rr > depth)))
 			op_data->op_flags |= MF_RR_MKDIR;
 		CDEBUG(D_INODE, DFID" requests qos mkdir %#x\n",
 		       PFID(&lli->lli_fid), op_data->op_flags);
-- 
1.8.3.1

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply related	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2023-04-09 12:44 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-09 12:12 [lustre-devel] [PATCH 00/40] lustre: backport OpenSFS changes from March XX, 2023 James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 01/40] lustre: protocol: basic batching processing framework James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 02/40] lustre: lov: fiemap improperly handles fm_extent_count=0 James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 03/40] lustre: llite: SIGBUS is possible on a race with page reclaim James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 04/40] lustre: osc: page fault in osc_release_bounce_pages() James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 05/40] lustre: readahead: add stats for read-ahead page count James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 06/40] lustre: quota: enforce project quota for root James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 07/40] lustre: ldlm: send the cancel RPC asap James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 08/40] lustre: enc: align Base64 encoding with RFC 4648 base64url James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 09/40] lustre: quota: fix insane grant quota James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 10/40] lustre: llite: check truncated page in ->readpage() James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 11/40] lnet: o2iblnd: Fix key mismatch issue James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 12/40] lustre: sec: fid2path for encrypted files James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 13/40] lustre: sec: Lustre/HSM on enc file with enc key James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 14/40] lustre: llite: check read page past requested James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 15/40] lustre: llite: fix relatime support James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 16/40] lustre: ptlrpc: clarify AT error message James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 17/40] lustre: update version to 2.15.54 James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 18/40] lustre: tgt: skip free inodes in OST weights James Simmons
2023-04-09 12:12 ` [lustre-devel] [PATCH 19/40] lustre: fileset: check fileset for operations by fid James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 20/40] lustre: clio: Remove cl_page_size() James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 21/40] lustre: fid: clean up OBIF_MAX_OID and IDIF_MAX_OID James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 22/40] lustre: llog: fix processing of a wrapped catalog James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 23/40] lustre: llite: replace lld_nfs_dentry flag with opencache handling James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 24/40] lustre: llite: match lock in corresponding namespace James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 25/40] lnet: libcfs: remove unused hash code James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 26/40] lustre: client: -o network needs add_conn processing James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 27/40] lnet: Lock primary NID logic James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 28/40] lnet: Peers added via kernel API should be permanent James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 29/40] lnet: don't delete peer created by Lustre James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 30/40] lnet: memory leak in copy_ioc_udsp_descr James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 31/40] lnet: remove crash with UDSP James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 32/40] lustre: ptlrpc: fix clang build errors James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 33/40] lustre: ldlm: remove client_import_find_conn() James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 34/40] lnet: add 'force' option to lnetctl peer del James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 35/40] lustre: ldlm: BL_AST lock cancel still can be batched James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 36/40] lnet: lnet_parse_route uses wrong loop var James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 37/40] lustre: tgt: add qos debug James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 38/40] lustre: enc: file names encryption when using secure boot James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 39/40] lustre: uapi: add DMV_IMP_INHERIT connect flag James Simmons
2023-04-09 12:13 ` [lustre-devel] [PATCH 40/40] lustre: llite: dir layout inheritance fixes James Simmons

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).