All of lore.kernel.org
 help / color / mirror / Atom feed
* [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10
@ 2018-10-14 18:57 James Simmons
  2018-10-14 18:57 ` [lustre-devel] [PATCH 01/28] lustre: osc: osc_extent_tree_dump0() implementation is suboptimal James Simmons
                   ` (28 more replies)
  0 siblings, 29 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:57 UTC (permalink / raw)
  To: lustre-devel

Another bacth of assorted fixes missing in the linux client from
lustre 2.10. All of these should be order independent and don't
collide with the PFL work that will land at a later date.

Alex Zhuravlev (2):
  lustre: llite: use security context if it's enabled in the kernel
  lustre: ptlrpc: do not wakeup every second

Andreas Dilger (2):
  lustre: mdc: improve mdc_enqueue() error message
  lustre: obdclass: deprecate OBD_GET_VERSION ioctl

Andrew Perepechko (1):
  lustre: osc: osc_extent_tree_dump0() implementation is suboptimal

Andriy Skulysh (1):
  lustre: ldlm: ELC shouldn't wait on lock flush

Bob Glossman (1):
  lustre: ptlrpc: handle case of epp_free_pages <= PTLRPC_MAX_BRW_PAGES

Doug Oucharek (1):
  lustre: ptlrpc: Do not assert when bd_nob_transferred != 0

Fan Yong (1):
  lustre: llite: control concurrent statahead instances

Frank Zago (1):
  lustre: llite: fix for stat under kthread and X86_X32

Henri Doreau (2):
  lustre: hsm: add kkuc before sending registration RPCs
  lustre: mdc: expose changelog through char devices

Hongchao Zhang (1):
  lustre: llite: IO accounting of page read

James Simmons (5):
  lustre: uapi: add back LUSTRE_MAXFSNAME to lustre_user.h
  lustre: uapi: add missing headers in lustre UAPI headers
  lustre: llite: enhance vvp_dev data structure naming
  lustre: clio: update spare bit handling
  lustre: llite: restore lld_nfs_dentry handling

Jinshan Xiong (2):
  lustre: llite: pipeline readahead better with large I/O
  lustre: ldlm: check lock cancellation in ldlm_cli_cancel()

John L. Hammond (2):
  lustre: llog: fix EOF handling in llog_client_next_block()
  lustre: mdc: set correct body eadatasize for getxattr()

Lai Siyao (3):
  lustre: ptlrpc: missing barrier before wake_up
  lustre: statahead: missing barrier before wake_up
  lustre: llite: disable statahead if starting statahead fail

Patrick Farrell (3):
  lustre: llite: Read ahead should return pages read
  lustre: llite: Update i_nlink on unlink
  lustre: ldlm: Make lru clear always discard read lock pages

 .../lustre/include/uapi/linux/lnet/libcfs_debug.h  |   2 +
 .../lustre/include/uapi/linux/lnet/lnetctl.h       |   1 +
 .../lustre/include/uapi/linux/lnet/nidstr.h        |   1 +
 .../lustre/include/uapi/linux/lustre/lustre_cfg.h  |   1 +
 .../lustre/include/uapi/linux/lustre/lustre_fid.h  |   1 +
 .../include/uapi/linux/lustre/lustre_fiemap.h      |   1 +
 .../include/uapi/linux/lustre/lustre_ioctl.h       |   2 +-
 .../include/uapi/linux/lustre/lustre_kernelcomm.h  |   3 -
 .../include/uapi/linux/lustre/lustre_ostid.h       |   1 +
 .../lustre/include/uapi/linux/lustre/lustre_user.h |   9 +-
 drivers/staging/lustre/lustre/include/cl_object.h  |   2 +-
 .../lustre/lustre/include/lustre_dlm_flags.h       |   2 +-
 drivers/staging/lustre/lustre/include/lustre_net.h |   2 +
 drivers/staging/lustre/lustre/include/obd.h        |  18 +-
 drivers/staging/lustre/lustre/ldlm/ldlm_internal.h |  20 +-
 drivers/staging/lustre/lustre/ldlm/ldlm_lib.c      |   2 +
 drivers/staging/lustre/lustre/ldlm/ldlm_lock.c     |  13 -
 drivers/staging/lustre/lustre/ldlm/ldlm_request.c  |  66 +-
 drivers/staging/lustre/lustre/ldlm/ldlm_resource.c |   6 +-
 drivers/staging/lustre/lustre/llite/dir.c          |  14 +-
 drivers/staging/lustre/lustre/llite/file.c         |  25 +-
 drivers/staging/lustre/lustre/llite/lcommon_cl.c   |   4 +-
 .../staging/lustre/lustre/llite/llite_internal.h   |  39 +-
 drivers/staging/lustre/lustre/llite/llite_lib.c    |  22 +-
 drivers/staging/lustre/lustre/llite/lproc_llite.c  |  37 ++
 drivers/staging/lustre/lustre/llite/namei.c        |  12 +-
 drivers/staging/lustre/lustre/llite/rw.c           |  57 +-
 drivers/staging/lustre/lustre/llite/statahead.c    |  39 +-
 drivers/staging/lustre/lustre/llite/vvp_dev.c      |  54 +-
 drivers/staging/lustre/lustre/lmv/lmv_obd.c        |  57 +-
 drivers/staging/lustre/lustre/mdc/Makefile         |   2 +-
 drivers/staging/lustre/lustre/mdc/mdc_changelog.c  | 722 +++++++++++++++++++++
 drivers/staging/lustre/lustre/mdc/mdc_internal.h   |   4 +
 drivers/staging/lustre/lustre/mdc/mdc_lib.c        |  10 +-
 drivers/staging/lustre/lustre/mdc/mdc_locks.c      |  11 +-
 drivers/staging/lustre/lustre/mdc/mdc_request.c    | 198 +-----
 drivers/staging/lustre/lustre/obdclass/class_obd.c |  18 +-
 drivers/staging/lustre/lustre/obdclass/obd_mount.c |   2 +-
 drivers/staging/lustre/lustre/osc/osc_cache.c      |   7 +-
 .../staging/lustre/lustre/osc/osc_cl_internal.h    |   2 +-
 drivers/staging/lustre/lustre/osc/osc_io.c         |   6 +-
 drivers/staging/lustre/lustre/osc/osc_lock.c       |  10 +-
 drivers/staging/lustre/lustre/osc/osc_object.c     |   2 +-
 drivers/staging/lustre/lustre/ptlrpc/llog_client.c |  24 +-
 drivers/staging/lustre/lustre/ptlrpc/niobuf.c      |   8 +-
 drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c     |   4 +-
 drivers/staging/lustre/lustre/ptlrpc/sec_bulk.c    |  11 +-
 47 files changed, 1150 insertions(+), 404 deletions(-)
 create mode 100644 drivers/staging/lustre/lustre/mdc/mdc_changelog.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 01/28] lustre: osc: osc_extent_tree_dump0() implementation is suboptimal
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
@ 2018-10-14 18:57 ` James Simmons
  2018-10-14 18:57 ` [lustre-devel] [PATCH 02/28] lustre: llite: Read ahead should return pages read James Simmons
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:57 UTC (permalink / raw)
  To: lustre-devel

From: Andrew Perepechko <c17827@cray.com>

Avoid looping in osc_extent_tree_dump() if debugging is disabled.
This helps us save some cpu ticks.

Signed-off-by: Andrew Perepechko <c17827@cray.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-9716
Seagate-bug-id: MRP-4469
Reviewed-on: https://review.whamcloud.com/27866
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Alexander Zarochentsev <c17826@cray.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/osc/osc_cache.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/staging/lustre/lustre/osc/osc_cache.c b/drivers/staging/lustre/lustre/osc/osc_cache.c
index 326f663..92d292d 100644
--- a/drivers/staging/lustre/lustre/osc/osc_cache.c
+++ b/drivers/staging/lustre/lustre/osc/osc_cache.c
@@ -1283,6 +1283,9 @@ static void osc_extent_tree_dump0(int level, struct osc_object *obj,
 	struct osc_extent *ext;
 	int cnt;
 
+	if (!cfs_cdebug_show(level, DEBUG_SUBSYSTEM))
+		return;
+
 	CDEBUG(level, "Dump object %p extents at %s:%d, mppr: %u.\n",
 	       obj, func, line, osc_cli(obj)->cl_max_pages_per_rpc);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 02/28] lustre: llite: Read ahead should return pages read
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
  2018-10-14 18:57 ` [lustre-devel] [PATCH 01/28] lustre: osc: osc_extent_tree_dump0() implementation is suboptimal James Simmons
@ 2018-10-14 18:57 ` James Simmons
  2018-10-14 18:57 ` [lustre-devel] [PATCH 03/28] lustre: ptlrpc: missing barrier before wake_up James Simmons
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:57 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <paf@cray.com>

ll_read_ahead_pages was modified by commit 198a49a964a0
("staging: lustre: clio: revise readahead to support 16MB IO")

and returning the count of pages read was removed.

This only affects debug, but it's very nice to have it
printed out, and several messages still try to print out
pages read ahead, but print 0.

Restore this functionality.

Signed-off-by: Patrick Farrell <paf@cray.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-9778
Reviewed-on: https://review.whamcloud.com/28052
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Gu Zheng <gzheng@ddn.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/llite/rw.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/rw.c b/drivers/staging/lustre/lustre/llite/rw.c
index 49ac723..82d874a 100644
--- a/drivers/staging/lustre/lustre/llite/rw.c
+++ b/drivers/staging/lustre/lustre/llite/rw.c
@@ -342,12 +342,12 @@ static int ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 static unsigned long
 ll_read_ahead_pages(const struct lu_env *env, struct cl_io *io,
 		    struct cl_page_list *queue, struct ll_readahead_state *ras,
-		    struct ra_io_arg *ria)
+		    struct ra_io_arg *ria, pgoff_t *ra_end)
 {
 	struct cl_read_ahead ra = { 0 };
-	unsigned long ra_end = 0;
 	bool stride_ria;
 	pgoff_t page_idx;
+	int count = 0;
 	int rc;
 
 	LASSERT(ria);
@@ -393,9 +393,14 @@ static int ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 			if (rc < 0)
 				break;
 
-			ra_end = page_idx;
-			if (!rc)
+			*ra_end = page_idx;
+			/* Only subtract from reserve & count the page if we
+			 * really did readahead on that page.
+			 */
+			if (!rc) {
 				ria->ria_reserved--;
+				count++;
+			}
 		} else if (stride_ria) {
 			/* If it is not in the read-ahead window, and it is
 			 * read-ahead mode, then check whether it should skip
@@ -423,7 +428,7 @@ static int ras_inside_ra_window(unsigned long idx, struct ra_io_arg *ria)
 	}
 	cl_read_ahead_release(env, &ra);
 
-	return ra_end;
+	return count;
 }
 
 static int ll_readahead(const struct lu_env *env, struct cl_io *io,
@@ -434,7 +439,7 @@ static int ll_readahead(const struct lu_env *env, struct cl_io *io,
 	struct ll_thread_info *lti = ll_env_info(env);
 	struct cl_attr *attr = vvp_env_thread_attr(env);
 	unsigned long len, mlen = 0;
-	pgoff_t ra_end, start = 0, end = 0;
+	pgoff_t ra_end = 0, start = 0, end = 0;
 	struct inode *inode;
 	struct ra_io_arg *ria = &lti->lti_ria;
 	struct cl_object *clob;
@@ -542,7 +547,7 @@ static int ll_readahead(const struct lu_env *env, struct cl_io *io,
 	       atomic_read(&ll_i2sbi(inode)->ll_ra_info.ra_cur_pages),
 	       ll_i2sbi(inode)->ll_ra_info.ra_max_pages);
 
-	ra_end = ll_read_ahead_pages(env, io, queue, ras, ria);
+	ret = ll_read_ahead_pages(env, io, queue, ras, ria, &ra_end);
 
 	if (ria->ria_reserved)
 		ll_ra_count_put(ll_i2sbi(inode), ria->ria_reserved);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 03/28] lustre: ptlrpc: missing barrier before wake_up
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
  2018-10-14 18:57 ` [lustre-devel] [PATCH 01/28] lustre: osc: osc_extent_tree_dump0() implementation is suboptimal James Simmons
  2018-10-14 18:57 ` [lustre-devel] [PATCH 02/28] lustre: llite: Read ahead should return pages read James Simmons
@ 2018-10-14 18:57 ` James Simmons
  2018-10-17 22:43   ` NeilBrown
  2018-10-14 18:57 ` [lustre-devel] [PATCH 04/28] lustre: ptlrpc: Do not assert when bd_nob_transferred != 0 James Simmons
                   ` (25 subsequent siblings)
  28 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:57 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

ptlrpc_client_wake_req() misses a memory barrier, which may cause
strange errors.

Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8935
Reviewed-on: https://review.whamcloud.com/26583
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/include/lustre_net.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/staging/lustre/lustre/include/lustre_net.h b/drivers/staging/lustre/lustre/include/lustre_net.h
index ce7e98c..468a03e 100644
--- a/drivers/staging/lustre/lustre/include/lustre_net.h
+++ b/drivers/staging/lustre/lustre/include/lustre_net.h
@@ -2211,6 +2211,8 @@ static inline int ptlrpc_status_ntoh(int n)
 static inline void
 ptlrpc_client_wake_req(struct ptlrpc_request *req)
 {
+	/* ensure ptlrpc_register_bulk see rq_resend as set. */
+	smp_mb();
 	if (!req->rq_set)
 		wake_up(&req->rq_reply_waitq);
 	else
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 04/28] lustre: ptlrpc: Do not assert when bd_nob_transferred != 0
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (2 preceding siblings ...)
  2018-10-14 18:57 ` [lustre-devel] [PATCH 03/28] lustre: ptlrpc: missing barrier before wake_up James Simmons
@ 2018-10-14 18:57 ` James Simmons
  2018-10-17 23:13   ` NeilBrown
  2018-10-14 18:57 ` [lustre-devel] [PATCH 05/28] lustre: uapi: add back LUSTRE_MAXFSNAME to lustre_user.h James Simmons
                   ` (24 subsequent siblings)
  28 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:57 UTC (permalink / raw)
  To: lustre-devel

From: Doug Oucharek <dougso@me.com>

There is a case in the routine ptlrpc_register_bulk() where we were
asserting if bd_nob_transferred != 0 when not resending.  There is
evidence that network errors can create a situation where
this does happen. So we should not be asserting!

This patch changes that assert to an error return code of -EIO.

Signed-off-by: Doug Oucharek <dougso@me.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-9828
Reviewed-on: https://review.whamcloud.com/28491
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
index 27eb1c0..7e7db24 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
@@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
 	/* cleanup the state of the bulk for it will be reused */
 	if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY)
 		desc->bd_nob_transferred = 0;
-	else
-		LASSERT(desc->bd_nob_transferred == 0);
+	else if (desc->bd_nob_transferred != 0)
+		/* If the network failed after an RPC was sent, this condition
+		 * could happen.  Rather than assert (was here before), return
+		 * an EIO error.
+		 */
+		return -EIO;
 
 	desc->bd_failure = 0;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 05/28] lustre: uapi: add back LUSTRE_MAXFSNAME to lustre_user.h
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (3 preceding siblings ...)
  2018-10-14 18:57 ` [lustre-devel] [PATCH 04/28] lustre: ptlrpc: Do not assert when bd_nob_transferred != 0 James Simmons
@ 2018-10-14 18:57 ` James Simmons
  2018-10-14 18:57 ` [lustre-devel] [PATCH 06/28] lustre: ldlm: ELC shouldn't wait on lock flush James Simmons
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:57 UTC (permalink / raw)
  To: lustre-devel

The work to turn lustre_param.h into a proper UAPI header
removed various user land functions used to validate poolnames
and file system names. The checks instead were enforced on the
kernel side to ensure any possible user land software directly
interfacing to the kernel wouldn't be able to break things badly.
For the case of formating the backend file system no kernel
interaction doesn't happen until it tries to mount the MDT/OST/MGT
which is very late in the process. So for this case lets add back
the file system name verification to userland. With bringing this
back LUSTRE_MAXFSNAME is needed again since it used by both user
land and kernel. For the kernel side use LUSTRE_MAXFSNAME instead
of the raw number in the function server_name2fsname() located in
obd_mount.c.

Signed-off-by: James Simmons <uja.ornl@yahoo.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-9767
Reviewed-on: https://review.whamcloud.com/28070
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h | 2 ++
 drivers/staging/lustre/lustre/obdclass/obd_mount.c             | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h
index 4fa7796..b8525e5 100644
--- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h
+++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h
@@ -515,6 +515,8 @@ static inline char *obd_uuid2str(const struct obd_uuid *uuid)
 	return (char *)(uuid->uuid);
 }
 
+#define LUSTRE_MAXFSNAME 8
+
 /* Extract fsname from uuid (or target name) of a target
  * e.g. (myfs-OST0007_UUID -> myfs)
  * see also deuuidify.
diff --git a/drivers/staging/lustre/lustre/obdclass/obd_mount.c b/drivers/staging/lustre/lustre/obdclass/obd_mount.c
index 33a67fd..5ed1758 100644
--- a/drivers/staging/lustre/lustre/obdclass/obd_mount.c
+++ b/drivers/staging/lustre/lustre/obdclass/obd_mount.c
@@ -599,7 +599,7 @@ static int server_name2fsname(const char *svname, char *fsname,
 {
 	const char *dash;
 
-	dash = svname + strnlen(svname, 8); /* max fsname length is 8 */
+	dash = svname + strnlen(svname, LUSTRE_MAXFSNAME);
 	for (; dash > svname && *dash != '-' && *dash != ':'; dash--)
 		;
 	if (dash == svname)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 06/28] lustre: ldlm: ELC shouldn't wait on lock flush
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (4 preceding siblings ...)
  2018-10-14 18:57 ` [lustre-devel] [PATCH 05/28] lustre: uapi: add back LUSTRE_MAXFSNAME to lustre_user.h James Simmons
@ 2018-10-14 18:57 ` James Simmons
  2018-10-17 23:20   ` NeilBrown
  2018-10-14 18:57 ` [lustre-devel] [PATCH 07/28] lustre: llite: pipeline readahead better with large I/O James Simmons
                   ` (22 subsequent siblings)
  28 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:57 UTC (permalink / raw)
  To: lustre-devel

From: Andriy Skulysh <c17819@cray.com>

The commit 08fd034670b5 ("staging: lustre: ldlm: revert the changes
for lock canceling policy") removed the fix for LU-4300 when lru_resize
is disabled.

Introduce ldlm_cancel_aged_no_wait_policy to be used by ELC.

Signed-off-by: Andriy Skulysh <c17819@cray.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8578
Seagate-bug-id: MRP-3662
Reviewed-on: https://review.whamcloud.com/22286
Reviewed-by: Vitaly Fertman <c17818@cray.com>
Reviewed-by: Patrick Farrell <paf@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/ldlm/ldlm_internal.h |  1 -
 drivers/staging/lustre/lustre/ldlm/ldlm_request.c  | 51 +++++++++++++++-------
 2 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h b/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
index 1d7c727..709c527 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
@@ -96,7 +96,6 @@ enum {
 	LDLM_LRU_FLAG_NO_WAIT	= BIT(4), /* Cancel locks w/o blocking (neither
 					   * sending nor waiting for any rpcs)
 					   */
-	LDLM_LRU_FLAG_LRUR_NO_WAIT = BIT(5), /* LRUR + NO_WAIT */
 };
 
 int ldlm_cancel_lru(struct ldlm_namespace *ns, int nr,
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
index 80260b07..3eb5036 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
@@ -579,8 +579,8 @@ int ldlm_prep_elc_req(struct obd_export *exp, struct ptlrpc_request *req,
 		req_capsule_filled_sizes(pill, RCL_CLIENT);
 		avail = ldlm_capsule_handles_avail(pill, RCL_CLIENT, canceloff);
 
-		flags = ns_connect_lru_resize(ns) ?
-			LDLM_LRU_FLAG_LRUR_NO_WAIT : LDLM_LRU_FLAG_AGED;
+		flags = LDLM_LRU_FLAG_NO_WAIT | ns_connect_lru_resize(ns) ?
+			LDLM_LRU_FLAG_LRUR : LDLM_LRU_FLAG_AGED;
 		to_free = !ns_connect_lru_resize(ns) &&
 			  opc == LDLM_ENQUEUE ? 1 : 0;
 
@@ -1254,6 +1254,20 @@ static enum ldlm_policy_res ldlm_cancel_aged_policy(struct ldlm_namespace *ns,
 	return ldlm_cancel_no_wait_policy(ns, lock, unused, added, count);
 }
 
+static enum ldlm_policy_res
+ldlm_cancel_aged_no_wait_policy(struct ldlm_namespace *ns,
+				struct ldlm_lock *lock,
+				int unused, int added, int count)
+{
+	enum ldlm_policy_res result;
+
+	result = ldlm_cancel_aged_policy(ns, lock, unused, added, count);
+	if (result == LDLM_POLICY_KEEP_LOCK)
+		return result;
+
+	return ldlm_cancel_no_wait_policy(ns, lock, unused, added, count);
+}
+
 /**
  * Callback function for default policy. Makes decision whether to keep \a lock
  * in LRU for current LRU size \a unused, added in current scan \a added and
@@ -1280,26 +1294,32 @@ typedef enum ldlm_policy_res (*ldlm_cancel_lru_policy_t)(
 						      int, int);
 
 static ldlm_cancel_lru_policy_t
-ldlm_cancel_lru_policy(struct ldlm_namespace *ns, int flags)
+ldlm_cancel_lru_policy(struct ldlm_namespace *ns, int lru_flags)
 {
-	if (flags & LDLM_LRU_FLAG_NO_WAIT)
-		return ldlm_cancel_no_wait_policy;
-
 	if (ns_connect_lru_resize(ns)) {
-		if (flags & LDLM_LRU_FLAG_SHRINK)
+		if (lru_flags & LDLM_LRU_FLAG_SHRINK) {
 			/* We kill passed number of old locks. */
 			return ldlm_cancel_passed_policy;
-		else if (flags & LDLM_LRU_FLAG_LRUR)
-			return ldlm_cancel_lrur_policy;
-		else if (flags & LDLM_LRU_FLAG_PASSED)
+		} else if (lru_flags & LDLM_LRU_FLAG_LRUR) {
+			if (lru_flags & LDLM_LRU_FLAG_NO_WAIT)
+				return ldlm_cancel_lrur_no_wait_policy;
+			else
+				return ldlm_cancel_lrur_policy;
+		} else if (lru_flags & LDLM_LRU_FLAG_PASSED) {
 			return ldlm_cancel_passed_policy;
-		else if (flags & LDLM_LRU_FLAG_LRUR_NO_WAIT)
-			return ldlm_cancel_lrur_no_wait_policy;
+		}
 	} else {
-		if (flags & LDLM_LRU_FLAG_AGED)
-			return ldlm_cancel_aged_policy;
+		if (lru_flags & LDLM_LRU_FLAG_AGED) {
+			if (lru_flags & LDLM_LRU_FLAG_NO_WAIT)
+				return ldlm_cancel_aged_no_wait_policy;
+			else
+				return ldlm_cancel_aged_policy;
+		}
 	}
 
+	if (lru_flags & LDLM_LRU_FLAG_NO_WAIT)
+		return ldlm_cancel_no_wait_policy;
+
 	return ldlm_cancel_default_policy;
 }
 
@@ -1344,8 +1364,7 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 	ldlm_cancel_lru_policy_t pf;
 	struct ldlm_lock *lock, *next;
 	int added = 0, unused, remained;
-	int no_wait = flags &
-		(LDLM_LRU_FLAG_NO_WAIT | LDLM_LRU_FLAG_LRUR_NO_WAIT);
+	int no_wait = flags & LDLM_LRU_FLAG_NO_WAIT;
 
 	spin_lock(&ns->ns_lock);
 	unused = ns->ns_nr_unused;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 07/28] lustre: llite: pipeline readahead better with large I/O
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (5 preceding siblings ...)
  2018-10-14 18:57 ` [lustre-devel] [PATCH 06/28] lustre: ldlm: ELC shouldn't wait on lock flush James Simmons
@ 2018-10-14 18:57 ` James Simmons
  2018-10-14 18:57 ` [lustre-devel] [PATCH 08/28] lustre: hsm: add kkuc before sending registration RPCs James Simmons
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:57 UTC (permalink / raw)
  To: lustre-devel

From: Jinshan Xiong <jinshan.xiong@gmail.com>

Fixed a bug where next readahead is not set correctly when
appplication issues large I/O;
Extend the readahead window length to at least cover the size of
current I/O.

Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-9574
Reviewed-on: https://review.whamcloud.com/27388
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Patrick Farrell <paf@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/llite/rw.c | 29 ++++++++++++-----------------
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/rw.c b/drivers/staging/lustre/lustre/llite/rw.c
index 82d874a..9cc0d4fe 100644
--- a/drivers/staging/lustre/lustre/llite/rw.c
+++ b/drivers/staging/lustre/lustre/llite/rw.c
@@ -494,10 +494,8 @@ static int ll_readahead(const struct lu_env *env, struct cl_io *io,
 			end = end_index;
 			ria->ria_eof = true;
 		}
-
-		ras->ras_next_readahead = max(end, end + 1);
-		RAS_CDEBUG(ras);
 	}
+
 	ria->ria_start = start;
 	ria->ria_end = end;
 	/* If stride I/O mode is detected, get stride window*/
@@ -518,6 +516,7 @@ static int ll_readahead(const struct lu_env *env, struct cl_io *io,
 		return 0;
 	}
 
+	RAS_CDEBUG(ras);
 	CDEBUG(D_READA, DFID ": ria: %lu/%lu, bead: %lu/%lu, hit: %d\n",
 	       PFID(lu_object_fid(&clob->co_lu)),
 	       ria->ria_start, ria->ria_end,
@@ -555,25 +554,20 @@ static int ll_readahead(const struct lu_env *env, struct cl_io *io,
 	if (ra_end == end && ra_end == (kms >> PAGE_SHIFT))
 		ll_ra_stats_inc(inode, RA_STAT_EOF);
 
-	/* if we didn't get to the end of the region we reserved from
-	 * the ras we need to go back and update the ras so that the
-	 * next read-ahead tries from where we left off.  we only do so
-	 * if the region we failed to issue read-ahead on is still ahead
-	 * of the app and behind the next index to start read-ahead from
-	 */
 	CDEBUG(D_READA, "ra_end = %lu end = %lu stride end = %lu pages = %d\n",
 	       ra_end, end, ria->ria_end, ret);
 
-	if (ra_end > 0 && ra_end != end) {
+	if (ra_end != end)
 		ll_ra_stats_inc(inode, RA_STAT_FAILED_REACH_END);
+
+	if (ra_end > 0) {
+		/* update the ras so that the next read-ahead tries from
+		 * where we left off.
+		 */
 		spin_lock(&ras->ras_lock);
-		if (ra_end <= ras->ras_next_readahead &&
-		    index_in_window(ra_end, ras->ras_window_start, 0,
-				    ras->ras_window_len)) {
-			ras->ras_next_readahead = ra_end + 1;
-			RAS_CDEBUG(ras);
-		}
+		ras->ras_next_readahead = ra_end + 1;
 		spin_unlock(&ras->ras_lock);
+		RAS_CDEBUG(ras);
 	}
 
 	return ret;
@@ -857,7 +851,8 @@ static void ras_update(struct ll_sb_info *sbi, struct inode *inode,
 		 * of read-ahead, so we use original offset here,
 		 * instead of ras_window_start, which is RPC aligned
 		 */
-		ras->ras_next_readahead = max(index, ras->ras_next_readahead);
+		ras->ras_next_readahead = max(index + 1,
+					      ras->ras_next_readahead);
 		ras->ras_window_start = max(ras->ras_stride_offset,
 					    ras->ras_window_start);
 	} else {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 08/28] lustre: hsm: add kkuc before sending registration RPCs
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (6 preceding siblings ...)
  2018-10-14 18:57 ` [lustre-devel] [PATCH 07/28] lustre: llite: pipeline readahead better with large I/O James Simmons
@ 2018-10-14 18:57 ` James Simmons
  2018-10-14 18:57 ` [lustre-devel] [PATCH 09/28] lustre: mdc: improve mdc_enqueue() error message James Simmons
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:57 UTC (permalink / raw)
  To: lustre-devel

From: Henri Doreau <henri.doreau@cea.fr>

This avoids a situation where the registration completes and the CDT
sends HSM actions just before the kkuc registration happens. In this
case the client drops the actions because there are no CT pipes in the
kkuc list.

Signed-off-by: Henri Doreau <henri.doreau@cea.fr>
WC-bug-id: https://jira.whamcloud.com/browse/LU-9416
Reviewed-on: https://review.whamcloud.com/28751
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Faccini Bruno <bruno.faccini@intel.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/lmv/lmv_obd.c | 44 +++++++++++++++++------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/drivers/staging/lustre/lustre/lmv/lmv_obd.c b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
index 71bd843..952c68e 100644
--- a/drivers/staging/lustre/lustre/lmv/lmv_obd.c
+++ b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
@@ -784,9 +784,23 @@ static int lmv_hsm_ct_register(struct lmv_obd *lmv, unsigned int cmd, int len,
 {
 	struct file *filp;
 	__u32 i, j;
-	int err, rc = 0;
+	int err;
 	bool any_set = false;
-	struct kkuc_ct_data kcd = { 0 };
+	struct kkuc_ct_data kcd = {
+		.kcd_magic	= KKUC_CT_DATA_MAGIC,
+		.kcd_uuid	= lmv->cluuid,
+		.kcd_archive	= lk->lk_data
+	};
+	int rc = 0;
+
+	filp = fget(lk->lk_wfd);
+	if (!filp)
+		return -EBADF;
+
+	rc = libcfs_kkuc_group_add(filp, lk->lk_uid, lk->lk_group,
+				   &kcd, sizeof(kcd));
+	if (rc)
+		goto err_fput;
 
 	/* All or nothing: try to register to all MDS.
 	 * In case of failure, unregister from previous MDS,
@@ -815,7 +829,7 @@ static int lmv_hsm_ct_register(struct lmv_obd *lmv, unsigned int cmd, int len,
 					obd_iocontrol(cmd, tgt->ltd_exp, len,
 						      lk, uarg);
 				}
-				return rc;
+				goto err_kkuc_rem;
 			}
 			/* else: transient error.
 			 * kuc will register to the missing MDT when it is back
@@ -825,24 +839,18 @@ static int lmv_hsm_ct_register(struct lmv_obd *lmv, unsigned int cmd, int len,
 		}
 	}
 
-	if (!any_set)
+	if (!any_set) {
 		/* no registration done: return error */
-		return -ENOTCONN;
-
-	/* at least one registration done, with no failure */
-	filp = fget(lk->lk_wfd);
-	if (!filp)
-		return -EBADF;
-
-	kcd.kcd_magic = KKUC_CT_DATA_MAGIC;
-	kcd.kcd_uuid = lmv->cluuid;
-	kcd.kcd_archive = lk->lk_data;
+		rc = -ENOTCONN;
+		goto err_kkuc_rem;
+	}
 
-	rc = libcfs_kkuc_group_add(filp, lk->lk_uid, lk->lk_group,
-				   &kcd, sizeof(kcd));
-	if (rc)
-		fput(filp);
+	return 0;
 
+err_kkuc_rem:
+	libcfs_kkuc_group_rem(lk->lk_uid, lk->lk_group);
+err_fput:
+	fput(filp);
 	return rc;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 09/28] lustre: mdc: improve mdc_enqueue() error message
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (7 preceding siblings ...)
  2018-10-14 18:57 ` [lustre-devel] [PATCH 08/28] lustre: hsm: add kkuc before sending registration RPCs James Simmons
@ 2018-10-14 18:57 ` James Simmons
  2018-10-14 18:58 ` [lustre-devel] [PATCH 10/28] lustre: llite: Update i_nlink on unlink James Simmons
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:57 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Include the parent/child FIDs and name in the mdc_enqueue()
debug message.

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-4705
Reviewed-on: https://review.whamcloud.com/28978
Reviewed-by: Steve Guminski <stephenx.guminski@intel.com>
Reviewed-by: Yang Sheng <ys@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/mdc/mdc_locks.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/lustre/lustre/mdc/mdc_locks.c b/drivers/staging/lustre/lustre/mdc/mdc_locks.c
index cfe917c..5ec5d78 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_locks.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_locks.c
@@ -824,8 +824,10 @@ int mdc_enqueue_base(struct obd_export *exp, struct ldlm_enqueue_info *einfo,
 	mdc_put_mod_rpc_slot(req, it);
 
 	if (rc < 0) {
-		CDEBUG(D_INFO, "%s: ldlm_cli_enqueue failed: rc = %d\n",
-		       obddev->obd_name, rc);
+		CDEBUG(D_INFO,
+		       "%s: ldlm_cli_enqueue " DFID ":" DFID "=%s failed: rc = %d\n",
+		       obddev->obd_name, PFID(&op_data->op_fid1),
+		       PFID(&op_data->op_fid2), op_data->op_name ?: "", rc);
 
 		mdc_clear_replay_flag(req, rc);
 		ptlrpc_req_finished(req);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 10/28] lustre: llite: Update i_nlink on unlink
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (8 preceding siblings ...)
  2018-10-14 18:57 ` [lustre-devel] [PATCH 09/28] lustre: mdc: improve mdc_enqueue() error message James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-14 18:58 ` [lustre-devel] [PATCH 11/28] lustre: llite: use security context if it's enabled in the kernel James Simmons
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <paf@cray.com>

Currently, the client inode link count is not updated on
last unlink.  This is fine because the dentries are all
gone and the inode is eligible for reclaim, but it's still
incorrect.  This causes two problems:

1. Inode is not immediately reclaimed
2. i_nlink count is > 0 for a fully unlinked file, which
   confuses wrapfs

On last unlink, the MDT sends back attributes.  Use the
nlink count from these to update the client inode.

Remove null check inherited from ll_get_child_fid, because
the inode should never be null on an unlink.

Signed-off-by: Patrick Farrell <paf@cray.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-10131
Reviewed-on: https://review.whamcloud.com/29651
Reviewed-by: Ben Evans <bevans@cray.com>
Reviewed-by: Alexey Lyashkov <c17817@cray.com>
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/llite/namei.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/namei.c b/drivers/staging/lustre/lustre/llite/namei.c
index 09cdf02..f2bd57e 100644
--- a/drivers/staging/lustre/lustre/llite/namei.c
+++ b/drivers/staging/lustre/lustre/llite/namei.c
@@ -1073,6 +1073,7 @@ static int ll_unlink(struct inode *dir, struct dentry *dchild)
 {
 	struct ptlrpc_request *request = NULL;
 	struct md_op_data *op_data;
+	struct mdt_body *body;
 	int rc;
 
 	CDEBUG(D_VFSTRACE, "VFS Op:name=%pd,dir=%lu/%u(%p)\n",
@@ -1085,8 +1086,7 @@ static int ll_unlink(struct inode *dir, struct dentry *dchild)
 	if (IS_ERR(op_data))
 		return PTR_ERR(op_data);
 
-	if (dchild->d_inode)
-		op_data->op_fid3 = *ll_inode2fid(dchild->d_inode);
+	op_data->op_fid3 = *ll_inode2fid(dchild->d_inode);
 
 	op_data->op_fid2 = op_data->op_fid3;
 	rc = md_unlink(ll_i2sbi(dir)->ll_md_exp, op_data, &request);
@@ -1094,6 +1094,14 @@ static int ll_unlink(struct inode *dir, struct dentry *dchild)
 	if (rc)
 		goto out;
 
+	/*
+	 * The server puts attributes in on the last unlink, use them to update
+	 * the link count so the inode can be freed immediately.
+	 */
+	body = req_capsule_server_get(&request->rq_pill, &RMF_MDT_BODY);
+	if (body->mbo_valid & OBD_MD_FLNLINK)
+		set_nlink(dchild->d_inode, body->mbo_nlink);
+
 	ll_update_times(request, dir);
 	ll_stats_ops_tally(ll_i2sbi(dir), LPROC_LL_UNLINK, 1);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 11/28] lustre: llite: use security context if it's enabled in the kernel
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (9 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 10/28] lustre: llite: Update i_nlink on unlink James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-17 23:34   ` NeilBrown
  2018-10-14 18:58 ` [lustre-devel] [PATCH 12/28] lustre: ptlrpc: do not wakeup every second James Simmons
                   ` (17 subsequent siblings)
  28 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

if it's disabled, then Lustre stop to work properly (can not create
files, etc)

Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-9578
Reviewed-on: https://review.whamcloud.com/27364
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/llite/llite_lib.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
index 22b545e..153aa12 100644
--- a/drivers/staging/lustre/lustre/llite/llite_lib.c
+++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
@@ -243,8 +243,9 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
 	if (sbi->ll_flags & LL_SBI_ALWAYS_PING)
 		data->ocd_connect_flags &= ~OBD_CONNECT_PINGLESS;
 
+#ifdef CONFIG_SECURITY
 	data->ocd_connect_flags2 |= OBD_CONNECT2_FILE_SECCTX;
-
+#endif
 	data->ocd_brw_size = MD_MAX_BRW_SIZE;
 
 	err = obd_connect(NULL, &sbi->ll_md_exp, sbi->ll_md_obd,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 12/28] lustre: ptlrpc: do not wakeup every second
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (10 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 11/28] lustre: llite: use security context if it's enabled in the kernel James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-29  0:03   ` NeilBrown
  2018-10-14 18:58 ` [lustre-devel] [PATCH 13/28] lustre: ldlm: check lock cancellation in ldlm_cli_cancel() James Simmons
                   ` (16 subsequent siblings)
  28 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

From: Alex Zhuravlev <bzzz@whamcloud.com>

Even if there are no RPC requests on the set, there is no need to
wake up every second. The thread is woken up when a request is added
to the set or when the STOP bit is set, so it is sufficient to only
wake up when there are requests on the set to worry about.

Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-9660
Reviewed-on: https://review.whamcloud.com/28776
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Patrick Farrell <paf@cray.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
index c201a88..5b4977b 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
@@ -371,7 +371,7 @@ static int ptlrpcd_check(struct lu_env *env, struct ptlrpcd_ctl *pc)
 		}
 	}
 
-	return rc;
+	return rc || test_bit(LIOD_STOP, &pc->pc_flags);
 }
 
 /**
@@ -441,7 +441,7 @@ static int ptlrpcd(void *arg)
 		lu_context_enter(env.le_ses);
 		if (wait_event_idle_timeout(set->set_waitq,
 					    ptlrpcd_check(&env, pc),
-					    (timeout ? timeout : 1) * HZ) == 0)
+					    timeout * HZ) == 0)
 			ptlrpc_expired_set(set);
 
 		lu_context_exit(&env.le_ctx);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 13/28] lustre: ldlm: check lock cancellation in ldlm_cli_cancel()
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (11 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 12/28] lustre: ptlrpc: do not wakeup every second James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-14 18:58 ` [lustre-devel] [PATCH 14/28] lustre: ptlrpc: handle case of epp_free_pages <= PTLRPC_MAX_BRW_PAGES James Simmons
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

From: Jinshan Xiong <jinshan.xiong@gmail.com>

In that case, the assert for 'list_empty(&lock->l_bl_ast)' will fail
because the lock is already in a cancel list.

This patch checks if the lock is already being canceled in prior.

Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-9997
Reviewed-on: https://review.whamcloud.com/29080
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/ldlm/ldlm_internal.h | 13 +++++++++++++
 drivers/staging/lustre/lustre/ldlm/ldlm_lock.c     | 13 -------------
 drivers/staging/lustre/lustre/ldlm/ldlm_request.c  |  6 +++++-
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h b/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
index 709c527..46b2b64 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
@@ -299,6 +299,19 @@ static inline int is_granted_or_cancelled(struct ldlm_lock *lock)
 	return ret;
 }
 
+static inline bool is_bl_done(struct ldlm_lock *lock)
+{
+	bool bl_done = true;
+
+	if (!ldlm_is_bl_done(lock)) {
+		lock_res_and_lock(lock);
+		bl_done = ldlm_is_bl_done(lock);
+		unlock_res_and_lock(lock);
+	}
+
+	return bl_done;
+}
+
 typedef void (*ldlm_policy_wire_to_local_t)(const union ldlm_wire_policy_data *,
 					    union ldlm_policy_data *);
 
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_lock.c b/drivers/staging/lustre/lustre/ldlm/ldlm_lock.c
index bc6b122..ebdfc11 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_lock.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_lock.c
@@ -1832,19 +1832,6 @@ int ldlm_run_ast_work(struct ldlm_namespace *ns, struct list_head *rpc_list,
 	return rc;
 }
 
-static bool is_bl_done(struct ldlm_lock *lock)
-{
-	bool bl_done = true;
-
-	if (!ldlm_is_bl_done(lock)) {
-		lock_res_and_lock(lock);
-		bl_done = ldlm_is_bl_done(lock);
-		unlock_res_and_lock(lock);
-	}
-
-	return bl_done;
-}
-
 /**
  * Helper function to call blocking AST for LDLM lock \a lock in a
  * "cancelling" mode.
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
index 3eb5036..a208c99 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
@@ -1026,8 +1026,12 @@ int ldlm_cli_cancel(const struct lustre_handle *lockh,
 
 	lock_res_and_lock(lock);
 	/* Lock is being canceled and the caller doesn't want to wait */
-	if (ldlm_is_canceling(lock) && (cancel_flags & LCF_ASYNC)) {
+	if (ldlm_is_canceling(lock)) {
 		unlock_res_and_lock(lock);
+
+		if (!(cancel_flags & LCF_ASYNC))
+			wait_event_idle(lock->l_waitq, is_bl_done(lock));
+
 		LDLM_LOCK_RELEASE(lock);
 		return 0;
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 14/28] lustre: ptlrpc: handle case of epp_free_pages <= PTLRPC_MAX_BRW_PAGES
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (12 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 13/28] lustre: ldlm: check lock cancellation in ldlm_cli_cancel() James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-14 18:58 ` [lustre-devel] [PATCH 15/28] lustre: llite: fix for stat under kthread and X86_X32 James Simmons
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

From: Bob Glossman <bob.glossman@intel.com>

Current code where page_pools.epp_free_pages is too small isn't
handled correctly. This mod fixes those instances.

Signed-off-by: Bob Glossman <bob.glossman@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-9458
Reviewed-on: https://review.whamcloud.com/27016
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/ptlrpc/sec_bulk.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/lustre/lustre/ptlrpc/sec_bulk.c b/drivers/staging/lustre/lustre/ptlrpc/sec_bulk.c
index 3d336d9..03bc95f 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/sec_bulk.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/sec_bulk.c
@@ -232,7 +232,8 @@ static unsigned long enc_pools_shrink_count(struct shrinker *s,
 	}
 
 	LASSERT(page_pools.epp_idle_idx <= IDLE_IDX_MAX);
-	return max(page_pools.epp_free_pages - PTLRPC_MAX_BRW_PAGES, 0UL) *
+	return (page_pools.epp_free_pages <= PTLRPC_MAX_BRW_PAGES) ? 0 :
+		(page_pools.epp_free_pages - PTLRPC_MAX_BRW_PAGES) *
 		(IDLE_IDX_MAX - page_pools.epp_idle_idx) / IDLE_IDX_MAX;
 }
 
@@ -243,8 +244,12 @@ static unsigned long enc_pools_shrink_scan(struct shrinker *s,
 					   struct shrink_control *sc)
 {
 	spin_lock(&page_pools.epp_lock);
-	sc->nr_to_scan = min_t(unsigned long, sc->nr_to_scan,
-			      page_pools.epp_free_pages - PTLRPC_MAX_BRW_PAGES);
+	if (page_pools.epp_free_pages > PTLRPC_MAX_BRW_PAGES)
+		sc->nr_to_scan = min_t(unsigned long, sc->nr_to_scan,
+				       page_pools.epp_free_pages -
+				       PTLRPC_MAX_BRW_PAGES);
+	else
+		sc->nr_to_scan = 0;
 	if (sc->nr_to_scan > 0) {
 		enc_pools_release_free_pages(sc->nr_to_scan);
 		CDEBUG(D_SEC, "released %ld pages, %ld left\n",
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 15/28] lustre: llite: fix for stat under kthread and X86_X32
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (13 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 14/28] lustre: ptlrpc: handle case of epp_free_pages <= PTLRPC_MAX_BRW_PAGES James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-18  1:48   ` NeilBrown
  2018-10-14 18:58 ` [lustre-devel] [PATCH 16/28] lustre: statahead: missing barrier before wake_up James Simmons
                   ` (13 subsequent siblings)
  28 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

From: Frank Zago <fzago@cray.com>

Under the following conditions, ll_getattr will flatten the inode
number when it shouldn't:

 - the X86_X32 architecture is defined CONFIG_X86_X32, and not even
   used,
 - ll_getattr is called from a kernel thread (though vfs_getattr for
   instance.)

This has the result that inode numbers are different whether the same
file is stat'ed from a kernel thread, or from a syscall. For instance,
4198401 vs. 144115205272502273.

ll_getattr calls ll_need_32bit_api to determine whether the task is 32
bits. When the combination is kthread+X86_X32, that function returns
that the task is 32 bits, which is incorrect, as the kernel is 64
bits.

The solution is to check whether the call is from a kernel thread
(which is 64 bits) and act consequently.

Signed-off-by: Frank Zago <fzago@cray.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-9468
Reviewed-on: https://review.whamcloud.com/26992
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/llite/dir.c          |  6 +++---
 drivers/staging/lustre/lustre/llite/lcommon_cl.c   |  2 +-
 .../staging/lustre/lustre/llite/llite_internal.h   | 22 +++++++++++++++++-----
 3 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
index 231b351..19c5e9c 100644
--- a/drivers/staging/lustre/lustre/llite/dir.c
+++ b/drivers/staging/lustre/lustre/llite/dir.c
@@ -202,7 +202,7 @@ int ll_dir_read(struct inode *inode, __u64 *ppos, struct md_op_data *op_data,
 {
 	struct ll_sb_info    *sbi	= ll_i2sbi(inode);
 	__u64		   pos		= *ppos;
-	int		   is_api32 = ll_need_32bit_api(sbi);
+	bool is_api32 = ll_need_32bit_api(sbi);
 	int		   is_hash64 = sbi->ll_flags & LL_SBI_64BIT_HASH;
 	struct page	  *page;
 	bool		   done = false;
@@ -296,7 +296,7 @@ static int ll_readdir(struct file *filp, struct dir_context *ctx)
 	struct ll_sb_info	*sbi	= ll_i2sbi(inode);
 	__u64 pos = lfd ? lfd->lfd_pos : 0;
 	int			hash64	= sbi->ll_flags & LL_SBI_64BIT_HASH;
-	int			api32	= ll_need_32bit_api(sbi);
+	bool api32 = ll_need_32bit_api(sbi);
 	struct md_op_data *op_data;
 	int			rc;
 
@@ -1674,7 +1674,7 @@ static loff_t ll_dir_seek(struct file *file, loff_t offset, int origin)
 	struct inode *inode = file->f_mapping->host;
 	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
 	struct ll_sb_info *sbi = ll_i2sbi(inode);
-	int api32 = ll_need_32bit_api(sbi);
+	bool api32 = ll_need_32bit_api(sbi);
 	loff_t ret = -EINVAL;
 
 	switch (origin) {
diff --git a/drivers/staging/lustre/lustre/llite/lcommon_cl.c b/drivers/staging/lustre/lustre/llite/lcommon_cl.c
index 30f17ea..20a3c74 100644
--- a/drivers/staging/lustre/lustre/llite/lcommon_cl.c
+++ b/drivers/staging/lustre/lustre/llite/lcommon_cl.c
@@ -267,7 +267,7 @@ void cl_inode_fini(struct inode *inode)
 /**
  * build inode number from passed @fid
  */
-__u64 cl_fid_build_ino(const struct lu_fid *fid, int api32)
+u64 cl_fid_build_ino(const struct lu_fid *fid, bool api32)
 {
 	if (BITS_PER_LONG == 32 || api32)
 		return fid_flatten32(fid);
diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
index dcb2fed..796a8ae 100644
--- a/drivers/staging/lustre/lustre/llite/llite_internal.h
+++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
@@ -651,13 +651,25 @@ static inline struct inode *ll_info2i(struct ll_inode_info *lli)
 __u32 ll_i2suppgid(struct inode *i);
 void ll_i2gids(__u32 *suppgids, struct inode *i1, struct inode *i2);
 
-static inline int ll_need_32bit_api(struct ll_sb_info *sbi)
+static inline bool ll_need_32bit_api(struct ll_sb_info *sbi)
 {
 #if BITS_PER_LONG == 32
-	return 1;
+	return true;
 #elif defined(CONFIG_COMPAT)
-	return unlikely(in_compat_syscall() ||
-			(sbi->ll_flags & LL_SBI_32BIT_API));
+	if (unlikely(sbi->ll_flags & LL_SBI_32BIT_API))
+		return true;
+
+#ifdef CONFIG_X86_X32
+	/* in_compat_syscall() returns true when called from a kthread
+	 * and CONFIG_X86_X32 is enabled, which is wrong. So check
+	 * whether the caller comes from a syscall (ie. not a kthread)
+	 * before calling in_compat_syscall().
+	 */
+	if (current->flags & PF_KTHREAD)
+		return false;
+#endif
+
+	return unlikely(in_compat_syscall());
 #else
 	return unlikely(sbi->ll_flags & LL_SBI_32BIT_API);
 #endif
@@ -1353,7 +1365,7 @@ int cl_setattr_ost(struct cl_object *obj, const struct iattr *attr,
 int cl_file_inode_init(struct inode *inode, struct lustre_md *md);
 void cl_inode_fini(struct inode *inode);
 
-__u64 cl_fid_build_ino(const struct lu_fid *fid, int api32);
+u64 cl_fid_build_ino(const struct lu_fid *fid, bool api32);
 __u32 cl_fid_build_gen(const struct lu_fid *fid);
 
 #endif /* LLITE_INTERNAL_H */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 16/28] lustre: statahead: missing barrier before wake_up
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (14 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 15/28] lustre: llite: fix for stat under kthread and X86_X32 James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-18  2:00   ` NeilBrown
  2018-10-14 18:58 ` [lustre-devel] [PATCH 17/28] lustre: ldlm: Make lru clear always discard read lock pages James Simmons
                   ` (12 subsequent siblings)
  28 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

A barrier is missing before wake_up() in ll_statahead_interpret(),
which may cause 'ls' hang. Under the right conditions a basic 'ls'
can fail. The debug logs show:

statahead.c:683:ll_statahead_interpret()) sa_entry software rc -13
statahead.c:1666:ll_statahead()) revalidate statahead software: -11.

Obviously statahead failure didn't notify 'ls' process in time.
The mi_cbdata can be stale so add a barrier before calling
wake_up().

Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
Signed-off-by: Bob Glossman <bob.glossman@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-9210
Reviewed-on: https://review.whamcloud.com/27330
Reviewed-by: Nathaniel Clark <nclark@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/llite/statahead.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/lustre/llite/statahead.c b/drivers/staging/lustre/lustre/llite/statahead.c
index 1ad308c..0174a4c 100644
--- a/drivers/staging/lustre/lustre/llite/statahead.c
+++ b/drivers/staging/lustre/lustre/llite/statahead.c
@@ -680,8 +680,14 @@ static int ll_statahead_interpret(struct ptlrpc_request *req,
 
 	spin_lock(&lli->lli_sa_lock);
 	if (rc) {
-		if (__sa_make_ready(sai, entry, rc))
+		if (__sa_make_ready(sai, entry, rc)) {
+			/* LU-9210 : Under the right conditions even 'ls'
+			 * can cause the statahead to fail. Using a memory
+			 * barrier resolves this issue.
+			 */
+			smp_mb();
 			wake_up(&sai->sai_waitq);
+		}
 	} else {
 		int first = 0;
 		entry->se_minfo = minfo;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 17/28] lustre: ldlm: Make lru clear always discard read lock pages
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (15 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 16/28] lustre: statahead: missing barrier before wake_up James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-14 18:58 ` [lustre-devel] [PATCH 18/28] lustre: mdc: expose changelog through char devices James Simmons
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

From: Patrick Farrell <paf@cray.com>

A significant amount of time is sometimes spent during
lru clearing (IE, echo 'clear' > lru_size) checking
pages to see if they are covered by another read lock.
Since all unused read locks will be destroyed by this
operation, the pages will be freed momentarily anyway,
and this is a waste of time.

This patch sets the LDLM_FL_DISCARD_DATA flag on all the PR
locks which are slated for cancellation by
ldlm_prepare_lru_list when it is called from
ldlm_ns_drop_cache.

The case where another lock covers those pages (and is in
use and so does not get cancelled by lru clear) is safe for
a few reasons:

1. When discarding pages, we wait (discard_cb->cl_page_own)
   until they are in the cached state before invalidating.
   So if they are actively in use, we'll wait until that use
   is done.

2. Removal of pages under a read lock is something that can
   happen due to memory pressure, since these are VFS cache
   pages. If a client reads something which is then removed
   from the cache and goes to read it again, this will simply
   generate a new read request.

This has a performance cost for that reader, but if anyone
is clearing the ldlm lru while actively doing I/O in that
namespace, then they cannot expect good performance.

In the case of many read locks on a single resource, this
improves cleanup time dramatically.  In internal testing at
Cray with ~80,000 read locks on a single file, this improves
cleanup time from ~60 seconds to ~0.5 seconds.  This also
slightly improves cleanup speed in the case of 1 or a few
read locks on a file.

Signed-off-by: Patrick Farrell <paf@cray.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8276
Reviewed-on: https://review.whamcloud.com/20785
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/include/lustre_dlm_flags.h |  2 +-
 drivers/staging/lustre/lustre/ldlm/ldlm_internal.h       |  6 ++++++
 drivers/staging/lustre/lustre/ldlm/ldlm_request.c        |  9 +++++++++
 drivers/staging/lustre/lustre/ldlm/ldlm_resource.c       |  6 ++++--
 drivers/staging/lustre/lustre/osc/osc_cache.c            |  4 ++--
 drivers/staging/lustre/lustre/osc/osc_cl_internal.h      |  2 +-
 drivers/staging/lustre/lustre/osc/osc_lock.c             | 10 +++++-----
 drivers/staging/lustre/lustre/osc/osc_object.c           |  2 +-
 8 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lustre_dlm_flags.h b/drivers/staging/lustre/lustre/include/lustre_dlm_flags.h
index 53db031..487ea17 100644
--- a/drivers/staging/lustre/lustre/include/lustre_dlm_flags.h
+++ b/drivers/staging/lustre/lustre/include/lustre_dlm_flags.h
@@ -95,7 +95,7 @@
 #define ldlm_set_flock_deadlock(_l)     LDLM_SET_FLAG((_l), 1ULL << 15)
 #define ldlm_clear_flock_deadlock(_l)   LDLM_CLEAR_FLAG((_l), 1ULL << 15)
 
-/** discard (no writeback) on cancel */
+/** discard (no writeback) (PW locks) or page retention (PR locks)) on cancel */
 #define LDLM_FL_DISCARD_DATA            0x0000000000010000ULL /* bit 16 */
 #define ldlm_is_discard_data(_l)        LDLM_TEST_FLAG((_l), 1ULL << 16)
 #define ldlm_set_discard_data(_l)       LDLM_SET_FLAG((_l), 1ULL << 16)
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h b/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
index 46b2b64..b64e2be0 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
@@ -96,6 +96,12 @@ enum {
 	LDLM_LRU_FLAG_NO_WAIT	= BIT(4), /* Cancel locks w/o blocking (neither
 					   * sending nor waiting for any rpcs)
 					   */
+	LDLM_LRU_FLAG_CLEANUP	= BIT(5), /* Used when clearing lru, tells
+					   * prepare_lru_list to set discard
+					   * flag on PR extent locks so we
+					   * don't waste time saving pages
+					   * that will be discarded momentarily
+					   */
 };
 
 int ldlm_cancel_lru(struct ldlm_namespace *ns, int nr,
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
index a208c99..ab089e8 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
@@ -1360,6 +1360,10 @@ typedef enum ldlm_policy_res (*ldlm_cancel_lru_policy_t)(
  *				   (typically before replaying locks) w/o
  *				   sending any RPCs or waiting for any
  *				   outstanding RPC to complete.
+ *
+ * flags & LDLM_CANCEL_CLEANUP - when cancelling read locks, do not check for
+ *				 other read locks covering the same pages, just
+ *				 discard those pages.
  */
 static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 				 struct list_head *cancels, int count, int max,
@@ -1487,6 +1491,11 @@ static int ldlm_prepare_lru_list(struct ldlm_namespace *ns,
 		 */
 		lock->l_flags |= LDLM_FL_CBPENDING | LDLM_FL_CANCELING;
 
+		if ((flags & LDLM_LRU_FLAG_CLEANUP) &&
+		    lock->l_resource->lr_type == LDLM_EXTENT &&
+		    lock->l_granted_mode == LCK_PR)
+			ldlm_set_discard_data(lock);
+
 		/* We can't re-add to l_lru as it confuses the
 		 * refcounting in ldlm_lock_remove_from_lru() if an AST
 		 * arrives after we drop lr_lock below. We use l_bl_ast
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c b/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
index bd5622d..5028db7 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_resource.c
@@ -197,7 +197,8 @@ static ssize_t lru_size_store(struct kobject *kobj, struct attribute *attr,
 
 			/* Try to cancel all @ns_nr_unused locks. */
 			canceled = ldlm_cancel_lru(ns, unused, 0,
-						   LDLM_LRU_FLAG_PASSED);
+						   LDLM_LRU_FLAG_PASSED |
+						   LDLM_LRU_FLAG_CLEANUP);
 			if (canceled < unused) {
 				CDEBUG(D_DLMTRACE,
 				       "not all requested locks are canceled, requested: %d, canceled: %d\n",
@@ -208,7 +209,8 @@ static ssize_t lru_size_store(struct kobject *kobj, struct attribute *attr,
 		} else {
 			tmp = ns->ns_max_unused;
 			ns->ns_max_unused = 0;
-			ldlm_cancel_lru(ns, 0, 0, LDLM_LRU_FLAG_PASSED);
+			ldlm_cancel_lru(ns, 0, 0, LDLM_LRU_FLAG_PASSED |
+					LDLM_LRU_FLAG_CLEANUP);
 			ns->ns_max_unused = tmp;
 		}
 		return count;
diff --git a/drivers/staging/lustre/lustre/osc/osc_cache.c b/drivers/staging/lustre/lustre/osc/osc_cache.c
index 92d292d..5d09a4f 100644
--- a/drivers/staging/lustre/lustre/osc/osc_cache.c
+++ b/drivers/staging/lustre/lustre/osc/osc_cache.c
@@ -3339,7 +3339,7 @@ static int discard_cb(const struct lu_env *env, struct cl_io *io,
  * behind this being that lock cancellation cannot be delayed indefinitely).
  */
 int osc_lock_discard_pages(const struct lu_env *env, struct osc_object *osc,
-			   pgoff_t start, pgoff_t end, enum cl_lock_mode mode)
+			   pgoff_t start, pgoff_t end, bool discard)
 {
 	struct osc_thread_info *info = osc_env_info(env);
 	struct cl_io *io = &info->oti_io;
@@ -3353,7 +3353,7 @@ int osc_lock_discard_pages(const struct lu_env *env, struct osc_object *osc,
 	if (result != 0)
 		goto out;
 
-	cb = mode == CLM_READ ? check_and_discard_cb : discard_cb;
+	cb = discard ? discard_cb : check_and_discard_cb;
 	info->oti_fn_index = start;
 	info->oti_next_index = start;
 	do {
diff --git a/drivers/staging/lustre/lustre/osc/osc_cl_internal.h b/drivers/staging/lustre/lustre/osc/osc_cl_internal.h
index da04c2c..4b01809 100644
--- a/drivers/staging/lustre/lustre/osc/osc_cl_internal.h
+++ b/drivers/staging/lustre/lustre/osc/osc_cl_internal.h
@@ -670,7 +670,7 @@ int osc_extent_finish(const struct lu_env *env, struct osc_extent *ext,
 void osc_extent_release(const struct lu_env *env, struct osc_extent *ext);
 
 int osc_lock_discard_pages(const struct lu_env *env, struct osc_object *osc,
-			   pgoff_t start, pgoff_t end, enum cl_lock_mode mode);
+			   pgoff_t start, pgoff_t end, bool discard_pages);
 
 typedef int (*osc_page_gang_cbt)(const struct lu_env *, struct cl_io *,
 				 struct osc_page *, void *);
diff --git a/drivers/staging/lustre/lustre/osc/osc_lock.c b/drivers/staging/lustre/lustre/osc/osc_lock.c
index 6059dba..4cc813d 100644
--- a/drivers/staging/lustre/lustre/osc/osc_lock.c
+++ b/drivers/staging/lustre/lustre/osc/osc_lock.c
@@ -380,7 +380,7 @@ static int osc_lock_upcall_agl(void *cookie, struct lustre_handle *lockh,
 }
 
 static int osc_lock_flush(struct osc_object *obj, pgoff_t start, pgoff_t end,
-			  enum cl_lock_mode mode, int discard)
+			  enum cl_lock_mode mode, bool discard)
 {
 	struct lu_env *env;
 	u16 refcheck;
@@ -401,7 +401,7 @@ static int osc_lock_flush(struct osc_object *obj, pgoff_t start, pgoff_t end,
 			rc = 0;
 	}
 
-	rc2 = osc_lock_discard_pages(env, obj, start, end, mode);
+	rc2 = osc_lock_discard_pages(env, obj, start, end, discard);
 	if (rc == 0 && rc2 < 0)
 		rc = rc2;
 
@@ -417,10 +417,10 @@ static int osc_dlm_blocking_ast0(const struct lu_env *env,
 				 struct ldlm_lock *dlmlock,
 				 void *data, int flag)
 {
+	enum cl_lock_mode mode = CLM_READ;
 	struct cl_object *obj = NULL;
 	int result = 0;
-	int discard;
-	enum cl_lock_mode mode = CLM_READ;
+	bool discard;
 
 	LASSERT(flag == LDLM_CB_CANCELING);
 
@@ -1098,7 +1098,7 @@ static void osc_lock_lockless_cancel(const struct lu_env *env,
 
 	LASSERT(!ols->ols_dlmlock);
 	result = osc_lock_flush(osc, descr->cld_start, descr->cld_end,
-				descr->cld_mode, 0);
+				descr->cld_mode, false);
 	if (result)
 		CERROR("Pages for lockless lock %p were not purged(%d)\n",
 		       ols, result);
diff --git a/drivers/staging/lustre/lustre/osc/osc_object.c b/drivers/staging/lustre/lustre/osc/osc_object.c
index a86d4c2..e9ecb77 100644
--- a/drivers/staging/lustre/lustre/osc/osc_object.c
+++ b/drivers/staging/lustre/lustre/osc/osc_object.c
@@ -462,7 +462,7 @@ int osc_object_invalidate(const struct lu_env *env, struct osc_object *osc)
 	osc_cache_truncate_start(env, osc, 0, NULL);
 
 	/* Discard all caching pages */
-	osc_lock_discard_pages(env, osc, 0, CL_PAGE_EOF, CLM_WRITE);
+	osc_lock_discard_pages(env, osc, 0, CL_PAGE_EOF, true);
 
 	/* Clear ast data of dlm lock. Do this after discarding all pages */
 	osc_object_prune(env, osc2cl(osc));
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 18/28] lustre: mdc: expose changelog through char devices
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (16 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 17/28] lustre: ldlm: Make lru clear always discard read lock pages James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-30  6:41   ` NeilBrown
  2018-10-14 18:58 ` [lustre-devel] [PATCH 19/28] lustre: uapi: add missing headers in lustre UAPI headers James Simmons
                   ` (10 subsequent siblings)
  28 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

From: Henri Doreau <henri.doreau@cea.fr>

Register one character device per MDT in order to allow non-llapi to
read them and to make delivery more efficient.

- open() spawns a thread to prefetch records and enqueue them into a
  local buffer (unless the device is open in write-only mode).
- lseek() can be used to jump to a specific record, in which case the
  offset is a record number (with SEEK_SET) or a number of records to
  skip (SEEK_CUR). Movement can only be done forward.
- read() copies records to userland. No truncation happens, so short
  reads are likely.
- write() is used to transmit control commands to the device.
  The only available one is changelog_clear, which is done by writing
  "clear:cl<user>:<recno>" into the device.
- close() terminates the prefetch thread if any, and releases resources.

It is possible to poll() on the device to get notified when new records
are available for read.

Signed-off-by: Henri Doreau <henri.doreau@cea.fr>
WC-bug-id: https://jira.whamcloud.com/browse/LU-7659
Reviewed-on: https://review.whamcloud.com/18900
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 .../include/uapi/linux/lustre/lustre_ioctl.h       |   2 +-
 .../include/uapi/linux/lustre/lustre_kernelcomm.h  |   3 -
 .../lustre/include/uapi/linux/lustre/lustre_user.h |   7 -
 drivers/staging/lustre/lustre/include/obd.h        |   2 +
 drivers/staging/lustre/lustre/ldlm/ldlm_lib.c      |   2 +
 drivers/staging/lustre/lustre/llite/dir.c          |   8 -
 drivers/staging/lustre/lustre/lmv/lmv_obd.c        |  13 -
 drivers/staging/lustre/lustre/mdc/Makefile         |   2 +-
 drivers/staging/lustre/lustre/mdc/mdc_changelog.c  | 722 +++++++++++++++++++++
 drivers/staging/lustre/lustre/mdc/mdc_internal.h   |   4 +
 drivers/staging/lustre/lustre/mdc/mdc_request.c    | 198 +-----
 11 files changed, 745 insertions(+), 218 deletions(-)
 create mode 100644 drivers/staging/lustre/lustre/mdc/mdc_changelog.c

diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
index 6e4e109..098b6451 100644
--- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
+++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
@@ -172,7 +172,7 @@ static inline __u32 obd_ioctl_packlen(struct obd_ioctl_data *data)
 #define OBD_GET_VERSION		_IOWR('f', 144, OBD_IOC_DATA_TYPE)
 /*	OBD_IOC_GSS_SUPPORT	_IOWR('f', 145, OBD_IOC_DATA_TYPE) */
 /*	OBD_IOC_CLOSE_UUID	_IOWR('f', 147, OBD_IOC_DATA_TYPE) */
-#define OBD_IOC_CHANGELOG_SEND	_IOW('f', 148, OBD_IOC_DATA_TYPE)
+/*	OBD_IOC_CHANGELOG_SEND	_IOW('f', 148, OBD_IOC_DATA_TYPE) */
 #define OBD_IOC_GETDEVICE	_IOWR('f', 149, OBD_IOC_DATA_TYPE)
 #define OBD_IOC_FID2PATH	_IOWR('f', 150, OBD_IOC_DATA_TYPE)
 /*	lustre/lustre_user.h	151-153 */
diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_kernelcomm.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_kernelcomm.h
index 94dadbe..d84a8fc 100644
--- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_kernelcomm.h
+++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_kernelcomm.h
@@ -54,15 +54,12 @@ struct kuc_hdr {
 	__u16 kuc_msglen;
 } __aligned(sizeof(__u64));
 
-#define KUC_CHANGELOG_MSG_MAXSIZE (sizeof(struct kuc_hdr) + CR_MAXSIZE)
-
 #define KUC_MAGIC		0x191C /*Lustre9etLinC */
 
 /* kuc_msgtype values are defined in each transport */
 enum kuc_transport_type {
 	KUC_TRANSPORT_GENERIC	= 1,
 	KUC_TRANSPORT_HSM	= 2,
-	KUC_TRANSPORT_CHANGELOG	= 3,
 };
 
 enum kuc_generic_message_type {
diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h
index b8525e5..715f1c5 100644
--- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h
+++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h
@@ -967,13 +967,6 @@ static inline void changelog_remap_rec(struct changelog_rec *rec,
 	rec->cr_flags = (rec->cr_flags & CLF_FLAGMASK) | crf_wanted;
 }
 
-struct ioc_changelog {
-	__u64 icc_recno;
-	__u32 icc_mdtindex;
-	__u32 icc_id;
-	__u32 icc_flags;
-};
-
 enum changelog_message_type {
 	CL_RECORD = 10, /* message is a changelog_rec */
 	CL_EOF    = 11, /* at end of current changelog */
diff --git a/drivers/staging/lustre/lustre/include/obd.h b/drivers/staging/lustre/lustre/include/obd.h
index 11e7ae8..76ae0b3 100644
--- a/drivers/staging/lustre/lustre/include/obd.h
+++ b/drivers/staging/lustre/lustre/include/obd.h
@@ -345,6 +345,8 @@ struct client_obd {
 	void			*cl_lru_work;
 	/* hash tables for osc_quota_info */
 	struct rhashtable	cl_quota_hash[MAXQUOTAS];
+	/* Links to the global list of registered changelog devices */
+	struct list_head	cl_chg_dev_linkage;
 };
 
 #define obd2cli_tgt(obd) ((char *)(obd)->u.cli.cl_target_uuid.uuid)
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_lib.c b/drivers/staging/lustre/lustre/ldlm/ldlm_lib.c
index 32eda4f..732ef3a 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_lib.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_lib.c
@@ -395,6 +395,8 @@ int client_obd_setup(struct obd_device *obddev, struct lustre_cfg *lcfg)
 	init_waitqueue_head(&cli->cl_mod_rpcs_waitq);
 	cli->cl_mod_tag_bitmap = NULL;
 
+	INIT_LIST_HEAD(&cli->cl_chg_dev_linkage);
+
 	if (connect_op == MDS_CONNECT) {
 		cli->cl_max_mod_rpcs_in_flight = cli->cl_max_rpcs_in_flight - 1;
 		cli->cl_mod_tag_bitmap = kcalloc(BITS_TO_LONGS(OBD_MAX_RIF_MAX),
diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
index 19c5e9c..36cea8d 100644
--- a/drivers/staging/lustre/lustre/llite/dir.c
+++ b/drivers/staging/lustre/lustre/llite/dir.c
@@ -1481,14 +1481,6 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 		return obd_iocontrol(cmd, sbi->ll_md_exp, 0, NULL,
 				     (void __user *)arg);
 	}
-	case OBD_IOC_CHANGELOG_SEND:
-	case OBD_IOC_CHANGELOG_CLEAR:
-		if (!capable(CAP_SYS_ADMIN))
-			return -EPERM;
-
-		rc = copy_and_ioctl(cmd, sbi->ll_md_exp, (void __user *)arg,
-				    sizeof(struct ioc_changelog));
-		return rc;
 	case OBD_IOC_FID2PATH:
 		return ll_fid2path(inode, (void __user *)arg);
 	case LL_IOC_GETPARENT:
diff --git a/drivers/staging/lustre/lustre/lmv/lmv_obd.c b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
index 952c68e..32bb9fc 100644
--- a/drivers/staging/lustre/lustre/lmv/lmv_obd.c
+++ b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
@@ -951,19 +951,6 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
 		kfree(oqctl);
 		break;
 	}
-	case OBD_IOC_CHANGELOG_SEND:
-	case OBD_IOC_CHANGELOG_CLEAR: {
-		struct ioc_changelog *icc = karg;
-
-		if (icc->icc_mdtindex >= count)
-			return -ENODEV;
-
-		tgt = lmv->tgts[icc->icc_mdtindex];
-		if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
-			return -ENODEV;
-		rc = obd_iocontrol(cmd, tgt->ltd_exp, sizeof(*icc), icc, NULL);
-		break;
-	}
 	case LL_IOC_GET_CONNECT_FLAGS: {
 		tgt = lmv->tgts[0];
 
diff --git a/drivers/staging/lustre/lustre/mdc/Makefile b/drivers/staging/lustre/lustre/mdc/Makefile
index 64cf49e..5f48e91 100644
--- a/drivers/staging/lustre/lustre/mdc/Makefile
+++ b/drivers/staging/lustre/lustre/mdc/Makefile
@@ -2,4 +2,4 @@ ccflags-y += -I$(srctree)/drivers/staging/lustre/include
 ccflags-y += -I$(srctree)/drivers/staging/lustre/lustre/include
 
 obj-$(CONFIG_LUSTRE_FS) += mdc.o
-mdc-y := mdc_request.o mdc_reint.o mdc_lib.o mdc_locks.o lproc_mdc.o
+mdc-y := mdc_changelog.o mdc_request.o mdc_reint.o mdc_lib.o mdc_locks.o lproc_mdc.o
diff --git a/drivers/staging/lustre/lustre/mdc/mdc_changelog.c b/drivers/staging/lustre/lustre/mdc/mdc_changelog.c
new file mode 100644
index 0000000..a5f3c64
--- /dev/null
+++ b/drivers/staging/lustre/lustre/mdc/mdc_changelog.c
@@ -0,0 +1,722 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * GPL HEADER START
+ *
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 only,
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License version 2 for more details (a copy is included
+ * in the LICENSE file that accompanied this code).
+ *
+ * You should have received a copy of the GNU General Public License
+ * version 2 along with this program; If not, see
+ * http://www.gnu.org/licenses/gpl-2.0.html
+ *
+ * GPL HEADER END
+ */
+/*
+ * Copyright (c) 2017, Commissariat a l'Energie Atomique et aux Energies
+ *                     Alternatives.
+ *
+ * Author: Henri Doreau <henri.doreau@cea.fr>
+ */
+
+#define DEBUG_SUBSYSTEM S_MDC
+
+#include <linux/init.h>
+#include <linux/kthread.h>
+#include <linux/poll.h>
+#include <linux/miscdevice.h>
+
+#include <lustre_log.h>
+
+#include "mdc_internal.h"
+
+/*
+ * -- Changelog delivery through character device --
+ */
+
+/**
+ * Mutex to protect chlg_registered_devices below
+ */
+static DEFINE_MUTEX(chlg_registered_dev_lock);
+
+/**
+ * Global linked list of all registered devices (one per MDT).
+ */
+static LIST_HEAD(chlg_registered_devices);
+
+struct chlg_registered_dev {
+	/* Device name of the form "changelog-{MDTNAME}" */
+	char			ced_name[32];
+	/* Misc device descriptor */
+	struct miscdevice	ced_misc;
+	/* OBDs referencing this device (multiple mount point) */
+	struct list_head	ced_obds;
+	/* Reference counter for proper deregistration */
+	struct kref		ced_refs;
+	/* Link within the global chlg_registered_devices */
+	struct list_head	ced_link;
+};
+
+struct chlg_reader_state {
+	/* Shortcut to the corresponding OBD device */
+	struct obd_device	*crs_obd;
+	/* An error occurred that prevents from reading further */
+	bool			 crs_err;
+	/* EOF, no more records available */
+	bool			 crs_eof;
+	/* Userland reader closed connection */
+	bool			 crs_closed;
+	/* Desired start position */
+	u64			 crs_start_offset;
+	/* Wait queue for the catalog processing thread */
+	wait_queue_head_t	 crs_waitq_prod;
+	/* Wait queue for the record copy threads */
+	wait_queue_head_t	 crs_waitq_cons;
+	/* Mutex protecting crs_rec_count and crs_rec_queue */
+	struct mutex		 crs_lock;
+	/* Number of item in the list */
+	u64			 crs_rec_count;
+	/* List of prefetched enqueued_record::enq_linkage_items */
+	struct list_head	 crs_rec_queue;
+};
+
+struct chlg_rec_entry {
+	/* Link within the chlg_reader_state::crs_rec_queue list */
+	struct list_head	enq_linkage;
+	/* Data (enq_record) field length */
+	u64			enq_length;
+	/* Copy of a changelog record (see struct llog_changelog_rec) */
+	struct changelog_rec	enq_record[];
+};
+
+enum {
+	/* Number of records to prefetch locally. */
+	CDEV_CHLG_MAX_PREFETCH = 1024,
+};
+
+/**
+ * ChangeLog catalog processing callback invoked on each record.
+ * If the current record is eligible to userland delivery, push
+ * it into the crs_rec_queue where the consumer code will fetch it.
+ *
+ * @param[in]     env  (unused)
+ * @param[in]     llh  Client-side handle used to identify the llog
+ * @param[in]     hdr  Header of the current llog record
+ * @param[in,out] data chlg_reader_state passed from caller
+ *
+ * @return 0 or LLOG_PROC_* control code on success, negated error on failure.
+ */
+static int chlg_read_cat_process_cb(const struct lu_env *env,
+				    struct llog_handle *llh,
+				    struct llog_rec_hdr *hdr, void *data)
+{
+	struct llog_changelog_rec *rec;
+	struct chlg_reader_state *crs = data;
+	struct chlg_rec_entry *enq;
+	size_t len;
+	int rc;
+
+	LASSERT(crs);
+	LASSERT(hdr);
+
+	rec = container_of(hdr, struct llog_changelog_rec, cr_hdr);
+
+	if (rec->cr_hdr.lrh_type != CHANGELOG_REC) {
+		rc = -EINVAL;
+		CERROR("%s: not a changelog rec %x/%d in llog : rc = %d\n",
+		       crs->crs_obd->obd_name, rec->cr_hdr.lrh_type,
+		       rec->cr.cr_type, rc);
+		return rc;
+	}
+
+	/* Skip undesired records */
+	if (rec->cr.cr_index < crs->crs_start_offset)
+		return 0;
+
+	CDEBUG(D_HSM, "%llu %02d%-5s %llu 0x%x t=" DFID " p=" DFID " %.*s\n",
+	       rec->cr.cr_index, rec->cr.cr_type,
+	       changelog_type2str(rec->cr.cr_type), rec->cr.cr_time,
+	       rec->cr.cr_flags & CLF_FLAGMASK,
+	       PFID(&rec->cr.cr_tfid), PFID(&rec->cr.cr_pfid),
+	       rec->cr.cr_namelen, changelog_rec_name(&rec->cr));
+
+	wait_event_idle(crs->crs_waitq_prod,
+			(crs->crs_rec_count < CDEV_CHLG_MAX_PREFETCH ||
+			 crs->crs_closed));
+
+	if (crs->crs_closed)
+		return LLOG_PROC_BREAK;
+
+	len = changelog_rec_size(&rec->cr) + rec->cr.cr_namelen;
+	enq = kzalloc(sizeof(*enq) + len, GFP_KERNEL);
+	if (!enq)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&enq->enq_linkage);
+	enq->enq_length = len;
+	memcpy(enq->enq_record, &rec->cr, len);
+
+	mutex_lock(&crs->crs_lock);
+	list_add_tail(&enq->enq_linkage, &crs->crs_rec_queue);
+	crs->crs_rec_count++;
+	mutex_unlock(&crs->crs_lock);
+
+	wake_up_all(&crs->crs_waitq_cons);
+
+	return 0;
+}
+
+/**
+ * Remove record from the list it is attached to and free it.
+ */
+static void enq_record_delete(struct chlg_rec_entry *rec)
+{
+	list_del(&rec->enq_linkage);
+	kfree(rec);
+}
+
+/**
+ * Release resources associated to a changelog_reader_state instance.
+ *
+ * @param  crs  CRS instance to release.
+ */
+static void crs_free(struct chlg_reader_state *crs)
+{
+	struct chlg_rec_entry *rec;
+	struct chlg_rec_entry *tmp;
+
+	list_for_each_entry_safe(rec, tmp, &crs->crs_rec_queue, enq_linkage)
+		enq_record_delete(rec);
+
+	kfree(crs);
+}
+
+/**
+ * Record prefetch thread entry point. Opens the changelog catalog and starts
+ * reading records.
+ *
+ * @param[in,out]  args  chlg_reader_state passed from caller.
+ * @return 0 on success, negated error code on failure.
+ */
+static int chlg_load(void *args)
+{
+	struct chlg_reader_state *crs = args;
+	struct obd_device *obd = crs->crs_obd;
+	struct llog_ctxt *ctx = NULL;
+	struct llog_handle *llh = NULL;
+	int rc;
+
+	ctx = llog_get_context(obd, LLOG_CHANGELOG_REPL_CTXT);
+	if (!ctx) {
+		rc = -ENOENT;
+		goto err_out;
+	}
+
+	rc = llog_open(NULL, ctx, &llh, NULL, CHANGELOG_CATALOG,
+		       LLOG_OPEN_EXISTS);
+	if (rc) {
+		CERROR("%s: fail to open changelog catalog: rc = %d\n",
+		       obd->obd_name, rc);
+		goto err_out;
+	}
+
+	rc = llog_init_handle(NULL, llh, LLOG_F_IS_CAT | LLOG_F_EXT_JOBID,
+			      NULL);
+	if (rc) {
+		CERROR("%s: fail to init llog handle: rc = %d\n",
+		       obd->obd_name, rc);
+		goto err_out;
+	}
+
+	rc = llog_cat_process(NULL, llh, chlg_read_cat_process_cb, crs, 0, 0);
+	if (rc < 0) {
+		CERROR("%s: fail to process llog: rc = %d\n",
+		       obd->obd_name, rc);
+		goto err_out;
+	}
+
+err_out:
+	crs->crs_err = true;
+	wake_up_all(&crs->crs_waitq_cons);
+
+	if (llh)
+		llog_cat_close(NULL, llh);
+
+	if (ctx)
+		llog_ctxt_put(ctx);
+
+	wait_event_idle(crs->crs_waitq_prod, crs->crs_closed);
+	crs_free(crs);
+	return rc;
+}
+
+/**
+ * Read handler, dequeues records from the chlg_reader_state if any.
+ * No partial records are copied to userland so this function can return less
+ * data than required (short read).
+ *
+ * @param[in]   file   File pointer to the character device.
+ * @param[out]  buff   Userland buffer where to copy the records.
+ * @param[in]   count  Userland buffer size.
+ * @param[out]  ppos   File position, updated with the index number of the next
+ *		       record to read.
+ * @return number of copied bytes on success, negated error code on failure.
+ */
+static ssize_t chlg_read(struct file *file, char __user *buff, size_t count,
+			 loff_t *ppos)
+{
+	struct chlg_reader_state *crs = file->private_data;
+	struct chlg_rec_entry *rec;
+	struct chlg_rec_entry *tmp;
+	ssize_t  written_total = 0;
+	LIST_HEAD(consumed);
+
+	if (file->f_flags & O_NONBLOCK && crs->crs_rec_count == 0)
+		return -EAGAIN;
+
+	wait_event_idle(crs->crs_waitq_cons,
+			crs->crs_rec_count > 0 || crs->crs_eof || crs->crs_err);
+
+	mutex_lock(&crs->crs_lock);
+	list_for_each_entry_safe(rec, tmp, &crs->crs_rec_queue, enq_linkage) {
+		if (written_total + rec->enq_length > count)
+			break;
+
+		if (copy_to_user(buff, rec->enq_record, rec->enq_length)) {
+			if (written_total == 0)
+				written_total = -EFAULT;
+			break;
+		}
+
+		buff += rec->enq_length;
+		written_total += rec->enq_length;
+
+		crs->crs_rec_count--;
+		list_move_tail(&rec->enq_linkage, &consumed);
+
+		crs->crs_start_offset = rec->enq_record->cr_index + 1;
+	}
+	mutex_unlock(&crs->crs_lock);
+
+	if (written_total > 0)
+		wake_up_all(&crs->crs_waitq_prod);
+
+	list_for_each_entry_safe(rec, tmp, &consumed, enq_linkage)
+		enq_record_delete(rec);
+
+	*ppos = crs->crs_start_offset;
+
+	return written_total;
+}
+
+/**
+ * Jump to a given record index. Helper for chlg_llseek().
+ *
+ * @param[in,out]  crs     Internal reader state.
+ * @param[in]      offset  Desired offset (index record).
+ * @return 0 on success, negated error code on failure.
+ */
+static int chlg_set_start_offset(struct chlg_reader_state *crs, u64 offset)
+{
+	struct chlg_rec_entry *rec;
+	struct chlg_rec_entry *tmp;
+
+	mutex_lock(&crs->crs_lock);
+	if (offset < crs->crs_start_offset) {
+		mutex_unlock(&crs->crs_lock);
+		return -ERANGE;
+	}
+
+	crs->crs_start_offset = offset;
+	list_for_each_entry_safe(rec, tmp, &crs->crs_rec_queue, enq_linkage) {
+		struct changelog_rec *cr = rec->enq_record;
+
+		if (cr->cr_index >= crs->crs_start_offset)
+			break;
+
+		crs->crs_rec_count--;
+		enq_record_delete(rec);
+	}
+
+	mutex_unlock(&crs->crs_lock);
+	wake_up_all(&crs->crs_waitq_prod);
+	return 0;
+}
+
+/**
+ * Move read pointer to a certain record index, encoded as an offset.
+ *
+ * @param[in,out] file   File pointer to the changelog character device
+ * @param[in]	  off    Offset to skip, actually a record index, not byte count
+ * @param[in]	  whence Relative/Absolute interpretation of the offset
+ * @return the resulting position on success or negated error code on failure.
+ */
+static loff_t chlg_llseek(struct file *file, loff_t off, int whence)
+{
+	struct chlg_reader_state *crs = file->private_data;
+	loff_t pos;
+	int rc;
+
+	switch (whence) {
+	case SEEK_SET:
+		pos = off;
+		break;
+	case SEEK_CUR:
+		pos = file->f_pos + off;
+		break;
+	case SEEK_END:
+	default:
+		return -EINVAL;
+	}
+
+	/* We cannot go backward */
+	if (pos < file->f_pos)
+		return -EINVAL;
+
+	rc = chlg_set_start_offset(crs, pos);
+	if (rc != 0)
+		return rc;
+
+	file->f_pos = pos;
+	return pos;
+}
+
+/**
+ * Clear record range for a given changelog reader.
+ *
+ * @param[in]  crs     Current internal state.
+ * @param[in]  reader  Changelog reader ID (cl1, cl2...)
+ * @param[in]  record  Record index up which to clear
+ * @return 0 on success, negated error code on failure.
+ */
+static int chlg_clear(struct chlg_reader_state *crs, u32 reader, u64 record)
+{
+	struct obd_device *obd = crs->crs_obd;
+	struct changelog_setinfo cs  = {
+		.cs_recno = record,
+		.cs_id    = reader
+	};
+
+	return obd_set_info_async(NULL, obd->obd_self_export,
+				  strlen(KEY_CHANGELOG_CLEAR),
+				  KEY_CHANGELOG_CLEAR, sizeof(cs), &cs, NULL);
+}
+
+/** Maximum changelog control command size */
+#define CHLG_CONTROL_CMD_MAX	64
+
+/**
+ * Handle writes() into the changelog character device. Write() can be used
+ * to request special control operations.
+ *
+ * @param[in]  file  File pointer to the changelog character device
+ * @param[in]  buff  User supplied data (written data)
+ * @param[in]  count Number of written bytes
+ * @param[in]  off   (unused)
+ * @return number of written bytes on success, negated error code on failure.
+ */
+static ssize_t chlg_write(struct file *file, const char __user *buff,
+			  size_t count, loff_t *off)
+{
+	struct chlg_reader_state *crs = file->private_data;
+	char *kbuf;
+	u64 record;
+	u32 reader;
+	int rc = 0;
+
+	if (count > CHLG_CONTROL_CMD_MAX)
+		return -EINVAL;
+
+	kbuf = kzalloc(CHLG_CONTROL_CMD_MAX, GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	if (copy_from_user(kbuf, buff, count)) {
+		rc = -EFAULT;
+		goto out_kbuf;
+	}
+
+	kbuf[CHLG_CONTROL_CMD_MAX - 1] = '\0';
+
+	if (sscanf(kbuf, "clear:cl%u:%llu", &reader, &record) == 2)
+		rc = chlg_clear(crs, reader, record);
+	else
+		rc = -EINVAL;
+
+out_kbuf:
+	kfree(kbuf);
+	return rc < 0 ? rc : count;
+}
+
+/**
+ * Find the OBD device associated to a changelog character device.
+ * @param[in]  cdev  character device instance descriptor
+ * @return corresponding OBD device or NULL if none was found.
+ */
+static struct obd_device *chlg_obd_get(dev_t cdev)
+{
+	int minor = MINOR(cdev);
+	struct obd_device *obd = NULL;
+	struct chlg_registered_dev *curr;
+
+	mutex_lock(&chlg_registered_dev_lock);
+	list_for_each_entry(curr, &chlg_registered_devices, ced_link) {
+		if (curr->ced_misc.minor == minor) {
+			/* take the first available OBD device attached */
+			obd = list_first_entry(&curr->ced_obds,
+					       struct obd_device,
+					       u.cli.cl_chg_dev_linkage);
+			break;
+		}
+	}
+	mutex_unlock(&chlg_registered_dev_lock);
+	return obd;
+}
+
+/**
+ * Open handler, initialize internal CRS state and spawn prefetch thread if
+ * needed.
+ * @param[in]  inode  Inode struct for the open character device.
+ * @param[in]  file   Corresponding file pointer.
+ * @return 0 on success, negated error code on failure.
+ */
+static int chlg_open(struct inode *inode, struct file *file)
+{
+	struct chlg_reader_state *crs;
+	struct obd_device *obd = chlg_obd_get(inode->i_rdev);
+	struct task_struct *task;
+	int rc;
+
+	if (!obd)
+		return -ENODEV;
+
+	crs = kzalloc(sizeof(*crs), GFP_KERNEL);
+	if (!crs)
+		return -ENOMEM;
+
+	crs->crs_obd = obd;
+	crs->crs_err = false;
+	crs->crs_eof = false;
+	crs->crs_closed = false;
+
+	mutex_init(&crs->crs_lock);
+	INIT_LIST_HEAD(&crs->crs_rec_queue);
+	init_waitqueue_head(&crs->crs_waitq_prod);
+	init_waitqueue_head(&crs->crs_waitq_cons);
+
+	if (file->f_mode & FMODE_READ) {
+		task = kthread_run(chlg_load, crs, "chlg_load_thread");
+		if (IS_ERR(task)) {
+			rc = PTR_ERR(task);
+			CERROR("%s: cannot start changelog thread: rc = %d\n",
+			       obd->obd_name, rc);
+			goto err_crs;
+		}
+	}
+
+	file->private_data = crs;
+	return 0;
+
+err_crs:
+	kfree(crs);
+	return rc;
+}
+
+/**
+ * Close handler, release resources.
+ *
+ * @param[in]  inode  Inode struct for the open character device.
+ * @param[in]  file   Corresponding file pointer.
+ * @return 0 on success, negated error code on failure.
+ */
+static int chlg_release(struct inode *inode, struct file *file)
+{
+	struct chlg_reader_state *crs = file->private_data;
+
+	if (file->f_mode & FMODE_READ) {
+		crs->crs_closed = true;
+		wake_up_all(&crs->crs_waitq_prod);
+	} else {
+		/* No producer thread, release resource ourselves */
+		crs_free(crs);
+	}
+	return 0;
+}
+
+/**
+ * Poll handler, indicates whether the device is readable (new records) and
+ * writable (always).
+ *
+ * @param[in]  file   Device file pointer.
+ * @param[in]  wait   (opaque)
+ * @return combination of the poll status flags.
+ */
+static unsigned int chlg_poll(struct file *file, poll_table *wait)
+{
+	struct chlg_reader_state *crs  = file->private_data;
+	unsigned int mask = 0;
+
+	mutex_lock(&crs->crs_lock);
+	poll_wait(file, &crs->crs_waitq_cons, wait);
+	if (crs->crs_rec_count > 0)
+		mask |= POLLIN | POLLRDNORM;
+	if (crs->crs_err)
+		mask |= POLLERR;
+	if (crs->crs_eof)
+		mask |= POLLHUP;
+	mutex_unlock(&crs->crs_lock);
+	return mask;
+}
+
+static const struct file_operations chlg_fops = {
+	.owner		= THIS_MODULE,
+	.llseek		= chlg_llseek,
+	.read		= chlg_read,
+	.write		= chlg_write,
+	.open		= chlg_open,
+	.release	= chlg_release,
+	.poll		= chlg_poll,
+};
+
+/**
+ * This uses obd_name of the form: "testfs-MDT0000-mdc-ffff88006501600"
+ * and returns a name of the form: "changelog-testfs-MDT0000".
+ */
+static void get_chlg_name(char *name, size_t name_len, struct obd_device *obd)
+{
+	int i;
+
+	snprintf(name, name_len, "changelog-%s", obd->obd_name);
+
+	/* Find the 2nd '-' from the end and truncate on it */
+	for (i = 0; i < 2; i++) {
+		char *p = strrchr(name, '-');
+
+		if (!p)
+			return;
+		*p = '\0';
+	}
+}
+
+/**
+ * Find a changelog character device by name.
+ * All devices registered during MDC setup are listed in a global list with
+ * their names attached.
+ */
+static struct chlg_registered_dev *
+chlg_registered_dev_find_by_name(const char *name)
+{
+	struct chlg_registered_dev *dit;
+
+	list_for_each_entry(dit, &chlg_registered_devices, ced_link)
+		if (strcmp(name, dit->ced_name) == 0)
+			return dit;
+	return NULL;
+}
+
+/**
+ * Find chlg_registered_dev structure for a given OBD device.
+ * This is bad O(n^2) but for each filesystem:
+ *   - N is # of MDTs times # of mount points
+ *   - this only runs at shutdown
+ */
+static struct chlg_registered_dev *
+chlg_registered_dev_find_by_obd(const struct obd_device *obd)
+{
+	struct chlg_registered_dev *dit;
+	struct obd_device *oit;
+
+	list_for_each_entry(dit, &chlg_registered_devices, ced_link)
+		list_for_each_entry(oit, &dit->ced_obds,
+				    u.cli.cl_chg_dev_linkage)
+			if (oit == obd)
+				return dit;
+	return NULL;
+}
+
+/**
+ * Changelog character device initialization.
+ * Register a misc character device with a dynamic minor number, under a name
+ * of the form: 'changelog-fsname-MDTxxxx'. Reference this OBD device with it.
+ *
+ * @param[in] obd  This MDC obd_device.
+ * @return 0 on success, negated error code on failure.
+ */
+int mdc_changelog_cdev_init(struct obd_device *obd)
+{
+	struct chlg_registered_dev *exist;
+	struct chlg_registered_dev *entry;
+	int rc;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry)
+		return -ENOMEM;
+
+	get_chlg_name(entry->ced_name, sizeof(entry->ced_name), obd);
+
+	entry->ced_misc.minor = MISC_DYNAMIC_MINOR;
+	entry->ced_misc.name  = entry->ced_name;
+	entry->ced_misc.fops  = &chlg_fops;
+
+	kref_init(&entry->ced_refs);
+	INIT_LIST_HEAD(&entry->ced_obds);
+	INIT_LIST_HEAD(&entry->ced_link);
+
+	mutex_lock(&chlg_registered_dev_lock);
+	exist = chlg_registered_dev_find_by_name(entry->ced_name);
+	if (exist) {
+		kref_get(&exist->ced_refs);
+		list_add_tail(&obd->u.cli.cl_chg_dev_linkage, &exist->ced_obds);
+		rc = 0;
+		goto out_unlock;
+	}
+
+	/* Register new character device */
+	rc = misc_register(&entry->ced_misc);
+	if (rc != 0)
+		goto out_unlock;
+
+	list_add_tail(&obd->u.cli.cl_chg_dev_linkage, &entry->ced_obds);
+	list_add_tail(&entry->ced_link, &chlg_registered_devices);
+
+	entry = NULL;	/* prevent it from being freed below */
+
+out_unlock:
+	mutex_unlock(&chlg_registered_dev_lock);
+	kfree(entry);
+	return rc;
+}
+
+/**
+ * Deregister a changelog character device whose refcount has reached zero.
+ */
+static void chlg_dev_clear(struct kref *kref)
+{
+	struct chlg_registered_dev *entry = container_of(kref,
+							 struct chlg_registered_dev,
+							 ced_refs);
+	list_del(&entry->ced_link);
+	misc_deregister(&entry->ced_misc);
+	kfree(entry);
+}
+
+/**
+ * Release OBD, decrease reference count of the corresponding changelog device.
+ */
+void mdc_changelog_cdev_finish(struct obd_device *obd)
+{
+	struct chlg_registered_dev *dev = chlg_registered_dev_find_by_obd(obd);
+
+	mutex_lock(&chlg_registered_dev_lock);
+	list_del_init(&obd->u.cli.cl_chg_dev_linkage);
+	kref_put(&dev->ced_refs, chlg_dev_clear);
+	mutex_unlock(&chlg_registered_dev_lock);
+}
diff --git a/drivers/staging/lustre/lustre/mdc/mdc_internal.h b/drivers/staging/lustre/lustre/mdc/mdc_internal.h
index 941a896..6da9046 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_internal.h
+++ b/drivers/staging/lustre/lustre/mdc/mdc_internal.h
@@ -129,6 +129,10 @@ enum ldlm_mode mdc_lock_match(struct obd_export *exp, __u64 flags,
 			      enum ldlm_mode mode,
 			      struct lustre_handle *lockh);
 
+int mdc_changelog_cdev_init(struct obd_device *obd);
+
+void mdc_changelog_cdev_finish(struct obd_device *obd);
+
 static inline int mdc_prep_elc_req(struct obd_export *exp,
 				   struct ptlrpc_request *req, int opc,
 				   struct list_head *cancels, int count)
diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c
index 8f8e3d2..3692b1c 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
@@ -35,7 +35,6 @@
 
 # include <linux/module.h>
 # include <linux/pagemap.h>
-# include <linux/miscdevice.h>
 # include <linux/init.h>
 # include <linux/utsname.h>
 # include <linux/file.h>
@@ -1810,174 +1809,6 @@ static int mdc_ioc_hsm_request(struct obd_export *exp,
 	return rc;
 }
 
-static struct kuc_hdr *changelog_kuc_hdr(char *buf, size_t len, u32 flags)
-{
-	struct kuc_hdr *lh = (struct kuc_hdr *)buf;
-
-	LASSERT(len <= KUC_CHANGELOG_MSG_MAXSIZE);
-
-	lh->kuc_magic = KUC_MAGIC;
-	lh->kuc_transport = KUC_TRANSPORT_CHANGELOG;
-	lh->kuc_flags = flags;
-	lh->kuc_msgtype = CL_RECORD;
-	lh->kuc_msglen = len;
-	return lh;
-}
-
-struct changelog_show {
-	__u64		cs_startrec;
-	enum changelog_send_flag	cs_flags;
-	struct file	*cs_fp;
-	char		*cs_buf;
-	struct obd_device *cs_obd;
-};
-
-static inline char *cs_obd_name(struct changelog_show *cs)
-{
-	return cs->cs_obd->obd_name;
-}
-
-static int changelog_kkuc_cb(const struct lu_env *env, struct llog_handle *llh,
-			     struct llog_rec_hdr *hdr, void *data)
-{
-	struct changelog_show *cs = data;
-	struct llog_changelog_rec *rec = (struct llog_changelog_rec *)hdr;
-	struct kuc_hdr *lh;
-	size_t len;
-	int rc;
-
-	if (rec->cr_hdr.lrh_type != CHANGELOG_REC) {
-		rc = -EINVAL;
-		CERROR("%s: not a changelog rec %x/%d: rc = %d\n",
-		       cs_obd_name(cs), rec->cr_hdr.lrh_type,
-		       rec->cr.cr_type, rc);
-		return rc;
-	}
-
-	if (rec->cr.cr_index < cs->cs_startrec) {
-		/* Skip entries earlier than what we are interested in */
-		CDEBUG(D_HSM, "rec=%llu start=%llu\n",
-		       rec->cr.cr_index, cs->cs_startrec);
-		return 0;
-	}
-
-	CDEBUG(D_HSM, "%llu %02d%-5s %llu 0x%x t=" DFID " p=" DFID
-		" %.*s\n", rec->cr.cr_index, rec->cr.cr_type,
-		changelog_type2str(rec->cr.cr_type), rec->cr.cr_time,
-		rec->cr.cr_flags & CLF_FLAGMASK,
-		PFID(&rec->cr.cr_tfid), PFID(&rec->cr.cr_pfid),
-		rec->cr.cr_namelen, changelog_rec_name(&rec->cr));
-
-	len = sizeof(*lh) + changelog_rec_size(&rec->cr) + rec->cr.cr_namelen;
-
-	/* Set up the message */
-	lh = changelog_kuc_hdr(cs->cs_buf, len, cs->cs_flags);
-	memcpy(lh + 1, &rec->cr, len - sizeof(*lh));
-
-	rc = libcfs_kkuc_msg_put(cs->cs_fp, lh);
-	CDEBUG(D_HSM, "kucmsg fp %p len %zu rc %d\n", cs->cs_fp, len, rc);
-
-	return rc;
-}
-
-static int mdc_changelog_send_thread(void *csdata)
-{
-	enum llog_flag flags = LLOG_F_IS_CAT;
-	struct changelog_show *cs = csdata;
-	struct llog_ctxt *ctxt = NULL;
-	struct llog_handle *llh = NULL;
-	struct kuc_hdr *kuch;
-	int rc;
-
-	CDEBUG(D_HSM, "changelog to fp=%p start %llu\n",
-	       cs->cs_fp, cs->cs_startrec);
-
-	cs->cs_buf = kzalloc(KUC_CHANGELOG_MSG_MAXSIZE, GFP_NOFS);
-	if (!cs->cs_buf) {
-		rc = -ENOMEM;
-		goto out;
-	}
-
-	/* Set up the remote catalog handle */
-	ctxt = llog_get_context(cs->cs_obd, LLOG_CHANGELOG_REPL_CTXT);
-	if (!ctxt) {
-		rc = -ENOENT;
-		goto out;
-	}
-	rc = llog_open(NULL, ctxt, &llh, NULL, CHANGELOG_CATALOG,
-		       LLOG_OPEN_EXISTS);
-	if (rc) {
-		CERROR("%s: fail to open changelog catalog: rc = %d\n",
-		       cs_obd_name(cs), rc);
-		goto out;
-	}
-
-	if (cs->cs_flags & CHANGELOG_FLAG_JOBID)
-		flags |= LLOG_F_EXT_JOBID;
-
-	rc = llog_init_handle(NULL, llh, flags, NULL);
-	if (rc) {
-		CERROR("llog_init_handle failed %d\n", rc);
-		goto out;
-	}
-
-	rc = llog_cat_process(NULL, llh, changelog_kkuc_cb, cs, 0, 0);
-
-	/* Send EOF no matter what our result */
-	kuch = changelog_kuc_hdr(cs->cs_buf, sizeof(*kuch), cs->cs_flags);
-	kuch->kuc_msgtype = CL_EOF;
-	libcfs_kkuc_msg_put(cs->cs_fp, kuch);
-
-out:
-	fput(cs->cs_fp);
-	if (llh)
-		llog_cat_close(NULL, llh);
-	if (ctxt)
-		llog_ctxt_put(ctxt);
-	kfree(cs->cs_buf);
-	kfree(cs);
-	return rc;
-}
-
-static int mdc_ioc_changelog_send(struct obd_device *obd,
-				  struct ioc_changelog *icc)
-{
-	struct changelog_show *cs;
-	struct task_struct *task;
-	int rc;
-
-	/* Freed in mdc_changelog_send_thread */
-	cs = kzalloc(sizeof(*cs), GFP_NOFS);
-	if (!cs)
-		return -ENOMEM;
-
-	cs->cs_obd = obd;
-	cs->cs_startrec = icc->icc_recno;
-	/* matching fput in mdc_changelog_send_thread */
-	cs->cs_fp = fget(icc->icc_id);
-	cs->cs_flags = icc->icc_flags;
-
-	/*
-	 * New thread because we should return to user app before
-	 * writing into our pipe
-	 */
-	task = kthread_run(mdc_changelog_send_thread, cs,
-			   "mdc_clg_send_thread");
-	if (IS_ERR(task)) {
-		rc = PTR_ERR(task);
-		CERROR("%s: can't start changelog thread: rc = %d\n",
-		       cs_obd_name(cs), rc);
-		kfree(cs);
-	} else {
-		rc = 0;
-		CDEBUG(D_HSM, "%s: started changelog thread\n",
-		       cs_obd_name(cs));
-	}
-
-	CERROR("Failed to start changelog thread: %d\n", rc);
-	return rc;
-}
-
 static int mdc_ioc_hsm_ct_start(struct obd_export *exp,
 				struct lustre_kernelcomm *lk);
 
@@ -2087,21 +1918,6 @@ static int mdc_iocontrol(unsigned int cmd, struct obd_export *exp, int len,
 		return -EINVAL;
 	}
 	switch (cmd) {
-	case OBD_IOC_CHANGELOG_SEND:
-		rc = mdc_ioc_changelog_send(obd, karg);
-		goto out;
-	case OBD_IOC_CHANGELOG_CLEAR: {
-		struct ioc_changelog *icc = karg;
-		struct changelog_setinfo cs = {
-			.cs_recno = icc->icc_recno,
-			.cs_id = icc->icc_id
-		};
-
-		rc = obd_set_info_async(NULL, exp, strlen(KEY_CHANGELOG_CLEAR),
-					KEY_CHANGELOG_CLEAR, sizeof(cs), &cs,
-					NULL);
-		goto out;
-	}
 	case OBD_IOC_FID2PATH:
 		rc = mdc_ioc_fid2path(exp, karg);
 		goto out;
@@ -2670,12 +2486,22 @@ static int mdc_setup(struct obd_device *obd, struct lustre_cfg *cfg)
 
 	rc = mdc_llog_init(obd);
 	if (rc) {
-		CERROR("failed to setup llogging subsystems\n");
+		CERROR("%s: failed to setup llogging subsystems: rc = %d\n",
+		       obd->obd_name, rc);
 		goto err_llog_cleanup;
 	}
 
+	rc = mdc_changelog_cdev_init(obd);
+	if (rc) {
+		CERROR("%s: failed to setup changelog char device: rc = %d\n",
+		       obd->obd_name, rc);
+		goto err_changelog_cleanup;
+	}
+
 	return 0;
 
+err_changelog_cleanup:
+	mdc_llog_finish(obd);
 err_llog_cleanup:
 	ldebugfs_free_md_stats(obd);
 	ptlrpc_lprocfs_unregister_obd(obd);
@@ -2714,6 +2540,8 @@ static int mdc_precleanup(struct obd_device *obd)
 	if (obd->obd_type->typ_refcnt <= 1)
 		libcfs_kkuc_group_rem(0, KUC_GRP_HSM);
 
+	mdc_changelog_cdev_finish(obd);
+
 	obd_cleanup_client_import(obd);
 	ptlrpc_lprocfs_unregister_obd(obd);
 	lprocfs_obd_cleanup(obd);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 19/28] lustre: uapi: add missing headers in lustre UAPI headers
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (17 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 18/28] lustre: mdc: expose changelog through char devices James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-14 18:58 ` [lustre-devel] [PATCH 20/28] lustre: obdclass: deprecate OBD_GET_VERSION ioctl James Simmons
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

A test move was done in the linux kernel that moved the UAPI headers
to their proper place. Errors were reported mainly due to
linux/types.h being missing. Add in linux/types.h to prepare for the
move out of staging.

Signed-off-by: James Simmons <uja.ornl@yahoo.com>
WC-bug-id: https://jira.whamcloud.com/browse/
Reviewed-on: https://review.whamcloud.com/31737
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: James Nunez <jnunez@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/include/uapi/linux/lnet/libcfs_debug.h    | 2 ++
 drivers/staging/lustre/include/uapi/linux/lnet/lnetctl.h         | 1 +
 drivers/staging/lustre/include/uapi/linux/lnet/nidstr.h          | 1 +
 drivers/staging/lustre/include/uapi/linux/lustre/lustre_cfg.h    | 1 +
 drivers/staging/lustre/include/uapi/linux/lustre/lustre_fid.h    | 1 +
 drivers/staging/lustre/include/uapi/linux/lustre/lustre_fiemap.h | 1 +
 drivers/staging/lustre/include/uapi/linux/lustre/lustre_ostid.h  | 1 +
 7 files changed, 8 insertions(+)

diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_debug.h b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_debug.h
index c4d9472..2672fe7 100644
--- a/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_debug.h
+++ b/drivers/staging/lustre/include/uapi/linux/lnet/libcfs_debug.h
@@ -38,6 +38,8 @@
 #ifndef __UAPI_LIBCFS_DEBUG_H__
 #define __UAPI_LIBCFS_DEBUG_H__
 
+#include <linux/types.h>
+
 /**
  * Format for debug message headers
  */
diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/lnetctl.h b/drivers/staging/lustre/include/uapi/linux/lnet/lnetctl.h
index cccb32d..9d53c51 100644
--- a/drivers/staging/lustre/include/uapi/linux/lnet/lnetctl.h
+++ b/drivers/staging/lustre/include/uapi/linux/lnet/lnetctl.h
@@ -15,6 +15,7 @@
 #ifndef _LNETCTL_H_
 #define _LNETCTL_H_
 
+#include <linux/types.h>
 #include <uapi/linux/lnet/lnet-types.h>
 
 /** \addtogroup lnet_fault_simulation
diff --git a/drivers/staging/lustre/include/uapi/linux/lnet/nidstr.h b/drivers/staging/lustre/include/uapi/linux/lnet/nidstr.h
index 3354e5a..3c5901d 100644
--- a/drivers/staging/lustre/include/uapi/linux/lnet/nidstr.h
+++ b/drivers/staging/lustre/include/uapi/linux/lnet/nidstr.h
@@ -28,6 +28,7 @@
 #ifndef _LNET_NIDSTRINGS_H
 #define _LNET_NIDSTRINGS_H
 
+#include <linux/types.h>
 #include <uapi/linux/lnet/lnet-types.h>
 
 /**
diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_cfg.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_cfg.h
index 11b51d9..0620e49 100644
--- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_cfg.h
+++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_cfg.h
@@ -35,6 +35,7 @@
 
 #include <linux/errno.h>
 #include <linux/kernel.h>
+#include <linux/types.h>
 #include <uapi/linux/lustre/lustre_user.h>
 
 /** \defgroup cfg cfg
diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_fid.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_fid.h
index 2e7a8d1..746bf7a 100644
--- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_fid.h
+++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_fid.h
@@ -37,6 +37,7 @@
 #ifndef _UAPI_LUSTRE_FID_H_
 #define _UAPI_LUSTRE_FID_H_
 
+#include <linux/types.h>
 #include <uapi/linux/lustre/lustre_idl.h>
 
 /** returns fid object sequence */
diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_fiemap.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_fiemap.h
index d375a47..d24a93e 100644
--- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_fiemap.h
+++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_fiemap.h
@@ -41,6 +41,7 @@
 
 #include <stddef.h>
 #include <linux/fiemap.h>
+#include <linux/types.h>
 
 /* XXX: We use fiemap_extent::fe_reserved[0] */
 #define fe_device	fe_reserved[0]
diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ostid.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ostid.h
index 3343b60..4b5f110 100644
--- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ostid.h
+++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ostid.h
@@ -35,6 +35,7 @@
 #define _UAPI_LUSTRE_OSTID_H_
 
 #include <linux/errno.h>
+#include <linux/types.h>
 #include <uapi/linux/lustre/lustre_fid.h>
 
 static inline __u64 lmm_oi_id(const struct ost_id *oi)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 20/28] lustre: obdclass: deprecate OBD_GET_VERSION ioctl
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (18 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 19/28] lustre: uapi: add missing headers in lustre UAPI headers James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-18  2:12   ` NeilBrown
  2018-10-14 18:58 ` [lustre-devel] [PATCH 21/28] lustre: llite: enhance vvp_dev data structure naming James Simmons
                   ` (8 subsequent siblings)
  28 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

From: Andreas Dilger <adilger@whamcloud.com>

Mark OBD_GET_VERSION ioctl deprecated, disable before 3.1 release.

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-5969
Reviewed-on: https://review.whamcloud.com/26440
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/obdclass/class_obd.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/lustre/lustre/obdclass/class_obd.c b/drivers/staging/lustre/lustre/obdclass/class_obd.c
index 2103d2a..c4d820a 100644
--- a/drivers/staging/lustre/lustre/obdclass/class_obd.c
+++ b/drivers/staging/lustre/lustre/obdclass/class_obd.c
@@ -364,7 +364,15 @@ int class_handle_ioctl(unsigned int cmd, unsigned long arg)
 		goto out;
 	}
 
-	case OBD_GET_VERSION:
+	case OBD_GET_VERSION: {
+		static bool warned;
+
+		/* This was the method to pass to user land the lustre version.
+		 * Today that information is in the sysfs tree so we can in the
+		 * future remove this.
+		 */
+		BUILD_BUG_ON(OBD_OCD_VERSION(3, 0, 53, 0) <= LUSTRE_VERSION_CODE);
+
 		if (!data->ioc_inlbuf1) {
 			CERROR("No buffer passed in ioctl\n");
 			err = -EINVAL;
@@ -377,13 +385,19 @@ int class_handle_ioctl(unsigned int cmd, unsigned long arg)
 			goto out;
 		}
 
+		if (!warned) {
+			warned = true;
+			CWARN("%s: ioctl(OBD_GET_VERSION) is deprecated, use llapi_get_version_string() and/or relink\n",
+			      current->comm);
+		}
+
 		memcpy(data->ioc_bulk, LUSTRE_VERSION_STRING,
 		       strlen(LUSTRE_VERSION_STRING) + 1);
 
 		if (copy_to_user((void __user *)arg, data, len))
 			err = -EFAULT;
 		goto out;
-
+	}
 	case OBD_IOC_NAME2DEV: {
 		/* Resolve a device name.  This does not change the
 		 * currently selected device.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 21/28] lustre: llite: enhance vvp_dev data structure naming
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (19 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 20/28] lustre: obdclass: deprecate OBD_GET_VERSION ioctl James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-18  2:15   ` NeilBrown
  2018-10-14 18:58 ` [lustre-devel] [PATCH 22/28] lustre: clio: update spare bit handling James Simmons
                   ` (7 subsequent siblings)
  28 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

The new code that added struct seq_private to the vvp_dev.c code
has very generic naming which doesn't fit the lustre / kernel style.
See http://wiki.lustre.org/Lustre_Coding_Style_Guidelines for the
naming conventions. Rename the struct seq_private and it fields.

Signed-off-by: James Simmons <uja.ornl@yahoo.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-8066
Reviewed-on: https://review.whamcloud.com/33009
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/llite/vvp_dev.c | 54 ++++++++++++++-------------
 1 file changed, 28 insertions(+), 26 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/vvp_dev.c b/drivers/staging/lustre/lustre/llite/vvp_dev.c
index 31dc3c0..8cc981b 100644
--- a/drivers/staging/lustre/lustre/llite/vvp_dev.c
+++ b/drivers/staging/lustre/lustre/llite/vvp_dev.c
@@ -391,11 +391,11 @@ struct vvp_pgcache_id {
 	struct lu_object_header *vpi_obj;
 };
 
-struct seq_private {
-	struct ll_sb_info	*sbi;
-	struct lu_env		*env;
-	u16			refcheck;
-	struct cl_object	*clob;
+struct vvp_seq_private {
+	struct ll_sb_info	*vsp_sbi;
+	struct lu_env		*vsp_env;
+	u16			vsp_refcheck;
+	struct cl_object	*vsp_clob;
 };
 
 static void vvp_pgcache_id_unpack(loff_t pos, struct vvp_pgcache_id *id)
@@ -542,52 +542,54 @@ static void vvp_pgcache_page_show(const struct lu_env *env,
 
 static int vvp_pgcache_show(struct seq_file *f, void *v)
 {
-	struct seq_private	*priv = f->private;
+	struct vvp_seq_private *priv = f->private;
 	struct page		*vmpage = v;
 	struct cl_page		*page;
 
 	seq_printf(f, "%8lx@" DFID ": ", vmpage->index,
-		   PFID(lu_object_fid(&priv->clob->co_lu)));
+		   PFID(lu_object_fid(&priv->vsp_clob->co_lu)));
 	lock_page(vmpage);
-	page = cl_vmpage_page(vmpage, priv->clob);
+	page = cl_vmpage_page(vmpage, priv->vsp_clob);
 	unlock_page(vmpage);
 	put_page(vmpage);
 
 	if (page) {
-		vvp_pgcache_page_show(priv->env, f, page);
-		cl_page_put(priv->env, page);
+		vvp_pgcache_page_show(priv->vsp_env, f, page);
+		cl_page_put(priv->vsp_env, page);
 	} else {
 		seq_puts(f, "missing\n");
 	}
-	lu_object_ref_del(&priv->clob->co_lu, "dump", current);
-	cl_object_put(priv->env, priv->clob);
+	lu_object_ref_del(&priv->vsp_clob->co_lu, "dump", current);
+	cl_object_put(priv->vsp_env, priv->vsp_clob);
 
 	return 0;
 }
 
 static void *vvp_pgcache_start(struct seq_file *f, loff_t *pos)
 {
-	struct seq_private	*priv = f->private;
+	struct vvp_seq_private *priv = f->private;
 	struct page *ret;
 
-	if (priv->sbi->ll_site->ls_obj_hash->hs_cur_bits >
+	if (priv->vsp_sbi->ll_site->ls_obj_hash->hs_cur_bits >
 	    64 - PGC_OBJ_SHIFT)
 		ret = ERR_PTR(-EFBIG);
 	else
-		ret = vvp_pgcache_find(priv->env, &priv->sbi->ll_cl->cd_lu_dev,
-				       &priv->clob, pos);
+		ret = vvp_pgcache_find(priv->vsp_env,
+				       &priv->vsp_sbi->ll_cl->cd_lu_dev,
+				       &priv->vsp_clob, pos);
 
 	return ret;
 }
 
 static void *vvp_pgcache_next(struct seq_file *f, void *v, loff_t *pos)
 {
-	struct seq_private *priv = f->private;
+	struct vvp_seq_private *priv = f->private;
 	struct page *ret;
 
 	*pos += 1;
-	ret = vvp_pgcache_find(priv->env, &priv->sbi->ll_cl->cd_lu_dev,
-			       &priv->clob, pos);
+	ret = vvp_pgcache_find(priv->vsp_env,
+			       &priv->vsp_sbi->ll_cl->cd_lu_dev,
+			       &priv->vsp_clob, pos);
 	return ret;
 }
 
@@ -605,16 +607,16 @@ static void vvp_pgcache_stop(struct seq_file *f, void *v)
 
 static int vvp_dump_pgcache_seq_open(struct inode *inode, struct file *filp)
 {
-	struct seq_private *priv;
+	struct vvp_seq_private *priv;
 
 	priv = __seq_open_private(filp, &vvp_pgcache_ops, sizeof(*priv));
 	if (!priv)
 		return -ENOMEM;
 
-	priv->sbi = inode->i_private;
-	priv->env = cl_env_get(&priv->refcheck);
-	if (IS_ERR(priv->env)) {
-		int err = PTR_ERR(priv->env);
+	priv->vsp_sbi = inode->i_private;
+	priv->vsp_env = cl_env_get(&priv->vsp_refcheck);
+	if (IS_ERR(priv->vsp_env)) {
+		int err = PTR_ERR(priv->vsp_env);
 
 		seq_release_private(inode, filp);
 		return err;
@@ -625,9 +627,9 @@ static int vvp_dump_pgcache_seq_open(struct inode *inode, struct file *filp)
 static int vvp_dump_pgcache_seq_release(struct inode *inode, struct file *file)
 {
 	struct seq_file *seq = file->private_data;
-	struct seq_private *priv = seq->private;
+	struct vvp_seq_private *priv = seq->private;
 
-	cl_env_put(priv->env, &priv->refcheck);
+	cl_env_put(priv->vsp_env, &priv->vsp_refcheck);
 	return seq_release_private(inode, file);
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 22/28] lustre: clio: update spare bit handling
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (20 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 21/28] lustre: llite: enhance vvp_dev data structure naming James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-14 18:58 ` [lustre-devel] [PATCH 23/28] lustre: llog: fix EOF handling in llog_client_next_block() James Simmons
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

Turn the OP_ATTR_* values into an enum and rename them to make
their purpose clear.

Signed-off-by: James Simmons <uja.ornl@yahoo.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-10030
Reviewed-on: https://review.whamcloud.com/32825
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/include/cl_object.h    |  2 +-
 drivers/staging/lustre/lustre/include/obd.h          | 16 +++++++++-------
 drivers/staging/lustre/lustre/llite/file.c           |  4 ++--
 drivers/staging/lustre/lustre/llite/lcommon_cl.c     |  2 +-
 drivers/staging/lustre/lustre/llite/llite_internal.h |  4 ++--
 drivers/staging/lustre/lustre/llite/llite_lib.c      | 18 +++++++++---------
 drivers/staging/lustre/lustre/mdc/mdc_lib.c          | 10 +++++-----
 drivers/staging/lustre/lustre/osc/osc_io.c           |  6 +++---
 8 files changed, 32 insertions(+), 30 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/cl_object.h b/drivers/staging/lustre/lustre/include/cl_object.h
index 382bfe8..9ff1ca5 100644
--- a/drivers/staging/lustre/lustre/include/cl_object.h
+++ b/drivers/staging/lustre/lustre/include/cl_object.h
@@ -1778,7 +1778,7 @@ struct cl_io {
 			struct ost_lvb   sa_attr;
 			unsigned int		 sa_attr_flags;
 			unsigned int     sa_avalid;
-			unsigned int     sa_xvalid;
+			unsigned int		sa_xvalid; /* OP_XVALID */
 			int		sa_stripe_index;
 			const struct lu_fid	*sa_parent_fid;
 		} ci_setattr;
diff --git a/drivers/staging/lustre/lustre/include/obd.h b/drivers/staging/lustre/lustre/include/obd.h
index 76ae0b3..cf3dbd6 100644
--- a/drivers/staging/lustre/lustre/include/obd.h
+++ b/drivers/staging/lustre/lustre/include/obd.h
@@ -664,6 +664,14 @@ struct obd_device {
 #define KEY_CACHE_SET		"cache_set"
 #define KEY_CACHE_LRU_SHRINK	"cache_lru_shrink"
 
+/* Flags for op_xvalid */
+enum op_xvalid {
+	OP_XVALID_CTIME_SET	= BIT(0),	/* 0x0001 */
+	OP_XVALID_BLOCKS	= BIT(1),	/* 0x0002 */
+	OP_XVALID_OWNEROVERRIDE	= BIT(2),	/* 0x0004 */
+	OP_XVALID_FLAGS		= BIT(3),	/* 0x0008 */
+};
+
 struct lu_context;
 
 static inline int it_to_lock_mode(struct lookup_intent *it)
@@ -733,7 +741,7 @@ struct md_op_data {
 
 	/* iattr fields and blocks. */
 	struct iattr	    op_attr;
-	unsigned int		op_xvalid; /* eXtra validity flags */
+	enum op_xvalid		op_xvalid;	/* eXtra validity flags */
 	unsigned int	    op_attr_flags;
 	__u64		   op_valid;
 	loff_t		  op_attr_blocks;
@@ -764,12 +772,6 @@ struct md_op_data {
 	__u32			op_default_stripe_offset;
 };
 
-/* Flags for op_xvalid */
-#define OP_ATTR_CTIME_SET	(1 << 0)
-#define OP_ATTR_BLOCKS		(1 << 1)
-#define OP_ATTR_OWNEROVERRIDE	(1 << 2)
-#define OP_ATTR_FLAGS		(1 << 3)
-
 struct md_callback {
 	int (*md_blocking_ast)(struct ldlm_lock *lock,
 			       struct ldlm_lock_desc *desc,
diff --git a/drivers/staging/lustre/lustre/llite/file.c b/drivers/staging/lustre/lustre/llite/file.c
index 3bfc6d84..d80bda4 100644
--- a/drivers/staging/lustre/lustre/llite/file.c
+++ b/drivers/staging/lustre/lustre/llite/file.c
@@ -96,7 +96,7 @@ static void ll_prepare_close(struct inode *inode, struct md_op_data *op_data,
 	op_data->op_attr.ia_valid |= (ATTR_MODE | ATTR_ATIME | ATTR_ATIME_SET |
 				      ATTR_MTIME | ATTR_MTIME_SET |
 				      ATTR_CTIME);
-	op_data->op_xvalid |= OP_ATTR_CTIME_SET;
+	op_data->op_xvalid |= OP_XVALID_CTIME_SET;
 	op_data->op_attr_blocks = inode->i_blocks;
 	op_data->op_attr_flags = ll_inode_to_ext_flags(inode->i_flags);
 	op_data->op_handle = och->och_fh;
@@ -163,7 +163,7 @@ static int ll_close_inode_openhandle(struct inode *inode,
 		op_data->op_data_version = *(__u64 *)data;
 		op_data->op_lease_handle = och->och_lease_handle;
 		op_data->op_attr.ia_valid |= ATTR_SIZE;
-		op_data->op_xvalid |= OP_ATTR_BLOCKS;
+		op_data->op_xvalid |= OP_XVALID_BLOCKS;
 		break;
 
 	default:
diff --git a/drivers/staging/lustre/lustre/llite/lcommon_cl.c b/drivers/staging/lustre/lustre/llite/lcommon_cl.c
index 20a3c74..ade3b12 100644
--- a/drivers/staging/lustre/lustre/llite/lcommon_cl.c
+++ b/drivers/staging/lustre/lustre/llite/lcommon_cl.c
@@ -80,7 +80,7 @@
 static DEFINE_MUTEX(cl_inode_fini_guard);
 
 int cl_setattr_ost(struct cl_object *obj, const struct iattr *attr,
-		   unsigned int xvalid, unsigned int attr_flags)
+		   enum op_xvalid xvalid, unsigned int attr_flags)
 {
 	struct lu_env *env;
 	struct cl_io  *io;
diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
index 796a8ae..ad380f1 100644
--- a/drivers/staging/lustre/lustre/llite/llite_internal.h
+++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
@@ -843,7 +843,7 @@ int ll_revalidate_it_finish(struct ptlrpc_request *request,
 void ll_dir_clear_lsm_md(struct inode *inode);
 void ll_clear_inode(struct inode *inode);
 int ll_setattr_raw(struct dentry *dentry, struct iattr *attr,
-		   unsigned int xvalid, bool hsm_import);
+		   enum op_xvalid xvalid, bool hsm_import);
 int ll_setattr(struct dentry *de, struct iattr *attr);
 int ll_statfs(struct dentry *de, struct kstatfs *sfs);
 int ll_statfs_internal(struct ll_sb_info *sbi, struct obd_statfs *osfs,
@@ -1357,7 +1357,7 @@ int ll_page_sync_io(const struct lu_env *env, struct cl_io *io,
 
 /* lcommon_cl.c */
 int cl_setattr_ost(struct cl_object *obj, const struct iattr *attr,
-		   unsigned int xvalid, unsigned int attr_flags);
+		   enum op_xvalid xvalid, unsigned int attr_flags);
 
 extern struct lu_env *cl_inode_fini_env;
 extern u16 cl_inode_fini_refcheck;
diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
index 153aa12..a5e65db 100644
--- a/drivers/staging/lustre/lustre/llite/llite_lib.c
+++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
@@ -1492,7 +1492,7 @@ static int ll_md_setattr(struct dentry *dentry, struct md_op_data *op_data)
  * In case of HSMimport, we only set attr on MDS.
  */
 int ll_setattr_raw(struct dentry *dentry, struct iattr *attr,
-		   unsigned int xvalid, bool hsm_import)
+		   enum op_xvalid xvalid, bool hsm_import)
 {
 	struct inode *inode = d_inode(dentry);
 	struct ll_inode_info *lli = ll_i2info(inode);
@@ -1531,10 +1531,10 @@ int ll_setattr_raw(struct dentry *dentry, struct iattr *attr,
 	}
 
 	/* We mark all of the fields "set" so MDS/OST does not re-set them */
-	if (!(xvalid & OP_ATTR_CTIME_SET) &&
+	if (!(xvalid & OP_XVALID_CTIME_SET) &&
 	    attr->ia_valid & ATTR_CTIME) {
 		attr->ia_ctime = current_time(inode);
-		xvalid |= OP_ATTR_CTIME_SET;
+		xvalid |= OP_XVALID_CTIME_SET;
 	}
 	if (!(attr->ia_valid & ATTR_ATIME_SET) &&
 	    (attr->ia_valid & ATTR_ATIME)) {
@@ -1570,7 +1570,7 @@ int ll_setattr_raw(struct dentry *dentry, struct iattr *attr,
 		 * If we are changing file size, file content is
 		 * modified, flag it.
 		 */
-		xvalid |= OP_ATTR_OWNEROVERRIDE;
+		xvalid |= OP_XVALID_OWNEROVERRIDE;
 		op_data->op_bias |= MDS_DATA_MODIFIED;
 		clear_bit(LLIF_DATA_MODIFIED, &lli->lli_flags);
 	}
@@ -1589,7 +1589,7 @@ int ll_setattr_raw(struct dentry *dentry, struct iattr *attr,
 
 	if (attr->ia_valid & (ATTR_SIZE | ATTR_ATIME | ATTR_ATIME_SET |
 			      ATTR_MTIME | ATTR_MTIME_SET | ATTR_CTIME) ||
-	    xvalid & OP_ATTR_CTIME_SET) {
+	    xvalid & OP_XVALID_CTIME_SET) {
 		/* For truncate and utimes sending attributes to OSTs, setting
 		 * mtime/atime to the past will be performed under PW [0:EOF]
 		 * extent lock (new_size:EOF for truncate).  It may seem
@@ -1655,11 +1655,11 @@ int ll_setattr_raw(struct dentry *dentry, struct iattr *attr,
 int ll_setattr(struct dentry *de, struct iattr *attr)
 {
 	int mode = d_inode(de)->i_mode;
-	unsigned int xvalid = 0;
+	enum op_xvalid xvalid = 0;
 
 	if ((attr->ia_valid & (ATTR_CTIME | ATTR_SIZE | ATTR_MODE)) ==
 			      (ATTR_CTIME | ATTR_SIZE | ATTR_MODE))
-		xvalid |= OP_ATTR_OWNEROVERRIDE;
+		xvalid |= OP_XVALID_OWNEROVERRIDE;
 
 	if (((attr->ia_valid & (ATTR_MODE | ATTR_FORCE | ATTR_SIZE)) ==
 			       (ATTR_SIZE | ATTR_MODE)) &&
@@ -2014,7 +2014,7 @@ int ll_iocontrol(struct inode *inode, struct file *file,
 			return PTR_ERR(op_data);
 
 		op_data->op_attr_flags = flags;
-		op_data->op_xvalid |= OP_ATTR_FLAGS;
+		op_data->op_xvalid |= OP_XVALID_FLAGS;
 		rc = md_setattr(sbi->ll_md_exp, op_data, NULL, 0, &req);
 		ll_finish_md_op_data(op_data);
 		ptlrpc_req_finished(req);
@@ -2031,7 +2031,7 @@ int ll_iocontrol(struct inode *inode, struct file *file,
 		if (!attr)
 			return -ENOMEM;
 
-		rc = cl_setattr_ost(obj, attr, OP_ATTR_FLAGS, flags);
+		rc = cl_setattr_ost(obj, attr, OP_XVALID_FLAGS, flags);
 		kfree(attr);
 		return rc;
 	}
diff --git a/drivers/staging/lustre/lustre/mdc/mdc_lib.c b/drivers/staging/lustre/lustre/mdc/mdc_lib.c
index fc5a51d..1ab1ad2 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_lib.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_lib.c
@@ -266,7 +266,7 @@ void mdc_open_pack(struct ptlrpc_request *req, struct md_op_data *op_data,
 	set_mrc_cr_flags(rec, cr_flags);
 }
 
-static inline __u64 attr_pack(unsigned int ia_valid, unsigned int ia_xvalid)
+static inline u64 attr_pack(unsigned int ia_valid, enum op_xvalid ia_xvalid)
 {
 	__u64 sa_valid = 0;
 
@@ -290,19 +290,19 @@ static inline __u64 attr_pack(unsigned int ia_valid, unsigned int ia_xvalid)
 		sa_valid |= MDS_ATTR_MTIME_SET;
 	if (ia_valid & ATTR_FORCE)
 		sa_valid |= MDS_ATTR_FORCE;
-	if (ia_xvalid & OP_ATTR_FLAGS)
+	if (ia_xvalid & OP_XVALID_FLAGS)
 		sa_valid |= MDS_ATTR_ATTR_FLAG;
 	if (ia_valid & ATTR_KILL_SUID)
 		sa_valid |=  MDS_ATTR_KILL_SUID;
 	if (ia_valid & ATTR_KILL_SGID)
 		sa_valid |= MDS_ATTR_KILL_SGID;
-	if (ia_xvalid & OP_ATTR_CTIME_SET)
+	if (ia_xvalid & OP_XVALID_CTIME_SET)
 		sa_valid |= MDS_ATTR_CTIME_SET;
 	if (ia_valid & ATTR_OPEN)
 		sa_valid |= MDS_ATTR_FROM_OPEN;
-	if (ia_xvalid & OP_ATTR_BLOCKS)
+	if (ia_xvalid & OP_XVALID_BLOCKS)
 		sa_valid |= MDS_ATTR_BLOCKS;
-	if (ia_xvalid & OP_ATTR_OWNEROVERRIDE)
+	if (ia_xvalid & OP_XVALID_OWNEROVERRIDE)
 		/* NFSD hack (see bug 5781) */
 		sa_valid |= MDS_OPEN_OWNEROVERRIDE;
 	return sa_valid;
diff --git a/drivers/staging/lustre/lustre/osc/osc_io.c b/drivers/staging/lustre/lustre/osc/osc_io.c
index e7151ed..dabdf6d 100644
--- a/drivers/staging/lustre/lustre/osc/osc_io.c
+++ b/drivers/staging/lustre/lustre/osc/osc_io.c
@@ -500,7 +500,7 @@ static int osc_io_setattr_start(const struct lu_env *env,
 	struct osc_async_cbargs *cbargs = &oio->oi_cbarg;
 	__u64 size = io->u.ci_setattr.sa_attr.lvb_size;
 	unsigned int ia_avalid = io->u.ci_setattr.sa_avalid;
-	unsigned int ia_xvalid = io->u.ci_setattr.sa_xvalid;
+	enum op_xvalid ia_xvalid = io->u.ci_setattr.sa_xvalid;
 	int result = 0;
 
 	/* truncate cache dirty pages first */
@@ -528,7 +528,7 @@ static int osc_io_setattr_start(const struct lu_env *env,
 				attr->cat_atime = lvb->lvb_atime;
 				cl_valid |= CAT_ATIME;
 			}
-			if (ia_xvalid & OP_ATTR_CTIME_SET) {
+			if (ia_xvalid & OP_XVALID_CTIME_SET) {
 				attr->cat_ctime = lvb->lvb_ctime;
 				cl_valid |= CAT_CTIME;
 			}
@@ -567,7 +567,7 @@ static int osc_io_setattr_start(const struct lu_env *env,
 		} else {
 			LASSERT(oio->oi_lockless == 0);
 		}
-		if (ia_xvalid & OP_ATTR_FLAGS) {
+		if (ia_xvalid & OP_XVALID_FLAGS) {
 			oa->o_flags = io->u.ci_setattr.sa_attr_flags;
 			oa->o_valid |= OBD_MD_FLFLAGS;
 		}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 23/28] lustre: llog: fix EOF handling in llog_client_next_block()
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (21 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 22/28] lustre: clio: update spare bit handling James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-14 18:58 ` [lustre-devel] [PATCH 24/28] lustre: llite: IO accounting of page read James Simmons
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

From: "John L. Hammond" <jhammond@whamcloud.com>

In llog_client_next_block() update *cur_idx and *cur_offset in the
special case that the handler has returned -EIO after reaching the end
of the log without finding the desired record. This fixes client side
EOF detection in llog_process_thread().

Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-10267
Reviewed-on: https://review.whamcloud.com/30313
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/ptlrpc/llog_client.c | 24 ++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/staging/lustre/lustre/ptlrpc/llog_client.c b/drivers/staging/lustre/lustre/ptlrpc/llog_client.c
index 946d538..6ddd93c 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/llog_client.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/llog_client.c
@@ -171,8 +171,21 @@ static int llog_client_next_block(const struct lu_env *env,
 	req_capsule_set_size(&req->rq_pill, &RMF_EADATA, RCL_SERVER, len);
 	ptlrpc_request_set_replen(req);
 	rc = ptlrpc_queue_wait(req);
-	if (rc)
+	/* -EIO has a special meaning here. If llog_osd_next_block()
+	 * reaches the end of the log without finding the desired
+	 * record then it updates *cur_offset and *cur_idx and returns
+	 * -EIO. In llog_process_thread() we use this to detect
+	 * EOF. But we must be careful to distinguish between -EIO
+	 * coming from llog_osd_next_block() and -EIO coming from
+	 * ptlrpc or below.
+	 */
+	if (rc == -EIO) {
+		if (!req->rq_repmsg ||
+		    lustre_msg_get_status(req->rq_repmsg) != -EIO)
+			goto out;
+	} else if (rc < 0) {
 		goto out;
+	}
 
 	body = req_capsule_server_get(&req->rq_pill, &RMF_LLOGD_BODY);
 	if (!body) {
@@ -180,6 +193,12 @@ static int llog_client_next_block(const struct lu_env *env,
 		goto out;
 	}
 
+	*cur_idx = body->lgd_saved_index;
+	*cur_offset = body->lgd_cur_offset;
+
+	if (rc < 0)
+		goto out;
+
 	/* The log records are swabbed as they are processed */
 	ptr = req_capsule_server_get(&req->rq_pill, &RMF_EADATA);
 	if (!ptr) {
@@ -187,9 +206,6 @@ static int llog_client_next_block(const struct lu_env *env,
 		goto out;
 	}
 
-	*cur_idx = body->lgd_saved_index;
-	*cur_offset = body->lgd_cur_offset;
-
 	memcpy(buf, ptr, len);
 out:
 	ptlrpc_req_finished(req);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 24/28] lustre: llite: IO accounting of page read
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (22 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 23/28] lustre: llog: fix EOF handling in llog_client_next_block() James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-14 18:58 ` [lustre-devel] [PATCH 25/28] lustre: llite: disable statahead if starting statahead fail James Simmons
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

From: Hongchao Zhang <hongchao@whamcloud.com>

When CONFIG_TASK_IO_ACCOUNTING is used with Lustre, writes are
accounted for but not read.

The accounting is normally done in the kernel for page writeback
and readahead functionlity, Therefore, as Lustre implements its
own readahead, it must also maintain its own accounting on read
(but not for write)

Signed-off-by: Hongchao Zhang <hongchao@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-618
Reviewed-on: https://review.whamcloud.com/1636
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/llite/rw.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/rw.c b/drivers/staging/lustre/lustre/llite/rw.c
index 9cc0d4fe..55d8b31 100644
--- a/drivers/staging/lustre/lustre/llite/rw.c
+++ b/drivers/staging/lustre/lustre/llite/rw.c
@@ -48,6 +48,7 @@
 #include <linux/pagemap.h>
 /* current_is_kswapd() */
 #include <linux/swap.h>
+#include <linux/task_io_accounting_ops.h>
 #include <linux/bvec.h>
 
 #define DEBUG_SUBSYSTEM S_LLITE
@@ -1137,9 +1138,13 @@ static int ll_io_read_page(const struct lu_env *env, struct cl_io *io,
 		       PFID(ll_inode2fid(inode)), rc2, vvp_index(vpg));
 	}
 
-	if (queue->c2_qin.pl_nr > 0)
-		rc = cl_io_submit_rw(env, io, CRT_READ, queue);
+	if (queue->c2_qin.pl_nr > 0) {
+		int count = queue->c2_qin.pl_nr;
 
+		rc = cl_io_submit_rw(env, io, CRT_READ, queue);
+		if (!rc)
+			task_io_account_read(PAGE_SIZE * count);
+	}
 	/*
 	 * Unlock unsent pages in case of error.
 	 */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 25/28] lustre: llite: disable statahead if starting statahead fail
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (23 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 24/28] lustre: llite: IO accounting of page read James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-14 18:58 ` [lustre-devel] [PATCH 26/28] lustre: mdc: set correct body eadatasize for getxattr() James Simmons
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

From: Lai Siyao <lai.siyao@whamcloud.com>

Once starting statahead thread fails, it should disable statahead.
Current code only does this when "sai != NULL", instead it should
check whether current process is opening this dir, so for cases
like current file is not the first dirent, or sai allocation fail,
it won't retry statahead.

Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-10165
Reviewed-on: https://review.whamcloud.com/29817
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/llite/statahead.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/statahead.c b/drivers/staging/lustre/lustre/llite/statahead.c
index 0174a4c..8f3ff7f 100644
--- a/drivers/staging/lustre/lustre/llite/statahead.c
+++ b/drivers/staging/lustre/lustre/llite/statahead.c
@@ -1512,6 +1512,9 @@ static int start_statahead_thread(struct inode *dir, struct dentry *dentry)
 	task = kthread_create(ll_statahead_thread, parent, "ll_sa_%u",
 			      lli->lli_opendir_pid);
 	if (IS_ERR(task)) {
+		spin_lock(&lli->lli_sa_lock);
+		lli->lli_sai = NULL;
+		spin_unlock(&lli->lli_sa_lock);
 		rc = PTR_ERR(task);
 		CERROR("can't start ll_sa thread, rc : %d\n", rc);
 		goto out;
@@ -1537,8 +1540,8 @@ static int start_statahead_thread(struct inode *dir, struct dentry *dentry)
 	 * that subsequent stat won't waste time to try it.
 	 */
 	spin_lock(&lli->lli_sa_lock);
-	lli->lli_sa_enabled = 0;
-	lli->lli_sai = NULL;
+	if (lli->lli_opendir_pid == current->pid)
+		lli->lli_sa_enabled = 0;
 	spin_unlock(&lli->lli_sa_lock);
 	if (sai)
 		ll_sai_free(sai);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 26/28] lustre: mdc: set correct body eadatasize for getxattr()
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (24 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 25/28] lustre: llite: disable statahead if starting statahead fail James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-14 18:58 ` [lustre-devel] [PATCH 27/28] lustre: llite: control concurrent statahead instances James Simmons
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

From: "John L. Hammond" <jhammond@whamcloud.com>

In mdc_intent_getxattr_pack() set mbo_eadatasize to the size of the
xattr values buffer rather than the size of the xattr names buffer.
Only the xattr values buffer should be upsized for older MDTs.

Signed-off-by: John L. Hammond <jhammond@whamcloud.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-10912
Reviewed-on: https://review.whamcloud.com/31990
WC-bug-id: https://jira.whamcloud.com/browse/LU-11268
Reviewed-on: https://review.whamcloud.com/33024
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Andrew Perepechko <c17827@cray.com>
Reviewed-by: Fan Yong <fan.yong@intel.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/mdc/mdc_locks.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/lustre/lustre/mdc/mdc_locks.c b/drivers/staging/lustre/lustre/mdc/mdc_locks.c
index 5ec5d78..2cc2378 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_locks.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_locks.c
@@ -331,6 +331,7 @@ static void mdc_realloc_openmsg(struct ptlrpc_request *req,
 			 struct lookup_intent *it,
 			 struct md_op_data *op_data)
 {
+	u32 ea_vals_buf_size = GA_DEFAULT_EA_VAL_LEN * GA_DEFAULT_EA_NUM;
 	struct ptlrpc_request	*req;
 	struct ldlm_intent	*lit;
 	int rc, count = 0;
@@ -353,13 +354,13 @@ static void mdc_realloc_openmsg(struct ptlrpc_request *req,
 
 	/* pack the intended request */
 	mdc_pack_body(req, &op_data->op_fid1, op_data->op_valid,
-		      GA_DEFAULT_EA_NAME_LEN * GA_DEFAULT_EA_NUM, -1, 0);
+		      ea_vals_buf_size, -1, 0);
 
 	req_capsule_set_size(&req->rq_pill, &RMF_EADATA, RCL_SERVER,
 			     GA_DEFAULT_EA_NAME_LEN * GA_DEFAULT_EA_NUM);
 
 	req_capsule_set_size(&req->rq_pill, &RMF_EAVALS, RCL_SERVER,
-			     GA_DEFAULT_EA_NAME_LEN * GA_DEFAULT_EA_NUM);
+			     ea_vals_buf_size);
 
 	req_capsule_set_size(&req->rq_pill, &RMF_EAVALS_LENS, RCL_SERVER,
 			     sizeof(u32) * GA_DEFAULT_EA_NUM);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 27/28] lustre: llite: control concurrent statahead instances
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (25 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 26/28] lustre: mdc: set correct body eadatasize for getxattr() James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-14 18:58 ` [lustre-devel] [PATCH 28/28] lustre: llite: restore lld_nfs_dentry handling James Simmons
  2018-10-22  4:36 ` [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 NeilBrown
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

From: Fan Yong <fan.yong@intel.com>

It is found that if there are too many concurrent statahead
instances, then related statahead RPCs may accumulate on the
client import (for MDT) RPC lists
(imp_sending_list/imp_delayed_list/imp_unreplied_lis), as to
seriously affect the efficiency of spin_lock under the case
of MDT overloaded or in recovery. Be as the temporarily solution,
restrict the concurrent statahead instances.

If want to support more concurrent statahead instances, please
consider to decentralize the RPC lists attached on related import.

Signed-off-by: Fan Yong <fan.yong@intel.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-11079
Reviewed-on: https://review.whamcloud.com/32690
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Lai Siyao <lai.siyao@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 .../staging/lustre/lustre/llite/llite_internal.h   | 13 +++++++-
 drivers/staging/lustre/lustre/llite/llite_lib.c    |  1 +
 drivers/staging/lustre/lustre/llite/lproc_llite.c  | 37 ++++++++++++++++++++++
 drivers/staging/lustre/lustre/llite/statahead.c    | 24 ++++++++++----
 4 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
index ad380f1..359bd53 100644
--- a/drivers/staging/lustre/lustre/llite/llite_internal.h
+++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
@@ -504,6 +504,9 @@ struct ll_sb_info {
 	int		       ll_rw_stats_on;
 
 	/* metadata stat-ahead */
+	unsigned int		ll_sa_running_max; /* max concurrent
+						    * statahead instances
+						    */
 	unsigned int	      ll_sa_max;     /* max statahead RPCs */
 	atomic_t		  ll_sa_total;   /* statahead thread started
 						  * count
@@ -1063,7 +1066,15 @@ enum ras_update_flags {
 /* statahead.c */
 #define LL_SA_RPC_MIN	   2
 #define LL_SA_RPC_DEF	   32
-#define LL_SA_RPC_MAX	   8192
+#define LL_SA_RPC_MAX		512
+
+/* XXX: If want to support more concurrent statahead instances,
+ *	please consider to decentralize the RPC lists attached
+ *	on related import, such as imp_{sending,delayed}_list.
+ *	LU-11079
+ */
+#define LL_SA_RUNNING_MAX	256
+#define LL_SA_RUNNING_DEF	16
 
 #define LL_SA_CACHE_BIT	 5
 #define LL_SA_CACHE_SIZE	(1 << LL_SA_CACHE_BIT)
diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
index a5e65db..fae7e50 100644
--- a/drivers/staging/lustre/lustre/llite/llite_lib.c
+++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
@@ -116,6 +116,7 @@ static struct ll_sb_info *ll_init_sbi(void)
 	}
 
 	/* metadata statahead is enabled by default */
+	sbi->ll_sa_running_max = LL_SA_RUNNING_DEF;
 	sbi->ll_sa_max = LL_SA_RPC_DEF;
 	atomic_set(&sbi->ll_sa_total, 0);
 	atomic_set(&sbi->ll_sa_wrong, 0);
diff --git a/drivers/staging/lustre/lustre/llite/lproc_llite.c b/drivers/staging/lustre/lustre/llite/lproc_llite.c
index d8ef090..10dc7a8 100644
--- a/drivers/staging/lustre/lustre/llite/lproc_llite.c
+++ b/drivers/staging/lustre/lustre/llite/lproc_llite.c
@@ -714,6 +714,42 @@ static ssize_t stats_track_gid_store(struct kobject *kobj,
 }
 LUSTRE_RW_ATTR(stats_track_gid);
 
+static ssize_t statahead_running_max_show(struct kobject *kobj,
+					  struct attribute *attr,
+					  char *buf)
+{
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+
+	return snprintf(buf, 16, "%u\n", sbi->ll_sa_running_max);
+}
+
+static ssize_t statahead_running_max_store(struct kobject *kobj,
+					   struct attribute *attr,
+					   const char *buffer,
+					   size_t count)
+{
+	struct ll_sb_info *sbi = container_of(kobj, struct ll_sb_info,
+					      ll_kset.kobj);
+	unsigned long val;
+	int rc;
+
+	rc = kstrtoul(buffer, 0, &val);
+	if (rc)
+		return rc;
+
+	if (val <= LL_SA_RUNNING_MAX) {
+		sbi->ll_sa_running_max = val;
+		return count;
+	}
+
+	CERROR("Bad statahead_running_max value %lu. Valid values are in the range [0, %d]\n",
+	       val, LL_SA_RUNNING_MAX);
+
+	return -ERANGE;
+}
+LUSTRE_RW_ATTR(statahead_running_max);
+
 static ssize_t statahead_max_show(struct kobject *kobj,
 				  struct attribute *attr,
 				  char *buf)
@@ -1171,6 +1207,7 @@ static ssize_t ll_nosquash_nids_seq_write(struct file *file,
 	&lustre_attr_stats_track_pid.attr,
 	&lustre_attr_stats_track_ppid.attr,
 	&lustre_attr_stats_track_gid.attr,
+	&lustre_attr_statahead_running_max.attr,
 	&lustre_attr_statahead_max.attr,
 	&lustre_attr_statahead_agl.attr,
 	&lustre_attr_lazystatfs.attr,
diff --git a/drivers/staging/lustre/lustre/llite/statahead.c b/drivers/staging/lustre/lustre/llite/statahead.c
index 8f3ff7f..336f1cf 100644
--- a/drivers/staging/lustre/lustre/llite/statahead.c
+++ b/drivers/staging/lustre/lustre/llite/statahead.c
@@ -1472,23 +1472,34 @@ static int start_statahead_thread(struct inode *dir, struct dentry *dentry)
 	struct ll_statahead_info *sai = NULL;
 	struct task_struct *task;
 	struct dentry *parent = dentry->d_parent;
-	int rc;
+	struct ll_sb_info *sbi = ll_i2sbi(parent->d_inode);
+	int first = LS_FIRST_DE;
+	int rc = 0;
 
 	/* I am the "lli_opendir_pid" owner, only me can set "lli_sai". */
-	rc = is_first_dirent(dir, dentry);
-	if (rc == LS_NOT_FIRST_DE) {
+	first = is_first_dirent(dir, dentry);
+	if (first == LS_NOT_FIRST_DE) {
 		/* It is not "ls -{a}l" operation, no need statahead for it. */
 		rc = -EFAULT;
 		goto out;
 	}
 
+	if (unlikely(atomic_inc_return(&sbi->ll_sa_running) >
+				       sbi->ll_sa_running_max)) {
+		CDEBUG(D_READA,
+		       "Too many concurrent statahead instances, avoid new statahead instance temporarily.\n");
+		rc = -EMFILE;
+		goto out;
+	}
+
+
 	sai = ll_sai_alloc(parent);
 	if (!sai) {
 		rc = -ENOMEM;
 		goto out;
 	}
 
-	sai->sai_ls_all = (rc == LS_FIRST_DOT_DE);
+	sai->sai_ls_all = (first == LS_FIRST_DOT_DE);
 	/*
 	 * if current lli_opendir_key was deauthorized, or dir re-opened by
 	 * another process, don't start statahead, otherwise the newly spawned
@@ -1504,8 +1515,6 @@ static int start_statahead_thread(struct inode *dir, struct dentry *dentry)
 	lli->lli_sai = sai;
 	spin_unlock(&lli->lli_sa_lock);
 
-	atomic_inc(&ll_i2sbi(parent->d_inode)->ll_sa_running);
-
 	CDEBUG(D_READA, "start statahead thread: [pid %d] [parent %pd]\n",
 	       current->pid, parent);
 
@@ -1545,6 +1554,9 @@ static int start_statahead_thread(struct inode *dir, struct dentry *dentry)
 	spin_unlock(&lli->lli_sa_lock);
 	if (sai)
 		ll_sai_free(sai);
+	if (first != LS_NOT_FIRST_DE)
+		atomic_dec(&sbi->ll_sa_running);
+
 	return rc;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 28/28] lustre: llite: restore lld_nfs_dentry handling
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (26 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 27/28] lustre: llite: control concurrent statahead instances James Simmons
@ 2018-10-14 18:58 ` James Simmons
  2018-10-22  4:36 ` [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 NeilBrown
  28 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-14 18:58 UTC (permalink / raw)
  To: lustre-devel

The port of patch for LU-3544 to enable open-by-fid as the default to
the linux lustre client was done incorrectly. It ended dropping the
handling of lld_nfs_dentry for the NFS export case. Lets restore it.

Fixes: c1b66fccf986 ("staging: lustre: fid: do open-by-fid by default")
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/llite/file.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/file.c b/drivers/staging/lustre/lustre/llite/file.c
index d80bda4..5df2b87 100644
--- a/drivers/staging/lustre/lustre/llite/file.c
+++ b/drivers/staging/lustre/lustre/llite/file.c
@@ -589,6 +589,8 @@ int ll_file_open(struct inode *inode, struct file *file)
 	} else {
 		LASSERT(*och_usecount == 0);
 		if (!it->it_disposition) {
+			struct ll_dentry_data *ldd = ll_d2d(file->f_path.dentry);
+
 			/* We cannot just request lock handle now, new ELC code
 			 * means that one of other OPEN locks for this file
 			 * could be cancelled, and since blocking ast handler
@@ -599,11 +601,24 @@ int ll_file_open(struct inode *inode, struct file *file)
 			/*
 			 * Normally called under two situations:
 			 * 1. NFS export.
-			 * 2. revalidate with IT_OPEN (revalidate doesn't
-			 *    execute this intent any more).
+			 * 2. A race/condition on MDS resulting in no open
+			 *    handle to be returned from LOOKUP|OPEN request,
+			 *    for example if the target entry was a symlink.
 			 *
-			 * Always fetch MDS_OPEN_LOCK if this is not setstripe.
+			 * Only fetch MDS_OPEN_LOCK if this is in NFS path,
+			 * marked by a bit set in ll_iget_for_nfs. Clear the
+			 * bit so that it's not confusing later callers.
 			 *
+			 * NB; when ldd is NULL, it must have come via normal
+			 * lookup path only, since ll_iget_for_nfs always calls
+			 * ll_d_init().
+			 */
+			if (ldd && ldd->lld_nfs_dentry) {
+				ldd->lld_nfs_dentry = 0;
+				it->it_flags |= MDS_OPEN_LOCK;
+			}
+
+			/*
 			 * Always specify MDS_OPEN_BY_FID because we don't want
 			 * to get file with different fid.
 			 */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 03/28] lustre: ptlrpc: missing barrier before wake_up
  2018-10-14 18:57 ` [lustre-devel] [PATCH 03/28] lustre: ptlrpc: missing barrier before wake_up James Simmons
@ 2018-10-17 22:43   ` NeilBrown
  2018-10-21 22:48     ` James Simmons
  0 siblings, 1 reply; 69+ messages in thread
From: NeilBrown @ 2018-10-17 22:43 UTC (permalink / raw)
  To: lustre-devel

On Sun, Oct 14 2018, James Simmons wrote:

> From: Lai Siyao <lai.siyao@whamcloud.com>
>
> ptlrpc_client_wake_req() misses a memory barrier, which may cause
> strange errors.
>
> Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-8935
> Reviewed-on: https://review.whamcloud.com/26583
> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> Reviewed-by: Wang Shilong <wshilong@ddn.com>
> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/lustre/include/lustre_net.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/staging/lustre/lustre/include/lustre_net.h b/drivers/staging/lustre/lustre/include/lustre_net.h
> index ce7e98c..468a03e 100644
> --- a/drivers/staging/lustre/lustre/include/lustre_net.h
> +++ b/drivers/staging/lustre/lustre/include/lustre_net.h
> @@ -2211,6 +2211,8 @@ static inline int ptlrpc_status_ntoh(int n)
>  static inline void
>  ptlrpc_client_wake_req(struct ptlrpc_request *req)
>  {
> +	/* ensure ptlrpc_register_bulk see rq_resend as set. */
> +	smp_mb();
>  	if (!req->rq_set)
>  		wake_up(&req->rq_reply_waitq);
>  	else

It is good that this memory barrier has a comment, but the comment isn't
very helpful.
There is no matching memory barrier in ptlrpc_register_bulk(), so it
isn't clear what sequencing is important.

And ptl_send_rpc() tests ->rq_resend *before* ptlrpc_register_bulk() is
called (which also tests it).  Presumably these should see that same
value?  So why does the comment refer to ptlrpc_register_bulk() instead
of ptl_send_rpc() ??

It all seems rather confusing, so it is very hard to be sure that the
code is now correct.
Is someone able to explain?

Thanks,
NeilBrown



> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181018/6b7a3e91/attachment.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 04/28] lustre: ptlrpc: Do not assert when bd_nob_transferred != 0
  2018-10-14 18:57 ` [lustre-devel] [PATCH 04/28] lustre: ptlrpc: Do not assert when bd_nob_transferred != 0 James Simmons
@ 2018-10-17 23:13   ` NeilBrown
  2018-10-21 22:44     ` James Simmons
  0 siblings, 1 reply; 69+ messages in thread
From: NeilBrown @ 2018-10-17 23:13 UTC (permalink / raw)
  To: lustre-devel

On Sun, Oct 14 2018, James Simmons wrote:

> From: Doug Oucharek <dougso@me.com>
>
> There is a case in the routine ptlrpc_register_bulk() where we were
> asserting if bd_nob_transferred != 0 when not resending.  There is
> evidence that network errors can create a situation where
> this does happen. So we should not be asserting!
>
> This patch changes that assert to an error return code of -EIO.
>
> Signed-off-by: Doug Oucharek <dougso@me.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9828
> Reviewed-on: https://review.whamcloud.com/28491
> Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
> Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> index 27eb1c0..7e7db24 100644
> --- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> +++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> @@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
>  	/* cleanup the state of the bulk for it will be reused */
>  	if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY)
>  		desc->bd_nob_transferred = 0;
> -	else
> -		LASSERT(desc->bd_nob_transferred == 0);
> +	else if (desc->bd_nob_transferred != 0)
> +		/* If the network failed after an RPC was sent, this condition
> +		 * could happen.  Rather than assert (was here before), return
> +		 * an EIO error.
> +		 */
> +		return -EIO;

This looks weird, and the justification is rather lame.
I wonder if this is an attempt to fix the same problem that the smp_mb()
in the previous patch was attempting to fix (and I'm not yet convinced
that either is the correct fix).

NeilBrown


>  
>  	desc->bd_failure = 0;
>  
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181018/b7a80410/attachment.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 06/28] lustre: ldlm: ELC shouldn't wait on lock flush
  2018-10-14 18:57 ` [lustre-devel] [PATCH 06/28] lustre: ldlm: ELC shouldn't wait on lock flush James Simmons
@ 2018-10-17 23:20   ` NeilBrown
  2018-10-20 17:09     ` James Simmons
  0 siblings, 1 reply; 69+ messages in thread
From: NeilBrown @ 2018-10-17 23:20 UTC (permalink / raw)
  To: lustre-devel

On Sun, Oct 14 2018, James Simmons wrote:

> From: Andriy Skulysh <c17819@cray.com>
>
> The commit 08fd034670b5 ("staging: lustre: ldlm: revert the changes
> for lock canceling policy") removed the fix for LU-4300 when lru_resize
> is disabled.
>
> Introduce ldlm_cancel_aged_no_wait_policy to be used by ELC.
>
> Signed-off-by: Andriy Skulysh <c17819@cray.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-8578
> Seagate-bug-id: MRP-3662
> Reviewed-on: https://review.whamcloud.com/22286
> Reviewed-by: Vitaly Fertman <c17818@cray.com>
> Reviewed-by: Patrick Farrell <paf@cray.com>
> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/lustre/ldlm/ldlm_internal.h |  1 -
>  drivers/staging/lustre/lustre/ldlm/ldlm_request.c  | 51 +++++++++++++++-------
>  2 files changed, 35 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h b/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
> index 1d7c727..709c527 100644
> --- a/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
> +++ b/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
> @@ -96,7 +96,6 @@ enum {
>  	LDLM_LRU_FLAG_NO_WAIT	= BIT(4), /* Cancel locks w/o blocking (neither
>  					   * sending nor waiting for any rpcs)
>  					   */
> -	LDLM_LRU_FLAG_LRUR_NO_WAIT = BIT(5), /* LRUR + NO_WAIT */
>  };
>  
>  int ldlm_cancel_lru(struct ldlm_namespace *ns, int nr,
> diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
> index 80260b07..3eb5036 100644
> --- a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
> +++ b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
> @@ -579,8 +579,8 @@ int ldlm_prep_elc_req(struct obd_export *exp, struct ptlrpc_request *req,
>  		req_capsule_filled_sizes(pill, RCL_CLIENT);
>  		avail = ldlm_capsule_handles_avail(pill, RCL_CLIENT, canceloff);
>  
> -		flags = ns_connect_lru_resize(ns) ?
> -			LDLM_LRU_FLAG_LRUR_NO_WAIT : LDLM_LRU_FLAG_AGED;
> +		flags = LDLM_LRU_FLAG_NO_WAIT | ns_connect_lru_resize(ns) ?
> +			LDLM_LRU_FLAG_LRUR : LDLM_LRU_FLAG_AGED;
>  		to_free = !ns_connect_lru_resize(ns) &&
>  			  opc == LDLM_ENQUEUE ? 1 : 0;

Bug.
The commit in SFS-lustre (7ca60f33893) is correct, but you dropped the
parentheses which introduces a bug.

While the SFS code is correct, it is formatted badly.
It should be

	lru_flags = LDLM_LRU_FLAG_NO_WAIT |
        	(ns_connect_lru_resize(ns) ?
                 LDLM_LRU_FLAG_LRUR : LDLM_LRU_FLAG_AGED);
or similar.

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181018/177d4600/attachment.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 11/28] lustre: llite: use security context if it's enabled in the kernel
  2018-10-14 18:58 ` [lustre-devel] [PATCH 11/28] lustre: llite: use security context if it's enabled in the kernel James Simmons
@ 2018-10-17 23:34   ` NeilBrown
  2018-10-20 17:49     ` James Simmons
  0 siblings, 1 reply; 69+ messages in thread
From: NeilBrown @ 2018-10-17 23:34 UTC (permalink / raw)
  To: lustre-devel

On Sun, Oct 14 2018, James Simmons wrote:

> From: Alex Zhuravlev <bzzz@whamcloud.com>
>
> if it's disabled, then Lustre stop to work properly (can not create
> files, etc)
>
> Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9578
> Reviewed-on: https://review.whamcloud.com/27364
> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> Reviewed-by: Chris Horn <hornc@cray.com>
> Reviewed-by: James Simmons <uja.ornl@yahoo.com>
> Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/lustre/llite/llite_lib.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
> index 22b545e..153aa12 100644
> --- a/drivers/staging/lustre/lustre/llite/llite_lib.c
> +++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
> @@ -243,8 +243,9 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
>  	if (sbi->ll_flags & LL_SBI_ALWAYS_PING)
>  		data->ocd_connect_flags &= ~OBD_CONNECT_PINGLESS;
>  
> +#ifdef CONFIG_SECURITY
>  	data->ocd_connect_flags2 |= OBD_CONNECT2_FILE_SECCTX;
> -
> +#endif

Policy is to avoid #ifdef in .c files where possible.
If we put something like
#ifdef CONFIG_SECURITY
 #define OBD_CONNECT2_FILE_SECURITY (OBD_CONNECT2_FILE_SECCTX)
#else
 #define OBD_CONNECT2_FILE_SECURITY (0)
#endif

in a .h file, then use OBD_CONNECT2_FILE_SECURITY both here and in
obd_connect_has_secctx(),
then the latter could would be optimized away by the compiler.  Wouldn't
be a big win I guess as it is only used once in a trivial context.

NeilBrown


>  	data->ocd_brw_size = MD_MAX_BRW_SIZE;
>  
>  	err = obd_connect(NULL, &sbi->ll_md_exp, sbi->ll_md_obd,
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181018/8ef4625b/attachment.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 15/28] lustre: llite: fix for stat under kthread and X86_X32
  2018-10-14 18:58 ` [lustre-devel] [PATCH 15/28] lustre: llite: fix for stat under kthread and X86_X32 James Simmons
@ 2018-10-18  1:48   ` NeilBrown
  2018-10-22  3:58     ` NeilBrown
  0 siblings, 1 reply; 69+ messages in thread
From: NeilBrown @ 2018-10-18  1:48 UTC (permalink / raw)
  To: lustre-devel

On Sun, Oct 14 2018, James Simmons wrote:

> From: Frank Zago <fzago@cray.com>
>
> Under the following conditions, ll_getattr will flatten the inode
> number when it shouldn't:
>
>  - the X86_X32 architecture is defined CONFIG_X86_X32, and not even
>    used,
>  - ll_getattr is called from a kernel thread (though vfs_getattr for
>    instance.)
>
> This has the result that inode numbers are different whether the same
> file is stat'ed from a kernel thread, or from a syscall. For instance,
> 4198401 vs. 144115205272502273.
>
> ll_getattr calls ll_need_32bit_api to determine whether the task is 32
> bits. When the combination is kthread+X86_X32, that function returns
> that the task is 32 bits, which is incorrect, as the kernel is 64
> bits.
>
> The solution is to check whether the call is from a kernel thread
> (which is 64 bits) and act consequently.
>
> Signed-off-by: Frank Zago <fzago@cray.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9468
> Reviewed-on: https://review.whamcloud.com/26992
> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/lustre/llite/dir.c          |  6 +++---
>  drivers/staging/lustre/lustre/llite/lcommon_cl.c   |  2 +-
>  .../staging/lustre/lustre/llite/llite_internal.h   | 22 +++++++++++++++++-----
>  3 files changed, 21 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
> index 231b351..19c5e9c 100644
> --- a/drivers/staging/lustre/lustre/llite/dir.c
> +++ b/drivers/staging/lustre/lustre/llite/dir.c
> @@ -202,7 +202,7 @@ int ll_dir_read(struct inode *inode, __u64 *ppos, struct md_op_data *op_data,
>  {
>  	struct ll_sb_info    *sbi	= ll_i2sbi(inode);
>  	__u64		   pos		= *ppos;
> -	int		   is_api32 = ll_need_32bit_api(sbi);
> +	bool is_api32 = ll_need_32bit_api(sbi);
>  	int		   is_hash64 = sbi->ll_flags & LL_SBI_64BIT_HASH;
>  	struct page	  *page;
>  	bool		   done = false;
> @@ -296,7 +296,7 @@ static int ll_readdir(struct file *filp, struct dir_context *ctx)
>  	struct ll_sb_info	*sbi	= ll_i2sbi(inode);
>  	__u64 pos = lfd ? lfd->lfd_pos : 0;
>  	int			hash64	= sbi->ll_flags & LL_SBI_64BIT_HASH;
> -	int			api32	= ll_need_32bit_api(sbi);
> +	bool api32 = ll_need_32bit_api(sbi);
>  	struct md_op_data *op_data;
>  	int			rc;
>  
> @@ -1674,7 +1674,7 @@ static loff_t ll_dir_seek(struct file *file, loff_t offset, int origin)
>  	struct inode *inode = file->f_mapping->host;
>  	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
>  	struct ll_sb_info *sbi = ll_i2sbi(inode);
> -	int api32 = ll_need_32bit_api(sbi);
> +	bool api32 = ll_need_32bit_api(sbi);
>  	loff_t ret = -EINVAL;
>  
>  	switch (origin) {
> diff --git a/drivers/staging/lustre/lustre/llite/lcommon_cl.c b/drivers/staging/lustre/lustre/llite/lcommon_cl.c
> index 30f17ea..20a3c74 100644
> --- a/drivers/staging/lustre/lustre/llite/lcommon_cl.c
> +++ b/drivers/staging/lustre/lustre/llite/lcommon_cl.c
> @@ -267,7 +267,7 @@ void cl_inode_fini(struct inode *inode)
>  /**
>   * build inode number from passed @fid
>   */
> -__u64 cl_fid_build_ino(const struct lu_fid *fid, int api32)
> +u64 cl_fid_build_ino(const struct lu_fid *fid, bool api32)
>  {
>  	if (BITS_PER_LONG == 32 || api32)
>  		return fid_flatten32(fid);
> diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
> index dcb2fed..796a8ae 100644
> --- a/drivers/staging/lustre/lustre/llite/llite_internal.h
> +++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
> @@ -651,13 +651,25 @@ static inline struct inode *ll_info2i(struct ll_inode_info *lli)
>  __u32 ll_i2suppgid(struct inode *i);
>  void ll_i2gids(__u32 *suppgids, struct inode *i1, struct inode *i2);
>  
> -static inline int ll_need_32bit_api(struct ll_sb_info *sbi)
> +static inline bool ll_need_32bit_api(struct ll_sb_info *sbi)
>  {
>  #if BITS_PER_LONG == 32
> -	return 1;
> +	return true;
>  #elif defined(CONFIG_COMPAT)
> -	return unlikely(in_compat_syscall() ||
> -			(sbi->ll_flags & LL_SBI_32BIT_API));
> +	if (unlikely(sbi->ll_flags & LL_SBI_32BIT_API))
> +		return true;
> +
> +#ifdef CONFIG_X86_X32
> +	/* in_compat_syscall() returns true when called from a kthread
> +	 * and CONFIG_X86_X32 is enabled, which is wrong. So check
> +	 * whether the caller comes from a syscall (ie. not a kthread)
> +	 * before calling in_compat_syscall().
> +	 */
> +	if (current->flags & PF_KTHREAD)
> +		return false;
> +#endif

This is wrong.  We should fix in_compat_syscall(), not work around it
here.
(and then there is that fact that the patch changes 'int' to 'bool'
without explaining that in the change description).

I've sent a query to some relevant people (Cc:ed to James) to ask about
fixing in_compat_syscall().

NeilBrown


> +
> +	return unlikely(in_compat_syscall());
>  #else
>  	return unlikely(sbi->ll_flags & LL_SBI_32BIT_API);
>  #endif
> @@ -1353,7 +1365,7 @@ int cl_setattr_ost(struct cl_object *obj, const struct iattr *attr,
>  int cl_file_inode_init(struct inode *inode, struct lustre_md *md);
>  void cl_inode_fini(struct inode *inode);
>  
> -__u64 cl_fid_build_ino(const struct lu_fid *fid, int api32);
> +u64 cl_fid_build_ino(const struct lu_fid *fid, bool api32);
>  __u32 cl_fid_build_gen(const struct lu_fid *fid);
>  
>  #endif /* LLITE_INTERNAL_H */
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181018/aba69fd3/attachment.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 16/28] lustre: statahead: missing barrier before wake_up
  2018-10-14 18:58 ` [lustre-devel] [PATCH 16/28] lustre: statahead: missing barrier before wake_up James Simmons
@ 2018-10-18  2:00   ` NeilBrown
  2018-10-21 22:52     ` James Simmons
  0 siblings, 1 reply; 69+ messages in thread
From: NeilBrown @ 2018-10-18  2:00 UTC (permalink / raw)
  To: lustre-devel

On Sun, Oct 14 2018, James Simmons wrote:

> From: Lai Siyao <lai.siyao@whamcloud.com>
>
> A barrier is missing before wake_up() in ll_statahead_interpret(),
> which may cause 'ls' hang. Under the right conditions a basic 'ls'
> can fail. The debug logs show:
>
> statahead.c:683:ll_statahead_interpret()) sa_entry software rc -13
> statahead.c:1666:ll_statahead()) revalidate statahead software: -11.
>
> Obviously statahead failure didn't notify 'ls' process in time.
> The mi_cbdata can be stale so add a barrier before calling
> wake_up().
>
> Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
> Signed-off-by: Bob Glossman <bob.glossman@intel.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9210
> Reviewed-on: https://review.whamcloud.com/27330
> Reviewed-by: Nathaniel Clark <nclark@whamcloud.com>
> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/lustre/llite/statahead.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/staging/lustre/lustre/llite/statahead.c b/drivers/staging/lustre/lustre/llite/statahead.c
> index 1ad308c..0174a4c 100644
> --- a/drivers/staging/lustre/lustre/llite/statahead.c
> +++ b/drivers/staging/lustre/lustre/llite/statahead.c
> @@ -680,8 +680,14 @@ static int ll_statahead_interpret(struct ptlrpc_request *req,
>  
>  	spin_lock(&lli->lli_sa_lock);
>  	if (rc) {
> -		if (__sa_make_ready(sai, entry, rc))
> +		if (__sa_make_ready(sai, entry, rc)) {
> +			/* LU-9210 : Under the right conditions even 'ls'
> +			 * can cause the statahead to fail. Using a memory
> +			 * barrier resolves this issue.
> +			 */
> +			smp_mb();
>  			wake_up(&sai->sai_waitq);
> +		}
>  	} else {
>  		int first = 0;
>  		entry->se_minfo = minfo;
> -- 
> 1.8.3.1

Again, this is a fairly lame comment to justify the smp_mb().
It appears to me that the issue is most likely the value of
entry->se_state.
__sa_make_ready() sets this and revalidate_statahead_dentry tests it
after waiting on sai_waitq.
So I think it would be best if we changed __sa_make_ready() to

	smp_store_release(&entry->se_state, ret < 0 ? SA_ENTRY_INVA : SA_ENTRY_SUCC)

and in ll_statahead_interpret() have

	if (smp_load_acquire(&entry->se_state) == SA_ENTRY_SUCC &&
            entry->se_inode) {

This would make it obvious which variable was important, and would show
the paired synchronization points.

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181018/21b73558/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 20/28] lustre: obdclass: deprecate OBD_GET_VERSION ioctl
  2018-10-14 18:58 ` [lustre-devel] [PATCH 20/28] lustre: obdclass: deprecate OBD_GET_VERSION ioctl James Simmons
@ 2018-10-18  2:12   ` NeilBrown
  2018-10-20 18:52     ` James Simmons
  0 siblings, 1 reply; 69+ messages in thread
From: NeilBrown @ 2018-10-18  2:12 UTC (permalink / raw)
  To: lustre-devel

On Sun, Oct 14 2018, James Simmons wrote:
>  
> +		if (!warned) {
> +			warned = true;
> +			CWARN("%s: ioctl(OBD_GET_VERSION) is deprecated, use llapi_get_version_string() and/or relink\n",
> +			      current->comm);
> +		}

Is there a good reason not to use WARN_ON_ONCE() here?

Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181018/21914395/attachment.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 21/28] lustre: llite: enhance vvp_dev data structure naming
  2018-10-14 18:58 ` [lustre-devel] [PATCH 21/28] lustre: llite: enhance vvp_dev data structure naming James Simmons
@ 2018-10-18  2:15   ` NeilBrown
  2018-10-20 18:55     ` James Simmons
  0 siblings, 1 reply; 69+ messages in thread
From: NeilBrown @ 2018-10-18  2:15 UTC (permalink / raw)
  To: lustre-devel

On Sun, Oct 14 2018, James Simmons wrote:

> The new code that added struct seq_private to the vvp_dev.c code
> has very generic naming which doesn't fit the lustre / kernel style.
> See http://wiki.lustre.org/Lustre_Coding_Style_Guidelines for the
> naming conventions. Rename the struct seq_private and it fields.

The guidelines say:

  unique member names for global structures, using a prefix to identify
  the parent structure type, helps readability.

As this structure is local to vvp_dev.c, I don't think of it as a
"global structure" and so I don't think the rule applies.

But I don't really care.

NeilBrown


>
> Signed-off-by: James Simmons <uja.ornl@yahoo.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-8066
> Reviewed-on: https://review.whamcloud.com/33009
> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/lustre/llite/vvp_dev.c | 54 ++++++++++++++-------------
>  1 file changed, 28 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/staging/lustre/lustre/llite/vvp_dev.c b/drivers/staging/lustre/lustre/llite/vvp_dev.c
> index 31dc3c0..8cc981b 100644
> --- a/drivers/staging/lustre/lustre/llite/vvp_dev.c
> +++ b/drivers/staging/lustre/lustre/llite/vvp_dev.c
> @@ -391,11 +391,11 @@ struct vvp_pgcache_id {
>  	struct lu_object_header *vpi_obj;
>  };
>  
> -struct seq_private {
> -	struct ll_sb_info	*sbi;
> -	struct lu_env		*env;
> -	u16			refcheck;
> -	struct cl_object	*clob;
> +struct vvp_seq_private {
> +	struct ll_sb_info	*vsp_sbi;
> +	struct lu_env		*vsp_env;
> +	u16			vsp_refcheck;
> +	struct cl_object	*vsp_clob;
>  };
>  
>  static void vvp_pgcache_id_unpack(loff_t pos, struct vvp_pgcache_id *id)
> @@ -542,52 +542,54 @@ static void vvp_pgcache_page_show(const struct lu_env *env,
>  
>  static int vvp_pgcache_show(struct seq_file *f, void *v)
>  {
> -	struct seq_private	*priv = f->private;
> +	struct vvp_seq_private *priv = f->private;
>  	struct page		*vmpage = v;
>  	struct cl_page		*page;
>  
>  	seq_printf(f, "%8lx@" DFID ": ", vmpage->index,
> -		   PFID(lu_object_fid(&priv->clob->co_lu)));
> +		   PFID(lu_object_fid(&priv->vsp_clob->co_lu)));
>  	lock_page(vmpage);
> -	page = cl_vmpage_page(vmpage, priv->clob);
> +	page = cl_vmpage_page(vmpage, priv->vsp_clob);
>  	unlock_page(vmpage);
>  	put_page(vmpage);
>  
>  	if (page) {
> -		vvp_pgcache_page_show(priv->env, f, page);
> -		cl_page_put(priv->env, page);
> +		vvp_pgcache_page_show(priv->vsp_env, f, page);
> +		cl_page_put(priv->vsp_env, page);
>  	} else {
>  		seq_puts(f, "missing\n");
>  	}
> -	lu_object_ref_del(&priv->clob->co_lu, "dump", current);
> -	cl_object_put(priv->env, priv->clob);
> +	lu_object_ref_del(&priv->vsp_clob->co_lu, "dump", current);
> +	cl_object_put(priv->vsp_env, priv->vsp_clob);
>  
>  	return 0;
>  }
>  
>  static void *vvp_pgcache_start(struct seq_file *f, loff_t *pos)
>  {
> -	struct seq_private	*priv = f->private;
> +	struct vvp_seq_private *priv = f->private;
>  	struct page *ret;
>  
> -	if (priv->sbi->ll_site->ls_obj_hash->hs_cur_bits >
> +	if (priv->vsp_sbi->ll_site->ls_obj_hash->hs_cur_bits >
>  	    64 - PGC_OBJ_SHIFT)
>  		ret = ERR_PTR(-EFBIG);
>  	else
> -		ret = vvp_pgcache_find(priv->env, &priv->sbi->ll_cl->cd_lu_dev,
> -				       &priv->clob, pos);
> +		ret = vvp_pgcache_find(priv->vsp_env,
> +				       &priv->vsp_sbi->ll_cl->cd_lu_dev,
> +				       &priv->vsp_clob, pos);
>  
>  	return ret;
>  }
>  
>  static void *vvp_pgcache_next(struct seq_file *f, void *v, loff_t *pos)
>  {
> -	struct seq_private *priv = f->private;
> +	struct vvp_seq_private *priv = f->private;
>  	struct page *ret;
>  
>  	*pos += 1;
> -	ret = vvp_pgcache_find(priv->env, &priv->sbi->ll_cl->cd_lu_dev,
> -			       &priv->clob, pos);
> +	ret = vvp_pgcache_find(priv->vsp_env,
> +			       &priv->vsp_sbi->ll_cl->cd_lu_dev,
> +			       &priv->vsp_clob, pos);
>  	return ret;
>  }
>  
> @@ -605,16 +607,16 @@ static void vvp_pgcache_stop(struct seq_file *f, void *v)
>  
>  static int vvp_dump_pgcache_seq_open(struct inode *inode, struct file *filp)
>  {
> -	struct seq_private *priv;
> +	struct vvp_seq_private *priv;
>  
>  	priv = __seq_open_private(filp, &vvp_pgcache_ops, sizeof(*priv));
>  	if (!priv)
>  		return -ENOMEM;
>  
> -	priv->sbi = inode->i_private;
> -	priv->env = cl_env_get(&priv->refcheck);
> -	if (IS_ERR(priv->env)) {
> -		int err = PTR_ERR(priv->env);
> +	priv->vsp_sbi = inode->i_private;
> +	priv->vsp_env = cl_env_get(&priv->vsp_refcheck);
> +	if (IS_ERR(priv->vsp_env)) {
> +		int err = PTR_ERR(priv->vsp_env);
>  
>  		seq_release_private(inode, filp);
>  		return err;
> @@ -625,9 +627,9 @@ static int vvp_dump_pgcache_seq_open(struct inode *inode, struct file *filp)
>  static int vvp_dump_pgcache_seq_release(struct inode *inode, struct file *file)
>  {
>  	struct seq_file *seq = file->private_data;
> -	struct seq_private *priv = seq->private;
> +	struct vvp_seq_private *priv = seq->private;
>  
> -	cl_env_put(priv->env, &priv->refcheck);
> +	cl_env_put(priv->vsp_env, &priv->vsp_refcheck);
>  	return seq_release_private(inode, file);
>  }
>  
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181018/a9cc1a4c/attachment.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 06/28] lustre: ldlm: ELC shouldn't wait on lock flush
  2018-10-17 23:20   ` NeilBrown
@ 2018-10-20 17:09     ` James Simmons
  2018-10-22  3:44       ` NeilBrown
  0 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-10-20 17:09 UTC (permalink / raw)
  To: lustre-devel


> On Sun, Oct 14 2018, James Simmons wrote:
> 
> > From: Andriy Skulysh <c17819@cray.com>
> >
> > The commit 08fd034670b5 ("staging: lustre: ldlm: revert the changes
> > for lock canceling policy") removed the fix for LU-4300 when lru_resize
> > is disabled.
> >
> > Introduce ldlm_cancel_aged_no_wait_policy to be used by ELC.
> >
> > Signed-off-by: Andriy Skulysh <c17819@cray.com>
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-8578
> > Seagate-bug-id: MRP-3662
> > Reviewed-on: https://review.whamcloud.com/22286
> > Reviewed-by: Vitaly Fertman <c17818@cray.com>
> > Reviewed-by: Patrick Farrell <paf@cray.com>
> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> > ---
> >  drivers/staging/lustre/lustre/ldlm/ldlm_internal.h |  1 -
> >  drivers/staging/lustre/lustre/ldlm/ldlm_request.c  | 51 +++++++++++++++-------
> >  2 files changed, 35 insertions(+), 17 deletions(-)
> >
> > diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h b/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
> > index 1d7c727..709c527 100644
> > --- a/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
> > +++ b/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
> > @@ -96,7 +96,6 @@ enum {
> >  	LDLM_LRU_FLAG_NO_WAIT	= BIT(4), /* Cancel locks w/o blocking (neither
> >  					   * sending nor waiting for any rpcs)
> >  					   */
> > -	LDLM_LRU_FLAG_LRUR_NO_WAIT = BIT(5), /* LRUR + NO_WAIT */
> >  };
> >  
> >  int ldlm_cancel_lru(struct ldlm_namespace *ns, int nr,
> > diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
> > index 80260b07..3eb5036 100644
> > --- a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
> > +++ b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
> > @@ -579,8 +579,8 @@ int ldlm_prep_elc_req(struct obd_export *exp, struct ptlrpc_request *req,
> >  		req_capsule_filled_sizes(pill, RCL_CLIENT);
> >  		avail = ldlm_capsule_handles_avail(pill, RCL_CLIENT, canceloff);
> >  
> > -		flags = ns_connect_lru_resize(ns) ?
> > -			LDLM_LRU_FLAG_LRUR_NO_WAIT : LDLM_LRU_FLAG_AGED;
> > +		flags = LDLM_LRU_FLAG_NO_WAIT | ns_connect_lru_resize(ns) ?
> > +			LDLM_LRU_FLAG_LRUR : LDLM_LRU_FLAG_AGED;
> >  		to_free = !ns_connect_lru_resize(ns) &&
> >  			  opc == LDLM_ENQUEUE ? 1 : 0;
> 
> Bug.
> The commit in SFS-lustre (7ca60f33893) is correct, but you dropped the
> parentheses which introduces a bug.
> 
> While the SFS code is correct, it is formatted badly.
> It should be
> 
> 	lru_flags = LDLM_LRU_FLAG_NO_WAIT |
>         	(ns_connect_lru_resize(ns) ?
>                  LDLM_LRU_FLAG_LRUR : LDLM_LRU_FLAG_AGED);
> or similar.

Thanks for finding that. Shall I submit another patch to fix that or will
you fix it up when you apply it to lustre-testing?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 11/28] lustre: llite: use security context if it's enabled in the kernel
  2018-10-17 23:34   ` NeilBrown
@ 2018-10-20 17:49     ` James Simmons
  2018-10-22  3:47       ` NeilBrown
  0 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-10-20 17:49 UTC (permalink / raw)
  To: lustre-devel


> On Sun, Oct 14 2018, James Simmons wrote:
> 
> > From: Alex Zhuravlev <bzzz@whamcloud.com>
> >
> > if it's disabled, then Lustre stop to work properly (can not create
> > files, etc)
> >
> > Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9578
> > Reviewed-on: https://review.whamcloud.com/27364
> > Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> > Reviewed-by: Chris Horn <hornc@cray.com>
> > Reviewed-by: James Simmons <uja.ornl@yahoo.com>
> > Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> > ---
> >  drivers/staging/lustre/lustre/llite/llite_lib.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
> > index 22b545e..153aa12 100644
> > --- a/drivers/staging/lustre/lustre/llite/llite_lib.c
> > +++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
> > @@ -243,8 +243,9 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
> >  	if (sbi->ll_flags & LL_SBI_ALWAYS_PING)
> >  		data->ocd_connect_flags &= ~OBD_CONNECT_PINGLESS;
> >  
> > +#ifdef CONFIG_SECURITY
> >  	data->ocd_connect_flags2 |= OBD_CONNECT2_FILE_SECCTX;
> > -
> > +#endif
> 
> Policy is to avoid #ifdef in .c files where possible.
> If we put something like
> #ifdef CONFIG_SECURITY
>  #define OBD_CONNECT2_FILE_SECURITY (OBD_CONNECT2_FILE_SECCTX)
> #else
>  #define OBD_CONNECT2_FILE_SECURITY (0)
> #endif
> 
> in a .h file, then use OBD_CONNECT2_FILE_SECURITY both here and in
> obd_connect_has_secctx(),
> then the latter could would be optimized away by the compiler.  Wouldn't
> be a big win I guess as it is only used once in a trivial context.

I suggest that we move obd_connect_has_secctx() to llite_internal.h. Also
that function should return bool. Besides this create an inline function
obd_connect_set_secctx() for llite_internal.h. Will submit a patch for
OpenSFS branch. Shall I redo this patch or submit a cleanup later?

> NeilBrown
> 
> 
> >  	data->ocd_brw_size = MD_MAX_BRW_SIZE;
> >  
> >  	err = obd_connect(NULL, &sbi->ll_md_exp, sbi->ll_md_obd,
> > -- 
> > 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 20/28] lustre: obdclass: deprecate OBD_GET_VERSION ioctl
  2018-10-18  2:12   ` NeilBrown
@ 2018-10-20 18:52     ` James Simmons
  2018-10-22  4:08       ` NeilBrown
  0 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-10-20 18:52 UTC (permalink / raw)
  To: lustre-devel


> On Sun, Oct 14 2018, James Simmons wrote:
> >  
> > +		if (!warned) {
> > +			warned = true;
> > +			CWARN("%s: ioctl(OBD_GET_VERSION) is deprecated, use llapi_get_version_string() and/or relink\n",
> > +			      current->comm);
> > +		}
> 
> Is there a good reason not to use WARN_ON_ONCE() here?

Oh that is much nicer. Didn't know about it. Shall I submit a new
patch or will it be changed when applied to lustre-testing?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 21/28] lustre: llite: enhance vvp_dev data structure naming
  2018-10-18  2:15   ` NeilBrown
@ 2018-10-20 18:55     ` James Simmons
  0 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-20 18:55 UTC (permalink / raw)
  To: lustre-devel


> On Sun, Oct 14 2018, James Simmons wrote:
> 
> > The new code that added struct seq_private to the vvp_dev.c code
> > has very generic naming which doesn't fit the lustre / kernel style.
> > See http://wiki.lustre.org/Lustre_Coding_Style_Guidelines for the
> > naming conventions. Rename the struct seq_private and it fields.
> 
> The guidelines say:
> 
>   unique member names for global structures, using a prefix to identify
>   the parent structure type, helps readability.
> 
> As this structure is local to vvp_dev.c, I don't think of it as a
> "global structure" and so I don't think the rule applies.
> 
> But I don't really care.

The gudeline needs to be updated to state "member names for all 
structures". Some lustre developers handle style issues as strictly
as Greg did in staging. It's just a different flavor :-) 

To let you know your dump cache tree walk patch in lustre-wip landed
to OpenSFS tree. I plan to push another set of patches and I can
included it with the reviews.

> > Signed-off-by: James Simmons <uja.ornl@yahoo.com>
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-8066
> > Reviewed-on: https://review.whamcloud.com/33009
> > Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> > Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> > ---
> >  drivers/staging/lustre/lustre/llite/vvp_dev.c | 54 ++++++++++++++-------------
> >  1 file changed, 28 insertions(+), 26 deletions(-)
> >
> > diff --git a/drivers/staging/lustre/lustre/llite/vvp_dev.c b/drivers/staging/lustre/lustre/llite/vvp_dev.c
> > index 31dc3c0..8cc981b 100644
> > --- a/drivers/staging/lustre/lustre/llite/vvp_dev.c
> > +++ b/drivers/staging/lustre/lustre/llite/vvp_dev.c
> > @@ -391,11 +391,11 @@ struct vvp_pgcache_id {
> >  	struct lu_object_header *vpi_obj;
> >  };
> >  
> > -struct seq_private {
> > -	struct ll_sb_info	*sbi;
> > -	struct lu_env		*env;
> > -	u16			refcheck;
> > -	struct cl_object	*clob;
> > +struct vvp_seq_private {
> > +	struct ll_sb_info	*vsp_sbi;
> > +	struct lu_env		*vsp_env;
> > +	u16			vsp_refcheck;
> > +	struct cl_object	*vsp_clob;
> >  };
> >  
> >  static void vvp_pgcache_id_unpack(loff_t pos, struct vvp_pgcache_id *id)
> > @@ -542,52 +542,54 @@ static void vvp_pgcache_page_show(const struct lu_env *env,
> >  
> >  static int vvp_pgcache_show(struct seq_file *f, void *v)
> >  {
> > -	struct seq_private	*priv = f->private;
> > +	struct vvp_seq_private *priv = f->private;
> >  	struct page		*vmpage = v;
> >  	struct cl_page		*page;
> >  
> >  	seq_printf(f, "%8lx@" DFID ": ", vmpage->index,
> > -		   PFID(lu_object_fid(&priv->clob->co_lu)));
> > +		   PFID(lu_object_fid(&priv->vsp_clob->co_lu)));
> >  	lock_page(vmpage);
> > -	page = cl_vmpage_page(vmpage, priv->clob);
> > +	page = cl_vmpage_page(vmpage, priv->vsp_clob);
> >  	unlock_page(vmpage);
> >  	put_page(vmpage);
> >  
> >  	if (page) {
> > -		vvp_pgcache_page_show(priv->env, f, page);
> > -		cl_page_put(priv->env, page);
> > +		vvp_pgcache_page_show(priv->vsp_env, f, page);
> > +		cl_page_put(priv->vsp_env, page);
> >  	} else {
> >  		seq_puts(f, "missing\n");
> >  	}
> > -	lu_object_ref_del(&priv->clob->co_lu, "dump", current);
> > -	cl_object_put(priv->env, priv->clob);
> > +	lu_object_ref_del(&priv->vsp_clob->co_lu, "dump", current);
> > +	cl_object_put(priv->vsp_env, priv->vsp_clob);
> >  
> >  	return 0;
> >  }
> >  
> >  static void *vvp_pgcache_start(struct seq_file *f, loff_t *pos)
> >  {
> > -	struct seq_private	*priv = f->private;
> > +	struct vvp_seq_private *priv = f->private;
> >  	struct page *ret;
> >  
> > -	if (priv->sbi->ll_site->ls_obj_hash->hs_cur_bits >
> > +	if (priv->vsp_sbi->ll_site->ls_obj_hash->hs_cur_bits >
> >  	    64 - PGC_OBJ_SHIFT)
> >  		ret = ERR_PTR(-EFBIG);
> >  	else
> > -		ret = vvp_pgcache_find(priv->env, &priv->sbi->ll_cl->cd_lu_dev,
> > -				       &priv->clob, pos);
> > +		ret = vvp_pgcache_find(priv->vsp_env,
> > +				       &priv->vsp_sbi->ll_cl->cd_lu_dev,
> > +				       &priv->vsp_clob, pos);
> >  
> >  	return ret;
> >  }
> >  
> >  static void *vvp_pgcache_next(struct seq_file *f, void *v, loff_t *pos)
> >  {
> > -	struct seq_private *priv = f->private;
> > +	struct vvp_seq_private *priv = f->private;
> >  	struct page *ret;
> >  
> >  	*pos += 1;
> > -	ret = vvp_pgcache_find(priv->env, &priv->sbi->ll_cl->cd_lu_dev,
> > -			       &priv->clob, pos);
> > +	ret = vvp_pgcache_find(priv->vsp_env,
> > +			       &priv->vsp_sbi->ll_cl->cd_lu_dev,
> > +			       &priv->vsp_clob, pos);
> >  	return ret;
> >  }
> >  
> > @@ -605,16 +607,16 @@ static void vvp_pgcache_stop(struct seq_file *f, void *v)
> >  
> >  static int vvp_dump_pgcache_seq_open(struct inode *inode, struct file *filp)
> >  {
> > -	struct seq_private *priv;
> > +	struct vvp_seq_private *priv;
> >  
> >  	priv = __seq_open_private(filp, &vvp_pgcache_ops, sizeof(*priv));
> >  	if (!priv)
> >  		return -ENOMEM;
> >  
> > -	priv->sbi = inode->i_private;
> > -	priv->env = cl_env_get(&priv->refcheck);
> > -	if (IS_ERR(priv->env)) {
> > -		int err = PTR_ERR(priv->env);
> > +	priv->vsp_sbi = inode->i_private;
> > +	priv->vsp_env = cl_env_get(&priv->vsp_refcheck);
> > +	if (IS_ERR(priv->vsp_env)) {
> > +		int err = PTR_ERR(priv->vsp_env);
> >  
> >  		seq_release_private(inode, filp);
> >  		return err;
> > @@ -625,9 +627,9 @@ static int vvp_dump_pgcache_seq_open(struct inode *inode, struct file *filp)
> >  static int vvp_dump_pgcache_seq_release(struct inode *inode, struct file *file)
> >  {
> >  	struct seq_file *seq = file->private_data;
> > -	struct seq_private *priv = seq->private;
> > +	struct vvp_seq_private *priv = seq->private;
> >  
> > -	cl_env_put(priv->env, &priv->refcheck);
> > +	cl_env_put(priv->vsp_env, &priv->vsp_refcheck);
> >  	return seq_release_private(inode, file);
> >  }
> >  
> > -- 
> > 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 04/28] lustre: ptlrpc: Do not assert when bd_nob_transferred != 0
  2018-10-17 23:13   ` NeilBrown
@ 2018-10-21 22:44     ` James Simmons
  2018-10-22  3:26       ` NeilBrown
  0 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-10-21 22:44 UTC (permalink / raw)
  To: lustre-devel


> On Sun, Oct 14 2018, James Simmons wrote:
> 
> > From: Doug Oucharek <dougso@me.com>
> >
> > There is a case in the routine ptlrpc_register_bulk() where we were
> > asserting if bd_nob_transferred != 0 when not resending.  There is
> > evidence that network errors can create a situation where
> > this does happen. So we should not be asserting!
> >
> > This patch changes that assert to an error return code of -EIO.
> >
> > Signed-off-by: Doug Oucharek <dougso@me.com>
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9828
> > Reviewed-on: https://review.whamcloud.com/28491
> > Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
> > Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> > ---
> >  drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> > index 27eb1c0..7e7db24 100644
> > --- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> > +++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> > @@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
> >  	/* cleanup the state of the bulk for it will be reused */
> >  	if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY)
> >  		desc->bd_nob_transferred = 0;
> > -	else
> > -		LASSERT(desc->bd_nob_transferred == 0);
> > +	else if (desc->bd_nob_transferred != 0)
> > +		/* If the network failed after an RPC was sent, this condition
> > +		 * could happen.  Rather than assert (was here before), return
> > +		 * an EIO error.
> > +		 */
> > +		return -EIO;
> 
> This looks weird, and the justification is rather lame.
> I wonder if this is an attempt to fix the same problem that the smp_mb()
> in the previous patch was attempting to fix (and I'm not yet convinced
> that either is the correct fix).

When the above condition happens the LASSERT ends up taking out the 
node with a panic which in turn kills the application running on the cluster.
When replaced with reporting an EIO error the node survives as well as the 
job. The job might fail at its IO but it wouldn't fail performing its work 
flow which is way more important.

> >  	desc->bd_failure = 0;
> >  
> > -- 
> > 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 03/28] lustre: ptlrpc: missing barrier before wake_up
  2018-10-17 22:43   ` NeilBrown
@ 2018-10-21 22:48     ` James Simmons
  0 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-21 22:48 UTC (permalink / raw)
  To: lustre-devel


> On Sun, Oct 14 2018, James Simmons wrote:
> 
> > From: Lai Siyao <lai.siyao@whamcloud.com>
> >
> > ptlrpc_client_wake_req() misses a memory barrier, which may cause
> > strange errors.
> >
> > Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-8935
> > Reviewed-on: https://review.whamcloud.com/26583
> > Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> > Reviewed-by: Wang Shilong <wshilong@ddn.com>
> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> > ---
> >  drivers/staging/lustre/lustre/include/lustre_net.h | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/staging/lustre/lustre/include/lustre_net.h b/drivers/staging/lustre/lustre/include/lustre_net.h
> > index ce7e98c..468a03e 100644
> > --- a/drivers/staging/lustre/lustre/include/lustre_net.h
> > +++ b/drivers/staging/lustre/lustre/include/lustre_net.h
> > @@ -2211,6 +2211,8 @@ static inline int ptlrpc_status_ntoh(int n)
> >  static inline void
> >  ptlrpc_client_wake_req(struct ptlrpc_request *req)
> >  {
> > +	/* ensure ptlrpc_register_bulk see rq_resend as set. */
> > +	smp_mb();
> >  	if (!req->rq_set)
> >  		wake_up(&req->rq_reply_waitq);
> >  	else
> 
> It is good that this memory barrier has a comment, but the comment isn't
> very helpful.
> There is no matching memory barrier in ptlrpc_register_bulk(), so it
> isn't clear what sequencing is important.
> 
> And ptl_send_rpc() tests ->rq_resend *before* ptlrpc_register_bulk() is
> called (which also tests it).  Presumably these should see that same
> value?  So why does the comment refer to ptlrpc_register_bulk() instead
> of ptl_send_rpc() ??
> 
> It all seems rather confusing, so it is very hard to be sure that the
> code is now correct.
> Is someone able to explain?

I wasn't going on much here. While the linux kernel request comments to
place with memory barriers lustre developers tend to never leave an
explaination on why a memory barrier was needed. In this case I examined
the original JIRA ticket to find in one of the comments:

"It looks like ptlrpc_client_wake_req() misses a memory barrier, which may 
cause ptlrpc_resend_req() wake up ptlrpc_send_rpc -> ptlrpc_register_bulk, 
while the latter doesn't see rq_resend set."

I attempted to add that as a comment. This is all I had to go one. Now
Lai is CC to this email so maybe he remembers what it was all about.
 
> Thanks,
> NeilBrown
> 
> 
> 
> > -- 
> > 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 16/28] lustre: statahead: missing barrier before wake_up
  2018-10-18  2:00   ` NeilBrown
@ 2018-10-21 22:52     ` James Simmons
  2018-10-22  4:04       ` NeilBrown
  0 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-10-21 22:52 UTC (permalink / raw)
  To: lustre-devel


> On Sun, Oct 14 2018, James Simmons wrote:
> 
> > From: Lai Siyao <lai.siyao@whamcloud.com>
> >
> > A barrier is missing before wake_up() in ll_statahead_interpret(),
> > which may cause 'ls' hang. Under the right conditions a basic 'ls'
> > can fail. The debug logs show:
> >
> > statahead.c:683:ll_statahead_interpret()) sa_entry software rc -13
> > statahead.c:1666:ll_statahead()) revalidate statahead software: -11.
> >
> > Obviously statahead failure didn't notify 'ls' process in time.
> > The mi_cbdata can be stale so add a barrier before calling
> > wake_up().
> >
> > Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
> > Signed-off-by: Bob Glossman <bob.glossman@intel.com>
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9210
> > Reviewed-on: https://review.whamcloud.com/27330
> > Reviewed-by: Nathaniel Clark <nclark@whamcloud.com>
> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> > ---
> >  drivers/staging/lustre/lustre/llite/statahead.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/staging/lustre/lustre/llite/statahead.c b/drivers/staging/lustre/lustre/llite/statahead.c
> > index 1ad308c..0174a4c 100644
> > --- a/drivers/staging/lustre/lustre/llite/statahead.c
> > +++ b/drivers/staging/lustre/lustre/llite/statahead.c
> > @@ -680,8 +680,14 @@ static int ll_statahead_interpret(struct ptlrpc_request *req,
> >  
> >  	spin_lock(&lli->lli_sa_lock);
> >  	if (rc) {
> > -		if (__sa_make_ready(sai, entry, rc))
> > +		if (__sa_make_ready(sai, entry, rc)) {
> > +			/* LU-9210 : Under the right conditions even 'ls'
> > +			 * can cause the statahead to fail. Using a memory
> > +			 * barrier resolves this issue.
> > +			 */
> > +			smp_mb();
> >  			wake_up(&sai->sai_waitq);
> > +		}
> >  	} else {
> >  		int first = 0;
> >  		entry->se_minfo = minfo;
> > -- 
> > 1.8.3.1
> 
> Again, this is a fairly lame comment to justify the smp_mb().
> It appears to me that the issue is most likely the value of
> entry->se_state.
> __sa_make_ready() sets this and revalidate_statahead_dentry tests it
> after waiting on sai_waitq.
> So I think it would be best if we changed __sa_make_ready() to
> 
> 	smp_store_release(&entry->se_state, ret < 0 ? SA_ENTRY_INVA : SA_ENTRY_SUCC)
> 
> and in ll_statahead_interpret() have
> 
> 	if (smp_load_acquire(&entry->se_state) == SA_ENTRY_SUCC &&
>             entry->se_inode) {
> 
> This would make it obvious which variable was important, and would show
> the paired synchronization points.

If you think this is lame you should be the JIRA ticket and the original 
patch. It had zero info so I attempted to extract what I could out of the
ticket. Hopefully Lai can fill in the details. I have no problems fixing 
this another way. I don't see a way in the ticket to easily reproduce this
problem to see the new approach would fix it :-(

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 04/28] lustre: ptlrpc: Do not assert when bd_nob_transferred != 0
  2018-10-21 22:44     ` James Simmons
@ 2018-10-22  3:26       ` NeilBrown
  2018-11-04 21:29         ` James Simmons
  0 siblings, 1 reply; 69+ messages in thread
From: NeilBrown @ 2018-10-22  3:26 UTC (permalink / raw)
  To: lustre-devel

On Sun, Oct 21 2018, James Simmons wrote:

>> On Sun, Oct 14 2018, James Simmons wrote:
>> 
>> > From: Doug Oucharek <dougso@me.com>
>> >
>> > There is a case in the routine ptlrpc_register_bulk() where we were
>> > asserting if bd_nob_transferred != 0 when not resending.  There is
>> > evidence that network errors can create a situation where
>> > this does happen. So we should not be asserting!
>> >
>> > This patch changes that assert to an error return code of -EIO.
>> >
>> > Signed-off-by: Doug Oucharek <dougso@me.com>
>> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9828
>> > Reviewed-on: https://review.whamcloud.com/28491
>> > Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
>> > Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
>> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
>> > Signed-off-by: James Simmons <jsimmons@infradead.org>
>> > ---
>> >  drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++--
>> >  1 file changed, 6 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
>> > index 27eb1c0..7e7db24 100644
>> > --- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
>> > +++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
>> > @@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
>> >  	/* cleanup the state of the bulk for it will be reused */
>> >  	if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY)
>> >  		desc->bd_nob_transferred = 0;
>> > -	else
>> > -		LASSERT(desc->bd_nob_transferred == 0);
>> > +	else if (desc->bd_nob_transferred != 0)
>> > +		/* If the network failed after an RPC was sent, this condition
>> > +		 * could happen.  Rather than assert (was here before), return
>> > +		 * an EIO error.
>> > +		 */
>> > +		return -EIO;
>> 
>> This looks weird, and the justification is rather lame.
>> I wonder if this is an attempt to fix the same problem that the smp_mb()
>> in the previous patch was attempting to fix (and I'm not yet convinced
>> that either is the correct fix).
>
> When the above condition happens the LASSERT ends up taking out the 
> node with a panic which in turn kills the application running on the cluster.
> When replaced with reporting an EIO error the node survives as well as the 
> job. The job might fail at its IO but it wouldn't fail performing its work 
> flow which is way more important.

Yes, a meaningless error is better than a crash, but a proper fix is
better still.  As I said, my guess is that the memory barrier in the
previous patch might have fixed the bug, so the LASSERT can remain.

Doug: is there any chance that this might be the case?

Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181022/20f6084a/attachment.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 06/28] lustre: ldlm: ELC shouldn't wait on lock flush
  2018-10-20 17:09     ` James Simmons
@ 2018-10-22  3:44       ` NeilBrown
  0 siblings, 0 replies; 69+ messages in thread
From: NeilBrown @ 2018-10-22  3:44 UTC (permalink / raw)
  To: lustre-devel

On Sat, Oct 20 2018, James Simmons wrote:

>> On Sun, Oct 14 2018, James Simmons wrote:
>> 
>> > From: Andriy Skulysh <c17819@cray.com>
>> >
>> > The commit 08fd034670b5 ("staging: lustre: ldlm: revert the changes
>> > for lock canceling policy") removed the fix for LU-4300 when lru_resize
>> > is disabled.
>> >
>> > Introduce ldlm_cancel_aged_no_wait_policy to be used by ELC.
>> >
>> > Signed-off-by: Andriy Skulysh <c17819@cray.com>
>> > WC-bug-id: https://jira.whamcloud.com/browse/LU-8578
>> > Seagate-bug-id: MRP-3662
>> > Reviewed-on: https://review.whamcloud.com/22286
>> > Reviewed-by: Vitaly Fertman <c17818@cray.com>
>> > Reviewed-by: Patrick Farrell <paf@cray.com>
>> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
>> > Signed-off-by: James Simmons <jsimmons@infradead.org>
>> > ---
>> >  drivers/staging/lustre/lustre/ldlm/ldlm_internal.h |  1 -
>> >  drivers/staging/lustre/lustre/ldlm/ldlm_request.c  | 51 +++++++++++++++-------
>> >  2 files changed, 35 insertions(+), 17 deletions(-)
>> >
>> > diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h b/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
>> > index 1d7c727..709c527 100644
>> > --- a/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
>> > +++ b/drivers/staging/lustre/lustre/ldlm/ldlm_internal.h
>> > @@ -96,7 +96,6 @@ enum {
>> >  	LDLM_LRU_FLAG_NO_WAIT	= BIT(4), /* Cancel locks w/o blocking (neither
>> >  					   * sending nor waiting for any rpcs)
>> >  					   */
>> > -	LDLM_LRU_FLAG_LRUR_NO_WAIT = BIT(5), /* LRUR + NO_WAIT */
>> >  };
>> >  
>> >  int ldlm_cancel_lru(struct ldlm_namespace *ns, int nr,
>> > diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
>> > index 80260b07..3eb5036 100644
>> > --- a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
>> > +++ b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
>> > @@ -579,8 +579,8 @@ int ldlm_prep_elc_req(struct obd_export *exp, struct ptlrpc_request *req,
>> >  		req_capsule_filled_sizes(pill, RCL_CLIENT);
>> >  		avail = ldlm_capsule_handles_avail(pill, RCL_CLIENT, canceloff);
>> >  
>> > -		flags = ns_connect_lru_resize(ns) ?
>> > -			LDLM_LRU_FLAG_LRUR_NO_WAIT : LDLM_LRU_FLAG_AGED;
>> > +		flags = LDLM_LRU_FLAG_NO_WAIT | ns_connect_lru_resize(ns) ?
>> > +			LDLM_LRU_FLAG_LRUR : LDLM_LRU_FLAG_AGED;
>> >  		to_free = !ns_connect_lru_resize(ns) &&
>> >  			  opc == LDLM_ENQUEUE ? 1 : 0;
>> 
>> Bug.
>> The commit in SFS-lustre (7ca60f33893) is correct, but you dropped the
>> parentheses which introduces a bug.
>> 
>> While the SFS code is correct, it is formatted badly.
>> It should be
>> 
>> 	lru_flags = LDLM_LRU_FLAG_NO_WAIT |
>>         	(ns_connect_lru_resize(ns) ?
>>                  LDLM_LRU_FLAG_LRUR : LDLM_LRU_FLAG_AGED);
>> or similar.
>
> Thanks for finding that. Shall I submit another patch to fix that or will
> you fix it up when you apply it to lustre-testing?

I've fixed up the patch - thanks.

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181022/8578fbac/attachment.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 11/28] lustre: llite: use security context if it's enabled in the kernel
  2018-10-20 17:49     ` James Simmons
@ 2018-10-22  3:47       ` NeilBrown
  2018-10-23 23:07         ` James Simmons
  0 siblings, 1 reply; 69+ messages in thread
From: NeilBrown @ 2018-10-22  3:47 UTC (permalink / raw)
  To: lustre-devel

On Sat, Oct 20 2018, James Simmons wrote:

>> On Sun, Oct 14 2018, James Simmons wrote:
>> 
>> > From: Alex Zhuravlev <bzzz@whamcloud.com>
>> >
>> > if it's disabled, then Lustre stop to work properly (can not create
>> > files, etc)
>> >
>> > Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
>> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9578
>> > Reviewed-on: https://review.whamcloud.com/27364
>> > Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
>> > Reviewed-by: Chris Horn <hornc@cray.com>
>> > Reviewed-by: James Simmons <uja.ornl@yahoo.com>
>> > Reviewed-by: Sebastien Buisson <sbuisson@ddn.com>
>> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
>> > Signed-off-by: James Simmons <jsimmons@infradead.org>
>> > ---
>> >  drivers/staging/lustre/lustre/llite/llite_lib.c | 3 ++-
>> >  1 file changed, 2 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
>> > index 22b545e..153aa12 100644
>> > --- a/drivers/staging/lustre/lustre/llite/llite_lib.c
>> > +++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
>> > @@ -243,8 +243,9 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
>> >  	if (sbi->ll_flags & LL_SBI_ALWAYS_PING)
>> >  		data->ocd_connect_flags &= ~OBD_CONNECT_PINGLESS;
>> >  
>> > +#ifdef CONFIG_SECURITY
>> >  	data->ocd_connect_flags2 |= OBD_CONNECT2_FILE_SECCTX;
>> > -
>> > +#endif
>> 
>> Policy is to avoid #ifdef in .c files where possible.
>> If we put something like
>> #ifdef CONFIG_SECURITY
>>  #define OBD_CONNECT2_FILE_SECURITY (OBD_CONNECT2_FILE_SECCTX)
>> #else
>>  #define OBD_CONNECT2_FILE_SECURITY (0)
>> #endif
>> 
>> in a .h file, then use OBD_CONNECT2_FILE_SECURITY both here and in
>> obd_connect_has_secctx(),
>> then the latter could would be optimized away by the compiler.  Wouldn't
>> be a big win I guess as it is only used once in a trivial context.
>
> I suggest that we move obd_connect_has_secctx() to llite_internal.h. Also
> that function should return bool. Besides this create an inline function
> obd_connect_set_secctx() for llite_internal.h. Will submit a patch for
> OpenSFS branch. Shall I redo this patch or submit a cleanup later?

Sounds like a good plan.  Please submit a cleanup.  I'll commit this
patch as-is (as it isn't exactly "broken", and a resolution has been
agreed).

Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181022/ec00008e/attachment.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 15/28] lustre: llite: fix for stat under kthread and X86_X32
  2018-10-18  1:48   ` NeilBrown
@ 2018-10-22  3:58     ` NeilBrown
  2018-11-04 21:35       ` James Simmons
  0 siblings, 1 reply; 69+ messages in thread
From: NeilBrown @ 2018-10-22  3:58 UTC (permalink / raw)
  To: lustre-devel

On Thu, Oct 18 2018, NeilBrown wrote:

> On Sun, Oct 14 2018, James Simmons wrote:
>
>> From: Frank Zago <fzago@cray.com>
>>
>> Under the following conditions, ll_getattr will flatten the inode
>> number when it shouldn't:
>>
>>  - the X86_X32 architecture is defined CONFIG_X86_X32, and not even
>>    used,
>>  - ll_getattr is called from a kernel thread (though vfs_getattr for
>>    instance.)
>>
>> This has the result that inode numbers are different whether the same
>> file is stat'ed from a kernel thread, or from a syscall. For instance,
>> 4198401 vs. 144115205272502273.
>>
>> ll_getattr calls ll_need_32bit_api to determine whether the task is 32
>> bits. When the combination is kthread+X86_X32, that function returns
>> that the task is 32 bits, which is incorrect, as the kernel is 64
>> bits.
>>
>> The solution is to check whether the call is from a kernel thread
>> (which is 64 bits) and act consequently.
>>
>> Signed-off-by: Frank Zago <fzago@cray.com>
>> WC-bug-id: https://jira.whamcloud.com/browse/LU-9468
>> Reviewed-on: https://review.whamcloud.com/26992
>> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
>> Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
>> Signed-off-by: James Simmons <jsimmons@infradead.org>
>> ---
>>  drivers/staging/lustre/lustre/llite/dir.c          |  6 +++---
>>  drivers/staging/lustre/lustre/llite/lcommon_cl.c   |  2 +-
>>  .../staging/lustre/lustre/llite/llite_internal.h   | 22 +++++++++++++++++-----
>>  3 files changed, 21 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
>> index 231b351..19c5e9c 100644
>> --- a/drivers/staging/lustre/lustre/llite/dir.c
>> +++ b/drivers/staging/lustre/lustre/llite/dir.c
>> @@ -202,7 +202,7 @@ int ll_dir_read(struct inode *inode, __u64 *ppos, struct md_op_data *op_data,
>>  {
>>  	struct ll_sb_info    *sbi	= ll_i2sbi(inode);
>>  	__u64		   pos		= *ppos;
>> -	int		   is_api32 = ll_need_32bit_api(sbi);
>> +	bool is_api32 = ll_need_32bit_api(sbi);
>>  	int		   is_hash64 = sbi->ll_flags & LL_SBI_64BIT_HASH;
>>  	struct page	  *page;
>>  	bool		   done = false;
>> @@ -296,7 +296,7 @@ static int ll_readdir(struct file *filp, struct dir_context *ctx)
>>  	struct ll_sb_info	*sbi	= ll_i2sbi(inode);
>>  	__u64 pos = lfd ? lfd->lfd_pos : 0;
>>  	int			hash64	= sbi->ll_flags & LL_SBI_64BIT_HASH;
>> -	int			api32	= ll_need_32bit_api(sbi);
>> +	bool api32 = ll_need_32bit_api(sbi);
>>  	struct md_op_data *op_data;
>>  	int			rc;
>>  
>> @@ -1674,7 +1674,7 @@ static loff_t ll_dir_seek(struct file *file, loff_t offset, int origin)
>>  	struct inode *inode = file->f_mapping->host;
>>  	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
>>  	struct ll_sb_info *sbi = ll_i2sbi(inode);
>> -	int api32 = ll_need_32bit_api(sbi);
>> +	bool api32 = ll_need_32bit_api(sbi);
>>  	loff_t ret = -EINVAL;
>>  
>>  	switch (origin) {
>> diff --git a/drivers/staging/lustre/lustre/llite/lcommon_cl.c b/drivers/staging/lustre/lustre/llite/lcommon_cl.c
>> index 30f17ea..20a3c74 100644
>> --- a/drivers/staging/lustre/lustre/llite/lcommon_cl.c
>> +++ b/drivers/staging/lustre/lustre/llite/lcommon_cl.c
>> @@ -267,7 +267,7 @@ void cl_inode_fini(struct inode *inode)
>>  /**
>>   * build inode number from passed @fid
>>   */
>> -__u64 cl_fid_build_ino(const struct lu_fid *fid, int api32)
>> +u64 cl_fid_build_ino(const struct lu_fid *fid, bool api32)
>>  {
>>  	if (BITS_PER_LONG == 32 || api32)
>>  		return fid_flatten32(fid);
>> diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
>> index dcb2fed..796a8ae 100644
>> --- a/drivers/staging/lustre/lustre/llite/llite_internal.h
>> +++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
>> @@ -651,13 +651,25 @@ static inline struct inode *ll_info2i(struct ll_inode_info *lli)
>>  __u32 ll_i2suppgid(struct inode *i);
>>  void ll_i2gids(__u32 *suppgids, struct inode *i1, struct inode *i2);
>>  
>> -static inline int ll_need_32bit_api(struct ll_sb_info *sbi)
>> +static inline bool ll_need_32bit_api(struct ll_sb_info *sbi)
>>  {
>>  #if BITS_PER_LONG == 32
>> -	return 1;
>> +	return true;
>>  #elif defined(CONFIG_COMPAT)
>> -	return unlikely(in_compat_syscall() ||
>> -			(sbi->ll_flags & LL_SBI_32BIT_API));
>> +	if (unlikely(sbi->ll_flags & LL_SBI_32BIT_API))
>> +		return true;
>> +
>> +#ifdef CONFIG_X86_X32
>> +	/* in_compat_syscall() returns true when called from a kthread
>> +	 * and CONFIG_X86_X32 is enabled, which is wrong. So check
>> +	 * whether the caller comes from a syscall (ie. not a kthread)
>> +	 * before calling in_compat_syscall().
>> +	 */
>> +	if (current->flags & PF_KTHREAD)
>> +		return false;
>> +#endif
>
> This is wrong.  We should fix in_compat_syscall(), not work around it
> here.
> (and then there is that fact that the patch changes 'int' to 'bool'
> without explaining that in the change description).
>
> I've sent a query to some relevant people (Cc:ed to James) to ask about
> fixing in_compat_syscall().

Upstream say in_compat_syscall() should only be called from a syscall,
so I have change the patch to the below.

We probably need to remove this in_compat_syscall() before going
mainline.

NeilBrown

-static inline int ll_need_32bit_api(struct ll_sb_info *sbi)
+static inline bool ll_need_32bit_api(struct ll_sb_info *sbi)
 {
 #if BITS_PER_LONG == 32
-	return 1;
-#elif defined(CONFIG_COMPAT)
-	return unlikely(in_compat_syscall() ||
-			(sbi->ll_flags & LL_SBI_32BIT_API));
+	return true;
+#else
+	if (unlikely(sbi->ll_flags & LL_SBI_32BIT_API))
+		return true;
+
+#if defined(CONFIG_COMPAT)
+	/* in_compat_syscall() is only meaningful inside a syscall.
+	 * As this can be called from a kthread (e.g. nfsd), we
+	 * need to catch that case first.  kthreads never need the
+	 * 32bit api.
+	 */
+	if (current->flags & PF_KTHREAD)
+		return false;
+
+	return unlikely(in_compat_syscall());
 #else
-	return unlikely(sbi->ll_flags & LL_SBI_32BIT_API);
+	return false;
+#endif
 #endif
 }
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181022/0fb4bbc9/attachment.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 16/28] lustre: statahead: missing barrier before wake_up
  2018-10-21 22:52     ` James Simmons
@ 2018-10-22  4:04       ` NeilBrown
  2018-11-04 20:52         ` James Simmons
  0 siblings, 1 reply; 69+ messages in thread
From: NeilBrown @ 2018-10-22  4:04 UTC (permalink / raw)
  To: lustre-devel

On Sun, Oct 21 2018, James Simmons wrote:

>> On Sun, Oct 14 2018, James Simmons wrote:
>> 
>> > From: Lai Siyao <lai.siyao@whamcloud.com>
>> >
>> > A barrier is missing before wake_up() in ll_statahead_interpret(),
>> > which may cause 'ls' hang. Under the right conditions a basic 'ls'
>> > can fail. The debug logs show:
>> >
>> > statahead.c:683:ll_statahead_interpret()) sa_entry software rc -13
>> > statahead.c:1666:ll_statahead()) revalidate statahead software: -11.
>> >
>> > Obviously statahead failure didn't notify 'ls' process in time.
>> > The mi_cbdata can be stale so add a barrier before calling
>> > wake_up().
>> >
>> > Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
>> > Signed-off-by: Bob Glossman <bob.glossman@intel.com>
>> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9210
>> > Reviewed-on: https://review.whamcloud.com/27330
>> > Reviewed-by: Nathaniel Clark <nclark@whamcloud.com>
>> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
>> > Signed-off-by: James Simmons <jsimmons@infradead.org>
>> > ---
>> >  drivers/staging/lustre/lustre/llite/statahead.c | 8 +++++++-
>> >  1 file changed, 7 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/staging/lustre/lustre/llite/statahead.c b/drivers/staging/lustre/lustre/llite/statahead.c
>> > index 1ad308c..0174a4c 100644
>> > --- a/drivers/staging/lustre/lustre/llite/statahead.c
>> > +++ b/drivers/staging/lustre/lustre/llite/statahead.c
>> > @@ -680,8 +680,14 @@ static int ll_statahead_interpret(struct ptlrpc_request *req,
>> >  
>> >  	spin_lock(&lli->lli_sa_lock);
>> >  	if (rc) {
>> > -		if (__sa_make_ready(sai, entry, rc))
>> > +		if (__sa_make_ready(sai, entry, rc)) {
>> > +			/* LU-9210 : Under the right conditions even 'ls'
>> > +			 * can cause the statahead to fail. Using a memory
>> > +			 * barrier resolves this issue.
>> > +			 */
>> > +			smp_mb();
>> >  			wake_up(&sai->sai_waitq);
>> > +		}
>> >  	} else {
>> >  		int first = 0;
>> >  		entry->se_minfo = minfo;
>> > -- 
>> > 1.8.3.1
>> 
>> Again, this is a fairly lame comment to justify the smp_mb().
>> It appears to me that the issue is most likely the value of
>> entry->se_state.
>> __sa_make_ready() sets this and revalidate_statahead_dentry tests it
>> after waiting on sai_waitq.
>> So I think it would be best if we changed __sa_make_ready() to
>> 
>> 	smp_store_release(&entry->se_state, ret < 0 ? SA_ENTRY_INVA : SA_ENTRY_SUCC)
>> 
>> and in ll_statahead_interpret() have
>> 
>> 	if (smp_load_acquire(&entry->se_state) == SA_ENTRY_SUCC &&
>>             entry->se_inode) {
>> 
>> This would make it obvious which variable was important, and would show
>> the paired synchronization points.
>
> If you think this is lame you should be the JIRA ticket and the original 
> patch. It had zero info so I attempted to extract what I could out of the
> ticket. Hopefully Lai can fill in the details. I have no problems fixing 
> this another way. I don't see a way in the ticket to easily reproduce this
> problem to see the new approach would fix it :-(

I have imposed the version that I think is correct.  See below.

Thanks,
NeilBrown

--- a/drivers/staging/lustre/lustre/llite/statahead.c
+++ b/drivers/staging/lustre/lustre/llite/statahead.c
@@ -322,7 +322,11 @@ __sa_make_ready(struct ll_statahead_info *sai, struct sa_entry *entry, int ret)
 		}
 	}
 	list_add(&entry->se_list, pos);
-	entry->se_state = ret < 0 ? SA_ENTRY_INVA : SA_ENTRY_SUCC;
+	/*
+	 * LU-9210: ll_statahead_interpet must be able to see this before
+	 * we wake it up
+	 */
+	smp_store_release(&entry->se_state, ret < 0 ? SA_ENTRY_INVA : SA_ENTRY_SUCC);
 
 	return (index == sai->sai_index_wait);
 }
@@ -1390,7 +1394,12 @@ static int revalidate_statahead_dentry(struct inode *dir,
 		}
 	}
 
-	if (entry->se_state == SA_ENTRY_SUCC && entry->se_inode) {
+	/*
+	 * We need to see the value that was set immediately before we
+	 * were woken up.
+	 */
+	if (smp_load_acquire(&entry->se_state) == SA_ENTRY_SUCC &&
+	    entry->se_inode) {
 		struct inode *inode = entry->se_inode;
 		struct lookup_intent it = { .it_op = IT_GETATTR,
 					    .it_lock_handle = entry->se_handle };
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181022/6bdf8044/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 20/28] lustre: obdclass: deprecate OBD_GET_VERSION ioctl
  2018-10-20 18:52     ` James Simmons
@ 2018-10-22  4:08       ` NeilBrown
  0 siblings, 0 replies; 69+ messages in thread
From: NeilBrown @ 2018-10-22  4:08 UTC (permalink / raw)
  To: lustre-devel

On Sat, Oct 20 2018, James Simmons wrote:

>> On Sun, Oct 14 2018, James Simmons wrote:
>> >  
>> > +		if (!warned) {
>> > +			warned = true;
>> > +			CWARN("%s: ioctl(OBD_GET_VERSION) is deprecated, use llapi_get_version_string() and/or relink\n",
>> > +			      current->comm);
>> > +		}
>> 
>> Is there a good reason not to use WARN_ON_ONCE() here?
>
> Oh that is much nicer. Didn't know about it. Shall I submit a new
> patch or will it be changed when applied to lustre-testing?

I've made the change directly - no need to resubmit.

Thanks,
NeilBrown




-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181022/ae48d8d9/attachment.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10
  2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
                   ` (27 preceding siblings ...)
  2018-10-14 18:58 ` [lustre-devel] [PATCH 28/28] lustre: llite: restore lld_nfs_dentry handling James Simmons
@ 2018-10-22  4:36 ` NeilBrown
  2018-10-23 22:34   ` [lustre-devel] [PATCH] lustre: lu_object: fix possible hang waiting for LCS_LEAVING NeilBrown
  28 siblings, 1 reply; 69+ messages in thread
From: NeilBrown @ 2018-10-22  4:36 UTC (permalink / raw)
  To: lustre-devel

On Sun, Oct 14 2018, James Simmons wrote:

> Another bacth of assorted fixes missing in the linux client from
> lustre 2.10. All of these should be order independent and don't
> collide with the PFL work that will land at a later date.

Thanks, I've applied these and the other series.
And pushed it all out.
I ran a test and something went wrong - I haven't had a chance to look
properly yet....  I'm make sure tests mostly pass before moving it out
of lustre-testing.

Thanks a lot,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181022/41ed377f/attachment.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH] lustre: lu_object: fix possible hang waiting for LCS_LEAVING
  2018-10-22  4:36 ` [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 NeilBrown
@ 2018-10-23 22:34   ` NeilBrown
  2018-10-29  3:31     ` James Simmons
  0 siblings, 1 reply; 69+ messages in thread
From: NeilBrown @ 2018-10-23 22:34 UTC (permalink / raw)
  To: lustre-devel


As lu_context_key_quiesce() spins waiting for LCS_LEAVING to
change, it is important the we set and then clear in within a
non-preemptible region.  If the thread that spins pre-empty the
thread that sets-and-clears the state while the state is LCS_LEAVING,
then it can spin indefinitely, particularly on a single-CPU machine.

Also update the comment to explain this dependency.

Fixes: ac3f8fd6e61b ("staging: lustre: remove locking from lu_context_exit()")
---

This is the cause of the "something" that went wrong in my recent
testing that I mentioned.  I wonder if preempt_enable() has recently
been enhanced to encourage a preempt, to make this sort of bug easier to
see.


 drivers/staging/lustre/lustre/obdclass/lu_object.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/lustre/lustre/obdclass/lu_object.c b/drivers/staging/lustre/lustre/obdclass/lu_object.c
index cb57abf03644..51497c144dd6 100644
--- a/drivers/staging/lustre/lustre/obdclass/lu_object.c
+++ b/drivers/staging/lustre/lustre/obdclass/lu_object.c
@@ -1654,17 +1654,20 @@ void lu_context_exit(struct lu_context *ctx)
 	unsigned int i;
 
 	LINVRNT(ctx->lc_state == LCS_ENTERED);
-	/*
-	 * Ensure lu_context_key_quiesce() sees LCS_LEAVING
-	 * or we see LCT_QUIESCENT
-	 */
-	smp_store_mb(ctx->lc_state, LCS_LEAVING);
 	/*
 	 * Disable preempt to ensure we get a warning if
 	 * any lct_exit ever tries to sleep.  That would hurt
 	 * lu_context_key_quiesce() which spins waiting for us.
+	 * This also ensure we aren't preempted while the state
+	 * is LCS_LEAVING, as that too would cause problems for
+	 * lu_context_key_quiesce().
 	 */
 	preempt_disable();
+	/*
+	 * Ensure lu_context_key_quiesce() sees LCS_LEAVING
+	 * or we see LCT_QUIESCENT
+	 */
+	smp_store_mb(ctx->lc_state, LCS_LEAVING);
 	if (ctx->lc_tags & LCT_HAS_EXIT && ctx->lc_value) {
 		for (i = 0; i < ARRAY_SIZE(lu_keys); ++i) {
 			struct lu_context_key *key;
@@ -1677,8 +1680,8 @@ void lu_context_exit(struct lu_context *ctx)
 		}
 	}
 
-	preempt_enable();
 	smp_store_release(&ctx->lc_state, LCS_LEFT);
+	preempt_enable();
 }
 EXPORT_SYMBOL(lu_context_exit);
 
-- 
2.14.0.rc0.dirty

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181024/a789b872/attachment.sig>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 11/28] lustre: llite: use security context if it's enabled in the kernel
  2018-10-22  3:47       ` NeilBrown
@ 2018-10-23 23:07         ` James Simmons
  0 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-10-23 23:07 UTC (permalink / raw)
  To: lustre-devel


> >> > diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
> >> > index 22b545e..153aa12 100644
> >> > --- a/drivers/staging/lustre/lustre/llite/llite_lib.c
> >> > +++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
> >> > @@ -243,8 +243,9 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt)
> >> >  	if (sbi->ll_flags & LL_SBI_ALWAYS_PING)
> >> >  		data->ocd_connect_flags &= ~OBD_CONNECT_PINGLESS;
> >> >  
> >> > +#ifdef CONFIG_SECURITY
> >> >  	data->ocd_connect_flags2 |= OBD_CONNECT2_FILE_SECCTX;
> >> > -
> >> > +#endif
> >> 
> >> Policy is to avoid #ifdef in .c files where possible.
> >> If we put something like
> >> #ifdef CONFIG_SECURITY
> >>  #define OBD_CONNECT2_FILE_SECURITY (OBD_CONNECT2_FILE_SECCTX)
> >> #else
> >>  #define OBD_CONNECT2_FILE_SECURITY (0)
> >> #endif
> >> 
> >> in a .h file, then use OBD_CONNECT2_FILE_SECURITY both here and in
> >> obd_connect_has_secctx(),
> >> then the latter could would be optimized away by the compiler.  Wouldn't
> >> be a big win I guess as it is only used once in a trivial context.
> >
> > I suggest that we move obd_connect_has_secctx() to llite_internal.h. Also
> > that function should return bool. Besides this create an inline function
> > obd_connect_set_secctx() for llite_internal.h. Will submit a patch for
> > OpenSFS branch. Shall I redo this patch or submit a cleanup later?
> 
> Sounds like a good plan.  Please submit a cleanup.  I'll commit this
> patch as-is (as it isn't exactly "broken", and a resolution has been
> agreed).

New patch is at https://review.whamcloud.com/#/c/33410. Will be coming
to linux client soon.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 12/28] lustre: ptlrpc: do not wakeup every second
  2018-10-14 18:58 ` [lustre-devel] [PATCH 12/28] lustre: ptlrpc: do not wakeup every second James Simmons
@ 2018-10-29  0:03   ` NeilBrown
  2018-10-29  1:35     ` Patrick Farrell
  0 siblings, 1 reply; 69+ messages in thread
From: NeilBrown @ 2018-10-29  0:03 UTC (permalink / raw)
  To: lustre-devel

On Sun, Oct 14 2018, James Simmons wrote:

> From: Alex Zhuravlev <bzzz@whamcloud.com>
>
> Even if there are no RPC requests on the set, there is no need to
> wake up every second. The thread is woken up when a request is added
> to the set or when the STOP bit is set, so it is sufficient to only
> wake up when there are requests on the set to worry about.
>
> Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9660
> Reviewed-on: https://review.whamcloud.com/28776
> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> Reviewed-by: Patrick Farrell <paf@cray.com>
> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> index c201a88..5b4977b 100644
> --- a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> +++ b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> @@ -371,7 +371,7 @@ static int ptlrpcd_check(struct lu_env *env, struct ptlrpcd_ctl *pc)
>  		}
>  	}
>  
> -	return rc;
> +	return rc || test_bit(LIOD_STOP, &pc->pc_flags);
>  }
>  
>  /**
> @@ -441,7 +441,7 @@ static int ptlrpcd(void *arg)
>  		lu_context_enter(env.le_ses);
>  		if (wait_event_idle_timeout(set->set_waitq,
>  					    ptlrpcd_check(&env, pc),
> -					    (timeout ? timeout : 1) * HZ) == 0)
> +					    timeout * HZ) == 0)
>  			ptlrpc_expired_set(set);

This is incorrect.
A timeout of zero means the timeout happens after zero jiffies
(immediately), it doesn't mean there is no timeout.
If we want a "timeout" of zero to mean "Wait forever", we need something
like:

  wait_event_idle_timeout(.....,
                          timeout ? (timeout * HZ) : MAX_SCHEDULE_TIMEOUT) == 0

I've updated the patch accordingly.

Thanks,
NeilBrown

>  
>  		lu_context_exit(&env.le_ctx);
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181029/afd7267b/attachment.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 12/28] lustre: ptlrpc: do not wakeup every second
  2018-10-29  0:03   ` NeilBrown
@ 2018-10-29  1:35     ` Patrick Farrell
  2018-10-29  2:41       ` NeilBrown
  0 siblings, 1 reply; 69+ messages in thread
From: Patrick Farrell @ 2018-10-29  1:35 UTC (permalink / raw)
  To: lustre-devel

Neil,

Does your statement imply this would spin?  It definitely doesn?t just spin (that behavior in a main ?wait for work? spot of a (depending on settings) ~per-CPU daemon would render systems unusable and this patch has been in testing for a while).  So what is the detailed behavior of a ?timeout that expires immediately??

- Patrick


________________________________
From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of NeilBrown <neilb@suse.com>
Sent: Sunday, October 28, 2018 7:03:02 PM
To: James Simmons; Andreas Dilger; Oleg Drokin
Cc: Lustre Development List
Subject: Re: [lustre-devel] [PATCH 12/28] lustre: ptlrpc: do not wakeup every second

On Sun, Oct 14 2018, James Simmons wrote:

> From: Alex Zhuravlev <bzzz@whamcloud.com>
>
> Even if there are no RPC requests on the set, there is no need to
> wake up every second. The thread is woken up when a request is added
> to the set or when the STOP bit is set, so it is sufficient to only
> wake up when there are requests on the set to worry about.
>
> Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9660
> Reviewed-on: https://review.whamcloud.com/28776
> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> Reviewed-by: Patrick Farrell <paf@cray.com>
> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> index c201a88..5b4977b 100644
> --- a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> +++ b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> @@ -371,7 +371,7 @@ static int ptlrpcd_check(struct lu_env *env, struct ptlrpcd_ctl *pc)
>                }
>        }
>
> -     return rc;
> +     return rc || test_bit(LIOD_STOP, &pc->pc_flags);
>  }
>
>  /**
> @@ -441,7 +441,7 @@ static int ptlrpcd(void *arg)
>                lu_context_enter(env.le_ses);
>                if (wait_event_idle_timeout(set->set_waitq,
>                                            ptlrpcd_check(&env, pc),
> -                                         (timeout ? timeout : 1) * HZ) == 0)
> +                                         timeout * HZ) == 0)
>                        ptlrpc_expired_set(set);

This is incorrect.
A timeout of zero means the timeout happens after zero jiffies
(immediately), it doesn't mean there is no timeout.
If we want a "timeout" of zero to mean "Wait forever", we need something
like:

  wait_event_idle_timeout(.....,
                          timeout ? (timeout * HZ) : MAX_SCHEDULE_TIMEOUT) == 0

I've updated the patch accordingly.

Thanks,
NeilBrown

>
>                lu_context_exit(&env.le_ctx);
> --
> 1.8.3.1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181029/516d370d/attachment.html>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 12/28] lustre: ptlrpc: do not wakeup every second
  2018-10-29  1:35     ` Patrick Farrell
@ 2018-10-29  2:41       ` NeilBrown
  2018-10-29  3:42         ` James Simmons
  2018-11-04 20:53         ` James Simmons
  0 siblings, 2 replies; 69+ messages in thread
From: NeilBrown @ 2018-10-29  2:41 UTC (permalink / raw)
  To: lustre-devel

On Mon, Oct 29 2018, Patrick Farrell wrote:

> Neil,
>
> Does your statement imply this would spin?  It definitely doesn?t just
> spin (that behavior in a main ?wait for work? spot of a (depending on
> settings) ~per-CPU daemon would render systems unusable and this patch
> has been in testing for a while).  So what is the detailed behavior of
> a ?timeout that expires immediately??

Hi Patrick,
 it definitely spins for me.

 I should have clarified that the SFS patch

   e81847bd0651 LU-9660 ptlrpc: do not wakeup every second

 is correct, as __l_wait_event() treats a timeout value of 0 as meaning an
 indefinite timeout.
 The error was in the conversion to wait_event_idle_timeout().  The
 various wait_event*timeout() functions treat 0 as 1 less than 1.
 If you want to not have a timeout, you need to not use the *_timeout()
 version.
 If a timeout is undesirable rather than fatal, then
 MAX_SCHEDULE_TIMEOUT can be used.  In this case, that seemed best.

Thanks,
NeilBrown


>
> - Patrick
>
>
> ________________________________
> From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of NeilBrown <neilb@suse.com>
> Sent: Sunday, October 28, 2018 7:03:02 PM
> To: James Simmons; Andreas Dilger; Oleg Drokin
> Cc: Lustre Development List
> Subject: Re: [lustre-devel] [PATCH 12/28] lustre: ptlrpc: do not wakeup every second
>
> On Sun, Oct 14 2018, James Simmons wrote:
>
>> From: Alex Zhuravlev <bzzz@whamcloud.com>
>>
>> Even if there are no RPC requests on the set, there is no need to
>> wake up every second. The thread is woken up when a request is added
>> to the set or when the STOP bit is set, so it is sufficient to only
>> wake up when there are requests on the set to worry about.
>>
>> Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
>> WC-bug-id: https://jira.whamcloud.com/browse/LU-9660
>> Reviewed-on: https://review.whamcloud.com/28776
>> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
>> Reviewed-by: Patrick Farrell <paf@cray.com>
>> Reviewed-by: Oleg Drokin <green@whamcloud.com>
>> Signed-off-by: James Simmons <jsimmons@infradead.org>
>> ---
>>  drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
>> index c201a88..5b4977b 100644
>> --- a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
>> +++ b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
>> @@ -371,7 +371,7 @@ static int ptlrpcd_check(struct lu_env *env, struct ptlrpcd_ctl *pc)
>>                }
>>        }
>>
>> -     return rc;
>> +     return rc || test_bit(LIOD_STOP, &pc->pc_flags);
>>  }
>>
>>  /**
>> @@ -441,7 +441,7 @@ static int ptlrpcd(void *arg)
>>                lu_context_enter(env.le_ses);
>>                if (wait_event_idle_timeout(set->set_waitq,
>>                                            ptlrpcd_check(&env, pc),
>> -                                         (timeout ? timeout : 1) * HZ) == 0)
>> +                                         timeout * HZ) == 0)
>>                        ptlrpc_expired_set(set);
>
> This is incorrect.
> A timeout of zero means the timeout happens after zero jiffies
> (immediately), it doesn't mean there is no timeout.
> If we want a "timeout" of zero to mean "Wait forever", we need something
> like:
>
>   wait_event_idle_timeout(.....,
>                           timeout ? (timeout * HZ) : MAX_SCHEDULE_TIMEOUT) == 0
>
> I've updated the patch accordingly.
>
> Thanks,
> NeilBrown
>
>>
>>                lu_context_exit(&env.le_ctx);
>> --
>> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181029/2b7d66ab/attachment.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH] lustre: lu_object: fix possible hang waiting for LCS_LEAVING
  2018-10-23 22:34   ` [lustre-devel] [PATCH] lustre: lu_object: fix possible hang waiting for LCS_LEAVING NeilBrown
@ 2018-10-29  3:31     ` James Simmons
  2018-10-29  4:31       ` NeilBrown
  0 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-10-29  3:31 UTC (permalink / raw)
  To: lustre-devel


> As lu_context_key_quiesce() spins waiting for LCS_LEAVING to
> change, it is important the we set and then clear in within a
> non-preemptible region.  If the thread that spins pre-empty the
> thread that sets-and-clears the state while the state is LCS_LEAVING,
> then it can spin indefinitely, particularly on a single-CPU machine.
> 
> Also update the comment to explain this dependency.
> 
> Fixes: ac3f8fd6e61b ("staging: lustre: remove locking from lu_context_exit()")
> ---
> 
> This is the cause of the "something" that went wrong in my recent
> testing that I mentioned.  I wonder if preempt_enable() has recently
> been enhanced to encourage a preempt, to make this sort of bug easier to
> see.
> 

Reduced my cpu load :-)

Reviewed-by: James Simmons <jsimmons@infradead.org>
 
>  drivers/staging/lustre/lustre/obdclass/lu_object.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/staging/lustre/lustre/obdclass/lu_object.c b/drivers/staging/lustre/lustre/obdclass/lu_object.c
> index cb57abf03644..51497c144dd6 100644
> --- a/drivers/staging/lustre/lustre/obdclass/lu_object.c
> +++ b/drivers/staging/lustre/lustre/obdclass/lu_object.c
> @@ -1654,17 +1654,20 @@ void lu_context_exit(struct lu_context *ctx)
>  	unsigned int i;
>  
>  	LINVRNT(ctx->lc_state == LCS_ENTERED);
> -	/*
> -	 * Ensure lu_context_key_quiesce() sees LCS_LEAVING
> -	 * or we see LCT_QUIESCENT
> -	 */
> -	smp_store_mb(ctx->lc_state, LCS_LEAVING);
>  	/*
>  	 * Disable preempt to ensure we get a warning if
>  	 * any lct_exit ever tries to sleep.  That would hurt
>  	 * lu_context_key_quiesce() which spins waiting for us.
> +	 * This also ensure we aren't preempted while the state
> +	 * is LCS_LEAVING, as that too would cause problems for
> +	 * lu_context_key_quiesce().
>  	 */
>  	preempt_disable();
> +	/*
> +	 * Ensure lu_context_key_quiesce() sees LCS_LEAVING
> +	 * or we see LCT_QUIESCENT
> +	 */
> +	smp_store_mb(ctx->lc_state, LCS_LEAVING);
>  	if (ctx->lc_tags & LCT_HAS_EXIT && ctx->lc_value) {
>  		for (i = 0; i < ARRAY_SIZE(lu_keys); ++i) {
>  			struct lu_context_key *key;
> @@ -1677,8 +1680,8 @@ void lu_context_exit(struct lu_context *ctx)
>  		}
>  	}
>  
> -	preempt_enable();
>  	smp_store_release(&ctx->lc_state, LCS_LEFT);
> +	preempt_enable();
>  }
>  EXPORT_SYMBOL(lu_context_exit);
>  
> -- 
> 2.14.0.rc0.dirty
> 
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 12/28] lustre: ptlrpc: do not wakeup every second
  2018-10-29  2:41       ` NeilBrown
@ 2018-10-29  3:42         ` James Simmons
  2018-10-29 14:17           ` Patrick Farrell
  2018-11-04 20:53         ` James Simmons
  1 sibling, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-10-29  3:42 UTC (permalink / raw)
  To: lustre-devel


> > Neil,
> >
> > Does your statement imply this would spin?  It definitely doesn?t just
> > spin (that behavior in a main ?wait for work? spot of a (depending on
> > settings) ~per-CPU daemon would render systems unusable and this patch
> > has been in testing for a while).  So what is the detailed behavior of
> > a ?timeout that expires immediately??
> 
> Hi Patrick,
>  it definitely spins for me.

Ah that is where the high cpu load is coming from.
 
>  I should have clarified that the SFS patch
> 
>    e81847bd0651 LU-9660 ptlrpc: do not wakeup every second
> 
>  is correct, as __l_wait_event() treats a timeout value of 0 as meaning an
>  indefinite timeout.
>  The error was in the conversion to wait_event_idle_timeout().  The
>  various wait_event*timeout() functions treat 0 as 1 less than 1.
>  If you want to not have a timeout, you need to not use the *_timeout()
>  version.
>  If a timeout is undesirable rather than fatal, then
>  MAX_SCHEDULE_TIMEOUT can be used.  In this case, that seemed best.

I missed that in the conversion. Now that I know that the mapping I will
do the future server code port correctly ;-)

> > - Patrick
> >
> >
> > ________________________________
> > From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of NeilBrown <neilb@suse.com>
> > Sent: Sunday, October 28, 2018 7:03:02 PM
> > To: James Simmons; Andreas Dilger; Oleg Drokin
> > Cc: Lustre Development List
> > Subject: Re: [lustre-devel] [PATCH 12/28] lustre: ptlrpc: do not wakeup every second
> >
> > On Sun, Oct 14 2018, James Simmons wrote:
> >
> >> From: Alex Zhuravlev <bzzz@whamcloud.com>
> >>
> >> Even if there are no RPC requests on the set, there is no need to
> >> wake up every second. The thread is woken up when a request is added
> >> to the set or when the STOP bit is set, so it is sufficient to only
> >> wake up when there are requests on the set to worry about.
> >>
> >> Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
> >> WC-bug-id: https://jira.whamcloud.com/browse/LU-9660
> >> Reviewed-on: https://review.whamcloud.com/28776
> >> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> >> Reviewed-by: Patrick Farrell <paf@cray.com>
> >> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> >> Signed-off-by: James Simmons <jsimmons@infradead.org>
> >> ---
> >>  drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c | 4 ++--
> >>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> >> index c201a88..5b4977b 100644
> >> --- a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> >> +++ b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> >> @@ -371,7 +371,7 @@ static int ptlrpcd_check(struct lu_env *env, struct ptlrpcd_ctl *pc)
> >>                }
> >>        }
> >>
> >> -     return rc;
> >> +     return rc || test_bit(LIOD_STOP, &pc->pc_flags);
> >>  }
> >>
> >>  /**
> >> @@ -441,7 +441,7 @@ static int ptlrpcd(void *arg)
> >>                lu_context_enter(env.le_ses);
> >>                if (wait_event_idle_timeout(set->set_waitq,
> >>                                            ptlrpcd_check(&env, pc),
> >> -                                         (timeout ? timeout : 1) * HZ) == 0)
> >> +                                         timeout * HZ) == 0)
> >>                        ptlrpc_expired_set(set);
> >
> > This is incorrect.
> > A timeout of zero means the timeout happens after zero jiffies
> > (immediately), it doesn't mean there is no timeout.
> > If we want a "timeout" of zero to mean "Wait forever", we need something
> > like:
> >
> >   wait_event_idle_timeout(.....,
> >                           timeout ? (timeout * HZ) : MAX_SCHEDULE_TIMEOUT) == 0
> >
> > I've updated the patch accordingly.
> >
> > Thanks,
> > NeilBrown
> >
> >>
> >>                lu_context_exit(&env.le_ctx);
> >> --
> >> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH] lustre: lu_object: fix possible hang waiting for LCS_LEAVING
  2018-10-29  3:31     ` James Simmons
@ 2018-10-29  4:31       ` NeilBrown
  0 siblings, 0 replies; 69+ messages in thread
From: NeilBrown @ 2018-10-29  4:31 UTC (permalink / raw)
  To: lustre-devel

On Mon, Oct 29 2018, James Simmons wrote:

>> As lu_context_key_quiesce() spins waiting for LCS_LEAVING to
>> change, it is important the we set and then clear in within a
>> non-preemptible region.  If the thread that spins pre-empty the
>> thread that sets-and-clears the state while the state is LCS_LEAVING,
>> then it can spin indefinitely, particularly on a single-CPU machine.
>> 
>> Also update the comment to explain this dependency.
>> 
>> Fixes: ac3f8fd6e61b ("staging: lustre: remove locking from lu_context_exit()")
>> ---
>> 
>> This is the cause of the "something" that went wrong in my recent
>> testing that I mentioned.  I wonder if preempt_enable() has recently
>> been enhanced to encourage a preempt, to make this sort of bug easier to
>> see.
>> 
>
> Reduced my cpu load :-)
>
> Reviewed-by: James Simmons <jsimmons@infradead.org>

Thanks,
NeilBrown


>  
>>  drivers/staging/lustre/lustre/obdclass/lu_object.c | 15 +++++++++------
>>  1 file changed, 9 insertions(+), 6 deletions(-)
>> 
>> diff --git a/drivers/staging/lustre/lustre/obdclass/lu_object.c b/drivers/staging/lustre/lustre/obdclass/lu_object.c
>> index cb57abf03644..51497c144dd6 100644
>> --- a/drivers/staging/lustre/lustre/obdclass/lu_object.c
>> +++ b/drivers/staging/lustre/lustre/obdclass/lu_object.c
>> @@ -1654,17 +1654,20 @@ void lu_context_exit(struct lu_context *ctx)
>>  	unsigned int i;
>>  
>>  	LINVRNT(ctx->lc_state == LCS_ENTERED);
>> -	/*
>> -	 * Ensure lu_context_key_quiesce() sees LCS_LEAVING
>> -	 * or we see LCT_QUIESCENT
>> -	 */
>> -	smp_store_mb(ctx->lc_state, LCS_LEAVING);
>>  	/*
>>  	 * Disable preempt to ensure we get a warning if
>>  	 * any lct_exit ever tries to sleep.  That would hurt
>>  	 * lu_context_key_quiesce() which spins waiting for us.
>> +	 * This also ensure we aren't preempted while the state
>> +	 * is LCS_LEAVING, as that too would cause problems for
>> +	 * lu_context_key_quiesce().
>>  	 */
>>  	preempt_disable();
>> +	/*
>> +	 * Ensure lu_context_key_quiesce() sees LCS_LEAVING
>> +	 * or we see LCT_QUIESCENT
>> +	 */
>> +	smp_store_mb(ctx->lc_state, LCS_LEAVING);
>>  	if (ctx->lc_tags & LCT_HAS_EXIT && ctx->lc_value) {
>>  		for (i = 0; i < ARRAY_SIZE(lu_keys); ++i) {
>>  			struct lu_context_key *key;
>> @@ -1677,8 +1680,8 @@ void lu_context_exit(struct lu_context *ctx)
>>  		}
>>  	}
>>  
>> -	preempt_enable();
>>  	smp_store_release(&ctx->lc_state, LCS_LEFT);
>> +	preempt_enable();
>>  }
>>  EXPORT_SYMBOL(lu_context_exit);
>>  
>> -- 
>> 2.14.0.rc0.dirty
>> 
>> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181029/530ea6cb/attachment.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 12/28] lustre: ptlrpc: do not wakeup every second
  2018-10-29  3:42         ` James Simmons
@ 2018-10-29 14:17           ` Patrick Farrell
  0 siblings, 0 replies; 69+ messages in thread
From: Patrick Farrell @ 2018-10-29 14:17 UTC (permalink / raw)
  To: lustre-devel

Ah, OK!

Thanks, Neil.  That makes sense.  I just knew there was no way that had been missed in the WC branch - Just as I'm sure you two noticed it immediately in your testing.


Also useful to know about the timeout functions and 0.


Regards,

Patrick

________________________________
From: James Simmons <jsimmons@infradead.org>
Sent: Sunday, October 28, 2018 10:42:10 PM
To: NeilBrown
Cc: Patrick Farrell; Andreas Dilger; Oleg Drokin; Lustre Development List
Subject: Re: [lustre-devel] [PATCH 12/28] lustre: ptlrpc: do not wakeup every second


> > Neil,
> >
> > Does your statement imply this would spin?  It definitely doesn?t just
> > spin (that behavior in a main ?wait for work? spot of a (depending on
> > settings) ~per-CPU daemon would render systems unusable and this patch
> > has been in testing for a while).  So what is the detailed behavior of
> > a ?timeout that expires immediately??
>
> Hi Patrick,
>  it definitely spins for me.

Ah that is where the high cpu load is coming from.

>  I should have clarified that the SFS patch
>
>    e81847bd0651 LU-9660 ptlrpc: do not wakeup every second
>
>  is correct, as __l_wait_event() treats a timeout value of 0 as meaning an
>  indefinite timeout.
>  The error was in the conversion to wait_event_idle_timeout().  The
>  various wait_event*timeout() functions treat 0 as 1 less than 1.
>  If you want to not have a timeout, you need to not use the *_timeout()
>  version.
>  If a timeout is undesirable rather than fatal, then
>  MAX_SCHEDULE_TIMEOUT can be used.  In this case, that seemed best.

I missed that in the conversion. Now that I know that the mapping I will
do the future server code port correctly ;-)

> > - Patrick
> >
> >
> > ________________________________
> > From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of NeilBrown <neilb@suse.com>
> > Sent: Sunday, October 28, 2018 7:03:02 PM
> > To: James Simmons; Andreas Dilger; Oleg Drokin
> > Cc: Lustre Development List
> > Subject: Re: [lustre-devel] [PATCH 12/28] lustre: ptlrpc: do not wakeup every second
> >
> > On Sun, Oct 14 2018, James Simmons wrote:
> >
> >> From: Alex Zhuravlev <bzzz@whamcloud.com>
> >>
> >> Even if there are no RPC requests on the set, there is no need to
> >> wake up every second. The thread is woken up when a request is added
> >> to the set or when the STOP bit is set, so it is sufficient to only
> >> wake up when there are requests on the set to worry about.
> >>
> >> Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
> >> WC-bug-id: https://jira.whamcloud.com/browse/LU-9660
> >> Reviewed-on: https://review.whamcloud.com/28776
> >> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> >> Reviewed-by: Patrick Farrell <paf@cray.com>
> >> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> >> Signed-off-by: James Simmons <jsimmons@infradead.org>
> >> ---
> >>  drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c | 4 ++--
> >>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> >> index c201a88..5b4977b 100644
> >> --- a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> >> +++ b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> >> @@ -371,7 +371,7 @@ static int ptlrpcd_check(struct lu_env *env, struct ptlrpcd_ctl *pc)
> >>                }
> >>        }
> >>
> >> -     return rc;
> >> +     return rc || test_bit(LIOD_STOP, &pc->pc_flags);
> >>  }
> >>
> >>  /**
> >> @@ -441,7 +441,7 @@ static int ptlrpcd(void *arg)
> >>                lu_context_enter(env.le_ses);
> >>                if (wait_event_idle_timeout(set->set_waitq,
> >>                                            ptlrpcd_check(&env, pc),
> >> -                                         (timeout ? timeout : 1) * HZ) == 0)
> >> +                                         timeout * HZ) == 0)
> >>                        ptlrpc_expired_set(set);
> >
> > This is incorrect.
> > A timeout of zero means the timeout happens after zero jiffies
> > (immediately), it doesn't mean there is no timeout.
> > If we want a "timeout" of zero to mean "Wait forever", we need something
> > like:
> >
> >   wait_event_idle_timeout(.....,
> >                           timeout ? (timeout * HZ) : MAX_SCHEDULE_TIMEOUT) == 0
> >
> > I've updated the patch accordingly.
> >
> > Thanks,
> > NeilBrown
> >
> >>
> >>                lu_context_exit(&env.le_ctx);
> >> --
> >> 1.8.3.1
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181029/bebde1a9/attachment.html>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 18/28] lustre: mdc: expose changelog through char devices
  2018-10-14 18:58 ` [lustre-devel] [PATCH 18/28] lustre: mdc: expose changelog through char devices James Simmons
@ 2018-10-30  6:41   ` NeilBrown
  2018-11-04 21:31     ` James Simmons
  0 siblings, 1 reply; 69+ messages in thread
From: NeilBrown @ 2018-10-30  6:41 UTC (permalink / raw)
  To: lustre-devel

On Sun, Oct 14 2018, James Simmons wrote:

> From: Henri Doreau <henri.doreau@cea.fr>
>
> Register one character device per MDT in order to allow non-llapi to
> read them and to make delivery more efficient.
>
> - open() spawns a thread to prefetch records and enqueue them into a
>   local buffer (unless the device is open in write-only mode).
> - lseek() can be used to jump to a specific record, in which case the
>   offset is a record number (with SEEK_SET) or a number of records to
>   skip (SEEK_CUR). Movement can only be done forward.
> - read() copies records to userland. No truncation happens, so short
>   reads are likely.
> - write() is used to transmit control commands to the device.
>   The only available one is changelog_clear, which is done by writing
>   "clear:cl<user>:<recno>" into the device.
> - close() terminates the prefetch thread if any, and releases resources.
>
> It is possible to poll() on the device to get notified when new records
> are available for read.
>
> Signed-off-by: Henri Doreau <henri.doreau@cea.fr>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-7659
> Reviewed-on: https://review.whamcloud.com/18900
> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>

This patches causes problems around sanity test 161a.
If you run -only '160h 160i 161a' it hangs.

Adding
Commit: 89e52326b5bd ("LU-10166 mdc: invalid free in changelog reader")
seems to fix the problem, so I'll port that into the series immediately
after this patch.

NeilBrown


> ---
>  .../include/uapi/linux/lustre/lustre_ioctl.h       |   2 +-
>  .../include/uapi/linux/lustre/lustre_kernelcomm.h  |   3 -
>  .../lustre/include/uapi/linux/lustre/lustre_user.h |   7 -
>  drivers/staging/lustre/lustre/include/obd.h        |   2 +
>  drivers/staging/lustre/lustre/ldlm/ldlm_lib.c      |   2 +
>  drivers/staging/lustre/lustre/llite/dir.c          |   8 -
>  drivers/staging/lustre/lustre/lmv/lmv_obd.c        |  13 -
>  drivers/staging/lustre/lustre/mdc/Makefile         |   2 +-
>  drivers/staging/lustre/lustre/mdc/mdc_changelog.c  | 722 +++++++++++++++++++++
>  drivers/staging/lustre/lustre/mdc/mdc_internal.h   |   4 +
>  drivers/staging/lustre/lustre/mdc/mdc_request.c    | 198 +-----
>  11 files changed, 745 insertions(+), 218 deletions(-)
>  create mode 100644 drivers/staging/lustre/lustre/mdc/mdc_changelog.c
>
> diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
> index 6e4e109..098b6451 100644
> --- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
> +++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
> @@ -172,7 +172,7 @@ static inline __u32 obd_ioctl_packlen(struct obd_ioctl_data *data)
>  #define OBD_GET_VERSION		_IOWR('f', 144, OBD_IOC_DATA_TYPE)
>  /*	OBD_IOC_GSS_SUPPORT	_IOWR('f', 145, OBD_IOC_DATA_TYPE) */
>  /*	OBD_IOC_CLOSE_UUID	_IOWR('f', 147, OBD_IOC_DATA_TYPE) */
> -#define OBD_IOC_CHANGELOG_SEND	_IOW('f', 148, OBD_IOC_DATA_TYPE)
> +/*	OBD_IOC_CHANGELOG_SEND	_IOW('f', 148, OBD_IOC_DATA_TYPE) */
>  #define OBD_IOC_GETDEVICE	_IOWR('f', 149, OBD_IOC_DATA_TYPE)
>  #define OBD_IOC_FID2PATH	_IOWR('f', 150, OBD_IOC_DATA_TYPE)
>  /*	lustre/lustre_user.h	151-153 */
> diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_kernelcomm.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_kernelcomm.h
> index 94dadbe..d84a8fc 100644
> --- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_kernelcomm.h
> +++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_kernelcomm.h
> @@ -54,15 +54,12 @@ struct kuc_hdr {
>  	__u16 kuc_msglen;
>  } __aligned(sizeof(__u64));
>  
> -#define KUC_CHANGELOG_MSG_MAXSIZE (sizeof(struct kuc_hdr) + CR_MAXSIZE)
> -
>  #define KUC_MAGIC		0x191C /*Lustre9etLinC */
>  
>  /* kuc_msgtype values are defined in each transport */
>  enum kuc_transport_type {
>  	KUC_TRANSPORT_GENERIC	= 1,
>  	KUC_TRANSPORT_HSM	= 2,
> -	KUC_TRANSPORT_CHANGELOG	= 3,
>  };
>  
>  enum kuc_generic_message_type {
> diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h
> index b8525e5..715f1c5 100644
> --- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h
> +++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h
> @@ -967,13 +967,6 @@ static inline void changelog_remap_rec(struct changelog_rec *rec,
>  	rec->cr_flags = (rec->cr_flags & CLF_FLAGMASK) | crf_wanted;
>  }
>  
> -struct ioc_changelog {
> -	__u64 icc_recno;
> -	__u32 icc_mdtindex;
> -	__u32 icc_id;
> -	__u32 icc_flags;
> -};
> -
>  enum changelog_message_type {
>  	CL_RECORD = 10, /* message is a changelog_rec */
>  	CL_EOF    = 11, /* at end of current changelog */
> diff --git a/drivers/staging/lustre/lustre/include/obd.h b/drivers/staging/lustre/lustre/include/obd.h
> index 11e7ae8..76ae0b3 100644
> --- a/drivers/staging/lustre/lustre/include/obd.h
> +++ b/drivers/staging/lustre/lustre/include/obd.h
> @@ -345,6 +345,8 @@ struct client_obd {
>  	void			*cl_lru_work;
>  	/* hash tables for osc_quota_info */
>  	struct rhashtable	cl_quota_hash[MAXQUOTAS];
> +	/* Links to the global list of registered changelog devices */
> +	struct list_head	cl_chg_dev_linkage;
>  };
>  
>  #define obd2cli_tgt(obd) ((char *)(obd)->u.cli.cl_target_uuid.uuid)
> diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_lib.c b/drivers/staging/lustre/lustre/ldlm/ldlm_lib.c
> index 32eda4f..732ef3a 100644
> --- a/drivers/staging/lustre/lustre/ldlm/ldlm_lib.c
> +++ b/drivers/staging/lustre/lustre/ldlm/ldlm_lib.c
> @@ -395,6 +395,8 @@ int client_obd_setup(struct obd_device *obddev, struct lustre_cfg *lcfg)
>  	init_waitqueue_head(&cli->cl_mod_rpcs_waitq);
>  	cli->cl_mod_tag_bitmap = NULL;
>  
> +	INIT_LIST_HEAD(&cli->cl_chg_dev_linkage);
> +
>  	if (connect_op == MDS_CONNECT) {
>  		cli->cl_max_mod_rpcs_in_flight = cli->cl_max_rpcs_in_flight - 1;
>  		cli->cl_mod_tag_bitmap = kcalloc(BITS_TO_LONGS(OBD_MAX_RIF_MAX),
> diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
> index 19c5e9c..36cea8d 100644
> --- a/drivers/staging/lustre/lustre/llite/dir.c
> +++ b/drivers/staging/lustre/lustre/llite/dir.c
> @@ -1481,14 +1481,6 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>  		return obd_iocontrol(cmd, sbi->ll_md_exp, 0, NULL,
>  				     (void __user *)arg);
>  	}
> -	case OBD_IOC_CHANGELOG_SEND:
> -	case OBD_IOC_CHANGELOG_CLEAR:
> -		if (!capable(CAP_SYS_ADMIN))
> -			return -EPERM;
> -
> -		rc = copy_and_ioctl(cmd, sbi->ll_md_exp, (void __user *)arg,
> -				    sizeof(struct ioc_changelog));
> -		return rc;
>  	case OBD_IOC_FID2PATH:
>  		return ll_fid2path(inode, (void __user *)arg);
>  	case LL_IOC_GETPARENT:
> diff --git a/drivers/staging/lustre/lustre/lmv/lmv_obd.c b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
> index 952c68e..32bb9fc 100644
> --- a/drivers/staging/lustre/lustre/lmv/lmv_obd.c
> +++ b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
> @@ -951,19 +951,6 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
>  		kfree(oqctl);
>  		break;
>  	}
> -	case OBD_IOC_CHANGELOG_SEND:
> -	case OBD_IOC_CHANGELOG_CLEAR: {
> -		struct ioc_changelog *icc = karg;
> -
> -		if (icc->icc_mdtindex >= count)
> -			return -ENODEV;
> -
> -		tgt = lmv->tgts[icc->icc_mdtindex];
> -		if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
> -			return -ENODEV;
> -		rc = obd_iocontrol(cmd, tgt->ltd_exp, sizeof(*icc), icc, NULL);
> -		break;
> -	}
>  	case LL_IOC_GET_CONNECT_FLAGS: {
>  		tgt = lmv->tgts[0];
>  
> diff --git a/drivers/staging/lustre/lustre/mdc/Makefile b/drivers/staging/lustre/lustre/mdc/Makefile
> index 64cf49e..5f48e91 100644
> --- a/drivers/staging/lustre/lustre/mdc/Makefile
> +++ b/drivers/staging/lustre/lustre/mdc/Makefile
> @@ -2,4 +2,4 @@ ccflags-y += -I$(srctree)/drivers/staging/lustre/include
>  ccflags-y += -I$(srctree)/drivers/staging/lustre/lustre/include
>  
>  obj-$(CONFIG_LUSTRE_FS) += mdc.o
> -mdc-y := mdc_request.o mdc_reint.o mdc_lib.o mdc_locks.o lproc_mdc.o
> +mdc-y := mdc_changelog.o mdc_request.o mdc_reint.o mdc_lib.o mdc_locks.o lproc_mdc.o
> diff --git a/drivers/staging/lustre/lustre/mdc/mdc_changelog.c b/drivers/staging/lustre/lustre/mdc/mdc_changelog.c
> new file mode 100644
> index 0000000..a5f3c64
> --- /dev/null
> +++ b/drivers/staging/lustre/lustre/mdc/mdc_changelog.c
> @@ -0,0 +1,722 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * GPL HEADER START
> + *
> + * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 only,
> + * as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License version 2 for more details (a copy is included
> + * in the LICENSE file that accompanied this code).
> + *
> + * You should have received a copy of the GNU General Public License
> + * version 2 along with this program; If not, see
> + * http://www.gnu.org/licenses/gpl-2.0.html
> + *
> + * GPL HEADER END
> + */
> +/*
> + * Copyright (c) 2017, Commissariat a l'Energie Atomique et aux Energies
> + *                     Alternatives.
> + *
> + * Author: Henri Doreau <henri.doreau@cea.fr>
> + */
> +
> +#define DEBUG_SUBSYSTEM S_MDC
> +
> +#include <linux/init.h>
> +#include <linux/kthread.h>
> +#include <linux/poll.h>
> +#include <linux/miscdevice.h>
> +
> +#include <lustre_log.h>
> +
> +#include "mdc_internal.h"
> +
> +/*
> + * -- Changelog delivery through character device --
> + */
> +
> +/**
> + * Mutex to protect chlg_registered_devices below
> + */
> +static DEFINE_MUTEX(chlg_registered_dev_lock);
> +
> +/**
> + * Global linked list of all registered devices (one per MDT).
> + */
> +static LIST_HEAD(chlg_registered_devices);
> +
> +struct chlg_registered_dev {
> +	/* Device name of the form "changelog-{MDTNAME}" */
> +	char			ced_name[32];
> +	/* Misc device descriptor */
> +	struct miscdevice	ced_misc;
> +	/* OBDs referencing this device (multiple mount point) */
> +	struct list_head	ced_obds;
> +	/* Reference counter for proper deregistration */
> +	struct kref		ced_refs;
> +	/* Link within the global chlg_registered_devices */
> +	struct list_head	ced_link;
> +};
> +
> +struct chlg_reader_state {
> +	/* Shortcut to the corresponding OBD device */
> +	struct obd_device	*crs_obd;
> +	/* An error occurred that prevents from reading further */
> +	bool			 crs_err;
> +	/* EOF, no more records available */
> +	bool			 crs_eof;
> +	/* Userland reader closed connection */
> +	bool			 crs_closed;
> +	/* Desired start position */
> +	u64			 crs_start_offset;
> +	/* Wait queue for the catalog processing thread */
> +	wait_queue_head_t	 crs_waitq_prod;
> +	/* Wait queue for the record copy threads */
> +	wait_queue_head_t	 crs_waitq_cons;
> +	/* Mutex protecting crs_rec_count and crs_rec_queue */
> +	struct mutex		 crs_lock;
> +	/* Number of item in the list */
> +	u64			 crs_rec_count;
> +	/* List of prefetched enqueued_record::enq_linkage_items */
> +	struct list_head	 crs_rec_queue;
> +};
> +
> +struct chlg_rec_entry {
> +	/* Link within the chlg_reader_state::crs_rec_queue list */
> +	struct list_head	enq_linkage;
> +	/* Data (enq_record) field length */
> +	u64			enq_length;
> +	/* Copy of a changelog record (see struct llog_changelog_rec) */
> +	struct changelog_rec	enq_record[];
> +};
> +
> +enum {
> +	/* Number of records to prefetch locally. */
> +	CDEV_CHLG_MAX_PREFETCH = 1024,
> +};
> +
> +/**
> + * ChangeLog catalog processing callback invoked on each record.
> + * If the current record is eligible to userland delivery, push
> + * it into the crs_rec_queue where the consumer code will fetch it.
> + *
> + * @param[in]     env  (unused)
> + * @param[in]     llh  Client-side handle used to identify the llog
> + * @param[in]     hdr  Header of the current llog record
> + * @param[in,out] data chlg_reader_state passed from caller
> + *
> + * @return 0 or LLOG_PROC_* control code on success, negated error on failure.
> + */
> +static int chlg_read_cat_process_cb(const struct lu_env *env,
> +				    struct llog_handle *llh,
> +				    struct llog_rec_hdr *hdr, void *data)
> +{
> +	struct llog_changelog_rec *rec;
> +	struct chlg_reader_state *crs = data;
> +	struct chlg_rec_entry *enq;
> +	size_t len;
> +	int rc;
> +
> +	LASSERT(crs);
> +	LASSERT(hdr);
> +
> +	rec = container_of(hdr, struct llog_changelog_rec, cr_hdr);
> +
> +	if (rec->cr_hdr.lrh_type != CHANGELOG_REC) {
> +		rc = -EINVAL;
> +		CERROR("%s: not a changelog rec %x/%d in llog : rc = %d\n",
> +		       crs->crs_obd->obd_name, rec->cr_hdr.lrh_type,
> +		       rec->cr.cr_type, rc);
> +		return rc;
> +	}
> +
> +	/* Skip undesired records */
> +	if (rec->cr.cr_index < crs->crs_start_offset)
> +		return 0;
> +
> +	CDEBUG(D_HSM, "%llu %02d%-5s %llu 0x%x t=" DFID " p=" DFID " %.*s\n",
> +	       rec->cr.cr_index, rec->cr.cr_type,
> +	       changelog_type2str(rec->cr.cr_type), rec->cr.cr_time,
> +	       rec->cr.cr_flags & CLF_FLAGMASK,
> +	       PFID(&rec->cr.cr_tfid), PFID(&rec->cr.cr_pfid),
> +	       rec->cr.cr_namelen, changelog_rec_name(&rec->cr));
> +
> +	wait_event_idle(crs->crs_waitq_prod,
> +			(crs->crs_rec_count < CDEV_CHLG_MAX_PREFETCH ||
> +			 crs->crs_closed));
> +
> +	if (crs->crs_closed)
> +		return LLOG_PROC_BREAK;
> +
> +	len = changelog_rec_size(&rec->cr) + rec->cr.cr_namelen;
> +	enq = kzalloc(sizeof(*enq) + len, GFP_KERNEL);
> +	if (!enq)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&enq->enq_linkage);
> +	enq->enq_length = len;
> +	memcpy(enq->enq_record, &rec->cr, len);
> +
> +	mutex_lock(&crs->crs_lock);
> +	list_add_tail(&enq->enq_linkage, &crs->crs_rec_queue);
> +	crs->crs_rec_count++;
> +	mutex_unlock(&crs->crs_lock);
> +
> +	wake_up_all(&crs->crs_waitq_cons);
> +
> +	return 0;
> +}
> +
> +/**
> + * Remove record from the list it is attached to and free it.
> + */
> +static void enq_record_delete(struct chlg_rec_entry *rec)
> +{
> +	list_del(&rec->enq_linkage);
> +	kfree(rec);
> +}
> +
> +/**
> + * Release resources associated to a changelog_reader_state instance.
> + *
> + * @param  crs  CRS instance to release.
> + */
> +static void crs_free(struct chlg_reader_state *crs)
> +{
> +	struct chlg_rec_entry *rec;
> +	struct chlg_rec_entry *tmp;
> +
> +	list_for_each_entry_safe(rec, tmp, &crs->crs_rec_queue, enq_linkage)
> +		enq_record_delete(rec);
> +
> +	kfree(crs);
> +}
> +
> +/**
> + * Record prefetch thread entry point. Opens the changelog catalog and starts
> + * reading records.
> + *
> + * @param[in,out]  args  chlg_reader_state passed from caller.
> + * @return 0 on success, negated error code on failure.
> + */
> +static int chlg_load(void *args)
> +{
> +	struct chlg_reader_state *crs = args;
> +	struct obd_device *obd = crs->crs_obd;
> +	struct llog_ctxt *ctx = NULL;
> +	struct llog_handle *llh = NULL;
> +	int rc;
> +
> +	ctx = llog_get_context(obd, LLOG_CHANGELOG_REPL_CTXT);
> +	if (!ctx) {
> +		rc = -ENOENT;
> +		goto err_out;
> +	}
> +
> +	rc = llog_open(NULL, ctx, &llh, NULL, CHANGELOG_CATALOG,
> +		       LLOG_OPEN_EXISTS);
> +	if (rc) {
> +		CERROR("%s: fail to open changelog catalog: rc = %d\n",
> +		       obd->obd_name, rc);
> +		goto err_out;
> +	}
> +
> +	rc = llog_init_handle(NULL, llh, LLOG_F_IS_CAT | LLOG_F_EXT_JOBID,
> +			      NULL);
> +	if (rc) {
> +		CERROR("%s: fail to init llog handle: rc = %d\n",
> +		       obd->obd_name, rc);
> +		goto err_out;
> +	}
> +
> +	rc = llog_cat_process(NULL, llh, chlg_read_cat_process_cb, crs, 0, 0);
> +	if (rc < 0) {
> +		CERROR("%s: fail to process llog: rc = %d\n",
> +		       obd->obd_name, rc);
> +		goto err_out;
> +	}
> +
> +err_out:
> +	crs->crs_err = true;
> +	wake_up_all(&crs->crs_waitq_cons);
> +
> +	if (llh)
> +		llog_cat_close(NULL, llh);
> +
> +	if (ctx)
> +		llog_ctxt_put(ctx);
> +
> +	wait_event_idle(crs->crs_waitq_prod, crs->crs_closed);
> +	crs_free(crs);
> +	return rc;
> +}
> +
> +/**
> + * Read handler, dequeues records from the chlg_reader_state if any.
> + * No partial records are copied to userland so this function can return less
> + * data than required (short read).
> + *
> + * @param[in]   file   File pointer to the character device.
> + * @param[out]  buff   Userland buffer where to copy the records.
> + * @param[in]   count  Userland buffer size.
> + * @param[out]  ppos   File position, updated with the index number of the next
> + *		       record to read.
> + * @return number of copied bytes on success, negated error code on failure.
> + */
> +static ssize_t chlg_read(struct file *file, char __user *buff, size_t count,
> +			 loff_t *ppos)
> +{
> +	struct chlg_reader_state *crs = file->private_data;
> +	struct chlg_rec_entry *rec;
> +	struct chlg_rec_entry *tmp;
> +	ssize_t  written_total = 0;
> +	LIST_HEAD(consumed);
> +
> +	if (file->f_flags & O_NONBLOCK && crs->crs_rec_count == 0)
> +		return -EAGAIN;
> +
> +	wait_event_idle(crs->crs_waitq_cons,
> +			crs->crs_rec_count > 0 || crs->crs_eof || crs->crs_err);
> +
> +	mutex_lock(&crs->crs_lock);
> +	list_for_each_entry_safe(rec, tmp, &crs->crs_rec_queue, enq_linkage) {
> +		if (written_total + rec->enq_length > count)
> +			break;
> +
> +		if (copy_to_user(buff, rec->enq_record, rec->enq_length)) {
> +			if (written_total == 0)
> +				written_total = -EFAULT;
> +			break;
> +		}
> +
> +		buff += rec->enq_length;
> +		written_total += rec->enq_length;
> +
> +		crs->crs_rec_count--;
> +		list_move_tail(&rec->enq_linkage, &consumed);
> +
> +		crs->crs_start_offset = rec->enq_record->cr_index + 1;
> +	}
> +	mutex_unlock(&crs->crs_lock);
> +
> +	if (written_total > 0)
> +		wake_up_all(&crs->crs_waitq_prod);
> +
> +	list_for_each_entry_safe(rec, tmp, &consumed, enq_linkage)
> +		enq_record_delete(rec);
> +
> +	*ppos = crs->crs_start_offset;
> +
> +	return written_total;
> +}
> +
> +/**
> + * Jump to a given record index. Helper for chlg_llseek().
> + *
> + * @param[in,out]  crs     Internal reader state.
> + * @param[in]      offset  Desired offset (index record).
> + * @return 0 on success, negated error code on failure.
> + */
> +static int chlg_set_start_offset(struct chlg_reader_state *crs, u64 offset)
> +{
> +	struct chlg_rec_entry *rec;
> +	struct chlg_rec_entry *tmp;
> +
> +	mutex_lock(&crs->crs_lock);
> +	if (offset < crs->crs_start_offset) {
> +		mutex_unlock(&crs->crs_lock);
> +		return -ERANGE;
> +	}
> +
> +	crs->crs_start_offset = offset;
> +	list_for_each_entry_safe(rec, tmp, &crs->crs_rec_queue, enq_linkage) {
> +		struct changelog_rec *cr = rec->enq_record;
> +
> +		if (cr->cr_index >= crs->crs_start_offset)
> +			break;
> +
> +		crs->crs_rec_count--;
> +		enq_record_delete(rec);
> +	}
> +
> +	mutex_unlock(&crs->crs_lock);
> +	wake_up_all(&crs->crs_waitq_prod);
> +	return 0;
> +}
> +
> +/**
> + * Move read pointer to a certain record index, encoded as an offset.
> + *
> + * @param[in,out] file   File pointer to the changelog character device
> + * @param[in]	  off    Offset to skip, actually a record index, not byte count
> + * @param[in]	  whence Relative/Absolute interpretation of the offset
> + * @return the resulting position on success or negated error code on failure.
> + */
> +static loff_t chlg_llseek(struct file *file, loff_t off, int whence)
> +{
> +	struct chlg_reader_state *crs = file->private_data;
> +	loff_t pos;
> +	int rc;
> +
> +	switch (whence) {
> +	case SEEK_SET:
> +		pos = off;
> +		break;
> +	case SEEK_CUR:
> +		pos = file->f_pos + off;
> +		break;
> +	case SEEK_END:
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	/* We cannot go backward */
> +	if (pos < file->f_pos)
> +		return -EINVAL;
> +
> +	rc = chlg_set_start_offset(crs, pos);
> +	if (rc != 0)
> +		return rc;
> +
> +	file->f_pos = pos;
> +	return pos;
> +}
> +
> +/**
> + * Clear record range for a given changelog reader.
> + *
> + * @param[in]  crs     Current internal state.
> + * @param[in]  reader  Changelog reader ID (cl1, cl2...)
> + * @param[in]  record  Record index up which to clear
> + * @return 0 on success, negated error code on failure.
> + */
> +static int chlg_clear(struct chlg_reader_state *crs, u32 reader, u64 record)
> +{
> +	struct obd_device *obd = crs->crs_obd;
> +	struct changelog_setinfo cs  = {
> +		.cs_recno = record,
> +		.cs_id    = reader
> +	};
> +
> +	return obd_set_info_async(NULL, obd->obd_self_export,
> +				  strlen(KEY_CHANGELOG_CLEAR),
> +				  KEY_CHANGELOG_CLEAR, sizeof(cs), &cs, NULL);
> +}
> +
> +/** Maximum changelog control command size */
> +#define CHLG_CONTROL_CMD_MAX	64
> +
> +/**
> + * Handle writes() into the changelog character device. Write() can be used
> + * to request special control operations.
> + *
> + * @param[in]  file  File pointer to the changelog character device
> + * @param[in]  buff  User supplied data (written data)
> + * @param[in]  count Number of written bytes
> + * @param[in]  off   (unused)
> + * @return number of written bytes on success, negated error code on failure.
> + */
> +static ssize_t chlg_write(struct file *file, const char __user *buff,
> +			  size_t count, loff_t *off)
> +{
> +	struct chlg_reader_state *crs = file->private_data;
> +	char *kbuf;
> +	u64 record;
> +	u32 reader;
> +	int rc = 0;
> +
> +	if (count > CHLG_CONTROL_CMD_MAX)
> +		return -EINVAL;
> +
> +	kbuf = kzalloc(CHLG_CONTROL_CMD_MAX, GFP_KERNEL);
> +	if (!kbuf)
> +		return -ENOMEM;
> +
> +	if (copy_from_user(kbuf, buff, count)) {
> +		rc = -EFAULT;
> +		goto out_kbuf;
> +	}
> +
> +	kbuf[CHLG_CONTROL_CMD_MAX - 1] = '\0';
> +
> +	if (sscanf(kbuf, "clear:cl%u:%llu", &reader, &record) == 2)
> +		rc = chlg_clear(crs, reader, record);
> +	else
> +		rc = -EINVAL;
> +
> +out_kbuf:
> +	kfree(kbuf);
> +	return rc < 0 ? rc : count;
> +}
> +
> +/**
> + * Find the OBD device associated to a changelog character device.
> + * @param[in]  cdev  character device instance descriptor
> + * @return corresponding OBD device or NULL if none was found.
> + */
> +static struct obd_device *chlg_obd_get(dev_t cdev)
> +{
> +	int minor = MINOR(cdev);
> +	struct obd_device *obd = NULL;
> +	struct chlg_registered_dev *curr;
> +
> +	mutex_lock(&chlg_registered_dev_lock);
> +	list_for_each_entry(curr, &chlg_registered_devices, ced_link) {
> +		if (curr->ced_misc.minor == minor) {
> +			/* take the first available OBD device attached */
> +			obd = list_first_entry(&curr->ced_obds,
> +					       struct obd_device,
> +					       u.cli.cl_chg_dev_linkage);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&chlg_registered_dev_lock);
> +	return obd;
> +}
> +
> +/**
> + * Open handler, initialize internal CRS state and spawn prefetch thread if
> + * needed.
> + * @param[in]  inode  Inode struct for the open character device.
> + * @param[in]  file   Corresponding file pointer.
> + * @return 0 on success, negated error code on failure.
> + */
> +static int chlg_open(struct inode *inode, struct file *file)
> +{
> +	struct chlg_reader_state *crs;
> +	struct obd_device *obd = chlg_obd_get(inode->i_rdev);
> +	struct task_struct *task;
> +	int rc;
> +
> +	if (!obd)
> +		return -ENODEV;
> +
> +	crs = kzalloc(sizeof(*crs), GFP_KERNEL);
> +	if (!crs)
> +		return -ENOMEM;
> +
> +	crs->crs_obd = obd;
> +	crs->crs_err = false;
> +	crs->crs_eof = false;
> +	crs->crs_closed = false;
> +
> +	mutex_init(&crs->crs_lock);
> +	INIT_LIST_HEAD(&crs->crs_rec_queue);
> +	init_waitqueue_head(&crs->crs_waitq_prod);
> +	init_waitqueue_head(&crs->crs_waitq_cons);
> +
> +	if (file->f_mode & FMODE_READ) {
> +		task = kthread_run(chlg_load, crs, "chlg_load_thread");
> +		if (IS_ERR(task)) {
> +			rc = PTR_ERR(task);
> +			CERROR("%s: cannot start changelog thread: rc = %d\n",
> +			       obd->obd_name, rc);
> +			goto err_crs;
> +		}
> +	}
> +
> +	file->private_data = crs;
> +	return 0;
> +
> +err_crs:
> +	kfree(crs);
> +	return rc;
> +}
> +
> +/**
> + * Close handler, release resources.
> + *
> + * @param[in]  inode  Inode struct for the open character device.
> + * @param[in]  file   Corresponding file pointer.
> + * @return 0 on success, negated error code on failure.
> + */
> +static int chlg_release(struct inode *inode, struct file *file)
> +{
> +	struct chlg_reader_state *crs = file->private_data;
> +
> +	if (file->f_mode & FMODE_READ) {
> +		crs->crs_closed = true;
> +		wake_up_all(&crs->crs_waitq_prod);
> +	} else {
> +		/* No producer thread, release resource ourselves */
> +		crs_free(crs);
> +	}
> +	return 0;
> +}
> +
> +/**
> + * Poll handler, indicates whether the device is readable (new records) and
> + * writable (always).
> + *
> + * @param[in]  file   Device file pointer.
> + * @param[in]  wait   (opaque)
> + * @return combination of the poll status flags.
> + */
> +static unsigned int chlg_poll(struct file *file, poll_table *wait)
> +{
> +	struct chlg_reader_state *crs  = file->private_data;
> +	unsigned int mask = 0;
> +
> +	mutex_lock(&crs->crs_lock);
> +	poll_wait(file, &crs->crs_waitq_cons, wait);
> +	if (crs->crs_rec_count > 0)
> +		mask |= POLLIN | POLLRDNORM;
> +	if (crs->crs_err)
> +		mask |= POLLERR;
> +	if (crs->crs_eof)
> +		mask |= POLLHUP;
> +	mutex_unlock(&crs->crs_lock);
> +	return mask;
> +}
> +
> +static const struct file_operations chlg_fops = {
> +	.owner		= THIS_MODULE,
> +	.llseek		= chlg_llseek,
> +	.read		= chlg_read,
> +	.write		= chlg_write,
> +	.open		= chlg_open,
> +	.release	= chlg_release,
> +	.poll		= chlg_poll,
> +};
> +
> +/**
> + * This uses obd_name of the form: "testfs-MDT0000-mdc-ffff88006501600"
> + * and returns a name of the form: "changelog-testfs-MDT0000".
> + */
> +static void get_chlg_name(char *name, size_t name_len, struct obd_device *obd)
> +{
> +	int i;
> +
> +	snprintf(name, name_len, "changelog-%s", obd->obd_name);
> +
> +	/* Find the 2nd '-' from the end and truncate on it */
> +	for (i = 0; i < 2; i++) {
> +		char *p = strrchr(name, '-');
> +
> +		if (!p)
> +			return;
> +		*p = '\0';
> +	}
> +}
> +
> +/**
> + * Find a changelog character device by name.
> + * All devices registered during MDC setup are listed in a global list with
> + * their names attached.
> + */
> +static struct chlg_registered_dev *
> +chlg_registered_dev_find_by_name(const char *name)
> +{
> +	struct chlg_registered_dev *dit;
> +
> +	list_for_each_entry(dit, &chlg_registered_devices, ced_link)
> +		if (strcmp(name, dit->ced_name) == 0)
> +			return dit;
> +	return NULL;
> +}
> +
> +/**
> + * Find chlg_registered_dev structure for a given OBD device.
> + * This is bad O(n^2) but for each filesystem:
> + *   - N is # of MDTs times # of mount points
> + *   - this only runs at shutdown
> + */
> +static struct chlg_registered_dev *
> +chlg_registered_dev_find_by_obd(const struct obd_device *obd)
> +{
> +	struct chlg_registered_dev *dit;
> +	struct obd_device *oit;
> +
> +	list_for_each_entry(dit, &chlg_registered_devices, ced_link)
> +		list_for_each_entry(oit, &dit->ced_obds,
> +				    u.cli.cl_chg_dev_linkage)
> +			if (oit == obd)
> +				return dit;
> +	return NULL;
> +}
> +
> +/**
> + * Changelog character device initialization.
> + * Register a misc character device with a dynamic minor number, under a name
> + * of the form: 'changelog-fsname-MDTxxxx'. Reference this OBD device with it.
> + *
> + * @param[in] obd  This MDC obd_device.
> + * @return 0 on success, negated error code on failure.
> + */
> +int mdc_changelog_cdev_init(struct obd_device *obd)
> +{
> +	struct chlg_registered_dev *exist;
> +	struct chlg_registered_dev *entry;
> +	int rc;
> +
> +	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
> +	if (!entry)
> +		return -ENOMEM;
> +
> +	get_chlg_name(entry->ced_name, sizeof(entry->ced_name), obd);
> +
> +	entry->ced_misc.minor = MISC_DYNAMIC_MINOR;
> +	entry->ced_misc.name  = entry->ced_name;
> +	entry->ced_misc.fops  = &chlg_fops;
> +
> +	kref_init(&entry->ced_refs);
> +	INIT_LIST_HEAD(&entry->ced_obds);
> +	INIT_LIST_HEAD(&entry->ced_link);
> +
> +	mutex_lock(&chlg_registered_dev_lock);
> +	exist = chlg_registered_dev_find_by_name(entry->ced_name);
> +	if (exist) {
> +		kref_get(&exist->ced_refs);
> +		list_add_tail(&obd->u.cli.cl_chg_dev_linkage, &exist->ced_obds);
> +		rc = 0;
> +		goto out_unlock;
> +	}
> +
> +	/* Register new character device */
> +	rc = misc_register(&entry->ced_misc);
> +	if (rc != 0)
> +		goto out_unlock;
> +
> +	list_add_tail(&obd->u.cli.cl_chg_dev_linkage, &entry->ced_obds);
> +	list_add_tail(&entry->ced_link, &chlg_registered_devices);
> +
> +	entry = NULL;	/* prevent it from being freed below */
> +
> +out_unlock:
> +	mutex_unlock(&chlg_registered_dev_lock);
> +	kfree(entry);
> +	return rc;
> +}
> +
> +/**
> + * Deregister a changelog character device whose refcount has reached zero.
> + */
> +static void chlg_dev_clear(struct kref *kref)
> +{
> +	struct chlg_registered_dev *entry = container_of(kref,
> +							 struct chlg_registered_dev,
> +							 ced_refs);
> +	list_del(&entry->ced_link);
> +	misc_deregister(&entry->ced_misc);
> +	kfree(entry);
> +}
> +
> +/**
> + * Release OBD, decrease reference count of the corresponding changelog device.
> + */
> +void mdc_changelog_cdev_finish(struct obd_device *obd)
> +{
> +	struct chlg_registered_dev *dev = chlg_registered_dev_find_by_obd(obd);
> +
> +	mutex_lock(&chlg_registered_dev_lock);
> +	list_del_init(&obd->u.cli.cl_chg_dev_linkage);
> +	kref_put(&dev->ced_refs, chlg_dev_clear);
> +	mutex_unlock(&chlg_registered_dev_lock);
> +}
> diff --git a/drivers/staging/lustre/lustre/mdc/mdc_internal.h b/drivers/staging/lustre/lustre/mdc/mdc_internal.h
> index 941a896..6da9046 100644
> --- a/drivers/staging/lustre/lustre/mdc/mdc_internal.h
> +++ b/drivers/staging/lustre/lustre/mdc/mdc_internal.h
> @@ -129,6 +129,10 @@ enum ldlm_mode mdc_lock_match(struct obd_export *exp, __u64 flags,
>  			      enum ldlm_mode mode,
>  			      struct lustre_handle *lockh);
>  
> +int mdc_changelog_cdev_init(struct obd_device *obd);
> +
> +void mdc_changelog_cdev_finish(struct obd_device *obd);
> +
>  static inline int mdc_prep_elc_req(struct obd_export *exp,
>  				   struct ptlrpc_request *req, int opc,
>  				   struct list_head *cancels, int count)
> diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c
> index 8f8e3d2..3692b1c 100644
> --- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
> +++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
> @@ -35,7 +35,6 @@
>  
>  # include <linux/module.h>
>  # include <linux/pagemap.h>
> -# include <linux/miscdevice.h>
>  # include <linux/init.h>
>  # include <linux/utsname.h>
>  # include <linux/file.h>
> @@ -1810,174 +1809,6 @@ static int mdc_ioc_hsm_request(struct obd_export *exp,
>  	return rc;
>  }
>  
> -static struct kuc_hdr *changelog_kuc_hdr(char *buf, size_t len, u32 flags)
> -{
> -	struct kuc_hdr *lh = (struct kuc_hdr *)buf;
> -
> -	LASSERT(len <= KUC_CHANGELOG_MSG_MAXSIZE);
> -
> -	lh->kuc_magic = KUC_MAGIC;
> -	lh->kuc_transport = KUC_TRANSPORT_CHANGELOG;
> -	lh->kuc_flags = flags;
> -	lh->kuc_msgtype = CL_RECORD;
> -	lh->kuc_msglen = len;
> -	return lh;
> -}
> -
> -struct changelog_show {
> -	__u64		cs_startrec;
> -	enum changelog_send_flag	cs_flags;
> -	struct file	*cs_fp;
> -	char		*cs_buf;
> -	struct obd_device *cs_obd;
> -};
> -
> -static inline char *cs_obd_name(struct changelog_show *cs)
> -{
> -	return cs->cs_obd->obd_name;
> -}
> -
> -static int changelog_kkuc_cb(const struct lu_env *env, struct llog_handle *llh,
> -			     struct llog_rec_hdr *hdr, void *data)
> -{
> -	struct changelog_show *cs = data;
> -	struct llog_changelog_rec *rec = (struct llog_changelog_rec *)hdr;
> -	struct kuc_hdr *lh;
> -	size_t len;
> -	int rc;
> -
> -	if (rec->cr_hdr.lrh_type != CHANGELOG_REC) {
> -		rc = -EINVAL;
> -		CERROR("%s: not a changelog rec %x/%d: rc = %d\n",
> -		       cs_obd_name(cs), rec->cr_hdr.lrh_type,
> -		       rec->cr.cr_type, rc);
> -		return rc;
> -	}
> -
> -	if (rec->cr.cr_index < cs->cs_startrec) {
> -		/* Skip entries earlier than what we are interested in */
> -		CDEBUG(D_HSM, "rec=%llu start=%llu\n",
> -		       rec->cr.cr_index, cs->cs_startrec);
> -		return 0;
> -	}
> -
> -	CDEBUG(D_HSM, "%llu %02d%-5s %llu 0x%x t=" DFID " p=" DFID
> -		" %.*s\n", rec->cr.cr_index, rec->cr.cr_type,
> -		changelog_type2str(rec->cr.cr_type), rec->cr.cr_time,
> -		rec->cr.cr_flags & CLF_FLAGMASK,
> -		PFID(&rec->cr.cr_tfid), PFID(&rec->cr.cr_pfid),
> -		rec->cr.cr_namelen, changelog_rec_name(&rec->cr));
> -
> -	len = sizeof(*lh) + changelog_rec_size(&rec->cr) + rec->cr.cr_namelen;
> -
> -	/* Set up the message */
> -	lh = changelog_kuc_hdr(cs->cs_buf, len, cs->cs_flags);
> -	memcpy(lh + 1, &rec->cr, len - sizeof(*lh));
> -
> -	rc = libcfs_kkuc_msg_put(cs->cs_fp, lh);
> -	CDEBUG(D_HSM, "kucmsg fp %p len %zu rc %d\n", cs->cs_fp, len, rc);
> -
> -	return rc;
> -}
> -
> -static int mdc_changelog_send_thread(void *csdata)
> -{
> -	enum llog_flag flags = LLOG_F_IS_CAT;
> -	struct changelog_show *cs = csdata;
> -	struct llog_ctxt *ctxt = NULL;
> -	struct llog_handle *llh = NULL;
> -	struct kuc_hdr *kuch;
> -	int rc;
> -
> -	CDEBUG(D_HSM, "changelog to fp=%p start %llu\n",
> -	       cs->cs_fp, cs->cs_startrec);
> -
> -	cs->cs_buf = kzalloc(KUC_CHANGELOG_MSG_MAXSIZE, GFP_NOFS);
> -	if (!cs->cs_buf) {
> -		rc = -ENOMEM;
> -		goto out;
> -	}
> -
> -	/* Set up the remote catalog handle */
> -	ctxt = llog_get_context(cs->cs_obd, LLOG_CHANGELOG_REPL_CTXT);
> -	if (!ctxt) {
> -		rc = -ENOENT;
> -		goto out;
> -	}
> -	rc = llog_open(NULL, ctxt, &llh, NULL, CHANGELOG_CATALOG,
> -		       LLOG_OPEN_EXISTS);
> -	if (rc) {
> -		CERROR("%s: fail to open changelog catalog: rc = %d\n",
> -		       cs_obd_name(cs), rc);
> -		goto out;
> -	}
> -
> -	if (cs->cs_flags & CHANGELOG_FLAG_JOBID)
> -		flags |= LLOG_F_EXT_JOBID;
> -
> -	rc = llog_init_handle(NULL, llh, flags, NULL);
> -	if (rc) {
> -		CERROR("llog_init_handle failed %d\n", rc);
> -		goto out;
> -	}
> -
> -	rc = llog_cat_process(NULL, llh, changelog_kkuc_cb, cs, 0, 0);
> -
> -	/* Send EOF no matter what our result */
> -	kuch = changelog_kuc_hdr(cs->cs_buf, sizeof(*kuch), cs->cs_flags);
> -	kuch->kuc_msgtype = CL_EOF;
> -	libcfs_kkuc_msg_put(cs->cs_fp, kuch);
> -
> -out:
> -	fput(cs->cs_fp);
> -	if (llh)
> -		llog_cat_close(NULL, llh);
> -	if (ctxt)
> -		llog_ctxt_put(ctxt);
> -	kfree(cs->cs_buf);
> -	kfree(cs);
> -	return rc;
> -}
> -
> -static int mdc_ioc_changelog_send(struct obd_device *obd,
> -				  struct ioc_changelog *icc)
> -{
> -	struct changelog_show *cs;
> -	struct task_struct *task;
> -	int rc;
> -
> -	/* Freed in mdc_changelog_send_thread */
> -	cs = kzalloc(sizeof(*cs), GFP_NOFS);
> -	if (!cs)
> -		return -ENOMEM;
> -
> -	cs->cs_obd = obd;
> -	cs->cs_startrec = icc->icc_recno;
> -	/* matching fput in mdc_changelog_send_thread */
> -	cs->cs_fp = fget(icc->icc_id);
> -	cs->cs_flags = icc->icc_flags;
> -
> -	/*
> -	 * New thread because we should return to user app before
> -	 * writing into our pipe
> -	 */
> -	task = kthread_run(mdc_changelog_send_thread, cs,
> -			   "mdc_clg_send_thread");
> -	if (IS_ERR(task)) {
> -		rc = PTR_ERR(task);
> -		CERROR("%s: can't start changelog thread: rc = %d\n",
> -		       cs_obd_name(cs), rc);
> -		kfree(cs);
> -	} else {
> -		rc = 0;
> -		CDEBUG(D_HSM, "%s: started changelog thread\n",
> -		       cs_obd_name(cs));
> -	}
> -
> -	CERROR("Failed to start changelog thread: %d\n", rc);
> -	return rc;
> -}
> -
>  static int mdc_ioc_hsm_ct_start(struct obd_export *exp,
>  				struct lustre_kernelcomm *lk);
>  
> @@ -2087,21 +1918,6 @@ static int mdc_iocontrol(unsigned int cmd, struct obd_export *exp, int len,
>  		return -EINVAL;
>  	}
>  	switch (cmd) {
> -	case OBD_IOC_CHANGELOG_SEND:
> -		rc = mdc_ioc_changelog_send(obd, karg);
> -		goto out;
> -	case OBD_IOC_CHANGELOG_CLEAR: {
> -		struct ioc_changelog *icc = karg;
> -		struct changelog_setinfo cs = {
> -			.cs_recno = icc->icc_recno,
> -			.cs_id = icc->icc_id
> -		};
> -
> -		rc = obd_set_info_async(NULL, exp, strlen(KEY_CHANGELOG_CLEAR),
> -					KEY_CHANGELOG_CLEAR, sizeof(cs), &cs,
> -					NULL);
> -		goto out;
> -	}
>  	case OBD_IOC_FID2PATH:
>  		rc = mdc_ioc_fid2path(exp, karg);
>  		goto out;
> @@ -2670,12 +2486,22 @@ static int mdc_setup(struct obd_device *obd, struct lustre_cfg *cfg)
>  
>  	rc = mdc_llog_init(obd);
>  	if (rc) {
> -		CERROR("failed to setup llogging subsystems\n");
> +		CERROR("%s: failed to setup llogging subsystems: rc = %d\n",
> +		       obd->obd_name, rc);
>  		goto err_llog_cleanup;
>  	}
>  
> +	rc = mdc_changelog_cdev_init(obd);
> +	if (rc) {
> +		CERROR("%s: failed to setup changelog char device: rc = %d\n",
> +		       obd->obd_name, rc);
> +		goto err_changelog_cleanup;
> +	}
> +
>  	return 0;
>  
> +err_changelog_cleanup:
> +	mdc_llog_finish(obd);
>  err_llog_cleanup:
>  	ldebugfs_free_md_stats(obd);
>  	ptlrpc_lprocfs_unregister_obd(obd);
> @@ -2714,6 +2540,8 @@ static int mdc_precleanup(struct obd_device *obd)
>  	if (obd->obd_type->typ_refcnt <= 1)
>  		libcfs_kkuc_group_rem(0, KUC_GRP_HSM);
>  
> +	mdc_changelog_cdev_finish(obd);
> +
>  	obd_cleanup_client_import(obd);
>  	ptlrpc_lprocfs_unregister_obd(obd);
>  	lprocfs_obd_cleanup(obd);
> -- 
> 1.8.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181030/624c8059/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 16/28] lustre: statahead: missing barrier before wake_up
  2018-10-22  4:04       ` NeilBrown
@ 2018-11-04 20:52         ` James Simmons
  0 siblings, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-11-04 20:52 UTC (permalink / raw)
  To: lustre-devel


> On Sun, Oct 21 2018, James Simmons wrote:
> 
> >> On Sun, Oct 14 2018, James Simmons wrote:
> >> 
> >> > From: Lai Siyao <lai.siyao@whamcloud.com>
> >> >
> >> > A barrier is missing before wake_up() in ll_statahead_interpret(),
> >> > which may cause 'ls' hang. Under the right conditions a basic 'ls'
> >> > can fail. The debug logs show:
> >> >
> >> > statahead.c:683:ll_statahead_interpret()) sa_entry software rc -13
> >> > statahead.c:1666:ll_statahead()) revalidate statahead software: -11.
> >> >
> >> > Obviously statahead failure didn't notify 'ls' process in time.
> >> > The mi_cbdata can be stale so add a barrier before calling
> >> > wake_up().
> >> >
> >> > Signed-off-by: Lai Siyao <lai.siyao@whamcloud.com>
> >> > Signed-off-by: Bob Glossman <bob.glossman@intel.com>
> >> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9210
> >> > Reviewed-on: https://review.whamcloud.com/27330
> >> > Reviewed-by: Nathaniel Clark <nclark@whamcloud.com>
> >> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
> >> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> >> > ---
> >> >  drivers/staging/lustre/lustre/llite/statahead.c | 8 +++++++-
> >> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/drivers/staging/lustre/lustre/llite/statahead.c b/drivers/staging/lustre/lustre/llite/statahead.c
> >> > index 1ad308c..0174a4c 100644
> >> > --- a/drivers/staging/lustre/lustre/llite/statahead.c
> >> > +++ b/drivers/staging/lustre/lustre/llite/statahead.c
> >> > @@ -680,8 +680,14 @@ static int ll_statahead_interpret(struct ptlrpc_request *req,
> >> >  
> >> >  	spin_lock(&lli->lli_sa_lock);
> >> >  	if (rc) {
> >> > -		if (__sa_make_ready(sai, entry, rc))
> >> > +		if (__sa_make_ready(sai, entry, rc)) {
> >> > +			/* LU-9210 : Under the right conditions even 'ls'
> >> > +			 * can cause the statahead to fail. Using a memory
> >> > +			 * barrier resolves this issue.
> >> > +			 */
> >> > +			smp_mb();
> >> >  			wake_up(&sai->sai_waitq);
> >> > +		}
> >> >  	} else {
> >> >  		int first = 0;
> >> >  		entry->se_minfo = minfo;
> >> > -- 
> >> > 1.8.3.1
> >> 
> >> Again, this is a fairly lame comment to justify the smp_mb().
> >> It appears to me that the issue is most likely the value of
> >> entry->se_state.
> >> __sa_make_ready() sets this and revalidate_statahead_dentry tests it
> >> after waiting on sai_waitq.
> >> So I think it would be best if we changed __sa_make_ready() to
> >> 
> >> 	smp_store_release(&entry->se_state, ret < 0 ? SA_ENTRY_INVA : SA_ENTRY_SUCC)
> >> 
> >> and in ll_statahead_interpret() have
> >> 
> >> 	if (smp_load_acquire(&entry->se_state) == SA_ENTRY_SUCC &&
> >>             entry->se_inode) {
> >> 
> >> This would make it obvious which variable was important, and would show
> >> the paired synchronization points.
> >
> > If you think this is lame you should be the JIRA ticket and the original 
> > patch. It had zero info so I attempted to extract what I could out of the
> > ticket. Hopefully Lai can fill in the details. I have no problems fixing 
> > this another way. I don't see a way in the ticket to easily reproduce this
> > problem to see the new approach would fix it :-(
> 
> I have imposed the version that I think is correct.  See below.

I opened a ticket - https://jira.whamcloud.com/browse/LU-11616 and pushed 
this to the OpenSFS branch for full testing. The patch is at:

https://review.whamcloud.com/#/c/33571

BTW I did not know about (total store order) TSO platforms and how some 
architectures don't support this property. An audit of the smp barriers
might be a good idea for Lustre. For people not aware of TSO a good
article on this is at:

https://lwn.net/Articles/576486
 
> Thanks,
> NeilBrown
> 
> --- a/drivers/staging/lustre/lustre/llite/statahead.c
> +++ b/drivers/staging/lustre/lustre/llite/statahead.c
> @@ -322,7 +322,11 @@ __sa_make_ready(struct ll_statahead_info *sai, struct sa_entry *entry, int ret)
>  		}
>  	}
>  	list_add(&entry->se_list, pos);
> -	entry->se_state = ret < 0 ? SA_ENTRY_INVA : SA_ENTRY_SUCC;
> +	/*
> +	 * LU-9210: ll_statahead_interpet must be able to see this before
> +	 * we wake it up
> +	 */
> +	smp_store_release(&entry->se_state, ret < 0 ? SA_ENTRY_INVA : SA_ENTRY_SUCC);
>  
>  	return (index == sai->sai_index_wait);
>  }
> @@ -1390,7 +1394,12 @@ static int revalidate_statahead_dentry(struct inode *dir,
>  		}
>  	}
>  
> -	if (entry->se_state == SA_ENTRY_SUCC && entry->se_inode) {
> +	/*
> +	 * We need to see the value that was set immediately before we
> +	 * were woken up.
> +	 */
> +	if (smp_load_acquire(&entry->se_state) == SA_ENTRY_SUCC &&
> +	    entry->se_inode) {
>  		struct inode *inode = entry->se_inode;
>  		struct lookup_intent it = { .it_op = IT_GETATTR,
>  					    .it_lock_handle = entry->se_handle };
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 12/28] lustre: ptlrpc: do not wakeup every second
  2018-10-29  2:41       ` NeilBrown
  2018-10-29  3:42         ` James Simmons
@ 2018-11-04 20:53         ` James Simmons
  1 sibling, 0 replies; 69+ messages in thread
From: James Simmons @ 2018-11-04 20:53 UTC (permalink / raw)
  To: lustre-devel


> > Neil,
> >
> > Does your statement imply this would spin?  It definitely doesn?t just
> > spin (that behavior in a main ?wait for work? spot of a (depending on
> > settings) ~per-CPU daemon would render systems unusable and this patch
> > has been in testing for a while).  So what is the detailed behavior of
> > a ?timeout that expires immediately??
> 
> Hi Patrick,
>  it definitely spins for me.
> 
>  I should have clarified that the SFS patch
> 
>    e81847bd0651 LU-9660 ptlrpc: do not wakeup every second
> 
>  is correct, as __l_wait_event() treats a timeout value of 0 as meaning an
>  indefinite timeout.
>  The error was in the conversion to wait_event_idle_timeout().  The
>  various wait_event*timeout() functions treat 0 as 1 less than 1.
>  If you want to not have a timeout, you need to not use the *_timeout()
>  version.
>  If a timeout is undesirable rather than fatal, then
>  MAX_SCHEDULE_TIMEOUT can be used.  In this case, that seemed best.
> 
> Thanks,
> NeilBrown
> 
> 
> >
> > - Patrick
> >
> >
> > ________________________________
> > From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of NeilBrown <neilb@suse.com>
> > Sent: Sunday, October 28, 2018 7:03:02 PM
> > To: James Simmons; Andreas Dilger; Oleg Drokin
> > Cc: Lustre Development List
> > Subject: Re: [lustre-devel] [PATCH 12/28] lustre: ptlrpc: do not wakeup every second
> >
> > On Sun, Oct 14 2018, James Simmons wrote:
> >
> >> From: Alex Zhuravlev <bzzz@whamcloud.com>
> >>
> >> Even if there are no RPC requests on the set, there is no need to
> >> wake up every second. The thread is woken up when a request is added
> >> to the set or when the STOP bit is set, so it is sufficient to only
> >> wake up when there are requests on the set to worry about.
> >>
> >> Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
> >> WC-bug-id: https://jira.whamcloud.com/browse/LU-9660
> >> Reviewed-on: https://review.whamcloud.com/28776
> >> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> >> Reviewed-by: Patrick Farrell <paf@cray.com>
> >> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> >> Signed-off-by: James Simmons <jsimmons@infradead.org>
> >> ---
> >>  drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c | 4 ++--
> >>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> >> index c201a88..5b4977b 100644
> >> --- a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> >> +++ b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
> >> @@ -371,7 +371,7 @@ static int ptlrpcd_check(struct lu_env *env, struct ptlrpcd_ctl *pc)
> >>                }
> >>        }
> >>
> >> -     return rc;
> >> +     return rc || test_bit(LIOD_STOP, &pc->pc_flags);
> >>  }
> >>
> >>  /**
> >> @@ -441,7 +441,7 @@ static int ptlrpcd(void *arg)
> >>                lu_context_enter(env.le_ses);
> >>                if (wait_event_idle_timeout(set->set_waitq,
> >>                                            ptlrpcd_check(&env, pc),
> >> -                                         (timeout ? timeout : 1) * HZ) == 0)
> >> +                                         timeout * HZ) == 0)
> >>                        ptlrpc_expired_set(set);
> >
> > This is incorrect.
> > A timeout of zero means the timeout happens after zero jiffies
> > (immediately), it doesn't mean there is no timeout.
> > If we want a "timeout" of zero to mean "Wait forever", we need something
> > like:
> >
> >   wait_event_idle_timeout(.....,
> >                           timeout ? (timeout * HZ) : MAX_SCHEDULE_TIMEOUT) == 0
> >
> > I've updated the patch accordingly.

I did that change locally as well and my CPU load problem went away. 
Thanks for figuring it out. Will be more careful in the future.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 04/28] lustre: ptlrpc: Do not assert when bd_nob_transferred != 0
  2018-10-22  3:26       ` NeilBrown
@ 2018-11-04 21:29         ` James Simmons
  2018-11-04 23:59           ` NeilBrown
  0 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-11-04 21:29 UTC (permalink / raw)
  To: lustre-devel


> >> On Sun, Oct 14 2018, James Simmons wrote:
> >> 
> >> > From: Doug Oucharek <dougso@me.com>
> >> >
> >> > There is a case in the routine ptlrpc_register_bulk() where we were
> >> > asserting if bd_nob_transferred != 0 when not resending.  There is
> >> > evidence that network errors can create a situation where
> >> > this does happen. So we should not be asserting!
> >> >
> >> > This patch changes that assert to an error return code of -EIO.
> >> >
> >> > Signed-off-by: Doug Oucharek <dougso@me.com>
> >> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9828
> >> > Reviewed-on: https://review.whamcloud.com/28491
> >> > Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
> >> > Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
> >> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
> >> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> >> > ---
> >> >  drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++--
> >> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >> >
> >> > diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> >> > index 27eb1c0..7e7db24 100644
> >> > --- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> >> > +++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> >> > @@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
> >> >  	/* cleanup the state of the bulk for it will be reused */
> >> >  	if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY)
> >> >  		desc->bd_nob_transferred = 0;
> >> > -	else
> >> > -		LASSERT(desc->bd_nob_transferred == 0);
> >> > +	else if (desc->bd_nob_transferred != 0)
> >> > +		/* If the network failed after an RPC was sent, this condition
> >> > +		 * could happen.  Rather than assert (was here before), return
> >> > +		 * an EIO error.
> >> > +		 */
> >> > +		return -EIO;
> >> 
> >> This looks weird, and the justification is rather lame.
> >> I wonder if this is an attempt to fix the same problem that the smp_mb()
> >> in the previous patch was attempting to fix (and I'm not yet convinced
> >> that either is the correct fix).
> >
> > When the above condition happens the LASSERT ends up taking out the 
> > node with a panic which in turn kills the application running on the cluster.
> > When replaced with reporting an EIO error the node survives as well as the 
> > job. The job might fail at its IO but it wouldn't fail performing its work 
> > flow which is way more important.
> 
> Yes, a meaningless error is better than a crash, but a proper fix is
> better still.  As I said, my guess is that the memory barrier in the
> previous patch might have fixed the bug, so the LASSERT can remain.
> 
> Doug: is there any chance that this might be the case?

I got a hold of Doug and discussed this issue. So the answer is that the
original logs to track down the original problem no longer exist. So 
finding the original source of the problem can't be done at this point.
Would you be okay with a version of this patch with dump_stack() and
treat it as a debug patch. We really need to collect logs to figure out
the real problem. I will push a debug patch to OpenSFS branch since it
is more widely used.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 18/28] lustre: mdc: expose changelog through char devices
  2018-10-30  6:41   ` NeilBrown
@ 2018-11-04 21:31     ` James Simmons
  2018-11-05  0:13       ` NeilBrown
  0 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-11-04 21:31 UTC (permalink / raw)
  To: lustre-devel


> On Sun, Oct 14 2018, James Simmons wrote:
> 
> > From: Henri Doreau <henri.doreau@cea.fr>
> >
> > Register one character device per MDT in order to allow non-llapi to
> > read them and to make delivery more efficient.
> >
> > - open() spawns a thread to prefetch records and enqueue them into a
> >   local buffer (unless the device is open in write-only mode).
> > - lseek() can be used to jump to a specific record, in which case the
> >   offset is a record number (with SEEK_SET) or a number of records to
> >   skip (SEEK_CUR). Movement can only be done forward.
> > - read() copies records to userland. No truncation happens, so short
> >   reads are likely.
> > - write() is used to transmit control commands to the device.
> >   The only available one is changelog_clear, which is done by writing
> >   "clear:cl<user>:<recno>" into the device.
> > - close() terminates the prefetch thread if any, and releases resources.
> >
> > It is possible to poll() on the device to get notified when new records
> > are available for read.
> >
> > Signed-off-by: Henri Doreau <henri.doreau@cea.fr>
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-7659
> > Reviewed-on: https://review.whamcloud.com/18900
> > Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> > Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> 
> This patches causes problems around sanity test 161a.
> If you run -only '160h 160i 161a' it hangs.
> 
> Adding
> Commit: 89e52326b5bd ("LU-10166 mdc: invalid free in changelog reader")
> seems to fix the problem, so I'll port that into the series immediately
> after this patch.

I was planning to push this in the next batch. I wouldn't push it in that
case and just wait for it to show up in lustre-testing.


> >  .../include/uapi/linux/lustre/lustre_kernelcomm.h  |   3 -
> >  .../lustre/include/uapi/linux/lustre/lustre_user.h |   7 -
> >  drivers/staging/lustre/lustre/include/obd.h        |   2 +
> >  drivers/staging/lustre/lustre/ldlm/ldlm_lib.c      |   2 +
> >  drivers/staging/lustre/lustre/llite/dir.c          |   8 -
> >  drivers/staging/lustre/lustre/lmv/lmv_obd.c        |  13 -
> >  drivers/staging/lustre/lustre/mdc/Makefile         |   2 +-
> >  drivers/staging/lustre/lustre/mdc/mdc_changelog.c  | 722 +++++++++++++++++++++
> >  drivers/staging/lustre/lustre/mdc/mdc_internal.h   |   4 +
> >  drivers/staging/lustre/lustre/mdc/mdc_request.c    | 198 +-----
> >  11 files changed, 745 insertions(+), 218 deletions(-)
> >  create mode 100644 drivers/staging/lustre/lustre/mdc/mdc_changelog.c
> >
> > diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
> > index 6e4e109..098b6451 100644
> > --- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
> > +++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
> > @@ -172,7 +172,7 @@ static inline __u32 obd_ioctl_packlen(struct obd_ioctl_data *data)
> >  #define OBD_GET_VERSION		_IOWR('f', 144, OBD_IOC_DATA_TYPE)
> >  /*	OBD_IOC_GSS_SUPPORT	_IOWR('f', 145, OBD_IOC_DATA_TYPE) */
> >  /*	OBD_IOC_CLOSE_UUID	_IOWR('f', 147, OBD_IOC_DATA_TYPE) */
> > -#define OBD_IOC_CHANGELOG_SEND	_IOW('f', 148, OBD_IOC_DATA_TYPE)
> > +/*	OBD_IOC_CHANGELOG_SEND	_IOW('f', 148, OBD_IOC_DATA_TYPE) */
> >  #define OBD_IOC_GETDEVICE	_IOWR('f', 149, OBD_IOC_DATA_TYPE)
> >  #define OBD_IOC_FID2PATH	_IOWR('f', 150, OBD_IOC_DATA_TYPE)
> >  /*	lustre/lustre_user.h	151-153 */
> > diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_kernelcomm.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_kernelcomm.h
> > index 94dadbe..d84a8fc 100644
> > --- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_kernelcomm.h
> > +++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_kernelcomm.h
> > @@ -54,15 +54,12 @@ struct kuc_hdr {
> >  	__u16 kuc_msglen;
> >  } __aligned(sizeof(__u64));
> >  
> > -#define KUC_CHANGELOG_MSG_MAXSIZE (sizeof(struct kuc_hdr) + CR_MAXSIZE)
> > -
> >  #define KUC_MAGIC		0x191C /*Lustre9etLinC */
> >  
> >  /* kuc_msgtype values are defined in each transport */
> >  enum kuc_transport_type {
> >  	KUC_TRANSPORT_GENERIC	= 1,
> >  	KUC_TRANSPORT_HSM	= 2,
> > -	KUC_TRANSPORT_CHANGELOG	= 3,
> >  };
> >  
> >  enum kuc_generic_message_type {
> > diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h
> > index b8525e5..715f1c5 100644
> > --- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h
> > +++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h
> > @@ -967,13 +967,6 @@ static inline void changelog_remap_rec(struct changelog_rec *rec,
> >  	rec->cr_flags = (rec->cr_flags & CLF_FLAGMASK) | crf_wanted;
> >  }
> >  
> > -struct ioc_changelog {
> > -	__u64 icc_recno;
> > -	__u32 icc_mdtindex;
> > -	__u32 icc_id;
> > -	__u32 icc_flags;
> > -};
> > -
> >  enum changelog_message_type {
> >  	CL_RECORD = 10, /* message is a changelog_rec */
> >  	CL_EOF    = 11, /* at end of current changelog */
> > diff --git a/drivers/staging/lustre/lustre/include/obd.h b/drivers/staging/lustre/lustre/include/obd.h
> > index 11e7ae8..76ae0b3 100644
> > --- a/drivers/staging/lustre/lustre/include/obd.h
> > +++ b/drivers/staging/lustre/lustre/include/obd.h
> > @@ -345,6 +345,8 @@ struct client_obd {
> >  	void			*cl_lru_work;
> >  	/* hash tables for osc_quota_info */
> >  	struct rhashtable	cl_quota_hash[MAXQUOTAS];
> > +	/* Links to the global list of registered changelog devices */
> > +	struct list_head	cl_chg_dev_linkage;
> >  };
> >  
> >  #define obd2cli_tgt(obd) ((char *)(obd)->u.cli.cl_target_uuid.uuid)
> > diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_lib.c b/drivers/staging/lustre/lustre/ldlm/ldlm_lib.c
> > index 32eda4f..732ef3a 100644
> > --- a/drivers/staging/lustre/lustre/ldlm/ldlm_lib.c
> > +++ b/drivers/staging/lustre/lustre/ldlm/ldlm_lib.c
> > @@ -395,6 +395,8 @@ int client_obd_setup(struct obd_device *obddev, struct lustre_cfg *lcfg)
> >  	init_waitqueue_head(&cli->cl_mod_rpcs_waitq);
> >  	cli->cl_mod_tag_bitmap = NULL;
> >  
> > +	INIT_LIST_HEAD(&cli->cl_chg_dev_linkage);
> > +
> >  	if (connect_op == MDS_CONNECT) {
> >  		cli->cl_max_mod_rpcs_in_flight = cli->cl_max_rpcs_in_flight - 1;
> >  		cli->cl_mod_tag_bitmap = kcalloc(BITS_TO_LONGS(OBD_MAX_RIF_MAX),
> > diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
> > index 19c5e9c..36cea8d 100644
> > --- a/drivers/staging/lustre/lustre/llite/dir.c
> > +++ b/drivers/staging/lustre/lustre/llite/dir.c
> > @@ -1481,14 +1481,6 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> >  		return obd_iocontrol(cmd, sbi->ll_md_exp, 0, NULL,
> >  				     (void __user *)arg);
> >  	}
> > -	case OBD_IOC_CHANGELOG_SEND:
> > -	case OBD_IOC_CHANGELOG_CLEAR:
> > -		if (!capable(CAP_SYS_ADMIN))
> > -			return -EPERM;
> > -
> > -		rc = copy_and_ioctl(cmd, sbi->ll_md_exp, (void __user *)arg,
> > -				    sizeof(struct ioc_changelog));
> > -		return rc;
> >  	case OBD_IOC_FID2PATH:
> >  		return ll_fid2path(inode, (void __user *)arg);
> >  	case LL_IOC_GETPARENT:
> > diff --git a/drivers/staging/lustre/lustre/lmv/lmv_obd.c b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
> > index 952c68e..32bb9fc 100644
> > --- a/drivers/staging/lustre/lustre/lmv/lmv_obd.c
> > +++ b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
> > @@ -951,19 +951,6 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
> >  		kfree(oqctl);
> >  		break;
> >  	}
> > -	case OBD_IOC_CHANGELOG_SEND:
> > -	case OBD_IOC_CHANGELOG_CLEAR: {
> > -		struct ioc_changelog *icc = karg;
> > -
> > -		if (icc->icc_mdtindex >= count)
> > -			return -ENODEV;
> > -
> > -		tgt = lmv->tgts[icc->icc_mdtindex];
> > -		if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
> > -			return -ENODEV;
> > -		rc = obd_iocontrol(cmd, tgt->ltd_exp, sizeof(*icc), icc, NULL);
> > -		break;
> > -	}
> >  	case LL_IOC_GET_CONNECT_FLAGS: {
> >  		tgt = lmv->tgts[0];
> >  
> > diff --git a/drivers/staging/lustre/lustre/mdc/Makefile b/drivers/staging/lustre/lustre/mdc/Makefile
> > index 64cf49e..5f48e91 100644
> > --- a/drivers/staging/lustre/lustre/mdc/Makefile
> > +++ b/drivers/staging/lustre/lustre/mdc/Makefile
> > @@ -2,4 +2,4 @@ ccflags-y += -I$(srctree)/drivers/staging/lustre/include
> >  ccflags-y += -I$(srctree)/drivers/staging/lustre/lustre/include
> >  
> >  obj-$(CONFIG_LUSTRE_FS) += mdc.o
> > -mdc-y := mdc_request.o mdc_reint.o mdc_lib.o mdc_locks.o lproc_mdc.o
> > +mdc-y := mdc_changelog.o mdc_request.o mdc_reint.o mdc_lib.o mdc_locks.o lproc_mdc.o
> > diff --git a/drivers/staging/lustre/lustre/mdc/mdc_changelog.c b/drivers/staging/lustre/lustre/mdc/mdc_changelog.c
> > new file mode 100644
> > index 0000000..a5f3c64
> > --- /dev/null
> > +++ b/drivers/staging/lustre/lustre/mdc/mdc_changelog.c
> > @@ -0,0 +1,722 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * GPL HEADER START
> > + *
> > + * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 only,
> > + * as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful, but
> > + * WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * General Public License version 2 for more details (a copy is included
> > + * in the LICENSE file that accompanied this code).
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * version 2 along with this program; If not, see
> > + * http://www.gnu.org/licenses/gpl-2.0.html
> > + *
> > + * GPL HEADER END
> > + */
> > +/*
> > + * Copyright (c) 2017, Commissariat a l'Energie Atomique et aux Energies
> > + *                     Alternatives.
> > + *
> > + * Author: Henri Doreau <henri.doreau@cea.fr>
> > + */
> > +
> > +#define DEBUG_SUBSYSTEM S_MDC
> > +
> > +#include <linux/init.h>
> > +#include <linux/kthread.h>
> > +#include <linux/poll.h>
> > +#include <linux/miscdevice.h>
> > +
> > +#include <lustre_log.h>
> > +
> > +#include "mdc_internal.h"
> > +
> > +/*
> > + * -- Changelog delivery through character device --
> > + */
> > +
> > +/**
> > + * Mutex to protect chlg_registered_devices below
> > + */
> > +static DEFINE_MUTEX(chlg_registered_dev_lock);
> > +
> > +/**
> > + * Global linked list of all registered devices (one per MDT).
> > + */
> > +static LIST_HEAD(chlg_registered_devices);
> > +
> > +struct chlg_registered_dev {
> > +	/* Device name of the form "changelog-{MDTNAME}" */
> > +	char			ced_name[32];
> > +	/* Misc device descriptor */
> > +	struct miscdevice	ced_misc;
> > +	/* OBDs referencing this device (multiple mount point) */
> > +	struct list_head	ced_obds;
> > +	/* Reference counter for proper deregistration */
> > +	struct kref		ced_refs;
> > +	/* Link within the global chlg_registered_devices */
> > +	struct list_head	ced_link;
> > +};
> > +
> > +struct chlg_reader_state {
> > +	/* Shortcut to the corresponding OBD device */
> > +	struct obd_device	*crs_obd;
> > +	/* An error occurred that prevents from reading further */
> > +	bool			 crs_err;
> > +	/* EOF, no more records available */
> > +	bool			 crs_eof;
> > +	/* Userland reader closed connection */
> > +	bool			 crs_closed;
> > +	/* Desired start position */
> > +	u64			 crs_start_offset;
> > +	/* Wait queue for the catalog processing thread */
> > +	wait_queue_head_t	 crs_waitq_prod;
> > +	/* Wait queue for the record copy threads */
> > +	wait_queue_head_t	 crs_waitq_cons;
> > +	/* Mutex protecting crs_rec_count and crs_rec_queue */
> > +	struct mutex		 crs_lock;
> > +	/* Number of item in the list */
> > +	u64			 crs_rec_count;
> > +	/* List of prefetched enqueued_record::enq_linkage_items */
> > +	struct list_head	 crs_rec_queue;
> > +};
> > +
> > +struct chlg_rec_entry {
> > +	/* Link within the chlg_reader_state::crs_rec_queue list */
> > +	struct list_head	enq_linkage;
> > +	/* Data (enq_record) field length */
> > +	u64			enq_length;
> > +	/* Copy of a changelog record (see struct llog_changelog_rec) */
> > +	struct changelog_rec	enq_record[];
> > +};
> > +
> > +enum {
> > +	/* Number of records to prefetch locally. */
> > +	CDEV_CHLG_MAX_PREFETCH = 1024,
> > +};
> > +
> > +/**
> > + * ChangeLog catalog processing callback invoked on each record.
> > + * If the current record is eligible to userland delivery, push
> > + * it into the crs_rec_queue where the consumer code will fetch it.
> > + *
> > + * @param[in]     env  (unused)
> > + * @param[in]     llh  Client-side handle used to identify the llog
> > + * @param[in]     hdr  Header of the current llog record
> > + * @param[in,out] data chlg_reader_state passed from caller
> > + *
> > + * @return 0 or LLOG_PROC_* control code on success, negated error on failure.
> > + */
> > +static int chlg_read_cat_process_cb(const struct lu_env *env,
> > +				    struct llog_handle *llh,
> > +				    struct llog_rec_hdr *hdr, void *data)
> > +{
> > +	struct llog_changelog_rec *rec;
> > +	struct chlg_reader_state *crs = data;
> > +	struct chlg_rec_entry *enq;
> > +	size_t len;
> > +	int rc;
> > +
> > +	LASSERT(crs);
> > +	LASSERT(hdr);
> > +
> > +	rec = container_of(hdr, struct llog_changelog_rec, cr_hdr);
> > +
> > +	if (rec->cr_hdr.lrh_type != CHANGELOG_REC) {
> > +		rc = -EINVAL;
> > +		CERROR("%s: not a changelog rec %x/%d in llog : rc = %d\n",
> > +		       crs->crs_obd->obd_name, rec->cr_hdr.lrh_type,
> > +		       rec->cr.cr_type, rc);
> > +		return rc;
> > +	}
> > +
> > +	/* Skip undesired records */
> > +	if (rec->cr.cr_index < crs->crs_start_offset)
> > +		return 0;
> > +
> > +	CDEBUG(D_HSM, "%llu %02d%-5s %llu 0x%x t=" DFID " p=" DFID " %.*s\n",
> > +	       rec->cr.cr_index, rec->cr.cr_type,
> > +	       changelog_type2str(rec->cr.cr_type), rec->cr.cr_time,
> > +	       rec->cr.cr_flags & CLF_FLAGMASK,
> > +	       PFID(&rec->cr.cr_tfid), PFID(&rec->cr.cr_pfid),
> > +	       rec->cr.cr_namelen, changelog_rec_name(&rec->cr));
> > +
> > +	wait_event_idle(crs->crs_waitq_prod,
> > +			(crs->crs_rec_count < CDEV_CHLG_MAX_PREFETCH ||
> > +			 crs->crs_closed));
> > +
> > +	if (crs->crs_closed)
> > +		return LLOG_PROC_BREAK;
> > +
> > +	len = changelog_rec_size(&rec->cr) + rec->cr.cr_namelen;
> > +	enq = kzalloc(sizeof(*enq) + len, GFP_KERNEL);
> > +	if (!enq)
> > +		return -ENOMEM;
> > +
> > +	INIT_LIST_HEAD(&enq->enq_linkage);
> > +	enq->enq_length = len;
> > +	memcpy(enq->enq_record, &rec->cr, len);
> > +
> > +	mutex_lock(&crs->crs_lock);
> > +	list_add_tail(&enq->enq_linkage, &crs->crs_rec_queue);
> > +	crs->crs_rec_count++;
> > +	mutex_unlock(&crs->crs_lock);
> > +
> > +	wake_up_all(&crs->crs_waitq_cons);
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * Remove record from the list it is attached to and free it.
> > + */
> > +static void enq_record_delete(struct chlg_rec_entry *rec)
> > +{
> > +	list_del(&rec->enq_linkage);
> > +	kfree(rec);
> > +}
> > +
> > +/**
> > + * Release resources associated to a changelog_reader_state instance.
> > + *
> > + * @param  crs  CRS instance to release.
> > + */
> > +static void crs_free(struct chlg_reader_state *crs)
> > +{
> > +	struct chlg_rec_entry *rec;
> > +	struct chlg_rec_entry *tmp;
> > +
> > +	list_for_each_entry_safe(rec, tmp, &crs->crs_rec_queue, enq_linkage)
> > +		enq_record_delete(rec);
> > +
> > +	kfree(crs);
> > +}
> > +
> > +/**
> > + * Record prefetch thread entry point. Opens the changelog catalog and starts
> > + * reading records.
> > + *
> > + * @param[in,out]  args  chlg_reader_state passed from caller.
> > + * @return 0 on success, negated error code on failure.
> > + */
> > +static int chlg_load(void *args)
> > +{
> > +	struct chlg_reader_state *crs = args;
> > +	struct obd_device *obd = crs->crs_obd;
> > +	struct llog_ctxt *ctx = NULL;
> > +	struct llog_handle *llh = NULL;
> > +	int rc;
> > +
> > +	ctx = llog_get_context(obd, LLOG_CHANGELOG_REPL_CTXT);
> > +	if (!ctx) {
> > +		rc = -ENOENT;
> > +		goto err_out;
> > +	}
> > +
> > +	rc = llog_open(NULL, ctx, &llh, NULL, CHANGELOG_CATALOG,
> > +		       LLOG_OPEN_EXISTS);
> > +	if (rc) {
> > +		CERROR("%s: fail to open changelog catalog: rc = %d\n",
> > +		       obd->obd_name, rc);
> > +		goto err_out;
> > +	}
> > +
> > +	rc = llog_init_handle(NULL, llh, LLOG_F_IS_CAT | LLOG_F_EXT_JOBID,
> > +			      NULL);
> > +	if (rc) {
> > +		CERROR("%s: fail to init llog handle: rc = %d\n",
> > +		       obd->obd_name, rc);
> > +		goto err_out;
> > +	}
> > +
> > +	rc = llog_cat_process(NULL, llh, chlg_read_cat_process_cb, crs, 0, 0);
> > +	if (rc < 0) {
> > +		CERROR("%s: fail to process llog: rc = %d\n",
> > +		       obd->obd_name, rc);
> > +		goto err_out;
> > +	}
> > +
> > +err_out:
> > +	crs->crs_err = true;
> > +	wake_up_all(&crs->crs_waitq_cons);
> > +
> > +	if (llh)
> > +		llog_cat_close(NULL, llh);
> > +
> > +	if (ctx)
> > +		llog_ctxt_put(ctx);
> > +
> > +	wait_event_idle(crs->crs_waitq_prod, crs->crs_closed);
> > +	crs_free(crs);
> > +	return rc;
> > +}
> > +
> > +/**
> > + * Read handler, dequeues records from the chlg_reader_state if any.
> > + * No partial records are copied to userland so this function can return less
> > + * data than required (short read).
> > + *
> > + * @param[in]   file   File pointer to the character device.
> > + * @param[out]  buff   Userland buffer where to copy the records.
> > + * @param[in]   count  Userland buffer size.
> > + * @param[out]  ppos   File position, updated with the index number of the next
> > + *		       record to read.
> > + * @return number of copied bytes on success, negated error code on failure.
> > + */
> > +static ssize_t chlg_read(struct file *file, char __user *buff, size_t count,
> > +			 loff_t *ppos)
> > +{
> > +	struct chlg_reader_state *crs = file->private_data;
> > +	struct chlg_rec_entry *rec;
> > +	struct chlg_rec_entry *tmp;
> > +	ssize_t  written_total = 0;
> > +	LIST_HEAD(consumed);
> > +
> > +	if (file->f_flags & O_NONBLOCK && crs->crs_rec_count == 0)
> > +		return -EAGAIN;
> > +
> > +	wait_event_idle(crs->crs_waitq_cons,
> > +			crs->crs_rec_count > 0 || crs->crs_eof || crs->crs_err);
> > +
> > +	mutex_lock(&crs->crs_lock);
> > +	list_for_each_entry_safe(rec, tmp, &crs->crs_rec_queue, enq_linkage) {
> > +		if (written_total + rec->enq_length > count)
> > +			break;
> > +
> > +		if (copy_to_user(buff, rec->enq_record, rec->enq_length)) {
> > +			if (written_total == 0)
> > +				written_total = -EFAULT;
> > +			break;
> > +		}
> > +
> > +		buff += rec->enq_length;
> > +		written_total += rec->enq_length;
> > +
> > +		crs->crs_rec_count--;
> > +		list_move_tail(&rec->enq_linkage, &consumed);
> > +
> > +		crs->crs_start_offset = rec->enq_record->cr_index + 1;
> > +	}
> > +	mutex_unlock(&crs->crs_lock);
> > +
> > +	if (written_total > 0)
> > +		wake_up_all(&crs->crs_waitq_prod);
> > +
> > +	list_for_each_entry_safe(rec, tmp, &consumed, enq_linkage)
> > +		enq_record_delete(rec);
> > +
> > +	*ppos = crs->crs_start_offset;
> > +
> > +	return written_total;
> > +}
> > +
> > +/**
> > + * Jump to a given record index. Helper for chlg_llseek().
> > + *
> > + * @param[in,out]  crs     Internal reader state.
> > + * @param[in]      offset  Desired offset (index record).
> > + * @return 0 on success, negated error code on failure.
> > + */
> > +static int chlg_set_start_offset(struct chlg_reader_state *crs, u64 offset)
> > +{
> > +	struct chlg_rec_entry *rec;
> > +	struct chlg_rec_entry *tmp;
> > +
> > +	mutex_lock(&crs->crs_lock);
> > +	if (offset < crs->crs_start_offset) {
> > +		mutex_unlock(&crs->crs_lock);
> > +		return -ERANGE;
> > +	}
> > +
> > +	crs->crs_start_offset = offset;
> > +	list_for_each_entry_safe(rec, tmp, &crs->crs_rec_queue, enq_linkage) {
> > +		struct changelog_rec *cr = rec->enq_record;
> > +
> > +		if (cr->cr_index >= crs->crs_start_offset)
> > +			break;
> > +
> > +		crs->crs_rec_count--;
> > +		enq_record_delete(rec);
> > +	}
> > +
> > +	mutex_unlock(&crs->crs_lock);
> > +	wake_up_all(&crs->crs_waitq_prod);
> > +	return 0;
> > +}
> > +
> > +/**
> > + * Move read pointer to a certain record index, encoded as an offset.
> > + *
> > + * @param[in,out] file   File pointer to the changelog character device
> > + * @param[in]	  off    Offset to skip, actually a record index, not byte count
> > + * @param[in]	  whence Relative/Absolute interpretation of the offset
> > + * @return the resulting position on success or negated error code on failure.
> > + */
> > +static loff_t chlg_llseek(struct file *file, loff_t off, int whence)
> > +{
> > +	struct chlg_reader_state *crs = file->private_data;
> > +	loff_t pos;
> > +	int rc;
> > +
> > +	switch (whence) {
> > +	case SEEK_SET:
> > +		pos = off;
> > +		break;
> > +	case SEEK_CUR:
> > +		pos = file->f_pos + off;
> > +		break;
> > +	case SEEK_END:
> > +	default:
> > +		return -EINVAL;
> > +	}
> > +
> > +	/* We cannot go backward */
> > +	if (pos < file->f_pos)
> > +		return -EINVAL;
> > +
> > +	rc = chlg_set_start_offset(crs, pos);
> > +	if (rc != 0)
> > +		return rc;
> > +
> > +	file->f_pos = pos;
> > +	return pos;
> > +}
> > +
> > +/**
> > + * Clear record range for a given changelog reader.
> > + *
> > + * @param[in]  crs     Current internal state.
> > + * @param[in]  reader  Changelog reader ID (cl1, cl2...)
> > + * @param[in]  record  Record index up which to clear
> > + * @return 0 on success, negated error code on failure.
> > + */
> > +static int chlg_clear(struct chlg_reader_state *crs, u32 reader, u64 record)
> > +{
> > +	struct obd_device *obd = crs->crs_obd;
> > +	struct changelog_setinfo cs  = {
> > +		.cs_recno = record,
> > +		.cs_id    = reader
> > +	};
> > +
> > +	return obd_set_info_async(NULL, obd->obd_self_export,
> > +				  strlen(KEY_CHANGELOG_CLEAR),
> > +				  KEY_CHANGELOG_CLEAR, sizeof(cs), &cs, NULL);
> > +}
> > +
> > +/** Maximum changelog control command size */
> > +#define CHLG_CONTROL_CMD_MAX	64
> > +
> > +/**
> > + * Handle writes() into the changelog character device. Write() can be used
> > + * to request special control operations.
> > + *
> > + * @param[in]  file  File pointer to the changelog character device
> > + * @param[in]  buff  User supplied data (written data)
> > + * @param[in]  count Number of written bytes
> > + * @param[in]  off   (unused)
> > + * @return number of written bytes on success, negated error code on failure.
> > + */
> > +static ssize_t chlg_write(struct file *file, const char __user *buff,
> > +			  size_t count, loff_t *off)
> > +{
> > +	struct chlg_reader_state *crs = file->private_data;
> > +	char *kbuf;
> > +	u64 record;
> > +	u32 reader;
> > +	int rc = 0;
> > +
> > +	if (count > CHLG_CONTROL_CMD_MAX)
> > +		return -EINVAL;
> > +
> > +	kbuf = kzalloc(CHLG_CONTROL_CMD_MAX, GFP_KERNEL);
> > +	if (!kbuf)
> > +		return -ENOMEM;
> > +
> > +	if (copy_from_user(kbuf, buff, count)) {
> > +		rc = -EFAULT;
> > +		goto out_kbuf;
> > +	}
> > +
> > +	kbuf[CHLG_CONTROL_CMD_MAX - 1] = '\0';
> > +
> > +	if (sscanf(kbuf, "clear:cl%u:%llu", &reader, &record) == 2)
> > +		rc = chlg_clear(crs, reader, record);
> > +	else
> > +		rc = -EINVAL;
> > +
> > +out_kbuf:
> > +	kfree(kbuf);
> > +	return rc < 0 ? rc : count;
> > +}
> > +
> > +/**
> > + * Find the OBD device associated to a changelog character device.
> > + * @param[in]  cdev  character device instance descriptor
> > + * @return corresponding OBD device or NULL if none was found.
> > + */
> > +static struct obd_device *chlg_obd_get(dev_t cdev)
> > +{
> > +	int minor = MINOR(cdev);
> > +	struct obd_device *obd = NULL;
> > +	struct chlg_registered_dev *curr;
> > +
> > +	mutex_lock(&chlg_registered_dev_lock);
> > +	list_for_each_entry(curr, &chlg_registered_devices, ced_link) {
> > +		if (curr->ced_misc.minor == minor) {
> > +			/* take the first available OBD device attached */
> > +			obd = list_first_entry(&curr->ced_obds,
> > +					       struct obd_device,
> > +					       u.cli.cl_chg_dev_linkage);
> > +			break;
> > +		}
> > +	}
> > +	mutex_unlock(&chlg_registered_dev_lock);
> > +	return obd;
> > +}
> > +
> > +/**
> > + * Open handler, initialize internal CRS state and spawn prefetch thread if
> > + * needed.
> > + * @param[in]  inode  Inode struct for the open character device.
> > + * @param[in]  file   Corresponding file pointer.
> > + * @return 0 on success, negated error code on failure.
> > + */
> > +static int chlg_open(struct inode *inode, struct file *file)
> > +{
> > +	struct chlg_reader_state *crs;
> > +	struct obd_device *obd = chlg_obd_get(inode->i_rdev);
> > +	struct task_struct *task;
> > +	int rc;
> > +
> > +	if (!obd)
> > +		return -ENODEV;
> > +
> > +	crs = kzalloc(sizeof(*crs), GFP_KERNEL);
> > +	if (!crs)
> > +		return -ENOMEM;
> > +
> > +	crs->crs_obd = obd;
> > +	crs->crs_err = false;
> > +	crs->crs_eof = false;
> > +	crs->crs_closed = false;
> > +
> > +	mutex_init(&crs->crs_lock);
> > +	INIT_LIST_HEAD(&crs->crs_rec_queue);
> > +	init_waitqueue_head(&crs->crs_waitq_prod);
> > +	init_waitqueue_head(&crs->crs_waitq_cons);
> > +
> > +	if (file->f_mode & FMODE_READ) {
> > +		task = kthread_run(chlg_load, crs, "chlg_load_thread");
> > +		if (IS_ERR(task)) {
> > +			rc = PTR_ERR(task);
> > +			CERROR("%s: cannot start changelog thread: rc = %d\n",
> > +			       obd->obd_name, rc);
> > +			goto err_crs;
> > +		}
> > +	}
> > +
> > +	file->private_data = crs;
> > +	return 0;
> > +
> > +err_crs:
> > +	kfree(crs);
> > +	return rc;
> > +}
> > +
> > +/**
> > + * Close handler, release resources.
> > + *
> > + * @param[in]  inode  Inode struct for the open character device.
> > + * @param[in]  file   Corresponding file pointer.
> > + * @return 0 on success, negated error code on failure.
> > + */
> > +static int chlg_release(struct inode *inode, struct file *file)
> > +{
> > +	struct chlg_reader_state *crs = file->private_data;
> > +
> > +	if (file->f_mode & FMODE_READ) {
> > +		crs->crs_closed = true;
> > +		wake_up_all(&crs->crs_waitq_prod);
> > +	} else {
> > +		/* No producer thread, release resource ourselves */
> > +		crs_free(crs);
> > +	}
> > +	return 0;
> > +}
> > +
> > +/**
> > + * Poll handler, indicates whether the device is readable (new records) and
> > + * writable (always).
> > + *
> > + * @param[in]  file   Device file pointer.
> > + * @param[in]  wait   (opaque)
> > + * @return combination of the poll status flags.
> > + */
> > +static unsigned int chlg_poll(struct file *file, poll_table *wait)
> > +{
> > +	struct chlg_reader_state *crs  = file->private_data;
> > +	unsigned int mask = 0;
> > +
> > +	mutex_lock(&crs->crs_lock);
> > +	poll_wait(file, &crs->crs_waitq_cons, wait);
> > +	if (crs->crs_rec_count > 0)
> > +		mask |= POLLIN | POLLRDNORM;
> > +	if (crs->crs_err)
> > +		mask |= POLLERR;
> > +	if (crs->crs_eof)
> > +		mask |= POLLHUP;
> > +	mutex_unlock(&crs->crs_lock);
> > +	return mask;
> > +}
> > +
> > +static const struct file_operations chlg_fops = {
> > +	.owner		= THIS_MODULE,
> > +	.llseek		= chlg_llseek,
> > +	.read		= chlg_read,
> > +	.write		= chlg_write,
> > +	.open		= chlg_open,
> > +	.release	= chlg_release,
> > +	.poll		= chlg_poll,
> > +};
> > +
> > +/**
> > + * This uses obd_name of the form: "testfs-MDT0000-mdc-ffff88006501600"
> > + * and returns a name of the form: "changelog-testfs-MDT0000".
> > + */
> > +static void get_chlg_name(char *name, size_t name_len, struct obd_device *obd)
> > +{
> > +	int i;
> > +
> > +	snprintf(name, name_len, "changelog-%s", obd->obd_name);
> > +
> > +	/* Find the 2nd '-' from the end and truncate on it */
> > +	for (i = 0; i < 2; i++) {
> > +		char *p = strrchr(name, '-');
> > +
> > +		if (!p)
> > +			return;
> > +		*p = '\0';
> > +	}
> > +}
> > +
> > +/**
> > + * Find a changelog character device by name.
> > + * All devices registered during MDC setup are listed in a global list with
> > + * their names attached.
> > + */
> > +static struct chlg_registered_dev *
> > +chlg_registered_dev_find_by_name(const char *name)
> > +{
> > +	struct chlg_registered_dev *dit;
> > +
> > +	list_for_each_entry(dit, &chlg_registered_devices, ced_link)
> > +		if (strcmp(name, dit->ced_name) == 0)
> > +			return dit;
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * Find chlg_registered_dev structure for a given OBD device.
> > + * This is bad O(n^2) but for each filesystem:
> > + *   - N is # of MDTs times # of mount points
> > + *   - this only runs at shutdown
> > + */
> > +static struct chlg_registered_dev *
> > +chlg_registered_dev_find_by_obd(const struct obd_device *obd)
> > +{
> > +	struct chlg_registered_dev *dit;
> > +	struct obd_device *oit;
> > +
> > +	list_for_each_entry(dit, &chlg_registered_devices, ced_link)
> > +		list_for_each_entry(oit, &dit->ced_obds,
> > +				    u.cli.cl_chg_dev_linkage)
> > +			if (oit == obd)
> > +				return dit;
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * Changelog character device initialization.
> > + * Register a misc character device with a dynamic minor number, under a name
> > + * of the form: 'changelog-fsname-MDTxxxx'. Reference this OBD device with it.
> > + *
> > + * @param[in] obd  This MDC obd_device.
> > + * @return 0 on success, negated error code on failure.
> > + */
> > +int mdc_changelog_cdev_init(struct obd_device *obd)
> > +{
> > +	struct chlg_registered_dev *exist;
> > +	struct chlg_registered_dev *entry;
> > +	int rc;
> > +
> > +	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
> > +	if (!entry)
> > +		return -ENOMEM;
> > +
> > +	get_chlg_name(entry->ced_name, sizeof(entry->ced_name), obd);
> > +
> > +	entry->ced_misc.minor = MISC_DYNAMIC_MINOR;
> > +	entry->ced_misc.name  = entry->ced_name;
> > +	entry->ced_misc.fops  = &chlg_fops;
> > +
> > +	kref_init(&entry->ced_refs);
> > +	INIT_LIST_HEAD(&entry->ced_obds);
> > +	INIT_LIST_HEAD(&entry->ced_link);
> > +
> > +	mutex_lock(&chlg_registered_dev_lock);
> > +	exist = chlg_registered_dev_find_by_name(entry->ced_name);
> > +	if (exist) {
> > +		kref_get(&exist->ced_refs);
> > +		list_add_tail(&obd->u.cli.cl_chg_dev_linkage, &exist->ced_obds);
> > +		rc = 0;
> > +		goto out_unlock;
> > +	}
> > +
> > +	/* Register new character device */
> > +	rc = misc_register(&entry->ced_misc);
> > +	if (rc != 0)
> > +		goto out_unlock;
> > +
> > +	list_add_tail(&obd->u.cli.cl_chg_dev_linkage, &entry->ced_obds);
> > +	list_add_tail(&entry->ced_link, &chlg_registered_devices);
> > +
> > +	entry = NULL;	/* prevent it from being freed below */
> > +
> > +out_unlock:
> > +	mutex_unlock(&chlg_registered_dev_lock);
> > +	kfree(entry);
> > +	return rc;
> > +}
> > +
> > +/**
> > + * Deregister a changelog character device whose refcount has reached zero.
> > + */
> > +static void chlg_dev_clear(struct kref *kref)
> > +{
> > +	struct chlg_registered_dev *entry = container_of(kref,
> > +							 struct chlg_registered_dev,
> > +							 ced_refs);
> > +	list_del(&entry->ced_link);
> > +	misc_deregister(&entry->ced_misc);
> > +	kfree(entry);
> > +}
> > +
> > +/**
> > + * Release OBD, decrease reference count of the corresponding changelog device.
> > + */
> > +void mdc_changelog_cdev_finish(struct obd_device *obd)
> > +{
> > +	struct chlg_registered_dev *dev = chlg_registered_dev_find_by_obd(obd);
> > +
> > +	mutex_lock(&chlg_registered_dev_lock);
> > +	list_del_init(&obd->u.cli.cl_chg_dev_linkage);
> > +	kref_put(&dev->ced_refs, chlg_dev_clear);
> > +	mutex_unlock(&chlg_registered_dev_lock);
> > +}
> > diff --git a/drivers/staging/lustre/lustre/mdc/mdc_internal.h b/drivers/staging/lustre/lustre/mdc/mdc_internal.h
> > index 941a896..6da9046 100644
> > --- a/drivers/staging/lustre/lustre/mdc/mdc_internal.h
> > +++ b/drivers/staging/lustre/lustre/mdc/mdc_internal.h
> > @@ -129,6 +129,10 @@ enum ldlm_mode mdc_lock_match(struct obd_export *exp, __u64 flags,
> >  			      enum ldlm_mode mode,
> >  			      struct lustre_handle *lockh);
> >  
> > +int mdc_changelog_cdev_init(struct obd_device *obd);
> > +
> > +void mdc_changelog_cdev_finish(struct obd_device *obd);
> > +
> >  static inline int mdc_prep_elc_req(struct obd_export *exp,
> >  				   struct ptlrpc_request *req, int opc,
> >  				   struct list_head *cancels, int count)
> > diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c
> > index 8f8e3d2..3692b1c 100644
> > --- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
> > +++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
> > @@ -35,7 +35,6 @@
> >  
> >  # include <linux/module.h>
> >  # include <linux/pagemap.h>
> > -# include <linux/miscdevice.h>
> >  # include <linux/init.h>
> >  # include <linux/utsname.h>
> >  # include <linux/file.h>
> > @@ -1810,174 +1809,6 @@ static int mdc_ioc_hsm_request(struct obd_export *exp,
> >  	return rc;
> >  }
> >  
> > -static struct kuc_hdr *changelog_kuc_hdr(char *buf, size_t len, u32 flags)
> > -{
> > -	struct kuc_hdr *lh = (struct kuc_hdr *)buf;
> > -
> > -	LASSERT(len <= KUC_CHANGELOG_MSG_MAXSIZE);
> > -
> > -	lh->kuc_magic = KUC_MAGIC;
> > -	lh->kuc_transport = KUC_TRANSPORT_CHANGELOG;
> > -	lh->kuc_flags = flags;
> > -	lh->kuc_msgtype = CL_RECORD;
> > -	lh->kuc_msglen = len;
> > -	return lh;
> > -}
> > -
> > -struct changelog_show {
> > -	__u64		cs_startrec;
> > -	enum changelog_send_flag	cs_flags;
> > -	struct file	*cs_fp;
> > -	char		*cs_buf;
> > -	struct obd_device *cs_obd;
> > -};
> > -
> > -static inline char *cs_obd_name(struct changelog_show *cs)
> > -{
> > -	return cs->cs_obd->obd_name;
> > -}
> > -
> > -static int changelog_kkuc_cb(const struct lu_env *env, struct llog_handle *llh,
> > -			     struct llog_rec_hdr *hdr, void *data)
> > -{
> > -	struct changelog_show *cs = data;
> > -	struct llog_changelog_rec *rec = (struct llog_changelog_rec *)hdr;
> > -	struct kuc_hdr *lh;
> > -	size_t len;
> > -	int rc;
> > -
> > -	if (rec->cr_hdr.lrh_type != CHANGELOG_REC) {
> > -		rc = -EINVAL;
> > -		CERROR("%s: not a changelog rec %x/%d: rc = %d\n",
> > -		       cs_obd_name(cs), rec->cr_hdr.lrh_type,
> > -		       rec->cr.cr_type, rc);
> > -		return rc;
> > -	}
> > -
> > -	if (rec->cr.cr_index < cs->cs_startrec) {
> > -		/* Skip entries earlier than what we are interested in */
> > -		CDEBUG(D_HSM, "rec=%llu start=%llu\n",
> > -		       rec->cr.cr_index, cs->cs_startrec);
> > -		return 0;
> > -	}
> > -
> > -	CDEBUG(D_HSM, "%llu %02d%-5s %llu 0x%x t=" DFID " p=" DFID
> > -		" %.*s\n", rec->cr.cr_index, rec->cr.cr_type,
> > -		changelog_type2str(rec->cr.cr_type), rec->cr.cr_time,
> > -		rec->cr.cr_flags & CLF_FLAGMASK,
> > -		PFID(&rec->cr.cr_tfid), PFID(&rec->cr.cr_pfid),
> > -		rec->cr.cr_namelen, changelog_rec_name(&rec->cr));
> > -
> > -	len = sizeof(*lh) + changelog_rec_size(&rec->cr) + rec->cr.cr_namelen;
> > -
> > -	/* Set up the message */
> > -	lh = changelog_kuc_hdr(cs->cs_buf, len, cs->cs_flags);
> > -	memcpy(lh + 1, &rec->cr, len - sizeof(*lh));
> > -
> > -	rc = libcfs_kkuc_msg_put(cs->cs_fp, lh);
> > -	CDEBUG(D_HSM, "kucmsg fp %p len %zu rc %d\n", cs->cs_fp, len, rc);
> > -
> > -	return rc;
> > -}
> > -
> > -static int mdc_changelog_send_thread(void *csdata)
> > -{
> > -	enum llog_flag flags = LLOG_F_IS_CAT;
> > -	struct changelog_show *cs = csdata;
> > -	struct llog_ctxt *ctxt = NULL;
> > -	struct llog_handle *llh = NULL;
> > -	struct kuc_hdr *kuch;
> > -	int rc;
> > -
> > -	CDEBUG(D_HSM, "changelog to fp=%p start %llu\n",
> > -	       cs->cs_fp, cs->cs_startrec);
> > -
> > -	cs->cs_buf = kzalloc(KUC_CHANGELOG_MSG_MAXSIZE, GFP_NOFS);
> > -	if (!cs->cs_buf) {
> > -		rc = -ENOMEM;
> > -		goto out;
> > -	}
> > -
> > -	/* Set up the remote catalog handle */
> > -	ctxt = llog_get_context(cs->cs_obd, LLOG_CHANGELOG_REPL_CTXT);
> > -	if (!ctxt) {
> > -		rc = -ENOENT;
> > -		goto out;
> > -	}
> > -	rc = llog_open(NULL, ctxt, &llh, NULL, CHANGELOG_CATALOG,
> > -		       LLOG_OPEN_EXISTS);
> > -	if (rc) {
> > -		CERROR("%s: fail to open changelog catalog: rc = %d\n",
> > -		       cs_obd_name(cs), rc);
> > -		goto out;
> > -	}
> > -
> > -	if (cs->cs_flags & CHANGELOG_FLAG_JOBID)
> > -		flags |= LLOG_F_EXT_JOBID;
> > -
> > -	rc = llog_init_handle(NULL, llh, flags, NULL);
> > -	if (rc) {
> > -		CERROR("llog_init_handle failed %d\n", rc);
> > -		goto out;
> > -	}
> > -
> > -	rc = llog_cat_process(NULL, llh, changelog_kkuc_cb, cs, 0, 0);
> > -
> > -	/* Send EOF no matter what our result */
> > -	kuch = changelog_kuc_hdr(cs->cs_buf, sizeof(*kuch), cs->cs_flags);
> > -	kuch->kuc_msgtype = CL_EOF;
> > -	libcfs_kkuc_msg_put(cs->cs_fp, kuch);
> > -
> > -out:
> > -	fput(cs->cs_fp);
> > -	if (llh)
> > -		llog_cat_close(NULL, llh);
> > -	if (ctxt)
> > -		llog_ctxt_put(ctxt);
> > -	kfree(cs->cs_buf);
> > -	kfree(cs);
> > -	return rc;
> > -}
> > -
> > -static int mdc_ioc_changelog_send(struct obd_device *obd,
> > -				  struct ioc_changelog *icc)
> > -{
> > -	struct changelog_show *cs;
> > -	struct task_struct *task;
> > -	int rc;
> > -
> > -	/* Freed in mdc_changelog_send_thread */
> > -	cs = kzalloc(sizeof(*cs), GFP_NOFS);
> > -	if (!cs)
> > -		return -ENOMEM;
> > -
> > -	cs->cs_obd = obd;
> > -	cs->cs_startrec = icc->icc_recno;
> > -	/* matching fput in mdc_changelog_send_thread */
> > -	cs->cs_fp = fget(icc->icc_id);
> > -	cs->cs_flags = icc->icc_flags;
> > -
> > -	/*
> > -	 * New thread because we should return to user app before
> > -	 * writing into our pipe
> > -	 */
> > -	task = kthread_run(mdc_changelog_send_thread, cs,
> > -			   "mdc_clg_send_thread");
> > -	if (IS_ERR(task)) {
> > -		rc = PTR_ERR(task);
> > -		CERROR("%s: can't start changelog thread: rc = %d\n",
> > -		       cs_obd_name(cs), rc);
> > -		kfree(cs);
> > -	} else {
> > -		rc = 0;
> > -		CDEBUG(D_HSM, "%s: started changelog thread\n",
> > -		       cs_obd_name(cs));
> > -	}
> > -
> > -	CERROR("Failed to start changelog thread: %d\n", rc);
> > -	return rc;
> > -}
> > -
> >  static int mdc_ioc_hsm_ct_start(struct obd_export *exp,
> >  				struct lustre_kernelcomm *lk);
> >  
> > @@ -2087,21 +1918,6 @@ static int mdc_iocontrol(unsigned int cmd, struct obd_export *exp, int len,
> >  		return -EINVAL;
> >  	}
> >  	switch (cmd) {
> > -	case OBD_IOC_CHANGELOG_SEND:
> > -		rc = mdc_ioc_changelog_send(obd, karg);
> > -		goto out;
> > -	case OBD_IOC_CHANGELOG_CLEAR: {
> > -		struct ioc_changelog *icc = karg;
> > -		struct changelog_setinfo cs = {
> > -			.cs_recno = icc->icc_recno,
> > -			.cs_id = icc->icc_id
> > -		};
> > -
> > -		rc = obd_set_info_async(NULL, exp, strlen(KEY_CHANGELOG_CLEAR),
> > -					KEY_CHANGELOG_CLEAR, sizeof(cs), &cs,
> > -					NULL);
> > -		goto out;
> > -	}
> >  	case OBD_IOC_FID2PATH:
> >  		rc = mdc_ioc_fid2path(exp, karg);
> >  		goto out;
> > @@ -2670,12 +2486,22 @@ static int mdc_setup(struct obd_device *obd, struct lustre_cfg *cfg)
> >  
> >  	rc = mdc_llog_init(obd);
> >  	if (rc) {
> > -		CERROR("failed to setup llogging subsystems\n");
> > +		CERROR("%s: failed to setup llogging subsystems: rc = %d\n",
> > +		       obd->obd_name, rc);
> >  		goto err_llog_cleanup;
> >  	}
> >  
> > +	rc = mdc_changelog_cdev_init(obd);
> > +	if (rc) {
> > +		CERROR("%s: failed to setup changelog char device: rc = %d\n",
> > +		       obd->obd_name, rc);
> > +		goto err_changelog_cleanup;
> > +	}
> > +
> >  	return 0;
> >  
> > +err_changelog_cleanup:
> > +	mdc_llog_finish(obd);
> >  err_llog_cleanup:
> >  	ldebugfs_free_md_stats(obd);
> >  	ptlrpc_lprocfs_unregister_obd(obd);
> > @@ -2714,6 +2540,8 @@ static int mdc_precleanup(struct obd_device *obd)
> >  	if (obd->obd_type->typ_refcnt <= 1)
> >  		libcfs_kkuc_group_rem(0, KUC_GRP_HSM);
> >  
> > +	mdc_changelog_cdev_finish(obd);
> > +
> >  	obd_cleanup_client_import(obd);
> >  	ptlrpc_lprocfs_unregister_obd(obd);
> >  	lprocfs_obd_cleanup(obd);
> > -- 
> > 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 15/28] lustre: llite: fix for stat under kthread and X86_X32
  2018-10-22  3:58     ` NeilBrown
@ 2018-11-04 21:35       ` James Simmons
  2018-11-05  0:03         ` NeilBrown
  0 siblings, 1 reply; 69+ messages in thread
From: James Simmons @ 2018-11-04 21:35 UTC (permalink / raw)
  To: lustre-devel


> On Thu, Oct 18 2018, NeilBrown wrote:
> 
> > On Sun, Oct 14 2018, James Simmons wrote:
> >
> >> From: Frank Zago <fzago@cray.com>
> >>
> >> Under the following conditions, ll_getattr will flatten the inode
> >> number when it shouldn't:
> >>
> >>  - the X86_X32 architecture is defined CONFIG_X86_X32, and not even
> >>    used,
> >>  - ll_getattr is called from a kernel thread (though vfs_getattr for
> >>    instance.)
> >>
> >> This has the result that inode numbers are different whether the same
> >> file is stat'ed from a kernel thread, or from a syscall. For instance,
> >> 4198401 vs. 144115205272502273.
> >>
> >> ll_getattr calls ll_need_32bit_api to determine whether the task is 32
> >> bits. When the combination is kthread+X86_X32, that function returns
> >> that the task is 32 bits, which is incorrect, as the kernel is 64
> >> bits.
> >>
> >> The solution is to check whether the call is from a kernel thread
> >> (which is 64 bits) and act consequently.
> >>
> >> Signed-off-by: Frank Zago <fzago@cray.com>
> >> WC-bug-id: https://jira.whamcloud.com/browse/LU-9468
> >> Reviewed-on: https://review.whamcloud.com/26992
> >> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
> >> Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
> >> Signed-off-by: James Simmons <jsimmons@infradead.org>
> >> ---
> >>  drivers/staging/lustre/lustre/llite/dir.c          |  6 +++---
> >>  drivers/staging/lustre/lustre/llite/lcommon_cl.c   |  2 +-
> >>  .../staging/lustre/lustre/llite/llite_internal.h   | 22 +++++++++++++++++-----
> >>  3 files changed, 21 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
> >> index 231b351..19c5e9c 100644
> >> --- a/drivers/staging/lustre/lustre/llite/dir.c
> >> +++ b/drivers/staging/lustre/lustre/llite/dir.c
> >> @@ -202,7 +202,7 @@ int ll_dir_read(struct inode *inode, __u64 *ppos, struct md_op_data *op_data,
> >>  {
> >>  	struct ll_sb_info    *sbi	= ll_i2sbi(inode);
> >>  	__u64		   pos		= *ppos;
> >> -	int		   is_api32 = ll_need_32bit_api(sbi);
> >> +	bool is_api32 = ll_need_32bit_api(sbi);
> >>  	int		   is_hash64 = sbi->ll_flags & LL_SBI_64BIT_HASH;
> >>  	struct page	  *page;
> >>  	bool		   done = false;
> >> @@ -296,7 +296,7 @@ static int ll_readdir(struct file *filp, struct dir_context *ctx)
> >>  	struct ll_sb_info	*sbi	= ll_i2sbi(inode);
> >>  	__u64 pos = lfd ? lfd->lfd_pos : 0;
> >>  	int			hash64	= sbi->ll_flags & LL_SBI_64BIT_HASH;
> >> -	int			api32	= ll_need_32bit_api(sbi);
> >> +	bool api32 = ll_need_32bit_api(sbi);
> >>  	struct md_op_data *op_data;
> >>  	int			rc;
> >>  
> >> @@ -1674,7 +1674,7 @@ static loff_t ll_dir_seek(struct file *file, loff_t offset, int origin)
> >>  	struct inode *inode = file->f_mapping->host;
> >>  	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
> >>  	struct ll_sb_info *sbi = ll_i2sbi(inode);
> >> -	int api32 = ll_need_32bit_api(sbi);
> >> +	bool api32 = ll_need_32bit_api(sbi);
> >>  	loff_t ret = -EINVAL;
> >>  
> >>  	switch (origin) {
> >> diff --git a/drivers/staging/lustre/lustre/llite/lcommon_cl.c b/drivers/staging/lustre/lustre/llite/lcommon_cl.c
> >> index 30f17ea..20a3c74 100644
> >> --- a/drivers/staging/lustre/lustre/llite/lcommon_cl.c
> >> +++ b/drivers/staging/lustre/lustre/llite/lcommon_cl.c
> >> @@ -267,7 +267,7 @@ void cl_inode_fini(struct inode *inode)
> >>  /**
> >>   * build inode number from passed @fid
> >>   */
> >> -__u64 cl_fid_build_ino(const struct lu_fid *fid, int api32)
> >> +u64 cl_fid_build_ino(const struct lu_fid *fid, bool api32)
> >>  {
> >>  	if (BITS_PER_LONG == 32 || api32)
> >>  		return fid_flatten32(fid);
> >> diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
> >> index dcb2fed..796a8ae 100644
> >> --- a/drivers/staging/lustre/lustre/llite/llite_internal.h
> >> +++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
> >> @@ -651,13 +651,25 @@ static inline struct inode *ll_info2i(struct ll_inode_info *lli)
> >>  __u32 ll_i2suppgid(struct inode *i);
> >>  void ll_i2gids(__u32 *suppgids, struct inode *i1, struct inode *i2);
> >>  
> >> -static inline int ll_need_32bit_api(struct ll_sb_info *sbi)
> >> +static inline bool ll_need_32bit_api(struct ll_sb_info *sbi)
> >>  {
> >>  #if BITS_PER_LONG == 32
> >> -	return 1;
> >> +	return true;
> >>  #elif defined(CONFIG_COMPAT)
> >> -	return unlikely(in_compat_syscall() ||
> >> -			(sbi->ll_flags & LL_SBI_32BIT_API));
> >> +	if (unlikely(sbi->ll_flags & LL_SBI_32BIT_API))
> >> +		return true;
> >> +
> >> +#ifdef CONFIG_X86_X32
> >> +	/* in_compat_syscall() returns true when called from a kthread
> >> +	 * and CONFIG_X86_X32 is enabled, which is wrong. So check
> >> +	 * whether the caller comes from a syscall (ie. not a kthread)
> >> +	 * before calling in_compat_syscall().
> >> +	 */
> >> +	if (current->flags & PF_KTHREAD)
> >> +		return false;
> >> +#endif
> >
> > This is wrong.  We should fix in_compat_syscall(), not work around it
> > here.
> > (and then there is that fact that the patch changes 'int' to 'bool'
> > without explaining that in the change description).
> >
> > I've sent a query to some relevant people (Cc:ed to James) to ask about
> > fixing in_compat_syscall().
> 
> Upstream say in_compat_syscall() should only be called from a syscall,
> so I have change the patch to the below.
> 
> We probably need to remove this in_compat_syscall() before going
> mainline.

Do you know the progress of this work?
 
> NeilBrown
> 
> -static inline int ll_need_32bit_api(struct ll_sb_info *sbi)
> +static inline bool ll_need_32bit_api(struct ll_sb_info *sbi)
>  {
>  #if BITS_PER_LONG == 32
> -	return 1;
> -#elif defined(CONFIG_COMPAT)
> -	return unlikely(in_compat_syscall() ||
> -			(sbi->ll_flags & LL_SBI_32BIT_API));
> +	return true;
> +#else
> +	if (unlikely(sbi->ll_flags & LL_SBI_32BIT_API))
> +		return true;
> +
> +#if defined(CONFIG_COMPAT)
> +	/* in_compat_syscall() is only meaningful inside a syscall.
> +	 * As this can be called from a kthread (e.g. nfsd), we
> +	 * need to catch that case first.  kthreads never need the
> +	 * 32bit api.
> +	 */
> +	if (current->flags & PF_KTHREAD)
> +		return false;
> +
> +	return unlikely(in_compat_syscall());
>  #else
> -	return unlikely(sbi->ll_flags & LL_SBI_32BIT_API);
> +	return false;
> +#endif
>  #endif
>  }
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 04/28] lustre: ptlrpc: Do not assert when bd_nob_transferred != 0
  2018-11-04 21:29         ` James Simmons
@ 2018-11-04 23:59           ` NeilBrown
  0 siblings, 0 replies; 69+ messages in thread
From: NeilBrown @ 2018-11-04 23:59 UTC (permalink / raw)
  To: lustre-devel

On Sun, Nov 04 2018, James Simmons wrote:

>> >> On Sun, Oct 14 2018, James Simmons wrote:
>> >> 
>> >> > From: Doug Oucharek <dougso@me.com>
>> >> >
>> >> > There is a case in the routine ptlrpc_register_bulk() where we were
>> >> > asserting if bd_nob_transferred != 0 when not resending.  There is
>> >> > evidence that network errors can create a situation where
>> >> > this does happen. So we should not be asserting!
>> >> >
>> >> > This patch changes that assert to an error return code of -EIO.
>> >> >
>> >> > Signed-off-by: Doug Oucharek <dougso@me.com>
>> >> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9828
>> >> > Reviewed-on: https://review.whamcloud.com/28491
>> >> > Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
>> >> > Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
>> >> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
>> >> > Signed-off-by: James Simmons <jsimmons@infradead.org>
>> >> > ---
>> >> >  drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++--
>> >> >  1 file changed, 6 insertions(+), 2 deletions(-)
>> >> >
>> >> > diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
>> >> > index 27eb1c0..7e7db24 100644
>> >> > --- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
>> >> > +++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
>> >> > @@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
>> >> >  	/* cleanup the state of the bulk for it will be reused */
>> >> >  	if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY)
>> >> >  		desc->bd_nob_transferred = 0;
>> >> > -	else
>> >> > -		LASSERT(desc->bd_nob_transferred == 0);
>> >> > +	else if (desc->bd_nob_transferred != 0)
>> >> > +		/* If the network failed after an RPC was sent, this condition
>> >> > +		 * could happen.  Rather than assert (was here before), return
>> >> > +		 * an EIO error.
>> >> > +		 */
>> >> > +		return -EIO;
>> >> 
>> >> This looks weird, and the justification is rather lame.
>> >> I wonder if this is an attempt to fix the same problem that the smp_mb()
>> >> in the previous patch was attempting to fix (and I'm not yet convinced
>> >> that either is the correct fix).
>> >
>> > When the above condition happens the LASSERT ends up taking out the 
>> > node with a panic which in turn kills the application running on the cluster.
>> > When replaced with reporting an EIO error the node survives as well as the 
>> > job. The job might fail at its IO but it wouldn't fail performing its work 
>> > flow which is way more important.
>> 
>> Yes, a meaningless error is better than a crash, but a proper fix is
>> better still.  As I said, my guess is that the memory barrier in the
>> previous patch might have fixed the bug, so the LASSERT can remain.
>> 
>> Doug: is there any chance that this might be the case?
>
> I got a hold of Doug and discussed this issue. So the answer is that the
> original logs to track down the original problem no longer exist. So 
> finding the original source of the problem can't be done at this point.
> Would you be okay with a version of this patch with dump_stack() and
> treat it as a debug patch. We really need to collect logs to figure out
> the real problem. I will push a debug patch to OpenSFS branch since it
> is more widely used.

Yes.
   else if (WARN_ON(desc->nb_nob_transferred != 0))
       /* comment explaining what we know 8/
       return -EIO

would be perfectly appropriate.

Thanks,
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181105/c9315ca8/attachment.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 15/28] lustre: llite: fix for stat under kthread and X86_X32
  2018-11-04 21:35       ` James Simmons
@ 2018-11-05  0:03         ` NeilBrown
  0 siblings, 0 replies; 69+ messages in thread
From: NeilBrown @ 2018-11-05  0:03 UTC (permalink / raw)
  To: lustre-devel

On Sun, Nov 04 2018, James Simmons wrote:

>> On Thu, Oct 18 2018, NeilBrown wrote:
>> 
>> > On Sun, Oct 14 2018, James Simmons wrote:
>> >
>> >> From: Frank Zago <fzago@cray.com>
>> >>
>> >> Under the following conditions, ll_getattr will flatten the inode
>> >> number when it shouldn't:
>> >>
>> >>  - the X86_X32 architecture is defined CONFIG_X86_X32, and not even
>> >>    used,
>> >>  - ll_getattr is called from a kernel thread (though vfs_getattr for
>> >>    instance.)
>> >>
>> >> This has the result that inode numbers are different whether the same
>> >> file is stat'ed from a kernel thread, or from a syscall. For instance,
>> >> 4198401 vs. 144115205272502273.
>> >>
>> >> ll_getattr calls ll_need_32bit_api to determine whether the task is 32
>> >> bits. When the combination is kthread+X86_X32, that function returns
>> >> that the task is 32 bits, which is incorrect, as the kernel is 64
>> >> bits.
>> >>
>> >> The solution is to check whether the call is from a kernel thread
>> >> (which is 64 bits) and act consequently.
>> >>
>> >> Signed-off-by: Frank Zago <fzago@cray.com>
>> >> WC-bug-id: https://jira.whamcloud.com/browse/LU-9468
>> >> Reviewed-on: https://review.whamcloud.com/26992
>> >> Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
>> >> Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
>> >> Signed-off-by: James Simmons <jsimmons@infradead.org>
>> >> ---
>> >>  drivers/staging/lustre/lustre/llite/dir.c          |  6 +++---
>> >>  drivers/staging/lustre/lustre/llite/lcommon_cl.c   |  2 +-
>> >>  .../staging/lustre/lustre/llite/llite_internal.h   | 22 +++++++++++++++++-----
>> >>  3 files changed, 21 insertions(+), 9 deletions(-)
>> >>
>> >> diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
>> >> index 231b351..19c5e9c 100644
>> >> --- a/drivers/staging/lustre/lustre/llite/dir.c
>> >> +++ b/drivers/staging/lustre/lustre/llite/dir.c
>> >> @@ -202,7 +202,7 @@ int ll_dir_read(struct inode *inode, __u64 *ppos, struct md_op_data *op_data,
>> >>  {
>> >>  	struct ll_sb_info    *sbi	= ll_i2sbi(inode);
>> >>  	__u64		   pos		= *ppos;
>> >> -	int		   is_api32 = ll_need_32bit_api(sbi);
>> >> +	bool is_api32 = ll_need_32bit_api(sbi);
>> >>  	int		   is_hash64 = sbi->ll_flags & LL_SBI_64BIT_HASH;
>> >>  	struct page	  *page;
>> >>  	bool		   done = false;
>> >> @@ -296,7 +296,7 @@ static int ll_readdir(struct file *filp, struct dir_context *ctx)
>> >>  	struct ll_sb_info	*sbi	= ll_i2sbi(inode);
>> >>  	__u64 pos = lfd ? lfd->lfd_pos : 0;
>> >>  	int			hash64	= sbi->ll_flags & LL_SBI_64BIT_HASH;
>> >> -	int			api32	= ll_need_32bit_api(sbi);
>> >> +	bool api32 = ll_need_32bit_api(sbi);
>> >>  	struct md_op_data *op_data;
>> >>  	int			rc;
>> >>  
>> >> @@ -1674,7 +1674,7 @@ static loff_t ll_dir_seek(struct file *file, loff_t offset, int origin)
>> >>  	struct inode *inode = file->f_mapping->host;
>> >>  	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
>> >>  	struct ll_sb_info *sbi = ll_i2sbi(inode);
>> >> -	int api32 = ll_need_32bit_api(sbi);
>> >> +	bool api32 = ll_need_32bit_api(sbi);
>> >>  	loff_t ret = -EINVAL;
>> >>  
>> >>  	switch (origin) {
>> >> diff --git a/drivers/staging/lustre/lustre/llite/lcommon_cl.c b/drivers/staging/lustre/lustre/llite/lcommon_cl.c
>> >> index 30f17ea..20a3c74 100644
>> >> --- a/drivers/staging/lustre/lustre/llite/lcommon_cl.c
>> >> +++ b/drivers/staging/lustre/lustre/llite/lcommon_cl.c
>> >> @@ -267,7 +267,7 @@ void cl_inode_fini(struct inode *inode)
>> >>  /**
>> >>   * build inode number from passed @fid
>> >>   */
>> >> -__u64 cl_fid_build_ino(const struct lu_fid *fid, int api32)
>> >> +u64 cl_fid_build_ino(const struct lu_fid *fid, bool api32)
>> >>  {
>> >>  	if (BITS_PER_LONG == 32 || api32)
>> >>  		return fid_flatten32(fid);
>> >> diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
>> >> index dcb2fed..796a8ae 100644
>> >> --- a/drivers/staging/lustre/lustre/llite/llite_internal.h
>> >> +++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
>> >> @@ -651,13 +651,25 @@ static inline struct inode *ll_info2i(struct ll_inode_info *lli)
>> >>  __u32 ll_i2suppgid(struct inode *i);
>> >>  void ll_i2gids(__u32 *suppgids, struct inode *i1, struct inode *i2);
>> >>  
>> >> -static inline int ll_need_32bit_api(struct ll_sb_info *sbi)
>> >> +static inline bool ll_need_32bit_api(struct ll_sb_info *sbi)
>> >>  {
>> >>  #if BITS_PER_LONG == 32
>> >> -	return 1;
>> >> +	return true;
>> >>  #elif defined(CONFIG_COMPAT)
>> >> -	return unlikely(in_compat_syscall() ||
>> >> -			(sbi->ll_flags & LL_SBI_32BIT_API));
>> >> +	if (unlikely(sbi->ll_flags & LL_SBI_32BIT_API))
>> >> +		return true;
>> >> +
>> >> +#ifdef CONFIG_X86_X32
>> >> +	/* in_compat_syscall() returns true when called from a kthread
>> >> +	 * and CONFIG_X86_X32 is enabled, which is wrong. So check
>> >> +	 * whether the caller comes from a syscall (ie. not a kthread)
>> >> +	 * before calling in_compat_syscall().
>> >> +	 */
>> >> +	if (current->flags & PF_KTHREAD)
>> >> +		return false;
>> >> +#endif
>> >
>> > This is wrong.  We should fix in_compat_syscall(), not work around it
>> > here.
>> > (and then there is that fact that the patch changes 'int' to 'bool'
>> > without explaining that in the change description).
>> >
>> > I've sent a query to some relevant people (Cc:ed to James) to ask about
>> > fixing in_compat_syscall().
>> 
>> Upstream say in_compat_syscall() should only be called from a syscall,
>> so I have change the patch to the below.
>> 
>> We probably need to remove this in_compat_syscall() before going
>> mainline.
>
> Do you know the progress of this work?
>  

Below is the code that I currently have.
It avoids making a special case of X86_X32 - I should update
the comment to reflect that.

I've haven't given much thought yet to removing the need for
in_compat_syscall().  I need to make sure I understand exactly why
it is currently needed first.

NeilBrown


commit c69633550256d1a68306caf4f67a7d58ba8763e8
Author: Frank Zago <fzago@cray.com>
Date:   Sun Oct 14 14:58:05 2018 -0400

    lustre: llite: fix for stat under kthread and X86_X32
    
    Under the following conditions, ll_getattr will flatten the inode
    number when it shouldn't:
    
     - the X86_X32 architecture is defined CONFIG_X86_X32, and not even
       used,
     - ll_getattr is called from a kernel thread (though vfs_getattr for
       instance.)
    
    This has the result that inode numbers are different whether the same
    file is stat'ed from a kernel thread, or from a syscall. For instance,
    4198401 vs. 144115205272502273.
    
    ll_getattr calls ll_need_32bit_api to determine whether the task is 32
    bits. When the combination is kthread+X86_X32, that function returns
    that the task is 32 bits, which is incorrect, as the kernel is 64
    bits.
    
    The solution is to check whether the call is from a kernel thread
    (which is 64 bits) and act consequently.
    
    Signed-off-by: Frank Zago <fzago@cray.com>
    WC-bug-id: https://jira.whamcloud.com/browse/LU-9468
    Reviewed-on: https://review.whamcloud.com/26992
    Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
    Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
    Signed-off-by: James Simmons <jsimmons@infradead.org>
    Signed-off-by: NeilBrown <neilb@suse.com>

diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
index 231b351536bf..19c5e9cee3f9 100644
--- a/drivers/staging/lustre/lustre/llite/dir.c
+++ b/drivers/staging/lustre/lustre/llite/dir.c
@@ -202,7 +202,7 @@ int ll_dir_read(struct inode *inode, __u64 *ppos, struct md_op_data *op_data,
 {
 	struct ll_sb_info    *sbi	= ll_i2sbi(inode);
 	__u64		   pos		= *ppos;
-	int		   is_api32 = ll_need_32bit_api(sbi);
+	bool is_api32 = ll_need_32bit_api(sbi);
 	int		   is_hash64 = sbi->ll_flags & LL_SBI_64BIT_HASH;
 	struct page	  *page;
 	bool		   done = false;
@@ -296,7 +296,7 @@ static int ll_readdir(struct file *filp, struct dir_context *ctx)
 	struct ll_sb_info	*sbi	= ll_i2sbi(inode);
 	__u64 pos = lfd ? lfd->lfd_pos : 0;
 	int			hash64	= sbi->ll_flags & LL_SBI_64BIT_HASH;
-	int			api32	= ll_need_32bit_api(sbi);
+	bool api32 = ll_need_32bit_api(sbi);
 	struct md_op_data *op_data;
 	int			rc;
 
@@ -1674,7 +1674,7 @@ static loff_t ll_dir_seek(struct file *file, loff_t offset, int origin)
 	struct inode *inode = file->f_mapping->host;
 	struct ll_file_data *fd = LUSTRE_FPRIVATE(file);
 	struct ll_sb_info *sbi = ll_i2sbi(inode);
-	int api32 = ll_need_32bit_api(sbi);
+	bool api32 = ll_need_32bit_api(sbi);
 	loff_t ret = -EINVAL;
 
 	switch (origin) {
diff --git a/drivers/staging/lustre/lustre/llite/lcommon_cl.c b/drivers/staging/lustre/lustre/llite/lcommon_cl.c
index 30f17eaa6b2c..20a3c749f085 100644
--- a/drivers/staging/lustre/lustre/llite/lcommon_cl.c
+++ b/drivers/staging/lustre/lustre/llite/lcommon_cl.c
@@ -267,7 +267,7 @@ void cl_inode_fini(struct inode *inode)
 /**
  * build inode number from passed @fid
  */
-__u64 cl_fid_build_ino(const struct lu_fid *fid, int api32)
+u64 cl_fid_build_ino(const struct lu_fid *fid, bool api32)
 {
 	if (BITS_PER_LONG == 32 || api32)
 		return fid_flatten32(fid);
diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
index dcb2fed7a350..26c35f5d28a6 100644
--- a/drivers/staging/lustre/lustre/llite/llite_internal.h
+++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
@@ -651,15 +651,27 @@ static inline struct inode *ll_info2i(struct ll_inode_info *lli)
 __u32 ll_i2suppgid(struct inode *i);
 void ll_i2gids(__u32 *suppgids, struct inode *i1, struct inode *i2);
 
-static inline int ll_need_32bit_api(struct ll_sb_info *sbi)
+static inline bool ll_need_32bit_api(struct ll_sb_info *sbi)
 {
 #if BITS_PER_LONG == 32
-	return 1;
-#elif defined(CONFIG_COMPAT)
-	return unlikely(in_compat_syscall() ||
-			(sbi->ll_flags & LL_SBI_32BIT_API));
+	return true;
+#else
+	if (unlikely(sbi->ll_flags & LL_SBI_32BIT_API))
+		return true;
+
+#if defined(CONFIG_COMPAT)
+	/* in_compat_syscall() is only meaningful inside a syscall.
+	 * As this can be called from a kthread (e.g. nfsd), we
+	 * need to catch that case first.  kthreads never need the
+	 * 32bit api.
+	 */
+	if (current->flags & PF_KTHREAD)
+		return false;
+
+	return unlikely(in_compat_syscall());
 #else
-	return unlikely(sbi->ll_flags & LL_SBI_32BIT_API);
+	return false;
+#endif
 #endif
 }
 
@@ -1353,7 +1365,7 @@ extern u16 cl_inode_fini_refcheck;
 int cl_file_inode_init(struct inode *inode, struct lustre_md *md);
 void cl_inode_fini(struct inode *inode);
 
-__u64 cl_fid_build_ino(const struct lu_fid *fid, int api32);
+u64 cl_fid_build_ino(const struct lu_fid *fid, bool api32);
 __u32 cl_fid_build_gen(const struct lu_fid *fid);
 
 #endif /* LLITE_INTERNAL_H */
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181105/d9e78cd5/attachment-0001.sig>

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [lustre-devel] [PATCH 18/28] lustre: mdc: expose changelog through char devices
  2018-11-04 21:31     ` James Simmons
@ 2018-11-05  0:13       ` NeilBrown
  0 siblings, 0 replies; 69+ messages in thread
From: NeilBrown @ 2018-11-05  0:13 UTC (permalink / raw)
  To: lustre-devel

On Sun, Nov 04 2018, James Simmons wrote:

>> On Sun, Oct 14 2018, James Simmons wrote:
>> 
>> > From: Henri Doreau <henri.doreau@cea.fr>
>> >
>> > Register one character device per MDT in order to allow non-llapi to
>> > read them and to make delivery more efficient.
>> >
>> > - open() spawns a thread to prefetch records and enqueue them into a
>> >   local buffer (unless the device is open in write-only mode).
>> > - lseek() can be used to jump to a specific record, in which case the
>> >   offset is a record number (with SEEK_SET) or a number of records to
>> >   skip (SEEK_CUR). Movement can only be done forward.
>> > - read() copies records to userland. No truncation happens, so short
>> >   reads are likely.
>> > - write() is used to transmit control commands to the device.
>> >   The only available one is changelog_clear, which is done by writing
>> >   "clear:cl<user>:<recno>" into the device.
>> > - close() terminates the prefetch thread if any, and releases resources.
>> >
>> > It is possible to poll() on the device to get notified when new records
>> > are available for read.
>> >
>> > Signed-off-by: Henri Doreau <henri.doreau@cea.fr>
>> > WC-bug-id: https://jira.whamcloud.com/browse/LU-7659
>> > Reviewed-on: https://review.whamcloud.com/18900
>> > Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
>> > Reviewed-by: John L. Hammond <jhammond@whamcloud.com>
>> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
>> > Signed-off-by: James Simmons <jsimmons@infradead.org>
>> 
>> This patches causes problems around sanity test 161a.
>> If you run -only '160h 160i 161a' it hangs.
>> 
>> Adding
>> Commit: 89e52326b5bd ("LU-10166 mdc: invalid free in changelog reader")
>> seems to fix the problem, so I'll port that into the series immediately
>> after this patch.
>
> I was planning to push this in the next batch. I wouldn't push it in that
> case and just wait for it to show up in lustre-testing.
>

It turns out that it didn't fix my problem - I don't know why it seemed
to.
Still, I'll keep it - and hope to push out a new lustre-testing today.

For now, I've disabled 160h and 160i as they always lead to problems
at about 161a.  Messages like

[  226.614732] Lustre: 4149:0:(client.c:2064:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1540794670/real 1540794670]  req at 000000008eaae33d x1615640091691072/t0(0) o250->MGC192.168.20.11 at tcp@192.168.20.11 at tcp:26/25 lens 520/544 e 0 to 1 dl 1540794676 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1

keep appearing, and no progress is made.
It is definitely related to the new changelog code, but I have no ideas
beyond that.

Thanks,
NeilBrown

>
>> >  .../include/uapi/linux/lustre/lustre_kernelcomm.h  |   3 -
>> >  .../lustre/include/uapi/linux/lustre/lustre_user.h |   7 -
>> >  drivers/staging/lustre/lustre/include/obd.h        |   2 +
>> >  drivers/staging/lustre/lustre/ldlm/ldlm_lib.c      |   2 +
>> >  drivers/staging/lustre/lustre/llite/dir.c          |   8 -
>> >  drivers/staging/lustre/lustre/lmv/lmv_obd.c        |  13 -
>> >  drivers/staging/lustre/lustre/mdc/Makefile         |   2 +-
>> >  drivers/staging/lustre/lustre/mdc/mdc_changelog.c  | 722 +++++++++++++++++++++
>> >  drivers/staging/lustre/lustre/mdc/mdc_internal.h   |   4 +
>> >  drivers/staging/lustre/lustre/mdc/mdc_request.c    | 198 +-----
>> >  11 files changed, 745 insertions(+), 218 deletions(-)
>> >  create mode 100644 drivers/staging/lustre/lustre/mdc/mdc_changelog.c
>> >
>> > diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
>> > index 6e4e109..098b6451 100644
>> > --- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
>> > +++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_ioctl.h
>> > @@ -172,7 +172,7 @@ static inline __u32 obd_ioctl_packlen(struct obd_ioctl_data *data)
>> >  #define OBD_GET_VERSION		_IOWR('f', 144, OBD_IOC_DATA_TYPE)
>> >  /*	OBD_IOC_GSS_SUPPORT	_IOWR('f', 145, OBD_IOC_DATA_TYPE) */
>> >  /*	OBD_IOC_CLOSE_UUID	_IOWR('f', 147, OBD_IOC_DATA_TYPE) */
>> > -#define OBD_IOC_CHANGELOG_SEND	_IOW('f', 148, OBD_IOC_DATA_TYPE)
>> > +/*	OBD_IOC_CHANGELOG_SEND	_IOW('f', 148, OBD_IOC_DATA_TYPE) */
>> >  #define OBD_IOC_GETDEVICE	_IOWR('f', 149, OBD_IOC_DATA_TYPE)
>> >  #define OBD_IOC_FID2PATH	_IOWR('f', 150, OBD_IOC_DATA_TYPE)
>> >  /*	lustre/lustre_user.h	151-153 */
>> > diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_kernelcomm.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_kernelcomm.h
>> > index 94dadbe..d84a8fc 100644
>> > --- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_kernelcomm.h
>> > +++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_kernelcomm.h
>> > @@ -54,15 +54,12 @@ struct kuc_hdr {
>> >  	__u16 kuc_msglen;
>> >  } __aligned(sizeof(__u64));
>> >  
>> > -#define KUC_CHANGELOG_MSG_MAXSIZE (sizeof(struct kuc_hdr) + CR_MAXSIZE)
>> > -
>> >  #define KUC_MAGIC		0x191C /*Lustre9etLinC */
>> >  
>> >  /* kuc_msgtype values are defined in each transport */
>> >  enum kuc_transport_type {
>> >  	KUC_TRANSPORT_GENERIC	= 1,
>> >  	KUC_TRANSPORT_HSM	= 2,
>> > -	KUC_TRANSPORT_CHANGELOG	= 3,
>> >  };
>> >  
>> >  enum kuc_generic_message_type {
>> > diff --git a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h
>> > index b8525e5..715f1c5 100644
>> > --- a/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h
>> > +++ b/drivers/staging/lustre/include/uapi/linux/lustre/lustre_user.h
>> > @@ -967,13 +967,6 @@ static inline void changelog_remap_rec(struct changelog_rec *rec,
>> >  	rec->cr_flags = (rec->cr_flags & CLF_FLAGMASK) | crf_wanted;
>> >  }
>> >  
>> > -struct ioc_changelog {
>> > -	__u64 icc_recno;
>> > -	__u32 icc_mdtindex;
>> > -	__u32 icc_id;
>> > -	__u32 icc_flags;
>> > -};
>> > -
>> >  enum changelog_message_type {
>> >  	CL_RECORD = 10, /* message is a changelog_rec */
>> >  	CL_EOF    = 11, /* at end of current changelog */
>> > diff --git a/drivers/staging/lustre/lustre/include/obd.h b/drivers/staging/lustre/lustre/include/obd.h
>> > index 11e7ae8..76ae0b3 100644
>> > --- a/drivers/staging/lustre/lustre/include/obd.h
>> > +++ b/drivers/staging/lustre/lustre/include/obd.h
>> > @@ -345,6 +345,8 @@ struct client_obd {
>> >  	void			*cl_lru_work;
>> >  	/* hash tables for osc_quota_info */
>> >  	struct rhashtable	cl_quota_hash[MAXQUOTAS];
>> > +	/* Links to the global list of registered changelog devices */
>> > +	struct list_head	cl_chg_dev_linkage;
>> >  };
>> >  
>> >  #define obd2cli_tgt(obd) ((char *)(obd)->u.cli.cl_target_uuid.uuid)
>> > diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_lib.c b/drivers/staging/lustre/lustre/ldlm/ldlm_lib.c
>> > index 32eda4f..732ef3a 100644
>> > --- a/drivers/staging/lustre/lustre/ldlm/ldlm_lib.c
>> > +++ b/drivers/staging/lustre/lustre/ldlm/ldlm_lib.c
>> > @@ -395,6 +395,8 @@ int client_obd_setup(struct obd_device *obddev, struct lustre_cfg *lcfg)
>> >  	init_waitqueue_head(&cli->cl_mod_rpcs_waitq);
>> >  	cli->cl_mod_tag_bitmap = NULL;
>> >  
>> > +	INIT_LIST_HEAD(&cli->cl_chg_dev_linkage);
>> > +
>> >  	if (connect_op == MDS_CONNECT) {
>> >  		cli->cl_max_mod_rpcs_in_flight = cli->cl_max_rpcs_in_flight - 1;
>> >  		cli->cl_mod_tag_bitmap = kcalloc(BITS_TO_LONGS(OBD_MAX_RIF_MAX),
>> > diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
>> > index 19c5e9c..36cea8d 100644
>> > --- a/drivers/staging/lustre/lustre/llite/dir.c
>> > +++ b/drivers/staging/lustre/lustre/llite/dir.c
>> > @@ -1481,14 +1481,6 @@ static long ll_dir_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>> >  		return obd_iocontrol(cmd, sbi->ll_md_exp, 0, NULL,
>> >  				     (void __user *)arg);
>> >  	}
>> > -	case OBD_IOC_CHANGELOG_SEND:
>> > -	case OBD_IOC_CHANGELOG_CLEAR:
>> > -		if (!capable(CAP_SYS_ADMIN))
>> > -			return -EPERM;
>> > -
>> > -		rc = copy_and_ioctl(cmd, sbi->ll_md_exp, (void __user *)arg,
>> > -				    sizeof(struct ioc_changelog));
>> > -		return rc;
>> >  	case OBD_IOC_FID2PATH:
>> >  		return ll_fid2path(inode, (void __user *)arg);
>> >  	case LL_IOC_GETPARENT:
>> > diff --git a/drivers/staging/lustre/lustre/lmv/lmv_obd.c b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
>> > index 952c68e..32bb9fc 100644
>> > --- a/drivers/staging/lustre/lustre/lmv/lmv_obd.c
>> > +++ b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
>> > @@ -951,19 +951,6 @@ static int lmv_iocontrol(unsigned int cmd, struct obd_export *exp,
>> >  		kfree(oqctl);
>> >  		break;
>> >  	}
>> > -	case OBD_IOC_CHANGELOG_SEND:
>> > -	case OBD_IOC_CHANGELOG_CLEAR: {
>> > -		struct ioc_changelog *icc = karg;
>> > -
>> > -		if (icc->icc_mdtindex >= count)
>> > -			return -ENODEV;
>> > -
>> > -		tgt = lmv->tgts[icc->icc_mdtindex];
>> > -		if (!tgt || !tgt->ltd_exp || !tgt->ltd_active)
>> > -			return -ENODEV;
>> > -		rc = obd_iocontrol(cmd, tgt->ltd_exp, sizeof(*icc), icc, NULL);
>> > -		break;
>> > -	}
>> >  	case LL_IOC_GET_CONNECT_FLAGS: {
>> >  		tgt = lmv->tgts[0];
>> >  
>> > diff --git a/drivers/staging/lustre/lustre/mdc/Makefile b/drivers/staging/lustre/lustre/mdc/Makefile
>> > index 64cf49e..5f48e91 100644
>> > --- a/drivers/staging/lustre/lustre/mdc/Makefile
>> > +++ b/drivers/staging/lustre/lustre/mdc/Makefile
>> > @@ -2,4 +2,4 @@ ccflags-y += -I$(srctree)/drivers/staging/lustre/include
>> >  ccflags-y += -I$(srctree)/drivers/staging/lustre/lustre/include
>> >  
>> >  obj-$(CONFIG_LUSTRE_FS) += mdc.o
>> > -mdc-y := mdc_request.o mdc_reint.o mdc_lib.o mdc_locks.o lproc_mdc.o
>> > +mdc-y := mdc_changelog.o mdc_request.o mdc_reint.o mdc_lib.o mdc_locks.o lproc_mdc.o
>> > diff --git a/drivers/staging/lustre/lustre/mdc/mdc_changelog.c b/drivers/staging/lustre/lustre/mdc/mdc_changelog.c
>> > new file mode 100644
>> > index 0000000..a5f3c64
>> > --- /dev/null
>> > +++ b/drivers/staging/lustre/lustre/mdc/mdc_changelog.c
>> > @@ -0,0 +1,722 @@
>> > +// SPDX-License-Identifier: GPL-2.0
>> > +/*
>> > + * GPL HEADER START
>> > + *
>> > + * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
>> > + *
>> > + * This program is free software; you can redistribute it and/or modify
>> > + * it under the terms of the GNU General Public License version 2 only,
>> > + * as published by the Free Software Foundation.
>> > + *
>> > + * This program is distributed in the hope that it will be useful, but
>> > + * WITHOUT ANY WARRANTY; without even the implied warranty of
>> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> > + * General Public License version 2 for more details (a copy is included
>> > + * in the LICENSE file that accompanied this code).
>> > + *
>> > + * You should have received a copy of the GNU General Public License
>> > + * version 2 along with this program; If not, see
>> > + * http://www.gnu.org/licenses/gpl-2.0.html
>> > + *
>> > + * GPL HEADER END
>> > + */
>> > +/*
>> > + * Copyright (c) 2017, Commissariat a l'Energie Atomique et aux Energies
>> > + *                     Alternatives.
>> > + *
>> > + * Author: Henri Doreau <henri.doreau@cea.fr>
>> > + */
>> > +
>> > +#define DEBUG_SUBSYSTEM S_MDC
>> > +
>> > +#include <linux/init.h>
>> > +#include <linux/kthread.h>
>> > +#include <linux/poll.h>
>> > +#include <linux/miscdevice.h>
>> > +
>> > +#include <lustre_log.h>
>> > +
>> > +#include "mdc_internal.h"
>> > +
>> > +/*
>> > + * -- Changelog delivery through character device --
>> > + */
>> > +
>> > +/**
>> > + * Mutex to protect chlg_registered_devices below
>> > + */
>> > +static DEFINE_MUTEX(chlg_registered_dev_lock);
>> > +
>> > +/**
>> > + * Global linked list of all registered devices (one per MDT).
>> > + */
>> > +static LIST_HEAD(chlg_registered_devices);
>> > +
>> > +struct chlg_registered_dev {
>> > +	/* Device name of the form "changelog-{MDTNAME}" */
>> > +	char			ced_name[32];
>> > +	/* Misc device descriptor */
>> > +	struct miscdevice	ced_misc;
>> > +	/* OBDs referencing this device (multiple mount point) */
>> > +	struct list_head	ced_obds;
>> > +	/* Reference counter for proper deregistration */
>> > +	struct kref		ced_refs;
>> > +	/* Link within the global chlg_registered_devices */
>> > +	struct list_head	ced_link;
>> > +};
>> > +
>> > +struct chlg_reader_state {
>> > +	/* Shortcut to the corresponding OBD device */
>> > +	struct obd_device	*crs_obd;
>> > +	/* An error occurred that prevents from reading further */
>> > +	bool			 crs_err;
>> > +	/* EOF, no more records available */
>> > +	bool			 crs_eof;
>> > +	/* Userland reader closed connection */
>> > +	bool			 crs_closed;
>> > +	/* Desired start position */
>> > +	u64			 crs_start_offset;
>> > +	/* Wait queue for the catalog processing thread */
>> > +	wait_queue_head_t	 crs_waitq_prod;
>> > +	/* Wait queue for the record copy threads */
>> > +	wait_queue_head_t	 crs_waitq_cons;
>> > +	/* Mutex protecting crs_rec_count and crs_rec_queue */
>> > +	struct mutex		 crs_lock;
>> > +	/* Number of item in the list */
>> > +	u64			 crs_rec_count;
>> > +	/* List of prefetched enqueued_record::enq_linkage_items */
>> > +	struct list_head	 crs_rec_queue;
>> > +};
>> > +
>> > +struct chlg_rec_entry {
>> > +	/* Link within the chlg_reader_state::crs_rec_queue list */
>> > +	struct list_head	enq_linkage;
>> > +	/* Data (enq_record) field length */
>> > +	u64			enq_length;
>> > +	/* Copy of a changelog record (see struct llog_changelog_rec) */
>> > +	struct changelog_rec	enq_record[];
>> > +};
>> > +
>> > +enum {
>> > +	/* Number of records to prefetch locally. */
>> > +	CDEV_CHLG_MAX_PREFETCH = 1024,
>> > +};
>> > +
>> > +/**
>> > + * ChangeLog catalog processing callback invoked on each record.
>> > + * If the current record is eligible to userland delivery, push
>> > + * it into the crs_rec_queue where the consumer code will fetch it.
>> > + *
>> > + * @param[in]     env  (unused)
>> > + * @param[in]     llh  Client-side handle used to identify the llog
>> > + * @param[in]     hdr  Header of the current llog record
>> > + * @param[in,out] data chlg_reader_state passed from caller
>> > + *
>> > + * @return 0 or LLOG_PROC_* control code on success, negated error on failure.
>> > + */
>> > +static int chlg_read_cat_process_cb(const struct lu_env *env,
>> > +				    struct llog_handle *llh,
>> > +				    struct llog_rec_hdr *hdr, void *data)
>> > +{
>> > +	struct llog_changelog_rec *rec;
>> > +	struct chlg_reader_state *crs = data;
>> > +	struct chlg_rec_entry *enq;
>> > +	size_t len;
>> > +	int rc;
>> > +
>> > +	LASSERT(crs);
>> > +	LASSERT(hdr);
>> > +
>> > +	rec = container_of(hdr, struct llog_changelog_rec, cr_hdr);
>> > +
>> > +	if (rec->cr_hdr.lrh_type != CHANGELOG_REC) {
>> > +		rc = -EINVAL;
>> > +		CERROR("%s: not a changelog rec %x/%d in llog : rc = %d\n",
>> > +		       crs->crs_obd->obd_name, rec->cr_hdr.lrh_type,
>> > +		       rec->cr.cr_type, rc);
>> > +		return rc;
>> > +	}
>> > +
>> > +	/* Skip undesired records */
>> > +	if (rec->cr.cr_index < crs->crs_start_offset)
>> > +		return 0;
>> > +
>> > +	CDEBUG(D_HSM, "%llu %02d%-5s %llu 0x%x t=" DFID " p=" DFID " %.*s\n",
>> > +	       rec->cr.cr_index, rec->cr.cr_type,
>> > +	       changelog_type2str(rec->cr.cr_type), rec->cr.cr_time,
>> > +	       rec->cr.cr_flags & CLF_FLAGMASK,
>> > +	       PFID(&rec->cr.cr_tfid), PFID(&rec->cr.cr_pfid),
>> > +	       rec->cr.cr_namelen, changelog_rec_name(&rec->cr));
>> > +
>> > +	wait_event_idle(crs->crs_waitq_prod,
>> > +			(crs->crs_rec_count < CDEV_CHLG_MAX_PREFETCH ||
>> > +			 crs->crs_closed));
>> > +
>> > +	if (crs->crs_closed)
>> > +		return LLOG_PROC_BREAK;
>> > +
>> > +	len = changelog_rec_size(&rec->cr) + rec->cr.cr_namelen;
>> > +	enq = kzalloc(sizeof(*enq) + len, GFP_KERNEL);
>> > +	if (!enq)
>> > +		return -ENOMEM;
>> > +
>> > +	INIT_LIST_HEAD(&enq->enq_linkage);
>> > +	enq->enq_length = len;
>> > +	memcpy(enq->enq_record, &rec->cr, len);
>> > +
>> > +	mutex_lock(&crs->crs_lock);
>> > +	list_add_tail(&enq->enq_linkage, &crs->crs_rec_queue);
>> > +	crs->crs_rec_count++;
>> > +	mutex_unlock(&crs->crs_lock);
>> > +
>> > +	wake_up_all(&crs->crs_waitq_cons);
>> > +
>> > +	return 0;
>> > +}
>> > +
>> > +/**
>> > + * Remove record from the list it is attached to and free it.
>> > + */
>> > +static void enq_record_delete(struct chlg_rec_entry *rec)
>> > +{
>> > +	list_del(&rec->enq_linkage);
>> > +	kfree(rec);
>> > +}
>> > +
>> > +/**
>> > + * Release resources associated to a changelog_reader_state instance.
>> > + *
>> > + * @param  crs  CRS instance to release.
>> > + */
>> > +static void crs_free(struct chlg_reader_state *crs)
>> > +{
>> > +	struct chlg_rec_entry *rec;
>> > +	struct chlg_rec_entry *tmp;
>> > +
>> > +	list_for_each_entry_safe(rec, tmp, &crs->crs_rec_queue, enq_linkage)
>> > +		enq_record_delete(rec);
>> > +
>> > +	kfree(crs);
>> > +}
>> > +
>> > +/**
>> > + * Record prefetch thread entry point. Opens the changelog catalog and starts
>> > + * reading records.
>> > + *
>> > + * @param[in,out]  args  chlg_reader_state passed from caller.
>> > + * @return 0 on success, negated error code on failure.
>> > + */
>> > +static int chlg_load(void *args)
>> > +{
>> > +	struct chlg_reader_state *crs = args;
>> > +	struct obd_device *obd = crs->crs_obd;
>> > +	struct llog_ctxt *ctx = NULL;
>> > +	struct llog_handle *llh = NULL;
>> > +	int rc;
>> > +
>> > +	ctx = llog_get_context(obd, LLOG_CHANGELOG_REPL_CTXT);
>> > +	if (!ctx) {
>> > +		rc = -ENOENT;
>> > +		goto err_out;
>> > +	}
>> > +
>> > +	rc = llog_open(NULL, ctx, &llh, NULL, CHANGELOG_CATALOG,
>> > +		       LLOG_OPEN_EXISTS);
>> > +	if (rc) {
>> > +		CERROR("%s: fail to open changelog catalog: rc = %d\n",
>> > +		       obd->obd_name, rc);
>> > +		goto err_out;
>> > +	}
>> > +
>> > +	rc = llog_init_handle(NULL, llh, LLOG_F_IS_CAT | LLOG_F_EXT_JOBID,
>> > +			      NULL);
>> > +	if (rc) {
>> > +		CERROR("%s: fail to init llog handle: rc = %d\n",
>> > +		       obd->obd_name, rc);
>> > +		goto err_out;
>> > +	}
>> > +
>> > +	rc = llog_cat_process(NULL, llh, chlg_read_cat_process_cb, crs, 0, 0);
>> > +	if (rc < 0) {
>> > +		CERROR("%s: fail to process llog: rc = %d\n",
>> > +		       obd->obd_name, rc);
>> > +		goto err_out;
>> > +	}
>> > +
>> > +err_out:
>> > +	crs->crs_err = true;
>> > +	wake_up_all(&crs->crs_waitq_cons);
>> > +
>> > +	if (llh)
>> > +		llog_cat_close(NULL, llh);
>> > +
>> > +	if (ctx)
>> > +		llog_ctxt_put(ctx);
>> > +
>> > +	wait_event_idle(crs->crs_waitq_prod, crs->crs_closed);
>> > +	crs_free(crs);
>> > +	return rc;
>> > +}
>> > +
>> > +/**
>> > + * Read handler, dequeues records from the chlg_reader_state if any.
>> > + * No partial records are copied to userland so this function can return less
>> > + * data than required (short read).
>> > + *
>> > + * @param[in]   file   File pointer to the character device.
>> > + * @param[out]  buff   Userland buffer where to copy the records.
>> > + * @param[in]   count  Userland buffer size.
>> > + * @param[out]  ppos   File position, updated with the index number of the next
>> > + *		       record to read.
>> > + * @return number of copied bytes on success, negated error code on failure.
>> > + */
>> > +static ssize_t chlg_read(struct file *file, char __user *buff, size_t count,
>> > +			 loff_t *ppos)
>> > +{
>> > +	struct chlg_reader_state *crs = file->private_data;
>> > +	struct chlg_rec_entry *rec;
>> > +	struct chlg_rec_entry *tmp;
>> > +	ssize_t  written_total = 0;
>> > +	LIST_HEAD(consumed);
>> > +
>> > +	if (file->f_flags & O_NONBLOCK && crs->crs_rec_count == 0)
>> > +		return -EAGAIN;
>> > +
>> > +	wait_event_idle(crs->crs_waitq_cons,
>> > +			crs->crs_rec_count > 0 || crs->crs_eof || crs->crs_err);
>> > +
>> > +	mutex_lock(&crs->crs_lock);
>> > +	list_for_each_entry_safe(rec, tmp, &crs->crs_rec_queue, enq_linkage) {
>> > +		if (written_total + rec->enq_length > count)
>> > +			break;
>> > +
>> > +		if (copy_to_user(buff, rec->enq_record, rec->enq_length)) {
>> > +			if (written_total == 0)
>> > +				written_total = -EFAULT;
>> > +			break;
>> > +		}
>> > +
>> > +		buff += rec->enq_length;
>> > +		written_total += rec->enq_length;
>> > +
>> > +		crs->crs_rec_count--;
>> > +		list_move_tail(&rec->enq_linkage, &consumed);
>> > +
>> > +		crs->crs_start_offset = rec->enq_record->cr_index + 1;
>> > +	}
>> > +	mutex_unlock(&crs->crs_lock);
>> > +
>> > +	if (written_total > 0)
>> > +		wake_up_all(&crs->crs_waitq_prod);
>> > +
>> > +	list_for_each_entry_safe(rec, tmp, &consumed, enq_linkage)
>> > +		enq_record_delete(rec);
>> > +
>> > +	*ppos = crs->crs_start_offset;
>> > +
>> > +	return written_total;
>> > +}
>> > +
>> > +/**
>> > + * Jump to a given record index. Helper for chlg_llseek().
>> > + *
>> > + * @param[in,out]  crs     Internal reader state.
>> > + * @param[in]      offset  Desired offset (index record).
>> > + * @return 0 on success, negated error code on failure.
>> > + */
>> > +static int chlg_set_start_offset(struct chlg_reader_state *crs, u64 offset)
>> > +{
>> > +	struct chlg_rec_entry *rec;
>> > +	struct chlg_rec_entry *tmp;
>> > +
>> > +	mutex_lock(&crs->crs_lock);
>> > +	if (offset < crs->crs_start_offset) {
>> > +		mutex_unlock(&crs->crs_lock);
>> > +		return -ERANGE;
>> > +	}
>> > +
>> > +	crs->crs_start_offset = offset;
>> > +	list_for_each_entry_safe(rec, tmp, &crs->crs_rec_queue, enq_linkage) {
>> > +		struct changelog_rec *cr = rec->enq_record;
>> > +
>> > +		if (cr->cr_index >= crs->crs_start_offset)
>> > +			break;
>> > +
>> > +		crs->crs_rec_count--;
>> > +		enq_record_delete(rec);
>> > +	}
>> > +
>> > +	mutex_unlock(&crs->crs_lock);
>> > +	wake_up_all(&crs->crs_waitq_prod);
>> > +	return 0;
>> > +}
>> > +
>> > +/**
>> > + * Move read pointer to a certain record index, encoded as an offset.
>> > + *
>> > + * @param[in,out] file   File pointer to the changelog character device
>> > + * @param[in]	  off    Offset to skip, actually a record index, not byte count
>> > + * @param[in]	  whence Relative/Absolute interpretation of the offset
>> > + * @return the resulting position on success or negated error code on failure.
>> > + */
>> > +static loff_t chlg_llseek(struct file *file, loff_t off, int whence)
>> > +{
>> > +	struct chlg_reader_state *crs = file->private_data;
>> > +	loff_t pos;
>> > +	int rc;
>> > +
>> > +	switch (whence) {
>> > +	case SEEK_SET:
>> > +		pos = off;
>> > +		break;
>> > +	case SEEK_CUR:
>> > +		pos = file->f_pos + off;
>> > +		break;
>> > +	case SEEK_END:
>> > +	default:
>> > +		return -EINVAL;
>> > +	}
>> > +
>> > +	/* We cannot go backward */
>> > +	if (pos < file->f_pos)
>> > +		return -EINVAL;
>> > +
>> > +	rc = chlg_set_start_offset(crs, pos);
>> > +	if (rc != 0)
>> > +		return rc;
>> > +
>> > +	file->f_pos = pos;
>> > +	return pos;
>> > +}
>> > +
>> > +/**
>> > + * Clear record range for a given changelog reader.
>> > + *
>> > + * @param[in]  crs     Current internal state.
>> > + * @param[in]  reader  Changelog reader ID (cl1, cl2...)
>> > + * @param[in]  record  Record index up which to clear
>> > + * @return 0 on success, negated error code on failure.
>> > + */
>> > +static int chlg_clear(struct chlg_reader_state *crs, u32 reader, u64 record)
>> > +{
>> > +	struct obd_device *obd = crs->crs_obd;
>> > +	struct changelog_setinfo cs  = {
>> > +		.cs_recno = record,
>> > +		.cs_id    = reader
>> > +	};
>> > +
>> > +	return obd_set_info_async(NULL, obd->obd_self_export,
>> > +				  strlen(KEY_CHANGELOG_CLEAR),
>> > +				  KEY_CHANGELOG_CLEAR, sizeof(cs), &cs, NULL);
>> > +}
>> > +
>> > +/** Maximum changelog control command size */
>> > +#define CHLG_CONTROL_CMD_MAX	64
>> > +
>> > +/**
>> > + * Handle writes() into the changelog character device. Write() can be used
>> > + * to request special control operations.
>> > + *
>> > + * @param[in]  file  File pointer to the changelog character device
>> > + * @param[in]  buff  User supplied data (written data)
>> > + * @param[in]  count Number of written bytes
>> > + * @param[in]  off   (unused)
>> > + * @return number of written bytes on success, negated error code on failure.
>> > + */
>> > +static ssize_t chlg_write(struct file *file, const char __user *buff,
>> > +			  size_t count, loff_t *off)
>> > +{
>> > +	struct chlg_reader_state *crs = file->private_data;
>> > +	char *kbuf;
>> > +	u64 record;
>> > +	u32 reader;
>> > +	int rc = 0;
>> > +
>> > +	if (count > CHLG_CONTROL_CMD_MAX)
>> > +		return -EINVAL;
>> > +
>> > +	kbuf = kzalloc(CHLG_CONTROL_CMD_MAX, GFP_KERNEL);
>> > +	if (!kbuf)
>> > +		return -ENOMEM;
>> > +
>> > +	if (copy_from_user(kbuf, buff, count)) {
>> > +		rc = -EFAULT;
>> > +		goto out_kbuf;
>> > +	}
>> > +
>> > +	kbuf[CHLG_CONTROL_CMD_MAX - 1] = '\0';
>> > +
>> > +	if (sscanf(kbuf, "clear:cl%u:%llu", &reader, &record) == 2)
>> > +		rc = chlg_clear(crs, reader, record);
>> > +	else
>> > +		rc = -EINVAL;
>> > +
>> > +out_kbuf:
>> > +	kfree(kbuf);
>> > +	return rc < 0 ? rc : count;
>> > +}
>> > +
>> > +/**
>> > + * Find the OBD device associated to a changelog character device.
>> > + * @param[in]  cdev  character device instance descriptor
>> > + * @return corresponding OBD device or NULL if none was found.
>> > + */
>> > +static struct obd_device *chlg_obd_get(dev_t cdev)
>> > +{
>> > +	int minor = MINOR(cdev);
>> > +	struct obd_device *obd = NULL;
>> > +	struct chlg_registered_dev *curr;
>> > +
>> > +	mutex_lock(&chlg_registered_dev_lock);
>> > +	list_for_each_entry(curr, &chlg_registered_devices, ced_link) {
>> > +		if (curr->ced_misc.minor == minor) {
>> > +			/* take the first available OBD device attached */
>> > +			obd = list_first_entry(&curr->ced_obds,
>> > +					       struct obd_device,
>> > +					       u.cli.cl_chg_dev_linkage);
>> > +			break;
>> > +		}
>> > +	}
>> > +	mutex_unlock(&chlg_registered_dev_lock);
>> > +	return obd;
>> > +}
>> > +
>> > +/**
>> > + * Open handler, initialize internal CRS state and spawn prefetch thread if
>> > + * needed.
>> > + * @param[in]  inode  Inode struct for the open character device.
>> > + * @param[in]  file   Corresponding file pointer.
>> > + * @return 0 on success, negated error code on failure.
>> > + */
>> > +static int chlg_open(struct inode *inode, struct file *file)
>> > +{
>> > +	struct chlg_reader_state *crs;
>> > +	struct obd_device *obd = chlg_obd_get(inode->i_rdev);
>> > +	struct task_struct *task;
>> > +	int rc;
>> > +
>> > +	if (!obd)
>> > +		return -ENODEV;
>> > +
>> > +	crs = kzalloc(sizeof(*crs), GFP_KERNEL);
>> > +	if (!crs)
>> > +		return -ENOMEM;
>> > +
>> > +	crs->crs_obd = obd;
>> > +	crs->crs_err = false;
>> > +	crs->crs_eof = false;
>> > +	crs->crs_closed = false;
>> > +
>> > +	mutex_init(&crs->crs_lock);
>> > +	INIT_LIST_HEAD(&crs->crs_rec_queue);
>> > +	init_waitqueue_head(&crs->crs_waitq_prod);
>> > +	init_waitqueue_head(&crs->crs_waitq_cons);
>> > +
>> > +	if (file->f_mode & FMODE_READ) {
>> > +		task = kthread_run(chlg_load, crs, "chlg_load_thread");
>> > +		if (IS_ERR(task)) {
>> > +			rc = PTR_ERR(task);
>> > +			CERROR("%s: cannot start changelog thread: rc = %d\n",
>> > +			       obd->obd_name, rc);
>> > +			goto err_crs;
>> > +		}
>> > +	}
>> > +
>> > +	file->private_data = crs;
>> > +	return 0;
>> > +
>> > +err_crs:
>> > +	kfree(crs);
>> > +	return rc;
>> > +}
>> > +
>> > +/**
>> > + * Close handler, release resources.
>> > + *
>> > + * @param[in]  inode  Inode struct for the open character device.
>> > + * @param[in]  file   Corresponding file pointer.
>> > + * @return 0 on success, negated error code on failure.
>> > + */
>> > +static int chlg_release(struct inode *inode, struct file *file)
>> > +{
>> > +	struct chlg_reader_state *crs = file->private_data;
>> > +
>> > +	if (file->f_mode & FMODE_READ) {
>> > +		crs->crs_closed = true;
>> > +		wake_up_all(&crs->crs_waitq_prod);
>> > +	} else {
>> > +		/* No producer thread, release resource ourselves */
>> > +		crs_free(crs);
>> > +	}
>> > +	return 0;
>> > +}
>> > +
>> > +/**
>> > + * Poll handler, indicates whether the device is readable (new records) and
>> > + * writable (always).
>> > + *
>> > + * @param[in]  file   Device file pointer.
>> > + * @param[in]  wait   (opaque)
>> > + * @return combination of the poll status flags.
>> > + */
>> > +static unsigned int chlg_poll(struct file *file, poll_table *wait)
>> > +{
>> > +	struct chlg_reader_state *crs  = file->private_data;
>> > +	unsigned int mask = 0;
>> > +
>> > +	mutex_lock(&crs->crs_lock);
>> > +	poll_wait(file, &crs->crs_waitq_cons, wait);
>> > +	if (crs->crs_rec_count > 0)
>> > +		mask |= POLLIN | POLLRDNORM;
>> > +	if (crs->crs_err)
>> > +		mask |= POLLERR;
>> > +	if (crs->crs_eof)
>> > +		mask |= POLLHUP;
>> > +	mutex_unlock(&crs->crs_lock);
>> > +	return mask;
>> > +}
>> > +
>> > +static const struct file_operations chlg_fops = {
>> > +	.owner		= THIS_MODULE,
>> > +	.llseek		= chlg_llseek,
>> > +	.read		= chlg_read,
>> > +	.write		= chlg_write,
>> > +	.open		= chlg_open,
>> > +	.release	= chlg_release,
>> > +	.poll		= chlg_poll,
>> > +};
>> > +
>> > +/**
>> > + * This uses obd_name of the form: "testfs-MDT0000-mdc-ffff88006501600"
>> > + * and returns a name of the form: "changelog-testfs-MDT0000".
>> > + */
>> > +static void get_chlg_name(char *name, size_t name_len, struct obd_device *obd)
>> > +{
>> > +	int i;
>> > +
>> > +	snprintf(name, name_len, "changelog-%s", obd->obd_name);
>> > +
>> > +	/* Find the 2nd '-' from the end and truncate on it */
>> > +	for (i = 0; i < 2; i++) {
>> > +		char *p = strrchr(name, '-');
>> > +
>> > +		if (!p)
>> > +			return;
>> > +		*p = '\0';
>> > +	}
>> > +}
>> > +
>> > +/**
>> > + * Find a changelog character device by name.
>> > + * All devices registered during MDC setup are listed in a global list with
>> > + * their names attached.
>> > + */
>> > +static struct chlg_registered_dev *
>> > +chlg_registered_dev_find_by_name(const char *name)
>> > +{
>> > +	struct chlg_registered_dev *dit;
>> > +
>> > +	list_for_each_entry(dit, &chlg_registered_devices, ced_link)
>> > +		if (strcmp(name, dit->ced_name) == 0)
>> > +			return dit;
>> > +	return NULL;
>> > +}
>> > +
>> > +/**
>> > + * Find chlg_registered_dev structure for a given OBD device.
>> > + * This is bad O(n^2) but for each filesystem:
>> > + *   - N is # of MDTs times # of mount points
>> > + *   - this only runs at shutdown
>> > + */
>> > +static struct chlg_registered_dev *
>> > +chlg_registered_dev_find_by_obd(const struct obd_device *obd)
>> > +{
>> > +	struct chlg_registered_dev *dit;
>> > +	struct obd_device *oit;
>> > +
>> > +	list_for_each_entry(dit, &chlg_registered_devices, ced_link)
>> > +		list_for_each_entry(oit, &dit->ced_obds,
>> > +				    u.cli.cl_chg_dev_linkage)
>> > +			if (oit == obd)
>> > +				return dit;
>> > +	return NULL;
>> > +}
>> > +
>> > +/**
>> > + * Changelog character device initialization.
>> > + * Register a misc character device with a dynamic minor number, under a name
>> > + * of the form: 'changelog-fsname-MDTxxxx'. Reference this OBD device with it.
>> > + *
>> > + * @param[in] obd  This MDC obd_device.
>> > + * @return 0 on success, negated error code on failure.
>> > + */
>> > +int mdc_changelog_cdev_init(struct obd_device *obd)
>> > +{
>> > +	struct chlg_registered_dev *exist;
>> > +	struct chlg_registered_dev *entry;
>> > +	int rc;
>> > +
>> > +	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
>> > +	if (!entry)
>> > +		return -ENOMEM;
>> > +
>> > +	get_chlg_name(entry->ced_name, sizeof(entry->ced_name), obd);
>> > +
>> > +	entry->ced_misc.minor = MISC_DYNAMIC_MINOR;
>> > +	entry->ced_misc.name  = entry->ced_name;
>> > +	entry->ced_misc.fops  = &chlg_fops;
>> > +
>> > +	kref_init(&entry->ced_refs);
>> > +	INIT_LIST_HEAD(&entry->ced_obds);
>> > +	INIT_LIST_HEAD(&entry->ced_link);
>> > +
>> > +	mutex_lock(&chlg_registered_dev_lock);
>> > +	exist = chlg_registered_dev_find_by_name(entry->ced_name);
>> > +	if (exist) {
>> > +		kref_get(&exist->ced_refs);
>> > +		list_add_tail(&obd->u.cli.cl_chg_dev_linkage, &exist->ced_obds);
>> > +		rc = 0;
>> > +		goto out_unlock;
>> > +	}
>> > +
>> > +	/* Register new character device */
>> > +	rc = misc_register(&entry->ced_misc);
>> > +	if (rc != 0)
>> > +		goto out_unlock;
>> > +
>> > +	list_add_tail(&obd->u.cli.cl_chg_dev_linkage, &entry->ced_obds);
>> > +	list_add_tail(&entry->ced_link, &chlg_registered_devices);
>> > +
>> > +	entry = NULL;	/* prevent it from being freed below */
>> > +
>> > +out_unlock:
>> > +	mutex_unlock(&chlg_registered_dev_lock);
>> > +	kfree(entry);
>> > +	return rc;
>> > +}
>> > +
>> > +/**
>> > + * Deregister a changelog character device whose refcount has reached zero.
>> > + */
>> > +static void chlg_dev_clear(struct kref *kref)
>> > +{
>> > +	struct chlg_registered_dev *entry = container_of(kref,
>> > +							 struct chlg_registered_dev,
>> > +							 ced_refs);
>> > +	list_del(&entry->ced_link);
>> > +	misc_deregister(&entry->ced_misc);
>> > +	kfree(entry);
>> > +}
>> > +
>> > +/**
>> > + * Release OBD, decrease reference count of the corresponding changelog device.
>> > + */
>> > +void mdc_changelog_cdev_finish(struct obd_device *obd)
>> > +{
>> > +	struct chlg_registered_dev *dev = chlg_registered_dev_find_by_obd(obd);
>> > +
>> > +	mutex_lock(&chlg_registered_dev_lock);
>> > +	list_del_init(&obd->u.cli.cl_chg_dev_linkage);
>> > +	kref_put(&dev->ced_refs, chlg_dev_clear);
>> > +	mutex_unlock(&chlg_registered_dev_lock);
>> > +}
>> > diff --git a/drivers/staging/lustre/lustre/mdc/mdc_internal.h b/drivers/staging/lustre/lustre/mdc/mdc_internal.h
>> > index 941a896..6da9046 100644
>> > --- a/drivers/staging/lustre/lustre/mdc/mdc_internal.h
>> > +++ b/drivers/staging/lustre/lustre/mdc/mdc_internal.h
>> > @@ -129,6 +129,10 @@ enum ldlm_mode mdc_lock_match(struct obd_export *exp, __u64 flags,
>> >  			      enum ldlm_mode mode,
>> >  			      struct lustre_handle *lockh);
>> >  
>> > +int mdc_changelog_cdev_init(struct obd_device *obd);
>> > +
>> > +void mdc_changelog_cdev_finish(struct obd_device *obd);
>> > +
>> >  static inline int mdc_prep_elc_req(struct obd_export *exp,
>> >  				   struct ptlrpc_request *req, int opc,
>> >  				   struct list_head *cancels, int count)
>> > diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c
>> > index 8f8e3d2..3692b1c 100644
>> > --- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
>> > +++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
>> > @@ -35,7 +35,6 @@
>> >  
>> >  # include <linux/module.h>
>> >  # include <linux/pagemap.h>
>> > -# include <linux/miscdevice.h>
>> >  # include <linux/init.h>
>> >  # include <linux/utsname.h>
>> >  # include <linux/file.h>
>> > @@ -1810,174 +1809,6 @@ static int mdc_ioc_hsm_request(struct obd_export *exp,
>> >  	return rc;
>> >  }
>> >  
>> > -static struct kuc_hdr *changelog_kuc_hdr(char *buf, size_t len, u32 flags)
>> > -{
>> > -	struct kuc_hdr *lh = (struct kuc_hdr *)buf;
>> > -
>> > -	LASSERT(len <= KUC_CHANGELOG_MSG_MAXSIZE);
>> > -
>> > -	lh->kuc_magic = KUC_MAGIC;
>> > -	lh->kuc_transport = KUC_TRANSPORT_CHANGELOG;
>> > -	lh->kuc_flags = flags;
>> > -	lh->kuc_msgtype = CL_RECORD;
>> > -	lh->kuc_msglen = len;
>> > -	return lh;
>> > -}
>> > -
>> > -struct changelog_show {
>> > -	__u64		cs_startrec;
>> > -	enum changelog_send_flag	cs_flags;
>> > -	struct file	*cs_fp;
>> > -	char		*cs_buf;
>> > -	struct obd_device *cs_obd;
>> > -};
>> > -
>> > -static inline char *cs_obd_name(struct changelog_show *cs)
>> > -{
>> > -	return cs->cs_obd->obd_name;
>> > -}
>> > -
>> > -static int changelog_kkuc_cb(const struct lu_env *env, struct llog_handle *llh,
>> > -			     struct llog_rec_hdr *hdr, void *data)
>> > -{
>> > -	struct changelog_show *cs = data;
>> > -	struct llog_changelog_rec *rec = (struct llog_changelog_rec *)hdr;
>> > -	struct kuc_hdr *lh;
>> > -	size_t len;
>> > -	int rc;
>> > -
>> > -	if (rec->cr_hdr.lrh_type != CHANGELOG_REC) {
>> > -		rc = -EINVAL;
>> > -		CERROR("%s: not a changelog rec %x/%d: rc = %d\n",
>> > -		       cs_obd_name(cs), rec->cr_hdr.lrh_type,
>> > -		       rec->cr.cr_type, rc);
>> > -		return rc;
>> > -	}
>> > -
>> > -	if (rec->cr.cr_index < cs->cs_startrec) {
>> > -		/* Skip entries earlier than what we are interested in */
>> > -		CDEBUG(D_HSM, "rec=%llu start=%llu\n",
>> > -		       rec->cr.cr_index, cs->cs_startrec);
>> > -		return 0;
>> > -	}
>> > -
>> > -	CDEBUG(D_HSM, "%llu %02d%-5s %llu 0x%x t=" DFID " p=" DFID
>> > -		" %.*s\n", rec->cr.cr_index, rec->cr.cr_type,
>> > -		changelog_type2str(rec->cr.cr_type), rec->cr.cr_time,
>> > -		rec->cr.cr_flags & CLF_FLAGMASK,
>> > -		PFID(&rec->cr.cr_tfid), PFID(&rec->cr.cr_pfid),
>> > -		rec->cr.cr_namelen, changelog_rec_name(&rec->cr));
>> > -
>> > -	len = sizeof(*lh) + changelog_rec_size(&rec->cr) + rec->cr.cr_namelen;
>> > -
>> > -	/* Set up the message */
>> > -	lh = changelog_kuc_hdr(cs->cs_buf, len, cs->cs_flags);
>> > -	memcpy(lh + 1, &rec->cr, len - sizeof(*lh));
>> > -
>> > -	rc = libcfs_kkuc_msg_put(cs->cs_fp, lh);
>> > -	CDEBUG(D_HSM, "kucmsg fp %p len %zu rc %d\n", cs->cs_fp, len, rc);
>> > -
>> > -	return rc;
>> > -}
>> > -
>> > -static int mdc_changelog_send_thread(void *csdata)
>> > -{
>> > -	enum llog_flag flags = LLOG_F_IS_CAT;
>> > -	struct changelog_show *cs = csdata;
>> > -	struct llog_ctxt *ctxt = NULL;
>> > -	struct llog_handle *llh = NULL;
>> > -	struct kuc_hdr *kuch;
>> > -	int rc;
>> > -
>> > -	CDEBUG(D_HSM, "changelog to fp=%p start %llu\n",
>> > -	       cs->cs_fp, cs->cs_startrec);
>> > -
>> > -	cs->cs_buf = kzalloc(KUC_CHANGELOG_MSG_MAXSIZE, GFP_NOFS);
>> > -	if (!cs->cs_buf) {
>> > -		rc = -ENOMEM;
>> > -		goto out;
>> > -	}
>> > -
>> > -	/* Set up the remote catalog handle */
>> > -	ctxt = llog_get_context(cs->cs_obd, LLOG_CHANGELOG_REPL_CTXT);
>> > -	if (!ctxt) {
>> > -		rc = -ENOENT;
>> > -		goto out;
>> > -	}
>> > -	rc = llog_open(NULL, ctxt, &llh, NULL, CHANGELOG_CATALOG,
>> > -		       LLOG_OPEN_EXISTS);
>> > -	if (rc) {
>> > -		CERROR("%s: fail to open changelog catalog: rc = %d\n",
>> > -		       cs_obd_name(cs), rc);
>> > -		goto out;
>> > -	}
>> > -
>> > -	if (cs->cs_flags & CHANGELOG_FLAG_JOBID)
>> > -		flags |= LLOG_F_EXT_JOBID;
>> > -
>> > -	rc = llog_init_handle(NULL, llh, flags, NULL);
>> > -	if (rc) {
>> > -		CERROR("llog_init_handle failed %d\n", rc);
>> > -		goto out;
>> > -	}
>> > -
>> > -	rc = llog_cat_process(NULL, llh, changelog_kkuc_cb, cs, 0, 0);
>> > -
>> > -	/* Send EOF no matter what our result */
>> > -	kuch = changelog_kuc_hdr(cs->cs_buf, sizeof(*kuch), cs->cs_flags);
>> > -	kuch->kuc_msgtype = CL_EOF;
>> > -	libcfs_kkuc_msg_put(cs->cs_fp, kuch);
>> > -
>> > -out:
>> > -	fput(cs->cs_fp);
>> > -	if (llh)
>> > -		llog_cat_close(NULL, llh);
>> > -	if (ctxt)
>> > -		llog_ctxt_put(ctxt);
>> > -	kfree(cs->cs_buf);
>> > -	kfree(cs);
>> > -	return rc;
>> > -}
>> > -
>> > -static int mdc_ioc_changelog_send(struct obd_device *obd,
>> > -				  struct ioc_changelog *icc)
>> > -{
>> > -	struct changelog_show *cs;
>> > -	struct task_struct *task;
>> > -	int rc;
>> > -
>> > -	/* Freed in mdc_changelog_send_thread */
>> > -	cs = kzalloc(sizeof(*cs), GFP_NOFS);
>> > -	if (!cs)
>> > -		return -ENOMEM;
>> > -
>> > -	cs->cs_obd = obd;
>> > -	cs->cs_startrec = icc->icc_recno;
>> > -	/* matching fput in mdc_changelog_send_thread */
>> > -	cs->cs_fp = fget(icc->icc_id);
>> > -	cs->cs_flags = icc->icc_flags;
>> > -
>> > -	/*
>> > -	 * New thread because we should return to user app before
>> > -	 * writing into our pipe
>> > -	 */
>> > -	task = kthread_run(mdc_changelog_send_thread, cs,
>> > -			   "mdc_clg_send_thread");
>> > -	if (IS_ERR(task)) {
>> > -		rc = PTR_ERR(task);
>> > -		CERROR("%s: can't start changelog thread: rc = %d\n",
>> > -		       cs_obd_name(cs), rc);
>> > -		kfree(cs);
>> > -	} else {
>> > -		rc = 0;
>> > -		CDEBUG(D_HSM, "%s: started changelog thread\n",
>> > -		       cs_obd_name(cs));
>> > -	}
>> > -
>> > -	CERROR("Failed to start changelog thread: %d\n", rc);
>> > -	return rc;
>> > -}
>> > -
>> >  static int mdc_ioc_hsm_ct_start(struct obd_export *exp,
>> >  				struct lustre_kernelcomm *lk);
>> >  
>> > @@ -2087,21 +1918,6 @@ static int mdc_iocontrol(unsigned int cmd, struct obd_export *exp, int len,
>> >  		return -EINVAL;
>> >  	}
>> >  	switch (cmd) {
>> > -	case OBD_IOC_CHANGELOG_SEND:
>> > -		rc = mdc_ioc_changelog_send(obd, karg);
>> > -		goto out;
>> > -	case OBD_IOC_CHANGELOG_CLEAR: {
>> > -		struct ioc_changelog *icc = karg;
>> > -		struct changelog_setinfo cs = {
>> > -			.cs_recno = icc->icc_recno,
>> > -			.cs_id = icc->icc_id
>> > -		};
>> > -
>> > -		rc = obd_set_info_async(NULL, exp, strlen(KEY_CHANGELOG_CLEAR),
>> > -					KEY_CHANGELOG_CLEAR, sizeof(cs), &cs,
>> > -					NULL);
>> > -		goto out;
>> > -	}
>> >  	case OBD_IOC_FID2PATH:
>> >  		rc = mdc_ioc_fid2path(exp, karg);
>> >  		goto out;
>> > @@ -2670,12 +2486,22 @@ static int mdc_setup(struct obd_device *obd, struct lustre_cfg *cfg)
>> >  
>> >  	rc = mdc_llog_init(obd);
>> >  	if (rc) {
>> > -		CERROR("failed to setup llogging subsystems\n");
>> > +		CERROR("%s: failed to setup llogging subsystems: rc = %d\n",
>> > +		       obd->obd_name, rc);
>> >  		goto err_llog_cleanup;
>> >  	}
>> >  
>> > +	rc = mdc_changelog_cdev_init(obd);
>> > +	if (rc) {
>> > +		CERROR("%s: failed to setup changelog char device: rc = %d\n",
>> > +		       obd->obd_name, rc);
>> > +		goto err_changelog_cleanup;
>> > +	}
>> > +
>> >  	return 0;
>> >  
>> > +err_changelog_cleanup:
>> > +	mdc_llog_finish(obd);
>> >  err_llog_cleanup:
>> >  	ldebugfs_free_md_stats(obd);
>> >  	ptlrpc_lprocfs_unregister_obd(obd);
>> > @@ -2714,6 +2540,8 @@ static int mdc_precleanup(struct obd_device *obd)
>> >  	if (obd->obd_type->typ_refcnt <= 1)
>> >  		libcfs_kkuc_group_rem(0, KUC_GRP_HSM);
>> >  
>> > +	mdc_changelog_cdev_finish(obd);
>> > +
>> >  	obd_cleanup_client_import(obd);
>> >  	ptlrpc_lprocfs_unregister_obd(obd);
>> >  	lprocfs_obd_cleanup(obd);
>> > -- 
>> > 1.8.3.1
>> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181105/8e402c5f/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2018-11-05  0:13 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-14 18:57 [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 James Simmons
2018-10-14 18:57 ` [lustre-devel] [PATCH 01/28] lustre: osc: osc_extent_tree_dump0() implementation is suboptimal James Simmons
2018-10-14 18:57 ` [lustre-devel] [PATCH 02/28] lustre: llite: Read ahead should return pages read James Simmons
2018-10-14 18:57 ` [lustre-devel] [PATCH 03/28] lustre: ptlrpc: missing barrier before wake_up James Simmons
2018-10-17 22:43   ` NeilBrown
2018-10-21 22:48     ` James Simmons
2018-10-14 18:57 ` [lustre-devel] [PATCH 04/28] lustre: ptlrpc: Do not assert when bd_nob_transferred != 0 James Simmons
2018-10-17 23:13   ` NeilBrown
2018-10-21 22:44     ` James Simmons
2018-10-22  3:26       ` NeilBrown
2018-11-04 21:29         ` James Simmons
2018-11-04 23:59           ` NeilBrown
2018-10-14 18:57 ` [lustre-devel] [PATCH 05/28] lustre: uapi: add back LUSTRE_MAXFSNAME to lustre_user.h James Simmons
2018-10-14 18:57 ` [lustre-devel] [PATCH 06/28] lustre: ldlm: ELC shouldn't wait on lock flush James Simmons
2018-10-17 23:20   ` NeilBrown
2018-10-20 17:09     ` James Simmons
2018-10-22  3:44       ` NeilBrown
2018-10-14 18:57 ` [lustre-devel] [PATCH 07/28] lustre: llite: pipeline readahead better with large I/O James Simmons
2018-10-14 18:57 ` [lustre-devel] [PATCH 08/28] lustre: hsm: add kkuc before sending registration RPCs James Simmons
2018-10-14 18:57 ` [lustre-devel] [PATCH 09/28] lustre: mdc: improve mdc_enqueue() error message James Simmons
2018-10-14 18:58 ` [lustre-devel] [PATCH 10/28] lustre: llite: Update i_nlink on unlink James Simmons
2018-10-14 18:58 ` [lustre-devel] [PATCH 11/28] lustre: llite: use security context if it's enabled in the kernel James Simmons
2018-10-17 23:34   ` NeilBrown
2018-10-20 17:49     ` James Simmons
2018-10-22  3:47       ` NeilBrown
2018-10-23 23:07         ` James Simmons
2018-10-14 18:58 ` [lustre-devel] [PATCH 12/28] lustre: ptlrpc: do not wakeup every second James Simmons
2018-10-29  0:03   ` NeilBrown
2018-10-29  1:35     ` Patrick Farrell
2018-10-29  2:41       ` NeilBrown
2018-10-29  3:42         ` James Simmons
2018-10-29 14:17           ` Patrick Farrell
2018-11-04 20:53         ` James Simmons
2018-10-14 18:58 ` [lustre-devel] [PATCH 13/28] lustre: ldlm: check lock cancellation in ldlm_cli_cancel() James Simmons
2018-10-14 18:58 ` [lustre-devel] [PATCH 14/28] lustre: ptlrpc: handle case of epp_free_pages <= PTLRPC_MAX_BRW_PAGES James Simmons
2018-10-14 18:58 ` [lustre-devel] [PATCH 15/28] lustre: llite: fix for stat under kthread and X86_X32 James Simmons
2018-10-18  1:48   ` NeilBrown
2018-10-22  3:58     ` NeilBrown
2018-11-04 21:35       ` James Simmons
2018-11-05  0:03         ` NeilBrown
2018-10-14 18:58 ` [lustre-devel] [PATCH 16/28] lustre: statahead: missing barrier before wake_up James Simmons
2018-10-18  2:00   ` NeilBrown
2018-10-21 22:52     ` James Simmons
2018-10-22  4:04       ` NeilBrown
2018-11-04 20:52         ` James Simmons
2018-10-14 18:58 ` [lustre-devel] [PATCH 17/28] lustre: ldlm: Make lru clear always discard read lock pages James Simmons
2018-10-14 18:58 ` [lustre-devel] [PATCH 18/28] lustre: mdc: expose changelog through char devices James Simmons
2018-10-30  6:41   ` NeilBrown
2018-11-04 21:31     ` James Simmons
2018-11-05  0:13       ` NeilBrown
2018-10-14 18:58 ` [lustre-devel] [PATCH 19/28] lustre: uapi: add missing headers in lustre UAPI headers James Simmons
2018-10-14 18:58 ` [lustre-devel] [PATCH 20/28] lustre: obdclass: deprecate OBD_GET_VERSION ioctl James Simmons
2018-10-18  2:12   ` NeilBrown
2018-10-20 18:52     ` James Simmons
2018-10-22  4:08       ` NeilBrown
2018-10-14 18:58 ` [lustre-devel] [PATCH 21/28] lustre: llite: enhance vvp_dev data structure naming James Simmons
2018-10-18  2:15   ` NeilBrown
2018-10-20 18:55     ` James Simmons
2018-10-14 18:58 ` [lustre-devel] [PATCH 22/28] lustre: clio: update spare bit handling James Simmons
2018-10-14 18:58 ` [lustre-devel] [PATCH 23/28] lustre: llog: fix EOF handling in llog_client_next_block() James Simmons
2018-10-14 18:58 ` [lustre-devel] [PATCH 24/28] lustre: llite: IO accounting of page read James Simmons
2018-10-14 18:58 ` [lustre-devel] [PATCH 25/28] lustre: llite: disable statahead if starting statahead fail James Simmons
2018-10-14 18:58 ` [lustre-devel] [PATCH 26/28] lustre: mdc: set correct body eadatasize for getxattr() James Simmons
2018-10-14 18:58 ` [lustre-devel] [PATCH 27/28] lustre: llite: control concurrent statahead instances James Simmons
2018-10-14 18:58 ` [lustre-devel] [PATCH 28/28] lustre: llite: restore lld_nfs_dentry handling James Simmons
2018-10-22  4:36 ` [lustre-devel] [PATCH 00/28] lustre: more assorted fixes for lustre 2.10 NeilBrown
2018-10-23 22:34   ` [lustre-devel] [PATCH] lustre: lu_object: fix possible hang waiting for LCS_LEAVING NeilBrown
2018-10-29  3:31     ` James Simmons
2018-10-29  4:31       ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.