All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/9] ceph: size handling for the fscrypt
@ 2021-11-05 14:22 xiubli
  2021-11-05 14:22 ` [PATCH v7 1/9] libceph: add CEPH_OSD_OP_ASSERT_VER support xiubli
                   ` (9 more replies)
  0 siblings, 10 replies; 25+ messages in thread
From: xiubli @ 2021-11-05 14:22 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel, Xiubo Li

From: Xiubo Li <xiubli@redhat.com>

This patch series is based on the "wip-fscrypt-fnames" branch in
repo https://github.com/ceph/ceph-client.git.

And I have picked up 5 patches from the "ceph-fscrypt-size-experimental"
branch in repo
https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git.

====

This approach is based on the discussion from V1 and V2, which will
pass the encrypted last block contents to MDS along with the truncate
request.

This will send the encrypted last block contents to MDS along with
the truncate request when truncating to a smaller size and at the
same time new size does not align to BLOCK SIZE.

The MDS side patch is raised in PR
https://github.com/ceph/ceph/pull/43588, which is also based Jeff's
previous great work in PR https://github.com/ceph/ceph/pull/41284.

The MDS will use the filer.write_trunc(), which could update and
truncate the file in one shot, instead of filer.truncate().

This just assume kclient won't support the inline data feature, which
will be remove soon, more detail please see:
https://tracker.ceph.com/issues/52916

Changed in V7:
- Fixed the sparse check warnings.
- Removed the include/linux/ceph/crypto.h header file.

Changed in V6:
- Fixed the file hole bug, also have updated the MDS side PR.
- Add add object version support for sync read in #8.


Changed in V5:
- Rebase to "wip-fscrypt-fnames" branch in ceph-client.git repo.
- Pick up 5 patches from Jeff's "ceph-fscrypt-size-experimental" branch
  in linux.git repo.
- Add "i_truncate_pagecache_size" member support in ceph_inode_info
  struct, this will be used to truncate the pagecache only in kclient
  side, because the "i_truncate_size" will always be aligned to BLOCK
  SIZE. In fscrypt case we need to use the real size to truncate the
  pagecache.


Changed in V4:
- Retry the truncate request by 20 times before fail it with -EAGAIN.
- Remove the "fill_last_block" label and move the code to else branch.
- Remove the #3 patch, which has already been sent out separately, in
  V3 series.
- Improve some comments in the code.


Changed in V3:
- Fix possibly corrupting the file just before the MDS acquires the
  xlock for FILE lock, another client has updated it.
- Flush the pagecache buffer before reading the last block for the
  when filling the truncate request.
- Some other minore fixes.



Jeff Layton (5):
  libceph: add CEPH_OSD_OP_ASSERT_VER support
  ceph: size handling for encrypted inodes in cap updates
  ceph: fscrypt_file field handling in MClientRequest messages
  ceph: get file size from fscrypt_file when present in inode traces
  ceph: handle fscrypt fields in cap messages from MDS

Xiubo Li (4):
  ceph: add __ceph_get_caps helper support
  ceph: add __ceph_sync_read helper support
  ceph: add object version support for sync read
  ceph: add truncate size handling support for fscrypt

 fs/ceph/caps.c                  | 136 ++++++++++++++----
 fs/ceph/crypto.h                |  25 ++++
 fs/ceph/dir.c                   |   3 +
 fs/ceph/file.c                  |  76 ++++++++--
 fs/ceph/inode.c                 | 244 +++++++++++++++++++++++++++++---
 fs/ceph/mds_client.c            |   9 +-
 fs/ceph/mds_client.h            |   2 +
 fs/ceph/super.h                 |  25 ++++
 include/linux/ceph/osd_client.h |   6 +-
 include/linux/ceph/rados.h      |   4 +
 net/ceph/osd_client.c           |   5 +
 11 files changed, 475 insertions(+), 60 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v7 1/9] libceph: add CEPH_OSD_OP_ASSERT_VER support
  2021-11-05 14:22 [PATCH v7 0/9] ceph: size handling for the fscrypt xiubli
@ 2021-11-05 14:22 ` xiubli
  2021-11-05 14:22 ` [PATCH v7 2/9] ceph: size handling for encrypted inodes in cap updates xiubli
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: xiubli @ 2021-11-05 14:22 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel

From: Jeff Layton <jlayton@kernel.org>

...and record the user_version in the reply in a new field in
ceph_osd_request, so we can populate the assert_ver appropriately.
Shuffle the fields a bit too so that the new field fits in an
existing hole on x86_64.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 include/linux/ceph/osd_client.h | 6 +++++-
 include/linux/ceph/rados.h      | 4 ++++
 net/ceph/osd_client.c           | 5 +++++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 83fa08a06507..7ee1684d3edc 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -145,6 +145,9 @@ struct ceph_osd_req_op {
 			u32 src_fadvise_flags;
 			struct ceph_osd_data osd_data;
 		} copy_from;
+		struct {
+			u64 ver;
+		} assert_ver;
 	};
 };
 
@@ -199,6 +202,7 @@ struct ceph_osd_request {
 	struct ceph_osd_client *r_osdc;
 	struct kref       r_kref;
 	bool              r_mempool;
+	bool		  r_linger;           /* don't resend on failure */
 	struct completion r_completion;       /* private to osd_client.c */
 	ceph_osdc_callback_t r_callback;
 
@@ -211,9 +215,9 @@ struct ceph_osd_request {
 	struct ceph_snap_context *r_snapc;    /* for writes */
 	struct timespec64 r_mtime;            /* ditto */
 	u64 r_data_offset;                    /* ditto */
-	bool r_linger;                        /* don't resend on failure */
 
 	/* internal */
+	u64 r_version;			      /* data version sent in reply */
 	unsigned long r_stamp;                /* jiffies, send or check time */
 	unsigned long r_start_stamp;          /* jiffies */
 	ktime_t r_start_latency;              /* ktime_t */
diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
index 43a7a1573b51..73c3efbec36c 100644
--- a/include/linux/ceph/rados.h
+++ b/include/linux/ceph/rados.h
@@ -523,6 +523,10 @@ struct ceph_osd_op {
 		struct {
 			__le64 cookie;
 		} __attribute__ ((packed)) notify;
+		struct {
+			__le64 unused;
+			__le64 ver;
+		} __attribute__ ((packed)) assert_ver;
 		struct {
 			__le64 offset, length;
 			__le64 src_offset;
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index ff8624a7c964..f3a9af012123 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1038,6 +1038,10 @@ static u32 osd_req_encode_op(struct ceph_osd_op *dst,
 		dst->copy_from.src_fadvise_flags =
 			cpu_to_le32(src->copy_from.src_fadvise_flags);
 		break;
+	case CEPH_OSD_OP_ASSERT_VER:
+		dst->assert_ver.unused = cpu_to_le64(0);
+		dst->assert_ver.ver = cpu_to_le64(src->assert_ver.ver);
+		break;
 	default:
 		pr_err("unsupported osd opcode %s\n",
 			ceph_osd_op_name(src->op));
@@ -3763,6 +3767,7 @@ static void handle_reply(struct ceph_osd *osd, struct ceph_msg *msg)
 	 * one (type of) reply back.
 	 */
 	WARN_ON(!(m.flags & CEPH_OSD_FLAG_ONDISK));
+	req->r_version = m.user_version;
 	req->r_result = m.result ?: data_len;
 	finish_request(req);
 	mutex_unlock(&osd->lock);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 2/9] ceph: size handling for encrypted inodes in cap updates
  2021-11-05 14:22 [PATCH v7 0/9] ceph: size handling for the fscrypt xiubli
  2021-11-05 14:22 ` [PATCH v7 1/9] libceph: add CEPH_OSD_OP_ASSERT_VER support xiubli
@ 2021-11-05 14:22 ` xiubli
  2021-11-05 14:22 ` [PATCH v7 3/9] ceph: fscrypt_file field handling in MClientRequest messages xiubli
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: xiubli @ 2021-11-05 14:22 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel

From: Jeff Layton <jlayton@kernel.org>

Transmit the rounded-up size as the normal size, and fill out the
fscrypt_file field with the real file size.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c   | 43 +++++++++++++++++++++++++------------------
 fs/ceph/crypto.h |  4 ++++
 2 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 80f521dd7254..fc367f42536a 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1215,10 +1215,9 @@ struct cap_msg_args {
 	umode_t			mode;
 	bool			inline_data;
 	bool			wake;
+	bool			encrypted;
 	u32			fscrypt_auth_len;
-	u32			fscrypt_file_len;
 	u8			fscrypt_auth[sizeof(struct ceph_fscrypt_auth)]; // for context
-	u8			fscrypt_file[sizeof(u64)]; // for size
 };
 
 /* Marshal up the cap msg to the MDS */
@@ -1253,7 +1252,12 @@ static void encode_cap_msg(struct ceph_msg *msg, struct cap_msg_args *arg)
 	fc->ino = cpu_to_le64(arg->ino);
 	fc->snap_follows = cpu_to_le64(arg->follows);
 
-	fc->size = cpu_to_le64(arg->size);
+#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
+	if (arg->encrypted)
+		fc->size = cpu_to_le64(round_up(arg->size, CEPH_FSCRYPT_BLOCK_SIZE));
+	else
+#endif
+		fc->size = cpu_to_le64(arg->size);
 	fc->max_size = cpu_to_le64(arg->max_size);
 	ceph_encode_timespec64(&fc->mtime, &arg->mtime);
 	ceph_encode_timespec64(&fc->atime, &arg->atime);
@@ -1313,11 +1317,17 @@ static void encode_cap_msg(struct ceph_msg *msg, struct cap_msg_args *arg)
 	ceph_encode_64(&p, 0);
 
 #if IS_ENABLED(CONFIG_FS_ENCRYPTION)
-	/* fscrypt_auth and fscrypt_file (version 12) */
+	/*
+	 * fscrypt_auth and fscrypt_file (version 12)
+	 *
+	 * fscrypt_auth holds the crypto context (if any). fscrypt_file
+	 * tracks the real i_size as an __le64 field (and we use a rounded-up
+	 * i_size in * the traditional size field).
+	 */
 	ceph_encode_32(&p, arg->fscrypt_auth_len);
 	ceph_encode_copy(&p, arg->fscrypt_auth, arg->fscrypt_auth_len);
-	ceph_encode_32(&p, arg->fscrypt_file_len);
-	ceph_encode_copy(&p, arg->fscrypt_file, arg->fscrypt_file_len);
+	ceph_encode_32(&p, sizeof(__le64));
+	ceph_encode_64(&p, arg->size);
 #else /* CONFIG_FS_ENCRYPTION */
 	ceph_encode_32(&p, 0);
 	ceph_encode_32(&p, 0);
@@ -1389,7 +1399,6 @@ static void __prep_cap(struct cap_msg_args *arg, struct ceph_cap *cap,
 	arg->follows = flushing ? ci->i_head_snapc->seq : 0;
 	arg->flush_tid = flush_tid;
 	arg->oldest_flush_tid = oldest_flush_tid;
-
 	arg->size = i_size_read(inode);
 	ci->i_reported_size = arg->size;
 	arg->max_size = ci->i_wanted_max_size;
@@ -1443,6 +1452,7 @@ static void __prep_cap(struct cap_msg_args *arg, struct ceph_cap *cap,
 		}
 	}
 	arg->flags = flags;
+	arg->encrypted = IS_ENCRYPTED(inode);
 #if IS_ENABLED(CONFIG_FS_ENCRYPTION)
 	if (ci->fscrypt_auth_len &&
 	    WARN_ON_ONCE(ci->fscrypt_auth_len != sizeof(struct ceph_fscrypt_auth))) {
@@ -1453,21 +1463,21 @@ static void __prep_cap(struct cap_msg_args *arg, struct ceph_cap *cap,
 		memcpy(arg->fscrypt_auth, ci->fscrypt_auth,
 			min_t(size_t, ci->fscrypt_auth_len, sizeof(arg->fscrypt_auth)));
 	}
-	/* FIXME: use this to track "real" size */
-	arg->fscrypt_file_len = 0;
 #endif /* CONFIG_FS_ENCRYPTION */
 }
 
+#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
 #define CAP_MSG_FIXED_FIELDS (sizeof(struct ceph_mds_caps) + \
-		      4 + 8 + 4 + 4 + 8 + 4 + 4 + 4 + 8 + 8 + 4 + 8 + 8 + 4 + 4)
+		      4 + 8 + 4 + 4 + 8 + 4 + 4 + 4 + 8 + 8 + 4 + 8 + 8 + 4 + 4 + 8)
 
-#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
 static inline int cap_msg_size(struct cap_msg_args *arg)
 {
-	return CAP_MSG_FIXED_FIELDS + arg->fscrypt_auth_len +
-			arg->fscrypt_file_len;
+	return CAP_MSG_FIXED_FIELDS + arg->fscrypt_auth_len;
 }
 #else
+#define CAP_MSG_FIXED_FIELDS (sizeof(struct ceph_mds_caps) + \
+		      4 + 8 + 4 + 4 + 8 + 4 + 4 + 4 + 8 + 8 + 4 + 8 + 8 + 4 + 4)
+
 static inline int cap_msg_size(struct cap_msg_args *arg)
 {
 	return CAP_MSG_FIXED_FIELDS;
@@ -1546,13 +1556,10 @@ static inline int __send_flush_snap(struct inode *inode,
 	arg.inline_data = capsnap->inline_data;
 	arg.flags = 0;
 	arg.wake = false;
+	arg.encrypted = IS_ENCRYPTED(inode);
 
-	/*
-	 * No fscrypt_auth changes from a capsnap. It will need
-	 * to update fscrypt_file on size changes (TODO).
-	 */
+	/* No fscrypt_auth changes from a capsnap.*/
 	arg.fscrypt_auth_len = 0;
-	arg.fscrypt_file_len = 0;
 
 	msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, cap_msg_size(&arg),
 			   GFP_NOFS, false);
diff --git a/fs/ceph/crypto.h b/fs/ceph/crypto.h
index c2e0cbb5667b..ab27a7ed62c3 100644
--- a/fs/ceph/crypto.h
+++ b/fs/ceph/crypto.h
@@ -9,6 +9,10 @@
 #include <crypto/sha2.h>
 #include <linux/fscrypt.h>
 
+#define CEPH_FSCRYPT_BLOCK_SHIFT   12
+#define CEPH_FSCRYPT_BLOCK_SIZE    (_AC(1,UL) << CEPH_FSCRYPT_BLOCK_SHIFT)
+#define CEPH_FSCRYPT_BLOCK_MASK	   (~(CEPH_FSCRYPT_BLOCK_SIZE-1))
+
 struct ceph_fs_client;
 struct ceph_acl_sec_ctx;
 struct ceph_mds_request;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 3/9] ceph: fscrypt_file field handling in MClientRequest messages
  2021-11-05 14:22 [PATCH v7 0/9] ceph: size handling for the fscrypt xiubli
  2021-11-05 14:22 ` [PATCH v7 1/9] libceph: add CEPH_OSD_OP_ASSERT_VER support xiubli
  2021-11-05 14:22 ` [PATCH v7 2/9] ceph: size handling for encrypted inodes in cap updates xiubli
@ 2021-11-05 14:22 ` xiubli
  2021-11-08  5:09   ` Xiubo Li
  2021-11-05 14:22 ` [PATCH v7 4/9] ceph: get file size from fscrypt_file when present in inode traces xiubli
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: xiubli @ 2021-11-05 14:22 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel

From: Jeff Layton <jlayton@kernel.org>

For encrypted inodes, transmit a rounded-up size to the MDS as the
normal file size and send the real inode size in fscrypt_file field.

Also, fix up creates and truncates to also transmit fscrypt_file.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/dir.c        |  3 +++
 fs/ceph/file.c       |  2 ++
 fs/ceph/inode.c      | 18 ++++++++++++++++--
 fs/ceph/mds_client.c |  9 ++++++++-
 fs/ceph/mds_client.h |  2 ++
 5 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 37c9c589ee27..987c1579614c 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -916,6 +916,9 @@ static int ceph_mknod(struct user_namespace *mnt_userns, struct inode *dir,
 		goto out_req;
 	}
 
+	if (S_ISREG(mode) && IS_ENCRYPTED(dir))
+		set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
+
 	req->r_dentry = dget(dentry);
 	req->r_num_caps = 2;
 	req->r_parent = dir;
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 126d2d80686c..8c0b9ed7f48b 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -715,6 +715,8 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 	req->r_args.open.mask = cpu_to_le32(mask);
 	req->r_parent = dir;
 	ihold(dir);
+	if (IS_ENCRYPTED(dir))
+		set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
 
 	if (flags & O_CREAT) {
 		struct ceph_file_layout lo;
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index d24d42c94d43..4a7b2b0d88f7 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -2383,11 +2383,25 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 			}
 		} else if ((issued & CEPH_CAP_FILE_SHARED) == 0 ||
 			   attr->ia_size != isize) {
-			req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
-			req->r_args.setattr.old_size = cpu_to_le64(isize);
 			mask |= CEPH_SETATTR_SIZE;
 			release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
 				   CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
+			if (IS_ENCRYPTED(inode)) {
+				set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
+				mask |= CEPH_SETATTR_FSCRYPT_FILE;
+				req->r_args.setattr.size =
+					cpu_to_le64(round_up(attr->ia_size,
+							     CEPH_FSCRYPT_BLOCK_SIZE));
+				req->r_args.setattr.old_size =
+					cpu_to_le64(round_up(isize,
+							     CEPH_FSCRYPT_BLOCK_SIZE));
+				req->r_fscrypt_file = attr->ia_size;
+				/* FIXME: client must zero out any partial blocks! */
+			} else {
+				req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
+				req->r_args.setattr.old_size = cpu_to_le64(isize);
+				req->r_fscrypt_file = 0;
+			}
 		}
 	}
 	if (ia_valid & ATTR_MTIME) {
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 69caea1d2444..e2d1b98c61fc 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2653,7 +2653,12 @@ static void encode_mclientrequest_tail(void **p, const struct ceph_mds_request *
 	} else {
 		ceph_encode_32(p, 0);
 	}
-	ceph_encode_32(p, 0); // fscrypt_file for now
+	if (test_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags)) {
+		ceph_encode_32(p, sizeof(__le64));
+		ceph_encode_64(p, req->r_fscrypt_file);
+	} else {
+		ceph_encode_32(p, 0);
+	}
 }
 
 /*
@@ -2739,6 +2744,8 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
 
 	/* fscrypt_file */
 	len += sizeof(u32);
+	if (test_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags))
+		len += sizeof(__le64);
 
 	msg = ceph_msg_new2(CEPH_MSG_CLIENT_REQUEST, len, 1, GFP_NOFS, false);
 	if (!msg) {
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 6a2ac489e06e..149a3a828472 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -276,6 +276,7 @@ struct ceph_mds_request {
 #define CEPH_MDS_R_DID_PREPOPULATE	(6) /* prepopulated readdir */
 #define CEPH_MDS_R_PARENT_LOCKED	(7) /* is r_parent->i_rwsem wlocked? */
 #define CEPH_MDS_R_ASYNC		(8) /* async request */
+#define CEPH_MDS_R_FSCRYPT_FILE		(9) /* must marshal fscrypt_file field */
 	unsigned long	r_req_flags;
 
 	struct mutex r_fill_mutex;
@@ -283,6 +284,7 @@ struct ceph_mds_request {
 	union ceph_mds_request_args r_args;
 
 	struct ceph_fscrypt_auth *r_fscrypt_auth;
+	u64	r_fscrypt_file;
 
 	u8 *r_altname;		    /* fscrypt binary crypttext for long filenames */
 	u32 r_altname_len;	    /* length of r_altname */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 4/9] ceph: get file size from fscrypt_file when present in inode traces
  2021-11-05 14:22 [PATCH v7 0/9] ceph: size handling for the fscrypt xiubli
                   ` (2 preceding siblings ...)
  2021-11-05 14:22 ` [PATCH v7 3/9] ceph: fscrypt_file field handling in MClientRequest messages xiubli
@ 2021-11-05 14:22 ` xiubli
  2021-11-05 14:22 ` [PATCH v7 5/9] ceph: handle fscrypt fields in cap messages from MDS xiubli
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: xiubli @ 2021-11-05 14:22 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel

From: Jeff Layton <jlayton@kernel.org>

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/inode.c | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 4a7b2b0d88f7..15c2fb1e2c8a 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -978,6 +978,16 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
 		     from_kgid(&init_user_ns, inode->i_gid));
 		ceph_decode_timespec64(&ci->i_btime, &iinfo->btime);
 		ceph_decode_timespec64(&ci->i_snap_btime, &iinfo->snap_btime);
+
+#ifdef CONFIG_FS_ENCRYPTION
+		if (iinfo->fscrypt_auth_len && !ci->fscrypt_auth) {
+			ci->fscrypt_auth_len = iinfo->fscrypt_auth_len;
+			ci->fscrypt_auth = iinfo->fscrypt_auth;
+			iinfo->fscrypt_auth = NULL;
+			iinfo->fscrypt_auth_len = 0;
+			inode_set_flags(inode, S_ENCRYPTED, S_ENCRYPTED);
+		}
+#endif
 	}
 
 	if ((new_version || (new_issued & CEPH_CAP_LINK_SHARED)) &&
@@ -1001,6 +1011,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
 
 	if (new_version ||
 	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
+		u64 size = info->size;
 		s64 old_pool = ci->i_layout.pool_id;
 		struct ceph_string *old_ns;
 
@@ -1014,10 +1025,17 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
 
 		pool_ns = old_ns;
 
+		if (IS_ENCRYPTED(inode) && size &&
+		    (iinfo->fscrypt_file_len == sizeof(__le64))) {
+			size = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
+			if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
+				pr_warn("size=%llu fscrypt_file=%llu\n", info->size, size);
+		}
+
 		queue_trunc = ceph_fill_file_size(inode, issued,
 					le32_to_cpu(info->truncate_seq),
 					le64_to_cpu(info->truncate_size),
-					le64_to_cpu(info->size));
+					le64_to_cpu(size));
 		/* only update max_size on auth cap */
 		if ((info->cap.flags & CEPH_CAP_FLAG_AUTH) &&
 		    ci->i_max_size != le64_to_cpu(info->max_size)) {
@@ -1057,16 +1075,6 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
 		xattr_blob = NULL;
 	}
 
-#ifdef CONFIG_FS_ENCRYPTION
-	if (iinfo->fscrypt_auth_len && !ci->fscrypt_auth) {
-		ci->fscrypt_auth_len = iinfo->fscrypt_auth_len;
-		ci->fscrypt_auth = iinfo->fscrypt_auth;
-		iinfo->fscrypt_auth = NULL;
-		iinfo->fscrypt_auth_len = 0;
-		inode_set_flags(inode, S_ENCRYPTED, S_ENCRYPTED);
-	}
-#endif
-
 	/* finally update i_version */
 	if (le64_to_cpu(info->version) > ci->i_version)
 		ci->i_version = le64_to_cpu(info->version);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 5/9] ceph: handle fscrypt fields in cap messages from MDS
  2021-11-05 14:22 [PATCH v7 0/9] ceph: size handling for the fscrypt xiubli
                   ` (3 preceding siblings ...)
  2021-11-05 14:22 ` [PATCH v7 4/9] ceph: get file size from fscrypt_file when present in inode traces xiubli
@ 2021-11-05 14:22 ` xiubli
  2021-11-05 14:22 ` [PATCH v7 6/9] ceph: add __ceph_get_caps helper support xiubli
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: xiubli @ 2021-11-05 14:22 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel

From: Jeff Layton <jlayton@kernel.org>

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 72 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index fc367f42536a..c9f1ac3ad2f3 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -3329,6 +3329,9 @@ struct cap_extra_info {
 	/* currently issued */
 	int issued;
 	struct timespec64 btime;
+	u8 *fscrypt_auth;
+	u32 fscrypt_auth_len;
+	u64 fscrypt_file_size;
 };
 
 /*
@@ -3361,6 +3364,14 @@ static void handle_cap_grant(struct inode *inode,
 	bool deleted_inode = false;
 	bool fill_inline = false;
 
+	/*
+	 * If there is at least one crypto block then we'll trust fscrypt_file_size.
+	 * If the real length of the file is 0, then ignore it (it has probably been
+	 * truncated down to 0 by the MDS).
+	 */
+	if (IS_ENCRYPTED(inode) && size)
+		size = extra_info->fscrypt_file_size;
+
 	dout("handle_cap_grant inode %p cap %p mds%d seq %d %s\n",
 	     inode, cap, session->s_mds, seq, ceph_cap_string(newcaps));
 	dout(" size %llu max_size %llu, i_size %llu\n", size, max_size,
@@ -3839,7 +3850,8 @@ static void handle_cap_flushsnap_ack(struct inode *inode, u64 flush_tid,
  */
 static bool handle_cap_trunc(struct inode *inode,
 			     struct ceph_mds_caps *trunc,
-			     struct ceph_mds_session *session)
+			     struct ceph_mds_session *session,
+			     struct cap_extra_info *extra_info)
 {
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	int mds = session->s_mds;
@@ -3856,6 +3868,14 @@ static bool handle_cap_trunc(struct inode *inode,
 
 	issued |= implemented | dirty;
 
+	/*
+	 * If there is at least one crypto block then we'll trust fscrypt_file_size.
+	 * If the real length of the file is 0, then ignore it (it has probably been
+	 * truncated down to 0 by the MDS).
+	 */
+	if (IS_ENCRYPTED(inode) && size)
+		size = extra_info->fscrypt_file_size;
+
 	dout("handle_cap_trunc inode %p mds%d seq %d to %lld seq %d\n",
 	     inode, mds, seq, truncate_size, truncate_seq);
 	queue_trunc = ceph_fill_file_size(inode, issued,
@@ -4074,6 +4094,48 @@ static void handle_cap_import(struct ceph_mds_client *mdsc,
 	*target_cap = cap;
 }
 
+#ifdef CONFIG_FS_ENCRYPTION
+static int parse_fscrypt_fields(void **p, void *end, struct cap_extra_info *extra)
+{
+	u32 len;
+
+	ceph_decode_32_safe(p, end, extra->fscrypt_auth_len, bad);
+	if (extra->fscrypt_auth_len) {
+		ceph_decode_need(p, end, extra->fscrypt_auth_len, bad);
+		extra->fscrypt_auth = kmalloc(extra->fscrypt_auth_len, GFP_KERNEL);
+		if (!extra->fscrypt_auth)
+			return -ENOMEM;
+		ceph_decode_copy_safe(p, end, extra->fscrypt_auth,
+					extra->fscrypt_auth_len, bad);
+	}
+
+	ceph_decode_32_safe(p, end, len, bad);
+	if (len == sizeof(u64))
+		ceph_decode_64_safe(p, end, extra->fscrypt_file_size, bad);
+	else
+		ceph_decode_skip_n(p, end, len, bad);
+	return 0;
+bad:
+	return -EIO;
+}
+#else
+static int parse_fscrypt_fields(void **p, void *end, struct cap_extra_info *extra)
+{
+	u32 len;
+
+	/* Don't care about these fields unless we're encryption-capable */
+	ceph_decode_32_safe(p, end, len, bad);
+	if (len)
+		ceph_decode_skip_n(p, end, len, bad);
+	ceph_decode_32_safe(p, end, len, bad);
+	if (len)
+		ceph_decode_skip_n(p, end, len, bad);
+	return 0;
+bad:
+	return -EIO;
+}
+#endif
+
 /*
  * Handle a caps message from the MDS.
  *
@@ -4192,6 +4254,12 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 		ceph_decode_64_safe(&p, end, extra_info.nsubdirs, bad);
 	}
 
+	if (msg_version >= 12) {
+		int ret = parse_fscrypt_fields(&p, end, &extra_info);
+		if (ret)
+			goto bad;
+	}
+
 	/* lookup ino */
 	inode = ceph_find_inode(mdsc->fsc->sb, vino);
 	ci = ceph_inode(inode);
@@ -4288,7 +4356,8 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 		break;
 
 	case CEPH_CAP_OP_TRUNC:
-		queue_trunc = handle_cap_trunc(inode, h, session);
+		queue_trunc = handle_cap_trunc(inode, h, session,
+						&extra_info);
 		spin_unlock(&ci->i_ceph_lock);
 		if (queue_trunc)
 			ceph_queue_vmtruncate(inode);
@@ -4306,6 +4375,7 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 	iput(inode);
 out:
 	ceph_put_string(extra_info.pool_ns);
+	kfree(extra_info.fscrypt_auth);
 	return;
 
 flush_cap_releases:
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 6/9] ceph: add __ceph_get_caps helper support
  2021-11-05 14:22 [PATCH v7 0/9] ceph: size handling for the fscrypt xiubli
                   ` (4 preceding siblings ...)
  2021-11-05 14:22 ` [PATCH v7 5/9] ceph: handle fscrypt fields in cap messages from MDS xiubli
@ 2021-11-05 14:22 ` xiubli
  2021-11-05 14:22 ` [PATCH v7 7/9] ceph: add __ceph_sync_read " xiubli
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: xiubli @ 2021-11-05 14:22 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel, Xiubo Li

From: Xiubo Li <xiubli@redhat.com>

Signed-off-by: Xiubo Li <xiubli@redhat.com>
---
 fs/ceph/caps.c  | 19 +++++++++++++------
 fs/ceph/super.h |  2 ++
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index c9f1ac3ad2f3..c15c5dd36747 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2911,10 +2911,9 @@ int ceph_try_get_caps(struct inode *inode, int need, int want,
  * due to a small max_size, make sure we check_max_size (and possibly
  * ask the mds) so we don't get hung up indefinitely.
  */
-int ceph_get_caps(struct file *filp, int need, int want, loff_t endoff, int *got)
+int __ceph_get_caps(struct inode *inode, struct ceph_file_info *fi, int need,
+		    int want, loff_t endoff, int *got)
 {
-	struct ceph_file_info *fi = filp->private_data;
-	struct inode *inode = file_inode(filp);
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
 	int ret, _got, flags;
@@ -2923,7 +2922,7 @@ int ceph_get_caps(struct file *filp, int need, int want, loff_t endoff, int *got
 	if (ret < 0)
 		return ret;
 
-	if ((fi->fmode & CEPH_FILE_MODE_WR) &&
+	if (fi && (fi->fmode & CEPH_FILE_MODE_WR) &&
 	    fi->filp_gen != READ_ONCE(fsc->filp_gen))
 		return -EBADF;
 
@@ -2931,7 +2930,7 @@ int ceph_get_caps(struct file *filp, int need, int want, loff_t endoff, int *got
 
 	while (true) {
 		flags &= CEPH_FILE_MODE_MASK;
-		if (atomic_read(&fi->num_locks))
+		if (fi && atomic_read(&fi->num_locks))
 			flags |= CHECK_FILELOCK;
 		_got = 0;
 		ret = try_get_cap_refs(inode, need, want, endoff,
@@ -2976,7 +2975,7 @@ int ceph_get_caps(struct file *filp, int need, int want, loff_t endoff, int *got
 				continue;
 		}
 
-		if ((fi->fmode & CEPH_FILE_MODE_WR) &&
+		if (fi && (fi->fmode & CEPH_FILE_MODE_WR) &&
 		    fi->filp_gen != READ_ONCE(fsc->filp_gen)) {
 			if (ret >= 0 && _got)
 				ceph_put_cap_refs(ci, _got);
@@ -3039,6 +3038,14 @@ int ceph_get_caps(struct file *filp, int need, int want, loff_t endoff, int *got
 	return 0;
 }
 
+int ceph_get_caps(struct file *filp, int need, int want, loff_t endoff, int *got)
+{
+	struct ceph_file_info *fi = filp->private_data;
+	struct inode *inode = file_inode(filp);
+
+	return __ceph_get_caps(inode, fi, need, want, endoff, got);
+}
+
 /*
  * Take cap refs.  Caller must already know we hold at least one ref
  * on the caps in question or we don't know this is safe.
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index ea95c958202f..403918a4cdb3 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1225,6 +1225,8 @@ extern int ceph_encode_dentry_release(void **p, struct dentry *dn,
 				      struct inode *dir,
 				      int mds, int drop, int unless);
 
+extern int __ceph_get_caps(struct inode *inode, struct ceph_file_info *fi,
+			   int need, int want, loff_t endoff, int *got);
 extern int ceph_get_caps(struct file *filp, int need, int want,
 			 loff_t endoff, int *got);
 extern int ceph_try_get_caps(struct inode *inode,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 7/9] ceph: add __ceph_sync_read helper support
  2021-11-05 14:22 [PATCH v7 0/9] ceph: size handling for the fscrypt xiubli
                   ` (5 preceding siblings ...)
  2021-11-05 14:22 ` [PATCH v7 6/9] ceph: add __ceph_get_caps helper support xiubli
@ 2021-11-05 14:22 ` xiubli
  2021-11-05 14:22 ` [PATCH v7 8/9] ceph: add object version support for sync read xiubli
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: xiubli @ 2021-11-05 14:22 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel, Xiubo Li

From: Xiubo Li <xiubli@redhat.com>

Signed-off-by: Xiubo Li <xiubli@redhat.com>
---
 fs/ceph/file.c  | 34 ++++++++++++++++++++++------------
 fs/ceph/super.h |  2 ++
 2 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 8c0b9ed7f48b..129f6a642f8e 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -870,21 +870,18 @@ enum {
  * If we get a short result from the OSD, check against i_size; we need to
  * only return a short read to the caller if we hit EOF.
  */
-static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
-			      int *retry_op)
+ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
+			 struct iov_iter *to, int *retry_op)
 {
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file_inode(file);
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
 	struct ceph_osd_client *osdc = &fsc->client->osdc;
 	ssize_t ret;
-	u64 off = iocb->ki_pos;
+	u64 off = *ki_pos;
 	u64 len = iov_iter_count(to);
 	u64 i_size;
 
-	dout("sync_read on file %p %llu~%u %s\n", file, off, (unsigned)len,
-	     (file->f_flags & O_DIRECT) ? "O_DIRECT" : "");
+	dout("sync_read on inode %p %llu~%u\n", inode, *ki_pos, (unsigned)len);
 
 	if (!len)
 		return 0;
@@ -986,14 +983,14 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
 			break;
 	}
 
-	if (off > iocb->ki_pos) {
+	if (off > *ki_pos) {
 		if (off >= i_size) {
 			*retry_op = CHECK_EOF;
-			ret = i_size - iocb->ki_pos;
-			iocb->ki_pos = i_size;
+			ret = i_size - *ki_pos;
+			*ki_pos = i_size;
 		} else {
-			ret = off - iocb->ki_pos;
-			iocb->ki_pos = off;
+			ret = off - *ki_pos;
+			*ki_pos = off;
 		}
 	}
 
@@ -1001,6 +998,19 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
 	return ret;
 }
 
+static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
+			      int *retry_op)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+
+	dout("sync_read on file %p %llu~%u %s\n", file, iocb->ki_pos,
+	     (unsigned)iov_iter_count(to),
+	     (file->f_flags & O_DIRECT) ? "O_DIRECT" : "");
+
+	return __ceph_sync_read(inode, &iocb->ki_pos, to, retry_op);
+}
+
 struct ceph_aio_request {
 	struct kiocb *iocb;
 	size_t total_len;
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 403918a4cdb3..2362d758af97 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1253,6 +1253,8 @@ extern int ceph_renew_caps(struct inode *inode, int fmode);
 extern int ceph_open(struct inode *inode, struct file *file);
 extern int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 			    struct file *file, unsigned flags, umode_t mode);
+extern ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
+				struct iov_iter *to, int *retry_op);
 extern int ceph_release(struct inode *inode, struct file *filp);
 extern void ceph_fill_inline_data(struct inode *inode, struct page *locked_page,
 				  char *data, size_t len);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 8/9] ceph: add object version support for sync read
  2021-11-05 14:22 [PATCH v7 0/9] ceph: size handling for the fscrypt xiubli
                   ` (6 preceding siblings ...)
  2021-11-05 14:22 ` [PATCH v7 7/9] ceph: add __ceph_sync_read " xiubli
@ 2021-11-05 14:22 ` xiubli
  2021-11-05 14:22 ` [PATCH v7 9/9] ceph: add truncate size handling support for fscrypt xiubli
  2021-11-05 18:36 ` [PATCH v7 0/9] ceph: size handling for the fscrypt Jeff Layton
  9 siblings, 0 replies; 25+ messages in thread
From: xiubli @ 2021-11-05 14:22 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel, Xiubo Li

From: Xiubo Li <xiubli@redhat.com>

The sync read may split the read into several osdc requests, so
for each it may in different Rados objects.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
---
 fs/ceph/file.c  | 44 ++++++++++++++++++++++++++++++++++++++++++--
 fs/ceph/super.h | 18 +++++++++++++++++-
 2 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 129f6a642f8e..cedd86a6058d 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -871,7 +871,8 @@ enum {
  * only return a short read to the caller if we hit EOF.
  */
 ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
-			 struct iov_iter *to, int *retry_op)
+			 struct iov_iter *to, int *retry_op,
+			 struct ceph_object_vers *objvers)
 {
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
@@ -880,6 +881,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
 	u64 off = *ki_pos;
 	u64 len = iov_iter_count(to);
 	u64 i_size;
+	u32 object_count = 8;
 
 	dout("sync_read on inode %p %llu~%u\n", inode, *ki_pos, (unsigned)len);
 
@@ -896,6 +898,15 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
 	if (ret < 0)
 		return ret;
 
+	if (objvers) {
+		objvers->count = 0;
+		objvers->objvers = kcalloc(object_count,
+					   sizeof(struct ceph_object_ver),
+					   GFP_KERNEL);
+		if (!objvers->objvers)
+			return -ENOMEM;
+	}
+
 	ret = 0;
 	while ((len = iov_iter_count(to)) > 0) {
 		struct ceph_osd_request *req;
@@ -938,6 +949,30 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
 					 req->r_end_latency,
 					 len, ret);
 
+		if (objvers) {
+			u32 ind = objvers->count;
+
+			if (objvers->count >= object_count) {
+				int ov_size;
+
+				object_count *= 2;
+				ov_size = sizeof(struct ceph_object_ver);
+				objvers->objvers = krealloc_array(objvers,
+								  object_count,
+								  ov_size,
+								  GFP_KERNEL);
+				if (!objvers->objvers) {
+					objvers->count = 0;
+					ret = -ENOMEM;
+					break;
+				}
+			}
+
+			objvers->objvers[ind].offset = off;
+			objvers->objvers[ind].length = len;
+			objvers->objvers[ind].objver = req->r_version;
+			objvers->count++;
+		}
 		ceph_osdc_put_request(req);
 
 		i_size = i_size_read(inode);
@@ -995,6 +1030,11 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
 	}
 
 	dout("sync_read result %zd retry_op %d\n", ret, *retry_op);
+	if (ret < 0 && objvers) {
+		objvers->count = 0;
+		kfree(objvers->objvers);
+		objvers->objvers = NULL;
+	}
 	return ret;
 }
 
@@ -1008,7 +1048,7 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
 	     (unsigned)iov_iter_count(to),
 	     (file->f_flags & O_DIRECT) ? "O_DIRECT" : "");
 
-	return __ceph_sync_read(inode, &iocb->ki_pos, to, retry_op);
+	return __ceph_sync_read(inode, &iocb->ki_pos, to, retry_op, NULL);
 }
 
 struct ceph_aio_request {
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 2362d758af97..b347b12e86a9 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -451,6 +451,21 @@ struct ceph_inode_info {
 	struct inode vfs_inode; /* at end */
 };
 
+/*
+ * The version of an object which contains the
+ * file range of [offset, offset + length).
+ */
+struct ceph_object_ver {
+	u64 offset;
+	u64 length;
+	u64 objver;
+};
+
+struct ceph_object_vers {
+	u32 count;
+	struct ceph_object_ver *objvers;
+};
+
 static inline struct ceph_inode_info *
 ceph_inode(const struct inode *inode)
 {
@@ -1254,7 +1269,8 @@ extern int ceph_open(struct inode *inode, struct file *file);
 extern int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 			    struct file *file, unsigned flags, umode_t mode);
 extern ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
-				struct iov_iter *to, int *retry_op);
+				struct iov_iter *to, int *retry_op,
+				struct ceph_object_vers *objvers);
 extern int ceph_release(struct inode *inode, struct file *filp);
 extern void ceph_fill_inline_data(struct inode *inode, struct page *locked_page,
 				  char *data, size_t len);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v7 9/9] ceph: add truncate size handling support for fscrypt
  2021-11-05 14:22 [PATCH v7 0/9] ceph: size handling for the fscrypt xiubli
                   ` (7 preceding siblings ...)
  2021-11-05 14:22 ` [PATCH v7 8/9] ceph: add object version support for sync read xiubli
@ 2021-11-05 14:22 ` xiubli
  2021-11-08 11:42   ` Xiubo Li
  2021-11-08 12:49   ` Xiubo Li
  2021-11-05 18:36 ` [PATCH v7 0/9] ceph: size handling for the fscrypt Jeff Layton
  9 siblings, 2 replies; 25+ messages in thread
From: xiubli @ 2021-11-05 14:22 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel, Xiubo Li

From: Xiubo Li <xiubli@redhat.com>

This will transfer the encrypted last block contents to the MDS
along with the truncate request only when the new size is smaller
and not aligned to the fscrypt BLOCK size. When the last block is
located in the file hole, the truncate request will only contain
the header.

The MDS could fail to do the truncate if there has another client
or process has already updated the Rados object which contains
the last block, and will return -EAGAIN, then the kclient needs
to retry it. The RMW will take around 50ms, and will let it retry
20 times for now.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
---
 fs/ceph/crypto.h |  21 +++++
 fs/ceph/inode.c  | 210 +++++++++++++++++++++++++++++++++++++++++++----
 fs/ceph/super.h  |   5 ++
 3 files changed, 222 insertions(+), 14 deletions(-)

diff --git a/fs/ceph/crypto.h b/fs/ceph/crypto.h
index ab27a7ed62c3..393c308e8fc2 100644
--- a/fs/ceph/crypto.h
+++ b/fs/ceph/crypto.h
@@ -25,6 +25,27 @@ struct ceph_fname {
 	u32		ctext_len;	// length of crypttext
 };
 
+/*
+ * Header for the crypted file when truncating the size, this
+ * will be sent to MDS, and the MDS will update the encrypted
+ * last block and then truncate the size.
+ */
+struct ceph_fscrypt_truncate_size_header {
+       __u8  ver;
+       __u8  compat;
+
+       /*
+	* It will be sizeof(assert_ver + file_offset + block_size)
+	* if the last block is empty when it's located in a file
+	* hole. Or the data_len will plus CEPH_FSCRYPT_BLOCK_SIZE.
+	*/
+       __le32 data_len;
+
+       __le64 assert_ver;
+       __le64 file_offset;
+       __le32 block_size;
+} __packed;
+
 struct ceph_fscrypt_auth {
 	__le32	cfa_version;
 	__le32	cfa_blob_len;
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 15c2fb1e2c8a..eebbd0296004 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -586,6 +586,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
 	ci->i_truncate_seq = 0;
 	ci->i_truncate_size = 0;
 	ci->i_truncate_pending = 0;
+	ci->i_truncate_pagecache_size = 0;
 
 	ci->i_max_size = 0;
 	ci->i_reported_size = 0;
@@ -751,6 +752,10 @@ int ceph_fill_file_size(struct inode *inode, int issued,
 		dout("truncate_size %lld -> %llu\n", ci->i_truncate_size,
 		     truncate_size);
 		ci->i_truncate_size = truncate_size;
+		if (IS_ENCRYPTED(inode))
+			ci->i_truncate_pagecache_size = size;
+		else
+			ci->i_truncate_pagecache_size = truncate_size;
 	}
 
 	if (queue_trunc)
@@ -1011,7 +1016,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
 
 	if (new_version ||
 	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
-		u64 size = info->size;
+		u64 size = le64_to_cpu(info->size);
 		s64 old_pool = ci->i_layout.pool_id;
 		struct ceph_string *old_ns;
 
@@ -1026,16 +1031,20 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
 		pool_ns = old_ns;
 
 		if (IS_ENCRYPTED(inode) && size &&
-		    (iinfo->fscrypt_file_len == sizeof(__le64))) {
-			size = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
-			if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
-				pr_warn("size=%llu fscrypt_file=%llu\n", info->size, size);
+		    (iinfo->fscrypt_file_len >= sizeof(__le64))) {
+			u64 fsize = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
+			if (fsize) {
+				size = fsize;
+				if (le64_to_cpu(info->size) !=
+				    round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
+					pr_warn("size=%llu fscrypt_file=%llu\n",
+						info->size, size);
+			}
 		}
 
 		queue_trunc = ceph_fill_file_size(inode, issued,
 					le32_to_cpu(info->truncate_seq),
-					le64_to_cpu(info->truncate_size),
-					le64_to_cpu(size));
+					le64_to_cpu(info->truncate_size), size);
 		/* only update max_size on auth cap */
 		if ((info->cap.flags & CEPH_CAP_FLAG_AUTH) &&
 		    ci->i_max_size != le64_to_cpu(info->max_size)) {
@@ -2142,7 +2151,7 @@ void __ceph_do_pending_vmtruncate(struct inode *inode)
 	/* there should be no reader or writer */
 	WARN_ON_ONCE(ci->i_rd_ref || ci->i_wr_ref);
 
-	to = ci->i_truncate_size;
+	to = ci->i_truncate_pagecache_size;
 	wrbuffer_refs = ci->i_wrbuffer_ref;
 	dout("__do_pending_vmtruncate %p (%d) to %lld\n", inode,
 	     ci->i_truncate_pending, to);
@@ -2151,7 +2160,7 @@ void __ceph_do_pending_vmtruncate(struct inode *inode)
 	truncate_pagecache(inode, to);
 
 	spin_lock(&ci->i_ceph_lock);
-	if (to == ci->i_truncate_size) {
+	if (to == ci->i_truncate_pagecache_size) {
 		ci->i_truncate_pending = 0;
 		finish = 1;
 	}
@@ -2232,6 +2241,141 @@ static const struct inode_operations ceph_encrypted_symlink_iops = {
 	.listxattr = ceph_listxattr,
 };
 
+/*
+ * Transfer the encrypted last block to the MDS and the MDS
+ * will help update it when truncating a smaller size.
+ *
+ * We don't support a PAGE_SIZE that is smaller than the
+ * CEPH_FSCRYPT_BLOCK_SIZE.
+ */
+static int fill_fscrypt_truncate(struct inode *inode,
+				 struct ceph_mds_request *req,
+				 struct iattr *attr)
+{
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	int boff = attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE;
+	loff_t pos, orig_pos = round_down(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE);
+#if 0
+	u64 block = orig_pos >> CEPH_FSCRYPT_BLOCK_SHIFT;
+#endif
+	struct ceph_pagelist *pagelist = NULL;
+	struct kvec iov;
+	struct iov_iter iter;
+	struct page *page = NULL;
+	struct ceph_fscrypt_truncate_size_header header;
+	int retry_op = 0;
+	int len = CEPH_FSCRYPT_BLOCK_SIZE;
+	loff_t i_size = i_size_read(inode);
+	struct ceph_object_vers objvers = {0, NULL};
+	int got, ret, issued;
+
+	ret = __ceph_get_caps(inode, NULL, CEPH_CAP_FILE_RD, 0, -1, &got);
+	if (ret < 0)
+		return ret;
+
+	issued = __ceph_caps_issued(ci, NULL);
+
+	dout("%s size %lld -> %lld got cap refs on %s, issued %s\n", __func__,
+	     i_size, attr->ia_size, ceph_cap_string(got),
+	     ceph_cap_string(issued));
+
+	/* Try to writeback the dirty pagecaches */
+	if (issued & (CEPH_CAP_FILE_BUFFER))
+		filemap_fdatawrite(&inode->i_data);
+
+	page = __page_cache_alloc(GFP_KERNEL);
+	if (page == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
+	if (!pagelist) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	iov.iov_base = kmap_local_page(page);
+	iov.iov_len = len;
+	iov_iter_kvec(&iter, READ, &iov, 1, len);
+
+	pos = orig_pos;
+	ret = __ceph_sync_read(inode, &pos, &iter, &retry_op, &objvers);
+	ceph_put_cap_refs(ci, got);
+	if (ret < 0)
+		goto out;
+
+	WARN_ON_ONCE(objvers.count != 1);
+
+	/* Insert the header first */
+	header.ver = 1;
+	header.compat = 1;
+
+	/*
+	 * If we hit a hole here, we should just skip filling
+	 * the fscrypt for the request, because once the fscrypt
+	 * is enabled, the file will be split into many blocks
+	 * with the size of CEPH_FSCRYPT_BLOCK_SIZE, if there
+	 * has a hole, the hole size should be multiple of block
+	 * size.
+	 *
+	 * If the Rados object doesn't exist, it will be set 0.
+	 */
+	if (!objvers.objvers[0].objver) {
+		dout("%s hit hole, ppos %lld < size %lld\n", __func__,
+		     pos, i_size);
+
+		header.data_len = cpu_to_le32(8 + 8 + 4);
+		header.assert_ver = 0;
+		header.file_offset = 0;
+		header.block_size = 0;
+		ret = 0;
+	} else {
+		header.data_len = cpu_to_le32(8 + 8 + 4 + CEPH_FSCRYPT_BLOCK_SIZE);
+		header.assert_ver = cpu_to_le64(objvers.objvers[0].objver);
+		header.file_offset = cpu_to_le64(orig_pos);
+		header.block_size = cpu_to_le32(CEPH_FSCRYPT_BLOCK_SIZE);
+
+		/* truncate and zero out the extra contents for the last block */
+		memset(iov.iov_base + boff, 0, PAGE_SIZE - boff);
+
+#if 0 // Uncomment this when the fscrypt is enabled globally in kceph
+
+		/* encrypt the last block */
+		ret = fscrypt_encrypt_block_inplace(inode, page,
+						    CEPH_FSCRYPT_BLOCK_SIZE,
+						    0, block,
+						    GFP_KERNEL);
+		if (ret)
+			goto out;
+#endif
+	}
+
+	/* Insert the header */
+	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
+	if (ret)
+		goto out;
+
+	if (header.block_size) {
+		/* Append the last block contents to pagelist */
+		ret = ceph_pagelist_append(pagelist, iov.iov_base,
+					   CEPH_FSCRYPT_BLOCK_SIZE);
+		if (ret)
+			goto out;
+	}
+	req->r_pagelist = pagelist;
+out:
+	dout("%s %p size dropping cap refs on %s\n", __func__,
+	     inode, ceph_cap_string(got));
+	kunmap_local(iov.iov_base);
+	if (page)
+		__free_pages(page, 0);
+	if (ret && pagelist)
+		ceph_pagelist_release(pagelist);
+	kfree(objvers.objvers);
+	return ret;
+}
+
 int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *cia)
 {
 	struct ceph_inode_info *ci = ceph_inode(inode);
@@ -2239,12 +2383,15 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 	struct ceph_mds_request *req;
 	struct ceph_mds_client *mdsc = ceph_sb_to_client(inode->i_sb)->mdsc;
 	struct ceph_cap_flush *prealloc_cf;
+	loff_t isize = i_size_read(inode);
 	int issued;
 	int release = 0, dirtied = 0;
 	int mask = 0;
 	int err = 0;
 	int inode_dirty_flags = 0;
 	bool lock_snap_rwsem = false;
+	bool fill_fscrypt;
+	int truncate_retry = 20; /* The RMW will take around 50ms */
 
 	prealloc_cf = ceph_alloc_cap_flush();
 	if (!prealloc_cf)
@@ -2257,6 +2404,8 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 		return PTR_ERR(req);
 	}
 
+retry:
+	fill_fscrypt = false;
 	spin_lock(&ci->i_ceph_lock);
 	issued = __ceph_caps_issued(ci, NULL);
 
@@ -2378,10 +2527,27 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 		}
 	}
 	if (ia_valid & ATTR_SIZE) {
-		loff_t isize = i_size_read(inode);
-
 		dout("setattr %p size %lld -> %lld\n", inode, isize, attr->ia_size);
-		if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
+		/*
+		 * Only when the new size is smaller and not aligned to
+		 * CEPH_FSCRYPT_BLOCK_SIZE will the RMW is needed.
+		 */
+		if (IS_ENCRYPTED(inode) && attr->ia_size < isize &&
+		    (attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE)) {
+			mask |= CEPH_SETATTR_SIZE;
+			release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
+				   CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
+			set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
+			mask |= CEPH_SETATTR_FSCRYPT_FILE;
+			req->r_args.setattr.size =
+				cpu_to_le64(round_up(attr->ia_size,
+						     CEPH_FSCRYPT_BLOCK_SIZE));
+			req->r_args.setattr.old_size =
+				cpu_to_le64(round_up(isize,
+						     CEPH_FSCRYPT_BLOCK_SIZE));
+			req->r_fscrypt_file = attr->ia_size;
+			fill_fscrypt = true;
+		} else if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
 			if (attr->ia_size > isize) {
 				i_size_write(inode, attr->ia_size);
 				inode->i_blocks = calc_inode_blocks(attr->ia_size);
@@ -2404,7 +2570,6 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 					cpu_to_le64(round_up(isize,
 							     CEPH_FSCRYPT_BLOCK_SIZE));
 				req->r_fscrypt_file = attr->ia_size;
-				/* FIXME: client must zero out any partial blocks! */
 			} else {
 				req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
 				req->r_args.setattr.old_size = cpu_to_le64(isize);
@@ -2476,7 +2641,6 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 	if (inode_dirty_flags)
 		__mark_inode_dirty(inode, inode_dirty_flags);
 
-
 	if (mask) {
 		req->r_inode = inode;
 		ihold(inode);
@@ -2484,7 +2648,25 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 		req->r_args.setattr.mask = cpu_to_le32(mask);
 		req->r_num_caps = 1;
 		req->r_stamp = attr->ia_ctime;
+		if (fill_fscrypt) {
+			err = fill_fscrypt_truncate(inode, req, attr);
+			if (err)
+				goto out;
+		}
+
+		/*
+		 * The truncate request will return -EAGAIN when the
+		 * last block has been updated just before the MDS
+		 * successfully gets the xlock for the FILE lock. To
+		 * avoid corrupting the file contents we need to retry
+		 * it.
+		 */
 		err = ceph_mdsc_do_request(mdsc, NULL, req);
+		if (err == -EAGAIN && truncate_retry--) {
+			dout("setattr %p result=%d (%s locally, %d remote), retry it!\n",
+			     inode, err, ceph_cap_string(dirtied), mask);
+			goto retry;
+		}
 	}
 out:
 	dout("setattr %p result=%d (%s locally, %d remote)\n", inode, err,
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index b347b12e86a9..071857bb59d8 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -408,6 +408,11 @@ struct ceph_inode_info {
 	u32 i_truncate_seq;        /* last truncate to smaller size */
 	u64 i_truncate_size;       /*  and the size we last truncated down to */
 	int i_truncate_pending;    /*  still need to call vmtruncate */
+	/*
+	 * For none fscrypt case it equals to i_truncate_size or it will
+	 * equals to fscrypt_file_size
+	 */
+	u64 i_truncate_pagecache_size;
 
 	u64 i_max_size;            /* max file size authorized by mds */
 	u64 i_reported_size; /* (max_)size reported to or requested of mds */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/9] ceph: size handling for the fscrypt
  2021-11-05 14:22 [PATCH v7 0/9] ceph: size handling for the fscrypt xiubli
                   ` (8 preceding siblings ...)
  2021-11-05 14:22 ` [PATCH v7 9/9] ceph: add truncate size handling support for fscrypt xiubli
@ 2021-11-05 18:36 ` Jeff Layton
  2021-11-05 20:46   ` Jeff Layton
  9 siblings, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2021-11-05 18:36 UTC (permalink / raw)
  To: xiubli; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel

On Fri, 2021-11-05 at 22:22 +0800, xiubli@redhat.com wrote:
> From: Xiubo Li <xiubli@redhat.com>
> 
> This patch series is based on the "wip-fscrypt-fnames" branch in
> repo https://github.com/ceph/ceph-client.git.
> 
> And I have picked up 5 patches from the "ceph-fscrypt-size-experimental"
> branch in repo
> https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git.
> 
> ====
> 
> This approach is based on the discussion from V1 and V2, which will
> pass the encrypted last block contents to MDS along with the truncate
> request.
> 
> This will send the encrypted last block contents to MDS along with
> the truncate request when truncating to a smaller size and at the
> same time new size does not align to BLOCK SIZE.
> 
> The MDS side patch is raised in PR
> https://github.com/ceph/ceph/pull/43588, which is also based Jeff's
> previous great work in PR https://github.com/ceph/ceph/pull/41284.
> 
> The MDS will use the filer.write_trunc(), which could update and
> truncate the file in one shot, instead of filer.truncate().
> 
> This just assume kclient won't support the inline data feature, which
> will be remove soon, more detail please see:
> https://tracker.ceph.com/issues/52916
> 
> Changed in V7:
> - Fixed the sparse check warnings.
> - Removed the include/linux/ceph/crypto.h header file.
> 
> Changed in V6:
> - Fixed the file hole bug, also have updated the MDS side PR.
> - Add add object version support for sync read in #8.
> 
> 
> Changed in V5:
> - Rebase to "wip-fscrypt-fnames" branch in ceph-client.git repo.
> - Pick up 5 patches from Jeff's "ceph-fscrypt-size-experimental" branch
>   in linux.git repo.
> - Add "i_truncate_pagecache_size" member support in ceph_inode_info
>   struct, this will be used to truncate the pagecache only in kclient
>   side, because the "i_truncate_size" will always be aligned to BLOCK
>   SIZE. In fscrypt case we need to use the real size to truncate the
>   pagecache.
> 
> 
> Changed in V4:
> - Retry the truncate request by 20 times before fail it with -EAGAIN.
> - Remove the "fill_last_block" label and move the code to else branch.
> - Remove the #3 patch, which has already been sent out separately, in
>   V3 series.
> - Improve some comments in the code.
> 
> 
> Changed in V3:
> - Fix possibly corrupting the file just before the MDS acquires the
>   xlock for FILE lock, another client has updated it.
> - Flush the pagecache buffer before reading the last block for the
>   when filling the truncate request.
> - Some other minore fixes.
> 
> 
> 
> Jeff Layton (5):
>   libceph: add CEPH_OSD_OP_ASSERT_VER support
>   ceph: size handling for encrypted inodes in cap updates
>   ceph: fscrypt_file field handling in MClientRequest messages
>   ceph: get file size from fscrypt_file when present in inode traces
>   ceph: handle fscrypt fields in cap messages from MDS
> 
> Xiubo Li (4):
>   ceph: add __ceph_get_caps helper support
>   ceph: add __ceph_sync_read helper support
>   ceph: add object version support for sync read
>   ceph: add truncate size handling support for fscrypt
> 
>  fs/ceph/caps.c                  | 136 ++++++++++++++----
>  fs/ceph/crypto.h                |  25 ++++
>  fs/ceph/dir.c                   |   3 +
>  fs/ceph/file.c                  |  76 ++++++++--
>  fs/ceph/inode.c                 | 244 +++++++++++++++++++++++++++++---
>  fs/ceph/mds_client.c            |   9 +-
>  fs/ceph/mds_client.h            |   2 +
>  fs/ceph/super.h                 |  25 ++++
>  include/linux/ceph/osd_client.h |   6 +-
>  include/linux/ceph/rados.h      |   4 +
>  net/ceph/osd_client.c           |   5 +
>  11 files changed, 475 insertions(+), 60 deletions(-)
> 

Thanks Xiubo.

I hit this today after some more testing (generic/014 again):

[ 1674.146843] libceph: mon0 (2)192.168.1.81:3300 session established
[ 1674.150902] libceph: client54432 fsid 4e286176-3d8b-11ec-bece-52540031ba78
[ 1674.153791] ceph: test_dummy_encryption mode enabled
[ 1719.254308] run fstests generic/014 at 2021-11-05 13:36:26
[ 1727.157974] 
[ 1727.158334] =====================================
[ 1727.159219] WARNING: bad unlock balance detected!
[ 1727.160707] 5.15.0-rc6+ #53 Tainted: G           OE    
[ 1727.162248] -------------------------------------
[ 1727.171918] truncfile/7800 is trying to release lock (&mdsc->snap_rwsem) at:
[ 1727.180836] [<ffffffffc127438e>] __ceph_setattr+0x85e/0x1270 [ceph]
[ 1727.192788] but there are no more locks to release!
[ 1727.203450] 
[ 1727.203450] other info that might help us debug this:
[ 1727.220766] 2 locks held by truncfile/7800:
[ 1727.225548]  #0: ffff888116dd2460 (sb_writers#15){.+.+}-{0:0}, at: do_syscall_64+0x3b/0x90
[ 1727.234851]  #1: ffff8882d8dac3d0 (&sb->s_type->i_mutex_key#20){++++}-{3:3}, at: do_truncate+0xbe/0x140
[ 1727.240027] 
[ 1727.240027] stack backtrace:
[ 1727.247863] CPU: 3 PID: 7800 Comm: truncfile Tainted: G           OE     5.15.0-rc6+ #53
[ 1727.252508] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-6.fc35 04/01/2014
[ 1727.257303] Call Trace:
[ 1727.261503]  dump_stack_lvl+0x57/0x72
[ 1727.265492]  lock_release.cold+0x49/0x4e
[ 1727.269499]  ? __ceph_setattr+0x85e/0x1270 [ceph]
[ 1727.273802]  ? lock_downgrade+0x390/0x390
[ 1727.277913]  ? preempt_count_sub+0x14/0xc0
[ 1727.281883]  ? _raw_spin_unlock+0x29/0x40
[ 1727.285725]  ? __ceph_mark_dirty_caps+0x29f/0x450 [ceph]
[ 1727.289959]  up_read+0x17/0x20
[ 1727.293852]  __ceph_setattr+0x85e/0x1270 [ceph]
[ 1727.297827]  ? ceph_inode_work+0x460/0x460 [ceph]
[ 1727.301765]  ceph_setattr+0x12d/0x1c0 [ceph]
[ 1727.305839]  notify_change+0x4e9/0x720
[ 1727.309762]  ? do_truncate+0xcf/0x140
[ 1727.313504]  do_truncate+0xcf/0x140
[ 1727.317092]  ? file_open_root+0x1e0/0x1e0
[ 1727.321022]  ? lock_release+0x410/0x410
[ 1727.324769]  ? lock_is_held_type+0xfb/0x130
[ 1727.328699]  do_sys_ftruncate+0x306/0x350
[ 1727.332449]  do_syscall_64+0x3b/0x90
[ 1727.336127]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1727.340303] RIP: 0033:0x7f7f38be356b
[ 1727.344445] Code: 77 05 c3 0f 1f 40 00 48 8b 15 09 99 0c 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 4d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 d9 98 0c 00 f7 d8
[ 1727.354258] RSP: 002b:00007ffdee35cd18 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
[ 1727.358964] RAX: ffffffffffffffda RBX: 000000000c8a9d62 RCX: 00007f7f38be356b
[ 1727.363836] RDX: 000000000c8a9d62 RSI: 000000000c8a9d62 RDI: 0000000000000003
[ 1727.368467] RBP: 0000000000000003 R08: 000000000000005a R09: 00007f7f38cada60
[ 1727.373285] R10: 00007f7f38af3700 R11: 0000000000000202 R12: 0000000061856b9d
[ 1727.377870] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1727.382578] ------------[ cut here ]------------
[ 1727.391761] DEBUG_RWSEMS_WARN_ON(tmp < 0): count = 0xffffffffffffff00, magic = 0xffff88812f062220, owner = 0x1, curr 0xffff88810095b280, list empty
[ 1727.419497] WARNING: CPU: 14 PID: 7800 at kernel/locking/rwsem.c:1297 __up_read+0x404/0x430
[ 1727.432752] Modules linked in: ceph(OE) libceph(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) bridge(E) ip_set(E) stp(E) llc(E) rfkill(E) nf_tables(E) nfnetlink(E) cachefiles(E) fscache(E) netfs(E) sunrpc(E) iTCO_wdt(E) intel_pmc_bxt(E) iTCO_vendor_support(E) intel_rapl_msr(E) lpc_ich(E) joydev(E) i2c_i801(E) i2c_smbus(E) virtio_balloon(E) intel_rapl_common(E) fuse(E) zram(E) ip_tables(E) xfs(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) virtio_gpu(E) virtio_blk(E) ghash_clmulni_intel(E) virtio_dma_buf(E) virtio_console(E) serio_raw(E) virtio_net(E) drm_kms_helper(E) net_failover(E) cec(E) failover(E) drm(E) qemu_fw_cfg(E)
[ 1727.506081] CPU: 1 PID: 7800 Comm: truncfile Tainted: G           OE     5.15.0-rc6+ #53
[ 1727.512691] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-6.fc35 04/01/2014
[ 1727.521863] RIP: 0010:__up_read+0x404/0x430
[ 1727.528011] Code: 48 8b 55 00 4d 89 f0 4c 89 e1 53 48 c7 c6 e0 db 89 b1 48 c7 c7 00 d9 89 b1 65 4c 8b 3c 25 80 fe 01 00 4d 89 f9 e8 a2 4b 08 01 <0f> 0b 5a e9 b4 fd ff ff be 08 00 00 00 4c 89 e7 e8 57 4d 33 00 f0
[ 1727.540864] RSP: 0018:ffff888118a07bb0 EFLAGS: 00010286
[ 1727.556265] RAX: 0000000000000000 RBX: ffffffffb189d840 RCX: 0000000000000000
[ 1727.571003] RDX: 0000000000000001 RSI: ffffffffb1aa6380 RDI: ffffed1023140f6c
[ 1727.580837] RBP: ffff88812f062220 R08: ffffffffb0185284 R09: ffff8884209ad7c7
[ 1727.593908] R10: ffffed1084135af8 R11: 0000000000000001 R12: ffff88812f062220
[ 1727.605431] R13: 1ffff11023140f79 R14: 0000000000000001 R15: ffff88810095b280
[ 1727.615457] FS:  00007f7f38add740(0000) GS:ffff888420640000(0000) knlGS:0000000000000000
[ 1727.623346] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1727.631138] CR2: 00007fa7810e8000 CR3: 000000012124a000 CR4: 00000000003506e0
[ 1727.639503] Call Trace:
[ 1727.649160]  ? _raw_spin_unlock+0x29/0x40
[ 1727.655583]  ? up_write+0x270/0x270
[ 1727.661769]  __ceph_setattr+0x85e/0x1270 [ceph]
[ 1727.670914]  ? ceph_inode_work+0x460/0x460 [ceph]
[ 1727.677397]  ceph_setattr+0x12d/0x1c0 [ceph]
[ 1727.685205]  notify_change+0x4e9/0x720
[ 1727.694598]  ? do_truncate+0xcf/0x140
[ 1727.705737]  do_truncate+0xcf/0x140
[ 1727.712680]  ? file_open_root+0x1e0/0x1e0
[ 1727.720447]  ? lock_release+0x410/0x410
[ 1727.727851]  ? lock_is_held_type+0xfb/0x130
[ 1727.734045]  do_sys_ftruncate+0x306/0x350
[ 1727.740636]  do_syscall_64+0x3b/0x90
[ 1727.748675]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1727.755634] RIP: 0033:0x7f7f38be356b
[ 1727.763575] Code: 77 05 c3 0f 1f 40 00 48 8b 15 09 99 0c 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 4d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 d9 98 0c 00 f7 d8
[ 1727.777868] RSP: 002b:00007ffdee35cd18 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
[ 1727.792610] RAX: ffffffffffffffda RBX: 000000000c8a9d62 RCX: 00007f7f38be356b
[ 1727.807383] RDX: 000000000c8a9d62 RSI: 000000000c8a9d62 RDI: 0000000000000003
[ 1727.821520] RBP: 0000000000000003 R08: 000000000000005a R09: 00007f7f38cada60
[ 1727.829368] R10: 00007f7f38af3700 R11: 0000000000000202 R12: 0000000061856b9d
[ 1727.837356] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1727.849767] irq event stamp: 549109
[ 1727.863878] hardirqs last  enabled at (549109): [<ffffffffb12f8b54>] _raw_spin_unlock_irq+0x24/0x50
[ 1727.879034] hardirqs last disabled at (549108): [<ffffffffb12f8d04>] _raw_spin_lock_irq+0x54/0x60
[ 1727.897434] softirqs last  enabled at (548984): [<ffffffffb013d097>] __irq_exit_rcu+0x157/0x1b0
[ 1727.913276] softirqs last disabled at (548975): [<ffffffffb013d097>] __irq_exit_rcu+0x157/0x1b0
[ 1727.933182] ---[ end trace a89de5333b156523 ]---



I think this patch should fix it:

[PATCH] SQUASH: ensure we unset lock_snap_rwsem after unlocking it

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/inode.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index eebbd0296004..cb0ad0faee45 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -2635,8 +2635,10 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 
 	release &= issued;
 	spin_unlock(&ci->i_ceph_lock);
-	if (lock_snap_rwsem)
+	if (lock_snap_rwsem) {
 		up_read(&mdsc->snap_rwsem);
+		lock_snap_rwsem = false;
+	}
 
 	if (inode_dirty_flags)
 		__mark_inode_dirty(inode, inode_dirty_flags);
-- 
2.33.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/9] ceph: size handling for the fscrypt
  2021-11-05 18:36 ` [PATCH v7 0/9] ceph: size handling for the fscrypt Jeff Layton
@ 2021-11-05 20:46   ` Jeff Layton
  2021-11-06  1:35     ` Xiubo Li
  0 siblings, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: xiubli; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel

On Fri, 2021-11-05 at 14:36 -0400, Jeff Layton wrote:
> On Fri, 2021-11-05 at 22:22 +0800, xiubli@redhat.com wrote:
> > From: Xiubo Li <xiubli@redhat.com>
> > 
> > This patch series is based on the "wip-fscrypt-fnames" branch in
> > repo https://github.com/ceph/ceph-client.git.
> > 
> > And I have picked up 5 patches from the "ceph-fscrypt-size-experimental"
> > branch in repo
> > https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git.
> > 
> > ====
> > 
> > This approach is based on the discussion from V1 and V2, which will
> > pass the encrypted last block contents to MDS along with the truncate
> > request.
> > 
> > This will send the encrypted last block contents to MDS along with
> > the truncate request when truncating to a smaller size and at the
> > same time new size does not align to BLOCK SIZE.
> > 
> > The MDS side patch is raised in PR
> > https://github.com/ceph/ceph/pull/43588, which is also based Jeff's
> > previous great work in PR https://github.com/ceph/ceph/pull/41284.
> > 
> > The MDS will use the filer.write_trunc(), which could update and
> > truncate the file in one shot, instead of filer.truncate().
> > 
> > This just assume kclient won't support the inline data feature, which
> > will be remove soon, more detail please see:
> > https://tracker.ceph.com/issues/52916
> > 
> > Changed in V7:
> > - Fixed the sparse check warnings.
> > - Removed the include/linux/ceph/crypto.h header file.
> > 
> > Changed in V6:
> > - Fixed the file hole bug, also have updated the MDS side PR.
> > - Add add object version support for sync read in #8.
> > 
> > 
> > Changed in V5:
> > - Rebase to "wip-fscrypt-fnames" branch in ceph-client.git repo.
> > - Pick up 5 patches from Jeff's "ceph-fscrypt-size-experimental" branch
> >   in linux.git repo.
> > - Add "i_truncate_pagecache_size" member support in ceph_inode_info
> >   struct, this will be used to truncate the pagecache only in kclient
> >   side, because the "i_truncate_size" will always be aligned to BLOCK
> >   SIZE. In fscrypt case we need to use the real size to truncate the
> >   pagecache.
> > 
> > 
> > Changed in V4:
> > - Retry the truncate request by 20 times before fail it with -EAGAIN.
> > - Remove the "fill_last_block" label and move the code to else branch.
> > - Remove the #3 patch, which has already been sent out separately, in
> >   V3 series.
> > - Improve some comments in the code.
> > 
> > 
> > Changed in V3:
> > - Fix possibly corrupting the file just before the MDS acquires the
> >   xlock for FILE lock, another client has updated it.
> > - Flush the pagecache buffer before reading the last block for the
> >   when filling the truncate request.
> > - Some other minore fixes.
> > 
> > 
> > 
> > Jeff Layton (5):
> >   libceph: add CEPH_OSD_OP_ASSERT_VER support
> >   ceph: size handling for encrypted inodes in cap updates
> >   ceph: fscrypt_file field handling in MClientRequest messages
> >   ceph: get file size from fscrypt_file when present in inode traces
> >   ceph: handle fscrypt fields in cap messages from MDS
> > 
> > Xiubo Li (4):
> >   ceph: add __ceph_get_caps helper support
> >   ceph: add __ceph_sync_read helper support
> >   ceph: add object version support for sync read
> >   ceph: add truncate size handling support for fscrypt
> > 
> >  fs/ceph/caps.c                  | 136 ++++++++++++++----
> >  fs/ceph/crypto.h                |  25 ++++
> >  fs/ceph/dir.c                   |   3 +
> >  fs/ceph/file.c                  |  76 ++++++++--
> >  fs/ceph/inode.c                 | 244 +++++++++++++++++++++++++++++---
> >  fs/ceph/mds_client.c            |   9 +-
> >  fs/ceph/mds_client.h            |   2 +
> >  fs/ceph/super.h                 |  25 ++++
> >  include/linux/ceph/osd_client.h |   6 +-
> >  include/linux/ceph/rados.h      |   4 +
> >  net/ceph/osd_client.c           |   5 +
> >  11 files changed, 475 insertions(+), 60 deletions(-)
> > 
> 
> Thanks Xiubo.
> 
> I hit this today after some more testing (generic/014 again):
> 
> [ 1674.146843] libceph: mon0 (2)192.168.1.81:3300 session established
> [ 1674.150902] libceph: client54432 fsid 4e286176-3d8b-11ec-bece-52540031ba78
> [ 1674.153791] ceph: test_dummy_encryption mode enabled
> [ 1719.254308] run fstests generic/014 at 2021-11-05 13:36:26
> [ 1727.157974] 
> [ 1727.158334] =====================================
> [ 1727.159219] WARNING: bad unlock balance detected!
> [ 1727.160707] 5.15.0-rc6+ #53 Tainted: G           OE    
> [ 1727.162248] -------------------------------------
> [ 1727.171918] truncfile/7800 is trying to release lock (&mdsc->snap_rwsem) at:
> [ 1727.180836] [<ffffffffc127438e>] __ceph_setattr+0x85e/0x1270 [ceph]
> [ 1727.192788] but there are no more locks to release!
> [ 1727.203450] 
> [ 1727.203450] other info that might help us debug this:
> [ 1727.220766] 2 locks held by truncfile/7800:
> [ 1727.225548]  #0: ffff888116dd2460 (sb_writers#15){.+.+}-{0:0}, at: do_syscall_64+0x3b/0x90
> [ 1727.234851]  #1: ffff8882d8dac3d0 (&sb->s_type->i_mutex_key#20){++++}-{3:3}, at: do_truncate+0xbe/0x140
> [ 1727.240027] 
> [ 1727.240027] stack backtrace:
> [ 1727.247863] CPU: 3 PID: 7800 Comm: truncfile Tainted: G           OE     5.15.0-rc6+ #53
> [ 1727.252508] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-6.fc35 04/01/2014
> [ 1727.257303] Call Trace:
> [ 1727.261503]  dump_stack_lvl+0x57/0x72
> [ 1727.265492]  lock_release.cold+0x49/0x4e
> [ 1727.269499]  ? __ceph_setattr+0x85e/0x1270 [ceph]
> [ 1727.273802]  ? lock_downgrade+0x390/0x390
> [ 1727.277913]  ? preempt_count_sub+0x14/0xc0
> [ 1727.281883]  ? _raw_spin_unlock+0x29/0x40
> [ 1727.285725]  ? __ceph_mark_dirty_caps+0x29f/0x450 [ceph]
> [ 1727.289959]  up_read+0x17/0x20
> [ 1727.293852]  __ceph_setattr+0x85e/0x1270 [ceph]
> [ 1727.297827]  ? ceph_inode_work+0x460/0x460 [ceph]
> [ 1727.301765]  ceph_setattr+0x12d/0x1c0 [ceph]
> [ 1727.305839]  notify_change+0x4e9/0x720
> [ 1727.309762]  ? do_truncate+0xcf/0x140
> [ 1727.313504]  do_truncate+0xcf/0x140
> [ 1727.317092]  ? file_open_root+0x1e0/0x1e0
> [ 1727.321022]  ? lock_release+0x410/0x410
> [ 1727.324769]  ? lock_is_held_type+0xfb/0x130
> [ 1727.328699]  do_sys_ftruncate+0x306/0x350
> [ 1727.332449]  do_syscall_64+0x3b/0x90
> [ 1727.336127]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 1727.340303] RIP: 0033:0x7f7f38be356b
> [ 1727.344445] Code: 77 05 c3 0f 1f 40 00 48 8b 15 09 99 0c 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 4d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 d9 98 0c 00 f7 d8
> [ 1727.354258] RSP: 002b:00007ffdee35cd18 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
> [ 1727.358964] RAX: ffffffffffffffda RBX: 000000000c8a9d62 RCX: 00007f7f38be356b
> [ 1727.363836] RDX: 000000000c8a9d62 RSI: 000000000c8a9d62 RDI: 0000000000000003
> [ 1727.368467] RBP: 0000000000000003 R08: 000000000000005a R09: 00007f7f38cada60
> [ 1727.373285] R10: 00007f7f38af3700 R11: 0000000000000202 R12: 0000000061856b9d
> [ 1727.377870] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 1727.382578] ------------[ cut here ]------------
> [ 1727.391761] DEBUG_RWSEMS_WARN_ON(tmp < 0): count = 0xffffffffffffff00, magic = 0xffff88812f062220, owner = 0x1, curr 0xffff88810095b280, list empty
> [ 1727.419497] WARNING: CPU: 14 PID: 7800 at kernel/locking/rwsem.c:1297 __up_read+0x404/0x430
> [ 1727.432752] Modules linked in: ceph(OE) libceph(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) bridge(E) ip_set(E) stp(E) llc(E) rfkill(E) nf_tables(E) nfnetlink(E) cachefiles(E) fscache(E) netfs(E) sunrpc(E) iTCO_wdt(E) intel_pmc_bxt(E) iTCO_vendor_support(E) intel_rapl_msr(E) lpc_ich(E) joydev(E) i2c_i801(E) i2c_smbus(E) virtio_balloon(E) intel_rapl_common(E) fuse(E) zram(E) ip_tables(E) xfs(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) virtio_gpu(E) virtio_blk(E) ghash_clmulni_intel(E) virtio_dma_buf(E) virtio_console(E) serio_raw(E) virtio_net(E) drm_kms_helper(E) net_failover(E) cec(E) failover(E) drm(E) qemu_fw_cfg(E)
> [ 1727.506081] CPU: 1 PID: 7800 Comm: truncfile Tainted: G           OE     5.15.0-rc6+ #53
> [ 1727.512691] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-6.fc35 04/01/2014
> [ 1727.521863] RIP: 0010:__up_read+0x404/0x430
> [ 1727.528011] Code: 48 8b 55 00 4d 89 f0 4c 89 e1 53 48 c7 c6 e0 db 89 b1 48 c7 c7 00 d9 89 b1 65 4c 8b 3c 25 80 fe 01 00 4d 89 f9 e8 a2 4b 08 01 <0f> 0b 5a e9 b4 fd ff ff be 08 00 00 00 4c 89 e7 e8 57 4d 33 00 f0
> [ 1727.540864] RSP: 0018:ffff888118a07bb0 EFLAGS: 00010286
> [ 1727.556265] RAX: 0000000000000000 RBX: ffffffffb189d840 RCX: 0000000000000000
> [ 1727.571003] RDX: 0000000000000001 RSI: ffffffffb1aa6380 RDI: ffffed1023140f6c
> [ 1727.580837] RBP: ffff88812f062220 R08: ffffffffb0185284 R09: ffff8884209ad7c7
> [ 1727.593908] R10: ffffed1084135af8 R11: 0000000000000001 R12: ffff88812f062220
> [ 1727.605431] R13: 1ffff11023140f79 R14: 0000000000000001 R15: ffff88810095b280
> [ 1727.615457] FS:  00007f7f38add740(0000) GS:ffff888420640000(0000) knlGS:0000000000000000
> [ 1727.623346] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1727.631138] CR2: 00007fa7810e8000 CR3: 000000012124a000 CR4: 00000000003506e0
> [ 1727.639503] Call Trace:
> [ 1727.649160]  ? _raw_spin_unlock+0x29/0x40
> [ 1727.655583]  ? up_write+0x270/0x270
> [ 1727.661769]  __ceph_setattr+0x85e/0x1270 [ceph]
> [ 1727.670914]  ? ceph_inode_work+0x460/0x460 [ceph]
> [ 1727.677397]  ceph_setattr+0x12d/0x1c0 [ceph]
> [ 1727.685205]  notify_change+0x4e9/0x720
> [ 1727.694598]  ? do_truncate+0xcf/0x140
> [ 1727.705737]  do_truncate+0xcf/0x140
> [ 1727.712680]  ? file_open_root+0x1e0/0x1e0
> [ 1727.720447]  ? lock_release+0x410/0x410
> [ 1727.727851]  ? lock_is_held_type+0xfb/0x130
> [ 1727.734045]  do_sys_ftruncate+0x306/0x350
> [ 1727.740636]  do_syscall_64+0x3b/0x90
> [ 1727.748675]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 1727.755634] RIP: 0033:0x7f7f38be356b
> [ 1727.763575] Code: 77 05 c3 0f 1f 40 00 48 8b 15 09 99 0c 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 4d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 d9 98 0c 00 f7 d8
> [ 1727.777868] RSP: 002b:00007ffdee35cd18 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
> [ 1727.792610] RAX: ffffffffffffffda RBX: 000000000c8a9d62 RCX: 00007f7f38be356b
> [ 1727.807383] RDX: 000000000c8a9d62 RSI: 000000000c8a9d62 RDI: 0000000000000003
> [ 1727.821520] RBP: 0000000000000003 R08: 000000000000005a R09: 00007f7f38cada60
> [ 1727.829368] R10: 00007f7f38af3700 R11: 0000000000000202 R12: 0000000061856b9d
> [ 1727.837356] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 1727.849767] irq event stamp: 549109
> [ 1727.863878] hardirqs last  enabled at (549109): [<ffffffffb12f8b54>] _raw_spin_unlock_irq+0x24/0x50
> [ 1727.879034] hardirqs last disabled at (549108): [<ffffffffb12f8d04>] _raw_spin_lock_irq+0x54/0x60
> [ 1727.897434] softirqs last  enabled at (548984): [<ffffffffb013d097>] __irq_exit_rcu+0x157/0x1b0
> [ 1727.913276] softirqs last disabled at (548975): [<ffffffffb013d097>] __irq_exit_rcu+0x157/0x1b0
> [ 1727.933182] ---[ end trace a89de5333b156523 ]---
> 
> 
> 
> I think this patch should fix it:
> 
> [PATCH] SQUASH: ensure we unset lock_snap_rwsem after unlocking it
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/ceph/inode.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index eebbd0296004..cb0ad0faee45 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -2635,8 +2635,10 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>  
>  	release &= issued;
>  	spin_unlock(&ci->i_ceph_lock);
> -	if (lock_snap_rwsem)
> +	if (lock_snap_rwsem) {
>  		up_read(&mdsc->snap_rwsem);
> +		lock_snap_rwsem = false;
> +	}
>  
>  	if (inode_dirty_flags)
>  		__mark_inode_dirty(inode, inode_dirty_flags);

Testing with that patch on top of your latest series looks pretty good
so far. I see some xfstests failures that need to be investigated
(generic/075, in particular). I'll take a harder look at that next week.

For now, I've gone ahead and updated wip-fscrypt-fnames to the latest
fnames branch, and also pushed a new wip-fscrypt-size branch that has
all of your patches, with the above SQUASH patch folded into #9.

I'll continue the testing next week, but I think the -size branch is
probably a good place to work from for now.

Thanks!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/9] ceph: size handling for the fscrypt
  2021-11-05 20:46   ` Jeff Layton
@ 2021-11-06  1:35     ` Xiubo Li
  2021-11-06 10:50       ` Jeff Layton
  0 siblings, 1 reply; 25+ messages in thread
From: Xiubo Li @ 2021-11-06  1:35 UTC (permalink / raw)
  To: Jeff Layton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel


On 11/6/21 4:46 AM, Jeff Layton wrote:
> On Fri, 2021-11-05 at 14:36 -0400, Jeff Layton wrote:
>> On Fri, 2021-11-05 at 22:22 +0800, xiubli@redhat.com wrote:
>>> From: Xiubo Li <xiubli@redhat.com>
>>>
>>> This patch series is based on the "wip-fscrypt-fnames" branch in
>>> repo https://github.com/ceph/ceph-client.git.
>>>
>>> And I have picked up 5 patches from the "ceph-fscrypt-size-experimental"
>>> branch in repo
>>> https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git.
>>>
>>> ====
>>>
>>> This approach is based on the discussion from V1 and V2, which will
>>> pass the encrypted last block contents to MDS along with the truncate
>>> request.
>>>
>>> This will send the encrypted last block contents to MDS along with
>>> the truncate request when truncating to a smaller size and at the
>>> same time new size does not align to BLOCK SIZE.
>>>
>>> The MDS side patch is raised in PR
>>> https://github.com/ceph/ceph/pull/43588, which is also based Jeff's
>>> previous great work in PR https://github.com/ceph/ceph/pull/41284.
>>>
>>> The MDS will use the filer.write_trunc(), which could update and
>>> truncate the file in one shot, instead of filer.truncate().
>>>
>>> This just assume kclient won't support the inline data feature, which
>>> will be remove soon, more detail please see:
>>> https://tracker.ceph.com/issues/52916
>>>
>>> Changed in V7:
>>> - Fixed the sparse check warnings.
>>> - Removed the include/linux/ceph/crypto.h header file.
>>>
>>> Changed in V6:
>>> - Fixed the file hole bug, also have updated the MDS side PR.
>>> - Add add object version support for sync read in #8.
>>>
>>>
>>> Changed in V5:
>>> - Rebase to "wip-fscrypt-fnames" branch in ceph-client.git repo.
>>> - Pick up 5 patches from Jeff's "ceph-fscrypt-size-experimental" branch
>>>    in linux.git repo.
>>> - Add "i_truncate_pagecache_size" member support in ceph_inode_info
>>>    struct, this will be used to truncate the pagecache only in kclient
>>>    side, because the "i_truncate_size" will always be aligned to BLOCK
>>>    SIZE. In fscrypt case we need to use the real size to truncate the
>>>    pagecache.
>>>
>>>
>>> Changed in V4:
>>> - Retry the truncate request by 20 times before fail it with -EAGAIN.
>>> - Remove the "fill_last_block" label and move the code to else branch.
>>> - Remove the #3 patch, which has already been sent out separately, in
>>>    V3 series.
>>> - Improve some comments in the code.
>>>
>>>
>>> Changed in V3:
>>> - Fix possibly corrupting the file just before the MDS acquires the
>>>    xlock for FILE lock, another client has updated it.
>>> - Flush the pagecache buffer before reading the last block for the
>>>    when filling the truncate request.
>>> - Some other minore fixes.
>>>
>>>
>>>
>>> Jeff Layton (5):
>>>    libceph: add CEPH_OSD_OP_ASSERT_VER support
>>>    ceph: size handling for encrypted inodes in cap updates
>>>    ceph: fscrypt_file field handling in MClientRequest messages
>>>    ceph: get file size from fscrypt_file when present in inode traces
>>>    ceph: handle fscrypt fields in cap messages from MDS
>>>
>>> Xiubo Li (4):
>>>    ceph: add __ceph_get_caps helper support
>>>    ceph: add __ceph_sync_read helper support
>>>    ceph: add object version support for sync read
>>>    ceph: add truncate size handling support for fscrypt
>>>
>>>   fs/ceph/caps.c                  | 136 ++++++++++++++----
>>>   fs/ceph/crypto.h                |  25 ++++
>>>   fs/ceph/dir.c                   |   3 +
>>>   fs/ceph/file.c                  |  76 ++++++++--
>>>   fs/ceph/inode.c                 | 244 +++++++++++++++++++++++++++++---
>>>   fs/ceph/mds_client.c            |   9 +-
>>>   fs/ceph/mds_client.h            |   2 +
>>>   fs/ceph/super.h                 |  25 ++++
>>>   include/linux/ceph/osd_client.h |   6 +-
>>>   include/linux/ceph/rados.h      |   4 +
>>>   net/ceph/osd_client.c           |   5 +
>>>   11 files changed, 475 insertions(+), 60 deletions(-)
>>>
>> Thanks Xiubo.
>>
>> I hit this today after some more testing (generic/014 again):
>>
>> [ 1674.146843] libceph: mon0 (2)192.168.1.81:3300 session established
>> [ 1674.150902] libceph: client54432 fsid 4e286176-3d8b-11ec-bece-52540031ba78
>> [ 1674.153791] ceph: test_dummy_encryption mode enabled
>> [ 1719.254308] run fstests generic/014 at 2021-11-05 13:36:26
>> [ 1727.157974]
>> [ 1727.158334] =====================================
>> [ 1727.159219] WARNING: bad unlock balance detected!
>> [ 1727.160707] 5.15.0-rc6+ #53 Tainted: G           OE
>> [ 1727.162248] -------------------------------------
>> [ 1727.171918] truncfile/7800 is trying to release lock (&mdsc->snap_rwsem) at:
>> [ 1727.180836] [<ffffffffc127438e>] __ceph_setattr+0x85e/0x1270 [ceph]
>> [ 1727.192788] but there are no more locks to release!
>> [ 1727.203450]
>> [ 1727.203450] other info that might help us debug this:
>> [ 1727.220766] 2 locks held by truncfile/7800:
>> [ 1727.225548]  #0: ffff888116dd2460 (sb_writers#15){.+.+}-{0:0}, at: do_syscall_64+0x3b/0x90
>> [ 1727.234851]  #1: ffff8882d8dac3d0 (&sb->s_type->i_mutex_key#20){++++}-{3:3}, at: do_truncate+0xbe/0x140
>> [ 1727.240027]
>> [ 1727.240027] stack backtrace:
>> [ 1727.247863] CPU: 3 PID: 7800 Comm: truncfile Tainted: G           OE     5.15.0-rc6+ #53
>> [ 1727.252508] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-6.fc35 04/01/2014
>> [ 1727.257303] Call Trace:
>> [ 1727.261503]  dump_stack_lvl+0x57/0x72
>> [ 1727.265492]  lock_release.cold+0x49/0x4e
>> [ 1727.269499]  ? __ceph_setattr+0x85e/0x1270 [ceph]
>> [ 1727.273802]  ? lock_downgrade+0x390/0x390
>> [ 1727.277913]  ? preempt_count_sub+0x14/0xc0
>> [ 1727.281883]  ? _raw_spin_unlock+0x29/0x40
>> [ 1727.285725]  ? __ceph_mark_dirty_caps+0x29f/0x450 [ceph]
>> [ 1727.289959]  up_read+0x17/0x20
>> [ 1727.293852]  __ceph_setattr+0x85e/0x1270 [ceph]
>> [ 1727.297827]  ? ceph_inode_work+0x460/0x460 [ceph]
>> [ 1727.301765]  ceph_setattr+0x12d/0x1c0 [ceph]
>> [ 1727.305839]  notify_change+0x4e9/0x720
>> [ 1727.309762]  ? do_truncate+0xcf/0x140
>> [ 1727.313504]  do_truncate+0xcf/0x140
>> [ 1727.317092]  ? file_open_root+0x1e0/0x1e0
>> [ 1727.321022]  ? lock_release+0x410/0x410
>> [ 1727.324769]  ? lock_is_held_type+0xfb/0x130
>> [ 1727.328699]  do_sys_ftruncate+0x306/0x350
>> [ 1727.332449]  do_syscall_64+0x3b/0x90
>> [ 1727.336127]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [ 1727.340303] RIP: 0033:0x7f7f38be356b
>> [ 1727.344445] Code: 77 05 c3 0f 1f 40 00 48 8b 15 09 99 0c 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 4d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 d9 98 0c 00 f7 d8
>> [ 1727.354258] RSP: 002b:00007ffdee35cd18 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
>> [ 1727.358964] RAX: ffffffffffffffda RBX: 000000000c8a9d62 RCX: 00007f7f38be356b
>> [ 1727.363836] RDX: 000000000c8a9d62 RSI: 000000000c8a9d62 RDI: 0000000000000003
>> [ 1727.368467] RBP: 0000000000000003 R08: 000000000000005a R09: 00007f7f38cada60
>> [ 1727.373285] R10: 00007f7f38af3700 R11: 0000000000000202 R12: 0000000061856b9d
>> [ 1727.377870] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>> [ 1727.382578] ------------[ cut here ]------------
>> [ 1727.391761] DEBUG_RWSEMS_WARN_ON(tmp < 0): count = 0xffffffffffffff00, magic = 0xffff88812f062220, owner = 0x1, curr 0xffff88810095b280, list empty
>> [ 1727.419497] WARNING: CPU: 14 PID: 7800 at kernel/locking/rwsem.c:1297 __up_read+0x404/0x430
>> [ 1727.432752] Modules linked in: ceph(OE) libceph(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) bridge(E) ip_set(E) stp(E) llc(E) rfkill(E) nf_tables(E) nfnetlink(E) cachefiles(E) fscache(E) netfs(E) sunrpc(E) iTCO_wdt(E) intel_pmc_bxt(E) iTCO_vendor_support(E) intel_rapl_msr(E) lpc_ich(E) joydev(E) i2c_i801(E) i2c_smbus(E) virtio_balloon(E) intel_rapl_common(E) fuse(E) zram(E) ip_tables(E) xfs(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) virtio_gpu(E) virtio_blk(E) ghash_clmulni_intel(E) virtio_dma_buf(E) virtio_console(E) serio_raw(E) virtio_net(E) drm_kms_helper(E) net_failover(E) cec(E) failover(E) drm(E) qemu_fw_cfg(E)
>> [ 1727.506081] CPU: 1 PID: 7800 Comm: truncfile Tainted: G           OE     5.15.0-rc6+ #53
>> [ 1727.512691] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-6.fc35 04/01/2014
>> [ 1727.521863] RIP: 0010:__up_read+0x404/0x430
>> [ 1727.528011] Code: 48 8b 55 00 4d 89 f0 4c 89 e1 53 48 c7 c6 e0 db 89 b1 48 c7 c7 00 d9 89 b1 65 4c 8b 3c 25 80 fe 01 00 4d 89 f9 e8 a2 4b 08 01 <0f> 0b 5a e9 b4 fd ff ff be 08 00 00 00 4c 89 e7 e8 57 4d 33 00 f0
>> [ 1727.540864] RSP: 0018:ffff888118a07bb0 EFLAGS: 00010286
>> [ 1727.556265] RAX: 0000000000000000 RBX: ffffffffb189d840 RCX: 0000000000000000
>> [ 1727.571003] RDX: 0000000000000001 RSI: ffffffffb1aa6380 RDI: ffffed1023140f6c
>> [ 1727.580837] RBP: ffff88812f062220 R08: ffffffffb0185284 R09: ffff8884209ad7c7
>> [ 1727.593908] R10: ffffed1084135af8 R11: 0000000000000001 R12: ffff88812f062220
>> [ 1727.605431] R13: 1ffff11023140f79 R14: 0000000000000001 R15: ffff88810095b280
>> [ 1727.615457] FS:  00007f7f38add740(0000) GS:ffff888420640000(0000) knlGS:0000000000000000
>> [ 1727.623346] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1727.631138] CR2: 00007fa7810e8000 CR3: 000000012124a000 CR4: 00000000003506e0
>> [ 1727.639503] Call Trace:
>> [ 1727.649160]  ? _raw_spin_unlock+0x29/0x40
>> [ 1727.655583]  ? up_write+0x270/0x270
>> [ 1727.661769]  __ceph_setattr+0x85e/0x1270 [ceph]
>> [ 1727.670914]  ? ceph_inode_work+0x460/0x460 [ceph]
>> [ 1727.677397]  ceph_setattr+0x12d/0x1c0 [ceph]
>> [ 1727.685205]  notify_change+0x4e9/0x720
>> [ 1727.694598]  ? do_truncate+0xcf/0x140
>> [ 1727.705737]  do_truncate+0xcf/0x140
>> [ 1727.712680]  ? file_open_root+0x1e0/0x1e0
>> [ 1727.720447]  ? lock_release+0x410/0x410
>> [ 1727.727851]  ? lock_is_held_type+0xfb/0x130
>> [ 1727.734045]  do_sys_ftruncate+0x306/0x350
>> [ 1727.740636]  do_syscall_64+0x3b/0x90
>> [ 1727.748675]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [ 1727.755634] RIP: 0033:0x7f7f38be356b
>> [ 1727.763575] Code: 77 05 c3 0f 1f 40 00 48 8b 15 09 99 0c 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 4d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 d9 98 0c 00 f7 d8
>> [ 1727.777868] RSP: 002b:00007ffdee35cd18 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
>> [ 1727.792610] RAX: ffffffffffffffda RBX: 000000000c8a9d62 RCX: 00007f7f38be356b
>> [ 1727.807383] RDX: 000000000c8a9d62 RSI: 000000000c8a9d62 RDI: 0000000000000003
>> [ 1727.821520] RBP: 0000000000000003 R08: 000000000000005a R09: 00007f7f38cada60
>> [ 1727.829368] R10: 00007f7f38af3700 R11: 0000000000000202 R12: 0000000061856b9d
>> [ 1727.837356] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>> [ 1727.849767] irq event stamp: 549109
>> [ 1727.863878] hardirqs last  enabled at (549109): [<ffffffffb12f8b54>] _raw_spin_unlock_irq+0x24/0x50
>> [ 1727.879034] hardirqs last disabled at (549108): [<ffffffffb12f8d04>] _raw_spin_lock_irq+0x54/0x60
>> [ 1727.897434] softirqs last  enabled at (548984): [<ffffffffb013d097>] __irq_exit_rcu+0x157/0x1b0
>> [ 1727.913276] softirqs last disabled at (548975): [<ffffffffb013d097>] __irq_exit_rcu+0x157/0x1b0
>> [ 1727.933182] ---[ end trace a89de5333b156523 ]---
>>
>>
>>
>> I think this patch should fix it:
>>
>> [PATCH] SQUASH: ensure we unset lock_snap_rwsem after unlocking it
>>
>> Signed-off-by: Jeff Layton <jlayton@kernel.org>
>> ---
>>   fs/ceph/inode.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
>> index eebbd0296004..cb0ad0faee45 100644
>> --- a/fs/ceph/inode.c
>> +++ b/fs/ceph/inode.c
>> @@ -2635,8 +2635,10 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>>   
>>   	release &= issued;
>>   	spin_unlock(&ci->i_ceph_lock);
>> -	if (lock_snap_rwsem)
>> +	if (lock_snap_rwsem) {
>>   		up_read(&mdsc->snap_rwsem);
>> +		lock_snap_rwsem = false;
>> +	}
>>   
>>   	if (inode_dirty_flags)
>>   		__mark_inode_dirty(inode, inode_dirty_flags);
> Testing with that patch on top of your latest series looks pretty good
> so far.

Cool.

>   I see some xfstests failures that need to be investigated
> (generic/075, in particular). I'll take a harder look at that next week.
I will also try this.
> For now, I've gone ahead and updated wip-fscrypt-fnames to the latest
> fnames branch, and also pushed a new wip-fscrypt-size branch that has
> all of your patches, with the above SQUASH patch folded into #9.
>
> I'll continue the testing next week, but I think the -size branch is
> probably a good place to work from for now.

BTW, what's your test script for the xfstests ? I may miss some important.

BRs

Thanks


>
> Thanks!


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/9] ceph: size handling for the fscrypt
  2021-11-06  1:35     ` Xiubo Li
@ 2021-11-06 10:50       ` Jeff Layton
  2021-11-06 10:51         ` Jeff Layton
  0 siblings, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2021-11-06 10:50 UTC (permalink / raw)
  To: Xiubo Li; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel

On Sat, 2021-11-06 at 09:35 +0800, Xiubo Li wrote:
> On 11/6/21 4:46 AM, Jeff Layton wrote:
> > On Fri, 2021-11-05 at 14:36 -0400, Jeff Layton wrote:
> > > On Fri, 2021-11-05 at 22:22 +0800, xiubli@redhat.com wrote:
> > > > From: Xiubo Li <xiubli@redhat.com>
> > > > 
> > > > This patch series is based on the "wip-fscrypt-fnames" branch in
> > > > repo https://github.com/ceph/ceph-client.git.
> > > > 
> > > > And I have picked up 5 patches from the "ceph-fscrypt-size-experimental"
> > > > branch in repo
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git.
> > > > 
> > > > ====
> > > > 
> > > > This approach is based on the discussion from V1 and V2, which will
> > > > pass the encrypted last block contents to MDS along with the truncate
> > > > request.
> > > > 
> > > > This will send the encrypted last block contents to MDS along with
> > > > the truncate request when truncating to a smaller size and at the
> > > > same time new size does not align to BLOCK SIZE.
> > > > 
> > > > The MDS side patch is raised in PR
> > > > https://github.com/ceph/ceph/pull/43588, which is also based Jeff's
> > > > previous great work in PR https://github.com/ceph/ceph/pull/41284.
> > > > 
> > > > The MDS will use the filer.write_trunc(), which could update and
> > > > truncate the file in one shot, instead of filer.truncate().
> > > > 
> > > > This just assume kclient won't support the inline data feature, which
> > > > will be remove soon, more detail please see:
> > > > https://tracker.ceph.com/issues/52916
> > > > 
> > > > Changed in V7:
> > > > - Fixed the sparse check warnings.
> > > > - Removed the include/linux/ceph/crypto.h header file.
> > > > 
> > > > Changed in V6:
> > > > - Fixed the file hole bug, also have updated the MDS side PR.
> > > > - Add add object version support for sync read in #8.
> > > > 
> > > > 
> > > > Changed in V5:
> > > > - Rebase to "wip-fscrypt-fnames" branch in ceph-client.git repo.
> > > > - Pick up 5 patches from Jeff's "ceph-fscrypt-size-experimental" branch
> > > >    in linux.git repo.
> > > > - Add "i_truncate_pagecache_size" member support in ceph_inode_info
> > > >    struct, this will be used to truncate the pagecache only in kclient
> > > >    side, because the "i_truncate_size" will always be aligned to BLOCK
> > > >    SIZE. In fscrypt case we need to use the real size to truncate the
> > > >    pagecache.
> > > > 
> > > > 
> > > > Changed in V4:
> > > > - Retry the truncate request by 20 times before fail it with -EAGAIN.
> > > > - Remove the "fill_last_block" label and move the code to else branch.
> > > > - Remove the #3 patch, which has already been sent out separately, in
> > > >    V3 series.
> > > > - Improve some comments in the code.
> > > > 
> > > > 
> > > > Changed in V3:
> > > > - Fix possibly corrupting the file just before the MDS acquires the
> > > >    xlock for FILE lock, another client has updated it.
> > > > - Flush the pagecache buffer before reading the last block for the
> > > >    when filling the truncate request.
> > > > - Some other minore fixes.
> > > > 
> > > > 
> > > > 
> > > > Jeff Layton (5):
> > > >    libceph: add CEPH_OSD_OP_ASSERT_VER support
> > > >    ceph: size handling for encrypted inodes in cap updates
> > > >    ceph: fscrypt_file field handling in MClientRequest messages
> > > >    ceph: get file size from fscrypt_file when present in inode traces
> > > >    ceph: handle fscrypt fields in cap messages from MDS
> > > > 
> > > > Xiubo Li (4):
> > > >    ceph: add __ceph_get_caps helper support
> > > >    ceph: add __ceph_sync_read helper support
> > > >    ceph: add object version support for sync read
> > > >    ceph: add truncate size handling support for fscrypt
> > > > 
> > > >   fs/ceph/caps.c                  | 136 ++++++++++++++----
> > > >   fs/ceph/crypto.h                |  25 ++++
> > > >   fs/ceph/dir.c                   |   3 +
> > > >   fs/ceph/file.c                  |  76 ++++++++--
> > > >   fs/ceph/inode.c                 | 244 +++++++++++++++++++++++++++++---
> > > >   fs/ceph/mds_client.c            |   9 +-
> > > >   fs/ceph/mds_client.h            |   2 +
> > > >   fs/ceph/super.h                 |  25 ++++
> > > >   include/linux/ceph/osd_client.h |   6 +-
> > > >   include/linux/ceph/rados.h      |   4 +
> > > >   net/ceph/osd_client.c           |   5 +
> > > >   11 files changed, 475 insertions(+), 60 deletions(-)
> > > > 
> > > Thanks Xiubo.
> > > 
> > > I hit this today after some more testing (generic/014 again):
> > > 
> > > [ 1674.146843] libceph: mon0 (2)192.168.1.81:3300 session established
> > > [ 1674.150902] libceph: client54432 fsid 4e286176-3d8b-11ec-bece-52540031ba78
> > > [ 1674.153791] ceph: test_dummy_encryption mode enabled
> > > [ 1719.254308] run fstests generic/014 at 2021-11-05 13:36:26
> > > [ 1727.157974]
> > > [ 1727.158334] =====================================
> > > [ 1727.159219] WARNING: bad unlock balance detected!
> > > [ 1727.160707] 5.15.0-rc6+ #53 Tainted: G           OE
> > > [ 1727.162248] -------------------------------------
> > > [ 1727.171918] truncfile/7800 is trying to release lock (&mdsc->snap_rwsem) at:
> > > [ 1727.180836] [<ffffffffc127438e>] __ceph_setattr+0x85e/0x1270 [ceph]
> > > [ 1727.192788] but there are no more locks to release!
> > > [ 1727.203450]
> > > [ 1727.203450] other info that might help us debug this:
> > > [ 1727.220766] 2 locks held by truncfile/7800:
> > > [ 1727.225548]  #0: ffff888116dd2460 (sb_writers#15){.+.+}-{0:0}, at: do_syscall_64+0x3b/0x90
> > > [ 1727.234851]  #1: ffff8882d8dac3d0 (&sb->s_type->i_mutex_key#20){++++}-{3:3}, at: do_truncate+0xbe/0x140
> > > [ 1727.240027]
> > > [ 1727.240027] stack backtrace:
> > > [ 1727.247863] CPU: 3 PID: 7800 Comm: truncfile Tainted: G           OE     5.15.0-rc6+ #53
> > > [ 1727.252508] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-6.fc35 04/01/2014
> > > [ 1727.257303] Call Trace:
> > > [ 1727.261503]  dump_stack_lvl+0x57/0x72
> > > [ 1727.265492]  lock_release.cold+0x49/0x4e
> > > [ 1727.269499]  ? __ceph_setattr+0x85e/0x1270 [ceph]
> > > [ 1727.273802]  ? lock_downgrade+0x390/0x390
> > > [ 1727.277913]  ? preempt_count_sub+0x14/0xc0
> > > [ 1727.281883]  ? _raw_spin_unlock+0x29/0x40
> > > [ 1727.285725]  ? __ceph_mark_dirty_caps+0x29f/0x450 [ceph]
> > > [ 1727.289959]  up_read+0x17/0x20
> > > [ 1727.293852]  __ceph_setattr+0x85e/0x1270 [ceph]
> > > [ 1727.297827]  ? ceph_inode_work+0x460/0x460 [ceph]
> > > [ 1727.301765]  ceph_setattr+0x12d/0x1c0 [ceph]
> > > [ 1727.305839]  notify_change+0x4e9/0x720
> > > [ 1727.309762]  ? do_truncate+0xcf/0x140
> > > [ 1727.313504]  do_truncate+0xcf/0x140
> > > [ 1727.317092]  ? file_open_root+0x1e0/0x1e0
> > > [ 1727.321022]  ? lock_release+0x410/0x410
> > > [ 1727.324769]  ? lock_is_held_type+0xfb/0x130
> > > [ 1727.328699]  do_sys_ftruncate+0x306/0x350
> > > [ 1727.332449]  do_syscall_64+0x3b/0x90
> > > [ 1727.336127]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [ 1727.340303] RIP: 0033:0x7f7f38be356b
> > > [ 1727.344445] Code: 77 05 c3 0f 1f 40 00 48 8b 15 09 99 0c 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 4d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 d9 98 0c 00 f7 d8
> > > [ 1727.354258] RSP: 002b:00007ffdee35cd18 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
> > > [ 1727.358964] RAX: ffffffffffffffda RBX: 000000000c8a9d62 RCX: 00007f7f38be356b
> > > [ 1727.363836] RDX: 000000000c8a9d62 RSI: 000000000c8a9d62 RDI: 0000000000000003
> > > [ 1727.368467] RBP: 0000000000000003 R08: 000000000000005a R09: 00007f7f38cada60
> > > [ 1727.373285] R10: 00007f7f38af3700 R11: 0000000000000202 R12: 0000000061856b9d
> > > [ 1727.377870] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > [ 1727.382578] ------------[ cut here ]------------
> > > [ 1727.391761] DEBUG_RWSEMS_WARN_ON(tmp < 0): count = 0xffffffffffffff00, magic = 0xffff88812f062220, owner = 0x1, curr 0xffff88810095b280, list empty
> > > [ 1727.419497] WARNING: CPU: 14 PID: 7800 at kernel/locking/rwsem.c:1297 __up_read+0x404/0x430
> > > [ 1727.432752] Modules linked in: ceph(OE) libceph(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) bridge(E) ip_set(E) stp(E) llc(E) rfkill(E) nf_tables(E) nfnetlink(E) cachefiles(E) fscache(E) netfs(E) sunrpc(E) iTCO_wdt(E) intel_pmc_bxt(E) iTCO_vendor_support(E) intel_rapl_msr(E) lpc_ich(E) joydev(E) i2c_i801(E) i2c_smbus(E) virtio_balloon(E) intel_rapl_common(E) fuse(E) zram(E) ip_tables(E) xfs(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) virtio_gpu(E) virtio_blk(E) ghash_clmulni_intel(E) virtio_dma_buf(E) virtio_console(E) serio_raw(E) virtio_net(E) drm_kms_helper(E) net_failover(E) cec(E) failover(E) drm(E) qemu_fw_cfg(E)
> > > [ 1727.506081] CPU: 1 PID: 7800 Comm: truncfile Tainted: G           OE     5.15.0-rc6+ #53
> > > [ 1727.512691] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-6.fc35 04/01/2014
> > > [ 1727.521863] RIP: 0010:__up_read+0x404/0x430
> > > [ 1727.528011] Code: 48 8b 55 00 4d 89 f0 4c 89 e1 53 48 c7 c6 e0 db 89 b1 48 c7 c7 00 d9 89 b1 65 4c 8b 3c 25 80 fe 01 00 4d 89 f9 e8 a2 4b 08 01 <0f> 0b 5a e9 b4 fd ff ff be 08 00 00 00 4c 89 e7 e8 57 4d 33 00 f0
> > > [ 1727.540864] RSP: 0018:ffff888118a07bb0 EFLAGS: 00010286
> > > [ 1727.556265] RAX: 0000000000000000 RBX: ffffffffb189d840 RCX: 0000000000000000
> > > [ 1727.571003] RDX: 0000000000000001 RSI: ffffffffb1aa6380 RDI: ffffed1023140f6c
> > > [ 1727.580837] RBP: ffff88812f062220 R08: ffffffffb0185284 R09: ffff8884209ad7c7
> > > [ 1727.593908] R10: ffffed1084135af8 R11: 0000000000000001 R12: ffff88812f062220
> > > [ 1727.605431] R13: 1ffff11023140f79 R14: 0000000000000001 R15: ffff88810095b280
> > > [ 1727.615457] FS:  00007f7f38add740(0000) GS:ffff888420640000(0000) knlGS:0000000000000000
> > > [ 1727.623346] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 1727.631138] CR2: 00007fa7810e8000 CR3: 000000012124a000 CR4: 00000000003506e0
> > > [ 1727.639503] Call Trace:
> > > [ 1727.649160]  ? _raw_spin_unlock+0x29/0x40
> > > [ 1727.655583]  ? up_write+0x270/0x270
> > > [ 1727.661769]  __ceph_setattr+0x85e/0x1270 [ceph]
> > > [ 1727.670914]  ? ceph_inode_work+0x460/0x460 [ceph]
> > > [ 1727.677397]  ceph_setattr+0x12d/0x1c0 [ceph]
> > > [ 1727.685205]  notify_change+0x4e9/0x720
> > > [ 1727.694598]  ? do_truncate+0xcf/0x140
> > > [ 1727.705737]  do_truncate+0xcf/0x140
> > > [ 1727.712680]  ? file_open_root+0x1e0/0x1e0
> > > [ 1727.720447]  ? lock_release+0x410/0x410
> > > [ 1727.727851]  ? lock_is_held_type+0xfb/0x130
> > > [ 1727.734045]  do_sys_ftruncate+0x306/0x350
> > > [ 1727.740636]  do_syscall_64+0x3b/0x90
> > > [ 1727.748675]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [ 1727.755634] RIP: 0033:0x7f7f38be356b
> > > [ 1727.763575] Code: 77 05 c3 0f 1f 40 00 48 8b 15 09 99 0c 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 4d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 d9 98 0c 00 f7 d8
> > > [ 1727.777868] RSP: 002b:00007ffdee35cd18 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
> > > [ 1727.792610] RAX: ffffffffffffffda RBX: 000000000c8a9d62 RCX: 00007f7f38be356b
> > > [ 1727.807383] RDX: 000000000c8a9d62 RSI: 000000000c8a9d62 RDI: 0000000000000003
> > > [ 1727.821520] RBP: 0000000000000003 R08: 000000000000005a R09: 00007f7f38cada60
> > > [ 1727.829368] R10: 00007f7f38af3700 R11: 0000000000000202 R12: 0000000061856b9d
> > > [ 1727.837356] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > [ 1727.849767] irq event stamp: 549109
> > > [ 1727.863878] hardirqs last  enabled at (549109): [<ffffffffb12f8b54>] _raw_spin_unlock_irq+0x24/0x50
> > > [ 1727.879034] hardirqs last disabled at (549108): [<ffffffffb12f8d04>] _raw_spin_lock_irq+0x54/0x60
> > > [ 1727.897434] softirqs last  enabled at (548984): [<ffffffffb013d097>] __irq_exit_rcu+0x157/0x1b0
> > > [ 1727.913276] softirqs last disabled at (548975): [<ffffffffb013d097>] __irq_exit_rcu+0x157/0x1b0
> > > [ 1727.933182] ---[ end trace a89de5333b156523 ]---
> > > 
> > > 
> > > 
> > > I think this patch should fix it:
> > > 
> > > [PATCH] SQUASH: ensure we unset lock_snap_rwsem after unlocking it
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >   fs/ceph/inode.c | 4 +++-
> > >   1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > > index eebbd0296004..cb0ad0faee45 100644
> > > --- a/fs/ceph/inode.c
> > > +++ b/fs/ceph/inode.c
> > > @@ -2635,8 +2635,10 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
> > >   
> > >   	release &= issued;
> > >   	spin_unlock(&ci->i_ceph_lock);
> > > -	if (lock_snap_rwsem)
> > > +	if (lock_snap_rwsem) {
> > >   		up_read(&mdsc->snap_rwsem);
> > > +		lock_snap_rwsem = false;
> > > +	}
> > >   
> > >   	if (inode_dirty_flags)
> > >   		__mark_inode_dirty(inode, inode_dirty_flags);
> > Testing with that patch on top of your latest series looks pretty good
> > so far.
> 
> Cool.
> 
> >   I see some xfstests failures that need to be investigated
> > (generic/075, in particular). I'll take a harder look at that next week.
> I will also try this.
> > For now, I've gone ahead and updated wip-fscrypt-fnames to the latest
> > fnames branch, and also pushed a new wip-fscrypt-size branch that has
> > all of your patches, with the above SQUASH patch folded into #9.
> > 
> > I'll continue the testing next week, but I think the -size branch is
> > probably a good place to work from for now.
> 
> BTW, what's your test script for the xfstests ? I may miss some important.
> 

I'm mainly running:

    $ sudo ./check -g quick -E ./ceph.exclude

...and ceph.exclude has:

ceph/001
generic/003
generic/531
generic/538

...most of the exclusions are because they take a long time to run.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/9] ceph: size handling for the fscrypt
  2021-11-06 10:50       ` Jeff Layton
@ 2021-11-06 10:51         ` Jeff Layton
  2021-11-07  9:44           ` Xiubo Li
  2021-11-08  3:22           ` Xiubo Li
  0 siblings, 2 replies; 25+ messages in thread
From: Jeff Layton @ 2021-11-06 10:51 UTC (permalink / raw)
  To: Xiubo Li; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel

On Sat, 2021-11-06 at 06:50 -0400, Jeff Layton wrote:
> On Sat, 2021-11-06 at 09:35 +0800, Xiubo Li wrote:
> > On 11/6/21 4:46 AM, Jeff Layton wrote:
> > > On Fri, 2021-11-05 at 14:36 -0400, Jeff Layton wrote:
> > > > On Fri, 2021-11-05 at 22:22 +0800, xiubli@redhat.com wrote:
> > > > > From: Xiubo Li <xiubli@redhat.com>
> > > > > 
> > > > > This patch series is based on the "wip-fscrypt-fnames" branch in
> > > > > repo https://github.com/ceph/ceph-client.git.
> > > > > 
> > > > > And I have picked up 5 patches from the "ceph-fscrypt-size-experimental"
> > > > > branch in repo
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git.
> > > > > 
> > > > > ====
> > > > > 
> > > > > This approach is based on the discussion from V1 and V2, which will
> > > > > pass the encrypted last block contents to MDS along with the truncate
> > > > > request.
> > > > > 
> > > > > This will send the encrypted last block contents to MDS along with
> > > > > the truncate request when truncating to a smaller size and at the
> > > > > same time new size does not align to BLOCK SIZE.
> > > > > 
> > > > > The MDS side patch is raised in PR
> > > > > https://github.com/ceph/ceph/pull/43588, which is also based Jeff's
> > > > > previous great work in PR https://github.com/ceph/ceph/pull/41284.
> > > > > 
> > > > > The MDS will use the filer.write_trunc(), which could update and
> > > > > truncate the file in one shot, instead of filer.truncate().
> > > > > 
> > > > > This just assume kclient won't support the inline data feature, which
> > > > > will be remove soon, more detail please see:
> > > > > https://tracker.ceph.com/issues/52916
> > > > > 
> > > > > Changed in V7:
> > > > > - Fixed the sparse check warnings.
> > > > > - Removed the include/linux/ceph/crypto.h header file.
> > > > > 
> > > > > Changed in V6:
> > > > > - Fixed the file hole bug, also have updated the MDS side PR.
> > > > > - Add add object version support for sync read in #8.
> > > > > 
> > > > > 
> > > > > Changed in V5:
> > > > > - Rebase to "wip-fscrypt-fnames" branch in ceph-client.git repo.
> > > > > - Pick up 5 patches from Jeff's "ceph-fscrypt-size-experimental" branch
> > > > >    in linux.git repo.
> > > > > - Add "i_truncate_pagecache_size" member support in ceph_inode_info
> > > > >    struct, this will be used to truncate the pagecache only in kclient
> > > > >    side, because the "i_truncate_size" will always be aligned to BLOCK
> > > > >    SIZE. In fscrypt case we need to use the real size to truncate the
> > > > >    pagecache.
> > > > > 
> > > > > 
> > > > > Changed in V4:
> > > > > - Retry the truncate request by 20 times before fail it with -EAGAIN.
> > > > > - Remove the "fill_last_block" label and move the code to else branch.
> > > > > - Remove the #3 patch, which has already been sent out separately, in
> > > > >    V3 series.
> > > > > - Improve some comments in the code.
> > > > > 
> > > > > 
> > > > > Changed in V3:
> > > > > - Fix possibly corrupting the file just before the MDS acquires the
> > > > >    xlock for FILE lock, another client has updated it.
> > > > > - Flush the pagecache buffer before reading the last block for the
> > > > >    when filling the truncate request.
> > > > > - Some other minore fixes.
> > > > > 
> > > > > 
> > > > > 
> > > > > Jeff Layton (5):
> > > > >    libceph: add CEPH_OSD_OP_ASSERT_VER support
> > > > >    ceph: size handling for encrypted inodes in cap updates
> > > > >    ceph: fscrypt_file field handling in MClientRequest messages
> > > > >    ceph: get file size from fscrypt_file when present in inode traces
> > > > >    ceph: handle fscrypt fields in cap messages from MDS
> > > > > 
> > > > > Xiubo Li (4):
> > > > >    ceph: add __ceph_get_caps helper support
> > > > >    ceph: add __ceph_sync_read helper support
> > > > >    ceph: add object version support for sync read
> > > > >    ceph: add truncate size handling support for fscrypt
> > > > > 
> > > > >   fs/ceph/caps.c                  | 136 ++++++++++++++----
> > > > >   fs/ceph/crypto.h                |  25 ++++
> > > > >   fs/ceph/dir.c                   |   3 +
> > > > >   fs/ceph/file.c                  |  76 ++++++++--
> > > > >   fs/ceph/inode.c                 | 244 +++++++++++++++++++++++++++++---
> > > > >   fs/ceph/mds_client.c            |   9 +-
> > > > >   fs/ceph/mds_client.h            |   2 +
> > > > >   fs/ceph/super.h                 |  25 ++++
> > > > >   include/linux/ceph/osd_client.h |   6 +-
> > > > >   include/linux/ceph/rados.h      |   4 +
> > > > >   net/ceph/osd_client.c           |   5 +
> > > > >   11 files changed, 475 insertions(+), 60 deletions(-)
> > > > > 
> > > > Thanks Xiubo.
> > > > 
> > > > I hit this today after some more testing (generic/014 again):
> > > > 
> > > > [ 1674.146843] libceph: mon0 (2)192.168.1.81:3300 session established
> > > > [ 1674.150902] libceph: client54432 fsid 4e286176-3d8b-11ec-bece-52540031ba78
> > > > [ 1674.153791] ceph: test_dummy_encryption mode enabled
> > > > [ 1719.254308] run fstests generic/014 at 2021-11-05 13:36:26
> > > > [ 1727.157974]
> > > > [ 1727.158334] =====================================
> > > > [ 1727.159219] WARNING: bad unlock balance detected!
> > > > [ 1727.160707] 5.15.0-rc6+ #53 Tainted: G           OE
> > > > [ 1727.162248] -------------------------------------
> > > > [ 1727.171918] truncfile/7800 is trying to release lock (&mdsc->snap_rwsem) at:
> > > > [ 1727.180836] [<ffffffffc127438e>] __ceph_setattr+0x85e/0x1270 [ceph]
> > > > [ 1727.192788] but there are no more locks to release!
> > > > [ 1727.203450]
> > > > [ 1727.203450] other info that might help us debug this:
> > > > [ 1727.220766] 2 locks held by truncfile/7800:
> > > > [ 1727.225548]  #0: ffff888116dd2460 (sb_writers#15){.+.+}-{0:0}, at: do_syscall_64+0x3b/0x90
> > > > [ 1727.234851]  #1: ffff8882d8dac3d0 (&sb->s_type->i_mutex_key#20){++++}-{3:3}, at: do_truncate+0xbe/0x140
> > > > [ 1727.240027]
> > > > [ 1727.240027] stack backtrace:
> > > > [ 1727.247863] CPU: 3 PID: 7800 Comm: truncfile Tainted: G           OE     5.15.0-rc6+ #53
> > > > [ 1727.252508] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-6.fc35 04/01/2014
> > > > [ 1727.257303] Call Trace:
> > > > [ 1727.261503]  dump_stack_lvl+0x57/0x72
> > > > [ 1727.265492]  lock_release.cold+0x49/0x4e
> > > > [ 1727.269499]  ? __ceph_setattr+0x85e/0x1270 [ceph]
> > > > [ 1727.273802]  ? lock_downgrade+0x390/0x390
> > > > [ 1727.277913]  ? preempt_count_sub+0x14/0xc0
> > > > [ 1727.281883]  ? _raw_spin_unlock+0x29/0x40
> > > > [ 1727.285725]  ? __ceph_mark_dirty_caps+0x29f/0x450 [ceph]
> > > > [ 1727.289959]  up_read+0x17/0x20
> > > > [ 1727.293852]  __ceph_setattr+0x85e/0x1270 [ceph]
> > > > [ 1727.297827]  ? ceph_inode_work+0x460/0x460 [ceph]
> > > > [ 1727.301765]  ceph_setattr+0x12d/0x1c0 [ceph]
> > > > [ 1727.305839]  notify_change+0x4e9/0x720
> > > > [ 1727.309762]  ? do_truncate+0xcf/0x140
> > > > [ 1727.313504]  do_truncate+0xcf/0x140
> > > > [ 1727.317092]  ? file_open_root+0x1e0/0x1e0
> > > > [ 1727.321022]  ? lock_release+0x410/0x410
> > > > [ 1727.324769]  ? lock_is_held_type+0xfb/0x130
> > > > [ 1727.328699]  do_sys_ftruncate+0x306/0x350
> > > > [ 1727.332449]  do_syscall_64+0x3b/0x90
> > > > [ 1727.336127]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > > [ 1727.340303] RIP: 0033:0x7f7f38be356b
> > > > [ 1727.344445] Code: 77 05 c3 0f 1f 40 00 48 8b 15 09 99 0c 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 4d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 d9 98 0c 00 f7 d8
> > > > [ 1727.354258] RSP: 002b:00007ffdee35cd18 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
> > > > [ 1727.358964] RAX: ffffffffffffffda RBX: 000000000c8a9d62 RCX: 00007f7f38be356b
> > > > [ 1727.363836] RDX: 000000000c8a9d62 RSI: 000000000c8a9d62 RDI: 0000000000000003
> > > > [ 1727.368467] RBP: 0000000000000003 R08: 000000000000005a R09: 00007f7f38cada60
> > > > [ 1727.373285] R10: 00007f7f38af3700 R11: 0000000000000202 R12: 0000000061856b9d
> > > > [ 1727.377870] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > [ 1727.382578] ------------[ cut here ]------------
> > > > [ 1727.391761] DEBUG_RWSEMS_WARN_ON(tmp < 0): count = 0xffffffffffffff00, magic = 0xffff88812f062220, owner = 0x1, curr 0xffff88810095b280, list empty
> > > > [ 1727.419497] WARNING: CPU: 14 PID: 7800 at kernel/locking/rwsem.c:1297 __up_read+0x404/0x430
> > > > [ 1727.432752] Modules linked in: ceph(OE) libceph(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) bridge(E) ip_set(E) stp(E) llc(E) rfkill(E) nf_tables(E) nfnetlink(E) cachefiles(E) fscache(E) netfs(E) sunrpc(E) iTCO_wdt(E) intel_pmc_bxt(E) iTCO_vendor_support(E) intel_rapl_msr(E) lpc_ich(E) joydev(E) i2c_i801(E) i2c_smbus(E) virtio_balloon(E) intel_rapl_common(E) fuse(E) zram(E) ip_tables(E) xfs(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) virtio_gpu(E) virtio_blk(E) ghash_clmulni_intel(E) virtio_dma_buf(E) virtio_console(E) serio_raw(E) virtio_net(E) drm_kms_helper(E) net_failover(E) cec(E) failover(E) drm(E) qemu_fw_cfg(E)
> > > > [ 1727.506081] CPU: 1 PID: 7800 Comm: truncfile Tainted: G           OE     5.15.0-rc6+ #53
> > > > [ 1727.512691] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-6.fc35 04/01/2014
> > > > [ 1727.521863] RIP: 0010:__up_read+0x404/0x430
> > > > [ 1727.528011] Code: 48 8b 55 00 4d 89 f0 4c 89 e1 53 48 c7 c6 e0 db 89 b1 48 c7 c7 00 d9 89 b1 65 4c 8b 3c 25 80 fe 01 00 4d 89 f9 e8 a2 4b 08 01 <0f> 0b 5a e9 b4 fd ff ff be 08 00 00 00 4c 89 e7 e8 57 4d 33 00 f0
> > > > [ 1727.540864] RSP: 0018:ffff888118a07bb0 EFLAGS: 00010286
> > > > [ 1727.556265] RAX: 0000000000000000 RBX: ffffffffb189d840 RCX: 0000000000000000
> > > > [ 1727.571003] RDX: 0000000000000001 RSI: ffffffffb1aa6380 RDI: ffffed1023140f6c
> > > > [ 1727.580837] RBP: ffff88812f062220 R08: ffffffffb0185284 R09: ffff8884209ad7c7
> > > > [ 1727.593908] R10: ffffed1084135af8 R11: 0000000000000001 R12: ffff88812f062220
> > > > [ 1727.605431] R13: 1ffff11023140f79 R14: 0000000000000001 R15: ffff88810095b280
> > > > [ 1727.615457] FS:  00007f7f38add740(0000) GS:ffff888420640000(0000) knlGS:0000000000000000
> > > > [ 1727.623346] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [ 1727.631138] CR2: 00007fa7810e8000 CR3: 000000012124a000 CR4: 00000000003506e0
> > > > [ 1727.639503] Call Trace:
> > > > [ 1727.649160]  ? _raw_spin_unlock+0x29/0x40
> > > > [ 1727.655583]  ? up_write+0x270/0x270
> > > > [ 1727.661769]  __ceph_setattr+0x85e/0x1270 [ceph]
> > > > [ 1727.670914]  ? ceph_inode_work+0x460/0x460 [ceph]
> > > > [ 1727.677397]  ceph_setattr+0x12d/0x1c0 [ceph]
> > > > [ 1727.685205]  notify_change+0x4e9/0x720
> > > > [ 1727.694598]  ? do_truncate+0xcf/0x140
> > > > [ 1727.705737]  do_truncate+0xcf/0x140
> > > > [ 1727.712680]  ? file_open_root+0x1e0/0x1e0
> > > > [ 1727.720447]  ? lock_release+0x410/0x410
> > > > [ 1727.727851]  ? lock_is_held_type+0xfb/0x130
> > > > [ 1727.734045]  do_sys_ftruncate+0x306/0x350
> > > > [ 1727.740636]  do_syscall_64+0x3b/0x90
> > > > [ 1727.748675]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > > [ 1727.755634] RIP: 0033:0x7f7f38be356b
> > > > [ 1727.763575] Code: 77 05 c3 0f 1f 40 00 48 8b 15 09 99 0c 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 4d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 d9 98 0c 00 f7 d8
> > > > [ 1727.777868] RSP: 002b:00007ffdee35cd18 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
> > > > [ 1727.792610] RAX: ffffffffffffffda RBX: 000000000c8a9d62 RCX: 00007f7f38be356b
> > > > [ 1727.807383] RDX: 000000000c8a9d62 RSI: 000000000c8a9d62 RDI: 0000000000000003
> > > > [ 1727.821520] RBP: 0000000000000003 R08: 000000000000005a R09: 00007f7f38cada60
> > > > [ 1727.829368] R10: 00007f7f38af3700 R11: 0000000000000202 R12: 0000000061856b9d
> > > > [ 1727.837356] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > > > [ 1727.849767] irq event stamp: 549109
> > > > [ 1727.863878] hardirqs last  enabled at (549109): [<ffffffffb12f8b54>] _raw_spin_unlock_irq+0x24/0x50
> > > > [ 1727.879034] hardirqs last disabled at (549108): [<ffffffffb12f8d04>] _raw_spin_lock_irq+0x54/0x60
> > > > [ 1727.897434] softirqs last  enabled at (548984): [<ffffffffb013d097>] __irq_exit_rcu+0x157/0x1b0
> > > > [ 1727.913276] softirqs last disabled at (548975): [<ffffffffb013d097>] __irq_exit_rcu+0x157/0x1b0
> > > > [ 1727.933182] ---[ end trace a89de5333b156523 ]---
> > > > 
> > > > 
> > > > 
> > > > I think this patch should fix it:
> > > > 
> > > > [PATCH] SQUASH: ensure we unset lock_snap_rwsem after unlocking it
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > >   fs/ceph/inode.c | 4 +++-
> > > >   1 file changed, 3 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > > > index eebbd0296004..cb0ad0faee45 100644
> > > > --- a/fs/ceph/inode.c
> > > > +++ b/fs/ceph/inode.c
> > > > @@ -2635,8 +2635,10 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
> > > >   
> > > >   	release &= issued;
> > > >   	spin_unlock(&ci->i_ceph_lock);
> > > > -	if (lock_snap_rwsem)
> > > > +	if (lock_snap_rwsem) {
> > > >   		up_read(&mdsc->snap_rwsem);
> > > > +		lock_snap_rwsem = false;
> > > > +	}
> > > >   
> > > >   	if (inode_dirty_flags)
> > > >   		__mark_inode_dirty(inode, inode_dirty_flags);
> > > Testing with that patch on top of your latest series looks pretty good
> > > so far.
> > 
> > Cool.
> > 
> > >   I see some xfstests failures that need to be investigated
> > > (generic/075, in particular). I'll take a harder look at that next week.
> > I will also try this.
> > > For now, I've gone ahead and updated wip-fscrypt-fnames to the latest
> > > fnames branch, and also pushed a new wip-fscrypt-size branch that has
> > > all of your patches, with the above SQUASH patch folded into #9.
> > > 
> > > I'll continue the testing next week, but I think the -size branch is
> > > probably a good place to work from for now.
> > 
> > BTW, what's your test script for the xfstests ? I may miss some important.
> > 
> 
> I'm mainly running:
> 
>     $ sudo ./check -g quick -E ./ceph.exclude
> 
> ...and ceph.exclude has:
> 
> ceph/001
> generic/003
> generic/531
> generic/538
> 
> ...most of the exclusions are because they take a long time to run.

Oh and I should say...most of the failures I've seen with this patchset
are intermittent. I suspect there is some race condition we haven't
addressed yet.

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/9] ceph: size handling for the fscrypt
  2021-11-06 10:51         ` Jeff Layton
@ 2021-11-07  9:44           ` Xiubo Li
  2021-11-08  3:22           ` Xiubo Li
  1 sibling, 0 replies; 25+ messages in thread
From: Xiubo Li @ 2021-11-07  9:44 UTC (permalink / raw)
  To: Jeff Layton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel

[...]

>>>>> I think this patch should fix it:
>>>>>
>>>>> [PATCH] SQUASH: ensure we unset lock_snap_rwsem after unlocking it
>>>>>
>>>>> Signed-off-by: Jeff Layton <jlayton@kernel.org>
>>>>> ---
>>>>>    fs/ceph/inode.c | 4 +++-
>>>>>    1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
>>>>> index eebbd0296004..cb0ad0faee45 100644
>>>>> --- a/fs/ceph/inode.c
>>>>> +++ b/fs/ceph/inode.c
>>>>> @@ -2635,8 +2635,10 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>>>>>    
>>>>>    	release &= issued;
>>>>>    	spin_unlock(&ci->i_ceph_lock);
>>>>> -	if (lock_snap_rwsem)
>>>>> +	if (lock_snap_rwsem) {
>>>>>    		up_read(&mdsc->snap_rwsem);
>>>>> +		lock_snap_rwsem = false;
>>>>> +	}
>>>>>    
>>>>>    	if (inode_dirty_flags)
>>>>>    		__mark_inode_dirty(inode, inode_dirty_flags);
>>>> Testing with that patch on top of your latest series looks pretty good
>>>> so far.
>>> Cool.
>>>
>>>>    I see some xfstests failures that need to be investigated
>>>> (generic/075, in particular). I'll take a harder look at that next week.
>>> I will also try this.
>>>> For now, I've gone ahead and updated wip-fscrypt-fnames to the latest
>>>> fnames branch, and also pushed a new wip-fscrypt-size branch that has
>>>> all of your patches, with the above SQUASH patch folded into #9.
>>>>
>>>> I'll continue the testing next week, but I think the -size branch is
>>>> probably a good place to work from for now.
>>> BTW, what's your test script for the xfstests ? I may miss some important.
>>>
>> I'm mainly running:
>>
>>      $ sudo ./check -g quick -E ./ceph.exclude
>>
>> ...and ceph.exclude has:
>>
>> ceph/001
>> generic/003
>> generic/531
>> generic/538
>>
>> ...most of the exclusions are because they take a long time to run.
> Oh and I should say...most of the failures I've seen with this patchset
> are intermittent. I suspect there is some race condition we haven't
> addressed yet.

Okay, my test was stuck and finally I found it just ran out of disks.

I have ran the truncate related tests all worked well till now.

I will try this more.

Thanks,

> Thanks,


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/9] ceph: size handling for the fscrypt
  2021-11-06 10:51         ` Jeff Layton
  2021-11-07  9:44           ` Xiubo Li
@ 2021-11-08  3:22           ` Xiubo Li
  2021-11-08  6:04             ` Xiubo Li
  1 sibling, 1 reply; 25+ messages in thread
From: Xiubo Li @ 2021-11-08  3:22 UTC (permalink / raw)
  To: Jeff Layton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel


On 11/6/21 6:51 PM, Jeff Layton wrote:
> On Sat, 2021-11-06 at 06:50 -0400, Jeff Layton wrote:
>> On Sat, 2021-11-06 at 09:35 +0800, Xiubo Li wrote:
>>> On 11/6/21 4:46 AM, Jeff Layton wrote:
>>>> On Fri, 2021-11-05 at 14:36 -0400, Jeff Layton wrote:
>>>>> On Fri, 2021-11-05 at 22:22 +0800, xiubli@redhat.com wrote:
>>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>>
>>>>>> This patch series is based on the "wip-fscrypt-fnames" branch in
>>>>>> repo https://github.com/ceph/ceph-client.git.
>>>>>>
>>>>>> And I have picked up 5 patches from the "ceph-fscrypt-size-experimental"
>>>>>> branch in repo
>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git.
>>>>>>
>>>>>> ====
>>>>>>
>>>>>> This approach is based on the discussion from V1 and V2, which will
>>>>>> pass the encrypted last block contents to MDS along with the truncate
>>>>>> request.
>>>>>>
>>>>>> This will send the encrypted last block contents to MDS along with
>>>>>> the truncate request when truncating to a smaller size and at the
>>>>>> same time new size does not align to BLOCK SIZE.
>>>>>>
>>>>>> The MDS side patch is raised in PR
>>>>>> https://github.com/ceph/ceph/pull/43588, which is also based Jeff's
>>>>>> previous great work in PR https://github.com/ceph/ceph/pull/41284.
>>>>>>
>>>>>> The MDS will use the filer.write_trunc(), which could update and
>>>>>> truncate the file in one shot, instead of filer.truncate().
>>>>>>
>>>>>> This just assume kclient won't support the inline data feature, which
>>>>>> will be remove soon, more detail please see:
>>>>>> https://tracker.ceph.com/issues/52916
>>>>>>
>>>>>> Changed in V7:
>>>>>> - Fixed the sparse check warnings.
>>>>>> - Removed the include/linux/ceph/crypto.h header file.
>>>>>>
>>>>>> Changed in V6:
>>>>>> - Fixed the file hole bug, also have updated the MDS side PR.
>>>>>> - Add add object version support for sync read in #8.
>>>>>>
>>>>>>
>>>>>> Changed in V5:
>>>>>> - Rebase to "wip-fscrypt-fnames" branch in ceph-client.git repo.
>>>>>> - Pick up 5 patches from Jeff's "ceph-fscrypt-size-experimental" branch
>>>>>>     in linux.git repo.
>>>>>> - Add "i_truncate_pagecache_size" member support in ceph_inode_info
>>>>>>     struct, this will be used to truncate the pagecache only in kclient
>>>>>>     side, because the "i_truncate_size" will always be aligned to BLOCK
>>>>>>     SIZE. In fscrypt case we need to use the real size to truncate the
>>>>>>     pagecache.
>>>>>>
>>>>>>
>>>>>> Changed in V4:
>>>>>> - Retry the truncate request by 20 times before fail it with -EAGAIN.
>>>>>> - Remove the "fill_last_block" label and move the code to else branch.
>>>>>> - Remove the #3 patch, which has already been sent out separately, in
>>>>>>     V3 series.
>>>>>> - Improve some comments in the code.
>>>>>>
>>>>>>
>>>>>> Changed in V3:
>>>>>> - Fix possibly corrupting the file just before the MDS acquires the
>>>>>>     xlock for FILE lock, another client has updated it.
>>>>>> - Flush the pagecache buffer before reading the last block for the
>>>>>>     when filling the truncate request.
>>>>>> - Some other minore fixes.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Jeff Layton (5):
>>>>>>     libceph: add CEPH_OSD_OP_ASSERT_VER support
>>>>>>     ceph: size handling for encrypted inodes in cap updates
>>>>>>     ceph: fscrypt_file field handling in MClientRequest messages
>>>>>>     ceph: get file size from fscrypt_file when present in inode traces
>>>>>>     ceph: handle fscrypt fields in cap messages from MDS
>>>>>>
>>>>>> Xiubo Li (4):
>>>>>>     ceph: add __ceph_get_caps helper support
>>>>>>     ceph: add __ceph_sync_read helper support
>>>>>>     ceph: add object version support for sync read
>>>>>>     ceph: add truncate size handling support for fscrypt
>>>>>>
>>>>>>    fs/ceph/caps.c                  | 136 ++++++++++++++----
>>>>>>    fs/ceph/crypto.h                |  25 ++++
>>>>>>    fs/ceph/dir.c                   |   3 +
>>>>>>    fs/ceph/file.c                  |  76 ++++++++--
>>>>>>    fs/ceph/inode.c                 | 244 +++++++++++++++++++++++++++++---
>>>>>>    fs/ceph/mds_client.c            |   9 +-
>>>>>>    fs/ceph/mds_client.h            |   2 +
>>>>>>    fs/ceph/super.h                 |  25 ++++
>>>>>>    include/linux/ceph/osd_client.h |   6 +-
>>>>>>    include/linux/ceph/rados.h      |   4 +
>>>>>>    net/ceph/osd_client.c           |   5 +
>>>>>>    11 files changed, 475 insertions(+), 60 deletions(-)
>>>>>>
>>>>> Thanks Xiubo.
>>>>>
>>>>> I hit this today after some more testing (generic/014 again):
>>>>>
>>>>> [ 1674.146843] libceph: mon0 (2)192.168.1.81:3300 session established
>>>>> [ 1674.150902] libceph: client54432 fsid 4e286176-3d8b-11ec-bece-52540031ba78
>>>>> [ 1674.153791] ceph: test_dummy_encryption mode enabled
>>>>> [ 1719.254308] run fstests generic/014 at 2021-11-05 13:36:26
>>>>> [ 1727.157974]
>>>>> [ 1727.158334] =====================================
>>>>> [ 1727.159219] WARNING: bad unlock balance detected!
>>>>> [ 1727.160707] 5.15.0-rc6+ #53 Tainted: G           OE
>>>>> [ 1727.162248] -------------------------------------
>>>>> [ 1727.171918] truncfile/7800 is trying to release lock (&mdsc->snap_rwsem) at:
>>>>> [ 1727.180836] [<ffffffffc127438e>] __ceph_setattr+0x85e/0x1270 [ceph]
>>>>> [ 1727.192788] but there are no more locks to release!
>>>>> [ 1727.203450]
>>>>> [ 1727.203450] other info that might help us debug this:
>>>>> [ 1727.220766] 2 locks held by truncfile/7800:
>>>>> [ 1727.225548]  #0: ffff888116dd2460 (sb_writers#15){.+.+}-{0:0}, at: do_syscall_64+0x3b/0x90
>>>>> [ 1727.234851]  #1: ffff8882d8dac3d0 (&sb->s_type->i_mutex_key#20){++++}-{3:3}, at: do_truncate+0xbe/0x140
>>>>> [ 1727.240027]
>>>>> [ 1727.240027] stack backtrace:
>>>>> [ 1727.247863] CPU: 3 PID: 7800 Comm: truncfile Tainted: G           OE     5.15.0-rc6+ #53
>>>>> [ 1727.252508] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-6.fc35 04/01/2014
>>>>> [ 1727.257303] Call Trace:
>>>>> [ 1727.261503]  dump_stack_lvl+0x57/0x72
>>>>> [ 1727.265492]  lock_release.cold+0x49/0x4e
>>>>> [ 1727.269499]  ? __ceph_setattr+0x85e/0x1270 [ceph]
>>>>> [ 1727.273802]  ? lock_downgrade+0x390/0x390
>>>>> [ 1727.277913]  ? preempt_count_sub+0x14/0xc0
>>>>> [ 1727.281883]  ? _raw_spin_unlock+0x29/0x40
>>>>> [ 1727.285725]  ? __ceph_mark_dirty_caps+0x29f/0x450 [ceph]
>>>>> [ 1727.289959]  up_read+0x17/0x20
>>>>> [ 1727.293852]  __ceph_setattr+0x85e/0x1270 [ceph]
>>>>> [ 1727.297827]  ? ceph_inode_work+0x460/0x460 [ceph]
>>>>> [ 1727.301765]  ceph_setattr+0x12d/0x1c0 [ceph]
>>>>> [ 1727.305839]  notify_change+0x4e9/0x720
>>>>> [ 1727.309762]  ? do_truncate+0xcf/0x140
>>>>> [ 1727.313504]  do_truncate+0xcf/0x140
>>>>> [ 1727.317092]  ? file_open_root+0x1e0/0x1e0
>>>>> [ 1727.321022]  ? lock_release+0x410/0x410
>>>>> [ 1727.324769]  ? lock_is_held_type+0xfb/0x130
>>>>> [ 1727.328699]  do_sys_ftruncate+0x306/0x350
>>>>> [ 1727.332449]  do_syscall_64+0x3b/0x90
>>>>> [ 1727.336127]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>> [ 1727.340303] RIP: 0033:0x7f7f38be356b
>>>>> [ 1727.344445] Code: 77 05 c3 0f 1f 40 00 48 8b 15 09 99 0c 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 4d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 d9 98 0c 00 f7 d8
>>>>> [ 1727.354258] RSP: 002b:00007ffdee35cd18 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
>>>>> [ 1727.358964] RAX: ffffffffffffffda RBX: 000000000c8a9d62 RCX: 00007f7f38be356b
>>>>> [ 1727.363836] RDX: 000000000c8a9d62 RSI: 000000000c8a9d62 RDI: 0000000000000003
>>>>> [ 1727.368467] RBP: 0000000000000003 R08: 000000000000005a R09: 00007f7f38cada60
>>>>> [ 1727.373285] R10: 00007f7f38af3700 R11: 0000000000000202 R12: 0000000061856b9d
>>>>> [ 1727.377870] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>>>> [ 1727.382578] ------------[ cut here ]------------
>>>>> [ 1727.391761] DEBUG_RWSEMS_WARN_ON(tmp < 0): count = 0xffffffffffffff00, magic = 0xffff88812f062220, owner = 0x1, curr 0xffff88810095b280, list empty
>>>>> [ 1727.419497] WARNING: CPU: 14 PID: 7800 at kernel/locking/rwsem.c:1297 __up_read+0x404/0x430
>>>>> [ 1727.432752] Modules linked in: ceph(OE) libceph(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) bridge(E) ip_set(E) stp(E) llc(E) rfkill(E) nf_tables(E) nfnetlink(E) cachefiles(E) fscache(E) netfs(E) sunrpc(E) iTCO_wdt(E) intel_pmc_bxt(E) iTCO_vendor_support(E) intel_rapl_msr(E) lpc_ich(E) joydev(E) i2c_i801(E) i2c_smbus(E) virtio_balloon(E) intel_rapl_common(E) fuse(E) zram(E) ip_tables(E) xfs(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) virtio_gpu(E) virtio_blk(E) ghash_clmulni_intel(E) virtio_dma_buf(E) virtio_console(E) serio_raw(E) virtio_net(E) drm_kms_helper(E) net_failover(E) cec(E) failover(E) drm(E) qemu_fw_cfg(E)
>>>>> [ 1727.506081] CPU: 1 PID: 7800 Comm: truncfile Tainted: G           OE     5.15.0-rc6+ #53
>>>>> [ 1727.512691] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-6.fc35 04/01/2014
>>>>> [ 1727.521863] RIP: 0010:__up_read+0x404/0x430
>>>>> [ 1727.528011] Code: 48 8b 55 00 4d 89 f0 4c 89 e1 53 48 c7 c6 e0 db 89 b1 48 c7 c7 00 d9 89 b1 65 4c 8b 3c 25 80 fe 01 00 4d 89 f9 e8 a2 4b 08 01 <0f> 0b 5a e9 b4 fd ff ff be 08 00 00 00 4c 89 e7 e8 57 4d 33 00 f0
>>>>> [ 1727.540864] RSP: 0018:ffff888118a07bb0 EFLAGS: 00010286
>>>>> [ 1727.556265] RAX: 0000000000000000 RBX: ffffffffb189d840 RCX: 0000000000000000
>>>>> [ 1727.571003] RDX: 0000000000000001 RSI: ffffffffb1aa6380 RDI: ffffed1023140f6c
>>>>> [ 1727.580837] RBP: ffff88812f062220 R08: ffffffffb0185284 R09: ffff8884209ad7c7
>>>>> [ 1727.593908] R10: ffffed1084135af8 R11: 0000000000000001 R12: ffff88812f062220
>>>>> [ 1727.605431] R13: 1ffff11023140f79 R14: 0000000000000001 R15: ffff88810095b280
>>>>> [ 1727.615457] FS:  00007f7f38add740(0000) GS:ffff888420640000(0000) knlGS:0000000000000000
>>>>> [ 1727.623346] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [ 1727.631138] CR2: 00007fa7810e8000 CR3: 000000012124a000 CR4: 00000000003506e0
>>>>> [ 1727.639503] Call Trace:
>>>>> [ 1727.649160]  ? _raw_spin_unlock+0x29/0x40
>>>>> [ 1727.655583]  ? up_write+0x270/0x270
>>>>> [ 1727.661769]  __ceph_setattr+0x85e/0x1270 [ceph]
>>>>> [ 1727.670914]  ? ceph_inode_work+0x460/0x460 [ceph]
>>>>> [ 1727.677397]  ceph_setattr+0x12d/0x1c0 [ceph]
>>>>> [ 1727.685205]  notify_change+0x4e9/0x720
>>>>> [ 1727.694598]  ? do_truncate+0xcf/0x140
>>>>> [ 1727.705737]  do_truncate+0xcf/0x140
>>>>> [ 1727.712680]  ? file_open_root+0x1e0/0x1e0
>>>>> [ 1727.720447]  ? lock_release+0x410/0x410
>>>>> [ 1727.727851]  ? lock_is_held_type+0xfb/0x130
>>>>> [ 1727.734045]  do_sys_ftruncate+0x306/0x350
>>>>> [ 1727.740636]  do_syscall_64+0x3b/0x90
>>>>> [ 1727.748675]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>> [ 1727.755634] RIP: 0033:0x7f7f38be356b
>>>>> [ 1727.763575] Code: 77 05 c3 0f 1f 40 00 48 8b 15 09 99 0c 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 4d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 d9 98 0c 00 f7 d8
>>>>> [ 1727.777868] RSP: 002b:00007ffdee35cd18 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
>>>>> [ 1727.792610] RAX: ffffffffffffffda RBX: 000000000c8a9d62 RCX: 00007f7f38be356b
>>>>> [ 1727.807383] RDX: 000000000c8a9d62 RSI: 000000000c8a9d62 RDI: 0000000000000003
>>>>> [ 1727.821520] RBP: 0000000000000003 R08: 000000000000005a R09: 00007f7f38cada60
>>>>> [ 1727.829368] R10: 00007f7f38af3700 R11: 0000000000000202 R12: 0000000061856b9d
>>>>> [ 1727.837356] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>>>> [ 1727.849767] irq event stamp: 549109
>>>>> [ 1727.863878] hardirqs last  enabled at (549109): [<ffffffffb12f8b54>] _raw_spin_unlock_irq+0x24/0x50
>>>>> [ 1727.879034] hardirqs last disabled at (549108): [<ffffffffb12f8d04>] _raw_spin_lock_irq+0x54/0x60
>>>>> [ 1727.897434] softirqs last  enabled at (548984): [<ffffffffb013d097>] __irq_exit_rcu+0x157/0x1b0
>>>>> [ 1727.913276] softirqs last disabled at (548975): [<ffffffffb013d097>] __irq_exit_rcu+0x157/0x1b0
>>>>> [ 1727.933182] ---[ end trace a89de5333b156523 ]---
>>>>>
>>>>>
>>>>>
>>>>> I think this patch should fix it:
>>>>>
>>>>> [PATCH] SQUASH: ensure we unset lock_snap_rwsem after unlocking it
>>>>>
>>>>> Signed-off-by: Jeff Layton <jlayton@kernel.org>
>>>>> ---
>>>>>    fs/ceph/inode.c | 4 +++-
>>>>>    1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
>>>>> index eebbd0296004..cb0ad0faee45 100644
>>>>> --- a/fs/ceph/inode.c
>>>>> +++ b/fs/ceph/inode.c
>>>>> @@ -2635,8 +2635,10 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>>>>>    
>>>>>    	release &= issued;
>>>>>    	spin_unlock(&ci->i_ceph_lock);
>>>>> -	if (lock_snap_rwsem)
>>>>> +	if (lock_snap_rwsem) {
>>>>>    		up_read(&mdsc->snap_rwsem);
>>>>> +		lock_snap_rwsem = false;
>>>>> +	}
>>>>>    
>>>>>    	if (inode_dirty_flags)
>>>>>    		__mark_inode_dirty(inode, inode_dirty_flags);
>>>> Testing with that patch on top of your latest series looks pretty good
>>>> so far.
>>> Cool.
>>>
>>>>    I see some xfstests failures that need to be investigated
>>>> (generic/075, in particular). I'll take a harder look at that next week.
>>> I will also try this.
>>>> For now, I've gone ahead and updated wip-fscrypt-fnames to the latest
>>>> fnames branch, and also pushed a new wip-fscrypt-size branch that has
>>>> all of your patches, with the above SQUASH patch folded into #9.
>>>>
>>>> I'll continue the testing next week, but I think the -size branch is
>>>> probably a good place to work from for now.
>>> BTW, what's your test script for the xfstests ? I may miss some important.
>>>
>> I'm mainly running:
>>
>>      $ sudo ./check -g quick -E ./ceph.exclude
>>
>> ...and ceph.exclude has:
>>
>> ceph/001
>> generic/003
>> generic/531
>> generic/538
>>
>> ...most of the exclusions are because they take a long time to run.
> Oh and I should say...most of the failures I've seen with this patchset
> are intermittent. I suspect there is some race condition we haven't
> addressed yet.

The "generic/075" failed:

[root@lxbceph1 xfstests]# ./check generic/075
FSTYP         -- ceph
PLATFORM      -- Linux/x86_64 lxbceph1 5.15.0-rc6+

generic/075     [failed, exit status 1] - output mismatch (see 
/mnt/kcephfs/xfstests/results//generic/075.out.bad)
     --- tests/generic/075.out    2021-11-08 08:38:19.756822587 +0800
     +++ /mnt/kcephfs/xfstests/results//generic/075.out.bad 2021-11-08 
09:19:14.570013209 +0800
     @@ -4,15 +4,4 @@
      -----------------------------------------------
      fsx.0 : -d -N numops -S 0
      -----------------------------------------------
     -
     ------------------------------------------------
     -fsx.1 : -d -N numops -S 0 -x
     ------------------------------------------------
     ...
     (Run 'diff -u tests/generic/075.out 
/mnt/kcephfs/xfstests/results//generic/075.out.bad'  to see the entire diff)
Ran: generic/075
Failures: generic/075
Failed 1 of 1 tests


 From '075.0.fsxlog':


  84 122 trunc       from 0x40000 to 0x3ffd3
  85 123 mapread     0x2794d thru    0x2cb8c (0x5240 bytes)
  86 124 read        0x37b86 thru    0x3dc7b (0x60f6 bytes)
  87 READ BAD DATA: offset = 0x37b86, size = 0x60f6, fname = 075.0
  88 OFFSET  GOOD    BAD     RANGE
  89 0x38fc0 0x79b2  0x0000  0x00000
  90 operation# (mod 256) for the bad data unknown, check HOLE and 
EXTEND ops
  91 0x38fc1 0xb279  0x0000  0x00001
  92 operation# (mod 256) for the bad data unknown, check HOLE and 
EXTEND ops
  93 0x38fc2 0x791e  0x0000  0x00002
  94 operation# (mod 256) for the bad data unknown, check HOLE and 
EXTEND ops
  95 0x38fc3 0x1e79  0x0000  0x00003
  96 operation# (mod 256) for the bad data unknown, check HOLE and 
EXTEND ops
  97 0x38fc4 0x79e0  0x0000  0x00004
  98 operation# (mod 256) for the bad data unknown, check HOLE and 
EXTEND ops
  99 0x38fc5 0xe079  0x0000  0x00005
100 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
101 0x38fc6 0x790b  0x0000  0x00006
102 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
103 0x38fc7 0x0b79  0x0000  0x00007
104 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
105 0x38fc8 0x7966  0x0000  0x00008
106 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
107 0x38fc9 0x6679  0x0000  0x00009
108 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
109 0x38fca 0x79ff  0x0000  0x0000a
110 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
111 0x38fcb 0xff79  0x0000  0x0000b
112 operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
113 0x38fcc 0x7996  0x0000  0x0000c
...


I have dumped the '075.0.full', it's the same with the '075.out.bad'.

Checked the diff '075.0.good' and '075.0.bad', it shows that from the 
file offset 0x038fc0~i_size the contents are all zero in the 
'075.0.bad'. The '075.0.good is not.

 From the '/proc/kmsg' output:

18715 <7>[61484.334994] ceph:  fill_fscrypt_truncate size 262144 -> 
262099 got cap refs on Fr, issued pAsxLsXsxFsxcrwb
18716 <7>[61484.335010] ceph:  writepages_start 000000003e6c8932 (mode=ALL)
18717 <7>[61484.335021] ceph:   head snapc 000000003195bf7d has 8 dirty 
pages
18718 <7>[61484.335030] ceph:   oldest snapc is 000000003195bf7d seq 1 
(0 snaps)
18719 <7>[61484.335041] ceph:   not cyclic, 0 to 2251799813685247
18720 <7>[61484.335054] ceph:  pagevec_lookup_range_tag got 8
18721 <7>[61484.335063] ceph:  ? 000000007350de9f idx 56
18722 <7>[61484.335139] ceph:  000000003e6c8932 will write page 
000000007350de9f idx 56
18723 <7>[61484.335151] ceph:  ? 00000000db5774fb idx 57
18724 <7>[61484.335162] ceph:  000000003e6c8932 will write page 
00000000db5774fb idx 57
18725 <7>[61484.335173] ceph:  ? 000000008bc9ea57 idx 58
18726 <7>[61484.335183] ceph:  000000003e6c8932 will write page 
000000008bc9ea57 idx 58
18727 <7>[61484.335194] ceph:  ? 00000000be4c1d25 idx 59
18728 <7>[61484.335204] ceph:  000000003e6c8932 will write page 
00000000be4c1d25 idx 59
18729 <7>[61484.335215] ceph:  ? 0000000051d6fed1 idx 60
18730 <7>[61484.335225] ceph:  000000003e6c8932 will write page 
0000000051d6fed1 idx 60
18731 <7>[61484.335237] ceph:  ? 00000000f40c8a7a idx 61
18732 <7>[61484.335254] ceph:  000000003e6c8932 will write page 
00000000f40c8a7a idx 61
18733 <7>[61484.335274] ceph:  ? 00000000c7da9df6 idx 62
18734 <7>[61484.335291] ceph:  000000003e6c8932 will write page 
00000000c7da9df6 idx 62
18735 <7>[61484.335312] ceph:  ? 00000000646abb31 idx 63
18736 <7>[61484.335330] ceph:  000000003e6c8932 will write page 
00000000646abb31 idx 63
18737 <7>[61484.335344] ceph:  reached end pvec, trying for more
18738 <7>[61484.335352] ceph:  pagevec_lookup_range_tag got 0
18739 <7>[61484.336008] ceph:  writepages got pages at 229376~32768
18740 <7>[61484.336136] ceph:  pagevec_release on 0 pages (0000000000000000)
18741 <7>[61484.336157] ceph:  pagevec_lookup_range_tag got 0
18742 <7>[61484.336172] ceph:  writepages dend - startone, rc = 0
18743 <7>[61484.348123] ceph:  writepages_finish 000000003e6c8932 rc 0

...
18760 <7>[61485.386715] ceph:  sync_read on inode 000000003e6c8932 
258048~4096
18761 <7>[61485.386784] ceph:  client4220 send metrics to mds0
18762 <7>[61485.389512] ceph:  sync_read 258048~4096 got 4096 i_size 262144
18763 <7>[61485.389569] ceph:  sync_read result 4096 retry_op 2
18764 <7>[61485.389581] ceph:  put_cap_refs 000000003e6c8932 had Fr last


I see in fill_fscrypt_truncate() just before reading the last block it 
has already trigerred and successfully flushed the dirty pages to the 
OSD, but it seems those 8 pages' contents are zero.

Is that possibly those 8 pages are not dirtied yet when we are flushing 
it in fill_fscrypt_truncate() ?

Thanks

BRs






> Thanks,


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 3/9] ceph: fscrypt_file field handling in MClientRequest messages
  2021-11-05 14:22 ` [PATCH v7 3/9] ceph: fscrypt_file field handling in MClientRequest messages xiubli
@ 2021-11-08  5:09   ` Xiubo Li
  0 siblings, 0 replies; 25+ messages in thread
From: Xiubo Li @ 2021-11-08  5:09 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel


On 11/5/21 10:22 PM, xiubli@redhat.com wrote:
> From: Jeff Layton <jlayton@kernel.org>
>
> For encrypted inodes, transmit a rounded-up size to the MDS as the
> normal file size and send the real inode size in fscrypt_file field.
>
> Also, fix up creates and truncates to also transmit fscrypt_file.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>   fs/ceph/dir.c        |  3 +++
>   fs/ceph/file.c       |  2 ++
>   fs/ceph/inode.c      | 18 ++++++++++++++++--
>   fs/ceph/mds_client.c |  9 ++++++++-
>   fs/ceph/mds_client.h |  2 ++
>   5 files changed, 31 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index 37c9c589ee27..987c1579614c 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -916,6 +916,9 @@ static int ceph_mknod(struct user_namespace *mnt_userns, struct inode *dir,
>   		goto out_req;
>   	}
>   
> +	if (S_ISREG(mode) && IS_ENCRYPTED(dir))
> +		set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
> +
>   	req->r_dentry = dget(dentry);
>   	req->r_num_caps = 2;
>   	req->r_parent = dir;
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 126d2d80686c..8c0b9ed7f48b 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -715,6 +715,8 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   	req->r_args.open.mask = cpu_to_le32(mask);
>   	req->r_parent = dir;
>   	ihold(dir);
> +	if (IS_ENCRYPTED(dir))
> +		set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
>   
>   	if (flags & O_CREAT) {
>   		struct ceph_file_layout lo;
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index d24d42c94d43..4a7b2b0d88f7 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -2383,11 +2383,25 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   			}
>   		} else if ((issued & CEPH_CAP_FILE_SHARED) == 0 ||
>   			   attr->ia_size != isize) {
> -			req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
> -			req->r_args.setattr.old_size = cpu_to_le64(isize);
>   			mask |= CEPH_SETATTR_SIZE;
>   			release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
>   				   CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
> +			if (IS_ENCRYPTED(inode)) {
It should be "if (IS_ENCRYPTED(inode) && attr->ia_size) {".

If new size is 0, no need to round up it to BLOCK SIZE.


> +				set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
> +				mask |= CEPH_SETATTR_FSCRYPT_FILE;
> +				req->r_args.setattr.size =
> +					cpu_to_le64(round_up(attr->ia_size,
> +							     CEPH_FSCRYPT_BLOCK_SIZE));
> +				req->r_args.setattr.old_size =
> +					cpu_to_le64(round_up(isize,
> +							     CEPH_FSCRYPT_BLOCK_SIZE));
> +				req->r_fscrypt_file = attr->ia_size;
> +				/* FIXME: client must zero out any partial blocks! */
> +			} else {
> +				req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
> +				req->r_args.setattr.old_size = cpu_to_le64(isize);
> +				req->r_fscrypt_file = 0;
> +			}
>   		}
>   	}
>   	if (ia_valid & ATTR_MTIME) {
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 69caea1d2444..e2d1b98c61fc 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2653,7 +2653,12 @@ static void encode_mclientrequest_tail(void **p, const struct ceph_mds_request *
>   	} else {
>   		ceph_encode_32(p, 0);
>   	}
> -	ceph_encode_32(p, 0); // fscrypt_file for now
> +	if (test_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags)) {
> +		ceph_encode_32(p, sizeof(__le64));
> +		ceph_encode_64(p, req->r_fscrypt_file);
> +	} else {
> +		ceph_encode_32(p, 0);
> +	}
>   }
>   
>   /*
> @@ -2739,6 +2744,8 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
>   
>   	/* fscrypt_file */
>   	len += sizeof(u32);
> +	if (test_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags))
> +		len += sizeof(__le64);
>   
>   	msg = ceph_msg_new2(CEPH_MSG_CLIENT_REQUEST, len, 1, GFP_NOFS, false);
>   	if (!msg) {
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index 6a2ac489e06e..149a3a828472 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -276,6 +276,7 @@ struct ceph_mds_request {
>   #define CEPH_MDS_R_DID_PREPOPULATE	(6) /* prepopulated readdir */
>   #define CEPH_MDS_R_PARENT_LOCKED	(7) /* is r_parent->i_rwsem wlocked? */
>   #define CEPH_MDS_R_ASYNC		(8) /* async request */
> +#define CEPH_MDS_R_FSCRYPT_FILE		(9) /* must marshal fscrypt_file field */
>   	unsigned long	r_req_flags;
>   
>   	struct mutex r_fill_mutex;
> @@ -283,6 +284,7 @@ struct ceph_mds_request {
>   	union ceph_mds_request_args r_args;
>   
>   	struct ceph_fscrypt_auth *r_fscrypt_auth;
> +	u64	r_fscrypt_file;
>   
>   	u8 *r_altname;		    /* fscrypt binary crypttext for long filenames */
>   	u32 r_altname_len;	    /* length of r_altname */


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/9] ceph: size handling for the fscrypt
  2021-11-08  3:22           ` Xiubo Li
@ 2021-11-08  6:04             ` Xiubo Li
  2021-11-08  8:24               ` Xiubo Li
  0 siblings, 1 reply; 25+ messages in thread
From: Xiubo Li @ 2021-11-08  6:04 UTC (permalink / raw)
  To: Jeff Layton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel


On 11/8/21 11:22 AM, Xiubo Li wrote:
...
>>>
>>>      $ sudo ./check -g quick -E ./ceph.exclude
>>>
>>> ...and ceph.exclude has:
>>>
>>> ceph/001
>>> generic/003
>>> generic/531
>>> generic/538
>>>
>>> ...most of the exclusions are because they take a long time to run.
>> Oh and I should say...most of the failures I've seen with this patchset
>> are intermittent. I suspect there is some race condition we haven't
>> addressed yet.
>
> The "generic/075" failed:
>
> [root@lxbceph1 xfstests]# ./check generic/075
> FSTYP         -- ceph
> PLATFORM      -- Linux/x86_64 lxbceph1 5.15.0-rc6+
>
> generic/075     [failed, exit status 1] - output mismatch (see 
> /mnt/kcephfs/xfstests/results//generic/075.out.bad)
>     --- tests/generic/075.out    2021-11-08 08:38:19.756822587 +0800
>     +++ /mnt/kcephfs/xfstests/results//generic/075.out.bad 2021-11-08 
> 09:19:14.570013209 +0800
>     @@ -4,15 +4,4 @@
>      -----------------------------------------------
>      fsx.0 : -d -N numops -S 0
>      -----------------------------------------------
>     -
>     ------------------------------------------------
>     -fsx.1 : -d -N numops -S 0 -x
>     ------------------------------------------------
>     ...
>     (Run 'diff -u tests/generic/075.out 
> /mnt/kcephfs/xfstests/results//generic/075.out.bad'  to see the entire 
> diff)
> Ran: generic/075
> Failures: generic/075
> Failed 1 of 1 tests
>
>
> From '075.0.fsxlog':
>
>
>  84 122 trunc       from 0x40000 to 0x3ffd3
>  85 123 mapread     0x2794d thru    0x2cb8c (0x5240 bytes)
>  86 124 read        0x37b86 thru    0x3dc7b (0x60f6 bytes)
>  87 READ BAD DATA: offset = 0x37b86, size = 0x60f6, fname = 075.0
>  88 OFFSET  GOOD    BAD     RANGE
>  89 0x38fc0 0x79b2  0x0000  0x00000
>  90 operation# (mod 256) for the bad data unknown, check HOLE and 
> EXTEND ops
>  91 0x38fc1 0xb279  0x0000  0x00001
>  92 operation# (mod 256) for the bad data unknown, check HOLE and 
> EXTEND ops
>  93 0x38fc2 0x791e  0x0000  0x00002
>  94 operation# (mod 256) for the bad data unknown, check HOLE and 
> EXTEND ops
>  95 0x38fc3 0x1e79  0x0000  0x00003
>  96 operation# (mod 256) for the bad data unknown, check HOLE and 
> EXTEND ops
>  97 0x38fc4 0x79e0  0x0000  0x00004
>  98 operation# (mod 256) for the bad data unknown, check HOLE and 
> EXTEND ops
>  99 0x38fc5 0xe079  0x0000  0x00005
> 100 operation# (mod 256) for the bad data unknown, check HOLE and 
> EXTEND ops
> 101 0x38fc6 0x790b  0x0000  0x00006
> 102 operation# (mod 256) for the bad data unknown, check HOLE and 
> EXTEND ops
> 103 0x38fc7 0x0b79  0x0000  0x00007
> 104 operation# (mod 256) for the bad data unknown, check HOLE and 
> EXTEND ops
> 105 0x38fc8 0x7966  0x0000  0x00008
> 106 operation# (mod 256) for the bad data unknown, check HOLE and 
> EXTEND ops
> 107 0x38fc9 0x6679  0x0000  0x00009
> 108 operation# (mod 256) for the bad data unknown, check HOLE and 
> EXTEND ops
> 109 0x38fca 0x79ff  0x0000  0x0000a
> 110 operation# (mod 256) for the bad data unknown, check HOLE and 
> EXTEND ops
> 111 0x38fcb 0xff79  0x0000  0x0000b
> 112 operation# (mod 256) for the bad data unknown, check HOLE and 
> EXTEND ops
> 113 0x38fcc 0x7996  0x0000  0x0000c
> ...
>
>
> I have dumped the '075.0.full', it's the same with the '075.out.bad'.
>
> Checked the diff '075.0.good' and '075.0.bad', it shows that from the 
> file offset 0x038fc0~i_size the contents are all zero in the 
> '075.0.bad'. The '075.0.good is not.
>
> From the '/proc/kmsg' output:
>
> 18715 <7>[61484.334994] ceph:  fill_fscrypt_truncate size 262144 -> 
> 262099 got cap refs on Fr, issued pAsxLsXsxFsxcrwb
> 18716 <7>[61484.335010] ceph:  writepages_start 000000003e6c8932 
> (mode=ALL)
> 18717 <7>[61484.335021] ceph:   head snapc 000000003195bf7d has 8 
> dirty pages
> 18718 <7>[61484.335030] ceph:   oldest snapc is 000000003195bf7d seq 1 
> (0 snaps)
> 18719 <7>[61484.335041] ceph:   not cyclic, 0 to 2251799813685247
> 18720 <7>[61484.335054] ceph:  pagevec_lookup_range_tag got 8
> 18721 <7>[61484.335063] ceph:  ? 000000007350de9f idx 56
> 18722 <7>[61484.335139] ceph:  000000003e6c8932 will write page 
> 000000007350de9f idx 56
> 18723 <7>[61484.335151] ceph:  ? 00000000db5774fb idx 57
> 18724 <7>[61484.335162] ceph:  000000003e6c8932 will write page 
> 00000000db5774fb idx 57
> 18725 <7>[61484.335173] ceph:  ? 000000008bc9ea57 idx 58
> 18726 <7>[61484.335183] ceph:  000000003e6c8932 will write page 
> 000000008bc9ea57 idx 58
> 18727 <7>[61484.335194] ceph:  ? 00000000be4c1d25 idx 59
> 18728 <7>[61484.335204] ceph:  000000003e6c8932 will write page 
> 00000000be4c1d25 idx 59
> 18729 <7>[61484.335215] ceph:  ? 0000000051d6fed1 idx 60
> 18730 <7>[61484.335225] ceph:  000000003e6c8932 will write page 
> 0000000051d6fed1 idx 60
> 18731 <7>[61484.335237] ceph:  ? 00000000f40c8a7a idx 61
> 18732 <7>[61484.335254] ceph:  000000003e6c8932 will write page 
> 00000000f40c8a7a idx 61
> 18733 <7>[61484.335274] ceph:  ? 00000000c7da9df6 idx 62
> 18734 <7>[61484.335291] ceph:  000000003e6c8932 will write page 
> 00000000c7da9df6 idx 62
> 18735 <7>[61484.335312] ceph:  ? 00000000646abb31 idx 63
> 18736 <7>[61484.335330] ceph:  000000003e6c8932 will write page 
> 00000000646abb31 idx 63
> 18737 <7>[61484.335344] ceph:  reached end pvec, trying for more
> 18738 <7>[61484.335352] ceph:  pagevec_lookup_range_tag got 0
> 18739 <7>[61484.336008] ceph:  writepages got pages at 229376~32768
> 18740 <7>[61484.336136] ceph:  pagevec_release on 0 pages 
> (0000000000000000)
> 18741 <7>[61484.336157] ceph:  pagevec_lookup_range_tag got 0
> 18742 <7>[61484.336172] ceph:  writepages dend - startone, rc = 0
> 18743 <7>[61484.348123] ceph:  writepages_finish 000000003e6c8932 rc 0
>
Before this I can see there has one aio_write will update the file and 
write/dirty the above 8 pages:

30766 <7>[72062.257479] ceph:  aio_write 00000000457286fe 
1000000b1b7.fffffffffffffffe 233408~28736 getting caps. i_size 53014
30767 <7>[72062.257491] ceph:  get_cap_refs 00000000457286fe need Fw want Fb
30768 <7>[72062.257499] ceph:  __ceph_caps_issued 00000000457286fe cap 
0000000075fd8906 issued pAsxLsXsxFscb
30769 <7>[72062.257507] ceph:  get_cap_refs 00000000457286fe have 
pAsxLsXsxFscb need Fw
...

30795 <7>[72062.267240] ceph:  aio_write 00000000457286fe 
1000000b1b7.fffffffffffffffe 233408~28736 got cap refs on Fwb
30796 <7>[72062.267248] ceph:  __unregister_request 00000000cce16c34 tid 24
30797 <7>[72062.267254] ceph:  got safe reply 24, mds0
30798 <7>[72062.267272] ceph:  write_end file 00000000b0595dbb inode 
00000000457286fe page 000000007350de9f 233408~64 (64)
30799 <7>[72062.267287] ceph:  set_size 00000000457286fe 53014 -> 233472
30800 <7>[72062.267297] ceph:  00000000457286fe set_page_dirty 
00000000d20754ba idx 56 head 0/0 -> 1/1 snapc 00000000f69ffd89 seq 1 (0 
snaps)
30801 <7>[72062.267322] ceph:  write_end file 00000000b0595dbb inode 
00000000457286fe page 00000000db5774fb 233472~4096 (4096)
30802 <7>[72062.267335] ceph:  set_size 00000000457286fe 233472 -> 237568
30803 <7>[72062.267344] ceph:  00000000457286fe set_page_dirty 
00000000cf1abc39 idx 57 head 1/1 -> 2/2 snapc 00000000f69ffd89 seq 1 (0 
snaps)
30804 <7>[72062.267380] ceph:  write_end file 00000000b0595dbb inode 
00000000457286fe page 000000008bc9ea57 237568~4096 (4096)
30805 <7>[72062.267393] ceph:  set_size 00000000457286fe 237568 -> 241664
30806 <7>[72062.267401] ceph:  00000000457286fe set_page_dirty 
00000000b55a5d0e idx 58 head 2/2 -> 3/3 snapc 00000000f69ffd89 seq 1 (0 
snaps)
30807 <7>[72062.267417] ceph:  put_cap_refs 00000000457286fe had p
30808 <7>[72062.267423] ceph:  write_end file 00000000b0595dbb inode 
00000000457286fe page 00000000be4c1d25 241664~4096 (4096)
30809 <7>[72062.267435] ceph:  set_size 00000000457286fe 241664 -> 245760
30810 <7>[72062.267444] ceph:  00000000457286fe set_page_dirty 
00000000810c0300 idx 59 head 3/3 -> 4/4 snapc 00000000f69ffd89 seq 1 (0 
snaps)
30811 <7>[72062.267473] ceph:  write_end file 00000000b0595dbb inode 
00000000457286fe page 0000000051d6fed1 245760~4096 (4096)
30812 <7>[72062.267492] ceph:  set_size 00000000457286fe 245760 -> 249856
30813 <7>[72062.267506] ceph:  00000000457286fe set_page_dirty 
00000000b113b082 idx 60 head 4/4 -> 5/5 snapc 00000000f69ffd89 seq 1 (0 
snaps)
30814 <7>[72062.267542] ceph:  write_end file 00000000b0595dbb inode 
00000000457286fe page 00000000f40c8a7a 249856~4096 (4096)
30815 <7>[72062.267563] ceph:  set_size 00000000457286fe 249856 -> 253952
30816 <7>[72062.267577] ceph:  00000000457286fe set_page_dirty 
00000000e52c4518 idx 61 head 5/5 -> 6/6 snapc 00000000f69ffd89 seq 1 (0 
snaps)
30817 <7>[72062.267610] ceph:  write_end file 00000000b0595dbb inode 
00000000457286fe page 00000000c7da9df6 253952~4096 (4096)
30818 <7>[72062.267626] ceph:  set_size 00000000457286fe 253952 -> 258048
30819 <7>[72062.267635] ceph:  00000000457286fe set_page_dirty 
00000000b81992fe idx 62 head 6/6 -> 7/7 snapc 00000000f69ffd89 seq 1 (0 
snaps)
30820 <7>[72062.267660] ceph:  write_end file 00000000b0595dbb inode 
00000000457286fe page 00000000646abb31 258048~4096 (4096)
30821 <7>[72062.267672] ceph:  set_size 00000000457286fe 258048 -> 262144
30822 <7>[72062.267680] ceph:  00000000457286fe set_page_dirty 
00000000111e20f4 idx 63 head 7/7 -> 8/8 snapc 00000000f69ffd89 seq 1 (0 
snaps)
30823 <7>[72062.267697] ceph:  __mark_dirty_caps 00000000457286fe Fw 
dirty - -> Fw

But still not sure why those 8 dirty pages still writing 0 to the files.



> ...
> 18760 <7>[61485.386715] ceph:  sync_read on inode 000000003e6c8932 
> 258048~4096
> 18761 <7>[61485.386784] ceph:  client4220 send metrics to mds0
> 18762 <7>[61485.389512] ceph:  sync_read 258048~4096 got 4096 i_size 
> 262144
> 18763 <7>[61485.389569] ceph:  sync_read result 4096 retry_op 2
> 18764 <7>[61485.389581] ceph:  put_cap_refs 000000003e6c8932 had Fr last
>
>
> I see in fill_fscrypt_truncate() just before reading the last block it 
> has already trigerred and successfully flushed the dirty pages to the 
> OSD, but it seems those 8 pages' contents are zero.
>
> Is that possibly those 8 pages are not dirtied yet when we are 
> flushing it in fill_fscrypt_truncate() ?
>
> Thanks
>
> BRs
>
>
>
>
>
>
>> Thanks,


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 0/9] ceph: size handling for the fscrypt
  2021-11-08  6:04             ` Xiubo Li
@ 2021-11-08  8:24               ` Xiubo Li
  0 siblings, 0 replies; 25+ messages in thread
From: Xiubo Li @ 2021-11-08  8:24 UTC (permalink / raw)
  To: Jeff Layton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel


On 11/8/21 2:04 PM, Xiubo Li wrote:
>
> On 11/8/21 11:22 AM, Xiubo Li wrote:
> ...
>>>>
>>>>      $ sudo ./check -g quick -E ./ceph.exclude
>>>>
>>>> ...and ceph.exclude has:
>>>>
>>>> ceph/001
>>>> generic/003
>>>> generic/531
>>>> generic/538
>>>>
>>>> ...most of the exclusions are because they take a long time to run.
>>> Oh and I should say...most of the failures I've seen with this patchset
>>> are intermittent. I suspect there is some race condition we haven't
>>> addressed yet.
>>
>> The "generic/075" failed:
>>
>> [root@lxbceph1 xfstests]# ./check generic/075
>> FSTYP         -- ceph
>> PLATFORM      -- Linux/x86_64 lxbceph1 5.15.0-rc6+
>>
>> generic/075     [failed, exit status 1] - output mismatch (see 
>> /mnt/kcephfs/xfstests/results//generic/075.out.bad)
>>     --- tests/generic/075.out    2021-11-08 08:38:19.756822587 +0800
>>     +++ /mnt/kcephfs/xfstests/results//generic/075.out.bad 2021-11-08 
>> 09:19:14.570013209 +0800
>>     @@ -4,15 +4,4 @@
>>      -----------------------------------------------
>>      fsx.0 : -d -N numops -S 0
>>      -----------------------------------------------
>>     -
>>     ------------------------------------------------
>>     -fsx.1 : -d -N numops -S 0 -x
>>     ------------------------------------------------
>>     ...
>>     (Run 'diff -u tests/generic/075.out 
>> /mnt/kcephfs/xfstests/results//generic/075.out.bad'  to see the 
>> entire diff)
>> Ran: generic/075
>> Failures: generic/075
>> Failed 1 of 1 tests
>>
>>
>> From '075.0.fsxlog':
>>
>>
>>  84 122 trunc       from 0x40000 to 0x3ffd3
>>  85 123 mapread     0x2794d thru    0x2cb8c (0x5240 bytes)
>>  86 124 read        0x37b86 thru    0x3dc7b (0x60f6 bytes)
>>  87 READ BAD DATA: offset = 0x37b86, size = 0x60f6, fname = 075.0
>>  88 OFFSET  GOOD    BAD     RANGE
>>  89 0x38fc0 0x79b2  0x0000  0x00000
>>  90 operation# (mod 256) for the bad data unknown, check HOLE and 
>> EXTEND ops
>>  91 0x38fc1 0xb279  0x0000  0x00001
>>  92 operation# (mod 256) for the bad data unknown, check HOLE and 
>> EXTEND ops
>>  93 0x38fc2 0x791e  0x0000  0x00002
>>  94 operation# (mod 256) for the bad data unknown, check HOLE and 
>> EXTEND ops
>>  95 0x38fc3 0x1e79  0x0000  0x00003
>>  96 operation# (mod 256) for the bad data unknown, check HOLE and 
>> EXTEND ops
>>  97 0x38fc4 0x79e0  0x0000  0x00004
>>  98 operation# (mod 256) for the bad data unknown, check HOLE and 
>> EXTEND ops
>>  99 0x38fc5 0xe079  0x0000  0x00005
>> 100 operation# (mod 256) for the bad data unknown, check HOLE and 
>> EXTEND ops
>> 101 0x38fc6 0x790b  0x0000  0x00006
>> 102 operation# (mod 256) for the bad data unknown, check HOLE and 
>> EXTEND ops
>> 103 0x38fc7 0x0b79  0x0000  0x00007
>> 104 operation# (mod 256) for the bad data unknown, check HOLE and 
>> EXTEND ops
>> 105 0x38fc8 0x7966  0x0000  0x00008
>> 106 operation# (mod 256) for the bad data unknown, check HOLE and 
>> EXTEND ops
>> 107 0x38fc9 0x6679  0x0000  0x00009
>> 108 operation# (mod 256) for the bad data unknown, check HOLE and 
>> EXTEND ops
>> 109 0x38fca 0x79ff  0x0000  0x0000a
>> 110 operation# (mod 256) for the bad data unknown, check HOLE and 
>> EXTEND ops
>> 111 0x38fcb 0xff79  0x0000  0x0000b
>> 112 operation# (mod 256) for the bad data unknown, check HOLE and 
>> EXTEND ops
>> 113 0x38fcc 0x7996  0x0000  0x0000c
>> ...
>>
>>
>> I have dumped the '075.0.full', it's the same with the '075.out.bad'.
>>
>> Checked the diff '075.0.good' and '075.0.bad', it shows that from the 
>> file offset 0x038fc0~i_size the contents are all zero in the 
>> '075.0.bad'. The '075.0.good is not.
>>
>> From the '/proc/kmsg' output:
>>
>> 18715 <7>[61484.334994] ceph:  fill_fscrypt_truncate size 262144 -> 
>> 262099 got cap refs on Fr, issued pAsxLsXsxFsxcrwb
>> 18716 <7>[61484.335010] ceph:  writepages_start 000000003e6c8932 
>> (mode=ALL)
>> 18717 <7>[61484.335021] ceph:   head snapc 000000003195bf7d has 8 
>> dirty pages
>> 18718 <7>[61484.335030] ceph:   oldest snapc is 000000003195bf7d seq 
>> 1 (0 snaps)
>> 18719 <7>[61484.335041] ceph:   not cyclic, 0 to 2251799813685247
>> 18720 <7>[61484.335054] ceph:  pagevec_lookup_range_tag got 8
>> 18721 <7>[61484.335063] ceph:  ? 000000007350de9f idx 56
>> 18722 <7>[61484.335139] ceph:  000000003e6c8932 will write page 
>> 000000007350de9f idx 56
>> 18723 <7>[61484.335151] ceph:  ? 00000000db5774fb idx 57
>> 18724 <7>[61484.335162] ceph:  000000003e6c8932 will write page 
>> 00000000db5774fb idx 57
>> 18725 <7>[61484.335173] ceph:  ? 000000008bc9ea57 idx 58
>> 18726 <7>[61484.335183] ceph:  000000003e6c8932 will write page 
>> 000000008bc9ea57 idx 58
>> 18727 <7>[61484.335194] ceph:  ? 00000000be4c1d25 idx 59
>> 18728 <7>[61484.335204] ceph:  000000003e6c8932 will write page 
>> 00000000be4c1d25 idx 59
>> 18729 <7>[61484.335215] ceph:  ? 0000000051d6fed1 idx 60
>> 18730 <7>[61484.335225] ceph:  000000003e6c8932 will write page 
>> 0000000051d6fed1 idx 60
>> 18731 <7>[61484.335237] ceph:  ? 00000000f40c8a7a idx 61
>> 18732 <7>[61484.335254] ceph:  000000003e6c8932 will write page 
>> 00000000f40c8a7a idx 61
>> 18733 <7>[61484.335274] ceph:  ? 00000000c7da9df6 idx 62
>> 18734 <7>[61484.335291] ceph:  000000003e6c8932 will write page 
>> 00000000c7da9df6 idx 62
>> 18735 <7>[61484.335312] ceph:  ? 00000000646abb31 idx 63
>> 18736 <7>[61484.335330] ceph:  000000003e6c8932 will write page 
>> 00000000646abb31 idx 63
>> 18737 <7>[61484.335344] ceph:  reached end pvec, trying for more
>> 18738 <7>[61484.335352] ceph:  pagevec_lookup_range_tag got 0
>> 18739 <7>[61484.336008] ceph:  writepages got pages at 229376~32768
>> 18740 <7>[61484.336136] ceph:  pagevec_release on 0 pages 
>> (0000000000000000)
>> 18741 <7>[61484.336157] ceph:  pagevec_lookup_range_tag got 0
>> 18742 <7>[61484.336172] ceph:  writepages dend - startone, rc = 0
>> 18743 <7>[61484.348123] ceph:  writepages_finish 000000003e6c8932 rc 0
>>
> Before this I can see there has one aio_write will update the file and 
> write/dirty the above 8 pages:
>
> 30766 <7>[72062.257479] ceph:  aio_write 00000000457286fe 
> 1000000b1b7.fffffffffffffffe 233408~28736 getting caps. i_size 53014
> 30767 <7>[72062.257491] ceph:  get_cap_refs 00000000457286fe need Fw 
> want Fb
> 30768 <7>[72062.257499] ceph:  __ceph_caps_issued 00000000457286fe cap 
> 0000000075fd8906 issued pAsxLsXsxFscb
> 30769 <7>[72062.257507] ceph:  get_cap_refs 00000000457286fe have 
> pAsxLsXsxFscb need Fw
> ...
>
> 30795 <7>[72062.267240] ceph:  aio_write 00000000457286fe 
> 1000000b1b7.fffffffffffffffe 233408~28736 got cap refs on Fwb
> 30796 <7>[72062.267248] ceph:  __unregister_request 00000000cce16c34 
> tid 24
> 30797 <7>[72062.267254] ceph:  got safe reply 24, mds0
> 30798 <7>[72062.267272] ceph:  write_end file 00000000b0595dbb inode 
> 00000000457286fe page 000000007350de9f 233408~64 (64)
> 30799 <7>[72062.267287] ceph:  set_size 00000000457286fe 53014 -> 233472
> 30800 <7>[72062.267297] ceph:  00000000457286fe set_page_dirty 
> 00000000d20754ba idx 56 head 0/0 -> 1/1 snapc 00000000f69ffd89 seq 1 
> (0 snaps)
> 30801 <7>[72062.267322] ceph:  write_end file 00000000b0595dbb inode 
> 00000000457286fe page 00000000db5774fb 233472~4096 (4096)
> 30802 <7>[72062.267335] ceph:  set_size 00000000457286fe 233472 -> 237568
> 30803 <7>[72062.267344] ceph:  00000000457286fe set_page_dirty 
> 00000000cf1abc39 idx 57 head 1/1 -> 2/2 snapc 00000000f69ffd89 seq 1 
> (0 snaps)
> 30804 <7>[72062.267380] ceph:  write_end file 00000000b0595dbb inode 
> 00000000457286fe page 000000008bc9ea57 237568~4096 (4096)
> 30805 <7>[72062.267393] ceph:  set_size 00000000457286fe 237568 -> 241664
> 30806 <7>[72062.267401] ceph:  00000000457286fe set_page_dirty 
> 00000000b55a5d0e idx 58 head 2/2 -> 3/3 snapc 00000000f69ffd89 seq 1 
> (0 snaps)
> 30807 <7>[72062.267417] ceph:  put_cap_refs 00000000457286fe had p
> 30808 <7>[72062.267423] ceph:  write_end file 00000000b0595dbb inode 
> 00000000457286fe page 00000000be4c1d25 241664~4096 (4096)
> 30809 <7>[72062.267435] ceph:  set_size 00000000457286fe 241664 -> 245760
> 30810 <7>[72062.267444] ceph:  00000000457286fe set_page_dirty 
> 00000000810c0300 idx 59 head 3/3 -> 4/4 snapc 00000000f69ffd89 seq 1 
> (0 snaps)
> 30811 <7>[72062.267473] ceph:  write_end file 00000000b0595dbb inode 
> 00000000457286fe page 0000000051d6fed1 245760~4096 (4096)
> 30812 <7>[72062.267492] ceph:  set_size 00000000457286fe 245760 -> 249856
> 30813 <7>[72062.267506] ceph:  00000000457286fe set_page_dirty 
> 00000000b113b082 idx 60 head 4/4 -> 5/5 snapc 00000000f69ffd89 seq 1 
> (0 snaps)
> 30814 <7>[72062.267542] ceph:  write_end file 00000000b0595dbb inode 
> 00000000457286fe page 00000000f40c8a7a 249856~4096 (4096)
> 30815 <7>[72062.267563] ceph:  set_size 00000000457286fe 249856 -> 253952
> 30816 <7>[72062.267577] ceph:  00000000457286fe set_page_dirty 
> 00000000e52c4518 idx 61 head 5/5 -> 6/6 snapc 00000000f69ffd89 seq 1 
> (0 snaps)
> 30817 <7>[72062.267610] ceph:  write_end file 00000000b0595dbb inode 
> 00000000457286fe page 00000000c7da9df6 253952~4096 (4096)
> 30818 <7>[72062.267626] ceph:  set_size 00000000457286fe 253952 -> 258048
> 30819 <7>[72062.267635] ceph:  00000000457286fe set_page_dirty 
> 00000000b81992fe idx 62 head 6/6 -> 7/7 snapc 00000000f69ffd89 seq 1 
> (0 snaps)
> 30820 <7>[72062.267660] ceph:  write_end file 00000000b0595dbb inode 
> 00000000457286fe page 00000000646abb31 258048~4096 (4096)
> 30821 <7>[72062.267672] ceph:  set_size 00000000457286fe 258048 -> 262144
> 30822 <7>[72062.267680] ceph:  00000000457286fe set_page_dirty 
> 00000000111e20f4 idx 63 head 7/7 -> 8/8 snapc 00000000f69ffd89 seq 1 
> (0 snaps)
> 30823 <7>[72062.267697] ceph:  __mark_dirty_caps 00000000457286fe Fw 
> dirty - -> Fw
>
> But still not sure why those 8 dirty pages still writing 0 to the files.
>
>
Please ignore the above issue, I was using the old branch in ceph mds code.

BRs

Thanks

>
>> ...
>> 18760 <7>[61485.386715] ceph:  sync_read on inode 000000003e6c8932 
>> 258048~4096
>> 18761 <7>[61485.386784] ceph:  client4220 send metrics to mds0
>> 18762 <7>[61485.389512] ceph:  sync_read 258048~4096 got 4096 i_size 
>> 262144
>> 18763 <7>[61485.389569] ceph:  sync_read result 4096 retry_op 2
>> 18764 <7>[61485.389581] ceph:  put_cap_refs 000000003e6c8932 had Fr last
>>
>>
>> I see in fill_fscrypt_truncate() just before reading the last block 
>> it has already trigerred and successfully flushed the dirty pages to 
>> the OSD, but it seems those 8 pages' contents are zero.
>>
>> Is that possibly those 8 pages are not dirtied yet when we are 
>> flushing it in fill_fscrypt_truncate() ?
>>
>> Thanks
>>
>> BRs
>>
>>
>>
>>
>>
>>
>>> Thanks,


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 9/9] ceph: add truncate size handling support for fscrypt
  2021-11-05 14:22 ` [PATCH v7 9/9] ceph: add truncate size handling support for fscrypt xiubli
@ 2021-11-08 11:42   ` Xiubo Li
  2021-11-08 12:49   ` Xiubo Li
  1 sibling, 0 replies; 25+ messages in thread
From: Xiubo Li @ 2021-11-08 11:42 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel


On 11/5/21 10:22 PM, xiubli@redhat.com wrote:
> From: Xiubo Li <xiubli@redhat.com>
>
> This will transfer the encrypted last block contents to the MDS
> along with the truncate request only when the new size is smaller
> and not aligned to the fscrypt BLOCK size. When the last block is
> located in the file hole, the truncate request will only contain
> the header.
>
> The MDS could fail to do the truncate if there has another client
> or process has already updated the Rados object which contains
> the last block, and will return -EAGAIN, then the kclient needs
> to retry it. The RMW will take around 50ms, and will let it retry
> 20 times for now.
>
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> ---
>   fs/ceph/crypto.h |  21 +++++
>   fs/ceph/inode.c  | 210 +++++++++++++++++++++++++++++++++++++++++++----
>   fs/ceph/super.h  |   5 ++
>   3 files changed, 222 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ceph/crypto.h b/fs/ceph/crypto.h
> index ab27a7ed62c3..393c308e8fc2 100644
> --- a/fs/ceph/crypto.h
> +++ b/fs/ceph/crypto.h
> @@ -25,6 +25,27 @@ struct ceph_fname {
>   	u32		ctext_len;	// length of crypttext
>   };
>   
> +/*
> + * Header for the crypted file when truncating the size, this
> + * will be sent to MDS, and the MDS will update the encrypted
> + * last block and then truncate the size.
> + */
> +struct ceph_fscrypt_truncate_size_header {
> +       __u8  ver;
> +       __u8  compat;
> +
> +       /*
> +	* It will be sizeof(assert_ver + file_offset + block_size)
> +	* if the last block is empty when it's located in a file
> +	* hole. Or the data_len will plus CEPH_FSCRYPT_BLOCK_SIZE.
> +	*/
> +       __le32 data_len;
> +
> +       __le64 assert_ver;
> +       __le64 file_offset;
> +       __le32 block_size;
> +} __packed;
> +
>   struct ceph_fscrypt_auth {
>   	__le32	cfa_version;
>   	__le32	cfa_blob_len;
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 15c2fb1e2c8a..eebbd0296004 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -586,6 +586,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
>   	ci->i_truncate_seq = 0;
>   	ci->i_truncate_size = 0;
>   	ci->i_truncate_pending = 0;
> +	ci->i_truncate_pagecache_size = 0;
>   
>   	ci->i_max_size = 0;
>   	ci->i_reported_size = 0;
> @@ -751,6 +752,10 @@ int ceph_fill_file_size(struct inode *inode, int issued,
>   		dout("truncate_size %lld -> %llu\n", ci->i_truncate_size,
>   		     truncate_size);
>   		ci->i_truncate_size = truncate_size;
> +		if (IS_ENCRYPTED(inode))
> +			ci->i_truncate_pagecache_size = size;
> +		else
> +			ci->i_truncate_pagecache_size = truncate_size;
>   	}
>   
>   	if (queue_trunc)
> @@ -1011,7 +1016,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
>   
>   	if (new_version ||
>   	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
> -		u64 size = info->size;
> +		u64 size = le64_to_cpu(info->size);
>   		s64 old_pool = ci->i_layout.pool_id;
>   		struct ceph_string *old_ns;
>   
> @@ -1026,16 +1031,20 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
>   		pool_ns = old_ns;
>   
>   		if (IS_ENCRYPTED(inode) && size &&
> -		    (iinfo->fscrypt_file_len == sizeof(__le64))) {
> -			size = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
> -			if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
> -				pr_warn("size=%llu fscrypt_file=%llu\n", info->size, size);
> +		    (iinfo->fscrypt_file_len >= sizeof(__le64))) {
> +			u64 fsize = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
> +			if (fsize) {
> +				size = fsize;
> +				if (le64_to_cpu(info->size) !=
> +				    round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
> +					pr_warn("size=%llu fscrypt_file=%llu\n",
> +						info->size, size);
> +			}
>   		}
>   
>   		queue_trunc = ceph_fill_file_size(inode, issued,
>   					le32_to_cpu(info->truncate_seq),
> -					le64_to_cpu(info->truncate_size),
> -					le64_to_cpu(size));
> +					le64_to_cpu(info->truncate_size), size);
>   		/* only update max_size on auth cap */
>   		if ((info->cap.flags & CEPH_CAP_FLAG_AUTH) &&
>   		    ci->i_max_size != le64_to_cpu(info->max_size)) {
> @@ -2142,7 +2151,7 @@ void __ceph_do_pending_vmtruncate(struct inode *inode)
>   	/* there should be no reader or writer */
>   	WARN_ON_ONCE(ci->i_rd_ref || ci->i_wr_ref);
>   
> -	to = ci->i_truncate_size;
> +	to = ci->i_truncate_pagecache_size;
>   	wrbuffer_refs = ci->i_wrbuffer_ref;
>   	dout("__do_pending_vmtruncate %p (%d) to %lld\n", inode,
>   	     ci->i_truncate_pending, to);
> @@ -2151,7 +2160,7 @@ void __ceph_do_pending_vmtruncate(struct inode *inode)
>   	truncate_pagecache(inode, to);
>   
>   	spin_lock(&ci->i_ceph_lock);
> -	if (to == ci->i_truncate_size) {
> +	if (to == ci->i_truncate_pagecache_size) {
>   		ci->i_truncate_pending = 0;
>   		finish = 1;
>   	}
> @@ -2232,6 +2241,141 @@ static const struct inode_operations ceph_encrypted_symlink_iops = {
>   	.listxattr = ceph_listxattr,
>   };
>   
> +/*
> + * Transfer the encrypted last block to the MDS and the MDS
> + * will help update it when truncating a smaller size.
> + *
> + * We don't support a PAGE_SIZE that is smaller than the
> + * CEPH_FSCRYPT_BLOCK_SIZE.
> + */
> +static int fill_fscrypt_truncate(struct inode *inode,
> +				 struct ceph_mds_request *req,
> +				 struct iattr *attr)
> +{
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	int boff = attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE;
> +	loff_t pos, orig_pos = round_down(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE);
> +#if 0
> +	u64 block = orig_pos >> CEPH_FSCRYPT_BLOCK_SHIFT;
> +#endif
> +	struct ceph_pagelist *pagelist = NULL;
> +	struct kvec iov;
> +	struct iov_iter iter;
> +	struct page *page = NULL;
> +	struct ceph_fscrypt_truncate_size_header header;
> +	int retry_op = 0;
> +	int len = CEPH_FSCRYPT_BLOCK_SIZE;
> +	loff_t i_size = i_size_read(inode);
> +	struct ceph_object_vers objvers = {0, NULL};
> +	int got, ret, issued;
> +
> +	ret = __ceph_get_caps(inode, NULL, CEPH_CAP_FILE_RD, 0, -1, &got);
> +	if (ret < 0)
> +		return ret;
> +
> +	issued = __ceph_caps_issued(ci, NULL);
> +
> +	dout("%s size %lld -> %lld got cap refs on %s, issued %s\n", __func__,
> +	     i_size, attr->ia_size, ceph_cap_string(got),
> +	     ceph_cap_string(issued));
> +
> +	/* Try to writeback the dirty pagecaches */
> +	if (issued & (CEPH_CAP_FILE_BUFFER))
> +		filemap_fdatawrite(&inode->i_data);
> +
> +	page = __page_cache_alloc(GFP_KERNEL);
> +	if (page == NULL) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
> +	if (!pagelist) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	iov.iov_base = kmap_local_page(page);
> +	iov.iov_len = len;
> +	iov_iter_kvec(&iter, READ, &iov, 1, len);
> +
> +	pos = orig_pos;
> +	ret = __ceph_sync_read(inode, &pos, &iter, &retry_op, &objvers);
> +	ceph_put_cap_refs(ci, got);
> +	if (ret < 0)
> +		goto out;
> +
> +	WARN_ON_ONCE(objvers.count != 1);
> +
> +	/* Insert the header first */
> +	header.ver = 1;
> +	header.compat = 1;
> +
> +	/*
> +	 * If we hit a hole here, we should just skip filling
> +	 * the fscrypt for the request, because once the fscrypt
> +	 * is enabled, the file will be split into many blocks
> +	 * with the size of CEPH_FSCRYPT_BLOCK_SIZE, if there
> +	 * has a hole, the hole size should be multiple of block
> +	 * size.
> +	 *
> +	 * If the Rados object doesn't exist, it will be set 0.
> +	 */
> +	if (!objvers.objvers[0].objver) {
> +		dout("%s hit hole, ppos %lld < size %lld\n", __func__,
> +		     pos, i_size);
> +
> +		header.data_len = cpu_to_le32(8 + 8 + 4);
> +		header.assert_ver = 0;
> +		header.file_offset = 0;
> +		header.block_size = 0;
> +		ret = 0;
> +	} else {
> +		header.data_len = cpu_to_le32(8 + 8 + 4 + CEPH_FSCRYPT_BLOCK_SIZE);
> +		header.assert_ver = cpu_to_le64(objvers.objvers[0].objver);
> +		header.file_offset = cpu_to_le64(orig_pos);
> +		header.block_size = cpu_to_le32(CEPH_FSCRYPT_BLOCK_SIZE);
> +
> +		/* truncate and zero out the extra contents for the last block */
> +		memset(iov.iov_base + boff, 0, PAGE_SIZE - boff);
> +
> +#if 0 // Uncomment this when the fscrypt is enabled globally in kceph
> +
> +		/* encrypt the last block */
> +		ret = fscrypt_encrypt_block_inplace(inode, page,
> +						    CEPH_FSCRYPT_BLOCK_SIZE,
> +						    0, block,
> +						    GFP_KERNEL);
> +		if (ret)
> +			goto out;
> +#endif
> +	}
> +
> +	/* Insert the header */
> +	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
> +	if (ret)
> +		goto out;
> +
> +	if (header.block_size) {
> +		/* Append the last block contents to pagelist */
> +		ret = ceph_pagelist_append(pagelist, iov.iov_base,
> +					   CEPH_FSCRYPT_BLOCK_SIZE);
> +		if (ret)
> +			goto out;
> +	}
> +	req->r_pagelist = pagelist;
> +out:
> +	dout("%s %p size dropping cap refs on %s\n", __func__,
> +	     inode, ceph_cap_string(got));
> +	kunmap_local(iov.iov_base);
> +	if (page)
> +		__free_pages(page, 0);
> +	if (ret && pagelist)
> +		ceph_pagelist_release(pagelist);
> +	kfree(objvers.objvers);
> +	return ret;
> +}
> +
>   int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *cia)
>   {
>   	struct ceph_inode_info *ci = ceph_inode(inode);
> @@ -2239,12 +2383,15 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   	struct ceph_mds_request *req;
>   	struct ceph_mds_client *mdsc = ceph_sb_to_client(inode->i_sb)->mdsc;
>   	struct ceph_cap_flush *prealloc_cf;
> +	loff_t isize = i_size_read(inode);
>   	int issued;
>   	int release = 0, dirtied = 0;
>   	int mask = 0;
>   	int err = 0;
>   	int inode_dirty_flags = 0;
>   	bool lock_snap_rwsem = false;
> +	bool fill_fscrypt;
> +	int truncate_retry = 20; /* The RMW will take around 50ms */
>   
>   	prealloc_cf = ceph_alloc_cap_flush();
>   	if (!prealloc_cf)
> @@ -2257,6 +2404,8 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   		return PTR_ERR(req);
>   	}
>   
> +retry:
> +	fill_fscrypt = false;
>   	spin_lock(&ci->i_ceph_lock);
>   	issued = __ceph_caps_issued(ci, NULL);
>   
> @@ -2378,10 +2527,27 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   		}
>   	}
>   	if (ia_valid & ATTR_SIZE) {
> -		loff_t isize = i_size_read(inode);
> -
>   		dout("setattr %p size %lld -> %lld\n", inode, isize, attr->ia_size);
> -		if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
> +		/*
> +		 * Only when the new size is smaller and not aligned to
> +		 * CEPH_FSCRYPT_BLOCK_SIZE will the RMW is needed.
> +		 */
> +		if (IS_ENCRYPTED(inode) && attr->ia_size < isize &&
> +		    (attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE)) {
> +			mask |= CEPH_SETATTR_SIZE;
> +			release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
> +				   CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
> +			set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
> +			mask |= CEPH_SETATTR_FSCRYPT_FILE;
> +			req->r_args.setattr.size =
> +				cpu_to_le64(round_up(attr->ia_size,
> +						     CEPH_FSCRYPT_BLOCK_SIZE));
> +			req->r_args.setattr.old_size =
> +				cpu_to_le64(round_up(isize,
> +						     CEPH_FSCRYPT_BLOCK_SIZE));
> +			req->r_fscrypt_file = attr->ia_size;
> +			fill_fscrypt = true;
> +		} else if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
>   			if (attr->ia_size > isize) {
>   				i_size_write(inode, attr->ia_size);
>   				inode->i_blocks = calc_inode_blocks(attr->ia_size);
> @@ -2404,7 +2570,6 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   					cpu_to_le64(round_up(isize,
>   							     CEPH_FSCRYPT_BLOCK_SIZE));
>   				req->r_fscrypt_file = attr->ia_size;
> -				/* FIXME: client must zero out any partial blocks! */
>   			} else {
>   				req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
>   				req->r_args.setattr.old_size = cpu_to_le64(isize);
> @@ -2476,7 +2641,6 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   	if (inode_dirty_flags)
>   		__mark_inode_dirty(inode, inode_dirty_flags);
>   
> -
>   	if (mask) {
>   		req->r_inode = inode;
>   		ihold(inode);
> @@ -2484,7 +2648,25 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   		req->r_args.setattr.mask = cpu_to_le32(mask);
>   		req->r_num_caps = 1;
>   		req->r_stamp = attr->ia_ctime;
> +		if (fill_fscrypt) {
> +			err = fill_fscrypt_truncate(inode, req, attr);
> +			if (err)
> +				goto out;
> +		}
> +
> +		/*
> +		 * The truncate request will return -EAGAIN when the
> +		 * last block has been updated just before the MDS
> +		 * successfully gets the xlock for the FILE lock. To
> +		 * avoid corrupting the file contents we need to retry
> +		 * it.
> +		 */
>   		err = ceph_mdsc_do_request(mdsc, NULL, req);
> +		if (err == -EAGAIN && truncate_retry--) {
> +			dout("setattr %p result=%d (%s locally, %d remote), retry it!\n",
> +			     inode, err, ceph_cap_string(dirtied), mask);
> +			goto retry;

Here before retry we should put the request and release 'prealloc_cf'.


> +		}
>   	}
>   out:
>   	dout("setattr %p result=%d (%s locally, %d remote)\n", inode, err,
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index b347b12e86a9..071857bb59d8 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -408,6 +408,11 @@ struct ceph_inode_info {
>   	u32 i_truncate_seq;        /* last truncate to smaller size */
>   	u64 i_truncate_size;       /*  and the size we last truncated down to */
>   	int i_truncate_pending;    /*  still need to call vmtruncate */
> +	/*
> +	 * For none fscrypt case it equals to i_truncate_size or it will
> +	 * equals to fscrypt_file_size
> +	 */
> +	u64 i_truncate_pagecache_size;
>   
>   	u64 i_max_size;            /* max file size authorized by mds */
>   	u64 i_reported_size; /* (max_)size reported to or requested of mds */


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 9/9] ceph: add truncate size handling support for fscrypt
  2021-11-05 14:22 ` [PATCH v7 9/9] ceph: add truncate size handling support for fscrypt xiubli
  2021-11-08 11:42   ` Xiubo Li
@ 2021-11-08 12:49   ` Xiubo Li
  2021-11-08 13:02     ` Jeff Layton
  1 sibling, 1 reply; 25+ messages in thread
From: Xiubo Li @ 2021-11-08 12:49 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel


On 11/5/21 10:22 PM, xiubli@redhat.com wrote:
> From: Xiubo Li <xiubli@redhat.com>
>
> This will transfer the encrypted last block contents to the MDS
> along with the truncate request only when the new size is smaller
> and not aligned to the fscrypt BLOCK size. When the last block is
> located in the file hole, the truncate request will only contain
> the header.
>
> The MDS could fail to do the truncate if there has another client
> or process has already updated the Rados object which contains
> the last block, and will return -EAGAIN, then the kclient needs
> to retry it. The RMW will take around 50ms, and will let it retry
> 20 times for now.
>
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> ---
>   fs/ceph/crypto.h |  21 +++++
>   fs/ceph/inode.c  | 210 +++++++++++++++++++++++++++++++++++++++++++----
>   fs/ceph/super.h  |   5 ++
>   3 files changed, 222 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ceph/crypto.h b/fs/ceph/crypto.h
> index ab27a7ed62c3..393c308e8fc2 100644
> --- a/fs/ceph/crypto.h
> +++ b/fs/ceph/crypto.h
> @@ -25,6 +25,27 @@ struct ceph_fname {
>   	u32		ctext_len;	// length of crypttext
>   };
>   
> +/*
> + * Header for the crypted file when truncating the size, this
> + * will be sent to MDS, and the MDS will update the encrypted
> + * last block and then truncate the size.
> + */
> +struct ceph_fscrypt_truncate_size_header {
> +       __u8  ver;
> +       __u8  compat;
> +
> +       /*
> +	* It will be sizeof(assert_ver + file_offset + block_size)
> +	* if the last block is empty when it's located in a file
> +	* hole. Or the data_len will plus CEPH_FSCRYPT_BLOCK_SIZE.
> +	*/
> +       __le32 data_len;
> +
> +       __le64 assert_ver;
> +       __le64 file_offset;
> +       __le32 block_size;
> +} __packed;
> +
>   struct ceph_fscrypt_auth {
>   	__le32	cfa_version;
>   	__le32	cfa_blob_len;
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 15c2fb1e2c8a..eebbd0296004 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -586,6 +586,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
>   	ci->i_truncate_seq = 0;
>   	ci->i_truncate_size = 0;
>   	ci->i_truncate_pending = 0;
> +	ci->i_truncate_pagecache_size = 0;
>   
>   	ci->i_max_size = 0;
>   	ci->i_reported_size = 0;
> @@ -751,6 +752,10 @@ int ceph_fill_file_size(struct inode *inode, int issued,
>   		dout("truncate_size %lld -> %llu\n", ci->i_truncate_size,
>   		     truncate_size);
>   		ci->i_truncate_size = truncate_size;
> +		if (IS_ENCRYPTED(inode))
> +			ci->i_truncate_pagecache_size = size;
> +		else
> +			ci->i_truncate_pagecache_size = truncate_size;
>   	}
>   
>   	if (queue_trunc)
> @@ -1011,7 +1016,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
>   
>   	if (new_version ||
>   	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
> -		u64 size = info->size;
> +		u64 size = le64_to_cpu(info->size);
>   		s64 old_pool = ci->i_layout.pool_id;
>   		struct ceph_string *old_ns;
>   
> @@ -1026,16 +1031,20 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
>   		pool_ns = old_ns;
>   
>   		if (IS_ENCRYPTED(inode) && size &&
> -		    (iinfo->fscrypt_file_len == sizeof(__le64))) {
> -			size = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
> -			if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
> -				pr_warn("size=%llu fscrypt_file=%llu\n", info->size, size);
> +		    (iinfo->fscrypt_file_len >= sizeof(__le64))) {
> +			u64 fsize = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
> +			if (fsize) {
> +				size = fsize;
> +				if (le64_to_cpu(info->size) !=
> +				    round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
> +					pr_warn("size=%llu fscrypt_file=%llu\n",
> +						info->size, size);
> +			}
>   		}
>   
>   		queue_trunc = ceph_fill_file_size(inode, issued,
>   					le32_to_cpu(info->truncate_seq),
> -					le64_to_cpu(info->truncate_size),
> -					le64_to_cpu(size));
> +					le64_to_cpu(info->truncate_size), size);
>   		/* only update max_size on auth cap */
>   		if ((info->cap.flags & CEPH_CAP_FLAG_AUTH) &&
>   		    ci->i_max_size != le64_to_cpu(info->max_size)) {
> @@ -2142,7 +2151,7 @@ void __ceph_do_pending_vmtruncate(struct inode *inode)
>   	/* there should be no reader or writer */
>   	WARN_ON_ONCE(ci->i_rd_ref || ci->i_wr_ref);
>   
> -	to = ci->i_truncate_size;
> +	to = ci->i_truncate_pagecache_size;
>   	wrbuffer_refs = ci->i_wrbuffer_ref;
>   	dout("__do_pending_vmtruncate %p (%d) to %lld\n", inode,
>   	     ci->i_truncate_pending, to);
> @@ -2151,7 +2160,7 @@ void __ceph_do_pending_vmtruncate(struct inode *inode)
>   	truncate_pagecache(inode, to);
>   
>   	spin_lock(&ci->i_ceph_lock);
> -	if (to == ci->i_truncate_size) {
> +	if (to == ci->i_truncate_pagecache_size) {
>   		ci->i_truncate_pending = 0;
>   		finish = 1;
>   	}
> @@ -2232,6 +2241,141 @@ static const struct inode_operations ceph_encrypted_symlink_iops = {
>   	.listxattr = ceph_listxattr,
>   };
>   
> +/*
> + * Transfer the encrypted last block to the MDS and the MDS
> + * will help update it when truncating a smaller size.
> + *
> + * We don't support a PAGE_SIZE that is smaller than the
> + * CEPH_FSCRYPT_BLOCK_SIZE.
> + */
> +static int fill_fscrypt_truncate(struct inode *inode,
> +				 struct ceph_mds_request *req,
> +				 struct iattr *attr)
> +{
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	int boff = attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE;
> +	loff_t pos, orig_pos = round_down(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE);
> +#if 0
> +	u64 block = orig_pos >> CEPH_FSCRYPT_BLOCK_SHIFT;
> +#endif
> +	struct ceph_pagelist *pagelist = NULL;
> +	struct kvec iov;
> +	struct iov_iter iter;
> +	struct page *page = NULL;
> +	struct ceph_fscrypt_truncate_size_header header;
> +	int retry_op = 0;
> +	int len = CEPH_FSCRYPT_BLOCK_SIZE;
> +	loff_t i_size = i_size_read(inode);
> +	struct ceph_object_vers objvers = {0, NULL};
> +	int got, ret, issued;
> +
> +	ret = __ceph_get_caps(inode, NULL, CEPH_CAP_FILE_RD, 0, -1, &got);
> +	if (ret < 0)
> +		return ret;
> +
> +	issued = __ceph_caps_issued(ci, NULL);
> +
> +	dout("%s size %lld -> %lld got cap refs on %s, issued %s\n", __func__,
> +	     i_size, attr->ia_size, ceph_cap_string(got),
> +	     ceph_cap_string(issued));
> +
> +	/* Try to writeback the dirty pagecaches */
> +	if (issued & (CEPH_CAP_FILE_BUFFER))
> +		filemap_fdatawrite(&inode->i_data);

We may need to wait here, to make sure the dirty pages are all wrote 
back to OSD before we are do the RMW, or if there have too many data 
need to write back, it may finished just after the truncate. The dirty 
data will be lost ?


> +
> +	page = __page_cache_alloc(GFP_KERNEL);
> +	if (page == NULL) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
> +	if (!pagelist) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	iov.iov_base = kmap_local_page(page);
> +	iov.iov_len = len;
> +	iov_iter_kvec(&iter, READ, &iov, 1, len);
> +
> +	pos = orig_pos;
> +	ret = __ceph_sync_read(inode, &pos, &iter, &retry_op, &objvers);
> +	ceph_put_cap_refs(ci, got);
> +	if (ret < 0)
> +		goto out;
> +
> +	WARN_ON_ONCE(objvers.count != 1);
> +
> +	/* Insert the header first */
> +	header.ver = 1;
> +	header.compat = 1;
> +
> +	/*
> +	 * If we hit a hole here, we should just skip filling
> +	 * the fscrypt for the request, because once the fscrypt
> +	 * is enabled, the file will be split into many blocks
> +	 * with the size of CEPH_FSCRYPT_BLOCK_SIZE, if there
> +	 * has a hole, the hole size should be multiple of block
> +	 * size.
> +	 *
> +	 * If the Rados object doesn't exist, it will be set 0.
> +	 */
> +	if (!objvers.objvers[0].objver) {
> +		dout("%s hit hole, ppos %lld < size %lld\n", __func__,
> +		     pos, i_size);
> +
> +		header.data_len = cpu_to_le32(8 + 8 + 4);
> +		header.assert_ver = 0;
> +		header.file_offset = 0;
> +		header.block_size = 0;
> +		ret = 0;
> +	} else {
> +		header.data_len = cpu_to_le32(8 + 8 + 4 + CEPH_FSCRYPT_BLOCK_SIZE);
> +		header.assert_ver = cpu_to_le64(objvers.objvers[0].objver);
> +		header.file_offset = cpu_to_le64(orig_pos);
> +		header.block_size = cpu_to_le32(CEPH_FSCRYPT_BLOCK_SIZE);
> +
> +		/* truncate and zero out the extra contents for the last block */
> +		memset(iov.iov_base + boff, 0, PAGE_SIZE - boff);
> +
> +#if 0 // Uncomment this when the fscrypt is enabled globally in kceph
> +
> +		/* encrypt the last block */
> +		ret = fscrypt_encrypt_block_inplace(inode, page,
> +						    CEPH_FSCRYPT_BLOCK_SIZE,
> +						    0, block,
> +						    GFP_KERNEL);
> +		if (ret)
> +			goto out;
> +#endif
> +	}
> +
> +	/* Insert the header */
> +	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
> +	if (ret)
> +		goto out;
> +
> +	if (header.block_size) {
> +		/* Append the last block contents to pagelist */
> +		ret = ceph_pagelist_append(pagelist, iov.iov_base,
> +					   CEPH_FSCRYPT_BLOCK_SIZE);
> +		if (ret)
> +			goto out;
> +	}
> +	req->r_pagelist = pagelist;
> +out:
> +	dout("%s %p size dropping cap refs on %s\n", __func__,
> +	     inode, ceph_cap_string(got));
> +	kunmap_local(iov.iov_base);
> +	if (page)
> +		__free_pages(page, 0);
> +	if (ret && pagelist)
> +		ceph_pagelist_release(pagelist);
> +	kfree(objvers.objvers);
> +	return ret;
> +}
> +
>   int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *cia)
>   {
>   	struct ceph_inode_info *ci = ceph_inode(inode);
> @@ -2239,12 +2383,15 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   	struct ceph_mds_request *req;
>   	struct ceph_mds_client *mdsc = ceph_sb_to_client(inode->i_sb)->mdsc;
>   	struct ceph_cap_flush *prealloc_cf;
> +	loff_t isize = i_size_read(inode);
>   	int issued;
>   	int release = 0, dirtied = 0;
>   	int mask = 0;
>   	int err = 0;
>   	int inode_dirty_flags = 0;
>   	bool lock_snap_rwsem = false;
> +	bool fill_fscrypt;
> +	int truncate_retry = 20; /* The RMW will take around 50ms */
>   
>   	prealloc_cf = ceph_alloc_cap_flush();
>   	if (!prealloc_cf)
> @@ -2257,6 +2404,8 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   		return PTR_ERR(req);
>   	}
>   
> +retry:
> +	fill_fscrypt = false;
>   	spin_lock(&ci->i_ceph_lock);
>   	issued = __ceph_caps_issued(ci, NULL);
>   
> @@ -2378,10 +2527,27 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   		}
>   	}
>   	if (ia_valid & ATTR_SIZE) {
> -		loff_t isize = i_size_read(inode);
> -
>   		dout("setattr %p size %lld -> %lld\n", inode, isize, attr->ia_size);
> -		if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
> +		/*
> +		 * Only when the new size is smaller and not aligned to
> +		 * CEPH_FSCRYPT_BLOCK_SIZE will the RMW is needed.
> +		 */
> +		if (IS_ENCRYPTED(inode) && attr->ia_size < isize &&
> +		    (attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE)) {
> +			mask |= CEPH_SETATTR_SIZE;
> +			release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
> +				   CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
> +			set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
> +			mask |= CEPH_SETATTR_FSCRYPT_FILE;
> +			req->r_args.setattr.size =
> +				cpu_to_le64(round_up(attr->ia_size,
> +						     CEPH_FSCRYPT_BLOCK_SIZE));
> +			req->r_args.setattr.old_size =
> +				cpu_to_le64(round_up(isize,
> +						     CEPH_FSCRYPT_BLOCK_SIZE));
> +			req->r_fscrypt_file = attr->ia_size;
> +			fill_fscrypt = true;
> +		} else if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
>   			if (attr->ia_size > isize) {
>   				i_size_write(inode, attr->ia_size);
>   				inode->i_blocks = calc_inode_blocks(attr->ia_size);
> @@ -2404,7 +2570,6 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   					cpu_to_le64(round_up(isize,
>   							     CEPH_FSCRYPT_BLOCK_SIZE));
>   				req->r_fscrypt_file = attr->ia_size;
> -				/* FIXME: client must zero out any partial blocks! */
>   			} else {
>   				req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
>   				req->r_args.setattr.old_size = cpu_to_le64(isize);
> @@ -2476,7 +2641,6 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   	if (inode_dirty_flags)
>   		__mark_inode_dirty(inode, inode_dirty_flags);
>   
> -
>   	if (mask) {
>   		req->r_inode = inode;
>   		ihold(inode);
> @@ -2484,7 +2648,25 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   		req->r_args.setattr.mask = cpu_to_le32(mask);
>   		req->r_num_caps = 1;
>   		req->r_stamp = attr->ia_ctime;
> +		if (fill_fscrypt) {
> +			err = fill_fscrypt_truncate(inode, req, attr);
> +			if (err)
> +				goto out;
> +		}
> +
> +		/*
> +		 * The truncate request will return -EAGAIN when the
> +		 * last block has been updated just before the MDS
> +		 * successfully gets the xlock for the FILE lock. To
> +		 * avoid corrupting the file contents we need to retry
> +		 * it.
> +		 */
>   		err = ceph_mdsc_do_request(mdsc, NULL, req);
> +		if (err == -EAGAIN && truncate_retry--) {
> +			dout("setattr %p result=%d (%s locally, %d remote), retry it!\n",
> +			     inode, err, ceph_cap_string(dirtied), mask);
> +			goto retry;
> +		}
>   	}
>   out:
>   	dout("setattr %p result=%d (%s locally, %d remote)\n", inode, err,
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index b347b12e86a9..071857bb59d8 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -408,6 +408,11 @@ struct ceph_inode_info {
>   	u32 i_truncate_seq;        /* last truncate to smaller size */
>   	u64 i_truncate_size;       /*  and the size we last truncated down to */
>   	int i_truncate_pending;    /*  still need to call vmtruncate */
> +	/*
> +	 * For none fscrypt case it equals to i_truncate_size or it will
> +	 * equals to fscrypt_file_size
> +	 */
> +	u64 i_truncate_pagecache_size;
>   
>   	u64 i_max_size;            /* max file size authorized by mds */
>   	u64 i_reported_size; /* (max_)size reported to or requested of mds */


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 9/9] ceph: add truncate size handling support for fscrypt
  2021-11-08 12:49   ` Xiubo Li
@ 2021-11-08 13:02     ` Jeff Layton
  2021-11-08 13:11       ` Xiubo Li
  0 siblings, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2021-11-08 13:02 UTC (permalink / raw)
  To: Xiubo Li; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel

On Mon, 2021-11-08 at 20:49 +0800, Xiubo Li wrote:
> On 11/5/21 10:22 PM, xiubli@redhat.com wrote:
> > From: Xiubo Li <xiubli@redhat.com>
> > 
> > This will transfer the encrypted last block contents to the MDS
> > along with the truncate request only when the new size is smaller
> > and not aligned to the fscrypt BLOCK size. When the last block is
> > located in the file hole, the truncate request will only contain
> > the header.
> > 
> > The MDS could fail to do the truncate if there has another client
> > or process has already updated the Rados object which contains
> > the last block, and will return -EAGAIN, then the kclient needs
> > to retry it. The RMW will take around 50ms, and will let it retry
> > 20 times for now.
> > 
> > Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > ---
> >   fs/ceph/crypto.h |  21 +++++
> >   fs/ceph/inode.c  | 210 +++++++++++++++++++++++++++++++++++++++++++----
> >   fs/ceph/super.h  |   5 ++
> >   3 files changed, 222 insertions(+), 14 deletions(-)
> > 
> > diff --git a/fs/ceph/crypto.h b/fs/ceph/crypto.h
> > index ab27a7ed62c3..393c308e8fc2 100644
> > --- a/fs/ceph/crypto.h
> > +++ b/fs/ceph/crypto.h
> > @@ -25,6 +25,27 @@ struct ceph_fname {
> >   	u32		ctext_len;	// length of crypttext
> >   };
> >   
> > +/*
> > + * Header for the crypted file when truncating the size, this
> > + * will be sent to MDS, and the MDS will update the encrypted
> > + * last block and then truncate the size.
> > + */
> > +struct ceph_fscrypt_truncate_size_header {
> > +       __u8  ver;
> > +       __u8  compat;
> > +
> > +       /*
> > +	* It will be sizeof(assert_ver + file_offset + block_size)
> > +	* if the last block is empty when it's located in a file
> > +	* hole. Or the data_len will plus CEPH_FSCRYPT_BLOCK_SIZE.
> > +	*/
> > +       __le32 data_len;
> > +
> > +       __le64 assert_ver;
> > +       __le64 file_offset;
> > +       __le32 block_size;
> > +} __packed;
> > +
> >   struct ceph_fscrypt_auth {
> >   	__le32	cfa_version;
> >   	__le32	cfa_blob_len;
> > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > index 15c2fb1e2c8a..eebbd0296004 100644
> > --- a/fs/ceph/inode.c
> > +++ b/fs/ceph/inode.c
> > @@ -586,6 +586,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
> >   	ci->i_truncate_seq = 0;
> >   	ci->i_truncate_size = 0;
> >   	ci->i_truncate_pending = 0;
> > +	ci->i_truncate_pagecache_size = 0;
> >   
> >   	ci->i_max_size = 0;
> >   	ci->i_reported_size = 0;
> > @@ -751,6 +752,10 @@ int ceph_fill_file_size(struct inode *inode, int issued,
> >   		dout("truncate_size %lld -> %llu\n", ci->i_truncate_size,
> >   		     truncate_size);
> >   		ci->i_truncate_size = truncate_size;
> > +		if (IS_ENCRYPTED(inode))
> > +			ci->i_truncate_pagecache_size = size;
> > +		else
> > +			ci->i_truncate_pagecache_size = truncate_size;
> >   	}
> >   
> >   	if (queue_trunc)
> > @@ -1011,7 +1016,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
> >   
> >   	if (new_version ||
> >   	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
> > -		u64 size = info->size;
> > +		u64 size = le64_to_cpu(info->size);
> >   		s64 old_pool = ci->i_layout.pool_id;
> >   		struct ceph_string *old_ns;
> >   
> > @@ -1026,16 +1031,20 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
> >   		pool_ns = old_ns;
> >   
> >   		if (IS_ENCRYPTED(inode) && size &&
> > -		    (iinfo->fscrypt_file_len == sizeof(__le64))) {
> > -			size = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
> > -			if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
> > -				pr_warn("size=%llu fscrypt_file=%llu\n", info->size, size);
> > +		    (iinfo->fscrypt_file_len >= sizeof(__le64))) {
> > +			u64 fsize = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
> > +			if (fsize) {
> > +				size = fsize;
> > +				if (le64_to_cpu(info->size) !=
> > +				    round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
> > +					pr_warn("size=%llu fscrypt_file=%llu\n",
> > +						info->size, size);
> > +			}
> >   		}
> >   
> >   		queue_trunc = ceph_fill_file_size(inode, issued,
> >   					le32_to_cpu(info->truncate_seq),
> > -					le64_to_cpu(info->truncate_size),
> > -					le64_to_cpu(size));
> > +					le64_to_cpu(info->truncate_size), size);
> >   		/* only update max_size on auth cap */
> >   		if ((info->cap.flags & CEPH_CAP_FLAG_AUTH) &&
> >   		    ci->i_max_size != le64_to_cpu(info->max_size)) {
> > @@ -2142,7 +2151,7 @@ void __ceph_do_pending_vmtruncate(struct inode *inode)
> >   	/* there should be no reader or writer */
> >   	WARN_ON_ONCE(ci->i_rd_ref || ci->i_wr_ref);
> >   
> > -	to = ci->i_truncate_size;
> > +	to = ci->i_truncate_pagecache_size;
> >   	wrbuffer_refs = ci->i_wrbuffer_ref;
> >   	dout("__do_pending_vmtruncate %p (%d) to %lld\n", inode,
> >   	     ci->i_truncate_pending, to);
> > @@ -2151,7 +2160,7 @@ void __ceph_do_pending_vmtruncate(struct inode *inode)
> >   	truncate_pagecache(inode, to);
> >   
> >   	spin_lock(&ci->i_ceph_lock);
> > -	if (to == ci->i_truncate_size) {
> > +	if (to == ci->i_truncate_pagecache_size) {
> >   		ci->i_truncate_pending = 0;
> >   		finish = 1;
> >   	}
> > @@ -2232,6 +2241,141 @@ static const struct inode_operations ceph_encrypted_symlink_iops = {
> >   	.listxattr = ceph_listxattr,
> >   };
> >   
> > +/*
> > + * Transfer the encrypted last block to the MDS and the MDS
> > + * will help update it when truncating a smaller size.
> > + *
> > + * We don't support a PAGE_SIZE that is smaller than the
> > + * CEPH_FSCRYPT_BLOCK_SIZE.
> > + */
> > +static int fill_fscrypt_truncate(struct inode *inode,
> > +				 struct ceph_mds_request *req,
> > +				 struct iattr *attr)
> > +{
> > +	struct ceph_inode_info *ci = ceph_inode(inode);
> > +	int boff = attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE;
> > +	loff_t pos, orig_pos = round_down(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE);
> > +#if 0
> > +	u64 block = orig_pos >> CEPH_FSCRYPT_BLOCK_SHIFT;
> > +#endif
> > +	struct ceph_pagelist *pagelist = NULL;
> > +	struct kvec iov;
> > +	struct iov_iter iter;
> > +	struct page *page = NULL;
> > +	struct ceph_fscrypt_truncate_size_header header;
> > +	int retry_op = 0;
> > +	int len = CEPH_FSCRYPT_BLOCK_SIZE;
> > +	loff_t i_size = i_size_read(inode);
> > +	struct ceph_object_vers objvers = {0, NULL};
> > +	int got, ret, issued;
> > +
> > +	ret = __ceph_get_caps(inode, NULL, CEPH_CAP_FILE_RD, 0, -1, &got);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	issued = __ceph_caps_issued(ci, NULL);
> > +
> > +	dout("%s size %lld -> %lld got cap refs on %s, issued %s\n", __func__,
> > +	     i_size, attr->ia_size, ceph_cap_string(got),
> > +	     ceph_cap_string(issued));
> > +
> > +	/* Try to writeback the dirty pagecaches */
> > +	if (issued & (CEPH_CAP_FILE_BUFFER))
> > +		filemap_fdatawrite(&inode->i_data);
> 
> We may need to wait here, to make sure the dirty pages are all wrote 
> back to OSD before we are do the RMW, or if there have too many data 
> need to write back, it may finished just after the truncate. The dirty 
> data will be lost ?
> 
> 

Yes, that should probably be:

    filemap_write_and_wait(inode->i_mapping);

(For silly reasons, i_mapping usually points at i_data, but in some
filesystems (e.g. coda or with DAX) it can change. We don't do that in
ceph, but it's still better to use i_mapping here since that's the
convention).

It would probably be good to send an updated patch with that and the fix
for the other req leak you spotted earlier.

Thanks,

> > +
> > +	page = __page_cache_alloc(GFP_KERNEL);
> > +	if (page == NULL) {
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +
> > +	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
> > +	if (!pagelist) {
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +
> > +	iov.iov_base = kmap_local_page(page);
> > +	iov.iov_len = len;
> > +	iov_iter_kvec(&iter, READ, &iov, 1, len);
> > +
> > +	pos = orig_pos;
> > +	ret = __ceph_sync_read(inode, &pos, &iter, &retry_op, &objvers);
> > +	ceph_put_cap_refs(ci, got);
> > +	if (ret < 0)
> > +		goto out;
> > +
> > +	WARN_ON_ONCE(objvers.count != 1);
> > +
> > +	/* Insert the header first */
> > +	header.ver = 1;
> > +	header.compat = 1;
> > +
> > +	/*
> > +	 * If we hit a hole here, we should just skip filling
> > +	 * the fscrypt for the request, because once the fscrypt
> > +	 * is enabled, the file will be split into many blocks
> > +	 * with the size of CEPH_FSCRYPT_BLOCK_SIZE, if there
> > +	 * has a hole, the hole size should be multiple of block
> > +	 * size.
> > +	 *
> > +	 * If the Rados object doesn't exist, it will be set 0.
> > +	 */
> > +	if (!objvers.objvers[0].objver) {
> > +		dout("%s hit hole, ppos %lld < size %lld\n", __func__,
> > +		     pos, i_size);
> > +
> > +		header.data_len = cpu_to_le32(8 + 8 + 4);
> > +		header.assert_ver = 0;
> > +		header.file_offset = 0;
> > +		header.block_size = 0;
> > +		ret = 0;
> > +	} else {
> > +		header.data_len = cpu_to_le32(8 + 8 + 4 + CEPH_FSCRYPT_BLOCK_SIZE);
> > +		header.assert_ver = cpu_to_le64(objvers.objvers[0].objver);
> > +		header.file_offset = cpu_to_le64(orig_pos);
> > +		header.block_size = cpu_to_le32(CEPH_FSCRYPT_BLOCK_SIZE);
> > +
> > +		/* truncate and zero out the extra contents for the last block */
> > +		memset(iov.iov_base + boff, 0, PAGE_SIZE - boff);
> > +
> > +#if 0 // Uncomment this when the fscrypt is enabled globally in kceph
> > +
> > +		/* encrypt the last block */
> > +		ret = fscrypt_encrypt_block_inplace(inode, page,
> > +						    CEPH_FSCRYPT_BLOCK_SIZE,
> > +						    0, block,
> > +						    GFP_KERNEL);
> > +		if (ret)
> > +			goto out;
> > +#endif
> > +	}
> > +
> > +	/* Insert the header */
> > +	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
> > +	if (ret)
> > +		goto out;
> > +
> > +	if (header.block_size) {
> > +		/* Append the last block contents to pagelist */
> > +		ret = ceph_pagelist_append(pagelist, iov.iov_base,
> > +					   CEPH_FSCRYPT_BLOCK_SIZE);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +	req->r_pagelist = pagelist;
> > +out:
> > +	dout("%s %p size dropping cap refs on %s\n", __func__,
> > +	     inode, ceph_cap_string(got));
> > +	kunmap_local(iov.iov_base);
> > +	if (page)
> > +		__free_pages(page, 0);
> > +	if (ret && pagelist)
> > +		ceph_pagelist_release(pagelist);
> > +	kfree(objvers.objvers);
> > +	return ret;
> > +}
> > +
> >   int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *cia)
> >   {
> >   	struct ceph_inode_info *ci = ceph_inode(inode);
> > @@ -2239,12 +2383,15 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
> >   	struct ceph_mds_request *req;
> >   	struct ceph_mds_client *mdsc = ceph_sb_to_client(inode->i_sb)->mdsc;
> >   	struct ceph_cap_flush *prealloc_cf;
> > +	loff_t isize = i_size_read(inode);
> >   	int issued;
> >   	int release = 0, dirtied = 0;
> >   	int mask = 0;
> >   	int err = 0;
> >   	int inode_dirty_flags = 0;
> >   	bool lock_snap_rwsem = false;
> > +	bool fill_fscrypt;
> > +	int truncate_retry = 20; /* The RMW will take around 50ms */
> >   
> >   	prealloc_cf = ceph_alloc_cap_flush();
> >   	if (!prealloc_cf)
> > @@ -2257,6 +2404,8 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
> >   		return PTR_ERR(req);
> >   	}
> >   
> > +retry:
> > +	fill_fscrypt = false;
> >   	spin_lock(&ci->i_ceph_lock);
> >   	issued = __ceph_caps_issued(ci, NULL);
> >   
> > @@ -2378,10 +2527,27 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
> >   		}
> >   	}
> >   	if (ia_valid & ATTR_SIZE) {
> > -		loff_t isize = i_size_read(inode);
> > -
> >   		dout("setattr %p size %lld -> %lld\n", inode, isize, attr->ia_size);
> > -		if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
> > +		/*
> > +		 * Only when the new size is smaller and not aligned to
> > +		 * CEPH_FSCRYPT_BLOCK_SIZE will the RMW is needed.
> > +		 */
> > +		if (IS_ENCRYPTED(inode) && attr->ia_size < isize &&
> > +		    (attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE)) {
> > +			mask |= CEPH_SETATTR_SIZE;
> > +			release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
> > +				   CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
> > +			set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
> > +			mask |= CEPH_SETATTR_FSCRYPT_FILE;
> > +			req->r_args.setattr.size =
> > +				cpu_to_le64(round_up(attr->ia_size,
> > +						     CEPH_FSCRYPT_BLOCK_SIZE));
> > +			req->r_args.setattr.old_size =
> > +				cpu_to_le64(round_up(isize,
> > +						     CEPH_FSCRYPT_BLOCK_SIZE));
> > +			req->r_fscrypt_file = attr->ia_size;
> > +			fill_fscrypt = true;
> > +		} else if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
> >   			if (attr->ia_size > isize) {
> >   				i_size_write(inode, attr->ia_size);
> >   				inode->i_blocks = calc_inode_blocks(attr->ia_size);
> > @@ -2404,7 +2570,6 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
> >   					cpu_to_le64(round_up(isize,
> >   							     CEPH_FSCRYPT_BLOCK_SIZE));
> >   				req->r_fscrypt_file = attr->ia_size;
> > -				/* FIXME: client must zero out any partial blocks! */
> >   			} else {
> >   				req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
> >   				req->r_args.setattr.old_size = cpu_to_le64(isize);
> > @@ -2476,7 +2641,6 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
> >   	if (inode_dirty_flags)
> >   		__mark_inode_dirty(inode, inode_dirty_flags);
> >   
> > -
> >   	if (mask) {
> >   		req->r_inode = inode;
> >   		ihold(inode);
> > @@ -2484,7 +2648,25 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
> >   		req->r_args.setattr.mask = cpu_to_le32(mask);
> >   		req->r_num_caps = 1;
> >   		req->r_stamp = attr->ia_ctime;
> > +		if (fill_fscrypt) {
> > +			err = fill_fscrypt_truncate(inode, req, attr);
> > +			if (err)
> > +				goto out;
> > +		}
> > +
> > +		/*
> > +		 * The truncate request will return -EAGAIN when the
> > +		 * last block has been updated just before the MDS
> > +		 * successfully gets the xlock for the FILE lock. To
> > +		 * avoid corrupting the file contents we need to retry
> > +		 * it.
> > +		 */
> >   		err = ceph_mdsc_do_request(mdsc, NULL, req);
> > +		if (err == -EAGAIN && truncate_retry--) {
> > +			dout("setattr %p result=%d (%s locally, %d remote), retry it!\n",
> > +			     inode, err, ceph_cap_string(dirtied), mask);
> > +			goto retry;
> > +		}
> >   	}
> >   out:
> >   	dout("setattr %p result=%d (%s locally, %d remote)\n", inode, err,
> > diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> > index b347b12e86a9..071857bb59d8 100644
> > --- a/fs/ceph/super.h
> > +++ b/fs/ceph/super.h
> > @@ -408,6 +408,11 @@ struct ceph_inode_info {
> >   	u32 i_truncate_seq;        /* last truncate to smaller size */
> >   	u64 i_truncate_size;       /*  and the size we last truncated down to */
> >   	int i_truncate_pending;    /*  still need to call vmtruncate */
> > +	/*
> > +	 * For none fscrypt case it equals to i_truncate_size or it will
> > +	 * equals to fscrypt_file_size
> > +	 */
> > +	u64 i_truncate_pagecache_size;
> >   
> >   	u64 i_max_size;            /* max file size authorized by mds */
> >   	u64 i_reported_size; /* (max_)size reported to or requested of mds */
> 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v7 9/9] ceph: add truncate size handling support for fscrypt
  2021-11-08 13:02     ` Jeff Layton
@ 2021-11-08 13:11       ` Xiubo Li
  0 siblings, 0 replies; 25+ messages in thread
From: Xiubo Li @ 2021-11-08 13:11 UTC (permalink / raw)
  To: Jeff Layton; +Cc: idryomov, vshankar, pdonnell, khiremat, ceph-devel


On 11/8/21 9:02 PM, Jeff Layton wrote:
> On Mon, 2021-11-08 at 20:49 +0800, Xiubo Li wrote:
>> On 11/5/21 10:22 PM, xiubli@redhat.com wrote:
>>> From: Xiubo Li <xiubli@redhat.com>
>>>
>>> This will transfer the encrypted last block contents to the MDS
>>> along with the truncate request only when the new size is smaller
>>> and not aligned to the fscrypt BLOCK size. When the last block is
>>> located in the file hole, the truncate request will only contain
>>> the header.
>>>
>>> The MDS could fail to do the truncate if there has another client
>>> or process has already updated the Rados object which contains
>>> the last block, and will return -EAGAIN, then the kclient needs
>>> to retry it. The RMW will take around 50ms, and will let it retry
>>> 20 times for now.
>>>
>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>> ---
>>>    fs/ceph/crypto.h |  21 +++++
>>>    fs/ceph/inode.c  | 210 +++++++++++++++++++++++++++++++++++++++++++----
>>>    fs/ceph/super.h  |   5 ++
>>>    3 files changed, 222 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/fs/ceph/crypto.h b/fs/ceph/crypto.h
>>> index ab27a7ed62c3..393c308e8fc2 100644
>>> --- a/fs/ceph/crypto.h
>>> +++ b/fs/ceph/crypto.h
>>> @@ -25,6 +25,27 @@ struct ceph_fname {
>>>    	u32		ctext_len;	// length of crypttext
>>>    };
>>>    
>>> +/*
>>> + * Header for the crypted file when truncating the size, this
>>> + * will be sent to MDS, and the MDS will update the encrypted
>>> + * last block and then truncate the size.
>>> + */
>>> +struct ceph_fscrypt_truncate_size_header {
>>> +       __u8  ver;
>>> +       __u8  compat;
>>> +
>>> +       /*
>>> +	* It will be sizeof(assert_ver + file_offset + block_size)
>>> +	* if the last block is empty when it's located in a file
>>> +	* hole. Or the data_len will plus CEPH_FSCRYPT_BLOCK_SIZE.
>>> +	*/
>>> +       __le32 data_len;
>>> +
>>> +       __le64 assert_ver;
>>> +       __le64 file_offset;
>>> +       __le32 block_size;
>>> +} __packed;
>>> +
>>>    struct ceph_fscrypt_auth {
>>>    	__le32	cfa_version;
>>>    	__le32	cfa_blob_len;
>>> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
>>> index 15c2fb1e2c8a..eebbd0296004 100644
>>> --- a/fs/ceph/inode.c
>>> +++ b/fs/ceph/inode.c
>>> @@ -586,6 +586,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
>>>    	ci->i_truncate_seq = 0;
>>>    	ci->i_truncate_size = 0;
>>>    	ci->i_truncate_pending = 0;
>>> +	ci->i_truncate_pagecache_size = 0;
>>>    
>>>    	ci->i_max_size = 0;
>>>    	ci->i_reported_size = 0;
>>> @@ -751,6 +752,10 @@ int ceph_fill_file_size(struct inode *inode, int issued,
>>>    		dout("truncate_size %lld -> %llu\n", ci->i_truncate_size,
>>>    		     truncate_size);
>>>    		ci->i_truncate_size = truncate_size;
>>> +		if (IS_ENCRYPTED(inode))
>>> +			ci->i_truncate_pagecache_size = size;
>>> +		else
>>> +			ci->i_truncate_pagecache_size = truncate_size;
>>>    	}
>>>    
>>>    	if (queue_trunc)
>>> @@ -1011,7 +1016,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
>>>    
>>>    	if (new_version ||
>>>    	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
>>> -		u64 size = info->size;
>>> +		u64 size = le64_to_cpu(info->size);
>>>    		s64 old_pool = ci->i_layout.pool_id;
>>>    		struct ceph_string *old_ns;
>>>    
>>> @@ -1026,16 +1031,20 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
>>>    		pool_ns = old_ns;
>>>    
>>>    		if (IS_ENCRYPTED(inode) && size &&
>>> -		    (iinfo->fscrypt_file_len == sizeof(__le64))) {
>>> -			size = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
>>> -			if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
>>> -				pr_warn("size=%llu fscrypt_file=%llu\n", info->size, size);
>>> +		    (iinfo->fscrypt_file_len >= sizeof(__le64))) {
>>> +			u64 fsize = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
>>> +			if (fsize) {
>>> +				size = fsize;
>>> +				if (le64_to_cpu(info->size) !=
>>> +				    round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
>>> +					pr_warn("size=%llu fscrypt_file=%llu\n",
>>> +						info->size, size);
>>> +			}
>>>    		}
>>>    
>>>    		queue_trunc = ceph_fill_file_size(inode, issued,
>>>    					le32_to_cpu(info->truncate_seq),
>>> -					le64_to_cpu(info->truncate_size),
>>> -					le64_to_cpu(size));
>>> +					le64_to_cpu(info->truncate_size), size);
>>>    		/* only update max_size on auth cap */
>>>    		if ((info->cap.flags & CEPH_CAP_FLAG_AUTH) &&
>>>    		    ci->i_max_size != le64_to_cpu(info->max_size)) {
>>> @@ -2142,7 +2151,7 @@ void __ceph_do_pending_vmtruncate(struct inode *inode)
>>>    	/* there should be no reader or writer */
>>>    	WARN_ON_ONCE(ci->i_rd_ref || ci->i_wr_ref);
>>>    
>>> -	to = ci->i_truncate_size;
>>> +	to = ci->i_truncate_pagecache_size;
>>>    	wrbuffer_refs = ci->i_wrbuffer_ref;
>>>    	dout("__do_pending_vmtruncate %p (%d) to %lld\n", inode,
>>>    	     ci->i_truncate_pending, to);
>>> @@ -2151,7 +2160,7 @@ void __ceph_do_pending_vmtruncate(struct inode *inode)
>>>    	truncate_pagecache(inode, to);
>>>    
>>>    	spin_lock(&ci->i_ceph_lock);
>>> -	if (to == ci->i_truncate_size) {
>>> +	if (to == ci->i_truncate_pagecache_size) {
>>>    		ci->i_truncate_pending = 0;
>>>    		finish = 1;
>>>    	}
>>> @@ -2232,6 +2241,141 @@ static const struct inode_operations ceph_encrypted_symlink_iops = {
>>>    	.listxattr = ceph_listxattr,
>>>    };
>>>    
>>> +/*
>>> + * Transfer the encrypted last block to the MDS and the MDS
>>> + * will help update it when truncating a smaller size.
>>> + *
>>> + * We don't support a PAGE_SIZE that is smaller than the
>>> + * CEPH_FSCRYPT_BLOCK_SIZE.
>>> + */
>>> +static int fill_fscrypt_truncate(struct inode *inode,
>>> +				 struct ceph_mds_request *req,
>>> +				 struct iattr *attr)
>>> +{
>>> +	struct ceph_inode_info *ci = ceph_inode(inode);
>>> +	int boff = attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE;
>>> +	loff_t pos, orig_pos = round_down(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE);
>>> +#if 0
>>> +	u64 block = orig_pos >> CEPH_FSCRYPT_BLOCK_SHIFT;
>>> +#endif
>>> +	struct ceph_pagelist *pagelist = NULL;
>>> +	struct kvec iov;
>>> +	struct iov_iter iter;
>>> +	struct page *page = NULL;
>>> +	struct ceph_fscrypt_truncate_size_header header;
>>> +	int retry_op = 0;
>>> +	int len = CEPH_FSCRYPT_BLOCK_SIZE;
>>> +	loff_t i_size = i_size_read(inode);
>>> +	struct ceph_object_vers objvers = {0, NULL};
>>> +	int got, ret, issued;
>>> +
>>> +	ret = __ceph_get_caps(inode, NULL, CEPH_CAP_FILE_RD, 0, -1, &got);
>>> +	if (ret < 0)
>>> +		return ret;
>>> +
>>> +	issued = __ceph_caps_issued(ci, NULL);
>>> +
>>> +	dout("%s size %lld -> %lld got cap refs on %s, issued %s\n", __func__,
>>> +	     i_size, attr->ia_size, ceph_cap_string(got),
>>> +	     ceph_cap_string(issued));
>>> +
>>> +	/* Try to writeback the dirty pagecaches */
>>> +	if (issued & (CEPH_CAP_FILE_BUFFER))
>>> +		filemap_fdatawrite(&inode->i_data);
>> We may need to wait here, to make sure the dirty pages are all wrote
>> back to OSD before we are do the RMW, or if there have too many data
>> need to write back, it may finished just after the truncate. The dirty
>> data will be lost ?
>>
>>
> Yes, that should probably be:
>
>      filemap_write_and_wait(inode->i_mapping);
>
> (For silly reasons, i_mapping usually points at i_data, but in some
> filesystems (e.g. coda or with DAX) it can change. We don't do that in
> ceph, but it's still better to use i_mapping here since that's the
> convention).
Okay, get it.
>
> It would probably be good to send an updated patch with that and the fix
> for the other req leak you spotted earlier.

Yeah, sure.

Thanks.

>
> Thanks,
>
>>> +
>>> +	page = __page_cache_alloc(GFP_KERNEL);
>>> +	if (page == NULL) {
>>> +		ret = -ENOMEM;
>>> +		goto out;
>>> +	}
>>> +
>>> +	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
>>> +	if (!pagelist) {
>>> +		ret = -ENOMEM;
>>> +		goto out;
>>> +	}
>>> +
>>> +	iov.iov_base = kmap_local_page(page);
>>> +	iov.iov_len = len;
>>> +	iov_iter_kvec(&iter, READ, &iov, 1, len);
>>> +
>>> +	pos = orig_pos;
>>> +	ret = __ceph_sync_read(inode, &pos, &iter, &retry_op, &objvers);
>>> +	ceph_put_cap_refs(ci, got);
>>> +	if (ret < 0)
>>> +		goto out;
>>> +
>>> +	WARN_ON_ONCE(objvers.count != 1);
>>> +
>>> +	/* Insert the header first */
>>> +	header.ver = 1;
>>> +	header.compat = 1;
>>> +
>>> +	/*
>>> +	 * If we hit a hole here, we should just skip filling
>>> +	 * the fscrypt for the request, because once the fscrypt
>>> +	 * is enabled, the file will be split into many blocks
>>> +	 * with the size of CEPH_FSCRYPT_BLOCK_SIZE, if there
>>> +	 * has a hole, the hole size should be multiple of block
>>> +	 * size.
>>> +	 *
>>> +	 * If the Rados object doesn't exist, it will be set 0.
>>> +	 */
>>> +	if (!objvers.objvers[0].objver) {
>>> +		dout("%s hit hole, ppos %lld < size %lld\n", __func__,
>>> +		     pos, i_size);
>>> +
>>> +		header.data_len = cpu_to_le32(8 + 8 + 4);
>>> +		header.assert_ver = 0;
>>> +		header.file_offset = 0;
>>> +		header.block_size = 0;
>>> +		ret = 0;
>>> +	} else {
>>> +		header.data_len = cpu_to_le32(8 + 8 + 4 + CEPH_FSCRYPT_BLOCK_SIZE);
>>> +		header.assert_ver = cpu_to_le64(objvers.objvers[0].objver);
>>> +		header.file_offset = cpu_to_le64(orig_pos);
>>> +		header.block_size = cpu_to_le32(CEPH_FSCRYPT_BLOCK_SIZE);
>>> +
>>> +		/* truncate and zero out the extra contents for the last block */
>>> +		memset(iov.iov_base + boff, 0, PAGE_SIZE - boff);
>>> +
>>> +#if 0 // Uncomment this when the fscrypt is enabled globally in kceph
>>> +
>>> +		/* encrypt the last block */
>>> +		ret = fscrypt_encrypt_block_inplace(inode, page,
>>> +						    CEPH_FSCRYPT_BLOCK_SIZE,
>>> +						    0, block,
>>> +						    GFP_KERNEL);
>>> +		if (ret)
>>> +			goto out;
>>> +#endif
>>> +	}
>>> +
>>> +	/* Insert the header */
>>> +	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	if (header.block_size) {
>>> +		/* Append the last block contents to pagelist */
>>> +		ret = ceph_pagelist_append(pagelist, iov.iov_base,
>>> +					   CEPH_FSCRYPT_BLOCK_SIZE);
>>> +		if (ret)
>>> +			goto out;
>>> +	}
>>> +	req->r_pagelist = pagelist;
>>> +out:
>>> +	dout("%s %p size dropping cap refs on %s\n", __func__,
>>> +	     inode, ceph_cap_string(got));
>>> +	kunmap_local(iov.iov_base);
>>> +	if (page)
>>> +		__free_pages(page, 0);
>>> +	if (ret && pagelist)
>>> +		ceph_pagelist_release(pagelist);
>>> +	kfree(objvers.objvers);
>>> +	return ret;
>>> +}
>>> +
>>>    int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *cia)
>>>    {
>>>    	struct ceph_inode_info *ci = ceph_inode(inode);
>>> @@ -2239,12 +2383,15 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>>>    	struct ceph_mds_request *req;
>>>    	struct ceph_mds_client *mdsc = ceph_sb_to_client(inode->i_sb)->mdsc;
>>>    	struct ceph_cap_flush *prealloc_cf;
>>> +	loff_t isize = i_size_read(inode);
>>>    	int issued;
>>>    	int release = 0, dirtied = 0;
>>>    	int mask = 0;
>>>    	int err = 0;
>>>    	int inode_dirty_flags = 0;
>>>    	bool lock_snap_rwsem = false;
>>> +	bool fill_fscrypt;
>>> +	int truncate_retry = 20; /* The RMW will take around 50ms */
>>>    
>>>    	prealloc_cf = ceph_alloc_cap_flush();
>>>    	if (!prealloc_cf)
>>> @@ -2257,6 +2404,8 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>>>    		return PTR_ERR(req);
>>>    	}
>>>    
>>> +retry:
>>> +	fill_fscrypt = false;
>>>    	spin_lock(&ci->i_ceph_lock);
>>>    	issued = __ceph_caps_issued(ci, NULL);
>>>    
>>> @@ -2378,10 +2527,27 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>>>    		}
>>>    	}
>>>    	if (ia_valid & ATTR_SIZE) {
>>> -		loff_t isize = i_size_read(inode);
>>> -
>>>    		dout("setattr %p size %lld -> %lld\n", inode, isize, attr->ia_size);
>>> -		if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
>>> +		/*
>>> +		 * Only when the new size is smaller and not aligned to
>>> +		 * CEPH_FSCRYPT_BLOCK_SIZE will the RMW is needed.
>>> +		 */
>>> +		if (IS_ENCRYPTED(inode) && attr->ia_size < isize &&
>>> +		    (attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE)) {
>>> +			mask |= CEPH_SETATTR_SIZE;
>>> +			release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
>>> +				   CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
>>> +			set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
>>> +			mask |= CEPH_SETATTR_FSCRYPT_FILE;
>>> +			req->r_args.setattr.size =
>>> +				cpu_to_le64(round_up(attr->ia_size,
>>> +						     CEPH_FSCRYPT_BLOCK_SIZE));
>>> +			req->r_args.setattr.old_size =
>>> +				cpu_to_le64(round_up(isize,
>>> +						     CEPH_FSCRYPT_BLOCK_SIZE));
>>> +			req->r_fscrypt_file = attr->ia_size;
>>> +			fill_fscrypt = true;
>>> +		} else if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
>>>    			if (attr->ia_size > isize) {
>>>    				i_size_write(inode, attr->ia_size);
>>>    				inode->i_blocks = calc_inode_blocks(attr->ia_size);
>>> @@ -2404,7 +2570,6 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>>>    					cpu_to_le64(round_up(isize,
>>>    							     CEPH_FSCRYPT_BLOCK_SIZE));
>>>    				req->r_fscrypt_file = attr->ia_size;
>>> -				/* FIXME: client must zero out any partial blocks! */
>>>    			} else {
>>>    				req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
>>>    				req->r_args.setattr.old_size = cpu_to_le64(isize);
>>> @@ -2476,7 +2641,6 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>>>    	if (inode_dirty_flags)
>>>    		__mark_inode_dirty(inode, inode_dirty_flags);
>>>    
>>> -
>>>    	if (mask) {
>>>    		req->r_inode = inode;
>>>    		ihold(inode);
>>> @@ -2484,7 +2648,25 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>>>    		req->r_args.setattr.mask = cpu_to_le32(mask);
>>>    		req->r_num_caps = 1;
>>>    		req->r_stamp = attr->ia_ctime;
>>> +		if (fill_fscrypt) {
>>> +			err = fill_fscrypt_truncate(inode, req, attr);
>>> +			if (err)
>>> +				goto out;
>>> +		}
>>> +
>>> +		/*
>>> +		 * The truncate request will return -EAGAIN when the
>>> +		 * last block has been updated just before the MDS
>>> +		 * successfully gets the xlock for the FILE lock. To
>>> +		 * avoid corrupting the file contents we need to retry
>>> +		 * it.
>>> +		 */
>>>    		err = ceph_mdsc_do_request(mdsc, NULL, req);
>>> +		if (err == -EAGAIN && truncate_retry--) {
>>> +			dout("setattr %p result=%d (%s locally, %d remote), retry it!\n",
>>> +			     inode, err, ceph_cap_string(dirtied), mask);
>>> +			goto retry;
>>> +		}
>>>    	}
>>>    out:
>>>    	dout("setattr %p result=%d (%s locally, %d remote)\n", inode, err,
>>> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
>>> index b347b12e86a9..071857bb59d8 100644
>>> --- a/fs/ceph/super.h
>>> +++ b/fs/ceph/super.h
>>> @@ -408,6 +408,11 @@ struct ceph_inode_info {
>>>    	u32 i_truncate_seq;        /* last truncate to smaller size */
>>>    	u64 i_truncate_size;       /*  and the size we last truncated down to */
>>>    	int i_truncate_pending;    /*  still need to call vmtruncate */
>>> +	/*
>>> +	 * For none fscrypt case it equals to i_truncate_size or it will
>>> +	 * equals to fscrypt_file_size
>>> +	 */
>>> +	u64 i_truncate_pagecache_size;
>>>    
>>>    	u64 i_max_size;            /* max file size authorized by mds */
>>>    	u64 i_reported_size; /* (max_)size reported to or requested of mds */


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v7 9/9] ceph: add truncate size handling support for fscrypt
  2021-12-08 12:45 xiubli
@ 2021-12-08 12:45 ` xiubli
  0 siblings, 0 replies; 25+ messages in thread
From: xiubli @ 2021-12-08 12:45 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, khiremat, ceph-devel, Xiubo Li

From: Xiubo Li <xiubli@redhat.com>

This will transfer the encrypted last block contents to the MDS
along with the truncate request only when the new size is smaller
and not aligned to the fscrypt BLOCK size. When the last block is
located in the file hole, the truncate request will only contain
the header.

The MDS could fail to do the truncate if there has another client
or process has already updated the RADOS object which contains
the last block, and will return -EAGAIN, then the kclient needs
to retry it. The RMW will take around 50ms, and will let it retry
20 times for now.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
---
 fs/ceph/crypto.h |  21 +++++
 fs/ceph/inode.c  | 223 +++++++++++++++++++++++++++++++++++++++++++----
 fs/ceph/super.h  |   5 ++
 3 files changed, 234 insertions(+), 15 deletions(-)

diff --git a/fs/ceph/crypto.h b/fs/ceph/crypto.h
index ab27a7ed62c3..393c308e8fc2 100644
--- a/fs/ceph/crypto.h
+++ b/fs/ceph/crypto.h
@@ -25,6 +25,27 @@ struct ceph_fname {
 	u32		ctext_len;	// length of crypttext
 };
 
+/*
+ * Header for the crypted file when truncating the size, this
+ * will be sent to MDS, and the MDS will update the encrypted
+ * last block and then truncate the size.
+ */
+struct ceph_fscrypt_truncate_size_header {
+       __u8  ver;
+       __u8  compat;
+
+       /*
+	* It will be sizeof(assert_ver + file_offset + block_size)
+	* if the last block is empty when it's located in a file
+	* hole. Or the data_len will plus CEPH_FSCRYPT_BLOCK_SIZE.
+	*/
+       __le32 data_len;
+
+       __le64 assert_ver;
+       __le64 file_offset;
+       __le32 block_size;
+} __packed;
+
 struct ceph_fscrypt_auth {
 	__le32	cfa_version;
 	__le32	cfa_blob_len;
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 213a9ca875ab..2492e1cb4497 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -586,6 +586,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
 	ci->i_truncate_seq = 0;
 	ci->i_truncate_size = 0;
 	ci->i_truncate_pending = 0;
+	ci->i_truncate_pagecache_size = 0;
 
 	ci->i_max_size = 0;
 	ci->i_reported_size = 0;
@@ -751,6 +752,10 @@ int ceph_fill_file_size(struct inode *inode, int issued,
 		dout("truncate_size %lld -> %llu\n", ci->i_truncate_size,
 		     truncate_size);
 		ci->i_truncate_size = truncate_size;
+		if (IS_ENCRYPTED(inode))
+			ci->i_truncate_pagecache_size = size;
+		else
+			ci->i_truncate_pagecache_size = truncate_size;
 	}
 
 	if (queue_trunc)
@@ -1011,7 +1016,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
 
 	if (new_version ||
 	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
-		u64 size = info->size;
+		u64 size = le64_to_cpu(info->size);
 		s64 old_pool = ci->i_layout.pool_id;
 		struct ceph_string *old_ns;
 
@@ -1026,16 +1031,20 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
 		pool_ns = old_ns;
 
 		if (IS_ENCRYPTED(inode) && size &&
-		    (iinfo->fscrypt_file_len == sizeof(__le64))) {
-			size = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
-			if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
-				pr_warn("size=%llu fscrypt_file=%llu\n", info->size, size);
+		    (iinfo->fscrypt_file_len >= sizeof(__le64))) {
+			u64 fsize = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
+			if (fsize) {
+				size = fsize;
+				if (le64_to_cpu(info->size) !=
+				    round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
+					pr_warn("size=%llu fscrypt_file=%llu\n",
+						info->size, size);
+			}
 		}
 
 		queue_trunc = ceph_fill_file_size(inode, issued,
 					le32_to_cpu(info->truncate_seq),
-					le64_to_cpu(info->truncate_size),
-					le64_to_cpu(size));
+					le64_to_cpu(info->truncate_size), size);
 		/* only update max_size on auth cap */
 		if ((info->cap.flags & CEPH_CAP_FLAG_AUTH) &&
 		    ci->i_max_size != le64_to_cpu(info->max_size)) {
@@ -2143,7 +2152,7 @@ void __ceph_do_pending_vmtruncate(struct inode *inode)
 	/* there should be no reader or writer */
 	WARN_ON_ONCE(ci->i_rd_ref || ci->i_wr_ref);
 
-	to = ci->i_truncate_size;
+	to = ci->i_truncate_pagecache_size;
 	wrbuffer_refs = ci->i_wrbuffer_ref;
 	dout("__do_pending_vmtruncate %p (%d) to %lld\n", inode,
 	     ci->i_truncate_pending, to);
@@ -2152,7 +2161,7 @@ void __ceph_do_pending_vmtruncate(struct inode *inode)
 	truncate_pagecache(inode, to);
 
 	spin_lock(&ci->i_ceph_lock);
-	if (to == ci->i_truncate_size) {
+	if (to == ci->i_truncate_pagecache_size) {
 		ci->i_truncate_pending = 0;
 		finish = 1;
 	}
@@ -2233,6 +2242,148 @@ static const struct inode_operations ceph_encrypted_symlink_iops = {
 	.listxattr = ceph_listxattr,
 };
 
+/*
+ * Transfer the encrypted last block to the MDS and the MDS
+ * will help update it when truncating a smaller size.
+ *
+ * We don't support a PAGE_SIZE that is smaller than the
+ * CEPH_FSCRYPT_BLOCK_SIZE.
+ */
+static int fill_fscrypt_truncate(struct inode *inode,
+				 struct ceph_mds_request *req,
+				 struct iattr *attr)
+{
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	int boff = attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE;
+	loff_t pos, orig_pos = round_down(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE);
+#if 0
+	u64 block = orig_pos >> CEPH_FSCRYPT_BLOCK_SHIFT;
+#endif
+	struct ceph_pagelist *pagelist = NULL;
+	struct kvec iov;
+	struct iov_iter iter;
+	struct page *page = NULL;
+	struct ceph_fscrypt_truncate_size_header header;
+	int retry_op = 0;
+	int len = CEPH_FSCRYPT_BLOCK_SIZE;
+	loff_t i_size = i_size_read(inode);
+	int got, ret, issued;
+	u64 objver;
+
+	ret = __ceph_get_caps(inode, NULL, CEPH_CAP_FILE_RD, 0, -1, &got);
+	if (ret < 0)
+		return ret;
+
+	issued = __ceph_caps_issued(ci, NULL);
+
+	dout("%s size %lld -> %lld got cap refs on %s, issued %s\n", __func__,
+	     i_size, attr->ia_size, ceph_cap_string(got),
+	     ceph_cap_string(issued));
+
+	/* Try to writeback the dirty pagecaches */
+	if (issued & (CEPH_CAP_FILE_BUFFER))
+		filemap_write_and_wait(inode->i_mapping);
+
+	page = __page_cache_alloc(GFP_KERNEL);
+	if (page == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
+	if (!pagelist) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	iov.iov_base = kmap_local_page(page);
+	iov.iov_len = len;
+	iov_iter_kvec(&iter, READ, &iov, 1, len);
+
+	pos = orig_pos;
+	ret = __ceph_sync_read(inode, &pos, &iter, &retry_op, &objver);
+	ceph_put_cap_refs(ci, got);
+	if (ret < 0)
+		goto out;
+
+	/* Insert the header first */
+	header.ver = 1;
+	header.compat = 1;
+
+	/*
+	 * Always set the block_size to CEPH_FSCRYPT_BLOCK_SIZE,
+	 * because in MDS it may need this to do the truncate.
+	 */
+	header.block_size = cpu_to_le32(CEPH_FSCRYPT_BLOCK_SIZE);
+
+	/*
+	 * If we hit a hole here, we should just skip filling
+	 * the fscrypt for the request, because once the fscrypt
+	 * is enabled, the file will be split into many blocks
+	 * with the size of CEPH_FSCRYPT_BLOCK_SIZE, if there
+	 * has a hole, the hole size should be multiple of block
+	 * size.
+	 *
+	 * If the Rados object doesn't exist, it will be set 0.
+	 */
+	if (!objver) {
+		dout("%s hit hole, ppos %lld < size %lld\n", __func__,
+		     pos, i_size);
+
+		header.data_len = cpu_to_le32(8 + 8 + 4);
+
+		/*
+		 * If the "assert_ver" is 0 means hitting a hole, and
+		 * the MDS will use the it to check whether hitting a
+		 * hole or not.
+		 */
+		header.assert_ver = 0;
+		header.file_offset = 0;
+		ret = 0;
+	} else {
+		header.data_len = cpu_to_le32(8 + 8 + 4 + CEPH_FSCRYPT_BLOCK_SIZE);
+		header.assert_ver = cpu_to_le64(objver);
+		header.file_offset = cpu_to_le64(orig_pos);
+
+		/* truncate and zero out the extra contents for the last block */
+		memset(iov.iov_base + boff, 0, PAGE_SIZE - boff);
+
+#if 0 // Uncomment this when the fscrypt is enabled globally in kceph
+
+		/* encrypt the last block */
+		ret = fscrypt_encrypt_block_inplace(inode, page,
+						    CEPH_FSCRYPT_BLOCK_SIZE,
+						    0, block,
+						    GFP_KERNEL);
+		if (ret)
+			goto out;
+#endif
+	}
+
+	/* Insert the header */
+	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
+	if (ret)
+		goto out;
+
+	if (header.block_size) {
+		/* Append the last block contents to pagelist */
+		ret = ceph_pagelist_append(pagelist, iov.iov_base,
+					   CEPH_FSCRYPT_BLOCK_SIZE);
+		if (ret)
+			goto out;
+	}
+	req->r_pagelist = pagelist;
+out:
+	dout("%s %p size dropping cap refs on %s\n", __func__,
+	     inode, ceph_cap_string(got));
+	kunmap_local(iov.iov_base);
+	if (page)
+		__free_pages(page, 0);
+	if (ret && pagelist)
+		ceph_pagelist_release(pagelist);
+	return ret;
+}
+
 int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *cia)
 {
 	struct ceph_inode_info *ci = ceph_inode(inode);
@@ -2240,13 +2391,17 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 	struct ceph_mds_request *req;
 	struct ceph_mds_client *mdsc = ceph_sb_to_client(inode->i_sb)->mdsc;
 	struct ceph_cap_flush *prealloc_cf;
+	loff_t isize = i_size_read(inode);
 	int issued;
 	int release = 0, dirtied = 0;
 	int mask = 0;
 	int err = 0;
 	int inode_dirty_flags = 0;
 	bool lock_snap_rwsem = false;
+	bool fill_fscrypt;
+	int truncate_retry = 20; /* The RMW will take around 50ms */
 
+retry:
 	prealloc_cf = ceph_alloc_cap_flush();
 	if (!prealloc_cf)
 		return -ENOMEM;
@@ -2258,6 +2413,7 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 		return PTR_ERR(req);
 	}
 
+	fill_fscrypt = false;
 	spin_lock(&ci->i_ceph_lock);
 	issued = __ceph_caps_issued(ci, NULL);
 
@@ -2379,10 +2535,27 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 		}
 	}
 	if (ia_valid & ATTR_SIZE) {
-		loff_t isize = i_size_read(inode);
-
 		dout("setattr %p size %lld -> %lld\n", inode, isize, attr->ia_size);
-		if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
+		/*
+		 * Only when the new size is smaller and not aligned to
+		 * CEPH_FSCRYPT_BLOCK_SIZE will the RMW is needed.
+		 */
+		if (IS_ENCRYPTED(inode) && attr->ia_size < isize &&
+		    (attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE)) {
+			mask |= CEPH_SETATTR_SIZE;
+			release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
+				   CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
+			set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
+			mask |= CEPH_SETATTR_FSCRYPT_FILE;
+			req->r_args.setattr.size =
+				cpu_to_le64(round_up(attr->ia_size,
+						     CEPH_FSCRYPT_BLOCK_SIZE));
+			req->r_args.setattr.old_size =
+				cpu_to_le64(round_up(isize,
+						     CEPH_FSCRYPT_BLOCK_SIZE));
+			req->r_fscrypt_file = attr->ia_size;
+			fill_fscrypt = true;
+		} else if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
 			if (attr->ia_size > isize) {
 				i_size_write(inode, attr->ia_size);
 				inode->i_blocks = calc_inode_blocks(attr->ia_size);
@@ -2405,7 +2578,6 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 					cpu_to_le64(round_up(isize,
 							     CEPH_FSCRYPT_BLOCK_SIZE));
 				req->r_fscrypt_file = attr->ia_size;
-				/* FIXME: client must zero out any partial blocks! */
 			} else {
 				req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
 				req->r_args.setattr.old_size = cpu_to_le64(isize);
@@ -2471,13 +2643,14 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 
 	release &= issued;
 	spin_unlock(&ci->i_ceph_lock);
-	if (lock_snap_rwsem)
+	if (lock_snap_rwsem) {
 		up_read(&mdsc->snap_rwsem);
+		lock_snap_rwsem = false;
+	}
 
 	if (inode_dirty_flags)
 		__mark_inode_dirty(inode, inode_dirty_flags);
 
-
 	if (mask) {
 		req->r_inode = inode;
 		ihold(inode);
@@ -2485,7 +2658,27 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 		req->r_args.setattr.mask = cpu_to_le32(mask);
 		req->r_num_caps = 1;
 		req->r_stamp = attr->ia_ctime;
+		if (fill_fscrypt) {
+			err = fill_fscrypt_truncate(inode, req, attr);
+			if (err)
+				goto out;
+		}
+
+		/*
+		 * The truncate request will return -EAGAIN when the
+		 * last block has been updated just before the MDS
+		 * successfully gets the xlock for the FILE lock. To
+		 * avoid corrupting the file contents we need to retry
+		 * it.
+		 */
 		err = ceph_mdsc_do_request(mdsc, NULL, req);
+		if (err == -EAGAIN && truncate_retry--) {
+			dout("setattr %p result=%d (%s locally, %d remote), retry it!\n",
+			     inode, err, ceph_cap_string(dirtied), mask);
+			ceph_mdsc_put_request(req);
+			ceph_free_cap_flush(prealloc_cf);
+			goto retry;
+		}
 	}
 out:
 	dout("setattr %p result=%d (%s locally, %d remote)\n", inode, err,
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index a7bdb28af595..1f61274d381b 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -409,6 +409,11 @@ struct ceph_inode_info {
 	u32 i_truncate_seq;        /* last truncate to smaller size */
 	u64 i_truncate_size;       /*  and the size we last truncated down to */
 	int i_truncate_pending;    /*  still need to call vmtruncate */
+	/*
+	 * For none fscrypt case it equals to i_truncate_size or it will
+	 * equals to fscrypt_file_size
+	 */
+	u64 i_truncate_pagecache_size;
 
 	u64 i_max_size;            /* max file size authorized by mds */
 	u64 i_reported_size; /* (max_)size reported to or requested of mds */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-12-08 12:45 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-05 14:22 [PATCH v7 0/9] ceph: size handling for the fscrypt xiubli
2021-11-05 14:22 ` [PATCH v7 1/9] libceph: add CEPH_OSD_OP_ASSERT_VER support xiubli
2021-11-05 14:22 ` [PATCH v7 2/9] ceph: size handling for encrypted inodes in cap updates xiubli
2021-11-05 14:22 ` [PATCH v7 3/9] ceph: fscrypt_file field handling in MClientRequest messages xiubli
2021-11-08  5:09   ` Xiubo Li
2021-11-05 14:22 ` [PATCH v7 4/9] ceph: get file size from fscrypt_file when present in inode traces xiubli
2021-11-05 14:22 ` [PATCH v7 5/9] ceph: handle fscrypt fields in cap messages from MDS xiubli
2021-11-05 14:22 ` [PATCH v7 6/9] ceph: add __ceph_get_caps helper support xiubli
2021-11-05 14:22 ` [PATCH v7 7/9] ceph: add __ceph_sync_read " xiubli
2021-11-05 14:22 ` [PATCH v7 8/9] ceph: add object version support for sync read xiubli
2021-11-05 14:22 ` [PATCH v7 9/9] ceph: add truncate size handling support for fscrypt xiubli
2021-11-08 11:42   ` Xiubo Li
2021-11-08 12:49   ` Xiubo Li
2021-11-08 13:02     ` Jeff Layton
2021-11-08 13:11       ` Xiubo Li
2021-11-05 18:36 ` [PATCH v7 0/9] ceph: size handling for the fscrypt Jeff Layton
2021-11-05 20:46   ` Jeff Layton
2021-11-06  1:35     ` Xiubo Li
2021-11-06 10:50       ` Jeff Layton
2021-11-06 10:51         ` Jeff Layton
2021-11-07  9:44           ` Xiubo Li
2021-11-08  3:22           ` Xiubo Li
2021-11-08  6:04             ` Xiubo Li
2021-11-08  8:24               ` Xiubo Li
2021-12-08 12:45 xiubli
2021-12-08 12:45 ` [PATCH v7 9/9] ceph: add truncate size handling support for fscrypt xiubli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.