All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/4] ceph: size handling for the fscrypt
@ 2021-10-20 13:28 xiubli
  2021-10-20 13:28 ` [PATCH v2 1/4] ceph: add __ceph_get_caps helper support xiubli
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: xiubli @ 2021-10-20 13:28 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, khiremat, pdonnell, ceph-devel, Xiubo Li

From: Xiubo Li <xiubli@redhat.com>

This patch series is based on the fscrypt_size_handling branch in
https://github.com/lxbsz/linux.git, which is based Jeff's
ceph-fscrypt-content-experimental branch in
https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git,
has reverted one useless commit and added some upstream commits.

I will keep this patch set as simple as possible to review since
this is still one framework code. It works and still in developing
and need some feedbacks and suggestions for two corner cases below.

====

This approach is based on the discussion from V1, which will pass
the encrypted last block contents to MDS along with the truncate
request.

This will send the encrypted last block contents to MDS along with
the truncate request when truncating to a smaller size and at the
same time new size does not align to BLOCK SIZE.

The MDS side patch is raised in PR
https://github.com/ceph/ceph/pull/43588, which is also based Jeff's
previous great work in PR https://github.com/ceph/ceph/pull/41284.

The MDS will use the filer.write_trunc(), which could update and
truncate the file in one shot, instead of filer.truncate().

I have removed the inline data related code since we are remove
this feature, more detail please see:
https://tracker.ceph.com/issues/52916


Note: There still has two CORNER cases we need to deal with:

1), If a truncate request with the last block is sent to the MDS and
just before the MDS has acquired the xlock for FILE lock, if another
client has updated that last block content, we will over write the
last block with old data.

For this case we could send the old encrypted last block data along
with the truncate request and in MDS side read it and then do compare
just before updating it, if the comparasion fails, then fail the
truncate and let the kclient retry it.

2), If another client has buffered the last block, we should flush
it first. I am still thinking how to do this ? Any idea is welcome.

Thanks.


Xiubo Li (4):
  ceph: add __ceph_get_caps helper support
  ceph: add __ceph_sync_read helper support
  ceph: return the real size readed when hit EOF
  ceph: add truncate size handling support for fscrypt

 fs/ceph/caps.c  |  28 ++++---
 fs/ceph/file.c  |  41 ++++++----
 fs/ceph/inode.c | 210 ++++++++++++++++++++++++++++++++++++++++++------
 fs/ceph/super.h |   4 +
 4 files changed, 234 insertions(+), 49 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/4] ceph: add __ceph_get_caps helper support
  2021-10-20 13:28 [PATCH v2 0/4] ceph: size handling for the fscrypt xiubli
@ 2021-10-20 13:28 ` xiubli
  2021-10-20 13:28 ` [PATCH v2 2/4] ceph: add __ceph_sync_read " xiubli
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: xiubli @ 2021-10-20 13:28 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, khiremat, pdonnell, ceph-devel, Xiubo Li

From: Xiubo Li <xiubli@redhat.com>

Signed-off-by: Xiubo Li <xiubli@redhat.com>
---
 fs/ceph/caps.c  | 19 +++++++++++++------
 fs/ceph/super.h |  2 ++
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index d628dcdbf869..4e2a588465c5 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2876,10 +2876,9 @@ int ceph_try_get_caps(struct inode *inode, int need, int want,
  * due to a small max_size, make sure we check_max_size (and possibly
  * ask the mds) so we don't get hung up indefinitely.
  */
-int ceph_get_caps(struct file *filp, int need, int want, loff_t endoff, int *got)
+int __ceph_get_caps(struct inode *inode, struct ceph_file_info *fi, int need,
+		    int want, loff_t endoff, int *got)
 {
-	struct ceph_file_info *fi = filp->private_data;
-	struct inode *inode = file_inode(filp);
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
 	int ret, _got, flags;
@@ -2888,7 +2887,7 @@ int ceph_get_caps(struct file *filp, int need, int want, loff_t endoff, int *got
 	if (ret < 0)
 		return ret;
 
-	if ((fi->fmode & CEPH_FILE_MODE_WR) &&
+	if (fi && (fi->fmode & CEPH_FILE_MODE_WR) &&
 	    fi->filp_gen != READ_ONCE(fsc->filp_gen))
 		return -EBADF;
 
@@ -2896,7 +2895,7 @@ int ceph_get_caps(struct file *filp, int need, int want, loff_t endoff, int *got
 
 	while (true) {
 		flags &= CEPH_FILE_MODE_MASK;
-		if (atomic_read(&fi->num_locks))
+		if (fi && atomic_read(&fi->num_locks))
 			flags |= CHECK_FILELOCK;
 		_got = 0;
 		ret = try_get_cap_refs(inode, need, want, endoff,
@@ -2941,7 +2940,7 @@ int ceph_get_caps(struct file *filp, int need, int want, loff_t endoff, int *got
 				continue;
 		}
 
-		if ((fi->fmode & CEPH_FILE_MODE_WR) &&
+		if (fi && (fi->fmode & CEPH_FILE_MODE_WR) &&
 		    fi->filp_gen != READ_ONCE(fsc->filp_gen)) {
 			if (ret >= 0 && _got)
 				ceph_put_cap_refs(ci, _got);
@@ -3004,6 +3003,14 @@ int ceph_get_caps(struct file *filp, int need, int want, loff_t endoff, int *got
 	return 0;
 }
 
+int ceph_get_caps(struct file *filp, int need, int want, loff_t endoff, int *got)
+{
+	struct ceph_file_info *fi = filp->private_data;
+	struct inode *inode = file_inode(filp);
+
+	return __ceph_get_caps(inode, fi, need, want, endoff, got);
+}
+
 /*
  * Take cap refs.  Caller must already know we hold at least one ref
  * on the caps in question or we don't know this is safe.
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 7f3976b3319d..027d5f579ba0 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1208,6 +1208,8 @@ extern int ceph_encode_dentry_release(void **p, struct dentry *dn,
 				      struct inode *dir,
 				      int mds, int drop, int unless);
 
+extern int __ceph_get_caps(struct inode *inode, struct ceph_file_info *fi,
+			   int need, int want, loff_t endoff, int *got);
 extern int ceph_get_caps(struct file *filp, int need, int want,
 			 loff_t endoff, int *got);
 extern int ceph_try_get_caps(struct inode *inode,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/4] ceph: add __ceph_sync_read helper support
  2021-10-20 13:28 [PATCH v2 0/4] ceph: size handling for the fscrypt xiubli
  2021-10-20 13:28 ` [PATCH v2 1/4] ceph: add __ceph_get_caps helper support xiubli
@ 2021-10-20 13:28 ` xiubli
  2021-10-20 13:28 ` [PATCH v2 3/4] ceph: return the real size readed when hit EOF xiubli
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: xiubli @ 2021-10-20 13:28 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, khiremat, pdonnell, ceph-devel, Xiubo Li

From: Xiubo Li <xiubli@redhat.com>

Signed-off-by: Xiubo Li <xiubli@redhat.com>
---
 fs/ceph/file.c  | 31 +++++++++++++++++++++----------
 fs/ceph/super.h |  2 ++
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 6e677b40410e..74db403a4c35 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -901,20 +901,17 @@ static inline void fscrypt_adjust_off_and_len(struct inode *inode, u64 *off, u64
  * If we get a short result from the OSD, check against i_size; we need to
  * only return a short read to the caller if we hit EOF.
  */
-static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
-			      int *retry_op)
+ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
+			 struct iov_iter *to, int *retry_op)
 {
-	struct file *file = iocb->ki_filp;
-	struct inode *inode = file_inode(file);
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
 	struct ceph_osd_client *osdc = &fsc->client->osdc;
 	ssize_t ret;
-	u64 off = iocb->ki_pos;
+	u64 off = *ki_pos;
 	u64 len = iov_iter_count(to);
 
-	dout("sync_read on file %p %llu~%u %s\n", file, off, (unsigned)len,
-	     (file->f_flags & O_DIRECT) ? "O_DIRECT" : "");
+	dout("sync_read on inode %p %llu~%u\n", inode, *ki_pos, (unsigned)len);
 
 	if (!len)
 		return 0;
@@ -1058,18 +1055,32 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
 			break;
 	}
 
-	if (off > iocb->ki_pos) {
+	if (off > *ki_pos) {
 		if (ret >= 0 &&
 		    iov_iter_count(to) > 0 && off >= i_size_read(inode))
 			*retry_op = CHECK_EOF;
-		ret = off - iocb->ki_pos;
-		iocb->ki_pos = off;
+		ret = off - *ki_pos;
+		*ki_pos = off;
 	}
 out:
 	dout("sync_read result %zd retry_op %d\n", ret, *retry_op);
 	return ret;
 }
 
+static ssize_t ceph_sync_read(struct kiocb *iocb, struct iov_iter *to,
+			      int *retry_op)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+
+	dout("sync_read on file %p %llu~%u %s\n", file, iocb->ki_pos,
+	     (unsigned)iov_iter_count(to),
+	     (file->f_flags & O_DIRECT) ? "O_DIRECT" : "");
+
+	return __ceph_sync_read(inode, &iocb->ki_pos, to, retry_op);
+
+}
+
 struct ceph_aio_request {
 	struct kiocb *iocb;
 	size_t total_len;
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 027d5f579ba0..57bc952c54e1 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1235,6 +1235,8 @@ extern int ceph_renew_caps(struct inode *inode, int fmode);
 extern int ceph_open(struct inode *inode, struct file *file);
 extern int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 			    struct file *file, unsigned flags, umode_t mode);
+extern ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
+				struct iov_iter *to, int *retry_op);
 extern int ceph_release(struct inode *inode, struct file *filp);
 extern void ceph_fill_inline_data(struct inode *inode, struct page *locked_page,
 				  char *data, size_t len);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 3/4] ceph: return the real size readed when hit EOF
  2021-10-20 13:28 [PATCH v2 0/4] ceph: size handling for the fscrypt xiubli
  2021-10-20 13:28 ` [PATCH v2 1/4] ceph: add __ceph_get_caps helper support xiubli
  2021-10-20 13:28 ` [PATCH v2 2/4] ceph: add __ceph_sync_read " xiubli
@ 2021-10-20 13:28 ` xiubli
  2021-10-25 19:05   ` Jeff Layton
  2021-10-20 13:28 ` [PATCH v2 4/4] ceph: add truncate size handling support for fscrypt xiubli
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: xiubli @ 2021-10-20 13:28 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, khiremat, pdonnell, ceph-devel, Xiubo Li

From: Xiubo Li <xiubli@redhat.com>

Signed-off-by: Xiubo Li <xiubli@redhat.com>
---
 fs/ceph/file.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 74db403a4c35..1988e75ad4a2 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -910,6 +910,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
 	ssize_t ret;
 	u64 off = *ki_pos;
 	u64 len = iov_iter_count(to);
+	u64 i_size = i_size_read(inode);
 
 	dout("sync_read on inode %p %llu~%u\n", inode, *ki_pos, (unsigned)len);
 
@@ -933,7 +934,6 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
 		struct page **pages;
 		int num_pages;
 		size_t page_off;
-		u64 i_size;
 		bool more;
 		int idx;
 		size_t left;
@@ -980,7 +980,6 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
 
 		ceph_osdc_put_request(req);
 
-		i_size = i_size_read(inode);
 		dout("sync_read %llu~%llu got %zd i_size %llu%s\n",
 		     off, len, ret, i_size, (more ? " MORE" : ""));
 
@@ -1056,11 +1055,14 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
 	}
 
 	if (off > *ki_pos) {
-		if (ret >= 0 &&
-		    iov_iter_count(to) > 0 && off >= i_size_read(inode))
+		if (off >= i_size) {
 			*retry_op = CHECK_EOF;
-		ret = off - *ki_pos;
-		*ki_pos = off;
+			ret = i_size - *ki_pos;
+			*ki_pos = i_size;
+		} else {
+			ret = off - *ki_pos;
+			*ki_pos = off;
+		}
 	}
 out:
 	dout("sync_read result %zd retry_op %d\n", ret, *retry_op);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 4/4] ceph: add truncate size handling support for fscrypt
  2021-10-20 13:28 [PATCH v2 0/4] ceph: size handling for the fscrypt xiubli
                   ` (2 preceding siblings ...)
  2021-10-20 13:28 ` [PATCH v2 3/4] ceph: return the real size readed when hit EOF xiubli
@ 2021-10-20 13:28 ` xiubli
  2021-10-25 20:01   ` Jeff Layton
  2021-10-20 15:32 ` [PATCH v2 0/4] ceph: size handling for the fscrypt Jeff Layton
  2021-10-25 20:13 ` Jeff Layton
  5 siblings, 1 reply; 17+ messages in thread
From: xiubli @ 2021-10-20 13:28 UTC (permalink / raw)
  To: jlayton; +Cc: idryomov, vshankar, khiremat, pdonnell, ceph-devel, Xiubo Li

From: Xiubo Li <xiubli@redhat.com>

This will transfer the encrypted last block contents to the MDS
along with the truncate request only when new size is smaller and
not aligned to the fscrypt BLOCK size.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
---
 fs/ceph/caps.c  |   9 +--
 fs/ceph/inode.c | 210 ++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 190 insertions(+), 29 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 4e2a588465c5..1a36f0870d89 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1296,16 +1296,13 @@ static void encode_cap_msg(struct ceph_msg *msg, struct cap_msg_args *arg)
 	/*
 	 * fscrypt_auth and fscrypt_file (version 12)
 	 *
-	 * fscrypt_auth holds the crypto context (if any). fscrypt_file
-	 * tracks the real i_size as an __le64 field (and we use a rounded-up
-	 * i_size in * the traditional size field).
-	 *
-	 * FIXME: should we encrypt fscrypt_file field?
+	 * fscrypt_auth holds the crypto context (if any). fscrypt_file will
+	 * always be zero here.
 	 */
 	ceph_encode_32(&p, arg->fscrypt_auth_len);
 	ceph_encode_copy(&p, arg->fscrypt_auth, arg->fscrypt_auth_len);
 	ceph_encode_32(&p, sizeof(__le64));
-	ceph_encode_64(&p, arg->size);
+	ceph_encode_64(&p, 0);
 }
 
 /*
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 9b798690fdc9..924a69bc074d 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1035,9 +1035,13 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
 
 		if (IS_ENCRYPTED(inode) && size &&
 		    (iinfo->fscrypt_file_len == sizeof(__le64))) {
-			size = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
-			if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
-				pr_warn("size=%llu fscrypt_file=%llu\n", info->size, size);
+			u64 fsize = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
+			if (fsize) {
+				size = fsize;
+				if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
+					pr_warn("size=%llu fscrypt_file=%llu\n",
+						info->size, size);
+			}
 		}
 
 		queue_trunc = ceph_fill_file_size(inode, issued,
@@ -2229,6 +2233,157 @@ static const struct inode_operations ceph_encrypted_symlink_iops = {
 	.listxattr = ceph_listxattr,
 };
 
+struct ceph_fscrypt_header {
+	__u8  ver;
+	__u8  compat;
+	__le32 data_len; /* length of sizeof(file_offset + block_size + BLOCK SIZE) */
+	__le64 file_offset;
+	__le64 block_size;
+} __packed;
+
+/*
+ * Transfer the encrypted last block to the MDS and the MDS
+ * will update the file when truncating a smaller size.
+ */
+static int fill_request_for_fscrypt(struct inode *inode,
+				    struct ceph_mds_request *req,
+				    struct iattr *attr)
+{
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	int boff = attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE;
+	loff_t pos, orig_pos = round_down(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE);
+	size_t blen = min_t(size_t, CEPH_FSCRYPT_BLOCK_SIZE, PAGE_SIZE);
+	struct ceph_pagelist *pagelist = NULL;
+	struct kvec *iovs = NULL;
+	struct iov_iter iter;
+	struct page **pages = NULL;
+	struct ceph_fscrypt_header header;
+	int num_pages = 0;
+	int retry_op = 0;
+	int iov_off, iov_idx, len = 0;
+	loff_t i_size = i_size_read(inode);
+	bool fill_header_only = false;
+	int ret, i;
+	int got;
+
+	/*
+	 * Do not support the inline data case, which will be
+	 * removed soon
+	 */
+	if (ci->i_inline_version != CEPH_INLINE_NONE)
+		return -EINVAL;
+
+	ret = __ceph_get_caps(inode, NULL, CEPH_CAP_FILE_RD, 0, -1, &got);
+	if (ret < 0)
+		return ret;
+
+	dout("%s size %lld -> %lld got cap refs on %s\n", __func__,
+	     i_size, attr->ia_size, ceph_cap_string(got));
+
+	/* Should we consider the tiny page in 1K case ? */
+	num_pages = (CEPH_FSCRYPT_BLOCK_SIZE + PAGE_SIZE -1) / PAGE_SIZE;
+	pages = ceph_alloc_page_vector(num_pages, GFP_NOFS);
+	if (IS_ERR(pages)) {
+		ret = PTR_ERR(pages);
+		goto out;
+	}
+
+	iovs = kcalloc(num_pages, sizeof(struct kvec), GFP_NOFS);
+	if (!iovs) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	for (i = 0; i < num_pages; i++) {
+		iovs[i].iov_base = kmap_local_page(pages[i]);
+		iovs[i].iov_len = PAGE_SIZE;
+		len += iovs[i].iov_len;
+	}
+	iov_iter_kvec(&iter, READ, iovs, num_pages, len);
+
+	pos = orig_pos;
+	ret = __ceph_sync_read(inode, &pos, &iter, &retry_op);
+
+	/*
+	 * If we hit a hole here, we should just skip filling
+	 * the fscrypt for the request, because once the fscrypt
+	 * is enabled, the file will be split into many blocks
+	 * with the size of CEPH_FSCRYPT_BLOCK_SIZE, if there
+	 * has a hole, the hole size should be multiple of block
+	 * size.
+	 */
+	if (pos < i_size && ret < len) {
+		dout("%s hit hole, ppos %lld < size %lld\n",
+		     __func__, pos, i_size);
+
+		ret = 0;
+		fill_header_only = true;
+		goto fill_last_block;
+	}
+
+	/* truncate and zero out the extra contents for the last block */
+	iov_idx = boff / PAGE_SIZE;
+	iov_off = boff % PAGE_SIZE;
+	memset(iovs[iov_idx].iov_base + iov_off, 0, PAGE_SIZE - iov_off);
+
+	/* encrypt the last block */
+	for (i = 0; i < num_pages; i++) {
+		u32 shift = CEPH_FSCRYPT_BLOCK_SIZE > PAGE_SIZE ?
+			    PAGE_SHIFT : CEPH_FSCRYPT_BLOCK_SHIFT;
+		u64 block = orig_pos >> shift;
+
+		ret = fscrypt_encrypt_block_inplace(inode, pages[i],
+						    blen, 0, block,
+						    GFP_KERNEL);
+		if (ret)
+			goto out;
+	}
+
+fill_last_block:
+	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
+	if (!pagelist)
+		return -ENOMEM;
+
+	/* Insert the header first */
+	header.ver = 1;
+	header.compat = 1;
+	/* sizeof(file_offset) + sizeof(block_size) + blen */
+	header.data_len = cpu_to_le32(8 + 8 + CEPH_FSCRYPT_BLOCK_SIZE);
+	header.file_offset = cpu_to_le64(orig_pos);
+	if (fill_header_only) {
+		header.file_offset = cpu_to_le64(0);
+		header.block_size = cpu_to_le64(0);
+	} else {
+		header.file_offset = cpu_to_le64(orig_pos);
+		header.block_size = cpu_to_le64(CEPH_FSCRYPT_BLOCK_SIZE);
+	}
+	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
+	if (ret)
+		goto out;
+
+	if (!fill_header_only) {
+		/* Append the last block contents to pagelist */
+		for (i = 0; i < num_pages; i++) {
+			ret = ceph_pagelist_append(pagelist, iovs[i].iov_base,
+						   blen);
+			if (ret)
+				goto out;
+		}
+	}
+	req->r_pagelist = pagelist;
+out:
+	dout("%s %p size dropping cap refs on %s\n", __func__,
+	     inode, ceph_cap_string(got));
+	for (i = 0; iovs && i < num_pages; i++)
+		kunmap_local(iovs[i].iov_base);
+	kfree(iovs);
+	if (pages)
+		ceph_release_page_vector(pages, num_pages);
+	if (ret && pagelist)
+		ceph_pagelist_release(pagelist);
+	ceph_put_cap_refs(ci, got);
+	return ret;
+}
+
 int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *cia)
 {
 	struct ceph_inode_info *ci = ceph_inode(inode);
@@ -2236,6 +2391,7 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 	struct ceph_mds_request *req;
 	struct ceph_mds_client *mdsc = ceph_sb_to_client(inode->i_sb)->mdsc;
 	struct ceph_cap_flush *prealloc_cf;
+	loff_t isize = i_size_read(inode);
 	int issued;
 	int release = 0, dirtied = 0;
 	int mask = 0;
@@ -2367,10 +2523,31 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 		}
 	}
 	if (ia_valid & ATTR_SIZE) {
-		loff_t isize = i_size_read(inode);
-
 		dout("setattr %p size %lld -> %lld\n", inode, isize, attr->ia_size);
-		if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
+		/*
+		 * Only when the new size is smaller and not aligned to
+		 * CEPH_FSCRYPT_BLOCK_SIZE will the RMW is needed.
+		 */
+		if (IS_ENCRYPTED(inode) && attr->ia_size < isize &&
+		    (attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE)) {
+			mask |= CEPH_SETATTR_SIZE;
+			release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
+				   CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
+			set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
+			mask |= CEPH_SETATTR_FSCRYPT_FILE;
+			req->r_args.setattr.size =
+				cpu_to_le64(round_up(attr->ia_size,
+						     CEPH_FSCRYPT_BLOCK_SIZE));
+			req->r_args.setattr.old_size =
+				cpu_to_le64(round_up(isize,
+						     CEPH_FSCRYPT_BLOCK_SIZE));
+			req->r_fscrypt_file = attr->ia_size;
+			spin_unlock(&ci->i_ceph_lock);
+			err = fill_request_for_fscrypt(inode, req, attr);
+			spin_lock(&ci->i_ceph_lock);
+			if (err)
+				goto out;
+		} else if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
 			if (attr->ia_size > isize) {
 				i_size_write(inode, attr->ia_size);
 				inode->i_blocks = calc_inode_blocks(attr->ia_size);
@@ -2382,23 +2559,10 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
 			   attr->ia_size != isize) {
 			mask |= CEPH_SETATTR_SIZE;
 			release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
-				   CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
-			if (IS_ENCRYPTED(inode)) {
-				set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
-				mask |= CEPH_SETATTR_FSCRYPT_FILE;
-				req->r_args.setattr.size =
-					cpu_to_le64(round_up(attr->ia_size,
-							     CEPH_FSCRYPT_BLOCK_SIZE));
-				req->r_args.setattr.old_size =
-					cpu_to_le64(round_up(isize,
-							     CEPH_FSCRYPT_BLOCK_SIZE));
-				req->r_fscrypt_file = attr->ia_size;
-				/* FIXME: client must zero out any partial blocks! */
-			} else {
-				req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
-				req->r_args.setattr.old_size = cpu_to_le64(isize);
-				req->r_fscrypt_file = 0;
-			}
+			           CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
+			req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
+			req->r_args.setattr.old_size = cpu_to_le64(isize);
+			req->r_fscrypt_file = 0;
 		}
 	}
 	if (ia_valid & ATTR_MTIME) {
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/4] ceph: size handling for the fscrypt
  2021-10-20 13:28 [PATCH v2 0/4] ceph: size handling for the fscrypt xiubli
                   ` (3 preceding siblings ...)
  2021-10-20 13:28 ` [PATCH v2 4/4] ceph: add truncate size handling support for fscrypt xiubli
@ 2021-10-20 15:32 ` Jeff Layton
  2021-10-25 20:13 ` Jeff Layton
  5 siblings, 0 replies; 17+ messages in thread
From: Jeff Layton @ 2021-10-20 15:32 UTC (permalink / raw)
  To: xiubli; +Cc: idryomov, vshankar, khiremat, pdonnell, ceph-devel

On Wed, 2021-10-20 at 21:28 +0800, xiubli@redhat.com wrote:
> From: Xiubo Li <xiubli@redhat.com>
> 
> This patch series is based on the fscrypt_size_handling branch in
> https://github.com/lxbsz/linux.git, which is based Jeff's
> ceph-fscrypt-content-experimental branch in
> https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git,
> has reverted one useless commit and added some upstream commits.
> 
> I will keep this patch set as simple as possible to review since
> this is still one framework code. It works and still in developing
> and need some feedbacks and suggestions for two corner cases below.
> 
> ====
> 
> This approach is based on the discussion from V1, which will pass
> the encrypted last block contents to MDS along with the truncate
> request.
> 
> This will send the encrypted last block contents to MDS along with
> the truncate request when truncating to a smaller size and at the
> same time new size does not align to BLOCK SIZE.
> 
> The MDS side patch is raised in PR
> https://github.com/ceph/ceph/pull/43588, which is also based Jeff's
> previous great work in PR https://github.com/ceph/ceph/pull/41284.
> 
> The MDS will use the filer.write_trunc(), which could update and
> truncate the file in one shot, instead of filer.truncate().
> 
> I have removed the inline data related code since we are remove
> this feature, more detail please see:
> https://tracker.ceph.com/issues/52916
> 
> 
> Note: There still has two CORNER cases we need to deal with:
> 
> 1), If a truncate request with the last block is sent to the MDS and
> just before the MDS has acquired the xlock for FILE lock, if another
> client has updated that last block content, we will over write the
> last block with old data.
> 
> For this case we could send the old encrypted last block data along
> with the truncate request and in MDS side read it and then do compare
> just before updating it, if the comparasion fails, then fail the
> truncate and let the kclient retry it.

Right -- this is the tricky bit. We're doing a truncate with a read-
modify-write cycle for the last block rolled in. We _must_ gate the
truncate+write vs. intervening changes to that extent.

You may be able to use the object version instead of comparing the old
block. The ceph-fscrypt-content branch has a patch that adds support for
CEPH_OSD_OP_ASSERT_VER, but I seem to recall that the OSD supports a way
to assert that an extent hasn't changed.

So, basically my thinking was something like:

client reads the data from the object and fetches the object version
send the object version along with the last block, and then the MDS's
write+truncate operation could assert on that version.

The catch here is that tracking those object versions is sort of nasty.
Having it do comparisons of the extent contents might be simpler.

> 2), If another client has buffered the last block, we should flush
> it first. I am still thinking how to do this ? Any idea is welcome.
> 

I think by asserting on the contents of the last block or the object
version, this problem is also solved.

> Thanks.
> 
> 
> Xiubo Li (4):
>   ceph: add __ceph_get_caps helper support
>   ceph: add __ceph_sync_read helper support
>   ceph: return the real size readed when hit EOF
>   ceph: add truncate size handling support for fscrypt
> 
>  fs/ceph/caps.c  |  28 ++++---
>  fs/ceph/file.c  |  41 ++++++----
>  fs/ceph/inode.c | 210 ++++++++++++++++++++++++++++++++++++++++++------
>  fs/ceph/super.h |   4 +
>  4 files changed, 234 insertions(+), 49 deletions(-)
> 

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/4] ceph: return the real size readed when hit EOF
  2021-10-20 13:28 ` [PATCH v2 3/4] ceph: return the real size readed when hit EOF xiubli
@ 2021-10-25 19:05   ` Jeff Layton
  2021-10-26  3:12     ` Xiubo Li
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff Layton @ 2021-10-25 19:05 UTC (permalink / raw)
  To: xiubli; +Cc: idryomov, vshankar, khiremat, pdonnell, ceph-devel

On Wed, 2021-10-20 at 21:28 +0800, xiubli@redhat.com wrote:
> From: Xiubo Li <xiubli@redhat.com>
> 
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> ---
>  fs/ceph/file.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 74db403a4c35..1988e75ad4a2 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -910,6 +910,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>  	ssize_t ret;
>  	u64 off = *ki_pos;
>  	u64 len = iov_iter_count(to);
> +	u64 i_size = i_size_read(inode);
>  
>  	dout("sync_read on inode %p %llu~%u\n", inode, *ki_pos, (unsigned)len);
>  
> @@ -933,7 +934,6 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>  		struct page **pages;
>  		int num_pages;
>  		size_t page_off;
> -		u64 i_size;
>  		bool more;
>  		int idx;
>  		size_t left;
> @@ -980,7 +980,6 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>  
>  		ceph_osdc_put_request(req);
>  
> -		i_size = i_size_read(inode);
>  		dout("sync_read %llu~%llu got %zd i_size %llu%s\n",
>  		     off, len, ret, i_size, (more ? " MORE" : ""));
>  
> @@ -1056,11 +1055,14 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>  	}
>  
>  	if (off > *ki_pos) {
> -		if (ret >= 0 &&
> -		    iov_iter_count(to) > 0 && off >= i_size_read(inode))
> +		if (off >= i_size) {
>  			*retry_op = CHECK_EOF;
> -		ret = off - *ki_pos;
> -		*ki_pos = off;
> +			ret = i_size - *ki_pos;
> +			*ki_pos = i_size;
> +		} else {
> +			ret = off - *ki_pos;
> +			*ki_pos = off;
> +		}
>  	}
>  out:
>  	dout("sync_read result %zd retry_op %d\n", ret, *retry_op);

I'm guessing this is fixing a bug? Did you hit this in testing or did
you just notice by inspection? Should we merge this in advance of the
rest of the set?

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] ceph: add truncate size handling support for fscrypt
  2021-10-20 13:28 ` [PATCH v2 4/4] ceph: add truncate size handling support for fscrypt xiubli
@ 2021-10-25 20:01   ` Jeff Layton
  2021-10-26  3:41     ` Xiubo Li
  2021-10-27  5:12     ` Xiubo Li
  0 siblings, 2 replies; 17+ messages in thread
From: Jeff Layton @ 2021-10-25 20:01 UTC (permalink / raw)
  To: xiubli; +Cc: idryomov, vshankar, khiremat, pdonnell, ceph-devel

On Wed, 2021-10-20 at 21:28 +0800, xiubli@redhat.com wrote:
> From: Xiubo Li <xiubli@redhat.com>
> 
> This will transfer the encrypted last block contents to the MDS
> along with the truncate request only when new size is smaller and
> not aligned to the fscrypt BLOCK size.
> 
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> ---
>  fs/ceph/caps.c  |   9 +--
>  fs/ceph/inode.c | 210 ++++++++++++++++++++++++++++++++++++++++++------
>  2 files changed, 190 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 4e2a588465c5..1a36f0870d89 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -1296,16 +1296,13 @@ static void encode_cap_msg(struct ceph_msg *msg, struct cap_msg_args *arg)
>  	/*
>  	 * fscrypt_auth and fscrypt_file (version 12)
>  	 *
> -	 * fscrypt_auth holds the crypto context (if any). fscrypt_file
> -	 * tracks the real i_size as an __le64 field (and we use a rounded-up
> -	 * i_size in * the traditional size field).
> -	 *
> -	 * FIXME: should we encrypt fscrypt_file field?
> +	 * fscrypt_auth holds the crypto context (if any). fscrypt_file will
> +	 * always be zero here.
>  	 */
>  	ceph_encode_32(&p, arg->fscrypt_auth_len);
>  	ceph_encode_copy(&p, arg->fscrypt_auth, arg->fscrypt_auth_len);
>  	ceph_encode_32(&p, sizeof(__le64));
> -	ceph_encode_64(&p, arg->size);
> +	ceph_encode_64(&p, 0);
>  }
>  
>  /*
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 9b798690fdc9..924a69bc074d 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -1035,9 +1035,13 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
>  
>  		if (IS_ENCRYPTED(inode) && size &&
>  		    (iinfo->fscrypt_file_len == sizeof(__le64))) {

Hmm...testing for == sizeof(__le64) is probably too restrictive here. We
should test for >= (to accomodate other fields later if we add them).

> -			size = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
> -			if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
> -				pr_warn("size=%llu fscrypt_file=%llu\n", info->size, size);
> +			u64 fsize = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
> +			if (fsize) {
> +				size = fsize;
> +				if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
> +					pr_warn("size=%llu fscrypt_file=%llu\n",
> +						info->size, size);
> +			}
>  		}
>  
>  		queue_trunc = ceph_fill_file_size(inode, issued,
> @@ -2229,6 +2233,157 @@ static const struct inode_operations ceph_encrypted_symlink_iops = {
>  	.listxattr = ceph_listxattr,
>  };
>  
> +struct ceph_fscrypt_header {

This is a pretty generic name. I'd make it more descriptive. This is for
a size change, so it probably needs to indicate that.

It also seems like this ought to go into a header file since it's
defining part of the protocol. All the existing cephfs headers that hold
those live in include/ so it may be better to put it in there, or add it
to crypto.h.

> +	__u8  ver;
> +	__u8  compat;
> +	__le32 data_len; /* length of sizeof(file_offset + block_size + BLOCK SIZE) */
> +	__le64 file_offset;
> +	__le64 block_size;

I don't forsee us needing an __le64 for block_size, particularly not
when we only have an __le32 for data_len. I'd make block_size __le32.

> +} __packed;
> +
> +/*
> + * Transfer the encrypted last block to the MDS and the MDS
> + * will update the file when truncating a smaller size.
> + */
> +static int fill_request_for_fscrypt(struct inode *inode,
> +				    struct ceph_mds_request *req,
> +				    struct iattr *attr)

nit: maybe call this fill_fscrypt_truncate() or something? That name is
too generic.

> +{
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	int boff = attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE;
> +	loff_t pos, orig_pos = round_down(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE);
> +	size_t blen = min_t(size_t, CEPH_FSCRYPT_BLOCK_SIZE, PAGE_SIZE);
> +	struct ceph_pagelist *pagelist = NULL;
> +	struct kvec *iovs = NULL;
> +	struct iov_iter iter;
> +	struct page **pages = NULL;
> +	struct ceph_fscrypt_header header;
> +	int num_pages = 0;
> +	int retry_op = 0;
> +	int iov_off, iov_idx, len = 0;
> +	loff_t i_size = i_size_read(inode);
> +	bool fill_header_only = false;
> +	int ret, i;
> +	int got;
> +
> +	/*
> +	 * Do not support the inline data case, which will be
> +	 * removed soon
> +	 */
> +	if (ci->i_inline_version != CEPH_INLINE_NONE)
> +		return -EINVAL;
> +

I don't think we need this check.

We only call this in the fscrypt case, and we probably need to ensure
that we just never allow an inlined inode to be encrypted in the first
place. I don't see the point in enforcing this specially in truncate
operations.

If you do want to do this, then maybe turn it into a WARN_ON_ONCE too,
since it means that something is very wrong.

> +	ret = __ceph_get_caps(inode, NULL, CEPH_CAP_FILE_RD, 0, -1, &got);
> +	if (ret < 0)
> +		return ret;
> +
> +	dout("%s size %lld -> %lld got cap refs on %s\n", __func__,
> +	     i_size, attr->ia_size, ceph_cap_string(got));
> +
> +	/* Should we consider the tiny page in 1K case ? */
> +	num_pages = (CEPH_FSCRYPT_BLOCK_SIZE + PAGE_SIZE -1) / PAGE_SIZE;

An earlier patch already adds this:

     BUILD_BUG_ON(CEPH_FSCRYPT_BLOCK_SHIFT > PAGE_SHIFT);

We don't support a PAGE_SIZE that is smaller than the
CEPH_FSCRYPT_BLOCK_SIZE and it will fail to build at all in that case.
You can safely assume that num_pages will always be 1 here.

If anything, the kernel as a whole is moving to larger page sizes (and
variable ones).

> +	pages = ceph_alloc_page_vector(num_pages, GFP_NOFS);
> +	if (IS_ERR(pages)) {
> +		ret = PTR_ERR(pages);
> +		goto out;
> +	}
> +
> +	iovs = kcalloc(num_pages, sizeof(struct kvec), GFP_NOFS);
> +	if (!iovs) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	for (i = 0; i < num_pages; i++) {
> +		iovs[i].iov_base = kmap_local_page(pages[i]);
> +		iovs[i].iov_len = PAGE_SIZE;
> +		len += iovs[i].iov_len;
> +	}
> +	iov_iter_kvec(&iter, READ, iovs, num_pages, len);
> +
> +	pos = orig_pos;
> +	ret = __ceph_sync_read(inode, &pos, &iter, &retry_op);
> +
> +	/*
> +	 * If we hit a hole here, we should just skip filling
> +	 * the fscrypt for the request, because once the fscrypt
> +	 * is enabled, the file will be split into many blocks
> +	 * with the size of CEPH_FSCRYPT_BLOCK_SIZE, if there
> +	 * has a hole, the hole size should be multiple of block
> +	 * size.
> +	 */
> +	if (pos < i_size && ret < len) {
> +		dout("%s hit hole, ppos %lld < size %lld\n",
> +		     __func__, pos, i_size);
> +
> +		ret = 0;
> +		fill_header_only = true;
> +		goto fill_last_block;
> +	}
> +
> +	/* truncate and zero out the extra contents for the last block */
> +	iov_idx = boff / PAGE_SIZE;
> +	iov_off = boff % PAGE_SIZE;
> +	memset(iovs[iov_idx].iov_base + iov_off, 0, PAGE_SIZE - iov_off);
> +
> +	/* encrypt the last block */
> +	for (i = 0; i < num_pages; i++) {
> +		u32 shift = CEPH_FSCRYPT_BLOCK_SIZE > PAGE_SIZE ?
> +			    PAGE_SHIFT : CEPH_FSCRYPT_BLOCK_SHIFT;
> +		u64 block = orig_pos >> shift;
> +
> +		ret = fscrypt_encrypt_block_inplace(inode, pages[i],
> +						    blen, 0, block,
> +						    GFP_KERNEL);
> +		if (ret)
> +			goto out;
> +	}
> +
> +fill_last_block:
> +	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
> +	if (!pagelist)
> +		return -ENOMEM;
> +
> +	/* Insert the header first */
> +	header.ver = 1;
> +	header.compat = 1;
> +	/* sizeof(file_offset) + sizeof(block_size) + blen */
> +	header.data_len = cpu_to_le32(8 + 8 + CEPH_FSCRYPT_BLOCK_SIZE);
> +	header.file_offset = cpu_to_le64(orig_pos);
> +	if (fill_header_only) {
> +		header.file_offset = cpu_to_le64(0);
> +		header.block_size = cpu_to_le64(0);
> +	} else {
> +		header.file_offset = cpu_to_le64(orig_pos);
> +		header.block_size = cpu_to_le64(CEPH_FSCRYPT_BLOCK_SIZE);
> +	}
> +	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
> +	if (ret)
> +		goto out;
> 
> 

Note that you're doing a read/modify/write cycle, and you must ensure
that the object remains consistent between the read and write or you may
end up with data corruption. This means that you probably need to
transmit an object version as part of the write. See this patch in the
stack:

    libceph: add CEPH_OSD_OP_ASSERT_VER support

That op tells the OSD to stop processing the request if the version is
wrong.

You'll want to grab the "ver" from the __ceph_sync_read call, and then
send that along with the updated last block. Then, when the MDS is
truncating, it can use a CEPH_OSD_OP_ASSERT_VER op with that version to
ensure that the object hasn't changed when doing it. If the assertion
trips, then the MDS should send back EAGAIN or something similar to the
client to tell it to retry.

It's also possible (though I've never used it) to make an OSD request
assert that the contents of an extent haven't changed, so you could
instead send along the old contents along with the new ones, etc.

That may end up being more efficient if the object is getting hammered
with I/O in other fscrypt blocks within the same object. It may be worth
exploring that avenue as well.

> +
> +	if (!fill_header_only) {
> +		/* Append the last block contents to pagelist */
> +		for (i = 0; i < num_pages; i++) {
> +			ret = ceph_pagelist_append(pagelist, iovs[i].iov_base,
> +						   blen);
> +			if (ret)
> +				goto out;
> +		}
> +	}
> +	req->r_pagelist = pagelist;
> +out:
> +	dout("%s %p size dropping cap refs on %s\n", __func__,
> +	     inode, ceph_cap_string(got));
> +	for (i = 0; iovs && i < num_pages; i++)
> +		kunmap_local(iovs[i].iov_base);
> +	kfree(iovs);
> +	if (pages)
> +		ceph_release_page_vector(pages, num_pages);
> +	if (ret && pagelist)
> +		ceph_pagelist_release(pagelist);
> +	ceph_put_cap_refs(ci, got);
> +	return ret;
> +}
> +
>  int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *cia)
>  {
>  	struct ceph_inode_info *ci = ceph_inode(inode);
> @@ -2236,6 +2391,7 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>  	struct ceph_mds_request *req;
>  	struct ceph_mds_client *mdsc = ceph_sb_to_client(inode->i_sb)->mdsc;
>  	struct ceph_cap_flush *prealloc_cf;
> +	loff_t isize = i_size_read(inode);
>  	int issued;
>  	int release = 0, dirtied = 0;
>  	int mask = 0;
> @@ -2367,10 +2523,31 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>  		}
>  	}
>  	if (ia_valid & ATTR_SIZE) {
> -		loff_t isize = i_size_read(inode);
> -
>  		dout("setattr %p size %lld -> %lld\n", inode, isize, attr->ia_size);
> -		if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
> +		/*
> +		 * Only when the new size is smaller and not aligned to
> +		 * CEPH_FSCRYPT_BLOCK_SIZE will the RMW is needed.
> +		 */
> +		if (IS_ENCRYPTED(inode) && attr->ia_size < isize &&
> +		    (attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE)) {
> +			mask |= CEPH_SETATTR_SIZE;
> +			release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
> +				   CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
> +			set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
> +			mask |= CEPH_SETATTR_FSCRYPT_FILE;
> +			req->r_args.setattr.size =
> +				cpu_to_le64(round_up(attr->ia_size,
> +						     CEPH_FSCRYPT_BLOCK_SIZE));
> +			req->r_args.setattr.old_size =
> +				cpu_to_le64(round_up(isize,
> +						     CEPH_FSCRYPT_BLOCK_SIZE));
> +			req->r_fscrypt_file = attr->ia_size;
> +			spin_unlock(&ci->i_ceph_lock);
> +			err = fill_request_for_fscrypt(inode, req, attr);

It'd be better to just defer the call to fill_request_for_fscrypt until
after you've dropped the spinlock, later in the function.


> +			spin_lock(&ci->i_ceph_lock);
> +			if (err)
> +				goto out;
> +		} else if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
>  			if (attr->ia_size > isize) {
>  				i_size_write(inode, attr->ia_size);
>  				inode->i_blocks = calc_inode_blocks(attr->ia_size);
> @@ -2382,23 +2559,10 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>  			   attr->ia_size != isize) {
>  			mask |= CEPH_SETATTR_SIZE;
>  			release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
> -				   CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
> -			if (IS_ENCRYPTED(inode)) {
> -				set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
> -				mask |= CEPH_SETATTR_FSCRYPT_FILE;
> -				req->r_args.setattr.size =
> -					cpu_to_le64(round_up(attr->ia_size,
> -							     CEPH_FSCRYPT_BLOCK_SIZE));
> -				req->r_args.setattr.old_size =
> -					cpu_to_le64(round_up(isize,
> -							     CEPH_FSCRYPT_BLOCK_SIZE));
> -				req->r_fscrypt_file = attr->ia_size;
> -				/* FIXME: client must zero out any partial blocks! */
> -			} else {
> -				req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
> -				req->r_args.setattr.old_size = cpu_to_le64(isize);
> -				req->r_fscrypt_file = 0;
> -			}
> +			           CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
> +			req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
> +			req->r_args.setattr.old_size = cpu_to_le64(isize);
> +			req->r_fscrypt_file = 0;
>  		}
>  	}
>  	if (ia_valid & ATTR_MTIME) {

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/4] ceph: size handling for the fscrypt
  2021-10-20 13:28 [PATCH v2 0/4] ceph: size handling for the fscrypt xiubli
                   ` (4 preceding siblings ...)
  2021-10-20 15:32 ` [PATCH v2 0/4] ceph: size handling for the fscrypt Jeff Layton
@ 2021-10-25 20:13 ` Jeff Layton
  5 siblings, 0 replies; 17+ messages in thread
From: Jeff Layton @ 2021-10-25 20:13 UTC (permalink / raw)
  To: xiubli; +Cc: idryomov, vshankar, khiremat, pdonnell, ceph-devel

On Wed, 2021-10-20 at 21:28 +0800, xiubli@redhat.com wrote:
> From: Xiubo Li <xiubli@redhat.com>
> 
> This patch series is based on the fscrypt_size_handling branch in
> https://github.com/lxbsz/linux.git, which is based Jeff's
> ceph-fscrypt-content-experimental branch in
> https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git,
> has reverted one useless commit and added some upstream commits.
> 
> I will keep this patch set as simple as possible to review since
> this is still one framework code. It works and still in developing
> and need some feedbacks and suggestions for two corner cases below.
> 

Ok, I think you're on a good start. It's a bit hard to review patchsets
that rely on other out-of-tree patches. The ceph-fscrypt-content-
experimental branch is _really_ experimental, and you start the series
by reverting one of the patches in there.

The goal here is for you to send a complete series that we'll want to
merge, without all of the interim churn of reverting patches and
changing the same code several times.

I think what would be best here is for you to put together and send a
complete patchset that is based on top of the
ceph-client/wip-fscrypt-fnames branch. I've just rebased that branch
onto the latest ceph-client/testing branch, but it has been fairly
stable for the last few weeks, and any changes we make in there probably
won't affect in the areas you're working as much.

Feel free to cherry-pick any patches from my ceph-fscrypt-content-
experimental branch that you think should part of the set. Also feel
free to modify any of those patches. We don't want buggy patches partway
through the merge of the series.

Does that sound OK?

> ====
> 
> This approach is based on the discussion from V1, which will pass
> the encrypted last block contents to MDS along with the truncate
> request.
> 
> This will send the encrypted last block contents to MDS along with
> the truncate request when truncating to a smaller size and at the
> same time new size does not align to BLOCK SIZE.
> 
> The MDS side patch is raised in PR
> https://github.com/ceph/ceph/pull/43588, which is also based Jeff's
> previous great work in PR https://github.com/ceph/ceph/pull/41284.
> 
> The MDS will use the filer.write_trunc(), which could update and
> truncate the file in one shot, instead of filer.truncate().
> 
> I have removed the inline data related code since we are remove
> this feature, more detail please see:
> https://tracker.ceph.com/issues/52916
> 
> 
> Note: There still has two CORNER cases we need to deal with:
> 
> 1), If a truncate request with the last block is sent to the MDS and
> just before the MDS has acquired the xlock for FILE lock, if another
> client has updated that last block content, we will over write the
> last block with old data.
> 
> For this case we could send the old encrypted last block data along
> with the truncate request and in MDS side read it and then do compare
> just before updating it, if the comparasion fails, then fail the
> truncate and let the kclient retry it.
> 
> 2), If another client has buffered the last block, we should flush
> it first. I am still thinking how to do this ? Any idea is welcome.
> 
> Thanks.
> 
> 
> Xiubo Li (4):
>   ceph: add __ceph_get_caps helper support
>   ceph: add __ceph_sync_read helper support
>   ceph: return the real size readed when hit EOF
>   ceph: add truncate size handling support for fscrypt
> 
>  fs/ceph/caps.c  |  28 ++++---
>  fs/ceph/file.c  |  41 ++++++----
>  fs/ceph/inode.c | 210 ++++++++++++++++++++++++++++++++++++++++++------
>  fs/ceph/super.h |   4 +
>  4 files changed, 234 insertions(+), 49 deletions(-)
> 

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/4] ceph: return the real size readed when hit EOF
  2021-10-25 19:05   ` Jeff Layton
@ 2021-10-26  3:12     ` Xiubo Li
  0 siblings, 0 replies; 17+ messages in thread
From: Xiubo Li @ 2021-10-26  3:12 UTC (permalink / raw)
  To: Jeff Layton; +Cc: idryomov, vshankar, khiremat, pdonnell, ceph-devel


On 10/26/21 3:05 AM, Jeff Layton wrote:
> On Wed, 2021-10-20 at 21:28 +0800, xiubli@redhat.com wrote:
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>> ---
>>   fs/ceph/file.c | 14 ++++++++------
>>   1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index 74db403a4c35..1988e75ad4a2 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -910,6 +910,7 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>>   	ssize_t ret;
>>   	u64 off = *ki_pos;
>>   	u64 len = iov_iter_count(to);
>> +	u64 i_size = i_size_read(inode);
>>   
>>   	dout("sync_read on inode %p %llu~%u\n", inode, *ki_pos, (unsigned)len);
>>   
>> @@ -933,7 +934,6 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>>   		struct page **pages;
>>   		int num_pages;
>>   		size_t page_off;
>> -		u64 i_size;
>>   		bool more;
>>   		int idx;
>>   		size_t left;
>> @@ -980,7 +980,6 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>>   
>>   		ceph_osdc_put_request(req);
>>   
>> -		i_size = i_size_read(inode);
>>   		dout("sync_read %llu~%llu got %zd i_size %llu%s\n",
>>   		     off, len, ret, i_size, (more ? " MORE" : ""));
>>   
>> @@ -1056,11 +1055,14 @@ ssize_t __ceph_sync_read(struct inode *inode, loff_t *ki_pos,
>>   	}
>>   
>>   	if (off > *ki_pos) {
>> -		if (ret >= 0 &&
>> -		    iov_iter_count(to) > 0 && off >= i_size_read(inode))
>> +		if (off >= i_size) {
>>   			*retry_op = CHECK_EOF;
>> -		ret = off - *ki_pos;
>> -		*ki_pos = off;
>> +			ret = i_size - *ki_pos;
>> +			*ki_pos = i_size;
>> +		} else {
>> +			ret = off - *ki_pos;
>> +			*ki_pos = off;
>> +		}
>>   	}
>>   out:
>>   	dout("sync_read result %zd retry_op %d\n", ret, *retry_op);
> I'm guessing this is fixing a bug? Did you hit this in testing or did
> you just notice by inspection? Should we merge this in advance of the
> rest of the set?

This is one bug, when testing the fscrypt feature and I can reproduce it 
every time, in theory this bug should be seen in none fscrypt case.

I just send the V2 here in this series to show in which use case I hit 
this bug.

If it looks good I will post the V3 based the upstream code and add more 
comments on it.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] ceph: add truncate size handling support for fscrypt
  2021-10-25 20:01   ` Jeff Layton
@ 2021-10-26  3:41     ` Xiubo Li
  2021-10-27 23:23       ` Xiubo Li
  2021-10-27  5:12     ` Xiubo Li
  1 sibling, 1 reply; 17+ messages in thread
From: Xiubo Li @ 2021-10-26  3:41 UTC (permalink / raw)
  To: Jeff Layton; +Cc: idryomov, vshankar, khiremat, pdonnell, ceph-devel


On 10/26/21 4:01 AM, Jeff Layton wrote:
> On Wed, 2021-10-20 at 21:28 +0800, xiubli@redhat.com wrote:
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> This will transfer the encrypted last block contents to the MDS
>> along with the truncate request only when new size is smaller and
>> not aligned to the fscrypt BLOCK size.
>>
>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>> ---
>>   fs/ceph/caps.c  |   9 +--
>>   fs/ceph/inode.c | 210 ++++++++++++++++++++++++++++++++++++++++++------
>>   2 files changed, 190 insertions(+), 29 deletions(-)
>>
>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>> index 4e2a588465c5..1a36f0870d89 100644
>> --- a/fs/ceph/caps.c
>> +++ b/fs/ceph/caps.c
>> @@ -1296,16 +1296,13 @@ static void encode_cap_msg(struct ceph_msg *msg, struct cap_msg_args *arg)
>>   	/*
>>   	 * fscrypt_auth and fscrypt_file (version 12)
>>   	 *
>> -	 * fscrypt_auth holds the crypto context (if any). fscrypt_file
>> -	 * tracks the real i_size as an __le64 field (and we use a rounded-up
>> -	 * i_size in * the traditional size field).
>> -	 *
>> -	 * FIXME: should we encrypt fscrypt_file field?
>> +	 * fscrypt_auth holds the crypto context (if any). fscrypt_file will
>> +	 * always be zero here.
>>   	 */
>>   	ceph_encode_32(&p, arg->fscrypt_auth_len);
>>   	ceph_encode_copy(&p, arg->fscrypt_auth, arg->fscrypt_auth_len);
>>   	ceph_encode_32(&p, sizeof(__le64));
>> -	ceph_encode_64(&p, arg->size);
>> +	ceph_encode_64(&p, 0);
>>   }
>>   
>>   /*
>> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
>> index 9b798690fdc9..924a69bc074d 100644
>> --- a/fs/ceph/inode.c
>> +++ b/fs/ceph/inode.c
>> @@ -1035,9 +1035,13 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
>>   
>>   		if (IS_ENCRYPTED(inode) && size &&
>>   		    (iinfo->fscrypt_file_len == sizeof(__le64))) {
> Hmm...testing for == sizeof(__le64) is probably too restrictive here. We
> should test for >= (to accomodate other fields later if we add them).
Okay, will fix it.
>> -			size = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
>> -			if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
>> -				pr_warn("size=%llu fscrypt_file=%llu\n", info->size, size);
>> +			u64 fsize = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
>> +			if (fsize) {
>> +				size = fsize;
>> +				if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
>> +					pr_warn("size=%llu fscrypt_file=%llu\n",
>> +						info->size, size);
>> +			}
>>   		}
>>   
>>   		queue_trunc = ceph_fill_file_size(inode, issued,
>> @@ -2229,6 +2233,157 @@ static const struct inode_operations ceph_encrypted_symlink_iops = {
>>   	.listxattr = ceph_listxattr,
>>   };
>>   
>> +struct ceph_fscrypt_header {
> This is a pretty generic name. I'd make it more descriptive. This is for
> a size change, so it probably needs to indicate that.
>
> It also seems like this ought to go into a header file since it's
> defining part of the protocol. All the existing cephfs headers that hold
> those live in include/ so it may be better to put it in there, or add it
> to crypto.h.

Sounds reasonable and will fix it.


>
>> +	__u8  ver;
>> +	__u8  compat;
>> +	__le32 data_len; /* length of sizeof(file_offset + block_size + BLOCK SIZE) */
>> +	__le64 file_offset;
>> +	__le64 block_size;
> I don't forsee us needing an __le64 for block_size, particularly not
> when we only have an __le32 for data_len. I'd make block_size __le32.

Okay.


>> +} __packed;
>> +
>> +/*
>> + * Transfer the encrypted last block to the MDS and the MDS
>> + * will update the file when truncating a smaller size.
>> + */
>> +static int fill_request_for_fscrypt(struct inode *inode,
>> +				    struct ceph_mds_request *req,
>> +				    struct iattr *attr)
> nit: maybe call this fill_fscrypt_truncate() or something? That name is
> too generic.

Sure.


>> +{
>> +	struct ceph_inode_info *ci = ceph_inode(inode);
>> +	int boff = attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE;
>> +	loff_t pos, orig_pos = round_down(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE);
>> +	size_t blen = min_t(size_t, CEPH_FSCRYPT_BLOCK_SIZE, PAGE_SIZE);
>> +	struct ceph_pagelist *pagelist = NULL;
>> +	struct kvec *iovs = NULL;
>> +	struct iov_iter iter;
>> +	struct page **pages = NULL;
>> +	struct ceph_fscrypt_header header;
>> +	int num_pages = 0;
>> +	int retry_op = 0;
>> +	int iov_off, iov_idx, len = 0;
>> +	loff_t i_size = i_size_read(inode);
>> +	bool fill_header_only = false;
>> +	int ret, i;
>> +	int got;
>> +
>> +	/*
>> +	 * Do not support the inline data case, which will be
>> +	 * removed soon
>> +	 */
>> +	if (ci->i_inline_version != CEPH_INLINE_NONE)
>> +		return -EINVAL;
>> +
> I don't think we need this check.
>
> We only call this in the fscrypt case, and we probably need to ensure
> that we just never allow an inlined inode to be encrypted in the first
> place. I don't see the point in enforcing this specially in truncate
> operations.

I just assume will adjust this code based the inline data removal patch 
set later.

I can remove this temporarily.

> If you do want to do this, then maybe turn it into a WARN_ON_ONCE too,
> since it means that something is very wrong.
>
>> +	ret = __ceph_get_caps(inode, NULL, CEPH_CAP_FILE_RD, 0, -1, &got);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	dout("%s size %lld -> %lld got cap refs on %s\n", __func__,
>> +	     i_size, attr->ia_size, ceph_cap_string(got));
>> +
>> +	/* Should we consider the tiny page in 1K case ? */
>> +	num_pages = (CEPH_FSCRYPT_BLOCK_SIZE + PAGE_SIZE -1) / PAGE_SIZE;
> An earlier patch already adds this:
>
>       BUILD_BUG_ON(CEPH_FSCRYPT_BLOCK_SHIFT > PAGE_SHIFT);
>
> We don't support a PAGE_SIZE that is smaller than the
> CEPH_FSCRYPT_BLOCK_SIZE and it will fail to build at all in that case.
> You can safely assume that num_pages will always be 1 here.
>
> If anything, the kernel as a whole is moving to larger page sizes (and
> variable ones).

Sure, I didn't notice that check. Will fix it.


>> +	pages = ceph_alloc_page_vector(num_pages, GFP_NOFS);
>> +	if (IS_ERR(pages)) {
>> +		ret = PTR_ERR(pages);
>> +		goto out;
>> +	}
>> +
>> +	iovs = kcalloc(num_pages, sizeof(struct kvec), GFP_NOFS);
>> +	if (!iovs) {
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +	for (i = 0; i < num_pages; i++) {
>> +		iovs[i].iov_base = kmap_local_page(pages[i]);
>> +		iovs[i].iov_len = PAGE_SIZE;
>> +		len += iovs[i].iov_len;
>> +	}
>> +	iov_iter_kvec(&iter, READ, iovs, num_pages, len);
>> +
>> +	pos = orig_pos;
>> +	ret = __ceph_sync_read(inode, &pos, &iter, &retry_op);
>> +
>> +	/*
>> +	 * If we hit a hole here, we should just skip filling
>> +	 * the fscrypt for the request, because once the fscrypt
>> +	 * is enabled, the file will be split into many blocks
>> +	 * with the size of CEPH_FSCRYPT_BLOCK_SIZE, if there
>> +	 * has a hole, the hole size should be multiple of block
>> +	 * size.
>> +	 */
>> +	if (pos < i_size && ret < len) {
>> +		dout("%s hit hole, ppos %lld < size %lld\n",
>> +		     __func__, pos, i_size);
>> +
>> +		ret = 0;
>> +		fill_header_only = true;
>> +		goto fill_last_block;
>> +	}
>> +
>> +	/* truncate and zero out the extra contents for the last block */
>> +	iov_idx = boff / PAGE_SIZE;
>> +	iov_off = boff % PAGE_SIZE;
>> +	memset(iovs[iov_idx].iov_base + iov_off, 0, PAGE_SIZE - iov_off);
>> +
>> +	/* encrypt the last block */
>> +	for (i = 0; i < num_pages; i++) {
>> +		u32 shift = CEPH_FSCRYPT_BLOCK_SIZE > PAGE_SIZE ?
>> +			    PAGE_SHIFT : CEPH_FSCRYPT_BLOCK_SHIFT;
>> +		u64 block = orig_pos >> shift;
>> +
>> +		ret = fscrypt_encrypt_block_inplace(inode, pages[i],
>> +						    blen, 0, block,
>> +						    GFP_KERNEL);
>> +		if (ret)
>> +			goto out;
>> +	}
>> +
>> +fill_last_block:
>> +	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
>> +	if (!pagelist)
>> +		return -ENOMEM;
>> +
>> +	/* Insert the header first */
>> +	header.ver = 1;
>> +	header.compat = 1;
>> +	/* sizeof(file_offset) + sizeof(block_size) + blen */
>> +	header.data_len = cpu_to_le32(8 + 8 + CEPH_FSCRYPT_BLOCK_SIZE);
>> +	header.file_offset = cpu_to_le64(orig_pos);
>> +	if (fill_header_only) {
>> +		header.file_offset = cpu_to_le64(0);
>> +		header.block_size = cpu_to_le64(0);
>> +	} else {
>> +		header.file_offset = cpu_to_le64(orig_pos);
>> +		header.block_size = cpu_to_le64(CEPH_FSCRYPT_BLOCK_SIZE);
>> +	}
>> +	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
>> +	if (ret)
>> +		goto out;
>>
>>
> Note that you're doing a read/modify/write cycle, and you must ensure
> that the object remains consistent between the read and write or you may
> end up with data corruption. This means that you probably need to
> transmit an object version as part of the write. See this patch in the
> stack:
>
>      libceph: add CEPH_OSD_OP_ASSERT_VER support
>
> That op tells the OSD to stop processing the request if the version is
> wrong.
>
> You'll want to grab the "ver" from the __ceph_sync_read call, and then
> send that along with the updated last block. Then, when the MDS is
> truncating, it can use a CEPH_OSD_OP_ASSERT_VER op with that version to
> ensure that the object hasn't changed when doing it. If the assertion
> trips, then the MDS should send back EAGAIN or something similar to the
> client to tell it to retry.

Yeah, this is one issues I have mentioned in the cover-letter, I found 
the cover-letter wasn't successfully sent to the mail list.

Except this, there also has another isssue, I will pasted some important 
sections here from the cover-letter:

======

This approach is based on the discussion from V1, which will pass
the encrypted last block contents to MDS along with the truncate
request.

This will send the encrypted last block contents to MDS along with
the truncate request when truncating to a smaller size and at the
same time new size does not align to BLOCK SIZE.

The MDS side patch is raised in PR
https://github.com/ceph/ceph/pull/43588, which is also based Jeff's
previous great work in PR https://github.com/ceph/ceph/pull/41284.

The MDS will use the filer.write_trunc(), which could update and
truncate the file in one shot, instead of filer.truncate().

I have removed the inline data related code since we are remove
this feature, more detail please see:
https://tracker.ceph.com/issues/52916


Note: There still has two CORNER cases we need to deal with:

1), If a truncate request with the last block is sent to the MDS and
just before the MDS has acquired the xlock for FILE lock, if another
client has updated that last block content, we will over write the
last block with old data.

For this case we could send the old encrypted last block data along
with the truncate request and in MDS side read it and then do compare
just before updating it, if the comparasion fails, then fail the
truncate and let the kclient retry it.

2), If another client has buffered the last block, we should flush
it first. I am still thinking how to do this ? Any idea is welcome.

======

For the first issue, I will check the "CEPH_OSD_OP_ASSERT_VER".

For the second issue, I think the "CEPH_OSD_OP_ASSERT_VER" could work 
too. When the MDS has successfully xlocked the FILE lock, we could 
notify the kclients to flush the buffer first, and then try to 
write/truncate the file, if it fails due to the "CEPH_OSD_OP_ASSERT_VER" 
check, then let kclient retry the truncate request.

NOTE: All the xlock and xlock related interim states allowed the clients 
continue to hold the Fcb caps which are issued from previous states, 
except that the LOCK_XLOCKSNAP only allow clients to hold Fc.


>
> It's also possible (though I've never used it) to make an OSD request
> assert that the contents of an extent haven't changed, so you could
> instead send along the old contents along with the new ones, etc.
>
> That may end up being more efficient if the object is getting hammered
> with I/O in other fscrypt blocks within the same object. It may be worth
> exploring that avenue as well.
>
>> +
>> +	if (!fill_header_only) {
>> +		/* Append the last block contents to pagelist */
>> +		for (i = 0; i < num_pages; i++) {
>> +			ret = ceph_pagelist_append(pagelist, iovs[i].iov_base,
>> +						   blen);
>> +			if (ret)
>> +				goto out;
>> +		}
>> +	}
>> +	req->r_pagelist = pagelist;
>> +out:
>> +	dout("%s %p size dropping cap refs on %s\n", __func__,
>> +	     inode, ceph_cap_string(got));
>> +	for (i = 0; iovs && i < num_pages; i++)
>> +		kunmap_local(iovs[i].iov_base);
>> +	kfree(iovs);
>> +	if (pages)
>> +		ceph_release_page_vector(pages, num_pages);
>> +	if (ret && pagelist)
>> +		ceph_pagelist_release(pagelist);
>> +	ceph_put_cap_refs(ci, got);
>> +	return ret;
>> +}
>> +
>>   int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *cia)
>>   {
>>   	struct ceph_inode_info *ci = ceph_inode(inode);
>> @@ -2236,6 +2391,7 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>>   	struct ceph_mds_request *req;
>>   	struct ceph_mds_client *mdsc = ceph_sb_to_client(inode->i_sb)->mdsc;
>>   	struct ceph_cap_flush *prealloc_cf;
>> +	loff_t isize = i_size_read(inode);
>>   	int issued;
>>   	int release = 0, dirtied = 0;
>>   	int mask = 0;
>> @@ -2367,10 +2523,31 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>>   		}
>>   	}
>>   	if (ia_valid & ATTR_SIZE) {
>> -		loff_t isize = i_size_read(inode);
>> -
>>   		dout("setattr %p size %lld -> %lld\n", inode, isize, attr->ia_size);
>> -		if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
>> +		/*
>> +		 * Only when the new size is smaller and not aligned to
>> +		 * CEPH_FSCRYPT_BLOCK_SIZE will the RMW is needed.
>> +		 */
>> +		if (IS_ENCRYPTED(inode) && attr->ia_size < isize &&
>> +		    (attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE)) {
>> +			mask |= CEPH_SETATTR_SIZE;
>> +			release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
>> +				   CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
>> +			set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
>> +			mask |= CEPH_SETATTR_FSCRYPT_FILE;
>> +			req->r_args.setattr.size =
>> +				cpu_to_le64(round_up(attr->ia_size,
>> +						     CEPH_FSCRYPT_BLOCK_SIZE));
>> +			req->r_args.setattr.old_size =
>> +				cpu_to_le64(round_up(isize,
>> +						     CEPH_FSCRYPT_BLOCK_SIZE));
>> +			req->r_fscrypt_file = attr->ia_size;
>> +			spin_unlock(&ci->i_ceph_lock);
>> +			err = fill_request_for_fscrypt(inode, req, attr);
> It'd be better to just defer the call to fill_request_for_fscrypt until
> after you've dropped the spinlock, later in the function.
>
Sound good and will do it.


>> +			spin_lock(&ci->i_ceph_lock);
>> +			if (err)
>> +				goto out;
>> +		} else if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
>>   			if (attr->ia_size > isize) {
>>   				i_size_write(inode, attr->ia_size);
>>   				inode->i_blocks = calc_inode_blocks(attr->ia_size);
>> @@ -2382,23 +2559,10 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>>   			   attr->ia_size != isize) {
>>   			mask |= CEPH_SETATTR_SIZE;
>>   			release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
>> -				   CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
>> -			if (IS_ENCRYPTED(inode)) {
>> -				set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
>> -				mask |= CEPH_SETATTR_FSCRYPT_FILE;
>> -				req->r_args.setattr.size =
>> -					cpu_to_le64(round_up(attr->ia_size,
>> -							     CEPH_FSCRYPT_BLOCK_SIZE));
>> -				req->r_args.setattr.old_size =
>> -					cpu_to_le64(round_up(isize,
>> -							     CEPH_FSCRYPT_BLOCK_SIZE));
>> -				req->r_fscrypt_file = attr->ia_size;
>> -				/* FIXME: client must zero out any partial blocks! */
>> -			} else {
>> -				req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
>> -				req->r_args.setattr.old_size = cpu_to_le64(isize);
>> -				req->r_fscrypt_file = 0;
>> -			}
>> +			           CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
>> +			req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
>> +			req->r_args.setattr.old_size = cpu_to_le64(isize);
>> +			req->r_fscrypt_file = 0;
>>   		}
>>   	}
>>   	if (ia_valid & ATTR_MTIME) {


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] ceph: add truncate size handling support for fscrypt
  2021-10-25 20:01   ` Jeff Layton
  2021-10-26  3:41     ` Xiubo Li
@ 2021-10-27  5:12     ` Xiubo Li
  2021-10-27 12:17       ` Jeff Layton
  1 sibling, 1 reply; 17+ messages in thread
From: Xiubo Li @ 2021-10-27  5:12 UTC (permalink / raw)
  To: Jeff Layton; +Cc: idryomov, vshankar, khiremat, pdonnell, ceph-devel


On 10/26/21 4:01 AM, Jeff Layton wrote:
> On Wed, 2021-10-20 at 21:28 +0800, xiubli@redhat.com wrote:
>> From: Xiubo Li <xiubli@redhat.com>
>>
>> This will transfer the encrypted last block contents to the MDS
>> along with the truncate request only when new size is smaller and
>> not aligned to the fscrypt BLOCK size.
>>
>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>> ---
>>   fs/ceph/caps.c  |   9 +--
>>   fs/ceph/inode.c | 210 ++++++++++++++++++++++++++++++++++++++++++------
>>   2 files changed, 190 insertions(+), 29 deletions(-)
>>
>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>> index 4e2a588465c5..1a36f0870d89 100644
>> --- a/fs/ceph/caps.c
>> +++ b/fs/ceph/caps.c
...
>> +fill_last_block:
>> +	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
>> +	if (!pagelist)
>> +		return -ENOMEM;
>> +
>> +	/* Insert the header first */
>> +	header.ver = 1;
>> +	header.compat = 1;
>> +	/* sizeof(file_offset) + sizeof(block_size) + blen */
>> +	header.data_len = cpu_to_le32(8 + 8 + CEPH_FSCRYPT_BLOCK_SIZE);
>> +	header.file_offset = cpu_to_le64(orig_pos);
>> +	if (fill_header_only) {
>> +		header.file_offset = cpu_to_le64(0);
>> +		header.block_size = cpu_to_le64(0);
>> +	} else {
>> +		header.file_offset = cpu_to_le64(orig_pos);
>> +		header.block_size = cpu_to_le64(CEPH_FSCRYPT_BLOCK_SIZE);
>> +	}
>> +	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
>> +	if (ret)
>> +		goto out;
>>
>>
> Note that you're doing a read/modify/write cycle, and you must ensure
> that the object remains consistent between the read and write or you may
> end up with data corruption. This means that you probably need to
> transmit an object version as part of the write. See this patch in the
> stack:
>
>      libceph: add CEPH_OSD_OP_ASSERT_VER support
>
> That op tells the OSD to stop processing the request if the version is
> wrong.
>
> You'll want to grab the "ver" from the __ceph_sync_read call, and then
> send that along with the updated last block. Then, when the MDS is
> truncating, it can use a CEPH_OSD_OP_ASSERT_VER op with that version to
> ensure that the object hasn't changed when doing it. If the assertion
> trips, then the MDS should send back EAGAIN or something similar to the
> client to tell it to retry.
>
> It's also possible (though I've never used it) to make an OSD request
> assert that the contents of an extent haven't changed, so you could
> instead send along the old contents along with the new ones, etc.
>
> That may end up being more efficient if the object is getting hammered
> with I/O in other fscrypt blocks within the same object. It may be worth
> exploring that avenue as well.

Hi Jeff,

One questions about this:

Should we consider that the FSCRYPT BLOCK will cross two different Rados 
objects ? As default the Rados object size is 4MB.

In case the FSCRYPT BLOCK size is 4K, when the object size is 3K or 5K ?

Or the object size should always be multiple of FSCRYPT BLOCK size ?


Thanks




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] ceph: add truncate size handling support for fscrypt
  2021-10-27  5:12     ` Xiubo Li
@ 2021-10-27 12:17       ` Jeff Layton
  2021-10-27 13:57         ` Xiubo Li
  2021-10-27 15:06         ` Luís Henriques
  0 siblings, 2 replies; 17+ messages in thread
From: Jeff Layton @ 2021-10-27 12:17 UTC (permalink / raw)
  To: Xiubo Li
  Cc: idryomov, vshankar, khiremat, pdonnell, ceph-devel, Luis Henriques

On Wed, 2021-10-27 at 13:12 +0800, Xiubo Li wrote:
> On 10/26/21 4:01 AM, Jeff Layton wrote:
> > On Wed, 2021-10-20 at 21:28 +0800, xiubli@redhat.com wrote:
> > > From: Xiubo Li <xiubli@redhat.com>
> > > 
> > > This will transfer the encrypted last block contents to the MDS
> > > along with the truncate request only when new size is smaller and
> > > not aligned to the fscrypt BLOCK size.
> > > 
> > > Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > > ---
> > >   fs/ceph/caps.c  |   9 +--
> > >   fs/ceph/inode.c | 210 ++++++++++++++++++++++++++++++++++++++++++------
> > >   2 files changed, 190 insertions(+), 29 deletions(-)
> > > 
> > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > index 4e2a588465c5..1a36f0870d89 100644
> > > --- a/fs/ceph/caps.c
> > > +++ b/fs/ceph/caps.c
> ...
> > > +fill_last_block:
> > > +	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
> > > +	if (!pagelist)
> > > +		return -ENOMEM;
> > > +
> > > +	/* Insert the header first */
> > > +	header.ver = 1;
> > > +	header.compat = 1;
> > > +	/* sizeof(file_offset) + sizeof(block_size) + blen */
> > > +	header.data_len = cpu_to_le32(8 + 8 + CEPH_FSCRYPT_BLOCK_SIZE);
> > > +	header.file_offset = cpu_to_le64(orig_pos);
> > > +	if (fill_header_only) {
> > > +		header.file_offset = cpu_to_le64(0);
> > > +		header.block_size = cpu_to_le64(0);
> > > +	} else {
> > > +		header.file_offset = cpu_to_le64(orig_pos);
> > > +		header.block_size = cpu_to_le64(CEPH_FSCRYPT_BLOCK_SIZE);
> > > +	}
> > > +	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
> > > +	if (ret)
> > > +		goto out;
> > > 
> > > 
> > Note that you're doing a read/modify/write cycle, and you must ensure
> > that the object remains consistent between the read and write or you may
> > end up with data corruption. This means that you probably need to
> > transmit an object version as part of the write. See this patch in the
> > stack:
> > 
> >      libceph: add CEPH_OSD_OP_ASSERT_VER support
> > 
> > That op tells the OSD to stop processing the request if the version is
> > wrong.
> > 
> > You'll want to grab the "ver" from the __ceph_sync_read call, and then
> > send that along with the updated last block. Then, when the MDS is
> > truncating, it can use a CEPH_OSD_OP_ASSERT_VER op with that version to
> > ensure that the object hasn't changed when doing it. If the assertion
> > trips, then the MDS should send back EAGAIN or something similar to the
> > client to tell it to retry.
> > 
> > It's also possible (though I've never used it) to make an OSD request
> > assert that the contents of an extent haven't changed, so you could
> > instead send along the old contents along with the new ones, etc.
> > 
> > That may end up being more efficient if the object is getting hammered
> > with I/O in other fscrypt blocks within the same object. It may be worth
> > exploring that avenue as well.
> 
> Hi Jeff,
> 
> One questions about this:
> 
> Should we consider that the FSCRYPT BLOCK will cross two different Rados 
> objects ? As default the Rados object size is 4MB.
> 
> In case the FSCRYPT BLOCK size is 4K, when the object size is 3K or 5K ?
> 
> Or the object size should always be multiple of FSCRYPT BLOCK size ?
> 

The danger here is that it's very hard to ensure atomicity in RADOS
across two different objects. If your crypto blocks can span objects,
then you can end up with torn writes, and a torn write on a crypto block
turns it into garbage.

So, I think we want to forbid:

1/ custom file layouts on encrypted files, to ensure that we don't end
up with weird object sizes. Luis' patch from August does this, but I
think we might want the MDS to also vet this.

2/ block sizes larger than the object size

3/ non-power-of-two crypto block sizes (so no 3k or 5k blocks, but you
could do 1k, 2k, 4k, 8k, etc...)

...with that we should be able to ensure that they never span objects.
Eventually we may be able to relax some of these constraints, but I
don't think most users will have a problem with these constraints.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] ceph: add truncate size handling support for fscrypt
  2021-10-27 12:17       ` Jeff Layton
@ 2021-10-27 13:57         ` Xiubo Li
  2021-10-27 15:06         ` Luís Henriques
  1 sibling, 0 replies; 17+ messages in thread
From: Xiubo Li @ 2021-10-27 13:57 UTC (permalink / raw)
  To: Jeff Layton
  Cc: idryomov, vshankar, khiremat, pdonnell, ceph-devel, Luis Henriques


On 10/27/21 8:17 PM, Jeff Layton wrote:
> On Wed, 2021-10-27 at 13:12 +0800, Xiubo Li wrote:
>> On 10/26/21 4:01 AM, Jeff Layton wrote:
>>> On Wed, 2021-10-20 at 21:28 +0800, xiubli@redhat.com wrote:
>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>
>>>> This will transfer the encrypted last block contents to the MDS
>>>> along with the truncate request only when new size is smaller and
>>>> not aligned to the fscrypt BLOCK size.
>>>>
>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>> ---
>>>>    fs/ceph/caps.c  |   9 +--
>>>>    fs/ceph/inode.c | 210 ++++++++++++++++++++++++++++++++++++++++++------
>>>>    2 files changed, 190 insertions(+), 29 deletions(-)
>>>>
>>>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>>>> index 4e2a588465c5..1a36f0870d89 100644
>>>> --- a/fs/ceph/caps.c
>>>> +++ b/fs/ceph/caps.c
>> ...
>>>> +fill_last_block:
>>>> +	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
>>>> +	if (!pagelist)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	/* Insert the header first */
>>>> +	header.ver = 1;
>>>> +	header.compat = 1;
>>>> +	/* sizeof(file_offset) + sizeof(block_size) + blen */
>>>> +	header.data_len = cpu_to_le32(8 + 8 + CEPH_FSCRYPT_BLOCK_SIZE);
>>>> +	header.file_offset = cpu_to_le64(orig_pos);
>>>> +	if (fill_header_only) {
>>>> +		header.file_offset = cpu_to_le64(0);
>>>> +		header.block_size = cpu_to_le64(0);
>>>> +	} else {
>>>> +		header.file_offset = cpu_to_le64(orig_pos);
>>>> +		header.block_size = cpu_to_le64(CEPH_FSCRYPT_BLOCK_SIZE);
>>>> +	}
>>>> +	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
>>>> +	if (ret)
>>>> +		goto out;
>>>>
>>>>
>>> Note that you're doing a read/modify/write cycle, and you must ensure
>>> that the object remains consistent between the read and write or you may
>>> end up with data corruption. This means that you probably need to
>>> transmit an object version as part of the write. See this patch in the
>>> stack:
>>>
>>>       libceph: add CEPH_OSD_OP_ASSERT_VER support
>>>
>>> That op tells the OSD to stop processing the request if the version is
>>> wrong.
>>>
>>> You'll want to grab the "ver" from the __ceph_sync_read call, and then
>>> send that along with the updated last block. Then, when the MDS is
>>> truncating, it can use a CEPH_OSD_OP_ASSERT_VER op with that version to
>>> ensure that the object hasn't changed when doing it. If the assertion
>>> trips, then the MDS should send back EAGAIN or something similar to the
>>> client to tell it to retry.
>>>
>>> It's also possible (though I've never used it) to make an OSD request
>>> assert that the contents of an extent haven't changed, so you could
>>> instead send along the old contents along with the new ones, etc.
>>>
>>> That may end up being more efficient if the object is getting hammered
>>> with I/O in other fscrypt blocks within the same object. It may be worth
>>> exploring that avenue as well.
>> Hi Jeff,
>>
>> One questions about this:
>>
>> Should we consider that the FSCRYPT BLOCK will cross two different Rados
>> objects ? As default the Rados object size is 4MB.
>>
>> In case the FSCRYPT BLOCK size is 4K, when the object size is 3K or 5K ?
>>
>> Or the object size should always be multiple of FSCRYPT BLOCK size ?
>>
> The danger here is that it's very hard to ensure atomicity in RADOS
> across two different objects. If your crypto blocks can span objects,
> then you can end up with torn writes, and a torn write on a crypto block
> turns it into garbage.
>
> So, I think we want to forbid:
>
> 1/ custom file layouts on encrypted files, to ensure that we don't end
> up with weird object sizes. Luis' patch from August does this, but I
> think we might want the MDS to also vet this.
>
> 2/ block sizes larger than the object size
>
> 3/ non-power-of-two crypto block sizes (so no 3k or 5k blocks, but you
> could do 1k, 2k, 4k, 8k, etc...)
>
> ...with that we should be able to ensure that they never span objects.
> Eventually we may be able to relax some of these constraints, but I
> don't think most users will have a problem with these constraints.

Yeah, this sounds reasonable. I will code based on these assumptions.

Thanks.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] ceph: add truncate size handling support for fscrypt
  2021-10-27 12:17       ` Jeff Layton
  2021-10-27 13:57         ` Xiubo Li
@ 2021-10-27 15:06         ` Luís Henriques
  2021-10-27 23:08           ` Xiubo Li
  1 sibling, 1 reply; 17+ messages in thread
From: Luís Henriques @ 2021-10-27 15:06 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Xiubo Li, idryomov, vshankar, khiremat, pdonnell, ceph-devel

On Wed, Oct 27, 2021 at 08:17:55AM -0400, Jeff Layton wrote:
> On Wed, 2021-10-27 at 13:12 +0800, Xiubo Li wrote:
> > On 10/26/21 4:01 AM, Jeff Layton wrote:
> > > On Wed, 2021-10-20 at 21:28 +0800, xiubli@redhat.com wrote:
> > > > From: Xiubo Li <xiubli@redhat.com>
> > > > 
> > > > This will transfer the encrypted last block contents to the MDS
> > > > along with the truncate request only when new size is smaller and
> > > > not aligned to the fscrypt BLOCK size.
> > > > 
> > > > Signed-off-by: Xiubo Li <xiubli@redhat.com>
> > > > ---
> > > >   fs/ceph/caps.c  |   9 +--
> > > >   fs/ceph/inode.c | 210 ++++++++++++++++++++++++++++++++++++++++++------
> > > >   2 files changed, 190 insertions(+), 29 deletions(-)
> > > > 
> > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > index 4e2a588465c5..1a36f0870d89 100644
> > > > --- a/fs/ceph/caps.c
> > > > +++ b/fs/ceph/caps.c
> > ...
> > > > +fill_last_block:
> > > > +	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
> > > > +	if (!pagelist)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	/* Insert the header first */
> > > > +	header.ver = 1;
> > > > +	header.compat = 1;
> > > > +	/* sizeof(file_offset) + sizeof(block_size) + blen */
> > > > +	header.data_len = cpu_to_le32(8 + 8 + CEPH_FSCRYPT_BLOCK_SIZE);
> > > > +	header.file_offset = cpu_to_le64(orig_pos);
> > > > +	if (fill_header_only) {
> > > > +		header.file_offset = cpu_to_le64(0);
> > > > +		header.block_size = cpu_to_le64(0);
> > > > +	} else {
> > > > +		header.file_offset = cpu_to_le64(orig_pos);
> > > > +		header.block_size = cpu_to_le64(CEPH_FSCRYPT_BLOCK_SIZE);
> > > > +	}
> > > > +	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
> > > > +	if (ret)
> > > > +		goto out;
> > > > 
> > > > 
> > > Note that you're doing a read/modify/write cycle, and you must ensure
> > > that the object remains consistent between the read and write or you may
> > > end up with data corruption. This means that you probably need to
> > > transmit an object version as part of the write. See this patch in the
> > > stack:
> > > 
> > >      libceph: add CEPH_OSD_OP_ASSERT_VER support
> > > 
> > > That op tells the OSD to stop processing the request if the version is
> > > wrong.
> > > 
> > > You'll want to grab the "ver" from the __ceph_sync_read call, and then
> > > send that along with the updated last block. Then, when the MDS is
> > > truncating, it can use a CEPH_OSD_OP_ASSERT_VER op with that version to
> > > ensure that the object hasn't changed when doing it. If the assertion
> > > trips, then the MDS should send back EAGAIN or something similar to the
> > > client to tell it to retry.
> > > 
> > > It's also possible (though I've never used it) to make an OSD request
> > > assert that the contents of an extent haven't changed, so you could
> > > instead send along the old contents along with the new ones, etc.
> > > 
> > > That may end up being more efficient if the object is getting hammered
> > > with I/O in other fscrypt blocks within the same object. It may be worth
> > > exploring that avenue as well.
> > 
> > Hi Jeff,
> > 
> > One questions about this:
> > 
> > Should we consider that the FSCRYPT BLOCK will cross two different Rados 
> > objects ? As default the Rados object size is 4MB.
> > 
> > In case the FSCRYPT BLOCK size is 4K, when the object size is 3K or 5K ?
> > 
> > Or the object size should always be multiple of FSCRYPT BLOCK size ?
> > 
> 
> The danger here is that it's very hard to ensure atomicity in RADOS
> across two different objects. If your crypto blocks can span objects,
> then you can end up with torn writes, and a torn write on a crypto block
> turns it into garbage.
> 
> So, I think we want to forbid:
> 
> 1/ custom file layouts on encrypted files, to ensure that we don't end
> up with weird object sizes. Luis' patch from August does this, but I
> think we might want the MDS to also vet this.
> 
> 2/ block sizes larger than the object size

I believe that object sizes have a minimum of 65k, defined by
CEPH_MIN_STRIPE_UNIT.  So, maybe I'm oversimplifying, but I think we just
need to ensure (a build-time check?) that

  CEPH_FSCRYPT_BLOCK_SIZE <= CEPH_MIN_STRIPE_UNIT

Cheers,
--
Luís

> 3/ non-power-of-two crypto block sizes (so no 3k or 5k blocks, but you
> could do 1k, 2k, 4k, 8k, etc...)
> 
> ...with that we should be able to ensure that they never span objects.
> Eventually we may be able to relax some of these constraints, but I
> don't think most users will have a problem with these constraints.
> 
> -- 
> Jeff Layton <jlayton@kernel.org>
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] ceph: add truncate size handling support for fscrypt
  2021-10-27 15:06         ` Luís Henriques
@ 2021-10-27 23:08           ` Xiubo Li
  0 siblings, 0 replies; 17+ messages in thread
From: Xiubo Li @ 2021-10-27 23:08 UTC (permalink / raw)
  To: Luís Henriques, Jeff Layton
  Cc: idryomov, vshankar, khiremat, pdonnell, ceph-devel


On 10/27/21 11:06 PM, Luís Henriques wrote:
> On Wed, Oct 27, 2021 at 08:17:55AM -0400, Jeff Layton wrote:
>> On Wed, 2021-10-27 at 13:12 +0800, Xiubo Li wrote:
>>> On 10/26/21 4:01 AM, Jeff Layton wrote:
>>>> On Wed, 2021-10-20 at 21:28 +0800, xiubli@redhat.com wrote:
>>>>> From: Xiubo Li <xiubli@redhat.com>
>>>>>
>>>>> This will transfer the encrypted last block contents to the MDS
>>>>> along with the truncate request only when new size is smaller and
>>>>> not aligned to the fscrypt BLOCK size.
>>>>>
>>>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>>>> ---
>>>>>    fs/ceph/caps.c  |   9 +--
>>>>>    fs/ceph/inode.c | 210 ++++++++++++++++++++++++++++++++++++++++++------
>>>>>    2 files changed, 190 insertions(+), 29 deletions(-)
>>>>>
>>>>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>>>>> index 4e2a588465c5..1a36f0870d89 100644
>>>>> --- a/fs/ceph/caps.c
>>>>> +++ b/fs/ceph/caps.c
>>> ...
>>>>> +fill_last_block:
>>>>> +	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
>>>>> +	if (!pagelist)
>>>>> +		return -ENOMEM;
>>>>> +
>>>>> +	/* Insert the header first */
>>>>> +	header.ver = 1;
>>>>> +	header.compat = 1;
>>>>> +	/* sizeof(file_offset) + sizeof(block_size) + blen */
>>>>> +	header.data_len = cpu_to_le32(8 + 8 + CEPH_FSCRYPT_BLOCK_SIZE);
>>>>> +	header.file_offset = cpu_to_le64(orig_pos);
>>>>> +	if (fill_header_only) {
>>>>> +		header.file_offset = cpu_to_le64(0);
>>>>> +		header.block_size = cpu_to_le64(0);
>>>>> +	} else {
>>>>> +		header.file_offset = cpu_to_le64(orig_pos);
>>>>> +		header.block_size = cpu_to_le64(CEPH_FSCRYPT_BLOCK_SIZE);
>>>>> +	}
>>>>> +	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
>>>>> +	if (ret)
>>>>> +		goto out;
>>>>>
>>>>>
>>>> Note that you're doing a read/modify/write cycle, and you must ensure
>>>> that the object remains consistent between the read and write or you may
>>>> end up with data corruption. This means that you probably need to
>>>> transmit an object version as part of the write. See this patch in the
>>>> stack:
>>>>
>>>>       libceph: add CEPH_OSD_OP_ASSERT_VER support
>>>>
>>>> That op tells the OSD to stop processing the request if the version is
>>>> wrong.
>>>>
>>>> You'll want to grab the "ver" from the __ceph_sync_read call, and then
>>>> send that along with the updated last block. Then, when the MDS is
>>>> truncating, it can use a CEPH_OSD_OP_ASSERT_VER op with that version to
>>>> ensure that the object hasn't changed when doing it. If the assertion
>>>> trips, then the MDS should send back EAGAIN or something similar to the
>>>> client to tell it to retry.
>>>>
>>>> It's also possible (though I've never used it) to make an OSD request
>>>> assert that the contents of an extent haven't changed, so you could
>>>> instead send along the old contents along with the new ones, etc.
>>>>
>>>> That may end up being more efficient if the object is getting hammered
>>>> with I/O in other fscrypt blocks within the same object. It may be worth
>>>> exploring that avenue as well.
>>> Hi Jeff,
>>>
>>> One questions about this:
>>>
>>> Should we consider that the FSCRYPT BLOCK will cross two different Rados
>>> objects ? As default the Rados object size is 4MB.
>>>
>>> In case the FSCRYPT BLOCK size is 4K, when the object size is 3K or 5K ?
>>>
>>> Or the object size should always be multiple of FSCRYPT BLOCK size ?
>>>
>> The danger here is that it's very hard to ensure atomicity in RADOS
>> across two different objects. If your crypto blocks can span objects,
>> then you can end up with torn writes, and a torn write on a crypto block
>> turns it into garbage.
>>
>> So, I think we want to forbid:
>>
>> 1/ custom file layouts on encrypted files, to ensure that we don't end
>> up with weird object sizes. Luis' patch from August does this, but I
>> think we might want the MDS to also vet this.
>>
>> 2/ block sizes larger than the object size
> I believe that object sizes have a minimum of 65k, defined by
> CEPH_MIN_STRIPE_UNIT.

It's 64KB.

> So, maybe I'm oversimplifying, but I think we just
> need to ensure (a build-time check?) that
>
>    CEPH_FSCRYPT_BLOCK_SIZE <= CEPH_MIN_STRIPE_UNIT

I think you are right, and IMO this makes sense.

Thanks

-- Xiubo

> Cheers,
> --
> Luís
>
>> 3/ non-power-of-two crypto block sizes (so no 3k or 5k blocks, but you
>> could do 1k, 2k, 4k, 8k, etc...)
>>
>> ...with that we should be able to ensure that they never span objects.
>> Eventually we may be able to relax some of these constraints, but I
>> don't think most users will have a problem with these constraints.
>>
>> -- 
>> Jeff Layton <jlayton@kernel.org>
>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/4] ceph: add truncate size handling support for fscrypt
  2021-10-26  3:41     ` Xiubo Li
@ 2021-10-27 23:23       ` Xiubo Li
  0 siblings, 0 replies; 17+ messages in thread
From: Xiubo Li @ 2021-10-27 23:23 UTC (permalink / raw)
  To: Jeff Layton; +Cc: idryomov, vshankar, khiremat, pdonnell, ceph-devel


On 10/26/21 11:41 AM, Xiubo Li wrote:
>
> On 10/26/21 4:01 AM, Jeff Layton wrote:
>> On Wed, 2021-10-20 at 21:28 +0800, xiubli@redhat.com wrote:
>>> From: Xiubo Li <xiubli@redhat.com>
>>>
>>> This will transfer the encrypted last block contents to the MDS
>>> along with the truncate request only when new size is smaller and
>>> not aligned to the fscrypt BLOCK size.
>>>
>>> Signed-off-by: Xiubo Li <xiubli@redhat.com>
>>> ---
>>>   fs/ceph/caps.c  |   9 +--
>>>   fs/ceph/inode.c | 210 
>>> ++++++++++++++++++++++++++++++++++++++++++------
>>>   2 files changed, 190 insertions(+), 29 deletions(-)
>>>
...
>>> +fill_last_block:
>>> +    pagelist = ceph_pagelist_alloc(GFP_KERNEL);
>>> +    if (!pagelist)
>>> +        return -ENOMEM;
>>> +
>>> +    /* Insert the header first */
>>> +    header.ver = 1;
>>> +    header.compat = 1;
>>> +    /* sizeof(file_offset) + sizeof(block_size) + blen */
>>> +    header.data_len = cpu_to_le32(8 + 8 + CEPH_FSCRYPT_BLOCK_SIZE);
>>> +    header.file_offset = cpu_to_le64(orig_pos);
>>> +    if (fill_header_only) {
>>> +        header.file_offset = cpu_to_le64(0);
>>> +        header.block_size = cpu_to_le64(0);
>>> +    } else {
>>> +        header.file_offset = cpu_to_le64(orig_pos);
>>> +        header.block_size = cpu_to_le64(CEPH_FSCRYPT_BLOCK_SIZE);
>>> +    }
>>> +    ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
>>> +    if (ret)
>>> +        goto out;
>>>
>>>
>> Note that you're doing a read/modify/write cycle, and you must ensure
>> that the object remains consistent between the read and write or you may
>> end up with data corruption. This means that you probably need to
>> transmit an object version as part of the write. See this patch in the
>> stack:
>>
>>      libceph: add CEPH_OSD_OP_ASSERT_VER support
>>
>> That op tells the OSD to stop processing the request if the version is
>> wrong.
>>
>> You'll want to grab the "ver" from the __ceph_sync_read call, and then
>> send that along with the updated last block. Then, when the MDS is
>> truncating, it can use a CEPH_OSD_OP_ASSERT_VER op with that version to
>> ensure that the object hasn't changed when doing it. If the assertion
>> trips, then the MDS should send back EAGAIN or something similar to the
>> client to tell it to retry.
>
> Yeah, this is one issues I have mentioned in the cover-letter, I found 
> the cover-letter wasn't successfully sent to the mail list.
>
> Except this, there also has another isssue, I will pasted some 
> important sections here from the cover-letter:
>
> ======
>
> This approach is based on the discussion from V1, which will pass
> the encrypted last block contents to MDS along with the truncate
> request.
>
> This will send the encrypted last block contents to MDS along with
> the truncate request when truncating to a smaller size and at the
> same time new size does not align to BLOCK SIZE.
>
> The MDS side patch is raised in PR
> https://github.com/ceph/ceph/pull/43588, which is also based Jeff's
> previous great work in PR https://github.com/ceph/ceph/pull/41284.
>
> The MDS will use the filer.write_trunc(), which could update and
> truncate the file in one shot, instead of filer.truncate().
>
> I have removed the inline data related code since we are remove
> this feature, more detail please see:
> https://tracker.ceph.com/issues/52916
>
>
> Note: There still has two CORNER cases we need to deal with:
>
> 1), If a truncate request with the last block is sent to the MDS and
> just before the MDS has acquired the xlock for FILE lock, if another
> client has updated that last block content, we will over write the
> last block with old data.
>
> For this case we could send the old encrypted last block data along
> with the truncate request and in MDS side read it and then do compare
> just before updating it, if the comparasion fails, then fail the
> truncate and let the kclient retry it.
>
> 2), If another client has buffered the last block, we should flush
> it first. I am still thinking how to do this ? Any idea is welcome.
>
Checked more about the MDS locker related code, once we get the Fr caps 
in the kclient which is doing the truncate, the Fb caps will be revoked 
by the MDS already. It's not allowed that one client hold Fr cap, and 
another client will hold the Fb. So we don't worry about the last block 
will still be buffered by another kclient once we get the Fr caps.

But it's possible that if currently kclient is a loner, it could have 
already buffered the last block in pagecache, since we need the 
assert_ver from the Rados for the truncate, so we couldn't get the 
contents from the pagecache and we need to flush the pagecache first and 
read it again from Rados.

BRs

-- Xiubo


> ======
>
> For the first issue, I will check the "CEPH_OSD_OP_ASSERT_VER".
>
> For the second issue, I think the "CEPH_OSD_OP_ASSERT_VER" could work 
> too. When the MDS has successfully xlocked the FILE lock, we could 
> notify the kclients to flush the buffer first, and then try to 
> write/truncate the file, if it fails due to the 
> "CEPH_OSD_OP_ASSERT_VER" check, then let kclient retry the truncate 
> request.
>
> NOTE: All the xlock and xlock related interim states allowed the 
> clients continue to hold the Fcb caps which are issued from previous 
> states, except that the LOCK_XLOCKSNAP only allow clients to hold Fc.
>
>
>>
>> It's also possible (though I've never used it) to make an OSD request
>> assert that the contents of an extent haven't changed, so you could
>> instead send along the old contents along with the new ones, etc.
>>
>> That may end up being more efficient if the object is getting hammered
>> with I/O in other fscrypt blocks within the same object. It may be worth
>> exploring that avenue as well.
>>
>>> +
>>> +    if (!fill_header_only) {
>>> +        /* Append the last block contents to pagelist */
>>> +        for (i = 0; i < num_pages; i++) {
>>> +            ret = ceph_pagelist_append(pagelist, iovs[i].iov_base,
>>> +                           blen);
>>> +            if (ret)
>>> +                goto out;
>>> +        }
>>> +    }
>>> +    req->r_pagelist = pagelist;
>>> +out:
>>> +    dout("%s %p size dropping cap refs on %s\n", __func__,
>>> +         inode, ceph_cap_string(got));
>>> +    for (i = 0; iovs && i < num_pages; i++)
>>> +        kunmap_local(iovs[i].iov_base);
>>> +    kfree(iovs);
>>> +    if (pages)
>>> +        ceph_release_page_vector(pages, num_pages);
>>> +    if (ret && pagelist)
>>> +        ceph_pagelist_release(pagelist);
>>> +    ceph_put_cap_refs(ci, got);
>>> +    return ret;
>>> +}
>>> +
>>>   int __ceph_setattr(struct inode *inode, struct iattr *attr, struct 
>>> ceph_iattr *cia)
>>>   {
>>>       struct ceph_inode_info *ci = ceph_inode(inode);
>>> @@ -2236,6 +2391,7 @@ int __ceph_setattr(struct inode *inode, struct 
>>> iattr *attr, struct ceph_iattr *c
>>>       struct ceph_mds_request *req;
>>>       struct ceph_mds_client *mdsc = 
>>> ceph_sb_to_client(inode->i_sb)->mdsc;
>>>       struct ceph_cap_flush *prealloc_cf;
>>> +    loff_t isize = i_size_read(inode);
>>>       int issued;
>>>       int release = 0, dirtied = 0;
>>>       int mask = 0;
>>> @@ -2367,10 +2523,31 @@ int __ceph_setattr(struct inode *inode, 
>>> struct iattr *attr, struct ceph_iattr *c
>>>           }
>>>       }
>>>       if (ia_valid & ATTR_SIZE) {
>>> -        loff_t isize = i_size_read(inode);
>>> -
>>>           dout("setattr %p size %lld -> %lld\n", inode, isize, 
>>> attr->ia_size);
>>> -        if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
>>> +        /*
>>> +         * Only when the new size is smaller and not aligned to
>>> +         * CEPH_FSCRYPT_BLOCK_SIZE will the RMW is needed.
>>> +         */
>>> +        if (IS_ENCRYPTED(inode) && attr->ia_size < isize &&
>>> +            (attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE)) {
>>> +            mask |= CEPH_SETATTR_SIZE;
>>> +            release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
>>> +                   CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
>>> +            set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
>>> +            mask |= CEPH_SETATTR_FSCRYPT_FILE;
>>> +            req->r_args.setattr.size =
>>> +                cpu_to_le64(round_up(attr->ia_size,
>>> +                             CEPH_FSCRYPT_BLOCK_SIZE));
>>> +            req->r_args.setattr.old_size =
>>> +                cpu_to_le64(round_up(isize,
>>> +                             CEPH_FSCRYPT_BLOCK_SIZE));
>>> +            req->r_fscrypt_file = attr->ia_size;
>>> +            spin_unlock(&ci->i_ceph_lock);
>>> +            err = fill_request_for_fscrypt(inode, req, attr);
>> It'd be better to just defer the call to fill_request_for_fscrypt until
>> after you've dropped the spinlock, later in the function.
>>
> Sound good and will do it.
>
>
>>> + spin_lock(&ci->i_ceph_lock);
>>> +            if (err)
>>> +                goto out;
>>> +        } else if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size 
>>> >= isize) {
>>>               if (attr->ia_size > isize) {
>>>                   i_size_write(inode, attr->ia_size);
>>>                   inode->i_blocks = calc_inode_blocks(attr->ia_size);
>>> @@ -2382,23 +2559,10 @@ int __ceph_setattr(struct inode *inode, 
>>> struct iattr *attr, struct ceph_iattr *c
>>>                  attr->ia_size != isize) {
>>>               mask |= CEPH_SETATTR_SIZE;
>>>               release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
>>> -                   CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
>>> -            if (IS_ENCRYPTED(inode)) {
>>> -                set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
>>> -                mask |= CEPH_SETATTR_FSCRYPT_FILE;
>>> -                req->r_args.setattr.size =
>>> -                    cpu_to_le64(round_up(attr->ia_size,
>>> -                                 CEPH_FSCRYPT_BLOCK_SIZE));
>>> -                req->r_args.setattr.old_size =
>>> -                    cpu_to_le64(round_up(isize,
>>> -                                 CEPH_FSCRYPT_BLOCK_SIZE));
>>> -                req->r_fscrypt_file = attr->ia_size;
>>> -                /* FIXME: client must zero out any partial blocks! */
>>> -            } else {
>>> -                req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
>>> -                req->r_args.setattr.old_size = cpu_to_le64(isize);
>>> -                req->r_fscrypt_file = 0;
>>> -            }
>>> +                       CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
>>> +            req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
>>> +            req->r_args.setattr.old_size = cpu_to_le64(isize);
>>> +            req->r_fscrypt_file = 0;
>>>           }
>>>       }
>>>       if (ia_valid & ATTR_MTIME) {


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-10-27 23:23 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-20 13:28 [PATCH v2 0/4] ceph: size handling for the fscrypt xiubli
2021-10-20 13:28 ` [PATCH v2 1/4] ceph: add __ceph_get_caps helper support xiubli
2021-10-20 13:28 ` [PATCH v2 2/4] ceph: add __ceph_sync_read " xiubli
2021-10-20 13:28 ` [PATCH v2 3/4] ceph: return the real size readed when hit EOF xiubli
2021-10-25 19:05   ` Jeff Layton
2021-10-26  3:12     ` Xiubo Li
2021-10-20 13:28 ` [PATCH v2 4/4] ceph: add truncate size handling support for fscrypt xiubli
2021-10-25 20:01   ` Jeff Layton
2021-10-26  3:41     ` Xiubo Li
2021-10-27 23:23       ` Xiubo Li
2021-10-27  5:12     ` Xiubo Li
2021-10-27 12:17       ` Jeff Layton
2021-10-27 13:57         ` Xiubo Li
2021-10-27 15:06         ` Luís Henriques
2021-10-27 23:08           ` Xiubo Li
2021-10-20 15:32 ` [PATCH v2 0/4] ceph: size handling for the fscrypt Jeff Layton
2021-10-25 20:13 ` Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.