All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/10] ceph: asynchronous file create support
@ 2020-01-15 20:59 Jeff Layton
  2020-01-15 20:59 ` [RFC PATCH v2 01/10] libceph: export ceph_file_layout_is_valid Jeff Layton
                   ` (13 more replies)
  0 siblings, 14 replies; 33+ messages in thread
From: Jeff Layton @ 2020-01-15 20:59 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell, xiubli

v2:
- move cached layout to dedicated field in inode
- protect cached layout with i_ceph_lock
- wipe cached layout in __check_cap_issue
- set max_size of file to layout.stripe_unit
- set truncate_size to (u64)-1
- use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
- set cap_id to 1 in async created inode
- allocate inode number before submitting request
- rework the prep for an async create to be more efficient
- don't allow MDS or cap messages involving an inode until we get async
  create reply

A lot of changes in this set, mostly based on Zheng and Xiubo's
comments. Performance is pretty similar to the previous set:

Untarring a kernel tarball into a cephfs takes about 98s with
async dirops disabled. With them enabled, it takes around 78s,
which is about a 25% improvement.

This is not quite ready for merge. Error handling could still be
improved. With xfstest generic/531, I see some messages like this pop
up in the ring buffer:

    [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9

Basically, we went to do an async create and got a different inode
number back than expected. That still needs investigation, but I
didn't see any test failures due to it.

Jeff Layton (10):
  libceph: export ceph_file_layout_is_valid
  ceph: make ceph_fill_inode non-static
  ceph: make dentry_lease_is_valid non-static
  ceph: make __take_cap_refs a public function
  ceph: decode interval_sets for delegated inos
  ceph: add flag to designate that a request is asynchronous
  ceph: add infrastructure for waiting for async create to complete
  ceph: add new MDS req field to hold delegated inode number
  ceph: cache layout in parent dir on first sync create
  ceph: attempt to do async create when possible

 fs/ceph/caps.c               |  34 ++++--
 fs/ceph/dir.c                |  13 ++-
 fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
 fs/ceph/inode.c              |  50 ++++----
 fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
 fs/ceph/mds_client.h         |   9 +-
 fs/ceph/super.h              |  16 ++-
 include/linux/ceph/ceph_fs.h |   8 +-
 net/ceph/ceph_fs.c           |   1 +
 9 files changed, 410 insertions(+), 65 deletions(-)

-- 
2.24.1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 01/10] libceph: export ceph_file_layout_is_valid
  2020-01-15 20:59 [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
@ 2020-01-15 20:59 ` Jeff Layton
  2020-01-15 20:59 ` [RFC PATCH v2 02/10] ceph: make ceph_fill_inode non-static Jeff Layton
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Jeff Layton @ 2020-01-15 20:59 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell, xiubli

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 net/ceph/ceph_fs.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ceph/ceph_fs.c b/net/ceph/ceph_fs.c
index 756a2dc10d27..11a2e3c61b04 100644
--- a/net/ceph/ceph_fs.c
+++ b/net/ceph/ceph_fs.c
@@ -27,6 +27,7 @@ int ceph_file_layout_is_valid(const struct ceph_file_layout *layout)
 		return 0;
 	return 1;
 }
+EXPORT_SYMBOL(ceph_file_layout_is_valid);
 
 void ceph_file_layout_from_legacy(struct ceph_file_layout *fl,
 				  struct ceph_file_layout_legacy *legacy)
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 02/10] ceph: make ceph_fill_inode non-static
  2020-01-15 20:59 [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
  2020-01-15 20:59 ` [RFC PATCH v2 01/10] libceph: export ceph_file_layout_is_valid Jeff Layton
@ 2020-01-15 20:59 ` Jeff Layton
  2020-01-15 20:59 ` [RFC PATCH v2 03/10] ceph: make dentry_lease_is_valid non-static Jeff Layton
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Jeff Layton @ 2020-01-15 20:59 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell, xiubli

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/inode.c | 47 ++++++++++++++++++++++++-----------------------
 fs/ceph/super.h |  8 ++++++++
 2 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index aee7a24bf1bc..79bb1e6af090 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -728,11 +728,11 @@ void ceph_fill_file_time(struct inode *inode, int issued,
  * Populate an inode based on info from mds.  May be called on new or
  * existing inodes.
  */
-static int fill_inode(struct inode *inode, struct page *locked_page,
-		      struct ceph_mds_reply_info_in *iinfo,
-		      struct ceph_mds_reply_dirfrag *dirinfo,
-		      struct ceph_mds_session *session, int cap_fmode,
-		      struct ceph_cap_reservation *caps_reservation)
+int ceph_fill_inode(struct inode *inode, struct page *locked_page,
+		    struct ceph_mds_reply_info_in *iinfo,
+		    struct ceph_mds_reply_dirfrag *dirinfo,
+		    struct ceph_mds_session *session, int cap_fmode,
+		    struct ceph_cap_reservation *caps_reservation)
 {
 	struct ceph_mds_client *mdsc = ceph_inode_to_client(inode)->mdsc;
 	struct ceph_mds_reply_inode *info = iinfo->in;
@@ -749,7 +749,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 	bool new_version = false;
 	bool fill_inline = false;
 
-	dout("fill_inode %p ino %llx.%llx v %llu had %llu\n",
+	dout("%s %p ino %llx.%llx v %llu had %llu\n", __func__,
 	     inode, ceph_vinop(inode), le64_to_cpu(info->version),
 	     ci->i_version);
 
@@ -770,7 +770,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 	if (iinfo->xattr_len > 4) {
 		xattr_blob = ceph_buffer_new(iinfo->xattr_len, GFP_NOFS);
 		if (!xattr_blob)
-			pr_err("fill_inode ENOMEM xattr blob %d bytes\n",
+			pr_err("%s ENOMEM xattr blob %d bytes\n", __func__,
 			       iinfo->xattr_len);
 	}
 
@@ -933,8 +933,9 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 			spin_unlock(&ci->i_ceph_lock);
 
 			if (symlen != i_size_read(inode)) {
-				pr_err("fill_inode %llx.%llx BAD symlink "
-					"size %lld\n", ceph_vinop(inode),
+				pr_err("%s %llx.%llx BAD symlink "
+					"size %lld\n", __func__,
+					ceph_vinop(inode),
 					i_size_read(inode));
 				i_size_write(inode, symlen);
 				inode->i_blocks = calc_inode_blocks(symlen);
@@ -958,7 +959,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 		inode->i_fop = &ceph_dir_fops;
 		break;
 	default:
-		pr_err("fill_inode %llx.%llx BAD mode 0%o\n",
+		pr_err("%s %llx.%llx BAD mode 0%o\n", __func__,
 		       ceph_vinop(inode), inode->i_mode);
 	}
 
@@ -1246,10 +1247,9 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 		struct inode *dir = req->r_parent;
 
 		if (dir) {
-			err = fill_inode(dir, NULL,
-					 &rinfo->diri, rinfo->dirfrag,
-					 session, -1,
-					 &req->r_caps_reservation);
+			err = ceph_fill_inode(dir, NULL, &rinfo->diri,
+					      rinfo->dirfrag, session, -1,
+					      &req->r_caps_reservation);
 			if (err < 0)
 				goto done;
 		} else {
@@ -1314,13 +1314,13 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 			goto done;
 		}
 
-		err = fill_inode(in, req->r_locked_page, &rinfo->targeti, NULL,
-				session,
+		err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
+				NULL, session,
 				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
 				 rinfo->head->result == 0) ?  req->r_fmode : -1,
 				&req->r_caps_reservation);
 		if (err < 0) {
-			pr_err("fill_inode badness %p %llx.%llx\n",
+			pr_err("ceph_fill_inode badness %p %llx.%llx\n",
 				in, ceph_vinop(in));
 			if (in->i_state & I_NEW)
 				discard_new_inode(in);
@@ -1507,10 +1507,11 @@ static int readdir_prepopulate_inodes_only(struct ceph_mds_request *req,
 			dout("new_inode badness got %d\n", err);
 			continue;
 		}
-		rc = fill_inode(in, NULL, &rde->inode, NULL, session,
-				-1, &req->r_caps_reservation);
+		rc = ceph_fill_inode(in, NULL, &rde->inode, NULL, session,
+				     -1, &req->r_caps_reservation);
 		if (rc < 0) {
-			pr_err("fill_inode badness on %p got %d\n", in, rc);
+			pr_err("ceph_fill_inode badness on %p got %d\n",
+			       in, rc);
 			err = rc;
 			if (in->i_state & I_NEW) {
 				ihold(in);
@@ -1714,10 +1715,10 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req,
 			}
 		}
 
-		ret = fill_inode(in, NULL, &rde->inode, NULL, session,
-				 -1, &req->r_caps_reservation);
+		ret = ceph_fill_inode(in, NULL, &rde->inode, NULL, session,
+				      -1, &req->r_caps_reservation);
 		if (ret < 0) {
-			pr_err("fill_inode badness on %p\n", in);
+			pr_err("ceph_fill_inode badness on %p\n", in);
 			if (d_really_is_negative(dn)) {
 				/* avoid calling iput_final() in mds
 				 * dispatch threads */
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 3ef17dd6491e..ec4d66d7c261 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -893,6 +893,9 @@ static inline bool __ceph_have_pending_cap_snap(struct ceph_inode_info *ci)
 }
 
 /* inode.c */
+struct ceph_mds_reply_info_in;
+struct ceph_mds_reply_dirfrag;
+
 extern const struct inode_operations ceph_file_iops;
 
 extern struct inode *ceph_alloc_inode(struct super_block *sb);
@@ -908,6 +911,11 @@ extern void ceph_fill_file_time(struct inode *inode, int issued,
 				u64 time_warp_seq, struct timespec64 *ctime,
 				struct timespec64 *mtime,
 				struct timespec64 *atime);
+extern int ceph_fill_inode(struct inode *inode, struct page *locked_page,
+		    struct ceph_mds_reply_info_in *iinfo,
+		    struct ceph_mds_reply_dirfrag *dirinfo,
+		    struct ceph_mds_session *session, int cap_fmode,
+		    struct ceph_cap_reservation *caps_reservation);
 extern int ceph_fill_trace(struct super_block *sb,
 			   struct ceph_mds_request *req);
 extern int ceph_readdir_prepopulate(struct ceph_mds_request *req,
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 03/10] ceph: make dentry_lease_is_valid non-static
  2020-01-15 20:59 [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
  2020-01-15 20:59 ` [RFC PATCH v2 01/10] libceph: export ceph_file_layout_is_valid Jeff Layton
  2020-01-15 20:59 ` [RFC PATCH v2 02/10] ceph: make ceph_fill_inode non-static Jeff Layton
@ 2020-01-15 20:59 ` Jeff Layton
  2020-01-15 20:59 ` [RFC PATCH v2 04/10] ceph: make __take_cap_refs a public function Jeff Layton
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Jeff Layton @ 2020-01-15 20:59 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell, xiubli

...and move a comment over the proper function.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/dir.c   | 10 +++++-----
 fs/ceph/super.h |  1 +
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 10294f07f5f0..9d2eca67985a 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1477,10 +1477,6 @@ void ceph_invalidate_dentry_lease(struct dentry *dentry)
 	spin_unlock(&dentry->d_lock);
 }
 
-/*
- * Check if dentry lease is valid.  If not, delete the lease.  Try to
- * renew if the least is more than half up.
- */
 static bool __dentry_lease_is_valid(struct ceph_dentry_info *di)
 {
 	struct ceph_mds_session *session;
@@ -1507,7 +1503,11 @@ static bool __dentry_lease_is_valid(struct ceph_dentry_info *di)
 	return false;
 }
 
-static int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags)
+/*
+ * Check if dentry lease is valid.  If not, delete the lease.  Try to
+ * renew if the least is more than half up.
+ */
+int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags)
 {
 	struct ceph_dentry_info *di;
 	struct ceph_mds_session *session = NULL;
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index ec4d66d7c261..f27b2bf9a3f5 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1121,6 +1121,7 @@ extern int ceph_handle_snapdir(struct ceph_mds_request *req,
 extern struct dentry *ceph_finish_lookup(struct ceph_mds_request *req,
 					 struct dentry *dentry, int err);
 
+extern int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags);
 extern void __ceph_dentry_lease_touch(struct ceph_dentry_info *di);
 extern void __ceph_dentry_dir_lease_touch(struct ceph_dentry_info *di);
 extern void ceph_invalidate_dentry_lease(struct dentry *dentry);
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 04/10] ceph: make __take_cap_refs a public function
  2020-01-15 20:59 [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (2 preceding siblings ...)
  2020-01-15 20:59 ` [RFC PATCH v2 03/10] ceph: make dentry_lease_is_valid non-static Jeff Layton
@ 2020-01-15 20:59 ` Jeff Layton
  2020-01-15 20:59 ` [RFC PATCH v2 05/10] ceph: decode interval_sets for delegated inos Jeff Layton
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Jeff Layton @ 2020-01-15 20:59 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell, xiubli

Rename it to ceph_take_cap_refs and make it available to other files.
Also replace a comment with a lockdep assertion.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c  | 12 ++++++------
 fs/ceph/super.h |  2 ++
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 7fc87b693ba4..c983990acb75 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2512,12 +2512,12 @@ static void kick_flushing_inode_caps(struct ceph_mds_client *mdsc,
 /*
  * Take references to capabilities we hold, so that we don't release
  * them to the MDS prematurely.
- *
- * Protected by i_ceph_lock.
  */
-static void __take_cap_refs(struct ceph_inode_info *ci, int got,
+void ceph_take_cap_refs(struct ceph_inode_info *ci, int got,
 			    bool snap_rwsem_locked)
 {
+	lockdep_assert_held(&ci->i_ceph_lock);
+
 	if (got & CEPH_CAP_PIN)
 		ci->i_pin_ref++;
 	if (got & CEPH_CAP_FILE_RD)
@@ -2538,7 +2538,7 @@ static void __take_cap_refs(struct ceph_inode_info *ci, int got,
 		if (ci->i_wb_ref == 0)
 			ihold(&ci->vfs_inode);
 		ci->i_wb_ref++;
-		dout("__take_cap_refs %p wb %d -> %d (?)\n",
+		dout("%s %p wb %d -> %d (?)\n", __func__,
 		     &ci->vfs_inode, ci->i_wb_ref-1, ci->i_wb_ref);
 	}
 }
@@ -2664,7 +2664,7 @@ static int try_get_cap_refs(struct inode *inode, int need, int want,
 			    (need & CEPH_CAP_FILE_RD) &&
 			    !(*got & CEPH_CAP_FILE_CACHE))
 				ceph_disable_fscache_readpage(ci);
-			__take_cap_refs(ci, *got, true);
+			ceph_take_cap_refs(ci, *got, true);
 			ret = 1;
 		}
 	} else {
@@ -2896,7 +2896,7 @@ int ceph_get_caps(struct file *filp, int need, int want,
 void ceph_get_cap_refs(struct ceph_inode_info *ci, int caps)
 {
 	spin_lock(&ci->i_ceph_lock);
-	__take_cap_refs(ci, caps, false);
+	ceph_take_cap_refs(ci, caps, false);
 	spin_unlock(&ci->i_ceph_lock);
 }
 
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index f27b2bf9a3f5..cb7b549f0995 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1062,6 +1062,8 @@ extern void ceph_kick_flushing_caps(struct ceph_mds_client *mdsc,
 				    struct ceph_mds_session *session);
 extern struct ceph_cap *ceph_get_cap_for_mds(struct ceph_inode_info *ci,
 					     int mds);
+extern void ceph_take_cap_refs(struct ceph_inode_info *ci, int caps,
+				bool snap_rwsem_locked);
 extern void ceph_get_cap_refs(struct ceph_inode_info *ci, int caps);
 extern void ceph_put_cap_refs(struct ceph_inode_info *ci, int had);
 extern void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr,
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 05/10] ceph: decode interval_sets for delegated inos
  2020-01-15 20:59 [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (3 preceding siblings ...)
  2020-01-15 20:59 ` [RFC PATCH v2 04/10] ceph: make __take_cap_refs a public function Jeff Layton
@ 2020-01-15 20:59 ` Jeff Layton
  2020-01-16 14:32   ` Yan, Zheng
  2020-01-15 20:59 ` [RFC PATCH v2 06/10] ceph: add flag to designate that a request is asynchronous Jeff Layton
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Jeff Layton @ 2020-01-15 20:59 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell, xiubli

Starting in Octopus, the MDS will hand out caps that allow the client
to do asynchronous file creates under certain conditions. As part of
that, the MDS will delegate ranges of inode numbers to the client.

Add the infrastructure to decode these ranges, and stuff them into an
xarray for later consumption by the async creation code.

Because the xarray code currently only handles unsigned long indexes,
and those are 32-bits on 32-bit arches, we only enable the decoding when
running on a 64-bit arch.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/mds_client.c | 109 +++++++++++++++++++++++++++++++++++++++----
 fs/ceph/mds_client.h |   7 ++-
 2 files changed, 106 insertions(+), 10 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 8263f75badfc..19bd71eb5733 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -415,21 +415,110 @@ static int parse_reply_info_filelock(void **p, void *end,
 	return -EIO;
 }
 
+
+#if BITS_PER_LONG == 64
+
+#define DELEGATED_INO_AVAILABLE		xa_mk_value(1)
+
+static int ceph_parse_deleg_inos(void **p, void *end,
+				 struct ceph_mds_session *s)
+{
+	u32 sets;
+
+	ceph_decode_32_safe(p, end, sets, bad);
+	dout("got %u sets of delegated inodes\n", sets);
+	while (sets--) {
+		u64 start, len, ino;
+
+		ceph_decode_64_safe(p, end, start, bad);
+		ceph_decode_64_safe(p, end, len, bad);
+		while (len--) {
+			int err = xa_insert(&s->s_delegated_inos, ino = start++,
+					    DELEGATED_INO_AVAILABLE,
+					    GFP_KERNEL);
+			if (!err) {
+				dout("added delegated inode 0x%llx\n",
+				     start - 1);
+			} else if (err == -EBUSY) {
+				pr_warn("ceph: MDS delegated inode 0x%llx more than once.\n",
+					start - 1);
+			} else {
+				return err;
+			}
+		}
+	}
+	return 0;
+bad:
+	return -EIO;
+}
+
+unsigned long ceph_get_deleg_ino(struct ceph_mds_session *s)
+{
+	unsigned long ino;
+	void *val;
+
+	xa_for_each(&s->s_delegated_inos, ino, val) {
+		val = xa_erase(&s->s_delegated_inos, ino);
+		if (val == DELEGATED_INO_AVAILABLE)
+			return ino;
+	}
+	return 0;
+}
+#else /* BITS_PER_LONG == 64 */
+/*
+ * FIXME: xarrays can't handle 64-bit indexes on a 32-bit arch. For now, just
+ * ignore delegated_inos on 32 bit arch. Maybe eventually add xarrays for top
+ * and bottom words?
+ */
+static int ceph_parse_deleg_inos(void **p, void *end,
+				 struct ceph_mds_session *s)
+{
+	u32 sets;
+
+	ceph_decode_32_safe(p, end, sets, bad);
+	if (sets)
+		ceph_decode_skip_n(p, end, sets * 2 * sizeof(__le64), bad);
+	return 0;
+bad:
+	return -EIO;
+}
+
+unsigned long ceph_get_deleg_ino(struct ceph_mds_session *s)
+{
+	return 0;
+}
+#endif /* BITS_PER_LONG == 64 */
+
 /*
  * parse create results
  */
 static int parse_reply_info_create(void **p, void *end,
 				  struct ceph_mds_reply_info_parsed *info,
-				  u64 features)
+				  u64 features, struct ceph_mds_session *s)
 {
+	int ret;
+
 	if (features == (u64)-1 ||
 	    (features & CEPH_FEATURE_REPLY_CREATE_INODE)) {
-		/* Malformed reply? */
 		if (*p == end) {
+			/* Malformed reply? */
 			info->has_create_ino = false;
-		} else {
+		} else if (test_bit(CEPHFS_FEATURE_DELEG_INO, &s->s_features)) {
+			u8 struct_v, struct_compat;
+			u32 len;
+
 			info->has_create_ino = true;
+			ceph_decode_8_safe(p, end, struct_v, bad);
+			ceph_decode_8_safe(p, end, struct_compat, bad);
+			ceph_decode_32_safe(p, end, len, bad);
+			ceph_decode_64_safe(p, end, info->ino, bad);
+			ret = ceph_parse_deleg_inos(p, end, s);
+			if (ret)
+				return ret;
+		} else {
+			/* legacy */
 			ceph_decode_64_safe(p, end, info->ino, bad);
+			info->has_create_ino = true;
 		}
 	} else {
 		if (*p != end)
@@ -448,7 +537,7 @@ static int parse_reply_info_create(void **p, void *end,
  */
 static int parse_reply_info_extra(void **p, void *end,
 				  struct ceph_mds_reply_info_parsed *info,
-				  u64 features)
+				  u64 features, struct ceph_mds_session *s)
 {
 	u32 op = le32_to_cpu(info->head->op);
 
@@ -457,7 +546,7 @@ static int parse_reply_info_extra(void **p, void *end,
 	else if (op == CEPH_MDS_OP_READDIR || op == CEPH_MDS_OP_LSSNAP)
 		return parse_reply_info_readdir(p, end, info, features);
 	else if (op == CEPH_MDS_OP_CREATE)
-		return parse_reply_info_create(p, end, info, features);
+		return parse_reply_info_create(p, end, info, features, s);
 	else
 		return -EIO;
 }
@@ -465,7 +554,7 @@ static int parse_reply_info_extra(void **p, void *end,
 /*
  * parse entire mds reply
  */
-static int parse_reply_info(struct ceph_msg *msg,
+static int parse_reply_info(struct ceph_mds_session *s, struct ceph_msg *msg,
 			    struct ceph_mds_reply_info_parsed *info,
 			    u64 features)
 {
@@ -490,7 +579,7 @@ static int parse_reply_info(struct ceph_msg *msg,
 	ceph_decode_32_safe(&p, end, len, bad);
 	if (len > 0) {
 		ceph_decode_need(&p, end, len, bad);
-		err = parse_reply_info_extra(&p, p+len, info, features);
+		err = parse_reply_info_extra(&p, p+len, info, features, s);
 		if (err < 0)
 			goto out_bad;
 	}
@@ -558,6 +647,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
 	if (refcount_dec_and_test(&s->s_ref)) {
 		if (s->s_auth.authorizer)
 			ceph_auth_destroy_authorizer(s->s_auth.authorizer);
+		xa_destroy(&s->s_delegated_inos);
 		kfree(s);
 	}
 }
@@ -645,6 +735,7 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc,
 	refcount_set(&s->s_ref, 1);
 	INIT_LIST_HEAD(&s->s_waiting);
 	INIT_LIST_HEAD(&s->s_unsafe);
+	xa_init(&s->s_delegated_inos);
 	s->s_num_cap_releases = 0;
 	s->s_cap_reconnect = 0;
 	s->s_cap_iterator = NULL;
@@ -2947,9 +3038,9 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
 	dout("handle_reply tid %lld result %d\n", tid, result);
 	rinfo = &req->r_reply_info;
 	if (test_bit(CEPHFS_FEATURE_REPLY_ENCODING, &session->s_features))
-		err = parse_reply_info(msg, rinfo, (u64)-1);
+		err = parse_reply_info(session, msg, rinfo, (u64)-1);
 	else
-		err = parse_reply_info(msg, rinfo, session->s_con.peer_features);
+		err = parse_reply_info(session, msg, rinfo, session->s_con.peer_features);
 	mutex_unlock(&mdsc->mutex);
 
 	mutex_lock(&session->s_mutex);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 27a7446e10d3..30fb60ba2580 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -23,8 +23,9 @@ enum ceph_feature_type {
 	CEPHFS_FEATURE_RECLAIM_CLIENT,
 	CEPHFS_FEATURE_LAZY_CAP_WANTED,
 	CEPHFS_FEATURE_MULTI_RECONNECT,
+	CEPHFS_FEATURE_DELEG_INO,
 
-	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_MULTI_RECONNECT,
+	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_DELEG_INO,
 };
 
 /*
@@ -37,6 +38,7 @@ enum ceph_feature_type {
 	CEPHFS_FEATURE_REPLY_ENCODING,		\
 	CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
 	CEPHFS_FEATURE_MULTI_RECONNECT,		\
+	CEPHFS_FEATURE_DELEG_INO,		\
 						\
 	CEPHFS_FEATURE_MAX,			\
 }
@@ -201,6 +203,7 @@ struct ceph_mds_session {
 
 	struct list_head  s_waiting;  /* waiting requests */
 	struct list_head  s_unsafe;   /* unsafe requests */
+	struct xarray	  s_delegated_inos;
 };
 
 /*
@@ -537,4 +540,6 @@ extern void ceph_mdsc_open_export_target_sessions(struct ceph_mds_client *mdsc,
 extern int ceph_trim_caps(struct ceph_mds_client *mdsc,
 			  struct ceph_mds_session *session,
 			  int max_caps);
+
+extern unsigned long ceph_get_deleg_ino(struct ceph_mds_session *session);
 #endif
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 06/10] ceph: add flag to designate that a request is asynchronous
  2020-01-15 20:59 [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (4 preceding siblings ...)
  2020-01-15 20:59 ` [RFC PATCH v2 05/10] ceph: decode interval_sets for delegated inos Jeff Layton
@ 2020-01-15 20:59 ` Jeff Layton
  2020-01-15 20:59 ` [RFC PATCH v2 07/10] ceph: add infrastructure for waiting for async create to complete Jeff Layton
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Jeff Layton @ 2020-01-15 20:59 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell, xiubli

The MDS has need to know that a request is asynchronous.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/dir.c                | 1 +
 fs/ceph/inode.c              | 1 +
 fs/ceph/mds_client.c         | 2 ++
 fs/ceph/mds_client.h         | 1 +
 include/linux/ceph/ceph_fs.h | 5 +++--
 5 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 9d2eca67985a..0d97c2962314 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1116,6 +1116,7 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
 	    get_caps_for_async_unlink(dir, dentry)) {
 		dout("ceph: Async unlink on %lu/%.*s", dir->i_ino,
 		     dentry->d_name.len, dentry->d_name.name);
+		set_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags);
 		req->r_callback = ceph_async_unlink_cb;
 		req->r_old_inode = d_inode(dentry);
 		ihold(req->r_old_inode);
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 79bb1e6af090..4056c7968b86 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1317,6 +1317,7 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 		err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
 				NULL, session,
 				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
+				 !test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags) &&
 				 rinfo->head->result == 0) ?  req->r_fmode : -1,
 				&req->r_caps_reservation);
 		if (err < 0) {
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 19bd71eb5733..f06496bb5705 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2620,6 +2620,8 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
 	rhead->oldest_client_tid = cpu_to_le64(__get_oldest_tid(mdsc));
 	if (test_bit(CEPH_MDS_R_GOT_UNSAFE, &req->r_req_flags))
 		flags |= CEPH_MDS_FLAG_REPLAY;
+	if (test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags))
+		flags |= CEPH_MDS_FLAG_ASYNC;
 	if (req->r_parent)
 		flags |= CEPH_MDS_FLAG_WANT_DENTRY;
 	rhead->flags = cpu_to_le32(flags);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 30fb60ba2580..2a32afa15eb6 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -258,6 +258,7 @@ struct ceph_mds_request {
 #define CEPH_MDS_R_GOT_RESULT		(5) /* got a result */
 #define CEPH_MDS_R_DID_PREPOPULATE	(6) /* prepopulated readdir */
 #define CEPH_MDS_R_PARENT_LOCKED	(7) /* is r_parent->i_rwsem wlocked? */
+#define CEPH_MDS_R_ASYNC		(8) /* async request */
 	unsigned long	r_req_flags;
 
 	struct mutex r_fill_mutex;
diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
index a099f60feb7b..91d09cf37649 100644
--- a/include/linux/ceph/ceph_fs.h
+++ b/include/linux/ceph/ceph_fs.h
@@ -444,8 +444,9 @@ union ceph_mds_request_args {
 	} __attribute__ ((packed)) lookupino;
 } __attribute__ ((packed));
 
-#define CEPH_MDS_FLAG_REPLAY        1  /* this is a replayed op */
-#define CEPH_MDS_FLAG_WANT_DENTRY   2  /* want dentry in reply */
+#define CEPH_MDS_FLAG_REPLAY		1 /* this is a replayed op */
+#define CEPH_MDS_FLAG_WANT_DENTRY	2 /* want dentry in reply */
+#define CEPH_MDS_FLAG_ASYNC		4 /* request is asynchronous */
 
 struct ceph_mds_request_head {
 	__le64 oldest_client_tid;
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 07/10] ceph: add infrastructure for waiting for async create to complete
  2020-01-15 20:59 [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (5 preceding siblings ...)
  2020-01-15 20:59 ` [RFC PATCH v2 06/10] ceph: add flag to designate that a request is asynchronous Jeff Layton
@ 2020-01-15 20:59 ` Jeff Layton
  2020-01-17 15:00   ` Ilya Dryomov
  2020-01-15 20:59 ` [RFC PATCH v2 08/10] ceph: add new MDS req field to hold delegated inode number Jeff Layton
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Jeff Layton @ 2020-01-15 20:59 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell, xiubli

When we issue an async create, we must ensure that any later on-the-wire
requests involving it wait for the create reply.

Expand i_ceph_flags to be an unsigned long, and add a new bit that
MDS requests can wait on. If the bit is set in the inode when sending
caps, then don't send it and just return that it has been delayed.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c       |  9 ++++++++-
 fs/ceph/dir.c        |  2 +-
 fs/ceph/mds_client.c | 12 +++++++++++-
 fs/ceph/super.h      |  4 +++-
 4 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index c983990acb75..9d1a3d6831f7 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -511,7 +511,7 @@ static void __cap_delay_requeue(struct ceph_mds_client *mdsc,
 				struct ceph_inode_info *ci,
 				bool set_timeout)
 {
-	dout("__cap_delay_requeue %p flags %d at %lu\n", &ci->vfs_inode,
+	dout("__cap_delay_requeue %p flags 0x%lx at %lu\n", &ci->vfs_inode,
 	     ci->i_ceph_flags, ci->i_hold_caps_max);
 	if (!mdsc->stopping) {
 		spin_lock(&mdsc->cap_delay_lock);
@@ -1298,6 +1298,13 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
 	int delayed = 0;
 	int ret;
 
+	/* Don't send anything if it's still being created. Return delayed */
+	if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
+		spin_unlock(&ci->i_ceph_lock);
+		dout("%s async create in flight for %p\n", __func__, inode);
+		return 1;
+	}
+
 	held = cap->issued | cap->implemented;
 	revoking = cap->implemented & ~cap->issued;
 	retain &= ~revoking;
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 0d97c2962314..b2bcd01ab4e9 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -752,7 +752,7 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry,
 		struct ceph_dentry_info *di = ceph_dentry(dentry);
 
 		spin_lock(&ci->i_ceph_lock);
-		dout(" dir %p flags are %d\n", dir, ci->i_ceph_flags);
+		dout(" dir %p flags are 0x%lx\n", dir, ci->i_ceph_flags);
 		if (strncmp(dentry->d_name.name,
 			    fsc->mount_options->snapdir_name,
 			    dentry->d_name.len) &&
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index f06496bb5705..e49ca0533df1 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2806,14 +2806,24 @@ static void kick_requests(struct ceph_mds_client *mdsc, int mds)
 	}
 }
 
+static int ceph_wait_on_async_create(struct inode *inode)
+{
+	struct ceph_inode_info *ci = ceph_inode(inode);
+
+	return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT,
+			   TASK_INTERRUPTIBLE);
+}
+
 int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
 			      struct ceph_mds_request *req)
 {
 	int err;
 
 	/* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */
-	if (req->r_inode)
+	if (req->r_inode) {
+		ceph_wait_on_async_create(req->r_inode);
 		ceph_get_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN);
+	}
 	if (req->r_parent) {
 		ceph_get_cap_refs(ceph_inode(req->r_parent), CEPH_CAP_PIN);
 		ihold(req->r_parent);
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index cb7b549f0995..86dd4a2163e0 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -319,7 +319,7 @@ struct ceph_inode_info {
 	u64 i_inline_version;
 	u32 i_time_warp_seq;
 
-	unsigned i_ceph_flags;
+	unsigned long i_ceph_flags;
 	atomic64_t i_release_count;
 	atomic64_t i_ordered_count;
 	atomic64_t i_complete_seq[2];
@@ -527,6 +527,8 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
 #define CEPH_I_ERROR_WRITE	(1 << 10) /* have seen write errors */
 #define CEPH_I_ERROR_FILELOCK	(1 << 11) /* have seen file lock errors */
 #define CEPH_I_ODIRECT		(1 << 12) /* inode in direct I/O mode */
+#define CEPH_ASYNC_CREATE_BIT	(13)	  /* async create in flight for this */
+#define CEPH_I_ASYNC_CREATE	(1 << CEPH_ASYNC_CREATE_BIT)
 
 /*
  * Masks of ceph inode work.
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 08/10] ceph: add new MDS req field to hold delegated inode number
  2020-01-15 20:59 [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (6 preceding siblings ...)
  2020-01-15 20:59 ` [RFC PATCH v2 07/10] ceph: add infrastructure for waiting for async create to complete Jeff Layton
@ 2020-01-15 20:59 ` Jeff Layton
  2020-01-17 14:47   ` Ilya Dryomov
  2020-01-15 20:59 ` [RFC PATCH v2 09/10] ceph: cache layout in parent dir on first sync create Jeff Layton
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 33+ messages in thread
From: Jeff Layton @ 2020-01-15 20:59 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell, xiubli

Add new request field to hold the delegated inode number. Encode that
into the message when it's set.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/mds_client.c | 3 +--
 fs/ceph/mds_client.h | 1 +
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index e49ca0533df1..b8070e8c4686 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2466,7 +2466,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_client *mdsc,
 	head->op = cpu_to_le32(req->r_op);
 	head->caller_uid = cpu_to_le32(from_kuid(&init_user_ns, req->r_uid));
 	head->caller_gid = cpu_to_le32(from_kgid(&init_user_ns, req->r_gid));
-	head->ino = 0;
+	head->ino = cpu_to_le64(req->r_deleg_ino);
 	head->args = req->r_args;
 
 	ceph_encode_filepath(&p, end, ino1, path1);
@@ -2627,7 +2627,6 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
 	rhead->flags = cpu_to_le32(flags);
 	rhead->num_fwd = req->r_num_fwd;
 	rhead->num_retry = req->r_attempts - 1;
-	rhead->ino = 0;
 
 	dout(" r_parent = %p\n", req->r_parent);
 	return 0;
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 2a32afa15eb6..0811543ffd79 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -308,6 +308,7 @@ struct ceph_mds_request {
 	int               r_num_fwd;    /* number of forward attempts */
 	int               r_resend_mds; /* mds to resend to next, if any*/
 	u32               r_sent_on_mseq; /* cap mseq request was sent at*/
+	unsigned long	  r_deleg_ino;
 
 	struct list_head  r_wait;
 	struct completion r_completion;
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 09/10] ceph: cache layout in parent dir on first sync create
  2020-01-15 20:59 [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (7 preceding siblings ...)
  2020-01-15 20:59 ` [RFC PATCH v2 08/10] ceph: add new MDS req field to hold delegated inode number Jeff Layton
@ 2020-01-15 20:59 ` Jeff Layton
  2020-01-15 20:59 ` [RFC PATCH v2 10/10] ceph: attempt to do async create when possible Jeff Layton
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Jeff Layton @ 2020-01-15 20:59 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell, xiubli

It doesn't do much good to do an asynchronous create unless we can do
I/O to it before the create reply comes in. That means we need layout
info the new file before we've gotten the response from the MDS.

All files created in a directory will initially inherit the same layout,
so copy off the requisite info from the first synchronous create in the
directory, and save it in a new i_cached_layout field. Zero out the
layout when we lose Dc caps in the dir.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c  | 13 ++++++++++---
 fs/ceph/file.c  | 22 +++++++++++++++++++++-
 fs/ceph/inode.c |  2 ++
 fs/ceph/super.h |  1 +
 4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 9d1a3d6831f7..70bf50f00c27 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -561,14 +561,14 @@ static void __cap_delay_cancel(struct ceph_mds_client *mdsc,
 	spin_unlock(&mdsc->cap_delay_lock);
 }
 
-/*
- * Common issue checks for add_cap, handle_cap_grant.
- */
+/* Common issue checks for add_cap, handle_cap_grant. */
 static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap,
 			      unsigned issued)
 {
 	unsigned had = __ceph_caps_issued(ci, NULL);
 
+	lockdep_assert_held(&ci->i_ceph_lock);
+
 	/*
 	 * Each time we receive FILE_CACHE anew, we increment
 	 * i_rdcache_gen.
@@ -593,6 +593,13 @@ static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap,
 			__ceph_dir_clear_complete(ci);
 		}
 	}
+
+	/* Wipe saved layout if we're losing DIR_CREATE caps */
+	if (S_ISDIR(ci->vfs_inode.i_mode) && (had & CEPH_CAP_DIR_CREATE) &&
+		!(issued & CEPH_CAP_DIR_CREATE)) {
+	     ceph_put_string(rcu_dereference_raw(ci->i_cached_layout.pool_ns));
+	     memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
+	}
 }
 
 /*
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 1e6cdf2dfe90..b44ccbc85fe4 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -430,6 +430,23 @@ int ceph_open(struct inode *inode, struct file *file)
 	return err;
 }
 
+/* Clone the layout from a synchronous create, if the dir now has Dc caps */
+static void
+cache_file_layout(struct inode *dst, struct inode *src)
+{
+	struct ceph_inode_info *cdst = ceph_inode(dst);
+	struct ceph_inode_info *csrc = ceph_inode(src);
+
+	spin_lock(&cdst->i_ceph_lock);
+	if ((__ceph_caps_issued(cdst, NULL) & CEPH_CAP_DIR_CREATE) &&
+	    !ceph_file_layout_is_valid(&cdst->i_cached_layout)) {
+		memcpy(&cdst->i_cached_layout, &csrc->i_layout,
+			sizeof(cdst->i_cached_layout));
+		rcu_assign_pointer(cdst->i_cached_layout.pool_ns,
+				   ceph_try_get_string(csrc->i_layout.pool_ns));
+	}
+	spin_unlock(&cdst->i_ceph_lock);
+}
 
 /*
  * Do a lookup + open with a single request.  If we get a non-existent
@@ -518,7 +535,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 	} else {
 		dout("atomic_open finish_open on dn %p\n", dn);
 		if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) {
-			ceph_init_inode_acls(d_inode(dentry), &as_ctx);
+			struct inode *newino = d_inode(dentry);
+
+			cache_file_layout(dir, newino);
+			ceph_init_inode_acls(newino, &as_ctx);
 			file->f_mode |= FMODE_CREATED;
 		}
 		err = finish_open(file, dentry, ceph_open);
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 4056c7968b86..73f986efb1fd 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -447,6 +447,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
 	ci->i_max_files = 0;
 
 	memset(&ci->i_dir_layout, 0, sizeof(ci->i_dir_layout));
+	memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
 	RCU_INIT_POINTER(ci->i_layout.pool_ns, NULL);
 
 	ci->i_fragtree = RB_ROOT;
@@ -587,6 +588,7 @@ void ceph_evict_inode(struct inode *inode)
 		ceph_buffer_put(ci->i_xattrs.prealloc_blob);
 
 	ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
+	ceph_put_string(rcu_dereference_raw(ci->i_cached_layout.pool_ns));
 }
 
 static inline blkcnt_t calc_inode_blocks(u64 size)
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 86dd4a2163e0..09bd4b71de91 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -326,6 +326,7 @@ struct ceph_inode_info {
 
 	struct ceph_dir_layout i_dir_layout;
 	struct ceph_file_layout i_layout;
+	struct ceph_file_layout i_cached_layout;	// for async creates
 	char *i_symlink;
 
 	/* for dirs */
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 10/10] ceph: attempt to do async create when possible
  2020-01-15 20:59 [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (8 preceding siblings ...)
  2020-01-15 20:59 ` [RFC PATCH v2 09/10] ceph: cache layout in parent dir on first sync create Jeff Layton
@ 2020-01-15 20:59 ` Jeff Layton
  2020-01-16 15:09   ` Yan, Zheng
  2020-01-17 13:28   ` Yan, Zheng
  2020-01-15 21:30 ` [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 33+ messages in thread
From: Jeff Layton @ 2020-01-15 20:59 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell, xiubli

With the Octopus release, the MDS will hand out directory create caps.

If we have Fxc caps on the directory, and complete directory information
or a known negative dentry, then we can return without waiting on the
reply, allowing the open() call to return very quickly to userland.

We use the normal ceph_fill_inode() routine to fill in the inode, so we
have to gin up some reply inode information with what we'd expect the
newly-created inode to have. The client assumes that it has a full set
of caps on the new inode, and that the MDS will revoke them when there
is conflicting access.

This functionality is gated on the enable_async_dirops module option,
along with async unlinks, and on the server supporting the necessary
CephFS feature bit.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/file.c               | 196 +++++++++++++++++++++++++++++++++--
 include/linux/ceph/ceph_fs.h |   3 +
 2 files changed, 190 insertions(+), 9 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index b44ccbc85fe4..2742417fa5ec 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -448,6 +448,169 @@ cache_file_layout(struct inode *dst, struct inode *src)
 	spin_unlock(&cdst->i_ceph_lock);
 }
 
+/*
+ * Try to set up an async create. We need caps, a file layout, and inode number,
+ * and either a lease on the dentry or complete dir info. If any of those
+ * criteria are not satisfied, then return false and the caller can go
+ * synchronous.
+ */
+static bool try_prep_async_create(struct inode *dir, struct dentry *dentry,
+				  struct ceph_file_layout *lo,
+				  unsigned long *pino)
+{
+	struct ceph_inode_info *ci = ceph_inode(dir);
+	bool ret = false;
+	unsigned long ino;
+
+	spin_lock(&ci->i_ceph_lock);
+	/* No auth cap means no chance for Dc caps */
+	if (!ci->i_auth_cap)
+		goto no_async;
+
+	/* Any delegated inos? */
+	if (xa_empty(&ci->i_auth_cap->session->s_delegated_inos))
+		goto no_async;
+
+	if (!ceph_file_layout_is_valid(&ci->i_cached_layout))
+		goto no_async;
+
+	/* Use LOOKUP_RCU since we're under i_ceph_lock */
+	if (!__ceph_dir_is_complete(ci) &&
+	    !dentry_lease_is_valid(dentry, LOOKUP_RCU))
+		goto no_async;
+
+	if (!(__ceph_caps_issued(ci, NULL) &
+	      (CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE)))
+		goto no_async;
+
+	ino = ceph_get_deleg_ino(ci->i_auth_cap->session);
+	if (!ino)
+		goto no_async;
+
+	*pino = ino;
+	ceph_take_cap_refs(ci, CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE, false);
+	memcpy(lo, &ci->i_cached_layout, sizeof(*lo));
+	rcu_assign_pointer(lo->pool_ns,
+			   ceph_try_get_string(ci->i_cached_layout.pool_ns));
+	ret = true;
+no_async:
+	spin_unlock(&ci->i_ceph_lock);
+	return ret;
+}
+
+static void ceph_async_create_cb(struct ceph_mds_client *mdsc,
+                                 struct ceph_mds_request *req)
+{
+	mapping_set_error(req->r_parent->i_mapping, req->r_err);
+
+	if (req->r_target_inode) {
+		struct ceph_inode_info *ci = ceph_inode(req->r_target_inode);
+		u64 ino = ceph_vino(req->r_target_inode).ino;
+
+		if (req->r_deleg_ino != ino)
+			pr_warn("%s: inode number mismatch! err=%d deleg_ino=0x%lx target=0x%llx\n",
+				__func__, req->r_err, req->r_deleg_ino, ino);
+		mapping_set_error(req->r_target_inode->i_mapping, req->r_err);
+
+		spin_lock(&ci->i_ceph_lock);
+		if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
+			ci->i_ceph_flags &= ~CEPH_I_ASYNC_CREATE;
+			wake_up_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT);
+		}
+		spin_unlock(&ci->i_ceph_lock);
+	} else {
+		pr_warn("%s: no req->r_target_inode for 0x%lx\n", __func__,
+			req->r_deleg_ino);
+	}
+	ceph_put_cap_refs(ceph_inode(req->r_parent),
+			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE);
+}
+
+static int ceph_finish_async_create(struct inode *dir, struct dentry *dentry,
+				    struct file *file, umode_t mode,
+				    struct ceph_mds_request *req,
+				    struct ceph_acl_sec_ctx *as_ctx,
+				    struct ceph_file_layout *lo)
+{
+	int ret;
+	char xattr_buf[4];
+	struct ceph_mds_reply_inode in = { };
+	struct ceph_mds_reply_info_in iinfo = { .in = &in };
+	struct ceph_inode_info *ci = ceph_inode(dir);
+	struct inode *inode;
+	struct timespec64 now;
+	struct ceph_vino vino = { .ino = req->r_deleg_ino,
+				  .snap = CEPH_NOSNAP };
+
+	ktime_get_real_ts64(&now);
+
+	inode = ceph_get_inode(dentry->d_sb, vino);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	iinfo.inline_version = CEPH_INLINE_NONE;
+	iinfo.change_attr = 1;
+	ceph_encode_timespec64(&iinfo.btime, &now);
+
+	iinfo.xattr_len = ARRAY_SIZE(xattr_buf);
+	iinfo.xattr_data = xattr_buf;
+	memset(iinfo.xattr_data, 0, iinfo.xattr_len);
+
+	in.ino = cpu_to_le64(vino.ino);
+	in.snapid = cpu_to_le64(CEPH_NOSNAP);
+	in.version = cpu_to_le64(1);	// ???
+	in.cap.caps = in.cap.wanted = cpu_to_le32(CEPH_CAP_ALL_FILE);
+	in.cap.cap_id = cpu_to_le64(1);
+	in.cap.realm = cpu_to_le64(ci->i_snap_realm->ino);
+	in.cap.flags = CEPH_CAP_FLAG_AUTH;
+	in.ctime = in.mtime = in.atime = iinfo.btime;
+	in.mode = cpu_to_le32((u32)mode);
+	in.truncate_seq = cpu_to_le32(1);
+	in.truncate_size = cpu_to_le64(-1ULL);
+	in.xattr_version = cpu_to_le64(1);
+	in.uid = cpu_to_le32(from_kuid(&init_user_ns, current_fsuid()));
+	in.gid = cpu_to_le32(from_kgid(&init_user_ns, dir->i_mode & S_ISGID ?
+				dir->i_gid : current_fsgid()));
+	in.nlink = cpu_to_le32(1);
+	in.max_size = cpu_to_le64(lo->stripe_unit);
+
+	ceph_file_layout_to_legacy(lo, &in.layout);
+
+	ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session,
+			      req->r_fmode, NULL);
+	if (ret) {
+		dout("%s failed to fill inode: %d\n", __func__, ret);
+		if (inode->i_state & I_NEW)
+			discard_new_inode(inode);
+	} else {
+		struct dentry *dn;
+
+		dout("%s d_adding new inode 0x%llx to 0x%lx/%s\n", __func__,
+			vino.ino, dir->i_ino, dentry->d_name.name);
+		ceph_dir_clear_ordered(dir);
+		ceph_init_inode_acls(inode, as_ctx);
+		if (inode->i_state & I_NEW) {
+			/*
+			 * If it's not I_NEW, then someone created this before
+			 * we got here. Assume the server is aware of it at
+			 * that point and don't worry about setting
+			 * CEPH_I_ASYNC_CREATE.
+			 */
+			ceph_inode(inode)->i_ceph_flags = CEPH_I_ASYNC_CREATE;
+			unlock_new_inode(inode);
+		}
+		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
+			if (!d_unhashed(dentry))
+				d_drop(dentry);
+			dn = d_splice_alias(inode, dentry);
+			WARN_ON_ONCE(dn && dn != dentry);
+		}
+		file->f_mode |= FMODE_CREATED;
+		ret = finish_open(file, dentry, ceph_open);
+	}
+	return ret;
+}
+
 /*
  * Do a lookup + open with a single request.  If we get a non-existent
  * file or symlink, return 1 so the VFS can retry.
@@ -460,6 +623,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 	struct ceph_mds_request *req;
 	struct dentry *dn;
 	struct ceph_acl_sec_ctx as_ctx = {};
+	bool try_async = enable_async_dirops;
 	int mask;
 	int err;
 
@@ -492,28 +656,41 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 	}
 	req->r_dentry = dget(dentry);
 	req->r_num_caps = 2;
+	mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
+	if (ceph_security_xattr_wanted(dir))
+		mask |= CEPH_CAP_XATTR_SHARED;
+	req->r_args.open.mask = cpu_to_le32(mask);
+	req->r_parent = dir;
+
 	if (flags & O_CREAT) {
+		struct ceph_file_layout lo;
+
 		req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL;
 		req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
 		if (as_ctx.pagelist) {
 			req->r_pagelist = as_ctx.pagelist;
 			as_ctx.pagelist = NULL;
 		}
+		if (try_async && try_prep_async_create(dir, dentry, &lo,
+						       &req->r_deleg_ino)) {
+			set_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags);
+			req->r_callback = ceph_async_create_cb;
+			err = ceph_mdsc_submit_request(mdsc, dir, req);
+			if (!err)
+				err = ceph_finish_async_create(dir, dentry,
+							file, mode, req,
+							&as_ctx, &lo);
+			goto out_req;
+		}
 	}
 
-       mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
-       if (ceph_security_xattr_wanted(dir))
-               mask |= CEPH_CAP_XATTR_SHARED;
-       req->r_args.open.mask = cpu_to_le32(mask);
-
-	req->r_parent = dir;
 	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
 	err = ceph_mdsc_do_request(mdsc,
 				   (flags & (O_CREAT|O_TRUNC)) ? dir : NULL,
 				   req);
 	err = ceph_handle_snapdir(req, dentry, err);
 	if (err)
-		goto out_req;
+		goto out_fmode;
 
 	if ((flags & O_CREAT) && !req->r_reply_info.head->is_dentry)
 		err = ceph_handle_notrace_create(dir, dentry);
@@ -527,7 +704,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 		dn = NULL;
 	}
 	if (err)
-		goto out_req;
+		goto out_fmode;
 	if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
 		/* make vfs retry on splice, ENOENT, or symlink */
 		dout("atomic_open finish_no_open on dn %p\n", dn);
@@ -543,9 +720,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 		}
 		err = finish_open(file, dentry, ceph_open);
 	}
-out_req:
+out_fmode:
 	if (!req->r_err && req->r_target_inode)
 		ceph_put_fmode(ceph_inode(req->r_target_inode), req->r_fmode);
+out_req:
 	ceph_mdsc_put_request(req);
 out_ctx:
 	ceph_release_acl_sec_ctx(&as_ctx);
diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
index 91d09cf37649..e035c5194005 100644
--- a/include/linux/ceph/ceph_fs.h
+++ b/include/linux/ceph/ceph_fs.h
@@ -659,6 +659,9 @@ int ceph_flags_to_mode(int flags);
 #define CEPH_CAP_ANY      (CEPH_CAP_ANY_RD | CEPH_CAP_ANY_EXCL | \
 			   CEPH_CAP_ANY_FILE_WR | CEPH_CAP_FILE_LAZYIO | \
 			   CEPH_CAP_PIN)
+#define CEPH_CAP_ALL_FILE (CEPH_CAP_PIN | CEPH_CAP_ANY_SHARED | \
+			   CEPH_CAP_AUTH_EXCL | CEPH_CAP_XATTR_EXCL | \
+			   CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)
 
 #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
 			CEPH_LOCK_IXATTR)
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 00/10] ceph: asynchronous file create support
  2020-01-15 20:59 [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (9 preceding siblings ...)
  2020-01-15 20:59 ` [RFC PATCH v2 10/10] ceph: attempt to do async create when possible Jeff Layton
@ 2020-01-15 21:30 ` Jeff Layton
  2020-01-16  8:57 ` Xiubo Li
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 33+ messages in thread
From: Jeff Layton @ 2020-01-15 21:30 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell, xiubli

On Wed, 2020-01-15 at 15:59 -0500, Jeff Layton wrote:
> v2:
> - move cached layout to dedicated field in inode
> - protect cached layout with i_ceph_lock
> - wipe cached layout in __check_cap_issue
> - set max_size of file to layout.stripe_unit
> - set truncate_size to (u64)-1
> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> - set cap_id to 1 in async created inode
> - allocate inode number before submitting request
> - rework the prep for an async create to be more efficient
> - don't allow MDS or cap messages involving an inode until we get async
>   create reply
> 
> A lot of changes in this set, mostly based on Zheng and Xiubo's
> comments. Performance is pretty similar to the previous set:
> 
> Untarring a kernel tarball into a cephfs takes about 98s with
> async dirops disabled. With them enabled, it takes around 78s,
> which is about a 25% improvement.
> 
> This is not quite ready for merge. Error handling could still be
> improved. With xfstest generic/531, I see some messages like this pop
> up in the ring buffer:
> 
>     [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9
> 
> Basically, we went to do an async create and got a different inode
> number back than expected. That still needs investigation, but I
> didn't see any test failures due to it.
> 
> Jeff Layton (10):
>   libceph: export ceph_file_layout_is_valid
>   ceph: make ceph_fill_inode non-static
>   ceph: make dentry_lease_is_valid non-static
>   ceph: make __take_cap_refs a public function
>   ceph: decode interval_sets for delegated inos
>   ceph: add flag to designate that a request is asynchronous
>   ceph: add infrastructure for waiting for async create to complete
>   ceph: add new MDS req field to hold delegated inode number
>   ceph: cache layout in parent dir on first sync create
>   ceph: attempt to do async create when possible
> 
>  fs/ceph/caps.c               |  34 ++++--
>  fs/ceph/dir.c                |  13 ++-
>  fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
>  fs/ceph/inode.c              |  50 ++++----
>  fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
>  fs/ceph/mds_client.h         |   9 +-
>  fs/ceph/super.h              |  16 ++-
>  include/linux/ceph/ceph_fs.h |   8 +-
>  net/ceph/ceph_fs.c           |   1 +
>  9 files changed, 410 insertions(+), 65 deletions(-)
> 

I forgot to mention that I went ahead and merged a few patches from the
first pile into testing, so this series is based on top of
ceph-client/testing as of today.

Cheers,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 00/10] ceph: asynchronous file create support
  2020-01-15 20:59 [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (10 preceding siblings ...)
  2020-01-15 21:30 ` [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
@ 2020-01-16  8:57 ` Xiubo Li
  2020-01-16 13:10 ` Yan, Zheng
  2020-01-20 13:20 ` Yan, Zheng
  13 siblings, 0 replies; 33+ messages in thread
From: Xiubo Li @ 2020-01-16  8:57 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: zyan, sage, idryomov, pdonnell

On 2020/1/16 4:59, Jeff Layton wrote:
> v2:
> - move cached layout to dedicated field in inode
> - protect cached layout with i_ceph_lock
> - wipe cached layout in __check_cap_issue
> - set max_size of file to layout.stripe_unit
> - set truncate_size to (u64)-1
> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> - set cap_id to 1 in async created inode
> - allocate inode number before submitting request
> - rework the prep for an async create to be more efficient
> - don't allow MDS or cap messages involving an inode until we get async
>    create reply
>
> A lot of changes in this set, mostly based on Zheng and Xiubo's
> comments. Performance is pretty similar to the previous set:
>
> Untarring a kernel tarball into a cephfs takes about 98s with
> async dirops disabled. With them enabled, it takes around 78s,
> which is about a 25% improvement.
>
> This is not quite ready for merge. Error handling could still be
> improved. With xfstest generic/531, I see some messages like this pop
> up in the ring buffer:
>
>      [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9
>
> Basically, we went to do an async create and got a different inode
> number back than expected. That still needs investigation, but I
> didn't see any test failures due to it.
>
> Jeff Layton (10):
>    libceph: export ceph_file_layout_is_valid
>    ceph: make ceph_fill_inode non-static
>    ceph: make dentry_lease_is_valid non-static
>    ceph: make __take_cap_refs a public function
>    ceph: decode interval_sets for delegated inos
>    ceph: add flag to designate that a request is asynchronous
>    ceph: add infrastructure for waiting for async create to complete
>    ceph: add new MDS req field to hold delegated inode number
>    ceph: cache layout in parent dir on first sync create
>    ceph: attempt to do async create when possible
>
>   fs/ceph/caps.c               |  34 ++++--
>   fs/ceph/dir.c                |  13 ++-
>   fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
>   fs/ceph/inode.c              |  50 ++++----
>   fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
>   fs/ceph/mds_client.h         |   9 +-
>   fs/ceph/super.h              |  16 ++-
>   include/linux/ceph/ceph_fs.h |   8 +-
>   net/ceph/ceph_fs.c           |   1 +
>   9 files changed, 410 insertions(+), 65 deletions(-)
>
This looks good to me.

Thanks.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 00/10] ceph: asynchronous file create support
  2020-01-15 20:59 [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (11 preceding siblings ...)
  2020-01-16  8:57 ` Xiubo Li
@ 2020-01-16 13:10 ` Yan, Zheng
  2020-01-16 14:15   ` Jeff Layton
  2020-01-20 13:20 ` Yan, Zheng
  13 siblings, 1 reply; 33+ messages in thread
From: Yan, Zheng @ 2020-01-16 13:10 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell, xiubli

On 1/16/20 4:59 AM, Jeff Layton wrote:
> v2:
> - move cached layout to dedicated field in inode
> - protect cached layout with i_ceph_lock
> - wipe cached layout in __check_cap_issue
> - set max_size of file to layout.stripe_unit
> - set truncate_size to (u64)-1
> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> - set cap_id to 1 in async created inode
> - allocate inode number before submitting request
> - rework the prep for an async create to be more efficient
> - don't allow MDS or cap messages involving an inode until we get async
>    create reply
> 
> A lot of changes in this set, mostly based on Zheng and Xiubo's
> comments. Performance is pretty similar to the previous set:
> 
> Untarring a kernel tarball into a cephfs takes about 98s with
> async dirops disabled. With them enabled, it takes around 78s,
> which is about a 25% improvement.
> 
> This is not quite ready for merge. Error handling could still be
> improved. With xfstest generic/531, I see some messages like this pop
> up in the ring buffer:
> 
>      [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9
> 

how about always set O_EXCL flag in async create request. It may help to 
debug this issue.

> Basically, we went to do an async create and got a different inode
> number back than expected. That still needs investigation, but I
> didn't see any test failures due to it.
> 
> Jeff Layton (10):
>    libceph: export ceph_file_layout_is_valid
>    ceph: make ceph_fill_inode non-static
>    ceph: make dentry_lease_is_valid non-static
>    ceph: make __take_cap_refs a public function
>    ceph: decode interval_sets for delegated inos
>    ceph: add flag to designate that a request is asynchronous
>    ceph: add infrastructure for waiting for async create to complete
>    ceph: add new MDS req field to hold delegated inode number
>    ceph: cache layout in parent dir on first sync create
>    ceph: attempt to do async create when possible
> 
>   fs/ceph/caps.c               |  34 ++++--
>   fs/ceph/dir.c                |  13 ++-
>   fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
>   fs/ceph/inode.c              |  50 ++++----
>   fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
>   fs/ceph/mds_client.h         |   9 +-
>   fs/ceph/super.h              |  16 ++-
>   include/linux/ceph/ceph_fs.h |   8 +-
>   net/ceph/ceph_fs.c           |   1 +
>   9 files changed, 410 insertions(+), 65 deletions(-)
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 00/10] ceph: asynchronous file create support
  2020-01-16 13:10 ` Yan, Zheng
@ 2020-01-16 14:15   ` Jeff Layton
  0 siblings, 0 replies; 33+ messages in thread
From: Jeff Layton @ 2020-01-16 14:15 UTC (permalink / raw)
  To: Yan, Zheng, ceph-devel; +Cc: sage, idryomov, pdonnell, xiubli

On Thu, 2020-01-16 at 21:10 +0800, Yan, Zheng wrote:
> On 1/16/20 4:59 AM, Jeff Layton wrote:
> > v2:
> > - move cached layout to dedicated field in inode
> > - protect cached layout with i_ceph_lock
> > - wipe cached layout in __check_cap_issue
> > - set max_size of file to layout.stripe_unit
> > - set truncate_size to (u64)-1
> > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> > - set cap_id to 1 in async created inode
> > - allocate inode number before submitting request
> > - rework the prep for an async create to be more efficient
> > - don't allow MDS or cap messages involving an inode until we get async
> >    create reply
> > 
> > A lot of changes in this set, mostly based on Zheng and Xiubo's
> > comments. Performance is pretty similar to the previous set:
> > 
> > Untarring a kernel tarball into a cephfs takes about 98s with
> > async dirops disabled. With them enabled, it takes around 78s,
> > which is about a 25% improvement.
> > 
> > This is not quite ready for merge. Error handling could still be
> > improved. With xfstest generic/531, I see some messages like this pop
> > up in the ring buffer:
> > 
> >      [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9
> > 
> 
> how about always set O_EXCL flag in async create request. It may help to 
> debug this issue.
> 

I was just thinking the same thing yesterday. I'll do that and we can
see what that turns up. Thanks!

> > Basically, we went to do an async create and got a different inode
> > number back than expected. That still needs investigation, but I
> > didn't see any test failures due to it.
> > 
> > Jeff Layton (10):
> >    libceph: export ceph_file_layout_is_valid
> >    ceph: make ceph_fill_inode non-static
> >    ceph: make dentry_lease_is_valid non-static
> >    ceph: make __take_cap_refs a public function
> >    ceph: decode interval_sets for delegated inos
> >    ceph: add flag to designate that a request is asynchronous
> >    ceph: add infrastructure for waiting for async create to complete
> >    ceph: add new MDS req field to hold delegated inode number
> >    ceph: cache layout in parent dir on first sync create
> >    ceph: attempt to do async create when possible
> > 
> >   fs/ceph/caps.c               |  34 ++++--
> >   fs/ceph/dir.c                |  13 ++-
> >   fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
> >   fs/ceph/inode.c              |  50 ++++----
> >   fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
> >   fs/ceph/mds_client.h         |   9 +-
> >   fs/ceph/super.h              |  16 ++-
> >   include/linux/ceph/ceph_fs.h |   8 +-
> >   net/ceph/ceph_fs.c           |   1 +
> >   9 files changed, 410 insertions(+), 65 deletions(-)
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 05/10] ceph: decode interval_sets for delegated inos
  2020-01-15 20:59 ` [RFC PATCH v2 05/10] ceph: decode interval_sets for delegated inos Jeff Layton
@ 2020-01-16 14:32   ` Yan, Zheng
  2020-01-16 15:37     ` Jeff Layton
  0 siblings, 1 reply; 33+ messages in thread
From: Yan, Zheng @ 2020-01-16 14:32 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell, xiubli

On 1/16/20 4:59 AM, Jeff Layton wrote:
> Starting in Octopus, the MDS will hand out caps that allow the client
> to do asynchronous file creates under certain conditions. As part of
> that, the MDS will delegate ranges of inode numbers to the client.
> 
> Add the infrastructure to decode these ranges, and stuff them into an
> xarray for later consumption by the async creation code.
> 
> Because the xarray code currently only handles unsigned long indexes,
> and those are 32-bits on 32-bit arches, we only enable the decoding when
> running on a 64-bit arch.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>   fs/ceph/mds_client.c | 109 +++++++++++++++++++++++++++++++++++++++----
>   fs/ceph/mds_client.h |   7 ++-
>   2 files changed, 106 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 8263f75badfc..19bd71eb5733 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -415,21 +415,110 @@ static int parse_reply_info_filelock(void **p, void *end,
>   	return -EIO;
>   }
>   
> +
> +#if BITS_PER_LONG == 64
> +
> +#define DELEGATED_INO_AVAILABLE		xa_mk_value(1)
> +
> +static int ceph_parse_deleg_inos(void **p, void *end,
> +				 struct ceph_mds_session *s)
> +{
> +	u32 sets;
> +
> +	ceph_decode_32_safe(p, end, sets, bad);
> +	dout("got %u sets of delegated inodes\n", sets);
> +	while (sets--) {
> +		u64 start, len, ino;
> +
> +		ceph_decode_64_safe(p, end, start, bad);
> +		ceph_decode_64_safe(p, end, len, bad);
> +		while (len--) {
> +			int err = xa_insert(&s->s_delegated_inos, ino = start++,
> +					    DELEGATED_INO_AVAILABLE,
> +					    GFP_KERNEL);
> +			if (!err) {
> +				dout("added delegated inode 0x%llx\n",
> +				     start - 1);
> +			} else if (err == -EBUSY) {
> +				pr_warn("ceph: MDS delegated inode 0x%llx more than once.\n",
> +					start - 1);
> +			} else {
> +				return err;
> +			}
> +		}
> +	}
> +	return 0;
> +bad:
> +	return -EIO;
> +}
> +
> +unsigned long ceph_get_deleg_ino(struct ceph_mds_session *s)
> +{
> +	unsigned long ino;
> +	void *val;
> +
> +	xa_for_each(&s->s_delegated_inos, ino, val) {
> +		val = xa_erase(&s->s_delegated_inos, ino);
> +		if (val == DELEGATED_INO_AVAILABLE)
> +			return ino;
> +	}
> +	return 0;

do we need to protect s_delegated_inos? ceph_get_deleg_ino() and 
ceph_parse_deleg_inos() can be executed at the same time. multiple 
thread may call ceph_parse_deleg_inos() at the same time.

> +}
> +#else /* BITS_PER_LONG == 64 */
> +/*
> + * FIXME: xarrays can't handle 64-bit indexes on a 32-bit arch. For now, just
> + * ignore delegated_inos on 32 bit arch. Maybe eventually add xarrays for top
> + * and bottom words?
> + */
> +static int ceph_parse_deleg_inos(void **p, void *end,
> +				 struct ceph_mds_session *s)
> +{
> +	u32 sets;
> +
> +	ceph_decode_32_safe(p, end, sets, bad);
> +	if (sets)
> +		ceph_decode_skip_n(p, end, sets * 2 * sizeof(__le64), bad);
> +	return 0;
> +bad:
> +	return -EIO;
> +}
> +
> +unsigned long ceph_get_deleg_ino(struct ceph_mds_session *s)
> +{
> +	return 0;
> +}
> +#endif /* BITS_PER_LONG == 64 */
> +
>   /*
>    * parse create results
>    */
>   static int parse_reply_info_create(void **p, void *end,
>   				  struct ceph_mds_reply_info_parsed *info,
> -				  u64 features)
> +				  u64 features, struct ceph_mds_session *s)
>   {
> +	int ret;
> +
>   	if (features == (u64)-1 ||
>   	    (features & CEPH_FEATURE_REPLY_CREATE_INODE)) {
> -		/* Malformed reply? */
>   		if (*p == end) {
> +			/* Malformed reply? */
>   			info->has_create_ino = false;
> -		} else {
> +		} else if (test_bit(CEPHFS_FEATURE_DELEG_INO, &s->s_features)) {
> +			u8 struct_v, struct_compat;
> +			u32 len;
> +
>   			info->has_create_ino = true;
> +			ceph_decode_8_safe(p, end, struct_v, bad);
> +			ceph_decode_8_safe(p, end, struct_compat, bad);
> +			ceph_decode_32_safe(p, end, len, bad);
> +			ceph_decode_64_safe(p, end, info->ino, bad);
> +			ret = ceph_parse_deleg_inos(p, end, s);
> +			if (ret)
> +				return ret;
> +		} else {
> +			/* legacy */
>   			ceph_decode_64_safe(p, end, info->ino, bad);
> +			info->has_create_ino = true;
>   		}
>   	} else {
>   		if (*p != end)
> @@ -448,7 +537,7 @@ static int parse_reply_info_create(void **p, void *end,
>    */
>   static int parse_reply_info_extra(void **p, void *end,
>   				  struct ceph_mds_reply_info_parsed *info,
> -				  u64 features)
> +				  u64 features, struct ceph_mds_session *s)
>   {
>   	u32 op = le32_to_cpu(info->head->op);
>   
> @@ -457,7 +546,7 @@ static int parse_reply_info_extra(void **p, void *end,
>   	else if (op == CEPH_MDS_OP_READDIR || op == CEPH_MDS_OP_LSSNAP)
>   		return parse_reply_info_readdir(p, end, info, features);
>   	else if (op == CEPH_MDS_OP_CREATE)
> -		return parse_reply_info_create(p, end, info, features);
> +		return parse_reply_info_create(p, end, info, features, s);
>   	else
>   		return -EIO;
>   }
> @@ -465,7 +554,7 @@ static int parse_reply_info_extra(void **p, void *end,
>   /*
>    * parse entire mds reply
>    */
> -static int parse_reply_info(struct ceph_msg *msg,
> +static int parse_reply_info(struct ceph_mds_session *s, struct ceph_msg *msg,
>   			    struct ceph_mds_reply_info_parsed *info,
>   			    u64 features)
>   {
> @@ -490,7 +579,7 @@ static int parse_reply_info(struct ceph_msg *msg,
>   	ceph_decode_32_safe(&p, end, len, bad);
>   	if (len > 0) {
>   		ceph_decode_need(&p, end, len, bad);
> -		err = parse_reply_info_extra(&p, p+len, info, features);
> +		err = parse_reply_info_extra(&p, p+len, info, features, s);
>   		if (err < 0)
>   			goto out_bad;
>   	}
> @@ -558,6 +647,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
>   	if (refcount_dec_and_test(&s->s_ref)) {
>   		if (s->s_auth.authorizer)
>   			ceph_auth_destroy_authorizer(s->s_auth.authorizer);
> +		xa_destroy(&s->s_delegated_inos);
>   		kfree(s);
>   	}
>   }
> @@ -645,6 +735,7 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc,
>   	refcount_set(&s->s_ref, 1);
>   	INIT_LIST_HEAD(&s->s_waiting);
>   	INIT_LIST_HEAD(&s->s_unsafe);
> +	xa_init(&s->s_delegated_inos);
>   	s->s_num_cap_releases = 0;
>   	s->s_cap_reconnect = 0;
>   	s->s_cap_iterator = NULL;
> @@ -2947,9 +3038,9 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
>   	dout("handle_reply tid %lld result %d\n", tid, result);
>   	rinfo = &req->r_reply_info;
>   	if (test_bit(CEPHFS_FEATURE_REPLY_ENCODING, &session->s_features))
> -		err = parse_reply_info(msg, rinfo, (u64)-1);
> +		err = parse_reply_info(session, msg, rinfo, (u64)-1);
>   	else
> -		err = parse_reply_info(msg, rinfo, session->s_con.peer_features);
> +		err = parse_reply_info(session, msg, rinfo, session->s_con.peer_features);
>   	mutex_unlock(&mdsc->mutex);
>   
>   	mutex_lock(&session->s_mutex);
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index 27a7446e10d3..30fb60ba2580 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -23,8 +23,9 @@ enum ceph_feature_type {
>   	CEPHFS_FEATURE_RECLAIM_CLIENT,
>   	CEPHFS_FEATURE_LAZY_CAP_WANTED,
>   	CEPHFS_FEATURE_MULTI_RECONNECT,
> +	CEPHFS_FEATURE_DELEG_INO,
>   
> -	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_MULTI_RECONNECT,
> +	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_DELEG_INO,
>   };
>   
>   /*
> @@ -37,6 +38,7 @@ enum ceph_feature_type {
>   	CEPHFS_FEATURE_REPLY_ENCODING,		\
>   	CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
>   	CEPHFS_FEATURE_MULTI_RECONNECT,		\
> +	CEPHFS_FEATURE_DELEG_INO,		\
>   						\
>   	CEPHFS_FEATURE_MAX,			\
>   }
> @@ -201,6 +203,7 @@ struct ceph_mds_session {
>   
>   	struct list_head  s_waiting;  /* waiting requests */
>   	struct list_head  s_unsafe;   /* unsafe requests */
> +	struct xarray	  s_delegated_inos;
>   };
>   
>   /*
> @@ -537,4 +540,6 @@ extern void ceph_mdsc_open_export_target_sessions(struct ceph_mds_client *mdsc,
>   extern int ceph_trim_caps(struct ceph_mds_client *mdsc,
>   			  struct ceph_mds_session *session,
>   			  int max_caps);
> +
> +extern unsigned long ceph_get_deleg_ino(struct ceph_mds_session *session);
>   #endif
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 10/10] ceph: attempt to do async create when possible
  2020-01-15 20:59 ` [RFC PATCH v2 10/10] ceph: attempt to do async create when possible Jeff Layton
@ 2020-01-16 15:09   ` Yan, Zheng
  2020-01-16 16:21     ` Jeff Layton
  2020-01-17 13:28   ` Yan, Zheng
  1 sibling, 1 reply; 33+ messages in thread
From: Yan, Zheng @ 2020-01-16 15:09 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell, xiubli

On 1/16/20 4:59 AM, Jeff Layton wrote:
> With the Octopus release, the MDS will hand out directory create caps.
> 
> If we have Fxc caps on the directory, and complete directory information
> or a known negative dentry, then we can return without waiting on the
> reply, allowing the open() call to return very quickly to userland.
> 
> We use the normal ceph_fill_inode() routine to fill in the inode, so we
> have to gin up some reply inode information with what we'd expect the
> newly-created inode to have. The client assumes that it has a full set
> of caps on the new inode, and that the MDS will revoke them when there
> is conflicting access.
> 
> This functionality is gated on the enable_async_dirops module option,
> along with async unlinks, and on the server supporting the necessary
> CephFS feature bit.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>   fs/ceph/file.c               | 196 +++++++++++++++++++++++++++++++++--
>   include/linux/ceph/ceph_fs.h |   3 +
>   2 files changed, 190 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index b44ccbc85fe4..2742417fa5ec 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -448,6 +448,169 @@ cache_file_layout(struct inode *dst, struct inode *src)
>   	spin_unlock(&cdst->i_ceph_lock);
>   }
>   
> +/*
> + * Try to set up an async create. We need caps, a file layout, and inode number,
> + * and either a lease on the dentry or complete dir info. If any of those
> + * criteria are not satisfied, then return false and the caller can go
> + * synchronous.
> + */
> +static bool try_prep_async_create(struct inode *dir, struct dentry *dentry,
> +				  struct ceph_file_layout *lo,
> +				  unsigned long *pino)
> +{
> +	struct ceph_inode_info *ci = ceph_inode(dir);
> +	bool ret = false;
> +	unsigned long ino;
> +
> +	spin_lock(&ci->i_ceph_lock);
> +	/* No auth cap means no chance for Dc caps */
> +	if (!ci->i_auth_cap)
> +		goto no_async;
> +
> +	/* Any delegated inos? */
> +	if (xa_empty(&ci->i_auth_cap->session->s_delegated_inos))
> +		goto no_async;
> +
> +	if (!ceph_file_layout_is_valid(&ci->i_cached_layout))
> +		goto no_async;
> +
> +	/* Use LOOKUP_RCU since we're under i_ceph_lock */
> +	if (!__ceph_dir_is_complete(ci) &&
> +	    !dentry_lease_is_valid(dentry, LOOKUP_RCU))
> +		goto no_async;
> +
> +	if (!(__ceph_caps_issued(ci, NULL) &
> +	      (CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE)))
> +		goto no_async;
> +

(ceph_caps_issued(ci, NULL) &
  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE)) ==
(CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE)

> +	ino = ceph_get_deleg_ino(ci->i_auth_cap->session);
> +	if (!ino)
> +		goto no_async;
> +
> +	*pino = ino;
> +	ceph_take_cap_refs(ci, CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE, false);
> +	memcpy(lo, &ci->i_cached_layout, sizeof(*lo));
> +	rcu_assign_pointer(lo->pool_ns,
> +			   ceph_try_get_string(ci->i_cached_layout.pool_ns));
> +	ret = true;
> +no_async:
> +	spin_unlock(&ci->i_ceph_lock);
> +	return ret;
> +}
> +
> +static void ceph_async_create_cb(struct ceph_mds_client *mdsc,
> +                                 struct ceph_mds_request *req)
> +{
> +	mapping_set_error(req->r_parent->i_mapping, req->r_err);
> +
> +	if (req->r_target_inode) {
> +		struct ceph_inode_info *ci = ceph_inode(req->r_target_inode);
> +		u64 ino = ceph_vino(req->r_target_inode).ino;
> +
> +		if (req->r_deleg_ino != ino)
> +			pr_warn("%s: inode number mismatch! err=%d deleg_ino=0x%lx target=0x%llx\n",
> +				__func__, req->r_err, req->r_deleg_ino, ino);
> +		mapping_set_error(req->r_target_inode->i_mapping, req->r_err);
> +
> +		spin_lock(&ci->i_ceph_lock);
> +		if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
> +			ci->i_ceph_flags &= ~CEPH_I_ASYNC_CREATE;
> +			wake_up_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT);
> +		}
> +		spin_unlock(&ci->i_ceph_lock);
> +	} else {
> +		pr_warn("%s: no req->r_target_inode for 0x%lx\n", __func__,
> +			req->r_deleg_ino);
> +	}
> +	ceph_put_cap_refs(ceph_inode(req->r_parent),
> +			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE);
> +}
> +
> +static int ceph_finish_async_create(struct inode *dir, struct dentry *dentry,
> +				    struct file *file, umode_t mode,
> +				    struct ceph_mds_request *req,
> +				    struct ceph_acl_sec_ctx *as_ctx,
> +				    struct ceph_file_layout *lo)
> +{
> +	int ret;
> +	char xattr_buf[4];
> +	struct ceph_mds_reply_inode in = { };
> +	struct ceph_mds_reply_info_in iinfo = { .in = &in };
> +	struct ceph_inode_info *ci = ceph_inode(dir);
> +	struct inode *inode;
> +	struct timespec64 now;
> +	struct ceph_vino vino = { .ino = req->r_deleg_ino,
> +				  .snap = CEPH_NOSNAP };
> +
> +	ktime_get_real_ts64(&now);
> +
> +	inode = ceph_get_inode(dentry->d_sb, vino);
> +	if (IS_ERR(inode))
> +		return PTR_ERR(inode);
> +
> +	iinfo.inline_version = CEPH_INLINE_NONE;
> +	iinfo.change_attr = 1;
> +	ceph_encode_timespec64(&iinfo.btime, &now);
> +
> +	iinfo.xattr_len = ARRAY_SIZE(xattr_buf);
> +	iinfo.xattr_data = xattr_buf;
> +	memset(iinfo.xattr_data, 0, iinfo.xattr_len);
> +
> +	in.ino = cpu_to_le64(vino.ino);
> +	in.snapid = cpu_to_le64(CEPH_NOSNAP);
> +	in.version = cpu_to_le64(1);	// ???
> +	in.cap.caps = in.cap.wanted = cpu_to_le32(CEPH_CAP_ALL_FILE);
> +	in.cap.cap_id = cpu_to_le64(1);
> +	in.cap.realm = cpu_to_le64(ci->i_snap_realm->ino);
> +	in.cap.flags = CEPH_CAP_FLAG_AUTH;
> +	in.ctime = in.mtime = in.atime = iinfo.btime;
> +	in.mode = cpu_to_le32((u32)mode);
> +	in.truncate_seq = cpu_to_le32(1);
> +	in.truncate_size = cpu_to_le64(-1ULL);
> +	in.xattr_version = cpu_to_le64(1);
> +	in.uid = cpu_to_le32(from_kuid(&init_user_ns, current_fsuid()));
> +	in.gid = cpu_to_le32(from_kgid(&init_user_ns, dir->i_mode & S_ISGID ?
> +				dir->i_gid : current_fsgid()));
> +	in.nlink = cpu_to_le32(1);
> +	in.max_size = cpu_to_le64(lo->stripe_unit);
> +
> +	ceph_file_layout_to_legacy(lo, &in.layout);
> +
> +	ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session,
> +			      req->r_fmode, NULL);
> +	if (ret) {
> +		dout("%s failed to fill inode: %d\n", __func__, ret);
> +		if (inode->i_state & I_NEW)
> +			discard_new_inode(inode);
> +	} else {
> +		struct dentry *dn;
> +
> +		dout("%s d_adding new inode 0x%llx to 0x%lx/%s\n", __func__,
> +			vino.ino, dir->i_ino, dentry->d_name.name);
> +		ceph_dir_clear_ordered(dir);
> +		ceph_init_inode_acls(inode, as_ctx);
> +		if (inode->i_state & I_NEW) {
> +			/*
> +			 * If it's not I_NEW, then someone created this before
> +			 * we got here. Assume the server is aware of it at
> +			 * that point and don't worry about setting
> +			 * CEPH_I_ASYNC_CREATE.
> +			 */
> +			ceph_inode(inode)->i_ceph_flags = CEPH_I_ASYNC_CREATE;
> +			unlock_new_inode(inode);
> +		}
> +		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
> +			if (!d_unhashed(dentry))
> +				d_drop(dentry);
> +			dn = d_splice_alias(inode, dentry);
> +			WARN_ON_ONCE(dn && dn != dentry);
> +		}
> +		file->f_mode |= FMODE_CREATED;
> +		ret = finish_open(file, dentry, ceph_open);
> +	}
> +	return ret;
> +}
> +
>   /*
>    * Do a lookup + open with a single request.  If we get a non-existent
>    * file or symlink, return 1 so the VFS can retry.
> @@ -460,6 +623,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   	struct ceph_mds_request *req;
>   	struct dentry *dn;
>   	struct ceph_acl_sec_ctx as_ctx = {};
> +	bool try_async = enable_async_dirops;
>   	int mask;
>   	int err;
>   
> @@ -492,28 +656,41 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   	}
>   	req->r_dentry = dget(dentry);
>   	req->r_num_caps = 2;
> +	mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> +	if (ceph_security_xattr_wanted(dir))
> +		mask |= CEPH_CAP_XATTR_SHARED;
> +	req->r_args.open.mask = cpu_to_le32(mask);
> +	req->r_parent = dir;
> +
>   	if (flags & O_CREAT) {
> +		struct ceph_file_layout lo;
> +
>   		req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL;
>   		req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
>   		if (as_ctx.pagelist) {
>   			req->r_pagelist = as_ctx.pagelist;
>   			as_ctx.pagelist = NULL;
>   		}
> +		if (try_async && try_prep_async_create(dir, dentry, &lo,
> +						       &req->r_deleg_ino)) {
> +			set_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags);
> +			req->r_callback = ceph_async_create_cb;
> +			err = ceph_mdsc_submit_request(mdsc, dir, req);
> +			if (!err)
> +				err = ceph_finish_async_create(dir, dentry,
> +							file, mode, req,
> +							&as_ctx, &lo);
> +			goto out_req;
> +		}
>   	}
>   
> -       mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> -       if (ceph_security_xattr_wanted(dir))
> -               mask |= CEPH_CAP_XATTR_SHARED;
> -       req->r_args.open.mask = cpu_to_le32(mask);
> -
> -	req->r_parent = dir;
>   	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
>   	err = ceph_mdsc_do_request(mdsc,
>   				   (flags & (O_CREAT|O_TRUNC)) ? dir : NULL,
>   				   req);
>   	err = ceph_handle_snapdir(req, dentry, err);
>   	if (err)
> -		goto out_req;
> +		goto out_fmode;
>   
>   	if ((flags & O_CREAT) && !req->r_reply_info.head->is_dentry)
>   		err = ceph_handle_notrace_create(dir, dentry);
> @@ -527,7 +704,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   		dn = NULL;
>   	}
>   	if (err)
> -		goto out_req;
> +		goto out_fmode;
>   	if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
>   		/* make vfs retry on splice, ENOENT, or symlink */
>   		dout("atomic_open finish_no_open on dn %p\n", dn);
> @@ -543,9 +720,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   		}
>   		err = finish_open(file, dentry, ceph_open);
>   	}
> -out_req:
> +out_fmode:
>   	if (!req->r_err && req->r_target_inode)
>   		ceph_put_fmode(ceph_inode(req->r_target_inode), req->r_fmode);
> +out_req:
>   	ceph_mdsc_put_request(req);
>   out_ctx:
>   	ceph_release_acl_sec_ctx(&as_ctx);
> diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
> index 91d09cf37649..e035c5194005 100644
> --- a/include/linux/ceph/ceph_fs.h
> +++ b/include/linux/ceph/ceph_fs.h
> @@ -659,6 +659,9 @@ int ceph_flags_to_mode(int flags);
>   #define CEPH_CAP_ANY      (CEPH_CAP_ANY_RD | CEPH_CAP_ANY_EXCL | \
>   			   CEPH_CAP_ANY_FILE_WR | CEPH_CAP_FILE_LAZYIO | \
>   			   CEPH_CAP_PIN)
> +#define CEPH_CAP_ALL_FILE (CEPH_CAP_PIN | CEPH_CAP_ANY_SHARED | \
> +			   CEPH_CAP_AUTH_EXCL | CEPH_CAP_XATTR_EXCL | \
> +			   CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)
>   
>   #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
>   			CEPH_LOCK_IXATTR)
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 05/10] ceph: decode interval_sets for delegated inos
  2020-01-16 14:32   ` Yan, Zheng
@ 2020-01-16 15:37     ` Jeff Layton
  0 siblings, 0 replies; 33+ messages in thread
From: Jeff Layton @ 2020-01-16 15:37 UTC (permalink / raw)
  To: Yan, Zheng, ceph-devel; +Cc: sage, idryomov, pdonnell, xiubli

On Thu, 2020-01-16 at 22:32 +0800, Yan, Zheng wrote:
> On 1/16/20 4:59 AM, Jeff Layton wrote:
> > Starting in Octopus, the MDS will hand out caps that allow the client
> > to do asynchronous file creates under certain conditions. As part of
> > that, the MDS will delegate ranges of inode numbers to the client.
> > 
> > Add the infrastructure to decode these ranges, and stuff them into an
> > xarray for later consumption by the async creation code.
> > 
> > Because the xarray code currently only handles unsigned long indexes,
> > and those are 32-bits on 32-bit arches, we only enable the decoding when
> > running on a 64-bit arch.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >   fs/ceph/mds_client.c | 109 +++++++++++++++++++++++++++++++++++++++----
> >   fs/ceph/mds_client.h |   7 ++-
> >   2 files changed, 106 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 8263f75badfc..19bd71eb5733 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -415,21 +415,110 @@ static int parse_reply_info_filelock(void **p, void *end,
> >   	return -EIO;
> >   }
> >   
> > +
> > +#if BITS_PER_LONG == 64
> > +
> > +#define DELEGATED_INO_AVAILABLE		xa_mk_value(1)
> > +
> > +static int ceph_parse_deleg_inos(void **p, void *end,
> > +				 struct ceph_mds_session *s)
> > +{
> > +	u32 sets;
> > +
> > +	ceph_decode_32_safe(p, end, sets, bad);
> > +	dout("got %u sets of delegated inodes\n", sets);
> > +	while (sets--) {
> > +		u64 start, len, ino;
> > +
> > +		ceph_decode_64_safe(p, end, start, bad);
> > +		ceph_decode_64_safe(p, end, len, bad);
> > +		while (len--) {
> > +			int err = xa_insert(&s->s_delegated_inos, ino = start++,
> > +					    DELEGATED_INO_AVAILABLE,
> > +					    GFP_KERNEL);
> > +			if (!err) {
> > +				dout("added delegated inode 0x%llx\n",
> > +				     start - 1);
> > +			} else if (err == -EBUSY) {
> > +				pr_warn("ceph: MDS delegated inode 0x%llx more than once.\n",
> > +					start - 1);
> > +			} else {
> > +				return err;
> > +			}
> > +		}
> > +	}
> > +	return 0;
> > +bad:
> > +	return -EIO;
> > +}
> > +
> > +unsigned long ceph_get_deleg_ino(struct ceph_mds_session *s)
> > +{
> > +	unsigned long ino;
> > +	void *val;
> > +
> > +	xa_for_each(&s->s_delegated_inos, ino, val) {
> > +		val = xa_erase(&s->s_delegated_inos, ino);
> > +		if (val == DELEGATED_INO_AVAILABLE)
> > +			return ino;
> > +	}
> > +	return 0;
> 
> do we need to protect s_delegated_inos? ceph_get_deleg_ino() and 
> ceph_parse_deleg_inos() can be executed at the same time. multiple 
> thread may call ceph_parse_deleg_inos() at the same time.
> 

No. Xarrays have their own locking, and we're using the "simple" API
here (which does it implicitly).

> > +}
> > +#else /* BITS_PER_LONG == 64 */
> > +/*
> > + * FIXME: xarrays can't handle 64-bit indexes on a 32-bit arch. For now, just
> > + * ignore delegated_inos on 32 bit arch. Maybe eventually add xarrays for top
> > + * and bottom words?
> > + */
> > +static int ceph_parse_deleg_inos(void **p, void *end,
> > +				 struct ceph_mds_session *s)
> > +{
> > +	u32 sets;
> > +
> > +	ceph_decode_32_safe(p, end, sets, bad);
> > +	if (sets)
> > +		ceph_decode_skip_n(p, end, sets * 2 * sizeof(__le64), bad);
> > +	return 0;
> > +bad:
> > +	return -EIO;
> > +}
> > +
> > +unsigned long ceph_get_deleg_ino(struct ceph_mds_session *s)
> > +{
> > +	return 0;
> > +}
> > +#endif /* BITS_PER_LONG == 64 */
> > +
> >   /*
> >    * parse create results
> >    */
> >   static int parse_reply_info_create(void **p, void *end,
> >   				  struct ceph_mds_reply_info_parsed *info,
> > -				  u64 features)
> > +				  u64 features, struct ceph_mds_session *s)
> >   {
> > +	int ret;
> > +
> >   	if (features == (u64)-1 ||
> >   	    (features & CEPH_FEATURE_REPLY_CREATE_INODE)) {
> > -		/* Malformed reply? */
> >   		if (*p == end) {
> > +			/* Malformed reply? */
> >   			info->has_create_ino = false;
> > -		} else {
> > +		} else if (test_bit(CEPHFS_FEATURE_DELEG_INO, &s->s_features)) {
> > +			u8 struct_v, struct_compat;
> > +			u32 len;
> > +
> >   			info->has_create_ino = true;
> > +			ceph_decode_8_safe(p, end, struct_v, bad);
> > +			ceph_decode_8_safe(p, end, struct_compat, bad);
> > +			ceph_decode_32_safe(p, end, len, bad);
> > +			ceph_decode_64_safe(p, end, info->ino, bad);
> > +			ret = ceph_parse_deleg_inos(p, end, s);
> > +			if (ret)
> > +				return ret;
> > +		} else {
> > +			/* legacy */
> >   			ceph_decode_64_safe(p, end, info->ino, bad);
> > +			info->has_create_ino = true;
> >   		}
> >   	} else {
> >   		if (*p != end)
> > @@ -448,7 +537,7 @@ static int parse_reply_info_create(void **p, void *end,
> >    */
> >   static int parse_reply_info_extra(void **p, void *end,
> >   				  struct ceph_mds_reply_info_parsed *info,
> > -				  u64 features)
> > +				  u64 features, struct ceph_mds_session *s)
> >   {
> >   	u32 op = le32_to_cpu(info->head->op);
> >   
> > @@ -457,7 +546,7 @@ static int parse_reply_info_extra(void **p, void *end,
> >   	else if (op == CEPH_MDS_OP_READDIR || op == CEPH_MDS_OP_LSSNAP)
> >   		return parse_reply_info_readdir(p, end, info, features);
> >   	else if (op == CEPH_MDS_OP_CREATE)
> > -		return parse_reply_info_create(p, end, info, features);
> > +		return parse_reply_info_create(p, end, info, features, s);
> >   	else
> >   		return -EIO;
> >   }
> > @@ -465,7 +554,7 @@ static int parse_reply_info_extra(void **p, void *end,
> >   /*
> >    * parse entire mds reply
> >    */
> > -static int parse_reply_info(struct ceph_msg *msg,
> > +static int parse_reply_info(struct ceph_mds_session *s, struct ceph_msg *msg,
> >   			    struct ceph_mds_reply_info_parsed *info,
> >   			    u64 features)
> >   {
> > @@ -490,7 +579,7 @@ static int parse_reply_info(struct ceph_msg *msg,
> >   	ceph_decode_32_safe(&p, end, len, bad);
> >   	if (len > 0) {
> >   		ceph_decode_need(&p, end, len, bad);
> > -		err = parse_reply_info_extra(&p, p+len, info, features);
> > +		err = parse_reply_info_extra(&p, p+len, info, features, s);
> >   		if (err < 0)
> >   			goto out_bad;
> >   	}
> > @@ -558,6 +647,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
> >   	if (refcount_dec_and_test(&s->s_ref)) {
> >   		if (s->s_auth.authorizer)
> >   			ceph_auth_destroy_authorizer(s->s_auth.authorizer);
> > +		xa_destroy(&s->s_delegated_inos);
> >   		kfree(s);
> >   	}
> >   }
> > @@ -645,6 +735,7 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc,
> >   	refcount_set(&s->s_ref, 1);
> >   	INIT_LIST_HEAD(&s->s_waiting);
> >   	INIT_LIST_HEAD(&s->s_unsafe);
> > +	xa_init(&s->s_delegated_inos);
> >   	s->s_num_cap_releases = 0;
> >   	s->s_cap_reconnect = 0;
> >   	s->s_cap_iterator = NULL;
> > @@ -2947,9 +3038,9 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
> >   	dout("handle_reply tid %lld result %d\n", tid, result);
> >   	rinfo = &req->r_reply_info;
> >   	if (test_bit(CEPHFS_FEATURE_REPLY_ENCODING, &session->s_features))
> > -		err = parse_reply_info(msg, rinfo, (u64)-1);
> > +		err = parse_reply_info(session, msg, rinfo, (u64)-1);
> >   	else
> > -		err = parse_reply_info(msg, rinfo, session->s_con.peer_features);
> > +		err = parse_reply_info(session, msg, rinfo, session->s_con.peer_features);
> >   	mutex_unlock(&mdsc->mutex);
> >   
> >   	mutex_lock(&session->s_mutex);
> > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> > index 27a7446e10d3..30fb60ba2580 100644
> > --- a/fs/ceph/mds_client.h
> > +++ b/fs/ceph/mds_client.h
> > @@ -23,8 +23,9 @@ enum ceph_feature_type {
> >   	CEPHFS_FEATURE_RECLAIM_CLIENT,
> >   	CEPHFS_FEATURE_LAZY_CAP_WANTED,
> >   	CEPHFS_FEATURE_MULTI_RECONNECT,
> > +	CEPHFS_FEATURE_DELEG_INO,
> >   
> > -	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_MULTI_RECONNECT,
> > +	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_DELEG_INO,
> >   };
> >   
> >   /*
> > @@ -37,6 +38,7 @@ enum ceph_feature_type {
> >   	CEPHFS_FEATURE_REPLY_ENCODING,		\
> >   	CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
> >   	CEPHFS_FEATURE_MULTI_RECONNECT,		\
> > +	CEPHFS_FEATURE_DELEG_INO,		\
> >   						\
> >   	CEPHFS_FEATURE_MAX,			\
> >   }
> > @@ -201,6 +203,7 @@ struct ceph_mds_session {
> >   
> >   	struct list_head  s_waiting;  /* waiting requests */
> >   	struct list_head  s_unsafe;   /* unsafe requests */
> > +	struct xarray	  s_delegated_inos;
> >   };
> >   
> >   /*
> > @@ -537,4 +540,6 @@ extern void ceph_mdsc_open_export_target_sessions(struct ceph_mds_client *mdsc,
> >   extern int ceph_trim_caps(struct ceph_mds_client *mdsc,
> >   			  struct ceph_mds_session *session,
> >   			  int max_caps);
> > +
> > +extern unsigned long ceph_get_deleg_ino(struct ceph_mds_session *session);
> >   #endif
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 10/10] ceph: attempt to do async create when possible
  2020-01-16 15:09   ` Yan, Zheng
@ 2020-01-16 16:21     ` Jeff Layton
  0 siblings, 0 replies; 33+ messages in thread
From: Jeff Layton @ 2020-01-16 16:21 UTC (permalink / raw)
  To: Yan, Zheng, ceph-devel; +Cc: sage, idryomov, pdonnell, xiubli

On Thu, 2020-01-16 at 23:09 +0800, Yan, Zheng wrote:
> On 1/16/20 4:59 AM, Jeff Layton wrote:
> > With the Octopus release, the MDS will hand out directory create caps.
> > 
> > If we have Fxc caps on the directory, and complete directory information
> > or a known negative dentry, then we can return without waiting on the
> > reply, allowing the open() call to return very quickly to userland.
> > 
> > We use the normal ceph_fill_inode() routine to fill in the inode, so we
> > have to gin up some reply inode information with what we'd expect the
> > newly-created inode to have. The client assumes that it has a full set
> > of caps on the new inode, and that the MDS will revoke them when there
> > is conflicting access.
> > 
> > This functionality is gated on the enable_async_dirops module option,
> > along with async unlinks, and on the server supporting the necessary
> > CephFS feature bit.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >   fs/ceph/file.c               | 196 +++++++++++++++++++++++++++++++++--
> >   include/linux/ceph/ceph_fs.h |   3 +
> >   2 files changed, 190 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index b44ccbc85fe4..2742417fa5ec 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -448,6 +448,169 @@ cache_file_layout(struct inode *dst, struct inode *src)
> >   	spin_unlock(&cdst->i_ceph_lock);
> >   }
> >   
> > +/*
> > + * Try to set up an async create. We need caps, a file layout, and inode number,
> > + * and either a lease on the dentry or complete dir info. If any of those
> > + * criteria are not satisfied, then return false and the caller can go
> > + * synchronous.
> > + */
> > +static bool try_prep_async_create(struct inode *dir, struct dentry *dentry,
> > +				  struct ceph_file_layout *lo,
> > +				  unsigned long *pino)
> > +{
> > +	struct ceph_inode_info *ci = ceph_inode(dir);
> > +	bool ret = false;
> > +	unsigned long ino;
> > +
> > +	spin_lock(&ci->i_ceph_lock);
> > +	/* No auth cap means no chance for Dc caps */
> > +	if (!ci->i_auth_cap)
> > +		goto no_async;
> > +
> > +	/* Any delegated inos? */
> > +	if (xa_empty(&ci->i_auth_cap->session->s_delegated_inos))
> > +		goto no_async;
> > +
> > +	if (!ceph_file_layout_is_valid(&ci->i_cached_layout))
> > +		goto no_async;
> > +
> > +	/* Use LOOKUP_RCU since we're under i_ceph_lock */
> > +	if (!__ceph_dir_is_complete(ci) &&
> > +	    !dentry_lease_is_valid(dentry, LOOKUP_RCU))
> > +		goto no_async;
> > +
> > +	if (!(__ceph_caps_issued(ci, NULL) &
> > +	      (CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE)))
> > +		goto no_async;
> > +
> 
> (ceph_caps_issued(ci, NULL) &
>   CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE)) ==
> (CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE)
> 

Good catch! Fixed in my tree. Retesting now and will fold that one in.

> > +	ino = ceph_get_deleg_ino(ci->i_auth_cap->session);
> > +	if (!ino)
> > +		goto no_async;
> > +
> > +	*pino = ino;
> > +	ceph_take_cap_refs(ci, CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE, false);
> > +	memcpy(lo, &ci->i_cached_layout, sizeof(*lo));
> > +	rcu_assign_pointer(lo->pool_ns,
> > +			   ceph_try_get_string(ci->i_cached_layout.pool_ns));
> > +	ret = true;
> > +no_async:
> > +	spin_unlock(&ci->i_ceph_lock);
> > +	return ret;
> > +}
> > +
> > +static void ceph_async_create_cb(struct ceph_mds_client *mdsc,
> > +                                 struct ceph_mds_request *req)
> > +{
> > +	mapping_set_error(req->r_parent->i_mapping, req->r_err);
> > +
> > +	if (req->r_target_inode) {
> > +		struct ceph_inode_info *ci = ceph_inode(req->r_target_inode);
> > +		u64 ino = ceph_vino(req->r_target_inode).ino;
> > +
> > +		if (req->r_deleg_ino != ino)
> > +			pr_warn("%s: inode number mismatch! err=%d deleg_ino=0x%lx target=0x%llx\n",
> > +				__func__, req->r_err, req->r_deleg_ino, ino);
> > +		mapping_set_error(req->r_target_inode->i_mapping, req->r_err);
> > +
> > +		spin_lock(&ci->i_ceph_lock);
> > +		if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
> > +			ci->i_ceph_flags &= ~CEPH_I_ASYNC_CREATE;
> > +			wake_up_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT);
> > +		}
> > +		spin_unlock(&ci->i_ceph_lock);
> > +	} else {
> > +		pr_warn("%s: no req->r_target_inode for 0x%lx\n", __func__,
> > +			req->r_deleg_ino);
> > +	}
> > +	ceph_put_cap_refs(ceph_inode(req->r_parent),
> > +			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE);
> > +}
> > +
> > +static int ceph_finish_async_create(struct inode *dir, struct dentry *dentry,
> > +				    struct file *file, umode_t mode,
> > +				    struct ceph_mds_request *req,
> > +				    struct ceph_acl_sec_ctx *as_ctx,
> > +				    struct ceph_file_layout *lo)
> > +{
> > +	int ret;
> > +	char xattr_buf[4];
> > +	struct ceph_mds_reply_inode in = { };
> > +	struct ceph_mds_reply_info_in iinfo = { .in = &in };
> > +	struct ceph_inode_info *ci = ceph_inode(dir);
> > +	struct inode *inode;
> > +	struct timespec64 now;
> > +	struct ceph_vino vino = { .ino = req->r_deleg_ino,
> > +				  .snap = CEPH_NOSNAP };
> > +
> > +	ktime_get_real_ts64(&now);
> > +
> > +	inode = ceph_get_inode(dentry->d_sb, vino);
> > +	if (IS_ERR(inode))
> > +		return PTR_ERR(inode);
> > +
> > +	iinfo.inline_version = CEPH_INLINE_NONE;
> > +	iinfo.change_attr = 1;
> > +	ceph_encode_timespec64(&iinfo.btime, &now);
> > +
> > +	iinfo.xattr_len = ARRAY_SIZE(xattr_buf);
> > +	iinfo.xattr_data = xattr_buf;
> > +	memset(iinfo.xattr_data, 0, iinfo.xattr_len);
> > +
> > +	in.ino = cpu_to_le64(vino.ino);
> > +	in.snapid = cpu_to_le64(CEPH_NOSNAP);
> > +	in.version = cpu_to_le64(1);	// ???
> > +	in.cap.caps = in.cap.wanted = cpu_to_le32(CEPH_CAP_ALL_FILE);
> > +	in.cap.cap_id = cpu_to_le64(1);
> > +	in.cap.realm = cpu_to_le64(ci->i_snap_realm->ino);
> > +	in.cap.flags = CEPH_CAP_FLAG_AUTH;
> > +	in.ctime = in.mtime = in.atime = iinfo.btime;
> > +	in.mode = cpu_to_le32((u32)mode);
> > +	in.truncate_seq = cpu_to_le32(1);
> > +	in.truncate_size = cpu_to_le64(-1ULL);
> > +	in.xattr_version = cpu_to_le64(1);
> > +	in.uid = cpu_to_le32(from_kuid(&init_user_ns, current_fsuid()));
> > +	in.gid = cpu_to_le32(from_kgid(&init_user_ns, dir->i_mode & S_ISGID ?
> > +				dir->i_gid : current_fsgid()));
> > +	in.nlink = cpu_to_le32(1);
> > +	in.max_size = cpu_to_le64(lo->stripe_unit);
> > +
> > +	ceph_file_layout_to_legacy(lo, &in.layout);
> > +
> > +	ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session,
> > +			      req->r_fmode, NULL);
> > +	if (ret) {
> > +		dout("%s failed to fill inode: %d\n", __func__, ret);
> > +		if (inode->i_state & I_NEW)
> > +			discard_new_inode(inode);
> > +	} else {
> > +		struct dentry *dn;
> > +
> > +		dout("%s d_adding new inode 0x%llx to 0x%lx/%s\n", __func__,
> > +			vino.ino, dir->i_ino, dentry->d_name.name);
> > +		ceph_dir_clear_ordered(dir);
> > +		ceph_init_inode_acls(inode, as_ctx);
> > +		if (inode->i_state & I_NEW) {
> > +			/*
> > +			 * If it's not I_NEW, then someone created this before
> > +			 * we got here. Assume the server is aware of it at
> > +			 * that point and don't worry about setting
> > +			 * CEPH_I_ASYNC_CREATE.
> > +			 */
> > +			ceph_inode(inode)->i_ceph_flags = CEPH_I_ASYNC_CREATE;
> > +			unlock_new_inode(inode);
> > +		}
> > +		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
> > +			if (!d_unhashed(dentry))
> > +				d_drop(dentry);
> > +			dn = d_splice_alias(inode, dentry);
> > +			WARN_ON_ONCE(dn && dn != dentry);
> > +		}
> > +		file->f_mode |= FMODE_CREATED;
> > +		ret = finish_open(file, dentry, ceph_open);
> > +	}
> > +	return ret;
> > +}
> > +
> >   /*
> >    * Do a lookup + open with a single request.  If we get a non-existent
> >    * file or symlink, return 1 so the VFS can retry.
> > @@ -460,6 +623,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   	struct ceph_mds_request *req;
> >   	struct dentry *dn;
> >   	struct ceph_acl_sec_ctx as_ctx = {};
> > +	bool try_async = enable_async_dirops;
> >   	int mask;
> >   	int err;
> >   
> > @@ -492,28 +656,41 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   	}
> >   	req->r_dentry = dget(dentry);
> >   	req->r_num_caps = 2;
> > +	mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> > +	if (ceph_security_xattr_wanted(dir))
> > +		mask |= CEPH_CAP_XATTR_SHARED;
> > +	req->r_args.open.mask = cpu_to_le32(mask);
> > +	req->r_parent = dir;
> > +
> >   	if (flags & O_CREAT) {
> > +		struct ceph_file_layout lo;
> > +
> >   		req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL;
> >   		req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
> >   		if (as_ctx.pagelist) {
> >   			req->r_pagelist = as_ctx.pagelist;
> >   			as_ctx.pagelist = NULL;
> >   		}
> > +		if (try_async && try_prep_async_create(dir, dentry, &lo,
> > +						       &req->r_deleg_ino)) {
> > +			set_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags);
> > +			req->r_callback = ceph_async_create_cb;
> > +			err = ceph_mdsc_submit_request(mdsc, dir, req);
> > +			if (!err)
> > +				err = ceph_finish_async_create(dir, dentry,
> > +							file, mode, req,
> > +							&as_ctx, &lo);
> > +			goto out_req;
> > +		}
> >   	}
> >   
> > -       mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> > -       if (ceph_security_xattr_wanted(dir))
> > -               mask |= CEPH_CAP_XATTR_SHARED;
> > -       req->r_args.open.mask = cpu_to_le32(mask);
> > -
> > -	req->r_parent = dir;
> >   	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
> >   	err = ceph_mdsc_do_request(mdsc,
> >   				   (flags & (O_CREAT|O_TRUNC)) ? dir : NULL,
> >   				   req);
> >   	err = ceph_handle_snapdir(req, dentry, err);
> >   	if (err)
> > -		goto out_req;
> > +		goto out_fmode;
> >   
> >   	if ((flags & O_CREAT) && !req->r_reply_info.head->is_dentry)
> >   		err = ceph_handle_notrace_create(dir, dentry);
> > @@ -527,7 +704,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   		dn = NULL;
> >   	}
> >   	if (err)
> > -		goto out_req;
> > +		goto out_fmode;
> >   	if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
> >   		/* make vfs retry on splice, ENOENT, or symlink */
> >   		dout("atomic_open finish_no_open on dn %p\n", dn);
> > @@ -543,9 +720,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   		}
> >   		err = finish_open(file, dentry, ceph_open);
> >   	}
> > -out_req:
> > +out_fmode:
> >   	if (!req->r_err && req->r_target_inode)
> >   		ceph_put_fmode(ceph_inode(req->r_target_inode), req->r_fmode);
> > +out_req:
> >   	ceph_mdsc_put_request(req);
> >   out_ctx:
> >   	ceph_release_acl_sec_ctx(&as_ctx);
> > diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
> > index 91d09cf37649..e035c5194005 100644
> > --- a/include/linux/ceph/ceph_fs.h
> > +++ b/include/linux/ceph/ceph_fs.h
> > @@ -659,6 +659,9 @@ int ceph_flags_to_mode(int flags);
> >   #define CEPH_CAP_ANY      (CEPH_CAP_ANY_RD | CEPH_CAP_ANY_EXCL | \
> >   			   CEPH_CAP_ANY_FILE_WR | CEPH_CAP_FILE_LAZYIO | \
> >   			   CEPH_CAP_PIN)
> > +#define CEPH_CAP_ALL_FILE (CEPH_CAP_PIN | CEPH_CAP_ANY_SHARED | \
> > +			   CEPH_CAP_AUTH_EXCL | CEPH_CAP_XATTR_EXCL | \
> > +			   CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)
> >   
> >   #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
> >   			CEPH_LOCK_IXATTR)
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 10/10] ceph: attempt to do async create when possible
  2020-01-15 20:59 ` [RFC PATCH v2 10/10] ceph: attempt to do async create when possible Jeff Layton
  2020-01-16 15:09   ` Yan, Zheng
@ 2020-01-17 13:28   ` Yan, Zheng
  2020-01-17 17:40     ` Jeff Layton
  1 sibling, 1 reply; 33+ messages in thread
From: Yan, Zheng @ 2020-01-17 13:28 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell, xiubli

On 1/16/20 4:59 AM, Jeff Layton wrote:
> With the Octopus release, the MDS will hand out directory create caps.
> 
> If we have Fxc caps on the directory, and complete directory information
> or a known negative dentry, then we can return without waiting on the
> reply, allowing the open() call to return very quickly to userland.
> 
> We use the normal ceph_fill_inode() routine to fill in the inode, so we
> have to gin up some reply inode information with what we'd expect the
> newly-created inode to have. The client assumes that it has a full set
> of caps on the new inode, and that the MDS will revoke them when there
> is conflicting access.
> 
> This functionality is gated on the enable_async_dirops module option,
> along with async unlinks, and on the server supporting the necessary
> CephFS feature bit.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>   fs/ceph/file.c               | 196 +++++++++++++++++++++++++++++++++--
>   include/linux/ceph/ceph_fs.h |   3 +
>   2 files changed, 190 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index b44ccbc85fe4..2742417fa5ec 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -448,6 +448,169 @@ cache_file_layout(struct inode *dst, struct inode *src)
>   	spin_unlock(&cdst->i_ceph_lock);
>   }
>   
> +/*
> + * Try to set up an async create. We need caps, a file layout, and inode number,
> + * and either a lease on the dentry or complete dir info. If any of those
> + * criteria are not satisfied, then return false and the caller can go
> + * synchronous.
> + */
> +static bool try_prep_async_create(struct inode *dir, struct dentry *dentry,
> +				  struct ceph_file_layout *lo,
> +				  unsigned long *pino)
> +{
> +	struct ceph_inode_info *ci = ceph_inode(dir);
> +	bool ret = false;
> +	unsigned long ino;
> +
> +	spin_lock(&ci->i_ceph_lock);
> +	/* No auth cap means no chance for Dc caps */
> +	if (!ci->i_auth_cap)
> +		goto no_async;
> +
> +	/* Any delegated inos? */
> +	if (xa_empty(&ci->i_auth_cap->session->s_delegated_inos))
> +		goto no_async;
> +
> +	if (!ceph_file_layout_is_valid(&ci->i_cached_layout))
> +		goto no_async;
> +
> +	/* Use LOOKUP_RCU since we're under i_ceph_lock */
> +	if (!__ceph_dir_is_complete(ci) &&
> +	    !dentry_lease_is_valid(dentry, LOOKUP_RCU))
> +		goto no_async;

dentry_lease_is_valid() checks dentry lease. When directory inode has
Fsx caps, mds does not issue lease for individual dentry. Check here 
should be something like dir_lease_is_valid()

> +
> +	if (!(__ceph_caps_issued(ci, NULL) &
> +	      (CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE)))
> +		goto no_async;
> +
> +	ino = ceph_get_deleg_ino(ci->i_auth_cap->session);
> +	if (!ino)
> +		goto no_async;
> +
> +	*pino = ino;
> +	ceph_take_cap_refs(ci, CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE, false);
> +	memcpy(lo, &ci->i_cached_layout, sizeof(*lo));
> +	rcu_assign_pointer(lo->pool_ns,
> +			   ceph_try_get_string(ci->i_cached_layout.pool_ns));
> +	ret = true;
> +no_async:
> +	spin_unlock(&ci->i_ceph_lock);
> +	return ret;
> +}
> +
> +static void ceph_async_create_cb(struct ceph_mds_client *mdsc,
> +                                 struct ceph_mds_request *req)
> +{
> +	mapping_set_error(req->r_parent->i_mapping, req->r_err);
> +
> +	if (req->r_target_inode) {
> +		struct ceph_inode_info *ci = ceph_inode(req->r_target_inode);
> +		u64 ino = ceph_vino(req->r_target_inode).ino;
> +
> +		if (req->r_deleg_ino != ino)
> +			pr_warn("%s: inode number mismatch! err=%d deleg_ino=0x%lx target=0x%llx\n",
> +				__func__, req->r_err, req->r_deleg_ino, ino);
> +		mapping_set_error(req->r_target_inode->i_mapping, req->r_err);
> +
> +		spin_lock(&ci->i_ceph_lock);
> +		if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
> +			ci->i_ceph_flags &= ~CEPH_I_ASYNC_CREATE;
> +			wake_up_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT);
> +		}
> +		spin_unlock(&ci->i_ceph_lock);
> +	} else {
> +		pr_warn("%s: no req->r_target_inode for 0x%lx\n", __func__,
> +			req->r_deleg_ino);
> +	}
> +	ceph_put_cap_refs(ceph_inode(req->r_parent),
> +			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE);
> +}
> +
> +static int ceph_finish_async_create(struct inode *dir, struct dentry *dentry,
> +				    struct file *file, umode_t mode,
> +				    struct ceph_mds_request *req,
> +				    struct ceph_acl_sec_ctx *as_ctx,
> +				    struct ceph_file_layout *lo)
> +{
> +	int ret;
> +	char xattr_buf[4];
> +	struct ceph_mds_reply_inode in = { };
> +	struct ceph_mds_reply_info_in iinfo = { .in = &in };
> +	struct ceph_inode_info *ci = ceph_inode(dir);
> +	struct inode *inode;
> +	struct timespec64 now;
> +	struct ceph_vino vino = { .ino = req->r_deleg_ino,
> +				  .snap = CEPH_NOSNAP };
> +
> +	ktime_get_real_ts64(&now);
> +
> +	inode = ceph_get_inode(dentry->d_sb, vino);
> +	if (IS_ERR(inode))
> +		return PTR_ERR(inode);
> +
> +	iinfo.inline_version = CEPH_INLINE_NONE;
> +	iinfo.change_attr = 1;
> +	ceph_encode_timespec64(&iinfo.btime, &now);
> +
> +	iinfo.xattr_len = ARRAY_SIZE(xattr_buf);
> +	iinfo.xattr_data = xattr_buf;
> +	memset(iinfo.xattr_data, 0, iinfo.xattr_len);
> +
> +	in.ino = cpu_to_le64(vino.ino);
> +	in.snapid = cpu_to_le64(CEPH_NOSNAP);
> +	in.version = cpu_to_le64(1);	// ???
> +	in.cap.caps = in.cap.wanted = cpu_to_le32(CEPH_CAP_ALL_FILE);
> +	in.cap.cap_id = cpu_to_le64(1);
> +	in.cap.realm = cpu_to_le64(ci->i_snap_realm->ino);
> +	in.cap.flags = CEPH_CAP_FLAG_AUTH;
> +	in.ctime = in.mtime = in.atime = iinfo.btime;
> +	in.mode = cpu_to_le32((u32)mode);
> +	in.truncate_seq = cpu_to_le32(1);
> +	in.truncate_size = cpu_to_le64(-1ULL);
> +	in.xattr_version = cpu_to_le64(1);
> +	in.uid = cpu_to_le32(from_kuid(&init_user_ns, current_fsuid()));
> +	in.gid = cpu_to_le32(from_kgid(&init_user_ns, dir->i_mode & S_ISGID ?
> +				dir->i_gid : current_fsgid()));
> +	in.nlink = cpu_to_le32(1);
> +	in.max_size = cpu_to_le64(lo->stripe_unit);
> +
> +	ceph_file_layout_to_legacy(lo, &in.layout);
> +
> +	ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session,
> +			      req->r_fmode, NULL);
> +	if (ret) {
> +		dout("%s failed to fill inode: %d\n", __func__, ret);
> +		if (inode->i_state & I_NEW)
> +			discard_new_inode(inode);
> +	} else {
> +		struct dentry *dn;
> +
> +		dout("%s d_adding new inode 0x%llx to 0x%lx/%s\n", __func__,
> +			vino.ino, dir->i_ino, dentry->d_name.name);
> +		ceph_dir_clear_ordered(dir);
> +		ceph_init_inode_acls(inode, as_ctx);
> +		if (inode->i_state & I_NEW) {
> +			/*
> +			 * If it's not I_NEW, then someone created this before
> +			 * we got here. Assume the server is aware of it at
> +			 * that point and don't worry about setting
> +			 * CEPH_I_ASYNC_CREATE.
> +			 */
> +			ceph_inode(inode)->i_ceph_flags = CEPH_I_ASYNC_CREATE;
> +			unlock_new_inode(inode);
> +		}
> +		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
> +			if (!d_unhashed(dentry))
> +				d_drop(dentry);
> +			dn = d_splice_alias(inode, dentry);
> +			WARN_ON_ONCE(dn && dn != dentry);
> +		}
> +		file->f_mode |= FMODE_CREATED;
> +		ret = finish_open(file, dentry, ceph_open);
> +	}
> +	return ret;
> +}
> +
>   /*
>    * Do a lookup + open with a single request.  If we get a non-existent
>    * file or symlink, return 1 so the VFS can retry.
> @@ -460,6 +623,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   	struct ceph_mds_request *req;
>   	struct dentry *dn;
>   	struct ceph_acl_sec_ctx as_ctx = {};
> +	bool try_async = enable_async_dirops;
>   	int mask;
>   	int err;
>   
> @@ -492,28 +656,41 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   	}
>   	req->r_dentry = dget(dentry);
>   	req->r_num_caps = 2;
> +	mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> +	if (ceph_security_xattr_wanted(dir))
> +		mask |= CEPH_CAP_XATTR_SHARED;
> +	req->r_args.open.mask = cpu_to_le32(mask);
> +	req->r_parent = dir;
> +
>   	if (flags & O_CREAT) {
> +		struct ceph_file_layout lo;
> +
>   		req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL;
>   		req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
>   		if (as_ctx.pagelist) {
>   			req->r_pagelist = as_ctx.pagelist;
>   			as_ctx.pagelist = NULL;
>   		}
> +		if (try_async && try_prep_async_create(dir, dentry, &lo,
> +						       &req->r_deleg_ino)) {
> +			set_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags);
> +			req->r_callback = ceph_async_create_cb;
> +			err = ceph_mdsc_submit_request(mdsc, dir, req);
> +			if (!err)
> +				err = ceph_finish_async_create(dir, dentry,
> +							file, mode, req,
> +							&as_ctx, &lo);
> +			goto out_req;
> +		}
>   	}
>   
> -       mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> -       if (ceph_security_xattr_wanted(dir))
> -               mask |= CEPH_CAP_XATTR_SHARED;
> -       req->r_args.open.mask = cpu_to_le32(mask);
> -
> -	req->r_parent = dir;
>   	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
>   	err = ceph_mdsc_do_request(mdsc,
>   				   (flags & (O_CREAT|O_TRUNC)) ? dir : NULL,
>   				   req);
>   	err = ceph_handle_snapdir(req, dentry, err);
>   	if (err)
> -		goto out_req;
> +		goto out_fmode;
>   
>   	if ((flags & O_CREAT) && !req->r_reply_info.head->is_dentry)
>   		err = ceph_handle_notrace_create(dir, dentry);
> @@ -527,7 +704,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   		dn = NULL;
>   	}
>   	if (err)
> -		goto out_req;
> +		goto out_fmode;
>   	if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
>   		/* make vfs retry on splice, ENOENT, or symlink */
>   		dout("atomic_open finish_no_open on dn %p\n", dn);
> @@ -543,9 +720,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   		}
>   		err = finish_open(file, dentry, ceph_open);
>   	}
> -out_req:
> +out_fmode:
>   	if (!req->r_err && req->r_target_inode)
>   		ceph_put_fmode(ceph_inode(req->r_target_inode), req->r_fmode);
> +out_req:
>   	ceph_mdsc_put_request(req);
>   out_ctx:
>   	ceph_release_acl_sec_ctx(&as_ctx);
> diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
> index 91d09cf37649..e035c5194005 100644
> --- a/include/linux/ceph/ceph_fs.h
> +++ b/include/linux/ceph/ceph_fs.h
> @@ -659,6 +659,9 @@ int ceph_flags_to_mode(int flags);
>   #define CEPH_CAP_ANY      (CEPH_CAP_ANY_RD | CEPH_CAP_ANY_EXCL | \
>   			   CEPH_CAP_ANY_FILE_WR | CEPH_CAP_FILE_LAZYIO | \
>   			   CEPH_CAP_PIN)
> +#define CEPH_CAP_ALL_FILE (CEPH_CAP_PIN | CEPH_CAP_ANY_SHARED | \
> +			   CEPH_CAP_AUTH_EXCL | CEPH_CAP_XATTR_EXCL | \
> +			   CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)
>   
>   #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
>   			CEPH_LOCK_IXATTR)
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 08/10] ceph: add new MDS req field to hold delegated inode number
  2020-01-15 20:59 ` [RFC PATCH v2 08/10] ceph: add new MDS req field to hold delegated inode number Jeff Layton
@ 2020-01-17 14:47   ` Ilya Dryomov
  2020-01-17 16:53     ` Jeff Layton
  0 siblings, 1 reply; 33+ messages in thread
From: Ilya Dryomov @ 2020-01-17 14:47 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Ceph Development, Yan, Zheng, Sage Weil, Patrick Donnelly, Xiubo Li

On Wed, Jan 15, 2020 at 9:59 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> Add new request field to hold the delegated inode number. Encode that
> into the message when it's set.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/ceph/mds_client.c | 3 +--
>  fs/ceph/mds_client.h | 1 +
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index e49ca0533df1..b8070e8c4686 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2466,7 +2466,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_client *mdsc,
>         head->op = cpu_to_le32(req->r_op);
>         head->caller_uid = cpu_to_le32(from_kuid(&init_user_ns, req->r_uid));
>         head->caller_gid = cpu_to_le32(from_kgid(&init_user_ns, req->r_gid));
> -       head->ino = 0;
> +       head->ino = cpu_to_le64(req->r_deleg_ino);
>         head->args = req->r_args;
>
>         ceph_encode_filepath(&p, end, ino1, path1);
> @@ -2627,7 +2627,6 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
>         rhead->flags = cpu_to_le32(flags);
>         rhead->num_fwd = req->r_num_fwd;
>         rhead->num_retry = req->r_attempts - 1;
> -       rhead->ino = 0;
>
>         dout(" r_parent = %p\n", req->r_parent);
>         return 0;
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index 2a32afa15eb6..0811543ffd79 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -308,6 +308,7 @@ struct ceph_mds_request {
>         int               r_num_fwd;    /* number of forward attempts */
>         int               r_resend_mds; /* mds to resend to next, if any*/
>         u32               r_sent_on_mseq; /* cap mseq request was sent at*/
> +       unsigned long     r_deleg_ino;

u64, as head->ino is __le64?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 07/10] ceph: add infrastructure for waiting for async create to complete
  2020-01-15 20:59 ` [RFC PATCH v2 07/10] ceph: add infrastructure for waiting for async create to complete Jeff Layton
@ 2020-01-17 15:00   ` Ilya Dryomov
  2020-01-17 16:49     ` Jeff Layton
  0 siblings, 1 reply; 33+ messages in thread
From: Ilya Dryomov @ 2020-01-17 15:00 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Ceph Development, Yan, Zheng, Sage Weil, Patrick Donnelly, Xiubo Li

On Wed, Jan 15, 2020 at 9:59 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> When we issue an async create, we must ensure that any later on-the-wire
> requests involving it wait for the create reply.
>
> Expand i_ceph_flags to be an unsigned long, and add a new bit that
> MDS requests can wait on. If the bit is set in the inode when sending
> caps, then don't send it and just return that it has been delayed.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/ceph/caps.c       |  9 ++++++++-
>  fs/ceph/dir.c        |  2 +-
>  fs/ceph/mds_client.c | 12 +++++++++++-
>  fs/ceph/super.h      |  4 +++-
>  4 files changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index c983990acb75..9d1a3d6831f7 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -511,7 +511,7 @@ static void __cap_delay_requeue(struct ceph_mds_client *mdsc,
>                                 struct ceph_inode_info *ci,
>                                 bool set_timeout)
>  {
> -       dout("__cap_delay_requeue %p flags %d at %lu\n", &ci->vfs_inode,
> +       dout("__cap_delay_requeue %p flags 0x%lx at %lu\n", &ci->vfs_inode,
>              ci->i_ceph_flags, ci->i_hold_caps_max);
>         if (!mdsc->stopping) {
>                 spin_lock(&mdsc->cap_delay_lock);
> @@ -1298,6 +1298,13 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
>         int delayed = 0;
>         int ret;
>
> +       /* Don't send anything if it's still being created. Return delayed */
> +       if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
> +               spin_unlock(&ci->i_ceph_lock);
> +               dout("%s async create in flight for %p\n", __func__, inode);
> +               return 1;
> +       }
> +
>         held = cap->issued | cap->implemented;
>         revoking = cap->implemented & ~cap->issued;
>         retain &= ~revoking;
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index 0d97c2962314..b2bcd01ab4e9 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -752,7 +752,7 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry,
>                 struct ceph_dentry_info *di = ceph_dentry(dentry);
>
>                 spin_lock(&ci->i_ceph_lock);
> -               dout(" dir %p flags are %d\n", dir, ci->i_ceph_flags);
> +               dout(" dir %p flags are 0x%lx\n", dir, ci->i_ceph_flags);
>                 if (strncmp(dentry->d_name.name,
>                             fsc->mount_options->snapdir_name,
>                             dentry->d_name.len) &&
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index f06496bb5705..e49ca0533df1 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2806,14 +2806,24 @@ static void kick_requests(struct ceph_mds_client *mdsc, int mds)
>         }
>  }
>
> +static int ceph_wait_on_async_create(struct inode *inode)
> +{
> +       struct ceph_inode_info *ci = ceph_inode(inode);
> +
> +       return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT,
> +                          TASK_INTERRUPTIBLE);
> +}
> +
>  int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
>                               struct ceph_mds_request *req)
>  {
>         int err;
>
>         /* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */
> -       if (req->r_inode)
> +       if (req->r_inode) {
> +               ceph_wait_on_async_create(req->r_inode);

This is waiting interruptibly, but ignoring the distinction between
CEPH_ASYNC_CREATE_BIT getting cleared and a signal.  Do we care?  If
not, it deserves a comment (or should ceph_wait_on_async_create() be
void?).

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 07/10] ceph: add infrastructure for waiting for async create to complete
  2020-01-17 15:00   ` Ilya Dryomov
@ 2020-01-17 16:49     ` Jeff Layton
  0 siblings, 0 replies; 33+ messages in thread
From: Jeff Layton @ 2020-01-17 16:49 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Ceph Development, Yan, Zheng, Sage Weil, Patrick Donnelly, Xiubo Li

On Fri, 2020-01-17 at 16:00 +0100, Ilya Dryomov wrote:
> On Wed, Jan 15, 2020 at 9:59 PM Jeff Layton <jlayton@kernel.org> wrote:
> > When we issue an async create, we must ensure that any later on-the-wire
> > requests involving it wait for the create reply.
> > 
> > Expand i_ceph_flags to be an unsigned long, and add a new bit that
> > MDS requests can wait on. If the bit is set in the inode when sending
> > caps, then don't send it and just return that it has been delayed.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/ceph/caps.c       |  9 ++++++++-
> >  fs/ceph/dir.c        |  2 +-
> >  fs/ceph/mds_client.c | 12 +++++++++++-
> >  fs/ceph/super.h      |  4 +++-
> >  4 files changed, 23 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > index c983990acb75..9d1a3d6831f7 100644
> > --- a/fs/ceph/caps.c
> > +++ b/fs/ceph/caps.c
> > @@ -511,7 +511,7 @@ static void __cap_delay_requeue(struct ceph_mds_client *mdsc,
> >                                 struct ceph_inode_info *ci,
> >                                 bool set_timeout)
> >  {
> > -       dout("__cap_delay_requeue %p flags %d at %lu\n", &ci->vfs_inode,
> > +       dout("__cap_delay_requeue %p flags 0x%lx at %lu\n", &ci->vfs_inode,
> >              ci->i_ceph_flags, ci->i_hold_caps_max);
> >         if (!mdsc->stopping) {
> >                 spin_lock(&mdsc->cap_delay_lock);
> > @@ -1298,6 +1298,13 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
> >         int delayed = 0;
> >         int ret;
> > 
> > +       /* Don't send anything if it's still being created. Return delayed */
> > +       if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
> > +               spin_unlock(&ci->i_ceph_lock);
> > +               dout("%s async create in flight for %p\n", __func__, inode);
> > +               return 1;
> > +       }
> > +
> >         held = cap->issued | cap->implemented;
> >         revoking = cap->implemented & ~cap->issued;
> >         retain &= ~revoking;
> > diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> > index 0d97c2962314..b2bcd01ab4e9 100644
> > --- a/fs/ceph/dir.c
> > +++ b/fs/ceph/dir.c
> > @@ -752,7 +752,7 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry,
> >                 struct ceph_dentry_info *di = ceph_dentry(dentry);
> > 
> >                 spin_lock(&ci->i_ceph_lock);
> > -               dout(" dir %p flags are %d\n", dir, ci->i_ceph_flags);
> > +               dout(" dir %p flags are 0x%lx\n", dir, ci->i_ceph_flags);
> >                 if (strncmp(dentry->d_name.name,
> >                             fsc->mount_options->snapdir_name,
> >                             dentry->d_name.len) &&
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index f06496bb5705..e49ca0533df1 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -2806,14 +2806,24 @@ static void kick_requests(struct ceph_mds_client *mdsc, int mds)
> >         }
> >  }
> > 
> > +static int ceph_wait_on_async_create(struct inode *inode)
> > +{
> > +       struct ceph_inode_info *ci = ceph_inode(inode);
> > +
> > +       return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT,
> > +                          TASK_INTERRUPTIBLE);
> > +}
> > +
> >  int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
> >                               struct ceph_mds_request *req)
> >  {
> >         int err;
> > 
> >         /* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */
> > -       if (req->r_inode)
> > +       if (req->r_inode) {
> > +               ceph_wait_on_async_create(req->r_inode);
> 
> This is waiting interruptibly, but ignoring the distinction between
> CEPH_ASYNC_CREATE_BIT getting cleared and a signal.  Do we care?  If
> not, it deserves a comment (or should ceph_wait_on_async_create() be
> void?).
> 

You're absolutely right -- we do need to catch and handle signals here,
I think. I'll fix that for the next version.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 08/10] ceph: add new MDS req field to hold delegated inode number
  2020-01-17 14:47   ` Ilya Dryomov
@ 2020-01-17 16:53     ` Jeff Layton
  2020-01-17 17:42       ` Ilya Dryomov
  0 siblings, 1 reply; 33+ messages in thread
From: Jeff Layton @ 2020-01-17 16:53 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Ceph Development, Yan, Zheng, Sage Weil, Patrick Donnelly, Xiubo Li

On Fri, 2020-01-17 at 15:47 +0100, Ilya Dryomov wrote:
> On Wed, Jan 15, 2020 at 9:59 PM Jeff Layton <jlayton@kernel.org> wrote:
> > Add new request field to hold the delegated inode number. Encode that
> > into the message when it's set.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/ceph/mds_client.c | 3 +--
> >  fs/ceph/mds_client.h | 1 +
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index e49ca0533df1..b8070e8c4686 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -2466,7 +2466,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_client *mdsc,
> >         head->op = cpu_to_le32(req->r_op);
> >         head->caller_uid = cpu_to_le32(from_kuid(&init_user_ns, req->r_uid));
> >         head->caller_gid = cpu_to_le32(from_kgid(&init_user_ns, req->r_gid));
> > -       head->ino = 0;
> > +       head->ino = cpu_to_le64(req->r_deleg_ino);
> >         head->args = req->r_args;
> > 
> >         ceph_encode_filepath(&p, end, ino1, path1);
> > @@ -2627,7 +2627,6 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
> >         rhead->flags = cpu_to_le32(flags);
> >         rhead->num_fwd = req->r_num_fwd;
> >         rhead->num_retry = req->r_attempts - 1;
> > -       rhead->ino = 0;
> > 
> >         dout(" r_parent = %p\n", req->r_parent);
> >         return 0;
> > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> > index 2a32afa15eb6..0811543ffd79 100644
> > --- a/fs/ceph/mds_client.h
> > +++ b/fs/ceph/mds_client.h
> > @@ -308,6 +308,7 @@ struct ceph_mds_request {
> >         int               r_num_fwd;    /* number of forward attempts */
> >         int               r_resend_mds; /* mds to resend to next, if any*/
> >         u32               r_sent_on_mseq; /* cap mseq request was sent at*/
> > +       unsigned long     r_deleg_ino;
> 
> u64, as head->ino is __le64?
> 

Does that actually matter? It should get promoted to 64 bit when we do
the encoding since we're passing by value, and this will never be larger
than 32 bits on a 32 bit box.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 10/10] ceph: attempt to do async create when possible
  2020-01-17 13:28   ` Yan, Zheng
@ 2020-01-17 17:40     ` Jeff Layton
  2020-01-18  2:42       ` Yan, Zheng
  0 siblings, 1 reply; 33+ messages in thread
From: Jeff Layton @ 2020-01-17 17:40 UTC (permalink / raw)
  To: Yan, Zheng, ceph-devel; +Cc: sage, idryomov, pdonnell, xiubli

On Fri, 2020-01-17 at 21:28 +0800, Yan, Zheng wrote:
> On 1/16/20 4:59 AM, Jeff Layton wrote:
> > With the Octopus release, the MDS will hand out directory create caps.
> > 
> > If we have Fxc caps on the directory, and complete directory information
> > or a known negative dentry, then we can return without waiting on the
> > reply, allowing the open() call to return very quickly to userland.
> > 
> > We use the normal ceph_fill_inode() routine to fill in the inode, so we
> > have to gin up some reply inode information with what we'd expect the
> > newly-created inode to have. The client assumes that it has a full set
> > of caps on the new inode, and that the MDS will revoke them when there
> > is conflicting access.
> > 
> > This functionality is gated on the enable_async_dirops module option,
> > along with async unlinks, and on the server supporting the necessary
> > CephFS feature bit.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >   fs/ceph/file.c               | 196 +++++++++++++++++++++++++++++++++--
> >   include/linux/ceph/ceph_fs.h |   3 +
> >   2 files changed, 190 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index b44ccbc85fe4..2742417fa5ec 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -448,6 +448,169 @@ cache_file_layout(struct inode *dst, struct inode *src)
> >   	spin_unlock(&cdst->i_ceph_lock);
> >   }
> >   
> > +/*
> > + * Try to set up an async create. We need caps, a file layout, and inode number,
> > + * and either a lease on the dentry or complete dir info. If any of those
> > + * criteria are not satisfied, then return false and the caller can go
> > + * synchronous.
> > + */
> > +static bool try_prep_async_create(struct inode *dir, struct dentry *dentry,
> > +				  struct ceph_file_layout *lo,
> > +				  unsigned long *pino)
> > +{
> > +	struct ceph_inode_info *ci = ceph_inode(dir);
> > +	bool ret = false;
> > +	unsigned long ino;
> > +
> > +	spin_lock(&ci->i_ceph_lock);
> > +	/* No auth cap means no chance for Dc caps */
> > +	if (!ci->i_auth_cap)
> > +		goto no_async;
> > +
> > +	/* Any delegated inos? */
> > +	if (xa_empty(&ci->i_auth_cap->session->s_delegated_inos))
> > +		goto no_async;
> > +
> > +	if (!ceph_file_layout_is_valid(&ci->i_cached_layout))
> > +		goto no_async;
> > +
> > +	/* Use LOOKUP_RCU since we're under i_ceph_lock */
> > +	if (!__ceph_dir_is_complete(ci) &&
> > +	    !dentry_lease_is_valid(dentry, LOOKUP_RCU))
> > +		goto no_async;
> 
> dentry_lease_is_valid() checks dentry lease. When directory inode has
> Fsx caps, mds does not issue lease for individual dentry. Check here 
> should be something like dir_lease_is_valid()

Ok, I think I get it. The catch here is that we're calling this from
atomic_open, so we may be dealing with a dentry that is brand new and
has never had a lookup. I think we have to handle those two cases
differently.

This is what I'm thinking:

---
 fs/ceph/file.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 7b14dba92266..a3eb38fac68a 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -459,6 +459,7 @@ static bool try_prep_async_create(struct inode *dir,
struct dentry *dentry,
 				  unsigned long *pino)
 {
 	struct ceph_inode_info *ci = ceph_inode(dir);
+	struct ceph_dentry_info *di = ceph_dentry(dentry);
 	bool ret = false;
 	unsigned long ino;
 
@@ -474,16 +475,19 @@ static bool try_prep_async_create(struct inode
*dir, struct dentry *dentry,
 	if (!ceph_file_layout_is_valid(&ci->i_cached_layout))
 		goto no_async;
 
-	/* Use LOOKUP_RCU since we're under i_ceph_lock */
-	if (!__ceph_dir_is_complete(ci) &&
-	    !dentry_lease_is_valid(dentry, LOOKUP_RCU))
-		goto no_async;
-
 	if ((__ceph_caps_issued(ci, NULL) &
 	     (CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE)) !=
 	    (CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE))
 		goto no_async;
 
+	if (d_in_lookup(dentry)) {
+		if (!__ceph_dir_is_complete(ci))
+			goto no_async;
+	} else if (atomic_read(&ci->i_shared_gen) !=
+		   READ_ONCE(di->lease_shared_gen)) {
+		goto no_async;
+	}
+
 	ino = ceph_get_deleg_ino(ci->i_auth_cap->session);
 	if (!ino)
 		goto no_async;
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 08/10] ceph: add new MDS req field to hold delegated inode number
  2020-01-17 16:53     ` Jeff Layton
@ 2020-01-17 17:42       ` Ilya Dryomov
  2020-01-17 18:31         ` Jeff Layton
  0 siblings, 1 reply; 33+ messages in thread
From: Ilya Dryomov @ 2020-01-17 17:42 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Ceph Development, Yan, Zheng, Sage Weil, Patrick Donnelly, Xiubo Li

On Fri, Jan 17, 2020 at 5:53 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Fri, 2020-01-17 at 15:47 +0100, Ilya Dryomov wrote:
> > On Wed, Jan 15, 2020 at 9:59 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > Add new request field to hold the delegated inode number. Encode that
> > > into the message when it's set.
> > >
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  fs/ceph/mds_client.c | 3 +--
> > >  fs/ceph/mds_client.h | 1 +
> > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > > index e49ca0533df1..b8070e8c4686 100644
> > > --- a/fs/ceph/mds_client.c
> > > +++ b/fs/ceph/mds_client.c
> > > @@ -2466,7 +2466,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_client *mdsc,
> > >         head->op = cpu_to_le32(req->r_op);
> > >         head->caller_uid = cpu_to_le32(from_kuid(&init_user_ns, req->r_uid));
> > >         head->caller_gid = cpu_to_le32(from_kgid(&init_user_ns, req->r_gid));
> > > -       head->ino = 0;
> > > +       head->ino = cpu_to_le64(req->r_deleg_ino);
> > >         head->args = req->r_args;
> > >
> > >         ceph_encode_filepath(&p, end, ino1, path1);
> > > @@ -2627,7 +2627,6 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
> > >         rhead->flags = cpu_to_le32(flags);
> > >         rhead->num_fwd = req->r_num_fwd;
> > >         rhead->num_retry = req->r_attempts - 1;
> > > -       rhead->ino = 0;
> > >
> > >         dout(" r_parent = %p\n", req->r_parent);
> > >         return 0;
> > > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> > > index 2a32afa15eb6..0811543ffd79 100644
> > > --- a/fs/ceph/mds_client.h
> > > +++ b/fs/ceph/mds_client.h
> > > @@ -308,6 +308,7 @@ struct ceph_mds_request {
> > >         int               r_num_fwd;    /* number of forward attempts */
> > >         int               r_resend_mds; /* mds to resend to next, if any*/
> > >         u32               r_sent_on_mseq; /* cap mseq request was sent at*/
> > > +       unsigned long     r_deleg_ino;
> >
> > u64, as head->ino is __le64?
> >
>
> Does that actually matter? It should get promoted to 64 bit when we do
> the encoding since we're passing by value, and this will never be larger
> than 32 bits on a 32 bit box.

It raises eyebrows -- one needs to remember that inode numbers
fit into unsigned long when looking at the code.  We are using u64
for inode numbers throughout ceph.ko: ceph_vino, ino parameters to
various functions, etc -- not just in the wire format definitions.
I think sticking to u64 is more clear and consistent.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 08/10] ceph: add new MDS req field to hold delegated inode number
  2020-01-17 17:42       ` Ilya Dryomov
@ 2020-01-17 18:31         ` Jeff Layton
  2020-01-20  9:41           ` Ilya Dryomov
  0 siblings, 1 reply; 33+ messages in thread
From: Jeff Layton @ 2020-01-17 18:31 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Ceph Development, Yan, Zheng, Sage Weil, Patrick Donnelly, Xiubo Li

On Fri, 2020-01-17 at 18:42 +0100, Ilya Dryomov wrote:
> On Fri, Jan 17, 2020 at 5:53 PM Jeff Layton <jlayton@kernel.org> wrote:
> > On Fri, 2020-01-17 at 15:47 +0100, Ilya Dryomov wrote:
> > > On Wed, Jan 15, 2020 at 9:59 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > Add new request field to hold the delegated inode number. Encode that
> > > > into the message when it's set.
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > >  fs/ceph/mds_client.c | 3 +--
> > > >  fs/ceph/mds_client.h | 1 +
> > > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > > > index e49ca0533df1..b8070e8c4686 100644
> > > > --- a/fs/ceph/mds_client.c
> > > > +++ b/fs/ceph/mds_client.c
> > > > @@ -2466,7 +2466,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_client *mdsc,
> > > >         head->op = cpu_to_le32(req->r_op);
> > > >         head->caller_uid = cpu_to_le32(from_kuid(&init_user_ns, req->r_uid));
> > > >         head->caller_gid = cpu_to_le32(from_kgid(&init_user_ns, req->r_gid));
> > > > -       head->ino = 0;
> > > > +       head->ino = cpu_to_le64(req->r_deleg_ino);
> > > >         head->args = req->r_args;
> > > > 
> > > >         ceph_encode_filepath(&p, end, ino1, path1);
> > > > @@ -2627,7 +2627,6 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
> > > >         rhead->flags = cpu_to_le32(flags);
> > > >         rhead->num_fwd = req->r_num_fwd;
> > > >         rhead->num_retry = req->r_attempts - 1;
> > > > -       rhead->ino = 0;
> > > > 
> > > >         dout(" r_parent = %p\n", req->r_parent);
> > > >         return 0;
> > > > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> > > > index 2a32afa15eb6..0811543ffd79 100644
> > > > --- a/fs/ceph/mds_client.h
> > > > +++ b/fs/ceph/mds_client.h
> > > > @@ -308,6 +308,7 @@ struct ceph_mds_request {
> > > >         int               r_num_fwd;    /* number of forward attempts */
> > > >         int               r_resend_mds; /* mds to resend to next, if any*/
> > > >         u32               r_sent_on_mseq; /* cap mseq request was sent at*/
> > > > +       unsigned long     r_deleg_ino;
> > > 
> > > u64, as head->ino is __le64?
> > > 
> > 
> > Does that actually matter? It should get promoted to 64 bit when we do
> > the encoding since we're passing by value, and this will never be larger
> > than 32 bits on a 32 bit box.
> 
> It raises eyebrows -- one needs to remember that inode numbers
> fit into unsigned long when looking at the code.  We are using u64
> for inode numbers throughout ceph.ko: ceph_vino, ino parameters to
> various functions, etc -- not just in the wire format definitions.
> I think sticking to u64 is more clear and consistent.
> 

Yeah, now that I think about it, you're right. This is going to be a
inode presented to the MDS, so it will always need to be 64-bit. Fixed
in my tree.

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 10/10] ceph: attempt to do async create when possible
  2020-01-17 17:40     ` Jeff Layton
@ 2020-01-18  2:42       ` Yan, Zheng
  0 siblings, 0 replies; 33+ messages in thread
From: Yan, Zheng @ 2020-01-18  2:42 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell, xiubli

On 1/18/20 1:40 AM, Jeff Layton wrote:
> On Fri, 2020-01-17 at 21:28 +0800, Yan, Zheng wrote:
>> On 1/16/20 4:59 AM, Jeff Layton wrote:
>>> With the Octopus release, the MDS will hand out directory create caps.
>>>
>>> If we have Fxc caps on the directory, and complete directory information
>>> or a known negative dentry, then we can return without waiting on the
>>> reply, allowing the open() call to return very quickly to userland.
>>>
>>> We use the normal ceph_fill_inode() routine to fill in the inode, so we
>>> have to gin up some reply inode information with what we'd expect the
>>> newly-created inode to have. The client assumes that it has a full set
>>> of caps on the new inode, and that the MDS will revoke them when there
>>> is conflicting access.
>>>
>>> This functionality is gated on the enable_async_dirops module option,
>>> along with async unlinks, and on the server supporting the necessary
>>> CephFS feature bit.
>>>
>>> Signed-off-by: Jeff Layton <jlayton@kernel.org>
>>> ---
>>>    fs/ceph/file.c               | 196 +++++++++++++++++++++++++++++++++--
>>>    include/linux/ceph/ceph_fs.h |   3 +
>>>    2 files changed, 190 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index b44ccbc85fe4..2742417fa5ec 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -448,6 +448,169 @@ cache_file_layout(struct inode *dst, struct inode *src)
>>>    	spin_unlock(&cdst->i_ceph_lock);
>>>    }
>>>    
>>> +/*
>>> + * Try to set up an async create. We need caps, a file layout, and inode number,
>>> + * and either a lease on the dentry or complete dir info. If any of those
>>> + * criteria are not satisfied, then return false and the caller can go
>>> + * synchronous.
>>> + */
>>> +static bool try_prep_async_create(struct inode *dir, struct dentry *dentry,
>>> +				  struct ceph_file_layout *lo,
>>> +				  unsigned long *pino)
>>> +{
>>> +	struct ceph_inode_info *ci = ceph_inode(dir);
>>> +	bool ret = false;
>>> +	unsigned long ino;
>>> +
>>> +	spin_lock(&ci->i_ceph_lock);
>>> +	/* No auth cap means no chance for Dc caps */
>>> +	if (!ci->i_auth_cap)
>>> +		goto no_async;
>>> +
>>> +	/* Any delegated inos? */
>>> +	if (xa_empty(&ci->i_auth_cap->session->s_delegated_inos))
>>> +		goto no_async;
>>> +
>>> +	if (!ceph_file_layout_is_valid(&ci->i_cached_layout))
>>> +		goto no_async;
>>> +
>>> +	/* Use LOOKUP_RCU since we're under i_ceph_lock */
>>> +	if (!__ceph_dir_is_complete(ci) &&
>>> +	    !dentry_lease_is_valid(dentry, LOOKUP_RCU))
>>> +		goto no_async;
>>
>> dentry_lease_is_valid() checks dentry lease. When directory inode has
>> Fsx caps, mds does not issue lease for individual dentry. Check here
>> should be something like dir_lease_is_valid()
> 
> Ok, I think I get it. The catch here is that we're calling this from
> atomic_open, so we may be dealing with a dentry that is brand new and
> has never had a lookup. I think we have to handle those two cases
> differently.
> 
> This is what I'm thinking:
> 
> ---
>   fs/ceph/file.c | 14 +++++++++-----
>   1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 7b14dba92266..a3eb38fac68a 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -459,6 +459,7 @@ static bool try_prep_async_create(struct inode *dir,
> struct dentry *dentry,
>   				  unsigned long *pino)
>   {
>   	struct ceph_inode_info *ci = ceph_inode(dir);
> +	struct ceph_dentry_info *di = ceph_dentry(dentry);
>   	bool ret = false;
>   	unsigned long ino;
>   
> @@ -474,16 +475,19 @@ static bool try_prep_async_create(struct inode
> *dir, struct dentry *dentry,
>   	if (!ceph_file_layout_is_valid(&ci->i_cached_layout))
>   		goto no_async;
>   
> -	/* Use LOOKUP_RCU since we're under i_ceph_lock */
> -	if (!__ceph_dir_is_complete(ci) &&
> -	    !dentry_lease_is_valid(dentry, LOOKUP_RCU))
> -		goto no_async;
> -
>   	if ((__ceph_caps_issued(ci, NULL) &
>   	     (CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE)) !=
>   	    (CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE))
>   		goto no_async;
>   
> +	if (d_in_lookup(dentry)) {
> +		if (!__ceph_dir_is_complete(ci))
> +			goto no_async;
> +	} else if (atomic_read(&ci->i_shared_gen) !=
> +		   READ_ONCE(di->lease_shared_gen)) {
> +		goto no_async;
> 

Make sense


>   	ino = ceph_get_deleg_ino(ci->i_auth_cap->session);
>   	if (!ino)
>   		goto no_async;
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 08/10] ceph: add new MDS req field to hold delegated inode number
  2020-01-17 18:31         ` Jeff Layton
@ 2020-01-20  9:41           ` Ilya Dryomov
  0 siblings, 0 replies; 33+ messages in thread
From: Ilya Dryomov @ 2020-01-20  9:41 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Ceph Development, Yan, Zheng, Sage Weil, Patrick Donnelly, Xiubo Li

On Fri, Jan 17, 2020 at 7:31 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Fri, 2020-01-17 at 18:42 +0100, Ilya Dryomov wrote:
> > On Fri, Jan 17, 2020 at 5:53 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > On Fri, 2020-01-17 at 15:47 +0100, Ilya Dryomov wrote:
> > > > On Wed, Jan 15, 2020 at 9:59 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > > Add new request field to hold the delegated inode number. Encode that
> > > > > into the message when it's set.
> > > > >
> > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > ---
> > > > >  fs/ceph/mds_client.c | 3 +--
> > > > >  fs/ceph/mds_client.h | 1 +
> > > > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > > > > index e49ca0533df1..b8070e8c4686 100644
> > > > > --- a/fs/ceph/mds_client.c
> > > > > +++ b/fs/ceph/mds_client.c
> > > > > @@ -2466,7 +2466,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_client *mdsc,
> > > > >         head->op = cpu_to_le32(req->r_op);
> > > > >         head->caller_uid = cpu_to_le32(from_kuid(&init_user_ns, req->r_uid));
> > > > >         head->caller_gid = cpu_to_le32(from_kgid(&init_user_ns, req->r_gid));
> > > > > -       head->ino = 0;
> > > > > +       head->ino = cpu_to_le64(req->r_deleg_ino);
> > > > >         head->args = req->r_args;
> > > > >
> > > > >         ceph_encode_filepath(&p, end, ino1, path1);
> > > > > @@ -2627,7 +2627,6 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
> > > > >         rhead->flags = cpu_to_le32(flags);
> > > > >         rhead->num_fwd = req->r_num_fwd;
> > > > >         rhead->num_retry = req->r_attempts - 1;
> > > > > -       rhead->ino = 0;
> > > > >
> > > > >         dout(" r_parent = %p\n", req->r_parent);
> > > > >         return 0;
> > > > > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> > > > > index 2a32afa15eb6..0811543ffd79 100644
> > > > > --- a/fs/ceph/mds_client.h
> > > > > +++ b/fs/ceph/mds_client.h
> > > > > @@ -308,6 +308,7 @@ struct ceph_mds_request {
> > > > >         int               r_num_fwd;    /* number of forward attempts */
> > > > >         int               r_resend_mds; /* mds to resend to next, if any*/
> > > > >         u32               r_sent_on_mseq; /* cap mseq request was sent at*/
> > > > > +       unsigned long     r_deleg_ino;
> > > >
> > > > u64, as head->ino is __le64?
> > > >
> > >
> > > Does that actually matter? It should get promoted to 64 bit when we do
> > > the encoding since we're passing by value, and this will never be larger
> > > than 32 bits on a 32 bit box.
> >
> > It raises eyebrows -- one needs to remember that inode numbers
> > fit into unsigned long when looking at the code.  We are using u64
> > for inode numbers throughout ceph.ko: ceph_vino, ino parameters to
> > various functions, etc -- not just in the wire format definitions.
> > I think sticking to u64 is more clear and consistent.
> >
>
> Yeah, now that I think about it, you're right. This is going to be a
> inode presented to the MDS, so it will always need to be 64-bit. Fixed
> in my tree.

Other patches in this set introduce function parameters and local
variables that are unsigned long and used for inode numbers.  I think
they need to be fixed up as well.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 00/10] ceph: asynchronous file create support
  2020-01-15 20:59 [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (12 preceding siblings ...)
  2020-01-16 13:10 ` Yan, Zheng
@ 2020-01-20 13:20 ` Yan, Zheng
  2020-01-21 10:56   ` Jeff Layton
  13 siblings, 1 reply; 33+ messages in thread
From: Yan, Zheng @ 2020-01-20 13:20 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell, xiubli

On 1/16/20 4:59 AM, Jeff Layton wrote:
> v2:
> - move cached layout to dedicated field in inode
> - protect cached layout with i_ceph_lock
> - wipe cached layout in __check_cap_issue
> - set max_size of file to layout.stripe_unit
> - set truncate_size to (u64)-1
> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> - set cap_id to 1 in async created inode
> - allocate inode number before submitting request
> - rework the prep for an async create to be more efficient
> - don't allow MDS or cap messages involving an inode until we get async
>    create reply
> 
> A lot of changes in this set, mostly based on Zheng and Xiubo's
> comments. Performance is pretty similar to the previous set:
> 
> Untarring a kernel tarball into a cephfs takes about 98s with
> async dirops disabled. With them enabled, it takes around 78s,
> which is about a 25% improvement.
> 
> This is not quite ready for merge. Error handling could still be
> improved. With xfstest generic/531, I see some messages like this pop
> up in the ring buffer:
> 
>      [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9
> 
> Basically, we went to do an async create and got a different inode
> number back than expected. That still needs investigation, but I
> didn't see any test failures due to it.
> 
> Jeff Layton (10):
>    libceph: export ceph_file_layout_is_valid
>    ceph: make ceph_fill_inode non-static
>    ceph: make dentry_lease_is_valid non-static
>    ceph: make __take_cap_refs a public function
>    ceph: decode interval_sets for delegated inos
>    ceph: add flag to designate that a request is asynchronous
>    ceph: add infrastructure for waiting for async create to complete
>    ceph: add new MDS req field to hold delegated inode number
>    ceph: cache layout in parent dir on first sync create
>    ceph: attempt to do async create when possible
> 
>   fs/ceph/caps.c               |  34 ++++--
>   fs/ceph/dir.c                |  13 ++-
>   fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
>   fs/ceph/inode.c              |  50 ++++----
>   fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
>   fs/ceph/mds_client.h         |   9 +-
>   fs/ceph/super.h              |  16 ++-
>   include/linux/ceph/ceph_fs.h |   8 +-
>   net/ceph/ceph_fs.c           |   1 +
>   9 files changed, 410 insertions(+), 65 deletions(-)
> 

An issue of kernel async unlink/create implementation is 
get_caps_for_async_unlink/try_prep_async_create are called before
calling ceph_mdsc_submit_request. There is a small windows that 
session's state may change. If session is in wrong state, 
ceph_mdsc_submit_request() may wait and not send request immediately.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 00/10] ceph: asynchronous file create support
  2020-01-20 13:20 ` Yan, Zheng
@ 2020-01-21 10:56   ` Jeff Layton
  2020-01-21 13:20     ` Yan, Zheng
  0 siblings, 1 reply; 33+ messages in thread
From: Jeff Layton @ 2020-01-21 10:56 UTC (permalink / raw)
  To: Yan, Zheng, ceph-devel; +Cc: sage, idryomov, pdonnell, xiubli

On Mon, 2020-01-20 at 21:20 +0800, Yan, Zheng wrote:
> On 1/16/20 4:59 AM, Jeff Layton wrote:
> > v2:
> > - move cached layout to dedicated field in inode
> > - protect cached layout with i_ceph_lock
> > - wipe cached layout in __check_cap_issue
> > - set max_size of file to layout.stripe_unit
> > - set truncate_size to (u64)-1
> > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> > - set cap_id to 1 in async created inode
> > - allocate inode number before submitting request
> > - rework the prep for an async create to be more efficient
> > - don't allow MDS or cap messages involving an inode until we get async
> >    create reply
> > 
> > A lot of changes in this set, mostly based on Zheng and Xiubo's
> > comments. Performance is pretty similar to the previous set:
> > 
> > Untarring a kernel tarball into a cephfs takes about 98s with
> > async dirops disabled. With them enabled, it takes around 78s,
> > which is about a 25% improvement.
> > 
> > This is not quite ready for merge. Error handling could still be
> > improved. With xfstest generic/531, I see some messages like this pop
> > up in the ring buffer:
> > 
> >      [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9
> > 
> > Basically, we went to do an async create and got a different inode
> > number back than expected. That still needs investigation, but I
> > didn't see any test failures due to it.
> > 
> > Jeff Layton (10):
> >    libceph: export ceph_file_layout_is_valid
> >    ceph: make ceph_fill_inode non-static
> >    ceph: make dentry_lease_is_valid non-static
> >    ceph: make __take_cap_refs a public function
> >    ceph: decode interval_sets for delegated inos
> >    ceph: add flag to designate that a request is asynchronous
> >    ceph: add infrastructure for waiting for async create to complete
> >    ceph: add new MDS req field to hold delegated inode number
> >    ceph: cache layout in parent dir on first sync create
> >    ceph: attempt to do async create when possible
> > 
> >   fs/ceph/caps.c               |  34 ++++--
> >   fs/ceph/dir.c                |  13 ++-
> >   fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
> >   fs/ceph/inode.c              |  50 ++++----
> >   fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
> >   fs/ceph/mds_client.h         |   9 +-
> >   fs/ceph/super.h              |  16 ++-
> >   include/linux/ceph/ceph_fs.h |   8 +-
> >   net/ceph/ceph_fs.c           |   1 +
> >   9 files changed, 410 insertions(+), 65 deletions(-)
> > 
> 
> An issue of kernel async unlink/create implementation is 
> get_caps_for_async_unlink/try_prep_async_create are called before
> calling ceph_mdsc_submit_request. There is a small windows that 
> session's state may change. If session is in wrong state, 
> ceph_mdsc_submit_request() may wait and not send request immediately.
> 

Is that a real issue (other than performance)?

We hold cap references, so assuming that the session can be reconnected
and that we keep the caps, everything should still work correctly, no?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 00/10] ceph: asynchronous file create support
  2020-01-21 10:56   ` Jeff Layton
@ 2020-01-21 13:20     ` Yan, Zheng
  2020-01-21 14:37       ` Jeff Layton
  0 siblings, 1 reply; 33+ messages in thread
From: Yan, Zheng @ 2020-01-21 13:20 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell, xiubli

On 1/21/20 6:56 PM, Jeff Layton wrote:
> On Mon, 2020-01-20 at 21:20 +0800, Yan, Zheng wrote:
>> On 1/16/20 4:59 AM, Jeff Layton wrote:
>>> v2:
>>> - move cached layout to dedicated field in inode
>>> - protect cached layout with i_ceph_lock
>>> - wipe cached layout in __check_cap_issue
>>> - set max_size of file to layout.stripe_unit
>>> - set truncate_size to (u64)-1
>>> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
>>> - set cap_id to 1 in async created inode
>>> - allocate inode number before submitting request
>>> - rework the prep for an async create to be more efficient
>>> - don't allow MDS or cap messages involving an inode until we get async
>>>     create reply
>>>
>>> A lot of changes in this set, mostly based on Zheng and Xiubo's
>>> comments. Performance is pretty similar to the previous set:
>>>
>>> Untarring a kernel tarball into a cephfs takes about 98s with
>>> async dirops disabled. With them enabled, it takes around 78s,
>>> which is about a 25% improvement.
>>>
>>> This is not quite ready for merge. Error handling could still be
>>> improved. With xfstest generic/531, I see some messages like this pop
>>> up in the ring buffer:
>>>
>>>       [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9
>>>
>>> Basically, we went to do an async create and got a different inode
>>> number back than expected. That still needs investigation, but I
>>> didn't see any test failures due to it.
>>>
>>> Jeff Layton (10):
>>>     libceph: export ceph_file_layout_is_valid
>>>     ceph: make ceph_fill_inode non-static
>>>     ceph: make dentry_lease_is_valid non-static
>>>     ceph: make __take_cap_refs a public function
>>>     ceph: decode interval_sets for delegated inos
>>>     ceph: add flag to designate that a request is asynchronous
>>>     ceph: add infrastructure for waiting for async create to complete
>>>     ceph: add new MDS req field to hold delegated inode number
>>>     ceph: cache layout in parent dir on first sync create
>>>     ceph: attempt to do async create when possible
>>>
>>>    fs/ceph/caps.c               |  34 ++++--
>>>    fs/ceph/dir.c                |  13 ++-
>>>    fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
>>>    fs/ceph/inode.c              |  50 ++++----
>>>    fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
>>>    fs/ceph/mds_client.h         |   9 +-
>>>    fs/ceph/super.h              |  16 ++-
>>>    include/linux/ceph/ceph_fs.h |   8 +-
>>>    net/ceph/ceph_fs.c           |   1 +
>>>    9 files changed, 410 insertions(+), 65 deletions(-)
>>>
>>
>> An issue of kernel async unlink/create implementation is
>> get_caps_for_async_unlink/try_prep_async_create are called before
>> calling ceph_mdsc_submit_request. There is a small windows that
>> session's state may change. If session is in wrong state,
>> ceph_mdsc_submit_request() may wait and not send request immediately.
>>
> 
> Is that a real issue (other than performance)?
> 
> We hold cap references, so assuming that the session can be reconnected
> and that we keep the caps, everything should still work correctly, no?
> 

I think it may cause more troubles for error handling (For the case that 
client failed to reconnect session)

Regards
Yan, Zheng

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 00/10] ceph: asynchronous file create support
  2020-01-21 13:20     ` Yan, Zheng
@ 2020-01-21 14:37       ` Jeff Layton
  0 siblings, 0 replies; 33+ messages in thread
From: Jeff Layton @ 2020-01-21 14:37 UTC (permalink / raw)
  To: Yan, Zheng, ceph-devel; +Cc: sage, idryomov, pdonnell, xiubli

On Tue, 2020-01-21 at 21:20 +0800, Yan, Zheng wrote:
> On 1/21/20 6:56 PM, Jeff Layton wrote:
> > On Mon, 2020-01-20 at 21:20 +0800, Yan, Zheng wrote:
> > > On 1/16/20 4:59 AM, Jeff Layton wrote:
> > > > v2:
> > > > - move cached layout to dedicated field in inode
> > > > - protect cached layout with i_ceph_lock
> > > > - wipe cached layout in __check_cap_issue
> > > > - set max_size of file to layout.stripe_unit
> > > > - set truncate_size to (u64)-1
> > > > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> > > > - set cap_id to 1 in async created inode
> > > > - allocate inode number before submitting request
> > > > - rework the prep for an async create to be more efficient
> > > > - don't allow MDS or cap messages involving an inode until we get async
> > > >     create reply
> > > > 
> > > > A lot of changes in this set, mostly based on Zheng and Xiubo's
> > > > comments. Performance is pretty similar to the previous set:
> > > > 
> > > > Untarring a kernel tarball into a cephfs takes about 98s with
> > > > async dirops disabled. With them enabled, it takes around 78s,
> > > > which is about a 25% improvement.
> > > > 
> > > > This is not quite ready for merge. Error handling could still be
> > > > improved. With xfstest generic/531, I see some messages like this pop
> > > > up in the ring buffer:
> > > > 
> > > >       [ 7331.393110] ceph: ceph_async_create_cb: inode number mismatch! err=0 deleg_ino=0x100001232d9 target=0x100001232b9
> > > > 
> > > > Basically, we went to do an async create and got a different inode
> > > > number back than expected. That still needs investigation, but I
> > > > didn't see any test failures due to it.
> > > > 
> > > > Jeff Layton (10):
> > > >     libceph: export ceph_file_layout_is_valid
> > > >     ceph: make ceph_fill_inode non-static
> > > >     ceph: make dentry_lease_is_valid non-static
> > > >     ceph: make __take_cap_refs a public function
> > > >     ceph: decode interval_sets for delegated inos
> > > >     ceph: add flag to designate that a request is asynchronous
> > > >     ceph: add infrastructure for waiting for async create to complete
> > > >     ceph: add new MDS req field to hold delegated inode number
> > > >     ceph: cache layout in parent dir on first sync create
> > > >     ceph: attempt to do async create when possible
> > > > 
> > > >    fs/ceph/caps.c               |  34 ++++--
> > > >    fs/ceph/dir.c                |  13 ++-
> > > >    fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
> > > >    fs/ceph/inode.c              |  50 ++++----
> > > >    fs/ceph/mds_client.c         | 126 ++++++++++++++++++--
> > > >    fs/ceph/mds_client.h         |   9 +-
> > > >    fs/ceph/super.h              |  16 ++-
> > > >    include/linux/ceph/ceph_fs.h |   8 +-
> > > >    net/ceph/ceph_fs.c           |   1 +
> > > >    9 files changed, 410 insertions(+), 65 deletions(-)
> > > > 
> > > 
> > > An issue of kernel async unlink/create implementation is
> > > get_caps_for_async_unlink/try_prep_async_create are called before
> > > calling ceph_mdsc_submit_request. There is a small windows that
> > > session's state may change. If session is in wrong state,
> > > ceph_mdsc_submit_request() may wait and not send request immediately.
> > > 
> > 
> > Is that a real issue (other than performance)?
> > 
> > We hold cap references, so assuming that the session can be reconnected
> > and that we keep the caps, everything should still work correctly, no?
> > 
> 
> I think it may cause more troubles for error handling (For the case that 
> client failed to reconnect session)
> 

Yeah, error handling is a real Achilles heel in this whole thing. I'm
still considering how best to deal with errors when this occurs. There
may not be any good answers for that w/o adding new interfaces.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2020-01-21 14:37 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-15 20:59 [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
2020-01-15 20:59 ` [RFC PATCH v2 01/10] libceph: export ceph_file_layout_is_valid Jeff Layton
2020-01-15 20:59 ` [RFC PATCH v2 02/10] ceph: make ceph_fill_inode non-static Jeff Layton
2020-01-15 20:59 ` [RFC PATCH v2 03/10] ceph: make dentry_lease_is_valid non-static Jeff Layton
2020-01-15 20:59 ` [RFC PATCH v2 04/10] ceph: make __take_cap_refs a public function Jeff Layton
2020-01-15 20:59 ` [RFC PATCH v2 05/10] ceph: decode interval_sets for delegated inos Jeff Layton
2020-01-16 14:32   ` Yan, Zheng
2020-01-16 15:37     ` Jeff Layton
2020-01-15 20:59 ` [RFC PATCH v2 06/10] ceph: add flag to designate that a request is asynchronous Jeff Layton
2020-01-15 20:59 ` [RFC PATCH v2 07/10] ceph: add infrastructure for waiting for async create to complete Jeff Layton
2020-01-17 15:00   ` Ilya Dryomov
2020-01-17 16:49     ` Jeff Layton
2020-01-15 20:59 ` [RFC PATCH v2 08/10] ceph: add new MDS req field to hold delegated inode number Jeff Layton
2020-01-17 14:47   ` Ilya Dryomov
2020-01-17 16:53     ` Jeff Layton
2020-01-17 17:42       ` Ilya Dryomov
2020-01-17 18:31         ` Jeff Layton
2020-01-20  9:41           ` Ilya Dryomov
2020-01-15 20:59 ` [RFC PATCH v2 09/10] ceph: cache layout in parent dir on first sync create Jeff Layton
2020-01-15 20:59 ` [RFC PATCH v2 10/10] ceph: attempt to do async create when possible Jeff Layton
2020-01-16 15:09   ` Yan, Zheng
2020-01-16 16:21     ` Jeff Layton
2020-01-17 13:28   ` Yan, Zheng
2020-01-17 17:40     ` Jeff Layton
2020-01-18  2:42       ` Yan, Zheng
2020-01-15 21:30 ` [RFC PATCH v2 00/10] ceph: asynchronous file create support Jeff Layton
2020-01-16  8:57 ` Xiubo Li
2020-01-16 13:10 ` Yan, Zheng
2020-01-16 14:15   ` Jeff Layton
2020-01-20 13:20 ` Yan, Zheng
2020-01-21 10:56   ` Jeff Layton
2020-01-21 13:20     ` Yan, Zheng
2020-01-21 14:37       ` Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.