All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/9] ceph: add asynchronous create functionality
@ 2020-01-10 20:56 Jeff Layton
  2020-01-10 20:56 ` [RFC PATCH 1/9] ceph: ensure we have a new cap before continuing in fill_inode Jeff Layton
                   ` (9 more replies)
  0 siblings, 10 replies; 28+ messages in thread
From: Jeff Layton @ 2020-01-10 20:56 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell

I recently sent a patchset that allows the client to do an asynchronous
UNLINK call to the MDS when it has the appropriate caps and dentry info.
This set adds the corresponding functionality for creates.

When the client has the appropriate caps on the parent directory and
dentry information, and a delegated inode number, it can satisfy a
request locally without contacting the server. This allows the kernel
client to return very quickly from an O_CREAT open, so it can get on
with doing other things.

These numbers are based on my personal test rig, which is a KVM client
vs a vstart cluster running on my workstation (nothing scientific here).

A simple benchmark (with the cephfs mounted at /mnt/cephfs):
-------------------8<-------------------
#!/bin/sh

TESTDIR=/mnt/cephfs/test-dirops.$$

mkdir $TESTDIR
stat $TESTDIR
echo "Creating files in $TESTDIR"
time for i in `seq 1 10000`; do
    echo "foobarbaz" > $TESTDIR/$i
done
-------------------8<-------------------

With async dirops disabled:

real	0m9.865s
user	0m0.353s
sys	0m0.888s

With async dirops enabled:

real	0m5.272s
user	0m0.104s
sys	0m0.454s

That workload is a bit synthetic though. One workload we're interested
in improving is untar. Untarring a deep directory tree (random kernel
tarball I had laying around):

Disabled:
$ time tar xf ~/linux-4.18.0-153.el8.jlayton.006.tar

real	1m35.774s
user	0m0.835s
sys	0m7.410s

Enabled:
$ time tar xf ~/linux-4.18.0-153.el8.jlayton.006.tar

real	1m32.182s
user	0m0.783s
sys	0m6.830s

Not a huge win there. I suspect at this point that synchronous mkdir
may be serializing behind the async creates.

It needs a lot more performance tuning and analysis, but it's now at the
point where it's basically usable. To enable it, turn on the
ceph.enable_async_dirops module option.

There are some places that need further work:

1) The MDS patchset to delegate inodes to the client is not yet merged:

    https://github.com/ceph/ceph/pull/31817

2) this is 64-bit arch only for the moment. I'm using an xarray to track
the delegated inode numbers, and those don't do 64-bit indexes on
32-bit machines. Is anyone using 32-bit ceph clients? We could probably
build an xarray of xarrays if needed.

3) The error handling is still pretty lame. If the create fails, it'll
set a writeback error on the parent dir and the inode itself, but the
client could end up writing a bunch before it notices, if it even
bothers to check. We probably need to do better here. I'm open to
suggestions on this bit especially.

Jeff Layton (9):
  ceph: ensure we have a new cap before continuing in fill_inode
  ceph: print name of xattr being set in set/getxattr dout message
  ceph: close some holes in struct ceph_mds_request
  ceph: make ceph_fill_inode non-static
  libceph: export ceph_file_layout_is_valid
  ceph: decode interval_sets for delegated inos
  ceph: add flag to delegate an inode number for async create
  ceph: copy layout, max_size and truncate_size on successful sync
    create
  ceph: attempt to do async create when possible

 fs/ceph/caps.c               |  31 +++++-
 fs/ceph/file.c               | 202 +++++++++++++++++++++++++++++++++--
 fs/ceph/inode.c              |  57 +++++-----
 fs/ceph/mds_client.c         | 130 ++++++++++++++++++++--
 fs/ceph/mds_client.h         |  12 ++-
 fs/ceph/super.h              |  10 ++
 fs/ceph/xattr.c              |   5 +-
 include/linux/ceph/ceph_fs.h |   8 +-
 net/ceph/ceph_fs.c           |   1 +
 9 files changed, 396 insertions(+), 60 deletions(-)

-- 
2.24.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH 1/9] ceph: ensure we have a new cap before continuing in fill_inode
  2020-01-10 20:56 [RFC PATCH 0/9] ceph: add asynchronous create functionality Jeff Layton
@ 2020-01-10 20:56 ` Jeff Layton
  2020-01-10 20:56 ` [RFC PATCH 2/9] ceph: print name of xattr being set in set/getxattr dout message Jeff Layton
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Jeff Layton @ 2020-01-10 20:56 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell

If the caller passes in a NULL cap_reservation, and we can't allocate
one then ensure that we fail gracefully.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/inode.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index ffef475af72b..aee7a24bf1bc 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -756,8 +756,11 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 	info_caps = le32_to_cpu(info->cap.caps);
 
 	/* prealloc new cap struct */
-	if (info_caps && ceph_snap(inode) == CEPH_NOSNAP)
+	if (info_caps && ceph_snap(inode) == CEPH_NOSNAP) {
 		new_cap = ceph_get_cap(mdsc, caps_reservation);
+		if (!new_cap)
+			return -ENOMEM;
+	}
 
 	/*
 	 * prealloc xattr data, if it looks like we'll need it.  only
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 2/9] ceph: print name of xattr being set in set/getxattr dout message
  2020-01-10 20:56 [RFC PATCH 0/9] ceph: add asynchronous create functionality Jeff Layton
  2020-01-10 20:56 ` [RFC PATCH 1/9] ceph: ensure we have a new cap before continuing in fill_inode Jeff Layton
@ 2020-01-10 20:56 ` Jeff Layton
  2020-01-10 20:56 ` [RFC PATCH 3/9] ceph: close some holes in struct ceph_mds_request Jeff Layton
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Jeff Layton @ 2020-01-10 20:56 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/xattr.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index 98a9a3101cda..d58fa14c1f01 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -851,7 +851,7 @@ ssize_t __ceph_getxattr(struct inode *inode, const char *name, void *value,
 	req_mask = __get_request_mask(inode);
 
 	spin_lock(&ci->i_ceph_lock);
-	dout("getxattr %p ver=%lld index_ver=%lld\n", inode,
+	dout("getxattr %p name=%s ver=%lld index_ver=%lld\n", inode, name,
 	     ci->i_xattrs.version, ci->i_xattrs.index_version);
 
 	if (ci->i_xattrs.version == 0 ||
@@ -1078,7 +1078,8 @@ int __ceph_setxattr(struct inode *inode, const char *name,
 		}
 	}
 
-	dout("setxattr %p issued %s\n", inode, ceph_cap_string(issued));
+	dout("setxattr %p name %s issued %s\n", inode, name,
+			ceph_cap_string(issued));
 	__build_xattrs(inode);
 
 	required_blob_size = __get_required_blob_size(ci, name_len, val_len);
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 3/9] ceph: close some holes in struct ceph_mds_request
  2020-01-10 20:56 [RFC PATCH 0/9] ceph: add asynchronous create functionality Jeff Layton
  2020-01-10 20:56 ` [RFC PATCH 1/9] ceph: ensure we have a new cap before continuing in fill_inode Jeff Layton
  2020-01-10 20:56 ` [RFC PATCH 2/9] ceph: print name of xattr being set in set/getxattr dout message Jeff Layton
@ 2020-01-10 20:56 ` Jeff Layton
  2020-01-10 20:56 ` [RFC PATCH 4/9] ceph: make ceph_fill_inode non-static Jeff Layton
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Jeff Layton @ 2020-01-10 20:56 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/mds_client.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index df18a29f9587..27a7446e10d3 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -234,6 +234,7 @@ struct ceph_mds_request {
 	struct rb_node r_node;
 	struct ceph_mds_client *r_mdsc;
 
+	struct kref       r_kref;
 	int r_op;                    /* mds op code */
 
 	/* operation on what? */
@@ -304,7 +305,6 @@ struct ceph_mds_request {
 	int               r_resend_mds; /* mds to resend to next, if any*/
 	u32               r_sent_on_mseq; /* cap mseq request was sent at*/
 
-	struct kref       r_kref;
 	struct list_head  r_wait;
 	struct completion r_completion;
 	struct completion r_safe_completion;
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 4/9] ceph: make ceph_fill_inode non-static
  2020-01-10 20:56 [RFC PATCH 0/9] ceph: add asynchronous create functionality Jeff Layton
                   ` (2 preceding siblings ...)
  2020-01-10 20:56 ` [RFC PATCH 3/9] ceph: close some holes in struct ceph_mds_request Jeff Layton
@ 2020-01-10 20:56 ` Jeff Layton
  2020-01-10 20:56 ` [RFC PATCH 5/9] libceph: export ceph_file_layout_is_valid Jeff Layton
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Jeff Layton @ 2020-01-10 20:56 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/inode.c | 47 ++++++++++++++++++++++++-----------------------
 fs/ceph/super.h |  8 ++++++++
 2 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index aee7a24bf1bc..79bb1e6af090 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -728,11 +728,11 @@ void ceph_fill_file_time(struct inode *inode, int issued,
  * Populate an inode based on info from mds.  May be called on new or
  * existing inodes.
  */
-static int fill_inode(struct inode *inode, struct page *locked_page,
-		      struct ceph_mds_reply_info_in *iinfo,
-		      struct ceph_mds_reply_dirfrag *dirinfo,
-		      struct ceph_mds_session *session, int cap_fmode,
-		      struct ceph_cap_reservation *caps_reservation)
+int ceph_fill_inode(struct inode *inode, struct page *locked_page,
+		    struct ceph_mds_reply_info_in *iinfo,
+		    struct ceph_mds_reply_dirfrag *dirinfo,
+		    struct ceph_mds_session *session, int cap_fmode,
+		    struct ceph_cap_reservation *caps_reservation)
 {
 	struct ceph_mds_client *mdsc = ceph_inode_to_client(inode)->mdsc;
 	struct ceph_mds_reply_inode *info = iinfo->in;
@@ -749,7 +749,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 	bool new_version = false;
 	bool fill_inline = false;
 
-	dout("fill_inode %p ino %llx.%llx v %llu had %llu\n",
+	dout("%s %p ino %llx.%llx v %llu had %llu\n", __func__,
 	     inode, ceph_vinop(inode), le64_to_cpu(info->version),
 	     ci->i_version);
 
@@ -770,7 +770,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 	if (iinfo->xattr_len > 4) {
 		xattr_blob = ceph_buffer_new(iinfo->xattr_len, GFP_NOFS);
 		if (!xattr_blob)
-			pr_err("fill_inode ENOMEM xattr blob %d bytes\n",
+			pr_err("%s ENOMEM xattr blob %d bytes\n", __func__,
 			       iinfo->xattr_len);
 	}
 
@@ -933,8 +933,9 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 			spin_unlock(&ci->i_ceph_lock);
 
 			if (symlen != i_size_read(inode)) {
-				pr_err("fill_inode %llx.%llx BAD symlink "
-					"size %lld\n", ceph_vinop(inode),
+				pr_err("%s %llx.%llx BAD symlink "
+					"size %lld\n", __func__,
+					ceph_vinop(inode),
 					i_size_read(inode));
 				i_size_write(inode, symlen);
 				inode->i_blocks = calc_inode_blocks(symlen);
@@ -958,7 +959,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 		inode->i_fop = &ceph_dir_fops;
 		break;
 	default:
-		pr_err("fill_inode %llx.%llx BAD mode 0%o\n",
+		pr_err("%s %llx.%llx BAD mode 0%o\n", __func__,
 		       ceph_vinop(inode), inode->i_mode);
 	}
 
@@ -1246,10 +1247,9 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 		struct inode *dir = req->r_parent;
 
 		if (dir) {
-			err = fill_inode(dir, NULL,
-					 &rinfo->diri, rinfo->dirfrag,
-					 session, -1,
-					 &req->r_caps_reservation);
+			err = ceph_fill_inode(dir, NULL, &rinfo->diri,
+					      rinfo->dirfrag, session, -1,
+					      &req->r_caps_reservation);
 			if (err < 0)
 				goto done;
 		} else {
@@ -1314,13 +1314,13 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 			goto done;
 		}
 
-		err = fill_inode(in, req->r_locked_page, &rinfo->targeti, NULL,
-				session,
+		err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
+				NULL, session,
 				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
 				 rinfo->head->result == 0) ?  req->r_fmode : -1,
 				&req->r_caps_reservation);
 		if (err < 0) {
-			pr_err("fill_inode badness %p %llx.%llx\n",
+			pr_err("ceph_fill_inode badness %p %llx.%llx\n",
 				in, ceph_vinop(in));
 			if (in->i_state & I_NEW)
 				discard_new_inode(in);
@@ -1507,10 +1507,11 @@ static int readdir_prepopulate_inodes_only(struct ceph_mds_request *req,
 			dout("new_inode badness got %d\n", err);
 			continue;
 		}
-		rc = fill_inode(in, NULL, &rde->inode, NULL, session,
-				-1, &req->r_caps_reservation);
+		rc = ceph_fill_inode(in, NULL, &rde->inode, NULL, session,
+				     -1, &req->r_caps_reservation);
 		if (rc < 0) {
-			pr_err("fill_inode badness on %p got %d\n", in, rc);
+			pr_err("ceph_fill_inode badness on %p got %d\n",
+			       in, rc);
 			err = rc;
 			if (in->i_state & I_NEW) {
 				ihold(in);
@@ -1714,10 +1715,10 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req,
 			}
 		}
 
-		ret = fill_inode(in, NULL, &rde->inode, NULL, session,
-				 -1, &req->r_caps_reservation);
+		ret = ceph_fill_inode(in, NULL, &rde->inode, NULL, session,
+				      -1, &req->r_caps_reservation);
 		if (ret < 0) {
-			pr_err("fill_inode badness on %p\n", in);
+			pr_err("ceph_fill_inode badness on %p\n", in);
 			if (d_really_is_negative(dn)) {
 				/* avoid calling iput_final() in mds
 				 * dispatch threads */
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 3ef17dd6491e..ec4d66d7c261 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -893,6 +893,9 @@ static inline bool __ceph_have_pending_cap_snap(struct ceph_inode_info *ci)
 }
 
 /* inode.c */
+struct ceph_mds_reply_info_in;
+struct ceph_mds_reply_dirfrag;
+
 extern const struct inode_operations ceph_file_iops;
 
 extern struct inode *ceph_alloc_inode(struct super_block *sb);
@@ -908,6 +911,11 @@ extern void ceph_fill_file_time(struct inode *inode, int issued,
 				u64 time_warp_seq, struct timespec64 *ctime,
 				struct timespec64 *mtime,
 				struct timespec64 *atime);
+extern int ceph_fill_inode(struct inode *inode, struct page *locked_page,
+		    struct ceph_mds_reply_info_in *iinfo,
+		    struct ceph_mds_reply_dirfrag *dirinfo,
+		    struct ceph_mds_session *session, int cap_fmode,
+		    struct ceph_cap_reservation *caps_reservation);
 extern int ceph_fill_trace(struct super_block *sb,
 			   struct ceph_mds_request *req);
 extern int ceph_readdir_prepopulate(struct ceph_mds_request *req,
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 5/9] libceph: export ceph_file_layout_is_valid
  2020-01-10 20:56 [RFC PATCH 0/9] ceph: add asynchronous create functionality Jeff Layton
                   ` (3 preceding siblings ...)
  2020-01-10 20:56 ` [RFC PATCH 4/9] ceph: make ceph_fill_inode non-static Jeff Layton
@ 2020-01-10 20:56 ` Jeff Layton
  2020-01-10 20:56 ` [RFC PATCH 6/9] ceph: decode interval_sets for delegated inos Jeff Layton
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Jeff Layton @ 2020-01-10 20:56 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 net/ceph/ceph_fs.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ceph/ceph_fs.c b/net/ceph/ceph_fs.c
index 756a2dc10d27..11a2e3c61b04 100644
--- a/net/ceph/ceph_fs.c
+++ b/net/ceph/ceph_fs.c
@@ -27,6 +27,7 @@ int ceph_file_layout_is_valid(const struct ceph_file_layout *layout)
 		return 0;
 	return 1;
 }
+EXPORT_SYMBOL(ceph_file_layout_is_valid);
 
 void ceph_file_layout_from_legacy(struct ceph_file_layout *fl,
 				  struct ceph_file_layout_legacy *legacy)
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 6/9] ceph: decode interval_sets for delegated inos
  2020-01-10 20:56 [RFC PATCH 0/9] ceph: add asynchronous create functionality Jeff Layton
                   ` (4 preceding siblings ...)
  2020-01-10 20:56 ` [RFC PATCH 5/9] libceph: export ceph_file_layout_is_valid Jeff Layton
@ 2020-01-10 20:56 ` Jeff Layton
  2020-01-10 20:56 ` [RFC PATCH 7/9] ceph: add flag to delegate an inode number for async create Jeff Layton
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Jeff Layton @ 2020-01-10 20:56 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell

Starting in Octopus, the MDS will hand out caps that allow the client
to do asynchronous file creates under certain conditions. As part of
that, the MDS will delegate ranges of inode numbers to the client.

Add the infrastructure to decode these ranges, and stuff them into an
xarray for later consumption by the async creation code.

Because the xarray code currently only handles unsigned long indexes,
and those are 32-bits on 32-bit arches, we only enable the decoding when
running on a 64-bit arch.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/mds_client.c | 107 +++++++++++++++++++++++++++++++++++++++----
 fs/ceph/mds_client.h |   5 +-
 2 files changed, 102 insertions(+), 10 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 8263f75badfc..852c46550d96 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -415,21 +415,108 @@ static int parse_reply_info_filelock(void **p, void *end,
 	return -EIO;
 }
 
+
+#if BITS_PER_LONG == 64
+
+#define DELEGATED_INO_AVAILABLE		xa_mk_value(1)
+
+static int parse_delegated_inos(void **p, void *end, struct ceph_mds_session *s)
+{
+	u32 sets;
+
+	ceph_decode_32_safe(p, end, sets, bad);
+	dout("got %u sets of delegated inodes\n", sets);
+	while (sets--) {
+		u64 start, len, ino;
+
+		ceph_decode_64_safe(p, end, start, bad);
+		ceph_decode_64_safe(p, end, len, bad);
+		while (len--) {
+			int err = xa_insert(&s->s_delegated_inos, ino = start++,
+					    DELEGATED_INO_AVAILABLE,
+					    GFP_KERNEL);
+			if (!err) {
+				dout("added delegated inode 0x%llx\n",
+				     start - 1);
+			} else if (err == -EBUSY) {
+				pr_warn("ceph: MDS delegated inode 0x%llx more than once.\n",
+					start - 1);
+			} else {
+				return err;
+			}
+		}
+	}
+	return 0;
+bad:
+	return -EIO;
+}
+
+static unsigned long get_delegated_ino(struct ceph_mds_session *s)
+{
+	unsigned long ino;
+	void *val;
+
+	xa_for_each(&s->s_delegated_inos, ino, val) {
+		val = xa_erase(&s->s_delegated_inos, ino);
+		if (val == DELEGATED_INO_AVAILABLE)
+			return ino;
+	}
+	return 0;
+}
+#else /* BITS_PER_LONG == 64 */
+/*
+ * FIXME: xarrays can't handle 64-bit indexes on a 32-bit arch. For now, just
+ * ignore delegated_inos on 32 bit arch. Maybe eventually add xarrays for top
+ * and bottom words?
+ */
+static int parse_delegated_inos(void **p, void *end, struct ceph_mds_session *s)
+{
+	u32 sets;
+
+	ceph_decode_32_safe(p, end, sets, bad);
+	if (sets)
+		ceph_decode_skip_n(p, end, sets * 2 * sizeof(__le64), bad);
+	return 0;
+bad:
+	return -EIO;
+}
+
+static inline unsigned long get_delegated_ino(struct ceph_mds_session *s)
+{
+	return 0;
+}
+#endif /* BITS_PER_LONG == 64 */
+
 /*
  * parse create results
  */
 static int parse_reply_info_create(void **p, void *end,
 				  struct ceph_mds_reply_info_parsed *info,
-				  u64 features)
+				  u64 features, struct ceph_mds_session *s)
 {
+	int ret;
+
 	if (features == (u64)-1 ||
 	    (features & CEPH_FEATURE_REPLY_CREATE_INODE)) {
-		/* Malformed reply? */
 		if (*p == end) {
+			/* Malformed reply? */
 			info->has_create_ino = false;
-		} else {
+		} else if (test_bit(CEPHFS_FEATURE_OCTOPUS, &s->s_features)) {
+			u8 struct_v, struct_compat;
+			u32 len;
+
 			info->has_create_ino = true;
+			ceph_decode_8_safe(p, end, struct_v, bad);
+			ceph_decode_8_safe(p, end, struct_compat, bad);
+			ceph_decode_32_safe(p, end, len, bad);
+			ceph_decode_64_safe(p, end, info->ino, bad);
+			ret = parse_delegated_inos(p, end, s);
+			if (ret)
+				return ret;
+		} else {
+			/* legacy */
 			ceph_decode_64_safe(p, end, info->ino, bad);
+			info->has_create_ino = true;
 		}
 	} else {
 		if (*p != end)
@@ -448,7 +535,7 @@ static int parse_reply_info_create(void **p, void *end,
  */
 static int parse_reply_info_extra(void **p, void *end,
 				  struct ceph_mds_reply_info_parsed *info,
-				  u64 features)
+				  u64 features, struct ceph_mds_session *s)
 {
 	u32 op = le32_to_cpu(info->head->op);
 
@@ -457,7 +544,7 @@ static int parse_reply_info_extra(void **p, void *end,
 	else if (op == CEPH_MDS_OP_READDIR || op == CEPH_MDS_OP_LSSNAP)
 		return parse_reply_info_readdir(p, end, info, features);
 	else if (op == CEPH_MDS_OP_CREATE)
-		return parse_reply_info_create(p, end, info, features);
+		return parse_reply_info_create(p, end, info, features, s);
 	else
 		return -EIO;
 }
@@ -465,7 +552,7 @@ static int parse_reply_info_extra(void **p, void *end,
 /*
  * parse entire mds reply
  */
-static int parse_reply_info(struct ceph_msg *msg,
+static int parse_reply_info(struct ceph_mds_session *s, struct ceph_msg *msg,
 			    struct ceph_mds_reply_info_parsed *info,
 			    u64 features)
 {
@@ -490,7 +577,7 @@ static int parse_reply_info(struct ceph_msg *msg,
 	ceph_decode_32_safe(&p, end, len, bad);
 	if (len > 0) {
 		ceph_decode_need(&p, end, len, bad);
-		err = parse_reply_info_extra(&p, p+len, info, features);
+		err = parse_reply_info_extra(&p, p+len, info, features, s);
 		if (err < 0)
 			goto out_bad;
 	}
@@ -558,6 +645,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
 	if (refcount_dec_and_test(&s->s_ref)) {
 		if (s->s_auth.authorizer)
 			ceph_auth_destroy_authorizer(s->s_auth.authorizer);
+		xa_destroy(&s->s_delegated_inos);
 		kfree(s);
 	}
 }
@@ -645,6 +733,7 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc,
 	refcount_set(&s->s_ref, 1);
 	INIT_LIST_HEAD(&s->s_waiting);
 	INIT_LIST_HEAD(&s->s_unsafe);
+	xa_init(&s->s_delegated_inos);
 	s->s_num_cap_releases = 0;
 	s->s_cap_reconnect = 0;
 	s->s_cap_iterator = NULL;
@@ -2947,9 +3036,9 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
 	dout("handle_reply tid %lld result %d\n", tid, result);
 	rinfo = &req->r_reply_info;
 	if (test_bit(CEPHFS_FEATURE_REPLY_ENCODING, &session->s_features))
-		err = parse_reply_info(msg, rinfo, (u64)-1);
+		err = parse_reply_info(session, msg, rinfo, (u64)-1);
 	else
-		err = parse_reply_info(msg, rinfo, session->s_con.peer_features);
+		err = parse_reply_info(session, msg, rinfo, session->s_con.peer_features);
 	mutex_unlock(&mdsc->mutex);
 
 	mutex_lock(&session->s_mutex);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 27a7446e10d3..3db7ef47e1c9 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -23,8 +23,10 @@ enum ceph_feature_type {
 	CEPHFS_FEATURE_RECLAIM_CLIENT,
 	CEPHFS_FEATURE_LAZY_CAP_WANTED,
 	CEPHFS_FEATURE_MULTI_RECONNECT,
+	CEPHFS_FEATURE_NAUTILUS = CEPHFS_FEATURE_MULTI_RECONNECT,
+	CEPHFS_FEATURE_OCTOPUS,
 
-	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_MULTI_RECONNECT,
+	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_OCTOPUS,
 };
 
 /*
@@ -201,6 +203,7 @@ struct ceph_mds_session {
 
 	struct list_head  s_waiting;  /* waiting requests */
 	struct list_head  s_unsafe;   /* unsafe requests */
+	struct xarray	  s_delegated_inos;
 };
 
 /*
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 7/9] ceph: add flag to delegate an inode number for async create
  2020-01-10 20:56 [RFC PATCH 0/9] ceph: add asynchronous create functionality Jeff Layton
                   ` (5 preceding siblings ...)
  2020-01-10 20:56 ` [RFC PATCH 6/9] ceph: decode interval_sets for delegated inos Jeff Layton
@ 2020-01-10 20:56 ` Jeff Layton
  2020-01-13  9:17   ` Yan, Zheng
  2020-01-10 20:56 ` [RFC PATCH 8/9] ceph: copy layout, max_size and truncate_size on successful sync create Jeff Layton
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Jeff Layton @ 2020-01-10 20:56 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell

In order to issue an async create request, we need to send an inode
number when we do the request, but we don't know which to which MDS
we'll be issuing the request.

Add a new r_req_flag that tells the request sending machinery to
grab an inode number from the delegated set, and encode it into the
request. If it can't get one then have it return -ECHILD. The
requestor can then reissue a synchronous request.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/inode.c      |  1 +
 fs/ceph/mds_client.c | 19 ++++++++++++++++++-
 fs/ceph/mds_client.h |  2 ++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 79bb1e6af090..9cfc093fd273 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1317,6 +1317,7 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 		err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
 				NULL, session,
 				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
+				 !test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags) &&
 				 rinfo->head->result == 0) ?  req->r_fmode : -1,
 				&req->r_caps_reservation);
 		if (err < 0) {
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 852c46550d96..9e7492b21b50 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2623,7 +2623,10 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
 	rhead->flags = cpu_to_le32(flags);
 	rhead->num_fwd = req->r_num_fwd;
 	rhead->num_retry = req->r_attempts - 1;
-	rhead->ino = 0;
+	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags))
+		rhead->ino = cpu_to_le64(req->r_deleg_ino);
+	else
+		rhead->ino = 0;
 
 	dout(" r_parent = %p\n", req->r_parent);
 	return 0;
@@ -2736,6 +2739,20 @@ static void __do_request(struct ceph_mds_client *mdsc,
 		goto out_session;
 	}
 
+	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags) &&
+	    !req->r_deleg_ino) {
+		req->r_deleg_ino = get_delegated_ino(req->r_session);
+
+		if (!req->r_deleg_ino) {
+			/*
+			 * If we can't get a deleg ino, exit with -ECHILD,
+			 * so the caller can reissue a sync request
+			 */
+			err = -ECHILD;
+			goto out_session;
+		}
+	}
+
 	/* send request */
 	req->r_resend_mds = -1;   /* forget any previous mds hint */
 
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 3db7ef47e1c9..e0b36be7c44f 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -258,6 +258,7 @@ struct ceph_mds_request {
 #define CEPH_MDS_R_GOT_RESULT		(5) /* got a result */
 #define CEPH_MDS_R_DID_PREPOPULATE	(6) /* prepopulated readdir */
 #define CEPH_MDS_R_PARENT_LOCKED	(7) /* is r_parent->i_rwsem wlocked? */
+#define CEPH_MDS_R_DELEG_INO		(8) /* attempt to get r_deleg_ino */
 	unsigned long	r_req_flags;
 
 	struct mutex r_fill_mutex;
@@ -307,6 +308,7 @@ struct ceph_mds_request {
 	int               r_num_fwd;    /* number of forward attempts */
 	int               r_resend_mds; /* mds to resend to next, if any*/
 	u32               r_sent_on_mseq; /* cap mseq request was sent at*/
+	unsigned long	  r_deleg_ino;
 
 	struct list_head  r_wait;
 	struct completion r_completion;
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 8/9] ceph: copy layout, max_size and truncate_size on successful sync create
  2020-01-10 20:56 [RFC PATCH 0/9] ceph: add asynchronous create functionality Jeff Layton
                   ` (6 preceding siblings ...)
  2020-01-10 20:56 ` [RFC PATCH 7/9] ceph: add flag to delegate an inode number for async create Jeff Layton
@ 2020-01-10 20:56 ` Jeff Layton
  2020-01-13  3:51   ` Yan, Zheng
  2020-01-13  9:01   ` Yan, Zheng
  2020-01-10 20:56 ` [RFC PATCH 9/9] ceph: attempt to do async create when possible Jeff Layton
  2020-01-13 11:07 ` [RFC PATCH 0/9] ceph: add asynchronous create functionality Yan, Zheng
  9 siblings, 2 replies; 28+ messages in thread
From: Jeff Layton @ 2020-01-10 20:56 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell

It doesn't do much good to do an asynchronous create unless we can do
I/O to it before the create reply comes in. That means we need layout
info the new file before we've gotten the response from the MDS.

All files created in a directory will initially inherit the same layout,
so copy off the requisite info from the first synchronous create in the
directory. Save it in the same fields in the directory inode, as those
are otherwise unsed for dir inodes. This means we need to be a bit
careful about only updating layout info on non-dir inodes.

Also, zero out the layout when we drop Dc caps in the dir.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c  | 24 ++++++++++++++++++++----
 fs/ceph/file.c  | 24 +++++++++++++++++++++++-
 fs/ceph/inode.c |  4 ++--
 3 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 7fc87b693ba4..b96fb1378479 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2847,7 +2847,7 @@ int ceph_get_caps(struct file *filp, int need, int want,
 			return ret;
 		}
 
-		if (S_ISREG(ci->vfs_inode.i_mode) &&
+		if (!S_ISDIR(ci->vfs_inode.i_mode) &&
 		    ci->i_inline_version != CEPH_INLINE_NONE &&
 		    (_got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) &&
 		    i_size_read(inode) > 0) {
@@ -2944,9 +2944,17 @@ void ceph_put_cap_refs(struct ceph_inode_info *ci, int had)
 	if (had & CEPH_CAP_FILE_RD)
 		if (--ci->i_rd_ref == 0)
 			last++;
-	if (had & CEPH_CAP_FILE_CACHE)
-		if (--ci->i_rdcache_ref == 0)
+	if (had & CEPH_CAP_FILE_CACHE) {
+		if (--ci->i_rdcache_ref == 0) {
 			last++;
+			/* Zero out layout if we lost CREATE caps */
+			if (S_ISDIR(inode->i_mode) &&
+			    !(__ceph_caps_issued(ci, NULL) & CEPH_CAP_DIR_CREATE)) {
+				ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
+				memset(&ci->i_layout, 0, sizeof(ci->i_layout));
+			}
+		}
+	}
 	if (had & CEPH_CAP_FILE_EXCL)
 		if (--ci->i_fx_ref == 0)
 			last++;
@@ -3264,7 +3272,8 @@ static void handle_cap_grant(struct inode *inode,
 		ci->i_subdirs = extra_info->nsubdirs;
 	}
 
-	if (newcaps & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)) {
+	if (!S_ISDIR(inode->i_mode) &&
+	    (newcaps & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
 		/* file layout may have changed */
 		s64 old_pool = ci->i_layout.pool_id;
 		struct ceph_string *old_ns;
@@ -3336,6 +3345,13 @@ static void handle_cap_grant(struct inode *inode,
 		     ceph_cap_string(cap->issued),
 		     ceph_cap_string(newcaps),
 		     ceph_cap_string(revoking));
+
+		if (S_ISDIR(inode->i_mode) &&
+		    (revoking & CEPH_CAP_DIR_CREATE) && !ci->i_rdcache_ref) {
+			ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
+			memset(&ci->i_layout, 0, sizeof(ci->i_layout));
+		}
+
 		if (S_ISREG(inode->i_mode) &&
 		    (revoking & used & CEPH_CAP_FILE_BUFFER))
 			writeback = true;  /* initiate writeback; will delay ack */
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 1e6cdf2dfe90..d4d7a277faf1 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -430,6 +430,25 @@ int ceph_open(struct inode *inode, struct file *file)
 	return err;
 }
 
+/* Clone the layout from a synchronous create, if the dir now has Dc caps */
+static void
+copy_file_layout(struct inode *dst, struct inode *src)
+{
+	struct ceph_inode_info *cdst = ceph_inode(dst);
+	struct ceph_inode_info *csrc = ceph_inode(src);
+
+	spin_lock(&cdst->i_ceph_lock);
+	if ((__ceph_caps_issued(cdst, NULL) & CEPH_CAP_DIR_CREATE) &&
+	    !ceph_file_layout_is_valid(&cdst->i_layout)) {
+		memcpy(&cdst->i_layout, &csrc->i_layout,
+			sizeof(cdst->i_layout));
+		rcu_assign_pointer(cdst->i_layout.pool_ns,
+				   ceph_try_get_string(csrc->i_layout.pool_ns));
+		cdst->i_max_size = csrc->i_max_size;
+		cdst->i_truncate_size = csrc->i_truncate_size;
+	}
+	spin_unlock(&cdst->i_ceph_lock);
+}
 
 /*
  * Do a lookup + open with a single request.  If we get a non-existent
@@ -518,7 +537,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 	} else {
 		dout("atomic_open finish_open on dn %p\n", dn);
 		if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) {
-			ceph_init_inode_acls(d_inode(dentry), &as_ctx);
+			struct inode *newino = d_inode(dentry);
+
+			copy_file_layout(dir, newino);
+			ceph_init_inode_acls(newino, &as_ctx);
 			file->f_mode |= FMODE_CREATED;
 		}
 		err = finish_open(file, dentry, ceph_open);
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 9cfc093fd273..8b51051b79b0 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -848,8 +848,8 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
 		ci->i_subdirs = le64_to_cpu(info->subdirs);
 	}
 
-	if (new_version ||
-	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
+	if (!S_ISDIR(inode->i_mode) && (new_version ||
+	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)))) {
 		s64 old_pool = ci->i_layout.pool_id;
 		struct ceph_string *old_ns;
 
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 9/9] ceph: attempt to do async create when possible
  2020-01-10 20:56 [RFC PATCH 0/9] ceph: add asynchronous create functionality Jeff Layton
                   ` (7 preceding siblings ...)
  2020-01-10 20:56 ` [RFC PATCH 8/9] ceph: copy layout, max_size and truncate_size on successful sync create Jeff Layton
@ 2020-01-10 20:56 ` Jeff Layton
  2020-01-13  1:43   ` Xiubo Li
  2020-01-13 10:53   ` Yan, Zheng
  2020-01-13 11:07 ` [RFC PATCH 0/9] ceph: add asynchronous create functionality Yan, Zheng
  9 siblings, 2 replies; 28+ messages in thread
From: Jeff Layton @ 2020-01-10 20:56 UTC (permalink / raw)
  To: ceph-devel; +Cc: zyan, sage, idryomov, pdonnell

With the Octopus release, the MDS will hand out directoy create caps.
If we have Fxc caps on the directory, and complete directory information
or a known negative dentry, then we can return without waiting on the
reply, allowing the open() call to return very quickly to userland.

We use the normal ceph_fill_inode() routine to fill in the inode, so we
have to gin up some reply inode information with what we'd expect a
newly-created inode to have. The client assumes that it has a full set
of caps on the new inode, and that the MDS will revoke them when there
is conflicting access.

This functionality is gated on the enable_async_dirops module option,
along with async unlinks, and on the server supporting the Octopus
CephFS feature bit.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c               |   7 +-
 fs/ceph/file.c               | 178 +++++++++++++++++++++++++++++++++--
 fs/ceph/mds_client.c         |  12 ++-
 fs/ceph/mds_client.h         |   3 +-
 fs/ceph/super.h              |   2 +
 include/linux/ceph/ceph_fs.h |   8 +-
 6 files changed, 191 insertions(+), 19 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index b96fb1378479..21a8a2ddc94b 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -654,6 +654,10 @@ void ceph_add_cap(struct inode *inode,
 		session->s_nr_caps++;
 		spin_unlock(&session->s_cap_lock);
 	} else {
+		/* Did an async create race with the reply? */
+		if (cap_id == CEPH_CAP_ID_TBD && cap->issued == issued)
+			return;
+
 		spin_lock(&session->s_cap_lock);
 		list_move_tail(&cap->session_caps, &session->s_caps);
 		spin_unlock(&session->s_cap_lock);
@@ -672,7 +676,8 @@ void ceph_add_cap(struct inode *inode,
 		 */
 		if (ceph_seq_cmp(seq, cap->seq) <= 0) {
 			WARN_ON(cap != ci->i_auth_cap);
-			WARN_ON(cap->cap_id != cap_id);
+			WARN_ON(cap_id != CEPH_CAP_ID_TBD &&
+				cap->cap_id != cap_id);
 			seq = cap->seq;
 			mseq = cap->mseq;
 			issued |= cap->issued;
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index d4d7a277faf1..706abd71b731 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -450,6 +450,141 @@ copy_file_layout(struct inode *dst, struct inode *src)
 	spin_unlock(&cdst->i_ceph_lock);
 }
 
+static bool get_caps_for_async_create(struct inode *dir, struct dentry *dentry)
+{
+	struct ceph_inode_info *ci = ceph_inode(dir);
+	int ret, want, got;
+
+	/*
+	 * We can do an async create if we either have a valid negative dentry
+	 * or the complete contents of the directory. Do a quick check without
+	 * cap refs.
+	 */
+	if ((d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) ||
+	    !ceph_file_layout_is_valid(&ci->i_layout))
+		return false;
+
+	/* Try to get caps */
+	want = CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE;
+	ret = ceph_try_get_caps(dir, 0, want, true, &got);
+	dout("Fx on %p ret=%d got=%d\n", dir, ret, got);
+	if (ret != 1)
+		return false;
+	if (got != want) {
+		ceph_put_cap_refs(ci, got);
+		return false;
+	}
+
+	/* Check again, now that we hold cap refs */
+	if ((d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) ||
+	    !ceph_file_layout_is_valid(&ci->i_layout)) {
+		ceph_put_cap_refs(ci, got);
+		return false;
+	}
+
+	return true;
+}
+
+static void ceph_async_create_cb(struct ceph_mds_client *mdsc,
+                                 struct ceph_mds_request *req)
+{
+	/* If we never sent anything then nothing to clean up */
+	if (req->r_err == -ECHILD)
+		goto out;
+
+	mapping_set_error(req->r_parent->i_mapping, req->r_err);
+
+	if (req->r_target_inode) {
+		u64 ino = ceph_vino(req->r_target_inode).ino;
+
+		if (req->r_deleg_ino != ino)
+			pr_warn("%s: inode number mismatch! err=%d deleg_ino=0x%lx target=0x%llx\n",
+				__func__, req->r_err, req->r_deleg_ino, ino);
+		mapping_set_error(req->r_target_inode->i_mapping, req->r_err);
+	} else {
+		pr_warn("%s: no req->r_target_inode for 0x%lx\n", __func__,
+			req->r_deleg_ino);
+	}
+out:
+	ceph_put_cap_refs(ceph_inode(req->r_parent),
+			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE);
+}
+
+static int ceph_finish_async_open(struct inode *dir, struct dentry *dentry,
+				  struct file *file, umode_t mode,
+				  struct ceph_mds_request *req,
+				  struct ceph_acl_sec_ctx *as_ctx)
+{
+	int ret;
+	struct ceph_mds_reply_inode in = { };
+	struct ceph_mds_reply_info_in iinfo = { .in = &in };
+	struct ceph_inode_info *ci = ceph_inode(dir);
+	struct inode *inode;
+	struct timespec64 now;
+	struct ceph_vino vino = { .ino = req->r_deleg_ino,
+				  .snap = CEPH_NOSNAP };
+
+	ktime_get_real_ts64(&now);
+
+	inode = ceph_get_inode(dentry->d_sb, vino);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	/* If we can't get a buffer, just carry on */
+	iinfo.xattr_data = kzalloc(4, GFP_NOFS);
+	if (iinfo.xattr_data)
+		iinfo.xattr_len = 4;
+
+	iinfo.inline_version = CEPH_INLINE_NONE;
+	iinfo.change_attr = 1;
+	ceph_encode_timespec64(&iinfo.btime, &now);
+
+	in.ino = cpu_to_le64(vino.ino);
+	in.snapid = cpu_to_le64(CEPH_NOSNAP);
+	in.version = cpu_to_le64(1);	// ???
+	in.cap.caps = in.cap.wanted = cpu_to_le32(CEPH_CAP_ALL_FILE);
+	in.cap.cap_id = cpu_to_le64(CEPH_CAP_ID_TBD);
+	in.cap.realm = cpu_to_le64(ci->i_snap_realm->ino);
+	in.cap.flags = CEPH_CAP_FLAG_AUTH;
+	in.ctime = in.mtime = in.atime = iinfo.btime;
+	in.mode = cpu_to_le32((u32)mode);
+	in.truncate_seq = cpu_to_le32(1);
+	in.truncate_size = cpu_to_le64(ci->i_truncate_size);
+	in.max_size = cpu_to_le64(ci->i_max_size);
+	in.xattr_version = cpu_to_le64(1);
+	in.uid = cpu_to_le32(from_kuid(&init_user_ns, current_fsuid()));
+	in.gid = cpu_to_le32(from_kgid(&init_user_ns, current_fsgid()));
+	in.nlink = cpu_to_le32(1);
+
+	ceph_file_layout_to_legacy(&ci->i_layout, &in.layout);
+
+	ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session,
+			      req->r_fmode, NULL);
+	if (ret) {
+		dout("%s failed to fill inode: %d\n", __func__, ret);
+		if (inode->i_state & I_NEW)
+			discard_new_inode(inode);
+	} else {
+		struct dentry *dn;
+
+		dout("%s d_adding new inode 0x%llx to 0x%lx/%s\n", __func__,
+			vino.ino, dir->i_ino, dentry->d_name.name);
+		ceph_dir_clear_ordered(dir);
+		ceph_init_inode_acls(inode, as_ctx);
+		if (inode->i_state & I_NEW)
+			unlock_new_inode(inode);
+		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
+			if (!d_unhashed(dentry))
+				d_drop(dentry);
+			dn = d_splice_alias(inode, dentry);
+			WARN_ON_ONCE(dn && dn != dentry);
+		}
+		file->f_mode |= FMODE_CREATED;
+		ret = finish_open(file, dentry, ceph_open);
+	}
+	return ret;
+}
+
 /*
  * Do a lookup + open with a single request.  If we get a non-existent
  * file or symlink, return 1 so the VFS can retry.
@@ -462,6 +597,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 	struct ceph_mds_request *req;
 	struct dentry *dn;
 	struct ceph_acl_sec_ctx as_ctx = {};
+	bool try_async = enable_async_dirops;
 	int mask;
 	int err;
 
@@ -486,6 +622,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 		return -ENOENT;
 	}
 
+retry:
 	/* do the open */
 	req = prepare_open_request(dir->i_sb, flags, mode);
 	if (IS_ERR(req)) {
@@ -494,6 +631,12 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 	}
 	req->r_dentry = dget(dentry);
 	req->r_num_caps = 2;
+	mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
+	if (ceph_security_xattr_wanted(dir))
+		mask |= CEPH_CAP_XATTR_SHARED;
+	req->r_args.open.mask = cpu_to_le32(mask);
+	req->r_parent = dir;
+
 	if (flags & O_CREAT) {
 		req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL;
 		req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
@@ -501,21 +644,37 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 			req->r_pagelist = as_ctx.pagelist;
 			as_ctx.pagelist = NULL;
 		}
+		if (try_async && get_caps_for_async_create(dir, dentry)) {
+			set_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags);
+			req->r_callback = ceph_async_create_cb;
+			err = ceph_mdsc_submit_request(mdsc, dir, req);
+			switch (err) {
+			case 0:
+				/* set up inode, dentry and return */
+				err = ceph_finish_async_open(dir, dentry, file,
+							mode, req, &as_ctx);
+				goto out_req;
+			case -ECHILD:
+				/* do a sync create */
+				try_async = false;
+				as_ctx.pagelist = req->r_pagelist;
+				req->r_pagelist = NULL;
+				ceph_mdsc_put_request(req);
+				goto retry;
+			default:
+				/* Hard error, give up */
+				goto out_req;
+			}
+		}
 	}
 
-       mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
-       if (ceph_security_xattr_wanted(dir))
-               mask |= CEPH_CAP_XATTR_SHARED;
-       req->r_args.open.mask = cpu_to_le32(mask);
-
-	req->r_parent = dir;
 	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
 	err = ceph_mdsc_do_request(mdsc,
 				   (flags & (O_CREAT|O_TRUNC)) ? dir : NULL,
 				   req);
 	err = ceph_handle_snapdir(req, dentry, err);
 	if (err)
-		goto out_req;
+		goto out_fmode;
 
 	if ((flags & O_CREAT) && !req->r_reply_info.head->is_dentry)
 		err = ceph_handle_notrace_create(dir, dentry);
@@ -529,7 +688,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 		dn = NULL;
 	}
 	if (err)
-		goto out_req;
+		goto out_fmode;
 	if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
 		/* make vfs retry on splice, ENOENT, or symlink */
 		dout("atomic_open finish_no_open on dn %p\n", dn);
@@ -545,9 +704,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 		}
 		err = finish_open(file, dentry, ceph_open);
 	}
-out_req:
+out_fmode:
 	if (!req->r_err && req->r_target_inode)
 		ceph_put_fmode(ceph_inode(req->r_target_inode), req->r_fmode);
+out_req:
 	ceph_mdsc_put_request(req);
 out_ctx:
 	ceph_release_acl_sec_ctx(&as_ctx);
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 9e7492b21b50..c76d6e7f8136 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2620,14 +2620,16 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
 		flags |= CEPH_MDS_FLAG_REPLAY;
 	if (req->r_parent)
 		flags |= CEPH_MDS_FLAG_WANT_DENTRY;
-	rhead->flags = cpu_to_le32(flags);
-	rhead->num_fwd = req->r_num_fwd;
-	rhead->num_retry = req->r_attempts - 1;
-	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags))
+	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags)) {
 		rhead->ino = cpu_to_le64(req->r_deleg_ino);
-	else
+		flags |= CEPH_MDS_FLAG_ASYNC;
+	} else {
 		rhead->ino = 0;
+	}
 
+	rhead->flags = cpu_to_le32(flags);
+	rhead->num_fwd = req->r_num_fwd;
+	rhead->num_retry = req->r_attempts - 1;
 	dout(" r_parent = %p\n", req->r_parent);
 	return 0;
 }
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index e0b36be7c44f..49e6cd5a07a2 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -39,8 +39,7 @@ enum ceph_feature_type {
 	CEPHFS_FEATURE_REPLY_ENCODING,		\
 	CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
 	CEPHFS_FEATURE_MULTI_RECONNECT,		\
-						\
-	CEPHFS_FEATURE_MAX,			\
+	CEPHFS_FEATURE_OCTOPUS,			\
 }
 #define CEPHFS_FEATURES_CLIENT_REQUIRED {}
 
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index ec4d66d7c261..33e03fbba888 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -136,6 +136,8 @@ struct ceph_fs_client {
 #endif
 };
 
+/* Special placeholder value for a cap_id during an asynchronous create. */
+#define        CEPH_CAP_ID_TBD         -1ULL
 
 /*
  * File i/o capability.  This tracks shared state with the metadata
diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
index a099f60feb7b..b127563e21a1 100644
--- a/include/linux/ceph/ceph_fs.h
+++ b/include/linux/ceph/ceph_fs.h
@@ -444,8 +444,9 @@ union ceph_mds_request_args {
 	} __attribute__ ((packed)) lookupino;
 } __attribute__ ((packed));
 
-#define CEPH_MDS_FLAG_REPLAY        1  /* this is a replayed op */
-#define CEPH_MDS_FLAG_WANT_DENTRY   2  /* want dentry in reply */
+#define CEPH_MDS_FLAG_REPLAY		1  /* this is a replayed op */
+#define CEPH_MDS_FLAG_WANT_DENTRY	2  /* want dentry in reply */
+#define CEPH_MDS_FLAG_ASYNC		4  /* request is asynchronous */
 
 struct ceph_mds_request_head {
 	__le64 oldest_client_tid;
@@ -658,6 +659,9 @@ int ceph_flags_to_mode(int flags);
 #define CEPH_CAP_ANY      (CEPH_CAP_ANY_RD | CEPH_CAP_ANY_EXCL | \
 			   CEPH_CAP_ANY_FILE_WR | CEPH_CAP_FILE_LAZYIO | \
 			   CEPH_CAP_PIN)
+#define CEPH_CAP_ALL_FILE (CEPH_CAP_PIN | CEPH_CAP_ANY_SHARED | \
+			   CEPH_CAP_AUTH_EXCL | CEPH_CAP_XATTR_EXCL | \
+			   CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)
 
 #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
 			CEPH_LOCK_IXATTR)
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 9/9] ceph: attempt to do async create when possible
  2020-01-10 20:56 ` [RFC PATCH 9/9] ceph: attempt to do async create when possible Jeff Layton
@ 2020-01-13  1:43   ` Xiubo Li
  2020-01-13 13:16     ` Jeff Layton
  2020-01-13 10:53   ` Yan, Zheng
  1 sibling, 1 reply; 28+ messages in thread
From: Xiubo Li @ 2020-01-13  1:43 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: zyan, sage, idryomov, pdonnell

On 2020/1/11 4:56, Jeff Layton wrote:
> With the Octopus release, the MDS will hand out directoy create caps.
> If we have Fxc caps on the directory, and complete directory information
> or a known negative dentry, then we can return without waiting on the
> reply, allowing the open() call to return very quickly to userland.
>
> We use the normal ceph_fill_inode() routine to fill in the inode, so we
> have to gin up some reply inode information with what we'd expect a
> newly-created inode to have. The client assumes that it has a full set
> of caps on the new inode, and that the MDS will revoke them when there
> is conflicting access.
>
> This functionality is gated on the enable_async_dirops module option,
> along with async unlinks, and on the server supporting the Octopus
> CephFS feature bit.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>   fs/ceph/caps.c               |   7 +-
>   fs/ceph/file.c               | 178 +++++++++++++++++++++++++++++++++--
>   fs/ceph/mds_client.c         |  12 ++-
>   fs/ceph/mds_client.h         |   3 +-
>   fs/ceph/super.h              |   2 +
>   include/linux/ceph/ceph_fs.h |   8 +-
>   6 files changed, 191 insertions(+), 19 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index b96fb1378479..21a8a2ddc94b 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -654,6 +654,10 @@ void ceph_add_cap(struct inode *inode,
>   		session->s_nr_caps++;
>   		spin_unlock(&session->s_cap_lock);
>   	} else {
> +		/* Did an async create race with the reply? */
> +		if (cap_id == CEPH_CAP_ID_TBD && cap->issued == issued)
> +			return;
> +
>   		spin_lock(&session->s_cap_lock);
>   		list_move_tail(&cap->session_caps, &session->s_caps);
>   		spin_unlock(&session->s_cap_lock);
> @@ -672,7 +676,8 @@ void ceph_add_cap(struct inode *inode,
>   		 */
>   		if (ceph_seq_cmp(seq, cap->seq) <= 0) {
>   			WARN_ON(cap != ci->i_auth_cap);
> -			WARN_ON(cap->cap_id != cap_id);
> +			WARN_ON(cap_id != CEPH_CAP_ID_TBD &&
> +				cap->cap_id != cap_id);
>   			seq = cap->seq;
>   			mseq = cap->mseq;
>   			issued |= cap->issued;
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index d4d7a277faf1..706abd71b731 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -450,6 +450,141 @@ copy_file_layout(struct inode *dst, struct inode *src)
>   	spin_unlock(&cdst->i_ceph_lock);
>   }
>   
> +static bool get_caps_for_async_create(struct inode *dir, struct dentry *dentry)
> +{
> +	struct ceph_inode_info *ci = ceph_inode(dir);
> +	int ret, want, got;
> +
> +	/*
> +	 * We can do an async create if we either have a valid negative dentry
> +	 * or the complete contents of the directory. Do a quick check without
> +	 * cap refs.
> +	 */
> +	if ((d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) ||
> +	    !ceph_file_layout_is_valid(&ci->i_layout))
> +		return false;
> +
> +	/* Try to get caps */
> +	want = CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE;
> +	ret = ceph_try_get_caps(dir, 0, want, true, &got);
> +	dout("Fx on %p ret=%d got=%d\n", dir, ret, got);
> +	if (ret != 1)
> +		return false;
> +	if (got != want) {
> +		ceph_put_cap_refs(ci, got);
> +		return false;
> +	}
> +
> +	/* Check again, now that we hold cap refs */
> +	if ((d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) ||
> +	    !ceph_file_layout_is_valid(&ci->i_layout)) {
> +		ceph_put_cap_refs(ci, got);
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +static void ceph_async_create_cb(struct ceph_mds_client *mdsc,
> +                                 struct ceph_mds_request *req)
> +{
> +	/* If we never sent anything then nothing to clean up */
> +	if (req->r_err == -ECHILD)
> +		goto out;
> +
> +	mapping_set_error(req->r_parent->i_mapping, req->r_err);
> +
> +	if (req->r_target_inode) {
> +		u64 ino = ceph_vino(req->r_target_inode).ino;
> +
> +		if (req->r_deleg_ino != ino)
> +			pr_warn("%s: inode number mismatch! err=%d deleg_ino=0x%lx target=0x%llx\n",
> +				__func__, req->r_err, req->r_deleg_ino, ino);
> +		mapping_set_error(req->r_target_inode->i_mapping, req->r_err);
> +	} else {
> +		pr_warn("%s: no req->r_target_inode for 0x%lx\n", __func__,
> +			req->r_deleg_ino);
> +	}
> +out:
> +	ceph_put_cap_refs(ceph_inode(req->r_parent),
> +			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE);
> +}
> +
> +static int ceph_finish_async_open(struct inode *dir, struct dentry *dentry,
> +				  struct file *file, umode_t mode,
> +				  struct ceph_mds_request *req,
> +				  struct ceph_acl_sec_ctx *as_ctx)
> +{
> +	int ret;
> +	struct ceph_mds_reply_inode in = { };
> +	struct ceph_mds_reply_info_in iinfo = { .in = &in };
> +	struct ceph_inode_info *ci = ceph_inode(dir);
> +	struct inode *inode;
> +	struct timespec64 now;
> +	struct ceph_vino vino = { .ino = req->r_deleg_ino,
> +				  .snap = CEPH_NOSNAP };
> +
> +	ktime_get_real_ts64(&now);
> +
> +	inode = ceph_get_inode(dentry->d_sb, vino);
> +	if (IS_ERR(inode))
> +		return PTR_ERR(inode);
> +
> +	/* If we can't get a buffer, just carry on */
> +	iinfo.xattr_data = kzalloc(4, GFP_NOFS);
> +	if (iinfo.xattr_data)
> +		iinfo.xattr_len = 4;
> +
> +	iinfo.inline_version = CEPH_INLINE_NONE;
> +	iinfo.change_attr = 1;
> +	ceph_encode_timespec64(&iinfo.btime, &now);
> +
> +	in.ino = cpu_to_le64(vino.ino);
> +	in.snapid = cpu_to_le64(CEPH_NOSNAP);
> +	in.version = cpu_to_le64(1);	// ???
> +	in.cap.caps = in.cap.wanted = cpu_to_le32(CEPH_CAP_ALL_FILE);
> +	in.cap.cap_id = cpu_to_le64(CEPH_CAP_ID_TBD);
> +	in.cap.realm = cpu_to_le64(ci->i_snap_realm->ino);
> +	in.cap.flags = CEPH_CAP_FLAG_AUTH;
> +	in.ctime = in.mtime = in.atime = iinfo.btime;
> +	in.mode = cpu_to_le32((u32)mode);
> +	in.truncate_seq = cpu_to_le32(1);
> +	in.truncate_size = cpu_to_le64(ci->i_truncate_size);
> +	in.max_size = cpu_to_le64(ci->i_max_size);
> +	in.xattr_version = cpu_to_le64(1);
> +	in.uid = cpu_to_le32(from_kuid(&init_user_ns, current_fsuid()));
> +	in.gid = cpu_to_le32(from_kgid(&init_user_ns, current_fsgid()));
> +	in.nlink = cpu_to_le32(1);
> +
> +	ceph_file_layout_to_legacy(&ci->i_layout, &in.layout);
> +
> +	ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session,
> +			      req->r_fmode, NULL);
> +	if (ret) {
> +		dout("%s failed to fill inode: %d\n", __func__, ret);
> +		if (inode->i_state & I_NEW)
> +			discard_new_inode(inode);
> +	} else {
> +		struct dentry *dn;
> +
> +		dout("%s d_adding new inode 0x%llx to 0x%lx/%s\n", __func__,
> +			vino.ino, dir->i_ino, dentry->d_name.name);
> +		ceph_dir_clear_ordered(dir);
> +		ceph_init_inode_acls(inode, as_ctx);
> +		if (inode->i_state & I_NEW)
> +			unlock_new_inode(inode);
> +		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
> +			if (!d_unhashed(dentry))
> +				d_drop(dentry);
> +			dn = d_splice_alias(inode, dentry);
> +			WARN_ON_ONCE(dn && dn != dentry);
> +		}
> +		file->f_mode |= FMODE_CREATED;
> +		ret = finish_open(file, dentry, ceph_open);
> +	}
> +	return ret;
> +}
> +
>   /*
>    * Do a lookup + open with a single request.  If we get a non-existent
>    * file or symlink, return 1 so the VFS can retry.
> @@ -462,6 +597,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   	struct ceph_mds_request *req;
>   	struct dentry *dn;
>   	struct ceph_acl_sec_ctx as_ctx = {};
> +	bool try_async = enable_async_dirops;
>   	int mask;
>   	int err;
>   
> @@ -486,6 +622,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   		return -ENOENT;
>   	}
>   
> +retry:
>   	/* do the open */
>   	req = prepare_open_request(dir->i_sb, flags, mode);
>   	if (IS_ERR(req)) {
> @@ -494,6 +631,12 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   	}
>   	req->r_dentry = dget(dentry);
>   	req->r_num_caps = 2;
> +	mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> +	if (ceph_security_xattr_wanted(dir))
> +		mask |= CEPH_CAP_XATTR_SHARED;
> +	req->r_args.open.mask = cpu_to_le32(mask);
> +	req->r_parent = dir;
> +
>   	if (flags & O_CREAT) {
>   		req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL;
>   		req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
> @@ -501,21 +644,37 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   			req->r_pagelist = as_ctx.pagelist;
>   			as_ctx.pagelist = NULL;
>   		}
> +		if (try_async && get_caps_for_async_create(dir, dentry)) {
> +			set_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags);
> +			req->r_callback = ceph_async_create_cb;
> +			err = ceph_mdsc_submit_request(mdsc, dir, req);
> +			switch (err) {
> +			case 0:
> +				/* set up inode, dentry and return */
> +				err = ceph_finish_async_open(dir, dentry, file,
> +							mode, req, &as_ctx);
> +				goto out_req;
> +			case -ECHILD:
> +				/* do a sync create */
> +				try_async = false;
> +				as_ctx.pagelist = req->r_pagelist;
> +				req->r_pagelist = NULL;
> +				ceph_mdsc_put_request(req);
> +				goto retry;
> +			default:
> +				/* Hard error, give up */
> +				goto out_req;
> +			}
> +		}
>   	}
>   
> -       mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> -       if (ceph_security_xattr_wanted(dir))
> -               mask |= CEPH_CAP_XATTR_SHARED;
> -       req->r_args.open.mask = cpu_to_le32(mask);
> -
> -	req->r_parent = dir;
>   	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
>   	err = ceph_mdsc_do_request(mdsc,
>   				   (flags & (O_CREAT|O_TRUNC)) ? dir : NULL,
>   				   req);
>   	err = ceph_handle_snapdir(req, dentry, err);
>   	if (err)
> -		goto out_req;
> +		goto out_fmode;
>   
>   	if ((flags & O_CREAT) && !req->r_reply_info.head->is_dentry)
>   		err = ceph_handle_notrace_create(dir, dentry);
> @@ -529,7 +688,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   		dn = NULL;
>   	}
>   	if (err)
> -		goto out_req;
> +		goto out_fmode;
>   	if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
>   		/* make vfs retry on splice, ENOENT, or symlink */
>   		dout("atomic_open finish_no_open on dn %p\n", dn);
> @@ -545,9 +704,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   		}
>   		err = finish_open(file, dentry, ceph_open);
>   	}
> -out_req:
> +out_fmode:
>   	if (!req->r_err && req->r_target_inode)
>   		ceph_put_fmode(ceph_inode(req->r_target_inode), req->r_fmode);
> +out_req:
>   	ceph_mdsc_put_request(req);
>   out_ctx:
>   	ceph_release_acl_sec_ctx(&as_ctx);
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 9e7492b21b50..c76d6e7f8136 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2620,14 +2620,16 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
>   		flags |= CEPH_MDS_FLAG_REPLAY;
>   	if (req->r_parent)
>   		flags |= CEPH_MDS_FLAG_WANT_DENTRY;
> -	rhead->flags = cpu_to_le32(flags);
> -	rhead->num_fwd = req->r_num_fwd;
> -	rhead->num_retry = req->r_attempts - 1;
> -	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags))
> +	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags)) {
>   		rhead->ino = cpu_to_le64(req->r_deleg_ino);
> -	else
> +		flags |= CEPH_MDS_FLAG_ASYNC;
> +	} else {
>   		rhead->ino = 0;
> +	}
>   
> +	rhead->flags = cpu_to_le32(flags);
> +	rhead->num_fwd = req->r_num_fwd;
> +	rhead->num_retry = req->r_attempts - 1;
>   	dout(" r_parent = %p\n", req->r_parent);
>   	return 0;
>   }
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index e0b36be7c44f..49e6cd5a07a2 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -39,8 +39,7 @@ enum ceph_feature_type {
>   	CEPHFS_FEATURE_REPLY_ENCODING,		\
>   	CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
>   	CEPHFS_FEATURE_MULTI_RECONNECT,		\
> -						\
> -	CEPHFS_FEATURE_MAX,			\
> +	CEPHFS_FEATURE_OCTOPUS,			\

We should always have the CEPHFS_FEATURE_MAX as the last element of the 
array here, though the _MAX equals to _OCTOPUS and the _OCTOPUS will be 
traversed twice when encoding. The _MAX here is just a guard bit when 
counting the feature bits when encoding.

The change here should be:

         CEPHFS_FEATURE_REPLY_ENCODING,		\
         CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
         CEPHFS_FEATURE_MULTI_RECONNECT,		\
+       CEPHFS_FEATURE_OCTOPUS,			\
						\
         CEPHFS_FEATURE_MAX,			\

Then we won't have to worry about the previous _FEATURE_ bits' order.

>   }
>   #define CEPHFS_FEATURES_CLIENT_REQUIRED {}
>   
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index ec4d66d7c261..33e03fbba888 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -136,6 +136,8 @@ struct ceph_fs_client {
>   #endif
>   };
>   
> +/* Special placeholder value for a cap_id during an asynchronous create. */
> +#define        CEPH_CAP_ID_TBD         -1ULL
>   
>   /*
>    * File i/o capability.  This tracks shared state with the metadata
> diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
> index a099f60feb7b..b127563e21a1 100644
> --- a/include/linux/ceph/ceph_fs.h
> +++ b/include/linux/ceph/ceph_fs.h
> @@ -444,8 +444,9 @@ union ceph_mds_request_args {
>   	} __attribute__ ((packed)) lookupino;
>   } __attribute__ ((packed));
>   
> -#define CEPH_MDS_FLAG_REPLAY        1  /* this is a replayed op */
> -#define CEPH_MDS_FLAG_WANT_DENTRY   2  /* want dentry in reply */
> +#define CEPH_MDS_FLAG_REPLAY		1  /* this is a replayed op */
> +#define CEPH_MDS_FLAG_WANT_DENTRY	2  /* want dentry in reply */
> +#define CEPH_MDS_FLAG_ASYNC		4  /* request is asynchronous */
>   
>   struct ceph_mds_request_head {
>   	__le64 oldest_client_tid;
> @@ -658,6 +659,9 @@ int ceph_flags_to_mode(int flags);
>   #define CEPH_CAP_ANY      (CEPH_CAP_ANY_RD | CEPH_CAP_ANY_EXCL | \
>   			   CEPH_CAP_ANY_FILE_WR | CEPH_CAP_FILE_LAZYIO | \
>   			   CEPH_CAP_PIN)
> +#define CEPH_CAP_ALL_FILE (CEPH_CAP_PIN | CEPH_CAP_ANY_SHARED | \
> +			   CEPH_CAP_AUTH_EXCL | CEPH_CAP_XATTR_EXCL | \
> +			   CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)
>   
>   #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
>   			CEPH_LOCK_IXATTR)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 8/9] ceph: copy layout, max_size and truncate_size on successful sync create
  2020-01-10 20:56 ` [RFC PATCH 8/9] ceph: copy layout, max_size and truncate_size on successful sync create Jeff Layton
@ 2020-01-13  3:51   ` Yan, Zheng
  2020-01-13 13:26     ` Jeff Layton
  2020-01-13  9:01   ` Yan, Zheng
  1 sibling, 1 reply; 28+ messages in thread
From: Yan, Zheng @ 2020-01-13  3:51 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell

On 1/11/20 4:56 AM, Jeff Layton wrote:
> It doesn't do much good to do an asynchronous create unless we can do
> I/O to it before the create reply comes in. That means we need layout
> info the new file before we've gotten the response from the MDS.
> 
> All files created in a directory will initially inherit the same layout,
> so copy off the requisite info from the first synchronous create in the
> directory. Save it in the same fields in the directory inode, as those
> are otherwise unsed for dir inodes. This means we need to be a bit
> careful about only updating layout info on non-dir inodes.
> 
> Also, zero out the layout when we drop Dc caps in the dir.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>   fs/ceph/caps.c  | 24 ++++++++++++++++++++----
>   fs/ceph/file.c  | 24 +++++++++++++++++++++++-
>   fs/ceph/inode.c |  4 ++--
>   3 files changed, 45 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 7fc87b693ba4..b96fb1378479 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -2847,7 +2847,7 @@ int ceph_get_caps(struct file *filp, int need, int want,
>   			return ret;
>   		}
>   
> -		if (S_ISREG(ci->vfs_inode.i_mode) &&
> +		if (!S_ISDIR(ci->vfs_inode.i_mode) &&
>   		    ci->i_inline_version != CEPH_INLINE_NONE &&
>   		    (_got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) &&
>   		    i_size_read(inode) > 0) {
> @@ -2944,9 +2944,17 @@ void ceph_put_cap_refs(struct ceph_inode_info *ci, int had)
>   	if (had & CEPH_CAP_FILE_RD)
>   		if (--ci->i_rd_ref == 0)
>   			last++;
> -	if (had & CEPH_CAP_FILE_CACHE)
> -		if (--ci->i_rdcache_ref == 0)
> +	if (had & CEPH_CAP_FILE_CACHE) {
> +		if (--ci->i_rdcache_ref == 0) {
>   			last++;
> +			/* Zero out layout if we lost CREATE caps */
> +			if (S_ISDIR(inode->i_mode) &&
> +			    !(__ceph_caps_issued(ci, NULL) & CEPH_CAP_DIR_CREATE)) {
> +				ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
> +				memset(&ci->i_layout, 0, sizeof(ci->i_layout));
> +			}
> +		}
> +	}

should do this in __check_cap_issue

>   	if (had & CEPH_CAP_FILE_EXCL)
>   		if (--ci->i_fx_ref == 0)
>   			last++;
> @@ -3264,7 +3272,8 @@ static void handle_cap_grant(struct inode *inode,
>   		ci->i_subdirs = extra_info->nsubdirs;
>   	}
>   
> -	if (newcaps & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)) {
> +	if (!S_ISDIR(inode->i_mode) &&
> +	    (newcaps & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
>   		/* file layout may have changed */
>   		s64 old_pool = ci->i_layout.pool_id;
>   		struct ceph_string *old_ns;
> @@ -3336,6 +3345,13 @@ static void handle_cap_grant(struct inode *inode,
>   		     ceph_cap_string(cap->issued),
>   		     ceph_cap_string(newcaps),
>   		     ceph_cap_string(revoking));
> +
> +		if (S_ISDIR(inode->i_mode) &&
> +		    (revoking & CEPH_CAP_DIR_CREATE) && !ci->i_rdcache_ref) {
> +			ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
> +			memset(&ci->i_layout, 0, sizeof(ci->i_layout));
> +		}


same here

> +
>   		if (S_ISREG(inode->i_mode) &&
>   		    (revoking & used & CEPH_CAP_FILE_BUFFER))
>   			writeback = true;  /* initiate writeback; will delay ack */
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 1e6cdf2dfe90..d4d7a277faf1 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -430,6 +430,25 @@ int ceph_open(struct inode *inode, struct file *file)
>   	return err;
>   }
>   
> +/* Clone the layout from a synchronous create, if the dir now has Dc caps */
> +static void
> +copy_file_layout(struct inode *dst, struct inode *src)
> +{
> +	struct ceph_inode_info *cdst = ceph_inode(dst);
> +	struct ceph_inode_info *csrc = ceph_inode(src);
> +
> +	spin_lock(&cdst->i_ceph_lock);
> +	if ((__ceph_caps_issued(cdst, NULL) & CEPH_CAP_DIR_CREATE) &&
> +	    !ceph_file_layout_is_valid(&cdst->i_layout)) {
> +		memcpy(&cdst->i_layout, &csrc->i_layout,
> +			sizeof(cdst->i_layout));
> +		rcu_assign_pointer(cdst->i_layout.pool_ns,
> +				   ceph_try_get_string(csrc->i_layout.pool_ns));
> +		cdst->i_max_size = csrc->i_max_size;
> +		cdst->i_truncate_size = csrc->i_truncate_size;
> +	}
> +	spin_unlock(&cdst->i_ceph_lock);
> +}
>   
>   /*
>    * Do a lookup + open with a single request.  If we get a non-existent
> @@ -518,7 +537,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   	} else {
>   		dout("atomic_open finish_open on dn %p\n", dn);
>   		if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) {
> -			ceph_init_inode_acls(d_inode(dentry), &as_ctx);
> +			struct inode *newino = d_inode(dentry);
> +
> +			copy_file_layout(dir, newino);
> +			ceph_init_inode_acls(newino, &as_ctx);
>   			file->f_mode |= FMODE_CREATED;
>   		}
>   		err = finish_open(file, dentry, ceph_open);
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 9cfc093fd273..8b51051b79b0 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -848,8 +848,8 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
>   		ci->i_subdirs = le64_to_cpu(info->subdirs);
>   	}
>   
> -	if (new_version ||
> -	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
> +	if (!S_ISDIR(inode->i_mode) && (new_version ||
> +	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)))) {
>   		s64 old_pool = ci->i_layout.pool_id;
>   		struct ceph_string *old_ns;
>   
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 8/9] ceph: copy layout, max_size and truncate_size on successful sync create
  2020-01-10 20:56 ` [RFC PATCH 8/9] ceph: copy layout, max_size and truncate_size on successful sync create Jeff Layton
  2020-01-13  3:51   ` Yan, Zheng
@ 2020-01-13  9:01   ` Yan, Zheng
  2020-01-13 13:29     ` Jeff Layton
  1 sibling, 1 reply; 28+ messages in thread
From: Yan, Zheng @ 2020-01-13  9:01 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell

On 1/11/20 4:56 AM, Jeff Layton wrote:
> It doesn't do much good to do an asynchronous create unless we can do
> I/O to it before the create reply comes in. That means we need layout
> info the new file before we've gotten the response from the MDS.
> 
> All files created in a directory will initially inherit the same layout,
> so copy off the requisite info from the first synchronous create in the
> directory. Save it in the same fields in the directory inode, as those
> are otherwise unsed for dir inodes. This means we need to be a bit
> careful about only updating layout info on non-dir inodes.
> 
> Also, zero out the layout when we drop Dc caps in the dir.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>   fs/ceph/caps.c  | 24 ++++++++++++++++++++----
>   fs/ceph/file.c  | 24 +++++++++++++++++++++++-
>   fs/ceph/inode.c |  4 ++--
>   3 files changed, 45 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 7fc87b693ba4..b96fb1378479 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -2847,7 +2847,7 @@ int ceph_get_caps(struct file *filp, int need, int want,
>   			return ret;
>   		}
>   
> -		if (S_ISREG(ci->vfs_inode.i_mode) &&
> +		if (!S_ISDIR(ci->vfs_inode.i_mode) &&
>   		    ci->i_inline_version != CEPH_INLINE_NONE &&
>   		    (_got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) &&
>   		    i_size_read(inode) > 0) {
> @@ -2944,9 +2944,17 @@ void ceph_put_cap_refs(struct ceph_inode_info *ci, int had)
>   	if (had & CEPH_CAP_FILE_RD)
>   		if (--ci->i_rd_ref == 0)
>   			last++;
> -	if (had & CEPH_CAP_FILE_CACHE)
> -		if (--ci->i_rdcache_ref == 0)
> +	if (had & CEPH_CAP_FILE_CACHE) {
> +		if (--ci->i_rdcache_ref == 0) {
>   			last++;
> +			/* Zero out layout if we lost CREATE caps */
> +			if (S_ISDIR(inode->i_mode) &&
> +			    !(__ceph_caps_issued(ci, NULL) & CEPH_CAP_DIR_CREATE)) {
> +				ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
> +				memset(&ci->i_layout, 0, sizeof(ci->i_layout));
> +			}
> +		}
> +	}
>   	if (had & CEPH_CAP_FILE_EXCL)
>   		if (--ci->i_fx_ref == 0)
>   			last++;
> @@ -3264,7 +3272,8 @@ static void handle_cap_grant(struct inode *inode,
>   		ci->i_subdirs = extra_info->nsubdirs;
>   	}
>   
> -	if (newcaps & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)) {
> +	if (!S_ISDIR(inode->i_mode) &&
> +	    (newcaps & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
>   		/* file layout may have changed */
>   		s64 old_pool = ci->i_layout.pool_id;
>   		struct ceph_string *old_ns;
> @@ -3336,6 +3345,13 @@ static void handle_cap_grant(struct inode *inode,
>   		     ceph_cap_string(cap->issued),
>   		     ceph_cap_string(newcaps),
>   		     ceph_cap_string(revoking));
> +
> +		if (S_ISDIR(inode->i_mode) &&
> +		    (revoking & CEPH_CAP_DIR_CREATE) && !ci->i_rdcache_ref) {
> +			ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
> +			memset(&ci->i_layout, 0, sizeof(ci->i_layout));
> +		}
> +
>   		if (S_ISREG(inode->i_mode) &&
>   		    (revoking & used & CEPH_CAP_FILE_BUFFER))
>   			writeback = true;  /* initiate writeback; will delay ack */
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 1e6cdf2dfe90..d4d7a277faf1 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -430,6 +430,25 @@ int ceph_open(struct inode *inode, struct file *file)
>   	return err;
>   }
>   
> +/* Clone the layout from a synchronous create, if the dir now has Dc caps */
> +static void
> +copy_file_layout(struct inode *dst, struct inode *src)
> +{
> +	struct ceph_inode_info *cdst = ceph_inode(dst);
> +	struct ceph_inode_info *csrc = ceph_inode(src);
> +
> +	spin_lock(&cdst->i_ceph_lock);
> +	if ((__ceph_caps_issued(cdst, NULL) & CEPH_CAP_DIR_CREATE) &&
> +	    !ceph_file_layout_is_valid(&cdst->i_layout)) {
> +		memcpy(&cdst->i_layout, &csrc->i_layout,
> +			sizeof(cdst->i_layout));

directory's i_layout is used for other purpose. shouldn't modify it.

> +		rcu_assign_pointer(cdst->i_layout.pool_ns,
> +				   ceph_try_get_string(csrc->i_layout.pool_ns));
> +		cdst->i_max_size = csrc->i_max_size; > +		cdst->i_truncate_size = csrc->i_truncate_size;
no need to save above two. just set truncate size of new file to 
(u64)-1, set max_size of new file to its layout.stripe_unit;

max_size == layout.strip_unit ensure that client only write to the first 
object before its writeable range is persistent.

> +	}
> +	spin_unlock(&cdst->i_ceph_lock);
> +}
>   
>   /*
>    * Do a lookup + open with a single request.  If we get a non-existent
> @@ -518,7 +537,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   	} else {
>   		dout("atomic_open finish_open on dn %p\n", dn);
>   		if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) {
> -			ceph_init_inode_acls(d_inode(dentry), &as_ctx);
> +			struct inode *newino = d_inode(dentry);
> +
> +			copy_file_layout(dir, newino);
> +			ceph_init_inode_acls(newino, &as_ctx);
>   			file->f_mode |= FMODE_CREATED;
>   		}
>   		err = finish_open(file, dentry, ceph_open);
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 9cfc093fd273..8b51051b79b0 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -848,8 +848,8 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
>   		ci->i_subdirs = le64_to_cpu(info->subdirs);
>   	}
>   
> -	if (new_version ||
> -	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
> +	if (!S_ISDIR(inode->i_mode) && (new_version ||
> +	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)))) {
>   		s64 old_pool = ci->i_layout.pool_id;
>   		struct ceph_string *old_ns;
>   
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 7/9] ceph: add flag to delegate an inode number for async create
  2020-01-10 20:56 ` [RFC PATCH 7/9] ceph: add flag to delegate an inode number for async create Jeff Layton
@ 2020-01-13  9:17   ` Yan, Zheng
  2020-01-13 13:31     ` Jeff Layton
  0 siblings, 1 reply; 28+ messages in thread
From: Yan, Zheng @ 2020-01-13  9:17 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell

On 1/11/20 4:56 AM, Jeff Layton wrote:
> In order to issue an async create request, we need to send an inode
> number when we do the request, but we don't know which to which MDS
> we'll be issuing the request.
> 

the request should be sent to auth mds (dir_ci->i_auth_cap->session) of 
directory. I think grabing inode number in get_caps_for_async_create() 
is simpler.

> Add a new r_req_flag that tells the request sending machinery to
> grab an inode number from the delegated set, and encode it into the
> request. If it can't get one then have it return -ECHILD. The
> requestor can then reissue a synchronous request.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>   fs/ceph/inode.c      |  1 +
>   fs/ceph/mds_client.c | 19 ++++++++++++++++++-
>   fs/ceph/mds_client.h |  2 ++
>   3 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 79bb1e6af090..9cfc093fd273 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -1317,6 +1317,7 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
>   		err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
>   				NULL, session,
>   				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
> +				 !test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags) &&
>   				 rinfo->head->result == 0) ?  req->r_fmode : -1,
>   				&req->r_caps_reservation);
>   		if (err < 0) {
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 852c46550d96..9e7492b21b50 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2623,7 +2623,10 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
>   	rhead->flags = cpu_to_le32(flags);
>   	rhead->num_fwd = req->r_num_fwd;
>   	rhead->num_retry = req->r_attempts - 1;
> -	rhead->ino = 0;
> +	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags))
> +		rhead->ino = cpu_to_le64(req->r_deleg_ino);
> +	else
> +		rhead->ino = 0;
>   
>   	dout(" r_parent = %p\n", req->r_parent);
>   	return 0;
> @@ -2736,6 +2739,20 @@ static void __do_request(struct ceph_mds_client *mdsc,
>   		goto out_session;
>   	}
>   
> +	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags) &&
> +	    !req->r_deleg_ino) {
> +		req->r_deleg_ino = get_delegated_ino(req->r_session);
> +
> +		if (!req->r_deleg_ino) {
> +			/*
> +			 * If we can't get a deleg ino, exit with -ECHILD,
> +			 * so the caller can reissue a sync request
> +			 */
> +			err = -ECHILD;
> +			goto out_session;
> +		}
> +	}
> +
>   	/* send request */
>   	req->r_resend_mds = -1;   /* forget any previous mds hint */
>   
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index 3db7ef47e1c9..e0b36be7c44f 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -258,6 +258,7 @@ struct ceph_mds_request {
>   #define CEPH_MDS_R_GOT_RESULT		(5) /* got a result */
>   #define CEPH_MDS_R_DID_PREPOPULATE	(6) /* prepopulated readdir */
>   #define CEPH_MDS_R_PARENT_LOCKED	(7) /* is r_parent->i_rwsem wlocked? */
> +#define CEPH_MDS_R_DELEG_INO		(8) /* attempt to get r_deleg_ino */
>   	unsigned long	r_req_flags;
>   
>   	struct mutex r_fill_mutex;
> @@ -307,6 +308,7 @@ struct ceph_mds_request {
>   	int               r_num_fwd;    /* number of forward attempts */
>   	int               r_resend_mds; /* mds to resend to next, if any*/
>   	u32               r_sent_on_mseq; /* cap mseq request was sent at*/
> +	unsigned long	  r_deleg_ino;
>   
>   	struct list_head  r_wait;
>   	struct completion r_completion;
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 9/9] ceph: attempt to do async create when possible
  2020-01-10 20:56 ` [RFC PATCH 9/9] ceph: attempt to do async create when possible Jeff Layton
  2020-01-13  1:43   ` Xiubo Li
@ 2020-01-13 10:53   ` Yan, Zheng
  2020-01-13 13:44     ` Jeff Layton
  1 sibling, 1 reply; 28+ messages in thread
From: Yan, Zheng @ 2020-01-13 10:53 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell

On 1/11/20 4:56 AM, Jeff Layton wrote:
> With the Octopus release, the MDS will hand out directoy create caps.
> If we have Fxc caps on the directory, and complete directory information
> or a known negative dentry, then we can return without waiting on the
> reply, allowing the open() call to return very quickly to userland.
> 
> We use the normal ceph_fill_inode() routine to fill in the inode, so we
> have to gin up some reply inode information with what we'd expect a
> newly-created inode to have. The client assumes that it has a full set
> of caps on the new inode, and that the MDS will revoke them when there
> is conflicting access.
> 
> This functionality is gated on the enable_async_dirops module option,
> along with async unlinks, and on the server supporting the Octopus
> CephFS feature bit.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>   fs/ceph/caps.c               |   7 +-
>   fs/ceph/file.c               | 178 +++++++++++++++++++++++++++++++++--
>   fs/ceph/mds_client.c         |  12 ++-
>   fs/ceph/mds_client.h         |   3 +-
>   fs/ceph/super.h              |   2 +
>   include/linux/ceph/ceph_fs.h |   8 +-
>   6 files changed, 191 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index b96fb1378479..21a8a2ddc94b 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -654,6 +654,10 @@ void ceph_add_cap(struct inode *inode,
>   		session->s_nr_caps++;
>   		spin_unlock(&session->s_cap_lock);
>   	} else {
> +		/* Did an async create race with the reply? */
> +		if (cap_id == CEPH_CAP_ID_TBD && cap->issued == issued)
> +			return;
> +
>   		spin_lock(&session->s_cap_lock);
>   		list_move_tail(&cap->session_caps, &session->s_caps);
>   		spin_unlock(&session->s_cap_lock);
> @@ -672,7 +676,8 @@ void ceph_add_cap(struct inode *inode,
>   		 */
>   		if (ceph_seq_cmp(seq, cap->seq) <= 0) {
>   			WARN_ON(cap != ci->i_auth_cap);
> -			WARN_ON(cap->cap_id != cap_id);
> +			WARN_ON(cap_id != CEPH_CAP_ID_TBD &&
> +				cap->cap_id != cap_id);
>   			seq = cap->seq;
>   			mseq = cap->mseq;
>   			issued |= cap->issued;
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index d4d7a277faf1..706abd71b731 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -450,6 +450,141 @@ copy_file_layout(struct inode *dst, struct inode *src)
>   	spin_unlock(&cdst->i_ceph_lock);
>   }
>   
> +static bool get_caps_for_async_create(struct inode *dir, struct dentry *dentry)
> +{
> +	struct ceph_inode_info *ci = ceph_inode(dir);
> +	int ret, want, got;
> +
> +	/*
> +	 * We can do an async create if we either have a valid negative dentry
> +	 * or the complete contents of the directory. Do a quick check without
> +	 * cap refs.
> +	 */
> +	if ((d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) ||

what does (d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) mean?

I think we can async create if dentry is negative and its 
lease_shared_gen == ci->i_shared_gen.

> +	    !ceph_file_layout_is_valid(&ci->i_layout))
> +		return false;
> +
> +	/* Try to get caps */
> +	want = CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE;
> +	ret = ceph_try_get_caps(dir, 0, want, true, &got);
> +	dout("Fx on %p ret=%d got=%d\n", dir, ret, got);
> +	if (ret != 1)
> +		return false;
> +	if (got != want) {
> +		ceph_put_cap_refs(ci, got);
> +		return false;
> +	}
> +
> +	/* Check again, now that we hold cap refs */
> +	if ((d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) ||
> +	    !ceph_file_layout_is_valid(&ci->i_layout)) {
> +		ceph_put_cap_refs(ci, got);
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +static void ceph_async_create_cb(struct ceph_mds_client *mdsc,
> +                                 struct ceph_mds_request *req)
> +{
> +	/* If we never sent anything then nothing to clean up */
> +	if (req->r_err == -ECHILD)
> +		goto out;
> +
> +	mapping_set_error(req->r_parent->i_mapping, req->r_err);
> +
> +	if (req->r_target_inode) {
> +		u64 ino = ceph_vino(req->r_target_inode).ino;
> +
> +		if (req->r_deleg_ino != ino)
> +			pr_warn("%s: inode number mismatch! err=%d deleg_ino=0x%lx target=0x%llx\n",
> +				__func__, req->r_err, req->r_deleg_ino, ino);
> +		mapping_set_error(req->r_target_inode->i_mapping, req->r_err);
> +	} else {
> +		pr_warn("%s: no req->r_target_inode for 0x%lx\n", __func__,
> +			req->r_deleg_ino);
> +	}
> +out:
> +	ceph_put_cap_refs(ceph_inode(req->r_parent),
> +			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE);
> +}
> +
> +static int ceph_finish_async_open(struct inode *dir, struct dentry *dentry,
> +				  struct file *file, umode_t mode,
> +				  struct ceph_mds_request *req,
> +				  struct ceph_acl_sec_ctx *as_ctx)
> +{
> +	int ret;
> +	struct ceph_mds_reply_inode in = { };
> +	struct ceph_mds_reply_info_in iinfo = { .in = &in };
> +	struct ceph_inode_info *ci = ceph_inode(dir);
> +	struct inode *inode;
> +	struct timespec64 now;
> +	struct ceph_vino vino = { .ino = req->r_deleg_ino,
> +				  .snap = CEPH_NOSNAP };
> +
> +	ktime_get_real_ts64(&now);
> +
> +	inode = ceph_get_inode(dentry->d_sb, vino);
> +	if (IS_ERR(inode))
> +		return PTR_ERR(inode);
> +
> +	/* If we can't get a buffer, just carry on */
> +	iinfo.xattr_data = kzalloc(4, GFP_NOFS);
> +	if (iinfo.xattr_data)
> +		iinfo.xattr_len = 4;

??

I think we should decode req->r_pagelist into xattrs


> +
> +	iinfo.inline_version = CEPH_INLINE_NONE;
> +	iinfo.change_attr = 1;
> +	ceph_encode_timespec64(&iinfo.btime, &now);
> +
> +	in.ino = cpu_to_le64(vino.ino);
> +	in.snapid = cpu_to_le64(CEPH_NOSNAP);
> +	in.version = cpu_to_le64(1);	// ???
> +	in.cap.caps = in.cap.wanted = cpu_to_le32(CEPH_CAP_ALL_FILE);
> +	in.cap.cap_id = cpu_to_le64(CEPH_CAP_ID_TBD);
> +	in.cap.realm = cpu_to_le64(ci->i_snap_realm->ino);
> +	in.cap.flags = CEPH_CAP_FLAG_AUTH;
> +	in.ctime = in.mtime = in.atime = iinfo.btime;
> +	in.mode = cpu_to_le32((u32)mode);
> +	in.truncate_seq = cpu_to_le32(1);
> +	in.truncate_size = cpu_to_le64(ci->i_truncate_size);
> +	in.max_size = cpu_to_le64(ci->i_max_size);
> +	in.xattr_version = cpu_to_le64(1);
> +	in.uid = cpu_to_le32(from_kuid(&init_user_ns, current_fsuid()));
> +	in.gid = cpu_to_le32(from_kgid(&init_user_ns, current_fsgid()));

if dir has S_ISGID, new file's gid should be inherit from dir's gid


> +	in.nlink = cpu_to_le32(1);
> +
> +	ceph_file_layout_to_legacy(&ci->i_layout, &in.layout);
> +
> +	ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session,
> +			      req->r_fmode, NULL);
> +	if (ret) {
> +		dout("%s failed to fill inode: %d\n", __func__, ret);
> +		if (inode->i_state & I_NEW)
> +			discard_new_inode(inode);
> +	} else {
> +		struct dentry *dn;
> +
> +		dout("%s d_adding new inode 0x%llx to 0x%lx/%s\n", __func__,
> +			vino.ino, dir->i_ino, dentry->d_name.name);
> +		ceph_dir_clear_ordered(dir);
> +		ceph_init_inode_acls(inode, as_ctx);
> +		if (inode->i_state & I_NEW)
> +			unlock_new_inode(inode);
> +		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
> +			if (!d_unhashed(dentry))
> +				d_drop(dentry);
> +			dn = d_splice_alias(inode, dentry);
> +			WARN_ON_ONCE(dn && dn != dentry);
> +		}
> +		file->f_mode |= FMODE_CREATED;
> +		ret = finish_open(file, dentry, ceph_open);
> +	}
> +	return ret;
> +}
> +
>   /*
>    * Do a lookup + open with a single request.  If we get a non-existent
>    * file or symlink, return 1 so the VFS can retry.
> @@ -462,6 +597,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   	struct ceph_mds_request *req;
>   	struct dentry *dn;
>   	struct ceph_acl_sec_ctx as_ctx = {};
> +	bool try_async = enable_async_dirops;
>   	int mask;
>   	int err;
>   
> @@ -486,6 +622,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   		return -ENOENT;
>   	}
>   
> +retry:
>   	/* do the open */
>   	req = prepare_open_request(dir->i_sb, flags, mode);
>   	if (IS_ERR(req)) {
> @@ -494,6 +631,12 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   	}
>   	req->r_dentry = dget(dentry);
>   	req->r_num_caps = 2;
> +	mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> +	if (ceph_security_xattr_wanted(dir))
> +		mask |= CEPH_CAP_XATTR_SHARED;
> +	req->r_args.open.mask = cpu_to_le32(mask);
> +	req->r_parent = dir;
> +
>   	if (flags & O_CREAT) {
>   		req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL;
>   		req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
> @@ -501,21 +644,37 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   			req->r_pagelist = as_ctx.pagelist;
>   			as_ctx.pagelist = NULL;
>   		}
> +		if (try_async && get_caps_for_async_create(dir, dentry)) {
> +			set_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags);
> +			req->r_callback = ceph_async_create_cb;
> +			err = ceph_mdsc_submit_request(mdsc, dir, req);
> +			switch (err) {
> +			case 0:
> +				/* set up inode, dentry and return */
> +				err = ceph_finish_async_open(dir, dentry, file,
> +							mode, req, &as_ctx);
> +				goto out_req;
> +			case -ECHILD:
> +				/* do a sync create */
> +				try_async = false;
> +				as_ctx.pagelist = req->r_pagelist;
> +				req->r_pagelist = NULL;
> +				ceph_mdsc_put_request(req);
> +				goto retry;
> +			default:
> +				/* Hard error, give up */
> +				goto out_req;
> +			}
> +		}
>   	}
>   
> -       mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> -       if (ceph_security_xattr_wanted(dir))
> -               mask |= CEPH_CAP_XATTR_SHARED;
> -       req->r_args.open.mask = cpu_to_le32(mask);
> -
> -	req->r_parent = dir;
>   	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
>   	err = ceph_mdsc_do_request(mdsc,
>   				   (flags & (O_CREAT|O_TRUNC)) ? dir : NULL,
>   				   req);
>   	err = ceph_handle_snapdir(req, dentry, err);
>   	if (err)
> -		goto out_req;
> +		goto out_fmode;
>   
>   	if ((flags & O_CREAT) && !req->r_reply_info.head->is_dentry)
>   		err = ceph_handle_notrace_create(dir, dentry);
> @@ -529,7 +688,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   		dn = NULL;
>   	}
>   	if (err)
> -		goto out_req;
> +		goto out_fmode;
>   	if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
>   		/* make vfs retry on splice, ENOENT, or symlink */
>   		dout("atomic_open finish_no_open on dn %p\n", dn);
> @@ -545,9 +704,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>   		}
>   		err = finish_open(file, dentry, ceph_open);
>   	}
> -out_req:
> +out_fmode:
>   	if (!req->r_err && req->r_target_inode)
>   		ceph_put_fmode(ceph_inode(req->r_target_inode), req->r_fmode);
> +out_req:
>   	ceph_mdsc_put_request(req);
>   out_ctx:
>   	ceph_release_acl_sec_ctx(&as_ctx);
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 9e7492b21b50..c76d6e7f8136 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2620,14 +2620,16 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
>   		flags |= CEPH_MDS_FLAG_REPLAY;
>   	if (req->r_parent)
>   		flags |= CEPH_MDS_FLAG_WANT_DENTRY;
> -	rhead->flags = cpu_to_le32(flags);
> -	rhead->num_fwd = req->r_num_fwd;
> -	rhead->num_retry = req->r_attempts - 1;
> -	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags))
> +	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags)) {
>   		rhead->ino = cpu_to_le64(req->r_deleg_ino);
> -	else
> +		flags |= CEPH_MDS_FLAG_ASYNC;
> +	} else {
>   		rhead->ino = 0;
> +	}
>   
> +	rhead->flags = cpu_to_le32(flags);
> +	rhead->num_fwd = req->r_num_fwd;
> +	rhead->num_retry = req->r_attempts - 1;
>   	dout(" r_parent = %p\n", req->r_parent);
>   	return 0;
>   }
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index e0b36be7c44f..49e6cd5a07a2 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -39,8 +39,7 @@ enum ceph_feature_type {
>   	CEPHFS_FEATURE_REPLY_ENCODING,		\
>   	CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
>   	CEPHFS_FEATURE_MULTI_RECONNECT,		\
> -						\
> -	CEPHFS_FEATURE_MAX,			\
> +	CEPHFS_FEATURE_OCTOPUS,			\
>   }
>   #define CEPHFS_FEATURES_CLIENT_REQUIRED {}
>   
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index ec4d66d7c261..33e03fbba888 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -136,6 +136,8 @@ struct ceph_fs_client {
>   #endif
>   };
>   
> +/* Special placeholder value for a cap_id during an asynchronous create. */
> +#define        CEPH_CAP_ID_TBD         -1ULL
>   
>   /*
>    * File i/o capability.  This tracks shared state with the metadata
> diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
> index a099f60feb7b..b127563e21a1 100644
> --- a/include/linux/ceph/ceph_fs.h
> +++ b/include/linux/ceph/ceph_fs.h
> @@ -444,8 +444,9 @@ union ceph_mds_request_args {
>   	} __attribute__ ((packed)) lookupino;
>   } __attribute__ ((packed));
>   
> -#define CEPH_MDS_FLAG_REPLAY        1  /* this is a replayed op */
> -#define CEPH_MDS_FLAG_WANT_DENTRY   2  /* want dentry in reply */
> +#define CEPH_MDS_FLAG_REPLAY		1  /* this is a replayed op */
> +#define CEPH_MDS_FLAG_WANT_DENTRY	2  /* want dentry in reply */
> +#define CEPH_MDS_FLAG_ASYNC		4  /* request is asynchronous */
>   
>   struct ceph_mds_request_head {
>   	__le64 oldest_client_tid;
> @@ -658,6 +659,9 @@ int ceph_flags_to_mode(int flags);
>   #define CEPH_CAP_ANY      (CEPH_CAP_ANY_RD | CEPH_CAP_ANY_EXCL | \
>   			   CEPH_CAP_ANY_FILE_WR | CEPH_CAP_FILE_LAZYIO | \
>   			   CEPH_CAP_PIN)
> +#define CEPH_CAP_ALL_FILE (CEPH_CAP_PIN | CEPH_CAP_ANY_SHARED | \
> +			   CEPH_CAP_AUTH_EXCL | CEPH_CAP_XATTR_EXCL | \
> +			   CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)
>   
>   #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
>   			CEPH_LOCK_IXATTR)
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/9] ceph: add asynchronous create functionality
  2020-01-10 20:56 [RFC PATCH 0/9] ceph: add asynchronous create functionality Jeff Layton
                   ` (8 preceding siblings ...)
  2020-01-10 20:56 ` [RFC PATCH 9/9] ceph: attempt to do async create when possible Jeff Layton
@ 2020-01-13 11:07 ` Yan, Zheng
  9 siblings, 0 replies; 28+ messages in thread
From: Yan, Zheng @ 2020-01-13 11:07 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell

On 1/11/20 4:56 AM, Jeff Layton wrote:
> I recently sent a patchset that allows the client to do an asynchronous
> UNLINK call to the MDS when it has the appropriate caps and dentry info.
> This set adds the corresponding functionality for creates.
> 
> When the client has the appropriate caps on the parent directory and
> dentry information, and a delegated inode number, it can satisfy a
> request locally without contacting the server. This allows the kernel
> client to return very quickly from an O_CREAT open, so it can get on
> with doing other things.
> 
> These numbers are based on my personal test rig, which is a KVM client
> vs a vstart cluster running on my workstation (nothing scientific here).
> 
> A simple benchmark (with the cephfs mounted at /mnt/cephfs):
> -------------------8<-------------------
> #!/bin/sh
> 
> TESTDIR=/mnt/cephfs/test-dirops.$$
> 
> mkdir $TESTDIR
> stat $TESTDIR
> echo "Creating files in $TESTDIR"
> time for i in `seq 1 10000`; do
>      echo "foobarbaz" > $TESTDIR/$i
> done
> -------------------8<-------------------
> 
> With async dirops disabled:
> 
> real	0m9.865s
> user	0m0.353s
> sys	0m0.888s
> 
> With async dirops enabled:
> 
> real	0m5.272s
> user	0m0.104s
> sys	0m0.454s
> 
> That workload is a bit synthetic though. One workload we're interested
> in improving is untar. Untarring a deep directory tree (random kernel
> tarball I had laying around):
> 
> Disabled:
> $ time tar xf ~/linux-4.18.0-153.el8.jlayton.006.tar
> 
> real	1m35.774s
> user	0m0.835s
> sys	0m7.410s
> 
> Enabled:
> $ time tar xf ~/linux-4.18.0-153.el8.jlayton.006.tar
> 
> real	1m32.182s
> user	0m0.783s
> sys	0m6.830s
> 
> Not a huge win there. I suspect at this point that synchronous mkdir
> may be serializing behind the async creates.
> 
> It needs a lot more performance tuning and analysis, but it's now at the
> point where it's basically usable. To enable it, turn on the
> ceph.enable_async_dirops module option.
> 
> There are some places that need further work:
> 
> 1) The MDS patchset to delegate inodes to the client is not yet merged:
> 
>      https://github.com/ceph/ceph/pull/31817
> 
> 2) this is 64-bit arch only for the moment. I'm using an xarray to track
> the delegated inode numbers, and those don't do 64-bit indexes on
> 32-bit machines. Is anyone using 32-bit ceph clients? We could probably
> build an xarray of xarrays if needed.
> 
> 3) The error handling is still pretty lame. If the create fails, it'll
> set a writeback error on the parent dir and the inode itself, but the
> client could end up writing a bunch before it notices, if it even
> bothers to check. We probably need to do better here. I'm open to
> suggestions on this bit especially.
> 
> Jeff Layton (9):
>    ceph: ensure we have a new cap before continuing in fill_inode
>    ceph: print name of xattr being set in set/getxattr dout message
>    ceph: close some holes in struct ceph_mds_request
>    ceph: make ceph_fill_inode non-static
>    libceph: export ceph_file_layout_is_valid
>    ceph: decode interval_sets for delegated inos
>    ceph: add flag to delegate an inode number for async create
>    ceph: copy layout, max_size and truncate_size on successful sync
>      create
>    ceph: attempt to do async create when possible
> 
>   fs/ceph/caps.c               |  31 +++++-
>   fs/ceph/file.c               | 202 +++++++++++++++++++++++++++++++++--
>   fs/ceph/inode.c              |  57 +++++-----
>   fs/ceph/mds_client.c         | 130 ++++++++++++++++++++--
>   fs/ceph/mds_client.h         |  12 ++-
>   fs/ceph/super.h              |  10 ++
>   fs/ceph/xattr.c              |   5 +-
>   include/linux/ceph/ceph_fs.h |   8 +-
>   net/ceph/ceph_fs.c           |   1 +
>   9 files changed, 396 insertions(+), 60 deletions(-)
> 

client should wait for reply of aysnc create, before sending cap message 
or request (which operates on the creating inode) to mds


see commit "client: wait for async creating before sending request or 
cap message" in https://github.com/ceph/ceph/pull/32576

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 9/9] ceph: attempt to do async create when possible
  2020-01-13  1:43   ` Xiubo Li
@ 2020-01-13 13:16     ` Jeff Layton
  0 siblings, 0 replies; 28+ messages in thread
From: Jeff Layton @ 2020-01-13 13:16 UTC (permalink / raw)
  To: Xiubo Li, ceph-devel; +Cc: zyan, sage, idryomov, pdonnell

On Mon, 2020-01-13 at 09:43 +0800, Xiubo Li wrote:
> On 2020/1/11 4:56, Jeff Layton wrote:
> > With the Octopus release, the MDS will hand out directoy create caps.
> > If we have Fxc caps on the directory, and complete directory information
> > or a known negative dentry, then we can return without waiting on the
> > reply, allowing the open() call to return very quickly to userland.
> > 
> > We use the normal ceph_fill_inode() routine to fill in the inode, so we
> > have to gin up some reply inode information with what we'd expect a
> > newly-created inode to have. The client assumes that it has a full set
> > of caps on the new inode, and that the MDS will revoke them when there
> > is conflicting access.
> > 
> > This functionality is gated on the enable_async_dirops module option,
> > along with async unlinks, and on the server supporting the Octopus
> > CephFS feature bit.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >   fs/ceph/caps.c               |   7 +-
> >   fs/ceph/file.c               | 178 +++++++++++++++++++++++++++++++++--
> >   fs/ceph/mds_client.c         |  12 ++-
> >   fs/ceph/mds_client.h         |   3 +-
> >   fs/ceph/super.h              |   2 +
> >   include/linux/ceph/ceph_fs.h |   8 +-
> >   6 files changed, 191 insertions(+), 19 deletions(-)
> > 
> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > index b96fb1378479..21a8a2ddc94b 100644
> > --- a/fs/ceph/caps.c
> > +++ b/fs/ceph/caps.c
> > @@ -654,6 +654,10 @@ void ceph_add_cap(struct inode *inode,
> >   		session->s_nr_caps++;
> >   		spin_unlock(&session->s_cap_lock);
> >   	} else {
> > +		/* Did an async create race with the reply? */
> > +		if (cap_id == CEPH_CAP_ID_TBD && cap->issued == issued)
> > +			return;
> > +
> >   		spin_lock(&session->s_cap_lock);
> >   		list_move_tail(&cap->session_caps, &session->s_caps);
> >   		spin_unlock(&session->s_cap_lock);
> > @@ -672,7 +676,8 @@ void ceph_add_cap(struct inode *inode,
> >   		 */
> >   		if (ceph_seq_cmp(seq, cap->seq) <= 0) {
> >   			WARN_ON(cap != ci->i_auth_cap);
> > -			WARN_ON(cap->cap_id != cap_id);
> > +			WARN_ON(cap_id != CEPH_CAP_ID_TBD &&
> > +				cap->cap_id != cap_id);
> >   			seq = cap->seq;
> >   			mseq = cap->mseq;
> >   			issued |= cap->issued;
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index d4d7a277faf1..706abd71b731 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -450,6 +450,141 @@ copy_file_layout(struct inode *dst, struct inode *src)
> >   	spin_unlock(&cdst->i_ceph_lock);
> >   }
> >   
> > +static bool get_caps_for_async_create(struct inode *dir, struct dentry *dentry)
> > +{
> > +	struct ceph_inode_info *ci = ceph_inode(dir);
> > +	int ret, want, got;
> > +
> > +	/*
> > +	 * We can do an async create if we either have a valid negative dentry
> > +	 * or the complete contents of the directory. Do a quick check without
> > +	 * cap refs.
> > +	 */
> > +	if ((d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) ||
> > +	    !ceph_file_layout_is_valid(&ci->i_layout))
> > +		return false;
> > +
> > +	/* Try to get caps */
> > +	want = CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE;
> > +	ret = ceph_try_get_caps(dir, 0, want, true, &got);
> > +	dout("Fx on %p ret=%d got=%d\n", dir, ret, got);
> > +	if (ret != 1)
> > +		return false;
> > +	if (got != want) {
> > +		ceph_put_cap_refs(ci, got);
> > +		return false;
> > +	}
> > +
> > +	/* Check again, now that we hold cap refs */
> > +	if ((d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) ||
> > +	    !ceph_file_layout_is_valid(&ci->i_layout)) {
> > +		ceph_put_cap_refs(ci, got);
> > +		return false;
> > +	}
> > +
> > +	return true;
> > +}
> > +
> > +static void ceph_async_create_cb(struct ceph_mds_client *mdsc,
> > +                                 struct ceph_mds_request *req)
> > +{
> > +	/* If we never sent anything then nothing to clean up */
> > +	if (req->r_err == -ECHILD)
> > +		goto out;
> > +
> > +	mapping_set_error(req->r_parent->i_mapping, req->r_err);
> > +
> > +	if (req->r_target_inode) {
> > +		u64 ino = ceph_vino(req->r_target_inode).ino;
> > +
> > +		if (req->r_deleg_ino != ino)
> > +			pr_warn("%s: inode number mismatch! err=%d deleg_ino=0x%lx target=0x%llx\n",
> > +				__func__, req->r_err, req->r_deleg_ino, ino);
> > +		mapping_set_error(req->r_target_inode->i_mapping, req->r_err);
> > +	} else {
> > +		pr_warn("%s: no req->r_target_inode for 0x%lx\n", __func__,
> > +			req->r_deleg_ino);
> > +	}
> > +out:
> > +	ceph_put_cap_refs(ceph_inode(req->r_parent),
> > +			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE);
> > +}
> > +
> > +static int ceph_finish_async_open(struct inode *dir, struct dentry *dentry,
> > +				  struct file *file, umode_t mode,
> > +				  struct ceph_mds_request *req,
> > +				  struct ceph_acl_sec_ctx *as_ctx)
> > +{
> > +	int ret;
> > +	struct ceph_mds_reply_inode in = { };
> > +	struct ceph_mds_reply_info_in iinfo = { .in = &in };
> > +	struct ceph_inode_info *ci = ceph_inode(dir);
> > +	struct inode *inode;
> > +	struct timespec64 now;
> > +	struct ceph_vino vino = { .ino = req->r_deleg_ino,
> > +				  .snap = CEPH_NOSNAP };
> > +
> > +	ktime_get_real_ts64(&now);
> > +
> > +	inode = ceph_get_inode(dentry->d_sb, vino);
> > +	if (IS_ERR(inode))
> > +		return PTR_ERR(inode);
> > +
> > +	/* If we can't get a buffer, just carry on */
> > +	iinfo.xattr_data = kzalloc(4, GFP_NOFS);
> > +	if (iinfo.xattr_data)
> > +		iinfo.xattr_len = 4;
> > +
> > +	iinfo.inline_version = CEPH_INLINE_NONE;
> > +	iinfo.change_attr = 1;
> > +	ceph_encode_timespec64(&iinfo.btime, &now);
> > +
> > +	in.ino = cpu_to_le64(vino.ino);
> > +	in.snapid = cpu_to_le64(CEPH_NOSNAP);
> > +	in.version = cpu_to_le64(1);	// ???
> > +	in.cap.caps = in.cap.wanted = cpu_to_le32(CEPH_CAP_ALL_FILE);
> > +	in.cap.cap_id = cpu_to_le64(CEPH_CAP_ID_TBD);
> > +	in.cap.realm = cpu_to_le64(ci->i_snap_realm->ino);
> > +	in.cap.flags = CEPH_CAP_FLAG_AUTH;
> > +	in.ctime = in.mtime = in.atime = iinfo.btime;
> > +	in.mode = cpu_to_le32((u32)mode);
> > +	in.truncate_seq = cpu_to_le32(1);
> > +	in.truncate_size = cpu_to_le64(ci->i_truncate_size);
> > +	in.max_size = cpu_to_le64(ci->i_max_size);
> > +	in.xattr_version = cpu_to_le64(1);
> > +	in.uid = cpu_to_le32(from_kuid(&init_user_ns, current_fsuid()));
> > +	in.gid = cpu_to_le32(from_kgid(&init_user_ns, current_fsgid()));
> > +	in.nlink = cpu_to_le32(1);
> > +
> > +	ceph_file_layout_to_legacy(&ci->i_layout, &in.layout);
> > +
> > +	ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session,
> > +			      req->r_fmode, NULL);
> > +	if (ret) {
> > +		dout("%s failed to fill inode: %d\n", __func__, ret);
> > +		if (inode->i_state & I_NEW)
> > +			discard_new_inode(inode);
> > +	} else {
> > +		struct dentry *dn;
> > +
> > +		dout("%s d_adding new inode 0x%llx to 0x%lx/%s\n", __func__,
> > +			vino.ino, dir->i_ino, dentry->d_name.name);
> > +		ceph_dir_clear_ordered(dir);
> > +		ceph_init_inode_acls(inode, as_ctx);
> > +		if (inode->i_state & I_NEW)
> > +			unlock_new_inode(inode);
> > +		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
> > +			if (!d_unhashed(dentry))
> > +				d_drop(dentry);
> > +			dn = d_splice_alias(inode, dentry);
> > +			WARN_ON_ONCE(dn && dn != dentry);
> > +		}
> > +		file->f_mode |= FMODE_CREATED;
> > +		ret = finish_open(file, dentry, ceph_open);
> > +	}
> > +	return ret;
> > +}
> > +
> >   /*
> >    * Do a lookup + open with a single request.  If we get a non-existent
> >    * file or symlink, return 1 so the VFS can retry.
> > @@ -462,6 +597,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   	struct ceph_mds_request *req;
> >   	struct dentry *dn;
> >   	struct ceph_acl_sec_ctx as_ctx = {};
> > +	bool try_async = enable_async_dirops;
> >   	int mask;
> >   	int err;
> >   
> > @@ -486,6 +622,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   		return -ENOENT;
> >   	}
> >   
> > +retry:
> >   	/* do the open */
> >   	req = prepare_open_request(dir->i_sb, flags, mode);
> >   	if (IS_ERR(req)) {
> > @@ -494,6 +631,12 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   	}
> >   	req->r_dentry = dget(dentry);
> >   	req->r_num_caps = 2;
> > +	mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> > +	if (ceph_security_xattr_wanted(dir))
> > +		mask |= CEPH_CAP_XATTR_SHARED;
> > +	req->r_args.open.mask = cpu_to_le32(mask);
> > +	req->r_parent = dir;
> > +
> >   	if (flags & O_CREAT) {
> >   		req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL;
> >   		req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
> > @@ -501,21 +644,37 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   			req->r_pagelist = as_ctx.pagelist;
> >   			as_ctx.pagelist = NULL;
> >   		}
> > +		if (try_async && get_caps_for_async_create(dir, dentry)) {
> > +			set_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags);
> > +			req->r_callback = ceph_async_create_cb;
> > +			err = ceph_mdsc_submit_request(mdsc, dir, req);
> > +			switch (err) {
> > +			case 0:
> > +				/* set up inode, dentry and return */
> > +				err = ceph_finish_async_open(dir, dentry, file,
> > +							mode, req, &as_ctx);
> > +				goto out_req;
> > +			case -ECHILD:
> > +				/* do a sync create */
> > +				try_async = false;
> > +				as_ctx.pagelist = req->r_pagelist;
> > +				req->r_pagelist = NULL;
> > +				ceph_mdsc_put_request(req);
> > +				goto retry;
> > +			default:
> > +				/* Hard error, give up */
> > +				goto out_req;
> > +			}
> > +		}
> >   	}
> >   
> > -       mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> > -       if (ceph_security_xattr_wanted(dir))
> > -               mask |= CEPH_CAP_XATTR_SHARED;
> > -       req->r_args.open.mask = cpu_to_le32(mask);
> > -
> > -	req->r_parent = dir;
> >   	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
> >   	err = ceph_mdsc_do_request(mdsc,
> >   				   (flags & (O_CREAT|O_TRUNC)) ? dir : NULL,
> >   				   req);
> >   	err = ceph_handle_snapdir(req, dentry, err);
> >   	if (err)
> > -		goto out_req;
> > +		goto out_fmode;
> >   
> >   	if ((flags & O_CREAT) && !req->r_reply_info.head->is_dentry)
> >   		err = ceph_handle_notrace_create(dir, dentry);
> > @@ -529,7 +688,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   		dn = NULL;
> >   	}
> >   	if (err)
> > -		goto out_req;
> > +		goto out_fmode;
> >   	if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
> >   		/* make vfs retry on splice, ENOENT, or symlink */
> >   		dout("atomic_open finish_no_open on dn %p\n", dn);
> > @@ -545,9 +704,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   		}
> >   		err = finish_open(file, dentry, ceph_open);
> >   	}
> > -out_req:
> > +out_fmode:
> >   	if (!req->r_err && req->r_target_inode)
> >   		ceph_put_fmode(ceph_inode(req->r_target_inode), req->r_fmode);
> > +out_req:
> >   	ceph_mdsc_put_request(req);
> >   out_ctx:
> >   	ceph_release_acl_sec_ctx(&as_ctx);
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 9e7492b21b50..c76d6e7f8136 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -2620,14 +2620,16 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
> >   		flags |= CEPH_MDS_FLAG_REPLAY;
> >   	if (req->r_parent)
> >   		flags |= CEPH_MDS_FLAG_WANT_DENTRY;
> > -	rhead->flags = cpu_to_le32(flags);
> > -	rhead->num_fwd = req->r_num_fwd;
> > -	rhead->num_retry = req->r_attempts - 1;
> > -	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags))
> > +	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags)) {
> >   		rhead->ino = cpu_to_le64(req->r_deleg_ino);
> > -	else
> > +		flags |= CEPH_MDS_FLAG_ASYNC;
> > +	} else {
> >   		rhead->ino = 0;
> > +	}
> >   
> > +	rhead->flags = cpu_to_le32(flags);
> > +	rhead->num_fwd = req->r_num_fwd;
> > +	rhead->num_retry = req->r_attempts - 1;
> >   	dout(" r_parent = %p\n", req->r_parent);
> >   	return 0;
> >   }
> > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> > index e0b36be7c44f..49e6cd5a07a2 100644
> > --- a/fs/ceph/mds_client.h
> > +++ b/fs/ceph/mds_client.h
> > @@ -39,8 +39,7 @@ enum ceph_feature_type {
> >   	CEPHFS_FEATURE_REPLY_ENCODING,		\
> >   	CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
> >   	CEPHFS_FEATURE_MULTI_RECONNECT,		\
> > -						\
> > -	CEPHFS_FEATURE_MAX,			\
> > +	CEPHFS_FEATURE_OCTOPUS,			\
> 
> We should always have the CEPHFS_FEATURE_MAX as the last element of the 
> array here, though the _MAX equals to _OCTOPUS and the _OCTOPUS will be 
> traversed twice when encoding. The _MAX here is just a guard bit when 
> counting the feature bits when encoding.
> 
> The change here should be:
> 
>          CEPHFS_FEATURE_REPLY_ENCODING,		\
>          CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
>          CEPHFS_FEATURE_MULTI_RECONNECT,		\
> +       CEPHFS_FEATURE_OCTOPUS,			\
> 						\
>          CEPHFS_FEATURE_MAX,			\
> 
> Then we won't have to worry about the previous _FEATURE_ bits' order.
> 

Definitely. That was just me being sloppy. Also, it sounds like we may
need to have this on a separate bit from Octopus so this part may change
anyway.

> >   }
> >   #define CEPHFS_FEATURES_CLIENT_REQUIRED {}
> >   
> > diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> > index ec4d66d7c261..33e03fbba888 100644
> > --- a/fs/ceph/super.h
> > +++ b/fs/ceph/super.h
> > @@ -136,6 +136,8 @@ struct ceph_fs_client {
> >   #endif
> >   };
> >   
> > +/* Special placeholder value for a cap_id during an asynchronous create. */
> > +#define        CEPH_CAP_ID_TBD         -1ULL
> >   
> >   /*
> >    * File i/o capability.  This tracks shared state with the metadata
> > diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
> > index a099f60feb7b..b127563e21a1 100644
> > --- a/include/linux/ceph/ceph_fs.h
> > +++ b/include/linux/ceph/ceph_fs.h
> > @@ -444,8 +444,9 @@ union ceph_mds_request_args {
> >   	} __attribute__ ((packed)) lookupino;
> >   } __attribute__ ((packed));
> >   
> > -#define CEPH_MDS_FLAG_REPLAY        1  /* this is a replayed op */
> > -#define CEPH_MDS_FLAG_WANT_DENTRY   2  /* want dentry in reply */
> > +#define CEPH_MDS_FLAG_REPLAY		1  /* this is a replayed op */
> > +#define CEPH_MDS_FLAG_WANT_DENTRY	2  /* want dentry in reply */
> > +#define CEPH_MDS_FLAG_ASYNC		4  /* request is asynchronous */
> >   
> >   struct ceph_mds_request_head {
> >   	__le64 oldest_client_tid;
> > @@ -658,6 +659,9 @@ int ceph_flags_to_mode(int flags);
> >   #define CEPH_CAP_ANY      (CEPH_CAP_ANY_RD | CEPH_CAP_ANY_EXCL | \
> >   			   CEPH_CAP_ANY_FILE_WR | CEPH_CAP_FILE_LAZYIO | \
> >   			   CEPH_CAP_PIN)
> > +#define CEPH_CAP_ALL_FILE (CEPH_CAP_PIN | CEPH_CAP_ANY_SHARED | \
> > +			   CEPH_CAP_AUTH_EXCL | CEPH_CAP_XATTR_EXCL | \
> > +			   CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)
> >   
> >   #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
> >   			CEPH_LOCK_IXATTR)
> 
> 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 8/9] ceph: copy layout, max_size and truncate_size on successful sync create
  2020-01-13  3:51   ` Yan, Zheng
@ 2020-01-13 13:26     ` Jeff Layton
  2020-01-13 14:56       ` Yan, Zheng
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Layton @ 2020-01-13 13:26 UTC (permalink / raw)
  To: Yan, Zheng, ceph-devel; +Cc: sage, idryomov, pdonnell

On Mon, 2020-01-13 at 11:51 +0800, Yan, Zheng wrote:
> On 1/11/20 4:56 AM, Jeff Layton wrote:
> > It doesn't do much good to do an asynchronous create unless we can do
> > I/O to it before the create reply comes in. That means we need layout
> > info the new file before we've gotten the response from the MDS.
> > 
> > All files created in a directory will initially inherit the same layout,
> > so copy off the requisite info from the first synchronous create in the
> > directory. Save it in the same fields in the directory inode, as those
> > are otherwise unsed for dir inodes. This means we need to be a bit
> > careful about only updating layout info on non-dir inodes.
> > 
> > Also, zero out the layout when we drop Dc caps in the dir.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >   fs/ceph/caps.c  | 24 ++++++++++++++++++++----
> >   fs/ceph/file.c  | 24 +++++++++++++++++++++++-
> >   fs/ceph/inode.c |  4 ++--
> >   3 files changed, 45 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > index 7fc87b693ba4..b96fb1378479 100644
> > --- a/fs/ceph/caps.c
> > +++ b/fs/ceph/caps.c
> > @@ -2847,7 +2847,7 @@ int ceph_get_caps(struct file *filp, int need, int want,
> >   			return ret;
> >   		}
> >   
> > -		if (S_ISREG(ci->vfs_inode.i_mode) &&
> > +		if (!S_ISDIR(ci->vfs_inode.i_mode) &&
> >   		    ci->i_inline_version != CEPH_INLINE_NONE &&
> >   		    (_got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) &&
> >   		    i_size_read(inode) > 0) {
> > @@ -2944,9 +2944,17 @@ void ceph_put_cap_refs(struct ceph_inode_info *ci, int had)
> >   	if (had & CEPH_CAP_FILE_RD)
> >   		if (--ci->i_rd_ref == 0)
> >   			last++;
> > -	if (had & CEPH_CAP_FILE_CACHE)
> > -		if (--ci->i_rdcache_ref == 0)
> > +	if (had & CEPH_CAP_FILE_CACHE) {
> > +		if (--ci->i_rdcache_ref == 0) {
> >   			last++;
> > +			/* Zero out layout if we lost CREATE caps */
> > +			if (S_ISDIR(inode->i_mode) &&
> > +			    !(__ceph_caps_issued(ci, NULL) & CEPH_CAP_DIR_CREATE)) {
> > +				ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
> > +				memset(&ci->i_layout, 0, sizeof(ci->i_layout));
> > +			}
> > +		}
> > +	}
> 
> should do this in __check_cap_issue
> 

That function doesn't get called from the put codepath. Suppose one task
is setting up an async create and a Dc (DIR_CREATE) cap revoke races in.
We zero out the layout and then the inode has a bogus layout set in it.

We can't wipe the cached layout until all of the Dc references are put.

> >   	if (had & CEPH_CAP_FILE_EXCL)
> >   		if (--ci->i_fx_ref == 0)
> >   			last++;
> > @@ -3264,7 +3272,8 @@ static void handle_cap_grant(struct inode *inode,
> >   		ci->i_subdirs = extra_info->nsubdirs;
> >   	}
> >   
> > -	if (newcaps & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)) {
> > +	if (!S_ISDIR(inode->i_mode) &&
> > +	    (newcaps & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
> >   		/* file layout may have changed */
> >   		s64 old_pool = ci->i_layout.pool_id;
> >   		struct ceph_string *old_ns;
> > @@ -3336,6 +3345,13 @@ static void handle_cap_grant(struct inode *inode,
> >   		     ceph_cap_string(cap->issued),
> >   		     ceph_cap_string(newcaps),
> >   		     ceph_cap_string(revoking));
> > +
> > +		if (S_ISDIR(inode->i_mode) &&
> > +		    (revoking & CEPH_CAP_DIR_CREATE) && !ci->i_rdcache_ref) {
> > +			ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
> > +			memset(&ci->i_layout, 0, sizeof(ci->i_layout));
> > +		}
> 
> same here
> 
> > +
> >   		if (S_ISREG(inode->i_mode) &&
> >   		    (revoking & used & CEPH_CAP_FILE_BUFFER))
> >   			writeback = true;  /* initiate writeback; will delay ack */
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index 1e6cdf2dfe90..d4d7a277faf1 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -430,6 +430,25 @@ int ceph_open(struct inode *inode, struct file *file)
> >   	return err;
> >   }
> >   
> > +/* Clone the layout from a synchronous create, if the dir now has Dc caps */
> > +static void
> > +copy_file_layout(struct inode *dst, struct inode *src)
> > +{
> > +	struct ceph_inode_info *cdst = ceph_inode(dst);
> > +	struct ceph_inode_info *csrc = ceph_inode(src);
> > +
> > +	spin_lock(&cdst->i_ceph_lock);
> > +	if ((__ceph_caps_issued(cdst, NULL) & CEPH_CAP_DIR_CREATE) &&
> > +	    !ceph_file_layout_is_valid(&cdst->i_layout)) {
> > +		memcpy(&cdst->i_layout, &csrc->i_layout,
> > +			sizeof(cdst->i_layout));
> > +		rcu_assign_pointer(cdst->i_layout.pool_ns,
> > +				   ceph_try_get_string(csrc->i_layout.pool_ns));
> > +		cdst->i_max_size = csrc->i_max_size;
> > +		cdst->i_truncate_size = csrc->i_truncate_size;
> > +	}
> > +	spin_unlock(&cdst->i_ceph_lock);
> > +}
> >   
> >   /*
> >    * Do a lookup + open with a single request.  If we get a non-existent
> > @@ -518,7 +537,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   	} else {
> >   		dout("atomic_open finish_open on dn %p\n", dn);
> >   		if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) {
> > -			ceph_init_inode_acls(d_inode(dentry), &as_ctx);
> > +			struct inode *newino = d_inode(dentry);
> > +
> > +			copy_file_layout(dir, newino);
> > +			ceph_init_inode_acls(newino, &as_ctx);
> >   			file->f_mode |= FMODE_CREATED;
> >   		}
> >   		err = finish_open(file, dentry, ceph_open);
> > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > index 9cfc093fd273..8b51051b79b0 100644
> > --- a/fs/ceph/inode.c
> > +++ b/fs/ceph/inode.c
> > @@ -848,8 +848,8 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
> >   		ci->i_subdirs = le64_to_cpu(info->subdirs);
> >   	}
> >   
> > -	if (new_version ||
> > -	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
> > +	if (!S_ISDIR(inode->i_mode) && (new_version ||
> > +	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)))) {
> >   		s64 old_pool = ci->i_layout.pool_id;
> >   		struct ceph_string *old_ns;
> >   
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 8/9] ceph: copy layout, max_size and truncate_size on successful sync create
  2020-01-13  9:01   ` Yan, Zheng
@ 2020-01-13 13:29     ` Jeff Layton
  0 siblings, 0 replies; 28+ messages in thread
From: Jeff Layton @ 2020-01-13 13:29 UTC (permalink / raw)
  To: Yan, Zheng, ceph-devel; +Cc: sage, idryomov, pdonnell

On Mon, 2020-01-13 at 17:01 +0800, Yan, Zheng wrote:
> On 1/11/20 4:56 AM, Jeff Layton wrote:
> > It doesn't do much good to do an asynchronous create unless we can do
> > I/O to it before the create reply comes in. That means we need layout
> > info the new file before we've gotten the response from the MDS.
> > 
> > All files created in a directory will initially inherit the same layout,
> > so copy off the requisite info from the first synchronous create in the
> > directory. Save it in the same fields in the directory inode, as those
> > are otherwise unsed for dir inodes. This means we need to be a bit
> > careful about only updating layout info on non-dir inodes.
> > 
> > Also, zero out the layout when we drop Dc caps in the dir.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >   fs/ceph/caps.c  | 24 ++++++++++++++++++++----
> >   fs/ceph/file.c  | 24 +++++++++++++++++++++++-
> >   fs/ceph/inode.c |  4 ++--
> >   3 files changed, 45 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > index 7fc87b693ba4..b96fb1378479 100644
> > --- a/fs/ceph/caps.c
> > +++ b/fs/ceph/caps.c
> > @@ -2847,7 +2847,7 @@ int ceph_get_caps(struct file *filp, int need, int want,
> >   			return ret;
> >   		}
> >   
> > -		if (S_ISREG(ci->vfs_inode.i_mode) &&
> > +		if (!S_ISDIR(ci->vfs_inode.i_mode) &&
> >   		    ci->i_inline_version != CEPH_INLINE_NONE &&
> >   		    (_got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) &&
> >   		    i_size_read(inode) > 0) {
> > @@ -2944,9 +2944,17 @@ void ceph_put_cap_refs(struct ceph_inode_info *ci, int had)
> >   	if (had & CEPH_CAP_FILE_RD)
> >   		if (--ci->i_rd_ref == 0)
> >   			last++;
> > -	if (had & CEPH_CAP_FILE_CACHE)
> > -		if (--ci->i_rdcache_ref == 0)
> > +	if (had & CEPH_CAP_FILE_CACHE) {
> > +		if (--ci->i_rdcache_ref == 0) {
> >   			last++;
> > +			/* Zero out layout if we lost CREATE caps */
> > +			if (S_ISDIR(inode->i_mode) &&
> > +			    !(__ceph_caps_issued(ci, NULL) & CEPH_CAP_DIR_CREATE)) {
> > +				ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
> > +				memset(&ci->i_layout, 0, sizeof(ci->i_layout));
> > +			}
> > +		}
> > +	}
> >   	if (had & CEPH_CAP_FILE_EXCL)
> >   		if (--ci->i_fx_ref == 0)
> >   			last++;
> > @@ -3264,7 +3272,8 @@ static void handle_cap_grant(struct inode *inode,
> >   		ci->i_subdirs = extra_info->nsubdirs;
> >   	}
> >   
> > -	if (newcaps & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)) {
> > +	if (!S_ISDIR(inode->i_mode) &&
> > +	    (newcaps & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
> >   		/* file layout may have changed */
> >   		s64 old_pool = ci->i_layout.pool_id;
> >   		struct ceph_string *old_ns;
> > @@ -3336,6 +3345,13 @@ static void handle_cap_grant(struct inode *inode,
> >   		     ceph_cap_string(cap->issued),
> >   		     ceph_cap_string(newcaps),
> >   		     ceph_cap_string(revoking));
> > +
> > +		if (S_ISDIR(inode->i_mode) &&
> > +		    (revoking & CEPH_CAP_DIR_CREATE) && !ci->i_rdcache_ref) {
> > +			ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
> > +			memset(&ci->i_layout, 0, sizeof(ci->i_layout));
> > +		}
> > +
> >   		if (S_ISREG(inode->i_mode) &&
> >   		    (revoking & used & CEPH_CAP_FILE_BUFFER))
> >   			writeback = true;  /* initiate writeback; will delay ack */
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index 1e6cdf2dfe90..d4d7a277faf1 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -430,6 +430,25 @@ int ceph_open(struct inode *inode, struct file *file)
> >   	return err;
> >   }
> >   
> > +/* Clone the layout from a synchronous create, if the dir now has Dc caps */
> > +static void
> > +copy_file_layout(struct inode *dst, struct inode *src)
> > +{
> > +	struct ceph_inode_info *cdst = ceph_inode(dst);
> > +	struct ceph_inode_info *csrc = ceph_inode(src);
> > +
> > +	spin_lock(&cdst->i_ceph_lock);
> > +	if ((__ceph_caps_issued(cdst, NULL) & CEPH_CAP_DIR_CREATE) &&
> > +	    !ceph_file_layout_is_valid(&cdst->i_layout)) {
> > +		memcpy(&cdst->i_layout, &csrc->i_layout,
> > +			sizeof(cdst->i_layout));
> 
> directory's i_layout is used for other purpose. shouldn't modify it.
> 

Yeah, I realized that just recently. I'll add a new field for this.
Maybe we can at least union it with something that is unused in
directories.

> > +		rcu_assign_pointer(cdst->i_layout.pool_ns,
> > +				   ceph_try_get_string(csrc->i_layout.pool_ns));
> > +		cdst->i_max_size = csrc->i_max_size; > +		cdst->i_truncate_size = csrc->i_truncate_size;
> no need to save above two. just set truncate size of new file to 
> (u64)-1, set max_size of new file to its layout.stripe_unit;
> 
> max_size == layout.strip_unit ensure that client only write to the first 
> object before its writeable range is persistent.
> 

ACK, I saw that in your userland client patchset. I'll fix that up.

Thanks,

> > +	}
> > +	spin_unlock(&cdst->i_ceph_lock);
> > +}
> >   
> >   /*
> >    * Do a lookup + open with a single request.  If we get a non-existent
> > @@ -518,7 +537,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   	} else {
> >   		dout("atomic_open finish_open on dn %p\n", dn);
> >   		if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) {
> > -			ceph_init_inode_acls(d_inode(dentry), &as_ctx);
> > +			struct inode *newino = d_inode(dentry);
> > +
> > +			copy_file_layout(dir, newino);
> > +			ceph_init_inode_acls(newino, &as_ctx);
> >   			file->f_mode |= FMODE_CREATED;
> >   		}
> >   		err = finish_open(file, dentry, ceph_open);
> > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > index 9cfc093fd273..8b51051b79b0 100644
> > --- a/fs/ceph/inode.c
> > +++ b/fs/ceph/inode.c
> > @@ -848,8 +848,8 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
> >   		ci->i_subdirs = le64_to_cpu(info->subdirs);
> >   	}
> >   
> > -	if (new_version ||
> > -	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
> > +	if (!S_ISDIR(inode->i_mode) && (new_version ||
> > +	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)))) {
> >   		s64 old_pool = ci->i_layout.pool_id;
> >   		struct ceph_string *old_ns;
> >   
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 7/9] ceph: add flag to delegate an inode number for async create
  2020-01-13  9:17   ` Yan, Zheng
@ 2020-01-13 13:31     ` Jeff Layton
  2020-01-13 14:51       ` Yan, Zheng
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Layton @ 2020-01-13 13:31 UTC (permalink / raw)
  To: Yan, Zheng, ceph-devel; +Cc: sage, idryomov, pdonnell

On Mon, 2020-01-13 at 17:17 +0800, Yan, Zheng wrote:
> On 1/11/20 4:56 AM, Jeff Layton wrote:
> > In order to issue an async create request, we need to send an inode
> > number when we do the request, but we don't know which to which MDS
> > we'll be issuing the request.
> > 
> 
> the request should be sent to auth mds (dir_ci->i_auth_cap->session) of 
> directory. I think grabing inode number in get_caps_for_async_create() 
> is simpler.
> 

That would definitely be simpler. I didn't know whether we could count
on having an i_auth_cap in that case.

Will a non-auth MDS ever hand out DIR_CREATE/DIR_UNLINK caps? I'm a
little fuzzy on what the rules are for caps being handed out by non-auth 
MDS's.

> > Add a new r_req_flag that tells the request sending machinery to
> > grab an inode number from the delegated set, and encode it into the
> > request. If it can't get one then have it return -ECHILD. The
> > requestor can then reissue a synchronous request.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >   fs/ceph/inode.c      |  1 +
> >   fs/ceph/mds_client.c | 19 ++++++++++++++++++-
> >   fs/ceph/mds_client.h |  2 ++
> >   3 files changed, 21 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > index 79bb1e6af090..9cfc093fd273 100644
> > --- a/fs/ceph/inode.c
> > +++ b/fs/ceph/inode.c
> > @@ -1317,6 +1317,7 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
> >   		err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
> >   				NULL, session,
> >   				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
> > +				 !test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags) &&
> >   				 rinfo->head->result == 0) ?  req->r_fmode : -1,
> >   				&req->r_caps_reservation);
> >   		if (err < 0) {
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 852c46550d96..9e7492b21b50 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -2623,7 +2623,10 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
> >   	rhead->flags = cpu_to_le32(flags);
> >   	rhead->num_fwd = req->r_num_fwd;
> >   	rhead->num_retry = req->r_attempts - 1;
> > -	rhead->ino = 0;
> > +	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags))
> > +		rhead->ino = cpu_to_le64(req->r_deleg_ino);
> > +	else
> > +		rhead->ino = 0;
> >   
> >   	dout(" r_parent = %p\n", req->r_parent);
> >   	return 0;
> > @@ -2736,6 +2739,20 @@ static void __do_request(struct ceph_mds_client *mdsc,
> >   		goto out_session;
> >   	}
> >   
> > +	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags) &&
> > +	    !req->r_deleg_ino) {
> > +		req->r_deleg_ino = get_delegated_ino(req->r_session);
> > +
> > +		if (!req->r_deleg_ino) {
> > +			/*
> > +			 * If we can't get a deleg ino, exit with -ECHILD,
> > +			 * so the caller can reissue a sync request
> > +			 */
> > +			err = -ECHILD;
> > +			goto out_session;
> > +		}
> > +	}
> > +
> >   	/* send request */
> >   	req->r_resend_mds = -1;   /* forget any previous mds hint */
> >   
> > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> > index 3db7ef47e1c9..e0b36be7c44f 100644
> > --- a/fs/ceph/mds_client.h
> > +++ b/fs/ceph/mds_client.h
> > @@ -258,6 +258,7 @@ struct ceph_mds_request {
> >   #define CEPH_MDS_R_GOT_RESULT		(5) /* got a result */
> >   #define CEPH_MDS_R_DID_PREPOPULATE	(6) /* prepopulated readdir */
> >   #define CEPH_MDS_R_PARENT_LOCKED	(7) /* is r_parent->i_rwsem wlocked? */
> > +#define CEPH_MDS_R_DELEG_INO		(8) /* attempt to get r_deleg_ino */
> >   	unsigned long	r_req_flags;
> >   
> >   	struct mutex r_fill_mutex;
> > @@ -307,6 +308,7 @@ struct ceph_mds_request {
> >   	int               r_num_fwd;    /* number of forward attempts */
> >   	int               r_resend_mds; /* mds to resend to next, if any*/
> >   	u32               r_sent_on_mseq; /* cap mseq request was sent at*/
> > +	unsigned long	  r_deleg_ino;
> >   
> >   	struct list_head  r_wait;
> >   	struct completion r_completion;
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 9/9] ceph: attempt to do async create when possible
  2020-01-13 10:53   ` Yan, Zheng
@ 2020-01-13 13:44     ` Jeff Layton
  2020-01-13 14:48       ` Yan, Zheng
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Layton @ 2020-01-13 13:44 UTC (permalink / raw)
  To: Yan, Zheng, ceph-devel; +Cc: sage, idryomov, pdonnell

On Mon, 2020-01-13 at 18:53 +0800, Yan, Zheng wrote:
> On 1/11/20 4:56 AM, Jeff Layton wrote:
> > With the Octopus release, the MDS will hand out directoy create caps.
> > If we have Fxc caps on the directory, and complete directory information
> > or a known negative dentry, then we can return without waiting on the
> > reply, allowing the open() call to return very quickly to userland.
> > 
> > We use the normal ceph_fill_inode() routine to fill in the inode, so we
> > have to gin up some reply inode information with what we'd expect a
> > newly-created inode to have. The client assumes that it has a full set
> > of caps on the new inode, and that the MDS will revoke them when there
> > is conflicting access.
> > 
> > This functionality is gated on the enable_async_dirops module option,
> > along with async unlinks, and on the server supporting the Octopus
> > CephFS feature bit.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >   fs/ceph/caps.c               |   7 +-
> >   fs/ceph/file.c               | 178 +++++++++++++++++++++++++++++++++--
> >   fs/ceph/mds_client.c         |  12 ++-
> >   fs/ceph/mds_client.h         |   3 +-
> >   fs/ceph/super.h              |   2 +
> >   include/linux/ceph/ceph_fs.h |   8 +-
> >   6 files changed, 191 insertions(+), 19 deletions(-)
> > 
> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > index b96fb1378479..21a8a2ddc94b 100644
> > --- a/fs/ceph/caps.c
> > +++ b/fs/ceph/caps.c
> > @@ -654,6 +654,10 @@ void ceph_add_cap(struct inode *inode,
> >   		session->s_nr_caps++;
> >   		spin_unlock(&session->s_cap_lock);
> >   	} else {
> > +		/* Did an async create race with the reply? */
> > +		if (cap_id == CEPH_CAP_ID_TBD && cap->issued == issued)
> > +			return;
> > +
> >   		spin_lock(&session->s_cap_lock);
> >   		list_move_tail(&cap->session_caps, &session->s_caps);
> >   		spin_unlock(&session->s_cap_lock);
> > @@ -672,7 +676,8 @@ void ceph_add_cap(struct inode *inode,
> >   		 */
> >   		if (ceph_seq_cmp(seq, cap->seq) <= 0) {
> >   			WARN_ON(cap != ci->i_auth_cap);
> > -			WARN_ON(cap->cap_id != cap_id);
> > +			WARN_ON(cap_id != CEPH_CAP_ID_TBD &&
> > +				cap->cap_id != cap_id);
> >   			seq = cap->seq;
> >   			mseq = cap->mseq;
> >   			issued |= cap->issued;
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index d4d7a277faf1..706abd71b731 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -450,6 +450,141 @@ copy_file_layout(struct inode *dst, struct inode *src)
> >   	spin_unlock(&cdst->i_ceph_lock);
> >   }
> >   
> > +static bool get_caps_for_async_create(struct inode *dir, struct dentry *dentry)
> > +{
> > +	struct ceph_inode_info *ci = ceph_inode(dir);
> > +	int ret, want, got;
> > +
> > +	/*
> > +	 * We can do an async create if we either have a valid negative dentry
> > +	 * or the complete contents of the directory. Do a quick check without
> > +	 * cap refs.
> > +	 */
> > +	if ((d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) ||
> 
> what does (d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) mean?
> 


If d_in_lookup returns false, then we have a dentry that is known to be
negative (IOW, it passed d_revalidate). If d_in_lookup returns true,
then this is the initial lookup for the dentry.

So if that returns true and the directory isn't complete then we can't
do an async create since we don't have enough information about the
namespace.

That probably deserves a better comment. I'll try to make that clear.


> I think we can async create if dentry is negative and its 
> lease_shared_gen == ci->i_shared_gen.
> 
> > +	    !ceph_file_layout_is_valid(&ci->i_layout))
> > +		return false;
> > +
> > +	/* Try to get caps */
> > +	want = CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE;
> > +	ret = ceph_try_get_caps(dir, 0, want, true, &got);
> > +	dout("Fx on %p ret=%d got=%d\n", dir, ret, got);
> > +	if (ret != 1)
> > +		return false;
> > +	if (got != want) {
> > +		ceph_put_cap_refs(ci, got);
> > +		return false;
> > +	}
> > +
> > +	/* Check again, now that we hold cap refs */
> > +	if ((d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) ||
> > +	    !ceph_file_layout_is_valid(&ci->i_layout)) {
> > +		ceph_put_cap_refs(ci, got);
> > +		return false;
> > +	}
> > +
> > +	return true;
> > +}
> > +
> > +static void ceph_async_create_cb(struct ceph_mds_client *mdsc,
> > +                                 struct ceph_mds_request *req)
> > +{
> > +	/* If we never sent anything then nothing to clean up */
> > +	if (req->r_err == -ECHILD)
> > +		goto out;
> > +
> > +	mapping_set_error(req->r_parent->i_mapping, req->r_err);
> > +
> > +	if (req->r_target_inode) {
> > +		u64 ino = ceph_vino(req->r_target_inode).ino;
> > +
> > +		if (req->r_deleg_ino != ino)
> > +			pr_warn("%s: inode number mismatch! err=%d deleg_ino=0x%lx target=0x%llx\n",
> > +				__func__, req->r_err, req->r_deleg_ino, ino);
> > +		mapping_set_error(req->r_target_inode->i_mapping, req->r_err);
> > +	} else {
> > +		pr_warn("%s: no req->r_target_inode for 0x%lx\n", __func__,
> > +			req->r_deleg_ino);
> > +	}
> > +out:
> > +	ceph_put_cap_refs(ceph_inode(req->r_parent),
> > +			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE);
> > +}
> > +
> > +static int ceph_finish_async_open(struct inode *dir, struct dentry *dentry,
> > +				  struct file *file, umode_t mode,
> > +				  struct ceph_mds_request *req,
> > +				  struct ceph_acl_sec_ctx *as_ctx)
> > +{
> > +	int ret;
> > +	struct ceph_mds_reply_inode in = { };
> > +	struct ceph_mds_reply_info_in iinfo = { .in = &in };
> > +	struct ceph_inode_info *ci = ceph_inode(dir);
> > +	struct inode *inode;
> > +	struct timespec64 now;
> > +	struct ceph_vino vino = { .ino = req->r_deleg_ino,
> > +				  .snap = CEPH_NOSNAP };
> > +
> > +	ktime_get_real_ts64(&now);
> > +
> > +	inode = ceph_get_inode(dentry->d_sb, vino);
> > +	if (IS_ERR(inode))
> > +		return PTR_ERR(inode);
> > +
> > +	/* If we can't get a buffer, just carry on */
> > +	iinfo.xattr_data = kzalloc(4, GFP_NOFS);
> > +	if (iinfo.xattr_data)
> > +		iinfo.xattr_len = 4;
> 
> ??
> 
> I think we should decode req->r_pagelist into xattrs
> 
> 

I'm not sure I follow what you're suggesting here. At this point,
r_pagelist may be set to as_ctx.pagelist. It would be nice to avoid an
allocation here though. I guess I could pass in an on-stack buffer. It
is only 4 bytes after all.

> > +
> > +	iinfo.inline_version = CEPH_INLINE_NONE;
> > +	iinfo.change_attr = 1;
> > +	ceph_encode_timespec64(&iinfo.btime, &now);
> > +
> > +	in.ino = cpu_to_le64(vino.ino);
> > +	in.snapid = cpu_to_le64(CEPH_NOSNAP);
> > +	in.version = cpu_to_le64(1);	// ???
> > +	in.cap.caps = in.cap.wanted = cpu_to_le32(CEPH_CAP_ALL_FILE);
> > +	in.cap.cap_id = cpu_to_le64(CEPH_CAP_ID_TBD);
> > +	in.cap.realm = cpu_to_le64(ci->i_snap_realm->ino);
> > +	in.cap.flags = CEPH_CAP_FLAG_AUTH;
> > +	in.ctime = in.mtime = in.atime = iinfo.btime;
> > +	in.mode = cpu_to_le32((u32)mode);
> > +	in.truncate_seq = cpu_to_le32(1);
> > +	in.truncate_size = cpu_to_le64(ci->i_truncate_size);
> > +	in.max_size = cpu_to_le64(ci->i_max_size);
> > +	in.xattr_version = cpu_to_le64(1);
> > +	in.uid = cpu_to_le32(from_kuid(&init_user_ns, current_fsuid()));
> > +	in.gid = cpu_to_le32(from_kgid(&init_user_ns, current_fsgid()));
> 
> if dir has S_ISGID, new file's gid should be inherit from dir's gid
> 
> 

Good catch. I'll fix that up for the next iteration.

> > +	in.nlink = cpu_to_le32(1);
> > +
> > +	ceph_file_layout_to_legacy(&ci->i_layout, &in.layout);
> > +
> > +	ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session,
> > +			      req->r_fmode, NULL);
> > +	if (ret) {
> > +		dout("%s failed to fill inode: %d\n", __func__, ret);
> > +		if (inode->i_state & I_NEW)
> > +			discard_new_inode(inode);
> > +	} else {
> > +		struct dentry *dn;
> > +
> > +		dout("%s d_adding new inode 0x%llx to 0x%lx/%s\n", __func__,
> > +			vino.ino, dir->i_ino, dentry->d_name.name);
> > +		ceph_dir_clear_ordered(dir);
> > +		ceph_init_inode_acls(inode, as_ctx);
> > +		if (inode->i_state & I_NEW)
> > +			unlock_new_inode(inode);
> > +		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
> > +			if (!d_unhashed(dentry))
> > +				d_drop(dentry);
> > +			dn = d_splice_alias(inode, dentry);
> > +			WARN_ON_ONCE(dn && dn != dentry);
> > +		}
> > +		file->f_mode |= FMODE_CREATED;
> > +		ret = finish_open(file, dentry, ceph_open);
> > +	}
> > +	return ret;
> > +}
> > +
> >   /*
> >    * Do a lookup + open with a single request.  If we get a non-existent
> >    * file or symlink, return 1 so the VFS can retry.
> > @@ -462,6 +597,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   	struct ceph_mds_request *req;
> >   	struct dentry *dn;
> >   	struct ceph_acl_sec_ctx as_ctx = {};
> > +	bool try_async = enable_async_dirops;
> >   	int mask;
> >   	int err;
> >   
> > @@ -486,6 +622,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   		return -ENOENT;
> >   	}
> >   
> > +retry:
> >   	/* do the open */
> >   	req = prepare_open_request(dir->i_sb, flags, mode);
> >   	if (IS_ERR(req)) {
> > @@ -494,6 +631,12 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   	}
> >   	req->r_dentry = dget(dentry);
> >   	req->r_num_caps = 2;
> > +	mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> > +	if (ceph_security_xattr_wanted(dir))
> > +		mask |= CEPH_CAP_XATTR_SHARED;
> > +	req->r_args.open.mask = cpu_to_le32(mask);
> > +	req->r_parent = dir;
> > +
> >   	if (flags & O_CREAT) {
> >   		req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL;
> >   		req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
> > @@ -501,21 +644,37 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   			req->r_pagelist = as_ctx.pagelist;
> >   			as_ctx.pagelist = NULL;
> >   		}
> > +		if (try_async && get_caps_for_async_create(dir, dentry)) {
> > +			set_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags);
> > +			req->r_callback = ceph_async_create_cb;
> > +			err = ceph_mdsc_submit_request(mdsc, dir, req);
> > +			switch (err) {
> > +			case 0:
> > +				/* set up inode, dentry and return */
> > +				err = ceph_finish_async_open(dir, dentry, file,
> > +							mode, req, &as_ctx);
> > +				goto out_req;
> > +			case -ECHILD:
> > +				/* do a sync create */
> > +				try_async = false;
> > +				as_ctx.pagelist = req->r_pagelist;
> > +				req->r_pagelist = NULL;
> > +				ceph_mdsc_put_request(req);
> > +				goto retry;
> > +			default:
> > +				/* Hard error, give up */
> > +				goto out_req;
> > +			}
> > +		}
> >   	}
> >   
> > -       mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> > -       if (ceph_security_xattr_wanted(dir))
> > -               mask |= CEPH_CAP_XATTR_SHARED;
> > -       req->r_args.open.mask = cpu_to_le32(mask);
> > -
> > -	req->r_parent = dir;
> >   	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
> >   	err = ceph_mdsc_do_request(mdsc,
> >   				   (flags & (O_CREAT|O_TRUNC)) ? dir : NULL,
> >   				   req);
> >   	err = ceph_handle_snapdir(req, dentry, err);
> >   	if (err)
> > -		goto out_req;
> > +		goto out_fmode;
> >   
> >   	if ((flags & O_CREAT) && !req->r_reply_info.head->is_dentry)
> >   		err = ceph_handle_notrace_create(dir, dentry);
> > @@ -529,7 +688,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   		dn = NULL;
> >   	}
> >   	if (err)
> > -		goto out_req;
> > +		goto out_fmode;
> >   	if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
> >   		/* make vfs retry on splice, ENOENT, or symlink */
> >   		dout("atomic_open finish_no_open on dn %p\n", dn);
> > @@ -545,9 +704,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> >   		}
> >   		err = finish_open(file, dentry, ceph_open);
> >   	}
> > -out_req:
> > +out_fmode:
> >   	if (!req->r_err && req->r_target_inode)
> >   		ceph_put_fmode(ceph_inode(req->r_target_inode), req->r_fmode);
> > +out_req:
> >   	ceph_mdsc_put_request(req);
> >   out_ctx:
> >   	ceph_release_acl_sec_ctx(&as_ctx);
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 9e7492b21b50..c76d6e7f8136 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -2620,14 +2620,16 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
> >   		flags |= CEPH_MDS_FLAG_REPLAY;
> >   	if (req->r_parent)
> >   		flags |= CEPH_MDS_FLAG_WANT_DENTRY;
> > -	rhead->flags = cpu_to_le32(flags);
> > -	rhead->num_fwd = req->r_num_fwd;
> > -	rhead->num_retry = req->r_attempts - 1;
> > -	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags))
> > +	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags)) {
> >   		rhead->ino = cpu_to_le64(req->r_deleg_ino);
> > -	else
> > +		flags |= CEPH_MDS_FLAG_ASYNC;
> > +	} else {
> >   		rhead->ino = 0;
> > +	}
> >   
> > +	rhead->flags = cpu_to_le32(flags);
> > +	rhead->num_fwd = req->r_num_fwd;
> > +	rhead->num_retry = req->r_attempts - 1;
> >   	dout(" r_parent = %p\n", req->r_parent);
> >   	return 0;
> >   }
> > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> > index e0b36be7c44f..49e6cd5a07a2 100644
> > --- a/fs/ceph/mds_client.h
> > +++ b/fs/ceph/mds_client.h
> > @@ -39,8 +39,7 @@ enum ceph_feature_type {
> >   	CEPHFS_FEATURE_REPLY_ENCODING,		\
> >   	CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
> >   	CEPHFS_FEATURE_MULTI_RECONNECT,		\
> > -						\
> > -	CEPHFS_FEATURE_MAX,			\
> > +	CEPHFS_FEATURE_OCTOPUS,			\
> >   }
> >   #define CEPHFS_FEATURES_CLIENT_REQUIRED {}
> >   
> > diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> > index ec4d66d7c261..33e03fbba888 100644
> > --- a/fs/ceph/super.h
> > +++ b/fs/ceph/super.h
> > @@ -136,6 +136,8 @@ struct ceph_fs_client {
> >   #endif
> >   };
> >   
> > +/* Special placeholder value for a cap_id during an asynchronous create. */
> > +#define        CEPH_CAP_ID_TBD         -1ULL
> >   
> >   /*
> >    * File i/o capability.  This tracks shared state with the metadata
> > diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
> > index a099f60feb7b..b127563e21a1 100644
> > --- a/include/linux/ceph/ceph_fs.h
> > +++ b/include/linux/ceph/ceph_fs.h
> > @@ -444,8 +444,9 @@ union ceph_mds_request_args {
> >   	} __attribute__ ((packed)) lookupino;
> >   } __attribute__ ((packed));
> >   
> > -#define CEPH_MDS_FLAG_REPLAY        1  /* this is a replayed op */
> > -#define CEPH_MDS_FLAG_WANT_DENTRY   2  /* want dentry in reply */
> > +#define CEPH_MDS_FLAG_REPLAY		1  /* this is a replayed op */
> > +#define CEPH_MDS_FLAG_WANT_DENTRY	2  /* want dentry in reply */
> > +#define CEPH_MDS_FLAG_ASYNC		4  /* request is asynchronous */
> >   
> >   struct ceph_mds_request_head {
> >   	__le64 oldest_client_tid;
> > @@ -658,6 +659,9 @@ int ceph_flags_to_mode(int flags);
> >   #define CEPH_CAP_ANY      (CEPH_CAP_ANY_RD | CEPH_CAP_ANY_EXCL | \
> >   			   CEPH_CAP_ANY_FILE_WR | CEPH_CAP_FILE_LAZYIO | \
> >   			   CEPH_CAP_PIN)
> > +#define CEPH_CAP_ALL_FILE (CEPH_CAP_PIN | CEPH_CAP_ANY_SHARED | \
> > +			   CEPH_CAP_AUTH_EXCL | CEPH_CAP_XATTR_EXCL | \
> > +			   CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)
> >   
> >   #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
> >   			CEPH_LOCK_IXATTR)
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 9/9] ceph: attempt to do async create when possible
  2020-01-13 13:44     ` Jeff Layton
@ 2020-01-13 14:48       ` Yan, Zheng
  2020-01-13 15:20         ` Jeff Layton
  0 siblings, 1 reply; 28+ messages in thread
From: Yan, Zheng @ 2020-01-13 14:48 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell

On 1/13/20 9:44 PM, Jeff Layton wrote:
> On Mon, 2020-01-13 at 18:53 +0800, Yan, Zheng wrote:
>> On 1/11/20 4:56 AM, Jeff Layton wrote:
>>> With the Octopus release, the MDS will hand out directoy create caps.
>>> If we have Fxc caps on the directory, and complete directory information
>>> or a known negative dentry, then we can return without waiting on the
>>> reply, allowing the open() call to return very quickly to userland.
>>>
>>> We use the normal ceph_fill_inode() routine to fill in the inode, so we
>>> have to gin up some reply inode information with what we'd expect a
>>> newly-created inode to have. The client assumes that it has a full set
>>> of caps on the new inode, and that the MDS will revoke them when there
>>> is conflicting access.
>>>
>>> This functionality is gated on the enable_async_dirops module option,
>>> along with async unlinks, and on the server supporting the Octopus
>>> CephFS feature bit.
>>>
>>> Signed-off-by: Jeff Layton <jlayton@kernel.org>
>>> ---
>>>    fs/ceph/caps.c               |   7 +-
>>>    fs/ceph/file.c               | 178 +++++++++++++++++++++++++++++++++--
>>>    fs/ceph/mds_client.c         |  12 ++-
>>>    fs/ceph/mds_client.h         |   3 +-
>>>    fs/ceph/super.h              |   2 +
>>>    include/linux/ceph/ceph_fs.h |   8 +-
>>>    6 files changed, 191 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>>> index b96fb1378479..21a8a2ddc94b 100644
>>> --- a/fs/ceph/caps.c
>>> +++ b/fs/ceph/caps.c
>>> @@ -654,6 +654,10 @@ void ceph_add_cap(struct inode *inode,
>>>    		session->s_nr_caps++;
>>>    		spin_unlock(&session->s_cap_lock);
>>>    	} else {
>>> +		/* Did an async create race with the reply? */
>>> +		if (cap_id == CEPH_CAP_ID_TBD && cap->issued == issued)
>>> +			return;
>>> +
>>>    		spin_lock(&session->s_cap_lock);
>>>    		list_move_tail(&cap->session_caps, &session->s_caps);
>>>    		spin_unlock(&session->s_cap_lock);
>>> @@ -672,7 +676,8 @@ void ceph_add_cap(struct inode *inode,
>>>    		 */
>>>    		if (ceph_seq_cmp(seq, cap->seq) <= 0) {
>>>    			WARN_ON(cap != ci->i_auth_cap);
>>> -			WARN_ON(cap->cap_id != cap_id);
>>> +			WARN_ON(cap_id != CEPH_CAP_ID_TBD &&
>>> +				cap->cap_id != cap_id);
>>>    			seq = cap->seq;
>>>    			mseq = cap->mseq;
>>>    			issued |= cap->issued;
>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index d4d7a277faf1..706abd71b731 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -450,6 +450,141 @@ copy_file_layout(struct inode *dst, struct inode *src)
>>>    	spin_unlock(&cdst->i_ceph_lock);
>>>    }
>>>    
>>> +static bool get_caps_for_async_create(struct inode *dir, struct dentry *dentry)
>>> +{
>>> +	struct ceph_inode_info *ci = ceph_inode(dir);
>>> +	int ret, want, got;
>>> +
>>> +	/*
>>> +	 * We can do an async create if we either have a valid negative dentry
>>> +	 * or the complete contents of the directory. Do a quick check without
>>> +	 * cap refs.
>>> +	 */
>>> +	if ((d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) ||
>>
>> what does (d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) mean?
>>
> 
> 
> If d_in_lookup returns false, then we have a dentry that is known to be
> negative (IOW, it passed d_revalidate). If d_in_lookup returns true,
> then this is the initial lookup for the dentry.
> 
> So if that returns true and the directory isn't complete then we can't
> do an async create since we don't have enough information about the
> namespace.
> 
> That probably deserves a better comment. I'll try to make that clear.

if directory is not complete and d_in_lookup() return false. we know the 
dentry is negative in dcache, but it does not guarantee that 
corresponding dentry in mds is negative. Between d_revalidate and this
function, mds may reovked dentry's lease and issued Fsx caps
> 
> 
>> I think we can async create if dentry is negative and its
>> lease_shared_gen == ci->i_shared_gen.
>>
>>> +	    !ceph_file_layout_is_valid(&ci->i_layout))
>>> +		return false;
>>> +
>>> +	/* Try to get caps */
>>> +	want = CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE;
>>> +	ret = ceph_try_get_caps(dir, 0, want, true, &got);
>>> +	dout("Fx on %p ret=%d got=%d\n", dir, ret, got);
>>> +	if (ret != 1)
>>> +		return false;
>>> +	if (got != want) {
>>> +		ceph_put_cap_refs(ci, got);
>>> +		return false;
>>> +	}
>>> +
>>> +	/* Check again, now that we hold cap refs */
>>> +	if ((d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) ||
>>> +	    !ceph_file_layout_is_valid(&ci->i_layout)) {
>>> +		ceph_put_cap_refs(ci, got);
>>> +		return false;
>>> +	}
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static void ceph_async_create_cb(struct ceph_mds_client *mdsc,
>>> +                                 struct ceph_mds_request *req)
>>> +{
>>> +	/* If we never sent anything then nothing to clean up */
>>> +	if (req->r_err == -ECHILD)
>>> +		goto out;
>>> +
>>> +	mapping_set_error(req->r_parent->i_mapping, req->r_err);
>>> +
>>> +	if (req->r_target_inode) {
>>> +		u64 ino = ceph_vino(req->r_target_inode).ino;
>>> +
>>> +		if (req->r_deleg_ino != ino)
>>> +			pr_warn("%s: inode number mismatch! err=%d deleg_ino=0x%lx target=0x%llx\n",
>>> +				__func__, req->r_err, req->r_deleg_ino, ino);
>>> +		mapping_set_error(req->r_target_inode->i_mapping, req->r_err);
>>> +	} else {
>>> +		pr_warn("%s: no req->r_target_inode for 0x%lx\n", __func__,
>>> +			req->r_deleg_ino);
>>> +	}
>>> +out:
>>> +	ceph_put_cap_refs(ceph_inode(req->r_parent),
>>> +			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE);
>>> +}
>>> +
>>> +static int ceph_finish_async_open(struct inode *dir, struct dentry *dentry,
>>> +				  struct file *file, umode_t mode,
>>> +				  struct ceph_mds_request *req,
>>> +				  struct ceph_acl_sec_ctx *as_ctx)
>>> +{
>>> +	int ret;
>>> +	struct ceph_mds_reply_inode in = { };
>>> +	struct ceph_mds_reply_info_in iinfo = { .in = &in };
>>> +	struct ceph_inode_info *ci = ceph_inode(dir);
>>> +	struct inode *inode;
>>> +	struct timespec64 now;
>>> +	struct ceph_vino vino = { .ino = req->r_deleg_ino,
>>> +				  .snap = CEPH_NOSNAP };
>>> +
>>> +	ktime_get_real_ts64(&now);
>>> +
>>> +	inode = ceph_get_inode(dentry->d_sb, vino);
>>> +	if (IS_ERR(inode))
>>> +		return PTR_ERR(inode);
>>> +
>>> +	/* If we can't get a buffer, just carry on */
>>> +	iinfo.xattr_data = kzalloc(4, GFP_NOFS);
>>> +	if (iinfo.xattr_data)
>>> +		iinfo.xattr_len = 4;
>>
>> ??
>>
>> I think we should decode req->r_pagelist into xattrs
>>
>>
> 
> I'm not sure I follow what you're suggesting here. At this point,
> r_pagelist may be set to as_ctx.pagelist. It would be nice to avoid an
> allocation here though. I guess I could pass in an on-stack buffer. It
> is only 4 bytes after all.
> 
>>> +
>>> +	iinfo.inline_version = CEPH_INLINE_NONE;
>>> +	iinfo.change_attr = 1;
>>> +	ceph_encode_timespec64(&iinfo.btime, &now);
>>> +
>>> +	in.ino = cpu_to_le64(vino.ino);
>>> +	in.snapid = cpu_to_le64(CEPH_NOSNAP);
>>> +	in.version = cpu_to_le64(1);	// ???
>>> +	in.cap.caps = in.cap.wanted = cpu_to_le32(CEPH_CAP_ALL_FILE);
>>> +	in.cap.cap_id = cpu_to_le64(CEPH_CAP_ID_TBD);
>>> +	in.cap.realm = cpu_to_le64(ci->i_snap_realm->ino);
>>> +	in.cap.flags = CEPH_CAP_FLAG_AUTH;
>>> +	in.ctime = in.mtime = in.atime = iinfo.btime;
>>> +	in.mode = cpu_to_le32((u32)mode);
>>> +	in.truncate_seq = cpu_to_le32(1);
>>> +	in.truncate_size = cpu_to_le64(ci->i_truncate_size);
>>> +	in.max_size = cpu_to_le64(ci->i_max_size);
>>> +	in.xattr_version = cpu_to_le64(1);
>>> +	in.uid = cpu_to_le32(from_kuid(&init_user_ns, current_fsuid()));
>>> +	in.gid = cpu_to_le32(from_kgid(&init_user_ns, current_fsgid()));
>>
>> if dir has S_ISGID, new file's gid should be inherit from dir's gid
>>
>>
> 
> Good catch. I'll fix that up for the next iteration.
> 
>>> +	in.nlink = cpu_to_le32(1);
>>> +
>>> +	ceph_file_layout_to_legacy(&ci->i_layout, &in.layout);
>>> +
>>> +	ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session,
>>> +			      req->r_fmode, NULL);
>>> +	if (ret) {
>>> +		dout("%s failed to fill inode: %d\n", __func__, ret);
>>> +		if (inode->i_state & I_NEW)
>>> +			discard_new_inode(inode);
>>> +	} else {
>>> +		struct dentry *dn;
>>> +
>>> +		dout("%s d_adding new inode 0x%llx to 0x%lx/%s\n", __func__,
>>> +			vino.ino, dir->i_ino, dentry->d_name.name);
>>> +		ceph_dir_clear_ordered(dir);
>>> +		ceph_init_inode_acls(inode, as_ctx);
>>> +		if (inode->i_state & I_NEW)
>>> +			unlock_new_inode(inode);
>>> +		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
>>> +			if (!d_unhashed(dentry))
>>> +				d_drop(dentry);
>>> +			dn = d_splice_alias(inode, dentry);
>>> +			WARN_ON_ONCE(dn && dn != dentry);
>>> +		}
>>> +		file->f_mode |= FMODE_CREATED;
>>> +		ret = finish_open(file, dentry, ceph_open);
>>> +	}
>>> +	return ret;
>>> +}
>>> +
>>>    /*
>>>     * Do a lookup + open with a single request.  If we get a non-existent
>>>     * file or symlink, return 1 so the VFS can retry.
>>> @@ -462,6 +597,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>>>    	struct ceph_mds_request *req;
>>>    	struct dentry *dn;
>>>    	struct ceph_acl_sec_ctx as_ctx = {};
>>> +	bool try_async = enable_async_dirops;
>>>    	int mask;
>>>    	int err;
>>>    
>>> @@ -486,6 +622,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>>>    		return -ENOENT;
>>>    	}
>>>    
>>> +retry:
>>>    	/* do the open */
>>>    	req = prepare_open_request(dir->i_sb, flags, mode);
>>>    	if (IS_ERR(req)) {
>>> @@ -494,6 +631,12 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>>>    	}
>>>    	req->r_dentry = dget(dentry);
>>>    	req->r_num_caps = 2;
>>> +	mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
>>> +	if (ceph_security_xattr_wanted(dir))
>>> +		mask |= CEPH_CAP_XATTR_SHARED;
>>> +	req->r_args.open.mask = cpu_to_le32(mask);
>>> +	req->r_parent = dir;
>>> +
>>>    	if (flags & O_CREAT) {
>>>    		req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL;
>>>    		req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
>>> @@ -501,21 +644,37 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>>>    			req->r_pagelist = as_ctx.pagelist;
>>>    			as_ctx.pagelist = NULL;
>>>    		}
>>> +		if (try_async && get_caps_for_async_create(dir, dentry)) {
>>> +			set_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags);
>>> +			req->r_callback = ceph_async_create_cb;
>>> +			err = ceph_mdsc_submit_request(mdsc, dir, req);
>>> +			switch (err) {
>>> +			case 0:
>>> +				/* set up inode, dentry and return */
>>> +				err = ceph_finish_async_open(dir, dentry, file,
>>> +							mode, req, &as_ctx);
>>> +				goto out_req;
>>> +			case -ECHILD:
>>> +				/* do a sync create */
>>> +				try_async = false;
>>> +				as_ctx.pagelist = req->r_pagelist;
>>> +				req->r_pagelist = NULL;
>>> +				ceph_mdsc_put_request(req);
>>> +				goto retry;
>>> +			default:
>>> +				/* Hard error, give up */
>>> +				goto out_req;
>>> +			}
>>> +		}
>>>    	}
>>>    
>>> -       mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
>>> -       if (ceph_security_xattr_wanted(dir))
>>> -               mask |= CEPH_CAP_XATTR_SHARED;
>>> -       req->r_args.open.mask = cpu_to_le32(mask);
>>> -
>>> -	req->r_parent = dir;
>>>    	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
>>>    	err = ceph_mdsc_do_request(mdsc,
>>>    				   (flags & (O_CREAT|O_TRUNC)) ? dir : NULL,
>>>    				   req);
>>>    	err = ceph_handle_snapdir(req, dentry, err);
>>>    	if (err)
>>> -		goto out_req;
>>> +		goto out_fmode;
>>>    
>>>    	if ((flags & O_CREAT) && !req->r_reply_info.head->is_dentry)
>>>    		err = ceph_handle_notrace_create(dir, dentry);
>>> @@ -529,7 +688,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>>>    		dn = NULL;
>>>    	}
>>>    	if (err)
>>> -		goto out_req;
>>> +		goto out_fmode;
>>>    	if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
>>>    		/* make vfs retry on splice, ENOENT, or symlink */
>>>    		dout("atomic_open finish_no_open on dn %p\n", dn);
>>> @@ -545,9 +704,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>>>    		}
>>>    		err = finish_open(file, dentry, ceph_open);
>>>    	}
>>> -out_req:
>>> +out_fmode:
>>>    	if (!req->r_err && req->r_target_inode)
>>>    		ceph_put_fmode(ceph_inode(req->r_target_inode), req->r_fmode);
>>> +out_req:
>>>    	ceph_mdsc_put_request(req);
>>>    out_ctx:
>>>    	ceph_release_acl_sec_ctx(&as_ctx);
>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>> index 9e7492b21b50..c76d6e7f8136 100644
>>> --- a/fs/ceph/mds_client.c
>>> +++ b/fs/ceph/mds_client.c
>>> @@ -2620,14 +2620,16 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
>>>    		flags |= CEPH_MDS_FLAG_REPLAY;
>>>    	if (req->r_parent)
>>>    		flags |= CEPH_MDS_FLAG_WANT_DENTRY;
>>> -	rhead->flags = cpu_to_le32(flags);
>>> -	rhead->num_fwd = req->r_num_fwd;
>>> -	rhead->num_retry = req->r_attempts - 1;
>>> -	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags))
>>> +	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags)) {
>>>    		rhead->ino = cpu_to_le64(req->r_deleg_ino);
>>> -	else
>>> +		flags |= CEPH_MDS_FLAG_ASYNC;
>>> +	} else {
>>>    		rhead->ino = 0;
>>> +	}
>>>    
>>> +	rhead->flags = cpu_to_le32(flags);
>>> +	rhead->num_fwd = req->r_num_fwd;
>>> +	rhead->num_retry = req->r_attempts - 1;
>>>    	dout(" r_parent = %p\n", req->r_parent);
>>>    	return 0;
>>>    }
>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>>> index e0b36be7c44f..49e6cd5a07a2 100644
>>> --- a/fs/ceph/mds_client.h
>>> +++ b/fs/ceph/mds_client.h
>>> @@ -39,8 +39,7 @@ enum ceph_feature_type {
>>>    	CEPHFS_FEATURE_REPLY_ENCODING,		\
>>>    	CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
>>>    	CEPHFS_FEATURE_MULTI_RECONNECT,		\
>>> -						\
>>> -	CEPHFS_FEATURE_MAX,			\
>>> +	CEPHFS_FEATURE_OCTOPUS,			\
>>>    }
>>>    #define CEPHFS_FEATURES_CLIENT_REQUIRED {}
>>>    
>>> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
>>> index ec4d66d7c261..33e03fbba888 100644
>>> --- a/fs/ceph/super.h
>>> +++ b/fs/ceph/super.h
>>> @@ -136,6 +136,8 @@ struct ceph_fs_client {
>>>    #endif
>>>    };
>>>    
>>> +/* Special placeholder value for a cap_id during an asynchronous create. */
>>> +#define        CEPH_CAP_ID_TBD         -1ULL
>>>    
>>>    /*
>>>     * File i/o capability.  This tracks shared state with the metadata
>>> diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
>>> index a099f60feb7b..b127563e21a1 100644
>>> --- a/include/linux/ceph/ceph_fs.h
>>> +++ b/include/linux/ceph/ceph_fs.h
>>> @@ -444,8 +444,9 @@ union ceph_mds_request_args {
>>>    	} __attribute__ ((packed)) lookupino;
>>>    } __attribute__ ((packed));
>>>    
>>> -#define CEPH_MDS_FLAG_REPLAY        1  /* this is a replayed op */
>>> -#define CEPH_MDS_FLAG_WANT_DENTRY   2  /* want dentry in reply */
>>> +#define CEPH_MDS_FLAG_REPLAY		1  /* this is a replayed op */
>>> +#define CEPH_MDS_FLAG_WANT_DENTRY	2  /* want dentry in reply */
>>> +#define CEPH_MDS_FLAG_ASYNC		4  /* request is asynchronous */
>>>    
>>>    struct ceph_mds_request_head {
>>>    	__le64 oldest_client_tid;
>>> @@ -658,6 +659,9 @@ int ceph_flags_to_mode(int flags);
>>>    #define CEPH_CAP_ANY      (CEPH_CAP_ANY_RD | CEPH_CAP_ANY_EXCL | \
>>>    			   CEPH_CAP_ANY_FILE_WR | CEPH_CAP_FILE_LAZYIO | \
>>>    			   CEPH_CAP_PIN)
>>> +#define CEPH_CAP_ALL_FILE (CEPH_CAP_PIN | CEPH_CAP_ANY_SHARED | \
>>> +			   CEPH_CAP_AUTH_EXCL | CEPH_CAP_XATTR_EXCL | \
>>> +			   CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)
>>>    
>>>    #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
>>>    			CEPH_LOCK_IXATTR)
>>>
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 7/9] ceph: add flag to delegate an inode number for async create
  2020-01-13 13:31     ` Jeff Layton
@ 2020-01-13 14:51       ` Yan, Zheng
  0 siblings, 0 replies; 28+ messages in thread
From: Yan, Zheng @ 2020-01-13 14:51 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell

On 1/13/20 9:31 PM, Jeff Layton wrote:
> On Mon, 2020-01-13 at 17:17 +0800, Yan, Zheng wrote:
>> On 1/11/20 4:56 AM, Jeff Layton wrote:
>>> In order to issue an async create request, we need to send an inode
>>> number when we do the request, but we don't know which to which MDS
>>> we'll be issuing the request.
>>>
>>
>> the request should be sent to auth mds (dir_ci->i_auth_cap->session) of
>> directory. I think grabing inode number in get_caps_for_async_create()
>> is simpler.
>>
> 
> That would definitely be simpler. I didn't know whether we could count
> on having an i_auth_cap in that case.
> 
> Will a non-auth MDS ever hand out DIR_CREATE/DIR_UNLINK caps? I'm a
> little fuzzy on what the rules are for caps being handed out by non-auth
> MDS's.

only auth mds can issue DIR_CREATE/DIR_UNLINK caps (any write caps). 
non-auth mds can only issue CEPH_CAP_FOO_SHARED caps.

> 
>>> Add a new r_req_flag that tells the request sending machinery to
>>> grab an inode number from the delegated set, and encode it into the
>>> request. If it can't get one then have it return -ECHILD. The
>>> requestor can then reissue a synchronous request.
>>>
>>> Signed-off-by: Jeff Layton <jlayton@kernel.org>
>>> ---
>>>    fs/ceph/inode.c      |  1 +
>>>    fs/ceph/mds_client.c | 19 ++++++++++++++++++-
>>>    fs/ceph/mds_client.h |  2 ++
>>>    3 files changed, 21 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
>>> index 79bb1e6af090..9cfc093fd273 100644
>>> --- a/fs/ceph/inode.c
>>> +++ b/fs/ceph/inode.c
>>> @@ -1317,6 +1317,7 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
>>>    		err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
>>>    				NULL, session,
>>>    				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
>>> +				 !test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags) &&
>>>    				 rinfo->head->result == 0) ?  req->r_fmode : -1,
>>>    				&req->r_caps_reservation);
>>>    		if (err < 0) {
>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>> index 852c46550d96..9e7492b21b50 100644
>>> --- a/fs/ceph/mds_client.c
>>> +++ b/fs/ceph/mds_client.c
>>> @@ -2623,7 +2623,10 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
>>>    	rhead->flags = cpu_to_le32(flags);
>>>    	rhead->num_fwd = req->r_num_fwd;
>>>    	rhead->num_retry = req->r_attempts - 1;
>>> -	rhead->ino = 0;
>>> +	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags))
>>> +		rhead->ino = cpu_to_le64(req->r_deleg_ino);
>>> +	else
>>> +		rhead->ino = 0;
>>>    
>>>    	dout(" r_parent = %p\n", req->r_parent);
>>>    	return 0;
>>> @@ -2736,6 +2739,20 @@ static void __do_request(struct ceph_mds_client *mdsc,
>>>    		goto out_session;
>>>    	}
>>>    
>>> +	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags) &&
>>> +	    !req->r_deleg_ino) {
>>> +		req->r_deleg_ino = get_delegated_ino(req->r_session);
>>> +
>>> +		if (!req->r_deleg_ino) {
>>> +			/*
>>> +			 * If we can't get a deleg ino, exit with -ECHILD,
>>> +			 * so the caller can reissue a sync request
>>> +			 */
>>> +			err = -ECHILD;
>>> +			goto out_session;
>>> +		}
>>> +	}
>>> +
>>>    	/* send request */
>>>    	req->r_resend_mds = -1;   /* forget any previous mds hint */
>>>    
>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>>> index 3db7ef47e1c9..e0b36be7c44f 100644
>>> --- a/fs/ceph/mds_client.h
>>> +++ b/fs/ceph/mds_client.h
>>> @@ -258,6 +258,7 @@ struct ceph_mds_request {
>>>    #define CEPH_MDS_R_GOT_RESULT		(5) /* got a result */
>>>    #define CEPH_MDS_R_DID_PREPOPULATE	(6) /* prepopulated readdir */
>>>    #define CEPH_MDS_R_PARENT_LOCKED	(7) /* is r_parent->i_rwsem wlocked? */
>>> +#define CEPH_MDS_R_DELEG_INO		(8) /* attempt to get r_deleg_ino */
>>>    	unsigned long	r_req_flags;
>>>    
>>>    	struct mutex r_fill_mutex;
>>> @@ -307,6 +308,7 @@ struct ceph_mds_request {
>>>    	int               r_num_fwd;    /* number of forward attempts */
>>>    	int               r_resend_mds; /* mds to resend to next, if any*/
>>>    	u32               r_sent_on_mseq; /* cap mseq request was sent at*/
>>> +	unsigned long	  r_deleg_ino;
>>>    
>>>    	struct list_head  r_wait;
>>>    	struct completion r_completion;
>>>
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 8/9] ceph: copy layout, max_size and truncate_size on successful sync create
  2020-01-13 13:26     ` Jeff Layton
@ 2020-01-13 14:56       ` Yan, Zheng
  2020-01-13 15:13         ` Jeff Layton
  0 siblings, 1 reply; 28+ messages in thread
From: Yan, Zheng @ 2020-01-13 14:56 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell

On 1/13/20 9:26 PM, Jeff Layton wrote:
> On Mon, 2020-01-13 at 11:51 +0800, Yan, Zheng wrote:
>> On 1/11/20 4:56 AM, Jeff Layton wrote:
>>> It doesn't do much good to do an asynchronous create unless we can do
>>> I/O to it before the create reply comes in. That means we need layout
>>> info the new file before we've gotten the response from the MDS.
>>>
>>> All files created in a directory will initially inherit the same layout,
>>> so copy off the requisite info from the first synchronous create in the
>>> directory. Save it in the same fields in the directory inode, as those
>>> are otherwise unsed for dir inodes. This means we need to be a bit
>>> careful about only updating layout info on non-dir inodes.
>>>
>>> Also, zero out the layout when we drop Dc caps in the dir.
>>>
>>> Signed-off-by: Jeff Layton <jlayton@kernel.org>
>>> ---
>>>    fs/ceph/caps.c  | 24 ++++++++++++++++++++----
>>>    fs/ceph/file.c  | 24 +++++++++++++++++++++++-
>>>    fs/ceph/inode.c |  4 ++--
>>>    3 files changed, 45 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>>> index 7fc87b693ba4..b96fb1378479 100644
>>> --- a/fs/ceph/caps.c
>>> +++ b/fs/ceph/caps.c
>>> @@ -2847,7 +2847,7 @@ int ceph_get_caps(struct file *filp, int need, int want,
>>>    			return ret;
>>>    		}
>>>    
>>> -		if (S_ISREG(ci->vfs_inode.i_mode) &&
>>> +		if (!S_ISDIR(ci->vfs_inode.i_mode) &&
>>>    		    ci->i_inline_version != CEPH_INLINE_NONE &&
>>>    		    (_got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) &&
>>>    		    i_size_read(inode) > 0) {
>>> @@ -2944,9 +2944,17 @@ void ceph_put_cap_refs(struct ceph_inode_info *ci, int had)
>>>    	if (had & CEPH_CAP_FILE_RD)
>>>    		if (--ci->i_rd_ref == 0)
>>>    			last++;
>>> -	if (had & CEPH_CAP_FILE_CACHE)
>>> -		if (--ci->i_rdcache_ref == 0)
>>> +	if (had & CEPH_CAP_FILE_CACHE) {
>>> +		if (--ci->i_rdcache_ref == 0) {
>>>    			last++;
>>> +			/* Zero out layout if we lost CREATE caps */
>>> +			if (S_ISDIR(inode->i_mode) &&
>>> +			    !(__ceph_caps_issued(ci, NULL) & CEPH_CAP_DIR_CREATE)) {
>>> +				ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
>>> +				memset(&ci->i_layout, 0, sizeof(ci->i_layout));
>>> +			}
>>> +		}
>>> +	}
>>
>> should do this in __check_cap_issue
>>
> 
> That function doesn't get called from the put codepath. Suppose one task
> is setting up an async create and a Dc (DIR_CREATE) cap revoke races in.
> We zero out the layout and then the inode has a bogus layout set in it.
> 
> We can't wipe the cached layout until all of the Dc references are put.

how about:

get_caps_for_async_create() return the layout.
pass the returned layout into ceph_finish_async_open()



> 
>>>    	if (had & CEPH_CAP_FILE_EXCL)
>>>    		if (--ci->i_fx_ref == 0)
>>>    			last++;
>>> @@ -3264,7 +3272,8 @@ static void handle_cap_grant(struct inode *inode,
>>>    		ci->i_subdirs = extra_info->nsubdirs;
>>>    	}
>>>    
>>> -	if (newcaps & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)) {
>>> +	if (!S_ISDIR(inode->i_mode) &&
>>> +	    (newcaps & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
>>>    		/* file layout may have changed */
>>>    		s64 old_pool = ci->i_layout.pool_id;
>>>    		struct ceph_string *old_ns;
>>> @@ -3336,6 +3345,13 @@ static void handle_cap_grant(struct inode *inode,
>>>    		     ceph_cap_string(cap->issued),
>>>    		     ceph_cap_string(newcaps),
>>>    		     ceph_cap_string(revoking));
>>> +
>>> +		if (S_ISDIR(inode->i_mode) &&
>>> +		    (revoking & CEPH_CAP_DIR_CREATE) && !ci->i_rdcache_ref) {
>>> +			ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
>>> +			memset(&ci->i_layout, 0, sizeof(ci->i_layout));
>>> +		}
>>
>> same here
>>
>>> +
>>>    		if (S_ISREG(inode->i_mode) &&
>>>    		    (revoking & used & CEPH_CAP_FILE_BUFFER))
>>>    			writeback = true;  /* initiate writeback; will delay ack */
>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index 1e6cdf2dfe90..d4d7a277faf1 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -430,6 +430,25 @@ int ceph_open(struct inode *inode, struct file *file)
>>>    	return err;
>>>    }
>>>    
>>> +/* Clone the layout from a synchronous create, if the dir now has Dc caps */
>>> +static void
>>> +copy_file_layout(struct inode *dst, struct inode *src)
>>> +{
>>> +	struct ceph_inode_info *cdst = ceph_inode(dst);
>>> +	struct ceph_inode_info *csrc = ceph_inode(src);
>>> +
>>> +	spin_lock(&cdst->i_ceph_lock);
>>> +	if ((__ceph_caps_issued(cdst, NULL) & CEPH_CAP_DIR_CREATE) &&
>>> +	    !ceph_file_layout_is_valid(&cdst->i_layout)) {
>>> +		memcpy(&cdst->i_layout, &csrc->i_layout,
>>> +			sizeof(cdst->i_layout));
>>> +		rcu_assign_pointer(cdst->i_layout.pool_ns,
>>> +				   ceph_try_get_string(csrc->i_layout.pool_ns));
>>> +		cdst->i_max_size = csrc->i_max_size;
>>> +		cdst->i_truncate_size = csrc->i_truncate_size;
>>> +	}
>>> +	spin_unlock(&cdst->i_ceph_lock);
>>> +}
>>>    
>>>    /*
>>>     * Do a lookup + open with a single request.  If we get a non-existent
>>> @@ -518,7 +537,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>>>    	} else {
>>>    		dout("atomic_open finish_open on dn %p\n", dn);
>>>    		if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) {
>>> -			ceph_init_inode_acls(d_inode(dentry), &as_ctx);
>>> +			struct inode *newino = d_inode(dentry);
>>> +
>>> +			copy_file_layout(dir, newino);
>>> +			ceph_init_inode_acls(newino, &as_ctx);
>>>    			file->f_mode |= FMODE_CREATED;
>>>    		}
>>>    		err = finish_open(file, dentry, ceph_open);
>>> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
>>> index 9cfc093fd273..8b51051b79b0 100644
>>> --- a/fs/ceph/inode.c
>>> +++ b/fs/ceph/inode.c
>>> @@ -848,8 +848,8 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
>>>    		ci->i_subdirs = le64_to_cpu(info->subdirs);
>>>    	}
>>>    
>>> -	if (new_version ||
>>> -	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
>>> +	if (!S_ISDIR(inode->i_mode) && (new_version ||
>>> +	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)))) {
>>>    		s64 old_pool = ci->i_layout.pool_id;
>>>    		struct ceph_string *old_ns;
>>>    
>>>
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 8/9] ceph: copy layout, max_size and truncate_size on successful sync create
  2020-01-13 14:56       ` Yan, Zheng
@ 2020-01-13 15:13         ` Jeff Layton
  2020-01-13 16:37           ` Yan, Zheng
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Layton @ 2020-01-13 15:13 UTC (permalink / raw)
  To: Yan, Zheng, ceph-devel; +Cc: sage, idryomov, pdonnell

On Mon, 2020-01-13 at 22:56 +0800, Yan, Zheng wrote:
> On 1/13/20 9:26 PM, Jeff Layton wrote:
> > On Mon, 2020-01-13 at 11:51 +0800, Yan, Zheng wrote:
> > > On 1/11/20 4:56 AM, Jeff Layton wrote:
> > > > It doesn't do much good to do an asynchronous create unless we can do
> > > > I/O to it before the create reply comes in. That means we need layout
> > > > info the new file before we've gotten the response from the MDS.
> > > > 
> > > > All files created in a directory will initially inherit the same layout,
> > > > so copy off the requisite info from the first synchronous create in the
> > > > directory. Save it in the same fields in the directory inode, as those
> > > > are otherwise unsed for dir inodes. This means we need to be a bit
> > > > careful about only updating layout info on non-dir inodes.
> > > > 
> > > > Also, zero out the layout when we drop Dc caps in the dir.
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > >    fs/ceph/caps.c  | 24 ++++++++++++++++++++----
> > > >    fs/ceph/file.c  | 24 +++++++++++++++++++++++-
> > > >    fs/ceph/inode.c |  4 ++--
> > > >    3 files changed, 45 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > index 7fc87b693ba4..b96fb1378479 100644
> > > > --- a/fs/ceph/caps.c
> > > > +++ b/fs/ceph/caps.c
> > > > @@ -2847,7 +2847,7 @@ int ceph_get_caps(struct file *filp, int need, int want,
> > > >    			return ret;
> > > >    		}
> > > >    
> > > > -		if (S_ISREG(ci->vfs_inode.i_mode) &&
> > > > +		if (!S_ISDIR(ci->vfs_inode.i_mode) &&
> > > >    		    ci->i_inline_version != CEPH_INLINE_NONE &&
> > > >    		    (_got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) &&
> > > >    		    i_size_read(inode) > 0) {
> > > > @@ -2944,9 +2944,17 @@ void ceph_put_cap_refs(struct ceph_inode_info *ci, int had)
> > > >    	if (had & CEPH_CAP_FILE_RD)
> > > >    		if (--ci->i_rd_ref == 0)
> > > >    			last++;
> > > > -	if (had & CEPH_CAP_FILE_CACHE)
> > > > -		if (--ci->i_rdcache_ref == 0)
> > > > +	if (had & CEPH_CAP_FILE_CACHE) {
> > > > +		if (--ci->i_rdcache_ref == 0) {
> > > >    			last++;
> > > > +			/* Zero out layout if we lost CREATE caps */
> > > > +			if (S_ISDIR(inode->i_mode) &&
> > > > +			    !(__ceph_caps_issued(ci, NULL) & CEPH_CAP_DIR_CREATE)) {
> > > > +				ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
> > > > +				memset(&ci->i_layout, 0, sizeof(ci->i_layout));
> > > > +			}
> > > > +		}
> > > > +	}
> > > 
> > > should do this in __check_cap_issue
> > > 
> > 
> > That function doesn't get called from the put codepath. Suppose one task
> > is setting up an async create and a Dc (DIR_CREATE) cap revoke races in.
> > We zero out the layout and then the inode has a bogus layout set in it.
> > 
> > We can't wipe the cached layout until all of the Dc references are put.
> 
> how about:
> 
> get_caps_for_async_create() return the layout.
> pass the returned layout into ceph_finish_async_open()
> 

That still sounds racy.

What guarantees the stability of the cached layout while we're copying
it in get_caps_for_async_create()? I guess we could protect the new
cached layout field with the i_ceph_lock.

Is that a real improvement? I'm not sure.

> 
> 
> > > >    	if (had & CEPH_CAP_FILE_EXCL)
> > > >    		if (--ci->i_fx_ref == 0)
> > > >    			last++;
> > > > @@ -3264,7 +3272,8 @@ static void handle_cap_grant(struct inode *inode,
> > > >    		ci->i_subdirs = extra_info->nsubdirs;
> > > >    	}
> > > >    
> > > > -	if (newcaps & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)) {
> > > > +	if (!S_ISDIR(inode->i_mode) &&
> > > > +	    (newcaps & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
> > > >    		/* file layout may have changed */
> > > >    		s64 old_pool = ci->i_layout.pool_id;
> > > >    		struct ceph_string *old_ns;
> > > > @@ -3336,6 +3345,13 @@ static void handle_cap_grant(struct inode *inode,
> > > >    		     ceph_cap_string(cap->issued),
> > > >    		     ceph_cap_string(newcaps),
> > > >    		     ceph_cap_string(revoking));
> > > > +
> > > > +		if (S_ISDIR(inode->i_mode) &&
> > > > +		    (revoking & CEPH_CAP_DIR_CREATE) && !ci->i_rdcache_ref) {
> > > > +			ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
> > > > +			memset(&ci->i_layout, 0, sizeof(ci->i_layout));
> > > > +		}
> > > 
> > > same here
> > > 
> > > > +
> > > >    		if (S_ISREG(inode->i_mode) &&
> > > >    		    (revoking & used & CEPH_CAP_FILE_BUFFER))
> > > >    			writeback = true;  /* initiate writeback; will delay ack */
> > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > > index 1e6cdf2dfe90..d4d7a277faf1 100644
> > > > --- a/fs/ceph/file.c
> > > > +++ b/fs/ceph/file.c
> > > > @@ -430,6 +430,25 @@ int ceph_open(struct inode *inode, struct file *file)
> > > >    	return err;
> > > >    }
> > > >    
> > > > +/* Clone the layout from a synchronous create, if the dir now has Dc caps */
> > > > +static void
> > > > +copy_file_layout(struct inode *dst, struct inode *src)
> > > > +{
> > > > +	struct ceph_inode_info *cdst = ceph_inode(dst);
> > > > +	struct ceph_inode_info *csrc = ceph_inode(src);
> > > > +
> > > > +	spin_lock(&cdst->i_ceph_lock);
> > > > +	if ((__ceph_caps_issued(cdst, NULL) & CEPH_CAP_DIR_CREATE) &&
> > > > +	    !ceph_file_layout_is_valid(&cdst->i_layout)) {
> > > > +		memcpy(&cdst->i_layout, &csrc->i_layout,
> > > > +			sizeof(cdst->i_layout));
> > > > +		rcu_assign_pointer(cdst->i_layout.pool_ns,
> > > > +				   ceph_try_get_string(csrc->i_layout.pool_ns));
> > > > +		cdst->i_max_size = csrc->i_max_size;
> > > > +		cdst->i_truncate_size = csrc->i_truncate_size;
> > > > +	}
> > > > +	spin_unlock(&cdst->i_ceph_lock);
> > > > +}
> > > >    
> > > >    /*
> > > >     * Do a lookup + open with a single request.  If we get a non-existent
> > > > @@ -518,7 +537,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> > > >    	} else {
> > > >    		dout("atomic_open finish_open on dn %p\n", dn);
> > > >    		if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) {
> > > > -			ceph_init_inode_acls(d_inode(dentry), &as_ctx);
> > > > +			struct inode *newino = d_inode(dentry);
> > > > +
> > > > +			copy_file_layout(dir, newino);
> > > > +			ceph_init_inode_acls(newino, &as_ctx);
> > > >    			file->f_mode |= FMODE_CREATED;
> > > >    		}
> > > >    		err = finish_open(file, dentry, ceph_open);
> > > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > > > index 9cfc093fd273..8b51051b79b0 100644
> > > > --- a/fs/ceph/inode.c
> > > > +++ b/fs/ceph/inode.c
> > > > @@ -848,8 +848,8 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
> > > >    		ci->i_subdirs = le64_to_cpu(info->subdirs);
> > > >    	}
> > > >    
> > > > -	if (new_version ||
> > > > -	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
> > > > +	if (!S_ISDIR(inode->i_mode) && (new_version ||
> > > > +	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)))) {
> > > >    		s64 old_pool = ci->i_layout.pool_id;
> > > >    		struct ceph_string *old_ns;
> > > >    
> > > > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 9/9] ceph: attempt to do async create when possible
  2020-01-13 14:48       ` Yan, Zheng
@ 2020-01-13 15:20         ` Jeff Layton
  2020-01-14  2:08           ` Yan, Zheng
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Layton @ 2020-01-13 15:20 UTC (permalink / raw)
  To: Yan, Zheng, ceph-devel; +Cc: sage, idryomov, pdonnell

On Mon, 2020-01-13 at 22:48 +0800, Yan, Zheng wrote:
> On 1/13/20 9:44 PM, Jeff Layton wrote:
> > On Mon, 2020-01-13 at 18:53 +0800, Yan, Zheng wrote:
> > > On 1/11/20 4:56 AM, Jeff Layton wrote:
> > > > With the Octopus release, the MDS will hand out directoy create caps.
> > > > If we have Fxc caps on the directory, and complete directory information
> > > > or a known negative dentry, then we can return without waiting on the
> > > > reply, allowing the open() call to return very quickly to userland.
> > > > 
> > > > We use the normal ceph_fill_inode() routine to fill in the inode, so we
> > > > have to gin up some reply inode information with what we'd expect a
> > > > newly-created inode to have. The client assumes that it has a full set
> > > > of caps on the new inode, and that the MDS will revoke them when there
> > > > is conflicting access.
> > > > 
> > > > This functionality is gated on the enable_async_dirops module option,
> > > > along with async unlinks, and on the server supporting the Octopus
> > > > CephFS feature bit.
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > >    fs/ceph/caps.c               |   7 +-
> > > >    fs/ceph/file.c               | 178 +++++++++++++++++++++++++++++++++--
> > > >    fs/ceph/mds_client.c         |  12 ++-
> > > >    fs/ceph/mds_client.h         |   3 +-
> > > >    fs/ceph/super.h              |   2 +
> > > >    include/linux/ceph/ceph_fs.h |   8 +-
> > > >    6 files changed, 191 insertions(+), 19 deletions(-)
> > > > 
> > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > index b96fb1378479..21a8a2ddc94b 100644
> > > > --- a/fs/ceph/caps.c
> > > > +++ b/fs/ceph/caps.c
> > > > @@ -654,6 +654,10 @@ void ceph_add_cap(struct inode *inode,
> > > >    		session->s_nr_caps++;
> > > >    		spin_unlock(&session->s_cap_lock);
> > > >    	} else {
> > > > +		/* Did an async create race with the reply? */
> > > > +		if (cap_id == CEPH_CAP_ID_TBD && cap->issued == issued)
> > > > +			return;
> > > > +
> > > >    		spin_lock(&session->s_cap_lock);
> > > >    		list_move_tail(&cap->session_caps, &session->s_caps);
> > > >    		spin_unlock(&session->s_cap_lock);
> > > > @@ -672,7 +676,8 @@ void ceph_add_cap(struct inode *inode,
> > > >    		 */
> > > >    		if (ceph_seq_cmp(seq, cap->seq) <= 0) {
> > > >    			WARN_ON(cap != ci->i_auth_cap);
> > > > -			WARN_ON(cap->cap_id != cap_id);
> > > > +			WARN_ON(cap_id != CEPH_CAP_ID_TBD &&
> > > > +				cap->cap_id != cap_id);
> > > >    			seq = cap->seq;
> > > >    			mseq = cap->mseq;
> > > >    			issued |= cap->issued;
> > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > > index d4d7a277faf1..706abd71b731 100644
> > > > --- a/fs/ceph/file.c
> > > > +++ b/fs/ceph/file.c
> > > > @@ -450,6 +450,141 @@ copy_file_layout(struct inode *dst, struct inode *src)
> > > >    	spin_unlock(&cdst->i_ceph_lock);
> > > >    }
> > > >    
> > > > +static bool get_caps_for_async_create(struct inode *dir, struct dentry *dentry)
> > > > +{
> > > > +	struct ceph_inode_info *ci = ceph_inode(dir);
> > > > +	int ret, want, got;
> > > > +
> > > > +	/*
> > > > +	 * We can do an async create if we either have a valid negative dentry
> > > > +	 * or the complete contents of the directory. Do a quick check without
> > > > +	 * cap refs.
> > > > +	 */
> > > > +	if ((d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) ||
> > > 
> > > what does (d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) mean?
> > > 
> > 
> > If d_in_lookup returns false, then we have a dentry that is known to be
> > negative (IOW, it passed d_revalidate). If d_in_lookup returns true,
> > then this is the initial lookup for the dentry.
> > 
> > So if that returns true and the directory isn't complete then we can't
> > do an async create since we don't have enough information about the
> > namespace.
> > 
> > That probably deserves a better comment. I'll try to make that clear.
> 
> if directory is not complete and d_in_lookup() return false. we know the 
> dentry is negative in dcache, but it does not guarantee that 
> corresponding dentry in mds is negative. Between d_revalidate and this
> function, mds may reovked dentry's lease and issued Fsx caps


Nothing prevents that occurring while in this function in that case. I
guess then we need to test for an actual dentry lease here then after
ensuring that we have the correct caps?

> > 
> > > I think we can async create if dentry is negative and its
> > > lease_shared_gen == ci->i_shared_gen.
> > > 
> > > > +	    !ceph_file_layout_is_valid(&ci->i_layout))
> > > > +		return false;
> > > > +
> > > > +	/* Try to get caps */
> > > > +	want = CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE;
> > > > +	ret = ceph_try_get_caps(dir, 0, want, true, &got);
> > > > +	dout("Fx on %p ret=%d got=%d\n", dir, ret, got);
> > > > +	if (ret != 1)
> > > > +		return false;
> > > > +	if (got != want) {
> > > > +		ceph_put_cap_refs(ci, got);
> > > > +		return false;
> > > > +	}
> > > > +
> > > > +	/* Check again, now that we hold cap refs */
> > > > +	if ((d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) ||
> > > > +	    !ceph_file_layout_is_valid(&ci->i_layout)) {
> > > > +		ceph_put_cap_refs(ci, got);
> > > > +		return false;
> > > > +	}
> > > > +
> > > > +	return true;
> > > > +}
> > > > +
> > > > +static void ceph_async_create_cb(struct ceph_mds_client *mdsc,
> > > > +                                 struct ceph_mds_request *req)
> > > > +{
> > > > +	/* If we never sent anything then nothing to clean up */
> > > > +	if (req->r_err == -ECHILD)
> > > > +		goto out;
> > > > +
> > > > +	mapping_set_error(req->r_parent->i_mapping, req->r_err);
> > > > +
> > > > +	if (req->r_target_inode) {
> > > > +		u64 ino = ceph_vino(req->r_target_inode).ino;
> > > > +
> > > > +		if (req->r_deleg_ino != ino)
> > > > +			pr_warn("%s: inode number mismatch! err=%d deleg_ino=0x%lx target=0x%llx\n",
> > > > +				__func__, req->r_err, req->r_deleg_ino, ino);
> > > > +		mapping_set_error(req->r_target_inode->i_mapping, req->r_err);
> > > > +	} else {
> > > > +		pr_warn("%s: no req->r_target_inode for 0x%lx\n", __func__,
> > > > +			req->r_deleg_ino);
> > > > +	}
> > > > +out:
> > > > +	ceph_put_cap_refs(ceph_inode(req->r_parent),
> > > > +			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE);
> > > > +}
> > > > +
> > > > +static int ceph_finish_async_open(struct inode *dir, struct dentry *dentry,
> > > > +				  struct file *file, umode_t mode,
> > > > +				  struct ceph_mds_request *req,
> > > > +				  struct ceph_acl_sec_ctx *as_ctx)
> > > > +{
> > > > +	int ret;
> > > > +	struct ceph_mds_reply_inode in = { };
> > > > +	struct ceph_mds_reply_info_in iinfo = { .in = &in };
> > > > +	struct ceph_inode_info *ci = ceph_inode(dir);
> > > > +	struct inode *inode;
> > > > +	struct timespec64 now;
> > > > +	struct ceph_vino vino = { .ino = req->r_deleg_ino,
> > > > +				  .snap = CEPH_NOSNAP };
> > > > +
> > > > +	ktime_get_real_ts64(&now);
> > > > +
> > > > +	inode = ceph_get_inode(dentry->d_sb, vino);
> > > > +	if (IS_ERR(inode))
> > > > +		return PTR_ERR(inode);
> > > > +
> > > > +	/* If we can't get a buffer, just carry on */
> > > > +	iinfo.xattr_data = kzalloc(4, GFP_NOFS);
> > > > +	if (iinfo.xattr_data)
> > > > +		iinfo.xattr_len = 4;
> > > 
> > > ??
> > > 
> > > I think we should decode req->r_pagelist into xattrs
> > > 
> > > 
> > 
> > I'm not sure I follow what you're suggesting here. At this point,
> > r_pagelist may be set to as_ctx.pagelist. It would be nice to avoid an
> > allocation here though. I guess I could pass in an on-stack buffer. It
> > is only 4 bytes after all.
> > 
> > > > +
> > > > +	iinfo.inline_version = CEPH_INLINE_NONE;
> > > > +	iinfo.change_attr = 1;
> > > > +	ceph_encode_timespec64(&iinfo.btime, &now);
> > > > +
> > > > +	in.ino = cpu_to_le64(vino.ino);
> > > > +	in.snapid = cpu_to_le64(CEPH_NOSNAP);
> > > > +	in.version = cpu_to_le64(1);	// ???
> > > > +	in.cap.caps = in.cap.wanted = cpu_to_le32(CEPH_CAP_ALL_FILE);
> > > > +	in.cap.cap_id = cpu_to_le64(CEPH_CAP_ID_TBD);
> > > > +	in.cap.realm = cpu_to_le64(ci->i_snap_realm->ino);
> > > > +	in.cap.flags = CEPH_CAP_FLAG_AUTH;
> > > > +	in.ctime = in.mtime = in.atime = iinfo.btime;
> > > > +	in.mode = cpu_to_le32((u32)mode);
> > > > +	in.truncate_seq = cpu_to_le32(1);
> > > > +	in.truncate_size = cpu_to_le64(ci->i_truncate_size);
> > > > +	in.max_size = cpu_to_le64(ci->i_max_size);
> > > > +	in.xattr_version = cpu_to_le64(1);
> > > > +	in.uid = cpu_to_le32(from_kuid(&init_user_ns, current_fsuid()));
> > > > +	in.gid = cpu_to_le32(from_kgid(&init_user_ns, current_fsgid()));
> > > 
> > > if dir has S_ISGID, new file's gid should be inherit from dir's gid
> > > 
> > > 
> > 
> > Good catch. I'll fix that up for the next iteration.
> > 
> > > > +	in.nlink = cpu_to_le32(1);
> > > > +
> > > > +	ceph_file_layout_to_legacy(&ci->i_layout, &in.layout);
> > > > +
> > > > +	ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session,
> > > > +			      req->r_fmode, NULL);
> > > > +	if (ret) {
> > > > +		dout("%s failed to fill inode: %d\n", __func__, ret);
> > > > +		if (inode->i_state & I_NEW)
> > > > +			discard_new_inode(inode);
> > > > +	} else {
> > > > +		struct dentry *dn;
> > > > +
> > > > +		dout("%s d_adding new inode 0x%llx to 0x%lx/%s\n", __func__,
> > > > +			vino.ino, dir->i_ino, dentry->d_name.name);
> > > > +		ceph_dir_clear_ordered(dir);
> > > > +		ceph_init_inode_acls(inode, as_ctx);
> > > > +		if (inode->i_state & I_NEW)
> > > > +			unlock_new_inode(inode);
> > > > +		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
> > > > +			if (!d_unhashed(dentry))
> > > > +				d_drop(dentry);
> > > > +			dn = d_splice_alias(inode, dentry);
> > > > +			WARN_ON_ONCE(dn && dn != dentry);
> > > > +		}
> > > > +		file->f_mode |= FMODE_CREATED;
> > > > +		ret = finish_open(file, dentry, ceph_open);
> > > > +	}
> > > > +	return ret;
> > > > +}
> > > > +
> > > >    /*
> > > >     * Do a lookup + open with a single request.  If we get a non-existent
> > > >     * file or symlink, return 1 so the VFS can retry.
> > > > @@ -462,6 +597,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> > > >    	struct ceph_mds_request *req;
> > > >    	struct dentry *dn;
> > > >    	struct ceph_acl_sec_ctx as_ctx = {};
> > > > +	bool try_async = enable_async_dirops;
> > > >    	int mask;
> > > >    	int err;
> > > >    
> > > > @@ -486,6 +622,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> > > >    		return -ENOENT;
> > > >    	}
> > > >    
> > > > +retry:
> > > >    	/* do the open */
> > > >    	req = prepare_open_request(dir->i_sb, flags, mode);
> > > >    	if (IS_ERR(req)) {
> > > > @@ -494,6 +631,12 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> > > >    	}
> > > >    	req->r_dentry = dget(dentry);
> > > >    	req->r_num_caps = 2;
> > > > +	mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> > > > +	if (ceph_security_xattr_wanted(dir))
> > > > +		mask |= CEPH_CAP_XATTR_SHARED;
> > > > +	req->r_args.open.mask = cpu_to_le32(mask);
> > > > +	req->r_parent = dir;
> > > > +
> > > >    	if (flags & O_CREAT) {
> > > >    		req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL;
> > > >    		req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
> > > > @@ -501,21 +644,37 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> > > >    			req->r_pagelist = as_ctx.pagelist;
> > > >    			as_ctx.pagelist = NULL;
> > > >    		}
> > > > +		if (try_async && get_caps_for_async_create(dir, dentry)) {
> > > > +			set_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags);
> > > > +			req->r_callback = ceph_async_create_cb;
> > > > +			err = ceph_mdsc_submit_request(mdsc, dir, req);
> > > > +			switch (err) {
> > > > +			case 0:
> > > > +				/* set up inode, dentry and return */
> > > > +				err = ceph_finish_async_open(dir, dentry, file,
> > > > +							mode, req, &as_ctx);
> > > > +				goto out_req;
> > > > +			case -ECHILD:
> > > > +				/* do a sync create */
> > > > +				try_async = false;
> > > > +				as_ctx.pagelist = req->r_pagelist;
> > > > +				req->r_pagelist = NULL;
> > > > +				ceph_mdsc_put_request(req);
> > > > +				goto retry;
> > > > +			default:
> > > > +				/* Hard error, give up */
> > > > +				goto out_req;
> > > > +			}
> > > > +		}
> > > >    	}
> > > >    
> > > > -       mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> > > > -       if (ceph_security_xattr_wanted(dir))
> > > > -               mask |= CEPH_CAP_XATTR_SHARED;
> > > > -       req->r_args.open.mask = cpu_to_le32(mask);
> > > > -
> > > > -	req->r_parent = dir;
> > > >    	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
> > > >    	err = ceph_mdsc_do_request(mdsc,
> > > >    				   (flags & (O_CREAT|O_TRUNC)) ? dir : NULL,
> > > >    				   req);
> > > >    	err = ceph_handle_snapdir(req, dentry, err);
> > > >    	if (err)
> > > > -		goto out_req;
> > > > +		goto out_fmode;
> > > >    
> > > >    	if ((flags & O_CREAT) && !req->r_reply_info.head->is_dentry)
> > > >    		err = ceph_handle_notrace_create(dir, dentry);
> > > > @@ -529,7 +688,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> > > >    		dn = NULL;
> > > >    	}
> > > >    	if (err)
> > > > -		goto out_req;
> > > > +		goto out_fmode;
> > > >    	if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
> > > >    		/* make vfs retry on splice, ENOENT, or symlink */
> > > >    		dout("atomic_open finish_no_open on dn %p\n", dn);
> > > > @@ -545,9 +704,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> > > >    		}
> > > >    		err = finish_open(file, dentry, ceph_open);
> > > >    	}
> > > > -out_req:
> > > > +out_fmode:
> > > >    	if (!req->r_err && req->r_target_inode)
> > > >    		ceph_put_fmode(ceph_inode(req->r_target_inode), req->r_fmode);
> > > > +out_req:
> > > >    	ceph_mdsc_put_request(req);
> > > >    out_ctx:
> > > >    	ceph_release_acl_sec_ctx(&as_ctx);
> > > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > > > index 9e7492b21b50..c76d6e7f8136 100644
> > > > --- a/fs/ceph/mds_client.c
> > > > +++ b/fs/ceph/mds_client.c
> > > > @@ -2620,14 +2620,16 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
> > > >    		flags |= CEPH_MDS_FLAG_REPLAY;
> > > >    	if (req->r_parent)
> > > >    		flags |= CEPH_MDS_FLAG_WANT_DENTRY;
> > > > -	rhead->flags = cpu_to_le32(flags);
> > > > -	rhead->num_fwd = req->r_num_fwd;
> > > > -	rhead->num_retry = req->r_attempts - 1;
> > > > -	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags))
> > > > +	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags)) {
> > > >    		rhead->ino = cpu_to_le64(req->r_deleg_ino);
> > > > -	else
> > > > +		flags |= CEPH_MDS_FLAG_ASYNC;
> > > > +	} else {
> > > >    		rhead->ino = 0;
> > > > +	}
> > > >    
> > > > +	rhead->flags = cpu_to_le32(flags);
> > > > +	rhead->num_fwd = req->r_num_fwd;
> > > > +	rhead->num_retry = req->r_attempts - 1;
> > > >    	dout(" r_parent = %p\n", req->r_parent);
> > > >    	return 0;
> > > >    }
> > > > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> > > > index e0b36be7c44f..49e6cd5a07a2 100644
> > > > --- a/fs/ceph/mds_client.h
> > > > +++ b/fs/ceph/mds_client.h
> > > > @@ -39,8 +39,7 @@ enum ceph_feature_type {
> > > >    	CEPHFS_FEATURE_REPLY_ENCODING,		\
> > > >    	CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
> > > >    	CEPHFS_FEATURE_MULTI_RECONNECT,		\
> > > > -						\
> > > > -	CEPHFS_FEATURE_MAX,			\
> > > > +	CEPHFS_FEATURE_OCTOPUS,			\
> > > >    }
> > > >    #define CEPHFS_FEATURES_CLIENT_REQUIRED {}
> > > >    
> > > > diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> > > > index ec4d66d7c261..33e03fbba888 100644
> > > > --- a/fs/ceph/super.h
> > > > +++ b/fs/ceph/super.h
> > > > @@ -136,6 +136,8 @@ struct ceph_fs_client {
> > > >    #endif
> > > >    };
> > > >    
> > > > +/* Special placeholder value for a cap_id during an asynchronous create. */
> > > > +#define        CEPH_CAP_ID_TBD         -1ULL
> > > >    
> > > >    /*
> > > >     * File i/o capability.  This tracks shared state with the metadata
> > > > diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
> > > > index a099f60feb7b..b127563e21a1 100644
> > > > --- a/include/linux/ceph/ceph_fs.h
> > > > +++ b/include/linux/ceph/ceph_fs.h
> > > > @@ -444,8 +444,9 @@ union ceph_mds_request_args {
> > > >    	} __attribute__ ((packed)) lookupino;
> > > >    } __attribute__ ((packed));
> > > >    
> > > > -#define CEPH_MDS_FLAG_REPLAY        1  /* this is a replayed op */
> > > > -#define CEPH_MDS_FLAG_WANT_DENTRY   2  /* want dentry in reply */
> > > > +#define CEPH_MDS_FLAG_REPLAY		1  /* this is a replayed op */
> > > > +#define CEPH_MDS_FLAG_WANT_DENTRY	2  /* want dentry in reply */
> > > > +#define CEPH_MDS_FLAG_ASYNC		4  /* request is asynchronous */
> > > >    
> > > >    struct ceph_mds_request_head {
> > > >    	__le64 oldest_client_tid;
> > > > @@ -658,6 +659,9 @@ int ceph_flags_to_mode(int flags);
> > > >    #define CEPH_CAP_ANY      (CEPH_CAP_ANY_RD | CEPH_CAP_ANY_EXCL | \
> > > >    			   CEPH_CAP_ANY_FILE_WR | CEPH_CAP_FILE_LAZYIO | \
> > > >    			   CEPH_CAP_PIN)
> > > > +#define CEPH_CAP_ALL_FILE (CEPH_CAP_PIN | CEPH_CAP_ANY_SHARED | \
> > > > +			   CEPH_CAP_AUTH_EXCL | CEPH_CAP_XATTR_EXCL | \
> > > > +			   CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)
> > > >    
> > > >    #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
> > > >    			CEPH_LOCK_IXATTR)
> > > > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 8/9] ceph: copy layout, max_size and truncate_size on successful sync create
  2020-01-13 15:13         ` Jeff Layton
@ 2020-01-13 16:37           ` Yan, Zheng
  0 siblings, 0 replies; 28+ messages in thread
From: Yan, Zheng @ 2020-01-13 16:37 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell

On 1/13/20 11:13 PM, Jeff Layton wrote:
> On Mon, 2020-01-13 at 22:56 +0800, Yan, Zheng wrote:
>> On 1/13/20 9:26 PM, Jeff Layton wrote:
>>> On Mon, 2020-01-13 at 11:51 +0800, Yan, Zheng wrote:
>>>> On 1/11/20 4:56 AM, Jeff Layton wrote:
>>>>> It doesn't do much good to do an asynchronous create unless we can do
>>>>> I/O to it before the create reply comes in. That means we need layout
>>>>> info the new file before we've gotten the response from the MDS.
>>>>>
>>>>> All files created in a directory will initially inherit the same layout,
>>>>> so copy off the requisite info from the first synchronous create in the
>>>>> directory. Save it in the same fields in the directory inode, as those
>>>>> are otherwise unsed for dir inodes. This means we need to be a bit
>>>>> careful about only updating layout info on non-dir inodes.
>>>>>
>>>>> Also, zero out the layout when we drop Dc caps in the dir.
>>>>>
>>>>> Signed-off-by: Jeff Layton <jlayton@kernel.org>
>>>>> ---
>>>>>     fs/ceph/caps.c  | 24 ++++++++++++++++++++----
>>>>>     fs/ceph/file.c  | 24 +++++++++++++++++++++++-
>>>>>     fs/ceph/inode.c |  4 ++--
>>>>>     3 files changed, 45 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>>>>> index 7fc87b693ba4..b96fb1378479 100644
>>>>> --- a/fs/ceph/caps.c
>>>>> +++ b/fs/ceph/caps.c
>>>>> @@ -2847,7 +2847,7 @@ int ceph_get_caps(struct file *filp, int need, int want,
>>>>>     			return ret;
>>>>>     		}
>>>>>     
>>>>> -		if (S_ISREG(ci->vfs_inode.i_mode) &&
>>>>> +		if (!S_ISDIR(ci->vfs_inode.i_mode) &&
>>>>>     		    ci->i_inline_version != CEPH_INLINE_NONE &&
>>>>>     		    (_got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) &&
>>>>>     		    i_size_read(inode) > 0) {
>>>>> @@ -2944,9 +2944,17 @@ void ceph_put_cap_refs(struct ceph_inode_info *ci, int had)
>>>>>     	if (had & CEPH_CAP_FILE_RD)
>>>>>     		if (--ci->i_rd_ref == 0)
>>>>>     			last++;
>>>>> -	if (had & CEPH_CAP_FILE_CACHE)
>>>>> -		if (--ci->i_rdcache_ref == 0)
>>>>> +	if (had & CEPH_CAP_FILE_CACHE) {
>>>>> +		if (--ci->i_rdcache_ref == 0) {
>>>>>     			last++;
>>>>> +			/* Zero out layout if we lost CREATE caps */
>>>>> +			if (S_ISDIR(inode->i_mode) &&
>>>>> +			    !(__ceph_caps_issued(ci, NULL) & CEPH_CAP_DIR_CREATE)) {
>>>>> +				ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
>>>>> +				memset(&ci->i_layout, 0, sizeof(ci->i_layout));
>>>>> +			}
>>>>> +		}
>>>>> +	}
>>>>
>>>> should do this in __check_cap_issue
>>>>
>>>
>>> That function doesn't get called from the put codepath. Suppose one task
>>> is setting up an async create and a Dc (DIR_CREATE) cap revoke races in.
>>> We zero out the layout and then the inode has a bogus layout set in it.
>>>
>>> We can't wipe the cached layout until all of the Dc references are put.
>>
>> how about:
>>
>> get_caps_for_async_create() return the layout.
>> pass the returned layout into ceph_finish_async_open()
>>
> 
> That still sounds racy.
> 
> What guarantees the stability of the cached layout while we're copying
> it in get_caps_for_async_create()? I guess we could protect the new
> cached layout field with the i_ceph_lock.
> 

Yes. copy cached layout while holding i_ceph_lock after getting Fsc caps.

Besides, we need to make sure ceph_finish_async_open get called before 
dropping Fsc.

> Is that a real improvement? I'm not sure.
> 
I think it's more clean.







>>
>>
>>>>>     	if (had & CEPH_CAP_FILE_EXCL)
>>>>>     		if (--ci->i_fx_ref == 0)
>>>>>     			last++;
>>>>> @@ -3264,7 +3272,8 @@ static void handle_cap_grant(struct inode *inode,
>>>>>     		ci->i_subdirs = extra_info->nsubdirs;
>>>>>     	}
>>>>>     
>>>>> -	if (newcaps & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)) {
>>>>> +	if (!S_ISDIR(inode->i_mode) &&
>>>>> +	    (newcaps & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
>>>>>     		/* file layout may have changed */
>>>>>     		s64 old_pool = ci->i_layout.pool_id;
>>>>>     		struct ceph_string *old_ns;
>>>>> @@ -3336,6 +3345,13 @@ static void handle_cap_grant(struct inode *inode,
>>>>>     		     ceph_cap_string(cap->issued),
>>>>>     		     ceph_cap_string(newcaps),
>>>>>     		     ceph_cap_string(revoking));
>>>>> +
>>>>> +		if (S_ISDIR(inode->i_mode) &&
>>>>> +		    (revoking & CEPH_CAP_DIR_CREATE) && !ci->i_rdcache_ref) {
>>>>> +			ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
>>>>> +			memset(&ci->i_layout, 0, sizeof(ci->i_layout));
>>>>> +		}
>>>>
>>>> same here
>>>>
>>>>> +
>>>>>     		if (S_ISREG(inode->i_mode) &&
>>>>>     		    (revoking & used & CEPH_CAP_FILE_BUFFER))
>>>>>     			writeback = true;  /* initiate writeback; will delay ack */
>>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>>>> index 1e6cdf2dfe90..d4d7a277faf1 100644
>>>>> --- a/fs/ceph/file.c
>>>>> +++ b/fs/ceph/file.c
>>>>> @@ -430,6 +430,25 @@ int ceph_open(struct inode *inode, struct file *file)
>>>>>     	return err;
>>>>>     }
>>>>>     
>>>>> +/* Clone the layout from a synchronous create, if the dir now has Dc caps */
>>>>> +static void
>>>>> +copy_file_layout(struct inode *dst, struct inode *src)
>>>>> +{
>>>>> +	struct ceph_inode_info *cdst = ceph_inode(dst);
>>>>> +	struct ceph_inode_info *csrc = ceph_inode(src);
>>>>> +
>>>>> +	spin_lock(&cdst->i_ceph_lock);
>>>>> +	if ((__ceph_caps_issued(cdst, NULL) & CEPH_CAP_DIR_CREATE) &&
>>>>> +	    !ceph_file_layout_is_valid(&cdst->i_layout)) {
>>>>> +		memcpy(&cdst->i_layout, &csrc->i_layout,
>>>>> +			sizeof(cdst->i_layout));
>>>>> +		rcu_assign_pointer(cdst->i_layout.pool_ns,
>>>>> +				   ceph_try_get_string(csrc->i_layout.pool_ns));
>>>>> +		cdst->i_max_size = csrc->i_max_size;
>>>>> +		cdst->i_truncate_size = csrc->i_truncate_size;
>>>>> +	}
>>>>> +	spin_unlock(&cdst->i_ceph_lock);
>>>>> +}
>>>>>     
>>>>>     /*
>>>>>      * Do a lookup + open with a single request.  If we get a non-existent
>>>>> @@ -518,7 +537,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>>>>>     	} else {
>>>>>     		dout("atomic_open finish_open on dn %p\n", dn);
>>>>>     		if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) {
>>>>> -			ceph_init_inode_acls(d_inode(dentry), &as_ctx);
>>>>> +			struct inode *newino = d_inode(dentry);
>>>>> +
>>>>> +			copy_file_layout(dir, newino);
>>>>> +			ceph_init_inode_acls(newino, &as_ctx);
>>>>>     			file->f_mode |= FMODE_CREATED;
>>>>>     		}
>>>>>     		err = finish_open(file, dentry, ceph_open);
>>>>> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
>>>>> index 9cfc093fd273..8b51051b79b0 100644
>>>>> --- a/fs/ceph/inode.c
>>>>> +++ b/fs/ceph/inode.c
>>>>> @@ -848,8 +848,8 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
>>>>>     		ci->i_subdirs = le64_to_cpu(info->subdirs);
>>>>>     	}
>>>>>     
>>>>> -	if (new_version ||
>>>>> -	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
>>>>> +	if (!S_ISDIR(inode->i_mode) && (new_version ||
>>>>> +	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)))) {
>>>>>     		s64 old_pool = ci->i_layout.pool_id;
>>>>>     		struct ceph_string *old_ns;
>>>>>     
>>>>>
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 9/9] ceph: attempt to do async create when possible
  2020-01-13 15:20         ` Jeff Layton
@ 2020-01-14  2:08           ` Yan, Zheng
  0 siblings, 0 replies; 28+ messages in thread
From: Yan, Zheng @ 2020-01-14  2:08 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: sage, idryomov, pdonnell

On 1/13/20 11:20 PM, Jeff Layton wrote:
> On Mon, 2020-01-13 at 22:48 +0800, Yan, Zheng wrote:
>> On 1/13/20 9:44 PM, Jeff Layton wrote:
>>> On Mon, 2020-01-13 at 18:53 +0800, Yan, Zheng wrote:
>>>> On 1/11/20 4:56 AM, Jeff Layton wrote:
>>>>> With the Octopus release, the MDS will hand out directoy create caps.
>>>>> If we have Fxc caps on the directory, and complete directory information
>>>>> or a known negative dentry, then we can return without waiting on the
>>>>> reply, allowing the open() call to return very quickly to userland.
>>>>>
>>>>> We use the normal ceph_fill_inode() routine to fill in the inode, so we
>>>>> have to gin up some reply inode information with what we'd expect a
>>>>> newly-created inode to have. The client assumes that it has a full set
>>>>> of caps on the new inode, and that the MDS will revoke them when there
>>>>> is conflicting access.
>>>>>
>>>>> This functionality is gated on the enable_async_dirops module option,
>>>>> along with async unlinks, and on the server supporting the Octopus
>>>>> CephFS feature bit.
>>>>>
>>>>> Signed-off-by: Jeff Layton <jlayton@kernel.org>
>>>>> ---
>>>>>     fs/ceph/caps.c               |   7 +-
>>>>>     fs/ceph/file.c               | 178 +++++++++++++++++++++++++++++++++--
>>>>>     fs/ceph/mds_client.c         |  12 ++-
>>>>>     fs/ceph/mds_client.h         |   3 +-
>>>>>     fs/ceph/super.h              |   2 +
>>>>>     include/linux/ceph/ceph_fs.h |   8 +-
>>>>>     6 files changed, 191 insertions(+), 19 deletions(-)
>>>>>
>>>>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>>>>> index b96fb1378479..21a8a2ddc94b 100644
>>>>> --- a/fs/ceph/caps.c
>>>>> +++ b/fs/ceph/caps.c
>>>>> @@ -654,6 +654,10 @@ void ceph_add_cap(struct inode *inode,
>>>>>     		session->s_nr_caps++;
>>>>>     		spin_unlock(&session->s_cap_lock);
>>>>>     	} else {
>>>>> +		/* Did an async create race with the reply? */
>>>>> +		if (cap_id == CEPH_CAP_ID_TBD && cap->issued == issued)
>>>>> +			return;
>>>>> +
>>>>>     		spin_lock(&session->s_cap_lock);
>>>>>     		list_move_tail(&cap->session_caps, &session->s_caps);
>>>>>     		spin_unlock(&session->s_cap_lock);
>>>>> @@ -672,7 +676,8 @@ void ceph_add_cap(struct inode *inode,
>>>>>     		 */
>>>>>     		if (ceph_seq_cmp(seq, cap->seq) <= 0) {
>>>>>     			WARN_ON(cap != ci->i_auth_cap);
>>>>> -			WARN_ON(cap->cap_id != cap_id);
>>>>> +			WARN_ON(cap_id != CEPH_CAP_ID_TBD &&
>>>>> +				cap->cap_id != cap_id);
>>>>>     			seq = cap->seq;
>>>>>     			mseq = cap->mseq;
>>>>>     			issued |= cap->issued;
>>>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>>>> index d4d7a277faf1..706abd71b731 100644
>>>>> --- a/fs/ceph/file.c
>>>>> +++ b/fs/ceph/file.c
>>>>> @@ -450,6 +450,141 @@ copy_file_layout(struct inode *dst, struct inode *src)
>>>>>     	spin_unlock(&cdst->i_ceph_lock);
>>>>>     }
>>>>>     
>>>>> +static bool get_caps_for_async_create(struct inode *dir, struct dentry *dentry)
>>>>> +{
>>>>> +	struct ceph_inode_info *ci = ceph_inode(dir);
>>>>> +	int ret, want, got;
>>>>> +
>>>>> +	/*
>>>>> +	 * We can do an async create if we either have a valid negative dentry
>>>>> +	 * or the complete contents of the directory. Do a quick check without
>>>>> +	 * cap refs.
>>>>> +	 */
>>>>> +	if ((d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) ||
>>>>
>>>> what does (d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) mean?
>>>>
>>>
>>> If d_in_lookup returns false, then we have a dentry that is known to be
>>> negative (IOW, it passed d_revalidate). If d_in_lookup returns true,
>>> then this is the initial lookup for the dentry.
>>>
>>> So if that returns true and the directory isn't complete then we can't
>>> do an async create since we don't have enough information about the
>>> namespace.
>>>
>>> That probably deserves a better comment. I'll try to make that clear.
>>
>> if directory is not complete and d_in_lookup() return false. we know the
>> dentry is negative in dcache, but it does not guarantee that
>> corresponding dentry in mds is negative. Between d_revalidate and this
>> function, mds may reovked dentry's lease and issued Fsx caps
> 
> 
> Nothing prevents that occurring while in this function in that case. I
> guess then we need to test for an actual dentry lease here then after
> ensuring that we have the correct caps?
> 

yes, I think so.

>>>
>>>> I think we can async create if dentry is negative and its
>>>> lease_shared_gen == ci->i_shared_gen.
>>>>
>>>>> +	    !ceph_file_layout_is_valid(&ci->i_layout))
>>>>> +		return false;
>>>>> +
>>>>> +	/* Try to get caps */
>>>>> +	want = CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE;
>>>>> +	ret = ceph_try_get_caps(dir, 0, want, true, &got);
>>>>> +	dout("Fx on %p ret=%d got=%d\n", dir, ret, got);
>>>>> +	if (ret != 1)
>>>>> +		return false;
>>>>> +	if (got != want) {
>>>>> +		ceph_put_cap_refs(ci, got);
>>>>> +		return false;
>>>>> +	}
>>>>> +
>>>>> +	/* Check again, now that we hold cap refs */
>>>>> +	if ((d_in_lookup(dentry) && !__ceph_dir_is_complete(ci)) ||
>>>>> +	    !ceph_file_layout_is_valid(&ci->i_layout)) {
>>>>> +		ceph_put_cap_refs(ci, got);
>>>>> +		return false;
>>>>> +	}
>>>>> +
>>>>> +	return true;
>>>>> +}
>>>>> +
>>>>> +static void ceph_async_create_cb(struct ceph_mds_client *mdsc,
>>>>> +                                 struct ceph_mds_request *req)
>>>>> +{
>>>>> +	/* If we never sent anything then nothing to clean up */
>>>>> +	if (req->r_err == -ECHILD)
>>>>> +		goto out;
>>>>> +
>>>>> +	mapping_set_error(req->r_parent->i_mapping, req->r_err);
>>>>> +
>>>>> +	if (req->r_target_inode) {
>>>>> +		u64 ino = ceph_vino(req->r_target_inode).ino;
>>>>> +
>>>>> +		if (req->r_deleg_ino != ino)
>>>>> +			pr_warn("%s: inode number mismatch! err=%d deleg_ino=0x%lx target=0x%llx\n",
>>>>> +				__func__, req->r_err, req->r_deleg_ino, ino);
>>>>> +		mapping_set_error(req->r_target_inode->i_mapping, req->r_err);
>>>>> +	} else {
>>>>> +		pr_warn("%s: no req->r_target_inode for 0x%lx\n", __func__,
>>>>> +			req->r_deleg_ino);
>>>>> +	}
>>>>> +out:
>>>>> +	ceph_put_cap_refs(ceph_inode(req->r_parent),
>>>>> +			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE);
>>>>> +}
>>>>> +
>>>>> +static int ceph_finish_async_open(struct inode *dir, struct dentry *dentry,
>>>>> +				  struct file *file, umode_t mode,
>>>>> +				  struct ceph_mds_request *req,
>>>>> +				  struct ceph_acl_sec_ctx *as_ctx)
>>>>> +{
>>>>> +	int ret;
>>>>> +	struct ceph_mds_reply_inode in = { };
>>>>> +	struct ceph_mds_reply_info_in iinfo = { .in = &in };
>>>>> +	struct ceph_inode_info *ci = ceph_inode(dir);
>>>>> +	struct inode *inode;
>>>>> +	struct timespec64 now;
>>>>> +	struct ceph_vino vino = { .ino = req->r_deleg_ino,
>>>>> +				  .snap = CEPH_NOSNAP };
>>>>> +
>>>>> +	ktime_get_real_ts64(&now);
>>>>> +
>>>>> +	inode = ceph_get_inode(dentry->d_sb, vino);
>>>>> +	if (IS_ERR(inode))
>>>>> +		return PTR_ERR(inode);
>>>>> +
>>>>> +	/* If we can't get a buffer, just carry on */
>>>>> +	iinfo.xattr_data = kzalloc(4, GFP_NOFS);
>>>>> +	if (iinfo.xattr_data)
>>>>> +		iinfo.xattr_len = 4;
>>>>
>>>> ??
>>>>
>>>> I think we should decode req->r_pagelist into xattrs
>>>>
>>>>
>>>
>>> I'm not sure I follow what you're suggesting here. At this point,
>>> r_pagelist may be set to as_ctx.pagelist. It would be nice to avoid an
>>> allocation here though. I guess I could pass in an on-stack buffer. It
>>> is only 4 bytes after all.
>>>
>>>>> +
>>>>> +	iinfo.inline_version = CEPH_INLINE_NONE;
>>>>> +	iinfo.change_attr = 1;
>>>>> +	ceph_encode_timespec64(&iinfo.btime, &now);
>>>>> +
>>>>> +	in.ino = cpu_to_le64(vino.ino);
>>>>> +	in.snapid = cpu_to_le64(CEPH_NOSNAP);
>>>>> +	in.version = cpu_to_le64(1);	// ???
>>>>> +	in.cap.caps = in.cap.wanted = cpu_to_le32(CEPH_CAP_ALL_FILE);
>>>>> +	in.cap.cap_id = cpu_to_le64(CEPH_CAP_ID_TBD);
>>>>> +	in.cap.realm = cpu_to_le64(ci->i_snap_realm->ino);
>>>>> +	in.cap.flags = CEPH_CAP_FLAG_AUTH;
>>>>> +	in.ctime = in.mtime = in.atime = iinfo.btime;
>>>>> +	in.mode = cpu_to_le32((u32)mode);
>>>>> +	in.truncate_seq = cpu_to_le32(1);
>>>>> +	in.truncate_size = cpu_to_le64(ci->i_truncate_size);
>>>>> +	in.max_size = cpu_to_le64(ci->i_max_size);
>>>>> +	in.xattr_version = cpu_to_le64(1);
>>>>> +	in.uid = cpu_to_le32(from_kuid(&init_user_ns, current_fsuid()));
>>>>> +	in.gid = cpu_to_le32(from_kgid(&init_user_ns, current_fsgid()));
>>>>
>>>> if dir has S_ISGID, new file's gid should be inherit from dir's gid
>>>>
>>>>
>>>
>>> Good catch. I'll fix that up for the next iteration.
>>>
>>>>> +	in.nlink = cpu_to_le32(1);
>>>>> +
>>>>> +	ceph_file_layout_to_legacy(&ci->i_layout, &in.layout);
>>>>> +
>>>>> +	ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session,
>>>>> +			      req->r_fmode, NULL);
>>>>> +	if (ret) {
>>>>> +		dout("%s failed to fill inode: %d\n", __func__, ret);
>>>>> +		if (inode->i_state & I_NEW)
>>>>> +			discard_new_inode(inode);
>>>>> +	} else {
>>>>> +		struct dentry *dn;
>>>>> +
>>>>> +		dout("%s d_adding new inode 0x%llx to 0x%lx/%s\n", __func__,
>>>>> +			vino.ino, dir->i_ino, dentry->d_name.name);
>>>>> +		ceph_dir_clear_ordered(dir);
>>>>> +		ceph_init_inode_acls(inode, as_ctx);
>>>>> +		if (inode->i_state & I_NEW)
>>>>> +			unlock_new_inode(inode);
>>>>> +		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
>>>>> +			if (!d_unhashed(dentry))
>>>>> +				d_drop(dentry);
>>>>> +			dn = d_splice_alias(inode, dentry);
>>>>> +			WARN_ON_ONCE(dn && dn != dentry);
>>>>> +		}
>>>>> +		file->f_mode |= FMODE_CREATED;
>>>>> +		ret = finish_open(file, dentry, ceph_open);
>>>>> +	}
>>>>> +	return ret;
>>>>> +}
>>>>> +
>>>>>     /*
>>>>>      * Do a lookup + open with a single request.  If we get a non-existent
>>>>>      * file or symlink, return 1 so the VFS can retry.
>>>>> @@ -462,6 +597,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>>>>>     	struct ceph_mds_request *req;
>>>>>     	struct dentry *dn;
>>>>>     	struct ceph_acl_sec_ctx as_ctx = {};
>>>>> +	bool try_async = enable_async_dirops;
>>>>>     	int mask;
>>>>>     	int err;
>>>>>     
>>>>> @@ -486,6 +622,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>>>>>     		return -ENOENT;
>>>>>     	}
>>>>>     
>>>>> +retry:
>>>>>     	/* do the open */
>>>>>     	req = prepare_open_request(dir->i_sb, flags, mode);
>>>>>     	if (IS_ERR(req)) {
>>>>> @@ -494,6 +631,12 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>>>>>     	}
>>>>>     	req->r_dentry = dget(dentry);
>>>>>     	req->r_num_caps = 2;
>>>>> +	mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
>>>>> +	if (ceph_security_xattr_wanted(dir))
>>>>> +		mask |= CEPH_CAP_XATTR_SHARED;
>>>>> +	req->r_args.open.mask = cpu_to_le32(mask);
>>>>> +	req->r_parent = dir;
>>>>> +
>>>>>     	if (flags & O_CREAT) {
>>>>>     		req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL;
>>>>>     		req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
>>>>> @@ -501,21 +644,37 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>>>>>     			req->r_pagelist = as_ctx.pagelist;
>>>>>     			as_ctx.pagelist = NULL;
>>>>>     		}
>>>>> +		if (try_async && get_caps_for_async_create(dir, dentry)) {
>>>>> +			set_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags);
>>>>> +			req->r_callback = ceph_async_create_cb;
>>>>> +			err = ceph_mdsc_submit_request(mdsc, dir, req);
>>>>> +			switch (err) {
>>>>> +			case 0:
>>>>> +				/* set up inode, dentry and return */
>>>>> +				err = ceph_finish_async_open(dir, dentry, file,
>>>>> +							mode, req, &as_ctx);
>>>>> +				goto out_req;
>>>>> +			case -ECHILD:
>>>>> +				/* do a sync create */
>>>>> +				try_async = false;
>>>>> +				as_ctx.pagelist = req->r_pagelist;
>>>>> +				req->r_pagelist = NULL;
>>>>> +				ceph_mdsc_put_request(req);
>>>>> +				goto retry;
>>>>> +			default:
>>>>> +				/* Hard error, give up */
>>>>> +				goto out_req;
>>>>> +			}
>>>>> +		}
>>>>>     	}
>>>>>     
>>>>> -       mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
>>>>> -       if (ceph_security_xattr_wanted(dir))
>>>>> -               mask |= CEPH_CAP_XATTR_SHARED;
>>>>> -       req->r_args.open.mask = cpu_to_le32(mask);
>>>>> -
>>>>> -	req->r_parent = dir;
>>>>>     	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
>>>>>     	err = ceph_mdsc_do_request(mdsc,
>>>>>     				   (flags & (O_CREAT|O_TRUNC)) ? dir : NULL,
>>>>>     				   req);
>>>>>     	err = ceph_handle_snapdir(req, dentry, err);
>>>>>     	if (err)
>>>>> -		goto out_req;
>>>>> +		goto out_fmode;
>>>>>     
>>>>>     	if ((flags & O_CREAT) && !req->r_reply_info.head->is_dentry)
>>>>>     		err = ceph_handle_notrace_create(dir, dentry);
>>>>> @@ -529,7 +688,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>>>>>     		dn = NULL;
>>>>>     	}
>>>>>     	if (err)
>>>>> -		goto out_req;
>>>>> +		goto out_fmode;
>>>>>     	if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
>>>>>     		/* make vfs retry on splice, ENOENT, or symlink */
>>>>>     		dout("atomic_open finish_no_open on dn %p\n", dn);
>>>>> @@ -545,9 +704,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>>>>>     		}
>>>>>     		err = finish_open(file, dentry, ceph_open);
>>>>>     	}
>>>>> -out_req:
>>>>> +out_fmode:
>>>>>     	if (!req->r_err && req->r_target_inode)
>>>>>     		ceph_put_fmode(ceph_inode(req->r_target_inode), req->r_fmode);
>>>>> +out_req:
>>>>>     	ceph_mdsc_put_request(req);
>>>>>     out_ctx:
>>>>>     	ceph_release_acl_sec_ctx(&as_ctx);
>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>>> index 9e7492b21b50..c76d6e7f8136 100644
>>>>> --- a/fs/ceph/mds_client.c
>>>>> +++ b/fs/ceph/mds_client.c
>>>>> @@ -2620,14 +2620,16 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
>>>>>     		flags |= CEPH_MDS_FLAG_REPLAY;
>>>>>     	if (req->r_parent)
>>>>>     		flags |= CEPH_MDS_FLAG_WANT_DENTRY;
>>>>> -	rhead->flags = cpu_to_le32(flags);
>>>>> -	rhead->num_fwd = req->r_num_fwd;
>>>>> -	rhead->num_retry = req->r_attempts - 1;
>>>>> -	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags))
>>>>> +	if (test_bit(CEPH_MDS_R_DELEG_INO, &req->r_req_flags)) {
>>>>>     		rhead->ino = cpu_to_le64(req->r_deleg_ino);
>>>>> -	else
>>>>> +		flags |= CEPH_MDS_FLAG_ASYNC;
>>>>> +	} else {
>>>>>     		rhead->ino = 0;
>>>>> +	}
>>>>>     
>>>>> +	rhead->flags = cpu_to_le32(flags);
>>>>> +	rhead->num_fwd = req->r_num_fwd;
>>>>> +	rhead->num_retry = req->r_attempts - 1;
>>>>>     	dout(" r_parent = %p\n", req->r_parent);
>>>>>     	return 0;
>>>>>     }
>>>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
>>>>> index e0b36be7c44f..49e6cd5a07a2 100644
>>>>> --- a/fs/ceph/mds_client.h
>>>>> +++ b/fs/ceph/mds_client.h
>>>>> @@ -39,8 +39,7 @@ enum ceph_feature_type {
>>>>>     	CEPHFS_FEATURE_REPLY_ENCODING,		\
>>>>>     	CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
>>>>>     	CEPHFS_FEATURE_MULTI_RECONNECT,		\
>>>>> -						\
>>>>> -	CEPHFS_FEATURE_MAX,			\
>>>>> +	CEPHFS_FEATURE_OCTOPUS,			\
>>>>>     }
>>>>>     #define CEPHFS_FEATURES_CLIENT_REQUIRED {}
>>>>>     
>>>>> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
>>>>> index ec4d66d7c261..33e03fbba888 100644
>>>>> --- a/fs/ceph/super.h
>>>>> +++ b/fs/ceph/super.h
>>>>> @@ -136,6 +136,8 @@ struct ceph_fs_client {
>>>>>     #endif
>>>>>     };
>>>>>     
>>>>> +/* Special placeholder value for a cap_id during an asynchronous create. */
>>>>> +#define        CEPH_CAP_ID_TBD         -1ULL
>>>>>     
>>>>>     /*
>>>>>      * File i/o capability.  This tracks shared state with the metadata
>>>>> diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
>>>>> index a099f60feb7b..b127563e21a1 100644
>>>>> --- a/include/linux/ceph/ceph_fs.h
>>>>> +++ b/include/linux/ceph/ceph_fs.h
>>>>> @@ -444,8 +444,9 @@ union ceph_mds_request_args {
>>>>>     	} __attribute__ ((packed)) lookupino;
>>>>>     } __attribute__ ((packed));
>>>>>     
>>>>> -#define CEPH_MDS_FLAG_REPLAY        1  /* this is a replayed op */
>>>>> -#define CEPH_MDS_FLAG_WANT_DENTRY   2  /* want dentry in reply */
>>>>> +#define CEPH_MDS_FLAG_REPLAY		1  /* this is a replayed op */
>>>>> +#define CEPH_MDS_FLAG_WANT_DENTRY	2  /* want dentry in reply */
>>>>> +#define CEPH_MDS_FLAG_ASYNC		4  /* request is asynchronous */
>>>>>     
>>>>>     struct ceph_mds_request_head {
>>>>>     	__le64 oldest_client_tid;
>>>>> @@ -658,6 +659,9 @@ int ceph_flags_to_mode(int flags);
>>>>>     #define CEPH_CAP_ANY      (CEPH_CAP_ANY_RD | CEPH_CAP_ANY_EXCL | \
>>>>>     			   CEPH_CAP_ANY_FILE_WR | CEPH_CAP_FILE_LAZYIO | \
>>>>>     			   CEPH_CAP_PIN)
>>>>> +#define CEPH_CAP_ALL_FILE (CEPH_CAP_PIN | CEPH_CAP_ANY_SHARED | \
>>>>> +			   CEPH_CAP_AUTH_EXCL | CEPH_CAP_XATTR_EXCL | \
>>>>> +			   CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)
>>>>>     
>>>>>     #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
>>>>>     			CEPH_LOCK_IXATTR)
>>>>>
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-01-14  2:08 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-10 20:56 [RFC PATCH 0/9] ceph: add asynchronous create functionality Jeff Layton
2020-01-10 20:56 ` [RFC PATCH 1/9] ceph: ensure we have a new cap before continuing in fill_inode Jeff Layton
2020-01-10 20:56 ` [RFC PATCH 2/9] ceph: print name of xattr being set in set/getxattr dout message Jeff Layton
2020-01-10 20:56 ` [RFC PATCH 3/9] ceph: close some holes in struct ceph_mds_request Jeff Layton
2020-01-10 20:56 ` [RFC PATCH 4/9] ceph: make ceph_fill_inode non-static Jeff Layton
2020-01-10 20:56 ` [RFC PATCH 5/9] libceph: export ceph_file_layout_is_valid Jeff Layton
2020-01-10 20:56 ` [RFC PATCH 6/9] ceph: decode interval_sets for delegated inos Jeff Layton
2020-01-10 20:56 ` [RFC PATCH 7/9] ceph: add flag to delegate an inode number for async create Jeff Layton
2020-01-13  9:17   ` Yan, Zheng
2020-01-13 13:31     ` Jeff Layton
2020-01-13 14:51       ` Yan, Zheng
2020-01-10 20:56 ` [RFC PATCH 8/9] ceph: copy layout, max_size and truncate_size on successful sync create Jeff Layton
2020-01-13  3:51   ` Yan, Zheng
2020-01-13 13:26     ` Jeff Layton
2020-01-13 14:56       ` Yan, Zheng
2020-01-13 15:13         ` Jeff Layton
2020-01-13 16:37           ` Yan, Zheng
2020-01-13  9:01   ` Yan, Zheng
2020-01-13 13:29     ` Jeff Layton
2020-01-10 20:56 ` [RFC PATCH 9/9] ceph: attempt to do async create when possible Jeff Layton
2020-01-13  1:43   ` Xiubo Li
2020-01-13 13:16     ` Jeff Layton
2020-01-13 10:53   ` Yan, Zheng
2020-01-13 13:44     ` Jeff Layton
2020-01-13 14:48       ` Yan, Zheng
2020-01-13 15:20         ` Jeff Layton
2020-01-14  2:08           ` Yan, Zheng
2020-01-13 11:07 ` [RFC PATCH 0/9] ceph: add asynchronous create functionality Yan, Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.