All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v3 00/10] ceph: asynchronous file create support
@ 2020-01-21 19:29 Jeff Layton
  2020-01-21 19:29 ` [RFC PATCH v3 01/10] ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c Jeff Layton
                   ` (11 more replies)
  0 siblings, 12 replies; 20+ messages in thread
From: Jeff Layton @ 2020-01-21 19:29 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan

v3:
- move some cephfs-specific code into ceph.ko
- present and track inode numbers as u64 values
- fix up check for dentry and cap eligibility checks
- set O_CEPH_EXCL on async creates
- attempt to handle errors better on async create (invalidate dentries
  and dir completeness).
- ensure that fsync waits for async create to complete

v2:
- move cached layout to dedicated field in inode
- protect cached layout with i_ceph_lock
- wipe cached layout in __check_cap_issue
- set max_size of file to layout.stripe_unit
- set truncate_size to (u64)-1
- use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
- set cap_id to 1 in async created inode
- allocate inode number before submitting request
- rework the prep for an async create to be more efficient
- don't allow MDS or cap messages involving an inode until we get async
  create reply

Still not quite ready for merge, but I've cleaned up a number of warts
in the v2 set. Performance numbers still look about the same.

There is definitely still a race of some sort that causes the client to
try to asynchronously create a dentry that already exists. I'm still
working on tracking that down.

Jeff Layton (10):
  ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
  ceph: make ceph_fill_inode non-static
  ceph: make dentry_lease_is_valid non-static
  ceph: make __take_cap_refs non-static
  ceph: decode interval_sets for delegated inos
  ceph: add flag to designate that a request is asynchronous
  ceph: add infrastructure for waiting for async create to complete
  ceph: add new MDS req field to hold delegated inode number
  ceph: cache layout in parent dir on first sync create
  ceph: attempt to do async create when possible

 fs/ceph/Makefile                     |   2 +-
 fs/ceph/caps.c                       |  38 +++--
 fs/ceph/dir.c                        |  13 +-
 fs/ceph/file.c                       | 240 +++++++++++++++++++++++++--
 fs/ceph/inode.c                      |  50 +++---
 fs/ceph/mds_client.c                 | 123 ++++++++++++--
 fs/ceph/mds_client.h                 |  17 +-
 fs/ceph/super.h                      |  16 +-
 net/ceph/ceph_fs.c => fs/ceph/util.c |   4 -
 include/linux/ceph/ceph_fs.h         |   8 +-
 net/ceph/Makefile                    |   2 +-
 11 files changed, 443 insertions(+), 70 deletions(-)
 rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%)

-- 
2.24.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH v3 01/10] ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
  2020-01-21 19:29 [RFC PATCH v3 00/10] ceph: asynchronous file create support Jeff Layton
@ 2020-01-21 19:29 ` Jeff Layton
  2020-01-24 19:04   ` Jeff Layton
  2020-01-21 19:29 ` [RFC PATCH v3 02/10] ceph: make ceph_fill_inode non-static Jeff Layton
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2020-01-21 19:29 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan

All of these functions are only called from CephFS, so move them into
ceph.ko, and drop the exports.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/Makefile                     | 2 +-
 net/ceph/ceph_fs.c => fs/ceph/util.c | 4 ----
 net/ceph/Makefile                    | 2 +-
 3 files changed, 2 insertions(+), 6 deletions(-)
 rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%)

diff --git a/fs/ceph/Makefile b/fs/ceph/Makefile
index c1da294418d1..0a0823d378db 100644
--- a/fs/ceph/Makefile
+++ b/fs/ceph/Makefile
@@ -8,7 +8,7 @@ obj-$(CONFIG_CEPH_FS) += ceph.o
 ceph-y := super.o inode.o dir.o file.o locks.o addr.o ioctl.o \
 	export.o caps.o snap.o xattr.o quota.o io.o \
 	mds_client.o mdsmap.o strings.o ceph_frag.o \
-	debugfs.o
+	debugfs.o util.o
 
 ceph-$(CONFIG_CEPH_FSCACHE) += cache.o
 ceph-$(CONFIG_CEPH_FS_POSIX_ACL) += acl.o
diff --git a/net/ceph/ceph_fs.c b/fs/ceph/util.c
similarity index 94%
rename from net/ceph/ceph_fs.c
rename to fs/ceph/util.c
index 756a2dc10d27..2c34875675bf 100644
--- a/net/ceph/ceph_fs.c
+++ b/fs/ceph/util.c
@@ -39,7 +39,6 @@ void ceph_file_layout_from_legacy(struct ceph_file_layout *fl,
 	    fl->stripe_count == 0 && fl->object_size == 0)
 		fl->pool_id = -1;
 }
-EXPORT_SYMBOL(ceph_file_layout_from_legacy);
 
 void ceph_file_layout_to_legacy(struct ceph_file_layout *fl,
 				struct ceph_file_layout_legacy *legacy)
@@ -52,7 +51,6 @@ void ceph_file_layout_to_legacy(struct ceph_file_layout *fl,
 	else
 		legacy->fl_pg_pool = 0;
 }
-EXPORT_SYMBOL(ceph_file_layout_to_legacy);
 
 int ceph_flags_to_mode(int flags)
 {
@@ -82,7 +80,6 @@ int ceph_flags_to_mode(int flags)
 
 	return mode;
 }
-EXPORT_SYMBOL(ceph_flags_to_mode);
 
 int ceph_caps_for_mode(int mode)
 {
@@ -101,4 +98,3 @@ int ceph_caps_for_mode(int mode)
 
 	return caps;
 }
-EXPORT_SYMBOL(ceph_caps_for_mode);
diff --git a/net/ceph/Makefile b/net/ceph/Makefile
index 59d0ba2072de..ce09bb4fb249 100644
--- a/net/ceph/Makefile
+++ b/net/ceph/Makefile
@@ -13,5 +13,5 @@ libceph-y := ceph_common.o messenger.o msgpool.o buffer.o pagelist.o \
 	auth.o auth_none.o \
 	crypto.o armor.o \
 	auth_x.o \
-	ceph_fs.o ceph_strings.o ceph_hash.o \
+	ceph_strings.o ceph_hash.o \
 	pagevec.o snapshot.o string_table.o
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v3 02/10] ceph: make ceph_fill_inode non-static
  2020-01-21 19:29 [RFC PATCH v3 00/10] ceph: asynchronous file create support Jeff Layton
  2020-01-21 19:29 ` [RFC PATCH v3 01/10] ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c Jeff Layton
@ 2020-01-21 19:29 ` Jeff Layton
  2020-01-21 19:29 ` [RFC PATCH v3 03/10] ceph: make dentry_lease_is_valid non-static Jeff Layton
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2020-01-21 19:29 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/inode.c | 47 ++++++++++++++++++++++++-----------------------
 fs/ceph/super.h |  8 ++++++++
 2 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index aee7a24bf1bc..79bb1e6af090 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -728,11 +728,11 @@ void ceph_fill_file_time(struct inode *inode, int issued,
  * Populate an inode based on info from mds.  May be called on new or
  * existing inodes.
  */
-static int fill_inode(struct inode *inode, struct page *locked_page,
-		      struct ceph_mds_reply_info_in *iinfo,
-		      struct ceph_mds_reply_dirfrag *dirinfo,
-		      struct ceph_mds_session *session, int cap_fmode,
-		      struct ceph_cap_reservation *caps_reservation)
+int ceph_fill_inode(struct inode *inode, struct page *locked_page,
+		    struct ceph_mds_reply_info_in *iinfo,
+		    struct ceph_mds_reply_dirfrag *dirinfo,
+		    struct ceph_mds_session *session, int cap_fmode,
+		    struct ceph_cap_reservation *caps_reservation)
 {
 	struct ceph_mds_client *mdsc = ceph_inode_to_client(inode)->mdsc;
 	struct ceph_mds_reply_inode *info = iinfo->in;
@@ -749,7 +749,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 	bool new_version = false;
 	bool fill_inline = false;
 
-	dout("fill_inode %p ino %llx.%llx v %llu had %llu\n",
+	dout("%s %p ino %llx.%llx v %llu had %llu\n", __func__,
 	     inode, ceph_vinop(inode), le64_to_cpu(info->version),
 	     ci->i_version);
 
@@ -770,7 +770,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 	if (iinfo->xattr_len > 4) {
 		xattr_blob = ceph_buffer_new(iinfo->xattr_len, GFP_NOFS);
 		if (!xattr_blob)
-			pr_err("fill_inode ENOMEM xattr blob %d bytes\n",
+			pr_err("%s ENOMEM xattr blob %d bytes\n", __func__,
 			       iinfo->xattr_len);
 	}
 
@@ -933,8 +933,9 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 			spin_unlock(&ci->i_ceph_lock);
 
 			if (symlen != i_size_read(inode)) {
-				pr_err("fill_inode %llx.%llx BAD symlink "
-					"size %lld\n", ceph_vinop(inode),
+				pr_err("%s %llx.%llx BAD symlink "
+					"size %lld\n", __func__,
+					ceph_vinop(inode),
 					i_size_read(inode));
 				i_size_write(inode, symlen);
 				inode->i_blocks = calc_inode_blocks(symlen);
@@ -958,7 +959,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 		inode->i_fop = &ceph_dir_fops;
 		break;
 	default:
-		pr_err("fill_inode %llx.%llx BAD mode 0%o\n",
+		pr_err("%s %llx.%llx BAD mode 0%o\n", __func__,
 		       ceph_vinop(inode), inode->i_mode);
 	}
 
@@ -1246,10 +1247,9 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 		struct inode *dir = req->r_parent;
 
 		if (dir) {
-			err = fill_inode(dir, NULL,
-					 &rinfo->diri, rinfo->dirfrag,
-					 session, -1,
-					 &req->r_caps_reservation);
+			err = ceph_fill_inode(dir, NULL, &rinfo->diri,
+					      rinfo->dirfrag, session, -1,
+					      &req->r_caps_reservation);
 			if (err < 0)
 				goto done;
 		} else {
@@ -1314,13 +1314,13 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 			goto done;
 		}
 
-		err = fill_inode(in, req->r_locked_page, &rinfo->targeti, NULL,
-				session,
+		err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
+				NULL, session,
 				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
 				 rinfo->head->result == 0) ?  req->r_fmode : -1,
 				&req->r_caps_reservation);
 		if (err < 0) {
-			pr_err("fill_inode badness %p %llx.%llx\n",
+			pr_err("ceph_fill_inode badness %p %llx.%llx\n",
 				in, ceph_vinop(in));
 			if (in->i_state & I_NEW)
 				discard_new_inode(in);
@@ -1507,10 +1507,11 @@ static int readdir_prepopulate_inodes_only(struct ceph_mds_request *req,
 			dout("new_inode badness got %d\n", err);
 			continue;
 		}
-		rc = fill_inode(in, NULL, &rde->inode, NULL, session,
-				-1, &req->r_caps_reservation);
+		rc = ceph_fill_inode(in, NULL, &rde->inode, NULL, session,
+				     -1, &req->r_caps_reservation);
 		if (rc < 0) {
-			pr_err("fill_inode badness on %p got %d\n", in, rc);
+			pr_err("ceph_fill_inode badness on %p got %d\n",
+			       in, rc);
 			err = rc;
 			if (in->i_state & I_NEW) {
 				ihold(in);
@@ -1714,10 +1715,10 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req,
 			}
 		}
 
-		ret = fill_inode(in, NULL, &rde->inode, NULL, session,
-				 -1, &req->r_caps_reservation);
+		ret = ceph_fill_inode(in, NULL, &rde->inode, NULL, session,
+				      -1, &req->r_caps_reservation);
 		if (ret < 0) {
-			pr_err("fill_inode badness on %p\n", in);
+			pr_err("ceph_fill_inode badness on %p\n", in);
 			if (d_really_is_negative(dn)) {
 				/* avoid calling iput_final() in mds
 				 * dispatch threads */
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 3ef17dd6491e..ec4d66d7c261 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -893,6 +893,9 @@ static inline bool __ceph_have_pending_cap_snap(struct ceph_inode_info *ci)
 }
 
 /* inode.c */
+struct ceph_mds_reply_info_in;
+struct ceph_mds_reply_dirfrag;
+
 extern const struct inode_operations ceph_file_iops;
 
 extern struct inode *ceph_alloc_inode(struct super_block *sb);
@@ -908,6 +911,11 @@ extern void ceph_fill_file_time(struct inode *inode, int issued,
 				u64 time_warp_seq, struct timespec64 *ctime,
 				struct timespec64 *mtime,
 				struct timespec64 *atime);
+extern int ceph_fill_inode(struct inode *inode, struct page *locked_page,
+		    struct ceph_mds_reply_info_in *iinfo,
+		    struct ceph_mds_reply_dirfrag *dirinfo,
+		    struct ceph_mds_session *session, int cap_fmode,
+		    struct ceph_cap_reservation *caps_reservation);
 extern int ceph_fill_trace(struct super_block *sb,
 			   struct ceph_mds_request *req);
 extern int ceph_readdir_prepopulate(struct ceph_mds_request *req,
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v3 03/10] ceph: make dentry_lease_is_valid non-static
  2020-01-21 19:29 [RFC PATCH v3 00/10] ceph: asynchronous file create support Jeff Layton
  2020-01-21 19:29 ` [RFC PATCH v3 01/10] ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c Jeff Layton
  2020-01-21 19:29 ` [RFC PATCH v3 02/10] ceph: make ceph_fill_inode non-static Jeff Layton
@ 2020-01-21 19:29 ` Jeff Layton
  2020-01-22  7:20   ` Yan, Zheng
  2020-01-21 19:29 ` [RFC PATCH v3 04/10] ceph: make __take_cap_refs non-static Jeff Layton
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2020-01-21 19:29 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan

...and move a comment over the proper function.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/dir.c   | 10 +++++-----
 fs/ceph/super.h |  1 +
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 10294f07f5f0..9d2eca67985a 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1477,10 +1477,6 @@ void ceph_invalidate_dentry_lease(struct dentry *dentry)
 	spin_unlock(&dentry->d_lock);
 }
 
-/*
- * Check if dentry lease is valid.  If not, delete the lease.  Try to
- * renew if the least is more than half up.
- */
 static bool __dentry_lease_is_valid(struct ceph_dentry_info *di)
 {
 	struct ceph_mds_session *session;
@@ -1507,7 +1503,11 @@ static bool __dentry_lease_is_valid(struct ceph_dentry_info *di)
 	return false;
 }
 
-static int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags)
+/*
+ * Check if dentry lease is valid.  If not, delete the lease.  Try to
+ * renew if the least is more than half up.
+ */
+int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags)
 {
 	struct ceph_dentry_info *di;
 	struct ceph_mds_session *session = NULL;
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index ec4d66d7c261..f27b2bf9a3f5 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1121,6 +1121,7 @@ extern int ceph_handle_snapdir(struct ceph_mds_request *req,
 extern struct dentry *ceph_finish_lookup(struct ceph_mds_request *req,
 					 struct dentry *dentry, int err);
 
+extern int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags);
 extern void __ceph_dentry_lease_touch(struct ceph_dentry_info *di);
 extern void __ceph_dentry_dir_lease_touch(struct ceph_dentry_info *di);
 extern void ceph_invalidate_dentry_lease(struct dentry *dentry);
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v3 04/10] ceph: make __take_cap_refs non-static
  2020-01-21 19:29 [RFC PATCH v3 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (2 preceding siblings ...)
  2020-01-21 19:29 ` [RFC PATCH v3 03/10] ceph: make dentry_lease_is_valid non-static Jeff Layton
@ 2020-01-21 19:29 ` Jeff Layton
  2020-01-21 19:29 ` [RFC PATCH v3 05/10] ceph: decode interval_sets for delegated inos Jeff Layton
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2020-01-21 19:29 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan

Rename it to ceph_take_cap_refs and make it available to other files.
Also replace a comment with a lockdep assertion.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c  | 12 ++++++------
 fs/ceph/super.h |  2 ++
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 7fc87b693ba4..c983990acb75 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2512,12 +2512,12 @@ static void kick_flushing_inode_caps(struct ceph_mds_client *mdsc,
 /*
  * Take references to capabilities we hold, so that we don't release
  * them to the MDS prematurely.
- *
- * Protected by i_ceph_lock.
  */
-static void __take_cap_refs(struct ceph_inode_info *ci, int got,
+void ceph_take_cap_refs(struct ceph_inode_info *ci, int got,
 			    bool snap_rwsem_locked)
 {
+	lockdep_assert_held(&ci->i_ceph_lock);
+
 	if (got & CEPH_CAP_PIN)
 		ci->i_pin_ref++;
 	if (got & CEPH_CAP_FILE_RD)
@@ -2538,7 +2538,7 @@ static void __take_cap_refs(struct ceph_inode_info *ci, int got,
 		if (ci->i_wb_ref == 0)
 			ihold(&ci->vfs_inode);
 		ci->i_wb_ref++;
-		dout("__take_cap_refs %p wb %d -> %d (?)\n",
+		dout("%s %p wb %d -> %d (?)\n", __func__,
 		     &ci->vfs_inode, ci->i_wb_ref-1, ci->i_wb_ref);
 	}
 }
@@ -2664,7 +2664,7 @@ static int try_get_cap_refs(struct inode *inode, int need, int want,
 			    (need & CEPH_CAP_FILE_RD) &&
 			    !(*got & CEPH_CAP_FILE_CACHE))
 				ceph_disable_fscache_readpage(ci);
-			__take_cap_refs(ci, *got, true);
+			ceph_take_cap_refs(ci, *got, true);
 			ret = 1;
 		}
 	} else {
@@ -2896,7 +2896,7 @@ int ceph_get_caps(struct file *filp, int need, int want,
 void ceph_get_cap_refs(struct ceph_inode_info *ci, int caps)
 {
 	spin_lock(&ci->i_ceph_lock);
-	__take_cap_refs(ci, caps, false);
+	ceph_take_cap_refs(ci, caps, false);
 	spin_unlock(&ci->i_ceph_lock);
 }
 
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index f27b2bf9a3f5..cb7b549f0995 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1062,6 +1062,8 @@ extern void ceph_kick_flushing_caps(struct ceph_mds_client *mdsc,
 				    struct ceph_mds_session *session);
 extern struct ceph_cap *ceph_get_cap_for_mds(struct ceph_inode_info *ci,
 					     int mds);
+extern void ceph_take_cap_refs(struct ceph_inode_info *ci, int caps,
+				bool snap_rwsem_locked);
 extern void ceph_get_cap_refs(struct ceph_inode_info *ci, int caps);
 extern void ceph_put_cap_refs(struct ceph_inode_info *ci, int had);
 extern void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr,
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v3 05/10] ceph: decode interval_sets for delegated inos
  2020-01-21 19:29 [RFC PATCH v3 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (3 preceding siblings ...)
  2020-01-21 19:29 ` [RFC PATCH v3 04/10] ceph: make __take_cap_refs non-static Jeff Layton
@ 2020-01-21 19:29 ` Jeff Layton
  2020-01-21 19:29 ` [RFC PATCH v3 06/10] ceph: add flag to designate that a request is asynchronous Jeff Layton
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2020-01-21 19:29 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan

Starting in Octopus, the MDS will hand out caps that allow the client
to do asynchronous file creates under certain conditions. As part of
that, the MDS will delegate ranges of inode numbers to the client.

Add the infrastructure to decode these ranges, and stuff them into an
xarray for later consumption by the async creation code.

Because the xarray code currently only handles unsigned long indexes,
and those are 32-bits on 32-bit arches, we only enable the decoding when
running on a 64-bit arch.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/mds_client.c | 109 +++++++++++++++++++++++++++++++++++++++----
 fs/ceph/mds_client.h |   7 ++-
 2 files changed, 106 insertions(+), 10 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 8263f75badfc..d0ceb3f46570 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -415,21 +415,110 @@ static int parse_reply_info_filelock(void **p, void *end,
 	return -EIO;
 }
 
+
+#if BITS_PER_LONG == 64
+
+#define DELEGATED_INO_AVAILABLE		xa_mk_value(1)
+
+static int ceph_parse_deleg_inos(void **p, void *end,
+				 struct ceph_mds_session *s)
+{
+	u32 sets;
+
+	ceph_decode_32_safe(p, end, sets, bad);
+	dout("got %u sets of delegated inodes\n", sets);
+	while (sets--) {
+		u64 start, len, ino;
+
+		ceph_decode_64_safe(p, end, start, bad);
+		ceph_decode_64_safe(p, end, len, bad);
+		while (len--) {
+			int err = xa_insert(&s->s_delegated_inos, ino = start++,
+					    DELEGATED_INO_AVAILABLE,
+					    GFP_KERNEL);
+			if (!err) {
+				dout("added delegated inode 0x%llx\n",
+				     start - 1);
+			} else if (err == -EBUSY) {
+				pr_warn("ceph: MDS delegated inode 0x%llx more than once.\n",
+					start - 1);
+			} else {
+				return err;
+			}
+		}
+	}
+	return 0;
+bad:
+	return -EIO;
+}
+
+u64 ceph_get_deleg_ino(struct ceph_mds_session *s)
+{
+	unsigned long ino;
+	void *val;
+
+	xa_for_each(&s->s_delegated_inos, ino, val) {
+		val = xa_erase(&s->s_delegated_inos, ino);
+		if (val == DELEGATED_INO_AVAILABLE)
+			return ino;
+	}
+	return 0;
+}
+#else /* BITS_PER_LONG == 64 */
+/*
+ * FIXME: xarrays can't handle 64-bit indexes on a 32-bit arch. For now, just
+ * ignore delegated_inos on 32 bit arch. Maybe eventually add xarrays for top
+ * and bottom words?
+ */
+static int ceph_parse_deleg_inos(void **p, void *end,
+				 struct ceph_mds_session *s)
+{
+	u32 sets;
+
+	ceph_decode_32_safe(p, end, sets, bad);
+	if (sets)
+		ceph_decode_skip_n(p, end, sets * 2 * sizeof(__le64), bad);
+	return 0;
+bad:
+	return -EIO;
+}
+
+u64 ceph_get_deleg_ino(struct ceph_mds_session *s)
+{
+	return 0;
+}
+#endif /* BITS_PER_LONG == 64 */
+
 /*
  * parse create results
  */
 static int parse_reply_info_create(void **p, void *end,
 				  struct ceph_mds_reply_info_parsed *info,
-				  u64 features)
+				  u64 features, struct ceph_mds_session *s)
 {
+	int ret;
+
 	if (features == (u64)-1 ||
 	    (features & CEPH_FEATURE_REPLY_CREATE_INODE)) {
-		/* Malformed reply? */
 		if (*p == end) {
+			/* Malformed reply? */
 			info->has_create_ino = false;
-		} else {
+		} else if (test_bit(CEPHFS_FEATURE_DELEG_INO, &s->s_features)) {
+			u8 struct_v, struct_compat;
+			u32 len;
+
 			info->has_create_ino = true;
+			ceph_decode_8_safe(p, end, struct_v, bad);
+			ceph_decode_8_safe(p, end, struct_compat, bad);
+			ceph_decode_32_safe(p, end, len, bad);
+			ceph_decode_64_safe(p, end, info->ino, bad);
+			ret = ceph_parse_deleg_inos(p, end, s);
+			if (ret)
+				return ret;
+		} else {
+			/* legacy */
 			ceph_decode_64_safe(p, end, info->ino, bad);
+			info->has_create_ino = true;
 		}
 	} else {
 		if (*p != end)
@@ -448,7 +537,7 @@ static int parse_reply_info_create(void **p, void *end,
  */
 static int parse_reply_info_extra(void **p, void *end,
 				  struct ceph_mds_reply_info_parsed *info,
-				  u64 features)
+				  u64 features, struct ceph_mds_session *s)
 {
 	u32 op = le32_to_cpu(info->head->op);
 
@@ -457,7 +546,7 @@ static int parse_reply_info_extra(void **p, void *end,
 	else if (op == CEPH_MDS_OP_READDIR || op == CEPH_MDS_OP_LSSNAP)
 		return parse_reply_info_readdir(p, end, info, features);
 	else if (op == CEPH_MDS_OP_CREATE)
-		return parse_reply_info_create(p, end, info, features);
+		return parse_reply_info_create(p, end, info, features, s);
 	else
 		return -EIO;
 }
@@ -465,7 +554,7 @@ static int parse_reply_info_extra(void **p, void *end,
 /*
  * parse entire mds reply
  */
-static int parse_reply_info(struct ceph_msg *msg,
+static int parse_reply_info(struct ceph_mds_session *s, struct ceph_msg *msg,
 			    struct ceph_mds_reply_info_parsed *info,
 			    u64 features)
 {
@@ -490,7 +579,7 @@ static int parse_reply_info(struct ceph_msg *msg,
 	ceph_decode_32_safe(&p, end, len, bad);
 	if (len > 0) {
 		ceph_decode_need(&p, end, len, bad);
-		err = parse_reply_info_extra(&p, p+len, info, features);
+		err = parse_reply_info_extra(&p, p+len, info, features, s);
 		if (err < 0)
 			goto out_bad;
 	}
@@ -558,6 +647,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
 	if (refcount_dec_and_test(&s->s_ref)) {
 		if (s->s_auth.authorizer)
 			ceph_auth_destroy_authorizer(s->s_auth.authorizer);
+		xa_destroy(&s->s_delegated_inos);
 		kfree(s);
 	}
 }
@@ -645,6 +735,7 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc,
 	refcount_set(&s->s_ref, 1);
 	INIT_LIST_HEAD(&s->s_waiting);
 	INIT_LIST_HEAD(&s->s_unsafe);
+	xa_init(&s->s_delegated_inos);
 	s->s_num_cap_releases = 0;
 	s->s_cap_reconnect = 0;
 	s->s_cap_iterator = NULL;
@@ -2947,9 +3038,9 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
 	dout("handle_reply tid %lld result %d\n", tid, result);
 	rinfo = &req->r_reply_info;
 	if (test_bit(CEPHFS_FEATURE_REPLY_ENCODING, &session->s_features))
-		err = parse_reply_info(msg, rinfo, (u64)-1);
+		err = parse_reply_info(session, msg, rinfo, (u64)-1);
 	else
-		err = parse_reply_info(msg, rinfo, session->s_con.peer_features);
+		err = parse_reply_info(session, msg, rinfo, session->s_con.peer_features);
 	mutex_unlock(&mdsc->mutex);
 
 	mutex_lock(&session->s_mutex);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 27a7446e10d3..54df093a7db0 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -23,8 +23,9 @@ enum ceph_feature_type {
 	CEPHFS_FEATURE_RECLAIM_CLIENT,
 	CEPHFS_FEATURE_LAZY_CAP_WANTED,
 	CEPHFS_FEATURE_MULTI_RECONNECT,
+	CEPHFS_FEATURE_DELEG_INO,
 
-	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_MULTI_RECONNECT,
+	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_DELEG_INO,
 };
 
 /*
@@ -37,6 +38,7 @@ enum ceph_feature_type {
 	CEPHFS_FEATURE_REPLY_ENCODING,		\
 	CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
 	CEPHFS_FEATURE_MULTI_RECONNECT,		\
+	CEPHFS_FEATURE_DELEG_INO,		\
 						\
 	CEPHFS_FEATURE_MAX,			\
 }
@@ -201,6 +203,7 @@ struct ceph_mds_session {
 
 	struct list_head  s_waiting;  /* waiting requests */
 	struct list_head  s_unsafe;   /* unsafe requests */
+	struct xarray	  s_delegated_inos;
 };
 
 /*
@@ -537,4 +540,6 @@ extern void ceph_mdsc_open_export_target_sessions(struct ceph_mds_client *mdsc,
 extern int ceph_trim_caps(struct ceph_mds_client *mdsc,
 			  struct ceph_mds_session *session,
 			  int max_caps);
+
+extern u64 ceph_get_deleg_ino(struct ceph_mds_session *session);
 #endif
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v3 06/10] ceph: add flag to designate that a request is asynchronous
  2020-01-21 19:29 [RFC PATCH v3 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (4 preceding siblings ...)
  2020-01-21 19:29 ` [RFC PATCH v3 05/10] ceph: decode interval_sets for delegated inos Jeff Layton
@ 2020-01-21 19:29 ` Jeff Layton
  2020-01-21 19:29 ` [RFC PATCH v3 07/10] ceph: add infrastructure for waiting for async create to complete Jeff Layton
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2020-01-21 19:29 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan

The MDS has need to know that a request is asynchronous.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/dir.c                | 1 +
 fs/ceph/inode.c              | 1 +
 fs/ceph/mds_client.c         | 2 ++
 fs/ceph/mds_client.h         | 1 +
 include/linux/ceph/ceph_fs.h | 5 +++--
 5 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 9d2eca67985a..0d97c2962314 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1116,6 +1116,7 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
 	    get_caps_for_async_unlink(dir, dentry)) {
 		dout("ceph: Async unlink on %lu/%.*s", dir->i_ino,
 		     dentry->d_name.len, dentry->d_name.name);
+		set_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags);
 		req->r_callback = ceph_async_unlink_cb;
 		req->r_old_inode = d_inode(dentry);
 		ihold(req->r_old_inode);
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 79bb1e6af090..4056c7968b86 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1317,6 +1317,7 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 		err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
 				NULL, session,
 				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
+				 !test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags) &&
 				 rinfo->head->result == 0) ?  req->r_fmode : -1,
 				&req->r_caps_reservation);
 		if (err < 0) {
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index d0ceb3f46570..4ca43348bfd7 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2620,6 +2620,8 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
 	rhead->oldest_client_tid = cpu_to_le64(__get_oldest_tid(mdsc));
 	if (test_bit(CEPH_MDS_R_GOT_UNSAFE, &req->r_req_flags))
 		flags |= CEPH_MDS_FLAG_REPLAY;
+	if (test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags))
+		flags |= CEPH_MDS_FLAG_ASYNC;
 	if (req->r_parent)
 		flags |= CEPH_MDS_FLAG_WANT_DENTRY;
 	rhead->flags = cpu_to_le32(flags);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 54df093a7db0..31f68897bc87 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -258,6 +258,7 @@ struct ceph_mds_request {
 #define CEPH_MDS_R_GOT_RESULT		(5) /* got a result */
 #define CEPH_MDS_R_DID_PREPOPULATE	(6) /* prepopulated readdir */
 #define CEPH_MDS_R_PARENT_LOCKED	(7) /* is r_parent->i_rwsem wlocked? */
+#define CEPH_MDS_R_ASYNC		(8) /* async request */
 	unsigned long	r_req_flags;
 
 	struct mutex r_fill_mutex;
diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
index a099f60feb7b..91d09cf37649 100644
--- a/include/linux/ceph/ceph_fs.h
+++ b/include/linux/ceph/ceph_fs.h
@@ -444,8 +444,9 @@ union ceph_mds_request_args {
 	} __attribute__ ((packed)) lookupino;
 } __attribute__ ((packed));
 
-#define CEPH_MDS_FLAG_REPLAY        1  /* this is a replayed op */
-#define CEPH_MDS_FLAG_WANT_DENTRY   2  /* want dentry in reply */
+#define CEPH_MDS_FLAG_REPLAY		1 /* this is a replayed op */
+#define CEPH_MDS_FLAG_WANT_DENTRY	2 /* want dentry in reply */
+#define CEPH_MDS_FLAG_ASYNC		4 /* request is asynchronous */
 
 struct ceph_mds_request_head {
 	__le64 oldest_client_tid;
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v3 07/10] ceph: add infrastructure for waiting for async create to complete
  2020-01-21 19:29 [RFC PATCH v3 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (5 preceding siblings ...)
  2020-01-21 19:29 ` [RFC PATCH v3 06/10] ceph: add flag to designate that a request is asynchronous Jeff Layton
@ 2020-01-21 19:29 ` Jeff Layton
  2020-01-21 19:29 ` [RFC PATCH v3 08/10] ceph: add new MDS req field to hold delegated inode number Jeff Layton
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2020-01-21 19:29 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan

When we issue an async create, we must ensure that any later on-the-wire
requests involving it wait for the create reply.

Expand i_ceph_flags to be an unsigned long, and add a new bit that
MDS requests can wait on. If the bit is set in the inode when sending
caps, then don't send it and just return that it has been delayed.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c       | 13 ++++++++++++-
 fs/ceph/dir.c        |  2 +-
 fs/ceph/mds_client.c |  9 +++++++++
 fs/ceph/mds_client.h |  8 ++++++++
 fs/ceph/super.h      |  4 +++-
 5 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index c983990acb75..869e2102e827 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -511,7 +511,7 @@ static void __cap_delay_requeue(struct ceph_mds_client *mdsc,
 				struct ceph_inode_info *ci,
 				bool set_timeout)
 {
-	dout("__cap_delay_requeue %p flags %d at %lu\n", &ci->vfs_inode,
+	dout("__cap_delay_requeue %p flags 0x%lx at %lu\n", &ci->vfs_inode,
 	     ci->i_ceph_flags, ci->i_hold_caps_max);
 	if (!mdsc->stopping) {
 		spin_lock(&mdsc->cap_delay_lock);
@@ -1298,6 +1298,13 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
 	int delayed = 0;
 	int ret;
 
+	/* Don't send anything if it's still being created. Return delayed */
+	if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
+		spin_unlock(&ci->i_ceph_lock);
+		dout("%s async create in flight for %p\n", __func__, inode);
+		return 1;
+	}
+
 	held = cap->issued | cap->implemented;
 	revoking = cap->implemented & ~cap->issued;
 	retain &= ~revoking;
@@ -2257,6 +2264,10 @@ int ceph_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	if (datasync)
 		goto out;
 
+	ret = ceph_wait_on_async_create(inode);
+	if (ret)
+		goto out;
+
 	dirty = try_flush_caps(inode, &flush_tid);
 	dout("fsync dirty caps are %s\n", ceph_cap_string(dirty));
 
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 0d97c2962314..b2bcd01ab4e9 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -752,7 +752,7 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry,
 		struct ceph_dentry_info *di = ceph_dentry(dentry);
 
 		spin_lock(&ci->i_ceph_lock);
-		dout(" dir %p flags are %d\n", dir, ci->i_ceph_flags);
+		dout(" dir %p flags are 0x%lx\n", dir, ci->i_ceph_flags);
 		if (strncmp(dentry->d_name.name,
 			    fsc->mount_options->snapdir_name,
 			    dentry->d_name.len) &&
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 4ca43348bfd7..ea9cfed6c027 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2822,6 +2822,15 @@ int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
 		ceph_get_cap_refs(ceph_inode(req->r_old_dentry_dir),
 				  CEPH_CAP_PIN);
 
+	if (req->r_inode) {
+		err = ceph_wait_on_async_create(req->r_inode);
+		if (err) {
+			dout("%s: wait for async create returned: %d\n",
+			     __func__, err);
+			return err;
+		}
+	}
+
 	dout("submit_request on %p for inode %p\n", req, dir);
 	mutex_lock(&mdsc->mutex);
 	__register_request(mdsc, req, dir);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 31f68897bc87..acad9adca0af 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -543,4 +543,12 @@ extern int ceph_trim_caps(struct ceph_mds_client *mdsc,
 			  int max_caps);
 
 extern u64 ceph_get_deleg_ino(struct ceph_mds_session *session);
+
+static inline int ceph_wait_on_async_create(struct inode *inode)
+{
+	struct ceph_inode_info *ci = ceph_inode(inode);
+
+	return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT,
+			   TASK_INTERRUPTIBLE);
+}
 #endif
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index cb7b549f0995..86dd4a2163e0 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -319,7 +319,7 @@ struct ceph_inode_info {
 	u64 i_inline_version;
 	u32 i_time_warp_seq;
 
-	unsigned i_ceph_flags;
+	unsigned long i_ceph_flags;
 	atomic64_t i_release_count;
 	atomic64_t i_ordered_count;
 	atomic64_t i_complete_seq[2];
@@ -527,6 +527,8 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
 #define CEPH_I_ERROR_WRITE	(1 << 10) /* have seen write errors */
 #define CEPH_I_ERROR_FILELOCK	(1 << 11) /* have seen file lock errors */
 #define CEPH_I_ODIRECT		(1 << 12) /* inode in direct I/O mode */
+#define CEPH_ASYNC_CREATE_BIT	(13)	  /* async create in flight for this */
+#define CEPH_I_ASYNC_CREATE	(1 << CEPH_ASYNC_CREATE_BIT)
 
 /*
  * Masks of ceph inode work.
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v3 08/10] ceph: add new MDS req field to hold delegated inode number
  2020-01-21 19:29 [RFC PATCH v3 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (6 preceding siblings ...)
  2020-01-21 19:29 ` [RFC PATCH v3 07/10] ceph: add infrastructure for waiting for async create to complete Jeff Layton
@ 2020-01-21 19:29 ` Jeff Layton
  2020-01-21 19:29 ` [RFC PATCH v3 09/10] ceph: cache layout in parent dir on first sync create Jeff Layton
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2020-01-21 19:29 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan

Add new request field to hold the delegated inode number. Encode that
into the message when it's set.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/mds_client.c | 3 +--
 fs/ceph/mds_client.h | 1 +
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index ea9cfed6c027..e9de37691ce8 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2466,7 +2466,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_client *mdsc,
 	head->op = cpu_to_le32(req->r_op);
 	head->caller_uid = cpu_to_le32(from_kuid(&init_user_ns, req->r_uid));
 	head->caller_gid = cpu_to_le32(from_kgid(&init_user_ns, req->r_gid));
-	head->ino = 0;
+	head->ino = cpu_to_le64(req->r_deleg_ino);
 	head->args = req->r_args;
 
 	ceph_encode_filepath(&p, end, ino1, path1);
@@ -2627,7 +2627,6 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
 	rhead->flags = cpu_to_le32(flags);
 	rhead->num_fwd = req->r_num_fwd;
 	rhead->num_retry = req->r_attempts - 1;
-	rhead->ino = 0;
 
 	dout(" r_parent = %p\n", req->r_parent);
 	return 0;
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index acad9adca0af..4f25fd6df3f9 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -308,6 +308,7 @@ struct ceph_mds_request {
 	int               r_num_fwd;    /* number of forward attempts */
 	int               r_resend_mds; /* mds to resend to next, if any*/
 	u32               r_sent_on_mseq; /* cap mseq request was sent at*/
+	u64		  r_deleg_ino;
 
 	struct list_head  r_wait;
 	struct completion r_completion;
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v3 09/10] ceph: cache layout in parent dir on first sync create
  2020-01-21 19:29 [RFC PATCH v3 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (7 preceding siblings ...)
  2020-01-21 19:29 ` [RFC PATCH v3 08/10] ceph: add new MDS req field to hold delegated inode number Jeff Layton
@ 2020-01-21 19:29 ` Jeff Layton
  2020-01-21 19:29 ` [RFC PATCH v3 10/10] ceph: attempt to do async create when possible Jeff Layton
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2020-01-21 19:29 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan

It doesn't do much good to do an asynchronous create unless we can do
I/O to it before the create reply comes in. That means we need layout
info the new file before we've gotten the response from the MDS.

All files created in a directory will initially inherit the same layout,
so copy off the requisite info from the first synchronous create in the
directory, and save it in a new i_cached_layout field. Zero out the
layout when we lose Dc caps in the dir.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c  | 13 ++++++++++---
 fs/ceph/file.c  | 22 +++++++++++++++++++++-
 fs/ceph/inode.c |  2 ++
 fs/ceph/super.h |  1 +
 4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 869e2102e827..0c95a7c9c7c1 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -561,14 +561,14 @@ static void __cap_delay_cancel(struct ceph_mds_client *mdsc,
 	spin_unlock(&mdsc->cap_delay_lock);
 }
 
-/*
- * Common issue checks for add_cap, handle_cap_grant.
- */
+/* Common issue checks for add_cap, handle_cap_grant. */
 static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap,
 			      unsigned issued)
 {
 	unsigned had = __ceph_caps_issued(ci, NULL);
 
+	lockdep_assert_held(&ci->i_ceph_lock);
+
 	/*
 	 * Each time we receive FILE_CACHE anew, we increment
 	 * i_rdcache_gen.
@@ -593,6 +593,13 @@ static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap,
 			__ceph_dir_clear_complete(ci);
 		}
 	}
+
+	/* Wipe saved layout if we're losing DIR_CREATE caps */
+	if (S_ISDIR(ci->vfs_inode.i_mode) && (had & CEPH_CAP_DIR_CREATE) &&
+		!(issued & CEPH_CAP_DIR_CREATE)) {
+	     ceph_put_string(rcu_dereference_raw(ci->i_cached_layout.pool_ns));
+	     memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
+	}
 }
 
 /*
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 1e6cdf2dfe90..b44ccbc85fe4 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -430,6 +430,23 @@ int ceph_open(struct inode *inode, struct file *file)
 	return err;
 }
 
+/* Clone the layout from a synchronous create, if the dir now has Dc caps */
+static void
+cache_file_layout(struct inode *dst, struct inode *src)
+{
+	struct ceph_inode_info *cdst = ceph_inode(dst);
+	struct ceph_inode_info *csrc = ceph_inode(src);
+
+	spin_lock(&cdst->i_ceph_lock);
+	if ((__ceph_caps_issued(cdst, NULL) & CEPH_CAP_DIR_CREATE) &&
+	    !ceph_file_layout_is_valid(&cdst->i_cached_layout)) {
+		memcpy(&cdst->i_cached_layout, &csrc->i_layout,
+			sizeof(cdst->i_cached_layout));
+		rcu_assign_pointer(cdst->i_cached_layout.pool_ns,
+				   ceph_try_get_string(csrc->i_layout.pool_ns));
+	}
+	spin_unlock(&cdst->i_ceph_lock);
+}
 
 /*
  * Do a lookup + open with a single request.  If we get a non-existent
@@ -518,7 +535,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 	} else {
 		dout("atomic_open finish_open on dn %p\n", dn);
 		if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) {
-			ceph_init_inode_acls(d_inode(dentry), &as_ctx);
+			struct inode *newino = d_inode(dentry);
+
+			cache_file_layout(dir, newino);
+			ceph_init_inode_acls(newino, &as_ctx);
 			file->f_mode |= FMODE_CREATED;
 		}
 		err = finish_open(file, dentry, ceph_open);
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 4056c7968b86..73f986efb1fd 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -447,6 +447,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
 	ci->i_max_files = 0;
 
 	memset(&ci->i_dir_layout, 0, sizeof(ci->i_dir_layout));
+	memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
 	RCU_INIT_POINTER(ci->i_layout.pool_ns, NULL);
 
 	ci->i_fragtree = RB_ROOT;
@@ -587,6 +588,7 @@ void ceph_evict_inode(struct inode *inode)
 		ceph_buffer_put(ci->i_xattrs.prealloc_blob);
 
 	ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
+	ceph_put_string(rcu_dereference_raw(ci->i_cached_layout.pool_ns));
 }
 
 static inline blkcnt_t calc_inode_blocks(u64 size)
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 86dd4a2163e0..09bd4b71de91 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -326,6 +326,7 @@ struct ceph_inode_info {
 
 	struct ceph_dir_layout i_dir_layout;
 	struct ceph_file_layout i_layout;
+	struct ceph_file_layout i_cached_layout;	// for async creates
 	char *i_symlink;
 
 	/* for dirs */
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v3 10/10] ceph: attempt to do async create when possible
  2020-01-21 19:29 [RFC PATCH v3 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (8 preceding siblings ...)
  2020-01-21 19:29 ` [RFC PATCH v3 09/10] ceph: cache layout in parent dir on first sync create Jeff Layton
@ 2020-01-21 19:29 ` Jeff Layton
  2020-01-22  7:29 ` [RFC PATCH v3 00/10] ceph: asynchronous file create support Yan, Zheng
  2020-01-22 17:04 ` Yan, Zheng
  11 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2020-01-21 19:29 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan

With the Octopus release, the MDS will hand out directory create caps.

If we have Fxc caps on the directory, and complete directory information
or a known negative dentry, then we can return without waiting on the
reply, allowing the open() call to return very quickly to userland.

We use the normal ceph_fill_inode() routine to fill in the inode, so we
have to gin up some reply inode information with what we'd expect the
newly-created inode to have. The client assumes that it has a full set
of caps on the new inode, and that the MDS will revoke them when there
is conflicting access.

This functionality is gated on the enable_async_dirops module option,
along with async unlinks, and on the server supporting the necessary
CephFS feature bit.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/file.c               | 218 +++++++++++++++++++++++++++++++++--
 include/linux/ceph/ceph_fs.h |   3 +
 2 files changed, 212 insertions(+), 9 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index b44ccbc85fe4..170bbb18f33e 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -448,6 +448,190 @@ cache_file_layout(struct inode *dst, struct inode *src)
 	spin_unlock(&cdst->i_ceph_lock);
 }
 
+/*
+ * Try to set up an async create. We need caps, a file layout, and inode number,
+ * and either a lease on the dentry or complete dir info. If any of those
+ * criteria are not satisfied, then return false and the caller can go
+ * synchronous.
+ */
+static bool try_prep_async_create(struct inode *dir, struct dentry *dentry,
+				  struct ceph_file_layout *lo, u64 *pino)
+{
+	struct ceph_inode_info *ci = ceph_inode(dir);
+	struct ceph_dentry_info *di = ceph_dentry(dentry);
+	bool ret = false;
+	u64 ino;
+
+	spin_lock(&ci->i_ceph_lock);
+	/* No auth cap means no chance for Dc caps */
+	if (!ci->i_auth_cap)
+		goto no_async;
+
+	/* Any delegated inos? */
+	if (xa_empty(&ci->i_auth_cap->session->s_delegated_inos))
+		goto no_async;
+
+	if (!ceph_file_layout_is_valid(&ci->i_cached_layout))
+		goto no_async;
+
+	if ((__ceph_caps_issued(ci, NULL) &
+	     (CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE)) !=
+	    (CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE))
+		goto no_async;
+
+	if (d_in_lookup(dentry)) {
+		if (!__ceph_dir_is_complete(ci))
+			goto no_async;
+	} else if (atomic_read(&ci->i_shared_gen) !=
+		   READ_ONCE(di->lease_shared_gen)) {
+		goto no_async;
+	}
+
+	ino = ceph_get_deleg_ino(ci->i_auth_cap->session);
+	if (!ino)
+		goto no_async;
+
+	*pino = ino;
+	ceph_take_cap_refs(ci, CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE, false);
+	memcpy(lo, &ci->i_cached_layout, sizeof(*lo));
+	rcu_assign_pointer(lo->pool_ns,
+			   ceph_try_get_string(ci->i_cached_layout.pool_ns));
+	ret = true;
+no_async:
+	spin_unlock(&ci->i_ceph_lock);
+	return ret;
+}
+
+static void ceph_async_create_cb(struct ceph_mds_client *mdsc,
+                                 struct ceph_mds_request *req)
+{
+	int result = req->r_err ? req->r_err :
+			le32_to_cpu(req->r_reply_info.head->result);
+
+	if (result) {
+		struct dentry *dentry = req->r_dentry;
+
+		ceph_dir_clear_complete(req->r_parent);
+		if (!d_unhashed(dentry))
+			d_drop(dentry);
+
+		/* FIXME: start returning I/O errors on all accesses? */
+		pr_warn("%s: async create failure result=%d!\n", __func__, result);
+	}
+
+	mapping_set_error(req->r_parent->i_mapping, result);
+
+	if (req->r_target_inode) {
+		struct ceph_inode_info *ci = ceph_inode(req->r_target_inode);
+		u64 ino = ceph_vino(req->r_target_inode).ino;
+
+		if (req->r_deleg_ino != ino)
+			pr_warn("%s: inode number mismatch! err=%d deleg_ino=0x%llx target=0x%llx\n",
+				__func__, req->r_err, req->r_deleg_ino, ino);
+		mapping_set_error(req->r_target_inode->i_mapping, result);
+
+		spin_lock(&ci->i_ceph_lock);
+		if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
+			ci->i_ceph_flags &= ~CEPH_I_ASYNC_CREATE;
+			wake_up_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT);
+		}
+		spin_unlock(&ci->i_ceph_lock);
+	} else {
+		pr_warn("%s: no req->r_target_inode for 0x%llx\n", __func__,
+			req->r_deleg_ino);
+	}
+	ceph_put_cap_refs(ceph_inode(req->r_parent),
+			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE);
+}
+
+static int ceph_finish_async_create(struct inode *dir, struct dentry *dentry,
+				    struct file *file, umode_t mode,
+				    struct ceph_mds_request *req,
+				    struct ceph_acl_sec_ctx *as_ctx,
+				    struct ceph_file_layout *lo)
+{
+	int ret;
+	char xattr_buf[4];
+	struct ceph_mds_reply_inode in = { };
+	struct ceph_mds_reply_info_in iinfo = { .in = &in };
+	struct ceph_inode_info *ci = ceph_inode(dir);
+	struct inode *inode;
+	struct timespec64 now;
+	struct ceph_vino vino = { .ino = req->r_deleg_ino,
+				  .snap = CEPH_NOSNAP };
+
+	ktime_get_real_ts64(&now);
+
+	inode = ceph_get_inode(dentry->d_sb, vino);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	iinfo.inline_version = CEPH_INLINE_NONE;
+	iinfo.change_attr = 1;
+	ceph_encode_timespec64(&iinfo.btime, &now);
+
+	iinfo.xattr_len = ARRAY_SIZE(xattr_buf);
+	iinfo.xattr_data = xattr_buf;
+	memset(iinfo.xattr_data, 0, iinfo.xattr_len);
+
+	in.ino = cpu_to_le64(vino.ino);
+	in.snapid = cpu_to_le64(CEPH_NOSNAP);
+	in.version = cpu_to_le64(1);	// ???
+	in.cap.caps = in.cap.wanted = cpu_to_le32(CEPH_CAP_ALL_FILE);
+	in.cap.cap_id = cpu_to_le64(1);
+	in.cap.realm = cpu_to_le64(ci->i_snap_realm->ino);
+	in.cap.flags = CEPH_CAP_FLAG_AUTH;
+	in.ctime = in.mtime = in.atime = iinfo.btime;
+	in.mode = cpu_to_le32((u32)mode);
+	in.truncate_seq = cpu_to_le32(1);
+	in.truncate_size = cpu_to_le64(-1ULL);
+	in.xattr_version = cpu_to_le64(1);
+	in.uid = cpu_to_le32(from_kuid(&init_user_ns, current_fsuid()));
+	in.gid = cpu_to_le32(from_kgid(&init_user_ns, dir->i_mode & S_ISGID ?
+				dir->i_gid : current_fsgid()));
+	in.nlink = cpu_to_le32(1);
+	in.max_size = cpu_to_le64(lo->stripe_unit);
+
+	ceph_file_layout_to_legacy(lo, &in.layout);
+
+	ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session,
+			      req->r_fmode, NULL);
+	if (ret) {
+		dout("%s failed to fill inode: %d\n", __func__, ret);
+		ceph_dir_clear_complete(dir);
+		if (!d_unhashed(dentry))
+			d_drop(dentry);
+		if (inode->i_state & I_NEW)
+			discard_new_inode(inode);
+	} else {
+		struct dentry *dn;
+
+		dout("%s d_adding new inode 0x%llx to 0x%lx/%s\n", __func__,
+			vino.ino, dir->i_ino, dentry->d_name.name);
+		ceph_dir_clear_ordered(dir);
+		ceph_init_inode_acls(inode, as_ctx);
+		if (inode->i_state & I_NEW) {
+			/*
+			 * If it's not I_NEW, then someone created this before
+			 * we got here. Assume the server is aware of it at
+			 * that point and don't worry about setting
+			 * CEPH_I_ASYNC_CREATE.
+			 */
+			ceph_inode(inode)->i_ceph_flags = CEPH_I_ASYNC_CREATE;
+			unlock_new_inode(inode);
+		}
+		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
+			if (!d_unhashed(dentry))
+				d_drop(dentry);
+			dn = d_splice_alias(inode, dentry);
+			WARN_ON_ONCE(dn && dn != dentry);
+		}
+		file->f_mode |= FMODE_CREATED;
+		ret = finish_open(file, dentry, ceph_open);
+	}
+	return ret;
+}
+
 /*
  * Do a lookup + open with a single request.  If we get a non-existent
  * file or symlink, return 1 so the VFS can retry.
@@ -460,6 +644,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 	struct ceph_mds_request *req;
 	struct dentry *dn;
 	struct ceph_acl_sec_ctx as_ctx = {};
+	bool try_async = enable_async_dirops;
 	int mask;
 	int err;
 
@@ -492,28 +677,42 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 	}
 	req->r_dentry = dget(dentry);
 	req->r_num_caps = 2;
+	mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
+	if (ceph_security_xattr_wanted(dir))
+		mask |= CEPH_CAP_XATTR_SHARED;
+	req->r_args.open.mask = cpu_to_le32(mask);
+	req->r_parent = dir;
+
 	if (flags & O_CREAT) {
+		struct ceph_file_layout lo;
+
 		req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL;
 		req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
 		if (as_ctx.pagelist) {
 			req->r_pagelist = as_ctx.pagelist;
 			as_ctx.pagelist = NULL;
 		}
+		if (try_async && try_prep_async_create(dir, dentry, &lo,
+						       &req->r_deleg_ino)) {
+			set_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags);
+			req->r_args.open.flags |= cpu_to_le32(CEPH_O_EXCL);
+			req->r_callback = ceph_async_create_cb;
+			err = ceph_mdsc_submit_request(mdsc, dir, req);
+			if (!err)
+				err = ceph_finish_async_create(dir, dentry,
+							file, mode, req,
+							&as_ctx, &lo);
+			goto out_req;
+		}
 	}
 
-       mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
-       if (ceph_security_xattr_wanted(dir))
-               mask |= CEPH_CAP_XATTR_SHARED;
-       req->r_args.open.mask = cpu_to_le32(mask);
-
-	req->r_parent = dir;
 	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
 	err = ceph_mdsc_do_request(mdsc,
 				   (flags & (O_CREAT|O_TRUNC)) ? dir : NULL,
 				   req);
 	err = ceph_handle_snapdir(req, dentry, err);
 	if (err)
-		goto out_req;
+		goto out_fmode;
 
 	if ((flags & O_CREAT) && !req->r_reply_info.head->is_dentry)
 		err = ceph_handle_notrace_create(dir, dentry);
@@ -527,7 +726,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 		dn = NULL;
 	}
 	if (err)
-		goto out_req;
+		goto out_fmode;
 	if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
 		/* make vfs retry on splice, ENOENT, or symlink */
 		dout("atomic_open finish_no_open on dn %p\n", dn);
@@ -543,9 +742,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 		}
 		err = finish_open(file, dentry, ceph_open);
 	}
-out_req:
+out_fmode:
 	if (!req->r_err && req->r_target_inode)
 		ceph_put_fmode(ceph_inode(req->r_target_inode), req->r_fmode);
+out_req:
 	ceph_mdsc_put_request(req);
 out_ctx:
 	ceph_release_acl_sec_ctx(&as_ctx);
diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
index 91d09cf37649..e035c5194005 100644
--- a/include/linux/ceph/ceph_fs.h
+++ b/include/linux/ceph/ceph_fs.h
@@ -659,6 +659,9 @@ int ceph_flags_to_mode(int flags);
 #define CEPH_CAP_ANY      (CEPH_CAP_ANY_RD | CEPH_CAP_ANY_EXCL | \
 			   CEPH_CAP_ANY_FILE_WR | CEPH_CAP_FILE_LAZYIO | \
 			   CEPH_CAP_PIN)
+#define CEPH_CAP_ALL_FILE (CEPH_CAP_PIN | CEPH_CAP_ANY_SHARED | \
+			   CEPH_CAP_AUTH_EXCL | CEPH_CAP_XATTR_EXCL | \
+			   CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)
 
 #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
 			CEPH_LOCK_IXATTR)
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 03/10] ceph: make dentry_lease_is_valid non-static
  2020-01-21 19:29 ` [RFC PATCH v3 03/10] ceph: make dentry_lease_is_valid non-static Jeff Layton
@ 2020-01-22  7:20   ` Yan, Zheng
  2020-01-22 11:26     ` Jeff Layton
  0 siblings, 1 reply; 20+ messages in thread
From: Yan, Zheng @ 2020-01-22  7:20 UTC (permalink / raw)
  To: Jeff Layton; +Cc: ceph-devel, idridryomov, Sage Weil, Zheng Yan

On Wed, Jan 22, 2020 at 3:31 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> ...and move a comment over the proper function.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/ceph/dir.c   | 10 +++++-----
>  fs/ceph/super.h |  1 +
>  2 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index 10294f07f5f0..9d2eca67985a 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -1477,10 +1477,6 @@ void ceph_invalidate_dentry_lease(struct dentry *dentry)
>         spin_unlock(&dentry->d_lock);
>  }
>
> -/*
> - * Check if dentry lease is valid.  If not, delete the lease.  Try to
> - * renew if the least is more than half up.
> - */
>  static bool __dentry_lease_is_valid(struct ceph_dentry_info *di)
>  {
>         struct ceph_mds_session *session;
> @@ -1507,7 +1503,11 @@ static bool __dentry_lease_is_valid(struct ceph_dentry_info *di)
>         return false;
>  }
>
> -static int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags)
> +/*
> + * Check if dentry lease is valid.  If not, delete the lease.  Try to
> + * renew if the least is more than half up.
> + */
> +int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags)
>  {
>         struct ceph_dentry_info *di;
>         struct ceph_mds_session *session = NULL;
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index ec4d66d7c261..f27b2bf9a3f5 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -1121,6 +1121,7 @@ extern int ceph_handle_snapdir(struct ceph_mds_request *req,
>  extern struct dentry *ceph_finish_lookup(struct ceph_mds_request *req,
>                                          struct dentry *dentry, int err);
>
> +extern int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags);
>  extern void __ceph_dentry_lease_touch(struct ceph_dentry_info *di);
>  extern void __ceph_dentry_dir_lease_touch(struct ceph_dentry_info *di);
>  extern void ceph_invalidate_dentry_lease(struct dentry *dentry);
> --
> 2.24.1
>

This change is not needed

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 00/10] ceph: asynchronous file create support
  2020-01-21 19:29 [RFC PATCH v3 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (9 preceding siblings ...)
  2020-01-21 19:29 ` [RFC PATCH v3 10/10] ceph: attempt to do async create when possible Jeff Layton
@ 2020-01-22  7:29 ` Yan, Zheng
  2020-01-22 17:04 ` Yan, Zheng
  11 siblings, 0 replies; 20+ messages in thread
From: Yan, Zheng @ 2020-01-22  7:29 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: idridryomov, sage

On 1/22/20 3:29 AM, Jeff Layton wrote:
> v3:
> - move some cephfs-specific code into ceph.ko
> - present and track inode numbers as u64 values
> - fix up check for dentry and cap eligibility checks
> - set O_CEPH_EXCL on async creates
> - attempt to handle errors better on async create (invalidate dentries
>    and dir completeness).
> - ensure that fsync waits for async create to complete
> 
> v2:
> - move cached layout to dedicated field in inode
> - protect cached layout with i_ceph_lock
> - wipe cached layout in __check_cap_issue
> - set max_size of file to layout.stripe_unit
> - set truncate_size to (u64)-1
> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> - set cap_id to 1 in async created inode
> - allocate inode number before submitting request
> - rework the prep for an async create to be more efficient
> - don't allow MDS or cap messages involving an inode until we get async
>    create reply
> 
> Still not quite ready for merge, but I've cleaned up a number of warts
> in the v2 set. Performance numbers still look about the same.
> 
> There is definitely still a race of some sort that causes the client to
> try to asynchronously create a dentry that already exists. I'm still
> working on tracking that down.
> 
> Jeff Layton (10):
>    ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
>    ceph: make ceph_fill_inode non-static
>    ceph: make dentry_lease_is_valid non-static
>    ceph: make __take_cap_refs non-static
>    ceph: decode interval_sets for delegated inos
>    ceph: add flag to designate that a request is asynchronous
>    ceph: add infrastructure for waiting for async create to complete
>    ceph: add new MDS req field to hold delegated inode number
>    ceph: cache layout in parent dir on first sync create
>    ceph: attempt to do async create when possible
> 
>   fs/ceph/Makefile                     |   2 +-
>   fs/ceph/caps.c                       |  38 +++--
>   fs/ceph/dir.c                        |  13 +-
>   fs/ceph/file.c                       | 240 +++++++++++++++++++++++++--
>   fs/ceph/inode.c                      |  50 +++---
>   fs/ceph/mds_client.c                 | 123 ++++++++++++--
>   fs/ceph/mds_client.h                 |  17 +-
>   fs/ceph/super.h                      |  16 +-
>   net/ceph/ceph_fs.c => fs/ceph/util.c |   4 -
>   include/linux/ceph/ceph_fs.h         |   8 +-
>   net/ceph/Makefile                    |   2 +-
>   11 files changed, 443 insertions(+), 70 deletions(-)
>   rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%)
> 

Series

Reviewed-By: "Yan, Zheng" <zyan@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 03/10] ceph: make dentry_lease_is_valid non-static
  2020-01-22  7:20   ` Yan, Zheng
@ 2020-01-22 11:26     ` Jeff Layton
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2020-01-22 11:26 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel, idridryomov, Sage Weil, Zheng Yan

On Wed, 2020-01-22 at 15:20 +0800, Yan, Zheng wrote:
> On Wed, Jan 22, 2020 at 3:31 AM Jeff Layton <jlayton@kernel.org> wrote:
> > ...and move a comment over the proper function.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/ceph/dir.c   | 10 +++++-----
> >  fs/ceph/super.h |  1 +
> >  2 files changed, 6 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> > index 10294f07f5f0..9d2eca67985a 100644
> > --- a/fs/ceph/dir.c
> > +++ b/fs/ceph/dir.c
> > @@ -1477,10 +1477,6 @@ void ceph_invalidate_dentry_lease(struct dentry *dentry)
> >         spin_unlock(&dentry->d_lock);
> >  }
> > 
> > -/*
> > - * Check if dentry lease is valid.  If not, delete the lease.  Try to
> > - * renew if the least is more than half up.
> > - */
> >  static bool __dentry_lease_is_valid(struct ceph_dentry_info *di)
> >  {
> >         struct ceph_mds_session *session;
> > @@ -1507,7 +1503,11 @@ static bool __dentry_lease_is_valid(struct ceph_dentry_info *di)
> >         return false;
> >  }
> > 
> > -static int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags)
> > +/*
> > + * Check if dentry lease is valid.  If not, delete the lease.  Try to
> > + * renew if the least is more than half up.
> > + */
> > +int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags)
> >  {
> >         struct ceph_dentry_info *di;
> >         struct ceph_mds_session *session = NULL;
> > diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> > index ec4d66d7c261..f27b2bf9a3f5 100644
> > --- a/fs/ceph/super.h
> > +++ b/fs/ceph/super.h
> > @@ -1121,6 +1121,7 @@ extern int ceph_handle_snapdir(struct ceph_mds_request *req,
> >  extern struct dentry *ceph_finish_lookup(struct ceph_mds_request *req,
> >                                          struct dentry *dentry, int err);
> > 
> > +extern int dentry_lease_is_valid(struct dentry *dentry, unsigned int flags);
> >  extern void __ceph_dentry_lease_touch(struct ceph_dentry_info *di);
> >  extern void __ceph_dentry_dir_lease_touch(struct ceph_dentry_info *di);
> >  extern void ceph_invalidate_dentry_lease(struct dentry *dentry);
> > --
> > 2.24.1
> > 
> 
> This change is not needed

Quite right. I had done this for an earlier version of the series, but
I'll drop this now.

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 00/10] ceph: asynchronous file create support
  2020-01-21 19:29 [RFC PATCH v3 00/10] ceph: asynchronous file create support Jeff Layton
                   ` (10 preceding siblings ...)
  2020-01-22  7:29 ` [RFC PATCH v3 00/10] ceph: asynchronous file create support Yan, Zheng
@ 2020-01-22 17:04 ` Yan, Zheng
  2020-01-24 17:19   ` Jeff Layton
  11 siblings, 1 reply; 20+ messages in thread
From: Yan, Zheng @ 2020-01-22 17:04 UTC (permalink / raw)
  To: Jeff Layton; +Cc: ceph-devel, idridryomov, Sage Weil, Zheng Yan

On Wed, Jan 22, 2020 at 3:31 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> v3:
> - move some cephfs-specific code into ceph.ko
> - present and track inode numbers as u64 values
> - fix up check for dentry and cap eligibility checks
> - set O_CEPH_EXCL on async creates
> - attempt to handle errors better on async create (invalidate dentries
>   and dir completeness).
> - ensure that fsync waits for async create to complete
>
> v2:
> - move cached layout to dedicated field in inode
> - protect cached layout with i_ceph_lock
> - wipe cached layout in __check_cap_issue
> - set max_size of file to layout.stripe_unit
> - set truncate_size to (u64)-1
> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> - set cap_id to 1 in async created inode
> - allocate inode number before submitting request
> - rework the prep for an async create to be more efficient
> - don't allow MDS or cap messages involving an inode until we get async
>   create reply
>
> Still not quite ready for merge, but I've cleaned up a number of warts
> in the v2 set. Performance numbers still look about the same.
>
> There is definitely still a race of some sort that causes the client to
> try to asynchronously create a dentry that already exists. I'm still
> working on tracking that down.
>
> Jeff Layton (10):
>   ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
>   ceph: make ceph_fill_inode non-static
>   ceph: make dentry_lease_is_valid non-static
>   ceph: make __take_cap_refs non-static
>   ceph: decode interval_sets for delegated inos
>   ceph: add flag to designate that a request is asynchronous
>   ceph: add infrastructure for waiting for async create to complete
>   ceph: add new MDS req field to hold delegated inode number
>   ceph: cache layout in parent dir on first sync create
>   ceph: attempt to do async create when possible
>
>  fs/ceph/Makefile                     |   2 +-
>  fs/ceph/caps.c                       |  38 +++--
>  fs/ceph/dir.c                        |  13 +-
>  fs/ceph/file.c                       | 240 +++++++++++++++++++++++++--
>  fs/ceph/inode.c                      |  50 +++---
>  fs/ceph/mds_client.c                 | 123 ++++++++++++--
>  fs/ceph/mds_client.h                 |  17 +-
>  fs/ceph/super.h                      |  16 +-
>  net/ceph/ceph_fs.c => fs/ceph/util.c |   4 -
>  include/linux/ceph/ceph_fs.h         |   8 +-
>  net/ceph/Makefile                    |   2 +-
>  11 files changed, 443 insertions(+), 70 deletions(-)
>  rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%)
>
> --
> 2.24.1
>

I realized that there still are two issues:
-  we needs to clear delegated inos when mds failover
- we needs to clear caps for async dir operations when reconnecting to
mds. (see last commit of https://github.com/ceph/ceph/pull/32576)

Regards
Yan, Zheng

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 00/10] ceph: asynchronous file create support
  2020-01-22 17:04 ` Yan, Zheng
@ 2020-01-24 17:19   ` Jeff Layton
  2020-01-25  1:58     ` Yan, Zheng
  2020-01-25  1:58     ` Yan, Zheng
  0 siblings, 2 replies; 20+ messages in thread
From: Jeff Layton @ 2020-01-24 17:19 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel, idridryomov, Sage Weil, Zheng Yan

On Thu, 2020-01-23 at 01:04 +0800, Yan, Zheng wrote:
> On Wed, Jan 22, 2020 at 3:31 AM Jeff Layton <jlayton@kernel.org> wrote:
> > v3:
> > - move some cephfs-specific code into ceph.ko
> > - present and track inode numbers as u64 values
> > - fix up check for dentry and cap eligibility checks
> > - set O_CEPH_EXCL on async creates
> > - attempt to handle errors better on async create (invalidate dentries
> >   and dir completeness).
> > - ensure that fsync waits for async create to complete
> > 
> > v2:
> > - move cached layout to dedicated field in inode
> > - protect cached layout with i_ceph_lock
> > - wipe cached layout in __check_cap_issue
> > - set max_size of file to layout.stripe_unit
> > - set truncate_size to (u64)-1
> > - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
> > - set cap_id to 1 in async created inode
> > - allocate inode number before submitting request
> > - rework the prep for an async create to be more efficient
> > - don't allow MDS or cap messages involving an inode until we get async
> >   create reply
> > 
> > Still not quite ready for merge, but I've cleaned up a number of warts
> > in the v2 set. Performance numbers still look about the same.
> > 
> > There is definitely still a race of some sort that causes the client to
> > try to asynchronously create a dentry that already exists. I'm still
> > working on tracking that down.
> > 
> > Jeff Layton (10):
> >   ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
> >   ceph: make ceph_fill_inode non-static
> >   ceph: make dentry_lease_is_valid non-static
> >   ceph: make __take_cap_refs non-static
> >   ceph: decode interval_sets for delegated inos
> >   ceph: add flag to designate that a request is asynchronous
> >   ceph: add infrastructure for waiting for async create to complete
> >   ceph: add new MDS req field to hold delegated inode number
> >   ceph: cache layout in parent dir on first sync create
> >   ceph: attempt to do async create when possible
> > 
> >  fs/ceph/Makefile                     |   2 +-
> >  fs/ceph/caps.c                       |  38 +++--
> >  fs/ceph/dir.c                        |  13 +-
> >  fs/ceph/file.c                       | 240 +++++++++++++++++++++++++--
> >  fs/ceph/inode.c                      |  50 +++---
> >  fs/ceph/mds_client.c                 | 123 ++++++++++++--
> >  fs/ceph/mds_client.h                 |  17 +-
> >  fs/ceph/super.h                      |  16 +-
> >  net/ceph/ceph_fs.c => fs/ceph/util.c |   4 -
> >  include/linux/ceph/ceph_fs.h         |   8 +-
> >  net/ceph/Makefile                    |   2 +-
> >  11 files changed, 443 insertions(+), 70 deletions(-)
> >  rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%)
> > 
> > --
> > 2.24.1
> > 
> 
> I realized that there still are two issues:
> -  we needs to clear delegated inos when mds failover

I'm guessing we need to do this whenever any session is reconnected,
right? I think we may have bigger problems here though:

The issue is that with this set we assign out ino_t's prior to
submitting the request. We could get down into it and find that we're
reconnecting the session. At that point, that ino_t is no longer valid
for the session.

> - we needs to clear caps for async dir operations when reconnecting to
> mds. (see last commit of https://github.com/ceph/ceph/pull/32576)
> 

I guess we can use ceph_iterate_session_caps to do this. That said,
looking at your patch:

        if (in->is_dir()) {                                                     
          // remove caps for async dir operations                               
          cap.implemented &= (CEPH_CAP_ANY_SHARED | CEPH_CAP_ANY_EXCL);         
        }                                                                       

We remove all but Fsx here. That seems like quite a subtle difference
from how FILE caps work.

Given the way that we're doing the inode delegation and handling dir
caps, I think we may need to rework things such that async requests
never get queued to the s_waiting list, and instead return some sort of
distinct error that cues the caller to fall back to a synchronous
request.

That would help prevent some of the potential races here.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 01/10] ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
  2020-01-21 19:29 ` [RFC PATCH v3 01/10] ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c Jeff Layton
@ 2020-01-24 19:04   ` Jeff Layton
  2020-01-25  8:08     ` Ilya Dryomov
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2020-01-24 19:04 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan

On Tue, 2020-01-21 at 14:29 -0500, Jeff Layton wrote:
> All of these functions are only called from CephFS, so move them into
> ceph.ko, and drop the exports.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/ceph/Makefile                     | 2 +-
>  net/ceph/ceph_fs.c => fs/ceph/util.c | 4 ----
>  net/ceph/Makefile                    | 2 +-
>  3 files changed, 2 insertions(+), 6 deletions(-)
>  rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%)
> 
> diff --git a/fs/ceph/Makefile b/fs/ceph/Makefile
> index c1da294418d1..0a0823d378db 100644
> --- a/fs/ceph/Makefile
> +++ b/fs/ceph/Makefile
> @@ -8,7 +8,7 @@ obj-$(CONFIG_CEPH_FS) += ceph.o
>  ceph-y := super.o inode.o dir.o file.o locks.o addr.o ioctl.o \
>  	export.o caps.o snap.o xattr.o quota.o io.o \
>  	mds_client.o mdsmap.o strings.o ceph_frag.o \
> -	debugfs.o
> +	debugfs.o util.o
>  
>  ceph-$(CONFIG_CEPH_FSCACHE) += cache.o
>  ceph-$(CONFIG_CEPH_FS_POSIX_ACL) += acl.o
> diff --git a/net/ceph/ceph_fs.c b/fs/ceph/util.c
> similarity index 94%
> rename from net/ceph/ceph_fs.c
> rename to fs/ceph/util.c
> index 756a2dc10d27..2c34875675bf 100644
> --- a/net/ceph/ceph_fs.c
> +++ b/fs/ceph/util.c
> @@ -39,7 +39,6 @@ void ceph_file_layout_from_legacy(struct ceph_file_layout *fl,
>  	    fl->stripe_count == 0 && fl->object_size == 0)
>  		fl->pool_id = -1;
>  }
> -EXPORT_SYMBOL(ceph_file_layout_from_legacy);
>  
>  void ceph_file_layout_to_legacy(struct ceph_file_layout *fl,
>  				struct ceph_file_layout_legacy *legacy)
> @@ -52,7 +51,6 @@ void ceph_file_layout_to_legacy(struct ceph_file_layout *fl,
>  	else
>  		legacy->fl_pg_pool = 0;
>  }
> -EXPORT_SYMBOL(ceph_file_layout_to_legacy);
>  
>  int ceph_flags_to_mode(int flags)
>  {
> @@ -82,7 +80,6 @@ int ceph_flags_to_mode(int flags)
>  
>  	return mode;
>  }
> -EXPORT_SYMBOL(ceph_flags_to_mode);
>  
>  int ceph_caps_for_mode(int mode)
>  {
> @@ -101,4 +98,3 @@ int ceph_caps_for_mode(int mode)
>  
>  	return caps;
>  }
> -EXPORT_SYMBOL(ceph_caps_for_mode);
> diff --git a/net/ceph/Makefile b/net/ceph/Makefile
> index 59d0ba2072de..ce09bb4fb249 100644
> --- a/net/ceph/Makefile
> +++ b/net/ceph/Makefile
> @@ -13,5 +13,5 @@ libceph-y := ceph_common.o messenger.o msgpool.o buffer.o pagelist.o \
>  	auth.o auth_none.o \
>  	crypto.o armor.o \
>  	auth_x.o \
> -	ceph_fs.o ceph_strings.o ceph_hash.o \
> +	ceph_strings.o ceph_hash.o \
>  	pagevec.o snapshot.o string_table.o

I've gone ahead and merged this patch into testing, as I think it makes
sense on its own and it was becoming a hassle when testing.

Cheers,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 00/10] ceph: asynchronous file create support
  2020-01-24 17:19   ` Jeff Layton
@ 2020-01-25  1:58     ` Yan, Zheng
  2020-01-25  1:58     ` Yan, Zheng
  1 sibling, 0 replies; 20+ messages in thread
From: Yan, Zheng @ 2020-01-25  1:58 UTC (permalink / raw)
  To: Jeff Layton, Yan, Zheng; +Cc: ceph-devel, idridryomov, Sage Weil

On 1/25/20 1:19 AM, Jeff Layton wrote:
> On Thu, 2020-01-23 at 01:04 +0800, Yan, Zheng wrote:
>> On Wed, Jan 22, 2020 at 3:31 AM Jeff Layton <jlayton@kernel.org> wrote:
>>> v3:
>>> - move some cephfs-specific code into ceph.ko
>>> - present and track inode numbers as u64 values
>>> - fix up check for dentry and cap eligibility checks
>>> - set O_CEPH_EXCL on async creates
>>> - attempt to handle errors better on async create (invalidate dentries
>>>    and dir completeness).
>>> - ensure that fsync waits for async create to complete
>>>
>>> v2:
>>> - move cached layout to dedicated field in inode
>>> - protect cached layout with i_ceph_lock
>>> - wipe cached layout in __check_cap_issue
>>> - set max_size of file to layout.stripe_unit
>>> - set truncate_size to (u64)-1
>>> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
>>> - set cap_id to 1 in async created inode
>>> - allocate inode number before submitting request
>>> - rework the prep for an async create to be more efficient
>>> - don't allow MDS or cap messages involving an inode until we get async
>>>    create reply
>>>
>>> Still not quite ready for merge, but I've cleaned up a number of warts
>>> in the v2 set. Performance numbers still look about the same.
>>>
>>> There is definitely still a race of some sort that causes the client to
>>> try to asynchronously create a dentry that already exists. I'm still
>>> working on tracking that down.
>>>
>>> Jeff Layton (10):
>>>    ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
>>>    ceph: make ceph_fill_inode non-static
>>>    ceph: make dentry_lease_is_valid non-static
>>>    ceph: make __take_cap_refs non-static
>>>    ceph: decode interval_sets for delegated inos
>>>    ceph: add flag to designate that a request is asynchronous
>>>    ceph: add infrastructure for waiting for async create to complete
>>>    ceph: add new MDS req field to hold delegated inode number
>>>    ceph: cache layout in parent dir on first sync create
>>>    ceph: attempt to do async create when possible
>>>
>>>   fs/ceph/Makefile                     |   2 +-
>>>   fs/ceph/caps.c                       |  38 +++--
>>>   fs/ceph/dir.c                        |  13 +-
>>>   fs/ceph/file.c                       | 240 +++++++++++++++++++++++++--
>>>   fs/ceph/inode.c                      |  50 +++---
>>>   fs/ceph/mds_client.c                 | 123 ++++++++++++--
>>>   fs/ceph/mds_client.h                 |  17 +-
>>>   fs/ceph/super.h                      |  16 +-
>>>   net/ceph/ceph_fs.c => fs/ceph/util.c |   4 -
>>>   include/linux/ceph/ceph_fs.h         |   8 +-
>>>   net/ceph/Makefile                    |   2 +-
>>>   11 files changed, 443 insertions(+), 70 deletions(-)
>>>   rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%)
>>>
>>> --
>>> 2.24.1
>>>
>>
>> I realized that there still are two issues:
>> -  we needs to clear delegated inos when mds failover
> 
> I'm guessing we need to do this whenever any session is reconnected,
> right? I think we may have bigger problems here though:
> 
> The issue is that with this set we assign out ino_t's prior to
> submitting the request. We could get down into it and find that we're
> reconnecting the session. At that point, that ino_t is no longer valid
> for the session.

yes, that's a problem for current kclient impelmentaion


> 
>> - we needs to clear caps for async dir operations when reconnecting to
>> mds. (see last commit of https://github.com/ceph/ceph/pull/32576)
>>
> 
> I guess we can use ceph_iterate_session_caps to do this. That said,
> looking at your patch:
> 
>          if (in->is_dir()) {
>            // remove caps for async dir operations
>            cap.implemented &= (CEPH_CAP_ANY_SHARED | CEPH_CAP_ANY_EXCL);
>          }
> 
> We remove all but Fsx here. That seems like quite a subtle difference
> from how FILE caps work.

only file caps have more than 2 bits. this removes all but AsxLsxXsxFsx.

> 
> Given the way that we're doing the inode delegation and handling dir
> caps, I think we may need to rework things such that async requests
> never get queued to the s_waiting list, and instead return some sort of
> distinct error that cues the caller to fall back to a synchronous
> request.
> 
> That would help prevent some of the potential races here.

my libcephfs implementation check if a request can be async just before 
sending the request.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 00/10] ceph: asynchronous file create support
  2020-01-24 17:19   ` Jeff Layton
  2020-01-25  1:58     ` Yan, Zheng
@ 2020-01-25  1:58     ` Yan, Zheng
  1 sibling, 0 replies; 20+ messages in thread
From: Yan, Zheng @ 2020-01-25  1:58 UTC (permalink / raw)
  To: Jeff Layton, Yan, Zheng; +Cc: ceph-devel, idridryomov, Sage Weil

On 1/25/20 1:19 AM, Jeff Layton wrote:
> On Thu, 2020-01-23 at 01:04 +0800, Yan, Zheng wrote:
>> On Wed, Jan 22, 2020 at 3:31 AM Jeff Layton <jlayton@kernel.org> wrote:
>>> v3:
>>> - move some cephfs-specific code into ceph.ko
>>> - present and track inode numbers as u64 values
>>> - fix up check for dentry and cap eligibility checks
>>> - set O_CEPH_EXCL on async creates
>>> - attempt to handle errors better on async create (invalidate dentries
>>>    and dir completeness).
>>> - ensure that fsync waits for async create to complete
>>>
>>> v2:
>>> - move cached layout to dedicated field in inode
>>> - protect cached layout with i_ceph_lock
>>> - wipe cached layout in __check_cap_issue
>>> - set max_size of file to layout.stripe_unit
>>> - set truncate_size to (u64)-1
>>> - use dedicated CephFS feature bit instead of CEPHFS_FEATURE_OCTOPUS
>>> - set cap_id to 1 in async created inode
>>> - allocate inode number before submitting request
>>> - rework the prep for an async create to be more efficient
>>> - don't allow MDS or cap messages involving an inode until we get async
>>>    create reply
>>>
>>> Still not quite ready for merge, but I've cleaned up a number of warts
>>> in the v2 set. Performance numbers still look about the same.
>>>
>>> There is definitely still a race of some sort that causes the client to
>>> try to asynchronously create a dentry that already exists. I'm still
>>> working on tracking that down.
>>>
>>> Jeff Layton (10):
>>>    ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
>>>    ceph: make ceph_fill_inode non-static
>>>    ceph: make dentry_lease_is_valid non-static
>>>    ceph: make __take_cap_refs non-static
>>>    ceph: decode interval_sets for delegated inos
>>>    ceph: add flag to designate that a request is asynchronous
>>>    ceph: add infrastructure for waiting for async create to complete
>>>    ceph: add new MDS req field to hold delegated inode number
>>>    ceph: cache layout in parent dir on first sync create
>>>    ceph: attempt to do async create when possible
>>>
>>>   fs/ceph/Makefile                     |   2 +-
>>>   fs/ceph/caps.c                       |  38 +++--
>>>   fs/ceph/dir.c                        |  13 +-
>>>   fs/ceph/file.c                       | 240 +++++++++++++++++++++++++--
>>>   fs/ceph/inode.c                      |  50 +++---
>>>   fs/ceph/mds_client.c                 | 123 ++++++++++++--
>>>   fs/ceph/mds_client.h                 |  17 +-
>>>   fs/ceph/super.h                      |  16 +-
>>>   net/ceph/ceph_fs.c => fs/ceph/util.c |   4 -
>>>   include/linux/ceph/ceph_fs.h         |   8 +-
>>>   net/ceph/Makefile                    |   2 +-
>>>   11 files changed, 443 insertions(+), 70 deletions(-)
>>>   rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%)
>>>
>>> --
>>> 2.24.1
>>>
>>
>> I realized that there still are two issues:
>> -  we needs to clear delegated inos when mds failover
> 
> I'm guessing we need to do this whenever any session is reconnected,
> right? I think we may have bigger problems here though:
> 
> The issue is that with this set we assign out ino_t's prior to
> submitting the request. We could get down into it and find that we're
> reconnecting the session. At that point, that ino_t is no longer valid
> for the session.

yes, that's a problem for current kclient impelmentaion


> 
>> - we needs to clear caps for async dir operations when reconnecting to
>> mds. (see last commit of https://github.com/ceph/ceph/pull/32576)
>>
> 
> I guess we can use ceph_iterate_session_caps to do this. That said,
> looking at your patch:
> 
>          if (in->is_dir()) {
>            // remove caps for async dir operations
>            cap.implemented &= (CEPH_CAP_ANY_SHARED | CEPH_CAP_ANY_EXCL);
>          }
> 
> We remove all but Fsx here. That seems like quite a subtle difference
> from how FILE caps work.

only file caps have more than 2 bits. this removes caps other than 
AsxLsxXsxFsx.

> 
> Given the way that we're doing the inode delegation and handling dir
> caps, I think we may need to rework things such that async requests
> never get queued to the s_waiting list, and instead return some sort of
> distinct error that cues the caller to fall back to a synchronous
> request.
> 
> That would help prevent some of the potential races here.

my libcephfs implementation check if a request can be async just before 
sending the request.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v3 01/10] ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
  2020-01-24 19:04   ` Jeff Layton
@ 2020-01-25  8:08     ` Ilya Dryomov
  0 siblings, 0 replies; 20+ messages in thread
From: Ilya Dryomov @ 2020-01-25  8:08 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Ceph Development, idridryomov, Sage Weil, Yan, Zheng

On Fri, Jan 24, 2020 at 9:59 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Tue, 2020-01-21 at 14:29 -0500, Jeff Layton wrote:
> > All of these functions are only called from CephFS, so move them into
> > ceph.ko, and drop the exports.
> >
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/ceph/Makefile                     | 2 +-
> >  net/ceph/ceph_fs.c => fs/ceph/util.c | 4 ----
> >  net/ceph/Makefile                    | 2 +-
> >  3 files changed, 2 insertions(+), 6 deletions(-)
> >  rename net/ceph/ceph_fs.c => fs/ceph/util.c (94%)
> >
> > diff --git a/fs/ceph/Makefile b/fs/ceph/Makefile
> > index c1da294418d1..0a0823d378db 100644
> > --- a/fs/ceph/Makefile
> > +++ b/fs/ceph/Makefile
> > @@ -8,7 +8,7 @@ obj-$(CONFIG_CEPH_FS) += ceph.o
> >  ceph-y := super.o inode.o dir.o file.o locks.o addr.o ioctl.o \
> >       export.o caps.o snap.o xattr.o quota.o io.o \
> >       mds_client.o mdsmap.o strings.o ceph_frag.o \
> > -     debugfs.o
> > +     debugfs.o util.o
> >
> >  ceph-$(CONFIG_CEPH_FSCACHE) += cache.o
> >  ceph-$(CONFIG_CEPH_FS_POSIX_ACL) += acl.o
> > diff --git a/net/ceph/ceph_fs.c b/fs/ceph/util.c
> > similarity index 94%
> > rename from net/ceph/ceph_fs.c
> > rename to fs/ceph/util.c
> > index 756a2dc10d27..2c34875675bf 100644
> > --- a/net/ceph/ceph_fs.c
> > +++ b/fs/ceph/util.c
> > @@ -39,7 +39,6 @@ void ceph_file_layout_from_legacy(struct ceph_file_layout *fl,
> >           fl->stripe_count == 0 && fl->object_size == 0)
> >               fl->pool_id = -1;
> >  }
> > -EXPORT_SYMBOL(ceph_file_layout_from_legacy);
> >
> >  void ceph_file_layout_to_legacy(struct ceph_file_layout *fl,
> >                               struct ceph_file_layout_legacy *legacy)
> > @@ -52,7 +51,6 @@ void ceph_file_layout_to_legacy(struct ceph_file_layout *fl,
> >       else
> >               legacy->fl_pg_pool = 0;
> >  }
> > -EXPORT_SYMBOL(ceph_file_layout_to_legacy);
> >
> >  int ceph_flags_to_mode(int flags)
> >  {
> > @@ -82,7 +80,6 @@ int ceph_flags_to_mode(int flags)
> >
> >       return mode;
> >  }
> > -EXPORT_SYMBOL(ceph_flags_to_mode);
> >
> >  int ceph_caps_for_mode(int mode)
> >  {
> > @@ -101,4 +98,3 @@ int ceph_caps_for_mode(int mode)
> >
> >       return caps;
> >  }
> > -EXPORT_SYMBOL(ceph_caps_for_mode);
> > diff --git a/net/ceph/Makefile b/net/ceph/Makefile
> > index 59d0ba2072de..ce09bb4fb249 100644
> > --- a/net/ceph/Makefile
> > +++ b/net/ceph/Makefile
> > @@ -13,5 +13,5 @@ libceph-y := ceph_common.o messenger.o msgpool.o buffer.o pagelist.o \
> >       auth.o auth_none.o \
> >       crypto.o armor.o \
> >       auth_x.o \
> > -     ceph_fs.o ceph_strings.o ceph_hash.o \
> > +     ceph_strings.o ceph_hash.o \
> >       pagevec.o snapshot.o string_table.o
>
> I've gone ahead and merged this patch into testing, as I think it makes
> sense on its own and it was becoming a hassle when testing.

Yup, sounds good.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2020-01-25  8:08 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-21 19:29 [RFC PATCH v3 00/10] ceph: asynchronous file create support Jeff Layton
2020-01-21 19:29 ` [RFC PATCH v3 01/10] ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c Jeff Layton
2020-01-24 19:04   ` Jeff Layton
2020-01-25  8:08     ` Ilya Dryomov
2020-01-21 19:29 ` [RFC PATCH v3 02/10] ceph: make ceph_fill_inode non-static Jeff Layton
2020-01-21 19:29 ` [RFC PATCH v3 03/10] ceph: make dentry_lease_is_valid non-static Jeff Layton
2020-01-22  7:20   ` Yan, Zheng
2020-01-22 11:26     ` Jeff Layton
2020-01-21 19:29 ` [RFC PATCH v3 04/10] ceph: make __take_cap_refs non-static Jeff Layton
2020-01-21 19:29 ` [RFC PATCH v3 05/10] ceph: decode interval_sets for delegated inos Jeff Layton
2020-01-21 19:29 ` [RFC PATCH v3 06/10] ceph: add flag to designate that a request is asynchronous Jeff Layton
2020-01-21 19:29 ` [RFC PATCH v3 07/10] ceph: add infrastructure for waiting for async create to complete Jeff Layton
2020-01-21 19:29 ` [RFC PATCH v3 08/10] ceph: add new MDS req field to hold delegated inode number Jeff Layton
2020-01-21 19:29 ` [RFC PATCH v3 09/10] ceph: cache layout in parent dir on first sync create Jeff Layton
2020-01-21 19:29 ` [RFC PATCH v3 10/10] ceph: attempt to do async create when possible Jeff Layton
2020-01-22  7:29 ` [RFC PATCH v3 00/10] ceph: asynchronous file create support Yan, Zheng
2020-01-22 17:04 ` Yan, Zheng
2020-01-24 17:19   ` Jeff Layton
2020-01-25  1:58     ` Yan, Zheng
2020-01-25  1:58     ` Yan, Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.