[PATCH v4 0/9] ceph: add support for asynchronous directory operations

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/9] ceph: add support for asynchronous directory operations
@ 2020-02-12 17:27 Jeff Layton
  2020-02-12 17:27 ` [PATCH v4 1/9] ceph: add flag to designate that a request is asynchronous Jeff Layton
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: Jeff Layton @ 2020-02-12 17:27 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan, pdonnell

I've dropped the async unlink patch from testing branch and am
resubmitting it here along with the rest of the create patches.

Zheng had pointed out that DIR_* caps should be cleared when the session
is reconnected. The underlying submission code needed changes to
handle that so it needed a bit of rework (along with the create code).

Since v3:
- rework async request submission to never queue the request when the
  session isn't open
- clean out DIR_* caps, layouts and delegated inodes when session goes down
- better ordering for dependent requests
- new mount options (wsync/nowsync) instead of module option
- more comprehensive error handling

Jeff Layton (9):
  ceph: add flag to designate that a request is asynchronous
  ceph: perform asynchronous unlink if we have sufficient caps
  ceph: make ceph_fill_inode non-static
  ceph: make __take_cap_refs non-static
  ceph: decode interval_sets for delegated inos
  ceph: add infrastructure for waiting for async create to complete
  ceph: add new MDS req field to hold delegated inode number
  ceph: cache layout in parent dir on first sync create
  ceph: attempt to do async create when possible

 fs/ceph/caps.c               |  73 +++++++---
 fs/ceph/dir.c                | 101 +++++++++++++-
 fs/ceph/file.c               | 253 +++++++++++++++++++++++++++++++++--
 fs/ceph/inode.c              |  58 ++++----
 fs/ceph/mds_client.c         | 156 +++++++++++++++++++--
 fs/ceph/mds_client.h         |  17 ++-
 fs/ceph/super.c              |  20 +++
 fs/ceph/super.h              |  21 ++-
 include/linux/ceph/ceph_fs.h |  17 ++-
 9 files changed, 637 insertions(+), 79 deletions(-)

-- 
2.24.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 1/9] ceph: add flag to designate that a request is asynchronous
  2020-02-12 17:27 [PATCH v4 0/9] ceph: add support for asynchronous directory operations Jeff Layton
@ 2020-02-12 17:27 ` Jeff Layton
  2020-02-13  9:29   ` Yan, Zheng
  2020-02-12 17:27 ` [PATCH v4 2/9] ceph: perform asynchronous unlink if we have sufficient caps Jeff Layton
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Jeff Layton @ 2020-02-12 17:27 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan, pdonnell

...and ensure that such requests are never queued. The MDS has need to
know that a request is asynchronous so add flags and proper
infrastructure for that.

Also, delegated inode numbers and directory caps are associated with the
session, so ensure that async requests are always transmitted on the
first attempt and are never queued to wait for session reestablishment.

If it does end up looking like we'll need to queue the request, then
have it return -EJUKEBOX so the caller can reattempt with a synchronous
request.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/inode.c              |  1 +
 fs/ceph/mds_client.c         | 11 +++++++++++
 fs/ceph/mds_client.h         |  1 +
 include/linux/ceph/ceph_fs.h |  5 +++--
 4 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 094b8fc37787..9869ec101e88 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1311,6 +1311,7 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 		err = fill_inode(in, req->r_locked_page, &rinfo->targeti, NULL,
 				session,
 				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
+				 !test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags) &&
 				 rinfo->head->result == 0) ?  req->r_fmode : -1,
 				&req->r_caps_reservation);
 		if (err < 0) {
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 2980e57ca7b9..9f2aeb6908b2 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2527,6 +2527,8 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
 	rhead->oldest_client_tid = cpu_to_le64(__get_oldest_tid(mdsc));
 	if (test_bit(CEPH_MDS_R_GOT_UNSAFE, &req->r_req_flags))
 		flags |= CEPH_MDS_FLAG_REPLAY;
+	if (test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags))
+		flags |= CEPH_MDS_FLAG_ASYNC;
 	if (req->r_parent)
 		flags |= CEPH_MDS_FLAG_WANT_DENTRY;
 	rhead->flags = cpu_to_le32(flags);
@@ -2634,6 +2636,15 @@ static void __do_request(struct ceph_mds_client *mdsc,
 			err = -EACCES;
 			goto out_session;
 		}
+		/*
+		 * We cannot queue async requests since the caps and delegated
+		 * inodes are bound to the session. Just return -EJUKEBOX and
+		 * let the caller retry a sync request in that case.
+		 */
+		if (test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags)) {
+			err = -EJUKEBOX;
+			goto out_session;
+		}
 		if (session->s_state == CEPH_MDS_SESSION_NEW ||
 		    session->s_state == CEPH_MDS_SESSION_CLOSING) {
 			__open_session(mdsc, session);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 27a7446e10d3..0327974d0763 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -255,6 +255,7 @@ struct ceph_mds_request {
 #define CEPH_MDS_R_GOT_RESULT		(5) /* got a result */
 #define CEPH_MDS_R_DID_PREPOPULATE	(6) /* prepopulated readdir */
 #define CEPH_MDS_R_PARENT_LOCKED	(7) /* is r_parent->i_rwsem wlocked? */
+#define CEPH_MDS_R_ASYNC		(8) /* async request */
 	unsigned long	r_req_flags;
 
 	struct mutex r_fill_mutex;
diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
index cb21c5cf12c3..9f747a1b8788 100644
--- a/include/linux/ceph/ceph_fs.h
+++ b/include/linux/ceph/ceph_fs.h
@@ -444,8 +444,9 @@ union ceph_mds_request_args {
 	} __attribute__ ((packed)) lookupino;
 } __attribute__ ((packed));
 
-#define CEPH_MDS_FLAG_REPLAY        1  /* this is a replayed op */
-#define CEPH_MDS_FLAG_WANT_DENTRY   2  /* want dentry in reply */
+#define CEPH_MDS_FLAG_REPLAY		1 /* this is a replayed op */
+#define CEPH_MDS_FLAG_WANT_DENTRY	2 /* want dentry in reply */
+#define CEPH_MDS_FLAG_ASYNC		4 /* request is asynchronous */
 
 struct ceph_mds_request_head {
 	__le64 oldest_client_tid;
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 2/9] ceph: perform asynchronous unlink if we have sufficient caps
  2020-02-12 17:27 [PATCH v4 0/9] ceph: add support for asynchronous directory operations Jeff Layton
  2020-02-12 17:27 ` [PATCH v4 1/9] ceph: add flag to designate that a request is asynchronous Jeff Layton
@ 2020-02-12 17:27 ` Jeff Layton
  2020-02-13 12:06   ` Yan, Zheng
  2020-02-12 17:27 ` [PATCH v4 3/9] ceph: make ceph_fill_inode non-static Jeff Layton
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Jeff Layton @ 2020-02-12 17:27 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan, pdonnell

The MDS is getting a new lock-caching facility that will allow it
to cache the necessary locks to allow asynchronous directory operations.
Since the CEPH_CAP_FILE_* caps are currently unused on directories,
we can repurpose those bits for this purpose.

When performing an unlink, if we have Fx on the parent directory,
and CEPH_CAP_DIR_UNLINK (aka Fr), and we know that the dentry being
removed is the primary link, then then we can fire off an unlink
request immediately and don't need to wait on reply before returning.

In that situation, just fix up the dcache and link count and return
immediately after issuing the call to the MDS. This does mean that we
need to hold an extra reference to the inode being unlinked, and extra
references to the caps to avoid races. Those references are put and
error handling is done in the r_callback routine.

If the operation ends up failing, then set a writeback error on the
directory inode, and the inode itself that can be fetched later by
an fsync on the dir.

The behavior of dir caps is slightly different from caps on normal
files. Because these are just considered an optimization, if the
session is reconnected, we will not automatically reclaim them. They
are instead considered lost until we do another synchronous op in the
parent directory.

Async dirops are enabled via the "nowsync" mount option, which is
patterned after the xfs "wsync" mount option. For now, the default
is "wsync", but eventually we may flip that.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
---
 fs/ceph/caps.c               | 35 +++++++++----
 fs/ceph/dir.c                | 99 ++++++++++++++++++++++++++++++++++--
 fs/ceph/inode.c              |  8 ++-
 fs/ceph/mds_client.c         |  8 ++-
 fs/ceph/super.c              | 20 ++++++++
 fs/ceph/super.h              |  6 ++-
 include/linux/ceph/ceph_fs.h |  9 ++++
 7 files changed, 166 insertions(+), 19 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index d05717397c2a..7fc87b693ba4 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -992,7 +992,11 @@ int __ceph_caps_file_wanted(struct ceph_inode_info *ci)
 int __ceph_caps_wanted(struct ceph_inode_info *ci)
 {
 	int w = __ceph_caps_file_wanted(ci) | __ceph_caps_used(ci);
-	if (!S_ISDIR(ci->vfs_inode.i_mode)) {
+	if (S_ISDIR(ci->vfs_inode.i_mode)) {
+		/* we want EXCL if holding caps of dir ops */
+		if (w & CEPH_CAP_ANY_DIR_OPS)
+			w |= CEPH_CAP_FILE_EXCL;
+	} else {
 		/* we want EXCL if dirty data */
 		if (w & CEPH_CAP_FILE_BUFFER)
 			w |= CEPH_CAP_FILE_EXCL;
@@ -1883,10 +1887,13 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
 			 * revoking the shared cap on every create/unlink
 			 * operation.
 			 */
-			if (IS_RDONLY(inode))
+			if (IS_RDONLY(inode)) {
 				want = CEPH_CAP_ANY_SHARED;
-			else
-				want = CEPH_CAP_ANY_SHARED | CEPH_CAP_FILE_EXCL;
+			} else {
+				want = CEPH_CAP_ANY_SHARED |
+				       CEPH_CAP_FILE_EXCL |
+				       CEPH_CAP_ANY_DIR_OPS;
+			}
 			retain |= want;
 		} else {
 
@@ -2649,7 +2656,10 @@ static int try_get_cap_refs(struct inode *inode, int need, int want,
 				}
 				snap_rwsem_locked = true;
 			}
-			*got = need | (have & want);
+			if ((have & want) == want)
+				*got = need | want;
+			else
+				*got = need;
 			if (S_ISREG(inode->i_mode) &&
 			    (need & CEPH_CAP_FILE_RD) &&
 			    !(*got & CEPH_CAP_FILE_CACHE))
@@ -2739,13 +2749,16 @@ int ceph_try_get_caps(struct inode *inode, int need, int want,
 	int ret;
 
 	BUG_ON(need & ~CEPH_CAP_FILE_RD);
-	BUG_ON(want & ~(CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO|CEPH_CAP_FILE_SHARED));
-	ret = ceph_pool_perm_check(inode, need);
-	if (ret < 0)
-		return ret;
+	if (need) {
+		ret = ceph_pool_perm_check(inode, need);
+		if (ret < 0)
+			return ret;
+	}
 
-	ret = try_get_cap_refs(inode, need, want, 0,
-			       (nonblock ? NON_BLOCKING : 0), got);
+	BUG_ON(want & ~(CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO |
+			CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
+			CEPH_CAP_ANY_DIR_OPS));
+	ret = try_get_cap_refs(inode, need, want, 0, nonblock, got);
 	return ret == -EAGAIN ? 0 : ret;
 }
 
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index d0cd0aba5843..46314ccf48c5 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1036,6 +1036,69 @@ static int ceph_link(struct dentry *old_dentry, struct inode *dir,
 	return err;
 }
 
+static void ceph_async_unlink_cb(struct ceph_mds_client *mdsc,
+				 struct ceph_mds_request *req)
+{
+	int result = req->r_err ? req->r_err :
+			le32_to_cpu(req->r_reply_info.head->result);
+
+	/* If op failed, mark everyone involved for errors */
+	if (result) {
+		int pathlen;
+		u64 base;
+		char *path = ceph_mdsc_build_path(req->r_dentry, &pathlen,
+						  &base, 0);
+
+		/* mark error on parent + clear complete */
+		mapping_set_error(req->r_parent->i_mapping, result);
+		ceph_dir_clear_complete(req->r_parent);
+
+		/* drop the dentry -- we don't know its status */
+		if (!d_unhashed(req->r_dentry))
+			d_drop(req->r_dentry);
+
+		/* mark inode itself for an error (since metadata is bogus) */
+		mapping_set_error(req->r_old_inode->i_mapping, result);
+
+		pr_warn("ceph: async unlink failure path=(%llx)%s result=%d!\n",
+			base, IS_ERR(path) ? "<<bad>>" : path, result);
+		ceph_mdsc_free_path(path, pathlen);
+	}
+
+	ceph_put_cap_refs(ceph_inode(req->r_parent),
+			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_UNLINK);
+	iput(req->r_old_inode);
+}
+
+static bool get_caps_for_async_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct ceph_inode_info *ci = ceph_inode(dir);
+	struct ceph_dentry_info *di;
+	int ret, want, got = 0;
+
+	want = CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_UNLINK;
+	ret = ceph_try_get_caps(dir, 0, want, true, &got);
+	dout("FxDu on %p ret=%d got=%s\n", dir, ret, ceph_cap_string(got));
+	if (ret != 1 || got != want)
+		return false;
+
+        spin_lock(&dentry->d_lock);
+        di = ceph_dentry(dentry);
+	/* - We are holding CEPH_CAP_FILE_EXCL, which implies
+	 * CEPH_CAP_FILE_SHARED.
+	 * - Only support async unlink for primary linkage */
+	if (atomic_read(&ci->i_shared_gen) != di->lease_shared_gen ||
+	    !(di->flags & CEPH_DENTRY_PRIMARY_LINK))
+		ret = 0;
+        spin_unlock(&dentry->d_lock);
+
+	if (!ret) {
+		ceph_put_cap_refs(ci, got);
+		return false;
+	}
+	return true;
+}
+
 /*
  * rmdir and unlink are differ only by the metadata op code
  */
@@ -1045,6 +1108,7 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
 	struct ceph_mds_client *mdsc = fsc->mdsc;
 	struct inode *inode = d_inode(dentry);
 	struct ceph_mds_request *req;
+	bool try_async = ceph_test_mount_opt(fsc, ASYNC_DIROPS);
 	int err = -EROFS;
 	int op;
 
@@ -1059,6 +1123,7 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
 			CEPH_MDS_OP_RMDIR : CEPH_MDS_OP_UNLINK;
 	} else
 		goto out;
+retry:
 	req = ceph_mdsc_create_request(mdsc, op, USE_AUTH_MDS);
 	if (IS_ERR(req)) {
 		err = PTR_ERR(req);
@@ -1067,13 +1132,38 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
 	req->r_dentry = dget(dentry);
 	req->r_num_caps = 2;
 	req->r_parent = dir;
-	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
 	req->r_dentry_drop = CEPH_CAP_FILE_SHARED;
 	req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
 	req->r_inode_drop = ceph_drop_caps_for_unlink(inode);
-	err = ceph_mdsc_do_request(mdsc, dir, req);
-	if (!err && !req->r_reply_info.head->is_dentry)
-		d_delete(dentry);
+
+	if (try_async && op == CEPH_MDS_OP_UNLINK &&
+	    get_caps_for_async_unlink(dir, dentry)) {
+		dout("ceph: Async unlink on %lu/%.*s", dir->i_ino,
+		     dentry->d_name.len, dentry->d_name.name);
+		set_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags);
+		req->r_callback = ceph_async_unlink_cb;
+		req->r_old_inode = d_inode(dentry);
+		ihold(req->r_old_inode);
+		err = ceph_mdsc_submit_request(mdsc, dir, req);
+		if (!err) {
+			/*
+			 * We have enough caps, so we assume that the unlink
+			 * will succeed. Fix up the target inode and dcache.
+			 */
+			drop_nlink(inode);
+			d_delete(dentry);
+		} else if (err == -EJUKEBOX) {
+			try_async = false;
+			ceph_mdsc_put_request(req);
+			goto retry;
+		}
+	} else {
+		set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
+		err = ceph_mdsc_do_request(mdsc, dir, req);
+		if (!err && !req->r_reply_info.head->is_dentry)
+			d_delete(dentry);
+	}
+
 	ceph_mdsc_put_request(req);
 out:
 	return err;
@@ -1411,6 +1501,7 @@ void ceph_invalidate_dentry_lease(struct dentry *dentry)
 	spin_lock(&dentry->d_lock);
 	di->time = jiffies;
 	di->lease_shared_gen = 0;
+	di->flags &= ~CEPH_DENTRY_PRIMARY_LINK;
 	__dentry_lease_unlist(di);
 	spin_unlock(&dentry->d_lock);
 }
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 9869ec101e88..7478bd0283c1 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1051,6 +1051,7 @@ static void __update_dentry_lease(struct inode *dir, struct dentry *dentry,
 				  struct ceph_mds_session **old_lease_session)
 {
 	struct ceph_dentry_info *di = ceph_dentry(dentry);
+	unsigned mask = le16_to_cpu(lease->mask);
 	long unsigned duration = le32_to_cpu(lease->duration_ms);
 	long unsigned ttl = from_time + (duration * HZ) / 1000;
 	long unsigned half_ttl = from_time + (duration * HZ / 2) / 1000;
@@ -1062,8 +1063,13 @@ static void __update_dentry_lease(struct inode *dir, struct dentry *dentry,
 	if (ceph_snap(dir) != CEPH_NOSNAP)
 		return;
 
+	if (mask & CEPH_LEASE_PRIMARY_LINK)
+		di->flags |= CEPH_DENTRY_PRIMARY_LINK;
+	else
+		di->flags &= ~CEPH_DENTRY_PRIMARY_LINK;
+
 	di->lease_shared_gen = atomic_read(&ceph_inode(dir)->i_shared_gen);
-	if (duration == 0) {
+	if (!(mask & CEPH_LEASE_VALID)) {
 		__ceph_dentry_dir_lease_touch(di);
 		return;
 	}
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 9f2aeb6908b2..f0ea32f4cdb9 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3370,7 +3370,7 @@ static int send_reconnect_partial(struct ceph_reconnect_state *recon_state)
 /*
  * Encode information about a cap for a reconnect with the MDS.
  */
-static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,
+static int reconnect_caps_cb(struct inode *inode, struct ceph_cap *cap,
 			  void *arg)
 {
 	union {
@@ -3393,6 +3393,10 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,
 	cap->mseq = 0;       /* and migrate_seq */
 	cap->cap_gen = cap->session->s_cap_gen;
 
+	/* These are lost when the session goes away */
+	if (S_ISDIR(inode->i_mode))
+		cap->issued &= ~(CEPH_CAP_DIR_CREATE|CEPH_CAP_DIR_UNLINK);
+
 	if (recon_state->msg_version >= 2) {
 		rec.v2.cap_id = cpu_to_le64(cap->cap_id);
 		rec.v2.wanted = cpu_to_le32(__ceph_caps_wanted(ci));
@@ -3689,7 +3693,7 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
 		recon_state.msg_version = 2;
 	}
 	/* trsaverse this session's caps */
-	err = ceph_iterate_session_caps(session, encode_caps_cb, &recon_state);
+	err = ceph_iterate_session_caps(session, reconnect_caps_cb, &recon_state);
 
 	spin_lock(&session->s_cap_lock);
 	session->s_cap_reconnect = 0;
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index c7f150686a53..58d64805c9e3 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -155,6 +155,7 @@ enum {
 	Opt_acl,
 	Opt_quotadf,
 	Opt_copyfrom,
+	Opt_wsync,
 };
 
 enum ceph_recover_session_mode {
@@ -194,6 +195,7 @@ static const struct fs_parameter_spec ceph_mount_parameters[] = {
 	fsparam_string	("snapdirname",			Opt_snapdirname),
 	fsparam_string	("source",			Opt_source),
 	fsparam_u32	("wsize",			Opt_wsize),
+	fsparam_flag_no	("wsync",			Opt_wsync),
 	{}
 };
 
@@ -444,6 +446,12 @@ static int ceph_parse_mount_param(struct fs_context *fc,
 			fc->sb_flags &= ~SB_POSIXACL;
 		}
 		break;
+	case Opt_wsync:
+		if (!result.negated)
+			fsopt->flags &= ~CEPH_MOUNT_OPT_ASYNC_DIROPS;
+		else
+			fsopt->flags |= CEPH_MOUNT_OPT_ASYNC_DIROPS;
+		break;
 	default:
 		BUG();
 	}
@@ -567,6 +575,9 @@ static int ceph_show_options(struct seq_file *m, struct dentry *root)
 	if (fsopt->flags & CEPH_MOUNT_OPT_CLEANRECOVER)
 		seq_show_option(m, "recover_session", "clean");
 
+	if (fsopt->flags & CEPH_MOUNT_OPT_ASYNC_DIROPS)
+		seq_puts(m, ",nowsync");
+
 	if (fsopt->wsize != CEPH_MAX_WRITE_SIZE)
 		seq_printf(m, ",wsize=%u", fsopt->wsize);
 	if (fsopt->rsize != CEPH_MAX_READ_SIZE)
@@ -1107,6 +1118,15 @@ static void ceph_free_fc(struct fs_context *fc)
 
 static int ceph_reconfigure_fc(struct fs_context *fc)
 {
+	struct ceph_parse_opts_ctx *pctx = fc->fs_private;
+	struct ceph_mount_options *fsopt = pctx->opts;
+	struct ceph_fs_client *fsc = ceph_sb_to_client(fc->root->d_sb);
+
+	if (fsopt->flags & CEPH_MOUNT_OPT_ASYNC_DIROPS)
+		ceph_set_mount_opt(fsc, ASYNC_DIROPS);
+	else
+		ceph_clear_mount_opt(fsc, ASYNC_DIROPS);
+
 	sync_filesystem(fc->root->d_sb);
 	return 0;
 }
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 37dc1ac8f6c3..540393ba861b 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -43,13 +43,16 @@
 #define CEPH_MOUNT_OPT_MOUNTWAIT       (1<<12) /* mount waits if no mds is up */
 #define CEPH_MOUNT_OPT_NOQUOTADF       (1<<13) /* no root dir quota in statfs */
 #define CEPH_MOUNT_OPT_NOCOPYFROM      (1<<14) /* don't use RADOS 'copy-from' op */
+#define CEPH_MOUNT_OPT_ASYNC_DIROPS    (1<<15) /* allow async directory ops */
 
 #define CEPH_MOUNT_OPT_DEFAULT			\
 	(CEPH_MOUNT_OPT_DCACHE |		\
 	 CEPH_MOUNT_OPT_NOCOPYFROM)
 
 #define ceph_set_mount_opt(fsc, opt) \
-	(fsc)->mount_options->flags |= CEPH_MOUNT_OPT_##opt;
+	(fsc)->mount_options->flags |= CEPH_MOUNT_OPT_##opt
+#define ceph_clear_mount_opt(fsc, opt) \
+	(fsc)->mount_options->flags &= ~CEPH_MOUNT_OPT_##opt
 #define ceph_test_mount_opt(fsc, opt) \
 	(!!((fsc)->mount_options->flags & CEPH_MOUNT_OPT_##opt))
 
@@ -284,6 +287,7 @@ struct ceph_dentry_info {
 #define CEPH_DENTRY_REFERENCED		1
 #define CEPH_DENTRY_LEASE_LIST		2
 #define CEPH_DENTRY_SHRINK_LIST		4
+#define CEPH_DENTRY_PRIMARY_LINK	8
 
 struct ceph_inode_xattrs_info {
 	/*
diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
index 9f747a1b8788..91d09cf37649 100644
--- a/include/linux/ceph/ceph_fs.h
+++ b/include/linux/ceph/ceph_fs.h
@@ -531,6 +531,9 @@ struct ceph_mds_reply_lease {
 	__le32 seq;
 } __attribute__ ((packed));
 
+#define CEPH_LEASE_VALID        (1 | 2) /* old and new bit values */
+#define CEPH_LEASE_PRIMARY_LINK 4       /* primary linkage */
+
 struct ceph_mds_reply_dirfrag {
 	__le32 frag;            /* fragment */
 	__le32 auth;            /* auth mds, if this is a delegation point */
@@ -660,6 +663,12 @@ int ceph_flags_to_mode(int flags);
 #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
 			CEPH_LOCK_IXATTR)
 
+/* cap masks async dir operations */
+#define CEPH_CAP_DIR_CREATE	CEPH_CAP_FILE_CACHE
+#define CEPH_CAP_DIR_UNLINK	CEPH_CAP_FILE_RD
+#define CEPH_CAP_ANY_DIR_OPS	(CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_RD | \
+				 CEPH_CAP_FILE_WREXTEND | CEPH_CAP_FILE_LAZYIO)
+
 int ceph_caps_for_mode(int mode);
 
 enum {
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 3/9] ceph: make ceph_fill_inode non-static
  2020-02-12 17:27 [PATCH v4 0/9] ceph: add support for asynchronous directory operations Jeff Layton
  2020-02-12 17:27 ` [PATCH v4 1/9] ceph: add flag to designate that a request is asynchronous Jeff Layton
  2020-02-12 17:27 ` [PATCH v4 2/9] ceph: perform asynchronous unlink if we have sufficient caps Jeff Layton
@ 2020-02-12 17:27 ` Jeff Layton
  2020-02-12 17:27 ` [PATCH v4 4/9] ceph: make __take_cap_refs non-static Jeff Layton
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2020-02-12 17:27 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan, pdonnell

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/inode.c | 47 ++++++++++++++++++++++++-----------------------
 fs/ceph/super.h |  8 ++++++++
 2 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 7478bd0283c1..4056c7968b86 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -728,11 +728,11 @@ void ceph_fill_file_time(struct inode *inode, int issued,
  * Populate an inode based on info from mds.  May be called on new or
  * existing inodes.
  */
-static int fill_inode(struct inode *inode, struct page *locked_page,
-		      struct ceph_mds_reply_info_in *iinfo,
-		      struct ceph_mds_reply_dirfrag *dirinfo,
-		      struct ceph_mds_session *session, int cap_fmode,
-		      struct ceph_cap_reservation *caps_reservation)
+int ceph_fill_inode(struct inode *inode, struct page *locked_page,
+		    struct ceph_mds_reply_info_in *iinfo,
+		    struct ceph_mds_reply_dirfrag *dirinfo,
+		    struct ceph_mds_session *session, int cap_fmode,
+		    struct ceph_cap_reservation *caps_reservation)
 {
 	struct ceph_mds_client *mdsc = ceph_inode_to_client(inode)->mdsc;
 	struct ceph_mds_reply_inode *info = iinfo->in;
@@ -749,7 +749,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 	bool new_version = false;
 	bool fill_inline = false;
 
-	dout("fill_inode %p ino %llx.%llx v %llu had %llu\n",
+	dout("%s %p ino %llx.%llx v %llu had %llu\n", __func__,
 	     inode, ceph_vinop(inode), le64_to_cpu(info->version),
 	     ci->i_version);
 
@@ -770,7 +770,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 	if (iinfo->xattr_len > 4) {
 		xattr_blob = ceph_buffer_new(iinfo->xattr_len, GFP_NOFS);
 		if (!xattr_blob)
-			pr_err("fill_inode ENOMEM xattr blob %d bytes\n",
+			pr_err("%s ENOMEM xattr blob %d bytes\n", __func__,
 			       iinfo->xattr_len);
 	}
 
@@ -933,8 +933,9 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 			spin_unlock(&ci->i_ceph_lock);
 
 			if (symlen != i_size_read(inode)) {
-				pr_err("fill_inode %llx.%llx BAD symlink "
-					"size %lld\n", ceph_vinop(inode),
+				pr_err("%s %llx.%llx BAD symlink "
+					"size %lld\n", __func__,
+					ceph_vinop(inode),
 					i_size_read(inode));
 				i_size_write(inode, symlen);
 				inode->i_blocks = calc_inode_blocks(symlen);
@@ -958,7 +959,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 		inode->i_fop = &ceph_dir_fops;
 		break;
 	default:
-		pr_err("fill_inode %llx.%llx BAD mode 0%o\n",
+		pr_err("%s %llx.%llx BAD mode 0%o\n", __func__,
 		       ceph_vinop(inode), inode->i_mode);
 	}
 
@@ -1246,10 +1247,9 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 		struct inode *dir = req->r_parent;
 
 		if (dir) {
-			err = fill_inode(dir, NULL,
-					 &rinfo->diri, rinfo->dirfrag,
-					 session, -1,
-					 &req->r_caps_reservation);
+			err = ceph_fill_inode(dir, NULL, &rinfo->diri,
+					      rinfo->dirfrag, session, -1,
+					      &req->r_caps_reservation);
 			if (err < 0)
 				goto done;
 		} else {
@@ -1314,14 +1314,14 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 			goto done;
 		}
 
-		err = fill_inode(in, req->r_locked_page, &rinfo->targeti, NULL,
-				session,
+		err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
+				NULL, session,
 				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
 				 !test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags) &&
 				 rinfo->head->result == 0) ?  req->r_fmode : -1,
 				&req->r_caps_reservation);
 		if (err < 0) {
-			pr_err("fill_inode badness %p %llx.%llx\n",
+			pr_err("ceph_fill_inode badness %p %llx.%llx\n",
 				in, ceph_vinop(in));
 			if (in->i_state & I_NEW)
 				discard_new_inode(in);
@@ -1508,10 +1508,11 @@ static int readdir_prepopulate_inodes_only(struct ceph_mds_request *req,
 			dout("new_inode badness got %d\n", err);
 			continue;
 		}
-		rc = fill_inode(in, NULL, &rde->inode, NULL, session,
-				-1, &req->r_caps_reservation);
+		rc = ceph_fill_inode(in, NULL, &rde->inode, NULL, session,
+				     -1, &req->r_caps_reservation);
 		if (rc < 0) {
-			pr_err("fill_inode badness on %p got %d\n", in, rc);
+			pr_err("ceph_fill_inode badness on %p got %d\n",
+			       in, rc);
 			err = rc;
 			if (in->i_state & I_NEW) {
 				ihold(in);
@@ -1715,10 +1716,10 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req,
 			}
 		}
 
-		ret = fill_inode(in, NULL, &rde->inode, NULL, session,
-				 -1, &req->r_caps_reservation);
+		ret = ceph_fill_inode(in, NULL, &rde->inode, NULL, session,
+				      -1, &req->r_caps_reservation);
 		if (ret < 0) {
-			pr_err("fill_inode badness on %p\n", in);
+			pr_err("ceph_fill_inode badness on %p\n", in);
 			if (d_really_is_negative(dn)) {
 				/* avoid calling iput_final() in mds
 				 * dispatch threads */
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 540393ba861b..3192e506ad5e 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -893,6 +893,9 @@ static inline bool __ceph_have_pending_cap_snap(struct ceph_inode_info *ci)
 }
 
 /* inode.c */
+struct ceph_mds_reply_info_in;
+struct ceph_mds_reply_dirfrag;
+
 extern const struct inode_operations ceph_file_iops;
 
 extern struct inode *ceph_alloc_inode(struct super_block *sb);
@@ -908,6 +911,11 @@ extern void ceph_fill_file_time(struct inode *inode, int issued,
 				u64 time_warp_seq, struct timespec64 *ctime,
 				struct timespec64 *mtime,
 				struct timespec64 *atime);
+extern int ceph_fill_inode(struct inode *inode, struct page *locked_page,
+		    struct ceph_mds_reply_info_in *iinfo,
+		    struct ceph_mds_reply_dirfrag *dirinfo,
+		    struct ceph_mds_session *session, int cap_fmode,
+		    struct ceph_cap_reservation *caps_reservation);
 extern int ceph_fill_trace(struct super_block *sb,
 			   struct ceph_mds_request *req);
 extern int ceph_readdir_prepopulate(struct ceph_mds_request *req,
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 4/9] ceph: make __take_cap_refs non-static
  2020-02-12 17:27 [PATCH v4 0/9] ceph: add support for asynchronous directory operations Jeff Layton
                   ` (2 preceding siblings ...)
  2020-02-12 17:27 ` [PATCH v4 3/9] ceph: make ceph_fill_inode non-static Jeff Layton
@ 2020-02-12 17:27 ` Jeff Layton
  2020-02-12 17:27 ` [PATCH v4 5/9] ceph: decode interval_sets for delegated inos Jeff Layton
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2020-02-12 17:27 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan, pdonnell

Rename it to ceph_take_cap_refs and make it available to other files.
Also replace a comment with a lockdep assertion.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c  | 12 ++++++------
 fs/ceph/super.h |  2 ++
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 7fc87b693ba4..c983990acb75 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2512,12 +2512,12 @@ static void kick_flushing_inode_caps(struct ceph_mds_client *mdsc,
 /*
  * Take references to capabilities we hold, so that we don't release
  * them to the MDS prematurely.
- *
- * Protected by i_ceph_lock.
  */
-static void __take_cap_refs(struct ceph_inode_info *ci, int got,
+void ceph_take_cap_refs(struct ceph_inode_info *ci, int got,
 			    bool snap_rwsem_locked)
 {
+	lockdep_assert_held(&ci->i_ceph_lock);
+
 	if (got & CEPH_CAP_PIN)
 		ci->i_pin_ref++;
 	if (got & CEPH_CAP_FILE_RD)
@@ -2538,7 +2538,7 @@ static void __take_cap_refs(struct ceph_inode_info *ci, int got,
 		if (ci->i_wb_ref == 0)
 			ihold(&ci->vfs_inode);
 		ci->i_wb_ref++;
-		dout("__take_cap_refs %p wb %d -> %d (?)\n",
+		dout("%s %p wb %d -> %d (?)\n", __func__,
 		     &ci->vfs_inode, ci->i_wb_ref-1, ci->i_wb_ref);
 	}
 }
@@ -2664,7 +2664,7 @@ static int try_get_cap_refs(struct inode *inode, int need, int want,
 			    (need & CEPH_CAP_FILE_RD) &&
 			    !(*got & CEPH_CAP_FILE_CACHE))
 				ceph_disable_fscache_readpage(ci);
-			__take_cap_refs(ci, *got, true);
+			ceph_take_cap_refs(ci, *got, true);
 			ret = 1;
 		}
 	} else {
@@ -2896,7 +2896,7 @@ int ceph_get_caps(struct file *filp, int need, int want,
 void ceph_get_cap_refs(struct ceph_inode_info *ci, int caps)
 {
 	spin_lock(&ci->i_ceph_lock);
-	__take_cap_refs(ci, caps, false);
+	ceph_take_cap_refs(ci, caps, false);
 	spin_unlock(&ci->i_ceph_lock);
 }
 
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 3192e506ad5e..ea68eef977ef 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1062,6 +1062,8 @@ extern void ceph_kick_flushing_caps(struct ceph_mds_client *mdsc,
 				    struct ceph_mds_session *session);
 extern struct ceph_cap *ceph_get_cap_for_mds(struct ceph_inode_info *ci,
 					     int mds);
+extern void ceph_take_cap_refs(struct ceph_inode_info *ci, int caps,
+				bool snap_rwsem_locked);
 extern void ceph_get_cap_refs(struct ceph_inode_info *ci, int caps);
 extern void ceph_put_cap_refs(struct ceph_inode_info *ci, int had);
 extern void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr,
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 5/9] ceph: decode interval_sets for delegated inos
  2020-02-12 17:27 [PATCH v4 0/9] ceph: add support for asynchronous directory operations Jeff Layton
                   ` (3 preceding siblings ...)
  2020-02-12 17:27 ` [PATCH v4 4/9] ceph: make __take_cap_refs non-static Jeff Layton
@ 2020-02-12 17:27 ` Jeff Layton
  2020-02-12 17:27 ` [PATCH v4 6/9] ceph: add infrastructure for waiting for async create to complete Jeff Layton
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2020-02-12 17:27 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan, pdonnell

Starting in Octopus, the MDS will hand out caps that allow the client
to do asynchronous file creates under certain conditions. As part of
that, the MDS will delegate ranges of inode numbers to the client.

Add the infrastructure to decode these ranges, and stuff them into an
xarray for later consumption by the async creation code.

Because the xarray code currently only handles unsigned long indexes,
and those are 32-bits on 32-bit arches, we only enable the decoding when
running on a 64-bit arch.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/mds_client.c | 111 +++++++++++++++++++++++++++++++++++++++----
 fs/ceph/mds_client.h |   7 ++-
 2 files changed, 108 insertions(+), 10 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index f0ea32f4cdb9..91c5f999da7d 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -415,21 +415,110 @@ static int parse_reply_info_filelock(void **p, void *end,
 	return -EIO;
 }
 
+
+#if BITS_PER_LONG == 64
+
+#define DELEGATED_INO_AVAILABLE		xa_mk_value(1)
+
+static int ceph_parse_deleg_inos(void **p, void *end,
+				 struct ceph_mds_session *s)
+{
+	u32 sets;
+
+	ceph_decode_32_safe(p, end, sets, bad);
+	dout("got %u sets of delegated inodes\n", sets);
+	while (sets--) {
+		u64 start, len, ino;
+
+		ceph_decode_64_safe(p, end, start, bad);
+		ceph_decode_64_safe(p, end, len, bad);
+		while (len--) {
+			int err = xa_insert(&s->s_delegated_inos, ino = start++,
+					    DELEGATED_INO_AVAILABLE,
+					    GFP_KERNEL);
+			if (!err) {
+				dout("added delegated inode 0x%llx\n",
+				     start - 1);
+			} else if (err == -EBUSY) {
+				pr_warn("ceph: MDS delegated inode 0x%llx more than once.\n",
+					start - 1);
+			} else {
+				return err;
+			}
+		}
+	}
+	return 0;
+bad:
+	return -EIO;
+}
+
+u64 ceph_get_deleg_ino(struct ceph_mds_session *s)
+{
+	unsigned long ino;
+	void *val;
+
+	xa_for_each(&s->s_delegated_inos, ino, val) {
+		val = xa_erase(&s->s_delegated_inos, ino);
+		if (val == DELEGATED_INO_AVAILABLE)
+			return ino;
+	}
+	return 0;
+}
+#else /* BITS_PER_LONG == 64 */
+/*
+ * FIXME: xarrays can't handle 64-bit indexes on a 32-bit arch. For now, just
+ * ignore delegated_inos on 32 bit arch. Maybe eventually add xarrays for top
+ * and bottom words?
+ */
+static int ceph_parse_deleg_inos(void **p, void *end,
+				 struct ceph_mds_session *s)
+{
+	u32 sets;
+
+	ceph_decode_32_safe(p, end, sets, bad);
+	if (sets)
+		ceph_decode_skip_n(p, end, sets * 2 * sizeof(__le64), bad);
+	return 0;
+bad:
+	return -EIO;
+}
+
+u64 ceph_get_deleg_ino(struct ceph_mds_session *s)
+{
+	return 0;
+}
+#endif /* BITS_PER_LONG == 64 */
+
 /*
  * parse create results
  */
 static int parse_reply_info_create(void **p, void *end,
 				  struct ceph_mds_reply_info_parsed *info,
-				  u64 features)
+				  u64 features, struct ceph_mds_session *s)
 {
+	int ret;
+
 	if (features == (u64)-1 ||
 	    (features & CEPH_FEATURE_REPLY_CREATE_INODE)) {
-		/* Malformed reply? */
 		if (*p == end) {
+			/* Malformed reply? */
 			info->has_create_ino = false;
-		} else {
+		} else if (test_bit(CEPHFS_FEATURE_DELEG_INO, &s->s_features)) {
+			u8 struct_v, struct_compat;
+			u32 len;
+
 			info->has_create_ino = true;
+			ceph_decode_8_safe(p, end, struct_v, bad);
+			ceph_decode_8_safe(p, end, struct_compat, bad);
+			ceph_decode_32_safe(p, end, len, bad);
 			ceph_decode_64_safe(p, end, info->ino, bad);
+			ret = ceph_parse_deleg_inos(p, end, s);
+			if (ret)
+				return ret;
+		} else {
+			/* legacy */
+			ceph_decode_64_safe(p, end, info->ino, bad);
+			info->has_create_ino = true;
 		}
 	} else {
 		if (*p != end)
@@ -448,7 +537,7 @@ static int parse_reply_info_create(void **p, void *end,
  */
 static int parse_reply_info_extra(void **p, void *end,
 				  struct ceph_mds_reply_info_parsed *info,
-				  u64 features)
+				  u64 features, struct ceph_mds_session *s)
 {
 	u32 op = le32_to_cpu(info->head->op);
 
@@ -457,7 +546,7 @@ static int parse_reply_info_extra(void **p, void *end,
 	else if (op == CEPH_MDS_OP_READDIR || op == CEPH_MDS_OP_LSSNAP)
 		return parse_reply_info_readdir(p, end, info, features);
 	else if (op == CEPH_MDS_OP_CREATE)
-		return parse_reply_info_create(p, end, info, features);
+		return parse_reply_info_create(p, end, info, features, s);
 	else
 		return -EIO;
 }
@@ -465,7 +554,7 @@ static int parse_reply_info_extra(void **p, void *end,
 /*
  * parse entire mds reply
  */
-static int parse_reply_info(struct ceph_msg *msg,
+static int parse_reply_info(struct ceph_mds_session *s, struct ceph_msg *msg,
 			    struct ceph_mds_reply_info_parsed *info,
 			    u64 features)
 {
@@ -490,7 +579,7 @@ static int parse_reply_info(struct ceph_msg *msg,
 	ceph_decode_32_safe(&p, end, len, bad);
 	if (len > 0) {
 		ceph_decode_need(&p, end, len, bad);
-		err = parse_reply_info_extra(&p, p+len, info, features);
+		err = parse_reply_info_extra(&p, p+len, info, features, s);
 		if (err < 0)
 			goto out_bad;
 	}
@@ -558,6 +647,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
 	if (refcount_dec_and_test(&s->s_ref)) {
 		if (s->s_auth.authorizer)
 			ceph_auth_destroy_authorizer(s->s_auth.authorizer);
+		xa_destroy(&s->s_delegated_inos);
 		kfree(s);
 	}
 }
@@ -645,6 +735,7 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc,
 	refcount_set(&s->s_ref, 1);
 	INIT_LIST_HEAD(&s->s_waiting);
 	INIT_LIST_HEAD(&s->s_unsafe);
+	xa_init(&s->s_delegated_inos);
 	s->s_num_cap_releases = 0;
 	s->s_cap_reconnect = 0;
 	s->s_cap_iterator = NULL;
@@ -2956,9 +3047,9 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
 	dout("handle_reply tid %lld result %d\n", tid, result);
 	rinfo = &req->r_reply_info;
 	if (test_bit(CEPHFS_FEATURE_REPLY_ENCODING, &session->s_features))
-		err = parse_reply_info(msg, rinfo, (u64)-1);
+		err = parse_reply_info(session, msg, rinfo, (u64)-1);
 	else
-		err = parse_reply_info(msg, rinfo, session->s_con.peer_features);
+		err = parse_reply_info(session, msg, rinfo, session->s_con.peer_features);
 	mutex_unlock(&mdsc->mutex);
 
 	mutex_lock(&session->s_mutex);
@@ -3638,6 +3729,8 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
 	if (!reply)
 		goto fail_nomsg;
 
+	xa_destroy(&session->s_delegated_inos);
+
 	mutex_lock(&session->s_mutex);
 	session->s_state = CEPH_MDS_SESSION_RECONNECTING;
 	session->s_seq = 0;
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 0327974d0763..31f68897bc87 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -23,8 +23,9 @@ enum ceph_feature_type {
 	CEPHFS_FEATURE_RECLAIM_CLIENT,
 	CEPHFS_FEATURE_LAZY_CAP_WANTED,
 	CEPHFS_FEATURE_MULTI_RECONNECT,
+	CEPHFS_FEATURE_DELEG_INO,
 
-	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_MULTI_RECONNECT,
+	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_DELEG_INO,
 };
 
 /*
@@ -37,6 +38,7 @@ enum ceph_feature_type {
 	CEPHFS_FEATURE_REPLY_ENCODING,		\
 	CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
 	CEPHFS_FEATURE_MULTI_RECONNECT,		\
+	CEPHFS_FEATURE_DELEG_INO,		\
 						\
 	CEPHFS_FEATURE_MAX,			\
 }
@@ -201,6 +203,7 @@ struct ceph_mds_session {
 
 	struct list_head  s_waiting;  /* waiting requests */
 	struct list_head  s_unsafe;   /* unsafe requests */
+	struct xarray	  s_delegated_inos;
 };
 
 /*
@@ -538,4 +541,6 @@ extern void ceph_mdsc_open_export_target_sessions(struct ceph_mds_client *mdsc,
 extern int ceph_trim_caps(struct ceph_mds_client *mdsc,
 			  struct ceph_mds_session *session,
 			  int max_caps);
+
+extern u64 ceph_get_deleg_ino(struct ceph_mds_session *session);
 #endif
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 6/9] ceph: add infrastructure for waiting for async create to complete
  2020-02-12 17:27 [PATCH v4 0/9] ceph: add support for asynchronous directory operations Jeff Layton
                   ` (4 preceding siblings ...)
  2020-02-12 17:27 ` [PATCH v4 5/9] ceph: decode interval_sets for delegated inos Jeff Layton
@ 2020-02-12 17:27 ` Jeff Layton
  2020-02-13 12:15   ` Yan, Zheng
  2020-02-12 17:27 ` [PATCH v4 7/9] ceph: add new MDS req field to hold delegated inode number Jeff Layton
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Jeff Layton @ 2020-02-12 17:27 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan, pdonnell

When we issue an async create, we must ensure that any later on-the-wire
requests involving it wait for the create reply.

Expand i_ceph_flags to be an unsigned long, and add a new bit that
MDS requests can wait on. If the bit is set in the inode when sending
caps, then don't send it and just return that it has been delayed.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c       | 13 ++++++++++++-
 fs/ceph/dir.c        |  2 +-
 fs/ceph/mds_client.c | 18 ++++++++++++++++++
 fs/ceph/mds_client.h |  8 ++++++++
 fs/ceph/super.h      |  4 +++-
 5 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index c983990acb75..869e2102e827 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -511,7 +511,7 @@ static void __cap_delay_requeue(struct ceph_mds_client *mdsc,
 				struct ceph_inode_info *ci,
 				bool set_timeout)
 {
-	dout("__cap_delay_requeue %p flags %d at %lu\n", &ci->vfs_inode,
+	dout("__cap_delay_requeue %p flags 0x%lx at %lu\n", &ci->vfs_inode,
 	     ci->i_ceph_flags, ci->i_hold_caps_max);
 	if (!mdsc->stopping) {
 		spin_lock(&mdsc->cap_delay_lock);
@@ -1298,6 +1298,13 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
 	int delayed = 0;
 	int ret;
 
+	/* Don't send anything if it's still being created. Return delayed */
+	if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
+		spin_unlock(&ci->i_ceph_lock);
+		dout("%s async create in flight for %p\n", __func__, inode);
+		return 1;
+	}
+
 	held = cap->issued | cap->implemented;
 	revoking = cap->implemented & ~cap->issued;
 	retain &= ~revoking;
@@ -2257,6 +2264,10 @@ int ceph_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	if (datasync)
 		goto out;
 
+	ret = ceph_wait_on_async_create(inode);
+	if (ret)
+		goto out;
+
 	dirty = try_flush_caps(inode, &flush_tid);
 	dout("fsync dirty caps are %s\n", ceph_cap_string(dirty));
 
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 46314ccf48c5..4e695f2a9347 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -752,7 +752,7 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry,
 		struct ceph_dentry_info *di = ceph_dentry(dentry);
 
 		spin_lock(&ci->i_ceph_lock);
-		dout(" dir %p flags are %d\n", dir, ci->i_ceph_flags);
+		dout(" dir %p flags are 0x%lx\n", dir, ci->i_ceph_flags);
 		if (strncmp(dentry->d_name.name,
 			    fsc->mount_options->snapdir_name,
 			    dentry->d_name.len) &&
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 91c5f999da7d..314dd0f6f5a9 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2829,6 +2829,24 @@ int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
 		ceph_get_cap_refs(ceph_inode(req->r_old_dentry_dir),
 				  CEPH_CAP_PIN);
 
+	if (req->r_inode) {
+		err = ceph_wait_on_async_create(req->r_inode);
+		if (err) {
+			dout("%s: wait for async create returned: %d\n",
+			     __func__, err);
+			return err;
+		}
+	}
+
+	if (req->r_old_inode) {
+		err = ceph_wait_on_async_create(req->r_old_inode);
+		if (err) {
+			dout("%s: wait for async create returned: %d\n",
+			     __func__, err);
+			return err;
+		}
+	}
+
 	dout("submit_request on %p for inode %p\n", req, dir);
 	mutex_lock(&mdsc->mutex);
 	__register_request(mdsc, req, dir);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 31f68897bc87..acad9adca0af 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -543,4 +543,12 @@ extern int ceph_trim_caps(struct ceph_mds_client *mdsc,
 			  int max_caps);
 
 extern u64 ceph_get_deleg_ino(struct ceph_mds_session *session);
+
+static inline int ceph_wait_on_async_create(struct inode *inode)
+{
+	struct ceph_inode_info *ci = ceph_inode(inode);
+
+	return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT,
+			   TASK_INTERRUPTIBLE);
+}
 #endif
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index ea68eef977ef..47fb6e022339 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -319,7 +319,7 @@ struct ceph_inode_info {
 	u64 i_inline_version;
 	u32 i_time_warp_seq;
 
-	unsigned i_ceph_flags;
+	unsigned long i_ceph_flags;
 	atomic64_t i_release_count;
 	atomic64_t i_ordered_count;
 	atomic64_t i_complete_seq[2];
@@ -527,6 +527,8 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
 #define CEPH_I_ERROR_WRITE	(1 << 10) /* have seen write errors */
 #define CEPH_I_ERROR_FILELOCK	(1 << 11) /* have seen file lock errors */
 #define CEPH_I_ODIRECT		(1 << 12) /* inode in direct I/O mode */
+#define CEPH_ASYNC_CREATE_BIT	(13)	  /* async create in flight for this */
+#define CEPH_I_ASYNC_CREATE	(1 << CEPH_ASYNC_CREATE_BIT)
 
 /*
  * Masks of ceph inode work.
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 7/9] ceph: add new MDS req field to hold delegated inode number
  2020-02-12 17:27 [PATCH v4 0/9] ceph: add support for asynchronous directory operations Jeff Layton
                   ` (5 preceding siblings ...)
  2020-02-12 17:27 ` [PATCH v4 6/9] ceph: add infrastructure for waiting for async create to complete Jeff Layton
@ 2020-02-12 17:27 ` Jeff Layton
  2020-02-12 17:27 ` [PATCH v4 8/9] ceph: cache layout in parent dir on first sync create Jeff Layton
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2020-02-12 17:27 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan, pdonnell

Add new request field to hold the delegated inode number. Encode that
into the message when it's set.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/mds_client.c | 3 +--
 fs/ceph/mds_client.h | 1 +
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 314dd0f6f5a9..2321c955439b 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2464,7 +2464,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_client *mdsc,
 	head->op = cpu_to_le32(req->r_op);
 	head->caller_uid = cpu_to_le32(from_kuid(&init_user_ns, req->r_uid));
 	head->caller_gid = cpu_to_le32(from_kgid(&init_user_ns, req->r_gid));
-	head->ino = 0;
+	head->ino = cpu_to_le64(req->r_deleg_ino);
 	head->args = req->r_args;
 
 	ceph_encode_filepath(&p, end, ino1, path1);
@@ -2625,7 +2625,6 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
 	rhead->flags = cpu_to_le32(flags);
 	rhead->num_fwd = req->r_num_fwd;
 	rhead->num_retry = req->r_attempts - 1;
-	rhead->ino = 0;
 
 	dout(" r_parent = %p\n", req->r_parent);
 	return 0;
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index acad9adca0af..4f25fd6df3f9 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -308,6 +308,7 @@ struct ceph_mds_request {
 	int               r_num_fwd;    /* number of forward attempts */
 	int               r_resend_mds; /* mds to resend to next, if any*/
 	u32               r_sent_on_mseq; /* cap mseq request was sent at*/
+	u64		  r_deleg_ino;
 
 	struct list_head  r_wait;
 	struct completion r_completion;
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 8/9] ceph: cache layout in parent dir on first sync create
  2020-02-12 17:27 [PATCH v4 0/9] ceph: add support for asynchronous directory operations Jeff Layton
                   ` (6 preceding siblings ...)
  2020-02-12 17:27 ` [PATCH v4 7/9] ceph: add new MDS req field to hold delegated inode number Jeff Layton
@ 2020-02-12 17:27 ` Jeff Layton
  2020-02-12 17:27 ` [PATCH v4 9/9] ceph: attempt to do async create when possible Jeff Layton
  2020-02-13 13:05 ` [PATCH v4 0/9] ceph: add support for asynchronous directory operations Yan, Zheng
  9 siblings, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2020-02-12 17:27 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan, pdonnell

If a create is done, then typically we'll end up writing to the file
soon afterward. We don't want to wait for the reply before doing that
when doing an async create, so that means we need the layout for the
new file before we've gotten the response from the MDS.

All files created in a directory will initially inherit the same layout,
so copy off the requisite info from the first synchronous create in the
directory, and save it in a new i_cached_layout field. Zero out the
layout when we lose Dc caps in the dir.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c       | 13 ++++++++++---
 fs/ceph/file.c       | 22 +++++++++++++++++++++-
 fs/ceph/inode.c      |  2 ++
 fs/ceph/mds_client.c |  7 ++++++-
 fs/ceph/super.h      |  1 +
 5 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 869e2102e827..0c95a7c9c7c1 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -561,14 +561,14 @@ static void __cap_delay_cancel(struct ceph_mds_client *mdsc,
 	spin_unlock(&mdsc->cap_delay_lock);
 }
 
-/*
- * Common issue checks for add_cap, handle_cap_grant.
- */
+/* Common issue checks for add_cap, handle_cap_grant. */
 static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap,
 			      unsigned issued)
 {
 	unsigned had = __ceph_caps_issued(ci, NULL);
 
+	lockdep_assert_held(&ci->i_ceph_lock);
+
 	/*
 	 * Each time we receive FILE_CACHE anew, we increment
 	 * i_rdcache_gen.
@@ -593,6 +593,13 @@ static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap,
 			__ceph_dir_clear_complete(ci);
 		}
 	}
+
+	/* Wipe saved layout if we're losing DIR_CREATE caps */
+	if (S_ISDIR(ci->vfs_inode.i_mode) && (had & CEPH_CAP_DIR_CREATE) &&
+		!(issued & CEPH_CAP_DIR_CREATE)) {
+	     ceph_put_string(rcu_dereference_raw(ci->i_cached_layout.pool_ns));
+	     memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
+	}
 }
 
 /*
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 7e0190b1f821..472d90ccdf44 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -430,6 +430,23 @@ int ceph_open(struct inode *inode, struct file *file)
 	return err;
 }
 
+/* Clone the layout from a synchronous create, if the dir now has Dc caps */
+static void
+cache_file_layout(struct inode *dst, struct inode *src)
+{
+	struct ceph_inode_info *cdst = ceph_inode(dst);
+	struct ceph_inode_info *csrc = ceph_inode(src);
+
+	spin_lock(&cdst->i_ceph_lock);
+	if ((__ceph_caps_issued(cdst, NULL) & CEPH_CAP_DIR_CREATE) &&
+	    !ceph_file_layout_is_valid(&cdst->i_cached_layout)) {
+		memcpy(&cdst->i_cached_layout, &csrc->i_layout,
+			sizeof(cdst->i_cached_layout));
+		rcu_assign_pointer(cdst->i_cached_layout.pool_ns,
+				   ceph_try_get_string(csrc->i_layout.pool_ns));
+	}
+	spin_unlock(&cdst->i_ceph_lock);
+}
 
 /*
  * Do a lookup + open with a single request.  If we get a non-existent
@@ -518,7 +535,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 	} else {
 		dout("atomic_open finish_open on dn %p\n", dn);
 		if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) {
-			ceph_init_inode_acls(d_inode(dentry), &as_ctx);
+			struct inode *newino = d_inode(dentry);
+
+			cache_file_layout(dir, newino);
+			ceph_init_inode_acls(newino, &as_ctx);
 			file->f_mode |= FMODE_CREATED;
 		}
 		err = finish_open(file, dentry, ceph_open);
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 4056c7968b86..73f986efb1fd 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -447,6 +447,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
 	ci->i_max_files = 0;
 
 	memset(&ci->i_dir_layout, 0, sizeof(ci->i_dir_layout));
+	memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
 	RCU_INIT_POINTER(ci->i_layout.pool_ns, NULL);
 
 	ci->i_fragtree = RB_ROOT;
@@ -587,6 +588,7 @@ void ceph_evict_inode(struct inode *inode)
 		ceph_buffer_put(ci->i_xattrs.prealloc_blob);
 
 	ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
+	ceph_put_string(rcu_dereference_raw(ci->i_cached_layout.pool_ns));
 }
 
 static inline blkcnt_t calc_inode_blocks(u64 size)
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 2321c955439b..09d5301b036c 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3502,8 +3502,13 @@ static int reconnect_caps_cb(struct inode *inode, struct ceph_cap *cap,
 	cap->cap_gen = cap->session->s_cap_gen;
 
 	/* These are lost when the session goes away */
-	if (S_ISDIR(inode->i_mode))
+	if (S_ISDIR(inode->i_mode)) {
+		if (cap->issued & CEPH_CAP_DIR_CREATE) {
+			ceph_put_string(rcu_dereference_raw(ci->i_cached_layout.pool_ns));
+			memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
+		}
 		cap->issued &= ~(CEPH_CAP_DIR_CREATE|CEPH_CAP_DIR_UNLINK);
+	}
 
 	if (recon_state->msg_version >= 2) {
 		rec.v2.cap_id = cpu_to_le64(cap->cap_id);
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 47fb6e022339..60701a2e36b3 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -326,6 +326,7 @@ struct ceph_inode_info {
 
 	struct ceph_dir_layout i_dir_layout;
 	struct ceph_file_layout i_layout;
+	struct ceph_file_layout i_cached_layout;	// for async creates
 	char *i_symlink;
 
 	/* for dirs */
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 9/9] ceph: attempt to do async create when possible
  2020-02-12 17:27 [PATCH v4 0/9] ceph: add support for asynchronous directory operations Jeff Layton
                   ` (7 preceding siblings ...)
  2020-02-12 17:27 ` [PATCH v4 8/9] ceph: cache layout in parent dir on first sync create Jeff Layton
@ 2020-02-12 17:27 ` Jeff Layton
  2020-02-13 12:44   ` Yan, Zheng
  2020-02-13 13:05 ` [PATCH v4 0/9] ceph: add support for asynchronous directory operations Yan, Zheng
  9 siblings, 1 reply; 22+ messages in thread
From: Jeff Layton @ 2020-02-12 17:27 UTC (permalink / raw)
  To: ceph-devel; +Cc: idridryomov, sage, zyan, pdonnell

With the Octopus release, the MDS will hand out directory create caps.

If we have Fxc caps on the directory, and complete directory information
or a known negative dentry, then we can return without waiting on the
reply, allowing the open() call to return very quickly to userland.

We use the normal ceph_fill_inode() routine to fill in the inode, so we
have to gin up some reply inode information with what we'd expect the
newly-created inode to have. The client assumes that it has a full set
of caps on the new inode, and that the MDS will revoke them when there
is conflicting access.

This functionality is gated on the wsync/nowsync mount options.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/file.c               | 231 +++++++++++++++++++++++++++++++++--
 include/linux/ceph/ceph_fs.h |   3 +
 2 files changed, 224 insertions(+), 10 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 472d90ccdf44..814ff435832c 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -448,6 +448,196 @@ cache_file_layout(struct inode *dst, struct inode *src)
 	spin_unlock(&cdst->i_ceph_lock);
 }
 
+/*
+ * Try to set up an async create. We need caps, a file layout, and inode number,
+ * and either a lease on the dentry or complete dir info. If any of those
+ * criteria are not satisfied, then return false and the caller can go
+ * synchronous.
+ */
+static bool try_prep_async_create(struct inode *dir, struct dentry *dentry,
+				  struct ceph_file_layout *lo, u64 *pino)
+{
+	struct ceph_inode_info *ci = ceph_inode(dir);
+	struct ceph_dentry_info *di = ceph_dentry(dentry);
+	bool ret = false;
+	u64 ino;
+
+	spin_lock(&ci->i_ceph_lock);
+	/* No auth cap means no chance for Dc caps */
+	if (!ci->i_auth_cap)
+		goto no_async;
+
+	/* Any delegated inos? */
+	if (xa_empty(&ci->i_auth_cap->session->s_delegated_inos))
+		goto no_async;
+
+	if (!ceph_file_layout_is_valid(&ci->i_cached_layout))
+		goto no_async;
+
+	if ((__ceph_caps_issued(ci, NULL) &
+	     (CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE)) !=
+	    (CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE))
+		goto no_async;
+
+	if (d_in_lookup(dentry)) {
+		if (!__ceph_dir_is_complete(ci))
+			goto no_async;
+	} else if (atomic_read(&ci->i_shared_gen) !=
+		   READ_ONCE(di->lease_shared_gen)) {
+		goto no_async;
+	}
+
+	ino = ceph_get_deleg_ino(ci->i_auth_cap->session);
+	if (!ino)
+		goto no_async;
+
+	*pino = ino;
+	ceph_take_cap_refs(ci, CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE, false);
+	memcpy(lo, &ci->i_cached_layout, sizeof(*lo));
+	rcu_assign_pointer(lo->pool_ns,
+			   ceph_try_get_string(ci->i_cached_layout.pool_ns));
+	ret = true;
+no_async:
+	spin_unlock(&ci->i_ceph_lock);
+	return ret;
+}
+
+static void ceph_async_create_cb(struct ceph_mds_client *mdsc,
+                                 struct ceph_mds_request *req)
+{
+	int result = req->r_err ? req->r_err :
+			le32_to_cpu(req->r_reply_info.head->result);
+
+	mapping_set_error(req->r_parent->i_mapping, result);
+
+	if (result) {
+		struct dentry *dentry = req->r_dentry;
+		int pathlen;
+		u64 base;
+		char *path = ceph_mdsc_build_path(req->r_dentry, &pathlen,
+						  &base, 0);
+
+		ceph_dir_clear_complete(req->r_parent);
+		if (!d_unhashed(dentry))
+			d_drop(dentry);
+
+		/* FIXME: start returning I/O errors on all accesses? */
+		pr_warn("ceph: async create failure path=(%llx)%s result=%d!\n",
+			base, IS_ERR(path) ? "<<bad>>" : path, result);
+		ceph_mdsc_free_path(path, pathlen);
+	}
+
+	if (req->r_target_inode) {
+		struct ceph_inode_info *ci = ceph_inode(req->r_target_inode);
+		u64 ino = ceph_vino(req->r_target_inode).ino;
+
+		if (req->r_deleg_ino != ino)
+			pr_warn("%s: inode number mismatch! err=%d deleg_ino=0x%llx target=0x%llx\n",
+				__func__, req->r_err, req->r_deleg_ino, ino);
+		mapping_set_error(req->r_target_inode->i_mapping, result);
+
+		spin_lock(&ci->i_ceph_lock);
+		if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
+			ci->i_ceph_flags &= ~CEPH_I_ASYNC_CREATE;
+			wake_up_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT);
+		}
+		spin_unlock(&ci->i_ceph_lock);
+	} else {
+		pr_warn("%s: no req->r_target_inode for 0x%llx\n", __func__,
+			req->r_deleg_ino);
+	}
+	ceph_put_cap_refs(ceph_inode(req->r_parent),
+			  CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE);
+}
+
+static int ceph_finish_async_create(struct inode *dir, struct dentry *dentry,
+				    struct file *file, umode_t mode,
+				    struct ceph_mds_request *req,
+				    struct ceph_acl_sec_ctx *as_ctx,
+				    struct ceph_file_layout *lo)
+{
+	int ret;
+	char xattr_buf[4];
+	struct ceph_mds_reply_inode in = { };
+	struct ceph_mds_reply_info_in iinfo = { .in = &in };
+	struct ceph_inode_info *ci = ceph_inode(dir);
+	struct inode *inode;
+	struct timespec64 now;
+	struct ceph_vino vino = { .ino = req->r_deleg_ino,
+				  .snap = CEPH_NOSNAP };
+
+	ktime_get_real_ts64(&now);
+
+	inode = ceph_get_inode(dentry->d_sb, vino);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	iinfo.inline_version = CEPH_INLINE_NONE;
+	iinfo.change_attr = 1;
+	ceph_encode_timespec64(&iinfo.btime, &now);
+
+	iinfo.xattr_len = ARRAY_SIZE(xattr_buf);
+	iinfo.xattr_data = xattr_buf;
+	memset(iinfo.xattr_data, 0, iinfo.xattr_len);
+
+	in.ino = cpu_to_le64(vino.ino);
+	in.snapid = cpu_to_le64(CEPH_NOSNAP);
+	in.version = cpu_to_le64(1);	// ???
+	in.cap.caps = in.cap.wanted = cpu_to_le32(CEPH_CAP_ALL_FILE);
+	in.cap.cap_id = cpu_to_le64(1);
+	in.cap.realm = cpu_to_le64(ci->i_snap_realm->ino);
+	in.cap.flags = CEPH_CAP_FLAG_AUTH;
+	in.ctime = in.mtime = in.atime = iinfo.btime;
+	in.mode = cpu_to_le32((u32)mode);
+	in.truncate_seq = cpu_to_le32(1);
+	in.truncate_size = cpu_to_le64(-1ULL);
+	in.xattr_version = cpu_to_le64(1);
+	in.uid = cpu_to_le32(from_kuid(&init_user_ns, current_fsuid()));
+	in.gid = cpu_to_le32(from_kgid(&init_user_ns, dir->i_mode & S_ISGID ?
+				dir->i_gid : current_fsgid()));
+	in.nlink = cpu_to_le32(1);
+	in.max_size = cpu_to_le64(lo->stripe_unit);
+
+	ceph_file_layout_to_legacy(lo, &in.layout);
+
+	ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session,
+			      req->r_fmode, NULL);
+	if (ret) {
+		dout("%s failed to fill inode: %d\n", __func__, ret);
+		ceph_dir_clear_complete(dir);
+		if (!d_unhashed(dentry))
+			d_drop(dentry);
+		if (inode->i_state & I_NEW)
+			discard_new_inode(inode);
+	} else {
+		struct dentry *dn;
+
+		dout("%s d_adding new inode 0x%llx to 0x%lx/%s\n", __func__,
+			vino.ino, dir->i_ino, dentry->d_name.name);
+		ceph_dir_clear_ordered(dir);
+		ceph_init_inode_acls(inode, as_ctx);
+		if (inode->i_state & I_NEW) {
+			/*
+			 * If it's not I_NEW, then someone created this before
+			 * we got here. Assume the server is aware of it at
+			 * that point and don't worry about setting
+			 * CEPH_I_ASYNC_CREATE.
+			 */
+			ceph_inode(inode)->i_ceph_flags = CEPH_I_ASYNC_CREATE;
+			unlock_new_inode(inode);
+		}
+		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
+			if (!d_unhashed(dentry))
+				d_drop(dentry);
+			dn = d_splice_alias(inode, dentry);
+			WARN_ON_ONCE(dn && dn != dentry);
+		}
+		file->f_mode |= FMODE_CREATED;
+		ret = finish_open(file, dentry, ceph_open);
+	}
+	return ret;
+}
+
 /*
  * Do a lookup + open with a single request.  If we get a non-existent
  * file or symlink, return 1 so the VFS can retry.
@@ -460,6 +650,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 	struct ceph_mds_request *req;
 	struct dentry *dn;
 	struct ceph_acl_sec_ctx as_ctx = {};
+	bool try_async = ceph_test_mount_opt(fsc, ASYNC_DIROPS);
 	int mask;
 	int err;
 
@@ -483,7 +674,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 		/* If it's not being looked up, it's negative */
 		return -ENOENT;
 	}
-
+retry:
 	/* do the open */
 	req = prepare_open_request(dir->i_sb, flags, mode);
 	if (IS_ERR(req)) {
@@ -492,28 +683,47 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 	}
 	req->r_dentry = dget(dentry);
 	req->r_num_caps = 2;
+	mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
+	if (ceph_security_xattr_wanted(dir))
+		mask |= CEPH_CAP_XATTR_SHARED;
+	req->r_args.open.mask = cpu_to_le32(mask);
+	req->r_parent = dir;
+
 	if (flags & O_CREAT) {
+		struct ceph_file_layout lo;
+
 		req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL;
 		req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
 		if (as_ctx.pagelist) {
 			req->r_pagelist = as_ctx.pagelist;
 			as_ctx.pagelist = NULL;
 		}
+		if (try_async && try_prep_async_create(dir, dentry, &lo,
+						       &req->r_deleg_ino)) {
+			set_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags);
+			req->r_args.open.flags |= cpu_to_le32(CEPH_O_EXCL);
+			req->r_callback = ceph_async_create_cb;
+			err = ceph_mdsc_submit_request(mdsc, dir, req);
+			if (!err) {
+				err = ceph_finish_async_create(dir, dentry,
+							file, mode, req,
+							&as_ctx, &lo);
+			} else if (err == -EJUKEBOX) {
+				ceph_mdsc_put_request(req);
+				try_async = false;
+				goto retry;
+			}
+			goto out_req;
+		}
 	}
 
-       mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
-       if (ceph_security_xattr_wanted(dir))
-               mask |= CEPH_CAP_XATTR_SHARED;
-       req->r_args.open.mask = cpu_to_le32(mask);
-
-	req->r_parent = dir;
 	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
 	err = ceph_mdsc_do_request(mdsc,
 				   (flags & (O_CREAT|O_TRUNC)) ? dir : NULL,
 				   req);
 	err = ceph_handle_snapdir(req, dentry, err);
 	if (err)
-		goto out_req;
+		goto out_fmode;
 
 	if ((flags & O_CREAT) && !req->r_reply_info.head->is_dentry)
 		err = ceph_handle_notrace_create(dir, dentry);
@@ -527,7 +737,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 		dn = NULL;
 	}
 	if (err)
-		goto out_req;
+		goto out_fmode;
 	if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
 		/* make vfs retry on splice, ENOENT, or symlink */
 		dout("atomic_open finish_no_open on dn %p\n", dn);
@@ -543,9 +753,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 		}
 		err = finish_open(file, dentry, ceph_open);
 	}
-out_req:
+out_fmode:
 	if (!req->r_err && req->r_target_inode)
 		ceph_put_fmode(ceph_inode(req->r_target_inode), req->r_fmode);
+out_req:
 	ceph_mdsc_put_request(req);
 out_ctx:
 	ceph_release_acl_sec_ctx(&as_ctx);
diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
index 91d09cf37649..e035c5194005 100644
--- a/include/linux/ceph/ceph_fs.h
+++ b/include/linux/ceph/ceph_fs.h
@@ -659,6 +659,9 @@ int ceph_flags_to_mode(int flags);
 #define CEPH_CAP_ANY      (CEPH_CAP_ANY_RD | CEPH_CAP_ANY_EXCL | \
 			   CEPH_CAP_ANY_FILE_WR | CEPH_CAP_FILE_LAZYIO | \
 			   CEPH_CAP_PIN)
+#define CEPH_CAP_ALL_FILE (CEPH_CAP_PIN | CEPH_CAP_ANY_SHARED | \
+			   CEPH_CAP_AUTH_EXCL | CEPH_CAP_XATTR_EXCL | \
+			   CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)
 
 #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
 			CEPH_LOCK_IXATTR)
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/9] ceph: add flag to designate that a request is asynchronous
  2020-02-12 17:27 ` [PATCH v4 1/9] ceph: add flag to designate that a request is asynchronous Jeff Layton
@ 2020-02-13  9:29   ` Yan, Zheng
  2020-02-13 11:35     ` Jeff Layton
  0 siblings, 1 reply; 22+ messages in thread
From: Yan, Zheng @ 2020-02-13  9:29 UTC (permalink / raw)
  To: Jeff Layton
  Cc: ceph-devel, idridryomov, Sage Weil, Zheng Yan, Patrick Donnelly

On Thu, Feb 13, 2020 at 1:29 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> ...and ensure that such requests are never queued. The MDS has need to
> know that a request is asynchronous so add flags and proper
> infrastructure for that.
>
> Also, delegated inode numbers and directory caps are associated with the
> session, so ensure that async requests are always transmitted on the
> first attempt and are never queued to wait for session reestablishment.
>
> If it does end up looking like we'll need to queue the request, then
> have it return -EJUKEBOX so the caller can reattempt with a synchronous
> request.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/ceph/inode.c              |  1 +
>  fs/ceph/mds_client.c         | 11 +++++++++++
>  fs/ceph/mds_client.h         |  1 +
>  include/linux/ceph/ceph_fs.h |  5 +++--
>  4 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 094b8fc37787..9869ec101e88 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -1311,6 +1311,7 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
>                 err = fill_inode(in, req->r_locked_page, &rinfo->targeti, NULL,
>                                 session,
>                                 (!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
> +                                !test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags) &&
>                                  rinfo->head->result == 0) ?  req->r_fmode : -1,
>                                 &req->r_caps_reservation);
>                 if (err < 0) {
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 2980e57ca7b9..9f2aeb6908b2 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2527,6 +2527,8 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
>         rhead->oldest_client_tid = cpu_to_le64(__get_oldest_tid(mdsc));
>         if (test_bit(CEPH_MDS_R_GOT_UNSAFE, &req->r_req_flags))
>                 flags |= CEPH_MDS_FLAG_REPLAY;
> +       if (test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags))
> +               flags |= CEPH_MDS_FLAG_ASYNC;
>         if (req->r_parent)
>                 flags |= CEPH_MDS_FLAG_WANT_DENTRY;
>         rhead->flags = cpu_to_le32(flags);
> @@ -2634,6 +2636,15 @@ static void __do_request(struct ceph_mds_client *mdsc,
>                         err = -EACCES;
>                         goto out_session;
>                 }
> +               /*
> +                * We cannot queue async requests since the caps and delegated
> +                * inodes are bound to the session. Just return -EJUKEBOX and
> +                * let the caller retry a sync request in that case.
> +                */
> +               if (test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags)) {
> +                       err = -EJUKEBOX;
> +                       goto out_session;
> +               }

the code near __choose_mds also can queue request


>                 if (session->s_state == CEPH_MDS_SESSION_NEW ||
>                     session->s_state == CEPH_MDS_SESSION_CLOSING) {
>                         __open_session(mdsc, session);
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index 27a7446e10d3..0327974d0763 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -255,6 +255,7 @@ struct ceph_mds_request {
>  #define CEPH_MDS_R_GOT_RESULT          (5) /* got a result */
>  #define CEPH_MDS_R_DID_PREPOPULATE     (6) /* prepopulated readdir */
>  #define CEPH_MDS_R_PARENT_LOCKED       (7) /* is r_parent->i_rwsem wlocked? */
> +#define CEPH_MDS_R_ASYNC               (8) /* async request */
>         unsigned long   r_req_flags;
>
>         struct mutex r_fill_mutex;
> diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
> index cb21c5cf12c3..9f747a1b8788 100644
> --- a/include/linux/ceph/ceph_fs.h
> +++ b/include/linux/ceph/ceph_fs.h
> @@ -444,8 +444,9 @@ union ceph_mds_request_args {
>         } __attribute__ ((packed)) lookupino;
>  } __attribute__ ((packed));
>
> -#define CEPH_MDS_FLAG_REPLAY        1  /* this is a replayed op */
> -#define CEPH_MDS_FLAG_WANT_DENTRY   2  /* want dentry in reply */
> +#define CEPH_MDS_FLAG_REPLAY           1 /* this is a replayed op */
> +#define CEPH_MDS_FLAG_WANT_DENTRY      2 /* want dentry in reply */
> +#define CEPH_MDS_FLAG_ASYNC            4 /* request is asynchronous */
>
>  struct ceph_mds_request_head {
>         __le64 oldest_client_tid;
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/9] ceph: add flag to designate that a request is asynchronous
  2020-02-13  9:29   ` Yan, Zheng
@ 2020-02-13 11:35     ` Jeff Layton
  0 siblings, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2020-02-13 11:35 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: ceph-devel, idridryomov, Sage Weil, Zheng Yan, Patrick Donnelly

On Thu, 2020-02-13 at 17:29 +0800, Yan, Zheng wrote:
> On Thu, Feb 13, 2020 at 1:29 AM Jeff Layton <jlayton@kernel.org> wrote:
> > ...and ensure that such requests are never queued. The MDS has need to
> > know that a request is asynchronous so add flags and proper
> > infrastructure for that.
> > 
> > Also, delegated inode numbers and directory caps are associated with the
> > session, so ensure that async requests are always transmitted on the
> > first attempt and are never queued to wait for session reestablishment.
> > 
> > If it does end up looking like we'll need to queue the request, then
> > have it return -EJUKEBOX so the caller can reattempt with a synchronous
> > request.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/ceph/inode.c              |  1 +
> >  fs/ceph/mds_client.c         | 11 +++++++++++
> >  fs/ceph/mds_client.h         |  1 +
> >  include/linux/ceph/ceph_fs.h |  5 +++--
> >  4 files changed, 16 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > index 094b8fc37787..9869ec101e88 100644
> > --- a/fs/ceph/inode.c
> > +++ b/fs/ceph/inode.c
> > @@ -1311,6 +1311,7 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
> >                 err = fill_inode(in, req->r_locked_page, &rinfo->targeti, NULL,
> >                                 session,
> >                                 (!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
> > +                                !test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags) &&
> >                                  rinfo->head->result == 0) ?  req->r_fmode : -1,
> >                                 &req->r_caps_reservation);
> >                 if (err < 0) {
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 2980e57ca7b9..9f2aeb6908b2 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -2527,6 +2527,8 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
> >         rhead->oldest_client_tid = cpu_to_le64(__get_oldest_tid(mdsc));
> >         if (test_bit(CEPH_MDS_R_GOT_UNSAFE, &req->r_req_flags))
> >                 flags |= CEPH_MDS_FLAG_REPLAY;
> > +       if (test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags))
> > +               flags |= CEPH_MDS_FLAG_ASYNC;
> >         if (req->r_parent)
> >                 flags |= CEPH_MDS_FLAG_WANT_DENTRY;
> >         rhead->flags = cpu_to_le32(flags);
> > @@ -2634,6 +2636,15 @@ static void __do_request(struct ceph_mds_client *mdsc,
> >                         err = -EACCES;
> >                         goto out_session;
> >                 }
> > +               /*
> > +                * We cannot queue async requests since the caps and delegated
> > +                * inodes are bound to the session. Just return -EJUKEBOX and
> > +                * let the caller retry a sync request in that case.
> > +                */
> > +               if (test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags)) {
> > +                       err = -EJUKEBOX;
> > +                       goto out_session;
> > +               }
> 
> the code near __choose_mds also can queue request
> 
> 

Ahh, right. Something like this maybe:

[PATCH] SQUASH: don't allow async req to be queued waiting for mdsmap

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/mds_client.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 09d5301b036c..ac5bd58bb971 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2702,6 +2702,10 @@ static void __do_request(struct ceph_mds_client *mdsc,
 	mds = __choose_mds(mdsc, req, &random);
 	if (mds < 0 ||
 	    ceph_mdsmap_get_state(mdsc->mdsmap, mds) < CEPH_MDS_STATE_ACTIVE) {
+		if (test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags)) {
+			err = -EJUKEBOX;
+			goto finish;
+		}
 		dout("do_request no mds or not active, waiting for map\n");
 		list_add(&req->r_wait, &mdsc->waiting_for_map);
 		return;
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/9] ceph: perform asynchronous unlink if we have sufficient caps
  2020-02-12 17:27 ` [PATCH v4 2/9] ceph: perform asynchronous unlink if we have sufficient caps Jeff Layton
@ 2020-02-13 12:06   ` Yan, Zheng
  2020-02-13 12:22     ` Jeff Layton
  0 siblings, 1 reply; 22+ messages in thread
From: Yan, Zheng @ 2020-02-13 12:06 UTC (permalink / raw)
  To: Jeff Layton
  Cc: ceph-devel, idridryomov, Sage Weil, Zheng Yan, Patrick Donnelly

On Thu, Feb 13, 2020 at 1:29 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> The MDS is getting a new lock-caching facility that will allow it
> to cache the necessary locks to allow asynchronous directory operations.
> Since the CEPH_CAP_FILE_* caps are currently unused on directories,
> we can repurpose those bits for this purpose.
>
> When performing an unlink, if we have Fx on the parent directory,
> and CEPH_CAP_DIR_UNLINK (aka Fr), and we know that the dentry being
> removed is the primary link, then then we can fire off an unlink
> request immediately and don't need to wait on reply before returning.
>
> In that situation, just fix up the dcache and link count and return
> immediately after issuing the call to the MDS. This does mean that we
> need to hold an extra reference to the inode being unlinked, and extra
> references to the caps to avoid races. Those references are put and
> error handling is done in the r_callback routine.
>
> If the operation ends up failing, then set a writeback error on the
> directory inode, and the inode itself that can be fetched later by
> an fsync on the dir.
>
> The behavior of dir caps is slightly different from caps on normal
> files. Because these are just considered an optimization, if the
> session is reconnected, we will not automatically reclaim them. They
> are instead considered lost until we do another synchronous op in the
> parent directory.
>
> Async dirops are enabled via the "nowsync" mount option, which is
> patterned after the xfs "wsync" mount option. For now, the default
> is "wsync", but eventually we may flip that.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
> ---
>  fs/ceph/caps.c               | 35 +++++++++----
>  fs/ceph/dir.c                | 99 ++++++++++++++++++++++++++++++++++--
>  fs/ceph/inode.c              |  8 ++-
>  fs/ceph/mds_client.c         |  8 ++-
>  fs/ceph/super.c              | 20 ++++++++
>  fs/ceph/super.h              |  6 ++-
>  include/linux/ceph/ceph_fs.h |  9 ++++
>  7 files changed, 166 insertions(+), 19 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index d05717397c2a..7fc87b693ba4 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -992,7 +992,11 @@ int __ceph_caps_file_wanted(struct ceph_inode_info *ci)
>  int __ceph_caps_wanted(struct ceph_inode_info *ci)
>  {
>         int w = __ceph_caps_file_wanted(ci) | __ceph_caps_used(ci);
> -       if (!S_ISDIR(ci->vfs_inode.i_mode)) {
> +       if (S_ISDIR(ci->vfs_inode.i_mode)) {
> +               /* we want EXCL if holding caps of dir ops */
> +               if (w & CEPH_CAP_ANY_DIR_OPS)
> +                       w |= CEPH_CAP_FILE_EXCL;
> +       } else {
>                 /* we want EXCL if dirty data */
>                 if (w & CEPH_CAP_FILE_BUFFER)
>                         w |= CEPH_CAP_FILE_EXCL;
> @@ -1883,10 +1887,13 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
>                          * revoking the shared cap on every create/unlink
>                          * operation.
>                          */
> -                       if (IS_RDONLY(inode))
> +                       if (IS_RDONLY(inode)) {
>                                 want = CEPH_CAP_ANY_SHARED;
> -                       else
> -                               want = CEPH_CAP_ANY_SHARED | CEPH_CAP_FILE_EXCL;
> +                       } else {
> +                               want = CEPH_CAP_ANY_SHARED |
> +                                      CEPH_CAP_FILE_EXCL |
> +                                      CEPH_CAP_ANY_DIR_OPS;
> +                       }
>                         retain |= want;
>                 } else {
>
> @@ -2649,7 +2656,10 @@ static int try_get_cap_refs(struct inode *inode, int need, int want,
>                                 }
>                                 snap_rwsem_locked = true;
>                         }
> -                       *got = need | (have & want);
> +                       if ((have & want) == want)
> +                               *got = need | want;
> +                       else
> +                               *got = need;
>                         if (S_ISREG(inode->i_mode) &&
>                             (need & CEPH_CAP_FILE_RD) &&
>                             !(*got & CEPH_CAP_FILE_CACHE))
> @@ -2739,13 +2749,16 @@ int ceph_try_get_caps(struct inode *inode, int need, int want,
>         int ret;
>
>         BUG_ON(need & ~CEPH_CAP_FILE_RD);
> -       BUG_ON(want & ~(CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO|CEPH_CAP_FILE_SHARED));
> -       ret = ceph_pool_perm_check(inode, need);
> -       if (ret < 0)
> -               return ret;
> +       if (need) {
> +               ret = ceph_pool_perm_check(inode, need);
> +               if (ret < 0)
> +                       return ret;
> +       }
>
> -       ret = try_get_cap_refs(inode, need, want, 0,
> -                              (nonblock ? NON_BLOCKING : 0), got);
> +       BUG_ON(want & ~(CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO |
> +                       CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
> +                       CEPH_CAP_ANY_DIR_OPS));
> +       ret = try_get_cap_refs(inode, need, want, 0, nonblock, got);

should keep (nonblock ? NON_BLOCKING : 0)

>         return ret == -EAGAIN ? 0 : ret;
>  }
>
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index d0cd0aba5843..46314ccf48c5 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -1036,6 +1036,69 @@ static int ceph_link(struct dentry *old_dentry, struct inode *dir,
>         return err;
>  }
>
> +static void ceph_async_unlink_cb(struct ceph_mds_client *mdsc,
> +                                struct ceph_mds_request *req)
> +{
> +       int result = req->r_err ? req->r_err :
> +                       le32_to_cpu(req->r_reply_info.head->result);
> +
> +       /* If op failed, mark everyone involved for errors */
> +       if (result) {
> +               int pathlen;
> +               u64 base;
> +               char *path = ceph_mdsc_build_path(req->r_dentry, &pathlen,
> +                                                 &base, 0);
> +
> +               /* mark error on parent + clear complete */
> +               mapping_set_error(req->r_parent->i_mapping, result);
> +               ceph_dir_clear_complete(req->r_parent);
> +
> +               /* drop the dentry -- we don't know its status */
> +               if (!d_unhashed(req->r_dentry))
> +                       d_drop(req->r_dentry);
> +
> +               /* mark inode itself for an error (since metadata is bogus) */
> +               mapping_set_error(req->r_old_inode->i_mapping, result);
> +
> +               pr_warn("ceph: async unlink failure path=(%llx)%s result=%d!\n",
> +                       base, IS_ERR(path) ? "<<bad>>" : path, result);
> +               ceph_mdsc_free_path(path, pathlen);
> +       }
> +
> +       ceph_put_cap_refs(ceph_inode(req->r_parent),
> +                         CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_UNLINK);
> +       iput(req->r_old_inode);
> +}
> +
> +static bool get_caps_for_async_unlink(struct inode *dir, struct dentry *dentry)
> +{
> +       struct ceph_inode_info *ci = ceph_inode(dir);
> +       struct ceph_dentry_info *di;
> +       int ret, want, got = 0;
> +
> +       want = CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_UNLINK;
> +       ret = ceph_try_get_caps(dir, 0, want, true, &got);
> +       dout("FxDu on %p ret=%d got=%s\n", dir, ret, ceph_cap_string(got));
> +       if (ret != 1 || got != want)
> +               return false;
> +
> +        spin_lock(&dentry->d_lock);
> +        di = ceph_dentry(dentry);
> +       /* - We are holding CEPH_CAP_FILE_EXCL, which implies
> +        * CEPH_CAP_FILE_SHARED.
> +        * - Only support async unlink for primary linkage */
> +       if (atomic_read(&ci->i_shared_gen) != di->lease_shared_gen ||
> +           !(di->flags & CEPH_DENTRY_PRIMARY_LINK))
> +               ret = 0;
> +        spin_unlock(&dentry->d_lock);
> +
> +       if (!ret) {
> +               ceph_put_cap_refs(ci, got);
> +               return false;
> +       }
> +       return true;
> +}
> +
>  /*
>   * rmdir and unlink are differ only by the metadata op code
>   */
> @@ -1045,6 +1108,7 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
>         struct ceph_mds_client *mdsc = fsc->mdsc;
>         struct inode *inode = d_inode(dentry);
>         struct ceph_mds_request *req;
> +       bool try_async = ceph_test_mount_opt(fsc, ASYNC_DIROPS);
>         int err = -EROFS;
>         int op;
>
> @@ -1059,6 +1123,7 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
>                         CEPH_MDS_OP_RMDIR : CEPH_MDS_OP_UNLINK;
>         } else
>                 goto out;
> +retry:
>         req = ceph_mdsc_create_request(mdsc, op, USE_AUTH_MDS);
>         if (IS_ERR(req)) {
>                 err = PTR_ERR(req);
> @@ -1067,13 +1132,38 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
>         req->r_dentry = dget(dentry);
>         req->r_num_caps = 2;
>         req->r_parent = dir;
> -       set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
>         req->r_dentry_drop = CEPH_CAP_FILE_SHARED;
>         req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
>         req->r_inode_drop = ceph_drop_caps_for_unlink(inode);
> -       err = ceph_mdsc_do_request(mdsc, dir, req);
> -       if (!err && !req->r_reply_info.head->is_dentry)
> -               d_delete(dentry);
> +
> +       if (try_async && op == CEPH_MDS_OP_UNLINK &&
> +           get_caps_for_async_unlink(dir, dentry)) {
> +               dout("ceph: Async unlink on %lu/%.*s", dir->i_ino,
> +                    dentry->d_name.len, dentry->d_name.name);
> +               set_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags);
> +               req->r_callback = ceph_async_unlink_cb;
> +               req->r_old_inode = d_inode(dentry);
> +               ihold(req->r_old_inode);
> +               err = ceph_mdsc_submit_request(mdsc, dir, req);
> +               if (!err) {
> +                       /*
> +                        * We have enough caps, so we assume that the unlink
> +                        * will succeed. Fix up the target inode and dcache.
> +                        */
> +                       drop_nlink(inode);
> +                       d_delete(dentry);
> +               } else if (err == -EJUKEBOX) {
> +                       try_async = false;
> +                       ceph_mdsc_put_request(req);
> +                       goto retry;
> +               }
> +       } else {
> +               set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
> +               err = ceph_mdsc_do_request(mdsc, dir, req);
> +               if (!err && !req->r_reply_info.head->is_dentry)
> +                       d_delete(dentry);
> +       }
> +
>         ceph_mdsc_put_request(req);
>  out:
>         return err;
> @@ -1411,6 +1501,7 @@ void ceph_invalidate_dentry_lease(struct dentry *dentry)
>         spin_lock(&dentry->d_lock);
>         di->time = jiffies;
>         di->lease_shared_gen = 0;
> +       di->flags &= ~CEPH_DENTRY_PRIMARY_LINK;
>         __dentry_lease_unlist(di);
>         spin_unlock(&dentry->d_lock);
>  }
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 9869ec101e88..7478bd0283c1 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -1051,6 +1051,7 @@ static void __update_dentry_lease(struct inode *dir, struct dentry *dentry,
>                                   struct ceph_mds_session **old_lease_session)
>  {
>         struct ceph_dentry_info *di = ceph_dentry(dentry);
> +       unsigned mask = le16_to_cpu(lease->mask);
>         long unsigned duration = le32_to_cpu(lease->duration_ms);
>         long unsigned ttl = from_time + (duration * HZ) / 1000;
>         long unsigned half_ttl = from_time + (duration * HZ / 2) / 1000;
> @@ -1062,8 +1063,13 @@ static void __update_dentry_lease(struct inode *dir, struct dentry *dentry,
>         if (ceph_snap(dir) != CEPH_NOSNAP)
>                 return;
>
> +       if (mask & CEPH_LEASE_PRIMARY_LINK)
> +               di->flags |= CEPH_DENTRY_PRIMARY_LINK;
> +       else
> +               di->flags &= ~CEPH_DENTRY_PRIMARY_LINK;
> +
>         di->lease_shared_gen = atomic_read(&ceph_inode(dir)->i_shared_gen);
> -       if (duration == 0) {
> +       if (!(mask & CEPH_LEASE_VALID)) {
>                 __ceph_dentry_dir_lease_touch(di);
>                 return;
>         }
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 9f2aeb6908b2..f0ea32f4cdb9 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -3370,7 +3370,7 @@ static int send_reconnect_partial(struct ceph_reconnect_state *recon_state)
>  /*
>   * Encode information about a cap for a reconnect with the MDS.
>   */
> -static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,
> +static int reconnect_caps_cb(struct inode *inode, struct ceph_cap *cap,
>                           void *arg)
>  {
>         union {
> @@ -3393,6 +3393,10 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,
>         cap->mseq = 0;       /* and migrate_seq */
>         cap->cap_gen = cap->session->s_cap_gen;
>
> +       /* These are lost when the session goes away */
> +       if (S_ISDIR(inode->i_mode))
> +               cap->issued &= ~(CEPH_CAP_DIR_CREATE|CEPH_CAP_DIR_UNLINK);
> +
>         if (recon_state->msg_version >= 2) {
>                 rec.v2.cap_id = cpu_to_le64(cap->cap_id);
>                 rec.v2.wanted = cpu_to_le32(__ceph_caps_wanted(ci));
> @@ -3689,7 +3693,7 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
>                 recon_state.msg_version = 2;
>         }
>         /* trsaverse this session's caps */
> -       err = ceph_iterate_session_caps(session, encode_caps_cb, &recon_state);
> +       err = ceph_iterate_session_caps(session, reconnect_caps_cb, &recon_state);
>
>         spin_lock(&session->s_cap_lock);
>         session->s_cap_reconnect = 0;
> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> index c7f150686a53..58d64805c9e3 100644
> --- a/fs/ceph/super.c
> +++ b/fs/ceph/super.c
> @@ -155,6 +155,7 @@ enum {
>         Opt_acl,
>         Opt_quotadf,
>         Opt_copyfrom,
> +       Opt_wsync,
>  };
>
>  enum ceph_recover_session_mode {
> @@ -194,6 +195,7 @@ static const struct fs_parameter_spec ceph_mount_parameters[] = {
>         fsparam_string  ("snapdirname",                 Opt_snapdirname),
>         fsparam_string  ("source",                      Opt_source),
>         fsparam_u32     ("wsize",                       Opt_wsize),
> +       fsparam_flag_no ("wsync",                       Opt_wsync),
>         {}
>  };
>
> @@ -444,6 +446,12 @@ static int ceph_parse_mount_param(struct fs_context *fc,
>                         fc->sb_flags &= ~SB_POSIXACL;
>                 }
>                 break;
> +       case Opt_wsync:
> +               if (!result.negated)
> +                       fsopt->flags &= ~CEPH_MOUNT_OPT_ASYNC_DIROPS;
> +               else
> +                       fsopt->flags |= CEPH_MOUNT_OPT_ASYNC_DIROPS;
> +               break;
>         default:
>                 BUG();
>         }
> @@ -567,6 +575,9 @@ static int ceph_show_options(struct seq_file *m, struct dentry *root)
>         if (fsopt->flags & CEPH_MOUNT_OPT_CLEANRECOVER)
>                 seq_show_option(m, "recover_session", "clean");
>
> +       if (fsopt->flags & CEPH_MOUNT_OPT_ASYNC_DIROPS)
> +               seq_puts(m, ",nowsync");
> +
>         if (fsopt->wsize != CEPH_MAX_WRITE_SIZE)
>                 seq_printf(m, ",wsize=%u", fsopt->wsize);
>         if (fsopt->rsize != CEPH_MAX_READ_SIZE)
> @@ -1107,6 +1118,15 @@ static void ceph_free_fc(struct fs_context *fc)
>
>  static int ceph_reconfigure_fc(struct fs_context *fc)
>  {
> +       struct ceph_parse_opts_ctx *pctx = fc->fs_private;
> +       struct ceph_mount_options *fsopt = pctx->opts;
> +       struct ceph_fs_client *fsc = ceph_sb_to_client(fc->root->d_sb);
> +
> +       if (fsopt->flags & CEPH_MOUNT_OPT_ASYNC_DIROPS)
> +               ceph_set_mount_opt(fsc, ASYNC_DIROPS);
> +       else
> +               ceph_clear_mount_opt(fsc, ASYNC_DIROPS);
> +
>         sync_filesystem(fc->root->d_sb);
>         return 0;
>  }
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index 37dc1ac8f6c3..540393ba861b 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -43,13 +43,16 @@
>  #define CEPH_MOUNT_OPT_MOUNTWAIT       (1<<12) /* mount waits if no mds is up */
>  #define CEPH_MOUNT_OPT_NOQUOTADF       (1<<13) /* no root dir quota in statfs */
>  #define CEPH_MOUNT_OPT_NOCOPYFROM      (1<<14) /* don't use RADOS 'copy-from' op */
> +#define CEPH_MOUNT_OPT_ASYNC_DIROPS    (1<<15) /* allow async directory ops */
>
>  #define CEPH_MOUNT_OPT_DEFAULT                 \
>         (CEPH_MOUNT_OPT_DCACHE |                \
>          CEPH_MOUNT_OPT_NOCOPYFROM)
>
>  #define ceph_set_mount_opt(fsc, opt) \
> -       (fsc)->mount_options->flags |= CEPH_MOUNT_OPT_##opt;
> +       (fsc)->mount_options->flags |= CEPH_MOUNT_OPT_##opt
> +#define ceph_clear_mount_opt(fsc, opt) \
> +       (fsc)->mount_options->flags &= ~CEPH_MOUNT_OPT_##opt
>  #define ceph_test_mount_opt(fsc, opt) \
>         (!!((fsc)->mount_options->flags & CEPH_MOUNT_OPT_##opt))
>
> @@ -284,6 +287,7 @@ struct ceph_dentry_info {
>  #define CEPH_DENTRY_REFERENCED         1
>  #define CEPH_DENTRY_LEASE_LIST         2
>  #define CEPH_DENTRY_SHRINK_LIST                4
> +#define CEPH_DENTRY_PRIMARY_LINK       8
>
>  struct ceph_inode_xattrs_info {
>         /*
> diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
> index 9f747a1b8788..91d09cf37649 100644
> --- a/include/linux/ceph/ceph_fs.h
> +++ b/include/linux/ceph/ceph_fs.h
> @@ -531,6 +531,9 @@ struct ceph_mds_reply_lease {
>         __le32 seq;
>  } __attribute__ ((packed));
>
> +#define CEPH_LEASE_VALID        (1 | 2) /* old and new bit values */
> +#define CEPH_LEASE_PRIMARY_LINK 4       /* primary linkage */
> +
>  struct ceph_mds_reply_dirfrag {
>         __le32 frag;            /* fragment */
>         __le32 auth;            /* auth mds, if this is a delegation point */
> @@ -660,6 +663,12 @@ int ceph_flags_to_mode(int flags);
>  #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
>                         CEPH_LOCK_IXATTR)
>
> +/* cap masks async dir operations */
> +#define CEPH_CAP_DIR_CREATE    CEPH_CAP_FILE_CACHE
> +#define CEPH_CAP_DIR_UNLINK    CEPH_CAP_FILE_RD
> +#define CEPH_CAP_ANY_DIR_OPS   (CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_RD | \
> +                                CEPH_CAP_FILE_WREXTEND | CEPH_CAP_FILE_LAZYIO)
> +
>  int ceph_caps_for_mode(int mode);
>
>  enum {
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 6/9] ceph: add infrastructure for waiting for async create to complete
  2020-02-12 17:27 ` [PATCH v4 6/9] ceph: add infrastructure for waiting for async create to complete Jeff Layton
@ 2020-02-13 12:15   ` Yan, Zheng
  2020-02-13 12:22     ` Jeff Layton
  0 siblings, 1 reply; 22+ messages in thread
From: Yan, Zheng @ 2020-02-13 12:15 UTC (permalink / raw)
  To: Jeff Layton
  Cc: ceph-devel, idridryomov, Sage Weil, Zheng Yan, Patrick Donnelly

On Thu, Feb 13, 2020 at 1:30 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> When we issue an async create, we must ensure that any later on-the-wire
> requests involving it wait for the create reply.
>
> Expand i_ceph_flags to be an unsigned long, and add a new bit that
> MDS requests can wait on. If the bit is set in the inode when sending
> caps, then don't send it and just return that it has been delayed.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/ceph/caps.c       | 13 ++++++++++++-
>  fs/ceph/dir.c        |  2 +-
>  fs/ceph/mds_client.c | 18 ++++++++++++++++++
>  fs/ceph/mds_client.h |  8 ++++++++
>  fs/ceph/super.h      |  4 +++-
>  5 files changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index c983990acb75..869e2102e827 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -511,7 +511,7 @@ static void __cap_delay_requeue(struct ceph_mds_client *mdsc,
>                                 struct ceph_inode_info *ci,
>                                 bool set_timeout)
>  {
> -       dout("__cap_delay_requeue %p flags %d at %lu\n", &ci->vfs_inode,
> +       dout("__cap_delay_requeue %p flags 0x%lx at %lu\n", &ci->vfs_inode,
>              ci->i_ceph_flags, ci->i_hold_caps_max);
>         if (!mdsc->stopping) {
>                 spin_lock(&mdsc->cap_delay_lock);
> @@ -1298,6 +1298,13 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
>         int delayed = 0;
>         int ret;
>
> +       /* Don't send anything if it's still being created. Return delayed */
> +       if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
> +               spin_unlock(&ci->i_ceph_lock);
> +               dout("%s async create in flight for %p\n", __func__, inode);
> +               return 1;
> +       }
> +
>         held = cap->issued | cap->implemented;
>         revoking = cap->implemented & ~cap->issued;
>         retain &= ~revoking;
> @@ -2257,6 +2264,10 @@ int ceph_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>         if (datasync)
>                 goto out;
>
> +       ret = ceph_wait_on_async_create(inode);
> +       if (ret)
> +               goto out;
> +

fsync on directory does not consider async create/unlink?

>         dirty = try_flush_caps(inode, &flush_tid);
>         dout("fsync dirty caps are %s\n", ceph_cap_string(dirty));
>
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index 46314ccf48c5..4e695f2a9347 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -752,7 +752,7 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry,
>                 struct ceph_dentry_info *di = ceph_dentry(dentry);
>
>                 spin_lock(&ci->i_ceph_lock);
> -               dout(" dir %p flags are %d\n", dir, ci->i_ceph_flags);
> +               dout(" dir %p flags are 0x%lx\n", dir, ci->i_ceph_flags);
>                 if (strncmp(dentry->d_name.name,
>                             fsc->mount_options->snapdir_name,
>                             dentry->d_name.len) &&
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 91c5f999da7d..314dd0f6f5a9 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2829,6 +2829,24 @@ int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
>                 ceph_get_cap_refs(ceph_inode(req->r_old_dentry_dir),
>                                   CEPH_CAP_PIN);
>
> +       if (req->r_inode) {
> +               err = ceph_wait_on_async_create(req->r_inode);
> +               if (err) {
> +                       dout("%s: wait for async create returned: %d\n",
> +                            __func__, err);
> +                       return err;
> +               }
> +       }
> +
> +       if (req->r_old_inode) {
> +               err = ceph_wait_on_async_create(req->r_old_inode);
> +               if (err) {
> +                       dout("%s: wait for async create returned: %d\n",
> +                            __func__, err);
> +                       return err;
> +               }
> +       }
> +
>         dout("submit_request on %p for inode %p\n", req, dir);
>         mutex_lock(&mdsc->mutex);
>         __register_request(mdsc, req, dir);
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index 31f68897bc87..acad9adca0af 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -543,4 +543,12 @@ extern int ceph_trim_caps(struct ceph_mds_client *mdsc,
>                           int max_caps);
>
>  extern u64 ceph_get_deleg_ino(struct ceph_mds_session *session);
> +
> +static inline int ceph_wait_on_async_create(struct inode *inode)
> +{
> +       struct ceph_inode_info *ci = ceph_inode(inode);
> +
> +       return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT,
> +                          TASK_INTERRUPTIBLE);
> +}
>  #endif
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index ea68eef977ef..47fb6e022339 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -319,7 +319,7 @@ struct ceph_inode_info {
>         u64 i_inline_version;
>         u32 i_time_warp_seq;
>
> -       unsigned i_ceph_flags;
> +       unsigned long i_ceph_flags;
>         atomic64_t i_release_count;
>         atomic64_t i_ordered_count;
>         atomic64_t i_complete_seq[2];
> @@ -527,6 +527,8 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
>  #define CEPH_I_ERROR_WRITE     (1 << 10) /* have seen write errors */
>  #define CEPH_I_ERROR_FILELOCK  (1 << 11) /* have seen file lock errors */
>  #define CEPH_I_ODIRECT         (1 << 12) /* inode in direct I/O mode */
> +#define CEPH_ASYNC_CREATE_BIT  (13)      /* async create in flight for this */
> +#define CEPH_I_ASYNC_CREATE    (1 << CEPH_ASYNC_CREATE_BIT)
>
>  /*
>   * Masks of ceph inode work.
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 6/9] ceph: add infrastructure for waiting for async create to complete
  2020-02-13 12:15   ` Yan, Zheng
@ 2020-02-13 12:22     ` Jeff Layton
  0 siblings, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2020-02-13 12:22 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: ceph-devel, idridryomov, Sage Weil, Zheng Yan, Patrick Donnelly

On Thu, 2020-02-13 at 20:15 +0800, Yan, Zheng wrote:
> On Thu, Feb 13, 2020 at 1:30 AM Jeff Layton <jlayton@kernel.org> wrote:
> > When we issue an async create, we must ensure that any later on-the-wire
> > requests involving it wait for the create reply.
> > 
> > Expand i_ceph_flags to be an unsigned long, and add a new bit that
> > MDS requests can wait on. If the bit is set in the inode when sending
> > caps, then don't send it and just return that it has been delayed.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/ceph/caps.c       | 13 ++++++++++++-
> >  fs/ceph/dir.c        |  2 +-
> >  fs/ceph/mds_client.c | 18 ++++++++++++++++++
> >  fs/ceph/mds_client.h |  8 ++++++++
> >  fs/ceph/super.h      |  4 +++-
> >  5 files changed, 42 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > index c983990acb75..869e2102e827 100644
> > --- a/fs/ceph/caps.c
> > +++ b/fs/ceph/caps.c
> > @@ -511,7 +511,7 @@ static void __cap_delay_requeue(struct ceph_mds_client *mdsc,
> >                                 struct ceph_inode_info *ci,
> >                                 bool set_timeout)
> >  {
> > -       dout("__cap_delay_requeue %p flags %d at %lu\n", &ci->vfs_inode,
> > +       dout("__cap_delay_requeue %p flags 0x%lx at %lu\n", &ci->vfs_inode,
> >              ci->i_ceph_flags, ci->i_hold_caps_max);
> >         if (!mdsc->stopping) {
> >                 spin_lock(&mdsc->cap_delay_lock);
> > @@ -1298,6 +1298,13 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
> >         int delayed = 0;
> >         int ret;
> > 
> > +       /* Don't send anything if it's still being created. Return delayed */
> > +       if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
> > +               spin_unlock(&ci->i_ceph_lock);
> > +               dout("%s async create in flight for %p\n", __func__, inode);
> > +               return 1;
> > +       }
> > +
> >         held = cap->issued | cap->implemented;
> >         revoking = cap->implemented & ~cap->issued;
> >         retain &= ~revoking;
> > @@ -2257,6 +2264,10 @@ int ceph_fsync(struct file *file, loff_t start, loff_t end, int datasync)
> >         if (datasync)
> >                 goto out;
> > 
> > +       ret = ceph_wait_on_async_create(inode);
> > +       if (ret)
> > +               goto out;
> > +
> 
> fsync on directory does not consider async create/unlink?
> 

No, it does. A little lower than this hunk, we have this call:

    err = unsafe_request_wait(inode);

...which waits on all of the i_unsafe_dirops requests.


> >         dirty = try_flush_caps(inode, &flush_tid);
> >         dout("fsync dirty caps are %s\n", ceph_cap_string(dirty));
> > 
> > diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> > index 46314ccf48c5..4e695f2a9347 100644
> > --- a/fs/ceph/dir.c
> > +++ b/fs/ceph/dir.c
> > @@ -752,7 +752,7 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry,
> >                 struct ceph_dentry_info *di = ceph_dentry(dentry);
> > 
> >                 spin_lock(&ci->i_ceph_lock);
> > -               dout(" dir %p flags are %d\n", dir, ci->i_ceph_flags);
> > +               dout(" dir %p flags are 0x%lx\n", dir, ci->i_ceph_flags);
> >                 if (strncmp(dentry->d_name.name,
> >                             fsc->mount_options->snapdir_name,
> >                             dentry->d_name.len) &&
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 91c5f999da7d..314dd0f6f5a9 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -2829,6 +2829,24 @@ int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
> >                 ceph_get_cap_refs(ceph_inode(req->r_old_dentry_dir),
> >                                   CEPH_CAP_PIN);
> > 
> > +       if (req->r_inode) {
> > +               err = ceph_wait_on_async_create(req->r_inode);
> > +               if (err) {
> > +                       dout("%s: wait for async create returned: %d\n",
> > +                            __func__, err);
> > +                       return err;
> > +               }
> > +       }
> > +
> > +       if (req->r_old_inode) {
> > +               err = ceph_wait_on_async_create(req->r_old_inode);
> > +               if (err) {
> > +                       dout("%s: wait for async create returned: %d\n",
> > +                            __func__, err);
> > +                       return err;
> > +               }
> > +       }
> > +
> >         dout("submit_request on %p for inode %p\n", req, dir);
> >         mutex_lock(&mdsc->mutex);
> >         __register_request(mdsc, req, dir);
> > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> > index 31f68897bc87..acad9adca0af 100644
> > --- a/fs/ceph/mds_client.h
> > +++ b/fs/ceph/mds_client.h
> > @@ -543,4 +543,12 @@ extern int ceph_trim_caps(struct ceph_mds_client *mdsc,
> >                           int max_caps);
> > 
> >  extern u64 ceph_get_deleg_ino(struct ceph_mds_session *session);
> > +
> > +static inline int ceph_wait_on_async_create(struct inode *inode)
> > +{
> > +       struct ceph_inode_info *ci = ceph_inode(inode);
> > +
> > +       return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT,
> > +                          TASK_INTERRUPTIBLE);
> > +}
> >  #endif
> > diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> > index ea68eef977ef..47fb6e022339 100644
> > --- a/fs/ceph/super.h
> > +++ b/fs/ceph/super.h
> > @@ -319,7 +319,7 @@ struct ceph_inode_info {
> >         u64 i_inline_version;
> >         u32 i_time_warp_seq;
> > 
> > -       unsigned i_ceph_flags;
> > +       unsigned long i_ceph_flags;
> >         atomic64_t i_release_count;
> >         atomic64_t i_ordered_count;
> >         atomic64_t i_complete_seq[2];
> > @@ -527,6 +527,8 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
> >  #define CEPH_I_ERROR_WRITE     (1 << 10) /* have seen write errors */
> >  #define CEPH_I_ERROR_FILELOCK  (1 << 11) /* have seen file lock errors */
> >  #define CEPH_I_ODIRECT         (1 << 12) /* inode in direct I/O mode */
> > +#define CEPH_ASYNC_CREATE_BIT  (13)      /* async create in flight for this */
> > +#define CEPH_I_ASYNC_CREATE    (1 << CEPH_ASYNC_CREATE_BIT)
> > 
> >  /*
> >   * Masks of ceph inode work.
> > --
> > 2.24.1
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/9] ceph: perform asynchronous unlink if we have sufficient caps
  2020-02-13 12:06   ` Yan, Zheng
@ 2020-02-13 12:22     ` Jeff Layton
  0 siblings, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2020-02-13 12:22 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: ceph-devel, idridryomov, Sage Weil, Zheng Yan, Patrick Donnelly

On Thu, 2020-02-13 at 20:06 +0800, Yan, Zheng wrote:
> On Thu, Feb 13, 2020 at 1:29 AM Jeff Layton <jlayton@kernel.org> wrote:
> > The MDS is getting a new lock-caching facility that will allow it
> > to cache the necessary locks to allow asynchronous directory operations.
> > Since the CEPH_CAP_FILE_* caps are currently unused on directories,
> > we can repurpose those bits for this purpose.
> > 
> > When performing an unlink, if we have Fx on the parent directory,
> > and CEPH_CAP_DIR_UNLINK (aka Fr), and we know that the dentry being
> > removed is the primary link, then then we can fire off an unlink
> > request immediately and don't need to wait on reply before returning.
> > 
> > In that situation, just fix up the dcache and link count and return
> > immediately after issuing the call to the MDS. This does mean that we
> > need to hold an extra reference to the inode being unlinked, and extra
> > references to the caps to avoid races. Those references are put and
> > error handling is done in the r_callback routine.
> > 
> > If the operation ends up failing, then set a writeback error on the
> > directory inode, and the inode itself that can be fetched later by
> > an fsync on the dir.
> > 
> > The behavior of dir caps is slightly different from caps on normal
> > files. Because these are just considered an optimization, if the
> > session is reconnected, we will not automatically reclaim them. They
> > are instead considered lost until we do another synchronous op in the
> > parent directory.
> > 
> > Async dirops are enabled via the "nowsync" mount option, which is
> > patterned after the xfs "wsync" mount option. For now, the default
> > is "wsync", but eventually we may flip that.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
> > ---
> >  fs/ceph/caps.c               | 35 +++++++++----
> >  fs/ceph/dir.c                | 99 ++++++++++++++++++++++++++++++++++--
> >  fs/ceph/inode.c              |  8 ++-
> >  fs/ceph/mds_client.c         |  8 ++-
> >  fs/ceph/super.c              | 20 ++++++++
> >  fs/ceph/super.h              |  6 ++-
> >  include/linux/ceph/ceph_fs.h |  9 ++++
> >  7 files changed, 166 insertions(+), 19 deletions(-)
> > 
> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > index d05717397c2a..7fc87b693ba4 100644
> > --- a/fs/ceph/caps.c
> > +++ b/fs/ceph/caps.c
> > @@ -992,7 +992,11 @@ int __ceph_caps_file_wanted(struct ceph_inode_info *ci)
> >  int __ceph_caps_wanted(struct ceph_inode_info *ci)
> >  {
> >         int w = __ceph_caps_file_wanted(ci) | __ceph_caps_used(ci);
> > -       if (!S_ISDIR(ci->vfs_inode.i_mode)) {
> > +       if (S_ISDIR(ci->vfs_inode.i_mode)) {
> > +               /* we want EXCL if holding caps of dir ops */
> > +               if (w & CEPH_CAP_ANY_DIR_OPS)
> > +                       w |= CEPH_CAP_FILE_EXCL;
> > +       } else {
> >                 /* we want EXCL if dirty data */
> >                 if (w & CEPH_CAP_FILE_BUFFER)
> >                         w |= CEPH_CAP_FILE_EXCL;
> > @@ -1883,10 +1887,13 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
> >                          * revoking the shared cap on every create/unlink
> >                          * operation.
> >                          */
> > -                       if (IS_RDONLY(inode))
> > +                       if (IS_RDONLY(inode)) {
> >                                 want = CEPH_CAP_ANY_SHARED;
> > -                       else
> > -                               want = CEPH_CAP_ANY_SHARED | CEPH_CAP_FILE_EXCL;
> > +                       } else {
> > +                               want = CEPH_CAP_ANY_SHARED |
> > +                                      CEPH_CAP_FILE_EXCL |
> > +                                      CEPH_CAP_ANY_DIR_OPS;
> > +                       }
> >                         retain |= want;
> >                 } else {
> > 
> > @@ -2649,7 +2656,10 @@ static int try_get_cap_refs(struct inode *inode, int need, int want,
> >                                 }
> >                                 snap_rwsem_locked = true;
> >                         }
> > -                       *got = need | (have & want);
> > +                       if ((have & want) == want)
> > +                               *got = need | want;
> > +                       else
> > +                               *got = need;
> >                         if (S_ISREG(inode->i_mode) &&
> >                             (need & CEPH_CAP_FILE_RD) &&
> >                             !(*got & CEPH_CAP_FILE_CACHE))
> > @@ -2739,13 +2749,16 @@ int ceph_try_get_caps(struct inode *inode, int need, int want,
> >         int ret;
> > 
> >         BUG_ON(need & ~CEPH_CAP_FILE_RD);
> > -       BUG_ON(want & ~(CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO|CEPH_CAP_FILE_SHARED));
> > -       ret = ceph_pool_perm_check(inode, need);
> > -       if (ret < 0)
> > -               return ret;
> > +       if (need) {
> > +               ret = ceph_pool_perm_check(inode, need);
> > +               if (ret < 0)
> > +                       return ret;
> > +       }
> > 
> > -       ret = try_get_cap_refs(inode, need, want, 0,
> > -                              (nonblock ? NON_BLOCKING : 0), got);
> > +       BUG_ON(want & ~(CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO |
> > +                       CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
> > +                       CEPH_CAP_ANY_DIR_OPS));
> > +       ret = try_get_cap_refs(inode, need, want, 0, nonblock, got);
> 
> should keep (nonblock ? NON_BLOCKING : 0)
> 

Good catch. Fixed in my tree.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 9/9] ceph: attempt to do async create when possible
  2020-02-12 17:27 ` [PATCH v4 9/9] ceph: attempt to do async create when possible Jeff Layton
@ 2020-02-13 12:44   ` Yan, Zheng
  0 siblings, 0 replies; 22+ messages in thread
From: Yan, Zheng @ 2020-02-13 12:44 UTC (permalink / raw)
  To: Jeff Layton
  Cc: ceph-devel, idridryomov, Sage Weil, Zheng Yan, Patrick Donnelly

On Thu, Feb 13, 2020 at 1:30 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> With the Octopus release, the MDS will hand out directory create caps.
>
> If we have Fxc caps on the directory, and complete directory information
> or a known negative dentry, then we can return without waiting on the
> reply, allowing the open() call to return very quickly to userland.
>
> We use the normal ceph_fill_inode() routine to fill in the inode, so we
> have to gin up some reply inode information with what we'd expect the
> newly-created inode to have. The client assumes that it has a full set
> of caps on the new inode, and that the MDS will revoke them when there
> is conflicting access.
>
> This functionality is gated on the wsync/nowsync mount options.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/ceph/file.c               | 231 +++++++++++++++++++++++++++++++++--
>  include/linux/ceph/ceph_fs.h |   3 +
>  2 files changed, 224 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 472d90ccdf44..814ff435832c 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -448,6 +448,196 @@ cache_file_layout(struct inode *dst, struct inode *src)
>         spin_unlock(&cdst->i_ceph_lock);
>  }
>
> +/*
> + * Try to set up an async create. We need caps, a file layout, and inode number,
> + * and either a lease on the dentry or complete dir info. If any of those
> + * criteria are not satisfied, then return false and the caller can go
> + * synchronous.
> + */
> +static bool try_prep_async_create(struct inode *dir, struct dentry *dentry,
> +                                 struct ceph_file_layout *lo, u64 *pino)
> +{
> +       struct ceph_inode_info *ci = ceph_inode(dir);
> +       struct ceph_dentry_info *di = ceph_dentry(dentry);
> +       bool ret = false;
> +       u64 ino;
> +
> +       spin_lock(&ci->i_ceph_lock);
> +       /* No auth cap means no chance for Dc caps */
> +       if (!ci->i_auth_cap)
> +               goto no_async;
> +
> +       /* Any delegated inos? */
> +       if (xa_empty(&ci->i_auth_cap->session->s_delegated_inos))
> +               goto no_async;
> +
> +       if (!ceph_file_layout_is_valid(&ci->i_cached_layout))
> +               goto no_async;
> +
> +       if ((__ceph_caps_issued(ci, NULL) &
> +            (CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE)) !=
> +           (CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE))
> +               goto no_async;
> +
> +       if (d_in_lookup(dentry)) {
> +               if (!__ceph_dir_is_complete(ci))
> +                       goto no_async;
> +       } else if (atomic_read(&ci->i_shared_gen) !=
> +                  READ_ONCE(di->lease_shared_gen)) {
> +               goto no_async;
> +       }
> +
> +       ino = ceph_get_deleg_ino(ci->i_auth_cap->session);
> +       if (!ino)
> +               goto no_async;
> +
> +       *pino = ino;
> +       ceph_take_cap_refs(ci, CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE, false);
> +       memcpy(lo, &ci->i_cached_layout, sizeof(*lo));
> +       rcu_assign_pointer(lo->pool_ns,
> +                          ceph_try_get_string(ci->i_cached_layout.pool_ns));
> +       ret = true;
> +no_async:
> +       spin_unlock(&ci->i_ceph_lock);
> +       return ret;
> +}
> +
> +static void ceph_async_create_cb(struct ceph_mds_client *mdsc,
> +                                 struct ceph_mds_request *req)
> +{
> +       int result = req->r_err ? req->r_err :
> +                       le32_to_cpu(req->r_reply_info.head->result);
> +
> +       mapping_set_error(req->r_parent->i_mapping, result);
> +
> +       if (result) {
> +               struct dentry *dentry = req->r_dentry;
> +               int pathlen;
> +               u64 base;
> +               char *path = ceph_mdsc_build_path(req->r_dentry, &pathlen,
> +                                                 &base, 0);
> +
> +               ceph_dir_clear_complete(req->r_parent);
> +               if (!d_unhashed(dentry))
> +                       d_drop(dentry);
> +
> +               /* FIXME: start returning I/O errors on all accesses? */
> +               pr_warn("ceph: async create failure path=(%llx)%s result=%d!\n",
> +                       base, IS_ERR(path) ? "<<bad>>" : path, result);
> +               ceph_mdsc_free_path(path, pathlen);
> +       }
> +
> +       if (req->r_target_inode) {
> +               struct ceph_inode_info *ci = ceph_inode(req->r_target_inode);
> +               u64 ino = ceph_vino(req->r_target_inode).ino;
> +
> +               if (req->r_deleg_ino != ino)
> +                       pr_warn("%s: inode number mismatch! err=%d deleg_ino=0x%llx target=0x%llx\n",
> +                               __func__, req->r_err, req->r_deleg_ino, ino);
> +               mapping_set_error(req->r_target_inode->i_mapping, result);
> +
> +               spin_lock(&ci->i_ceph_lock);
> +               if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
> +                       ci->i_ceph_flags &= ~CEPH_I_ASYNC_CREATE;
> +                       wake_up_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT);
> +               }
> +               spin_unlock(&ci->i_ceph_lock);
> +       } else {
> +               pr_warn("%s: no req->r_target_inode for 0x%llx\n", __func__,
> +                       req->r_deleg_ino);
> +       }
> +       ceph_put_cap_refs(ceph_inode(req->r_parent),
> +                         CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE);
> +}
> +
> +static int ceph_finish_async_create(struct inode *dir, struct dentry *dentry,
> +                                   struct file *file, umode_t mode,
> +                                   struct ceph_mds_request *req,
> +                                   struct ceph_acl_sec_ctx *as_ctx,
> +                                   struct ceph_file_layout *lo)
> +{
> +       int ret;
> +       char xattr_buf[4];
> +       struct ceph_mds_reply_inode in = { };
> +       struct ceph_mds_reply_info_in iinfo = { .in = &in };
> +       struct ceph_inode_info *ci = ceph_inode(dir);
> +       struct inode *inode;
> +       struct timespec64 now;
> +       struct ceph_vino vino = { .ino = req->r_deleg_ino,
> +                                 .snap = CEPH_NOSNAP };
> +
> +       ktime_get_real_ts64(&now);
> +
> +       inode = ceph_get_inode(dentry->d_sb, vino);
> +       if (IS_ERR(inode))
> +               return PTR_ERR(inode);
> +
> +       iinfo.inline_version = CEPH_INLINE_NONE;
> +       iinfo.change_attr = 1;
> +       ceph_encode_timespec64(&iinfo.btime, &now);
> +
> +       iinfo.xattr_len = ARRAY_SIZE(xattr_buf);
> +       iinfo.xattr_data = xattr_buf;
> +       memset(iinfo.xattr_data, 0, iinfo.xattr_len);
> +
> +       in.ino = cpu_to_le64(vino.ino);
> +       in.snapid = cpu_to_le64(CEPH_NOSNAP);
> +       in.version = cpu_to_le64(1);    // ???
> +       in.cap.caps = in.cap.wanted = cpu_to_le32(CEPH_CAP_ALL_FILE);
> +       in.cap.cap_id = cpu_to_le64(1);
> +       in.cap.realm = cpu_to_le64(ci->i_snap_realm->ino);
> +       in.cap.flags = CEPH_CAP_FLAG_AUTH;
> +       in.ctime = in.mtime = in.atime = iinfo.btime;
> +       in.mode = cpu_to_le32((u32)mode);
> +       in.truncate_seq = cpu_to_le32(1);
> +       in.truncate_size = cpu_to_le64(-1ULL);
> +       in.xattr_version = cpu_to_le64(1);
> +       in.uid = cpu_to_le32(from_kuid(&init_user_ns, current_fsuid()));
> +       in.gid = cpu_to_le32(from_kgid(&init_user_ns, dir->i_mode & S_ISGID ?
> +                               dir->i_gid : current_fsgid()));
> +       in.nlink = cpu_to_le32(1);
> +       in.max_size = cpu_to_le64(lo->stripe_unit);
> +
> +       ceph_file_layout_to_legacy(lo, &in.layout);
> +
> +       ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session,
> +                             req->r_fmode, NULL);
> +       if (ret) {
> +               dout("%s failed to fill inode: %d\n", __func__, ret);
> +               ceph_dir_clear_complete(dir);
> +               if (!d_unhashed(dentry))
> +                       d_drop(dentry);
> +               if (inode->i_state & I_NEW)
> +                       discard_new_inode(inode);
> +       } else {
> +               struct dentry *dn;
> +
> +               dout("%s d_adding new inode 0x%llx to 0x%lx/%s\n", __func__,
> +                       vino.ino, dir->i_ino, dentry->d_name.name);
> +               ceph_dir_clear_ordered(dir);
> +               ceph_init_inode_acls(inode, as_ctx);
> +               if (inode->i_state & I_NEW) {
> +                       /*
> +                        * If it's not I_NEW, then someone created this before
> +                        * we got here. Assume the server is aware of it at
> +                        * that point and don't worry about setting
> +                        * CEPH_I_ASYNC_CREATE.
> +                        */
> +                       ceph_inode(inode)->i_ceph_flags = CEPH_I_ASYNC_CREATE;
> +                       unlock_new_inode(inode);
> +               }
> +               if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
> +                       if (!d_unhashed(dentry))
> +                               d_drop(dentry);
> +                       dn = d_splice_alias(inode, dentry);
> +                       WARN_ON_ONCE(dn && dn != dentry);
> +               }
> +               file->f_mode |= FMODE_CREATED;
> +               ret = finish_open(file, dentry, ceph_open);
> +       }
> +       return ret;
> +}
> +
>  /*
>   * Do a lookup + open with a single request.  If we get a non-existent
>   * file or symlink, return 1 so the VFS can retry.
> @@ -460,6 +650,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>         struct ceph_mds_request *req;
>         struct dentry *dn;
>         struct ceph_acl_sec_ctx as_ctx = {};
> +       bool try_async = ceph_test_mount_opt(fsc, ASYNC_DIROPS);
>         int mask;
>         int err;
>
> @@ -483,7 +674,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>                 /* If it's not being looked up, it's negative */
>                 return -ENOENT;
>         }
> -
> +retry:
>         /* do the open */
>         req = prepare_open_request(dir->i_sb, flags, mode);
>         if (IS_ERR(req)) {
> @@ -492,28 +683,47 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>         }
>         req->r_dentry = dget(dentry);
>         req->r_num_caps = 2;
> +       mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> +       if (ceph_security_xattr_wanted(dir))
> +               mask |= CEPH_CAP_XATTR_SHARED;
> +       req->r_args.open.mask = cpu_to_le32(mask);
> +       req->r_parent = dir;
> +
>         if (flags & O_CREAT) {
> +               struct ceph_file_layout lo;
> +
>                 req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL;
>                 req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
>                 if (as_ctx.pagelist) {
>                         req->r_pagelist = as_ctx.pagelist;
>                         as_ctx.pagelist = NULL;
>                 }
> +               if (try_async && try_prep_async_create(dir, dentry, &lo,
> +                                                      &req->r_deleg_ino)) {
> +                       set_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags);
> +                       req->r_args.open.flags |= cpu_to_le32(CEPH_O_EXCL);
> +                       req->r_callback = ceph_async_create_cb;
> +                       err = ceph_mdsc_submit_request(mdsc, dir, req);
> +                       if (!err) {
> +                               err = ceph_finish_async_create(dir, dentry,
> +                                                       file, mode, req,
> +                                                       &as_ctx, &lo);
> +                       } else if (err == -EJUKEBOX) {
> +                               ceph_mdsc_put_request(req);
> +                               try_async = false;
> +                               goto retry;

put ino number back to s_delegated_inos

> +                       }
> +                       goto out_req;
> +               }
>         }
>
> -       mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
> -       if (ceph_security_xattr_wanted(dir))
> -               mask |= CEPH_CAP_XATTR_SHARED;
> -       req->r_args.open.mask = cpu_to_le32(mask);
> -
> -       req->r_parent = dir;
>         set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
>         err = ceph_mdsc_do_request(mdsc,
>                                    (flags & (O_CREAT|O_TRUNC)) ? dir : NULL,
>                                    req);
>         err = ceph_handle_snapdir(req, dentry, err);
>         if (err)
> -               goto out_req;
> +               goto out_fmode;
>
>         if ((flags & O_CREAT) && !req->r_reply_info.head->is_dentry)
>                 err = ceph_handle_notrace_create(dir, dentry);
> @@ -527,7 +737,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>                 dn = NULL;
>         }
>         if (err)
> -               goto out_req;
> +               goto out_fmode;
>         if (dn || d_really_is_negative(dentry) || d_is_symlink(dentry)) {
>                 /* make vfs retry on splice, ENOENT, or symlink */
>                 dout("atomic_open finish_no_open on dn %p\n", dn);
> @@ -543,9 +753,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
>                 }
>                 err = finish_open(file, dentry, ceph_open);
>         }
> -out_req:
> +out_fmode:
>         if (!req->r_err && req->r_target_inode)
>                 ceph_put_fmode(ceph_inode(req->r_target_inode), req->r_fmode);
> +out_req:
>         ceph_mdsc_put_request(req);
>  out_ctx:
>         ceph_release_acl_sec_ctx(&as_ctx);
> diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
> index 91d09cf37649..e035c5194005 100644
> --- a/include/linux/ceph/ceph_fs.h
> +++ b/include/linux/ceph/ceph_fs.h
> @@ -659,6 +659,9 @@ int ceph_flags_to_mode(int flags);
>  #define CEPH_CAP_ANY      (CEPH_CAP_ANY_RD | CEPH_CAP_ANY_EXCL | \
>                            CEPH_CAP_ANY_FILE_WR | CEPH_CAP_FILE_LAZYIO | \
>                            CEPH_CAP_PIN)
> +#define CEPH_CAP_ALL_FILE (CEPH_CAP_PIN | CEPH_CAP_ANY_SHARED | \
> +                          CEPH_CAP_AUTH_EXCL | CEPH_CAP_XATTR_EXCL | \
> +                          CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)
>
>  #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
>                         CEPH_LOCK_IXATTR)
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/9] ceph: add support for asynchronous directory operations
  2020-02-12 17:27 [PATCH v4 0/9] ceph: add support for asynchronous directory operations Jeff Layton
                   ` (8 preceding siblings ...)
  2020-02-12 17:27 ` [PATCH v4 9/9] ceph: attempt to do async create when possible Jeff Layton
@ 2020-02-13 13:05 ` Yan, Zheng
  2020-02-13 13:20   ` Jeff Layton
  9 siblings, 1 reply; 22+ messages in thread
From: Yan, Zheng @ 2020-02-13 13:05 UTC (permalink / raw)
  To: Jeff Layton
  Cc: ceph-devel, idridryomov, Sage Weil, Zheng Yan, Patrick Donnelly

On Thu, Feb 13, 2020 at 1:29 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> I've dropped the async unlink patch from testing branch and am
> resubmitting it here along with the rest of the create patches.
>
> Zheng had pointed out that DIR_* caps should be cleared when the session
> is reconnected. The underlying submission code needed changes to
> handle that so it needed a bit of rework (along with the create code).
>
> Since v3:
> - rework async request submission to never queue the request when the
>   session isn't open
> - clean out DIR_* caps, layouts and delegated inodes when session goes down
> - better ordering for dependent requests
> - new mount options (wsync/nowsync) instead of module option
> - more comprehensive error handling
>
> Jeff Layton (9):
>   ceph: add flag to designate that a request is asynchronous
>   ceph: perform asynchronous unlink if we have sufficient caps
>   ceph: make ceph_fill_inode non-static
>   ceph: make __take_cap_refs non-static
>   ceph: decode interval_sets for delegated inos
>   ceph: add infrastructure for waiting for async create to complete
>   ceph: add new MDS req field to hold delegated inode number
>   ceph: cache layout in parent dir on first sync create
>   ceph: attempt to do async create when possible
>
>  fs/ceph/caps.c               |  73 +++++++---
>  fs/ceph/dir.c                | 101 +++++++++++++-
>  fs/ceph/file.c               | 253 +++++++++++++++++++++++++++++++++--
>  fs/ceph/inode.c              |  58 ++++----
>  fs/ceph/mds_client.c         | 156 +++++++++++++++++++--
>  fs/ceph/mds_client.h         |  17 ++-
>  fs/ceph/super.c              |  20 +++
>  fs/ceph/super.h              |  21 ++-
>  include/linux/ceph/ceph_fs.h |  17 ++-
>  9 files changed, 637 insertions(+), 79 deletions(-)
>

Please implement something like
https://github.com/ceph/ceph/pull/32576/commits/e9aa5ec062fab8324e13020ff2f583537e326a0b.
MDS may revoke Fx when replaying unsafe/async requests. Make mds not
do this is quite complex.

> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/9] ceph: add support for asynchronous directory operations
  2020-02-13 13:05 ` [PATCH v4 0/9] ceph: add support for asynchronous directory operations Yan, Zheng
@ 2020-02-13 13:20   ` Jeff Layton
  2020-02-13 14:43     ` Yan, Zheng
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff Layton @ 2020-02-13 13:20 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: ceph-devel, idridryomov, Sage Weil, Zheng Yan, Patrick Donnelly

On Thu, 2020-02-13 at 21:05 +0800, Yan, Zheng wrote:
> On Thu, Feb 13, 2020 at 1:29 AM Jeff Layton <jlayton@kernel.org> wrote:
> > I've dropped the async unlink patch from testing branch and am
> > resubmitting it here along with the rest of the create patches.
> > 
> > Zheng had pointed out that DIR_* caps should be cleared when the session
> > is reconnected. The underlying submission code needed changes to
> > handle that so it needed a bit of rework (along with the create code).
> > 
> > Since v3:
> > - rework async request submission to never queue the request when the
> >   session isn't open
> > - clean out DIR_* caps, layouts and delegated inodes when session goes down
> > - better ordering for dependent requests
> > - new mount options (wsync/nowsync) instead of module option
> > - more comprehensive error handling
> > 
> > Jeff Layton (9):
> >   ceph: add flag to designate that a request is asynchronous
> >   ceph: perform asynchronous unlink if we have sufficient caps
> >   ceph: make ceph_fill_inode non-static
> >   ceph: make __take_cap_refs non-static
> >   ceph: decode interval_sets for delegated inos
> >   ceph: add infrastructure for waiting for async create to complete
> >   ceph: add new MDS req field to hold delegated inode number
> >   ceph: cache layout in parent dir on first sync create
> >   ceph: attempt to do async create when possible
> > 
> >  fs/ceph/caps.c               |  73 +++++++---
> >  fs/ceph/dir.c                | 101 +++++++++++++-
> >  fs/ceph/file.c               | 253 +++++++++++++++++++++++++++++++++--
> >  fs/ceph/inode.c              |  58 ++++----
> >  fs/ceph/mds_client.c         | 156 +++++++++++++++++++--
> >  fs/ceph/mds_client.h         |  17 ++-
> >  fs/ceph/super.c              |  20 +++
> >  fs/ceph/super.h              |  21 ++-
> >  include/linux/ceph/ceph_fs.h |  17 ++-
> >  9 files changed, 637 insertions(+), 79 deletions(-)
> > 
> 
> Please implement something like
> https://github.com/ceph/ceph/pull/32576/commits/e9aa5ec062fab8324e13020ff2f583537e326a0b.
> MDS may revoke Fx when replaying unsafe/async requests. Make mds not
> do this is quite complex.
> 

I added this in reconnect_caps_cb in the latest set:

        /* These are lost when the session goes away */                         
        if (S_ISDIR(inode->i_mode)) {                                           
                if (cap->issued & CEPH_CAP_DIR_CREATE) {                        
                        ceph_put_string(rcu_dereference_raw(ci->i_cached_layout.pool_ns));
                        memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
                }                                                               
                cap->issued &= ~(CEPH_CAP_DIR_CREATE|CEPH_CAP_DIR_UNLINK);      
        }                                                                       

Basically, wipe out the layout and Duc caps when we reconnect the
session. Outstanding references to the caps will be put when the call
completes. Is that not sufficient?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/9] ceph: add support for asynchronous directory operations
  2020-02-13 13:20   ` Jeff Layton
@ 2020-02-13 14:43     ` Yan, Zheng
  2020-02-13 15:09       ` Jeff Layton
  0 siblings, 1 reply; 22+ messages in thread
From: Yan, Zheng @ 2020-02-13 14:43 UTC (permalink / raw)
  To: Jeff Layton
  Cc: ceph-devel, idridryomov, Sage Weil, Zheng Yan, Patrick Donnelly

On Thu, Feb 13, 2020 at 9:20 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Thu, 2020-02-13 at 21:05 +0800, Yan, Zheng wrote:
> > On Thu, Feb 13, 2020 at 1:29 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > I've dropped the async unlink patch from testing branch and am
> > > resubmitting it here along with the rest of the create patches.
> > >
> > > Zheng had pointed out that DIR_* caps should be cleared when the session
> > > is reconnected. The underlying submission code needed changes to
> > > handle that so it needed a bit of rework (along with the create code).
> > >
> > > Since v3:
> > > - rework async request submission to never queue the request when the
> > >   session isn't open
> > > - clean out DIR_* caps, layouts and delegated inodes when session goes down
> > > - better ordering for dependent requests
> > > - new mount options (wsync/nowsync) instead of module option
> > > - more comprehensive error handling
> > >
> > > Jeff Layton (9):
> > >   ceph: add flag to designate that a request is asynchronous
> > >   ceph: perform asynchronous unlink if we have sufficient caps
> > >   ceph: make ceph_fill_inode non-static
> > >   ceph: make __take_cap_refs non-static
> > >   ceph: decode interval_sets for delegated inos
> > >   ceph: add infrastructure for waiting for async create to complete
> > >   ceph: add new MDS req field to hold delegated inode number
> > >   ceph: cache layout in parent dir on first sync create
> > >   ceph: attempt to do async create when possible
> > >
> > >  fs/ceph/caps.c               |  73 +++++++---
> > >  fs/ceph/dir.c                | 101 +++++++++++++-
> > >  fs/ceph/file.c               | 253 +++++++++++++++++++++++++++++++++--
> > >  fs/ceph/inode.c              |  58 ++++----
> > >  fs/ceph/mds_client.c         | 156 +++++++++++++++++++--
> > >  fs/ceph/mds_client.h         |  17 ++-
> > >  fs/ceph/super.c              |  20 +++
> > >  fs/ceph/super.h              |  21 ++-
> > >  include/linux/ceph/ceph_fs.h |  17 ++-
> > >  9 files changed, 637 insertions(+), 79 deletions(-)
> > >
> >
> > Please implement something like
> > https://github.com/ceph/ceph/pull/32576/commits/e9aa5ec062fab8324e13020ff2f583537e326a0b.
> > MDS may revoke Fx when replaying unsafe/async requests. Make mds not
> > do this is quite complex.
> >
>
> I added this in reconnect_caps_cb in the latest set:
>
>         /* These are lost when the session goes away */
>         if (S_ISDIR(inode->i_mode)) {
>                 if (cap->issued & CEPH_CAP_DIR_CREATE) {
>                         ceph_put_string(rcu_dereference_raw(ci->i_cached_layout.pool_ns));
>                         memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
>                 }
>                 cap->issued &= ~(CEPH_CAP_DIR_CREATE|CEPH_CAP_DIR_UNLINK);
>         }
>

It's not enough.  for async create/unlink, we need to call

ceph_put_cap_refs(..., CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_FOO) to release caps


> Basically, wipe out the layout and Duc caps when we reconnect the
> session. Outstanding references to the caps will be put when the call
> completes. Is that not sufficient?
> --
> Jeff Layton <jlayton@kernel.org>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/9] ceph: add support for asynchronous directory operations
  2020-02-13 14:43     ` Yan, Zheng
@ 2020-02-13 15:09       ` Jeff Layton
  2020-02-14  2:10         ` Yan, Zheng
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff Layton @ 2020-02-13 15:09 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: ceph-devel, idridryomov, Sage Weil, Zheng Yan, Patrick Donnelly

On Thu, 2020-02-13 at 22:43 +0800, Yan, Zheng wrote:
> On Thu, Feb 13, 2020 at 9:20 PM Jeff Layton <jlayton@kernel.org> wrote:
> > On Thu, 2020-02-13 at 21:05 +0800, Yan, Zheng wrote:
> > > On Thu, Feb 13, 2020 at 1:29 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > > I've dropped the async unlink patch from testing branch and am
> > > > resubmitting it here along with the rest of the create patches.
> > > > 
> > > > Zheng had pointed out that DIR_* caps should be cleared when the session
> > > > is reconnected. The underlying submission code needed changes to
> > > > handle that so it needed a bit of rework (along with the create code).
> > > > 
> > > > Since v3:
> > > > - rework async request submission to never queue the request when the
> > > >   session isn't open
> > > > - clean out DIR_* caps, layouts and delegated inodes when session goes down
> > > > - better ordering for dependent requests
> > > > - new mount options (wsync/nowsync) instead of module option
> > > > - more comprehensive error handling
> > > > 
> > > > Jeff Layton (9):
> > > >   ceph: add flag to designate that a request is asynchronous
> > > >   ceph: perform asynchronous unlink if we have sufficient caps
> > > >   ceph: make ceph_fill_inode non-static
> > > >   ceph: make __take_cap_refs non-static
> > > >   ceph: decode interval_sets for delegated inos
> > > >   ceph: add infrastructure for waiting for async create to complete
> > > >   ceph: add new MDS req field to hold delegated inode number
> > > >   ceph: cache layout in parent dir on first sync create
> > > >   ceph: attempt to do async create when possible
> > > > 
> > > >  fs/ceph/caps.c               |  73 +++++++---
> > > >  fs/ceph/dir.c                | 101 +++++++++++++-
> > > >  fs/ceph/file.c               | 253 +++++++++++++++++++++++++++++++++--
> > > >  fs/ceph/inode.c              |  58 ++++----
> > > >  fs/ceph/mds_client.c         | 156 +++++++++++++++++++--
> > > >  fs/ceph/mds_client.h         |  17 ++-
> > > >  fs/ceph/super.c              |  20 +++
> > > >  fs/ceph/super.h              |  21 ++-
> > > >  include/linux/ceph/ceph_fs.h |  17 ++-
> > > >  9 files changed, 637 insertions(+), 79 deletions(-)
> > > > 
> > > 
> > > Please implement something like
> > > https://github.com/ceph/ceph/pull/32576/commits/e9aa5ec062fab8324e13020ff2f583537e326a0b.
> > > MDS may revoke Fx when replaying unsafe/async requests. Make mds not
> > > do this is quite complex.
> > > 
> > 
> > I added this in reconnect_caps_cb in the latest set:
> > 
> >         /* These are lost when the session goes away */
> >         if (S_ISDIR(inode->i_mode)) {
> >                 if (cap->issued & CEPH_CAP_DIR_CREATE) {
> >                         ceph_put_string(rcu_dereference_raw(ci->i_cached_layout.pool_ns));
> >                         memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
> >                 }
> >                 cap->issued &= ~(CEPH_CAP_DIR_CREATE|CEPH_CAP_DIR_UNLINK);
> >         }
> > 
> 
> It's not enough.  for async create/unlink, we need to call
> 
> ceph_put_cap_refs(..., CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_FOO) to release caps
> 

That sounds really wrong.

The call holds references to these caps. We can't just drop them here,
as we could be racing with reply handling.

What exactly is the problem with waiting until r_callback fires to drop
the references? We're clearing them out of the "issued" field in the
cap, so we won't be handing out any new references. The fact that there
are still outstanding references doesn't seem like it ought to cause any
problem.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/9] ceph: add support for asynchronous directory operations
  2020-02-13 15:09       ` Jeff Layton
@ 2020-02-14  2:10         ` Yan, Zheng
  0 siblings, 0 replies; 22+ messages in thread
From: Yan, Zheng @ 2020-02-14  2:10 UTC (permalink / raw)
  To: Jeff Layton
  Cc: ceph-devel, idridryomov, Sage Weil, Zheng Yan, Patrick Donnelly

On Thu, Feb 13, 2020 at 11:09 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Thu, 2020-02-13 at 22:43 +0800, Yan, Zheng wrote:
> > On Thu, Feb 13, 2020 at 9:20 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > On Thu, 2020-02-13 at 21:05 +0800, Yan, Zheng wrote:
> > > > On Thu, Feb 13, 2020 at 1:29 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > > > I've dropped the async unlink patch from testing branch and am
> > > > > resubmitting it here along with the rest of the create patches.
> > > > >
> > > > > Zheng had pointed out that DIR_* caps should be cleared when the session
> > > > > is reconnected. The underlying submission code needed changes to
> > > > > handle that so it needed a bit of rework (along with the create code).
> > > > >
> > > > > Since v3:
> > > > > - rework async request submission to never queue the request when the
> > > > >   session isn't open
> > > > > - clean out DIR_* caps, layouts and delegated inodes when session goes down
> > > > > - better ordering for dependent requests
> > > > > - new mount options (wsync/nowsync) instead of module option
> > > > > - more comprehensive error handling
> > > > >
> > > > > Jeff Layton (9):
> > > > >   ceph: add flag to designate that a request is asynchronous
> > > > >   ceph: perform asynchronous unlink if we have sufficient caps
> > > > >   ceph: make ceph_fill_inode non-static
> > > > >   ceph: make __take_cap_refs non-static
> > > > >   ceph: decode interval_sets for delegated inos
> > > > >   ceph: add infrastructure for waiting for async create to complete
> > > > >   ceph: add new MDS req field to hold delegated inode number
> > > > >   ceph: cache layout in parent dir on first sync create
> > > > >   ceph: attempt to do async create when possible
> > > > >
> > > > >  fs/ceph/caps.c               |  73 +++++++---
> > > > >  fs/ceph/dir.c                | 101 +++++++++++++-
> > > > >  fs/ceph/file.c               | 253 +++++++++++++++++++++++++++++++++--
> > > > >  fs/ceph/inode.c              |  58 ++++----
> > > > >  fs/ceph/mds_client.c         | 156 +++++++++++++++++++--
> > > > >  fs/ceph/mds_client.h         |  17 ++-
> > > > >  fs/ceph/super.c              |  20 +++
> > > > >  fs/ceph/super.h              |  21 ++-
> > > > >  include/linux/ceph/ceph_fs.h |  17 ++-
> > > > >  9 files changed, 637 insertions(+), 79 deletions(-)
> > > > >
> > > >
> > > > Please implement something like
> > > > https://github.com/ceph/ceph/pull/32576/commits/e9aa5ec062fab8324e13020ff2f583537e326a0b.
> > > > MDS may revoke Fx when replaying unsafe/async requests. Make mds not
> > > > do this is quite complex.
> > > >
> > >
> > > I added this in reconnect_caps_cb in the latest set:
> > >
> > >         /* These are lost when the session goes away */
> > >         if (S_ISDIR(inode->i_mode)) {
> > >                 if (cap->issued & CEPH_CAP_DIR_CREATE) {
> > >                         ceph_put_string(rcu_dereference_raw(ci->i_cached_layout.pool_ns));
> > >                         memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
> > >                 }
> > >                 cap->issued &= ~(CEPH_CAP_DIR_CREATE|CEPH_CAP_DIR_UNLINK);
> > >         }
> > >
> >
> > It's not enough.  for async create/unlink, we need to call
> >
> > ceph_put_cap_refs(..., CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_FOO) to release caps
> >
>
> That sounds really wrong.
>
> The call holds references to these caps. We can't just drop them here,
> as we could be racing with reply handling.
>
> What exactly is the problem with waiting until r_callback fires to drop
> the references? We're clearing them out of the "issued" field in the
> cap, so we won't be handing out any new references. The fact that there
> are still outstanding references doesn't seem like it ought to cause any
> problem.
>

see https://github.com/ceph/ceph/pull/32576/commits/e9aa5ec062fab8324e13020ff2f583537e326a0b.

Also need to make r_callback not release cap refs if cap ref is
already release at reconnect.  The problem is that mds may want to
revoke Fx when replaying unsafe/async requests. (same reason that we
can't send getattr to fetch inline data while holding Fr cap)

> --
> Jeff Layton <jlayton@kernel.org>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-02-14  2:10 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-12 17:27 [PATCH v4 0/9] ceph: add support for asynchronous directory operations Jeff Layton
2020-02-12 17:27 ` [PATCH v4 1/9] ceph: add flag to designate that a request is asynchronous Jeff Layton
2020-02-13  9:29   ` Yan, Zheng
2020-02-13 11:35     ` Jeff Layton
2020-02-12 17:27 ` [PATCH v4 2/9] ceph: perform asynchronous unlink if we have sufficient caps Jeff Layton
2020-02-13 12:06   ` Yan, Zheng
2020-02-13 12:22     ` Jeff Layton
2020-02-12 17:27 ` [PATCH v4 3/9] ceph: make ceph_fill_inode non-static Jeff Layton
2020-02-12 17:27 ` [PATCH v4 4/9] ceph: make __take_cap_refs non-static Jeff Layton
2020-02-12 17:27 ` [PATCH v4 5/9] ceph: decode interval_sets for delegated inos Jeff Layton
2020-02-12 17:27 ` [PATCH v4 6/9] ceph: add infrastructure for waiting for async create to complete Jeff Layton
2020-02-13 12:15   ` Yan, Zheng
2020-02-13 12:22     ` Jeff Layton
2020-02-12 17:27 ` [PATCH v4 7/9] ceph: add new MDS req field to hold delegated inode number Jeff Layton
2020-02-12 17:27 ` [PATCH v4 8/9] ceph: cache layout in parent dir on first sync create Jeff Layton
2020-02-12 17:27 ` [PATCH v4 9/9] ceph: attempt to do async create when possible Jeff Layton
2020-02-13 12:44   ` Yan, Zheng
2020-02-13 13:05 ` [PATCH v4 0/9] ceph: add support for asynchronous directory operations Yan, Zheng
2020-02-13 13:20   ` Jeff Layton
2020-02-13 14:43     ` Yan, Zheng
2020-02-13 15:09       ` Jeff Layton
2020-02-14  2:10         ` Yan, Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.