All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/13] ceph: async directory operations support
@ 2020-03-02 14:14 Jeff Layton
  2020-03-02 14:14 ` [PATCH v6 01/13] ceph: make kick_flushing_inode_caps non-static Jeff Layton
                   ` (13 more replies)
  0 siblings, 14 replies; 21+ messages in thread
From: Jeff Layton @ 2020-03-02 14:14 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, sage, zyan, pdonnell

v6: move handling of CEPH_I_ASYNC_CREATE from __send_cap into callers
    also issue ceph_mdsc_release_dir_caps() in complete_request
    properly handle -EJUKEBOX return in async callbacks
    
I previously pulled the async unlink patch from ceph-client/testing, so
this set includes a revised version of that as well, and orders it
some other changes.

The main change from v5 is to rework the callers of __send_cap to either
skip sending or wait if the create reply hasn't come in yet.

We may not actually need patch #7 here. Zheng had that delta in one
of the earlier patches, but I'm not sure it's really needed now. It
may make sense to just take it on its own merits though.

Jeff Layton (12):
  ceph: make kick_flushing_inode_caps non-static
  ceph: add flag to designate that a request is asynchronous
  ceph: track primary dentry link
  ceph: add infrastructure for waiting for async create to complete
  ceph: make __take_cap_refs non-static
  ceph: cap tracking for async directory operations
  ceph: perform asynchronous unlink if we have sufficient caps
  ceph: make ceph_fill_inode non-static
  ceph: decode interval_sets for delegated inos
  ceph: add new MDS req field to hold delegated inode number
  ceph: cache layout in parent dir on first sync create
  ceph: attempt to do async create when possible

Yan, Zheng (1):
  ceph: don't take refs to want mask unless we have all bits

 fs/ceph/caps.c               |  91 ++++++++----
 fs/ceph/dir.c                | 111 ++++++++++++++-
 fs/ceph/file.c               | 269 +++++++++++++++++++++++++++++++++--
 fs/ceph/inode.c              |  58 ++++----
 fs/ceph/mds_client.c         | 196 ++++++++++++++++++++++---
 fs/ceph/mds_client.h         |  24 +++-
 fs/ceph/super.c              |  20 +++
 fs/ceph/super.h              |  23 ++-
 include/linux/ceph/ceph_fs.h |  17 ++-
 9 files changed, 724 insertions(+), 85 deletions(-)

-- 
2.24.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v6 01/13] ceph: make kick_flushing_inode_caps non-static
  2020-03-02 14:14 [PATCH v6 00/13] ceph: async directory operations support Jeff Layton
@ 2020-03-02 14:14 ` Jeff Layton
  2020-03-02 14:14 ` [PATCH v6 02/13] ceph: add flag to designate that a request is asynchronous Jeff Layton
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2020-03-02 14:14 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, sage, zyan, pdonnell

We'll need this to kick any flushing caps once the create reply
comes in.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c  | 6 +++---
 fs/ceph/super.h | 2 ++
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index b34f9be29622..553fd1d52456 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2427,8 +2427,8 @@ void ceph_kick_flushing_caps(struct ceph_mds_client *mdsc,
 	}
 }
 
-static void kick_flushing_inode_caps(struct ceph_mds_session *session,
-				     struct ceph_inode_info *ci)
+void ceph_kick_flushing_inode_caps(struct ceph_mds_session *session,
+				   struct ceph_inode_info *ci)
 {
 	struct ceph_mds_client *mdsc = session->s_mdsc;
 	struct ceph_cap *cap = ci->i_auth_cap;
@@ -3325,7 +3325,7 @@ static void handle_cap_grant(struct inode *inode,
 	if (le32_to_cpu(grant->op) == CEPH_CAP_OP_IMPORT) {
 		if (newcaps & ~extra_info->issued)
 			wake = true;
-		kick_flushing_inode_caps(session, ci);
+		ceph_kick_flushing_inode_caps(session, ci);
 		spin_unlock(&ci->i_ceph_lock);
 		up_read(&session->s_mdsc->snap_rwsem);
 	} else {
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index e586cff3dfd5..d10513c6f0a1 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1051,6 +1051,8 @@ extern void ceph_early_kick_flushing_caps(struct ceph_mds_client *mdsc,
 					  struct ceph_mds_session *session);
 extern void ceph_kick_flushing_caps(struct ceph_mds_client *mdsc,
 				    struct ceph_mds_session *session);
+void ceph_kick_flushing_inode_caps(struct ceph_mds_session *session,
+				   struct ceph_inode_info *ci);
 extern struct ceph_cap *ceph_get_cap_for_mds(struct ceph_inode_info *ci,
 					     int mds);
 extern void ceph_get_cap_refs(struct ceph_inode_info *ci, int caps);
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 02/13] ceph: add flag to designate that a request is asynchronous
  2020-03-02 14:14 [PATCH v6 00/13] ceph: async directory operations support Jeff Layton
  2020-03-02 14:14 ` [PATCH v6 01/13] ceph: make kick_flushing_inode_caps non-static Jeff Layton
@ 2020-03-02 14:14 ` Jeff Layton
  2020-03-02 14:14 ` [PATCH v6 03/13] ceph: track primary dentry link Jeff Layton
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2020-03-02 14:14 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, sage, zyan, pdonnell

...and ensure that such requests are never queued. The MDS has need to
know that a request is asynchronous so add flags and proper
infrastructure for that.

Also, delegated inode numbers and directory caps are associated with the
session, so ensure that async requests are always transmitted on the
first attempt and are never queued to wait for session reestablishment.

If it does end up looking like we'll need to queue the request, then
have it return -EJUKEBOX so the caller can reattempt with a synchronous
request.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/inode.c              |  1 +
 fs/ceph/mds_client.c         | 15 +++++++++++++++
 fs/ceph/mds_client.h         |  1 +
 include/linux/ceph/ceph_fs.h |  5 +++--
 4 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 896d30820035..6004ea0d2ef1 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1312,6 +1312,7 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 		err = fill_inode(in, req->r_locked_page, &rinfo->targeti, NULL,
 				session,
 				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
+				 !test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags) &&
 				 rinfo->head->result == 0) ?  req->r_fmode : -1,
 				&req->r_caps_reservation);
 		if (err < 0) {
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index baf801ba34d9..5d6959c0cf33 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2523,6 +2523,8 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
 	rhead->oldest_client_tid = cpu_to_le64(__get_oldest_tid(mdsc));
 	if (test_bit(CEPH_MDS_R_GOT_UNSAFE, &req->r_req_flags))
 		flags |= CEPH_MDS_FLAG_REPLAY;
+	if (test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags))
+		flags |= CEPH_MDS_FLAG_ASYNC;
 	if (req->r_parent)
 		flags |= CEPH_MDS_FLAG_WANT_DENTRY;
 	rhead->flags = cpu_to_le32(flags);
@@ -2606,6 +2608,10 @@ static void __do_request(struct ceph_mds_client *mdsc,
 	mds = __choose_mds(mdsc, req, &random);
 	if (mds < 0 ||
 	    ceph_mdsmap_get_state(mdsc->mdsmap, mds) < CEPH_MDS_STATE_ACTIVE) {
+		if (test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags)) {
+			err = -EJUKEBOX;
+			goto finish;
+		}
 		dout("do_request no mds or not active, waiting for map\n");
 		list_add(&req->r_wait, &mdsc->waiting_for_map);
 		return;
@@ -2630,6 +2636,15 @@ static void __do_request(struct ceph_mds_client *mdsc,
 			err = -EACCES;
 			goto out_session;
 		}
+		/*
+		 * We cannot queue async requests since the caps and delegated
+		 * inodes are bound to the session. Just return -EJUKEBOX and
+		 * let the caller retry a sync request in that case.
+		 */
+		if (test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags)) {
+			err = -EJUKEBOX;
+			goto out_session;
+		}
 		if (session->s_state == CEPH_MDS_SESSION_NEW ||
 		    session->s_state == CEPH_MDS_SESSION_CLOSING) {
 			__open_session(mdsc, session);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index a0918d00117c..95ac00e59e66 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -255,6 +255,7 @@ struct ceph_mds_request {
 #define CEPH_MDS_R_GOT_RESULT		(5) /* got a result */
 #define CEPH_MDS_R_DID_PREPOPULATE	(6) /* prepopulated readdir */
 #define CEPH_MDS_R_PARENT_LOCKED	(7) /* is r_parent->i_rwsem wlocked? */
+#define CEPH_MDS_R_ASYNC		(8) /* async request */
 	unsigned long	r_req_flags;
 
 	struct mutex r_fill_mutex;
diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
index 8017130a08a1..81d934dae129 100644
--- a/include/linux/ceph/ceph_fs.h
+++ b/include/linux/ceph/ceph_fs.h
@@ -444,8 +444,9 @@ union ceph_mds_request_args {
 	} __attribute__ ((packed)) lookupino;
 } __attribute__ ((packed));
 
-#define CEPH_MDS_FLAG_REPLAY        1  /* this is a replayed op */
-#define CEPH_MDS_FLAG_WANT_DENTRY   2  /* want dentry in reply */
+#define CEPH_MDS_FLAG_REPLAY		1 /* this is a replayed op */
+#define CEPH_MDS_FLAG_WANT_DENTRY	2 /* want dentry in reply */
+#define CEPH_MDS_FLAG_ASYNC		4 /* request is asynchronous */
 
 struct ceph_mds_request_head {
 	__le64 oldest_client_tid;
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 03/13] ceph: track primary dentry link
  2020-03-02 14:14 [PATCH v6 00/13] ceph: async directory operations support Jeff Layton
  2020-03-02 14:14 ` [PATCH v6 01/13] ceph: make kick_flushing_inode_caps non-static Jeff Layton
  2020-03-02 14:14 ` [PATCH v6 02/13] ceph: add flag to designate that a request is asynchronous Jeff Layton
@ 2020-03-02 14:14 ` Jeff Layton
  2020-03-02 14:14 ` [PATCH v6 04/13] ceph: add infrastructure for waiting for async create to complete Jeff Layton
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2020-03-02 14:14 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, sage, zyan, pdonnell

Newer versions of the MDS will flag a dentry as "primary". In later
patches, we'll need to consult this info, so track it in di->flags.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/dir.c                | 1 +
 fs/ceph/inode.c              | 8 +++++++-
 fs/ceph/super.h              | 1 +
 include/linux/ceph/ceph_fs.h | 3 +++
 4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index d0cd0aba5843..a87274935a09 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1411,6 +1411,7 @@ void ceph_invalidate_dentry_lease(struct dentry *dentry)
 	spin_lock(&dentry->d_lock);
 	di->time = jiffies;
 	di->lease_shared_gen = 0;
+	di->flags &= ~CEPH_DENTRY_PRIMARY_LINK;
 	__dentry_lease_unlist(di);
 	spin_unlock(&dentry->d_lock);
 }
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 6004ea0d2ef1..e9a29aae9466 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1052,6 +1052,7 @@ static void __update_dentry_lease(struct inode *dir, struct dentry *dentry,
 				  struct ceph_mds_session **old_lease_session)
 {
 	struct ceph_dentry_info *di = ceph_dentry(dentry);
+	unsigned mask = le16_to_cpu(lease->mask);
 	long unsigned duration = le32_to_cpu(lease->duration_ms);
 	long unsigned ttl = from_time + (duration * HZ) / 1000;
 	long unsigned half_ttl = from_time + (duration * HZ / 2) / 1000;
@@ -1063,8 +1064,13 @@ static void __update_dentry_lease(struct inode *dir, struct dentry *dentry,
 	if (ceph_snap(dir) != CEPH_NOSNAP)
 		return;
 
+	if (mask & CEPH_LEASE_PRIMARY_LINK)
+		di->flags |= CEPH_DENTRY_PRIMARY_LINK;
+	else
+		di->flags &= ~CEPH_DENTRY_PRIMARY_LINK;
+
 	di->lease_shared_gen = atomic_read(&ceph_inode(dir)->i_shared_gen);
-	if (duration == 0) {
+	if (!(mask & CEPH_LEASE_VALID)) {
 		__ceph_dentry_dir_lease_touch(di);
 		return;
 	}
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index d10513c6f0a1..6acecb7cf6d2 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -284,6 +284,7 @@ struct ceph_dentry_info {
 #define CEPH_DENTRY_REFERENCED		1
 #define CEPH_DENTRY_LEASE_LIST		2
 #define CEPH_DENTRY_SHRINK_LIST		4
+#define CEPH_DENTRY_PRIMARY_LINK	8
 
 struct ceph_inode_xattrs_info {
 	/*
diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
index 81d934dae129..a45b1c5605b8 100644
--- a/include/linux/ceph/ceph_fs.h
+++ b/include/linux/ceph/ceph_fs.h
@@ -531,6 +531,9 @@ struct ceph_mds_reply_lease {
 	__le32 seq;
 } __attribute__ ((packed));
 
+#define CEPH_LEASE_VALID        (1 | 2) /* old and new bit values */
+#define CEPH_LEASE_PRIMARY_LINK 4       /* primary linkage */
+
 struct ceph_mds_reply_dirfrag {
 	__le32 frag;            /* fragment */
 	__le32 auth;            /* auth mds, if this is a delegation point */
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 04/13] ceph: add infrastructure for waiting for async create to complete
  2020-03-02 14:14 [PATCH v6 00/13] ceph: async directory operations support Jeff Layton
                   ` (2 preceding siblings ...)
  2020-03-02 14:14 ` [PATCH v6 03/13] ceph: track primary dentry link Jeff Layton
@ 2020-03-02 14:14 ` Jeff Layton
  2020-03-02 14:14 ` [PATCH v6 05/13] ceph: make __take_cap_refs non-static Jeff Layton
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2020-03-02 14:14 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, sage, zyan, pdonnell

When we issue an async create, we must ensure that any later on-the-wire
requests involving it wait for the create reply.

Expand i_ceph_flags to be an unsigned long, and add a new bit that
MDS requests can wait on. If the bit is set in the inode when sending
caps, then don't send it and just return that it has been delayed.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c       | 28 ++++++++++++++++++++++++----
 fs/ceph/dir.c        |  2 +-
 fs/ceph/mds_client.c | 20 +++++++++++++++++++-
 fs/ceph/mds_client.h |  7 +++++++
 fs/ceph/super.h      |  4 +++-
 5 files changed, 54 insertions(+), 7 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 553fd1d52456..3fff4945f10e 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -507,7 +507,7 @@ static void __cap_set_timeouts(struct ceph_mds_client *mdsc,
 static void __cap_delay_requeue(struct ceph_mds_client *mdsc,
 				struct ceph_inode_info *ci)
 {
-	dout("__cap_delay_requeue %p flags %d at %lu\n", &ci->vfs_inode,
+	dout("%s %p flags 0x%lx at %lu\n", __func__, &ci->vfs_inode,
 	     ci->i_ceph_flags, ci->i_hold_caps_max);
 	if (!mdsc->stopping) {
 		spin_lock(&mdsc->cap_delay_lock);
@@ -1843,6 +1843,14 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
 	bool tried_invalidate = false;
 
 	spin_lock(&ci->i_ceph_lock);
+
+	/* Just requeue it until create reply comes in */
+	if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
+		__cap_delay_requeue(mdsc, ci);
+		spin_unlock(&ci->i_ceph_lock);
+		return;
+	}
+
 	if (ci->i_ceph_flags & CEPH_I_FLUSH)
 		flags |= CHECK_CAPS_FLUSH;
 
@@ -2080,6 +2088,7 @@ static int try_flush_caps(struct inode *inode, u64 *ptid)
 
 retry:
 	spin_lock(&ci->i_ceph_lock);
+	WARN_ON_ONCE(ci->i_ceph_flags & CEPH_I_ASYNC_CREATE);
 retry_locked:
 	if (ci->i_dirty_caps && ci->i_auth_cap) {
 		struct ceph_cap *cap = ci->i_auth_cap;
@@ -2212,6 +2221,10 @@ int ceph_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	if (datasync)
 		goto out;
 
+	ret = ceph_wait_on_async_create(inode);
+	if (ret)
+		goto out;
+
 	dirty = try_flush_caps(inode, &flush_tid);
 	dout("fsync dirty caps are %s\n", ceph_cap_string(dirty));
 
@@ -2259,10 +2272,13 @@ int ceph_write_inode(struct inode *inode, struct writeback_control *wbc)
 
 	dout("write_inode %p wait=%d\n", inode, wait);
 	if (wait) {
-		dirty = try_flush_caps(inode, &flush_tid);
-		if (dirty)
-			err = wait_event_interruptible(ci->i_cap_wq,
+		err = ceph_wait_on_async_create(inode);
+		if (!err) {
+			dirty = try_flush_caps(inode, &flush_tid);
+			if (dirty)
+				err = wait_event_interruptible(ci->i_cap_wq,
 				       caps_are_flushed(inode, flush_tid));
+		}
 	} else {
 		struct ceph_mds_client *mdsc =
 			ceph_sb_to_client(inode->i_sb)->mdsc;
@@ -2289,6 +2305,10 @@ static void __kick_flushing_caps(struct ceph_mds_client *mdsc,
 	u64 first_tid = 0;
 	u64 last_snap_flush = 0;
 
+	/* Can't flush an inode that's not created yet */
+	if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE)
+		return;
+
 	ci->i_ceph_flags &= ~CEPH_I_KICK_FLUSH;
 
 	list_for_each_entry_reverse(cf, &ci->i_cap_flush_list, i_list) {
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index a87274935a09..5b83bda57056 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -752,7 +752,7 @@ static struct dentry *ceph_lookup(struct inode *dir, struct dentry *dentry,
 		struct ceph_dentry_info *di = ceph_dentry(dentry);
 
 		spin_lock(&ci->i_ceph_lock);
-		dout(" dir %p flags are %d\n", dir, ci->i_ceph_flags);
+		dout(" dir %p flags are 0x%lx\n", dir, ci->i_ceph_flags);
 		if (strncmp(dentry->d_name.name,
 			    fsc->mount_options->snapdir_name,
 			    dentry->d_name.len) &&
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 5d6959c0cf33..2fbe505e5b2e 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2725,7 +2725,7 @@ static void kick_requests(struct ceph_mds_client *mdsc, int mds)
 int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
 			      struct ceph_mds_request *req)
 {
-	int err;
+	int err = 0;
 
 	/* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */
 	if (req->r_inode)
@@ -2738,6 +2738,24 @@ int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
 		ceph_get_cap_refs(ceph_inode(req->r_old_dentry_dir),
 				  CEPH_CAP_PIN);
 
+	if (req->r_inode) {
+		err = ceph_wait_on_async_create(req->r_inode);
+		if (err) {
+			dout("%s: wait for async create returned: %d\n",
+			     __func__, err);
+			return err;
+		}
+	}
+
+	if (!err && req->r_old_inode) {
+		err = ceph_wait_on_async_create(req->r_old_inode);
+		if (err) {
+			dout("%s: wait for async create returned: %d\n",
+			     __func__, err);
+			return err;
+		}
+	}
+
 	dout("submit_request on %p for inode %p\n", req, dir);
 	mutex_lock(&mdsc->mutex);
 	__register_request(mdsc, req, dir);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 95ac00e59e66..8043f2b439b1 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -538,4 +538,11 @@ extern void ceph_mdsc_open_export_target_sessions(struct ceph_mds_client *mdsc,
 extern int ceph_trim_caps(struct ceph_mds_client *mdsc,
 			  struct ceph_mds_session *session,
 			  int max_caps);
+static inline int ceph_wait_on_async_create(struct inode *inode)
+{
+	struct ceph_inode_info *ci = ceph_inode(inode);
+
+	return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT,
+			   TASK_INTERRUPTIBLE);
+}
 #endif
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 6acecb7cf6d2..8cfacee5e856 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -316,7 +316,7 @@ struct ceph_inode_info {
 	u64 i_inline_version;
 	u32 i_time_warp_seq;
 
-	unsigned i_ceph_flags;
+	unsigned long i_ceph_flags;
 	atomic64_t i_release_count;
 	atomic64_t i_ordered_count;
 	atomic64_t i_complete_seq[2];
@@ -523,6 +523,8 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
 #define CEPH_I_ERROR_WRITE	(1 << 9)  /* have seen write errors */
 #define CEPH_I_ERROR_FILELOCK	(1 << 10) /* have seen file lock errors */
 #define CEPH_I_ODIRECT		(1 << 11) /* inode in direct I/O mode */
+#define CEPH_ASYNC_CREATE_BIT	(12)	  /* async create in flight for this */
+#define CEPH_I_ASYNC_CREATE	(1 << CEPH_ASYNC_CREATE_BIT)
 
 /*
  * Masks of ceph inode work.
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 05/13] ceph: make __take_cap_refs non-static
  2020-03-02 14:14 [PATCH v6 00/13] ceph: async directory operations support Jeff Layton
                   ` (3 preceding siblings ...)
  2020-03-02 14:14 ` [PATCH v6 04/13] ceph: add infrastructure for waiting for async create to complete Jeff Layton
@ 2020-03-02 14:14 ` Jeff Layton
  2020-03-02 14:14 ` [PATCH v6 06/13] ceph: cap tracking for async directory operations Jeff Layton
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2020-03-02 14:14 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, sage, zyan, pdonnell

Rename it to ceph_take_cap_refs and make it available to other files.
Also replace a comment with a lockdep assertion.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c  | 12 ++++++------
 fs/ceph/super.h |  2 ++
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 3fff4945f10e..526743d65244 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2474,12 +2474,12 @@ void ceph_kick_flushing_inode_caps(struct ceph_mds_session *session,
 /*
  * Take references to capabilities we hold, so that we don't release
  * them to the MDS prematurely.
- *
- * Protected by i_ceph_lock.
  */
-static void __take_cap_refs(struct ceph_inode_info *ci, int got,
+void ceph_take_cap_refs(struct ceph_inode_info *ci, int got,
 			    bool snap_rwsem_locked)
 {
+	lockdep_assert_held(&ci->i_ceph_lock);
+
 	if (got & CEPH_CAP_PIN)
 		ci->i_pin_ref++;
 	if (got & CEPH_CAP_FILE_RD)
@@ -2500,7 +2500,7 @@ static void __take_cap_refs(struct ceph_inode_info *ci, int got,
 		if (ci->i_wb_ref == 0)
 			ihold(&ci->vfs_inode);
 		ci->i_wb_ref++;
-		dout("__take_cap_refs %p wb %d -> %d (?)\n",
+		dout("%s %p wb %d -> %d (?)\n", __func__,
 		     &ci->vfs_inode, ci->i_wb_ref-1, ci->i_wb_ref);
 	}
 }
@@ -2614,7 +2614,7 @@ static int try_get_cap_refs(struct inode *inode, int need, int want,
 			    (need & CEPH_CAP_FILE_RD) &&
 			    !(*got & CEPH_CAP_FILE_CACHE))
 				ceph_disable_fscache_readpage(ci);
-			__take_cap_refs(ci, *got, true);
+			ceph_take_cap_refs(ci, *got, true);
 			ret = 1;
 		}
 	} else {
@@ -2862,7 +2862,7 @@ int ceph_get_caps(struct file *filp, int need, int want,
 void ceph_get_cap_refs(struct ceph_inode_info *ci, int caps)
 {
 	spin_lock(&ci->i_ceph_lock);
-	__take_cap_refs(ci, caps, false);
+	ceph_take_cap_refs(ci, caps, false);
 	spin_unlock(&ci->i_ceph_lock);
 }
 
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 8cfacee5e856..2d3b88f674ca 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -1058,6 +1058,8 @@ void ceph_kick_flushing_inode_caps(struct ceph_mds_session *session,
 				   struct ceph_inode_info *ci);
 extern struct ceph_cap *ceph_get_cap_for_mds(struct ceph_inode_info *ci,
 					     int mds);
+extern void ceph_take_cap_refs(struct ceph_inode_info *ci, int caps,
+				bool snap_rwsem_locked);
 extern void ceph_get_cap_refs(struct ceph_inode_info *ci, int caps);
 extern void ceph_put_cap_refs(struct ceph_inode_info *ci, int had);
 extern void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr,
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 06/13] ceph: cap tracking for async directory operations
  2020-03-02 14:14 [PATCH v6 00/13] ceph: async directory operations support Jeff Layton
                   ` (4 preceding siblings ...)
  2020-03-02 14:14 ` [PATCH v6 05/13] ceph: make __take_cap_refs non-static Jeff Layton
@ 2020-03-02 14:14 ` Jeff Layton
  2020-03-02 14:14 ` [PATCH v6 07/13] ceph: don't take refs to want mask unless we have all bits Jeff Layton
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2020-03-02 14:14 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, sage, zyan, pdonnell

Track and correctly handle directory caps for asynchronous operations.
Add aliases for Frc caps that we now designate at Dcu caps (when dealing
with directories).

Unlike file caps, we don't reclaim these when the session goes away, and
instead preemptively release them. In-flight async dirops are instead
handled during reconnect phase. The client needs to re-do a synchronous
operation in order to re-get directory caps.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c               | 27 +++++++++++++++++++--------
 fs/ceph/mds_client.c         | 31 ++++++++++++++++++++++++++-----
 fs/ceph/mds_client.h         |  6 +++++-
 include/linux/ceph/ceph_fs.h |  6 ++++++
 4 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 526743d65244..51483ba572b3 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1011,7 +1011,11 @@ int __ceph_caps_file_wanted(struct ceph_inode_info *ci)
 int __ceph_caps_wanted(struct ceph_inode_info *ci)
 {
 	int w = __ceph_caps_file_wanted(ci) | __ceph_caps_used(ci);
-	if (!S_ISDIR(ci->vfs_inode.i_mode)) {
+	if (S_ISDIR(ci->vfs_inode.i_mode)) {
+		/* we want EXCL if holding caps of dir ops */
+		if (w & CEPH_CAP_ANY_DIR_OPS)
+			w |= CEPH_CAP_FILE_EXCL;
+	} else {
 		/* we want EXCL if dirty data */
 		if (w & CEPH_CAP_FILE_BUFFER)
 			w |= CEPH_CAP_FILE_EXCL;
@@ -1877,10 +1881,13 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
 			 * revoking the shared cap on every create/unlink
 			 * operation.
 			 */
-			if (IS_RDONLY(inode))
+			if (IS_RDONLY(inode)) {
 				want = CEPH_CAP_ANY_SHARED;
-			else
-				want = CEPH_CAP_ANY_SHARED | CEPH_CAP_FILE_EXCL;
+			} else {
+				want = CEPH_CAP_ANY_SHARED |
+				       CEPH_CAP_FILE_EXCL |
+				       CEPH_CAP_ANY_DIR_OPS;
+			}
 			retain |= want;
 		} else {
 
@@ -2708,10 +2715,14 @@ int ceph_try_get_caps(struct inode *inode, int need, int want,
 	int ret, flags;
 
 	BUG_ON(need & ~CEPH_CAP_FILE_RD);
-	BUG_ON(want & ~(CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO|CEPH_CAP_FILE_SHARED));
-	ret = ceph_pool_perm_check(inode, need);
-	if (ret < 0)
-		return ret;
+	BUG_ON(want & ~(CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO |
+			CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
+			CEPH_CAP_ANY_DIR_OPS));
+	if (need) {
+		ret = ceph_pool_perm_check(inode, need);
+		if (ret < 0)
+			return ret;
+	}
 
 	flags = get_used_fmode(need | want);
 	if (nonblock)
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 2fbe505e5b2e..db8304447f35 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -699,6 +699,7 @@ void ceph_mdsc_release_request(struct kref *kref)
 	struct ceph_mds_request *req = container_of(kref,
 						    struct ceph_mds_request,
 						    r_kref);
+	ceph_mdsc_release_dir_caps(req);
 	destroy_reply_info(&req->r_reply_info);
 	if (req->r_request)
 		ceph_msg_put(req->r_request);
@@ -3275,6 +3276,17 @@ static void handle_session(struct ceph_mds_session *session,
 	return;
 }
 
+void ceph_mdsc_release_dir_caps(struct ceph_mds_request *req)
+{
+	int dcaps;
+
+	dcaps = xchg(&req->r_dir_caps, 0);
+	if (dcaps) {
+		dout("releasing r_dir_caps=%s\n", ceph_cap_string(dcaps));
+		ceph_put_cap_refs(ceph_inode(req->r_parent), dcaps);
+	}
+}
+
 /*
  * called under session->mutex.
  */
@@ -3302,9 +3314,14 @@ static void replay_unsafe_requests(struct ceph_mds_client *mdsc,
 			continue;
 		if (req->r_attempts == 0)
 			continue; /* only old requests */
-		if (req->r_session &&
-		    req->r_session->s_mds == session->s_mds)
-			__send_request(mdsc, session, req, true);
+		if (!req->r_session)
+			continue;
+		if (req->r_session->s_mds != session->s_mds)
+			continue;
+
+		ceph_mdsc_release_dir_caps(req);
+
+		__send_request(mdsc, session, req, true);
 	}
 	mutex_unlock(&mdsc->mutex);
 }
@@ -3388,7 +3405,7 @@ static int send_reconnect_partial(struct ceph_reconnect_state *recon_state)
 /*
  * Encode information about a cap for a reconnect with the MDS.
  */
-static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,
+static int reconnect_caps_cb(struct inode *inode, struct ceph_cap *cap,
 			  void *arg)
 {
 	union {
@@ -3411,6 +3428,10 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,
 	cap->mseq = 0;       /* and migrate_seq */
 	cap->cap_gen = cap->session->s_cap_gen;
 
+	/* These are lost when the session goes away */
+	if (S_ISDIR(inode->i_mode))
+		cap->issued &= ~CEPH_CAP_ANY_DIR_OPS;
+
 	if (recon_state->msg_version >= 2) {
 		rec.v2.cap_id = cpu_to_le64(cap->cap_id);
 		rec.v2.wanted = cpu_to_le32(__ceph_caps_wanted(ci));
@@ -3707,7 +3728,7 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
 		recon_state.msg_version = 2;
 	}
 	/* trsaverse this session's caps */
-	err = ceph_iterate_session_caps(session, encode_caps_cb, &recon_state);
+	err = ceph_iterate_session_caps(session, reconnect_caps_cb, &recon_state);
 
 	spin_lock(&session->s_cap_lock);
 	session->s_cap_reconnect = 0;
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 8043f2b439b1..f10d342ea585 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -284,8 +284,11 @@ struct ceph_mds_request {
 	struct ceph_msg  *r_request;  /* original request */
 	struct ceph_msg  *r_reply;
 	struct ceph_mds_reply_info_parsed r_reply_info;
-	struct page *r_locked_page;
 	int r_err;
+
+
+	struct page *r_locked_page;
+	int r_dir_caps;
 	int r_num_caps;
 	u32               r_readdir_offset;
 
@@ -489,6 +492,7 @@ extern int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc,
 extern int ceph_mdsc_do_request(struct ceph_mds_client *mdsc,
 				struct inode *dir,
 				struct ceph_mds_request *req);
+extern void ceph_mdsc_release_dir_caps(struct ceph_mds_request *req);
 static inline void ceph_mdsc_get_request(struct ceph_mds_request *req)
 {
 	kref_get(&req->r_kref);
diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
index a45b1c5605b8..e63a5c0b6d62 100644
--- a/include/linux/ceph/ceph_fs.h
+++ b/include/linux/ceph/ceph_fs.h
@@ -664,6 +664,12 @@ int ceph_flags_to_mode(int flags);
 #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
 			CEPH_LOCK_IXATTR)
 
+/* cap masks async dir operations */
+#define CEPH_CAP_DIR_CREATE	CEPH_CAP_FILE_CACHE
+#define CEPH_CAP_DIR_UNLINK	CEPH_CAP_FILE_RD
+#define CEPH_CAP_ANY_DIR_OPS	(CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_RD | \
+				 CEPH_CAP_FILE_WREXTEND | CEPH_CAP_FILE_LAZYIO)
+
 int ceph_caps_for_mode(int mode);
 
 enum {
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 07/13] ceph: don't take refs to want mask unless we have all bits
  2020-03-02 14:14 [PATCH v6 00/13] ceph: async directory operations support Jeff Layton
                   ` (5 preceding siblings ...)
  2020-03-02 14:14 ` [PATCH v6 06/13] ceph: cap tracking for async directory operations Jeff Layton
@ 2020-03-02 14:14 ` Jeff Layton
  2020-03-02 14:14 ` [PATCH v6 08/13] ceph: perform asynchronous unlink if we have sufficient caps Jeff Layton
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2020-03-02 14:14 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, sage, zyan, pdonnell

From: "Yan, Zheng" <ukernel@gmail.com>

If we don't have all of the cap bits for the want mask in
try_get_cap_refs, then just take refs on the need bits.

Signed-off-by: "Yan, Zheng" <ukernel@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Zheng,

I broke this patch out on its own as I wasn't sure it was still
needed with the latest iteration of the code. We can fold it into
the previous one if we do want it, or just drop it.

Thanks,
Jeff

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 51483ba572b3..c60b28304c50 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2616,7 +2616,10 @@ static int try_get_cap_refs(struct inode *inode, int need, int want,
 				}
 				snap_rwsem_locked = true;
 			}
-			*got = need | (have & want);
+			if ((have & want) == want)
+				*got = need | want;
+			else
+				*got = need;
 			if (S_ISREG(inode->i_mode) &&
 			    (need & CEPH_CAP_FILE_RD) &&
 			    !(*got & CEPH_CAP_FILE_CACHE))
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 08/13] ceph: perform asynchronous unlink if we have sufficient caps
  2020-03-02 14:14 [PATCH v6 00/13] ceph: async directory operations support Jeff Layton
                   ` (6 preceding siblings ...)
  2020-03-02 14:14 ` [PATCH v6 07/13] ceph: don't take refs to want mask unless we have all bits Jeff Layton
@ 2020-03-02 14:14 ` Jeff Layton
  2020-03-02 14:14 ` [PATCH v6 09/13] ceph: make ceph_fill_inode non-static Jeff Layton
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2020-03-02 14:14 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, sage, zyan, pdonnell

The MDS is getting a new lock-caching facility that will allow it
to cache the necessary locks to allow asynchronous directory operations.
Since the CEPH_CAP_FILE_* caps are currently unused on directories,
we can repurpose those bits for this purpose.

When performing an unlink, if we have Fx on the parent directory,
and CEPH_CAP_DIR_UNLINK (aka Fr), and we know that the dentry being
removed is the primary link, then then we can fire off an unlink
request immediately and don't need to wait on reply before returning.

In that situation, just fix up the dcache and link count and return
immediately after issuing the call to the MDS. This does mean that we
need to hold an extra reference to the inode being unlinked, and extra
references to the caps to avoid races. Those references are put and
error handling is done in the r_callback routine.

If the operation ends up failing, then set a writeback error on the
directory inode, and the inode itself that can be fetched later by
an fsync on the dir.

The behavior of dir caps is slightly different from caps on normal
files. Because these are just considered an optimization, if the
session is reconnected, we will not automatically reclaim them. They
are instead considered lost until we do another synchronous op in the
parent directory.

Async dirops are enabled via the "nowsync" mount option, which is
patterned after the xfs "wsync" mount option. For now, the default
is "wsync", but eventually we may flip that.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/dir.c   | 108 ++++++++++++++++++++++++++++++++++++++++++++++--
 fs/ceph/super.c |  20 +++++++++
 fs/ceph/super.h |   5 ++-
 3 files changed, 128 insertions(+), 5 deletions(-)

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 5b83bda57056..ee6b319e5481 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1036,6 +1036,78 @@ static int ceph_link(struct dentry *old_dentry, struct inode *dir,
 	return err;
 }
 
+static void ceph_async_unlink_cb(struct ceph_mds_client *mdsc,
+				 struct ceph_mds_request *req)
+{
+	int result = req->r_err ? req->r_err :
+			le32_to_cpu(req->r_reply_info.head->result);
+
+	if (result == -EJUKEBOX)
+		goto out;
+
+	/* If op failed, mark everyone involved for errors */
+	if (result) {
+		int pathlen;
+		u64 base;
+		char *path = ceph_mdsc_build_path(req->r_dentry, &pathlen,
+						  &base, 0);
+
+		/* mark error on parent + clear complete */
+		mapping_set_error(req->r_parent->i_mapping, result);
+		ceph_dir_clear_complete(req->r_parent);
+
+		/* drop the dentry -- we don't know its status */
+		if (!d_unhashed(req->r_dentry))
+			d_drop(req->r_dentry);
+
+		/* mark inode itself for an error (since metadata is bogus) */
+		mapping_set_error(req->r_old_inode->i_mapping, result);
+
+		pr_warn("ceph: async unlink failure path=(%llx)%s result=%d!\n",
+			base, IS_ERR(path) ? "<<bad>>" : path, result);
+		ceph_mdsc_free_path(path, pathlen);
+	}
+out:
+	iput(req->r_old_inode);
+	ceph_mdsc_release_dir_caps(req);
+}
+
+static int get_caps_for_async_unlink(struct inode *dir, struct dentry *dentry)
+{
+	struct ceph_inode_info *ci = ceph_inode(dir);
+	struct ceph_dentry_info *di;
+	int got = 0, want = CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_UNLINK;
+
+	spin_lock(&ci->i_ceph_lock);
+	if ((__ceph_caps_issued(ci, NULL) & want) == want) {
+		ceph_take_cap_refs(ci, want, false);
+		got = want;
+	}
+	spin_unlock(&ci->i_ceph_lock);
+
+	/* If we didn't get anything, return 0 */
+	if (!got)
+		return 0;
+
+        spin_lock(&dentry->d_lock);
+        di = ceph_dentry(dentry);
+	/*
+	 * - We are holding Fx, which implies Fs caps.
+	 * - Only support async unlink for primary linkage
+	 */
+	if (atomic_read(&ci->i_shared_gen) != di->lease_shared_gen ||
+	    !(di->flags & CEPH_DENTRY_PRIMARY_LINK))
+		want = 0;
+        spin_unlock(&dentry->d_lock);
+
+	/* Do we still want what we've got? */
+	if (want == got)
+		return got;
+
+	ceph_put_cap_refs(ci, got);
+	return 0;
+}
+
 /*
  * rmdir and unlink are differ only by the metadata op code
  */
@@ -1045,6 +1117,7 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
 	struct ceph_mds_client *mdsc = fsc->mdsc;
 	struct inode *inode = d_inode(dentry);
 	struct ceph_mds_request *req;
+	bool try_async = ceph_test_mount_opt(fsc, ASYNC_DIROPS);
 	int err = -EROFS;
 	int op;
 
@@ -1059,6 +1132,7 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
 			CEPH_MDS_OP_RMDIR : CEPH_MDS_OP_UNLINK;
 	} else
 		goto out;
+retry:
 	req = ceph_mdsc_create_request(mdsc, op, USE_AUTH_MDS);
 	if (IS_ERR(req)) {
 		err = PTR_ERR(req);
@@ -1067,13 +1141,39 @@ static int ceph_unlink(struct inode *dir, struct dentry *dentry)
 	req->r_dentry = dget(dentry);
 	req->r_num_caps = 2;
 	req->r_parent = dir;
-	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
 	req->r_dentry_drop = CEPH_CAP_FILE_SHARED;
 	req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
 	req->r_inode_drop = ceph_drop_caps_for_unlink(inode);
-	err = ceph_mdsc_do_request(mdsc, dir, req);
-	if (!err && !req->r_reply_info.head->is_dentry)
-		d_delete(dentry);
+
+	if (try_async && op == CEPH_MDS_OP_UNLINK &&
+	    (req->r_dir_caps = get_caps_for_async_unlink(dir, dentry))) {
+		dout("async unlink on %lu/%.*s caps=%s", dir->i_ino,
+		     dentry->d_name.len, dentry->d_name.name,
+		     ceph_cap_string(req->r_dir_caps));
+		set_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags);
+		req->r_callback = ceph_async_unlink_cb;
+		req->r_old_inode = d_inode(dentry);
+		ihold(req->r_old_inode);
+		err = ceph_mdsc_submit_request(mdsc, dir, req);
+		if (!err) {
+			/*
+			 * We have enough caps, so we assume that the unlink
+			 * will succeed. Fix up the target inode and dcache.
+			 */
+			drop_nlink(inode);
+			d_delete(dentry);
+		} else if (err == -EJUKEBOX) {
+			try_async = false;
+			ceph_mdsc_put_request(req);
+			goto retry;
+		}
+	} else {
+		set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
+		err = ceph_mdsc_do_request(mdsc, dir, req);
+		if (!err && !req->r_reply_info.head->is_dentry)
+			d_delete(dentry);
+	}
+
 	ceph_mdsc_put_request(req);
 out:
 	return err;
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index b1329cd5388a..c9784eb1159a 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -155,6 +155,7 @@ enum {
 	Opt_acl,
 	Opt_quotadf,
 	Opt_copyfrom,
+	Opt_wsync,
 };
 
 enum ceph_recover_session_mode {
@@ -194,6 +195,7 @@ static const struct fs_parameter_spec ceph_mount_parameters[] = {
 	fsparam_string	("snapdirname",			Opt_snapdirname),
 	fsparam_string	("source",			Opt_source),
 	fsparam_u32	("wsize",			Opt_wsize),
+	fsparam_flag_no	("wsync",			Opt_wsync),
 	{}
 };
 
@@ -444,6 +446,12 @@ static int ceph_parse_mount_param(struct fs_context *fc,
 			fc->sb_flags &= ~SB_POSIXACL;
 		}
 		break;
+	case Opt_wsync:
+		if (!result.negated)
+			fsopt->flags &= ~CEPH_MOUNT_OPT_ASYNC_DIROPS;
+		else
+			fsopt->flags |= CEPH_MOUNT_OPT_ASYNC_DIROPS;
+		break;
 	default:
 		BUG();
 	}
@@ -567,6 +575,9 @@ static int ceph_show_options(struct seq_file *m, struct dentry *root)
 	if (fsopt->flags & CEPH_MOUNT_OPT_CLEANRECOVER)
 		seq_show_option(m, "recover_session", "clean");
 
+	if (fsopt->flags & CEPH_MOUNT_OPT_ASYNC_DIROPS)
+		seq_puts(m, ",nowsync");
+
 	if (fsopt->wsize != CEPH_MAX_WRITE_SIZE)
 		seq_printf(m, ",wsize=%u", fsopt->wsize);
 	if (fsopt->rsize != CEPH_MAX_READ_SIZE)
@@ -1115,6 +1126,15 @@ static void ceph_free_fc(struct fs_context *fc)
 
 static int ceph_reconfigure_fc(struct fs_context *fc)
 {
+	struct ceph_parse_opts_ctx *pctx = fc->fs_private;
+	struct ceph_mount_options *fsopt = pctx->opts;
+	struct ceph_fs_client *fsc = ceph_sb_to_client(fc->root->d_sb);
+
+	if (fsopt->flags & CEPH_MOUNT_OPT_ASYNC_DIROPS)
+		ceph_set_mount_opt(fsc, ASYNC_DIROPS);
+	else
+		ceph_clear_mount_opt(fsc, ASYNC_DIROPS);
+
 	sync_filesystem(fc->root->d_sb);
 	return 0;
 }
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 2d3b88f674ca..d2c718143a04 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -43,13 +43,16 @@
 #define CEPH_MOUNT_OPT_MOUNTWAIT       (1<<12) /* mount waits if no mds is up */
 #define CEPH_MOUNT_OPT_NOQUOTADF       (1<<13) /* no root dir quota in statfs */
 #define CEPH_MOUNT_OPT_NOCOPYFROM      (1<<14) /* don't use RADOS 'copy-from' op */
+#define CEPH_MOUNT_OPT_ASYNC_DIROPS    (1<<15) /* allow async directory ops */
 
 #define CEPH_MOUNT_OPT_DEFAULT			\
 	(CEPH_MOUNT_OPT_DCACHE |		\
 	 CEPH_MOUNT_OPT_NOCOPYFROM)
 
 #define ceph_set_mount_opt(fsc, opt) \
-	(fsc)->mount_options->flags |= CEPH_MOUNT_OPT_##opt;
+	(fsc)->mount_options->flags |= CEPH_MOUNT_OPT_##opt
+#define ceph_clear_mount_opt(fsc, opt) \
+	(fsc)->mount_options->flags &= ~CEPH_MOUNT_OPT_##opt
 #define ceph_test_mount_opt(fsc, opt) \
 	(!!((fsc)->mount_options->flags & CEPH_MOUNT_OPT_##opt))
 
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 09/13] ceph: make ceph_fill_inode non-static
  2020-03-02 14:14 [PATCH v6 00/13] ceph: async directory operations support Jeff Layton
                   ` (7 preceding siblings ...)
  2020-03-02 14:14 ` [PATCH v6 08/13] ceph: perform asynchronous unlink if we have sufficient caps Jeff Layton
@ 2020-03-02 14:14 ` Jeff Layton
  2020-03-02 14:14 ` [PATCH v6 10/13] ceph: decode interval_sets for delegated inos Jeff Layton
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2020-03-02 14:14 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, sage, zyan, pdonnell

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/inode.c | 47 ++++++++++++++++++++++++-----------------------
 fs/ceph/super.h |  8 ++++++++
 2 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index e9a29aae9466..0e2653b734c3 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -728,11 +728,11 @@ void ceph_fill_file_time(struct inode *inode, int issued,
  * Populate an inode based on info from mds.  May be called on new or
  * existing inodes.
  */
-static int fill_inode(struct inode *inode, struct page *locked_page,
-		      struct ceph_mds_reply_info_in *iinfo,
-		      struct ceph_mds_reply_dirfrag *dirinfo,
-		      struct ceph_mds_session *session, int cap_fmode,
-		      struct ceph_cap_reservation *caps_reservation)
+int ceph_fill_inode(struct inode *inode, struct page *locked_page,
+		    struct ceph_mds_reply_info_in *iinfo,
+		    struct ceph_mds_reply_dirfrag *dirinfo,
+		    struct ceph_mds_session *session, int cap_fmode,
+		    struct ceph_cap_reservation *caps_reservation)
 {
 	struct ceph_mds_client *mdsc = ceph_inode_to_client(inode)->mdsc;
 	struct ceph_mds_reply_inode *info = iinfo->in;
@@ -749,7 +749,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 	bool new_version = false;
 	bool fill_inline = false;
 
-	dout("fill_inode %p ino %llx.%llx v %llu had %llu\n",
+	dout("%s %p ino %llx.%llx v %llu had %llu\n", __func__,
 	     inode, ceph_vinop(inode), le64_to_cpu(info->version),
 	     ci->i_version);
 
@@ -770,7 +770,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 	if (iinfo->xattr_len > 4) {
 		xattr_blob = ceph_buffer_new(iinfo->xattr_len, GFP_NOFS);
 		if (!xattr_blob)
-			pr_err("fill_inode ENOMEM xattr blob %d bytes\n",
+			pr_err("%s ENOMEM xattr blob %d bytes\n", __func__,
 			       iinfo->xattr_len);
 	}
 
@@ -933,8 +933,9 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 			spin_unlock(&ci->i_ceph_lock);
 
 			if (symlen != i_size_read(inode)) {
-				pr_err("fill_inode %llx.%llx BAD symlink "
-					"size %lld\n", ceph_vinop(inode),
+				pr_err("%s %llx.%llx BAD symlink "
+					"size %lld\n", __func__,
+					ceph_vinop(inode),
 					i_size_read(inode));
 				i_size_write(inode, symlen);
 				inode->i_blocks = calc_inode_blocks(symlen);
@@ -958,7 +959,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 		inode->i_fop = &ceph_dir_fops;
 		break;
 	default:
-		pr_err("fill_inode %llx.%llx BAD mode 0%o\n",
+		pr_err("%s %llx.%llx BAD mode 0%o\n", __func__,
 		       ceph_vinop(inode), inode->i_mode);
 	}
 
@@ -1247,10 +1248,9 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 		struct inode *dir = req->r_parent;
 
 		if (dir) {
-			err = fill_inode(dir, NULL,
-					 &rinfo->diri, rinfo->dirfrag,
-					 session, -1,
-					 &req->r_caps_reservation);
+			err = ceph_fill_inode(dir, NULL, &rinfo->diri,
+					      rinfo->dirfrag, session, -1,
+					      &req->r_caps_reservation);
 			if (err < 0)
 				goto done;
 		} else {
@@ -1315,14 +1315,14 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 			goto done;
 		}
 
-		err = fill_inode(in, req->r_locked_page, &rinfo->targeti, NULL,
-				session,
+		err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
+				NULL, session,
 				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
 				 !test_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags) &&
 				 rinfo->head->result == 0) ?  req->r_fmode : -1,
 				&req->r_caps_reservation);
 		if (err < 0) {
-			pr_err("fill_inode badness %p %llx.%llx\n",
+			pr_err("ceph_fill_inode badness %p %llx.%llx\n",
 				in, ceph_vinop(in));
 			if (in->i_state & I_NEW)
 				discard_new_inode(in);
@@ -1509,10 +1509,11 @@ static int readdir_prepopulate_inodes_only(struct ceph_mds_request *req,
 			dout("new_inode badness got %d\n", err);
 			continue;
 		}
-		rc = fill_inode(in, NULL, &rde->inode, NULL, session,
-				-1, &req->r_caps_reservation);
+		rc = ceph_fill_inode(in, NULL, &rde->inode, NULL, session,
+				     -1, &req->r_caps_reservation);
 		if (rc < 0) {
-			pr_err("fill_inode badness on %p got %d\n", in, rc);
+			pr_err("ceph_fill_inode badness on %p got %d\n",
+			       in, rc);
 			err = rc;
 			if (in->i_state & I_NEW) {
 				ihold(in);
@@ -1716,10 +1717,10 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req,
 			}
 		}
 
-		ret = fill_inode(in, NULL, &rde->inode, NULL, session,
-				 -1, &req->r_caps_reservation);
+		ret = ceph_fill_inode(in, NULL, &rde->inode, NULL, session,
+				      -1, &req->r_caps_reservation);
 		if (ret < 0) {
-			pr_err("fill_inode badness on %p\n", in);
+			pr_err("ceph_fill_inode badness on %p\n", in);
 			if (d_really_is_negative(dn)) {
 				/* avoid calling iput_final() in mds
 				 * dispatch threads */
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index d2c718143a04..97f57fa0f42c 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -898,6 +898,9 @@ static inline bool __ceph_have_pending_cap_snap(struct ceph_inode_info *ci)
 }
 
 /* inode.c */
+struct ceph_mds_reply_info_in;
+struct ceph_mds_reply_dirfrag;
+
 extern const struct inode_operations ceph_file_iops;
 
 extern struct inode *ceph_alloc_inode(struct super_block *sb);
@@ -913,6 +916,11 @@ extern void ceph_fill_file_time(struct inode *inode, int issued,
 				u64 time_warp_seq, struct timespec64 *ctime,
 				struct timespec64 *mtime,
 				struct timespec64 *atime);
+extern int ceph_fill_inode(struct inode *inode, struct page *locked_page,
+		    struct ceph_mds_reply_info_in *iinfo,
+		    struct ceph_mds_reply_dirfrag *dirinfo,
+		    struct ceph_mds_session *session, int cap_fmode,
+		    struct ceph_cap_reservation *caps_reservation);
 extern int ceph_fill_trace(struct super_block *sb,
 			   struct ceph_mds_request *req);
 extern int ceph_readdir_prepopulate(struct ceph_mds_request *req,
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 10/13] ceph: decode interval_sets for delegated inos
  2020-03-02 14:14 [PATCH v6 00/13] ceph: async directory operations support Jeff Layton
                   ` (8 preceding siblings ...)
  2020-03-02 14:14 ` [PATCH v6 09/13] ceph: make ceph_fill_inode non-static Jeff Layton
@ 2020-03-02 14:14 ` Jeff Layton
  2020-03-05 11:45   ` Luis Henriques
  2020-03-02 14:14 ` [PATCH v6 11/13] ceph: add new MDS req field to hold delegated inode number Jeff Layton
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 21+ messages in thread
From: Jeff Layton @ 2020-03-02 14:14 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, sage, zyan, pdonnell

Starting in Octopus, the MDS will hand out caps that allow the client
to do asynchronous file creates under certain conditions. As part of
that, the MDS will delegate ranges of inode numbers to the client.

Add the infrastructure to decode these ranges, and stuff them into an
xarray for later consumption by the async creation code.

Because the xarray code currently only handles unsigned long indexes,
and those are 32-bits on 32-bit arches, we only enable the decoding when
running on a 64-bit arch.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/mds_client.c | 122 +++++++++++++++++++++++++++++++++++++++----
 fs/ceph/mds_client.h |   9 +++-
 2 files changed, 121 insertions(+), 10 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index db8304447f35..87f75d05b004 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -415,21 +415,121 @@ static int parse_reply_info_filelock(void **p, void *end,
 	return -EIO;
 }
 
+
+#if BITS_PER_LONG == 64
+
+#define DELEGATED_INO_AVAILABLE		xa_mk_value(1)
+
+static int ceph_parse_deleg_inos(void **p, void *end,
+				 struct ceph_mds_session *s)
+{
+	u32 sets;
+
+	ceph_decode_32_safe(p, end, sets, bad);
+	dout("got %u sets of delegated inodes\n", sets);
+	while (sets--) {
+		u64 start, len, ino;
+
+		ceph_decode_64_safe(p, end, start, bad);
+		ceph_decode_64_safe(p, end, len, bad);
+		while (len--) {
+			int err = xa_insert(&s->s_delegated_inos, ino = start++,
+					    DELEGATED_INO_AVAILABLE,
+					    GFP_KERNEL);
+			if (!err) {
+				dout("added delegated inode 0x%llx\n",
+				     start - 1);
+			} else if (err == -EBUSY) {
+				pr_warn("ceph: MDS delegated inode 0x%llx more than once.\n",
+					start - 1);
+			} else {
+				return err;
+			}
+		}
+	}
+	return 0;
+bad:
+	return -EIO;
+}
+
+u64 ceph_get_deleg_ino(struct ceph_mds_session *s)
+{
+	unsigned long ino;
+	void *val;
+
+	xa_for_each(&s->s_delegated_inos, ino, val) {
+		val = xa_erase(&s->s_delegated_inos, ino);
+		if (val == DELEGATED_INO_AVAILABLE)
+			return ino;
+	}
+	return 0;
+}
+
+int ceph_restore_deleg_ino(struct ceph_mds_session *s, u64 ino)
+{
+	return xa_insert(&s->s_delegated_inos, ino, DELEGATED_INO_AVAILABLE,
+			 GFP_KERNEL);
+}
+#else /* BITS_PER_LONG == 64 */
+/*
+ * FIXME: xarrays can't handle 64-bit indexes on a 32-bit arch. For now, just
+ * ignore delegated_inos on 32 bit arch. Maybe eventually add xarrays for top
+ * and bottom words?
+ */
+static int ceph_parse_deleg_inos(void **p, void *end,
+				 struct ceph_mds_session *s)
+{
+	u32 sets;
+
+	ceph_decode_32_safe(p, end, sets, bad);
+	if (sets)
+		ceph_decode_skip_n(p, end, sets * 2 * sizeof(__le64), bad);
+	return 0;
+bad:
+	return -EIO;
+}
+
+u64 ceph_get_deleg_ino(struct ceph_mds_session *s)
+{
+	return 0;
+}
+
+int ceph_restore_deleg_ino(struct ceph_mds_session *s, u64 ino)
+{
+	return 0;
+}
+#endif /* BITS_PER_LONG == 64 */
+
 /*
  * parse create results
  */
 static int parse_reply_info_create(void **p, void *end,
 				  struct ceph_mds_reply_info_parsed *info,
-				  u64 features)
+				  u64 features, struct ceph_mds_session *s)
 {
+	int ret;
+
 	if (features == (u64)-1 ||
 	    (features & CEPH_FEATURE_REPLY_CREATE_INODE)) {
-		/* Malformed reply? */
 		if (*p == end) {
+			/* Malformed reply? */
 			info->has_create_ino = false;
-		} else {
+		} else if (test_bit(CEPHFS_FEATURE_DELEG_INO, &s->s_features)) {
+			u8 struct_v, struct_compat;
+			u32 len;
+
 			info->has_create_ino = true;
+			ceph_decode_8_safe(p, end, struct_v, bad);
+			ceph_decode_8_safe(p, end, struct_compat, bad);
+			ceph_decode_32_safe(p, end, len, bad);
+			ceph_decode_64_safe(p, end, info->ino, bad);
+			ret = ceph_parse_deleg_inos(p, end, s);
+			if (ret)
+				return ret;
+		} else {
+			/* legacy */
 			ceph_decode_64_safe(p, end, info->ino, bad);
+			info->has_create_ino = true;
 		}
 	} else {
 		if (*p != end)
@@ -448,7 +548,7 @@ static int parse_reply_info_create(void **p, void *end,
  */
 static int parse_reply_info_extra(void **p, void *end,
 				  struct ceph_mds_reply_info_parsed *info,
-				  u64 features)
+				  u64 features, struct ceph_mds_session *s)
 {
 	u32 op = le32_to_cpu(info->head->op);
 
@@ -457,7 +557,7 @@ static int parse_reply_info_extra(void **p, void *end,
 	else if (op == CEPH_MDS_OP_READDIR || op == CEPH_MDS_OP_LSSNAP)
 		return parse_reply_info_readdir(p, end, info, features);
 	else if (op == CEPH_MDS_OP_CREATE)
-		return parse_reply_info_create(p, end, info, features);
+		return parse_reply_info_create(p, end, info, features, s);
 	else
 		return -EIO;
 }
@@ -465,7 +565,7 @@ static int parse_reply_info_extra(void **p, void *end,
 /*
  * parse entire mds reply
  */
-static int parse_reply_info(struct ceph_msg *msg,
+static int parse_reply_info(struct ceph_mds_session *s, struct ceph_msg *msg,
 			    struct ceph_mds_reply_info_parsed *info,
 			    u64 features)
 {
@@ -490,7 +590,7 @@ static int parse_reply_info(struct ceph_msg *msg,
 	ceph_decode_32_safe(&p, end, len, bad);
 	if (len > 0) {
 		ceph_decode_need(&p, end, len, bad);
-		err = parse_reply_info_extra(&p, p+len, info, features);
+		err = parse_reply_info_extra(&p, p+len, info, features, s);
 		if (err < 0)
 			goto out_bad;
 	}
@@ -558,6 +658,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
 	if (refcount_dec_and_test(&s->s_ref)) {
 		if (s->s_auth.authorizer)
 			ceph_auth_destroy_authorizer(s->s_auth.authorizer);
+		xa_destroy(&s->s_delegated_inos);
 		kfree(s);
 	}
 }
@@ -645,6 +746,7 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc,
 	refcount_set(&s->s_ref, 1);
 	INIT_LIST_HEAD(&s->s_waiting);
 	INIT_LIST_HEAD(&s->s_unsafe);
+	xa_init(&s->s_delegated_inos);
 	s->s_num_cap_releases = 0;
 	s->s_cap_reconnect = 0;
 	s->s_cap_iterator = NULL;
@@ -2975,9 +3077,9 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
 	dout("handle_reply tid %lld result %d\n", tid, result);
 	rinfo = &req->r_reply_info;
 	if (test_bit(CEPHFS_FEATURE_REPLY_ENCODING, &session->s_features))
-		err = parse_reply_info(msg, rinfo, (u64)-1);
+		err = parse_reply_info(session, msg, rinfo, (u64)-1);
 	else
-		err = parse_reply_info(msg, rinfo, session->s_con.peer_features);
+		err = parse_reply_info(session, msg, rinfo, session->s_con.peer_features);
 	mutex_unlock(&mdsc->mutex);
 
 	mutex_lock(&session->s_mutex);
@@ -3673,6 +3775,8 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
 	if (!reply)
 		goto fail_nomsg;
 
+	xa_destroy(&session->s_delegated_inos);
+
 	mutex_lock(&session->s_mutex);
 	session->s_state = CEPH_MDS_SESSION_RECONNECTING;
 	session->s_seq = 0;
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index f10d342ea585..4c3b71707470 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -23,8 +23,9 @@ enum ceph_feature_type {
 	CEPHFS_FEATURE_RECLAIM_CLIENT,
 	CEPHFS_FEATURE_LAZY_CAP_WANTED,
 	CEPHFS_FEATURE_MULTI_RECONNECT,
+	CEPHFS_FEATURE_DELEG_INO,
 
-	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_MULTI_RECONNECT,
+	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_DELEG_INO,
 };
 
 /*
@@ -37,6 +38,7 @@ enum ceph_feature_type {
 	CEPHFS_FEATURE_REPLY_ENCODING,		\
 	CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
 	CEPHFS_FEATURE_MULTI_RECONNECT,		\
+	CEPHFS_FEATURE_DELEG_INO,		\
 						\
 	CEPHFS_FEATURE_MAX,			\
 }
@@ -201,6 +203,7 @@ struct ceph_mds_session {
 
 	struct list_head  s_waiting;  /* waiting requests */
 	struct list_head  s_unsafe;   /* unsafe requests */
+	struct xarray	  s_delegated_inos;
 };
 
 /*
@@ -542,6 +545,7 @@ extern void ceph_mdsc_open_export_target_sessions(struct ceph_mds_client *mdsc,
 extern int ceph_trim_caps(struct ceph_mds_client *mdsc,
 			  struct ceph_mds_session *session,
 			  int max_caps);
+
 static inline int ceph_wait_on_async_create(struct inode *inode)
 {
 	struct ceph_inode_info *ci = ceph_inode(inode);
@@ -549,4 +553,7 @@ static inline int ceph_wait_on_async_create(struct inode *inode)
 	return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT,
 			   TASK_INTERRUPTIBLE);
 }
+
+extern u64 ceph_get_deleg_ino(struct ceph_mds_session *session);
+extern int ceph_restore_deleg_ino(struct ceph_mds_session *session, u64 ino);
 #endif
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 11/13] ceph: add new MDS req field to hold delegated inode number
  2020-03-02 14:14 [PATCH v6 00/13] ceph: async directory operations support Jeff Layton
                   ` (9 preceding siblings ...)
  2020-03-02 14:14 ` [PATCH v6 10/13] ceph: decode interval_sets for delegated inos Jeff Layton
@ 2020-03-02 14:14 ` Jeff Layton
  2020-03-02 14:14 ` [PATCH v6 12/13] ceph: cache layout in parent dir on first sync create Jeff Layton
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2020-03-02 14:14 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, sage, zyan, pdonnell

Add new request field to hold the delegated inode number. Encode that
into the message when it's set.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/mds_client.c | 3 +--
 fs/ceph/mds_client.h | 1 +
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 87f75d05b004..e4ef3b47e2db 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2472,7 +2472,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_client *mdsc,
 	head->op = cpu_to_le32(req->r_op);
 	head->caller_uid = cpu_to_le32(from_kuid(&init_user_ns, req->r_uid));
 	head->caller_gid = cpu_to_le32(from_kgid(&init_user_ns, req->r_gid));
-	head->ino = 0;
+	head->ino = cpu_to_le64(req->r_deleg_ino);
 	head->args = req->r_args;
 
 	ceph_encode_filepath(&p, end, ino1, path1);
@@ -2633,7 +2633,6 @@ static int __prepare_send_request(struct ceph_mds_client *mdsc,
 	rhead->flags = cpu_to_le32(flags);
 	rhead->num_fwd = req->r_num_fwd;
 	rhead->num_retry = req->r_attempts - 1;
-	rhead->ino = 0;
 
 	dout(" r_parent = %p\n", req->r_parent);
 	return 0;
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 4c3b71707470..4e5be79bf080 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -313,6 +313,7 @@ struct ceph_mds_request {
 	int               r_num_fwd;    /* number of forward attempts */
 	int               r_resend_mds; /* mds to resend to next, if any*/
 	u32               r_sent_on_mseq; /* cap mseq request was sent at*/
+	u64		  r_deleg_ino;
 
 	struct list_head  r_wait;
 	struct completion r_completion;
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 12/13] ceph: cache layout in parent dir on first sync create
  2020-03-02 14:14 [PATCH v6 00/13] ceph: async directory operations support Jeff Layton
                   ` (10 preceding siblings ...)
  2020-03-02 14:14 ` [PATCH v6 11/13] ceph: add new MDS req field to hold delegated inode number Jeff Layton
@ 2020-03-02 14:14 ` Jeff Layton
  2020-03-02 14:14 ` [PATCH v6 13/13] ceph: attempt to do async create when possible Jeff Layton
  2020-03-02 16:22 ` [PATCH v6 00/13] ceph: async directory operations support Yan, Zheng
  13 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2020-03-02 14:14 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, sage, zyan, pdonnell

If a create is done, then typically we'll end up writing to the file
soon afterward. We don't want to wait for the reply before doing that
when doing an async create, so that means we need the layout for the
new file before we've gotten the response from the MDS.

All files created in a directory will initially inherit the same layout,
so copy off the requisite info from the first synchronous create in the
directory, and save it in a new i_cached_layout field. Zero out the
layout when we lose Dc caps in the dir.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c       | 13 ++++++++++---
 fs/ceph/file.c       | 22 +++++++++++++++++++++-
 fs/ceph/inode.c      |  2 ++
 fs/ceph/mds_client.c |  7 ++++++-
 fs/ceph/super.h      |  1 +
 5 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index c60b28304c50..fb00120a22df 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -556,14 +556,14 @@ static void __cap_delay_cancel(struct ceph_mds_client *mdsc,
 	spin_unlock(&mdsc->cap_delay_lock);
 }
 
-/*
- * Common issue checks for add_cap, handle_cap_grant.
- */
+/* Common issue checks for add_cap, handle_cap_grant. */
 static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap,
 			      unsigned issued)
 {
 	unsigned had = __ceph_caps_issued(ci, NULL);
 
+	lockdep_assert_held(&ci->i_ceph_lock);
+
 	/*
 	 * Each time we receive FILE_CACHE anew, we increment
 	 * i_rdcache_gen.
@@ -588,6 +588,13 @@ static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap,
 			__ceph_dir_clear_complete(ci);
 		}
 	}
+
+	/* Wipe saved layout if we're losing DIR_CREATE caps */
+	if (S_ISDIR(ci->vfs_inode.i_mode) && (had & CEPH_CAP_DIR_CREATE) &&
+		!(issued & CEPH_CAP_DIR_CREATE)) {
+	     ceph_put_string(rcu_dereference_raw(ci->i_cached_layout.pool_ns));
+	     memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
+	}
 }
 
 /*
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 5d300a41ea08..57a3960ffeb7 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -423,6 +423,23 @@ int ceph_open(struct inode *inode, struct file *file)
 	return err;
 }
 
+/* Clone the layout from a synchronous create, if the dir now has Dc caps */
+static void
+cache_file_layout(struct inode *dst, struct inode *src)
+{
+	struct ceph_inode_info *cdst = ceph_inode(dst);
+	struct ceph_inode_info *csrc = ceph_inode(src);
+
+	spin_lock(&cdst->i_ceph_lock);
+	if ((__ceph_caps_issued(cdst, NULL) & CEPH_CAP_DIR_CREATE) &&
+	    !ceph_file_layout_is_valid(&cdst->i_cached_layout)) {
+		memcpy(&cdst->i_cached_layout, &csrc->i_layout,
+			sizeof(cdst->i_cached_layout));
+		rcu_assign_pointer(cdst->i_cached_layout.pool_ns,
+				   ceph_try_get_string(csrc->i_layout.pool_ns));
+	}
+	spin_unlock(&cdst->i_ceph_lock);
+}
 
 /*
  * Do a lookup + open with a single request.  If we get a non-existent
@@ -511,7 +528,10 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 	} else {
 		dout("atomic_open finish_open on dn %p\n", dn);
 		if (req->r_op == CEPH_MDS_OP_CREATE && req->r_reply_info.has_create_ino) {
-			ceph_init_inode_acls(d_inode(dentry), &as_ctx);
+			struct inode *newino = d_inode(dentry);
+
+			cache_file_layout(dir, newino);
+			ceph_init_inode_acls(newino, &as_ctx);
 			file->f_mode |= FMODE_CREATED;
 		}
 		err = finish_open(file, dentry, ceph_open);
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 0e2653b734c3..14f339dec490 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -447,6 +447,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
 	ci->i_max_files = 0;
 
 	memset(&ci->i_dir_layout, 0, sizeof(ci->i_dir_layout));
+	memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
 	RCU_INIT_POINTER(ci->i_layout.pool_ns, NULL);
 
 	ci->i_fragtree = RB_ROOT;
@@ -587,6 +588,7 @@ void ceph_evict_inode(struct inode *inode)
 		ceph_buffer_put(ci->i_xattrs.prealloc_blob);
 
 	ceph_put_string(rcu_dereference_raw(ci->i_layout.pool_ns));
+	ceph_put_string(rcu_dereference_raw(ci->i_cached_layout.pool_ns));
 }
 
 static inline blkcnt_t calc_inode_blocks(u64 size)
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index e4ef3b47e2db..68b8afded466 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3530,8 +3530,13 @@ static int reconnect_caps_cb(struct inode *inode, struct ceph_cap *cap,
 	cap->cap_gen = cap->session->s_cap_gen;
 
 	/* These are lost when the session goes away */
-	if (S_ISDIR(inode->i_mode))
+	if (S_ISDIR(inode->i_mode)) {
+		if (cap->issued & CEPH_CAP_DIR_CREATE) {
+			ceph_put_string(rcu_dereference_raw(ci->i_cached_layout.pool_ns));
+			memset(&ci->i_cached_layout, 0, sizeof(ci->i_cached_layout));
+		}
 		cap->issued &= ~CEPH_CAP_ANY_DIR_OPS;
+	}
 
 	if (recon_state->msg_version >= 2) {
 		rec.v2.cap_id = cpu_to_le64(cap->cap_id);
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 97f57fa0f42c..98d4fbf5e553 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -326,6 +326,7 @@ struct ceph_inode_info {
 
 	struct ceph_dir_layout i_dir_layout;
 	struct ceph_file_layout i_layout;
+	struct ceph_file_layout i_cached_layout;	// for async creates
 	char *i_symlink;
 
 	/* for dirs */
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v6 13/13] ceph: attempt to do async create when possible
  2020-03-02 14:14 [PATCH v6 00/13] ceph: async directory operations support Jeff Layton
                   ` (11 preceding siblings ...)
  2020-03-02 14:14 ` [PATCH v6 12/13] ceph: cache layout in parent dir on first sync create Jeff Layton
@ 2020-03-02 14:14 ` Jeff Layton
  2020-03-02 16:22 ` [PATCH v6 00/13] ceph: async directory operations support Yan, Zheng
  13 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2020-03-02 14:14 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, sage, zyan, pdonnell

With the Octopus release, the MDS will hand out directory create caps.

If we have Fxc caps on the directory, and complete directory information
or a known negative dentry, then we can return without waiting on the
reply, allowing the open() call to return very quickly to userland.

We use the normal ceph_fill_inode() routine to fill in the inode, so we
have to gin up some reply inode information with what we'd expect the
newly-created inode to have. The client assumes that it has a full set
of caps on the new inode, and that the MDS will revoke them when there
is conflicting access.

This functionality is gated on the wsync/nowsync mount options.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/file.c               | 247 ++++++++++++++++++++++++++++++++++-
 include/linux/ceph/ceph_fs.h |   3 +
 2 files changed, 243 insertions(+), 7 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 57a3960ffeb7..d6653e385adb 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -441,6 +441,216 @@ cache_file_layout(struct inode *dst, struct inode *src)
 	spin_unlock(&cdst->i_ceph_lock);
 }
 
+/*
+ * Try to set up an async create. We need caps, a file layout, and inode number,
+ * and either a lease on the dentry or complete dir info. If any of those
+ * criteria are not satisfied, then return false and the caller can go
+ * synchronous.
+ */
+static int try_prep_async_create(struct inode *dir, struct dentry *dentry,
+				 struct ceph_file_layout *lo, u64 *pino)
+{
+	struct ceph_inode_info *ci = ceph_inode(dir);
+	struct ceph_dentry_info *di = ceph_dentry(dentry);
+	int got = 0, want = CEPH_CAP_FILE_EXCL | CEPH_CAP_DIR_CREATE;
+	u64 ino;
+
+	spin_lock(&ci->i_ceph_lock);
+	/* No auth cap means no chance for Dc caps */
+	if (!ci->i_auth_cap)
+		goto no_async;
+
+	/* Any delegated inos? */
+	if (xa_empty(&ci->i_auth_cap->session->s_delegated_inos))
+		goto no_async;
+
+	if (!ceph_file_layout_is_valid(&ci->i_cached_layout))
+		goto no_async;
+
+	if ((__ceph_caps_issued(ci, NULL) & want) != want)
+		goto no_async;
+
+	if (d_in_lookup(dentry)) {
+		if (!__ceph_dir_is_complete(ci))
+			goto no_async;
+	} else if (atomic_read(&ci->i_shared_gen) !=
+		   READ_ONCE(di->lease_shared_gen)) {
+		goto no_async;
+	}
+
+	ino = ceph_get_deleg_ino(ci->i_auth_cap->session);
+	if (!ino)
+		goto no_async;
+
+	*pino = ino;
+	ceph_take_cap_refs(ci, want, false);
+	memcpy(lo, &ci->i_cached_layout, sizeof(*lo));
+	rcu_assign_pointer(lo->pool_ns,
+			   ceph_try_get_string(ci->i_cached_layout.pool_ns));
+	got = want;
+no_async:
+	spin_unlock(&ci->i_ceph_lock);
+	return got;
+}
+
+static void restore_deleg_ino(struct inode *dir, u64 ino)
+{
+	struct ceph_inode_info *ci = ceph_inode(dir);
+	struct ceph_mds_session *s = NULL;
+
+	spin_lock(&ci->i_ceph_lock);
+	if (ci->i_auth_cap)
+		s = ceph_get_mds_session(ci->i_auth_cap->session);
+	spin_unlock(&ci->i_ceph_lock);
+	if (s) {
+		int err = ceph_restore_deleg_ino(s, ino);
+		if (err)
+			pr_warn("ceph: unable to restore delegated ino 0x%llx to session: %d\n",
+				ino, err);
+		ceph_put_mds_session(s);
+	}
+}
+
+static void ceph_async_create_cb(struct ceph_mds_client *mdsc,
+                                 struct ceph_mds_request *req)
+{
+	int result = req->r_err ? req->r_err :
+			le32_to_cpu(req->r_reply_info.head->result);
+
+	if (result == -EJUKEBOX)
+		goto out;
+
+	mapping_set_error(req->r_parent->i_mapping, result);
+
+	if (result) {
+		struct dentry *dentry = req->r_dentry;
+		int pathlen;
+		u64 base;
+		char *path = ceph_mdsc_build_path(req->r_dentry, &pathlen,
+						  &base, 0);
+
+		ceph_dir_clear_complete(req->r_parent);
+		if (!d_unhashed(dentry))
+			d_drop(dentry);
+
+		/* FIXME: start returning I/O errors on all accesses? */
+		pr_warn("ceph: async create failure path=(%llx)%s result=%d!\n",
+			base, IS_ERR(path) ? "<<bad>>" : path, result);
+		ceph_mdsc_free_path(path, pathlen);
+	}
+
+	if (req->r_target_inode) {
+		struct ceph_inode_info *ci = ceph_inode(req->r_target_inode);
+		u64 ino = ceph_vino(req->r_target_inode).ino;
+
+		if (req->r_deleg_ino != ino)
+			pr_warn("%s: inode number mismatch! err=%d deleg_ino=0x%llx target=0x%llx\n",
+				__func__, req->r_err, req->r_deleg_ino, ino);
+		mapping_set_error(req->r_target_inode->i_mapping, result);
+
+		spin_lock(&ci->i_ceph_lock);
+		WARN_ON_ONCE(!(ci->i_ceph_flags & CEPH_I_ASYNC_CREATE));
+		ci->i_ceph_flags &= ~CEPH_I_ASYNC_CREATE;
+		wake_up_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT);
+		spin_unlock(&ci->i_ceph_lock);
+
+		ceph_check_caps(ci, 0, req->r_session);
+	} else {
+		pr_warn("%s: no req->r_target_inode for 0x%llx\n", __func__,
+			req->r_deleg_ino);
+	}
+out:
+	ceph_mdsc_release_dir_caps(req);
+}
+
+static int ceph_finish_async_create(struct inode *dir, struct dentry *dentry,
+				    struct file *file, umode_t mode,
+				    struct ceph_mds_request *req,
+				    struct ceph_acl_sec_ctx *as_ctx,
+				    struct ceph_file_layout *lo)
+{
+	int ret;
+	char xattr_buf[4];
+	struct ceph_mds_reply_inode in = { };
+	struct ceph_mds_reply_info_in iinfo = { .in = &in };
+	struct ceph_inode_info *ci = ceph_inode(dir);
+	struct inode *inode;
+	struct timespec64 now;
+	struct ceph_vino vino = { .ino = req->r_deleg_ino,
+				  .snap = CEPH_NOSNAP };
+
+	ktime_get_real_ts64(&now);
+
+	inode = ceph_get_inode(dentry->d_sb, vino);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	iinfo.inline_version = CEPH_INLINE_NONE;
+	iinfo.change_attr = 1;
+	ceph_encode_timespec64(&iinfo.btime, &now);
+
+	iinfo.xattr_len = ARRAY_SIZE(xattr_buf);
+	iinfo.xattr_data = xattr_buf;
+	memset(iinfo.xattr_data, 0, iinfo.xattr_len);
+
+	in.ino = cpu_to_le64(vino.ino);
+	in.snapid = cpu_to_le64(CEPH_NOSNAP);
+	in.version = cpu_to_le64(1);	// ???
+	in.cap.caps = in.cap.wanted = cpu_to_le32(CEPH_CAP_ALL_FILE);
+	in.cap.cap_id = cpu_to_le64(1);
+	in.cap.realm = cpu_to_le64(ci->i_snap_realm->ino);
+	in.cap.flags = CEPH_CAP_FLAG_AUTH;
+	in.ctime = in.mtime = in.atime = iinfo.btime;
+	in.mode = cpu_to_le32((u32)mode);
+	in.truncate_seq = cpu_to_le32(1);
+	in.truncate_size = cpu_to_le64(-1ULL);
+	in.xattr_version = cpu_to_le64(1);
+	in.uid = cpu_to_le32(from_kuid(&init_user_ns, current_fsuid()));
+	in.gid = cpu_to_le32(from_kgid(&init_user_ns, dir->i_mode & S_ISGID ?
+				dir->i_gid : current_fsgid()));
+	in.nlink = cpu_to_le32(1);
+	in.max_size = cpu_to_le64(lo->stripe_unit);
+
+	ceph_file_layout_to_legacy(lo, &in.layout);
+
+	ret = ceph_fill_inode(inode, NULL, &iinfo, NULL, req->r_session,
+			      req->r_fmode, NULL);
+	if (ret) {
+		dout("%s failed to fill inode: %d\n", __func__, ret);
+		ceph_dir_clear_complete(dir);
+		if (!d_unhashed(dentry))
+			d_drop(dentry);
+		if (inode->i_state & I_NEW)
+			discard_new_inode(inode);
+	} else {
+		struct dentry *dn;
+
+		dout("%s d_adding new inode 0x%llx to 0x%lx/%s\n", __func__,
+			vino.ino, dir->i_ino, dentry->d_name.name);
+		ceph_dir_clear_ordered(dir);
+		ceph_init_inode_acls(inode, as_ctx);
+		if (inode->i_state & I_NEW) {
+			/*
+			 * If it's not I_NEW, then someone created this before
+			 * we got here. Assume the server is aware of it at
+			 * that point and don't worry about setting
+			 * CEPH_I_ASYNC_CREATE.
+			 */
+			ceph_inode(inode)->i_ceph_flags = CEPH_I_ASYNC_CREATE;
+			unlock_new_inode(inode);
+		}
+		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
+			if (!d_unhashed(dentry))
+				d_drop(dentry);
+			dn = d_splice_alias(inode, dentry);
+			WARN_ON_ONCE(dn && dn != dentry);
+		}
+		file->f_mode |= FMODE_CREATED;
+		ret = finish_open(file, dentry, ceph_open);
+	}
+	return ret;
+}
+
 /*
  * Do a lookup + open with a single request.  If we get a non-existent
  * file or symlink, return 1 so the VFS can retry.
@@ -453,6 +663,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 	struct ceph_mds_request *req;
 	struct dentry *dn;
 	struct ceph_acl_sec_ctx as_ctx = {};
+	bool try_async = ceph_test_mount_opt(fsc, ASYNC_DIROPS);
 	int mask;
 	int err;
 
@@ -476,7 +687,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 		/* If it's not being looked up, it's negative */
 		return -ENOENT;
 	}
-
+retry:
 	/* do the open */
 	req = prepare_open_request(dir->i_sb, flags, mode);
 	if (IS_ERR(req)) {
@@ -485,21 +696,43 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
 	}
 	req->r_dentry = dget(dentry);
 	req->r_num_caps = 2;
+	mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
+	if (ceph_security_xattr_wanted(dir))
+		mask |= CEPH_CAP_XATTR_SHARED;
+	req->r_args.open.mask = cpu_to_le32(mask);
+	req->r_parent = dir;
+
 	if (flags & O_CREAT) {
+		struct ceph_file_layout lo;
+
 		req->r_dentry_drop = CEPH_CAP_FILE_SHARED | CEPH_CAP_AUTH_EXCL;
 		req->r_dentry_unless = CEPH_CAP_FILE_EXCL;
 		if (as_ctx.pagelist) {
 			req->r_pagelist = as_ctx.pagelist;
 			as_ctx.pagelist = NULL;
 		}
+		if (try_async &&
+		    (req->r_dir_caps =
+		      try_prep_async_create(dir, dentry, &lo,
+					    &req->r_deleg_ino))) {
+			set_bit(CEPH_MDS_R_ASYNC, &req->r_req_flags);
+			req->r_args.open.flags |= cpu_to_le32(CEPH_O_EXCL);
+			req->r_callback = ceph_async_create_cb;
+			err = ceph_mdsc_submit_request(mdsc, dir, req);
+			if (!err) {
+				err = ceph_finish_async_create(dir, dentry,
+							file, mode, req,
+							&as_ctx, &lo);
+			} else if (err == -EJUKEBOX) {
+				restore_deleg_ino(dir, req->r_deleg_ino);
+				ceph_mdsc_put_request(req);
+				try_async = false;
+				goto retry;
+			}
+			goto out_req;
+		}
 	}
 
-       mask = CEPH_STAT_CAP_INODE | CEPH_CAP_AUTH_SHARED;
-       if (ceph_security_xattr_wanted(dir))
-               mask |= CEPH_CAP_XATTR_SHARED;
-       req->r_args.open.mask = cpu_to_le32(mask);
-
-	req->r_parent = dir;
 	set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
 	err = ceph_mdsc_do_request(mdsc,
 				   (flags & (O_CREAT|O_TRUNC)) ? dir : NULL,
diff --git a/include/linux/ceph/ceph_fs.h b/include/linux/ceph/ceph_fs.h
index e63a5c0b6d62..ebf5ba62b772 100644
--- a/include/linux/ceph/ceph_fs.h
+++ b/include/linux/ceph/ceph_fs.h
@@ -660,6 +660,9 @@ int ceph_flags_to_mode(int flags);
 #define CEPH_CAP_ANY      (CEPH_CAP_ANY_RD | CEPH_CAP_ANY_EXCL | \
 			   CEPH_CAP_ANY_FILE_WR | CEPH_CAP_FILE_LAZYIO | \
 			   CEPH_CAP_PIN)
+#define CEPH_CAP_ALL_FILE (CEPH_CAP_PIN | CEPH_CAP_ANY_SHARED | \
+			   CEPH_CAP_AUTH_EXCL | CEPH_CAP_XATTR_EXCL | \
+			   CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)
 
 #define CEPH_CAP_LOCKS (CEPH_LOCK_IFILE | CEPH_LOCK_IAUTH | CEPH_LOCK_ILINK | \
 			CEPH_LOCK_IXATTR)
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 00/13] ceph: async directory operations support
  2020-03-02 14:14 [PATCH v6 00/13] ceph: async directory operations support Jeff Layton
                   ` (12 preceding siblings ...)
  2020-03-02 14:14 ` [PATCH v6 13/13] ceph: attempt to do async create when possible Jeff Layton
@ 2020-03-02 16:22 ` Yan, Zheng
  2020-03-02 21:07   ` Jeff Layton
  13 siblings, 1 reply; 21+ messages in thread
From: Yan, Zheng @ 2020-03-02 16:22 UTC (permalink / raw)
  To: Jeff Layton, ceph-devel; +Cc: idryomov, sage, pdonnell

On 3/2/20 10:14 PM, Jeff Layton wrote:
> v6: move handling of CEPH_I_ASYNC_CREATE from __send_cap into callers
>      also issue ceph_mdsc_release_dir_caps() in complete_request
>      properly handle -EJUKEBOX return in async callbacks
>      
> I previously pulled the async unlink patch from ceph-client/testing, so
> this set includes a revised version of that as well, and orders it
> some other changes.
> 
> The main change from v5 is to rework the callers of __send_cap to either
> skip sending or wait if the create reply hasn't come in yet.
> 
> We may not actually need patch #7 here. Zheng had that delta in one
> of the earlier patches, but I'm not sure it's really needed now. It
> may make sense to just take it on its own merits though.
> 
> Jeff Layton (12):
>    ceph: make kick_flushing_inode_caps non-static
>    ceph: add flag to designate that a request is asynchronous
>    ceph: track primary dentry link
>    ceph: add infrastructure for waiting for async create to complete
>    ceph: make __take_cap_refs non-static
>    ceph: cap tracking for async directory operations
>    ceph: perform asynchronous unlink if we have sufficient caps
>    ceph: make ceph_fill_inode non-static
>    ceph: decode interval_sets for delegated inos
>    ceph: add new MDS req field to hold delegated inode number
>    ceph: cache layout in parent dir on first sync create
>    ceph: attempt to do async create when possible
> 
> Yan, Zheng (1):
>    ceph: don't take refs to want mask unless we have all bits
> 
>   fs/ceph/caps.c               |  91 ++++++++----
>   fs/ceph/dir.c                | 111 ++++++++++++++-
>   fs/ceph/file.c               | 269 +++++++++++++++++++++++++++++++++--
>   fs/ceph/inode.c              |  58 ++++----
>   fs/ceph/mds_client.c         | 196 ++++++++++++++++++++++---
>   fs/ceph/mds_client.h         |  24 +++-
>   fs/ceph/super.c              |  20 +++
>   fs/ceph/super.h              |  23 ++-
>   include/linux/ceph/ceph_fs.h |  17 ++-
>   9 files changed, 724 insertions(+), 85 deletions(-)
> 


series

Reviewed-by: "Yan, Zheng" <zyan@redhat.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 00/13] ceph: async directory operations support
  2020-03-02 16:22 ` [PATCH v6 00/13] ceph: async directory operations support Yan, Zheng
@ 2020-03-02 21:07   ` Jeff Layton
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2020-03-02 21:07 UTC (permalink / raw)
  To: Yan, Zheng, ceph-devel; +Cc: idryomov, sage, pdonnell

On Tue, 2020-03-03 at 00:22 +0800, Yan, Zheng wrote:
> Reviewed-by: "Yan, Zheng" <zyan@redhat.com>

Thanks Zheng, merged into "testing".
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 10/13] ceph: decode interval_sets for delegated inos
  2020-03-02 14:14 ` [PATCH v6 10/13] ceph: decode interval_sets for delegated inos Jeff Layton
@ 2020-03-05 11:45   ` Luis Henriques
  2020-03-05 12:02     ` Jeff Layton
  0 siblings, 1 reply; 21+ messages in thread
From: Luis Henriques @ 2020-03-05 11:45 UTC (permalink / raw)
  To: Jeff Layton; +Cc: ceph-devel, idryomov, sage, zyan, pdonnell

On Mon, Mar 02, 2020 at 09:14:31AM -0500, Jeff Layton wrote:
> Starting in Octopus, the MDS will hand out caps that allow the client
> to do asynchronous file creates under certain conditions. As part of
> that, the MDS will delegate ranges of inode numbers to the client.
> 
> Add the infrastructure to decode these ranges, and stuff them into an
> xarray for later consumption by the async creation code.
> 
> Because the xarray code currently only handles unsigned long indexes,
> and those are 32-bits on 32-bit arches, we only enable the decoding when
> running on a 64-bit arch.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/ceph/mds_client.c | 122 +++++++++++++++++++++++++++++++++++++++----
>  fs/ceph/mds_client.h |   9 +++-
>  2 files changed, 121 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index db8304447f35..87f75d05b004 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -415,21 +415,121 @@ static int parse_reply_info_filelock(void **p, void *end,
>  	return -EIO;
>  }
>  
> +
> +#if BITS_PER_LONG == 64
> +
> +#define DELEGATED_INO_AVAILABLE		xa_mk_value(1)
> +
> +static int ceph_parse_deleg_inos(void **p, void *end,
> +				 struct ceph_mds_session *s)
> +{
> +	u32 sets;
> +
> +	ceph_decode_32_safe(p, end, sets, bad);
> +	dout("got %u sets of delegated inodes\n", sets);
> +	while (sets--) {
> +		u64 start, len, ino;
> +
> +		ceph_decode_64_safe(p, end, start, bad);
> +		ceph_decode_64_safe(p, end, len, bad);
> +		while (len--) {
> +			int err = xa_insert(&s->s_delegated_inos, ino = start++,
> +					    DELEGATED_INO_AVAILABLE,
> +					    GFP_KERNEL);
> +			if (!err) {
> +				dout("added delegated inode 0x%llx\n",
> +				     start - 1);
> +			} else if (err == -EBUSY) {
> +				pr_warn("ceph: MDS delegated inode 0x%llx more than once.\n",
> +					start - 1);
> +			} else {
> +				return err;
> +			}
> +		}
> +	}
> +	return 0;
> +bad:
> +	return -EIO;
> +}
> +
> +u64 ceph_get_deleg_ino(struct ceph_mds_session *s)
> +{
> +	unsigned long ino;
> +	void *val;
> +
> +	xa_for_each(&s->s_delegated_inos, ino, val) {
> +		val = xa_erase(&s->s_delegated_inos, ino);
> +		if (val == DELEGATED_INO_AVAILABLE)
> +			return ino;
> +	}
> +	return 0;
> +}
> +
> +int ceph_restore_deleg_ino(struct ceph_mds_session *s, u64 ino)
> +{
> +	return xa_insert(&s->s_delegated_inos, ino, DELEGATED_INO_AVAILABLE,
> +			 GFP_KERNEL);
> +}
> +#else /* BITS_PER_LONG == 64 */
> +/*
> + * FIXME: xarrays can't handle 64-bit indexes on a 32-bit arch. For now, just
> + * ignore delegated_inos on 32 bit arch. Maybe eventually add xarrays for top
> + * and bottom words?
> + */
> +static int ceph_parse_deleg_inos(void **p, void *end,
> +				 struct ceph_mds_session *s)
> +{
> +	u32 sets;
> +
> +	ceph_decode_32_safe(p, end, sets, bad);
> +	if (sets)
> +		ceph_decode_skip_n(p, end, sets * 2 * sizeof(__le64), bad);
> +	return 0;
> +bad:
> +	return -EIO;
> +}
> +
> +u64 ceph_get_deleg_ino(struct ceph_mds_session *s)
> +{
> +	return 0;
> +}
> +
> +int ceph_restore_deleg_ino(struct ceph_mds_session *s, u64 ino)
> +{
> +	return 0;
> +}
> +#endif /* BITS_PER_LONG == 64 */
> +
>  /*
>   * parse create results
>   */
>  static int parse_reply_info_create(void **p, void *end,
>  				  struct ceph_mds_reply_info_parsed *info,
> -				  u64 features)
> +				  u64 features, struct ceph_mds_session *s)
>  {
> +	int ret;
> +
>  	if (features == (u64)-1 ||
>  	    (features & CEPH_FEATURE_REPLY_CREATE_INODE)) {
> -		/* Malformed reply? */
>  		if (*p == end) {
> +			/* Malformed reply? */
>  			info->has_create_ino = false;
> -		} else {
> +		} else if (test_bit(CEPHFS_FEATURE_DELEG_INO, &s->s_features)) {
> +			u8 struct_v, struct_compat;
> +			u32 len;
> +
>  			info->has_create_ino = true;
> +			ceph_decode_8_safe(p, end, struct_v, bad);
> +			ceph_decode_8_safe(p, end, struct_compat, bad);
> +			ceph_decode_32_safe(p, end, len, bad);
> +			ceph_decode_64_safe(p, end, info->ino, bad);

I've done a quick test in current 'testing' branch and it seems that it's
currently broken.  A bisect identified this commit as 'bad' and it's
failing at this point.

I'm running an old (a few weeks) 'master' vstart cluster, so I don't have
the needed bits for using this DELEG_INO feature.  Running xfstest
generic/001 results in:

   ceph: mds parse_reply err -5
   ceph: mdsc_handle_reply got corrupt reply mds0(tid:9)
   ...

s->s_features does include the CEPHFS_FEATURE_DELEG_INO bit set;
'features' is -1 (0xffffffffffffffff) and s->s_features is 0x3fff.  Maybe
the issue is actually somewhere else (the cephfs feature handling code),
but I'm still looking.

Cheers,
--
Luís

> +			ret = ceph_parse_deleg_inos(p, end, s);
> +			if (ret)
> +				return ret;
> +		} else {
> +			/* legacy */
>  			ceph_decode_64_safe(p, end, info->ino, bad);
> +			info->has_create_ino = true;
>  		}
>  	} else {
>  		if (*p != end)
> @@ -448,7 +548,7 @@ static int parse_reply_info_create(void **p, void *end,
>   */
>  static int parse_reply_info_extra(void **p, void *end,
>  				  struct ceph_mds_reply_info_parsed *info,
> -				  u64 features)
> +				  u64 features, struct ceph_mds_session *s)
>  {
>  	u32 op = le32_to_cpu(info->head->op);
>  
> @@ -457,7 +557,7 @@ static int parse_reply_info_extra(void **p, void *end,
>  	else if (op == CEPH_MDS_OP_READDIR || op == CEPH_MDS_OP_LSSNAP)
>  		return parse_reply_info_readdir(p, end, info, features);
>  	else if (op == CEPH_MDS_OP_CREATE)
> -		return parse_reply_info_create(p, end, info, features);
> +		return parse_reply_info_create(p, end, info, features, s);
>  	else
>  		return -EIO;
>  }
> @@ -465,7 +565,7 @@ static int parse_reply_info_extra(void **p, void *end,
>  /*
>   * parse entire mds reply
>   */
> -static int parse_reply_info(struct ceph_msg *msg,
> +static int parse_reply_info(struct ceph_mds_session *s, struct ceph_msg *msg,
>  			    struct ceph_mds_reply_info_parsed *info,
>  			    u64 features)
>  {
> @@ -490,7 +590,7 @@ static int parse_reply_info(struct ceph_msg *msg,
>  	ceph_decode_32_safe(&p, end, len, bad);
>  	if (len > 0) {
>  		ceph_decode_need(&p, end, len, bad);
> -		err = parse_reply_info_extra(&p, p+len, info, features);
> +		err = parse_reply_info_extra(&p, p+len, info, features, s);
>  		if (err < 0)
>  			goto out_bad;
>  	}
> @@ -558,6 +658,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
>  	if (refcount_dec_and_test(&s->s_ref)) {
>  		if (s->s_auth.authorizer)
>  			ceph_auth_destroy_authorizer(s->s_auth.authorizer);
> +		xa_destroy(&s->s_delegated_inos);
>  		kfree(s);
>  	}
>  }
> @@ -645,6 +746,7 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc,
>  	refcount_set(&s->s_ref, 1);
>  	INIT_LIST_HEAD(&s->s_waiting);
>  	INIT_LIST_HEAD(&s->s_unsafe);
> +	xa_init(&s->s_delegated_inos);
>  	s->s_num_cap_releases = 0;
>  	s->s_cap_reconnect = 0;
>  	s->s_cap_iterator = NULL;
> @@ -2975,9 +3077,9 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
>  	dout("handle_reply tid %lld result %d\n", tid, result);
>  	rinfo = &req->r_reply_info;
>  	if (test_bit(CEPHFS_FEATURE_REPLY_ENCODING, &session->s_features))
> -		err = parse_reply_info(msg, rinfo, (u64)-1);
> +		err = parse_reply_info(session, msg, rinfo, (u64)-1);
>  	else
> -		err = parse_reply_info(msg, rinfo, session->s_con.peer_features);
> +		err = parse_reply_info(session, msg, rinfo, session->s_con.peer_features);
>  	mutex_unlock(&mdsc->mutex);
>  
>  	mutex_lock(&session->s_mutex);
> @@ -3673,6 +3775,8 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
>  	if (!reply)
>  		goto fail_nomsg;
>  
> +	xa_destroy(&session->s_delegated_inos);
> +
>  	mutex_lock(&session->s_mutex);
>  	session->s_state = CEPH_MDS_SESSION_RECONNECTING;
>  	session->s_seq = 0;
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index f10d342ea585..4c3b71707470 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -23,8 +23,9 @@ enum ceph_feature_type {
>  	CEPHFS_FEATURE_RECLAIM_CLIENT,
>  	CEPHFS_FEATURE_LAZY_CAP_WANTED,
>  	CEPHFS_FEATURE_MULTI_RECONNECT,
> +	CEPHFS_FEATURE_DELEG_INO,
>  
> -	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_MULTI_RECONNECT,
> +	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_DELEG_INO,
>  };
>  
>  /*
> @@ -37,6 +38,7 @@ enum ceph_feature_type {
>  	CEPHFS_FEATURE_REPLY_ENCODING,		\
>  	CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
>  	CEPHFS_FEATURE_MULTI_RECONNECT,		\
> +	CEPHFS_FEATURE_DELEG_INO,		\
>  						\
>  	CEPHFS_FEATURE_MAX,			\
>  }
> @@ -201,6 +203,7 @@ struct ceph_mds_session {
>  
>  	struct list_head  s_waiting;  /* waiting requests */
>  	struct list_head  s_unsafe;   /* unsafe requests */
> +	struct xarray	  s_delegated_inos;
>  };
>  
>  /*
> @@ -542,6 +545,7 @@ extern void ceph_mdsc_open_export_target_sessions(struct ceph_mds_client *mdsc,
>  extern int ceph_trim_caps(struct ceph_mds_client *mdsc,
>  			  struct ceph_mds_session *session,
>  			  int max_caps);
> +
>  static inline int ceph_wait_on_async_create(struct inode *inode)
>  {
>  	struct ceph_inode_info *ci = ceph_inode(inode);
> @@ -549,4 +553,7 @@ static inline int ceph_wait_on_async_create(struct inode *inode)
>  	return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT,
>  			   TASK_INTERRUPTIBLE);
>  }
> +
> +extern u64 ceph_get_deleg_ino(struct ceph_mds_session *session);
> +extern int ceph_restore_deleg_ino(struct ceph_mds_session *session, u64 ino);
>  #endif
> -- 
> 2.24.1
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 10/13] ceph: decode interval_sets for delegated inos
  2020-03-05 11:45   ` Luis Henriques
@ 2020-03-05 12:02     ` Jeff Layton
  2020-03-05 12:20       ` Luis Henriques
  2020-03-05 13:36       ` Ilya Dryomov
  0 siblings, 2 replies; 21+ messages in thread
From: Jeff Layton @ 2020-03-05 12:02 UTC (permalink / raw)
  To: Luis Henriques; +Cc: ceph-devel, idryomov, sage, zyan, pdonnell

On Thu, 2020-03-05 at 11:45 +0000, Luis Henriques wrote:
> On Mon, Mar 02, 2020 at 09:14:31AM -0500, Jeff Layton wrote:
> > Starting in Octopus, the MDS will hand out caps that allow the client
> > to do asynchronous file creates under certain conditions. As part of
> > that, the MDS will delegate ranges of inode numbers to the client.
> > 
> > Add the infrastructure to decode these ranges, and stuff them into an
> > xarray for later consumption by the async creation code.
> > 
> > Because the xarray code currently only handles unsigned long indexes,
> > and those are 32-bits on 32-bit arches, we only enable the decoding when
> > running on a 64-bit arch.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/ceph/mds_client.c | 122 +++++++++++++++++++++++++++++++++++++++----
> >  fs/ceph/mds_client.h |   9 +++-
> >  2 files changed, 121 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index db8304447f35..87f75d05b004 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -415,21 +415,121 @@ static int parse_reply_info_filelock(void **p, void *end,
> >  	return -EIO;
> >  }
> >  
> > +
> > +#if BITS_PER_LONG == 64
> > +
> > +#define DELEGATED_INO_AVAILABLE		xa_mk_value(1)
> > +
> > +static int ceph_parse_deleg_inos(void **p, void *end,
> > +				 struct ceph_mds_session *s)
> > +{
> > +	u32 sets;
> > +
> > +	ceph_decode_32_safe(p, end, sets, bad);
> > +	dout("got %u sets of delegated inodes\n", sets);
> > +	while (sets--) {
> > +		u64 start, len, ino;
> > +
> > +		ceph_decode_64_safe(p, end, start, bad);
> > +		ceph_decode_64_safe(p, end, len, bad);
> > +		while (len--) {
> > +			int err = xa_insert(&s->s_delegated_inos, ino = start++,
> > +					    DELEGATED_INO_AVAILABLE,
> > +					    GFP_KERNEL);
> > +			if (!err) {
> > +				dout("added delegated inode 0x%llx\n",
> > +				     start - 1);
> > +			} else if (err == -EBUSY) {
> > +				pr_warn("ceph: MDS delegated inode 0x%llx more than once.\n",
> > +					start - 1);
> > +			} else {
> > +				return err;
> > +			}
> > +		}
> > +	}
> > +	return 0;
> > +bad:
> > +	return -EIO;
> > +}
> > +
> > +u64 ceph_get_deleg_ino(struct ceph_mds_session *s)
> > +{
> > +	unsigned long ino;
> > +	void *val;
> > +
> > +	xa_for_each(&s->s_delegated_inos, ino, val) {
> > +		val = xa_erase(&s->s_delegated_inos, ino);
> > +		if (val == DELEGATED_INO_AVAILABLE)
> > +			return ino;
> > +	}
> > +	return 0;
> > +}
> > +
> > +int ceph_restore_deleg_ino(struct ceph_mds_session *s, u64 ino)
> > +{
> > +	return xa_insert(&s->s_delegated_inos, ino, DELEGATED_INO_AVAILABLE,
> > +			 GFP_KERNEL);
> > +}
> > +#else /* BITS_PER_LONG == 64 */
> > +/*
> > + * FIXME: xarrays can't handle 64-bit indexes on a 32-bit arch. For now, just
> > + * ignore delegated_inos on 32 bit arch. Maybe eventually add xarrays for top
> > + * and bottom words?
> > + */
> > +static int ceph_parse_deleg_inos(void **p, void *end,
> > +				 struct ceph_mds_session *s)
> > +{
> > +	u32 sets;
> > +
> > +	ceph_decode_32_safe(p, end, sets, bad);
> > +	if (sets)
> > +		ceph_decode_skip_n(p, end, sets * 2 * sizeof(__le64), bad);
> > +	return 0;
> > +bad:
> > +	return -EIO;
> > +}
> > +
> > +u64 ceph_get_deleg_ino(struct ceph_mds_session *s)
> > +{
> > +	return 0;
> > +}
> > +
> > +int ceph_restore_deleg_ino(struct ceph_mds_session *s, u64 ino)
> > +{
> > +	return 0;
> > +}
> > +#endif /* BITS_PER_LONG == 64 */
> > +
> >  /*
> >   * parse create results
> >   */
> >  static int parse_reply_info_create(void **p, void *end,
> >  				  struct ceph_mds_reply_info_parsed *info,
> > -				  u64 features)
> > +				  u64 features, struct ceph_mds_session *s)
> >  {
> > +	int ret;
> > +
> >  	if (features == (u64)-1 ||
> >  	    (features & CEPH_FEATURE_REPLY_CREATE_INODE)) {
> > -		/* Malformed reply? */
> >  		if (*p == end) {
> > +			/* Malformed reply? */
> >  			info->has_create_ino = false;
> > -		} else {
> > +		} else if (test_bit(CEPHFS_FEATURE_DELEG_INO, &s->s_features)) {
> > +			u8 struct_v, struct_compat;
> > +			u32 len;
> > +
> >  			info->has_create_ino = true;
> > +			ceph_decode_8_safe(p, end, struct_v, bad);
> > +			ceph_decode_8_safe(p, end, struct_compat, bad);
> > +			ceph_decode_32_safe(p, end, len, bad);
> > +			ceph_decode_64_safe(p, end, info->ino, bad);
> 
> I've done a quick test in current 'testing' branch and it seems that it's
> currently broken.  A bisect identified this commit as 'bad' and it's
> failing at this point.
> 
> I'm running an old (a few weeks) 'master' vstart cluster, so I don't have
> the needed bits for using this DELEG_INO feature.  Running xfstest
> generic/001 results in:
> 
>    ceph: mds parse_reply err -5
>    ceph: mdsc_handle_reply got corrupt reply mds0(tid:9)
>    ...
> 
> s->s_features does include the CEPHFS_FEATURE_DELEG_INO bit set;
> 'features' is -1 (0xffffffffffffffff) and s->s_features is 0x3fff.  Maybe
> the issue is actually somewhere else (the cephfs feature handling code),
> but I'm still looking.
> 

From the patch that added this feature in userland ceph code (commit
2bcf4b62643b5):

--- a/src/mds/cephfs_features.h
+++ b/src/mds/cephfs_features.h
@@ -32,6 +32,7 @@
 #define CEPHFS_FEATURE_LAZY_CAP_WANTED  11
 #define CEPHFS_FEATURE_MULTI_RECONNECT  12
 #define CEPHFS_FEATURE_NAUTILUS         12
+#define CEPHFS_FEATURE_DELEG_INO        13
 #define CEPHFS_FEATURE_OCTOPUS          13
 
 #define CEPHFS_FEATURES_ALL {          \
@@ -45,6 +46,7 @@
   CEPHFS_FEATURE_LAZY_CAP_WANTED,      \
   CEPHFS_FEATURE_MULTI_RECONNECT,      \
   CEPHFS_FEATURE_NAUTILUS,              \
+  CEPHFS_FEATURE_DELEG_INO,             \
   CEPHFS_FEATURE_OCTOPUS,               \
 }

...this feature was added under the aegis of the
CEPHFS_FEATURE_DELEG_INO flag, but that bit is shared with
CEPHFS_FEATURE_OCTOPUS, which was already enabled in octopus before we
ever added it (back on April 1st 2019).

Any version of the MDS that has commit 49930ad8a3402 but does not have 
2bcf4b62643b5 will not work properly with newer kernels. Personally, I
don't see that as a problem per-se, as that should only be the case with
bleeding-edge MDS builds. Official releases should never see this issue.

Going forward, I think commit 49930ad8a3402 was probably a bad idea. We
really should not add "release" cephfs feature bits to the mask until
just before an official release, and should just make it alias the last
"real" feature bit. That should help ensure that we don't hit this
problem in the future.

> > +			ret = ceph_parse_deleg_inos(p, end, s);
> > +			if (ret)
> > +				return ret;
> > +		} else {
> > +			/* legacy */
> >  			ceph_decode_64_safe(p, end, info->ino, bad);
> > +			info->has_create_ino = true;
> >  		}
> >  	} else {
> >  		if (*p != end)
> > @@ -448,7 +548,7 @@ static int parse_reply_info_create(void **p, void *end,
> >   */
> >  static int parse_reply_info_extra(void **p, void *end,
> >  				  struct ceph_mds_reply_info_parsed *info,
> > -				  u64 features)
> > +				  u64 features, struct ceph_mds_session *s)
> >  {
> >  	u32 op = le32_to_cpu(info->head->op);
> >  
> > @@ -457,7 +557,7 @@ static int parse_reply_info_extra(void **p, void *end,
> >  	else if (op == CEPH_MDS_OP_READDIR || op == CEPH_MDS_OP_LSSNAP)
> >  		return parse_reply_info_readdir(p, end, info, features);
> >  	else if (op == CEPH_MDS_OP_CREATE)
> > -		return parse_reply_info_create(p, end, info, features);
> > +		return parse_reply_info_create(p, end, info, features, s);
> >  	else
> >  		return -EIO;
> >  }
> > @@ -465,7 +565,7 @@ static int parse_reply_info_extra(void **p, void *end,
> >  /*
> >   * parse entire mds reply
> >   */
> > -static int parse_reply_info(struct ceph_msg *msg,
> > +static int parse_reply_info(struct ceph_mds_session *s, struct ceph_msg *msg,
> >  			    struct ceph_mds_reply_info_parsed *info,
> >  			    u64 features)
> >  {
> > @@ -490,7 +590,7 @@ static int parse_reply_info(struct ceph_msg *msg,
> >  	ceph_decode_32_safe(&p, end, len, bad);
> >  	if (len > 0) {
> >  		ceph_decode_need(&p, end, len, bad);
> > -		err = parse_reply_info_extra(&p, p+len, info, features);
> > +		err = parse_reply_info_extra(&p, p+len, info, features, s);
> >  		if (err < 0)
> >  			goto out_bad;
> >  	}
> > @@ -558,6 +658,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
> >  	if (refcount_dec_and_test(&s->s_ref)) {
> >  		if (s->s_auth.authorizer)
> >  			ceph_auth_destroy_authorizer(s->s_auth.authorizer);
> > +		xa_destroy(&s->s_delegated_inos);
> >  		kfree(s);
> >  	}
> >  }
> > @@ -645,6 +746,7 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc,
> >  	refcount_set(&s->s_ref, 1);
> >  	INIT_LIST_HEAD(&s->s_waiting);
> >  	INIT_LIST_HEAD(&s->s_unsafe);
> > +	xa_init(&s->s_delegated_inos);
> >  	s->s_num_cap_releases = 0;
> >  	s->s_cap_reconnect = 0;
> >  	s->s_cap_iterator = NULL;
> > @@ -2975,9 +3077,9 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
> >  	dout("handle_reply tid %lld result %d\n", tid, result);
> >  	rinfo = &req->r_reply_info;
> >  	if (test_bit(CEPHFS_FEATURE_REPLY_ENCODING, &session->s_features))
> > -		err = parse_reply_info(msg, rinfo, (u64)-1);
> > +		err = parse_reply_info(session, msg, rinfo, (u64)-1);
> >  	else
> > -		err = parse_reply_info(msg, rinfo, session->s_con.peer_features);
> > +		err = parse_reply_info(session, msg, rinfo, session->s_con.peer_features);
> >  	mutex_unlock(&mdsc->mutex);
> >  
> >  	mutex_lock(&session->s_mutex);
> > @@ -3673,6 +3775,8 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
> >  	if (!reply)
> >  		goto fail_nomsg;
> >  
> > +	xa_destroy(&session->s_delegated_inos);
> > +
> >  	mutex_lock(&session->s_mutex);
> >  	session->s_state = CEPH_MDS_SESSION_RECONNECTING;
> >  	session->s_seq = 0;
> > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> > index f10d342ea585..4c3b71707470 100644
> > --- a/fs/ceph/mds_client.h
> > +++ b/fs/ceph/mds_client.h
> > @@ -23,8 +23,9 @@ enum ceph_feature_type {
> >  	CEPHFS_FEATURE_RECLAIM_CLIENT,
> >  	CEPHFS_FEATURE_LAZY_CAP_WANTED,
> >  	CEPHFS_FEATURE_MULTI_RECONNECT,
> > +	CEPHFS_FEATURE_DELEG_INO,
> >  
> > -	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_MULTI_RECONNECT,
> > +	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_DELEG_INO,
> >  };
> >  
> >  /*
> > @@ -37,6 +38,7 @@ enum ceph_feature_type {
> >  	CEPHFS_FEATURE_REPLY_ENCODING,		\
> >  	CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
> >  	CEPHFS_FEATURE_MULTI_RECONNECT,		\
> > +	CEPHFS_FEATURE_DELEG_INO,		\
> >  						\
> >  	CEPHFS_FEATURE_MAX,			\
> >  }
> > @@ -201,6 +203,7 @@ struct ceph_mds_session {
> >  
> >  	struct list_head  s_waiting;  /* waiting requests */
> >  	struct list_head  s_unsafe;   /* unsafe requests */
> > +	struct xarray	  s_delegated_inos;
> >  };
> >  
> >  /*
> > @@ -542,6 +545,7 @@ extern void ceph_mdsc_open_export_target_sessions(struct ceph_mds_client *mdsc,
> >  extern int ceph_trim_caps(struct ceph_mds_client *mdsc,
> >  			  struct ceph_mds_session *session,
> >  			  int max_caps);
> > +
> >  static inline int ceph_wait_on_async_create(struct inode *inode)
> >  {
> >  	struct ceph_inode_info *ci = ceph_inode(inode);
> > @@ -549,4 +553,7 @@ static inline int ceph_wait_on_async_create(struct inode *inode)
> >  	return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT,
> >  			   TASK_INTERRUPTIBLE);
> >  }
> > +
> > +extern u64 ceph_get_deleg_ino(struct ceph_mds_session *session);
> > +extern int ceph_restore_deleg_ino(struct ceph_mds_session *session, u64 ino);
> >  #endif
> > -- 
> > 2.24.1
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 10/13] ceph: decode interval_sets for delegated inos
  2020-03-05 12:02     ` Jeff Layton
@ 2020-03-05 12:20       ` Luis Henriques
  2020-03-05 13:36       ` Ilya Dryomov
  1 sibling, 0 replies; 21+ messages in thread
From: Luis Henriques @ 2020-03-05 12:20 UTC (permalink / raw)
  To: Jeff Layton; +Cc: ceph-devel, idryomov, sage, zyan, pdonnell

On Thu, Mar 05, 2020 at 07:02:59AM -0500, Jeff Layton wrote:
> On Thu, 2020-03-05 at 11:45 +0000, Luis Henriques wrote:
> > On Mon, Mar 02, 2020 at 09:14:31AM -0500, Jeff Layton wrote:
> > > Starting in Octopus, the MDS will hand out caps that allow the client
> > > to do asynchronous file creates under certain conditions. As part of
> > > that, the MDS will delegate ranges of inode numbers to the client.
> > > 
> > > Add the infrastructure to decode these ranges, and stuff them into an
> > > xarray for later consumption by the async creation code.
> > > 
> > > Because the xarray code currently only handles unsigned long indexes,
> > > and those are 32-bits on 32-bit arches, we only enable the decoding when
> > > running on a 64-bit arch.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  fs/ceph/mds_client.c | 122 +++++++++++++++++++++++++++++++++++++++----
> > >  fs/ceph/mds_client.h |   9 +++-
> > >  2 files changed, 121 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > > index db8304447f35..87f75d05b004 100644
> > > --- a/fs/ceph/mds_client.c
> > > +++ b/fs/ceph/mds_client.c
> > > @@ -415,21 +415,121 @@ static int parse_reply_info_filelock(void **p, void *end,
> > >  	return -EIO;
> > >  }
> > >  
> > > +
> > > +#if BITS_PER_LONG == 64
> > > +
> > > +#define DELEGATED_INO_AVAILABLE		xa_mk_value(1)
> > > +
> > > +static int ceph_parse_deleg_inos(void **p, void *end,
> > > +				 struct ceph_mds_session *s)
> > > +{
> > > +	u32 sets;
> > > +
> > > +	ceph_decode_32_safe(p, end, sets, bad);
> > > +	dout("got %u sets of delegated inodes\n", sets);
> > > +	while (sets--) {
> > > +		u64 start, len, ino;
> > > +
> > > +		ceph_decode_64_safe(p, end, start, bad);
> > > +		ceph_decode_64_safe(p, end, len, bad);
> > > +		while (len--) {
> > > +			int err = xa_insert(&s->s_delegated_inos, ino = start++,
> > > +					    DELEGATED_INO_AVAILABLE,
> > > +					    GFP_KERNEL);
> > > +			if (!err) {
> > > +				dout("added delegated inode 0x%llx\n",
> > > +				     start - 1);
> > > +			} else if (err == -EBUSY) {
> > > +				pr_warn("ceph: MDS delegated inode 0x%llx more than once.\n",
> > > +					start - 1);
> > > +			} else {
> > > +				return err;
> > > +			}
> > > +		}
> > > +	}
> > > +	return 0;
> > > +bad:
> > > +	return -EIO;
> > > +}
> > > +
> > > +u64 ceph_get_deleg_ino(struct ceph_mds_session *s)
> > > +{
> > > +	unsigned long ino;
> > > +	void *val;
> > > +
> > > +	xa_for_each(&s->s_delegated_inos, ino, val) {
> > > +		val = xa_erase(&s->s_delegated_inos, ino);
> > > +		if (val == DELEGATED_INO_AVAILABLE)
> > > +			return ino;
> > > +	}
> > > +	return 0;
> > > +}
> > > +
> > > +int ceph_restore_deleg_ino(struct ceph_mds_session *s, u64 ino)
> > > +{
> > > +	return xa_insert(&s->s_delegated_inos, ino, DELEGATED_INO_AVAILABLE,
> > > +			 GFP_KERNEL);
> > > +}
> > > +#else /* BITS_PER_LONG == 64 */
> > > +/*
> > > + * FIXME: xarrays can't handle 64-bit indexes on a 32-bit arch. For now, just
> > > + * ignore delegated_inos on 32 bit arch. Maybe eventually add xarrays for top
> > > + * and bottom words?
> > > + */
> > > +static int ceph_parse_deleg_inos(void **p, void *end,
> > > +				 struct ceph_mds_session *s)
> > > +{
> > > +	u32 sets;
> > > +
> > > +	ceph_decode_32_safe(p, end, sets, bad);
> > > +	if (sets)
> > > +		ceph_decode_skip_n(p, end, sets * 2 * sizeof(__le64), bad);
> > > +	return 0;
> > > +bad:
> > > +	return -EIO;
> > > +}
> > > +
> > > +u64 ceph_get_deleg_ino(struct ceph_mds_session *s)
> > > +{
> > > +	return 0;
> > > +}
> > > +
> > > +int ceph_restore_deleg_ino(struct ceph_mds_session *s, u64 ino)
> > > +{
> > > +	return 0;
> > > +}
> > > +#endif /* BITS_PER_LONG == 64 */
> > > +
> > >  /*
> > >   * parse create results
> > >   */
> > >  static int parse_reply_info_create(void **p, void *end,
> > >  				  struct ceph_mds_reply_info_parsed *info,
> > > -				  u64 features)
> > > +				  u64 features, struct ceph_mds_session *s)
> > >  {
> > > +	int ret;
> > > +
> > >  	if (features == (u64)-1 ||
> > >  	    (features & CEPH_FEATURE_REPLY_CREATE_INODE)) {
> > > -		/* Malformed reply? */
> > >  		if (*p == end) {
> > > +			/* Malformed reply? */
> > >  			info->has_create_ino = false;
> > > -		} else {
> > > +		} else if (test_bit(CEPHFS_FEATURE_DELEG_INO, &s->s_features)) {
> > > +			u8 struct_v, struct_compat;
> > > +			u32 len;
> > > +
> > >  			info->has_create_ino = true;
> > > +			ceph_decode_8_safe(p, end, struct_v, bad);
> > > +			ceph_decode_8_safe(p, end, struct_compat, bad);
> > > +			ceph_decode_32_safe(p, end, len, bad);
> > > +			ceph_decode_64_safe(p, end, info->ino, bad);
> > 
> > I've done a quick test in current 'testing' branch and it seems that it's
> > currently broken.  A bisect identified this commit as 'bad' and it's
> > failing at this point.
> > 
> > I'm running an old (a few weeks) 'master' vstart cluster, so I don't have
> > the needed bits for using this DELEG_INO feature.  Running xfstest
> > generic/001 results in:
> > 
> >    ceph: mds parse_reply err -5
> >    ceph: mdsc_handle_reply got corrupt reply mds0(tid:9)
> >    ...
> > 
> > s->s_features does include the CEPHFS_FEATURE_DELEG_INO bit set;
> > 'features' is -1 (0xffffffffffffffff) and s->s_features is 0x3fff.  Maybe
> > the issue is actually somewhere else (the cephfs feature handling code),
> > but I'm still looking.
> > 
> 
> From the patch that added this feature in userland ceph code (commit
> 2bcf4b62643b5):
> 
> --- a/src/mds/cephfs_features.h
> +++ b/src/mds/cephfs_features.h
> @@ -32,6 +32,7 @@
>  #define CEPHFS_FEATURE_LAZY_CAP_WANTED  11
>  #define CEPHFS_FEATURE_MULTI_RECONNECT  12
>  #define CEPHFS_FEATURE_NAUTILUS         12
> +#define CEPHFS_FEATURE_DELEG_INO        13
>  #define CEPHFS_FEATURE_OCTOPUS          13
>  
>  #define CEPHFS_FEATURES_ALL {          \
> @@ -45,6 +46,7 @@
>    CEPHFS_FEATURE_LAZY_CAP_WANTED,      \
>    CEPHFS_FEATURE_MULTI_RECONNECT,      \
>    CEPHFS_FEATURE_NAUTILUS,              \
> +  CEPHFS_FEATURE_DELEG_INO,             \
>    CEPHFS_FEATURE_OCTOPUS,               \
>  }
> 
> ...this feature was added under the aegis of the
> CEPHFS_FEATURE_DELEG_INO flag, but that bit is shared with
> CEPHFS_FEATURE_OCTOPUS, which was already enabled in octopus before we
> ever added it (back on April 1st 2019).
> 
> Any version of the MDS that has commit 49930ad8a3402 but does not have 
> 2bcf4b62643b5 will not work properly with newer kernels. Personally, I
> don't see that as a problem per-se, as that should only be the case with
> bleeding-edge MDS builds. Official releases should never see this issue.
> 
> Going forward, I think commit 49930ad8a3402 was probably a bad idea. We
> really should not add "release" cephfs feature bits to the mask until
> just before an official release, and should just make it alias the last
> "real" feature bit. That should help ensure that we don't hit this
> problem in the future.

Doh!  Thanks a lot for figuring this out, Jeff!  You may have saved me a
few hours with this explanation!

Cheers,
--
Luís


> 
> > > +			ret = ceph_parse_deleg_inos(p, end, s);
> > > +			if (ret)
> > > +				return ret;
> > > +		} else {
> > > +			/* legacy */
> > >  			ceph_decode_64_safe(p, end, info->ino, bad);
> > > +			info->has_create_ino = true;
> > >  		}
> > >  	} else {
> > >  		if (*p != end)
> > > @@ -448,7 +548,7 @@ static int parse_reply_info_create(void **p, void *end,
> > >   */
> > >  static int parse_reply_info_extra(void **p, void *end,
> > >  				  struct ceph_mds_reply_info_parsed *info,
> > > -				  u64 features)
> > > +				  u64 features, struct ceph_mds_session *s)
> > >  {
> > >  	u32 op = le32_to_cpu(info->head->op);
> > >  
> > > @@ -457,7 +557,7 @@ static int parse_reply_info_extra(void **p, void *end,
> > >  	else if (op == CEPH_MDS_OP_READDIR || op == CEPH_MDS_OP_LSSNAP)
> > >  		return parse_reply_info_readdir(p, end, info, features);
> > >  	else if (op == CEPH_MDS_OP_CREATE)
> > > -		return parse_reply_info_create(p, end, info, features);
> > > +		return parse_reply_info_create(p, end, info, features, s);
> > >  	else
> > >  		return -EIO;
> > >  }
> > > @@ -465,7 +565,7 @@ static int parse_reply_info_extra(void **p, void *end,
> > >  /*
> > >   * parse entire mds reply
> > >   */
> > > -static int parse_reply_info(struct ceph_msg *msg,
> > > +static int parse_reply_info(struct ceph_mds_session *s, struct ceph_msg *msg,
> > >  			    struct ceph_mds_reply_info_parsed *info,
> > >  			    u64 features)
> > >  {
> > > @@ -490,7 +590,7 @@ static int parse_reply_info(struct ceph_msg *msg,
> > >  	ceph_decode_32_safe(&p, end, len, bad);
> > >  	if (len > 0) {
> > >  		ceph_decode_need(&p, end, len, bad);
> > > -		err = parse_reply_info_extra(&p, p+len, info, features);
> > > +		err = parse_reply_info_extra(&p, p+len, info, features, s);
> > >  		if (err < 0)
> > >  			goto out_bad;
> > >  	}
> > > @@ -558,6 +658,7 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
> > >  	if (refcount_dec_and_test(&s->s_ref)) {
> > >  		if (s->s_auth.authorizer)
> > >  			ceph_auth_destroy_authorizer(s->s_auth.authorizer);
> > > +		xa_destroy(&s->s_delegated_inos);
> > >  		kfree(s);
> > >  	}
> > >  }
> > > @@ -645,6 +746,7 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc,
> > >  	refcount_set(&s->s_ref, 1);
> > >  	INIT_LIST_HEAD(&s->s_waiting);
> > >  	INIT_LIST_HEAD(&s->s_unsafe);
> > > +	xa_init(&s->s_delegated_inos);
> > >  	s->s_num_cap_releases = 0;
> > >  	s->s_cap_reconnect = 0;
> > >  	s->s_cap_iterator = NULL;
> > > @@ -2975,9 +3077,9 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
> > >  	dout("handle_reply tid %lld result %d\n", tid, result);
> > >  	rinfo = &req->r_reply_info;
> > >  	if (test_bit(CEPHFS_FEATURE_REPLY_ENCODING, &session->s_features))
> > > -		err = parse_reply_info(msg, rinfo, (u64)-1);
> > > +		err = parse_reply_info(session, msg, rinfo, (u64)-1);
> > >  	else
> > > -		err = parse_reply_info(msg, rinfo, session->s_con.peer_features);
> > > +		err = parse_reply_info(session, msg, rinfo, session->s_con.peer_features);
> > >  	mutex_unlock(&mdsc->mutex);
> > >  
> > >  	mutex_lock(&session->s_mutex);
> > > @@ -3673,6 +3775,8 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
> > >  	if (!reply)
> > >  		goto fail_nomsg;
> > >  
> > > +	xa_destroy(&session->s_delegated_inos);
> > > +
> > >  	mutex_lock(&session->s_mutex);
> > >  	session->s_state = CEPH_MDS_SESSION_RECONNECTING;
> > >  	session->s_seq = 0;
> > > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> > > index f10d342ea585..4c3b71707470 100644
> > > --- a/fs/ceph/mds_client.h
> > > +++ b/fs/ceph/mds_client.h
> > > @@ -23,8 +23,9 @@ enum ceph_feature_type {
> > >  	CEPHFS_FEATURE_RECLAIM_CLIENT,
> > >  	CEPHFS_FEATURE_LAZY_CAP_WANTED,
> > >  	CEPHFS_FEATURE_MULTI_RECONNECT,
> > > +	CEPHFS_FEATURE_DELEG_INO,
> > >  
> > > -	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_MULTI_RECONNECT,
> > > +	CEPHFS_FEATURE_MAX = CEPHFS_FEATURE_DELEG_INO,
> > >  };
> > >  
> > >  /*
> > > @@ -37,6 +38,7 @@ enum ceph_feature_type {
> > >  	CEPHFS_FEATURE_REPLY_ENCODING,		\
> > >  	CEPHFS_FEATURE_LAZY_CAP_WANTED,		\
> > >  	CEPHFS_FEATURE_MULTI_RECONNECT,		\
> > > +	CEPHFS_FEATURE_DELEG_INO,		\
> > >  						\
> > >  	CEPHFS_FEATURE_MAX,			\
> > >  }
> > > @@ -201,6 +203,7 @@ struct ceph_mds_session {
> > >  
> > >  	struct list_head  s_waiting;  /* waiting requests */
> > >  	struct list_head  s_unsafe;   /* unsafe requests */
> > > +	struct xarray	  s_delegated_inos;
> > >  };
> > >  
> > >  /*
> > > @@ -542,6 +545,7 @@ extern void ceph_mdsc_open_export_target_sessions(struct ceph_mds_client *mdsc,
> > >  extern int ceph_trim_caps(struct ceph_mds_client *mdsc,
> > >  			  struct ceph_mds_session *session,
> > >  			  int max_caps);
> > > +
> > >  static inline int ceph_wait_on_async_create(struct inode *inode)
> > >  {
> > >  	struct ceph_inode_info *ci = ceph_inode(inode);
> > > @@ -549,4 +553,7 @@ static inline int ceph_wait_on_async_create(struct inode *inode)
> > >  	return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT,
> > >  			   TASK_INTERRUPTIBLE);
> > >  }
> > > +
> > > +extern u64 ceph_get_deleg_ino(struct ceph_mds_session *session);
> > > +extern int ceph_restore_deleg_ino(struct ceph_mds_session *session, u64 ino);
> > >  #endif
> > > -- 
> > > 2.24.1
> > > 
> 
> -- 
> Jeff Layton <jlayton@kernel.org>
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 10/13] ceph: decode interval_sets for delegated inos
  2020-03-05 12:02     ` Jeff Layton
  2020-03-05 12:20       ` Luis Henriques
@ 2020-03-05 13:36       ` Ilya Dryomov
  2020-03-05 13:44         ` Jeff Layton
  1 sibling, 1 reply; 21+ messages in thread
From: Ilya Dryomov @ 2020-03-05 13:36 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Luis Henriques, Ceph Development, Sage Weil, Yan, Zheng,
	Patrick Donnelly

On Thu, Mar 5, 2020 at 1:03 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Thu, 2020-03-05 at 11:45 +0000, Luis Henriques wrote:
> > On Mon, Mar 02, 2020 at 09:14:31AM -0500, Jeff Layton wrote:
> > > Starting in Octopus, the MDS will hand out caps that allow the client
> > > to do asynchronous file creates under certain conditions. As part of
> > > that, the MDS will delegate ranges of inode numbers to the client.
> > >
> > > Add the infrastructure to decode these ranges, and stuff them into an
> > > xarray for later consumption by the async creation code.
> > >
> > > Because the xarray code currently only handles unsigned long indexes,
> > > and those are 32-bits on 32-bit arches, we only enable the decoding when
> > > running on a 64-bit arch.
> > >
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  fs/ceph/mds_client.c | 122 +++++++++++++++++++++++++++++++++++++++----
> > >  fs/ceph/mds_client.h |   9 +++-
> > >  2 files changed, 121 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > > index db8304447f35..87f75d05b004 100644
> > > --- a/fs/ceph/mds_client.c
> > > +++ b/fs/ceph/mds_client.c
> > > @@ -415,21 +415,121 @@ static int parse_reply_info_filelock(void **p, void *end,
> > >     return -EIO;
> > >  }
> > >
> > > +
> > > +#if BITS_PER_LONG == 64
> > > +
> > > +#define DELEGATED_INO_AVAILABLE            xa_mk_value(1)
> > > +
> > > +static int ceph_parse_deleg_inos(void **p, void *end,
> > > +                            struct ceph_mds_session *s)
> > > +{
> > > +   u32 sets;
> > > +
> > > +   ceph_decode_32_safe(p, end, sets, bad);
> > > +   dout("got %u sets of delegated inodes\n", sets);
> > > +   while (sets--) {
> > > +           u64 start, len, ino;
> > > +
> > > +           ceph_decode_64_safe(p, end, start, bad);
> > > +           ceph_decode_64_safe(p, end, len, bad);
> > > +           while (len--) {
> > > +                   int err = xa_insert(&s->s_delegated_inos, ino = start++,
> > > +                                       DELEGATED_INO_AVAILABLE,
> > > +                                       GFP_KERNEL);
> > > +                   if (!err) {
> > > +                           dout("added delegated inode 0x%llx\n",
> > > +                                start - 1);
> > > +                   } else if (err == -EBUSY) {
> > > +                           pr_warn("ceph: MDS delegated inode 0x%llx more than once.\n",
> > > +                                   start - 1);
> > > +                   } else {
> > > +                           return err;
> > > +                   }
> > > +           }
> > > +   }
> > > +   return 0;
> > > +bad:
> > > +   return -EIO;
> > > +}
> > > +
> > > +u64 ceph_get_deleg_ino(struct ceph_mds_session *s)
> > > +{
> > > +   unsigned long ino;
> > > +   void *val;
> > > +
> > > +   xa_for_each(&s->s_delegated_inos, ino, val) {
> > > +           val = xa_erase(&s->s_delegated_inos, ino);
> > > +           if (val == DELEGATED_INO_AVAILABLE)
> > > +                   return ino;
> > > +   }
> > > +   return 0;
> > > +}
> > > +
> > > +int ceph_restore_deleg_ino(struct ceph_mds_session *s, u64 ino)
> > > +{
> > > +   return xa_insert(&s->s_delegated_inos, ino, DELEGATED_INO_AVAILABLE,
> > > +                    GFP_KERNEL);
> > > +}
> > > +#else /* BITS_PER_LONG == 64 */
> > > +/*
> > > + * FIXME: xarrays can't handle 64-bit indexes on a 32-bit arch. For now, just
> > > + * ignore delegated_inos on 32 bit arch. Maybe eventually add xarrays for top
> > > + * and bottom words?
> > > + */
> > > +static int ceph_parse_deleg_inos(void **p, void *end,
> > > +                            struct ceph_mds_session *s)
> > > +{
> > > +   u32 sets;
> > > +
> > > +   ceph_decode_32_safe(p, end, sets, bad);
> > > +   if (sets)
> > > +           ceph_decode_skip_n(p, end, sets * 2 * sizeof(__le64), bad);
> > > +   return 0;
> > > +bad:
> > > +   return -EIO;
> > > +}
> > > +
> > > +u64 ceph_get_deleg_ino(struct ceph_mds_session *s)
> > > +{
> > > +   return 0;
> > > +}
> > > +
> > > +int ceph_restore_deleg_ino(struct ceph_mds_session *s, u64 ino)
> > > +{
> > > +   return 0;
> > > +}
> > > +#endif /* BITS_PER_LONG == 64 */
> > > +
> > >  /*
> > >   * parse create results
> > >   */
> > >  static int parse_reply_info_create(void **p, void *end,
> > >                               struct ceph_mds_reply_info_parsed *info,
> > > -                             u64 features)
> > > +                             u64 features, struct ceph_mds_session *s)
> > >  {
> > > +   int ret;
> > > +
> > >     if (features == (u64)-1 ||
> > >         (features & CEPH_FEATURE_REPLY_CREATE_INODE)) {
> > > -           /* Malformed reply? */
> > >             if (*p == end) {
> > > +                   /* Malformed reply? */
> > >                     info->has_create_ino = false;
> > > -           } else {
> > > +           } else if (test_bit(CEPHFS_FEATURE_DELEG_INO, &s->s_features)) {
> > > +                   u8 struct_v, struct_compat;
> > > +                   u32 len;
> > > +
> > >                     info->has_create_ino = true;
> > > +                   ceph_decode_8_safe(p, end, struct_v, bad);
> > > +                   ceph_decode_8_safe(p, end, struct_compat, bad);
> > > +                   ceph_decode_32_safe(p, end, len, bad);
> > > +                   ceph_decode_64_safe(p, end, info->ino, bad);
> >
> > I've done a quick test in current 'testing' branch and it seems that it's
> > currently broken.  A bisect identified this commit as 'bad' and it's
> > failing at this point.
> >
> > I'm running an old (a few weeks) 'master' vstart cluster, so I don't have
> > the needed bits for using this DELEG_INO feature.  Running xfstest
> > generic/001 results in:
> >
> >    ceph: mds parse_reply err -5
> >    ceph: mdsc_handle_reply got corrupt reply mds0(tid:9)
> >    ...
> >
> > s->s_features does include the CEPHFS_FEATURE_DELEG_INO bit set;
> > 'features' is -1 (0xffffffffffffffff) and s->s_features is 0x3fff.  Maybe
> > the issue is actually somewhere else (the cephfs feature handling code),
> > but I'm still looking.
> >
>
> From the patch that added this feature in userland ceph code (commit
> 2bcf4b62643b5):
>
> --- a/src/mds/cephfs_features.h
> +++ b/src/mds/cephfs_features.h
> @@ -32,6 +32,7 @@
>  #define CEPHFS_FEATURE_LAZY_CAP_WANTED  11
>  #define CEPHFS_FEATURE_MULTI_RECONNECT  12
>  #define CEPHFS_FEATURE_NAUTILUS         12
> +#define CEPHFS_FEATURE_DELEG_INO        13
>  #define CEPHFS_FEATURE_OCTOPUS          13
>
>  #define CEPHFS_FEATURES_ALL {          \
> @@ -45,6 +46,7 @@
>    CEPHFS_FEATURE_LAZY_CAP_WANTED,      \
>    CEPHFS_FEATURE_MULTI_RECONNECT,      \
>    CEPHFS_FEATURE_NAUTILUS,              \
> +  CEPHFS_FEATURE_DELEG_INO,             \
>    CEPHFS_FEATURE_OCTOPUS,               \
>  }
>
> ...this feature was added under the aegis of the
> CEPHFS_FEATURE_DELEG_INO flag, but that bit is shared with
> CEPHFS_FEATURE_OCTOPUS, which was already enabled in octopus before we
> ever added it (back on April 1st 2019).
>
> Any version of the MDS that has commit 49930ad8a3402 but does not have
> 2bcf4b62643b5 will not work properly with newer kernels. Personally, I
> don't see that as a problem per-se, as that should only be the case with
> bleeding-edge MDS builds. Official releases should never see this issue.
>
> Going forward, I think commit 49930ad8a3402 was probably a bad idea. We
> really should not add "release" cephfs feature bits to the mask until
> just before an official release, and should just make it alias the last
> "real" feature bit. That should help ensure that we don't hit this
> problem in the future.

We should avoid "release" feature bits altogether, as discussed in
the last CDM.  They appeared because we were running out of free bits
for RADOS features (a 64-bit field).  CephFS features are encoded in
a bit vector that can grow as needed.

Feature masks for groups of related feature bits make sense, but they
should be handled at a higher level.  E.g. a set of feature bits that
a client should support in order for inline data to work properly that
we flip to required if the user runs "ceph fs set <fsname> inline_data
true".  The actual feature bits in the bit vector should all be
distinct -- no overlaps.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v6 10/13] ceph: decode interval_sets for delegated inos
  2020-03-05 13:36       ` Ilya Dryomov
@ 2020-03-05 13:44         ` Jeff Layton
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff Layton @ 2020-03-05 13:44 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Luis Henriques, Ceph Development, Sage Weil, Yan, Zheng,
	Patrick Donnelly

On Thu, 2020-03-05 at 14:36 +0100, Ilya Dryomov wrote:
> On Thu, Mar 5, 2020 at 1:03 PM Jeff Layton <jlayton@kernel.org> wrote:
> > On Thu, 2020-03-05 at 11:45 +0000, Luis Henriques wrote:
> > > On Mon, Mar 02, 2020 at 09:14:31AM -0500, Jeff Layton wrote:
> > > > Starting in Octopus, the MDS will hand out caps that allow the client
> > > > to do asynchronous file creates under certain conditions. As part of
> > > > that, the MDS will delegate ranges of inode numbers to the client.
> > > > 
> > > > Add the infrastructure to decode these ranges, and stuff them into an
> > > > xarray for later consumption by the async creation code.
> > > > 
> > > > Because the xarray code currently only handles unsigned long indexes,
> > > > and those are 32-bits on 32-bit arches, we only enable the decoding when
> > > > running on a 64-bit arch.
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > >  fs/ceph/mds_client.c | 122 +++++++++++++++++++++++++++++++++++++++----
> > > >  fs/ceph/mds_client.h |   9 +++-
> > > >  2 files changed, 121 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > > > index db8304447f35..87f75d05b004 100644
> > > > --- a/fs/ceph/mds_client.c
> > > > +++ b/fs/ceph/mds_client.c
> > > > @@ -415,21 +415,121 @@ static int parse_reply_info_filelock(void **p, void *end,
> > > >     return -EIO;
> > > >  }
> > > > 
> > > > +
> > > > +#if BITS_PER_LONG == 64
> > > > +
> > > > +#define DELEGATED_INO_AVAILABLE            xa_mk_value(1)
> > > > +
> > > > +static int ceph_parse_deleg_inos(void **p, void *end,
> > > > +                            struct ceph_mds_session *s)
> > > > +{
> > > > +   u32 sets;
> > > > +
> > > > +   ceph_decode_32_safe(p, end, sets, bad);
> > > > +   dout("got %u sets of delegated inodes\n", sets);
> > > > +   while (sets--) {
> > > > +           u64 start, len, ino;
> > > > +
> > > > +           ceph_decode_64_safe(p, end, start, bad);
> > > > +           ceph_decode_64_safe(p, end, len, bad);
> > > > +           while (len--) {
> > > > +                   int err = xa_insert(&s->s_delegated_inos, ino = start++,
> > > > +                                       DELEGATED_INO_AVAILABLE,
> > > > +                                       GFP_KERNEL);
> > > > +                   if (!err) {
> > > > +                           dout("added delegated inode 0x%llx\n",
> > > > +                                start - 1);
> > > > +                   } else if (err == -EBUSY) {
> > > > +                           pr_warn("ceph: MDS delegated inode 0x%llx more than once.\n",
> > > > +                                   start - 1);
> > > > +                   } else {
> > > > +                           return err;
> > > > +                   }
> > > > +           }
> > > > +   }
> > > > +   return 0;
> > > > +bad:
> > > > +   return -EIO;
> > > > +}
> > > > +
> > > > +u64 ceph_get_deleg_ino(struct ceph_mds_session *s)
> > > > +{
> > > > +   unsigned long ino;
> > > > +   void *val;
> > > > +
> > > > +   xa_for_each(&s->s_delegated_inos, ino, val) {
> > > > +           val = xa_erase(&s->s_delegated_inos, ino);
> > > > +           if (val == DELEGATED_INO_AVAILABLE)
> > > > +                   return ino;
> > > > +   }
> > > > +   return 0;
> > > > +}
> > > > +
> > > > +int ceph_restore_deleg_ino(struct ceph_mds_session *s, u64 ino)
> > > > +{
> > > > +   return xa_insert(&s->s_delegated_inos, ino, DELEGATED_INO_AVAILABLE,
> > > > +                    GFP_KERNEL);
> > > > +}
> > > > +#else /* BITS_PER_LONG == 64 */
> > > > +/*
> > > > + * FIXME: xarrays can't handle 64-bit indexes on a 32-bit arch. For now, just
> > > > + * ignore delegated_inos on 32 bit arch. Maybe eventually add xarrays for top
> > > > + * and bottom words?
> > > > + */
> > > > +static int ceph_parse_deleg_inos(void **p, void *end,
> > > > +                            struct ceph_mds_session *s)
> > > > +{
> > > > +   u32 sets;
> > > > +
> > > > +   ceph_decode_32_safe(p, end, sets, bad);
> > > > +   if (sets)
> > > > +           ceph_decode_skip_n(p, end, sets * 2 * sizeof(__le64), bad);
> > > > +   return 0;
> > > > +bad:
> > > > +   return -EIO;
> > > > +}
> > > > +
> > > > +u64 ceph_get_deleg_ino(struct ceph_mds_session *s)
> > > > +{
> > > > +   return 0;
> > > > +}
> > > > +
> > > > +int ceph_restore_deleg_ino(struct ceph_mds_session *s, u64 ino)
> > > > +{
> > > > +   return 0;
> > > > +}
> > > > +#endif /* BITS_PER_LONG == 64 */
> > > > +
> > > >  /*
> > > >   * parse create results
> > > >   */
> > > >  static int parse_reply_info_create(void **p, void *end,
> > > >                               struct ceph_mds_reply_info_parsed *info,
> > > > -                             u64 features)
> > > > +                             u64 features, struct ceph_mds_session *s)
> > > >  {
> > > > +   int ret;
> > > > +
> > > >     if (features == (u64)-1 ||
> > > >         (features & CEPH_FEATURE_REPLY_CREATE_INODE)) {
> > > > -           /* Malformed reply? */
> > > >             if (*p == end) {
> > > > +                   /* Malformed reply? */
> > > >                     info->has_create_ino = false;
> > > > -           } else {
> > > > +           } else if (test_bit(CEPHFS_FEATURE_DELEG_INO, &s->s_features)) {
> > > > +                   u8 struct_v, struct_compat;
> > > > +                   u32 len;
> > > > +
> > > >                     info->has_create_ino = true;
> > > > +                   ceph_decode_8_safe(p, end, struct_v, bad);
> > > > +                   ceph_decode_8_safe(p, end, struct_compat, bad);
> > > > +                   ceph_decode_32_safe(p, end, len, bad);
> > > > +                   ceph_decode_64_safe(p, end, info->ino, bad);
> > > 
> > > I've done a quick test in current 'testing' branch and it seems that it's
> > > currently broken.  A bisect identified this commit as 'bad' and it's
> > > failing at this point.
> > > 
> > > I'm running an old (a few weeks) 'master' vstart cluster, so I don't have
> > > the needed bits for using this DELEG_INO feature.  Running xfstest
> > > generic/001 results in:
> > > 
> > >    ceph: mds parse_reply err -5
> > >    ceph: mdsc_handle_reply got corrupt reply mds0(tid:9)
> > >    ...
> > > 
> > > s->s_features does include the CEPHFS_FEATURE_DELEG_INO bit set;
> > > 'features' is -1 (0xffffffffffffffff) and s->s_features is 0x3fff.  Maybe
> > > the issue is actually somewhere else (the cephfs feature handling code),
> > > but I'm still looking.
> > > 
> > 
> > From the patch that added this feature in userland ceph code (commit
> > 2bcf4b62643b5):
> > 
> > --- a/src/mds/cephfs_features.h
> > +++ b/src/mds/cephfs_features.h
> > @@ -32,6 +32,7 @@
> >  #define CEPHFS_FEATURE_LAZY_CAP_WANTED  11
> >  #define CEPHFS_FEATURE_MULTI_RECONNECT  12
> >  #define CEPHFS_FEATURE_NAUTILUS         12
> > +#define CEPHFS_FEATURE_DELEG_INO        13
> >  #define CEPHFS_FEATURE_OCTOPUS          13
> > 
> >  #define CEPHFS_FEATURES_ALL {          \
> > @@ -45,6 +46,7 @@
> >    CEPHFS_FEATURE_LAZY_CAP_WANTED,      \
> >    CEPHFS_FEATURE_MULTI_RECONNECT,      \
> >    CEPHFS_FEATURE_NAUTILUS,              \
> > +  CEPHFS_FEATURE_DELEG_INO,             \
> >    CEPHFS_FEATURE_OCTOPUS,               \
> >  }
> > 
> > ...this feature was added under the aegis of the
> > CEPHFS_FEATURE_DELEG_INO flag, but that bit is shared with
> > CEPHFS_FEATURE_OCTOPUS, which was already enabled in octopus before we
> > ever added it (back on April 1st 2019).
> > 
> > Any version of the MDS that has commit 49930ad8a3402 but does not have
> > 2bcf4b62643b5 will not work properly with newer kernels. Personally, I
> > don't see that as a problem per-se, as that should only be the case with
> > bleeding-edge MDS builds. Official releases should never see this issue.
> > 
> > Going forward, I think commit 49930ad8a3402 was probably a bad idea. We
> > really should not add "release" cephfs feature bits to the mask until
> > just before an official release, and should just make it alias the last
> > "real" feature bit. That should help ensure that we don't hit this
> > problem in the future.
> 
> We should avoid "release" feature bits altogether, as discussed in
> the last CDM.  They appeared because we were running out of free bits
> for RADOS features (a 64-bit field).  CephFS features are encoded in
> a bit vector that can grow as needed.
> 
> Feature masks for groups of related feature bits make sense, but they
> should be handled at a higher level.  E.g. a set of feature bits that
> a client should support in order for inline data to work properly that
> we flip to required if the user runs "ceph fs set <fsname> inline_data
> true".  The actual feature bits in the bit vector should all be
> distinct -- no overlaps.
> 

Yes, sorry, that was what we discussed, and that sounds fine to me.

The only reason those tags exist at all is that userland ceph has a
"min_compat_client" setting that disallows clients that don't support
release "X" from mounting.

I don't recall whether we had a plan to remove that setting from the
userland ceph code. Did we? It seems like if we're moving away from
declaring a particular release feature bit, then we should deprecate
min_compat_client.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-03-05 13:44 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-02 14:14 [PATCH v6 00/13] ceph: async directory operations support Jeff Layton
2020-03-02 14:14 ` [PATCH v6 01/13] ceph: make kick_flushing_inode_caps non-static Jeff Layton
2020-03-02 14:14 ` [PATCH v6 02/13] ceph: add flag to designate that a request is asynchronous Jeff Layton
2020-03-02 14:14 ` [PATCH v6 03/13] ceph: track primary dentry link Jeff Layton
2020-03-02 14:14 ` [PATCH v6 04/13] ceph: add infrastructure for waiting for async create to complete Jeff Layton
2020-03-02 14:14 ` [PATCH v6 05/13] ceph: make __take_cap_refs non-static Jeff Layton
2020-03-02 14:14 ` [PATCH v6 06/13] ceph: cap tracking for async directory operations Jeff Layton
2020-03-02 14:14 ` [PATCH v6 07/13] ceph: don't take refs to want mask unless we have all bits Jeff Layton
2020-03-02 14:14 ` [PATCH v6 08/13] ceph: perform asynchronous unlink if we have sufficient caps Jeff Layton
2020-03-02 14:14 ` [PATCH v6 09/13] ceph: make ceph_fill_inode non-static Jeff Layton
2020-03-02 14:14 ` [PATCH v6 10/13] ceph: decode interval_sets for delegated inos Jeff Layton
2020-03-05 11:45   ` Luis Henriques
2020-03-05 12:02     ` Jeff Layton
2020-03-05 12:20       ` Luis Henriques
2020-03-05 13:36       ` Ilya Dryomov
2020-03-05 13:44         ` Jeff Layton
2020-03-02 14:14 ` [PATCH v6 11/13] ceph: add new MDS req field to hold delegated inode number Jeff Layton
2020-03-02 14:14 ` [PATCH v6 12/13] ceph: cache layout in parent dir on first sync create Jeff Layton
2020-03-02 14:14 ` [PATCH v6 13/13] ceph: attempt to do async create when possible Jeff Layton
2020-03-02 16:22 ` [PATCH v6 00/13] ceph: async directory operations support Yan, Zheng
2020-03-02 21:07   ` Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.