[RFC PATCH 00/10] ceph: fix long stalls during fsync and write

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 00/10] ceph: fix long stalls during fsync and write_inode
@ 2016-11-04 11:34 Jeff Layton
  2016-11-04 11:34 ` [RFC PATCH 01/10] ceph: fix minor typo in unsafe_request_wait Jeff Layton
                   ` (9 more replies)
  0 siblings, 10 replies; 24+ messages in thread
From: Jeff Layton @ 2016-11-04 11:34 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, zyan, sage

This is the companion kernel patchset to this ceph userland pull req:

    https://github.com/ceph/ceph/pull/11710

The problem is that fsync can be very slow on ceph, as it waits for a
cap flush ack. Cap flushes however are generally considered by the MDS
to be background activity, so they don't entail a journal flush on their
own.

The idea here is to add a new flag to cap requests to inform the MDS
that the client is waiting on the reply and that it shouldn't delay it.

In addition, this adds support for the birthtime and change attribute in
cephfs. This is necessary since the new sync flag comes after those
fields.

In current mainline ceph, the btime and change_attribute share a feature
flag with addr2 support. In order to test this, I had to move addr2 to a
new feature flag since the kernel doesn't have that support yet.xi

For now, this is just an RFC set until the userland parts are in place.

Jeff Layton (10):
  ceph: fix minor typo in unsafe_request_wait
  ceph: move xattr initialzation before the encoding past the
    ceph_mds_caps
  ceph: initialize i_version to 0 in new ceph inodes
  ceph: save off btime and change_attr when we get an InodeStat
  ceph: handle btime and change_attr updates in cap messages
  ceph: define new argument structure for send_cap_msg
  ceph: update cap message struct version to 9
  ceph: add sync parameter to send_cap_msg
  ceph: plumb "sync" parameter into __send_cap
  ceph: turn on btime and change_attr support

 fs/ceph/caps.c                     | 307 ++++++++++++++++++++++---------------
 fs/ceph/inode.c                    |  11 +-
 fs/ceph/mds_client.c               |  10 ++
 fs/ceph/mds_client.h               |   2 +
 fs/ceph/snap.c                     |   3 +
 fs/ceph/super.c                    |   3 +-
 fs/ceph/super.h                    |   5 +
 include/linux/ceph/ceph_features.h |   2 +
 8 files changed, 219 insertions(+), 124 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH 01/10] ceph: fix minor typo in unsafe_request_wait
  2016-11-04 11:34 [RFC PATCH 00/10] ceph: fix long stalls during fsync and write_inode Jeff Layton
@ 2016-11-04 11:34 ` Jeff Layton
  2016-11-04 11:34 ` [RFC PATCH 02/10] ceph: move xattr initialzation before the encoding past the ceph_mds_caps Jeff Layton
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Jeff Layton @ 2016-11-04 11:34 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, zyan, sage

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/caps.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 16e6ded0b7f2..d37a04b87d2c 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1996,7 +1996,7 @@ static int unsafe_request_wait(struct inode *inode)
 	}
 	spin_unlock(&ci->i_unsafe_lock);
 
-	dout("unsafe_requeset_wait %p wait on tid %llu %llu\n",
+	dout("unsafe_request_wait %p wait on tid %llu %llu\n",
 	     inode, req1 ? req1->r_tid : 0ULL, req2 ? req2->r_tid : 0ULL);
 	if (req1) {
 		ret = !wait_for_completion_timeout(&req1->r_safe_completion,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 02/10] ceph: move xattr initialzation before the encoding past the ceph_mds_caps
  2016-11-04 11:34 [RFC PATCH 00/10] ceph: fix long stalls during fsync and write_inode Jeff Layton
  2016-11-04 11:34 ` [RFC PATCH 01/10] ceph: fix minor typo in unsafe_request_wait Jeff Layton
@ 2016-11-04 11:34 ` Jeff Layton
  2016-11-04 11:34 ` [RFC PATCH 03/10] ceph: initialize i_version to 0 in new ceph inodes Jeff Layton
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Jeff Layton @ 2016-11-04 11:34 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, zyan, sage

Just for clarity. This part is inside the header, so it makes sense to
group it with the rest of the stuff in the header.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/caps.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index d37a04b87d2c..2a5caee2fb17 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1057,6 +1057,13 @@ static int send_cap_msg(struct ceph_mds_session *session,
 	fc->gid = cpu_to_le32(from_kgid(&init_user_ns, gid));
 	fc->mode = cpu_to_le32(mode);
 
+	fc->xattr_version = cpu_to_le64(xattr_version);
+	if (xattrs_buf) {
+		msg->middle = ceph_buffer_get(xattrs_buf);
+		fc->xattr_len = cpu_to_le32(xattrs_buf->vec.iov_len);
+		msg->hdr.middle_len = cpu_to_le32(xattrs_buf->vec.iov_len);
+	}
+
 	p = fc + 1;
 	/* flock buffer size */
 	ceph_encode_32(&p, 0);
@@ -1069,13 +1076,6 @@ static int send_cap_msg(struct ceph_mds_session *session,
 	/* oldest_flush_tid */
 	ceph_encode_64(&p, oldest_flush_tid);
 
-	fc->xattr_version = cpu_to_le64(xattr_version);
-	if (xattrs_buf) {
-		msg->middle = ceph_buffer_get(xattrs_buf);
-		fc->xattr_len = cpu_to_le32(xattrs_buf->vec.iov_len);
-		msg->hdr.middle_len = cpu_to_le32(xattrs_buf->vec.iov_len);
-	}
-
 	ceph_con_send(&session->s_con, msg);
 	return 0;
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 03/10] ceph: initialize i_version to 0 in new ceph inodes
  2016-11-04 11:34 [RFC PATCH 00/10] ceph: fix long stalls during fsync and write_inode Jeff Layton
  2016-11-04 11:34 ` [RFC PATCH 01/10] ceph: fix minor typo in unsafe_request_wait Jeff Layton
  2016-11-04 11:34 ` [RFC PATCH 02/10] ceph: move xattr initialzation before the encoding past the ceph_mds_caps Jeff Layton
@ 2016-11-04 11:34 ` Jeff Layton
  2016-11-04 11:34 ` [RFC PATCH 04/10] ceph: save off btime and change_attr when we get an InodeStat Jeff Layton
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Jeff Layton @ 2016-11-04 11:34 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, zyan, sage

Currently, i_version is not always initialized when inodes are
allocated, so it's possible to end up "inheriting" an i_version from a
completely different inode.

In most cases, this is not an issue, as the initialized value will end
up being clobbered when the inode metadata is set up.

Ceph is a little different here though as we always want the max value
for i_version.

For now, just initialize this to 0 in ceph_alloc_inode, but we may
eventually want to move this into inode_init_always and do it for all
filesystems.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/inode.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index ef4d04647325..f03dc579e0ec 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -514,6 +514,8 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
 
 	ceph_fscache_inode_init(ci);
 
+	// FIXME: merge initialization into inode_init_always ?
+	ci->vfs_inode.i_version = 0;
 	return &ci->vfs_inode;
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 04/10] ceph: save off btime and change_attr when we get an InodeStat
  2016-11-04 11:34 [RFC PATCH 00/10] ceph: fix long stalls during fsync and write_inode Jeff Layton
                   ` (2 preceding siblings ...)
  2016-11-04 11:34 ` [RFC PATCH 03/10] ceph: initialize i_version to 0 in new ceph inodes Jeff Layton
@ 2016-11-04 11:34 ` Jeff Layton
  2016-11-04 11:34 ` [RFC PATCH 05/10] ceph: handle btime and change_attr updates in cap messages Jeff Layton
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Jeff Layton @ 2016-11-04 11:34 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, zyan, sage

Needed so we can send the proper values in cap flushes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/inode.c                    |  9 ++++++++-
 fs/ceph/mds_client.c               | 10 ++++++++++
 fs/ceph/mds_client.h               |  2 ++
 fs/ceph/super.h                    |  2 ++
 include/linux/ceph/ceph_features.h |  2 ++
 5 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index f03dc579e0ec..f7a3ec6d7152 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -436,6 +436,9 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
 	ci->i_inline_version = 0;
 	ci->i_time_warp_seq = 0;
 	ci->i_ceph_flags = 0;
+
+	memset(&ci->i_btime, 0, sizeof(ci->i_btime));
+
 	atomic64_set(&ci->i_ordered_count, 1);
 	atomic64_set(&ci->i_release_count, 1);
 	atomic64_set(&ci->i_complete_seq[0], 0);
@@ -798,7 +801,10 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 
 	/* update inode */
 	ci->i_version = le64_to_cpu(info->version);
-	inode->i_version++;
+
+	/* Always take a larger change attr */
+	inode->i_version = max(inode->i_version, iinfo->change_attr);
+
 	inode->i_rdev = le32_to_cpu(info->rdev);
 	inode->i_blkbits = fls(le32_to_cpu(info->layout.fl_stripe_unit)) - 1;
 
@@ -807,6 +813,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page,
 		inode->i_mode = le32_to_cpu(info->mode);
 		inode->i_uid = make_kuid(&init_user_ns, le32_to_cpu(info->uid));
 		inode->i_gid = make_kgid(&init_user_ns, le32_to_cpu(info->gid));
+		ci->i_btime = iinfo->btime;
 		dout("%p mode 0%o uid.gid %d.%d\n", inode, inode->i_mode,
 		     from_kuid(&init_user_ns, inode->i_uid),
 		     from_kgid(&init_user_ns, inode->i_gid));
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 815acd1a56d4..7217404f0f7c 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -111,6 +111,16 @@ static int parse_reply_info_in(void **p, void *end,
 		}
 	}
 
+	if (features & CEPH_FEATURE_FS_BTIME) {
+		ceph_decode_need(p, end, sizeof(struct ceph_timespec), bad);
+		ceph_decode_timespec(&info->btime, *p);
+		*p += sizeof(struct ceph_timespec);
+		ceph_decode_64_safe(p, end, info->change_attr, bad);
+	} else {
+		memset(&info->btime, 0, sizeof(info->btime));
+		info->change_attr = 0;
+	}
+
 	return 0;
 bad:
 	return err;
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 3c6f77b7bb02..e217a3dd3f19 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -46,6 +46,8 @@ struct ceph_mds_reply_info_in {
 	char *inline_data;
 	u32 pool_ns_len;
 	char *pool_ns_data;
+	struct timespec btime;
+	u64 change_attr;
 };
 
 struct ceph_mds_reply_dir_entry {
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 3e3fa9163059..244fd8dbff31 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -288,6 +288,8 @@ struct ceph_inode_info {
 	struct ceph_file_layout i_layout;
 	char *i_symlink;
 
+	struct timespec i_btime;
+
 	/* for dirs */
 	struct timespec i_rctime;
 	u64 i_rbytes, i_rfiles, i_rsubdirs;
diff --git a/include/linux/ceph/ceph_features.h b/include/linux/ceph/ceph_features.h
index ae2f66833762..95b174a676d5 100644
--- a/include/linux/ceph/ceph_features.h
+++ b/include/linux/ceph/ceph_features.h
@@ -76,6 +76,8 @@
 // duplicated since it was introduced at the same time as CEPH_FEATURE_CRUSH_TUNABLES5
 #define CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING   (1ULL<<58) /* New, v7 encoding */
 #define CEPH_FEATURE_FS_FILE_LAYOUT_V2       (1ULL<<58) /* file_layout_t */
+#define CEPH_FEATURE_FS_BTIME			(1ULL<<59) /* btime */
+#define CEPH_FEATURE_FS_CHANGE_ATTR		(1ULL<<59) /* change_attr */
 
 /*
  * The introduction of CEPH_FEATURE_OSD_SNAPMAPPER caused the feature
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 05/10] ceph: handle btime and change_attr updates in cap messages
  2016-11-04 11:34 [RFC PATCH 00/10] ceph: fix long stalls during fsync and write_inode Jeff Layton
                   ` (3 preceding siblings ...)
  2016-11-04 11:34 ` [RFC PATCH 04/10] ceph: save off btime and change_attr when we get an InodeStat Jeff Layton
@ 2016-11-04 11:34 ` Jeff Layton
  2016-11-04 11:34 ` [RFC PATCH 06/10] ceph: define new argument structure for send_cap_msg Jeff Layton
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Jeff Layton @ 2016-11-04 11:34 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, zyan, sage

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/caps.c  | 31 ++++++++++++++++++++++++-------
 fs/ceph/snap.c  |  3 +++
 fs/ceph/super.h |  3 +++
 3 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 2a5caee2fb17..dcf1eff06f85 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2798,7 +2798,8 @@ static void handle_cap_grant(struct ceph_mds_client *mdsc,
 			     void *inline_data, u32 inline_len,
 			     struct ceph_buffer *xattr_buf,
 			     struct ceph_mds_session *session,
-			     struct ceph_cap *cap, int issued)
+			     struct ceph_cap *cap, int issued,
+			     struct timespec *btime, u64 change_attr)
 	__releases(ci->i_ceph_lock)
 	__releases(mdsc->snap_rwsem)
 {
@@ -2864,11 +2865,14 @@ static void handle_cap_grant(struct ceph_mds_client *mdsc,
 
 	__check_cap_issue(ci, cap, newcaps);
 
+	inode->i_version = max(change_attr, inode->i_version);
+
 	if ((newcaps & CEPH_CAP_AUTH_SHARED) &&
 	    (issued & CEPH_CAP_AUTH_EXCL) == 0) {
 		inode->i_mode = le32_to_cpu(grant->mode);
 		inode->i_uid = make_kuid(&init_user_ns, le32_to_cpu(grant->uid));
 		inode->i_gid = make_kgid(&init_user_ns, le32_to_cpu(grant->gid));
+		ci->i_btime = *btime;
 		dout("%p mode 0%o uid.gid %d.%d\n", inode, inode->i_mode,
 		     from_kuid(&init_user_ns, inode->i_uid),
 		     from_kgid(&init_user_ns, inode->i_gid));
@@ -3480,6 +3484,9 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 	void *snaptrace;
 	size_t snaptrace_len;
 	void *p, *end;
+	u16 hdr_version;
+	struct timespec btime = {};
+	u64 change_attr = 0;
 
 	dout("handle_caps from mds%d\n", mds);
 
@@ -3499,7 +3506,8 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 	snaptrace_len = le32_to_cpu(h->snap_trace_len);
 	p = snaptrace + snaptrace_len;
 
-	if (le16_to_cpu(msg->hdr.version) >= 2) {
+	hdr_version = le16_to_cpu(msg->hdr.version);
+	if (hdr_version >= 2) {
 		u32 flock_len;
 		ceph_decode_32_safe(&p, end, flock_len, bad);
 		if (p + flock_len > end)
@@ -3507,7 +3515,7 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 		p += flock_len;
 	}
 
-	if (le16_to_cpu(msg->hdr.version) >= 3) {
+	if (hdr_version >= 3) {
 		if (op == CEPH_CAP_OP_IMPORT) {
 			if (p + sizeof(*peer) > end)
 				goto bad;
@@ -3519,7 +3527,7 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 		}
 	}
 
-	if (le16_to_cpu(msg->hdr.version) >= 4) {
+	if (hdr_version >= 4) {
 		ceph_decode_64_safe(&p, end, inline_version, bad);
 		ceph_decode_32_safe(&p, end, inline_len, bad);
 		if (p + inline_len > end)
@@ -3528,7 +3536,7 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 		p += inline_len;
 	}
 
-	if (le16_to_cpu(msg->hdr.version) >= 8) {
+	if (hdr_version >= 8) {
 		u64 flush_tid;
 		u32 caller_uid, caller_gid;
 		u32 osd_epoch_barrier;
@@ -3549,6 +3557,13 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 		}
 	}
 
+	if (hdr_version >= 9) {
+		ceph_decode_need(&p, end, sizeof(struct ceph_timespec), bad);
+		ceph_decode_timespec(&btime, p);
+		p += sizeof(struct ceph_timespec);
+		ceph_decode_64_safe(&p, end, change_attr, bad);
+	}
+
 	/* lookup ino */
 	inode = ceph_find_inode(sb, vino);
 	ci = ceph_inode(inode);
@@ -3604,7 +3619,8 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 				  &cap, &issued);
 		handle_cap_grant(mdsc, inode, h, &pool_ns,
 				 inline_version, inline_data, inline_len,
-				 msg->middle, session, cap, issued);
+				 msg->middle, session, cap, issued, &btime,
+				 change_attr);
 		if (realm)
 			ceph_put_snap_realm(mdsc, realm);
 		goto done_unlocked;
@@ -3628,7 +3644,8 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 		issued |= __ceph_caps_dirty(ci);
 		handle_cap_grant(mdsc, inode, h, &pool_ns,
 				 inline_version, inline_data, inline_len,
-				 msg->middle, session, cap, issued);
+				 msg->middle, session, cap, issued, &btime,
+				 change_attr);
 		goto done_unlocked;
 
 	case CEPH_CAP_OP_FLUSH_ACK:
diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index 9ff5219d849e..5cfdab5d366b 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -529,6 +529,9 @@ void ceph_queue_cap_snap(struct ceph_inode_info *ci)
 	capsnap->mode = inode->i_mode;
 	capsnap->uid = inode->i_uid;
 	capsnap->gid = inode->i_gid;
+	capsnap->btime = ci->i_btime;
+
+	capsnap->change_attr = inode->i_version;
 
 	if (dirty & CEPH_CAP_XATTR_EXCL) {
 		__ceph_build_xattrs_blob(ci);
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 244fd8dbff31..c264d823bc58 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -173,10 +173,13 @@ struct ceph_cap_snap {
 	umode_t mode;
 	kuid_t uid;
 	kgid_t gid;
+	struct timespec btime;
 
 	struct ceph_buffer *xattr_blob;
 	u64 xattr_version;
 
+	u64 change_attr;
+
 	u64 size;
 	struct timespec mtime, atime, ctime;
 	u64 time_warp_seq;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 06/10] ceph: define new argument structure for send_cap_msg
  2016-11-04 11:34 [RFC PATCH 00/10] ceph: fix long stalls during fsync and write_inode Jeff Layton
                   ` (4 preceding siblings ...)
  2016-11-04 11:34 ` [RFC PATCH 05/10] ceph: handle btime and change_attr updates in cap messages Jeff Layton
@ 2016-11-04 11:34 ` Jeff Layton
  2016-11-04 11:34 ` [RFC PATCH 07/10] ceph: update cap message struct version to 9 Jeff Layton
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Jeff Layton @ 2016-11-04 11:34 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, zyan, sage

When we get to this many arguments, it's hard to work with positional
parameters. send_cap_msg is already at 25 arguments, with more needed.

Define a new args structure and pass a pointer to it to send_cap_msg.
Eventually it might make sense to embed one of these inside
ceph_cap_snap instead of tracking individual fields.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/caps.c | 225 ++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 126 insertions(+), 99 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index dcf1eff06f85..6e99866b1946 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -987,22 +987,27 @@ void __ceph_remove_cap(struct ceph_cap *cap, bool queue_release)
 		__cap_delay_cancel(mdsc, ci);
 }
 
+struct cap_msg_args {
+	struct ceph_mds_session	*session;
+	u64			ino, cid, follows;
+	u64			flush_tid, oldest_flush_tid, size, max_size;
+	u64			xattr_version;
+	struct ceph_buffer	*xattr_buf;
+	struct timespec		atime, mtime, ctime;
+	int			op, caps, wanted, dirty;
+	u32			seq, issue_seq, mseq, time_warp_seq;
+	kuid_t			uid;
+	kgid_t			gid;
+	umode_t			mode;
+	bool			inline_data;
+};
+
 /*
  * Build and send a cap message to the given MDS.
  *
  * Caller should be holding s_mutex.
  */
-static int send_cap_msg(struct ceph_mds_session *session,
-			u64 ino, u64 cid, int op,
-			int caps, int wanted, int dirty,
-			u32 seq, u64 flush_tid, u64 oldest_flush_tid,
-			u32 issue_seq, u32 mseq, u64 size, u64 max_size,
-			struct timespec *mtime, struct timespec *atime,
-			struct timespec *ctime, u32 time_warp_seq,
-			kuid_t uid, kgid_t gid, umode_t mode,
-			u64 xattr_version,
-			struct ceph_buffer *xattrs_buf,
-			u64 follows, bool inline_data)
+static int send_cap_msg(struct cap_msg_args *arg)
 {
 	struct ceph_mds_caps *fc;
 	struct ceph_msg *msg;
@@ -1011,12 +1016,13 @@ static int send_cap_msg(struct ceph_mds_session *session,
 
 	dout("send_cap_msg %s %llx %llx caps %s wanted %s dirty %s"
 	     " seq %u/%u tid %llu/%llu mseq %u follows %lld size %llu/%llu"
-	     " xattr_ver %llu xattr_len %d\n", ceph_cap_op_name(op),
-	     cid, ino, ceph_cap_string(caps), ceph_cap_string(wanted),
-	     ceph_cap_string(dirty),
-	     seq, issue_seq, flush_tid, oldest_flush_tid,
-	     mseq, follows, size, max_size,
-	     xattr_version, xattrs_buf ? (int)xattrs_buf->vec.iov_len : 0);
+	     " xattr_ver %llu xattr_len %d\n", ceph_cap_op_name(arg->op),
+	     arg->cid, arg->ino, ceph_cap_string(arg->caps),
+	     ceph_cap_string(arg->wanted), ceph_cap_string(arg->dirty),
+	     arg->seq, arg->issue_seq, arg->flush_tid, arg->oldest_flush_tid,
+	     arg->mseq, arg->follows, arg->size, arg->max_size,
+	     arg->xattr_version,
+	     arg->xattr_buf ? (int)arg->xattr_buf->vec.iov_len : 0);
 
 	/* flock buffer size + inline version + inline data size +
 	 * osd_epoch_barrier + oldest_flush_tid */
@@ -1027,56 +1033,53 @@ static int send_cap_msg(struct ceph_mds_session *session,
 		return -ENOMEM;
 
 	msg->hdr.version = cpu_to_le16(6);
-	msg->hdr.tid = cpu_to_le64(flush_tid);
+	msg->hdr.tid = cpu_to_le64(arg->flush_tid);
 
 	fc = msg->front.iov_base;
 	memset(fc, 0, sizeof(*fc));
 
-	fc->cap_id = cpu_to_le64(cid);
-	fc->op = cpu_to_le32(op);
-	fc->seq = cpu_to_le32(seq);
-	fc->issue_seq = cpu_to_le32(issue_seq);
-	fc->migrate_seq = cpu_to_le32(mseq);
-	fc->caps = cpu_to_le32(caps);
-	fc->wanted = cpu_to_le32(wanted);
-	fc->dirty = cpu_to_le32(dirty);
-	fc->ino = cpu_to_le64(ino);
-	fc->snap_follows = cpu_to_le64(follows);
-
-	fc->size = cpu_to_le64(size);
-	fc->max_size = cpu_to_le64(max_size);
-	if (mtime)
-		ceph_encode_timespec(&fc->mtime, mtime);
-	if (atime)
-		ceph_encode_timespec(&fc->atime, atime);
-	if (ctime)
-		ceph_encode_timespec(&fc->ctime, ctime);
-	fc->time_warp_seq = cpu_to_le32(time_warp_seq);
-
-	fc->uid = cpu_to_le32(from_kuid(&init_user_ns, uid));
-	fc->gid = cpu_to_le32(from_kgid(&init_user_ns, gid));
-	fc->mode = cpu_to_le32(mode);
-
-	fc->xattr_version = cpu_to_le64(xattr_version);
-	if (xattrs_buf) {
-		msg->middle = ceph_buffer_get(xattrs_buf);
-		fc->xattr_len = cpu_to_le32(xattrs_buf->vec.iov_len);
-		msg->hdr.middle_len = cpu_to_le32(xattrs_buf->vec.iov_len);
+	fc->cap_id = cpu_to_le64(arg->cid);
+	fc->op = cpu_to_le32(arg->op);
+	fc->seq = cpu_to_le32(arg->seq);
+	fc->issue_seq = cpu_to_le32(arg->issue_seq);
+	fc->migrate_seq = cpu_to_le32(arg->mseq);
+	fc->caps = cpu_to_le32(arg->caps);
+	fc->wanted = cpu_to_le32(arg->wanted);
+	fc->dirty = cpu_to_le32(arg->dirty);
+	fc->ino = cpu_to_le64(arg->ino);
+	fc->snap_follows = cpu_to_le64(arg->follows);
+
+	fc->size = cpu_to_le64(arg->size);
+	fc->max_size = cpu_to_le64(arg->max_size);
+	ceph_encode_timespec(&fc->mtime, &arg->mtime);
+	ceph_encode_timespec(&fc->atime, &arg->atime);
+	ceph_encode_timespec(&fc->ctime, &arg->ctime);
+	fc->time_warp_seq = cpu_to_le32(arg->time_warp_seq);
+
+	fc->uid = cpu_to_le32(from_kuid(&init_user_ns, arg->uid));
+	fc->gid = cpu_to_le32(from_kgid(&init_user_ns, arg->gid));
+	fc->mode = cpu_to_le32(arg->mode);
+
+	fc->xattr_version = cpu_to_le64(arg->xattr_version);
+	if (arg->xattr_buf) {
+		msg->middle = ceph_buffer_get(arg->xattr_buf);
+		fc->xattr_len = cpu_to_le32(arg->xattr_buf->vec.iov_len);
+		msg->hdr.middle_len = cpu_to_le32(arg->xattr_buf->vec.iov_len);
 	}
 
 	p = fc + 1;
 	/* flock buffer size */
 	ceph_encode_32(&p, 0);
 	/* inline version */
-	ceph_encode_64(&p, inline_data ? 0 : CEPH_INLINE_NONE);
+	ceph_encode_64(&p, arg->inline_data ? 0 : CEPH_INLINE_NONE);
 	/* inline data size */
 	ceph_encode_32(&p, 0);
 	/* osd_epoch_barrier */
 	ceph_encode_32(&p, 0);
 	/* oldest_flush_tid */
-	ceph_encode_64(&p, oldest_flush_tid);
+	ceph_encode_64(&p, arg->oldest_flush_tid);
 
-	ceph_con_send(&session->s_con, msg);
+	ceph_con_send(&arg->session->s_con, msg);
 	return 0;
 }
 
@@ -1121,21 +1124,11 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
 {
 	struct ceph_inode_info *ci = cap->ci;
 	struct inode *inode = &ci->vfs_inode;
-	u64 cap_id = cap->cap_id;
-	int held, revoking, dropping, keep;
-	u64 follows, size, max_size;
-	u32 seq, issue_seq, mseq, time_warp_seq;
-	struct timespec mtime, atime, ctime;
+	struct cap_msg_args arg;
+	int held, revoking, dropping;
 	int wake = 0;
-	umode_t mode;
-	kuid_t uid;
-	kgid_t gid;
-	struct ceph_mds_session *session;
-	u64 xattr_version = 0;
-	struct ceph_buffer *xattr_blob = NULL;
 	int delayed = 0;
 	int ret;
-	bool inline_data;
 
 	held = cap->issued | cap->implemented;
 	revoking = cap->implemented & ~cap->issued;
@@ -1148,7 +1141,7 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
 	     ceph_cap_string(revoking));
 	BUG_ON((retain & CEPH_CAP_PIN) == 0);
 
-	session = cap->session;
+	arg.session = cap->session;
 
 	/* don't release wanted unless we've waited a bit. */
 	if ((ci->i_ceph_flags & CEPH_I_NODELAY) == 0 &&
@@ -1177,40 +1170,48 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
 	cap->implemented &= cap->issued | used;
 	cap->mds_wanted = want;
 
-	follows = flushing ? ci->i_head_snapc->seq : 0;
-
-	keep = cap->implemented;
-	seq = cap->seq;
-	issue_seq = cap->issue_seq;
-	mseq = cap->mseq;
-	size = inode->i_size;
-	ci->i_reported_size = size;
-	max_size = ci->i_wanted_max_size;
-	ci->i_requested_max_size = max_size;
-	mtime = inode->i_mtime;
-	atime = inode->i_atime;
-	ctime = inode->i_ctime;
-	time_warp_seq = ci->i_time_warp_seq;
-	uid = inode->i_uid;
-	gid = inode->i_gid;
-	mode = inode->i_mode;
+	arg.ino = ceph_vino(inode).ino;
+	arg.cid = cap->cap_id;
+	arg.follows = flushing ? ci->i_head_snapc->seq : 0;
+	arg.flush_tid = flush_tid;
+	arg.oldest_flush_tid = oldest_flush_tid;
+
+	arg.size = inode->i_size;
+	ci->i_reported_size = arg.size;
+	arg.max_size = ci->i_wanted_max_size;
+	ci->i_requested_max_size = arg.max_size;
 
 	if (flushing & CEPH_CAP_XATTR_EXCL) {
 		__ceph_build_xattrs_blob(ci);
-		xattr_blob = ci->i_xattrs.blob;
-		xattr_version = ci->i_xattrs.version;
+		arg.xattr_version = ci->i_xattrs.version;
+		arg.xattr_buf = ci->i_xattrs.blob;
+	} else {
+		arg.xattr_buf = NULL;
 	}
 
-	inline_data = ci->i_inline_version != CEPH_INLINE_NONE;
+	arg.mtime = inode->i_mtime;
+	arg.atime = inode->i_atime;
+	arg.ctime = inode->i_ctime;
+
+	arg.op = op;
+	arg.caps = cap->implemented;
+	arg.wanted = want;
+	arg.dirty = flushing;
+
+	arg.seq = cap->seq;
+	arg.issue_seq = cap->issue_seq;
+	arg.mseq = cap->mseq;
+	arg.time_warp_seq = ci->i_time_warp_seq;
+
+	arg.uid = inode->i_uid;
+	arg.gid = inode->i_gid;
+	arg.mode = inode->i_mode;
+
+	arg.inline_data = ci->i_inline_version != CEPH_INLINE_NONE;
 
 	spin_unlock(&ci->i_ceph_lock);
 
-	ret = send_cap_msg(session, ceph_vino(inode).ino, cap_id,
-		op, keep, want, flushing, seq,
-		flush_tid, oldest_flush_tid, issue_seq, mseq,
-		size, max_size, &mtime, &atime, &ctime, time_warp_seq,
-		uid, gid, mode, xattr_version, xattr_blob,
-		follows, inline_data);
+	ret = send_cap_msg(&arg);
 	if (ret < 0) {
 		dout("error sending cap msg, must requeue %p\n", inode);
 		delayed = 1;
@@ -1227,15 +1228,41 @@ static inline int __send_flush_snap(struct inode *inode,
 				    struct ceph_cap_snap *capsnap,
 				    u32 mseq, u64 oldest_flush_tid)
 {
-	return send_cap_msg(session, ceph_vino(inode).ino, 0,
-			CEPH_CAP_OP_FLUSHSNAP, capsnap->issued, 0,
-			capsnap->dirty, 0, capsnap->cap_flush.tid,
-			oldest_flush_tid, 0, mseq, capsnap->size, 0,
-			&capsnap->mtime, &capsnap->atime,
-			&capsnap->ctime, capsnap->time_warp_seq,
-			capsnap->uid, capsnap->gid, capsnap->mode,
-			capsnap->xattr_version, capsnap->xattr_blob,
-			capsnap->follows, capsnap->inline_data);
+	struct cap_msg_args	arg;
+
+	arg.session = session;
+	arg.ino = ceph_vino(inode).ino;
+	arg.cid = 0;
+	arg.follows = capsnap->follows;
+	arg.flush_tid = capsnap->cap_flush.tid;
+	arg.oldest_flush_tid = oldest_flush_tid;
+
+	arg.size = capsnap->size;
+	arg.max_size = 0;
+	arg.xattr_version = capsnap->xattr_version;
+	arg.xattr_buf = capsnap->xattr_blob;
+
+	arg.atime = capsnap->atime;
+	arg.mtime = capsnap->mtime;
+	arg.ctime = capsnap->ctime;
+
+	arg.op = CEPH_CAP_OP_FLUSHSNAP;
+	arg.caps = capsnap->issued;
+	arg.wanted = 0;
+	arg.dirty = capsnap->dirty;
+
+	arg.seq = 0;
+	arg.issue_seq = 0;
+	arg.mseq = mseq;
+	arg.time_warp_seq = capsnap->time_warp_seq;
+
+	arg.uid = capsnap->uid;
+	arg.gid = capsnap->gid;
+	arg.mode = capsnap->mode;
+
+	arg.inline_data = capsnap->inline_data;
+
+	return send_cap_msg(&arg);
 }
 
 /*
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 07/10] ceph: update cap message struct version to 9
  2016-11-04 11:34 [RFC PATCH 00/10] ceph: fix long stalls during fsync and write_inode Jeff Layton
                   ` (5 preceding siblings ...)
  2016-11-04 11:34 ` [RFC PATCH 06/10] ceph: define new argument structure for send_cap_msg Jeff Layton
@ 2016-11-04 11:34 ` Jeff Layton
  2016-11-04 12:57   ` Jeff Layton
  2016-11-04 11:34 ` [RFC PATCH 08/10] ceph: add sync parameter to send_cap_msg Jeff Layton
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2016-11-04 11:34 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, zyan, sage

The userland ceph has MClientCaps at struct version 9. This brings the
kernel up the same version.

With this change, we have to start tracking the btime and change_attr,
so that the client can pass back sane values in cap messages. The
client doesn't care about the btime at all, so this is just passed
around, but the change_attr is used when ceph is exported via NFS.

For now, the new "sync" parm is left at 0, to preserve the existing
behavior of the client.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/caps.c | 33 +++++++++++++++++++++++++--------
 1 file changed, 25 insertions(+), 8 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 6e99866b1946..452f5024589f 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -991,9 +991,9 @@ struct cap_msg_args {
 	struct ceph_mds_session	*session;
 	u64			ino, cid, follows;
 	u64			flush_tid, oldest_flush_tid, size, max_size;
-	u64			xattr_version;
+	u64			xattr_version, change_attr;
 	struct ceph_buffer	*xattr_buf;
-	struct timespec		atime, mtime, ctime;
+	struct timespec		atime, mtime, ctime, btime;
 	int			op, caps, wanted, dirty;
 	u32			seq, issue_seq, mseq, time_warp_seq;
 	kuid_t			uid;
@@ -1026,13 +1026,13 @@ static int send_cap_msg(struct cap_msg_args *arg)
 
 	/* flock buffer size + inline version + inline data size +
 	 * osd_epoch_barrier + oldest_flush_tid */
-	extra_len = 4 + 8 + 4 + 4 + 8;
+	extra_len = 4 + 8 + 4 + 4 + 8 + 4 + 4 + 4 + 8 + 8 + 1;
 	msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, sizeof(*fc) + extra_len,
 			   GFP_NOFS, false);
 	if (!msg)
 		return -ENOMEM;
 
-	msg->hdr.version = cpu_to_le16(6);
+	msg->hdr.version = cpu_to_le16(9);
 	msg->hdr.tid = cpu_to_le64(arg->flush_tid);
 
 	fc = msg->front.iov_base;
@@ -1068,17 +1068,30 @@ static int send_cap_msg(struct cap_msg_args *arg)
 	}
 
 	p = fc + 1;
-	/* flock buffer size */
+	/* flock buffer size (version 2) */
 	ceph_encode_32(&p, 0);
-	/* inline version */
+	/* inline version (version 4) */
 	ceph_encode_64(&p, arg->inline_data ? 0 : CEPH_INLINE_NONE);
 	/* inline data size */
 	ceph_encode_32(&p, 0);
-	/* osd_epoch_barrier */
+	/* osd_epoch_barrier (version 5) */
 	ceph_encode_32(&p, 0);
-	/* oldest_flush_tid */
+	/* oldest_flush_tid (version 6) */
 	ceph_encode_64(&p, arg->oldest_flush_tid);
 
+	/* caller_uid/caller_gid (version 7) */
+	ceph_encode_32(&p, (u32)-1);
+	ceph_encode_32(&p, (u32)-1);
+
+	/* pool namespace (version 8) */
+	ceph_encode_32(&p, 0);
+
+	/* btime, change_attr, sync (version 9) */
+	ceph_encode_timespec(p, &arg->btime);
+	p += sizeof(struct ceph_timespec);
+	ceph_encode_64(&p, arg->change_attr);
+	ceph_encode_8(&p, 0);
+
 	ceph_con_send(&arg->session->s_con, msg);
 	return 0;
 }
@@ -1189,9 +1202,11 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
 		arg.xattr_buf = NULL;
 	}
 
+	arg.change_attr = inode->i_version;
 	arg.mtime = inode->i_mtime;
 	arg.atime = inode->i_atime;
 	arg.ctime = inode->i_ctime;
+	arg.btime = ci->i_btime;
 
 	arg.op = op;
 	arg.caps = cap->implemented;
@@ -1241,10 +1256,12 @@ static inline int __send_flush_snap(struct inode *inode,
 	arg.max_size = 0;
 	arg.xattr_version = capsnap->xattr_version;
 	arg.xattr_buf = capsnap->xattr_blob;
+	arg.change_attr = capsnap->change_attr;
 
 	arg.atime = capsnap->atime;
 	arg.mtime = capsnap->mtime;
 	arg.ctime = capsnap->ctime;
+	arg.btime = capsnap->btime;
 
 	arg.op = CEPH_CAP_OP_FLUSHSNAP;
 	arg.caps = capsnap->issued;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 08/10] ceph: add sync parameter to send_cap_msg
  2016-11-04 11:34 [RFC PATCH 00/10] ceph: fix long stalls during fsync and write_inode Jeff Layton
                   ` (6 preceding siblings ...)
  2016-11-04 11:34 ` [RFC PATCH 07/10] ceph: update cap message struct version to 9 Jeff Layton
@ 2016-11-04 11:34 ` Jeff Layton
  2016-11-07  8:32   ` Yan, Zheng
  2016-11-04 11:34 ` [RFC PATCH 09/10] ceph: plumb "sync" parameter into __send_cap Jeff Layton
  2016-11-04 11:34 ` [RFC PATCH 10/10] ceph: turn on btime and change_attr support Jeff Layton
  9 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2016-11-04 11:34 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, zyan, sage

So we can request an MDS log flush on a cap message when we know that
we'll be waiting on the result.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/caps.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 452f5024589f..e92c6ce53af6 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -999,7 +999,7 @@ struct cap_msg_args {
 	kuid_t			uid;
 	kgid_t			gid;
 	umode_t			mode;
-	bool			inline_data;
+	bool			inline_data, sync;
 };
 
 /*
@@ -1090,7 +1090,7 @@ static int send_cap_msg(struct cap_msg_args *arg)
 	ceph_encode_timespec(p, &arg->btime);
 	p += sizeof(struct ceph_timespec);
 	ceph_encode_64(&p, arg->change_attr);
-	ceph_encode_8(&p, 0);
+	ceph_encode_8(&p, arg->sync);
 
 	ceph_con_send(&arg->session->s_con, msg);
 	return 0;
@@ -1223,6 +1223,7 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
 	arg.mode = inode->i_mode;
 
 	arg.inline_data = ci->i_inline_version != CEPH_INLINE_NONE;
+	arg.sync = false;
 
 	spin_unlock(&ci->i_ceph_lock);
 
@@ -1278,6 +1279,7 @@ static inline int __send_flush_snap(struct inode *inode,
 	arg.mode = capsnap->mode;
 
 	arg.inline_data = capsnap->inline_data;
+	arg.sync = false;
 
 	return send_cap_msg(&arg);
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 09/10] ceph: plumb "sync" parameter into __send_cap
  2016-11-04 11:34 [RFC PATCH 00/10] ceph: fix long stalls during fsync and write_inode Jeff Layton
                   ` (7 preceding siblings ...)
  2016-11-04 11:34 ` [RFC PATCH 08/10] ceph: add sync parameter to send_cap_msg Jeff Layton
@ 2016-11-04 11:34 ` Jeff Layton
  2016-11-04 11:34 ` [RFC PATCH 10/10] ceph: turn on btime and change_attr support Jeff Layton
  9 siblings, 0 replies; 24+ messages in thread
From: Jeff Layton @ 2016-11-04 11:34 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, zyan, sage

We set that to false, except in the case of try_flush_caps. The callers
of that function generally wait synchronously on the result, so it's
beneficial to ask the server to expedite it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/caps.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index e92c6ce53af6..7f6aa1e42f5a 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1131,8 +1131,8 @@ void ceph_queue_caps_release(struct inode *inode)
  * caller should hold snap_rwsem (read), s_mutex.
  */
 static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
-		      int op, int used, int want, int retain, int flushing,
-		      u64 flush_tid, u64 oldest_flush_tid)
+		      int op, bool sync, int used, int want, int retain,
+		      int flushing, u64 flush_tid, u64 oldest_flush_tid)
 	__releases(cap->ci->i_ceph_lock)
 {
 	struct ceph_inode_info *ci = cap->ci;
@@ -1223,7 +1223,7 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
 	arg.mode = inode->i_mode;
 
 	arg.inline_data = ci->i_inline_version != CEPH_INLINE_NONE;
-	arg.sync = false;
+	arg.sync = sync;
 
 	spin_unlock(&ci->i_ceph_lock);
 
@@ -1904,9 +1904,9 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags,
 		sent++;
 
 		/* __send_cap drops i_ceph_lock */
-		delayed += __send_cap(mdsc, cap, CEPH_CAP_OP_UPDATE, cap_used,
-				      want, retain, flushing,
-				      flush_tid, oldest_flush_tid);
+		delayed += __send_cap(mdsc, cap, CEPH_CAP_OP_UPDATE, false,
+				cap_used, want, retain, flushing,
+				flush_tid, oldest_flush_tid);
 		goto retry; /* retake i_ceph_lock and restart our cap scan. */
 	}
 
@@ -1970,9 +1970,9 @@ static int try_flush_caps(struct inode *inode, u64 *ptid)
 						&flush_tid, &oldest_flush_tid);
 
 		/* __send_cap drops i_ceph_lock */
-		delayed = __send_cap(mdsc, cap, CEPH_CAP_OP_FLUSH, used, want,
-				     (cap->issued | cap->implemented),
-				     flushing, flush_tid, oldest_flush_tid);
+		delayed = __send_cap(mdsc, cap, CEPH_CAP_OP_FLUSH, true,
+				used, want, (cap->issued | cap->implemented),
+				flushing, flush_tid, oldest_flush_tid);
 
 		if (delayed) {
 			spin_lock(&ci->i_ceph_lock);
@@ -2165,7 +2165,7 @@ static void __kick_flushing_caps(struct ceph_mds_client *mdsc,
 			     inode, cap, cf->tid, ceph_cap_string(cf->caps));
 			ci->i_ceph_flags |= CEPH_I_NODELAY;
 			ret = __send_cap(mdsc, cap, CEPH_CAP_OP_FLUSH,
-					  __ceph_caps_used(ci),
+					  false, __ceph_caps_used(ci),
 					  __ceph_caps_wanted(ci),
 					  cap->issued | cap->implemented,
 					  cf->caps, cf->tid, oldest_flush_tid);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 10/10] ceph: turn on btime and change_attr support
  2016-11-04 11:34 [RFC PATCH 00/10] ceph: fix long stalls during fsync and write_inode Jeff Layton
                   ` (8 preceding siblings ...)
  2016-11-04 11:34 ` [RFC PATCH 09/10] ceph: plumb "sync" parameter into __send_cap Jeff Layton
@ 2016-11-04 11:34 ` Jeff Layton
  9 siblings, 0 replies; 24+ messages in thread
From: Jeff Layton @ 2016-11-04 11:34 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, zyan, sage

All of the necessary plumbing is in place.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ceph/super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index b382e5910eea..35355ee88744 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -536,7 +536,8 @@ static struct ceph_fs_client *create_fs_client(struct ceph_mount_options *fsopt,
 	struct ceph_fs_client *fsc;
 	const u64 supported_features =
 		CEPH_FEATURE_FLOCK | CEPH_FEATURE_DIRLAYOUTHASH |
-		CEPH_FEATURE_MDSENC | CEPH_FEATURE_MDS_INLINE_DATA;
+		CEPH_FEATURE_MDSENC | CEPH_FEATURE_MDS_INLINE_DATA |
+		CEPH_FEATURE_FS_BTIME;
 	const u64 required_features = 0;
 	int page_count;
 	size_t size;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 07/10] ceph: update cap message struct version to 9
  2016-11-04 11:34 ` [RFC PATCH 07/10] ceph: update cap message struct version to 9 Jeff Layton
@ 2016-11-04 12:57   ` Jeff Layton
  2016-11-07  8:43     ` Yan, Zheng
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2016-11-04 12:57 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, zyan, sage

On Fri, 2016-11-04 at 07:34 -0400, Jeff Layton wrote:
> The userland ceph has MClientCaps at struct version 9. This brings the
> kernel up the same version.
> 
> With this change, we have to start tracking the btime and change_attr,
> so that the client can pass back sane values in cap messages. The
> client doesn't care about the btime at all, so this is just passed
> around, but the change_attr is used when ceph is exported via NFS.
> 
> For now, the new "sync" parm is left at 0, to preserve the existing
> behavior of the client.
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  fs/ceph/caps.c | 33 +++++++++++++++++++++++++--------
>  1 file changed, 25 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 6e99866b1946..452f5024589f 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -991,9 +991,9 @@ struct cap_msg_args {
>  	struct ceph_mds_session	*session;
>  	u64			ino, cid, follows;
>  	u64			flush_tid, oldest_flush_tid, size, max_size;
> -	u64			xattr_version;
> +	u64			xattr_version, change_attr;
>  	struct ceph_buffer	*xattr_buf;
> -	struct timespec		atime, mtime, ctime;
> +	struct timespec		atime, mtime, ctime, btime;
>  	int			op, caps, wanted, dirty;
>  	u32			seq, issue_seq, mseq, time_warp_seq;
>  	kuid_t			uid;
> @@ -1026,13 +1026,13 @@ static int send_cap_msg(struct cap_msg_args *arg)
>  
>  	/* flock buffer size + inline version + inline data size +
>  	 * osd_epoch_barrier + oldest_flush_tid */
> -	extra_len = 4 + 8 + 4 + 4 + 8;
> +	extra_len = 4 + 8 + 4 + 4 + 8 + 4 + 4 + 4 + 8 + 8 + 1;
>  	msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, sizeof(*fc) + extra_len,
>  			   GFP_NOFS, false);
>  	if (!msg)
>  		return -ENOMEM;
>  
> -	msg->hdr.version = cpu_to_le16(6);
> +	msg->hdr.version = cpu_to_le16(9);
>  	msg->hdr.tid = cpu_to_le64(arg->flush_tid);
>  
>  	fc = msg->front.iov_base;
> @@ -1068,17 +1068,30 @@ static int send_cap_msg(struct cap_msg_args *arg)
>  	}
>  
>  	p = fc + 1;
> -	/* flock buffer size */
> +	/* flock buffer size (version 2) */
>  	ceph_encode_32(&p, 0);
> -	/* inline version */
> +	/* inline version (version 4) */
>  	ceph_encode_64(&p, arg->inline_data ? 0 : CEPH_INLINE_NONE);
>  	/* inline data size */
>  	ceph_encode_32(&p, 0);
> -	/* osd_epoch_barrier */
> +	/* osd_epoch_barrier (version 5) */
>  	ceph_encode_32(&p, 0);
> -	/* oldest_flush_tid */
> +	/* oldest_flush_tid (version 6) */
>  	ceph_encode_64(&p, arg->oldest_flush_tid);
>  
> +	/* caller_uid/caller_gid (version 7) */
> +	ceph_encode_32(&p, (u32)-1);
> +	ceph_encode_32(&p, (u32)-1);

A bit of self-review...

Not sure if we want to set the above to something else -- maybe 0 or to
current's creds? That may not always make sense though (during e.g.
writeback).

> +
> +	/* pool namespace (version 8) */
> +	ceph_encode_32(&p, 0);
> +

I'm a little unclear on how the above should be set, but I'll look over
the userland code and ape what it does.

> +	/* btime, change_attr, sync (version 9) */
> +	ceph_encode_timespec(p, &arg->btime);
> +	p += sizeof(struct ceph_timespec);
> +	ceph_encode_64(&p, arg->change_attr);
> +	ceph_encode_8(&p, 0);
> +
>  	ceph_con_send(&arg->session->s_con, msg);
>  	return 0;
>  }
> @@ -1189,9 +1202,11 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
>  		arg.xattr_buf = NULL;
>  	}
>  
> +	arg.change_attr = inode->i_version;
>  	arg.mtime = inode->i_mtime;
>  	arg.atime = inode->i_atime;
>  	arg.ctime = inode->i_ctime;
> +	arg.btime = ci->i_btime;
>  
>  	arg.op = op;
>  	arg.caps = cap->implemented;
> @@ -1241,10 +1256,12 @@ static inline int __send_flush_snap(struct inode *inode,
>  	arg.max_size = 0;
>  	arg.xattr_version = capsnap->xattr_version;
>  	arg.xattr_buf = capsnap->xattr_blob;
> +	arg.change_attr = capsnap->change_attr;
>  
>  	arg.atime = capsnap->atime;
>  	arg.mtime = capsnap->mtime;
>  	arg.ctime = capsnap->ctime;
> +	arg.btime = capsnap->btime;
>  
>  	arg.op = CEPH_CAP_OP_FLUSHSNAP;
>  	arg.caps = capsnap->issued;

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 08/10] ceph: add sync parameter to send_cap_msg
  2016-11-04 11:34 ` [RFC PATCH 08/10] ceph: add sync parameter to send_cap_msg Jeff Layton
@ 2016-11-07  8:32   ` Yan, Zheng
  2016-11-07 10:51     ` Jeff Layton
  0 siblings, 1 reply; 24+ messages in thread
From: Yan, Zheng @ 2016-11-07  8:32 UTC (permalink / raw)
  To: Jeff Layton; +Cc: ceph-devel, Ilya Dryomov, Zheng Yan, Sage Weil

On Fri, Nov 4, 2016 at 7:34 PM, Jeff Layton <jlayton@redhat.com> wrote:
> So we can request an MDS log flush on a cap message when we know that
> we'll be waiting on the result.
>
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  fs/ceph/caps.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index 452f5024589f..e92c6ce53af6 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -999,7 +999,7 @@ struct cap_msg_args {
>         kuid_t                  uid;
>         kgid_t                  gid;
>         umode_t                 mode;
> -       bool                    inline_data;
> +       bool                    inline_data, sync;
>  };
>
>  /*
> @@ -1090,7 +1090,7 @@ static int send_cap_msg(struct cap_msg_args *arg)
>         ceph_encode_timespec(p, &arg->btime);
>         p += sizeof(struct ceph_timespec);
>         ceph_encode_64(&p, arg->change_attr);
> -       ceph_encode_8(&p, 0);
> +       ceph_encode_8(&p, arg->sync);

this is the 'flag' field? In MClientCaps, it's 'unsigned int'

>
>         ceph_con_send(&arg->session->s_con, msg);
>         return 0;
> @@ -1223,6 +1223,7 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
>         arg.mode = inode->i_mode;
>
>         arg.inline_data = ci->i_inline_version != CEPH_INLINE_NONE;
> +       arg.sync = false;
>
>         spin_unlock(&ci->i_ceph_lock);
>
> @@ -1278,6 +1279,7 @@ static inline int __send_flush_snap(struct inode *inode,
>         arg.mode = capsnap->mode;
>
>         arg.inline_data = capsnap->inline_data;
> +       arg.sync = false;
>
>         return send_cap_msg(&arg);
>  }
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 07/10] ceph: update cap message struct version to 9
  2016-11-04 12:57   ` Jeff Layton
@ 2016-11-07  8:43     ` Yan, Zheng
  2016-11-07 11:21       ` Jeff Layton
  0 siblings, 1 reply; 24+ messages in thread
From: Yan, Zheng @ 2016-11-07  8:43 UTC (permalink / raw)
  To: Jeff Layton; +Cc: ceph-devel, Ilya Dryomov, Zheng Yan, Sage Weil

On Fri, Nov 4, 2016 at 8:57 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Fri, 2016-11-04 at 07:34 -0400, Jeff Layton wrote:
>> The userland ceph has MClientCaps at struct version 9. This brings the
>> kernel up the same version.
>>
>> With this change, we have to start tracking the btime and change_attr,
>> so that the client can pass back sane values in cap messages. The
>> client doesn't care about the btime at all, so this is just passed
>> around, but the change_attr is used when ceph is exported via NFS.
>>
>> For now, the new "sync" parm is left at 0, to preserve the existing
>> behavior of the client.
>>
>> Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> ---
>>  fs/ceph/caps.c | 33 +++++++++++++++++++++++++--------
>>  1 file changed, 25 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>> index 6e99866b1946..452f5024589f 100644
>> --- a/fs/ceph/caps.c
>> +++ b/fs/ceph/caps.c
>> @@ -991,9 +991,9 @@ struct cap_msg_args {
>>       struct ceph_mds_session *session;
>>       u64                     ino, cid, follows;
>>       u64                     flush_tid, oldest_flush_tid, size, max_size;
>> -     u64                     xattr_version;
>> +     u64                     xattr_version, change_attr;
>>       struct ceph_buffer      *xattr_buf;
>> -     struct timespec         atime, mtime, ctime;
>> +     struct timespec         atime, mtime, ctime, btime;
>>       int                     op, caps, wanted, dirty;
>>       u32                     seq, issue_seq, mseq, time_warp_seq;
>>       kuid_t                  uid;
>> @@ -1026,13 +1026,13 @@ static int send_cap_msg(struct cap_msg_args *arg)
>>
>>       /* flock buffer size + inline version + inline data size +
>>        * osd_epoch_barrier + oldest_flush_tid */
>> -     extra_len = 4 + 8 + 4 + 4 + 8;
>> +     extra_len = 4 + 8 + 4 + 4 + 8 + 4 + 4 + 4 + 8 + 8 + 1;
>>       msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, sizeof(*fc) + extra_len,
>>                          GFP_NOFS, false);
>>       if (!msg)
>>               return -ENOMEM;
>>
>> -     msg->hdr.version = cpu_to_le16(6);
>> +     msg->hdr.version = cpu_to_le16(9);
>>       msg->hdr.tid = cpu_to_le64(arg->flush_tid);
>>
>>       fc = msg->front.iov_base;
>> @@ -1068,17 +1068,30 @@ static int send_cap_msg(struct cap_msg_args *arg)
>>       }
>>
>>       p = fc + 1;
>> -     /* flock buffer size */
>> +     /* flock buffer size (version 2) */
>>       ceph_encode_32(&p, 0);
>> -     /* inline version */
>> +     /* inline version (version 4) */
>>       ceph_encode_64(&p, arg->inline_data ? 0 : CEPH_INLINE_NONE);
>>       /* inline data size */
>>       ceph_encode_32(&p, 0);
>> -     /* osd_epoch_barrier */
>> +     /* osd_epoch_barrier (version 5) */
>>       ceph_encode_32(&p, 0);
>> -     /* oldest_flush_tid */
>> +     /* oldest_flush_tid (version 6) */
>>       ceph_encode_64(&p, arg->oldest_flush_tid);
>>
>> +     /* caller_uid/caller_gid (version 7) */
>> +     ceph_encode_32(&p, (u32)-1);
>> +     ceph_encode_32(&p, (u32)-1);
>
> A bit of self-review...
>
> Not sure if we want to set the above to something else -- maybe 0 or to
> current's creds? That may not always make sense though (during e.g.
> writeback).
>
>> +
>> +     /* pool namespace (version 8) */
>> +     ceph_encode_32(&p, 0);
>> +
>
> I'm a little unclear on how the above should be set, but I'll look over
> the userland code and ape what it does.

pool namespace is useless for client->mds cap message, set its length
to 0 should be OK.

>
>> +     /* btime, change_attr, sync (version 9) */
>> +     ceph_encode_timespec(p, &arg->btime);
>> +     p += sizeof(struct ceph_timespec);
>> +     ceph_encode_64(&p, arg->change_attr);
>> +     ceph_encode_8(&p, 0);
>> +
>>       ceph_con_send(&arg->session->s_con, msg);
>>       return 0;
>>  }
>> @@ -1189,9 +1202,11 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
>>               arg.xattr_buf = NULL;
>>       }
>>
>> +     arg.change_attr = inode->i_version;
>>       arg.mtime = inode->i_mtime;
>>       arg.atime = inode->i_atime;
>>       arg.ctime = inode->i_ctime;
>> +     arg.btime = ci->i_btime;
>>
>>       arg.op = op;
>>       arg.caps = cap->implemented;
>> @@ -1241,10 +1256,12 @@ static inline int __send_flush_snap(struct inode *inode,
>>       arg.max_size = 0;
>>       arg.xattr_version = capsnap->xattr_version;
>>       arg.xattr_buf = capsnap->xattr_blob;
>> +     arg.change_attr = capsnap->change_attr;
>>
>>       arg.atime = capsnap->atime;
>>       arg.mtime = capsnap->mtime;
>>       arg.ctime = capsnap->ctime;
>> +     arg.btime = capsnap->btime;
>>
>>       arg.op = CEPH_CAP_OP_FLUSHSNAP;
>>       arg.caps = capsnap->issued;
>
> --
> Jeff Layton <jlayton@redhat.com>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 08/10] ceph: add sync parameter to send_cap_msg
  2016-11-07  8:32   ` Yan, Zheng
@ 2016-11-07 10:51     ` Jeff Layton
  0 siblings, 0 replies; 24+ messages in thread
From: Jeff Layton @ 2016-11-07 10:51 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel, Ilya Dryomov, Zheng Yan, Sage Weil

On Mon, 2016-11-07 at 16:32 +0800, Yan, Zheng wrote:
> On Fri, Nov 4, 2016 at 7:34 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > So we can request an MDS log flush on a cap message when we know that
> > we'll be waiting on the result.
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> >  fs/ceph/caps.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > index 452f5024589f..e92c6ce53af6 100644
> > --- a/fs/ceph/caps.c
> > +++ b/fs/ceph/caps.c
> > @@ -999,7 +999,7 @@ struct cap_msg_args {
> >         kuid_t                  uid;
> >         kgid_t                  gid;
> >         umode_t                 mode;
> > -       bool                    inline_data;
> > +       bool                    inline_data, sync;
> >  };
> > 
> >  /*
> > @@ -1090,7 +1090,7 @@ static int send_cap_msg(struct cap_msg_args *arg)
> >         ceph_encode_timespec(p, &arg->btime);
> >         p += sizeof(struct ceph_timespec);
> >         ceph_encode_64(&p, arg->change_attr);
> > -       ceph_encode_8(&p, 0);
> > +       ceph_encode_8(&p, arg->sync);
> 
> this is the 'flag' field? In MClientCaps, it's 'unsigned int'
> 

Thanks for the review.

Yes, the first version of the userland patch series had this as a bool,
but I changed to a flags field based on some review of that set.

I'll be posting a newer kernel set in the next day or two.

> > 
> > 
> >         ceph_con_send(&arg->session->s_con, msg);
> >         return 0;
> > @@ -1223,6 +1223,7 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
> >         arg.mode = inode->i_mode;
> > 
> >         arg.inline_data = ci->i_inline_version != CEPH_INLINE_NONE;
> > +       arg.sync = false;
> > 
> >         spin_unlock(&ci->i_ceph_lock);
> > 
> > @@ -1278,6 +1279,7 @@ static inline int __send_flush_snap(struct inode *inode,
> >         arg.mode = capsnap->mode;
> > 
> >         arg.inline_data = capsnap->inline_data;
> > +       arg.sync = false;
> > 
> >         return send_cap_msg(&arg);
> >  }
> > --
> > 2.7.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 07/10] ceph: update cap message struct version to 9
  2016-11-07  8:43     ` Yan, Zheng
@ 2016-11-07 11:21       ` Jeff Layton
  2016-11-07 14:05         ` Sage Weil
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2016-11-07 11:21 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: ceph-devel, Ilya Dryomov, Zheng Yan, Sage Weil

On Mon, 2016-11-07 at 16:43 +0800, Yan, Zheng wrote:
> On Fri, Nov 4, 2016 at 8:57 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > On Fri, 2016-11-04 at 07:34 -0400, Jeff Layton wrote:
> > > 
> > > The userland ceph has MClientCaps at struct version 9. This brings the
> > > kernel up the same version.
> > > 
> > > With this change, we have to start tracking the btime and change_attr,
> > > so that the client can pass back sane values in cap messages. The
> > > client doesn't care about the btime at all, so this is just passed
> > > around, but the change_attr is used when ceph is exported via NFS.
> > > 
> > > For now, the new "sync" parm is left at 0, to preserve the existing
> > > behavior of the client.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > ---
> > >  fs/ceph/caps.c | 33 +++++++++++++++++++++++++--------
> > >  1 file changed, 25 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > index 6e99866b1946..452f5024589f 100644
> > > --- a/fs/ceph/caps.c
> > > +++ b/fs/ceph/caps.c
> > > @@ -991,9 +991,9 @@ struct cap_msg_args {
> > >       struct ceph_mds_session *session;
> > >       u64                     ino, cid, follows;
> > >       u64                     flush_tid, oldest_flush_tid, size, max_size;
> > > -     u64                     xattr_version;
> > > +     u64                     xattr_version, change_attr;
> > >       struct ceph_buffer      *xattr_buf;
> > > -     struct timespec         atime, mtime, ctime;
> > > +     struct timespec         atime, mtime, ctime, btime;
> > >       int                     op, caps, wanted, dirty;
> > >       u32                     seq, issue_seq, mseq, time_warp_seq;
> > >       kuid_t                  uid;
> > > @@ -1026,13 +1026,13 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > > 
> > >       /* flock buffer size + inline version + inline data size +
> > >        * osd_epoch_barrier + oldest_flush_tid */
> > > -     extra_len = 4 + 8 + 4 + 4 + 8;
> > > +     extra_len = 4 + 8 + 4 + 4 + 8 + 4 + 4 + 4 + 8 + 8 + 1;
> > >       msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, sizeof(*fc) + extra_len,
> > >                          GFP_NOFS, false);
> > >       if (!msg)
> > >               return -ENOMEM;
> > > 
> > > -     msg->hdr.version = cpu_to_le16(6);
> > > +     msg->hdr.version = cpu_to_le16(9);
> > >       msg->hdr.tid = cpu_to_le64(arg->flush_tid);
> > > 
> > >       fc = msg->front.iov_base;
> > > @@ -1068,17 +1068,30 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > >       }
> > > 
> > >       p = fc + 1;
> > > -     /* flock buffer size */
> > > +     /* flock buffer size (version 2) */
> > >       ceph_encode_32(&p, 0);
> > > -     /* inline version */
> > > +     /* inline version (version 4) */
> > >       ceph_encode_64(&p, arg->inline_data ? 0 : CEPH_INLINE_NONE);
> > >       /* inline data size */
> > >       ceph_encode_32(&p, 0);
> > > -     /* osd_epoch_barrier */
> > > +     /* osd_epoch_barrier (version 5) */
> > >       ceph_encode_32(&p, 0);
> > > -     /* oldest_flush_tid */
> > > +     /* oldest_flush_tid (version 6) */
> > >       ceph_encode_64(&p, arg->oldest_flush_tid);
> > > 
> > > +     /* caller_uid/caller_gid (version 7) */
> > > +     ceph_encode_32(&p, (u32)-1);
> > > +     ceph_encode_32(&p, (u32)-1);
> > 
> > A bit of self-review...
> > 
> > Not sure if we want to set the above to something else -- maybe 0 or to
> > current's creds? That may not always make sense though (during e.g.
> > writeback).
> > 

Looking further, I'm not quite sure I understand why we send creds at
all in cap messages. Can you clarify where that matters?

The way I look at it, would be to consider caps to be something like a
more granular NFS delegation or SMB oplock.

In that light, a cap flush is just the client sending updated attrs for
the exclusive caps that it has already been granted. Is there a
situation where we would ever want to refuse that update?

Note that nothing ever checks the return code for _do_cap_update in the
userland code. If the permissions check fails, then we'll end up
silently dropping the updated attrs on the floor.

> > > 
> > > +
> > > +     /* pool namespace (version 8) */
> > > +     ceph_encode_32(&p, 0);
> > > +
> > 
> > I'm a little unclear on how the above should be set, but I'll look over
> > the userland code and ape what it does.
> 
> pool namespace is useless for client->mds cap message, set its length
> to 0 should be OK.
> 

Thanks. I went ahead and added a comment to that effect in the updated
set I'm testing now.

> > 
> > 
> > > 
> > > +     /* btime, change_attr, sync (version 9) */
> > > +     ceph_encode_timespec(p, &arg->btime);
> > > +     p += sizeof(struct ceph_timespec);
> > > +     ceph_encode_64(&p, arg->change_attr);
> > > +     ceph_encode_8(&p, 0);
> > > +
> > >       ceph_con_send(&arg->session->s_con, msg);
> > >       return 0;
> > >  }
> > > @@ -1189,9 +1202,11 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
> > >               arg.xattr_buf = NULL;
> > >       }
> > > 
> > > +     arg.change_attr = inode->i_version;
> > >       arg.mtime = inode->i_mtime;
> > >       arg.atime = inode->i_atime;
> > >       arg.ctime = inode->i_ctime;
> > > +     arg.btime = ci->i_btime;
> > > 
> > >       arg.op = op;
> > >       arg.caps = cap->implemented;
> > > @@ -1241,10 +1256,12 @@ static inline int __send_flush_snap(struct inode *inode,
> > >       arg.max_size = 0;
> > >       arg.xattr_version = capsnap->xattr_version;
> > >       arg.xattr_buf = capsnap->xattr_blob;
> > > +     arg.change_attr = capsnap->change_attr;
> > > 
> > >       arg.atime = capsnap->atime;
> > >       arg.mtime = capsnap->mtime;
> > >       arg.ctime = capsnap->ctime;
> > > +     arg.btime = capsnap->btime;
> > > 
> > >       arg.op = CEPH_CAP_OP_FLUSHSNAP;
> > >       arg.caps = capsnap->issued;
> > 
> > --
> > Jeff Layton <jlayton@redhat.com>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 07/10] ceph: update cap message struct version to 9
  2016-11-07 11:21       ` Jeff Layton
@ 2016-11-07 14:05         ` Sage Weil
  2016-11-07 14:22           ` Jeff Layton
  0 siblings, 1 reply; 24+ messages in thread
From: Sage Weil @ 2016-11-07 14:05 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, ceph-devel, Ilya Dryomov, Zheng Yan

On Mon, 7 Nov 2016, Jeff Layton wrote:
> On Mon, 2016-11-07 at 16:43 +0800, Yan, Zheng wrote:
> > On Fri, Nov 4, 2016 at 8:57 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > 
> > > On Fri, 2016-11-04 at 07:34 -0400, Jeff Layton wrote:
> > > > 
> > > > The userland ceph has MClientCaps at struct version 9. This brings the
> > > > kernel up the same version.
> > > > 
> > > > With this change, we have to start tracking the btime and change_attr,
> > > > so that the client can pass back sane values in cap messages. The
> > > > client doesn't care about the btime at all, so this is just passed
> > > > around, but the change_attr is used when ceph is exported via NFS.
> > > > 
> > > > For now, the new "sync" parm is left at 0, to preserve the existing
> > > > behavior of the client.
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > ---
> > > >  fs/ceph/caps.c | 33 +++++++++++++++++++++++++--------
> > > >  1 file changed, 25 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > index 6e99866b1946..452f5024589f 100644
> > > > --- a/fs/ceph/caps.c
> > > > +++ b/fs/ceph/caps.c
> > > > @@ -991,9 +991,9 @@ struct cap_msg_args {
> > > >       struct ceph_mds_session *session;
> > > >       u64                     ino, cid, follows;
> > > >       u64                     flush_tid, oldest_flush_tid, size, max_size;
> > > > -     u64                     xattr_version;
> > > > +     u64                     xattr_version, change_attr;
> > > >       struct ceph_buffer      *xattr_buf;
> > > > -     struct timespec         atime, mtime, ctime;
> > > > +     struct timespec         atime, mtime, ctime, btime;
> > > >       int                     op, caps, wanted, dirty;
> > > >       u32                     seq, issue_seq, mseq, time_warp_seq;
> > > >       kuid_t                  uid;
> > > > @@ -1026,13 +1026,13 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > > > 
> > > >       /* flock buffer size + inline version + inline data size +
> > > >        * osd_epoch_barrier + oldest_flush_tid */
> > > > -     extra_len = 4 + 8 + 4 + 4 + 8;
> > > > +     extra_len = 4 + 8 + 4 + 4 + 8 + 4 + 4 + 4 + 8 + 8 + 1;
> > > >       msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, sizeof(*fc) + extra_len,
> > > >                          GFP_NOFS, false);
> > > >       if (!msg)
> > > >               return -ENOMEM;
> > > > 
> > > > -     msg->hdr.version = cpu_to_le16(6);
> > > > +     msg->hdr.version = cpu_to_le16(9);
> > > >       msg->hdr.tid = cpu_to_le64(arg->flush_tid);
> > > > 
> > > >       fc = msg->front.iov_base;
> > > > @@ -1068,17 +1068,30 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > > >       }
> > > > 
> > > >       p = fc + 1;
> > > > -     /* flock buffer size */
> > > > +     /* flock buffer size (version 2) */
> > > >       ceph_encode_32(&p, 0);
> > > > -     /* inline version */
> > > > +     /* inline version (version 4) */
> > > >       ceph_encode_64(&p, arg->inline_data ? 0 : CEPH_INLINE_NONE);
> > > >       /* inline data size */
> > > >       ceph_encode_32(&p, 0);
> > > > -     /* osd_epoch_barrier */
> > > > +     /* osd_epoch_barrier (version 5) */
> > > >       ceph_encode_32(&p, 0);
> > > > -     /* oldest_flush_tid */
> > > > +     /* oldest_flush_tid (version 6) */
> > > >       ceph_encode_64(&p, arg->oldest_flush_tid);
> > > > 
> > > > +     /* caller_uid/caller_gid (version 7) */
> > > > +     ceph_encode_32(&p, (u32)-1);
> > > > +     ceph_encode_32(&p, (u32)-1);
> > > 
> > > A bit of self-review...
> > > 
> > > Not sure if we want to set the above to something else -- maybe 0 or to
> > > current's creds? That may not always make sense though (during e.g.
> > > writeback).
> > > 
> 
> Looking further, I'm not quite sure I understand why we send creds at
> all in cap messages. Can you clarify where that matters?
> 
> The way I look at it, would be to consider caps to be something like a
> more granular NFS delegation or SMB oplock.
> 
> In that light, a cap flush is just the client sending updated attrs for
> the exclusive caps that it has already been granted. Is there a
> situation where we would ever want to refuse that update?

A chmod or chown can be done locally if you have excl caps and flushed 
back to the MDS via a caps message.  We need to verify the user has 
permission to make the change.

> Note that nothing ever checks the return code for _do_cap_update in the
> userland code. If the permissions check fails, then we'll end up
> silently dropping the updated attrs on the floor.

Yeah.  This was mainly for expediency... the protocol assumes that flushes 
don't fail.  Given that the client does it's own permissions check, I 
think the way to improve this is to have it prevent the flush in the first 
place, so that it's only nefarious clients that are effected (and who 
cares if they get confused).  I don't think we have a particularly good 
way to tell the client it can't, say, sudo chmod 0:0 a file, though.

sage


> 
> > > > 
> > > > +
> > > > +     /* pool namespace (version 8) */
> > > > +     ceph_encode_32(&p, 0);
> > > > +
> > > 
> > > I'm a little unclear on how the above should be set, but I'll look over
> > > the userland code and ape what it does.
> > 
> > pool namespace is useless for client->mds cap message, set its length
> > to 0 should be OK.
> > 
> 
> Thanks. I went ahead and added a comment to that effect in the updated
> set I'm testing now.
> 
> > > 
> > > 
> > > > 
> > > > +     /* btime, change_attr, sync (version 9) */
> > > > +     ceph_encode_timespec(p, &arg->btime);
> > > > +     p += sizeof(struct ceph_timespec);
> > > > +     ceph_encode_64(&p, arg->change_attr);
> > > > +     ceph_encode_8(&p, 0);
> > > > +
> > > >       ceph_con_send(&arg->session->s_con, msg);
> > > >       return 0;
> > > >  }
> > > > @@ -1189,9 +1202,11 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
> > > >               arg.xattr_buf = NULL;
> > > >       }
> > > > 
> > > > +     arg.change_attr = inode->i_version;
> > > >       arg.mtime = inode->i_mtime;
> > > >       arg.atime = inode->i_atime;
> > > >       arg.ctime = inode->i_ctime;
> > > > +     arg.btime = ci->i_btime;
> > > > 
> > > >       arg.op = op;
> > > >       arg.caps = cap->implemented;
> > > > @@ -1241,10 +1256,12 @@ static inline int __send_flush_snap(struct inode *inode,
> > > >       arg.max_size = 0;
> > > >       arg.xattr_version = capsnap->xattr_version;
> > > >       arg.xattr_buf = capsnap->xattr_blob;
> > > > +     arg.change_attr = capsnap->change_attr;
> > > > 
> > > >       arg.atime = capsnap->atime;
> > > >       arg.mtime = capsnap->mtime;
> > > >       arg.ctime = capsnap->ctime;
> > > > +     arg.btime = capsnap->btime;
> > > > 
> > > >       arg.op = CEPH_CAP_OP_FLUSHSNAP;
> > > >       arg.caps = capsnap->issued;
> > > 
> > > --
> > > Jeff Layton <jlayton@redhat.com>
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Jeff Layton <jlayton@redhat.com>
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 07/10] ceph: update cap message struct version to 9
  2016-11-07 14:05         ` Sage Weil
@ 2016-11-07 14:22           ` Jeff Layton
  2016-11-07 14:36             ` Sage Weil
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2016-11-07 14:22 UTC (permalink / raw)
  To: Sage Weil; +Cc: Yan, Zheng, ceph-devel, Ilya Dryomov, Zheng Yan

On Mon, 2016-11-07 at 14:05 +0000, Sage Weil wrote:
> On Mon, 7 Nov 2016, Jeff Layton wrote:
> > On Mon, 2016-11-07 at 16:43 +0800, Yan, Zheng wrote:
> > > On Fri, Nov 4, 2016 at 8:57 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > 
> > > > On Fri, 2016-11-04 at 07:34 -0400, Jeff Layton wrote:
> > > > > 
> > > > > The userland ceph has MClientCaps at struct version 9. This brings the
> > > > > kernel up the same version.
> > > > > 
> > > > > With this change, we have to start tracking the btime and change_attr,
> > > > > so that the client can pass back sane values in cap messages. The
> > > > > client doesn't care about the btime at all, so this is just passed
> > > > > around, but the change_attr is used when ceph is exported via NFS.
> > > > > 
> > > > > For now, the new "sync" parm is left at 0, to preserve the existing
> > > > > behavior of the client.
> > > > > 
> > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > > ---
> > > > >  fs/ceph/caps.c | 33 +++++++++++++++++++++++++--------
> > > > >  1 file changed, 25 insertions(+), 8 deletions(-)
> > > > > 
> > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > > index 6e99866b1946..452f5024589f 100644
> > > > > --- a/fs/ceph/caps.c
> > > > > +++ b/fs/ceph/caps.c
> > > > > @@ -991,9 +991,9 @@ struct cap_msg_args {
> > > > >       struct ceph_mds_session *session;
> > > > >       u64                     ino, cid, follows;
> > > > >       u64                     flush_tid, oldest_flush_tid, size, max_size;
> > > > > -     u64                     xattr_version;
> > > > > +     u64                     xattr_version, change_attr;
> > > > >       struct ceph_buffer      *xattr_buf;
> > > > > -     struct timespec         atime, mtime, ctime;
> > > > > +     struct timespec         atime, mtime, ctime, btime;
> > > > >       int                     op, caps, wanted, dirty;
> > > > >       u32                     seq, issue_seq, mseq, time_warp_seq;
> > > > >       kuid_t                  uid;
> > > > > @@ -1026,13 +1026,13 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > > > > 
> > > > >       /* flock buffer size + inline version + inline data size +
> > > > >        * osd_epoch_barrier + oldest_flush_tid */
> > > > > -     extra_len = 4 + 8 + 4 + 4 + 8;
> > > > > +     extra_len = 4 + 8 + 4 + 4 + 8 + 4 + 4 + 4 + 8 + 8 + 1;
> > > > >       msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, sizeof(*fc) + extra_len,
> > > > >                          GFP_NOFS, false);
> > > > >       if (!msg)
> > > > >               return -ENOMEM;
> > > > > 
> > > > > -     msg->hdr.version = cpu_to_le16(6);
> > > > > +     msg->hdr.version = cpu_to_le16(9);
> > > > >       msg->hdr.tid = cpu_to_le64(arg->flush_tid);
> > > > > 
> > > > >       fc = msg->front.iov_base;
> > > > > @@ -1068,17 +1068,30 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > > > >       }
> > > > > 
> > > > >       p = fc + 1;
> > > > > -     /* flock buffer size */
> > > > > +     /* flock buffer size (version 2) */
> > > > >       ceph_encode_32(&p, 0);
> > > > > -     /* inline version */
> > > > > +     /* inline version (version 4) */
> > > > >       ceph_encode_64(&p, arg->inline_data ? 0 : CEPH_INLINE_NONE);
> > > > >       /* inline data size */
> > > > >       ceph_encode_32(&p, 0);
> > > > > -     /* osd_epoch_barrier */
> > > > > +     /* osd_epoch_barrier (version 5) */
> > > > >       ceph_encode_32(&p, 0);
> > > > > -     /* oldest_flush_tid */
> > > > > +     /* oldest_flush_tid (version 6) */
> > > > >       ceph_encode_64(&p, arg->oldest_flush_tid);
> > > > > 
> > > > > +     /* caller_uid/caller_gid (version 7) */
> > > > > +     ceph_encode_32(&p, (u32)-1);
> > > > > +     ceph_encode_32(&p, (u32)-1);
> > > > 
> > > > A bit of self-review...
> > > > 
> > > > Not sure if we want to set the above to something else -- maybe 0 or to
> > > > current's creds? That may not always make sense though (during e.g.
> > > > writeback).
> > > > 
> > 
> > Looking further, I'm not quite sure I understand why we send creds at
> > all in cap messages. Can you clarify where that matters?
> > 
> > The way I look at it, would be to consider caps to be something like a
> > more granular NFS delegation or SMB oplock.
> > 
> > In that light, a cap flush is just the client sending updated attrs for
> > the exclusive caps that it has already been granted. Is there a
> > situation where we would ever want to refuse that update?
> 
> A chmod or chown can be done locally if you have excl caps and flushed 
> back to the MDS via a caps message.  We need to verify the user has 
> permission to make the change.
> 

My take is that once the MDS has delegated Ax to the client, then it's
effectively trusting the client to handle permissions enforcement
correctly. I don't see why we should second guess that.

> > Note that nothing ever checks the return code for _do_cap_update in the
> > userland code. If the permissions check fails, then we'll end up
> > silently dropping the updated attrs on the floor.
> 
> Yeah.  This was mainly for expediency... the protocol assumes that flushes 
> don't fail.  Given that the client does it's own permissions check, I 
> think the way to improve this is to have it prevent the flush in the first 
> place, so that it's only nefarious clients that are effected (and who 
> cares if they get confused).  I don't think we have a particularly good 
> way to tell the client it can't, say, sudo chmod 0:0 a file, though.
> 

Sorry, I don't quite follow. How would we prevent the flush from a
nefarious client (which is not something we can really control)?

In any case...ISTM that the permissions check in _do_cap_update ought to
be replaced by a cephx key check. IOW, what we really want to know is
whether the client is truly the one to which we delegated the caps. If
so, then we sort of have to trust that it's doing the right thing with
respect to permissions checking here.

Does that make sense?

> 
> > 
> > > > > 
> > > > > +
> > > > > +     /* pool namespace (version 8) */
> > > > > +     ceph_encode_32(&p, 0);
> > > > > +
> > > > 
> > > > I'm a little unclear on how the above should be set, but I'll look over
> > > > the userland code and ape what it does.
> > > 
> > > pool namespace is useless for client->mds cap message, set its length
> > > to 0 should be OK.
> > > 
> > 
> > Thanks. I went ahead and added a comment to that effect in the updated
> > set I'm testing now.
> > 
> > > > 
> > > > 
> > > > > 
> > > > > +     /* btime, change_attr, sync (version 9) */
> > > > > +     ceph_encode_timespec(p, &arg->btime);
> > > > > +     p += sizeof(struct ceph_timespec);
> > > > > +     ceph_encode_64(&p, arg->change_attr);
> > > > > +     ceph_encode_8(&p, 0);
> > > > > +
> > > > >       ceph_con_send(&arg->session->s_con, msg);
> > > > >       return 0;
> > > > >  }
> > > > > @@ -1189,9 +1202,11 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
> > > > >               arg.xattr_buf = NULL;
> > > > >       }
> > > > > 
> > > > > +     arg.change_attr = inode->i_version;
> > > > >       arg.mtime = inode->i_mtime;
> > > > >       arg.atime = inode->i_atime;
> > > > >       arg.ctime = inode->i_ctime;
> > > > > +     arg.btime = ci->i_btime;
> > > > > 
> > > > >       arg.op = op;
> > > > >       arg.caps = cap->implemented;
> > > > > @@ -1241,10 +1256,12 @@ static inline int __send_flush_snap(struct inode *inode,
> > > > >       arg.max_size = 0;
> > > > >       arg.xattr_version = capsnap->xattr_version;
> > > > >       arg.xattr_buf = capsnap->xattr_blob;
> > > > > +     arg.change_attr = capsnap->change_attr;
> > > > > 
> > > > >       arg.atime = capsnap->atime;
> > > > >       arg.mtime = capsnap->mtime;
> > > > >       arg.ctime = capsnap->ctime;
> > > > > +     arg.btime = capsnap->btime;
> > > > > 
> > > > >       arg.op = CEPH_CAP_OP_FLUSHSNAP;
> > > > >       arg.caps = capsnap->issued;
> > > > 
> > > > --
> > > > Jeff Layton <jlayton@redhat.com>
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > -- 
> > Jeff Layton <jlayton@redhat.com>
> > 
> > 

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 07/10] ceph: update cap message struct version to 9
  2016-11-07 14:22           ` Jeff Layton
@ 2016-11-07 14:36             ` Sage Weil
  2016-11-07 18:39               ` Jeff Layton
  0 siblings, 1 reply; 24+ messages in thread
From: Sage Weil @ 2016-11-07 14:36 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Yan, Zheng, ceph-devel, Ilya Dryomov, Zheng Yan

[-- Attachment #1: Type: TEXT/PLAIN, Size: 9989 bytes --]

On Mon, 7 Nov 2016, Jeff Layton wrote:
> On Mon, 2016-11-07 at 14:05 +0000, Sage Weil wrote:
> > On Mon, 7 Nov 2016, Jeff Layton wrote:
> > > On Mon, 2016-11-07 at 16:43 +0800, Yan, Zheng wrote:
> > > > On Fri, Nov 4, 2016 at 8:57 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > > 
> > > > > On Fri, 2016-11-04 at 07:34 -0400, Jeff Layton wrote:
> > > > > > 
> > > > > > The userland ceph has MClientCaps at struct version 9. This brings the
> > > > > > kernel up the same version.
> > > > > > 
> > > > > > With this change, we have to start tracking the btime and change_attr,
> > > > > > so that the client can pass back sane values in cap messages. The
> > > > > > client doesn't care about the btime at all, so this is just passed
> > > > > > around, but the change_attr is used when ceph is exported via NFS.
> > > > > > 
> > > > > > For now, the new "sync" parm is left at 0, to preserve the existing
> > > > > > behavior of the client.
> > > > > > 
> > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > > > ---
> > > > > >  fs/ceph/caps.c | 33 +++++++++++++++++++++++++--------
> > > > > >  1 file changed, 25 insertions(+), 8 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > > > index 6e99866b1946..452f5024589f 100644
> > > > > > --- a/fs/ceph/caps.c
> > > > > > +++ b/fs/ceph/caps.c
> > > > > > @@ -991,9 +991,9 @@ struct cap_msg_args {
> > > > > >       struct ceph_mds_session *session;
> > > > > >       u64                     ino, cid, follows;
> > > > > >       u64                     flush_tid, oldest_flush_tid, size, max_size;
> > > > > > -     u64                     xattr_version;
> > > > > > +     u64                     xattr_version, change_attr;
> > > > > >       struct ceph_buffer      *xattr_buf;
> > > > > > -     struct timespec         atime, mtime, ctime;
> > > > > > +     struct timespec         atime, mtime, ctime, btime;
> > > > > >       int                     op, caps, wanted, dirty;
> > > > > >       u32                     seq, issue_seq, mseq, time_warp_seq;
> > > > > >       kuid_t                  uid;
> > > > > > @@ -1026,13 +1026,13 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > > > > > 
> > > > > >       /* flock buffer size + inline version + inline data size +
> > > > > >        * osd_epoch_barrier + oldest_flush_tid */
> > > > > > -     extra_len = 4 + 8 + 4 + 4 + 8;
> > > > > > +     extra_len = 4 + 8 + 4 + 4 + 8 + 4 + 4 + 4 + 8 + 8 + 1;
> > > > > >       msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, sizeof(*fc) + extra_len,
> > > > > >                          GFP_NOFS, false);
> > > > > >       if (!msg)
> > > > > >               return -ENOMEM;
> > > > > > 
> > > > > > -     msg->hdr.version = cpu_to_le16(6);
> > > > > > +     msg->hdr.version = cpu_to_le16(9);
> > > > > >       msg->hdr.tid = cpu_to_le64(arg->flush_tid);
> > > > > > 
> > > > > >       fc = msg->front.iov_base;
> > > > > > @@ -1068,17 +1068,30 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > > > > >       }
> > > > > > 
> > > > > >       p = fc + 1;
> > > > > > -     /* flock buffer size */
> > > > > > +     /* flock buffer size (version 2) */
> > > > > >       ceph_encode_32(&p, 0);
> > > > > > -     /* inline version */
> > > > > > +     /* inline version (version 4) */
> > > > > >       ceph_encode_64(&p, arg->inline_data ? 0 : CEPH_INLINE_NONE);
> > > > > >       /* inline data size */
> > > > > >       ceph_encode_32(&p, 0);
> > > > > > -     /* osd_epoch_barrier */
> > > > > > +     /* osd_epoch_barrier (version 5) */
> > > > > >       ceph_encode_32(&p, 0);
> > > > > > -     /* oldest_flush_tid */
> > > > > > +     /* oldest_flush_tid (version 6) */
> > > > > >       ceph_encode_64(&p, arg->oldest_flush_tid);
> > > > > > 
> > > > > > +     /* caller_uid/caller_gid (version 7) */
> > > > > > +     ceph_encode_32(&p, (u32)-1);
> > > > > > +     ceph_encode_32(&p, (u32)-1);
> > > > > 
> > > > > A bit of self-review...
> > > > > 
> > > > > Not sure if we want to set the above to something else -- maybe 0 or to
> > > > > current's creds? That may not always make sense though (during e.g.
> > > > > writeback).
> > > > > 
> > > 
> > > Looking further, I'm not quite sure I understand why we send creds at
> > > all in cap messages. Can you clarify where that matters?
> > > 
> > > The way I look at it, would be to consider caps to be something like a
> > > more granular NFS delegation or SMB oplock.
> > > 
> > > In that light, a cap flush is just the client sending updated attrs for
> > > the exclusive caps that it has already been granted. Is there a
> > > situation where we would ever want to refuse that update?
> > 
> > A chmod or chown can be done locally if you have excl caps and flushed 
> > back to the MDS via a caps message.  We need to verify the user has 
> > permission to make the change.
> > 
> 
> My take is that once the MDS has delegated Ax to the client, then it's
> effectively trusting the client to handle permissions enforcement
> correctly. I don't see why we should second guess that.
> 
> > > Note that nothing ever checks the return code for _do_cap_update in the
> > > userland code. If the permissions check fails, then we'll end up
> > > silently dropping the updated attrs on the floor.
> > 
> > Yeah.  This was mainly for expediency... the protocol assumes that flushes 
> > don't fail.  Given that the client does it's own permissions check, I 
> > think the way to improve this is to have it prevent the flush in the first 
> > place, so that it's only nefarious clients that are effected (and who 
> > cares if they get confused).  I don't think we have a particularly good 
> > way to tell the client it can't, say, sudo chmod 0:0 a file, though.
> > 
> 
> Sorry, I don't quite follow. How would we prevent the flush from a
> nefarious client (which is not something we can really control)?
> 
> In any case...ISTM that the permissions check in _do_cap_update ought to
> be replaced by a cephx key check. IOW, what we really want to know is
> whether the client is truly the one to which we delegated the caps. If
> so, then we sort of have to trust that it's doing the right thing with
> respect to permissions checking here.

The capability can say "you are allowed to be uid 1000 or uid 1020." We 
want to delegate the EXCL caps to the client so that a create + chmod + 
chown + write can all happen efficiently, but we still need to ensure that 
the values they set are legal (a permitted uid/gid combo).

A common example would be user workstations that are allowed access to 
/home/user and restricted via their mds caps to their uid/gid.  We need to 
prevent them from doing a 'sudo chown 0:0 foo'...

sage




> 
> Does that make sense?
> 
> > 
> > > 
> > > > > > 
> > > > > > +
> > > > > > +     /* pool namespace (version 8) */
> > > > > > +     ceph_encode_32(&p, 0);
> > > > > > +
> > > > > 
> > > > > I'm a little unclear on how the above should be set, but I'll look over
> > > > > the userland code and ape what it does.
> > > > 
> > > > pool namespace is useless for client->mds cap message, set its length
> > > > to 0 should be OK.
> > > > 
> > > 
> > > Thanks. I went ahead and added a comment to that effect in the updated
> > > set I'm testing now.
> > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > +     /* btime, change_attr, sync (version 9) */
> > > > > > +     ceph_encode_timespec(p, &arg->btime);
> > > > > > +     p += sizeof(struct ceph_timespec);
> > > > > > +     ceph_encode_64(&p, arg->change_attr);
> > > > > > +     ceph_encode_8(&p, 0);
> > > > > > +
> > > > > >       ceph_con_send(&arg->session->s_con, msg);
> > > > > >       return 0;
> > > > > >  }
> > > > > > @@ -1189,9 +1202,11 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap,
> > > > > >               arg.xattr_buf = NULL;
> > > > > >       }
> > > > > > 
> > > > > > +     arg.change_attr = inode->i_version;
> > > > > >       arg.mtime = inode->i_mtime;
> > > > > >       arg.atime = inode->i_atime;
> > > > > >       arg.ctime = inode->i_ctime;
> > > > > > +     arg.btime = ci->i_btime;
> > > > > > 
> > > > > >       arg.op = op;
> > > > > >       arg.caps = cap->implemented;
> > > > > > @@ -1241,10 +1256,12 @@ static inline int __send_flush_snap(struct inode *inode,
> > > > > >       arg.max_size = 0;
> > > > > >       arg.xattr_version = capsnap->xattr_version;
> > > > > >       arg.xattr_buf = capsnap->xattr_blob;
> > > > > > +     arg.change_attr = capsnap->change_attr;
> > > > > > 
> > > > > >       arg.atime = capsnap->atime;
> > > > > >       arg.mtime = capsnap->mtime;
> > > > > >       arg.ctime = capsnap->ctime;
> > > > > > +     arg.btime = capsnap->btime;
> > > > > > 
> > > > > >       arg.op = CEPH_CAP_OP_FLUSHSNAP;
> > > > > >       arg.caps = capsnap->issued;
> > > > > 
> > > > > --
> > > > > Jeff Layton <jlayton@redhat.com>
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > -- 
> > > Jeff Layton <jlayton@redhat.com>
> > > 
> > > 
> 
> -- 
> Jeff Layton <jlayton@redhat.com>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 07/10] ceph: update cap message struct version to 9
  2016-11-07 14:36             ` Sage Weil
@ 2016-11-07 18:39               ` Jeff Layton
  2016-11-07 19:15                 ` Sage Weil
  2016-11-07 19:53                 ` Gregory Farnum
  0 siblings, 2 replies; 24+ messages in thread
From: Jeff Layton @ 2016-11-07 18:39 UTC (permalink / raw)
  To: Sage Weil; +Cc: Yan, Zheng, ceph-devel, Ilya Dryomov, Zheng Yan, Gregory Farnum

On Mon, 2016-11-07 at 14:36 +0000, Sage Weil wrote:
> On Mon, 7 Nov 2016, Jeff Layton wrote:
> > On Mon, 2016-11-07 at 14:05 +0000, Sage Weil wrote:
> > > On Mon, 7 Nov 2016, Jeff Layton wrote:
> > > > On Mon, 2016-11-07 at 16:43 +0800, Yan, Zheng wrote:
> > > > > On Fri, Nov 4, 2016 at 8:57 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > > > 
> > > > > > On Fri, 2016-11-04 at 07:34 -0400, Jeff Layton wrote:
> > > > > > > 
> > > > > > > The userland ceph has MClientCaps at struct version 9. This brings the
> > > > > > > kernel up the same version.
> > > > > > > 
> > > > > > > With this change, we have to start tracking the btime and change_attr,
> > > > > > > so that the client can pass back sane values in cap messages. The
> > > > > > > client doesn't care about the btime at all, so this is just passed
> > > > > > > around, but the change_attr is used when ceph is exported via NFS.
> > > > > > > 
> > > > > > > For now, the new "sync" parm is left at 0, to preserve the existing
> > > > > > > behavior of the client.
> > > > > > > 
> > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > > > > ---
> > > > > > >  fs/ceph/caps.c | 33 +++++++++++++++++++++++++--------
> > > > > > >  1 file changed, 25 insertions(+), 8 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > > > > index 6e99866b1946..452f5024589f 100644
> > > > > > > --- a/fs/ceph/caps.c
> > > > > > > +++ b/fs/ceph/caps.c
> > > > > > > @@ -991,9 +991,9 @@ struct cap_msg_args {
> > > > > > >       struct ceph_mds_session *session;
> > > > > > >       u64                     ino, cid, follows;
> > > > > > >       u64                     flush_tid, oldest_flush_tid, size, max_size;
> > > > > > > -     u64                     xattr_version;
> > > > > > > +     u64                     xattr_version, change_attr;
> > > > > > >       struct ceph_buffer      *xattr_buf;
> > > > > > > -     struct timespec         atime, mtime, ctime;
> > > > > > > +     struct timespec         atime, mtime, ctime, btime;
> > > > > > >       int                     op, caps, wanted, dirty;
> > > > > > >       u32                     seq, issue_seq, mseq, time_warp_seq;
> > > > > > >       kuid_t                  uid;
> > > > > > > @@ -1026,13 +1026,13 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > > > > > > 
> > > > > > >       /* flock buffer size + inline version + inline data size +
> > > > > > >        * osd_epoch_barrier + oldest_flush_tid */
> > > > > > > -     extra_len = 4 + 8 + 4 + 4 + 8;
> > > > > > > +     extra_len = 4 + 8 + 4 + 4 + 8 + 4 + 4 + 4 + 8 + 8 + 1;
> > > > > > >       msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, sizeof(*fc) + extra_len,
> > > > > > >                          GFP_NOFS, false);
> > > > > > >       if (!msg)
> > > > > > >               return -ENOMEM;
> > > > > > > 
> > > > > > > -     msg->hdr.version = cpu_to_le16(6);
> > > > > > > +     msg->hdr.version = cpu_to_le16(9);
> > > > > > >       msg->hdr.tid = cpu_to_le64(arg->flush_tid);
> > > > > > > 
> > > > > > >       fc = msg->front.iov_base;
> > > > > > > @@ -1068,17 +1068,30 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > > > > > >       }
> > > > > > > 
> > > > > > >       p = fc + 1;
> > > > > > > -     /* flock buffer size */
> > > > > > > +     /* flock buffer size (version 2) */
> > > > > > >       ceph_encode_32(&p, 0);
> > > > > > > -     /* inline version */
> > > > > > > +     /* inline version (version 4) */
> > > > > > >       ceph_encode_64(&p, arg->inline_data ? 0 : CEPH_INLINE_NONE);
> > > > > > >       /* inline data size */
> > > > > > >       ceph_encode_32(&p, 0);
> > > > > > > -     /* osd_epoch_barrier */
> > > > > > > +     /* osd_epoch_barrier (version 5) */
> > > > > > >       ceph_encode_32(&p, 0);
> > > > > > > -     /* oldest_flush_tid */
> > > > > > > +     /* oldest_flush_tid (version 6) */
> > > > > > >       ceph_encode_64(&p, arg->oldest_flush_tid);
> > > > > > > 
> > > > > > > +     /* caller_uid/caller_gid (version 7) */
> > > > > > > +     ceph_encode_32(&p, (u32)-1);
> > > > > > > +     ceph_encode_32(&p, (u32)-1);
> > > > > > 
> > > > > > A bit of self-review...
> > > > > > 
> > > > > > Not sure if we want to set the above to something else -- maybe 0 or to
> > > > > > current's creds? That may not always make sense though (during e.g.
> > > > > > writeback).
> > > > > > 
> > > > 
> > > > Looking further, I'm not quite sure I understand why we send creds at
> > > > all in cap messages. Can you clarify where that matters?
> > > > 
> > > > The way I look at it, would be to consider caps to be something like a
> > > > more granular NFS delegation or SMB oplock.
> > > > 
> > > > In that light, a cap flush is just the client sending updated attrs for
> > > > the exclusive caps that it has already been granted. Is there a
> > > > situation where we would ever want to refuse that update?
> > > 
> > > A chmod or chown can be done locally if you have excl caps and flushed 
> > > back to the MDS via a caps message.  We need to verify the user has 
> > > permission to make the change.
> > > 
> > 
> > My take is that once the MDS has delegated Ax to the client, then it's
> > effectively trusting the client to handle permissions enforcement
> > correctly. I don't see why we should second guess that.
> > 
> > > > Note that nothing ever checks the return code for _do_cap_update in the
> > > > userland code. If the permissions check fails, then we'll end up
> > > > silently dropping the updated attrs on the floor.
> > > 
> > > Yeah.  This was mainly for expediency... the protocol assumes that flushes 
> > > don't fail.  Given that the client does it's own permissions check, I 
> > > think the way to improve this is to have it prevent the flush in the first 
> > > place, so that it's only nefarious clients that are effected (and who 
> > > cares if they get confused).  I don't think we have a particularly good 
> > > way to tell the client it can't, say, sudo chmod 0:0 a file, though.
> > > 
> > 
> > Sorry, I don't quite follow. How would we prevent the flush from a
> > nefarious client (which is not something we can really control)?
> > 
> > In any case...ISTM that the permissions check in _do_cap_update ought to
> > be replaced by a cephx key check. IOW, what we really want to know is
> > whether the client is truly the one to which we delegated the caps. If
> > so, then we sort of have to trust that it's doing the right thing with
> > respect to permissions checking here.
> 
> The capability can say "you are allowed to be uid 1000 or uid 1020." We 
> want to delegate the EXCL caps to the client so that a create + chmod + 
> chown + write can all happen efficiently, but we still need to ensure that 
> the values they set are legal (a permitted uid/gid combo).
> 
> A common example would be user workstations that are allowed access to 
> /home/user and restricted via their mds caps to their uid/gid.  We need to 
> prevent them from doing a 'sudo chown 0:0 foo'...
> 
> 


On what basis do you make such a decision though? For instance, NFS does
root-squashing which is (generally) a per-export+per-client thing.
It sounds like you're saying that ceph has different semantics here?

(cc'ing Greg here)

Also, chown (at least under POSIX) is reserved for superuser only, and
now that I look, I think this check in MDSAuthCaps::is_capable may be
wrong:

      // chown/chgrp
      if (mask & MAY_CHOWN) {
        if (new_uid != caller_uid ||   // you can't chown to someone else
            inode_uid != caller_uid) { // you can't chown from someone else
          continue;
        }
      }

Shouldn't this just be a check for whether the caller_uid is 0 (or
whatever the correct check for the equivalent to the kernel's CAP_CHOWN
would be)?

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 07/10] ceph: update cap message struct version to 9
  2016-11-07 18:39               ` Jeff Layton
@ 2016-11-07 19:15                 ` Sage Weil
  2016-11-07 19:53                 ` Gregory Farnum
  1 sibling, 0 replies; 24+ messages in thread
From: Sage Weil @ 2016-11-07 19:15 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Yan, Zheng, ceph-devel, Ilya Dryomov, Zheng Yan, Gregory Farnum

[-- Attachment #1: Type: TEXT/PLAIN, Size: 10647 bytes --]

On Mon, 7 Nov 2016, Jeff Layton wrote:
> On Mon, 2016-11-07 at 14:36 +0000, Sage Weil wrote:
> > On Mon, 7 Nov 2016, Jeff Layton wrote:
> > > On Mon, 2016-11-07 at 14:05 +0000, Sage Weil wrote:
> > > > On Mon, 7 Nov 2016, Jeff Layton wrote:
> > > > > On Mon, 2016-11-07 at 16:43 +0800, Yan, Zheng wrote:
> > > > > > On Fri, Nov 4, 2016 at 8:57 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > > > > 
> > > > > > > On Fri, 2016-11-04 at 07:34 -0400, Jeff Layton wrote:
> > > > > > > > 
> > > > > > > > The userland ceph has MClientCaps at struct version 9. This brings the
> > > > > > > > kernel up the same version.
> > > > > > > > 
> > > > > > > > With this change, we have to start tracking the btime and change_attr,
> > > > > > > > so that the client can pass back sane values in cap messages. The
> > > > > > > > client doesn't care about the btime at all, so this is just passed
> > > > > > > > around, but the change_attr is used when ceph is exported via NFS.
> > > > > > > > 
> > > > > > > > For now, the new "sync" parm is left at 0, to preserve the existing
> > > > > > > > behavior of the client.
> > > > > > > > 
> > > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > > > > > ---
> > > > > > > >  fs/ceph/caps.c | 33 +++++++++++++++++++++++++--------
> > > > > > > >  1 file changed, 25 insertions(+), 8 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > > > > > index 6e99866b1946..452f5024589f 100644
> > > > > > > > --- a/fs/ceph/caps.c
> > > > > > > > +++ b/fs/ceph/caps.c
> > > > > > > > @@ -991,9 +991,9 @@ struct cap_msg_args {
> > > > > > > >       struct ceph_mds_session *session;
> > > > > > > >       u64                     ino, cid, follows;
> > > > > > > >       u64                     flush_tid, oldest_flush_tid, size, max_size;
> > > > > > > > -     u64                     xattr_version;
> > > > > > > > +     u64                     xattr_version, change_attr;
> > > > > > > >       struct ceph_buffer      *xattr_buf;
> > > > > > > > -     struct timespec         atime, mtime, ctime;
> > > > > > > > +     struct timespec         atime, mtime, ctime, btime;
> > > > > > > >       int                     op, caps, wanted, dirty;
> > > > > > > >       u32                     seq, issue_seq, mseq, time_warp_seq;
> > > > > > > >       kuid_t                  uid;
> > > > > > > > @@ -1026,13 +1026,13 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > > > > > > > 
> > > > > > > >       /* flock buffer size + inline version + inline data size +
> > > > > > > >        * osd_epoch_barrier + oldest_flush_tid */
> > > > > > > > -     extra_len = 4 + 8 + 4 + 4 + 8;
> > > > > > > > +     extra_len = 4 + 8 + 4 + 4 + 8 + 4 + 4 + 4 + 8 + 8 + 1;
> > > > > > > >       msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, sizeof(*fc) + extra_len,
> > > > > > > >                          GFP_NOFS, false);
> > > > > > > >       if (!msg)
> > > > > > > >               return -ENOMEM;
> > > > > > > > 
> > > > > > > > -     msg->hdr.version = cpu_to_le16(6);
> > > > > > > > +     msg->hdr.version = cpu_to_le16(9);
> > > > > > > >       msg->hdr.tid = cpu_to_le64(arg->flush_tid);
> > > > > > > > 
> > > > > > > >       fc = msg->front.iov_base;
> > > > > > > > @@ -1068,17 +1068,30 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > > > > > > >       }
> > > > > > > > 
> > > > > > > >       p = fc + 1;
> > > > > > > > -     /* flock buffer size */
> > > > > > > > +     /* flock buffer size (version 2) */
> > > > > > > >       ceph_encode_32(&p, 0);
> > > > > > > > -     /* inline version */
> > > > > > > > +     /* inline version (version 4) */
> > > > > > > >       ceph_encode_64(&p, arg->inline_data ? 0 : CEPH_INLINE_NONE);
> > > > > > > >       /* inline data size */
> > > > > > > >       ceph_encode_32(&p, 0);
> > > > > > > > -     /* osd_epoch_barrier */
> > > > > > > > +     /* osd_epoch_barrier (version 5) */
> > > > > > > >       ceph_encode_32(&p, 0);
> > > > > > > > -     /* oldest_flush_tid */
> > > > > > > > +     /* oldest_flush_tid (version 6) */
> > > > > > > >       ceph_encode_64(&p, arg->oldest_flush_tid);
> > > > > > > > 
> > > > > > > > +     /* caller_uid/caller_gid (version 7) */
> > > > > > > > +     ceph_encode_32(&p, (u32)-1);
> > > > > > > > +     ceph_encode_32(&p, (u32)-1);
> > > > > > > 
> > > > > > > A bit of self-review...
> > > > > > > 
> > > > > > > Not sure if we want to set the above to something else -- maybe 0 or to
> > > > > > > current's creds? That may not always make sense though (during e.g.
> > > > > > > writeback).
> > > > > > > 
> > > > > 
> > > > > Looking further, I'm not quite sure I understand why we send creds at
> > > > > all in cap messages. Can you clarify where that matters?
> > > > > 
> > > > > The way I look at it, would be to consider caps to be something like a
> > > > > more granular NFS delegation or SMB oplock.
> > > > > 
> > > > > In that light, a cap flush is just the client sending updated attrs for
> > > > > the exclusive caps that it has already been granted. Is there a
> > > > > situation where we would ever want to refuse that update?
> > > > 
> > > > A chmod or chown can be done locally if you have excl caps and flushed 
> > > > back to the MDS via a caps message.  We need to verify the user has 
> > > > permission to make the change.
> > > > 
> > > 
> > > My take is that once the MDS has delegated Ax to the client, then it's
> > > effectively trusting the client to handle permissions enforcement
> > > correctly. I don't see why we should second guess that.
> > > 
> > > > > Note that nothing ever checks the return code for _do_cap_update in the
> > > > > userland code. If the permissions check fails, then we'll end up
> > > > > silently dropping the updated attrs on the floor.
> > > > 
> > > > Yeah.  This was mainly for expediency... the protocol assumes that flushes 
> > > > don't fail.  Given that the client does it's own permissions check, I 
> > > > think the way to improve this is to have it prevent the flush in the first 
> > > > place, so that it's only nefarious clients that are effected (and who 
> > > > cares if they get confused).  I don't think we have a particularly good 
> > > > way to tell the client it can't, say, sudo chmod 0:0 a file, though.
> > > > 
> > > 
> > > Sorry, I don't quite follow. How would we prevent the flush from a
> > > nefarious client (which is not something we can really control)?
> > > 
> > > In any case...ISTM that the permissions check in _do_cap_update ought to
> > > be replaced by a cephx key check. IOW, what we really want to know is
> > > whether the client is truly the one to which we delegated the caps. If
> > > so, then we sort of have to trust that it's doing the right thing with
> > > respect to permissions checking here.
> > 
> > The capability can say "you are allowed to be uid 1000 or uid 1020." We 
> > want to delegate the EXCL caps to the client so that a create + chmod + 
> > chown + write can all happen efficiently, but we still need to ensure that 
> > the values they set are legal (a permitted uid/gid combo).
> > 
> > A common example would be user workstations that are allowed access to 
> > /home/user and restricted via their mds caps to their uid/gid.  We need to 
> > prevent them from doing a 'sudo chown 0:0 foo'...
> > 
> > 
> 
> 
> On what basis do you make such a decision though? For instance, NFS does
> root-squashing which is (generally) a per-export+per-client thing.
> It sounds like you're saying that ceph has different semantics here?

I don't remember all the specifics :(, but I do remember a long discussion 
about whether root_squash made sense and ultimately deciding that the 
semantics were weak enough to not bother implementing.  Instead, we 
explicitly enumerate the uids/gids that are allowed and just match against 
that.
 
> (cc'ing Greg here)
> 
> Also, chown (at least under POSIX) is reserved for superuser only, and
> now that I look, I think this check in MDSAuthCaps::is_capable may be
> wrong:
> 
>       // chown/chgrp
>       if (mask & MAY_CHOWN) {
>         if (new_uid != caller_uid ||   // you can't chown to someone else
>             inode_uid != caller_uid) { // you can't chown from someone else
>           continue;
>         }
>       }
> 
> Shouldn't this just be a check for whether the caller_uid is 0 (or
> whatever the correct check for the equivalent to the kernel's CAP_CHOWN
> would be)?

In the single-user case root_squash would be sufficient, provided that we 
also somehow ensured that whatever uid the new file was assigned was set 
to the "correct" value.  It would have been a kludgey and/or limited 
solution, though.  And that only addresses the UID portion...

Instead, we validate whatever values come back against what is in the cap 
(uid *and* gid) to ensure it is allowed.  Note that the capability 
can take a list of gids, e.g.,

   "allow rw path=/foo uid=1 gids=1,2,3"

This is also laying some of the groundwork for the future world in which 
the client *host* gets a capability saying something like "I trust you do 
be an honest broker of kerberos identities" and to pass per-user tickets 
to the MDS as users come and go from the host, such that a given request 
will be validated against some specific user's session (kerberos 
identity).  Of course, in such a situation a malicious host could allow 
one user to impersonate some other user that is also logged into the host, 
but that is a still a much stronger model than simply trustly the 
host/client completely (ala AUTH_UNIX in NFS-land).  The idea is that 
eventually the check_caps() would validate against a dynamic "user" 
session as well as the client session capability which would have per-user 
context pulled from kerberos or LDAP or whatever.

In any case, the core issue is that it wouldn't be sufficient to, say, 
refuse to issue EXCL caps on a file to the client unless we are prepare to 
trust any values they send back.  Or I guess it would, but it would rule 
out lots of common scenarios where there are huge benefits 
(performance-wise) to issueing the EXCL caps.  That's why we have 
to validate the contents of the metadata in the cap flush messages...

Does that make sense?

sage

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 07/10] ceph: update cap message struct version to 9
  2016-11-07 18:39               ` Jeff Layton
  2016-11-07 19:15                 ` Sage Weil
@ 2016-11-07 19:53                 ` Gregory Farnum
  2016-11-07 20:09                   ` Sage Weil
  1 sibling, 1 reply; 24+ messages in thread
From: Gregory Farnum @ 2016-11-07 19:53 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Sage Weil, Yan, Zheng, ceph-devel, Ilya Dryomov, Zheng Yan

On Mon, Nov 7, 2016 at 10:39 AM, Jeff Layton <jlayton@redhat.com> wrote:
> On Mon, 2016-11-07 at 14:36 +0000, Sage Weil wrote:
>> On Mon, 7 Nov 2016, Jeff Layton wrote:
>> > On Mon, 2016-11-07 at 14:05 +0000, Sage Weil wrote:
>> > > On Mon, 7 Nov 2016, Jeff Layton wrote:
>> > > > On Mon, 2016-11-07 at 16:43 +0800, Yan, Zheng wrote:
>> > > > > On Fri, Nov 4, 2016 at 8:57 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > > > > >
>> > > > > > On Fri, 2016-11-04 at 07:34 -0400, Jeff Layton wrote:
>> > > > > > >
>> > > > > > > The userland ceph has MClientCaps at struct version 9. This brings the
>> > > > > > > kernel up the same version.
>> > > > > > >
>> > > > > > > With this change, we have to start tracking the btime and change_attr,
>> > > > > > > so that the client can pass back sane values in cap messages. The
>> > > > > > > client doesn't care about the btime at all, so this is just passed
>> > > > > > > around, but the change_attr is used when ceph is exported via NFS.
>> > > > > > >
>> > > > > > > For now, the new "sync" parm is left at 0, to preserve the existing
>> > > > > > > behavior of the client.
>> > > > > > >
>> > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> > > > > > > ---
>> > > > > > >  fs/ceph/caps.c | 33 +++++++++++++++++++++++++--------
>> > > > > > >  1 file changed, 25 insertions(+), 8 deletions(-)
>> > > > > > >
>> > > > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
>> > > > > > > index 6e99866b1946..452f5024589f 100644
>> > > > > > > --- a/fs/ceph/caps.c
>> > > > > > > +++ b/fs/ceph/caps.c
>> > > > > > > @@ -991,9 +991,9 @@ struct cap_msg_args {
>> > > > > > >       struct ceph_mds_session *session;
>> > > > > > >       u64                     ino, cid, follows;
>> > > > > > >       u64                     flush_tid, oldest_flush_tid, size, max_size;
>> > > > > > > -     u64                     xattr_version;
>> > > > > > > +     u64                     xattr_version, change_attr;
>> > > > > > >       struct ceph_buffer      *xattr_buf;
>> > > > > > > -     struct timespec         atime, mtime, ctime;
>> > > > > > > +     struct timespec         atime, mtime, ctime, btime;
>> > > > > > >       int                     op, caps, wanted, dirty;
>> > > > > > >       u32                     seq, issue_seq, mseq, time_warp_seq;
>> > > > > > >       kuid_t                  uid;
>> > > > > > > @@ -1026,13 +1026,13 @@ static int send_cap_msg(struct cap_msg_args *arg)
>> > > > > > >
>> > > > > > >       /* flock buffer size + inline version + inline data size +
>> > > > > > >        * osd_epoch_barrier + oldest_flush_tid */
>> > > > > > > -     extra_len = 4 + 8 + 4 + 4 + 8;
>> > > > > > > +     extra_len = 4 + 8 + 4 + 4 + 8 + 4 + 4 + 4 + 8 + 8 + 1;
>> > > > > > >       msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, sizeof(*fc) + extra_len,
>> > > > > > >                          GFP_NOFS, false);
>> > > > > > >       if (!msg)
>> > > > > > >               return -ENOMEM;
>> > > > > > >
>> > > > > > > -     msg->hdr.version = cpu_to_le16(6);
>> > > > > > > +     msg->hdr.version = cpu_to_le16(9);
>> > > > > > >       msg->hdr.tid = cpu_to_le64(arg->flush_tid);
>> > > > > > >
>> > > > > > >       fc = msg->front.iov_base;
>> > > > > > > @@ -1068,17 +1068,30 @@ static int send_cap_msg(struct cap_msg_args *arg)
>> > > > > > >       }
>> > > > > > >
>> > > > > > >       p = fc + 1;
>> > > > > > > -     /* flock buffer size */
>> > > > > > > +     /* flock buffer size (version 2) */
>> > > > > > >       ceph_encode_32(&p, 0);
>> > > > > > > -     /* inline version */
>> > > > > > > +     /* inline version (version 4) */
>> > > > > > >       ceph_encode_64(&p, arg->inline_data ? 0 : CEPH_INLINE_NONE);
>> > > > > > >       /* inline data size */
>> > > > > > >       ceph_encode_32(&p, 0);
>> > > > > > > -     /* osd_epoch_barrier */
>> > > > > > > +     /* osd_epoch_barrier (version 5) */
>> > > > > > >       ceph_encode_32(&p, 0);
>> > > > > > > -     /* oldest_flush_tid */
>> > > > > > > +     /* oldest_flush_tid (version 6) */
>> > > > > > >       ceph_encode_64(&p, arg->oldest_flush_tid);
>> > > > > > >
>> > > > > > > +     /* caller_uid/caller_gid (version 7) */
>> > > > > > > +     ceph_encode_32(&p, (u32)-1);
>> > > > > > > +     ceph_encode_32(&p, (u32)-1);
>> > > > > >
>> > > > > > A bit of self-review...
>> > > > > >
>> > > > > > Not sure if we want to set the above to something else -- maybe 0 or to
>> > > > > > current's creds? That may not always make sense though (during e.g.
>> > > > > > writeback).
>> > > > > >
>> > > >
>> > > > Looking further, I'm not quite sure I understand why we send creds at
>> > > > all in cap messages. Can you clarify where that matters?
>> > > >
>> > > > The way I look at it, would be to consider caps to be something like a
>> > > > more granular NFS delegation or SMB oplock.
>> > > >
>> > > > In that light, a cap flush is just the client sending updated attrs for
>> > > > the exclusive caps that it has already been granted. Is there a
>> > > > situation where we would ever want to refuse that update?
>> > >
>> > > A chmod or chown can be done locally if you have excl caps and flushed
>> > > back to the MDS via a caps message.  We need to verify the user has
>> > > permission to make the change.
>> > >
>> >
>> > My take is that once the MDS has delegated Ax to the client, then it's
>> > effectively trusting the client to handle permissions enforcement
>> > correctly. I don't see why we should second guess that.
>> >
>> > > > Note that nothing ever checks the return code for _do_cap_update in the
>> > > > userland code. If the permissions check fails, then we'll end up
>> > > > silently dropping the updated attrs on the floor.
>> > >
>> > > Yeah.  This was mainly for expediency... the protocol assumes that flushes
>> > > don't fail.  Given that the client does it's own permissions check, I
>> > > think the way to improve this is to have it prevent the flush in the first
>> > > place, so that it's only nefarious clients that are effected (and who
>> > > cares if they get confused).  I don't think we have a particularly good
>> > > way to tell the client it can't, say, sudo chmod 0:0 a file, though.
>> > >
>> >
>> > Sorry, I don't quite follow. How would we prevent the flush from a
>> > nefarious client (which is not something we can really control)?
>> >
>> > In any case...ISTM that the permissions check in _do_cap_update ought to
>> > be replaced by a cephx key check. IOW, what we really want to know is
>> > whether the client is truly the one to which we delegated the caps. If
>> > so, then we sort of have to trust that it's doing the right thing with
>> > respect to permissions checking here.
>>
>> The capability can say "you are allowed to be uid 1000 or uid 1020." We
>> want to delegate the EXCL caps to the client so that a create + chmod +
>> chown + write can all happen efficiently, but we still need to ensure that
>> the values they set are legal (a permitted uid/gid combo).
>>
>> A common example would be user workstations that are allowed access to
>> /home/user and restricted via their mds caps to their uid/gid.  We need to
>> prevent them from doing a 'sudo chown 0:0 foo'...
>>
>>
>
>
> On what basis do you make such a decision though? For instance, NFS does
> root-squashing which is (generally) a per-export+per-client thing.
> It sounds like you're saying that ceph has different semantics here?
>
> (cc'ing Greg here)

As Sage says, we definitely avoid the root squash semantics. We
discussed them last year and concluded they were an inappropriate
match for Ceph's permission model.

>
> Also, chown (at least under POSIX) is reserved for superuser only, and
> now that I look, I think this check in MDSAuthCaps::is_capable may be
> wrong:
>
>       // chown/chgrp
>       if (mask & MAY_CHOWN) {
>         if (new_uid != caller_uid ||   // you can't chown to someone else
>             inode_uid != caller_uid) { // you can't chown from someone else
>           continue;
>         }
>       }
>
> Shouldn't this just be a check for whether the caller_uid is 0 (or
> whatever the correct check for the equivalent to the kernel's CAP_CHOWN
> would be)?

Without context, this does look a little weird — does it allow *any*
change, given caller_uid needs to match both new and inode uid?
Sort of the common case would be that the admin cap gets hit toward
the beginning of the function and just allows it without ever reaching
this point.
-Greg

>
> --
> Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 07/10] ceph: update cap message struct version to 9
  2016-11-07 19:53                 ` Gregory Farnum
@ 2016-11-07 20:09                   ` Sage Weil
  2016-11-07 21:16                     ` Jeff Layton
  0 siblings, 1 reply; 24+ messages in thread
From: Sage Weil @ 2016-11-07 20:09 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: Jeff Layton, Yan, Zheng, ceph-devel, Ilya Dryomov, Zheng Yan

[-- Attachment #1: Type: TEXT/PLAIN, Size: 9550 bytes --]

On Mon, 7 Nov 2016, Gregory Farnum wrote:
> On Mon, Nov 7, 2016 at 10:39 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > On Mon, 2016-11-07 at 14:36 +0000, Sage Weil wrote:
> >> On Mon, 7 Nov 2016, Jeff Layton wrote:
> >> > On Mon, 2016-11-07 at 14:05 +0000, Sage Weil wrote:
> >> > > On Mon, 7 Nov 2016, Jeff Layton wrote:
> >> > > > On Mon, 2016-11-07 at 16:43 +0800, Yan, Zheng wrote:
> >> > > > > On Fri, Nov 4, 2016 at 8:57 PM, Jeff Layton <jlayton@redhat.com> wrote:
> >> > > > > >
> >> > > > > > On Fri, 2016-11-04 at 07:34 -0400, Jeff Layton wrote:
> >> > > > > > >
> >> > > > > > > The userland ceph has MClientCaps at struct version 9. This brings the
> >> > > > > > > kernel up the same version.
> >> > > > > > >
> >> > > > > > > With this change, we have to start tracking the btime and change_attr,
> >> > > > > > > so that the client can pass back sane values in cap messages. The
> >> > > > > > > client doesn't care about the btime at all, so this is just passed
> >> > > > > > > around, but the change_attr is used when ceph is exported via NFS.
> >> > > > > > >
> >> > > > > > > For now, the new "sync" parm is left at 0, to preserve the existing
> >> > > > > > > behavior of the client.
> >> > > > > > >
> >> > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> >> > > > > > > ---
> >> > > > > > >  fs/ceph/caps.c | 33 +++++++++++++++++++++++++--------
> >> > > > > > >  1 file changed, 25 insertions(+), 8 deletions(-)
> >> > > > > > >
> >> > > > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> >> > > > > > > index 6e99866b1946..452f5024589f 100644
> >> > > > > > > --- a/fs/ceph/caps.c
> >> > > > > > > +++ b/fs/ceph/caps.c
> >> > > > > > > @@ -991,9 +991,9 @@ struct cap_msg_args {
> >> > > > > > >       struct ceph_mds_session *session;
> >> > > > > > >       u64                     ino, cid, follows;
> >> > > > > > >       u64                     flush_tid, oldest_flush_tid, size, max_size;
> >> > > > > > > -     u64                     xattr_version;
> >> > > > > > > +     u64                     xattr_version, change_attr;
> >> > > > > > >       struct ceph_buffer      *xattr_buf;
> >> > > > > > > -     struct timespec         atime, mtime, ctime;
> >> > > > > > > +     struct timespec         atime, mtime, ctime, btime;
> >> > > > > > >       int                     op, caps, wanted, dirty;
> >> > > > > > >       u32                     seq, issue_seq, mseq, time_warp_seq;
> >> > > > > > >       kuid_t                  uid;
> >> > > > > > > @@ -1026,13 +1026,13 @@ static int send_cap_msg(struct cap_msg_args *arg)
> >> > > > > > >
> >> > > > > > >       /* flock buffer size + inline version + inline data size +
> >> > > > > > >        * osd_epoch_barrier + oldest_flush_tid */
> >> > > > > > > -     extra_len = 4 + 8 + 4 + 4 + 8;
> >> > > > > > > +     extra_len = 4 + 8 + 4 + 4 + 8 + 4 + 4 + 4 + 8 + 8 + 1;
> >> > > > > > >       msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, sizeof(*fc) + extra_len,
> >> > > > > > >                          GFP_NOFS, false);
> >> > > > > > >       if (!msg)
> >> > > > > > >               return -ENOMEM;
> >> > > > > > >
> >> > > > > > > -     msg->hdr.version = cpu_to_le16(6);
> >> > > > > > > +     msg->hdr.version = cpu_to_le16(9);
> >> > > > > > >       msg->hdr.tid = cpu_to_le64(arg->flush_tid);
> >> > > > > > >
> >> > > > > > >       fc = msg->front.iov_base;
> >> > > > > > > @@ -1068,17 +1068,30 @@ static int send_cap_msg(struct cap_msg_args *arg)
> >> > > > > > >       }
> >> > > > > > >
> >> > > > > > >       p = fc + 1;
> >> > > > > > > -     /* flock buffer size */
> >> > > > > > > +     /* flock buffer size (version 2) */
> >> > > > > > >       ceph_encode_32(&p, 0);
> >> > > > > > > -     /* inline version */
> >> > > > > > > +     /* inline version (version 4) */
> >> > > > > > >       ceph_encode_64(&p, arg->inline_data ? 0 : CEPH_INLINE_NONE);
> >> > > > > > >       /* inline data size */
> >> > > > > > >       ceph_encode_32(&p, 0);
> >> > > > > > > -     /* osd_epoch_barrier */
> >> > > > > > > +     /* osd_epoch_barrier (version 5) */
> >> > > > > > >       ceph_encode_32(&p, 0);
> >> > > > > > > -     /* oldest_flush_tid */
> >> > > > > > > +     /* oldest_flush_tid (version 6) */
> >> > > > > > >       ceph_encode_64(&p, arg->oldest_flush_tid);
> >> > > > > > >
> >> > > > > > > +     /* caller_uid/caller_gid (version 7) */
> >> > > > > > > +     ceph_encode_32(&p, (u32)-1);
> >> > > > > > > +     ceph_encode_32(&p, (u32)-1);
> >> > > > > >
> >> > > > > > A bit of self-review...
> >> > > > > >
> >> > > > > > Not sure if we want to set the above to something else -- maybe 0 or to
> >> > > > > > current's creds? That may not always make sense though (during e.g.
> >> > > > > > writeback).
> >> > > > > >
> >> > > >
> >> > > > Looking further, I'm not quite sure I understand why we send creds at
> >> > > > all in cap messages. Can you clarify where that matters?
> >> > > >
> >> > > > The way I look at it, would be to consider caps to be something like a
> >> > > > more granular NFS delegation or SMB oplock.
> >> > > >
> >> > > > In that light, a cap flush is just the client sending updated attrs for
> >> > > > the exclusive caps that it has already been granted. Is there a
> >> > > > situation where we would ever want to refuse that update?
> >> > >
> >> > > A chmod or chown can be done locally if you have excl caps and flushed
> >> > > back to the MDS via a caps message.  We need to verify the user has
> >> > > permission to make the change.
> >> > >
> >> >
> >> > My take is that once the MDS has delegated Ax to the client, then it's
> >> > effectively trusting the client to handle permissions enforcement
> >> > correctly. I don't see why we should second guess that.
> >> >
> >> > > > Note that nothing ever checks the return code for _do_cap_update in the
> >> > > > userland code. If the permissions check fails, then we'll end up
> >> > > > silently dropping the updated attrs on the floor.
> >> > >
> >> > > Yeah.  This was mainly for expediency... the protocol assumes that flushes
> >> > > don't fail.  Given that the client does it's own permissions check, I
> >> > > think the way to improve this is to have it prevent the flush in the first
> >> > > place, so that it's only nefarious clients that are effected (and who
> >> > > cares if they get confused).  I don't think we have a particularly good
> >> > > way to tell the client it can't, say, sudo chmod 0:0 a file, though.
> >> > >
> >> >
> >> > Sorry, I don't quite follow. How would we prevent the flush from a
> >> > nefarious client (which is not something we can really control)?
> >> >
> >> > In any case...ISTM that the permissions check in _do_cap_update ought to
> >> > be replaced by a cephx key check. IOW, what we really want to know is
> >> > whether the client is truly the one to which we delegated the caps. If
> >> > so, then we sort of have to trust that it's doing the right thing with
> >> > respect to permissions checking here.
> >>
> >> The capability can say "you are allowed to be uid 1000 or uid 1020." We
> >> want to delegate the EXCL caps to the client so that a create + chmod +
> >> chown + write can all happen efficiently, but we still need to ensure that
> >> the values they set are legal (a permitted uid/gid combo).
> >>
> >> A common example would be user workstations that are allowed access to
> >> /home/user and restricted via their mds caps to their uid/gid.  We need to
> >> prevent them from doing a 'sudo chown 0:0 foo'...
> >>
> >>
> >
> >
> > On what basis do you make such a decision though? For instance, NFS does
> > root-squashing which is (generally) a per-export+per-client thing.
> > It sounds like you're saying that ceph has different semantics here?
> >
> > (cc'ing Greg here)
> 
> As Sage says, we definitely avoid the root squash semantics. We
> discussed them last year and concluded they were an inappropriate
> match for Ceph's permission model.
> 
> >
> > Also, chown (at least under POSIX) is reserved for superuser only, and
> > now that I look, I think this check in MDSAuthCaps::is_capable may be
> > wrong:
> >
> >       // chown/chgrp
> >       if (mask & MAY_CHOWN) {
> >         if (new_uid != caller_uid ||   // you can't chown to someone else
> >             inode_uid != caller_uid) { // you can't chown from someone else
> >           continue;
> >         }
> >       }
> >
> > Shouldn't this just be a check for whether the caller_uid is 0 (or
> > whatever the correct check for the equivalent to the kernel's CAP_CHOWN
> > would be)?

Oops, I skipped over this part ^
 
> Without context, this does look a little weird — does it allow *any*
> change, given caller_uid needs to match both new and inode uid?
> Sort of the common case would be that the admin cap gets hit toward
> the beginning of the function and just allows it without ever reaching
> this point.

Yeah, the check is a bit weird.  It looks like

1- A normal cap that specifies a uid can't ever change the uid.  This 
conditional could be simplified/clarified...

2- If you have a pair of caps, like

  allow * uid=1, allow * uid=2

we still don't let you chown between uid 1 and 2.  Well, not as caller_uid 
1 or 2 (which is fine), but

3- Jeff is right, we don't allow root to chown between allowed uids.  
Like if you had

  allow * uid=0

shouldn't that let you chown anything?  I didn't really consider this 
case since most users would just do

  allow *

which can do anything (including chown).  But probably the 'allow * uid=0' 
case should be handled properly.

sage

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 07/10] ceph: update cap message struct version to 9
  2016-11-07 20:09                   ` Sage Weil
@ 2016-11-07 21:16                     ` Jeff Layton
  0 siblings, 0 replies; 24+ messages in thread
From: Jeff Layton @ 2016-11-07 21:16 UTC (permalink / raw)
  To: Sage Weil, Gregory Farnum; +Cc: Yan, Zheng, ceph-devel, Ilya Dryomov, Zheng Yan

On Mon, 2016-11-07 at 20:09 +0000, Sage Weil wrote:
> On Mon, 7 Nov 2016, Gregory Farnum wrote:
> > On Mon, Nov 7, 2016 at 10:39 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > > On Mon, 2016-11-07 at 14:36 +0000, Sage Weil wrote:
> > >> On Mon, 7 Nov 2016, Jeff Layton wrote:
> > >> > On Mon, 2016-11-07 at 14:05 +0000, Sage Weil wrote:
> > >> > > On Mon, 7 Nov 2016, Jeff Layton wrote:
> > >> > > > On Mon, 2016-11-07 at 16:43 +0800, Yan, Zheng wrote:
> > >> > > > > On Fri, Nov 4, 2016 at 8:57 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > >> > > > > >
> > >> > > > > > On Fri, 2016-11-04 at 07:34 -0400, Jeff Layton wrote:
> > >> > > > > > >
> > >> > > > > > > The userland ceph has MClientCaps at struct version 9. This brings the
> > >> > > > > > > kernel up the same version.
> > >> > > > > > >
> > >> > > > > > > With this change, we have to start tracking the btime and change_attr,
> > >> > > > > > > so that the client can pass back sane values in cap messages. The
> > >> > > > > > > client doesn't care about the btime at all, so this is just passed
> > >> > > > > > > around, but the change_attr is used when ceph is exported via NFS.
> > >> > > > > > >
> > >> > > > > > > For now, the new "sync" parm is left at 0, to preserve the existing
> > >> > > > > > > behavior of the client.
> > >> > > > > > >
> > >> > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > >> > > > > > > ---
> > >> > > > > > >  fs/ceph/caps.c | 33 +++++++++++++++++++++++++--------
> > >> > > > > > >  1 file changed, 25 insertions(+), 8 deletions(-)
> > >> > > > > > >
> > >> > > > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > >> > > > > > > index 6e99866b1946..452f5024589f 100644
> > >> > > > > > > --- a/fs/ceph/caps.c
> > >> > > > > > > +++ b/fs/ceph/caps.c
> > >> > > > > > > @@ -991,9 +991,9 @@ struct cap_msg_args {
> > >> > > > > > >       struct ceph_mds_session *session;
> > >> > > > > > >       u64                     ino, cid, follows;
> > >> > > > > > >       u64                     flush_tid, oldest_flush_tid, size, max_size;
> > >> > > > > > > -     u64                     xattr_version;
> > >> > > > > > > +     u64                     xattr_version, change_attr;
> > >> > > > > > >       struct ceph_buffer      *xattr_buf;
> > >> > > > > > > -     struct timespec         atime, mtime, ctime;
> > >> > > > > > > +     struct timespec         atime, mtime, ctime, btime;
> > >> > > > > > >       int                     op, caps, wanted, dirty;
> > >> > > > > > >       u32                     seq, issue_seq, mseq, time_warp_seq;
> > >> > > > > > >       kuid_t                  uid;
> > >> > > > > > > @@ -1026,13 +1026,13 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > >> > > > > > >
> > >> > > > > > >       /* flock buffer size + inline version + inline data size +
> > >> > > > > > >        * osd_epoch_barrier + oldest_flush_tid */
> > >> > > > > > > -     extra_len = 4 + 8 + 4 + 4 + 8;
> > >> > > > > > > +     extra_len = 4 + 8 + 4 + 4 + 8 + 4 + 4 + 4 + 8 + 8 + 1;
> > >> > > > > > >       msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, sizeof(*fc) + extra_len,
> > >> > > > > > >                          GFP_NOFS, false);
> > >> > > > > > >       if (!msg)
> > >> > > > > > >               return -ENOMEM;
> > >> > > > > > >
> > >> > > > > > > -     msg->hdr.version = cpu_to_le16(6);
> > >> > > > > > > +     msg->hdr.version = cpu_to_le16(9);
> > >> > > > > > >       msg->hdr.tid = cpu_to_le64(arg->flush_tid);
> > >> > > > > > >
> > >> > > > > > >       fc = msg->front.iov_base;
> > >> > > > > > > @@ -1068,17 +1068,30 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > >> > > > > > >       }
> > >> > > > > > >
> > >> > > > > > >       p = fc + 1;
> > >> > > > > > > -     /* flock buffer size */
> > >> > > > > > > +     /* flock buffer size (version 2) */
> > >> > > > > > >       ceph_encode_32(&p, 0);
> > >> > > > > > > -     /* inline version */
> > >> > > > > > > +     /* inline version (version 4) */
> > >> > > > > > >       ceph_encode_64(&p, arg->inline_data ? 0 : CEPH_INLINE_NONE);
> > >> > > > > > >       /* inline data size */
> > >> > > > > > >       ceph_encode_32(&p, 0);
> > >> > > > > > > -     /* osd_epoch_barrier */
> > >> > > > > > > +     /* osd_epoch_barrier (version 5) */
> > >> > > > > > >       ceph_encode_32(&p, 0);
> > >> > > > > > > -     /* oldest_flush_tid */
> > >> > > > > > > +     /* oldest_flush_tid (version 6) */
> > >> > > > > > >       ceph_encode_64(&p, arg->oldest_flush_tid);
> > >> > > > > > >
> > >> > > > > > > +     /* caller_uid/caller_gid (version 7) */
> > >> > > > > > > +     ceph_encode_32(&p, (u32)-1);
> > >> > > > > > > +     ceph_encode_32(&p, (u32)-1);
> > >> > > > > >
> > >> > > > > > A bit of self-review...
> > >> > > > > >
> > >> > > > > > Not sure if we want to set the above to something else -- maybe 0 or to
> > >> > > > > > current's creds? That may not always make sense though (during e.g.
> > >> > > > > > writeback).
> > >> > > > > >
> > >> > > >
> > >> > > > Looking further, I'm not quite sure I understand why we send creds at
> > >> > > > all in cap messages. Can you clarify where that matters?
> > >> > > >
> > >> > > > The way I look at it, would be to consider caps to be something like a
> > >> > > > more granular NFS delegation or SMB oplock.
> > >> > > >
> > >> > > > In that light, a cap flush is just the client sending updated attrs for
> > >> > > > the exclusive caps that it has already been granted. Is there a
> > >> > > > situation where we would ever want to refuse that update?
> > >> > >
> > >> > > A chmod or chown can be done locally if you have excl caps and flushed
> > >> > > back to the MDS via a caps message.  We need to verify the user has
> > >> > > permission to make the change.
> > >> > >
> > >> >
> > >> > My take is that once the MDS has delegated Ax to the client, then it's
> > >> > effectively trusting the client to handle permissions enforcement
> > >> > correctly. I don't see why we should second guess that.
> > >> >
> > >> > > > Note that nothing ever checks the return code for _do_cap_update in the
> > >> > > > userland code. If the permissions check fails, then we'll end up
> > >> > > > silently dropping the updated attrs on the floor.
> > >> > >
> > >> > > Yeah.  This was mainly for expediency... the protocol assumes that flushes
> > >> > > don't fail.  Given that the client does it's own permissions check, I
> > >> > > think the way to improve this is to have it prevent the flush in the first
> > >> > > place, so that it's only nefarious clients that are effected (and who
> > >> > > cares if they get confused).  I don't think we have a particularly good
> > >> > > way to tell the client it can't, say, sudo chmod 0:0 a file, though.
> > >> > >
> > >> >
> > >> > Sorry, I don't quite follow. How would we prevent the flush from a
> > >> > nefarious client (which is not something we can really control)?
> > >> >
> > >> > In any case...ISTM that the permissions check in _do_cap_update ought to
> > >> > be replaced by a cephx key check. IOW, what we really want to know is
> > >> > whether the client is truly the one to which we delegated the caps. If
> > >> > so, then we sort of have to trust that it's doing the right thing with
> > >> > respect to permissions checking here.
> > >>
> > >> The capability can say "you are allowed to be uid 1000 or uid 1020." We
> > >> want to delegate the EXCL caps to the client so that a create + chmod +
> > >> chown + write can all happen efficiently, but we still need to ensure that
> > >> the values they set are legal (a permitted uid/gid combo).
> > >>
> > >> A common example would be user workstations that are allowed access to
> > >> /home/user and restricted via their mds caps to their uid/gid.  We need to
> > >> prevent them from doing a 'sudo chown 0:0 foo'...
> > >>
> > >>
> > >
> > >
> > > On what basis do you make such a decision though? For instance, NFS does
> > > root-squashing which is (generally) a per-export+per-client thing.
> > > It sounds like you're saying that ceph has different semantics here?
> > >
> > > (cc'ing Greg here)
> > 
> > As Sage says, we definitely avoid the root squash semantics. We
> > discussed them last year and concluded they were an inappropriate
> > match for Ceph's permission model.
> > 
> > >
> > > Also, chown (at least under POSIX) is reserved for superuser only, and
> > > now that I look, I think this check in MDSAuthCaps::is_capable may be
> > > wrong:
> > >
> > >       // chown/chgrp
> > >       if (mask & MAY_CHOWN) {
> > >         if (new_uid != caller_uid ||   // you can't chown to someone else
> > >             inode_uid != caller_uid) { // you can't chown from someone else
> > >           continue;
> > >         }
> > >       }
> > >
> > > Shouldn't this just be a check for whether the caller_uid is 0 (or
> > > whatever the correct check for the equivalent to the kernel's CAP_CHOWN
> > > would be)?
> 
> Oops, I skipped over this part ^
>  
> > Without context, this does look a little weird — does it allow *any*
> > change, given caller_uid needs to match both new and inode uid?
> > Sort of the common case would be that the admin cap gets hit toward
> > the beginning of the function and just allows it without ever reaching
> > this point.
> 
> Yeah, the check is a bit weird.  It looks like
> 
> 1- A normal cap that specifies a uid can't ever change the uid.  This 
> conditional could be simplified/clarified...
> 
> 2- If you have a pair of caps, like
> 
>   allow * uid=1, allow * uid=2
> 
> we still don't let you chown between uid 1 and 2.  Well, not as caller_uid 
> 1 or 2 (which is fine), but
> 
> 3- Jeff is right, we don't allow root to chown between allowed uids.  
> Like if you had
> 
>   allow * uid=0
> 
> shouldn't that let you chown anything?  I didn't really consider this 
> case since most users would just do
> 
>   allow *
> 
> which can do anything (including chown).  But probably the 'allow * uid=0' 
> case should be handled properly.
> 
> sage

It still seems to me like that should just be a check for superuser
status. Something like:

      if (mask & MAY_CHOWN) {
	// only root can chown
        if (i->match.uid != 0 || caller_uid != 0)
          continue;
        }
      }

i.e. only allow chown if the capability has a uid of 0 and the
caller_uid is also 0.

I don't think we want to ever grant an unprivileged user the ability to
chown, do we?

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2016-11-07 21:16 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-04 11:34 [RFC PATCH 00/10] ceph: fix long stalls during fsync and write_inode Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 01/10] ceph: fix minor typo in unsafe_request_wait Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 02/10] ceph: move xattr initialzation before the encoding past the ceph_mds_caps Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 03/10] ceph: initialize i_version to 0 in new ceph inodes Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 04/10] ceph: save off btime and change_attr when we get an InodeStat Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 05/10] ceph: handle btime and change_attr updates in cap messages Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 06/10] ceph: define new argument structure for send_cap_msg Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 07/10] ceph: update cap message struct version to 9 Jeff Layton
2016-11-04 12:57   ` Jeff Layton
2016-11-07  8:43     ` Yan, Zheng
2016-11-07 11:21       ` Jeff Layton
2016-11-07 14:05         ` Sage Weil
2016-11-07 14:22           ` Jeff Layton
2016-11-07 14:36             ` Sage Weil
2016-11-07 18:39               ` Jeff Layton
2016-11-07 19:15                 ` Sage Weil
2016-11-07 19:53                 ` Gregory Farnum
2016-11-07 20:09                   ` Sage Weil
2016-11-07 21:16                     ` Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 08/10] ceph: add sync parameter to send_cap_msg Jeff Layton
2016-11-07  8:32   ` Yan, Zheng
2016-11-07 10:51     ` Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 09/10] ceph: plumb "sync" parameter into __send_cap Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 10/10] ceph: turn on btime and change_attr support Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.