linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs
@ 2019-11-14 10:57 Luis Henriques
  2019-11-14 10:57 ` [RFC PATCH v2 1/4] ceph: add support for TYPE_MSGR2 address decode Luis Henriques
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Luis Henriques @ 2019-11-14 10:57 UTC (permalink / raw)
  To: Jeff Layton, Sage Weil, Ilya Dryomov, Yan, Zheng
  Cc: ceph-devel, linux-kernel, Luis Henriques

Hi!

So, after the feedback I got from v1 [1] I've sent out a pull-request
for the OSDs [2] which encodes require_osd_release into the OSDMap
client data.  This allows the client to figure out which ceph release
the OSDs cluster is running and decide whether or not it's safe to use
the copy-from Op for copy_file_range.

This new patchset I'm sending simply adds enough functionality to the
kernel client so that it can take advantage of this OSD patch:

0001 - adds the ability to decode TYPE_MSGR2 addresses.  This is a
       required functionality for enabling SERVER_NAUTILUS in the
       client.  I hope I got the new format right, as I couldn't figure
       out what the hard-coded values (see comments) really mean.

0002 - allows the client to retrieve the new require_osd_release field
       from the OSDMap if available.  This patch also adds SERVER_MIMIC,
       SERVER_NAUTILUS and SERVER_OCTOPUS to the supported features,
       which TBH I'm not sure if that's a safe thing to do -- the only
       issue I've seen was that Nautilus requires the ability to decode
       TYPE_MSGR2 address, but I may have missed others.

0003 - debug code to add require_osd_release to the osdmap debugfs file.

0004 - adds the truncate_{seq,size} fields to the 'copy-from' operation
       if the OSDs are >= Octopus.

Also note that, as suggested by Ilya, I've dropped the patch that would
change the default mount options to 'copyfrom'.

These patches have been tested with the xfstests generic test suite, and
with a couple of other (local) tests that exercise the cephfs
copy_file_range syscall.  I didn't saw any issues, but as I said above,
I'm not really sure if adding the SERVER_* flags to the supported
features have other side effects.

[1] https://lore.kernel.org/lkml/20191108141555.31176-1-lhenriques@suse.com/
[2] https://github.com/ceph/ceph/pull/31611

Cheers,
--
Luis

Luis Henriques (4):
  ceph: add support for TYPE_MSGR2 address decode
  ceph: get the require_osd_release field from the osdmap
  ceph: add require_osd_release field to osdmap debugfs
  ceph: add support for sending truncate_{seq,size} in 'copy-from' Op

 fs/ceph/file.c                     | 10 +++++++-
 include/linux/ceph/ceph_features.h | 10 ++++++--
 include/linux/ceph/decode.h        |  3 ++-
 include/linux/ceph/osd_client.h    |  1 +
 include/linux/ceph/osdmap.h        |  1 +
 include/linux/ceph/rados.h         | 23 ++++++++++++++++++
 net/ceph/ceph_strings.c            | 38 ++++++++++++++++++++++++++++++
 net/ceph/debugfs.c                 |  2 ++
 net/ceph/decode.c                  | 33 ++++++++++++++++++++++++--
 net/ceph/osd_client.c              |  7 +++++-
 net/ceph/osdmap.c                  | 21 +++++++++++++++++
 11 files changed, 142 insertions(+), 7 deletions(-)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 1/4] ceph: add support for TYPE_MSGR2 address decode
  2019-11-14 10:57 [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs Luis Henriques
@ 2019-11-14 10:57 ` Luis Henriques
  2019-11-14 12:18   ` Jeff Layton
  2019-11-14 10:57 ` [RFC PATCH v2 2/4] ceph: get the require_osd_release field from the osdmap Luis Henriques
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Luis Henriques @ 2019-11-14 10:57 UTC (permalink / raw)
  To: Jeff Layton, Sage Weil, Ilya Dryomov, Yan, Zheng
  Cc: ceph-devel, linux-kernel, Luis Henriques

The new format actually includes two addresses: one the new messenger v2,
and other for the legacy v1, which is the only one currently understood
by kernel clients.  Add code to pick the legacy address and ignore the v2
one.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
---
 include/linux/ceph/decode.h |  3 ++-
 net/ceph/decode.c           | 33 +++++++++++++++++++++++++++++++--
 2 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/include/linux/ceph/decode.h b/include/linux/ceph/decode.h
index 450384fe487c..2a2f07dfb39c 100644
--- a/include/linux/ceph/decode.h
+++ b/include/linux/ceph/decode.h
@@ -219,7 +219,8 @@ static inline void ceph_encode_timespec64(struct ceph_timespec *tv,
  * sockaddr_storage <-> ceph_sockaddr
  */
 #define CEPH_ENTITY_ADDR_TYPE_NONE	0
-#define CEPH_ENTITY_ADDR_TYPE_LEGACY	__cpu_to_le32(1)
+#define CEPH_ENTITY_ADDR_TYPE_LEGACY	__cpu_to_le32(1) /* legacy msgr1 */
+#define CEPH_ENTITY_ADDR_TYPE_MSGR2	__cpu_to_le32(2) /* msgr2 protocol */
 
 static inline void ceph_encode_banner_addr(struct ceph_entity_addr *a)
 {
diff --git a/net/ceph/decode.c b/net/ceph/decode.c
index eea529595a7a..613a2bc6f805 100644
--- a/net/ceph/decode.c
+++ b/net/ceph/decode.c
@@ -67,16 +67,45 @@ ceph_decode_entity_addr_legacy(void **p, void *end,
 	return ret;
 }
 
+static int
+ceph_decode_entity_addr_versioned_msgr2(void **p, void *end,
+					struct ceph_entity_addr *addr)
+{
+	struct ceph_entity_addr tmp_addr;
+	struct ceph_entity_addr *paddr = addr;
+	int ret = -EINVAL;
+
+	ceph_decode_skip_32(p, end, bad); /* hard-coded '2' */
+	ceph_decode_skip_8(p, end, bad);  /* hard-coded '1' */
+
+	ret = ceph_decode_entity_addr_versioned(p, end, paddr);
+	if (ret)
+		goto bad;
+	/* If we already have a v1 address, simply skip over the other address */
+	if (paddr->type == CEPH_ENTITY_ADDR_TYPE_LEGACY)
+		paddr = &tmp_addr;
+
+	ceph_decode_skip_8(p, end, bad);  /* hard-coded '1' */
+
+	ret = ceph_decode_entity_addr_versioned(p, end, paddr);
+
+bad:
+	return ret;
+}
+
 int
 ceph_decode_entity_addr(void **p, void *end, struct ceph_entity_addr *addr)
 {
 	u8 marker;
 
 	ceph_decode_8_safe(p, end, marker, bad);
-	if (marker == 1)
+	if (marker == CEPH_ENTITY_ADDR_TYPE_MSGR2)
+		return ceph_decode_entity_addr_versioned_msgr2(p, end, addr);
+	else if (marker == CEPH_ENTITY_ADDR_TYPE_LEGACY)
 		return ceph_decode_entity_addr_versioned(p, end, addr);
-	else if (marker == 0)
+	else if (marker == CEPH_ENTITY_ADDR_TYPE_NONE)
 		return ceph_decode_entity_addr_legacy(p, end, addr);
+
 bad:
 	return -EINVAL;
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 2/4] ceph: get the require_osd_release field from the osdmap
  2019-11-14 10:57 [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs Luis Henriques
  2019-11-14 10:57 ` [RFC PATCH v2 1/4] ceph: add support for TYPE_MSGR2 address decode Luis Henriques
@ 2019-11-14 10:57 ` Luis Henriques
  2019-11-14 13:00   ` Jeff Layton
  2019-11-14 10:57 ` [RFC PATCH v2 3/4] ceph: add require_osd_release field to osdmap debugfs Luis Henriques
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Luis Henriques @ 2019-11-14 10:57 UTC (permalink / raw)
  To: Jeff Layton, Sage Weil, Ilya Dryomov, Yan, Zheng
  Cc: ceph-devel, linux-kernel, Luis Henriques

Since Ceph Octopus, OSDs are encoding require_osd_release into the client
data part of the osdmap.  This patch adds code to pick this extra field.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
---
 include/linux/ceph/ceph_features.h | 10 ++++++++--
 include/linux/ceph/osdmap.h        |  1 +
 net/ceph/osdmap.c                  | 21 +++++++++++++++++++++
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/include/linux/ceph/ceph_features.h b/include/linux/ceph/ceph_features.h
index 39e6f4c57580..f329d1907dd7 100644
--- a/include/linux/ceph/ceph_features.h
+++ b/include/linux/ceph/ceph_features.h
@@ -9,6 +9,7 @@
  */
 #define CEPH_FEATURE_INCARNATION_1 (0ull)
 #define CEPH_FEATURE_INCARNATION_2 (1ull<<57) // CEPH_FEATURE_SERVER_JEWEL
+#define CEPH_FEATURE_INCARNATION_3 ((1ull<<57)|(1ull<<28)) // SERVER_MIMIC
 
 #define DEFINE_CEPH_FEATURE(bit, incarnation, name)			\
 	static const uint64_t CEPH_FEATURE_##name = (1ULL<<bit);		\
@@ -76,6 +77,7 @@ DEFINE_CEPH_FEATURE( 0, 1, UID)
 DEFINE_CEPH_FEATURE( 1, 1, NOSRCADDR)
 DEFINE_CEPH_FEATURE_RETIRED( 2, 1, MONCLOCKCHECK, JEWEL, LUMINOUS)
 
+DEFINE_CEPH_FEATURE( 2, 3, SERVER_NAUTILUS)
 DEFINE_CEPH_FEATURE( 3, 1, FLOCK)
 DEFINE_CEPH_FEATURE( 4, 1, SUBSCRIBE2)
 DEFINE_CEPH_FEATURE( 5, 1, MONNAMES)
@@ -92,6 +94,7 @@ DEFINE_CEPH_FEATURE(14, 2, SERVER_KRAKEN)
 DEFINE_CEPH_FEATURE(15, 1, MONENC)
 DEFINE_CEPH_FEATURE_RETIRED(16, 1, QUERY_T, JEWEL, LUMINOUS)
 
+DEFINE_CEPH_FEATURE(16, 3, SERVER_OCTOPUS)
 DEFINE_CEPH_FEATURE_RETIRED(17, 1, INDEP_PG_MAP, JEWEL, LUMINOUS)
 
 DEFINE_CEPH_FEATURE(18, 1, CRUSH_TUNABLES)
@@ -114,7 +117,7 @@ DEFINE_CEPH_FEATURE(25, 1, CRUSH_TUNABLES2)
 DEFINE_CEPH_FEATURE(26, 1, CREATEPOOLID)
 DEFINE_CEPH_FEATURE(27, 1, REPLY_CREATE_INODE)
 DEFINE_CEPH_FEATURE_RETIRED(28, 1, OSD_HBMSGS, HAMMER, JEWEL)
-DEFINE_CEPH_FEATURE(28, 2, SERVER_M)
+DEFINE_CEPH_FEATURE(28, 2, SERVER_MIMIC)
 DEFINE_CEPH_FEATURE(29, 1, MDSENC)
 DEFINE_CEPH_FEATURE(30, 1, OSDHASHPSPOOL)
 DEFINE_CEPH_FEATURE(31, 1, MON_SINGLE_PAXOS)  // deprecate me
@@ -212,7 +215,10 @@ DEFINE_CEPH_FEATURE_DEPRECATED(63, 1, RESERVED_BROKEN, LUMINOUS) // client-facin
 	 CEPH_FEATURE_CRUSH_TUNABLES5 |		\
 	 CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING |	\
 	 CEPH_FEATURE_MSG_ADDR2 |		\
-	 CEPH_FEATURE_CEPHX_V2)
+	 CEPH_FEATURE_CEPHX_V2 |		\
+	 CEPH_FEATURE_SERVER_MIMIC |		\
+	 CEPH_FEATURE_SERVER_NAUTILUS |		\
+	 CEPH_FEATURE_SERVER_OCTOPUS)
 
 #define CEPH_FEATURES_REQUIRED_DEFAULT	0
 
diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
index e081b56f1c1d..0d8e7f5e3478 100644
--- a/include/linux/ceph/osdmap.h
+++ b/include/linux/ceph/osdmap.h
@@ -160,6 +160,7 @@ struct ceph_osdmap {
 
 	u32 flags;         /* CEPH_OSDMAP_* */
 
+	u8 require_osd_release;
 	u32 max_osd;       /* size of osd_state, _offload, _addr arrays */
 	u32 *osd_state;    /* CEPH_OSD_* */
 	u32 *osd_weight;   /* 0 = failed, 0x10000 = 100% normal */
diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 4e0de14f80bb..29526fd61983 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -1582,6 +1582,27 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
 		WARN_ON(!RB_EMPTY_ROOT(&map->pg_upmap_items));
 	}
 
+	if (struct_v >= 6)
+		/* crush version */
+		ceph_decode_skip_32(p, end, e_inval);
+	if (struct_v >= 7) {
+		/*
+		 * skip removed_snaps and purged_snaps
+		 * (snap_interval_set_t = 8 + 8)
+		 */
+		ceph_decode_skip_set(p, end, 16, e_inval);
+		ceph_decode_skip_set(p, end, 16, e_inval);
+	}
+	if (struct_v >= 9) {
+		struct ceph_timespec ts;
+
+		/* last_up_change and last_in_change */
+		ceph_decode_copy_safe(p, end, &ts, sizeof(ts), e_inval);
+		ceph_decode_copy_safe(p, end, &ts, sizeof(ts), e_inval);
+	}
+	if (struct_v >= 10)
+		ceph_decode_8_safe(p, end, map->require_osd_release, e_inval);
+
 	/* ignore the rest */
 	*p = end;
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 3/4] ceph: add require_osd_release field to osdmap debugfs
  2019-11-14 10:57 [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs Luis Henriques
  2019-11-14 10:57 ` [RFC PATCH v2 1/4] ceph: add support for TYPE_MSGR2 address decode Luis Henriques
  2019-11-14 10:57 ` [RFC PATCH v2 2/4] ceph: get the require_osd_release field from the osdmap Luis Henriques
@ 2019-11-14 10:57 ` Luis Henriques
  2019-11-14 10:57 ` [RFC PATCH v2 4/4] ceph: add support for sending truncate_{seq,size} in 'copy-from' Op Luis Henriques
  2019-11-14 13:15 ` [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs Jeff Layton
  4 siblings, 0 replies; 14+ messages in thread
From: Luis Henriques @ 2019-11-14 10:57 UTC (permalink / raw)
  To: Jeff Layton, Sage Weil, Ilya Dryomov, Yan, Zheng
  Cc: ceph-devel, linux-kernel, Luis Henriques

Add the require_osd_release information to debugfs.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
---
 include/linux/ceph/rados.h | 22 ++++++++++++++++++++++
 net/ceph/ceph_strings.c    | 38 ++++++++++++++++++++++++++++++++++++++
 net/ceph/debugfs.c         |  2 ++
 3 files changed, 62 insertions(+)

diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
index 3eb0e55665b4..68bc65f971b4 100644
--- a/include/linux/ceph/rados.h
+++ b/include/linux/ceph/rados.h
@@ -164,6 +164,28 @@ extern const char *ceph_osd_state_name(int s);
 #define CEPH_OSDMAP_REQUIRE_LUMINOUS (1<<18) /* require l for booting osds */
 #define CEPH_OSDMAP_RECOVERY_DELETES (1<<19) /* deletes performed during recovery instead of peering */
 
+/*
+ * major ceph release numbers
+ */
+#define CEPH_RELEASE_ARGONAUT	1
+#define CEPH_RELEASE_BOBTAIL	2
+#define CEPH_RELEASE_CUTTLEFISH	3
+#define CEPH_RELEASE_DUMPLING	4
+#define CEPH_RELEASE_EMPEROR	5
+#define CEPH_RELEASE_FIREFLY	6
+#define CEPH_RELEASE_GIANT	7
+#define CEPH_RELEASE_HAMMER	8
+#define CEPH_RELEASE_INFERNALIS	9
+#define CEPH_RELEASE_JEWEL	10
+#define CEPH_RELEASE_KRAKEN	11
+#define CEPH_RELEASE_LUMINOUS	12
+#define CEPH_RELEASE_MIMIC	13
+#define CEPH_RELEASE_NAUTILUS	14
+#define CEPH_RELEASE_OCTOPUS	15
+#define CEPH_RELEASE_MAX	16 /* highest + 1 */
+
+extern const char *ceph_release_name(int r);
+
 /*
  * The error code to return when an OSD can't handle a write
  * because it is too large.
diff --git a/net/ceph/ceph_strings.c b/net/ceph/ceph_strings.c
index 10e01494993c..3f280f17bbcb 100644
--- a/net/ceph/ceph_strings.c
+++ b/net/ceph/ceph_strings.c
@@ -60,3 +60,41 @@ const char *ceph_osd_state_name(int s)
 		return "???";
 	}
 }
+
+const char *ceph_release_name(int r)
+{
+	switch (r) {
+	case CEPH_RELEASE_ARGONAUT:
+		return "argonaut";
+	case CEPH_RELEASE_BOBTAIL:
+		return "bobtail";
+	case CEPH_RELEASE_CUTTLEFISH:
+		return "cuttlefish";
+	case CEPH_RELEASE_DUMPLING:
+		return "dumpling";
+	case CEPH_RELEASE_EMPEROR:
+		return "emperor";
+	case CEPH_RELEASE_FIREFLY:
+		return "firefly";
+	case CEPH_RELEASE_GIANT:
+		return "giant";
+	case CEPH_RELEASE_HAMMER:
+		return "hammer";
+	case CEPH_RELEASE_INFERNALIS:
+		return "infernalis";
+	case CEPH_RELEASE_JEWEL:
+		return "jewel";
+	case CEPH_RELEASE_KRAKEN:
+		return "kraken";
+	case CEPH_RELEASE_LUMINOUS:
+		return "luminous";
+	case CEPH_RELEASE_MIMIC:
+		return "mimic";
+	case CEPH_RELEASE_NAUTILUS:
+		return "nautilus";
+	case CEPH_RELEASE_OCTOPUS:
+		return "octopus";
+	default:
+		return "unknown";
+	}
+}
diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c
index 7cb992e55475..d42071f6ab57 100644
--- a/net/ceph/debugfs.c
+++ b/net/ceph/debugfs.c
@@ -65,6 +65,8 @@ static int osdmap_show(struct seq_file *s, void *p)
 	down_read(&osdc->lock);
 	seq_printf(s, "epoch %u barrier %u flags 0x%x\n", map->epoch,
 			osdc->epoch_barrier, map->flags);
+	seq_printf(s, "require_osd_release: %s\n",
+		   ceph_release_name(map->require_osd_release));
 
 	for (n = rb_first(&map->pg_pools); n; n = rb_next(n)) {
 		struct ceph_pg_pool_info *pi =

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH v2 4/4] ceph: add support for sending truncate_{seq,size} in 'copy-from' Op
  2019-11-14 10:57 [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs Luis Henriques
                   ` (2 preceding siblings ...)
  2019-11-14 10:57 ` [RFC PATCH v2 3/4] ceph: add require_osd_release field to osdmap debugfs Luis Henriques
@ 2019-11-14 10:57 ` Luis Henriques
  2019-11-14 13:15 ` [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs Jeff Layton
  4 siblings, 0 replies; 14+ messages in thread
From: Luis Henriques @ 2019-11-14 10:57 UTC (permalink / raw)
  To: Jeff Layton, Sage Weil, Ilya Dryomov, Yan, Zheng
  Cc: ceph-devel, linux-kernel, Luis Henriques

Doing an object copy in Ceph will result in not only the data being
copied but also the truncate_seq value.  This may make sense in generic
RADOS object copies, but for the specific case of performing a file copy
will result in data corruption in the destination file.

In order to fix this, the 'copy-from' operation had to be modified so
that it could receive the two extra parameters for the destination
object truncate_seq and truncate_size.  This patch adds support for
these extra parameters to the kernel client.  Unfortunately, this
operation modification is available in Ceph Octopus only, so it is
necessary to ensure that the OSD doing the copy does indeed support this
feature.

Link: https://tracker.ceph.com/issues/37378
Signed-off-by: Luis Henriques <lhenriques@suse.com>
---
 fs/ceph/file.c                  | 10 +++++++++-
 include/linux/ceph/osd_client.h |  1 +
 include/linux/ceph/rados.h      |  1 +
 net/ceph/osd_client.c           |  7 ++++++-
 4 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index bd77adb64bfd..f45bb3837a31 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1928,6 +1928,7 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 	struct ceph_fs_client *src_fsc = ceph_inode_to_client(src_inode);
 	struct ceph_object_locator src_oloc, dst_oloc;
 	struct ceph_object_id src_oid, dst_oid;
+	struct ceph_osdmap *map = src_fsc->client->osdc.osdmap;
 	loff_t endoff = 0, size;
 	ssize_t ret = -EIO;
 	u64 src_objnum, dst_objnum, src_objoff, dst_objoff;
@@ -1958,6 +1959,11 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 
 	if (ceph_test_mount_opt(src_fsc, NOCOPYFROM))
 		return -EOPNOTSUPP;
+	if (map->require_osd_release < CEPH_RELEASE_OCTOPUS) {
+		pr_warn_once("copy_file_range not supported in '%s' release\n",
+			     ceph_release_name(map->require_osd_release));
+		return -EOPNOTSUPP;
+	}
 
 	/*
 	 * Striped file layouts require that we copy partial objects, but the
@@ -2086,7 +2092,9 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
 			CEPH_OSD_OP_FLAG_FADVISE_NOCACHE,
 			&dst_oid, &dst_oloc,
 			CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL |
-			CEPH_OSD_OP_FLAG_FADVISE_DONTNEED, 0);
+			CEPH_OSD_OP_FLAG_FADVISE_DONTNEED,
+			dst_ci->i_truncate_seq, dst_ci->i_truncate_size,
+			CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ);
 		if (err) {
 			dout("ceph_osdc_copy_from returned %d\n", err);
 			if (!ret)
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index eaffbdddf89a..5a62dbd3f4c2 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -534,6 +534,7 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
 			struct ceph_object_id *dst_oid,
 			struct ceph_object_locator *dst_oloc,
 			u32 dst_fadvise_flags,
+			u32 truncate_seq, u64 truncate_size,
 			u8 copy_from_flags);
 
 /* watch/notify */
diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
index 68bc65f971b4..318da211bb79 100644
--- a/include/linux/ceph/rados.h
+++ b/include/linux/ceph/rados.h
@@ -468,6 +468,7 @@ enum {
 	CEPH_OSD_COPY_FROM_FLAG_MAP_SNAP_CLONE = 8, /* map snap direct to
 						     * cloneid */
 	CEPH_OSD_COPY_FROM_FLAG_RWORDERED = 16,     /* order with write */
+	CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ = 32,  /* send truncate_{seq,size} */
 };
 
 enum {
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index ba45b074a362..02abf2790e99 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -5315,6 +5315,7 @@ static int osd_req_op_copy_from_init(struct ceph_osd_request *req,
 				     struct ceph_object_locator *src_oloc,
 				     u32 src_fadvise_flags,
 				     u32 dst_fadvise_flags,
+				     u32 truncate_seq, u64 truncate_size,
 				     u8 copy_from_flags)
 {
 	struct ceph_osd_req_op *op;
@@ -5335,6 +5336,8 @@ static int osd_req_op_copy_from_init(struct ceph_osd_request *req,
 	end = p + PAGE_SIZE;
 	ceph_encode_string(&p, end, src_oid->name, src_oid->name_len);
 	encode_oloc(&p, end, src_oloc);
+	ceph_encode_32(&p, truncate_seq);
+	ceph_encode_64(&p, truncate_size);
 	op->indata_len = PAGE_SIZE - (end - p);
 
 	ceph_osd_data_pages_init(&op->copy_from.osd_data, pages,
@@ -5350,6 +5353,7 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
 			struct ceph_object_id *dst_oid,
 			struct ceph_object_locator *dst_oloc,
 			u32 dst_fadvise_flags,
+			u32 truncate_seq, u64 truncate_size,
 			u8 copy_from_flags)
 {
 	struct ceph_osd_request *req;
@@ -5366,7 +5370,8 @@ int ceph_osdc_copy_from(struct ceph_osd_client *osdc,
 
 	ret = osd_req_op_copy_from_init(req, src_snapid, src_version, src_oid,
 					src_oloc, src_fadvise_flags,
-					dst_fadvise_flags, copy_from_flags);
+					dst_fadvise_flags, truncate_seq,
+					truncate_size, copy_from_flags);
 	if (ret)
 		goto out;
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 1/4] ceph: add support for TYPE_MSGR2 address decode
  2019-11-14 10:57 ` [RFC PATCH v2 1/4] ceph: add support for TYPE_MSGR2 address decode Luis Henriques
@ 2019-11-14 12:18   ` Jeff Layton
  0 siblings, 0 replies; 14+ messages in thread
From: Jeff Layton @ 2019-11-14 12:18 UTC (permalink / raw)
  To: Luis Henriques, Sage Weil, Ilya Dryomov, Yan, Zheng
  Cc: ceph-devel, linux-kernel

On Thu, 2019-11-14 at 10:57 +0000, Luis Henriques wrote:
> The new format actually includes two addresses: one the new messenger v2,
> and other for the legacy v1, which is the only one currently understood
> by kernel clients.  Add code to pick the legacy address and ignore the v2
> one.
> 
> Signed-off-by: Luis Henriques <lhenriques@suse.com>
> ---
>  include/linux/ceph/decode.h |  3 ++-
>  net/ceph/decode.c           | 33 +++++++++++++++++++++++++++++++--
>  2 files changed, 33 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/ceph/decode.h b/include/linux/ceph/decode.h
> index 450384fe487c..2a2f07dfb39c 100644
> --- a/include/linux/ceph/decode.h
> +++ b/include/linux/ceph/decode.h
> @@ -219,7 +219,8 @@ static inline void ceph_encode_timespec64(struct ceph_timespec *tv,
>   * sockaddr_storage <-> ceph_sockaddr
>   */
>  #define CEPH_ENTITY_ADDR_TYPE_NONE	0
> -#define CEPH_ENTITY_ADDR_TYPE_LEGACY	__cpu_to_le32(1)
> +#define CEPH_ENTITY_ADDR_TYPE_LEGACY	__cpu_to_le32(1) /* legacy msgr1 */
> +#define CEPH_ENTITY_ADDR_TYPE_MSGR2	__cpu_to_le32(2) /* msgr2 protocol */
>  
>  static inline void ceph_encode_banner_addr(struct ceph_entity_addr *a)
>  {
> diff --git a/net/ceph/decode.c b/net/ceph/decode.c
> index eea529595a7a..613a2bc6f805 100644
> --- a/net/ceph/decode.c
> +++ b/net/ceph/decode.c
> @@ -67,16 +67,45 @@ ceph_decode_entity_addr_legacy(void **p, void *end,
>  	return ret;
>  }
>  
> +static int
> +ceph_decode_entity_addr_versioned_msgr2(void **p, void *end,
> +					struct ceph_entity_addr *addr)
> +{
> +	struct ceph_entity_addr tmp_addr;
> +	struct ceph_entity_addr *paddr = addr;
> +	int ret = -EINVAL;
> +
> +	ceph_decode_skip_32(p, end, bad); /* hard-coded '2' */
> +	ceph_decode_skip_8(p, end, bad);  /* hard-coded '1' */
> +
> +	ret = ceph_decode_entity_addr_versioned(p, end, paddr);
> +	if (ret)
> +		goto bad;
> +	/* If we already have a v1 address, simply skip over the other address */
> +	if (paddr->type == CEPH_ENTITY_ADDR_TYPE_LEGACY)
> +		paddr = &tmp_addr;
> +
> +	ceph_decode_skip_8(p, end, bad);  /* hard-coded '1' */
> +
> +	ret = ceph_decode_entity_addr_versioned(p, end, paddr);
> +
> +bad:
> +	return ret;
> +}
> +
>  int
>  ceph_decode_entity_addr(void **p, void *end, struct ceph_entity_addr *addr)
>  {
>  	u8 marker;
>  
>  	ceph_decode_8_safe(p, end, marker, bad);
> -	if (marker == 1)
> +	if (marker == CEPH_ENTITY_ADDR_TYPE_MSGR2)
> +		return ceph_decode_entity_addr_versioned_msgr2(p, end, addr);
> +	else if (marker == CEPH_ENTITY_ADDR_TYPE_LEGACY)
>  		return ceph_decode_entity_addr_versioned(p, end, addr);
> -	else if (marker == 0)
> +	else if (marker == CEPH_ENTITY_ADDR_TYPE_NONE)
>  		return ceph_decode_entity_addr_legacy(p, end, addr);

You're decoding a byte into "marker" and then comparing that to a __le32
value. They almost certainly won't match correctly on a BE arch.

> +
>  bad:
>  	return -EINVAL;
>  }

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 2/4] ceph: get the require_osd_release field from the osdmap
  2019-11-14 10:57 ` [RFC PATCH v2 2/4] ceph: get the require_osd_release field from the osdmap Luis Henriques
@ 2019-11-14 13:00   ` Jeff Layton
  0 siblings, 0 replies; 14+ messages in thread
From: Jeff Layton @ 2019-11-14 13:00 UTC (permalink / raw)
  To: Luis Henriques, Sage Weil, Ilya Dryomov, Yan, Zheng
  Cc: ceph-devel, linux-kernel

On Thu, 2019-11-14 at 10:57 +0000, Luis Henriques wrote:
> Since Ceph Octopus, OSDs are encoding require_osd_release into the client
> data part of the osdmap.  This patch adds code to pick this extra field.
> 
> Signed-off-by: Luis Henriques <lhenriques@suse.com>
> ---
>  include/linux/ceph/ceph_features.h | 10 ++++++++--
>  include/linux/ceph/osdmap.h        |  1 +
>  net/ceph/osdmap.c                  | 21 +++++++++++++++++++++
>  3 files changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/ceph/ceph_features.h b/include/linux/ceph/ceph_features.h
> index 39e6f4c57580..f329d1907dd7 100644
> --- a/include/linux/ceph/ceph_features.h
> +++ b/include/linux/ceph/ceph_features.h
> @@ -9,6 +9,7 @@
>   */
>  #define CEPH_FEATURE_INCARNATION_1 (0ull)
>  #define CEPH_FEATURE_INCARNATION_2 (1ull<<57) // CEPH_FEATURE_SERVER_JEWEL
> +#define CEPH_FEATURE_INCARNATION_3 ((1ull<<57)|(1ull<<28)) // SERVER_MIMIC
>  
>  #define DEFINE_CEPH_FEATURE(bit, incarnation, name)			\
>  	static const uint64_t CEPH_FEATURE_##name = (1ULL<<bit);		\
> @@ -76,6 +77,7 @@ DEFINE_CEPH_FEATURE( 0, 1, UID)
>  DEFINE_CEPH_FEATURE( 1, 1, NOSRCADDR)
>  DEFINE_CEPH_FEATURE_RETIRED( 2, 1, MONCLOCKCHECK, JEWEL, LUMINOUS)
>  
> +DEFINE_CEPH_FEATURE( 2, 3, SERVER_NAUTILUS)
>  DEFINE_CEPH_FEATURE( 3, 1, FLOCK)
>  DEFINE_CEPH_FEATURE( 4, 1, SUBSCRIBE2)
>  DEFINE_CEPH_FEATURE( 5, 1, MONNAMES)
> @@ -92,6 +94,7 @@ DEFINE_CEPH_FEATURE(14, 2, SERVER_KRAKEN)
>  DEFINE_CEPH_FEATURE(15, 1, MONENC)
>  DEFINE_CEPH_FEATURE_RETIRED(16, 1, QUERY_T, JEWEL, LUMINOUS)
>  
> +DEFINE_CEPH_FEATURE(16, 3, SERVER_OCTOPUS)
>  DEFINE_CEPH_FEATURE_RETIRED(17, 1, INDEP_PG_MAP, JEWEL, LUMINOUS)
>  
>  DEFINE_CEPH_FEATURE(18, 1, CRUSH_TUNABLES)
> @@ -114,7 +117,7 @@ DEFINE_CEPH_FEATURE(25, 1, CRUSH_TUNABLES2)
>  DEFINE_CEPH_FEATURE(26, 1, CREATEPOOLID)
>  DEFINE_CEPH_FEATURE(27, 1, REPLY_CREATE_INODE)
>  DEFINE_CEPH_FEATURE_RETIRED(28, 1, OSD_HBMSGS, HAMMER, JEWEL)
> -DEFINE_CEPH_FEATURE(28, 2, SERVER_M)
> +DEFINE_CEPH_FEATURE(28, 2, SERVER_MIMIC)
>  DEFINE_CEPH_FEATURE(29, 1, MDSENC)
>  DEFINE_CEPH_FEATURE(30, 1, OSDHASHPSPOOL)
>  DEFINE_CEPH_FEATURE(31, 1, MON_SINGLE_PAXOS)  // deprecate me
> @@ -212,7 +215,10 @@ DEFINE_CEPH_FEATURE_DEPRECATED(63, 1, RESERVED_BROKEN, LUMINOUS) // client-facin
>  	 CEPH_FEATURE_CRUSH_TUNABLES5 |		\
>  	 CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING |	\
>  	 CEPH_FEATURE_MSG_ADDR2 |		\
> -	 CEPH_FEATURE_CEPHX_V2)
> +	 CEPH_FEATURE_CEPHX_V2 |		\
> +	 CEPH_FEATURE_SERVER_MIMIC |		\
> +	 CEPH_FEATURE_SERVER_NAUTILUS |		\
> +	 CEPH_FEATURE_SERVER_OCTOPUS)
>  

As you mentioned in the cover letter, that doesn't seem likely to be
safe to just enable like this. I'm pretty sure the kclient is missing
some things that are encompassed by these bits.

Unfortunately none of that seems to be clearly documented anywhere, so
you're probably stuck walking through the ceph tree to see why the
server daemons are checking these flags.

>  #define CEPH_FEATURES_REQUIRED_DEFAULT	0
>  
> diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
> index e081b56f1c1d..0d8e7f5e3478 100644
> --- a/include/linux/ceph/osdmap.h
> +++ b/include/linux/ceph/osdmap.h
> @@ -160,6 +160,7 @@ struct ceph_osdmap {
>  
>  	u32 flags;         /* CEPH_OSDMAP_* */
>  
> +	u8 require_osd_release;
>  	u32 max_osd;       /* size of osd_state, _offload, _addr arrays */
>  	u32 *osd_state;    /* CEPH_OSD_* */
>  	u32 *osd_weight;   /* 0 = failed, 0x10000 = 100% normal */
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index 4e0de14f80bb..29526fd61983 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -1582,6 +1582,27 @@ static int osdmap_decode(void **p, void *end, struct ceph_osdmap *map)
>  		WARN_ON(!RB_EMPTY_ROOT(&map->pg_upmap_items));
>  	}
>  
> +	if (struct_v >= 6)
> +		/* crush version */
> +		ceph_decode_skip_32(p, end, e_inval);
> +	if (struct_v >= 7) {
> +		/*
> +		 * skip removed_snaps and purged_snaps
> +		 * (snap_interval_set_t = 8 + 8)
> +		 */
> +		ceph_decode_skip_set(p, end, 16, e_inval);
> +		ceph_decode_skip_set(p, end, 16, e_inval);
> +	}
> +	if (struct_v >= 9) {
> +		struct ceph_timespec ts;
> +
> +		/* last_up_change and last_in_change */
> +		ceph_decode_copy_safe(p, end, &ts, sizeof(ts), e_inval);
> +		ceph_decode_copy_safe(p, end, &ts, sizeof(ts), e_inval);
> +	}
> +	if (struct_v >= 10)
> +		ceph_decode_8_safe(p, end, map->require_osd_release, e_inval);
> +
>  	/* ignore the rest */
>  	*p = end;
>  

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs
  2019-11-14 10:57 [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs Luis Henriques
                   ` (3 preceding siblings ...)
  2019-11-14 10:57 ` [RFC PATCH v2 4/4] ceph: add support for sending truncate_{seq,size} in 'copy-from' Op Luis Henriques
@ 2019-11-14 13:15 ` Jeff Layton
  2019-11-14 13:28   ` Sage Weil
  4 siblings, 1 reply; 14+ messages in thread
From: Jeff Layton @ 2019-11-14 13:15 UTC (permalink / raw)
  To: Luis Henriques, Sage Weil, Ilya Dryomov, Yan, Zheng
  Cc: ceph-devel, linux-kernel

On Thu, 2019-11-14 at 10:57 +0000, Luis Henriques wrote:
> Hi!
> 
> So, after the feedback I got from v1 [1] I've sent out a pull-request
> for the OSDs [2] which encodes require_osd_release into the OSDMap
> client data.  This allows the client to figure out which ceph release
> the OSDs cluster is running and decide whether or not it's safe to use
> the copy-from Op for copy_file_range.
> 
> This new patchset I'm sending simply adds enough functionality to the
> kernel client so that it can take advantage of this OSD patch:
> 
> 0001 - adds the ability to decode TYPE_MSGR2 addresses.  This is a
>        required functionality for enabling SERVER_NAUTILUS in the
>        client.  I hope I got the new format right, as I couldn't figure
>        out what the hard-coded values (see comments) really mean.
> 

nit: the first 3 patch subject lines should probably be prefixed with
"libceph:"

> 0002 - allows the client to retrieve the new require_osd_release field
>        from the OSDMap if available.  This patch also adds SERVER_MIMIC,
>        SERVER_NAUTILUS and SERVER_OCTOPUS to the supported features,
>        which TBH I'm not sure if that's a safe thing to do -- the only
>        issue I've seen was that Nautilus requires the ability to decode
>        TYPE_MSGR2 address, but I may have missed others.
> 

Yes, this needs to be done with care. We have to ensure that the server
side isn't assuming that the client supports something that it doesn't.
I think that means just trawling through the code and verifying whether
this is safe.

> 0003 - debug code to add require_osd_release to the osdmap debugfs file.
> 
> 0004 - adds the truncate_{seq,size} fields to the 'copy-from' operation
>        if the OSDs are >= Octopus.
> 
> Also note that, as suggested by Ilya, I've dropped the patch that would
> change the default mount options to 'copyfrom'.
> 
> These patches have been tested with the xfstests generic test suite, and
> with a couple of other (local) tests that exercise the cephfs
> copy_file_range syscall.  I didn't saw any issues, but as I said above,
> I'm not really sure if adding the SERVER_* flags to the supported
> features have other side effects.
> 
> [1] https://lore.kernel.org/lkml/20191108141555.31176-1-lhenriques@suse.com/
> [2] https://github.com/ceph/ceph/pull/31611
> 

I'm just getting caught up on the discussion here, but why was it
decided to do it this way instead of just adding a new OSD
"copy-from-no-truncseq" operation? Once you tried it once and an OSD
didn't support it, you could just give up on using it any longer? That
seems a lot simpler than trying to monkey with feature bits.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs
  2019-11-14 13:15 ` [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs Jeff Layton
@ 2019-11-14 13:28   ` Sage Weil
  2019-11-14 14:13     ` Ilya Dryomov
  2019-11-14 18:28     ` Gregory Farnum
  0 siblings, 2 replies; 14+ messages in thread
From: Sage Weil @ 2019-11-14 13:28 UTC (permalink / raw)
  To: Jeff Layton, gfarnum
  Cc: Luis Henriques, Ilya Dryomov, Yan, Zheng, ceph-devel, linux-kernel

On Thu, 14 Nov 2019, Jeff Layton wrote:
> On Thu, 2019-11-14 at 10:57 +0000, Luis Henriques wrote:
> > Hi!
> > 
> > So, after the feedback I got from v1 [1] I've sent out a pull-request
> > for the OSDs [2] which encodes require_osd_release into the OSDMap
> > client data.  This allows the client to figure out which ceph release
> > the OSDs cluster is running and decide whether or not it's safe to use
> > the copy-from Op for copy_file_range.
> > 
> > This new patchset I'm sending simply adds enough functionality to the
> > kernel client so that it can take advantage of this OSD patch:
> > 
> > 0001 - adds the ability to decode TYPE_MSGR2 addresses.  This is a
> >        required functionality for enabling SERVER_NAUTILUS in the
> >        client.  I hope I got the new format right, as I couldn't figure
> >        out what the hard-coded values (see comments) really mean.
> > 
> 
> nit: the first 3 patch subject lines should probably be prefixed with
> "libceph:"
> 
> > 0002 - allows the client to retrieve the new require_osd_release field
> >        from the OSDMap if available.  This patch also adds SERVER_MIMIC,
> >        SERVER_NAUTILUS and SERVER_OCTOPUS to the supported features,
> >        which TBH I'm not sure if that's a safe thing to do -- the only
> >        issue I've seen was that Nautilus requires the ability to decode
> >        TYPE_MSGR2 address, but I may have missed others.
> > 
> 
> Yes, this needs to be done with care. We have to ensure that the server
> side isn't assuming that the client supports something that it doesn't.
> I think that means just trawling through the code and verifying whether
> this is safe.
> 
> > 0003 - debug code to add require_osd_release to the osdmap debugfs file.
> > 
> > 0004 - adds the truncate_{seq,size} fields to the 'copy-from' operation
> >        if the OSDs are >= Octopus.
> > 
> > Also note that, as suggested by Ilya, I've dropped the patch that would
> > change the default mount options to 'copyfrom'.
> > 
> > These patches have been tested with the xfstests generic test suite, and
> > with a couple of other (local) tests that exercise the cephfs
> > copy_file_range syscall.  I didn't saw any issues, but as I said above,
> > I'm not really sure if adding the SERVER_* flags to the supported
> > features have other side effects.
> > 
> > [1] https://lore.kernel.org/lkml/20191108141555.31176-1-lhenriques@suse.com/
> > [2] https://github.com/ceph/ceph/pull/31611
> > 
> 
> I'm just getting caught up on the discussion here, but why was it
> decided to do it this way instead of just adding a new OSD
> "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> didn't support it, you could just give up on using it any longer? That
> seems a lot simpler than trying to monkey with feature bits.

I don't remember the original discussion either, but in retrospect that 
does seem much simpler--especially since hte client is conditioning 
sending this based on the the require_osd_release.  It seems like passing 
a flag to the copy-from op would be more reasonable instead of conditional 
feature-based behavior.

Greg, do you remember why we ended up here?

sage


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs
  2019-11-14 13:28   ` Sage Weil
@ 2019-11-14 14:13     ` Ilya Dryomov
  2019-11-14 14:17       ` Sage Weil
  2019-11-14 18:28     ` Gregory Farnum
  1 sibling, 1 reply; 14+ messages in thread
From: Ilya Dryomov @ 2019-11-14 14:13 UTC (permalink / raw)
  To: Sage Weil
  Cc: Jeff Layton, Gregory Farnum, Luis Henriques, Yan, Zheng,
	Ceph Development, LKML

On Thu, Nov 14, 2019 at 2:28 PM Sage Weil <sweil@redhat.com> wrote:
>
> On Thu, 14 Nov 2019, Jeff Layton wrote:
> > On Thu, 2019-11-14 at 10:57 +0000, Luis Henriques wrote:
> > > Hi!
> > >
> > > So, after the feedback I got from v1 [1] I've sent out a pull-request
> > > for the OSDs [2] which encodes require_osd_release into the OSDMap
> > > client data.  This allows the client to figure out which ceph release
> > > the OSDs cluster is running and decide whether or not it's safe to use
> > > the copy-from Op for copy_file_range.
> > >
> > > This new patchset I'm sending simply adds enough functionality to the
> > > kernel client so that it can take advantage of this OSD patch:
> > >
> > > 0001 - adds the ability to decode TYPE_MSGR2 addresses.  This is a
> > >        required functionality for enabling SERVER_NAUTILUS in the
> > >        client.  I hope I got the new format right, as I couldn't figure
> > >        out what the hard-coded values (see comments) really mean.
> > >
> >
> > nit: the first 3 patch subject lines should probably be prefixed with
> > "libceph:"
> >
> > > 0002 - allows the client to retrieve the new require_osd_release field
> > >        from the OSDMap if available.  This patch also adds SERVER_MIMIC,
> > >        SERVER_NAUTILUS and SERVER_OCTOPUS to the supported features,
> > >        which TBH I'm not sure if that's a safe thing to do -- the only
> > >        issue I've seen was that Nautilus requires the ability to decode
> > >        TYPE_MSGR2 address, but I may have missed others.
> > >
> >
> > Yes, this needs to be done with care. We have to ensure that the server
> > side isn't assuming that the client supports something that it doesn't.
> > I think that means just trawling through the code and verifying whether
> > this is safe.
> >
> > > 0003 - debug code to add require_osd_release to the osdmap debugfs file.
> > >
> > > 0004 - adds the truncate_{seq,size} fields to the 'copy-from' operation
> > >        if the OSDs are >= Octopus.
> > >
> > > Also note that, as suggested by Ilya, I've dropped the patch that would
> > > change the default mount options to 'copyfrom'.
> > >
> > > These patches have been tested with the xfstests generic test suite, and
> > > with a couple of other (local) tests that exercise the cephfs
> > > copy_file_range syscall.  I didn't saw any issues, but as I said above,
> > > I'm not really sure if adding the SERVER_* flags to the supported
> > > features have other side effects.
> > >
> > > [1] https://lore.kernel.org/lkml/20191108141555.31176-1-lhenriques@suse.com/
> > > [2] https://github.com/ceph/ceph/pull/31611
> > >
> >
> > I'm just getting caught up on the discussion here, but why was it
> > decided to do it this way instead of just adding a new OSD
> > "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> > didn't support it, you could just give up on using it any longer? That
> > seems a lot simpler than trying to monkey with feature bits.
>
> I don't remember the original discussion either, but in retrospect that
> does seem much simpler--especially since hte client is conditioning
> sending this based on the the require_osd_release.  It seems like passing
> a flag to the copy-from op would be more reasonable instead of conditional
> feature-based behavior.

Yeah, I suggested adding require_osd_release to the client portion just
because we are running into it more and more: Objecter relies on it for
RESEND_ON_SPLIT for example.  It needs to be accessible so that patches
like that can be carried over to the kernel client without workarounds.

copy-from in its existing form is another example.  AFAIU the problem
is that copy-from op doesn't reject unknown flags.  Luis added a flag
in https://github.com/ceph/ceph/pull/25374, but it is simply ignored on
nautilus and older releases, potentially leading to data corruption.

Adding a new op that would be an alias for CEPH_OSD_OP_COPY_FROM with
CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ like Jeff is suggesting, or a new
copy-from2 op that would behave just like copy-from, but reject unknown
flags to avoid similar compatibility issues in the future is probably
the best thing we can do from the client perspective.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs
  2019-11-14 14:13     ` Ilya Dryomov
@ 2019-11-14 14:17       ` Sage Weil
  2019-11-14 15:24         ` Luis Henriques
  0 siblings, 1 reply; 14+ messages in thread
From: Sage Weil @ 2019-11-14 14:17 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Jeff Layton, Gregory Farnum, Luis Henriques, Yan, Zheng,
	Ceph Development, LKML

On Thu, 14 Nov 2019, Ilya Dryomov wrote:
> > > I'm just getting caught up on the discussion here, but why was it
> > > decided to do it this way instead of just adding a new OSD
> > > "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> > > didn't support it, you could just give up on using it any longer? That
> > > seems a lot simpler than trying to monkey with feature bits.
> >
> > I don't remember the original discussion either, but in retrospect that
> > does seem much simpler--especially since hte client is conditioning
> > sending this based on the the require_osd_release.  It seems like passing
> > a flag to the copy-from op would be more reasonable instead of conditional
> > feature-based behavior.
> 
> Yeah, I suggested adding require_osd_release to the client portion just
> because we are running into it more and more: Objecter relies on it for
> RESEND_ON_SPLIT for example.  It needs to be accessible so that patches
> like that can be carried over to the kernel client without workarounds.
> 
> copy-from in its existing form is another example.  AFAIU the problem
> is that copy-from op doesn't reject unknown flags.  Luis added a flag
> in https://github.com/ceph/ceph/pull/25374, but it is simply ignored on
> nautilus and older releases, potentially leading to data corruption.
> 
> Adding a new op that would be an alias for CEPH_OSD_OP_COPY_FROM with
> CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ like Jeff is suggesting, or a new
> copy-from2 op that would behave just like copy-from, but reject unknown
> flags to avoid similar compatibility issues in the future is probably
> the best thing we can do from the client perspective.

Yeah, I think copy-from2 is the best path.  I think that means we should 
revert what we merged to ceph.git a few weeks back, Luis!  Sorry we didn't 
put all the pieces together before...

sage


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs
  2019-11-14 14:17       ` Sage Weil
@ 2019-11-14 15:24         ` Luis Henriques
  2019-11-14 15:47           ` Ilya Dryomov
  0 siblings, 1 reply; 14+ messages in thread
From: Luis Henriques @ 2019-11-14 15:24 UTC (permalink / raw)
  To: Sage Weil
  Cc: Ilya Dryomov, Jeff Layton, Gregory Farnum, Yan, Zheng,
	Ceph Development, LKML

On Thu, Nov 14, 2019 at 02:17:44PM +0000, Sage Weil wrote:
> On Thu, 14 Nov 2019, Ilya Dryomov wrote:
> > > > I'm just getting caught up on the discussion here, but why was it
> > > > decided to do it this way instead of just adding a new OSD
> > > > "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> > > > didn't support it, you could just give up on using it any longer? That
> > > > seems a lot simpler than trying to monkey with feature bits.
> > >
> > > I don't remember the original discussion either, but in retrospect that
> > > does seem much simpler--especially since hte client is conditioning
> > > sending this based on the the require_osd_release.  It seems like passing
> > > a flag to the copy-from op would be more reasonable instead of conditional
> > > feature-based behavior.
> > 
> > Yeah, I suggested adding require_osd_release to the client portion just
> > because we are running into it more and more: Objecter relies on it for
> > RESEND_ON_SPLIT for example.  It needs to be accessible so that patches
> > like that can be carried over to the kernel client without workarounds.
> > 
> > copy-from in its existing form is another example.  AFAIU the problem
> > is that copy-from op doesn't reject unknown flags.  Luis added a flag
> > in https://github.com/ceph/ceph/pull/25374, but it is simply ignored on
> > nautilus and older releases, potentially leading to data corruption.
> > 
> > Adding a new op that would be an alias for CEPH_OSD_OP_COPY_FROM with
> > CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ like Jeff is suggesting, or a new
> > copy-from2 op that would behave just like copy-from, but reject unknown
> > flags to avoid similar compatibility issues in the future is probably
> > the best thing we can do from the client perspective.
> 
> Yeah, I think copy-from2 is the best path.  I think that means we should 
> revert what we merged to ceph.git a few weeks back, Luis!  Sorry we didn't 
> put all the pieces together before...

Well, that's an unexpected turn.  I'm not disagreeing with that decision
but since my initial pull request was done one year ago (almost to the
day!), it's a bit disappointing to see that in the end I'm back to
square one :-)

I guess that the PR I mentioned in the cover letter can also be dropped,
as it's not really usable by the kernel client (at least not until it
fully supports all the features up to SERVER_OCTOPUS).

Anyway, thanks everyone for the review.

Cheers,
--
Luís

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs
  2019-11-14 15:24         ` Luis Henriques
@ 2019-11-14 15:47           ` Ilya Dryomov
  0 siblings, 0 replies; 14+ messages in thread
From: Ilya Dryomov @ 2019-11-14 15:47 UTC (permalink / raw)
  To: Luis Henriques
  Cc: Sage Weil, Jeff Layton, Gregory Farnum, Yan, Zheng,
	Ceph Development, LKML

On Thu, Nov 14, 2019 at 4:24 PM Luis Henriques <lhenriques@suse.com> wrote:
>
> On Thu, Nov 14, 2019 at 02:17:44PM +0000, Sage Weil wrote:
> > On Thu, 14 Nov 2019, Ilya Dryomov wrote:
> > > > > I'm just getting caught up on the discussion here, but why was it
> > > > > decided to do it this way instead of just adding a new OSD
> > > > > "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> > > > > didn't support it, you could just give up on using it any longer? That
> > > > > seems a lot simpler than trying to monkey with feature bits.
> > > >
> > > > I don't remember the original discussion either, but in retrospect that
> > > > does seem much simpler--especially since hte client is conditioning
> > > > sending this based on the the require_osd_release.  It seems like passing
> > > > a flag to the copy-from op would be more reasonable instead of conditional
> > > > feature-based behavior.
> > >
> > > Yeah, I suggested adding require_osd_release to the client portion just
> > > because we are running into it more and more: Objecter relies on it for
> > > RESEND_ON_SPLIT for example.  It needs to be accessible so that patches
> > > like that can be carried over to the kernel client without workarounds.
> > >
> > > copy-from in its existing form is another example.  AFAIU the problem
> > > is that copy-from op doesn't reject unknown flags.  Luis added a flag
> > > in https://github.com/ceph/ceph/pull/25374, but it is simply ignored on
> > > nautilus and older releases, potentially leading to data corruption.
> > >
> > > Adding a new op that would be an alias for CEPH_OSD_OP_COPY_FROM with
> > > CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ like Jeff is suggesting, or a new
> > > copy-from2 op that would behave just like copy-from, but reject unknown
> > > flags to avoid similar compatibility issues in the future is probably
> > > the best thing we can do from the client perspective.
> >
> > Yeah, I think copy-from2 is the best path.  I think that means we should
> > revert what we merged to ceph.git a few weeks back, Luis!  Sorry we didn't
> > put all the pieces together before...
>
> Well, that's an unexpected turn.  I'm not disagreeing with that decision
> but since my initial pull request was done one year ago (almost to the
> day!), it's a bit disappointing to see that in the end I'm back to
> square one :-)

Well, I think literally every line from that PR will still go in, just
wrapped in a new OSD op.  Backwards compatibility is hard...

>
> I guess that the PR I mentioned in the cover letter can also be dropped,
> as it's not really usable by the kernel client (at least not until it
> fully supports all the features up to SERVER_OCTOPUS).

No, some form of https://github.com/ceph/ceph/pull/31611 should go in.
I'm pretty certain it will come up at some point in the future, even if
the new field isn't immediately usable today.  Someone porting a change
to the kernel client a couple years from now will thank you for it ;)

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs
  2019-11-14 13:28   ` Sage Weil
  2019-11-14 14:13     ` Ilya Dryomov
@ 2019-11-14 18:28     ` Gregory Farnum
  1 sibling, 0 replies; 14+ messages in thread
From: Gregory Farnum @ 2019-11-14 18:28 UTC (permalink / raw)
  To: Sage Weil
  Cc: Jeff Layton, Luis Henriques, Ilya Dryomov, Yan, Zheng,
	ceph-devel, linux-kernel

On Thu, Nov 14, 2019 at 5:28 AM Sage Weil <sweil@redhat.com> wrote:
>
> On Thu, 14 Nov 2019, Jeff Layton wrote:
> > On Thu, 2019-11-14 at 10:57 +0000, Luis Henriques wrote:
> > > Hi!
> > >
> > > So, after the feedback I got from v1 [1] I've sent out a pull-request
> > > for the OSDs [2] which encodes require_osd_release into the OSDMap
> > > client data.  This allows the client to figure out which ceph release
> > > the OSDs cluster is running and decide whether or not it's safe to use
> > > the copy-from Op for copy_file_range.
> > >
> > > This new patchset I'm sending simply adds enough functionality to the
> > > kernel client so that it can take advantage of this OSD patch:
> > >
> > > 0001 - adds the ability to decode TYPE_MSGR2 addresses.  This is a
> > >        required functionality for enabling SERVER_NAUTILUS in the
> > >        client.  I hope I got the new format right, as I couldn't figure
> > >        out what the hard-coded values (see comments) really mean.
> > >
> >
> > nit: the first 3 patch subject lines should probably be prefixed with
> > "libceph:"
> >
> > > 0002 - allows the client to retrieve the new require_osd_release field
> > >        from the OSDMap if available.  This patch also adds SERVER_MIMIC,
> > >        SERVER_NAUTILUS and SERVER_OCTOPUS to the supported features,
> > >        which TBH I'm not sure if that's a safe thing to do -- the only
> > >        issue I've seen was that Nautilus requires the ability to decode
> > >        TYPE_MSGR2 address, but I may have missed others.
> > >
> >
> > Yes, this needs to be done with care. We have to ensure that the server
> > side isn't assuming that the client supports something that it doesn't.
> > I think that means just trawling through the code and verifying whether
> > this is safe.
> >
> > > 0003 - debug code to add require_osd_release to the osdmap debugfs file.
> > >
> > > 0004 - adds the truncate_{seq,size} fields to the 'copy-from' operation
> > >        if the OSDs are >= Octopus.
> > >
> > > Also note that, as suggested by Ilya, I've dropped the patch that would
> > > change the default mount options to 'copyfrom'.
> > >
> > > These patches have been tested with the xfstests generic test suite, and
> > > with a couple of other (local) tests that exercise the cephfs
> > > copy_file_range syscall.  I didn't saw any issues, but as I said above,
> > > I'm not really sure if adding the SERVER_* flags to the supported
> > > features have other side effects.
> > >
> > > [1] https://lore.kernel.org/lkml/20191108141555.31176-1-lhenriques@suse.com/
> > > [2] https://github.com/ceph/ceph/pull/31611
> > >
> >
> > I'm just getting caught up on the discussion here, but why was it
> > decided to do it this way instead of just adding a new OSD
> > "copy-from-no-truncseq" operation? Once you tried it once and an OSD
> > didn't support it, you could just give up on using it any longer? That
> > seems a lot simpler than trying to monkey with feature bits.
>
> I don't remember the original discussion either, but in retrospect that
> does seem much simpler--especially since hte client is conditioning
> sending this based on the the require_osd_release.  It seems like passing
> a flag to the copy-from op would be more reasonable instead of conditional
> feature-based behavior.
>
> Greg, do you remember why we ended up here?

Well, I can look it up. We discussed it being a different op in
February ("OSD 'copy-from' operation and truncate_seq value") and...it
looks like that conversation ended with that being the plan?

I can't see why it changed in the making though, and everyone seems to
have forgotten about it at the next pass.
-Greg

>
> sage
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-11-14 18:28 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-14 10:57 [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs Luis Henriques
2019-11-14 10:57 ` [RFC PATCH v2 1/4] ceph: add support for TYPE_MSGR2 address decode Luis Henriques
2019-11-14 12:18   ` Jeff Layton
2019-11-14 10:57 ` [RFC PATCH v2 2/4] ceph: get the require_osd_release field from the osdmap Luis Henriques
2019-11-14 13:00   ` Jeff Layton
2019-11-14 10:57 ` [RFC PATCH v2 3/4] ceph: add require_osd_release field to osdmap debugfs Luis Henriques
2019-11-14 10:57 ` [RFC PATCH v2 4/4] ceph: add support for sending truncate_{seq,size} in 'copy-from' Op Luis Henriques
2019-11-14 13:15 ` [RFC PATCH v2 0/4] ceph: safely use 'copy-from' Op on Octopus OSDs Jeff Layton
2019-11-14 13:28   ` Sage Weil
2019-11-14 14:13     ` Ilya Dryomov
2019-11-14 14:17       ` Sage Weil
2019-11-14 15:24         ` Luis Henriques
2019-11-14 15:47           ` Ilya Dryomov
2019-11-14 18:28     ` Gregory Farnum

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).