All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/16] libceph: messenger: send/recv data at one go
@ 2020-04-21 13:18 Roman Penyaev
  2020-04-21 13:18 ` [PATCH 01/16] libceph: remove unused ceph_pagelist_cursor Roman Penyaev
                   ` (16 more replies)
  0 siblings, 17 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 13:18 UTC (permalink / raw)
  Cc: Ilya Dryomov, Jeff Layton, ceph-devel, Roman Penyaev

Hi folks,

While experimenting with messenger code in userspace [1] I noticed
that send and receive socket calls always operate with 4k, even bvec
length is larger (for example when bvec is contructed from bio, where
multi-page is used for big IOs). This is an attempt to speed up send
and receive for large IO.

First 3 patches are cleanups. I remove unused code and get rid of
ceph_osd_data structure. I found that ceph_osd_data duplicates
ceph_msg_data and it seems unified API looks better for similar
things.

In the following patches ceph_msg_data_cursor is switched to iov_iter,
which seems is more suitable for such kind of things (when we
basically do socket IO). This gives us the possibility to use the
whole iov_iter for sendmsg() and recvmsg() calls instead of iterating
page by page. sendpage() call also benefits from this, because now if
bvec is constructed from multi-page, then we can 0-copy the whole
bvec in one go.

I also allowed myself to get rid of ->last_piece and ->need_crc
members and ceph_msg_data_next() call. Now CRC is calculated not on
page basis, but according to the size of processed chunk.  I found
ceph_msg_data_next() is a bit redundant, since we always can set the
next cursor chunk on cursor init or on advance.

How I tested the performance? I used rbd.fio load on 1 OSD in memory
with the following fio configuration:

  direct=1
  time_based=1
  runtime=10
  ioengine=io_uring
  size=256m

  rw=rand{read|write}
  numjobs=32
  iodepth=32

  [job1]
  filename=/dev/rbd0

RBD device is mapped with 'nocrc' option set.  For writes OSD completes
requests immediately, without touching the memory simulating null block
device, that's why write throughput in my results is much higher than
for reads.

I tested on loopback interface only, in Vm, have not yet setup the
cluster on real machines, so sendpage() on a big multi-page shows
indeed good results, as expected. But I found an interesting comment
in drivers/infiniband/sw/siw/siw_qp_tc.c:siw_tcp_sendpages(), which
says:

 "Using sendpage to push page by page appears to be less efficient
  than using sendmsg, even if data are copied.
 
  A general performance limitation might be the extra four bytes
  trailer checksum segment to be pushed after user data."

I could not prove or disprove since have tested on loopback interface
only.  So it might be that sendmsg() in on go is faster than
sendpage() for bvecs with many segments.

Here is the output of the rbd fio load for various block sizes:

==== WRITE ===

current master, rw=randwrite, numjobs=32 iodepth=32

  4k  IOPS=92.7k, BW=362MiB/s, Lat=11033.30usec
  8k  IOPS=85.6k, BW=669MiB/s, Lat=11956.74usec
 16k  IOPS=76.8k, BW=1200MiB/s, Lat=13318.24usec
 32k  IOPS=56.7k, BW=1770MiB/s, Lat=18056.92usec
 64k  IOPS=34.0k, BW=2186MiB/s, Lat=29.23msec
128k  IOPS=21.8k, BW=2720MiB/s, Lat=46.96msec
256k  IOPS=14.4k, BW=3596MiB/s, Lat=71.03msec
512k  IOPS=8726, BW=4363MiB/s, Lat=116.34msec
  1m  IOPS=4799, BW=4799MiB/s, Lat=211.15msec

this patchset,  rw=randwrite, numjobs=32 iodepth=32

  4k  IOPS=94.7k, BW=370MiB/s, Lat=10802.43usec
  8k  IOPS=91.2k, BW=712MiB/s, Lat=11221.00usec
 16k  IOPS=80.4k, BW=1257MiB/s, Lat=12715.56usec
 32k  IOPS=61.2k, BW=1912MiB/s, Lat=16721.33usec
 64k  IOPS=40.9k, BW=2554MiB/s, Lat=24993.31usec
128k  IOPS=25.7k, BW=3216MiB/s, Lat=39.72msec
256k  IOPS=17.3k, BW=4318MiB/s, Lat=59.15msec
512k  IOPS=11.1k, BW=5559MiB/s, Lat=91.39msec
  1m  IOPS=6696, BW=6696MiB/s, Lat=151.25msec


=== READ ===

current master, rw=randread, numjobs=32 iodepth=32

  4k  IOPS=62.5k, BW=244MiB/s, Lat=16.38msec
  8k  IOPS=55.5k, BW=433MiB/s, Lat=18.44msec
 16k  IOPS=40.6k, BW=635MiB/s, Lat=25.18msec
 32k  IOPS=24.6k, BW=768MiB/s, Lat=41.61msec
 64k  IOPS=14.8k, BW=925MiB/s, Lat=69.06msec
128k  IOPS=8687, BW=1086MiB/s, Lat=117.59msec
256k  IOPS=4733, BW=1183MiB/s, Lat=214.76msec
512k  IOPS=3156, BW=1578MiB/s, Lat=320.54msec
  1m  IOPS=1901, BW=1901MiB/s, Lat=528.22msec

this patchset,  rw=randread, numjobs=32 iodepth=32

  4k  IOPS=62.6k, BW=244MiB/s, Lat=16342.89usec
  8k  IOPS=55.5k, BW=434MiB/s, Lat=18.42msec
 16k  IOPS=43.2k, BW=675MiB/s, Lat=23.68msec
 32k  IOPS=28.4k, BW=887MiB/s, Lat=36.04msec
 64k  IOPS=20.2k, BW=1263MiB/s, Lat=50.54msec
128k  IOPS=11.7k, BW=1465MiB/s, Lat=87.01msec
256k  IOPS=6813, BW=1703MiB/s, Lat=149.30msec
512k  IOPS=5363, BW=2682MiB/s, Lat=189.37msec
  1m  IOPS=2220, BW=2221MiB/s, Lat=453.92msec


Results for small blocks are not interesting, since there should not
be any difference. But starting from 32k block benefits of doing IO
for the whole message at once starts to prevail.

I'm open to test any other loads, I just usually stick to fio rbd,
since it is pretty simple and pumps the IOs quite well.

[1] https://github.com/rouming/pech

Roman Penyaev (16):
  libceph: remove unused ceph_pagelist_cursor
  libceph: extend ceph_msg_data API in order to switch on it
  libceph,rbd,cephfs: switch from ceph_osd_data to ceph_msg_data
  libceph: remove ceph_osd_data completely
  libceph: remove unused last_piece out parameter from
    ceph_msg_data_next()
  libceph: switch data cursor from page to iov_iter for messenger
  libceph: use new tcp_sendiov() instead of tcp_sendmsg() for messenger
  libceph: remove unused tcp wrappers, now iov_iter is used for
    messenger
  libceph: no need for cursor->need_crc for messenger
  libceph: remove ->last_piece member for message data cursor
  libceph: remove not necessary checks on doing advance on bio and bvecs
    cursor
  libceph: switch bvecs cursor to iov_iter for messenger
  libceph: switch bio cursor to iov_iter for messenger
  libceph: switch pages cursor to iov_iter for messenger
  libceph: switch pageslist cursor to iov_iter for messenger
  libceph: remove ceph_msg_data_*_next() from messenger

 drivers/block/rbd.c             |   4 +-
 fs/ceph/addr.c                  |  10 +-
 fs/ceph/file.c                  |   4 +-
 include/linux/ceph/messenger.h  |  42 ++-
 include/linux/ceph/osd_client.h |  58 +---
 include/linux/ceph/pagelist.h   |  12 -
 net/ceph/messenger.c            | 558 +++++++++++++++-----------------
 net/ceph/osd_client.c           | 251 ++++----------
 net/ceph/pagelist.c             |  38 ---
 9 files changed, 390 insertions(+), 587 deletions(-)

-- 
2.24.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 01/16] libceph: remove unused ceph_pagelist_cursor
  2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
@ 2020-04-21 13:18 ` Roman Penyaev
  2020-04-21 13:18 ` [PATCH 02/16] libceph: extend ceph_msg_data API in order to switch on it Roman Penyaev
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 13:18 UTC (permalink / raw)
  Cc: Ilya Dryomov, Jeff Layton, ceph-devel, Roman Penyaev

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
---
 include/linux/ceph/pagelist.h | 12 -----------
 net/ceph/pagelist.c           | 38 -----------------------------------
 2 files changed, 50 deletions(-)

diff --git a/include/linux/ceph/pagelist.h b/include/linux/ceph/pagelist.h
index 5dead8486fd8..879bec0863aa 100644
--- a/include/linux/ceph/pagelist.h
+++ b/include/linux/ceph/pagelist.h
@@ -17,12 +17,6 @@ struct ceph_pagelist {
 	refcount_t refcnt;
 };
 
-struct ceph_pagelist_cursor {
-	struct ceph_pagelist *pl;   /* pagelist, for error checking */
-	struct list_head *page_lru; /* page in list */
-	size_t room;		    /* room remaining to reset to */
-};
-
 struct ceph_pagelist *ceph_pagelist_alloc(gfp_t gfp_flags);
 
 extern void ceph_pagelist_release(struct ceph_pagelist *pl);
@@ -33,12 +27,6 @@ extern int ceph_pagelist_reserve(struct ceph_pagelist *pl, size_t space);
 
 extern int ceph_pagelist_free_reserve(struct ceph_pagelist *pl);
 
-extern void ceph_pagelist_set_cursor(struct ceph_pagelist *pl,
-				     struct ceph_pagelist_cursor *c);
-
-extern int ceph_pagelist_truncate(struct ceph_pagelist *pl,
-				  struct ceph_pagelist_cursor *c);
-
 static inline int ceph_pagelist_encode_64(struct ceph_pagelist *pl, u64 v)
 {
 	__le64 ev = cpu_to_le64(v);
diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c
index 65e34f78b05d..87074a74d35f 100644
--- a/net/ceph/pagelist.c
+++ b/net/ceph/pagelist.c
@@ -131,41 +131,3 @@ int ceph_pagelist_free_reserve(struct ceph_pagelist *pl)
 	return 0;
 }
 EXPORT_SYMBOL(ceph_pagelist_free_reserve);
-
-/* Create a truncation point. */
-void ceph_pagelist_set_cursor(struct ceph_pagelist *pl,
-			      struct ceph_pagelist_cursor *c)
-{
-	c->pl = pl;
-	c->page_lru = pl->head.prev;
-	c->room = pl->room;
-}
-EXPORT_SYMBOL(ceph_pagelist_set_cursor);
-
-/* Truncate a pagelist to the given point. Move extra pages to reserve.
- * This won't sleep.
- * Returns: 0 on success,
- *          -EINVAL if the pagelist doesn't match the trunc point pagelist
- */
-int ceph_pagelist_truncate(struct ceph_pagelist *pl,
-			   struct ceph_pagelist_cursor *c)
-{
-	struct page *page;
-
-	if (pl != c->pl)
-		return -EINVAL;
-	ceph_pagelist_unmap_tail(pl);
-	while (pl->head.prev != c->page_lru) {
-		page = list_entry(pl->head.prev, struct page, lru);
-		/* move from pagelist to reserve */
-		list_move_tail(&page->lru, &pl->free_list);
-		++pl->num_pages_free;
-	}
-	pl->room = c->room;
-	if (!list_empty(&pl->head)) {
-		page = list_entry(pl->head.prev, struct page, lru);
-		pl->mapped_tail = kmap(page);
-	}
-	return 0;
-}
-EXPORT_SYMBOL(ceph_pagelist_truncate);
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 02/16] libceph: extend ceph_msg_data API in order to switch on it
  2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
  2020-04-21 13:18 ` [PATCH 01/16] libceph: remove unused ceph_pagelist_cursor Roman Penyaev
@ 2020-04-21 13:18 ` Roman Penyaev
  2020-04-21 13:18 ` [PATCH 03/16] libceph,rbd,cephfs: switch from ceph_osd_data to ceph_msg_data Roman Penyaev
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 13:18 UTC (permalink / raw)
  Cc: Ilya Dryomov, Jeff Layton, ceph-devel, Roman Penyaev

There is a similar msg data interface for osd_client, which has the
structure named ceph_osd_data.  This is a first patch towards API
unification, i.e. ceph_msg_data API will be used for all the cases.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
---
 include/linux/ceph/messenger.h |  34 +++++++-
 net/ceph/messenger.c           | 145 ++++++++++++++++++++++++++++-----
 net/ceph/osd_client.c          |   8 +-
 3 files changed, 158 insertions(+), 29 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 76371aaae2d1..424f9f1989b7 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -173,11 +173,15 @@ struct ceph_msg_data {
 			u32			bio_length;
 		};
 #endif /* CONFIG_BLOCK */
-		struct ceph_bvec_iter	bvec_pos;
+		struct {
+			struct ceph_bvec_iter	bvec_pos;
+			u32			num_bvecs;
+		};
 		struct {
 			struct page	**pages;
 			size_t		length;		/* total # bytes */
 			unsigned int	alignment;	/* first page */
+			bool		pages_from_pool;
 			bool		own_pages;
 		};
 		struct ceph_pagelist	*pagelist;
@@ -357,8 +361,29 @@ extern void ceph_con_keepalive(struct ceph_connection *con);
 extern bool ceph_con_keepalive_expired(struct ceph_connection *con,
 				       unsigned long interval);
 
-void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
-			     size_t length, size_t alignment, bool own_pages);
+extern void ceph_msg_data_init(struct ceph_msg_data *data);
+extern void ceph_msg_data_release(struct ceph_msg_data *data);
+extern size_t ceph_msg_data_length(struct ceph_msg_data *data);
+
+extern void ceph_msg_data_pages_init(struct ceph_msg_data *data,
+				     struct page **pages, u64 length,
+				     u32 alignment, bool pages_from_pool,
+				     bool own_pages);
+extern void ceph_msg_data_pagelist_init(struct ceph_msg_data *data,
+					struct ceph_pagelist *pagelist);
+#ifdef CONFIG_BLOCK
+extern void ceph_msg_data_bio_init(struct ceph_msg_data *data,
+				   struct ceph_bio_iter *bio_pos,
+				   u32 bio_length);
+#endif /* CONFIG_BLOCK */
+extern void ceph_msg_data_bvecs_init(struct ceph_msg_data *data,
+				     struct ceph_bvec_iter *bvec_pos,
+				     u32 num_bvecs);
+extern void ceph_msg_data_add(struct ceph_msg *msg, struct ceph_msg_data *data);
+
+extern void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
+				    size_t length, size_t alignment,
+				    bool pages_from_pool, bool own_pages);
 extern void ceph_msg_data_add_pagelist(struct ceph_msg *msg,
 				struct ceph_pagelist *pagelist);
 #ifdef CONFIG_BLOCK
@@ -366,7 +391,8 @@ void ceph_msg_data_add_bio(struct ceph_msg *msg, struct ceph_bio_iter *bio_pos,
 			   u32 length);
 #endif /* CONFIG_BLOCK */
 void ceph_msg_data_add_bvecs(struct ceph_msg *msg,
-			     struct ceph_bvec_iter *bvec_pos);
+			     struct ceph_bvec_iter *bvec_pos,
+			     u32 num_bvecs);
 
 struct ceph_msg *ceph_msg_new2(int type, int front_len, int max_data_items,
 			       gfp_t flags, bool can_fail);
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index f8ca5edc5f2c..8f35ed01a576 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -3240,13 +3240,20 @@ bool ceph_con_keepalive_expired(struct ceph_connection *con,
 	return false;
 }
 
-static struct ceph_msg_data *ceph_msg_data_add(struct ceph_msg *msg)
+static struct ceph_msg_data *ceph_msg_data_get_next(struct ceph_msg *msg)
 {
 	BUG_ON(msg->num_data_items >= msg->max_data_items);
 	return &msg->data[msg->num_data_items++];
 }
 
-static void ceph_msg_data_destroy(struct ceph_msg_data *data)
+void ceph_msg_data_init(struct ceph_msg_data *data)
+{
+	memset(data, 0, sizeof(*data));
+	data->type = CEPH_MSG_DATA_NONE;
+}
+EXPORT_SYMBOL(ceph_msg_data_init);
+
+void ceph_msg_data_release(struct ceph_msg_data *data)
 {
 	if (data->type == CEPH_MSG_DATA_PAGES && data->own_pages) {
 		int num_pages = calc_pages_for(data->alignment, data->length);
@@ -3254,23 +3261,120 @@ static void ceph_msg_data_destroy(struct ceph_msg_data *data)
 	} else if (data->type == CEPH_MSG_DATA_PAGELIST) {
 		ceph_pagelist_release(data->pagelist);
 	}
+	ceph_msg_data_init(data);
 }
+EXPORT_SYMBOL(ceph_msg_data_release);
 
-void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
-			     size_t length, size_t alignment, bool own_pages)
+/*
+ * Consumes @pages if @own_pages is true.
+ */
+void ceph_msg_data_pages_init(struct ceph_msg_data *data,
+			      struct page **pages, u64 length, u32 alignment,
+			      bool pages_from_pool, bool own_pages)
 {
-	struct ceph_msg_data *data;
-
-	BUG_ON(!pages);
-	BUG_ON(!length);
-
-	data = ceph_msg_data_add(msg);
 	data->type = CEPH_MSG_DATA_PAGES;
 	data->pages = pages;
 	data->length = length;
 	data->alignment = alignment & ~PAGE_MASK;
+	data->pages_from_pool = pages_from_pool;
 	data->own_pages = own_pages;
+}
+EXPORT_SYMBOL(ceph_msg_data_pages_init);
+
+/*
+ * Consumes a ref on @pagelist.
+ */
+void ceph_msg_data_pagelist_init(struct ceph_msg_data *data,
+				 struct ceph_pagelist *pagelist)
+{
+	data->type = CEPH_MSG_DATA_PAGELIST;
+	data->pagelist = pagelist;
+}
+EXPORT_SYMBOL(ceph_msg_data_pagelist_init);
+
+#ifdef CONFIG_BLOCK
+void ceph_msg_data_bio_init(struct ceph_msg_data *data,
+			    struct ceph_bio_iter *bio_pos,
+			    u32 bio_length)
+{
+	data->type = CEPH_MSG_DATA_BIO;
+	data->bio_pos = *bio_pos;
+	data->bio_length = bio_length;
+}
+EXPORT_SYMBOL(ceph_msg_data_bio_init);
+#endif /* CONFIG_BLOCK */
 
+void ceph_msg_data_bvecs_init(struct ceph_msg_data *data,
+			      struct ceph_bvec_iter *bvec_pos,
+			      u32 num_bvecs)
+{
+	data->type = CEPH_MSG_DATA_BVECS;
+	data->bvec_pos = *bvec_pos;
+	data->num_bvecs = num_bvecs;
+}
+EXPORT_SYMBOL(ceph_msg_data_bvecs_init);
+
+size_t ceph_msg_data_length(struct ceph_msg_data *data)
+{
+	switch (data->type) {
+	case CEPH_MSG_DATA_NONE:
+		return 0;
+	case CEPH_MSG_DATA_PAGES:
+		return data->length;
+	case CEPH_MSG_DATA_PAGELIST:
+		return data->pagelist->length;
+#ifdef CONFIG_BLOCK
+	case CEPH_MSG_DATA_BIO:
+		return data->bio_length;
+#endif /* CONFIG_BLOCK */
+	case CEPH_MSG_DATA_BVECS:
+		return data->bvec_pos.iter.bi_size;
+	default:
+		WARN(true, "unrecognized data type %d\n", (int)data->type);
+		return 0;
+	}
+}
+EXPORT_SYMBOL(ceph_msg_data_length);
+
+void ceph_msg_data_add(struct ceph_msg *msg, struct ceph_msg_data *data)
+{
+	u64 length = ceph_msg_data_length(data);
+
+	if (data->type == CEPH_MSG_DATA_PAGES) {
+		BUG_ON(length > (u64)SIZE_MAX);
+		if (likely(length))
+			ceph_msg_data_add_pages(msg, data->pages,
+						length, data->alignment,
+						data->pages_from_pool,
+						false);
+	} else if (data->type == CEPH_MSG_DATA_PAGELIST) {
+		BUG_ON(!length);
+		ceph_msg_data_add_pagelist(msg, data->pagelist);
+#ifdef CONFIG_BLOCK
+	} else if (data->type == CEPH_MSG_DATA_BIO) {
+		ceph_msg_data_add_bio(msg, &data->bio_pos, length);
+#endif
+	} else if (data->type == CEPH_MSG_DATA_BVECS) {
+		ceph_msg_data_add_bvecs(msg, &data->bvec_pos,
+					data->num_bvecs);
+	} else {
+		BUG_ON(data->type != CEPH_MSG_DATA_NONE);
+	}
+}
+EXPORT_SYMBOL(ceph_msg_data_add);
+
+void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
+			     size_t length, size_t alignment,
+			     bool pages_from_pool, bool own_pages)
+{
+	struct ceph_msg_data *data;
+
+	BUG_ON(!pages);
+	BUG_ON(!length);
+
+	data = ceph_msg_data_get_next(msg);
+	ceph_msg_data_pages_init(data, pages, length, alignment,
+				 pages_from_pool, own_pages);
 	msg->data_length += length;
 }
 EXPORT_SYMBOL(ceph_msg_data_add_pages);
@@ -3283,10 +3387,9 @@ void ceph_msg_data_add_pagelist(struct ceph_msg *msg,
 	BUG_ON(!pagelist);
 	BUG_ON(!pagelist->length);
 
-	data = ceph_msg_data_add(msg);
-	data->type = CEPH_MSG_DATA_PAGELIST;
+	data = ceph_msg_data_get_next(msg);
+	ceph_msg_data_pagelist_init(data, pagelist);
 	refcount_inc(&pagelist->refcnt);
-	data->pagelist = pagelist;
 
 	msg->data_length += pagelist->length;
 }
@@ -3298,10 +3401,8 @@ void ceph_msg_data_add_bio(struct ceph_msg *msg, struct ceph_bio_iter *bio_pos,
 {
 	struct ceph_msg_data *data;
 
-	data = ceph_msg_data_add(msg);
-	data->type = CEPH_MSG_DATA_BIO;
-	data->bio_pos = *bio_pos;
-	data->bio_length = length;
+	data = ceph_msg_data_get_next(msg);
+	ceph_msg_data_bio_init(data, bio_pos, length);
 
 	msg->data_length += length;
 }
@@ -3309,13 +3410,13 @@ EXPORT_SYMBOL(ceph_msg_data_add_bio);
 #endif	/* CONFIG_BLOCK */
 
 void ceph_msg_data_add_bvecs(struct ceph_msg *msg,
-			     struct ceph_bvec_iter *bvec_pos)
+			     struct ceph_bvec_iter *bvec_pos,
+			     u32 num_bvecs)
 {
 	struct ceph_msg_data *data;
 
-	data = ceph_msg_data_add(msg);
-	data->type = CEPH_MSG_DATA_BVECS;
-	data->bvec_pos = *bvec_pos;
+	data = ceph_msg_data_get_next(msg);
+	ceph_msg_data_bvecs_init(data, bvec_pos, num_bvecs);
 
 	msg->data_length += bvec_pos->iter.bi_size;
 }
@@ -3502,7 +3603,7 @@ static void ceph_msg_release(struct kref *kref)
 	}
 
 	for (i = 0; i < m->num_data_items; i++)
-		ceph_msg_data_destroy(&m->data[i]);
+		ceph_msg_data_release(&m->data[i]);
 
 	if (m->pool)
 		ceph_msgpool_put(m->pool, m);
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 998e26b75a78..efe3d87b75f2 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -962,7 +962,8 @@ static void ceph_osdc_msg_data_add(struct ceph_msg *msg,
 		BUG_ON(length > (u64) SIZE_MAX);
 		if (length)
 			ceph_msg_data_add_pages(msg, osd_data->pages,
-					length, osd_data->alignment, false);
+					length, osd_data->alignment,
+					osd_data->pages_from_pool, false);
 	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGELIST) {
 		BUG_ON(!length);
 		ceph_msg_data_add_pagelist(msg, osd_data->pagelist);
@@ -971,7 +972,8 @@ static void ceph_osdc_msg_data_add(struct ceph_msg *msg,
 		ceph_msg_data_add_bio(msg, &osd_data->bio_pos, length);
 #endif
 	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_BVECS) {
-		ceph_msg_data_add_bvecs(msg, &osd_data->bvec_pos);
+		ceph_msg_data_add_bvecs(msg, &osd_data->bvec_pos,
+					osd_data->num_bvecs);
 	} else {
 		BUG_ON(osd_data->type != CEPH_OSD_DATA_TYPE_NONE);
 	}
@@ -5443,7 +5445,7 @@ static struct ceph_msg *alloc_msg_with_page_vector(struct ceph_msg_header *hdr)
 			return NULL;
 		}
 
-		ceph_msg_data_add_pages(m, pages, data_len, 0, true);
+		ceph_msg_data_add_pages(m, pages, data_len, 0, false, true);
 	}
 
 	return m;
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 03/16] libceph,rbd,cephfs: switch from ceph_osd_data to ceph_msg_data
  2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
  2020-04-21 13:18 ` [PATCH 01/16] libceph: remove unused ceph_pagelist_cursor Roman Penyaev
  2020-04-21 13:18 ` [PATCH 02/16] libceph: extend ceph_msg_data API in order to switch on it Roman Penyaev
@ 2020-04-21 13:18 ` Roman Penyaev
  2020-04-21 13:18 ` [PATCH 04/16] libceph: remove ceph_osd_data completely Roman Penyaev
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 13:18 UTC (permalink / raw)
  Cc: Ilya Dryomov, Jeff Layton, ceph-devel, Roman Penyaev

This is just a blind replacement from ceph_osd_data API to
ceph_msg_data.  In the next patch ceph_osd_data will be removed.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
---
 drivers/block/rbd.c             |   4 +-
 fs/ceph/addr.c                  |  10 +--
 fs/ceph/file.c                  |   4 +-
 include/linux/ceph/osd_client.h |  24 +++---
 net/ceph/osd_client.c           | 145 ++++++++++++++++----------------
 5 files changed, 95 insertions(+), 92 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 67d65ac785e9..eddde641a615 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1971,7 +1971,7 @@ static int rbd_object_map_update_finish(struct rbd_obj_request *obj_req,
 					struct ceph_osd_request *osd_req)
 {
 	struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev;
-	struct ceph_osd_data *osd_data;
+	struct ceph_msg_data *osd_data;
 	u64 objno;
 	u8 state, new_state, uninitialized_var(current_state);
 	bool has_current_state;
@@ -1991,7 +1991,7 @@ static int rbd_object_map_update_finish(struct rbd_obj_request *obj_req,
 	 */
 	rbd_assert(osd_req->r_num_ops == 2);
 	osd_data = osd_req_op_data(osd_req, 1, cls, request_data);
-	rbd_assert(osd_data->type == CEPH_OSD_DATA_TYPE_PAGES);
+	rbd_assert(osd_data->type == CEPH_MSG_DATA_PAGES);
 
 	p = page_address(osd_data->pages[0]);
 	objno = ceph_decode_64(&p);
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 6f4678d98df7..6021364233ba 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -299,7 +299,7 @@ static int ceph_readpage(struct file *filp, struct page *page)
 static void finish_read(struct ceph_osd_request *req)
 {
 	struct inode *inode = req->r_inode;
-	struct ceph_osd_data *osd_data;
+	struct ceph_msg_data *osd_data;
 	int rc = req->r_result <= 0 ? req->r_result : 0;
 	int bytes = req->r_result >= 0 ? req->r_result : 0;
 	int num_pages;
@@ -311,7 +311,7 @@ static void finish_read(struct ceph_osd_request *req)
 
 	/* unlock all pages, zeroing any data we didn't read */
 	osd_data = osd_req_op_extent_osd_data(req, 0);
-	BUG_ON(osd_data->type != CEPH_OSD_DATA_TYPE_PAGES);
+	BUG_ON(osd_data->type != CEPH_MSG_DATA_PAGES);
 	num_pages = calc_pages_for((u64)osd_data->alignment,
 					(u64)osd_data->length);
 	for (i = 0; i < num_pages; i++) {
@@ -774,7 +774,7 @@ static void writepages_finish(struct ceph_osd_request *req)
 {
 	struct inode *inode = req->r_inode;
 	struct ceph_inode_info *ci = ceph_inode(inode);
-	struct ceph_osd_data *osd_data;
+	struct ceph_msg_data *osd_data;
 	struct page *page;
 	int num_pages, total_pages = 0;
 	int i, j;
@@ -809,7 +809,7 @@ static void writepages_finish(struct ceph_osd_request *req)
 			break;
 
 		osd_data = osd_req_op_extent_osd_data(req, i);
-		BUG_ON(osd_data->type != CEPH_OSD_DATA_TYPE_PAGES);
+		BUG_ON(osd_data->type != CEPH_MSG_DATA_PAGES);
 		num_pages = calc_pages_for((u64)osd_data->alignment,
 					   (u64)osd_data->length);
 		total_pages += num_pages;
@@ -836,7 +836,7 @@ static void writepages_finish(struct ceph_osd_request *req)
 
 			unlock_page(page);
 		}
-		dout("writepages_finish %p wrote %llu bytes cleaned %d pages\n",
+		dout("writepages_finish %p wrote %zu bytes cleaned %d pages\n",
 		     inode, osd_data->length, rc >= 0 ? num_pages : 0);
 
 		release_pages(osd_data->pages, num_pages);
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index afdfca965a7f..49b35fa39bb6 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1043,9 +1043,9 @@ static void ceph_aio_complete_req(struct ceph_osd_request *req)
 	int rc = req->r_result;
 	struct inode *inode = req->r_inode;
 	struct ceph_aio_request *aio_req = req->r_priv;
-	struct ceph_osd_data *osd_data = osd_req_op_extent_osd_data(req, 0);
+	struct ceph_msg_data *osd_data = osd_req_op_extent_osd_data(req, 0);
 
-	BUG_ON(osd_data->type != CEPH_OSD_DATA_TYPE_BVECS);
+	BUG_ON(osd_data->type != CEPH_MSG_DATA_BVECS);
 	BUG_ON(!osd_data->num_bvecs);
 
 	dout("ceph_aio_complete_req %p rc %d bytes %u\n",
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 9d9f745b98a1..b1ec10c8a408 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -92,26 +92,26 @@ struct ceph_osd_req_op {
 	s32 rval;
 
 	union {
-		struct ceph_osd_data raw_data_in;
+		struct ceph_msg_data raw_data_in;
 		struct {
 			u64 offset, length;
 			u64 truncate_size;
 			u32 truncate_seq;
-			struct ceph_osd_data osd_data;
+			struct ceph_msg_data osd_data;
 		} extent;
 		struct {
 			u32 name_len;
 			u32 value_len;
 			__u8 cmp_op;       /* CEPH_OSD_CMPXATTR_OP_* */
 			__u8 cmp_mode;     /* CEPH_OSD_CMPXATTR_MODE_* */
-			struct ceph_osd_data osd_data;
+			struct ceph_msg_data osd_data;
 		} xattr;
 		struct {
 			const char *class_name;
 			const char *method_name;
-			struct ceph_osd_data request_info;
-			struct ceph_osd_data request_data;
-			struct ceph_osd_data response_data;
+			struct ceph_msg_data request_info;
+			struct ceph_msg_data request_data;
+			struct ceph_msg_data response_data;
 			__u8 class_len;
 			__u8 method_len;
 			u32 indata_len;
@@ -122,15 +122,15 @@ struct ceph_osd_req_op {
 			u32 gen;
 		} watch;
 		struct {
-			struct ceph_osd_data request_data;
+			struct ceph_msg_data request_data;
 		} notify_ack;
 		struct {
 			u64 cookie;
-			struct ceph_osd_data request_data;
-			struct ceph_osd_data response_data;
+			struct ceph_msg_data request_data;
+			struct ceph_msg_data response_data;
 		} notify;
 		struct {
-			struct ceph_osd_data response_data;
+			struct ceph_msg_data response_data;
 		} list_watchers;
 		struct {
 			u64 expected_object_size;
@@ -141,7 +141,7 @@ struct ceph_osd_req_op {
 			u64 src_version;
 			u8 flags;
 			u32 src_fadvise_flags;
-			struct ceph_osd_data osd_data;
+			struct ceph_msg_data osd_data;
 		} copy_from;
 	};
 };
@@ -417,7 +417,7 @@ extern void osd_req_op_extent_update(struct ceph_osd_request *osd_req,
 extern void osd_req_op_extent_dup_last(struct ceph_osd_request *osd_req,
 				       unsigned int which, u64 offset_inc);
 
-extern struct ceph_osd_data *osd_req_op_extent_osd_data(
+extern struct ceph_msg_data *osd_req_op_extent_osd_data(
 					struct ceph_osd_request *osd_req,
 					unsigned int which);
 
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index efe3d87b75f2..56a4d5f196b3 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -129,6 +129,7 @@ static void ceph_osd_data_init(struct ceph_osd_data *osd_data)
 /*
  * Consumes @pages if @own_pages is true.
  */
+__attribute__((unused))
 static void ceph_osd_data_pages_init(struct ceph_osd_data *osd_data,
 			struct page **pages, u64 length, u32 alignment,
 			bool pages_from_pool, bool own_pages)
@@ -144,6 +145,7 @@ static void ceph_osd_data_pages_init(struct ceph_osd_data *osd_data,
 /*
  * Consumes a ref on @pagelist.
  */
+__attribute__((unused))
 static void ceph_osd_data_pagelist_init(struct ceph_osd_data *osd_data,
 			struct ceph_pagelist *pagelist)
 {
@@ -152,6 +154,7 @@ static void ceph_osd_data_pagelist_init(struct ceph_osd_data *osd_data,
 }
 
 #ifdef CONFIG_BLOCK
+__attribute__((unused))
 static void ceph_osd_data_bio_init(struct ceph_osd_data *osd_data,
 				   struct ceph_bio_iter *bio_pos,
 				   u32 bio_length)
@@ -162,6 +165,7 @@ static void ceph_osd_data_bio_init(struct ceph_osd_data *osd_data,
 }
 #endif /* CONFIG_BLOCK */
 
+__attribute__((unused))
 static void ceph_osd_data_bvecs_init(struct ceph_osd_data *osd_data,
 				     struct ceph_bvec_iter *bvec_pos,
 				     u32 num_bvecs)
@@ -171,7 +175,7 @@ static void ceph_osd_data_bvecs_init(struct ceph_osd_data *osd_data,
 	osd_data->num_bvecs = num_bvecs;
 }
 
-static struct ceph_osd_data *
+static struct ceph_msg_data *
 osd_req_op_raw_data_in(struct ceph_osd_request *osd_req, unsigned int which)
 {
 	BUG_ON(which >= osd_req->r_num_ops);
@@ -179,7 +183,7 @@ osd_req_op_raw_data_in(struct ceph_osd_request *osd_req, unsigned int which)
 	return &osd_req->r_ops[which].raw_data_in;
 }
 
-struct ceph_osd_data *
+struct ceph_msg_data *
 osd_req_op_extent_osd_data(struct ceph_osd_request *osd_req,
 			unsigned int which)
 {
@@ -192,11 +196,11 @@ void osd_req_op_raw_data_in_pages(struct ceph_osd_request *osd_req,
 			u64 length, u32 alignment,
 			bool pages_from_pool, bool own_pages)
 {
-	struct ceph_osd_data *osd_data;
+	struct ceph_msg_data *osd_data;
 
 	osd_data = osd_req_op_raw_data_in(osd_req, which);
-	ceph_osd_data_pages_init(osd_data, pages, length, alignment,
-				pages_from_pool, own_pages);
+	ceph_msg_data_pages_init(osd_data, pages, length, alignment,
+				 pages_from_pool, own_pages);
 }
 EXPORT_SYMBOL(osd_req_op_raw_data_in_pages);
 
@@ -205,21 +209,21 @@ void osd_req_op_extent_osd_data_pages(struct ceph_osd_request *osd_req,
 			u64 length, u32 alignment,
 			bool pages_from_pool, bool own_pages)
 {
-	struct ceph_osd_data *osd_data;
+	struct ceph_msg_data *osd_data;
 
 	osd_data = osd_req_op_data(osd_req, which, extent, osd_data);
-	ceph_osd_data_pages_init(osd_data, pages, length, alignment,
-				pages_from_pool, own_pages);
+	ceph_msg_data_pages_init(osd_data, pages, length, alignment,
+				 pages_from_pool, own_pages);
 }
 EXPORT_SYMBOL(osd_req_op_extent_osd_data_pages);
 
 void osd_req_op_extent_osd_data_pagelist(struct ceph_osd_request *osd_req,
 			unsigned int which, struct ceph_pagelist *pagelist)
 {
-	struct ceph_osd_data *osd_data;
+	struct ceph_msg_data *osd_data;
 
 	osd_data = osd_req_op_data(osd_req, which, extent, osd_data);
-	ceph_osd_data_pagelist_init(osd_data, pagelist);
+	ceph_msg_data_pagelist_init(osd_data, pagelist);
 }
 EXPORT_SYMBOL(osd_req_op_extent_osd_data_pagelist);
 
@@ -229,10 +233,10 @@ void osd_req_op_extent_osd_data_bio(struct ceph_osd_request *osd_req,
 				    struct ceph_bio_iter *bio_pos,
 				    u32 bio_length)
 {
-	struct ceph_osd_data *osd_data;
+	struct ceph_msg_data *osd_data;
 
 	osd_data = osd_req_op_data(osd_req, which, extent, osd_data);
-	ceph_osd_data_bio_init(osd_data, bio_pos, bio_length);
+	ceph_msg_data_bio_init(osd_data, bio_pos, bio_length);
 }
 EXPORT_SYMBOL(osd_req_op_extent_osd_data_bio);
 #endif /* CONFIG_BLOCK */
@@ -242,14 +246,14 @@ void osd_req_op_extent_osd_data_bvecs(struct ceph_osd_request *osd_req,
 				      struct bio_vec *bvecs, u32 num_bvecs,
 				      u32 bytes)
 {
-	struct ceph_osd_data *osd_data;
+	struct ceph_msg_data *osd_data;
 	struct ceph_bvec_iter it = {
 		.bvecs = bvecs,
 		.iter = { .bi_size = bytes },
 	};
 
 	osd_data = osd_req_op_data(osd_req, which, extent, osd_data);
-	ceph_osd_data_bvecs_init(osd_data, &it, num_bvecs);
+	ceph_msg_data_bvecs_init(osd_data, &it, num_bvecs);
 }
 EXPORT_SYMBOL(osd_req_op_extent_osd_data_bvecs);
 
@@ -257,10 +261,10 @@ void osd_req_op_extent_osd_data_bvec_pos(struct ceph_osd_request *osd_req,
 					 unsigned int which,
 					 struct ceph_bvec_iter *bvec_pos)
 {
-	struct ceph_osd_data *osd_data;
+	struct ceph_msg_data *osd_data;
 
 	osd_data = osd_req_op_data(osd_req, which, extent, osd_data);
-	ceph_osd_data_bvecs_init(osd_data, bvec_pos, 0);
+	ceph_msg_data_bvecs_init(osd_data, bvec_pos, 0);
 }
 EXPORT_SYMBOL(osd_req_op_extent_osd_data_bvec_pos);
 
@@ -268,20 +272,20 @@ static void osd_req_op_cls_request_info_pagelist(
 			struct ceph_osd_request *osd_req,
 			unsigned int which, struct ceph_pagelist *pagelist)
 {
-	struct ceph_osd_data *osd_data;
+	struct ceph_msg_data *osd_data;
 
 	osd_data = osd_req_op_data(osd_req, which, cls, request_info);
-	ceph_osd_data_pagelist_init(osd_data, pagelist);
+	ceph_msg_data_pagelist_init(osd_data, pagelist);
 }
 
 void osd_req_op_cls_request_data_pagelist(
 			struct ceph_osd_request *osd_req,
 			unsigned int which, struct ceph_pagelist *pagelist)
 {
-	struct ceph_osd_data *osd_data;
+	struct ceph_msg_data *osd_data;
 
 	osd_data = osd_req_op_data(osd_req, which, cls, request_data);
-	ceph_osd_data_pagelist_init(osd_data, pagelist);
+	ceph_msg_data_pagelist_init(osd_data, pagelist);
 	osd_req->r_ops[which].cls.indata_len += pagelist->length;
 	osd_req->r_ops[which].indata_len += pagelist->length;
 }
@@ -291,11 +295,11 @@ void osd_req_op_cls_request_data_pages(struct ceph_osd_request *osd_req,
 			unsigned int which, struct page **pages, u64 length,
 			u32 alignment, bool pages_from_pool, bool own_pages)
 {
-	struct ceph_osd_data *osd_data;
+	struct ceph_msg_data *osd_data;
 
 	osd_data = osd_req_op_data(osd_req, which, cls, request_data);
-	ceph_osd_data_pages_init(osd_data, pages, length, alignment,
-				pages_from_pool, own_pages);
+	ceph_msg_data_pages_init(osd_data, pages, length, alignment,
+				 pages_from_pool, own_pages);
 	osd_req->r_ops[which].cls.indata_len += length;
 	osd_req->r_ops[which].indata_len += length;
 }
@@ -306,14 +310,14 @@ void osd_req_op_cls_request_data_bvecs(struct ceph_osd_request *osd_req,
 				       struct bio_vec *bvecs, u32 num_bvecs,
 				       u32 bytes)
 {
-	struct ceph_osd_data *osd_data;
+	struct ceph_msg_data *osd_data;
 	struct ceph_bvec_iter it = {
 		.bvecs = bvecs,
 		.iter = { .bi_size = bytes },
 	};
 
 	osd_data = osd_req_op_data(osd_req, which, cls, request_data);
-	ceph_osd_data_bvecs_init(osd_data, &it, num_bvecs);
+	ceph_msg_data_bvecs_init(osd_data, &it, num_bvecs);
 	osd_req->r_ops[which].cls.indata_len += bytes;
 	osd_req->r_ops[which].indata_len += bytes;
 }
@@ -323,11 +327,11 @@ void osd_req_op_cls_response_data_pages(struct ceph_osd_request *osd_req,
 			unsigned int which, struct page **pages, u64 length,
 			u32 alignment, bool pages_from_pool, bool own_pages)
 {
-	struct ceph_osd_data *osd_data;
+	struct ceph_msg_data *osd_data;
 
 	osd_data = osd_req_op_data(osd_req, which, cls, response_data);
-	ceph_osd_data_pages_init(osd_data, pages, length, alignment,
-				pages_from_pool, own_pages);
+	ceph_msg_data_pages_init(osd_data, pages, length, alignment,
+				 pages_from_pool, own_pages);
 }
 EXPORT_SYMBOL(osd_req_op_cls_response_data_pages);
 
@@ -352,6 +356,7 @@ static u64 ceph_osd_data_length(struct ceph_osd_data *osd_data)
 	}
 }
 
+__attribute__((unused))
 static void ceph_osd_data_release(struct ceph_osd_data *osd_data)
 {
 	if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES && osd_data->own_pages) {
@@ -378,32 +383,32 @@ static void osd_req_op_data_release(struct ceph_osd_request *osd_req,
 	case CEPH_OSD_OP_READ:
 	case CEPH_OSD_OP_WRITE:
 	case CEPH_OSD_OP_WRITEFULL:
-		ceph_osd_data_release(&op->extent.osd_data);
+		ceph_msg_data_release(&op->extent.osd_data);
 		break;
 	case CEPH_OSD_OP_CALL:
-		ceph_osd_data_release(&op->cls.request_info);
-		ceph_osd_data_release(&op->cls.request_data);
-		ceph_osd_data_release(&op->cls.response_data);
+		ceph_msg_data_release(&op->cls.request_info);
+		ceph_msg_data_release(&op->cls.request_data);
+		ceph_msg_data_release(&op->cls.response_data);
 		break;
 	case CEPH_OSD_OP_SETXATTR:
 	case CEPH_OSD_OP_CMPXATTR:
-		ceph_osd_data_release(&op->xattr.osd_data);
+		ceph_msg_data_release(&op->xattr.osd_data);
 		break;
 	case CEPH_OSD_OP_STAT:
-		ceph_osd_data_release(&op->raw_data_in);
+		ceph_msg_data_release(&op->raw_data_in);
 		break;
 	case CEPH_OSD_OP_NOTIFY_ACK:
-		ceph_osd_data_release(&op->notify_ack.request_data);
+		ceph_msg_data_release(&op->notify_ack.request_data);
 		break;
 	case CEPH_OSD_OP_NOTIFY:
-		ceph_osd_data_release(&op->notify.request_data);
-		ceph_osd_data_release(&op->notify.response_data);
+		ceph_msg_data_release(&op->notify.request_data);
+		ceph_msg_data_release(&op->notify.response_data);
 		break;
 	case CEPH_OSD_OP_LIST_WATCHERS:
-		ceph_osd_data_release(&op->list_watchers.response_data);
+		ceph_msg_data_release(&op->list_watchers.response_data);
 		break;
 	case CEPH_OSD_OP_COPY_FROM2:
-		ceph_osd_data_release(&op->copy_from.osd_data);
+		ceph_msg_data_release(&op->copy_from.osd_data);
 		break;
 	default:
 		break;
@@ -908,7 +913,7 @@ int osd_req_op_xattr_init(struct ceph_osd_request *osd_req, unsigned int which,
 	op->xattr.cmp_op = cmp_op;
 	op->xattr.cmp_mode = cmp_mode;
 
-	ceph_osd_data_pagelist_init(&op->xattr.osd_data, pagelist);
+	ceph_msg_data_pagelist_init(&op->xattr.osd_data, pagelist);
 	op->indata_len = payload_len;
 	return 0;
 
@@ -953,6 +958,7 @@ void osd_req_op_alloc_hint_init(struct ceph_osd_request *osd_req,
 }
 EXPORT_SYMBOL(osd_req_op_alloc_hint_init);
 
+__attribute__((unused))
 static void ceph_osdc_msg_data_add(struct ceph_msg *msg,
 				struct ceph_osd_data *osd_data)
 {
@@ -1954,37 +1960,35 @@ static void setup_request_data(struct ceph_osd_request *req)
 		case CEPH_OSD_OP_WRITE:
 		case CEPH_OSD_OP_WRITEFULL:
 			WARN_ON(op->indata_len != op->extent.length);
-			ceph_osdc_msg_data_add(request_msg,
-					       &op->extent.osd_data);
+			ceph_msg_data_add(request_msg,
+					  &op->extent.osd_data);
 			break;
 		case CEPH_OSD_OP_SETXATTR:
 		case CEPH_OSD_OP_CMPXATTR:
 			WARN_ON(op->indata_len != op->xattr.name_len +
 						  op->xattr.value_len);
-			ceph_osdc_msg_data_add(request_msg,
-					       &op->xattr.osd_data);
+			ceph_msg_data_add(request_msg,
+					  &op->xattr.osd_data);
 			break;
 		case CEPH_OSD_OP_NOTIFY_ACK:
-			ceph_osdc_msg_data_add(request_msg,
-					       &op->notify_ack.request_data);
+			ceph_msg_data_add(request_msg,
+					  &op->notify_ack.request_data);
 			break;
 		case CEPH_OSD_OP_COPY_FROM2:
-			ceph_osdc_msg_data_add(request_msg,
-					       &op->copy_from.osd_data);
+			ceph_msg_data_add(request_msg,
+					  &op->copy_from.osd_data);
 			break;
 
 		/* reply */
 		case CEPH_OSD_OP_STAT:
-			ceph_osdc_msg_data_add(reply_msg,
-					       &op->raw_data_in);
+			ceph_msg_data_add(reply_msg, &op->raw_data_in);
 			break;
 		case CEPH_OSD_OP_READ:
-			ceph_osdc_msg_data_add(reply_msg,
-					       &op->extent.osd_data);
+			ceph_msg_data_add(reply_msg, &op->extent.osd_data);
 			break;
 		case CEPH_OSD_OP_LIST_WATCHERS:
-			ceph_osdc_msg_data_add(reply_msg,
-					       &op->list_watchers.response_data);
+			ceph_msg_data_add(reply_msg,
+					  &op->list_watchers.response_data);
 			break;
 
 		/* both */
@@ -1992,20 +1996,19 @@ static void setup_request_data(struct ceph_osd_request *req)
 			WARN_ON(op->indata_len != op->cls.class_len +
 						  op->cls.method_len +
 						  op->cls.indata_len);
-			ceph_osdc_msg_data_add(request_msg,
-					       &op->cls.request_info);
+			ceph_msg_data_add(request_msg, &op->cls.request_info);
 			/* optional, can be NONE */
-			ceph_osdc_msg_data_add(request_msg,
-					       &op->cls.request_data);
+			ceph_msg_data_add(request_msg,
+					  &op->cls.request_data);
 			/* optional, can be NONE */
-			ceph_osdc_msg_data_add(reply_msg,
-					       &op->cls.response_data);
+			ceph_msg_data_add(reply_msg,
+					  &op->cls.response_data);
 			break;
 		case CEPH_OSD_OP_NOTIFY:
-			ceph_osdc_msg_data_add(request_msg,
-					       &op->notify.request_data);
-			ceph_osdc_msg_data_add(reply_msg,
-					       &op->notify.response_data);
+			ceph_msg_data_add(request_msg,
+					  &op->notify.request_data);
+			ceph_msg_data_add(reply_msg,
+					  &op->notify.response_data);
 			break;
 		}
 	}
@@ -2944,12 +2947,12 @@ static void linger_commit_cb(struct ceph_osd_request *req)
 	lreq->committed = true;
 
 	if (!lreq->is_watch) {
-		struct ceph_osd_data *osd_data =
+		struct ceph_msg_data *osd_data =
 		    osd_req_op_data(req, 0, notify, response_data);
 		void *p = page_address(osd_data->pages[0]);
 
 		WARN_ON(req->r_ops[0].op != CEPH_OSD_OP_NOTIFY ||
-			osd_data->type != CEPH_OSD_DATA_TYPE_PAGES);
+			osd_data->type != CEPH_MSG_DATA_PAGES);
 
 		/* make note of the notify_id */
 		if (req->r_ops[0].outdata_len >= sizeof(u64)) {
@@ -4730,7 +4733,7 @@ static int osd_req_op_notify_ack_init(struct ceph_osd_request *req, int which,
 		return -ENOMEM;
 	}
 
-	ceph_osd_data_pagelist_init(&op->notify_ack.request_data, pl);
+	ceph_msg_data_pagelist_init(&op->notify_ack.request_data, pl);
 	op->indata_len = pl->length;
 	return 0;
 }
@@ -4796,7 +4799,7 @@ static int osd_req_op_notify_init(struct ceph_osd_request *req, int which,
 		return -ENOMEM;
 	}
 
-	ceph_osd_data_pagelist_init(&op->notify.request_data, pl);
+	ceph_msg_data_pagelist_init(&op->notify.request_data, pl);
 	op->indata_len = pl->length;
 	return 0;
 }
@@ -4860,7 +4863,7 @@ int ceph_osdc_notify(struct ceph_osd_client *osdc,
 		ret = PTR_ERR(pages);
 		goto out_put_lreq;
 	}
-	ceph_osd_data_pages_init(osd_req_op_data(lreq->reg_req, 0, notify,
+	ceph_msg_data_pages_init(osd_req_op_data(lreq->reg_req, 0, notify,
 						 response_data),
 				 pages, PAGE_SIZE, 0, false, true);
 
@@ -5007,7 +5010,7 @@ int ceph_osdc_list_watchers(struct ceph_osd_client *osdc,
 	}
 
 	osd_req_op_init(req, 0, CEPH_OSD_OP_LIST_WATCHERS, 0);
-	ceph_osd_data_pages_init(osd_req_op_data(req, 0, list_watchers,
+	ceph_msg_data_pages_init(osd_req_op_data(req, 0, list_watchers,
 						 response_data),
 				 pages, PAGE_SIZE, 0, false, true);
 
@@ -5259,7 +5262,7 @@ static int osd_req_op_copy_from_init(struct ceph_osd_request *req,
 	ceph_encode_64(&p, truncate_size);
 	op->indata_len = PAGE_SIZE - (end - p);
 
-	ceph_osd_data_pages_init(&op->copy_from.osd_data, pages,
+	ceph_msg_data_pages_init(&op->copy_from.osd_data, pages,
 				 op->indata_len, 0, false, true);
 	return 0;
 }
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 04/16] libceph: remove ceph_osd_data completely
  2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
                   ` (2 preceding siblings ...)
  2020-04-21 13:18 ` [PATCH 03/16] libceph,rbd,cephfs: switch from ceph_osd_data to ceph_msg_data Roman Penyaev
@ 2020-04-21 13:18 ` Roman Penyaev
  2020-04-21 13:18 ` [PATCH 05/16] libceph: remove unused last_piece out parameter from ceph_msg_data_next() Roman Penyaev
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 13:18 UTC (permalink / raw)
  Cc: Ilya Dryomov, Jeff Layton, ceph-devel, Roman Penyaev

Now ceph_msg_data API from messenger should be used.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
---
 include/linux/ceph/osd_client.h |  34 ---------
 net/ceph/osd_client.c           | 118 --------------------------------
 2 files changed, 152 deletions(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index b1ec10c8a408..cddbb3e35859 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -50,40 +50,6 @@ struct ceph_osd {
 #define CEPH_OSD_SLAB_OPS	2
 #define CEPH_OSD_MAX_OPS	16
 
-enum ceph_osd_data_type {
-	CEPH_OSD_DATA_TYPE_NONE = 0,
-	CEPH_OSD_DATA_TYPE_PAGES,
-	CEPH_OSD_DATA_TYPE_PAGELIST,
-#ifdef CONFIG_BLOCK
-	CEPH_OSD_DATA_TYPE_BIO,
-#endif /* CONFIG_BLOCK */
-	CEPH_OSD_DATA_TYPE_BVECS,
-};
-
-struct ceph_osd_data {
-	enum ceph_osd_data_type	type;
-	union {
-		struct {
-			struct page	**pages;
-			u64		length;
-			u32		alignment;
-			bool		pages_from_pool;
-			bool		own_pages;
-		};
-		struct ceph_pagelist	*pagelist;
-#ifdef CONFIG_BLOCK
-		struct {
-			struct ceph_bio_iter	bio_pos;
-			u32			bio_length;
-		};
-#endif /* CONFIG_BLOCK */
-		struct {
-			struct ceph_bvec_iter	bvec_pos;
-			u32			num_bvecs;
-		};
-	};
-};
-
 struct ceph_osd_req_op {
 	u16 op;           /* CEPH_OSD_OP_* */
 	u32 flags;        /* CEPH_OSD_OP_FLAG_* */
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 56a4d5f196b3..5725e46d83e8 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -120,61 +120,6 @@ static int calc_layout(struct ceph_file_layout *layout, u64 off, u64 *plen,
 	return 0;
 }
 
-static void ceph_osd_data_init(struct ceph_osd_data *osd_data)
-{
-	memset(osd_data, 0, sizeof (*osd_data));
-	osd_data->type = CEPH_OSD_DATA_TYPE_NONE;
-}
-
-/*
- * Consumes @pages if @own_pages is true.
- */
-__attribute__((unused))
-static void ceph_osd_data_pages_init(struct ceph_osd_data *osd_data,
-			struct page **pages, u64 length, u32 alignment,
-			bool pages_from_pool, bool own_pages)
-{
-	osd_data->type = CEPH_OSD_DATA_TYPE_PAGES;
-	osd_data->pages = pages;
-	osd_data->length = length;
-	osd_data->alignment = alignment;
-	osd_data->pages_from_pool = pages_from_pool;
-	osd_data->own_pages = own_pages;
-}
-
-/*
- * Consumes a ref on @pagelist.
- */
-__attribute__((unused))
-static void ceph_osd_data_pagelist_init(struct ceph_osd_data *osd_data,
-			struct ceph_pagelist *pagelist)
-{
-	osd_data->type = CEPH_OSD_DATA_TYPE_PAGELIST;
-	osd_data->pagelist = pagelist;
-}
-
-#ifdef CONFIG_BLOCK
-__attribute__((unused))
-static void ceph_osd_data_bio_init(struct ceph_osd_data *osd_data,
-				   struct ceph_bio_iter *bio_pos,
-				   u32 bio_length)
-{
-	osd_data->type = CEPH_OSD_DATA_TYPE_BIO;
-	osd_data->bio_pos = *bio_pos;
-	osd_data->bio_length = bio_length;
-}
-#endif /* CONFIG_BLOCK */
-
-__attribute__((unused))
-static void ceph_osd_data_bvecs_init(struct ceph_osd_data *osd_data,
-				     struct ceph_bvec_iter *bvec_pos,
-				     u32 num_bvecs)
-{
-	osd_data->type = CEPH_OSD_DATA_TYPE_BVECS;
-	osd_data->bvec_pos = *bvec_pos;
-	osd_data->num_bvecs = num_bvecs;
-}
-
 static struct ceph_msg_data *
 osd_req_op_raw_data_in(struct ceph_osd_request *osd_req, unsigned int which)
 {
@@ -335,42 +280,6 @@ void osd_req_op_cls_response_data_pages(struct ceph_osd_request *osd_req,
 }
 EXPORT_SYMBOL(osd_req_op_cls_response_data_pages);
 
-static u64 ceph_osd_data_length(struct ceph_osd_data *osd_data)
-{
-	switch (osd_data->type) {
-	case CEPH_OSD_DATA_TYPE_NONE:
-		return 0;
-	case CEPH_OSD_DATA_TYPE_PAGES:
-		return osd_data->length;
-	case CEPH_OSD_DATA_TYPE_PAGELIST:
-		return (u64)osd_data->pagelist->length;
-#ifdef CONFIG_BLOCK
-	case CEPH_OSD_DATA_TYPE_BIO:
-		return (u64)osd_data->bio_length;
-#endif /* CONFIG_BLOCK */
-	case CEPH_OSD_DATA_TYPE_BVECS:
-		return osd_data->bvec_pos.iter.bi_size;
-	default:
-		WARN(true, "unrecognized data type %d\n", (int)osd_data->type);
-		return 0;
-	}
-}
-
-__attribute__((unused))
-static void ceph_osd_data_release(struct ceph_osd_data *osd_data)
-{
-	if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES && osd_data->own_pages) {
-		int num_pages;
-
-		num_pages = calc_pages_for((u64)osd_data->alignment,
-						(u64)osd_data->length);
-		ceph_release_page_vector(osd_data->pages, num_pages);
-	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGELIST) {
-		ceph_pagelist_release(osd_data->pagelist);
-	}
-	ceph_osd_data_init(osd_data);
-}
-
 static void osd_req_op_data_release(struct ceph_osd_request *osd_req,
 			unsigned int which)
 {
@@ -958,33 +867,6 @@ void osd_req_op_alloc_hint_init(struct ceph_osd_request *osd_req,
 }
 EXPORT_SYMBOL(osd_req_op_alloc_hint_init);
 
-__attribute__((unused))
-static void ceph_osdc_msg_data_add(struct ceph_msg *msg,
-				struct ceph_osd_data *osd_data)
-{
-	u64 length = ceph_osd_data_length(osd_data);
-
-	if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGES) {
-		BUG_ON(length > (u64) SIZE_MAX);
-		if (length)
-			ceph_msg_data_add_pages(msg, osd_data->pages,
-					length, osd_data->alignment,
-					osd_data->pages_from_pool, false);
-	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_PAGELIST) {
-		BUG_ON(!length);
-		ceph_msg_data_add_pagelist(msg, osd_data->pagelist);
-#ifdef CONFIG_BLOCK
-	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_BIO) {
-		ceph_msg_data_add_bio(msg, &osd_data->bio_pos, length);
-#endif
-	} else if (osd_data->type == CEPH_OSD_DATA_TYPE_BVECS) {
-		ceph_msg_data_add_bvecs(msg, &osd_data->bvec_pos,
-					osd_data->num_bvecs);
-	} else {
-		BUG_ON(osd_data->type != CEPH_OSD_DATA_TYPE_NONE);
-	}
-}
-
 static u32 osd_req_encode_op(struct ceph_osd_op *dst,
 			     const struct ceph_osd_req_op *src)
 {
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 05/16] libceph: remove unused last_piece out parameter from ceph_msg_data_next()
  2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
                   ` (3 preceding siblings ...)
  2020-04-21 13:18 ` [PATCH 04/16] libceph: remove ceph_osd_data completely Roman Penyaev
@ 2020-04-21 13:18 ` Roman Penyaev
  2020-04-21 13:18 ` [PATCH 06/16] libceph: switch data cursor from page to iov_iter for messenger Roman Penyaev
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 13:18 UTC (permalink / raw)
  Cc: Ilya Dryomov, Jeff Layton, ceph-devel, Roman Penyaev

Remove it as it is not used anywhere.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
---
 net/ceph/messenger.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 8f35ed01a576..08786d75b990 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1137,11 +1137,9 @@ static void ceph_msg_data_cursor_init(struct ceph_msg *msg, size_t length)
 /*
  * Return the page containing the next piece to process for a given
  * data item, and supply the page offset and length of that piece.
- * Indicate whether this is the last piece in this data item.
  */
 static struct page *ceph_msg_data_next(struct ceph_msg_data_cursor *cursor,
-					size_t *page_offset, size_t *length,
-					bool *last_piece)
+					size_t *page_offset, size_t *length)
 {
 	struct page *page;
 
@@ -1170,8 +1168,6 @@ static struct page *ceph_msg_data_next(struct ceph_msg_data_cursor *cursor,
 	BUG_ON(*page_offset + *length > PAGE_SIZE);
 	BUG_ON(!*length);
 	BUG_ON(*length > cursor->resid);
-	if (last_piece)
-		*last_piece = cursor->last_piece;
 
 	return page;
 }
@@ -1589,7 +1585,7 @@ static int write_partial_message_data(struct ceph_connection *con)
 			continue;
 		}
 
-		page = ceph_msg_data_next(cursor, &page_offset, &length, NULL);
+		page = ceph_msg_data_next(cursor, &page_offset, &length);
 		if (length == cursor->total_resid)
 			more = MSG_MORE;
 		ret = ceph_tcp_sendpage(con->sock, page, page_offset, length,
@@ -2336,7 +2332,7 @@ static int read_partial_msg_data(struct ceph_connection *con)
 			continue;
 		}
 
-		page = ceph_msg_data_next(cursor, &page_offset, &length, NULL);
+		page = ceph_msg_data_next(cursor, &page_offset, &length);
 		ret = ceph_tcp_recvpage(con->sock, page, page_offset, length);
 		if (ret <= 0) {
 			if (do_datacrc)
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 06/16] libceph: switch data cursor from page to iov_iter for messenger
  2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
                   ` (4 preceding siblings ...)
  2020-04-21 13:18 ` [PATCH 05/16] libceph: remove unused last_piece out parameter from ceph_msg_data_next() Roman Penyaev
@ 2020-04-21 13:18 ` Roman Penyaev
  2020-04-21 13:18 ` [PATCH 07/16] libceph: use new tcp_sendiov() instead of tcp_sendmsg() " Roman Penyaev
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 13:18 UTC (permalink / raw)
  Cc: Ilya Dryomov, Jeff Layton, ceph-devel, Roman Penyaev

The first problem is performance. Why not to pass to read/write
socket function the whole iov_iter and let socket API handle
everything at once instead of doing IO page by page?  So better
to make data cursor as iov_iter, which is generic for many API
calls.

The second reason is the support of kvec in the future, i.e. we
do not have a page in a hand, but a buffer.

So this patch is a preparation, the first iteration: users of data
cursor do not see pages, but use cursor->iter instead.  Internally
cursor still uses page.  In next patches that will be avoided.

We are still able to use sendpage() for 0-copy and have performance
benefit from multi-pages, i.e. if bvec in iter is a multi-page,
then we pass the whole multi-page to sendpage() and not only 4k.

Important to mention that for sendpage() MSG_SENDPAGE_NOTLAST is
always set if @more flag is true.  We know that the footer of a
message will follow, @more will be false and all data will be
pushed out of the socket.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
---
 include/linux/ceph/messenger.h |   3 +
 net/ceph/messenger.c           | 141 ++++++++++++++++++++++-----------
 2 files changed, 97 insertions(+), 47 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 424f9f1989b7..044c74333c27 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -192,6 +192,9 @@ struct ceph_msg_data_cursor {
 	size_t			total_resid;	/* across all data items */
 
 	struct ceph_msg_data	*data;		/* current data item */
+	struct iov_iter         iter;           /* iterator for current data */
+	struct bio_vec          it_bvec;        /* used as an addition to it */
+	unsigned int            direction;      /* data direction */
 	size_t			resid;		/* bytes not yet consumed */
 	bool			last_piece;	/* current is last piece */
 	bool			need_crc;	/* crc update needed */
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 08786d75b990..709d9f26f755 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -523,6 +523,22 @@ static int ceph_tcp_recvmsg(struct socket *sock, void *buf, size_t len)
 	return r;
 }
 
+static int ceph_tcp_recviov(struct socket *sock, struct iov_iter *iter)
+{
+	struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL,
+			      .msg_iter = *iter };
+	int r;
+
+	if (!iter->count)
+		msg.msg_flags |= MSG_TRUNC;
+
+	r = sock_recvmsg(sock, &msg, msg.msg_flags);
+	if (r == -EAGAIN)
+		r = 0;
+	return r;
+}
+
+__attribute__((unused))
 static int ceph_tcp_recvpage(struct socket *sock, struct page *page,
 		     int page_offset, size_t length)
 {
@@ -594,6 +610,42 @@ static int ceph_tcp_sendpage(struct socket *sock, struct page *page,
 	return ret;
 }
 
+/**
+ * ceph_tcp_sendiov() - either does sendmsg() or 0-copy sendpage()
+ *
+ * @more is true if caller will be sending more data shortly.
+ */
+static int ceph_tcp_sendiov(struct socket *sock, struct iov_iter *iter,
+			    bool more)
+{
+	if (iov_iter_is_bvec(iter)) {
+		const struct bio_vec *bvec = &iter->bvec[0];
+		int flags = more ? MSG_MORE | MSG_SENDPAGE_NOTLAST : 0;
+
+		/* Do 0-copy instead of sendmsg */
+
+		return ceph_tcp_sendpage(sock, bvec->bv_page,
+					 iter->iov_offset + bvec->bv_offset,
+					 bvec->bv_len - iter->iov_offset,
+					 flags);
+	} else {
+		struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL,
+				      .msg_iter = *iter };
+		int r;
+
+		if (more)
+			msg.msg_flags |= MSG_MORE;
+		else
+			/* superfluous, but what the hell */
+			msg.msg_flags |= MSG_EOR;
+
+		r = sock_sendmsg(sock, &msg);
+		if (r == -EAGAIN)
+			r = 0;
+		return r;
+	}
+}
+
 /*
  * Shutdown/close the socket for the given connection.
  */
@@ -1086,12 +1138,7 @@ static bool ceph_msg_data_pagelist_advance(struct ceph_msg_data_cursor *cursor,
 }
 
 /*
- * Message data is handled (sent or received) in pieces, where each
- * piece resides on a single page.  The network layer might not
- * consume an entire piece at once.  A data item's cursor keeps
- * track of which piece is next to process and how much remains to
- * be processed in that piece.  It also tracks whether the current
- * piece is the last one in the data item.
+ * Message data is iterated (sent or received) by internal iov_iter.
  */
 static void __ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor)
 {
@@ -1120,7 +1167,8 @@ static void __ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor)
 	cursor->need_crc = true;
 }
 
-static void ceph_msg_data_cursor_init(struct ceph_msg *msg, size_t length)
+static void ceph_msg_data_cursor_init(unsigned int dir, struct ceph_msg *msg,
+				      size_t length)
 {
 	struct ceph_msg_data_cursor *cursor = &msg->cursor;
 
@@ -1130,33 +1178,33 @@ static void ceph_msg_data_cursor_init(struct ceph_msg *msg, size_t length)
 
 	cursor->total_resid = length;
 	cursor->data = msg->data;
+	cursor->direction = dir;
 
 	__ceph_msg_data_cursor_init(cursor);
 }
 
 /*
- * Return the page containing the next piece to process for a given
- * data item, and supply the page offset and length of that piece.
+ * Setups cursor->iter for the next piece to process.
  */
-static struct page *ceph_msg_data_next(struct ceph_msg_data_cursor *cursor,
-					size_t *page_offset, size_t *length)
+static void ceph_msg_data_next(struct ceph_msg_data_cursor *cursor)
 {
 	struct page *page;
+	size_t off, len;
 
 	switch (cursor->data->type) {
 	case CEPH_MSG_DATA_PAGELIST:
-		page = ceph_msg_data_pagelist_next(cursor, page_offset, length);
+		page = ceph_msg_data_pagelist_next(cursor, &off, &len);
 		break;
 	case CEPH_MSG_DATA_PAGES:
-		page = ceph_msg_data_pages_next(cursor, page_offset, length);
+		page = ceph_msg_data_pages_next(cursor, &off, &len);
 		break;
 #ifdef CONFIG_BLOCK
 	case CEPH_MSG_DATA_BIO:
-		page = ceph_msg_data_bio_next(cursor, page_offset, length);
+		page = ceph_msg_data_bio_next(cursor, &off, &len);
 		break;
 #endif /* CONFIG_BLOCK */
 	case CEPH_MSG_DATA_BVECS:
-		page = ceph_msg_data_bvecs_next(cursor, page_offset, length);
+		page = ceph_msg_data_bvecs_next(cursor, &off, &len);
 		break;
 	case CEPH_MSG_DATA_NONE:
 	default:
@@ -1165,11 +1213,16 @@ static struct page *ceph_msg_data_next(struct ceph_msg_data_cursor *cursor,
 	}
 
 	BUG_ON(!page);
-	BUG_ON(*page_offset + *length > PAGE_SIZE);
-	BUG_ON(!*length);
-	BUG_ON(*length > cursor->resid);
+	BUG_ON(off + len > PAGE_SIZE);
+	BUG_ON(!len);
+	BUG_ON(len > cursor->resid);
+
+	cursor->it_bvec.bv_page = page;
+	cursor->it_bvec.bv_len = len;
+	cursor->it_bvec.bv_offset = off;
 
-	return page;
+	iov_iter_bvec(&cursor->iter, cursor->direction,
+		      &cursor->it_bvec, 1, len);
 }
 
 /*
@@ -1220,11 +1273,12 @@ static size_t sizeof_footer(struct ceph_connection *con)
 	    sizeof(struct ceph_msg_footer_old);
 }
 
-static void prepare_message_data(struct ceph_msg *msg, u32 data_len)
+static void prepare_message_data(unsigned int dir, struct ceph_msg *msg,
+				 u32 data_len)
 {
 	/* Initialize data cursor */
 
-	ceph_msg_data_cursor_init(msg, (size_t)data_len);
+	ceph_msg_data_cursor_init(dir, msg, (size_t)data_len);
 }
 
 /*
@@ -1331,7 +1385,7 @@ static void prepare_write_message(struct ceph_connection *con)
 	/* is there a data payload? */
 	con->out_msg->footer.data_crc = 0;
 	if (m->data_length) {
-		prepare_message_data(con->out_msg, m->data_length);
+		prepare_message_data(WRITE, con->out_msg, m->data_length);
 		con->out_more = 1;  /* data + footer will follow */
 	} else {
 		/* no, queue up footer too and be done */
@@ -1532,16 +1586,19 @@ static int write_partial_kvec(struct ceph_connection *con)
 	return ret;  /* done! */
 }
 
-static u32 ceph_crc32c_page(u32 crc, struct page *page,
-				unsigned int page_offset,
-				unsigned int length)
+static int crc32c_kvec(struct kvec *vec, void *p)
 {
-	char *kaddr;
+	u32 *crc = p;
 
-	kaddr = kmap(page);
-	BUG_ON(kaddr == NULL);
-	crc = crc32c(crc, kaddr + page_offset, length);
-	kunmap(page);
+	*crc = crc32c(*crc, vec->iov_base, vec->iov_len);
+
+	return 0;
+}
+
+static u32 ceph_crc32c_iov(u32 crc, struct iov_iter *iter,
+			   unsigned int length)
+{
+	iov_iter_for_each_range(iter, length, crc32c_kvec, &crc);
 
 	return crc;
 }
@@ -1557,7 +1614,6 @@ static int write_partial_message_data(struct ceph_connection *con)
 	struct ceph_msg *msg = con->out_msg;
 	struct ceph_msg_data_cursor *cursor = &msg->cursor;
 	bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
-	int more = MSG_MORE | MSG_SENDPAGE_NOTLAST;
 	u32 crc;
 
 	dout("%s %p msg %p\n", __func__, con, msg);
@@ -1575,9 +1631,6 @@ static int write_partial_message_data(struct ceph_connection *con)
 	 */
 	crc = do_datacrc ? le32_to_cpu(msg->footer.data_crc) : 0;
 	while (cursor->total_resid) {
-		struct page *page;
-		size_t page_offset;
-		size_t length;
 		int ret;
 
 		if (!cursor->resid) {
@@ -1585,11 +1638,8 @@ static int write_partial_message_data(struct ceph_connection *con)
 			continue;
 		}
 
-		page = ceph_msg_data_next(cursor, &page_offset, &length);
-		if (length == cursor->total_resid)
-			more = MSG_MORE;
-		ret = ceph_tcp_sendpage(con->sock, page, page_offset, length,
-					more);
+		ceph_msg_data_next(cursor);
+		ret = ceph_tcp_sendiov(con->sock, &cursor->iter, true);
 		if (ret <= 0) {
 			if (do_datacrc)
 				msg->footer.data_crc = cpu_to_le32(crc);
@@ -1597,7 +1647,7 @@ static int write_partial_message_data(struct ceph_connection *con)
 			return ret;
 		}
 		if (do_datacrc && cursor->need_crc)
-			crc = ceph_crc32c_page(crc, page, page_offset, length);
+			crc = ceph_crc32c_iov(crc, &cursor->iter, ret);
 		ceph_msg_data_advance(cursor, (size_t)ret);
 	}
 
@@ -2315,9 +2365,6 @@ static int read_partial_msg_data(struct ceph_connection *con)
 	struct ceph_msg *msg = con->in_msg;
 	struct ceph_msg_data_cursor *cursor = &msg->cursor;
 	bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
-	struct page *page;
-	size_t page_offset;
-	size_t length;
 	u32 crc = 0;
 	int ret;
 
@@ -2332,8 +2379,8 @@ static int read_partial_msg_data(struct ceph_connection *con)
 			continue;
 		}
 
-		page = ceph_msg_data_next(cursor, &page_offset, &length);
-		ret = ceph_tcp_recvpage(con->sock, page, page_offset, length);
+		ceph_msg_data_next(cursor);
+		ret = ceph_tcp_recviov(con->sock, &cursor->iter);
 		if (ret <= 0) {
 			if (do_datacrc)
 				con->in_data_crc = crc;
@@ -2342,7 +2389,7 @@ static int read_partial_msg_data(struct ceph_connection *con)
 		}
 
 		if (do_datacrc)
-			crc = ceph_crc32c_page(crc, page, page_offset, ret);
+			crc = ceph_crc32c_iov(crc, &cursor->iter, ret);
 		ceph_msg_data_advance(cursor, (size_t)ret);
 	}
 	if (do_datacrc)
@@ -2443,7 +2490,7 @@ static int read_partial_message(struct ceph_connection *con)
 		/* prepare for data payload, if any */
 
 		if (data_len)
-			prepare_message_data(con->in_msg, data_len);
+			prepare_message_data(READ, con->in_msg, data_len);
 	}
 
 	/* front */
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 07/16] libceph: use new tcp_sendiov() instead of tcp_sendmsg() for messenger
  2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
                   ` (5 preceding siblings ...)
  2020-04-21 13:18 ` [PATCH 06/16] libceph: switch data cursor from page to iov_iter for messenger Roman Penyaev
@ 2020-04-21 13:18 ` Roman Penyaev
  2020-04-21 13:18 ` [PATCH 08/16] libceph: remove unused tcp wrappers, now iov_iter is used " Roman Penyaev
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 13:18 UTC (permalink / raw)
  Cc: Ilya Dryomov, Jeff Layton, ceph-devel, Roman Penyaev

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
---
 net/ceph/messenger.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 709d9f26f755..b8ea6ce91a27 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -562,6 +562,7 @@ static int ceph_tcp_recvpage(struct socket *sock, struct page *page,
  * write something.  @more is true if caller will be sending more data
  * shortly.
  */
+__attribute__((unused))
 static int ceph_tcp_sendmsg(struct socket *sock, struct kvec *iov,
 			    size_t kvlen, size_t len, bool more)
 {
@@ -1552,13 +1553,14 @@ static int prepare_write_connect(struct ceph_connection *con)
  */
 static int write_partial_kvec(struct ceph_connection *con)
 {
+	struct iov_iter it;
 	int ret;
 
 	dout("write_partial_kvec %p %d left\n", con, con->out_kvec_bytes);
 	while (con->out_kvec_bytes > 0) {
-		ret = ceph_tcp_sendmsg(con->sock, con->out_kvec_cur,
-				       con->out_kvec_left, con->out_kvec_bytes,
-				       con->out_more);
+		iov_iter_kvec(&it, WRITE, con->out_kvec_cur,
+			      con->out_kvec_left, con->out_kvec_bytes);
+		ret = ceph_tcp_sendiov(con->sock, &it, con->out_more);
 		if (ret <= 0)
 			goto out;
 		con->out_kvec_bytes -= ret;
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 08/16] libceph: remove unused tcp wrappers, now iov_iter is used for messenger
  2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
                   ` (6 preceding siblings ...)
  2020-04-21 13:18 ` [PATCH 07/16] libceph: use new tcp_sendiov() instead of tcp_sendmsg() " Roman Penyaev
@ 2020-04-21 13:18 ` Roman Penyaev
  2020-04-21 13:18 ` [PATCH 09/16] libceph: no need for cursor->need_crc " Roman Penyaev
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 13:18 UTC (permalink / raw)
  Cc: Ilya Dryomov, Jeff Layton, ceph-devel, Roman Penyaev

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
---
 net/ceph/messenger.c | 42 ------------------------------------------
 1 file changed, 42 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index b8ea6ce91a27..8f867c8dc481 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -538,48 +538,6 @@ static int ceph_tcp_recviov(struct socket *sock, struct iov_iter *iter)
 	return r;
 }
 
-__attribute__((unused))
-static int ceph_tcp_recvpage(struct socket *sock, struct page *page,
-		     int page_offset, size_t length)
-{
-	struct bio_vec bvec = {
-		.bv_page = page,
-		.bv_offset = page_offset,
-		.bv_len = length
-	};
-	struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL };
-	int r;
-
-	BUG_ON(page_offset + length > PAGE_SIZE);
-	iov_iter_bvec(&msg.msg_iter, READ, &bvec, 1, length);
-	r = sock_recvmsg(sock, &msg, msg.msg_flags);
-	if (r == -EAGAIN)
-		r = 0;
-	return r;
-}
-
-/*
- * write something.  @more is true if caller will be sending more data
- * shortly.
- */
-__attribute__((unused))
-static int ceph_tcp_sendmsg(struct socket *sock, struct kvec *iov,
-			    size_t kvlen, size_t len, bool more)
-{
-	struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL };
-	int r;
-
-	if (more)
-		msg.msg_flags |= MSG_MORE;
-	else
-		msg.msg_flags |= MSG_EOR;  /* superfluous, but what the hell */
-
-	r = kernel_sendmsg(sock, &msg, iov, kvlen, len);
-	if (r == -EAGAIN)
-		r = 0;
-	return r;
-}
-
 /*
  * @more: either or both of MSG_MORE and MSG_SENDPAGE_NOTLAST
  */
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 09/16] libceph: no need for cursor->need_crc for messenger
  2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
                   ` (7 preceding siblings ...)
  2020-04-21 13:18 ` [PATCH 08/16] libceph: remove unused tcp wrappers, now iov_iter is used " Roman Penyaev
@ 2020-04-21 13:18 ` Roman Penyaev
  2020-04-21 13:18 ` [PATCH 10/16] libceph: remove ->last_piece member for message data cursor Roman Penyaev
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 13:18 UTC (permalink / raw)
  Cc: Ilya Dryomov, Jeff Layton, ceph-devel, Roman Penyaev

I want to simplify cursor and switch to iov_iter.  Here I get rid
of ->need_crc, now we calculate crc not for 1 page at once, but exactly
the size written to the socket. So get rid of new_piece and ->need_crc
completely.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
---
 include/linux/ceph/messenger.h |  1 -
 net/ceph/messenger.c           | 55 +++++++++++++---------------------
 2 files changed, 20 insertions(+), 36 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 044c74333c27..82a7fb0018e3 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -197,7 +197,6 @@ struct ceph_msg_data_cursor {
 	unsigned int            direction;      /* data direction */
 	size_t			resid;		/* bytes not yet consumed */
 	bool			last_piece;	/* current is last piece */
-	bool			need_crc;	/* crc update needed */
 	union {
 #ifdef CONFIG_BLOCK
 		struct ceph_bio_iter	bio_iter;
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 8f867c8dc481..6423edf5cf65 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -849,8 +849,8 @@ static struct page *ceph_msg_data_bio_next(struct ceph_msg_data_cursor *cursor,
 	return bv.bv_page;
 }
 
-static bool ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor,
-					size_t bytes)
+static void ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor,
+				      size_t bytes)
 {
 	struct ceph_bio_iter *it = &cursor->bio_iter;
 	struct page *page = bio_iter_page(it->bio, it->iter);
@@ -862,12 +862,12 @@ static bool ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor,
 
 	if (!cursor->resid) {
 		BUG_ON(!cursor->last_piece);
-		return false;   /* no more data */
+		return;   /* no more data */
 	}
 
 	if (!bytes || (it->iter.bi_size && it->iter.bi_bvec_done &&
 		       page == bio_iter_page(it->bio, it->iter)))
-		return false;	/* more bytes to process in this segment */
+		return;	/* more bytes to process in this segment */
 
 	if (!it->iter.bi_size) {
 		it->bio = it->bio->bi_next;
@@ -879,7 +879,6 @@ static bool ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor,
 	BUG_ON(cursor->last_piece);
 	BUG_ON(cursor->resid < bio_iter_len(it->bio, it->iter));
 	cursor->last_piece = cursor->resid == bio_iter_len(it->bio, it->iter);
-	return true;
 }
 #endif /* CONFIG_BLOCK */
 
@@ -910,7 +909,7 @@ static struct page *ceph_msg_data_bvecs_next(struct ceph_msg_data_cursor *cursor
 	return bv.bv_page;
 }
 
-static bool ceph_msg_data_bvecs_advance(struct ceph_msg_data_cursor *cursor,
+static void ceph_msg_data_bvecs_advance(struct ceph_msg_data_cursor *cursor,
 					size_t bytes)
 {
 	struct bio_vec *bvecs = cursor->data->bvec_pos.bvecs;
@@ -923,18 +922,17 @@ static bool ceph_msg_data_bvecs_advance(struct ceph_msg_data_cursor *cursor,
 
 	if (!cursor->resid) {
 		BUG_ON(!cursor->last_piece);
-		return false;   /* no more data */
+		return;   /* no more data */
 	}
 
 	if (!bytes || (cursor->bvec_iter.bi_bvec_done &&
 		       page == bvec_iter_page(bvecs, cursor->bvec_iter)))
-		return false;	/* more bytes to process in this segment */
+		return;	/* more bytes to process in this segment */
 
 	BUG_ON(cursor->last_piece);
 	BUG_ON(cursor->resid < bvec_iter_len(bvecs, cursor->bvec_iter));
 	cursor->last_piece =
 	    cursor->resid == bvec_iter_len(bvecs, cursor->bvec_iter);
-	return true;
 }
 
 /*
@@ -982,8 +980,8 @@ ceph_msg_data_pages_next(struct ceph_msg_data_cursor *cursor,
 	return data->pages[cursor->page_index];
 }
 
-static bool ceph_msg_data_pages_advance(struct ceph_msg_data_cursor *cursor,
-						size_t bytes)
+static void ceph_msg_data_pages_advance(struct ceph_msg_data_cursor *cursor,
+					size_t bytes)
 {
 	BUG_ON(cursor->data->type != CEPH_MSG_DATA_PAGES);
 
@@ -994,18 +992,16 @@ static bool ceph_msg_data_pages_advance(struct ceph_msg_data_cursor *cursor,
 	cursor->resid -= bytes;
 	cursor->page_offset = (cursor->page_offset + bytes) & ~PAGE_MASK;
 	if (!bytes || cursor->page_offset)
-		return false;	/* more bytes to process in the current page */
+		return;	/* more bytes to process in the current page */
 
 	if (!cursor->resid)
-		return false;   /* no more data */
+		return;   /* no more data */
 
 	/* Move on to the next page; offset is already at 0 */
 
 	BUG_ON(cursor->page_index >= cursor->page_count);
 	cursor->page_index++;
 	cursor->last_piece = cursor->resid <= PAGE_SIZE;
-
-	return true;
 }
 
 /*
@@ -1062,8 +1058,8 @@ ceph_msg_data_pagelist_next(struct ceph_msg_data_cursor *cursor,
 	return cursor->page;
 }
 
-static bool ceph_msg_data_pagelist_advance(struct ceph_msg_data_cursor *cursor,
-						size_t bytes)
+static void ceph_msg_data_pagelist_advance(struct ceph_msg_data_cursor *cursor,
+					   size_t bytes)
 {
 	struct ceph_msg_data *data = cursor->data;
 	struct ceph_pagelist *pagelist;
@@ -1082,18 +1078,16 @@ static bool ceph_msg_data_pagelist_advance(struct ceph_msg_data_cursor *cursor,
 	cursor->offset += bytes;
 	/* offset of first page in pagelist is always 0 */
 	if (!bytes || cursor->offset & ~PAGE_MASK)
-		return false;	/* more bytes to process in the current page */
+		return;	/* more bytes to process in the current page */
 
 	if (!cursor->resid)
-		return false;   /* no more data */
+		return;   /* no more data */
 
 	/* Move on to the next page */
 
 	BUG_ON(list_is_last(&cursor->page->lru, &pagelist->head));
 	cursor->page = list_next_entry(cursor->page, lru);
 	cursor->last_piece = cursor->resid <= PAGE_SIZE;
-
-	return true;
 }
 
 /*
@@ -1123,7 +1117,6 @@ static void __ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor)
 		/* BUG(); */
 		break;
 	}
-	cursor->need_crc = true;
 }
 
 static void ceph_msg_data_cursor_init(unsigned int dir, struct ceph_msg *msg,
@@ -1184,30 +1177,24 @@ static void ceph_msg_data_next(struct ceph_msg_data_cursor *cursor)
 		      &cursor->it_bvec, 1, len);
 }
 
-/*
- * Returns true if the result moves the cursor on to the next piece
- * of the data item.
- */
 static void ceph_msg_data_advance(struct ceph_msg_data_cursor *cursor,
 				  size_t bytes)
 {
-	bool new_piece;
-
 	BUG_ON(bytes > cursor->resid);
 	switch (cursor->data->type) {
 	case CEPH_MSG_DATA_PAGELIST:
-		new_piece = ceph_msg_data_pagelist_advance(cursor, bytes);
+		ceph_msg_data_pagelist_advance(cursor, bytes);
 		break;
 	case CEPH_MSG_DATA_PAGES:
-		new_piece = ceph_msg_data_pages_advance(cursor, bytes);
+		ceph_msg_data_pages_advance(cursor, bytes);
 		break;
 #ifdef CONFIG_BLOCK
 	case CEPH_MSG_DATA_BIO:
-		new_piece = ceph_msg_data_bio_advance(cursor, bytes);
+		ceph_msg_data_bio_advance(cursor, bytes);
 		break;
 #endif /* CONFIG_BLOCK */
 	case CEPH_MSG_DATA_BVECS:
-		new_piece = ceph_msg_data_bvecs_advance(cursor, bytes);
+		ceph_msg_data_bvecs_advance(cursor, bytes);
 		break;
 	case CEPH_MSG_DATA_NONE:
 	default:
@@ -1220,9 +1207,7 @@ static void ceph_msg_data_advance(struct ceph_msg_data_cursor *cursor,
 		WARN_ON(!cursor->last_piece);
 		cursor->data++;
 		__ceph_msg_data_cursor_init(cursor);
-		new_piece = true;
 	}
-	cursor->need_crc = new_piece;
 }
 
 static size_t sizeof_footer(struct ceph_connection *con)
@@ -1606,7 +1591,7 @@ static int write_partial_message_data(struct ceph_connection *con)
 
 			return ret;
 		}
-		if (do_datacrc && cursor->need_crc)
+		if (do_datacrc)
 			crc = ceph_crc32c_iov(crc, &cursor->iter, ret);
 		ceph_msg_data_advance(cursor, (size_t)ret);
 	}
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 10/16] libceph: remove ->last_piece member for message data cursor
  2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
                   ` (8 preceding siblings ...)
  2020-04-21 13:18 ` [PATCH 09/16] libceph: no need for cursor->need_crc " Roman Penyaev
@ 2020-04-21 13:18 ` Roman Penyaev
  2020-04-21 13:18 ` [PATCH 11/16] libceph: remove not necessary checks on doing advance on bio and bvecs cursor Roman Penyaev
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 13:18 UTC (permalink / raw)
  Cc: Ilya Dryomov, Jeff Layton, ceph-devel, Roman Penyaev

No need to keep strange member, which is a) not used, b) can be
always calculated comparing offset and PAGE_SIZE.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
---
 include/linux/ceph/messenger.h |   1 -
 net/ceph/messenger.c           | 101 +++++++++++----------------------
 2 files changed, 33 insertions(+), 69 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 82a7fb0018e3..bc25f5f0e729 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -196,7 +196,6 @@ struct ceph_msg_data_cursor {
 	struct bio_vec          it_bvec;        /* used as an addition to it */
 	unsigned int            direction;      /* data direction */
 	size_t			resid;		/* bytes not yet consumed */
-	bool			last_piece;	/* current is last piece */
 	union {
 #ifdef CONFIG_BLOCK
 		struct ceph_bio_iter	bio_iter;
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 6423edf5cf65..3f8a47de62c7 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -815,6 +815,18 @@ static int con_out_kvec_skip(struct ceph_connection *con)
 	return skip;
 }
 
+static void ceph_msg_data_set_iter(struct ceph_msg_data_cursor *cursor,
+				   struct page *page, size_t offset,
+				   size_t length)
+{
+	cursor->it_bvec.bv_page = page;
+	cursor->it_bvec.bv_len = length;
+	cursor->it_bvec.bv_offset = offset;
+
+	iov_iter_bvec(&cursor->iter, cursor->direction,
+		      &cursor->it_bvec, 1, length);
+}
+
 #ifdef CONFIG_BLOCK
 
 /*
@@ -834,19 +846,15 @@ static void ceph_msg_data_bio_cursor_init(struct ceph_msg_data_cursor *cursor,
 		it->iter.bi_size = cursor->resid;
 
 	BUG_ON(cursor->resid < bio_iter_len(it->bio, it->iter));
-	cursor->last_piece = cursor->resid == bio_iter_len(it->bio, it->iter);
 }
 
-static struct page *ceph_msg_data_bio_next(struct ceph_msg_data_cursor *cursor,
-						size_t *page_offset,
-						size_t *length)
+static void ceph_msg_data_bio_next(struct ceph_msg_data_cursor *cursor)
 {
 	struct bio_vec bv = bio_iter_iovec(cursor->bio_iter.bio,
 					   cursor->bio_iter.iter);
 
-	*page_offset = bv.bv_offset;
-	*length = bv.bv_len;
-	return bv.bv_page;
+	ceph_msg_data_set_iter(cursor, bv.bv_page,
+			       bv.bv_offset, bv.bv_len);
 }
 
 static void ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor,
@@ -861,7 +869,6 @@ static void ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor,
 	bio_advance_iter(it->bio, &it->iter, bytes);
 
 	if (!cursor->resid) {
-		BUG_ON(!cursor->last_piece);
 		return;   /* no more data */
 	}
 
@@ -876,9 +883,7 @@ static void ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor,
 			it->iter.bi_size = cursor->resid;
 	}
 
-	BUG_ON(cursor->last_piece);
 	BUG_ON(cursor->resid < bio_iter_len(it->bio, it->iter));
-	cursor->last_piece = cursor->resid == bio_iter_len(it->bio, it->iter);
 }
 #endif /* CONFIG_BLOCK */
 
@@ -893,20 +898,15 @@ static void ceph_msg_data_bvecs_cursor_init(struct ceph_msg_data_cursor *cursor,
 	cursor->bvec_iter.bi_size = cursor->resid;
 
 	BUG_ON(cursor->resid < bvec_iter_len(bvecs, cursor->bvec_iter));
-	cursor->last_piece =
-	    cursor->resid == bvec_iter_len(bvecs, cursor->bvec_iter);
 }
 
-static struct page *ceph_msg_data_bvecs_next(struct ceph_msg_data_cursor *cursor,
-						size_t *page_offset,
-						size_t *length)
+static void ceph_msg_data_bvecs_next(struct ceph_msg_data_cursor *cursor)
 {
 	struct bio_vec bv = bvec_iter_bvec(cursor->data->bvec_pos.bvecs,
 					   cursor->bvec_iter);
 
-	*page_offset = bv.bv_offset;
-	*length = bv.bv_len;
-	return bv.bv_page;
+	ceph_msg_data_set_iter(cursor, bv.bv_page,
+			       bv.bv_offset, bv.bv_len);
 }
 
 static void ceph_msg_data_bvecs_advance(struct ceph_msg_data_cursor *cursor,
@@ -921,7 +921,6 @@ static void ceph_msg_data_bvecs_advance(struct ceph_msg_data_cursor *cursor,
 	bvec_iter_advance(bvecs, &cursor->bvec_iter, bytes);
 
 	if (!cursor->resid) {
-		BUG_ON(!cursor->last_piece);
 		return;   /* no more data */
 	}
 
@@ -929,10 +928,7 @@ static void ceph_msg_data_bvecs_advance(struct ceph_msg_data_cursor *cursor,
 		       page == bvec_iter_page(bvecs, cursor->bvec_iter)))
 		return;	/* more bytes to process in this segment */
 
-	BUG_ON(cursor->last_piece);
 	BUG_ON(cursor->resid < bvec_iter_len(bvecs, cursor->bvec_iter));
-	cursor->last_piece =
-	    cursor->resid == bvec_iter_len(bvecs, cursor->bvec_iter);
 }
 
 /*
@@ -957,12 +953,9 @@ static void ceph_msg_data_pages_cursor_init(struct ceph_msg_data_cursor *cursor,
 	BUG_ON(page_count > (int)USHRT_MAX);
 	cursor->page_count = (unsigned short)page_count;
 	BUG_ON(length > SIZE_MAX - cursor->page_offset);
-	cursor->last_piece = cursor->page_offset + cursor->resid <= PAGE_SIZE;
 }
 
-static struct page *
-ceph_msg_data_pages_next(struct ceph_msg_data_cursor *cursor,
-					size_t *page_offset, size_t *length)
+static void ceph_msg_data_pages_next(struct ceph_msg_data_cursor *cursor)
 {
 	struct ceph_msg_data *data = cursor->data;
 
@@ -971,13 +964,10 @@ ceph_msg_data_pages_next(struct ceph_msg_data_cursor *cursor,
 	BUG_ON(cursor->page_index >= cursor->page_count);
 	BUG_ON(cursor->page_offset >= PAGE_SIZE);
 
-	*page_offset = cursor->page_offset;
-	if (cursor->last_piece)
-		*length = cursor->resid;
-	else
-		*length = PAGE_SIZE - *page_offset;
-
-	return data->pages[cursor->page_index];
+	ceph_msg_data_set_iter(cursor, data->pages[cursor->page_index],
+			       cursor->page_offset,
+			       min(PAGE_SIZE - cursor->page_offset,
+				   cursor->resid));
 }
 
 static void ceph_msg_data_pages_advance(struct ceph_msg_data_cursor *cursor,
@@ -1001,7 +991,6 @@ static void ceph_msg_data_pages_advance(struct ceph_msg_data_cursor *cursor,
 
 	BUG_ON(cursor->page_index >= cursor->page_count);
 	cursor->page_index++;
-	cursor->last_piece = cursor->resid <= PAGE_SIZE;
 }
 
 /*
@@ -1030,12 +1019,9 @@ ceph_msg_data_pagelist_cursor_init(struct ceph_msg_data_cursor *cursor,
 	cursor->resid = min(length, pagelist->length);
 	cursor->page = page;
 	cursor->offset = 0;
-	cursor->last_piece = cursor->resid <= PAGE_SIZE;
 }
 
-static struct page *
-ceph_msg_data_pagelist_next(struct ceph_msg_data_cursor *cursor,
-				size_t *page_offset, size_t *length)
+static void ceph_msg_data_pagelist_next(struct ceph_msg_data_cursor *cursor)
 {
 	struct ceph_msg_data *data = cursor->data;
 	struct ceph_pagelist *pagelist;
@@ -1048,14 +1034,10 @@ ceph_msg_data_pagelist_next(struct ceph_msg_data_cursor *cursor,
 	BUG_ON(!cursor->page);
 	BUG_ON(cursor->offset + cursor->resid != pagelist->length);
 
-	/* offset of first page in pagelist is always 0 */
-	*page_offset = cursor->offset & ~PAGE_MASK;
-	if (cursor->last_piece)
-		*length = cursor->resid;
-	else
-		*length = PAGE_SIZE - *page_offset;
-
-	return cursor->page;
+	ceph_msg_data_set_iter(cursor, cursor->page,
+			       cursor->offset % ~PAGE_MASK,
+			       min(PAGE_SIZE - cursor->offset,
+				   cursor->resid));
 }
 
 static void ceph_msg_data_pagelist_advance(struct ceph_msg_data_cursor *cursor,
@@ -1087,7 +1069,6 @@ static void ceph_msg_data_pagelist_advance(struct ceph_msg_data_cursor *cursor,
 
 	BUG_ON(list_is_last(&cursor->page->lru, &pagelist->head));
 	cursor->page = list_next_entry(cursor->page, lru);
-	cursor->last_piece = cursor->resid <= PAGE_SIZE;
 }
 
 /*
@@ -1140,41 +1121,26 @@ static void ceph_msg_data_cursor_init(unsigned int dir, struct ceph_msg *msg,
  */
 static void ceph_msg_data_next(struct ceph_msg_data_cursor *cursor)
 {
-	struct page *page;
-	size_t off, len;
-
 	switch (cursor->data->type) {
 	case CEPH_MSG_DATA_PAGELIST:
-		page = ceph_msg_data_pagelist_next(cursor, &off, &len);
+		ceph_msg_data_pagelist_next(cursor);
 		break;
 	case CEPH_MSG_DATA_PAGES:
-		page = ceph_msg_data_pages_next(cursor, &off, &len);
+		ceph_msg_data_pages_next(cursor);
 		break;
 #ifdef CONFIG_BLOCK
 	case CEPH_MSG_DATA_BIO:
-		page = ceph_msg_data_bio_next(cursor, &off, &len);
+		ceph_msg_data_bio_next(cursor);
 		break;
 #endif /* CONFIG_BLOCK */
 	case CEPH_MSG_DATA_BVECS:
-		page = ceph_msg_data_bvecs_next(cursor, &off, &len);
+		ceph_msg_data_bvecs_next(cursor);
 		break;
 	case CEPH_MSG_DATA_NONE:
 	default:
-		page = NULL;
+		BUG();
 		break;
 	}
-
-	BUG_ON(!page);
-	BUG_ON(off + len > PAGE_SIZE);
-	BUG_ON(!len);
-	BUG_ON(len > cursor->resid);
-
-	cursor->it_bvec.bv_page = page;
-	cursor->it_bvec.bv_len = len;
-	cursor->it_bvec.bv_offset = off;
-
-	iov_iter_bvec(&cursor->iter, cursor->direction,
-		      &cursor->it_bvec, 1, len);
 }
 
 static void ceph_msg_data_advance(struct ceph_msg_data_cursor *cursor,
@@ -1204,7 +1170,6 @@ static void ceph_msg_data_advance(struct ceph_msg_data_cursor *cursor,
 	cursor->total_resid -= bytes;
 
 	if (!cursor->resid && cursor->total_resid) {
-		WARN_ON(!cursor->last_piece);
 		cursor->data++;
 		__ceph_msg_data_cursor_init(cursor);
 	}
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 11/16] libceph: remove not necessary checks on doing advance on bio and bvecs cursor
  2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
                   ` (9 preceding siblings ...)
  2020-04-21 13:18 ` [PATCH 10/16] libceph: remove ->last_piece member for message data cursor Roman Penyaev
@ 2020-04-21 13:18 ` Roman Penyaev
  2020-04-21 13:18 ` [PATCH 12/16] libceph: switch bvecs cursor to iov_iter for messenger Roman Penyaev
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 13:18 UTC (permalink / raw)
  Cc: Ilya Dryomov, Jeff Layton, ceph-devel, Roman Penyaev

They were used for returning false, indicating that we are still
on the same page.  Caller is not interested now, so just remove.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
---
 net/ceph/messenger.c | 17 +----------------
 1 file changed, 1 insertion(+), 16 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 3f8a47de62c7..7465039da9f5 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -861,20 +861,14 @@ static void ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor,
 				      size_t bytes)
 {
 	struct ceph_bio_iter *it = &cursor->bio_iter;
-	struct page *page = bio_iter_page(it->bio, it->iter);
 
 	BUG_ON(bytes > cursor->resid);
 	BUG_ON(bytes > bio_iter_len(it->bio, it->iter));
 	cursor->resid -= bytes;
 	bio_advance_iter(it->bio, &it->iter, bytes);
 
-	if (!cursor->resid) {
+	if (!bytes || !cursor->resid)
 		return;   /* no more data */
-	}
-
-	if (!bytes || (it->iter.bi_size && it->iter.bi_bvec_done &&
-		       page == bio_iter_page(it->bio, it->iter)))
-		return;	/* more bytes to process in this segment */
 
 	if (!it->iter.bi_size) {
 		it->bio = it->bio->bi_next;
@@ -913,21 +907,12 @@ static void ceph_msg_data_bvecs_advance(struct ceph_msg_data_cursor *cursor,
 					size_t bytes)
 {
 	struct bio_vec *bvecs = cursor->data->bvec_pos.bvecs;
-	struct page *page = bvec_iter_page(bvecs, cursor->bvec_iter);
 
 	BUG_ON(bytes > cursor->resid);
 	BUG_ON(bytes > bvec_iter_len(bvecs, cursor->bvec_iter));
 	cursor->resid -= bytes;
 	bvec_iter_advance(bvecs, &cursor->bvec_iter, bytes);
 
-	if (!cursor->resid) {
-		return;   /* no more data */
-	}
-
-	if (!bytes || (cursor->bvec_iter.bi_bvec_done &&
-		       page == bvec_iter_page(bvecs, cursor->bvec_iter)))
-		return;	/* more bytes to process in this segment */
-
 	BUG_ON(cursor->resid < bvec_iter_len(bvecs, cursor->bvec_iter));
 }
 
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 12/16] libceph: switch bvecs cursor to iov_iter for messenger
  2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
                   ` (10 preceding siblings ...)
  2020-04-21 13:18 ` [PATCH 11/16] libceph: remove not necessary checks on doing advance on bio and bvecs cursor Roman Penyaev
@ 2020-04-21 13:18 ` Roman Penyaev
  2020-04-21 13:18 ` [PATCH 13/16] libceph: switch bio " Roman Penyaev
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 13:18 UTC (permalink / raw)
  Cc: Ilya Dryomov, Jeff Layton, ceph-devel, Roman Penyaev

Now pages are not visible for the bvecs data and iov_iter API
is used instead.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
---
 include/linux/ceph/messenger.h |  1 -
 net/ceph/messenger.c           | 20 +++++---------------
 2 files changed, 5 insertions(+), 16 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index bc25f5f0e729..89874fe7153b 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -200,7 +200,6 @@ struct ceph_msg_data_cursor {
 #ifdef CONFIG_BLOCK
 		struct ceph_bio_iter	bio_iter;
 #endif /* CONFIG_BLOCK */
-		struct bvec_iter	bvec_iter;
 		struct {				/* pages */
 			unsigned int	page_offset;	/* offset in page */
 			unsigned short	page_index;	/* index in array */
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 7465039da9f5..19f85bb85340 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -885,35 +885,25 @@ static void ceph_msg_data_bvecs_cursor_init(struct ceph_msg_data_cursor *cursor,
 					size_t length)
 {
 	struct ceph_msg_data *data = cursor->data;
-	struct bio_vec *bvecs = data->bvec_pos.bvecs;
 
 	cursor->resid = min_t(size_t, length, data->bvec_pos.iter.bi_size);
-	cursor->bvec_iter = data->bvec_pos.iter;
-	cursor->bvec_iter.bi_size = cursor->resid;
 
-	BUG_ON(cursor->resid < bvec_iter_len(bvecs, cursor->bvec_iter));
+	iov_iter_bvec(&cursor->iter, cursor->direction, data->bvec_pos.bvecs,
+		      data->num_bvecs, cursor->resid);
 }
 
 static void ceph_msg_data_bvecs_next(struct ceph_msg_data_cursor *cursor)
 {
-	struct bio_vec bv = bvec_iter_bvec(cursor->data->bvec_pos.bvecs,
-					   cursor->bvec_iter);
-
-	ceph_msg_data_set_iter(cursor, bv.bv_page,
-			       bv.bv_offset, bv.bv_len);
+	/* Nothing here */
 }
 
 static void ceph_msg_data_bvecs_advance(struct ceph_msg_data_cursor *cursor,
 					size_t bytes)
 {
-	struct bio_vec *bvecs = cursor->data->bvec_pos.bvecs;
-
 	BUG_ON(bytes > cursor->resid);
-	BUG_ON(bytes > bvec_iter_len(bvecs, cursor->bvec_iter));
+	BUG_ON(bytes > iov_iter_count(&cursor->iter));
 	cursor->resid -= bytes;
-	bvec_iter_advance(bvecs, &cursor->bvec_iter, bytes);
-
-	BUG_ON(cursor->resid < bvec_iter_len(bvecs, cursor->bvec_iter));
+	iov_iter_advance(&cursor->iter, bytes);
 }
 
 /*
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 13/16] libceph: switch bio cursor to iov_iter for messenger
  2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
                   ` (11 preceding siblings ...)
  2020-04-21 13:18 ` [PATCH 12/16] libceph: switch bvecs cursor to iov_iter for messenger Roman Penyaev
@ 2020-04-21 13:18 ` Roman Penyaev
  2020-04-21 13:18 ` [PATCH 14/16] libceph: switch pages " Roman Penyaev
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 13:18 UTC (permalink / raw)
  Cc: Ilya Dryomov, Jeff Layton, ceph-devel, Roman Penyaev

Data cursor of bio type uses bio->bi_io_vec directly.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
---
 net/ceph/messenger.c | 33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 19f85bb85340..ea91f94096f1 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -829,6 +829,22 @@ static void ceph_msg_data_set_iter(struct ceph_msg_data_cursor *cursor,
 
 #ifdef CONFIG_BLOCK
 
+static void set_bio_iter_to_iov_iter(struct ceph_msg_data_cursor *cursor)
+{
+	struct ceph_bio_iter *it = &cursor->bio_iter;
+
+	iov_iter_bvec(&cursor->iter, cursor->direction,
+		      it->bio->bi_io_vec + it->iter.bi_idx,
+		      it->bio->bi_vcnt - it->iter.bi_idx,
+		      it->iter.bi_size);
+	/*
+	 * Careful here: use multipage offset, because we need an offset
+	 * in the whole bvec, not in a page
+	 */
+	cursor->iter.iov_offset =
+		mp_bvec_iter_offset(cursor->iter.bvec, it->iter);
+}
+
 /*
  * For a bio data item, a piece is whatever remains of the next
  * entry in the current bio iovec, or the first entry in the next
@@ -846,15 +862,12 @@ static void ceph_msg_data_bio_cursor_init(struct ceph_msg_data_cursor *cursor,
 		it->iter.bi_size = cursor->resid;
 
 	BUG_ON(cursor->resid < bio_iter_len(it->bio, it->iter));
+	set_bio_iter_to_iov_iter(cursor);
 }
 
 static void ceph_msg_data_bio_next(struct ceph_msg_data_cursor *cursor)
 {
-	struct bio_vec bv = bio_iter_iovec(cursor->bio_iter.bio,
-					   cursor->bio_iter.iter);
-
-	ceph_msg_data_set_iter(cursor, bv.bv_page,
-			       bv.bv_offset, bv.bv_len);
+	/* Nothing here */
 }
 
 static void ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor,
@@ -863,21 +876,23 @@ static void ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor,
 	struct ceph_bio_iter *it = &cursor->bio_iter;
 
 	BUG_ON(bytes > cursor->resid);
-	BUG_ON(bytes > bio_iter_len(it->bio, it->iter));
+	BUG_ON(bytes > iov_iter_count(&cursor->iter));
 	cursor->resid -= bytes;
-	bio_advance_iter(it->bio, &it->iter, bytes);
+	iov_iter_advance(&cursor->iter, bytes);
 
 	if (!bytes || !cursor->resid)
 		return;   /* no more data */
 
-	if (!it->iter.bi_size) {
+	if (!iov_iter_count(&cursor->iter)) {
 		it->bio = it->bio->bi_next;
 		it->iter = it->bio->bi_iter;
 		if (cursor->resid < it->iter.bi_size)
 			it->iter.bi_size = cursor->resid;
+
+		set_bio_iter_to_iov_iter(cursor);
 	}
 
-	BUG_ON(cursor->resid < bio_iter_len(it->bio, it->iter));
+	BUG_ON(cursor->resid != iov_iter_count(&cursor->iter));
 }
 #endif /* CONFIG_BLOCK */
 
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 14/16] libceph: switch pages cursor to iov_iter for messenger
  2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
                   ` (12 preceding siblings ...)
  2020-04-21 13:18 ` [PATCH 13/16] libceph: switch bio " Roman Penyaev
@ 2020-04-21 13:18 ` Roman Penyaev
  2020-04-21 13:18 ` [PATCH 15/16] libceph: switch pageslist " Roman Penyaev
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 13:18 UTC (permalink / raw)
  Cc: Ilya Dryomov, Jeff Layton, ceph-devel, Roman Penyaev

Though it still uses pages, ceph_msg_data_pages_next() is noop now.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
---
 include/linux/ceph/messenger.h |  1 -
 net/ceph/messenger.c           | 33 ++++++++++++++-------------------
 2 files changed, 14 insertions(+), 20 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 89874fe7153b..822182ac4386 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -201,7 +201,6 @@ struct ceph_msg_data_cursor {
 		struct ceph_bio_iter	bio_iter;
 #endif /* CONFIG_BLOCK */
 		struct {				/* pages */
-			unsigned int	page_offset;	/* offset in page */
 			unsigned short	page_index;	/* index in array */
 			unsigned short	page_count;	/* pages in array */
 		};
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index ea91f94096f1..288f3c66a4d1 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -929,6 +929,7 @@ static void ceph_msg_data_pages_cursor_init(struct ceph_msg_data_cursor *cursor,
 					size_t length)
 {
 	struct ceph_msg_data *data = cursor->data;
+	unsigned int page_offset;
 	int page_count;
 
 	BUG_ON(data->type != CEPH_MSG_DATA_PAGES);
@@ -938,26 +939,20 @@ static void ceph_msg_data_pages_cursor_init(struct ceph_msg_data_cursor *cursor,
 
 	cursor->resid = min(length, data->length);
 	page_count = calc_pages_for(data->alignment, (u64)data->length);
-	cursor->page_offset = data->alignment & ~PAGE_MASK;
+	page_offset = data->alignment & ~PAGE_MASK;
 	cursor->page_index = 0;
 	BUG_ON(page_count > (int)USHRT_MAX);
 	cursor->page_count = (unsigned short)page_count;
-	BUG_ON(length > SIZE_MAX - cursor->page_offset);
+	BUG_ON(length > SIZE_MAX - page_offset);
+
+	ceph_msg_data_set_iter(cursor, data->pages[cursor->page_index],
+			       page_offset, min(PAGE_SIZE - page_offset,
+						cursor->resid));
 }
 
 static void ceph_msg_data_pages_next(struct ceph_msg_data_cursor *cursor)
 {
-	struct ceph_msg_data *data = cursor->data;
-
-	BUG_ON(data->type != CEPH_MSG_DATA_PAGES);
-
-	BUG_ON(cursor->page_index >= cursor->page_count);
-	BUG_ON(cursor->page_offset >= PAGE_SIZE);
-
-	ceph_msg_data_set_iter(cursor, data->pages[cursor->page_index],
-			       cursor->page_offset,
-			       min(PAGE_SIZE - cursor->page_offset,
-				   cursor->resid));
+	/* Nothing here */
 }
 
 static void ceph_msg_data_pages_advance(struct ceph_msg_data_cursor *cursor,
@@ -965,13 +960,10 @@ static void ceph_msg_data_pages_advance(struct ceph_msg_data_cursor *cursor,
 {
 	BUG_ON(cursor->data->type != CEPH_MSG_DATA_PAGES);
 
-	BUG_ON(cursor->page_offset + bytes > PAGE_SIZE);
-
-	/* Advance the cursor page offset */
-
+	/* Advance the cursor iter */
 	cursor->resid -= bytes;
-	cursor->page_offset = (cursor->page_offset + bytes) & ~PAGE_MASK;
-	if (!bytes || cursor->page_offset)
+	iov_iter_advance(&cursor->iter, bytes);
+	if (!bytes || iov_iter_count(&cursor->iter))
 		return;	/* more bytes to process in the current page */
 
 	if (!cursor->resid)
@@ -981,6 +973,9 @@ static void ceph_msg_data_pages_advance(struct ceph_msg_data_cursor *cursor,
 
 	BUG_ON(cursor->page_index >= cursor->page_count);
 	cursor->page_index++;
+
+	ceph_msg_data_set_iter(cursor, cursor->data->pages[cursor->page_index],
+			       0, min(PAGE_SIZE, cursor->resid));
 }
 
 /*
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 15/16] libceph: switch pageslist cursor to iov_iter for messenger
  2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
                   ` (13 preceding siblings ...)
  2020-04-21 13:18 ` [PATCH 14/16] libceph: switch pages " Roman Penyaev
@ 2020-04-21 13:18 ` Roman Penyaev
  2020-04-21 13:18 ` [PATCH 16/16] libceph: remove ceph_msg_data_*_next() from messenger Roman Penyaev
  2020-04-21 15:51 ` [PATCH 00/16] libceph: messenger: send/recv data at one go Ilya Dryomov
  16 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 13:18 UTC (permalink / raw)
  Cc: Ilya Dryomov, Jeff Layton, ceph-devel, Roman Penyaev

Though it still uses pages, ceph_msg_data_pages_next() is noop now.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
---
 include/linux/ceph/messenger.h |  1 -
 net/ceph/messenger.c           | 32 +++++++++-----------------------
 2 files changed, 9 insertions(+), 24 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 822182ac4386..ef5b0064f515 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -206,7 +206,6 @@ struct ceph_msg_data_cursor {
 		};
 		struct {				/* pagelist */
 			struct page	*page;		/* page from list */
-			size_t		offset;		/* bytes from list */
 		};
 	};
 };
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 288f3c66a4d1..c001f3c551bd 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1003,26 +1003,13 @@ ceph_msg_data_pagelist_cursor_init(struct ceph_msg_data_cursor *cursor,
 
 	cursor->resid = min(length, pagelist->length);
 	cursor->page = page;
-	cursor->offset = 0;
+
+	ceph_msg_data_set_iter(cursor, page, 0, min(PAGE_SIZE, cursor->resid));
 }
 
 static void ceph_msg_data_pagelist_next(struct ceph_msg_data_cursor *cursor)
 {
-	struct ceph_msg_data *data = cursor->data;
-	struct ceph_pagelist *pagelist;
-
-	BUG_ON(data->type != CEPH_MSG_DATA_PAGELIST);
-
-	pagelist = data->pagelist;
-	BUG_ON(!pagelist);
-
-	BUG_ON(!cursor->page);
-	BUG_ON(cursor->offset + cursor->resid != pagelist->length);
-
-	ceph_msg_data_set_iter(cursor, cursor->page,
-			       cursor->offset % ~PAGE_MASK,
-			       min(PAGE_SIZE - cursor->offset,
-				   cursor->resid));
+	/* Nothing here */
 }
 
 static void ceph_msg_data_pagelist_advance(struct ceph_msg_data_cursor *cursor,
@@ -1036,15 +1023,11 @@ static void ceph_msg_data_pagelist_advance(struct ceph_msg_data_cursor *cursor,
 	pagelist = data->pagelist;
 	BUG_ON(!pagelist);
 
-	BUG_ON(cursor->offset + cursor->resid != pagelist->length);
-	BUG_ON((cursor->offset & ~PAGE_MASK) + bytes > PAGE_SIZE);
-
-	/* Advance the cursor offset */
+	/* Advance the cursor iter */
 
 	cursor->resid -= bytes;
-	cursor->offset += bytes;
-	/* offset of first page in pagelist is always 0 */
-	if (!bytes || cursor->offset & ~PAGE_MASK)
+	iov_iter_advance(&cursor->iter, bytes);
+	if (!bytes || iov_iter_count(&cursor->iter))
 		return;	/* more bytes to process in the current page */
 
 	if (!cursor->resid)
@@ -1054,6 +1037,9 @@ static void ceph_msg_data_pagelist_advance(struct ceph_msg_data_cursor *cursor,
 
 	BUG_ON(list_is_last(&cursor->page->lru, &pagelist->head));
 	cursor->page = list_next_entry(cursor->page, lru);
+
+	ceph_msg_data_set_iter(cursor, cursor->page, 0,
+			       min(PAGE_SIZE, cursor->resid));
 }
 
 /*
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 16/16] libceph: remove ceph_msg_data_*_next() from messenger
  2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
                   ` (14 preceding siblings ...)
  2020-04-21 13:18 ` [PATCH 15/16] libceph: switch pageslist " Roman Penyaev
@ 2020-04-21 13:18 ` Roman Penyaev
  2020-04-21 15:51 ` [PATCH 00/16] libceph: messenger: send/recv data at one go Ilya Dryomov
  16 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 13:18 UTC (permalink / raw)
  Cc: Ilya Dryomov, Jeff Layton, ceph-devel, Roman Penyaev

All cursor types do not need next operation, advance handles
everything, so just remove it.

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
---
 net/ceph/messenger.c | 49 --------------------------------------------
 1 file changed, 49 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index c001f3c551bd..3facb1a8c5d5 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -865,11 +865,6 @@ static void ceph_msg_data_bio_cursor_init(struct ceph_msg_data_cursor *cursor,
 	set_bio_iter_to_iov_iter(cursor);
 }
 
-static void ceph_msg_data_bio_next(struct ceph_msg_data_cursor *cursor)
-{
-	/* Nothing here */
-}
-
 static void ceph_msg_data_bio_advance(struct ceph_msg_data_cursor *cursor,
 				      size_t bytes)
 {
@@ -907,11 +902,6 @@ static void ceph_msg_data_bvecs_cursor_init(struct ceph_msg_data_cursor *cursor,
 		      data->num_bvecs, cursor->resid);
 }
 
-static void ceph_msg_data_bvecs_next(struct ceph_msg_data_cursor *cursor)
-{
-	/* Nothing here */
-}
-
 static void ceph_msg_data_bvecs_advance(struct ceph_msg_data_cursor *cursor,
 					size_t bytes)
 {
@@ -950,11 +940,6 @@ static void ceph_msg_data_pages_cursor_init(struct ceph_msg_data_cursor *cursor,
 						cursor->resid));
 }
 
-static void ceph_msg_data_pages_next(struct ceph_msg_data_cursor *cursor)
-{
-	/* Nothing here */
-}
-
 static void ceph_msg_data_pages_advance(struct ceph_msg_data_cursor *cursor,
 					size_t bytes)
 {
@@ -1007,11 +992,6 @@ ceph_msg_data_pagelist_cursor_init(struct ceph_msg_data_cursor *cursor,
 	ceph_msg_data_set_iter(cursor, page, 0, min(PAGE_SIZE, cursor->resid));
 }
 
-static void ceph_msg_data_pagelist_next(struct ceph_msg_data_cursor *cursor)
-{
-	/* Nothing here */
-}
-
 static void ceph_msg_data_pagelist_advance(struct ceph_msg_data_cursor *cursor,
 					   size_t bytes)
 {
@@ -1087,33 +1067,6 @@ static void ceph_msg_data_cursor_init(unsigned int dir, struct ceph_msg *msg,
 	__ceph_msg_data_cursor_init(cursor);
 }
 
-/*
- * Setups cursor->iter for the next piece to process.
- */
-static void ceph_msg_data_next(struct ceph_msg_data_cursor *cursor)
-{
-	switch (cursor->data->type) {
-	case CEPH_MSG_DATA_PAGELIST:
-		ceph_msg_data_pagelist_next(cursor);
-		break;
-	case CEPH_MSG_DATA_PAGES:
-		ceph_msg_data_pages_next(cursor);
-		break;
-#ifdef CONFIG_BLOCK
-	case CEPH_MSG_DATA_BIO:
-		ceph_msg_data_bio_next(cursor);
-		break;
-#endif /* CONFIG_BLOCK */
-	case CEPH_MSG_DATA_BVECS:
-		ceph_msg_data_bvecs_next(cursor);
-		break;
-	case CEPH_MSG_DATA_NONE:
-	default:
-		BUG();
-		break;
-	}
-}
-
 static void ceph_msg_data_advance(struct ceph_msg_data_cursor *cursor,
 				  size_t bytes)
 {
@@ -1519,7 +1472,6 @@ static int write_partial_message_data(struct ceph_connection *con)
 			continue;
 		}
 
-		ceph_msg_data_next(cursor);
 		ret = ceph_tcp_sendiov(con->sock, &cursor->iter, true);
 		if (ret <= 0) {
 			if (do_datacrc)
@@ -2260,7 +2212,6 @@ static int read_partial_msg_data(struct ceph_connection *con)
 			continue;
 		}
 
-		ceph_msg_data_next(cursor);
 		ret = ceph_tcp_recviov(con->sock, &cursor->iter);
 		if (ret <= 0) {
 			if (do_datacrc)
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 00/16] libceph: messenger: send/recv data at one go
  2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
                   ` (15 preceding siblings ...)
  2020-04-21 13:18 ` [PATCH 16/16] libceph: remove ceph_msg_data_*_next() from messenger Roman Penyaev
@ 2020-04-21 15:51 ` Ilya Dryomov
  2020-04-21 16:28   ` Roman Penyaev
  16 siblings, 1 reply; 19+ messages in thread
From: Ilya Dryomov @ 2020-04-21 15:51 UTC (permalink / raw)
  To: Roman Penyaev; +Cc: Jeff Layton, Ceph Development

On Tue, Apr 21, 2020 at 3:18 PM Roman Penyaev <rpenyaev@suse.de> wrote:
>
> Hi folks,
>
> While experimenting with messenger code in userspace [1] I noticed
> that send and receive socket calls always operate with 4k, even bvec
> length is larger (for example when bvec is contructed from bio, where
> multi-page is used for big IOs). This is an attempt to speed up send
> and receive for large IO.
>
> First 3 patches are cleanups. I remove unused code and get rid of
> ceph_osd_data structure. I found that ceph_osd_data duplicates
> ceph_msg_data and it seems unified API looks better for similar
> things.
>
> In the following patches ceph_msg_data_cursor is switched to iov_iter,
> which seems is more suitable for such kind of things (when we
> basically do socket IO). This gives us the possibility to use the
> whole iov_iter for sendmsg() and recvmsg() calls instead of iterating
> page by page. sendpage() call also benefits from this, because now if
> bvec is constructed from multi-page, then we can 0-copy the whole
> bvec in one go.

Hi Roman,

I'm in the process of rewriting the kernel messenger to support msgr2
(i.e. encryption) and noticed the same things.  The switch to iov_iter
was the first thing I implemented ;)  Among other things is support for
multipage bvecs and explicit socket corking.  I haven't benchmarked any
of it though -- it just seemed like a sensible thing to do, especially
since the sendmsg/sendpage infrastructure needed changes for encryption
anyway.

Support for kvecs isn't implemented yet, but will be in order to get
rid of all those "allocate a page just to process 16 bytes" sites.

Unfortunately I got distracted by some higher priority issues with the
userspace messenger, so the kernel messenger is in a bit of a state of
disarray at the moment.  Here is the excerpt from the send path:

#define CEPH_MSG_FLAGS (MSG_DONTWAIT | MSG_NOSIGNAL)

static int do_sendmsg(struct ceph_connection *con, struct iov_iter *it)
{
        struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS };
        int ret;

        msg.msg_iter = *it;
        while (iov_iter_count(it)) {
                ret = do_one_sendmsg(con, &msg);
                if (ret <= 0) {
                        if (ret == -EAGAIN)
                                ret = 0;
                        return ret;
                }

                iov_iter_advance(it, ret);
        }

        BUG_ON(msg_data_left(&msg));
        return 1;
}

static int do_sendpage(struct ceph_connection *con, struct iov_iter *it)
{
        ssize_t ret;

        BUG_ON(!iov_iter_is_bvec(it));
        while (iov_iter_count(it)) {
                struct page *page = it->bvec->bv_page;
                int offset = it->bvec->bv_offset + it->iov_offset;
                size_t size = min(it->count,
                                  it->bvec->bv_len - it->iov_offset);

                /*
                 * sendpage cannot properly handle pages with
                 * page_count == 0, we need to fall back to sendmsg if
                 * that's the case.
                 *
                 * Same goes for slab pages: skb_can_coalesce() allows
                 * coalescing neighboring slab objects into a single frag
                 * which triggers one of hardened usercopy checks.
                 */
                if (page_count(page) >= 1 && !PageSlab(page)) {
                        ret = do_one_sendpage(con, page, offset, size,
                                              CEPH_MSG_FLAGS);
                } else {
                        struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS };
                        struct bio_vec bv = {
                                .bv_page = page,
                                .bv_offset = offset,
                                .bv_len = size,
                        };

                        iov_iter_bvec(&msg.msg_iter, WRITE, &bv, 1, size);
                        ret = do_one_sendmsg(con, &msg);
                }
                if (ret <= 0) {
                        if (ret == -EAGAIN)
                                ret = 0;
                        return ret;
                }

                iov_iter_advance(it, ret);
        }

        return 1;
}

/*
 * Write as much as possible.  The socket is expected to be corked,
 * so we don't bother with MSG_MORE/MSG_SENDPAGE_NOTLAST here.
 *
 * Return:
 *   1 - done, nothing else to write
 *   0 - socket is full, need to wait
 *  <0 - error
 */
int ceph_tcp_send(struct ceph_connection *con)
{
        bool is_kvec = iov_iter_is_kvec(&con->out_iter);
        int ret;

        dout("%s con %p have %zu is_kvec %d\n", __func__, con,
             iov_iter_count(&con->out_iter), is_kvec);
        if (is_kvec)
                ret = do_sendmsg(con, &con->out_iter);
        else
                ret = do_sendpage(con, &con->out_iter);

        dout("%s con %p ret %d left %zu\n", __func__, con, ret,
             iov_iter_count(&con->out_iter));
        return ret;
}

I'll make sure to CC you on my patches, should be in a few weeks.

Getting rid of ceph_osd_data is probably a good idea.  FWIW I never
liked it, but not strong enough to bother with removing it.

>
> I also allowed myself to get rid of ->last_piece and ->need_crc
> members and ceph_msg_data_next() call. Now CRC is calculated not on
> page basis, but according to the size of processed chunk.  I found
> ceph_msg_data_next() is a bit redundant, since we always can set the
> next cursor chunk on cursor init or on advance.
>
> How I tested the performance? I used rbd.fio load on 1 OSD in memory
> with the following fio configuration:
>
>   direct=1
>   time_based=1
>   runtime=10
>   ioengine=io_uring
>   size=256m
>
>   rw=rand{read|write}
>   numjobs=32
>   iodepth=32
>
>   [job1]
>   filename=/dev/rbd0
>
> RBD device is mapped with 'nocrc' option set.  For writes OSD completes
> requests immediately, without touching the memory simulating null block
> device, that's why write throughput in my results is much higher than
> for reads.
>
> I tested on loopback interface only, in Vm, have not yet setup the
> cluster on real machines, so sendpage() on a big multi-page shows
> indeed good results, as expected. But I found an interesting comment
> in drivers/infiniband/sw/siw/siw_qp_tc.c:siw_tcp_sendpages(), which
> says:
>
>  "Using sendpage to push page by page appears to be less efficient
>   than using sendmsg, even if data are copied.
>
>   A general performance limitation might be the extra four bytes
>   trailer checksum segment to be pushed after user data."
>
> I could not prove or disprove since have tested on loopback interface
> only.  So it might be that sendmsg() in on go is faster than
> sendpage() for bvecs with many segments.

Please share any further findings.  We have been using sendpage for
the data section of the message since forever and I remember hearing
about a performance regression when someone inadvertently disabled the
sendpage path (can't recall the subsystem -- iSCSI?).  If you discover
that sendpage is actually slower, that would be very interesting.

>
> Here is the output of the rbd fio load for various block sizes:
>
> ==== WRITE ===
>
> current master, rw=randwrite, numjobs=32 iodepth=32
>
>   4k  IOPS=92.7k, BW=362MiB/s, Lat=11033.30usec
>   8k  IOPS=85.6k, BW=669MiB/s, Lat=11956.74usec
>  16k  IOPS=76.8k, BW=1200MiB/s, Lat=13318.24usec
>  32k  IOPS=56.7k, BW=1770MiB/s, Lat=18056.92usec
>  64k  IOPS=34.0k, BW=2186MiB/s, Lat=29.23msec
> 128k  IOPS=21.8k, BW=2720MiB/s, Lat=46.96msec
> 256k  IOPS=14.4k, BW=3596MiB/s, Lat=71.03msec
> 512k  IOPS=8726, BW=4363MiB/s, Lat=116.34msec
>   1m  IOPS=4799, BW=4799MiB/s, Lat=211.15msec
>
> this patchset,  rw=randwrite, numjobs=32 iodepth=32
>
>   4k  IOPS=94.7k, BW=370MiB/s, Lat=10802.43usec
>   8k  IOPS=91.2k, BW=712MiB/s, Lat=11221.00usec
>  16k  IOPS=80.4k, BW=1257MiB/s, Lat=12715.56usec
>  32k  IOPS=61.2k, BW=1912MiB/s, Lat=16721.33usec
>  64k  IOPS=40.9k, BW=2554MiB/s, Lat=24993.31usec
> 128k  IOPS=25.7k, BW=3216MiB/s, Lat=39.72msec
> 256k  IOPS=17.3k, BW=4318MiB/s, Lat=59.15msec
> 512k  IOPS=11.1k, BW=5559MiB/s, Lat=91.39msec
>   1m  IOPS=6696, BW=6696MiB/s, Lat=151.25msec
>
>
> === READ ===
>
> current master, rw=randread, numjobs=32 iodepth=32
>
>   4k  IOPS=62.5k, BW=244MiB/s, Lat=16.38msec
>   8k  IOPS=55.5k, BW=433MiB/s, Lat=18.44msec
>  16k  IOPS=40.6k, BW=635MiB/s, Lat=25.18msec
>  32k  IOPS=24.6k, BW=768MiB/s, Lat=41.61msec
>  64k  IOPS=14.8k, BW=925MiB/s, Lat=69.06msec
> 128k  IOPS=8687, BW=1086MiB/s, Lat=117.59msec
> 256k  IOPS=4733, BW=1183MiB/s, Lat=214.76msec
> 512k  IOPS=3156, BW=1578MiB/s, Lat=320.54msec
>   1m  IOPS=1901, BW=1901MiB/s, Lat=528.22msec
>
> this patchset,  rw=randread, numjobs=32 iodepth=32
>
>   4k  IOPS=62.6k, BW=244MiB/s, Lat=16342.89usec
>   8k  IOPS=55.5k, BW=434MiB/s, Lat=18.42msec
>  16k  IOPS=43.2k, BW=675MiB/s, Lat=23.68msec
>  32k  IOPS=28.4k, BW=887MiB/s, Lat=36.04msec
>  64k  IOPS=20.2k, BW=1263MiB/s, Lat=50.54msec
> 128k  IOPS=11.7k, BW=1465MiB/s, Lat=87.01msec
> 256k  IOPS=6813, BW=1703MiB/s, Lat=149.30msec
> 512k  IOPS=5363, BW=2682MiB/s, Lat=189.37msec
>   1m  IOPS=2220, BW=2221MiB/s, Lat=453.92msec
>
>
> Results for small blocks are not interesting, since there should not
> be any difference. But starting from 32k block benefits of doing IO
> for the whole message at once starts to prevail.

It's not really the whole message, just the header, front and middle
sections, right?  The data section is still per-bvec, it's just that
bvec is no longer limited to a single page but may encompass several
physically contiguous pages.  These are not that easy to come by on
a heavily loaded system, but they do result in nice numbers.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 00/16] libceph: messenger: send/recv data at one go
  2020-04-21 15:51 ` [PATCH 00/16] libceph: messenger: send/recv data at one go Ilya Dryomov
@ 2020-04-21 16:28   ` Roman Penyaev
  0 siblings, 0 replies; 19+ messages in thread
From: Roman Penyaev @ 2020-04-21 16:28 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Jeff Layton, Ceph Development


Hi Ilya,

On 2020-04-21 17:51, Ilya Dryomov wrote:
> On Tue, Apr 21, 2020 at 3:18 PM Roman Penyaev <rpenyaev@suse.de> wrote:
>> 
>> Hi folks,
>> 
>> While experimenting with messenger code in userspace [1] I noticed
>> that send and receive socket calls always operate with 4k, even bvec
>> length is larger (for example when bvec is contructed from bio, where
>> multi-page is used for big IOs). This is an attempt to speed up send
>> and receive for large IO.
>> 
>> First 3 patches are cleanups. I remove unused code and get rid of
>> ceph_osd_data structure. I found that ceph_osd_data duplicates
>> ceph_msg_data and it seems unified API looks better for similar
>> things.
>> 
>> In the following patches ceph_msg_data_cursor is switched to iov_iter,
>> which seems is more suitable for such kind of things (when we
>> basically do socket IO). This gives us the possibility to use the
>> whole iov_iter for sendmsg() and recvmsg() calls instead of iterating
>> page by page. sendpage() call also benefits from this, because now if
>> bvec is constructed from multi-page, then we can 0-copy the whole
>> bvec in one go.
> 
> Hi Roman,
> 
> I'm in the process of rewriting the kernel messenger to support msgr2
> (i.e. encryption) and noticed the same things.  The switch to iov_iter
> was the first thing I implemented ;)  Among other things is support for
> multipage bvecs and explicit socket corking.

Ah, ok, good to know. This patchset came from the userspace variant of
the kernel messenger. These changes also show nice numbers on userspace
side. (of course without sendpage() variant and plus some caching on
receive, which I also implemented for kernel side, but left aside since
this did not show any interesting results for rbd load)

> I haven't benchmarked any
> of it though -- it just seemed like a sensible thing to do, especially
> since the sendmsg/sendpage infrastructure needed changes for encryption
> anyway.

I can benchmark on my localhost setup easily. Just add me to CC when
you are done.

> 
> Support for kvecs isn't implemented yet, but will be in order to get
> rid of all those "allocate a page just to process 16 bytes" sites.
> 
> Unfortunately I got distracted by some higher priority issues with the
> userspace messenger, so the kernel messenger is in a bit of a state of
> disarray at the moment.  Here is the excerpt from the send path:
> 
> #define CEPH_MSG_FLAGS (MSG_DONTWAIT | MSG_NOSIGNAL)
> 
> static int do_sendmsg(struct ceph_connection *con, struct iov_iter *it)
> {
>         struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS };
>         int ret;
> 
>         msg.msg_iter = *it;
>         while (iov_iter_count(it)) {
>                 ret = do_one_sendmsg(con, &msg);
>                 if (ret <= 0) {
>                         if (ret == -EAGAIN)
>                                 ret = 0;
>                         return ret;
>                 }
> 
>                 iov_iter_advance(it, ret);
>         }
> 
>         BUG_ON(msg_data_left(&msg));
>         return 1;
> }
> 
> static int do_sendpage(struct ceph_connection *con, struct iov_iter 
> *it)
> {
>         ssize_t ret;
> 
>         BUG_ON(!iov_iter_is_bvec(it));
>         while (iov_iter_count(it)) {
>                 struct page *page = it->bvec->bv_page;
>                 int offset = it->bvec->bv_offset + it->iov_offset;
>                 size_t size = min(it->count,
>                                   it->bvec->bv_len - it->iov_offset);
> 
>                 /*
>                  * sendpage cannot properly handle pages with
>                  * page_count == 0, we need to fall back to sendmsg if
>                  * that's the case.
>                  *
>                  * Same goes for slab pages: skb_can_coalesce() allows
>                  * coalescing neighboring slab objects into a single 
> frag
>                  * which triggers one of hardened usercopy checks.
>                  */
>                 if (page_count(page) >= 1 && !PageSlab(page)) {
>                         ret = do_one_sendpage(con, page, offset, size,
>                                               CEPH_MSG_FLAGS);
>                 } else {
>                         struct msghdr msg = { .msg_flags = 
> CEPH_MSG_FLAGS };
>                         struct bio_vec bv = {
>                                 .bv_page = page,
>                                 .bv_offset = offset,
>                                 .bv_len = size,
>                         };
> 
>                         iov_iter_bvec(&msg.msg_iter, WRITE, &bv, 1, 
> size);
>                         ret = do_one_sendmsg(con, &msg);
>                 }
>                 if (ret <= 0) {
>                         if (ret == -EAGAIN)
>                                 ret = 0;
>                         return ret;
>                 }
> 
>                 iov_iter_advance(it, ret);
>         }
> 
>         return 1;
> }
> 
> /*
>  * Write as much as possible.  The socket is expected to be corked,
>  * so we don't bother with MSG_MORE/MSG_SENDPAGE_NOTLAST here.
>  *
>  * Return:
>  *   1 - done, nothing else to write
>  *   0 - socket is full, need to wait
>  *  <0 - error
>  */
> int ceph_tcp_send(struct ceph_connection *con)
> {
>         bool is_kvec = iov_iter_is_kvec(&con->out_iter);
>         int ret;
> 
>         dout("%s con %p have %zu is_kvec %d\n", __func__, con,
>              iov_iter_count(&con->out_iter), is_kvec);
>         if (is_kvec)
>                 ret = do_sendmsg(con, &con->out_iter);
>         else
>                 ret = do_sendpage(con, &con->out_iter);
> 
>         dout("%s con %p ret %d left %zu\n", __func__, con, ret,
>              iov_iter_count(&con->out_iter));
>         return ret;
> }

Ha! Nice! That almost exactly what I do in current patchset, except
corking. I still bother with MSG_MORE/MSG_SENDPAGE_NOTLAST :)

BTW kvecs also can be sendpaged. Since we do not have userspace
iovec, thus page can be easily taken.  But that is a minor.

> I'll make sure to CC you on my patches, should be in a few weeks.

Yes, please.  I can help with benchmarking and reviewing.

> 
> Getting rid of ceph_osd_data is probably a good idea.  FWIW I never
> liked it, but not strong enough to bother with removing it.
> 
>> 
>> I also allowed myself to get rid of ->last_piece and ->need_crc
>> members and ceph_msg_data_next() call. Now CRC is calculated not on
>> page basis, but according to the size of processed chunk.  I found
>> ceph_msg_data_next() is a bit redundant, since we always can set the
>> next cursor chunk on cursor init or on advance.
>> 
>> How I tested the performance? I used rbd.fio load on 1 OSD in memory
>> with the following fio configuration:
>> 
>>   direct=1
>>   time_based=1
>>   runtime=10
>>   ioengine=io_uring
>>   size=256m
>> 
>>   rw=rand{read|write}
>>   numjobs=32
>>   iodepth=32
>> 
>>   [job1]
>>   filename=/dev/rbd0
>> 
>> RBD device is mapped with 'nocrc' option set.  For writes OSD 
>> completes
>> requests immediately, without touching the memory simulating null 
>> block
>> device, that's why write throughput in my results is much higher than
>> for reads.
>> 
>> I tested on loopback interface only, in Vm, have not yet setup the
>> cluster on real machines, so sendpage() on a big multi-page shows
>> indeed good results, as expected. But I found an interesting comment
>> in drivers/infiniband/sw/siw/siw_qp_tc.c:siw_tcp_sendpages(), which
>> says:
>> 
>>  "Using sendpage to push page by page appears to be less efficient
>>   than using sendmsg, even if data are copied.
>> 
>>   A general performance limitation might be the extra four bytes
>>   trailer checksum segment to be pushed after user data."
>> 
>> I could not prove or disprove since have tested on loopback interface
>> only.  So it might be that sendmsg() in on go is faster than
>> sendpage() for bvecs with many segments.
> 
> Please share any further findings.  We have been using sendpage for
> the data section of the message since forever and I remember hearing
> about a performance regression when someone inadvertently disabled the
> sendpage path (can't recall the subsystem -- iSCSI?).  If you discover
> that sendpage is actually slower, that would be very interesting.

I will try to find a good machine for that with fat network.
Should be easy to benchmark.

>> Here is the output of the rbd fio load for various block sizes:
>> 
>> ==== WRITE ===
>> 
>> current master, rw=randwrite, numjobs=32 iodepth=32
>> 
>>   4k  IOPS=92.7k, BW=362MiB/s, Lat=11033.30usec
>>   8k  IOPS=85.6k, BW=669MiB/s, Lat=11956.74usec
>>  16k  IOPS=76.8k, BW=1200MiB/s, Lat=13318.24usec
>>  32k  IOPS=56.7k, BW=1770MiB/s, Lat=18056.92usec
>>  64k  IOPS=34.0k, BW=2186MiB/s, Lat=29.23msec
>> 128k  IOPS=21.8k, BW=2720MiB/s, Lat=46.96msec
>> 256k  IOPS=14.4k, BW=3596MiB/s, Lat=71.03msec
>> 512k  IOPS=8726, BW=4363MiB/s, Lat=116.34msec
>>   1m  IOPS=4799, BW=4799MiB/s, Lat=211.15msec
>> 
>> this patchset,  rw=randwrite, numjobs=32 iodepth=32
>> 
>>   4k  IOPS=94.7k, BW=370MiB/s, Lat=10802.43usec
>>   8k  IOPS=91.2k, BW=712MiB/s, Lat=11221.00usec
>>  16k  IOPS=80.4k, BW=1257MiB/s, Lat=12715.56usec
>>  32k  IOPS=61.2k, BW=1912MiB/s, Lat=16721.33usec
>>  64k  IOPS=40.9k, BW=2554MiB/s, Lat=24993.31usec
>> 128k  IOPS=25.7k, BW=3216MiB/s, Lat=39.72msec
>> 256k  IOPS=17.3k, BW=4318MiB/s, Lat=59.15msec
>> 512k  IOPS=11.1k, BW=5559MiB/s, Lat=91.39msec
>>   1m  IOPS=6696, BW=6696MiB/s, Lat=151.25msec
>> 
>> 
>> === READ ===
>> 
>> current master, rw=randread, numjobs=32 iodepth=32
>> 
>>   4k  IOPS=62.5k, BW=244MiB/s, Lat=16.38msec
>>   8k  IOPS=55.5k, BW=433MiB/s, Lat=18.44msec
>>  16k  IOPS=40.6k, BW=635MiB/s, Lat=25.18msec
>>  32k  IOPS=24.6k, BW=768MiB/s, Lat=41.61msec
>>  64k  IOPS=14.8k, BW=925MiB/s, Lat=69.06msec
>> 128k  IOPS=8687, BW=1086MiB/s, Lat=117.59msec
>> 256k  IOPS=4733, BW=1183MiB/s, Lat=214.76msec
>> 512k  IOPS=3156, BW=1578MiB/s, Lat=320.54msec
>>   1m  IOPS=1901, BW=1901MiB/s, Lat=528.22msec
>> 
>> this patchset,  rw=randread, numjobs=32 iodepth=32
>> 
>>   4k  IOPS=62.6k, BW=244MiB/s, Lat=16342.89usec
>>   8k  IOPS=55.5k, BW=434MiB/s, Lat=18.42msec
>>  16k  IOPS=43.2k, BW=675MiB/s, Lat=23.68msec
>>  32k  IOPS=28.4k, BW=887MiB/s, Lat=36.04msec
>>  64k  IOPS=20.2k, BW=1263MiB/s, Lat=50.54msec
>> 128k  IOPS=11.7k, BW=1465MiB/s, Lat=87.01msec
>> 256k  IOPS=6813, BW=1703MiB/s, Lat=149.30msec
>> 512k  IOPS=5363, BW=2682MiB/s, Lat=189.37msec
>>   1m  IOPS=2220, BW=2221MiB/s, Lat=453.92msec
>> 
>> 
>> Results for small blocks are not interesting, since there should not
>> be any difference. But starting from 32k block benefits of doing IO
>> for the whole message at once starts to prevail.
> 
> It's not really the whole message, just the header, front and middle
> sections, right?

No, that is the whole message. Output of fio rbd load. Or I did not get
your question.

> The data section is still per-bvec, it's just that
> bvec is no longer limited to a single page but may encompass several
> physically contiguous pages.

True. Data section is bvec, which is taken from a bio, which on
its turn has one big physically contiguous multipage.  Of course
when there is such a slice of physical contiguous memory.

> These are not that easy to come by on
> a heavily loaded system, but they do result in nice numbers.

Yeah, sometimes there is quite a big number of segments for a big IO.
And for that case it seems makes sense to do buffering, i.e. calling
sendmsg().

So probably sendpage() makes sense to call when

    nr_segs < N && bvec->bv_len > M * 4k

where N and M are some magic numbers which help to resuce costs
of calling do_tcp_sendpages() in a loooong loop. But this is
pure speculation.

--
Roman

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-04-21 16:28 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
2020-04-21 13:18 ` [PATCH 01/16] libceph: remove unused ceph_pagelist_cursor Roman Penyaev
2020-04-21 13:18 ` [PATCH 02/16] libceph: extend ceph_msg_data API in order to switch on it Roman Penyaev
2020-04-21 13:18 ` [PATCH 03/16] libceph,rbd,cephfs: switch from ceph_osd_data to ceph_msg_data Roman Penyaev
2020-04-21 13:18 ` [PATCH 04/16] libceph: remove ceph_osd_data completely Roman Penyaev
2020-04-21 13:18 ` [PATCH 05/16] libceph: remove unused last_piece out parameter from ceph_msg_data_next() Roman Penyaev
2020-04-21 13:18 ` [PATCH 06/16] libceph: switch data cursor from page to iov_iter for messenger Roman Penyaev
2020-04-21 13:18 ` [PATCH 07/16] libceph: use new tcp_sendiov() instead of tcp_sendmsg() " Roman Penyaev
2020-04-21 13:18 ` [PATCH 08/16] libceph: remove unused tcp wrappers, now iov_iter is used " Roman Penyaev
2020-04-21 13:18 ` [PATCH 09/16] libceph: no need for cursor->need_crc " Roman Penyaev
2020-04-21 13:18 ` [PATCH 10/16] libceph: remove ->last_piece member for message data cursor Roman Penyaev
2020-04-21 13:18 ` [PATCH 11/16] libceph: remove not necessary checks on doing advance on bio and bvecs cursor Roman Penyaev
2020-04-21 13:18 ` [PATCH 12/16] libceph: switch bvecs cursor to iov_iter for messenger Roman Penyaev
2020-04-21 13:18 ` [PATCH 13/16] libceph: switch bio " Roman Penyaev
2020-04-21 13:18 ` [PATCH 14/16] libceph: switch pages " Roman Penyaev
2020-04-21 13:18 ` [PATCH 15/16] libceph: switch pageslist " Roman Penyaev
2020-04-21 13:18 ` [PATCH 16/16] libceph: remove ceph_msg_data_*_next() from messenger Roman Penyaev
2020-04-21 15:51 ` [PATCH 00/16] libceph: messenger: send/recv data at one go Ilya Dryomov
2020-04-21 16:28   ` Roman Penyaev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.