All of
 help / color / mirror / Atom feed
From: Ilya Dryomov <>
To: Roman Penyaev <>
Cc: Jeff Layton <>,
	Ceph Development <>
Subject: Re: [PATCH 00/16] libceph: messenger: send/recv data at one go
Date: Tue, 21 Apr 2020 17:51:37 +0200	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On Tue, Apr 21, 2020 at 3:18 PM Roman Penyaev <> wrote:
> Hi folks,
> While experimenting with messenger code in userspace [1] I noticed
> that send and receive socket calls always operate with 4k, even bvec
> length is larger (for example when bvec is contructed from bio, where
> multi-page is used for big IOs). This is an attempt to speed up send
> and receive for large IO.
> First 3 patches are cleanups. I remove unused code and get rid of
> ceph_osd_data structure. I found that ceph_osd_data duplicates
> ceph_msg_data and it seems unified API looks better for similar
> things.
> In the following patches ceph_msg_data_cursor is switched to iov_iter,
> which seems is more suitable for such kind of things (when we
> basically do socket IO). This gives us the possibility to use the
> whole iov_iter for sendmsg() and recvmsg() calls instead of iterating
> page by page. sendpage() call also benefits from this, because now if
> bvec is constructed from multi-page, then we can 0-copy the whole
> bvec in one go.

Hi Roman,

I'm in the process of rewriting the kernel messenger to support msgr2
(i.e. encryption) and noticed the same things.  The switch to iov_iter
was the first thing I implemented ;)  Among other things is support for
multipage bvecs and explicit socket corking.  I haven't benchmarked any
of it though -- it just seemed like a sensible thing to do, especially
since the sendmsg/sendpage infrastructure needed changes for encryption

Support for kvecs isn't implemented yet, but will be in order to get
rid of all those "allocate a page just to process 16 bytes" sites.

Unfortunately I got distracted by some higher priority issues with the
userspace messenger, so the kernel messenger is in a bit of a state of
disarray at the moment.  Here is the excerpt from the send path:


static int do_sendmsg(struct ceph_connection *con, struct iov_iter *it)
        struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS };
        int ret;

        msg.msg_iter = *it;
        while (iov_iter_count(it)) {
                ret = do_one_sendmsg(con, &msg);
                if (ret <= 0) {
                        if (ret == -EAGAIN)
                                ret = 0;
                        return ret;

                iov_iter_advance(it, ret);

        return 1;

static int do_sendpage(struct ceph_connection *con, struct iov_iter *it)
        ssize_t ret;

        while (iov_iter_count(it)) {
                struct page *page = it->bvec->bv_page;
                int offset = it->bvec->bv_offset + it->iov_offset;
                size_t size = min(it->count,
                                  it->bvec->bv_len - it->iov_offset);

                 * sendpage cannot properly handle pages with
                 * page_count == 0, we need to fall back to sendmsg if
                 * that's the case.
                 * Same goes for slab pages: skb_can_coalesce() allows
                 * coalescing neighboring slab objects into a single frag
                 * which triggers one of hardened usercopy checks.
                if (page_count(page) >= 1 && !PageSlab(page)) {
                        ret = do_one_sendpage(con, page, offset, size,
                } else {
                        struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS };
                        struct bio_vec bv = {
                                .bv_page = page,
                                .bv_offset = offset,
                                .bv_len = size,

                        iov_iter_bvec(&msg.msg_iter, WRITE, &bv, 1, size);
                        ret = do_one_sendmsg(con, &msg);
                if (ret <= 0) {
                        if (ret == -EAGAIN)
                                ret = 0;
                        return ret;

                iov_iter_advance(it, ret);

        return 1;

 * Write as much as possible.  The socket is expected to be corked,
 * so we don't bother with MSG_MORE/MSG_SENDPAGE_NOTLAST here.
 * Return:
 *   1 - done, nothing else to write
 *   0 - socket is full, need to wait
 *  <0 - error
int ceph_tcp_send(struct ceph_connection *con)
        bool is_kvec = iov_iter_is_kvec(&con->out_iter);
        int ret;

        dout("%s con %p have %zu is_kvec %d\n", __func__, con,
             iov_iter_count(&con->out_iter), is_kvec);
        if (is_kvec)
                ret = do_sendmsg(con, &con->out_iter);
                ret = do_sendpage(con, &con->out_iter);

        dout("%s con %p ret %d left %zu\n", __func__, con, ret,
        return ret;

I'll make sure to CC you on my patches, should be in a few weeks.

Getting rid of ceph_osd_data is probably a good idea.  FWIW I never
liked it, but not strong enough to bother with removing it.

> I also allowed myself to get rid of ->last_piece and ->need_crc
> members and ceph_msg_data_next() call. Now CRC is calculated not on
> page basis, but according to the size of processed chunk.  I found
> ceph_msg_data_next() is a bit redundant, since we always can set the
> next cursor chunk on cursor init or on advance.
> How I tested the performance? I used rbd.fio load on 1 OSD in memory
> with the following fio configuration:
>   direct=1
>   time_based=1
>   runtime=10
>   ioengine=io_uring
>   size=256m
>   rw=rand{read|write}
>   numjobs=32
>   iodepth=32
>   [job1]
>   filename=/dev/rbd0
> RBD device is mapped with 'nocrc' option set.  For writes OSD completes
> requests immediately, without touching the memory simulating null block
> device, that's why write throughput in my results is much higher than
> for reads.
> I tested on loopback interface only, in Vm, have not yet setup the
> cluster on real machines, so sendpage() on a big multi-page shows
> indeed good results, as expected. But I found an interesting comment
> in drivers/infiniband/sw/siw/siw_qp_tc.c:siw_tcp_sendpages(), which
> says:
>  "Using sendpage to push page by page appears to be less efficient
>   than using sendmsg, even if data are copied.
>   A general performance limitation might be the extra four bytes
>   trailer checksum segment to be pushed after user data."
> I could not prove or disprove since have tested on loopback interface
> only.  So it might be that sendmsg() in on go is faster than
> sendpage() for bvecs with many segments.

Please share any further findings.  We have been using sendpage for
the data section of the message since forever and I remember hearing
about a performance regression when someone inadvertently disabled the
sendpage path (can't recall the subsystem -- iSCSI?).  If you discover
that sendpage is actually slower, that would be very interesting.

> Here is the output of the rbd fio load for various block sizes:
> ==== WRITE ===
> current master, rw=randwrite, numjobs=32 iodepth=32
>   4k  IOPS=92.7k, BW=362MiB/s, Lat=11033.30usec
>   8k  IOPS=85.6k, BW=669MiB/s, Lat=11956.74usec
>  16k  IOPS=76.8k, BW=1200MiB/s, Lat=13318.24usec
>  32k  IOPS=56.7k, BW=1770MiB/s, Lat=18056.92usec
>  64k  IOPS=34.0k, BW=2186MiB/s, Lat=29.23msec
> 128k  IOPS=21.8k, BW=2720MiB/s, Lat=46.96msec
> 256k  IOPS=14.4k, BW=3596MiB/s, Lat=71.03msec
> 512k  IOPS=8726, BW=4363MiB/s, Lat=116.34msec
>   1m  IOPS=4799, BW=4799MiB/s, Lat=211.15msec
> this patchset,  rw=randwrite, numjobs=32 iodepth=32
>   4k  IOPS=94.7k, BW=370MiB/s, Lat=10802.43usec
>   8k  IOPS=91.2k, BW=712MiB/s, Lat=11221.00usec
>  16k  IOPS=80.4k, BW=1257MiB/s, Lat=12715.56usec
>  32k  IOPS=61.2k, BW=1912MiB/s, Lat=16721.33usec
>  64k  IOPS=40.9k, BW=2554MiB/s, Lat=24993.31usec
> 128k  IOPS=25.7k, BW=3216MiB/s, Lat=39.72msec
> 256k  IOPS=17.3k, BW=4318MiB/s, Lat=59.15msec
> 512k  IOPS=11.1k, BW=5559MiB/s, Lat=91.39msec
>   1m  IOPS=6696, BW=6696MiB/s, Lat=151.25msec
> === READ ===
> current master, rw=randread, numjobs=32 iodepth=32
>   4k  IOPS=62.5k, BW=244MiB/s, Lat=16.38msec
>   8k  IOPS=55.5k, BW=433MiB/s, Lat=18.44msec
>  16k  IOPS=40.6k, BW=635MiB/s, Lat=25.18msec
>  32k  IOPS=24.6k, BW=768MiB/s, Lat=41.61msec
>  64k  IOPS=14.8k, BW=925MiB/s, Lat=69.06msec
> 128k  IOPS=8687, BW=1086MiB/s, Lat=117.59msec
> 256k  IOPS=4733, BW=1183MiB/s, Lat=214.76msec
> 512k  IOPS=3156, BW=1578MiB/s, Lat=320.54msec
>   1m  IOPS=1901, BW=1901MiB/s, Lat=528.22msec
> this patchset,  rw=randread, numjobs=32 iodepth=32
>   4k  IOPS=62.6k, BW=244MiB/s, Lat=16342.89usec
>   8k  IOPS=55.5k, BW=434MiB/s, Lat=18.42msec
>  16k  IOPS=43.2k, BW=675MiB/s, Lat=23.68msec
>  32k  IOPS=28.4k, BW=887MiB/s, Lat=36.04msec
>  64k  IOPS=20.2k, BW=1263MiB/s, Lat=50.54msec
> 128k  IOPS=11.7k, BW=1465MiB/s, Lat=87.01msec
> 256k  IOPS=6813, BW=1703MiB/s, Lat=149.30msec
> 512k  IOPS=5363, BW=2682MiB/s, Lat=189.37msec
>   1m  IOPS=2220, BW=2221MiB/s, Lat=453.92msec
> Results for small blocks are not interesting, since there should not
> be any difference. But starting from 32k block benefits of doing IO
> for the whole message at once starts to prevail.

It's not really the whole message, just the header, front and middle
sections, right?  The data section is still per-bvec, it's just that
bvec is no longer limited to a single page but may encompass several
physically contiguous pages.  These are not that easy to come by on
a heavily loaded system, but they do result in nice numbers.



  parent reply	other threads:[~2020-04-21 15:51 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-21 13:18 [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev
2020-04-21 13:18 ` [PATCH 01/16] libceph: remove unused ceph_pagelist_cursor Roman Penyaev
2020-04-21 13:18 ` [PATCH 02/16] libceph: extend ceph_msg_data API in order to switch on it Roman Penyaev
2020-04-21 13:18 ` [PATCH 03/16] libceph,rbd,cephfs: switch from ceph_osd_data to ceph_msg_data Roman Penyaev
2020-04-21 13:18 ` [PATCH 04/16] libceph: remove ceph_osd_data completely Roman Penyaev
2020-04-21 13:18 ` [PATCH 05/16] libceph: remove unused last_piece out parameter from ceph_msg_data_next() Roman Penyaev
2020-04-21 13:18 ` [PATCH 06/16] libceph: switch data cursor from page to iov_iter for messenger Roman Penyaev
2020-04-21 13:18 ` [PATCH 07/16] libceph: use new tcp_sendiov() instead of tcp_sendmsg() " Roman Penyaev
2020-04-21 13:18 ` [PATCH 08/16] libceph: remove unused tcp wrappers, now iov_iter is used " Roman Penyaev
2020-04-21 13:18 ` [PATCH 09/16] libceph: no need for cursor->need_crc " Roman Penyaev
2020-04-21 13:18 ` [PATCH 10/16] libceph: remove ->last_piece member for message data cursor Roman Penyaev
2020-04-21 13:18 ` [PATCH 11/16] libceph: remove not necessary checks on doing advance on bio and bvecs cursor Roman Penyaev
2020-04-21 13:18 ` [PATCH 12/16] libceph: switch bvecs cursor to iov_iter for messenger Roman Penyaev
2020-04-21 13:18 ` [PATCH 13/16] libceph: switch bio " Roman Penyaev
2020-04-21 13:18 ` [PATCH 14/16] libceph: switch pages " Roman Penyaev
2020-04-21 13:18 ` [PATCH 15/16] libceph: switch pageslist " Roman Penyaev
2020-04-21 13:18 ` [PATCH 16/16] libceph: remove ceph_msg_data_*_next() from messenger Roman Penyaev
2020-04-21 15:51 ` Ilya Dryomov [this message]
2020-04-21 16:28   ` [PATCH 00/16] libceph: messenger: send/recv data at one go Roman Penyaev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='' \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.