From mboxrd@z Thu Jan  1 00:00:00 1970
Content-Type: multipart/mixed; boundary="===============1379191224686054428=="
MIME-Version: 1.0
From: Walker, Benjamin <benjamin.walker at intel.com>
Subject: Re: [SPDK] NBD with SPDK
Date: Thu, 05 Sep 2019 21:22:38 +0000
Message-ID: <049b94758aa1830f66a4069eacbdd12c85476a0b.camel@intel.com>
In-Reply-To: 394387C6-8DAA-4DC0-BD99-71B293AF9F82@ebay.com
List-ID: <spdk@lists.01.org>
To: spdk@lists.01.org

--===============1379191224686054428==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable

On Thu, 2019-09-05 at 19:47 +0000, Szmyd, Brian wrote:
> Hi Paul,
> =

> Rather than put the effort into a formalized document here is a brief
> description of the solution I have been investigating just to get an opin=
ion
> of feasibility or even workability. =

> =

> Some background and a reiteration of the problem to set things up. I apol=
ogize
> to reiterate anything and to include details that some may already know.
> =

> We are looking for a solution that allows us to write a custom bdev for t=
he
> SPDK bdev layer that distributes I/O between different NVMe-oF targets th=
at we
> have attached and then present that to our application as either a raw bl=
ock
> device or filesystem mountpoint.
> =

> This is normally (as I understand it) done to by exposing a device via QE=
MU to
> a VM using the vhost target. This SPDK target has implemented the virtio-=
scsi
> (among others) device according to this spec:
> =

> https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd0=
1.html#x1-8300021
> =

> The VM kernel then uses a virtio-scsi module to attach said device into i=
ts
> SCSI mid-layer and then have the device enumerated as a /dev/sd[a-z]+ dev=
ice.
> =

> The problem is that QEMU virtualizes a PCIe bus for the guest kernel virt=
io-
> pci driver to discover the virtio devices and bind them to the virtio-scsi
> driver. There really is no other way (other than platform MMIO type devic=
es)
> to attach a device to the virtio-scsi device.
> =

> SPDK exposes the virtio device to the VM via QEMU which has written a "us=
er
> space" version of the vhost bus. This driver then translates the API into=
 the
> virtio-pci specification:
> =

> https://github.com/qemu/qemu/blob/5d0e5694470d2952b4f257bc985cac8c89b4fd9=
2/docs/interop/vhost-user.rst
> =

> This uses an eventfd descriptor for interrupting the non-polling side of =
the
> queue and a UNIX domain socket to setup (and control) the shared memory w=
hich
> contains the I/O buffers and virtio queues. This is documented in SPDKs o=
wn
> documentation and diagramed here:
> =

> https://github.com/spdk/spdk/blob/01103b2e4dfdcf23cc2125164aa116394c8185e=
8/doc/vhost_processing.md
> =

> If we could implement this vhost-user QEMU target as a virtio driver in t=
he
> kernel as an alternative to the virtio-pci driver, it could bind a SPDK v=
host
> into the host kernel as a virtio device and enumerated in the /dev/sd[a-z=
]+
> tree for our containers to bind. Attached is draft block diagram.

If you think of QEMU as just another user-space process, and the SPDK vhost
target as a user-space process, then it's clear that vhost-user is simply a
cross-process IPC mechanism based on shared memory. The "shared memory" par=
t is
the critical part of that description - QEMU pre-registers all of the memory
that will be used for I/O buffers (in fact, all of the memory that is mapped
into the guest) with the SPDK process by sending fds across a Unix domain
socket.

If you move this code into the kernel, you have to solve two issues:

1) What memory is it registering with the SPDK process? The kernel driver h=
as no
idea which application process may route I/O to it - in fact the application
process may not even exist yet - so it isn't memory allocated to the applic=
ation
process. Maybe you have a pool of kernel buffers that get mapped into the S=
PDK
process, and when the application process performs I/O the kernel copies in=
to
those buffers prior to telling SPDK about them? That would work, but now yo=
u're
back to doing a data copy. I do think you can get it down to 1 data copy in=
stead
of 2 with a scheme like this.

2) One of the big performance problems you're seeing is syscall overhead in=
 NBD.
If you still have a kernel block device that routes messages up to the SPDK
process, the application process is making the same syscalls because it's s=
till
interacting with a block device in the kernel, but you're right that the ba=
ckend
SPDK implementation could be polling on shared memory rings and potentially=
 run
more efficiently.

> =

> Since we will not have a real bus to signal for the driver to probe for n=
ew
> devices we can use a sysfs interface for the application to notify the dr=
iver
> of a new socket and eventfd pair to setup a new virtio-scsi instance.
> Otherwise the design simply moves the vhost-user driver from the QEMU
> application into the Host kernel itself.
> =

> It's my understanding that this will avoid a lot more system calls and co=
pies
> compared to what exposing an iSCSI device or NBD device as we're currently
> discussing. Does this seem feasible?

What you really want is a "block device in user space" solution that's high=
er
performance than NBD, and while that's been tried many, many times in the p=
ast I
do think there is a great opportunity here for someone. I'm not sure that t=
he
interface between the block device process and the kernel is best done as a
modification of NBD or a wholesale replacement by vhost-user-scsi, but I'd =
like
to throw in a third option to consider - use NVMe queues in shared memory a=
s the
interface instead. The NVMe queues are going to be much more efficient than
virtqueues for storage commands.

> =

> Thanks,
> Brian
> =

> =EF=BB=BFOn 9/5/19, 12:32 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wro=
te:
> =

>     Hi Paul.
>     =

>     Thanks for investigating it. =

>     =

>     We have one more idea floating around. Brian is going to send you a
> proposal shortly. If other proposal seems feasible to you that we can eva=
luate
> the work required in both the proposals.
>     =

>     Thanks
>     Rishabh Mittal
>     =

>     On 9/5/19, 11:09 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
>     =

>         Hi,
>         =

>         So I was able to perform the same steps here and I think one of t=
he
> keys to really seeing what's going on is to start perftop like this:
>         =

>          =E2=80=9Cperf top --sort comm,dso,symbol -C 0=E2=80=9D to get a =
more focused view by
> sorting on command, shared object and symbol
>         =

>         Attached are 2 snapshots, one with a NULL back end for nbd and one
> with libaio/nvme.  Some notes after chatting with Ben a bit, please read
> through and let us know what you think:
>         =

>         * in both cases the vast majority of the highest overhead activit=
ies
> are kernel
>         * the "copy_user_enhanced" symbol on the NULL case (it shows up o=
n the
> other as well but you have to scroll way down to see it) and is the
> user/kernel space copy, nothing SPDK can do about that
>         * the syscalls that dominate in both cases are likely something t=
hat
> can be improved on by changing how SPDK interacts with nbd. Ben had a cou=
ple
> of ideas inlcuidng (a) using libaio to interact with the nbd fd as oppose=
d to
> interacting with the nbd socket, (b) "batching" wherever possible, for ex=
ample
> on writes to nbd investigate not ack'ing them until some number have comp=
leted
>         * the kernel slab* commands are likely nbd kernel driver
> allocations/frees in the IO path, one possibility would be to look at
> optimizing the nbd kernel driver for this one
>         * the libc item on the NULL chart also shows up on the libaio pro=
file
> however is again way down the scroll so it didn't make the screenshot :) =
 This
> could be a zeroing of something somewhere in the SPDK nbd driver
>         =

>         It looks like this data supports what Ben had suspected a while b=
ack,
> much of the overhead we're looking at is kernel nbd.  Anyway, let us know=
 what
> you think and if you want to explore any of the ideas above any further o=
r see
> something else in the data that looks worthy to note.
>         =

>         Thx
>         Paul
>         =

>         =

>         =

>         -----Original Message-----
>         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse=
, Paul
> E
>         Sent: Wednesday, September 4, 2019 4:27 PM
>         To: Mittal, Rishabh <rimittal(a)ebay.com>; Walker, Benjamin <
> benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com=
>; =

> spdk(a)lists.01.org
>         Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)e=
bay.com>;
> Kadayam, Hari <hkadayam(a)ebay.com>
>         Subject: Re: [SPDK] NBD with SPDK
>         =

>         Cool, thanks for sending this.  I will try and repro tomorrow her=
e and
> see what kind of results I get
>         =

>         Thx
>         Paul
>         =

>         -----Original Message-----
>         From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] =

>         Sent: Wednesday, September 4, 2019 4:23 PM
>         To: Luse, Paul E <paul.e.luse(a)intel.com>; Walker, Benjamin <
> benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com=
>; =

> spdk(a)lists.01.org
>         Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <
> hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
>         Subject: Re: [SPDK] NBD with SPDK
>         =

>         Avg CPU utilization is very low when I am running this.
>         =

>         09/04/2019 04:21:40 PM
>         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                    2.59    0.00    2.57    0.00    0.00   94.84
>         =

>         Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm=
/s  %
> rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
>         sda              0.00    0.20      0.00      0.80     0.00     0.=
00   =

> 0.00   0.00    0.00    0.00   0.00     0.00     4.00   0.00   0.00
>         sdb              0.00    0.00      0.00      0.00     0.00     0.=
00   =

> 0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
>         sdc              0.00 28846.80      0.00 191555.20     0.00
> 18211.00   0.00  38.70    0.00    1.03  29.64     0.00     6.64   0.03 10=
0.00
>         nb0              0.00 47297.00      0.00
> 191562.40     0.00   593.60   0.00   1.24    0.00    1.32  61.83     0.00=
     =

> 4.05   0
>         =

>         =

>         =

>         On 9/4/19, 4:19 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
>         =

>             I am using this command
>             =

>             fio --name=3Drandwrite --ioengine=3Dlibaio --iodepth=3D8 --rw=
=3Dwrite --
> rwmixread=3D0  --bsrange=3D4k-4k --direct=3D1 -filename=3D/dev/nbd0 --num=
jobs=3D8 --
> runtime 120 --time_based --group_reporting
>             =

>             I have created the device by using these commands
>             	1.  ./root/spdk/app/vhost
>             	2.  ./rpc.py bdev_aio_create /dev/sdc aio0
>             	3. /rpc.py start_nbd_disk aio0 /dev/nbd0
>             =

>             I am using  "perf top"  to get the performance =

>             =

>             On 9/4/19, 4:03 PM, "Luse, Paul E" <paul.e.luse(a)intel.com> =
wrote:
>             =

>                 Hi Rishabh,
>                 =

>                 Maybe it would help (me at least) if you described the
> complete & exact steps for your test - both setup of the env & test and
> command to profile.  Can you send that out?
>                 =

>                 Thx
>                 Paul
>                 =

>                 -----Original Message-----
>                 From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] =

>                 Sent: Wednesday, September 4, 2019 2:45 PM
>                 To: Walker, Benjamin <benjamin.walker(a)intel.com>; Harri=
s,
> James R <james.r.harris(a)intel.com>; spdk(a)lists.01.org; Luse, Paul E <
> paul.e.luse(a)intel.com>
>                 Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <
> hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
>                 Subject: Re: [SPDK] NBD with SPDK
>                 =

>                 Yes, I am using 64 q depth with one thread in fio. I am u=
sing
> AIO. This profiling is for the entire system. I don't know why spdk threa=
ds
> are idle.
>                 =

>                 On 9/4/19, 11:08 AM, "Walker, Benjamin" <
> benjamin.walker(a)intel.com> wrote:
>                 =

>                     On Fri, 2019-08-30 at 22:28 +0000, Mittal, Rishabh wr=
ote:
>                     > I got the run again. It is with 4k write.
>                     > =

>                     > 13.16%  vhost                       [.]
>                     >
> spdk_ring_dequeue                                                        =
     =

>                     >              =

>                     >    6.08%  vhost                       [.]
>                     >
> rte_rdtsc                                                                =
     =

>                     >              =

>                     >    4.77%  vhost                       [.]
>                     >
> spdk_thread_poll                                                         =
     =

>                     >              =

>                     >    2.85%  vhost                       [.]
>                     >
> _spdk_reactor_run                                                        =
     =

>                     >  =

>                     =

>                     You're doing high queue depth for at least 30 seconds
> while the trace runs,
>                     right? Using fio with the libaio engine on the NBD de=
vice
> is probably the way to
>                     go. Are you limiting the profiling to just the core w=
here
> the main SPDK process
>                     is pinned? I'm asking because SPDK still appears to be
> mostly idle, and I
>                     suspect the time is being spent in some other thread =
(in
> the kernel). Consider
>                     capturing a profile for the entire system. It will ha=
ve
> fio stuff in it, but the
>                     expensive stuff still should generally bubble up to t=
he
> top.
>                     =

>                     Thanks,
>                     Ben
>                     =

>                     =

>                     > =

>                     > On 8/29/19, 6:05 PM, "Mittal, Rishabh" <
> rimittal(a)ebay.com> wrote:
>                     > =

>                     >     I got the profile with first run. =

>                     >     =

>                     >       27.91%  vhost                       [.]
>                     >
> spdk_ring_dequeue                                                        =
     =

>                     >              =

>                     >       12.94%  vhost                       [.]
>                     >
> rte_rdtsc                                                                =
     =

>                     >              =

>                     >       11.00%  vhost                       [.]
>                     >
> spdk_thread_poll                                                         =
     =

>                     >              =

>                     >        6.15%  vhost                       [.]
>                     >
> _spdk_reactor_run                                                        =
     =

>                     >              =

>                     >        4.35%  [kernel]                    [k]
>                     >
> syscall_return_via_sysret                                                =
     =

>                     >              =

>                     >        3.91%  vhost                       [.]
>                     >
> _spdk_msg_queue_run_batch                                                =
     =

>                     >              =

>                     >        3.38%  vhost                       [.]
>                     >
> _spdk_event_queue_run_batch                                              =
     =

>                     >              =

>                     >        2.83%  [unknown]                   [k]
>                     >
> 0xfffffe000000601b                                                       =
     =

>                     >              =

>                     >        1.45%  vhost                       [.]
>                     >
> spdk_thread_get_from_ctx                                                 =
     =

>                     >              =

>                     >        1.20%  [kernel]                    [k]
>                     >
> __fget                                                                   =
     =

>                     >              =

>                     >        1.14%  libpthread-2.27.so          [.]
>                     >
> __libc_read                                                              =
     =

>                     >              =

>                     >        1.00%  libc-2.27.so                [.]
>                     >
> 0x000000000018ef76                                                       =
     =

>                     >              =

>                     >        0.99%  libc-2.27.so                [.]
> 0x000000000018ef79          =

>                     >     =

>                     >     Thanks
>                     >     Rishabh Mittal                         =

>                     >     =

>                     >     On 8/19/19, 7:42 AM, "Luse, Paul E" <
> paul.e.luse(a)intel.com> wrote:
>                     >     =

>                     >         That's great.  Keep any eye out for the ite=
ms
> Ben mentions below - at
>                     > least the first one should be quick to implement and
> compare both profile data
>                     > and measured performance.
>                     >         =

>                     >         Don=E2=80=99t' forget about the community m=
eetings
> either, great place to chat
>                     > about these kinds of things.  =

>                     > =

> https://nam01.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Fspdk.=
io%2Fcommunity%2F&amp;data=3D02%7C01%7Cbszmyd%40ebay.com%7C52847d18df514b39=
d8cf08d7322f74ea%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C6370330517210=
21295&amp;sdata=3DheRt%2FhB5SPeqWNw44VoCIrt5W9N%2B0ExCXIVFNtzi2Zg%3D&amp;re=
served=3D0
>                     >   Next one is tomorrow morn US time.
>                     >         =

>                     >         Thx
>                     >         Paul
>                     >         =

>                     >         -----Original Message-----
>                     >         From: SPDK [mailto:spdk-bounces(a)lists.01.=
org] On
> Behalf Of Mittal,
>                     > Rishabh via SPDK
>                     >         Sent: Thursday, August 15, 2019 6:50 PM
>                     >         To: Harris, James R <james.r.harris(a)intel=
.com>;
> Walker, Benjamin <
>                     > benjamin.walker(a)intel.com>; spdk(a)lists.01.org
>                     >         Cc: Mittal, Rishabh <rimittal(a)ebay.com>; =
Chen,
> Xiaoxi <
>                     > xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.=
com>;
> Kadayam, Hari <
>                     > hkadayam(a)ebay.com>
>                     >         Subject: Re: [SPDK] NBD with SPDK
>                     >         =

>                     >         Thanks. I will get the profiling by next we=
ek. =

>                     >         =

>                     >         On 8/15/19, 6:26 PM, "Harris, James R" <
> james.r.harris(a)intel.com>
>                     > wrote:
>                     >         =

>                     >             =

>                     >             =

>                     >             On 8/15/19, 4:34 PM, "Mittal, Rishabh" <
> rimittal(a)ebay.com> wrote:
>                     >             =

>                     >                 Hi Jim
>                     >                 =

>                     >                 What tool you use to take profiling=
. =

>                     >             =

>                     >             Hi Rishabh,
>                     >             =

>                     >             Mostly I just use "perf top".
>                     >             =

>                     >             -Jim
>                     >             =

>                     >                 =

>                     >                 Thanks
>                     >                 Rishabh Mittal
>                     >                 =

>                     >                 On 8/14/19, 9:54 AM, "Harris, James=
 R" <
>                     > james.r.harris(a)intel.com> wrote:
>                     >                 =

>                     >                     =

>                     >                     =

>                     >                     On 8/14/19, 9:18 AM, "Walker,
> Benjamin" <
>                     > benjamin.walker(a)intel.com> wrote:
>                     >                     =

>                     >                     <trim>
>                     >                         =

>                     >                         When an I/O is performed in=
 the
> process initiating the
>                     > I/O to a file, the data
>                     >                         goes into the OS page cache
> buffers at a layer far
>                     > above the bio stack
>                     >                         (somewhere up in VFS). If S=
PDK
> were to reserve some
>                     > memory and hand it off to
>                     >                         your kernel driver, your ke=
rnel
> driver would still
>                     > need to copy it to that
>                     >                         location out of the page ca=
che
> buffers. We can't
>                     > safely share the page cache
>                     >                         buffers with a user space
> process.
>                     >                        =

>                     >                     I think Rishabh was suggesting =
the
> SPDK reserve the
>                     > virtual address space only.
>                     >                     Then the kernel could map the p=
age
> cache buffers into that
>                     > virtual address space.
>                     >                     That would not require a data c=
opy,
> but would require the
>                     > mapping operations.
>                     >                     =

>                     >                     I think the profiling data woul=
d be
> really helpful - to
>                     > quantify how much of the 50us
>                     >                     Is due to copying the 4KB of
> data.  That can help drive
>                     > next steps on how to optimize
>                     >                     the SPDK NBD module.
>                     >                     =

>                     >                     Thanks,
>                     >                     =

>                     >                     -Jim
>                     >                     =

>                     >                     =

>                     >                         As Paul said, I'm skeptical=
 that
> the memcpy is
>                     > significant in the overall
>                     >                         performance you're measurin=
g. I
> encourage you to go
>                     > look at some profiling data
>                     >                         and confirm that the memcpy=
 is
> really showing up. I
>                     > suspect the overhead is
>                     >                         instead primarily in these
> spots:
>                     >                         =

>                     >                         1) Dynamic buffer allocatio=
n in
> the SPDK NBD backend.
>                     >                         =

>                     >                         As Paul indicated, the NBD
> target is dynamically
>                     > allocating memory for each I/O.
>                     >                         The NBD backend wasn't desi=
gned
> to be fast - it was
>                     > designed to be simple.
>                     >                         Pooling would be a lot fast=
er
> and is something fairly
>                     > easy to implement.
>                     >                         =

>                     >                         2) The way SPDK does the
> syscalls when it implements
>                     > the NBD backend.
>                     >                         =

>                     >                         Again, the code was designe=
d to
> be simple, not high
>                     > performance. It simply calls
>                     >                         read() and write() on the s=
ocket
> for each command.
>                     > There are much higher
>                     >                         performance ways of doing t=
his,
> they're just more
>                     > complex to implement.
>                     >                         =

>                     >                         3) The lack of multi-queue
> support in NBD
>                     >                         =

>                     >                         Every I/O is funneled throu=
gh a
> single sockpair up to
>                     > user space. That means
>                     >                         there is locking going on. I
> believe this is just a
>                     > limitation of NBD today - it
>                     >                         doesn't plug into the block=
-mq
> stuff in the kernel and
>                     > expose multiple
>                     >                         sockpairs. But someone more
> knowledgeable on the
>                     > kernel stack would need to take
>                     >                         a look.
>                     >                         =

>                     >                         Thanks,
>                     >                         Ben
>                     >                         =

>                     >                         > =

>                     >                         > Couple of things that I a=
m not
> really sure in this
>                     > flow is :- 1. How memory
>                     >                         > registration is going to =
work
> with RDMA driver.
>                     >                         > 2. What changes are requi=
red
> in spdk memory
>                     > management
>                     >                         > =

>                     >                         > Thanks
>                     >                         > Rishabh Mittal
>                     >                         =


>     =

>     =

> =


--===============1379191224686054428==--