From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============5964432289275183084==" MIME-Version: 1.0 From: Szmyd, Brian Subject: Re: [SPDK] NBD with SPDK Date: Fri, 06 Sep 2019 02:14:15 +0000 Message-ID: In-Reply-To: CAE7zfpP3yWUJCzFMN5w61KkoA1fozF4uyXsfR_fiHNC16Q=vGw@mail.gmail.com List-ID: To: spdk@lists.01.org --===============5964432289275183084== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable I believe this option has the same number of copies since your still sharin= g the memory with KATA VM kernel not the application itself. This is an option that the = development of a virtio-vhost-user driver does not prevent, its merely an option to all= ow non-KATA containers to also use the same device. I will note that doing a virtio-host-user driver also allows one to project= other device types than just block devices into the kernel device stack. One could also write = a user application that exposed an input, network, console, gpu or socket device as well. Not that I have any interest in these... __ =EF=BB=BFOn 9/5/19, 8:08 PM, "Huang Zhiteng" wrote: Since this SPDK bdev is intended to be consumed by a user application running inside a container, we do have the possibility to run user application inside a Kata container instead. Kata container does introduce the layer of IO virtualization, therefore we convert a user space block device on host to a kernel block device inside the VM but with less memory copies than NBD thanks to SPDK vhost. Kata container might impose higher overhead than plain container but hopefully it's lightweight enough that the overhead is negligible. = On Fri, Sep 6, 2019 at 5:22 AM Walker, Benjamin wrote: > > On Thu, 2019-09-05 at 19:47 +0000, Szmyd, Brian wrote: > > Hi Paul, > > > > Rather than put the effort into a formalized document here is a bri= ef > > description of the solution I have been investigating just to get a= n opinion > > of feasibility or even workability. > > > > Some background and a reiteration of the problem to set things up. = I apologize > > to reiterate anything and to include details that some may already = know. > > > > We are looking for a solution that allows us to write a custom bdev= for the > > SPDK bdev layer that distributes I/O between different NVMe-oF targ= ets that we > > have attached and then present that to our application as either a = raw block > > device or filesystem mountpoint. > > > > This is normally (as I understand it) done to by exposing a device = via QEMU to > > a VM using the vhost target. This SPDK target has implemented the v= irtio-scsi > > (among others) device according to this spec: > > > > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2= Fdocs.oasis-open.org%2Fvirtio%2Fvirtio%2Fv1.1%2Fcsprd01%2Fvirtio-v1.1-csprd= 01.html%23x1-8300021&data=3D02%7C01%7Cbszmyd%40ebay.com%7Cc69c9bed27434= 16e654208d7326f15b3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C6370333250= 11676085&sdata=3DgSSZfohiZCFv85ZBbGTfiMzttbHRwgQ0eOB0rFSTlpo%3D&res= erved=3D0 > > > > The VM kernel then uses a virtio-scsi module to attach said device = into its > > SCSI mid-layer and then have the device enumerated as a /dev/sd[a-z= ]+ device. > > > > The problem is that QEMU virtualizes a PCIe bus for the guest kerne= l virtio- > > pci driver to discover the virtio devices and bind them to the virt= io-scsi > > driver. There really is no other way (other than platform MMIO type= devices) > > to attach a device to the virtio-scsi device. > > > > SPDK exposes the virtio device to the VM via QEMU which has written= a "user > > space" version of the vhost bus. This driver then translates the AP= I into the > > virtio-pci specification: > > > > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2= Fgithub.com%2Fqemu%2Fqemu%2Fblob%2F5d0e5694470d2952b4f257bc985cac8c89b4fd92= %2Fdocs%2Finterop%2Fvhost-user.rst&data=3D02%7C01%7Cbszmyd%40ebay.com%7= Cc69c9bed2743416e654208d7326f15b3%7C46326bff992841a0baca17c16c94ea99%7C0%7C= 0%7C637033325011676085&sdata=3D18MeAcahkaTPT9pRiPGcx4GV%2BPFNb%2F12JXYg= j1h5hSk%3D&reserved=3D0 > > > > This uses an eventfd descriptor for interrupting the non-polling si= de of the > > queue and a UNIX domain socket to setup (and control) the shared me= mory which > > contains the I/O buffers and virtio queues. This is documented in S= PDKs own > > documentation and diagramed here: > > > > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2= Fgithub.com%2Fspdk%2Fspdk%2Fblob%2F01103b2e4dfdcf23cc2125164aa116394c8185e8= %2Fdoc%2Fvhost_processing.md&data=3D02%7C01%7Cbszmyd%40ebay.com%7Cc69c9= bed2743416e654208d7326f15b3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C63= 7033325011676085&sdata=3DBf3Ni5lmXxlwgkxYknWg3yW7bprXp2H%2F525JZ%2BHcjg= E%3D&reserved=3D0 > > > > If we could implement this vhost-user QEMU target as a virtio drive= r in the > > kernel as an alternative to the virtio-pci driver, it could bind a = SPDK vhost > > into the host kernel as a virtio device and enumerated in the /dev/= sd[a-z]+ > > tree for our containers to bind. Attached is draft block diagram. > > If you think of QEMU as just another user-space process, and the SPDK= vhost > target as a user-space process, then it's clear that vhost-user is si= mply a > cross-process IPC mechanism based on shared memory. The "shared memor= y" part is > the critical part of that description - QEMU pre-registers all of the= memory > that will be used for I/O buffers (in fact, all of the memory that is= mapped > into the guest) with the SPDK process by sending fds across a Unix do= main > socket. > > If you move this code into the kernel, you have to solve two issues: > > 1) What memory is it registering with the SPDK process? The kernel dr= iver has no > idea which application process may route I/O to it - in fact the appl= ication > process may not even exist yet - so it isn't memory allocated to the = application > process. Maybe you have a pool of kernel buffers that get mapped into= the SPDK > process, and when the application process performs I/O the kernel cop= ies into > those buffers prior to telling SPDK about them? That would work, but = now you're > back to doing a data copy. I do think you can get it down to 1 data c= opy instead > of 2 with a scheme like this. > > 2) One of the big performance problems you're seeing is syscall overh= ead in NBD. > If you still have a kernel block device that routes messages up to th= e SPDK > process, the application process is making the same syscalls because = it's still > interacting with a block device in the kernel, but you're right that = the backend > SPDK implementation could be polling on shared memory rings and poten= tially run > more efficiently. > > > > > Since we will not have a real bus to signal for the driver to probe= for new > > devices we can use a sysfs interface for the application to notify = the driver > > of a new socket and eventfd pair to setup a new virtio-scsi instanc= e. > > Otherwise the design simply moves the vhost-user driver from the QE= MU > > application into the Host kernel itself. > > > > It's my understanding that this will avoid a lot more system calls = and copies > > compared to what exposing an iSCSI device or NBD device as we're cu= rrently > > discussing. Does this seem feasible? > > What you really want is a "block device in user space" solution that'= s higher > performance than NBD, and while that's been tried many, many times in= the past I > do think there is a great opportunity here for someone. I'm not sure = that the > interface between the block device process and the kernel is best don= e as a > modification of NBD or a wholesale replacement by vhost-user-scsi, bu= t I'd like > to throw in a third option to consider - use NVMe queues in shared me= mory as the > interface instead. The NVMe queues are going to be much more efficien= t than > virtqueues for storage commands. > > > > > Thanks, > > Brian > > > > On 9/5/19, 12:32 PM, "Mittal, Rishabh" wrote: > > > > Hi Paul. > > > > Thanks for investigating it. > > > > We have one more idea floating around. Brian is going to send y= ou a > > proposal shortly. If other proposal seems feasible to you that we c= an evaluate > > the work required in both the proposals. > > > > Thanks > > Rishabh Mittal > > > > On 9/5/19, 11:09 AM, "Luse, Paul E" w= rote: > > > > Hi, > > > > So I was able to perform the same steps here and I think on= e of the > > keys to really seeing what's going on is to start perftop like this: > > > > =E2=80=9Cperf top --sort comm,dso,symbol -C 0=E2=80=9D to = get a more focused view by > > sorting on command, shared object and symbol > > > > Attached are 2 snapshots, one with a NULL back end for nbd = and one > > with libaio/nvme. Some notes after chatting with Ben a bit, please= read > > through and let us know what you think: > > > > * in both cases the vast majority of the highest overhead a= ctivities > > are kernel > > * the "copy_user_enhanced" symbol on the NULL case (it show= s up on the > > other as well but you have to scroll way down to see it) and is the > > user/kernel space copy, nothing SPDK can do about that > > * the syscalls that dominate in both cases are likely somet= hing that > > can be improved on by changing how SPDK interacts with nbd. Ben had= a couple > > of ideas inlcuidng (a) using libaio to interact with the nbd fd as = opposed to > > interacting with the nbd socket, (b) "batching" wherever possible, = for example > > on writes to nbd investigate not ack'ing them until some number hav= e completed > > * the kernel slab* commands are likely nbd kernel driver > > allocations/frees in the IO path, one possibility would be to look = at > > optimizing the nbd kernel driver for this one > > * the libc item on the NULL chart also shows up on the liba= io profile > > however is again way down the scroll so it didn't make the screensh= ot :) This > > could be a zeroing of something somewhere in the SPDK nbd driver > > > > It looks like this data supports what Ben had suspected a w= hile back, > > much of the overhead we're looking at is kernel nbd. Anyway, let u= s know what > > you think and if you want to explore any of the ideas above any fur= ther or see > > something else in the data that looks worthy to note. > > > > Thx > > Paul > > > > > > > > -----Original Message----- > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf O= f Luse, Paul > > E > > Sent: Wednesday, September 4, 2019 4:27 PM > > To: Mittal, Rishabh ; Walker, Benjamin= < > > benjamin.walker(a)intel.com>; Harris, James R ; > > spdk(a)lists.01.org > > Cc: Chen, Xiaoxi ; Szmyd, Brian ; > > Kadayam, Hari > > Subject: Re: [SPDK] NBD with SPDK > > > > Cool, thanks for sending this. I will try and repro tomorr= ow here and > > see what kind of results I get > > > > Thx > > Paul > > > > -----Original Message----- > > From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] > > Sent: Wednesday, September 4, 2019 4:23 PM > > To: Luse, Paul E ; Walker, Benjami= n < > > benjamin.walker(a)intel.com>; Harris, James R ; > > spdk(a)lists.01.org > > Cc: Chen, Xiaoxi ; Kadayam, Hari < > > hkadayam(a)ebay.com>; Szmyd, Brian > > Subject: Re: [SPDK] NBD with SPDK > > > > Avg CPU utilization is very low when I am running this. > > > > 09/04/2019 04:21:40 PM > > avg-cpu: %user %nice %system %iowait %steal %idle > > 2.59 0.00 2.57 0.00 0.00 94.84 > > > > Device r/s w/s rkB/s wkB/s rrqm/s = wrqm/s % > > rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util > > sda 0.00 0.20 0.00 0.80 0.00 = 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 > > sdb 0.00 0.00 0.00 0.00 0.00 = 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > sdc 0.00 28846.80 0.00 191555.20 0.00 > > 18211.00 0.00 38.70 0.00 1.03 29.64 0.00 6.64 0= .03 100.00 > > nb0 0.00 47297.00 0.00 > > 191562.40 0.00 593.60 0.00 1.24 0.00 1.32 61.83 = 0.00 > > 4.05 0 > > > > > > > > On 9/4/19, 4:19 PM, "Mittal, Rishabh" = wrote: > > > > I am using this command > > > > fio --name=3Drandwrite --ioengine=3Dlibaio --iodepth=3D= 8 --rw=3Dwrite -- > > rwmixread=3D0 --bsrange=3D4k-4k --direct=3D1 -filename=3D/dev/nbd0= --numjobs=3D8 -- > > runtime 120 --time_based --group_reporting > > > > I have created the device by using these commands > > 1. ./root/spdk/app/vhost > > 2. ./rpc.py bdev_aio_create /dev/sdc aio0 > > 3. /rpc.py start_nbd_disk aio0 /dev/nbd0 > > > > I am using "perf top" to get the performance > > > > On 9/4/19, 4:03 PM, "Luse, Paul E" wrote: > > > > Hi Rishabh, > > > > Maybe it would help (me at least) if you described = the > > complete & exact steps for your test - both setup of the env & test= and > > command to profile. Can you send that out? > > > > Thx > > Paul > > > > -----Original Message----- > > From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] > > Sent: Wednesday, September 4, 2019 2:45 PM > > To: Walker, Benjamin ;= Harris, > > James R ; spdk(a)lists.01.org; Luse, Pa= ul E < > > paul.e.luse(a)intel.com> > > Cc: Chen, Xiaoxi ; Kadayam, H= ari < > > hkadayam(a)ebay.com>; Szmyd, Brian > > Subject: Re: [SPDK] NBD with SPDK > > > > Yes, I am using 64 q depth with one thread in fio. = I am using > > AIO. This profiling is for the entire system. I don't know why spdk= threads > > are idle. > > > > On 9/4/19, 11:08 AM, "Walker, Benjamin" < > > benjamin.walker(a)intel.com> wrote: > > > > On Fri, 2019-08-30 at 22:28 +0000, Mittal, Rish= abh wrote: > > > I got the run again. It is with 4k write. > > > > > > 13.16% vhost [.] > > > > > spdk_ring_dequeue > > > > > > 6.08% vhost [.] > > > > > rte_rdtsc > > > > > > 4.77% vhost [.] > > > > > spdk_thread_poll > > > > > > 2.85% vhost [.] > > > > > _spdk_reactor_run > > > > > > > You're doing high queue depth for at least 30 s= econds > > while the trace runs, > > right? Using fio with the libaio engine on the = NBD device > > is probably the way to > > go. Are you limiting the profiling to just the = core where > > the main SPDK process > > is pinned? I'm asking because SPDK still appear= s to be > > mostly idle, and I > > suspect the time is being spent in some other t= hread (in > > the kernel). Consider > > capturing a profile for the entire system. It w= ill have > > fio stuff in it, but the > > expensive stuff still should generally bubble u= p to the > > top. > > > > Thanks, > > Ben > > > > > > > > > > On 8/29/19, 6:05 PM, "Mittal, Rishabh" < > > rimittal(a)ebay.com> wrote: > > > > > > I got the profile with first run. > > > > > > 27.91% vhost [.] > > > > > spdk_ring_dequeue > > > > > > 12.94% vhost [.] > > > > > rte_rdtsc > > > > > > 11.00% vhost [.] > > > > > spdk_thread_poll > > > > > > 6.15% vhost [.] > > > > > _spdk_reactor_run > > > > > > 4.35% [kernel] [k] > > > > > syscall_return_via_sysret > > > > > > 3.91% vhost [.] > > > > > _spdk_msg_queue_run_batch > > > > > > 3.38% vhost [.] > > > > > _spdk_event_queue_run_batch > > > > > > 2.83% [unknown] [k] > > > > > 0xfffffe000000601b > > > > > > 1.45% vhost [.] > > > > > spdk_thread_get_from_ctx > > > > > > 1.20% [kernel] [k] > > > > > __fget > > > > > > 1.14% libpthread-2.27.so [.] > > > > > __libc_read > > > > > > 1.00% libc-2.27.so [.] > > > > > 0x000000000018ef76 > > > > > > 0.99% libc-2.27.so [.] > > 0x000000000018ef79 > > > > > > Thanks > > > Rishabh Mittal > > > > > > On 8/19/19, 7:42 AM, "Luse, Paul E" < > > paul.e.luse(a)intel.com> wrote: > > > > > > That's great. Keep any eye out for t= he items > > Ben mentions below - at > > > least the first one should be quick to implem= ent and > > compare both profile data > > > and measured performance. > > > > > > Don=E2=80=99t' forget about the commu= nity meetings > > either, great place to chat > > > about these kinds of things. > > > > > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2= Fspdk.io%2Fcommunity%2F&data=3D02%7C01%7Cbszmyd%40ebay.com%7Cc69c9bed27= 43416e654208d7326f15b3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C6370333= 25011676085&sdata=3DCQV3XmFFbzlI%2FLqQgB9IdeB4E6Imtyvvegk4c5bFhBo%3D&am= p;reserved=3D0 > > > Next one is tomorrow morn US time. > > > > > > Thx > > > Paul > > > > > > -----Original Message----- > > > From: SPDK [mailto:spdk-bounces(a)lis= ts.01.org] On > > Behalf Of Mittal, > > > Rishabh via SPDK > > > Sent: Thursday, August 15, 2019 6:50 = PM > > > To: Harris, James R ; > > Walker, Benjamin < > > > benjamin.walker(a)intel.com>; spdk(a)lists.01= .org > > > Cc: Mittal, Rishabh ; Chen, > > Xiaoxi < > > > xiaoxchen(a)ebay.com>; Szmyd, Brian ; > > Kadayam, Hari < > > > hkadayam(a)ebay.com> > > > Subject: Re: [SPDK] NBD with SPDK > > > > > > Thanks. I will get the profiling by n= ext week. > > > > > > On 8/15/19, 6:26 PM, "Harris, James R= " < > > james.r.harris(a)intel.com> > > > wrote: > > > > > > > > > > > > On 8/15/19, 4:34 PM, "Mittal, Ris= habh" < > > rimittal(a)ebay.com> wrote: > > > > > > Hi Jim > > > > > > What tool you use to take pro= filing. > > > > > > Hi Rishabh, > > > > > > Mostly I just use "perf top". > > > > > > -Jim > > > > > > > > > Thanks > > > Rishabh Mittal > > > > > > On 8/14/19, 9:54 AM, "Harris,= James R" < > > > james.r.harris(a)intel.com> wrote: > > > > > > > > > > > > On 8/14/19, 9:18 AM, "Wal= ker, > > Benjamin" < > > > benjamin.walker(a)intel.com> wrote: > > > > > > > > > > > > When an I/O is perfor= med in the > > process initiating the > > > I/O to a file, the data > > > goes into the OS page= cache > > buffers at a layer far > > > above the bio stack > > > (somewhere up in VFS)= . If SPDK > > were to reserve some > > > memory and hand it off to > > > your kernel driver, y= our kernel > > driver would still > > > need to copy it to that > > > location out of the p= age cache > > buffers. We can't > > > safely share the page cache > > > buffers with a user s= pace > > process. > > > > > > I think Rishabh was sugge= sting the > > SPDK reserve the > > > virtual address space only. > > > Then the kernel could map= the page > > cache buffers into that > > > virtual address space. > > > That would not require a = data copy, > > but would require the > > > mapping operations. > > > > > > I think the profiling dat= a would be > > really helpful - to > > > quantify how much of the 50us > > > Is due to copying the 4KB= of > > data. That can help drive > > > next steps on how to optimize > > > the SPDK NBD module. > > > > > > Thanks, > > > > > > -Jim > > > > > > > > > As Paul said, I'm ske= ptical that > > the memcpy is > > > significant in the overall > > > performance you're me= asuring. I > > encourage you to go > > > look at some profiling data > > > and confirm that the = memcpy is > > really showing up. I > > > suspect the overhead is > > > instead primarily in = these > > spots: > > > > > > 1) Dynamic buffer all= ocation in > > the SPDK NBD backend. > > > > > > As Paul indicated, th= e NBD > > target is dynamically > > > allocating memory for each I/O. > > > The NBD backend wasn'= t designed > > to be fast - it was > > > designed to be simple. > > > Pooling would be a lo= t faster > > and is something fairly > > > easy to implement. > > > > > > 2) The way SPDK does = the > > syscalls when it implements > > > the NBD backend. > > > > > > Again, the code was d= esigned to > > be simple, not high > > > performance. It simply calls > > > read() and write() on= the socket > > for each command. > > > There are much higher > > > performance ways of d= oing this, > > they're just more > > > complex to implement. > > > > > > 3) The lack of multi-= queue > > support in NBD > > > > > > Every I/O is funneled= through a > > single sockpair up to > > > user space. That means > > > there is locking goin= g on. I > > believe this is just a > > > limitation of NBD today - it > > > doesn't plug into the= block-mq > > stuff in the kernel and > > > expose multiple > > > sockpairs. But someon= e more > > knowledgeable on the > > > kernel stack would need to take > > > a look. > > > > > > Thanks, > > > Ben > > > > > > > > > > > Couple of things th= at I am not > > really sure in this > > > flow is :- 1. How memory > > > > registration is goi= ng to work > > with RDMA driver. > > > > 2. What changes are= required > > in spdk memory > > > management > > > > > > > > Thanks > > > > Rishabh Mittal > > > > > > > > > > > > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Fl= ists.01.org%2Fmailman%2Flistinfo%2Fspdk&data=3D02%7C01%7Cbszmyd%40ebay.= com%7Cc69c9bed2743416e654208d7326f15b3%7C46326bff992841a0baca17c16c94ea99%7= C0%7C0%7C637033325011676085&sdata=3D70VlFioBcD63PlGV3IUCd8qdCJvA2DDyfdS= ixgQLAKE%3D&reserved=3D0 = = = -- = Regards Huang Zhiteng = --===============5964432289275183084==--