From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============7379197247611525410==" MIME-Version: 1.0 From: Szmyd, Brian Subject: Re: [SPDK] NBD with SPDK Date: Thu, 05 Sep 2019 22:00:20 +0000 Message-ID: <2E478AD2-075B-4E67-A9F6-83E545AD7072@ebay.com> In-Reply-To: 049b94758aa1830f66a4069eacbdd12c85476a0b.camel@intel.com List-ID: To: spdk@lists.01.org --===============7379197247611525410== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable > What memory is it registering with the SPDK process? Only the kernel and the SPDK application would share memory. Writes from th= e application will be going through the VFS most likely so I don't think it= 's feasible to shared the buffers directly to the SPDK app. Yes, there woul= d be a copy from the applications write buffers into the shared memory regi= on by the kernel. This is how it works already with the virtio-pci under QE= MU right? I'm not trying to optimize that path, it's as you say a removal o= f a copy on the other side. > If you still have a kernel block device that routes messages up to the SP= DK process, the application process is making the same syscalls because it'= s still interacting with a block device in the kernel. Correct, there is no intention to remove this syscall from the application = into the kernel as again the write will also be accompanied by VFS operatio= ns on the block device that only the kernel can provide. It was my impressi= on most of the added latency we were concerned with is coming from the tran= sformation of said write/read to and from NBD messages forwarded to the SPD= K application over a normal socket. > use NVMe queues in shared memory as the interface instead You could be correct that this is more efficient. It would involve implemen= ting something I assumed would end up quite similar to the virtio spec sinc= e we don't want to use TCP messages (NVMf) or act as a PCIe device. = =EF=BB=BFOn 9/5/19, 3:22 PM, "Walker, Benjamin" wrote: On Thu, 2019-09-05 at 19:47 +0000, Szmyd, Brian wrote: > Hi Paul, > = > Rather than put the effort into a formalized document here is a brief > description of the solution I have been investigating just to get an = opinion > of feasibility or even workability. = > = > Some background and a reiteration of the problem to set things up. I = apologize > to reiterate anything and to include details that some may already kn= ow. > = > We are looking for a solution that allows us to write a custom bdev f= or the > SPDK bdev layer that distributes I/O between different NVMe-oF target= s that we > have attached and then present that to our application as either a ra= w block > device or filesystem mountpoint. > = > This is normally (as I understand it) done to by exposing a device vi= a QEMU to > a VM using the vhost target. This SPDK target has implemented the vir= tio-scsi > (among others) device according to this spec: > = > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Fd= ocs.oasis-open.org%2Fvirtio%2Fvirtio%2Fv1.1%2Fcsprd01%2Fvirtio-v1.1-csprd01= .html%23x1-8300021&data=3D02%7C01%7Cbszmyd%40ebay.com%7C7ac7f0056f0a479= 69f1808d732472eab%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033153633= 551936&sdata=3DqAR4NugaG%2FgzXae8eJuvlwGyWUHihidrKZv7ZZRt%2BY8%3D&r= eserved=3D0 > = > The VM kernel then uses a virtio-scsi module to attach said device in= to its > SCSI mid-layer and then have the device enumerated as a /dev/sd[a-z]+= device. > = > The problem is that QEMU virtualizes a PCIe bus for the guest kernel = virtio- > pci driver to discover the virtio devices and bind them to the virtio= -scsi > driver. There really is no other way (other than platform MMIO type d= evices) > to attach a device to the virtio-scsi device. > = > SPDK exposes the virtio device to the VM via QEMU which has written a= "user > space" version of the vhost bus. This driver then translates the API = into the > virtio-pci specification: > = > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Fg= ithub.com%2Fqemu%2Fqemu%2Fblob%2F5d0e5694470d2952b4f257bc985cac8c89b4fd92%2= Fdocs%2Finterop%2Fvhost-user.rst&data=3D02%7C01%7Cbszmyd%40ebay.com%7C7= ac7f0056f0a47969f1808d732472eab%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%= 7C637033153633551936&sdata=3DkBKlf5GxG728mmFyUjXzTHguS0pjlx0W%2FtMOt8dK= VUg%3D&reserved=3D0 > = > This uses an eventfd descriptor for interrupting the non-polling side= of the > queue and a UNIX domain socket to setup (and control) the shared memo= ry which > contains the I/O buffers and virtio queues. This is documented in SPD= Ks own > documentation and diagramed here: > = > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Fg= ithub.com%2Fspdk%2Fspdk%2Fblob%2F01103b2e4dfdcf23cc2125164aa116394c8185e8%2= Fdoc%2Fvhost_processing.md&data=3D02%7C01%7Cbszmyd%40ebay.com%7C7ac7f00= 56f0a47969f1808d732472eab%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C6370= 33153633551936&sdata=3DG%2Fq9CSNi0FpGspqDsQAYKKdsZBLlQ2V6rm1UPCy8rC4%3D= &reserved=3D0 > = > If we could implement this vhost-user QEMU target as a virtio driver = in the > kernel as an alternative to the virtio-pci driver, it could bind a SP= DK vhost > into the host kernel as a virtio device and enumerated in the /dev/sd= [a-z]+ > tree for our containers to bind. Attached is draft block diagram. = If you think of QEMU as just another user-space process, and the SPDK v= host target as a user-space process, then it's clear that vhost-user is simp= ly a cross-process IPC mechanism based on shared memory. The "shared memory"= part is the critical part of that description - QEMU pre-registers all of the m= emory that will be used for I/O buffers (in fact, all of the memory that is m= apped into the guest) with the SPDK process by sending fds across a Unix doma= in socket. = If you move this code into the kernel, you have to solve two issues: = 1) What memory is it registering with the SPDK process? The kernel driv= er has no idea which application process may route I/O to it - in fact the applic= ation process may not even exist yet - so it isn't memory allocated to the ap= plication process. Maybe you have a pool of kernel buffers that get mapped into t= he SPDK process, and when the application process performs I/O the kernel copie= s into those buffers prior to telling SPDK about them? That would work, but no= w you're back to doing a data copy. I do think you can get it down to 1 data cop= y instead of 2 with a scheme like this. = 2) One of the big performance problems you're seeing is syscall overhea= d in NBD. If you still have a kernel block device that routes messages up to the = SPDK process, the application process is making the same syscalls because it= 's still interacting with a block device in the kernel, but you're right that th= e backend SPDK implementation could be polling on shared memory rings and potenti= ally run more efficiently. = > = > Since we will not have a real bus to signal for the driver to probe f= or new > devices we can use a sysfs interface for the application to notify th= e driver > of a new socket and eventfd pair to setup a new virtio-scsi instance. > Otherwise the design simply moves the vhost-user driver from the QEMU > application into the Host kernel itself. > = > It's my understanding that this will avoid a lot more system calls an= d copies > compared to what exposing an iSCSI device or NBD device as we're curr= ently > discussing. Does this seem feasible? = What you really want is a "block device in user space" solution that's = higher performance than NBD, and while that's been tried many, many times in t= he past I do think there is a great opportunity here for someone. I'm not sure th= at the interface between the block device process and the kernel is best done = as a modification of NBD or a wholesale replacement by vhost-user-scsi, but = I'd like to throw in a third option to consider - use NVMe queues in shared memo= ry as the interface instead. The NVMe queues are going to be much more efficient = than virtqueues for storage commands. = > = > Thanks, > Brian > = > On 9/5/19, 12:32 PM, "Mittal, Rishabh" wrote: > = > Hi Paul. > = > Thanks for investigating it. = > = > We have one more idea floating around. Brian is going to send you= a > proposal shortly. If other proposal seems feasible to you that we can= evaluate > the work required in both the proposals. > = > Thanks > Rishabh Mittal > = > On 9/5/19, 11:09 AM, "Luse, Paul E" wro= te: > = > Hi, > = > So I was able to perform the same steps here and I think one = of the > keys to really seeing what's going on is to start perftop like this: > = > =E2=80=9Cperf top --sort comm,dso,symbol -C 0=E2=80=9D to ge= t a more focused view by > sorting on command, shared object and symbol > = > Attached are 2 snapshots, one with a NULL back end for nbd an= d one > with libaio/nvme. Some notes after chatting with Ben a bit, please r= ead > through and let us know what you think: > = > * in both cases the vast majority of the highest overhead act= ivities > are kernel > * the "copy_user_enhanced" symbol on the NULL case (it shows = up on the > other as well but you have to scroll way down to see it) and is the > user/kernel space copy, nothing SPDK can do about that > * the syscalls that dominate in both cases are likely somethi= ng that > can be improved on by changing how SPDK interacts with nbd. Ben had a= couple > of ideas inlcuidng (a) using libaio to interact with the nbd fd as op= posed to > interacting with the nbd socket, (b) "batching" wherever possible, fo= r example > on writes to nbd investigate not ack'ing them until some number have = completed > * the kernel slab* commands are likely nbd kernel driver > allocations/frees in the IO path, one possibility would be to look at > optimizing the nbd kernel driver for this one > * the libc item on the NULL chart also shows up on the libaio= profile > however is again way down the scroll so it didn't make the screenshot= :) This > could be a zeroing of something somewhere in the SPDK nbd driver > = > It looks like this data supports what Ben had suspected a whi= le back, > much of the overhead we're looking at is kernel nbd. Anyway, let us = know what > you think and if you want to explore any of the ideas above any furth= er or see > something else in the data that looks worthy to note. > = > Thx > Paul > = > = > = > -----Original Message----- > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of = Luse, Paul > E > Sent: Wednesday, September 4, 2019 4:27 PM > To: Mittal, Rishabh ; Walker, Benjamin < > benjamin.walker(a)intel.com>; Harris, James R ; = > spdk(a)lists.01.org > Cc: Chen, Xiaoxi ; Szmyd, Brian ; > Kadayam, Hari > Subject: Re: [SPDK] NBD with SPDK > = > Cool, thanks for sending this. I will try and repro tomorrow= here and > see what kind of results I get > = > Thx > Paul > = > -----Original Message----- > From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] = > Sent: Wednesday, September 4, 2019 4:23 PM > To: Luse, Paul E ; Walker, Benjamin < > benjamin.walker(a)intel.com>; Harris, James R ; = > spdk(a)lists.01.org > Cc: Chen, Xiaoxi ; Kadayam, Hari < > hkadayam(a)ebay.com>; Szmyd, Brian > Subject: Re: [SPDK] NBD with SPDK > = > Avg CPU utilization is very low when I am running this. > = > 09/04/2019 04:21:40 PM > avg-cpu: %user %nice %system %iowait %steal %idle > 2.59 0.00 2.57 0.00 0.00 94.84 > = > Device r/s w/s rkB/s wkB/s rrqm/s = wrqm/s % > rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util > sda 0.00 0.20 0.00 0.80 0.00 = 0.00 = > 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 > sdb 0.00 0.00 0.00 0.00 0.00 = 0.00 = > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > sdc 0.00 28846.80 0.00 191555.20 0.00 > 18211.00 0.00 38.70 0.00 1.03 29.64 0.00 6.64 0.0= 3 100.00 > nb0 0.00 47297.00 0.00 > 191562.40 0.00 593.60 0.00 1.24 0.00 1.32 61.83 = 0.00 = > 4.05 0 > = > = > = > On 9/4/19, 4:19 PM, "Mittal, Rishabh" w= rote: > = > I am using this command > = > fio --name=3Drandwrite --ioengine=3Dlibaio --iodepth=3D8 = --rw=3Dwrite -- > rwmixread=3D0 --bsrange=3D4k-4k --direct=3D1 -filename=3D/dev/nbd0 -= -numjobs=3D8 -- > runtime 120 --time_based --group_reporting > = > I have created the device by using these commands > 1. ./root/spdk/app/vhost > 2. ./rpc.py bdev_aio_create /dev/sdc aio0 > 3. /rpc.py start_nbd_disk aio0 /dev/nbd0 > = > I am using "perf top" to get the performance = > = > On 9/4/19, 4:03 PM, "Luse, Paul E" wrote: > = > Hi Rishabh, > = > Maybe it would help (me at least) if you described the > complete & exact steps for your test - both setup of the env & test a= nd > command to profile. Can you send that out? > = > Thx > Paul > = > -----Original Message----- > From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] = > Sent: Wednesday, September 4, 2019 2:45 PM > To: Walker, Benjamin ; H= arris, > James R ; spdk(a)lists.01.org; Luse, Paul= E < > paul.e.luse(a)intel.com> > Cc: Chen, Xiaoxi ; Kadayam, Har= i < > hkadayam(a)ebay.com>; Szmyd, Brian > Subject: Re: [SPDK] NBD with SPDK > = > Yes, I am using 64 q depth with one thread in fio. I = am using > AIO. This profiling is for the entire system. I don't know why spdk t= hreads > are idle. > = > On 9/4/19, 11:08 AM, "Walker, Benjamin" < > benjamin.walker(a)intel.com> wrote: > = > On Fri, 2019-08-30 at 22:28 +0000, Mittal, Rishab= h wrote: > > I got the run again. It is with 4k write. > > = > > 13.16% vhost [.] > > > spdk_ring_dequeue = = > > = > > 6.08% vhost [.] > > > rte_rdtsc = = > > = > > 4.77% vhost [.] > > > spdk_thread_poll = = > > = > > 2.85% vhost [.] > > > _spdk_reactor_run = = > > = > = > You're doing high queue depth for at least 30 sec= onds > while the trace runs, > right? Using fio with the libaio engine on the NB= D device > is probably the way to > go. Are you limiting the profiling to just the co= re where > the main SPDK process > is pinned? I'm asking because SPDK still appears = to be > mostly idle, and I > suspect the time is being spent in some other thr= ead (in > the kernel). Consider > capturing a profile for the entire system. It wil= l have > fio stuff in it, but the > expensive stuff still should generally bubble up = to the > top. > = > Thanks, > Ben > = > = > > = > > On 8/29/19, 6:05 PM, "Mittal, Rishabh" < > rimittal(a)ebay.com> wrote: > > = > > I got the profile with first run. = > > = > > 27.91% vhost [.] > > > spdk_ring_dequeue = = > > = > > 12.94% vhost [.] > > > rte_rdtsc = = > > = > > 11.00% vhost [.] > > > spdk_thread_poll = = > > = > > 6.15% vhost [.] > > > _spdk_reactor_run = = > > = > > 4.35% [kernel] [k] > > > syscall_return_via_sysret = = > > = > > 3.91% vhost [.] > > > _spdk_msg_queue_run_batch = = > > = > > 3.38% vhost [.] > > > _spdk_event_queue_run_batch = = > > = > > 2.83% [unknown] [k] > > > 0xfffffe000000601b = = > > = > > 1.45% vhost [.] > > > spdk_thread_get_from_ctx = = > > = > > 1.20% [kernel] [k] > > > __fget = = > > = > > 1.14% libpthread-2.27.so [.] > > > __libc_read = = > > = > > 1.00% libc-2.27.so [.] > > > 0x000000000018ef76 = = > > = > > 0.99% libc-2.27.so [.] > 0x000000000018ef79 = > > = > > Thanks > > Rishabh Mittal = > > = > > On 8/19/19, 7:42 AM, "Luse, Paul E" < > paul.e.luse(a)intel.com> wrote: > > = > > That's great. Keep any eye out for the= items > Ben mentions below - at > > least the first one should be quick to implemen= t and > compare both profile data > > and measured performance. > > = > > Don=E2=80=99t' forget about the communi= ty meetings > either, great place to chat > > about these kinds of things. = > > = > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Fs= pdk.io%2Fcommunity%2F&data=3D02%7C01%7Cbszmyd%40ebay.com%7C7ac7f0056f0a= 47969f1808d732472eab%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033153= 633551936&sdata=3DAXH71h8R4%2FlZMiCQ4sugJ6cxdMOeqO14e72gEpbt21w%3D&= reserved=3D0 > > Next one is tomorrow morn US time. > > = > > Thx > > Paul > > = > > -----Original Message----- > > From: SPDK [mailto:spdk-bounces(a)lists= .01.org] On > Behalf Of Mittal, > > Rishabh via SPDK > > Sent: Thursday, August 15, 2019 6:50 PM > > To: Harris, James R ; > Walker, Benjamin < > > benjamin.walker(a)intel.com>; spdk(a)lists.01.o= rg > > Cc: Mittal, Rishabh ; Chen, > Xiaoxi < > > xiaoxchen(a)ebay.com>; Szmyd, Brian ; > Kadayam, Hari < > > hkadayam(a)ebay.com> > > Subject: Re: [SPDK] NBD with SPDK > > = > > Thanks. I will get the profiling by nex= t week. = > > = > > On 8/15/19, 6:26 PM, "Harris, James R" < > james.r.harris(a)intel.com> > > wrote: > > = > > = > > = > > On 8/15/19, 4:34 PM, "Mittal, Risha= bh" < > rimittal(a)ebay.com> wrote: > > = > > Hi Jim > > = > > What tool you use to take profi= ling. = > > = > > Hi Rishabh, > > = > > Mostly I just use "perf top". > > = > > -Jim > > = > > = > > Thanks > > Rishabh Mittal > > = > > On 8/14/19, 9:54 AM, "Harris, J= ames R" < > > james.r.harris(a)intel.com> wrote: > > = > > = > > = > > On 8/14/19, 9:18 AM, "Walke= r, > Benjamin" < > > benjamin.walker(a)intel.com> wrote: > > = > > > > = > > When an I/O is performe= d in the > process initiating the > > I/O to a file, the data > > goes into the OS page c= ache > buffers at a layer far > > above the bio stack > > (somewhere up in VFS). = If SPDK > were to reserve some > > memory and hand it off to > > your kernel driver, you= r kernel > driver would still > > need to copy it to that > > location out of the pag= e cache > buffers. We can't > > safely share the page cache > > buffers with a user spa= ce > process. > > = > > I think Rishabh was suggest= ing the > SPDK reserve the > > virtual address space only. > > Then the kernel could map t= he page > cache buffers into that > > virtual address space. > > That would not require a da= ta copy, > but would require the > > mapping operations. > > = > > I think the profiling data = would be > really helpful - to > > quantify how much of the 50us > > Is due to copying the 4KB of > data. That can help drive > > next steps on how to optimize > > the SPDK NBD module. > > = > > Thanks, > > = > > -Jim > > = > > = > > As Paul said, I'm skept= ical that > the memcpy is > > significant in the overall > > performance you're meas= uring. I > encourage you to go > > look at some profiling data > > and confirm that the me= mcpy is > really showing up. I > > suspect the overhead is > > instead primarily in th= ese > spots: > > = > > 1) Dynamic buffer alloc= ation in > the SPDK NBD backend. > > = > > As Paul indicated, the = NBD > target is dynamically > > allocating memory for each I/O. > > The NBD backend wasn't = designed > to be fast - it was > > designed to be simple. > > Pooling would be a lot = faster > and is something fairly > > easy to implement. > > = > > 2) The way SPDK does the > syscalls when it implements > > the NBD backend. > > = > > Again, the code was des= igned to > be simple, not high > > performance. It simply calls > > read() and write() on t= he socket > for each command. > > There are much higher > > performance ways of doi= ng this, > they're just more > > complex to implement. > > = > > 3) The lack of multi-qu= eue > support in NBD > > = > > Every I/O is funneled t= hrough a > single sockpair up to > > user space. That means > > there is locking going = on. I > believe this is just a > > limitation of NBD today - it > > doesn't plug into the b= lock-mq > stuff in the kernel and > > expose multiple > > sockpairs. But someone = more > knowledgeable on the > > kernel stack would need to take > > a look. > > = > > Thanks, > > Ben > > = > > > = > > > Couple of things that= I am not > really sure in this > > flow is :- 1. How memory > > > registration is going= to work > with RDMA driver. > > > 2. What changes are r= equired > in spdk memory > > management > > > = > > > Thanks > > > Rishabh Mittal > > = = > = > = > = = = --===============7379197247611525410==--