From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============1379191224686054428==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] NBD with SPDK Date: Thu, 05 Sep 2019 21:22:38 +0000 Message-ID: <049b94758aa1830f66a4069eacbdd12c85476a0b.camel@intel.com> In-Reply-To: 394387C6-8DAA-4DC0-BD99-71B293AF9F82@ebay.com List-ID: To: spdk@lists.01.org --===============1379191224686054428== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Thu, 2019-09-05 at 19:47 +0000, Szmyd, Brian wrote: > Hi Paul, > = > Rather than put the effort into a formalized document here is a brief > description of the solution I have been investigating just to get an opin= ion > of feasibility or even workability. = > = > Some background and a reiteration of the problem to set things up. I apol= ogize > to reiterate anything and to include details that some may already know. > = > We are looking for a solution that allows us to write a custom bdev for t= he > SPDK bdev layer that distributes I/O between different NVMe-oF targets th= at we > have attached and then present that to our application as either a raw bl= ock > device or filesystem mountpoint. > = > This is normally (as I understand it) done to by exposing a device via QE= MU to > a VM using the vhost target. This SPDK target has implemented the virtio-= scsi > (among others) device according to this spec: > = > https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd0= 1.html#x1-8300021 > = > The VM kernel then uses a virtio-scsi module to attach said device into i= ts > SCSI mid-layer and then have the device enumerated as a /dev/sd[a-z]+ dev= ice. > = > The problem is that QEMU virtualizes a PCIe bus for the guest kernel virt= io- > pci driver to discover the virtio devices and bind them to the virtio-scsi > driver. There really is no other way (other than platform MMIO type devic= es) > to attach a device to the virtio-scsi device. > = > SPDK exposes the virtio device to the VM via QEMU which has written a "us= er > space" version of the vhost bus. This driver then translates the API into= the > virtio-pci specification: > = > https://github.com/qemu/qemu/blob/5d0e5694470d2952b4f257bc985cac8c89b4fd9= 2/docs/interop/vhost-user.rst > = > This uses an eventfd descriptor for interrupting the non-polling side of = the > queue and a UNIX domain socket to setup (and control) the shared memory w= hich > contains the I/O buffers and virtio queues. This is documented in SPDKs o= wn > documentation and diagramed here: > = > https://github.com/spdk/spdk/blob/01103b2e4dfdcf23cc2125164aa116394c8185e= 8/doc/vhost_processing.md > = > If we could implement this vhost-user QEMU target as a virtio driver in t= he > kernel as an alternative to the virtio-pci driver, it could bind a SPDK v= host > into the host kernel as a virtio device and enumerated in the /dev/sd[a-z= ]+ > tree for our containers to bind. Attached is draft block diagram. If you think of QEMU as just another user-space process, and the SPDK vhost target as a user-space process, then it's clear that vhost-user is simply a cross-process IPC mechanism based on shared memory. The "shared memory" par= t is the critical part of that description - QEMU pre-registers all of the memory that will be used for I/O buffers (in fact, all of the memory that is mapped into the guest) with the SPDK process by sending fds across a Unix domain socket. If you move this code into the kernel, you have to solve two issues: 1) What memory is it registering with the SPDK process? The kernel driver h= as no idea which application process may route I/O to it - in fact the application process may not even exist yet - so it isn't memory allocated to the applic= ation process. Maybe you have a pool of kernel buffers that get mapped into the S= PDK process, and when the application process performs I/O the kernel copies in= to those buffers prior to telling SPDK about them? That would work, but now yo= u're back to doing a data copy. I do think you can get it down to 1 data copy in= stead of 2 with a scheme like this. 2) One of the big performance problems you're seeing is syscall overhead in= NBD. If you still have a kernel block device that routes messages up to the SPDK process, the application process is making the same syscalls because it's s= till interacting with a block device in the kernel, but you're right that the ba= ckend SPDK implementation could be polling on shared memory rings and potentially= run more efficiently. > = > Since we will not have a real bus to signal for the driver to probe for n= ew > devices we can use a sysfs interface for the application to notify the dr= iver > of a new socket and eventfd pair to setup a new virtio-scsi instance. > Otherwise the design simply moves the vhost-user driver from the QEMU > application into the Host kernel itself. > = > It's my understanding that this will avoid a lot more system calls and co= pies > compared to what exposing an iSCSI device or NBD device as we're currently > discussing. Does this seem feasible? What you really want is a "block device in user space" solution that's high= er performance than NBD, and while that's been tried many, many times in the p= ast I do think there is a great opportunity here for someone. I'm not sure that t= he interface between the block device process and the kernel is best done as a modification of NBD or a wholesale replacement by vhost-user-scsi, but I'd = like to throw in a third option to consider - use NVMe queues in shared memory a= s the interface instead. The NVMe queues are going to be much more efficient than virtqueues for storage commands. > = > Thanks, > Brian > = > =EF=BB=BFOn 9/5/19, 12:32 PM, "Mittal, Rishabh" wro= te: > = > Hi Paul. > = > Thanks for investigating it. = > = > We have one more idea floating around. Brian is going to send you a > proposal shortly. If other proposal seems feasible to you that we can eva= luate > the work required in both the proposals. > = > Thanks > Rishabh Mittal > = > On 9/5/19, 11:09 AM, "Luse, Paul E" wrote: > = > Hi, > = > So I was able to perform the same steps here and I think one of t= he > keys to really seeing what's going on is to start perftop like this: > = > =E2=80=9Cperf top --sort comm,dso,symbol -C 0=E2=80=9D to get a = more focused view by > sorting on command, shared object and symbol > = > Attached are 2 snapshots, one with a NULL back end for nbd and one > with libaio/nvme. Some notes after chatting with Ben a bit, please read > through and let us know what you think: > = > * in both cases the vast majority of the highest overhead activit= ies > are kernel > * the "copy_user_enhanced" symbol on the NULL case (it shows up o= n the > other as well but you have to scroll way down to see it) and is the > user/kernel space copy, nothing SPDK can do about that > * the syscalls that dominate in both cases are likely something t= hat > can be improved on by changing how SPDK interacts with nbd. Ben had a cou= ple > of ideas inlcuidng (a) using libaio to interact with the nbd fd as oppose= d to > interacting with the nbd socket, (b) "batching" wherever possible, for ex= ample > on writes to nbd investigate not ack'ing them until some number have comp= leted > * the kernel slab* commands are likely nbd kernel driver > allocations/frees in the IO path, one possibility would be to look at > optimizing the nbd kernel driver for this one > * the libc item on the NULL chart also shows up on the libaio pro= file > however is again way down the scroll so it didn't make the screenshot :) = This > could be a zeroing of something somewhere in the SPDK nbd driver > = > It looks like this data supports what Ben had suspected a while b= ack, > much of the overhead we're looking at is kernel nbd. Anyway, let us know= what > you think and if you want to explore any of the ideas above any further o= r see > something else in the data that looks worthy to note. > = > Thx > Paul > = > = > = > -----Original Message----- > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse= , Paul > E > Sent: Wednesday, September 4, 2019 4:27 PM > To: Mittal, Rishabh ; Walker, Benjamin < > benjamin.walker(a)intel.com>; Harris, James R ; = > spdk(a)lists.01.org > Cc: Chen, Xiaoxi ; Szmyd, Brian ; > Kadayam, Hari > Subject: Re: [SPDK] NBD with SPDK > = > Cool, thanks for sending this. I will try and repro tomorrow her= e and > see what kind of results I get > = > Thx > Paul > = > -----Original Message----- > From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] = > Sent: Wednesday, September 4, 2019 4:23 PM > To: Luse, Paul E ; Walker, Benjamin < > benjamin.walker(a)intel.com>; Harris, James R ; = > spdk(a)lists.01.org > Cc: Chen, Xiaoxi ; Kadayam, Hari < > hkadayam(a)ebay.com>; Szmyd, Brian > Subject: Re: [SPDK] NBD with SPDK > = > Avg CPU utilization is very low when I am running this. > = > 09/04/2019 04:21:40 PM > avg-cpu: %user %nice %system %iowait %steal %idle > 2.59 0.00 2.57 0.00 0.00 94.84 > = > Device r/s w/s rkB/s wkB/s rrqm/s wrqm= /s % > rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util > sda 0.00 0.20 0.00 0.80 0.00 0.= 00 = > 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 > sdb 0.00 0.00 0.00 0.00 0.00 0.= 00 = > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > sdc 0.00 28846.80 0.00 191555.20 0.00 > 18211.00 0.00 38.70 0.00 1.03 29.64 0.00 6.64 0.03 10= 0.00 > nb0 0.00 47297.00 0.00 > 191562.40 0.00 593.60 0.00 1.24 0.00 1.32 61.83 0.00= = > 4.05 0 > = > = > = > On 9/4/19, 4:19 PM, "Mittal, Rishabh" wrote: > = > I am using this command > = > fio --name=3Drandwrite --ioengine=3Dlibaio --iodepth=3D8 --rw= =3Dwrite -- > rwmixread=3D0 --bsrange=3D4k-4k --direct=3D1 -filename=3D/dev/nbd0 --num= jobs=3D8 -- > runtime 120 --time_based --group_reporting > = > I have created the device by using these commands > 1. ./root/spdk/app/vhost > 2. ./rpc.py bdev_aio_create /dev/sdc aio0 > 3. /rpc.py start_nbd_disk aio0 /dev/nbd0 > = > I am using "perf top" to get the performance = > = > On 9/4/19, 4:03 PM, "Luse, Paul E" = wrote: > = > Hi Rishabh, > = > Maybe it would help (me at least) if you described the > complete & exact steps for your test - both setup of the env & test and > command to profile. Can you send that out? > = > Thx > Paul > = > -----Original Message----- > From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] = > Sent: Wednesday, September 4, 2019 2:45 PM > To: Walker, Benjamin ; Harri= s, > James R ; spdk(a)lists.01.org; Luse, Paul E < > paul.e.luse(a)intel.com> > Cc: Chen, Xiaoxi ; Kadayam, Hari < > hkadayam(a)ebay.com>; Szmyd, Brian > Subject: Re: [SPDK] NBD with SPDK > = > Yes, I am using 64 q depth with one thread in fio. I am u= sing > AIO. This profiling is for the entire system. I don't know why spdk threa= ds > are idle. > = > On 9/4/19, 11:08 AM, "Walker, Benjamin" < > benjamin.walker(a)intel.com> wrote: > = > On Fri, 2019-08-30 at 22:28 +0000, Mittal, Rishabh wr= ote: > > I got the run again. It is with 4k write. > > = > > 13.16% vhost [.] > > > spdk_ring_dequeue = = > > = > > 6.08% vhost [.] > > > rte_rdtsc = = > > = > > 4.77% vhost [.] > > > spdk_thread_poll = = > > = > > 2.85% vhost [.] > > > _spdk_reactor_run = = > > = > = > You're doing high queue depth for at least 30 seconds > while the trace runs, > right? Using fio with the libaio engine on the NBD de= vice > is probably the way to > go. Are you limiting the profiling to just the core w= here > the main SPDK process > is pinned? I'm asking because SPDK still appears to be > mostly idle, and I > suspect the time is being spent in some other thread = (in > the kernel). Consider > capturing a profile for the entire system. It will ha= ve > fio stuff in it, but the > expensive stuff still should generally bubble up to t= he > top. > = > Thanks, > Ben > = > = > > = > > On 8/29/19, 6:05 PM, "Mittal, Rishabh" < > rimittal(a)ebay.com> wrote: > > = > > I got the profile with first run. = > > = > > 27.91% vhost [.] > > > spdk_ring_dequeue = = > > = > > 12.94% vhost [.] > > > rte_rdtsc = = > > = > > 11.00% vhost [.] > > > spdk_thread_poll = = > > = > > 6.15% vhost [.] > > > _spdk_reactor_run = = > > = > > 4.35% [kernel] [k] > > > syscall_return_via_sysret = = > > = > > 3.91% vhost [.] > > > _spdk_msg_queue_run_batch = = > > = > > 3.38% vhost [.] > > > _spdk_event_queue_run_batch = = > > = > > 2.83% [unknown] [k] > > > 0xfffffe000000601b = = > > = > > 1.45% vhost [.] > > > spdk_thread_get_from_ctx = = > > = > > 1.20% [kernel] [k] > > > __fget = = > > = > > 1.14% libpthread-2.27.so [.] > > > __libc_read = = > > = > > 1.00% libc-2.27.so [.] > > > 0x000000000018ef76 = = > > = > > 0.99% libc-2.27.so [.] > 0x000000000018ef79 = > > = > > Thanks > > Rishabh Mittal = > > = > > On 8/19/19, 7:42 AM, "Luse, Paul E" < > paul.e.luse(a)intel.com> wrote: > > = > > That's great. Keep any eye out for the ite= ms > Ben mentions below - at > > least the first one should be quick to implement and > compare both profile data > > and measured performance. > > = > > Don=E2=80=99t' forget about the community m= eetings > either, great place to chat > > about these kinds of things. = > > = > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Fspdk.= io%2Fcommunity%2F&data=3D02%7C01%7Cbszmyd%40ebay.com%7C52847d18df514b39= d8cf08d7322f74ea%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C6370330517210= 21295&sdata=3DheRt%2FhB5SPeqWNw44VoCIrt5W9N%2B0ExCXIVFNtzi2Zg%3D&re= served=3D0 > > Next one is tomorrow morn US time. > > = > > Thx > > Paul > > = > > -----Original Message----- > > From: SPDK [mailto:spdk-bounces(a)lists.01.= org] On > Behalf Of Mittal, > > Rishabh via SPDK > > Sent: Thursday, August 15, 2019 6:50 PM > > To: Harris, James R ; > Walker, Benjamin < > > benjamin.walker(a)intel.com>; spdk(a)lists.01.org > > Cc: Mittal, Rishabh ; = Chen, > Xiaoxi < > > xiaoxchen(a)ebay.com>; Szmyd, Brian ; > Kadayam, Hari < > > hkadayam(a)ebay.com> > > Subject: Re: [SPDK] NBD with SPDK > > = > > Thanks. I will get the profiling by next we= ek. = > > = > > On 8/15/19, 6:26 PM, "Harris, James R" < > james.r.harris(a)intel.com> > > wrote: > > = > > = > > = > > On 8/15/19, 4:34 PM, "Mittal, Rishabh" < > rimittal(a)ebay.com> wrote: > > = > > Hi Jim > > = > > What tool you use to take profiling= . = > > = > > Hi Rishabh, > > = > > Mostly I just use "perf top". > > = > > -Jim > > = > > = > > Thanks > > Rishabh Mittal > > = > > On 8/14/19, 9:54 AM, "Harris, James= R" < > > james.r.harris(a)intel.com> wrote: > > = > > = > > = > > On 8/14/19, 9:18 AM, "Walker, > Benjamin" < > > benjamin.walker(a)intel.com> wrote: > > = > > > > = > > When an I/O is performed in= the > process initiating the > > I/O to a file, the data > > goes into the OS page cache > buffers at a layer far > > above the bio stack > > (somewhere up in VFS). If S= PDK > were to reserve some > > memory and hand it off to > > your kernel driver, your ke= rnel > driver would still > > need to copy it to that > > location out of the page ca= che > buffers. We can't > > safely share the page cache > > buffers with a user space > process. > > = > > I think Rishabh was suggesting = the > SPDK reserve the > > virtual address space only. > > Then the kernel could map the p= age > cache buffers into that > > virtual address space. > > That would not require a data c= opy, > but would require the > > mapping operations. > > = > > I think the profiling data woul= d be > really helpful - to > > quantify how much of the 50us > > Is due to copying the 4KB of > data. That can help drive > > next steps on how to optimize > > the SPDK NBD module. > > = > > Thanks, > > = > > -Jim > > = > > = > > As Paul said, I'm skeptical= that > the memcpy is > > significant in the overall > > performance you're measurin= g. I > encourage you to go > > look at some profiling data > > and confirm that the memcpy= is > really showing up. I > > suspect the overhead is > > instead primarily in these > spots: > > = > > 1) Dynamic buffer allocatio= n in > the SPDK NBD backend. > > = > > As Paul indicated, the NBD > target is dynamically > > allocating memory for each I/O. > > The NBD backend wasn't desi= gned > to be fast - it was > > designed to be simple. > > Pooling would be a lot fast= er > and is something fairly > > easy to implement. > > = > > 2) The way SPDK does the > syscalls when it implements > > the NBD backend. > > = > > Again, the code was designe= d to > be simple, not high > > performance. It simply calls > > read() and write() on the s= ocket > for each command. > > There are much higher > > performance ways of doing t= his, > they're just more > > complex to implement. > > = > > 3) The lack of multi-queue > support in NBD > > = > > Every I/O is funneled throu= gh a > single sockpair up to > > user space. That means > > there is locking going on. I > believe this is just a > > limitation of NBD today - it > > doesn't plug into the block= -mq > stuff in the kernel and > > expose multiple > > sockpairs. But someone more > knowledgeable on the > > kernel stack would need to take > > a look. > > = > > Thanks, > > Ben > > = > > > = > > > Couple of things that I a= m not > really sure in this > > flow is :- 1. How memory > > > registration is going to = work > with RDMA driver. > > > 2. What changes are requi= red > in spdk memory > > management > > > = > > > Thanks > > > Rishabh Mittal > > = > = > = > = --===============1379191224686054428==--