From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============7505354475319397787==" MIME-Version: 1.0 From: Mittal, Rishabh Subject: Re: [SPDK] NBD with SPDK Date: Fri, 06 Sep 2019 17:13:13 +0000 Message-ID: <3B5A81EA-C2D4-429E-B09E-30A8EE3F36E2@ebay.com> In-Reply-To: D7FC3D5B-A4D0-497F-8EC1-E522CAFC479A@ebay.com List-ID: To: spdk@lists.01.org --===============7505354475319397787== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable I am summarizing all the options. We have three options 1. SPDK with NBD. :- It need few optimizations in spdk - nbd to reduce the= system calls overhead. We can also explore using multi socket because it i= s supported in NBD. One disadvantage is the there will be bcopy for every = IO from kernel to SPDK or vice versa for reads. Overhead of bcopy compare t= o end to end latency is very low for 4k workload but we need to see its imp= act for higher read/write size. 2. SPDK with virtio :- It doesn't require any changes in spdk (assuming tha= t spdk virtio is written for performance) but we need to have a customized = kernel module which can work with spdk virtio target. It's obvious advantag= e is the kernel buffer cache will shared with spdk so there will be no copy= from kernel to spdk. Other advantage is there will be minimal system calls= to ring the doorbell as it will be using shared ring queue. Here my only c= oncern is that memory protection will be lost as entire kernel buffers will= be shared with spdk. 3. SPDK is used with KATA containers :- It doesn't require much changes (Xi= aoxi can comment more on this). But our concern is that apps will not be mo= ved to kata containers which will slow down its adoption rate. = Please feel free to add pros/cons of any approach if I miss anything. It wi= ll help us to decide. Thanks Rishabh Mittal =EF=BB=BFOn 9/5/19, 7:14 PM, "Szmyd, Brian" wrote: I believe this option has the same number of copies since your still sh= aring the memory with KATA VM kernel not the application itself. This is an option that = the development of a virtio-vhost-user driver does not prevent, its merely an option to= allow non-KATA containers to also use the same device. = I will note that doing a virtio-host-user driver also allows one to pro= ject other device types than just block devices into the kernel device stack. One could also wr= ite a user application that exposed an input, network, console, gpu or socket device as well. = Not that I have any interest in these... __ = On 9/5/19, 8:08 PM, "Huang Zhiteng" wrote: = Since this SPDK bdev is intended to be consumed by a user applicati= on running inside a container, we do have the possibility to run user application inside a Kata container instead. Kata container does introduce the layer of IO virtualization, therefore we convert a us= er space block device on host to a kernel block device inside the VM b= ut with less memory copies than NBD thanks to SPDK vhost. Kata contai= ner might impose higher overhead than plain container but hopefully it's lightweight enough that the overhead is negligible. = On Fri, Sep 6, 2019 at 5:22 AM Walker, Benjamin wrote: > > On Thu, 2019-09-05 at 19:47 +0000, Szmyd, Brian wrote: > > Hi Paul, > > > > Rather than put the effort into a formalized document here is a= brief > > description of the solution I have been investigating just to g= et an opinion > > of feasibility or even workability. > > > > Some background and a reiteration of the problem to set things = up. I apologize > > to reiterate anything and to include details that some may alre= ady know. > > > > We are looking for a solution that allows us to write a custom = bdev for the > > SPDK bdev layer that distributes I/O between different NVMe-oF = targets that we > > have attached and then present that to our application as eithe= r a raw block > > device or filesystem mountpoint. > > > > This is normally (as I understand it) done to by exposing a dev= ice via QEMU to > > a VM using the vhost target. This SPDK target has implemented t= he virtio-scsi > > (among others) device according to this spec: > > > > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps%3A%= 2F%2Fdocs.oasis-open.org%2Fvirtio%2Fvirtio%2Fv1.1%2Fcsprd01%2Fvirtio-v1.1-c= sprd01.html%23x1-8300021&data=3D02%7C01%7Crimittal%40ebay.com%7Ceebabb3= a2eff4e1dc47108d7326feaca%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C6370= 33328572563931&sdata=3DRb6jc8GEqDasm%2FNpWPpPozlFSfwHumutQQ0P9r28ysw%3D= &reserved=3D0 > > > > The VM kernel then uses a virtio-scsi module to attach said dev= ice into its > > SCSI mid-layer and then have the device enumerated as a /dev/sd= [a-z]+ device. > > > > The problem is that QEMU virtualizes a PCIe bus for the guest k= ernel virtio- > > pci driver to discover the virtio devices and bind them to the = virtio-scsi > > driver. There really is no other way (other than platform MMIO = type devices) > > to attach a device to the virtio-scsi device. > > > > SPDK exposes the virtio device to the VM via QEMU which has wri= tten a "user > > space" version of the vhost bus. This driver then translates th= e API into the > > virtio-pci specification: > > > > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps%3A%= 2F%2Fgithub.com%2Fqemu%2Fqemu%2Fblob%2F5d0e5694470d2952b4f257bc985cac8c89b4= fd92%2Fdocs%2Finterop%2Fvhost-user.rst&data=3D02%7C01%7Crimittal%40ebay= .com%7Ceebabb3a2eff4e1dc47108d7326feaca%7C46326bff992841a0baca17c16c94ea99%= 7C0%7C0%7C637033328572563931&sdata=3DcfFYWklYCtQog7oi6cpw93490%2F1UwTM1= qwZghWnuu%2FU%3D&reserved=3D0 > > > > This uses an eventfd descriptor for interrupting the non-pollin= g side of the > > queue and a UNIX domain socket to setup (and control) the share= d memory which > > contains the I/O buffers and virtio queues. This is documented = in SPDKs own > > documentation and diagramed here: > > > > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps%3A%= 2F%2Fgithub.com%2Fspdk%2Fspdk%2Fblob%2F01103b2e4dfdcf23cc2125164aa116394c81= 85e8%2Fdoc%2Fvhost_processing.md&data=3D02%7C01%7Crimittal%40ebay.com%7= Ceebabb3a2eff4e1dc47108d7326feaca%7C46326bff992841a0baca17c16c94ea99%7C0%7C= 0%7C637033328572563931&sdata=3DhVNgYqwWUl6y61MibZ4K0tJr%2FEIMgVldx8FIb0= WgyXE%3D&reserved=3D0 > > > > If we could implement this vhost-user QEMU target as a virtio d= river in the > > kernel as an alternative to the virtio-pci driver, it could bin= d a SPDK vhost > > into the host kernel as a virtio device and enumerated in the /= dev/sd[a-z]+ > > tree for our containers to bind. Attached is draft block diagra= m. > > If you think of QEMU as just another user-space process, and the = SPDK vhost > target as a user-space process, then it's clear that vhost-user i= s simply a > cross-process IPC mechanism based on shared memory. The "shared m= emory" part is > the critical part of that description - QEMU pre-registers all of= the memory > that will be used for I/O buffers (in fact, all of the memory tha= t is mapped > into the guest) with the SPDK process by sending fds across a Uni= x domain > socket. > > If you move this code into the kernel, you have to solve two issu= es: > > 1) What memory is it registering with the SPDK process? The kerne= l driver has no > idea which application process may route I/O to it - in fact the = application > process may not even exist yet - so it isn't memory allocated to = the application > process. Maybe you have a pool of kernel buffers that get mapped = into the SPDK > process, and when the application process performs I/O the kernel= copies into > those buffers prior to telling SPDK about them? That would work, = but now you're > back to doing a data copy. I do think you can get it down to 1 da= ta copy instead > of 2 with a scheme like this. > > 2) One of the big performance problems you're seeing is syscall o= verhead in NBD. > If you still have a kernel block device that routes messages up t= o the SPDK > process, the application process is making the same syscalls beca= use it's still > interacting with a block device in the kernel, but you're right t= hat the backend > SPDK implementation could be polling on shared memory rings and p= otentially run > more efficiently. > > > > > Since we will not have a real bus to signal for the driver to p= robe for new > > devices we can use a sysfs interface for the application to not= ify the driver > > of a new socket and eventfd pair to setup a new virtio-scsi ins= tance. > > Otherwise the design simply moves the vhost-user driver from th= e QEMU > > application into the Host kernel itself. > > > > It's my understanding that this will avoid a lot more system ca= lls and copies > > compared to what exposing an iSCSI device or NBD device as we'r= e currently > > discussing. Does this seem feasible? > > What you really want is a "block device in user space" solution t= hat's higher > performance than NBD, and while that's been tried many, many time= s in the past I > do think there is a great opportunity here for someone. I'm not s= ure that the > interface between the block device process and the kernel is best= done as a > modification of NBD or a wholesale replacement by vhost-user-scsi= , but I'd like > to throw in a third option to consider - use NVMe queues in share= d memory as the > interface instead. The NVMe queues are going to be much more effi= cient than > virtqueues for storage commands. > > > > > Thanks, > > Brian > > > > On 9/5/19, 12:32 PM, "Mittal, Rishabh" wr= ote: > > > > Hi Paul. > > > > Thanks for investigating it. > > > > We have one more idea floating around. Brian is going to se= nd you a > > proposal shortly. If other proposal seems feasible to you that = we can evaluate > > the work required in both the proposals. > > > > Thanks > > Rishabh Mittal > > > > On 9/5/19, 11:09 AM, "Luse, Paul E" wrote: > > > > Hi, > > > > So I was able to perform the same steps here and I thin= k one of the > > keys to really seeing what's going on is to start perftop like = this: > > > > =E2=80=9Cperf top --sort comm,dso,symbol -C 0=E2=80=9D= to get a more focused view by > > sorting on command, shared object and symbol > > > > Attached are 2 snapshots, one with a NULL back end for = nbd and one > > with libaio/nvme. Some notes after chatting with Ben a bit, pl= ease read > > through and let us know what you think: > > > > * in both cases the vast majority of the highest overhe= ad activities > > are kernel > > * the "copy_user_enhanced" symbol on the NULL case (it = shows up on the > > other as well but you have to scroll way down to see it) and is= the > > user/kernel space copy, nothing SPDK can do about that > > * the syscalls that dominate in both cases are likely s= omething that > > can be improved on by changing how SPDK interacts with nbd. Ben= had a couple > > of ideas inlcuidng (a) using libaio to interact with the nbd fd= as opposed to > > interacting with the nbd socket, (b) "batching" wherever possib= le, for example > > on writes to nbd investigate not ack'ing them until some number= have completed > > * the kernel slab* commands are likely nbd kernel driver > > allocations/frees in the IO path, one possibility would be to l= ook at > > optimizing the nbd kernel driver for this one > > * the libc item on the NULL chart also shows up on the = libaio profile > > however is again way down the scroll so it didn't make the scre= enshot :) This > > could be a zeroing of something somewhere in the SPDK nbd driver > > > > It looks like this data supports what Ben had suspected= a while back, > > much of the overhead we're looking at is kernel nbd. Anyway, l= et us know what > > you think and if you want to explore any of the ideas above any= further or see > > something else in the data that looks worthy to note. > > > > Thx > > Paul > > > > > > > > -----Original Message----- > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Beha= lf Of Luse, Paul > > E > > Sent: Wednesday, September 4, 2019 4:27 PM > > To: Mittal, Rishabh ; Walker, Benj= amin < > > benjamin.walker(a)intel.com>; Harris, James R ; > > spdk(a)lists.01.org > > Cc: Chen, Xiaoxi ; Szmyd, Brian <= bszmyd(a)ebay.com>; > > Kadayam, Hari > > Subject: Re: [SPDK] NBD with SPDK > > > > Cool, thanks for sending this. I will try and repro to= morrow here and > > see what kind of results I get > > > > Thx > > Paul > > > > -----Original Message----- > > From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] > > Sent: Wednesday, September 4, 2019 4:23 PM > > To: Luse, Paul E ; Walker, Ben= jamin < > > benjamin.walker(a)intel.com>; Harris, James R ; > > spdk(a)lists.01.org > > Cc: Chen, Xiaoxi ; Kadayam, Hari < > > hkadayam(a)ebay.com>; Szmyd, Brian > > Subject: Re: [SPDK] NBD with SPDK > > > > Avg CPU utilization is very low when I am running this. > > > > 09/04/2019 04:21:40 PM > > avg-cpu: %user %nice %system %iowait %steal %idle > > 2.59 0.00 2.57 0.00 0.00 94.84 > > > > Device r/s w/s rkB/s wkB/s rrq= m/s wrqm/s % > > rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %u= til > > sda 0.00 0.20 0.00 0.80 0= .00 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0= .00 > > sdb 0.00 0.00 0.00 0.00 0= .00 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0= .00 > > sdc 0.00 28846.80 0.00 191555.20 = 0.00 > > 18211.00 0.00 38.70 0.00 1.03 29.64 0.00 6.64= 0.03 100.00 > > nb0 0.00 47297.00 0.00 > > 191562.40 0.00 593.60 0.00 1.24 0.00 1.32 61.8= 3 0.00 > > 4.05 0 > > > > > > > > On 9/4/19, 4:19 PM, "Mittal, Rishabh" wrote: > > > > I am using this command > > > > fio --name=3Drandwrite --ioengine=3Dlibaio --iodept= h=3D8 --rw=3Dwrite -- > > rwmixread=3D0 --bsrange=3D4k-4k --direct=3D1 -filename=3D/dev/= nbd0 --numjobs=3D8 -- > > runtime 120 --time_based --group_reporting > > > > I have created the device by using these commands > > 1. ./root/spdk/app/vhost > > 2. ./rpc.py bdev_aio_create /dev/sdc aio0 > > 3. /rpc.py start_nbd_disk aio0 /dev/nbd0 > > > > I am using "perf top" to get the performance > > > > On 9/4/19, 4:03 PM, "Luse, Paul E" wrote: > > > > Hi Rishabh, > > > > Maybe it would help (me at least) if you descri= bed the > > complete & exact steps for your test - both setup of the env & = test and > > command to profile. Can you send that out? > > > > Thx > > Paul > > > > -----Original Message----- > > From: Mittal, Rishabh [mailto:rimittal(a)ebay.c= om] > > Sent: Wednesday, September 4, 2019 2:45 PM > > To: Walker, Benjamin ; Harris, > > James R ; spdk(a)lists.01.org; Luse= , Paul E < > > paul.e.luse(a)intel.com> > > Cc: Chen, Xiaoxi ; Kadaya= m, Hari < > > hkadayam(a)ebay.com>; Szmyd, Brian > > Subject: Re: [SPDK] NBD with SPDK > > > > Yes, I am using 64 q depth with one thread in f= io. I am using > > AIO. This profiling is for the entire system. I don't know why = spdk threads > > are idle. > > > > On 9/4/19, 11:08 AM, "Walker, Benjamin" < > > benjamin.walker(a)intel.com> wrote: > > > > On Fri, 2019-08-30 at 22:28 +0000, Mittal, = Rishabh wrote: > > > I got the run again. It is with 4k write. > > > > > > 13.16% vhost [.] > > > > > spdk_ring_dequeue > > > > > > 6.08% vhost [.] > > > > > rte_rdtsc > > > > > > 4.77% vhost [.] > > > > > spdk_thread_poll > > > > > > 2.85% vhost [.] > > > > > _spdk_reactor_run > > > > > > > You're doing high queue depth for at least = 30 seconds > > while the trace runs, > > right? Using fio with the libaio engine on = the NBD device > > is probably the way to > > go. Are you limiting the profiling to just = the core where > > the main SPDK process > > is pinned? I'm asking because SPDK still ap= pears to be > > mostly idle, and I > > suspect the time is being spent in some oth= er thread (in > > the kernel). Consider > > capturing a profile for the entire system. = It will have > > fio stuff in it, but the > > expensive stuff still should generally bubb= le up to the > > top. > > > > Thanks, > > Ben > > > > > > > > > > On 8/29/19, 6:05 PM, "Mittal, Rishabh" < > > rimittal(a)ebay.com> wrote: > > > > > > I got the profile with first run. > > > > > > 27.91% vhost = [.] > > > > > spdk_ring_dequeue > > > > > > 12.94% vhost = [.] > > > > > rte_rdtsc > > > > > > 11.00% vhost = [.] > > > > > spdk_thread_poll > > > > > > 6.15% vhost = [.] > > > > > _spdk_reactor_run > > > > > > 4.35% [kernel] = [k] > > > > > syscall_return_via_sysret > > > > > > 3.91% vhost = [.] > > > > > _spdk_msg_queue_run_batch > > > > > > 3.38% vhost = [.] > > > > > _spdk_event_queue_run_batch > > > > > > 2.83% [unknown] = [k] > > > > > 0xfffffe000000601b > > > > > > 1.45% vhost = [.] > > > > > spdk_thread_get_from_ctx > > > > > > 1.20% [kernel] = [k] > > > > > __fget > > > > > > 1.14% libpthread-2.27.so = [.] > > > > > __libc_read > > > > > > 1.00% libc-2.27.so = [.] > > > > > 0x000000000018ef76 > > > > > > 0.99% libc-2.27.so = [.] > > 0x000000000018ef79 > > > > > > Thanks > > > Rishabh Mittal > > > > > > On 8/19/19, 7:42 AM, "Luse, Paul E" < > > paul.e.luse(a)intel.com> wrote: > > > > > > That's great. Keep any eye out f= or the items > > Ben mentions below - at > > > least the first one should be quick to im= plement and > > compare both profile data > > > and measured performance. > > > > > > Don=E2=80=99t' forget about the c= ommunity meetings > > either, great place to chat > > > about these kinds of things. > > > > > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps%3A%= 2F%2Fspdk.io%2Fcommunity%2F&data=3D02%7C01%7Crimittal%40ebay.com%7Ceeba= bb3a2eff4e1dc47108d7326feaca%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C6= 37033328572563931&sdata=3D7tan5pPttSBLypikDgsH1lQZGY0HBQQr3rQQGJwIy3s%3= D&reserved=3D0 > > > Next one is tomorrow morn US time. > > > > > > Thx > > > Paul > > > > > > -----Original Message----- > > > From: SPDK [mailto:spdk-bounces(a= )lists.01.org] On > > Behalf Of Mittal, > > > Rishabh via SPDK > > > Sent: Thursday, August 15, 2019 6= :50 PM > > > To: Harris, James R ; > > Walker, Benjamin < > > > benjamin.walker(a)intel.com>; spdk(a)list= s.01.org > > > Cc: Mittal, Rishabh ; Chen, > > Xiaoxi < > > > xiaoxchen(a)ebay.com>; Szmyd, Brian ; > > Kadayam, Hari < > > > hkadayam(a)ebay.com> > > > Subject: Re: [SPDK] NBD with SPDK > > > > > > Thanks. I will get the profiling = by next week. > > > > > > On 8/15/19, 6:26 PM, "Harris, Jam= es R" < > > james.r.harris(a)intel.com> > > > wrote: > > > > > > > > > > > > On 8/15/19, 4:34 PM, "Mittal,= Rishabh" < > > rimittal(a)ebay.com> wrote: > > > > > > Hi Jim > > > > > > What tool you use to take= profiling. > > > > > > Hi Rishabh, > > > > > > Mostly I just use "perf top". > > > > > > -Jim > > > > > > > > > Thanks > > > Rishabh Mittal > > > > > > On 8/14/19, 9:54 AM, "Har= ris, James R" < > > > james.r.harris(a)intel.com> wrote: > > > > > > > > > > > > On 8/14/19, 9:18 AM, = "Walker, > > Benjamin" < > > > benjamin.walker(a)intel.com> wrote: > > > > > > > > > > > > When an I/O is pe= rformed in the > > process initiating the > > > I/O to a file, the data > > > goes into the OS = page cache > > buffers at a layer far > > > above the bio stack > > > (somewhere up in = VFS). If SPDK > > were to reserve some > > > memory and hand it off to > > > your kernel drive= r, your kernel > > driver would still > > > need to copy it to that > > > location out of t= he page cache > > buffers. We can't > > > safely share the page cache > > > buffers with a us= er space > > process. > > > > > > I think Rishabh was s= uggesting the > > SPDK reserve the > > > virtual address space only. > > > Then the kernel could= map the page > > cache buffers into that > > > virtual address space. > > > That would not requir= e a data copy, > > but would require the > > > mapping operations. > > > > > > I think the profiling= data would be > > really helpful - to > > > quantify how much of the 50us > > > Is due to copying the= 4KB of > > data. That can help drive > > > next steps on how to optimize > > > the SPDK NBD module. > > > > > > Thanks, > > > > > > -Jim > > > > > > > > > As Paul said, I'm= skeptical that > > the memcpy is > > > significant in the overall > > > performance you'r= e measuring. I > > encourage you to go > > > look at some profiling data > > > and confirm that = the memcpy is > > really showing up. I > > > suspect the overhead is > > > instead primarily= in these > > spots: > > > > > > 1) Dynamic buffer= allocation in > > the SPDK NBD backend. > > > > > > As Paul indicated= , the NBD > > target is dynamically > > > allocating memory for each I/O. > > > The NBD backend w= asn't designed > > to be fast - it was > > > designed to be simple. > > > Pooling would be = a lot faster > > and is something fairly > > > easy to implement. > > > > > > 2) The way SPDK d= oes the > > syscalls when it implements > > > the NBD backend. > > > > > > Again, the code w= as designed to > > be simple, not high > > > performance. It simply calls > > > read() and write(= ) on the socket > > for each command. > > > There are much higher > > > performance ways = of doing this, > > they're just more > > > complex to implement. > > > > > > 3) The lack of mu= lti-queue > > support in NBD > > > > > > Every I/O is funn= eled through a > > single sockpair up to > > > user space. That means > > > there is locking = going on. I > > believe this is just a > > > limitation of NBD today - it > > > doesn't plug into= the block-mq > > stuff in the kernel and > > > expose multiple > > > sockpairs. But so= meone more > > knowledgeable on the > > > kernel stack would need to take > > > a look. > > > > > > Thanks, > > > Ben > > > > > > > > > > > Couple of thing= s that I am not > > really sure in this > > > flow is :- 1. How memory > > > > registration is= going to work > > with RDMA driver. > > > > 2. What changes= are required > > in spdk memory > > > management > > > > > > > > Thanks > > > > Rishabh Mittal > > > > > > > > > > > > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F= %2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&data=3D02%7C01%7Crimittal%4= 0ebay.com%7Ceebabb3a2eff4e1dc47108d7326feaca%7C46326bff992841a0baca17c16c94= ea99%7C0%7C0%7C637033328572563931&sdata=3DrGCQA4lAfwN8GdwvnZ2ozjAWhApxd= u%2BioKMqw3gOmr0%3D&reserved=3D0 = = = -- = Regards Huang Zhiteng = = = --===============7505354475319397787==--