From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============7855823812991707430==" MIME-Version: 1.0 From: Kadayam, Hari Subject: Re: [SPDK] NBD with SPDK Date: Fri, 06 Sep 2019 20:31:28 +0000 Message-ID: <9CD34198-6425-4D77-91D0-986720349148@ebay.com> In-Reply-To: 3B5A81EA-C2D4-429E-B09E-30A8EE3F36E2@ebay.com List-ID: To: spdk@lists.01.org --===============7855823812991707430== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Kata containers has additional indirection compared to docker, which potent= ially affects the performance, right? = Also memory protection concern in virtio is valid, but we could possibly lo= ok for containing the accessible memory. In any case, I think an SPDK appli= cation wouldn't be accessing a buffer other than IO buffer from that space. =EF=BB=BFOn 9/6/19, 10:13 AM, "Mittal, Rishabh" wrote: I am summarizing all the options. We have three options = 1. SPDK with NBD. :- It need few optimizations in spdk - nbd to reduce= the system calls overhead. We can also explore using multi socket because = it is supported in NBD. One disadvantage is the there will be bcopy for ev= ery IO from kernel to SPDK or vice versa for reads. Overhead of bcopy compa= re to end to end latency is very low for 4k workload but we need to see its= impact for higher read/write size. = 2. SPDK with virtio :- It doesn't require any changes in spdk (assuming= that spdk virtio is written for performance) but we need to have a customi= zed kernel module which can work with spdk virtio target. It's obvious adva= ntage is the kernel buffer cache will shared with spdk so there will be no = copy from kernel to spdk. Other advantage is there will be minimal system c= alls to ring the doorbell as it will be using shared ring queue. Here my on= ly concern is that memory protection will be lost as entire kernel buffers = will be shared with spdk. = 3. SPDK is used with KATA containers :- It doesn't require much changes= (Xiaoxi can comment more on this). But our concern is that apps will not b= e moved to kata containers which will slow down its adoption rate. = = Please feel free to add pros/cons of any approach if I miss anything. I= t will help us to decide. = = Thanks Rishabh Mittal = On 9/5/19, 7:14 PM, "Szmyd, Brian" wrote: = I believe this option has the same number of copies since your stil= l sharing the memory with KATA VM kernel not the application itself. This is an option t= hat the development of a virtio-vhost-user driver does not prevent, its merely an optio= n to allow non-KATA containers to also use the same device. = I will note that doing a virtio-host-user driver also allows one to= project other device types than just block devices into the kernel device stack. One could als= o write a user application that exposed an input, network, console, gpu or socket device as we= ll. = Not that I have any interest in these... __ = On 9/5/19, 8:08 PM, "Huang Zhiteng" wrote: = Since this SPDK bdev is intended to be consumed by a user appli= cation running inside a container, we do have the possibility to run u= ser application inside a Kata container instead. Kata container do= es introduce the layer of IO virtualization, therefore we convert = a user space block device on host to a kernel block device inside the = VM but with less memory copies than NBD thanks to SPDK vhost. Kata co= ntainer might impose higher overhead than plain container but hopefully= it's lightweight enough that the overhead is negligible. = On Fri, Sep 6, 2019 at 5:22 AM Walker, Benjamin wrote: > > On Thu, 2019-09-05 at 19:47 +0000, Szmyd, Brian wrote: > > Hi Paul, > > > > Rather than put the effort into a formalized document here = is a brief > > description of the solution I have been investigating just = to get an opinion > > of feasibility or even workability. > > > > Some background and a reiteration of the problem to set thi= ngs up. I apologize > > to reiterate anything and to include details that some may = already know. > > > > We are looking for a solution that allows us to write a cus= tom bdev for the > > SPDK bdev layer that distributes I/O between different NVMe= -oF targets that we > > have attached and then present that to our application as e= ither a raw block > > device or filesystem mountpoint. > > > > This is normally (as I understand it) done to by exposing a= device via QEMU to > > a VM using the vhost target. This SPDK target has implement= ed the virtio-scsi > > (among others) device according to this spec: > > > > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps= %3A%2F%2Fdocs.oasis-open.org%2Fvirtio%2Fvirtio%2Fv1.1%2Fcsprd01%2Fvirtio-v1= .1-csprd01.html%23x1-8300021&data=3D02%7C01%7Chkadayam%40ebay.com%7Cb65= b1ff6f6cc470400d608d732ed8087%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C= 637033868045644168&sdata=3DfbXQwgNce4XFtTAAFHl%2F5SSFo8i%2BZ1vqQdjTEv56= Lw4%3D&reserved=3D0 > > > > The VM kernel then uses a virtio-scsi module to attach said= device into its > > SCSI mid-layer and then have the device enumerated as a /de= v/sd[a-z]+ device. > > > > The problem is that QEMU virtualizes a PCIe bus for the gue= st kernel virtio- > > pci driver to discover the virtio devices and bind them to = the virtio-scsi > > driver. There really is no other way (other than platform M= MIO type devices) > > to attach a device to the virtio-scsi device. > > > > SPDK exposes the virtio device to the VM via QEMU which has= written a "user > > space" version of the vhost bus. This driver then translate= s the API into the > > virtio-pci specification: > > > > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps= %3A%2F%2Fgithub.com%2Fqemu%2Fqemu%2Fblob%2F5d0e5694470d2952b4f257bc985cac8c= 89b4fd92%2Fdocs%2Finterop%2Fvhost-user.rst&data=3D02%7C01%7Chkadayam%40= ebay.com%7Cb65b1ff6f6cc470400d608d732ed8087%7C46326bff992841a0baca17c16c94e= a99%7C0%7C0%7C637033868045644168&sdata=3Dp%2BeTDdTmvNn8hxPFc%2BGQnKEleF= eRP9aJ3Sc8prRKJRk%3D&reserved=3D0 > > > > This uses an eventfd descriptor for interrupting the non-po= lling side of the > > queue and a UNIX domain socket to setup (and control) the s= hared memory which > > contains the I/O buffers and virtio queues. This is documen= ted in SPDKs own > > documentation and diagramed here: > > > > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps= %3A%2F%2Fgithub.com%2Fspdk%2Fspdk%2Fblob%2F01103b2e4dfdcf23cc2125164aa11639= 4c8185e8%2Fdoc%2Fvhost_processing.md&data=3D02%7C01%7Chkadayam%40ebay.c= om%7Cb65b1ff6f6cc470400d608d732ed8087%7C46326bff992841a0baca17c16c94ea99%7C= 0%7C0%7C637033868045644168&sdata=3DiThOkZ70btaiBFAQKn5EUR%2BpCw%2BrcIzf= IkWVPvf9LZs%3D&reserved=3D0 > > > > If we could implement this vhost-user QEMU target as a virt= io driver in the > > kernel as an alternative to the virtio-pci driver, it could= bind a SPDK vhost > > into the host kernel as a virtio device and enumerated in t= he /dev/sd[a-z]+ > > tree for our containers to bind. Attached is draft block di= agram. > > If you think of QEMU as just another user-space process, and = the SPDK vhost > target as a user-space process, then it's clear that vhost-us= er is simply a > cross-process IPC mechanism based on shared memory. The "shar= ed memory" part is > the critical part of that description - QEMU pre-registers al= l of the memory > that will be used for I/O buffers (in fact, all of the memory= that is mapped > into the guest) with the SPDK process by sending fds across a= Unix domain > socket. > > If you move this code into the kernel, you have to solve two = issues: > > 1) What memory is it registering with the SPDK process? The k= ernel driver has no > idea which application process may route I/O to it - in fact = the application > process may not even exist yet - so it isn't memory allocated= to the application > process. Maybe you have a pool of kernel buffers that get map= ped into the SPDK > process, and when the application process performs I/O the ke= rnel copies into > those buffers prior to telling SPDK about them? That would wo= rk, but now you're > back to doing a data copy. I do think you can get it down to = 1 data copy instead > of 2 with a scheme like this. > > 2) One of the big performance problems you're seeing is sysca= ll overhead in NBD. > If you still have a kernel block device that routes messages = up to the SPDK > process, the application process is making the same syscalls = because it's still > interacting with a block device in the kernel, but you're rig= ht that the backend > SPDK implementation could be polling on shared memory rings a= nd potentially run > more efficiently. > > > > > Since we will not have a real bus to signal for the driver = to probe for new > > devices we can use a sysfs interface for the application to= notify the driver > > of a new socket and eventfd pair to setup a new virtio-scsi= instance. > > Otherwise the design simply moves the vhost-user driver fro= m the QEMU > > application into the Host kernel itself. > > > > It's my understanding that this will avoid a lot more syste= m calls and copies > > compared to what exposing an iSCSI device or NBD device as = we're currently > > discussing. Does this seem feasible? > > What you really want is a "block device in user space" soluti= on that's higher > performance than NBD, and while that's been tried many, many = times in the past I > do think there is a great opportunity here for someone. I'm n= ot sure that the > interface between the block device process and the kernel is = best done as a > modification of NBD or a wholesale replacement by vhost-user-= scsi, but I'd like > to throw in a third option to consider - use NVMe queues in s= hared memory as the > interface instead. The NVMe queues are going to be much more = efficient than > virtqueues for storage commands. > > > > > Thanks, > > Brian > > > > On 9/5/19, 12:32 PM, "Mittal, Rishabh" wrote: > > > > Hi Paul. > > > > Thanks for investigating it. > > > > We have one more idea floating around. Brian is going t= o send you a > > proposal shortly. If other proposal seems feasible to you t= hat we can evaluate > > the work required in both the proposals. > > > > Thanks > > Rishabh Mittal > > > > On 9/5/19, 11:09 AM, "Luse, Paul E" wrote: > > > > Hi, > > > > So I was able to perform the same steps here and I = think one of the > > keys to really seeing what's going on is to start perftop l= ike this: > > > > =E2=80=9Cperf top --sort comm,dso,symbol -C 0=E2= =80=9D to get a more focused view by > > sorting on command, shared object and symbol > > > > Attached are 2 snapshots, one with a NULL back end = for nbd and one > > with libaio/nvme. Some notes after chatting with Ben a bit= , please read > > through and let us know what you think: > > > > * in both cases the vast majority of the highest ov= erhead activities > > are kernel > > * the "copy_user_enhanced" symbol on the NULL case = (it shows up on the > > other as well but you have to scroll way down to see it) an= d is the > > user/kernel space copy, nothing SPDK can do about that > > * the syscalls that dominate in both cases are like= ly something that > > can be improved on by changing how SPDK interacts with nbd.= Ben had a couple > > of ideas inlcuidng (a) using libaio to interact with the nb= d fd as opposed to > > interacting with the nbd socket, (b) "batching" wherever po= ssible, for example > > on writes to nbd investigate not ack'ing them until some nu= mber have completed > > * the kernel slab* commands are likely nbd kernel d= river > > allocations/frees in the IO path, one possibility would be = to look at > > optimizing the nbd kernel driver for this one > > * the libc item on the NULL chart also shows up on = the libaio profile > > however is again way down the scroll so it didn't make the = screenshot :) This > > could be a zeroing of something somewhere in the SPDK nbd d= river > > > > It looks like this data supports what Ben had suspe= cted a while back, > > much of the overhead we're looking at is kernel nbd. Anywa= y, let us know what > > you think and if you want to explore any of the ideas above= any further or see > > something else in the data that looks worthy to note. > > > > Thx > > Paul > > > > > > > > -----Original Message----- > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On = Behalf Of Luse, Paul > > E > > Sent: Wednesday, September 4, 2019 4:27 PM > > To: Mittal, Rishabh ; Walker, = Benjamin < > > benjamin.walker(a)intel.com>; Harris, James R ; > > spdk(a)lists.01.org > > Cc: Chen, Xiaoxi ; Szmyd, Bri= an ; > > Kadayam, Hari > > Subject: Re: [SPDK] NBD with SPDK > > > > Cool, thanks for sending this. I will try and repr= o tomorrow here and > > see what kind of results I get > > > > Thx > > Paul > > > > -----Original Message----- > > From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] > > Sent: Wednesday, September 4, 2019 4:23 PM > > To: Luse, Paul E ; Walker,= Benjamin < > > benjamin.walker(a)intel.com>; Harris, James R ; > > spdk(a)lists.01.org > > Cc: Chen, Xiaoxi ; Kadayam, H= ari < > > hkadayam(a)ebay.com>; Szmyd, Brian > > Subject: Re: [SPDK] NBD with SPDK > > > > Avg CPU utilization is very low when I am running t= his. > > > > 09/04/2019 04:21:40 PM > > avg-cpu: %user %nice %system %iowait %steal %= idle > > 2.59 0.00 2.57 0.00 0.00 9= 4.84 > > > > Device r/s w/s rkB/s wkB/s = rrqm/s wrqm/s % > > rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm= %util > > sda 0.00 0.20 0.00 0.80 = 0.00 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00= 0.00 > > sdb 0.00 0.00 0.00 0.00 = 0.00 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00= 0.00 > > sdc 0.00 28846.80 0.00 191555.20 = 0.00 > > 18211.00 0.00 38.70 0.00 1.03 29.64 0.00 = 6.64 0.03 100.00 > > nb0 0.00 47297.00 0.00 > > 191562.40 0.00 593.60 0.00 1.24 0.00 1.32 = 61.83 0.00 > > 4.05 0 > > > > > > > > On 9/4/19, 4:19 PM, "Mittal, Rishabh" wrote: > > > > I am using this command > > > > fio --name=3Drandwrite --ioengine=3Dlibaio --io= depth=3D8 --rw=3Dwrite -- > > rwmixread=3D0 --bsrange=3D4k-4k --direct=3D1 -filename=3D/= dev/nbd0 --numjobs=3D8 -- > > runtime 120 --time_based --group_reporting > > > > I have created the device by using these comman= ds > > 1. ./root/spdk/app/vhost > > 2. ./rpc.py bdev_aio_create /dev/sdc aio0 > > 3. /rpc.py start_nbd_disk aio0 /dev/nbd0 > > > > I am using "perf top" to get the performance > > > > On 9/4/19, 4:03 PM, "Luse, Paul E" wrote: > > > > Hi Rishabh, > > > > Maybe it would help (me at least) if you de= scribed the > > complete & exact steps for your test - both setup of the en= v & test and > > command to profile. Can you send that out? > > > > Thx > > Paul > > > > -----Original Message----- > > From: Mittal, Rishabh [mailto:rimittal(a)eb= ay.com] > > Sent: Wednesday, September 4, 2019 2:45 PM > > To: Walker, Benjamin ; Harris, > > James R ; spdk(a)lists.01.org; = Luse, Paul E < > > paul.e.luse(a)intel.com> > > Cc: Chen, Xiaoxi ; Ka= dayam, Hari < > > hkadayam(a)ebay.com>; Szmyd, Brian > > Subject: Re: [SPDK] NBD with SPDK > > > > Yes, I am using 64 q depth with one thread = in fio. I am using > > AIO. This profiling is for the entire system. I don't know = why spdk threads > > are idle. > > > > On 9/4/19, 11:08 AM, "Walker, Benjamin" < > > benjamin.walker(a)intel.com> wrote: > > > > On Fri, 2019-08-30 at 22:28 +0000, Mitt= al, Rishabh wrote: > > > I got the run again. It is with 4k wr= ite. > > > > > > 13.16% vhost [= .] > > > > > spdk_ring_dequeue > > > > > > 6.08% vhost = [.] > > > > > rte_rdtsc > > > > > > 4.77% vhost = [.] > > > > > spdk_thread_poll > > > > > > 2.85% vhost = [.] > > > > > _spdk_reactor_run > > > > > > > You're doing high queue depth for at le= ast 30 seconds > > while the trace runs, > > right? Using fio with the libaio engine= on the NBD device > > is probably the way to > > go. Are you limiting the profiling to j= ust the core where > > the main SPDK process > > is pinned? I'm asking because SPDK stil= l appears to be > > mostly idle, and I > > suspect the time is being spent in some= other thread (in > > the kernel). Consider > > capturing a profile for the entire syst= em. It will have > > fio stuff in it, but the > > expensive stuff still should generally = bubble up to the > > top. > > > > Thanks, > > Ben > > > > > > > > > > On 8/29/19, 6:05 PM, "Mittal, Rishabh= " < > > rimittal(a)ebay.com> wrote: > > > > > > I got the profile with first run. > > > > > > 27.91% vhost = [.] > > > > > spdk_ring_dequeue > > > > > > 12.94% vhost = [.] > > > > > rte_rdtsc > > > > > > 11.00% vhost = [.] > > > > > spdk_thread_poll > > > > > > 6.15% vhost = [.] > > > > > _spdk_reactor_run > > > > > > 4.35% [kernel] = [k] > > > > > syscall_return_via_sysret > > > > > > 3.91% vhost = [.] > > > > > _spdk_msg_queue_run_batch > > > > > > 3.38% vhost = [.] > > > > > _spdk_event_queue_run_batch > > > > > > 2.83% [unknown] = [k] > > > > > 0xfffffe000000601b > > > > > > 1.45% vhost = [.] > > > > > spdk_thread_get_from_ctx > > > > > > 1.20% [kernel] = [k] > > > > > __fget > > > > > > 1.14% libpthread-2.27.so = [.] > > > > > __libc_read > > > > > > 1.00% libc-2.27.so = [.] > > > > > 0x000000000018ef76 > > > > > > 0.99% libc-2.27.so = [.] > > 0x000000000018ef79 > > > > > > Thanks > > > Rishabh Mittal > > > > > > On 8/19/19, 7:42 AM, "Luse, Paul = E" < > > paul.e.luse(a)intel.com> wrote: > > > > > > That's great. Keep any eye o= ut for the items > > Ben mentions below - at > > > least the first one should be quick t= o implement and > > compare both profile data > > > and measured performance. > > > > > > Don=E2=80=99t' forget about t= he community meetings > > either, great place to chat > > > about these kinds of things. > > > > > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps= %3A%2F%2Fspdk.io%2Fcommunity%2F&data=3D02%7C01%7Chkadayam%40ebay.com%7C= b65b1ff6f6cc470400d608d732ed8087%7C46326bff992841a0baca17c16c94ea99%7C0%7C0= %7C637033868045644168&sdata=3DEMsizfalT%2FNT885h48%2FRgiefp0AN%2BKyYKBs= Qnhzn5IA%3D&reserved=3D0 > > > Next one is tomorrow morn US time. > > > > > > Thx > > > Paul > > > > > > -----Original Message----- > > > From: SPDK [mailto:spdk-bounc= es(a)lists.01.org] On > > Behalf Of Mittal, > > > Rishabh via SPDK > > > Sent: Thursday, August 15, 20= 19 6:50 PM > > > To: Harris, James R ; > > Walker, Benjamin < > > > benjamin.walker(a)intel.com>; spdk(a)= lists.01.org > > > Cc: Mittal, Rishabh ; Chen, > > Xiaoxi < > > > xiaoxchen(a)ebay.com>; Szmyd, Brian <= bszmyd(a)ebay.com>; > > Kadayam, Hari < > > > hkadayam(a)ebay.com> > > > Subject: Re: [SPDK] NBD with = SPDK > > > > > > Thanks. I will get the profil= ing by next week. > > > > > > On 8/15/19, 6:26 PM, "Harris,= James R" < > > james.r.harris(a)intel.com> > > > wrote: > > > > > > > > > > > > On 8/15/19, 4:34 PM, "Mit= tal, Rishabh" < > > rimittal(a)ebay.com> wrote: > > > > > > Hi Jim > > > > > > What tool you use to = take profiling. > > > > > > Hi Rishabh, > > > > > > Mostly I just use "perf t= op". > > > > > > -Jim > > > > > > > > > Thanks > > > Rishabh Mittal > > > > > > On 8/14/19, 9:54 AM, = "Harris, James R" < > > > james.r.harris(a)intel.com> wrote: > > > > > > > > > > > > On 8/14/19, 9:18 = AM, "Walker, > > Benjamin" < > > > benjamin.walker(a)intel.com> wrote: > > > > > > > > > > > > When an I/O i= s performed in the > > process initiating the > > > I/O to a file, the data > > > goes into the= OS page cache > > buffers at a layer far > > > above the bio stack > > > (somewhere up= in VFS). If SPDK > > were to reserve some > > > memory and hand it off to > > > your kernel d= river, your kernel > > driver would still > > > need to copy it to that > > > location out = of the page cache > > buffers. We can't > > > safely share the page cache > > > buffers with = a user space > > process. > > > > > > I think Rishabh w= as suggesting the > > SPDK reserve the > > > virtual address space only. > > > Then the kernel c= ould map the page > > cache buffers into that > > > virtual address space. > > > That would not re= quire a data copy, > > but would require the > > > mapping operations. > > > > > > I think the profi= ling data would be > > really helpful - to > > > quantify how much of the 50us > > > Is due to copying= the 4KB of > > data. That can help drive > > > next steps on how to optimize > > > the SPDK NBD modu= le. > > > > > > Thanks, > > > > > > -Jim > > > > > > > > > As Paul said,= I'm skeptical that > > the memcpy is > > > significant in the overall > > > performance y= ou're measuring. I > > encourage you to go > > > look at some profiling data > > > and confirm t= hat the memcpy is > > really showing up. I > > > suspect the overhead is > > > instead prima= rily in these > > spots: > > > > > > 1) Dynamic bu= ffer allocation in > > the SPDK NBD backend. > > > > > > As Paul indic= ated, the NBD > > target is dynamically > > > allocating memory for each I/O. > > > The NBD backe= nd wasn't designed > > to be fast - it was > > > designed to be simple. > > > Pooling would= be a lot faster > > and is something fairly > > > easy to implement. > > > > > > 2) The way SP= DK does the > > syscalls when it implements > > > the NBD backend. > > > > > > Again, the co= de was designed to > > be simple, not high > > > performance. It simply calls > > > read() and wr= ite() on the socket > > for each command. > > > There are much higher > > > performance w= ays of doing this, > > they're just more > > > complex to implement. > > > > > > 3) The lack o= f multi-queue > > support in NBD > > > > > > Every I/O is = funneled through a > > single sockpair up to > > > user space. That means > > > there is lock= ing going on. I > > believe this is just a > > > limitation of NBD today - it > > > doesn't plug = into the block-mq > > stuff in the kernel and > > > expose multiple > > > sockpairs. Bu= t someone more > > knowledgeable on the > > > kernel stack would need to take > > > a look. > > > > > > Thanks, > > > Ben > > > > > > > > > > > Couple of t= hings that I am not > > really sure in this > > > flow is :- 1. How memory > > > > registratio= n is going to work > > with RDMA driver. > > > > 2. What cha= nges are required > > in spdk memory > > > management > > > > > > > > Thanks > > > > Rishabh Mit= tal > > > > > > > > > > > > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://nam01.safelinks.protection.outlook.com/?url=3Dhttps%3= A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&data=3D02%7C01%7Chkaday= am%40ebay.com%7Cb65b1ff6f6cc470400d608d732ed8087%7C46326bff992841a0baca17c1= 6c94ea99%7C0%7C0%7C637033868045644168&sdata=3Dor%2FOWQA3mTPiixZcHJqPizj= OMNreQoIcDK8ZZ5A4Goo%3D&reserved=3D0 = = = -- = Regards Huang Zhiteng = = = = = --===============7855823812991707430==--