I am summarizing all the options. We have three options 1. SPDK with NBD. :- It need few optimizations in spdk - nbd to reduce the system calls overhead. We can also explore using multi socket because it is supported in NBD. One disadvantage is the there will be bcopy for every IO from kernel to SPDK or vice versa for reads. Overhead of bcopy compare to end to end latency is very low for 4k workload but we need to see its impact for higher read/write size. 2. SPDK with virtio :- It doesn't require any changes in spdk (assuming that spdk virtio is written for performance) but we need to have a customized kernel module which can work with spdk virtio target. It's obvious advantage is the kernel buffer cache will shared with spdk so there will be no copy from kernel to spdk. Other advantage is there will be minimal system calls to ring the doorbell as it will be using shared ring queue. Here my only concern is that memory protection will be lost as entire kernel buffers will be shared with spdk. 3. SPDK is used with KATA containers :- It doesn't require much changes (Xiaoxi can comment more on this). But our concern is that apps will not be moved to kata containers which will slow down its adoption rate. Please feel free to add pros/cons of any approach if I miss anything. It will help us to decide. Thanks Rishabh Mittal On 9/5/19, 7:14 PM, "Szmyd, Brian" wrote: I believe this option has the same number of copies since your still sharing the memory with KATA VM kernel not the application itself. This is an option that the development of a virtio-vhost-user driver does not prevent, its merely an option to allow non-KATA containers to also use the same device. I will note that doing a virtio-host-user driver also allows one to project other device types than just block devices into the kernel device stack. One could also write a user application that exposed an input, network, console, gpu or socket device as well. Not that I have any interest in these... __ On 9/5/19, 8:08 PM, "Huang Zhiteng" wrote: Since this SPDK bdev is intended to be consumed by a user application running inside a container, we do have the possibility to run user application inside a Kata container instead. Kata container does introduce the layer of IO virtualization, therefore we convert a user space block device on host to a kernel block device inside the VM but with less memory copies than NBD thanks to SPDK vhost. Kata container might impose higher overhead than plain container but hopefully it's lightweight enough that the overhead is negligible. On Fri, Sep 6, 2019 at 5:22 AM Walker, Benjamin wrote: > > On Thu, 2019-09-05 at 19:47 +0000, Szmyd, Brian wrote: > > Hi Paul, > > > > Rather than put the effort into a formalized document here is a brief > > description of the solution I have been investigating just to get an opinion > > of feasibility or even workability. > > > > Some background and a reiteration of the problem to set things up. I apologize > > to reiterate anything and to include details that some may already know. > > > > We are looking for a solution that allows us to write a custom bdev for the > > SPDK bdev layer that distributes I/O between different NVMe-oF targets that we > > have attached and then present that to our application as either a raw block > > device or filesystem mountpoint. > > > > This is normally (as I understand it) done to by exposing a device via QEMU to > > a VM using the vhost target. This SPDK target has implemented the virtio-scsi > > (among others) device according to this spec: > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oasis-open.org%2Fvirtio%2Fvirtio%2Fv1.1%2Fcsprd01%2Fvirtio-v1.1-csprd01.html%23x1-8300021&data=02%7C01%7Crimittal%40ebay.com%7Ceebabb3a2eff4e1dc47108d7326feaca%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033328572563931&sdata=Rb6jc8GEqDasm%2FNpWPpPozlFSfwHumutQQ0P9r28ysw%3D&reserved=0 > > > > The VM kernel then uses a virtio-scsi module to attach said device into its > > SCSI mid-layer and then have the device enumerated as a /dev/sd[a-z]+ device. > > > > The problem is that QEMU virtualizes a PCIe bus for the guest kernel virtio- > > pci driver to discover the virtio devices and bind them to the virtio-scsi > > driver. There really is no other way (other than platform MMIO type devices) > > to attach a device to the virtio-scsi device. > > > > SPDK exposes the virtio device to the VM via QEMU which has written a "user > > space" version of the vhost bus. This driver then translates the API into the > > virtio-pci specification: > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fqemu%2Fqemu%2Fblob%2F5d0e5694470d2952b4f257bc985cac8c89b4fd92%2Fdocs%2Finterop%2Fvhost-user.rst&data=02%7C01%7Crimittal%40ebay.com%7Ceebabb3a2eff4e1dc47108d7326feaca%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033328572563931&sdata=cfFYWklYCtQog7oi6cpw93490%2F1UwTM1qwZghWnuu%2FU%3D&reserved=0 > > > > This uses an eventfd descriptor for interrupting the non-polling side of the > > queue and a UNIX domain socket to setup (and control) the shared memory which > > contains the I/O buffers and virtio queues. This is documented in SPDKs own > > documentation and diagramed here: > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fspdk%2Fspdk%2Fblob%2F01103b2e4dfdcf23cc2125164aa116394c8185e8%2Fdoc%2Fvhost_processing.md&data=02%7C01%7Crimittal%40ebay.com%7Ceebabb3a2eff4e1dc47108d7326feaca%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033328572563931&sdata=hVNgYqwWUl6y61MibZ4K0tJr%2FEIMgVldx8FIb0WgyXE%3D&reserved=0 > > > > If we could implement this vhost-user QEMU target as a virtio driver in the > > kernel as an alternative to the virtio-pci driver, it could bind a SPDK vhost > > into the host kernel as a virtio device and enumerated in the /dev/sd[a-z]+ > > tree for our containers to bind. Attached is draft block diagram. > > If you think of QEMU as just another user-space process, and the SPDK vhost > target as a user-space process, then it's clear that vhost-user is simply a > cross-process IPC mechanism based on shared memory. The "shared memory" part is > the critical part of that description - QEMU pre-registers all of the memory > that will be used for I/O buffers (in fact, all of the memory that is mapped > into the guest) with the SPDK process by sending fds across a Unix domain > socket. > > If you move this code into the kernel, you have to solve two issues: > > 1) What memory is it registering with the SPDK process? The kernel driver has no > idea which application process may route I/O to it - in fact the application > process may not even exist yet - so it isn't memory allocated to the application > process. Maybe you have a pool of kernel buffers that get mapped into the SPDK > process, and when the application process performs I/O the kernel copies into > those buffers prior to telling SPDK about them? That would work, but now you're > back to doing a data copy. I do think you can get it down to 1 data copy instead > of 2 with a scheme like this. > > 2) One of the big performance problems you're seeing is syscall overhead in NBD. > If you still have a kernel block device that routes messages up to the SPDK > process, the application process is making the same syscalls because it's still > interacting with a block device in the kernel, but you're right that the backend > SPDK implementation could be polling on shared memory rings and potentially run > more efficiently. > > > > > Since we will not have a real bus to signal for the driver to probe for new > > devices we can use a sysfs interface for the application to notify the driver > > of a new socket and eventfd pair to setup a new virtio-scsi instance. > > Otherwise the design simply moves the vhost-user driver from the QEMU > > application into the Host kernel itself. > > > > It's my understanding that this will avoid a lot more system calls and copies > > compared to what exposing an iSCSI device or NBD device as we're currently > > discussing. Does this seem feasible? > > What you really want is a "block device in user space" solution that's higher > performance than NBD, and while that's been tried many, many times in the past I > do think there is a great opportunity here for someone. I'm not sure that the > interface between the block device process and the kernel is best done as a > modification of NBD or a wholesale replacement by vhost-user-scsi, but I'd like > to throw in a third option to consider - use NVMe queues in shared memory as the > interface instead. The NVMe queues are going to be much more efficient than > virtqueues for storage commands. > > > > > Thanks, > > Brian > > > > On 9/5/19, 12:32 PM, "Mittal, Rishabh" wrote: > > > > Hi Paul. > > > > Thanks for investigating it. > > > > We have one more idea floating around. Brian is going to send you a > > proposal shortly. If other proposal seems feasible to you that we can evaluate > > the work required in both the proposals. > > > > Thanks > > Rishabh Mittal > > > > On 9/5/19, 11:09 AM, "Luse, Paul E" wrote: > > > > Hi, > > > > So I was able to perform the same steps here and I think one of the > > keys to really seeing what's going on is to start perftop like this: > > > > “perf top --sort comm,dso,symbol -C 0” to get a more focused view by > > sorting on command, shared object and symbol > > > > Attached are 2 snapshots, one with a NULL back end for nbd and one > > with libaio/nvme. Some notes after chatting with Ben a bit, please read > > through and let us know what you think: > > > > * in both cases the vast majority of the highest overhead activities > > are kernel > > * the "copy_user_enhanced" symbol on the NULL case (it shows up on the > > other as well but you have to scroll way down to see it) and is the > > user/kernel space copy, nothing SPDK can do about that > > * the syscalls that dominate in both cases are likely something that > > can be improved on by changing how SPDK interacts with nbd. Ben had a couple > > of ideas inlcuidng (a) using libaio to interact with the nbd fd as opposed to > > interacting with the nbd socket, (b) "batching" wherever possible, for example > > on writes to nbd investigate not ack'ing them until some number have completed > > * the kernel slab* commands are likely nbd kernel driver > > allocations/frees in the IO path, one possibility would be to look at > > optimizing the nbd kernel driver for this one > > * the libc item on the NULL chart also shows up on the libaio profile > > however is again way down the scroll so it didn't make the screenshot :) This > > could be a zeroing of something somewhere in the SPDK nbd driver > > > > It looks like this data supports what Ben had suspected a while back, > > much of the overhead we're looking at is kernel nbd. Anyway, let us know what > > you think and if you want to explore any of the ideas above any further or see > > something else in the data that looks worthy to note. > > > > Thx > > Paul > > > > > > > > -----Original Message----- > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul > > E > > Sent: Wednesday, September 4, 2019 4:27 PM > > To: Mittal, Rishabh ; Walker, Benjamin < > > benjamin.walker(a)intel.com>; Harris, James R ; > > spdk(a)lists.01.org > > Cc: Chen, Xiaoxi ; Szmyd, Brian ; > > Kadayam, Hari > > Subject: Re: [SPDK] NBD with SPDK > > > > Cool, thanks for sending this. I will try and repro tomorrow here and > > see what kind of results I get > > > > Thx > > Paul > > > > -----Original Message----- > > From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] > > Sent: Wednesday, September 4, 2019 4:23 PM > > To: Luse, Paul E ; Walker, Benjamin < > > benjamin.walker(a)intel.com>; Harris, James R ; > > spdk(a)lists.01.org > > Cc: Chen, Xiaoxi ; Kadayam, Hari < > > hkadayam(a)ebay.com>; Szmyd, Brian > > Subject: Re: [SPDK] NBD with SPDK > > > > Avg CPU utilization is very low when I am running this. > > > > 09/04/2019 04:21:40 PM > > avg-cpu: %user %nice %system %iowait %steal %idle > > 2.59 0.00 2.57 0.00 0.00 94.84 > > > > Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s % > > rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util > > sda 0.00 0.20 0.00 0.80 0.00 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 > > sdb 0.00 0.00 0.00 0.00 0.00 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > sdc 0.00 28846.80 0.00 191555.20 0.00 > > 18211.00 0.00 38.70 0.00 1.03 29.64 0.00 6.64 0.03 100.00 > > nb0 0.00 47297.00 0.00 > > 191562.40 0.00 593.60 0.00 1.24 0.00 1.32 61.83 0.00 > > 4.05 0 > > > > > > > > On 9/4/19, 4:19 PM, "Mittal, Rishabh" wrote: > > > > I am using this command > > > > fio --name=randwrite --ioengine=libaio --iodepth=8 --rw=write -- > > rwmixread=0 --bsrange=4k-4k --direct=1 -filename=/dev/nbd0 --numjobs=8 -- > > runtime 120 --time_based --group_reporting > > > > I have created the device by using these commands > > 1. ./root/spdk/app/vhost > > 2. ./rpc.py bdev_aio_create /dev/sdc aio0 > > 3. /rpc.py start_nbd_disk aio0 /dev/nbd0 > > > > I am using "perf top" to get the performance > > > > On 9/4/19, 4:03 PM, "Luse, Paul E" wrote: > > > > Hi Rishabh, > > > > Maybe it would help (me at least) if you described the > > complete & exact steps for your test - both setup of the env & test and > > command to profile. Can you send that out? > > > > Thx > > Paul > > > > -----Original Message----- > > From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] > > Sent: Wednesday, September 4, 2019 2:45 PM > > To: Walker, Benjamin ; Harris, > > James R ; spdk(a)lists.01.org; Luse, Paul E < > > paul.e.luse(a)intel.com> > > Cc: Chen, Xiaoxi ; Kadayam, Hari < > > hkadayam(a)ebay.com>; Szmyd, Brian > > Subject: Re: [SPDK] NBD with SPDK > > > > Yes, I am using 64 q depth with one thread in fio. I am using > > AIO. This profiling is for the entire system. I don't know why spdk threads > > are idle. > > > > On 9/4/19, 11:08 AM, "Walker, Benjamin" < > > benjamin.walker(a)intel.com> wrote: > > > > On Fri, 2019-08-30 at 22:28 +0000, Mittal, Rishabh wrote: > > > I got the run again. It is with 4k write. > > > > > > 13.16% vhost [.] > > > > > spdk_ring_dequeue > > > > > > 6.08% vhost [.] > > > > > rte_rdtsc > > > > > > 4.77% vhost [.] > > > > > spdk_thread_poll > > > > > > 2.85% vhost [.] > > > > > _spdk_reactor_run > > > > > > > You're doing high queue depth for at least 30 seconds > > while the trace runs, > > right? Using fio with the libaio engine on the NBD device > > is probably the way to > > go. Are you limiting the profiling to just the core where > > the main SPDK process > > is pinned? I'm asking because SPDK still appears to be > > mostly idle, and I > > suspect the time is being spent in some other thread (in > > the kernel). Consider > > capturing a profile for the entire system. It will have > > fio stuff in it, but the > > expensive stuff still should generally bubble up to the > > top. > > > > Thanks, > > Ben > > > > > > > > > > On 8/29/19, 6:05 PM, "Mittal, Rishabh" < > > rimittal(a)ebay.com> wrote: > > > > > > I got the profile with first run. > > > > > > 27.91% vhost [.] > > > > > spdk_ring_dequeue > > > > > > 12.94% vhost [.] > > > > > rte_rdtsc > > > > > > 11.00% vhost [.] > > > > > spdk_thread_poll > > > > > > 6.15% vhost [.] > > > > > _spdk_reactor_run > > > > > > 4.35% [kernel] [k] > > > > > syscall_return_via_sysret > > > > > > 3.91% vhost [.] > > > > > _spdk_msg_queue_run_batch > > > > > > 3.38% vhost [.] > > > > > _spdk_event_queue_run_batch > > > > > > 2.83% [unknown] [k] > > > > > 0xfffffe000000601b > > > > > > 1.45% vhost [.] > > > > > spdk_thread_get_from_ctx > > > > > > 1.20% [kernel] [k] > > > > > __fget > > > > > > 1.14% libpthread-2.27.so [.] > > > > > __libc_read > > > > > > 1.00% libc-2.27.so [.] > > > > > 0x000000000018ef76 > > > > > > 0.99% libc-2.27.so [.] > > 0x000000000018ef79 > > > > > > Thanks > > > Rishabh Mittal > > > > > > On 8/19/19, 7:42 AM, "Luse, Paul E" < > > paul.e.luse(a)intel.com> wrote: > > > > > > That's great. Keep any eye out for the items > > Ben mentions below - at > > > least the first one should be quick to implement and > > compare both profile data > > > and measured performance. > > > > > > Don’t' forget about the community meetings > > either, great place to chat > > > about these kinds of things. > > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspdk.io%2Fcommunity%2F&data=02%7C01%7Crimittal%40ebay.com%7Ceebabb3a2eff4e1dc47108d7326feaca%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033328572563931&sdata=7tan5pPttSBLypikDgsH1lQZGY0HBQQr3rQQGJwIy3s%3D&reserved=0 > > > Next one is tomorrow morn US time. > > > > > > Thx > > > Paul > > > > > > -----Original Message----- > > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On > > Behalf Of Mittal, > > > Rishabh via SPDK > > > Sent: Thursday, August 15, 2019 6:50 PM > > > To: Harris, James R ; > > Walker, Benjamin < > > > benjamin.walker(a)intel.com>; spdk(a)lists.01.org > > > Cc: Mittal, Rishabh ; Chen, > > Xiaoxi < > > > xiaoxchen(a)ebay.com>; Szmyd, Brian ; > > Kadayam, Hari < > > > hkadayam(a)ebay.com> > > > Subject: Re: [SPDK] NBD with SPDK > > > > > > Thanks. I will get the profiling by next week. > > > > > > On 8/15/19, 6:26 PM, "Harris, James R" < > > james.r.harris(a)intel.com> > > > wrote: > > > > > > > > > > > > On 8/15/19, 4:34 PM, "Mittal, Rishabh" < > > rimittal(a)ebay.com> wrote: > > > > > > Hi Jim > > > > > > What tool you use to take profiling. > > > > > > Hi Rishabh, > > > > > > Mostly I just use "perf top". > > > > > > -Jim > > > > > > > > > Thanks > > > Rishabh Mittal > > > > > > On 8/14/19, 9:54 AM, "Harris, James R" < > > > james.r.harris(a)intel.com> wrote: > > > > > > > > > > > > On 8/14/19, 9:18 AM, "Walker, > > Benjamin" < > > > benjamin.walker(a)intel.com> wrote: > > > > > > > > > > > > When an I/O is performed in the > > process initiating the > > > I/O to a file, the data > > > goes into the OS page cache > > buffers at a layer far > > > above the bio stack > > > (somewhere up in VFS). If SPDK > > were to reserve some > > > memory and hand it off to > > > your kernel driver, your kernel > > driver would still > > > need to copy it to that > > > location out of the page cache > > buffers. We can't > > > safely share the page cache > > > buffers with a user space > > process. > > > > > > I think Rishabh was suggesting the > > SPDK reserve the > > > virtual address space only. > > > Then the kernel could map the page > > cache buffers into that > > > virtual address space. > > > That would not require a data copy, > > but would require the > > > mapping operations. > > > > > > I think the profiling data would be > > really helpful - to > > > quantify how much of the 50us > > > Is due to copying the 4KB of > > data. That can help drive > > > next steps on how to optimize > > > the SPDK NBD module. > > > > > > Thanks, > > > > > > -Jim > > > > > > > > > As Paul said, I'm skeptical that > > the memcpy is > > > significant in the overall > > > performance you're measuring. I > > encourage you to go > > > look at some profiling data > > > and confirm that the memcpy is > > really showing up. I > > > suspect the overhead is > > > instead primarily in these > > spots: > > > > > > 1) Dynamic buffer allocation in > > the SPDK NBD backend. > > > > > > As Paul indicated, the NBD > > target is dynamically > > > allocating memory for each I/O. > > > The NBD backend wasn't designed > > to be fast - it was > > > designed to be simple. > > > Pooling would be a lot faster > > and is something fairly > > > easy to implement. > > > > > > 2) The way SPDK does the > > syscalls when it implements > > > the NBD backend. > > > > > > Again, the code was designed to > > be simple, not high > > > performance. It simply calls > > > read() and write() on the socket > > for each command. > > > There are much higher > > > performance ways of doing this, > > they're just more > > > complex to implement. > > > > > > 3) The lack of multi-queue > > support in NBD > > > > > > Every I/O is funneled through a > > single sockpair up to > > > user space. That means > > > there is locking going on. I > > believe this is just a > > > limitation of NBD today - it > > > doesn't plug into the block-mq > > stuff in the kernel and > > > expose multiple > > > sockpairs. But someone more > > knowledgeable on the > > > kernel stack would need to take > > > a look. > > > > > > Thanks, > > > Ben > > > > > > > > > > > Couple of things that I am not > > really sure in this > > > flow is :- 1. How memory > > > > registration is going to work > > with RDMA driver. > > > > 2. What changes are required > > in spdk memory > > > management > > > > > > > > Thanks > > > > Rishabh Mittal > > > > > > > > > > > > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&data=02%7C01%7Crimittal%40ebay.com%7Ceebabb3a2eff4e1dc47108d7326feaca%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033328572563931&sdata=rGCQA4lAfwN8GdwvnZ2ozjAWhApxdu%2BioKMqw3gOmr0%3D&reserved=0 -- Regards Huang Zhiteng