From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============8004131904720964727==" MIME-Version: 1.0 From: Kadayam, Hari Subject: Re: [SPDK] NBD with SPDK Date: Wed, 14 Aug 2019 17:05:40 +0000 Message-ID: <819E98D6-6355-41BC-9B6C-B92EB4B1475B@ebay.com> In-Reply-To: 742b180c299ad6847e6efa33bad8c62e3dd2c7ac.camel@intel.com List-ID: To: spdk@lists.01.org --===============8004131904720964727== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Hi Ben, I agree we need to profile this and improve where ever we are seeing the bo= ttlenecks. The possible improvements you suggested are certainly very usefu= l to look into and good place to start. Having said that, on a large writes= won't memcpy surely going to add into latency and cost more CPU? = Regarding your comment: > We can't safely share the page cache buffers with a user space process. The thought process here is the driver in the SPDK thread context does a re= map using something like phys_to_virt() or mmap them, which means page cach= e buffer(s) gets to be accessed from user space process. Of course, we have= concerns too regarding the safety of user space process accessing page cac= he. = Regards, Hari =EF=BB=BFOn 8/14/19, 9:19 AM, "Walker, Benjamin" wrote: On Wed, 2019-08-14 at 14:28 +0000, Luse, Paul E wrote: > So I think there's still a feeling amongst most involved in the discu= ssion > that eliminating the memcpy is likely not worth it, especially without > profiling data to prove it. Ben and I were talking about some other = much > simpler things that might be worth experimenting with). One example = would be > in spdk_nbd_io_recv_internal(), look at how spdk_malloc(), is called = for every > IO/ Creating a pre-allocated pool and pulling from there would be a = quick > change and may yield some positive results. Again though, profiling w= ill > actually tell you where the most time is being spent and where the be= st bang > for your buck is in terms of making changes. > = > Thx > Paul > = > From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] = > = > Back end device is malloc0 which is a memory device running in the = =E2=80=9Cvhost=E2=80=9D > application address space. It is not over NVMe-oF. > = > I guess that bio pages are already pinned because same buffers are se= nt to > lower layers to do DMA. Lets say we have written a lightweight ebay = block > driver in kernel. This would be the flow > = > 1. SPDK reserve the virtual space and pass it to ebay block driver t= o do > mmap. This step happens once during startup. = > 2. For every IO, ebay block driver map buffers to virtual memory and= pass a > IO information to SPDK through shared queues. > 3. SPDK read it from the shared queue and pass the same virtual addr= ess to do > RDMA. = When an I/O is performed in the process initiating the I/O to a file, t= he data goes into the OS page cache buffers at a layer far above the bio stack (somewhere up in VFS). If SPDK were to reserve some memory and hand it = off to your kernel driver, your kernel driver would still need to copy it to t= hat location out of the page cache buffers. We can't safely share the page = cache buffers with a user space process. = As Paul said, I'm skeptical that the memcpy is significant in the overa= ll performance you're measuring. I encourage you to go look at some profil= ing data and confirm that the memcpy is really showing up. I suspect the overhea= d is instead primarily in these spots: = 1) Dynamic buffer allocation in the SPDK NBD backend. = As Paul indicated, the NBD target is dynamically allocating memory for = each I/O. The NBD backend wasn't designed to be fast - it was designed to be simp= le. Pooling would be a lot faster and is something fairly easy to implement. = 2) The way SPDK does the syscalls when it implements the NBD backend. = Again, the code was designed to be simple, not high performance. It sim= ply calls read() and write() on the socket for each command. There are much higher performance ways of doing this, they're just more complex to implement. = 3) The lack of multi-queue support in NBD = Every I/O is funneled through a single sockpair up to user space. That = means there is locking going on. I believe this is just a limitation of NBD t= oday - it doesn't plug into the block-mq stuff in the kernel and expose multiple sockpairs. But someone more knowledgeable on the kernel stack would nee= d to take a look. = Thanks, Ben = > = > Couple of things that I am not really sure in this flow is :- 1. How = memory > registration is going to work with RDMA driver. > 2. What changes are required in spdk memory management > = > Thanks > Rishabh Mittal = --===============8004131904720964727==--