Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-08-19 21:16 Howell, Seth
  0 siblings, 0 replies; 20+ messages in thread
From: Howell, Seth @ 2019-08-19 21:16 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 29964 bytes --]

Hi JD,

What issue specifically did you see? If there is something measurable happening (other than the error message) then I think it should be high priority to get a more permanent workaround into upstream SPDK.

Thanks,

Seth

-----Original Message-----
From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com] 
Sent: Monday, August 19, 2019 2:03 PM
To: Howell, Seth <seth.howell(a)intel.com>
Cc: Storage Performance Development Kit <spdk(a)lists.01.org>; Jonathan Richardson <jonathan.richardson(a)broadcom.com>
Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions

Hi Seth,

 > Unfortunately, the only way
 > to protect fully against this happening is by using the DPDK flag  > --match-allocations which was introduced in DPDK 19.02.

Then I need to use DPDK 19.02. Do I need to enable this flag explicitly when moving DPDK 19.02?

 > The good news is that the SPDK target will skip these buffers without  > bricking, causing data corruption or doing any otherwise bad things.

Unfortunately this is not what I saw. It appeared that SPDK gave up this split buffer, but it still causes issue, maybe because it was tried too many times(?).

Currently, I have to use DPDK 18.11 so that I added a couple of workarounds to prevent the split buffer from being used before reaching fill_buffers(). I did a little trick there to call spdk_mempool_get() but not mempool_put later, so that this buffer is set as "allocated" in mempool and will not be tried again and again. It does look like small memory leak though. We can usually see 2-3 split buffers during overnight run, btw.

This seems working OK.

For sure, I will measure performance later.

Thanks,
JD

On 8/19/19 1:12 PM, Howell, Seth wrote:
> Hi JD,
> 
> Thanks for performing that experiment. With this new information, I 
> think we can be pretty much 100% sure that the problem is related to 
> the mempool being split over two DPDK memzones. Unfortunately, the 
> only way to protect fully against this happening is by using the DPDK 
> flag --match-allocations which was introduced in DPDK 19.02. Jim 
> helped advocate for this flag specifically because of this problem 
> with mempools and RPMA.
> 
> The good news is that the SPDK target will skip these buffers without 
> bricking, causing data corruption or doing any otherwise bad things.
> What ends up happening is that the nvmf_rdma_fill_buffers function 
> will print the error message and then return NULL which will trigger 
> the target to retry the I/O again. By that time, there will be another 
> buffer there for the request to use and it won’t fail the second time 
> around. So the code currently handles the problem in a technically 
> correct way i.e. It’s not going to brick the target or initiator by 
> trying to use a buffer that spans multiple Memory Regions. Instead, it 
> properly recognizes that it is trying to use a bad buffer and 
> reschedules the request buffer parsing.
> 
> However, I am a little worried over the fact that these buffers remain 
> in the mempool and can be repeatedly used by the application. I can 
> picture a scenario where this could possibly have a  performance impact.
> Take for example a mempool with 128 entries in it in which one of them 
> is split over a memzone. Since this split buffer will never find its 
> way into a request, it’s possible that this split buffer gets pulled 
> up into requests more often than other buffers and subsequently fails 
> in nvmf_rdma_fill_buffers causing requests to have to be rescheduled 
> to the next time the poller runs. Depending on how frequently this 
> happens, the performance impact *could possibly* add up.
> 
> I have as yet been unable to replicate the split buffer error. One 
> thing you could try to see if there is any measurable performance 
> impact is try starting the NVMe-oF target with DPDK legacy memory mode 
> which will move all memory allocations to startup and prevent you from 
> splitting buffers. Then run a benchmark with a lot of connections at 
> high queue depth and see what the performance looks like compared to 
> the dynamic memory model. If there is a significant performance 
> impact, we may have to modify how we handle this error case.
> 
> Thanks,
> 
> Seth
> 
> *From:*JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
> *Sent:* Monday, August 12, 2019 4:17 PM
> *To:* Howell, Seth <seth.howell(a)intel.com>
> *Cc:* Storage Performance Development Kit <spdk(a)lists.01.org>; 
> Jonathan Richardson <jonathan.richardson(a)broadcom.com>
> *Subject:* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over 
> multiple RDMA Memory Regions
> 
> + Jonanthan
> 
> Hi Seth,
> 
> We finally got chance to test with more logs enabled. You are correct 
> that that problematic buffer does sit on 2 registered memory regions:
> 
> The problematic buffer is "Buffer address:*200019bfeb00", *actual used 
> buffer pointer is "*200019bff000*" (SPDK makes it 4KiB aligned), size 
> is
> 8KiB(0x2000) so it does sit on 2 registered memory region.
> 
> However, looks like SPDK/DPDK allocates buffers starting from end of a 
> region and going up, but due to the extra room and alignment of each 
> buffer and there is chance that one buffer can exceed memory region 
> boundary?
> 
> In this case, the buffers are between 0x200019997800 and 0x200019c5320 
> so that last buffer exceeds one region and goes to next one.
> 
> Some logs for your information:
> 
> rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start: 
> 200019800000, memory region length: 400000
> rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start: 
> 200019c00000, memory region length: 400000
> 
> ...
> 
> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
> 0x200019bfeb00 27(32)
> 
> ...
> 
> rdma.c:1508:nvmf_rdma_fill_buffers: *NOTICE*: Buffer 
> address:**200019bfeb00**iov_base address 200019bff000
> 
> Thanks,
> 
> JD
> 
> On Thu, Aug 1, 2019 at 2:22 PM Howell, Seth <seth.howell(a)intel.com 
> <mailto:seth.howell(a)intel.com>> wrote:
> 
>     There are two different assignments that you need to look at. I'll
>     detail the cases below based on line numbers from the latest master.
> 
>     Memory.c:656 *size = spdk_min(*size, cur_size):
>              This assignment is inside of the conditional "if(size ==
>     NULL || map->ops.are_contiguous == NULL)"
>              So in other words, at the offset, we figure out how much
>     space we have left in the current translation. Then, if there is no
>     callback to tell us whether the next translation will be contiguous
>     to this one, we fill the size variable with the remaining length of
>     that 2 MiB buffer.
> 
>     Memory.c:682 *size = spdk_min(*size, cur_size):
>              This assignment comes after the while loop guarded by the
>     condition "while (cur_size < *size)". This while loop assumes that
>     we have supplied some desired length for our buffer. This is true in
>     the RDMA case. Now this while loop will only break on two
>     conditions. 1. Cur_size becomes larger than *size, or the
>     are_contiguous function returns false, meaning that the two
>     translations cannot be considered together. In the case of the RDMA
>     memory map, the only time are_contiguous returns false is when the
>     two memory regions correspond to two distinct RDMA MRs. Notice that
>     in this case - the one where are_contiguous is defined and we
>     supplied a size variable - the *size variable is not overwritten
>     with cur_size until 1. cur_size is >= *size or 2. The are_contiguous
>     check fails.
> 
>     In the second case detailed above, you can see how one could  pass
>     in a buffer that spanned a 2 MiB page and still get a translation
>     value equal to the size of the buffer. This second case is the one
>     that the rdma.c code should be using since we have a registered
>     are_contiguous function with the NIC and we have supplied a size
>     pointer filled with the length of our buffer.
> 
>     -----Original Message-----
>     From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>     <mailto:jiandong.zheng(a)broadcom.com>]
>     Sent: Thursday, August 1, 2019 2:01 PM
>     To: Howell, Seth <seth.howell(a)intel.com
>     <mailto:seth.howell(a)intel.com>>; Storage Performance Development Kit
>     <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>     Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>     multiple RDMA Memory Regions
> 
>     Hi Seth,
> 
>       > Just because a buffer extends past a 2 MiB boundary doesn't mean
>     that it exists in two different Memory Regions. It also won't fail
>     the translation for being over two memory regions.
> 
>     This makes sense. However, spdk_mem_map_translate() does following
>     to calculate translation_len:
> 
>     cur_size = VALUE_2MB - _2MB_OFFSET(vaddr); ...
>     *size = spdk_min(*size, cur_size); // *size is the translation_len
>     from caller nvmf_rdma_fill_buffers()
> 
>     In nvmf_rdma_fill_buffers(),
> 
>     if (translation_len < rdma_req->req.iov[iovcnt].iov_len) {
>                              SPDK_ERRLOG("Data buffer split over
>     multiple RDMA Memory Regions\n");
>                              return -EINVAL;
>                      }
> 
>     This just checks if buffer sits on 2 2MB pages, not about 2 RDMA
>     memory regions. Is my understanding correct?
> 
>     I still need some time to test. I will update you the result with -s
>     as well.
> 
>     Thanks,
>     JD
> 
> 
>     On 8/1/19 1:28 PM, Howell, Seth wrote:
>      > Hi JD,
>      >
>      > The 2 MiB check is just because we always do memory registrations
>     at at least 2 MiB granularity (the minimum hugepage size). Just
>     because a buffer extends past a 2 MiB boundary doesn't mean that it
>     exists in two different Memory Regions. It also won't fail the
>     translation for being over two memory regions.
>      >
>      > If you look at the definition of spdk_mem_map_translate we call
>     map->ops->are_contiguous every time we cross a 2 MiB boundary. For
>     RDMA, this function is registered to
>     spdk_nvmf_rdma_check_contiguous_entries. IF this function returns
>     true, then even if the buffer crosses a 2 MiB boundary, the
>     translation will still be valid.
>      > The problem you are running into is not related to the buffer
>     alignment, it is related to the fact that the two pages across which
>     the buffer is split are registered to two different MRs in the NIC.
>     This can only happen if those two pages are allocated independently
>     and trigger two distinct memory event callbacks.
>      >
>      > That is why I am so interested in seeing the results from the
>     noticelog above ibv_reg_mr. It will tell me how your target
>     application is allocating memory. Also, when you start the SPDK
>     target, are you using the -s option? Something like
>     ./app/nvmf_tgt/nvmf_tgt -s 512 or something like that (I don't know
>     if it'll make a difference, it's more of a curiosity thing for me)?
>      >
>      > Thanks,
>      >
>      > Seth
>      >
>      > -----Original Message-----
>      > From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>     <mailto:jiandong.zheng(a)broadcom.com>]
>      > Sent: Thursday, August 1, 2019 11:24 AM
>      > To: Howell, Seth <seth.howell(a)intel.com
>     <mailto:seth.howell(a)intel.com>>; Storage Performance
>      > Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>      > Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>     multiple
>      > RDMA Memory Regions
>      >
>      > Hi Seth,
>      >
>      > Thanks for the detailed description, now I understand the reason
>     behind the checking. But I have a question, why checking against
>     2MiB? Is it because DPDK uses 2MiB page size by default so that one
>     RDMA memory region should not cross 2 pages?
>      >
>      >   > Once I see what your memory registrations look like and what
>     addresses you're failing on, it will help me understand what is
>     going on better.
>      >
>      > I've added some print in nvmf_rdma_fill_buffers() @@ -1502,7
>     +1503,11 @@ nvmf_rdma_fill_buffers(struct spdk_nvmf_rdma_transport
>     *rtransport,
>      >                   remaining_length -=
>      > rdma_req->req.iov[iovcnt].iov_len;
>      >
>      >                   if (translation_len <
>     rdma_req->req.iov[iovcnt].iov_len) {
>      > -                       SPDK_ERRLOG("Data buffer split over multiple
>      > RDMA Memory Regions\n");
>      > +                       SPDK_ERRLOG("Data buffer split over multiple
>      > RDMA Memory Regions %p %d (%d) (%d) (%d) (%d)\n",
>     rdma_req->buffers[iovcnt], iovcnt, length, remaining_length,
>     translation_len, rdma_req->req.iov[iovcnt].iov_len);
>      >                           return -EINVAL;
>      >                   }
>      >
>      > With this I can see which buffer failed the checking.
>      > For example, when SPKD initializes the memory pool, one of the
>     buffers starts with 0x2000193feb00, and when failed, I got following:
>      >
>      > rdma.c:1510:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over
>      > multiple RDMA Memory Regions 0x2000193feb00 0 (8192) (0) (5376)
>     (8192)
>      >
>      > This buffer has 5376B on one 2MB page and the rest of it
>      > (8192-5376=2816B) is on another page.
>      >
>      > The change https://review.gerrithub.io/c/spdk/spdk/+/463893 to
>     use iov base should make it better as iov base is 4KiB aligned. In
>     above case, iov_base is 0x2000193feb00 & 0xfff = 0x2000193fe000 and
>     it should pass the checking.
>      > However, another buffer in the pool is 0x2000192010c0 and
>     iov_base is 0x200019201000, which would fail the checking because it
>     is only 4KiB to 2MB boundary and IOUnitSize is 8KiB.
>      >
>      > I will add the change from
>      > https://review.gerrithub.io/c/spdk/spdk/+/463892 and rerun the
>     test to get more information.
>      >
>      > I also attached the conf file too. The cmd line is "nvmf_tgt -m 0xff
>      > -j
>      > 0x90000000:0x20000000 -c 16disk_1ns.conf"
>      >
>      > Thanks,
>      > JD
>      >
>      >
>      > On 8/1/19 7:52 AM, Howell, Seth wrote:
>      >> Hi JD,
>      >>
>      >> I was doing a little bit of digging in the dpdk documentation
>     around this process, and I have a little bit more information. We
>     were pretty worried about the whole dynamic memory allocations thing
>     a few releases ago, so Jim helped add a flag into DPDK that
>     prevented allocations from being allocated and freed in different
>     granularities. This flag also prevents malloc heap allocations from
>     spanning multiple memory events. However, this flag didn't make it
>     into DPDK until 19.02 (More documentation at
>     https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#environment-abstraction-layer
>     if you're interested). We have some code in the SPDK environment
>     layer that tries to deal with that (see
>     lib/env_dpdk/memory.c:memory_hotplug_cb) but I don't know that that
>     function is entirely capable of handling the heap allocations
>     spanning multiple memory events part of the problem.
>      >> Since you are using dpdk 18.11, the memory callback inside of
>     lib/env_dpdk looks like a good candidate for our issue. My best
>     guess is that somehow a heap allocation from the buffer mempool is
>     hitting across addresses from two dynamic memory allocation events.
>     I'd still appreciate it if you could send me the information in my
>     last e-mail, but I think we're onto something here.
>      >>
>      >> Thanks,
>      >>
>      >> Seth
>      >>
>      >> -----Original Message-----
>      >> From: SPDK [mailto:spdk-bounces(a)lists.01.org
>     <mailto:spdk-bounces(a)lists.01.org>] On Behalf Of Howell,
>      >> Seth
>      >> Sent: Thursday, August 1, 2019 5:26 AM
>      >> To: JD Zheng <jiandong.zheng(a)broadcom.com
>     <mailto:jiandong.zheng(a)broadcom.com>>; Storage Performance
>      >> Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>      >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>     multiple
>      >> RDMA Memory Regions
>      >>
>      >> Hi JD,
>      >>
>      >> Thanks for doing that. Yeah, I am mainly looking to see how the
>     mempool addresses are mapped into the NIC with ibv_reg_mr.
>      >>
>      >> I think it's odd that we are using the buffer base for the memory
>      >> check, we should be using the iov base, but I don't believe that
>      >> would cause the issue you are seeing. Pushed a change to modify
>     that
>      >> behavior anyways though:
>      >> https://review.gerrithub.io/c/spdk/spdk/+/463893
>      >>
>      >> There was one registration that I wasn't able to catch from your
>     last log. Sorry about that, I forgot there wasn’t a debug log for
>     it. Can you try it again with this change which adds noticelogs for
>     the relevant registrations.
>     https://review.gerrithub.io/c/spdk/spdk/+/463892 You should be able
>     to run your test without the -Lrdma argument this time to avoid the
>     extra bloat in the logs.
>      >>
>      >> The underlying assumption of the code is that any given object
>     is not going to cross a dynamic memory allocation from DPDK. For a
>     little background, when the mempool gets created, the dpdk code
>     allocates some number of memzones to accommodate those buffer
>     objects. Then it passes those memzones down one at a time and places
>     objects inside the mempool from the given memzone until the memzone
>     is exhausted. Then it goes back and grabs another memzone. This
>     process continues until all objects are accounted for.
>      >> This only works if each memzone corresponds to a single memory
>     event when using dynamic memory allocation. My understanding was
>     that this was always the case, but this error makes me think that
>     it's possible that that's not true.
>      >>
>      >> Once I see what your memory registrations look like and what
>     addresses you're failing on, it will help me understand what is
>     going on better.
>      >>
>      >> Can you also provide the command line you are using to start the
>     nvmf_tgt application and attach your configuration file?
>      >>
>      >> Thanks,
>      >>
>      >> Seth
>      >> -----Original Message-----
>      >> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>     <mailto:jiandong.zheng(a)broadcom.com>]
>      >> Sent: Wednesday, July 31, 2019 3:13 PM
>      >> To: Howell, Seth <seth.howell(a)intel.com
>     <mailto:seth.howell(a)intel.com>>; Storage Performance
>      >> Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>      >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>     multiple
>      >> RDMA Memory Regions
>      >>
>      >> Hi Seth,
>      >>
>      >> After I enabled debug and ran nvmf_tgt with -L rdma, I got some
>     logs like:
>      >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>      >> 0x2000084bf000 Length: 40000 LKey: e601
>      >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>      >> 0x200008621000 Length: 10000 LKey: e701
>      >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>     Array:
>      >> 0x200018600000 Length: 1000000 LKey: e801
>      >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>      >> 0x20000847e000 Length: 40000 LKey: e701
>      >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>      >> 0x20000846d000 Length: 10000 LKey: e801
>      >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>     Array:
>      >> 0x200019800000 Length: 1000000 LKey: e901
>      >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>      >> 0x200016ebb000 Length: 40000 LKey: e801
>      >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>      >> 0x20000845c000 Length: 10000 LKey: e901
>      >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>     Array:
>      >> 0x20001aa00000 Length: 1000000 LKey: ea01
>      >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>      >> 0x200016e7a000 Length: 40000 LKey: e901
>      >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>      >> 0x20000844b000 Length: 10000 LKey: ea01
>      >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>     Array:
>      >> 0x20001bc00000 Length: 1000000 LKey: eb01 ...
>      >>
>      >> Is this you are look for as memory regions registered for NIC?
>      >>
>      >> I attached the complete log.
>      >>
>      >> Thanks,
>      >> JD
>      >>
>      >> On 7/30/19 5:28 PM, JD Zheng wrote:
>      >>> Hi Seth,
>      >>>
>      >>> Thanks for the prompt reply!
>      >>>
>      >>> Please find answers inline.
>      >>>
>      >>> JD
>      >>>
>      >>> On 7/30/19 5:01 PM, Howell, Seth wrote:
>      >>>> Hi JD,
>      >>>>
>      >>>> Thanks for the report. I want to ask a few questions to start
>      >>>> getting to the bottom of this. Since this issue doesn't currently
>      >>>> reproduce on our per-patch or nightly tests, I would like to
>      >>>> understand what's unique about your setup so that we can
>     replicate
>      >>>> it in a per patch test to prevent future regressions.
>      >>> I am running it on aarch64 platform. I tried x86 platform and I
>     can
>      >>> see same buffer alignment in memory pool but can't run the real
>     test
>      >>> to reproduce it due to other missing pieces.
>      >>>
>      >>>>
>      >>>> What options are you passing when you create the rdma transport?
>      >>>> Are you creating it over RPC or in a configuration file?
>      >>> I am using conf file. Pls let me know if you'd like to look
>     into conf file.
>      >>>
>      >>>>
>      >>>> Are you using the current DPDK submodule as your environment
>      >>>> abstraction layer?
>      >>> No. Our project uses specific version of DPDK, which is v18.11. I
>      >>> did quick test using latest and DPDK submodule on x86, and the
>      >>> buffer alignment is the same, i.e. 64B aligned.
>      >>>
>      >>>>
>      >>>> I notice that your error log is printing from
>      >>>> spdk_nvmf_transport_poll_group_create, which value exactly are
>     you
>      >>>> printing out?
>      >>> Here is patch to add dbg print. Pls note that SPDK version is
>     v19.04
>      >>>
>      >>> @@ -215,6 +222,7 @@ spdk_nvmf_transport_poll_group_create(st
>      >>>                                    SPDK_NOTICELOG("Unable to
>     reserve
>      >>> the full number of buffers for the pg buffer cache.\n");
>      >>>                                    break;
>      >>>                            }
>      >>> +                       SPDK_ERRLOG("%p %d(%d)\n", buf,
>      >>> group->buf_cache_count, group->buf_cache_size);
>      >>>                            STAILQ_INSERT_HEAD(&group->buf_cache,
>      >>> buf, link);
>      >>>                            group->buf_cache_count++;
>      >>>                    }
>      >>>
>      >>>>
>      >>>> Can you run your target with the -L rdma option to get a dump of
>      >>>> the memory regions registered with the NIC?
>      >>> Let me test and get back to you soon.
>      >>>
>      >>>>
>      >>>> We made a couple of changes to this code when dynamic memory
>      >>>> allocations were added to DPDK. There were some safeguards
>     that we
>      >>>> added to try and make sure this case wouldn't hit, so I'd like to
>      >>>> make sure you are running on the latest DPDK submodule as well as
>      >>>> the latest SPDK to narrow down where we need to look.
>      >>> Unfortunately I can't easily update DPDK because other team
>      >>> maintains it internally. But if it can be repro and fixed in
>     latest,
>      >>> I will try to pull in the fix.
>      >>>
>      >>>>
>      >>>> Thanks,
>      >>>>
>      >>>> Seth
>      >>>>
>      >>>> -----Original Message-----
>      >>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org
>     <mailto:spdk-bounces(a)lists.01.org>] On Behalf Of JD Zheng
>      >>>> via SPDK
>      >>>> Sent: Wednesday, July 31, 2019 3:00 AM
>      >>>> To: spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>
>      >>>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com
>     <mailto:jiandong.zheng(a)broadcom.com>>
>      >>>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
>      >>>> RDMA Memory Regions
>      >>>>
>      >>>> Hello,
>      >>>>
>      >>>> When I run nvmf_tgt over RDMA using latest SPDK code, I
>      >>>> occasionally ran into this errors:
>      >>>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split
>      >>>> over multiple RDMA Memory Regions"
>      >>>>
>      >>>> After digging into the code, I found that
>     nvmf_rdma_fill_buffers()
>      >>>> calls spdk_mem_map_translate() to check if a data buffer sit on 2
>      >>>> 2MB pages, and if it is the case, it reports this error.
>      >>>>
>      >>>> The following commit added change to use data buffer start
>     address
>      >>>> to calculate the size between buffer start address and 2MB
>     boundary.
>      >>>> The caller nvmf_rdma_fill_buffers() uses the size to compare with
>      >>>> IO Unit size (which is 8KB in my conf) to determine if the buffer
>      >>>> passes 2MB boundary.
>      >>>>
>      >>>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
>      >>>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com
>     <mailto:dariusz.stojaczyk(a)intel.com>>
>      >>>> Date:   Tue Nov 13 17:43:46 2018 +0100
>      >>>>
>      >>>>         memory: fix contiguous memory calculation for unaligned
>      >>>> buffers
>      >>>>
>      >>>> In nvmf_tgt, the buffers are pre-allocated as a memory pool
>     and new
>      >>>> request will use free buffer from that pool and the buffer start
>      >>>> address is passed to nvmf_rdma_fill_buffers(). But I found that
>      >>>> these buffers are not 2MB aligned and not IOUnitSize aligned (8KB
>      >>>> in my
>      >>>> case) either, instead, they are 64Byte aligned so that some
>     buffers
>      >>>> will fail the checking and leads to this problem.
>      >>>>
>      >>>> The corresponding code snippets are as following:
>      >>>> spdk_nvmf_transport_create()
>      >>>> {
>      >>>> ...
>      >>>>         transport->data_buf_pool =
>      >>>> pdk_mempool_create(spdk_mempool_name,
>      >>>>                                    opts->num_shared_buffers,
>      >>>>                                    opts->io_unit_size +
>      >>>> NVMF_DATA_BUFFER_ALIGNMENT,
>      >>>>                                  
>       SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>      >>>>                                    SPDK_ENV_SOCKET_ID_ANY); ...
>      >>>> }
>      >>>>
>      >>>> Also some debug print I added shows the start address of the
>     buffers:
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x200019258800 0(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x2000192557c0 1(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x200019252780 2(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x20001924f740 3(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x20001924c700 4(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x2000192496c0 5(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x200019246680 6(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x200019243640 7(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x200019240600 8(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x20001923d5c0 9(32)
>      >>>> ...
>      >>>>
>      >>>> It looks like either the buffer allocation has alignment issue or
>      >>>> the checking is not correct.
>      >>>>
>      >>>> Please advice how to fix this problem.
>      >>>>
>      >>>> Thanks,
>      >>>> JD Zheng
>      >>>> _______________________________________________
>      >>>> SPDK mailing list
>      >>>> SPDK(a)lists.01.org <mailto:SPDK(a)lists.01.org>
>      >>>> https://lists.01.org/mailman/listinfo/spdk
>      >>>>
>      >> _______________________________________________
>      >> SPDK mailing list
>      >> SPDK(a)lists.01.org <mailto:SPDK(a)lists.01.org>
>      >> https://lists.01.org/mailman/listinfo/spdk
>      >>
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread