All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-08-19 21:16 Howell, Seth
  0 siblings, 0 replies; 20+ messages in thread
From: Howell, Seth @ 2019-08-19 21:16 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 29964 bytes --]

Hi JD,

What issue specifically did you see? If there is something measurable happening (other than the error message) then I think it should be high priority to get a more permanent workaround into upstream SPDK.

Thanks,

Seth

-----Original Message-----
From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com] 
Sent: Monday, August 19, 2019 2:03 PM
To: Howell, Seth <seth.howell(a)intel.com>
Cc: Storage Performance Development Kit <spdk(a)lists.01.org>; Jonathan Richardson <jonathan.richardson(a)broadcom.com>
Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions

Hi Seth,

 > Unfortunately, the only way
 > to protect fully against this happening is by using the DPDK flag  > --match-allocations which was introduced in DPDK 19.02.

Then I need to use DPDK 19.02. Do I need to enable this flag explicitly when moving DPDK 19.02?

 > The good news is that the SPDK target will skip these buffers without  > bricking, causing data corruption or doing any otherwise bad things.

Unfortunately this is not what I saw. It appeared that SPDK gave up this split buffer, but it still causes issue, maybe because it was tried too many times(?).

Currently, I have to use DPDK 18.11 so that I added a couple of workarounds to prevent the split buffer from being used before reaching fill_buffers(). I did a little trick there to call spdk_mempool_get() but not mempool_put later, so that this buffer is set as "allocated" in mempool and will not be tried again and again. It does look like small memory leak though. We can usually see 2-3 split buffers during overnight run, btw.

This seems working OK.

For sure, I will measure performance later.

Thanks,
JD


On 8/19/19 1:12 PM, Howell, Seth wrote:
> Hi JD,
> 
> Thanks for performing that experiment. With this new information, I 
> think we can be pretty much 100% sure that the problem is related to 
> the mempool being split over two DPDK memzones. Unfortunately, the 
> only way to protect fully against this happening is by using the DPDK 
> flag --match-allocations which was introduced in DPDK 19.02. Jim 
> helped advocate for this flag specifically because of this problem 
> with mempools and RPMA.
> 
> The good news is that the SPDK target will skip these buffers without 
> bricking, causing data corruption or doing any otherwise bad things.
> What ends up happening is that the nvmf_rdma_fill_buffers function 
> will print the error message and then return NULL which will trigger 
> the target to retry the I/O again. By that time, there will be another 
> buffer there for the request to use and it won’t fail the second time 
> around. So the code currently handles the problem in a technically 
> correct way i.e. It’s not going to brick the target or initiator by 
> trying to use a buffer that spans multiple Memory Regions. Instead, it 
> properly recognizes that it is trying to use a bad buffer and 
> reschedules the request buffer parsing.
> 
> However, I am a little worried over the fact that these buffers remain 
> in the mempool and can be repeatedly used by the application. I can 
> picture a scenario where this could possibly have a  performance impact.
> Take for example a mempool with 128 entries in it in which one of them 
> is split over a memzone. Since this split buffer will never find its 
> way into a request, it’s possible that this split buffer gets pulled 
> up into requests more often than other buffers and subsequently fails 
> in nvmf_rdma_fill_buffers causing requests to have to be rescheduled 
> to the next time the poller runs. Depending on how frequently this 
> happens, the performance impact *could possibly* add up.
> 
> I have as yet been unable to replicate the split buffer error. One 
> thing you could try to see if there is any measurable performance 
> impact is try starting the NVMe-oF target with DPDK legacy memory mode 
> which will move all memory allocations to startup and prevent you from 
> splitting buffers. Then run a benchmark with a lot of connections at 
> high queue depth and see what the performance looks like compared to 
> the dynamic memory model. If there is a significant performance 
> impact, we may have to modify how we handle this error case.
> 
> Thanks,
> 
> Seth
> 
> *From:*JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
> *Sent:* Monday, August 12, 2019 4:17 PM
> *To:* Howell, Seth <seth.howell(a)intel.com>
> *Cc:* Storage Performance Development Kit <spdk(a)lists.01.org>; 
> Jonathan Richardson <jonathan.richardson(a)broadcom.com>
> *Subject:* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over 
> multiple RDMA Memory Regions
> 
> + Jonanthan
> 
> Hi Seth,
> 
> We finally got chance to test with more logs enabled. You are correct 
> that that problematic buffer does sit on 2 registered memory regions:
> 
> The problematic buffer is "Buffer address:*200019bfeb00", *actual used 
> buffer pointer is "*200019bff000*" (SPDK makes it 4KiB aligned), size 
> is
> 8KiB(0x2000) so it does sit on 2 registered memory region.
> 
> However, looks like SPDK/DPDK allocates buffers starting from end of a 
> region and going up, but due to the extra room and alignment of each 
> buffer and there is chance that one buffer can exceed memory region 
> boundary?
> 
> In this case, the buffers are between 0x200019997800 and 0x200019c5320 
> so that last buffer exceeds one region and goes to next one.
> 
> Some logs for your information:
> 
> rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start: 
> 200019800000, memory region length: 400000
> rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start: 
> 200019c00000, memory region length: 400000
> 
> ...
> 
> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
> 0x200019bfeb00 27(32)
> 
> ...
> 
> rdma.c:1508:nvmf_rdma_fill_buffers: *NOTICE*: Buffer 
> address:**200019bfeb00**iov_base address 200019bff000
> 
> Thanks,
> 
> JD
> 
> On Thu, Aug 1, 2019 at 2:22 PM Howell, Seth <seth.howell(a)intel.com 
> <mailto:seth.howell(a)intel.com>> wrote:
> 
>     There are two different assignments that you need to look at. I'll
>     detail the cases below based on line numbers from the latest master.
> 
>     Memory.c:656 *size = spdk_min(*size, cur_size):
>              This assignment is inside of the conditional "if(size ==
>     NULL || map->ops.are_contiguous == NULL)"
>              So in other words, at the offset, we figure out how much
>     space we have left in the current translation. Then, if there is no
>     callback to tell us whether the next translation will be contiguous
>     to this one, we fill the size variable with the remaining length of
>     that 2 MiB buffer.
> 
>     Memory.c:682 *size = spdk_min(*size, cur_size):
>              This assignment comes after the while loop guarded by the
>     condition "while (cur_size < *size)". This while loop assumes that
>     we have supplied some desired length for our buffer. This is true in
>     the RDMA case. Now this while loop will only break on two
>     conditions. 1. Cur_size becomes larger than *size, or the
>     are_contiguous function returns false, meaning that the two
>     translations cannot be considered together. In the case of the RDMA
>     memory map, the only time are_contiguous returns false is when the
>     two memory regions correspond to two distinct RDMA MRs. Notice that
>     in this case - the one where are_contiguous is defined and we
>     supplied a size variable - the *size variable is not overwritten
>     with cur_size until 1. cur_size is >= *size or 2. The are_contiguous
>     check fails.
> 
>     In the second case detailed above, you can see how one could  pass
>     in a buffer that spanned a 2 MiB page and still get a translation
>     value equal to the size of the buffer. This second case is the one
>     that the rdma.c code should be using since we have a registered
>     are_contiguous function with the NIC and we have supplied a size
>     pointer filled with the length of our buffer.
> 
>     -----Original Message-----
>     From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>     <mailto:jiandong.zheng(a)broadcom.com>]
>     Sent: Thursday, August 1, 2019 2:01 PM
>     To: Howell, Seth <seth.howell(a)intel.com
>     <mailto:seth.howell(a)intel.com>>; Storage Performance Development Kit
>     <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>     Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>     multiple RDMA Memory Regions
> 
>     Hi Seth,
> 
>       > Just because a buffer extends past a 2 MiB boundary doesn't mean
>     that it exists in two different Memory Regions. It also won't fail
>     the translation for being over two memory regions.
> 
>     This makes sense. However, spdk_mem_map_translate() does following
>     to calculate translation_len:
> 
>     cur_size = VALUE_2MB - _2MB_OFFSET(vaddr); ...
>     *size = spdk_min(*size, cur_size); // *size is the translation_len
>     from caller nvmf_rdma_fill_buffers()
> 
>     In nvmf_rdma_fill_buffers(),
> 
>     if (translation_len < rdma_req->req.iov[iovcnt].iov_len) {
>                              SPDK_ERRLOG("Data buffer split over
>     multiple RDMA Memory Regions\n");
>                              return -EINVAL;
>                      }
> 
>     This just checks if buffer sits on 2 2MB pages, not about 2 RDMA
>     memory regions. Is my understanding correct?
> 
>     I still need some time to test. I will update you the result with -s
>     as well.
> 
>     Thanks,
>     JD
> 
> 
>     On 8/1/19 1:28 PM, Howell, Seth wrote:
>      > Hi JD,
>      >
>      > The 2 MiB check is just because we always do memory registrations
>     at at least 2 MiB granularity (the minimum hugepage size). Just
>     because a buffer extends past a 2 MiB boundary doesn't mean that it
>     exists in two different Memory Regions. It also won't fail the
>     translation for being over two memory regions.
>      >
>      > If you look at the definition of spdk_mem_map_translate we call
>     map->ops->are_contiguous every time we cross a 2 MiB boundary. For
>     RDMA, this function is registered to
>     spdk_nvmf_rdma_check_contiguous_entries. IF this function returns
>     true, then even if the buffer crosses a 2 MiB boundary, the
>     translation will still be valid.
>      > The problem you are running into is not related to the buffer
>     alignment, it is related to the fact that the two pages across which
>     the buffer is split are registered to two different MRs in the NIC.
>     This can only happen if those two pages are allocated independently
>     and trigger two distinct memory event callbacks.
>      >
>      > That is why I am so interested in seeing the results from the
>     noticelog above ibv_reg_mr. It will tell me how your target
>     application is allocating memory. Also, when you start the SPDK
>     target, are you using the -s option? Something like
>     ./app/nvmf_tgt/nvmf_tgt -s 512 or something like that (I don't know
>     if it'll make a difference, it's more of a curiosity thing for me)?
>      >
>      > Thanks,
>      >
>      > Seth
>      >
>      > -----Original Message-----
>      > From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>     <mailto:jiandong.zheng(a)broadcom.com>]
>      > Sent: Thursday, August 1, 2019 11:24 AM
>      > To: Howell, Seth <seth.howell(a)intel.com
>     <mailto:seth.howell(a)intel.com>>; Storage Performance
>      > Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>      > Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>     multiple
>      > RDMA Memory Regions
>      >
>      > Hi Seth,
>      >
>      > Thanks for the detailed description, now I understand the reason
>     behind the checking. But I have a question, why checking against
>     2MiB? Is it because DPDK uses 2MiB page size by default so that one
>     RDMA memory region should not cross 2 pages?
>      >
>      >   > Once I see what your memory registrations look like and what
>     addresses you're failing on, it will help me understand what is
>     going on better.
>      >
>      > I've added some print in nvmf_rdma_fill_buffers() @@ -1502,7
>     +1503,11 @@ nvmf_rdma_fill_buffers(struct spdk_nvmf_rdma_transport
>     *rtransport,
>      >                   remaining_length -=
>      > rdma_req->req.iov[iovcnt].iov_len;
>      >
>      >                   if (translation_len <
>     rdma_req->req.iov[iovcnt].iov_len) {
>      > -                       SPDK_ERRLOG("Data buffer split over multiple
>      > RDMA Memory Regions\n");
>      > +                       SPDK_ERRLOG("Data buffer split over multiple
>      > RDMA Memory Regions %p %d (%d) (%d) (%d) (%d)\n",
>     rdma_req->buffers[iovcnt], iovcnt, length, remaining_length,
>     translation_len, rdma_req->req.iov[iovcnt].iov_len);
>      >                           return -EINVAL;
>      >                   }
>      >
>      > With this I can see which buffer failed the checking.
>      > For example, when SPKD initializes the memory pool, one of the
>     buffers starts with 0x2000193feb00, and when failed, I got following:
>      >
>      > rdma.c:1510:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over
>      > multiple RDMA Memory Regions 0x2000193feb00 0 (8192) (0) (5376)
>     (8192)
>      >
>      > This buffer has 5376B on one 2MB page and the rest of it
>      > (8192-5376=2816B) is on another page.
>      >
>      > The change https://review.gerrithub.io/c/spdk/spdk/+/463893 to
>     use iov base should make it better as iov base is 4KiB aligned. In
>     above case, iov_base is 0x2000193feb00 & 0xfff = 0x2000193fe000 and
>     it should pass the checking.
>      > However, another buffer in the pool is 0x2000192010c0 and
>     iov_base is 0x200019201000, which would fail the checking because it
>     is only 4KiB to 2MB boundary and IOUnitSize is 8KiB.
>      >
>      > I will add the change from
>      > https://review.gerrithub.io/c/spdk/spdk/+/463892 and rerun the
>     test to get more information.
>      >
>      > I also attached the conf file too. The cmd line is "nvmf_tgt -m 0xff
>      > -j
>      > 0x90000000:0x20000000 -c 16disk_1ns.conf"
>      >
>      > Thanks,
>      > JD
>      >
>      >
>      > On 8/1/19 7:52 AM, Howell, Seth wrote:
>      >> Hi JD,
>      >>
>      >> I was doing a little bit of digging in the dpdk documentation
>     around this process, and I have a little bit more information. We
>     were pretty worried about the whole dynamic memory allocations thing
>     a few releases ago, so Jim helped add a flag into DPDK that
>     prevented allocations from being allocated and freed in different
>     granularities. This flag also prevents malloc heap allocations from
>     spanning multiple memory events. However, this flag didn't make it
>     into DPDK until 19.02 (More documentation at
>     https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#environment-abstraction-layer
>     if you're interested). We have some code in the SPDK environment
>     layer that tries to deal with that (see
>     lib/env_dpdk/memory.c:memory_hotplug_cb) but I don't know that that
>     function is entirely capable of handling the heap allocations
>     spanning multiple memory events part of the problem.
>      >> Since you are using dpdk 18.11, the memory callback inside of
>     lib/env_dpdk looks like a good candidate for our issue. My best
>     guess is that somehow a heap allocation from the buffer mempool is
>     hitting across addresses from two dynamic memory allocation events.
>     I'd still appreciate it if you could send me the information in my
>     last e-mail, but I think we're onto something here.
>      >>
>      >> Thanks,
>      >>
>      >> Seth
>      >>
>      >> -----Original Message-----
>      >> From: SPDK [mailto:spdk-bounces(a)lists.01.org
>     <mailto:spdk-bounces(a)lists.01.org>] On Behalf Of Howell,
>      >> Seth
>      >> Sent: Thursday, August 1, 2019 5:26 AM
>      >> To: JD Zheng <jiandong.zheng(a)broadcom.com
>     <mailto:jiandong.zheng(a)broadcom.com>>; Storage Performance
>      >> Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>      >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>     multiple
>      >> RDMA Memory Regions
>      >>
>      >> Hi JD,
>      >>
>      >> Thanks for doing that. Yeah, I am mainly looking to see how the
>     mempool addresses are mapped into the NIC with ibv_reg_mr.
>      >>
>      >> I think it's odd that we are using the buffer base for the memory
>      >> check, we should be using the iov base, but I don't believe that
>      >> would cause the issue you are seeing. Pushed a change to modify
>     that
>      >> behavior anyways though:
>      >> https://review.gerrithub.io/c/spdk/spdk/+/463893
>      >>
>      >> There was one registration that I wasn't able to catch from your
>     last log. Sorry about that, I forgot there wasn’t a debug log for
>     it. Can you try it again with this change which adds noticelogs for
>     the relevant registrations.
>     https://review.gerrithub.io/c/spdk/spdk/+/463892 You should be able
>     to run your test without the -Lrdma argument this time to avoid the
>     extra bloat in the logs.
>      >>
>      >> The underlying assumption of the code is that any given object
>     is not going to cross a dynamic memory allocation from DPDK. For a
>     little background, when the mempool gets created, the dpdk code
>     allocates some number of memzones to accommodate those buffer
>     objects. Then it passes those memzones down one at a time and places
>     objects inside the mempool from the given memzone until the memzone
>     is exhausted. Then it goes back and grabs another memzone. This
>     process continues until all objects are accounted for.
>      >> This only works if each memzone corresponds to a single memory
>     event when using dynamic memory allocation. My understanding was
>     that this was always the case, but this error makes me think that
>     it's possible that that's not true.
>      >>
>      >> Once I see what your memory registrations look like and what
>     addresses you're failing on, it will help me understand what is
>     going on better.
>      >>
>      >> Can you also provide the command line you are using to start the
>     nvmf_tgt application and attach your configuration file?
>      >>
>      >> Thanks,
>      >>
>      >> Seth
>      >> -----Original Message-----
>      >> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>     <mailto:jiandong.zheng(a)broadcom.com>]
>      >> Sent: Wednesday, July 31, 2019 3:13 PM
>      >> To: Howell, Seth <seth.howell(a)intel.com
>     <mailto:seth.howell(a)intel.com>>; Storage Performance
>      >> Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>      >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>     multiple
>      >> RDMA Memory Regions
>      >>
>      >> Hi Seth,
>      >>
>      >> After I enabled debug and ran nvmf_tgt with -L rdma, I got some
>     logs like:
>      >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>      >> 0x2000084bf000 Length: 40000 LKey: e601
>      >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>      >> 0x200008621000 Length: 10000 LKey: e701
>      >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>     Array:
>      >> 0x200018600000 Length: 1000000 LKey: e801
>      >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>      >> 0x20000847e000 Length: 40000 LKey: e701
>      >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>      >> 0x20000846d000 Length: 10000 LKey: e801
>      >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>     Array:
>      >> 0x200019800000 Length: 1000000 LKey: e901
>      >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>      >> 0x200016ebb000 Length: 40000 LKey: e801
>      >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>      >> 0x20000845c000 Length: 10000 LKey: e901
>      >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>     Array:
>      >> 0x20001aa00000 Length: 1000000 LKey: ea01
>      >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>      >> 0x200016e7a000 Length: 40000 LKey: e901
>      >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>      >> 0x20000844b000 Length: 10000 LKey: ea01
>      >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>     Array:
>      >> 0x20001bc00000 Length: 1000000 LKey: eb01 ...
>      >>
>      >> Is this you are look for as memory regions registered for NIC?
>      >>
>      >> I attached the complete log.
>      >>
>      >> Thanks,
>      >> JD
>      >>
>      >> On 7/30/19 5:28 PM, JD Zheng wrote:
>      >>> Hi Seth,
>      >>>
>      >>> Thanks for the prompt reply!
>      >>>
>      >>> Please find answers inline.
>      >>>
>      >>> JD
>      >>>
>      >>> On 7/30/19 5:01 PM, Howell, Seth wrote:
>      >>>> Hi JD,
>      >>>>
>      >>>> Thanks for the report. I want to ask a few questions to start
>      >>>> getting to the bottom of this. Since this issue doesn't currently
>      >>>> reproduce on our per-patch or nightly tests, I would like to
>      >>>> understand what's unique about your setup so that we can
>     replicate
>      >>>> it in a per patch test to prevent future regressions.
>      >>> I am running it on aarch64 platform. I tried x86 platform and I
>     can
>      >>> see same buffer alignment in memory pool but can't run the real
>     test
>      >>> to reproduce it due to other missing pieces.
>      >>>
>      >>>>
>      >>>> What options are you passing when you create the rdma transport?
>      >>>> Are you creating it over RPC or in a configuration file?
>      >>> I am using conf file. Pls let me know if you'd like to look
>     into conf file.
>      >>>
>      >>>>
>      >>>> Are you using the current DPDK submodule as your environment
>      >>>> abstraction layer?
>      >>> No. Our project uses specific version of DPDK, which is v18.11. I
>      >>> did quick test using latest and DPDK submodule on x86, and the
>      >>> buffer alignment is the same, i.e. 64B aligned.
>      >>>
>      >>>>
>      >>>> I notice that your error log is printing from
>      >>>> spdk_nvmf_transport_poll_group_create, which value exactly are
>     you
>      >>>> printing out?
>      >>> Here is patch to add dbg print. Pls note that SPDK version is
>     v19.04
>      >>>
>      >>> @@ -215,6 +222,7 @@ spdk_nvmf_transport_poll_group_create(st
>      >>>                                    SPDK_NOTICELOG("Unable to
>     reserve
>      >>> the full number of buffers for the pg buffer cache.\n");
>      >>>                                    break;
>      >>>                            }
>      >>> +                       SPDK_ERRLOG("%p %d(%d)\n", buf,
>      >>> group->buf_cache_count, group->buf_cache_size);
>      >>>                            STAILQ_INSERT_HEAD(&group->buf_cache,
>      >>> buf, link);
>      >>>                            group->buf_cache_count++;
>      >>>                    }
>      >>>
>      >>>>
>      >>>> Can you run your target with the -L rdma option to get a dump of
>      >>>> the memory regions registered with the NIC?
>      >>> Let me test and get back to you soon.
>      >>>
>      >>>>
>      >>>> We made a couple of changes to this code when dynamic memory
>      >>>> allocations were added to DPDK. There were some safeguards
>     that we
>      >>>> added to try and make sure this case wouldn't hit, so I'd like to
>      >>>> make sure you are running on the latest DPDK submodule as well as
>      >>>> the latest SPDK to narrow down where we need to look.
>      >>> Unfortunately I can't easily update DPDK because other team
>      >>> maintains it internally. But if it can be repro and fixed in
>     latest,
>      >>> I will try to pull in the fix.
>      >>>
>      >>>>
>      >>>> Thanks,
>      >>>>
>      >>>> Seth
>      >>>>
>      >>>> -----Original Message-----
>      >>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org
>     <mailto:spdk-bounces(a)lists.01.org>] On Behalf Of JD Zheng
>      >>>> via SPDK
>      >>>> Sent: Wednesday, July 31, 2019 3:00 AM
>      >>>> To: spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>
>      >>>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com
>     <mailto:jiandong.zheng(a)broadcom.com>>
>      >>>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
>      >>>> RDMA Memory Regions
>      >>>>
>      >>>> Hello,
>      >>>>
>      >>>> When I run nvmf_tgt over RDMA using latest SPDK code, I
>      >>>> occasionally ran into this errors:
>      >>>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split
>      >>>> over multiple RDMA Memory Regions"
>      >>>>
>      >>>> After digging into the code, I found that
>     nvmf_rdma_fill_buffers()
>      >>>> calls spdk_mem_map_translate() to check if a data buffer sit on 2
>      >>>> 2MB pages, and if it is the case, it reports this error.
>      >>>>
>      >>>> The following commit added change to use data buffer start
>     address
>      >>>> to calculate the size between buffer start address and 2MB
>     boundary.
>      >>>> The caller nvmf_rdma_fill_buffers() uses the size to compare with
>      >>>> IO Unit size (which is 8KB in my conf) to determine if the buffer
>      >>>> passes 2MB boundary.
>      >>>>
>      >>>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
>      >>>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com
>     <mailto:dariusz.stojaczyk(a)intel.com>>
>      >>>> Date:   Tue Nov 13 17:43:46 2018 +0100
>      >>>>
>      >>>>         memory: fix contiguous memory calculation for unaligned
>      >>>> buffers
>      >>>>
>      >>>> In nvmf_tgt, the buffers are pre-allocated as a memory pool
>     and new
>      >>>> request will use free buffer from that pool and the buffer start
>      >>>> address is passed to nvmf_rdma_fill_buffers(). But I found that
>      >>>> these buffers are not 2MB aligned and not IOUnitSize aligned (8KB
>      >>>> in my
>      >>>> case) either, instead, they are 64Byte aligned so that some
>     buffers
>      >>>> will fail the checking and leads to this problem.
>      >>>>
>      >>>> The corresponding code snippets are as following:
>      >>>> spdk_nvmf_transport_create()
>      >>>> {
>      >>>> ...
>      >>>>         transport->data_buf_pool =
>      >>>> pdk_mempool_create(spdk_mempool_name,
>      >>>>                                    opts->num_shared_buffers,
>      >>>>                                    opts->io_unit_size +
>      >>>> NVMF_DATA_BUFFER_ALIGNMENT,
>      >>>>                                  
>       SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>      >>>>                                    SPDK_ENV_SOCKET_ID_ANY); ...
>      >>>> }
>      >>>>
>      >>>> Also some debug print I added shows the start address of the
>     buffers:
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x200019258800 0(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x2000192557c0 1(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x200019252780 2(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x20001924f740 3(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x20001924c700 4(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x2000192496c0 5(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x200019246680 6(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x200019243640 7(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x200019240600 8(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x20001923d5c0 9(32)
>      >>>> ...
>      >>>>
>      >>>> It looks like either the buffer allocation has alignment issue or
>      >>>> the checking is not correct.
>      >>>>
>      >>>> Please advice how to fix this problem.
>      >>>>
>      >>>> Thanks,
>      >>>> JD Zheng
>      >>>> _______________________________________________
>      >>>> SPDK mailing list
>      >>>> SPDK(a)lists.01.org <mailto:SPDK(a)lists.01.org>
>      >>>> https://lists.01.org/mailman/listinfo/spdk
>      >>>>
>      >> _______________________________________________
>      >> SPDK mailing list
>      >> SPDK(a)lists.01.org <mailto:SPDK(a)lists.01.org>
>      >> https://lists.01.org/mailman/listinfo/spdk
>      >>
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-08-21 13:15 Sasha Kotchubievsky
  0 siblings, 0 replies; 20+ messages in thread
From: Sasha Kotchubievsky @ 2019-08-21 13:15 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 38193 bytes --]

Seth,

Thanks for update !

Sasha

On 20-Aug-19 5:15 PM, Howell, Seth wrote:
> Hi,
>
>> I think, dynamic memory allocation doesn't really work for RDMA case.
> Dynamic memory allocation does work for the RDMA case on the latest master. That is specifically why we use the match-allocations flag in DPDK. The Broadcom case is distinct from stock SPDK in that they are using an older version of DPDK than the submodule which doesn't support this flag and has to use mitigations such as the one you mentioned below to attempt to work around the problems we faced before DPDK was updated.
>
>> Commit 9cec99b84b9a08e9122ada4f4455172e40ff6c06 already removes memory "free" for dynamically allocated memory.
> True, but that is a mitigation for DPDK submodules between 18.05 and 19.02 which don't support the match-allocations flag. If you look at the preprocessor directives around this flag on master, it is only applicable if the RTE_VERSION is >= 18.05 and < 19.02.
>
> Anyone using the stock SPDK with the DPDK submodule should be able to rely on DPDK dynamic allocations with the RDMA case.
>
> Thanks,
>
> Seth
>
>
> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha Kotchubievsky
> Sent: Tuesday, August 20, 2019 5:22 AM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
>
> Hi,
>
> I think, dynamic memory allocation doesn't really work for RDMA case.
>
> Commit 9cec99b84b9a08e9122ada4f4455172e40ff6c06 already removes memory "free" for dynamically allocated memory.
>
> I'd suggest to pre-allocate enough memory for nvmf target using "-s" option.
>
>
> Best regards
>
> Sasha
>
> On 20-Aug-19 12:42 AM, JD Zheng via SPDK wrote:
>> Hi Seth,
>>
>> It sometimes triggered seg fault but I couldn't get backtrace due to
>> likely corrupted stack. With my workaround, this is no longer seen.
>>
>> Let me submit my change as RFC to gerrit. It probably isn't necessary
>> to upstream as DPDK 19.02 should fix this problem properly.
>>
>> Thanks,
>> JD
>>
>> On 8/19/19 2:16 PM, Howell, Seth wrote:
>>> Hi JD,
>>>
>>> What issue specifically did you see? If there is something measurable
>>> happening (other than the error message) then I think it should be
>>> high priority to get a more permanent workaround into upstream SPDK.
>>>
>>> Thanks,
>>>
>>> Seth
>>>
>>> -----Original Message-----
>>> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
>>> Sent: Monday, August 19, 2019 2:03 PM
>>> To: Howell, Seth <seth.howell(a)intel.com>
>>> Cc: Storage Performance Development Kit <spdk(a)lists.01.org>; Jonathan
>>> Richardson <jonathan.richardson(a)broadcom.com>
>>> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
>>> RDMA Memory Regions
>>>
>>> Hi Seth,
>>>
>>>    > Unfortunately, the only way
>>>    > to protect fully against this happening is by using the DPDK flag
>>>> --match-allocations which was introduced in DPDK 19.02.
>>> Then I need to use DPDK 19.02. Do I need to enable this flag
>>> explicitly when moving DPDK 19.02?
>>>
>>>    > The good news is that the SPDK target will skip these buffers
>>> without  > bricking, causing data corruption or doing any otherwise
>>> bad things.
>>>
>>> Unfortunately this is not what I saw. It appeared that SPDK gave up
>>> this split buffer, but it still causes issue, maybe because it was
>>> tried too many times(?).
>>>
>>> Currently, I have to use DPDK 18.11 so that I added a couple of
>>> workarounds to prevent the split buffer from being used before
>>> reaching fill_buffers(). I did a little trick there to call
>>> spdk_mempool_get() but not mempool_put later, so that this buffer is
>>> set as "allocated" in mempool and will not be tried again and again.
>>> It does look like small memory leak though. We can usually see 2-3
>>> split buffers during overnight run, btw.
>>>
>>> This seems working OK.
>>>
>>> For sure, I will measure performance later.
>>>
>>> Thanks,
>>> JD
>>>
>>>
>>> On 8/19/19 1:12 PM, Howell, Seth wrote:
>>>> Hi JD,
>>>>
>>>> Thanks for performing that experiment. With this new information, I
>>>> think we can be pretty much 100% sure that the problem is related to
>>>> the mempool being split over two DPDK memzones. Unfortunately, the
>>>> only way to protect fully against this happening is by using the
>>>> DPDK flag --match-allocations which was introduced in DPDK 19.02.
>>>> Jim helped advocate for this flag specifically because of this
>>>> problem with mempools and RPMA.
>>>>
>>>> The good news is that the SPDK target will skip these buffers
>>>> without bricking, causing data corruption or doing any otherwise bad things.
>>>> What ends up happening is that the nvmf_rdma_fill_buffers function
>>>> will print the error message and then return NULL which will trigger
>>>> the target to retry the I/O again. By that time, there will be
>>>> another buffer there for the request to use and it won’t fail the
>>>> second time around. So the code currently handles the problem in a
>>>> technically correct way i.e. It’s not going to brick the target or
>>>> initiator by trying to use a buffer that spans multiple Memory
>>>> Regions. Instead, it properly recognizes that it is trying to use a
>>>> bad buffer and reschedules the request buffer parsing.
>>>>
>>>> However, I am a little worried over the fact that these buffers
>>>> remain in the mempool and can be repeatedly used by the application.
>>>> I can picture a scenario where this could possibly have a
>>>> performance impact.
>>>> Take for example a mempool with 128 entries in it in which one of
>>>> them is split over a memzone. Since this split buffer will never
>>>> find its way into a request, it’s possible that this split buffer
>>>> gets pulled up into requests more often than other buffers and
>>>> subsequently fails in nvmf_rdma_fill_buffers causing requests to
>>>> have to be rescheduled to the next time the poller runs. Depending
>>>> on how frequently this happens, the performance impact *could possibly* add up.
>>>>
>>>> I have as yet been unable to replicate the split buffer error. One
>>>> thing you could try to see if there is any measurable performance
>>>> impact is try starting the NVMe-oF target with DPDK legacy memory
>>>> mode which will move all memory allocations to startup and prevent
>>>> you from splitting buffers. Then run a benchmark with a lot of
>>>> connections at high queue depth and see what the performance looks
>>>> like compared to the dynamic memory model. If there is a significant
>>>> performance impact, we may have to modify how we handle this error case.
>>>>
>>>> Thanks,
>>>>
>>>> Seth
>>>>
>>>> *From:*JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
>>>> *Sent:* Monday, August 12, 2019 4:17 PM
>>>> *To:* Howell, Seth <seth.howell(a)intel.com>
>>>> *Cc:* Storage Performance Development Kit <spdk(a)lists.01.org>;
>>>> Jonathan Richardson <jonathan.richardson(a)broadcom.com>
>>>> *Subject:* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>>> multiple RDMA Memory Regions
>>>>
>>>> + Jonanthan
>>>>
>>>> Hi Seth,
>>>>
>>>> We finally got chance to test with more logs enabled. You are
>>>> correct that that problematic buffer does sit on 2 registered memory regions:
>>>>
>>>> The problematic buffer is "Buffer address:*200019bfeb00", *actual
>>>> used buffer pointer is "*200019bff000*" (SPDK makes it 4KiB
>>>> aligned), size is
>>>> 8KiB(0x2000) so it does sit on 2 registered memory region.
>>>>
>>>> However, looks like SPDK/DPDK allocates buffers starting from end of
>>>> a region and going up, but due to the extra room and alignment of
>>>> each buffer and there is chance that one buffer can exceed memory
>>>> region boundary?
>>>>
>>>> In this case, the buffers are between 0x200019997800 and
>>>> 0x200019c5320 so that last buffer exceeds one region and goes to next one.
>>>>
>>>> Some logs for your information:
>>>>
>>>> rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start:
>>>> 200019800000, memory region length: 400000
>>>> rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start:
>>>> 200019c00000, memory region length: 400000
>>>>
>>>> ...
>>>>
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x200019bfeb00 27(32)
>>>>
>>>> ...
>>>>
>>>> rdma.c:1508:nvmf_rdma_fill_buffers: *NOTICE*: Buffer
>>>> address:**200019bfeb00**iov_base address 200019bff000
>>>>
>>>> Thanks,
>>>>
>>>> JD
>>>>
>>>> On Thu, Aug 1, 2019 at 2:22 PM Howell, Seth <seth.howell(a)intel.com
>>>> <mailto:seth.howell(a)intel.com>> wrote:
>>>>
>>>>       There are two different assignments that you need to look at.
>>>> I'll
>>>>       detail the cases below based on line numbers from the latest
>>>> master.
>>>>
>>>>       Memory.c:656 *size = spdk_min(*size, cur_size):
>>>>                This assignment is inside of the conditional "if(size
>>>> ==
>>>>       NULL || map->ops.are_contiguous == NULL)"
>>>>                So in other words, at the offset, we figure out how
>>>> much
>>>>       space we have left in the current translation. Then, if there
>>>> is no
>>>>       callback to tell us whether the next translation will be
>>>> contiguous
>>>>       to this one, we fill the size variable with the remaining
>>>> length of
>>>>       that 2 MiB buffer.
>>>>
>>>>       Memory.c:682 *size = spdk_min(*size, cur_size):
>>>>                This assignment comes after the while loop guarded by
>>>> the
>>>>       condition "while (cur_size < *size)". This while loop assumes
>>>> that
>>>>       we have supplied some desired length for our buffer. This is
>>>> true in
>>>>       the RDMA case. Now this while loop will only break on two
>>>>       conditions. 1. Cur_size becomes larger than *size, or the
>>>>       are_contiguous function returns false, meaning that the two
>>>>       translations cannot be considered together. In the case of the
>>>> RDMA
>>>>       memory map, the only time are_contiguous returns false is when
>>>> the
>>>>       two memory regions correspond to two distinct RDMA MRs. Notice
>>>> that
>>>>       in this case - the one where are_contiguous is defined and we
>>>>       supplied a size variable - the *size variable is not
>>>> overwritten
>>>>       with cur_size until 1. cur_size is >= *size or 2. The
>>>> are_contiguous
>>>>       check fails.
>>>>
>>>>       In the second case detailed above, you can see how one could
>>>> pass
>>>>       in a buffer that spanned a 2 MiB page and still get a
>>>> translation
>>>>       value equal to the size of the buffer. This second case is the
>>>> one
>>>>       that the rdma.c code should be using since we have a registered
>>>>       are_contiguous function with the NIC and we have supplied a
>>>> size
>>>>       pointer filled with the length of our buffer.
>>>>
>>>>       -----Original Message-----
>>>>       From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>>>>       <mailto:jiandong.zheng(a)broadcom.com>]
>>>>       Sent: Thursday, August 1, 2019 2:01 PM
>>>>       To: Howell, Seth <seth.howell(a)intel.com
>>>>       <mailto:seth.howell(a)intel.com>>; Storage Performance
>>>> Development Kit
>>>>       <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>>>>       Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>>>       multiple RDMA Memory Regions
>>>>
>>>>       Hi Seth,
>>>>
>>>>         > Just because a buffer extends past a 2 MiB boundary doesn't
>>>> mean
>>>>       that it exists in two different Memory Regions. It also won't
>>>> fail
>>>>       the translation for being over two memory regions.
>>>>
>>>>       This makes sense. However, spdk_mem_map_translate() does
>>>> following
>>>>       to calculate translation_len:
>>>>
>>>>       cur_size = VALUE_2MB - _2MB_OFFSET(vaddr); ...
>>>>       *size = spdk_min(*size, cur_size); // *size is the
>>>> translation_len
>>>>       from caller nvmf_rdma_fill_buffers()
>>>>
>>>>       In nvmf_rdma_fill_buffers(),
>>>>
>>>>       if (translation_len < rdma_req->req.iov[iovcnt].iov_len) {
>>>>                                SPDK_ERRLOG("Data buffer split over
>>>>       multiple RDMA Memory Regions\n");
>>>>                                return -EINVAL;
>>>>                        }
>>>>
>>>>       This just checks if buffer sits on 2 2MB pages, not about 2
>>>> RDMA
>>>>       memory regions. Is my understanding correct?
>>>>
>>>>       I still need some time to test. I will update you the result
>>>> with -s
>>>>       as well.
>>>>
>>>>       Thanks,
>>>>       JD
>>>>
>>>>
>>>>       On 8/1/19 1:28 PM, Howell, Seth wrote:
>>>>        > Hi JD,
>>>>        >
>>>>        > The 2 MiB check is just because we always do memory
>>>> registrations
>>>>       at at least 2 MiB granularity (the minimum hugepage size). Just
>>>>       because a buffer extends past a 2 MiB boundary doesn't mean
>>>> that it
>>>>       exists in two different Memory Regions. It also won't fail the
>>>>       translation for being over two memory regions.
>>>>        >
>>>>        > If you look at the definition of spdk_mem_map_translate we
>>>> call
>>>>       map->ops->are_contiguous every time we cross a 2 MiB boundary.
>>>> For
>>>>       RDMA, this function is registered to
>>>>       spdk_nvmf_rdma_check_contiguous_entries. IF this function
>>>> returns
>>>>       true, then even if the buffer crosses a 2 MiB boundary, the
>>>>       translation will still be valid.
>>>>        > The problem you are running into is not related to the
>>>> buffer
>>>>       alignment, it is related to the fact that the two pages across
>>>> which
>>>>       the buffer is split are registered to two different MRs in the
>>>> NIC.
>>>>       This can only happen if those two pages are allocated
>>>> independently
>>>>       and trigger two distinct memory event callbacks.
>>>>        >
>>>>        > That is why I am so interested in seeing the results from
>>>> the
>>>>       noticelog above ibv_reg_mr. It will tell me how your target
>>>>       application is allocating memory. Also, when you start the SPDK
>>>>       target, are you using the -s option? Something like
>>>>       ./app/nvmf_tgt/nvmf_tgt -s 512 or something like that (I don't
>>>> know
>>>>       if it'll make a difference, it's more of a curiosity thing for
>>>> me)?
>>>>        >
>>>>        > Thanks,
>>>>        >
>>>>        > Seth
>>>>        >
>>>>        > -----Original Message-----
>>>>        > From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>>>>       <mailto:jiandong.zheng(a)broadcom.com>]
>>>>        > Sent: Thursday, August 1, 2019 11:24 AM
>>>>        > To: Howell, Seth <seth.howell(a)intel.com
>>>>       <mailto:seth.howell(a)intel.com>>; Storage Performance
>>>>        > Development Kit <spdk(a)lists.01.org
>>>> <mailto:spdk(a)lists.01.org>>
>>>>        > Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>>>       multiple
>>>>        > RDMA Memory Regions
>>>>        >
>>>>        > Hi Seth,
>>>>        >
>>>>        > Thanks for the detailed description, now I understand the
>>>> reason
>>>>       behind the checking. But I have a question, why checking
>>>> against
>>>>       2MiB? Is it because DPDK uses 2MiB page size by default so that
>>>> one
>>>>       RDMA memory region should not cross 2 pages?
>>>>        >
>>>>        >   > Once I see what your memory registrations look like and
>>>> what
>>>>       addresses you're failing on, it will help me understand what is
>>>>       going on better.
>>>>        >
>>>>        > I've added some print in nvmf_rdma_fill_buffers() @@ -1502,7
>>>>       +1503,11 @@ nvmf_rdma_fill_buffers(struct
>>>> spdk_nvmf_rdma_transport
>>>>       *rtransport,
>>>>        >                   remaining_length -=
>>>>        > rdma_req->req.iov[iovcnt].iov_len;
>>>>        >
>>>>        >                   if (translation_len <
>>>>       rdma_req->req.iov[iovcnt].iov_len) {
>>>>        > -                       SPDK_ERRLOG("Data buffer split over
>>>> multiple
>>>>        > RDMA Memory Regions\n");
>>>>        > +                       SPDK_ERRLOG("Data buffer split over
>>>> multiple
>>>>        > RDMA Memory Regions %p %d (%d) (%d) (%d) (%d)\n",
>>>>       rdma_req->buffers[iovcnt], iovcnt, length, remaining_length,
>>>>       translation_len, rdma_req->req.iov[iovcnt].iov_len);
>>>>        >                           return -EINVAL;
>>>>        >                   }
>>>>        >
>>>>        > With this I can see which buffer failed the checking.
>>>>        > For example, when SPKD initializes the memory pool, one of
>>>> the
>>>>       buffers starts with 0x2000193feb00, and when failed, I got
>>>> following:
>>>>        >
>>>>        > rdma.c:1510:nvmf_rdma_fill_buffers: *ERROR*: Data buffer
>>>> split over
>>>>        > multiple RDMA Memory Regions 0x2000193feb00 0 (8192) (0)
>>>> (5376)
>>>>       (8192)
>>>>        >
>>>>        > This buffer has 5376B on one 2MB page and the rest of it
>>>>        > (8192-5376=2816B) is on another page.
>>>>        >
>>>>        > The change https://review.gerrithub.io/c/spdk/spdk/+/463893
>>>> to
>>>>       use iov base should make it better as iov base is 4KiB aligned.
>>>> In
>>>>       above case, iov_base is 0x2000193feb00 & 0xfff = 0x2000193fe000
>>>> and
>>>>       it should pass the checking.
>>>>        > However, another buffer in the pool is 0x2000192010c0 and
>>>>       iov_base is 0x200019201000, which would fail the checking
>>>> because it
>>>>       is only 4KiB to 2MB boundary and IOUnitSize is 8KiB.
>>>>        >
>>>>        > I will add the change from
>>>>        > https://review.gerrithub.io/c/spdk/spdk/+/463892 and rerun
>>>> the
>>>>       test to get more information.
>>>>        >
>>>>        > I also attached the conf file too. The cmd line is "nvmf_tgt
>>>> -m 0xff
>>>>        > -j
>>>>        > 0x90000000:0x20000000 -c 16disk_1ns.conf"
>>>>        >
>>>>        > Thanks,
>>>>        > JD
>>>>        >
>>>>        >
>>>>        > On 8/1/19 7:52 AM, Howell, Seth wrote:
>>>>        >> Hi JD,
>>>>        >>
>>>>        >> I was doing a little bit of digging in the dpdk
>>>> documentation
>>>>       around this process, and I have a little bit more information.
>>>> We
>>>>       were pretty worried about the whole dynamic memory allocations
>>>> thing
>>>>       a few releases ago, so Jim helped add a flag into DPDK that
>>>>       prevented allocations from being allocated and freed in
>>>> different
>>>>       granularities. This flag also prevents malloc heap allocations
>>>> from
>>>>       spanning multiple memory events. However, this flag didn't make
>>>> it
>>>>       into DPDK until 19.02 (More documentation at
>>>> https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#en
>>>> vironment-abstraction-layer
>>>>       if you're interested). We have some code in the SPDK
>>>> environment
>>>>       layer that tries to deal with that (see
>>>>       lib/env_dpdk/memory.c:memory_hotplug_cb) but I don't know that
>>>> that
>>>>       function is entirely capable of handling the heap allocations
>>>>       spanning multiple memory events part of the problem.
>>>>        >> Since you are using dpdk 18.11, the memory callback inside
>>>> of
>>>>       lib/env_dpdk looks like a good candidate for our issue. My best
>>>>       guess is that somehow a heap allocation from the buffer mempool
>>>> is
>>>>       hitting across addresses from two dynamic memory allocation
>>>> events.
>>>>       I'd still appreciate it if you could send me the information in
>>>> my
>>>>       last e-mail, but I think we're onto something here.
>>>>        >>
>>>>        >> Thanks,
>>>>        >>
>>>>        >> Seth
>>>>        >>
>>>>        >> -----Original Message-----
>>>>        >> From: SPDK [mailto:spdk-bounces(a)lists.01.org
>>>>       <mailto:spdk-bounces(a)lists.01.org>] On Behalf Of Howell,
>>>>        >> Seth
>>>>        >> Sent: Thursday, August 1, 2019 5:26 AM
>>>>        >> To: JD Zheng <jiandong.zheng(a)broadcom.com
>>>>       <mailto:jiandong.zheng(a)broadcom.com>>; Storage Performance
>>>>        >> Development Kit <spdk(a)lists.01.org
>>>> <mailto:spdk(a)lists.01.org>>
>>>>        >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split
>>>> over
>>>>       multiple
>>>>        >> RDMA Memory Regions
>>>>        >>
>>>>        >> Hi JD,
>>>>        >>
>>>>        >> Thanks for doing that. Yeah, I am mainly looking to see how
>>>> the
>>>>       mempool addresses are mapped into the NIC with ibv_reg_mr.
>>>>        >>
>>>>        >> I think it's odd that we are using the buffer base for the
>>>> memory
>>>>        >> check, we should be using the iov base, but I don't believe
>>>> that
>>>>        >> would cause the issue you are seeing. Pushed a change to
>>>> modify
>>>>       that
>>>>        >> behavior anyways though:
>>>>        >> https://review.gerrithub.io/c/spdk/spdk/+/463893
>>>>        >>
>>>>        >> There was one registration that I wasn't able to catch from
>>>> your
>>>>       last log. Sorry about that, I forgot there wasn’t a debug log
>>>> for
>>>>       it. Can you try it again with this change which adds noticelogs
>>>> for
>>>>       the relevant registrations.
>>>>       https://review.gerrithub.io/c/spdk/spdk/+/463892 You should be
>>>> able
>>>>       to run your test without the -Lrdma argument this time to avoid
>>>> the
>>>>       extra bloat in the logs.
>>>>        >>
>>>>        >> The underlying assumption of the code is that any given
>>>> object
>>>>       is not going to cross a dynamic memory allocation from DPDK.
>>>> For a
>>>>       little background, when the mempool gets created, the dpdk code
>>>>       allocates some number of memzones to accommodate those buffer
>>>>       objects. Then it passes those memzones down one at a time and
>>>> places
>>>>       objects inside the mempool from the given memzone until the
>>>> memzone
>>>>       is exhausted. Then it goes back and grabs another memzone. This
>>>>       process continues until all objects are accounted for.
>>>>        >> This only works if each memzone corresponds to a single
>>>> memory
>>>>       event when using dynamic memory allocation. My understanding
>>>> was
>>>>       that this was always the case, but this error makes me think
>>>> that
>>>>       it's possible that that's not true.
>>>>        >>
>>>>        >> Once I see what your memory registrations look like and
>>>> what
>>>>       addresses you're failing on, it will help me understand what is
>>>>       going on better.
>>>>        >>
>>>>        >> Can you also provide the command line you are using to
>>>> start the
>>>>       nvmf_tgt application and attach your configuration file?
>>>>        >>
>>>>        >> Thanks,
>>>>        >>
>>>>        >> Seth
>>>>        >> -----Original Message-----
>>>>        >> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>>>>       <mailto:jiandong.zheng(a)broadcom.com>]
>>>>        >> Sent: Wednesday, July 31, 2019 3:13 PM
>>>>        >> To: Howell, Seth <seth.howell(a)intel.com
>>>>       <mailto:seth.howell(a)intel.com>>; Storage Performance
>>>>        >> Development Kit <spdk(a)lists.01.org
>>>> <mailto:spdk(a)lists.01.org>>
>>>>        >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split
>>>> over
>>>>       multiple
>>>>        >> RDMA Memory Regions
>>>>        >>
>>>>        >> Hi Seth,
>>>>        >>
>>>>        >> After I enabled debug and ran nvmf_tgt with -L rdma, I got
>>>> some
>>>>       logs like:
>>>>        >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command
>>>> Array:
>>>>        >> 0x2000084bf000 Length: 40000 LKey: e601
>>>>        >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion
>>>> Array:
>>>>        >> 0x200008621000 Length: 10000 LKey: e701
>>>>        >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule
>>>> Data
>>>>       Array:
>>>>        >> 0x200018600000 Length: 1000000 LKey: e801
>>>>        >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command
>>>> Array:
>>>>        >> 0x20000847e000 Length: 40000 LKey: e701
>>>>        >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion
>>>> Array:
>>>>        >> 0x20000846d000 Length: 10000 LKey: e801
>>>>        >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule
>>>> Data
>>>>       Array:
>>>>        >> 0x200019800000 Length: 1000000 LKey: e901
>>>>        >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command
>>>> Array:
>>>>        >> 0x200016ebb000 Length: 40000 LKey: e801
>>>>        >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion
>>>> Array:
>>>>        >> 0x20000845c000 Length: 10000 LKey: e901
>>>>        >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule
>>>> Data
>>>>       Array:
>>>>        >> 0x20001aa00000 Length: 1000000 LKey: ea01
>>>>        >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command
>>>> Array:
>>>>        >> 0x200016e7a000 Length: 40000 LKey: e901
>>>>        >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion
>>>> Array:
>>>>        >> 0x20000844b000 Length: 10000 LKey: ea01
>>>>        >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule
>>>> Data
>>>>       Array:
>>>>        >> 0x20001bc00000 Length: 1000000 LKey: eb01 ...
>>>>        >>
>>>>        >> Is this you are look for as memory regions registered for NIC?
>>>>        >>
>>>>        >> I attached the complete log.
>>>>        >>
>>>>        >> Thanks,
>>>>        >> JD
>>>>        >>
>>>>        >> On 7/30/19 5:28 PM, JD Zheng wrote:
>>>>        >>> Hi Seth,
>>>>        >>>
>>>>        >>> Thanks for the prompt reply!
>>>>        >>>
>>>>        >>> Please find answers inline.
>>>>        >>>
>>>>        >>> JD
>>>>        >>>
>>>>        >>> On 7/30/19 5:01 PM, Howell, Seth wrote:
>>>>        >>>> Hi JD,
>>>>        >>>>
>>>>        >>>> Thanks for the report. I want to ask a few questions to
>>>> start
>>>>        >>>> getting to the bottom of this. Since this issue doesn't
>>>> currently
>>>>        >>>> reproduce on our per-patch or nightly tests, I would like
>>>> to
>>>>        >>>> understand what's unique about your setup so that we can
>>>>       replicate
>>>>        >>>> it in a per patch test to prevent future regressions.
>>>>        >>> I am running it on aarch64 platform. I tried x86 platform
>>>> and I
>>>>       can
>>>>        >>> see same buffer alignment in memory pool but can't run the
>>>> real
>>>>       test
>>>>        >>> to reproduce it due to other missing pieces.
>>>>        >>>
>>>>        >>>>
>>>>        >>>> What options are you passing when you create the rdma
>>>> transport?
>>>>        >>>> Are you creating it over RPC or in a configuration file?
>>>>        >>> I am using conf file. Pls let me know if you'd like to
>>>> look
>>>>       into conf file.
>>>>        >>>
>>>>        >>>>
>>>>        >>>> Are you using the current DPDK submodule as your
>>>> environment
>>>>        >>>> abstraction layer?
>>>>        >>> No. Our project uses specific version of DPDK, which is
>>>> v18.11. I
>>>>        >>> did quick test using latest and DPDK submodule on x86, and
>>>> the
>>>>        >>> buffer alignment is the same, i.e. 64B aligned.
>>>>        >>>
>>>>        >>>>
>>>>        >>>> I notice that your error log is printing from
>>>>        >>>> spdk_nvmf_transport_poll_group_create, which value
>>>> exactly are
>>>>       you
>>>>        >>>> printing out?
>>>>        >>> Here is patch to add dbg print. Pls note that SPDK version
>>>> is
>>>>       v19.04
>>>>        >>>
>>>>        >>> @@ -215,6 +222,7 @@
>>>> spdk_nvmf_transport_poll_group_create(st
>>>>        >>> SPDK_NOTICELOG("Unable to
>>>>       reserve
>>>>        >>> the full number of buffers for the pg buffer cache.\n");
>>>>        >>>                                    break;
>>>>        >>>                            }
>>>>        >>> +                       SPDK_ERRLOG("%p %d(%d)\n", buf,
>>>>        >>> group->buf_cache_count, group->buf_cache_size);
>>>>        >>> STAILQ_INSERT_HEAD(&group->buf_cache,
>>>>        >>> buf, link);
>>>>        >>> group->buf_cache_count++;
>>>>        >>>                    }
>>>>        >>>
>>>>        >>>>
>>>>        >>>> Can you run your target with the -L rdma option to get a
>>>> dump of
>>>>        >>>> the memory regions registered with the NIC?
>>>>        >>> Let me test and get back to you soon.
>>>>        >>>
>>>>        >>>>
>>>>        >>>> We made a couple of changes to this code when dynamic
>>>> memory
>>>>        >>>> allocations were added to DPDK. There were some
>>>> safeguards
>>>>       that we
>>>>        >>>> added to try and make sure this case wouldn't hit, so I'd
>>>> like to
>>>>        >>>> make sure you are running on the latest DPDK submodule as
>>>> well as
>>>>        >>>> the latest SPDK to narrow down where we need to look.
>>>>        >>> Unfortunately I can't easily update DPDK because other
>>>> team
>>>>        >>> maintains it internally. But if it can be repro and fixed
>>>> in
>>>>       latest,
>>>>        >>> I will try to pull in the fix.
>>>>        >>>
>>>>        >>>>
>>>>        >>>> Thanks,
>>>>        >>>>
>>>>        >>>> Seth
>>>>        >>>>
>>>>        >>>> -----Original Message-----
>>>>        >>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org
>>>>       <mailto:spdk-bounces(a)lists.01.org>] On Behalf Of JD Zheng
>>>>        >>>> via SPDK
>>>>        >>>> Sent: Wednesday, July 31, 2019 3:00 AM
>>>>        >>>> To: spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>
>>>>        >>>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com
>>>>       <mailto:jiandong.zheng(a)broadcom.com>>
>>>>        >>>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>>> multiple
>>>>        >>>> RDMA Memory Regions
>>>>        >>>>
>>>>        >>>> Hello,
>>>>        >>>>
>>>>        >>>> When I run nvmf_tgt over RDMA using latest SPDK code, I
>>>>        >>>> occasionally ran into this errors:
>>>>        >>>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer
>>>> split
>>>>        >>>> over multiple RDMA Memory Regions"
>>>>        >>>>
>>>>        >>>> After digging into the code, I found that
>>>>       nvmf_rdma_fill_buffers()
>>>>        >>>> calls spdk_mem_map_translate() to check if a data buffer
>>>> sit on 2
>>>>        >>>> 2MB pages, and if it is the case, it reports this error.
>>>>        >>>>
>>>>        >>>> The following commit added change to use data buffer
>>>> start
>>>>       address
>>>>        >>>> to calculate the size between buffer start address and
>>>> 2MB
>>>>       boundary.
>>>>        >>>> The caller nvmf_rdma_fill_buffers() uses the size to
>>>> compare with
>>>>        >>>> IO Unit size (which is 8KB in my conf) to determine if
>>>> the buffer
>>>>        >>>> passes 2MB boundary.
>>>>        >>>>
>>>>        >>>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
>>>>        >>>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com
>>>>       <mailto:dariusz.stojaczyk(a)intel.com>>
>>>>        >>>> Date:   Tue Nov 13 17:43:46 2018 +0100
>>>>        >>>>
>>>>        >>>>         memory: fix contiguous memory calculation for
>>>> unaligned
>>>>        >>>> buffers
>>>>        >>>>
>>>>        >>>> In nvmf_tgt, the buffers are pre-allocated as a memory
>>>> pool
>>>>       and new
>>>>        >>>> request will use free buffer from that pool and the
>>>> buffer start
>>>>        >>>> address is passed to nvmf_rdma_fill_buffers(). But I
>>>> found that
>>>>        >>>> these buffers are not 2MB aligned and not IOUnitSize
>>>> aligned (8KB
>>>>        >>>> in my
>>>>        >>>> case) either, instead, they are 64Byte aligned so that
>>>> some
>>>>       buffers
>>>>        >>>> will fail the checking and leads to this problem.
>>>>        >>>>
>>>>        >>>> The corresponding code snippets are as following:
>>>>        >>>> spdk_nvmf_transport_create()
>>>>        >>>> {
>>>>        >>>> ...
>>>>        >>>>         transport->data_buf_pool =
>>>>        >>>> pdk_mempool_create(spdk_mempool_name,
>>>>        >>>>  opts->num_shared_buffers,
>>>>        >>>>  opts->io_unit_size +
>>>>        >>>> NVMF_DATA_BUFFER_ALIGNMENT,
>>>>        >>>>
>>>>         SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>>>>        >>>>  SPDK_ENV_SOCKET_ID_ANY); ...
>>>>        >>>> }
>>>>        >>>>
>>>>        >>>> Also some debug print I added shows the start address of
>>>> the
>>>>       buffers:
>>>>        >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create:
>>>> *ERROR*:
>>>>        >>>> 0x200019258800 0(32)
>>>>        >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create:
>>>> *ERROR*:
>>>>        >>>> 0x2000192557c0 1(32)
>>>>        >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create:
>>>> *ERROR*:
>>>>        >>>> 0x200019252780 2(32)
>>>>        >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create:
>>>> *ERROR*:
>>>>        >>>> 0x20001924f740 3(32)
>>>>        >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create:
>>>> *ERROR*:
>>>>        >>>> 0x20001924c700 4(32)
>>>>        >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create:
>>>> *ERROR*:
>>>>        >>>> 0x2000192496c0 5(32)
>>>>        >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create:
>>>> *ERROR*:
>>>>        >>>> 0x200019246680 6(32)
>>>>        >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create:
>>>> *ERROR*:
>>>>        >>>> 0x200019243640 7(32)
>>>>        >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create:
>>>> *ERROR*:
>>>>        >>>> 0x200019240600 8(32)
>>>>        >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create:
>>>> *ERROR*:
>>>>        >>>> 0x20001923d5c0 9(32)
>>>>        >>>> ...
>>>>        >>>>
>>>>        >>>> It looks like either the buffer allocation has alignment
>>>> issue or
>>>>        >>>> the checking is not correct.
>>>>        >>>>
>>>>        >>>> Please advice how to fix this problem.
>>>>        >>>>
>>>>        >>>> Thanks,
>>>>        >>>> JD Zheng
>>>>        >>>> _______________________________________________
>>>>        >>>> SPDK mailing list
>>>>        >>>> SPDK(a)lists.01.org <mailto:SPDK(a)lists.01.org>
>>>>        >>>> https://lists.01.org/mailman/listinfo/spdk
>>>>        >>>>
>>>>        >> _______________________________________________
>>>>        >> SPDK mailing list
>>>>        >> SPDK(a)lists.01.org <mailto:SPDK(a)lists.01.org>
>>>>        >> https://lists.01.org/mailman/listinfo/spdk
>>>>        >>
>>>>
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-08-20 14:39 Howell, Seth
  0 siblings, 0 replies; 20+ messages in thread
From: Howell, Seth @ 2019-08-20 14:39 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 32105 bytes --]

Hmm. OK. Would you be willing to share the test script you are running to get the segfault? I would really like to get to the bottom of this. It's one thing if we are just tossing an error message from time to time, but if running SPDK against an older DPDK submodule can cause it to brick I think that we need to fully squash this. 

-----Original Message-----
From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com] 
Sent: Monday, August 19, 2019 2:42 PM
To: Howell, Seth <seth.howell(a)intel.com>
Cc: Storage Performance Development Kit <spdk(a)lists.01.org>; Jonathan Richardson <jonathan.richardson(a)broadcom.com>
Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions

Hi Seth,

It sometimes triggered seg fault but I couldn't get backtrace due to likely corrupted stack. With my workaround, this is no longer seen.

Let me submit my change as RFC to gerrit. It probably isn't necessary to upstream as DPDK 19.02 should fix this problem properly.

Thanks,
JD

On 8/19/19 2:16 PM, Howell, Seth wrote:
> Hi JD,
> 
> What issue specifically did you see? If there is something measurable happening (other than the error message) then I think it should be high priority to get a more permanent workaround into upstream SPDK.
> 
> Thanks,
> 
> Seth
> 
> -----Original Message-----
> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
> Sent: Monday, August 19, 2019 2:03 PM
> To: Howell, Seth <seth.howell(a)intel.com>
> Cc: Storage Performance Development Kit <spdk(a)lists.01.org>; Jonathan 
> Richardson <jonathan.richardson(a)broadcom.com>
> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple 
> RDMA Memory Regions
> 
> Hi Seth,
> 
>   > Unfortunately, the only way
>   > to protect fully against this happening is by using the DPDK flag  > --match-allocations which was introduced in DPDK 19.02.
> 
> Then I need to use DPDK 19.02. Do I need to enable this flag explicitly when moving DPDK 19.02?
> 
>   > The good news is that the SPDK target will skip these buffers without  > bricking, causing data corruption or doing any otherwise bad things.
> 
> Unfortunately this is not what I saw. It appeared that SPDK gave up this split buffer, but it still causes issue, maybe because it was tried too many times(?).
> 
> Currently, I have to use DPDK 18.11 so that I added a couple of workarounds to prevent the split buffer from being used before reaching fill_buffers(). I did a little trick there to call spdk_mempool_get() but not mempool_put later, so that this buffer is set as "allocated" in mempool and will not be tried again and again. It does look like small memory leak though. We can usually see 2-3 split buffers during overnight run, btw.
> 
> This seems working OK.
> 
> For sure, I will measure performance later.
> 
> Thanks,
> JD
> 
> 
> On 8/19/19 1:12 PM, Howell, Seth wrote:
>> Hi JD,
>>
>> Thanks for performing that experiment. With this new information, I 
>> think we can be pretty much 100% sure that the problem is related to 
>> the mempool being split over two DPDK memzones. Unfortunately, the 
>> only way to protect fully against this happening is by using the DPDK 
>> flag --match-allocations which was introduced in DPDK 19.02. Jim 
>> helped advocate for this flag specifically because of this problem 
>> with mempools and RPMA.
>>
>> The good news is that the SPDK target will skip these buffers without 
>> bricking, causing data corruption or doing any otherwise bad things.
>> What ends up happening is that the nvmf_rdma_fill_buffers function 
>> will print the error message and then return NULL which will trigger 
>> the target to retry the I/O again. By that time, there will be 
>> another buffer there for the request to use and it won’t fail the 
>> second time around. So the code currently handles the problem in a 
>> technically correct way i.e. It’s not going to brick the target or 
>> initiator by trying to use a buffer that spans multiple Memory 
>> Regions. Instead, it properly recognizes that it is trying to use a 
>> bad buffer and reschedules the request buffer parsing.
>>
>> However, I am a little worried over the fact that these buffers 
>> remain in the mempool and can be repeatedly used by the application. 
>> I can picture a scenario where this could possibly have a  performance impact.
>> Take for example a mempool with 128 entries in it in which one of 
>> them is split over a memzone. Since this split buffer will never find 
>> its way into a request, it’s possible that this split buffer gets 
>> pulled up into requests more often than other buffers and 
>> subsequently fails in nvmf_rdma_fill_buffers causing requests to have 
>> to be rescheduled to the next time the poller runs. Depending on how 
>> frequently this happens, the performance impact *could possibly* add up.
>>
>> I have as yet been unable to replicate the split buffer error. One 
>> thing you could try to see if there is any measurable performance 
>> impact is try starting the NVMe-oF target with DPDK legacy memory 
>> mode which will move all memory allocations to startup and prevent 
>> you from splitting buffers. Then run a benchmark with a lot of 
>> connections at high queue depth and see what the performance looks 
>> like compared to the dynamic memory model. If there is a significant 
>> performance impact, we may have to modify how we handle this error case.
>>
>> Thanks,
>>
>> Seth
>>
>> *From:*JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
>> *Sent:* Monday, August 12, 2019 4:17 PM
>> *To:* Howell, Seth <seth.howell(a)intel.com>
>> *Cc:* Storage Performance Development Kit <spdk(a)lists.01.org>; 
>> Jonathan Richardson <jonathan.richardson(a)broadcom.com>
>> *Subject:* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over 
>> multiple RDMA Memory Regions
>>
>> + Jonanthan
>>
>> Hi Seth,
>>
>> We finally got chance to test with more logs enabled. You are correct 
>> that that problematic buffer does sit on 2 registered memory regions:
>>
>> The problematic buffer is "Buffer address:*200019bfeb00", *actual 
>> used buffer pointer is "*200019bff000*" (SPDK makes it 4KiB aligned), 
>> size is
>> 8KiB(0x2000) so it does sit on 2 registered memory region.
>>
>> However, looks like SPDK/DPDK allocates buffers starting from end of 
>> a region and going up, but due to the extra room and alignment of 
>> each buffer and there is chance that one buffer can exceed memory 
>> region boundary?
>>
>> In this case, the buffers are between 0x200019997800 and 
>> 0x200019c5320 so that last buffer exceeds one region and goes to next one.
>>
>> Some logs for your information:
>>
>> rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start:
>> 200019800000, memory region length: 400000
>> rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start:
>> 200019c00000, memory region length: 400000
>>
>> ...
>>
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019bfeb00 27(32)
>>
>> ...
>>
>> rdma.c:1508:nvmf_rdma_fill_buffers: *NOTICE*: Buffer 
>> address:**200019bfeb00**iov_base address 200019bff000
>>
>> Thanks,
>>
>> JD
>>
>> On Thu, Aug 1, 2019 at 2:22 PM Howell, Seth <seth.howell(a)intel.com 
>> <mailto:seth.howell(a)intel.com>> wrote:
>>
>>      There are two different assignments that you need to look at. I'll
>>      detail the cases below based on line numbers from the latest master.
>>
>>      Memory.c:656 *size = spdk_min(*size, cur_size):
>>               This assignment is inside of the conditional "if(size ==
>>      NULL || map->ops.are_contiguous == NULL)"
>>               So in other words, at the offset, we figure out how much
>>      space we have left in the current translation. Then, if there is no
>>      callback to tell us whether the next translation will be contiguous
>>      to this one, we fill the size variable with the remaining length of
>>      that 2 MiB buffer.
>>
>>      Memory.c:682 *size = spdk_min(*size, cur_size):
>>               This assignment comes after the while loop guarded by the
>>      condition "while (cur_size < *size)". This while loop assumes that
>>      we have supplied some desired length for our buffer. This is true in
>>      the RDMA case. Now this while loop will only break on two
>>      conditions. 1. Cur_size becomes larger than *size, or the
>>      are_contiguous function returns false, meaning that the two
>>      translations cannot be considered together. In the case of the RDMA
>>      memory map, the only time are_contiguous returns false is when the
>>      two memory regions correspond to two distinct RDMA MRs. Notice that
>>      in this case - the one where are_contiguous is defined and we
>>      supplied a size variable - the *size variable is not overwritten
>>      with cur_size until 1. cur_size is >= *size or 2. The are_contiguous
>>      check fails.
>>
>>      In the second case detailed above, you can see how one could  pass
>>      in a buffer that spanned a 2 MiB page and still get a translation
>>      value equal to the size of the buffer. This second case is the one
>>      that the rdma.c code should be using since we have a registered
>>      are_contiguous function with the NIC and we have supplied a size
>>      pointer filled with the length of our buffer.
>>
>>      -----Original Message-----
>>      From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>>      <mailto:jiandong.zheng(a)broadcom.com>]
>>      Sent: Thursday, August 1, 2019 2:01 PM
>>      To: Howell, Seth <seth.howell(a)intel.com
>>      <mailto:seth.howell(a)intel.com>>; Storage Performance Development Kit
>>      <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>>      Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>      multiple RDMA Memory Regions
>>
>>      Hi Seth,
>>
>>        > Just because a buffer extends past a 2 MiB boundary doesn't mean
>>      that it exists in two different Memory Regions. It also won't fail
>>      the translation for being over two memory regions.
>>
>>      This makes sense. However, spdk_mem_map_translate() does following
>>      to calculate translation_len:
>>
>>      cur_size = VALUE_2MB - _2MB_OFFSET(vaddr); ...
>>      *size = spdk_min(*size, cur_size); // *size is the translation_len
>>      from caller nvmf_rdma_fill_buffers()
>>
>>      In nvmf_rdma_fill_buffers(),
>>
>>      if (translation_len < rdma_req->req.iov[iovcnt].iov_len) {
>>                               SPDK_ERRLOG("Data buffer split over
>>      multiple RDMA Memory Regions\n");
>>                               return -EINVAL;
>>                       }
>>
>>      This just checks if buffer sits on 2 2MB pages, not about 2 RDMA
>>      memory regions. Is my understanding correct?
>>
>>      I still need some time to test. I will update you the result with -s
>>      as well.
>>
>>      Thanks,
>>      JD
>>
>>
>>      On 8/1/19 1:28 PM, Howell, Seth wrote:
>>       > Hi JD,
>>       >
>>       > The 2 MiB check is just because we always do memory registrations
>>      at at least 2 MiB granularity (the minimum hugepage size). Just
>>      because a buffer extends past a 2 MiB boundary doesn't mean that it
>>      exists in two different Memory Regions. It also won't fail the
>>      translation for being over two memory regions.
>>       >
>>       > If you look at the definition of spdk_mem_map_translate we call
>>      map->ops->are_contiguous every time we cross a 2 MiB boundary. For
>>      RDMA, this function is registered to
>>      spdk_nvmf_rdma_check_contiguous_entries. IF this function returns
>>      true, then even if the buffer crosses a 2 MiB boundary, the
>>      translation will still be valid.
>>       > The problem you are running into is not related to the buffer
>>      alignment, it is related to the fact that the two pages across which
>>      the buffer is split are registered to two different MRs in the NIC.
>>      This can only happen if those two pages are allocated independently
>>      and trigger two distinct memory event callbacks.
>>       >
>>       > That is why I am so interested in seeing the results from the
>>      noticelog above ibv_reg_mr. It will tell me how your target
>>      application is allocating memory. Also, when you start the SPDK
>>      target, are you using the -s option? Something like
>>      ./app/nvmf_tgt/nvmf_tgt -s 512 or something like that (I don't know
>>      if it'll make a difference, it's more of a curiosity thing for me)?
>>       >
>>       > Thanks,
>>       >
>>       > Seth
>>       >
>>       > -----Original Message-----
>>       > From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>>      <mailto:jiandong.zheng(a)broadcom.com>]
>>       > Sent: Thursday, August 1, 2019 11:24 AM
>>       > To: Howell, Seth <seth.howell(a)intel.com
>>      <mailto:seth.howell(a)intel.com>>; Storage Performance
>>       > Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>>       > Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>      multiple
>>       > RDMA Memory Regions
>>       >
>>       > Hi Seth,
>>       >
>>       > Thanks for the detailed description, now I understand the reason
>>      behind the checking. But I have a question, why checking against
>>      2MiB? Is it because DPDK uses 2MiB page size by default so that one
>>      RDMA memory region should not cross 2 pages?
>>       >
>>       >   > Once I see what your memory registrations look like and what
>>      addresses you're failing on, it will help me understand what is
>>      going on better.
>>       >
>>       > I've added some print in nvmf_rdma_fill_buffers() @@ -1502,7
>>      +1503,11 @@ nvmf_rdma_fill_buffers(struct spdk_nvmf_rdma_transport
>>      *rtransport,
>>       >                   remaining_length -=
>>       > rdma_req->req.iov[iovcnt].iov_len;
>>       >
>>       >                   if (translation_len <
>>      rdma_req->req.iov[iovcnt].iov_len) {
>>       > -                       SPDK_ERRLOG("Data buffer split over multiple
>>       > RDMA Memory Regions\n");
>>       > +                       SPDK_ERRLOG("Data buffer split over multiple
>>       > RDMA Memory Regions %p %d (%d) (%d) (%d) (%d)\n",
>>      rdma_req->buffers[iovcnt], iovcnt, length, remaining_length,
>>      translation_len, rdma_req->req.iov[iovcnt].iov_len);
>>       >                           return -EINVAL;
>>       >                   }
>>       >
>>       > With this I can see which buffer failed the checking.
>>       > For example, when SPKD initializes the memory pool, one of the
>>      buffers starts with 0x2000193feb00, and when failed, I got following:
>>       >
>>       > rdma.c:1510:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over
>>       > multiple RDMA Memory Regions 0x2000193feb00 0 (8192) (0) (5376)
>>      (8192)
>>       >
>>       > This buffer has 5376B on one 2MB page and the rest of it
>>       > (8192-5376=2816B) is on another page.
>>       >
>>       > The change https://review.gerrithub.io/c/spdk/spdk/+/463893 to
>>      use iov base should make it better as iov base is 4KiB aligned. In
>>      above case, iov_base is 0x2000193feb00 & 0xfff = 0x2000193fe000 and
>>      it should pass the checking.
>>       > However, another buffer in the pool is 0x2000192010c0 and
>>      iov_base is 0x200019201000, which would fail the checking because it
>>      is only 4KiB to 2MB boundary and IOUnitSize is 8KiB.
>>       >
>>       > I will add the change from
>>       > https://review.gerrithub.io/c/spdk/spdk/+/463892 and rerun the
>>      test to get more information.
>>       >
>>       > I also attached the conf file too. The cmd line is "nvmf_tgt -m 0xff
>>       > -j
>>       > 0x90000000:0x20000000 -c 16disk_1ns.conf"
>>       >
>>       > Thanks,
>>       > JD
>>       >
>>       >
>>       > On 8/1/19 7:52 AM, Howell, Seth wrote:
>>       >> Hi JD,
>>       >>
>>       >> I was doing a little bit of digging in the dpdk documentation
>>      around this process, and I have a little bit more information. We
>>      were pretty worried about the whole dynamic memory allocations thing
>>      a few releases ago, so Jim helped add a flag into DPDK that
>>      prevented allocations from being allocated and freed in different
>>      granularities. This flag also prevents malloc heap allocations from
>>      spanning multiple memory events. However, this flag didn't make it
>>      into DPDK until 19.02 (More documentation at
>>      https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#environment-abstraction-layer
>>      if you're interested). We have some code in the SPDK environment
>>      layer that tries to deal with that (see
>>      lib/env_dpdk/memory.c:memory_hotplug_cb) but I don't know that that
>>      function is entirely capable of handling the heap allocations
>>      spanning multiple memory events part of the problem.
>>       >> Since you are using dpdk 18.11, the memory callback inside of
>>      lib/env_dpdk looks like a good candidate for our issue. My best
>>      guess is that somehow a heap allocation from the buffer mempool is
>>      hitting across addresses from two dynamic memory allocation events.
>>      I'd still appreciate it if you could send me the information in my
>>      last e-mail, but I think we're onto something here.
>>       >>
>>       >> Thanks,
>>       >>
>>       >> Seth
>>       >>
>>       >> -----Original Message-----
>>       >> From: SPDK [mailto:spdk-bounces(a)lists.01.org
>>      <mailto:spdk-bounces(a)lists.01.org>] On Behalf Of Howell,
>>       >> Seth
>>       >> Sent: Thursday, August 1, 2019 5:26 AM
>>       >> To: JD Zheng <jiandong.zheng(a)broadcom.com
>>      <mailto:jiandong.zheng(a)broadcom.com>>; Storage Performance
>>       >> Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>>       >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>      multiple
>>       >> RDMA Memory Regions
>>       >>
>>       >> Hi JD,
>>       >>
>>       >> Thanks for doing that. Yeah, I am mainly looking to see how the
>>      mempool addresses are mapped into the NIC with ibv_reg_mr.
>>       >>
>>       >> I think it's odd that we are using the buffer base for the memory
>>       >> check, we should be using the iov base, but I don't believe that
>>       >> would cause the issue you are seeing. Pushed a change to modify
>>      that
>>       >> behavior anyways though:
>>       >> https://review.gerrithub.io/c/spdk/spdk/+/463893
>>       >>
>>       >> There was one registration that I wasn't able to catch from your
>>      last log. Sorry about that, I forgot there wasn’t a debug log for
>>      it. Can you try it again with this change which adds noticelogs for
>>      the relevant registrations.
>>      https://review.gerrithub.io/c/spdk/spdk/+/463892 You should be able
>>      to run your test without the -Lrdma argument this time to avoid the
>>      extra bloat in the logs.
>>       >>
>>       >> The underlying assumption of the code is that any given object
>>      is not going to cross a dynamic memory allocation from DPDK. For a
>>      little background, when the mempool gets created, the dpdk code
>>      allocates some number of memzones to accommodate those buffer
>>      objects. Then it passes those memzones down one at a time and places
>>      objects inside the mempool from the given memzone until the memzone
>>      is exhausted. Then it goes back and grabs another memzone. This
>>      process continues until all objects are accounted for.
>>       >> This only works if each memzone corresponds to a single memory
>>      event when using dynamic memory allocation. My understanding was
>>      that this was always the case, but this error makes me think that
>>      it's possible that that's not true.
>>       >>
>>       >> Once I see what your memory registrations look like and what
>>      addresses you're failing on, it will help me understand what is
>>      going on better.
>>       >>
>>       >> Can you also provide the command line you are using to start the
>>      nvmf_tgt application and attach your configuration file?
>>       >>
>>       >> Thanks,
>>       >>
>>       >> Seth
>>       >> -----Original Message-----
>>       >> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>>      <mailto:jiandong.zheng(a)broadcom.com>]
>>       >> Sent: Wednesday, July 31, 2019 3:13 PM
>>       >> To: Howell, Seth <seth.howell(a)intel.com
>>      <mailto:seth.howell(a)intel.com>>; Storage Performance
>>       >> Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>>       >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>      multiple
>>       >> RDMA Memory Regions
>>       >>
>>       >> Hi Seth,
>>       >>
>>       >> After I enabled debug and ran nvmf_tgt with -L rdma, I got some
>>      logs like:
>>       >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>>       >> 0x2000084bf000 Length: 40000 LKey: e601
>>       >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>>       >> 0x200008621000 Length: 10000 LKey: e701
>>       >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>>      Array:
>>       >> 0x200018600000 Length: 1000000 LKey: e801
>>       >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>>       >> 0x20000847e000 Length: 40000 LKey: e701
>>       >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>>       >> 0x20000846d000 Length: 10000 LKey: e801
>>       >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>>      Array:
>>       >> 0x200019800000 Length: 1000000 LKey: e901
>>       >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>>       >> 0x200016ebb000 Length: 40000 LKey: e801
>>       >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>>       >> 0x20000845c000 Length: 10000 LKey: e901
>>       >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>>      Array:
>>       >> 0x20001aa00000 Length: 1000000 LKey: ea01
>>       >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>>       >> 0x200016e7a000 Length: 40000 LKey: e901
>>       >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>>       >> 0x20000844b000 Length: 10000 LKey: ea01
>>       >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>>      Array:
>>       >> 0x20001bc00000 Length: 1000000 LKey: eb01 ...
>>       >>
>>       >> Is this you are look for as memory regions registered for NIC?
>>       >>
>>       >> I attached the complete log.
>>       >>
>>       >> Thanks,
>>       >> JD
>>       >>
>>       >> On 7/30/19 5:28 PM, JD Zheng wrote:
>>       >>> Hi Seth,
>>       >>>
>>       >>> Thanks for the prompt reply!
>>       >>>
>>       >>> Please find answers inline.
>>       >>>
>>       >>> JD
>>       >>>
>>       >>> On 7/30/19 5:01 PM, Howell, Seth wrote:
>>       >>>> Hi JD,
>>       >>>>
>>       >>>> Thanks for the report. I want to ask a few questions to start
>>       >>>> getting to the bottom of this. Since this issue doesn't currently
>>       >>>> reproduce on our per-patch or nightly tests, I would like to
>>       >>>> understand what's unique about your setup so that we can
>>      replicate
>>       >>>> it in a per patch test to prevent future regressions.
>>       >>> I am running it on aarch64 platform. I tried x86 platform and I
>>      can
>>       >>> see same buffer alignment in memory pool but can't run the real
>>      test
>>       >>> to reproduce it due to other missing pieces.
>>       >>>
>>       >>>>
>>       >>>> What options are you passing when you create the rdma transport?
>>       >>>> Are you creating it over RPC or in a configuration file?
>>       >>> I am using conf file. Pls let me know if you'd like to look
>>      into conf file.
>>       >>>
>>       >>>>
>>       >>>> Are you using the current DPDK submodule as your environment
>>       >>>> abstraction layer?
>>       >>> No. Our project uses specific version of DPDK, which is v18.11. I
>>       >>> did quick test using latest and DPDK submodule on x86, and the
>>       >>> buffer alignment is the same, i.e. 64B aligned.
>>       >>>
>>       >>>>
>>       >>>> I notice that your error log is printing from
>>       >>>> spdk_nvmf_transport_poll_group_create, which value exactly are
>>      you
>>       >>>> printing out?
>>       >>> Here is patch to add dbg print. Pls note that SPDK version is
>>      v19.04
>>       >>>
>>       >>> @@ -215,6 +222,7 @@ spdk_nvmf_transport_poll_group_create(st
>>       >>>                                    SPDK_NOTICELOG("Unable to
>>      reserve
>>       >>> the full number of buffers for the pg buffer cache.\n");
>>       >>>                                    break;
>>       >>>                            }
>>       >>> +                       SPDK_ERRLOG("%p %d(%d)\n", buf,
>>       >>> group->buf_cache_count, group->buf_cache_size);
>>       >>>                            STAILQ_INSERT_HEAD(&group->buf_cache,
>>       >>> buf, link);
>>       >>>                            group->buf_cache_count++;
>>       >>>                    }
>>       >>>
>>       >>>>
>>       >>>> Can you run your target with the -L rdma option to get a dump of
>>       >>>> the memory regions registered with the NIC?
>>       >>> Let me test and get back to you soon.
>>       >>>
>>       >>>>
>>       >>>> We made a couple of changes to this code when dynamic memory
>>       >>>> allocations were added to DPDK. There were some safeguards
>>      that we
>>       >>>> added to try and make sure this case wouldn't hit, so I'd like to
>>       >>>> make sure you are running on the latest DPDK submodule as well as
>>       >>>> the latest SPDK to narrow down where we need to look.
>>       >>> Unfortunately I can't easily update DPDK because other team
>>       >>> maintains it internally. But if it can be repro and fixed in
>>      latest,
>>       >>> I will try to pull in the fix.
>>       >>>
>>       >>>>
>>       >>>> Thanks,
>>       >>>>
>>       >>>> Seth
>>       >>>>
>>       >>>> -----Original Message-----
>>       >>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org
>>      <mailto:spdk-bounces(a)lists.01.org>] On Behalf Of JD Zheng
>>       >>>> via SPDK
>>       >>>> Sent: Wednesday, July 31, 2019 3:00 AM
>>       >>>> To: spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>
>>       >>>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com
>>      <mailto:jiandong.zheng(a)broadcom.com>>
>>       >>>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
>>       >>>> RDMA Memory Regions
>>       >>>>
>>       >>>> Hello,
>>       >>>>
>>       >>>> When I run nvmf_tgt over RDMA using latest SPDK code, I
>>       >>>> occasionally ran into this errors:
>>       >>>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split
>>       >>>> over multiple RDMA Memory Regions"
>>       >>>>
>>       >>>> After digging into the code, I found that
>>      nvmf_rdma_fill_buffers()
>>       >>>> calls spdk_mem_map_translate() to check if a data buffer sit on 2
>>       >>>> 2MB pages, and if it is the case, it reports this error.
>>       >>>>
>>       >>>> The following commit added change to use data buffer start
>>      address
>>       >>>> to calculate the size between buffer start address and 2MB
>>      boundary.
>>       >>>> The caller nvmf_rdma_fill_buffers() uses the size to compare with
>>       >>>> IO Unit size (which is 8KB in my conf) to determine if the buffer
>>       >>>> passes 2MB boundary.
>>       >>>>
>>       >>>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
>>       >>>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com
>>      <mailto:dariusz.stojaczyk(a)intel.com>>
>>       >>>> Date:   Tue Nov 13 17:43:46 2018 +0100
>>       >>>>
>>       >>>>         memory: fix contiguous memory calculation for unaligned
>>       >>>> buffers
>>       >>>>
>>       >>>> In nvmf_tgt, the buffers are pre-allocated as a memory pool
>>      and new
>>       >>>> request will use free buffer from that pool and the buffer start
>>       >>>> address is passed to nvmf_rdma_fill_buffers(). But I found that
>>       >>>> these buffers are not 2MB aligned and not IOUnitSize aligned (8KB
>>       >>>> in my
>>       >>>> case) either, instead, they are 64Byte aligned so that some
>>      buffers
>>       >>>> will fail the checking and leads to this problem.
>>       >>>>
>>       >>>> The corresponding code snippets are as following:
>>       >>>> spdk_nvmf_transport_create()
>>       >>>> {
>>       >>>> ...
>>       >>>>         transport->data_buf_pool =
>>       >>>> pdk_mempool_create(spdk_mempool_name,
>>       >>>>                                    opts->num_shared_buffers,
>>       >>>>                                    opts->io_unit_size +
>>       >>>> NVMF_DATA_BUFFER_ALIGNMENT,
>>       >>>>
>>        SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>>       >>>>                                    SPDK_ENV_SOCKET_ID_ANY); ...
>>       >>>> }
>>       >>>>
>>       >>>> Also some debug print I added shows the start address of the
>>      buffers:
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x200019258800 0(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x2000192557c0 1(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x200019252780 2(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x20001924f740 3(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x20001924c700 4(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x2000192496c0 5(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x200019246680 6(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x200019243640 7(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x200019240600 8(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x20001923d5c0 9(32)
>>       >>>> ...
>>       >>>>
>>       >>>> It looks like either the buffer allocation has alignment issue or
>>       >>>> the checking is not correct.
>>       >>>>
>>       >>>> Please advice how to fix this problem.
>>       >>>>
>>       >>>> Thanks,
>>       >>>> JD Zheng
>>       >>>> _______________________________________________
>>       >>>> SPDK mailing list
>>       >>>> SPDK(a)lists.01.org <mailto:SPDK(a)lists.01.org>
>>       >>>> https://lists.01.org/mailman/listinfo/spdk
>>       >>>>
>>       >> _______________________________________________
>>       >> SPDK mailing list
>>       >> SPDK(a)lists.01.org <mailto:SPDK(a)lists.01.org>
>>       >> https://lists.01.org/mailman/listinfo/spdk
>>       >>
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-08-20 14:15 Howell, Seth
  0 siblings, 0 replies; 20+ messages in thread
From: Howell, Seth @ 2019-08-20 14:15 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 36827 bytes --]

Hi,

> I think, dynamic memory allocation doesn't really work for RDMA case.

Dynamic memory allocation does work for the RDMA case on the latest master. That is specifically why we use the match-allocations flag in DPDK. The Broadcom case is distinct from stock SPDK in that they are using an older version of DPDK than the submodule which doesn't support this flag and has to use mitigations such as the one you mentioned below to attempt to work around the problems we faced before DPDK was updated.

> Commit 9cec99b84b9a08e9122ada4f4455172e40ff6c06 already removes memory "free" for dynamically allocated memory.

True, but that is a mitigation for DPDK submodules between 18.05 and 19.02 which don't support the match-allocations flag. If you look at the preprocessor directives around this flag on master, it is only applicable if the RTE_VERSION is >= 18.05 and < 19.02.

Anyone using the stock SPDK with the DPDK submodule should be able to rely on DPDK dynamic allocations with the RDMA case. 

Thanks,

Seth


-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Sasha Kotchubievsky
Sent: Tuesday, August 20, 2019 5:22 AM
To: spdk(a)lists.01.org
Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions

Hi,

I think, dynamic memory allocation doesn't really work for RDMA case.

Commit 9cec99b84b9a08e9122ada4f4455172e40ff6c06 already removes memory "free" for dynamically allocated memory.

I'd suggest to pre-allocate enough memory for nvmf target using "-s" option.


Best regards

Sasha

On 20-Aug-19 12:42 AM, JD Zheng via SPDK wrote:
> Hi Seth,
>
> It sometimes triggered seg fault but I couldn't get backtrace due to 
> likely corrupted stack. With my workaround, this is no longer seen.
>
> Let me submit my change as RFC to gerrit. It probably isn't necessary 
> to upstream as DPDK 19.02 should fix this problem properly.
>
> Thanks,
> JD
>
> On 8/19/19 2:16 PM, Howell, Seth wrote:
>> Hi JD,
>>
>> What issue specifically did you see? If there is something measurable 
>> happening (other than the error message) then I think it should be 
>> high priority to get a more permanent workaround into upstream SPDK.
>>
>> Thanks,
>>
>> Seth
>>
>> -----Original Message-----
>> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
>> Sent: Monday, August 19, 2019 2:03 PM
>> To: Howell, Seth <seth.howell(a)intel.com>
>> Cc: Storage Performance Development Kit <spdk(a)lists.01.org>; Jonathan 
>> Richardson <jonathan.richardson(a)broadcom.com>
>> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple 
>> RDMA Memory Regions
>>
>> Hi Seth,
>>
>>   > Unfortunately, the only way
>>   > to protect fully against this happening is by using the DPDK flag  
>> > --match-allocations which was introduced in DPDK 19.02.
>>
>> Then I need to use DPDK 19.02. Do I need to enable this flag 
>> explicitly when moving DPDK 19.02?
>>
>>   > The good news is that the SPDK target will skip these buffers 
>> without  > bricking, causing data corruption or doing any otherwise 
>> bad things.
>>
>> Unfortunately this is not what I saw. It appeared that SPDK gave up 
>> this split buffer, but it still causes issue, maybe because it was 
>> tried too many times(?).
>>
>> Currently, I have to use DPDK 18.11 so that I added a couple of 
>> workarounds to prevent the split buffer from being used before 
>> reaching fill_buffers(). I did a little trick there to call
>> spdk_mempool_get() but not mempool_put later, so that this buffer is 
>> set as "allocated" in mempool and will not be tried again and again.
>> It does look like small memory leak though. We can usually see 2-3 
>> split buffers during overnight run, btw.
>>
>> This seems working OK.
>>
>> For sure, I will measure performance later.
>>
>> Thanks,
>> JD
>>
>>
>> On 8/19/19 1:12 PM, Howell, Seth wrote:
>>> Hi JD,
>>>
>>> Thanks for performing that experiment. With this new information, I 
>>> think we can be pretty much 100% sure that the problem is related to 
>>> the mempool being split over two DPDK memzones. Unfortunately, the 
>>> only way to protect fully against this happening is by using the 
>>> DPDK flag --match-allocations which was introduced in DPDK 19.02. 
>>> Jim helped advocate for this flag specifically because of this 
>>> problem with mempools and RPMA.
>>>
>>> The good news is that the SPDK target will skip these buffers 
>>> without bricking, causing data corruption or doing any otherwise bad things.
>>> What ends up happening is that the nvmf_rdma_fill_buffers function 
>>> will print the error message and then return NULL which will trigger 
>>> the target to retry the I/O again. By that time, there will be 
>>> another buffer there for the request to use and it won’t fail the 
>>> second time around. So the code currently handles the problem in a 
>>> technically correct way i.e. It’s not going to brick the target or 
>>> initiator by trying to use a buffer that spans multiple Memory 
>>> Regions. Instead, it properly recognizes that it is trying to use a 
>>> bad buffer and reschedules the request buffer parsing.
>>>
>>> However, I am a little worried over the fact that these buffers 
>>> remain in the mempool and can be repeatedly used by the application. 
>>> I can picture a scenario where this could possibly have a  
>>> performance impact.
>>> Take for example a mempool with 128 entries in it in which one of 
>>> them is split over a memzone. Since this split buffer will never 
>>> find its way into a request, it’s possible that this split buffer 
>>> gets pulled up into requests more often than other buffers and 
>>> subsequently fails in nvmf_rdma_fill_buffers causing requests to 
>>> have to be rescheduled to the next time the poller runs. Depending 
>>> on how frequently this happens, the performance impact *could possibly* add up.
>>>
>>> I have as yet been unable to replicate the split buffer error. One 
>>> thing you could try to see if there is any measurable performance 
>>> impact is try starting the NVMe-oF target with DPDK legacy memory 
>>> mode which will move all memory allocations to startup and prevent 
>>> you from splitting buffers. Then run a benchmark with a lot of 
>>> connections at high queue depth and see what the performance looks 
>>> like compared to the dynamic memory model. If there is a significant 
>>> performance impact, we may have to modify how we handle this error case.
>>>
>>> Thanks,
>>>
>>> Seth
>>>
>>> *From:*JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
>>> *Sent:* Monday, August 12, 2019 4:17 PM
>>> *To:* Howell, Seth <seth.howell(a)intel.com>
>>> *Cc:* Storage Performance Development Kit <spdk(a)lists.01.org>; 
>>> Jonathan Richardson <jonathan.richardson(a)broadcom.com>
>>> *Subject:* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over 
>>> multiple RDMA Memory Regions
>>>
>>> + Jonanthan
>>>
>>> Hi Seth,
>>>
>>> We finally got chance to test with more logs enabled. You are 
>>> correct that that problematic buffer does sit on 2 registered memory regions:
>>>
>>> The problematic buffer is "Buffer address:*200019bfeb00", *actual 
>>> used buffer pointer is "*200019bff000*" (SPDK makes it 4KiB 
>>> aligned), size is
>>> 8KiB(0x2000) so it does sit on 2 registered memory region.
>>>
>>> However, looks like SPDK/DPDK allocates buffers starting from end of 
>>> a region and going up, but due to the extra room and alignment of 
>>> each buffer and there is chance that one buffer can exceed memory 
>>> region boundary?
>>>
>>> In this case, the buffers are between 0x200019997800 and 
>>> 0x200019c5320 so that last buffer exceeds one region and goes to next one.
>>>
>>> Some logs for your information:
>>>
>>> rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start:
>>> 200019800000, memory region length: 400000
>>> rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start:
>>> 200019c00000, memory region length: 400000
>>>
>>> ...
>>>
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x200019bfeb00 27(32)
>>>
>>> ...
>>>
>>> rdma.c:1508:nvmf_rdma_fill_buffers: *NOTICE*: Buffer 
>>> address:**200019bfeb00**iov_base address 200019bff000
>>>
>>> Thanks,
>>>
>>> JD
>>>
>>> On Thu, Aug 1, 2019 at 2:22 PM Howell, Seth <seth.howell(a)intel.com 
>>> <mailto:seth.howell(a)intel.com>> wrote:
>>>
>>>      There are two different assignments that you need to look at. 
>>> I'll
>>>      detail the cases below based on line numbers from the latest 
>>> master.
>>>
>>>      Memory.c:656 *size = spdk_min(*size, cur_size):
>>>               This assignment is inside of the conditional "if(size 
>>> ==
>>>      NULL || map->ops.are_contiguous == NULL)"
>>>               So in other words, at the offset, we figure out how 
>>> much
>>>      space we have left in the current translation. Then, if there 
>>> is no
>>>      callback to tell us whether the next translation will be 
>>> contiguous
>>>      to this one, we fill the size variable with the remaining 
>>> length of
>>>      that 2 MiB buffer.
>>>
>>>      Memory.c:682 *size = spdk_min(*size, cur_size):
>>>               This assignment comes after the while loop guarded by 
>>> the
>>>      condition "while (cur_size < *size)". This while loop assumes 
>>> that
>>>      we have supplied some desired length for our buffer. This is 
>>> true in
>>>      the RDMA case. Now this while loop will only break on two
>>>      conditions. 1. Cur_size becomes larger than *size, or the
>>>      are_contiguous function returns false, meaning that the two
>>>      translations cannot be considered together. In the case of the 
>>> RDMA
>>>      memory map, the only time are_contiguous returns false is when 
>>> the
>>>      two memory regions correspond to two distinct RDMA MRs. Notice 
>>> that
>>>      in this case - the one where are_contiguous is defined and we
>>>      supplied a size variable - the *size variable is not 
>>> overwritten
>>>      with cur_size until 1. cur_size is >= *size or 2. The 
>>> are_contiguous
>>>      check fails.
>>>
>>>      In the second case detailed above, you can see how one could  
>>> pass
>>>      in a buffer that spanned a 2 MiB page and still get a 
>>> translation
>>>      value equal to the size of the buffer. This second case is the 
>>> one
>>>      that the rdma.c code should be using since we have a registered
>>>      are_contiguous function with the NIC and we have supplied a 
>>> size
>>>      pointer filled with the length of our buffer.
>>>
>>>      -----Original Message-----
>>>      From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>>>      <mailto:jiandong.zheng(a)broadcom.com>]
>>>      Sent: Thursday, August 1, 2019 2:01 PM
>>>      To: Howell, Seth <seth.howell(a)intel.com
>>>      <mailto:seth.howell(a)intel.com>>; Storage Performance 
>>> Development Kit
>>>      <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>>>      Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>>      multiple RDMA Memory Regions
>>>
>>>      Hi Seth,
>>>
>>>        > Just because a buffer extends past a 2 MiB boundary doesn't 
>>> mean
>>>      that it exists in two different Memory Regions. It also won't 
>>> fail
>>>      the translation for being over two memory regions.
>>>
>>>      This makes sense. However, spdk_mem_map_translate() does 
>>> following
>>>      to calculate translation_len:
>>>
>>>      cur_size = VALUE_2MB - _2MB_OFFSET(vaddr); ...
>>>      *size = spdk_min(*size, cur_size); // *size is the 
>>> translation_len
>>>      from caller nvmf_rdma_fill_buffers()
>>>
>>>      In nvmf_rdma_fill_buffers(),
>>>
>>>      if (translation_len < rdma_req->req.iov[iovcnt].iov_len) {
>>>                               SPDK_ERRLOG("Data buffer split over
>>>      multiple RDMA Memory Regions\n");
>>>                               return -EINVAL;
>>>                       }
>>>
>>>      This just checks if buffer sits on 2 2MB pages, not about 2 
>>> RDMA
>>>      memory regions. Is my understanding correct?
>>>
>>>      I still need some time to test. I will update you the result 
>>> with -s
>>>      as well.
>>>
>>>      Thanks,
>>>      JD
>>>
>>>
>>>      On 8/1/19 1:28 PM, Howell, Seth wrote:
>>>       > Hi JD,
>>>       >
>>>       > The 2 MiB check is just because we always do memory 
>>> registrations
>>>      at at least 2 MiB granularity (the minimum hugepage size). Just
>>>      because a buffer extends past a 2 MiB boundary doesn't mean 
>>> that it
>>>      exists in two different Memory Regions. It also won't fail the
>>>      translation for being over two memory regions.
>>>       >
>>>       > If you look at the definition of spdk_mem_map_translate we 
>>> call
>>>      map->ops->are_contiguous every time we cross a 2 MiB boundary. 
>>> For
>>>      RDMA, this function is registered to
>>>      spdk_nvmf_rdma_check_contiguous_entries. IF this function 
>>> returns
>>>      true, then even if the buffer crosses a 2 MiB boundary, the
>>>      translation will still be valid.
>>>       > The problem you are running into is not related to the 
>>> buffer
>>>      alignment, it is related to the fact that the two pages across 
>>> which
>>>      the buffer is split are registered to two different MRs in the 
>>> NIC.
>>>      This can only happen if those two pages are allocated 
>>> independently
>>>      and trigger two distinct memory event callbacks.
>>>       >
>>>       > That is why I am so interested in seeing the results from 
>>> the
>>>      noticelog above ibv_reg_mr. It will tell me how your target
>>>      application is allocating memory. Also, when you start the SPDK
>>>      target, are you using the -s option? Something like
>>>      ./app/nvmf_tgt/nvmf_tgt -s 512 or something like that (I don't 
>>> know
>>>      if it'll make a difference, it's more of a curiosity thing for 
>>> me)?
>>>       >
>>>       > Thanks,
>>>       >
>>>       > Seth
>>>       >
>>>       > -----Original Message-----
>>>       > From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>>>      <mailto:jiandong.zheng(a)broadcom.com>]
>>>       > Sent: Thursday, August 1, 2019 11:24 AM
>>>       > To: Howell, Seth <seth.howell(a)intel.com
>>>      <mailto:seth.howell(a)intel.com>>; Storage Performance
>>>       > Development Kit <spdk(a)lists.01.org 
>>> <mailto:spdk(a)lists.01.org>>
>>>       > Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>>      multiple
>>>       > RDMA Memory Regions
>>>       >
>>>       > Hi Seth,
>>>       >
>>>       > Thanks for the detailed description, now I understand the 
>>> reason
>>>      behind the checking. But I have a question, why checking 
>>> against
>>>      2MiB? Is it because DPDK uses 2MiB page size by default so that 
>>> one
>>>      RDMA memory region should not cross 2 pages?
>>>       >
>>>       >   > Once I see what your memory registrations look like and 
>>> what
>>>      addresses you're failing on, it will help me understand what is
>>>      going on better.
>>>       >
>>>       > I've added some print in nvmf_rdma_fill_buffers() @@ -1502,7
>>>      +1503,11 @@ nvmf_rdma_fill_buffers(struct 
>>> spdk_nvmf_rdma_transport
>>>      *rtransport,
>>>       >                   remaining_length -=
>>>       > rdma_req->req.iov[iovcnt].iov_len;
>>>       >
>>>       >                   if (translation_len <
>>>      rdma_req->req.iov[iovcnt].iov_len) {
>>>       > -                       SPDK_ERRLOG("Data buffer split over 
>>> multiple
>>>       > RDMA Memory Regions\n");
>>>       > +                       SPDK_ERRLOG("Data buffer split over 
>>> multiple
>>>       > RDMA Memory Regions %p %d (%d) (%d) (%d) (%d)\n",
>>>      rdma_req->buffers[iovcnt], iovcnt, length, remaining_length,
>>>      translation_len, rdma_req->req.iov[iovcnt].iov_len);
>>>       >                           return -EINVAL;
>>>       >                   }
>>>       >
>>>       > With this I can see which buffer failed the checking.
>>>       > For example, when SPKD initializes the memory pool, one of 
>>> the
>>>      buffers starts with 0x2000193feb00, and when failed, I got
>>> following:
>>>       >
>>>       > rdma.c:1510:nvmf_rdma_fill_buffers: *ERROR*: Data buffer 
>>> split over
>>>       > multiple RDMA Memory Regions 0x2000193feb00 0 (8192) (0) 
>>> (5376)
>>>      (8192)
>>>       >
>>>       > This buffer has 5376B on one 2MB page and the rest of it
>>>       > (8192-5376=2816B) is on another page.
>>>       >
>>>       > The change https://review.gerrithub.io/c/spdk/spdk/+/463893 
>>> to
>>>      use iov base should make it better as iov base is 4KiB aligned. 
>>> In
>>>      above case, iov_base is 0x2000193feb00 & 0xfff = 0x2000193fe000 
>>> and
>>>      it should pass the checking.
>>>       > However, another buffer in the pool is 0x2000192010c0 and
>>>      iov_base is 0x200019201000, which would fail the checking 
>>> because it
>>>      is only 4KiB to 2MB boundary and IOUnitSize is 8KiB.
>>>       >
>>>       > I will add the change from
>>>       > https://review.gerrithub.io/c/spdk/spdk/+/463892 and rerun 
>>> the
>>>      test to get more information.
>>>       >
>>>       > I also attached the conf file too. The cmd line is "nvmf_tgt 
>>> -m 0xff
>>>       > -j
>>>       > 0x90000000:0x20000000 -c 16disk_1ns.conf"
>>>       >
>>>       > Thanks,
>>>       > JD
>>>       >
>>>       >
>>>       > On 8/1/19 7:52 AM, Howell, Seth wrote:
>>>       >> Hi JD,
>>>       >>
>>>       >> I was doing a little bit of digging in the dpdk 
>>> documentation
>>>      around this process, and I have a little bit more information. 
>>> We
>>>      were pretty worried about the whole dynamic memory allocations 
>>> thing
>>>      a few releases ago, so Jim helped add a flag into DPDK that
>>>      prevented allocations from being allocated and freed in 
>>> different
>>>      granularities. This flag also prevents malloc heap allocations 
>>> from
>>>      spanning multiple memory events. However, this flag didn't make 
>>> it
>>>      into DPDK until 19.02 (More documentation at 
>>> https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#en
>>> vironment-abstraction-layer
>>>      if you're interested). We have some code in the SPDK 
>>> environment
>>>      layer that tries to deal with that (see
>>>      lib/env_dpdk/memory.c:memory_hotplug_cb) but I don't know that 
>>> that
>>>      function is entirely capable of handling the heap allocations
>>>      spanning multiple memory events part of the problem.
>>>       >> Since you are using dpdk 18.11, the memory callback inside 
>>> of
>>>      lib/env_dpdk looks like a good candidate for our issue. My best
>>>      guess is that somehow a heap allocation from the buffer mempool 
>>> is
>>>      hitting across addresses from two dynamic memory allocation 
>>> events.
>>>      I'd still appreciate it if you could send me the information in 
>>> my
>>>      last e-mail, but I think we're onto something here.
>>>       >>
>>>       >> Thanks,
>>>       >>
>>>       >> Seth
>>>       >>
>>>       >> -----Original Message-----
>>>       >> From: SPDK [mailto:spdk-bounces(a)lists.01.org
>>>      <mailto:spdk-bounces(a)lists.01.org>] On Behalf Of Howell,
>>>       >> Seth
>>>       >> Sent: Thursday, August 1, 2019 5:26 AM
>>>       >> To: JD Zheng <jiandong.zheng(a)broadcom.com
>>>      <mailto:jiandong.zheng(a)broadcom.com>>; Storage Performance
>>>       >> Development Kit <spdk(a)lists.01.org 
>>> <mailto:spdk(a)lists.01.org>>
>>>       >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split 
>>> over
>>>      multiple
>>>       >> RDMA Memory Regions
>>>       >>
>>>       >> Hi JD,
>>>       >>
>>>       >> Thanks for doing that. Yeah, I am mainly looking to see how 
>>> the
>>>      mempool addresses are mapped into the NIC with ibv_reg_mr.
>>>       >>
>>>       >> I think it's odd that we are using the buffer base for the 
>>> memory
>>>       >> check, we should be using the iov base, but I don't believe 
>>> that
>>>       >> would cause the issue you are seeing. Pushed a change to 
>>> modify
>>>      that
>>>       >> behavior anyways though:
>>>       >> https://review.gerrithub.io/c/spdk/spdk/+/463893
>>>       >>
>>>       >> There was one registration that I wasn't able to catch from 
>>> your
>>>      last log. Sorry about that, I forgot there wasn’t a debug log 
>>> for
>>>      it. Can you try it again with this change which adds noticelogs 
>>> for
>>>      the relevant registrations.
>>>      https://review.gerrithub.io/c/spdk/spdk/+/463892 You should be 
>>> able
>>>      to run your test without the -Lrdma argument this time to avoid 
>>> the
>>>      extra bloat in the logs.
>>>       >>
>>>       >> The underlying assumption of the code is that any given 
>>> object
>>>      is not going to cross a dynamic memory allocation from DPDK. 
>>> For a
>>>      little background, when the mempool gets created, the dpdk code
>>>      allocates some number of memzones to accommodate those buffer
>>>      objects. Then it passes those memzones down one at a time and 
>>> places
>>>      objects inside the mempool from the given memzone until the 
>>> memzone
>>>      is exhausted. Then it goes back and grabs another memzone. This
>>>      process continues until all objects are accounted for.
>>>       >> This only works if each memzone corresponds to a single 
>>> memory
>>>      event when using dynamic memory allocation. My understanding 
>>> was
>>>      that this was always the case, but this error makes me think 
>>> that
>>>      it's possible that that's not true.
>>>       >>
>>>       >> Once I see what your memory registrations look like and 
>>> what
>>>      addresses you're failing on, it will help me understand what is
>>>      going on better.
>>>       >>
>>>       >> Can you also provide the command line you are using to 
>>> start the
>>>      nvmf_tgt application and attach your configuration file?
>>>       >>
>>>       >> Thanks,
>>>       >>
>>>       >> Seth
>>>       >> -----Original Message-----
>>>       >> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>>>      <mailto:jiandong.zheng(a)broadcom.com>]
>>>       >> Sent: Wednesday, July 31, 2019 3:13 PM
>>>       >> To: Howell, Seth <seth.howell(a)intel.com
>>>      <mailto:seth.howell(a)intel.com>>; Storage Performance
>>>       >> Development Kit <spdk(a)lists.01.org 
>>> <mailto:spdk(a)lists.01.org>>
>>>       >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split 
>>> over
>>>      multiple
>>>       >> RDMA Memory Regions
>>>       >>
>>>       >> Hi Seth,
>>>       >>
>>>       >> After I enabled debug and ran nvmf_tgt with -L rdma, I got 
>>> some
>>>      logs like:
>>>       >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command
>>> Array:
>>>       >> 0x2000084bf000 Length: 40000 LKey: e601
>>>       >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion
>>> Array:
>>>       >> 0x200008621000 Length: 10000 LKey: e701
>>>       >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule 
>>> Data
>>>      Array:
>>>       >> 0x200018600000 Length: 1000000 LKey: e801
>>>       >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command
>>> Array:
>>>       >> 0x20000847e000 Length: 40000 LKey: e701
>>>       >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion
>>> Array:
>>>       >> 0x20000846d000 Length: 10000 LKey: e801
>>>       >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule 
>>> Data
>>>      Array:
>>>       >> 0x200019800000 Length: 1000000 LKey: e901
>>>       >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command
>>> Array:
>>>       >> 0x200016ebb000 Length: 40000 LKey: e801
>>>       >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion
>>> Array:
>>>       >> 0x20000845c000 Length: 10000 LKey: e901
>>>       >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule 
>>> Data
>>>      Array:
>>>       >> 0x20001aa00000 Length: 1000000 LKey: ea01
>>>       >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command
>>> Array:
>>>       >> 0x200016e7a000 Length: 40000 LKey: e901
>>>       >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion
>>> Array:
>>>       >> 0x20000844b000 Length: 10000 LKey: ea01
>>>       >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule 
>>> Data
>>>      Array:
>>>       >> 0x20001bc00000 Length: 1000000 LKey: eb01 ...
>>>       >>
>>>       >> Is this you are look for as memory regions registered for NIC?
>>>       >>
>>>       >> I attached the complete log.
>>>       >>
>>>       >> Thanks,
>>>       >> JD
>>>       >>
>>>       >> On 7/30/19 5:28 PM, JD Zheng wrote:
>>>       >>> Hi Seth,
>>>       >>>
>>>       >>> Thanks for the prompt reply!
>>>       >>>
>>>       >>> Please find answers inline.
>>>       >>>
>>>       >>> JD
>>>       >>>
>>>       >>> On 7/30/19 5:01 PM, Howell, Seth wrote:
>>>       >>>> Hi JD,
>>>       >>>>
>>>       >>>> Thanks for the report. I want to ask a few questions to 
>>> start
>>>       >>>> getting to the bottom of this. Since this issue doesn't 
>>> currently
>>>       >>>> reproduce on our per-patch or nightly tests, I would like 
>>> to
>>>       >>>> understand what's unique about your setup so that we can
>>>      replicate
>>>       >>>> it in a per patch test to prevent future regressions.
>>>       >>> I am running it on aarch64 platform. I tried x86 platform 
>>> and I
>>>      can
>>>       >>> see same buffer alignment in memory pool but can't run the 
>>> real
>>>      test
>>>       >>> to reproduce it due to other missing pieces.
>>>       >>>
>>>       >>>>
>>>       >>>> What options are you passing when you create the rdma 
>>> transport?
>>>       >>>> Are you creating it over RPC or in a configuration file?
>>>       >>> I am using conf file. Pls let me know if you'd like to 
>>> look
>>>      into conf file.
>>>       >>>
>>>       >>>>
>>>       >>>> Are you using the current DPDK submodule as your 
>>> environment
>>>       >>>> abstraction layer?
>>>       >>> No. Our project uses specific version of DPDK, which is 
>>> v18.11. I
>>>       >>> did quick test using latest and DPDK submodule on x86, and 
>>> the
>>>       >>> buffer alignment is the same, i.e. 64B aligned.
>>>       >>>
>>>       >>>>
>>>       >>>> I notice that your error log is printing from
>>>       >>>> spdk_nvmf_transport_poll_group_create, which value 
>>> exactly are
>>>      you
>>>       >>>> printing out?
>>>       >>> Here is patch to add dbg print. Pls note that SPDK version 
>>> is
>>>      v19.04
>>>       >>>
>>>       >>> @@ -215,6 +222,7 @@ 
>>> spdk_nvmf_transport_poll_group_create(st
>>>       >>> SPDK_NOTICELOG("Unable to
>>>      reserve
>>>       >>> the full number of buffers for the pg buffer cache.\n");
>>>       >>>                                    break;
>>>       >>>                            }
>>>       >>> +                       SPDK_ERRLOG("%p %d(%d)\n", buf,
>>>       >>> group->buf_cache_count, group->buf_cache_size);
>>>       >>> STAILQ_INSERT_HEAD(&group->buf_cache,
>>>       >>> buf, link);
>>>       >>> group->buf_cache_count++;
>>>       >>>                    }
>>>       >>>
>>>       >>>>
>>>       >>>> Can you run your target with the -L rdma option to get a 
>>> dump of
>>>       >>>> the memory regions registered with the NIC?
>>>       >>> Let me test and get back to you soon.
>>>       >>>
>>>       >>>>
>>>       >>>> We made a couple of changes to this code when dynamic 
>>> memory
>>>       >>>> allocations were added to DPDK. There were some 
>>> safeguards
>>>      that we
>>>       >>>> added to try and make sure this case wouldn't hit, so I'd 
>>> like to
>>>       >>>> make sure you are running on the latest DPDK submodule as 
>>> well as
>>>       >>>> the latest SPDK to narrow down where we need to look.
>>>       >>> Unfortunately I can't easily update DPDK because other 
>>> team
>>>       >>> maintains it internally. But if it can be repro and fixed 
>>> in
>>>      latest,
>>>       >>> I will try to pull in the fix.
>>>       >>>
>>>       >>>>
>>>       >>>> Thanks,
>>>       >>>>
>>>       >>>> Seth
>>>       >>>>
>>>       >>>> -----Original Message-----
>>>       >>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org
>>>      <mailto:spdk-bounces(a)lists.01.org>] On Behalf Of JD Zheng
>>>       >>>> via SPDK
>>>       >>>> Sent: Wednesday, July 31, 2019 3:00 AM
>>>       >>>> To: spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>
>>>       >>>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com
>>>      <mailto:jiandong.zheng(a)broadcom.com>>
>>>       >>>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over 
>>> multiple
>>>       >>>> RDMA Memory Regions
>>>       >>>>
>>>       >>>> Hello,
>>>       >>>>
>>>       >>>> When I run nvmf_tgt over RDMA using latest SPDK code, I
>>>       >>>> occasionally ran into this errors:
>>>       >>>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer 
>>> split
>>>       >>>> over multiple RDMA Memory Regions"
>>>       >>>>
>>>       >>>> After digging into the code, I found that
>>>      nvmf_rdma_fill_buffers()
>>>       >>>> calls spdk_mem_map_translate() to check if a data buffer 
>>> sit on 2
>>>       >>>> 2MB pages, and if it is the case, it reports this error.
>>>       >>>>
>>>       >>>> The following commit added change to use data buffer 
>>> start
>>>      address
>>>       >>>> to calculate the size between buffer start address and 
>>> 2MB
>>>      boundary.
>>>       >>>> The caller nvmf_rdma_fill_buffers() uses the size to 
>>> compare with
>>>       >>>> IO Unit size (which is 8KB in my conf) to determine if 
>>> the buffer
>>>       >>>> passes 2MB boundary.
>>>       >>>>
>>>       >>>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
>>>       >>>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com
>>>      <mailto:dariusz.stojaczyk(a)intel.com>>
>>>       >>>> Date:   Tue Nov 13 17:43:46 2018 +0100
>>>       >>>>
>>>       >>>>         memory: fix contiguous memory calculation for 
>>> unaligned
>>>       >>>> buffers
>>>       >>>>
>>>       >>>> In nvmf_tgt, the buffers are pre-allocated as a memory 
>>> pool
>>>      and new
>>>       >>>> request will use free buffer from that pool and the 
>>> buffer start
>>>       >>>> address is passed to nvmf_rdma_fill_buffers(). But I 
>>> found that
>>>       >>>> these buffers are not 2MB aligned and not IOUnitSize 
>>> aligned (8KB
>>>       >>>> in my
>>>       >>>> case) either, instead, they are 64Byte aligned so that 
>>> some
>>>      buffers
>>>       >>>> will fail the checking and leads to this problem.
>>>       >>>>
>>>       >>>> The corresponding code snippets are as following:
>>>       >>>> spdk_nvmf_transport_create()
>>>       >>>> {
>>>       >>>> ...
>>>       >>>>         transport->data_buf_pool =
>>>       >>>> pdk_mempool_create(spdk_mempool_name,
>>>       >>>>  opts->num_shared_buffers,
>>>       >>>>  opts->io_unit_size +
>>>       >>>> NVMF_DATA_BUFFER_ALIGNMENT,
>>>       >>>>
>>>        SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>>>       >>>>  SPDK_ENV_SOCKET_ID_ANY); ...
>>>       >>>> }
>>>       >>>>
>>>       >>>> Also some debug print I added shows the start address of 
>>> the
>>>      buffers:
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x200019258800 0(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x2000192557c0 1(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x200019252780 2(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x20001924f740 3(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x20001924c700 4(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x2000192496c0 5(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x200019246680 6(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x200019243640 7(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x200019240600 8(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x20001923d5c0 9(32)
>>>       >>>> ...
>>>       >>>>
>>>       >>>> It looks like either the buffer allocation has alignment 
>>> issue or
>>>       >>>> the checking is not correct.
>>>       >>>>
>>>       >>>> Please advice how to fix this problem.
>>>       >>>>
>>>       >>>> Thanks,
>>>       >>>> JD Zheng
>>>       >>>> _______________________________________________
>>>       >>>> SPDK mailing list
>>>       >>>> SPDK(a)lists.01.org <mailto:SPDK(a)lists.01.org>
>>>       >>>> https://lists.01.org/mailman/listinfo/spdk
>>>       >>>>
>>>       >> _______________________________________________
>>>       >> SPDK mailing list
>>>       >> SPDK(a)lists.01.org <mailto:SPDK(a)lists.01.org>
>>>       >> https://lists.01.org/mailman/listinfo/spdk
>>>       >>
>>>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-08-20 12:22 Sasha Kotchubievsky
  0 siblings, 0 replies; 20+ messages in thread
From: Sasha Kotchubievsky @ 2019-08-20 12:22 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 34987 bytes --]

Hi,

I think, dynamic memory allocation doesn't really work for RDMA case.

Commit 9cec99b84b9a08e9122ada4f4455172e40ff6c06 already removes memory 
"free" for dynamically allocated memory.

I'd suggest to pre-allocate enough memory for nvmf target using "-s" option.


Best regards

Sasha

On 20-Aug-19 12:42 AM, JD Zheng via SPDK wrote:
> Hi Seth,
>
> It sometimes triggered seg fault but I couldn't get backtrace due to 
> likely corrupted stack. With my workaround, this is no longer seen.
>
> Let me submit my change as RFC to gerrit. It probably isn't necessary 
> to upstream as DPDK 19.02 should fix this problem properly.
>
> Thanks,
> JD
>
> On 8/19/19 2:16 PM, Howell, Seth wrote:
>> Hi JD,
>>
>> What issue specifically did you see? If there is something measurable 
>> happening (other than the error message) then I think it should be 
>> high priority to get a more permanent workaround into upstream SPDK.
>>
>> Thanks,
>>
>> Seth
>>
>> -----Original Message-----
>> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
>> Sent: Monday, August 19, 2019 2:03 PM
>> To: Howell, Seth <seth.howell(a)intel.com>
>> Cc: Storage Performance Development Kit <spdk(a)lists.01.org>; Jonathan 
>> Richardson <jonathan.richardson(a)broadcom.com>
>> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple 
>> RDMA Memory Regions
>>
>> Hi Seth,
>>
>>   > Unfortunately, the only way
>>   > to protect fully against this happening is by using the DPDK 
>> flag  > --match-allocations which was introduced in DPDK 19.02.
>>
>> Then I need to use DPDK 19.02. Do I need to enable this flag 
>> explicitly when moving DPDK 19.02?
>>
>>   > The good news is that the SPDK target will skip these buffers 
>> without  > bricking, causing data corruption or doing any otherwise 
>> bad things.
>>
>> Unfortunately this is not what I saw. It appeared that SPDK gave up 
>> this split buffer, but it still causes issue, maybe because it was 
>> tried too many times(?).
>>
>> Currently, I have to use DPDK 18.11 so that I added a couple of 
>> workarounds to prevent the split buffer from being used before 
>> reaching fill_buffers(). I did a little trick there to call 
>> spdk_mempool_get() but not mempool_put later, so that this buffer is 
>> set as "allocated" in mempool and will not be tried again and again. 
>> It does look like small memory leak though. We can usually see 2-3 
>> split buffers during overnight run, btw.
>>
>> This seems working OK.
>>
>> For sure, I will measure performance later.
>>
>> Thanks,
>> JD
>>
>>
>> On 8/19/19 1:12 PM, Howell, Seth wrote:
>>> Hi JD,
>>>
>>> Thanks for performing that experiment. With this new information, I
>>> think we can be pretty much 100% sure that the problem is related to
>>> the mempool being split over two DPDK memzones. Unfortunately, the
>>> only way to protect fully against this happening is by using the DPDK
>>> flag --match-allocations which was introduced in DPDK 19.02. Jim
>>> helped advocate for this flag specifically because of this problem
>>> with mempools and RPMA.
>>>
>>> The good news is that the SPDK target will skip these buffers without
>>> bricking, causing data corruption or doing any otherwise bad things.
>>> What ends up happening is that the nvmf_rdma_fill_buffers function
>>> will print the error message and then return NULL which will trigger
>>> the target to retry the I/O again. By that time, there will be another
>>> buffer there for the request to use and it won’t fail the second time
>>> around. So the code currently handles the problem in a technically
>>> correct way i.e. It’s not going to brick the target or initiator by
>>> trying to use a buffer that spans multiple Memory Regions. Instead, it
>>> properly recognizes that it is trying to use a bad buffer and
>>> reschedules the request buffer parsing.
>>>
>>> However, I am a little worried over the fact that these buffers remain
>>> in the mempool and can be repeatedly used by the application. I can
>>> picture a scenario where this could possibly have a  performance 
>>> impact.
>>> Take for example a mempool with 128 entries in it in which one of them
>>> is split over a memzone. Since this split buffer will never find its
>>> way into a request, it’s possible that this split buffer gets pulled
>>> up into requests more often than other buffers and subsequently fails
>>> in nvmf_rdma_fill_buffers causing requests to have to be rescheduled
>>> to the next time the poller runs. Depending on how frequently this
>>> happens, the performance impact *could possibly* add up.
>>>
>>> I have as yet been unable to replicate the split buffer error. One
>>> thing you could try to see if there is any measurable performance
>>> impact is try starting the NVMe-oF target with DPDK legacy memory mode
>>> which will move all memory allocations to startup and prevent you from
>>> splitting buffers. Then run a benchmark with a lot of connections at
>>> high queue depth and see what the performance looks like compared to
>>> the dynamic memory model. If there is a significant performance
>>> impact, we may have to modify how we handle this error case.
>>>
>>> Thanks,
>>>
>>> Seth
>>>
>>> *From:*JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
>>> *Sent:* Monday, August 12, 2019 4:17 PM
>>> *To:* Howell, Seth <seth.howell(a)intel.com>
>>> *Cc:* Storage Performance Development Kit <spdk(a)lists.01.org>;
>>> Jonathan Richardson <jonathan.richardson(a)broadcom.com>
>>> *Subject:* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>> multiple RDMA Memory Regions
>>>
>>> + Jonanthan
>>>
>>> Hi Seth,
>>>
>>> We finally got chance to test with more logs enabled. You are correct
>>> that that problematic buffer does sit on 2 registered memory regions:
>>>
>>> The problematic buffer is "Buffer address:*200019bfeb00", *actual used
>>> buffer pointer is "*200019bff000*" (SPDK makes it 4KiB aligned), size
>>> is
>>> 8KiB(0x2000) so it does sit on 2 registered memory region.
>>>
>>> However, looks like SPDK/DPDK allocates buffers starting from end of a
>>> region and going up, but due to the extra room and alignment of each
>>> buffer and there is chance that one buffer can exceed memory region
>>> boundary?
>>>
>>> In this case, the buffers are between 0x200019997800 and 0x200019c5320
>>> so that last buffer exceeds one region and goes to next one.
>>>
>>> Some logs for your information:
>>>
>>> rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start:
>>> 200019800000, memory region length: 400000
>>> rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start:
>>> 200019c00000, memory region length: 400000
>>>
>>> ...
>>>
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x200019bfeb00 27(32)
>>>
>>> ...
>>>
>>> rdma.c:1508:nvmf_rdma_fill_buffers: *NOTICE*: Buffer
>>> address:**200019bfeb00**iov_base address 200019bff000
>>>
>>> Thanks,
>>>
>>> JD
>>>
>>> On Thu, Aug 1, 2019 at 2:22 PM Howell, Seth <seth.howell(a)intel.com
>>> <mailto:seth.howell(a)intel.com>> wrote:
>>>
>>>      There are two different assignments that you need to look at. I'll
>>>      detail the cases below based on line numbers from the latest 
>>> master.
>>>
>>>      Memory.c:656 *size = spdk_min(*size, cur_size):
>>>               This assignment is inside of the conditional "if(size ==
>>>      NULL || map->ops.are_contiguous == NULL)"
>>>               So in other words, at the offset, we figure out how much
>>>      space we have left in the current translation. Then, if there 
>>> is no
>>>      callback to tell us whether the next translation will be 
>>> contiguous
>>>      to this one, we fill the size variable with the remaining 
>>> length of
>>>      that 2 MiB buffer.
>>>
>>>      Memory.c:682 *size = spdk_min(*size, cur_size):
>>>               This assignment comes after the while loop guarded by the
>>>      condition "while (cur_size < *size)". This while loop assumes that
>>>      we have supplied some desired length for our buffer. This is 
>>> true in
>>>      the RDMA case. Now this while loop will only break on two
>>>      conditions. 1. Cur_size becomes larger than *size, or the
>>>      are_contiguous function returns false, meaning that the two
>>>      translations cannot be considered together. In the case of the 
>>> RDMA
>>>      memory map, the only time are_contiguous returns false is when the
>>>      two memory regions correspond to two distinct RDMA MRs. Notice 
>>> that
>>>      in this case - the one where are_contiguous is defined and we
>>>      supplied a size variable - the *size variable is not overwritten
>>>      with cur_size until 1. cur_size is >= *size or 2. The 
>>> are_contiguous
>>>      check fails.
>>>
>>>      In the second case detailed above, you can see how one could  pass
>>>      in a buffer that spanned a 2 MiB page and still get a translation
>>>      value equal to the size of the buffer. This second case is the one
>>>      that the rdma.c code should be using since we have a registered
>>>      are_contiguous function with the NIC and we have supplied a size
>>>      pointer filled with the length of our buffer.
>>>
>>>      -----Original Message-----
>>>      From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>>>      <mailto:jiandong.zheng(a)broadcom.com>]
>>>      Sent: Thursday, August 1, 2019 2:01 PM
>>>      To: Howell, Seth <seth.howell(a)intel.com
>>>      <mailto:seth.howell(a)intel.com>>; Storage Performance 
>>> Development Kit
>>>      <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>>>      Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>>      multiple RDMA Memory Regions
>>>
>>>      Hi Seth,
>>>
>>>        > Just because a buffer extends past a 2 MiB boundary doesn't 
>>> mean
>>>      that it exists in two different Memory Regions. It also won't fail
>>>      the translation for being over two memory regions.
>>>
>>>      This makes sense. However, spdk_mem_map_translate() does following
>>>      to calculate translation_len:
>>>
>>>      cur_size = VALUE_2MB - _2MB_OFFSET(vaddr); ...
>>>      *size = spdk_min(*size, cur_size); // *size is the translation_len
>>>      from caller nvmf_rdma_fill_buffers()
>>>
>>>      In nvmf_rdma_fill_buffers(),
>>>
>>>      if (translation_len < rdma_req->req.iov[iovcnt].iov_len) {
>>>                               SPDK_ERRLOG("Data buffer split over
>>>      multiple RDMA Memory Regions\n");
>>>                               return -EINVAL;
>>>                       }
>>>
>>>      This just checks if buffer sits on 2 2MB pages, not about 2 RDMA
>>>      memory regions. Is my understanding correct?
>>>
>>>      I still need some time to test. I will update you the result 
>>> with -s
>>>      as well.
>>>
>>>      Thanks,
>>>      JD
>>>
>>>
>>>      On 8/1/19 1:28 PM, Howell, Seth wrote:
>>>       > Hi JD,
>>>       >
>>>       > The 2 MiB check is just because we always do memory 
>>> registrations
>>>      at at least 2 MiB granularity (the minimum hugepage size). Just
>>>      because a buffer extends past a 2 MiB boundary doesn't mean 
>>> that it
>>>      exists in two different Memory Regions. It also won't fail the
>>>      translation for being over two memory regions.
>>>       >
>>>       > If you look at the definition of spdk_mem_map_translate we call
>>>      map->ops->are_contiguous every time we cross a 2 MiB boundary. For
>>>      RDMA, this function is registered to
>>>      spdk_nvmf_rdma_check_contiguous_entries. IF this function returns
>>>      true, then even if the buffer crosses a 2 MiB boundary, the
>>>      translation will still be valid.
>>>       > The problem you are running into is not related to the buffer
>>>      alignment, it is related to the fact that the two pages across 
>>> which
>>>      the buffer is split are registered to two different MRs in the 
>>> NIC.
>>>      This can only happen if those two pages are allocated 
>>> independently
>>>      and trigger two distinct memory event callbacks.
>>>       >
>>>       > That is why I am so interested in seeing the results from the
>>>      noticelog above ibv_reg_mr. It will tell me how your target
>>>      application is allocating memory. Also, when you start the SPDK
>>>      target, are you using the -s option? Something like
>>>      ./app/nvmf_tgt/nvmf_tgt -s 512 or something like that (I don't 
>>> know
>>>      if it'll make a difference, it's more of a curiosity thing for 
>>> me)?
>>>       >
>>>       > Thanks,
>>>       >
>>>       > Seth
>>>       >
>>>       > -----Original Message-----
>>>       > From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>>>      <mailto:jiandong.zheng(a)broadcom.com>]
>>>       > Sent: Thursday, August 1, 2019 11:24 AM
>>>       > To: Howell, Seth <seth.howell(a)intel.com
>>>      <mailto:seth.howell(a)intel.com>>; Storage Performance
>>>       > Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>>>       > Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>>      multiple
>>>       > RDMA Memory Regions
>>>       >
>>>       > Hi Seth,
>>>       >
>>>       > Thanks for the detailed description, now I understand the 
>>> reason
>>>      behind the checking. But I have a question, why checking against
>>>      2MiB? Is it because DPDK uses 2MiB page size by default so that 
>>> one
>>>      RDMA memory region should not cross 2 pages?
>>>       >
>>>       >   > Once I see what your memory registrations look like and 
>>> what
>>>      addresses you're failing on, it will help me understand what is
>>>      going on better.
>>>       >
>>>       > I've added some print in nvmf_rdma_fill_buffers() @@ -1502,7
>>>      +1503,11 @@ nvmf_rdma_fill_buffers(struct spdk_nvmf_rdma_transport
>>>      *rtransport,
>>>       >                   remaining_length -=
>>>       > rdma_req->req.iov[iovcnt].iov_len;
>>>       >
>>>       >                   if (translation_len <
>>>      rdma_req->req.iov[iovcnt].iov_len) {
>>>       > -                       SPDK_ERRLOG("Data buffer split over 
>>> multiple
>>>       > RDMA Memory Regions\n");
>>>       > +                       SPDK_ERRLOG("Data buffer split over 
>>> multiple
>>>       > RDMA Memory Regions %p %d (%d) (%d) (%d) (%d)\n",
>>>      rdma_req->buffers[iovcnt], iovcnt, length, remaining_length,
>>>      translation_len, rdma_req->req.iov[iovcnt].iov_len);
>>>       >                           return -EINVAL;
>>>       >                   }
>>>       >
>>>       > With this I can see which buffer failed the checking.
>>>       > For example, when SPKD initializes the memory pool, one of the
>>>      buffers starts with 0x2000193feb00, and when failed, I got 
>>> following:
>>>       >
>>>       > rdma.c:1510:nvmf_rdma_fill_buffers: *ERROR*: Data buffer 
>>> split over
>>>       > multiple RDMA Memory Regions 0x2000193feb00 0 (8192) (0) (5376)
>>>      (8192)
>>>       >
>>>       > This buffer has 5376B on one 2MB page and the rest of it
>>>       > (8192-5376=2816B) is on another page.
>>>       >
>>>       > The change https://review.gerrithub.io/c/spdk/spdk/+/463893 to
>>>      use iov base should make it better as iov base is 4KiB aligned. In
>>>      above case, iov_base is 0x2000193feb00 & 0xfff = 0x2000193fe000 
>>> and
>>>      it should pass the checking.
>>>       > However, another buffer in the pool is 0x2000192010c0 and
>>>      iov_base is 0x200019201000, which would fail the checking 
>>> because it
>>>      is only 4KiB to 2MB boundary and IOUnitSize is 8KiB.
>>>       >
>>>       > I will add the change from
>>>       > https://review.gerrithub.io/c/spdk/spdk/+/463892 and rerun the
>>>      test to get more information.
>>>       >
>>>       > I also attached the conf file too. The cmd line is "nvmf_tgt 
>>> -m 0xff
>>>       > -j
>>>       > 0x90000000:0x20000000 -c 16disk_1ns.conf"
>>>       >
>>>       > Thanks,
>>>       > JD
>>>       >
>>>       >
>>>       > On 8/1/19 7:52 AM, Howell, Seth wrote:
>>>       >> Hi JD,
>>>       >>
>>>       >> I was doing a little bit of digging in the dpdk documentation
>>>      around this process, and I have a little bit more information. We
>>>      were pretty worried about the whole dynamic memory allocations 
>>> thing
>>>      a few releases ago, so Jim helped add a flag into DPDK that
>>>      prevented allocations from being allocated and freed in different
>>>      granularities. This flag also prevents malloc heap allocations 
>>> from
>>>      spanning multiple memory events. However, this flag didn't make it
>>>      into DPDK until 19.02 (More documentation at
>>> https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#environment-abstraction-layer
>>>      if you're interested). We have some code in the SPDK environment
>>>      layer that tries to deal with that (see
>>>      lib/env_dpdk/memory.c:memory_hotplug_cb) but I don't know that 
>>> that
>>>      function is entirely capable of handling the heap allocations
>>>      spanning multiple memory events part of the problem.
>>>       >> Since you are using dpdk 18.11, the memory callback inside of
>>>      lib/env_dpdk looks like a good candidate for our issue. My best
>>>      guess is that somehow a heap allocation from the buffer mempool is
>>>      hitting across addresses from two dynamic memory allocation 
>>> events.
>>>      I'd still appreciate it if you could send me the information in my
>>>      last e-mail, but I think we're onto something here.
>>>       >>
>>>       >> Thanks,
>>>       >>
>>>       >> Seth
>>>       >>
>>>       >> -----Original Message-----
>>>       >> From: SPDK [mailto:spdk-bounces(a)lists.01.org
>>>      <mailto:spdk-bounces(a)lists.01.org>] On Behalf Of Howell,
>>>       >> Seth
>>>       >> Sent: Thursday, August 1, 2019 5:26 AM
>>>       >> To: JD Zheng <jiandong.zheng(a)broadcom.com
>>>      <mailto:jiandong.zheng(a)broadcom.com>>; Storage Performance
>>>       >> Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>>>       >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>>      multiple
>>>       >> RDMA Memory Regions
>>>       >>
>>>       >> Hi JD,
>>>       >>
>>>       >> Thanks for doing that. Yeah, I am mainly looking to see how 
>>> the
>>>      mempool addresses are mapped into the NIC with ibv_reg_mr.
>>>       >>
>>>       >> I think it's odd that we are using the buffer base for the 
>>> memory
>>>       >> check, we should be using the iov base, but I don't believe 
>>> that
>>>       >> would cause the issue you are seeing. Pushed a change to 
>>> modify
>>>      that
>>>       >> behavior anyways though:
>>>       >> https://review.gerrithub.io/c/spdk/spdk/+/463893
>>>       >>
>>>       >> There was one registration that I wasn't able to catch from 
>>> your
>>>      last log. Sorry about that, I forgot there wasn’t a debug log for
>>>      it. Can you try it again with this change which adds noticelogs 
>>> for
>>>      the relevant registrations.
>>>      https://review.gerrithub.io/c/spdk/spdk/+/463892 You should be 
>>> able
>>>      to run your test without the -Lrdma argument this time to avoid 
>>> the
>>>      extra bloat in the logs.
>>>       >>
>>>       >> The underlying assumption of the code is that any given object
>>>      is not going to cross a dynamic memory allocation from DPDK. For a
>>>      little background, when the mempool gets created, the dpdk code
>>>      allocates some number of memzones to accommodate those buffer
>>>      objects. Then it passes those memzones down one at a time and 
>>> places
>>>      objects inside the mempool from the given memzone until the 
>>> memzone
>>>      is exhausted. Then it goes back and grabs another memzone. This
>>>      process continues until all objects are accounted for.
>>>       >> This only works if each memzone corresponds to a single memory
>>>      event when using dynamic memory allocation. My understanding was
>>>      that this was always the case, but this error makes me think that
>>>      it's possible that that's not true.
>>>       >>
>>>       >> Once I see what your memory registrations look like and what
>>>      addresses you're failing on, it will help me understand what is
>>>      going on better.
>>>       >>
>>>       >> Can you also provide the command line you are using to 
>>> start the
>>>      nvmf_tgt application and attach your configuration file?
>>>       >>
>>>       >> Thanks,
>>>       >>
>>>       >> Seth
>>>       >> -----Original Message-----
>>>       >> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>>>      <mailto:jiandong.zheng(a)broadcom.com>]
>>>       >> Sent: Wednesday, July 31, 2019 3:13 PM
>>>       >> To: Howell, Seth <seth.howell(a)intel.com
>>>      <mailto:seth.howell(a)intel.com>>; Storage Performance
>>>       >> Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>>>       >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>>      multiple
>>>       >> RDMA Memory Regions
>>>       >>
>>>       >> Hi Seth,
>>>       >>
>>>       >> After I enabled debug and ran nvmf_tgt with -L rdma, I got 
>>> some
>>>      logs like:
>>>       >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command 
>>> Array:
>>>       >> 0x2000084bf000 Length: 40000 LKey: e601
>>>       >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion 
>>> Array:
>>>       >> 0x200008621000 Length: 10000 LKey: e701
>>>       >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule 
>>> Data
>>>      Array:
>>>       >> 0x200018600000 Length: 1000000 LKey: e801
>>>       >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command 
>>> Array:
>>>       >> 0x20000847e000 Length: 40000 LKey: e701
>>>       >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion 
>>> Array:
>>>       >> 0x20000846d000 Length: 10000 LKey: e801
>>>       >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule 
>>> Data
>>>      Array:
>>>       >> 0x200019800000 Length: 1000000 LKey: e901
>>>       >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command 
>>> Array:
>>>       >> 0x200016ebb000 Length: 40000 LKey: e801
>>>       >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion 
>>> Array:
>>>       >> 0x20000845c000 Length: 10000 LKey: e901
>>>       >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule 
>>> Data
>>>      Array:
>>>       >> 0x20001aa00000 Length: 1000000 LKey: ea01
>>>       >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command 
>>> Array:
>>>       >> 0x200016e7a000 Length: 40000 LKey: e901
>>>       >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion 
>>> Array:
>>>       >> 0x20000844b000 Length: 10000 LKey: ea01
>>>       >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule 
>>> Data
>>>      Array:
>>>       >> 0x20001bc00000 Length: 1000000 LKey: eb01 ...
>>>       >>
>>>       >> Is this you are look for as memory regions registered for NIC?
>>>       >>
>>>       >> I attached the complete log.
>>>       >>
>>>       >> Thanks,
>>>       >> JD
>>>       >>
>>>       >> On 7/30/19 5:28 PM, JD Zheng wrote:
>>>       >>> Hi Seth,
>>>       >>>
>>>       >>> Thanks for the prompt reply!
>>>       >>>
>>>       >>> Please find answers inline.
>>>       >>>
>>>       >>> JD
>>>       >>>
>>>       >>> On 7/30/19 5:01 PM, Howell, Seth wrote:
>>>       >>>> Hi JD,
>>>       >>>>
>>>       >>>> Thanks for the report. I want to ask a few questions to 
>>> start
>>>       >>>> getting to the bottom of this. Since this issue doesn't 
>>> currently
>>>       >>>> reproduce on our per-patch or nightly tests, I would like to
>>>       >>>> understand what's unique about your setup so that we can
>>>      replicate
>>>       >>>> it in a per patch test to prevent future regressions.
>>>       >>> I am running it on aarch64 platform. I tried x86 platform 
>>> and I
>>>      can
>>>       >>> see same buffer alignment in memory pool but can't run the 
>>> real
>>>      test
>>>       >>> to reproduce it due to other missing pieces.
>>>       >>>
>>>       >>>>
>>>       >>>> What options are you passing when you create the rdma 
>>> transport?
>>>       >>>> Are you creating it over RPC or in a configuration file?
>>>       >>> I am using conf file. Pls let me know if you'd like to look
>>>      into conf file.
>>>       >>>
>>>       >>>>
>>>       >>>> Are you using the current DPDK submodule as your environment
>>>       >>>> abstraction layer?
>>>       >>> No. Our project uses specific version of DPDK, which is 
>>> v18.11. I
>>>       >>> did quick test using latest and DPDK submodule on x86, and 
>>> the
>>>       >>> buffer alignment is the same, i.e. 64B aligned.
>>>       >>>
>>>       >>>>
>>>       >>>> I notice that your error log is printing from
>>>       >>>> spdk_nvmf_transport_poll_group_create, which value 
>>> exactly are
>>>      you
>>>       >>>> printing out?
>>>       >>> Here is patch to add dbg print. Pls note that SPDK version is
>>>      v19.04
>>>       >>>
>>>       >>> @@ -215,6 +222,7 @@ spdk_nvmf_transport_poll_group_create(st
>>>       >>> SPDK_NOTICELOG("Unable to
>>>      reserve
>>>       >>> the full number of buffers for the pg buffer cache.\n");
>>>       >>>                                    break;
>>>       >>>                            }
>>>       >>> +                       SPDK_ERRLOG("%p %d(%d)\n", buf,
>>>       >>> group->buf_cache_count, group->buf_cache_size);
>>>       >>> STAILQ_INSERT_HEAD(&group->buf_cache,
>>>       >>> buf, link);
>>>       >>> group->buf_cache_count++;
>>>       >>>                    }
>>>       >>>
>>>       >>>>
>>>       >>>> Can you run your target with the -L rdma option to get a 
>>> dump of
>>>       >>>> the memory regions registered with the NIC?
>>>       >>> Let me test and get back to you soon.
>>>       >>>
>>>       >>>>
>>>       >>>> We made a couple of changes to this code when dynamic memory
>>>       >>>> allocations were added to DPDK. There were some safeguards
>>>      that we
>>>       >>>> added to try and make sure this case wouldn't hit, so I'd 
>>> like to
>>>       >>>> make sure you are running on the latest DPDK submodule as 
>>> well as
>>>       >>>> the latest SPDK to narrow down where we need to look.
>>>       >>> Unfortunately I can't easily update DPDK because other team
>>>       >>> maintains it internally. But if it can be repro and fixed in
>>>      latest,
>>>       >>> I will try to pull in the fix.
>>>       >>>
>>>       >>>>
>>>       >>>> Thanks,
>>>       >>>>
>>>       >>>> Seth
>>>       >>>>
>>>       >>>> -----Original Message-----
>>>       >>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org
>>>      <mailto:spdk-bounces(a)lists.01.org>] On Behalf Of JD Zheng
>>>       >>>> via SPDK
>>>       >>>> Sent: Wednesday, July 31, 2019 3:00 AM
>>>       >>>> To: spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>
>>>       >>>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com
>>>      <mailto:jiandong.zheng(a)broadcom.com>>
>>>       >>>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over 
>>> multiple
>>>       >>>> RDMA Memory Regions
>>>       >>>>
>>>       >>>> Hello,
>>>       >>>>
>>>       >>>> When I run nvmf_tgt over RDMA using latest SPDK code, I
>>>       >>>> occasionally ran into this errors:
>>>       >>>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer 
>>> split
>>>       >>>> over multiple RDMA Memory Regions"
>>>       >>>>
>>>       >>>> After digging into the code, I found that
>>>      nvmf_rdma_fill_buffers()
>>>       >>>> calls spdk_mem_map_translate() to check if a data buffer 
>>> sit on 2
>>>       >>>> 2MB pages, and if it is the case, it reports this error.
>>>       >>>>
>>>       >>>> The following commit added change to use data buffer start
>>>      address
>>>       >>>> to calculate the size between buffer start address and 2MB
>>>      boundary.
>>>       >>>> The caller nvmf_rdma_fill_buffers() uses the size to 
>>> compare with
>>>       >>>> IO Unit size (which is 8KB in my conf) to determine if 
>>> the buffer
>>>       >>>> passes 2MB boundary.
>>>       >>>>
>>>       >>>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
>>>       >>>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com
>>>      <mailto:dariusz.stojaczyk(a)intel.com>>
>>>       >>>> Date:   Tue Nov 13 17:43:46 2018 +0100
>>>       >>>>
>>>       >>>>         memory: fix contiguous memory calculation for 
>>> unaligned
>>>       >>>> buffers
>>>       >>>>
>>>       >>>> In nvmf_tgt, the buffers are pre-allocated as a memory pool
>>>      and new
>>>       >>>> request will use free buffer from that pool and the 
>>> buffer start
>>>       >>>> address is passed to nvmf_rdma_fill_buffers(). But I 
>>> found that
>>>       >>>> these buffers are not 2MB aligned and not IOUnitSize 
>>> aligned (8KB
>>>       >>>> in my
>>>       >>>> case) either, instead, they are 64Byte aligned so that some
>>>      buffers
>>>       >>>> will fail the checking and leads to this problem.
>>>       >>>>
>>>       >>>> The corresponding code snippets are as following:
>>>       >>>> spdk_nvmf_transport_create()
>>>       >>>> {
>>>       >>>> ...
>>>       >>>>         transport->data_buf_pool =
>>>       >>>> pdk_mempool_create(spdk_mempool_name,
>>>       >>>>  opts->num_shared_buffers,
>>>       >>>>  opts->io_unit_size +
>>>       >>>> NVMF_DATA_BUFFER_ALIGNMENT,
>>>       >>>>
>>>        SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>>>       >>>>  SPDK_ENV_SOCKET_ID_ANY); ...
>>>       >>>> }
>>>       >>>>
>>>       >>>> Also some debug print I added shows the start address of the
>>>      buffers:
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x200019258800 0(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x2000192557c0 1(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x200019252780 2(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x20001924f740 3(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x20001924c700 4(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x2000192496c0 5(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x200019246680 6(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x200019243640 7(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x200019240600 8(32)
>>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: 
>>> *ERROR*:
>>>       >>>> 0x20001923d5c0 9(32)
>>>       >>>> ...
>>>       >>>>
>>>       >>>> It looks like either the buffer allocation has alignment 
>>> issue or
>>>       >>>> the checking is not correct.
>>>       >>>>
>>>       >>>> Please advice how to fix this problem.
>>>       >>>>
>>>       >>>> Thanks,
>>>       >>>> JD Zheng
>>>       >>>> _______________________________________________
>>>       >>>> SPDK mailing list
>>>       >>>> SPDK(a)lists.01.org <mailto:SPDK(a)lists.01.org>
>>>       >>>> https://lists.01.org/mailman/listinfo/spdk
>>>       >>>>
>>>       >> _______________________________________________
>>>       >> SPDK mailing list
>>>       >> SPDK(a)lists.01.org <mailto:SPDK(a)lists.01.org>
>>>       >> https://lists.01.org/mailman/listinfo/spdk
>>>       >>
>>>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-08-19 21:42 JD Zheng
  0 siblings, 0 replies; 20+ messages in thread
From: JD Zheng @ 2019-08-19 21:42 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 31342 bytes --]

Hi Seth,

It sometimes triggered seg fault but I couldn't get backtrace due to 
likely corrupted stack. With my workaround, this is no longer seen.

Let me submit my change as RFC to gerrit. It probably isn't necessary to 
upstream as DPDK 19.02 should fix this problem properly.

Thanks,
JD

On 8/19/19 2:16 PM, Howell, Seth wrote:
> Hi JD,
> 
> What issue specifically did you see? If there is something measurable happening (other than the error message) then I think it should be high priority to get a more permanent workaround into upstream SPDK.
> 
> Thanks,
> 
> Seth
> 
> -----Original Message-----
> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
> Sent: Monday, August 19, 2019 2:03 PM
> To: Howell, Seth <seth.howell(a)intel.com>
> Cc: Storage Performance Development Kit <spdk(a)lists.01.org>; Jonathan Richardson <jonathan.richardson(a)broadcom.com>
> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
> 
> Hi Seth,
> 
>   > Unfortunately, the only way
>   > to protect fully against this happening is by using the DPDK flag  > --match-allocations which was introduced in DPDK 19.02.
> 
> Then I need to use DPDK 19.02. Do I need to enable this flag explicitly when moving DPDK 19.02?
> 
>   > The good news is that the SPDK target will skip these buffers without  > bricking, causing data corruption or doing any otherwise bad things.
> 
> Unfortunately this is not what I saw. It appeared that SPDK gave up this split buffer, but it still causes issue, maybe because it was tried too many times(?).
> 
> Currently, I have to use DPDK 18.11 so that I added a couple of workarounds to prevent the split buffer from being used before reaching fill_buffers(). I did a little trick there to call spdk_mempool_get() but not mempool_put later, so that this buffer is set as "allocated" in mempool and will not be tried again and again. It does look like small memory leak though. We can usually see 2-3 split buffers during overnight run, btw.
> 
> This seems working OK.
> 
> For sure, I will measure performance later.
> 
> Thanks,
> JD
> 
> 
> On 8/19/19 1:12 PM, Howell, Seth wrote:
>> Hi JD,
>>
>> Thanks for performing that experiment. With this new information, I
>> think we can be pretty much 100% sure that the problem is related to
>> the mempool being split over two DPDK memzones. Unfortunately, the
>> only way to protect fully against this happening is by using the DPDK
>> flag --match-allocations which was introduced in DPDK 19.02. Jim
>> helped advocate for this flag specifically because of this problem
>> with mempools and RPMA.
>>
>> The good news is that the SPDK target will skip these buffers without
>> bricking, causing data corruption or doing any otherwise bad things.
>> What ends up happening is that the nvmf_rdma_fill_buffers function
>> will print the error message and then return NULL which will trigger
>> the target to retry the I/O again. By that time, there will be another
>> buffer there for the request to use and it won’t fail the second time
>> around. So the code currently handles the problem in a technically
>> correct way i.e. It’s not going to brick the target or initiator by
>> trying to use a buffer that spans multiple Memory Regions. Instead, it
>> properly recognizes that it is trying to use a bad buffer and
>> reschedules the request buffer parsing.
>>
>> However, I am a little worried over the fact that these buffers remain
>> in the mempool and can be repeatedly used by the application. I can
>> picture a scenario where this could possibly have a  performance impact.
>> Take for example a mempool with 128 entries in it in which one of them
>> is split over a memzone. Since this split buffer will never find its
>> way into a request, it’s possible that this split buffer gets pulled
>> up into requests more often than other buffers and subsequently fails
>> in nvmf_rdma_fill_buffers causing requests to have to be rescheduled
>> to the next time the poller runs. Depending on how frequently this
>> happens, the performance impact *could possibly* add up.
>>
>> I have as yet been unable to replicate the split buffer error. One
>> thing you could try to see if there is any measurable performance
>> impact is try starting the NVMe-oF target with DPDK legacy memory mode
>> which will move all memory allocations to startup and prevent you from
>> splitting buffers. Then run a benchmark with a lot of connections at
>> high queue depth and see what the performance looks like compared to
>> the dynamic memory model. If there is a significant performance
>> impact, we may have to modify how we handle this error case.
>>
>> Thanks,
>>
>> Seth
>>
>> *From:*JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
>> *Sent:* Monday, August 12, 2019 4:17 PM
>> *To:* Howell, Seth <seth.howell(a)intel.com>
>> *Cc:* Storage Performance Development Kit <spdk(a)lists.01.org>;
>> Jonathan Richardson <jonathan.richardson(a)broadcom.com>
>> *Subject:* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>> multiple RDMA Memory Regions
>>
>> + Jonanthan
>>
>> Hi Seth,
>>
>> We finally got chance to test with more logs enabled. You are correct
>> that that problematic buffer does sit on 2 registered memory regions:
>>
>> The problematic buffer is "Buffer address:*200019bfeb00", *actual used
>> buffer pointer is "*200019bff000*" (SPDK makes it 4KiB aligned), size
>> is
>> 8KiB(0x2000) so it does sit on 2 registered memory region.
>>
>> However, looks like SPDK/DPDK allocates buffers starting from end of a
>> region and going up, but due to the extra room and alignment of each
>> buffer and there is chance that one buffer can exceed memory region
>> boundary?
>>
>> In this case, the buffers are between 0x200019997800 and 0x200019c5320
>> so that last buffer exceeds one region and goes to next one.
>>
>> Some logs for your information:
>>
>> rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start:
>> 200019800000, memory region length: 400000
>> rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start:
>> 200019c00000, memory region length: 400000
>>
>> ...
>>
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019bfeb00 27(32)
>>
>> ...
>>
>> rdma.c:1508:nvmf_rdma_fill_buffers: *NOTICE*: Buffer
>> address:**200019bfeb00**iov_base address 200019bff000
>>
>> Thanks,
>>
>> JD
>>
>> On Thu, Aug 1, 2019 at 2:22 PM Howell, Seth <seth.howell(a)intel.com
>> <mailto:seth.howell(a)intel.com>> wrote:
>>
>>      There are two different assignments that you need to look at. I'll
>>      detail the cases below based on line numbers from the latest master.
>>
>>      Memory.c:656 *size = spdk_min(*size, cur_size):
>>               This assignment is inside of the conditional "if(size ==
>>      NULL || map->ops.are_contiguous == NULL)"
>>               So in other words, at the offset, we figure out how much
>>      space we have left in the current translation. Then, if there is no
>>      callback to tell us whether the next translation will be contiguous
>>      to this one, we fill the size variable with the remaining length of
>>      that 2 MiB buffer.
>>
>>      Memory.c:682 *size = spdk_min(*size, cur_size):
>>               This assignment comes after the while loop guarded by the
>>      condition "while (cur_size < *size)". This while loop assumes that
>>      we have supplied some desired length for our buffer. This is true in
>>      the RDMA case. Now this while loop will only break on two
>>      conditions. 1. Cur_size becomes larger than *size, or the
>>      are_contiguous function returns false, meaning that the two
>>      translations cannot be considered together. In the case of the RDMA
>>      memory map, the only time are_contiguous returns false is when the
>>      two memory regions correspond to two distinct RDMA MRs. Notice that
>>      in this case - the one where are_contiguous is defined and we
>>      supplied a size variable - the *size variable is not overwritten
>>      with cur_size until 1. cur_size is >= *size or 2. The are_contiguous
>>      check fails.
>>
>>      In the second case detailed above, you can see how one could  pass
>>      in a buffer that spanned a 2 MiB page and still get a translation
>>      value equal to the size of the buffer. This second case is the one
>>      that the rdma.c code should be using since we have a registered
>>      are_contiguous function with the NIC and we have supplied a size
>>      pointer filled with the length of our buffer.
>>
>>      -----Original Message-----
>>      From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>>      <mailto:jiandong.zheng(a)broadcom.com>]
>>      Sent: Thursday, August 1, 2019 2:01 PM
>>      To: Howell, Seth <seth.howell(a)intel.com
>>      <mailto:seth.howell(a)intel.com>>; Storage Performance Development Kit
>>      <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>>      Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>      multiple RDMA Memory Regions
>>
>>      Hi Seth,
>>
>>        > Just because a buffer extends past a 2 MiB boundary doesn't mean
>>      that it exists in two different Memory Regions. It also won't fail
>>      the translation for being over two memory regions.
>>
>>      This makes sense. However, spdk_mem_map_translate() does following
>>      to calculate translation_len:
>>
>>      cur_size = VALUE_2MB - _2MB_OFFSET(vaddr); ...
>>      *size = spdk_min(*size, cur_size); // *size is the translation_len
>>      from caller nvmf_rdma_fill_buffers()
>>
>>      In nvmf_rdma_fill_buffers(),
>>
>>      if (translation_len < rdma_req->req.iov[iovcnt].iov_len) {
>>                               SPDK_ERRLOG("Data buffer split over
>>      multiple RDMA Memory Regions\n");
>>                               return -EINVAL;
>>                       }
>>
>>      This just checks if buffer sits on 2 2MB pages, not about 2 RDMA
>>      memory regions. Is my understanding correct?
>>
>>      I still need some time to test. I will update you the result with -s
>>      as well.
>>
>>      Thanks,
>>      JD
>>
>>
>>      On 8/1/19 1:28 PM, Howell, Seth wrote:
>>       > Hi JD,
>>       >
>>       > The 2 MiB check is just because we always do memory registrations
>>      at at least 2 MiB granularity (the minimum hugepage size). Just
>>      because a buffer extends past a 2 MiB boundary doesn't mean that it
>>      exists in two different Memory Regions. It also won't fail the
>>      translation for being over two memory regions.
>>       >
>>       > If you look at the definition of spdk_mem_map_translate we call
>>      map->ops->are_contiguous every time we cross a 2 MiB boundary. For
>>      RDMA, this function is registered to
>>      spdk_nvmf_rdma_check_contiguous_entries. IF this function returns
>>      true, then even if the buffer crosses a 2 MiB boundary, the
>>      translation will still be valid.
>>       > The problem you are running into is not related to the buffer
>>      alignment, it is related to the fact that the two pages across which
>>      the buffer is split are registered to two different MRs in the NIC.
>>      This can only happen if those two pages are allocated independently
>>      and trigger two distinct memory event callbacks.
>>       >
>>       > That is why I am so interested in seeing the results from the
>>      noticelog above ibv_reg_mr. It will tell me how your target
>>      application is allocating memory. Also, when you start the SPDK
>>      target, are you using the -s option? Something like
>>      ./app/nvmf_tgt/nvmf_tgt -s 512 or something like that (I don't know
>>      if it'll make a difference, it's more of a curiosity thing for me)?
>>       >
>>       > Thanks,
>>       >
>>       > Seth
>>       >
>>       > -----Original Message-----
>>       > From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>>      <mailto:jiandong.zheng(a)broadcom.com>]
>>       > Sent: Thursday, August 1, 2019 11:24 AM
>>       > To: Howell, Seth <seth.howell(a)intel.com
>>      <mailto:seth.howell(a)intel.com>>; Storage Performance
>>       > Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>>       > Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>      multiple
>>       > RDMA Memory Regions
>>       >
>>       > Hi Seth,
>>       >
>>       > Thanks for the detailed description, now I understand the reason
>>      behind the checking. But I have a question, why checking against
>>      2MiB? Is it because DPDK uses 2MiB page size by default so that one
>>      RDMA memory region should not cross 2 pages?
>>       >
>>       >   > Once I see what your memory registrations look like and what
>>      addresses you're failing on, it will help me understand what is
>>      going on better.
>>       >
>>       > I've added some print in nvmf_rdma_fill_buffers() @@ -1502,7
>>      +1503,11 @@ nvmf_rdma_fill_buffers(struct spdk_nvmf_rdma_transport
>>      *rtransport,
>>       >                   remaining_length -=
>>       > rdma_req->req.iov[iovcnt].iov_len;
>>       >
>>       >                   if (translation_len <
>>      rdma_req->req.iov[iovcnt].iov_len) {
>>       > -                       SPDK_ERRLOG("Data buffer split over multiple
>>       > RDMA Memory Regions\n");
>>       > +                       SPDK_ERRLOG("Data buffer split over multiple
>>       > RDMA Memory Regions %p %d (%d) (%d) (%d) (%d)\n",
>>      rdma_req->buffers[iovcnt], iovcnt, length, remaining_length,
>>      translation_len, rdma_req->req.iov[iovcnt].iov_len);
>>       >                           return -EINVAL;
>>       >                   }
>>       >
>>       > With this I can see which buffer failed the checking.
>>       > For example, when SPKD initializes the memory pool, one of the
>>      buffers starts with 0x2000193feb00, and when failed, I got following:
>>       >
>>       > rdma.c:1510:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over
>>       > multiple RDMA Memory Regions 0x2000193feb00 0 (8192) (0) (5376)
>>      (8192)
>>       >
>>       > This buffer has 5376B on one 2MB page and the rest of it
>>       > (8192-5376=2816B) is on another page.
>>       >
>>       > The change https://review.gerrithub.io/c/spdk/spdk/+/463893 to
>>      use iov base should make it better as iov base is 4KiB aligned. In
>>      above case, iov_base is 0x2000193feb00 & 0xfff = 0x2000193fe000 and
>>      it should pass the checking.
>>       > However, another buffer in the pool is 0x2000192010c0 and
>>      iov_base is 0x200019201000, which would fail the checking because it
>>      is only 4KiB to 2MB boundary and IOUnitSize is 8KiB.
>>       >
>>       > I will add the change from
>>       > https://review.gerrithub.io/c/spdk/spdk/+/463892 and rerun the
>>      test to get more information.
>>       >
>>       > I also attached the conf file too. The cmd line is "nvmf_tgt -m 0xff
>>       > -j
>>       > 0x90000000:0x20000000 -c 16disk_1ns.conf"
>>       >
>>       > Thanks,
>>       > JD
>>       >
>>       >
>>       > On 8/1/19 7:52 AM, Howell, Seth wrote:
>>       >> Hi JD,
>>       >>
>>       >> I was doing a little bit of digging in the dpdk documentation
>>      around this process, and I have a little bit more information. We
>>      were pretty worried about the whole dynamic memory allocations thing
>>      a few releases ago, so Jim helped add a flag into DPDK that
>>      prevented allocations from being allocated and freed in different
>>      granularities. This flag also prevents malloc heap allocations from
>>      spanning multiple memory events. However, this flag didn't make it
>>      into DPDK until 19.02 (More documentation at
>>      https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#environment-abstraction-layer
>>      if you're interested). We have some code in the SPDK environment
>>      layer that tries to deal with that (see
>>      lib/env_dpdk/memory.c:memory_hotplug_cb) but I don't know that that
>>      function is entirely capable of handling the heap allocations
>>      spanning multiple memory events part of the problem.
>>       >> Since you are using dpdk 18.11, the memory callback inside of
>>      lib/env_dpdk looks like a good candidate for our issue. My best
>>      guess is that somehow a heap allocation from the buffer mempool is
>>      hitting across addresses from two dynamic memory allocation events.
>>      I'd still appreciate it if you could send me the information in my
>>      last e-mail, but I think we're onto something here.
>>       >>
>>       >> Thanks,
>>       >>
>>       >> Seth
>>       >>
>>       >> -----Original Message-----
>>       >> From: SPDK [mailto:spdk-bounces(a)lists.01.org
>>      <mailto:spdk-bounces(a)lists.01.org>] On Behalf Of Howell,
>>       >> Seth
>>       >> Sent: Thursday, August 1, 2019 5:26 AM
>>       >> To: JD Zheng <jiandong.zheng(a)broadcom.com
>>      <mailto:jiandong.zheng(a)broadcom.com>>; Storage Performance
>>       >> Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>>       >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>      multiple
>>       >> RDMA Memory Regions
>>       >>
>>       >> Hi JD,
>>       >>
>>       >> Thanks for doing that. Yeah, I am mainly looking to see how the
>>      mempool addresses are mapped into the NIC with ibv_reg_mr.
>>       >>
>>       >> I think it's odd that we are using the buffer base for the memory
>>       >> check, we should be using the iov base, but I don't believe that
>>       >> would cause the issue you are seeing. Pushed a change to modify
>>      that
>>       >> behavior anyways though:
>>       >> https://review.gerrithub.io/c/spdk/spdk/+/463893
>>       >>
>>       >> There was one registration that I wasn't able to catch from your
>>      last log. Sorry about that, I forgot there wasn’t a debug log for
>>      it. Can you try it again with this change which adds noticelogs for
>>      the relevant registrations.
>>      https://review.gerrithub.io/c/spdk/spdk/+/463892 You should be able
>>      to run your test without the -Lrdma argument this time to avoid the
>>      extra bloat in the logs.
>>       >>
>>       >> The underlying assumption of the code is that any given object
>>      is not going to cross a dynamic memory allocation from DPDK. For a
>>      little background, when the mempool gets created, the dpdk code
>>      allocates some number of memzones to accommodate those buffer
>>      objects. Then it passes those memzones down one at a time and places
>>      objects inside the mempool from the given memzone until the memzone
>>      is exhausted. Then it goes back and grabs another memzone. This
>>      process continues until all objects are accounted for.
>>       >> This only works if each memzone corresponds to a single memory
>>      event when using dynamic memory allocation. My understanding was
>>      that this was always the case, but this error makes me think that
>>      it's possible that that's not true.
>>       >>
>>       >> Once I see what your memory registrations look like and what
>>      addresses you're failing on, it will help me understand what is
>>      going on better.
>>       >>
>>       >> Can you also provide the command line you are using to start the
>>      nvmf_tgt application and attach your configuration file?
>>       >>
>>       >> Thanks,
>>       >>
>>       >> Seth
>>       >> -----Original Message-----
>>       >> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>>      <mailto:jiandong.zheng(a)broadcom.com>]
>>       >> Sent: Wednesday, July 31, 2019 3:13 PM
>>       >> To: Howell, Seth <seth.howell(a)intel.com
>>      <mailto:seth.howell(a)intel.com>>; Storage Performance
>>       >> Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>>       >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>>      multiple
>>       >> RDMA Memory Regions
>>       >>
>>       >> Hi Seth,
>>       >>
>>       >> After I enabled debug and ran nvmf_tgt with -L rdma, I got some
>>      logs like:
>>       >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>>       >> 0x2000084bf000 Length: 40000 LKey: e601
>>       >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>>       >> 0x200008621000 Length: 10000 LKey: e701
>>       >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>>      Array:
>>       >> 0x200018600000 Length: 1000000 LKey: e801
>>       >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>>       >> 0x20000847e000 Length: 40000 LKey: e701
>>       >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>>       >> 0x20000846d000 Length: 10000 LKey: e801
>>       >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>>      Array:
>>       >> 0x200019800000 Length: 1000000 LKey: e901
>>       >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>>       >> 0x200016ebb000 Length: 40000 LKey: e801
>>       >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>>       >> 0x20000845c000 Length: 10000 LKey: e901
>>       >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>>      Array:
>>       >> 0x20001aa00000 Length: 1000000 LKey: ea01
>>       >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>>       >> 0x200016e7a000 Length: 40000 LKey: e901
>>       >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>>       >> 0x20000844b000 Length: 10000 LKey: ea01
>>       >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>>      Array:
>>       >> 0x20001bc00000 Length: 1000000 LKey: eb01 ...
>>       >>
>>       >> Is this you are look for as memory regions registered for NIC?
>>       >>
>>       >> I attached the complete log.
>>       >>
>>       >> Thanks,
>>       >> JD
>>       >>
>>       >> On 7/30/19 5:28 PM, JD Zheng wrote:
>>       >>> Hi Seth,
>>       >>>
>>       >>> Thanks for the prompt reply!
>>       >>>
>>       >>> Please find answers inline.
>>       >>>
>>       >>> JD
>>       >>>
>>       >>> On 7/30/19 5:01 PM, Howell, Seth wrote:
>>       >>>> Hi JD,
>>       >>>>
>>       >>>> Thanks for the report. I want to ask a few questions to start
>>       >>>> getting to the bottom of this. Since this issue doesn't currently
>>       >>>> reproduce on our per-patch or nightly tests, I would like to
>>       >>>> understand what's unique about your setup so that we can
>>      replicate
>>       >>>> it in a per patch test to prevent future regressions.
>>       >>> I am running it on aarch64 platform. I tried x86 platform and I
>>      can
>>       >>> see same buffer alignment in memory pool but can't run the real
>>      test
>>       >>> to reproduce it due to other missing pieces.
>>       >>>
>>       >>>>
>>       >>>> What options are you passing when you create the rdma transport?
>>       >>>> Are you creating it over RPC or in a configuration file?
>>       >>> I am using conf file. Pls let me know if you'd like to look
>>      into conf file.
>>       >>>
>>       >>>>
>>       >>>> Are you using the current DPDK submodule as your environment
>>       >>>> abstraction layer?
>>       >>> No. Our project uses specific version of DPDK, which is v18.11. I
>>       >>> did quick test using latest and DPDK submodule on x86, and the
>>       >>> buffer alignment is the same, i.e. 64B aligned.
>>       >>>
>>       >>>>
>>       >>>> I notice that your error log is printing from
>>       >>>> spdk_nvmf_transport_poll_group_create, which value exactly are
>>      you
>>       >>>> printing out?
>>       >>> Here is patch to add dbg print. Pls note that SPDK version is
>>      v19.04
>>       >>>
>>       >>> @@ -215,6 +222,7 @@ spdk_nvmf_transport_poll_group_create(st
>>       >>>                                    SPDK_NOTICELOG("Unable to
>>      reserve
>>       >>> the full number of buffers for the pg buffer cache.\n");
>>       >>>                                    break;
>>       >>>                            }
>>       >>> +                       SPDK_ERRLOG("%p %d(%d)\n", buf,
>>       >>> group->buf_cache_count, group->buf_cache_size);
>>       >>>                            STAILQ_INSERT_HEAD(&group->buf_cache,
>>       >>> buf, link);
>>       >>>                            group->buf_cache_count++;
>>       >>>                    }
>>       >>>
>>       >>>>
>>       >>>> Can you run your target with the -L rdma option to get a dump of
>>       >>>> the memory regions registered with the NIC?
>>       >>> Let me test and get back to you soon.
>>       >>>
>>       >>>>
>>       >>>> We made a couple of changes to this code when dynamic memory
>>       >>>> allocations were added to DPDK. There were some safeguards
>>      that we
>>       >>>> added to try and make sure this case wouldn't hit, so I'd like to
>>       >>>> make sure you are running on the latest DPDK submodule as well as
>>       >>>> the latest SPDK to narrow down where we need to look.
>>       >>> Unfortunately I can't easily update DPDK because other team
>>       >>> maintains it internally. But if it can be repro and fixed in
>>      latest,
>>       >>> I will try to pull in the fix.
>>       >>>
>>       >>>>
>>       >>>> Thanks,
>>       >>>>
>>       >>>> Seth
>>       >>>>
>>       >>>> -----Original Message-----
>>       >>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org
>>      <mailto:spdk-bounces(a)lists.01.org>] On Behalf Of JD Zheng
>>       >>>> via SPDK
>>       >>>> Sent: Wednesday, July 31, 2019 3:00 AM
>>       >>>> To: spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>
>>       >>>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com
>>      <mailto:jiandong.zheng(a)broadcom.com>>
>>       >>>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
>>       >>>> RDMA Memory Regions
>>       >>>>
>>       >>>> Hello,
>>       >>>>
>>       >>>> When I run nvmf_tgt over RDMA using latest SPDK code, I
>>       >>>> occasionally ran into this errors:
>>       >>>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split
>>       >>>> over multiple RDMA Memory Regions"
>>       >>>>
>>       >>>> After digging into the code, I found that
>>      nvmf_rdma_fill_buffers()
>>       >>>> calls spdk_mem_map_translate() to check if a data buffer sit on 2
>>       >>>> 2MB pages, and if it is the case, it reports this error.
>>       >>>>
>>       >>>> The following commit added change to use data buffer start
>>      address
>>       >>>> to calculate the size between buffer start address and 2MB
>>      boundary.
>>       >>>> The caller nvmf_rdma_fill_buffers() uses the size to compare with
>>       >>>> IO Unit size (which is 8KB in my conf) to determine if the buffer
>>       >>>> passes 2MB boundary.
>>       >>>>
>>       >>>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
>>       >>>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com
>>      <mailto:dariusz.stojaczyk(a)intel.com>>
>>       >>>> Date:   Tue Nov 13 17:43:46 2018 +0100
>>       >>>>
>>       >>>>         memory: fix contiguous memory calculation for unaligned
>>       >>>> buffers
>>       >>>>
>>       >>>> In nvmf_tgt, the buffers are pre-allocated as a memory pool
>>      and new
>>       >>>> request will use free buffer from that pool and the buffer start
>>       >>>> address is passed to nvmf_rdma_fill_buffers(). But I found that
>>       >>>> these buffers are not 2MB aligned and not IOUnitSize aligned (8KB
>>       >>>> in my
>>       >>>> case) either, instead, they are 64Byte aligned so that some
>>      buffers
>>       >>>> will fail the checking and leads to this problem.
>>       >>>>
>>       >>>> The corresponding code snippets are as following:
>>       >>>> spdk_nvmf_transport_create()
>>       >>>> {
>>       >>>> ...
>>       >>>>         transport->data_buf_pool =
>>       >>>> pdk_mempool_create(spdk_mempool_name,
>>       >>>>                                    opts->num_shared_buffers,
>>       >>>>                                    opts->io_unit_size +
>>       >>>> NVMF_DATA_BUFFER_ALIGNMENT,
>>       >>>>
>>        SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>>       >>>>                                    SPDK_ENV_SOCKET_ID_ANY); ...
>>       >>>> }
>>       >>>>
>>       >>>> Also some debug print I added shows the start address of the
>>      buffers:
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x200019258800 0(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x2000192557c0 1(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x200019252780 2(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x20001924f740 3(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x20001924c700 4(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x2000192496c0 5(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x200019246680 6(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x200019243640 7(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x200019240600 8(32)
>>       >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>       >>>> 0x20001923d5c0 9(32)
>>       >>>> ...
>>       >>>>
>>       >>>> It looks like either the buffer allocation has alignment issue or
>>       >>>> the checking is not correct.
>>       >>>>
>>       >>>> Please advice how to fix this problem.
>>       >>>>
>>       >>>> Thanks,
>>       >>>> JD Zheng
>>       >>>> _______________________________________________
>>       >>>> SPDK mailing list
>>       >>>> SPDK(a)lists.01.org <mailto:SPDK(a)lists.01.org>
>>       >>>> https://lists.01.org/mailman/listinfo/spdk
>>       >>>>
>>       >> _______________________________________________
>>       >> SPDK mailing list
>>       >> SPDK(a)lists.01.org <mailto:SPDK(a)lists.01.org>
>>       >> https://lists.01.org/mailman/listinfo/spdk
>>       >>
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-08-19 21:02 JD Zheng
  0 siblings, 0 replies; 20+ messages in thread
From: JD Zheng @ 2019-08-19 21:02 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 29364 bytes --]

Hi Seth,

 > Unfortunately, the only way
 > to protect fully against this happening is by using the DPDK flag
 > --match-allocations which was introduced in DPDK 19.02.

Then I need to use DPDK 19.02. Do I need to enable this flag explicitly 
when moving DPDK 19.02?

 > The good news is that the SPDK target will skip these buffers without
 > bricking, causing data corruption or doing any otherwise bad things.

Unfortunately this is not what I saw. It appeared that SPDK gave up this 
split buffer, but it still causes issue, maybe because it was tried too 
many times(?).

Currently, I have to use DPDK 18.11 so that I added a couple of 
workarounds to prevent the split buffer from being used before reaching 
fill_buffers(). I did a little trick there to call spdk_mempool_get() 
but not mempool_put later, so that this buffer is set as "allocated" in 
mempool and will not be tried again and again. It does look like small 
memory leak though. We can usually see 2-3 split buffers during 
overnight run, btw.

This seems working OK.

For sure, I will measure performance later.

Thanks,
JD


On 8/19/19 1:12 PM, Howell, Seth wrote:
> Hi JD,
> 
> Thanks for performing that experiment. With this new information, I 
> think we can be pretty much 100% sure that the problem is related to the 
> mempool being split over two DPDK memzones. Unfortunately, the only way 
> to protect fully against this happening is by using the DPDK flag 
> --match-allocations which was introduced in DPDK 19.02. Jim helped 
> advocate for this flag specifically because of this problem with 
> mempools and RPMA.
> 
> The good news is that the SPDK target will skip these buffers without 
> bricking, causing data corruption or doing any otherwise bad things. 
> What ends up happening is that the nvmf_rdma_fill_buffers function will 
> print the error message and then return NULL which will trigger the 
> target to retry the I/O again. By that time, there will be another 
> buffer there for the request to use and it won’t fail the second time 
> around. So the code currently handles the problem in a technically 
> correct way i.e. It’s not going to brick the target or initiator by 
> trying to use a buffer that spans multiple Memory Regions. Instead, it 
> properly recognizes that it is trying to use a bad buffer and 
> reschedules the request buffer parsing.
> 
> However, I am a little worried over the fact that these buffers remain 
> in the mempool and can be repeatedly used by the application. I can 
> picture a scenario where this could possibly have a  performance impact. 
> Take for example a mempool with 128 entries in it in which one of them 
> is split over a memzone. Since this split buffer will never find its way 
> into a request, it’s possible that this split buffer gets pulled up into 
> requests more often than other buffers and subsequently fails in 
> nvmf_rdma_fill_buffers causing requests to have to be rescheduled to the 
> next time the poller runs. Depending on how frequently this happens, the 
> performance impact *could possibly* add up.
> 
> I have as yet been unable to replicate the split buffer error. One thing 
> you could try to see if there is any measurable performance impact is 
> try starting the NVMe-oF target with DPDK legacy memory mode which will 
> move all memory allocations to startup and prevent you from splitting 
> buffers. Then run a benchmark with a lot of connections at high queue 
> depth and see what the performance looks like compared to the dynamic 
> memory model. If there is a significant performance impact, we may have 
> to modify how we handle this error case.
> 
> Thanks,
> 
> Seth
> 
> *From:*JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
> *Sent:* Monday, August 12, 2019 4:17 PM
> *To:* Howell, Seth <seth.howell(a)intel.com>
> *Cc:* Storage Performance Development Kit <spdk(a)lists.01.org>; Jonathan 
> Richardson <jonathan.richardson(a)broadcom.com>
> *Subject:* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple 
> RDMA Memory Regions
> 
> + Jonanthan
> 
> Hi Seth,
> 
> We finally got chance to test with more logs enabled. You are correct 
> that that problematic buffer does sit on 2 registered memory regions:
> 
> The problematic buffer is "Buffer address:*200019bfeb00", *actual used 
> buffer pointer is "*200019bff000*" (SPDK makes it 4KiB aligned), size is 
> 8KiB(0x2000) so it does sit on 2 registered memory region.
> 
> However, looks like SPDK/DPDK allocates buffers starting from end of a 
> region and going up, but due to the extra room and alignment of each 
> buffer and there is chance that one buffer can exceed memory region 
> boundary?
> 
> In this case, the buffers are between 0x200019997800 and 0x200019c5320 
> so that last buffer exceeds one region and goes to next one.
> 
> Some logs for your information:
> 
> rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start: 
> 200019800000, memory region length: 400000
> rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start: 
> 200019c00000, memory region length: 400000
> 
> ...
> 
> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
> 0x200019bfeb00 27(32)
> 
> ...
> 
> rdma.c:1508:nvmf_rdma_fill_buffers: *NOTICE*: Buffer 
> address:**200019bfeb00**iov_base address 200019bff000
> 
> Thanks,
> 
> JD
> 
> On Thu, Aug 1, 2019 at 2:22 PM Howell, Seth <seth.howell(a)intel.com 
> <mailto:seth.howell(a)intel.com>> wrote:
> 
>     There are two different assignments that you need to look at. I'll
>     detail the cases below based on line numbers from the latest master.
> 
>     Memory.c:656 *size = spdk_min(*size, cur_size):
>              This assignment is inside of the conditional "if(size ==
>     NULL || map->ops.are_contiguous == NULL)"
>              So in other words, at the offset, we figure out how much
>     space we have left in the current translation. Then, if there is no
>     callback to tell us whether the next translation will be contiguous
>     to this one, we fill the size variable with the remaining length of
>     that 2 MiB buffer.
> 
>     Memory.c:682 *size = spdk_min(*size, cur_size):
>              This assignment comes after the while loop guarded by the
>     condition "while (cur_size < *size)". This while loop assumes that
>     we have supplied some desired length for our buffer. This is true in
>     the RDMA case. Now this while loop will only break on two
>     conditions. 1. Cur_size becomes larger than *size, or the
>     are_contiguous function returns false, meaning that the two
>     translations cannot be considered together. In the case of the RDMA
>     memory map, the only time are_contiguous returns false is when the
>     two memory regions correspond to two distinct RDMA MRs. Notice that
>     in this case - the one where are_contiguous is defined and we
>     supplied a size variable - the *size variable is not overwritten
>     with cur_size until 1. cur_size is >= *size or 2. The are_contiguous
>     check fails.
> 
>     In the second case detailed above, you can see how one could  pass
>     in a buffer that spanned a 2 MiB page and still get a translation
>     value equal to the size of the buffer. This second case is the one
>     that the rdma.c code should be using since we have a registered
>     are_contiguous function with the NIC and we have supplied a size
>     pointer filled with the length of our buffer.
> 
>     -----Original Message-----
>     From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>     <mailto:jiandong.zheng(a)broadcom.com>]
>     Sent: Thursday, August 1, 2019 2:01 PM
>     To: Howell, Seth <seth.howell(a)intel.com
>     <mailto:seth.howell(a)intel.com>>; Storage Performance Development Kit
>     <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>     Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>     multiple RDMA Memory Regions
> 
>     Hi Seth,
> 
>       > Just because a buffer extends past a 2 MiB boundary doesn't mean
>     that it exists in two different Memory Regions. It also won't fail
>     the translation for being over two memory regions.
> 
>     This makes sense. However, spdk_mem_map_translate() does following
>     to calculate translation_len:
> 
>     cur_size = VALUE_2MB - _2MB_OFFSET(vaddr); ...
>     *size = spdk_min(*size, cur_size); // *size is the translation_len
>     from caller nvmf_rdma_fill_buffers()
> 
>     In nvmf_rdma_fill_buffers(),
> 
>     if (translation_len < rdma_req->req.iov[iovcnt].iov_len) {
>                              SPDK_ERRLOG("Data buffer split over
>     multiple RDMA Memory Regions\n");
>                              return -EINVAL;
>                      }
> 
>     This just checks if buffer sits on 2 2MB pages, not about 2 RDMA
>     memory regions. Is my understanding correct?
> 
>     I still need some time to test. I will update you the result with -s
>     as well.
> 
>     Thanks,
>     JD
> 
> 
>     On 8/1/19 1:28 PM, Howell, Seth wrote:
>      > Hi JD,
>      >
>      > The 2 MiB check is just because we always do memory registrations
>     at at least 2 MiB granularity (the minimum hugepage size). Just
>     because a buffer extends past a 2 MiB boundary doesn't mean that it
>     exists in two different Memory Regions. It also won't fail the
>     translation for being over two memory regions.
>      >
>      > If you look at the definition of spdk_mem_map_translate we call
>     map->ops->are_contiguous every time we cross a 2 MiB boundary. For
>     RDMA, this function is registered to
>     spdk_nvmf_rdma_check_contiguous_entries. IF this function returns
>     true, then even if the buffer crosses a 2 MiB boundary, the
>     translation will still be valid.
>      > The problem you are running into is not related to the buffer
>     alignment, it is related to the fact that the two pages across which
>     the buffer is split are registered to two different MRs in the NIC.
>     This can only happen if those two pages are allocated independently
>     and trigger two distinct memory event callbacks.
>      >
>      > That is why I am so interested in seeing the results from the
>     noticelog above ibv_reg_mr. It will tell me how your target
>     application is allocating memory. Also, when you start the SPDK
>     target, are you using the -s option? Something like
>     ./app/nvmf_tgt/nvmf_tgt -s 512 or something like that (I don't know
>     if it'll make a difference, it's more of a curiosity thing for me)?
>      >
>      > Thanks,
>      >
>      > Seth
>      >
>      > -----Original Message-----
>      > From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>     <mailto:jiandong.zheng(a)broadcom.com>]
>      > Sent: Thursday, August 1, 2019 11:24 AM
>      > To: Howell, Seth <seth.howell(a)intel.com
>     <mailto:seth.howell(a)intel.com>>; Storage Performance
>      > Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>      > Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>     multiple
>      > RDMA Memory Regions
>      >
>      > Hi Seth,
>      >
>      > Thanks for the detailed description, now I understand the reason
>     behind the checking. But I have a question, why checking against
>     2MiB? Is it because DPDK uses 2MiB page size by default so that one
>     RDMA memory region should not cross 2 pages?
>      >
>      >   > Once I see what your memory registrations look like and what
>     addresses you're failing on, it will help me understand what is
>     going on better.
>      >
>      > I've added some print in nvmf_rdma_fill_buffers() @@ -1502,7
>     +1503,11 @@ nvmf_rdma_fill_buffers(struct spdk_nvmf_rdma_transport
>     *rtransport,
>      >                   remaining_length -=
>      > rdma_req->req.iov[iovcnt].iov_len;
>      >
>      >                   if (translation_len <
>     rdma_req->req.iov[iovcnt].iov_len) {
>      > -                       SPDK_ERRLOG("Data buffer split over multiple
>      > RDMA Memory Regions\n");
>      > +                       SPDK_ERRLOG("Data buffer split over multiple
>      > RDMA Memory Regions %p %d (%d) (%d) (%d) (%d)\n",
>     rdma_req->buffers[iovcnt], iovcnt, length, remaining_length,
>     translation_len, rdma_req->req.iov[iovcnt].iov_len);
>      >                           return -EINVAL;
>      >                   }
>      >
>      > With this I can see which buffer failed the checking.
>      > For example, when SPKD initializes the memory pool, one of the
>     buffers starts with 0x2000193feb00, and when failed, I got following:
>      >
>      > rdma.c:1510:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over
>      > multiple RDMA Memory Regions 0x2000193feb00 0 (8192) (0) (5376)
>     (8192)
>      >
>      > This buffer has 5376B on one 2MB page and the rest of it
>      > (8192-5376=2816B) is on another page.
>      >
>      > The change https://review.gerrithub.io/c/spdk/spdk/+/463893 to
>     use iov base should make it better as iov base is 4KiB aligned. In
>     above case, iov_base is 0x2000193feb00 & 0xfff = 0x2000193fe000 and
>     it should pass the checking.
>      > However, another buffer in the pool is 0x2000192010c0 and
>     iov_base is 0x200019201000, which would fail the checking because it
>     is only 4KiB to 2MB boundary and IOUnitSize is 8KiB.
>      >
>      > I will add the change from
>      > https://review.gerrithub.io/c/spdk/spdk/+/463892 and rerun the
>     test to get more information.
>      >
>      > I also attached the conf file too. The cmd line is "nvmf_tgt -m 0xff
>      > -j
>      > 0x90000000:0x20000000 -c 16disk_1ns.conf"
>      >
>      > Thanks,
>      > JD
>      >
>      >
>      > On 8/1/19 7:52 AM, Howell, Seth wrote:
>      >> Hi JD,
>      >>
>      >> I was doing a little bit of digging in the dpdk documentation
>     around this process, and I have a little bit more information. We
>     were pretty worried about the whole dynamic memory allocations thing
>     a few releases ago, so Jim helped add a flag into DPDK that
>     prevented allocations from being allocated and freed in different
>     granularities. This flag also prevents malloc heap allocations from
>     spanning multiple memory events. However, this flag didn't make it
>     into DPDK until 19.02 (More documentation at
>     https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#environment-abstraction-layer
>     if you're interested). We have some code in the SPDK environment
>     layer that tries to deal with that (see
>     lib/env_dpdk/memory.c:memory_hotplug_cb) but I don't know that that
>     function is entirely capable of handling the heap allocations
>     spanning multiple memory events part of the problem.
>      >> Since you are using dpdk 18.11, the memory callback inside of
>     lib/env_dpdk looks like a good candidate for our issue. My best
>     guess is that somehow a heap allocation from the buffer mempool is
>     hitting across addresses from two dynamic memory allocation events.
>     I'd still appreciate it if you could send me the information in my
>     last e-mail, but I think we're onto something here.
>      >>
>      >> Thanks,
>      >>
>      >> Seth
>      >>
>      >> -----Original Message-----
>      >> From: SPDK [mailto:spdk-bounces(a)lists.01.org
>     <mailto:spdk-bounces(a)lists.01.org>] On Behalf Of Howell,
>      >> Seth
>      >> Sent: Thursday, August 1, 2019 5:26 AM
>      >> To: JD Zheng <jiandong.zheng(a)broadcom.com
>     <mailto:jiandong.zheng(a)broadcom.com>>; Storage Performance
>      >> Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>      >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>     multiple
>      >> RDMA Memory Regions
>      >>
>      >> Hi JD,
>      >>
>      >> Thanks for doing that. Yeah, I am mainly looking to see how the
>     mempool addresses are mapped into the NIC with ibv_reg_mr.
>      >>
>      >> I think it's odd that we are using the buffer base for the memory
>      >> check, we should be using the iov base, but I don't believe that
>      >> would cause the issue you are seeing. Pushed a change to modify
>     that
>      >> behavior anyways though:
>      >> https://review.gerrithub.io/c/spdk/spdk/+/463893
>      >>
>      >> There was one registration that I wasn't able to catch from your
>     last log. Sorry about that, I forgot there wasn’t a debug log for
>     it. Can you try it again with this change which adds noticelogs for
>     the relevant registrations.
>     https://review.gerrithub.io/c/spdk/spdk/+/463892 You should be able
>     to run your test without the -Lrdma argument this time to avoid the
>     extra bloat in the logs.
>      >>
>      >> The underlying assumption of the code is that any given object
>     is not going to cross a dynamic memory allocation from DPDK. For a
>     little background, when the mempool gets created, the dpdk code
>     allocates some number of memzones to accommodate those buffer
>     objects. Then it passes those memzones down one at a time and places
>     objects inside the mempool from the given memzone until the memzone
>     is exhausted. Then it goes back and grabs another memzone. This
>     process continues until all objects are accounted for.
>      >> This only works if each memzone corresponds to a single memory
>     event when using dynamic memory allocation. My understanding was
>     that this was always the case, but this error makes me think that
>     it's possible that that's not true.
>      >>
>      >> Once I see what your memory registrations look like and what
>     addresses you're failing on, it will help me understand what is
>     going on better.
>      >>
>      >> Can you also provide the command line you are using to start the
>     nvmf_tgt application and attach your configuration file?
>      >>
>      >> Thanks,
>      >>
>      >> Seth
>      >> -----Original Message-----
>      >> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com
>     <mailto:jiandong.zheng(a)broadcom.com>]
>      >> Sent: Wednesday, July 31, 2019 3:13 PM
>      >> To: Howell, Seth <seth.howell(a)intel.com
>     <mailto:seth.howell(a)intel.com>>; Storage Performance
>      >> Development Kit <spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>>
>      >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over
>     multiple
>      >> RDMA Memory Regions
>      >>
>      >> Hi Seth,
>      >>
>      >> After I enabled debug and ran nvmf_tgt with -L rdma, I got some
>     logs like:
>      >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>      >> 0x2000084bf000 Length: 40000 LKey: e601
>      >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>      >> 0x200008621000 Length: 10000 LKey: e701
>      >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>     Array:
>      >> 0x200018600000 Length: 1000000 LKey: e801
>      >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>      >> 0x20000847e000 Length: 40000 LKey: e701
>      >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>      >> 0x20000846d000 Length: 10000 LKey: e801
>      >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>     Array:
>      >> 0x200019800000 Length: 1000000 LKey: e901
>      >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>      >> 0x200016ebb000 Length: 40000 LKey: e801
>      >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>      >> 0x20000845c000 Length: 10000 LKey: e901
>      >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>     Array:
>      >> 0x20001aa00000 Length: 1000000 LKey: ea01
>      >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>      >> 0x200016e7a000 Length: 40000 LKey: e901
>      >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>      >> 0x20000844b000 Length: 10000 LKey: ea01
>      >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data
>     Array:
>      >> 0x20001bc00000 Length: 1000000 LKey: eb01 ...
>      >>
>      >> Is this you are look for as memory regions registered for NIC?
>      >>
>      >> I attached the complete log.
>      >>
>      >> Thanks,
>      >> JD
>      >>
>      >> On 7/30/19 5:28 PM, JD Zheng wrote:
>      >>> Hi Seth,
>      >>>
>      >>> Thanks for the prompt reply!
>      >>>
>      >>> Please find answers inline.
>      >>>
>      >>> JD
>      >>>
>      >>> On 7/30/19 5:01 PM, Howell, Seth wrote:
>      >>>> Hi JD,
>      >>>>
>      >>>> Thanks for the report. I want to ask a few questions to start
>      >>>> getting to the bottom of this. Since this issue doesn't currently
>      >>>> reproduce on our per-patch or nightly tests, I would like to
>      >>>> understand what's unique about your setup so that we can
>     replicate
>      >>>> it in a per patch test to prevent future regressions.
>      >>> I am running it on aarch64 platform. I tried x86 platform and I
>     can
>      >>> see same buffer alignment in memory pool but can't run the real
>     test
>      >>> to reproduce it due to other missing pieces.
>      >>>
>      >>>>
>      >>>> What options are you passing when you create the rdma transport?
>      >>>> Are you creating it over RPC or in a configuration file?
>      >>> I am using conf file. Pls let me know if you'd like to look
>     into conf file.
>      >>>
>      >>>>
>      >>>> Are you using the current DPDK submodule as your environment
>      >>>> abstraction layer?
>      >>> No. Our project uses specific version of DPDK, which is v18.11. I
>      >>> did quick test using latest and DPDK submodule on x86, and the
>      >>> buffer alignment is the same, i.e. 64B aligned.
>      >>>
>      >>>>
>      >>>> I notice that your error log is printing from
>      >>>> spdk_nvmf_transport_poll_group_create, which value exactly are
>     you
>      >>>> printing out?
>      >>> Here is patch to add dbg print. Pls note that SPDK version is
>     v19.04
>      >>>
>      >>> @@ -215,6 +222,7 @@ spdk_nvmf_transport_poll_group_create(st
>      >>>                                    SPDK_NOTICELOG("Unable to
>     reserve
>      >>> the full number of buffers for the pg buffer cache.\n");
>      >>>                                    break;
>      >>>                            }
>      >>> +                       SPDK_ERRLOG("%p %d(%d)\n", buf,
>      >>> group->buf_cache_count, group->buf_cache_size);
>      >>>                            STAILQ_INSERT_HEAD(&group->buf_cache,
>      >>> buf, link);
>      >>>                            group->buf_cache_count++;
>      >>>                    }
>      >>>
>      >>>>
>      >>>> Can you run your target with the -L rdma option to get a dump of
>      >>>> the memory regions registered with the NIC?
>      >>> Let me test and get back to you soon.
>      >>>
>      >>>>
>      >>>> We made a couple of changes to this code when dynamic memory
>      >>>> allocations were added to DPDK. There were some safeguards
>     that we
>      >>>> added to try and make sure this case wouldn't hit, so I'd like to
>      >>>> make sure you are running on the latest DPDK submodule as well as
>      >>>> the latest SPDK to narrow down where we need to look.
>      >>> Unfortunately I can't easily update DPDK because other team
>      >>> maintains it internally. But if it can be repro and fixed in
>     latest,
>      >>> I will try to pull in the fix.
>      >>>
>      >>>>
>      >>>> Thanks,
>      >>>>
>      >>>> Seth
>      >>>>
>      >>>> -----Original Message-----
>      >>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org
>     <mailto:spdk-bounces(a)lists.01.org>] On Behalf Of JD Zheng
>      >>>> via SPDK
>      >>>> Sent: Wednesday, July 31, 2019 3:00 AM
>      >>>> To: spdk(a)lists.01.org <mailto:spdk(a)lists.01.org>
>      >>>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com
>     <mailto:jiandong.zheng(a)broadcom.com>>
>      >>>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
>      >>>> RDMA Memory Regions
>      >>>>
>      >>>> Hello,
>      >>>>
>      >>>> When I run nvmf_tgt over RDMA using latest SPDK code, I
>      >>>> occasionally ran into this errors:
>      >>>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split
>      >>>> over multiple RDMA Memory Regions"
>      >>>>
>      >>>> After digging into the code, I found that
>     nvmf_rdma_fill_buffers()
>      >>>> calls spdk_mem_map_translate() to check if a data buffer sit on 2
>      >>>> 2MB pages, and if it is the case, it reports this error.
>      >>>>
>      >>>> The following commit added change to use data buffer start
>     address
>      >>>> to calculate the size between buffer start address and 2MB
>     boundary.
>      >>>> The caller nvmf_rdma_fill_buffers() uses the size to compare with
>      >>>> IO Unit size (which is 8KB in my conf) to determine if the buffer
>      >>>> passes 2MB boundary.
>      >>>>
>      >>>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
>      >>>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com
>     <mailto:dariusz.stojaczyk(a)intel.com>>
>      >>>> Date:   Tue Nov 13 17:43:46 2018 +0100
>      >>>>
>      >>>>         memory: fix contiguous memory calculation for unaligned
>      >>>> buffers
>      >>>>
>      >>>> In nvmf_tgt, the buffers are pre-allocated as a memory pool
>     and new
>      >>>> request will use free buffer from that pool and the buffer start
>      >>>> address is passed to nvmf_rdma_fill_buffers(). But I found that
>      >>>> these buffers are not 2MB aligned and not IOUnitSize aligned (8KB
>      >>>> in my
>      >>>> case) either, instead, they are 64Byte aligned so that some
>     buffers
>      >>>> will fail the checking and leads to this problem.
>      >>>>
>      >>>> The corresponding code snippets are as following:
>      >>>> spdk_nvmf_transport_create()
>      >>>> {
>      >>>> ...
>      >>>>         transport->data_buf_pool =
>      >>>> pdk_mempool_create(spdk_mempool_name,
>      >>>>                                    opts->num_shared_buffers,
>      >>>>                                    opts->io_unit_size +
>      >>>> NVMF_DATA_BUFFER_ALIGNMENT,
>      >>>>                                  
>       SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>      >>>>                                    SPDK_ENV_SOCKET_ID_ANY); ...
>      >>>> }
>      >>>>
>      >>>> Also some debug print I added shows the start address of the
>     buffers:
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x200019258800 0(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x2000192557c0 1(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x200019252780 2(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x20001924f740 3(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x20001924c700 4(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x2000192496c0 5(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x200019246680 6(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x200019243640 7(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x200019240600 8(32)
>      >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>      >>>> 0x20001923d5c0 9(32)
>      >>>> ...
>      >>>>
>      >>>> It looks like either the buffer allocation has alignment issue or
>      >>>> the checking is not correct.
>      >>>>
>      >>>> Please advice how to fix this problem.
>      >>>>
>      >>>> Thanks,
>      >>>> JD Zheng
>      >>>> _______________________________________________
>      >>>> SPDK mailing list
>      >>>> SPDK(a)lists.01.org <mailto:SPDK(a)lists.01.org>
>      >>>> https://lists.01.org/mailman/listinfo/spdk
>      >>>>
>      >> _______________________________________________
>      >> SPDK mailing list
>      >> SPDK(a)lists.01.org <mailto:SPDK(a)lists.01.org>
>      >> https://lists.01.org/mailman/listinfo/spdk
>      >>
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-08-19 20:12 Howell, Seth
  0 siblings, 0 replies; 20+ messages in thread
From: Howell, Seth @ 2019-08-19 20:12 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 24117 bytes --]

Hi JD,

Thanks for performing that experiment. With this new information, I think we can be pretty much 100% sure that the problem is related to the mempool being split over two DPDK memzones. Unfortunately, the only way to protect fully against this happening is by using the DPDK flag --match-allocations which was introduced in DPDK 19.02. Jim helped advocate for this flag specifically because of this problem with mempools and RPMA.

The good news is that the SPDK target will skip these buffers without bricking, causing data corruption or doing any otherwise bad things. What ends up happening is that the nvmf_rdma_fill_buffers function will print the error message and then return NULL which will trigger the target to retry the I/O again. By that time, there will be another buffer there for the request to use and it won’t fail the second time around. So the code currently handles the problem in a technically correct way i.e. It’s not going to brick the target or initiator by trying to use a buffer that spans multiple Memory Regions. Instead, it properly recognizes that it is trying to use a bad buffer and reschedules the request buffer parsing.

However, I am a little worried over the fact that these buffers remain in the mempool and can be repeatedly used by the application. I can picture a scenario where this could possibly have a  performance impact. Take for example a mempool with 128 entries in it in which one of them is split over a memzone. Since this split buffer will never find its way into a request, it’s possible that this split buffer gets pulled up into requests more often than other buffers and subsequently fails in nvmf_rdma_fill_buffers causing requests to have to be rescheduled to the next time the poller runs. Depending on how frequently this happens, the performance impact *could possibly* add up.

I have as yet been unable to replicate the split buffer error. One thing you could try to see if there is any measurable performance impact is try starting the NVMe-oF target with DPDK legacy memory mode which will move all memory allocations to startup and prevent you from splitting buffers. Then run a benchmark with a lot of connections at high queue depth and see what the performance looks like compared to the dynamic memory model. If there is a significant performance impact, we may have to modify how we handle this error case.

Thanks,

Seth

From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
Sent: Monday, August 12, 2019 4:17 PM
To: Howell, Seth <seth.howell(a)intel.com>
Cc: Storage Performance Development Kit <spdk(a)lists.01.org>; Jonathan Richardson <jonathan.richardson(a)broadcom.com>
Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions

+ Jonanthan

Hi Seth,

We finally got chance to test with more logs enabled. You are correct that that problematic buffer does sit on 2 registered memory regions:

The problematic buffer is "Buffer address: 200019bfeb00", actual used buffer pointer is "200019bff000" (SPDK makes it 4KiB aligned), size is 8KiB(0x2000) so it does sit on 2 registered memory region.

However, looks like SPDK/DPDK allocates buffers starting from end of a region and going up, but due to the extra room and alignment of each buffer and there is chance that one buffer can exceed memory region boundary?

In this case, the buffers are between 0x200019997800 and 0x200019c5320 so that last buffer exceeds one region and goes to next one.

Some logs for your information:

rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start: 200019800000, memory region length: 400000
rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start: 200019c00000, memory region length: 400000
...
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 0x200019bfeb00 27(32)
...
rdma.c:1508:nvmf_rdma_fill_buffers: *NOTICE*: Buffer address: 200019bfeb00 iov_base address 200019bff000

Thanks,
JD

On Thu, Aug 1, 2019 at 2:22 PM Howell, Seth <seth.howell(a)intel.com<mailto:seth.howell(a)intel.com>> wrote:
There are two different assignments that you need to look at. I'll detail the cases below based on line numbers from the latest master.

Memory.c:656 *size = spdk_min(*size, cur_size):
        This assignment is inside of the conditional "if(size == NULL || map->ops.are_contiguous == NULL)"
        So in other words, at the offset, we figure out how much space we have left in the current translation. Then, if there is no callback to tell us whether the next translation will be contiguous to this one, we fill the size variable with the remaining length of that 2 MiB buffer.

Memory.c:682 *size = spdk_min(*size, cur_size):
        This assignment comes after the while loop guarded by the condition "while (cur_size < *size)". This while loop assumes that we have supplied some desired length for our buffer. This is true in the RDMA case. Now this while loop will only break on two conditions. 1. Cur_size becomes larger than *size, or the are_contiguous function returns false, meaning that the two translations cannot be considered together. In the case of the RDMA memory map, the only time are_contiguous returns false is when the two memory regions correspond to two distinct RDMA MRs. Notice that in this case - the one where are_contiguous is defined and we supplied a size variable - the *size variable is not overwritten with cur_size until 1. cur_size is >= *size or 2. The are_contiguous check fails.

In the second case detailed above, you can see how one could  pass in a buffer that spanned a 2 MiB page and still get a translation value equal to the size of the buffer. This second case is the one that the rdma.c code should be using since we have a registered are_contiguous function with the NIC and we have supplied a size pointer filled with the length of our buffer.

-----Original Message-----
From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com<mailto:jiandong.zheng(a)broadcom.com>]
Sent: Thursday, August 1, 2019 2:01 PM
To: Howell, Seth <seth.howell(a)intel.com<mailto:seth.howell(a)intel.com>>; Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions

Hi Seth,

 > Just because a buffer extends past a 2 MiB boundary doesn't mean that it exists in two different Memory Regions. It also won't fail the translation for being over two memory regions.

This makes sense. However, spdk_mem_map_translate() does following to calculate translation_len:

cur_size = VALUE_2MB - _2MB_OFFSET(vaddr); ...
*size = spdk_min(*size, cur_size); // *size is the translation_len from caller nvmf_rdma_fill_buffers()

In nvmf_rdma_fill_buffers(),

if (translation_len < rdma_req->req.iov[iovcnt].iov_len) {
                        SPDK_ERRLOG("Data buffer split over multiple RDMA Memory Regions\n");
                        return -EINVAL;
                }

This just checks if buffer sits on 2 2MB pages, not about 2 RDMA memory regions. Is my understanding correct?

I still need some time to test. I will update you the result with -s as well.

Thanks,
JD


On 8/1/19 1:28 PM, Howell, Seth wrote:
> Hi JD,
>
> The 2 MiB check is just because we always do memory registrations at at least 2 MiB granularity (the minimum hugepage size). Just because a buffer extends past a 2 MiB boundary doesn't mean that it exists in two different Memory Regions. It also won't fail the translation for being over two memory regions.
>
> If you look at the definition of spdk_mem_map_translate we call map->ops->are_contiguous every time we cross a 2 MiB boundary. For RDMA, this function is registered to spdk_nvmf_rdma_check_contiguous_entries. IF this function returns true, then even if the buffer crosses a 2 MiB boundary, the translation will still be valid.
> The problem you are running into is not related to the buffer alignment, it is related to the fact that the two pages across which the buffer is split are registered to two different MRs in the NIC. This can only happen if those two pages are allocated independently and trigger two distinct memory event callbacks.
>
> That is why I am so interested in seeing the results from the noticelog above ibv_reg_mr. It will tell me how your target application is allocating memory. Also, when you start the SPDK target, are you using the -s option? Something like ./app/nvmf_tgt/nvmf_tgt -s 512 or something like that (I don't know if it'll make a difference, it's more of a curiosity thing for me)?
>
> Thanks,
>
> Seth
>
> -----Original Message-----
> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com<mailto:jiandong.zheng(a)broadcom.com>]
> Sent: Thursday, August 1, 2019 11:24 AM
> To: Howell, Seth <seth.howell(a)intel.com<mailto:seth.howell(a)intel.com>>; Storage Performance
> Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
> RDMA Memory Regions
>
> Hi Seth,
>
> Thanks for the detailed description, now I understand the reason behind the checking. But I have a question, why checking against 2MiB? Is it because DPDK uses 2MiB page size by default so that one RDMA memory region should not cross 2 pages?
>
>   > Once I see what your memory registrations look like and what addresses you're failing on, it will help me understand what is going on better.
>
> I've added some print in nvmf_rdma_fill_buffers() @@ -1502,7 +1503,11 @@ nvmf_rdma_fill_buffers(struct spdk_nvmf_rdma_transport *rtransport,
>                   remaining_length -=
> rdma_req->req.iov[iovcnt].iov_len;
>
>                   if (translation_len < rdma_req->req.iov[iovcnt].iov_len) {
> -                       SPDK_ERRLOG("Data buffer split over multiple
> RDMA Memory Regions\n");
> +                       SPDK_ERRLOG("Data buffer split over multiple
> RDMA Memory Regions %p %d (%d) (%d) (%d) (%d)\n", rdma_req->buffers[iovcnt], iovcnt, length, remaining_length, translation_len, rdma_req->req.iov[iovcnt].iov_len);
>                           return -EINVAL;
>                   }
>
> With this I can see which buffer failed the checking.
> For example, when SPKD initializes the memory pool, one of the buffers starts with 0x2000193feb00, and when failed, I got following:
>
> rdma.c:1510:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over
> multiple RDMA Memory Regions 0x2000193feb00 0 (8192) (0) (5376) (8192)
>
> This buffer has 5376B on one 2MB page and the rest of it
> (8192-5376=2816B) is on another page.
>
> The change https://review.gerrithub.io/c/spdk/spdk/+/463893 to use iov base should make it better as iov base is 4KiB aligned. In above case, iov_base is 0x2000193feb00 & 0xfff = 0x2000193fe000 and it should pass the checking.
> However, another buffer in the pool is 0x2000192010c0 and iov_base is 0x200019201000, which would fail the checking because it is only 4KiB to 2MB boundary and IOUnitSize is 8KiB.
>
> I will add the change from
> https://review.gerrithub.io/c/spdk/spdk/+/463892 and rerun the test to get more information.
>
> I also attached the conf file too. The cmd line is "nvmf_tgt -m 0xff
> -j
> 0x90000000:0x20000000 -c 16disk_1ns.conf"
>
> Thanks,
> JD
>
>
> On 8/1/19 7:52 AM, Howell, Seth wrote:
>> Hi JD,
>>
>> I was doing a little bit of digging in the dpdk documentation around this process, and I have a little bit more information. We were pretty worried about the whole dynamic memory allocations thing a few releases ago, so Jim helped add a flag into DPDK that prevented allocations from being allocated and freed in different granularities. This flag also prevents malloc heap allocations from spanning multiple memory events. However, this flag didn't make it into DPDK until 19.02 (More documentation at https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#environment-abstraction-layer if you're interested). We have some code in the SPDK environment layer that tries to deal with that (see lib/env_dpdk/memory.c:memory_hotplug_cb) but I don't know that that function is entirely capable of handling the heap allocations spanning multiple memory events part of the problem.
>> Since you are using dpdk 18.11, the memory callback inside of lib/env_dpdk looks like a good candidate for our issue. My best guess is that somehow a heap allocation from the buffer mempool is hitting across addresses from two dynamic memory allocation events. I'd still appreciate it if you could send me the information in my last e-mail, but I think we're onto something here.
>>
>> Thanks,
>>
>> Seth
>>
>> -----Original Message-----
>> From: SPDK [mailto:spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>] On Behalf Of Howell,
>> Seth
>> Sent: Thursday, August 1, 2019 5:26 AM
>> To: JD Zheng <jiandong.zheng(a)broadcom.com<mailto:jiandong.zheng(a)broadcom.com>>; Storage Performance
>> Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
>> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
>> RDMA Memory Regions
>>
>> Hi JD,
>>
>> Thanks for doing that. Yeah, I am mainly looking to see how the mempool addresses are mapped into the NIC with ibv_reg_mr.
>>
>> I think it's odd that we are using the buffer base for the memory
>> check, we should be using the iov base, but I don't believe that
>> would cause the issue you are seeing. Pushed a change to modify that
>> behavior anyways though:
>> https://review.gerrithub.io/c/spdk/spdk/+/463893
>>
>> There was one registration that I wasn't able to catch from your last log. Sorry about that, I forgot there wasn’t a debug log for it. Can you try it again with this change which adds noticelogs for the relevant registrations. https://review.gerrithub.io/c/spdk/spdk/+/463892 You should be able to run your test without the -Lrdma argument this time to avoid the extra bloat in the logs.
>>
>> The underlying assumption of the code is that any given object is not going to cross a dynamic memory allocation from DPDK. For a little background, when the mempool gets created, the dpdk code allocates some number of memzones to accommodate those buffer objects. Then it passes those memzones down one at a time and places objects inside the mempool from the given memzone until the memzone is exhausted. Then it goes back and grabs another memzone. This process continues until all objects are accounted for.
>> This only works if each memzone corresponds to a single memory event when using dynamic memory allocation. My understanding was that this was always the case, but this error makes me think that it's possible that that's not true.
>>
>> Once I see what your memory registrations look like and what addresses you're failing on, it will help me understand what is going on better.
>>
>> Can you also provide the command line you are using to start the nvmf_tgt application and attach your configuration file?
>>
>> Thanks,
>>
>> Seth
>> -----Original Message-----
>> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com<mailto:jiandong.zheng(a)broadcom.com>]
>> Sent: Wednesday, July 31, 2019 3:13 PM
>> To: Howell, Seth <seth.howell(a)intel.com<mailto:seth.howell(a)intel.com>>; Storage Performance
>> Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
>> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
>> RDMA Memory Regions
>>
>> Hi Seth,
>>
>> After I enabled debug and ran nvmf_tgt with -L rdma, I got some logs like:
>> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>> 0x2000084bf000 Length: 40000 LKey: e601
>> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>> 0x200008621000 Length: 10000 LKey: e701
>> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
>> 0x200018600000 Length: 1000000 LKey: e801
>> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>> 0x20000847e000 Length: 40000 LKey: e701
>> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>> 0x20000846d000 Length: 10000 LKey: e801
>> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
>> 0x200019800000 Length: 1000000 LKey: e901
>> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>> 0x200016ebb000 Length: 40000 LKey: e801
>> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>> 0x20000845c000 Length: 10000 LKey: e901
>> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
>> 0x20001aa00000 Length: 1000000 LKey: ea01
>> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>> 0x200016e7a000 Length: 40000 LKey: e901
>> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>> 0x20000844b000 Length: 10000 LKey: ea01
>> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
>> 0x20001bc00000 Length: 1000000 LKey: eb01 ...
>>
>> Is this you are look for as memory regions registered for NIC?
>>
>> I attached the complete log.
>>
>> Thanks,
>> JD
>>
>> On 7/30/19 5:28 PM, JD Zheng wrote:
>>> Hi Seth,
>>>
>>> Thanks for the prompt reply!
>>>
>>> Please find answers inline.
>>>
>>> JD
>>>
>>> On 7/30/19 5:01 PM, Howell, Seth wrote:
>>>> Hi JD,
>>>>
>>>> Thanks for the report. I want to ask a few questions to start
>>>> getting to the bottom of this. Since this issue doesn't currently
>>>> reproduce on our per-patch or nightly tests, I would like to
>>>> understand what's unique about your setup so that we can replicate
>>>> it in a per patch test to prevent future regressions.
>>> I am running it on aarch64 platform. I tried x86 platform and I can
>>> see same buffer alignment in memory pool but can't run the real test
>>> to reproduce it due to other missing pieces.
>>>
>>>>
>>>> What options are you passing when you create the rdma transport?
>>>> Are you creating it over RPC or in a configuration file?
>>> I am using conf file. Pls let me know if you'd like to look into conf file.
>>>
>>>>
>>>> Are you using the current DPDK submodule as your environment
>>>> abstraction layer?
>>> No. Our project uses specific version of DPDK, which is v18.11. I
>>> did quick test using latest and DPDK submodule on x86, and the
>>> buffer alignment is the same, i.e. 64B aligned.
>>>
>>>>
>>>> I notice that your error log is printing from
>>>> spdk_nvmf_transport_poll_group_create, which value exactly are you
>>>> printing out?
>>> Here is patch to add dbg print. Pls note that SPDK version is v19.04
>>>
>>> @@ -215,6 +222,7 @@ spdk_nvmf_transport_poll_group_create(st
>>>                                    SPDK_NOTICELOG("Unable to reserve
>>> the full number of buffers for the pg buffer cache.\n");
>>>                                    break;
>>>                            }
>>> +                       SPDK_ERRLOG("%p %d(%d)\n", buf,
>>> group->buf_cache_count, group->buf_cache_size);
>>>                            STAILQ_INSERT_HEAD(&group->buf_cache,
>>> buf, link);
>>>                            group->buf_cache_count++;
>>>                    }
>>>
>>>>
>>>> Can you run your target with the -L rdma option to get a dump of
>>>> the memory regions registered with the NIC?
>>> Let me test and get back to you soon.
>>>
>>>>
>>>> We made a couple of changes to this code when dynamic memory
>>>> allocations were added to DPDK. There were some safeguards that we
>>>> added to try and make sure this case wouldn't hit, so I'd like to
>>>> make sure you are running on the latest DPDK submodule as well as
>>>> the latest SPDK to narrow down where we need to look.
>>> Unfortunately I can't easily update DPDK because other team
>>> maintains it internally. But if it can be repro and fixed in latest,
>>> I will try to pull in the fix.
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Seth
>>>>
>>>> -----Original Message-----
>>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>] On Behalf Of JD Zheng
>>>> via SPDK
>>>> Sent: Wednesday, July 31, 2019 3:00 AM
>>>> To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
>>>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com<mailto:jiandong.zheng(a)broadcom.com>>
>>>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
>>>> RDMA Memory Regions
>>>>
>>>> Hello,
>>>>
>>>> When I run nvmf_tgt over RDMA using latest SPDK code, I
>>>> occasionally ran into this errors:
>>>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split
>>>> over multiple RDMA Memory Regions"
>>>>
>>>> After digging into the code, I found that nvmf_rdma_fill_buffers()
>>>> calls spdk_mem_map_translate() to check if a data buffer sit on 2
>>>> 2MB pages, and if it is the case, it reports this error.
>>>>
>>>> The following commit added change to use data buffer start address
>>>> to calculate the size between buffer start address and 2MB boundary.
>>>> The caller nvmf_rdma_fill_buffers() uses the size to compare with
>>>> IO Unit size (which is 8KB in my conf) to determine if the buffer
>>>> passes 2MB boundary.
>>>>
>>>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
>>>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com<mailto:dariusz.stojaczyk(a)intel.com>>
>>>> Date:   Tue Nov 13 17:43:46 2018 +0100
>>>>
>>>>         memory: fix contiguous memory calculation for unaligned
>>>> buffers
>>>>
>>>> In nvmf_tgt, the buffers are pre-allocated as a memory pool and new
>>>> request will use free buffer from that pool and the buffer start
>>>> address is passed to nvmf_rdma_fill_buffers(). But I found that
>>>> these buffers are not 2MB aligned and not IOUnitSize aligned (8KB
>>>> in my
>>>> case) either, instead, they are 64Byte aligned so that some buffers
>>>> will fail the checking and leads to this problem.
>>>>
>>>> The corresponding code snippets are as following:
>>>> spdk_nvmf_transport_create()
>>>> {
>>>> ...
>>>>         transport->data_buf_pool =
>>>> pdk_mempool_create(spdk_mempool_name,
>>>>                                    opts->num_shared_buffers,
>>>>                                    opts->io_unit_size +
>>>> NVMF_DATA_BUFFER_ALIGNMENT,
>>>>                                    SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>>>>                                    SPDK_ENV_SOCKET_ID_ANY); ...
>>>> }
>>>>
>>>> Also some debug print I added shows the start address of the buffers:
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x200019258800 0(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x2000192557c0 1(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x200019252780 2(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x20001924f740 3(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x20001924c700 4(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x2000192496c0 5(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x200019246680 6(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x200019243640 7(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x200019240600 8(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x20001923d5c0 9(32)
>>>> ...
>>>>
>>>> It looks like either the buffer allocation has alignment issue or
>>>> the checking is not correct.
>>>>
>>>> Please advice how to fix this problem.
>>>>
>>>> Thanks,
>>>> JD Zheng
>>>> _______________________________________________
>>>> SPDK mailing list
>>>> SPDK(a)lists.01.org<mailto:SPDK(a)lists.01.org>
>>>> https://lists.01.org/mailman/listinfo/spdk
>>>>
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org<mailto:SPDK(a)lists.01.org>
>> https://lists.01.org/mailman/listinfo/spdk
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-08-12 23:17 JD Zheng
  0 siblings, 0 replies; 20+ messages in thread
From: JD Zheng @ 2019-08-12 23:17 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 21726 bytes --]

+ Jonanthan

Hi Seth,

We finally got chance to test with more logs enabled. You are correct that
that problematic buffer does sit on 2 registered memory regions:

The problematic buffer is "Buffer address:* 200019bfeb00", *actual used
buffer pointer is "*200019bff000*" (SPDK makes it 4KiB aligned), size is
8KiB(0x2000) so it does sit on 2 registered memory region.

However, looks like SPDK/DPDK allocates buffers starting from end of a
region and going up, but due to the extra room and alignment of each buffer
and there is chance that one buffer can exceed memory region boundary?

In this case, the buffers are between 0x200019997800 and 0x200019c5320 so
that last buffer exceeds one region and goes to next one.

Some logs for your information:

rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start:
200019800000, memory region length: 400000
rdma.c:1279:spdk_nvmf_rdma_mem_notify: *NOTICE*: memory region start:
200019c00000, memory region length: 400000
...
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
0x200019bfeb00 27(32)
...
rdma.c:1508:nvmf_rdma_fill_buffers: *NOTICE*: Buffer address:
200019bfeb00 iov_base
address 200019bff000

Thanks,
JD

On Thu, Aug 1, 2019 at 2:22 PM Howell, Seth <seth.howell(a)intel.com> wrote:

> There are two different assignments that you need to look at. I'll detail
> the cases below based on line numbers from the latest master.
>
> Memory.c:656 *size = spdk_min(*size, cur_size):
>         This assignment is inside of the conditional "if(size == NULL ||
> map->ops.are_contiguous == NULL)"
>         So in other words, at the offset, we figure out how much space we
> have left in the current translation. Then, if there is no callback to tell
> us whether the next translation will be contiguous to this one, we fill the
> size variable with the remaining length of that 2 MiB buffer.
>
> Memory.c:682 *size = spdk_min(*size, cur_size):
>         This assignment comes after the while loop guarded by the
> condition "while (cur_size < *size)". This while loop assumes that we have
> supplied some desired length for our buffer. This is true in the RDMA case.
> Now this while loop will only break on two conditions. 1. Cur_size becomes
> larger than *size, or the are_contiguous function returns false, meaning
> that the two translations cannot be considered together. In the case of the
> RDMA memory map, the only time are_contiguous returns false is when the two
> memory regions correspond to two distinct RDMA MRs. Notice that in this
> case - the one where are_contiguous is defined and we supplied a size
> variable - the *size variable is not overwritten with cur_size until 1.
> cur_size is >= *size or 2. The are_contiguous check fails.
>
> In the second case detailed above, you can see how one could  pass in a
> buffer that spanned a 2 MiB page and still get a translation value equal to
> the size of the buffer. This second case is the one that the rdma.c code
> should be using since we have a registered are_contiguous function with the
> NIC and we have supplied a size pointer filled with the length of our
> buffer.
>
> -----Original Message-----
> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
> Sent: Thursday, August 1, 2019 2:01 PM
> To: Howell, Seth <seth.howell(a)intel.com>; Storage Performance Development
> Kit <spdk(a)lists.01.org>
> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA
> Memory Regions
>
> Hi Seth,
>
>  > Just because a buffer extends past a 2 MiB boundary doesn't mean that
> it exists in two different Memory Regions. It also won't fail the
> translation for being over two memory regions.
>
> This makes sense. However, spdk_mem_map_translate() does following to
> calculate translation_len:
>
> cur_size = VALUE_2MB - _2MB_OFFSET(vaddr); ...
> *size = spdk_min(*size, cur_size); // *size is the translation_len from
> caller nvmf_rdma_fill_buffers()
>
> In nvmf_rdma_fill_buffers(),
>
> if (translation_len < rdma_req->req.iov[iovcnt].iov_len) {
>                         SPDK_ERRLOG("Data buffer split over multiple RDMA
> Memory Regions\n");
>                         return -EINVAL;
>                 }
>
> This just checks if buffer sits on 2 2MB pages, not about 2 RDMA memory
> regions. Is my understanding correct?
>
> I still need some time to test. I will update you the result with -s as
> well.
>
> Thanks,
> JD
>
>
> On 8/1/19 1:28 PM, Howell, Seth wrote:
> > Hi JD,
> >
> > The 2 MiB check is just because we always do memory registrations at at
> least 2 MiB granularity (the minimum hugepage size). Just because a buffer
> extends past a 2 MiB boundary doesn't mean that it exists in two different
> Memory Regions. It also won't fail the translation for being over two
> memory regions.
> >
> > If you look at the definition of spdk_mem_map_translate we call
> map->ops->are_contiguous every time we cross a 2 MiB boundary. For RDMA,
> this function is registered to spdk_nvmf_rdma_check_contiguous_entries. IF
> this function returns true, then even if the buffer crosses a 2 MiB
> boundary, the translation will still be valid.
> > The problem you are running into is not related to the buffer alignment,
> it is related to the fact that the two pages across which the buffer is
> split are registered to two different MRs in the NIC. This can only happen
> if those two pages are allocated independently and trigger two distinct
> memory event callbacks.
> >
> > That is why I am so interested in seeing the results from the noticelog
> above ibv_reg_mr. It will tell me how your target application is allocating
> memory. Also, when you start the SPDK target, are you using the -s option?
> Something like ./app/nvmf_tgt/nvmf_tgt -s 512 or something like that (I
> don't know if it'll make a difference, it's more of a curiosity thing for
> me)?
> >
> > Thanks,
> >
> > Seth
> >
> > -----Original Message-----
> > From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
> > Sent: Thursday, August 1, 2019 11:24 AM
> > To: Howell, Seth <seth.howell(a)intel.com>; Storage Performance
> > Development Kit <spdk(a)lists.01.org>
> > Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
> > RDMA Memory Regions
> >
> > Hi Seth,
> >
> > Thanks for the detailed description, now I understand the reason behind
> the checking. But I have a question, why checking against 2MiB? Is it
> because DPDK uses 2MiB page size by default so that one RDMA memory region
> should not cross 2 pages?
> >
> >   > Once I see what your memory registrations look like and what
> addresses you're failing on, it will help me understand what is going on
> better.
> >
> > I've added some print in nvmf_rdma_fill_buffers() @@ -1502,7 +1503,11 @@
> nvmf_rdma_fill_buffers(struct spdk_nvmf_rdma_transport *rtransport,
> >                   remaining_length -=
> > rdma_req->req.iov[iovcnt].iov_len;
> >
> >                   if (translation_len <
> rdma_req->req.iov[iovcnt].iov_len) {
> > -                       SPDK_ERRLOG("Data buffer split over multiple
> > RDMA Memory Regions\n");
> > +                       SPDK_ERRLOG("Data buffer split over multiple
> > RDMA Memory Regions %p %d (%d) (%d) (%d) (%d)\n",
> rdma_req->buffers[iovcnt], iovcnt, length, remaining_length,
> translation_len, rdma_req->req.iov[iovcnt].iov_len);
> >                           return -EINVAL;
> >                   }
> >
> > With this I can see which buffer failed the checking.
> > For example, when SPKD initializes the memory pool, one of the buffers
> starts with 0x2000193feb00, and when failed, I got following:
> >
> > rdma.c:1510:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over
> > multiple RDMA Memory Regions 0x2000193feb00 0 (8192) (0) (5376) (8192)
> >
> > This buffer has 5376B on one 2MB page and the rest of it
> > (8192-5376=2816B) is on another page.
> >
> > The change https://review.gerrithub.io/c/spdk/spdk/+/463893 to use iov
> base should make it better as iov base is 4KiB aligned. In above case,
> iov_base is 0x2000193feb00 & 0xfff = 0x2000193fe000 and it should pass the
> checking.
> > However, another buffer in the pool is 0x2000192010c0 and iov_base is
> 0x200019201000, which would fail the checking because it is only 4KiB to
> 2MB boundary and IOUnitSize is 8KiB.
> >
> > I will add the change from
> > https://review.gerrithub.io/c/spdk/spdk/+/463892 and rerun the test to
> get more information.
> >
> > I also attached the conf file too. The cmd line is "nvmf_tgt -m 0xff
> > -j
> > 0x90000000:0x20000000 -c 16disk_1ns.conf"
> >
> > Thanks,
> > JD
> >
> >
> > On 8/1/19 7:52 AM, Howell, Seth wrote:
> >> Hi JD,
> >>
> >> I was doing a little bit of digging in the dpdk documentation around
> this process, and I have a little bit more information. We were pretty
> worried about the whole dynamic memory allocations thing a few releases
> ago, so Jim helped add a flag into DPDK that prevented allocations from
> being allocated and freed in different granularities. This flag also
> prevents malloc heap allocations from spanning multiple memory events.
> However, this flag didn't make it into DPDK until 19.02 (More documentation
> at
> https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#environment-abstraction-layer
> if you're interested). We have some code in the SPDK environment layer that
> tries to deal with that (see lib/env_dpdk/memory.c:memory_hotplug_cb) but I
> don't know that that function is entirely capable of handling the heap
> allocations spanning multiple memory events part of the problem.
> >> Since you are using dpdk 18.11, the memory callback inside of
> lib/env_dpdk looks like a good candidate for our issue. My best guess is
> that somehow a heap allocation from the buffer mempool is hitting across
> addresses from two dynamic memory allocation events. I'd still appreciate
> it if you could send me the information in my last e-mail, but I think
> we're onto something here.
> >>
> >> Thanks,
> >>
> >> Seth
> >>
> >> -----Original Message-----
> >> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Howell,
> >> Seth
> >> Sent: Thursday, August 1, 2019 5:26 AM
> >> To: JD Zheng <jiandong.zheng(a)broadcom.com>; Storage Performance
> >> Development Kit <spdk(a)lists.01.org>
> >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
> >> RDMA Memory Regions
> >>
> >> Hi JD,
> >>
> >> Thanks for doing that. Yeah, I am mainly looking to see how the mempool
> addresses are mapped into the NIC with ibv_reg_mr.
> >>
> >> I think it's odd that we are using the buffer base for the memory
> >> check, we should be using the iov base, but I don't believe that
> >> would cause the issue you are seeing. Pushed a change to modify that
> >> behavior anyways though:
> >> https://review.gerrithub.io/c/spdk/spdk/+/463893
> >>
> >> There was one registration that I wasn't able to catch from your last
> log. Sorry about that, I forgot there wasn’t a debug log for it. Can you
> try it again with this change which adds noticelogs for the relevant
> registrations. https://review.gerrithub.io/c/spdk/spdk/+/463892 You
> should be able to run your test without the -Lrdma argument this time to
> avoid the extra bloat in the logs.
> >>
> >> The underlying assumption of the code is that any given object is not
> going to cross a dynamic memory allocation from DPDK. For a little
> background, when the mempool gets created, the dpdk code allocates some
> number of memzones to accommodate those buffer objects. Then it passes
> those memzones down one at a time and places objects inside the mempool
> from the given memzone until the memzone is exhausted. Then it goes back
> and grabs another memzone. This process continues until all objects are
> accounted for.
> >> This only works if each memzone corresponds to a single memory event
> when using dynamic memory allocation. My understanding was that this was
> always the case, but this error makes me think that it's possible that
> that's not true.
> >>
> >> Once I see what your memory registrations look like and what addresses
> you're failing on, it will help me understand what is going on better.
> >>
> >> Can you also provide the command line you are using to start the
> nvmf_tgt application and attach your configuration file?
> >>
> >> Thanks,
> >>
> >> Seth
> >> -----Original Message-----
> >> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
> >> Sent: Wednesday, July 31, 2019 3:13 PM
> >> To: Howell, Seth <seth.howell(a)intel.com>; Storage Performance
> >> Development Kit <spdk(a)lists.01.org>
> >> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
> >> RDMA Memory Regions
> >>
> >> Hi Seth,
> >>
> >> After I enabled debug and ran nvmf_tgt with -L rdma, I got some logs
> like:
> >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
> >> 0x2000084bf000 Length: 40000 LKey: e601
> >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
> >> 0x200008621000 Length: 10000 LKey: e701
> >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
> >> 0x200018600000 Length: 1000000 LKey: e801
> >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
> >> 0x20000847e000 Length: 40000 LKey: e701
> >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
> >> 0x20000846d000 Length: 10000 LKey: e801
> >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
> >> 0x200019800000 Length: 1000000 LKey: e901
> >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
> >> 0x200016ebb000 Length: 40000 LKey: e801
> >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
> >> 0x20000845c000 Length: 10000 LKey: e901
> >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
> >> 0x20001aa00000 Length: 1000000 LKey: ea01
> >> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
> >> 0x200016e7a000 Length: 40000 LKey: e901
> >> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
> >> 0x20000844b000 Length: 10000 LKey: ea01
> >> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
> >> 0x20001bc00000 Length: 1000000 LKey: eb01 ...
> >>
> >> Is this you are look for as memory regions registered for NIC?
> >>
> >> I attached the complete log.
> >>
> >> Thanks,
> >> JD
> >>
> >> On 7/30/19 5:28 PM, JD Zheng wrote:
> >>> Hi Seth,
> >>>
> >>> Thanks for the prompt reply!
> >>>
> >>> Please find answers inline.
> >>>
> >>> JD
> >>>
> >>> On 7/30/19 5:01 PM, Howell, Seth wrote:
> >>>> Hi JD,
> >>>>
> >>>> Thanks for the report. I want to ask a few questions to start
> >>>> getting to the bottom of this. Since this issue doesn't currently
> >>>> reproduce on our per-patch or nightly tests, I would like to
> >>>> understand what's unique about your setup so that we can replicate
> >>>> it in a per patch test to prevent future regressions.
> >>> I am running it on aarch64 platform. I tried x86 platform and I can
> >>> see same buffer alignment in memory pool but can't run the real test
> >>> to reproduce it due to other missing pieces.
> >>>
> >>>>
> >>>> What options are you passing when you create the rdma transport?
> >>>> Are you creating it over RPC or in a configuration file?
> >>> I am using conf file. Pls let me know if you'd like to look into conf
> file.
> >>>
> >>>>
> >>>> Are you using the current DPDK submodule as your environment
> >>>> abstraction layer?
> >>> No. Our project uses specific version of DPDK, which is v18.11. I
> >>> did quick test using latest and DPDK submodule on x86, and the
> >>> buffer alignment is the same, i.e. 64B aligned.
> >>>
> >>>>
> >>>> I notice that your error log is printing from
> >>>> spdk_nvmf_transport_poll_group_create, which value exactly are you
> >>>> printing out?
> >>> Here is patch to add dbg print. Pls note that SPDK version is v19.04
> >>>
> >>> @@ -215,6 +222,7 @@ spdk_nvmf_transport_poll_group_create(st
> >>>                                    SPDK_NOTICELOG("Unable to reserve
> >>> the full number of buffers for the pg buffer cache.\n");
> >>>                                    break;
> >>>                            }
> >>> +                       SPDK_ERRLOG("%p %d(%d)\n", buf,
> >>> group->buf_cache_count, group->buf_cache_size);
> >>>                            STAILQ_INSERT_HEAD(&group->buf_cache,
> >>> buf, link);
> >>>                            group->buf_cache_count++;
> >>>                    }
> >>>
> >>>>
> >>>> Can you run your target with the -L rdma option to get a dump of
> >>>> the memory regions registered with the NIC?
> >>> Let me test and get back to you soon.
> >>>
> >>>>
> >>>> We made a couple of changes to this code when dynamic memory
> >>>> allocations were added to DPDK. There were some safeguards that we
> >>>> added to try and make sure this case wouldn't hit, so I'd like to
> >>>> make sure you are running on the latest DPDK submodule as well as
> >>>> the latest SPDK to narrow down where we need to look.
> >>> Unfortunately I can't easily update DPDK because other team
> >>> maintains it internally. But if it can be repro and fixed in latest,
> >>> I will try to pull in the fix.
> >>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Seth
> >>>>
> >>>> -----Original Message-----
> >>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of JD Zheng
> >>>> via SPDK
> >>>> Sent: Wednesday, July 31, 2019 3:00 AM
> >>>> To: spdk(a)lists.01.org
> >>>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com>
> >>>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
> >>>> RDMA Memory Regions
> >>>>
> >>>> Hello,
> >>>>
> >>>> When I run nvmf_tgt over RDMA using latest SPDK code, I
> >>>> occasionally ran into this errors:
> >>>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split
> >>>> over multiple RDMA Memory Regions"
> >>>>
> >>>> After digging into the code, I found that nvmf_rdma_fill_buffers()
> >>>> calls spdk_mem_map_translate() to check if a data buffer sit on 2
> >>>> 2MB pages, and if it is the case, it reports this error.
> >>>>
> >>>> The following commit added change to use data buffer start address
> >>>> to calculate the size between buffer start address and 2MB boundary.
> >>>> The caller nvmf_rdma_fill_buffers() uses the size to compare with
> >>>> IO Unit size (which is 8KB in my conf) to determine if the buffer
> >>>> passes 2MB boundary.
> >>>>
> >>>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
> >>>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com>
> >>>> Date:   Tue Nov 13 17:43:46 2018 +0100
> >>>>
> >>>>         memory: fix contiguous memory calculation for unaligned
> >>>> buffers
> >>>>
> >>>> In nvmf_tgt, the buffers are pre-allocated as a memory pool and new
> >>>> request will use free buffer from that pool and the buffer start
> >>>> address is passed to nvmf_rdma_fill_buffers(). But I found that
> >>>> these buffers are not 2MB aligned and not IOUnitSize aligned (8KB
> >>>> in my
> >>>> case) either, instead, they are 64Byte aligned so that some buffers
> >>>> will fail the checking and leads to this problem.
> >>>>
> >>>> The corresponding code snippets are as following:
> >>>> spdk_nvmf_transport_create()
> >>>> {
> >>>> ...
> >>>>         transport->data_buf_pool =
> >>>> pdk_mempool_create(spdk_mempool_name,
> >>>>                                    opts->num_shared_buffers,
> >>>>                                    opts->io_unit_size +
> >>>> NVMF_DATA_BUFFER_ALIGNMENT,
> >>>>                                    SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
> >>>>                                    SPDK_ENV_SOCKET_ID_ANY); ...
> >>>> }
> >>>>
> >>>> Also some debug print I added shows the start address of the buffers:
> >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> >>>> 0x200019258800 0(32)
> >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> >>>> 0x2000192557c0 1(32)
> >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> >>>> 0x200019252780 2(32)
> >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> >>>> 0x20001924f740 3(32)
> >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> >>>> 0x20001924c700 4(32)
> >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> >>>> 0x2000192496c0 5(32)
> >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> >>>> 0x200019246680 6(32)
> >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> >>>> 0x200019243640 7(32)
> >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> >>>> 0x200019240600 8(32)
> >>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> >>>> 0x20001923d5c0 9(32)
> >>>> ...
> >>>>
> >>>> It looks like either the buffer allocation has alignment issue or
> >>>> the checking is not correct.
> >>>>
> >>>> Please advice how to fix this problem.
> >>>>
> >>>> Thanks,
> >>>> JD Zheng
> >>>> _______________________________________________
> >>>> SPDK mailing list
> >>>> SPDK(a)lists.01.org
> >>>> https://lists.01.org/mailman/listinfo/spdk
> >>>>
> >> _______________________________________________
> >> SPDK mailing list
> >> SPDK(a)lists.01.org
> >> https://lists.01.org/mailman/listinfo/spdk
> >>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-08-01 21:22 Howell, Seth
  0 siblings, 0 replies; 20+ messages in thread
From: Howell, Seth @ 2019-08-01 21:22 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 19668 bytes --]

There are two different assignments that you need to look at. I'll detail the cases below based on line numbers from the latest master.

Memory.c:656 *size = spdk_min(*size, cur_size):
	This assignment is inside of the conditional "if(size == NULL || map->ops.are_contiguous == NULL)"
	So in other words, at the offset, we figure out how much space we have left in the current translation. Then, if there is no callback to tell us whether the next translation will be contiguous to this one, we fill the size variable with the remaining length of that 2 MiB buffer.

Memory.c:682 *size = spdk_min(*size, cur_size):
	This assignment comes after the while loop guarded by the condition "while (cur_size < *size)". This while loop assumes that we have supplied some desired length for our buffer. This is true in the RDMA case. Now this while loop will only break on two conditions. 1. Cur_size becomes larger than *size, or the are_contiguous function returns false, meaning that the two translations cannot be considered together. In the case of the RDMA memory map, the only time are_contiguous returns false is when the two memory regions correspond to two distinct RDMA MRs. Notice that in this case - the one where are_contiguous is defined and we supplied a size variable - the *size variable is not overwritten with cur_size until 1. cur_size is >= *size or 2. The are_contiguous check fails.

In the second case detailed above, you can see how one could  pass in a buffer that spanned a 2 MiB page and still get a translation value equal to the size of the buffer. This second case is the one that the rdma.c code should be using since we have a registered are_contiguous function with the NIC and we have supplied a size pointer filled with the length of our buffer.

-----Original Message-----
From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com] 
Sent: Thursday, August 1, 2019 2:01 PM
To: Howell, Seth <seth.howell(a)intel.com>; Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions

Hi Seth,

 > Just because a buffer extends past a 2 MiB boundary doesn't mean that it exists in two different Memory Regions. It also won't fail the translation for being over two memory regions.

This makes sense. However, spdk_mem_map_translate() does following to calculate translation_len:

cur_size = VALUE_2MB - _2MB_OFFSET(vaddr); ...
*size = spdk_min(*size, cur_size); // *size is the translation_len from caller nvmf_rdma_fill_buffers()

In nvmf_rdma_fill_buffers(),

if (translation_len < rdma_req->req.iov[iovcnt].iov_len) {
			SPDK_ERRLOG("Data buffer split over multiple RDMA Memory Regions\n");
			return -EINVAL;
		}

This just checks if buffer sits on 2 2MB pages, not about 2 RDMA memory regions. Is my understanding correct?

I still need some time to test. I will update you the result with -s as well.

Thanks,
JD


On 8/1/19 1:28 PM, Howell, Seth wrote:
> Hi JD,
> 
> The 2 MiB check is just because we always do memory registrations at at least 2 MiB granularity (the minimum hugepage size). Just because a buffer extends past a 2 MiB boundary doesn't mean that it exists in two different Memory Regions. It also won't fail the translation for being over two memory regions.
> 
> If you look at the definition of spdk_mem_map_translate we call map->ops->are_contiguous every time we cross a 2 MiB boundary. For RDMA, this function is registered to spdk_nvmf_rdma_check_contiguous_entries. IF this function returns true, then even if the buffer crosses a 2 MiB boundary, the translation will still be valid.
> The problem you are running into is not related to the buffer alignment, it is related to the fact that the two pages across which the buffer is split are registered to two different MRs in the NIC. This can only happen if those two pages are allocated independently and trigger two distinct memory event callbacks.
> 
> That is why I am so interested in seeing the results from the noticelog above ibv_reg_mr. It will tell me how your target application is allocating memory. Also, when you start the SPDK target, are you using the -s option? Something like ./app/nvmf_tgt/nvmf_tgt -s 512 or something like that (I don't know if it'll make a difference, it's more of a curiosity thing for me)?
> 
> Thanks,
> 
> Seth
> 
> -----Original Message-----
> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
> Sent: Thursday, August 1, 2019 11:24 AM
> To: Howell, Seth <seth.howell(a)intel.com>; Storage Performance 
> Development Kit <spdk(a)lists.01.org>
> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple 
> RDMA Memory Regions
> 
> Hi Seth,
> 
> Thanks for the detailed description, now I understand the reason behind the checking. But I have a question, why checking against 2MiB? Is it because DPDK uses 2MiB page size by default so that one RDMA memory region should not cross 2 pages?
> 
>   > Once I see what your memory registrations look like and what addresses you're failing on, it will help me understand what is going on better.
> 
> I've added some print in nvmf_rdma_fill_buffers() @@ -1502,7 +1503,11 @@ nvmf_rdma_fill_buffers(struct spdk_nvmf_rdma_transport *rtransport,
>                   remaining_length -= 
> rdma_req->req.iov[iovcnt].iov_len;
> 
>                   if (translation_len < rdma_req->req.iov[iovcnt].iov_len) {
> -                       SPDK_ERRLOG("Data buffer split over multiple
> RDMA Memory Regions\n");
> +                       SPDK_ERRLOG("Data buffer split over multiple
> RDMA Memory Regions %p %d (%d) (%d) (%d) (%d)\n", rdma_req->buffers[iovcnt], iovcnt, length, remaining_length, translation_len, rdma_req->req.iov[iovcnt].iov_len);
>                           return -EINVAL;
>                   }
> 
> With this I can see which buffer failed the checking.
> For example, when SPKD initializes the memory pool, one of the buffers starts with 0x2000193feb00, and when failed, I got following:
> 
> rdma.c:1510:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over 
> multiple RDMA Memory Regions 0x2000193feb00 0 (8192) (0) (5376) (8192)
> 
> This buffer has 5376B on one 2MB page and the rest of it
> (8192-5376=2816B) is on another page.
> 
> The change https://review.gerrithub.io/c/spdk/spdk/+/463893 to use iov base should make it better as iov base is 4KiB aligned. In above case, iov_base is 0x2000193feb00 & 0xfff = 0x2000193fe000 and it should pass the checking.
> However, another buffer in the pool is 0x2000192010c0 and iov_base is 0x200019201000, which would fail the checking because it is only 4KiB to 2MB boundary and IOUnitSize is 8KiB.
> 
> I will add the change from
> https://review.gerrithub.io/c/spdk/spdk/+/463892 and rerun the test to get more information.
> 
> I also attached the conf file too. The cmd line is "nvmf_tgt -m 0xff 
> -j
> 0x90000000:0x20000000 -c 16disk_1ns.conf"
> 
> Thanks,
> JD
> 
> 
> On 8/1/19 7:52 AM, Howell, Seth wrote:
>> Hi JD,
>>
>> I was doing a little bit of digging in the dpdk documentation around this process, and I have a little bit more information. We were pretty worried about the whole dynamic memory allocations thing a few releases ago, so Jim helped add a flag into DPDK that prevented allocations from being allocated and freed in different granularities. This flag also prevents malloc heap allocations from spanning multiple memory events. However, this flag didn't make it into DPDK until 19.02 (More documentation at https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#environment-abstraction-layer if you're interested). We have some code in the SPDK environment layer that tries to deal with that (see lib/env_dpdk/memory.c:memory_hotplug_cb) but I don't know that that function is entirely capable of handling the heap allocations spanning multiple memory events part of the problem.
>> Since you are using dpdk 18.11, the memory callback inside of lib/env_dpdk looks like a good candidate for our issue. My best guess is that somehow a heap allocation from the buffer mempool is hitting across addresses from two dynamic memory allocation events. I'd still appreciate it if you could send me the information in my last e-mail, but I think we're onto something here.
>>
>> Thanks,
>>
>> Seth
>>
>> -----Original Message-----
>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Howell, 
>> Seth
>> Sent: Thursday, August 1, 2019 5:26 AM
>> To: JD Zheng <jiandong.zheng(a)broadcom.com>; Storage Performance 
>> Development Kit <spdk(a)lists.01.org>
>> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple 
>> RDMA Memory Regions
>>
>> Hi JD,
>>
>> Thanks for doing that. Yeah, I am mainly looking to see how the mempool addresses are mapped into the NIC with ibv_reg_mr.
>>
>> I think it's odd that we are using the buffer base for the memory 
>> check, we should be using the iov base, but I don't believe that 
>> would cause the issue you are seeing. Pushed a change to modify that 
>> behavior anyways though:
>> https://review.gerrithub.io/c/spdk/spdk/+/463893
>>
>> There was one registration that I wasn't able to catch from your last log. Sorry about that, I forgot there wasn’t a debug log for it. Can you try it again with this change which adds noticelogs for the relevant registrations. https://review.gerrithub.io/c/spdk/spdk/+/463892 You should be able to run your test without the -Lrdma argument this time to avoid the extra bloat in the logs.
>>
>> The underlying assumption of the code is that any given object is not going to cross a dynamic memory allocation from DPDK. For a little background, when the mempool gets created, the dpdk code allocates some number of memzones to accommodate those buffer objects. Then it passes those memzones down one at a time and places objects inside the mempool from the given memzone until the memzone is exhausted. Then it goes back and grabs another memzone. This process continues until all objects are accounted for.
>> This only works if each memzone corresponds to a single memory event when using dynamic memory allocation. My understanding was that this was always the case, but this error makes me think that it's possible that that's not true.
>>
>> Once I see what your memory registrations look like and what addresses you're failing on, it will help me understand what is going on better.
>>
>> Can you also provide the command line you are using to start the nvmf_tgt application and attach your configuration file?
>>
>> Thanks,
>>
>> Seth
>> -----Original Message-----
>> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
>> Sent: Wednesday, July 31, 2019 3:13 PM
>> To: Howell, Seth <seth.howell(a)intel.com>; Storage Performance 
>> Development Kit <spdk(a)lists.01.org>
>> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple 
>> RDMA Memory Regions
>>
>> Hi Seth,
>>
>> After I enabled debug and ran nvmf_tgt with -L rdma, I got some logs like:
>> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>> 0x2000084bf000 Length: 40000 LKey: e601
>> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>> 0x200008621000 Length: 10000 LKey: e701
>> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
>> 0x200018600000 Length: 1000000 LKey: e801
>> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>> 0x20000847e000 Length: 40000 LKey: e701
>> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>> 0x20000846d000 Length: 10000 LKey: e801
>> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
>> 0x200019800000 Length: 1000000 LKey: e901
>> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>> 0x200016ebb000 Length: 40000 LKey: e801
>> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>> 0x20000845c000 Length: 10000 LKey: e901
>> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
>> 0x20001aa00000 Length: 1000000 LKey: ea01
>> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>> 0x200016e7a000 Length: 40000 LKey: e901
>> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>> 0x20000844b000 Length: 10000 LKey: ea01
>> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
>> 0x20001bc00000 Length: 1000000 LKey: eb01 ...
>>
>> Is this you are look for as memory regions registered for NIC?
>>
>> I attached the complete log.
>>
>> Thanks,
>> JD
>>
>> On 7/30/19 5:28 PM, JD Zheng wrote:
>>> Hi Seth,
>>>
>>> Thanks for the prompt reply!
>>>
>>> Please find answers inline.
>>>
>>> JD
>>>
>>> On 7/30/19 5:01 PM, Howell, Seth wrote:
>>>> Hi JD,
>>>>
>>>> Thanks for the report. I want to ask a few questions to start 
>>>> getting to the bottom of this. Since this issue doesn't currently 
>>>> reproduce on our per-patch or nightly tests, I would like to 
>>>> understand what's unique about your setup so that we can replicate 
>>>> it in a per patch test to prevent future regressions.
>>> I am running it on aarch64 platform. I tried x86 platform and I can 
>>> see same buffer alignment in memory pool but can't run the real test 
>>> to reproduce it due to other missing pieces.
>>>
>>>>
>>>> What options are you passing when you create the rdma transport? 
>>>> Are you creating it over RPC or in a configuration file?
>>> I am using conf file. Pls let me know if you'd like to look into conf file.
>>>
>>>>
>>>> Are you using the current DPDK submodule as your environment 
>>>> abstraction layer?
>>> No. Our project uses specific version of DPDK, which is v18.11. I 
>>> did quick test using latest and DPDK submodule on x86, and the 
>>> buffer alignment is the same, i.e. 64B aligned.
>>>
>>>>
>>>> I notice that your error log is printing from 
>>>> spdk_nvmf_transport_poll_group_create, which value exactly are you 
>>>> printing out?
>>> Here is patch to add dbg print. Pls note that SPDK version is v19.04
>>>
>>> @@ -215,6 +222,7 @@ spdk_nvmf_transport_poll_group_create(st
>>>                                    SPDK_NOTICELOG("Unable to reserve 
>>> the full number of buffers for the pg buffer cache.\n");
>>>                                    break;
>>>                            }
>>> +                       SPDK_ERRLOG("%p %d(%d)\n", buf,
>>> group->buf_cache_count, group->buf_cache_size);
>>>                            STAILQ_INSERT_HEAD(&group->buf_cache, 
>>> buf, link);
>>>                            group->buf_cache_count++;
>>>                    }
>>>
>>>>
>>>> Can you run your target with the -L rdma option to get a dump of 
>>>> the memory regions registered with the NIC?
>>> Let me test and get back to you soon.
>>>
>>>>
>>>> We made a couple of changes to this code when dynamic memory 
>>>> allocations were added to DPDK. There were some safeguards that we 
>>>> added to try and make sure this case wouldn't hit, so I'd like to 
>>>> make sure you are running on the latest DPDK submodule as well as 
>>>> the latest SPDK to narrow down where we need to look.
>>> Unfortunately I can't easily update DPDK because other team 
>>> maintains it internally. But if it can be repro and fixed in latest, 
>>> I will try to pull in the fix.
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Seth
>>>>
>>>> -----Original Message-----
>>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of JD Zheng 
>>>> via SPDK
>>>> Sent: Wednesday, July 31, 2019 3:00 AM
>>>> To: spdk(a)lists.01.org
>>>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com>
>>>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple 
>>>> RDMA Memory Regions
>>>>
>>>> Hello,
>>>>
>>>> When I run nvmf_tgt over RDMA using latest SPDK code, I 
>>>> occasionally ran into this errors:
>>>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split 
>>>> over multiple RDMA Memory Regions"
>>>>
>>>> After digging into the code, I found that nvmf_rdma_fill_buffers() 
>>>> calls spdk_mem_map_translate() to check if a data buffer sit on 2 
>>>> 2MB pages, and if it is the case, it reports this error.
>>>>
>>>> The following commit added change to use data buffer start address 
>>>> to calculate the size between buffer start address and 2MB boundary.
>>>> The caller nvmf_rdma_fill_buffers() uses the size to compare with 
>>>> IO Unit size (which is 8KB in my conf) to determine if the buffer 
>>>> passes 2MB boundary.
>>>>
>>>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
>>>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com>
>>>> Date:   Tue Nov 13 17:43:46 2018 +0100
>>>>
>>>>         memory: fix contiguous memory calculation for unaligned 
>>>> buffers
>>>>
>>>> In nvmf_tgt, the buffers are pre-allocated as a memory pool and new 
>>>> request will use free buffer from that pool and the buffer start 
>>>> address is passed to nvmf_rdma_fill_buffers(). But I found that 
>>>> these buffers are not 2MB aligned and not IOUnitSize aligned (8KB 
>>>> in my
>>>> case) either, instead, they are 64Byte aligned so that some buffers 
>>>> will fail the checking and leads to this problem.
>>>>
>>>> The corresponding code snippets are as following:
>>>> spdk_nvmf_transport_create()
>>>> {
>>>> ...
>>>>         transport->data_buf_pool =
>>>> pdk_mempool_create(spdk_mempool_name,
>>>>                                    opts->num_shared_buffers,
>>>>                                    opts->io_unit_size + 
>>>> NVMF_DATA_BUFFER_ALIGNMENT,
>>>>                                    SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>>>>                                    SPDK_ENV_SOCKET_ID_ANY); ...
>>>> }
>>>>
>>>> Also some debug print I added shows the start address of the buffers:
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x200019258800 0(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x2000192557c0 1(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x200019252780 2(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x20001924f740 3(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x20001924c700 4(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x2000192496c0 5(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x200019246680 6(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x200019243640 7(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x200019240600 8(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x20001923d5c0 9(32)
>>>> ...
>>>>
>>>> It looks like either the buffer allocation has alignment issue or 
>>>> the checking is not correct.
>>>>
>>>> Please advice how to fix this problem.
>>>>
>>>> Thanks,
>>>> JD Zheng
>>>> _______________________________________________
>>>> SPDK mailing list
>>>> SPDK(a)lists.01.org
>>>> https://lists.01.org/mailman/listinfo/spdk
>>>>
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-08-01 21:00 JD Zheng
  0 siblings, 0 replies; 20+ messages in thread
From: JD Zheng @ 2019-08-01 21:00 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 17509 bytes --]

Hi Seth,

 > Just because a buffer extends past a 2 MiB boundary doesn't mean that 
it exists in two different Memory Regions. It also won't fail the 
translation for being over two memory regions.

This makes sense. However, spdk_mem_map_translate() does following to 
calculate translation_len:

cur_size = VALUE_2MB - _2MB_OFFSET(vaddr);
...
*size = spdk_min(*size, cur_size); // *size is the translation_len from 
caller nvmf_rdma_fill_buffers()

In nvmf_rdma_fill_buffers(),

if (translation_len < rdma_req->req.iov[iovcnt].iov_len) {
			SPDK_ERRLOG("Data buffer split over multiple RDMA Memory Regions\n");
			return -EINVAL;
		}

This just checks if buffer sits on 2 2MB pages, not about 2 RDMA memory 
regions. Is my understanding correct?

I still need some time to test. I will update you the result with -s as 
well.

Thanks,
JD


On 8/1/19 1:28 PM, Howell, Seth wrote:
> Hi JD,
> 
> The 2 MiB check is just because we always do memory registrations at at least 2 MiB granularity (the minimum hugepage size). Just because a buffer extends past a 2 MiB boundary doesn't mean that it exists in two different Memory Regions. It also won't fail the translation for being over two memory regions.
> 
> If you look at the definition of spdk_mem_map_translate we call map->ops->are_contiguous every time we cross a 2 MiB boundary. For RDMA, this function is registered to spdk_nvmf_rdma_check_contiguous_entries. IF this function returns true, then even if the buffer crosses a 2 MiB boundary, the translation will still be valid.
> The problem you are running into is not related to the buffer alignment, it is related to the fact that the two pages across which the buffer is split are registered to two different MRs in the NIC. This can only happen if those two pages are allocated independently and trigger two distinct memory event callbacks.
> 
> That is why I am so interested in seeing the results from the noticelog above ibv_reg_mr. It will tell me how your target application is allocating memory. Also, when you start the SPDK target, are you using the -s option? Something like ./app/nvmf_tgt/nvmf_tgt -s 512 or something like that (I don't know if it'll make a difference, it's more of a curiosity thing for me)?
> 
> Thanks,
> 
> Seth
> 
> -----Original Message-----
> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
> Sent: Thursday, August 1, 2019 11:24 AM
> To: Howell, Seth <seth.howell(a)intel.com>; Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
> 
> Hi Seth,
> 
> Thanks for the detailed description, now I understand the reason behind the checking. But I have a question, why checking against 2MiB? Is it because DPDK uses 2MiB page size by default so that one RDMA memory region should not cross 2 pages?
> 
>   > Once I see what your memory registrations look like and what addresses you're failing on, it will help me understand what is going on better.
> 
> I've added some print in nvmf_rdma_fill_buffers() @@ -1502,7 +1503,11 @@ nvmf_rdma_fill_buffers(struct spdk_nvmf_rdma_transport *rtransport,
>                   remaining_length -= rdma_req->req.iov[iovcnt].iov_len;
> 
>                   if (translation_len < rdma_req->req.iov[iovcnt].iov_len) {
> -                       SPDK_ERRLOG("Data buffer split over multiple
> RDMA Memory Regions\n");
> +                       SPDK_ERRLOG("Data buffer split over multiple
> RDMA Memory Regions %p %d (%d) (%d) (%d) (%d)\n", rdma_req->buffers[iovcnt], iovcnt, length, remaining_length, translation_len, rdma_req->req.iov[iovcnt].iov_len);
>                           return -EINVAL;
>                   }
> 
> With this I can see which buffer failed the checking.
> For example, when SPKD initializes the memory pool, one of the buffers starts with 0x2000193feb00, and when failed, I got following:
> 
> rdma.c:1510:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over multiple RDMA Memory Regions 0x2000193feb00 0 (8192) (0) (5376) (8192)
> 
> This buffer has 5376B on one 2MB page and the rest of it
> (8192-5376=2816B) is on another page.
> 
> The change https://review.gerrithub.io/c/spdk/spdk/+/463893 to use iov base should make it better as iov base is 4KiB aligned. In above case, iov_base is 0x2000193feb00 & 0xfff = 0x2000193fe000 and it should pass the checking.
> However, another buffer in the pool is 0x2000192010c0 and iov_base is 0x200019201000, which would fail the checking because it is only 4KiB to 2MB boundary and IOUnitSize is 8KiB.
> 
> I will add the change from
> https://review.gerrithub.io/c/spdk/spdk/+/463892 and rerun the test to get more information.
> 
> I also attached the conf file too. The cmd line is "nvmf_tgt -m 0xff -j
> 0x90000000:0x20000000 -c 16disk_1ns.conf"
> 
> Thanks,
> JD
> 
> 
> On 8/1/19 7:52 AM, Howell, Seth wrote:
>> Hi JD,
>>
>> I was doing a little bit of digging in the dpdk documentation around this process, and I have a little bit more information. We were pretty worried about the whole dynamic memory allocations thing a few releases ago, so Jim helped add a flag into DPDK that prevented allocations from being allocated and freed in different granularities. This flag also prevents malloc heap allocations from spanning multiple memory events. However, this flag didn't make it into DPDK until 19.02 (More documentation at https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#environment-abstraction-layer if you're interested). We have some code in the SPDK environment layer that tries to deal with that (see lib/env_dpdk/memory.c:memory_hotplug_cb) but I don't know that that function is entirely capable of handling the heap allocations spanning multiple memory events part of the problem.
>> Since you are using dpdk 18.11, the memory callback inside of lib/env_dpdk looks like a good candidate for our issue. My best guess is that somehow a heap allocation from the buffer mempool is hitting across addresses from two dynamic memory allocation events. I'd still appreciate it if you could send me the information in my last e-mail, but I think we're onto something here.
>>
>> Thanks,
>>
>> Seth
>>
>> -----Original Message-----
>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Howell,
>> Seth
>> Sent: Thursday, August 1, 2019 5:26 AM
>> To: JD Zheng <jiandong.zheng(a)broadcom.com>; Storage Performance
>> Development Kit <spdk(a)lists.01.org>
>> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
>> RDMA Memory Regions
>>
>> Hi JD,
>>
>> Thanks for doing that. Yeah, I am mainly looking to see how the mempool addresses are mapped into the NIC with ibv_reg_mr.
>>
>> I think it's odd that we are using the buffer base for the memory
>> check, we should be using the iov base, but I don't believe that would
>> cause the issue you are seeing. Pushed a change to modify that
>> behavior anyways though:
>> https://review.gerrithub.io/c/spdk/spdk/+/463893
>>
>> There was one registration that I wasn't able to catch from your last log. Sorry about that, I forgot there wasn’t a debug log for it. Can you try it again with this change which adds noticelogs for the relevant registrations. https://review.gerrithub.io/c/spdk/spdk/+/463892 You should be able to run your test without the -Lrdma argument this time to avoid the extra bloat in the logs.
>>
>> The underlying assumption of the code is that any given object is not going to cross a dynamic memory allocation from DPDK. For a little background, when the mempool gets created, the dpdk code allocates some number of memzones to accommodate those buffer objects. Then it passes those memzones down one at a time and places objects inside the mempool from the given memzone until the memzone is exhausted. Then it goes back and grabs another memzone. This process continues until all objects are accounted for.
>> This only works if each memzone corresponds to a single memory event when using dynamic memory allocation. My understanding was that this was always the case, but this error makes me think that it's possible that that's not true.
>>
>> Once I see what your memory registrations look like and what addresses you're failing on, it will help me understand what is going on better.
>>
>> Can you also provide the command line you are using to start the nvmf_tgt application and attach your configuration file?
>>
>> Thanks,
>>
>> Seth
>> -----Original Message-----
>> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
>> Sent: Wednesday, July 31, 2019 3:13 PM
>> To: Howell, Seth <seth.howell(a)intel.com>; Storage Performance
>> Development Kit <spdk(a)lists.01.org>
>> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
>> RDMA Memory Regions
>>
>> Hi Seth,
>>
>> After I enabled debug and ran nvmf_tgt with -L rdma, I got some logs like:
>> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>> 0x2000084bf000 Length: 40000 LKey: e601
>> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>> 0x200008621000 Length: 10000 LKey: e701
>> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
>> 0x200018600000 Length: 1000000 LKey: e801
>> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>> 0x20000847e000 Length: 40000 LKey: e701
>> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>> 0x20000846d000 Length: 10000 LKey: e801
>> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
>> 0x200019800000 Length: 1000000 LKey: e901
>> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>> 0x200016ebb000 Length: 40000 LKey: e801
>> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>> 0x20000845c000 Length: 10000 LKey: e901
>> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
>> 0x20001aa00000 Length: 1000000 LKey: ea01
>> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
>> 0x200016e7a000 Length: 40000 LKey: e901
>> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
>> 0x20000844b000 Length: 10000 LKey: ea01
>> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
>> 0x20001bc00000 Length: 1000000 LKey: eb01 ...
>>
>> Is this you are look for as memory regions registered for NIC?
>>
>> I attached the complete log.
>>
>> Thanks,
>> JD
>>
>> On 7/30/19 5:28 PM, JD Zheng wrote:
>>> Hi Seth,
>>>
>>> Thanks for the prompt reply!
>>>
>>> Please find answers inline.
>>>
>>> JD
>>>
>>> On 7/30/19 5:01 PM, Howell, Seth wrote:
>>>> Hi JD,
>>>>
>>>> Thanks for the report. I want to ask a few questions to start
>>>> getting to the bottom of this. Since this issue doesn't currently
>>>> reproduce on our per-patch or nightly tests, I would like to
>>>> understand what's unique about your setup so that we can replicate
>>>> it in a per patch test to prevent future regressions.
>>> I am running it on aarch64 platform. I tried x86 platform and I can
>>> see same buffer alignment in memory pool but can't run the real test
>>> to reproduce it due to other missing pieces.
>>>
>>>>
>>>> What options are you passing when you create the rdma transport? Are
>>>> you creating it over RPC or in a configuration file?
>>> I am using conf file. Pls let me know if you'd like to look into conf file.
>>>
>>>>
>>>> Are you using the current DPDK submodule as your environment
>>>> abstraction layer?
>>> No. Our project uses specific version of DPDK, which is v18.11. I did
>>> quick test using latest and DPDK submodule on x86, and the buffer
>>> alignment is the same, i.e. 64B aligned.
>>>
>>>>
>>>> I notice that your error log is printing from
>>>> spdk_nvmf_transport_poll_group_create, which value exactly are you
>>>> printing out?
>>> Here is patch to add dbg print. Pls note that SPDK version is v19.04
>>>
>>> @@ -215,6 +222,7 @@ spdk_nvmf_transport_poll_group_create(st
>>>                                    SPDK_NOTICELOG("Unable to reserve
>>> the full number of buffers for the pg buffer cache.\n");
>>>                                    break;
>>>                            }
>>> +                       SPDK_ERRLOG("%p %d(%d)\n", buf,
>>> group->buf_cache_count, group->buf_cache_size);
>>>                            STAILQ_INSERT_HEAD(&group->buf_cache, buf,
>>> link);
>>>                            group->buf_cache_count++;
>>>                    }
>>>
>>>>
>>>> Can you run your target with the -L rdma option to get a dump of the
>>>> memory regions registered with the NIC?
>>> Let me test and get back to you soon.
>>>
>>>>
>>>> We made a couple of changes to this code when dynamic memory
>>>> allocations were added to DPDK. There were some safeguards that we
>>>> added to try and make sure this case wouldn't hit, so I'd like to
>>>> make sure you are running on the latest DPDK submodule as well as
>>>> the latest SPDK to narrow down where we need to look.
>>> Unfortunately I can't easily update DPDK because other team maintains
>>> it internally. But if it can be repro and fixed in latest, I will try
>>> to pull in the fix.
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Seth
>>>>
>>>> -----Original Message-----
>>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of JD Zheng
>>>> via SPDK
>>>> Sent: Wednesday, July 31, 2019 3:00 AM
>>>> To: spdk(a)lists.01.org
>>>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com>
>>>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
>>>> RDMA Memory Regions
>>>>
>>>> Hello,
>>>>
>>>> When I run nvmf_tgt over RDMA using latest SPDK code, I occasionally
>>>> ran into this errors:
>>>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over
>>>> multiple RDMA Memory Regions"
>>>>
>>>> After digging into the code, I found that nvmf_rdma_fill_buffers()
>>>> calls spdk_mem_map_translate() to check if a data buffer sit on 2
>>>> 2MB pages, and if it is the case, it reports this error.
>>>>
>>>> The following commit added change to use data buffer start address
>>>> to calculate the size between buffer start address and 2MB boundary.
>>>> The caller nvmf_rdma_fill_buffers() uses the size to compare with IO
>>>> Unit size (which is 8KB in my conf) to determine if the buffer
>>>> passes 2MB boundary.
>>>>
>>>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
>>>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com>
>>>> Date:   Tue Nov 13 17:43:46 2018 +0100
>>>>
>>>>         memory: fix contiguous memory calculation for unaligned
>>>> buffers
>>>>
>>>> In nvmf_tgt, the buffers are pre-allocated as a memory pool and new
>>>> request will use free buffer from that pool and the buffer start
>>>> address is passed to nvmf_rdma_fill_buffers(). But I found that
>>>> these buffers are not 2MB aligned and not IOUnitSize aligned (8KB in
>>>> my
>>>> case) either, instead, they are 64Byte aligned so that some buffers
>>>> will fail the checking and leads to this problem.
>>>>
>>>> The corresponding code snippets are as following:
>>>> spdk_nvmf_transport_create()
>>>> {
>>>> ...
>>>>         transport->data_buf_pool =
>>>> pdk_mempool_create(spdk_mempool_name,
>>>>                                    opts->num_shared_buffers,
>>>>                                    opts->io_unit_size +
>>>> NVMF_DATA_BUFFER_ALIGNMENT,
>>>>                                    SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>>>>                                    SPDK_ENV_SOCKET_ID_ANY); ...
>>>> }
>>>>
>>>> Also some debug print I added shows the start address of the buffers:
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x200019258800 0(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x2000192557c0 1(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x200019252780 2(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x20001924f740 3(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x20001924c700 4(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x2000192496c0 5(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x200019246680 6(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x200019243640 7(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x200019240600 8(32)
>>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>>> 0x20001923d5c0 9(32)
>>>> ...
>>>>
>>>> It looks like either the buffer allocation has alignment issue or
>>>> the checking is not correct.
>>>>
>>>> Please advice how to fix this problem.
>>>>
>>>> Thanks,
>>>> JD Zheng
>>>> _______________________________________________
>>>> SPDK mailing list
>>>> SPDK(a)lists.01.org
>>>> https://lists.01.org/mailman/listinfo/spdk
>>>>
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-08-01 20:28 Howell, Seth
  0 siblings, 0 replies; 20+ messages in thread
From: Howell, Seth @ 2019-08-01 20:28 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 16285 bytes --]

Hi JD,

The 2 MiB check is just because we always do memory registrations at at least 2 MiB granularity (the minimum hugepage size). Just because a buffer extends past a 2 MiB boundary doesn't mean that it exists in two different Memory Regions. It also won't fail the translation for being over two memory regions.

If you look at the definition of spdk_mem_map_translate we call map->ops->are_contiguous every time we cross a 2 MiB boundary. For RDMA, this function is registered to spdk_nvmf_rdma_check_contiguous_entries. IF this function returns true, then even if the buffer crosses a 2 MiB boundary, the translation will still be valid.
The problem you are running into is not related to the buffer alignment, it is related to the fact that the two pages across which the buffer is split are registered to two different MRs in the NIC. This can only happen if those two pages are allocated independently and trigger two distinct memory event callbacks.

That is why I am so interested in seeing the results from the noticelog above ibv_reg_mr. It will tell me how your target application is allocating memory. Also, when you start the SPDK target, are you using the -s option? Something like ./app/nvmf_tgt/nvmf_tgt -s 512 or something like that (I don't know if it'll make a difference, it's more of a curiosity thing for me)?

Thanks,

Seth

-----Original Message-----
From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com] 
Sent: Thursday, August 1, 2019 11:24 AM
To: Howell, Seth <seth.howell(a)intel.com>; Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions

Hi Seth,

Thanks for the detailed description, now I understand the reason behind the checking. But I have a question, why checking against 2MiB? Is it because DPDK uses 2MiB page size by default so that one RDMA memory region should not cross 2 pages?

 > Once I see what your memory registrations look like and what addresses you're failing on, it will help me understand what is going on better.

I've added some print in nvmf_rdma_fill_buffers() @@ -1502,7 +1503,11 @@ nvmf_rdma_fill_buffers(struct spdk_nvmf_rdma_transport *rtransport,
                 remaining_length -= rdma_req->req.iov[iovcnt].iov_len;

                 if (translation_len < rdma_req->req.iov[iovcnt].iov_len) {
-                       SPDK_ERRLOG("Data buffer split over multiple 
RDMA Memory Regions\n");
+                       SPDK_ERRLOG("Data buffer split over multiple
RDMA Memory Regions %p %d (%d) (%d) (%d) (%d)\n", rdma_req->buffers[iovcnt], iovcnt, length, remaining_length, translation_len, rdma_req->req.iov[iovcnt].iov_len);
                         return -EINVAL;
                 }

With this I can see which buffer failed the checking.
For example, when SPKD initializes the memory pool, one of the buffers starts with 0x2000193feb00, and when failed, I got following:

rdma.c:1510:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over multiple RDMA Memory Regions 0x2000193feb00 0 (8192) (0) (5376) (8192)

This buffer has 5376B on one 2MB page and the rest of it
(8192-5376=2816B) is on another page.

The change https://review.gerrithub.io/c/spdk/spdk/+/463893 to use iov base should make it better as iov base is 4KiB aligned. In above case, iov_base is 0x2000193feb00 & 0xfff = 0x2000193fe000 and it should pass the checking.
However, another buffer in the pool is 0x2000192010c0 and iov_base is 0x200019201000, which would fail the checking because it is only 4KiB to 2MB boundary and IOUnitSize is 8KiB.

I will add the change from
https://review.gerrithub.io/c/spdk/spdk/+/463892 and rerun the test to get more information.

I also attached the conf file too. The cmd line is "nvmf_tgt -m 0xff -j
0x90000000:0x20000000 -c 16disk_1ns.conf"

Thanks,
JD


On 8/1/19 7:52 AM, Howell, Seth wrote:
> Hi JD,
> 
> I was doing a little bit of digging in the dpdk documentation around this process, and I have a little bit more information. We were pretty worried about the whole dynamic memory allocations thing a few releases ago, so Jim helped add a flag into DPDK that prevented allocations from being allocated and freed in different granularities. This flag also prevents malloc heap allocations from spanning multiple memory events. However, this flag didn't make it into DPDK until 19.02 (More documentation at https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#environment-abstraction-layer if you're interested). We have some code in the SPDK environment layer that tries to deal with that (see lib/env_dpdk/memory.c:memory_hotplug_cb) but I don't know that that function is entirely capable of handling the heap allocations spanning multiple memory events part of the problem.
> Since you are using dpdk 18.11, the memory callback inside of lib/env_dpdk looks like a good candidate for our issue. My best guess is that somehow a heap allocation from the buffer mempool is hitting across addresses from two dynamic memory allocation events. I'd still appreciate it if you could send me the information in my last e-mail, but I think we're onto something here.
> 
> Thanks,
> 
> Seth
> 
> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Howell, 
> Seth
> Sent: Thursday, August 1, 2019 5:26 AM
> To: JD Zheng <jiandong.zheng(a)broadcom.com>; Storage Performance 
> Development Kit <spdk(a)lists.01.org>
> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple 
> RDMA Memory Regions
> 
> Hi JD,
> 
> Thanks for doing that. Yeah, I am mainly looking to see how the mempool addresses are mapped into the NIC with ibv_reg_mr.
> 
> I think it's odd that we are using the buffer base for the memory 
> check, we should be using the iov base, but I don't believe that would 
> cause the issue you are seeing. Pushed a change to modify that 
> behavior anyways though: 
> https://review.gerrithub.io/c/spdk/spdk/+/463893
> 
> There was one registration that I wasn't able to catch from your last log. Sorry about that, I forgot there wasn’t a debug log for it. Can you try it again with this change which adds noticelogs for the relevant registrations. https://review.gerrithub.io/c/spdk/spdk/+/463892 You should be able to run your test without the -Lrdma argument this time to avoid the extra bloat in the logs.
> 
> The underlying assumption of the code is that any given object is not going to cross a dynamic memory allocation from DPDK. For a little background, when the mempool gets created, the dpdk code allocates some number of memzones to accommodate those buffer objects. Then it passes those memzones down one at a time and places objects inside the mempool from the given memzone until the memzone is exhausted. Then it goes back and grabs another memzone. This process continues until all objects are accounted for.
> This only works if each memzone corresponds to a single memory event when using dynamic memory allocation. My understanding was that this was always the case, but this error makes me think that it's possible that that's not true.
> 
> Once I see what your memory registrations look like and what addresses you're failing on, it will help me understand what is going on better.
> 
> Can you also provide the command line you are using to start the nvmf_tgt application and attach your configuration file?
> 
> Thanks,
> 
> Seth
> -----Original Message-----
> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
> Sent: Wednesday, July 31, 2019 3:13 PM
> To: Howell, Seth <seth.howell(a)intel.com>; Storage Performance 
> Development Kit <spdk(a)lists.01.org>
> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple 
> RDMA Memory Regions
> 
> Hi Seth,
> 
> After I enabled debug and ran nvmf_tgt with -L rdma, I got some logs like:
> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
> 0x2000084bf000 Length: 40000 LKey: e601
> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
> 0x200008621000 Length: 10000 LKey: e701
> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
> 0x200018600000 Length: 1000000 LKey: e801
> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
> 0x20000847e000 Length: 40000 LKey: e701
> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
> 0x20000846d000 Length: 10000 LKey: e801
> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
> 0x200019800000 Length: 1000000 LKey: e901
> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
> 0x200016ebb000 Length: 40000 LKey: e801
> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
> 0x20000845c000 Length: 10000 LKey: e901
> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
> 0x20001aa00000 Length: 1000000 LKey: ea01
> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
> 0x200016e7a000 Length: 40000 LKey: e901
> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
> 0x20000844b000 Length: 10000 LKey: ea01
> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
> 0x20001bc00000 Length: 1000000 LKey: eb01 ...
> 
> Is this you are look for as memory regions registered for NIC?
> 
> I attached the complete log.
> 
> Thanks,
> JD
> 
> On 7/30/19 5:28 PM, JD Zheng wrote:
>> Hi Seth,
>>
>> Thanks for the prompt reply!
>>
>> Please find answers inline.
>>
>> JD
>>
>> On 7/30/19 5:01 PM, Howell, Seth wrote:
>>> Hi JD,
>>>
>>> Thanks for the report. I want to ask a few questions to start 
>>> getting to the bottom of this. Since this issue doesn't currently 
>>> reproduce on our per-patch or nightly tests, I would like to 
>>> understand what's unique about your setup so that we can replicate 
>>> it in a per patch test to prevent future regressions.
>> I am running it on aarch64 platform. I tried x86 platform and I can 
>> see same buffer alignment in memory pool but can't run the real test 
>> to reproduce it due to other missing pieces.
>>
>>>
>>> What options are you passing when you create the rdma transport? Are 
>>> you creating it over RPC or in a configuration file?
>> I am using conf file. Pls let me know if you'd like to look into conf file.
>>
>>>
>>> Are you using the current DPDK submodule as your environment 
>>> abstraction layer?
>> No. Our project uses specific version of DPDK, which is v18.11. I did 
>> quick test using latest and DPDK submodule on x86, and the buffer 
>> alignment is the same, i.e. 64B aligned.
>>
>>>
>>> I notice that your error log is printing from 
>>> spdk_nvmf_transport_poll_group_create, which value exactly are you 
>>> printing out?
>> Here is patch to add dbg print. Pls note that SPDK version is v19.04
>>
>> @@ -215,6 +222,7 @@ spdk_nvmf_transport_poll_group_create(st
>>                                   SPDK_NOTICELOG("Unable to reserve 
>> the full number of buffers for the pg buffer cache.\n");
>>                                   break;
>>                           }
>> +                       SPDK_ERRLOG("%p %d(%d)\n", buf,
>> group->buf_cache_count, group->buf_cache_size);
>>                           STAILQ_INSERT_HEAD(&group->buf_cache, buf, 
>> link);
>>                           group->buf_cache_count++;
>>                   }
>>
>>>
>>> Can you run your target with the -L rdma option to get a dump of the 
>>> memory regions registered with the NIC?
>> Let me test and get back to you soon.
>>
>>>
>>> We made a couple of changes to this code when dynamic memory 
>>> allocations were added to DPDK. There were some safeguards that we 
>>> added to try and make sure this case wouldn't hit, so I'd like to 
>>> make sure you are running on the latest DPDK submodule as well as 
>>> the latest SPDK to narrow down where we need to look.
>> Unfortunately I can't easily update DPDK because other team maintains 
>> it internally. But if it can be repro and fixed in latest, I will try 
>> to pull in the fix.
>>
>>>
>>> Thanks,
>>>
>>> Seth
>>>
>>> -----Original Message-----
>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of JD Zheng 
>>> via SPDK
>>> Sent: Wednesday, July 31, 2019 3:00 AM
>>> To: spdk(a)lists.01.org
>>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com>
>>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple 
>>> RDMA Memory Regions
>>>
>>> Hello,
>>>
>>> When I run nvmf_tgt over RDMA using latest SPDK code, I occasionally 
>>> ran into this errors:
>>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over 
>>> multiple RDMA Memory Regions"
>>>
>>> After digging into the code, I found that nvmf_rdma_fill_buffers() 
>>> calls spdk_mem_map_translate() to check if a data buffer sit on 2 
>>> 2MB pages, and if it is the case, it reports this error.
>>>
>>> The following commit added change to use data buffer start address 
>>> to calculate the size between buffer start address and 2MB boundary. 
>>> The caller nvmf_rdma_fill_buffers() uses the size to compare with IO 
>>> Unit size (which is 8KB in my conf) to determine if the buffer 
>>> passes 2MB boundary.
>>>
>>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
>>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com>
>>> Date:   Tue Nov 13 17:43:46 2018 +0100
>>>
>>>        memory: fix contiguous memory calculation for unaligned 
>>> buffers
>>>
>>> In nvmf_tgt, the buffers are pre-allocated as a memory pool and new 
>>> request will use free buffer from that pool and the buffer start 
>>> address is passed to nvmf_rdma_fill_buffers(). But I found that 
>>> these buffers are not 2MB aligned and not IOUnitSize aligned (8KB in 
>>> my
>>> case) either, instead, they are 64Byte aligned so that some buffers 
>>> will fail the checking and leads to this problem.
>>>
>>> The corresponding code snippets are as following:
>>> spdk_nvmf_transport_create()
>>> {
>>> ...
>>>        transport->data_buf_pool =
>>> pdk_mempool_create(spdk_mempool_name,
>>>                                   opts->num_shared_buffers,
>>>                                   opts->io_unit_size + 
>>> NVMF_DATA_BUFFER_ALIGNMENT,
>>>                                   SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>>>                                   SPDK_ENV_SOCKET_ID_ANY); ...
>>> }
>>>
>>> Also some debug print I added shows the start address of the buffers:
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x200019258800 0(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x2000192557c0 1(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x200019252780 2(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x20001924f740 3(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x20001924c700 4(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x2000192496c0 5(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x200019246680 6(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x200019243640 7(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x200019240600 8(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x20001923d5c0 9(32)
>>> ...
>>>
>>> It looks like either the buffer allocation has alignment issue or 
>>> the checking is not correct.
>>>
>>> Please advice how to fix this problem.
>>>
>>> Thanks,
>>> JD Zheng
>>> _______________________________________________
>>> SPDK mailing list
>>> SPDK(a)lists.01.org
>>> https://lists.01.org/mailman/listinfo/spdk
>>>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-08-01 18:23 JD Zheng
  0 siblings, 0 replies; 20+ messages in thread
From: JD Zheng @ 2019-08-01 18:23 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 14552 bytes --]

Hi Seth,

Thanks for the detailed description, now I understand the reason behind 
the checking. But I have a question, why checking against 2MiB? Is it 
because DPDK uses 2MiB page size by default so that one RDMA memory 
region should not cross 2 pages?

 > Once I see what your memory registrations look like and what 
addresses you're failing on, it will help me understand what is going on 
better.

I've added some print in nvmf_rdma_fill_buffers()
@@ -1502,7 +1503,11 @@ nvmf_rdma_fill_buffers(struct 
spdk_nvmf_rdma_transport *rtransport,
                 remaining_length -= rdma_req->req.iov[iovcnt].iov_len;

                 if (translation_len < rdma_req->req.iov[iovcnt].iov_len) {
-                       SPDK_ERRLOG("Data buffer split over multiple 
RDMA Memory Regions\n");
+                       SPDK_ERRLOG("Data buffer split over multiple 
RDMA Memory Regions %p %d (%d) (%d) (%d) (%d)\n", 
rdma_req->buffers[iovcnt], iovcnt, length, remaining_length, 
translation_len, rdma_req->req.iov[iovcnt].iov_len);
                         return -EINVAL;
                 }

With this I can see which buffer failed the checking.
For example, when SPKD initializes the memory pool, one of the buffers 
starts with 0x2000193feb00, and when failed, I got following:

rdma.c:1510:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over 
multiple RDMA Memory Regions 0x2000193feb00 0 (8192) (0) (5376) (8192)

This buffer has 5376B on one 2MB page and the rest of it 
(8192-5376=2816B) is on another page.

The change https://review.gerrithub.io/c/spdk/spdk/+/463893 to use iov 
base should make it better as iov base is 4KiB aligned. In above case, 
iov_base is 0x2000193feb00 & 0xfff = 0x2000193fe000 and it should pass 
the checking.
However, another buffer in the pool is 0x2000192010c0 and iov_base is 
0x200019201000, which would fail the checking because it is only 4KiB to 
2MB boundary and IOUnitSize is 8KiB.

I will add the change from 
https://review.gerrithub.io/c/spdk/spdk/+/463892 and rerun the test to 
get more information.

I also attached the conf file too. The cmd line is "nvmf_tgt -m 0xff -j 
0x90000000:0x20000000 -c 16disk_1ns.conf"

Thanks,
JD


On 8/1/19 7:52 AM, Howell, Seth wrote:
> Hi JD,
> 
> I was doing a little bit of digging in the dpdk documentation around this process, and I have a little bit more information. We were pretty worried about the whole dynamic memory allocations thing a few releases ago, so Jim helped add a flag into DPDK that prevented allocations from being allocated and freed in different granularities. This flag also prevents malloc heap allocations from spanning multiple memory events. However, this flag didn't make it into DPDK until 19.02 (More documentation at https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#environment-abstraction-layer if you're interested). We have some code in the SPDK environment layer that tries to deal with that (see lib/env_dpdk/memory.c:memory_hotplug_cb) but I don't know that that function is entirely capable of handling the heap allocations spanning multiple memory events part of the problem.
> Since you are using dpdk 18.11, the memory callback inside of lib/env_dpdk looks like a good candidate for our issue. My best guess is that somehow a heap allocation from the buffer mempool is hitting across addresses from two dynamic memory allocation events. I'd still appreciate it if you could send me the information in my last e-mail, but I think we're onto something here.
> 
> Thanks,
> 
> Seth
> 
> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Howell, Seth
> Sent: Thursday, August 1, 2019 5:26 AM
> To: JD Zheng <jiandong.zheng(a)broadcom.com>; Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
> 
> Hi JD,
> 
> Thanks for doing that. Yeah, I am mainly looking to see how the mempool addresses are mapped into the NIC with ibv_reg_mr.
> 
> I think it's odd that we are using the buffer base for the memory check, we should be using the iov base, but I don't believe that would cause the issue you are seeing. Pushed a change to modify that behavior anyways though: https://review.gerrithub.io/c/spdk/spdk/+/463893
> 
> There was one registration that I wasn't able to catch from your last log. Sorry about that, I forgot there wasn’t a debug log for it. Can you try it again with this change which adds noticelogs for the relevant registrations. https://review.gerrithub.io/c/spdk/spdk/+/463892 You should be able to run your test without the -Lrdma argument this time to avoid the extra bloat in the logs.
> 
> The underlying assumption of the code is that any given object is not going to cross a dynamic memory allocation from DPDK. For a little background, when the mempool gets created, the dpdk code allocates some number of memzones to accommodate those buffer objects. Then it passes those memzones down one at a time and places objects inside the mempool from the given memzone until the memzone is exhausted. Then it goes back and grabs another memzone. This process continues until all objects are accounted for.
> This only works if each memzone corresponds to a single memory event when using dynamic memory allocation. My understanding was that this was always the case, but this error makes me think that it's possible that that's not true.
> 
> Once I see what your memory registrations look like and what addresses you're failing on, it will help me understand what is going on better.
> 
> Can you also provide the command line you are using to start the nvmf_tgt application and attach your configuration file?
> 
> Thanks,
> 
> Seth
> -----Original Message-----
> From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
> Sent: Wednesday, July 31, 2019 3:13 PM
> To: Howell, Seth <seth.howell(a)intel.com>; Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
> 
> Hi Seth,
> 
> After I enabled debug and ran nvmf_tgt with -L rdma, I got some logs like:
> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
> 0x2000084bf000 Length: 40000 LKey: e601
> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
> 0x200008621000 Length: 10000 LKey: e701
> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
> 0x200018600000 Length: 1000000 LKey: e801
> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
> 0x20000847e000 Length: 40000 LKey: e701
> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
> 0x20000846d000 Length: 10000 LKey: e801
> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
> 0x200019800000 Length: 1000000 LKey: e901
> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
> 0x200016ebb000 Length: 40000 LKey: e801
> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
> 0x20000845c000 Length: 10000 LKey: e901
> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
> 0x20001aa00000 Length: 1000000 LKey: ea01
> rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array:
> 0x200016e7a000 Length: 40000 LKey: e901
> rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array:
> 0x20000844b000 Length: 10000 LKey: ea01
> rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array:
> 0x20001bc00000 Length: 1000000 LKey: eb01 ...
> 
> Is this you are look for as memory regions registered for NIC?
> 
> I attached the complete log.
> 
> Thanks,
> JD
> 
> On 7/30/19 5:28 PM, JD Zheng wrote:
>> Hi Seth,
>>
>> Thanks for the prompt reply!
>>
>> Please find answers inline.
>>
>> JD
>>
>> On 7/30/19 5:01 PM, Howell, Seth wrote:
>>> Hi JD,
>>>
>>> Thanks for the report. I want to ask a few questions to start getting
>>> to the bottom of this. Since this issue doesn't currently reproduce
>>> on our per-patch or nightly tests, I would like to understand what's
>>> unique about your setup so that we can replicate it in a per patch
>>> test to prevent future regressions.
>> I am running it on aarch64 platform. I tried x86 platform and I can
>> see same buffer alignment in memory pool but can't run the real test
>> to reproduce it due to other missing pieces.
>>
>>>
>>> What options are you passing when you create the rdma transport? Are
>>> you creating it over RPC or in a configuration file?
>> I am using conf file. Pls let me know if you'd like to look into conf file.
>>
>>>
>>> Are you using the current DPDK submodule as your environment
>>> abstraction layer?
>> No. Our project uses specific version of DPDK, which is v18.11. I did
>> quick test using latest and DPDK submodule on x86, and the buffer
>> alignment is the same, i.e. 64B aligned.
>>
>>>
>>> I notice that your error log is printing from
>>> spdk_nvmf_transport_poll_group_create, which value exactly are you
>>> printing out?
>> Here is patch to add dbg print. Pls note that SPDK version is v19.04
>>
>> @@ -215,6 +222,7 @@ spdk_nvmf_transport_poll_group_create(st
>>                                   SPDK_NOTICELOG("Unable to reserve the
>> full number of buffers for the pg buffer cache.\n");
>>                                   break;
>>                           }
>> +                       SPDK_ERRLOG("%p %d(%d)\n", buf,
>> group->buf_cache_count, group->buf_cache_size);
>>                           STAILQ_INSERT_HEAD(&group->buf_cache, buf,
>> link);
>>                           group->buf_cache_count++;
>>                   }
>>
>>>
>>> Can you run your target with the -L rdma option to get a dump of the
>>> memory regions registered with the NIC?
>> Let me test and get back to you soon.
>>
>>>
>>> We made a couple of changes to this code when dynamic memory
>>> allocations were added to DPDK. There were some safeguards that we
>>> added to try and make sure this case wouldn't hit, so I'd like to
>>> make sure you are running on the latest DPDK submodule as well as the
>>> latest SPDK to narrow down where we need to look.
>> Unfortunately I can't easily update DPDK because other team maintains
>> it internally. But if it can be repro and fixed in latest, I will try
>> to pull in the fix.
>>
>>>
>>> Thanks,
>>>
>>> Seth
>>>
>>> -----Original Message-----
>>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of JD Zheng
>>> via SPDK
>>> Sent: Wednesday, July 31, 2019 3:00 AM
>>> To: spdk(a)lists.01.org
>>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com>
>>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple
>>> RDMA Memory Regions
>>>
>>> Hello,
>>>
>>> When I run nvmf_tgt over RDMA using latest SPDK code, I occasionally
>>> ran into this errors:
>>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over
>>> multiple RDMA Memory Regions"
>>>
>>> After digging into the code, I found that nvmf_rdma_fill_buffers()
>>> calls spdk_mem_map_translate() to check if a data buffer sit on 2 2MB
>>> pages, and if it is the case, it reports this error.
>>>
>>> The following commit added change to use data buffer start address to
>>> calculate the size between buffer start address and 2MB boundary. The
>>> caller nvmf_rdma_fill_buffers() uses the size to compare with IO Unit
>>> size (which is 8KB in my conf) to determine if the buffer passes 2MB
>>> boundary.
>>>
>>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
>>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com>
>>> Date:   Tue Nov 13 17:43:46 2018 +0100
>>>
>>>        memory: fix contiguous memory calculation for unaligned buffers
>>>
>>> In nvmf_tgt, the buffers are pre-allocated as a memory pool and new
>>> request will use free buffer from that pool and the buffer start
>>> address is passed to nvmf_rdma_fill_buffers(). But I found that these
>>> buffers are not 2MB aligned and not IOUnitSize aligned (8KB in my
>>> case) either, instead, they are 64Byte aligned so that some buffers
>>> will fail the checking and leads to this problem.
>>>
>>> The corresponding code snippets are as following:
>>> spdk_nvmf_transport_create()
>>> {
>>> ...
>>>        transport->data_buf_pool =
>>> pdk_mempool_create(spdk_mempool_name,
>>>                                   opts->num_shared_buffers,
>>>                                   opts->io_unit_size +
>>> NVMF_DATA_BUFFER_ALIGNMENT,
>>>                                   SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>>>                                   SPDK_ENV_SOCKET_ID_ANY); ...
>>> }
>>>
>>> Also some debug print I added shows the start address of the buffers:
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x200019258800 0(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x2000192557c0 1(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x200019252780 2(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x20001924f740 3(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x20001924c700 4(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x2000192496c0 5(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x200019246680 6(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x200019243640 7(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x200019240600 8(32)
>>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>>> 0x20001923d5c0 9(32)
>>> ...
>>>
>>> It looks like either the buffer allocation has alignment issue or the
>>> checking is not correct.
>>>
>>> Please advice how to fix this problem.
>>>
>>> Thanks,
>>> JD Zheng
>>> _______________________________________________
>>> SPDK mailing list
>>> SPDK(a)lists.01.org
>>> https://lists.01.org/mailman/listinfo/spdk
>>>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
> 

[-- Attachment #2: 16disk_1ns.conf --]
[-- Type: text/plain, Size: 1648 bytes --]

[Global]
  ReactorMask 0xff
  LogFacility "local7"
[Rpc]
  Enable No
  Listen 127.0.0.1
[Nvmf]
  AcceptorPollRate 10000
[Transport]
  Type RDMA
  MaxQueuesPerSession 32
  MaxQueueDepth 256
  InCapsuleDataSize 4096
  MaxIOSize 8192
  IOUnitSize 8192
  AcceptorCore 0
  NumSharedBuffers 512
[Nvme]
  Timeout 0
  AdminPollRate 100000
  TransportId "trtype:PCIe traddr:0000:06:00.0" nvme0
  TransportId "trtype:PCIe traddr:0000:0a:00.0" nvme1
  TransportId "trtype:PCIe traddr:0000:0e:00.0" nvme2
  TransportId "trtype:PCIe traddr:0000:12:00.0" nvme3
  TransportId "trtype:PCIe traddr:0001:05:00.0" nvme4
  TransportId "trtype:PCIe traddr:0001:09:00.0" nvme5
  TransportId "trtype:PCIe traddr:0001:0d:00.0" nvme6
  TransportId "trtype:PCIe traddr:0001:11:00.0" nvme7
  TransportId "trtype:PCIe traddr:0006:04:00.0" nvme8
  TransportId "trtype:PCIe traddr:0006:08:00.0" nvme9
  TransportId "trtype:PCIe traddr:0006:0c:00.0" nvme10
  TransportId "trtype:PCIe traddr:0006:10:00.0" nvme11
  TransportId "trtype:PCIe traddr:0007:03:00.0" nvme12
  TransportId "trtype:PCIe traddr:0007:07:00.0" nvme13
  TransportId "trtype:PCIe traddr:0007:0b:00.0" nvme14
  TransportId "trtype:PCIe traddr:0007:0f:00.0" nvme15
[Subsystem0]
  NQN nqn.2016-06.io.spdk:cnode0
  Listen RDMA 192.168.2.10:4420
  SN SPDK00000000000001
  Namespace nvme0n1
  Namespace nvme4n1
  Namespace nvme8n1
  Namespace nvme12n1
  Namespace nvme1n1
  Namespace nvme5n1
  Namespace nvme9n1
  Namespace nvme13n1
  Namespace nvme2n1
  Namespace nvme6n1
  Namespace nvme10n1
  Namespace nvme14n1
  Namespace nvme3n1
  Namespace nvme7n1
  Namespace nvme11n1
  Namespace nvme15n1
  AllowAnyHost Yes

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-08-01 14:52 Howell, Seth
  0 siblings, 0 replies; 20+ messages in thread
From: Howell, Seth @ 2019-08-01 14:52 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 12006 bytes --]

Hi JD,

I was doing a little bit of digging in the dpdk documentation around this process, and I have a little bit more information. We were pretty worried about the whole dynamic memory allocations thing a few releases ago, so Jim helped add a flag into DPDK that prevented allocations from being allocated and freed in different granularities. This flag also prevents malloc heap allocations from spanning multiple memory events. However, this flag didn't make it into DPDK until 19.02 (More documentation at https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#environment-abstraction-layer if you're interested). We have some code in the SPDK environment layer that tries to deal with that (see lib/env_dpdk/memory.c:memory_hotplug_cb) but I don't know that that function is entirely capable of handling the heap allocations spanning multiple memory events part of the problem.
Since you are using dpdk 18.11, the memory callback inside of lib/env_dpdk looks like a good candidate for our issue. My best guess is that somehow a heap allocation from the buffer mempool is hitting across addresses from two dynamic memory allocation events. I'd still appreciate it if you could send me the information in my last e-mail, but I think we're onto something here.

Thanks,

Seth

-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Howell, Seth
Sent: Thursday, August 1, 2019 5:26 AM
To: JD Zheng <jiandong.zheng(a)broadcom.com>; Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions

Hi JD,

Thanks for doing that. Yeah, I am mainly looking to see how the mempool addresses are mapped into the NIC with ibv_reg_mr.

I think it's odd that we are using the buffer base for the memory check, we should be using the iov base, but I don't believe that would cause the issue you are seeing. Pushed a change to modify that behavior anyways though: https://review.gerrithub.io/c/spdk/spdk/+/463893

There was one registration that I wasn't able to catch from your last log. Sorry about that, I forgot there wasn’t a debug log for it. Can you try it again with this change which adds noticelogs for the relevant registrations. https://review.gerrithub.io/c/spdk/spdk/+/463892 You should be able to run your test without the -Lrdma argument this time to avoid the extra bloat in the logs.

The underlying assumption of the code is that any given object is not going to cross a dynamic memory allocation from DPDK. For a little background, when the mempool gets created, the dpdk code allocates some number of memzones to accommodate those buffer objects. Then it passes those memzones down one at a time and places objects inside the mempool from the given memzone until the memzone is exhausted. Then it goes back and grabs another memzone. This process continues until all objects are accounted for.
This only works if each memzone corresponds to a single memory event when using dynamic memory allocation. My understanding was that this was always the case, but this error makes me think that it's possible that that's not true.

Once I see what your memory registrations look like and what addresses you're failing on, it will help me understand what is going on better.

Can you also provide the command line you are using to start the nvmf_tgt application and attach your configuration file?

Thanks,

Seth
-----Original Message-----
From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com]
Sent: Wednesday, July 31, 2019 3:13 PM
To: Howell, Seth <seth.howell(a)intel.com>; Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions

Hi Seth,

After I enabled debug and ran nvmf_tgt with -L rdma, I got some logs like:
rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array: 
0x2000084bf000 Length: 40000 LKey: e601
rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array: 
0x200008621000 Length: 10000 LKey: e701
rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array: 
0x200018600000 Length: 1000000 LKey: e801
rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array: 
0x20000847e000 Length: 40000 LKey: e701
rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array: 
0x20000846d000 Length: 10000 LKey: e801
rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array: 
0x200019800000 Length: 1000000 LKey: e901
rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array: 
0x200016ebb000 Length: 40000 LKey: e801
rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array: 
0x20000845c000 Length: 10000 LKey: e901
rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array: 
0x20001aa00000 Length: 1000000 LKey: ea01
rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array: 
0x200016e7a000 Length: 40000 LKey: e901
rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array: 
0x20000844b000 Length: 10000 LKey: ea01
rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array: 
0x20001bc00000 Length: 1000000 LKey: eb01 ...

Is this you are look for as memory regions registered for NIC?

I attached the complete log.

Thanks,
JD

On 7/30/19 5:28 PM, JD Zheng wrote:
> Hi Seth,
> 
> Thanks for the prompt reply!
> 
> Please find answers inline.
> 
> JD
> 
> On 7/30/19 5:01 PM, Howell, Seth wrote:
>> Hi JD,
>>
>> Thanks for the report. I want to ask a few questions to start getting 
>> to the bottom of this. Since this issue doesn't currently reproduce 
>> on our per-patch or nightly tests, I would like to understand what's 
>> unique about your setup so that we can replicate it in a per patch 
>> test to prevent future regressions.
> I am running it on aarch64 platform. I tried x86 platform and I can 
> see same buffer alignment in memory pool but can't run the real test 
> to reproduce it due to other missing pieces.
> 
>>
>> What options are you passing when you create the rdma transport? Are 
>> you creating it over RPC or in a configuration file?
> I am using conf file. Pls let me know if you'd like to look into conf file.
> 
>>
>> Are you using the current DPDK submodule as your environment 
>> abstraction layer?
> No. Our project uses specific version of DPDK, which is v18.11. I did 
> quick test using latest and DPDK submodule on x86, and the buffer 
> alignment is the same, i.e. 64B aligned.
> 
>>
>> I notice that your error log is printing from 
>> spdk_nvmf_transport_poll_group_create, which value exactly are you 
>> printing out?
> Here is patch to add dbg print. Pls note that SPDK version is v19.04
> 
> @@ -215,6 +222,7 @@ spdk_nvmf_transport_poll_group_create(st
>                                  SPDK_NOTICELOG("Unable to reserve the 
> full number of buffers for the pg buffer cache.\n");
>                                  break;
>                          }
> +                       SPDK_ERRLOG("%p %d(%d)\n", buf,
> group->buf_cache_count, group->buf_cache_size);
>                          STAILQ_INSERT_HEAD(&group->buf_cache, buf, 
> link);
>                          group->buf_cache_count++;
>                  }
> 
>>
>> Can you run your target with the -L rdma option to get a dump of the 
>> memory regions registered with the NIC?
> Let me test and get back to you soon.
> 
>>
>> We made a couple of changes to this code when dynamic memory 
>> allocations were added to DPDK. There were some safeguards that we 
>> added to try and make sure this case wouldn't hit, so I'd like to 
>> make sure you are running on the latest DPDK submodule as well as the 
>> latest SPDK to narrow down where we need to look.
> Unfortunately I can't easily update DPDK because other team maintains 
> it internally. But if it can be repro and fixed in latest, I will try 
> to pull in the fix.
> 
>>
>> Thanks,
>>
>> Seth
>>
>> -----Original Message-----
>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of JD Zheng 
>> via SPDK
>> Sent: Wednesday, July 31, 2019 3:00 AM
>> To: spdk(a)lists.01.org
>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com>
>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple 
>> RDMA Memory Regions
>>
>> Hello,
>>
>> When I run nvmf_tgt over RDMA using latest SPDK code, I occasionally 
>> ran into this errors:
>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over 
>> multiple RDMA Memory Regions"
>>
>> After digging into the code, I found that nvmf_rdma_fill_buffers() 
>> calls spdk_mem_map_translate() to check if a data buffer sit on 2 2MB 
>> pages, and if it is the case, it reports this error.
>>
>> The following commit added change to use data buffer start address to 
>> calculate the size between buffer start address and 2MB boundary. The 
>> caller nvmf_rdma_fill_buffers() uses the size to compare with IO Unit 
>> size (which is 8KB in my conf) to determine if the buffer passes 2MB 
>> boundary.
>>
>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com>
>> Date:   Tue Nov 13 17:43:46 2018 +0100
>>
>>       memory: fix contiguous memory calculation for unaligned buffers
>>
>> In nvmf_tgt, the buffers are pre-allocated as a memory pool and new 
>> request will use free buffer from that pool and the buffer start 
>> address is passed to nvmf_rdma_fill_buffers(). But I found that these 
>> buffers are not 2MB aligned and not IOUnitSize aligned (8KB in my
>> case) either, instead, they are 64Byte aligned so that some buffers 
>> will fail the checking and leads to this problem.
>>
>> The corresponding code snippets are as following:
>> spdk_nvmf_transport_create()
>> {
>> ...
>>       transport->data_buf_pool =
>> pdk_mempool_create(spdk_mempool_name,
>>                                  opts->num_shared_buffers,
>>                                  opts->io_unit_size + 
>> NVMF_DATA_BUFFER_ALIGNMENT,
>>                                  SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>>                                  SPDK_ENV_SOCKET_ID_ANY); ...
>> }
>>
>> Also some debug print I added shows the start address of the buffers:
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019258800 0(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x2000192557c0 1(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019252780 2(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x20001924f740 3(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x20001924c700 4(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x2000192496c0 5(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019246680 6(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019243640 7(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019240600 8(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x20001923d5c0 9(32)
>> ...
>>
>> It looks like either the buffer allocation has alignment issue or the 
>> checking is not correct.
>>
>> Please advice how to fix this problem.
>>
>> Thanks,
>> JD Zheng
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
>>
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-08-01 12:26 Howell, Seth
  0 siblings, 0 replies; 20+ messages in thread
From: Howell, Seth @ 2019-08-01 12:26 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 10234 bytes --]

Hi JD,

Thanks for doing that. Yeah, I am mainly looking to see how the mempool addresses are mapped into the NIC with ibv_reg_mr.

I think it's odd that we are using the buffer base for the memory check, we should be using the iov base, but I don't believe that would cause the issue you are seeing. Pushed a change to modify that behavior anyways though: https://review.gerrithub.io/c/spdk/spdk/+/463893

There was one registration that I wasn't able to catch from your last log. Sorry about that, I forgot there wasn’t a debug log for it. Can you try it again with this change which adds noticelogs for the relevant registrations. https://review.gerrithub.io/c/spdk/spdk/+/463892 You should be able to run your test without the -Lrdma argument this time to avoid the extra bloat in the logs.

The underlying assumption of the code is that any given object is not going to cross a dynamic memory allocation from DPDK. For a little background, when the mempool gets created, the dpdk code allocates some number of memzones to accommodate those buffer objects. Then it passes those memzones down one at a time and places objects inside the mempool from the given memzone until the memzone is exhausted. Then it goes back and grabs another memzone. This process continues until all objects are accounted for.
This only works if each memzone corresponds to a single memory event when using dynamic memory allocation. My understanding was that this was always the case, but this error makes me think that it's possible that that's not true.

Once I see what your memory registrations look like and what addresses you're failing on, it will help me understand what is going on better.

Can you also provide the command line you are using to start the nvmf_tgt application and attach your configuration file?

Thanks,

Seth
-----Original Message-----
From: JD Zheng [mailto:jiandong.zheng(a)broadcom.com] 
Sent: Wednesday, July 31, 2019 3:13 PM
To: Howell, Seth <seth.howell(a)intel.com>; Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions

Hi Seth,

After I enabled debug and ran nvmf_tgt with -L rdma, I got some logs like:
rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array: 
0x2000084bf000 Length: 40000 LKey: e601
rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array: 
0x200008621000 Length: 10000 LKey: e701
rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array: 
0x200018600000 Length: 1000000 LKey: e801
rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array: 
0x20000847e000 Length: 40000 LKey: e701
rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array: 
0x20000846d000 Length: 10000 LKey: e801
rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array: 
0x200019800000 Length: 1000000 LKey: e901
rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array: 
0x200016ebb000 Length: 40000 LKey: e801
rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array: 
0x20000845c000 Length: 10000 LKey: e901
rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array: 
0x20001aa00000 Length: 1000000 LKey: ea01
rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array: 
0x200016e7a000 Length: 40000 LKey: e901
rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array: 
0x20000844b000 Length: 10000 LKey: ea01
rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array: 
0x20001bc00000 Length: 1000000 LKey: eb01 ...

Is this you are look for as memory regions registered for NIC?

I attached the complete log.

Thanks,
JD

On 7/30/19 5:28 PM, JD Zheng wrote:
> Hi Seth,
> 
> Thanks for the prompt reply!
> 
> Please find answers inline.
> 
> JD
> 
> On 7/30/19 5:01 PM, Howell, Seth wrote:
>> Hi JD,
>>
>> Thanks for the report. I want to ask a few questions to start getting 
>> to the bottom of this. Since this issue doesn't currently reproduce 
>> on our per-patch or nightly tests, I would like to understand what's 
>> unique about your setup so that we can replicate it in a per patch 
>> test to prevent future regressions.
> I am running it on aarch64 platform. I tried x86 platform and I can 
> see same buffer alignment in memory pool but can't run the real test 
> to reproduce it due to other missing pieces.
> 
>>
>> What options are you passing when you create the rdma transport? Are 
>> you creating it over RPC or in a configuration file?
> I am using conf file. Pls let me know if you'd like to look into conf file.
> 
>>
>> Are you using the current DPDK submodule as your environment 
>> abstraction layer?
> No. Our project uses specific version of DPDK, which is v18.11. I did 
> quick test using latest and DPDK submodule on x86, and the buffer 
> alignment is the same, i.e. 64B aligned.
> 
>>
>> I notice that your error log is printing from 
>> spdk_nvmf_transport_poll_group_create, which value exactly are you 
>> printing out?
> Here is patch to add dbg print. Pls note that SPDK version is v19.04
> 
> @@ -215,6 +222,7 @@ spdk_nvmf_transport_poll_group_create(st
>                                  SPDK_NOTICELOG("Unable to reserve the 
> full number of buffers for the pg buffer cache.\n");
>                                  break;
>                          }
> +                       SPDK_ERRLOG("%p %d(%d)\n", buf,
> group->buf_cache_count, group->buf_cache_size);
>                          STAILQ_INSERT_HEAD(&group->buf_cache, buf, 
> link);
>                          group->buf_cache_count++;
>                  }
> 
>>
>> Can you run your target with the -L rdma option to get a dump of the 
>> memory regions registered with the NIC?
> Let me test and get back to you soon.
> 
>>
>> We made a couple of changes to this code when dynamic memory 
>> allocations were added to DPDK. There were some safeguards that we 
>> added to try and make sure this case wouldn't hit, so I'd like to 
>> make sure you are running on the latest DPDK submodule as well as the 
>> latest SPDK to narrow down where we need to look.
> Unfortunately I can't easily update DPDK because other team maintains 
> it internally. But if it can be repro and fixed in latest, I will try 
> to pull in the fix.
> 
>>
>> Thanks,
>>
>> Seth
>>
>> -----Original Message-----
>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of JD Zheng 
>> via SPDK
>> Sent: Wednesday, July 31, 2019 3:00 AM
>> To: spdk(a)lists.01.org
>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com>
>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple 
>> RDMA Memory Regions
>>
>> Hello,
>>
>> When I run nvmf_tgt over RDMA using latest SPDK code, I occasionally 
>> ran into this errors:
>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over 
>> multiple RDMA Memory Regions"
>>
>> After digging into the code, I found that nvmf_rdma_fill_buffers() 
>> calls spdk_mem_map_translate() to check if a data buffer sit on 2 2MB 
>> pages, and if it is the case, it reports this error.
>>
>> The following commit added change to use data buffer start address to 
>> calculate the size between buffer start address and 2MB boundary. The 
>> caller nvmf_rdma_fill_buffers() uses the size to compare with IO Unit 
>> size (which is 8KB in my conf) to determine if the buffer passes 2MB 
>> boundary.
>>
>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com>
>> Date:   Tue Nov 13 17:43:46 2018 +0100
>>
>>       memory: fix contiguous memory calculation for unaligned buffers
>>
>> In nvmf_tgt, the buffers are pre-allocated as a memory pool and new 
>> request will use free buffer from that pool and the buffer start 
>> address is passed to nvmf_rdma_fill_buffers(). But I found that these 
>> buffers are not 2MB aligned and not IOUnitSize aligned (8KB in my
>> case) either, instead, they are 64Byte aligned so that some buffers 
>> will fail the checking and leads to this problem.
>>
>> The corresponding code snippets are as following:
>> spdk_nvmf_transport_create()
>> {
>> ...
>>       transport->data_buf_pool = 
>> pdk_mempool_create(spdk_mempool_name,
>>                                  opts->num_shared_buffers,
>>                                  opts->io_unit_size + 
>> NVMF_DATA_BUFFER_ALIGNMENT,
>>                                  SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>>                                  SPDK_ENV_SOCKET_ID_ANY); ...
>> }
>>
>> Also some debug print I added shows the start address of the buffers:
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019258800 0(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x2000192557c0 1(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019252780 2(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x20001924f740 3(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x20001924c700 4(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x2000192496c0 5(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019246680 6(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019243640 7(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019240600 8(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x20001923d5c0 9(32)
>> ...
>>
>> It looks like either the buffer allocation has alignment issue or the 
>> checking is not correct.
>>
>> Please advice how to fix this problem.
>>
>> Thanks,
>> JD Zheng
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-07-31 22:13 JD Zheng
  0 siblings, 0 replies; 20+ messages in thread
From: JD Zheng @ 2019-07-31 22:13 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 8069 bytes --]

Hi Seth,

After I enabled debug and ran nvmf_tgt with -L rdma, I got some logs like:
rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array: 
0x2000084bf000 Length: 40000 LKey: e601
rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array: 
0x200008621000 Length: 10000 LKey: e701
rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array: 
0x200018600000 Length: 1000000 LKey: e801
rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array: 
0x20000847e000 Length: 40000 LKey: e701
rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array: 
0x20000846d000 Length: 10000 LKey: e801
rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array: 
0x200019800000 Length: 1000000 LKey: e901
rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array: 
0x200016ebb000 Length: 40000 LKey: e801
rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array: 
0x20000845c000 Length: 10000 LKey: e901
rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array: 
0x20001aa00000 Length: 1000000 LKey: ea01
rdma.c: 746:nvmf_rdma_resources_create: *DEBUG*: Command Array: 
0x200016e7a000 Length: 40000 LKey: e901
rdma.c: 749:nvmf_rdma_resources_create: *DEBUG*: Completion Array: 
0x20000844b000 Length: 10000 LKey: ea01
rdma.c: 753:nvmf_rdma_resources_create: *DEBUG*: In Capsule Data Array: 
0x20001bc00000 Length: 1000000 LKey: eb01
...

Is this you are look for as memory regions registered for NIC?

I attached the complete log.

Thanks,
JD

On 7/30/19 5:28 PM, JD Zheng wrote:
> Hi Seth,
> 
> Thanks for the prompt reply!
> 
> Please find answers inline.
> 
> JD
> 
> On 7/30/19 5:01 PM, Howell, Seth wrote:
>> Hi JD,
>>
>> Thanks for the report. I want to ask a few questions to start getting 
>> to the bottom of this. Since this issue doesn't currently reproduce on 
>> our per-patch or nightly tests, I would like to understand what's 
>> unique about your setup so that we can replicate it in a per patch 
>> test to prevent future regressions.
> I am running it on aarch64 platform. I tried x86 platform and I can see 
> same buffer alignment in memory pool but can't run the real test to 
> reproduce it due to other missing pieces.
> 
>>
>> What options are you passing when you create the rdma transport? Are 
>> you creating it over RPC or in a configuration file?
> I am using conf file. Pls let me know if you'd like to look into conf file.
> 
>>
>> Are you using the current DPDK submodule as your environment 
>> abstraction layer?
> No. Our project uses specific version of DPDK, which is v18.11. I did 
> quick test using latest and DPDK submodule on x86, and the buffer 
> alignment is the same, i.e. 64B aligned.
> 
>>
>> I notice that your error log is printing from 
>> spdk_nvmf_transport_poll_group_create, which value exactly are you 
>> printing out?
> Here is patch to add dbg print. Pls note that SPDK version is v19.04
> 
> @@ -215,6 +222,7 @@ spdk_nvmf_transport_poll_group_create(st
>                                  SPDK_NOTICELOG("Unable to reserve the 
> full number of buffers for the pg buffer cache.\n");
>                                  break;
>                          }
> +                       SPDK_ERRLOG("%p %d(%d)\n", buf, 
> group->buf_cache_count, group->buf_cache_size);
>                          STAILQ_INSERT_HEAD(&group->buf_cache, buf, link);
>                          group->buf_cache_count++;
>                  }
> 
>>
>> Can you run your target with the -L rdma option to get a dump of the 
>> memory regions registered with the NIC?
> Let me test and get back to you soon.
> 
>>
>> We made a couple of changes to this code when dynamic memory 
>> allocations were added to DPDK. There were some safeguards that we 
>> added to try and make sure this case wouldn't hit, so I'd like to make 
>> sure you are running on the latest DPDK submodule as well as the 
>> latest SPDK to narrow down where we need to look.
> Unfortunately I can't easily update DPDK because other team maintains it 
> internally. But if it can be repro and fixed in latest, I will try to 
> pull in the fix.
> 
>>
>> Thanks,
>>
>> Seth
>>
>> -----Original Message-----
>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of JD Zheng 
>> via SPDK
>> Sent: Wednesday, July 31, 2019 3:00 AM
>> To: spdk(a)lists.01.org
>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com>
>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA 
>> Memory Regions
>>
>> Hello,
>>
>> When I run nvmf_tgt over RDMA using latest SPDK code, I occasionally 
>> ran into this errors:
>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over 
>> multiple RDMA Memory Regions"
>>
>> After digging into the code, I found that nvmf_rdma_fill_buffers() 
>> calls spdk_mem_map_translate() to check if a data buffer sit on 2 2MB 
>> pages, and if it is the case, it reports this error.
>>
>> The following commit added change to use data buffer start address to 
>> calculate the size between buffer start address and 2MB boundary. The 
>> caller nvmf_rdma_fill_buffers() uses the size to compare with IO Unit 
>> size (which is 8KB in my conf) to determine if the buffer passes 2MB 
>> boundary.
>>
>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com>
>> Date:   Tue Nov 13 17:43:46 2018 +0100
>>
>>       memory: fix contiguous memory calculation for unaligned buffers
>>
>> In nvmf_tgt, the buffers are pre-allocated as a memory pool and new 
>> request will use free buffer from that pool and the buffer start 
>> address is passed to nvmf_rdma_fill_buffers(). But I found that these 
>> buffers are not 2MB aligned and not IOUnitSize aligned (8KB in my 
>> case) either, instead, they are 64Byte aligned so that some buffers 
>> will fail the checking and leads to this problem.
>>
>> The corresponding code snippets are as following:
>> spdk_nvmf_transport_create()
>> {
>> ...
>>       transport->data_buf_pool = pdk_mempool_create(spdk_mempool_name,
>>                                  opts->num_shared_buffers,
>>                                  opts->io_unit_size + 
>> NVMF_DATA_BUFFER_ALIGNMENT,
>>                                  SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>>                                  SPDK_ENV_SOCKET_ID_ANY); ...
>> }
>>
>> Also some debug print I added shows the start address of the buffers:
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019258800 0(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x2000192557c0 1(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019252780 2(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x20001924f740 3(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x20001924c700 4(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x2000192496c0 5(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019246680 6(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019243640 7(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019240600 8(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x20001923d5c0 9(32)
>> ...
>>
>> It looks like either the buffer allocation has alignment issue or the 
>> checking is not correct.
>>
>> Please advice how to fix this problem.
>>
>> Thanks,
>> JD Zheng
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-07-31  2:34 Rao, Anu H
  0 siblings, 0 replies; 20+ messages in thread
From: Rao, Anu H @ 2019-07-31  2:34 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6210 bytes --]

Bobi

Sent from my iPhone

> On Jul 30, 2019, at 5:28 PM, JD Zheng via SPDK <spdk(a)lists.01.org> wrote:
> 
> Hi Seth,
> 
> Thanks for the prompt reply!
> 
> Please find answers inline.
> 
> JD
> 
>> On 7/30/19 5:01 PM, Howell, Seth wrote:
>> Hi JD,
>> Thanks for the report. I want to ask a few questions to start getting to the bottom of this. Since this issue doesn't currently reproduce on our per-patch or nightly tests, I would like to understand what's unique about your setup so that we can replicate it in a per patch test to prevent future regressions.
> I am running it on aarch64 platform. I tried x86 platform and I can see same buffer alignment in memory pool but can't run the real test to reproduce it due to other missing pieces.
> 
>> What options are you passing when you create the rdma transport? Are you creating it over RPC or in a configuration file?
> I am using conf file. Pls let me know if you'd like to look into conf file.
> 
>> Are you using the current DPDK submodule as your environment abstraction layer?
> No. Our project uses specific version of DPDK, which is v18.11. I did quick test using latest and DPDK submodule on x86, and the buffer alignment is the same, i.e. 64B aligned.
> 
>> I notice that your error log is printing from spdk_nvmf_transport_poll_group_create, which value exactly are you printing out?
> Here is patch to add dbg print. Pls note that SPDK version is v19.04
> 
> @@ -215,6 +222,7 @@ spdk_nvmf_transport_poll_group_create(st
>                                SPDK_NOTICELOG("Unable to reserve the full number of buffers for the pg buffer cache.\n");
>                                break;
>                        }
> +                       SPDK_ERRLOG("%p %d(%d)\n", buf, group->buf_cache_count, group->buf_cache_size);
>                        STAILQ_INSERT_HEAD(&group->buf_cache, buf, link);
>                        group->buf_cache_count++;
>                }
> 
>> Can you run your target with the -L rdma option to get a dump of the memory regions registered with the NIC?
> Let me test and get back to you soon.
> 
>> We made a couple of changes to this code when dynamic memory allocations were added to DPDK. There were some safeguards that we added to try and make sure this case wouldn't hit, so I'd like to make sure you are running on the latest DPDK submodule as well as the latest SPDK to narrow down where we need to look.
> Unfortunately I can't easily update DPDK because other team maintains it internally. But if it can be repro and fixed in latest, I will try to pull in the fix.
> 
>> Thanks,
>> Seth
>> -----Original Message-----
>> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of JD Zheng via SPDK
>> Sent: Wednesday, July 31, 2019 3:00 AM
>> To: spdk(a)lists.01.org
>> Cc: JD Zheng <jiandong.zheng(a)broadcom.com>
>> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
>> Hello,
>> When I run nvmf_tgt over RDMA using latest SPDK code, I occasionally ran into this errors:
>> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over multiple RDMA Memory Regions"
>> After digging into the code, I found that nvmf_rdma_fill_buffers() calls spdk_mem_map_translate() to check if a data buffer sit on 2 2MB pages, and if it is the case, it reports this error.
>> The following commit added change to use data buffer start address to calculate the size between buffer start address and 2MB boundary. The caller nvmf_rdma_fill_buffers() uses the size to compare with IO Unit size (which is 8KB in my conf) to determine if the buffer passes 2MB boundary.
>> commit 37b7a308941b996f0e69049358a6119ed90d70a2
>> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com>
>> Date:   Tue Nov 13 17:43:46 2018 +0100
>>      memory: fix contiguous memory calculation for unaligned buffers
>> In nvmf_tgt, the buffers are pre-allocated as a memory pool and new request will use free buffer from that pool and the buffer start address is passed to nvmf_rdma_fill_buffers(). But I found that these buffers are not 2MB aligned and not IOUnitSize aligned (8KB in my case) either, instead, they are 64Byte aligned so that some buffers will fail the checking and leads to this problem.
>> The corresponding code snippets are as following:
>> spdk_nvmf_transport_create()
>> {
>> ...
>>      transport->data_buf_pool = pdk_mempool_create(spdk_mempool_name,
>>                                 opts->num_shared_buffers,
>>                                 opts->io_unit_size + NVMF_DATA_BUFFER_ALIGNMENT,
>>                                 SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>>                                 SPDK_ENV_SOCKET_ID_ANY); ...
>> }
>> Also some debug print I added shows the start address of the buffers:
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019258800 0(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x2000192557c0 1(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019252780 2(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x20001924f740 3(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x20001924c700 4(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x2000192496c0 5(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019246680 6(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019243640 7(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x200019240600 8(32)
>> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
>> 0x20001923d5c0 9(32)
>> ...
>> It looks like either the buffer allocation has alignment issue or the checking is not correct.
>> Please advice how to fix this problem.
>> Thanks,
>> JD Zheng
>> _______________________________________________
>> SPDK mailing list
>> SPDK(a)lists.01.org
>> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-07-31  0:28 JD Zheng
  0 siblings, 0 replies; 20+ messages in thread
From: JD Zheng @ 2019-07-31  0:28 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6023 bytes --]

Hi Seth,

Thanks for the prompt reply!

Please find answers inline.

JD

On 7/30/19 5:01 PM, Howell, Seth wrote:
> Hi JD,
> 
> Thanks for the report. I want to ask a few questions to start getting to the bottom of this. Since this issue doesn't currently reproduce on our per-patch or nightly tests, I would like to understand what's unique about your setup so that we can replicate it in a per patch test to prevent future regressions.
I am running it on aarch64 platform. I tried x86 platform and I can see 
same buffer alignment in memory pool but can't run the real test to 
reproduce it due to other missing pieces.

> 
> What options are you passing when you create the rdma transport? Are you creating it over RPC or in a configuration file?
I am using conf file. Pls let me know if you'd like to look into conf file.

> 
> Are you using the current DPDK submodule as your environment abstraction layer?
No. Our project uses specific version of DPDK, which is v18.11. I did 
quick test using latest and DPDK submodule on x86, and the buffer 
alignment is the same, i.e. 64B aligned.

> 
> I notice that your error log is printing from spdk_nvmf_transport_poll_group_create, which value exactly are you printing out?
Here is patch to add dbg print. Pls note that SPDK version is v19.04

@@ -215,6 +222,7 @@ spdk_nvmf_transport_poll_group_create(st
                                 SPDK_NOTICELOG("Unable to reserve the 
full number of buffers for the pg buffer cache.\n");
                                 break;
                         }
+                       SPDK_ERRLOG("%p %d(%d)\n", buf, 
group->buf_cache_count, group->buf_cache_size);
                         STAILQ_INSERT_HEAD(&group->buf_cache, buf, link);
                         group->buf_cache_count++;
                 }

> 
> Can you run your target with the -L rdma option to get a dump of the memory regions registered with the NIC?
Let me test and get back to you soon.

> 
> We made a couple of changes to this code when dynamic memory allocations were added to DPDK. There were some safeguards that we added to try and make sure this case wouldn't hit, so I'd like to make sure you are running on the latest DPDK submodule as well as the latest SPDK to narrow down where we need to look.
Unfortunately I can't easily update DPDK because other team maintains it 
internally. But if it can be repro and fixed in latest, I will try to 
pull in the fix.

> 
> Thanks,
> 
> Seth
> 
> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of JD Zheng via SPDK
> Sent: Wednesday, July 31, 2019 3:00 AM
> To: spdk(a)lists.01.org
> Cc: JD Zheng <jiandong.zheng(a)broadcom.com>
> Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
> 
> Hello,
> 
> When I run nvmf_tgt over RDMA using latest SPDK code, I occasionally ran into this errors:
> "rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over multiple RDMA Memory Regions"
> 
> After digging into the code, I found that nvmf_rdma_fill_buffers() calls spdk_mem_map_translate() to check if a data buffer sit on 2 2MB pages, and if it is the case, it reports this error.
> 
> The following commit added change to use data buffer start address to calculate the size between buffer start address and 2MB boundary. The caller nvmf_rdma_fill_buffers() uses the size to compare with IO Unit size (which is 8KB in my conf) to determine if the buffer passes 2MB boundary.
> 
> commit 37b7a308941b996f0e69049358a6119ed90d70a2
> Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com>
> Date:   Tue Nov 13 17:43:46 2018 +0100
> 
>       memory: fix contiguous memory calculation for unaligned buffers
> 
> In nvmf_tgt, the buffers are pre-allocated as a memory pool and new request will use free buffer from that pool and the buffer start address is passed to nvmf_rdma_fill_buffers(). But I found that these buffers are not 2MB aligned and not IOUnitSize aligned (8KB in my case) either, instead, they are 64Byte aligned so that some buffers will fail the checking and leads to this problem.
> 
> The corresponding code snippets are as following:
> spdk_nvmf_transport_create()
> {
> ...
>       transport->data_buf_pool = pdk_mempool_create(spdk_mempool_name,
>                                  opts->num_shared_buffers,
>                                  opts->io_unit_size + NVMF_DATA_BUFFER_ALIGNMENT,
>                                  SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
>                                  SPDK_ENV_SOCKET_ID_ANY); ...
> }
> 
> Also some debug print I added shows the start address of the buffers:
> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> 0x200019258800 0(32)
> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> 0x2000192557c0 1(32)
> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> 0x200019252780 2(32)
> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> 0x20001924f740 3(32)
> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> 0x20001924c700 4(32)
> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> 0x2000192496c0 5(32)
> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> 0x200019246680 6(32)
> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> 0x200019243640 7(32)
> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> 0x200019240600 8(32)
> transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*:
> 0x20001923d5c0 9(32)
> ...
> 
> It looks like either the buffer allocation has alignment issue or the checking is not correct.
> 
> Please advice how to fix this problem.
> 
> Thanks,
> JD Zheng
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-07-31  0:01 Howell, Seth
  0 siblings, 0 replies; 20+ messages in thread
From: Howell, Seth @ 2019-07-31  0:01 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4482 bytes --]

Hi JD,

Thanks for the report. I want to ask a few questions to start getting to the bottom of this. Since this issue doesn't currently reproduce on our per-patch or nightly tests, I would like to understand what's unique about your setup so that we can replicate it in a per patch test to prevent future regressions.

What options are you passing when you create the rdma transport? Are you creating it over RPC or in a configuration file?

Are you using the current DPDK submodule as your environment abstraction layer?

I notice that your error log is printing from spdk_nvmf_transport_poll_group_create, which value exactly are you printing out?

Can you run your target with the -L rdma option to get a dump of the memory regions registered with the NIC?

We made a couple of changes to this code when dynamic memory allocations were added to DPDK. There were some safeguards that we added to try and make sure this case wouldn't hit, so I'd like to make sure you are running on the latest DPDK submodule as well as the latest SPDK to narrow down where we need to look.

Thanks,

Seth

-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of JD Zheng via SPDK
Sent: Wednesday, July 31, 2019 3:00 AM
To: spdk(a)lists.01.org
Cc: JD Zheng <jiandong.zheng(a)broadcom.com>
Subject: [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions

Hello,

When I run nvmf_tgt over RDMA using latest SPDK code, I occasionally ran into this errors:
"rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over multiple RDMA Memory Regions"

After digging into the code, I found that nvmf_rdma_fill_buffers() calls spdk_mem_map_translate() to check if a data buffer sit on 2 2MB pages, and if it is the case, it reports this error.

The following commit added change to use data buffer start address to calculate the size between buffer start address and 2MB boundary. The caller nvmf_rdma_fill_buffers() uses the size to compare with IO Unit size (which is 8KB in my conf) to determine if the buffer passes 2MB boundary.

commit 37b7a308941b996f0e69049358a6119ed90d70a2
Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com>
Date:   Tue Nov 13 17:43:46 2018 +0100

     memory: fix contiguous memory calculation for unaligned buffers

In nvmf_tgt, the buffers are pre-allocated as a memory pool and new request will use free buffer from that pool and the buffer start address is passed to nvmf_rdma_fill_buffers(). But I found that these buffers are not 2MB aligned and not IOUnitSize aligned (8KB in my case) either, instead, they are 64Byte aligned so that some buffers will fail the checking and leads to this problem.

The corresponding code snippets are as following:
spdk_nvmf_transport_create()
{
...
     transport->data_buf_pool = pdk_mempool_create(spdk_mempool_name,
                                opts->num_shared_buffers,
                                opts->io_unit_size + NVMF_DATA_BUFFER_ALIGNMENT,
                                SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
                                SPDK_ENV_SOCKET_ID_ANY); ...
}

Also some debug print I added shows the start address of the buffers:
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x200019258800 0(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x2000192557c0 1(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x200019252780 2(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x20001924f740 3(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x20001924c700 4(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x2000192496c0 5(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x200019246680 6(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x200019243640 7(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x200019240600 8(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x20001923d5c0 9(32)
...

It looks like either the buffer allocation has alignment issue or the checking is not correct.

Please advice how to fix this problem.

Thanks,
JD Zheng
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions
@ 2019-07-30 18:59 JD Zheng
  0 siblings, 0 replies; 20+ messages in thread
From: JD Zheng @ 2019-07-30 18:59 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 2963 bytes --]

Hello,

When I run nvmf_tgt over RDMA using latest SPDK code, I occasionally ran 
into this errors:
"rdma.c:1505:nvmf_rdma_fill_buffers: *ERROR*: Data buffer split over 
multiple RDMA Memory Regions"

After digging into the code, I found that nvmf_rdma_fill_buffers() 
calls spdk_mem_map_translate() to check if a data buffer sit on 2 2MB 
pages, and if it is the case, it reports this error.

The following commit added change to use data buffer start address to 
calculate the size between buffer start address and 2MB boundary. The 
caller nvmf_rdma_fill_buffers() uses the size to compare with IO Unit 
size (which is 8KB in my conf) to determine if the buffer passes 2MB 
boundary.

commit 37b7a308941b996f0e69049358a6119ed90d70a2
Author: Darek Stojaczyk <dariusz.stojaczyk(a)intel.com>
Date:   Tue Nov 13 17:43:46 2018 +0100

     memory: fix contiguous memory calculation for unaligned buffers

In nvmf_tgt, the buffers are pre-allocated as a memory pool and new 
request will use free buffer from that pool and the buffer start address 
is passed to nvmf_rdma_fill_buffers(). But I found that these buffers 
are not 2MB aligned and not IOUnitSize aligned (8KB in my case) either, 
instead, they are 64Byte aligned so that some buffers will fail the 
checking and leads to this problem.

The corresponding code snippets are as following:
spdk_nvmf_transport_create()
{
...
     transport->data_buf_pool = pdk_mempool_create(spdk_mempool_name,
                                opts->num_shared_buffers,
                                opts->io_unit_size + 
NVMF_DATA_BUFFER_ALIGNMENT,
                                SPDK_MEMPOOL_DEFAULT_CACHE_SIZE,
                                SPDK_ENV_SOCKET_ID_ANY);
...
}

Also some debug print I added shows the start address of the buffers:
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x200019258800 0(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x2000192557c0 1(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x200019252780 2(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x20001924f740 3(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x20001924c700 4(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x2000192496c0 5(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x200019246680 6(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x200019243640 7(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x200019240600 8(32)
transport.c: 218:spdk_nvmf_transport_poll_group_create: *ERROR*: 
0x20001923d5c0 9(32)
...

It looks like either the buffer allocation has alignment issue or the 
checking is not correct.

Please advice how to fix this problem.

Thanks,
JD Zheng

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-08-21 13:15 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-19 21:16 [SPDK] nvmf_tgt *ERROR*: Data buffer split over multiple RDMA Memory Regions Howell, Seth
  -- strict thread matches above, loose matches on Subject: below --
2019-08-21 13:15 Sasha Kotchubievsky
2019-08-20 14:39 Howell, Seth
2019-08-20 14:15 Howell, Seth
2019-08-20 12:22 Sasha Kotchubievsky
2019-08-19 21:42 JD Zheng
2019-08-19 21:02 JD Zheng
2019-08-19 20:12 Howell, Seth
2019-08-12 23:17 JD Zheng
2019-08-01 21:22 Howell, Seth
2019-08-01 21:00 JD Zheng
2019-08-01 20:28 Howell, Seth
2019-08-01 18:23 JD Zheng
2019-08-01 14:52 Howell, Seth
2019-08-01 12:26 Howell, Seth
2019-07-31 22:13 JD Zheng
2019-07-31  2:34 Rao, Anu H
2019-07-31  0:28 JD Zheng
2019-07-31  0:01 Howell, Seth
2019-07-30 18:59 JD Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.