linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "lizhijian@fujitsu.com" <lizhijian@fujitsu.com>
To: Tom Talpey <tom@talpey.com>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	"zyjzyj2000@gmail.com" <zyjzyj2000@gmail.com>,
	"jgg@ziepe.ca" <jgg@ziepe.ca>,
	"aharonl@nvidia.com" <aharonl@nvidia.com>,
	"leon@kernel.org" <leon@kernel.org>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"mbloch@nvidia.com" <mbloch@nvidia.com>,
	"liangwenpeng@huawei.com" <liangwenpeng@huawei.com>,
	"yangx.jy@fujitsu.com" <yangx.jy@fujitsu.com>,
	"rpearsonhpe@gmail.com" <rpearsonhpe@gmail.com>,
	"y-goto@fujitsu.com" <y-goto@fujitsu.com>
Subject: Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
Date: Tue, 4 Jan 2022 08:51:06 +0000	[thread overview]
Message-ID: <33284887-163e-6d6f-a3b8-c9c9b597d160@fujitsu.com> (raw)
In-Reply-To: <17122432-1d05-7a88-5d06-840cd69e57e9@talpey.com>



On 31/12/2021 10:32, Tom Talpey wrote:
>
> On 12/30/2021 8:37 PM, lizhijian@fujitsu.com wrote
>>>> +static int nvdimm_flush_iova(struct rxe_mr *mr, u64 iova, int length)
>>>> +{
>>>> +    int            err;
>>>> +    int            bytes;
>>>> +    u8            *va;
>>>> +    struct rxe_map        **map;
>>>> +    struct rxe_phys_buf    *buf;
>>>> +    int            m;
>>>> +    int            i;
>>>> +    size_t            offset;
>>>> +
>>>> +    if (length == 0)
>>>> +        return 0;
>>>
>>> The length is only relevant when the flush type is "Memory Region
>>> Range".
>>>
>>> When the flush type is "Memory Region", the entire region must be
>>> flushed successfully before completing the operation.
>>
>> Yes, currently, the length has been expanded to the MR's length in such case.
>
> I'll dig deeper, but this is not immediately clear in this context.
> A zero-length operation is however valid, even though it's a no-op,

If it's no-op, what shall we do in this case.


> the application may use it to ensure certain ordering constraints.
> Will comment later if I have a specific concern.

kindly welcome :)




>
>>>> +
>>>> +    if (mr->type == IB_MR_TYPE_DMA) {
>>>> +        arch_wb_cache_pmem((void *)iova, length);
>>>> +        return 0;
>>>> +    }
>>>
>>> Are dmamr's supported for remote access? I thought that was
>>> prevented on first principles now. I might suggest not allowing
>>> them to be flushed in any event. There is no length restriction,
>>> and it's a VERY costly operation. At a minimum, protect this
>>> closely.
>> Indeed, I didn't confidence about this, the main logic comes from rxe_mr_copy()
>> Thanks for the suggestion.
>
> Well, be careful when following semantics from local behaviors. When you
> are processing a command from the wire, extreme caution is needed.
> ESPECIALLY WHEN IT'S A DMA_MR, WHICH PROVIDES ACCESS TO ALL PHYSICAL!!!
>
> Sorry for shouting. :)

Never mind :)





>
>>>> +
>>>> +static enum resp_states process_flush(struct rxe_qp *qp,
>>>> +                       struct rxe_pkt_info *pkt)
>>>> +{
>>>> +    u64 length = 0, start = qp->resp.va;
>>>> +    u32 sel = feth_sel(pkt);
>>>> +    u32 plt = feth_plt(pkt);
>>>> +    struct rxe_mr *mr = qp->resp.mr;
>>>> +
>>>> +    if (sel == IB_EXT_SEL_MR_RANGE)
>>>> +        length = qp->resp.length;
>>>> +    else if (sel == IB_EXT_SEL_MR_WHOLE)
>>>> +        length = mr->cur_map_set->length;
>>>
>>> I'm going to have to think about these
>>
>> Yes, you inspire me that we should consider to adjust the start of iova to the MR's start as well.
>
> I'll review again when you decide.
>>>> +
>>>> +    if (plt == IB_EXT_PLT_PERSIST) {
>>>> +        nvdimm_flush_iova(mr, start, length);
>>>> +        wmb(); // clwb follows by a sfence
>>>> +    } else if (plt == IB_EXT_PLT_GLB_VIS)
>>>> +        wmb(); // sfence is enough
>>>
>>> The persistence and global visibility bits are not mutually
>>> exclusive,
>> My bad, it ever appeared in my mind. o(╯□╰)o
>>
>>
>>
>>
>>> and in fact persistence does not imply global
>>> visibility in some platforms.
>> If so, and per the SPEC, why not
>>      if (plt & IB_EXT_PLT_PERSIST)
>>         do_somethingA();
>>      if (plt & IB_EXT_PLT_GLB_VIS)
>>         do_somethingB();
>
> At the simplest, yes. But there are many subtle other possibilities.
>
> The global visibility is oriented toward supporting distributed
> shared memory workloads, or publish/subscribe on high scale systems.
> It's mainly about ensuring that all devices and CPUs see the data,
> something ordinary RDMA Write does not guarantee. This often means
> flushing write pipelines, possibly followed by invalidating caches.
>
> The persistence, of course, is about ensuring the data is stored. This
> is entirely different from making it visible.
>
> In fact, you often want to optimize the two scenarios together, since
> these operations are all about optimizing latency. The choice is going
> to depend greatly on the behavior of the platform itself. For example,
> certain caches provide persistence when batteries are present. So,
> providing persistence and providing visibility are one and the same.
> No need to do that twice.
>
> On the other hand, some systems may provide persistence only after a
> significant cost, such as writing into flash from a DRAM buffer, or
> when an Optane DIMM becomes overloaded from continuous writes and
> begins to throttle them. The flush may need some rather intricate
> processing, to avoid deadlock.
>
> Your code does not appear to envision these, so the simple way might
> be best for now. But you should consider other situations.

Thanks a lot for your detailed explanation.



>
>>> Second, the "clwb" and "sfence" comments are completely
>>> Intel-specific.
>> good catch.
>>
>>
>>> What processing will be done on other
>>> processor platforms???
>>
>> I didn't dig other ARCH yet but INTEL.
>> In this function, i think i just need to call the higher level wrapper, like wmb() and
>> arch_wb_cache_pmem are enough, right ?
>
> Well, higher level wrappers may signal errors, for example if they're
> not available or unreliable, and you will need to handle them. Also,
> they may block. Is that ok in this context?

Good question.

My bad, i forgot to check to return value of nvdimm_flush_iova() previously.

But if you mean we should guarantee arch_wb_cache_pmem() and wmb(), i have not
idea yet.

arch_wb_cache_pmem() and wmb(), What i'm currently using are just to hide
the assembly instructions on different architectures. And they are void return.

I wonder if we can always assume they work always When the code reaches them.
Since all current local flushing to pmem do something like that AFAIK(Feel free
to correct me if i'm wrong)

Thanks
Zhijian

>
> You need to get this right on all platforms which will run this code.
> It is not acceptable to guess at whether the data is in the required
> state before completing the RDMA_FLUSH. If this is not guaranteed,
> the operation must raise the required error. To do anything else will
> violate the protocol contract, and the remote application may fail.


>
> Tom.

  reply	other threads:[~2022-01-04  8:51 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-28  8:07 [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 01/10] RDMA: mr: Introduce is_pmem Li Zhijian
2022-01-06  0:21   ` Jason Gunthorpe
2022-01-06  6:12     ` lizhijian
2022-01-14  8:10       ` Li, Zhijian
2022-01-27 22:30         ` Jeff Moyer
2022-01-16 18:11       ` Dan Williams
2022-01-18  8:55         ` lizhijian
2022-01-18 15:28           ` Dan Williams
2022-01-19  2:01             ` lizhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 02/10] RDMA: Allow registering MR with flush access flags Li Zhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 03/10] RDMA/rxe: Allow registering FLUSH flags for supported device only Li Zhijian
2022-01-06  0:22   ` Jason Gunthorpe
2022-01-06  6:20     ` lizhijian
2022-01-13  6:43   ` lizhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 04/10] RDMA/rxe: Enable IB_DEVICE_RDMA_FLUSH for rxe device Li Zhijian
2022-01-06  0:22   ` Jason Gunthorpe
2022-01-06  6:26     ` lizhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 05/10] RDMA/rxe: Allow registering persistent flag for pmem MR only Li Zhijian
2021-12-30 22:25   ` Tom Talpey
2021-12-31  3:34     ` lizhijian
2021-12-31 14:40       ` Tom Talpey
2022-01-04  1:32         ` lizhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 06/10] RDMA/rxe: Implement RC RDMA FLUSH service in requester side Li Zhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 07/10] RDMA/rxe: Set BTH's SE to zero for FLUSH packet Li Zhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side Li Zhijian
2021-12-30 22:18   ` Tom Talpey
2021-12-31  1:37     ` lizhijian
2021-12-31  2:32       ` Tom Talpey
2022-01-04  8:51         ` lizhijian [this message]
2022-01-04 16:02           ` Tom Talpey
2022-01-06  0:35         ` Jason Gunthorpe
2022-01-04 16:40   ` Tom Talpey
2022-01-05  1:43     ` lizhijian
2022-01-06  0:28   ` Jason Gunthorpe
2022-01-06  6:42     ` lizhijian
2022-01-06 17:33       ` Jason Gunthorpe
2022-01-10  5:45         ` lizhijian
2022-01-10 14:34           ` Jason Gunthorpe
2022-01-11  5:34             ` lizhijian
2022-01-11 20:48               ` Jason Gunthorpe
2022-01-12  9:50                 ` lizhijian
2022-01-12 13:12                   ` Jason Gunthorpe
2022-01-13  6:29                     ` lizhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 09/10] RDMA/rxe: Implement flush completion Li Zhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 10/10] RDMA/rxe: Add RD FLUSH service support Li Zhijian
2021-12-29  8:49 ` [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Gromadzki, Tomasz
2021-12-29 14:35   ` Tom Talpey
2021-12-31  1:10     ` lizhijian

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=33284887-163e-6d6f-a3b8-c9c9b597d160@fujitsu.com \
    --to=lizhijian@fujitsu.com \
    --cc=aharonl@nvidia.com \
    --cc=jgg@ziepe.ca \
    --cc=leon@kernel.org \
    --cc=liangwenpeng@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mbloch@nvidia.com \
    --cc=rpearsonhpe@gmail.com \
    --cc=tom@talpey.com \
    --cc=y-goto@fujitsu.com \
    --cc=yangx.jy@fujitsu.com \
    --cc=zyjzyj2000@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).