Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side

From: "lizhijian@fujitsu.com" <lizhijian@fujitsu.com>
To: Tom Talpey <tom@talpey.com>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	"zyjzyj2000@gmail.com" <zyjzyj2000@gmail.com>,
	"jgg@ziepe.ca" <jgg@ziepe.ca>,
	"aharonl@nvidia.com" <aharonl@nvidia.com>,
	"leon@kernel.org" <leon@kernel.org>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"mbloch@nvidia.com" <mbloch@nvidia.com>,
	"liangwenpeng@huawei.com" <liangwenpeng@huawei.com>,
	"yangx.jy@fujitsu.com" <yangx.jy@fujitsu.com>,
	"rpearsonhpe@gmail.com" <rpearsonhpe@gmail.com>,
	"y-goto@fujitsu.com" <y-goto@fujitsu.com>
Subject: Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
Date: Tue, 4 Jan 2022 08:51:06 +0000	[thread overview]
Message-ID: <33284887-163e-6d6f-a3b8-c9c9b597d160@fujitsu.com> (raw)
In-Reply-To: <17122432-1d05-7a88-5d06-840cd69e57e9@talpey.com>

On 31/12/2021 10:32, Tom Talpey wrote:
>
> On 12/30/2021 8:37 PM, lizhijian@fujitsu.com wrote
>>>> +static int nvdimm_flush_iova(struct rxe_mr *mr, u64 iova, int length)
>>>> +{
>>>> +    int            err;
>>>> +    int            bytes;
>>>> +    u8            *va;
>>>> +    struct rxe_map        **map;
>>>> +    struct rxe_phys_buf    *buf;
>>>> +    int            m;
>>>> +    int            i;
>>>> +    size_t            offset;
>>>> +
>>>> +    if (length == 0)
>>>> +        return 0;
>>>
>>> The length is only relevant when the flush type is "Memory Region
>>> Range".
>>>
>>> When the flush type is "Memory Region", the entire region must be
>>> flushed successfully before completing the operation.
>>
>> Yes, currently, the length has been expanded to the MR's length in such case.
>
> I'll dig deeper, but this is not immediately clear in this context.
> A zero-length operation is however valid, even though it's a no-op,

If it's no-op, what shall we do in this case.

> the application may use it to ensure certain ordering constraints.
> Will comment later if I have a specific concern.

kindly welcome :)

>
>>>> +
>>>> +    if (mr->type == IB_MR_TYPE_DMA) {
>>>> +        arch_wb_cache_pmem((void *)iova, length);
>>>> +        return 0;
>>>> +    }
>>>
>>> Are dmamr's supported for remote access? I thought that was
>>> prevented on first principles now. I might suggest not allowing
>>> them to be flushed in any event. There is no length restriction,
>>> and it's a VERY costly operation. At a minimum, protect this
>>> closely.
>> Indeed, I didn't confidence about this, the main logic comes from rxe_mr_copy()
>> Thanks for the suggestion.
>
> Well, be careful when following semantics from local behaviors. When you
> are processing a command from the wire, extreme caution is needed.
> ESPECIALLY WHEN IT'S A DMA_MR, WHICH PROVIDES ACCESS TO ALL PHYSICAL!!!
>
> Sorry for shouting. :)

Never mind :)

>
>>>> +
>>>> +static enum resp_states process_flush(struct rxe_qp *qp,
>>>> +                       struct rxe_pkt_info *pkt)
>>>> +{
>>>> +    u64 length = 0, start = qp->resp.va;
>>>> +    u32 sel = feth_sel(pkt);
>>>> +    u32 plt = feth_plt(pkt);
>>>> +    struct rxe_mr *mr = qp->resp.mr;
>>>> +
>>>> +    if (sel == IB_EXT_SEL_MR_RANGE)
>>>> +        length = qp->resp.length;
>>>> +    else if (sel == IB_EXT_SEL_MR_WHOLE)
>>>> +        length = mr->cur_map_set->length;
>>>
>>> I'm going to have to think about these
>>
>> Yes, you inspire me that we should consider to adjust the start of iova to the MR's start as well.
>
> I'll review again when you decide.
>>>> +
>>>> +    if (plt == IB_EXT_PLT_PERSIST) {
>>>> +        nvdimm_flush_iova(mr, start, length);
>>>> +        wmb(); // clwb follows by a sfence
>>>> +    } else if (plt == IB_EXT_PLT_GLB_VIS)
>>>> +        wmb(); // sfence is enough
>>>
>>> The persistence and global visibility bits are not mutually
>>> exclusive,
>> My bad, it ever appeared in my mind. o(╯□╰)o
>>
>>
>>
>>
>>> and in fact persistence does not imply global
>>> visibility in some platforms.
>> If so, and per the SPEC, why not
>>      if (plt & IB_EXT_PLT_PERSIST)
>>         do_somethingA();
>>      if (plt & IB_EXT_PLT_GLB_VIS)
>>         do_somethingB();
>
> At the simplest, yes. But there are many subtle other possibilities.
>
> The global visibility is oriented toward supporting distributed
> shared memory workloads, or publish/subscribe on high scale systems.
> It's mainly about ensuring that all devices and CPUs see the data,
> something ordinary RDMA Write does not guarantee. This often means
> flushing write pipelines, possibly followed by invalidating caches.
>
> The persistence, of course, is about ensuring the data is stored. This
> is entirely different from making it visible.
>
> In fact, you often want to optimize the two scenarios together, since
> these operations are all about optimizing latency. The choice is going
> to depend greatly on the behavior of the platform itself. For example,
> certain caches provide persistence when batteries are present. So,
> providing persistence and providing visibility are one and the same.
> No need to do that twice.
>
> On the other hand, some systems may provide persistence only after a
> significant cost, such as writing into flash from a DRAM buffer, or
> when an Optane DIMM becomes overloaded from continuous writes and
> begins to throttle them. The flush may need some rather intricate
> processing, to avoid deadlock.
>
> Your code does not appear to envision these, so the simple way might
> be best for now. But you should consider other situations.

Thanks a lot for your detailed explanation.

>
>>> Second, the "clwb" and "sfence" comments are completely
>>> Intel-specific.
>> good catch.
>>
>>
>>> What processing will be done on other
>>> processor platforms???
>>
>> I didn't dig other ARCH yet but INTEL.
>> In this function, i think i just need to call the higher level wrapper, like wmb() and
>> arch_wb_cache_pmem are enough, right ?
>
> Well, higher level wrappers may signal errors, for example if they're
> not available or unreliable, and you will need to handle them. Also,
> they may block. Is that ok in this context?

Good question.

My bad, i forgot to check to return value of nvdimm_flush_iova() previously.

But if you mean we should guarantee arch_wb_cache_pmem() and wmb(), i have not
idea yet.

arch_wb_cache_pmem() and wmb(), What i'm currently using are just to hide
the assembly instructions on different architectures. And they are void return.

I wonder if we can always assume they work always When the code reaches them.
Since all current local flushing to pmem do something like that AFAIK(Feel free
to correct me if i'm wrong)

Thanks
Zhijian

>
> You need to get this right on all platforms which will run this code.
> It is not acceptable to guess at whether the data is in the required
> state before completing the RDMA_FLUSH. If this is not guaranteed,
> the operation must raise the required error. To do anything else will
> violate the protocol contract, and the remote application may fail.

>
> Tom.