nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
       [not found]             ` <ZivS86BrfPHopkru@memverge.com>
@ 2024-04-28  5:47               ` Dongsheng Yang
  2024-04-28 16:44                 ` Gregory Price
                                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Dongsheng Yang @ 2024-04-28  5:47 UTC (permalink / raw)
  To: Gregory Price, Dan Williams, John Groves
  Cc: axboe, linux-block, linux-kernel, linux-cxl, nvdimm



在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:
> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
>>
>>
>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
>>>
>>
>> In (5) of the cover letter, I mentioned that cbd addresses cache coherence
>> at the software level:
>>
>> (5) How do blkdev and backend interact through the channel?
>> 	a) For reader side, before reading the data, if the data in this channel
>> may be modified by the other party, then I need to flush the cache before
>> reading to ensure that I get the latest data. For example, the blkdev needs
>> to flush the cache before obtaining compr_head because compr_head will be
>> updated by the backend handler.
>> 	b) For writter side, if the written information will be read by others,
>> then after writing, I need to flush the cache to let the other party see it
>> immediately. For example, after blkdev submits cbd_se, it needs to update
>> cmd_head to let the handler have a new cbd_se. Therefore, after updating
>> cmd_head, I need to flush the cache to let the backend see it.
>>
> 
> Flushing the cache is insufficient.  All that cache flushing guarantees
> is that the memory has left the writer's CPU cache.  There are potentially
> many write buffers between the CPU and the actual backing media that the
> CPU has no visibility of and cannot pierce through to force a full
> guaranteed flush back to the media.
> 
> for example:
> 
> memcpy(some_cacheline, data, 64);
> mfence();
> 
> Will not guarantee that after mfence() completes that the remote host
> will have visibility of the data.  mfence() does not guarantee a full
> flush back down to the device, it only guarantees it has been pushed out
> of the CPU's cache.
> 
> similarly:
> 
> memcpy(some_cacheline, data, 64);
> mfence();
> memcpy(some_other_cacheline, data, 64);
> mfence()
> 
> Will not guarantee that some_cacheline reaches the backing media prior
> to some_other_cacheline, as there is no guarantee of write-ordering in
> CXL controllers (with the exception of writes to the same cacheline).
> 
> So this statement:
> 
>> I need to flush the cache to let the other party see it immediately.
> 
> Is misleading.  They will not see is "immediately", they will see it
> "eventually at some completely unknowable time in the future".

This is indeed one of the issues I wanted to discuss at the RFC stage. 
Thank you for pointing it out.

In my opinion, using "nvdimm_flush" might be one way to address this 
issue, but it seems to flush the entire nd_region, which might be too 
heavy. Moreover, it only applies to non-volatile memory.

This should be a general problem for cxl shared memory. In theory, FAMFS 
should also encounter this issue.

Gregory, John, and Dan, Any suggestion about it?

Thanx a lot
> 
> ~Gregory
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-04-28  5:47               ` [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dongsheng Yang
@ 2024-04-28 16:44                 ` Gregory Price
  2024-04-28 16:55                 ` John Groves
  2024-04-30  0:34                 ` Dan Williams
  2 siblings, 0 replies; 16+ messages in thread
From: Gregory Price @ 2024-04-28 16:44 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: Dan Williams, John Groves, axboe, linux-block, linux-kernel,
	linux-cxl, nvdimm

On Sun, Apr 28, 2024 at 01:47:29PM +0800, Dongsheng Yang wrote:
> 
> 
> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:
> > On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
> > > 
> > > 
> > > 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
> > > > 
> > > 
> > > In (5) of the cover letter, I mentioned that cbd addresses cache coherence
> > > at the software level:
> > > 
> > > (5) How do blkdev and backend interact through the channel?
> > > 	a) For reader side, before reading the data, if the data in this channel
> > > may be modified by the other party, then I need to flush the cache before
> > > reading to ensure that I get the latest data. For example, the blkdev needs
> > > to flush the cache before obtaining compr_head because compr_head will be
> > > updated by the backend handler.
> > > 	b) For writter side, if the written information will be read by others,
> > > then after writing, I need to flush the cache to let the other party see it
> > > immediately. For example, after blkdev submits cbd_se, it needs to update
> > > cmd_head to let the handler have a new cbd_se. Therefore, after updating
> > > cmd_head, I need to flush the cache to let the backend see it.
> > > 
> > 
> > Flushing the cache is insufficient.  All that cache flushing guarantees
> > is that the memory has left the writer's CPU cache.  There are potentially
> > many write buffers between the CPU and the actual backing media that the
> > CPU has no visibility of and cannot pierce through to force a full
> > guaranteed flush back to the media.
> > 
> > for example:
> > 
> > memcpy(some_cacheline, data, 64);
> > mfence();
> > 
> > Will not guarantee that after mfence() completes that the remote host
> > will have visibility of the data.  mfence() does not guarantee a full
> > flush back down to the device, it only guarantees it has been pushed out
> > of the CPU's cache.
> > 
> > similarly:
> > 
> > memcpy(some_cacheline, data, 64);
> > mfence();
> > memcpy(some_other_cacheline, data, 64);
> > mfence()
> > 

just a derp here, meant to add an explicit clflush(some_cacheline)
between the copy and the mfence.  But the result is the same.

> > Will not guarantee that some_cacheline reaches the backing media prior
> > to some_other_cacheline, as there is no guarantee of write-ordering in
> > CXL controllers (with the exception of writes to the same cacheline).
> > 
> > So this statement:
> > 
> > > I need to flush the cache to let the other party see it immediately.
> > 
> > Is misleading.  They will not see is "immediately", they will see it
> > "eventually at some completely unknowable time in the future".
> 
> This is indeed one of the issues I wanted to discuss at the RFC stage. Thank
> you for pointing it out.
> 
> In my opinion, using "nvdimm_flush" might be one way to address this issue,
> but it seems to flush the entire nd_region, which might be too heavy.
> Moreover, it only applies to non-volatile memory.
> 

The problem is that the coherence domain really ends at the root
complex, and from the perspective of any one host the data is coherent.

Flushing only guarantees it gets pushed out from that domain, but does
not guarantee anything south of it.

Flushing semantics that don't puncture through the root complex won't
help

>
> This should be a general problem for cxl shared memory. In theory, FAMFS
> should also encounter this issue.
> 
> Gregory, John, and Dan, Any suggestion about it?
> 
> Thanx a lot
> > 
> > ~Gregory
> > 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-04-28  5:47               ` [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dongsheng Yang
  2024-04-28 16:44                 ` Gregory Price
@ 2024-04-28 16:55                 ` John Groves
  2024-05-03  9:52                   ` Jonathan Cameron
  2024-04-30  0:34                 ` Dan Williams
  2 siblings, 1 reply; 16+ messages in thread
From: John Groves @ 2024-04-28 16:55 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: Gregory Price, Dan Williams, axboe, linux-block, linux-kernel,
	linux-cxl, nvdimm

On 24/04/28 01:47PM, Dongsheng Yang wrote:
> 
> 
> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:
> > On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
> > > 
> > > 
> > > 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
> > > > 
> > > 
> > > In (5) of the cover letter, I mentioned that cbd addresses cache coherence
> > > at the software level:
> > > 
> > > (5) How do blkdev and backend interact through the channel?
> > > 	a) For reader side, before reading the data, if the data in this channel
> > > may be modified by the other party, then I need to flush the cache before
> > > reading to ensure that I get the latest data. For example, the blkdev needs
> > > to flush the cache before obtaining compr_head because compr_head will be
> > > updated by the backend handler.
> > > 	b) For writter side, if the written information will be read by others,
> > > then after writing, I need to flush the cache to let the other party see it
> > > immediately. For example, after blkdev submits cbd_se, it needs to update
> > > cmd_head to let the handler have a new cbd_se. Therefore, after updating
> > > cmd_head, I need to flush the cache to let the backend see it.
> > > 
> > 
> > Flushing the cache is insufficient.  All that cache flushing guarantees
> > is that the memory has left the writer's CPU cache.  There are potentially
> > many write buffers between the CPU and the actual backing media that the
> > CPU has no visibility of and cannot pierce through to force a full
> > guaranteed flush back to the media.
> > 
> > for example:
> > 
> > memcpy(some_cacheline, data, 64);
> > mfence();
> > 
> > Will not guarantee that after mfence() completes that the remote host
> > will have visibility of the data.  mfence() does not guarantee a full
> > flush back down to the device, it only guarantees it has been pushed out
> > of the CPU's cache.
> > 
> > similarly:
> > 
> > memcpy(some_cacheline, data, 64);
> > mfence();
> > memcpy(some_other_cacheline, data, 64);
> > mfence()
> > 
> > Will not guarantee that some_cacheline reaches the backing media prior
> > to some_other_cacheline, as there is no guarantee of write-ordering in
> > CXL controllers (with the exception of writes to the same cacheline).
> > 
> > So this statement:
> > 
> > > I need to flush the cache to let the other party see it immediately.
> > 
> > Is misleading.  They will not see is "immediately", they will see it
> > "eventually at some completely unknowable time in the future".
> 
> This is indeed one of the issues I wanted to discuss at the RFC stage. Thank
> you for pointing it out.
> 
> In my opinion, using "nvdimm_flush" might be one way to address this issue,
> but it seems to flush the entire nd_region, which might be too heavy.
> Moreover, it only applies to non-volatile memory.
> 
> This should be a general problem for cxl shared memory. In theory, FAMFS
> should also encounter this issue.
> 
> Gregory, John, and Dan, Any suggestion about it?
> 
> Thanx a lot
> > 
> > ~Gregory
> > 

Hi Dongsheng,

Gregory is right about the uncertainty around "clflush" operations, but
let me drill in a bit further.

Say you copy a payload into a "bucket" in a queue and then update an
index in a metadata structure; I'm thinking of the standard producer/
consumer queuing model here, with one index mutated by the producer and
the other mutated by the consumer. 

(I have not reviewed your queueing code, but you *must* be using this
model - things like linked-lists won't work in shared memory without 
shared locks/atomics.)

Normal logic says that you should clflush the payload before updating
the index, then update and clflush the index.

But we still observe in non-cache-coherent shared memory that the payload 
may become valid *after* the clflush of the queue index.

The famfs user space has a program called pcq.c, which implements a
producer/consumer queue in a pair of famfs files. The only way to 
currently guarantee a valid read of a payload is to use sequence numbers 
and checksums on payloads.  We do observe mismatches with actual shared 
memory, and the recovery is to clflush and re-read the payload from the 
client side. (Aside: These file pairs theoretically might work for CBD 
queues.)

Anoter side note: it would be super-helpful if the CPU gave us an explicit 
invalidate rather than just clflush, which will write-back before 
invalidating *if* the cache line is marked as dirty, even when software
knows this should not happen.

Note that CXL 3.1 provides a way to guarantee that stuff that should not
be written back can't be written back: read-only mappings. This one of
the features I got into the spec; using this requires CXL 3.1 DCD, and 
would require two DCD allocations (i.e. two tagged-capacity dax devices - 
one writable by the server and one by the client).

Just to make things slightly gnarlier, the MESI cache coherency protocol
allows a CPU to speculatively convert a line from exclusive to modified,
meaning it's not clear as of now whether "occasional" clean write-backs
can be avoided. Meaning those read-only mappings may be more important
than one might think. (Clean write-backs basically make it
impossible for software to manage cache coherency.)

Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and 
shared memory is not explicitly legal in cxl 2, so there are things a cpu 
could do (or not do) in a cxl 2 environment that are not illegal because 
they should not be observable in a no-shared-memory environment.

CBD is interesting work, though for some of the reasons above I'm somewhat
skeptical of shared memory as an IPC mechanism.

Regards,
John



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-04-28  5:47               ` [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dongsheng Yang
  2024-04-28 16:44                 ` Gregory Price
  2024-04-28 16:55                 ` John Groves
@ 2024-04-30  0:34                 ` Dan Williams
  2 siblings, 0 replies; 16+ messages in thread
From: Dan Williams @ 2024-04-30  0:34 UTC (permalink / raw)
  To: Dongsheng Yang, Gregory Price, Dan Williams, John Groves
  Cc: axboe, linux-block, linux-kernel, linux-cxl, nvdimm

Dongsheng Yang wrote:
> 
> 
> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:
> > On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
> >>
> >>
> >> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
> >>>
> >>
> >> In (5) of the cover letter, I mentioned that cbd addresses cache coherence
> >> at the software level:
> >>
> >> (5) How do blkdev and backend interact through the channel?
> >> 	a) For reader side, before reading the data, if the data in this channel
> >> may be modified by the other party, then I need to flush the cache before
> >> reading to ensure that I get the latest data. For example, the blkdev needs
> >> to flush the cache before obtaining compr_head because compr_head will be
> >> updated by the backend handler.
> >> 	b) For writter side, if the written information will be read by others,
> >> then after writing, I need to flush the cache to let the other party see it
> >> immediately. For example, after blkdev submits cbd_se, it needs to update
> >> cmd_head to let the handler have a new cbd_se. Therefore, after updating
> >> cmd_head, I need to flush the cache to let the backend see it.
> >>
> > 
> > Flushing the cache is insufficient.  All that cache flushing guarantees
> > is that the memory has left the writer's CPU cache.  There are potentially
> > many write buffers between the CPU and the actual backing media that the
> > CPU has no visibility of and cannot pierce through to force a full
> > guaranteed flush back to the media.
> > 
> > for example:
> > 
> > memcpy(some_cacheline, data, 64);
> > mfence();
> > 
> > Will not guarantee that after mfence() completes that the remote host
> > will have visibility of the data.  mfence() does not guarantee a full
> > flush back down to the device, it only guarantees it has been pushed out
> > of the CPU's cache.
> > 
> > similarly:
> > 
> > memcpy(some_cacheline, data, 64);
> > mfence();
> > memcpy(some_other_cacheline, data, 64);
> > mfence()
> > 
> > Will not guarantee that some_cacheline reaches the backing media prior
> > to some_other_cacheline, as there is no guarantee of write-ordering in
> > CXL controllers (with the exception of writes to the same cacheline).
> > 
> > So this statement:
> > 
> >> I need to flush the cache to let the other party see it immediately.
> > 
> > Is misleading.  They will not see is "immediately", they will see it
> > "eventually at some completely unknowable time in the future".
> 
> This is indeed one of the issues I wanted to discuss at the RFC stage. 
> Thank you for pointing it out.
> 
> In my opinion, using "nvdimm_flush" might be one way to address this 
> issue, but it seems to flush the entire nd_region, which might be too 
> heavy. Moreover, it only applies to non-volatile memory.
> 
> This should be a general problem for cxl shared memory. In theory, FAMFS 
> should also encounter this issue.
> 
> Gregory, John, and Dan, Any suggestion about it?

The CXL equivalent is GPF (Global Persistence Flush), not be confused
with "General Protection Fault" which is likely what will happen if
software needs to manage cache coherency for this solution. CXL GPF was
not designed to be triggered by software. It is hardware response to a
power supply indicating loss of input power.

I do not think you want to spend community resources reviewing software
cache coherency considerations, and instead "just" mandate that this
solution requires inter-host hardware cache coherence. I understand that
is a difficult requirement to mandate, but it is likely less difficult
than getting Linux to carry a software cache coherence mitigation.

In some ways this reminds me of SMR drives and the problems those posed
to software where ultimately the programming difficulties needed to be
solved in hardware, not exported to the Linux kernel to solve.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-04-28 16:55                 ` John Groves
@ 2024-05-03  9:52                   ` Jonathan Cameron
  2024-05-08 11:39                     ` Dongsheng Yang
  0 siblings, 1 reply; 16+ messages in thread
From: Jonathan Cameron @ 2024-05-03  9:52 UTC (permalink / raw)
  To: John Groves
  Cc: Dongsheng Yang, Gregory Price, Dan Williams, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm

On Sun, 28 Apr 2024 11:55:10 -0500
John Groves <John@groves.net> wrote:

> On 24/04/28 01:47PM, Dongsheng Yang wrote:
> > 
> > 
> > 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:  
> > > On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:  
> > > > 
> > > > 
> > > > 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:  
> > > > >   
> > > > 
> > > > In (5) of the cover letter, I mentioned that cbd addresses cache coherence
> > > > at the software level:
> > > > 
> > > > (5) How do blkdev and backend interact through the channel?
> > > > 	a) For reader side, before reading the data, if the data in this channel
> > > > may be modified by the other party, then I need to flush the cache before
> > > > reading to ensure that I get the latest data. For example, the blkdev needs
> > > > to flush the cache before obtaining compr_head because compr_head will be
> > > > updated by the backend handler.
> > > > 	b) For writter side, if the written information will be read by others,
> > > > then after writing, I need to flush the cache to let the other party see it
> > > > immediately. For example, after blkdev submits cbd_se, it needs to update
> > > > cmd_head to let the handler have a new cbd_se. Therefore, after updating
> > > > cmd_head, I need to flush the cache to let the backend see it.
> > > >   
> > > 
> > > Flushing the cache is insufficient.  All that cache flushing guarantees
> > > is that the memory has left the writer's CPU cache.  There are potentially
> > > many write buffers between the CPU and the actual backing media that the
> > > CPU has no visibility of and cannot pierce through to force a full
> > > guaranteed flush back to the media.
> > > 
> > > for example:
> > > 
> > > memcpy(some_cacheline, data, 64);
> > > mfence();
> > > 
> > > Will not guarantee that after mfence() completes that the remote host
> > > will have visibility of the data.  mfence() does not guarantee a full
> > > flush back down to the device, it only guarantees it has been pushed out
> > > of the CPU's cache.
> > > 
> > > similarly:
> > > 
> > > memcpy(some_cacheline, data, 64);
> > > mfence();
> > > memcpy(some_other_cacheline, data, 64);
> > > mfence()
> > > 
> > > Will not guarantee that some_cacheline reaches the backing media prior
> > > to some_other_cacheline, as there is no guarantee of write-ordering in
> > > CXL controllers (with the exception of writes to the same cacheline).
> > > 
> > > So this statement:
> > >   
> > > > I need to flush the cache to let the other party see it immediately.  
> > > 
> > > Is misleading.  They will not see is "immediately", they will see it
> > > "eventually at some completely unknowable time in the future".  
> > 
> > This is indeed one of the issues I wanted to discuss at the RFC stage. Thank
> > you for pointing it out.
> > 
> > In my opinion, using "nvdimm_flush" might be one way to address this issue,
> > but it seems to flush the entire nd_region, which might be too heavy.
> > Moreover, it only applies to non-volatile memory.
> > 
> > This should be a general problem for cxl shared memory. In theory, FAMFS
> > should also encounter this issue.
> > 
> > Gregory, John, and Dan, Any suggestion about it?
> > 
> > Thanx a lot  
> > > 
> > > ~Gregory
> > >   
> 
> Hi Dongsheng,
> 
> Gregory is right about the uncertainty around "clflush" operations, but
> let me drill in a bit further.
> 
> Say you copy a payload into a "bucket" in a queue and then update an
> index in a metadata structure; I'm thinking of the standard producer/
> consumer queuing model here, with one index mutated by the producer and
> the other mutated by the consumer. 
> 
> (I have not reviewed your queueing code, but you *must* be using this
> model - things like linked-lists won't work in shared memory without 
> shared locks/atomics.)
> 
> Normal logic says that you should clflush the payload before updating
> the index, then update and clflush the index.
> 
> But we still observe in non-cache-coherent shared memory that the payload 
> may become valid *after* the clflush of the queue index.
> 
> The famfs user space has a program called pcq.c, which implements a
> producer/consumer queue in a pair of famfs files. The only way to 
> currently guarantee a valid read of a payload is to use sequence numbers 
> and checksums on payloads.  We do observe mismatches with actual shared 
> memory, and the recovery is to clflush and re-read the payload from the 
> client side. (Aside: These file pairs theoretically might work for CBD 
> queues.)
> 
> Anoter side note: it would be super-helpful if the CPU gave us an explicit 
> invalidate rather than just clflush, which will write-back before 
> invalidating *if* the cache line is marked as dirty, even when software
> knows this should not happen.
> 
> Note that CXL 3.1 provides a way to guarantee that stuff that should not
> be written back can't be written back: read-only mappings. This one of
> the features I got into the spec; using this requires CXL 3.1 DCD, and 
> would require two DCD allocations (i.e. two tagged-capacity dax devices - 
> one writable by the server and one by the client).
> 
> Just to make things slightly gnarlier, the MESI cache coherency protocol
> allows a CPU to speculatively convert a line from exclusive to modified,
> meaning it's not clear as of now whether "occasional" clean write-backs
> can be avoided. Meaning those read-only mappings may be more important
> than one might think. (Clean write-backs basically make it
> impossible for software to manage cache coherency.)

My understanding is that clean write backs are an implementation specific
issue that came as a surprise to some CPU arch folk I spoke to, we will
need some path for a host to say if they can ever do that.

Given this definitely effects one CPU vendor, maybe solutions that
rely on this not happening are not suitable for upstream.

Maybe this market will be important enough for that CPU vendor to stop
doing it but if they do it will take a while...

Flushing in general is as CPU architecture problem where each of the
architectures needs to be clear what they do / specify that their
licensees do.

I'm with Dan on encouraging all memory vendors to do hardware coherence!

J

> 
> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and 
> shared memory is not explicitly legal in cxl 2, so there are things a cpu 
> could do (or not do) in a cxl 2 environment that are not illegal because 
> they should not be observable in a no-shared-memory environment.
> 
> CBD is interesting work, though for some of the reasons above I'm somewhat
> skeptical of shared memory as an IPC mechanism.
> 
> Regards,
> John
> 
> 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-03  9:52                   ` Jonathan Cameron
@ 2024-05-08 11:39                     ` Dongsheng Yang
  2024-05-08 12:11                       ` Jonathan Cameron
  0 siblings, 1 reply; 16+ messages in thread
From: Dongsheng Yang @ 2024-05-08 11:39 UTC (permalink / raw)
  To: Jonathan Cameron, John Groves, Dan Williams, Gregory Price
  Cc: Gregory Price, Dan Williams, axboe, linux-block, linux-kernel,
	linux-cxl, nvdimm



在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道:
> On Sun, 28 Apr 2024 11:55:10 -0500
> John Groves <John@groves.net> wrote:
> 
>> On 24/04/28 01:47PM, Dongsheng Yang wrote:
>>>
>>>
>>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:
>>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
>>>>>
>>>>>
>>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
>>>>>>    
>>>>>

...
>>
>> Just to make things slightly gnarlier, the MESI cache coherency protocol
>> allows a CPU to speculatively convert a line from exclusive to modified,
>> meaning it's not clear as of now whether "occasional" clean write-backs
>> can be avoided. Meaning those read-only mappings may be more important
>> than one might think. (Clean write-backs basically make it
>> impossible for software to manage cache coherency.)
> 
> My understanding is that clean write backs are an implementation specific
> issue that came as a surprise to some CPU arch folk I spoke to, we will
> need some path for a host to say if they can ever do that.
> 
> Given this definitely effects one CPU vendor, maybe solutions that
> rely on this not happening are not suitable for upstream.
> 
> Maybe this market will be important enough for that CPU vendor to stop
> doing it but if they do it will take a while...
> 
> Flushing in general is as CPU architecture problem where each of the
> architectures needs to be clear what they do / specify that their
> licensees do.
> 
> I'm with Dan on encouraging all memory vendors to do hardware coherence!

Hi Gregory, John, Jonathan and Dan:
	Thanx for your information, they help a lot, and sorry for the late reply.

After some internal discussions, I think we can design it as follows:

(1) If the hardware implements cache coherence, then the software layer 
doesn't need to consider this issue, and can perform read and write 
operations directly.

(2) If the hardware doesn't implement cache coherence, we can consider a 
DMA-like approach, where we check architectural features to determine if 
cache coherence is supported. This could be similar to 
`dev_is_dma_coherent`.

Additionally, if the architecture supports flushing and invalidating CPU 
caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`, 
`CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`, 
`CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`),

then we can handle cache coherence at the software layer.
(For the clean writeback issue, I think it may also require 
clarification from the architecture, and how DMA handles the clean 
writeback problem, which I haven't further checked.)

(3) If the hardware doesn't implement cache coherence and the cpu 
doesn't support the required CPU cache operations, then we can run in 
nocache mode.

CBD can initially support (3), and then transition to (1) when hardware 
supports cache-coherency. If there's sufficient market demand, we can 
also consider supporting (2).

How does this approach sound?

Thanx
> 
> J
> 
>>
>> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and
>> shared memory is not explicitly legal in cxl 2, so there are things a cpu
>> could do (or not do) in a cxl 2 environment that are not illegal because
>> they should not be observable in a no-shared-memory environment.
>>
>> CBD is interesting work, though for some of the reasons above I'm somewhat
>> skeptical of shared memory as an IPC mechanism.
>>
>> Regards,
>> John
>>
>>
>>
> 
> .
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-08 11:39                     ` Dongsheng Yang
@ 2024-05-08 12:11                       ` Jonathan Cameron
  2024-05-08 13:03                         ` Dongsheng Yang
  0 siblings, 1 reply; 16+ messages in thread
From: Jonathan Cameron @ 2024-05-08 12:11 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: John Groves, Dan Williams, Gregory Price, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm

On Wed, 8 May 2024 19:39:23 +0800
Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:

> 在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道:
> > On Sun, 28 Apr 2024 11:55:10 -0500
> > John Groves <John@groves.net> wrote:
> >   
> >> On 24/04/28 01:47PM, Dongsheng Yang wrote:  
> >>>
> >>>
> >>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:  
> >>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:  
> >>>>>
> >>>>>
> >>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:  
> >>>>>>      
> >>>>>  
> 
> ...
> >>
> >> Just to make things slightly gnarlier, the MESI cache coherency protocol
> >> allows a CPU to speculatively convert a line from exclusive to modified,
> >> meaning it's not clear as of now whether "occasional" clean write-backs
> >> can be avoided. Meaning those read-only mappings may be more important
> >> than one might think. (Clean write-backs basically make it
> >> impossible for software to manage cache coherency.)  
> > 
> > My understanding is that clean write backs are an implementation specific
> > issue that came as a surprise to some CPU arch folk I spoke to, we will
> > need some path for a host to say if they can ever do that.
> > 
> > Given this definitely effects one CPU vendor, maybe solutions that
> > rely on this not happening are not suitable for upstream.
> > 
> > Maybe this market will be important enough for that CPU vendor to stop
> > doing it but if they do it will take a while...
> > 
> > Flushing in general is as CPU architecture problem where each of the
> > architectures needs to be clear what they do / specify that their
> > licensees do.
> > 
> > I'm with Dan on encouraging all memory vendors to do hardware coherence!  
> 
> Hi Gregory, John, Jonathan and Dan:
> 	Thanx for your information, they help a lot, and sorry for the late reply.
> 
> After some internal discussions, I think we can design it as follows:
> 
> (1) If the hardware implements cache coherence, then the software layer 
> doesn't need to consider this issue, and can perform read and write 
> operations directly.

Agreed - this is one easier case.

> 
> (2) If the hardware doesn't implement cache coherence, we can consider a 
> DMA-like approach, where we check architectural features to determine if 
> cache coherence is supported. This could be similar to 
> `dev_is_dma_coherent`.

Ok. So this would combine host support checks with checking if the shared
memory on the device is multi host cache coherent (it will be single host
cache coherent which is what makes this messy)
> 
> Additionally, if the architecture supports flushing and invalidating CPU 
> caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`, 
> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`, 
> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`),

Those particular calls won't tell you much at all. They indicate that a flush
can happen as far as a common point for DMA engines in the system. No
information on whether there are caches beyond that point.

> 
> then we can handle cache coherence at the software layer.
> (For the clean writeback issue, I think it may also require 
> clarification from the architecture, and how DMA handles the clean 
> writeback problem, which I haven't further checked.)

I believe the relevant architecture only does IO coherent DMA so it is
never a problem (unlike with multihost cache coherence).
> 
> (3) If the hardware doesn't implement cache coherence and the cpu 
> doesn't support the required CPU cache operations, then we can run in 
> nocache mode.

I suspect that gets you no where either.  Never believe an architecture
that provides a flag that says not to cache something.  That just means
you should not be able to tell that it is cached - many many implementations
actually cache such accesses.

> 
> CBD can initially support (3), and then transition to (1) when hardware 
> supports cache-coherency. If there's sufficient market demand, we can 
> also consider supporting (2).
I'd assume only (3) works.  The others rely on assumptions I don't think
you can rely on.

Fun fun fun,

Jonathan

> 
> How does this approach sound?
> 
> Thanx
> > 
> > J
> >   
> >>
> >> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and
> >> shared memory is not explicitly legal in cxl 2, so there are things a cpu
> >> could do (or not do) in a cxl 2 environment that are not illegal because
> >> they should not be observable in a no-shared-memory environment.
> >>
> >> CBD is interesting work, though for some of the reasons above I'm somewhat
> >> skeptical of shared memory as an IPC mechanism.
> >>
> >> Regards,
> >> John
> >>
> >>
> >>  
> > 
> > .
> >   


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-08 12:11                       ` Jonathan Cameron
@ 2024-05-08 13:03                         ` Dongsheng Yang
  2024-05-08 15:44                           ` Jonathan Cameron
  0 siblings, 1 reply; 16+ messages in thread
From: Dongsheng Yang @ 2024-05-08 13:03 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Dan Williams, Gregory Price, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm



在 2024/5/8 星期三 下午 8:11, Jonathan Cameron 写道:
> On Wed, 8 May 2024 19:39:23 +0800
> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
> 
>> 在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道:
>>> On Sun, 28 Apr 2024 11:55:10 -0500
>>> John Groves <John@groves.net> wrote:
>>>    
>>>> On 24/04/28 01:47PM, Dongsheng Yang wrote:
>>>>>
>>>>>
>>>>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:
>>>>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
>>>>>>>
>>>>>>>
>>>>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
>>>>>>>>       
>>>>>>>   
>>
>> ...
>>>>
>>>> Just to make things slightly gnarlier, the MESI cache coherency protocol
>>>> allows a CPU to speculatively convert a line from exclusive to modified,
>>>> meaning it's not clear as of now whether "occasional" clean write-backs
>>>> can be avoided. Meaning those read-only mappings may be more important
>>>> than one might think. (Clean write-backs basically make it
>>>> impossible for software to manage cache coherency.)
>>>
>>> My understanding is that clean write backs are an implementation specific
>>> issue that came as a surprise to some CPU arch folk I spoke to, we will
>>> need some path for a host to say if they can ever do that.
>>>
>>> Given this definitely effects one CPU vendor, maybe solutions that
>>> rely on this not happening are not suitable for upstream.
>>>
>>> Maybe this market will be important enough for that CPU vendor to stop
>>> doing it but if they do it will take a while...
>>>
>>> Flushing in general is as CPU architecture problem where each of the
>>> architectures needs to be clear what they do / specify that their
>>> licensees do.
>>>
>>> I'm with Dan on encouraging all memory vendors to do hardware coherence!
>>
>> Hi Gregory, John, Jonathan and Dan:
>> 	Thanx for your information, they help a lot, and sorry for the late reply.
>>
>> After some internal discussions, I think we can design it as follows:
>>
>> (1) If the hardware implements cache coherence, then the software layer
>> doesn't need to consider this issue, and can perform read and write
>> operations directly.
> 
> Agreed - this is one easier case.
> 
>>
>> (2) If the hardware doesn't implement cache coherence, we can consider a
>> DMA-like approach, where we check architectural features to determine if
>> cache coherence is supported. This could be similar to
>> `dev_is_dma_coherent`.
> 
> Ok. So this would combine host support checks with checking if the shared
> memory on the device is multi host cache coherent (it will be single host
> cache coherent which is what makes this messy)
>>
>> Additionally, if the architecture supports flushing and invalidating CPU
>> caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`,
>> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`,
>> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`),
> 
> Those particular calls won't tell you much at all. They indicate that a flush
> can happen as far as a common point for DMA engines in the system. No
> information on whether there are caches beyond that point.
> 
>>
>> then we can handle cache coherence at the software layer.
>> (For the clean writeback issue, I think it may also require
>> clarification from the architecture, and how DMA handles the clean
>> writeback problem, which I haven't further checked.)
> 
> I believe the relevant architecture only does IO coherent DMA so it is
> never a problem (unlike with multihost cache coherence).Hi Jonathan,

let me provide an example,
In nvmeof-rdma, the `nvme_rdma_queue_rq` function places a request into 
`req->sqe.dma`.

(1) First, it calls `ib_dma_sync_single_for_cpu()`, which invalidates 
the CPU cache:


ib_dma_sync_single_for_cpu(dev, sqe->dma,
                             sizeof(struct nvme_command), DMA_TO_DEVICE);


For example, on ARM64, this would call `arch_sync_dma_for_cpu`, followed 
by `dcache_inval_poc(start, start + size)`.

(2) Setting up data related to the NVMe request.

(3) then Calls `ib_dma_sync_single_for_device` to flush the CPU cache to 
DMA memory:

ib_dma_sync_single_for_device(dev, sqe->dma,
                                 sizeof(struct nvme_command), 
DMA_TO_DEVICE);

Of course, if the hardware ensures cache coherency, the above operations 
are skipped. However, if the hardware does not guarantee cache 
coherency, RDMA appears to ensure cache coherency through this method.

In the RDMA scenario, we also face the issue of multi-host cache 
coherence. so I'm thinking, can we adopt a similar approach in CXL 
shared memory to achieve data sharing?

>>
>> (3) If the hardware doesn't implement cache coherence and the cpu
>> doesn't support the required CPU cache operations, then we can run in
>> nocache mode.
> 
> I suspect that gets you no where either.  Never believe an architecture
> that provides a flag that says not to cache something.  That just means
> you should not be able to tell that it is cached - many many implementations
> actually cache such accesses.

Sigh, then that really makes thing difficult.
> 
>>
>> CBD can initially support (3), and then transition to (1) when hardware
>> supports cache-coherency. If there's sufficient market demand, we can
>> also consider supporting (2).
> I'd assume only (3) works.  The others rely on assumptions I don't think

I guess you mean (1), the hardware cache-coherency way, right?

:)
Thanx

> you can rely on.
> 
> Fun fun fun,
> 
> Jonathan
> 
>>
>> How does this approach sound?
>>
>> Thanx
>>>
>>> J
>>>    
>>>>
>>>> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and
>>>> shared memory is not explicitly legal in cxl 2, so there are things a cpu
>>>> could do (or not do) in a cxl 2 environment that are not illegal because
>>>> they should not be observable in a no-shared-memory environment.
>>>>
>>>> CBD is interesting work, though for some of the reasons above I'm somewhat
>>>> skeptical of shared memory as an IPC mechanism.
>>>>
>>>> Regards,
>>>> John
>>>>
>>>>
>>>>   
>>>
>>> .
>>>    
> 
> .
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-08 13:03                         ` Dongsheng Yang
@ 2024-05-08 15:44                           ` Jonathan Cameron
  2024-05-09 11:24                             ` Dongsheng Yang
  0 siblings, 1 reply; 16+ messages in thread
From: Jonathan Cameron @ 2024-05-08 15:44 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: John Groves, Dan Williams, Gregory Price, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm

On Wed, 8 May 2024 21:03:54 +0800
Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:

> 在 2024/5/8 星期三 下午 8:11, Jonathan Cameron 写道:
> > On Wed, 8 May 2024 19:39:23 +0800
> > Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
> >   
> >> 在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道:  
> >>> On Sun, 28 Apr 2024 11:55:10 -0500
> >>> John Groves <John@groves.net> wrote:
> >>>      
> >>>> On 24/04/28 01:47PM, Dongsheng Yang wrote:  
> >>>>>
> >>>>>
> >>>>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:  
> >>>>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:  
> >>>>>>>
> >>>>>>>
> >>>>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:  
> >>>>>>>>         
> >>>>>>>     
> >>
> >> ...  
> >>>>
> >>>> Just to make things slightly gnarlier, the MESI cache coherency protocol
> >>>> allows a CPU to speculatively convert a line from exclusive to modified,
> >>>> meaning it's not clear as of now whether "occasional" clean write-backs
> >>>> can be avoided. Meaning those read-only mappings may be more important
> >>>> than one might think. (Clean write-backs basically make it
> >>>> impossible for software to manage cache coherency.)  
> >>>
> >>> My understanding is that clean write backs are an implementation specific
> >>> issue that came as a surprise to some CPU arch folk I spoke to, we will
> >>> need some path for a host to say if they can ever do that.
> >>>
> >>> Given this definitely effects one CPU vendor, maybe solutions that
> >>> rely on this not happening are not suitable for upstream.
> >>>
> >>> Maybe this market will be important enough for that CPU vendor to stop
> >>> doing it but if they do it will take a while...
> >>>
> >>> Flushing in general is as CPU architecture problem where each of the
> >>> architectures needs to be clear what they do / specify that their
> >>> licensees do.
> >>>
> >>> I'm with Dan on encouraging all memory vendors to do hardware coherence!  
> >>
> >> Hi Gregory, John, Jonathan and Dan:
> >> 	Thanx for your information, they help a lot, and sorry for the late reply.
> >>
> >> After some internal discussions, I think we can design it as follows:
> >>
> >> (1) If the hardware implements cache coherence, then the software layer
> >> doesn't need to consider this issue, and can perform read and write
> >> operations directly.  
> > 
> > Agreed - this is one easier case.
> >   
> >>
> >> (2) If the hardware doesn't implement cache coherence, we can consider a
> >> DMA-like approach, where we check architectural features to determine if
> >> cache coherence is supported. This could be similar to
> >> `dev_is_dma_coherent`.  
> > 
> > Ok. So this would combine host support checks with checking if the shared
> > memory on the device is multi host cache coherent (it will be single host
> > cache coherent which is what makes this messy)  
> >>
> >> Additionally, if the architecture supports flushing and invalidating CPU
> >> caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`,
> >> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`,
> >> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`),  
> > 
> > Those particular calls won't tell you much at all. They indicate that a flush
> > can happen as far as a common point for DMA engines in the system. No
> > information on whether there are caches beyond that point.
> >   
> >>
> >> then we can handle cache coherence at the software layer.
> >> (For the clean writeback issue, I think it may also require
> >> clarification from the architecture, and how DMA handles the clean
> >> writeback problem, which I haven't further checked.)  
> > 
> > I believe the relevant architecture only does IO coherent DMA so it is
> > never a problem (unlike with multihost cache coherence).Hi Jonathan,  
> 
> let me provide an example,
> In nvmeof-rdma, the `nvme_rdma_queue_rq` function places a request into 
> `req->sqe.dma`.
> 
> (1) First, it calls `ib_dma_sync_single_for_cpu()`, which invalidates 
> the CPU cache:
> 
> 
> ib_dma_sync_single_for_cpu(dev, sqe->dma,
>                              sizeof(struct nvme_command), DMA_TO_DEVICE);
> 
> 
> For example, on ARM64, this would call `arch_sync_dma_for_cpu`, followed 
> by `dcache_inval_poc(start, start + size)`.

Key here is the POC. It's a flush to the point of coherence of the local
system.  It has no idea about interhost coherency and is not necessarily
the DRAM (in CXL or otherwise).

If you are doing software coherence, those devices will plug into today's
hosts and they have no idea that such a flush means pushing out into
the CXL fabric and to the type 3 device.

> 
> (2) Setting up data related to the NVMe request.
> 
> (3) then Calls `ib_dma_sync_single_for_device` to flush the CPU cache to 
> DMA memory:
> 
> ib_dma_sync_single_for_device(dev, sqe->dma,
>                                  sizeof(struct nvme_command), 
> DMA_TO_DEVICE);
> 
> Of course, if the hardware ensures cache coherency, the above operations 
> are skipped. However, if the hardware does not guarantee cache 
> coherency, RDMA appears to ensure cache coherency through this method.
> 
> In the RDMA scenario, we also face the issue of multi-host cache 
> coherence. so I'm thinking, can we adopt a similar approach in CXL 
> shared memory to achieve data sharing?

You don't face the same coherence issues, or at least not in the same way.
In that case the coherence guarantees are actually to the RDMA NIC.
It is guaranteed to see the clean data by the host - that may involve
flushes to PoC.  A one time snapshot is then sent to readers on other
hosts. If writes occur they are also guarantee to replace cached copies
on this host - because there is well define guarantee of IO coherence
or explicit cache maintenance to the PoC.

 
> 
> >>
> >> (3) If the hardware doesn't implement cache coherence and the cpu
> >> doesn't support the required CPU cache operations, then we can run in
> >> nocache mode.  
> > 
> > I suspect that gets you no where either.  Never believe an architecture
> > that provides a flag that says not to cache something.  That just means
> > you should not be able to tell that it is cached - many many implementations
> > actually cache such accesses.  
> 
> Sigh, then that really makes thing difficult.

Yes. I think we are going to have to wait on architecture specific clarifications
before any software coherent use case can be guaranteed to work beyond the 3.1 ones
for temporal sharing (only one accessing host at a time) and read only sharing where
writes are dropped anyway so clean write back is irrelevant beyond some noise in
logs possibly (if they do get logged it is considered so rare we don't care!).

> >   
> >>
> >> CBD can initially support (3), and then transition to (1) when hardware
> >> supports cache-coherency. If there's sufficient market demand, we can
> >> also consider supporting (2).  
> > I'd assume only (3) works.  The others rely on assumptions I don't think  
> 
> I guess you mean (1), the hardware cache-coherency way, right?

Indeed - oops!
Hardware coherency is the way to go, or a well defined and clearly document
description of how to play with the various host architectures.

Jonathan


> 
> :)
> Thanx
> 
> > you can rely on.
> > 
> > Fun fun fun,
> > 
> > Jonathan
> >   
> >>
> >> How does this approach sound?
> >>
> >> Thanx  
> >>>
> >>> J
> >>>      
> >>>>
> >>>> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and
> >>>> shared memory is not explicitly legal in cxl 2, so there are things a cpu
> >>>> could do (or not do) in a cxl 2 environment that are not illegal because
> >>>> they should not be observable in a no-shared-memory environment.
> >>>>
> >>>> CBD is interesting work, though for some of the reasons above I'm somewhat
> >>>> skeptical of shared memory as an IPC mechanism.
> >>>>
> >>>> Regards,
> >>>> John
> >>>>
> >>>>
> >>>>     
> >>>
> >>> .
> >>>      
> > 
> > .
> >   


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-08 15:44                           ` Jonathan Cameron
@ 2024-05-09 11:24                             ` Dongsheng Yang
  2024-05-09 12:21                               ` Jonathan Cameron
  0 siblings, 1 reply; 16+ messages in thread
From: Dongsheng Yang @ 2024-05-09 11:24 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Dan Williams, Gregory Price, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm



在 2024/5/8 星期三 下午 11:44, Jonathan Cameron 写道:
> On Wed, 8 May 2024 21:03:54 +0800
> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
> 
>> 在 2024/5/8 星期三 下午 8:11, Jonathan Cameron 写道:
>>> On Wed, 8 May 2024 19:39:23 +0800
>>> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
>>>    
>>>> 在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道:
>>>>> On Sun, 28 Apr 2024 11:55:10 -0500
>>>>> John Groves <John@groves.net> wrote:
>>>>>       
>>>>>> On 24/04/28 01:47PM, Dongsheng Yang wrote:
>>>>>>>
>>>>>>>
>>>>>>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:
>>>>>>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
>>>>>>>>>>          
>>>>>>>>>      
>>>>
>>>> ...
>>>>>>
>>>>>> Just to make things slightly gnarlier, the MESI cache coherency protocol
>>>>>> allows a CPU to speculatively convert a line from exclusive to modified,
>>>>>> meaning it's not clear as of now whether "occasional" clean write-backs
>>>>>> can be avoided. Meaning those read-only mappings may be more important
>>>>>> than one might think. (Clean write-backs basically make it
>>>>>> impossible for software to manage cache coherency.)
>>>>>
>>>>> My understanding is that clean write backs are an implementation specific
>>>>> issue that came as a surprise to some CPU arch folk I spoke to, we will
>>>>> need some path for a host to say if they can ever do that.
>>>>>
>>>>> Given this definitely effects one CPU vendor, maybe solutions that
>>>>> rely on this not happening are not suitable for upstream.
>>>>>
>>>>> Maybe this market will be important enough for that CPU vendor to stop
>>>>> doing it but if they do it will take a while...
>>>>>
>>>>> Flushing in general is as CPU architecture problem where each of the
>>>>> architectures needs to be clear what they do / specify that their
>>>>> licensees do.
>>>>>
>>>>> I'm with Dan on encouraging all memory vendors to do hardware coherence!
>>>>
>>>> Hi Gregory, John, Jonathan and Dan:
>>>> 	Thanx for your information, they help a lot, and sorry for the late reply.
>>>>
>>>> After some internal discussions, I think we can design it as follows:
>>>>
>>>> (1) If the hardware implements cache coherence, then the software layer
>>>> doesn't need to consider this issue, and can perform read and write
>>>> operations directly.
>>>
>>> Agreed - this is one easier case.
>>>    
>>>>
>>>> (2) If the hardware doesn't implement cache coherence, we can consider a
>>>> DMA-like approach, where we check architectural features to determine if
>>>> cache coherence is supported. This could be similar to
>>>> `dev_is_dma_coherent`.
>>>
>>> Ok. So this would combine host support checks with checking if the shared
>>> memory on the device is multi host cache coherent (it will be single host
>>> cache coherent which is what makes this messy)
>>>>
>>>> Additionally, if the architecture supports flushing and invalidating CPU
>>>> caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`,
>>>> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`,
>>>> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`),
>>>
>>> Those particular calls won't tell you much at all. They indicate that a flush
>>> can happen as far as a common point for DMA engines in the system. No
>>> information on whether there are caches beyond that point.
>>>    
>>>>
>>>> then we can handle cache coherence at the software layer.
>>>> (For the clean writeback issue, I think it may also require
>>>> clarification from the architecture, and how DMA handles the clean
>>>> writeback problem, which I haven't further checked.)
>>>
>>> I believe the relevant architecture only does IO coherent DMA so it is
>>> never a problem (unlike with multihost cache coherence).Hi Jonathan,
>>
>> let me provide an example,
>> In nvmeof-rdma, the `nvme_rdma_queue_rq` function places a request into
>> `req->sqe.dma`.
>>
>> (1) First, it calls `ib_dma_sync_single_for_cpu()`, which invalidates
>> the CPU cache:
>>
>>
>> ib_dma_sync_single_for_cpu(dev, sqe->dma,
>>                               sizeof(struct nvme_command), DMA_TO_DEVICE);
>>
>>
>> For example, on ARM64, this would call `arch_sync_dma_for_cpu`, followed
>> by `dcache_inval_poc(start, start + size)`.
> 
> Key here is the POC. It's a flush to the point of coherence of the local
> system.  It has no idea about interhost coherency and is not necessarily
> the DRAM (in CXL or otherwise).
> 
> If you are doing software coherence, those devices will plug into today's
> hosts and they have no idea that such a flush means pushing out into
> the CXL fabric and to the type 3 device.
> 
>>
>> (2) Setting up data related to the NVMe request.
>>
>> (3) then Calls `ib_dma_sync_single_for_device` to flush the CPU cache to
>> DMA memory:
>>
>> ib_dma_sync_single_for_device(dev, sqe->dma,
>>                                   sizeof(struct nvme_command),
>> DMA_TO_DEVICE);
>>
>> Of course, if the hardware ensures cache coherency, the above operations
>> are skipped. However, if the hardware does not guarantee cache
>> coherency, RDMA appears to ensure cache coherency through this method.
>>
>> In the RDMA scenario, we also face the issue of multi-host cache
>> coherence. so I'm thinking, can we adopt a similar approach in CXL
>> shared memory to achieve data sharing?
> 
> You don't face the same coherence issues, or at least not in the same way.
> In that case the coherence guarantees are actually to the RDMA NIC.
> It is guaranteed to see the clean data by the host - that may involve
> flushes to PoC.  A one time snapshot is then sent to readers on other
> hosts. If writes occur they are also guarantee to replace cached copies
> on this host - because there is well define guarantee of IO coherence
> or explicit cache maintenance to the PoC
right, the PoC is not point of cohenrence with other host. it sounds 
correct. thanx.
> 
>   
>>
>>>>
>>>> (3) If the hardware doesn't implement cache coherence and the cpu
>>>> doesn't support the required CPU cache operations, then we can run in
>>>> nocache mode.
>>>
>>> I suspect that gets you no where either.  Never believe an architecture
>>> that provides a flag that says not to cache something.  That just means
>>> you should not be able to tell that it is cached - many many implementations
>>> actually cache such accesses.
>>
>> Sigh, then that really makes thing difficult.
> 
> Yes. I think we are going to have to wait on architecture specific clarifications
> before any software coherent use case can be guaranteed to work beyond the 3.1 ones
> for temporal sharing (only one accessing host at a time) and read only sharing where
> writes are dropped anyway so clean write back is irrelevant beyond some noise in
> logs possibly (if they do get logged it is considered so rare we don't care!).

Hi Jonathan,
	Allow me to discuss further. As described in CXL 3.1:
```
Software-managed coherency schemes are complicated by any host or device 
whose caching agents generate clean writebacks. A “No Clean Writebacks” 
capability bit is available for a host in the CXL System Description 
Structure (CSDS; see Section 9.18.1.6) or for a device in the DVSEC CXL 
Capability2 register (see Section 8.1.3.7).
```

If we check and find that the "No clean writeback" bit in both CSDS and 
DVSEC is set, can we then assume that software cache-coherency is 
feasible, as outlined below:

(1) Both the writer and reader ensure cache flushes. Since there are no 
clean writebacks, there will be no background data writes.

(2) The writer writes data to shared memory and then executes a cache 
flush. If we trust the "No clean writeback" bit, we can assume that the 
data in shared memory is coherent.

(3) Before reading the data, the reader performs cache invalidation. 
Since there are no clean writebacks, this invalidation operation will 
not destroy the data written by the writer. Therefore, the data read by 
the reader should be the data written by the writer, and since the 
writer's cache is clean, it will not write data to shared memory during 
the reader's reading process. Additionally, data integrity can be ensured.

The first step for CBD should depend on hardware cache coherence, which 
is clearer and more feasible. Here, I am just exploring the possibility 
of software cache coherence, not insisting on implementing software 
cache-coherency right away. :)

Thanx
> 
>>>    
>>>>
>>>> CBD can initially support (3), and then transition to (1) when hardware
>>>> supports cache-coherency. If there's sufficient market demand, we can
>>>> also consider supporting (2).
>>> I'd assume only (3) works.  The others rely on assumptions I don't think
>>
>> I guess you mean (1), the hardware cache-coherency way, right?
> 
> Indeed - oops!
> Hardware coherency is the way to go, or a well defined and clearly document
> description of how to play with the various host architectures.
> 
> Jonathan
> 
> 
>>
>> :)
>> Thanx
>>
>>> you can rely on.
>>>
>>> Fun fun fun,
>>>
>>> Jonathan
>>>    
>>>>
>>>> How does this approach sound?
>>>>
>>>> Thanx
>>>>>
>>>>> J
>>>>>       
>>>>>>
>>>>>> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and
>>>>>> shared memory is not explicitly legal in cxl 2, so there are things a cpu
>>>>>> could do (or not do) in a cxl 2 environment that are not illegal because
>>>>>> they should not be observable in a no-shared-memory environment.
>>>>>>
>>>>>> CBD is interesting work, though for some of the reasons above I'm somewhat
>>>>>> skeptical of shared memory as an IPC mechanism.
>>>>>>
>>>>>> Regards,
>>>>>> John
>>>>>>
>>>>>>
>>>>>>      
>>>>>
>>>>> .
>>>>>       
>>>
>>> .
>>>    
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-09 11:24                             ` Dongsheng Yang
@ 2024-05-09 12:21                               ` Jonathan Cameron
  2024-05-09 13:03                                 ` Dongsheng Yang
  0 siblings, 1 reply; 16+ messages in thread
From: Jonathan Cameron @ 2024-05-09 12:21 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: John Groves, Dan Williams, Gregory Price, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm

On Thu, 9 May 2024 19:24:28 +0800
Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:

> 在 2024/5/8 星期三 下午 11:44, Jonathan Cameron 写道:
> > On Wed, 8 May 2024 21:03:54 +0800
> > Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
> >   
> >> 在 2024/5/8 星期三 下午 8:11, Jonathan Cameron 写道:  
> >>> On Wed, 8 May 2024 19:39:23 +0800
> >>> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
> >>>      
> >>>> 在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道:  
> >>>>> On Sun, 28 Apr 2024 11:55:10 -0500
> >>>>> John Groves <John@groves.net> wrote:
> >>>>>         
> >>>>>> On 24/04/28 01:47PM, Dongsheng Yang wrote:  
> >>>>>>>
> >>>>>>>
> >>>>>>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:  
> >>>>>>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:  
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:  
> >>>>>>>>>>            
> >>>>>>>>>        
> >>>>
> >>>> ...  
> >>>>>>
> >>>>>> Just to make things slightly gnarlier, the MESI cache coherency protocol
> >>>>>> allows a CPU to speculatively convert a line from exclusive to modified,
> >>>>>> meaning it's not clear as of now whether "occasional" clean write-backs
> >>>>>> can be avoided. Meaning those read-only mappings may be more important
> >>>>>> than one might think. (Clean write-backs basically make it
> >>>>>> impossible for software to manage cache coherency.)  
> >>>>>
> >>>>> My understanding is that clean write backs are an implementation specific
> >>>>> issue that came as a surprise to some CPU arch folk I spoke to, we will
> >>>>> need some path for a host to say if they can ever do that.
> >>>>>
> >>>>> Given this definitely effects one CPU vendor, maybe solutions that
> >>>>> rely on this not happening are not suitable for upstream.
> >>>>>
> >>>>> Maybe this market will be important enough for that CPU vendor to stop
> >>>>> doing it but if they do it will take a while...
> >>>>>
> >>>>> Flushing in general is as CPU architecture problem where each of the
> >>>>> architectures needs to be clear what they do / specify that their
> >>>>> licensees do.
> >>>>>
> >>>>> I'm with Dan on encouraging all memory vendors to do hardware coherence!  
> >>>>
> >>>> Hi Gregory, John, Jonathan and Dan:
> >>>> 	Thanx for your information, they help a lot, and sorry for the late reply.
> >>>>
> >>>> After some internal discussions, I think we can design it as follows:
> >>>>
> >>>> (1) If the hardware implements cache coherence, then the software layer
> >>>> doesn't need to consider this issue, and can perform read and write
> >>>> operations directly.  
> >>>
> >>> Agreed - this is one easier case.
> >>>      
> >>>>
> >>>> (2) If the hardware doesn't implement cache coherence, we can consider a
> >>>> DMA-like approach, where we check architectural features to determine if
> >>>> cache coherence is supported. This could be similar to
> >>>> `dev_is_dma_coherent`.  
> >>>
> >>> Ok. So this would combine host support checks with checking if the shared
> >>> memory on the device is multi host cache coherent (it will be single host
> >>> cache coherent which is what makes this messy)  
> >>>>
> >>>> Additionally, if the architecture supports flushing and invalidating CPU
> >>>> caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`,
> >>>> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`,
> >>>> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`),  
> >>>
> >>> Those particular calls won't tell you much at all. They indicate that a flush
> >>> can happen as far as a common point for DMA engines in the system. No
> >>> information on whether there are caches beyond that point.
> >>>      
> >>>>
> >>>> then we can handle cache coherence at the software layer.
> >>>> (For the clean writeback issue, I think it may also require
> >>>> clarification from the architecture, and how DMA handles the clean
> >>>> writeback problem, which I haven't further checked.)  
> >>>
> >>> I believe the relevant architecture only does IO coherent DMA so it is
> >>> never a problem (unlike with multihost cache coherence).Hi Jonathan,  
> >>
> >> let me provide an example,
> >> In nvmeof-rdma, the `nvme_rdma_queue_rq` function places a request into
> >> `req->sqe.dma`.
> >>
> >> (1) First, it calls `ib_dma_sync_single_for_cpu()`, which invalidates
> >> the CPU cache:
> >>
> >>
> >> ib_dma_sync_single_for_cpu(dev, sqe->dma,
> >>                               sizeof(struct nvme_command), DMA_TO_DEVICE);
> >>
> >>
> >> For example, on ARM64, this would call `arch_sync_dma_for_cpu`, followed
> >> by `dcache_inval_poc(start, start + size)`.  
> > 
> > Key here is the POC. It's a flush to the point of coherence of the local
> > system.  It has no idea about interhost coherency and is not necessarily
> > the DRAM (in CXL or otherwise).
> > 
> > If you are doing software coherence, those devices will plug into today's
> > hosts and they have no idea that such a flush means pushing out into
> > the CXL fabric and to the type 3 device.
> >   
> >>
> >> (2) Setting up data related to the NVMe request.
> >>
> >> (3) then Calls `ib_dma_sync_single_for_device` to flush the CPU cache to
> >> DMA memory:
> >>
> >> ib_dma_sync_single_for_device(dev, sqe->dma,
> >>                                   sizeof(struct nvme_command),
> >> DMA_TO_DEVICE);
> >>
> >> Of course, if the hardware ensures cache coherency, the above operations
> >> are skipped. However, if the hardware does not guarantee cache
> >> coherency, RDMA appears to ensure cache coherency through this method.
> >>
> >> In the RDMA scenario, we also face the issue of multi-host cache
> >> coherence. so I'm thinking, can we adopt a similar approach in CXL
> >> shared memory to achieve data sharing?  
> > 
> > You don't face the same coherence issues, or at least not in the same way.
> > In that case the coherence guarantees are actually to the RDMA NIC.
> > It is guaranteed to see the clean data by the host - that may involve
> > flushes to PoC.  A one time snapshot is then sent to readers on other
> > hosts. If writes occur they are also guarantee to replace cached copies
> > on this host - because there is well define guarantee of IO coherence
> > or explicit cache maintenance to the PoC  
> right, the PoC is not point of cohenrence with other host. it sounds 
> correct. thanx.
> > 
> >     
> >>  
> >>>>
> >>>> (3) If the hardware doesn't implement cache coherence and the cpu
> >>>> doesn't support the required CPU cache operations, then we can run in
> >>>> nocache mode.  
> >>>
> >>> I suspect that gets you no where either.  Never believe an architecture
> >>> that provides a flag that says not to cache something.  That just means
> >>> you should not be able to tell that it is cached - many many implementations
> >>> actually cache such accesses.  
> >>
> >> Sigh, then that really makes thing difficult.  
> > 
> > Yes. I think we are going to have to wait on architecture specific clarifications
> > before any software coherent use case can be guaranteed to work beyond the 3.1 ones
> > for temporal sharing (only one accessing host at a time) and read only sharing where
> > writes are dropped anyway so clean write back is irrelevant beyond some noise in
> > logs possibly (if they do get logged it is considered so rare we don't care!).  
> 
> Hi Jonathan,
> 	Allow me to discuss further. As described in CXL 3.1:
> ```
> Software-managed coherency schemes are complicated by any host or device 
> whose caching agents generate clean writebacks. A “No Clean Writebacks” 
> capability bit is available for a host in the CXL System Description 
> Structure (CSDS; see Section 9.18.1.6) or for a device in the DVSEC CXL 
> Capability2 register (see Section 8.1.3.7).
> ```
> 
> If we check and find that the "No clean writeback" bit in both CSDS and 
> DVSEC is set, can we then assume that software cache-coherency is 
> feasible, as outlined below:
> 
> (1) Both the writer and reader ensure cache flushes. Since there are no 
> clean writebacks, there will be no background data writes.
> 
> (2) The writer writes data to shared memory and then executes a cache 
> flush. If we trust the "No clean writeback" bit, we can assume that the 
> data in shared memory is coherent.
> 
> (3) Before reading the data, the reader performs cache invalidation. 
> Since there are no clean writebacks, this invalidation operation will 
> not destroy the data written by the writer. Therefore, the data read by 
> the reader should be the data written by the writer, and since the 
> writer's cache is clean, it will not write data to shared memory during 
> the reader's reading process. Additionally, data integrity can be ensured.
> 
> The first step for CBD should depend on hardware cache coherence, which 
> is clearer and more feasible. Here, I am just exploring the possibility 
> of software cache coherence, not insisting on implementing software 
> cache-coherency right away. :)

Yes, if a platform sets that bit, you 'should' be fine.  What exact flush
is needed is architecture specific however and the DMA related ones
may not be sufficient. I'd keep an eye open for arch doc update from the
various vendors.

Also, the architecture that motivated that bit existing is a 'moderately
large' chip vendor so I'd go so far as to say adoption will be limited
unless they resolve that in a future implementation :)

Jonathan

> 
> Thanx
> >   
> >>>      
> >>>>
> >>>> CBD can initially support (3), and then transition to (1) when hardware
> >>>> supports cache-coherency. If there's sufficient market demand, we can
> >>>> also consider supporting (2).  
> >>> I'd assume only (3) works.  The others rely on assumptions I don't think  
> >>
> >> I guess you mean (1), the hardware cache-coherency way, right?  
> > 
> > Indeed - oops!
> > Hardware coherency is the way to go, or a well defined and clearly document
> > description of how to play with the various host architectures.
> > 
> > Jonathan
> > 
> >   
> >>
> >> :)
> >> Thanx
> >>  
> >>> you can rely on.
> >>>
> >>> Fun fun fun,
> >>>
> >>> Jonathan
> >>>      
> >>>>
> >>>> How does this approach sound?
> >>>>
> >>>> Thanx  
> >>>>>
> >>>>> J
> >>>>>         
> >>>>>>
> >>>>>> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and
> >>>>>> shared memory is not explicitly legal in cxl 2, so there are things a cpu
> >>>>>> could do (or not do) in a cxl 2 environment that are not illegal because
> >>>>>> they should not be observable in a no-shared-memory environment.
> >>>>>>
> >>>>>> CBD is interesting work, though for some of the reasons above I'm somewhat
> >>>>>> skeptical of shared memory as an IPC mechanism.
> >>>>>>
> >>>>>> Regards,
> >>>>>> John
> >>>>>>
> >>>>>>
> >>>>>>        
> >>>>>
> >>>>> .
> >>>>>         
> >>>
> >>> .
> >>>      
> > 
> >   


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-09 12:21                               ` Jonathan Cameron
@ 2024-05-09 13:03                                 ` Dongsheng Yang
  2024-05-21 18:41                                   ` Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Dongsheng Yang @ 2024-05-09 13:03 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Dan Williams, Gregory Price, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm



在 2024/5/9 星期四 下午 8:21, Jonathan Cameron 写道:
> On Thu, 9 May 2024 19:24:28 +0800
> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
> 
...
>>> Yes. I think we are going to have to wait on architecture specific clarifications
>>> before any software coherent use case can be guaranteed to work beyond the 3.1 ones
>>> for temporal sharing (only one accessing host at a time) and read only sharing where
>>> writes are dropped anyway so clean write back is irrelevant beyond some noise in
>>> logs possibly (if they do get logged it is considered so rare we don't care!).
>>
>> Hi Jonathan,
>> 	Allow me to discuss further. As described in CXL 3.1:
>> ```
>> Software-managed coherency schemes are complicated by any host or device
>> whose caching agents generate clean writebacks. A “No Clean Writebacks”
>> capability bit is available for a host in the CXL System Description
>> Structure (CSDS; see Section 9.18.1.6) or for a device in the DVSEC CXL
>> Capability2 register (see Section 8.1.3.7).
>> ```
>>
>> If we check and find that the "No clean writeback" bit in both CSDS and
>> DVSEC is set, can we then assume that software cache-coherency is
>> feasible, as outlined below:
>>
>> (1) Both the writer and reader ensure cache flushes. Since there are no
>> clean writebacks, there will be no background data writes.
>>
>> (2) The writer writes data to shared memory and then executes a cache
>> flush. If we trust the "No clean writeback" bit, we can assume that the
>> data in shared memory is coherent.
>>
>> (3) Before reading the data, the reader performs cache invalidation.
>> Since there are no clean writebacks, this invalidation operation will
>> not destroy the data written by the writer. Therefore, the data read by
>> the reader should be the data written by the writer, and since the
>> writer's cache is clean, it will not write data to shared memory during
>> the reader's reading process. Additionally, data integrity can be ensured.
>>
>> The first step for CBD should depend on hardware cache coherence, which
>> is clearer and more feasible. Here, I am just exploring the possibility
>> of software cache coherence, not insisting on implementing software
>> cache-coherency right away. :)
> 
> Yes, if a platform sets that bit, you 'should' be fine.  What exact flush
> is needed is architecture specific however and the DMA related ones
> may not be sufficient. I'd keep an eye open for arch doc update from the
> various vendors.
> 
> Also, the architecture that motivated that bit existing is a 'moderately
> large' chip vendor so I'd go so far as to say adoption will be limited
> unless they resolve that in a future implementation :)

Great, I think we've had a good discussion and reached a consensus on 
this issue. The remaining aspect will depend on hardware updates. Thank 
you for the information, that helps a lot.

Thanx
> 
> Jonathan
> 
>>
>> Thanx
>>>    
>>>>>       
>>>>>>
>>>>>> CBD can initially support (3), and then transition to (1) when hardware
>>>>>> supports cache-coherency. If there's sufficient market demand, we can
>>>>>> also consider supporting (2).
>>>>> I'd assume only (3) works.  The others rely on assumptions I don't think
>>>>
>>>> I guess you mean (1), the hardware cache-coherency way, right?
>>>
>>> Indeed - oops!
>>> Hardware coherency is the way to go, or a well defined and clearly document
>>> description of how to play with the various host architectures.
>>>
>>> Jonathan
>>>
>>>    
>>>>
>>>> :)
>>>> Thanx
>>>>   
>>>>> you can rely on.
>>>>>
>>>>> Fun fun fun,
>>>>>
>>>>> Jonathan
>>>>>       
>>>>>>
>>>>>> How does this approach sound?
>>>>>>
>>>>>> Thanx
>>>>>>>
>>>>>>> J
>>>>>>>          
>>>>>>>>
>>>>>>>> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and
>>>>>>>> shared memory is not explicitly legal in cxl 2, so there are things a cpu
>>>>>>>> could do (or not do) in a cxl 2 environment that are not illegal because
>>>>>>>> they should not be observable in a no-shared-memory environment.
>>>>>>>>
>>>>>>>> CBD is interesting work, though for some of the reasons above I'm somewhat
>>>>>>>> skeptical of shared memory as an IPC mechanism.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> John
>>>>>>>>
>>>>>>>>
>>>>>>>>         
>>>>>>>
>>>>>>> .
>>>>>>>          
>>>>>
>>>>> .
>>>>>       
>>>
>>>    
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-09 13:03                                 ` Dongsheng Yang
@ 2024-05-21 18:41                                   ` Dan Williams
       [not found]                                     ` <8f161b2d-eacd-ad35-8959-0f44c8d132b3@easystack.cn>
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2024-05-21 18:41 UTC (permalink / raw)
  To: Dongsheng Yang, Jonathan Cameron
  Cc: John Groves, Dan Williams, Gregory Price, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm

Dongsheng Yang wrote:
> 在 2024/5/9 星期四 下午 8:21, Jonathan Cameron 写道:
[..]
> >> If we check and find that the "No clean writeback" bit in both CSDS and
> >> DVSEC is set, can we then assume that software cache-coherency is
> >> feasible, as outlined below:
> >>
> >> (1) Both the writer and reader ensure cache flushes. Since there are no
> >> clean writebacks, there will be no background data writes.
> >>
> >> (2) The writer writes data to shared memory and then executes a cache
> >> flush. If we trust the "No clean writeback" bit, we can assume that the
> >> data in shared memory is coherent.
> >>
> >> (3) Before reading the data, the reader performs cache invalidation.
> >> Since there are no clean writebacks, this invalidation operation will
> >> not destroy the data written by the writer. Therefore, the data read by
> >> the reader should be the data written by the writer, and since the
> >> writer's cache is clean, it will not write data to shared memory during
> >> the reader's reading process. Additionally, data integrity can be ensured.

What guarantees this property? How does the reader know that its local
cache invalidation is sufficient for reading data that has only reached
global visibility on the remote peer? As far as I can see, there is
nothing that guarantees that local global visibility translates to
remote visibility. In fact, the GPF feature is counter-evidence of the
fact that writes can be pending in buffers that are only flushed on a
GPF event.

I remain skeptical that a software managed inter-host cache-coherency
scheme can be made reliable with current CXL defined mechanisms.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
       [not found]                                     ` <8f161b2d-eacd-ad35-8959-0f44c8d132b3@easystack.cn>
@ 2024-05-29 15:25                                       ` Gregory Price
  2024-05-30  6:59                                         ` Dongsheng Yang
  0 siblings, 1 reply; 16+ messages in thread
From: Gregory Price @ 2024-05-29 15:25 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: Dan Williams, Jonathan Cameron, John Groves, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm

On Wed, May 22, 2024 at 02:17:38PM +0800, Dongsheng Yang wrote:
> 
> 
> 在 2024/5/22 星期三 上午 2:41, Dan Williams 写道:
> > Dongsheng Yang wrote:
> > 
> > What guarantees this property? How does the reader know that its local
> > cache invalidation is sufficient for reading data that has only reached
> > global visibility on the remote peer? As far as I can see, there is
> > nothing that guarantees that local global visibility translates to
> > remote visibility. In fact, the GPF feature is counter-evidence of the
> > fact that writes can be pending in buffers that are only flushed on a
> > GPF event.
> 
> Sounds correct. From what I learned from GPF, ADR, and eADR, there would
> still be data in WPQ even though we perform a CPU cache line flush in the
> OS.
> 
> This means we don't have a explicit method to make data puncture all caches
> and land in the media after writing. also it seems there isn't a explicit
> method to invalidate all caches along the entire path.
> 
> > 
> > I remain skeptical that a software managed inter-host cache-coherency
> > scheme can be made reliable with current CXL defined mechanisms.
> 
> 
> I got your point now, acorrding current CXL Spec, it seems software managed
> cache-coherency for inter-host shared memory is not working. Will the next
> version of CXL spec consider it?
> > 

Sorry for missing the conversation, have been out of office for a bit.

It's not just a CXL spec issue, though that is part of it. I think the
CXL spec would have to expose some form of puncturing flush, and this
makes the assumption that such a flush doesn't cause some kind of
race/deadlock issue.  Certainly this needs to be discussed.

However, consider that the upstream processor actually has to generate
this flush.  This means adding the flush to existing coherence protocols,
or at the very least a new instruction to generate the flush explicitly.
The latter seems more likely than the former.

This flush would need to ensure the data is forced out of the local WPQ
AND all WPQs south of the PCIE complex - because what you really want to
know is that the data has actually made it back to a place where remote
viewers are capable of percieving the change.

So this means:
1) Spec revision with puncturing flush
2) Buy-in from CPU vendors to generate such a flush
3) A new instruction added to the architecture.

Call me in a decade or so.


But really, I think it likely we see hardware-coherence well before this.
For this reason, I have become skeptical of all but a few memory sharing
use cases that depend on software-controlled cache-coherency.

There are some (FAMFS, for example). The coherence state of these
systems tend to be less volatile (e.g. mappings are read-only), or
they have inherent design limitations (cacheline-sized message passing
via write-ahead logging only).

~Gregory

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-29 15:25                                       ` Gregory Price
@ 2024-05-30  6:59                                         ` Dongsheng Yang
  2024-05-30 13:38                                           ` Jonathan Cameron
  0 siblings, 1 reply; 16+ messages in thread
From: Dongsheng Yang @ 2024-05-30  6:59 UTC (permalink / raw)
  To: Gregory Price
  Cc: Dan Williams, Jonathan Cameron, John Groves, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm



在 2024/5/29 星期三 下午 11:25, Gregory Price 写道:
> On Wed, May 22, 2024 at 02:17:38PM +0800, Dongsheng Yang wrote:
>>
>>
>> 在 2024/5/22 星期三 上午 2:41, Dan Williams 写道:
>>> Dongsheng Yang wrote:
>>>
>>> What guarantees this property? How does the reader know that its local
>>> cache invalidation is sufficient for reading data that has only reached
>>> global visibility on the remote peer? As far as I can see, there is
>>> nothing that guarantees that local global visibility translates to
>>> remote visibility. In fact, the GPF feature is counter-evidence of the
>>> fact that writes can be pending in buffers that are only flushed on a
>>> GPF event.
>>
>> Sounds correct. From what I learned from GPF, ADR, and eADR, there would
>> still be data in WPQ even though we perform a CPU cache line flush in the
>> OS.
>>
>> This means we don't have a explicit method to make data puncture all caches
>> and land in the media after writing. also it seems there isn't a explicit
>> method to invalidate all caches along the entire path.
>>
>>>
>>> I remain skeptical that a software managed inter-host cache-coherency
>>> scheme can be made reliable with current CXL defined mechanisms.
>>
>>
>> I got your point now, acorrding current CXL Spec, it seems software managed
>> cache-coherency for inter-host shared memory is not working. Will the next
>> version of CXL spec consider it?
>>>
> 
> Sorry for missing the conversation, have been out of office for a bit.
> 
> It's not just a CXL spec issue, though that is part of it. I think the
> CXL spec would have to expose some form of puncturing flush, and this
> makes the assumption that such a flush doesn't cause some kind of
> race/deadlock issue.  Certainly this needs to be discussed.
> 
> However, consider that the upstream processor actually has to generate
> this flush.  This means adding the flush to existing coherence protocols,
> or at the very least a new instruction to generate the flush explicitly.
> The latter seems more likely than the former.
> 
> This flush would need to ensure the data is forced out of the local WPQ
> AND all WPQs south of the PCIE complex - because what you really want to
> know is that the data has actually made it back to a place where remote
> viewers are capable of percieving the change.
> 
> So this means:
> 1) Spec revision with puncturing flush
> 2) Buy-in from CPU vendors to generate such a flush
> 3) A new instruction added to the architecture.
> 
> Call me in a decade or so.
> 
> 
> But really, I think it likely we see hardware-coherence well before this.
> For this reason, I have become skeptical of all but a few memory sharing
> use cases that depend on software-controlled cache-coherency.

Hi Gregory,

	From my understanding, we actually has the same idea here. What I am 
saying is that we need SPEC to consider this issue, meaning we need to 
describe how the entire software-coherency mechanism operates, which 
includes the necessary hardware support. Additionally, I agree that if 
software-coherency also requires hardware support, it seems that 
hardware-coherency is the better path.
> 
> There are some (FAMFS, for example). The coherence state of these
> systems tend to be less volatile (e.g. mappings are read-only), or
> they have inherent design limitations (cacheline-sized message passing
> via write-ahead logging only).

Can you explain more about this? I understand that if the reader in the 
writer-reader model is using a readonly mapping, the interaction will be 
much simpler. However, after the writer writes data, if we don't have a 
mechanism to flush and invalidate puncturing all caches, how can the 
readonly reader access the new data?
> 
> ~Gregory
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-30  6:59                                         ` Dongsheng Yang
@ 2024-05-30 13:38                                           ` Jonathan Cameron
  0 siblings, 0 replies; 16+ messages in thread
From: Jonathan Cameron @ 2024-05-30 13:38 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: Gregory Price, Dan Williams, John Groves, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm, james.morse, Mark Rutland

On Thu, 30 May 2024 14:59:38 +0800
Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:

> 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道:
> > On Wed, May 22, 2024 at 02:17:38PM +0800, Dongsheng Yang wrote:  
> >>
> >>
> >> 在 2024/5/22 星期三 上午 2:41, Dan Williams 写道:  
> >>> Dongsheng Yang wrote:
> >>>
> >>> What guarantees this property? How does the reader know that its local
> >>> cache invalidation is sufficient for reading data that has only reached
> >>> global visibility on the remote peer? As far as I can see, there is
> >>> nothing that guarantees that local global visibility translates to
> >>> remote visibility. In fact, the GPF feature is counter-evidence of the
> >>> fact that writes can be pending in buffers that are only flushed on a
> >>> GPF event.  
> >>
> >> Sounds correct. From what I learned from GPF, ADR, and eADR, there would
> >> still be data in WPQ even though we perform a CPU cache line flush in the
> >> OS.
> >>
> >> This means we don't have a explicit method to make data puncture all caches
> >> and land in the media after writing. also it seems there isn't a explicit
> >> method to invalidate all caches along the entire path.
> >>  
> >>>
> >>> I remain skeptical that a software managed inter-host cache-coherency
> >>> scheme can be made reliable with current CXL defined mechanisms.  
> >>
> >>
> >> I got your point now, acorrding current CXL Spec, it seems software managed
> >> cache-coherency for inter-host shared memory is not working. Will the next
> >> version of CXL spec consider it?  
> >>>  
> > 
> > Sorry for missing the conversation, have been out of office for a bit.
> > 
> > It's not just a CXL spec issue, though that is part of it. I think the
> > CXL spec would have to expose some form of puncturing flush, and this
> > makes the assumption that such a flush doesn't cause some kind of
> > race/deadlock issue.  Certainly this needs to be discussed.
> > 
> > However, consider that the upstream processor actually has to generate
> > this flush.  This means adding the flush to existing coherence protocols,
> > or at the very least a new instruction to generate the flush explicitly.
> > The latter seems more likely than the former.
> > 
> > This flush would need to ensure the data is forced out of the local WPQ
> > AND all WPQs south of the PCIE complex - because what you really want to
> > know is that the data has actually made it back to a place where remote
> > viewers are capable of percieving the change.
> > 
> > So this means:
> > 1) Spec revision with puncturing flush
> > 2) Buy-in from CPU vendors to generate such a flush
> > 3) A new instruction added to the architecture.
> > 
> > Call me in a decade or so.
> > 
> > 
> > But really, I think it likely we see hardware-coherence well before this.
> > For this reason, I have become skeptical of all but a few memory sharing
> > use cases that depend on software-controlled cache-coherency.  
> 
> Hi Gregory,
> 
> 	From my understanding, we actually has the same idea here. What I am 
> saying is that we need SPEC to consider this issue, meaning we need to 
> describe how the entire software-coherency mechanism operates, which 
> includes the necessary hardware support. Additionally, I agree that if 
> software-coherency also requires hardware support, it seems that 
> hardware-coherency is the better path.
> > 
> > There are some (FAMFS, for example). The coherence state of these
> > systems tend to be less volatile (e.g. mappings are read-only), or
> > they have inherent design limitations (cacheline-sized message passing
> > via write-ahead logging only).  
> 
> Can you explain more about this? I understand that if the reader in the 
> writer-reader model is using a readonly mapping, the interaction will be 
> much simpler. However, after the writer writes data, if we don't have a 
> mechanism to flush and invalidate puncturing all caches, how can the 
> readonly reader access the new data?

There is a mechanism for doing coarse grained flushing that is known to
work on some architectures. Look at cpu_cache_invalidate_memregion().
On intel/x86 it's wbinvd_on_all_cpu_cpus()
on arm64 it's a PSCI firmware call CLEAN_INV_MEMREGION (there is a
public alpha specification for PSCI 1.3 with that defined but we
don't yet have kernel code.)

These are very big hammers and so unsuited for anything fine grained.
In the extreme end of possible implementations they briefly stop all
CPUs and clean and invalidate all caches of all types.  So not suited
to anything fine grained, but may be acceptable for a rare setup event,
particularly if the main job of the writing host is to fill that memory
for lots of other hosts to use.

At least the ARM one takes a range so allows for a less painful
implementation.  I'm assuming we'll see new architecture over time
but this is a different (and potentially easier) problem space
to what you need.

Jonathan



> > ~Gregory
> >   


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-05-30 13:38 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20240422071606.52637-1-dongsheng.yang@easystack.cn>
     [not found] ` <66288ac38b770_a96f294c6@dwillia2-mobl3.amr.corp.intel.com.notmuch>
     [not found]   ` <ef34808b-d25d-c953-3407-aa833ad58e61@easystack.cn>
     [not found]     ` <ZikhwAAIGFG0UU23@memverge.com>
     [not found]       ` <bbf692ec-2109-baf2-aaae-7859a8315025@easystack.cn>
     [not found]         ` <ZiuwyIVaKJq8aC6g@memverge.com>
     [not found]           ` <98ae27ff-b01a-761d-c1c6-39911a000268@easystack.cn>
     [not found]             ` <ZivS86BrfPHopkru@memverge.com>
2024-04-28  5:47               ` [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dongsheng Yang
2024-04-28 16:44                 ` Gregory Price
2024-04-28 16:55                 ` John Groves
2024-05-03  9:52                   ` Jonathan Cameron
2024-05-08 11:39                     ` Dongsheng Yang
2024-05-08 12:11                       ` Jonathan Cameron
2024-05-08 13:03                         ` Dongsheng Yang
2024-05-08 15:44                           ` Jonathan Cameron
2024-05-09 11:24                             ` Dongsheng Yang
2024-05-09 12:21                               ` Jonathan Cameron
2024-05-09 13:03                                 ` Dongsheng Yang
2024-05-21 18:41                                   ` Dan Williams
     [not found]                                     ` <8f161b2d-eacd-ad35-8959-0f44c8d132b3@easystack.cn>
2024-05-29 15:25                                       ` Gregory Price
2024-05-30  6:59                                         ` Dongsheng Yang
2024-05-30 13:38                                           ` Jonathan Cameron
2024-04-30  0:34                 ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).