Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)

From: Dan Williams <dan.j.williams@intel.com>
To: Dongsheng Yang <dongsheng.yang@easystack.cn>,
	Gregory Price <gregory.price@memverge.com>,
	Dan Williams <dan.j.williams@intel.com>,
	"John Groves" <John@groves.net>
Cc: <axboe@kernel.dk>, <linux-block@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-cxl@vger.kernel.org>,
	<nvdimm@lists.linux.dev>
Subject: Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
Date: Mon, 29 Apr 2024 17:34:36 -0700	[thread overview]
Message-ID: <66303c9c98f2_148729450@dwillia2-mobl3.amr.corp.intel.com.notmuch> (raw)
In-Reply-To: <8f373165-dd2b-906f-96da-41be9f27c208@easystack.cn>

Dongsheng Yang wrote:
> 
> 
> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:
> > On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
> >>
> >>
> >> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
> >>>
> >>
> >> In (5) of the cover letter, I mentioned that cbd addresses cache coherence
> >> at the software level:
> >>
> >> (5) How do blkdev and backend interact through the channel?
> >> 	a) For reader side, before reading the data, if the data in this channel
> >> may be modified by the other party, then I need to flush the cache before
> >> reading to ensure that I get the latest data. For example, the blkdev needs
> >> to flush the cache before obtaining compr_head because compr_head will be
> >> updated by the backend handler.
> >> 	b) For writter side, if the written information will be read by others,
> >> then after writing, I need to flush the cache to let the other party see it
> >> immediately. For example, after blkdev submits cbd_se, it needs to update
> >> cmd_head to let the handler have a new cbd_se. Therefore, after updating
> >> cmd_head, I need to flush the cache to let the backend see it.
> >>
> > 
> > Flushing the cache is insufficient.  All that cache flushing guarantees
> > is that the memory has left the writer's CPU cache.  There are potentially
> > many write buffers between the CPU and the actual backing media that the
> > CPU has no visibility of and cannot pierce through to force a full
> > guaranteed flush back to the media.
> > 
> > for example:
> > 
> > memcpy(some_cacheline, data, 64);
> > mfence();
> > 
> > Will not guarantee that after mfence() completes that the remote host
> > will have visibility of the data.  mfence() does not guarantee a full
> > flush back down to the device, it only guarantees it has been pushed out
> > of the CPU's cache.
> > 
> > similarly:
> > 
> > memcpy(some_cacheline, data, 64);
> > mfence();
> > memcpy(some_other_cacheline, data, 64);
> > mfence()
> > 
> > Will not guarantee that some_cacheline reaches the backing media prior
> > to some_other_cacheline, as there is no guarantee of write-ordering in
> > CXL controllers (with the exception of writes to the same cacheline).
> > 
> > So this statement:
> > 
> >> I need to flush the cache to let the other party see it immediately.
> > 
> > Is misleading.  They will not see is "immediately", they will see it
> > "eventually at some completely unknowable time in the future".
> 
> This is indeed one of the issues I wanted to discuss at the RFC stage. 
> Thank you for pointing it out.
> 
> In my opinion, using "nvdimm_flush" might be one way to address this 
> issue, but it seems to flush the entire nd_region, which might be too 
> heavy. Moreover, it only applies to non-volatile memory.
> 
> This should be a general problem for cxl shared memory. In theory, FAMFS 
> should also encounter this issue.
> 
> Gregory, John, and Dan, Any suggestion about it?

The CXL equivalent is GPF (Global Persistence Flush), not be confused
with "General Protection Fault" which is likely what will happen if
software needs to manage cache coherency for this solution. CXL GPF was
not designed to be triggered by software. It is hardware response to a
power supply indicating loss of input power.

I do not think you want to spend community resources reviewing software
cache coherency considerations, and instead "just" mandate that this
solution requires inter-host hardware cache coherence. I understand that
is a difficult requirement to mandate, but it is likely less difficult
than getting Linux to carry a software cache coherence mitigation.

In some ways this reminds me of SMR drives and the problems those posed
to software where ultimately the programming difficulties needed to be
solved in hardware, not exported to the Linux kernel to solve.