linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
@ 2024-04-22  7:15 Dongsheng Yang
  2024-04-22  7:16 ` [PATCH 1/7] block: Init for CBD(CXL " Dongsheng Yang
                   ` (7 more replies)
  0 siblings, 8 replies; 52+ messages in thread
From: Dongsheng Yang @ 2024-04-22  7:15 UTC (permalink / raw)
  To: dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang

From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>

Hi all,
	This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
	https://github.com/DataTravelGuide/linux

(1) What is cbd:
	As shared memory is supported in CXL3.0 spec, we can transfer data
via CXL shared memory. CBD means CXL block device, it use CXL shared memory
to transfer command and data to access block device in different
host, as shown below:

┌───────────────────────────────┐                               ┌────────────────────────────────────┐
│          node-1               │                               │              node-2                │
├───────────────────────────────┤                               ├────────────────────────────────────┤
│                               │                               │                                    │
│                       ┌───────┤                               ├─────────┐                          │
│                       │ cbd0  │                               │ backend0├──────────────────┐       │
│                       ├───────┤                               ├─────────┤                  │       │
│                       │ pmem0 │                               │ pmem0   │                  ▼       │
│               ┌───────┴───────┤                               ├─────────┴────┐     ┌───────────────┤
│               │    cxl driver │                               │ cxl driver   │     │ /dev/sda      │
└───────────────┴────────┬──────┘                               └─────┬────────┴─────┴───────────────┘
                         │                                            │                               
                         │                                            │                               
                         │        CXL                         CXL     │                               
                         └────────────────┐               ┌───────────┘                               
                                          │               │                                           
                                          │               │                                           
                                          │               │                                           
                                      ┌───┴───────────────┴────---------------─┐                                     
                                      │   shared memory device(cbd transport)  │                                     
                                      └──────────────────────---------------───┘                                     

any read/write to cbd0 on node-1 will be transferred to node-2 /dev/sda. It works similar with
nbd (network block device), but it transfer data via CXL shared memory rather than network.

(2) Layout of transport:

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                           cbd transport                                                                                       │
├────────────────────┬───────────────────────┬───────────────────────┬──────────────────────┬───────────────────────────────────┤
│                    │       hosts           │      backends         │       blkdevs        │        channels                   │
│ cbd transport info ├────┬────┬────┬────────┼────┬────┬────┬────────┼────┬────┬────┬───────┼───────┬───────┬───────┬───────────┤
│                    │    │    │    │  ...   │    │    │    │  ...   │    │    │    │  ...  │       │       │       │   ...     │
└────────────────────┴────┴────┴────┴────────┴────┴────┴────┴────────┴────┴────┴────┴───────┴───┬───┴───────┴───────┴───────────┘
                                                                                                │                                
                                                                                                │                                
                                                                                                │                                
                                                                                                │                                
          ┌─────────────────────────────────────────────────────────────────────────────────────┘                                
          │                                                                                                                      
          │                                                                                                                      
          ▼                                                                                                                      
    ┌───────────────────────────────────────────────────────────┐                                                                
    │                     channel                               │                                                                
    ├────────────────────┬──────────────────────────────────────┤                                                                
    │    channel meta    │              channel data            │                                                                
    └─────────┬──────────┴────────────────────────────────-─────┘                                                                
              │                                                                                                                  
              │                                                                                                                  
              │                                                                                                                  
              ▼                                                                                                                  
    ┌──────────────────────────────────────────────────────────┐                                                                 
    │                 channel meta                             │                                                                 
    ├───────────┬──────────────┬───────────────────────────────┤                                                                 
    │ meta ctrl │  comp ring   │       cmd ring                │                                                                 
    └───────────┴──────────────┴───────────────────────────────┘                                                                 

The shared memory is divided into five regions:

    a) Transport_info:
	Information about the overall transport, including the layout of the transport.
    b) Hosts:
	Each host wishing to utilize this transport needs to register its own information within a host entry in this region.
    c) Backends:
	Starting a backend on a host requires filling in information in a backend entry within this region.
    d) Blkdevs:
	Once a backend is established, it can be mapped to any associated host. The information about the blkdevs is then filled into the blkdevs region.
    e) Channels:
	This is the actual data communication area, where communication between blkdev and backend occurs. Each queue of a block device uses a channel, and each backend has a corresponding handler interacting with this queue.
    f) Channel:
	Channel is further divided into meta and data regions.
	The meta region includes cmd rings and comp rings. The blkdev converts upper-layer requests into cbd_se and fills them into the cmd ring.
	The handler accepts the cbd_se from the cmd ring and sends them to the local actual block device of the backend (e.g., sda).
	After completion, the results are formed into cbd_ce and filled into the comp ring.
	The blkdev then receives the cbd_ce and returns the results to the upper-layer IO sender.

    Currently, the number of entries in each region and the channel size are both set to default values. In the future, they will be made configurable.

(3) Naming of CBD:
	Actually it is not strictly depends on CXL, any shared memory can be used for cbd, but
I did not find out a better name, maybe smxbd(shared memory transport block device)? I choose
CBD as it sounds more concise and elegant. Any suggestion?

(4) dax is not supported yet:
	same with famfs, dax device is not supported here, because dax device does not support
dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.

(5) How do blkdev and backend interact through the channel?
	a) For reader side, before reading the data, if the data in this channel may be modified by the other party, then I need to flush the cache before reading to ensure that I get the latest data. For example, the blkdev needs to flush the cache before obtaining compr_head because compr_head will be updated by the backend handler.
	b) For writter side, if the written information will be read by others, then after writing, I need to flush the cache to let the other party see it immediately. For example, after blkdev submits cbd_se, it needs to update cmd_head to let the handler have a new cbd_se. Therefore, after updating cmd_head, I need to flush the cache to let the backend see it.

(6) race between management operations:
	There may be a race condition, for example: if we use backend-start on different nodes at the same time,
it's possible to allocate the same backend ID. This issue should be handled by the upper-layer
manager, ensuring that all management operations are serialized, such as acquiring a distributed lock.

(7) What's Next?
	This is an first version of CBD, and there is still much work to be done, such as: how to recover a backend service when a backend node fails? How to gracefully stop associated blkdev when a backend service cannot be recovered? How to clear dead information within the transport layer? For non-volatile memory transport, it may be considered to allocate a new area as a Write-Ahead Log (WAL).

(8) testing with qemu:
	We can use two QEMU virtual machines to test CBD by sharing a CXLMemDev:

  a)  Start two QEMU virtual machines, sharing a CXLMemDev.

	root@qemu-2:~# cxl list
	[
	  {
	    "memdev":"mem0",
	    "pmem_size":536870912,
	    "serial":0,
	    "host":"0000:0d:00.0"
	  }
	]

	root@qemu-1:~# cxl list
	[
	  {
	    "memdev":"mem0",
	    "pmem_size":536870912,
	    "serial":0,
	    "host":"0000:0d:00.0"
	  }
	]

  b)  Register a CBD transport on node-1 and add a backend, specifying the path as /dev/ram0p1.
	root@qemu-1:~# cxl create-region -m mem0 -d decoder0.0 -t pmem
	{
	  "region":"region0",
	  "resource":"0x1890000000",
	  "size":"512.00 MiB (536.87 MB)",
	  "type":"pmem",
	  "interleave_ways":1,
	  "interleave_granularity":256,
	  "decode_state":"commit",
	  "mappings":[
	    {
	      "position":0,
	      "memdev":"mem0",
	      "decoder":"decoder2.0"
	    }
	  ]
	}
	cxl region: cmd_create_region: created 1 region
	root@qemu-1:~# ndctl create-namespace -r region0 -m fsdax --map dev -t pmem

	{
	  "dev":"namespace0.0",
	  "mode":"fsdax",
	  "map":"dev",
	  "size":"502.00 MiB (526.39 MB)",
	  "uuid":"618e9627-4345-4046-ba46-becf430a1464",
	  "sector_size":512,
	  "align":2097152,
	  "blockdev":"pmem0"
	}
	root@qemu-1:~# echo "path=/dev/pmem0,hostname=node-1,force=1,format=1" >  /sys/bus/cbd/transport_register
	root@qemu-1:~# echo "op=backend-start,path=/dev/ram0p1" > /sys/bus/cbd/devices/transport0/adm

  c)  Register a CBD transport on node-2 and add a blkdev, specifying the backend ID as the backend on node-1.
	root@qemu-2:~# cxl create-region -m mem0 -d decoder0.0 -t pmem
	{
	  "region":"region0",
	  "resource":"0x390000000",
	  "size":"512.00 MiB (536.87 MB)",
	  "type":"pmem",
	  "interleave_ways":1,
	  "interleave_granularity":256,
	  "decode_state":"commit",
	  "mappings":[
	    {
	      "position":0,
	      "memdev":"mem0",
	      "decoder":"decoder2.0"
	    }
	  ]
	}
	cxl region: cmd_create_region: created 1 region
	root@qemu-2:~# ndctl create-namespace -r region0 -m fsdax --map dev -t pmem -b 0
	{
	  "dev":"namespace0.0",
	  "mode":"fsdax",
	  "map":"dev",
	  "size":"502.00 MiB (526.39 MB)",
	  "uuid":"a7fae1a5-2cba-46d7-83a2-20a76d736848",
	  "sector_size":512,
	  "align":2097152,
	  "blockdev":"pmem0"
	}
	root@qemu-2:~# echo "path=/dev/pmem0,hostname=node-2" > /sys/bus/cbd/transport_register
	root@qemu-2:~# echo "op=dev-start,backend_id=0,queues=1" > /sys/bus/cbd/devices/transport0/adm

  d)  On node-2, you will get a /dev/cbd0, and all reads and writes to cbd0 will actually read from and write to /dev/ram0p1 on node-1.
	root@qemu-2:~# mkfs.xfs -f /dev/cbd0
	meta-data=/dev/cbd0              isize=512    agcount=4, agsize=655360 blks
		 =                       sectsz=4096  attr=2, projid32bit=1
		 =                       crc=1        finobt=1, sparse=1, rmapbt=0
		 =                       reflink=1    bigtime=0 inobtcount=0
	data     =                       bsize=4096   blocks=2621440, imaxpct=25
		 =                       sunit=0      swidth=0 blks
	naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
	log      =internal log           bsize=4096   blocks=2560, version=2
		 =                       sectsz=4096  sunit=1 blks, lazy-count=1
	realtime =none                   extsz=4096   blocks=0, rtextents=0


Thanx

Dongsheng Yang (7):
  block: Init for CBD(CXL Block Device)
  cbd: introduce cbd_transport
  cbd: introduce cbd_channel
  cbd: introduce cbd_host
  cbd: introuce cbd_backend
  cbd: introduce cbd_blkdev
  cbd: add related sysfs files in transport register

 drivers/block/Kconfig             |   2 +
 drivers/block/Makefile            |   2 +
 drivers/block/cbd/Kconfig         |   4 +
 drivers/block/cbd/Makefile        |   3 +
 drivers/block/cbd/cbd_backend.c   | 254 +++++++++
 drivers/block/cbd/cbd_blkdev.c    | 375 +++++++++++++
 drivers/block/cbd/cbd_channel.c   | 179 +++++++
 drivers/block/cbd/cbd_handler.c   | 261 +++++++++
 drivers/block/cbd/cbd_host.c      | 123 +++++
 drivers/block/cbd/cbd_internal.h  | 830 +++++++++++++++++++++++++++++
 drivers/block/cbd/cbd_main.c      | 230 ++++++++
 drivers/block/cbd/cbd_queue.c     | 621 ++++++++++++++++++++++
 drivers/block/cbd/cbd_transport.c | 845 ++++++++++++++++++++++++++++++
 13 files changed, 3729 insertions(+)
 create mode 100644 drivers/block/cbd/Kconfig
 create mode 100644 drivers/block/cbd/Makefile
 create mode 100644 drivers/block/cbd/cbd_backend.c
 create mode 100644 drivers/block/cbd/cbd_blkdev.c
 create mode 100644 drivers/block/cbd/cbd_channel.c
 create mode 100644 drivers/block/cbd/cbd_handler.c
 create mode 100644 drivers/block/cbd/cbd_host.c
 create mode 100644 drivers/block/cbd/cbd_internal.h
 create mode 100644 drivers/block/cbd/cbd_main.c
 create mode 100644 drivers/block/cbd/cbd_queue.c
 create mode 100644 drivers/block/cbd/cbd_transport.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/7] block: Init for CBD(CXL Block Device)
  2024-04-22  7:15 [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dongsheng Yang
@ 2024-04-22  7:16 ` Dongsheng Yang
  2024-04-22 18:39   ` Randy Dunlap
  2024-04-24  3:58   ` Chaitanya Kulkarni
  2024-04-22  7:16 ` [PATCH 2/7] cbd: introduce cbd_transport Dongsheng Yang
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 52+ messages in thread
From: Dongsheng Yang @ 2024-04-22  7:16 UTC (permalink / raw)
  To: dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang

From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>

As shared memory is supported in CXL3.0 spec, we can transfer data
via CXL shared memory. CBD means CXL block device, it use CXL shared memory
to transfer command and data to access block device in different
host, as shown below:

┌───────────────────────────────┐                               ┌────────────────────────────────────┐
│          node-1               │                               │              node-2                │
├───────────────────────────────┤                               ├────────────────────────────────────┤
│                               │                               │                                    │
│                       ┌───────┤                               ├─────────┐                          │
│                       │ cbd0  │                               │ backend0├──────────────────┐       │
│                       ├───────┤                               ├─────────┤                  │       │
│                       │ pmem0 │                               │ pmem0   │                  ▼       │
│               ┌───────┴───────┤                               ├─────────┴────┐     ┌───────────────┤
│               │    cxl driver │                               │ cxl driver   │     │ /dev/sda      │
└───────────────┴────────┬──────┘                               └─────┬────────┴─────┴───────────────┘
                         │                                            │
                         │                                            │
                         │        CXL                         CXL     │
                         └────────────────┐               ┌───────────┘
                                          │               │
                                          │               │
                                          │               │
                                      ┌───┴───────────────┴────---------------─┐
                                      │   shared memory device(cbd transport)  │
                                      └──────────────────────---------------───┘

any read/write to cbd0 on node-1 will be transferred to node-2 /dev/sda. It works similar with
nbd (network block device), but it transfer data via CXL shared memory rather than network.

Signed-off-by: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
---
 drivers/block/Kconfig            |   2 +
 drivers/block/Makefile           |   2 +
 drivers/block/cbd/Kconfig        |   4 +
 drivers/block/cbd/Makefile       |   3 +
 drivers/block/cbd/cbd_internal.h | 830 +++++++++++++++++++++++++++++++
 drivers/block/cbd/cbd_main.c     | 216 ++++++++
 6 files changed, 1057 insertions(+)
 create mode 100644 drivers/block/cbd/Kconfig
 create mode 100644 drivers/block/cbd/Makefile
 create mode 100644 drivers/block/cbd/cbd_internal.h
 create mode 100644 drivers/block/cbd/cbd_main.c

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 5b9d4aaebb81..1f6376828af9 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -219,6 +219,8 @@ config BLK_DEV_NBD
 
 	  If unsure, say N.
 
+source "drivers/block/cbd/Kconfig"
+
 config BLK_DEV_RAM
 	tristate "RAM block device support"
 	help
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 101612cba303..8be2a39f5a7c 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -39,4 +39,6 @@ obj-$(CONFIG_BLK_DEV_NULL_BLK)	+= null_blk/
 
 obj-$(CONFIG_BLK_DEV_UBLK)			+= ublk_drv.o
 
+obj-$(CONFIG_BLK_DEV_CBD)	+= cbd/
+
 swim_mod-y	:= swim.o swim_asm.o
diff --git a/drivers/block/cbd/Kconfig b/drivers/block/cbd/Kconfig
new file mode 100644
index 000000000000..98b2cbcdf895
--- /dev/null
+++ b/drivers/block/cbd/Kconfig
@@ -0,0 +1,4 @@
+config BLK_DEV_CBD
+	tristate "CXL Block Device"
+	help
+	  If unsure say 'm'.
diff --git a/drivers/block/cbd/Makefile b/drivers/block/cbd/Makefile
new file mode 100644
index 000000000000..2765325486a2
--- /dev/null
+++ b/drivers/block/cbd/Makefile
@@ -0,0 +1,3 @@
+cbd-y := cbd_main.o
+
+obj-$(CONFIG_BLK_DEV_CBD) += cbd.o
diff --git a/drivers/block/cbd/cbd_internal.h b/drivers/block/cbd/cbd_internal.h
new file mode 100644
index 000000000000..7d9bf5b1c70d
--- /dev/null
+++ b/drivers/block/cbd/cbd_internal.h
@@ -0,0 +1,830 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _CBD_INTERNAL_H
+#define _CBD_INTERNAL_H
+
+#include <linux/kernel.h>
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/blk-mq.h>
+#include <asm/byteorder.h>
+#include <asm/types.h>
+#include <linux/types.h>
+#include <linux/delay.h>
+#include <linux/fs.h>
+#include <linux/dax.h>
+#include <linux/blkdev.h>
+#include <linux/slab.h>
+#include <linux/parser.h>
+#include <linux/idr.h>
+#include <linux/workqueue.h>
+#include <linux/uuid.h>
+#include <linux/bitfield.h>
+
+/*
+ * As shared memory is supported in CXL3.0 spec, we can transfer data via CXL shared memory.
+ * CBD means CXL block device, it use CXL shared memory to transport command and data to
+ * access block device in different host, as shown below:
+ *
+ *    ┌───────────────────────────────┐                               ┌────────────────────────────────────┐
+ *    │          node-1               │                               │              node-2                │
+ *    ├───────────────────────────────┤                               ├────────────────────────────────────┤
+ *    │                               │                               │                                    │
+ *    │                       ┌───────┤                               ├─────────┐                          │
+ *    │                       │ cbd0  │                               │ backend0├──────────────────┐       │
+ *    │                       ├───────┤                               ├─────────┤                  │       │
+ *    │                       │ pmem0 │                               │ pmem0   │                  ▼       │
+ *    │               ┌───────┴───────┤                               ├─────────┴────┐     ┌───────────────┤
+ *    │               │    cxl driver │                               │ cxl driver   │     │  /dev/sda     │
+ *    └───────────────┴────────┬──────┘                               └─────┬────────┴─────┴───────────────┘
+ *                             │                                            │
+ *                             │                                            │
+ *                             │        CXL                         CXL     │
+ *                             └────────────────┐               ┌───────────┘
+ *                                              │               │
+ *                                              │               │
+ *                                              │               │
+ *                                          ┌───┴───────────────┴─────┐
+ *                                          │   shared memory device  │
+ *                                          └─────────────────────────┘
+ *
+ * any read/write to cbd0 on node-1 will be transferred to node-2 /dev/sda. It works similar with
+ * nbd (network block device), but it transfer data via CXL shared memory rather than network.
+ */
+
+/* printk */
+#define cbd_err(fmt, ...)							\
+	pr_err("cbd: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__)
+#define cbd_info(fmt, ...)							\
+	pr_info("cbd: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__)
+#define cbd_debug(fmt, ...)							\
+	pr_debug("cbd: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__)
+
+#define cbdt_err(transport, fmt, ...)						\
+	cbd_err("cbd_transport%u: " fmt,					\
+		 transport->id, ##__VA_ARGS__)
+#define cbdt_info(transport, fmt, ...)						\
+	cbd_info("cbd_transport%u: " fmt,					\
+		 transport->id, ##__VA_ARGS__)
+#define cbdt_debug(transport, fmt, ...)						\
+	cbd_debug("cbd_transport%u: " fmt,					\
+		 transport->id, ##__VA_ARGS__)
+
+#define cbd_backend_err(backend, fmt, ...)					\
+	cbdt_err(backend->cbdt, "backend%d: " fmt,				\
+		 backend->backend_id, ##__VA_ARGS__)
+#define cbd_backend_info(backend, fmt, ...)					\
+	cbdt_info(backend->cbdt, "backend%d: " fmt,				\
+		 backend->backend_id, ##__VA_ARGS__)
+#define cbd_backend_debug(backend, fmt, ...)					\
+	cbdt_debug(backend->cbdt, "backend%d: " fmt,				\
+		 backend->backend_id, ##__VA_ARGS__)
+
+#define cbd_handler_err(handler, fmt, ...)					\
+	cbd_backend_err(handler->cbdb, "handler%d: " fmt,			\
+		 handler->channel.channel_id, ##__VA_ARGS__)
+#define cbd_handler_info(handler, fmt, ...)					\
+	cbd_backend_info(handler->cbdb, "handler%d: " fmt,			\
+		 handler->channel.channel_id, ##__VA_ARGS__)
+#define cbd_handler_debug(handler, fmt, ...)					\
+	cbd_backend_debug(handler->cbdb, "handler%d: " fmt,			\
+		 handler->channel.channel_id, ##__VA_ARGS__)
+
+#define cbd_blk_err(dev, fmt, ...)						\
+	cbdt_err(dev->cbdt, "cbd%d: " fmt,					\
+		 dev->mapped_id, ##__VA_ARGS__)
+#define cbd_blk_info(dev, fmt, ...)						\
+	cbdt_info(dev->cbdt, "cbd%d: " fmt,					\
+		 dev->mapped_id, ##__VA_ARGS__)
+#define cbd_blk_debug(dev, fmt, ...)						\
+	cbdt_debug(dev->cbdt, "cbd%d: " fmt,					\
+		 dev->mapped_id, ##__VA_ARGS__)
+
+#define cbd_queue_err(queue, fmt, ...)						\
+	cbd_blk_err(queue->cbd_blkdev, "queue-%d: " fmt,			\
+		     queue->index, ##__VA_ARGS__)
+#define cbd_queue_info(queue, fmt, ...)						\
+	cbd_blk_info(queue->cbd_blkdev, "queue-%d: " fmt,			\
+		     queue->index, ##__VA_ARGS__)
+#define cbd_queue_debug(queue, fmt, ...)					\
+	cbd_blk_debug(queue->cbd_blkdev, "queue-%d: " fmt,			\
+		     queue->index, ##__VA_ARGS__)
+
+#define cbd_channel_err(channel, fmt, ...)					\
+	cbdt_err(channel->cbdt, "channel%d: " fmt,				\
+		 channel->channel_id, ##__VA_ARGS__)
+#define cbd_channel_info(channel, fmt, ...)					\
+	cbdt_info(channel->cbdt, "channel%d: " fmt,				\
+		 channel->channel_id, ##__VA_ARGS__)
+#define cbd_channel_debug(channel, fmt, ...)					\
+	cbdt_debug(channel->cbdt, "channel%d: " fmt,				\
+		 channel->channel_id, ##__VA_ARGS__)
+
+#define CBD_PAGE_SHIFT		12
+#define CBD_PAGE_SIZE		(1 << CBD_PAGE_SHIFT)
+#define CBD_PAGE_MASK		(CBD_PAGE_SIZE - 1)
+
+#define CBD_TRANSPORT_MAX	1024
+#define CBD_PATH_LEN	512
+#define CBD_NAME_LEN	32
+
+/* TODO support multi queue */
+#define CBD_QUEUES_MAX		1
+
+#define CBD_PART_SHIFT 4
+#define CBD_DRV_NAME "cbd"
+#define CBD_DEV_NAME_LEN 32
+
+#define CBD_HB_INTERVAL		msecs_to_jiffies(5000) /* 5s */
+#define CBD_HB_TIMEOUT		(30 * 1000) /* 30s */
+
+/*
+ * CBD transport layout:
+ *
+ *      +-------------------------------------------------------------------------------------------------------------------------------+
+ *      |                           cbd transport                                                                                       |
+ *      +--------------------+-----------------------+-----------------------+----------------------+-----------------------------------+
+ *      |                    |       hosts           |      backends         |       blkdevs        |        channels                   |
+ *      | cbd transport info +----+----+----+--------+----+----+----+--------+----+----+----+-------+-------+-------+-------+-----------+
+ *      |                    |    |    |    |  ...   |    |    |    |  ...   |    |    |    |  ...  |       |       |       |   ...     |
+ *      +--------------------+----+----+----+--------+----+----+----+--------+----+----+----+-------+---+---+-------+-------+-----------+
+ *                                                                                                      |
+ *                                                                                                      |
+ *                                                                                                      |
+ *                                                                                                      |
+ *                +-------------------------------------------------------------------------------------+
+ *                |
+ *                |
+ *                v
+ *          +-----------------------------------------------------------+
+ *          |                     channel                               |
+ *          +--------------------+--------------------------------------+
+ *          |    channel meta    |              channel data            |
+ *          +---------+----------+--------------------------------------+
+ *                    |
+ *                    |
+ *                    |
+ *                    v
+ *          +----------------------------------------------------------+
+ *          |                 channel meta                             |
+ *          +-----------+--------------+-------------------------------+
+ *          | meta ctrl |  comp ring   |       cmd ring                |
+ *          +-----------+--------------+-------------------------------+
+ */
+
+/* cbd channel */
+#define CBD_OP_ALIGN_SIZE	sizeof(u64)
+#define CBDC_META_SIZE		(1024 * CBD_PAGE_SIZE)
+#define CBDC_CMDR_RESERVED	CBD_OP_ALIGN_SIZE
+#define CBDC_CMPR_RESERVED	sizeof(struct cbd_ce)
+
+#define CBDC_CTRL_OFF		0
+#define CBDC_CTRL_SIZE		CBD_PAGE_SIZE
+#define CBDC_COMPR_OFF		(CBDC_CTRL_OFF + CBDC_CTRL_SIZE)
+#define CBDC_COMPR_SIZE		(sizeof(struct cbd_ce) * 1024)
+#define CBDC_CMDR_OFF		(CBDC_COMPR_OFF + CBDC_COMPR_SIZE)
+#define CBDC_CMDR_SIZE		(CBDC_META_SIZE - CBDC_CMDR_OFF)
+
+#define CBDC_DATA_OFF		(CBDC_CMDR_OFF + CBDC_CMDR_SIZE)
+#define CBDC_DATA_SIZE		(16 * 1024 * 1024)
+#define CBDC_DATA_MASK		0xFFFFFF
+
+#define CBDC_UPDATE_CMDR_HEAD(head, used, size) (head = ((head % size) + used) % size)
+#define CBDC_UPDATE_CMDR_TAIL(tail, used, size) (tail = ((tail % size) + used) % size)
+
+#define CBDC_UPDATE_COMPR_HEAD(head, used, size) (head = ((head % size) + used) % size)
+#define CBDC_UPDATE_COMPR_TAIL(tail, used, size) (tail = ((tail % size) + used) % size)
+
+/* cbd transport */
+#define CBD_TRANSPORT_MAGIC		0x9a6c676896C596EFULL
+#define CBD_TRANSPORT_VERSION		1
+
+#define CBDT_INFO_OFF			0
+#define CBDT_INFO_SIZE			CBD_PAGE_SIZE
+
+#define CBDT_HOST_AREA_OFF		(CBDT_INFO_OFF + CBDT_INFO_SIZE)
+#define CBDT_HOST_INFO_SIZE		CBD_PAGE_SIZE
+#define CBDT_HOST_NUM			16
+
+#define CBDT_BACKEND_AREA_OFF		(CBDT_HOST_AREA_OFF + (CBDT_HOST_INFO_SIZE * CBDT_HOST_NUM))
+#define CBDT_BACKEND_INFO_SIZE		CBD_PAGE_SIZE
+#define CBDT_BACKEND_NUM		16
+
+#define CBDT_BLKDEV_AREA_OFF		(CBDT_BACKEND_AREA_OFF + (CBDT_BACKEND_INFO_SIZE * CBDT_BACKEND_NUM))
+#define CBDT_BLKDEV_INFO_SIZE		CBD_PAGE_SIZE
+#define CBDT_BLKDEV_NUM			16
+
+#define CBDT_CHANNEL_AREA_OFF		(CBDT_BLKDEV_AREA_OFF + (CBDT_BLKDEV_INFO_SIZE * CBDT_BLKDEV_NUM))
+#define CBDT_CHANNEL_SIZE		(CBDC_META_SIZE + CBDC_DATA_SIZE)
+#define CBDT_CHANNEL_NUM		16
+
+#define CBD_TRASNPORT_SIZE		(CBDT_CHANNEL_AREA_OFF + CBDT_CHANNEL_SIZE * CBDT_CHANNEL_NUM)
+
+/*
+ * CBD structure diagram:
+ *
+ *                                        +--------------+
+ *                                        | cbd_transport|                                               +----------+
+ *                                        +--------------+                                               | cbd_host |
+ *                                        |              |                                               +----------+
+ *                                        |   host       +---------------------------------------------->|          |
+ *                   +--------------------+   backends   |                                               | hostname |
+ *                   |                    |   devices    +------------------------------------------+    |          |
+ *                   |                    |              |                                          |    +----------+
+ *                   |                    +--------------+                                          |
+ *                   |                                                                              |
+ *                   |                                                                              |
+ *                   |                                                                              |
+ *                   |                                                                              |
+ *                   |                                                                              |
+ *                   v                                                                              v
+ *             +------------+     +-----------+     +------+                                  +-----------+      +-----------+     +------+
+ *             | cbd_backend+---->|cbd_backend+---->| NULL |                                  | cbd_blkdev+----->| cbd_blkdev+---->| NULL |
+ *             +------------+     +-----------+     +------+                                  +-----------+      +-----------+     +------+
+ *      +------+  handlers  |     |  handlers |                                        +------+  queues   |      |  queues   |
+ *      |      +------------+     +-----------+                                        |      +-----------+      +-----------+
+ *      |                                                                              |
+ *      |                                                                              |
+ *      |                                                                              |
+ *      |                                                                              |
+ *      |      +-------------+       +-------------+           +------+                |      +-----------+      +-----------+     +------+
+ *      +----->| cbd_handler +------>| cbd_handler +---------->| NULL |                +----->| cbd_queue +----->| cbd_queue +---->| NULL |
+ *             +-------------+       +-------------+           +------+                       +-----------+      +-----------+     +------+
+ *      +------+ channel     |       |   channel   |                                   +------+  channel  |      |  channel  |
+ *      |      +-------------+       +-------------+                                   |      +-----------+      +-----------+
+ *      |                                                                              |
+ *      |                                                                              |
+ *      |                                                                              |
+ *      |                                                                              v
+ *      |                                                        +-----------------------+
+ *      +------------------------------------------------------->|      cbd_channel      |
+ *                                                               +-----------------------+
+ *                                                               | channel_id            |
+ *                                                               | cmdr (cmd ring)       |
+ *                                                               | compr (complete ring) |
+ *                                                               | data (data area)      |
+ *                                                               |                       |
+ *                                                               +-----------------------+
+ */
+
+#define CBD_DEVICE(OBJ)					\
+struct cbd_## OBJ ##_device {				\
+	struct device dev;				\
+	struct cbd_transport *cbdt;			\
+	struct cbd_## OBJ ##_info *OBJ##_info;		\
+};							\
+							\
+struct cbd_## OBJ ##s_device {				\
+	struct device OBJ ##s_dev;			\
+	struct cbd_## OBJ ##_device OBJ ##_devs[];	\
+};
+
+
+/* cbd_worker_cfg*/
+struct cbd_worker_cfg {
+	u32			busy_retry_cur;
+	u32			busy_retry_count;
+	u32			busy_retry_max;
+	u32			busy_retry_min;
+	u64			busy_retry_interval;
+};
+
+static inline void cbdwc_init(struct cbd_worker_cfg *cfg)
+{
+	/* init cbd_worker_cfg with default values */
+	cfg->busy_retry_cur = 0;
+	cfg->busy_retry_count = 100;
+	cfg->busy_retry_max = cfg->busy_retry_count * 2;
+	cfg->busy_retry_min = 0;
+	cfg->busy_retry_interval = 1;			/* 1us */
+}
+
+/* reset retry_cur and increase busy_retry_count */
+static inline void cbdwc_hit(struct cbd_worker_cfg *cfg)
+{
+	u32 delta;
+
+	cfg->busy_retry_cur = 0;
+
+	if (cfg->busy_retry_count == cfg->busy_retry_max)
+		return;
+
+	/* retry_count increase by 1/16 */
+	delta = cfg->busy_retry_count >> 4;
+	if (!delta)
+		delta = (cfg->busy_retry_max + cfg->busy_retry_min) >> 1;
+
+	cfg->busy_retry_count += delta;
+
+	if (cfg->busy_retry_count > cfg->busy_retry_max)
+		cfg->busy_retry_count = cfg->busy_retry_max;
+
+	return;
+}
+
+/* reset retry_cur and decrease busy_retry_count */
+static inline void cbdwc_miss(struct cbd_worker_cfg *cfg)
+{
+	u32 delta;
+
+	cfg->busy_retry_cur = 0;
+
+	if (cfg->busy_retry_count == cfg->busy_retry_min)
+		return;
+
+	/* retry_count decrease by 1/16 */
+	delta = cfg->busy_retry_count >> 4;
+	if (!delta)
+		delta = cfg->busy_retry_count;
+
+	cfg->busy_retry_count -= delta;
+
+	return;
+}
+
+static inline bool cbdwc_need_retry(struct cbd_worker_cfg *cfg)
+{
+	if (++cfg->busy_retry_cur < cfg->busy_retry_count) {
+		cpu_relax();
+		fsleep(cfg->busy_retry_interval);
+		return true;
+	}
+
+	return false;
+}
+
+/* cbd_transport */
+#define CBDT_INFO_F_BIGENDIAN		1 << 0
+
+struct cbd_transport_info {
+	__le64 magic;
+	__le16 version;
+	__le16 flags;
+
+	u64 host_area_off;
+	u32 host_info_size;
+	u32 host_num;
+
+	u64 backend_area_off;
+	u32 backend_info_size;
+	u32 backend_num;
+
+	u64 blkdev_area_off;
+	u32 blkdev_info_size;
+	u32 blkdev_num;
+
+	u64 channel_area_off;
+	u32 channel_size;
+	u32 channel_num;
+};
+
+struct cbd_transport {
+	u16	id;
+	struct device device;
+	struct mutex lock;
+
+	struct cbd_transport_info *transport_info;
+
+	struct cbd_host *host;
+	struct list_head backends;
+	struct list_head devices;
+
+	struct cbd_hosts_device *cbd_hosts_dev;
+	struct cbd_channels_device *cbd_channels_dev;
+	struct cbd_backends_device *cbd_backends_dev;
+	struct cbd_blkdevs_device *cbd_blkdevs_dev;
+
+	struct dax_device *dax_dev;
+	struct bdev_handle   *bdev_handle;
+};
+
+struct cbdt_register_options {
+	char hostname[CBD_NAME_LEN];
+	char path[CBD_PATH_LEN];
+	u16 format:1;
+	u16 force:1;
+	u16 unused:15;
+};
+
+struct cbd_blkdev;
+struct cbd_backend;
+
+int cbdt_register(struct cbdt_register_options *opts);
+int cbdt_unregister(u32 transport_id);
+
+struct cbd_host_info *cbdt_get_host_info(struct cbd_transport *cbdt, u32 id);
+struct cbd_backend_info *cbdt_get_backend_info(struct cbd_transport *cbdt, u32 id);
+struct cbd_blkdev_info *cbdt_get_blkdev_info(struct cbd_transport *cbdt, u32 id);
+struct cbd_channel_info *cbdt_get_channel_info(struct cbd_transport *cbdt, u32 id);
+
+int cbdt_get_empty_host_id(struct cbd_transport *cbdt, u32 *id);
+int cbdt_get_empty_backend_id(struct cbd_transport *cbdt, u32 *id);
+int cbdt_get_empty_blkdev_id(struct cbd_transport *cbdt, u32 *id);
+int cbdt_get_empty_channel_id(struct cbd_transport *cbdt, u32 *id);
+
+void cbdt_add_backend(struct cbd_transport *cbdt, struct cbd_backend *cbdb);
+void cbdt_del_backend(struct cbd_transport *cbdt, struct cbd_backend *cbdb);
+struct cbd_backend *cbdt_get_backend(struct cbd_transport *cbdt, u32 id);
+void cbdt_add_blkdev(struct cbd_transport *cbdt, struct cbd_blkdev *blkdev);
+struct cbd_blkdev *cbdt_fetch_blkdev(struct cbd_transport *cbdt, u32 id);
+
+struct page *cbdt_page(struct cbd_transport *cbdt, u64 transport_off);
+void cbdt_flush_range(struct cbd_transport *cbdt, void *pos, u64 size);
+
+/* cbd_host */
+CBD_DEVICE(host);
+
+enum cbd_host_state {
+	cbd_host_state_none	= 0,
+	cbd_host_state_running
+};
+
+struct cbd_host_info {
+	u8	state;
+	u64	alive_ts;
+	char	hostname[CBD_NAME_LEN];
+};
+
+struct cbd_host {
+	u32			host_id;
+	struct cbd_transport	*cbdt;
+
+	struct cbd_host_device	*dev;
+	struct cbd_host_info	*host_info;
+	struct delayed_work	hb_work; /* heartbeat work */
+};
+
+int cbd_host_register(struct cbd_transport *cbdt, char *hostname);
+int cbd_host_unregister(struct cbd_transport *cbdt);
+
+/* cbd_channel */
+CBD_DEVICE(channel);
+
+enum cbdc_blkdev_state {
+	cbdc_blkdev_state_none		= 0,
+	cbdc_blkdev_state_running,
+	cbdc_blkdev_state_stopped,
+};
+
+enum cbdc_backend_state {
+	cbdc_backend_state_none		= 0,
+	cbdc_backend_state_running,
+	cbdc_backend_state_stopped,
+};
+
+enum cbd_channel_state {
+	cbd_channel_state_none		= 0,
+	cbd_channel_state_running,
+};
+
+struct cbd_channel_info {
+	u8	state;
+
+	u8	blkdev_state;
+	u32	blkdev_id;
+
+	u8	backend_state;
+	u32	backend_id;
+
+	u32	cmdr_off;
+	u32 	cmdr_size;
+	u32 	cmd_head;
+	u32 	cmd_tail;
+
+	u32 	compr_head;
+	u32 	compr_tail;
+	u32 	compr_off;
+	u32 	compr_size;
+};
+
+struct cbd_channel {
+	u32				channel_id;
+	struct cbd_channel_deivce	*dev;
+	struct cbd_channel_info		*channel_info;
+
+	struct cbd_transport		*cbdt;
+
+	struct page			*ctrl_page;
+
+	void				*cmdr;
+	void				*compr;
+	void				*data;
+
+	u32				data_size;
+	u32				data_head;
+	u32				data_tail;
+
+	spinlock_t			cmdr_lock;
+	spinlock_t			compr_lock;
+};
+
+void cbd_channel_init(struct cbd_channel *channel, struct cbd_transport *cbdt, u32 channel_id);
+void cbdc_copy_from_bio(struct cbd_channel *channel,
+		u32 data_off, u32 data_len, struct bio *bio);
+void cbdc_copy_to_bio(struct cbd_channel *channel,
+		u32 data_off, u32 data_len, struct bio *bio);
+void cbdc_flush_ctrl(struct cbd_channel *channel);
+
+/* cbd_handler */
+struct cbd_handler {
+	struct cbd_backend	*cbdb;
+	struct cbd_channel_info *channel_info;
+
+	struct cbd_channel	channel;
+
+	u32			se_to_handle;
+
+	struct delayed_work	handle_work;
+	struct cbd_worker_cfg	handle_worker_cfg;
+
+	struct list_head	handlers_node;
+	struct bio_set		bioset;
+	struct workqueue_struct *handle_wq;
+};
+
+void cbd_handler_destroy(struct cbd_handler *handler);
+int cbd_handler_create(struct cbd_backend *cbdb, u32 channel_id);
+
+/* cbd_backend */
+CBD_DEVICE(backend);
+
+enum cbd_backend_state {
+	cbd_backend_state_none	= 0,
+	cbd_backend_state_running,
+};
+
+#define CBDB_BLKDEV_COUNT_MAX	1
+
+struct cbd_backend_info {
+	u8	state;
+	u32	host_id;
+	u32	blkdev_count;
+	u64	alive_ts;
+	u64	dev_size; /* nr_sectors */
+	char	path[CBD_PATH_LEN];
+};
+
+struct cbd_backend {
+	u32			backend_id;
+	char			path[CBD_PATH_LEN];
+	struct cbd_transport	*cbdt;
+	struct cbd_backend_info *backend_info;
+	struct mutex 		lock;
+
+	struct block_device	*bdev;
+	struct bdev_handle	*bdev_handle;
+
+	struct workqueue_struct	*task_wq;  /* workqueue for request work */
+	struct delayed_work	state_work;
+	struct delayed_work	hb_work; /* heartbeat work */
+
+	struct list_head	node; /* cbd_transport->backends */
+	struct list_head	handlers;
+
+	struct cbd_backend_device *backend_device;
+};
+
+int cbd_backend_start(struct cbd_transport *cbdt, char *path);
+int cbd_backend_stop(struct cbd_transport *cbdt, u32 backend_id);
+void cbdb_add_handler(struct cbd_backend *cbdb, struct cbd_handler *handler);
+void cbdb_del_handler(struct cbd_backend *cbdb, struct cbd_handler *handler);
+
+/* cbd_queue */
+enum cbd_op {
+	CBD_OP_PAD = 0,
+	CBD_OP_WRITE,
+	CBD_OP_READ,
+	CBD_OP_DISCARD,
+	CBD_OP_WRITE_ZEROS,
+	CBD_OP_FLUSH,
+};
+
+struct cbd_se_hdr {
+	u32 len_op;
+	u32 flags;
+
+};
+
+struct cbd_se {
+	struct cbd_se_hdr	header;
+	u64			priv_data;	// pointer to cbd_request
+
+	u64			offset;
+	u32			len;
+
+	u32			data_off;
+	u32			data_len;
+};
+
+
+struct cbd_ce {
+	u64		priv_data;	// copied from submit entry
+	u32		result;
+	u32		flags;
+};
+
+
+struct cbd_request {
+	struct cbd_queue	*cbdq;
+
+	struct cbd_se		*se;
+	struct cbd_ce		*ce;
+	struct request		*req;
+
+	enum cbd_op		op;
+	u64			req_tid;
+	struct list_head	inflight_reqs_node;
+
+	u32			data_off;
+	u32			data_len;
+
+	struct work_struct	work;
+};
+
+#define CBD_OP_MASK 0xff
+#define CBD_OP_SHIFT 8
+
+static inline enum cbd_op cbd_se_hdr_get_op(__le32 len_op)
+{
+       return (enum cbd_op)(len_op & CBD_OP_MASK);
+}
+
+static inline void cbd_se_hdr_set_op(u32 *len_op, enum cbd_op op)
+{
+       *len_op &= ~CBD_OP_MASK;
+       *len_op |= (op & CBD_OP_MASK);
+}
+
+static inline u32 cbd_se_hdr_get_len(u32 len_op)
+{
+	return len_op >> CBD_OP_SHIFT;
+}
+
+static inline void cbd_se_hdr_set_len(u32 *len_op, u32 len)
+{
+	*len_op &= CBD_OP_MASK;
+	*len_op |= (len << CBD_OP_SHIFT);
+}
+
+#define CBD_SE_HDR_DONE	1
+
+static inline bool cbd_se_hdr_flags_test(struct cbd_se *se, u32 bit)
+{
+	return (se->header.flags & bit);
+}
+
+static inline void cbd_se_hdr_flags_set(struct cbd_se *se, u32 bit)
+{
+	se->header.flags |= bit;
+}
+
+enum cbd_queue_state {
+	cbd_queue_state_none	= 0,
+	cbd_queue_state_running
+};
+
+struct cbd_queue {
+	struct cbd_blkdev	*cbd_blkdev;
+
+	bool			inited;
+	int			index;
+
+	struct list_head	inflight_reqs;
+	spinlock_t		inflight_reqs_lock;
+	u64			req_tid;
+
+	u32			*released_extents;
+
+	u32			channel_id;
+	struct cbd_channel_info	*channel_info;
+	struct cbd_channel	channel;
+	struct workqueue_struct	*task_wq;  /* workqueue for request work */
+
+	atomic_t		state;
+
+	struct delayed_work	complete_work;
+	struct cbd_worker_cfg	complete_worker_cfg;
+};
+
+int cbd_queue_start(struct cbd_queue *cbdq);
+void cbd_queue_stop(struct cbd_queue *cbdq);
+extern const struct blk_mq_ops cbd_mq_ops;
+
+/* cbd_blkdev */
+CBD_DEVICE(blkdev);
+
+enum cbd_blkdev_state {
+	cbd_blkdev_state_none	= 0,
+	cbd_blkdev_state_running
+};
+
+struct cbd_blkdev_info {
+	u8	state;
+	u64	alive_ts;
+	u32	backend_id;
+	u32	host_id;
+	u32	mapped_id;
+};
+
+struct cbd_blkdev {
+	u32			blkdev_id; /* index in transport blkdev area */
+	u32			backend_id;
+	int			mapped_id; /* id in block device such as: /dev/cbd0 */
+
+	int			major;		/* blkdev assigned major */
+	int			minor;
+	struct gendisk		*disk;		/* blkdev's gendisk and rq */
+
+	spinlock_t		lock;		/* open_count */
+	struct list_head	node;
+	struct mutex		state_lock;
+	struct delayed_work	hb_work; /* heartbeat work */
+
+	/* Block layer tags. */
+	struct blk_mq_tag_set	tag_set;
+
+	unsigned long		open_count;	/* protected by lock */
+
+	uint32_t		num_queues;
+	struct cbd_queue	*queues;
+
+	u64			dev_size;
+	u64			dev_features;
+	u32			io_timeout;
+
+	u8			state;
+	u32			state_flags;
+	struct kref		kref;
+
+	void			*cmdr;
+	void			*compr;
+	spinlock_t		cmdr_lock;
+	spinlock_t		compr_lock;
+	void			*data;
+
+	struct cbd_blkdev_device *blkdev_dev;
+	struct cbd_blkdev_info *blkdev_info;
+
+	struct cbd_transport *cbdt;
+};
+
+int cbd_blkdev_init(void);
+void cbd_blkdev_exit(void);
+int cbd_blkdev_start(struct cbd_transport *cbdt, u32 backend_id, u32 queues);
+int cbd_blkdev_stop(struct cbd_transport *cbdt, u32 devid);
+
+extern struct workqueue_struct	*cbd_wq;
+
+#define cbd_setup_device(DEV, PARENT, TYPE, fmt, ...)		\
+do {								\
+	device_initialize(DEV);					\
+	device_set_pm_not_required(DEV);			\
+	dev_set_name(DEV, fmt, ##__VA_ARGS__);			\
+	DEV->parent = PARENT;					\
+	DEV->type = TYPE;					\
+								\
+	ret = device_add(DEV);					\
+} while (0)
+
+#define CBD_OBJ_HEARTBEAT(OBJ)								\
+static void OBJ##_hb_workfn(struct work_struct *work)					\
+{											\
+	struct cbd_##OBJ *obj = container_of(work, struct cbd_##OBJ, hb_work.work);	\
+	struct cbd_##OBJ##_info *info = obj->OBJ##_info;				\
+											\
+	info->alive_ts = ktime_get_real();						\
+	cbdt_flush_range(obj->cbdt, info, sizeof(*info));				\
+											\
+	queue_delayed_work(cbd_wq, &obj->hb_work, CBD_HB_INTERVAL);			\
+}											\
+											\
+static bool OBJ##_info_is_alive(struct cbd_##OBJ##_info *info)				\
+{											\
+	ktime_t oldest, ts;								\
+											\
+	ts = info->alive_ts;								\
+	oldest = ktime_sub_ms(ktime_get_real(), CBD_HB_TIMEOUT);			\
+											\
+	if (ktime_after(ts, oldest))							\
+		return true;								\
+											\
+	return false;									\
+}											\
+											\
+static ssize_t cbd_##OBJ##_alive_show(struct device *dev,				\
+			       struct device_attribute *attr,				\
+			       char *buf)						\
+{											\
+	struct cbd_##OBJ##_device *_dev;						\
+											\
+	_dev = container_of(dev, struct cbd_##OBJ##_device, dev);			\
+											\
+	cbdt_flush_range(_dev->cbdt, _dev->OBJ##_info, sizeof(*_dev->OBJ##_info));	\
+	if (OBJ##_info_is_alive(_dev->OBJ##_info))					\
+		return sprintf(buf, "true\n");						\
+											\
+	return sprintf(buf, "false\n");							\
+}											\
+											\
+static DEVICE_ATTR(alive, 0400, cbd_##OBJ##_alive_show, NULL);				\
+
+#endif /* _CBD_INTERNAL_H */
diff --git a/drivers/block/cbd/cbd_main.c b/drivers/block/cbd/cbd_main.c
new file mode 100644
index 000000000000..0a87c95d749d
--- /dev/null
+++ b/drivers/block/cbd/cbd_main.c
@@ -0,0 +1,216 @@
+/*
+ * Copyright(C) 2024, Dongsheng Yang <dongsheng.yang.linux@gmail.com>
+ */
+
+#include <linux/module.h>
+#include <linux/io.h>
+#include <linux/blk-mq.h>
+#include <linux/blkdev.h>
+#include <linux/kernel.h>
+#include <linux/device.h>
+#include <linux/bio.h>
+#include <linux/module.h>
+#include <linux/blk-mq.h>
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/slab.h>
+#include <linux/idr.h>
+#include <linux/workqueue.h>
+#include <linux/delay.h>
+#include <net/genetlink.h>
+
+#include <linux/types.h>
+
+#include "cbd_internal.h"
+
+struct workqueue_struct	*cbd_wq;
+
+enum {
+	CBDT_REG_OPT_ERR		= 0,
+	CBDT_REG_OPT_FORCE,
+	CBDT_REG_OPT_FORMAT,
+	CBDT_REG_OPT_PATH,
+	CBDT_REG_OPT_HOSTNAME,
+};
+
+static const match_table_t register_opt_tokens = {
+	{ CBDT_REG_OPT_FORCE,		"force=%u" },
+	{ CBDT_REG_OPT_FORMAT,		"format=%u" },
+	{ CBDT_REG_OPT_PATH,		"path=%s" },
+	{ CBDT_REG_OPT_HOSTNAME,	"hostname=%s" },
+	{ CBDT_REG_OPT_ERR,		NULL	}
+};
+
+static int parse_register_options(
+		char *buf,
+		struct cbdt_register_options *opts)
+{
+	substring_t args[MAX_OPT_ARGS];
+	char *o, *p;
+	int token, ret = 0;
+
+	o = buf;
+
+	while ((p = strsep(&o, ",\n")) != NULL) {
+		if (!*p)
+			continue;
+
+		token = match_token(p, register_opt_tokens, args);
+		switch (token) {
+		case CBDT_REG_OPT_PATH:
+			if (match_strlcpy(opts->path, &args[0],
+			        CBD_PATH_LEN) == 0) {
+			        ret = -EINVAL;
+			        break;
+			}
+			break;
+		case CBDT_REG_OPT_FORCE:
+			if (match_uint(args, &token) || token != 1) {
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->force = 1;
+			break;
+		case CBDT_REG_OPT_FORMAT:
+			if (match_uint(args, &token) || token != 1) {
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->format = 1;
+			break;
+		case CBDT_REG_OPT_HOSTNAME:
+			if (match_strlcpy(opts->hostname, &args[0],
+			        CBD_NAME_LEN) == 0) {
+			        ret = -EINVAL;
+			        break;
+			}
+			break;
+		default:
+			pr_err("unknown parameter or missing value '%s'\n", p);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+out:
+	return ret;
+}
+
+static ssize_t transport_unregister_store(const struct bus_type *bus, const char *ubuf,
+				      size_t size)
+{
+	int ret;
+	u32 transport_id;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (sscanf(ubuf, "transport_id=%u", &transport_id) != 1) {
+		return -EINVAL;
+	}
+
+	return size;
+}
+
+static ssize_t transport_register_store(const struct bus_type *bus, const char *ubuf,
+				      size_t size)
+{
+	int ret;
+	char *buf;
+	struct cbdt_register_options opts = { 0 };
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	buf = kmemdup(ubuf, size + 1, GFP_KERNEL);
+	if (IS_ERR(buf)) {
+		pr_err("failed to dup buf for adm option: %d", (int)PTR_ERR(buf));
+		return PTR_ERR(buf);
+	}
+	buf[size] = '\0';
+
+	ret = parse_register_options(buf, &opts);
+	if (ret < 0) {
+		kfree(buf);
+		return ret;
+	}
+	kfree(buf);
+
+	return size;
+}
+
+static BUS_ATTR_WO(transport_unregister);
+static BUS_ATTR_WO(transport_register);
+
+static struct attribute *cbd_bus_attrs[] = {
+	&bus_attr_transport_unregister.attr,
+	&bus_attr_transport_register.attr,
+	NULL,
+};
+
+static const struct attribute_group cbd_bus_group = {
+	.attrs = cbd_bus_attrs,
+};
+__ATTRIBUTE_GROUPS(cbd_bus);
+
+struct bus_type cbd_bus_type = {
+	.name		= "cbd",
+	.bus_groups	= cbd_bus_groups,
+};
+
+static void cbd_root_dev_release(struct device *dev)
+{
+}
+
+struct device cbd_root_dev = {
+	.init_name =    "cbd",
+	.release =      cbd_root_dev_release,
+};
+
+static int __init cbd_init(void)
+{
+	int ret;
+
+	cbd_wq = alloc_workqueue(CBD_DRV_NAME, WQ_MEM_RECLAIM, 0);
+	if (!cbd_wq) {
+		return -ENOMEM;
+	}
+
+	ret = device_register(&cbd_root_dev);
+	if (ret < 0) {
+		put_device(&cbd_root_dev);
+		goto destroy_wq;
+	}
+
+	ret = bus_register(&cbd_bus_type);
+	if (ret < 0) {
+		goto device_unregister;
+	}
+
+	return 0;
+
+bus_unregister:
+	bus_unregister(&cbd_bus_type);
+device_unregister:
+	device_unregister(&cbd_root_dev);
+destroy_wq:
+	destroy_workqueue(cbd_wq);
+
+	return ret;
+}
+
+static void cbd_exit(void)
+{
+	bus_unregister(&cbd_bus_type);
+	device_unregister(&cbd_root_dev);
+
+	destroy_workqueue(cbd_wq);
+
+	return;
+}
+
+MODULE_AUTHOR("Dongsheng Yang <dongsheng.yang.linux@gmail.com>");
+MODULE_DESCRIPTION("CXL(Compute Express Link) Block Device");
+MODULE_LICENSE("GPL v2");
+module_init(cbd_init);
+module_exit(cbd_exit);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/7] cbd: introduce cbd_transport
  2024-04-22  7:15 [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dongsheng Yang
  2024-04-22  7:16 ` [PATCH 1/7] block: Init for CBD(CXL " Dongsheng Yang
@ 2024-04-22  7:16 ` Dongsheng Yang
  2024-04-24  4:08   ` Chaitanya Kulkarni
  2024-04-22  7:16 ` [PATCH 3/7] cbd: introduce cbd_channel Dongsheng Yang
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 52+ messages in thread
From: Dongsheng Yang @ 2024-04-22  7:16 UTC (permalink / raw)
  To: dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang

From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>

cbd_transport represents the layout of the entire shared memory, as shown below.

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                           cbd transport                                                                                       │
├────────────────────┬───────────────────────┬───────────────────────┬──────────────────────┬───────────────────────────────────┤
│                    │       hosts           │      backends         │       blkdevs        │        channels                   │
│ cbd transport info ├────┬────┬────┬────────┼────┬────┬────┬────────┼────┬────┬────┬───────┼───────┬───────┬───────┬───────────┤
│                    │    │    │    │  ...   │    │    │    │  ...   │    │    │    │  ...  │       │       │       │   ...     │
└────────────────────┴────┴────┴────┴────────┴────┴────┴────┴────────┴────┴────┴────┴───────┴───┬───┴───────┴───────┴───────────┘
                                                                                                │
                                                                                                │
                                                                                                │
                                                                                                │
          ┌─────────────────────────────────────────────────────────────────────────────────────┘
          │
          │
          ▼
    ┌───────────────────────────────────────────────────────────┐
    │                     channel                               │
    ├────────────────────┬──────────────────────────────────────┤
    │    channel meta    │              channel data            │
    └─────────┬──────────┴────────────────────────────────-─────┘
              │
              │
              │
              ▼
    ┌──────────────────────────────────────────────────────────┐
    │                 channel meta                             │
    ├───────────┬──────────────┬───────────────────────────────┤
    │ meta ctrl │  comp ring   │       cmd ring                │
    └───────────┴──────────────┴───────────────────────────────┘

The shared memory is divided into five regions:

    a) Transport_info:
	Information about the overall transport, including the layout
of the transport.
    b) Hosts:
	Each host wishing to utilize this transport needs to register
its own information within a host entry in this region.
    c) Backends:
	Starting a backend on a host requires filling in information
in a backend entry within this region.
    d) Blkdevs:
	Once a backend is established, it can be mapped to CBD device
on any associated host. The information about the blkdevs is then
filled into the blkdevs region.
    e) Channels:
	This is the actual data communication area, where communication
between blkdev and backend occurs. Each queue of a block device uses
a channel, and each backend has a corresponding handler interacting
with this queue.
    f) Channel:
	Channel is further divided into meta and data regions.
	The meta region includes cmd rings and comp rings.
The blkdev converts upper-layer requests into cbd_se and fills
them into the cmd ring. The handler accepts the cbd_se from
the cmd ring and sends them to the local actual block device
of the backend (e.g., sda). After completion, the results are
formed into cbd_ce and filled into the comp ring. The blkdev
then receives the cbd_ce and returns the results to the upper-layer
IO sender.

Signed-off-by: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
---
 drivers/block/cbd/Makefile        |   2 +-
 drivers/block/cbd/cbd_main.c      |   8 +
 drivers/block/cbd/cbd_transport.c | 721 ++++++++++++++++++++++++++++++
 3 files changed, 730 insertions(+), 1 deletion(-)
 create mode 100644 drivers/block/cbd/cbd_transport.c

diff --git a/drivers/block/cbd/Makefile b/drivers/block/cbd/Makefile
index 2765325486a2..a22796bfa7db 100644
--- a/drivers/block/cbd/Makefile
+++ b/drivers/block/cbd/Makefile
@@ -1,3 +1,3 @@
-cbd-y := cbd_main.o
+cbd-y := cbd_main.o cbd_transport.o
 
 obj-$(CONFIG_BLK_DEV_CBD) += cbd.o
diff --git a/drivers/block/cbd/cbd_main.c b/drivers/block/cbd/cbd_main.c
index 0a87c95d749d..8cfa60dde7c5 100644
--- a/drivers/block/cbd/cbd_main.c
+++ b/drivers/block/cbd/cbd_main.c
@@ -109,6 +109,10 @@ static ssize_t transport_unregister_store(const struct bus_type *bus, const char
 		return -EINVAL;
 	}
 
+	ret = cbdt_unregister(transport_id);
+	if (ret < 0)
+		return ret;
+
 	return size;
 }
 
@@ -136,6 +140,10 @@ static ssize_t transport_register_store(const struct bus_type *bus, const char *
 	}
 	kfree(buf);
 
+	ret = cbdt_register(&opts);
+	if (ret < 0)
+		return ret;
+
 	return size;
 }
 
diff --git a/drivers/block/cbd/cbd_transport.c b/drivers/block/cbd/cbd_transport.c
new file mode 100644
index 000000000000..3a4887afab08
--- /dev/null
+++ b/drivers/block/cbd/cbd_transport.c
@@ -0,0 +1,721 @@
+#include <linux/pfn_t.h>
+
+#include "cbd_internal.h"
+
+#define CBDT_OBJ(OBJ, OBJ_SIZE)							\
+										\
+static inline struct cbd_##OBJ##_info						\
+*__get_##OBJ##_info(struct cbd_transport *cbdt, u32 id)				\
+{										\
+	struct cbd_transport_info *info = cbdt->transport_info;			\
+	void *start = cbdt->transport_info;					\
+										\
+	start += info->OBJ##_area_off;						\
+										\
+	return start + (info->OBJ_SIZE * id);					\
+}										\
+										\
+struct cbd_##OBJ##_info 							\
+*cbdt_get_##OBJ##_info(struct cbd_transport *cbdt, u32 id)			\
+{										\
+	struct cbd_##OBJ##_info *info;						\
+										\
+	mutex_lock(&cbdt->lock);						\
+	info = __get_##OBJ##_info(cbdt, id);					\
+	mutex_unlock(&cbdt->lock);						\
+										\
+	return info;								\
+}										\
+										\
+int cbdt_get_empty_##OBJ##_id(struct cbd_transport *cbdt, u32 *id)		\
+{										\
+	struct cbd_transport_info *info = cbdt->transport_info;			\
+	struct cbd_##OBJ##_info *_info;						\
+	int ret = 0;								\
+	int i;									\
+										\
+	mutex_lock(&cbdt->lock);						\
+	for (i = 0; i < info->OBJ##_num; i++) {					\
+		_info = __get_##OBJ##_info(cbdt, i);				\
+		cbdt_flush_range(cbdt, _info, sizeof(*_info));			\
+		if (_info->state == cbd_##OBJ##_state_none) {			\
+			*id = i;						\
+			goto out;						\
+		}								\
+	}									\
+										\
+	cbdt_err(cbdt, "No available " #OBJ "_id found.");			\
+	ret = -ENOENT;								\
+out:										\
+	mutex_unlock(&cbdt->lock);						\
+										\
+	return ret;								\
+}
+
+CBDT_OBJ(host, host_info_size);
+CBDT_OBJ(backend, backend_info_size);
+CBDT_OBJ(blkdev, blkdev_info_size);
+CBDT_OBJ(channel, channel_size);
+
+static struct cbd_transport *cbd_transports[CBD_TRANSPORT_MAX];
+static DEFINE_IDA(cbd_transport_id_ida);
+static DEFINE_MUTEX(cbd_transport_mutex);
+
+extern struct bus_type cbd_bus_type;
+extern struct device cbd_root_dev;
+
+static ssize_t cbd_myhost_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct cbd_transport *cbdt;
+	struct cbd_host *host;
+
+	cbdt = container_of(dev, struct cbd_transport, device);
+
+	host = cbdt->host;
+	if (!host)
+		return 0;
+
+	return sprintf(buf, "%d\n", host->host_id);
+}
+
+static DEVICE_ATTR(my_host_id, 0400, cbd_myhost_show, NULL);
+
+enum {
+	CBDT_ADM_OPT_ERR		= 0,
+	CBDT_ADM_OPT_OP,
+	CBDT_ADM_OPT_FORCE,
+	CBDT_ADM_OPT_PATH,
+	CBDT_ADM_OPT_BID,
+	CBDT_ADM_OPT_DID,
+	CBDT_ADM_OPT_QUEUES,
+};
+
+enum {
+	CBDT_ADM_OP_B_START,
+	CBDT_ADM_OP_B_STOP,
+	CBDT_ADM_OP_B_CLEAR,
+	CBDT_ADM_OP_DEV_START,
+	CBDT_ADM_OP_DEV_STOP,
+};
+
+static const char *const adm_op_names[] = {
+	[CBDT_ADM_OP_B_START] = "backend-start",
+	[CBDT_ADM_OP_B_STOP] = "backend-stop",
+	[CBDT_ADM_OP_B_CLEAR] = "backend-clear",
+	[CBDT_ADM_OP_DEV_START] = "dev-start",
+	[CBDT_ADM_OP_DEV_STOP] = "dev-stop",
+};
+
+static const match_table_t adm_opt_tokens = {
+	{ CBDT_ADM_OPT_OP,		"op=%s"	},
+	{ CBDT_ADM_OPT_FORCE,		"force=%u" },
+	{ CBDT_ADM_OPT_PATH,		"path=%s" },
+	{ CBDT_ADM_OPT_BID,		"backend_id=%u" },
+	{ CBDT_ADM_OPT_DID,		"devid=%u" },
+	{ CBDT_ADM_OPT_QUEUES,		"queues=%u" },
+	{ CBDT_ADM_OPT_ERR,		NULL	}
+};
+
+
+struct cbd_adm_options {
+	u16 op;
+	u16 force:1;
+	u32 backend_id;
+	union {
+		struct host_options {
+			u32 hid;
+		} host;
+		struct backend_options {
+			char path[CBD_PATH_LEN];
+		} backend;
+		struct channel_options {
+			u32 cid;
+		} channel;
+		struct blkdev_options {
+			u32 devid;
+			u32 queues;
+		} blkdev;
+	};
+};
+
+static int parse_adm_options(struct cbd_transport *cbdt,
+		char *buf,
+		struct cbd_adm_options *opts)
+{
+	substring_t args[MAX_OPT_ARGS];
+	char *o, *p;
+	int token, ret = 0;
+
+	o = buf;
+
+	while ((p = strsep(&o, ",\n")) != NULL) {
+		if (!*p)
+			continue;
+
+		token = match_token(p, adm_opt_tokens, args);
+		switch (token) {
+		case CBDT_ADM_OPT_OP:
+			ret = match_string(adm_op_names, ARRAY_SIZE(adm_op_names), args[0].from);
+			if (ret < 0) {
+				pr_err("unknown op: '%s'\n", args[0].from);
+			        ret = -EINVAL;
+			        break;
+			}
+			opts->op = ret;
+			break;
+		case CBDT_ADM_OPT_PATH:
+			if (match_strlcpy(opts->backend.path, &args[0],
+			        CBD_PATH_LEN) == 0) {
+			        ret = -EINVAL;
+			        break;
+			}
+			break;
+		case CBDT_ADM_OPT_FORCE:
+			if (match_uint(args, &token) || token != 1) {
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->force = 1;
+			break;
+		case CBDT_ADM_OPT_BID:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->backend_id = token;
+			break;
+		case CBDT_ADM_OPT_DID:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->blkdev.devid = token;
+			break;
+		case CBDT_ADM_OPT_QUEUES:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->blkdev.queues = token;
+			break;
+		default:
+			pr_err("unknown parameter or missing value '%s'\n", p);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+out:
+	return ret;
+}
+
+static void transport_zero_range(struct cbd_transport *cbdt, void *pos, u64 size)
+{
+	memset(pos, 0, size);
+	cbdt_flush_range(cbdt, pos, size);
+}
+
+static void channels_format(struct cbd_transport *cbdt)
+{
+	struct cbd_transport_info *info = cbdt->transport_info;
+	struct cbd_channel_info *channel_info;
+	int i;
+
+	for (i = 0; i < info->channel_num; i++) {
+		channel_info = __get_channel_info(cbdt, i);
+		transport_zero_range(cbdt, channel_info, CBDC_META_SIZE);
+	}
+}
+
+static int cbd_transport_format(struct cbd_transport *cbdt, bool force)
+{
+	struct cbd_transport_info *info = cbdt->transport_info;
+	u64 magic;
+
+	magic = le64_to_cpu(info->magic);
+	if (magic && !force) {
+		return -EEXIST;
+	}
+
+	/* TODO make these configureable */
+	info->magic = cpu_to_le64(CBD_TRANSPORT_MAGIC);
+	info->version = cpu_to_le16(CBD_TRANSPORT_VERSION);
+#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __GIT_ENDIAN : defined(__BIG_ENDIAN)
+	info->flags = cpu_to_le16(CBDT_INFO_F_BIGENDIAN);
+#endif
+	info->host_area_off = CBDT_HOST_AREA_OFF;
+	info->host_info_size = CBDT_HOST_INFO_SIZE;
+	info->host_num = CBDT_HOST_NUM;
+
+	info->backend_area_off = CBDT_BACKEND_AREA_OFF;
+	info->backend_info_size = CBDT_BACKEND_INFO_SIZE;
+	info->backend_num = CBDT_BACKEND_NUM;
+
+	info->blkdev_area_off = CBDT_BLKDEV_AREA_OFF;
+	info->blkdev_info_size = CBDT_BLKDEV_INFO_SIZE;
+	info->blkdev_num = CBDT_BLKDEV_NUM;
+
+	info->channel_area_off = CBDT_CHANNEL_AREA_OFF;
+	info->channel_size = CBDT_CHANNEL_SIZE;
+	info->channel_num = CBDT_CHANNEL_NUM;
+
+	cbdt_flush_range(cbdt, info, sizeof(*info));
+
+	transport_zero_range(cbdt, (void *)info + info->host_area_off,
+			     info->channel_area_off - info->host_area_off);
+
+	channels_format(cbdt);
+
+	return 0;
+}
+
+
+
+static ssize_t cbd_adm_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *ubuf,
+				 size_t size)
+{
+	int ret;
+	char *buf;
+	struct cbd_adm_options opts = { 0 };
+	struct cbd_transport *cbdt;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	cbdt = container_of(dev, struct cbd_transport, device);
+
+	buf = kmemdup(ubuf, size + 1, GFP_KERNEL);
+	if (IS_ERR(buf)) {
+		pr_err("failed to dup buf for adm option: %d", (int)PTR_ERR(buf));
+		return PTR_ERR(buf);
+	}
+	buf[size] = '\0';
+	ret = parse_adm_options(cbdt, buf, &opts);
+	if (ret < 0) {
+		kfree(buf);
+		return ret;
+	}
+	kfree(buf);
+
+	switch (opts.op) {
+	case CBDT_ADM_OP_B_START:
+		break;
+	case CBDT_ADM_OP_B_STOP:
+		break;
+	case CBDT_ADM_OP_B_CLEAR:
+		break;
+	case CBDT_ADM_OP_DEV_START:
+		break;
+	case CBDT_ADM_OP_DEV_STOP:
+		break;
+	default:
+		pr_err("invalid op: %d\n", opts.op);
+		return -EINVAL;
+	}
+
+	if (ret < 0)
+		return ret;
+
+	return size;
+}
+
+static DEVICE_ATTR(adm, 0200, NULL, cbd_adm_store);
+
+static ssize_t cbd_transport_info(struct cbd_transport *cbdt, char *buf)
+{
+	struct cbd_transport_info *info = cbdt->transport_info;
+	ssize_t ret;
+
+	mutex_lock(&cbdt->lock);
+	info = cbdt->transport_info;
+	mutex_unlock(&cbdt->lock);
+
+	ret = sprintf(buf, "magic: 0x%llx\n"		\
+			"version: %u\n"			\
+			"flags: %x\n\n"			\
+			"host_area_off: %llu\n"		\
+			"bytes_per_host_info: %u\n"	\
+			"host_num: %u\n\n"		\
+			"backend_area_off: %llu\n"	\
+			"bytes_per_backend_info: %u\n"	\
+			"backend_num: %u\n\n"		\
+			"blkdev_area_off: %llu\n"	\
+			"bytes_per_blkdev_info: %u\n"	\
+			"blkdev_num: %u\n\n"		\
+			"channel_area_off: %llu\n"	\
+			"bytes_per_channel: %u\n"	\
+			"channel_num: %u\n",
+			le64_to_cpu(info->magic),
+			le16_to_cpu(info->version),
+			le16_to_cpu(info->flags),
+			info->host_area_off,
+			info->host_info_size,
+			info->host_num,
+			info->backend_area_off,
+			info->backend_info_size,
+			info->backend_num,
+			info->blkdev_area_off,
+			info->blkdev_info_size,
+			info->blkdev_num,
+			info->channel_area_off,
+			info->channel_size,
+			info->channel_num);
+
+	return ret;
+}
+
+static ssize_t cbd_info_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct cbd_transport *cbdt;
+
+	cbdt = container_of(dev, struct cbd_transport, device);
+
+	return cbd_transport_info(cbdt, buf);
+}
+static DEVICE_ATTR(info, 0400, cbd_info_show, NULL);
+
+static struct attribute *cbd_transport_attrs[] = {
+	&dev_attr_adm.attr,
+	&dev_attr_info.attr,
+	&dev_attr_my_host_id.attr,
+	NULL
+};
+
+static struct attribute_group cbd_transport_attr_group = {
+	.attrs = cbd_transport_attrs,
+};
+
+static const struct attribute_group *cbd_transport_attr_groups[] = {
+	&cbd_transport_attr_group,
+	NULL
+};
+
+static void cbd_transport_release(struct device *dev)
+{
+}
+
+struct device_type cbd_transport_type = {
+	.name		= "cbd_transport",
+	.groups		= cbd_transport_attr_groups,
+	.release	= cbd_transport_release,
+};
+
+static int
+cbd_dax_notify_failure(
+	struct dax_device	*dax_devp,
+	u64			offset,
+	u64			len,
+	int			mf_flags)
+{
+
+	pr_err("%s: dax_devp %llx offset %llx len %lld mf_flags %x\n",
+	       __func__, (u64)dax_devp, (u64)offset, (u64)len, mf_flags);
+	return -EOPNOTSUPP;
+}
+
+const struct dax_holder_operations cbd_dax_holder_ops = {
+	.notify_failure		= cbd_dax_notify_failure,
+};
+
+static struct cbd_transport *cbdt_alloc(void)
+{
+	struct cbd_transport *cbdt;
+	int ret;
+
+	cbdt = kzalloc(sizeof(struct cbd_transport), GFP_KERNEL);
+	if (!cbdt) {
+		return NULL;
+	}
+
+	ret = ida_simple_get(&cbd_transport_id_ida, 0, CBD_TRANSPORT_MAX,
+				GFP_KERNEL);
+	if (ret < 0) {
+		goto cbdt_free;
+	}
+
+	cbdt->id = ret;
+	cbd_transports[cbdt->id] = cbdt;
+
+	return cbdt;
+
+cbdt_free:
+	kfree(cbdt);
+	return NULL;
+}
+
+static void cbdt_destroy(struct cbd_transport *cbdt)
+{
+	cbd_transports[cbdt->id] = NULL;
+	ida_simple_remove(&cbd_transport_id_ida, cbdt->id);
+	kfree(cbdt);
+}
+
+static int cbdt_dax_init(struct cbd_transport *cbdt, char *path)
+{
+	struct dax_device *dax_dev = NULL;
+	struct bdev_handle *handle = NULL;
+	long access_size;
+	void *kaddr;
+	u64 nr_pages = CBD_TRASNPORT_SIZE >> PAGE_SHIFT;
+	u64 start_off = 0;
+	int ret;
+
+	handle = bdev_open_by_path(path, BLK_OPEN_READ | BLK_OPEN_WRITE, cbdt, NULL);
+	if (IS_ERR(handle)) {
+		pr_err("%s: failed blkdev_get_by_path(%s)\n", __func__, path);
+		ret = PTR_ERR(handle);
+		goto err;
+	}
+
+	dax_dev = fs_dax_get_by_bdev(handle->bdev, &start_off,
+				     cbdt,
+				     &cbd_dax_holder_ops);
+	if (IS_ERR(dax_dev)) {
+		pr_err("%s: unable to get daxdev from handle->bdev\n", __func__);
+		ret = -ENODEV;
+		goto bdev_release;
+	}
+
+	access_size = dax_direct_access(dax_dev, 0, nr_pages, DAX_ACCESS, &kaddr, NULL);
+	if (access_size != nr_pages) {
+		ret = -EINVAL;
+		goto dax_put;
+	}
+
+	cbdt->bdev_handle = handle;
+	cbdt->dax_dev = dax_dev;
+	cbdt->transport_info = (struct cbd_transport_info *)kaddr;
+
+	return 0;
+
+dax_put:
+	fs_put_dax(dax_dev, cbdt);
+bdev_release:
+	bdev_release(handle);
+err:
+	return ret;
+}
+
+static void cbdt_dax_release(struct cbd_transport *cbdt)
+{
+	if (cbdt->dax_dev)
+		fs_put_dax(cbdt->dax_dev, cbdt);
+
+	if (cbdt->bdev_handle)
+		bdev_release(cbdt->bdev_handle);
+}
+
+static int cbd_transport_init(struct cbd_transport *cbdt)
+{
+	struct device *dev;
+
+	mutex_init(&cbdt->lock);
+	INIT_LIST_HEAD(&cbdt->backends);
+	INIT_LIST_HEAD(&cbdt->devices);
+
+	dev = &cbdt->device;
+	device_initialize(dev);
+	device_set_pm_not_required(dev);
+	dev->bus = &cbd_bus_type;
+	dev->type = &cbd_transport_type;
+	dev->parent = &cbd_root_dev;
+
+	dev_set_name(&cbdt->device, "transport%d", cbdt->id);
+
+	return device_add(&cbdt->device);
+}
+
+
+static int cbdt_validate(struct cbd_transport *cbdt)
+{
+	u16 flags;
+
+	if (le64_to_cpu(cbdt->transport_info->magic) != CBD_TRANSPORT_MAGIC) {
+		return -EINVAL;
+	}
+
+	flags = le16_to_cpu(cbdt->transport_info->flags);
+#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __GIT_ENDIAN : defined(__BIG_ENDIAN)
+	if (!(flags & CBDT_INFO_F_BIGENDIAN)) {
+		return -EINVAL;
+	}
+#else
+	if ((flags & CBDT_INFO_F_BIGENDIAN)) {
+		return -EINVAL;
+	}
+#endif
+
+	return 0;
+}
+
+int cbdt_unregister(u32 tid)
+{
+	struct cbd_transport *cbdt;
+
+	cbdt = cbd_transports[tid];
+	if (!cbdt) {
+		pr_err("tid: %u, is not registered\n", tid);
+		return -EINVAL;
+	}
+
+	mutex_lock(&cbdt->lock);
+	if (!list_empty(&cbdt->backends) || !list_empty(&cbdt->devices)) {
+		mutex_unlock(&cbdt->lock);
+		return -EBUSY;
+	}
+	mutex_unlock(&cbdt->lock);
+
+	device_unregister(&cbdt->device);
+	cbdt_dax_release(cbdt);
+	cbdt_destroy(cbdt);
+	module_put(THIS_MODULE);
+
+	return 0;
+}
+
+
+int cbdt_register(struct cbdt_register_options *opts)
+{
+	struct cbd_transport *cbdt;
+	int ret;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	/* TODO support /dev/dax */
+	if (!strstr(opts->path, "/dev/pmem")) {
+		pr_err("%s: path (%s) is not pmem\n",
+		       __func__, opts->path);
+		ret = -EINVAL;
+		goto module_put;
+	}
+
+	cbdt = cbdt_alloc();
+	if (!cbdt) {
+		ret = -ENOMEM;
+		goto module_put;
+	}
+
+	ret = cbdt_dax_init(cbdt, opts->path);
+	if (ret) {
+		goto cbdt_destroy;
+	}
+
+	if (opts->format) {
+		ret = cbd_transport_format(cbdt, opts->force);
+		if (ret < 0) {
+			goto dax_release;
+		}
+	}
+
+	ret = cbdt_validate(cbdt);
+	if (ret) {
+		goto dax_release;
+	}
+
+	ret = cbd_transport_init(cbdt);
+	if (ret) {
+		goto dax_release;
+	}
+
+	return 0;
+
+dev_unregister:
+	device_unregister(&cbdt->device);
+dax_release:
+	cbdt_dax_release(cbdt);
+cbdt_destroy:
+	cbdt_destroy(cbdt);
+module_put:
+	module_put(THIS_MODULE);
+
+	return ret;
+}
+
+void cbdt_add_backend(struct cbd_transport *cbdt, struct cbd_backend *cbdb)
+{
+	mutex_lock(&cbdt->lock);
+	list_add(&cbdb->node, &cbdt->backends);
+	mutex_unlock(&cbdt->lock);
+}
+
+void cbdt_del_backend(struct cbd_transport *cbdt, struct cbd_backend *cbdb)
+{
+	if (list_empty(&cbdb->node))
+		return;
+
+	mutex_lock(&cbdt->lock);
+	list_del_init(&cbdb->node);
+	mutex_unlock(&cbdt->lock);
+}
+
+struct cbd_backend *cbdt_get_backend(struct cbd_transport *cbdt, u32 id)
+{
+	struct cbd_backend *backend;
+
+	mutex_lock(&cbdt->lock);
+	list_for_each_entry(backend, &cbdt->backends, node) {
+		if (backend->backend_id == id) {
+			goto out;
+		}
+	}
+	backend = NULL;
+out:
+	mutex_unlock(&cbdt->lock);
+	return backend;
+}
+
+void cbdt_add_blkdev(struct cbd_transport *cbdt, struct cbd_blkdev *blkdev)
+{
+	mutex_lock(&cbdt->lock);
+	list_add(&blkdev->node, &cbdt->devices);
+	mutex_unlock(&cbdt->lock);
+}
+
+struct cbd_blkdev *cbdt_fetch_blkdev(struct cbd_transport *cbdt, u32 id)
+{
+	struct cbd_blkdev *dev;
+
+	mutex_lock(&cbdt->lock);
+	list_for_each_entry(dev, &cbdt->devices, node) {
+		if (dev->blkdev_id == id) {
+			list_del(&dev->node);
+			goto out;
+		}
+	}
+	dev = NULL;
+out:
+	mutex_unlock(&cbdt->lock);
+	return dev;
+}
+
+struct page *cbdt_page(struct cbd_transport *cbdt, u64 transport_off)
+{
+	long access_size;
+	pfn_t pfn;
+
+	access_size = dax_direct_access(cbdt->dax_dev, transport_off >> PAGE_SHIFT, 1, DAX_ACCESS, NULL, &pfn);
+
+	return pfn_t_to_page(pfn);
+}
+
+void cbdt_flush_range(struct cbd_transport *cbdt, void *pos, u64 size)
+{
+	u64 offset = pos - (void *)cbdt->transport_info;
+	u32 off_in_page = (offset & CBD_PAGE_MASK);
+
+	offset -= off_in_page;
+	size = round_up(off_in_page + size, PAGE_SIZE);
+
+	while (size) {
+		flush_dcache_page(cbdt_page(cbdt, offset));
+		offset += PAGE_SIZE;
+		size -= PAGE_SIZE;
+	}
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 3/7] cbd: introduce cbd_channel
  2024-04-22  7:15 [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dongsheng Yang
  2024-04-22  7:16 ` [PATCH 1/7] block: Init for CBD(CXL " Dongsheng Yang
  2024-04-22  7:16 ` [PATCH 2/7] cbd: introduce cbd_transport Dongsheng Yang
@ 2024-04-22  7:16 ` Dongsheng Yang
  2024-04-22  7:16 ` [PATCH 4/7] cbd: introduce cbd_host Dongsheng Yang
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 52+ messages in thread
From: Dongsheng Yang @ 2024-04-22  7:16 UTC (permalink / raw)
  To: dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang

From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>

The "cbd_channel" is the component responsible for the interaction
between the blkdev and the backend. It mainly provides the functions
"cbdc_copy_to_bio" and "cbdc_copy_from_bio".

The "cbdc_copy_to_bio" function copies data from the specified area of
the channel to the bio. Before copying, it flushes the dcache to ensure
that the data read from the channel is the latest.

The "cbdc_copy_from_bio" function copies data from the bio to the
specified area of the channel. After copying, it flushes the dcache to
ensure that other parties can see the latest data.

Signed-off-by: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
---
 drivers/block/cbd/Makefile      |   2 +-
 drivers/block/cbd/cbd_channel.c | 179 ++++++++++++++++++++++++++++++++
 2 files changed, 180 insertions(+), 1 deletion(-)
 create mode 100644 drivers/block/cbd/cbd_channel.c

diff --git a/drivers/block/cbd/Makefile b/drivers/block/cbd/Makefile
index a22796bfa7db..c581ae96732b 100644
--- a/drivers/block/cbd/Makefile
+++ b/drivers/block/cbd/Makefile
@@ -1,3 +1,3 @@
-cbd-y := cbd_main.o cbd_transport.o
+cbd-y := cbd_main.o cbd_transport.o cbd_channel.o
 
 obj-$(CONFIG_BLK_DEV_CBD) += cbd.o
diff --git a/drivers/block/cbd/cbd_channel.c b/drivers/block/cbd/cbd_channel.c
new file mode 100644
index 000000000000..7253523bea3c
--- /dev/null
+++ b/drivers/block/cbd/cbd_channel.c
@@ -0,0 +1,179 @@
+#include "cbd_internal.h"
+
+static ssize_t cbd_backend_id_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct cbd_channel_device *channel;
+	struct cbd_channel_info *channel_info;
+
+	channel = container_of(dev, struct cbd_channel_device, dev);
+	channel_info = channel->channel_info;
+
+	if (channel_info->backend_state == cbdc_backend_state_none)
+		return 0;
+
+	return sprintf(buf, "%u\n", channel_info->backend_id);
+}
+
+static ssize_t cbd_blkdev_id_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct cbd_channel_device *channel;
+	struct cbd_channel_info *channel_info;
+
+	channel = container_of(dev, struct cbd_channel_device, dev);
+	channel_info = channel->channel_info;
+
+	if (channel_info->blkdev_state == cbdc_blkdev_state_none)
+		return 0;
+
+	return sprintf(buf, "%u\n", channel_info->blkdev_id);
+}
+
+static DEVICE_ATTR(backend_id, 0400, cbd_backend_id_show, NULL);
+static DEVICE_ATTR(blkdev_id, 0400, cbd_blkdev_id_show, NULL);
+
+static struct attribute *cbd_channel_attrs[] = {
+	&dev_attr_backend_id.attr,
+	&dev_attr_blkdev_id.attr,
+	NULL
+};
+
+static struct attribute_group cbd_channel_attr_group = {
+	.attrs = cbd_channel_attrs,
+};
+
+static const struct attribute_group *cbd_channel_attr_groups[] = {
+	&cbd_channel_attr_group,
+	NULL
+};
+
+static void cbd_channel_release(struct device *dev)
+{
+}
+
+struct device_type cbd_channel_type = {
+	.name		= "cbd_channel",
+	.groups		= cbd_channel_attr_groups,
+	.release	= cbd_channel_release,
+};
+
+struct device_type cbd_channels_type = {
+	.name		= "cbd_channels",
+	.release	= cbd_channel_release,
+};
+
+void cbdc_copy_to_bio(struct cbd_channel *channel,
+		u32 data_off, u32 data_len, struct bio *bio)
+{
+	struct bio_vec bv;
+	struct bvec_iter iter;
+	void *src, *dst;
+	u32 data_head = data_off;
+	u32 to_copy, page_off = 0;
+
+	cbdt_flush_range(channel->cbdt, channel->data + data_off, data_len);
+next:
+	bio_for_each_segment(bv, bio, iter) {
+		dst = kmap_atomic(bv.bv_page);
+		page_off = bv.bv_offset;
+again:
+		if (data_head >= CBDC_DATA_SIZE) {
+			data_head &= CBDC_DATA_MASK;
+		}
+
+		src = channel->data + data_head;
+		to_copy = min(bv.bv_offset + bv.bv_len - page_off,
+			      CBDC_DATA_SIZE - data_head);
+		memcpy_flushcache(dst + page_off, src, to_copy);
+
+		/* advance */
+		data_head += to_copy;
+		page_off += to_copy;
+
+		/* more data in this bv page */
+		if (page_off < bv.bv_offset + bv.bv_len) {
+			goto again;
+		}
+		kunmap_atomic(dst);
+		flush_dcache_page(bv.bv_page);
+	}
+
+	if (bio->bi_next) {
+		bio = bio->bi_next;
+		goto next;
+	}
+
+	return;
+}
+
+void cbdc_copy_from_bio(struct cbd_channel *channel,
+		u32 data_off, u32 data_len, struct bio *bio)
+{
+	struct bio_vec bv;
+	struct bvec_iter iter;
+	void *src, *dst;
+	u32 data_head = data_off;
+	u32 to_copy, page_off = 0;
+
+next:
+	bio_for_each_segment(bv, bio, iter) {
+		flush_dcache_page(bv.bv_page);
+
+		src = kmap_atomic(bv.bv_page);
+		page_off = bv.bv_offset;
+again:
+		if (data_head >= CBDC_DATA_SIZE) {
+			data_head &= CBDC_DATA_MASK;
+		}
+
+		dst = channel->data + data_head;
+		to_copy = min(bv.bv_offset + bv.bv_len - page_off,
+			      CBDC_DATA_SIZE - data_head);
+
+		memcpy_flushcache(dst, src + page_off, to_copy);
+
+		/* advance */
+		data_head += to_copy;
+		page_off += to_copy;
+
+		/* more data in this bv page */
+		if (page_off < bv.bv_offset + bv.bv_len) {
+			goto again;
+		}
+		kunmap_atomic(src);
+	}
+
+	if (bio->bi_next) {
+		bio = bio->bi_next;
+		goto next;
+	}
+
+	cbdt_flush_range(channel->cbdt, channel->data + data_off, data_len);
+
+	return;
+}
+
+void cbdc_flush_ctrl(struct cbd_channel *channel)
+{
+	flush_dcache_page(channel->ctrl_page);
+}
+
+void cbd_channel_init(struct cbd_channel *channel, struct cbd_transport *cbdt, u32 channel_id)
+{
+	struct cbd_channel_info *channel_info = cbdt_get_channel_info(cbdt, channel_id);
+
+	channel->cbdt = cbdt;
+	channel->channel_info = channel_info;
+	channel->channel_id = channel_id;
+	channel->cmdr = (void *)channel_info + CBDC_CMDR_OFF;
+	channel->compr = (void *)channel_info + CBDC_COMPR_OFF;
+	channel->data = (void *)channel_info + CBDC_DATA_OFF;
+	channel->data_size = CBDC_DATA_SIZE;
+	channel->ctrl_page = cbdt_page(cbdt, (void *)channel_info - (void *)cbdt->transport_info);
+
+	spin_lock_init(&channel->cmdr_lock);
+	spin_lock_init(&channel->compr_lock);
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 4/7] cbd: introduce cbd_host
  2024-04-22  7:15 [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dongsheng Yang
                   ` (2 preceding siblings ...)
  2024-04-22  7:16 ` [PATCH 3/7] cbd: introduce cbd_channel Dongsheng Yang
@ 2024-04-22  7:16 ` Dongsheng Yang
  2024-04-25  5:51   ` [EXTERNAL] " Bharat Bhushan
  2024-04-22  7:16 ` [PATCH 5/7] cbd: introuce cbd_backend Dongsheng Yang
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 52+ messages in thread
From: Dongsheng Yang @ 2024-04-22  7:16 UTC (permalink / raw)
  To: dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang

From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>

The "cbd_host" represents a host node. Each node needs to be registered
before it can use the "cbd_transport". After registration, the node's
information, such as its hostname, will be recorded in the "hosts" area
of this transport. Through this mechanism, we can know which nodes are
currently using each transport.

Signed-off-by: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
---
 drivers/block/cbd/Makefile        |   2 +-
 drivers/block/cbd/cbd_host.c      | 123 ++++++++++++++++++++++++++++++
 drivers/block/cbd/cbd_transport.c |   8 ++
 3 files changed, 132 insertions(+), 1 deletion(-)
 create mode 100644 drivers/block/cbd/cbd_host.c

diff --git a/drivers/block/cbd/Makefile b/drivers/block/cbd/Makefile
index c581ae96732b..2389a738b12b 100644
--- a/drivers/block/cbd/Makefile
+++ b/drivers/block/cbd/Makefile
@@ -1,3 +1,3 @@
-cbd-y := cbd_main.o cbd_transport.o cbd_channel.o
+cbd-y := cbd_main.o cbd_transport.o cbd_channel.o cbd_host.o
 
 obj-$(CONFIG_BLK_DEV_CBD) += cbd.o
diff --git a/drivers/block/cbd/cbd_host.c b/drivers/block/cbd/cbd_host.c
new file mode 100644
index 000000000000..892961f5f1b2
--- /dev/null
+++ b/drivers/block/cbd/cbd_host.c
@@ -0,0 +1,123 @@
+#include "cbd_internal.h"
+
+static ssize_t cbd_host_name_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct cbd_host_device *host;
+	struct cbd_host_info *host_info;
+
+	host = container_of(dev, struct cbd_host_device, dev);
+	host_info = host->host_info;
+
+	cbdt_flush_range(host->cbdt, host_info, sizeof(*host_info));
+
+	if (host_info->state == cbd_host_state_none)
+		return 0;
+
+	if (strlen(host_info->hostname) == 0)
+		return 0;
+
+	return sprintf(buf, "%s\n", host_info->hostname);
+}
+
+static DEVICE_ATTR(hostname, 0400, cbd_host_name_show, NULL);
+
+CBD_OBJ_HEARTBEAT(host);
+
+static struct attribute *cbd_host_attrs[] = {
+	&dev_attr_hostname.attr,
+	&dev_attr_alive.attr,
+	NULL
+};
+
+static struct attribute_group cbd_host_attr_group = {
+	.attrs = cbd_host_attrs,
+};
+
+static const struct attribute_group *cbd_host_attr_groups[] = {
+	&cbd_host_attr_group,
+	NULL
+};
+
+static void cbd_host_release(struct device *dev)
+{
+}
+
+struct device_type cbd_host_type = {
+	.name		= "cbd_host",
+	.groups		= cbd_host_attr_groups,
+	.release	= cbd_host_release,
+};
+
+struct device_type cbd_hosts_type = {
+	.name		= "cbd_hosts",
+	.release	= cbd_host_release,
+};
+
+int cbd_host_register(struct cbd_transport *cbdt, char *hostname)
+{
+	struct cbd_host *host;
+	struct cbd_host_info *host_info;
+	u32 host_id;
+	int ret;
+
+	if (cbdt->host) {
+		return -EEXIST;
+	}
+
+	if (strlen(hostname) == 0) {
+		return -EINVAL;
+	}
+
+	ret = cbdt_get_empty_host_id(cbdt, &host_id);
+	if (ret < 0) {
+		return ret;
+	}
+
+	host = kzalloc(sizeof(struct cbd_host), GFP_KERNEL);
+	if (!host) {
+		return -ENOMEM;
+	}
+
+	host->host_id = host_id;
+	host->cbdt = cbdt;
+	INIT_DELAYED_WORK(&host->hb_work, host_hb_workfn);
+
+	host_info = cbdt_get_host_info(cbdt, host_id);
+	host_info->state = cbd_host_state_running;
+	memcpy(host_info->hostname, hostname, CBD_NAME_LEN);
+
+	cbdt_flush_range(cbdt, host_info, sizeof(*host_info));
+
+	host->host_info = host_info;
+	cbdt->host = host;
+
+	queue_delayed_work(cbd_wq, &host->hb_work, 0);
+
+	return 0;
+}
+
+int cbd_host_unregister(struct cbd_transport *cbdt)
+{
+	struct cbd_host *host = cbdt->host;
+	struct cbd_host_info *host_info;
+
+	if (!host) {
+		cbd_err("This host is not registered.");
+		return 0;
+	}
+
+	cancel_delayed_work_sync(&host->hb_work);
+	host_info = host->host_info;
+	memset(host_info->hostname, 0, CBD_NAME_LEN);
+	host_info->alive_ts = 0;
+	host_info->state = cbd_host_state_none;
+
+	cbdt_flush_range(cbdt, host_info, sizeof(*host_info));
+
+	cbdt->host = NULL;
+	kfree(cbdt->host);
+
+	return 0;
+}
diff --git a/drivers/block/cbd/cbd_transport.c b/drivers/block/cbd/cbd_transport.c
index 3a4887afab08..682d0f45ce9e 100644
--- a/drivers/block/cbd/cbd_transport.c
+++ b/drivers/block/cbd/cbd_transport.c
@@ -571,6 +571,7 @@ int cbdt_unregister(u32 tid)
 	}
 	mutex_unlock(&cbdt->lock);
 
+	cbd_host_unregister(cbdt);
 	device_unregister(&cbdt->device);
 	cbdt_dax_release(cbdt);
 	cbdt_destroy(cbdt);
@@ -624,8 +625,15 @@ int cbdt_register(struct cbdt_register_options *opts)
 		goto dax_release;
 	}
 
+	ret = cbd_host_register(cbdt, opts->hostname);
+	if (ret) {
+		goto dev_unregister;
+	}
+
 	return 0;
 
+devs_exit:
+	cbd_host_unregister(cbdt);
 dev_unregister:
 	device_unregister(&cbdt->device);
 dax_release:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 5/7] cbd: introuce cbd_backend
  2024-04-22  7:15 [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dongsheng Yang
                   ` (3 preceding siblings ...)
  2024-04-22  7:16 ` [PATCH 4/7] cbd: introduce cbd_host Dongsheng Yang
@ 2024-04-22  7:16 ` Dongsheng Yang
  2024-04-24  5:03   ` Chaitanya Kulkarni
  2024-04-25  5:46   ` [EXTERNAL] " Bharat Bhushan
  2024-04-22  7:16 ` [PATCH 7/7] cbd: add related sysfs files in transport register Dongsheng Yang
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 52+ messages in thread
From: Dongsheng Yang @ 2024-04-22  7:16 UTC (permalink / raw)
  To: dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang

From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>

The "cbd_backend" is responsible for exposing a local block device (such
as "/dev/sda") through the "cbd_transport" to other hosts.

Any host that registers this transport can map this backend to a local
"cbd device"(such as "/dev/cbd0"). All reads and writes to "cbd0" are transmitted
through the channel inside the transport to the backend. The handler
inside the backend is responsible for processing these read and write
requests, converting them into read and write requests corresponding to "sda".

Signed-off-by: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
---
 drivers/block/cbd/Makefile        |   2 +-
 drivers/block/cbd/cbd_backend.c   | 254 +++++++++++++++++++++++++++++
 drivers/block/cbd/cbd_handler.c   | 261 ++++++++++++++++++++++++++++++
 drivers/block/cbd/cbd_transport.c |   6 +
 4 files changed, 522 insertions(+), 1 deletion(-)
 create mode 100644 drivers/block/cbd/cbd_backend.c
 create mode 100644 drivers/block/cbd/cbd_handler.c

diff --git a/drivers/block/cbd/Makefile b/drivers/block/cbd/Makefile
index 2389a738b12b..b47f1e584946 100644
--- a/drivers/block/cbd/Makefile
+++ b/drivers/block/cbd/Makefile
@@ -1,3 +1,3 @@
-cbd-y := cbd_main.o cbd_transport.o cbd_channel.o cbd_host.o
+cbd-y := cbd_main.o cbd_transport.o cbd_channel.o cbd_host.o cbd_backend.o cbd_handler.o
 
 obj-$(CONFIG_BLK_DEV_CBD) += cbd.o
diff --git a/drivers/block/cbd/cbd_backend.c b/drivers/block/cbd/cbd_backend.c
new file mode 100644
index 000000000000..a06f319e62c4
--- /dev/null
+++ b/drivers/block/cbd/cbd_backend.c
@@ -0,0 +1,254 @@
+#include "cbd_internal.h"
+
+static ssize_t backend_host_id_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct cbd_backend_device *backend;
+	struct cbd_backend_info *backend_info;
+
+	backend = container_of(dev, struct cbd_backend_device, dev);
+	backend_info = backend->backend_info;
+
+	cbdt_flush_range(backend->cbdt, backend_info, sizeof(*backend_info));
+
+	if (backend_info->state == cbd_backend_state_none)
+		return 0;
+
+	return sprintf(buf, "%u\n", backend_info->host_id);
+}
+
+static DEVICE_ATTR(host_id, 0400, backend_host_id_show, NULL);
+
+static ssize_t backend_path_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct cbd_backend_device *backend;
+	struct cbd_backend_info *backend_info;
+
+	backend = container_of(dev, struct cbd_backend_device, dev);
+	backend_info = backend->backend_info;
+
+	cbdt_flush_range(backend->cbdt, backend_info, sizeof(*backend_info));
+
+	if (backend_info->state == cbd_backend_state_none)
+		return 0;
+
+	if (strlen(backend_info->path) == 0)
+		return 0;
+
+	return sprintf(buf, "%s\n", backend_info->path);
+}
+
+static DEVICE_ATTR(path, 0400, backend_path_show, NULL);
+
+CBD_OBJ_HEARTBEAT(backend);
+
+static struct attribute *cbd_backend_attrs[] = {
+	&dev_attr_path.attr,
+	&dev_attr_host_id.attr,
+	&dev_attr_alive.attr,
+	NULL
+};
+
+static struct attribute_group cbd_backend_attr_group = {
+	.attrs = cbd_backend_attrs,
+};
+
+static const struct attribute_group *cbd_backend_attr_groups[] = {
+	&cbd_backend_attr_group,
+	NULL
+};
+
+static void cbd_backend_release(struct device *dev)
+{
+}
+
+struct device_type cbd_backend_type = {
+	.name		= "cbd_backend",
+	.groups		= cbd_backend_attr_groups,
+	.release	= cbd_backend_release,
+};
+
+struct device_type cbd_backends_type = {
+	.name		= "cbd_backends",
+	.release	= cbd_backend_release,
+};
+
+void cbdb_add_handler(struct cbd_backend *cbdb, struct cbd_handler *handler)
+{
+	mutex_lock(&cbdb->lock);
+	list_add(&handler->handlers_node, &cbdb->handlers);
+	mutex_unlock(&cbdb->lock);
+}
+
+void cbdb_del_handler(struct cbd_backend *cbdb, struct cbd_handler *handler)
+{
+	mutex_lock(&cbdb->lock);
+	list_del_init(&handler->handlers_node);
+	mutex_unlock(&cbdb->lock);
+}
+
+static struct cbd_handler *cbdb_get_handler(struct cbd_backend *cbdb, u32 channel_id)
+{
+	struct cbd_handler *handler, *handler_next;
+	bool found = false;
+
+	mutex_lock(&cbdb->lock);
+	list_for_each_entry_safe(handler, handler_next, &cbdb->handlers, handlers_node) {
+		if (handler->channel.channel_id == channel_id) {
+			found = true;
+			break;
+		}
+	}
+	mutex_unlock(&cbdb->lock);
+
+	if (!found) {
+		return ERR_PTR(-ENOENT);
+	}
+
+	return handler;
+}
+
+static void state_work_fn(struct work_struct *work)
+{
+	struct cbd_backend *cbdb = container_of(work, struct cbd_backend, state_work.work);
+	struct cbd_transport *cbdt = cbdb->cbdt;
+	struct cbd_channel_info *channel_info;
+	u32 blkdev_state, backend_state, backend_id;
+	int i;
+
+	for (i = 0; i < cbdt->transport_info->channel_num; i++) {
+		channel_info = cbdt_get_channel_info(cbdt, i);
+
+		cbdt_flush_range(cbdt, channel_info, sizeof(*channel_info));
+		blkdev_state = channel_info->blkdev_state;
+		backend_state = channel_info->backend_state;
+		backend_id = channel_info->backend_id;
+
+		if (blkdev_state == cbdc_blkdev_state_running &&
+				backend_state == cbdc_backend_state_none &&
+				backend_id == cbdb->backend_id) {
+
+			cbd_handler_create(cbdb, i);
+		}
+		
+		if (blkdev_state == cbdc_blkdev_state_none &&
+				backend_state == cbdc_backend_state_running &&
+				backend_id == cbdb->backend_id) {
+			struct cbd_handler *handler;
+
+			handler = cbdb_get_handler(cbdb, i);
+			cbd_handler_destroy(handler);
+		}
+	}
+
+	queue_delayed_work(cbd_wq, &cbdb->state_work, 1 * HZ);
+}
+
+static int cbd_backend_init(struct cbd_backend *cbdb)
+{
+	struct cbd_backend_info *b_info;
+	struct cbd_transport *cbdt = cbdb->cbdt;
+
+	b_info = cbdt_get_backend_info(cbdt, cbdb->backend_id);
+	cbdb->backend_info = b_info;
+
+	b_info->host_id = cbdb->cbdt->host->host_id;
+
+	cbdb->bdev_handle = bdev_open_by_path(cbdb->path, BLK_OPEN_READ | BLK_OPEN_WRITE, cbdb, NULL);
+	if (IS_ERR(cbdb->bdev_handle)) {
+		cbdt_err(cbdt, "failed to open bdev: %d", (int)PTR_ERR(cbdb->bdev_handle));
+		return PTR_ERR(cbdb->bdev_handle);
+	}
+	cbdb->bdev = cbdb->bdev_handle->bdev;
+	b_info->dev_size = bdev_nr_sectors(cbdb->bdev);
+
+	INIT_DELAYED_WORK(&cbdb->state_work, state_work_fn);
+	INIT_DELAYED_WORK(&cbdb->hb_work, backend_hb_workfn);
+	INIT_LIST_HEAD(&cbdb->handlers);
+	cbdb->backend_device = &cbdt->cbd_backends_dev->backend_devs[cbdb->backend_id];
+
+	mutex_init(&cbdb->lock);
+
+	queue_delayed_work(cbd_wq, &cbdb->state_work, 0);
+	queue_delayed_work(cbd_wq, &cbdb->hb_work, 0);
+
+	return 0;
+}
+
+int cbd_backend_start(struct cbd_transport *cbdt, char *path)
+{
+	struct cbd_backend *backend;
+	struct cbd_backend_info *backend_info;
+	u32 backend_id;
+	int ret;
+
+	ret = cbdt_get_empty_backend_id(cbdt, &backend_id);
+	if (ret) {
+		return ret;
+	}
+
+	backend_info = cbdt_get_backend_info(cbdt, backend_id);
+
+	backend = kzalloc(sizeof(struct cbd_backend), GFP_KERNEL);
+	if (!backend) {
+		return -ENOMEM;
+	}
+
+	strscpy(backend->path, path, CBD_PATH_LEN);
+	memcpy(backend_info->path, backend->path, CBD_PATH_LEN);
+	INIT_LIST_HEAD(&backend->node);
+	backend->backend_id = backend_id;
+	backend->cbdt = cbdt;
+
+	ret = cbd_backend_init(backend);
+	if (ret) {
+		goto backend_free;
+	}
+
+	backend_info->state = cbd_backend_state_running;
+	cbdt_flush_range(cbdt, backend_info, sizeof(*backend_info));
+
+	cbdt_add_backend(cbdt, backend);
+
+	return 0;
+
+backend_free:
+	kfree(backend);
+
+	return ret;
+}
+
+int cbd_backend_stop(struct cbd_transport *cbdt, u32 backend_id)
+{
+	struct cbd_backend *cbdb;
+	struct cbd_backend_info *backend_info;
+
+	cbdb = cbdt_get_backend(cbdt, backend_id);
+	if (!cbdb) {
+		return -ENOENT;
+	}
+
+	mutex_lock(&cbdb->lock);
+	if (!list_empty(&cbdb->handlers)) {
+		mutex_unlock(&cbdb->lock);
+		return -EBUSY;
+	}
+
+	cbdt_del_backend(cbdt, cbdb);
+
+	cancel_delayed_work_sync(&cbdb->hb_work);
+	cancel_delayed_work_sync(&cbdb->state_work);
+
+	backend_info = cbdt_get_backend_info(cbdt, cbdb->backend_id);
+	backend_info->state = cbd_backend_state_none;
+	cbdt_flush_range(cbdt, backend_info, sizeof(*backend_info));
+	mutex_unlock(&cbdb->lock);
+
+	bdev_release(cbdb->bdev_handle);
+	kfree(cbdb);
+
+	return 0;
+}
diff --git a/drivers/block/cbd/cbd_handler.c b/drivers/block/cbd/cbd_handler.c
new file mode 100644
index 000000000000..0fbfc225ea29
--- /dev/null
+++ b/drivers/block/cbd/cbd_handler.c
@@ -0,0 +1,261 @@
+#include "cbd_internal.h"
+
+static inline struct cbd_se *get_se_head(struct cbd_handler *handler)
+{
+	return (struct cbd_se *)(handler->channel.cmdr + handler->channel_info->cmd_head);
+}
+
+static inline struct cbd_se *get_se_to_handle(struct cbd_handler *handler)
+{
+	return (struct cbd_se *)(handler->channel.cmdr + handler->se_to_handle);
+}
+
+static inline struct cbd_ce *get_compr_head(struct cbd_handler *handler)
+{
+	return (struct cbd_ce *)(handler->channel.compr + handler->channel_info->compr_head);
+}
+
+struct cbd_backend_io {
+	struct cbd_se		*se;
+	u64			off;
+	u32			len;
+	struct bio		*bio;
+	struct cbd_handler	*handler;
+};
+
+static inline void complete_cmd(struct cbd_handler *handler, u64 priv_data, int ret)
+{
+	struct cbd_ce *ce = get_compr_head(handler);
+
+	memset(ce, 0, sizeof(*ce));
+	ce->priv_data = priv_data;
+	ce->result = ret;
+	CBDC_UPDATE_COMPR_HEAD(handler->channel_info->compr_head,
+			       sizeof(struct cbd_ce),
+			       handler->channel_info->compr_size);
+
+	cbdc_flush_ctrl(&handler->channel);
+
+	return;
+}
+
+static void backend_bio_end(struct bio *bio)
+{
+	struct cbd_backend_io *backend_io = bio->bi_private;
+	struct cbd_se *se = backend_io->se;
+	struct cbd_handler *handler = backend_io->handler;
+
+	if (bio->bi_status == 0 &&
+	    cbd_se_hdr_get_op(se->header.len_op) == CBD_OP_READ) {
+		cbdc_copy_from_bio(&handler->channel, se->data_off, se->data_len, bio);
+	}
+
+	complete_cmd(handler, se->priv_data, bio->bi_status);
+
+	bio_free_pages(bio);
+	bio_put(bio);
+	kfree(backend_io);
+}
+
+static int cbd_bio_alloc_pages(struct bio *bio, size_t size, gfp_t gfp_mask)
+{
+	int ret = 0;
+
+        while (size) {
+                struct page *page = alloc_pages(gfp_mask, 0);
+                unsigned len = min_t(size_t, PAGE_SIZE, size);
+
+                if (!page) {
+			pr_err("failed to alloc page");
+			ret = -ENOMEM;
+			break;
+		}
+
+		ret = bio_add_page(bio, page, len, 0);
+                if (unlikely(ret != len)) {
+                        __free_page(page);
+			pr_err("failed to add page");
+                        break;
+                }
+
+                size -= len;
+        }
+
+	if (size)
+		bio_free_pages(bio);
+	else
+		ret = 0;
+
+        return ret;
+}
+
+static struct cbd_backend_io *backend_prepare_io(struct cbd_handler *handler, struct cbd_se *se, blk_opf_t opf)
+{
+	struct cbd_backend_io *backend_io;
+	struct cbd_backend *cbdb = handler->cbdb;
+
+	backend_io = kzalloc(sizeof(struct cbd_backend_io), GFP_KERNEL);
+	backend_io->se = se;
+
+	backend_io->handler = handler;
+	backend_io->bio = bio_alloc_bioset(cbdb->bdev, roundup(se->len, 4096) / 4096, opf, GFP_KERNEL, &handler->bioset);
+
+	backend_io->bio->bi_iter.bi_sector = se->offset >> SECTOR_SHIFT;
+	backend_io->bio->bi_iter.bi_size = 0;
+	backend_io->bio->bi_private = backend_io;
+	backend_io->bio->bi_end_io = backend_bio_end;
+
+	return backend_io;
+}
+
+static int handle_backend_cmd(struct cbd_handler *handler, struct cbd_se *se)
+{
+	struct cbd_backend *cbdb = handler->cbdb;
+	u32 len = se->len;
+	struct cbd_backend_io *backend_io = NULL;
+	int ret;
+
+	if (cbd_se_hdr_flags_test(se, CBD_SE_HDR_DONE)) {
+		return 0 ;
+	}
+
+	switch (cbd_se_hdr_get_op(se->header.len_op)) {
+	case CBD_OP_PAD:
+		cbd_se_hdr_flags_set(se, CBD_SE_HDR_DONE);
+		return 0;
+	case CBD_OP_READ:
+		backend_io = backend_prepare_io(handler, se, REQ_OP_READ);
+		break;
+	case CBD_OP_WRITE:
+		backend_io = backend_prepare_io(handler, se, REQ_OP_WRITE);
+		break;
+	case CBD_OP_DISCARD:
+		ret = blkdev_issue_discard(cbdb->bdev, se->offset >> SECTOR_SHIFT,
+				se->len, GFP_NOIO);
+		goto complete_cmd;
+	case CBD_OP_WRITE_ZEROS:
+		ret = blkdev_issue_zeroout(cbdb->bdev, se->offset >> SECTOR_SHIFT,
+				se->len, GFP_NOIO, 0);
+		goto complete_cmd;
+	case CBD_OP_FLUSH:
+		ret = blkdev_issue_flush(cbdb->bdev);
+		goto complete_cmd;
+	default:
+		pr_err("unrecognized op: %x", cbd_se_hdr_get_op(se->header.len_op));
+		ret = -EIO;
+		goto complete_cmd;
+	}
+
+	if (!backend_io)
+		return -ENOMEM;
+
+	ret = cbd_bio_alloc_pages(backend_io->bio, len, GFP_NOIO);
+	if (ret) {
+		kfree(backend_io);
+		return ret;
+	}
+
+	if (cbd_se_hdr_get_op(se->header.len_op) == CBD_OP_WRITE) {
+		cbdc_copy_to_bio(&handler->channel, se->data_off, se->data_len, backend_io->bio);
+	}
+
+	submit_bio(backend_io->bio);
+
+	return 0;
+
+complete_cmd:
+	complete_cmd(handler, se->priv_data, ret);
+	return 0;
+}
+
+static void handle_work_fn(struct work_struct *work)
+{
+	struct cbd_handler *handler = container_of(work, struct cbd_handler, handle_work.work);
+	struct cbd_se *se;
+	int ret;
+again:
+	/* channel ctrl would be updated by blkdev queue */
+	cbdc_flush_ctrl(&handler->channel);
+	se = get_se_to_handle(handler);
+	if (se == get_se_head(handler)) {
+		if (cbdwc_need_retry(&handler->handle_worker_cfg)) {
+			goto again;
+		}
+
+		cbdwc_miss(&handler->handle_worker_cfg);
+
+		queue_delayed_work(handler->handle_wq, &handler->handle_work, usecs_to_jiffies(0));
+		return;
+	}
+
+	cbdwc_hit(&handler->handle_worker_cfg);
+	cbdt_flush_range(handler->cbdb->cbdt, se, sizeof(*se));
+	ret = handle_backend_cmd(handler, se);
+	if (!ret) {
+		/* this se is handled */
+		handler->se_to_handle = (handler->se_to_handle + cbd_se_hdr_get_len(se->header.len_op)) % handler->channel_info->cmdr_size;
+	}
+
+	goto again;
+}
+
+int cbd_handler_create(struct cbd_backend *cbdb, u32 channel_id)
+{
+	struct cbd_transport *cbdt = cbdb->cbdt;
+	struct cbd_handler *handler;
+	int ret;
+
+	handler = kzalloc(sizeof(struct cbd_handler), GFP_KERNEL);
+	if (!handler) {
+		return -ENOMEM;
+	}
+
+	handler->cbdb = cbdb;
+	cbd_channel_init(&handler->channel, cbdt, channel_id);
+	handler->channel_info = handler->channel.channel_info;
+
+	handler->handle_wq = alloc_workqueue("cbdt%u-handler%u",
+					     WQ_UNBOUND | WQ_MEM_RECLAIM,
+					     0, cbdt->id, channel_id);
+	if (!handler->handle_wq) {
+		ret = -ENOMEM;
+		goto free_handler;
+	}
+
+	handler->se_to_handle = handler->channel_info->cmd_tail;
+
+	INIT_DELAYED_WORK(&handler->handle_work, handle_work_fn);
+	INIT_LIST_HEAD(&handler->handlers_node);
+
+	bioset_init(&handler->bioset, 128, 0, BIOSET_NEED_BVECS);
+	cbdwc_init(&handler->handle_worker_cfg);
+
+	cbdb_add_handler(cbdb, handler);
+	handler->channel_info->backend_state = cbdc_backend_state_running;
+
+	cbdt_flush_range(cbdt, handler->channel_info, sizeof(*handler->channel_info));
+
+	queue_delayed_work(handler->handle_wq, &handler->handle_work, 0);
+
+	return 0;
+
+free_handler:
+	kfree(handler);
+	return ret;
+};
+
+void cbd_handler_destroy(struct cbd_handler *handler)
+{
+	cbdb_del_handler(handler->cbdb, handler);
+
+	cancel_delayed_work_sync(&handler->handle_work);
+	drain_workqueue(handler->handle_wq);
+	destroy_workqueue(handler->handle_wq);
+
+	handler->channel_info->backend_state = cbdc_backend_state_none;
+	handler->channel_info->state = cbd_channel_state_none;
+	cbdt_flush_range(handler->cbdb->cbdt, handler->channel_info, sizeof(*handler->channel_info));
+
+	bioset_exit(&handler->bioset);
+	kfree(handler);
+}
diff --git a/drivers/block/cbd/cbd_transport.c b/drivers/block/cbd/cbd_transport.c
index 682d0f45ce9e..4dd9bf1b5fd5 100644
--- a/drivers/block/cbd/cbd_transport.c
+++ b/drivers/block/cbd/cbd_transport.c
@@ -303,8 +303,14 @@ static ssize_t cbd_adm_store(struct device *dev,
 
 	switch (opts.op) {
 	case CBDT_ADM_OP_B_START:
+		ret = cbd_backend_start(cbdt, opts.backend.path);
+		if (ret < 0)
+			return ret;
 		break;
 	case CBDT_ADM_OP_B_STOP:
+		ret = cbd_backend_stop(cbdt, opts.backend_id);
+		if (ret < 0)
+			return ret;
 		break;
 	case CBDT_ADM_OP_B_CLEAR:
 		break;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 7/7] cbd: add related sysfs files in transport register
  2024-04-22  7:15 [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dongsheng Yang
                   ` (4 preceding siblings ...)
  2024-04-22  7:16 ` [PATCH 5/7] cbd: introuce cbd_backend Dongsheng Yang
@ 2024-04-22  7:16 ` Dongsheng Yang
  2024-04-25  5:24   ` [EXTERNAL] " Bharat Bhushan
  2024-04-22 22:42 ` [PATCH 6/7] cbd: introduce cbd_blkdev Dongsheng Yang
  2024-04-24  4:29 ` [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dan Williams
  7 siblings, 1 reply; 52+ messages in thread
From: Dongsheng Yang @ 2024-04-22  7:16 UTC (permalink / raw)
  To: dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang

From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>

When a transport is registered, a corresponding file is created for each
area within the transport in the sysfs, including "cbd_hosts",
"cbd_backends", "cbd_blkdevs", and "cbd_channels".

Through these sysfs files, we can examine the information of each entity
and thereby understand the relationships between them. This allows us to
further understand the current operational status of the transport.

For example, by examining "cbd_hosts", we can find all the hosts
currently using the transport. We can also determine which host each
backend is running on by looking at the "host_id" in "cbd_backends".
Similarly, by examining "cbd_blkdevs", we can determine which host each
blkdev is running on, and through the "mapped_id", we can know the name
of the cbd device to which the blkdev is mapped. Additionally, by
looking at "cbd_channels", we can determine which blkdev and backend are
connected through each channel by examining the "blkdev_id" and
"backend_id".

Signed-off-by: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
---
 drivers/block/cbd/cbd_transport.c | 101 +++++++++++++++++++++++++++++-
 1 file changed, 100 insertions(+), 1 deletion(-)

diff --git a/drivers/block/cbd/cbd_transport.c b/drivers/block/cbd/cbd_transport.c
index 75b9d34218fc..0e917d72b209 100644
--- a/drivers/block/cbd/cbd_transport.c
+++ b/drivers/block/cbd/cbd_transport.c
@@ -1,8 +1,91 @@
 #include <linux/pfn_t.h>
-
 #include "cbd_internal.h"
 
 #define CBDT_OBJ(OBJ, OBJ_SIZE)							\
+extern struct device_type cbd_##OBJ##_type;					\
+extern struct device_type cbd_##OBJ##s_type;					\
+										\
+static int cbd_##OBJ##s_init(struct cbd_transport *cbdt) 			\
+{ 										\
+	struct cbd_##OBJ##s_device *devs; 					\
+	struct cbd_##OBJ##_device *cbd_dev;					\
+	struct device *dev;							\
+	int i; 									\
+	int ret;								\
+										\
+	u32 memsize = struct_size(devs, OBJ##_devs,				\
+			cbdt->transport_info->OBJ##_num);			\
+	devs = kzalloc(memsize, GFP_KERNEL);					\
+	if (!devs) {								\
+	    return -ENOMEM;							\
+	}									\
+										\
+	dev = &devs->OBJ##s_dev;						\
+	device_initialize(dev);							\
+	device_set_pm_not_required(dev);					\
+	dev_set_name(dev, "cbd_" #OBJ "s");					\
+	dev->parent = &cbdt->device;						\
+	dev->type = &cbd_##OBJ##s_type;						\
+	ret = device_add(dev);							\
+	if (ret) {								\
+		goto devs_free;							\
+	}									\
+										\
+	for (i = 0; i < cbdt->transport_info->OBJ##_num; i++) {			\
+		cbd_dev = &devs->OBJ##_devs[i];					\
+		dev = &cbd_dev->dev;						\
+										\
+		cbd_dev->cbdt = cbdt;						\
+		cbd_dev->OBJ##_info = cbdt_get_##OBJ##_info(cbdt, i);		\
+		device_initialize(dev);						\
+		device_set_pm_not_required(dev);				\
+		dev_set_name(dev, #OBJ "%u", i);				\
+		dev->parent = &devs->OBJ##s_dev;				\
+		dev->type = &cbd_##OBJ##_type;					\
+										\
+		ret = device_add(dev);						\
+		if (ret) {							\
+			i--;							\
+			goto del_device;					\
+		}								\
+	}									\
+	cbdt->cbd_##OBJ##s_dev = devs;						\
+										\
+    	return 0;								\
+del_device:									\
+	for (; i >= 0; i--) {							\
+		cbd_dev = &devs->OBJ##_devs[i];					\
+		dev = &cbd_dev->dev;						\
+		device_del(dev);						\
+	}									\
+devs_free:									\
+	kfree(devs);								\
+	return ret;								\
+}										\
+										\
+static void cbd_##OBJ##s_exit(struct cbd_transport *cbdt)			\
+{										\
+	struct cbd_##OBJ##s_device *devs = cbdt->cbd_##OBJ##s_dev;		\
+	struct device *dev;							\
+	int i;									\
+										\
+	if (!devs)								\
+		return;								\
+										\
+	for (i = 0; i < cbdt->transport_info->OBJ##_num; i++) {			\
+		struct cbd_##OBJ##_device *cbd_dev = &devs->OBJ##_devs[i];	\
+		dev = &cbd_dev->dev;						\
+										\
+		device_del(dev);						\
+	}									\
+										\
+	device_del(&devs->OBJ##s_dev);						\
+										\
+	kfree(devs);								\
+	cbdt->cbd_##OBJ##s_dev = NULL;						\
+										\
+	return;									\
+}										\
 										\
 static inline struct cbd_##OBJ##_info						\
 *__get_##OBJ##_info(struct cbd_transport *cbdt, u32 id)				\
@@ -588,6 +671,11 @@ int cbdt_unregister(u32 tid)
 	}
 	mutex_unlock(&cbdt->lock);
 
+	cbd_blkdevs_exit(cbdt);
+	cbd_channels_exit(cbdt);
+	cbd_backends_exit(cbdt);
+	cbd_hosts_exit(cbdt);
+
 	cbd_host_unregister(cbdt);
 	device_unregister(&cbdt->device);
 	cbdt_dax_release(cbdt);
@@ -647,9 +735,20 @@ int cbdt_register(struct cbdt_register_options *opts)
 		goto dev_unregister;
 	}
 
+	if (cbd_hosts_init(cbdt) || cbd_backends_init(cbdt) ||
+	    cbd_channels_init(cbdt) || cbd_blkdevs_init(cbdt)) {
+		ret = -ENOMEM;
+		goto devs_exit;
+	}
+
 	return 0;
 
 devs_exit:
+	cbd_blkdevs_exit(cbdt);
+	cbd_channels_exit(cbdt);
+	cbd_backends_exit(cbdt);
+	cbd_hosts_exit(cbdt);
+
 	cbd_host_unregister(cbdt);
 dev_unregister:
 	device_unregister(&cbdt->device);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/7] block: Init for CBD(CXL Block Device)
  2024-04-22  7:16 ` [PATCH 1/7] block: Init for CBD(CXL " Dongsheng Yang
@ 2024-04-22 18:39   ` Randy Dunlap
  2024-04-22 22:41     ` Dongsheng Yang
  2024-04-24  3:58   ` Chaitanya Kulkarni
  1 sibling, 1 reply; 52+ messages in thread
From: Randy Dunlap @ 2024-04-22 18:39 UTC (permalink / raw)
  To: Dongsheng Yang, dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang

Hi,

On 4/22/24 12:16 AM, Dongsheng Yang wrote:
> diff --git a/drivers/block/cbd/Kconfig b/drivers/block/cbd/Kconfig
> new file mode 100644
> index 000000000000..98b2cbcdf895
> --- /dev/null
> +++ b/drivers/block/cbd/Kconfig
> @@ -0,0 +1,4 @@
> +config BLK_DEV_CBD
> +	tristate "CXL Block Device"
> +	help
> +	  If unsure say 'm'.

I think that needs more help text. checkpatch should have said something
about that...

And why should someone say 'm' to the config question?
Will lots of (non-server) computers have CXL block device capability?

thanks.
-- 
#Randy
https://people.kernel.org/tglx/notes-about-netiquette
https://subspace.kernel.org/etiquette.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/7] block: Init for CBD(CXL Block Device)
  2024-04-22 18:39   ` Randy Dunlap
@ 2024-04-22 22:41     ` Dongsheng Yang
  0 siblings, 0 replies; 52+ messages in thread
From: Dongsheng Yang @ 2024-04-22 22:41 UTC (permalink / raw)
  To: Randy Dunlap, dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang



在 2024/4/23 星期二 上午 2:39, Randy Dunlap 写道:
> Hi,
> 
> On 4/22/24 12:16 AM, Dongsheng Yang wrote:
>> diff --git a/drivers/block/cbd/Kconfig b/drivers/block/cbd/Kconfig
>> new file mode 100644
>> index 000000000000..98b2cbcdf895
>> --- /dev/null
>> +++ b/drivers/block/cbd/Kconfig
>> @@ -0,0 +1,4 @@
>> +config BLK_DEV_CBD
>> +	tristate "CXL Block Device"
>> +	help
>> +	  If unsure say 'm'.
> 
> I think that needs more help text. checkpatch should have said something
> about that...
> 
> And why should someone say 'm' to the config question?
> Will lots of (non-server) computers have CXL block device capability?

Hi,
     Thanx for your review! In this RFC version, I have focused entirely 
on prototype validation and demonstration, so this aspect has evidently 
been overlooked. I will supplement the help text in the next version. Of 
course, this place should be If unsure say "n", not "m".

Thanx
> 
> thanks.
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 6/7] cbd: introduce cbd_blkdev
  2024-04-22  7:15 [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dongsheng Yang
                   ` (5 preceding siblings ...)
  2024-04-22  7:16 ` [PATCH 7/7] cbd: add related sysfs files in transport register Dongsheng Yang
@ 2024-04-22 22:42 ` Dongsheng Yang
  2024-04-23  7:27   ` Dongsheng Yang
  2024-04-24  4:29 ` [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dan Williams
  7 siblings, 1 reply; 52+ messages in thread
From: Dongsheng Yang @ 2024-04-22 22:42 UTC (permalink / raw)
  To: dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang

From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>

The "cbd_blkdev" represents a virtual block device named "/dev/cbdX". It
corresponds to a backend. The "blkdev" interacts with upper-layer users
and accepts IO requests from them. A "blkdev" includes multiple
"cbd_queues", each of which requires a "cbd_channel" to
interact with the backend's handler. The "cbd_queue" forwards IO
requests from the upper layer to the backend's handler through the
channel.

Signed-off-by: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
---
 drivers/block/cbd/Makefile        |   2 +-
 drivers/block/cbd/cbd_blkdev.c    | 375 ++++++++++++++++++
 drivers/block/cbd/cbd_main.c      |   6 +
 drivers/block/cbd/cbd_queue.c     | 621 ++++++++++++++++++++++++++++++
 drivers/block/cbd/cbd_transport.c |  11 +
 5 files changed, 1014 insertions(+), 1 deletion(-)
 create mode 100644 drivers/block/cbd/cbd_blkdev.c
 create mode 100644 drivers/block/cbd/cbd_queue.c

diff --git a/drivers/block/cbd/Makefile b/drivers/block/cbd/Makefile
index b47f1e584946..f5fb5fd68f3d 100644
--- a/drivers/block/cbd/Makefile
+++ b/drivers/block/cbd/Makefile
@@ -1,3 +1,3 @@
-cbd-y := cbd_main.o cbd_transport.o cbd_channel.o cbd_host.o cbd_backend.o cbd_handler.o
+cbd-y := cbd_main.o cbd_transport.o cbd_channel.o cbd_host.o cbd_backend.o cbd_handler.o cbd_blkdev.o cbd_queue.o
 
 obj-$(CONFIG_BLK_DEV_CBD) += cbd.o
diff --git a/drivers/block/cbd/cbd_blkdev.c b/drivers/block/cbd/cbd_blkdev.c
new file mode 100644
index 000000000000..816bc28afb49
--- /dev/null
+++ b/drivers/block/cbd/cbd_blkdev.c
@@ -0,0 +1,375 @@
+#include "cbd_internal.h"
+
+static ssize_t blkdev_backend_id_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct cbd_blkdev_device *blkdev;
+	struct cbd_blkdev_info *blkdev_info;
+
+	blkdev = container_of(dev, struct cbd_blkdev_device, dev);
+	blkdev_info = blkdev->blkdev_info;
+
+	cbdt_flush_range(blkdev->cbdt, blkdev_info, sizeof(*blkdev_info));
+
+	if (blkdev_info->state == cbd_blkdev_state_none)
+		return 0;
+
+	return sprintf(buf, "%u\n", blkdev_info->backend_id);
+}
+
+static DEVICE_ATTR(backend_id, 0400, blkdev_backend_id_show, NULL);
+
+static ssize_t blkdev_host_id_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct cbd_blkdev_device *blkdev;
+	struct cbd_blkdev_info *blkdev_info;
+
+	blkdev = container_of(dev, struct cbd_blkdev_device, dev);
+	blkdev_info = blkdev->blkdev_info;
+
+	cbdt_flush_range(blkdev->cbdt, blkdev_info, sizeof(*blkdev_info));
+
+	if (blkdev_info->state == cbd_blkdev_state_none)
+		return 0;
+
+	return sprintf(buf, "%u\n", blkdev_info->host_id);
+}
+
+static DEVICE_ATTR(host_id, 0400, blkdev_host_id_show, NULL);
+
+static ssize_t blkdev_mapped_id_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct cbd_blkdev_device *blkdev;
+	struct cbd_blkdev_info *blkdev_info;
+
+	blkdev = container_of(dev, struct cbd_blkdev_device, dev);
+	blkdev_info = blkdev->blkdev_info;
+
+	cbdt_flush_range(blkdev->cbdt, blkdev_info, sizeof(*blkdev_info));
+
+	if (blkdev_info->state == cbd_blkdev_state_none)
+		return 0;
+
+	return sprintf(buf, "%u\n", blkdev_info->mapped_id);
+}
+
+static DEVICE_ATTR(mapped_id, 0400, blkdev_mapped_id_show, NULL);
+
+CBD_OBJ_HEARTBEAT(blkdev);
+
+static struct attribute *cbd_blkdev_attrs[] = {
+	&dev_attr_mapped_id.attr,
+	&dev_attr_host_id.attr,
+	&dev_attr_backend_id.attr,
+	&dev_attr_alive.attr,
+	NULL
+};
+
+static struct attribute_group cbd_blkdev_attr_group = {
+	.attrs = cbd_blkdev_attrs,
+};
+
+static const struct attribute_group *cbd_blkdev_attr_groups[] = {
+	&cbd_blkdev_attr_group,
+	NULL
+};
+
+static void cbd_blkdev_release(struct device *dev)
+{
+}
+
+struct device_type cbd_blkdev_type = {
+	.name		= "cbd_blkdev",
+	.groups		= cbd_blkdev_attr_groups,
+	.release	= cbd_blkdev_release,
+};
+
+struct device_type cbd_blkdevs_type = {
+	.name		= "cbd_blkdevs",
+	.release	= cbd_blkdev_release,
+};
+
+
+static int cbd_major;
+static DEFINE_IDA(cbd_mapped_id_ida);
+
+static int minor_to_cbd_mapped_id(int minor)
+{
+	return minor >> CBD_PART_SHIFT;
+}
+
+
+static int cbd_open(struct gendisk *disk, blk_mode_t mode)
+{
+	return 0;
+}
+
+static void cbd_release(struct gendisk *disk)
+{
+}
+
+static const struct block_device_operations cbd_bd_ops = {
+	.owner			= THIS_MODULE,
+	.open			= cbd_open,
+	.release		= cbd_release,
+};
+
+
+static void cbd_blkdev_destroy_queues(struct cbd_blkdev *cbd_blkdev)
+{
+	int i;
+
+	for (i = 0; i < cbd_blkdev->num_queues; i++) {
+		cbd_queue_stop(&cbd_blkdev->queues[i]);
+	}
+
+	kfree(cbd_blkdev->queues);
+}
+
+static int cbd_blkdev_create_queues(struct cbd_blkdev *cbd_blkdev)
+{
+	int i;
+	int ret;
+	struct cbd_queue *cbdq;
+
+	cbd_blkdev->queues = kcalloc(cbd_blkdev->num_queues, sizeof(struct cbd_queue), GFP_KERNEL);
+	if (!cbd_blkdev->queues) {
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < cbd_blkdev->num_queues; i++) {
+		cbdq = &cbd_blkdev->queues[i];
+		cbdq->cbd_blkdev = cbd_blkdev;
+		cbdq->index = i;
+		ret = cbd_queue_start(cbdq);
+		if (ret)
+			goto err;
+
+	}
+
+	return 0;
+err:
+	cbd_blkdev_destroy_queues(cbd_blkdev);
+	return ret;
+}
+
+static int disk_start(struct cbd_blkdev *cbd_blkdev)
+{
+	int ret;
+	struct gendisk *disk;
+
+	memset(&cbd_blkdev->tag_set, 0, sizeof(cbd_blkdev->tag_set));
+	cbd_blkdev->tag_set.ops = &cbd_mq_ops;
+	cbd_blkdev->tag_set.queue_depth = 128;
+	cbd_blkdev->tag_set.numa_node = NUMA_NO_NODE;
+	cbd_blkdev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_SCHED;
+	cbd_blkdev->tag_set.nr_hw_queues = cbd_blkdev->num_queues;
+	cbd_blkdev->tag_set.cmd_size = sizeof(struct cbd_request);
+	cbd_blkdev->tag_set.timeout = 0;
+	cbd_blkdev->tag_set.driver_data = cbd_blkdev;
+
+	ret = blk_mq_alloc_tag_set(&cbd_blkdev->tag_set);
+	if (ret) {
+		pr_err("failed to alloc tag set %d", ret);
+		goto err;
+	}
+
+	disk = blk_mq_alloc_disk(&cbd_blkdev->tag_set, cbd_blkdev);
+	if (IS_ERR(disk)) {
+		ret = PTR_ERR(disk);
+		pr_err("failed to alloc disk");
+		goto out_tag_set;
+	}
+
+        snprintf(disk->disk_name, sizeof(disk->disk_name), "cbd%d",
+                 cbd_blkdev->mapped_id);
+
+	disk->major = cbd_major;
+	disk->first_minor = cbd_blkdev->mapped_id << CBD_PART_SHIFT;
+	disk->minors = (1 << CBD_PART_SHIFT);
+
+	disk->fops = &cbd_bd_ops;
+	disk->private_data = cbd_blkdev;
+
+	/* Tell the block layer that this is not a rotational device */
+	blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue);
+	blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, disk->queue);
+	blk_queue_flag_set(QUEUE_FLAG_NOWAIT, disk->queue);
+
+	blk_queue_physical_block_size(disk->queue, PAGE_SIZE);
+	blk_queue_max_hw_sectors(disk->queue, 128);
+	blk_queue_max_segments(disk->queue, USHRT_MAX);
+	blk_queue_max_segment_size(disk->queue, UINT_MAX);
+	blk_queue_io_min(disk->queue, 4096);
+	blk_queue_io_opt(disk->queue, 4096);
+
+	disk->queue->limits.max_sectors = queue_max_hw_sectors(disk->queue);
+	/* TODO support discard */
+	disk->queue->limits.discard_granularity = 0;
+	blk_queue_max_discard_sectors(disk->queue, 0);
+	blk_queue_max_write_zeroes_sectors(disk->queue, 0);
+
+	cbd_blkdev->disk = disk;
+
+	cbdt_add_blkdev(cbd_blkdev->cbdt, cbd_blkdev);
+	cbd_blkdev->blkdev_info->mapped_id = cbd_blkdev->blkdev_id;
+	cbd_blkdev->blkdev_info->state = cbd_blkdev_state_running;
+
+	set_capacity(cbd_blkdev->disk, cbd_blkdev->dev_size);
+
+	set_disk_ro(cbd_blkdev->disk, false);
+	blk_queue_write_cache(cbd_blkdev->disk->queue, false, false);
+
+	ret = add_disk(cbd_blkdev->disk);
+	if (ret) {
+		goto put_disk;
+	}
+
+	ret = sysfs_create_link(&disk_to_dev(cbd_blkdev->disk)->kobj,
+				&cbd_blkdev->blkdev_dev->dev.kobj, "cbd_blkdev");
+	if (ret) {
+		goto del_disk;
+	}
+
+	blk_put_queue(cbd_blkdev->disk->queue);
+
+	return 0;
+
+del_disk:
+	del_gendisk(cbd_blkdev->disk);
+put_disk:
+	put_disk(cbd_blkdev->disk);
+out_tag_set:
+	blk_mq_free_tag_set(&cbd_blkdev->tag_set);
+err:
+	return ret;
+}
+
+int cbd_blkdev_start(struct cbd_transport *cbdt, u32 backend_id, u32 queues)
+{
+	struct cbd_blkdev *cbd_blkdev;
+	struct cbd_backend_info *backend_info;
+	u64 dev_size;
+	int ret;
+
+	backend_info = cbdt_get_backend_info(cbdt, backend_id);
+	cbdt_flush_range(cbdt, backend_info, sizeof(*backend_info));
+	if (backend_info->blkdev_count == CBDB_BLKDEV_COUNT_MAX) {
+		return -EBUSY;
+	}
+
+	dev_size = backend_info->dev_size;
+
+	cbd_blkdev = kzalloc(sizeof(struct cbd_blkdev), GFP_KERNEL);
+	if (!cbd_blkdev) {
+		pr_err("fail to alloc cbd_blkdev");
+		return -ENOMEM;
+	}
+
+	ret = cbdt_get_empty_blkdev_id(cbdt, &cbd_blkdev->blkdev_id);
+	if (ret < 0) {
+		goto blkdev_free;
+	}
+
+	cbd_blkdev->mapped_id = ida_simple_get(&cbd_mapped_id_ida, 0,
+					 minor_to_cbd_mapped_id(1 << MINORBITS),
+					 GFP_KERNEL);
+	if (cbd_blkdev->mapped_id < 0) {
+		ret = -ENOENT;
+		goto blkdev_free;
+	}
+
+	INIT_LIST_HEAD(&cbd_blkdev->node);
+	cbd_blkdev->cbdt = cbdt;
+	cbd_blkdev->backend_id = backend_id;
+	cbd_blkdev->num_queues = queues;
+	cbd_blkdev->dev_size = dev_size;
+	cbd_blkdev->blkdev_info = cbdt_get_blkdev_info(cbdt, cbd_blkdev->blkdev_id);
+	cbd_blkdev->blkdev_dev = &cbdt->cbd_blkdevs_dev->blkdev_devs[cbd_blkdev->blkdev_id];
+
+	cbd_blkdev->blkdev_info->state = cbd_blkdev_state_running;
+	cbdt_flush_range(cbdt, cbd_blkdev->blkdev_info, sizeof(*cbd_blkdev->blkdev_info));
+
+	INIT_DELAYED_WORK(&cbd_blkdev->hb_work, blkdev_hb_workfn);
+	queue_delayed_work(cbd_wq, &cbd_blkdev->hb_work, 0);
+
+	ret = cbd_blkdev_create_queues(cbd_blkdev);
+	if (ret < 0) {
+		goto cancel_hb;;
+	}
+
+	ret = disk_start(cbd_blkdev);
+	if (ret < 0) {
+		goto destroy_queues;
+	}
+
+	backend_info->blkdev_count++;
+	cbdt_flush_range(cbdt, backend_info, sizeof(*backend_info));
+
+	return 0;
+
+destroy_queues:
+	cbd_blkdev_destroy_queues(cbd_blkdev);
+cancel_hb:
+	cancel_delayed_work_sync(&cbd_blkdev->hb_work);
+	cbd_blkdev->blkdev_info->state = cbd_blkdev_state_none;
+	cbdt_flush_range(cbdt, cbd_blkdev->blkdev_info, sizeof(*cbd_blkdev->blkdev_info));
+	ida_simple_remove(&cbd_mapped_id_ida, cbd_blkdev->mapped_id);
+blkdev_free:
+	kfree(cbd_blkdev);
+	return ret;
+}
+
+static void disk_stop(struct cbd_blkdev *cbd_blkdev)
+{
+	sysfs_remove_link(&disk_to_dev(cbd_blkdev->disk)->kobj, "cache");
+	del_gendisk(cbd_blkdev->disk);
+	put_disk(cbd_blkdev->disk);
+	blk_mq_free_tag_set(&cbd_blkdev->tag_set);
+}
+
+int cbd_blkdev_stop(struct cbd_transport *cbdt, u32 devid)
+{
+	struct cbd_blkdev *cbd_blkdev;
+	struct cbd_backend_info *backend_info;
+
+	cbd_blkdev = cbdt_fetch_blkdev(cbdt, devid);
+	if (!cbd_blkdev) {
+		return -EINVAL;
+	}
+
+	backend_info = cbdt_get_backend_info(cbdt, cbd_blkdev->backend_id);
+
+	disk_stop(cbd_blkdev);
+	cbd_blkdev_destroy_queues(cbd_blkdev);
+	cancel_delayed_work_sync(&cbd_blkdev->hb_work);
+	cbd_blkdev->blkdev_info->state = cbd_blkdev_state_none;
+	cbdt_flush_range(cbdt, cbd_blkdev->blkdev_info, sizeof(*cbd_blkdev->blkdev_info));
+	ida_simple_remove(&cbd_mapped_id_ida, cbd_blkdev->mapped_id);
+
+	kfree(cbd_blkdev);
+
+	backend_info->blkdev_count--;
+	cbdt_flush_range(cbdt, backend_info, sizeof(*backend_info));
+
+	return 0;
+}
+
+int cbd_blkdev_init(void)
+{
+	cbd_major = register_blkdev(0, "cbd");
+	if (cbd_major < 0)
+		return cbd_major;
+
+	return 0;
+}
+
+void cbd_blkdev_exit(void)
+{
+	unregister_blkdev(cbd_major, "cbd");
+}
diff --git a/drivers/block/cbd/cbd_main.c b/drivers/block/cbd/cbd_main.c
index 8cfa60dde7c5..658233807b59 100644
--- a/drivers/block/cbd/cbd_main.c
+++ b/drivers/block/cbd/cbd_main.c
@@ -195,6 +195,11 @@ static int __init cbd_init(void)
 		goto device_unregister;
 	}
 
+	ret = cbd_blkdev_init();
+	if (ret < 0) {
+		goto bus_unregister;
+	}
+
 	return 0;
 
 bus_unregister:
@@ -209,6 +214,7 @@ static int __init cbd_init(void)
 
 static void cbd_exit(void)
 {
+	cbd_blkdev_exit();
 	bus_unregister(&cbd_bus_type);
 	device_unregister(&cbd_root_dev);
 
diff --git a/drivers/block/cbd/cbd_queue.c b/drivers/block/cbd/cbd_queue.c
new file mode 100644
index 000000000000..6709ac016e18
--- /dev/null
+++ b/drivers/block/cbd/cbd_queue.c
@@ -0,0 +1,621 @@
+#include "cbd_internal.h"
+
+/*
+ * How do blkdev and backend interact through the channel?
+ *         a) For reader side, before reading the data, if the data in this channel may
+ * be modified by the other party, then I need to flush the cache before reading to ensure
+ * that I get the latest data. For example, the blkdev needs to flush the cache before
+ * obtaining compr_head because compr_head will be updated by the backend handler.
+ *         b) For writter side, if the written information will be read by others, then
+ * after writing, I need to flush the cache to let the other party see it immediately.
+ * For example, after blkdev submits cbd_se, it needs to update cmd_head to let the
+ * handler have a new cbd_se. Therefore, after updating cmd_head, I need to flush the
+ * cache to let the backend see it.
+ *
+ * For the blkdev queue, I am the only one who updates the `cmd_head`, `cmd_tail`, and `compr_tail'.
+ * Therefore, I don't need to flush_dcache before reading these data. However, after updating these data,
+ * I need to flush_dcache so that the backend handler can see these updates.
+ *
+ * On the other hand, `compr_head` is updated by the backend handler. So, I need to flush_dcache before
+ * reading `compr_head` to ensure that I can see the updates.
+ *
+ *           ┌───────────┐          ┌─────────────┐
+ *           │  blkdev   │          │   backend   │
+ *           │  queue    │          │   handler   │
+ *           └─────┬─────┘          └──────┬──────┘
+ *                 ▼                       │
+ *        init data and cbd_se             │
+ *                 │                       │
+ *                 ▼                       │
+ *            update cmd_head              │
+ *                 │                       │
+ *                 ▼                       │
+ *            flush_cache                  │
+ *                 │                       ▼
+ *                 │                    flush_cache
+ *                 │                       │
+ *                 │                       ▼
+ *                 │                   handle cmd
+ *                 │                       │
+ *                 │                       ▼
+ *                 │                    fill cbd_ce
+ *                 │                       │
+ *                 │                       ▼
+ *                 │                    flush_cache
+ *                 ▼
+ *            flush_cache
+ *                 │
+ *                 ▼
+ *            complete_req
+ */
+
+static inline struct cbd_se *get_submit_entry(struct cbd_queue *cbdq)
+{
+	return (struct cbd_se *)(cbdq->channel.cmdr + cbdq->channel_info->cmd_head);
+}
+
+static inline struct cbd_se *get_oldest_se(struct cbd_queue *cbdq)
+{
+	if (cbdq->channel_info->cmd_tail == cbdq->channel_info->cmd_head)
+		return NULL;
+
+	return (struct cbd_se *)(cbdq->channel.cmdr + cbdq->channel_info->cmd_tail);
+}
+
+static inline struct cbd_ce *get_complete_entry(struct cbd_queue *cbdq)
+{
+	if (cbdq->channel_info->compr_tail == cbdq->channel_info->compr_head)
+		return NULL;
+
+	return (struct cbd_ce *)(cbdq->channel.compr + cbdq->channel_info->compr_tail);
+}
+
+static void cbd_req_init(struct cbd_queue *cbdq, enum cbd_op op, struct request *rq)
+{
+	struct cbd_request *cbd_req = blk_mq_rq_to_pdu(rq);
+
+	cbd_req->req = rq;
+	cbd_req->cbdq = cbdq;
+	cbd_req->op = op;
+
+	return;
+}
+
+static bool cbd_req_nodata(struct cbd_request *cbd_req)
+{
+	switch (cbd_req->op) {
+		case CBD_OP_WRITE:
+		case CBD_OP_READ:
+			return false;
+		case CBD_OP_DISCARD:
+		case CBD_OP_WRITE_ZEROS:
+		case CBD_OP_FLUSH:
+			return true;
+		default:
+			BUG();
+	}
+}
+
+static uint32_t cbd_req_segments(struct cbd_request *cbd_req)
+{
+	uint32_t segs = 0;
+	struct bio *bio = cbd_req->req->bio;
+
+	if (cbd_req_nodata(cbd_req))
+		return 0;
+
+	while (bio) {
+		segs += bio_segments(bio);
+		bio = bio->bi_next;
+	}
+
+	return segs;
+}
+
+static inline size_t cbd_get_cmd_size(struct cbd_request *cbd_req)
+{
+	u32 segs = cbd_req_segments(cbd_req);
+	u32 cmd_size = sizeof(struct cbd_se) + (sizeof(struct iovec) * segs);
+
+	return round_up(cmd_size, CBD_OP_ALIGN_SIZE);
+}
+
+static void insert_padding(struct cbd_queue *cbdq, u32 cmd_size)
+{
+	struct cbd_se_hdr *header;
+	u32 pad_len;
+
+	if (cbdq->channel_info->cmdr_size - cbdq->channel_info->cmd_head >= cmd_size)
+		return;
+
+	pad_len = cbdq->channel_info->cmdr_size - cbdq->channel_info->cmd_head;
+	cbd_queue_debug(cbdq, "insert pad:%d\n", pad_len);
+
+	header = (struct cbd_se_hdr *)get_submit_entry(cbdq);
+	memset(header, 0, pad_len);
+	cbd_se_hdr_set_op(&header->len_op, CBD_OP_PAD);
+	cbd_se_hdr_set_len(&header->len_op, pad_len);
+
+	cbdt_flush_range(cbdq->cbd_blkdev->cbdt, header, sizeof(*header));
+
+	CBDC_UPDATE_CMDR_HEAD(cbdq->channel_info->cmd_head, pad_len, cbdq->channel_info->cmdr_size);
+}
+
+static void queue_req_se_init(struct cbd_request *cbd_req)
+{
+	struct cbd_se	*se;
+	struct cbd_se_hdr *header;
+	u64 offset = (u64)blk_rq_pos(cbd_req->req) << SECTOR_SHIFT;
+	u64 length = blk_rq_bytes(cbd_req->req);
+
+	se = get_submit_entry(cbd_req->cbdq);
+	memset(se, 0, cbd_get_cmd_size(cbd_req));
+	header = &se->header;
+
+	cbd_se_hdr_set_op(&header->len_op, cbd_req->op);
+	cbd_se_hdr_set_len(&header->len_op, cbd_get_cmd_size(cbd_req));
+
+	se->priv_data = cbd_req->req_tid;
+	se->offset = offset;
+	se->len = length;
+
+	if (req_op(cbd_req->req) == REQ_OP_READ || req_op(cbd_req->req) == REQ_OP_WRITE) {
+		se->data_off = cbd_req->cbdq->channel.data_head;
+		se->data_len = length;
+	}
+
+	cbd_req->se = se;
+}
+
+static bool data_space_enough(struct cbd_queue *cbdq, struct cbd_request *cbd_req)
+{
+	u32 space_available;
+	u32 space_needed;
+	u32 space_used;
+	u32 space_max;
+
+	space_max = cbdq->channel.data_size - 4096;
+
+	if (cbdq->channel.data_head > cbdq->channel.data_tail)
+		space_used = cbdq->channel.data_head - cbdq->channel.data_tail;
+	else if (cbdq->channel.data_head < cbdq->channel.data_tail)
+		space_used = cbdq->channel.data_head + (cbdq->channel.data_size - cbdq->channel.data_tail);
+	else
+		space_used = 0;
+
+	space_available = space_max - space_used;
+
+	space_needed = round_up(cbd_req->data_len, 4096);
+
+	if (space_available < space_needed) {
+		cbd_queue_err(cbdq, "data space is not enough: availaible: %u needed: %u",
+			      space_available, space_needed);
+		return false;
+	}
+
+	return true;
+}
+
+static bool submit_ring_space_enough(struct cbd_queue *cbdq, u32 cmd_size)
+{
+	u32 space_available;
+	u32 space_needed;
+	u32 space_max, space_used;
+
+	/* There is a CMDR_RESERVED we dont use to prevent the ring to be used up */
+	space_max = cbdq->channel_info->cmdr_size - CBDC_CMDR_RESERVED;
+
+	if (cbdq->channel_info->cmd_head > cbdq->channel_info->cmd_tail)
+		space_used = cbdq->channel_info->cmd_head - cbdq->channel_info->cmd_tail;
+	else if (cbdq->channel_info->cmd_head < cbdq->channel_info->cmd_tail)
+		space_used = cbdq->channel_info->cmd_head + (cbdq->channel_info->cmdr_size - cbdq->channel_info->cmd_tail);
+	else
+		space_used = 0;
+
+	space_available = space_max - space_used;
+
+	if (cbdq->channel_info->cmdr_size - cbdq->channel_info->cmd_head > cmd_size)
+		space_needed = cmd_size;
+	else
+		space_needed = cmd_size + cbdq->channel_info->cmdr_size - cbdq->channel_info->cmd_head;
+
+	if (space_available < space_needed)
+		return false;
+
+	return true;
+}
+
+static void queue_req_data_init(struct cbd_request *cbd_req)
+{
+	struct cbd_queue *cbdq = cbd_req->cbdq;
+	struct bio *bio = cbd_req->req->bio;
+
+	if (cbd_req->op == CBD_OP_READ) {
+		goto advance_data_head;
+	}
+
+	cbdc_copy_from_bio(&cbdq->channel, cbd_req->data_off, cbd_req->data_len, bio);
+
+advance_data_head:
+	cbdq->channel.data_head = round_up(cbdq->channel.data_head + cbd_req->data_len, PAGE_SIZE);
+	cbdq->channel.data_head %= cbdq->channel.data_size;
+
+	return;
+}
+
+static void complete_inflight_req(struct cbd_queue *cbdq, struct cbd_request *cbd_req, int ret);
+static void cbd_queue_fn(struct cbd_request *cbd_req)
+{
+	struct cbd_queue *cbdq = cbd_req->cbdq;
+	int ret = 0;
+	size_t command_size;
+
+	spin_lock(&cbdq->inflight_reqs_lock);
+	list_add_tail(&cbd_req->inflight_reqs_node, &cbdq->inflight_reqs);
+	spin_unlock(&cbdq->inflight_reqs_lock);
+
+	command_size = cbd_get_cmd_size(cbd_req);
+
+	spin_lock(&cbdq->channel.cmdr_lock);
+	if (req_op(cbd_req->req) == REQ_OP_WRITE || req_op(cbd_req->req) == REQ_OP_READ) {
+		cbd_req->data_off = cbdq->channel.data_head;
+		cbd_req->data_len = blk_rq_bytes(cbd_req->req);
+	} else {
+		cbd_req->data_off = -1;
+		cbd_req->data_len = 0;
+	}
+
+	if (!submit_ring_space_enough(cbdq, command_size) ||
+			!data_space_enough(cbdq, cbd_req)) {
+		spin_unlock(&cbdq->channel.cmdr_lock);
+
+		/* remove request from inflight_reqs */
+		spin_lock(&cbdq->inflight_reqs_lock);
+		list_del_init(&cbd_req->inflight_reqs_node);
+		spin_unlock(&cbdq->inflight_reqs_lock);
+
+		cbd_blk_debug(cbdq->cbd_blkdev, "transport space is not enough");
+		ret = -ENOMEM;
+		goto end_request;
+	}
+
+	insert_padding(cbdq, command_size);
+
+	cbd_req->req_tid = ++cbdq->req_tid;
+	queue_req_se_init(cbd_req);
+	cbdt_flush_range(cbdq->cbd_blkdev->cbdt, cbd_req->se, sizeof(struct cbd_se));
+
+	if (!cbd_req_nodata(cbd_req)) {
+		queue_req_data_init(cbd_req);
+	}
+
+	queue_delayed_work(cbdq->task_wq, &cbdq->complete_work, 0);
+
+	CBDC_UPDATE_CMDR_HEAD(cbdq->channel_info->cmd_head,
+			cbd_get_cmd_size(cbd_req),
+			cbdq->channel_info->cmdr_size);
+	cbdc_flush_ctrl(&cbdq->channel);
+	spin_unlock(&cbdq->channel.cmdr_lock);
+
+	return;
+
+end_request:
+	if (ret == -ENOMEM || ret == -EBUSY)
+		blk_mq_requeue_request(cbd_req->req, true);
+	else
+		blk_mq_end_request(cbd_req->req, errno_to_blk_status(ret));
+
+	return;
+}
+
+static void cbd_req_release(struct cbd_request *cbd_req)
+{
+	return;
+}
+
+static void advance_cmd_ring(struct cbd_queue *cbdq)
+{
+       struct cbd_se *se;
+again:
+       se = get_oldest_se(cbdq);
+       if (!se)
+               goto out;
+
+	if (cbd_se_hdr_flags_test(se, CBD_SE_HDR_DONE)) {
+		CBDC_UPDATE_CMDR_TAIL(cbdq->channel_info->cmd_tail,
+				cbd_se_hdr_get_len(se->header.len_op),
+				cbdq->channel_info->cmdr_size);
+		cbdc_flush_ctrl(&cbdq->channel);
+		goto again;
+       }
+out:
+       return;
+}
+
+static bool __advance_data_tail(struct cbd_queue *cbdq, u32 data_off, u32 data_len)
+{
+	if (data_off == cbdq->channel.data_tail) {
+		cbdq->released_extents[data_off / 4096] = 0;
+		cbdq->channel.data_tail += data_len;
+		if (cbdq->channel.data_tail >= cbdq->channel.data_size) {
+			cbdq->channel.data_tail %= cbdq->channel.data_size;
+		}
+		return true;
+	}
+
+	return false;
+}
+
+static void advance_data_tail(struct cbd_queue *cbdq, u32 data_off, u32 data_len)
+{
+	cbdq->released_extents[data_off / 4096] = data_len;
+
+	while (__advance_data_tail(cbdq, data_off, data_len)) {
+		data_off += data_len;
+		data_len = cbdq->released_extents[data_off / 4096];
+		if (!data_len) {
+			break;
+		}
+	}
+}
+
+static inline void complete_inflight_req(struct cbd_queue *cbdq, struct cbd_request *cbd_req, int ret)
+{
+	u32 data_off, data_len;
+	bool advance_data = false;
+
+	spin_lock(&cbdq->inflight_reqs_lock);
+	list_del_init(&cbd_req->inflight_reqs_node);
+	spin_unlock(&cbdq->inflight_reqs_lock);
+
+	cbd_se_hdr_flags_set(cbd_req->se, CBD_SE_HDR_DONE);
+	data_off = cbd_req->data_off;
+	data_len = cbd_req->data_len;
+	advance_data = (!cbd_req_nodata(cbd_req));
+
+	blk_mq_end_request(cbd_req->req, errno_to_blk_status(ret));
+
+	cbd_req_release(cbd_req);
+
+	spin_lock(&cbdq->channel.cmdr_lock);
+	advance_cmd_ring(cbdq);
+	if (advance_data)
+		advance_data_tail(cbdq, data_off, round_up(data_len, PAGE_SIZE));
+	spin_unlock(&cbdq->channel.cmdr_lock);
+}
+
+static struct cbd_request *fetch_inflight_req(struct cbd_queue *cbdq, u64 req_tid)
+{
+	struct cbd_request *req;
+	bool found = false;
+
+	list_for_each_entry(req, &cbdq->inflight_reqs, inflight_reqs_node) {
+		if (req->req_tid == req_tid) {
+			list_del_init(&req->inflight_reqs_node);
+			found = true;
+			break;
+		}
+	}
+
+	if (found)
+		return req;
+
+	return NULL;
+}
+
+static void copy_data_from_cbdteq(struct cbd_request *cbd_req)
+{
+	struct bio *bio = cbd_req->req->bio;
+	struct cbd_queue *cbdq = cbd_req->cbdq;
+
+	cbdc_copy_to_bio(&cbdq->channel, cbd_req->data_off, cbd_req->data_len, bio);
+
+	return;
+}
+
+static void complete_work_fn(struct work_struct *work)
+{
+	struct cbd_queue *cbdq = container_of(work, struct cbd_queue, complete_work.work);
+	struct cbd_ce *ce;
+	struct cbd_request *cbd_req;
+
+again:
+	/* compr_head would be updated by backend handler */
+	cbdc_flush_ctrl(&cbdq->channel);
+
+	spin_lock(&cbdq->channel.compr_lock);
+	ce = get_complete_entry(cbdq);
+	if (!ce) {
+		spin_unlock(&cbdq->channel.compr_lock);
+		if (cbdwc_need_retry(&cbdq->complete_worker_cfg)) {
+			goto again;
+		}
+
+		spin_lock(&cbdq->inflight_reqs_lock);
+		if (list_empty(&cbdq->inflight_reqs)) {
+			spin_unlock(&cbdq->inflight_reqs_lock);
+			cbdwc_init(&cbdq->complete_worker_cfg);
+			return;
+		}
+		spin_unlock(&cbdq->inflight_reqs_lock);
+
+		cbdwc_miss(&cbdq->complete_worker_cfg);
+
+		queue_delayed_work(cbdq->task_wq, &cbdq->complete_work, 0);
+		return;
+	}
+	cbdwc_hit(&cbdq->complete_worker_cfg);
+	CBDC_UPDATE_COMPR_TAIL(cbdq->channel_info->compr_tail,
+			       sizeof(struct cbd_ce),
+			       cbdq->channel_info->compr_size);
+	cbdc_flush_ctrl(&cbdq->channel);
+	spin_unlock(&cbdq->channel.compr_lock);
+
+	spin_lock(&cbdq->inflight_reqs_lock);
+	/* flush to ensure the content of ce is uptodate */
+	cbdt_flush_range(cbdq->cbd_blkdev->cbdt, ce, sizeof(*ce));
+	cbd_req = fetch_inflight_req(cbdq, ce->priv_data);
+	spin_unlock(&cbdq->inflight_reqs_lock);
+	if (!cbd_req) {
+		goto again;
+	}
+
+	if (req_op(cbd_req->req) == REQ_OP_READ) {
+		spin_lock(&cbdq->channel.cmdr_lock);
+		copy_data_from_cbdteq(cbd_req);
+		spin_unlock(&cbdq->channel.cmdr_lock);
+	}
+
+	complete_inflight_req(cbdq, cbd_req, ce->result);
+
+	goto again;
+}
+
+static blk_status_t cbd_queue_rq(struct blk_mq_hw_ctx *hctx,
+		const struct blk_mq_queue_data *bd)
+{
+	struct request *req = bd->rq;
+	struct cbd_queue *cbdq = hctx->driver_data;
+	struct cbd_request *cbd_req = blk_mq_rq_to_pdu(bd->rq);
+
+	memset(cbd_req, 0, sizeof(struct cbd_request));
+	INIT_LIST_HEAD(&cbd_req->inflight_reqs_node);
+
+	blk_mq_start_request(bd->rq);
+
+	switch (req_op(bd->rq)) {
+	case REQ_OP_FLUSH:
+		cbd_req_init(cbdq, CBD_OP_FLUSH, req);
+		break;
+	case REQ_OP_DISCARD:
+		cbd_req_init(cbdq, CBD_OP_DISCARD, req);
+		break;
+	case REQ_OP_WRITE_ZEROES:
+		cbd_req_init(cbdq, CBD_OP_WRITE_ZEROS, req);
+		break;
+	case REQ_OP_WRITE:
+		cbd_req_init(cbdq, CBD_OP_WRITE, req);
+		break;
+	case REQ_OP_READ:
+		cbd_req_init(cbdq, CBD_OP_READ, req);
+		break;
+	default:
+		return BLK_STS_IOERR;
+	}
+
+	cbd_queue_fn(cbd_req);
+
+	return BLK_STS_OK;
+}
+
+static int cbd_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data,
+			unsigned int hctx_idx)
+{
+	struct cbd_blkdev *cbd_blkdev = driver_data;
+	struct cbd_queue *cbdq;
+
+	cbdq = &cbd_blkdev->queues[hctx_idx];
+	hctx->driver_data = cbdq;
+
+	return 0;
+}
+
+const struct blk_mq_ops cbd_mq_ops = {
+	.queue_rq	= cbd_queue_rq,
+	.init_hctx	= cbd_init_hctx,
+};
+
+static int cbd_queue_channel_init(struct cbd_queue *cbdq, u32 channel_id)
+{
+	struct cbd_blkdev *cbd_blkdev = cbdq->cbd_blkdev;
+	struct cbd_transport *cbdt = cbd_blkdev->cbdt;
+
+	cbdq->channel_id = channel_id;
+	cbd_channel_init(&cbdq->channel, cbdt, channel_id);
+	cbdq->channel_info = cbdq->channel.channel_info;
+
+	cbdq->channel.data_head = cbdq->channel.data_tail = 0;
+
+	/* Initialise the channel_info of the ring buffer */
+	cbdq->channel_info->cmdr_off = CBDC_CMDR_OFF;
+	cbdq->channel_info->cmdr_size = CBDC_CMDR_SIZE;
+	cbdq->channel_info->compr_off = CBDC_COMPR_OFF;
+	cbdq->channel_info->compr_size = CBDC_COMPR_SIZE;
+
+	cbdq->channel_info->backend_id = cbd_blkdev->backend_id;
+	cbdq->channel_info->blkdev_id = cbd_blkdev->blkdev_id;
+	cbdq->channel_info->blkdev_state = cbdc_blkdev_state_running;
+	cbdq->channel_info->state = cbd_channel_state_running;
+
+	cbdc_flush_ctrl(&cbdq->channel);
+
+	return 0;
+}
+
+int cbd_queue_start(struct cbd_queue *cbdq)
+{
+	struct cbd_transport *cbdt = cbdq->cbd_blkdev->cbdt;
+	u32 channel_id;
+	int ret;
+
+	ret = cbdt_get_empty_channel_id(cbdt, &channel_id);
+	if (ret < 0) {
+		cbdt_err(cbdt, "failed find available channel_id.\n");
+		goto err;
+	}
+
+	ret = cbd_queue_channel_init(cbdq, channel_id);
+	if (ret) {
+		cbd_queue_err(cbdq, "failed to init dev channel_info: %d.", ret);
+		goto err;
+	}
+
+	INIT_LIST_HEAD(&cbdq->inflight_reqs);
+	spin_lock_init(&cbdq->inflight_reqs_lock);
+	cbdq->req_tid = 0;
+	INIT_DELAYED_WORK(&cbdq->complete_work, complete_work_fn);
+	cbdwc_init(&cbdq->complete_worker_cfg);
+
+	cbdq->released_extents = kmalloc(sizeof(u32) * (CBDC_DATA_SIZE >> PAGE_SHIFT), GFP_KERNEL);
+	if (!cbdq->released_extents) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	cbdq->task_wq = alloc_workqueue("cbd%d-queue%u",  WQ_UNBOUND | WQ_MEM_RECLAIM,
+					0, cbdq->cbd_blkdev->mapped_id, cbdq->index);
+	if (!cbdq->task_wq) {
+		ret = -ENOMEM;
+		goto released_extents_free;
+	}
+
+	queue_delayed_work(cbdq->task_wq, &cbdq->complete_work, 0);
+
+	atomic_set(&cbdq->state, cbd_queue_state_running);
+
+	return 0;
+
+released_extents_free:
+	kfree(cbdq->released_extents);
+err:
+	return ret;
+}
+
+void cbd_queue_stop(struct cbd_queue *cbdq)
+{
+	if (atomic_cmpxchg(&cbdq->state,
+			   cbd_queue_state_running,
+			   cbd_queue_state_none) != cbd_queue_state_running)
+		return;
+
+	cancel_delayed_work_sync(&cbdq->complete_work);
+	drain_workqueue(cbdq->task_wq);
+	destroy_workqueue(cbdq->task_wq);
+
+	kfree(cbdq->released_extents);
+	cbdq->channel_info->blkdev_state = cbdc_blkdev_state_none;
+
+	cbdc_flush_ctrl(&cbdq->channel);
+
+	return;
+}
diff --git a/drivers/block/cbd/cbd_transport.c b/drivers/block/cbd/cbd_transport.c
index 4dd9bf1b5fd5..75b9d34218fc 100644
--- a/drivers/block/cbd/cbd_transport.c
+++ b/drivers/block/cbd/cbd_transport.c
@@ -315,8 +315,19 @@ static ssize_t cbd_adm_store(struct device *dev,
 	case CBDT_ADM_OP_B_CLEAR:
 		break;
 	case CBDT_ADM_OP_DEV_START:
+		if (opts.blkdev.queues > CBD_QUEUES_MAX) {
+			cbdt_err(cbdt, "invalid queues = %u, larger than max %u\n",
+					opts.blkdev.queues, CBD_QUEUES_MAX);
+			return -EINVAL;
+		}
+		ret = cbd_blkdev_start(cbdt, opts.backend_id, opts.blkdev.queues);
+		if (ret < 0)
+			return ret;
 		break;
 	case CBDT_ADM_OP_DEV_STOP:
+		ret = cbd_blkdev_stop(cbdt, opts.blkdev.devid);
+		if (ret < 0)
+			return ret;
 		break;
 	default:
 		pr_err("invalid op: %d\n", opts.op);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 6/7] cbd: introduce cbd_blkdev
  2024-04-22 22:42 ` [PATCH 6/7] cbd: introduce cbd_blkdev Dongsheng Yang
@ 2024-04-23  7:27   ` Dongsheng Yang
  0 siblings, 0 replies; 52+ messages in thread
From: Dongsheng Yang @ 2024-04-23  7:27 UTC (permalink / raw)
  To: dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang



在 2024/4/23 星期二 上午 6:42, Dongsheng Yang 写道:
> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> 
> The "cbd_blkdev" represents a virtual block device named "/dev/cbdX". It
> corresponds to a backend. The "blkdev" interacts with upper-layer users
> and accepts IO requests from them. A "blkdev" includes multiple
> "cbd_queues", each of which requires a "cbd_channel" to
> interact with the backend's handler. The "cbd_queue" forwards IO
> requests from the upper layer to the backend's handler through the
> channel.
> 
> Signed-off-by: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> ---
>   drivers/block/cbd/Makefile        |   2 +-
>   drivers/block/cbd/cbd_blkdev.c    | 375 ++++++++++++++++++
>   drivers/block/cbd/cbd_main.c      |   6 +
>   drivers/block/cbd/cbd_queue.c     | 621 ++++++++++++++++++++++++++++++
>   drivers/block/cbd/cbd_transport.c |  11 +
>   5 files changed, 1014 insertions(+), 1 deletion(-)
>   create mode 100644 drivers/block/cbd/cbd_blkdev.c
>   create mode 100644 drivers/block/cbd/cbd_queue.c
> 
> diff --git a/drivers/block/cbd/Makefile b/drivers/block/cbd/Makefile
> index b47f1e584946..f5fb5fd68f3d 100644
> --- a/drivers/block/cbd/Makefile
> +++ b/drivers/block/cbd/Makefile
> @@ -1,3 +1,3 @@
> -cbd-y := cbd_main.o cbd_transport.o cbd_channel.o cbd_host.o cbd_backend.o cbd_handler.o
> +cbd-y := cbd_main.o cbd_transport.o cbd_channel.o cbd_host.o cbd_backend.o cbd_handler.o cbd_blkdev.o cbd_queue.o
>   
>   obj-$(CONFIG_BLK_DEV_CBD) += cbd.o
> diff --git a/drivers/block/cbd/cbd_blkdev.c b/drivers/block/cbd/cbd_blkdev.c
> new file mode 100644
> index 000000000000..816bc28afb49
> --- /dev/null
> +++ b/drivers/block/cbd/cbd_blkdev.c
> @@ -0,0 +1,375 @@
> +#include "cbd_internal.h"
> +
> +static ssize_t blkdev_backend_id_show(struct device *dev,
> +			       struct device_attribute *attr,
> +			       char *buf)
> +{
> +	struct cbd_blkdev_device *blkdev;
> +	struct cbd_blkdev_info *blkdev_info;
> +
> +	blkdev = container_of(dev, struct cbd_blkdev_device, dev);
> +	blkdev_info = blkdev->blkdev_info;
> +
> +	cbdt_flush_range(blkdev->cbdt, blkdev_info, sizeof(*blkdev_info));
> +
> +	if (blkdev_info->state == cbd_blkdev_state_none)
> +		return 0;
> +
> +	return sprintf(buf, "%u\n", blkdev_info->backend_id);
> +}
> +
> +static DEVICE_ATTR(backend_id, 0400, blkdev_backend_id_show, NULL);
> +
> +static ssize_t blkdev_host_id_show(struct device *dev,
> +			       struct device_attribute *attr,
> +			       char *buf)
> +{
> +	struct cbd_blkdev_device *blkdev;
> +	struct cbd_blkdev_info *blkdev_info;
> +
> +	blkdev = container_of(dev, struct cbd_blkdev_device, dev);
> +	blkdev_info = blkdev->blkdev_info;
> +
> +	cbdt_flush_range(blkdev->cbdt, blkdev_info, sizeof(*blkdev_info));
> +
> +	if (blkdev_info->state == cbd_blkdev_state_none)
> +		return 0;
> +
> +	return sprintf(buf, "%u\n", blkdev_info->host_id);
> +}
> +
> +static DEVICE_ATTR(host_id, 0400, blkdev_host_id_show, NULL);
> +
> +static ssize_t blkdev_mapped_id_show(struct device *dev,
> +			       struct device_attribute *attr,
> +			       char *buf)
> +{
> +	struct cbd_blkdev_device *blkdev;
> +	struct cbd_blkdev_info *blkdev_info;
> +
> +	blkdev = container_of(dev, struct cbd_blkdev_device, dev);
> +	blkdev_info = blkdev->blkdev_info;
> +
> +	cbdt_flush_range(blkdev->cbdt, blkdev_info, sizeof(*blkdev_info));
> +
> +	if (blkdev_info->state == cbd_blkdev_state_none)
> +		return 0;
> +
> +	return sprintf(buf, "%u\n", blkdev_info->mapped_id);
> +}
> +
> +static DEVICE_ATTR(mapped_id, 0400, blkdev_mapped_id_show, NULL);
> +
> +CBD_OBJ_HEARTBEAT(blkdev);
> +
> +static struct attribute *cbd_blkdev_attrs[] = {
> +	&dev_attr_mapped_id.attr,
> +	&dev_attr_host_id.attr,
> +	&dev_attr_backend_id.attr,
> +	&dev_attr_alive.attr,
> +	NULL
> +};
> +
> +static struct attribute_group cbd_blkdev_attr_group = {
> +	.attrs = cbd_blkdev_attrs,
> +};
> +
> +static const struct attribute_group *cbd_blkdev_attr_groups[] = {
> +	&cbd_blkdev_attr_group,
> +	NULL
> +};
> +
> +static void cbd_blkdev_release(struct device *dev)
> +{
> +}
> +
> +struct device_type cbd_blkdev_type = {
> +	.name		= "cbd_blkdev",
> +	.groups		= cbd_blkdev_attr_groups,
> +	.release	= cbd_blkdev_release,
> +};
> +
> +struct device_type cbd_blkdevs_type = {
> +	.name		= "cbd_blkdevs",
> +	.release	= cbd_blkdev_release,
> +};
> +
> +
> +static int cbd_major;
> +static DEFINE_IDA(cbd_mapped_id_ida);
> +
> +static int minor_to_cbd_mapped_id(int minor)
> +{
> +	return minor >> CBD_PART_SHIFT;
> +}
> +
> +
> +static int cbd_open(struct gendisk *disk, blk_mode_t mode)
> +{
> +	return 0;
> +}
> +
> +static void cbd_release(struct gendisk *disk)
> +{
> +}
> +
> +static const struct block_device_operations cbd_bd_ops = {
> +	.owner			= THIS_MODULE,
> +	.open			= cbd_open,
> +	.release		= cbd_release,
> +};
> +
> +
> +static void cbd_blkdev_destroy_queues(struct cbd_blkdev *cbd_blkdev)
> +{
> +	int i;
> +
> +	for (i = 0; i < cbd_blkdev->num_queues; i++) {
> +		cbd_queue_stop(&cbd_blkdev->queues[i]);
> +	}
> +
> +	kfree(cbd_blkdev->queues);
> +}
> +
> +static int cbd_blkdev_create_queues(struct cbd_blkdev *cbd_blkdev)
> +{
> +	int i;
> +	int ret;
> +	struct cbd_queue *cbdq;
> +
> +	cbd_blkdev->queues = kcalloc(cbd_blkdev->num_queues, sizeof(struct cbd_queue), GFP_KERNEL);
> +	if (!cbd_blkdev->queues) {
> +		return -ENOMEM;
> +	}
> +
> +	for (i = 0; i < cbd_blkdev->num_queues; i++) {
> +		cbdq = &cbd_blkdev->queues[i];
> +		cbdq->cbd_blkdev = cbd_blkdev;
> +		cbdq->index = i;
> +		ret = cbd_queue_start(cbdq);
> +		if (ret)
> +			goto err;
> +
> +	}
> +
> +	return 0;
> +err:
> +	cbd_blkdev_destroy_queues(cbd_blkdev);
> +	return ret;
> +}
> +
> +static int disk_start(struct cbd_blkdev *cbd_blkdev)
> +{
> +	int ret;
> +	struct gendisk *disk;
> +
> +	memset(&cbd_blkdev->tag_set, 0, sizeof(cbd_blkdev->tag_set));
> +	cbd_blkdev->tag_set.ops = &cbd_mq_ops;
> +	cbd_blkdev->tag_set.queue_depth = 128;
> +	cbd_blkdev->tag_set.numa_node = NUMA_NO_NODE;
> +	cbd_blkdev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_SCHED;
> +	cbd_blkdev->tag_set.nr_hw_queues = cbd_blkdev->num_queues;
> +	cbd_blkdev->tag_set.cmd_size = sizeof(struct cbd_request);
> +	cbd_blkdev->tag_set.timeout = 0;
> +	cbd_blkdev->tag_set.driver_data = cbd_blkdev;
> +
> +	ret = blk_mq_alloc_tag_set(&cbd_blkdev->tag_set);
> +	if (ret) {
> +		pr_err("failed to alloc tag set %d", ret);
> +		goto err;
> +	}
> +
> +	disk = blk_mq_alloc_disk(&cbd_blkdev->tag_set, cbd_blkdev);
> +	if (IS_ERR(disk)) {
> +		ret = PTR_ERR(disk);
> +		pr_err("failed to alloc disk");
> +		goto out_tag_set;
> +	}
> +
> +        snprintf(disk->disk_name, sizeof(disk->disk_name), "cbd%d",
> +                 cbd_blkdev->mapped_id);
> +
> +	disk->major = cbd_major;
> +	disk->first_minor = cbd_blkdev->mapped_id << CBD_PART_SHIFT;
> +	disk->minors = (1 << CBD_PART_SHIFT);
> +
> +	disk->fops = &cbd_bd_ops;
> +	disk->private_data = cbd_blkdev;
> +
> +	/* Tell the block layer that this is not a rotational device */
> +	blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue);
> +	blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, disk->queue);
> +	blk_queue_flag_set(QUEUE_FLAG_NOWAIT, disk->queue);
> +
> +	blk_queue_physical_block_size(disk->queue, PAGE_SIZE);
> +	blk_queue_max_hw_sectors(disk->queue, 128);
> +	blk_queue_max_segments(disk->queue, USHRT_MAX);
> +	blk_queue_max_segment_size(disk->queue, UINT_MAX);
> +	blk_queue_io_min(disk->queue, 4096);
> +	blk_queue_io_opt(disk->queue, 4096);
> +
> +	disk->queue->limits.max_sectors = queue_max_hw_sectors(disk->queue);
> +	/* TODO support discard */
> +	disk->queue->limits.discard_granularity = 0;
> +	blk_queue_max_discard_sectors(disk->queue, 0);
> +	blk_queue_max_write_zeroes_sectors(disk->queue, 0);
> +
> +	cbd_blkdev->disk = disk;
> +
> +	cbdt_add_blkdev(cbd_blkdev->cbdt, cbd_blkdev);
> +	cbd_blkdev->blkdev_info->mapped_id = cbd_blkdev->blkdev_id;
> +	cbd_blkdev->blkdev_info->state = cbd_blkdev_state_running;
> +
> +	set_capacity(cbd_blkdev->disk, cbd_blkdev->dev_size);
> +
> +	set_disk_ro(cbd_blkdev->disk, false);
> +	blk_queue_write_cache(cbd_blkdev->disk->queue, false, false);
> +
> +	ret = add_disk(cbd_blkdev->disk);
> +	if (ret) {
> +		goto put_disk;
> +	}
> +
> +	ret = sysfs_create_link(&disk_to_dev(cbd_blkdev->disk)->kobj,
> +				&cbd_blkdev->blkdev_dev->dev.kobj, "cbd_blkdev");
> +	if (ret) {
> +		goto del_disk;
> +	}
> +
> +	blk_put_queue(cbd_blkdev->disk->queue);
> +
> +	return 0;
> +
> +del_disk:
> +	del_gendisk(cbd_blkdev->disk);
> +put_disk:
> +	put_disk(cbd_blkdev->disk);
> +out_tag_set:
> +	blk_mq_free_tag_set(&cbd_blkdev->tag_set);
> +err:
> +	return ret;
> +}
> +
> +int cbd_blkdev_start(struct cbd_transport *cbdt, u32 backend_id, u32 queues)
> +{
> +	struct cbd_blkdev *cbd_blkdev;
> +	struct cbd_backend_info *backend_info;
> +	u64 dev_size;
> +	int ret;
> +
> +	backend_info = cbdt_get_backend_info(cbdt, backend_id);
> +	cbdt_flush_range(cbdt, backend_info, sizeof(*backend_info));
> +	if (backend_info->blkdev_count == CBDB_BLKDEV_COUNT_MAX) {
> +		return -EBUSY;
> +	}
> +
> +	dev_size = backend_info->dev_size;
> +
> +	cbd_blkdev = kzalloc(sizeof(struct cbd_blkdev), GFP_KERNEL);
> +	if (!cbd_blkdev) {
> +		pr_err("fail to alloc cbd_blkdev");
> +		return -ENOMEM;
> +	}
> +
> +	ret = cbdt_get_empty_blkdev_id(cbdt, &cbd_blkdev->blkdev_id);
> +	if (ret < 0) {
> +		goto blkdev_free;
> +	}
> +
> +	cbd_blkdev->mapped_id = ida_simple_get(&cbd_mapped_id_ida, 0,
> +					 minor_to_cbd_mapped_id(1 << MINORBITS),
> +					 GFP_KERNEL);
> +	if (cbd_blkdev->mapped_id < 0) {
> +		ret = -ENOENT;
> +		goto blkdev_free;
> +	}
> +
> +	INIT_LIST_HEAD(&cbd_blkdev->node);
> +	cbd_blkdev->cbdt = cbdt;
> +	cbd_blkdev->backend_id = backend_id;
> +	cbd_blkdev->num_queues = queues;
> +	cbd_blkdev->dev_size = dev_size;
> +	cbd_blkdev->blkdev_info = cbdt_get_blkdev_info(cbdt, cbd_blkdev->blkdev_id);
> +	cbd_blkdev->blkdev_dev = &cbdt->cbd_blkdevs_dev->blkdev_devs[cbd_blkdev->blkdev_id];
> +
> +	cbd_blkdev->blkdev_info->state = cbd_blkdev_state_running;
> +	cbdt_flush_range(cbdt, cbd_blkdev->blkdev_info, sizeof(*cbd_blkdev->blkdev_info));
> +
> +	INIT_DELAYED_WORK(&cbd_blkdev->hb_work, blkdev_hb_workfn);
> +	queue_delayed_work(cbd_wq, &cbd_blkdev->hb_work, 0);
> +
> +	ret = cbd_blkdev_create_queues(cbd_blkdev);
> +	if (ret < 0) {
> +		goto cancel_hb;;
> +	}
> +
> +	ret = disk_start(cbd_blkdev);
> +	if (ret < 0) {
> +		goto destroy_queues;
> +	}
> +
> +	backend_info->blkdev_count++;
> +	cbdt_flush_range(cbdt, backend_info, sizeof(*backend_info));
> +
> +	return 0;
> +
> +destroy_queues:
> +	cbd_blkdev_destroy_queues(cbd_blkdev);
> +cancel_hb:
> +	cancel_delayed_work_sync(&cbd_blkdev->hb_work);
> +	cbd_blkdev->blkdev_info->state = cbd_blkdev_state_none;
> +	cbdt_flush_range(cbdt, cbd_blkdev->blkdev_info, sizeof(*cbd_blkdev->blkdev_info));
> +	ida_simple_remove(&cbd_mapped_id_ida, cbd_blkdev->mapped_id);
> +blkdev_free:
> +	kfree(cbd_blkdev);
> +	return ret;
> +}
> +
> +static void disk_stop(struct cbd_blkdev *cbd_blkdev)
> +{
> +	sysfs_remove_link(&disk_to_dev(cbd_blkdev->disk)->kobj, "cache");
> +	del_gendisk(cbd_blkdev->disk);
> +	put_disk(cbd_blkdev->disk);
> +	blk_mq_free_tag_set(&cbd_blkdev->tag_set);
> +}
> +
> +int cbd_blkdev_stop(struct cbd_transport *cbdt, u32 devid)
> +{
> +	struct cbd_blkdev *cbd_blkdev;
> +	struct cbd_backend_info *backend_info;
> +
> +	cbd_blkdev = cbdt_fetch_blkdev(cbdt, devid);
> +	if (!cbd_blkdev) {
> +		return -EINVAL;
> +	}
> +
> +	backend_info = cbdt_get_backend_info(cbdt, cbd_blkdev->backend_id);
> +
> +	disk_stop(cbd_blkdev);
> +	cbd_blkdev_destroy_queues(cbd_blkdev);
> +	cancel_delayed_work_sync(&cbd_blkdev->hb_work);
> +	cbd_blkdev->blkdev_info->state = cbd_blkdev_state_none;
> +	cbdt_flush_range(cbdt, cbd_blkdev->blkdev_info, sizeof(*cbd_blkdev->blkdev_info));
> +	ida_simple_remove(&cbd_mapped_id_ida, cbd_blkdev->mapped_id);
> +
> +	kfree(cbd_blkdev);
> +
> +	backend_info->blkdev_count--;
> +	cbdt_flush_range(cbdt, backend_info, sizeof(*backend_info));
> +
> +	return 0;
> +}
> +
> +int cbd_blkdev_init(void)
> +{
> +	cbd_major = register_blkdev(0, "cbd");
> +	if (cbd_major < 0)
> +		return cbd_major;
> +
> +	return 0;
> +}
> +
> +void cbd_blkdev_exit(void)
> +{
> +	unregister_blkdev(cbd_major, "cbd");
> +}
> diff --git a/drivers/block/cbd/cbd_main.c b/drivers/block/cbd/cbd_main.c
> index 8cfa60dde7c5..658233807b59 100644
> --- a/drivers/block/cbd/cbd_main.c
> +++ b/drivers/block/cbd/cbd_main.c
> @@ -195,6 +195,11 @@ static int __init cbd_init(void)
>   		goto device_unregister;
>   	}
>   
> +	ret = cbd_blkdev_init();
> +	if (ret < 0) {
> +		goto bus_unregister;
> +	}
> +
>   	return 0;
>   
>   bus_unregister:
> @@ -209,6 +214,7 @@ static int __init cbd_init(void)
>   
>   static void cbd_exit(void)
>   {
> +	cbd_blkdev_exit();
>   	bus_unregister(&cbd_bus_type);
>   	device_unregister(&cbd_root_dev);
>   
> diff --git a/drivers/block/cbd/cbd_queue.c b/drivers/block/cbd/cbd_queue.c
> new file mode 100644
> index 000000000000..6709ac016e18
> --- /dev/null
> +++ b/drivers/block/cbd/cbd_queue.c
> @@ -0,0 +1,621 @@
> +#include "cbd_internal.h"
> +
> +/*
> + * How do blkdev and backend interact through the channel?
> + *         a) For reader side, before reading the data, if the data in this channel may
> + * be modified by the other party, then I need to flush the cache before reading to ensure
> + * that I get the latest data. For example, the blkdev needs to flush the cache before
> + * obtaining compr_head because compr_head will be updated by the backend handler.
> + *         b) For writter side, if the written information will be read by others, then
> + * after writing, I need to flush the cache to let the other party see it immediately.
> + * For example, after blkdev submits cbd_se, it needs to update cmd_head to let the
> + * handler have a new cbd_se. Therefore, after updating cmd_head, I need to flush the
> + * cache to let the backend see it.
> + *
> + * For the blkdev queue, I am the only one who updates the `cmd_head`, `cmd_tail`, and `compr_tail'.
> + * Therefore, I don't need to flush_dcache before reading these data. However, after updating these data,
> + * I need to flush_dcache so that the backend handler can see these updates.
> + *
> + * On the other hand, `compr_head` is updated by the backend handler. So, I need to flush_dcache before
> + * reading `compr_head` to ensure that I can see the updates.
> + *
> + *           ┌───────────┐          ┌─────────────┐
> + *           │  blkdev   │          │   backend   │
> + *           │  queue    │          │   handler   │
> + *           └─────┬─────┘          └──────┬──────┘
> + *                 ▼                       │
> + *        init data and cbd_se             │
> + *                 │                       │
> + *                 ▼                       │
> + *            update cmd_head              │
> + *                 │                       │
> + *                 ▼                       │
> + *            flush_cache                  │
> + *                 │                       ▼
> + *                 │                    flush_cache
> + *                 │                       │
> + *                 │                       ▼
> + *                 │                   handle cmd
> + *                 │                       │
> + *                 │                       ▼
> + *                 │                    fill cbd_ce
> + *                 │                       │
> + *                 │                       ▼
> + *                 │                    flush_cache
> + *                 ▼
> + *            flush_cache
> + *                 │
> + *                 ▼
> + *            complete_req
> + */
> +
> +static inline struct cbd_se *get_submit_entry(struct cbd_queue *cbdq)
> +{
> +	return (struct cbd_se *)(cbdq->channel.cmdr + cbdq->channel_info->cmd_head);
> +}
> +
> +static inline struct cbd_se *get_oldest_se(struct cbd_queue *cbdq)
> +{
> +	if (cbdq->channel_info->cmd_tail == cbdq->channel_info->cmd_head)
> +		return NULL;
> +
> +	return (struct cbd_se *)(cbdq->channel.cmdr + cbdq->channel_info->cmd_tail);
> +}
> +
> +static inline struct cbd_ce *get_complete_entry(struct cbd_queue *cbdq)
> +{
> +	if (cbdq->channel_info->compr_tail == cbdq->channel_info->compr_head)
> +		return NULL;
> +
> +	return (struct cbd_ce *)(cbdq->channel.compr + cbdq->channel_info->compr_tail);
> +}
> +
> +static void cbd_req_init(struct cbd_queue *cbdq, enum cbd_op op, struct request *rq)
> +{
> +	struct cbd_request *cbd_req = blk_mq_rq_to_pdu(rq);
> +
> +	cbd_req->req = rq;
> +	cbd_req->cbdq = cbdq;
> +	cbd_req->op = op;
> +
> +	return;
> +}
> +
> +static bool cbd_req_nodata(struct cbd_request *cbd_req)
> +{
> +	switch (cbd_req->op) {
> +		case CBD_OP_WRITE:
> +		case CBD_OP_READ:
> +			return false;
> +		case CBD_OP_DISCARD:
> +		case CBD_OP_WRITE_ZEROS:
> +		case CBD_OP_FLUSH:
> +			return true;
> +		default:
> +			BUG();
> +	}
> +}
> +
> +static uint32_t cbd_req_segments(struct cbd_request *cbd_req)
> +{
> +	uint32_t segs = 0;
> +	struct bio *bio = cbd_req->req->bio;
> +
> +	if (cbd_req_nodata(cbd_req))
> +		return 0;
> +
> +	while (bio) {
> +		segs += bio_segments(bio);
> +		bio = bio->bi_next;
> +	}
> +
> +	return segs;
> +}
> +
> +static inline size_t cbd_get_cmd_size(struct cbd_request *cbd_req)
> +{
> +	u32 segs = cbd_req_segments(cbd_req);
> +	u32 cmd_size = sizeof(struct cbd_se) + (sizeof(struct iovec) * segs);
> +
> +	return round_up(cmd_size, CBD_OP_ALIGN_SIZE);
> +}
> +
> +static void insert_padding(struct cbd_queue *cbdq, u32 cmd_size)
> +{
> +	struct cbd_se_hdr *header;
> +	u32 pad_len;
> +
> +	if (cbdq->channel_info->cmdr_size - cbdq->channel_info->cmd_head >= cmd_size)
> +		return;
> +
> +	pad_len = cbdq->channel_info->cmdr_size - cbdq->channel_info->cmd_head;
> +	cbd_queue_debug(cbdq, "insert pad:%d\n", pad_len);
> +
> +	header = (struct cbd_se_hdr *)get_submit_entry(cbdq);
> +	memset(header, 0, pad_len);
> +	cbd_se_hdr_set_op(&header->len_op, CBD_OP_PAD);
> +	cbd_se_hdr_set_len(&header->len_op, pad_len);
> +
> +	cbdt_flush_range(cbdq->cbd_blkdev->cbdt, header, sizeof(*header));
> +
> +	CBDC_UPDATE_CMDR_HEAD(cbdq->channel_info->cmd_head, pad_len, cbdq->channel_info->cmdr_size);
> +}
> +
> +static void queue_req_se_init(struct cbd_request *cbd_req)
> +{
> +	struct cbd_se	*se;
> +	struct cbd_se_hdr *header;
> +	u64 offset = (u64)blk_rq_pos(cbd_req->req) << SECTOR_SHIFT;
> +	u64 length = blk_rq_bytes(cbd_req->req);
> +
> +	se = get_submit_entry(cbd_req->cbdq);
> +	memset(se, 0, cbd_get_cmd_size(cbd_req));
> +	header = &se->header;
> +
> +	cbd_se_hdr_set_op(&header->len_op, cbd_req->op);
> +	cbd_se_hdr_set_len(&header->len_op, cbd_get_cmd_size(cbd_req));
> +
> +	se->priv_data = cbd_req->req_tid;
> +	se->offset = offset;
> +	se->len = length;
> +
> +	if (req_op(cbd_req->req) == REQ_OP_READ || req_op(cbd_req->req) == REQ_OP_WRITE) {
> +		se->data_off = cbd_req->cbdq->channel.data_head;
> +		se->data_len = length;
> +	}
> +
> +	cbd_req->se = se;
> +}
> +
> +static bool data_space_enough(struct cbd_queue *cbdq, struct cbd_request *cbd_req)
> +{
> +	u32 space_available;
> +	u32 space_needed;
> +	u32 space_used;
> +	u32 space_max;
> +
> +	space_max = cbdq->channel.data_size - 4096;
> +
> +	if (cbdq->channel.data_head > cbdq->channel.data_tail)
> +		space_used = cbdq->channel.data_head - cbdq->channel.data_tail;
> +	else if (cbdq->channel.data_head < cbdq->channel.data_tail)
> +		space_used = cbdq->channel.data_head + (cbdq->channel.data_size - cbdq->channel.data_tail);
> +	else
> +		space_used = 0;
> +
> +	space_available = space_max - space_used;
> +
> +	space_needed = round_up(cbd_req->data_len, 4096);
> +
> +	if (space_available < space_needed) {
> +		cbd_queue_err(cbdq, "data space is not enough: availaible: %u needed: %u",
> +			      space_available, space_needed);
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +static bool submit_ring_space_enough(struct cbd_queue *cbdq, u32 cmd_size)
> +{
> +	u32 space_available;
> +	u32 space_needed;
> +	u32 space_max, space_used;
> +
> +	/* There is a CMDR_RESERVED we dont use to prevent the ring to be used up */
> +	space_max = cbdq->channel_info->cmdr_size - CBDC_CMDR_RESERVED;
> +
> +	if (cbdq->channel_info->cmd_head > cbdq->channel_info->cmd_tail)
> +		space_used = cbdq->channel_info->cmd_head - cbdq->channel_info->cmd_tail;
> +	else if (cbdq->channel_info->cmd_head < cbdq->channel_info->cmd_tail)
> +		space_used = cbdq->channel_info->cmd_head + (cbdq->channel_info->cmdr_size - cbdq->channel_info->cmd_tail);
> +	else
> +		space_used = 0;
> +
> +	space_available = space_max - space_used;
> +
> +	if (cbdq->channel_info->cmdr_size - cbdq->channel_info->cmd_head > cmd_size)
> +		space_needed = cmd_size;
> +	else
> +		space_needed = cmd_size + cbdq->channel_info->cmdr_size - cbdq->channel_info->cmd_head;
> +
> +	if (space_available < space_needed)
> +		return false;
> +
> +	return true;
> +}
> +
> +static void queue_req_data_init(struct cbd_request *cbd_req)
> +{
> +	struct cbd_queue *cbdq = cbd_req->cbdq;
> +	struct bio *bio = cbd_req->req->bio;
> +
> +	if (cbd_req->op == CBD_OP_READ) {
> +		goto advance_data_head;
> +	}
> +
> +	cbdc_copy_from_bio(&cbdq->channel, cbd_req->data_off, cbd_req->data_len, bio);
> +
> +advance_data_head:
> +	cbdq->channel.data_head = round_up(cbdq->channel.data_head + cbd_req->data_len, PAGE_SIZE);
> +	cbdq->channel.data_head %= cbdq->channel.data_size;
> +
> +	return;
> +}
> +
> +static void complete_inflight_req(struct cbd_queue *cbdq, struct cbd_request *cbd_req, int ret);
> +static void cbd_queue_fn(struct cbd_request *cbd_req)
> +{
> +	struct cbd_queue *cbdq = cbd_req->cbdq;
> +	int ret = 0;
> +	size_t command_size;
> +
> +	spin_lock(&cbdq->inflight_reqs_lock);
> +	list_add_tail(&cbd_req->inflight_reqs_node, &cbdq->inflight_reqs);
> +	spin_unlock(&cbdq->inflight_reqs_lock);
> +
> +	command_size = cbd_get_cmd_size(cbd_req);
> +
> +	spin_lock(&cbdq->channel.cmdr_lock);
> +	if (req_op(cbd_req->req) == REQ_OP_WRITE || req_op(cbd_req->req) == REQ_OP_READ) {
> +		cbd_req->data_off = cbdq->channel.data_head;
> +		cbd_req->data_len = blk_rq_bytes(cbd_req->req);
> +	} else {
> +		cbd_req->data_off = -1;
> +		cbd_req->data_len = 0;
> +	}
> +
> +	if (!submit_ring_space_enough(cbdq, command_size) ||
> +			!data_space_enough(cbdq, cbd_req)) {
> +		spin_unlock(&cbdq->channel.cmdr_lock);
> +
> +		/* remove request from inflight_reqs */
> +		spin_lock(&cbdq->inflight_reqs_lock);
> +		list_del_init(&cbd_req->inflight_reqs_node);
> +		spin_unlock(&cbdq->inflight_reqs_lock);
> +
> +		cbd_blk_debug(cbdq->cbd_blkdev, "transport space is not enough");
> +		ret = -ENOMEM;
> +		goto end_request;
> +	}
> +
> +	insert_padding(cbdq, command_size);
> +
> +	cbd_req->req_tid = ++cbdq->req_tid;
> +	queue_req_se_init(cbd_req);
> +	cbdt_flush_range(cbdq->cbd_blkdev->cbdt, cbd_req->se, sizeof(struct cbd_se));
> +
> +	if (!cbd_req_nodata(cbd_req)) {
> +		queue_req_data_init(cbd_req);
> +	}
> +
> +	queue_delayed_work(cbdq->task_wq, &cbdq->complete_work, 0);
> +
> +	CBDC_UPDATE_CMDR_HEAD(cbdq->channel_info->cmd_head,
> +			cbd_get_cmd_size(cbd_req),
> +			cbdq->channel_info->cmdr_size);
> +	cbdc_flush_ctrl(&cbdq->channel);
> +	spin_unlock(&cbdq->channel.cmdr_lock);
> +
> +	return;
> +
> +end_request:
> +	if (ret == -ENOMEM || ret == -EBUSY)
> +		blk_mq_requeue_request(cbd_req->req, true);
> +	else
> +		blk_mq_end_request(cbd_req->req, errno_to_blk_status(ret));
> +
> +	return;
> +}
> +
> +static void cbd_req_release(struct cbd_request *cbd_req)
> +{
> +	return;
> +}
> +
> +static void advance_cmd_ring(struct cbd_queue *cbdq)
> +{
> +       struct cbd_se *se;
> +again:
> +       se = get_oldest_se(cbdq);
> +       if (!se)
> +               goto out;
> +
> +	if (cbd_se_hdr_flags_test(se, CBD_SE_HDR_DONE)) {
> +		CBDC_UPDATE_CMDR_TAIL(cbdq->channel_info->cmd_tail,
> +				cbd_se_hdr_get_len(se->header.len_op),
> +				cbdq->channel_info->cmdr_size);
> +		cbdc_flush_ctrl(&cbdq->channel);
> +		goto again;
> +       }
> +out:
> +       return;
> +}
> +
> +static bool __advance_data_tail(struct cbd_queue *cbdq, u32 data_off, u32 data_len)
> +{
> +	if (data_off == cbdq->channel.data_tail) {
> +		cbdq->released_extents[data_off / 4096] = 0;
> +		cbdq->channel.data_tail += data_len;
> +		if (cbdq->channel.data_tail >= cbdq->channel.data_size) {
> +			cbdq->channel.data_tail %= cbdq->channel.data_size;
> +		}
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
> +static void advance_data_tail(struct cbd_queue *cbdq, u32 data_off, u32 data_len)
> +{
> +	cbdq->released_extents[data_off / 4096] = data_len;
> +
> +	while (__advance_data_tail(cbdq, data_off, data_len)) {
> +		data_off += data_len;
> +		data_len = cbdq->released_extents[data_off / 4096];
> +		if (!data_len) {
> +			break;
> +		}
> +	}
> +}
> +
> +static inline void complete_inflight_req(struct cbd_queue *cbdq, struct cbd_request *cbd_req, int ret)
> +{
> +	u32 data_off, data_len;
> +	bool advance_data = false;
> +
> +	spin_lock(&cbdq->inflight_reqs_lock);
> +	list_del_init(&cbd_req->inflight_reqs_node);
> +	spin_unlock(&cbdq->inflight_reqs_lock);
> +
> +	cbd_se_hdr_flags_set(cbd_req->se, CBD_SE_HDR_DONE);
> +	data_off = cbd_req->data_off;
> +	data_len = cbd_req->data_len;
> +	advance_data = (!cbd_req_nodata(cbd_req));
> +
> +	blk_mq_end_request(cbd_req->req, errno_to_blk_status(ret));
> +
> +	cbd_req_release(cbd_req);
> +
> +	spin_lock(&cbdq->channel.cmdr_lock);
> +	advance_cmd_ring(cbdq);
> +	if (advance_data)
> +		advance_data_tail(cbdq, data_off, round_up(data_len, PAGE_SIZE));
> +	spin_unlock(&cbdq->channel.cmdr_lock);
> +}
> +
> +static struct cbd_request *fetch_inflight_req(struct cbd_queue *cbdq, u64 req_tid)
> +{
> +	struct cbd_request *req;
> +	bool found = false;
> +
> +	list_for_each_entry(req, &cbdq->inflight_reqs, inflight_reqs_node) {
> +		if (req->req_tid == req_tid) {
> +			list_del_init(&req->inflight_reqs_node);
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	if (found)
> +		return req;
> +
> +	return NULL;
> +}
> +
> +static void copy_data_from_cbdteq(struct cbd_request *cbd_req)
> +{
> +	struct bio *bio = cbd_req->req->bio;
> +	struct cbd_queue *cbdq = cbd_req->cbdq;
> +
> +	cbdc_copy_to_bio(&cbdq->channel, cbd_req->data_off, cbd_req->data_len, bio);
> +
> +	return;
> +}
> +
> +static void complete_work_fn(struct work_struct *work)
> +{
> +	struct cbd_queue *cbdq = container_of(work, struct cbd_queue, complete_work.work);
> +	struct cbd_ce *ce;
> +	struct cbd_request *cbd_req;
> +
> +again:
> +	/* compr_head would be updated by backend handler */
> +	cbdc_flush_ctrl(&cbdq->channel);
> +
> +	spin_lock(&cbdq->channel.compr_lock);
> +	ce = get_complete_entry(cbdq);
> +	if (!ce) {
> +		spin_unlock(&cbdq->channel.compr_lock);
> +		if (cbdwc_need_retry(&cbdq->complete_worker_cfg)) {
> +			goto again;
> +		}
> +
> +		spin_lock(&cbdq->inflight_reqs_lock);
> +		if (list_empty(&cbdq->inflight_reqs)) {
> +			spin_unlock(&cbdq->inflight_reqs_lock);
> +			cbdwc_init(&cbdq->complete_worker_cfg);
> +			return;
> +		}
> +		spin_unlock(&cbdq->inflight_reqs_lock);
> +
> +		cbdwc_miss(&cbdq->complete_worker_cfg);
> +
> +		queue_delayed_work(cbdq->task_wq, &cbdq->complete_work, 0);
> +		return;
> +	}
> +	cbdwc_hit(&cbdq->complete_worker_cfg);
> +	CBDC_UPDATE_COMPR_TAIL(cbdq->channel_info->compr_tail,
> +			       sizeof(struct cbd_ce),
> +			       cbdq->channel_info->compr_size);
> +	cbdc_flush_ctrl(&cbdq->channel);
> +	spin_unlock(&cbdq->channel.compr_lock);
> +
> +	spin_lock(&cbdq->inflight_reqs_lock);
> +	/* flush to ensure the content of ce is uptodate */
> +	cbdt_flush_range(cbdq->cbd_blkdev->cbdt, ce, sizeof(*ce));
> +	cbd_req = fetch_inflight_req(cbdq, ce->priv_data);
> +	spin_unlock(&cbdq->inflight_reqs_lock);
> +	if (!cbd_req) {
> +		goto again;
> +	}
> +
> +	if (req_op(cbd_req->req) == REQ_OP_READ) {
> +		spin_lock(&cbdq->channel.cmdr_lock);
> +		copy_data_from_cbdteq(cbd_req);
> +		spin_unlock(&cbdq->channel.cmdr_lock);
> +	}
> +
> +	complete_inflight_req(cbdq, cbd_req, ce->result);
> +
> +	goto again;
> +}
> +
> +static blk_status_t cbd_queue_rq(struct blk_mq_hw_ctx *hctx,
> +		const struct blk_mq_queue_data *bd)
> +{
> +	struct request *req = bd->rq;
> +	struct cbd_queue *cbdq = hctx->driver_data;
> +	struct cbd_request *cbd_req = blk_mq_rq_to_pdu(bd->rq);
> +
> +	memset(cbd_req, 0, sizeof(struct cbd_request));
> +	INIT_LIST_HEAD(&cbd_req->inflight_reqs_node);
> +
> +	blk_mq_start_request(bd->rq);
> +
> +	switch (req_op(bd->rq)) {
> +	case REQ_OP_FLUSH:
> +		cbd_req_init(cbdq, CBD_OP_FLUSH, req);
> +		break;
> +	case REQ_OP_DISCARD:
> +		cbd_req_init(cbdq, CBD_OP_DISCARD, req);
> +		break;
> +	case REQ_OP_WRITE_ZEROES:
> +		cbd_req_init(cbdq, CBD_OP_WRITE_ZEROS, req);
> +		break;
> +	case REQ_OP_WRITE:
> +		cbd_req_init(cbdq, CBD_OP_WRITE, req);
> +		break;
> +	case REQ_OP_READ:
> +		cbd_req_init(cbdq, CBD_OP_READ, req);
> +		break;
> +	default:
> +		return BLK_STS_IOERR;
> +	}
> +
> +	cbd_queue_fn(cbd_req);
> +
> +	return BLK_STS_OK;
> +}
> +
> +static int cbd_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data,
> +			unsigned int hctx_idx)
> +{
> +	struct cbd_blkdev *cbd_blkdev = driver_data;
> +	struct cbd_queue *cbdq;
> +
> +	cbdq = &cbd_blkdev->queues[hctx_idx];
> +	hctx->driver_data = cbdq;
> +
> +	return 0;
> +}
> +
> +const struct blk_mq_ops cbd_mq_ops = {
> +	.queue_rq	= cbd_queue_rq,
> +	.init_hctx	= cbd_init_hctx,
> +};
> +
> +static int cbd_queue_channel_init(struct cbd_queue *cbdq, u32 channel_id)
> +{
> +	struct cbd_blkdev *cbd_blkdev = cbdq->cbd_blkdev;
> +	struct cbd_transport *cbdt = cbd_blkdev->cbdt;
> +
> +	cbdq->channel_id = channel_id;
> +	cbd_channel_init(&cbdq->channel, cbdt, channel_id);
> +	cbdq->channel_info = cbdq->channel.channel_info;
> +
> +	cbdq->channel.data_head = cbdq->channel.data_tail = 0;
> +
> +	/* Initialise the channel_info of the ring buffer */
> +	cbdq->channel_info->cmdr_off = CBDC_CMDR_OFF;
> +	cbdq->channel_info->cmdr_size = CBDC_CMDR_SIZE;
> +	cbdq->channel_info->compr_off = CBDC_COMPR_OFF;
> +	cbdq->channel_info->compr_size = CBDC_COMPR_SIZE;
> +
> +	cbdq->channel_info->backend_id = cbd_blkdev->backend_id;
> +	cbdq->channel_info->blkdev_id = cbd_blkdev->blkdev_id;
> +	cbdq->channel_info->blkdev_state = cbdc_blkdev_state_running;
> +	cbdq->channel_info->state = cbd_channel_state_running;
> +
> +	cbdc_flush_ctrl(&cbdq->channel);
> +
> +	return 0;
> +}
> +
> +int cbd_queue_start(struct cbd_queue *cbdq)
> +{
> +	struct cbd_transport *cbdt = cbdq->cbd_blkdev->cbdt;
> +	u32 channel_id;
> +	int ret;
> +
> +	ret = cbdt_get_empty_channel_id(cbdt, &channel_id);
> +	if (ret < 0) {
> +		cbdt_err(cbdt, "failed find available channel_id.\n");
> +		goto err;
> +	}
> +
> +	ret = cbd_queue_channel_init(cbdq, channel_id);
> +	if (ret) {
> +		cbd_queue_err(cbdq, "failed to init dev channel_info: %d.", ret);
> +		goto err;
> +	}
> +
> +	INIT_LIST_HEAD(&cbdq->inflight_reqs);
> +	spin_lock_init(&cbdq->inflight_reqs_lock);
> +	cbdq->req_tid = 0;
> +	INIT_DELAYED_WORK(&cbdq->complete_work, complete_work_fn);
> +	cbdwc_init(&cbdq->complete_worker_cfg);
> +
> +	cbdq->released_extents = kmalloc(sizeof(u32) * (CBDC_DATA_SIZE >> PAGE_SHIFT), GFP_KERNEL);

Quick fixup, this would be kzalloc, the fix path is available at branch 
cbd of repo: https://github.com/DataTravelGuide/linux.git


     cbd: fixup: initilize cbdq->released_extents with zeros

     We have to initialize cbdq->released_extents with zeros, that means
     there is no released extents. Otherwise, it will make advance_data_tail
     confusing, and IO would be hang over.

     Signed-off-by: Dongsheng Yang <dongsheng.yang.linux@gmail.com>

diff --git a/drivers/block/cbd/cbd_queue.c b/drivers/block/cbd/cbd_queue.c
index 6709ac016e18..ebde191eb907 100644
--- a/drivers/block/cbd/cbd_queue.c
+++ b/drivers/block/cbd/cbd_queue.c
@@ -576,7 +576,7 @@ int cbd_queue_start(struct cbd_queue *cbdq)
         INIT_DELAYED_WORK(&cbdq->complete_work, complete_work_fn);
         cbdwc_init(&cbdq->complete_worker_cfg);

-       cbdq->released_extents = kmalloc(sizeof(u32) * (CBDC_DATA_SIZE 
 >> PAGE_SHIFT), GFP_KERNEL);
+       cbdq->released_extents = kzalloc(sizeof(u32) * (CBDC_DATA_SIZE 
 >> PAGE_SHIFT), GFP_KERNEL);
         if (!cbdq->released_extents) {
                 ret = -ENOMEM;
                 goto err;
> +	if (!cbdq->released_extents) {
> +		ret = -ENOMEM;
> +		goto err;
> +	}
> +
> +	cbdq->task_wq = alloc_workqueue("cbd%d-queue%u",  WQ_UNBOUND | WQ_MEM_RECLAIM,
> +					0, cbdq->cbd_blkdev->mapped_id, cbdq->index);
> +	if (!cbdq->task_wq) {
> +		ret = -ENOMEM;
> +		goto released_extents_free;
> +	}
> +
> +	queue_delayed_work(cbdq->task_wq, &cbdq->complete_work, 0);
> +
> +	atomic_set(&cbdq->state, cbd_queue_state_running);
> +
> +	return 0;
> +
> +released_extents_free:
> +	kfree(cbdq->released_extents);
> +err:
> +	return ret;
> +}
> +
> +void cbd_queue_stop(struct cbd_queue *cbdq)
> +{
> +	if (atomic_cmpxchg(&cbdq->state,
> +			   cbd_queue_state_running,
> +			   cbd_queue_state_none) != cbd_queue_state_running)
> +		return;
> +
> +	cancel_delayed_work_sync(&cbdq->complete_work);
> +	drain_workqueue(cbdq->task_wq);
> +	destroy_workqueue(cbdq->task_wq);
> +
> +	kfree(cbdq->released_extents);
> +	cbdq->channel_info->blkdev_state = cbdc_blkdev_state_none;
> +
> +	cbdc_flush_ctrl(&cbdq->channel);
> +
> +	return;
> +}
> diff --git a/drivers/block/cbd/cbd_transport.c b/drivers/block/cbd/cbd_transport.c
> index 4dd9bf1b5fd5..75b9d34218fc 100644
> --- a/drivers/block/cbd/cbd_transport.c
> +++ b/drivers/block/cbd/cbd_transport.c
> @@ -315,8 +315,19 @@ static ssize_t cbd_adm_store(struct device *dev,
>   	case CBDT_ADM_OP_B_CLEAR:
>   		break;
>   	case CBDT_ADM_OP_DEV_START:
> +		if (opts.blkdev.queues > CBD_QUEUES_MAX) {
> +			cbdt_err(cbdt, "invalid queues = %u, larger than max %u\n",
> +					opts.blkdev.queues, CBD_QUEUES_MAX);
> +			return -EINVAL;
> +		}
> +		ret = cbd_blkdev_start(cbdt, opts.backend_id, opts.blkdev.queues);
> +		if (ret < 0)
> +			return ret;
>   		break;
>   	case CBDT_ADM_OP_DEV_STOP:
> +		ret = cbd_blkdev_stop(cbdt, opts.blkdev.devid);
> +		if (ret < 0)
> +			return ret;
>   		break;
>   	default:
>   		pr_err("invalid op: %d\n", opts.op);
> 

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/7] block: Init for CBD(CXL Block Device)
  2024-04-22  7:16 ` [PATCH 1/7] block: Init for CBD(CXL " Dongsheng Yang
  2024-04-22 18:39   ` Randy Dunlap
@ 2024-04-24  3:58   ` Chaitanya Kulkarni
  2024-04-24  8:36     ` Dongsheng Yang
  1 sibling, 1 reply; 52+ messages in thread
From: Chaitanya Kulkarni @ 2024-04-24  3:58 UTC (permalink / raw)
  To: Dongsheng Yang, dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang

> +/*
> + * As shared memory is supported in CXL3.0 spec, we can transfer data via CXL shared memory.
> + * CBD means CXL block device, it use CXL shared memory to transport command and data to
> + * access block device in different host, as shown below:
> + *
> + *    ┌───────────────────────────────┐                               ┌────────────────────────────────────┐
> + *    │          node-1               │                               │              node-2                │
> + *    ├───────────────────────────────┤                               ├────────────────────────────────────┤
> + *    │                               │                               │                                    │
> + *    │                       ┌───────┤                               ├─────────┐                          │
> + *    │                       │ cbd0  │                               │ backend0├──────────────────┐       │
> + *    │                       ├───────┤                               ├─────────┤                  │       │
> + *    │                       │ pmem0 │                               │ pmem0   │                  ▼       │
> + *    │               ┌───────┴───────┤                               ├─────────┴────┐     ┌───────────────┤
> + *    │               │    cxl driver │                               │ cxl driver   │     │  /dev/sda     │
> + *    └───────────────┴────────┬──────┘                               └─────┬────────┴─────┴───────────────┘
> + *                             │                                            │
> + *                             │                                            │
> + *                             │        CXL                         CXL     │
> + *                             └────────────────┐               ┌───────────┘
> + *                                              │               │
> + *                                              │               │
> + *                                              │               │
> + *                                          ┌───┴───────────────┴─────┐
> + *                                          │   shared memory device  │
> + *                                          └─────────────────────────┘
> + *
> + * any read/write to cbd0 on node-1 will be transferred to node-2 /dev/sda. It works similar with
> + * nbd (network block device), but it transfer data via CXL shared memory rather than network.
> + */
> +
> +/* printk */

I don't think you need above comment ..

> +#define cbd_err(fmt, ...)							\
> +	pr_err("cbd: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__)

you can use #define pr_fmt and remove cmd: prefixes in each pr_xxx above ?

> +#define cbd_info(fmt, ...)							\
> +	pr_info("cbd: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__)
> +#define cbd_debug(fmt, ...)							\
> +	pr_debug("cbd: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__)
> +
> +#define cbdt_err(transport, fmt, ...)						\
> +	cbd_err("cbd_transport%u: " fmt,					\
> +		 transport->id, ##__VA_ARGS__)
> +#define cbdt_info(transport, fmt, ...)						\
> +	cbd_info("cbd_transport%u: " fmt,					\
> +		 transport->id, ##__VA_ARGS__)
> +#define cbdt_debug(transport, fmt, ...)						\
> +	cbd_debug("cbd_transport%u: " fmt,					\
> +		 transport->id, ##__VA_ARGS__)
> +
> +#define cbd_backend_err(backend, fmt, ...)					\
> +	cbdt_err(backend->cbdt, "backend%d: " fmt,				\
> +		 backend->backend_id, ##__VA_ARGS__)
> +#define cbd_backend_info(backend, fmt, ...)					\
> +	cbdt_info(backend->cbdt, "backend%d: " fmt,				\
> +		 backend->backend_id, ##__VA_ARGS__)
> +#define cbd_backend_debug(backend, fmt, ...)					\
> +	cbdt_debug(backend->cbdt, "backend%d: " fmt,				\
> +		 backend->backend_id, ##__VA_ARGS__)
> +
> +#define cbd_handler_err(handler, fmt, ...)					\
> +	cbd_backend_err(handler->cbdb, "handler%d: " fmt,			\
> +		 handler->channel.channel_id, ##__VA_ARGS__)
> +#define cbd_handler_info(handler, fmt, ...)					\
> +	cbd_backend_info(handler->cbdb, "handler%d: " fmt,			\
> +		 handler->channel.channel_id, ##__VA_ARGS__)
> +#define cbd_handler_debug(handler, fmt, ...)					\
> +	cbd_backend_debug(handler->cbdb, "handler%d: " fmt,			\
> +		 handler->channel.channel_id, ##__VA_ARGS__)
> +
> +#define cbd_blk_err(dev, fmt, ...)						\
> +	cbdt_err(dev->cbdt, "cbd%d: " fmt,					\
> +		 dev->mapped_id, ##__VA_ARGS__)
> +#define cbd_blk_info(dev, fmt, ...)						\
> +	cbdt_info(dev->cbdt, "cbd%d: " fmt,					\
> +		 dev->mapped_id, ##__VA_ARGS__)
> +#define cbd_blk_debug(dev, fmt, ...)						\
> +	cbdt_debug(dev->cbdt, "cbd%d: " fmt,					\
> +		 dev->mapped_id, ##__VA_ARGS__)
> +
> +#define cbd_queue_err(queue, fmt, ...)						\
> +	cbd_blk_err(queue->cbd_blkdev, "queue-%d: " fmt,			\
> +		     queue->index, ##__VA_ARGS__)
> +#define cbd_queue_info(queue, fmt, ...)						\
> +	cbd_blk_info(queue->cbd_blkdev, "queue-%d: " fmt,			\
> +		     queue->index, ##__VA_ARGS__)
> +#define cbd_queue_debug(queue, fmt, ...)					\
> +	cbd_blk_debug(queue->cbd_blkdev, "queue-%d: " fmt,			\
> +		     queue->index, ##__VA_ARGS__)
> +
> +#define cbd_channel_err(channel, fmt, ...)					\
> +	cbdt_err(channel->cbdt, "channel%d: " fmt,				\
> +		 channel->channel_id, ##__VA_ARGS__)
> +#define cbd_channel_info(channel, fmt, ...)					\
> +	cbdt_info(channel->cbdt, "channel%d: " fmt,				\
> +		 channel->channel_id, ##__VA_ARGS__)
> +#define cbd_channel_debug(channel, fmt, ...)					\
> +	cbdt_debug(channel->cbdt, "channel%d: " fmt,				\
> +		 channel->channel_id, ##__VA_ARGS__)
> +

[...]

> +
> +struct cbd_se {
> +	struct cbd_se_hdr	header;
> +	u64			priv_data;	// pointer to cbd_request

use /**/ instead //


-ck



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/7] cbd: introduce cbd_transport
  2024-04-22  7:16 ` [PATCH 2/7] cbd: introduce cbd_transport Dongsheng Yang
@ 2024-04-24  4:08   ` Chaitanya Kulkarni
  2024-04-24  8:43     ` Dongsheng Yang
  0 siblings, 1 reply; 52+ messages in thread
From: Chaitanya Kulkarni @ 2024-04-24  4:08 UTC (permalink / raw)
  To: Dongsheng Yang, dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang

> +static ssize_t cbd_myhost_show(struct device *dev,
> +			       struct device_attribute *attr,
> +			       char *buf)
> +{
> +	struct cbd_transport *cbdt;
> +	struct cbd_host *host;
> +
> +	cbdt = container_of(dev, struct cbd_transport, device);
> +
> +	host = cbdt->host;
> +	if (!host)
> +		return 0;
> +
> +	return sprintf(buf, "%d\n", host->host_id);

snprintf() ?

> +}
> +
> +static DEVICE_ATTR(my_host_id, 0400, cbd_myhost_show, NULL);
> +
> +enum {

[...]

> +
> +static ssize_t cbd_adm_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *ubuf,
> +				 size_t size)
> +{
> +	int ret;
> +	char *buf;
> +	struct cbd_adm_options opts = { 0 };
> +	struct cbd_transport *cbdt;
> +

reverse tree order that matches rest of your code ?

> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	cbdt = container_of(dev, struct cbd_transport, device);
> +
> +	buf = kmemdup(ubuf, size + 1, GFP_KERNEL);
> +	if (IS_ERR(buf)) {
> +		pr_err("failed to dup buf for adm option: %d", (int)PTR_ERR(buf));
> +		return PTR_ERR(buf);
> +	}
> +	buf[size] = '\0';
> +	ret = parse_adm_options(cbdt, buf, &opts);
> +	if (ret < 0) {
> +		kfree(buf);
> +		return ret;
> +	}
> +	kfree(buf);
> +

standard format is using goto out and having only on kfree()
at the end of the function ...

> +	switch (opts.op) {
> +	case CBDT_ADM_OP_B_START:
> +		break;
> +	case CBDT_ADM_OP_B_STOP:
> +		break;
> +	case CBDT_ADM_OP_B_CLEAR:
> +		break;
> +	case CBDT_ADM_OP_DEV_START:
> +		break;
> +	case CBDT_ADM_OP_DEV_STOP:
> +		break;
> +	default:
> +		pr_err("invalid op: %d\n", opts.op);
> +		return -EINVAL;
> +	}
> +
> +	if (ret < 0)
> +		return ret;
> +
> +	return size;
> +}
> +

[...]

> +static struct cbd_transport *cbdt_alloc(void)
> +{
> +	struct cbd_transport *cbdt;
> +	int ret;
> +
> +	cbdt = kzalloc(sizeof(struct cbd_transport), GFP_KERNEL);
> +	if (!cbdt) {
> +		return NULL;
> +	}

no braces needed for single statements in if ... applies rest of
the code ...

-ck



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-04-22  7:15 [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dongsheng Yang
                   ` (6 preceding siblings ...)
  2024-04-22 22:42 ` [PATCH 6/7] cbd: introduce cbd_blkdev Dongsheng Yang
@ 2024-04-24  4:29 ` Dan Williams
  2024-04-24  6:33   ` Dongsheng Yang
  7 siblings, 1 reply; 52+ messages in thread
From: Dan Williams @ 2024-04-24  4:29 UTC (permalink / raw)
  To: Dongsheng Yang, dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang

Dongsheng Yang wrote:
> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> 
> Hi all,
> 	This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
> 	https://github.com/DataTravelGuide/linux
> 
[..]
> (4) dax is not supported yet:
> 	same with famfs, dax device is not supported here, because dax device does not support
> dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.

I am glad that famfs is mentioned here, it demonstrates you know about
it. However, unfortunately this cover letter does not offer any analysis
of *why* the Linux project should consider this additional approach to
the inter-host shared-memory enabling problem.

To be clear I am neutral at best on some of the initiatives around CXL
memory sharing vs pooling, but famfs at least jettisons block-devices
and gets closer to a purpose-built memory semantic.

So my primary question is why would Linux need both famfs and cbd? I am
sure famfs would love feedback and help vs developing competing efforts.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/7] cbd: introuce cbd_backend
  2024-04-22  7:16 ` [PATCH 5/7] cbd: introuce cbd_backend Dongsheng Yang
@ 2024-04-24  5:03   ` Chaitanya Kulkarni
  2024-04-24  8:36     ` Dongsheng Yang
  2024-04-25  5:46   ` [EXTERNAL] " Bharat Bhushan
  1 sibling, 1 reply; 52+ messages in thread
From: Chaitanya Kulkarni @ 2024-04-24  5:03 UTC (permalink / raw)
  To: Dongsheng Yang, dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang

> +
> +struct cbd_backend_io {
> +	struct cbd_se		*se;
> +	u64			off;
> +	u32			len;
> +	struct bio		*bio;
> +	struct cbd_handler	*handler;
> +};
> +

why not use inline bvecs and avoid bio page allocation for reasonable
size ? instead of performing the allocation for each request ...

> +static inline void complete_cmd(struct cbd_handler *handler, u64 priv_data, int ret)
> +{
> +	struct cbd_ce *ce = get_compr_head(handler);
> +
> +	memset(ce, 0, sizeof(*ce));
> +	ce->priv_data = priv_data;
> +	ce->result = ret;
> +	CBDC_UPDATE_COMPR_HEAD(handler->channel_info->compr_head,
> +			       sizeof(struct cbd_ce),
> +			       handler->channel_info->compr_size);
> +
> +	cbdc_flush_ctrl(&handler->channel);
> +
> +	return;
> +}
> +
> +static void backend_bio_end(struct bio *bio)
> +{
> +	struct cbd_backend_io *backend_io = bio->bi_private;
> +	struct cbd_se *se = backend_io->se;
> +	struct cbd_handler *handler = backend_io->handler;
> +
> +	if (bio->bi_status == 0 &&
> +	    cbd_se_hdr_get_op(se->header.len_op) == CBD_OP_READ) {
> +		cbdc_copy_from_bio(&handler->channel, se->data_off, se->data_len, bio);
> +	}
> +
> +	complete_cmd(handler, se->priv_data, bio->bi_status);
> +
> +	bio_free_pages(bio);
> +	bio_put(bio);
> +	kfree(backend_io);
> +}
> +
> +static int cbd_bio_alloc_pages(struct bio *bio, size_t size, gfp_t gfp_mask)
> +{
> +	int ret = 0;
> +
> +        while (size) {
> +                struct page *page = alloc_pages(gfp_mask, 0);
> +                unsigned len = min_t(size_t, PAGE_SIZE, size);

alloc_page() call should be close to below check ..

> +
> +                if (!page) {
> +			pr_err("failed to alloc page");
> +			ret = -ENOMEM;
> +			break;
> +		}
> +
> +		ret = bio_add_page(bio, page, len, 0);
> +                if (unlikely(ret != len)) {
> +                        __free_page(page);
> +			pr_err("failed to add page");
> +                        break;
> +                }
> +
> +                size -= len;
> +        }
> +
> +	if (size)
> +		bio_free_pages(bio);
> +	else
> +		ret = 0;
> +
> +        return ret;
> +}

code formatting seems to be broken for above function plz check..

> +
> +static struct cbd_backend_io *backend_prepare_io(struct cbd_handler *handler, struct cbd_se *se, blk_opf_t opf)
> +{
> +	struct cbd_backend_io *backend_io;
> +	struct cbd_backend *cbdb = handler->cbdb;
> +
> +	backend_io = kzalloc(sizeof(struct cbd_backend_io), GFP_KERNEL);

will above allocation always succeed ? or NULL check should be here ?

> +	backend_io->se = se;
> +
> +	backend_io->handler = handler;
> +	backend_io->bio = bio_alloc_bioset(cbdb->bdev, roundup(se->len, 4096) / 4096, opf, GFP_KERNEL, &handler->bioset);
> +
> +	backend_io->bio->bi_iter.bi_sector = se->offset >> SECTOR_SHIFT;
> +	backend_io->bio->bi_iter.bi_size = 0;
> +	backend_io->bio->bi_private = backend_io;
> +	backend_io->bio->bi_end_io = backend_bio_end;
> +
> +	return backend_io;
> +}
> +
> +static int handle_backend_cmd(struct cbd_handler *handler, struct cbd_se *se)
> +{
> +	struct cbd_backend *cbdb = handler->cbdb;
> +	u32 len = se->len;
> +	struct cbd_backend_io *backend_io = NULL;
> +	int ret;
> +
> +	if (cbd_se_hdr_flags_test(se, CBD_SE_HDR_DONE)) {
> +		return 0 ;
> +	}
> +
> +	switch (cbd_se_hdr_get_op(se->header.len_op)) {
> +	case CBD_OP_PAD:
> +		cbd_se_hdr_flags_set(se, CBD_SE_HDR_DONE);
> +		return 0;
> +	case CBD_OP_READ:
> +		backend_io = backend_prepare_io(handler, se, REQ_OP_READ);
> +		break;
> +	case CBD_OP_WRITE:
> +		backend_io = backend_prepare_io(handler, se, REQ_OP_WRITE);
> +		break;
> +	case CBD_OP_DISCARD:
> +		ret = blkdev_issue_discard(cbdb->bdev, se->offset >> SECTOR_SHIFT,
> +				se->len, GFP_NOIO);

any specific reason to not use GFP_KERNEL ?

> +		goto complete_cmd;
> +	case CBD_OP_WRITE_ZEROS:
> +		ret = blkdev_issue_zeroout(cbdb->bdev, se->offset >> SECTOR_SHIFT,
> +				se->len, GFP_NOIO, 0);

any specific reason to not use GFP_KERNEL ?

> +		goto complete_cmd;
> +	case CBD_OP_FLUSH:
> +		ret = blkdev_issue_flush(cbdb->bdev);
> +		goto complete_cmd;
> +	default:
> +		pr_err("unrecognized op: %x", cbd_se_hdr_get_op(se->header.len_op));
> +		ret = -EIO;
> +		goto complete_cmd;
> +	}
> +
> +	if (!backend_io)
> +		return -ENOMEM;

there is no NULL check in the backend_prepare_io() not sure about
above condition in current code unless you return NULL ...

> +
> +	ret = cbd_bio_alloc_pages(backend_io->bio, len, GFP_NOIO);
> +	if (ret) {
> +		kfree(backend_io);
> +		return ret;
> +	}
> +
> +	if (cbd_se_hdr_get_op(se->header.len_op) == CBD_OP_WRITE) {
> +		cbdc_copy_to_bio(&handler->channel, se->data_off, se->data_len, backend_io->bio);
> +	}
> +
> +	submit_bio(backend_io->bio);
> +

unless I didn't understand the code, you are building a single bio from
incoming request, that might not have enough space to accommodate all
the data from incoming request, hence you are returning an error from
cbd_bio_alloc_pages() when bio_add_page() fail ...

bio_add_page() can fail for multiple reasons, instead of trying to
build only one bio that might be smaller for the size of the I/O and
returning error, why not use the chain of the small size bios ? that
way you will not run out of the space in single bio and still finish
the I/O by avoiding bio_add_page() failure that might happen due to
bio full ?

> +	return 0;
> +
> +complete_cmd:
> +	complete_cmd(handler, se->priv_data, ret);
> +	return 0;
> +}
> +
> +static void handle_work_fn(struct work_struct *work)
> +{
> +	struct cbd_handler *handler = container_of(work, struct cbd_handler, handle_work.work);
> +	struct cbd_se *se;
> +	int ret;
> +again:
> +	/* channel ctrl would be updated by blkdev queue */
> +	cbdc_flush_ctrl(&handler->channel);
> +	se = get_se_to_handle(handler);
> +	if (se == get_se_head(handler)) {
> +		if (cbdwc_need_retry(&handler->handle_worker_cfg)) {
> +			goto again;
> +		}
> +
> +		cbdwc_miss(&handler->handle_worker_cfg);
> +
> +		queue_delayed_work(handler->handle_wq, &handler->handle_work, usecs_to_jiffies(0));
> +		return;
> +	}
> +
> +	cbdwc_hit(&handler->handle_worker_cfg);
> +	cbdt_flush_range(handler->cbdb->cbdt, se, sizeof(*se));
> +	ret = handle_backend_cmd(handler, se);
> +	if (!ret) {
> +		/* this se is handled */
> +		handler->se_to_handle = (handler->se_to_handle + cbd_se_hdr_get_len(se->header.len_op)) % handler->channel_info->cmdr_size;

this is a really long line, if possible keep code under 80 char, I know
it's not a requirement anymore but it will match block drivers ..

-ck



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-04-24  4:29 ` [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dan Williams
@ 2024-04-24  6:33   ` Dongsheng Yang
  2024-04-24 15:14     ` Gregory Price
  2024-04-24 18:08     ` Dan Williams
  0 siblings, 2 replies; 52+ messages in thread
From: Dongsheng Yang @ 2024-04-24  6:33 UTC (permalink / raw)
  To: Dan Williams, axboe, John Groves
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang



在 2024/4/24 星期三 下午 12:29, Dan Williams 写道:
> Dongsheng Yang wrote:
>> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
>>
>> Hi all,
>> 	This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
>> 	https://github.com/DataTravelGuide/linux
>>
> [..]
>> (4) dax is not supported yet:
>> 	same with famfs, dax device is not supported here, because dax device does not support
>> dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.
> 
> I am glad that famfs is mentioned here, it demonstrates you know about
> it. However, unfortunately this cover letter does not offer any analysis
> of *why* the Linux project should consider this additional approach to
> the inter-host shared-memory enabling problem.
> 
> To be clear I am neutral at best on some of the initiatives around CXL
> memory sharing vs pooling, but famfs at least jettisons block-devices
> and gets closer to a purpose-built memory semantic.
> 
> So my primary question is why would Linux need both famfs and cbd? I am
> sure famfs would love feedback and help vs developing competing efforts.

Hi,
	Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in 
shared memory, and related nodes can share the data inside this file 
system; whereas cbd does not store data in shared memory, it uses shared 
memory as a channel for data transmission, and the actual data is stored 
in the backend block device of remote nodes. In cbd, shared memory works 
more like network to connect different hosts.

That is to say, in my view, FAMfs and cbd do not conflict at all; they 
meet different scenario requirements. cbd simply uses shared memory to 
transmit data, shared memory plays the role of a data transmission 
channel, while in FAMfs, shared memory serves as a data store role.

Please correct me if I am wrong.

Thanx
> .
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/7] cbd: introuce cbd_backend
  2024-04-24  5:03   ` Chaitanya Kulkarni
@ 2024-04-24  8:36     ` Dongsheng Yang
  0 siblings, 0 replies; 52+ messages in thread
From: Dongsheng Yang @ 2024-04-24  8:36 UTC (permalink / raw)
  To: Chaitanya Kulkarni, dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang



在 2024/4/24 星期三 下午 1:03, Chaitanya Kulkarni 写道:
>> +
>> +struct cbd_backend_io {
>> +	struct cbd_se		*se;
>> +	u64			off;
>> +	u32			len;
>> +	struct bio		*bio;
>> +	struct cbd_handler	*handler;
>> +};
>> +
> 
> why not use inline bvecs and avoid bio page allocation for reasonable
> size ? instead of performing the allocation for each request ...

inline bvecs sounds good, Iwill use it in next version.
> 
>> +static inline void complete_cmd(struct cbd_handler *handler, u64 priv_data, int ret)
>> +{
>> +	struct cbd_ce *ce = get_compr_head(handler);
>> +
>> +	memset(ce, 0, sizeof(*ce));
>> +	ce->priv_data = priv_data;
>> +	ce->result = ret;
>> +	CBDC_UPDATE_COMPR_HEAD(handler->channel_info->compr_head,
>> +			       sizeof(struct cbd_ce),
>> +			       handler->channel_info->compr_size);
>> +
>> +	cbdc_flush_ctrl(&handler->channel);
>> +
>> +	return;
>> +}
>> +
>> +static void backend_bio_end(struct bio *bio)
>> +{
>> +	struct cbd_backend_io *backend_io = bio->bi_private;
>> +	struct cbd_se *se = backend_io->se;
>> +	struct cbd_handler *handler = backend_io->handler;
>> +
>> +	if (bio->bi_status == 0 &&
>> +	    cbd_se_hdr_get_op(se->header.len_op) == CBD_OP_READ) {
>> +		cbdc_copy_from_bio(&handler->channel, se->data_off, se->data_len, bio);
>> +	}
>> +
>> +	complete_cmd(handler, se->priv_data, bio->bi_status);
>> +
>> +	bio_free_pages(bio);
>> +	bio_put(bio);
>> +	kfree(backend_io);
>> +}
>> +
>> +static int cbd_bio_alloc_pages(struct bio *bio, size_t size, gfp_t gfp_mask)
>> +{
>> +	int ret = 0;
>> +
>> +        while (size) {
>> +                struct page *page = alloc_pages(gfp_mask, 0);
>> +                unsigned len = min_t(size_t, PAGE_SIZE, size);
> 
> alloc_page() call should be close to below check ..

that's right, it should be alloc_page rather than alloc_pages with order 
of 0.
> 
>> +
>> +                if (!page) {
>> +			pr_err("failed to alloc page");
>> +			ret = -ENOMEM;
>> +			break;
>> +		}
>> +
>> +		ret = bio_add_page(bio, page, len, 0);
>> +                if (unlikely(ret != len)) {
>> +                        __free_page(page);
>> +			pr_err("failed to add page");
>> +                        break;
>> +                }
>> +
>> +                size -= len;
>> +        }
>> +
>> +	if (size)
>> +		bio_free_pages(bio);
>> +	else
>> +		ret = 0;
>> +
>> +        return ret;
>> +}
> 
> code formatting seems to be broken for above function plz check..

thanx for pointing it.
> 
>> +
>> +static struct cbd_backend_io *backend_prepare_io(struct cbd_handler *handler, struct cbd_se *se, blk_opf_t opf)
>> +{
>> +	struct cbd_backend_io *backend_io;
>> +	struct cbd_backend *cbdb = handler->cbdb;
>> +
>> +	backend_io = kzalloc(sizeof(struct cbd_backend_io), GFP_KERNEL);
> 
> will above allocation always succeed ? or NULL check should be here ?

sure, it should be checked here. thanx
> 
>> +	backend_io->se = se;
>> +
>> +	backend_io->handler = handler;
>> +	backend_io->bio = bio_alloc_bioset(cbdb->bdev, roundup(se->len, 4096) / 4096, opf, GFP_KERNEL, &handler->bioset);
>> +
>> +	backend_io->bio->bi_iter.bi_sector = se->offset >> SECTOR_SHIFT;
>> +	backend_io->bio->bi_iter.bi_size = 0;
>> +	backend_io->bio->bi_private = backend_io;
>> +	backend_io->bio->bi_end_io = backend_bio_end;
>> +
>> +	return backend_io;
>> +}
>> +
>> +static int handle_backend_cmd(struct cbd_handler *handler, struct cbd_se *se)
>> +{
>> +	struct cbd_backend *cbdb = handler->cbdb;
>> +	u32 len = se->len;
>> +	struct cbd_backend_io *backend_io = NULL;
>> +	int ret;
>> +
>> +	if (cbd_se_hdr_flags_test(se, CBD_SE_HDR_DONE)) {
>> +		return 0 ;
>> +	}
>> +
>> +	switch (cbd_se_hdr_get_op(se->header.len_op)) {
>> +	case CBD_OP_PAD:
>> +		cbd_se_hdr_flags_set(se, CBD_SE_HDR_DONE);
>> +		return 0;
>> +	case CBD_OP_READ:
>> +		backend_io = backend_prepare_io(handler, se, REQ_OP_READ);
>> +		break;
>> +	case CBD_OP_WRITE:
>> +		backend_io = backend_prepare_io(handler, se, REQ_OP_WRITE);
>> +		break;
>> +	case CBD_OP_DISCARD:
>> +		ret = blkdev_issue_discard(cbdb->bdev, se->offset >> SECTOR_SHIFT,
>> +				se->len, GFP_NOIO);
> 
> any specific reason to not use GFP_KERNEL ?

Using GFP_NOIO is intended to avoid memory allocation loops in the I/O 
path, but in this case, it's actually handling remote I/O requests, so 
theoretically using GFP_KERNEL should also work.
> 
>> +		goto complete_cmd;
>> +	case CBD_OP_WRITE_ZEROS:
>> +		ret = blkdev_issue_zeroout(cbdb->bdev, se->offset >> SECTOR_SHIFT,
>> +				se->len, GFP_NOIO, 0);
> 
> any specific reason to not use GFP_KERNEL ?

ditto
> 
>> +		goto complete_cmd;
>> +	case CBD_OP_FLUSH:
>> +		ret = blkdev_issue_flush(cbdb->bdev);
>> +		goto complete_cmd;
>> +	default:
>> +		pr_err("unrecognized op: %x", cbd_se_hdr_get_op(se->header.len_op));
>> +		ret = -EIO;
>> +		goto complete_cmd;
>> +	}
>> +
>> +	if (!backend_io)
>> +		return -ENOMEM;
> 
> there is no NULL check in the backend_prepare_io() not sure about
> above condition in current code unless you return NULL ...

backend_prepare_io should check NULL :)
> 
>> +
>> +	ret = cbd_bio_alloc_pages(backend_io->bio, len, GFP_NOIO);
>> +	if (ret) {
>> +		kfree(backend_io);
>> +		return ret;
>> +	}
>> +
>> +	if (cbd_se_hdr_get_op(se->header.len_op) == CBD_OP_WRITE) {
>> +		cbdc_copy_to_bio(&handler->channel, se->data_off, se->data_len, backend_io->bio);
>> +	}
>> +
>> +	submit_bio(backend_io->bio);
>> +
> 
> unless I didn't understand the code, you are building a single bio from
> incoming request, that might not have enough space to accommodate all
> the data from incoming request, hence you are returning an error from
> cbd_bio_alloc_pages() when bio_add_page() fail ...
> 
> bio_add_page() can fail for multiple reasons, instead of trying to
> build only one bio that might be smaller for the size of the I/O and
> returning error, why not use the chain of the small size bios ? that
> way you will not run out of the space in single bio and still finish
> the I/O by avoiding bio_add_page() failure that might happen due to
> bio full ?

"bio_add_page" should only return an error when "bio->bi_vcnt >= 
bio->bi_max_vecs". However, in our case, "bi_max_vecs" is calculated 
when "bio_alloc_bioset" is called, so "bi_vcnt" should not exceed 
"bi_max_vecs". In other words, theoretically, "bio_add_page" should not 
fail here.
> 
>> +	return 0;
>> +
>> +complete_cmd:
>> +	complete_cmd(handler, se->priv_data, ret);
>> +	return 0;
>> +}
>> +
>> +static void handle_work_fn(struct work_struct *work)
>> +{
>> +	struct cbd_handler *handler = container_of(work, struct cbd_handler, handle_work.work);
>> +	struct cbd_se *se;
>> +	int ret;
>> +again:
>> +	/* channel ctrl would be updated by blkdev queue */
>> +	cbdc_flush_ctrl(&handler->channel);
>> +	se = get_se_to_handle(handler);
>> +	if (se == get_se_head(handler)) {
>> +		if (cbdwc_need_retry(&handler->handle_worker_cfg)) {
>> +			goto again;
>> +		}
>> +
>> +		cbdwc_miss(&handler->handle_worker_cfg);
>> +
>> +		queue_delayed_work(handler->handle_wq, &handler->handle_work, usecs_to_jiffies(0));
>> +		return;
>> +	}
>> +
>> +	cbdwc_hit(&handler->handle_worker_cfg);
>> +	cbdt_flush_range(handler->cbdb->cbdt, se, sizeof(*se));
>> +	ret = handle_backend_cmd(handler, se);
>> +	if (!ret) {
>> +		/* this se is handled */
>> +		handler->se_to_handle = (handler->se_to_handle + cbd_se_hdr_get_len(se->header.len_op)) % handler->channel_info->cmdr_size;
> 
> this is a really long line, if possible keep code under 80 char, I know
> it's not a requirement anymore but it will match block drivers ..

That's indeed long. I'll try to make it more concise in the next version.

Kulkarni, thanx for your review, all each comment helps :)

Thanx
> 
> -ck
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/7] block: Init for CBD(CXL Block Device)
  2024-04-24  3:58   ` Chaitanya Kulkarni
@ 2024-04-24  8:36     ` Dongsheng Yang
  0 siblings, 0 replies; 52+ messages in thread
From: Dongsheng Yang @ 2024-04-24  8:36 UTC (permalink / raw)
  To: Chaitanya Kulkarni, dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang



在 2024/4/24 星期三 上午 11:58, Chaitanya Kulkarni 写道:
>> +/*
>> + * As shared memory is supported in CXL3.0 spec, we can transfer data via CXL shared memory.
>> + * CBD means CXL block device, it use CXL shared memory to transport command and data to
>> + * access block device in different host, as shown below:
>> + *
>> + *    ┌───────────────────────────────┐                               ┌────────────────────────────────────┐
>> + *    │          node-1               │                               │              node-2                │
>> + *    ├───────────────────────────────┤                               ├────────────────────────────────────┤
>> + *    │                               │                               │                                    │
>> + *    │                       ┌───────┤                               ├─────────┐                          │
>> + *    │                       │ cbd0  │                               │ backend0├──────────────────┐       │
>> + *    │                       ├───────┤                               ├─────────┤                  │       │
>> + *    │                       │ pmem0 │                               │ pmem0   │                  ▼       │
>> + *    │               ┌───────┴───────┤                               ├─────────┴────┐     ┌───────────────┤
>> + *    │               │    cxl driver │                               │ cxl driver   │     │  /dev/sda     │
>> + *    └───────────────┴────────┬──────┘                               └─────┬────────┴─────┴───────────────┘
>> + *                             │                                            │
>> + *                             │                                            │
>> + *                             │        CXL                         CXL     │
>> + *                             └────────────────┐               ┌───────────┘
>> + *                                              │               │
>> + *                                              │               │
>> + *                                              │               │
>> + *                                          ┌───┴───────────────┴─────┐
>> + *                                          │   shared memory device  │
>> + *                                          └─────────────────────────┘
>> + *
>> + * any read/write to cbd0 on node-1 will be transferred to node-2 /dev/sda. It works similar with
>> + * nbd (network block device), but it transfer data via CXL shared memory rather than network.
>> + */
>> +
>> +/* printk */
> 
> I don't think you need above comment ..

Agreed
> 
>> +#define cbd_err(fmt, ...)							\
>> +	pr_err("cbd: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__)
> 
> you can use #define pr_fmt and remove cmd: prefixes in each pr_xxx above ?

pr_fmt was one of my choice, however, it cannot fulfill all my 
requirements, such as adding other prefixes like "transport%u," etc. So, 
in the end, I didn't use pr_fmt, also to maintain consistency with other 
macros.
> 
>> +#define cbd_info(fmt, ...)							\
>> +	pr_info("cbd: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__)
>> +#define cbd_debug(fmt, ...)							\
>> +	pr_debug("cbd: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__)
>> +
>> +#define cbdt_err(transport, fmt, ...)						\
>> +	cbd_err("cbd_transport%u: " fmt,					\
>> +		 transport->id, ##__VA_ARGS__)
>> +#define cbdt_info(transport, fmt, ...)						\
>> +	cbd_info("cbd_transport%u: " fmt,					\
>> +		 transport->id, ##__VA_ARGS__)
>> +#define cbdt_debug(transport, fmt, ...)						\
>> +	cbd_debug("cbd_transport%u: " fmt,					\
>> +		 transport->id, ##__VA_ARGS__)
>> +
>> +#define cbd_backend_err(backend, fmt, ...)					\
>> +	cbdt_err(backend->cbdt, "backend%d: " fmt,				\
>> +		 backend->backend_id, ##__VA_ARGS__)
>> +#define cbd_backend_info(backend, fmt, ...)					\
>> +	cbdt_info(backend->cbdt, "backend%d: " fmt,				\
>> +		 backend->backend_id, ##__VA_ARGS__)
>> +#define cbd_backend_debug(backend, fmt, ...)					\
>> +	cbdt_debug(backend->cbdt, "backend%d: " fmt,				\
>> +		 backend->backend_id, ##__VA_ARGS__)
>> +
>> +#define cbd_handler_err(handler, fmt, ...)					\
>> +	cbd_backend_err(handler->cbdb, "handler%d: " fmt,			\
>> +		 handler->channel.channel_id, ##__VA_ARGS__)
>> +#define cbd_handler_info(handler, fmt, ...)					\
>> +	cbd_backend_info(handler->cbdb, "handler%d: " fmt,			\
>> +		 handler->channel.channel_id, ##__VA_ARGS__)
>> +#define cbd_handler_debug(handler, fmt, ...)					\
>> +	cbd_backend_debug(handler->cbdb, "handler%d: " fmt,			\
>> +		 handler->channel.channel_id, ##__VA_ARGS__)
>> +
>> +#define cbd_blk_err(dev, fmt, ...)						\
>> +	cbdt_err(dev->cbdt, "cbd%d: " fmt,					\
>> +		 dev->mapped_id, ##__VA_ARGS__)
>> +#define cbd_blk_info(dev, fmt, ...)						\
>> +	cbdt_info(dev->cbdt, "cbd%d: " fmt,					\
>> +		 dev->mapped_id, ##__VA_ARGS__)
>> +#define cbd_blk_debug(dev, fmt, ...)						\
>> +	cbdt_debug(dev->cbdt, "cbd%d: " fmt,					\
>> +		 dev->mapped_id, ##__VA_ARGS__)
>> +
>> +#define cbd_queue_err(queue, fmt, ...)						\
>> +	cbd_blk_err(queue->cbd_blkdev, "queue-%d: " fmt,			\
>> +		     queue->index, ##__VA_ARGS__)
>> +#define cbd_queue_info(queue, fmt, ...)						\
>> +	cbd_blk_info(queue->cbd_blkdev, "queue-%d: " fmt,			\
>> +		     queue->index, ##__VA_ARGS__)
>> +#define cbd_queue_debug(queue, fmt, ...)					\
>> +	cbd_blk_debug(queue->cbd_blkdev, "queue-%d: " fmt,			\
>> +		     queue->index, ##__VA_ARGS__)
>> +
>> +#define cbd_channel_err(channel, fmt, ...)					\
>> +	cbdt_err(channel->cbdt, "channel%d: " fmt,				\
>> +		 channel->channel_id, ##__VA_ARGS__)
>> +#define cbd_channel_info(channel, fmt, ...)					\
>> +	cbdt_info(channel->cbdt, "channel%d: " fmt,				\
>> +		 channel->channel_id, ##__VA_ARGS__)
>> +#define cbd_channel_debug(channel, fmt, ...)					\
>> +	cbdt_debug(channel->cbdt, "channel%d: " fmt,				\
>> +		 channel->channel_id, ##__VA_ARGS__)
>> +
> 
> [...]
> 
>> +
>> +struct cbd_se {
>> +	struct cbd_se_hdr	header;
>> +	u64			priv_data;	// pointer to cbd_request
> 
> use /**/ instead //

agreed

Thanx
> 
> 
> -ck
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/7] cbd: introduce cbd_transport
  2024-04-24  4:08   ` Chaitanya Kulkarni
@ 2024-04-24  8:43     ` Dongsheng Yang
  0 siblings, 0 replies; 52+ messages in thread
From: Dongsheng Yang @ 2024-04-24  8:43 UTC (permalink / raw)
  To: Chaitanya Kulkarni, dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang



在 2024/4/24 星期三 下午 12:08, Chaitanya Kulkarni 写道:
>> +static ssize_t cbd_myhost_show(struct device *dev,
>> +			       struct device_attribute *attr,
>> +			       char *buf)
>> +{
>> +	struct cbd_transport *cbdt;
>> +	struct cbd_host *host;
>> +
>> +	cbdt = container_of(dev, struct cbd_transport, device);
>> +
>> +	host = cbdt->host;
>> +	if (!host)
>> +		return 0;
>> +
>> +	return sprintf(buf, "%d\n", host->host_id);
> 
> snprintf() ?

IMO, it will only print a decimal unsigned int, so it shouldn't overflow 
the buffer.
> 
>> +}
>> +
>> +static DEVICE_ATTR(my_host_id, 0400, cbd_myhost_show, NULL);
>> +
>> +enum {
> 
> [...]
> 
>> +
>> +static ssize_t cbd_adm_store(struct device *dev,
>> +				 struct device_attribute *attr,
>> +				 const char *ubuf,
>> +				 size_t size)
>> +{
>> +	int ret;
>> +	char *buf;
>> +	struct cbd_adm_options opts = { 0 };
>> +	struct cbd_transport *cbdt;
>> +
> 
> reverse tree order that matches rest of your code ?

Agreed,
> 
>> +	if (!capable(CAP_SYS_ADMIN))
>> +		return -EPERM;
>> +
>> +	cbdt = container_of(dev, struct cbd_transport, device);
>> +
>> +	buf = kmemdup(ubuf, size + 1, GFP_KERNEL);
>> +	if (IS_ERR(buf)) {
>> +		pr_err("failed to dup buf for adm option: %d", (int)PTR_ERR(buf));
>> +		return PTR_ERR(buf);
>> +	}
>> +	buf[size] = '\0';
>> +	ret = parse_adm_options(cbdt, buf, &opts);
>> +	if (ret < 0) {
>> +		kfree(buf);
>> +		return ret;
>> +	}
>> +	kfree(buf);
>> +
> 
> standard format is using goto out and having only on kfree()
> at the end of the function ...

Okey, having a unified error handling path is a good idea, and it's 
suitable here as well, thanx.
> 
>> +	switch (opts.op) {
>> +	case CBDT_ADM_OP_B_START:
>> +		break;
>> +	case CBDT_ADM_OP_B_STOP:
>> +		break;
>> +	case CBDT_ADM_OP_B_CLEAR:
>> +		break;
>> +	case CBDT_ADM_OP_DEV_START:
>> +		break;
>> +	case CBDT_ADM_OP_DEV_STOP:
>> +		break;
>> +	default:
>> +		pr_err("invalid op: %d\n", opts.op);
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	return size;
>> +}
>> +
> 
> [...]
> 
>> +static struct cbd_transport *cbdt_alloc(void)
>> +{
>> +	struct cbd_transport *cbdt;
>> +	int ret;
>> +
>> +	cbdt = kzalloc(sizeof(struct cbd_transport), GFP_KERNEL);
>> +	if (!cbdt) {
>> +		return NULL;
>> +	}
> 
> no braces needed for single statements in if ... applies rest of
> the code ...

thanx, Iwill remove unnecessary braces next version.
> 
> -ck
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-04-24  6:33   ` Dongsheng Yang
@ 2024-04-24 15:14     ` Gregory Price
  2024-04-26  1:25       ` Dongsheng Yang
  2024-04-24 18:08     ` Dan Williams
  1 sibling, 1 reply; 52+ messages in thread
From: Gregory Price @ 2024-04-24 15:14 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: Dan Williams, axboe, John Groves, linux-block, linux-kernel,
	linux-cxl, Dongsheng Yang

On Wed, Apr 24, 2024 at 02:33:28PM +0800, Dongsheng Yang wrote:
> 
> 
> 在 2024/4/24 星期三 下午 12:29, Dan Williams 写道:
> > Dongsheng Yang wrote:
> > > From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> > > 
> > > Hi all,
> > > 	This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
> > > 	https://github.com/DataTravelGuide/linux
> > > 
> > [..]
> > > (4) dax is not supported yet:
> > > 	same with famfs, dax device is not supported here, because dax device does not support
> > > dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.
> > 
> > I am glad that famfs is mentioned here, it demonstrates you know about
> > it. However, unfortunately this cover letter does not offer any analysis
> > of *why* the Linux project should consider this additional approach to
> > the inter-host shared-memory enabling problem.
> > 
> > To be clear I am neutral at best on some of the initiatives around CXL
> > memory sharing vs pooling, but famfs at least jettisons block-devices
> > and gets closer to a purpose-built memory semantic.
> > 
> > So my primary question is why would Linux need both famfs and cbd? I am
> > sure famfs would love feedback and help vs developing competing efforts.
> 
> Hi,
> 	Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in
> shared memory, and related nodes can share the data inside this file system;
> whereas cbd does not store data in shared memory, it uses shared memory as a
> channel for data transmission, and the actual data is stored in the backend
> block device of remote nodes. In cbd, shared memory works more like network
> to connect different hosts.
>

Couldn't you basically just allocate a file for use as a uni-directional
buffer on top of FAMFS and achieve the same thing without the need for
additional kernel support? Similar in a sense to allocating a file on
network storage and pinging the remote host when it's ready (except now
it's fast!)

(The point here is not "FAMFS is better" or "CBD is better", simply
trying to identify the function that will ultimately dictate the form).

~Gregory

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-04-24  6:33   ` Dongsheng Yang
  2024-04-24 15:14     ` Gregory Price
@ 2024-04-24 18:08     ` Dan Williams
       [not found]       ` <539c1323-68f9-d753-a102-692b69049c20@easystack.cn>
  1 sibling, 1 reply; 52+ messages in thread
From: Dan Williams @ 2024-04-24 18:08 UTC (permalink / raw)
  To: Dongsheng Yang, Dan Williams, axboe, John Groves
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang

Dongsheng Yang wrote:
> 
> 
> 在 2024/4/24 星期三 下午 12:29, Dan Williams 写道:
> > Dongsheng Yang wrote:
> >> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> >>
> >> Hi all,
> >> 	This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
> >> 	https://github.com/DataTravelGuide/linux
> >>
> > [..]
> >> (4) dax is not supported yet:
> >> 	same with famfs, dax device is not supported here, because dax device does not support
> >> dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.
> > 
> > I am glad that famfs is mentioned here, it demonstrates you know about
> > it. However, unfortunately this cover letter does not offer any analysis
> > of *why* the Linux project should consider this additional approach to
> > the inter-host shared-memory enabling problem.
> > 
> > To be clear I am neutral at best on some of the initiatives around CXL
> > memory sharing vs pooling, but famfs at least jettisons block-devices
> > and gets closer to a purpose-built memory semantic.
> > 
> > So my primary question is why would Linux need both famfs and cbd? I am
> > sure famfs would love feedback and help vs developing competing efforts.
> 
> Hi,
> 	Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in 
> shared memory, and related nodes can share the data inside this file 
> system; whereas cbd does not store data in shared memory, it uses shared 
> memory as a channel for data transmission, and the actual data is stored 
> in the backend block device of remote nodes. In cbd, shared memory works 
> more like network to connect different hosts.
> 
> That is to say, in my view, FAMfs and cbd do not conflict at all; they 
> meet different scenario requirements. cbd simply uses shared memory to 
> transmit data, shared memory plays the role of a data transmission 
> channel, while in FAMfs, shared memory serves as a data store role.

If shared memory is just a communication transport then a block-device
abstraction does not seem a proper fit. From the above description this
sounds similar to what CONFIG_NTB_TRANSPORT offers which is a way for
two hosts to communicate over a shared memory channel.

So, I am not really looking for an analysis of famfs vs CBD I am looking
for CBD to clarify why Linux should consider it, and why the
architecture is fit for purpose.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [EXTERNAL] [PATCH 7/7] cbd: add related sysfs files in transport register
  2024-04-22  7:16 ` [PATCH 7/7] cbd: add related sysfs files in transport register Dongsheng Yang
@ 2024-04-25  5:24   ` Bharat Bhushan
  0 siblings, 0 replies; 52+ messages in thread
From: Bharat Bhushan @ 2024-04-25  5:24 UTC (permalink / raw)
  To: Dongsheng Yang, dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang



> -----Original Message-----
> From: Dongsheng Yang <dongsheng.yang@easystack.cn>
> Sent: Monday, April 22, 2024 12:46 PM
> To: dan.j.williams@intel.com; axboe@kernel.dk
> Cc: linux-block@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> cxl@vger.kernel.org; Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> Subject: [EXTERNAL] [PATCH 7/7] cbd: add related sysfs files in transport
> register
> 
> Prioritize security for external emails: Confirm sender and content safety
> before clicking links or opening attachments
> 
> ----------------------------------------------------------------------
> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> 
> When a transport is registered, a corresponding file is created for each
> area within the transport in the sysfs, including "cbd_hosts",
> "cbd_backends", "cbd_blkdevs", and "cbd_channels".
> 
> Through these sysfs files, we can examine the information of each entity
> and thereby understand the relationships between them. This allows us to
> further understand the current operational status of the transport.
> 
> For example, by examining "cbd_hosts", we can find all the hosts
> currently using the transport. We can also determine which host each
> backend is running on by looking at the "host_id" in "cbd_backends".
> Similarly, by examining "cbd_blkdevs", we can determine which host each
> blkdev is running on, and through the "mapped_id", we can know the name
> of the cbd device to which the blkdev is mapped. Additionally, by
> looking at "cbd_channels", we can determine which blkdev and backend are
> connected through each channel by examining the "blkdev_id" and
> "backend_id".
> 
> Signed-off-by: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> ---
>  drivers/block/cbd/cbd_transport.c | 101
> +++++++++++++++++++++++++++++-
>  1 file changed, 100 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/block/cbd/cbd_transport.c
> b/drivers/block/cbd/cbd_transport.c
> index 75b9d34218fc..0e917d72b209 100644
> --- a/drivers/block/cbd/cbd_transport.c
> +++ b/drivers/block/cbd/cbd_transport.c
> @@ -1,8 +1,91 @@
>  #include <linux/pfn_t.h>
> -
>  #include "cbd_internal.h"
> 
>  #define CBDT_OBJ(OBJ, OBJ_SIZE)
> 		\
> +extern struct device_type cbd_##OBJ##_type;
> 	\
> +extern struct device_type cbd_##OBJ##s_type;
> 	\
> +
> 	\
> +static int cbd_##OBJ##s_init(struct cbd_transport *cbdt)
> 	\
> +{
> 	\
> +	struct cbd_##OBJ##s_device *devs;
> 	\
> +	struct cbd_##OBJ##_device *cbd_dev;
> 	\
> +	struct device *dev;
> 	\
> +	int i;
> 	\
> +	int ret;								\
> +
> 	\
> +	u32 memsize = struct_size(devs, OBJ##_devs,
> 	\
> +			cbdt->transport_info->OBJ##_num);
> 	\
> +	devs = kzalloc(memsize, GFP_KERNEL);
> 	\
> +	if (!devs) {
> 	\
> +	    return -ENOMEM;
> 	\
> +	}

Do not need "{ } " here 

> 	\
> +
> 	\
> +	dev = &devs->OBJ##s_dev;
> 	\
> +	device_initialize(dev);
> 	\
> +	device_set_pm_not_required(dev);
> 	\
> +	dev_set_name(dev, "cbd_" #OBJ "s");
> 	\
> +	dev->parent = &cbdt->device;
> 	\
> +	dev->type = &cbd_##OBJ##s_type;
> 		\
> +	ret = device_add(dev);
> 	\
> +	if (ret) {								\
> +		goto devs_free;
> 	\
> +	}

Do not need "{ } " here

Thanks
-Bharat

> 	\
> +
> 	\
> +	for (i = 0; i < cbdt->transport_info->OBJ##_num; i++) {
> 	\
> +		cbd_dev = &devs->OBJ##_devs[i];
> 		\
> +		dev = &cbd_dev->dev;
> 	\
> +
> 	\
> +		cbd_dev->cbdt = cbdt;
> 	\
> +		cbd_dev->OBJ##_info = cbdt_get_##OBJ##_info(cbdt, i);
> 		\
> +		device_initialize(dev);
> 	\
> +		device_set_pm_not_required(dev);
> 	\
> +		dev_set_name(dev, #OBJ "%u", i);
> 	\
> +		dev->parent = &devs->OBJ##s_dev;
> 	\
> +		dev->type = &cbd_##OBJ##_type;
> 		\
> +
> 	\
> +		ret = device_add(dev);
> 	\
> +		if (ret) {							\
> +			i--;
> 	\
> +			goto del_device;
> 	\
> +		}
> 	\
> +	}
> 	\
> +	cbdt->cbd_##OBJ##s_dev = devs;
> 		\
> +
> 	\
> +    	return 0;
> 	\
> +del_device:
> 	\
> +	for (; i >= 0; i--) {
> 	\
> +		cbd_dev = &devs->OBJ##_devs[i];
> 		\
> +		dev = &cbd_dev->dev;
> 	\
> +		device_del(dev);
> 	\
> +	}
> 	\
> +devs_free:
> 	\
> +	kfree(devs);
> 	\
> +	return ret;
> 	\
> +}
> 	\
> +
> 	\
> +static void cbd_##OBJ##s_exit(struct cbd_transport *cbdt)
> 	\
> +{
> 	\
> +	struct cbd_##OBJ##s_device *devs = cbdt->cbd_##OBJ##s_dev;
> 	\
> +	struct device *dev;
> 	\
> +	int i;
> 	\
> +
> 	\
> +	if (!devs)
> 	\
> +		return;
> 	\
> +
> 	\
> +	for (i = 0; i < cbdt->transport_info->OBJ##_num; i++) {
> 	\
> +		struct cbd_##OBJ##_device *cbd_dev = &devs-
> >OBJ##_devs[i];	\
> +		dev = &cbd_dev->dev;
> 	\
> +
> 	\
> +		device_del(dev);
> 	\
> +	}
> 	\
> +
> 	\
> +	device_del(&devs->OBJ##s_dev);
> 		\
> +
> 	\
> +	kfree(devs);
> 	\
> +	cbdt->cbd_##OBJ##s_dev = NULL;
> 		\
> +
> 	\
> +	return;
> 	\
> +}
> 	\
> 
> 	\
>  static inline struct cbd_##OBJ##_info
> 	\
>  *__get_##OBJ##_info(struct cbd_transport *cbdt, u32 id)
> 		\
> @@ -588,6 +671,11 @@ int cbdt_unregister(u32 tid)
>  	}
>  	mutex_unlock(&cbdt->lock);
> 
> +	cbd_blkdevs_exit(cbdt);
> +	cbd_channels_exit(cbdt);
> +	cbd_backends_exit(cbdt);
> +	cbd_hosts_exit(cbdt);
> +
>  	cbd_host_unregister(cbdt);
>  	device_unregister(&cbdt->device);
>  	cbdt_dax_release(cbdt);
> @@ -647,9 +735,20 @@ int cbdt_register(struct cbdt_register_options
> *opts)
>  		goto dev_unregister;
>  	}
> 
> +	if (cbd_hosts_init(cbdt) || cbd_backends_init(cbdt) ||
> +	    cbd_channels_init(cbdt) || cbd_blkdevs_init(cbdt)) {
> +		ret = -ENOMEM;
> +		goto devs_exit;
> +	}
> +
>  	return 0;
> 
>  devs_exit:
> +	cbd_blkdevs_exit(cbdt);
> +	cbd_channels_exit(cbdt);
> +	cbd_backends_exit(cbdt);
> +	cbd_hosts_exit(cbdt);
> +
>  	cbd_host_unregister(cbdt);
>  dev_unregister:
>  	device_unregister(&cbdt->device);
> --
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [EXTERNAL] [PATCH 5/7] cbd: introuce cbd_backend
  2024-04-22  7:16 ` [PATCH 5/7] cbd: introuce cbd_backend Dongsheng Yang
  2024-04-24  5:03   ` Chaitanya Kulkarni
@ 2024-04-25  5:46   ` Bharat Bhushan
  1 sibling, 0 replies; 52+ messages in thread
From: Bharat Bhushan @ 2024-04-25  5:46 UTC (permalink / raw)
  To: Dongsheng Yang, dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang



> -----Original Message-----
> From: Dongsheng Yang <dongsheng.yang@easystack.cn>
> Sent: Monday, April 22, 2024 12:46 PM
> To: dan.j.williams@intel.com; axboe@kernel.dk
> Cc: linux-block@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> cxl@vger.kernel.org; Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> Subject: [EXTERNAL] [PATCH 5/7] cbd: introuce cbd_backend
> 
> Prioritize security for external emails: Confirm sender and content safety
> before clicking links or opening attachments
> 
> ----------------------------------------------------------------------
> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> 
> The "cbd_backend" is responsible for exposing a local block device (such as
> "/dev/sda") through the "cbd_transport" to other hosts.
> 
> Any host that registers this transport can map this backend to a local "cbd
> device"(such as "/dev/cbd0"). All reads and writes to "cbd0" are transmitted
> through the channel inside the transport to the backend. The handler inside
> the backend is responsible for processing these read and write requests,
> converting them into read and write requests corresponding to "sda".
> 
> Signed-off-by: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> ---
>  drivers/block/cbd/Makefile        |   2 +-
>  drivers/block/cbd/cbd_backend.c   | 254
> +++++++++++++++++++++++++++++
>  drivers/block/cbd/cbd_handler.c   | 261
> ++++++++++++++++++++++++++++++
>  drivers/block/cbd/cbd_transport.c |   6 +
>  4 files changed, 522 insertions(+), 1 deletion(-)  create mode 100644
> drivers/block/cbd/cbd_backend.c  create mode 100644
> drivers/block/cbd/cbd_handler.c
> 
> diff --git a/drivers/block/cbd/Makefile b/drivers/block/cbd/Makefile index
> 2389a738b12b..b47f1e584946 100644
> --- a/drivers/block/cbd/Makefile
> +++ b/drivers/block/cbd/Makefile
> @@ -1,3 +1,3 @@
> -cbd-y := cbd_main.o cbd_transport.o cbd_channel.o cbd_host.o
> +cbd-y := cbd_main.o cbd_transport.o cbd_channel.o cbd_host.o
> +cbd_backend.o cbd_handler.o
> 
>  obj-$(CONFIG_BLK_DEV_CBD) += cbd.o
> diff --git a/drivers/block/cbd/cbd_backend.c
> b/drivers/block/cbd/cbd_backend.c new file mode 100644 index
> 000000000000..a06f319e62c4
> --- /dev/null
> +++ b/drivers/block/cbd/cbd_backend.c
> @@ -0,0 +1,254 @@
> +#include "cbd_internal.h"
> +
> +static ssize_t backend_host_id_show(struct device *dev,
> +			       struct device_attribute *attr,
> +			       char *buf)
> +{
> +	struct cbd_backend_device *backend;
> +	struct cbd_backend_info *backend_info;
> +
> +	backend = container_of(dev, struct cbd_backend_device, dev);
> +	backend_info = backend->backend_info;
> +
> +	cbdt_flush_range(backend->cbdt, backend_info,
> sizeof(*backend_info));
> +
> +	if (backend_info->state == cbd_backend_state_none)
> +		return 0;
> +
> +	return sprintf(buf, "%u\n", backend_info->host_id); }
> +
> +static DEVICE_ATTR(host_id, 0400, backend_host_id_show, NULL);
> +
> +static ssize_t backend_path_show(struct device *dev,
> +			       struct device_attribute *attr,
> +			       char *buf)
> +{
> +	struct cbd_backend_device *backend;
> +	struct cbd_backend_info *backend_info;
> +
> +	backend = container_of(dev, struct cbd_backend_device, dev);
> +	backend_info = backend->backend_info;
> +
> +	cbdt_flush_range(backend->cbdt, backend_info,
> sizeof(*backend_info));
> +
> +	if (backend_info->state == cbd_backend_state_none)
> +		return 0;
> +
> +	if (strlen(backend_info->path) == 0)

Cosmetic comment, maybe we can use 
	if (!strlen(backend_info->path))

> +		return 0;
> +
> +	return sprintf(buf, "%s\n", backend_info->path); }

Sprintf is safe to provide zero length source buffer. Maybe above zero length check can be removed.


> +
> +static DEVICE_ATTR(path, 0400, backend_path_show, NULL);
> +
> +CBD_OBJ_HEARTBEAT(backend);
> +
> +static struct attribute *cbd_backend_attrs[] = {
> +	&dev_attr_path.attr,
> +	&dev_attr_host_id.attr,
> +	&dev_attr_alive.attr,
> +	NULL
> +};
> +
> +static struct attribute_group cbd_backend_attr_group = {
> +	.attrs = cbd_backend_attrs,
> +};
> +
> +static const struct attribute_group *cbd_backend_attr_groups[] = {
> +	&cbd_backend_attr_group,
> +	NULL
> +};
> +
> +static void cbd_backend_release(struct device *dev) { }
> +
> +struct device_type cbd_backend_type = {
> +	.name		= "cbd_backend",
> +	.groups		= cbd_backend_attr_groups,
> +	.release	= cbd_backend_release,
> +};
> +
> +struct device_type cbd_backends_type = {
> +	.name		= "cbd_backends",
> +	.release	= cbd_backend_release,
> +};
> +
> +void cbdb_add_handler(struct cbd_backend *cbdb, struct cbd_handler
> +*handler) {
> +	mutex_lock(&cbdb->lock);
> +	list_add(&handler->handlers_node, &cbdb->handlers);
> +	mutex_unlock(&cbdb->lock);
> +}
> +
> +void cbdb_del_handler(struct cbd_backend *cbdb, struct cbd_handler
> +*handler) {
> +	mutex_lock(&cbdb->lock);
> +	list_del_init(&handler->handlers_node);
> +	mutex_unlock(&cbdb->lock);
> +}
> +
> +static struct cbd_handler *cbdb_get_handler(struct cbd_backend *cbdb,
> +u32 channel_id) {
> +	struct cbd_handler *handler, *handler_next;
> +	bool found = false;
> +
> +	mutex_lock(&cbdb->lock);
> +	list_for_each_entry_safe(handler, handler_next, &cbdb->handlers,
> handlers_node) {
> +		if (handler->channel.channel_id == channel_id) {
> +			found = true;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&cbdb->lock);
> +
> +	if (!found) {
> +		return ERR_PTR(-ENOENT);
> +	}

Do not need "{ }" with single line.

Looks like these braces are used in this series at multiple places, please remove

> +
> +	return handler;
> +}
> +
> +static void state_work_fn(struct work_struct *work) {
> +	struct cbd_backend *cbdb = container_of(work, struct cbd_backend,
> state_work.work);
> +	struct cbd_transport *cbdt = cbdb->cbdt;
> +	struct cbd_channel_info *channel_info;
> +	u32 blkdev_state, backend_state, backend_id;
> +	int i;
> +
> +	for (i = 0; i < cbdt->transport_info->channel_num; i++) {
> +		channel_info = cbdt_get_channel_info(cbdt, i);
> +
> +		cbdt_flush_range(cbdt, channel_info, sizeof(*channel_info));
> +		blkdev_state = channel_info->blkdev_state;
> +		backend_state = channel_info->backend_state;
> +		backend_id = channel_info->backend_id;
> +
> +		if (blkdev_state == cbdc_blkdev_state_running &&
> +				backend_state == cbdc_backend_state_none
> &&
> +				backend_id == cbdb->backend_id) {
> +
> +			cbd_handler_create(cbdb, i);
> +		}
> +
> +		if (blkdev_state == cbdc_blkdev_state_none &&
> +				backend_state ==
> cbdc_backend_state_running &&
> +				backend_id == cbdb->backend_id) {
> +			struct cbd_handler *handler;
> +
> +			handler = cbdb_get_handler(cbdb, i);
> +			cbd_handler_destroy(handler);
> +		}
> +	}
> +
> +	queue_delayed_work(cbd_wq, &cbdb->state_work, 1 * HZ); }
> +
> +static int cbd_backend_init(struct cbd_backend *cbdb) {
> +	struct cbd_backend_info *b_info;
> +	struct cbd_transport *cbdt = cbdb->cbdt;
> +
> +	b_info = cbdt_get_backend_info(cbdt, cbdb->backend_id);
> +	cbdb->backend_info = b_info;
> +
> +	b_info->host_id = cbdb->cbdt->host->host_id;
> +
> +	cbdb->bdev_handle = bdev_open_by_path(cbdb->path,
> BLK_OPEN_READ | BLK_OPEN_WRITE, cbdb, NULL);
> +	if (IS_ERR(cbdb->bdev_handle)) {
> +		cbdt_err(cbdt, "failed to open bdev: %d", (int)PTR_ERR(cbdb-
> >bdev_handle));
> +		return PTR_ERR(cbdb->bdev_handle);
> +	}
> +	cbdb->bdev = cbdb->bdev_handle->bdev;
> +	b_info->dev_size = bdev_nr_sectors(cbdb->bdev);
> +
> +	INIT_DELAYED_WORK(&cbdb->state_work, state_work_fn);
> +	INIT_DELAYED_WORK(&cbdb->hb_work, backend_hb_workfn);
> +	INIT_LIST_HEAD(&cbdb->handlers);
> +	cbdb->backend_device =
> +&cbdt->cbd_backends_dev->backend_devs[cbdb->backend_id];
> +
> +	mutex_init(&cbdb->lock);
> +
> +	queue_delayed_work(cbd_wq, &cbdb->state_work, 0);
> +	queue_delayed_work(cbd_wq, &cbdb->hb_work, 0);
> +
> +	return 0;
> +}
> +
> +int cbd_backend_start(struct cbd_transport *cbdt, char *path) {
> +	struct cbd_backend *backend;
> +	struct cbd_backend_info *backend_info;
> +	u32 backend_id;
> +	int ret;
> +
> +	ret = cbdt_get_empty_backend_id(cbdt, &backend_id);
> +	if (ret) {
> +		return ret;
> +	}

Same comment as above

> +
> +	backend_info = cbdt_get_backend_info(cbdt, backend_id);
> +
> +	backend = kzalloc(sizeof(struct cbd_backend), GFP_KERNEL);
> +	if (!backend) {
> +		return -ENOMEM;
> +	}

Same comment as above

> +
> +	strscpy(backend->path, path, CBD_PATH_LEN);
> +	memcpy(backend_info->path, backend->path, CBD_PATH_LEN);
> +	INIT_LIST_HEAD(&backend->node);
> +	backend->backend_id = backend_id;
> +	backend->cbdt = cbdt;
> +
> +	ret = cbd_backend_init(backend);
> +	if (ret) {
> +		goto backend_free;
> +	}
> +
> +	backend_info->state = cbd_backend_state_running;
> +	cbdt_flush_range(cbdt, backend_info, sizeof(*backend_info));
> +
> +	cbdt_add_backend(cbdt, backend);
> +
> +	return 0;
> +
> +backend_free:
> +	kfree(backend);
> +
> +	return ret;
> +}
> +
> +int cbd_backend_stop(struct cbd_transport *cbdt, u32 backend_id) {
> +	struct cbd_backend *cbdb;
> +	struct cbd_backend_info *backend_info;
> +
> +	cbdb = cbdt_get_backend(cbdt, backend_id);
> +	if (!cbdb) {
> +		return -ENOENT;
> +	}
> +
> +	mutex_lock(&cbdb->lock);
> +	if (!list_empty(&cbdb->handlers)) {
> +		mutex_unlock(&cbdb->lock);
> +		return -EBUSY;
> +	}
> +
> +	cbdt_del_backend(cbdt, cbdb);
> +
> +	cancel_delayed_work_sync(&cbdb->hb_work);
> +	cancel_delayed_work_sync(&cbdb->state_work);
> +
> +	backend_info = cbdt_get_backend_info(cbdt, cbdb->backend_id);
> +	backend_info->state = cbd_backend_state_none;
> +	cbdt_flush_range(cbdt, backend_info, sizeof(*backend_info));
> +	mutex_unlock(&cbdb->lock);
> +
> +	bdev_release(cbdb->bdev_handle);
> +	kfree(cbdb);
> +
> +	return 0;
> +}
> diff --git a/drivers/block/cbd/cbd_handler.c
> b/drivers/block/cbd/cbd_handler.c new file mode 100644 index
> 000000000000..0fbfc225ea29
> --- /dev/null
> +++ b/drivers/block/cbd/cbd_handler.c
> @@ -0,0 +1,261 @@
> +#include "cbd_internal.h"
> +
> +static inline struct cbd_se *get_se_head(struct cbd_handler *handler) {
> +	return (struct cbd_se *)(handler->channel.cmdr +
> +handler->channel_info->cmd_head); }
> +
> +static inline struct cbd_se *get_se_to_handle(struct cbd_handler
> +*handler) {
> +	return (struct cbd_se *)(handler->channel.cmdr +
> +handler->se_to_handle); }
> +
> +static inline struct cbd_ce *get_compr_head(struct cbd_handler
> +*handler) {
> +	return (struct cbd_ce *)(handler->channel.compr +
> +handler->channel_info->compr_head);
> +}
> +
> +struct cbd_backend_io {
> +	struct cbd_se		*se;
> +	u64			off;
> +	u32			len;
> +	struct bio		*bio;
> +	struct cbd_handler	*handler;
> +};
> +
> +static inline void complete_cmd(struct cbd_handler *handler, u64
> +priv_data, int ret) {
> +	struct cbd_ce *ce = get_compr_head(handler);
> +
> +	memset(ce, 0, sizeof(*ce));
> +	ce->priv_data = priv_data;
> +	ce->result = ret;
> +	CBDC_UPDATE_COMPR_HEAD(handler->channel_info->compr_head,
> +			       sizeof(struct cbd_ce),
> +			       handler->channel_info->compr_size);
> +
> +	cbdc_flush_ctrl(&handler->channel);
> +
> +	return;
> +}
> +
> +static void backend_bio_end(struct bio *bio) {
> +	struct cbd_backend_io *backend_io = bio->bi_private;
> +	struct cbd_se *se = backend_io->se;
> +	struct cbd_handler *handler = backend_io->handler;
> +
> +	if (bio->bi_status == 0 &&
> +	    cbd_se_hdr_get_op(se->header.len_op) == CBD_OP_READ) {
> +		cbdc_copy_from_bio(&handler->channel, se->data_off, se-
> >data_len, bio);
> +	}
> +
> +	complete_cmd(handler, se->priv_data, bio->bi_status);
> +
> +	bio_free_pages(bio);
> +	bio_put(bio);
> +	kfree(backend_io);
> +}
> +
> +static int cbd_bio_alloc_pages(struct bio *bio, size_t size, gfp_t
> +gfp_mask) {
> +	int ret = 0;
> +
> +        while (size) {
> +                struct page *page = alloc_pages(gfp_mask, 0);
> +                unsigned len = min_t(size_t, PAGE_SIZE, size);
> +
> +                if (!page) {
> +			pr_err("failed to alloc page");
> +			ret = -ENOMEM;
> +			break;
> +		}
> +
> +		ret = bio_add_page(bio, page, len, 0);
> +                if (unlikely(ret != len)) {
> +                        __free_page(page);
> +			pr_err("failed to add page");
> +                        break;
> +                }
> +
> +                size -= len;
> +        }
> +
> +	if (size)
> +		bio_free_pages(bio);
> +	else
> +		ret = 0;
> +
> +        return ret;
> +}
> +
> +static struct cbd_backend_io *backend_prepare_io(struct cbd_handler
> +*handler, struct cbd_se *se, blk_opf_t opf) {
> +	struct cbd_backend_io *backend_io;
> +	struct cbd_backend *cbdb = handler->cbdb;
> +
> +	backend_io = kzalloc(sizeof(struct cbd_backend_io), GFP_KERNEL);
> +	backend_io->se = se;
> +
> +	backend_io->handler = handler;
> +	backend_io->bio = bio_alloc_bioset(cbdb->bdev, roundup(se->len,
> 4096)
> +/ 4096, opf, GFP_KERNEL, &handler->bioset);
> +
> +	backend_io->bio->bi_iter.bi_sector = se->offset >> SECTOR_SHIFT;
> +	backend_io->bio->bi_iter.bi_size = 0;
> +	backend_io->bio->bi_private = backend_io;
> +	backend_io->bio->bi_end_io = backend_bio_end;
> +
> +	return backend_io;
> +}
> +
> +static int handle_backend_cmd(struct cbd_handler *handler, struct
> +cbd_se *se) {
> +	struct cbd_backend *cbdb = handler->cbdb;
> +	u32 len = se->len;
> +	struct cbd_backend_io *backend_io = NULL;
> +	int ret;
> +
> +	if (cbd_se_hdr_flags_test(se, CBD_SE_HDR_DONE)) {
> +		return 0 ;
> +	}
> +
> +	switch (cbd_se_hdr_get_op(se->header.len_op)) {
> +	case CBD_OP_PAD:
> +		cbd_se_hdr_flags_set(se, CBD_SE_HDR_DONE);
> +		return 0;
> +	case CBD_OP_READ:
> +		backend_io = backend_prepare_io(handler, se,
> REQ_OP_READ);
> +		break;
> +	case CBD_OP_WRITE:
> +		backend_io = backend_prepare_io(handler, se,
> REQ_OP_WRITE);
> +		break;
> +	case CBD_OP_DISCARD:
> +		ret = blkdev_issue_discard(cbdb->bdev, se->offset >>
> SECTOR_SHIFT,
> +				se->len, GFP_NOIO);
> +		goto complete_cmd;
> +	case CBD_OP_WRITE_ZEROS:
> +		ret = blkdev_issue_zeroout(cbdb->bdev, se->offset >>
> SECTOR_SHIFT,
> +				se->len, GFP_NOIO, 0);
> +		goto complete_cmd;
> +	case CBD_OP_FLUSH:
> +		ret = blkdev_issue_flush(cbdb->bdev);
> +		goto complete_cmd;
> +	default:
> +		pr_err("unrecognized op: %x", cbd_se_hdr_get_op(se-
> >header.len_op));
> +		ret = -EIO;
> +		goto complete_cmd;
> +	}
> +
> +	if (!backend_io)
> +		return -ENOMEM;
> +
> +	ret = cbd_bio_alloc_pages(backend_io->bio, len, GFP_NOIO);
> +	if (ret) {
> +		kfree(backend_io);
> +		return ret;
> +	}
> +
> +	if (cbd_se_hdr_get_op(se->header.len_op) == CBD_OP_WRITE) {
> +		cbdc_copy_to_bio(&handler->channel, se->data_off, se-
> >data_len, backend_io->bio);
> +	}
> +
> +	submit_bio(backend_io->bio);
> +
> +	return 0;
> +
> +complete_cmd:
> +	complete_cmd(handler, se->priv_data, ret);
> +	return 0;
> +}
> +
> +static void handle_work_fn(struct work_struct *work) {
> +	struct cbd_handler *handler = container_of(work, struct cbd_handler,
> handle_work.work);
> +	struct cbd_se *se;
> +	int ret;
> +again:
> +	/* channel ctrl would be updated by blkdev queue */
> +	cbdc_flush_ctrl(&handler->channel);
> +	se = get_se_to_handle(handler);
> +	if (se == get_se_head(handler)) {
> +		if (cbdwc_need_retry(&handler->handle_worker_cfg)) {
> +			goto again;
> +		}
> +
> +		cbdwc_miss(&handler->handle_worker_cfg);
> +
> +		queue_delayed_work(handler->handle_wq, &handler-
> >handle_work, usecs_to_jiffies(0));
> +		return;
> +	}
> +
> +	cbdwc_hit(&handler->handle_worker_cfg);
> +	cbdt_flush_range(handler->cbdb->cbdt, se, sizeof(*se));
> +	ret = handle_backend_cmd(handler, se);
> +	if (!ret) {
> +		/* this se is handled */
> +		handler->se_to_handle = (handler->se_to_handle +
> cbd_se_hdr_get_len(se->header.len_op)) % handler->channel_info-
> >cmdr_size;
> +	}
> +
> +	goto again;
> +}
> +
> +int cbd_handler_create(struct cbd_backend *cbdb, u32 channel_id) {
> +	struct cbd_transport *cbdt = cbdb->cbdt;
> +	struct cbd_handler *handler;
> +	int ret;
> +
> +	handler = kzalloc(sizeof(struct cbd_handler), GFP_KERNEL);
> +	if (!handler) {
> +		return -ENOMEM;
> +	}
> +
> +	handler->cbdb = cbdb;
> +	cbd_channel_init(&handler->channel, cbdt, channel_id);
> +	handler->channel_info = handler->channel.channel_info;
> +
> +	handler->handle_wq = alloc_workqueue("cbdt%u-handler%u",
> +					     WQ_UNBOUND |
> WQ_MEM_RECLAIM,
> +					     0, cbdt->id, channel_id);
> +	if (!handler->handle_wq) {
> +		ret = -ENOMEM;
> +		goto free_handler;
> +	}
> +
> +	handler->se_to_handle = handler->channel_info->cmd_tail;
> +
> +	INIT_DELAYED_WORK(&handler->handle_work, handle_work_fn);
> +	INIT_LIST_HEAD(&handler->handlers_node);
> +
> +	bioset_init(&handler->bioset, 128, 0, BIOSET_NEED_BVECS);
> +	cbdwc_init(&handler->handle_worker_cfg);
> +
> +	cbdb_add_handler(cbdb, handler);
> +	handler->channel_info->backend_state =
> cbdc_backend_state_running;
> +
> +	cbdt_flush_range(cbdt, handler->channel_info,
> +sizeof(*handler->channel_info));
> +
> +	queue_delayed_work(handler->handle_wq, &handler->handle_work,
> 0);
> +
> +	return 0;
> +
> +free_handler:
> +	kfree(handler);
> +	return ret;
> +};
> +
> +void cbd_handler_destroy(struct cbd_handler *handler) {
> +	cbdb_del_handler(handler->cbdb, handler);
> +
> +	cancel_delayed_work_sync(&handler->handle_work);
> +	drain_workqueue(handler->handle_wq);
> +	destroy_workqueue(handler->handle_wq);
> +
> +	handler->channel_info->backend_state = cbdc_backend_state_none;
> +	handler->channel_info->state = cbd_channel_state_none;
> +	cbdt_flush_range(handler->cbdb->cbdt, handler->channel_info,
> +sizeof(*handler->channel_info));
> +
> +	bioset_exit(&handler->bioset);
> +	kfree(handler);
> +}
> diff --git a/drivers/block/cbd/cbd_transport.c
> b/drivers/block/cbd/cbd_transport.c
> index 682d0f45ce9e..4dd9bf1b5fd5 100644
> --- a/drivers/block/cbd/cbd_transport.c
> +++ b/drivers/block/cbd/cbd_transport.c
> @@ -303,8 +303,14 @@ static ssize_t cbd_adm_store(struct device *dev,
> 
>  	switch (opts.op) {
>  	case CBDT_ADM_OP_B_START:
> +		ret = cbd_backend_start(cbdt, opts.backend.path);
> +		if (ret < 0)
> +			return ret;
>  		break;
>  	case CBDT_ADM_OP_B_STOP:
> +		ret = cbd_backend_stop(cbdt, opts.backend_id);
> +		if (ret < 0)
> +			return ret;
>  		break;
>  	case CBDT_ADM_OP_B_CLEAR:
>  		break;
> --
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [EXTERNAL] [PATCH 4/7] cbd: introduce cbd_host
  2024-04-22  7:16 ` [PATCH 4/7] cbd: introduce cbd_host Dongsheng Yang
@ 2024-04-25  5:51   ` Bharat Bhushan
  0 siblings, 0 replies; 52+ messages in thread
From: Bharat Bhushan @ 2024-04-25  5:51 UTC (permalink / raw)
  To: Dongsheng Yang, dan.j.williams, axboe
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang



> -----Original Message-----
> From: Dongsheng Yang <dongsheng.yang@easystack.cn>
> Sent: Monday, April 22, 2024 12:46 PM
> To: dan.j.williams@intel.com; axboe@kernel.dk
> Cc: linux-block@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> cxl@vger.kernel.org; Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> Subject: [EXTERNAL] [PATCH 4/7] cbd: introduce cbd_host
> 
> Prioritize security for external emails: Confirm sender and content safety
> before clicking links or opening attachments
> 
> ----------------------------------------------------------------------
> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> 
> The "cbd_host" represents a host node. Each node needs to be registered
> before it can use the "cbd_transport". After registration, the node's
> information, such as its hostname, will be recorded in the "hosts" area of this
> transport. Through this mechanism, we can know which nodes are currently
> using each transport.
> 
> Signed-off-by: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> ---
>  drivers/block/cbd/Makefile        |   2 +-
>  drivers/block/cbd/cbd_host.c      | 123
> ++++++++++++++++++++++++++++++
>  drivers/block/cbd/cbd_transport.c |   8 ++
>  3 files changed, 132 insertions(+), 1 deletion(-)  create mode 100644
> drivers/block/cbd/cbd_host.c
> 
> diff --git a/drivers/block/cbd/Makefile b/drivers/block/cbd/Makefile index
> c581ae96732b..2389a738b12b 100644
> --- a/drivers/block/cbd/Makefile
> +++ b/drivers/block/cbd/Makefile
> @@ -1,3 +1,3 @@
> -cbd-y := cbd_main.o cbd_transport.o cbd_channel.o
> +cbd-y := cbd_main.o cbd_transport.o cbd_channel.o cbd_host.o
> 
>  obj-$(CONFIG_BLK_DEV_CBD) += cbd.o
> diff --git a/drivers/block/cbd/cbd_host.c b/drivers/block/cbd/cbd_host.c new
> file mode 100644 index 000000000000..892961f5f1b2
> --- /dev/null
> +++ b/drivers/block/cbd/cbd_host.c
> @@ -0,0 +1,123 @@
> +#include "cbd_internal.h"
> +
> +static ssize_t cbd_host_name_show(struct device *dev,
> +			       struct device_attribute *attr,
> +			       char *buf)
> +{
> +	struct cbd_host_device *host;
> +	struct cbd_host_info *host_info;
> +
> +	host = container_of(dev, struct cbd_host_device, dev);
> +	host_info = host->host_info;
> +
> +	cbdt_flush_range(host->cbdt, host_info, sizeof(*host_info));
> +
> +	if (host_info->state == cbd_host_state_none)
> +		return 0;
> +
> +	if (strlen(host_info->hostname) == 0)
> +		return 0;

Sprintf is safe to provide zero length source buffer. Maybe this check can be removed.

> +
> +	return sprintf(buf, "%s\n", host_info->hostname); }
> +
> +static DEVICE_ATTR(hostname, 0400, cbd_host_name_show, NULL);
> +
> +CBD_OBJ_HEARTBEAT(host);
> +
> +static struct attribute *cbd_host_attrs[] = {
> +	&dev_attr_hostname.attr,
> +	&dev_attr_alive.attr,
> +	NULL
> +};
> +
> +static struct attribute_group cbd_host_attr_group = {
> +	.attrs = cbd_host_attrs,
> +};
> +
> +static const struct attribute_group *cbd_host_attr_groups[] = {
> +	&cbd_host_attr_group,
> +	NULL
> +};
> +
> +static void cbd_host_release(struct device *dev) { }
> +
> +struct device_type cbd_host_type = {
> +	.name		= "cbd_host",
> +	.groups		= cbd_host_attr_groups,
> +	.release	= cbd_host_release,
> +};
> +
> +struct device_type cbd_hosts_type = {
> +	.name		= "cbd_hosts",
> +	.release	= cbd_host_release,
> +};
> +
> +int cbd_host_register(struct cbd_transport *cbdt, char *hostname) {
> +	struct cbd_host *host;
> +	struct cbd_host_info *host_info;
> +	u32 host_id;
> +	int ret;
> +
> +	if (cbdt->host) {
> +		return -EEXIST;
> +	}
> +
> +	if (strlen(hostname) == 0) {
> +		return -EINVAL;
> +	}

Un-necessary braces

Thanks
-Bharat

> +
> +	ret = cbdt_get_empty_host_id(cbdt, &host_id);
> +	if (ret < 0) {
> +		return ret;
> +	}
> +
> +	host = kzalloc(sizeof(struct cbd_host), GFP_KERNEL);
> +	if (!host) {
> +		return -ENOMEM;
> +	}
> +
> +	host->host_id = host_id;
> +	host->cbdt = cbdt;
> +	INIT_DELAYED_WORK(&host->hb_work, host_hb_workfn);
> +
> +	host_info = cbdt_get_host_info(cbdt, host_id);
> +	host_info->state = cbd_host_state_running;
> +	memcpy(host_info->hostname, hostname, CBD_NAME_LEN);
> +
> +	cbdt_flush_range(cbdt, host_info, sizeof(*host_info));
> +
> +	host->host_info = host_info;
> +	cbdt->host = host;
> +
> +	queue_delayed_work(cbd_wq, &host->hb_work, 0);
> +
> +	return 0;
> +}
> +
> +int cbd_host_unregister(struct cbd_transport *cbdt) {
> +	struct cbd_host *host = cbdt->host;
> +	struct cbd_host_info *host_info;
> +
> +	if (!host) {
> +		cbd_err("This host is not registered.");
> +		return 0;
> +	}
> +
> +	cancel_delayed_work_sync(&host->hb_work);
> +	host_info = host->host_info;
> +	memset(host_info->hostname, 0, CBD_NAME_LEN);
> +	host_info->alive_ts = 0;
> +	host_info->state = cbd_host_state_none;
> +
> +	cbdt_flush_range(cbdt, host_info, sizeof(*host_info));
> +
> +	cbdt->host = NULL;
> +	kfree(cbdt->host);
> +
> +	return 0;
> +}
> diff --git a/drivers/block/cbd/cbd_transport.c
> b/drivers/block/cbd/cbd_transport.c
> index 3a4887afab08..682d0f45ce9e 100644
> --- a/drivers/block/cbd/cbd_transport.c
> +++ b/drivers/block/cbd/cbd_transport.c
> @@ -571,6 +571,7 @@ int cbdt_unregister(u32 tid)
>  	}
>  	mutex_unlock(&cbdt->lock);
> 
> +	cbd_host_unregister(cbdt);
>  	device_unregister(&cbdt->device);
>  	cbdt_dax_release(cbdt);
>  	cbdt_destroy(cbdt);
> @@ -624,8 +625,15 @@ int cbdt_register(struct cbdt_register_options
> *opts)
>  		goto dax_release;
>  	}
> 
> +	ret = cbd_host_register(cbdt, opts->hostname);
> +	if (ret) {
> +		goto dev_unregister;
> +	}
> +
>  	return 0;
> 
> +devs_exit:
> +	cbd_host_unregister(cbdt);
>  dev_unregister:
>  	device_unregister(&cbdt->device);
>  dax_release:
> --
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-04-24 15:14     ` Gregory Price
@ 2024-04-26  1:25       ` Dongsheng Yang
  2024-04-26 13:48         ` Gregory Price
  0 siblings, 1 reply; 52+ messages in thread
From: Dongsheng Yang @ 2024-04-26  1:25 UTC (permalink / raw)
  To: Gregory Price
  Cc: Dan Williams, axboe, John Groves, linux-block, linux-kernel,
	linux-cxl, Dongsheng Yang



在 2024/4/24 星期三 下午 11:14, Gregory Price 写道:
> On Wed, Apr 24, 2024 at 02:33:28PM +0800, Dongsheng Yang wrote:
>>
>>
>> 在 2024/4/24 星期三 下午 12:29, Dan Williams 写道:
>>> Dongsheng Yang wrote:
>>>> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
>>>>
>>>> Hi all,
>>>> 	This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
>>>> 	https://github.com/DataTravelGuide/linux
>>>>
>>> [..]
>>>> (4) dax is not supported yet:
>>>> 	same with famfs, dax device is not supported here, because dax device does not support
>>>> dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.
>>>
>>> I am glad that famfs is mentioned here, it demonstrates you know about
>>> it. However, unfortunately this cover letter does not offer any analysis
>>> of *why* the Linux project should consider this additional approach to
>>> the inter-host shared-memory enabling problem.
>>>
>>> To be clear I am neutral at best on some of the initiatives around CXL
>>> memory sharing vs pooling, but famfs at least jettisons block-devices
>>> and gets closer to a purpose-built memory semantic.
>>>
>>> So my primary question is why would Linux need both famfs and cbd? I am
>>> sure famfs would love feedback and help vs developing competing efforts.
>>
>> Hi,
>> 	Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in
>> shared memory, and related nodes can share the data inside this file system;
>> whereas cbd does not store data in shared memory, it uses shared memory as a
>> channel for data transmission, and the actual data is stored in the backend
>> block device of remote nodes. In cbd, shared memory works more like network
>> to connect different hosts.
>>
> 
> Couldn't you basically just allocate a file for use as a uni-directional
> buffer on top of FAMFS and achieve the same thing without the need for
> additional kernel support? Similar in a sense to allocating a file on
> network storage and pinging the remote host when it's ready (except now
> it's fast!)

I'm not entirely sure I follow your suggestion. I guess it means that 
cbd would no longer directly manage the pmem device, but allocate files 
on famfs to transfer data. I didn't do it this way because I considered 
at least a few points: one of them is, cbd_transport actually requires a 
DAX device to access shared memory, and cbd has very simple requirements 
for space management, so there's no need to rely on a file system layer, 
which would increase architectural complexity.

However, we still need cbd_blkdev to provide a block device, so it 
doesn't achieve "achieve the same without the need for additional kernel 
support".

Could you please provide more specific details about your suggestion?
> 
> (The point here is not "FAMFS is better" or "CBD is better", simply
> trying to identify the function that will ultimately dictate the form).

Thank you for your clarification. totally aggree with it, discussions 
always make the issues clearer.

Thanx
> 
> ~Gregory
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-04-26  1:25       ` Dongsheng Yang
@ 2024-04-26 13:48         ` Gregory Price
  2024-04-26 14:53           ` Dongsheng Yang
  0 siblings, 1 reply; 52+ messages in thread
From: Gregory Price @ 2024-04-26 13:48 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: Dan Williams, axboe, John Groves, linux-block, linux-kernel,
	linux-cxl, Dongsheng Yang

On Fri, Apr 26, 2024 at 09:25:53AM +0800, Dongsheng Yang wrote:
> 
> 
> 在 2024/4/24 星期三 下午 11:14, Gregory Price 写道:
> > On Wed, Apr 24, 2024 at 02:33:28PM +0800, Dongsheng Yang wrote:
> > > 
> > > 
> > > 在 2024/4/24 星期三 下午 12:29, Dan Williams 写道:
> > > > Dongsheng Yang wrote:
> > > > > From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> > > > > 
> > > > > Hi all,
> > > > > 	This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
> > > > > 	https://github.com/DataTravelGuide/linux
> > > > > 
> > > > [..]
> > > > > (4) dax is not supported yet:
> > > > > 	same with famfs, dax device is not supported here, because dax device does not support
> > > > > dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.
> > > > 
> > > > I am glad that famfs is mentioned here, it demonstrates you know about
> > > > it. However, unfortunately this cover letter does not offer any analysis
> > > > of *why* the Linux project should consider this additional approach to
> > > > the inter-host shared-memory enabling problem.
> > > > 
> > > > To be clear I am neutral at best on some of the initiatives around CXL
> > > > memory sharing vs pooling, but famfs at least jettisons block-devices
> > > > and gets closer to a purpose-built memory semantic.
> > > > 
> > > > So my primary question is why would Linux need both famfs and cbd? I am
> > > > sure famfs would love feedback and help vs developing competing efforts.
> > > 
> > > Hi,
> > > 	Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in
> > > shared memory, and related nodes can share the data inside this file system;
> > > whereas cbd does not store data in shared memory, it uses shared memory as a
> > > channel for data transmission, and the actual data is stored in the backend
> > > block device of remote nodes. In cbd, shared memory works more like network
> > > to connect different hosts.
> > > 
> > 
> > Couldn't you basically just allocate a file for use as a uni-directional
> > buffer on top of FAMFS and achieve the same thing without the need for
> > additional kernel support? Similar in a sense to allocating a file on
> > network storage and pinging the remote host when it's ready (except now
> > it's fast!)
> 
> I'm not entirely sure I follow your suggestion. I guess it means that cbd
> would no longer directly manage the pmem device, but allocate files on famfs
> to transfer data. I didn't do it this way because I considered at least a
> few points: one of them is, cbd_transport actually requires a DAX device to
> access shared memory, and cbd has very simple requirements for space
> management, so there's no need to rely on a file system layer, which would
> increase architectural complexity.
> 
> However, we still need cbd_blkdev to provide a block device, so it doesn't
> achieve "achieve the same without the need for additional kernel support".
> 
> Could you please provide more specific details about your suggestion?

Fundamentally you're shuffling bits from one place to another, the
ultimate target is storage located on another device as opposed to
the memory itself.  So you're using CXL as a transport medium.

Could you not do the same thing with a file in FAMFS, and put all of
the transport logic in userland? Then you'd just have what looks like
a kernel bypass transport mechanism built on top of a file backed by
shared memory.

Basically it's unclear to me why this must be done in the kernel.
Performance? Explicit bypass? Some technical reason I'm missing?


Also, on a tangential note, you're using pmem/qemu to emulate the
behavior of shared CXL memory.  You should probably explain the
coherence implications of the system more explicitly.

The emulated system implements what amounts to hardware-coherent
memory (i.e. the two QEMU machines run on the same physical machine,
so coherency is managed within the same coherence domain).

If there is no explicit coherence control in software, then it is
important to state that this system relies on hardware that implements
snoop back-invalidate (which is not a requirement of a CXL 3.x device,
just a feature described by the spec that may be implemented).

~Gregory

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-04-26 13:48         ` Gregory Price
@ 2024-04-26 14:53           ` Dongsheng Yang
  2024-04-26 16:14             ` Gregory Price
  0 siblings, 1 reply; 52+ messages in thread
From: Dongsheng Yang @ 2024-04-26 14:53 UTC (permalink / raw)
  To: Gregory Price
  Cc: Dan Williams, axboe, John Groves, linux-block, linux-kernel, linux-cxl



在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
> On Fri, Apr 26, 2024 at 09:25:53AM +0800, Dongsheng Yang wrote:
>>
>>
>> 在 2024/4/24 星期三 下午 11:14, Gregory Price 写道:
>>> On Wed, Apr 24, 2024 at 02:33:28PM +0800, Dongsheng Yang wrote:
>>>>
>>>>
>>>> 在 2024/4/24 星期三 下午 12:29, Dan Williams 写道:
>>>>> Dongsheng Yang wrote:
>>>>>> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
>>>>>>
>>>>>> Hi all,
>>>>>> 	This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
>>>>>> 	https://github.com/DataTravelGuide/linux
>>>>>>
>>>>> [..]
>>>>>> (4) dax is not supported yet:
>>>>>> 	same with famfs, dax device is not supported here, because dax device does not support
>>>>>> dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.
>>>>>
>>>>> I am glad that famfs is mentioned here, it demonstrates you know about
>>>>> it. However, unfortunately this cover letter does not offer any analysis
>>>>> of *why* the Linux project should consider this additional approach to
>>>>> the inter-host shared-memory enabling problem.
>>>>>
>>>>> To be clear I am neutral at best on some of the initiatives around CXL
>>>>> memory sharing vs pooling, but famfs at least jettisons block-devices
>>>>> and gets closer to a purpose-built memory semantic.
>>>>>
>>>>> So my primary question is why would Linux need both famfs and cbd? I am
>>>>> sure famfs would love feedback and help vs developing competing efforts.
>>>>
>>>> Hi,
>>>> 	Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in
>>>> shared memory, and related nodes can share the data inside this file system;
>>>> whereas cbd does not store data in shared memory, it uses shared memory as a
>>>> channel for data transmission, and the actual data is stored in the backend
>>>> block device of remote nodes. In cbd, shared memory works more like network
>>>> to connect different hosts.
>>>>
>>>
>>> Couldn't you basically just allocate a file for use as a uni-directional
>>> buffer on top of FAMFS and achieve the same thing without the need for
>>> additional kernel support? Similar in a sense to allocating a file on
>>> network storage and pinging the remote host when it's ready (except now
>>> it's fast!)
>>
>> I'm not entirely sure I follow your suggestion. I guess it means that cbd
>> would no longer directly manage the pmem device, but allocate files on famfs
>> to transfer data. I didn't do it this way because I considered at least a
>> few points: one of them is, cbd_transport actually requires a DAX device to
>> access shared memory, and cbd has very simple requirements for space
>> management, so there's no need to rely on a file system layer, which would
>> increase architectural complexity.
>>
>> However, we still need cbd_blkdev to provide a block device, so it doesn't
>> achieve "achieve the same without the need for additional kernel support".
>>
>> Could you please provide more specific details about your suggestion?
> 
> Fundamentally you're shuffling bits from one place to another, the
> ultimate target is storage located on another device as opposed to
> the memory itself.  So you're using CXL as a transport medium.
> 
> Could you not do the same thing with a file in FAMFS, and put all of
> the transport logic in userland? Then you'd just have what looks like
> a kernel bypass transport mechanism built on top of a file backed by
> shared memory.
> 
> Basically it's unclear to me why this must be done in the kernel.
> Performance? Explicit bypass? Some technical reason I'm missing?


In user space, transferring data via FAMFS files poses no problem, but 
how do we present this data to users? We cannot expect users to revamp 
all their business I/O methods.

For example, suppose a user needs to run a database on a compute node. 
As the cloud infrastructure department, we need to allocate a block 
storage on the storage node and provide it to the database on the 
compute node through a certain transmission protocol (such as iSCSI, 
NVMe over Fabrics, or our current solution, cbd). Users can then create 
any file system they like on the block device and run the database on 
it. We aim to enhance the performance of this block device with cbd, 
rather than requiring the business department to adapt their database to 
fit our shared memory-facing storage node disks.

This is why we need to provide users with a block device. If it were 
only about data transmission, we wouldn't need a block device. But when 
it comes to actually running business operations, we need a block 
storage interface for the upper layer. Additionally, the block device 
layer offers many other rich features, such as RAID.

If accessing shared memory in user space is mandatory, there's another 
option: using user space block storage technologies like ublk. However, 
this would lead to performance issues as data would need to traverse 
back to the kernel space block device from the user space process.

In summary, we need a block device sharing mechanism, similar to what is 
provided by NBD, iSCSI, or NVMe over Fabrics, because user businesses 
rely on the block device interface and ecosystem.
> 
> 
> Also, on a tangential note, you're using pmem/qemu to emulate the
> behavior of shared CXL memory.  You should probably explain the
> coherence implications of the system more explicitly.
> 
> The emulated system implements what amounts to hardware-coherent
> memory (i.e. the two QEMU machines run on the same physical machine,
> so coherency is managed within the same coherence domain).
> 
> If there is no explicit coherence control in software, then it is
> important to state that this system relies on hardware that implements
> snoop back-invalidate (which is not a requirement of a CXL 3.x device,
> just a feature described by the spec that may be implemented).

In (5) of the cover letter, I mentioned that cbd addresses cache 
coherence at the software level:

(5) How do blkdev and backend interact through the channel?
	a) For reader side, before reading the data, if the data in this 
channel may be modified by the other party, then I need to flush the 
cache before reading to ensure that I get the latest data. For example, 
the blkdev needs to flush the cache before obtaining compr_head because 
compr_head will be updated by the backend handler.
	b) For writter side, if the written information will be read by others, 
then after writing, I need to flush the cache to let the other party see 
it immediately. For example, after blkdev submits cbd_se, it needs to 
update cmd_head to let the handler have a new cbd_se. Therefore, after 
updating cmd_head, I need to flush the cache to let the backend see it.


This part of the code is indeed implemented, however, as you pointed 
out, since I am currently using qemu/pmem for emulation, the effects of 
this code cannot be observed.

Thanx
> 
> ~Gregory
> .
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-04-26 14:53           ` Dongsheng Yang
@ 2024-04-26 16:14             ` Gregory Price
  2024-04-28  5:47               ` Dongsheng Yang
  0 siblings, 1 reply; 52+ messages in thread
From: Gregory Price @ 2024-04-26 16:14 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: Dan Williams, axboe, John Groves, linux-block, linux-kernel, linux-cxl

On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
> 
> 
> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
> > 
> > Also, on a tangential note, you're using pmem/qemu to emulate the
> > behavior of shared CXL memory.  You should probably explain the
> > coherence implications of the system more explicitly.
> > 
> > The emulated system implements what amounts to hardware-coherent
> > memory (i.e. the two QEMU machines run on the same physical machine,
> > so coherency is managed within the same coherence domain).
> > 
> > If there is no explicit coherence control in software, then it is
> > important to state that this system relies on hardware that implements
> > snoop back-invalidate (which is not a requirement of a CXL 3.x device,
> > just a feature described by the spec that may be implemented).
> 
> In (5) of the cover letter, I mentioned that cbd addresses cache coherence
> at the software level:
> 
> (5) How do blkdev and backend interact through the channel?
> 	a) For reader side, before reading the data, if the data in this channel
> may be modified by the other party, then I need to flush the cache before
> reading to ensure that I get the latest data. For example, the blkdev needs
> to flush the cache before obtaining compr_head because compr_head will be
> updated by the backend handler.
> 	b) For writter side, if the written information will be read by others,
> then after writing, I need to flush the cache to let the other party see it
> immediately. For example, after blkdev submits cbd_se, it needs to update
> cmd_head to let the handler have a new cbd_se. Therefore, after updating
> cmd_head, I need to flush the cache to let the backend see it.
> 

Flushing the cache is insufficient.  All that cache flushing guarantees
is that the memory has left the writer's CPU cache.  There are potentially
many write buffers between the CPU and the actual backing media that the
CPU has no visibility of and cannot pierce through to force a full
guaranteed flush back to the media.

for example:

memcpy(some_cacheline, data, 64);
mfence();

Will not guarantee that after mfence() completes that the remote host
will have visibility of the data.  mfence() does not guarantee a full
flush back down to the device, it only guarantees it has been pushed out
of the CPU's cache.

similarly:

memcpy(some_cacheline, data, 64);
mfence();
memcpy(some_other_cacheline, data, 64);
mfence()

Will not guarantee that some_cacheline reaches the backing media prior
to some_other_cacheline, as there is no guarantee of write-ordering in
CXL controllers (with the exception of writes to the same cacheline).

So this statement:

> I need to flush the cache to let the other party see it immediately.

Is misleading.  They will not see is "immediately", they will see it
"eventually at some completely unknowable time in the future".

~Gregory

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-04-26 16:14             ` Gregory Price
@ 2024-04-28  5:47               ` Dongsheng Yang
  2024-04-28 16:44                 ` Gregory Price
                                   ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Dongsheng Yang @ 2024-04-28  5:47 UTC (permalink / raw)
  To: Gregory Price, Dan Williams, John Groves
  Cc: axboe, linux-block, linux-kernel, linux-cxl, nvdimm



在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:
> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
>>
>>
>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
>>>
>>
>> In (5) of the cover letter, I mentioned that cbd addresses cache coherence
>> at the software level:
>>
>> (5) How do blkdev and backend interact through the channel?
>> 	a) For reader side, before reading the data, if the data in this channel
>> may be modified by the other party, then I need to flush the cache before
>> reading to ensure that I get the latest data. For example, the blkdev needs
>> to flush the cache before obtaining compr_head because compr_head will be
>> updated by the backend handler.
>> 	b) For writter side, if the written information will be read by others,
>> then after writing, I need to flush the cache to let the other party see it
>> immediately. For example, after blkdev submits cbd_se, it needs to update
>> cmd_head to let the handler have a new cbd_se. Therefore, after updating
>> cmd_head, I need to flush the cache to let the backend see it.
>>
> 
> Flushing the cache is insufficient.  All that cache flushing guarantees
> is that the memory has left the writer's CPU cache.  There are potentially
> many write buffers between the CPU and the actual backing media that the
> CPU has no visibility of and cannot pierce through to force a full
> guaranteed flush back to the media.
> 
> for example:
> 
> memcpy(some_cacheline, data, 64);
> mfence();
> 
> Will not guarantee that after mfence() completes that the remote host
> will have visibility of the data.  mfence() does not guarantee a full
> flush back down to the device, it only guarantees it has been pushed out
> of the CPU's cache.
> 
> similarly:
> 
> memcpy(some_cacheline, data, 64);
> mfence();
> memcpy(some_other_cacheline, data, 64);
> mfence()
> 
> Will not guarantee that some_cacheline reaches the backing media prior
> to some_other_cacheline, as there is no guarantee of write-ordering in
> CXL controllers (with the exception of writes to the same cacheline).
> 
> So this statement:
> 
>> I need to flush the cache to let the other party see it immediately.
> 
> Is misleading.  They will not see is "immediately", they will see it
> "eventually at some completely unknowable time in the future".

This is indeed one of the issues I wanted to discuss at the RFC stage. 
Thank you for pointing it out.

In my opinion, using "nvdimm_flush" might be one way to address this 
issue, but it seems to flush the entire nd_region, which might be too 
heavy. Moreover, it only applies to non-volatile memory.

This should be a general problem for cxl shared memory. In theory, FAMFS 
should also encounter this issue.

Gregory, John, and Dan, Any suggestion about it?

Thanx a lot
> 
> ~Gregory
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-04-28  5:47               ` Dongsheng Yang
@ 2024-04-28 16:44                 ` Gregory Price
  2024-04-28 16:55                 ` John Groves
  2024-04-30  0:34                 ` Dan Williams
  2 siblings, 0 replies; 52+ messages in thread
From: Gregory Price @ 2024-04-28 16:44 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: Dan Williams, John Groves, axboe, linux-block, linux-kernel,
	linux-cxl, nvdimm

On Sun, Apr 28, 2024 at 01:47:29PM +0800, Dongsheng Yang wrote:
> 
> 
> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:
> > On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
> > > 
> > > 
> > > 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
> > > > 
> > > 
> > > In (5) of the cover letter, I mentioned that cbd addresses cache coherence
> > > at the software level:
> > > 
> > > (5) How do blkdev and backend interact through the channel?
> > > 	a) For reader side, before reading the data, if the data in this channel
> > > may be modified by the other party, then I need to flush the cache before
> > > reading to ensure that I get the latest data. For example, the blkdev needs
> > > to flush the cache before obtaining compr_head because compr_head will be
> > > updated by the backend handler.
> > > 	b) For writter side, if the written information will be read by others,
> > > then after writing, I need to flush the cache to let the other party see it
> > > immediately. For example, after blkdev submits cbd_se, it needs to update
> > > cmd_head to let the handler have a new cbd_se. Therefore, after updating
> > > cmd_head, I need to flush the cache to let the backend see it.
> > > 
> > 
> > Flushing the cache is insufficient.  All that cache flushing guarantees
> > is that the memory has left the writer's CPU cache.  There are potentially
> > many write buffers between the CPU and the actual backing media that the
> > CPU has no visibility of and cannot pierce through to force a full
> > guaranteed flush back to the media.
> > 
> > for example:
> > 
> > memcpy(some_cacheline, data, 64);
> > mfence();
> > 
> > Will not guarantee that after mfence() completes that the remote host
> > will have visibility of the data.  mfence() does not guarantee a full
> > flush back down to the device, it only guarantees it has been pushed out
> > of the CPU's cache.
> > 
> > similarly:
> > 
> > memcpy(some_cacheline, data, 64);
> > mfence();
> > memcpy(some_other_cacheline, data, 64);
> > mfence()
> > 

just a derp here, meant to add an explicit clflush(some_cacheline)
between the copy and the mfence.  But the result is the same.

> > Will not guarantee that some_cacheline reaches the backing media prior
> > to some_other_cacheline, as there is no guarantee of write-ordering in
> > CXL controllers (with the exception of writes to the same cacheline).
> > 
> > So this statement:
> > 
> > > I need to flush the cache to let the other party see it immediately.
> > 
> > Is misleading.  They will not see is "immediately", they will see it
> > "eventually at some completely unknowable time in the future".
> 
> This is indeed one of the issues I wanted to discuss at the RFC stage. Thank
> you for pointing it out.
> 
> In my opinion, using "nvdimm_flush" might be one way to address this issue,
> but it seems to flush the entire nd_region, which might be too heavy.
> Moreover, it only applies to non-volatile memory.
> 

The problem is that the coherence domain really ends at the root
complex, and from the perspective of any one host the data is coherent.

Flushing only guarantees it gets pushed out from that domain, but does
not guarantee anything south of it.

Flushing semantics that don't puncture through the root complex won't
help

>
> This should be a general problem for cxl shared memory. In theory, FAMFS
> should also encounter this issue.
> 
> Gregory, John, and Dan, Any suggestion about it?
> 
> Thanx a lot
> > 
> > ~Gregory
> > 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-04-28  5:47               ` Dongsheng Yang
  2024-04-28 16:44                 ` Gregory Price
@ 2024-04-28 16:55                 ` John Groves
  2024-05-03  9:52                   ` Jonathan Cameron
  2024-04-30  0:34                 ` Dan Williams
  2 siblings, 1 reply; 52+ messages in thread
From: John Groves @ 2024-04-28 16:55 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: Gregory Price, Dan Williams, axboe, linux-block, linux-kernel,
	linux-cxl, nvdimm

On 24/04/28 01:47PM, Dongsheng Yang wrote:
> 
> 
> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:
> > On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
> > > 
> > > 
> > > 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
> > > > 
> > > 
> > > In (5) of the cover letter, I mentioned that cbd addresses cache coherence
> > > at the software level:
> > > 
> > > (5) How do blkdev and backend interact through the channel?
> > > 	a) For reader side, before reading the data, if the data in this channel
> > > may be modified by the other party, then I need to flush the cache before
> > > reading to ensure that I get the latest data. For example, the blkdev needs
> > > to flush the cache before obtaining compr_head because compr_head will be
> > > updated by the backend handler.
> > > 	b) For writter side, if the written information will be read by others,
> > > then after writing, I need to flush the cache to let the other party see it
> > > immediately. For example, after blkdev submits cbd_se, it needs to update
> > > cmd_head to let the handler have a new cbd_se. Therefore, after updating
> > > cmd_head, I need to flush the cache to let the backend see it.
> > > 
> > 
> > Flushing the cache is insufficient.  All that cache flushing guarantees
> > is that the memory has left the writer's CPU cache.  There are potentially
> > many write buffers between the CPU and the actual backing media that the
> > CPU has no visibility of and cannot pierce through to force a full
> > guaranteed flush back to the media.
> > 
> > for example:
> > 
> > memcpy(some_cacheline, data, 64);
> > mfence();
> > 
> > Will not guarantee that after mfence() completes that the remote host
> > will have visibility of the data.  mfence() does not guarantee a full
> > flush back down to the device, it only guarantees it has been pushed out
> > of the CPU's cache.
> > 
> > similarly:
> > 
> > memcpy(some_cacheline, data, 64);
> > mfence();
> > memcpy(some_other_cacheline, data, 64);
> > mfence()
> > 
> > Will not guarantee that some_cacheline reaches the backing media prior
> > to some_other_cacheline, as there is no guarantee of write-ordering in
> > CXL controllers (with the exception of writes to the same cacheline).
> > 
> > So this statement:
> > 
> > > I need to flush the cache to let the other party see it immediately.
> > 
> > Is misleading.  They will not see is "immediately", they will see it
> > "eventually at some completely unknowable time in the future".
> 
> This is indeed one of the issues I wanted to discuss at the RFC stage. Thank
> you for pointing it out.
> 
> In my opinion, using "nvdimm_flush" might be one way to address this issue,
> but it seems to flush the entire nd_region, which might be too heavy.
> Moreover, it only applies to non-volatile memory.
> 
> This should be a general problem for cxl shared memory. In theory, FAMFS
> should also encounter this issue.
> 
> Gregory, John, and Dan, Any suggestion about it?
> 
> Thanx a lot
> > 
> > ~Gregory
> > 

Hi Dongsheng,

Gregory is right about the uncertainty around "clflush" operations, but
let me drill in a bit further.

Say you copy a payload into a "bucket" in a queue and then update an
index in a metadata structure; I'm thinking of the standard producer/
consumer queuing model here, with one index mutated by the producer and
the other mutated by the consumer. 

(I have not reviewed your queueing code, but you *must* be using this
model - things like linked-lists won't work in shared memory without 
shared locks/atomics.)

Normal logic says that you should clflush the payload before updating
the index, then update and clflush the index.

But we still observe in non-cache-coherent shared memory that the payload 
may become valid *after* the clflush of the queue index.

The famfs user space has a program called pcq.c, which implements a
producer/consumer queue in a pair of famfs files. The only way to 
currently guarantee a valid read of a payload is to use sequence numbers 
and checksums on payloads.  We do observe mismatches with actual shared 
memory, and the recovery is to clflush and re-read the payload from the 
client side. (Aside: These file pairs theoretically might work for CBD 
queues.)

Anoter side note: it would be super-helpful if the CPU gave us an explicit 
invalidate rather than just clflush, which will write-back before 
invalidating *if* the cache line is marked as dirty, even when software
knows this should not happen.

Note that CXL 3.1 provides a way to guarantee that stuff that should not
be written back can't be written back: read-only mappings. This one of
the features I got into the spec; using this requires CXL 3.1 DCD, and 
would require two DCD allocations (i.e. two tagged-capacity dax devices - 
one writable by the server and one by the client).

Just to make things slightly gnarlier, the MESI cache coherency protocol
allows a CPU to speculatively convert a line from exclusive to modified,
meaning it's not clear as of now whether "occasional" clean write-backs
can be avoided. Meaning those read-only mappings may be more important
than one might think. (Clean write-backs basically make it
impossible for software to manage cache coherency.)

Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and 
shared memory is not explicitly legal in cxl 2, so there are things a cpu 
could do (or not do) in a cxl 2 environment that are not illegal because 
they should not be observable in a no-shared-memory environment.

CBD is interesting work, though for some of the reasons above I'm somewhat
skeptical of shared memory as an IPC mechanism.

Regards,
John



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
       [not found]       ` <539c1323-68f9-d753-a102-692b69049c20@easystack.cn>
@ 2024-04-30  0:10         ` Dan Williams
  0 siblings, 0 replies; 52+ messages in thread
From: Dan Williams @ 2024-04-30  0:10 UTC (permalink / raw)
  To: Dongsheng Yang, Dan Williams, axboe, John Groves
  Cc: linux-block, linux-kernel, linux-cxl, Dongsheng Yang

Dongsheng Yang wrote:
> 
> 
> 在 2024/4/25 星期四 上午 2:08, Dan Williams 写道:
> > Dongsheng Yang wrote:
> >>
> >>
> >> 在 2024/4/24 星期三 下午 12:29, Dan Williams 写道:
> >>> Dongsheng Yang wrote:
> >>>> From: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
> >>>>
> >>>> Hi all,
> >>>> 	This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
> >>>> 	https://github.com/DataTravelGuide/linux
> >>>>
> >>> [..]
> >>>> (4) dax is not supported yet:
> >>>> 	same with famfs, dax device is not supported here, because dax device does not support
> >>>> dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.
> >>>
> >>> I am glad that famfs is mentioned here, it demonstrates you know about
> >>> it. However, unfortunately this cover letter does not offer any analysis
> >>> of *why* the Linux project should consider this additional approach to
> >>> the inter-host shared-memory enabling problem.
> >>>
> >>> To be clear I am neutral at best on some of the initiatives around CXL
> >>> memory sharing vs pooling, but famfs at least jettisons block-devices
> >>> and gets closer to a purpose-built memory semantic.
> >>>
> >>> So my primary question is why would Linux need both famfs and cbd? I am
> >>> sure famfs would love feedback and help vs developing competing efforts.
> >>
> >> Hi,
> >> 	Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in
> >> shared memory, and related nodes can share the data inside this file
> >> system; whereas cbd does not store data in shared memory, it uses shared
> >> memory as a channel for data transmission, and the actual data is stored
> >> in the backend block device of remote nodes. In cbd, shared memory works
> >> more like network to connect different hosts.
> >>
> >> That is to say, in my view, FAMfs and cbd do not conflict at all; they
> >> meet different scenario requirements. cbd simply uses shared memory to
> >> transmit data, shared memory plays the role of a data transmission
> >> channel, while in FAMfs, shared memory serves as a data store role.
> > 
> > If shared memory is just a communication transport then a block-device
> > abstraction does not seem a proper fit. From the above description this
> > sounds similar to what CONFIG_NTB_TRANSPORT offers which is a way for
> > two hosts to communicate over a shared memory channel.
> > 
> > So, I am not really looking for an analysis of famfs vs CBD I am looking
> > for CBD to clarify why Linux should consider it, and why the
> > architecture is fit for purpose.
> 
> Let me explain why we need cbd:
> 
> In cloud storage scenarios, we often need to expose block devices of 
> storage nodes to compute nodes. We have options like nbd, iscsi, nvmeof, 
> etc., but these all communicate over the network. cbd aims to address 
> the same scenario but using shared memory for data transfer instead of 
> the network, aiming for better performance and reduced network latency.
> 
> Furthermore, shared memory can not only transfer data but also implement 
> features like write-ahead logging (WAL) or read/write cache, further 
> improving performance, especially latency-sensitive business scenarios. 
> (If I understand correctly, this might not be achievable with the 
> previously mentioned ntb.)
> 
> To ensure we have a common understanding, I'd like to clarify one point: 
> the /dev/cbdX block device is not an abstraction of shared memory; it is 
> a mapping of a block device (such as /dev/sda) on the remote host. 
> Reading/writing to /dev/cbdX is equivalent to reading/writing to 
> /dev/sda on the remote host.
> 
> This is the design intention of cbd. I hope this clarifies things.

I does, thanks for the clarification. Let me go back and take a another
look now that I undertand that this is a "remote storage target over CXL
memory" solution.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-04-28  5:47               ` Dongsheng Yang
  2024-04-28 16:44                 ` Gregory Price
  2024-04-28 16:55                 ` John Groves
@ 2024-04-30  0:34                 ` Dan Williams
  2 siblings, 0 replies; 52+ messages in thread
From: Dan Williams @ 2024-04-30  0:34 UTC (permalink / raw)
  To: Dongsheng Yang, Gregory Price, Dan Williams, John Groves
  Cc: axboe, linux-block, linux-kernel, linux-cxl, nvdimm

Dongsheng Yang wrote:
> 
> 
> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:
> > On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
> >>
> >>
> >> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
> >>>
> >>
> >> In (5) of the cover letter, I mentioned that cbd addresses cache coherence
> >> at the software level:
> >>
> >> (5) How do blkdev and backend interact through the channel?
> >> 	a) For reader side, before reading the data, if the data in this channel
> >> may be modified by the other party, then I need to flush the cache before
> >> reading to ensure that I get the latest data. For example, the blkdev needs
> >> to flush the cache before obtaining compr_head because compr_head will be
> >> updated by the backend handler.
> >> 	b) For writter side, if the written information will be read by others,
> >> then after writing, I need to flush the cache to let the other party see it
> >> immediately. For example, after blkdev submits cbd_se, it needs to update
> >> cmd_head to let the handler have a new cbd_se. Therefore, after updating
> >> cmd_head, I need to flush the cache to let the backend see it.
> >>
> > 
> > Flushing the cache is insufficient.  All that cache flushing guarantees
> > is that the memory has left the writer's CPU cache.  There are potentially
> > many write buffers between the CPU and the actual backing media that the
> > CPU has no visibility of and cannot pierce through to force a full
> > guaranteed flush back to the media.
> > 
> > for example:
> > 
> > memcpy(some_cacheline, data, 64);
> > mfence();
> > 
> > Will not guarantee that after mfence() completes that the remote host
> > will have visibility of the data.  mfence() does not guarantee a full
> > flush back down to the device, it only guarantees it has been pushed out
> > of the CPU's cache.
> > 
> > similarly:
> > 
> > memcpy(some_cacheline, data, 64);
> > mfence();
> > memcpy(some_other_cacheline, data, 64);
> > mfence()
> > 
> > Will not guarantee that some_cacheline reaches the backing media prior
> > to some_other_cacheline, as there is no guarantee of write-ordering in
> > CXL controllers (with the exception of writes to the same cacheline).
> > 
> > So this statement:
> > 
> >> I need to flush the cache to let the other party see it immediately.
> > 
> > Is misleading.  They will not see is "immediately", they will see it
> > "eventually at some completely unknowable time in the future".
> 
> This is indeed one of the issues I wanted to discuss at the RFC stage. 
> Thank you for pointing it out.
> 
> In my opinion, using "nvdimm_flush" might be one way to address this 
> issue, but it seems to flush the entire nd_region, which might be too 
> heavy. Moreover, it only applies to non-volatile memory.
> 
> This should be a general problem for cxl shared memory. In theory, FAMFS 
> should also encounter this issue.
> 
> Gregory, John, and Dan, Any suggestion about it?

The CXL equivalent is GPF (Global Persistence Flush), not be confused
with "General Protection Fault" which is likely what will happen if
software needs to manage cache coherency for this solution. CXL GPF was
not designed to be triggered by software. It is hardware response to a
power supply indicating loss of input power.

I do not think you want to spend community resources reviewing software
cache coherency considerations, and instead "just" mandate that this
solution requires inter-host hardware cache coherence. I understand that
is a difficult requirement to mandate, but it is likely less difficult
than getting Linux to carry a software cache coherence mitigation.

In some ways this reminds me of SMR drives and the problems those posed
to software where ultimately the programming difficulties needed to be
solved in hardware, not exported to the Linux kernel to solve.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-04-28 16:55                 ` John Groves
@ 2024-05-03  9:52                   ` Jonathan Cameron
  2024-05-08 11:39                     ` Dongsheng Yang
  0 siblings, 1 reply; 52+ messages in thread
From: Jonathan Cameron @ 2024-05-03  9:52 UTC (permalink / raw)
  To: John Groves
  Cc: Dongsheng Yang, Gregory Price, Dan Williams, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm

On Sun, 28 Apr 2024 11:55:10 -0500
John Groves <John@groves.net> wrote:

> On 24/04/28 01:47PM, Dongsheng Yang wrote:
> > 
> > 
> > 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:  
> > > On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:  
> > > > 
> > > > 
> > > > 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:  
> > > > >   
> > > > 
> > > > In (5) of the cover letter, I mentioned that cbd addresses cache coherence
> > > > at the software level:
> > > > 
> > > > (5) How do blkdev and backend interact through the channel?
> > > > 	a) For reader side, before reading the data, if the data in this channel
> > > > may be modified by the other party, then I need to flush the cache before
> > > > reading to ensure that I get the latest data. For example, the blkdev needs
> > > > to flush the cache before obtaining compr_head because compr_head will be
> > > > updated by the backend handler.
> > > > 	b) For writter side, if the written information will be read by others,
> > > > then after writing, I need to flush the cache to let the other party see it
> > > > immediately. For example, after blkdev submits cbd_se, it needs to update
> > > > cmd_head to let the handler have a new cbd_se. Therefore, after updating
> > > > cmd_head, I need to flush the cache to let the backend see it.
> > > >   
> > > 
> > > Flushing the cache is insufficient.  All that cache flushing guarantees
> > > is that the memory has left the writer's CPU cache.  There are potentially
> > > many write buffers between the CPU and the actual backing media that the
> > > CPU has no visibility of and cannot pierce through to force a full
> > > guaranteed flush back to the media.
> > > 
> > > for example:
> > > 
> > > memcpy(some_cacheline, data, 64);
> > > mfence();
> > > 
> > > Will not guarantee that after mfence() completes that the remote host
> > > will have visibility of the data.  mfence() does not guarantee a full
> > > flush back down to the device, it only guarantees it has been pushed out
> > > of the CPU's cache.
> > > 
> > > similarly:
> > > 
> > > memcpy(some_cacheline, data, 64);
> > > mfence();
> > > memcpy(some_other_cacheline, data, 64);
> > > mfence()
> > > 
> > > Will not guarantee that some_cacheline reaches the backing media prior
> > > to some_other_cacheline, as there is no guarantee of write-ordering in
> > > CXL controllers (with the exception of writes to the same cacheline).
> > > 
> > > So this statement:
> > >   
> > > > I need to flush the cache to let the other party see it immediately.  
> > > 
> > > Is misleading.  They will not see is "immediately", they will see it
> > > "eventually at some completely unknowable time in the future".  
> > 
> > This is indeed one of the issues I wanted to discuss at the RFC stage. Thank
> > you for pointing it out.
> > 
> > In my opinion, using "nvdimm_flush" might be one way to address this issue,
> > but it seems to flush the entire nd_region, which might be too heavy.
> > Moreover, it only applies to non-volatile memory.
> > 
> > This should be a general problem for cxl shared memory. In theory, FAMFS
> > should also encounter this issue.
> > 
> > Gregory, John, and Dan, Any suggestion about it?
> > 
> > Thanx a lot  
> > > 
> > > ~Gregory
> > >   
> 
> Hi Dongsheng,
> 
> Gregory is right about the uncertainty around "clflush" operations, but
> let me drill in a bit further.
> 
> Say you copy a payload into a "bucket" in a queue and then update an
> index in a metadata structure; I'm thinking of the standard producer/
> consumer queuing model here, with one index mutated by the producer and
> the other mutated by the consumer. 
> 
> (I have not reviewed your queueing code, but you *must* be using this
> model - things like linked-lists won't work in shared memory without 
> shared locks/atomics.)
> 
> Normal logic says that you should clflush the payload before updating
> the index, then update and clflush the index.
> 
> But we still observe in non-cache-coherent shared memory that the payload 
> may become valid *after* the clflush of the queue index.
> 
> The famfs user space has a program called pcq.c, which implements a
> producer/consumer queue in a pair of famfs files. The only way to 
> currently guarantee a valid read of a payload is to use sequence numbers 
> and checksums on payloads.  We do observe mismatches with actual shared 
> memory, and the recovery is to clflush and re-read the payload from the 
> client side. (Aside: These file pairs theoretically might work for CBD 
> queues.)
> 
> Anoter side note: it would be super-helpful if the CPU gave us an explicit 
> invalidate rather than just clflush, which will write-back before 
> invalidating *if* the cache line is marked as dirty, even when software
> knows this should not happen.
> 
> Note that CXL 3.1 provides a way to guarantee that stuff that should not
> be written back can't be written back: read-only mappings. This one of
> the features I got into the spec; using this requires CXL 3.1 DCD, and 
> would require two DCD allocations (i.e. two tagged-capacity dax devices - 
> one writable by the server and one by the client).
> 
> Just to make things slightly gnarlier, the MESI cache coherency protocol
> allows a CPU to speculatively convert a line from exclusive to modified,
> meaning it's not clear as of now whether "occasional" clean write-backs
> can be avoided. Meaning those read-only mappings may be more important
> than one might think. (Clean write-backs basically make it
> impossible for software to manage cache coherency.)

My understanding is that clean write backs are an implementation specific
issue that came as a surprise to some CPU arch folk I spoke to, we will
need some path for a host to say if they can ever do that.

Given this definitely effects one CPU vendor, maybe solutions that
rely on this not happening are not suitable for upstream.

Maybe this market will be important enough for that CPU vendor to stop
doing it but if they do it will take a while...

Flushing in general is as CPU architecture problem where each of the
architectures needs to be clear what they do / specify that their
licensees do.

I'm with Dan on encouraging all memory vendors to do hardware coherence!

J

> 
> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and 
> shared memory is not explicitly legal in cxl 2, so there are things a cpu 
> could do (or not do) in a cxl 2 environment that are not illegal because 
> they should not be observable in a no-shared-memory environment.
> 
> CBD is interesting work, though for some of the reasons above I'm somewhat
> skeptical of shared memory as an IPC mechanism.
> 
> Regards,
> John
> 
> 
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-03  9:52                   ` Jonathan Cameron
@ 2024-05-08 11:39                     ` Dongsheng Yang
  2024-05-08 12:11                       ` Jonathan Cameron
  0 siblings, 1 reply; 52+ messages in thread
From: Dongsheng Yang @ 2024-05-08 11:39 UTC (permalink / raw)
  To: Jonathan Cameron, John Groves, Dan Williams, Gregory Price
  Cc: Gregory Price, Dan Williams, axboe, linux-block, linux-kernel,
	linux-cxl, nvdimm



在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道:
> On Sun, 28 Apr 2024 11:55:10 -0500
> John Groves <John@groves.net> wrote:
> 
>> On 24/04/28 01:47PM, Dongsheng Yang wrote:
>>>
>>>
>>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:
>>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
>>>>>
>>>>>
>>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
>>>>>>    
>>>>>

...
>>
>> Just to make things slightly gnarlier, the MESI cache coherency protocol
>> allows a CPU to speculatively convert a line from exclusive to modified,
>> meaning it's not clear as of now whether "occasional" clean write-backs
>> can be avoided. Meaning those read-only mappings may be more important
>> than one might think. (Clean write-backs basically make it
>> impossible for software to manage cache coherency.)
> 
> My understanding is that clean write backs are an implementation specific
> issue that came as a surprise to some CPU arch folk I spoke to, we will
> need some path for a host to say if they can ever do that.
> 
> Given this definitely effects one CPU vendor, maybe solutions that
> rely on this not happening are not suitable for upstream.
> 
> Maybe this market will be important enough for that CPU vendor to stop
> doing it but if they do it will take a while...
> 
> Flushing in general is as CPU architecture problem where each of the
> architectures needs to be clear what they do / specify that their
> licensees do.
> 
> I'm with Dan on encouraging all memory vendors to do hardware coherence!

Hi Gregory, John, Jonathan and Dan:
	Thanx for your information, they help a lot, and sorry for the late reply.

After some internal discussions, I think we can design it as follows:

(1) If the hardware implements cache coherence, then the software layer 
doesn't need to consider this issue, and can perform read and write 
operations directly.

(2) If the hardware doesn't implement cache coherence, we can consider a 
DMA-like approach, where we check architectural features to determine if 
cache coherence is supported. This could be similar to 
`dev_is_dma_coherent`.

Additionally, if the architecture supports flushing and invalidating CPU 
caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`, 
`CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`, 
`CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`),

then we can handle cache coherence at the software layer.
(For the clean writeback issue, I think it may also require 
clarification from the architecture, and how DMA handles the clean 
writeback problem, which I haven't further checked.)

(3) If the hardware doesn't implement cache coherence and the cpu 
doesn't support the required CPU cache operations, then we can run in 
nocache mode.

CBD can initially support (3), and then transition to (1) when hardware 
supports cache-coherency. If there's sufficient market demand, we can 
also consider supporting (2).

How does this approach sound?

Thanx
> 
> J
> 
>>
>> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and
>> shared memory is not explicitly legal in cxl 2, so there are things a cpu
>> could do (or not do) in a cxl 2 environment that are not illegal because
>> they should not be observable in a no-shared-memory environment.
>>
>> CBD is interesting work, though for some of the reasons above I'm somewhat
>> skeptical of shared memory as an IPC mechanism.
>>
>> Regards,
>> John
>>
>>
>>
> 
> .
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-08 11:39                     ` Dongsheng Yang
@ 2024-05-08 12:11                       ` Jonathan Cameron
  2024-05-08 13:03                         ` Dongsheng Yang
  0 siblings, 1 reply; 52+ messages in thread
From: Jonathan Cameron @ 2024-05-08 12:11 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: John Groves, Dan Williams, Gregory Price, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm

On Wed, 8 May 2024 19:39:23 +0800
Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:

> 在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道:
> > On Sun, 28 Apr 2024 11:55:10 -0500
> > John Groves <John@groves.net> wrote:
> >   
> >> On 24/04/28 01:47PM, Dongsheng Yang wrote:  
> >>>
> >>>
> >>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:  
> >>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:  
> >>>>>
> >>>>>
> >>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:  
> >>>>>>      
> >>>>>  
> 
> ...
> >>
> >> Just to make things slightly gnarlier, the MESI cache coherency protocol
> >> allows a CPU to speculatively convert a line from exclusive to modified,
> >> meaning it's not clear as of now whether "occasional" clean write-backs
> >> can be avoided. Meaning those read-only mappings may be more important
> >> than one might think. (Clean write-backs basically make it
> >> impossible for software to manage cache coherency.)  
> > 
> > My understanding is that clean write backs are an implementation specific
> > issue that came as a surprise to some CPU arch folk I spoke to, we will
> > need some path for a host to say if they can ever do that.
> > 
> > Given this definitely effects one CPU vendor, maybe solutions that
> > rely on this not happening are not suitable for upstream.
> > 
> > Maybe this market will be important enough for that CPU vendor to stop
> > doing it but if they do it will take a while...
> > 
> > Flushing in general is as CPU architecture problem where each of the
> > architectures needs to be clear what they do / specify that their
> > licensees do.
> > 
> > I'm with Dan on encouraging all memory vendors to do hardware coherence!  
> 
> Hi Gregory, John, Jonathan and Dan:
> 	Thanx for your information, they help a lot, and sorry for the late reply.
> 
> After some internal discussions, I think we can design it as follows:
> 
> (1) If the hardware implements cache coherence, then the software layer 
> doesn't need to consider this issue, and can perform read and write 
> operations directly.

Agreed - this is one easier case.

> 
> (2) If the hardware doesn't implement cache coherence, we can consider a 
> DMA-like approach, where we check architectural features to determine if 
> cache coherence is supported. This could be similar to 
> `dev_is_dma_coherent`.

Ok. So this would combine host support checks with checking if the shared
memory on the device is multi host cache coherent (it will be single host
cache coherent which is what makes this messy)
> 
> Additionally, if the architecture supports flushing and invalidating CPU 
> caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`, 
> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`, 
> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`),

Those particular calls won't tell you much at all. They indicate that a flush
can happen as far as a common point for DMA engines in the system. No
information on whether there are caches beyond that point.

> 
> then we can handle cache coherence at the software layer.
> (For the clean writeback issue, I think it may also require 
> clarification from the architecture, and how DMA handles the clean 
> writeback problem, which I haven't further checked.)

I believe the relevant architecture only does IO coherent DMA so it is
never a problem (unlike with multihost cache coherence).
> 
> (3) If the hardware doesn't implement cache coherence and the cpu 
> doesn't support the required CPU cache operations, then we can run in 
> nocache mode.

I suspect that gets you no where either.  Never believe an architecture
that provides a flag that says not to cache something.  That just means
you should not be able to tell that it is cached - many many implementations
actually cache such accesses.

> 
> CBD can initially support (3), and then transition to (1) when hardware 
> supports cache-coherency. If there's sufficient market demand, we can 
> also consider supporting (2).
I'd assume only (3) works.  The others rely on assumptions I don't think
you can rely on.

Fun fun fun,

Jonathan

> 
> How does this approach sound?
> 
> Thanx
> > 
> > J
> >   
> >>
> >> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and
> >> shared memory is not explicitly legal in cxl 2, so there are things a cpu
> >> could do (or not do) in a cxl 2 environment that are not illegal because
> >> they should not be observable in a no-shared-memory environment.
> >>
> >> CBD is interesting work, though for some of the reasons above I'm somewhat
> >> skeptical of shared memory as an IPC mechanism.
> >>
> >> Regards,
> >> John
> >>
> >>
> >>  
> > 
> > .
> >   


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-08 12:11                       ` Jonathan Cameron
@ 2024-05-08 13:03                         ` Dongsheng Yang
  2024-05-08 15:44                           ` Jonathan Cameron
  0 siblings, 1 reply; 52+ messages in thread
From: Dongsheng Yang @ 2024-05-08 13:03 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Dan Williams, Gregory Price, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm



在 2024/5/8 星期三 下午 8:11, Jonathan Cameron 写道:
> On Wed, 8 May 2024 19:39:23 +0800
> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
> 
>> 在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道:
>>> On Sun, 28 Apr 2024 11:55:10 -0500
>>> John Groves <John@groves.net> wrote:
>>>    
>>>> On 24/04/28 01:47PM, Dongsheng Yang wrote:
>>>>>
>>>>>
>>>>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:
>>>>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
>>>>>>>
>>>>>>>
>>>>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
>>>>>>>>       
>>>>>>>   
>>
>> ...
>>>>
>>>> Just to make things slightly gnarlier, the MESI cache coherency protocol
>>>> allows a CPU to speculatively convert a line from exclusive to modified,
>>>> meaning it's not clear as of now whether "occasional" clean write-backs
>>>> can be avoided. Meaning those read-only mappings may be more important
>>>> than one might think. (Clean write-backs basically make it
>>>> impossible for software to manage cache coherency.)
>>>
>>> My understanding is that clean write backs are an implementation specific
>>> issue that came as a surprise to some CPU arch folk I spoke to, we will
>>> need some path for a host to say if they can ever do that.
>>>
>>> Given this definitely effects one CPU vendor, maybe solutions that
>>> rely on this not happening are not suitable for upstream.
>>>
>>> Maybe this market will be important enough for that CPU vendor to stop
>>> doing it but if they do it will take a while...
>>>
>>> Flushing in general is as CPU architecture problem where each of the
>>> architectures needs to be clear what they do / specify that their
>>> licensees do.
>>>
>>> I'm with Dan on encouraging all memory vendors to do hardware coherence!
>>
>> Hi Gregory, John, Jonathan and Dan:
>> 	Thanx for your information, they help a lot, and sorry for the late reply.
>>
>> After some internal discussions, I think we can design it as follows:
>>
>> (1) If the hardware implements cache coherence, then the software layer
>> doesn't need to consider this issue, and can perform read and write
>> operations directly.
> 
> Agreed - this is one easier case.
> 
>>
>> (2) If the hardware doesn't implement cache coherence, we can consider a
>> DMA-like approach, where we check architectural features to determine if
>> cache coherence is supported. This could be similar to
>> `dev_is_dma_coherent`.
> 
> Ok. So this would combine host support checks with checking if the shared
> memory on the device is multi host cache coherent (it will be single host
> cache coherent which is what makes this messy)
>>
>> Additionally, if the architecture supports flushing and invalidating CPU
>> caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`,
>> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`,
>> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`),
> 
> Those particular calls won't tell you much at all. They indicate that a flush
> can happen as far as a common point for DMA engines in the system. No
> information on whether there are caches beyond that point.
> 
>>
>> then we can handle cache coherence at the software layer.
>> (For the clean writeback issue, I think it may also require
>> clarification from the architecture, and how DMA handles the clean
>> writeback problem, which I haven't further checked.)
> 
> I believe the relevant architecture only does IO coherent DMA so it is
> never a problem (unlike with multihost cache coherence).Hi Jonathan,

let me provide an example,
In nvmeof-rdma, the `nvme_rdma_queue_rq` function places a request into 
`req->sqe.dma`.

(1) First, it calls `ib_dma_sync_single_for_cpu()`, which invalidates 
the CPU cache:


ib_dma_sync_single_for_cpu(dev, sqe->dma,
                             sizeof(struct nvme_command), DMA_TO_DEVICE);


For example, on ARM64, this would call `arch_sync_dma_for_cpu`, followed 
by `dcache_inval_poc(start, start + size)`.

(2) Setting up data related to the NVMe request.

(3) then Calls `ib_dma_sync_single_for_device` to flush the CPU cache to 
DMA memory:

ib_dma_sync_single_for_device(dev, sqe->dma,
                                 sizeof(struct nvme_command), 
DMA_TO_DEVICE);

Of course, if the hardware ensures cache coherency, the above operations 
are skipped. However, if the hardware does not guarantee cache 
coherency, RDMA appears to ensure cache coherency through this method.

In the RDMA scenario, we also face the issue of multi-host cache 
coherence. so I'm thinking, can we adopt a similar approach in CXL 
shared memory to achieve data sharing?

>>
>> (3) If the hardware doesn't implement cache coherence and the cpu
>> doesn't support the required CPU cache operations, then we can run in
>> nocache mode.
> 
> I suspect that gets you no where either.  Never believe an architecture
> that provides a flag that says not to cache something.  That just means
> you should not be able to tell that it is cached - many many implementations
> actually cache such accesses.

Sigh, then that really makes thing difficult.
> 
>>
>> CBD can initially support (3), and then transition to (1) when hardware
>> supports cache-coherency. If there's sufficient market demand, we can
>> also consider supporting (2).
> I'd assume only (3) works.  The others rely on assumptions I don't think

I guess you mean (1), the hardware cache-coherency way, right?

:)
Thanx

> you can rely on.
> 
> Fun fun fun,
> 
> Jonathan
> 
>>
>> How does this approach sound?
>>
>> Thanx
>>>
>>> J
>>>    
>>>>
>>>> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and
>>>> shared memory is not explicitly legal in cxl 2, so there are things a cpu
>>>> could do (or not do) in a cxl 2 environment that are not illegal because
>>>> they should not be observable in a no-shared-memory environment.
>>>>
>>>> CBD is interesting work, though for some of the reasons above I'm somewhat
>>>> skeptical of shared memory as an IPC mechanism.
>>>>
>>>> Regards,
>>>> John
>>>>
>>>>
>>>>   
>>>
>>> .
>>>    
> 
> .
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-08 13:03                         ` Dongsheng Yang
@ 2024-05-08 15:44                           ` Jonathan Cameron
  2024-05-09 11:24                             ` Dongsheng Yang
  0 siblings, 1 reply; 52+ messages in thread
From: Jonathan Cameron @ 2024-05-08 15:44 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: John Groves, Dan Williams, Gregory Price, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm

On Wed, 8 May 2024 21:03:54 +0800
Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:

> 在 2024/5/8 星期三 下午 8:11, Jonathan Cameron 写道:
> > On Wed, 8 May 2024 19:39:23 +0800
> > Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
> >   
> >> 在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道:  
> >>> On Sun, 28 Apr 2024 11:55:10 -0500
> >>> John Groves <John@groves.net> wrote:
> >>>      
> >>>> On 24/04/28 01:47PM, Dongsheng Yang wrote:  
> >>>>>
> >>>>>
> >>>>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:  
> >>>>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:  
> >>>>>>>
> >>>>>>>
> >>>>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:  
> >>>>>>>>         
> >>>>>>>     
> >>
> >> ...  
> >>>>
> >>>> Just to make things slightly gnarlier, the MESI cache coherency protocol
> >>>> allows a CPU to speculatively convert a line from exclusive to modified,
> >>>> meaning it's not clear as of now whether "occasional" clean write-backs
> >>>> can be avoided. Meaning those read-only mappings may be more important
> >>>> than one might think. (Clean write-backs basically make it
> >>>> impossible for software to manage cache coherency.)  
> >>>
> >>> My understanding is that clean write backs are an implementation specific
> >>> issue that came as a surprise to some CPU arch folk I spoke to, we will
> >>> need some path for a host to say if they can ever do that.
> >>>
> >>> Given this definitely effects one CPU vendor, maybe solutions that
> >>> rely on this not happening are not suitable for upstream.
> >>>
> >>> Maybe this market will be important enough for that CPU vendor to stop
> >>> doing it but if they do it will take a while...
> >>>
> >>> Flushing in general is as CPU architecture problem where each of the
> >>> architectures needs to be clear what they do / specify that their
> >>> licensees do.
> >>>
> >>> I'm with Dan on encouraging all memory vendors to do hardware coherence!  
> >>
> >> Hi Gregory, John, Jonathan and Dan:
> >> 	Thanx for your information, they help a lot, and sorry for the late reply.
> >>
> >> After some internal discussions, I think we can design it as follows:
> >>
> >> (1) If the hardware implements cache coherence, then the software layer
> >> doesn't need to consider this issue, and can perform read and write
> >> operations directly.  
> > 
> > Agreed - this is one easier case.
> >   
> >>
> >> (2) If the hardware doesn't implement cache coherence, we can consider a
> >> DMA-like approach, where we check architectural features to determine if
> >> cache coherence is supported. This could be similar to
> >> `dev_is_dma_coherent`.  
> > 
> > Ok. So this would combine host support checks with checking if the shared
> > memory on the device is multi host cache coherent (it will be single host
> > cache coherent which is what makes this messy)  
> >>
> >> Additionally, if the architecture supports flushing and invalidating CPU
> >> caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`,
> >> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`,
> >> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`),  
> > 
> > Those particular calls won't tell you much at all. They indicate that a flush
> > can happen as far as a common point for DMA engines in the system. No
> > information on whether there are caches beyond that point.
> >   
> >>
> >> then we can handle cache coherence at the software layer.
> >> (For the clean writeback issue, I think it may also require
> >> clarification from the architecture, and how DMA handles the clean
> >> writeback problem, which I haven't further checked.)  
> > 
> > I believe the relevant architecture only does IO coherent DMA so it is
> > never a problem (unlike with multihost cache coherence).Hi Jonathan,  
> 
> let me provide an example,
> In nvmeof-rdma, the `nvme_rdma_queue_rq` function places a request into 
> `req->sqe.dma`.
> 
> (1) First, it calls `ib_dma_sync_single_for_cpu()`, which invalidates 
> the CPU cache:
> 
> 
> ib_dma_sync_single_for_cpu(dev, sqe->dma,
>                              sizeof(struct nvme_command), DMA_TO_DEVICE);
> 
> 
> For example, on ARM64, this would call `arch_sync_dma_for_cpu`, followed 
> by `dcache_inval_poc(start, start + size)`.

Key here is the POC. It's a flush to the point of coherence of the local
system.  It has no idea about interhost coherency and is not necessarily
the DRAM (in CXL or otherwise).

If you are doing software coherence, those devices will plug into today's
hosts and they have no idea that such a flush means pushing out into
the CXL fabric and to the type 3 device.

> 
> (2) Setting up data related to the NVMe request.
> 
> (3) then Calls `ib_dma_sync_single_for_device` to flush the CPU cache to 
> DMA memory:
> 
> ib_dma_sync_single_for_device(dev, sqe->dma,
>                                  sizeof(struct nvme_command), 
> DMA_TO_DEVICE);
> 
> Of course, if the hardware ensures cache coherency, the above operations 
> are skipped. However, if the hardware does not guarantee cache 
> coherency, RDMA appears to ensure cache coherency through this method.
> 
> In the RDMA scenario, we also face the issue of multi-host cache 
> coherence. so I'm thinking, can we adopt a similar approach in CXL 
> shared memory to achieve data sharing?

You don't face the same coherence issues, or at least not in the same way.
In that case the coherence guarantees are actually to the RDMA NIC.
It is guaranteed to see the clean data by the host - that may involve
flushes to PoC.  A one time snapshot is then sent to readers on other
hosts. If writes occur they are also guarantee to replace cached copies
on this host - because there is well define guarantee of IO coherence
or explicit cache maintenance to the PoC.

 
> 
> >>
> >> (3) If the hardware doesn't implement cache coherence and the cpu
> >> doesn't support the required CPU cache operations, then we can run in
> >> nocache mode.  
> > 
> > I suspect that gets you no where either.  Never believe an architecture
> > that provides a flag that says not to cache something.  That just means
> > you should not be able to tell that it is cached - many many implementations
> > actually cache such accesses.  
> 
> Sigh, then that really makes thing difficult.

Yes. I think we are going to have to wait on architecture specific clarifications
before any software coherent use case can be guaranteed to work beyond the 3.1 ones
for temporal sharing (only one accessing host at a time) and read only sharing where
writes are dropped anyway so clean write back is irrelevant beyond some noise in
logs possibly (if they do get logged it is considered so rare we don't care!).

> >   
> >>
> >> CBD can initially support (3), and then transition to (1) when hardware
> >> supports cache-coherency. If there's sufficient market demand, we can
> >> also consider supporting (2).  
> > I'd assume only (3) works.  The others rely on assumptions I don't think  
> 
> I guess you mean (1), the hardware cache-coherency way, right?

Indeed - oops!
Hardware coherency is the way to go, or a well defined and clearly document
description of how to play with the various host architectures.

Jonathan


> 
> :)
> Thanx
> 
> > you can rely on.
> > 
> > Fun fun fun,
> > 
> > Jonathan
> >   
> >>
> >> How does this approach sound?
> >>
> >> Thanx  
> >>>
> >>> J
> >>>      
> >>>>
> >>>> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and
> >>>> shared memory is not explicitly legal in cxl 2, so there are things a cpu
> >>>> could do (or not do) in a cxl 2 environment that are not illegal because
> >>>> they should not be observable in a no-shared-memory environment.
> >>>>
> >>>> CBD is interesting work, though for some of the reasons above I'm somewhat
> >>>> skeptical of shared memory as an IPC mechanism.
> >>>>
> >>>> Regards,
> >>>> John
> >>>>
> >>>>
> >>>>     
> >>>
> >>> .
> >>>      
> > 
> > .
> >   


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-08 15:44                           ` Jonathan Cameron
@ 2024-05-09 11:24                             ` Dongsheng Yang
  2024-05-09 12:21                               ` Jonathan Cameron
  0 siblings, 1 reply; 52+ messages in thread
From: Dongsheng Yang @ 2024-05-09 11:24 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Dan Williams, Gregory Price, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm



在 2024/5/8 星期三 下午 11:44, Jonathan Cameron 写道:
> On Wed, 8 May 2024 21:03:54 +0800
> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
> 
>> 在 2024/5/8 星期三 下午 8:11, Jonathan Cameron 写道:
>>> On Wed, 8 May 2024 19:39:23 +0800
>>> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
>>>    
>>>> 在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道:
>>>>> On Sun, 28 Apr 2024 11:55:10 -0500
>>>>> John Groves <John@groves.net> wrote:
>>>>>       
>>>>>> On 24/04/28 01:47PM, Dongsheng Yang wrote:
>>>>>>>
>>>>>>>
>>>>>>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:
>>>>>>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
>>>>>>>>>>          
>>>>>>>>>      
>>>>
>>>> ...
>>>>>>
>>>>>> Just to make things slightly gnarlier, the MESI cache coherency protocol
>>>>>> allows a CPU to speculatively convert a line from exclusive to modified,
>>>>>> meaning it's not clear as of now whether "occasional" clean write-backs
>>>>>> can be avoided. Meaning those read-only mappings may be more important
>>>>>> than one might think. (Clean write-backs basically make it
>>>>>> impossible for software to manage cache coherency.)
>>>>>
>>>>> My understanding is that clean write backs are an implementation specific
>>>>> issue that came as a surprise to some CPU arch folk I spoke to, we will
>>>>> need some path for a host to say if they can ever do that.
>>>>>
>>>>> Given this definitely effects one CPU vendor, maybe solutions that
>>>>> rely on this not happening are not suitable for upstream.
>>>>>
>>>>> Maybe this market will be important enough for that CPU vendor to stop
>>>>> doing it but if they do it will take a while...
>>>>>
>>>>> Flushing in general is as CPU architecture problem where each of the
>>>>> architectures needs to be clear what they do / specify that their
>>>>> licensees do.
>>>>>
>>>>> I'm with Dan on encouraging all memory vendors to do hardware coherence!
>>>>
>>>> Hi Gregory, John, Jonathan and Dan:
>>>> 	Thanx for your information, they help a lot, and sorry for the late reply.
>>>>
>>>> After some internal discussions, I think we can design it as follows:
>>>>
>>>> (1) If the hardware implements cache coherence, then the software layer
>>>> doesn't need to consider this issue, and can perform read and write
>>>> operations directly.
>>>
>>> Agreed - this is one easier case.
>>>    
>>>>
>>>> (2) If the hardware doesn't implement cache coherence, we can consider a
>>>> DMA-like approach, where we check architectural features to determine if
>>>> cache coherence is supported. This could be similar to
>>>> `dev_is_dma_coherent`.
>>>
>>> Ok. So this would combine host support checks with checking if the shared
>>> memory on the device is multi host cache coherent (it will be single host
>>> cache coherent which is what makes this messy)
>>>>
>>>> Additionally, if the architecture supports flushing and invalidating CPU
>>>> caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`,
>>>> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`,
>>>> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`),
>>>
>>> Those particular calls won't tell you much at all. They indicate that a flush
>>> can happen as far as a common point for DMA engines in the system. No
>>> information on whether there are caches beyond that point.
>>>    
>>>>
>>>> then we can handle cache coherence at the software layer.
>>>> (For the clean writeback issue, I think it may also require
>>>> clarification from the architecture, and how DMA handles the clean
>>>> writeback problem, which I haven't further checked.)
>>>
>>> I believe the relevant architecture only does IO coherent DMA so it is
>>> never a problem (unlike with multihost cache coherence).Hi Jonathan,
>>
>> let me provide an example,
>> In nvmeof-rdma, the `nvme_rdma_queue_rq` function places a request into
>> `req->sqe.dma`.
>>
>> (1) First, it calls `ib_dma_sync_single_for_cpu()`, which invalidates
>> the CPU cache:
>>
>>
>> ib_dma_sync_single_for_cpu(dev, sqe->dma,
>>                               sizeof(struct nvme_command), DMA_TO_DEVICE);
>>
>>
>> For example, on ARM64, this would call `arch_sync_dma_for_cpu`, followed
>> by `dcache_inval_poc(start, start + size)`.
> 
> Key here is the POC. It's a flush to the point of coherence of the local
> system.  It has no idea about interhost coherency and is not necessarily
> the DRAM (in CXL or otherwise).
> 
> If you are doing software coherence, those devices will plug into today's
> hosts and they have no idea that such a flush means pushing out into
> the CXL fabric and to the type 3 device.
> 
>>
>> (2) Setting up data related to the NVMe request.
>>
>> (3) then Calls `ib_dma_sync_single_for_device` to flush the CPU cache to
>> DMA memory:
>>
>> ib_dma_sync_single_for_device(dev, sqe->dma,
>>                                   sizeof(struct nvme_command),
>> DMA_TO_DEVICE);
>>
>> Of course, if the hardware ensures cache coherency, the above operations
>> are skipped. However, if the hardware does not guarantee cache
>> coherency, RDMA appears to ensure cache coherency through this method.
>>
>> In the RDMA scenario, we also face the issue of multi-host cache
>> coherence. so I'm thinking, can we adopt a similar approach in CXL
>> shared memory to achieve data sharing?
> 
> You don't face the same coherence issues, or at least not in the same way.
> In that case the coherence guarantees are actually to the RDMA NIC.
> It is guaranteed to see the clean data by the host - that may involve
> flushes to PoC.  A one time snapshot is then sent to readers on other
> hosts. If writes occur they are also guarantee to replace cached copies
> on this host - because there is well define guarantee of IO coherence
> or explicit cache maintenance to the PoC
right, the PoC is not point of cohenrence with other host. it sounds 
correct. thanx.
> 
>   
>>
>>>>
>>>> (3) If the hardware doesn't implement cache coherence and the cpu
>>>> doesn't support the required CPU cache operations, then we can run in
>>>> nocache mode.
>>>
>>> I suspect that gets you no where either.  Never believe an architecture
>>> that provides a flag that says not to cache something.  That just means
>>> you should not be able to tell that it is cached - many many implementations
>>> actually cache such accesses.
>>
>> Sigh, then that really makes thing difficult.
> 
> Yes. I think we are going to have to wait on architecture specific clarifications
> before any software coherent use case can be guaranteed to work beyond the 3.1 ones
> for temporal sharing (only one accessing host at a time) and read only sharing where
> writes are dropped anyway so clean write back is irrelevant beyond some noise in
> logs possibly (if they do get logged it is considered so rare we don't care!).

Hi Jonathan,
	Allow me to discuss further. As described in CXL 3.1:
```
Software-managed coherency schemes are complicated by any host or device 
whose caching agents generate clean writebacks. A “No Clean Writebacks” 
capability bit is available for a host in the CXL System Description 
Structure (CSDS; see Section 9.18.1.6) or for a device in the DVSEC CXL 
Capability2 register (see Section 8.1.3.7).
```

If we check and find that the "No clean writeback" bit in both CSDS and 
DVSEC is set, can we then assume that software cache-coherency is 
feasible, as outlined below:

(1) Both the writer and reader ensure cache flushes. Since there are no 
clean writebacks, there will be no background data writes.

(2) The writer writes data to shared memory and then executes a cache 
flush. If we trust the "No clean writeback" bit, we can assume that the 
data in shared memory is coherent.

(3) Before reading the data, the reader performs cache invalidation. 
Since there are no clean writebacks, this invalidation operation will 
not destroy the data written by the writer. Therefore, the data read by 
the reader should be the data written by the writer, and since the 
writer's cache is clean, it will not write data to shared memory during 
the reader's reading process. Additionally, data integrity can be ensured.

The first step for CBD should depend on hardware cache coherence, which 
is clearer and more feasible. Here, I am just exploring the possibility 
of software cache coherence, not insisting on implementing software 
cache-coherency right away. :)

Thanx
> 
>>>    
>>>>
>>>> CBD can initially support (3), and then transition to (1) when hardware
>>>> supports cache-coherency. If there's sufficient market demand, we can
>>>> also consider supporting (2).
>>> I'd assume only (3) works.  The others rely on assumptions I don't think
>>
>> I guess you mean (1), the hardware cache-coherency way, right?
> 
> Indeed - oops!
> Hardware coherency is the way to go, or a well defined and clearly document
> description of how to play with the various host architectures.
> 
> Jonathan
> 
> 
>>
>> :)
>> Thanx
>>
>>> you can rely on.
>>>
>>> Fun fun fun,
>>>
>>> Jonathan
>>>    
>>>>
>>>> How does this approach sound?
>>>>
>>>> Thanx
>>>>>
>>>>> J
>>>>>       
>>>>>>
>>>>>> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and
>>>>>> shared memory is not explicitly legal in cxl 2, so there are things a cpu
>>>>>> could do (or not do) in a cxl 2 environment that are not illegal because
>>>>>> they should not be observable in a no-shared-memory environment.
>>>>>>
>>>>>> CBD is interesting work, though for some of the reasons above I'm somewhat
>>>>>> skeptical of shared memory as an IPC mechanism.
>>>>>>
>>>>>> Regards,
>>>>>> John
>>>>>>
>>>>>>
>>>>>>      
>>>>>
>>>>> .
>>>>>       
>>>
>>> .
>>>    
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-09 11:24                             ` Dongsheng Yang
@ 2024-05-09 12:21                               ` Jonathan Cameron
  2024-05-09 13:03                                 ` Dongsheng Yang
  0 siblings, 1 reply; 52+ messages in thread
From: Jonathan Cameron @ 2024-05-09 12:21 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: John Groves, Dan Williams, Gregory Price, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm

On Thu, 9 May 2024 19:24:28 +0800
Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:

> 在 2024/5/8 星期三 下午 11:44, Jonathan Cameron 写道:
> > On Wed, 8 May 2024 21:03:54 +0800
> > Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
> >   
> >> 在 2024/5/8 星期三 下午 8:11, Jonathan Cameron 写道:  
> >>> On Wed, 8 May 2024 19:39:23 +0800
> >>> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
> >>>      
> >>>> 在 2024/5/3 星期五 下午 5:52, Jonathan Cameron 写道:  
> >>>>> On Sun, 28 Apr 2024 11:55:10 -0500
> >>>>> John Groves <John@groves.net> wrote:
> >>>>>         
> >>>>>> On 24/04/28 01:47PM, Dongsheng Yang wrote:  
> >>>>>>>
> >>>>>>>
> >>>>>>> 在 2024/4/27 星期六 上午 12:14, Gregory Price 写道:  
> >>>>>>>> On Fri, Apr 26, 2024 at 10:53:43PM +0800, Dongsheng Yang wrote:  
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:  
> >>>>>>>>>>            
> >>>>>>>>>        
> >>>>
> >>>> ...  
> >>>>>>
> >>>>>> Just to make things slightly gnarlier, the MESI cache coherency protocol
> >>>>>> allows a CPU to speculatively convert a line from exclusive to modified,
> >>>>>> meaning it's not clear as of now whether "occasional" clean write-backs
> >>>>>> can be avoided. Meaning those read-only mappings may be more important
> >>>>>> than one might think. (Clean write-backs basically make it
> >>>>>> impossible for software to manage cache coherency.)  
> >>>>>
> >>>>> My understanding is that clean write backs are an implementation specific
> >>>>> issue that came as a surprise to some CPU arch folk I spoke to, we will
> >>>>> need some path for a host to say if they can ever do that.
> >>>>>
> >>>>> Given this definitely effects one CPU vendor, maybe solutions that
> >>>>> rely on this not happening are not suitable for upstream.
> >>>>>
> >>>>> Maybe this market will be important enough for that CPU vendor to stop
> >>>>> doing it but if they do it will take a while...
> >>>>>
> >>>>> Flushing in general is as CPU architecture problem where each of the
> >>>>> architectures needs to be clear what they do / specify that their
> >>>>> licensees do.
> >>>>>
> >>>>> I'm with Dan on encouraging all memory vendors to do hardware coherence!  
> >>>>
> >>>> Hi Gregory, John, Jonathan and Dan:
> >>>> 	Thanx for your information, they help a lot, and sorry for the late reply.
> >>>>
> >>>> After some internal discussions, I think we can design it as follows:
> >>>>
> >>>> (1) If the hardware implements cache coherence, then the software layer
> >>>> doesn't need to consider this issue, and can perform read and write
> >>>> operations directly.  
> >>>
> >>> Agreed - this is one easier case.
> >>>      
> >>>>
> >>>> (2) If the hardware doesn't implement cache coherence, we can consider a
> >>>> DMA-like approach, where we check architectural features to determine if
> >>>> cache coherence is supported. This could be similar to
> >>>> `dev_is_dma_coherent`.  
> >>>
> >>> Ok. So this would combine host support checks with checking if the shared
> >>> memory on the device is multi host cache coherent (it will be single host
> >>> cache coherent which is what makes this messy)  
> >>>>
> >>>> Additionally, if the architecture supports flushing and invalidating CPU
> >>>> caches (`CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE`,
> >>>> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU`,
> >>>> `CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL`),  
> >>>
> >>> Those particular calls won't tell you much at all. They indicate that a flush
> >>> can happen as far as a common point for DMA engines in the system. No
> >>> information on whether there are caches beyond that point.
> >>>      
> >>>>
> >>>> then we can handle cache coherence at the software layer.
> >>>> (For the clean writeback issue, I think it may also require
> >>>> clarification from the architecture, and how DMA handles the clean
> >>>> writeback problem, which I haven't further checked.)  
> >>>
> >>> I believe the relevant architecture only does IO coherent DMA so it is
> >>> never a problem (unlike with multihost cache coherence).Hi Jonathan,  
> >>
> >> let me provide an example,
> >> In nvmeof-rdma, the `nvme_rdma_queue_rq` function places a request into
> >> `req->sqe.dma`.
> >>
> >> (1) First, it calls `ib_dma_sync_single_for_cpu()`, which invalidates
> >> the CPU cache:
> >>
> >>
> >> ib_dma_sync_single_for_cpu(dev, sqe->dma,
> >>                               sizeof(struct nvme_command), DMA_TO_DEVICE);
> >>
> >>
> >> For example, on ARM64, this would call `arch_sync_dma_for_cpu`, followed
> >> by `dcache_inval_poc(start, start + size)`.  
> > 
> > Key here is the POC. It's a flush to the point of coherence of the local
> > system.  It has no idea about interhost coherency and is not necessarily
> > the DRAM (in CXL or otherwise).
> > 
> > If you are doing software coherence, those devices will plug into today's
> > hosts and they have no idea that such a flush means pushing out into
> > the CXL fabric and to the type 3 device.
> >   
> >>
> >> (2) Setting up data related to the NVMe request.
> >>
> >> (3) then Calls `ib_dma_sync_single_for_device` to flush the CPU cache to
> >> DMA memory:
> >>
> >> ib_dma_sync_single_for_device(dev, sqe->dma,
> >>                                   sizeof(struct nvme_command),
> >> DMA_TO_DEVICE);
> >>
> >> Of course, if the hardware ensures cache coherency, the above operations
> >> are skipped. However, if the hardware does not guarantee cache
> >> coherency, RDMA appears to ensure cache coherency through this method.
> >>
> >> In the RDMA scenario, we also face the issue of multi-host cache
> >> coherence. so I'm thinking, can we adopt a similar approach in CXL
> >> shared memory to achieve data sharing?  
> > 
> > You don't face the same coherence issues, or at least not in the same way.
> > In that case the coherence guarantees are actually to the RDMA NIC.
> > It is guaranteed to see the clean data by the host - that may involve
> > flushes to PoC.  A one time snapshot is then sent to readers on other
> > hosts. If writes occur they are also guarantee to replace cached copies
> > on this host - because there is well define guarantee of IO coherence
> > or explicit cache maintenance to the PoC  
> right, the PoC is not point of cohenrence with other host. it sounds 
> correct. thanx.
> > 
> >     
> >>  
> >>>>
> >>>> (3) If the hardware doesn't implement cache coherence and the cpu
> >>>> doesn't support the required CPU cache operations, then we can run in
> >>>> nocache mode.  
> >>>
> >>> I suspect that gets you no where either.  Never believe an architecture
> >>> that provides a flag that says not to cache something.  That just means
> >>> you should not be able to tell that it is cached - many many implementations
> >>> actually cache such accesses.  
> >>
> >> Sigh, then that really makes thing difficult.  
> > 
> > Yes. I think we are going to have to wait on architecture specific clarifications
> > before any software coherent use case can be guaranteed to work beyond the 3.1 ones
> > for temporal sharing (only one accessing host at a time) and read only sharing where
> > writes are dropped anyway so clean write back is irrelevant beyond some noise in
> > logs possibly (if they do get logged it is considered so rare we don't care!).  
> 
> Hi Jonathan,
> 	Allow me to discuss further. As described in CXL 3.1:
> ```
> Software-managed coherency schemes are complicated by any host or device 
> whose caching agents generate clean writebacks. A “No Clean Writebacks” 
> capability bit is available for a host in the CXL System Description 
> Structure (CSDS; see Section 9.18.1.6) or for a device in the DVSEC CXL 
> Capability2 register (see Section 8.1.3.7).
> ```
> 
> If we check and find that the "No clean writeback" bit in both CSDS and 
> DVSEC is set, can we then assume that software cache-coherency is 
> feasible, as outlined below:
> 
> (1) Both the writer and reader ensure cache flushes. Since there are no 
> clean writebacks, there will be no background data writes.
> 
> (2) The writer writes data to shared memory and then executes a cache 
> flush. If we trust the "No clean writeback" bit, we can assume that the 
> data in shared memory is coherent.
> 
> (3) Before reading the data, the reader performs cache invalidation. 
> Since there are no clean writebacks, this invalidation operation will 
> not destroy the data written by the writer. Therefore, the data read by 
> the reader should be the data written by the writer, and since the 
> writer's cache is clean, it will not write data to shared memory during 
> the reader's reading process. Additionally, data integrity can be ensured.
> 
> The first step for CBD should depend on hardware cache coherence, which 
> is clearer and more feasible. Here, I am just exploring the possibility 
> of software cache coherence, not insisting on implementing software 
> cache-coherency right away. :)

Yes, if a platform sets that bit, you 'should' be fine.  What exact flush
is needed is architecture specific however and the DMA related ones
may not be sufficient. I'd keep an eye open for arch doc update from the
various vendors.

Also, the architecture that motivated that bit existing is a 'moderately
large' chip vendor so I'd go so far as to say adoption will be limited
unless they resolve that in a future implementation :)

Jonathan

> 
> Thanx
> >   
> >>>      
> >>>>
> >>>> CBD can initially support (3), and then transition to (1) when hardware
> >>>> supports cache-coherency. If there's sufficient market demand, we can
> >>>> also consider supporting (2).  
> >>> I'd assume only (3) works.  The others rely on assumptions I don't think  
> >>
> >> I guess you mean (1), the hardware cache-coherency way, right?  
> > 
> > Indeed - oops!
> > Hardware coherency is the way to go, or a well defined and clearly document
> > description of how to play with the various host architectures.
> > 
> > Jonathan
> > 
> >   
> >>
> >> :)
> >> Thanx
> >>  
> >>> you can rely on.
> >>>
> >>> Fun fun fun,
> >>>
> >>> Jonathan
> >>>      
> >>>>
> >>>> How does this approach sound?
> >>>>
> >>>> Thanx  
> >>>>>
> >>>>> J
> >>>>>         
> >>>>>>
> >>>>>> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and
> >>>>>> shared memory is not explicitly legal in cxl 2, so there are things a cpu
> >>>>>> could do (or not do) in a cxl 2 environment that are not illegal because
> >>>>>> they should not be observable in a no-shared-memory environment.
> >>>>>>
> >>>>>> CBD is interesting work, though for some of the reasons above I'm somewhat
> >>>>>> skeptical of shared memory as an IPC mechanism.
> >>>>>>
> >>>>>> Regards,
> >>>>>> John
> >>>>>>
> >>>>>>
> >>>>>>        
> >>>>>
> >>>>> .
> >>>>>         
> >>>
> >>> .
> >>>      
> > 
> >   


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-09 12:21                               ` Jonathan Cameron
@ 2024-05-09 13:03                                 ` Dongsheng Yang
  2024-05-21 18:41                                   ` Dan Williams
  0 siblings, 1 reply; 52+ messages in thread
From: Dongsheng Yang @ 2024-05-09 13:03 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: John Groves, Dan Williams, Gregory Price, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm



在 2024/5/9 星期四 下午 8:21, Jonathan Cameron 写道:
> On Thu, 9 May 2024 19:24:28 +0800
> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
> 
...
>>> Yes. I think we are going to have to wait on architecture specific clarifications
>>> before any software coherent use case can be guaranteed to work beyond the 3.1 ones
>>> for temporal sharing (only one accessing host at a time) and read only sharing where
>>> writes are dropped anyway so clean write back is irrelevant beyond some noise in
>>> logs possibly (if they do get logged it is considered so rare we don't care!).
>>
>> Hi Jonathan,
>> 	Allow me to discuss further. As described in CXL 3.1:
>> ```
>> Software-managed coherency schemes are complicated by any host or device
>> whose caching agents generate clean writebacks. A “No Clean Writebacks”
>> capability bit is available for a host in the CXL System Description
>> Structure (CSDS; see Section 9.18.1.6) or for a device in the DVSEC CXL
>> Capability2 register (see Section 8.1.3.7).
>> ```
>>
>> If we check and find that the "No clean writeback" bit in both CSDS and
>> DVSEC is set, can we then assume that software cache-coherency is
>> feasible, as outlined below:
>>
>> (1) Both the writer and reader ensure cache flushes. Since there are no
>> clean writebacks, there will be no background data writes.
>>
>> (2) The writer writes data to shared memory and then executes a cache
>> flush. If we trust the "No clean writeback" bit, we can assume that the
>> data in shared memory is coherent.
>>
>> (3) Before reading the data, the reader performs cache invalidation.
>> Since there are no clean writebacks, this invalidation operation will
>> not destroy the data written by the writer. Therefore, the data read by
>> the reader should be the data written by the writer, and since the
>> writer's cache is clean, it will not write data to shared memory during
>> the reader's reading process. Additionally, data integrity can be ensured.
>>
>> The first step for CBD should depend on hardware cache coherence, which
>> is clearer and more feasible. Here, I am just exploring the possibility
>> of software cache coherence, not insisting on implementing software
>> cache-coherency right away. :)
> 
> Yes, if a platform sets that bit, you 'should' be fine.  What exact flush
> is needed is architecture specific however and the DMA related ones
> may not be sufficient. I'd keep an eye open for arch doc update from the
> various vendors.
> 
> Also, the architecture that motivated that bit existing is a 'moderately
> large' chip vendor so I'd go so far as to say adoption will be limited
> unless they resolve that in a future implementation :)

Great, I think we've had a good discussion and reached a consensus on 
this issue. The remaining aspect will depend on hardware updates. Thank 
you for the information, that helps a lot.

Thanx
> 
> Jonathan
> 
>>
>> Thanx
>>>    
>>>>>       
>>>>>>
>>>>>> CBD can initially support (3), and then transition to (1) when hardware
>>>>>> supports cache-coherency. If there's sufficient market demand, we can
>>>>>> also consider supporting (2).
>>>>> I'd assume only (3) works.  The others rely on assumptions I don't think
>>>>
>>>> I guess you mean (1), the hardware cache-coherency way, right?
>>>
>>> Indeed - oops!
>>> Hardware coherency is the way to go, or a well defined and clearly document
>>> description of how to play with the various host architectures.
>>>
>>> Jonathan
>>>
>>>    
>>>>
>>>> :)
>>>> Thanx
>>>>   
>>>>> you can rely on.
>>>>>
>>>>> Fun fun fun,
>>>>>
>>>>> Jonathan
>>>>>       
>>>>>>
>>>>>> How does this approach sound?
>>>>>>
>>>>>> Thanx
>>>>>>>
>>>>>>> J
>>>>>>>          
>>>>>>>>
>>>>>>>> Keep in mind that I don't think anybody has cxl 3 devices or CPUs yet, and
>>>>>>>> shared memory is not explicitly legal in cxl 2, so there are things a cpu
>>>>>>>> could do (or not do) in a cxl 2 environment that are not illegal because
>>>>>>>> they should not be observable in a no-shared-memory environment.
>>>>>>>>
>>>>>>>> CBD is interesting work, though for some of the reasons above I'm somewhat
>>>>>>>> skeptical of shared memory as an IPC mechanism.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> John
>>>>>>>>
>>>>>>>>
>>>>>>>>         
>>>>>>>
>>>>>>> .
>>>>>>>          
>>>>>
>>>>> .
>>>>>       
>>>
>>>    
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-09 13:03                                 ` Dongsheng Yang
@ 2024-05-21 18:41                                   ` Dan Williams
  2024-05-22  6:17                                     ` Dongsheng Yang
  0 siblings, 1 reply; 52+ messages in thread
From: Dan Williams @ 2024-05-21 18:41 UTC (permalink / raw)
  To: Dongsheng Yang, Jonathan Cameron
  Cc: John Groves, Dan Williams, Gregory Price, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm

Dongsheng Yang wrote:
> 在 2024/5/9 星期四 下午 8:21, Jonathan Cameron 写道:
[..]
> >> If we check and find that the "No clean writeback" bit in both CSDS and
> >> DVSEC is set, can we then assume that software cache-coherency is
> >> feasible, as outlined below:
> >>
> >> (1) Both the writer and reader ensure cache flushes. Since there are no
> >> clean writebacks, there will be no background data writes.
> >>
> >> (2) The writer writes data to shared memory and then executes a cache
> >> flush. If we trust the "No clean writeback" bit, we can assume that the
> >> data in shared memory is coherent.
> >>
> >> (3) Before reading the data, the reader performs cache invalidation.
> >> Since there are no clean writebacks, this invalidation operation will
> >> not destroy the data written by the writer. Therefore, the data read by
> >> the reader should be the data written by the writer, and since the
> >> writer's cache is clean, it will not write data to shared memory during
> >> the reader's reading process. Additionally, data integrity can be ensured.

What guarantees this property? How does the reader know that its local
cache invalidation is sufficient for reading data that has only reached
global visibility on the remote peer? As far as I can see, there is
nothing that guarantees that local global visibility translates to
remote visibility. In fact, the GPF feature is counter-evidence of the
fact that writes can be pending in buffers that are only flushed on a
GPF event.

I remain skeptical that a software managed inter-host cache-coherency
scheme can be made reliable with current CXL defined mechanisms.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-21 18:41                                   ` Dan Williams
@ 2024-05-22  6:17                                     ` Dongsheng Yang
  2024-05-29 15:25                                       ` Gregory Price
  0 siblings, 1 reply; 52+ messages in thread
From: Dongsheng Yang @ 2024-05-22  6:17 UTC (permalink / raw)
  To: Dan Williams, Jonathan Cameron
  Cc: John Groves, Gregory Price, axboe, linux-block, linux-kernel,
	linux-cxl, nvdimm



在 2024/5/22 星期三 上午 2:41, Dan Williams 写道:
> Dongsheng Yang wrote:
>> 在 2024/5/9 星期四 下午 8:21, Jonathan Cameron 写道:
> [..]
>>>> If we check and find that the "No clean writeback" bit in both CSDS and
>>>> DVSEC is set, can we then assume that software cache-coherency is
>>>> feasible, as outlined below:
>>>>
>>>> (1) Both the writer and reader ensure cache flushes. Since there are no
>>>> clean writebacks, there will be no background data writes.
>>>>
>>>> (2) The writer writes data to shared memory and then executes a cache
>>>> flush. If we trust the "No clean writeback" bit, we can assume that the
>>>> data in shared memory is coherent.
>>>>
>>>> (3) Before reading the data, the reader performs cache invalidation.
>>>> Since there are no clean writebacks, this invalidation operation will
>>>> not destroy the data written by the writer. Therefore, the data read by
>>>> the reader should be the data written by the writer, and since the
>>>> writer's cache is clean, it will not write data to shared memory during
>>>> the reader's reading process. Additionally, data integrity can be ensured.
> 
> What guarantees this property? How does the reader know that its local
> cache invalidation is sufficient for reading data that has only reached
> global visibility on the remote peer? As far as I can see, there is
> nothing that guarantees that local global visibility translates to
> remote visibility. In fact, the GPF feature is counter-evidence of the
> fact that writes can be pending in buffers that are only flushed on a
> GPF event.

Sounds correct. From what I learned from GPF, ADR, and eADR, there would 
still be data in WPQ even though we perform a CPU cache line flush in 
the OS.

This means we don't have a explicit method to make data puncture all 
caches and land in the media after writing. also it seems there isn't a 
explicit method to invalidate all caches along the entire path.

> 
> I remain skeptical that a software managed inter-host cache-coherency
> scheme can be made reliable with current CXL defined mechanisms.


I got your point now, acorrding current CXL Spec, it seems software 
managed cache-coherency for inter-host shared memory is not working. 
Will the next version of CXL spec consider it?
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-22  6:17                                     ` Dongsheng Yang
@ 2024-05-29 15:25                                       ` Gregory Price
  2024-05-30  6:59                                         ` Dongsheng Yang
  0 siblings, 1 reply; 52+ messages in thread
From: Gregory Price @ 2024-05-29 15:25 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: Dan Williams, Jonathan Cameron, John Groves, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm

On Wed, May 22, 2024 at 02:17:38PM +0800, Dongsheng Yang wrote:
> 
> 
> 在 2024/5/22 星期三 上午 2:41, Dan Williams 写道:
> > Dongsheng Yang wrote:
> > 
> > What guarantees this property? How does the reader know that its local
> > cache invalidation is sufficient for reading data that has only reached
> > global visibility on the remote peer? As far as I can see, there is
> > nothing that guarantees that local global visibility translates to
> > remote visibility. In fact, the GPF feature is counter-evidence of the
> > fact that writes can be pending in buffers that are only flushed on a
> > GPF event.
> 
> Sounds correct. From what I learned from GPF, ADR, and eADR, there would
> still be data in WPQ even though we perform a CPU cache line flush in the
> OS.
> 
> This means we don't have a explicit method to make data puncture all caches
> and land in the media after writing. also it seems there isn't a explicit
> method to invalidate all caches along the entire path.
> 
> > 
> > I remain skeptical that a software managed inter-host cache-coherency
> > scheme can be made reliable with current CXL defined mechanisms.
> 
> 
> I got your point now, acorrding current CXL Spec, it seems software managed
> cache-coherency for inter-host shared memory is not working. Will the next
> version of CXL spec consider it?
> > 

Sorry for missing the conversation, have been out of office for a bit.

It's not just a CXL spec issue, though that is part of it. I think the
CXL spec would have to expose some form of puncturing flush, and this
makes the assumption that such a flush doesn't cause some kind of
race/deadlock issue.  Certainly this needs to be discussed.

However, consider that the upstream processor actually has to generate
this flush.  This means adding the flush to existing coherence protocols,
or at the very least a new instruction to generate the flush explicitly.
The latter seems more likely than the former.

This flush would need to ensure the data is forced out of the local WPQ
AND all WPQs south of the PCIE complex - because what you really want to
know is that the data has actually made it back to a place where remote
viewers are capable of percieving the change.

So this means:
1) Spec revision with puncturing flush
2) Buy-in from CPU vendors to generate such a flush
3) A new instruction added to the architecture.

Call me in a decade or so.


But really, I think it likely we see hardware-coherence well before this.
For this reason, I have become skeptical of all but a few memory sharing
use cases that depend on software-controlled cache-coherency.

There are some (FAMFS, for example). The coherence state of these
systems tend to be less volatile (e.g. mappings are read-only), or
they have inherent design limitations (cacheline-sized message passing
via write-ahead logging only).

~Gregory

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-29 15:25                                       ` Gregory Price
@ 2024-05-30  6:59                                         ` Dongsheng Yang
  2024-05-30 13:38                                           ` Jonathan Cameron
  2024-05-31 14:23                                           ` Gregory Price
  0 siblings, 2 replies; 52+ messages in thread
From: Dongsheng Yang @ 2024-05-30  6:59 UTC (permalink / raw)
  To: Gregory Price
  Cc: Dan Williams, Jonathan Cameron, John Groves, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm



在 2024/5/29 星期三 下午 11:25, Gregory Price 写道:
> On Wed, May 22, 2024 at 02:17:38PM +0800, Dongsheng Yang wrote:
>>
>>
>> 在 2024/5/22 星期三 上午 2:41, Dan Williams 写道:
>>> Dongsheng Yang wrote:
>>>
>>> What guarantees this property? How does the reader know that its local
>>> cache invalidation is sufficient for reading data that has only reached
>>> global visibility on the remote peer? As far as I can see, there is
>>> nothing that guarantees that local global visibility translates to
>>> remote visibility. In fact, the GPF feature is counter-evidence of the
>>> fact that writes can be pending in buffers that are only flushed on a
>>> GPF event.
>>
>> Sounds correct. From what I learned from GPF, ADR, and eADR, there would
>> still be data in WPQ even though we perform a CPU cache line flush in the
>> OS.
>>
>> This means we don't have a explicit method to make data puncture all caches
>> and land in the media after writing. also it seems there isn't a explicit
>> method to invalidate all caches along the entire path.
>>
>>>
>>> I remain skeptical that a software managed inter-host cache-coherency
>>> scheme can be made reliable with current CXL defined mechanisms.
>>
>>
>> I got your point now, acorrding current CXL Spec, it seems software managed
>> cache-coherency for inter-host shared memory is not working. Will the next
>> version of CXL spec consider it?
>>>
> 
> Sorry for missing the conversation, have been out of office for a bit.
> 
> It's not just a CXL spec issue, though that is part of it. I think the
> CXL spec would have to expose some form of puncturing flush, and this
> makes the assumption that such a flush doesn't cause some kind of
> race/deadlock issue.  Certainly this needs to be discussed.
> 
> However, consider that the upstream processor actually has to generate
> this flush.  This means adding the flush to existing coherence protocols,
> or at the very least a new instruction to generate the flush explicitly.
> The latter seems more likely than the former.
> 
> This flush would need to ensure the data is forced out of the local WPQ
> AND all WPQs south of the PCIE complex - because what you really want to
> know is that the data has actually made it back to a place where remote
> viewers are capable of percieving the change.
> 
> So this means:
> 1) Spec revision with puncturing flush
> 2) Buy-in from CPU vendors to generate such a flush
> 3) A new instruction added to the architecture.
> 
> Call me in a decade or so.
> 
> 
> But really, I think it likely we see hardware-coherence well before this.
> For this reason, I have become skeptical of all but a few memory sharing
> use cases that depend on software-controlled cache-coherency.

Hi Gregory,

	From my understanding, we actually has the same idea here. What I am 
saying is that we need SPEC to consider this issue, meaning we need to 
describe how the entire software-coherency mechanism operates, which 
includes the necessary hardware support. Additionally, I agree that if 
software-coherency also requires hardware support, it seems that 
hardware-coherency is the better path.
> 
> There are some (FAMFS, for example). The coherence state of these
> systems tend to be less volatile (e.g. mappings are read-only), or
> they have inherent design limitations (cacheline-sized message passing
> via write-ahead logging only).

Can you explain more about this? I understand that if the reader in the 
writer-reader model is using a readonly mapping, the interaction will be 
much simpler. However, after the writer writes data, if we don't have a 
mechanism to flush and invalidate puncturing all caches, how can the 
readonly reader access the new data?
> 
> ~Gregory
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-30  6:59                                         ` Dongsheng Yang
@ 2024-05-30 13:38                                           ` Jonathan Cameron
  2024-06-01  3:22                                             ` Dan Williams
  2024-05-31 14:23                                           ` Gregory Price
  1 sibling, 1 reply; 52+ messages in thread
From: Jonathan Cameron @ 2024-05-30 13:38 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: Gregory Price, Dan Williams, John Groves, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm, james.morse, Mark Rutland

On Thu, 30 May 2024 14:59:38 +0800
Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:

> 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道:
> > On Wed, May 22, 2024 at 02:17:38PM +0800, Dongsheng Yang wrote:  
> >>
> >>
> >> 在 2024/5/22 星期三 上午 2:41, Dan Williams 写道:  
> >>> Dongsheng Yang wrote:
> >>>
> >>> What guarantees this property? How does the reader know that its local
> >>> cache invalidation is sufficient for reading data that has only reached
> >>> global visibility on the remote peer? As far as I can see, there is
> >>> nothing that guarantees that local global visibility translates to
> >>> remote visibility. In fact, the GPF feature is counter-evidence of the
> >>> fact that writes can be pending in buffers that are only flushed on a
> >>> GPF event.  
> >>
> >> Sounds correct. From what I learned from GPF, ADR, and eADR, there would
> >> still be data in WPQ even though we perform a CPU cache line flush in the
> >> OS.
> >>
> >> This means we don't have a explicit method to make data puncture all caches
> >> and land in the media after writing. also it seems there isn't a explicit
> >> method to invalidate all caches along the entire path.
> >>  
> >>>
> >>> I remain skeptical that a software managed inter-host cache-coherency
> >>> scheme can be made reliable with current CXL defined mechanisms.  
> >>
> >>
> >> I got your point now, acorrding current CXL Spec, it seems software managed
> >> cache-coherency for inter-host shared memory is not working. Will the next
> >> version of CXL spec consider it?  
> >>>  
> > 
> > Sorry for missing the conversation, have been out of office for a bit.
> > 
> > It's not just a CXL spec issue, though that is part of it. I think the
> > CXL spec would have to expose some form of puncturing flush, and this
> > makes the assumption that such a flush doesn't cause some kind of
> > race/deadlock issue.  Certainly this needs to be discussed.
> > 
> > However, consider that the upstream processor actually has to generate
> > this flush.  This means adding the flush to existing coherence protocols,
> > or at the very least a new instruction to generate the flush explicitly.
> > The latter seems more likely than the former.
> > 
> > This flush would need to ensure the data is forced out of the local WPQ
> > AND all WPQs south of the PCIE complex - because what you really want to
> > know is that the data has actually made it back to a place where remote
> > viewers are capable of percieving the change.
> > 
> > So this means:
> > 1) Spec revision with puncturing flush
> > 2) Buy-in from CPU vendors to generate such a flush
> > 3) A new instruction added to the architecture.
> > 
> > Call me in a decade or so.
> > 
> > 
> > But really, I think it likely we see hardware-coherence well before this.
> > For this reason, I have become skeptical of all but a few memory sharing
> > use cases that depend on software-controlled cache-coherency.  
> 
> Hi Gregory,
> 
> 	From my understanding, we actually has the same idea here. What I am 
> saying is that we need SPEC to consider this issue, meaning we need to 
> describe how the entire software-coherency mechanism operates, which 
> includes the necessary hardware support. Additionally, I agree that if 
> software-coherency also requires hardware support, it seems that 
> hardware-coherency is the better path.
> > 
> > There are some (FAMFS, for example). The coherence state of these
> > systems tend to be less volatile (e.g. mappings are read-only), or
> > they have inherent design limitations (cacheline-sized message passing
> > via write-ahead logging only).  
> 
> Can you explain more about this? I understand that if the reader in the 
> writer-reader model is using a readonly mapping, the interaction will be 
> much simpler. However, after the writer writes data, if we don't have a 
> mechanism to flush and invalidate puncturing all caches, how can the 
> readonly reader access the new data?

There is a mechanism for doing coarse grained flushing that is known to
work on some architectures. Look at cpu_cache_invalidate_memregion().
On intel/x86 it's wbinvd_on_all_cpu_cpus()
on arm64 it's a PSCI firmware call CLEAN_INV_MEMREGION (there is a
public alpha specification for PSCI 1.3 with that defined but we
don't yet have kernel code.)

These are very big hammers and so unsuited for anything fine grained.
In the extreme end of possible implementations they briefly stop all
CPUs and clean and invalidate all caches of all types.  So not suited
to anything fine grained, but may be acceptable for a rare setup event,
particularly if the main job of the writing host is to fill that memory
for lots of other hosts to use.

At least the ARM one takes a range so allows for a less painful
implementation.  I'm assuming we'll see new architecture over time
but this is a different (and potentially easier) problem space
to what you need.

Jonathan



> > ~Gregory
> >   


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-30  6:59                                         ` Dongsheng Yang
  2024-05-30 13:38                                           ` Jonathan Cameron
@ 2024-05-31 14:23                                           ` Gregory Price
  2024-06-03  1:33                                             ` Dongsheng Yang
  1 sibling, 1 reply; 52+ messages in thread
From: Gregory Price @ 2024-05-31 14:23 UTC (permalink / raw)
  To: Dongsheng Yang
  Cc: Dan Williams, Jonathan Cameron, John Groves, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm

On Thu, May 30, 2024 at 02:59:38PM +0800, Dongsheng Yang wrote:
> 
> 
> 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道:
> > 
> > There are some (FAMFS, for example). The coherence state of these
> > systems tend to be less volatile (e.g. mappings are read-only), or
> > they have inherent design limitations (cacheline-sized message passing
> > via write-ahead logging only).
> 
> Can you explain more about this? I understand that if the reader in the
> writer-reader model is using a readonly mapping, the interaction will be
> much simpler. However, after the writer writes data, if we don't have a
> mechanism to flush and invalidate puncturing all caches, how can the
> readonly reader access the new data?

This is exactly right, so the coherence/correctness of the data needs to
be enforced in some other way.

Generally speaking, the WPQs will *eventually* get flushed.  As such,
the memory will *eventually* become coherent.  So if you set up the
following pattern, you will end up with an "eventually coherent" system

1) Writer instantiates the memory to be used
2) Writer calculates and records a checksum of that data into memory
3) Writer invalidates everything
4) Reader maps the memory
5) Reader reads the checksum and calculates the checksum of the data
   a) if the checksums match, the data is coherent
   b) if they don't, we must wait longer for the queues to flush

This is just one example of a system design which enforces coherence by
placing the limitation on the system that the data will never change
once it becomes coherent.

Whatever the case, regardless of the scheme you come up with, you will
end up with a system where the data must be inspected and validated
before it can be used.  This has the limiting factor of performance:
throughput will be limited by how fast you can validate the data.

~Gregory

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-30 13:38                                           ` Jonathan Cameron
@ 2024-06-01  3:22                                             ` Dan Williams
  2024-06-03 12:48                                               ` Jonathan Cameron
  0 siblings, 1 reply; 52+ messages in thread
From: Dan Williams @ 2024-06-01  3:22 UTC (permalink / raw)
  To: Jonathan Cameron, Dongsheng Yang
  Cc: Gregory Price, Dan Williams, John Groves, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm, james.morse, Mark Rutland

Jonathan Cameron wrote:
> On Thu, 30 May 2024 14:59:38 +0800
> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
> 
> > 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道:
> > > On Wed, May 22, 2024 at 02:17:38PM +0800, Dongsheng Yang wrote:  
> > >>
> > >>
> > >> 在 2024/5/22 星期三 上午 2:41, Dan Williams 写道:  
> > >>> Dongsheng Yang wrote:
> > >>>
> > >>> What guarantees this property? How does the reader know that its local
> > >>> cache invalidation is sufficient for reading data that has only reached
> > >>> global visibility on the remote peer? As far as I can see, there is
> > >>> nothing that guarantees that local global visibility translates to
> > >>> remote visibility. In fact, the GPF feature is counter-evidence of the
> > >>> fact that writes can be pending in buffers that are only flushed on a
> > >>> GPF event.  
> > >>
> > >> Sounds correct. From what I learned from GPF, ADR, and eADR, there would
> > >> still be data in WPQ even though we perform a CPU cache line flush in the
> > >> OS.
> > >>
> > >> This means we don't have a explicit method to make data puncture all caches
> > >> and land in the media after writing. also it seems there isn't a explicit
> > >> method to invalidate all caches along the entire path.
> > >>  
> > >>>
> > >>> I remain skeptical that a software managed inter-host cache-coherency
> > >>> scheme can be made reliable with current CXL defined mechanisms.  
> > >>
> > >>
> > >> I got your point now, acorrding current CXL Spec, it seems software managed
> > >> cache-coherency for inter-host shared memory is not working. Will the next
> > >> version of CXL spec consider it?  
> > >>>  
> > > 
> > > Sorry for missing the conversation, have been out of office for a bit.
> > > 
> > > It's not just a CXL spec issue, though that is part of it. I think the
> > > CXL spec would have to expose some form of puncturing flush, and this
> > > makes the assumption that such a flush doesn't cause some kind of
> > > race/deadlock issue.  Certainly this needs to be discussed.
> > > 
> > > However, consider that the upstream processor actually has to generate
> > > this flush.  This means adding the flush to existing coherence protocols,
> > > or at the very least a new instruction to generate the flush explicitly.
> > > The latter seems more likely than the former.
> > > 
> > > This flush would need to ensure the data is forced out of the local WPQ
> > > AND all WPQs south of the PCIE complex - because what you really want to
> > > know is that the data has actually made it back to a place where remote
> > > viewers are capable of percieving the change.
> > > 
> > > So this means:
> > > 1) Spec revision with puncturing flush
> > > 2) Buy-in from CPU vendors to generate such a flush
> > > 3) A new instruction added to the architecture.
> > > 
> > > Call me in a decade or so.
> > > 
> > > 
> > > But really, I think it likely we see hardware-coherence well before this.
> > > For this reason, I have become skeptical of all but a few memory sharing
> > > use cases that depend on software-controlled cache-coherency.  
> > 
> > Hi Gregory,
> > 
> > 	From my understanding, we actually has the same idea here. What I am 
> > saying is that we need SPEC to consider this issue, meaning we need to 
> > describe how the entire software-coherency mechanism operates, which 
> > includes the necessary hardware support. Additionally, I agree that if 
> > software-coherency also requires hardware support, it seems that 
> > hardware-coherency is the better path.
> > > 
> > > There are some (FAMFS, for example). The coherence state of these
> > > systems tend to be less volatile (e.g. mappings are read-only), or
> > > they have inherent design limitations (cacheline-sized message passing
> > > via write-ahead logging only).  
> > 
> > Can you explain more about this? I understand that if the reader in the 
> > writer-reader model is using a readonly mapping, the interaction will be 
> > much simpler. However, after the writer writes data, if we don't have a 
> > mechanism to flush and invalidate puncturing all caches, how can the 
> > readonly reader access the new data?
> 
> There is a mechanism for doing coarse grained flushing that is known to
> work on some architectures. Look at cpu_cache_invalidate_memregion().
> On intel/x86 it's wbinvd_on_all_cpu_cpus()

There is no guarantee on x86 that after cpu_cache_invalidate_memregion()
that a remote shared memory consumer can be assured to see the writes
from that event.

> on arm64 it's a PSCI firmware call CLEAN_INV_MEMREGION (there is a
> public alpha specification for PSCI 1.3 with that defined but we
> don't yet have kernel code.)

That punches visibility through CXL shared memory devices?

> These are very big hammers and so unsuited for anything fine grained.
> In the extreme end of possible implementations they briefly stop all
> CPUs and clean and invalidate all caches of all types.  So not suited
> to anything fine grained, but may be acceptable for a rare setup event,
> particularly if the main job of the writing host is to fill that memory
> for lots of other hosts to use.
> 
> At least the ARM one takes a range so allows for a less painful
> implementation.  I'm assuming we'll see new architecture over time
> but this is a different (and potentially easier) problem space
> to what you need.

cpu_cache_invalidate_memregion() is only about making sure local CPU
sees new contents after an DPA:HPA remap event. I hope CPUs are able to
get away from that responsibility long term when / if future memory
expanders just issue back-invalidate automatically when the HDM decoder
configuration changes.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-05-31 14:23                                           ` Gregory Price
@ 2024-06-03  1:33                                             ` Dongsheng Yang
  0 siblings, 0 replies; 52+ messages in thread
From: Dongsheng Yang @ 2024-06-03  1:33 UTC (permalink / raw)
  To: Gregory Price
  Cc: Dan Williams, Jonathan Cameron, John Groves, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm



在 2024/5/31 星期五 下午 10:23, Gregory Price 写道:
> On Thu, May 30, 2024 at 02:59:38PM +0800, Dongsheng Yang wrote:
>>
>>
>> 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道:
>>>
>>> There are some (FAMFS, for example). The coherence state of these
>>> systems tend to be less volatile (e.g. mappings are read-only), or
>>> they have inherent design limitations (cacheline-sized message passing
>>> via write-ahead logging only).
>>
>> Can you explain more about this? I understand that if the reader in the
>> writer-reader model is using a readonly mapping, the interaction will be
>> much simpler. However, after the writer writes data, if we don't have a
>> mechanism to flush and invalidate puncturing all caches, how can the
>> readonly reader access the new data?
> 
> This is exactly right, so the coherence/correctness of the data needs to
> be enforced in some other way.
> 
> Generally speaking, the WPQs will *eventually* get flushed.  As such,
> the memory will *eventually* become coherent.  So if you set up the
> following pattern, you will end up with an "eventually coherent" system


Yes, it is "eventually coherent" if "NO CLEAN WRITEBACK" bit in both 
CSDS and DVSEC is set.
> 
> 1) Writer instantiates the memory to be used
> 2) Writer calculates and records a checksum of that data into memory
> 3) Writer invalidates everything
> 4) Reader maps the memory
> 5) Reader reads the checksum and calculates the checksum of the data
>     a) if the checksums match, the data is coherent
>     b) if they don't, we must wait longer for the queues to flush

Yes, the checksum was mentioned by John, it is used in FAMFS/pcq_lib.c, 
pcq use sequence and checksum in consumer to make sure data consistency.

I think it's a good idea and was planning to introduce it into cbd, of 
coures it should be optional for cbd, as cbd current only supports
hardware-consistency usage. it can be an option to do data verification.

Thanx
> 
> This is just one example of a system design which enforces coherence by
> placing the limitation on the system that the data will never change
> once it becomes coherent.
> 
> Whatever the case, regardless of the scheme you come up with, you will
> end up with a system where the data must be inspected and validated
> before it can be used.  This has the limiting factor of performance:
> throughput will be limited by how fast you can validate the data.
> 
> ~Gregory
> .
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-06-01  3:22                                             ` Dan Williams
@ 2024-06-03 12:48                                               ` Jonathan Cameron
  2024-06-03 17:28                                                 ` James Morse
  0 siblings, 1 reply; 52+ messages in thread
From: Jonathan Cameron @ 2024-06-03 12:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dongsheng Yang, Gregory Price, John Groves, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm, james.morse, Mark Rutland

On Fri, 31 May 2024 20:22:42 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> Jonathan Cameron wrote:
> > On Thu, 30 May 2024 14:59:38 +0800
> > Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
> >   
> > > 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道:  
> > > > On Wed, May 22, 2024 at 02:17:38PM +0800, Dongsheng Yang wrote:    
> > > >>
> > > >>
> > > >> 在 2024/5/22 星期三 上午 2:41, Dan Williams 写道:    
> > > >>> Dongsheng Yang wrote:
> > > >>>
> > > >>> What guarantees this property? How does the reader know that its local
> > > >>> cache invalidation is sufficient for reading data that has only reached
> > > >>> global visibility on the remote peer? As far as I can see, there is
> > > >>> nothing that guarantees that local global visibility translates to
> > > >>> remote visibility. In fact, the GPF feature is counter-evidence of the
> > > >>> fact that writes can be pending in buffers that are only flushed on a
> > > >>> GPF event.    
> > > >>
> > > >> Sounds correct. From what I learned from GPF, ADR, and eADR, there would
> > > >> still be data in WPQ even though we perform a CPU cache line flush in the
> > > >> OS.
> > > >>
> > > >> This means we don't have a explicit method to make data puncture all caches
> > > >> and land in the media after writing. also it seems there isn't a explicit
> > > >> method to invalidate all caches along the entire path.
> > > >>    
> > > >>>
> > > >>> I remain skeptical that a software managed inter-host cache-coherency
> > > >>> scheme can be made reliable with current CXL defined mechanisms.    
> > > >>
> > > >>
> > > >> I got your point now, acorrding current CXL Spec, it seems software managed
> > > >> cache-coherency for inter-host shared memory is not working. Will the next
> > > >> version of CXL spec consider it?    
> > > >>>    
> > > > 
> > > > Sorry for missing the conversation, have been out of office for a bit.
> > > > 
> > > > It's not just a CXL spec issue, though that is part of it. I think the
> > > > CXL spec would have to expose some form of puncturing flush, and this
> > > > makes the assumption that such a flush doesn't cause some kind of
> > > > race/deadlock issue.  Certainly this needs to be discussed.
> > > > 
> > > > However, consider that the upstream processor actually has to generate
> > > > this flush.  This means adding the flush to existing coherence protocols,
> > > > or at the very least a new instruction to generate the flush explicitly.
> > > > The latter seems more likely than the former.
> > > > 
> > > > This flush would need to ensure the data is forced out of the local WPQ
> > > > AND all WPQs south of the PCIE complex - because what you really want to
> > > > know is that the data has actually made it back to a place where remote
> > > > viewers are capable of percieving the change.
> > > > 
> > > > So this means:
> > > > 1) Spec revision with puncturing flush
> > > > 2) Buy-in from CPU vendors to generate such a flush
> > > > 3) A new instruction added to the architecture.
> > > > 
> > > > Call me in a decade or so.
> > > > 
> > > > 
> > > > But really, I think it likely we see hardware-coherence well before this.
> > > > For this reason, I have become skeptical of all but a few memory sharing
> > > > use cases that depend on software-controlled cache-coherency.    
> > > 
> > > Hi Gregory,
> > > 
> > > 	From my understanding, we actually has the same idea here. What I am 
> > > saying is that we need SPEC to consider this issue, meaning we need to 
> > > describe how the entire software-coherency mechanism operates, which 
> > > includes the necessary hardware support. Additionally, I agree that if 
> > > software-coherency also requires hardware support, it seems that 
> > > hardware-coherency is the better path.  
> > > > 
> > > > There are some (FAMFS, for example). The coherence state of these
> > > > systems tend to be less volatile (e.g. mappings are read-only), or
> > > > they have inherent design limitations (cacheline-sized message passing
> > > > via write-ahead logging only).    
> > > 
> > > Can you explain more about this? I understand that if the reader in the 
> > > writer-reader model is using a readonly mapping, the interaction will be 
> > > much simpler. However, after the writer writes data, if we don't have a 
> > > mechanism to flush and invalidate puncturing all caches, how can the 
> > > readonly reader access the new data?  
> > 
> > There is a mechanism for doing coarse grained flushing that is known to
> > work on some architectures. Look at cpu_cache_invalidate_memregion().
> > On intel/x86 it's wbinvd_on_all_cpu_cpus()  
> 
> There is no guarantee on x86 that after cpu_cache_invalidate_memregion()
> that a remote shared memory consumer can be assured to see the writes
> from that event.

I was wondering about that after I wrote this...  I guess it guarantees
we won't get a late landing write or is that not even true?

So if we remove memory, then added fresh memory again quickly enough
can we get a left over write showing up?  I guess that doesn't matter as
the kernel will chase it with a memset(0) anyway and that will be ordered
as to the same address.

However we won't be able to elide that zeroing even if we know the device
did it which is makes some operations the device might support rather
pointless :(

> 
> > on arm64 it's a PSCI firmware call CLEAN_INV_MEMREGION (there is a
> > public alpha specification for PSCI 1.3 with that defined but we
> > don't yet have kernel code.)  
> 
> That punches visibility through CXL shared memory devices?

It's a draft spec and Mark + James in +CC can hopefully confirm.
It does say
"Cleans and invalidates all caches, including system caches".
which I'd read as meaning it should but good to confirm.

> 
> > These are very big hammers and so unsuited for anything fine grained.
> > In the extreme end of possible implementations they briefly stop all
> > CPUs and clean and invalidate all caches of all types.  So not suited
> > to anything fine grained, but may be acceptable for a rare setup event,
> > particularly if the main job of the writing host is to fill that memory
> > for lots of other hosts to use.
> > 
> > At least the ARM one takes a range so allows for a less painful
> > implementation.  I'm assuming we'll see new architecture over time
> > but this is a different (and potentially easier) problem space
> > to what you need.  
> 
> cpu_cache_invalidate_memregion() is only about making sure local CPU
> sees new contents after an DPA:HPA remap event. I hope CPUs are able to
> get away from that responsibility long term when / if future memory
> expanders just issue back-invalidate automatically when the HDM decoder
> configuration changes.

I would love that to be the way things go, but I fear the overheads of
doing that on the protocol means people will want the option of the painful
approach.

Jonathan
 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-06-03 12:48                                               ` Jonathan Cameron
@ 2024-06-03 17:28                                                 ` James Morse
  2024-06-04 14:26                                                   ` Jonathan Cameron
  0 siblings, 1 reply; 52+ messages in thread
From: James Morse @ 2024-06-03 17:28 UTC (permalink / raw)
  To: Jonathan Cameron, Dan Williams
  Cc: Dongsheng Yang, Gregory Price, John Groves, axboe, linux-block,
	linux-kernel, linux-cxl, nvdimm, Mark Rutland

Hi guys,

On 03/06/2024 13:48, Jonathan Cameron wrote:
> On Fri, 31 May 2024 20:22:42 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
>> Jonathan Cameron wrote:
>>> On Thu, 30 May 2024 14:59:38 +0800
>>> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:
>>>> 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道:  
>>>>> It's not just a CXL spec issue, though that is part of it. I think the
>>>>> CXL spec would have to expose some form of puncturing flush, and this
>>>>> makes the assumption that such a flush doesn't cause some kind of
>>>>> race/deadlock issue.  Certainly this needs to be discussed.
>>>>>
>>>>> However, consider that the upstream processor actually has to generate
>>>>> this flush.  This means adding the flush to existing coherence protocols,
>>>>> or at the very least a new instruction to generate the flush explicitly.
>>>>> The latter seems more likely than the former.
>>>>>
>>>>> This flush would need to ensure the data is forced out of the local WPQ
>>>>> AND all WPQs south of the PCIE complex - because what you really want to
>>>>> know is that the data has actually made it back to a place where remote
>>>>> viewers are capable of percieving the change.
>>>>>
>>>>> So this means:
>>>>> 1) Spec revision with puncturing flush
>>>>> 2) Buy-in from CPU vendors to generate such a flush
>>>>> 3) A new instruction added to the architecture.
>>>>>
>>>>> Call me in a decade or so.
>>>>>
>>>>>
>>>>> But really, I think it likely we see hardware-coherence well before this.
>>>>> For this reason, I have become skeptical of all but a few memory sharing
>>>>> use cases that depend on software-controlled cache-coherency.    
>>>>
>>>> Hi Gregory,
>>>>
>>>> 	From my understanding, we actually has the same idea here. What I am 
>>>> saying is that we need SPEC to consider this issue, meaning we need to 
>>>> describe how the entire software-coherency mechanism operates, which 
>>>> includes the necessary hardware support. Additionally, I agree that if 
>>>> software-coherency also requires hardware support, it seems that 
>>>> hardware-coherency is the better path.  
>>>>>
>>>>> There are some (FAMFS, for example). The coherence state of these
>>>>> systems tend to be less volatile (e.g. mappings are read-only), or
>>>>> they have inherent design limitations (cacheline-sized message passing
>>>>> via write-ahead logging only).    
>>>>
>>>> Can you explain more about this? I understand that if the reader in the 
>>>> writer-reader model is using a readonly mapping, the interaction will be 
>>>> much simpler. However, after the writer writes data, if we don't have a 
>>>> mechanism to flush and invalidate puncturing all caches, how can the 
>>>> readonly reader access the new data?  
>>>
>>> There is a mechanism for doing coarse grained flushing that is known to
>>> work on some architectures. Look at cpu_cache_invalidate_memregion().
>>> On intel/x86 it's wbinvd_on_all_cpu_cpus()  
>>
>> There is no guarantee on x86 that after cpu_cache_invalidate_memregion()
>> that a remote shared memory consumer can be assured to see the writes
>> from that event.
> 
> I was wondering about that after I wrote this...  I guess it guarantees
> we won't get a late landing write or is that not even true?
> 
> So if we remove memory, then added fresh memory again quickly enough
> can we get a left over write showing up?  I guess that doesn't matter as
> the kernel will chase it with a memset(0) anyway and that will be ordered
> as to the same address.
> 
> However we won't be able to elide that zeroing even if we know the device
> did it which is makes some operations the device might support rather
> pointless :(

>>> on arm64 it's a PSCI firmware call CLEAN_INV_MEMREGION (there is a
>>> public alpha specification for PSCI 1.3 with that defined but we
>>> don't yet have kernel code.)  

I have an RFC for that - but I haven't had time to update and re-test it.

If you need this, and have a platform where it can be implemented, please get in touch
with the people that look after the specs to move it along from alpha.


>> That punches visibility through CXL shared memory devices?

> It's a draft spec and Mark + James in +CC can hopefully confirm.
> It does say
> "Cleans and invalidates all caches, including system caches".
> which I'd read as meaning it should but good to confirm.

It's intended to remove any cached entries - including lines in what the arm-arm calls
"invisible" system caches, which typically only platform firmware can touch. The next
access should have to go all the way to the media. (I don't know enough about CXL to say
what a remote shared memory consumer observes)

Without it, all we have are the by-VA operations which are painfully slow for large
regions, and insufficient for system caches.

As with all those firmware interfaces - its for the platform implementer to wire up
whatever is necessary to remove cached content for the specified range. Just because there
is an (alpha!) spec doesn't mean it can be supported efficiently by a particular platform.


>>> These are very big hammers and so unsuited for anything fine grained.

You forgot really ugly too!


>>> In the extreme end of possible implementations they briefly stop all
>>> CPUs and clean and invalidate all caches of all types. So not suited
>>> to anything fine grained, but may be acceptable for a rare setup event,
>>> particularly if the main job of the writing host is to fill that memory
>>> for lots of other hosts to use.
>>>
>>> At least the ARM one takes a range so allows for a less painful
>>> implementation. 

That is to allow some ranges to fail. (e.g. you can do this to the CXL windows, but not
the regular DRAM).

On the less painful implementation, arm's interconnect has a gadget that does "Address
based flush" which could be used here. I'd hope platforms with that don't need to
interrupt all CPUs - but it depends on what else needs to be done.


>>> I'm assuming we'll see new architecture over time
>>> but this is a different (and potentially easier) problem space
>>> to what you need.  
>>
>> cpu_cache_invalidate_memregion() is only about making sure local CPU
>> sees new contents after an DPA:HPA remap event. I hope CPUs are able to
>> get away from that responsibility long term when / if future memory
>> expanders just issue back-invalidate automatically when the HDM decoder
>> configuration changes.
> 
> I would love that to be the way things go, but I fear the overheads of
> doing that on the protocol means people will want the option of the painful
> approach.



Thanks,

James

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)
  2024-06-03 17:28                                                 ` James Morse
@ 2024-06-04 14:26                                                   ` Jonathan Cameron
  0 siblings, 0 replies; 52+ messages in thread
From: Jonathan Cameron @ 2024-06-04 14:26 UTC (permalink / raw)
  To: James Morse
  Cc: Dan Williams, Dongsheng Yang, Gregory Price, John Groves, axboe,
	linux-block, linux-kernel, linux-cxl, nvdimm, Mark Rutland

On Mon, 3 Jun 2024 18:28:51 +0100
James Morse <james.morse@arm.com> wrote:

> Hi guys,
> 
> On 03/06/2024 13:48, Jonathan Cameron wrote:
> > On Fri, 31 May 2024 20:22:42 -0700
> > Dan Williams <dan.j.williams@intel.com> wrote:  
> >> Jonathan Cameron wrote:  
> >>> On Thu, 30 May 2024 14:59:38 +0800
> >>> Dongsheng Yang <dongsheng.yang@easystack.cn> wrote:  
> >>>> 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道:    
> >>>>> It's not just a CXL spec issue, though that is part of it. I think the
> >>>>> CXL spec would have to expose some form of puncturing flush, and this
> >>>>> makes the assumption that such a flush doesn't cause some kind of
> >>>>> race/deadlock issue.  Certainly this needs to be discussed.
> >>>>>
> >>>>> However, consider that the upstream processor actually has to generate
> >>>>> this flush.  This means adding the flush to existing coherence protocols,
> >>>>> or at the very least a new instruction to generate the flush explicitly.
> >>>>> The latter seems more likely than the former.
> >>>>>
> >>>>> This flush would need to ensure the data is forced out of the local WPQ
> >>>>> AND all WPQs south of the PCIE complex - because what you really want to
> >>>>> know is that the data has actually made it back to a place where remote
> >>>>> viewers are capable of percieving the change.
> >>>>>
> >>>>> So this means:
> >>>>> 1) Spec revision with puncturing flush
> >>>>> 2) Buy-in from CPU vendors to generate such a flush
> >>>>> 3) A new instruction added to the architecture.
> >>>>>
> >>>>> Call me in a decade or so.
> >>>>>
> >>>>>
> >>>>> But really, I think it likely we see hardware-coherence well before this.
> >>>>> For this reason, I have become skeptical of all but a few memory sharing
> >>>>> use cases that depend on software-controlled cache-coherency.      
> >>>>
> >>>> Hi Gregory,
> >>>>
> >>>> 	From my understanding, we actually has the same idea here. What I am 
> >>>> saying is that we need SPEC to consider this issue, meaning we need to 
> >>>> describe how the entire software-coherency mechanism operates, which 
> >>>> includes the necessary hardware support. Additionally, I agree that if 
> >>>> software-coherency also requires hardware support, it seems that 
> >>>> hardware-coherency is the better path.    
> >>>>>
> >>>>> There are some (FAMFS, for example). The coherence state of these
> >>>>> systems tend to be less volatile (e.g. mappings are read-only), or
> >>>>> they have inherent design limitations (cacheline-sized message passing
> >>>>> via write-ahead logging only).      
> >>>>
> >>>> Can you explain more about this? I understand that if the reader in the 
> >>>> writer-reader model is using a readonly mapping, the interaction will be 
> >>>> much simpler. However, after the writer writes data, if we don't have a 
> >>>> mechanism to flush and invalidate puncturing all caches, how can the 
> >>>> readonly reader access the new data?    
> >>>
> >>> There is a mechanism for doing coarse grained flushing that is known to
> >>> work on some architectures. Look at cpu_cache_invalidate_memregion().
> >>> On intel/x86 it's wbinvd_on_all_cpu_cpus()    
> >>
> >> There is no guarantee on x86 that after cpu_cache_invalidate_memregion()
> >> that a remote shared memory consumer can be assured to see the writes
> >> from that event.  
> > 
> > I was wondering about that after I wrote this...  I guess it guarantees
> > we won't get a late landing write or is that not even true?
> > 
> > So if we remove memory, then added fresh memory again quickly enough
> > can we get a left over write showing up?  I guess that doesn't matter as
> > the kernel will chase it with a memset(0) anyway and that will be ordered
> > as to the same address.
> > 
> > However we won't be able to elide that zeroing even if we know the device
> > did it which is makes some operations the device might support rather
> > pointless :(  
> 
> >>> on arm64 it's a PSCI firmware call CLEAN_INV_MEMREGION (there is a
> >>> public alpha specification for PSCI 1.3 with that defined but we
> >>> don't yet have kernel code.)    
> 
> I have an RFC for that - but I haven't had time to update and re-test it.

If it's useful, I might either be able to find time to take that forwards
(or get someone else to do it).

Let me know if that would be helpful; I'd love to add this to the list
of things I can forget about because it just works for kernel
(and hence is a problem for the firmware and uarch folk).

> 
> If you need this, and have a platform where it can be implemented, please get in touch
> with the people that look after the specs to move it along from alpha.
> 
> 
> >> That punches visibility through CXL shared memory devices?  
> 
> > It's a draft spec and Mark + James in +CC can hopefully confirm.
> > It does say
> > "Cleans and invalidates all caches, including system caches".
> > which I'd read as meaning it should but good to confirm.  
> 
> It's intended to remove any cached entries - including lines in what the arm-arm calls
> "invisible" system caches, which typically only platform firmware can touch. The next
> access should have to go all the way to the media. (I don't know enough about CXL to say
> what a remote shared memory consumer observes)

If it's out of the host bridge buffers (and known to have succeeded in write back) which I
think the host should know, I believe what happens next is a device implementer problem.
Hopefully anyone designing a device that does memory sharing has built that part right.

> 
> Without it, all we have are the by-VA operations which are painfully slow for large
> regions, and insufficient for system caches.
> 
> As with all those firmware interfaces - its for the platform implementer to wire up
> whatever is necessary to remove cached content for the specified range. Just because there
> is an (alpha!) spec doesn't mean it can be supported efficiently by a particular platform.
> 
> 
> >>> These are very big hammers and so unsuited for anything fine grained.  
> 
> You forgot really ugly too!

I was being polite :)

> 
> 
> >>> In the extreme end of possible implementations they briefly stop all
> >>> CPUs and clean and invalidate all caches of all types. So not suited
> >>> to anything fine grained, but may be acceptable for a rare setup event,
> >>> particularly if the main job of the writing host is to fill that memory
> >>> for lots of other hosts to use.
> >>>
> >>> At least the ARM one takes a range so allows for a less painful
> >>> implementation.   
> 
> That is to allow some ranges to fail. (e.g. you can do this to the CXL windows, but not
> the regular DRAM).
> 
> On the less painful implementation, arm's interconnect has a gadget that does "Address
> based flush" which could be used here. I'd hope platforms with that don't need to
> interrupt all CPUs - but it depends on what else needs to be done.
> 
> 
> >>> I'm assuming we'll see new architecture over time
> >>> but this is a different (and potentially easier) problem space
> >>> to what you need.    
> >>
> >> cpu_cache_invalidate_memregion() is only about making sure local CPU
> >> sees new contents after an DPA:HPA remap event. I hope CPUs are able to
> >> get away from that responsibility long term when / if future memory
> >> expanders just issue back-invalidate automatically when the HDM decoder
> >> configuration changes.  
> > 
> > I would love that to be the way things go, but I fear the overheads of
> > doing that on the protocol means people will want the option of the painful
> > approach.  
> 
> 
> 
> Thanks,
> 
> James

Thanks for the info,

Jonathan

> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2024-06-04 14:26 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-22  7:15 [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dongsheng Yang
2024-04-22  7:16 ` [PATCH 1/7] block: Init for CBD(CXL " Dongsheng Yang
2024-04-22 18:39   ` Randy Dunlap
2024-04-22 22:41     ` Dongsheng Yang
2024-04-24  3:58   ` Chaitanya Kulkarni
2024-04-24  8:36     ` Dongsheng Yang
2024-04-22  7:16 ` [PATCH 2/7] cbd: introduce cbd_transport Dongsheng Yang
2024-04-24  4:08   ` Chaitanya Kulkarni
2024-04-24  8:43     ` Dongsheng Yang
2024-04-22  7:16 ` [PATCH 3/7] cbd: introduce cbd_channel Dongsheng Yang
2024-04-22  7:16 ` [PATCH 4/7] cbd: introduce cbd_host Dongsheng Yang
2024-04-25  5:51   ` [EXTERNAL] " Bharat Bhushan
2024-04-22  7:16 ` [PATCH 5/7] cbd: introuce cbd_backend Dongsheng Yang
2024-04-24  5:03   ` Chaitanya Kulkarni
2024-04-24  8:36     ` Dongsheng Yang
2024-04-25  5:46   ` [EXTERNAL] " Bharat Bhushan
2024-04-22  7:16 ` [PATCH 7/7] cbd: add related sysfs files in transport register Dongsheng Yang
2024-04-25  5:24   ` [EXTERNAL] " Bharat Bhushan
2024-04-22 22:42 ` [PATCH 6/7] cbd: introduce cbd_blkdev Dongsheng Yang
2024-04-23  7:27   ` Dongsheng Yang
2024-04-24  4:29 ` [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Dan Williams
2024-04-24  6:33   ` Dongsheng Yang
2024-04-24 15:14     ` Gregory Price
2024-04-26  1:25       ` Dongsheng Yang
2024-04-26 13:48         ` Gregory Price
2024-04-26 14:53           ` Dongsheng Yang
2024-04-26 16:14             ` Gregory Price
2024-04-28  5:47               ` Dongsheng Yang
2024-04-28 16:44                 ` Gregory Price
2024-04-28 16:55                 ` John Groves
2024-05-03  9:52                   ` Jonathan Cameron
2024-05-08 11:39                     ` Dongsheng Yang
2024-05-08 12:11                       ` Jonathan Cameron
2024-05-08 13:03                         ` Dongsheng Yang
2024-05-08 15:44                           ` Jonathan Cameron
2024-05-09 11:24                             ` Dongsheng Yang
2024-05-09 12:21                               ` Jonathan Cameron
2024-05-09 13:03                                 ` Dongsheng Yang
2024-05-21 18:41                                   ` Dan Williams
2024-05-22  6:17                                     ` Dongsheng Yang
2024-05-29 15:25                                       ` Gregory Price
2024-05-30  6:59                                         ` Dongsheng Yang
2024-05-30 13:38                                           ` Jonathan Cameron
2024-06-01  3:22                                             ` Dan Williams
2024-06-03 12:48                                               ` Jonathan Cameron
2024-06-03 17:28                                                 ` James Morse
2024-06-04 14:26                                                   ` Jonathan Cameron
2024-05-31 14:23                                           ` Gregory Price
2024-06-03  1:33                                             ` Dongsheng Yang
2024-04-30  0:34                 ` Dan Williams
2024-04-24 18:08     ` Dan Williams
     [not found]       ` <539c1323-68f9-d753-a102-692b69049c20@easystack.cn>
2024-04-30  0:10         ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).