linux-bcache.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/14] bcache patches for Linux v5.14
@ 2021-06-15  5:49 Coly Li
  2021-06-15  5:49 ` [PATCH 01/14] bcache: fix error info in register_bcache() Coly Li
                   ` (14 more replies)
  0 siblings, 15 replies; 60+ messages in thread
From: Coly Li @ 2021-06-15  5:49 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

Hi Jens,

Here are the bcache patches for Linux v5.14.

The patches from Chao Yu and Ding Senjie are useful code cleanup. The
rested patches for the NVDIMM support to bcache journaling.

For the series to support NVDIMM to bache journaling, all reported
issue since last merge window are all fixed. And no more issue detected
during our testing or by the kernel test robot. If there is any issue
reported during they stay in linux-next, I, Jianpang and Qiaowei will
response and fix immediately.

Please take them for Linux v5.14.

Thank you in advance.

Coly Li
---

Chao Yu (1):
  bcache: fix error info in register_bcache()

Coly Li (7):
  bcache: add initial data structures for nvm pages
  bcache: use bucket index to set GC_MARK_METADATA for journal buckets
    in bch_btree_gc_finish()
  bcache: add BCH_FEATURE_INCOMPAT_NVDIMM_META into incompat feature set
  bcache: initialize bcache journal for NVDIMM meta device
  bcache: support storing bcache journal into NVDIMM meta device
  bcache: read jset from NVDIMM pages for journal replay
  bcache: add sysfs interface register_nvdimm_meta to register NVDIMM
    meta device

Ding Senjie (1):
  md: bcache: Fix spelling of 'acquire'

Jianpeng Ma (5):
  bcache: initialize the nvm pages allocator
  bcache: initialization of the buddy
  bcache: bch_nvm_alloc_pages() of the buddy
  bcache: bch_nvm_free_pages() of the buddy
  bcache: get allocated pages from specific owner

 drivers/md/bcache/Kconfig       |  10 +
 drivers/md/bcache/Makefile      |   1 +
 drivers/md/bcache/btree.c       |   6 +-
 drivers/md/bcache/features.h    |   9 +
 drivers/md/bcache/journal.c     | 317 ++++++++++---
 drivers/md/bcache/journal.h     |   2 +-
 drivers/md/bcache/nvm-pages.c   | 773 ++++++++++++++++++++++++++++++++
 drivers/md/bcache/nvm-pages.h   |  93 ++++
 drivers/md/bcache/super.c       |  91 +++-
 include/uapi/linux/bcache-nvm.h | 206 +++++++++
 10 files changed, 1432 insertions(+), 76 deletions(-)
 create mode 100644 drivers/md/bcache/nvm-pages.c
 create mode 100644 drivers/md/bcache/nvm-pages.h
 create mode 100644 include/uapi/linux/bcache-nvm.h

-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 01/14] bcache: fix error info in register_bcache()
  2021-06-15  5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
@ 2021-06-15  5:49 ` Coly Li
  2021-06-22  9:47   ` Hannes Reinecke
  2021-06-15  5:49 ` [PATCH 02/14] md: bcache: Fix spelling of 'acquire' Coly Li
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-15  5:49 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Chao Yu, Coly Li

From: Chao Yu <yuchao0@huawei.com>

In register_bcache(), there are several cases we didn't set
correct error info (return value and/or error message):
- if kzalloc() fails, it needs to return ENOMEM and print
"cannot allocate memory";
- if register_cache() fails, it's better to propagate its
return value rather than using default EINVAL.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index bea8c4429ae8..0a20ccf5a1db 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2620,8 +2620,11 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	if (SB_IS_BDEV(sb)) {
 		struct cached_dev *dc = kzalloc(sizeof(*dc), GFP_KERNEL);
 
-		if (!dc)
+		if (!dc) {
+			ret = -ENOMEM;
+			err = "cannot allocate memory";
 			goto out_put_sb_page;
+		}
 
 		mutex_lock(&bch_register_lock);
 		ret = register_bdev(sb, sb_disk, bdev, dc);
@@ -2632,11 +2635,15 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	} else {
 		struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
 
-		if (!ca)
+		if (!ca) {
+			ret = -ENOMEM;
+			err = "cannot allocate memory";
 			goto out_put_sb_page;
+		}
 
 		/* blkdev_put() will be called in bch_cache_release() */
-		if (register_cache(sb, sb_disk, bdev, ca) != 0)
+		ret = register_cache(sb, sb_disk, bdev, ca);
+		if (ret)
 			goto out_free_sb;
 	}
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 02/14] md: bcache: Fix spelling of 'acquire'
  2021-06-15  5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
  2021-06-15  5:49 ` [PATCH 01/14] bcache: fix error info in register_bcache() Coly Li
@ 2021-06-15  5:49 ` Coly Li
  2021-06-22 10:03   ` Hannes Reinecke
  2021-06-15  5:49 ` [PATCH 03/14] bcache: add initial data structures for nvm pages Coly Li
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-15  5:49 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Ding Senjie, Coly Li

From: Ding Senjie <dingsenjie@yulong.com>

acqurie -> acquire

Signed-off-by: Ding Senjie <dingsenjie@yulong.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 0a20ccf5a1db..2f1ee4fbf4d5 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2760,7 +2760,7 @@ static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x)
 		 * The reason bch_register_lock is not held to call
 		 * bch_cache_set_stop() and bcache_device_stop() is to
 		 * avoid potential deadlock during reboot, because cache
-		 * set or bcache device stopping process will acqurie
+		 * set or bcache device stopping process will acquire
 		 * bch_register_lock too.
 		 *
 		 * We are safe here because bcache_is_reboot sets to
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 03/14] bcache: add initial data structures for nvm pages
  2021-06-15  5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
  2021-06-15  5:49 ` [PATCH 01/14] bcache: fix error info in register_bcache() Coly Li
  2021-06-15  5:49 ` [PATCH 02/14] md: bcache: Fix spelling of 'acquire' Coly Li
@ 2021-06-15  5:49 ` Coly Li
  2021-06-21 16:17   ` Ask help for code review (was Re: [PATCH 03/14] bcache: add initial data structures for nvm pages) Coly Li
  2021-06-22 10:19   ` [PATCH 03/14] bcache: add initial data structures for nvm pages Hannes Reinecke
  2021-06-15  5:49 ` [PATCH 04/14] bcache: initialize the nvm pages allocator Coly Li
                   ` (11 subsequent siblings)
  14 siblings, 2 replies; 60+ messages in thread
From: Coly Li @ 2021-06-15  5:49 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Jianpeng Ma, Qiaowei Ren

This patch initializes the prototype data structures for nvm pages
allocator,

- struct bch_nvm_pages_sb
This is the super block allocated on each nvdimm namespace. A nvdimm
set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used
to mark which nvdimm set this name space belongs to. Normally we will
use the bcache's cache set UUID to initialize this uuid, to connect this
nvdimm set to a specified bcache cache set.

- struct bch_owner_list_head
This is a table for all heads of all owner lists. A owner list records
which page(s) allocated to which owner. After reboot from power failure,
the ownwer may find all its requested and allocated pages from the owner
list by a handler which is converted by a UUID.

- struct bch_nvm_pages_owner_head
This is a head of an owner list. Each owner only has one owner list,
and a nvm page only belongs to an specific owner. uuid[] will be set to
owner's uuid, for bcache it is the bcache's cache set uuid. label is not
mandatory, it is a human-readable string for debug purpose. The pointer
*recs references to separated nvm page which hold the table of struct
bch_nvm_pgalloc_rec.

- struct bch_nvm_pgalloc_recs
This struct occupies a whole page, owner_uuid should match the uuid
in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
allocated records.

- struct bch_nvm_pgalloc_rec
Each structure records a range of allocated nvm pages.
  - Bits  0 - 51: is pages offset of the allocated pages.
  - Bits 52 - 57: allocaed size in page_size * order-of-2
  - Bits 58 - 63: reserved.
Since each of the allocated nvm pages are power of 2, using 6 bits to
represent allocated size can have (1<<(1<<64) - 1) * PAGE_SIZE maximum
value. It can be a 76 bits width range size in byte for 4KB page size,
which is large enough currently.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Jianpeng Ma <jianpeng.ma@intel.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
---
 include/uapi/linux/bcache-nvm.h | 200 ++++++++++++++++++++++++++++++++
 1 file changed, 200 insertions(+)
 create mode 100644 include/uapi/linux/bcache-nvm.h

diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
new file mode 100644
index 000000000000..5094a6797679
--- /dev/null
+++ b/include/uapi/linux/bcache-nvm.h
@@ -0,0 +1,200 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+
+#ifndef _UAPI_BCACHE_NVM_H
+#define _UAPI_BCACHE_NVM_H
+
+#if (__BITS_PER_LONG == 64)
+/*
+ * Bcache on NVDIMM data structures
+ */
+
+/*
+ * - struct bch_nvm_pages_sb
+ *   This is the super block allocated on each nvdimm namespace. A nvdimm
+ * set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used to mark
+ * which nvdimm set this name space belongs to. Normally we will use the
+ * bcache's cache set UUID to initialize this uuid, to connect this nvdimm
+ * set to a specified bcache cache set.
+ *
+ * - struct bch_owner_list_head
+ *   This is a table for all heads of all owner lists. A owner list records
+ * which page(s) allocated to which owner. After reboot from power failure,
+ * the ownwer may find all its requested and allocated pages from the owner
+ * list by a handler which is converted by a UUID.
+ *
+ * - struct bch_nvm_pages_owner_head
+ *   This is a head of an owner list. Each owner only has one owner list,
+ * and a nvm page only belongs to an specific owner. uuid[] will be set to
+ * owner's uuid, for bcache it is the bcache's cache set uuid. label is not
+ * mandatory, it is a human-readable string for debug purpose. The pointer
+ * recs references to separated nvm page which hold the table of struct
+ * bch_pgalloc_rec.
+ *
+ *- struct bch_nvm_pgalloc_recs
+ *  This structure occupies a whole page, owner_uuid should match the uuid
+ * in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
+ * allocated records.
+ *
+ * - struct bch_pgalloc_rec
+ *   Each structure records a range of allocated nvm pages. pgoff is offset
+ * in unit of page size of this allocated nvm page range. The adjoint page
+ * ranges of same owner can be merged into a larger one, therefore pages_nr
+ * is NOT always power of 2.
+ *
+ *
+ * Memory layout on nvdimm namespace 0
+ *
+ *    0 +---------------------------------+
+ *      |                                 |
+ *  4KB +---------------------------------+
+ *      |         bch_nvm_pages_sb        |
+ *  8KB +---------------------------------+ <--- bch_nvm_pages_sb.bch_owner_list_head
+ *      |       bch_owner_list_head       |
+ *      |                                 |
+ * 16KB +---------------------------------+ <--- bch_owner_list_head.heads[0].recs[0]
+ *      |       bch_nvm_pgalloc_recs      |
+ *      |  (nvm pages internal usage)     |
+ * 24KB +---------------------------------+
+ *      |                                 |
+ *      |                                 |
+ * 16MB  +---------------------------------+
+ *      |      allocable nvm pages        |
+ *      |      for buddy allocator        |
+ * end  +---------------------------------+
+ *
+ *
+ *
+ * Memory layout on nvdimm namespace N
+ * (doesn't have owner list)
+ *
+ *    0 +---------------------------------+
+ *      |                                 |
+ *  4KB +---------------------------------+
+ *      |         bch_nvm_pages_sb        |
+ *  8KB +---------------------------------+
+ *      |                                 |
+ *      |                                 |
+ *      |                                 |
+ *      |                                 |
+ *      |                                 |
+ *      |                                 |
+ * 16MB  +---------------------------------+
+ *      |      allocable nvm pages        |
+ *      |      for buddy allocator        |
+ * end  +---------------------------------+
+ *
+ */
+
+#include <linux/types.h>
+
+/* In sectors */
+#define BCH_NVM_PAGES_SB_OFFSET			4096
+#define BCH_NVM_PAGES_OFFSET			(16 << 20)
+
+#define BCH_NVM_PAGES_LABEL_SIZE		32
+#define BCH_NVM_PAGES_NAMESPACES_MAX		8
+
+#define BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET	(8<<10)
+#define BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET	(16<<10)
+
+#define BCH_NVM_PAGES_SB_VERSION		0
+#define BCH_NVM_PAGES_SB_VERSION_MAX		0
+
+static const unsigned char bch_nvm_pages_magic[] = {
+	0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83,
+	0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 };
+static const unsigned char bch_nvm_pages_pgalloc_magic[] = {
+	0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9,
+	0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae };
+
+/* takes 64bit width */
+struct bch_pgalloc_rec {
+	__u64	pgoff:52;
+	__u64	order:6;
+	__u64	reserved:6;
+};
+
+struct bch_nvm_pgalloc_recs {
+union {
+	struct {
+		struct bch_nvm_pages_owner_head	*owner;
+		struct bch_nvm_pgalloc_recs	*next;
+		unsigned char			magic[16];
+		unsigned char			owner_uuid[16];
+		unsigned int			size;
+		unsigned int			used;
+		unsigned long			_pad[4];
+		struct bch_pgalloc_rec		recs[];
+	};
+	unsigned char				pad[8192];
+};
+};
+
+#define BCH_MAX_RECS					\
+	((sizeof(struct bch_nvm_pgalloc_recs) -		\
+	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /	\
+	 sizeof(struct bch_pgalloc_rec))
+
+struct bch_nvm_pages_owner_head {
+	unsigned char			uuid[16];
+	unsigned char			label[BCH_NVM_PAGES_LABEL_SIZE];
+	/* Per-namespace own lists */
+	struct bch_nvm_pgalloc_recs	*recs[BCH_NVM_PAGES_NAMESPACES_MAX];
+};
+
+/* heads[0] is always for nvm_pages internal usage */
+struct bch_owner_list_head {
+union {
+	struct {
+		unsigned int			size;
+		unsigned int			used;
+		unsigned long			_pad[4];
+		struct bch_nvm_pages_owner_head	heads[];
+	};
+	unsigned char				pad[8192];
+};
+};
+#define BCH_MAX_OWNER_LIST				\
+	((sizeof(struct bch_owner_list_head) -		\
+	 offsetof(struct bch_owner_list_head, heads)) /	\
+	 sizeof(struct bch_nvm_pages_owner_head))
+
+/* The on-media bit order is local CPU order */
+struct bch_nvm_pages_sb {
+	unsigned long				csum;
+	unsigned long				ns_start;
+	unsigned long				sb_offset;
+	unsigned long				version;
+	unsigned char				magic[16];
+	unsigned char				uuid[16];
+	unsigned int				page_size;
+	unsigned int				total_namespaces_nr;
+	unsigned int				this_namespace_nr;
+	union {
+		unsigned char			set_uuid[16];
+		unsigned long			set_magic;
+	};
+
+	unsigned long				flags;
+	unsigned long				seq;
+
+	unsigned long				feature_compat;
+	unsigned long				feature_incompat;
+	unsigned long				feature_ro_compat;
+
+	/* For allocable nvm pages from buddy systems */
+	unsigned long				pages_offset;
+	unsigned long				pages_total;
+
+	unsigned long				pad[8];
+
+	/* Only on the first name space */
+	struct bch_owner_list_head		*owner_list_head;
+
+	/* Just for csum_set() */
+	unsigned int				keys;
+	unsigned long				d[0];
+};
+#endif /* __BITS_PER_LONG == 64 */
+
+#endif /* _UAPI_BCACHE_NVM_H */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 04/14] bcache: initialize the nvm pages allocator
  2021-06-15  5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
                   ` (2 preceding siblings ...)
  2021-06-15  5:49 ` [PATCH 03/14] bcache: add initial data structures for nvm pages Coly Li
@ 2021-06-15  5:49 ` Coly Li
  2021-06-22 10:39   ` Hannes Reinecke
  2021-06-15  5:49 ` [PATCH 05/14] bcache: initialization of the buddy Coly Li
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-15  5:49 UTC (permalink / raw)
  To: axboe
  Cc: linux-bcache, linux-block, Jianpeng Ma, Randy Dunlap,
	Qiaowei Ren, Coly Li

From: Jianpeng Ma <jianpeng.ma@intel.com>

This patch define the prototype data structures in memory and
initializes the nvm pages allocator.

The nvm address space which is managed by this allocator can consist of
many nvm namespaces, and some namespaces can compose into one nvm set,
like cache set. For this initial implementation, only one set can be
supported.

The users of this nvm pages allocator need to call register_namespace()
to register the nvdimm device (like /dev/pmemX) into this allocator as
the instance of struct nvm_namespace.

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/Kconfig     |  10 ++
 drivers/md/bcache/Makefile    |   1 +
 drivers/md/bcache/nvm-pages.c | 295 ++++++++++++++++++++++++++++++++++
 drivers/md/bcache/nvm-pages.h |  74 +++++++++
 drivers/md/bcache/super.c     |   3 +
 5 files changed, 383 insertions(+)
 create mode 100644 drivers/md/bcache/nvm-pages.c
 create mode 100644 drivers/md/bcache/nvm-pages.h

diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
index d1ca4d059c20..a69f6c0e0507 100644
--- a/drivers/md/bcache/Kconfig
+++ b/drivers/md/bcache/Kconfig
@@ -35,3 +35,13 @@ config BCACHE_ASYNC_REGISTRATION
 	device path into this file will returns immediately and the real
 	registration work is handled in kernel work queue in asynchronous
 	way.
+
+config BCACHE_NVM_PAGES
+	bool "NVDIMM support for bcache (EXPERIMENTAL)"
+	depends on BCACHE
+	depends on 64BIT
+	depends on LIBNVDIMM
+	depends on DAX
+	help
+	  Allocate/release NV-memory pages for bcache and provide allocated pages
+	  for each requestor after system reboot.
diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
index 5b87e59676b8..2397bb7c7ffd 100644
--- a/drivers/md/bcache/Makefile
+++ b/drivers/md/bcache/Makefile
@@ -5,3 +5,4 @@ obj-$(CONFIG_BCACHE)	+= bcache.o
 bcache-y		:= alloc.o bset.o btree.o closure.o debug.o extents.o\
 	io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
 	util.o writeback.o features.o
+bcache-$(CONFIG_BCACHE_NVM_PAGES) += nvm-pages.o
diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
new file mode 100644
index 000000000000..18fdadbc502f
--- /dev/null
+++ b/drivers/md/bcache/nvm-pages.c
@@ -0,0 +1,295 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Nvdimm page-buddy allocator
+ *
+ * Copyright (c) 2021, Intel Corporation.
+ * Copyright (c) 2021, Qiaowei Ren <qiaowei.ren@intel.com>.
+ * Copyright (c) 2021, Jianpeng Ma <jianpeng.ma@intel.com>.
+ */
+
+#if defined(CONFIG_BCACHE_NVM_PAGES)
+
+#include "bcache.h"
+#include "nvm-pages.h"
+
+#include <linux/slab.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/dax.h>
+#include <linux/pfn_t.h>
+#include <linux/libnvdimm.h>
+#include <linux/mm_types.h>
+#include <linux/err.h>
+#include <linux/pagemap.h>
+#include <linux/bitmap.h>
+#include <linux/blkdev.h>
+
+struct bch_nvm_set *only_set;
+
+static void release_nvm_namespaces(struct bch_nvm_set *nvm_set)
+{
+	int i;
+	struct bch_nvm_namespace *ns;
+
+	for (i = 0; i < nvm_set->total_namespaces_nr; i++) {
+		ns = nvm_set->nss[i];
+		if (ns) {
+			blkdev_put(ns->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC);
+			kfree(ns);
+		}
+	}
+
+	kfree(nvm_set->nss);
+}
+
+static void release_nvm_set(struct bch_nvm_set *nvm_set)
+{
+	release_nvm_namespaces(nvm_set);
+	kfree(nvm_set);
+}
+
+static int init_owner_info(struct bch_nvm_namespace *ns)
+{
+	struct bch_owner_list_head *owner_list_head = ns->sb->owner_list_head;
+
+	mutex_lock(&only_set->lock);
+	only_set->owner_list_head = owner_list_head;
+	only_set->owner_list_size = owner_list_head->size;
+	only_set->owner_list_used = owner_list_head->used;
+	mutex_unlock(&only_set->lock);
+
+	return 0;
+}
+
+static int attach_nvm_set(struct bch_nvm_namespace *ns)
+{
+	int rc = 0;
+
+	mutex_lock(&only_set->lock);
+	if (only_set->nss) {
+		if (memcmp(ns->sb->set_uuid, only_set->set_uuid, 16)) {
+			pr_info("namespace id doesn't match nvm set\n");
+			rc = -EINVAL;
+			goto unlock;
+		}
+
+		if (only_set->nss[ns->sb->this_namespace_nr]) {
+			pr_info("already has the same position(%d) nvm\n",
+					ns->sb->this_namespace_nr);
+			rc = -EEXIST;
+			goto unlock;
+		}
+	} else {
+		memcpy(only_set->set_uuid, ns->sb->set_uuid, 16);
+		only_set->total_namespaces_nr = ns->sb->total_namespaces_nr;
+		only_set->nss = kcalloc(only_set->total_namespaces_nr,
+				sizeof(struct bch_nvm_namespace *), GFP_KERNEL);
+		if (!only_set->nss) {
+			rc = -ENOMEM;
+			goto unlock;
+		}
+	}
+
+	only_set->nss[ns->sb->this_namespace_nr] = ns;
+
+	/* Firstly attach */
+	if ((unsigned long)ns->sb->owner_list_head == BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET) {
+		struct bch_nvm_pages_owner_head *sys_owner_head;
+		struct bch_nvm_pgalloc_recs *sys_pgalloc_recs;
+
+		ns->sb->owner_list_head = ns->kaddr + BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET;
+		sys_pgalloc_recs = ns->kaddr + BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET;
+
+		sys_owner_head = &(ns->sb->owner_list_head->heads[0]);
+		sys_owner_head->recs[0] = sys_pgalloc_recs;
+		ns->sb->csum = csum_set(ns->sb);
+
+		sys_pgalloc_recs->owner = sys_owner_head;
+	} else
+		BUG_ON(ns->sb->owner_list_head !=
+			(ns->kaddr + BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET));
+
+unlock:
+	mutex_unlock(&only_set->lock);
+	return rc;
+}
+
+static int read_nvdimm_meta_super(struct block_device *bdev,
+			      struct bch_nvm_namespace *ns)
+{
+	struct page *page;
+	struct bch_nvm_pages_sb *sb;
+	int r = 0;
+	uint64_t expected_csum = 0;
+
+	page = read_cache_page_gfp(bdev->bd_inode->i_mapping,
+			BCH_NVM_PAGES_SB_OFFSET >> PAGE_SHIFT, GFP_KERNEL);
+
+	if (IS_ERR(page))
+		return -EIO;
+
+	sb = (struct bch_nvm_pages_sb *)(page_address(page) +
+					offset_in_page(BCH_NVM_PAGES_SB_OFFSET));
+	r = -EINVAL;
+	expected_csum = csum_set(sb);
+	if (expected_csum != sb->csum) {
+		pr_info("csum is not match with expected one\n");
+		goto put_page;
+	}
+
+	if (memcmp(sb->magic, bch_nvm_pages_magic, 16)) {
+		pr_info("invalid bch_nvm_pages_magic\n");
+		goto put_page;
+	}
+
+	if (sb->total_namespaces_nr != 1) {
+		pr_info("currently only support one nvm device\n");
+		goto put_page;
+	}
+
+	if (sb->sb_offset != BCH_NVM_PAGES_SB_OFFSET) {
+		pr_info("invalid superblock offset\n");
+		goto put_page;
+	}
+
+	r = 0;
+	/* temporary use for DAX API */
+	ns->page_size = sb->page_size;
+	ns->pages_total = sb->pages_total;
+
+put_page:
+	put_page(page);
+	return r;
+}
+
+struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
+{
+	struct bch_nvm_namespace *ns;
+	int err;
+	pgoff_t pgoff;
+	char buf[BDEVNAME_SIZE];
+	struct block_device *bdev;
+	int id;
+	char *path = NULL;
+
+	path = kstrndup(dev_path, 512, GFP_KERNEL);
+	if (!path) {
+		pr_err("kstrndup failed\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	bdev = blkdev_get_by_path(strim(path),
+				  FMODE_READ|FMODE_WRITE|FMODE_EXEC,
+				  only_set);
+	if (IS_ERR(bdev)) {
+		pr_info("get %s error: %ld\n", dev_path, PTR_ERR(bdev));
+		kfree(path);
+		return ERR_PTR(PTR_ERR(bdev));
+	}
+
+	err = -ENOMEM;
+	ns = kzalloc(sizeof(struct bch_nvm_namespace), GFP_KERNEL);
+	if (!ns)
+		goto bdput;
+
+	err = -EIO;
+	if (read_nvdimm_meta_super(bdev, ns)) {
+		pr_info("%s read nvdimm meta super block failed.\n",
+			bdevname(bdev, buf));
+		goto free_ns;
+	}
+
+	err = -EOPNOTSUPP;
+	if (!bdev_dax_supported(bdev, ns->page_size)) {
+		pr_info("%s don't support DAX\n", bdevname(bdev, buf));
+		goto free_ns;
+	}
+
+	err = -EINVAL;
+	if (bdev_dax_pgoff(bdev, 0, ns->page_size, &pgoff)) {
+		pr_info("invalid offset of %s\n", bdevname(bdev, buf));
+		goto free_ns;
+	}
+
+	err = -ENOMEM;
+	ns->dax_dev = fs_dax_get_by_bdev(bdev);
+	if (!ns->dax_dev) {
+		pr_info("can't by dax device by %s\n", bdevname(bdev, buf));
+		goto free_ns;
+	}
+
+	err = -EINVAL;
+	id = dax_read_lock();
+	if (dax_direct_access(ns->dax_dev, pgoff, ns->pages_total,
+			      &ns->kaddr, &ns->start_pfn) <= 0) {
+		pr_info("dax_direct_access error\n");
+		dax_read_unlock(id);
+		goto free_ns;
+	}
+	dax_read_unlock(id);
+
+	ns->sb = ns->kaddr + BCH_NVM_PAGES_SB_OFFSET;
+
+	err = -EINVAL;
+	/* Check magic again to make sure DAX mapping is correct */
+	if (memcmp(ns->sb->magic, bch_nvm_pages_magic, 16)) {
+		pr_info("invalid bch_nvm_pages_magic after DAX mapping\n");
+		goto free_ns;
+	}
+
+	err = attach_nvm_set(ns);
+	if (err < 0)
+		goto free_ns;
+
+	ns->page_size = ns->sb->page_size;
+	ns->pages_offset = ns->sb->pages_offset;
+	ns->pages_total = ns->sb->pages_total;
+	ns->free = 0;
+	ns->bdev = bdev;
+	ns->nvm_set = only_set;
+	mutex_init(&ns->lock);
+
+	if (ns->sb->this_namespace_nr == 0) {
+		pr_info("only first namespace contain owner info\n");
+		err = init_owner_info(ns);
+		if (err < 0) {
+			pr_info("init_owner_info met error %d\n", err);
+			only_set->nss[ns->sb->this_namespace_nr] = NULL;
+			goto free_ns;
+		}
+	}
+
+	kfree(path);
+	return ns;
+free_ns:
+	kfree(ns);
+bdput:
+	blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC);
+	kfree(path);
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL_GPL(bch_register_namespace);
+
+int __init bch_nvm_init(void)
+{
+	only_set = kzalloc(sizeof(*only_set), GFP_KERNEL);
+	if (!only_set)
+		return -ENOMEM;
+
+	only_set->total_namespaces_nr = 0;
+	only_set->owner_list_head = NULL;
+	only_set->nss = NULL;
+
+	mutex_init(&only_set->lock);
+
+	pr_info("bcache nvm init\n");
+	return 0;
+}
+
+void bch_nvm_exit(void)
+{
+	release_nvm_set(only_set);
+	pr_info("bcache nvm exit\n");
+}
+
+#endif /* CONFIG_BCACHE_NVM_PAGES */
diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
new file mode 100644
index 000000000000..3e24c4dee7fd
--- /dev/null
+++ b/drivers/md/bcache/nvm-pages.h
@@ -0,0 +1,74 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _BCACHE_NVM_PAGES_H
+#define _BCACHE_NVM_PAGES_H
+
+#if defined(CONFIG_BCACHE_NVM_PAGES)
+#include <linux/bcache-nvm.h>
+#endif /* CONFIG_BCACHE_NVM_PAGES */
+
+/*
+ * Bcache NVDIMM in memory data structures
+ */
+
+/*
+ * The following three structures in memory records which page(s) allocated
+ * to which owner. After reboot from power failure, they will be initialized
+ * based on nvm pages superblock in NVDIMM device.
+ */
+struct bch_nvm_namespace {
+	struct bch_nvm_pages_sb *sb;
+	void *kaddr;
+
+	u8 uuid[16];
+	u64 free;
+	u32 page_size;
+	u64 pages_offset;
+	u64 pages_total;
+	pfn_t start_pfn;
+
+	struct dax_device *dax_dev;
+	struct block_device *bdev;
+	struct bch_nvm_set *nvm_set;
+
+	struct mutex lock;
+};
+
+/*
+ * A set of namespaces. Currently only one set can be supported.
+ */
+struct bch_nvm_set {
+	u8 set_uuid[16];
+	u32 total_namespaces_nr;
+
+	u32 owner_list_size;
+	u32 owner_list_used;
+	struct bch_owner_list_head *owner_list_head;
+
+	struct bch_nvm_namespace **nss;
+
+	struct mutex lock;
+};
+extern struct bch_nvm_set *only_set;
+
+#if defined(CONFIG_BCACHE_NVM_PAGES)
+
+struct bch_nvm_namespace *bch_register_namespace(const char *dev_path);
+int bch_nvm_init(void);
+void bch_nvm_exit(void);
+
+#else
+
+static inline struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
+{
+	return NULL;
+}
+static inline int bch_nvm_init(void)
+{
+	return 0;
+}
+static inline void bch_nvm_exit(void) { }
+
+#endif /* CONFIG_BCACHE_NVM_PAGES */
+
+#endif /* _BCACHE_NVM_PAGES_H */
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 2f1ee4fbf4d5..ce22aefb1352 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -14,6 +14,7 @@
 #include "request.h"
 #include "writeback.h"
 #include "features.h"
+#include "nvm-pages.h"
 
 #include <linux/blkdev.h>
 #include <linux/pagemap.h>
@@ -2823,6 +2824,7 @@ static void bcache_exit(void)
 {
 	bch_debug_exit();
 	bch_request_exit();
+	bch_nvm_exit();
 	if (bcache_kobj)
 		kobject_put(bcache_kobj);
 	if (bcache_wq)
@@ -2921,6 +2923,7 @@ static int __init bcache_init(void)
 
 	bch_debug_init();
 	closure_debug_init();
+	bch_nvm_init();
 
 	bcache_is_reboot = false;
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 05/14] bcache: initialization of the buddy
  2021-06-15  5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
                   ` (3 preceding siblings ...)
  2021-06-15  5:49 ` [PATCH 04/14] bcache: initialize the nvm pages allocator Coly Li
@ 2021-06-15  5:49 ` Coly Li
  2021-06-22 10:45   ` Hannes Reinecke
  2021-06-15  5:49 ` [PATCH 06/14] bcache: bch_nvm_alloc_pages() " Coly Li
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-15  5:49 UTC (permalink / raw)
  To: axboe
  Cc: linux-bcache, linux-block, Jianpeng Ma, kernel test robot,
	Dan Carpenter, Qiaowei Ren, Coly Li

From: Jianpeng Ma <jianpeng.ma@intel.com>

This nvm pages allocator will implement the simple buddy to manage the
nvm address space. This patch initializes this buddy for new namespace.

the unit of alloc/free of the buddy is page. DAX device has their
struct page(in dram or PMEM).

        struct {        /* ZONE_DEVICE pages */
                /** @pgmap: Points to the hosting device page map. */
                struct dev_pagemap *pgmap;
                void *zone_device_data;
                /*
                 * ZONE_DEVICE private pages are counted as being
                 * mapped so the next 3 words hold the mapping, index,
                 * and private fields from the source anonymous or
                 * page cache page while the page is migrated to device
                 * private memory.
                 * ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also
                 * use the mapping, index, and private fields when
                 * pmem backed DAX files are mapped.
                 */
        };

ZONE_DEVICE pages only use pgmap. Other 4 words[16/32 bytes] don't use.
So the second/third word will be used as 'struct list_head ' which list
in buddy. The fourth word(that is normal struct page::index) store pgoff
which the page-offset in the dax device. And the fifth word (that is
normal struct page::private) store order of buddy. page_type will be used
to store buddy flags.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/nvm-pages.c   | 156 +++++++++++++++++++++++++++++++-
 drivers/md/bcache/nvm-pages.h   |   6 ++
 include/uapi/linux/bcache-nvm.h |  10 +-
 3 files changed, 165 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
index 18fdadbc502f..804ee66e97be 100644
--- a/drivers/md/bcache/nvm-pages.c
+++ b/drivers/md/bcache/nvm-pages.c
@@ -34,6 +34,10 @@ static void release_nvm_namespaces(struct bch_nvm_set *nvm_set)
 	for (i = 0; i < nvm_set->total_namespaces_nr; i++) {
 		ns = nvm_set->nss[i];
 		if (ns) {
+			kvfree(ns->pages_bitmap);
+			if (ns->pgalloc_recs_bitmap)
+				bitmap_free(ns->pgalloc_recs_bitmap);
+
 			blkdev_put(ns->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC);
 			kfree(ns);
 		}
@@ -48,17 +52,130 @@ static void release_nvm_set(struct bch_nvm_set *nvm_set)
 	kfree(nvm_set);
 }
 
+static struct page *nvm_vaddr_to_page(struct bch_nvm_namespace *ns, void *addr)
+{
+	return virt_to_page(addr);
+}
+
+static void *nvm_pgoff_to_vaddr(struct bch_nvm_namespace *ns, pgoff_t pgoff)
+{
+	return ns->kaddr + (pgoff << PAGE_SHIFT);
+}
+
+static inline void remove_owner_space(struct bch_nvm_namespace *ns,
+					pgoff_t pgoff, u64 nr)
+{
+	while (nr > 0) {
+		unsigned int num = nr > UINT_MAX ? UINT_MAX : nr;
+
+		bitmap_set(ns->pages_bitmap, pgoff, num);
+		nr -= num;
+		pgoff += num;
+	}
+}
+
+#define BCH_PGOFF_TO_KVADDR(pgoff) ((void *)((unsigned long)pgoff << PAGE_SHIFT))
+
 static int init_owner_info(struct bch_nvm_namespace *ns)
 {
 	struct bch_owner_list_head *owner_list_head = ns->sb->owner_list_head;
+	struct bch_nvm_pgalloc_recs *sys_recs;
+	int i, j, k, rc = 0;
 
 	mutex_lock(&only_set->lock);
 	only_set->owner_list_head = owner_list_head;
 	only_set->owner_list_size = owner_list_head->size;
 	only_set->owner_list_used = owner_list_head->used;
+
+	/* remove used space */
+	remove_owner_space(ns, 0, div_u64(ns->pages_offset, ns->page_size));
+
+	sys_recs = ns->kaddr + BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET;
+	/* suppose no hole in array */
+	for (i = 0; i < owner_list_head->used; i++) {
+		struct bch_nvm_pages_owner_head *head = &owner_list_head->heads[i];
+
+		for (j = 0; j < BCH_NVM_PAGES_NAMESPACES_MAX; j++) {
+			struct bch_nvm_pgalloc_recs *pgalloc_recs = head->recs[j];
+			unsigned long offset = (unsigned long)ns->kaddr >> PAGE_SHIFT;
+			struct page *page;
+
+			while (pgalloc_recs) {
+				u32 pgalloc_recs_pos = (unsigned int)(pgalloc_recs - sys_recs);
+
+				if (memcmp(pgalloc_recs->magic, bch_nvm_pages_pgalloc_magic, 16)) {
+					pr_info("invalid bch_nvm_pages_pgalloc_magic\n");
+					rc = -EINVAL;
+					goto unlock;
+				}
+				if (memcmp(pgalloc_recs->owner_uuid, head->uuid, 16)) {
+					pr_info("invalid owner_uuid in bch_nvm_pgalloc_recs\n");
+					rc = -EINVAL;
+					goto unlock;
+				}
+				if (pgalloc_recs->owner != head) {
+					pr_info("invalid owner in bch_nvm_pgalloc_recs\n");
+					rc = -EINVAL;
+					goto unlock;
+				}
+
+				/* recs array can has hole */
+				for (k = 0; k < pgalloc_recs->size; k++) {
+					struct bch_pgalloc_rec *rec = &pgalloc_recs->recs[k];
+
+					if (rec->pgoff) {
+						BUG_ON(rec->pgoff <= offset);
+
+						/* init struct page: index/private */
+						page = nvm_vaddr_to_page(ns,
+							BCH_PGOFF_TO_KVADDR(rec->pgoff));
+
+						set_page_private(page, rec->order);
+						page->index = rec->pgoff - offset;
+
+						remove_owner_space(ns,
+							rec->pgoff - offset,
+							1L << rec->order);
+					}
+				}
+				bitmap_set(ns->pgalloc_recs_bitmap, pgalloc_recs_pos, 1);
+				pgalloc_recs = pgalloc_recs->next;
+			}
+		}
+	}
+unlock:
 	mutex_unlock(&only_set->lock);
 
-	return 0;
+	return rc;
+}
+
+static void init_nvm_free_space(struct bch_nvm_namespace *ns)
+{
+	unsigned int start, end, pages;
+	int i;
+	struct page *page;
+	pgoff_t pgoff_start;
+
+	bitmap_for_each_clear_region(ns->pages_bitmap, start, end, 0, ns->pages_total) {
+		pgoff_start = start;
+		pages = end - start;
+
+		while (pages) {
+			for (i = BCH_MAX_ORDER - 1; i >= 0 ; i--) {
+				if ((pgoff_start % (1L << i) == 0) && (pages >= (1L << i)))
+					break;
+			}
+
+			page = nvm_vaddr_to_page(ns, nvm_pgoff_to_vaddr(ns, pgoff_start));
+			page->index = pgoff_start;
+			set_page_private(page, i);
+			__SetPageBuddy(page);
+			list_add((struct list_head *)&page->zone_device_data, &ns->free_area[i]);
+
+			pgoff_start += 1L << i;
+			pages -= 1L << i;
+		}
+	}
 }
 
 static int attach_nvm_set(struct bch_nvm_namespace *ns)
@@ -165,7 +282,7 @@ static int read_nvdimm_meta_super(struct block_device *bdev,
 struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
 {
 	struct bch_nvm_namespace *ns;
-	int err;
+	int i, err;
 	pgoff_t pgoff;
 	char buf[BDEVNAME_SIZE];
 	struct block_device *bdev;
@@ -249,18 +366,49 @@ struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
 	ns->nvm_set = only_set;
 	mutex_init(&ns->lock);
 
+	/*
+	 * parameters of bitmap_set/clear are unsigned int.
+	 * Given currently size of nvm is far from exceeding this limit,
+	 * so only add a WARN_ON message.
+	 */
+	WARN_ON(BITS_TO_LONGS(ns->pages_total) > UINT_MAX);
+	ns->pages_bitmap = kvcalloc(BITS_TO_LONGS(ns->pages_total),
+					sizeof(unsigned long), GFP_KERNEL);
+	if (!ns->pages_bitmap) {
+		err = -ENOMEM;
+		goto clear_ns_nr;
+	}
+
+	if (ns->sb->this_namespace_nr == 0) {
+		ns->pgalloc_recs_bitmap = bitmap_zalloc(BCH_MAX_PGALLOC_RECS, GFP_KERNEL);
+		if (ns->pgalloc_recs_bitmap == NULL) {
+			err = -ENOMEM;
+			goto free_pages_bitmap;
+		}
+	}
+
+	for (i = 0; i < BCH_MAX_ORDER; i++)
+		INIT_LIST_HEAD(&ns->free_area[i]);
+
 	if (ns->sb->this_namespace_nr == 0) {
 		pr_info("only first namespace contain owner info\n");
 		err = init_owner_info(ns);
 		if (err < 0) {
 			pr_info("init_owner_info met error %d\n", err);
-			only_set->nss[ns->sb->this_namespace_nr] = NULL;
-			goto free_ns;
+			goto free_recs_bitmap;
 		}
+		/* init buddy allocator */
+		init_nvm_free_space(ns);
 	}
 
 	kfree(path);
 	return ns;
+free_recs_bitmap:
+	bitmap_free(ns->pgalloc_recs_bitmap);
+free_pages_bitmap:
+	kvfree(ns->pages_bitmap);
+clear_ns_nr:
+	only_set->nss[ns->sb->this_namespace_nr] = NULL;
 free_ns:
 	kfree(ns);
 bdput:
diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
index 3e24c4dee7fd..71beb244b9be 100644
--- a/drivers/md/bcache/nvm-pages.h
+++ b/drivers/md/bcache/nvm-pages.h
@@ -16,6 +16,7 @@
  * to which owner. After reboot from power failure, they will be initialized
  * based on nvm pages superblock in NVDIMM device.
  */
+#define BCH_MAX_ORDER 20
 struct bch_nvm_namespace {
 	struct bch_nvm_pages_sb *sb;
 	void *kaddr;
@@ -27,6 +28,11 @@ struct bch_nvm_namespace {
 	u64 pages_total;
 	pfn_t start_pfn;
 
+	unsigned long *pages_bitmap;
+	struct list_head free_area[BCH_MAX_ORDER];
+
+	unsigned long *pgalloc_recs_bitmap;
+
 	struct dax_device *dax_dev;
 	struct block_device *bdev;
 	struct bch_nvm_set *nvm_set;
diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
index 5094a6797679..1fdb3eaabf7e 100644
--- a/include/uapi/linux/bcache-nvm.h
+++ b/include/uapi/linux/bcache-nvm.h
@@ -130,11 +130,15 @@ union {
 };
 };
 
-#define BCH_MAX_RECS					\
-	((sizeof(struct bch_nvm_pgalloc_recs) -		\
-	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /	\
+#define BCH_MAX_RECS							\
+	((sizeof(struct bch_nvm_pgalloc_recs) -				\
+	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /			\
 	 sizeof(struct bch_pgalloc_rec))
 
+#define BCH_MAX_PGALLOC_RECS						\
+	((BCH_NVM_PAGES_OFFSET - BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET) /	\
+	 sizeof(struct bch_nvm_pgalloc_recs))
+
 struct bch_nvm_pages_owner_head {
 	unsigned char			uuid[16];
 	unsigned char			label[BCH_NVM_PAGES_LABEL_SIZE];
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 06/14] bcache: bch_nvm_alloc_pages() of the buddy
  2021-06-15  5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
                   ` (4 preceding siblings ...)
  2021-06-15  5:49 ` [PATCH 05/14] bcache: initialization of the buddy Coly Li
@ 2021-06-15  5:49 ` Coly Li
  2021-06-22 10:51   ` Hannes Reinecke
  2021-06-15  5:49 ` [PATCH 07/14] bcache: bch_nvm_free_pages() " Coly Li
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-15  5:49 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, Coly Li

From: Jianpeng Ma <jianpeng.ma@intel.com>

This patch implements the bch_nvm_alloc_pages() of the buddy.
In terms of function, this func is like current-page-buddy-alloc.
But the differences are:
a: it need owner_uuid as parameter which record owner info. And it
make those info persistence.
b: it don't need flags like GFP_*. All allocs are the equal.
c: it don't trigger other ops etc swap/recycle.

Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/nvm-pages.c   | 174 ++++++++++++++++++++++++++++++++
 drivers/md/bcache/nvm-pages.h   |   6 ++
 include/uapi/linux/bcache-nvm.h |   6 +-
 3 files changed, 184 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
index 804ee66e97be..5d095d241483 100644
--- a/drivers/md/bcache/nvm-pages.c
+++ b/drivers/md/bcache/nvm-pages.c
@@ -74,6 +74,180 @@ static inline void remove_owner_space(struct bch_nvm_namespace *ns,
 	}
 }
 
+/* If not found, it will create if create == true */
+static struct bch_nvm_pages_owner_head *find_owner_head(const char *owner_uuid, bool create)
+{
+	struct bch_owner_list_head *owner_list_head = only_set->owner_list_head;
+	struct bch_nvm_pages_owner_head *owner_head = NULL;
+	int i;
+
+	if (owner_list_head == NULL)
+		goto out;
+
+	for (i = 0; i < only_set->owner_list_used; i++) {
+		if (!memcmp(owner_uuid, owner_list_head->heads[i].uuid, 16)) {
+			owner_head = &(owner_list_head->heads[i]);
+			break;
+		}
+	}
+
+	if (!owner_head && create) {
+		u32 used = only_set->owner_list_used;
+
+		if (only_set->owner_list_size > used) {
+			memcpy_flushcache(owner_list_head->heads[used].uuid, owner_uuid, 16);
+			only_set->owner_list_used++;
+
+			owner_list_head->used++;
+			owner_head = &(owner_list_head->heads[used]);
+		} else
+			pr_info("no free bch_nvm_pages_owner_head\n");
+	}
+
+out:
+	return owner_head;
+}
+
+static struct bch_nvm_pgalloc_recs *find_empty_pgalloc_recs(void)
+{
+	unsigned int start;
+	struct bch_nvm_namespace *ns = only_set->nss[0];
+	struct bch_nvm_pgalloc_recs *recs;
+
+	start = bitmap_find_next_zero_area(ns->pgalloc_recs_bitmap, BCH_MAX_PGALLOC_RECS, 0, 1, 0);
+	if (start > BCH_MAX_PGALLOC_RECS) {
+		pr_info("no free struct bch_nvm_pgalloc_recs\n");
+		return NULL;
+	}
+
+	bitmap_set(ns->pgalloc_recs_bitmap, start, 1);
+	recs = (struct bch_nvm_pgalloc_recs *)(ns->kaddr + BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET)
+		+ start;
+	return recs;
+}
+
+static struct bch_nvm_pgalloc_recs *find_nvm_pgalloc_recs(struct bch_nvm_namespace *ns,
+		struct bch_nvm_pages_owner_head *owner_head, bool create)
+{
+	int ns_nr = ns->sb->this_namespace_nr;
+	struct bch_nvm_pgalloc_recs *prev_recs = NULL, *recs = owner_head->recs[ns_nr];
+
+	/* If create=false, we return recs[nr] */
+	if (!create)
+		return recs;
+
+	/*
+	 * If create=true, it mean we need a empty struct bch_pgalloc_rec
+	 * So we should find non-empty struct bch_nvm_pgalloc_recs or alloc
+	 * new struct bch_nvm_pgalloc_recs. And return this bch_nvm_pgalloc_recs
+	 */
+	while (recs && (recs->used == recs->size)) {
+		prev_recs = recs;
+		recs = recs->next;
+	}
+
+	/* Found empty struct bch_nvm_pgalloc_recs */
+	if (recs)
+		return recs;
+	/* Need alloc new struct bch_nvm_galloc_recs */
+	recs = find_empty_pgalloc_recs();
+	if (recs) {
+		recs->next = NULL;
+		recs->owner = owner_head;
+		memcpy_flushcache(recs->magic, bch_nvm_pages_pgalloc_magic, 16);
+		memcpy_flushcache(recs->owner_uuid, owner_head->uuid, 16);
+		recs->size = BCH_MAX_RECS;
+		recs->used = 0;
+
+		if (prev_recs)
+			prev_recs->next = recs;
+		else
+			owner_head->recs[ns_nr] = recs;
+	}
+
+	return recs;
+}
+
+static void add_pgalloc_rec(struct bch_nvm_pgalloc_recs *recs, void *kaddr, int order)
+{
+	int i;
+
+	for (i = 0; i < recs->size; i++) {
+		if (recs->recs[i].pgoff == 0) {
+			recs->recs[i].pgoff = (unsigned long)kaddr >> PAGE_SHIFT;
+			recs->recs[i].order = order;
+			recs->used++;
+			break;
+		}
+	}
+	BUG_ON(i == recs->size);
+}
+
+void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
+{
+	void *kaddr = NULL;
+	struct bch_nvm_pgalloc_recs *pgalloc_recs;
+	struct bch_nvm_pages_owner_head *owner_head;
+	int i, j;
+
+	mutex_lock(&only_set->lock);
+	owner_head = find_owner_head(owner_uuid, true);
+
+	if (!owner_head) {
+		pr_err("can't find bch_nvm_pgalloc_recs by(uuid=%s)\n", owner_uuid);
+		goto unlock;
+	}
+
+	for (j = 0; j < only_set->total_namespaces_nr; j++) {
+		struct bch_nvm_namespace *ns = only_set->nss[j];
+
+		if (!ns || (ns->free < (1L << order)))
+			continue;
+
+		for (i = order; i < BCH_MAX_ORDER; i++) {
+			struct list_head *list;
+			struct page *page, *buddy_page;
+
+			if (list_empty(&ns->free_area[i]))
+				continue;
+
+			list = ns->free_area[i].next;
+			page = container_of((void *)list, struct page, zone_device_data);
+
+			list_del(list);
+
+			while (i != order) {
+				buddy_page = nvm_vaddr_to_page(ns,
+					nvm_pgoff_to_vaddr(ns, page->index + (1L << (i - 1))));
+				set_page_private(buddy_page, i - 1);
+				buddy_page->index = page->index + (1L << (i - 1));
+				__SetPageBuddy(buddy_page);
+				list_add((struct list_head *)&buddy_page->zone_device_data,
+					&ns->free_area[i - 1]);
+				i--;
+			}
+
+			set_page_private(page, order);
+			__ClearPageBuddy(page);
+			ns->free -= 1L << order;
+			kaddr = nvm_pgoff_to_vaddr(ns, page->index);
+			break;
+		}
+
+		if (i < BCH_MAX_ORDER) {
+			pgalloc_recs = find_nvm_pgalloc_recs(ns, owner_head, true);
+			/* ToDo: handle pgalloc_recs==NULL */
+			add_pgalloc_rec(pgalloc_recs, kaddr, order);
+			break;
+		}
+	}
+
+unlock:
+	mutex_unlock(&only_set->lock);
+	return kaddr;
+}
+EXPORT_SYMBOL_GPL(bch_nvm_alloc_pages);
+
 #define BCH_PGOFF_TO_KVADDR(pgoff) ((void *)((unsigned long)pgoff << PAGE_SHIFT))
 
 static int init_owner_info(struct bch_nvm_namespace *ns)
diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
index 71beb244b9be..f2583723aca6 100644
--- a/drivers/md/bcache/nvm-pages.h
+++ b/drivers/md/bcache/nvm-pages.h
@@ -62,6 +62,7 @@ extern struct bch_nvm_set *only_set;
 struct bch_nvm_namespace *bch_register_namespace(const char *dev_path);
 int bch_nvm_init(void);
 void bch_nvm_exit(void);
+void *bch_nvm_alloc_pages(int order, const char *owner_uuid);
 
 #else
 
@@ -74,6 +75,11 @@ static inline int bch_nvm_init(void)
 	return 0;
 }
 static inline void bch_nvm_exit(void) { }
+static inline void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
+{
+	return NULL;
+}
+
 
 #endif /* CONFIG_BCACHE_NVM_PAGES */
 
diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
index 1fdb3eaabf7e..9cb937292202 100644
--- a/include/uapi/linux/bcache-nvm.h
+++ b/include/uapi/linux/bcache-nvm.h
@@ -135,9 +135,11 @@ union {
 	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /			\
 	 sizeof(struct bch_pgalloc_rec))
 
+/* Currently 64 struct bch_nvm_pgalloc_recs is enough */
 #define BCH_MAX_PGALLOC_RECS						\
-	((BCH_NVM_PAGES_OFFSET - BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET) /	\
-	 sizeof(struct bch_nvm_pgalloc_recs))
+	(min_t(unsigned int, 64,					\
+		(BCH_NVM_PAGES_OFFSET - BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET) / \
+		 sizeof(struct bch_nvm_pgalloc_recs)))
 
 struct bch_nvm_pages_owner_head {
 	unsigned char			uuid[16];
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 07/14] bcache: bch_nvm_free_pages() of the buddy
  2021-06-15  5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
                   ` (5 preceding siblings ...)
  2021-06-15  5:49 ` [PATCH 06/14] bcache: bch_nvm_alloc_pages() " Coly Li
@ 2021-06-15  5:49 ` Coly Li
  2021-06-22 10:53   ` Hannes Reinecke
  2021-06-15  5:49 ` [PATCH 08/14] bcache: get allocated pages from specific owner Coly Li
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-15  5:49 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, Coly Li

From: Jianpeng Ma <jianpeng.ma@intel.com>

This patch implements the bch_nvm_free_pages() of the buddy.

The difference between this and page-buddy-free:
it need owner_uuid to free owner allocated pages.And must
persistent after free.

Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/nvm-pages.c | 164 ++++++++++++++++++++++++++++++++--
 drivers/md/bcache/nvm-pages.h |   3 +-
 2 files changed, 159 insertions(+), 8 deletions(-)

diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
index 5d095d241483..74d08950c67c 100644
--- a/drivers/md/bcache/nvm-pages.c
+++ b/drivers/md/bcache/nvm-pages.c
@@ -52,7 +52,7 @@ static void release_nvm_set(struct bch_nvm_set *nvm_set)
 	kfree(nvm_set);
 }
 
-static struct page *nvm_vaddr_to_page(struct bch_nvm_namespace *ns, void *addr)
+static struct page *nvm_vaddr_to_page(void *addr)
 {
 	return virt_to_page(addr);
 }
@@ -183,6 +183,155 @@ static void add_pgalloc_rec(struct bch_nvm_pgalloc_recs *recs, void *kaddr, int
 	BUG_ON(i == recs->size);
 }
 
+static inline void *nvm_end_addr(struct bch_nvm_namespace *ns)
+{
+	return ns->kaddr + (ns->pages_total << PAGE_SHIFT);
+}
+
+static inline bool in_nvm_range(struct bch_nvm_namespace *ns,
+		void *start_addr, void *end_addr)
+{
+	return (start_addr >= ns->kaddr) && (end_addr < nvm_end_addr(ns));
+}
+
+static struct bch_nvm_namespace *find_nvm_by_addr(void *addr, int order)
+{
+	int i;
+	struct bch_nvm_namespace *ns;
+
+	for (i = 0; i < only_set->total_namespaces_nr; i++) {
+		ns = only_set->nss[i];
+		if (ns && in_nvm_range(ns, addr, addr + (1L << order)))
+			return ns;
+	}
+	return NULL;
+}
+
+static int remove_pgalloc_rec(struct bch_nvm_pgalloc_recs *pgalloc_recs, int ns_nr,
+				void *kaddr, int order)
+{
+	struct bch_nvm_pages_owner_head *owner_head = pgalloc_recs->owner;
+	struct bch_nvm_pgalloc_recs *prev_recs, *sys_recs;
+	u64 pgoff = (unsigned long)kaddr >> PAGE_SHIFT;
+	struct bch_nvm_namespace *ns = only_set->nss[0];
+	int i;
+
+	prev_recs = pgalloc_recs;
+	sys_recs = ns->kaddr + BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET;
+	while (pgalloc_recs) {
+		for (i = 0; i < pgalloc_recs->size; i++) {
+			struct bch_pgalloc_rec *rec = &(pgalloc_recs->recs[i]);
+
+			if (rec->pgoff == pgoff) {
+				WARN_ON(rec->order != order);
+				rec->pgoff = 0;
+				rec->order = 0;
+				pgalloc_recs->used--;
+
+				if (pgalloc_recs->used == 0) {
+					int recs_pos = pgalloc_recs - sys_recs;
+
+					if (pgalloc_recs == prev_recs)
+						owner_head->recs[ns_nr] = pgalloc_recs->next;
+					else
+						prev_recs->next = pgalloc_recs->next;
+
+					pgalloc_recs->next = NULL;
+					pgalloc_recs->owner = NULL;
+
+					bitmap_clear(ns->pgalloc_recs_bitmap, recs_pos, 1);
+				}
+				goto exit;
+			}
+		}
+		prev_recs = pgalloc_recs;
+		pgalloc_recs = pgalloc_recs->next;
+	}
+exit:
+	return pgalloc_recs ? 0 : -ENOENT;
+}
+
+static void __free_space(struct bch_nvm_namespace *ns, void *addr, int order)
+{
+	unsigned long add_pages = (1L << order);
+	pgoff_t pgoff;
+	struct page *page;
+
+	page = nvm_vaddr_to_page(addr);
+	WARN_ON((!page) || (page->private != order));
+	pgoff = page->index;
+
+	while (order < BCH_MAX_ORDER - 1) {
+		struct page *buddy_page;
+
+		pgoff_t buddy_pgoff = pgoff ^ (1L << order);
+		pgoff_t parent_pgoff = pgoff & ~(1L << order);
+
+		if ((parent_pgoff + (1L << (order + 1)) > ns->pages_total))
+			break;
+
+		buddy_page = nvm_vaddr_to_page(nvm_pgoff_to_vaddr(ns, buddy_pgoff));
+		WARN_ON(!buddy_page);
+
+		if (PageBuddy(buddy_page) && (buddy_page->private == order)) {
+			list_del((struct list_head *)&buddy_page->zone_device_data);
+			__ClearPageBuddy(buddy_page);
+			pgoff = parent_pgoff;
+			order++;
+			continue;
+		}
+		break;
+	}
+
+	page = nvm_vaddr_to_page(nvm_pgoff_to_vaddr(ns, pgoff));
+	WARN_ON(!page);
+	list_add((struct list_head *)&page->zone_device_data, &ns->free_area[order]);
+	page->index = pgoff;
+	set_page_private(page, order);
+	__SetPageBuddy(page);
+	ns->free += add_pages;
+}
+
+void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid)
+{
+	struct bch_nvm_namespace *ns;
+	struct bch_nvm_pages_owner_head *owner_head;
+	struct bch_nvm_pgalloc_recs *pgalloc_recs;
+	int r;
+
+	mutex_lock(&only_set->lock);
+
+	ns = find_nvm_by_addr(addr, order);
+	if (!ns) {
+		pr_err("can't find nvm_dev by kaddr %p\n", addr);
+		goto unlock;
+	}
+
+	owner_head = find_owner_head(owner_uuid, false);
+	if (!owner_head) {
+		pr_err("can't found bch_nvm_pages_owner_head by(uuid=%s)\n", owner_uuid);
+		goto unlock;
+	}
+
+	pgalloc_recs = find_nvm_pgalloc_recs(ns, owner_head, false);
+	if (!pgalloc_recs) {
+		pr_err("can't find bch_nvm_pgalloc_recs by(uuid=%s)\n", owner_uuid);
+		goto unlock;
+	}
+
+	r = remove_pgalloc_rec(pgalloc_recs, ns->sb->this_namespace_nr, addr, order);
+	if (r < 0) {
+		pr_err("can't find bch_pgalloc_rec\n");
+		goto unlock;
+	}
+
+	__free_space(ns, addr, order);
+
+unlock:
+	mutex_unlock(&only_set->lock);
+}
+EXPORT_SYMBOL_GPL(bch_nvm_free_pages);
+
 void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
 {
 	void *kaddr = NULL;
@@ -217,7 +366,7 @@ void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
 			list_del(list);
 
 			while (i != order) {
-				buddy_page = nvm_vaddr_to_page(ns,
+				buddy_page = nvm_vaddr_to_page(
 					nvm_pgoff_to_vaddr(ns, page->index + (1L << (i - 1))));
 				set_page_private(buddy_page, i - 1);
 				buddy_page->index = page->index + (1L << (i - 1));
@@ -301,7 +450,7 @@ static int init_owner_info(struct bch_nvm_namespace *ns)
 						BUG_ON(rec->pgoff <= offset);
 
 						/* init struct page: index/private */
-						page = nvm_vaddr_to_page(ns,
+						page = nvm_vaddr_to_page(
 							BCH_PGOFF_TO_KVADDR(rec->pgoff));
 
 						set_page_private(page, rec->order);
@@ -340,11 +489,12 @@ static void init_nvm_free_space(struct bch_nvm_namespace *ns)
 					break;
 			}
 
-			page = nvm_vaddr_to_page(ns, nvm_pgoff_to_vaddr(ns, pgoff_start));
+			page = nvm_vaddr_to_page(nvm_pgoff_to_vaddr(ns, pgoff_start));
 			page->index = pgoff_start;
 			set_page_private(page, i);
-			__SetPageBuddy(page);
-			list_add((struct list_head *)&page->zone_device_data, &ns->free_area[i]);
+
+			/* in order to update ns->free */
+			__free_space(ns, nvm_pgoff_to_vaddr(ns, pgoff_start), i);
 
 			pgoff_start += 1L << i;
 			pages -= 1L << i;
@@ -535,7 +685,7 @@ struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
 	ns->page_size = ns->sb->page_size;
 	ns->pages_offset = ns->sb->pages_offset;
 	ns->pages_total = ns->sb->pages_total;
-	ns->free = 0;
+	ns->free = 0; /* increase by __free_space() */
 	ns->bdev = bdev;
 	ns->nvm_set = only_set;
 	mutex_init(&ns->lock);
diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
index f2583723aca6..0ca699166855 100644
--- a/drivers/md/bcache/nvm-pages.h
+++ b/drivers/md/bcache/nvm-pages.h
@@ -63,6 +63,7 @@ struct bch_nvm_namespace *bch_register_namespace(const char *dev_path);
 int bch_nvm_init(void);
 void bch_nvm_exit(void);
 void *bch_nvm_alloc_pages(int order, const char *owner_uuid);
+void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid);
 
 #else
 
@@ -79,7 +80,7 @@ static inline void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
 {
 	return NULL;
 }
-
+static inline void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid) { }
 
 #endif /* CONFIG_BCACHE_NVM_PAGES */
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 08/14] bcache: get allocated pages from specific owner
  2021-06-15  5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
                   ` (6 preceding siblings ...)
  2021-06-15  5:49 ` [PATCH 07/14] bcache: bch_nvm_free_pages() " Coly Li
@ 2021-06-15  5:49 ` Coly Li
  2021-06-22 10:54   ` Hannes Reinecke
  2021-06-15  5:49 ` [PATCH 09/14] bcache: use bucket index to set GC_MARK_METADATA for journal buckets in bch_btree_gc_finish() Coly Li
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-15  5:49 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, Coly Li

From: Jianpeng Ma <jianpeng.ma@intel.com>

This patch implements bch_get_allocated_pages() of the buddy to be used to
get allocated pages from specific owner.

Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/nvm-pages.c | 6 ++++++
 drivers/md/bcache/nvm-pages.h | 5 +++++
 2 files changed, 11 insertions(+)

diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
index 74d08950c67c..42b0504d9564 100644
--- a/drivers/md/bcache/nvm-pages.c
+++ b/drivers/md/bcache/nvm-pages.c
@@ -397,6 +397,12 @@ void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
 }
 EXPORT_SYMBOL_GPL(bch_nvm_alloc_pages);
 
+struct bch_nvm_pages_owner_head *bch_get_allocated_pages(const char *owner_uuid)
+{
+	return find_owner_head(owner_uuid, false);
+}
+EXPORT_SYMBOL_GPL(bch_get_allocated_pages);
+
 #define BCH_PGOFF_TO_KVADDR(pgoff) ((void *)((unsigned long)pgoff << PAGE_SHIFT))
 
 static int init_owner_info(struct bch_nvm_namespace *ns)
diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
index 0ca699166855..c763bf2e2721 100644
--- a/drivers/md/bcache/nvm-pages.h
+++ b/drivers/md/bcache/nvm-pages.h
@@ -64,6 +64,7 @@ int bch_nvm_init(void);
 void bch_nvm_exit(void);
 void *bch_nvm_alloc_pages(int order, const char *owner_uuid);
 void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid);
+struct bch_nvm_pages_owner_head *bch_get_allocated_pages(const char *owner_uuid);
 
 #else
 
@@ -81,6 +82,10 @@ static inline void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
 	return NULL;
 }
 static inline void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid) { }
+static inline struct bch_nvm_pages_owner_head *bch_get_allocated_pages(const char *owner_uuid)
+{
+	return NULL;
+}
 
 #endif /* CONFIG_BCACHE_NVM_PAGES */
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 09/14] bcache: use bucket index to set GC_MARK_METADATA for journal buckets in bch_btree_gc_finish()
  2021-06-15  5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
                   ` (7 preceding siblings ...)
  2021-06-15  5:49 ` [PATCH 08/14] bcache: get allocated pages from specific owner Coly Li
@ 2021-06-15  5:49 ` Coly Li
  2021-06-22 10:55   ` Hannes Reinecke
  2021-06-15  5:49 ` [PATCH 10/14] bcache: add BCH_FEATURE_INCOMPAT_NVDIMM_META into incompat feature set Coly Li
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-15  5:49 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Jianpeng Ma, Qiaowei Ren

Currently the meta data bucket locations on cache device are reserved
after the meta data stored on NVDIMM pages, for the meta data layout
consistentcy temporarily. So these buckets are still marked as meta data
by SET_GC_MARK() in bch_btree_gc_finish().

When BCH_FEATURE_INCOMPAT_NVDIMM_META is set, the sb.d[] stores linear
address of NVDIMM pages and not bucket index anymore. Therefore we
should avoid to find bucket index from sb.d[], and directly use bucket
index from ca->sb.first_bucket to (ca->sb.first_bucket +
ca->sb.njournal_bucketsi) for setting the gc mark of journal bucket.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Jianpeng Ma <jianpeng.ma@intel.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
---
 drivers/md/bcache/btree.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 183a58c89377..e0d7135669ca 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1761,8 +1761,10 @@ static void bch_btree_gc_finish(struct cache_set *c)
 	ca = c->cache;
 	ca->invalidate_needs_gc = 0;
 
-	for (k = ca->sb.d; k < ca->sb.d + ca->sb.keys; k++)
-		SET_GC_MARK(ca->buckets + *k, GC_MARK_METADATA);
+	/* Range [first_bucket, first_bucket + keys) is for journal buckets */
+	for (i = ca->sb.first_bucket;
+	     i < ca->sb.first_bucket + ca->sb.njournal_buckets; i++)
+		SET_GC_MARK(ca->buckets + i, GC_MARK_METADATA);
 
 	for (k = ca->prio_buckets;
 	     k < ca->prio_buckets + prio_buckets(ca) * 2; k++)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 10/14] bcache: add BCH_FEATURE_INCOMPAT_NVDIMM_META into incompat feature set
  2021-06-15  5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
                   ` (8 preceding siblings ...)
  2021-06-15  5:49 ` [PATCH 09/14] bcache: use bucket index to set GC_MARK_METADATA for journal buckets in bch_btree_gc_finish() Coly Li
@ 2021-06-15  5:49 ` Coly Li
  2021-06-22 10:59   ` Hannes Reinecke
  2021-06-15  5:49 ` [PATCH 11/14] bcache: initialize bcache journal for NVDIMM meta device Coly Li
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-15  5:49 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Jianpeng Ma, Qiaowei Ren

This patch adds BCH_FEATURE_INCOMPAT_NVDIMM_META (value 0x0004) into the
incompat feature set. When this bit is set by bcache-tools, it indicates
bcache meta data should be stored on specific NVDIMM meta device.

The bcache meta data mainly includes journal and btree nodes, when this
bit is set in incompat feature set, bcache will ask the nvm-pages
allocator for NVDIMM space to store the meta data.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Jianpeng Ma <jianpeng.ma@intel.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
---
 drivers/md/bcache/features.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h
index d1c8fd3977fc..45d2508d5532 100644
--- a/drivers/md/bcache/features.h
+++ b/drivers/md/bcache/features.h
@@ -17,11 +17,19 @@
 #define BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET		0x0001
 /* real bucket size is (1 << bucket_size) */
 #define BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE	0x0002
+/* store bcache meta data on nvdimm */
+#define BCH_FEATURE_INCOMPAT_NVDIMM_META		0x0004
 
 #define BCH_FEATURE_COMPAT_SUPP		0
 #define BCH_FEATURE_RO_COMPAT_SUPP	0
+#if defined(CONFIG_BCACHE_NVM_PAGES)
+#define BCH_FEATURE_INCOMPAT_SUPP	(BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET| \
+					 BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE| \
+					 BCH_FEATURE_INCOMPAT_NVDIMM_META)
+#else
 #define BCH_FEATURE_INCOMPAT_SUPP	(BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET| \
 					 BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE)
+#endif
 
 #define BCH_HAS_COMPAT_FEATURE(sb, mask) \
 		((sb)->feature_compat & (mask))
@@ -89,6 +97,7 @@ static inline void bch_clear_feature_##name(struct cache_sb *sb) \
 
 BCH_FEATURE_INCOMPAT_FUNCS(obso_large_bucket, OBSO_LARGE_BUCKET);
 BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LOG_LARGE_BUCKET_SIZE);
+BCH_FEATURE_INCOMPAT_FUNCS(nvdimm_meta, NVDIMM_META);
 
 static inline bool bch_has_unknown_compat_features(struct cache_sb *sb)
 {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 11/14] bcache: initialize bcache journal for NVDIMM meta device
  2021-06-15  5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
                   ` (9 preceding siblings ...)
  2021-06-15  5:49 ` [PATCH 10/14] bcache: add BCH_FEATURE_INCOMPAT_NVDIMM_META into incompat feature set Coly Li
@ 2021-06-15  5:49 ` Coly Li
  2021-06-22 11:01   ` Hannes Reinecke
  2021-06-15  5:49 ` [PATCH 12/14] bcache: support storing bcache journal into " Coly Li
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-15  5:49 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Jianpeng Ma, Qiaowei Ren

The nvm-pages allocator may store and index the NVDIMM pages allocated
for bcache journal. This patch adds the initialization to store bcache
journal space on NVDIMM pages if BCH_FEATURE_INCOMPAT_NVDIMM_META bit is
set by bcache-tools.

If BCH_FEATURE_INCOMPAT_NVDIMM_META is set, get_nvdimm_journal_space()
will return the linear address of NVDIMM pages for bcache journal,
- If there is previously allocated space, find it from nvm-pages owner
  list and return to bch_journal_init().
- If there is no previously allocated space, require a new NVDIMM range
  from the nvm-pages allocator, and return it to bch_journal_init().

And in bch_journal_init(), keys in sb.d[] store the corresponding linear
address from NVDIMM into sb.d[i].ptr[0] where 'i' is the bucket index to
iterate all journal buckets.

Later when bcache journaling code stores the journaling jset, the target
NVDIMM linear address stored (and updated) in sb.d[i].ptr[0] can be used
directly in memory copy from DRAM pages into NVDIMM pages.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Jianpeng Ma <jianpeng.ma@intel.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
---
 drivers/md/bcache/journal.c | 105 ++++++++++++++++++++++++++++++++++++
 drivers/md/bcache/journal.h |   2 +-
 drivers/md/bcache/super.c   |  16 +++---
 3 files changed, 115 insertions(+), 8 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 61bd79babf7a..32599d2ff5d2 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -9,6 +9,8 @@
 #include "btree.h"
 #include "debug.h"
 #include "extents.h"
+#include "nvm-pages.h"
+#include "features.h"
 
 #include <trace/events/bcache.h>
 
@@ -982,3 +984,106 @@ int bch_journal_alloc(struct cache_set *c)
 
 	return 0;
 }
+
+#if defined(CONFIG_BCACHE_NVM_PAGES)
+
+static void *find_journal_nvm_base(struct bch_nvm_pages_owner_head *owner_list,
+				   struct cache *ca)
+{
+	unsigned long addr = 0;
+	struct bch_nvm_pgalloc_recs *recs_list = owner_list->recs[0];
+
+	while (recs_list) {
+		struct bch_pgalloc_rec *rec;
+		unsigned long jnl_pgoff;
+		int i;
+
+		jnl_pgoff = ((unsigned long)ca->sb.d[0]) >> PAGE_SHIFT;
+		rec = recs_list->recs;
+		for (i = 0; i < recs_list->used; i++) {
+			if (rec->pgoff == jnl_pgoff)
+				break;
+			rec++;
+		}
+		if (i < recs_list->used) {
+			addr = rec->pgoff << PAGE_SHIFT;
+			break;
+		}
+		recs_list = recs_list->next;
+	}
+	return (void *)addr;
+}
+
+static void *get_nvdimm_journal_space(struct cache *ca)
+{
+	struct bch_nvm_pages_owner_head *owner_list = NULL;
+	void *ret = NULL;
+	int order;
+
+	owner_list = bch_get_allocated_pages(ca->sb.set_uuid);
+	if (owner_list) {
+		ret = find_journal_nvm_base(owner_list, ca);
+		if (ret)
+			goto found;
+	}
+
+	order = ilog2(ca->sb.bucket_size *
+		      ca->sb.njournal_buckets / PAGE_SECTORS);
+	ret = bch_nvm_alloc_pages(order, ca->sb.set_uuid);
+	if (ret)
+		memset(ret, 0, (1 << order) * PAGE_SIZE);
+
+found:
+	return ret;
+}
+
+static int __bch_journal_nvdimm_init(struct cache *ca)
+{
+	int i, ret = 0;
+	void *journal_nvm_base = NULL;
+
+	journal_nvm_base = get_nvdimm_journal_space(ca);
+	if (!journal_nvm_base) {
+		pr_err("Failed to get journal space from nvdimm\n");
+		ret = -1;
+		goto out;
+	}
+
+	/* Iniialized and reloaded from on-disk super block already */
+	if (ca->sb.d[0] != 0)
+		goto out;
+
+	for (i = 0; i < ca->sb.keys; i++)
+		ca->sb.d[i] =
+			(u64)(journal_nvm_base + (ca->sb.bucket_size * i));
+
+out:
+	return ret;
+}
+
+#else /* CONFIG_BCACHE_NVM_PAGES */
+
+static int __bch_journal_nvdimm_init(struct cache *ca)
+{
+	return -1;
+}
+
+#endif /* CONFIG_BCACHE_NVM_PAGES */
+
+int bch_journal_init(struct cache_set *c)
+{
+	int i, ret = 0;
+	struct cache *ca = c->cache;
+
+	ca->sb.keys = clamp_t(int, ca->sb.nbuckets >> 7,
+				2, SB_JOURNAL_BUCKETS);
+
+	if (!bch_has_feature_nvdimm_meta(&ca->sb)) {
+		for (i = 0; i < ca->sb.keys; i++)
+			ca->sb.d[i] = ca->sb.first_bucket + i;
+	} else {
+		ret = __bch_journal_nvdimm_init(ca);
+	}
+
+	return ret;
+}
diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h
index f2ea34d5f431..e3a7fa5a8fda 100644
--- a/drivers/md/bcache/journal.h
+++ b/drivers/md/bcache/journal.h
@@ -179,7 +179,7 @@ void bch_journal_mark(struct cache_set *c, struct list_head *list);
 void bch_journal_meta(struct cache_set *c, struct closure *cl);
 int bch_journal_read(struct cache_set *c, struct list_head *list);
 int bch_journal_replay(struct cache_set *c, struct list_head *list);
-
+int bch_journal_init(struct cache_set *c);
 void bch_journal_free(struct cache_set *c);
 int bch_journal_alloc(struct cache_set *c);
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index ce22aefb1352..cce0f6bf0944 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -147,10 +147,15 @@ static const char *read_super_common(struct cache_sb *sb,  struct block_device *
 		goto err;
 
 	err = "Journal buckets not sequential";
+#if defined(CONFIG_BCACHE_NVM_PAGES)
+	if (!bch_has_feature_nvdimm_meta(sb)) {
+#endif
 	for (i = 0; i < sb->keys; i++)
 		if (sb->d[i] != sb->first_bucket + i)
 			goto err;
-
+#ifdef CONFIG_BCACHE_NVM_PAGES
+	} /* bch_has_feature_nvdimm_meta */
+#endif
 	err = "Too many journal buckets";
 	if (sb->first_bucket + sb->keys > sb->nbuckets)
 		goto err;
@@ -2072,14 +2077,11 @@ static int run_cache_set(struct cache_set *c)
 		if (bch_journal_replay(c, &journal))
 			goto err;
 	} else {
-		unsigned int j;
-
 		pr_notice("invalidating existing data\n");
-		ca->sb.keys = clamp_t(int, ca->sb.nbuckets >> 7,
-					2, SB_JOURNAL_BUCKETS);
 
-		for (j = 0; j < ca->sb.keys; j++)
-			ca->sb.d[j] = ca->sb.first_bucket + j;
+		err = "error initializing journal";
+		if (bch_journal_init(c))
+			goto err;
 
 		bch_initial_gc_finish(c);
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 12/14] bcache: support storing bcache journal into NVDIMM meta device
  2021-06-15  5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
                   ` (10 preceding siblings ...)
  2021-06-15  5:49 ` [PATCH 11/14] bcache: initialize bcache journal for NVDIMM meta device Coly Li
@ 2021-06-15  5:49 ` Coly Li
  2021-06-22 11:03   ` Hannes Reinecke
  2021-06-15  5:49 ` [PATCH 13/14] bcache: read jset from NVDIMM pages for journal replay Coly Li
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-15  5:49 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Jianpeng Ma, Qiaowei Ren

This patch implements two methods to store bcache journal to,
1) __journal_write_unlocked() for block interface device
   The latency method to compose bio and issue the jset bio to cache
   device (e.g. SSD). c->journal.key.ptr[0] indicates the LBA on cache
   device to store the journal jset.
2) __journal_nvdimm_write_unlocked() for memory interface NVDIMM
   Use memory interface to access NVDIMM pages and store the jset by
   memcpy_flushcache(). c->journal.key.ptr[0] indicates the linear
   address from the NVDIMM pages to store the journal jset.

For lagency configuration without NVDIMM meta device, journal I/O is
handled by __journal_write_unlocked() with existing code logic. If the
NVDIMM meta device is used (by bcache-tools), the journal I/O will
be handled by __journal_nvdimm_write_unlocked() and go into the NVDIMM
pages.

And when NVDIMM meta device is used, sb.d[] stores the linear addresses
from NVDIMM pages (no more bucket index), in journal_reclaim() the
journaling location in c->journal.key.ptr[0] should also be updated by
linear address from NVDIMM pages (no more LBA combined by sectors offset
and bucket index).

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Jianpeng Ma <jianpeng.ma@intel.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
---
 drivers/md/bcache/journal.c   | 119 ++++++++++++++++++++++++----------
 drivers/md/bcache/nvm-pages.h |   1 +
 drivers/md/bcache/super.c     |  28 +++++++-
 3 files changed, 110 insertions(+), 38 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 32599d2ff5d2..03ecedf813b0 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -596,6 +596,8 @@ static void do_journal_discard(struct cache *ca)
 		return;
 	}
 
+	BUG_ON(bch_has_feature_nvdimm_meta(&ca->sb));
+
 	switch (atomic_read(&ja->discard_in_flight)) {
 	case DISCARD_IN_FLIGHT:
 		return;
@@ -661,9 +663,13 @@ static void journal_reclaim(struct cache_set *c)
 		goto out;
 
 	ja->cur_idx = next;
-	k->ptr[0] = MAKE_PTR(0,
-			     bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
-			     ca->sb.nr_this_dev);
+	if (!bch_has_feature_nvdimm_meta(&ca->sb))
+		k->ptr[0] = MAKE_PTR(0,
+			bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
+			ca->sb.nr_this_dev);
+	else
+		k->ptr[0] = ca->sb.d[ja->cur_idx];
+
 	atomic_long_inc(&c->reclaimed_journal_buckets);
 
 	bkey_init(k);
@@ -729,46 +735,21 @@ static void journal_write_unlock(struct closure *cl)
 	spin_unlock(&c->journal.lock);
 }
 
-static void journal_write_unlocked(struct closure *cl)
+
+static void __journal_write_unlocked(struct cache_set *c)
 	__releases(c->journal.lock)
 {
-	struct cache_set *c = container_of(cl, struct cache_set, journal.io);
-	struct cache *ca = c->cache;
-	struct journal_write *w = c->journal.cur;
 	struct bkey *k = &c->journal.key;
-	unsigned int i, sectors = set_blocks(w->data, block_bytes(ca)) *
-		ca->sb.block_size;
-
+	struct journal_write *w = c->journal.cur;
+	struct closure *cl = &c->journal.io;
+	struct cache *ca = c->cache;
 	struct bio *bio;
 	struct bio_list list;
+	unsigned int i, sectors = set_blocks(w->data, block_bytes(ca)) *
+		ca->sb.block_size;
 
 	bio_list_init(&list);
 
-	if (!w->need_write) {
-		closure_return_with_destructor(cl, journal_write_unlock);
-		return;
-	} else if (journal_full(&c->journal)) {
-		journal_reclaim(c);
-		spin_unlock(&c->journal.lock);
-
-		btree_flush_write(c);
-		continue_at(cl, journal_write, bch_journal_wq);
-		return;
-	}
-
-	c->journal.blocks_free -= set_blocks(w->data, block_bytes(ca));
-
-	w->data->btree_level = c->root->level;
-
-	bkey_copy(&w->data->btree_root, &c->root->key);
-	bkey_copy(&w->data->uuid_bucket, &c->uuid_bucket);
-
-	w->data->prio_bucket[ca->sb.nr_this_dev] = ca->prio_buckets[0];
-	w->data->magic		= jset_magic(&ca->sb);
-	w->data->version	= BCACHE_JSET_VERSION;
-	w->data->last_seq	= last_seq(&c->journal);
-	w->data->csum		= csum_set(w->data);
-
 	for (i = 0; i < KEY_PTRS(k); i++) {
 		ca = c->cache;
 		bio = &ca->journal.bio;
@@ -793,7 +774,6 @@ static void journal_write_unlocked(struct closure *cl)
 
 		ca->journal.seq[ca->journal.cur_idx] = w->data->seq;
 	}
-
 	/* If KEY_PTRS(k) == 0, this jset gets lost in air */
 	BUG_ON(i == 0);
 
@@ -805,6 +785,73 @@ static void journal_write_unlocked(struct closure *cl)
 
 	while ((bio = bio_list_pop(&list)))
 		closure_bio_submit(c, bio, cl);
+}
+
+#if defined(CONFIG_BCACHE_NVM_PAGES)
+
+static void __journal_nvdimm_write_unlocked(struct cache_set *c)
+	__releases(c->journal.lock)
+{
+	struct journal_write *w = c->journal.cur;
+	struct cache *ca = c->cache;
+	unsigned int sectors;
+
+	sectors = set_blocks(w->data, block_bytes(ca)) * ca->sb.block_size;
+	atomic_long_add(sectors, &ca->meta_sectors_written);
+
+	memcpy_flushcache((void *)c->journal.key.ptr[0], w->data, sectors << 9);
+
+	c->journal.key.ptr[0] += sectors << 9;
+	ca->journal.seq[ca->journal.cur_idx] = w->data->seq;
+
+	atomic_dec_bug(&fifo_back(&c->journal.pin));
+	bch_journal_next(&c->journal);
+	journal_reclaim(c);
+
+	spin_unlock(&c->journal.lock);
+}
+
+#else /* CONFIG_BCACHE_NVM_PAGES */
+
+static void __journal_nvdimm_write_unlocked(struct cache_set *c) { }
+
+#endif /* CONFIG_BCACHE_NVM_PAGES */
+
+static void journal_write_unlocked(struct closure *cl)
+{
+	struct cache_set *c = container_of(cl, struct cache_set, journal.io);
+	struct cache *ca = c->cache;
+	struct journal_write *w = c->journal.cur;
+
+	if (!w->need_write) {
+		closure_return_with_destructor(cl, journal_write_unlock);
+		return;
+	} else if (journal_full(&c->journal)) {
+		journal_reclaim(c);
+		spin_unlock(&c->journal.lock);
+
+		btree_flush_write(c);
+		continue_at(cl, journal_write, bch_journal_wq);
+		return;
+	}
+
+	c->journal.blocks_free -= set_blocks(w->data, block_bytes(ca));
+
+	w->data->btree_level = c->root->level;
+
+	bkey_copy(&w->data->btree_root, &c->root->key);
+	bkey_copy(&w->data->uuid_bucket, &c->uuid_bucket);
+
+	w->data->prio_bucket[ca->sb.nr_this_dev] = ca->prio_buckets[0];
+	w->data->magic		= jset_magic(&ca->sb);
+	w->data->version	= BCACHE_JSET_VERSION;
+	w->data->last_seq	= last_seq(&c->journal);
+	w->data->csum		= csum_set(w->data);
+
+	if (!bch_has_feature_nvdimm_meta(&ca->sb))
+		__journal_write_unlocked(c);
+	else
+		__journal_nvdimm_write_unlocked(c);
 
 	continue_at(cl, journal_write_done, NULL);
 }
diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
index c763bf2e2721..736a661777b7 100644
--- a/drivers/md/bcache/nvm-pages.h
+++ b/drivers/md/bcache/nvm-pages.h
@@ -5,6 +5,7 @@
 
 #if defined(CONFIG_BCACHE_NVM_PAGES)
 #include <linux/bcache-nvm.h>
+#include <linux/libnvdimm.h>
 #endif /* CONFIG_BCACHE_NVM_PAGES */
 
 /*
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index cce0f6bf0944..4d6666d03aa7 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1686,7 +1686,32 @@ void bch_cache_set_release(struct kobject *kobj)
 static void cache_set_free(struct closure *cl)
 {
 	struct cache_set *c = container_of(cl, struct cache_set, cl);
-	struct cache *ca;
+	struct cache *ca = c->cache;
+
+#if defined(CONFIG_BCACHE_NVM_PAGES)
+	/* Flush cache if journal stored in NVDIMM */
+	if (ca && bch_has_feature_nvdimm_meta(&ca->sb)) {
+		unsigned long bucket_size = ca->sb.bucket_size;
+		int i;
+
+		for (i = 0; i < ca->sb.keys; i++) {
+			unsigned long offset = 0;
+			unsigned int len = round_down(UINT_MAX, 2);
+
+			if ((void *)ca->sb.d[i] == NULL)
+				continue;
+
+			while (bucket_size > 0) {
+				if (len > bucket_size)
+					len = bucket_size;
+				arch_invalidate_pmem(
+					(void *)(ca->sb.d[i] + offset), len);
+				offset += len;
+				bucket_size -= len;
+			}
+		}
+	}
+#endif /* CONFIG_BCACHE_NVM_PAGES */
 
 	debugfs_remove(c->debug);
 
@@ -1698,7 +1723,6 @@ static void cache_set_free(struct closure *cl)
 	bch_bset_sort_state_free(&c->sort);
 	free_pages((unsigned long) c->uuids, ilog2(meta_bucket_pages(&c->cache->sb)));
 
-	ca = c->cache;
 	if (ca) {
 		ca->set = NULL;
 		c->cache = NULL;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 13/14] bcache: read jset from NVDIMM pages for journal replay
  2021-06-15  5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
                   ` (11 preceding siblings ...)
  2021-06-15  5:49 ` [PATCH 12/14] bcache: support storing bcache journal into " Coly Li
@ 2021-06-15  5:49 ` Coly Li
  2021-06-22 11:04   ` Hannes Reinecke
  2021-06-15  5:49 ` [PATCH 14/14] bcache: add sysfs interface register_nvdimm_meta to register NVDIMM meta device Coly Li
  2021-06-21 15:14 ` [PATCH 00/14] bcache patches for Linux v5.14 Jens Axboe
  14 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-15  5:49 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Jianpeng Ma, Qiaowei Ren

This patch implements two methods to read jset from media for journal
replay,
- __jnl_rd_bkt() for block device
  This is the legacy method to read jset via block device interface.
- __jnl_rd_nvm_bkt() for NVDIMM
  This is the method to read jset from NVDIMM memory interface, a.k.a
  memcopy() from NVDIMM pages to DRAM pages.

If BCH_FEATURE_INCOMPAT_NVDIMM_META is set in incompat feature set,
during running cache set, journal_read_bucket() will read the journal
content from NVDIMM by __jnl_rd_nvm_bkt(). The linear addresses of
NVDIMM pages to read jset are stored in sb.d[SB_JOURNAL_BUCKETS], which
were initialized and maintained in previous runs of the cache set.

A thing should be noticed is, when bch_journal_read() is called, the
linear address of NVDIMM pages is not loaded and initialized yet, it
is necessary to call __bch_journal_nvdimm_init() before reading the jset
from NVDIMM pages.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Jianpeng Ma <jianpeng.ma@intel.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
---
 drivers/md/bcache/journal.c | 93 +++++++++++++++++++++++++++----------
 1 file changed, 69 insertions(+), 24 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 03ecedf813b0..23e5ccf125df 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -34,60 +34,96 @@ static void journal_read_endio(struct bio *bio)
 	closure_put(cl);
 }
 
+static struct jset *__jnl_rd_bkt(struct cache *ca, unsigned int bkt_idx,
+				    unsigned int len, unsigned int offset,
+				    struct closure *cl)
+{
+	sector_t bucket = bucket_to_sector(ca->set, ca->sb.d[bkt_idx]);
+	struct bio *bio = &ca->journal.bio;
+	struct jset *data = ca->set->journal.w[0].data;
+
+	bio_reset(bio);
+	bio->bi_iter.bi_sector	= bucket + offset;
+	bio_set_dev(bio, ca->bdev);
+	bio->bi_iter.bi_size	= len << 9;
+	bio->bi_end_io	= journal_read_endio;
+	bio->bi_private = cl;
+	bio_set_op_attrs(bio, REQ_OP_READ, 0);
+	bch_bio_map(bio, data);
+
+	closure_bio_submit(ca->set, bio, cl);
+	closure_sync(cl);
+
+	/* Indeed journal.w[0].data */
+	return data;
+}
+
+#if defined(CONFIG_BCACHE_NVM_PAGES)
+
+static struct jset *__jnl_rd_nvm_bkt(struct cache *ca, unsigned int bkt_idx,
+				     unsigned int len, unsigned int offset)
+{
+	void *jset_addr = (void *)ca->sb.d[bkt_idx] + (offset << 9);
+	struct jset *data = ca->set->journal.w[0].data;
+
+	memcpy(data, jset_addr, len << 9);
+
+	/* Indeed journal.w[0].data */
+	return data;
+}
+
+#else /* CONFIG_BCACHE_NVM_PAGES */
+
+static struct jset *__jnl_rd_nvm_bkt(struct cache *ca, unsigned int bkt_idx,
+				     unsigned int len, unsigned int offset)
+{
+	return NULL;
+}
+
+#endif /* CONFIG_BCACHE_NVM_PAGES */
+
 static int journal_read_bucket(struct cache *ca, struct list_head *list,
-			       unsigned int bucket_index)
+			       unsigned int bucket_idx)
 {
 	struct journal_device *ja = &ca->journal;
-	struct bio *bio = &ja->bio;
 
 	struct journal_replay *i;
-	struct jset *j, *data = ca->set->journal.w[0].data;
+	struct jset *j;
 	struct closure cl;
 	unsigned int len, left, offset = 0;
 	int ret = 0;
-	sector_t bucket = bucket_to_sector(ca->set, ca->sb.d[bucket_index]);
 
 	closure_init_stack(&cl);
 
-	pr_debug("reading %u\n", bucket_index);
+	pr_debug("reading %u\n", bucket_idx);
 
 	while (offset < ca->sb.bucket_size) {
 reread:		left = ca->sb.bucket_size - offset;
 		len = min_t(unsigned int, left, PAGE_SECTORS << JSET_BITS);
 
-		bio_reset(bio);
-		bio->bi_iter.bi_sector	= bucket + offset;
-		bio_set_dev(bio, ca->bdev);
-		bio->bi_iter.bi_size	= len << 9;
-
-		bio->bi_end_io	= journal_read_endio;
-		bio->bi_private = &cl;
-		bio_set_op_attrs(bio, REQ_OP_READ, 0);
-		bch_bio_map(bio, data);
-
-		closure_bio_submit(ca->set, bio, &cl);
-		closure_sync(&cl);
+		if (!bch_has_feature_nvdimm_meta(&ca->sb))
+			j = __jnl_rd_bkt(ca, bucket_idx, len, offset, &cl);
+		else
+			j = __jnl_rd_nvm_bkt(ca, bucket_idx, len, offset);
 
 		/* This function could be simpler now since we no longer write
 		 * journal entries that overlap bucket boundaries; this means
 		 * the start of a bucket will always have a valid journal entry
 		 * if it has any journal entries at all.
 		 */
-
-		j = data;
 		while (len) {
 			struct list_head *where;
 			size_t blocks, bytes = set_bytes(j);
 
 			if (j->magic != jset_magic(&ca->sb)) {
-				pr_debug("%u: bad magic\n", bucket_index);
+				pr_debug("%u: bad magic\n", bucket_idx);
 				return ret;
 			}
 
 			if (bytes > left << 9 ||
 			    bytes > PAGE_SIZE << JSET_BITS) {
 				pr_info("%u: too big, %zu bytes, offset %u\n",
-					bucket_index, bytes, offset);
+					bucket_idx, bytes, offset);
 				return ret;
 			}
 
@@ -96,7 +132,7 @@ reread:		left = ca->sb.bucket_size - offset;
 
 			if (j->csum != csum_set(j)) {
 				pr_info("%u: bad csum, %zu bytes, offset %u\n",
-					bucket_index, bytes, offset);
+					bucket_idx, bytes, offset);
 				return ret;
 			}
 
@@ -158,8 +194,8 @@ reread:		left = ca->sb.bucket_size - offset;
 			list_add(&i->list, where);
 			ret = 1;
 
-			if (j->seq > ja->seq[bucket_index])
-				ja->seq[bucket_index] = j->seq;
+			if (j->seq > ja->seq[bucket_idx])
+				ja->seq[bucket_idx] = j->seq;
 next_set:
 			offset	+= blocks * ca->sb.block_size;
 			len	-= blocks * ca->sb.block_size;
@@ -170,6 +206,8 @@ reread:		left = ca->sb.bucket_size - offset;
 	return ret;
 }
 
+static int __bch_journal_nvdimm_init(struct cache *ca);
+
 int bch_journal_read(struct cache_set *c, struct list_head *list)
 {
 #define read_bucket(b)							\
@@ -188,6 +226,13 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
 	unsigned int i, l, r, m;
 	uint64_t seq;
 
+	/*
+	 * Linear addresses of NVDIMM pages for journaling is not
+	 * initialized yet, do it before read jset from NVDIMM pages.
+	 */
+	if (bch_has_feature_nvdimm_meta(&ca->sb))
+		__bch_journal_nvdimm_init(ca);
+
 	bitmap_zero(bitmap, SB_JOURNAL_BUCKETS);
 	pr_debug("%u journal buckets\n", ca->sb.njournal_buckets);
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 14/14] bcache: add sysfs interface register_nvdimm_meta to register NVDIMM meta device
  2021-06-15  5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
                   ` (12 preceding siblings ...)
  2021-06-15  5:49 ` [PATCH 13/14] bcache: read jset from NVDIMM pages for journal replay Coly Li
@ 2021-06-15  5:49 ` Coly Li
  2021-06-22 11:04   ` Hannes Reinecke
  2021-06-21 15:14 ` [PATCH 00/14] bcache patches for Linux v5.14 Jens Axboe
  14 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-15  5:49 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Jianpeng Ma, Qiaowei Ren

This patch adds a sysfs interface register_nvdimm_meta to register
NVDIMM meta device. The sysfs interface file only shows up when
CONFIG_BCACHE_NVM_PAGES=y. Then a NVDIMM name space formatted by
bcache-tools can be registered into bcache by e.g.,
  echo /dev/pmem0 > /sys/fs/bcache/register_nvdimm_meta

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Jianpeng Ma <jianpeng.ma@intel.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
---
 drivers/md/bcache/super.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 4d6666d03aa7..9d506d053548 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2439,10 +2439,18 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 static ssize_t bch_pending_bdevs_cleanup(struct kobject *k,
 					 struct kobj_attribute *attr,
 					 const char *buffer, size_t size);
+#if defined(CONFIG_BCACHE_NVM_PAGES)
+static ssize_t register_nvdimm_meta(struct kobject *k,
+				    struct kobj_attribute *attr,
+				    const char *buffer, size_t size);
+#endif
 
 kobj_attribute_write(register,		register_bcache);
 kobj_attribute_write(register_quiet,	register_bcache);
 kobj_attribute_write(pendings_cleanup,	bch_pending_bdevs_cleanup);
+#if defined(CONFIG_BCACHE_NVM_PAGES)
+kobj_attribute_write(register_nvdimm_meta, register_nvdimm_meta);
+#endif
 
 static bool bch_is_open_backing(dev_t dev)
 {
@@ -2556,6 +2564,24 @@ static void register_device_async(struct async_reg_args *args)
 	queue_delayed_work(system_wq, &args->reg_work, 10);
 }
 
+#if defined(CONFIG_BCACHE_NVM_PAGES)
+static ssize_t register_nvdimm_meta(struct kobject *k, struct kobj_attribute *attr,
+				    const char *buffer, size_t size)
+{
+	ssize_t ret = size;
+
+	struct bch_nvm_namespace *ns = bch_register_namespace(buffer);
+
+	if (IS_ERR(ns)) {
+		pr_err("register nvdimm namespace %s for meta device failed.\n",
+			buffer);
+		ret = -EINVAL;
+	}
+
+	return ret;
+}
+#endif
+
 static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			       const char *buffer, size_t size)
 {
@@ -2898,6 +2924,9 @@ static int __init bcache_init(void)
 	static const struct attribute *files[] = {
 		&ksysfs_register.attr,
 		&ksysfs_register_quiet.attr,
+#if defined(CONFIG_BCACHE_NVM_PAGES)
+		&ksysfs_register_nvdimm_meta.attr,
+#endif
 		&ksysfs_pendings_cleanup.attr,
 		NULL
 	};
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/14] bcache patches for Linux v5.14
  2021-06-15  5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
                   ` (13 preceding siblings ...)
  2021-06-15  5:49 ` [PATCH 14/14] bcache: add sysfs interface register_nvdimm_meta to register NVDIMM meta device Coly Li
@ 2021-06-21 15:14 ` Jens Axboe
  2021-06-21 15:25   ` Coly Li
  14 siblings, 1 reply; 60+ messages in thread
From: Jens Axboe @ 2021-06-21 15:14 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-bcache, linux-block

On 6/14/21 11:49 PM, Coly Li wrote:
> Hi Jens,
> 
> Here are the bcache patches for Linux v5.14.
> 
> The patches from Chao Yu and Ding Senjie are useful code cleanup. The
> rested patches for the NVDIMM support to bcache journaling.
> 
> For the series to support NVDIMM to bache journaling, all reported
> issue since last merge window are all fixed. And no more issue detected
> during our testing or by the kernel test robot. If there is any issue
> reported during they stay in linux-next, I, Jianpang and Qiaowei will
> response and fix immediately.
> 
> Please take them for Linux v5.14.

I'd really like the user api bits to have some wider review. Maybe
I'm missing something, but there's a lot of weird stuff in the uapi
header that includes things like pointers etc.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/14] bcache patches for Linux v5.14
  2021-06-21 15:14 ` [PATCH 00/14] bcache patches for Linux v5.14 Jens Axboe
@ 2021-06-21 15:25   ` Coly Li
  2021-06-21 15:27     ` Jens Axboe
  0 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-21 15:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-bcache, linux-block

On 6/21/21 11:14 PM, Jens Axboe wrote:
> On 6/14/21 11:49 PM, Coly Li wrote:
>> Hi Jens,
>>
>> Here are the bcache patches for Linux v5.14.
>>
>> The patches from Chao Yu and Ding Senjie are useful code cleanup. The
>> rested patches for the NVDIMM support to bcache journaling.
>>
>> For the series to support NVDIMM to bache journaling, all reported
>> issue since last merge window are all fixed. And no more issue detected
>> during our testing or by the kernel test robot. If there is any issue
>> reported during they stay in linux-next, I, Jianpang and Qiaowei will
>> response and fix immediately.
>>
>> Please take them for Linux v5.14.
> I'd really like the user api bits to have some wider review. Maybe
> I'm missing something, but there's a lot of weird stuff in the uapi
> header that includes things like pointers etc.
>

Hi Jens,

As I explained 2 merge windows before, we use nvdimm as non-volatiled
memory, that is, the
memory objects are stored on nvdimm as memory object which are
non-volatiled. This is why
you see the pointers in the data structure, e.g. the list is stored on
NVDIMM area, and the code
goes through the list directly on the NVDIMM, we don't load them into
memory.

This is not block device interface.

I try to ask Dan Williams, Jan Kara, Christoph Hellwig and Hannes
Reineicke to help to review,
hope there are some experts may help to take a look.

Coly Li



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/14] bcache patches for Linux v5.14
  2021-06-21 15:25   ` Coly Li
@ 2021-06-21 15:27     ` Jens Axboe
  0 siblings, 0 replies; 60+ messages in thread
From: Jens Axboe @ 2021-06-21 15:27 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-bcache, linux-block

On 6/21/21 9:25 AM, Coly Li wrote:
> On 6/21/21 11:14 PM, Jens Axboe wrote:
>> On 6/14/21 11:49 PM, Coly Li wrote:
>>> Hi Jens,
>>>
>>> Here are the bcache patches for Linux v5.14.
>>>
>>> The patches from Chao Yu and Ding Senjie are useful code cleanup. The
>>> rested patches for the NVDIMM support to bcache journaling.
>>>
>>> For the series to support NVDIMM to bache journaling, all reported
>>> issue since last merge window are all fixed. And no more issue detected
>>> during our testing or by the kernel test robot. If there is any issue
>>> reported during they stay in linux-next, I, Jianpang and Qiaowei will
>>> response and fix immediately.
>>>
>>> Please take them for Linux v5.14.
>> I'd really like the user api bits to have some wider review. Maybe
>> I'm missing something, but there's a lot of weird stuff in the uapi
>> header that includes things like pointers etc.
>>
> 
> Hi Jens,
> 
> As I explained 2 merge windows before, we use nvdimm as non-volatiled
> memory, that is, the
> memory objects are stored on nvdimm as memory object which are
> non-volatiled. This is why
> you see the pointers in the data structure, e.g. the list is stored on
> NVDIMM area, and the code
> goes through the list directly on the NVDIMM, we don't load them into
> memory.
> 
> This is not block device interface.

Right, and I'm not oblivious to those emails. What I'm saying is that
we need more review of that, so far it seems to stand unreviewed.

> I try to ask Dan Williams, Jan Kara, Christoph Hellwig and Hannes
> Reineicke to help to review,
> hope there are some experts may help to take a look.

That'd be great.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Ask help for code review (was Re: [PATCH 03/14] bcache: add initial data structures for nvm pages)
  2021-06-15  5:49 ` [PATCH 03/14] bcache: add initial data structures for nvm pages Coly Li
@ 2021-06-21 16:17   ` Coly Li
  2021-06-22  8:41     ` Huang, Ying
  2021-06-22 10:19   ` [PATCH 03/14] bcache: add initial data structures for nvm pages Hannes Reinecke
  1 sibling, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-21 16:17 UTC (permalink / raw)
  To: Dan Williams, Jan Kara, Hannes Reinecke, Christoph Hellwig, ying.huang
  Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, axboe

Hi all my dear receivers (Christoph, Dan, Hannes, Jan and Ying),

I do need your help on code review for the following patch. This
series are posted and refined for 2 merge windows already but we
are lack of code review from more experienced kernel developers
like you all.

The following patch defines a set of on-NVDIMM memory objects,
which are used to support NVDIMM for bcache journalling. Currently
the testing hardware is Intel AEP (Apache Pass).

Qiangwei Ren and Jianpeng Ma work with me to compose a mini pages
allocator for NVDIMM pages, then we allocate non-volatiled memory
pages from NVDIMM to store bcache journal set. Then the jouranling
can be very fast and after system reboots, once the NVDIMM mapping
is done, bcache code can directly reference journal set memory
object without loading them via block layer interface.

In order to restore allocated non-volatile memory, we use a set of
list (named owner list) to trace all allocated non-volatile memory
pages identified by UUID. Just like the bcache journal set, the list
stored in NVDIMM and accessed directly as typical in-memory list,
the only difference is they are non-volatiled: we access the lists
directly from NVDIMM, update the list in-place.

This is why you can see pointers are defined in struct
bch_nvm_pgalloc_recs, because such object is reference directly as
memory object, and stored directly onto NVDIMM.

Current patch series works as expected with limited data-set on
both bcache-tools and patched kernel. Because the bcache btree nodes
are not stored onto NVDIMM yet, journaling for leaf node splitting
will be handled in later series.

The whole work of supporting NVDIMM for bcache will involve in,
- Storing bcache journal on NVDIMM
- Store bcache btree nodes on NVDIMM
- Store cached data on NVDIMM.
- On-NVDIMM objects consistency for power failure

In order to make the code review to be more easier, the first step
we submit storing journal on NVDIMM into upstream firstly, following
work will be submitted step by step.

Jens wants more widely review before taking this series into bcache
upstream, and you are all the experts I trust and have my respect.

I do ask for help of code review from you all. Especially for the
following particular data structure definition patch, because I
define pointers in memory structures and reference and store them on
the NVDIMM.

Thanks in advance for your help.

Coly Li




On 6/15/21 1:49 PM, Coly Li wrote:
> This patch initializes the prototype data structures for nvm pages
> allocator,
>
> - struct bch_nvm_pages_sb
> This is the super block allocated on each nvdimm namespace. A nvdimm
> set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used
> to mark which nvdimm set this name space belongs to. Normally we will
> use the bcache's cache set UUID to initialize this uuid, to connect this
> nvdimm set to a specified bcache cache set.
>
> - struct bch_owner_list_head
> This is a table for all heads of all owner lists. A owner list records
> which page(s) allocated to which owner. After reboot from power failure,
> the ownwer may find all its requested and allocated pages from the owner
> list by a handler which is converted by a UUID.
>
> - struct bch_nvm_pages_owner_head
> This is a head of an owner list. Each owner only has one owner list,
> and a nvm page only belongs to an specific owner. uuid[] will be set to
> owner's uuid, for bcache it is the bcache's cache set uuid. label is not
> mandatory, it is a human-readable string for debug purpose. The pointer
> *recs references to separated nvm page which hold the table of struct
> bch_nvm_pgalloc_rec.
>
> - struct bch_nvm_pgalloc_recs
> This struct occupies a whole page, owner_uuid should match the uuid
> in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
> allocated records.
>
> - struct bch_nvm_pgalloc_rec
> Each structure records a range of allocated nvm pages.
>   - Bits  0 - 51: is pages offset of the allocated pages.
>   - Bits 52 - 57: allocaed size in page_size * order-of-2
>   - Bits 58 - 63: reserved.
> Since each of the allocated nvm pages are power of 2, using 6 bits to
> represent allocated size can have (1<<(1<<64) - 1) * PAGE_SIZE maximum
> value. It can be a 76 bits width range size in byte for 4KB page size,
> which is large enough currently.
>
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
> ---
>  include/uapi/linux/bcache-nvm.h | 200 ++++++++++++++++++++++++++++++++
>  1 file changed, 200 insertions(+)
>  create mode 100644 include/uapi/linux/bcache-nvm.h
>
> diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
> new file mode 100644
> index 000000000000..5094a6797679
> --- /dev/null
> +++ b/include/uapi/linux/bcache-nvm.h
> @@ -0,0 +1,200 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +
> +#ifndef _UAPI_BCACHE_NVM_H
> +#define _UAPI_BCACHE_NVM_H
> +
> +#if (__BITS_PER_LONG == 64)
> +/*
> + * Bcache on NVDIMM data structures
> + */
> +
> +/*
> + * - struct bch_nvm_pages_sb
> + *   This is the super block allocated on each nvdimm namespace. A nvdimm
> + * set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used to mark
> + * which nvdimm set this name space belongs to. Normally we will use the
> + * bcache's cache set UUID to initialize this uuid, to connect this nvdimm
> + * set to a specified bcache cache set.
> + *
> + * - struct bch_owner_list_head
> + *   This is a table for all heads of all owner lists. A owner list records
> + * which page(s) allocated to which owner. After reboot from power failure,
> + * the ownwer may find all its requested and allocated pages from the owner
> + * list by a handler which is converted by a UUID.
> + *
> + * - struct bch_nvm_pages_owner_head
> + *   This is a head of an owner list. Each owner only has one owner list,
> + * and a nvm page only belongs to an specific owner. uuid[] will be set to
> + * owner's uuid, for bcache it is the bcache's cache set uuid. label is not
> + * mandatory, it is a human-readable string for debug purpose. The pointer
> + * recs references to separated nvm page which hold the table of struct
> + * bch_pgalloc_rec.
> + *
> + *- struct bch_nvm_pgalloc_recs
> + *  This structure occupies a whole page, owner_uuid should match the uuid
> + * in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
> + * allocated records.
> + *
> + * - struct bch_pgalloc_rec
> + *   Each structure records a range of allocated nvm pages. pgoff is offset
> + * in unit of page size of this allocated nvm page range. The adjoint page
> + * ranges of same owner can be merged into a larger one, therefore pages_nr
> + * is NOT always power of 2.
> + *
> + *
> + * Memory layout on nvdimm namespace 0
> + *
> + *    0 +---------------------------------+
> + *      |                                 |
> + *  4KB +---------------------------------+
> + *      |         bch_nvm_pages_sb        |
> + *  8KB +---------------------------------+ <--- bch_nvm_pages_sb.bch_owner_list_head
> + *      |       bch_owner_list_head       |
> + *      |                                 |
> + * 16KB +---------------------------------+ <--- bch_owner_list_head.heads[0].recs[0]
> + *      |       bch_nvm_pgalloc_recs      |
> + *      |  (nvm pages internal usage)     |
> + * 24KB +---------------------------------+
> + *      |                                 |
> + *      |                                 |
> + * 16MB  +---------------------------------+
> + *      |      allocable nvm pages        |
> + *      |      for buddy allocator        |
> + * end  +---------------------------------+
> + *
> + *
> + *
> + * Memory layout on nvdimm namespace N
> + * (doesn't have owner list)
> + *
> + *    0 +---------------------------------+
> + *      |                                 |
> + *  4KB +---------------------------------+
> + *      |         bch_nvm_pages_sb        |
> + *  8KB +---------------------------------+
> + *      |                                 |
> + *      |                                 |
> + *      |                                 |
> + *      |                                 |
> + *      |                                 |
> + *      |                                 |
> + * 16MB  +---------------------------------+
> + *      |      allocable nvm pages        |
> + *      |      for buddy allocator        |
> + * end  +---------------------------------+
> + *
> + */
> +
> +#include <linux/types.h>
> +
> +/* In sectors */
> +#define BCH_NVM_PAGES_SB_OFFSET			4096
> +#define BCH_NVM_PAGES_OFFSET			(16 << 20)
> +
> +#define BCH_NVM_PAGES_LABEL_SIZE		32
> +#define BCH_NVM_PAGES_NAMESPACES_MAX		8
> +
> +#define BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET	(8<<10)
> +#define BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET	(16<<10)
> +
> +#define BCH_NVM_PAGES_SB_VERSION		0
> +#define BCH_NVM_PAGES_SB_VERSION_MAX		0
> +
> +static const unsigned char bch_nvm_pages_magic[] = {
> +	0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83,
> +	0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 };
> +static const unsigned char bch_nvm_pages_pgalloc_magic[] = {
> +	0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9,
> +	0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae };
> +
> +/* takes 64bit width */
> +struct bch_pgalloc_rec {
> +	__u64	pgoff:52;
> +	__u64	order:6;
> +	__u64	reserved:6;
> +};
> +
> +struct bch_nvm_pgalloc_recs {
> +union {
> +	struct {
> +		struct bch_nvm_pages_owner_head	*owner;
> +		struct bch_nvm_pgalloc_recs	*next;
> +		unsigned char			magic[16];
> +		unsigned char			owner_uuid[16];
> +		unsigned int			size;
> +		unsigned int			used;
> +		unsigned long			_pad[4];
> +		struct bch_pgalloc_rec		recs[];
> +	};
> +	unsigned char				pad[8192];
> +};
> +};
> +
> +#define BCH_MAX_RECS					\
> +	((sizeof(struct bch_nvm_pgalloc_recs) -		\
> +	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /	\
> +	 sizeof(struct bch_pgalloc_rec))
> +
> +struct bch_nvm_pages_owner_head {
> +	unsigned char			uuid[16];
> +	unsigned char			label[BCH_NVM_PAGES_LABEL_SIZE];
> +	/* Per-namespace own lists */
> +	struct bch_nvm_pgalloc_recs	*recs[BCH_NVM_PAGES_NAMESPACES_MAX];
> +};
> +
> +/* heads[0] is always for nvm_pages internal usage */
> +struct bch_owner_list_head {
> +union {
> +	struct {
> +		unsigned int			size;
> +		unsigned int			used;
> +		unsigned long			_pad[4];
> +		struct bch_nvm_pages_owner_head	heads[];
> +	};
> +	unsigned char				pad[8192];
> +};
> +};
> +#define BCH_MAX_OWNER_LIST				\
> +	((sizeof(struct bch_owner_list_head) -		\
> +	 offsetof(struct bch_owner_list_head, heads)) /	\
> +	 sizeof(struct bch_nvm_pages_owner_head))
> +
> +/* The on-media bit order is local CPU order */
> +struct bch_nvm_pages_sb {
> +	unsigned long				csum;
> +	unsigned long				ns_start;
> +	unsigned long				sb_offset;
> +	unsigned long				version;
> +	unsigned char				magic[16];
> +	unsigned char				uuid[16];
> +	unsigned int				page_size;
> +	unsigned int				total_namespaces_nr;
> +	unsigned int				this_namespace_nr;
> +	union {
> +		unsigned char			set_uuid[16];
> +		unsigned long			set_magic;
> +	};
> +
> +	unsigned long				flags;
> +	unsigned long				seq;
> +
> +	unsigned long				feature_compat;
> +	unsigned long				feature_incompat;
> +	unsigned long				feature_ro_compat;
> +
> +	/* For allocable nvm pages from buddy systems */
> +	unsigned long				pages_offset;
> +	unsigned long				pages_total;
> +
> +	unsigned long				pad[8];
> +
> +	/* Only on the first name space */
> +	struct bch_owner_list_head		*owner_list_head;
> +
> +	/* Just for csum_set() */
> +	unsigned int				keys;
> +	unsigned long				d[0];
> +};
> +#endif /* __BITS_PER_LONG == 64 */
> +
> +#endif /* _UAPI_BCACHE_NVM_H */


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Ask help for code review (was Re: [PATCH 03/14] bcache: add initial data structures for nvm pages)
  2021-06-21 16:17   ` Ask help for code review (was Re: [PATCH 03/14] bcache: add initial data structures for nvm pages) Coly Li
@ 2021-06-22  8:41     ` Huang, Ying
  2021-06-23  4:32       ` Coly Li
  0 siblings, 1 reply; 60+ messages in thread
From: Huang, Ying @ 2021-06-22  8:41 UTC (permalink / raw)
  To: Coly Li
  Cc: Dan Williams, Jan Kara, Hannes Reinecke, Christoph Hellwig,
	linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, axboe

Coly Li <colyli@suse.de> writes:

> Hi all my dear receivers (Christoph, Dan, Hannes, Jan and Ying),
>
> I do need your help on code review for the following patch. This
> series are posted and refined for 2 merge windows already but we
> are lack of code review from more experienced kernel developers
> like you all.
>
> The following patch defines a set of on-NVDIMM memory objects,
> which are used to support NVDIMM for bcache journalling. Currently
> the testing hardware is Intel AEP (Apache Pass).
>
> Qiangwei Ren and Jianpeng Ma work with me to compose a mini pages
> allocator for NVDIMM pages, then we allocate non-volatiled memory
> pages from NVDIMM to store bcache journal set. Then the jouranling
> can be very fast and after system reboots, once the NVDIMM mapping
> is done, bcache code can directly reference journal set memory
> object without loading them via block layer interface.
>
> In order to restore allocated non-volatile memory, we use a set of
> list (named owner list) to trace all allocated non-volatile memory
> pages identified by UUID. Just like the bcache journal set, the list
> stored in NVDIMM and accessed directly as typical in-memory list,
> the only difference is they are non-volatiled: we access the lists
> directly from NVDIMM, update the list in-place.
>
> This is why you can see pointers are defined in struct
> bch_nvm_pgalloc_recs, because such object is reference directly as
> memory object, and stored directly onto NVDIMM.
>
> Current patch series works as expected with limited data-set on
> both bcache-tools and patched kernel. Because the bcache btree nodes
> are not stored onto NVDIMM yet, journaling for leaf node splitting
> will be handled in later series.
>
> The whole work of supporting NVDIMM for bcache will involve in,
> - Storing bcache journal on NVDIMM
> - Store bcache btree nodes on NVDIMM
> - Store cached data on NVDIMM.
> - On-NVDIMM objects consistency for power failure
>
> In order to make the code review to be more easier, the first step
> we submit storing journal on NVDIMM into upstream firstly, following
> work will be submitted step by step.
>
> Jens wants more widely review before taking this series into bcache
> upstream, and you are all the experts I trust and have my respect.
>
> I do ask for help of code review from you all. Especially for the
> following particular data structure definition patch, because I
> define pointers in memory structures and reference and store them on
> the NVDIMM.
>
> Thanks in advance for your help.
>
> Coly Li
>
>
>
>
> On 6/15/21 1:49 PM, Coly Li wrote:
>> This patch initializes the prototype data structures for nvm pages
>> allocator,
>>
>> - struct bch_nvm_pages_sb
>> This is the super block allocated on each nvdimm namespace. A nvdimm
>> set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used
>> to mark which nvdimm set this name space belongs to. Normally we will
>> use the bcache's cache set UUID to initialize this uuid, to connect this
>> nvdimm set to a specified bcache cache set.
>>
>> - struct bch_owner_list_head
>> This is a table for all heads of all owner lists. A owner list records
>> which page(s) allocated to which owner. After reboot from power failure,
>> the ownwer may find all its requested and allocated pages from the owner
>> list by a handler which is converted by a UUID.
>>
>> - struct bch_nvm_pages_owner_head
>> This is a head of an owner list. Each owner only has one owner list,
>> and a nvm page only belongs to an specific owner. uuid[] will be set to
>> owner's uuid, for bcache it is the bcache's cache set uuid. label is not
>> mandatory, it is a human-readable string for debug purpose. The pointer
>> *recs references to separated nvm page which hold the table of struct
>> bch_nvm_pgalloc_rec.
>>
>> - struct bch_nvm_pgalloc_recs
>> This struct occupies a whole page, owner_uuid should match the uuid
>> in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
>> allocated records.
>>
>> - struct bch_nvm_pgalloc_rec
>> Each structure records a range of allocated nvm pages.
>>   - Bits  0 - 51: is pages offset of the allocated pages.
>>   - Bits 52 - 57: allocaed size in page_size * order-of-2
>>   - Bits 58 - 63: reserved.
>> Since each of the allocated nvm pages are power of 2, using 6 bits to
>> represent allocated size can have (1<<(1<<64) - 1) * PAGE_SIZE maximum
>> value. It can be a 76 bits width range size in byte for 4KB page size,
>> which is large enough currently.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
>> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
>> ---
>>  include/uapi/linux/bcache-nvm.h | 200 ++++++++++++++++++++++++++++++++
>>  1 file changed, 200 insertions(+)
>>  create mode 100644 include/uapi/linux/bcache-nvm.h
>>
>> diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
>> new file mode 100644
>> index 000000000000..5094a6797679
>> --- /dev/null
>> +++ b/include/uapi/linux/bcache-nvm.h
>> @@ -0,0 +1,200 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +
>> +#ifndef _UAPI_BCACHE_NVM_H
>> +#define _UAPI_BCACHE_NVM_H
>> +
>> +#if (__BITS_PER_LONG == 64)
>> +/*
>> + * Bcache on NVDIMM data structures
>> + */
>> +
>> +/*
>> + * - struct bch_nvm_pages_sb
>> + *   This is the super block allocated on each nvdimm namespace. A nvdimm
>> + * set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used to mark
>> + * which nvdimm set this name space belongs to. Normally we will use the
>> + * bcache's cache set UUID to initialize this uuid, to connect this nvdimm
>> + * set to a specified bcache cache set.
>> + *
>> + * - struct bch_owner_list_head
>> + *   This is a table for all heads of all owner lists. A owner list records
>> + * which page(s) allocated to which owner. After reboot from power failure,
>> + * the ownwer may find all its requested and allocated pages from the owner
>> + * list by a handler which is converted by a UUID.
>> + *
>> + * - struct bch_nvm_pages_owner_head
>> + *   This is a head of an owner list. Each owner only has one owner list,
>> + * and a nvm page only belongs to an specific owner. uuid[] will be set to
>> + * owner's uuid, for bcache it is the bcache's cache set uuid. label is not
>> + * mandatory, it is a human-readable string for debug purpose. The pointer
>> + * recs references to separated nvm page which hold the table of struct
>> + * bch_pgalloc_rec.
>> + *
>> + *- struct bch_nvm_pgalloc_recs
>> + *  This structure occupies a whole page, owner_uuid should match the uuid
>> + * in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
>> + * allocated records.
>> + *
>> + * - struct bch_pgalloc_rec
>> + *   Each structure records a range of allocated nvm pages. pgoff is offset
>> + * in unit of page size of this allocated nvm page range. The adjoint page
>> + * ranges of same owner can be merged into a larger one, therefore pages_nr
>> + * is NOT always power of 2.
>> + *
>> + *
>> + * Memory layout on nvdimm namespace 0
>> + *
>> + *    0 +---------------------------------+
>> + *      |                                 |
>> + *  4KB +---------------------------------+
>> + *      |         bch_nvm_pages_sb        |
>> + *  8KB +---------------------------------+ <--- bch_nvm_pages_sb.bch_owner_list_head
>> + *      |       bch_owner_list_head       |
>> + *      |                                 |
>> + * 16KB +---------------------------------+ <--- bch_owner_list_head.heads[0].recs[0]
>> + *      |       bch_nvm_pgalloc_recs      |
>> + *      |  (nvm pages internal usage)     |
>> + * 24KB +---------------------------------+
>> + *      |                                 |
>> + *      |                                 |
>> + * 16MB  +---------------------------------+
>> + *      |      allocable nvm pages        |
>> + *      |      for buddy allocator        |
>> + * end  +---------------------------------+
>> + *
>> + *
>> + *
>> + * Memory layout on nvdimm namespace N
>> + * (doesn't have owner list)
>> + *
>> + *    0 +---------------------------------+
>> + *      |                                 |
>> + *  4KB +---------------------------------+
>> + *      |         bch_nvm_pages_sb        |
>> + *  8KB +---------------------------------+
>> + *      |                                 |
>> + *      |                                 |
>> + *      |                                 |
>> + *      |                                 |
>> + *      |                                 |
>> + *      |                                 |
>> + * 16MB  +---------------------------------+
>> + *      |      allocable nvm pages        |
>> + *      |      for buddy allocator        |
>> + * end  +---------------------------------+
>> + *
>> + */
>> +
>> +#include <linux/types.h>
>> +
>> +/* In sectors */
>> +#define BCH_NVM_PAGES_SB_OFFSET			4096
>> +#define BCH_NVM_PAGES_OFFSET			(16 << 20)
>> +
>> +#define BCH_NVM_PAGES_LABEL_SIZE		32
>> +#define BCH_NVM_PAGES_NAMESPACES_MAX		8
>> +
>> +#define BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET	(8<<10)
>> +#define BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET	(16<<10)
>> +
>> +#define BCH_NVM_PAGES_SB_VERSION		0
>> +#define BCH_NVM_PAGES_SB_VERSION_MAX		0
>> +
>> +static const unsigned char bch_nvm_pages_magic[] = {
>> +	0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83,
>> +	0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 };
>> +static const unsigned char bch_nvm_pages_pgalloc_magic[] = {
>> +	0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9,
>> +	0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae };
>> +
>> +/* takes 64bit width */
>> +struct bch_pgalloc_rec {
>> +	__u64	pgoff:52;
>> +	__u64	order:6;
>> +	__u64	reserved:6;
>> +};
>> +
>> +struct bch_nvm_pgalloc_recs {
>> +union {
>> +	struct {
>> +		struct bch_nvm_pages_owner_head	*owner;
>> +		struct bch_nvm_pgalloc_recs	*next;

I have concerns about using pointers directly in on-NVDIMM data
structure too.  How can you guarantee the NVDIMM devices will be mapped
to exact same virtual address across reboot?

Best Regards,
Huang, Ying

>> +		unsigned char			magic[16];
>> +		unsigned char			owner_uuid[16];
>> +		unsigned int			size;
>> +		unsigned int			used;
>> +		unsigned long			_pad[4];
>> +		struct bch_pgalloc_rec		recs[];
>> +	};
>> +	unsigned char				pad[8192];
>> +};
>> +};
>> +
>> +#define BCH_MAX_RECS					\
>> +	((sizeof(struct bch_nvm_pgalloc_recs) -		\
>> +	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /	\
>> +	 sizeof(struct bch_pgalloc_rec))
>> +
>> +struct bch_nvm_pages_owner_head {
>> +	unsigned char			uuid[16];
>> +	unsigned char			label[BCH_NVM_PAGES_LABEL_SIZE];
>> +	/* Per-namespace own lists */
>> +	struct bch_nvm_pgalloc_recs	*recs[BCH_NVM_PAGES_NAMESPACES_MAX];
>> +};
>> +
>> +/* heads[0] is always for nvm_pages internal usage */
>> +struct bch_owner_list_head {
>> +union {
>> +	struct {
>> +		unsigned int			size;
>> +		unsigned int			used;
>> +		unsigned long			_pad[4];
>> +		struct bch_nvm_pages_owner_head	heads[];
>> +	};
>> +	unsigned char				pad[8192];
>> +};
>> +};
>> +#define BCH_MAX_OWNER_LIST				\
>> +	((sizeof(struct bch_owner_list_head) -		\
>> +	 offsetof(struct bch_owner_list_head, heads)) /	\
>> +	 sizeof(struct bch_nvm_pages_owner_head))
>> +
>> +/* The on-media bit order is local CPU order */
>> +struct bch_nvm_pages_sb {
>> +	unsigned long				csum;
>> +	unsigned long				ns_start;
>> +	unsigned long				sb_offset;
>> +	unsigned long				version;
>> +	unsigned char				magic[16];
>> +	unsigned char				uuid[16];
>> +	unsigned int				page_size;
>> +	unsigned int				total_namespaces_nr;
>> +	unsigned int				this_namespace_nr;
>> +	union {
>> +		unsigned char			set_uuid[16];
>> +		unsigned long			set_magic;
>> +	};
>> +
>> +	unsigned long				flags;
>> +	unsigned long				seq;
>> +
>> +	unsigned long				feature_compat;
>> +	unsigned long				feature_incompat;
>> +	unsigned long				feature_ro_compat;
>> +
>> +	/* For allocable nvm pages from buddy systems */
>> +	unsigned long				pages_offset;
>> +	unsigned long				pages_total;
>> +
>> +	unsigned long				pad[8];
>> +
>> +	/* Only on the first name space */
>> +	struct bch_owner_list_head		*owner_list_head;
>> +
>> +	/* Just for csum_set() */
>> +	unsigned int				keys;
>> +	unsigned long				d[0];
>> +};
>> +#endif /* __BITS_PER_LONG == 64 */
>> +
>> +#endif /* _UAPI_BCACHE_NVM_H */

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 01/14] bcache: fix error info in register_bcache()
  2021-06-15  5:49 ` [PATCH 01/14] bcache: fix error info in register_bcache() Coly Li
@ 2021-06-22  9:47   ` Hannes Reinecke
  0 siblings, 0 replies; 60+ messages in thread
From: Hannes Reinecke @ 2021-06-22  9:47 UTC (permalink / raw)
  To: Coly Li, axboe; +Cc: linux-bcache, linux-block, Chao Yu

On 6/15/21 7:49 AM, Coly Li wrote:
> From: Chao Yu <yuchao0@huawei.com>
> 
> In register_bcache(), there are several cases we didn't set
> correct error info (return value and/or error message):
> - if kzalloc() fails, it needs to return ENOMEM and print
> "cannot allocate memory";
> - if register_cache() fails, it's better to propagate its
> return value rather than using default EINVAL.
> 
> Signed-off-by: Chao Yu <yuchao0@huawei.com>
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/super.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index bea8c4429ae8..0a20ccf5a1db 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -2620,8 +2620,11 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>  	if (SB_IS_BDEV(sb)) {
>  		struct cached_dev *dc = kzalloc(sizeof(*dc), GFP_KERNEL);
>  
> -		if (!dc)
> +		if (!dc) {
> +			ret = -ENOMEM;
> +			err = "cannot allocate memory";
>  			goto out_put_sb_page;
> +		}
>  
>  		mutex_lock(&bch_register_lock);
>  		ret = register_bdev(sb, sb_disk, bdev, dc);
> @@ -2632,11 +2635,15 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>  	} else {
>  		struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
>  
> -		if (!ca)
> +		if (!ca) {
> +			ret = -ENOMEM;
> +			err = "cannot allocate memory";
>  			goto out_put_sb_page;
> +		}
>  
>  		/* blkdev_put() will be called in bch_cache_release() */
> -		if (register_cache(sb, sb_disk, bdev, ca) != 0)
> +		ret = register_cache(sb, sb_disk, bdev, ca);
> +		if (ret)
>  			goto out_free_sb;
>  	}
>  
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 02/14] md: bcache: Fix spelling of 'acquire'
  2021-06-15  5:49 ` [PATCH 02/14] md: bcache: Fix spelling of 'acquire' Coly Li
@ 2021-06-22 10:03   ` Hannes Reinecke
  0 siblings, 0 replies; 60+ messages in thread
From: Hannes Reinecke @ 2021-06-22 10:03 UTC (permalink / raw)
  To: Coly Li, axboe; +Cc: linux-bcache, linux-block, Ding Senjie

On 6/15/21 7:49 AM, Coly Li wrote:
> From: Ding Senjie <dingsenjie@yulong.com>
> 
> acqurie -> acquire
> 
> Signed-off-by: Ding Senjie <dingsenjie@yulong.com>
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/super.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 0a20ccf5a1db..2f1ee4fbf4d5 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -2760,7 +2760,7 @@ static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x)
>  		 * The reason bch_register_lock is not held to call
>  		 * bch_cache_set_stop() and bcache_device_stop() is to
>  		 * avoid potential deadlock during reboot, because cache
> -		 * set or bcache device stopping process will acqurie
> +		 * set or bcache device stopping process will acquire
>  		 * bch_register_lock too.
>  		 *
>  		 * We are safe here because bcache_is_reboot sets to
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 03/14] bcache: add initial data structures for nvm pages
  2021-06-15  5:49 ` [PATCH 03/14] bcache: add initial data structures for nvm pages Coly Li
  2021-06-21 16:17   ` Ask help for code review (was Re: [PATCH 03/14] bcache: add initial data structures for nvm pages) Coly Li
@ 2021-06-22 10:19   ` Hannes Reinecke
  2021-06-23  7:09     ` Coly Li
  1 sibling, 1 reply; 60+ messages in thread
From: Hannes Reinecke @ 2021-06-22 10:19 UTC (permalink / raw)
  To: Coly Li, axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/15/21 7:49 AM, Coly Li wrote:
> This patch initializes the prototype data structures for nvm pages
> allocator,
> 
> - struct bch_nvm_pages_sb
> This is the super block allocated on each nvdimm namespace. A nvdimm
> set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used
> to mark which nvdimm set this name space belongs to. Normally we will
> use the bcache's cache set UUID to initialize this uuid, to connect this
> nvdimm set to a specified bcache cache set.
> 
> - struct bch_owner_list_head
> This is a table for all heads of all owner lists. A owner list records
> which page(s) allocated to which owner. After reboot from power failure,
> the ownwer may find all its requested and allocated pages from the owner

owner

> list by a handler which is converted by a UUID.
> 
> - struct bch_nvm_pages_owner_head
> This is a head of an owner list. Each owner only has one owner list,
> and a nvm page only belongs to an specific owner. uuid[] will be set to
> owner's uuid, for bcache it is the bcache's cache set uuid. label is not
> mandatory, it is a human-readable string for debug purpose. The pointer
> *recs references to separated nvm page which hold the table of struct
> bch_nvm_pgalloc_rec.
> 
> - struct bch_nvm_pgalloc_recs
> This struct occupies a whole page, owner_uuid should match the uuid
> in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
> allocated records.
> 
> - struct bch_nvm_pgalloc_rec
> Each structure records a range of allocated nvm pages.
>   - Bits  0 - 51: is pages offset of the allocated pages.
>   - Bits 52 - 57: allocaed size in page_size * order-of-2
>   - Bits 58 - 63: reserved.
> Since each of the allocated nvm pages are power of 2, using 6 bits to
> represent allocated size can have (1<<(1<<64) - 1) * PAGE_SIZE maximum
> value. It can be a 76 bits width range size in byte for 4KB page size,
> which is large enough currently.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
> ---
>  include/uapi/linux/bcache-nvm.h | 200 ++++++++++++++++++++++++++++++++
>  1 file changed, 200 insertions(+)
>  create mode 100644 include/uapi/linux/bcache-nvm.h
> 
> diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
> new file mode 100644
> index 000000000000..5094a6797679
> --- /dev/null
> +++ b/include/uapi/linux/bcache-nvm.h
> @@ -0,0 +1,200 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +
> +#ifndef _UAPI_BCACHE_NVM_H
> +#define _UAPI_BCACHE_NVM_H
> +
> +#if (__BITS_PER_LONG == 64)
> +/*
> + * Bcache on NVDIMM data structures
> + */
> +
> +/*
> + * - struct bch_nvm_pages_sb
> + *   This is the super block allocated on each nvdimm namespace. A nvdimm
> + * set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used to mark
> + * which nvdimm set this name space belongs to. Normally we will use the
> + * bcache's cache set UUID to initialize this uuid, to connect this nvdimm
> + * set to a specified bcache cache set.
> + *
> + * - struct bch_owner_list_head
> + *   This is a table for all heads of all owner lists. A owner list records
> + * which page(s) allocated to which owner. After reboot from power failure,
> + * the ownwer may find all its requested and allocated pages from the owner
> + * list by a handler which is converted by a UUID.
> + *
> + * - struct bch_nvm_pages_owner_head
> + *   This is a head of an owner list. Each owner only has one owner list,
> + * and a nvm page only belongs to an specific owner. uuid[] will be set to
> + * owner's uuid, for bcache it is the bcache's cache set uuid. label is not
> + * mandatory, it is a human-readable string for debug purpose. The pointer
> + * recs references to separated nvm page which hold the table of struct
> + * bch_pgalloc_rec.
> + *
> + *- struct bch_nvm_pgalloc_recs
> + *  This structure occupies a whole page, owner_uuid should match the uuid
> + * in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
> + * allocated records.
> + *
> + * - struct bch_pgalloc_rec
> + *   Each structure records a range of allocated nvm pages. pgoff is offset
> + * in unit of page size of this allocated nvm page range. The adjoint page
> + * ranges of same owner can be merged into a larger one, therefore pages_nr
> + * is NOT always power of 2.
> + *
> + *
> + * Memory layout on nvdimm namespace 0
> + *
> + *    0 +---------------------------------+
> + *      |                                 |
> + *  4KB +---------------------------------+
> + *      |         bch_nvm_pages_sb        |
> + *  8KB +---------------------------------+ <--- bch_nvm_pages_sb.bch_owner_list_head
> + *      |       bch_owner_list_head       |
> + *      |                                 |
> + * 16KB +---------------------------------+ <--- bch_owner_list_head.heads[0].recs[0]
> + *      |       bch_nvm_pgalloc_recs      |
> + *      |  (nvm pages internal usage)     |
> + * 24KB +---------------------------------+
> + *      |                                 |
> + *      |                                 |
> + * 16MB  +---------------------------------+
> + *      |      allocable nvm pages        |
> + *      |      for buddy allocator        |
> + * end  +---------------------------------+
> + *
> + *
> + *
> + * Memory layout on nvdimm namespace N
> + * (doesn't have owner list)
> + *
> + *    0 +---------------------------------+
> + *      |                                 |
> + *  4KB +---------------------------------+
> + *      |         bch_nvm_pages_sb        |
> + *  8KB +---------------------------------+
> + *      |                                 |
> + *      |                                 |
> + *      |                                 |
> + *      |                                 |
> + *      |                                 |
> + *      |                                 |
> + * 16MB  +---------------------------------+
> + *      |      allocable nvm pages        |
> + *      |      for buddy allocator        |
> + * end  +---------------------------------+
> + *
> + */
> +
> +#include <linux/types.h>
> +
> +/* In sectors */
> +#define BCH_NVM_PAGES_SB_OFFSET			4096
> +#define BCH_NVM_PAGES_OFFSET			(16 << 20)
> +
> +#define BCH_NVM_PAGES_LABEL_SIZE		32
> +#define BCH_NVM_PAGES_NAMESPACES_MAX		8
> +
> +#define BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET	(8<<10)
> +#define BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET	(16<<10)
> +
> +#define BCH_NVM_PAGES_SB_VERSION		0
> +#define BCH_NVM_PAGES_SB_VERSION_MAX		0
> +
> +static const unsigned char bch_nvm_pages_magic[] = {
> +	0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83,
> +	0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 };
> +static const unsigned char bch_nvm_pages_pgalloc_magic[] = {
> +	0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9,
> +	0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae };
> +
> +/* takes 64bit width */
> +struct bch_pgalloc_rec {
> +	__u64	pgoff:52;
> +	__u64	order:6;
> +	__u64	reserved:6;
> +};
> +
> +struct bch_nvm_pgalloc_recs {
> +union {

Indentation.

> +	struct {
> +		struct bch_nvm_pages_owner_head	*owner;
> +		struct bch_nvm_pgalloc_recs	*next;
> +		unsigned char			magic[16];
> +		unsigned char			owner_uuid[16];
> +		unsigned int			size;
> +		unsigned int			used;
> +		unsigned long			_pad[4];
> +		struct bch_pgalloc_rec		recs[];
> +	};
> +	unsigned char				pad[8192];
> +};
> +};
> +

Consider using __u64 and friends when specifying a structure with a
fixed alignment; that also removes the need of the BITS_PER_LONG ifdef
at the top.

> +#define BCH_MAX_RECS					\
> +	((sizeof(struct bch_nvm_pgalloc_recs) -		\
> +	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /	\
> +	 sizeof(struct bch_pgalloc_rec))
> +

What _are_ you doing here?
You're not seriously using the 'pad' field as a placeholder to size the
structure accordingly?
Also, what is the size of the 'bch_nvm_pgalloc_recs' structure?
8k + header size?
That is very awkward, as the page allocator won't be able to handle it
efficiently.
Please size it to either 8k or 16k overall.
And if you do that you can simplify this define.

> +struct bch_nvm_pages_owner_head {
> +	unsigned char			uuid[16];
> +	unsigned char			label[BCH_NVM_PAGES_LABEL_SIZE];
> +	/* Per-namespace own lists */
> +	struct bch_nvm_pgalloc_recs	*recs[BCH_NVM_PAGES_NAMESPACES_MAX];
> +};
> +
> +/* heads[0] is always for nvm_pages internal usage */
> +struct bch_owner_list_head {
> +union {
> +	struct {
> +		unsigned int			size;
> +		unsigned int			used;
> +		unsigned long			_pad[4];
> +		struct bch_nvm_pages_owner_head	heads[];
> +	};
> +	unsigned char				pad[8192];
> +};
> +};
> +#define BCH_MAX_OWNER_LIST				\
> +	((sizeof(struct bch_owner_list_head) -		\
> +	 offsetof(struct bch_owner_list_head, heads)) /	\
> +	 sizeof(struct bch_nvm_pages_owner_head))
> +

Same here.
Please size it that the 'bch_owner_list_head' structure fits into either
8k or 16k.

> +/* The on-media bit order is local CPU order */
> +struct bch_nvm_pages_sb {
> +	unsigned long				csum;
> +	unsigned long				ns_start;
> +	unsigned long				sb_offset;
> +	unsigned long				version;
> +	unsigned char				magic[16];
> +	unsigned char				uuid[16];
> +	unsigned int				page_size;
> +	unsigned int				total_namespaces_nr;
> +	unsigned int				this_namespace_nr;
> +	union {
> +		unsigned char			set_uuid[16];
> +		unsigned long			set_magic;
> +	};
> +
> +	unsigned long				flags;
> +	unsigned long				seq;
> +
> +	unsigned long				feature_compat;
> +	unsigned long				feature_incompat;
> +	unsigned long				feature_ro_compat;
> +
> +	/* For allocable nvm pages from buddy systems */
> +	unsigned long				pages_offset;
> +	unsigned long				pages_total;
> +
> +	unsigned long				pad[8];
> +
> +	/* Only on the first name space */
> +	struct bch_owner_list_head		*owner_list_head;
> +
> +	/* Just for csum_set() */
> +	unsigned int				keys;
> +	unsigned long				d[0];
> +};

And also here, use __u64 and friends.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/14] bcache: initialize the nvm pages allocator
  2021-06-15  5:49 ` [PATCH 04/14] bcache: initialize the nvm pages allocator Coly Li
@ 2021-06-22 10:39   ` Hannes Reinecke
  2021-06-23  5:26     ` Coly Li
  0 siblings, 1 reply; 60+ messages in thread
From: Hannes Reinecke @ 2021-06-22 10:39 UTC (permalink / raw)
  To: Coly Li, axboe
  Cc: linux-bcache, linux-block, Jianpeng Ma, Randy Dunlap, Qiaowei Ren

On 6/15/21 7:49 AM, Coly Li wrote:
> From: Jianpeng Ma <jianpeng.ma@intel.com>
> 
> This patch define the prototype data structures in memory and
> initializes the nvm pages allocator.
> 
> The nvm address space which is managed by this allocator can consist of
> many nvm namespaces, and some namespaces can compose into one nvm set,
> like cache set. For this initial implementation, only one set can be
> supported.
> 
> The users of this nvm pages allocator need to call register_namespace()
> to register the nvdimm device (like /dev/pmemX) into this allocator as
> the instance of struct nvm_namespace.
> 
> Reported-by: Randy Dunlap <rdunlap@infradead.org>
> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
> Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
> Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/Kconfig     |  10 ++
>  drivers/md/bcache/Makefile    |   1 +
>  drivers/md/bcache/nvm-pages.c | 295 ++++++++++++++++++++++++++++++++++
>  drivers/md/bcache/nvm-pages.h |  74 +++++++++
>  drivers/md/bcache/super.c     |   3 +
>  5 files changed, 383 insertions(+)
>  create mode 100644 drivers/md/bcache/nvm-pages.c
>  create mode 100644 drivers/md/bcache/nvm-pages.h
> 
> diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
> index d1ca4d059c20..a69f6c0e0507 100644
> --- a/drivers/md/bcache/Kconfig
> +++ b/drivers/md/bcache/Kconfig
> @@ -35,3 +35,13 @@ config BCACHE_ASYNC_REGISTRATION
>  	device path into this file will returns immediately and the real
>  	registration work is handled in kernel work queue in asynchronous
>  	way.
> +
> +config BCACHE_NVM_PAGES
> +	bool "NVDIMM support for bcache (EXPERIMENTAL)"
> +	depends on BCACHE
> +	depends on 64BIT
> +	depends on LIBNVDIMM
> +	depends on DAX
> +	help
> +	  Allocate/release NV-memory pages for bcache and provide allocated pages
> +	  for each requestor after system reboot.
> diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
> index 5b87e59676b8..2397bb7c7ffd 100644
> --- a/drivers/md/bcache/Makefile
> +++ b/drivers/md/bcache/Makefile
> @@ -5,3 +5,4 @@ obj-$(CONFIG_BCACHE)	+= bcache.o
>  bcache-y		:= alloc.o bset.o btree.o closure.o debug.o extents.o\
>  	io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
>  	util.o writeback.o features.o
> +bcache-$(CONFIG_BCACHE_NVM_PAGES) += nvm-pages.o
> diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
> new file mode 100644
> index 000000000000..18fdadbc502f
> --- /dev/null
> +++ b/drivers/md/bcache/nvm-pages.c
> @@ -0,0 +1,295 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Nvdimm page-buddy allocator
> + *
> + * Copyright (c) 2021, Intel Corporation.
> + * Copyright (c) 2021, Qiaowei Ren <qiaowei.ren@intel.com>.
> + * Copyright (c) 2021, Jianpeng Ma <jianpeng.ma@intel.com>.
> + */
> +
> +#if defined(CONFIG_BCACHE_NVM_PAGES)
> +

No need for this 'if' statement as it'll be excluded by the Makefile
anyway if the config option isn't set.

> +#include "bcache.h"
> +#include "nvm-pages.h"
> +
> +#include <linux/slab.h>
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +#include <linux/dax.h>
> +#include <linux/pfn_t.h>
> +#include <linux/libnvdimm.h>
> +#include <linux/mm_types.h>
> +#include <linux/err.h>
> +#include <linux/pagemap.h>
> +#include <linux/bitmap.h>
> +#include <linux/blkdev.h>
> +
> +struct bch_nvm_set *only_set;
> +
> +static void release_nvm_namespaces(struct bch_nvm_set *nvm_set)
> +{
> +	int i;
> +	struct bch_nvm_namespace *ns;
> +
> +	for (i = 0; i < nvm_set->total_namespaces_nr; i++) {
> +		ns = nvm_set->nss[i];
> +		if (ns) {
> +			blkdev_put(ns->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC);
> +			kfree(ns);
> +		}
> +	}
> +
> +	kfree(nvm_set->nss);
> +}
> +
> +static void release_nvm_set(struct bch_nvm_set *nvm_set)
> +{
> +	release_nvm_namespaces(nvm_set);
> +	kfree(nvm_set);
> +}
> +
> +static int init_owner_info(struct bch_nvm_namespace *ns)
> +{
> +	struct bch_owner_list_head *owner_list_head = ns->sb->owner_list_head;
> +
> +	mutex_lock(&only_set->lock);
> +	only_set->owner_list_head = owner_list_head;
> +	only_set->owner_list_size = owner_list_head->size;
> +	only_set->owner_list_used = owner_list_head->used;
> +	mutex_unlock(&only_set->lock);
> +
> +	return 0;
> +}
> +
> +static int attach_nvm_set(struct bch_nvm_namespace *ns)
> +{
> +	int rc = 0;
> +
> +	mutex_lock(&only_set->lock);
> +	if (only_set->nss) {
> +		if (memcmp(ns->sb->set_uuid, only_set->set_uuid, 16)) {
> +			pr_info("namespace id doesn't match nvm set\n");
> +			rc = -EINVAL;
> +			goto unlock;
> +		}
> +
> +		if (only_set->nss[ns->sb->this_namespace_nr]) {

Doesn't this need to be checked against 'total_namespaces_nr' to avoid
overflow?

> +			pr_info("already has the same position(%d) nvm\n",
> +					ns->sb->this_namespace_nr);
> +			rc = -EEXIST;
> +			goto unlock;
> +		}
> +	} else {
> +		memcpy(only_set->set_uuid, ns->sb->set_uuid, 16);
> +		only_set->total_namespaces_nr = ns->sb->total_namespaces_nr;
> +		only_set->nss = kcalloc(only_set->total_namespaces_nr,
> +				sizeof(struct bch_nvm_namespace *), GFP_KERNEL);
> +		if (!only_set->nss) {

When you enter here, 'set_uuid' and 'total_namespace_nr' is being
modified, which might cause errors later on.
Please move these two lines _after_ the kcalloc() to avoid this.

> +			rc = -ENOMEM;
> +			goto unlock;
> +		}
> +	}
> +
> +	only_set->nss[ns->sb->this_namespace_nr] = ns;
> +
> +	/* Firstly attach */

Initial attach?

> +	if ((unsigned long)ns->sb->owner_list_head == BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET) {
> +		struct bch_nvm_pages_owner_head *sys_owner_head;
> +		struct bch_nvm_pgalloc_recs *sys_pgalloc_recs;
> +
> +		ns->sb->owner_list_head = ns->kaddr + BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET;
> +		sys_pgalloc_recs = ns->kaddr + BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET;
> +
> +		sys_owner_head = &(ns->sb->owner_list_head->heads[0]);
> +		sys_owner_head->recs[0] = sys_pgalloc_recs;
> +		ns->sb->csum = csum_set(ns->sb);
> +
Hmm. You are trying to pick up the 'list_head' structure from NVM, right?

In doing so, don't you need to validate the structure (eg by checking
the checksum) before doing so to ensure that the contents are valid?

> +		sys_pgalloc_recs->owner = sys_owner_head;
> +	} else
> +		BUG_ON(ns->sb->owner_list_head !=
> +			(ns->kaddr + BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET));
> +
> +unlock:
> +	mutex_unlock(&only_set->lock);
> +	return rc;
> +}
> +
> +static int read_nvdimm_meta_super(struct block_device *bdev,
> +			      struct bch_nvm_namespace *ns)
> +{
> +	struct page *page;
> +	struct bch_nvm_pages_sb *sb;
> +	int r = 0;
> +	uint64_t expected_csum = 0;
> +
> +	page = read_cache_page_gfp(bdev->bd_inode->i_mapping,
> +			BCH_NVM_PAGES_SB_OFFSET >> PAGE_SHIFT, GFP_KERNEL);
> +
> +	if (IS_ERR(page))
> +		return -EIO;
> +
> +	sb = (struct bch_nvm_pages_sb *)(page_address(page) +
> +					offset_in_page(BCH_NVM_PAGES_SB_OFFSET));
> +	r = -EINVAL;
> +	expected_csum = csum_set(sb);
> +	if (expected_csum != sb->csum) {
> +		pr_info("csum is not match with expected one\n");
> +		goto put_page;
> +	}
> +
> +	if (memcmp(sb->magic, bch_nvm_pages_magic, 16)) {
> +		pr_info("invalid bch_nvm_pages_magic\n");
> +		goto put_page;
> +	}
> +
> +	if (sb->total_namespaces_nr != 1) {
> +		pr_info("currently only support one nvm device\n");
> +		goto put_page;
> +	}
> +
> +	if (sb->sb_offset != BCH_NVM_PAGES_SB_OFFSET) {
> +		pr_info("invalid superblock offset\n");
> +		goto put_page;
> +	}
> +
> +	r = 0;
> +	/* temporary use for DAX API */
> +	ns->page_size = sb->page_size;
> +	ns->pages_total = sb->pages_total;
> +
> +put_page:
> +	put_page(page);
> +	return r;
> +}
> +
> +struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
> +{
> +	struct bch_nvm_namespace *ns;
> +	int err;
> +	pgoff_t pgoff;
> +	char buf[BDEVNAME_SIZE];
> +	struct block_device *bdev;
> +	int id;
> +	char *path = NULL;
> +
> +	path = kstrndup(dev_path, 512, GFP_KERNEL);
> +	if (!path) {
> +		pr_err("kstrndup failed\n");
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	bdev = blkdev_get_by_path(strim(path),
> +				  FMODE_READ|FMODE_WRITE|FMODE_EXEC,
> +				  only_set);
> +	if (IS_ERR(bdev)) {
> +		pr_info("get %s error: %ld\n", dev_path, PTR_ERR(bdev));
> +		kfree(path);
> +		return ERR_PTR(PTR_ERR(bdev));
> +	}
> +
> +	err = -ENOMEM;
> +	ns = kzalloc(sizeof(struct bch_nvm_namespace), GFP_KERNEL);
> +	if (!ns)
> +		goto bdput;
> +
> +	err = -EIO;
> +	if (read_nvdimm_meta_super(bdev, ns)) {
> +		pr_info("%s read nvdimm meta super block failed.\n",
> +			bdevname(bdev, buf));
> +		goto free_ns;
> +	}
> +
> +	err = -EOPNOTSUPP;
> +	if (!bdev_dax_supported(bdev, ns->page_size)) {
> +		pr_info("%s don't support DAX\n", bdevname(bdev, buf));
> +		goto free_ns;
> +	}
> +
> +	err = -EINVAL;
> +	if (bdev_dax_pgoff(bdev, 0, ns->page_size, &pgoff)) {
> +		pr_info("invalid offset of %s\n", bdevname(bdev, buf));
> +		goto free_ns;
> +	}
> +
> +	err = -ENOMEM;
> +	ns->dax_dev = fs_dax_get_by_bdev(bdev);
> +	if (!ns->dax_dev) {
> +		pr_info("can't by dax device by %s\n", bdevname(bdev, buf));
> +		goto free_ns;
> +	}
> +
> +	err = -EINVAL;
> +	id = dax_read_lock();
> +	if (dax_direct_access(ns->dax_dev, pgoff, ns->pages_total,
> +			      &ns->kaddr, &ns->start_pfn) <= 0) {
> +		pr_info("dax_direct_access error\n");
> +		dax_read_unlock(id);
> +		goto free_ns;
> +	}
> +	dax_read_unlock(id);
> +
> +	ns->sb = ns->kaddr + BCH_NVM_PAGES_SB_OFFSET;
> +

You already read the superblock in read_nvdimm_meta_super(), right?
Wouldn't it be better to first do the 'dax_direct_access()' call, and
then check the superblock?
That way you'll ensure that dax_direct_access()' did the right thing;
with the current code you are using two different methods of accessing
the superblock, which theoretically can result in one method succeeding,
the other not ...

> +	err = -EINVAL;
> +	/* Check magic again to make sure DAX mapping is correct */
> +	if (memcmp(ns->sb->magic, bch_nvm_pages_magic, 16)) {
> +		pr_info("invalid bch_nvm_pages_magic after DAX mapping\n");
> +		goto free_ns;
> +	}
> +
> +	err = attach_nvm_set(ns);
> +	if (err < 0)
> +		goto free_ns;
> +
> +	ns->page_size = ns->sb->page_size;
> +	ns->pages_offset = ns->sb->pages_offset;
> +	ns->pages_total = ns->sb->pages_total;
> +	ns->free = 0;
> +	ns->bdev = bdev;
> +	ns->nvm_set = only_set;
> +	mutex_init(&ns->lock);
> +
> +	if (ns->sb->this_namespace_nr == 0) {
> +		pr_info("only first namespace contain owner info\n");
> +		err = init_owner_info(ns);
> +		if (err < 0) {
> +			pr_info("init_owner_info met error %d\n", err);
> +			only_set->nss[ns->sb->this_namespace_nr] = NULL;
> +			goto free_ns;
> +		}
> +	}
> +
> +	kfree(path);
> +	return ns;
> +free_ns:
> +	kfree(ns);
> +bdput:
> +	blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC);
> +	kfree(path);
> +	return ERR_PTR(err);
> +}
> +EXPORT_SYMBOL_GPL(bch_register_namespace);
> +
> +int __init bch_nvm_init(void)
> +{
> +	only_set = kzalloc(sizeof(*only_set), GFP_KERNEL);
> +	if (!only_set)
> +		return -ENOMEM;
> +
> +	only_set->total_namespaces_nr = 0;
> +	only_set->owner_list_head = NULL;
> +	only_set->nss = NULL;
> +
> +	mutex_init(&only_set->lock);
> +
> +	pr_info("bcache nvm init\n");
> +	return 0;
> +}
> +
> +void bch_nvm_exit(void)
> +{
> +	release_nvm_set(only_set);
> +	pr_info("bcache nvm exit\n");
> +}
> +
> +#endif /* CONFIG_BCACHE_NVM_PAGES */
> diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
> new file mode 100644
> index 000000000000..3e24c4dee7fd
> --- /dev/null
> +++ b/drivers/md/bcache/nvm-pages.h
> @@ -0,0 +1,74 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _BCACHE_NVM_PAGES_H
> +#define _BCACHE_NVM_PAGES_H
> +
> +#if defined(CONFIG_BCACHE_NVM_PAGES)
> +#include <linux/bcache-nvm.h>
> +#endif /* CONFIG_BCACHE_NVM_PAGES */
> +

Hmm? What is that doing here?
Please move it into the source file.

> +/*
> + * Bcache NVDIMM in memory data structures
> + */
> +
> +/*
> + * The following three structures in memory records which page(s) allocated
> + * to which owner. After reboot from power failure, they will be initialized
> + * based on nvm pages superblock in NVDIMM device.
> + */
> +struct bch_nvm_namespace {
> +	struct bch_nvm_pages_sb *sb;
> +	void *kaddr;
> +
> +	u8 uuid[16];
> +	u64 free;
> +	u32 page_size;
> +	u64 pages_offset;
> +	u64 pages_total;
> +	pfn_t start_pfn;
> +
> +	struct dax_device *dax_dev;
> +	struct block_device *bdev;
> +	struct bch_nvm_set *nvm_set;
> +
> +	struct mutex lock;
> +};
> +
> +/*
> + * A set of namespaces. Currently only one set can be supported.
> + */
> +struct bch_nvm_set {
> +	u8 set_uuid[16];
> +	u32 total_namespaces_nr;
> +
> +	u32 owner_list_size;
> +	u32 owner_list_used;
> +	struct bch_owner_list_head *owner_list_head;
> +
> +	struct bch_nvm_namespace **nss;
> +
> +	struct mutex lock;
> +};
> +extern struct bch_nvm_set *only_set;
> +
> +#if defined(CONFIG_BCACHE_NVM_PAGES)
> +
> +struct bch_nvm_namespace *bch_register_namespace(const char *dev_path);
> +int bch_nvm_init(void);
> +void bch_nvm_exit(void);
> +
> +#else
> +
> +static inline struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
> +{
> +	return NULL;
> +}
> +static inline int bch_nvm_init(void)
> +{
> +	return 0;
> +}
> +static inline void bch_nvm_exit(void) { }
> +
> +#endif /* CONFIG_BCACHE_NVM_PAGES */
> +
> +#endif /* _BCACHE_NVM_PAGES_H */
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 2f1ee4fbf4d5..ce22aefb1352 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -14,6 +14,7 @@
>  #include "request.h"
>  #include "writeback.h"
>  #include "features.h"
> +#include "nvm-pages.h"
>  
>  #include <linux/blkdev.h>
>  #include <linux/pagemap.h>
> @@ -2823,6 +2824,7 @@ static void bcache_exit(void)
>  {
>  	bch_debug_exit();
>  	bch_request_exit();
> +	bch_nvm_exit();
>  	if (bcache_kobj)
>  		kobject_put(bcache_kobj);
>  	if (bcache_wq)
> @@ -2921,6 +2923,7 @@ static int __init bcache_init(void)
>  
>  	bch_debug_init();
>  	closure_debug_init();
> +	bch_nvm_init();
>  
>  	bcache_is_reboot = false;
>  
> 
Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/14] bcache: initialization of the buddy
  2021-06-15  5:49 ` [PATCH 05/14] bcache: initialization of the buddy Coly Li
@ 2021-06-22 10:45   ` Hannes Reinecke
  2021-06-23  5:35     ` Coly Li
  0 siblings, 1 reply; 60+ messages in thread
From: Hannes Reinecke @ 2021-06-22 10:45 UTC (permalink / raw)
  To: Coly Li, axboe
  Cc: linux-bcache, linux-block, Jianpeng Ma, kernel test robot,
	Dan Carpenter, Qiaowei Ren

On 6/15/21 7:49 AM, Coly Li wrote:
> From: Jianpeng Ma <jianpeng.ma@intel.com>
> 
> This nvm pages allocator will implement the simple buddy to manage the
> nvm address space. This patch initializes this buddy for new namespace.
> 
Please use 'buddy allocator' instead of just 'buddy'.

> the unit of alloc/free of the buddy is page. DAX device has their
> struct page(in dram or PMEM).
> 
>         struct {        /* ZONE_DEVICE pages */
>                 /** @pgmap: Points to the hosting device page map. */
>                 struct dev_pagemap *pgmap;
>                 void *zone_device_data;
>                 /*
>                  * ZONE_DEVICE private pages are counted as being
>                  * mapped so the next 3 words hold the mapping, index,
>                  * and private fields from the source anonymous or
>                  * page cache page while the page is migrated to device
>                  * private memory.
>                  * ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also
>                  * use the mapping, index, and private fields when
>                  * pmem backed DAX files are mapped.
>                  */
>         };
> 
> ZONE_DEVICE pages only use pgmap. Other 4 words[16/32 bytes] don't use.
> So the second/third word will be used as 'struct list_head ' which list
> in buddy. The fourth word(that is normal struct page::index) store pgoff
> which the page-offset in the dax device. And the fifth word (that is
> normal struct page::private) store order of buddy. page_type will be used
> to store buddy flags.
> 
> Reported-by: kernel test robot <lkp@intel.com>
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
> Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
> Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/nvm-pages.c   | 156 +++++++++++++++++++++++++++++++-
>  drivers/md/bcache/nvm-pages.h   |   6 ++
>  include/uapi/linux/bcache-nvm.h |  10 +-
>  3 files changed, 165 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
> index 18fdadbc502f..804ee66e97be 100644
> --- a/drivers/md/bcache/nvm-pages.c
> +++ b/drivers/md/bcache/nvm-pages.c
> @@ -34,6 +34,10 @@ static void release_nvm_namespaces(struct bch_nvm_set *nvm_set)
>  	for (i = 0; i < nvm_set->total_namespaces_nr; i++) {
>  		ns = nvm_set->nss[i];
>  		if (ns) {
> +			kvfree(ns->pages_bitmap);
> +			if (ns->pgalloc_recs_bitmap)
> +				bitmap_free(ns->pgalloc_recs_bitmap);
> +
>  			blkdev_put(ns->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC);
>  			kfree(ns);
>  		}
> @@ -48,17 +52,130 @@ static void release_nvm_set(struct bch_nvm_set *nvm_set)
>  	kfree(nvm_set);
>  }
>  
> +static struct page *nvm_vaddr_to_page(struct bch_nvm_namespace *ns, void *addr)
> +{
> +	return virt_to_page(addr);
> +}
> +
> +static void *nvm_pgoff_to_vaddr(struct bch_nvm_namespace *ns, pgoff_t pgoff)
> +{
> +	return ns->kaddr + (pgoff << PAGE_SHIFT);
> +}
> +
> +static inline void remove_owner_space(struct bch_nvm_namespace *ns,
> +					pgoff_t pgoff, u64 nr)
> +{
> +	while (nr > 0) {
> +		unsigned int num = nr > UINT_MAX ? UINT_MAX : nr;
> +
> +		bitmap_set(ns->pages_bitmap, pgoff, num);
> +		nr -= num;
> +		pgoff += num;
> +	}
> +}
> +
> +#define BCH_PGOFF_TO_KVADDR(pgoff) ((void *)((unsigned long)pgoff << PAGE_SHIFT))
> +
>  static int init_owner_info(struct bch_nvm_namespace *ns)
>  {
>  	struct bch_owner_list_head *owner_list_head = ns->sb->owner_list_head;
> +	struct bch_nvm_pgalloc_recs *sys_recs;
> +	int i, j, k, rc = 0;
>  
>  	mutex_lock(&only_set->lock);
>  	only_set->owner_list_head = owner_list_head;
>  	only_set->owner_list_size = owner_list_head->size;
>  	only_set->owner_list_used = owner_list_head->used;
> +
> +	/* remove used space */
> +	remove_owner_space(ns, 0, div_u64(ns->pages_offset, ns->page_size));
> +
> +	sys_recs = ns->kaddr + BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET;
> +	/* suppose no hole in array */
> +	for (i = 0; i < owner_list_head->used; i++) {
> +		struct bch_nvm_pages_owner_head *head = &owner_list_head->heads[i];
> +
> +		for (j = 0; j < BCH_NVM_PAGES_NAMESPACES_MAX; j++) {
> +			struct bch_nvm_pgalloc_recs *pgalloc_recs = head->recs[j];
> +			unsigned long offset = (unsigned long)ns->kaddr >> PAGE_SHIFT;
> +			struct page *page;
> +
> +			while (pgalloc_recs) {
> +				u32 pgalloc_recs_pos = (unsigned int)(pgalloc_recs - sys_recs);
> +
> +				if (memcmp(pgalloc_recs->magic, bch_nvm_pages_pgalloc_magic, 16)) {
> +					pr_info("invalid bch_nvm_pages_pgalloc_magic\n");
> +					rc = -EINVAL;
> +					goto unlock;
> +				}
> +				if (memcmp(pgalloc_recs->owner_uuid, head->uuid, 16)) {
> +					pr_info("invalid owner_uuid in bch_nvm_pgalloc_recs\n");
> +					rc = -EINVAL;
> +					goto unlock;
> +				}
> +				if (pgalloc_recs->owner != head) {
> +					pr_info("invalid owner in bch_nvm_pgalloc_recs\n");
> +					rc = -EINVAL;
> +					goto unlock;
> +				}
> +
> +				/* recs array can has hole */

can have holes ?

> +				for (k = 0; k < pgalloc_recs->size; k++) {
> +					struct bch_pgalloc_rec *rec = &pgalloc_recs->recs[k];
> +
> +					if (rec->pgoff) {
> +						BUG_ON(rec->pgoff <= offset);
> +
> +						/* init struct page: index/private */
> +						page = nvm_vaddr_to_page(ns,
> +							BCH_PGOFF_TO_KVADDR(rec->pgoff));
> +
> +						set_page_private(page, rec->order);
> +						page->index = rec->pgoff - offset;
> +
> +						remove_owner_space(ns,
> +							rec->pgoff - offset,
> +							1L << rec->order);
> +					}
> +				}
> +				bitmap_set(ns->pgalloc_recs_bitmap, pgalloc_recs_pos, 1);
> +				pgalloc_recs = pgalloc_recs->next;
> +			}
> +		}
> +	}
> +unlock:
>  	mutex_unlock(&only_set->lock);
>  
> -	return 0;
> +	return rc;
> +}
> +
> +static void init_nvm_free_space(struct bch_nvm_namespace *ns)
> +{
> +	unsigned int start, end, pages;
> +	int i;
> +	struct page *page;
> +	pgoff_t pgoff_start;
> +
> +	bitmap_for_each_clear_region(ns->pages_bitmap, start, end, 0, ns->pages_total) {
> +		pgoff_start = start;
> +		pages = end - start;
> +
> +		while (pages) {
> +			for (i = BCH_MAX_ORDER - 1; i >= 0 ; i--) {
> +				if ((pgoff_start % (1L << i) == 0) && (pages >= (1L << i)))
> +					break;
> +			}
> +
> +			page = nvm_vaddr_to_page(ns, nvm_pgoff_to_vaddr(ns, pgoff_start));
> +			page->index = pgoff_start;
> +			set_page_private(page, i);
> +			__SetPageBuddy(page);
> +			list_add((struct list_head *)&page->zone_device_data, &ns->free_area[i]);
> +
> +			pgoff_start += 1L << i;
> +			pages -= 1L << i;
> +		}
> +	}
>  }
>  
>  static int attach_nvm_set(struct bch_nvm_namespace *ns)
> @@ -165,7 +282,7 @@ static int read_nvdimm_meta_super(struct block_device *bdev,
>  struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
>  {
>  	struct bch_nvm_namespace *ns;
> -	int err;
> +	int i, err;
>  	pgoff_t pgoff;
>  	char buf[BDEVNAME_SIZE];
>  	struct block_device *bdev;
> @@ -249,18 +366,49 @@ struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
>  	ns->nvm_set = only_set;
>  	mutex_init(&ns->lock);
>  
> +	/*
> +	 * parameters of bitmap_set/clear are unsigned int.
> +	 * Given currently size of nvm is far from exceeding this limit,
> +	 * so only add a WARN_ON message.
> +	 */
> +	WARN_ON(BITS_TO_LONGS(ns->pages_total) > UINT_MAX);
> +	ns->pages_bitmap = kvcalloc(BITS_TO_LONGS(ns->pages_total),
> +					sizeof(unsigned long), GFP_KERNEL);
> +	if (!ns->pages_bitmap) {
> +		err = -ENOMEM;
> +		goto clear_ns_nr;
> +	}
> +
> +	if (ns->sb->this_namespace_nr == 0) {
> +		ns->pgalloc_recs_bitmap = bitmap_zalloc(BCH_MAX_PGALLOC_RECS, GFP_KERNEL);
> +		if (ns->pgalloc_recs_bitmap == NULL) {
> +			err = -ENOMEM;
> +			goto free_pages_bitmap;
> +		}
> +	}
> +
> +	for (i = 0; i < BCH_MAX_ORDER; i++)
> +		INIT_LIST_HEAD(&ns->free_area[i]);
> +
>  	if (ns->sb->this_namespace_nr == 0) {
>  		pr_info("only first namespace contain owner info\n");
>  		err = init_owner_info(ns);
>  		if (err < 0) {
>  			pr_info("init_owner_info met error %d\n", err);
> -			only_set->nss[ns->sb->this_namespace_nr] = NULL;
> -			goto free_ns;
> +			goto free_recs_bitmap;
>  		}
> +		/* init buddy allocator */
> +		init_nvm_free_space(ns);
>  	}
>  
>  	kfree(path);
>  	return ns;
> +free_recs_bitmap:
> +	bitmap_free(ns->pgalloc_recs_bitmap);
> +free_pages_bitmap:
> +	kvfree(ns->pages_bitmap);
> +clear_ns_nr:
> +	only_set->nss[ns->sb->this_namespace_nr] = NULL;
>  free_ns:
>  	kfree(ns);
>  bdput:
> diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
> index 3e24c4dee7fd..71beb244b9be 100644
> --- a/drivers/md/bcache/nvm-pages.h
> +++ b/drivers/md/bcache/nvm-pages.h
> @@ -16,6 +16,7 @@
>   * to which owner. After reboot from power failure, they will be initialized
>   * based on nvm pages superblock in NVDIMM device.
>   */
> +#define BCH_MAX_ORDER 20
>  struct bch_nvm_namespace {
>  	struct bch_nvm_pages_sb *sb;
>  	void *kaddr;
> @@ -27,6 +28,11 @@ struct bch_nvm_namespace {
>  	u64 pages_total;
>  	pfn_t start_pfn;
>  
> +	unsigned long *pages_bitmap;
> +	struct list_head free_area[BCH_MAX_ORDER];
> +
> +	unsigned long *pgalloc_recs_bitmap;
> >  	struct dax_device *dax_dev;
>  	struct block_device *bdev;
>  	struct bch_nvm_set *nvm_set;
> diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
> index 5094a6797679..1fdb3eaabf7e 100644
> --- a/include/uapi/linux/bcache-nvm.h
> +++ b/include/uapi/linux/bcache-nvm.h
> @@ -130,11 +130,15 @@ union {
>  };
>  };
>  
> -#define BCH_MAX_RECS					\
> -	((sizeof(struct bch_nvm_pgalloc_recs) -		\
> -	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /	\
> +#define BCH_MAX_RECS							\
> +	((sizeof(struct bch_nvm_pgalloc_recs) -				\
> +	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /			\
>  	 sizeof(struct bch_pgalloc_rec))
>  
> +#define BCH_MAX_PGALLOC_RECS						\
> +	((BCH_NVM_PAGES_OFFSET - BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET) /	\
> +	 sizeof(struct bch_nvm_pgalloc_recs))
> +
>  struct bch_nvm_pages_owner_head {
>  	unsigned char			uuid[16];
>  	unsigned char			label[BCH_NVM_PAGES_LABEL_SIZE];
> 
Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 06/14] bcache: bch_nvm_alloc_pages() of the buddy
  2021-06-15  5:49 ` [PATCH 06/14] bcache: bch_nvm_alloc_pages() " Coly Li
@ 2021-06-22 10:51   ` Hannes Reinecke
  2021-06-23  6:02     ` Coly Li
  0 siblings, 1 reply; 60+ messages in thread
From: Hannes Reinecke @ 2021-06-22 10:51 UTC (permalink / raw)
  To: Coly Li, axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/15/21 7:49 AM, Coly Li wrote:
> From: Jianpeng Ma <jianpeng.ma@intel.com>
> 
> This patch implements the bch_nvm_alloc_pages() of the buddy.
> In terms of function, this func is like current-page-buddy-alloc.
> But the differences are:
> a: it need owner_uuid as parameter which record owner info. And it
> make those info persistence.
> b: it don't need flags like GFP_*. All allocs are the equal.
> c: it don't trigger other ops etc swap/recycle.
> 
> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
> Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
> Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/nvm-pages.c   | 174 ++++++++++++++++++++++++++++++++
>  drivers/md/bcache/nvm-pages.h   |   6 ++
>  include/uapi/linux/bcache-nvm.h |   6 +-
>  3 files changed, 184 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
> index 804ee66e97be..5d095d241483 100644
> --- a/drivers/md/bcache/nvm-pages.c
> +++ b/drivers/md/bcache/nvm-pages.c
> @@ -74,6 +74,180 @@ static inline void remove_owner_space(struct bch_nvm_namespace *ns,
>  	}
>  }
>  
> +/* If not found, it will create if create == true */
> +static struct bch_nvm_pages_owner_head *find_owner_head(const char *owner_uuid, bool create)
> +{
> +	struct bch_owner_list_head *owner_list_head = only_set->owner_list_head;
> +	struct bch_nvm_pages_owner_head *owner_head = NULL;
> +	int i;
> +
> +	if (owner_list_head == NULL)
> +		goto out;
> +
> +	for (i = 0; i < only_set->owner_list_used; i++) {
> +		if (!memcmp(owner_uuid, owner_list_head->heads[i].uuid, 16)) {
> +			owner_head = &(owner_list_head->heads[i]);
> +			break;
> +		}
> +	}
> +

Please, don't name is 'heads'. If this is supposed to be a linked list,
use the standard list implementation and initialize the pointers correctly.
If it isn't use an array (as you know in advance how many array entries
you can allocate).

> +	if (!owner_head && create) {
> +		u32 used = only_set->owner_list_used;
> +
> +		if (only_set->owner_list_size > used) {
> +			memcpy_flushcache(owner_list_head->heads[used].uuid, owner_uuid, 16);
> +			only_set->owner_list_used++;
> +
> +			owner_list_head->used++;
> +			owner_head = &(owner_list_head->heads[used]);
> +		} else
> +			pr_info("no free bch_nvm_pages_owner_head\n");
> +	}
> +
> +out:
> +	return owner_head;
> +}
> +
> +static struct bch_nvm_pgalloc_recs *find_empty_pgalloc_recs(void)
> +{
> +	unsigned int start;
> +	struct bch_nvm_namespace *ns = only_set->nss[0];
> +	struct bch_nvm_pgalloc_recs *recs;
> +
> +	start = bitmap_find_next_zero_area(ns->pgalloc_recs_bitmap, BCH_MAX_PGALLOC_RECS, 0, 1, 0);
> +	if (start > BCH_MAX_PGALLOC_RECS) {
> +		pr_info("no free struct bch_nvm_pgalloc_recs\n");
> +		return NULL;
> +	}
> +
> +	bitmap_set(ns->pgalloc_recs_bitmap, start, 1);
> +	recs = (struct bch_nvm_pgalloc_recs *)(ns->kaddr + BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET)
> +		+ start;
> +	return recs;
> +}
> +
> +static struct bch_nvm_pgalloc_recs *find_nvm_pgalloc_recs(struct bch_nvm_namespace *ns,
> +		struct bch_nvm_pages_owner_head *owner_head, bool create)
> +{
> +	int ns_nr = ns->sb->this_namespace_nr;
> +	struct bch_nvm_pgalloc_recs *prev_recs = NULL, *recs = owner_head->recs[ns_nr];
> +
> +	/* If create=false, we return recs[nr] */
> +	if (!create)
> +		return recs;
> +
> +	/*
> +	 * If create=true, it mean we need a empty struct bch_pgalloc_rec
> +	 * So we should find non-empty struct bch_nvm_pgalloc_recs or alloc
> +	 * new struct bch_nvm_pgalloc_recs. And return this bch_nvm_pgalloc_recs
> +	 */
> +	while (recs && (recs->used == recs->size)) {
> +		prev_recs = recs;
> +		recs = recs->next;
> +	}
> +
> +	/* Found empty struct bch_nvm_pgalloc_recs */
> +	if (recs)
> +		return recs;
> +	/* Need alloc new struct bch_nvm_galloc_recs */
> +	recs = find_empty_pgalloc_recs();
> +	if (recs) {
> +		recs->next = NULL;
> +		recs->owner = owner_head;
> +		memcpy_flushcache(recs->magic, bch_nvm_pages_pgalloc_magic, 16);
> +		memcpy_flushcache(recs->owner_uuid, owner_head->uuid, 16);
> +		recs->size = BCH_MAX_RECS;
> +		recs->used = 0;
> +
> +		if (prev_recs)
> +			prev_recs->next = recs;
> +		else
> +			owner_head->recs[ns_nr] = recs;
> +	}
> +

Wouldn't it be easier if the bitmap covers the entire range, and not
just the non-empty ones?
Eventually (ie if the NVM set becomes full) it'll cover it anyway, so
can't we save ourselves some time to allocate a large enough bitmap
upfront and only use it do figure out empty recs?

> +	return recs;
> +}
> +
> +static void add_pgalloc_rec(struct bch_nvm_pgalloc_recs *recs, void *kaddr, int order)
> +{
> +	int i;
> +
> +	for (i = 0; i < recs->size; i++) {
> +		if (recs->recs[i].pgoff == 0) {
> +			recs->recs[i].pgoff = (unsigned long)kaddr >> PAGE_SHIFT;
> +			recs->recs[i].order = order;
> +			recs->used++;
> +			break;
> +		}
> +	}
> +	BUG_ON(i == recs->size);
> +}
> +
> +void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
> +{
> +	void *kaddr = NULL;
> +	struct bch_nvm_pgalloc_recs *pgalloc_recs;
> +	struct bch_nvm_pages_owner_head *owner_head;
> +	int i, j;
> +
> +	mutex_lock(&only_set->lock);
> +	owner_head = find_owner_head(owner_uuid, true);
> +
> +	if (!owner_head) {
> +		pr_err("can't find bch_nvm_pgalloc_recs by(uuid=%s)\n", owner_uuid);
> +		goto unlock;
> +	}
> +
> +	for (j = 0; j < only_set->total_namespaces_nr; j++) {
> +		struct bch_nvm_namespace *ns = only_set->nss[j];
> +
> +		if (!ns || (ns->free < (1L << order)))
> +			continue;
> +
> +		for (i = order; i < BCH_MAX_ORDER; i++) {
> +			struct list_head *list;
> +			struct page *page, *buddy_page;
> +
> +			if (list_empty(&ns->free_area[i]))
> +				continue;
> +
> +			list = ns->free_area[i].next;

list_first_entry()?

> +			page = container_of((void *)list, struct page, zone_device_data);
> +
> +			list_del(list);
> +
> +			while (i != order) {
> +				buddy_page = nvm_vaddr_to_page(ns,
> +					nvm_pgoff_to_vaddr(ns, page->index + (1L << (i - 1))));
> +				set_page_private(buddy_page, i - 1);
> +				buddy_page->index = page->index + (1L << (i - 1));
> +				__SetPageBuddy(buddy_page);
> +				list_add((struct list_head *)&buddy_page->zone_device_data,
> +					&ns->free_area[i - 1]);
> +				i--;
> +			}
> +
> +			set_page_private(page, order);
> +			__ClearPageBuddy(page);
> +			ns->free -= 1L << order;
> +			kaddr = nvm_pgoff_to_vaddr(ns, page->index);
> +			break;
> +		}
> +
> +		if (i < BCH_MAX_ORDER) {
> +			pgalloc_recs = find_nvm_pgalloc_recs(ns, owner_head, true);
> +			/* ToDo: handle pgalloc_recs==NULL */
> +			add_pgalloc_rec(pgalloc_recs, kaddr, order);
> +			break;
> +		}
> +	}
> +
> +unlock:
> +	mutex_unlock(&only_set->lock);
> +	return kaddr;
> +}
> +EXPORT_SYMBOL_GPL(bch_nvm_alloc_pages);
> +
>  #define BCH_PGOFF_TO_KVADDR(pgoff) ((void *)((unsigned long)pgoff << PAGE_SHIFT))
>  
>  static int init_owner_info(struct bch_nvm_namespace *ns)
> diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
> index 71beb244b9be..f2583723aca6 100644
> --- a/drivers/md/bcache/nvm-pages.h
> +++ b/drivers/md/bcache/nvm-pages.h
> @@ -62,6 +62,7 @@ extern struct bch_nvm_set *only_set;
>  struct bch_nvm_namespace *bch_register_namespace(const char *dev_path);
>  int bch_nvm_init(void);
>  void bch_nvm_exit(void);
> +void *bch_nvm_alloc_pages(int order, const char *owner_uuid);
>  
>  #else
>  
> @@ -74,6 +75,11 @@ static inline int bch_nvm_init(void)
>  	return 0;
>  }
>  static inline void bch_nvm_exit(void) { }
> +static inline void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
> +{
> +	return NULL;
> +}
> +
>  
>  #endif /* CONFIG_BCACHE_NVM_PAGES */
>  
> diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
> index 1fdb3eaabf7e..9cb937292202 100644
> --- a/include/uapi/linux/bcache-nvm.h
> +++ b/include/uapi/linux/bcache-nvm.h
> @@ -135,9 +135,11 @@ union {
>  	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /			\
>  	 sizeof(struct bch_pgalloc_rec))
>  
> +/* Currently 64 struct bch_nvm_pgalloc_recs is enough */
>  #define BCH_MAX_PGALLOC_RECS						\
> -	((BCH_NVM_PAGES_OFFSET - BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET) /	\
> -	 sizeof(struct bch_nvm_pgalloc_recs))
> +	(min_t(unsigned int, 64,					\
> +		(BCH_NVM_PAGES_OFFSET - BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET) / \
> +		 sizeof(struct bch_nvm_pgalloc_recs)))
>  
>  struct bch_nvm_pages_owner_head {
>  	unsigned char			uuid[16];
> 
Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/14] bcache: bch_nvm_free_pages() of the buddy
  2021-06-15  5:49 ` [PATCH 07/14] bcache: bch_nvm_free_pages() " Coly Li
@ 2021-06-22 10:53   ` Hannes Reinecke
  2021-06-23  6:06     ` Coly Li
  0 siblings, 1 reply; 60+ messages in thread
From: Hannes Reinecke @ 2021-06-22 10:53 UTC (permalink / raw)
  To: Coly Li, axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/15/21 7:49 AM, Coly Li wrote:
> From: Jianpeng Ma <jianpeng.ma@intel.com>
> 
> This patch implements the bch_nvm_free_pages() of the buddy.
> 
> The difference between this and page-buddy-free:
> it need owner_uuid to free owner allocated pages.And must
> persistent after free.
> 
> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
> Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
> Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/nvm-pages.c | 164 ++++++++++++++++++++++++++++++++--
>  drivers/md/bcache/nvm-pages.h |   3 +-
>  2 files changed, 159 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
> index 5d095d241483..74d08950c67c 100644
> --- a/drivers/md/bcache/nvm-pages.c
> +++ b/drivers/md/bcache/nvm-pages.c
> @@ -52,7 +52,7 @@ static void release_nvm_set(struct bch_nvm_set *nvm_set)
>  	kfree(nvm_set);
>  }
>  
> -static struct page *nvm_vaddr_to_page(struct bch_nvm_namespace *ns, void *addr)
> +static struct page *nvm_vaddr_to_page(void *addr)
>  {
>  	return virt_to_page(addr);
>  }

If you don't need this argument please modify the patch adding the
nvm_vaddr_to_page() function.

> @@ -183,6 +183,155 @@ static void add_pgalloc_rec(struct bch_nvm_pgalloc_recs *recs, void *kaddr, int
>  	BUG_ON(i == recs->size);
>  }
>  
> +static inline void *nvm_end_addr(struct bch_nvm_namespace *ns)
> +{
> +	return ns->kaddr + (ns->pages_total << PAGE_SHIFT);
> +}
> +
> +static inline bool in_nvm_range(struct bch_nvm_namespace *ns,
> +		void *start_addr, void *end_addr)
> +{
> +	return (start_addr >= ns->kaddr) && (end_addr < nvm_end_addr(ns));
> +}
> +
> +static struct bch_nvm_namespace *find_nvm_by_addr(void *addr, int order)
> +{
> +	int i;
> +	struct bch_nvm_namespace *ns;
> +
> +	for (i = 0; i < only_set->total_namespaces_nr; i++) {
> +		ns = only_set->nss[i];
> +		if (ns && in_nvm_range(ns, addr, addr + (1L << order)))
> +			return ns;
> +	}
> +	return NULL;
> +}
> +
> +static int remove_pgalloc_rec(struct bch_nvm_pgalloc_recs *pgalloc_recs, int ns_nr,
> +				void *kaddr, int order)
> +{
> +	struct bch_nvm_pages_owner_head *owner_head = pgalloc_recs->owner;
> +	struct bch_nvm_pgalloc_recs *prev_recs, *sys_recs;
> +	u64 pgoff = (unsigned long)kaddr >> PAGE_SHIFT;
> +	struct bch_nvm_namespace *ns = only_set->nss[0];
> +	int i;
> +
> +	prev_recs = pgalloc_recs;
> +	sys_recs = ns->kaddr + BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET;
> +	while (pgalloc_recs) {
> +		for (i = 0; i < pgalloc_recs->size; i++) {
> +			struct bch_pgalloc_rec *rec = &(pgalloc_recs->recs[i]);
> +
> +			if (rec->pgoff == pgoff) {
> +				WARN_ON(rec->order != order);
> +				rec->pgoff = 0;
> +				rec->order = 0;
> +				pgalloc_recs->used--;
> +
> +				if (pgalloc_recs->used == 0) {
> +					int recs_pos = pgalloc_recs - sys_recs;
> +
> +					if (pgalloc_recs == prev_recs)
> +						owner_head->recs[ns_nr] = pgalloc_recs->next;
> +					else
> +						prev_recs->next = pgalloc_recs->next;
> +
> +					pgalloc_recs->next = NULL;
> +					pgalloc_recs->owner = NULL;
> +
> +					bitmap_clear(ns->pgalloc_recs_bitmap, recs_pos, 1);
> +				}
> +				goto exit;
> +			}
> +		}
> +		prev_recs = pgalloc_recs;
> +		pgalloc_recs = pgalloc_recs->next;
> +	}
> +exit:
> +	return pgalloc_recs ? 0 : -ENOENT;
> +}
> +
> +static void __free_space(struct bch_nvm_namespace *ns, void *addr, int order)
> +{
> +	unsigned long add_pages = (1L << order);
> +	pgoff_t pgoff;
> +	struct page *page;
> +
> +	page = nvm_vaddr_to_page(addr);
> +	WARN_ON((!page) || (page->private != order));
> +	pgoff = page->index;
> +
> +	while (order < BCH_MAX_ORDER - 1) {
> +		struct page *buddy_page;
> +
> +		pgoff_t buddy_pgoff = pgoff ^ (1L << order);
> +		pgoff_t parent_pgoff = pgoff & ~(1L << order);
> +
> +		if ((parent_pgoff + (1L << (order + 1)) > ns->pages_total))
> +			break;
> +
> +		buddy_page = nvm_vaddr_to_page(nvm_pgoff_to_vaddr(ns, buddy_pgoff));
> +		WARN_ON(!buddy_page);
> +
> +		if (PageBuddy(buddy_page) && (buddy_page->private == order)) {
> +			list_del((struct list_head *)&buddy_page->zone_device_data);
> +			__ClearPageBuddy(buddy_page);
> +			pgoff = parent_pgoff;
> +			order++;
> +			continue;
> +		}
> +		break;
> +	}
> +
> +	page = nvm_vaddr_to_page(nvm_pgoff_to_vaddr(ns, pgoff));
> +	WARN_ON(!page);
> +	list_add((struct list_head *)&page->zone_device_data, &ns->free_area[order]);
> +	page->index = pgoff;
> +	set_page_private(page, order);
> +	__SetPageBuddy(page);
> +	ns->free += add_pages;
> +}
> +
> +void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid)
> +{
> +	struct bch_nvm_namespace *ns;
> +	struct bch_nvm_pages_owner_head *owner_head;
> +	struct bch_nvm_pgalloc_recs *pgalloc_recs;
> +	int r;
> +
> +	mutex_lock(&only_set->lock);
> +
> +	ns = find_nvm_by_addr(addr, order);
> +	if (!ns) {
> +		pr_err("can't find nvm_dev by kaddr %p\n", addr);
> +		goto unlock;
> +	}
> +
> +	owner_head = find_owner_head(owner_uuid, false);
> +	if (!owner_head) {
> +		pr_err("can't found bch_nvm_pages_owner_head by(uuid=%s)\n", owner_uuid);
> +		goto unlock;
> +	}
> +
> +	pgalloc_recs = find_nvm_pgalloc_recs(ns, owner_head, false);
> +	if (!pgalloc_recs) {
> +		pr_err("can't find bch_nvm_pgalloc_recs by(uuid=%s)\n", owner_uuid);
> +		goto unlock;
> +	}
> +
> +	r = remove_pgalloc_rec(pgalloc_recs, ns->sb->this_namespace_nr, addr, order);
> +	if (r < 0) {
> +		pr_err("can't find bch_pgalloc_rec\n");
> +		goto unlock;
> +	}
> +
> +	__free_space(ns, addr, order);
> +
> +unlock:
> +	mutex_unlock(&only_set->lock);
> +}
> +EXPORT_SYMBOL_GPL(bch_nvm_free_pages);
> +
>  void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
>  {
>  	void *kaddr = NULL;
> @@ -217,7 +366,7 @@ void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
>  			list_del(list);
>  
>  			while (i != order) {
> -				buddy_page = nvm_vaddr_to_page(ns,
> +				buddy_page = nvm_vaddr_to_page(
>  					nvm_pgoff_to_vaddr(ns, page->index + (1L << (i - 1))));
>  				set_page_private(buddy_page, i - 1);
>  				buddy_page->index = page->index + (1L << (i - 1));
> @@ -301,7 +450,7 @@ static int init_owner_info(struct bch_nvm_namespace *ns)
>  						BUG_ON(rec->pgoff <= offset);
>  
>  						/* init struct page: index/private */
> -						page = nvm_vaddr_to_page(ns,
> +						page = nvm_vaddr_to_page(
>  							BCH_PGOFF_TO_KVADDR(rec->pgoff));
>  
>  						set_page_private(page, rec->order);
> @@ -340,11 +489,12 @@ static void init_nvm_free_space(struct bch_nvm_namespace *ns)
>  					break;
>  			}
>  
> -			page = nvm_vaddr_to_page(ns, nvm_pgoff_to_vaddr(ns, pgoff_start));
> +			page = nvm_vaddr_to_page(nvm_pgoff_to_vaddr(ns, pgoff_start));
>  			page->index = pgoff_start;
>  			set_page_private(page, i);
> -			__SetPageBuddy(page);
> -			list_add((struct list_head *)&page->zone_device_data, &ns->free_area[i]);
> +
> +			/* in order to update ns->free */
> +			__free_space(ns, nvm_pgoff_to_vaddr(ns, pgoff_start), i);
>  
>  			pgoff_start += 1L << i;
>  			pages -= 1L << i;
> @@ -535,7 +685,7 @@ struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
>  	ns->page_size = ns->sb->page_size;
>  	ns->pages_offset = ns->sb->pages_offset;
>  	ns->pages_total = ns->sb->pages_total;
> -	ns->free = 0;
> +	ns->free = 0; /* increase by __free_space() */
>  	ns->bdev = bdev;
>  	ns->nvm_set = only_set;
>  	mutex_init(&ns->lock);
> diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
> index f2583723aca6..0ca699166855 100644
> --- a/drivers/md/bcache/nvm-pages.h
> +++ b/drivers/md/bcache/nvm-pages.h
> @@ -63,6 +63,7 @@ struct bch_nvm_namespace *bch_register_namespace(const char *dev_path);
>  int bch_nvm_init(void);
>  void bch_nvm_exit(void);
>  void *bch_nvm_alloc_pages(int order, const char *owner_uuid);
> +void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid);
>  
>  #else
>  
> @@ -79,7 +80,7 @@ static inline void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
>  {
>  	return NULL;
>  }
> -
> +static inline void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid) { }
>  
>  #endif /* CONFIG_BCACHE_NVM_PAGES */
>  
> 
Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 08/14] bcache: get allocated pages from specific owner
  2021-06-15  5:49 ` [PATCH 08/14] bcache: get allocated pages from specific owner Coly Li
@ 2021-06-22 10:54   ` Hannes Reinecke
  2021-06-23  6:08     ` Coly Li
  0 siblings, 1 reply; 60+ messages in thread
From: Hannes Reinecke @ 2021-06-22 10:54 UTC (permalink / raw)
  To: Coly Li, axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/15/21 7:49 AM, Coly Li wrote:
> From: Jianpeng Ma <jianpeng.ma@intel.com>
> 
> This patch implements bch_get_allocated_pages() of the buddy to be used to

buddy allocator

> get allocated pages from specific owner.
> 
> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
> Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
> Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/nvm-pages.c | 6 ++++++
>  drivers/md/bcache/nvm-pages.h | 5 +++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
> index 74d08950c67c..42b0504d9564 100644
> --- a/drivers/md/bcache/nvm-pages.c
> +++ b/drivers/md/bcache/nvm-pages.c
> @@ -397,6 +397,12 @@ void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
>  }
>  EXPORT_SYMBOL_GPL(bch_nvm_alloc_pages);
>  
> +struct bch_nvm_pages_owner_head *bch_get_allocated_pages(const char *owner_uuid)
> +{
> +	return find_owner_head(owner_uuid, false);
> +}
> +EXPORT_SYMBOL_GPL(bch_get_allocated_pages);
> +
>  #define BCH_PGOFF_TO_KVADDR(pgoff) ((void *)((unsigned long)pgoff << PAGE_SHIFT))
>  
>  static int init_owner_info(struct bch_nvm_namespace *ns)
> diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
> index 0ca699166855..c763bf2e2721 100644
> --- a/drivers/md/bcache/nvm-pages.h
> +++ b/drivers/md/bcache/nvm-pages.h
> @@ -64,6 +64,7 @@ int bch_nvm_init(void);
>  void bch_nvm_exit(void);
>  void *bch_nvm_alloc_pages(int order, const char *owner_uuid);
>  void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid);
> +struct bch_nvm_pages_owner_head *bch_get_allocated_pages(const char *owner_uuid);
>  
>  #else
>  
> @@ -81,6 +82,10 @@ static inline void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
>  	return NULL;
>  }
>  static inline void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid) { }
> +static inline struct bch_nvm_pages_owner_head *bch_get_allocated_pages(const char *owner_uuid)
> +{
> +	return NULL;
> +}
>  
>  #endif /* CONFIG_BCACHE_NVM_PAGES */
>  
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/14] bcache: use bucket index to set GC_MARK_METADATA for journal buckets in bch_btree_gc_finish()
  2021-06-15  5:49 ` [PATCH 09/14] bcache: use bucket index to set GC_MARK_METADATA for journal buckets in bch_btree_gc_finish() Coly Li
@ 2021-06-22 10:55   ` Hannes Reinecke
  2021-06-23  6:09     ` Coly Li
  0 siblings, 1 reply; 60+ messages in thread
From: Hannes Reinecke @ 2021-06-22 10:55 UTC (permalink / raw)
  To: Coly Li, axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/15/21 7:49 AM, Coly Li wrote:
> Currently the meta data bucket locations on cache device are reserved
> after the meta data stored on NVDIMM pages, for the meta data layout
> consistentcy temporarily. So these buckets are still marked as meta data
> by SET_GC_MARK() in bch_btree_gc_finish().
> 
> When BCH_FEATURE_INCOMPAT_NVDIMM_META is set, the sb.d[] stores linear
> address of NVDIMM pages and not bucket index anymore. Therefore we
> should avoid to find bucket index from sb.d[], and directly use bucket
> index from ca->sb.first_bucket to (ca->sb.first_bucket +
> ca->sb.njournal_bucketsi) for setting the gc mark of journal bucket.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
> ---
>  drivers/md/bcache/btree.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 183a58c89377..e0d7135669ca 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -1761,8 +1761,10 @@ static void bch_btree_gc_finish(struct cache_set *c)
>  	ca = c->cache;
>  	ca->invalidate_needs_gc = 0;
>  
> -	for (k = ca->sb.d; k < ca->sb.d + ca->sb.keys; k++)
> -		SET_GC_MARK(ca->buckets + *k, GC_MARK_METADATA);
> +	/* Range [first_bucket, first_bucket + keys) is for journal buckets */
> +	for (i = ca->sb.first_bucket;
> +	     i < ca->sb.first_bucket + ca->sb.njournal_buckets; i++)
> +		SET_GC_MARK(ca->buckets + i, GC_MARK_METADATA);
>  
>  	for (k = ca->prio_buckets;
>  	     k < ca->prio_buckets + prio_buckets(ca) * 2; k++)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 10/14] bcache: add BCH_FEATURE_INCOMPAT_NVDIMM_META into incompat feature set
  2021-06-15  5:49 ` [PATCH 10/14] bcache: add BCH_FEATURE_INCOMPAT_NVDIMM_META into incompat feature set Coly Li
@ 2021-06-22 10:59   ` Hannes Reinecke
  2021-06-23  6:09     ` Coly Li
  0 siblings, 1 reply; 60+ messages in thread
From: Hannes Reinecke @ 2021-06-22 10:59 UTC (permalink / raw)
  To: Coly Li, axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/15/21 7:49 AM, Coly Li wrote:
> This patch adds BCH_FEATURE_INCOMPAT_NVDIMM_META (value 0x0004) into the
> incompat feature set. When this bit is set by bcache-tools, it indicates
> bcache meta data should be stored on specific NVDIMM meta device.
> 
> The bcache meta data mainly includes journal and btree nodes, when this
> bit is set in incompat feature set, bcache will ask the nvm-pages
> allocator for NVDIMM space to store the meta data.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
> ---
>  drivers/md/bcache/features.h | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h
> index d1c8fd3977fc..45d2508d5532 100644
> --- a/drivers/md/bcache/features.h
> +++ b/drivers/md/bcache/features.h
> @@ -17,11 +17,19 @@
>  #define BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET		0x0001
>  /* real bucket size is (1 << bucket_size) */
>  #define BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE	0x0002
> +/* store bcache meta data on nvdimm */
> +#define BCH_FEATURE_INCOMPAT_NVDIMM_META		0x0004
>  
>  #define BCH_FEATURE_COMPAT_SUPP		0
>  #define BCH_FEATURE_RO_COMPAT_SUPP	0
> +#if defined(CONFIG_BCACHE_NVM_PAGES)
> +#define BCH_FEATURE_INCOMPAT_SUPP	(BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET| \
> +					 BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE| \
> +					 BCH_FEATURE_INCOMPAT_NVDIMM_META)
> +#else
>  #define BCH_FEATURE_INCOMPAT_SUPP	(BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET| \
>  					 BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE)
> +#endif
>  
>  #define BCH_HAS_COMPAT_FEATURE(sb, mask) \
>  		((sb)->feature_compat & (mask))
> @@ -89,6 +97,7 @@ static inline void bch_clear_feature_##name(struct cache_sb *sb) \
>  
>  BCH_FEATURE_INCOMPAT_FUNCS(obso_large_bucket, OBSO_LARGE_BUCKET);
>  BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LOG_LARGE_BUCKET_SIZE);
> +BCH_FEATURE_INCOMPAT_FUNCS(nvdimm_meta, NVDIMM_META);
>  
>  static inline bool bch_has_unknown_compat_features(struct cache_sb *sb)
>  {
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 11/14] bcache: initialize bcache journal for NVDIMM meta device
  2021-06-15  5:49 ` [PATCH 11/14] bcache: initialize bcache journal for NVDIMM meta device Coly Li
@ 2021-06-22 11:01   ` Hannes Reinecke
  2021-06-23  6:17     ` Coly Li
  0 siblings, 1 reply; 60+ messages in thread
From: Hannes Reinecke @ 2021-06-22 11:01 UTC (permalink / raw)
  To: Coly Li, axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/15/21 7:49 AM, Coly Li wrote:
> The nvm-pages allocator may store and index the NVDIMM pages allocated
> for bcache journal. This patch adds the initialization to store bcache
> journal space on NVDIMM pages if BCH_FEATURE_INCOMPAT_NVDIMM_META bit is
> set by bcache-tools.
> 
> If BCH_FEATURE_INCOMPAT_NVDIMM_META is set, get_nvdimm_journal_space()
> will return the linear address of NVDIMM pages for bcache journal,
> - If there is previously allocated space, find it from nvm-pages owner
>   list and return to bch_journal_init().
> - If there is no previously allocated space, require a new NVDIMM range
>   from the nvm-pages allocator, and return it to bch_journal_init().
> 
> And in bch_journal_init(), keys in sb.d[] store the corresponding linear
> address from NVDIMM into sb.d[i].ptr[0] where 'i' is the bucket index to
> iterate all journal buckets.
> 
> Later when bcache journaling code stores the journaling jset, the target
> NVDIMM linear address stored (and updated) in sb.d[i].ptr[0] can be used
> directly in memory copy from DRAM pages into NVDIMM pages.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
> ---
>  drivers/md/bcache/journal.c | 105 ++++++++++++++++++++++++++++++++++++
>  drivers/md/bcache/journal.h |   2 +-
>  drivers/md/bcache/super.c   |  16 +++---
>  3 files changed, 115 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 61bd79babf7a..32599d2ff5d2 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -9,6 +9,8 @@
>  #include "btree.h"
>  #include "debug.h"
>  #include "extents.h"
> +#include "nvm-pages.h"
> +#include "features.h"
>  
>  #include <trace/events/bcache.h>
>  
> @@ -982,3 +984,106 @@ int bch_journal_alloc(struct cache_set *c)
>  
>  	return 0;
>  }
> +
> +#if defined(CONFIG_BCACHE_NVM_PAGES)
> +
> +static void *find_journal_nvm_base(struct bch_nvm_pages_owner_head *owner_list,
> +				   struct cache *ca)
> +{
> +	unsigned long addr = 0;
> +	struct bch_nvm_pgalloc_recs *recs_list = owner_list->recs[0];
> +
> +	while (recs_list) {
> +		struct bch_pgalloc_rec *rec;
> +		unsigned long jnl_pgoff;
> +		int i;
> +
> +		jnl_pgoff = ((unsigned long)ca->sb.d[0]) >> PAGE_SHIFT;
> +		rec = recs_list->recs;
> +		for (i = 0; i < recs_list->used; i++) {
> +			if (rec->pgoff == jnl_pgoff)
> +				break;
> +			rec++;
> +		}
> +		if (i < recs_list->used) {
> +			addr = rec->pgoff << PAGE_SHIFT;
> +			break;
> +		}
> +		recs_list = recs_list->next;
> +	}
> +	return (void *)addr;
> +}
> +
> +static void *get_nvdimm_journal_space(struct cache *ca)
> +{
> +	struct bch_nvm_pages_owner_head *owner_list = NULL;
> +	void *ret = NULL;
> +	int order;
> +
> +	owner_list = bch_get_allocated_pages(ca->sb.set_uuid);
> +	if (owner_list) {
> +		ret = find_journal_nvm_base(owner_list, ca);
> +		if (ret)
> +			goto found;
> +	}
> +
> +	order = ilog2(ca->sb.bucket_size *
> +		      ca->sb.njournal_buckets / PAGE_SECTORS);
> +	ret = bch_nvm_alloc_pages(order, ca->sb.set_uuid);
> +	if (ret)
> +		memset(ret, 0, (1 << order) * PAGE_SIZE);
> +
> +found:
> +	return ret;
> +}
> +
> +static int __bch_journal_nvdimm_init(struct cache *ca)
> +{
> +	int i, ret = 0;
> +	void *journal_nvm_base = NULL;
> +
> +	journal_nvm_base = get_nvdimm_journal_space(ca);
> +	if (!journal_nvm_base) {
> +		pr_err("Failed to get journal space from nvdimm\n");
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	/* Iniialized and reloaded from on-disk super block already */
> +	if (ca->sb.d[0] != 0)
> +		goto out;
> +
> +	for (i = 0; i < ca->sb.keys; i++)
> +		ca->sb.d[i] =
> +			(u64)(journal_nvm_base + (ca->sb.bucket_size * i));
> +
> +out:
> +	return ret;
> +}
> +
> +#else /* CONFIG_BCACHE_NVM_PAGES */
> +
> +static int __bch_journal_nvdimm_init(struct cache *ca)
> +{
> +	return -1;
> +}
> +
> +#endif /* CONFIG_BCACHE_NVM_PAGES */
> +
> +int bch_journal_init(struct cache_set *c)
> +{
> +	int i, ret = 0;
> +	struct cache *ca = c->cache;
> +
> +	ca->sb.keys = clamp_t(int, ca->sb.nbuckets >> 7,
> +				2, SB_JOURNAL_BUCKETS);
> +
> +	if (!bch_has_feature_nvdimm_meta(&ca->sb)) {
> +		for (i = 0; i < ca->sb.keys; i++)
> +			ca->sb.d[i] = ca->sb.first_bucket + i;
> +	} else {
> +		ret = __bch_journal_nvdimm_init(ca);
> +	}
> +
> +	return ret;
> +}
> diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h
> index f2ea34d5f431..e3a7fa5a8fda 100644
> --- a/drivers/md/bcache/journal.h
> +++ b/drivers/md/bcache/journal.h
> @@ -179,7 +179,7 @@ void bch_journal_mark(struct cache_set *c, struct list_head *list);
>  void bch_journal_meta(struct cache_set *c, struct closure *cl);
>  int bch_journal_read(struct cache_set *c, struct list_head *list);
>  int bch_journal_replay(struct cache_set *c, struct list_head *list);
> -
> +int bch_journal_init(struct cache_set *c);
>  void bch_journal_free(struct cache_set *c);
>  int bch_journal_alloc(struct cache_set *c);
>  
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index ce22aefb1352..cce0f6bf0944 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -147,10 +147,15 @@ static const char *read_super_common(struct cache_sb *sb,  struct block_device *
>  		goto err;
>  
>  	err = "Journal buckets not sequential";
> +#if defined(CONFIG_BCACHE_NVM_PAGES)
> +	if (!bch_has_feature_nvdimm_meta(sb)) {
> +#endif
>  	for (i = 0; i < sb->keys; i++)
>  		if (sb->d[i] != sb->first_bucket + i)
>  			goto err;
> -
> +#ifdef CONFIG_BCACHE_NVM_PAGES
> +	} /* bch_has_feature_nvdimm_meta */
> +#endif
>  	err = "Too many journal buckets";
>  	if (sb->first_bucket + sb->keys > sb->nbuckets)
>  		goto err;

Extremely awkward.
Make 'bch_has_feature_nvdimm_meta()' generally available, and have it
return 'false' if the config feature isn't enabled.

> @@ -2072,14 +2077,11 @@ static int run_cache_set(struct cache_set *c)
>  		if (bch_journal_replay(c, &journal))
>  			goto err;
>  	} else {
> -		unsigned int j;
> -
>  		pr_notice("invalidating existing data\n");
> -		ca->sb.keys = clamp_t(int, ca->sb.nbuckets >> 7,
> -					2, SB_JOURNAL_BUCKETS);
>  
> -		for (j = 0; j < ca->sb.keys; j++)
> -			ca->sb.d[j] = ca->sb.first_bucket + j;
> +		err = "error initializing journal";
> +		if (bch_journal_init(c))
> +			goto err;
>  
>  		bch_initial_gc_finish(c);
>  
> 
Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/14] bcache: support storing bcache journal into NVDIMM meta device
  2021-06-15  5:49 ` [PATCH 12/14] bcache: support storing bcache journal into " Coly Li
@ 2021-06-22 11:03   ` Hannes Reinecke
  2021-06-23  6:19     ` Coly Li
  0 siblings, 1 reply; 60+ messages in thread
From: Hannes Reinecke @ 2021-06-22 11:03 UTC (permalink / raw)
  To: Coly Li, axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/15/21 7:49 AM, Coly Li wrote:
> This patch implements two methods to store bcache journal to,
> 1) __journal_write_unlocked() for block interface device
>    The latency method to compose bio and issue the jset bio to cache
>    device (e.g. SSD). c->journal.key.ptr[0] indicates the LBA on cache
>    device to store the journal jset.
> 2) __journal_nvdimm_write_unlocked() for memory interface NVDIMM
>    Use memory interface to access NVDIMM pages and store the jset by
>    memcpy_flushcache(). c->journal.key.ptr[0] indicates the linear
>    address from the NVDIMM pages to store the journal jset.
> 
> For lagency configuration without NVDIMM meta device, journal I/O is

legacy?

> handled by __journal_write_unlocked() with existing code logic. If the
> NVDIMM meta device is used (by bcache-tools), the journal I/O will
> be handled by __journal_nvdimm_write_unlocked() and go into the NVDIMM
> pages.
> 
> And when NVDIMM meta device is used, sb.d[] stores the linear addresses
> from NVDIMM pages (no more bucket index), in journal_reclaim() the
> journaling location in c->journal.key.ptr[0] should also be updated by
> linear address from NVDIMM pages (no more LBA combined by sectors offset
> and bucket index).
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
> ---
>  drivers/md/bcache/journal.c   | 119 ++++++++++++++++++++++++----------
>  drivers/md/bcache/nvm-pages.h |   1 +
>  drivers/md/bcache/super.c     |  28 +++++++-
>  3 files changed, 110 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 32599d2ff5d2..03ecedf813b0 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -596,6 +596,8 @@ static void do_journal_discard(struct cache *ca)
>  		return;
>  	}
>  
> +	BUG_ON(bch_has_feature_nvdimm_meta(&ca->sb));
> +
>  	switch (atomic_read(&ja->discard_in_flight)) {
>  	case DISCARD_IN_FLIGHT:
>  		return;
> @@ -661,9 +663,13 @@ static void journal_reclaim(struct cache_set *c)
>  		goto out;
>  
>  	ja->cur_idx = next;
> -	k->ptr[0] = MAKE_PTR(0,
> -			     bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
> -			     ca->sb.nr_this_dev);
> +	if (!bch_has_feature_nvdimm_meta(&ca->sb))
> +		k->ptr[0] = MAKE_PTR(0,
> +			bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
> +			ca->sb.nr_this_dev);
> +	else
> +		k->ptr[0] = ca->sb.d[ja->cur_idx];
> +
>  	atomic_long_inc(&c->reclaimed_journal_buckets);
>  
>  	bkey_init(k);
> @@ -729,46 +735,21 @@ static void journal_write_unlock(struct closure *cl)
>  	spin_unlock(&c->journal.lock);
>  }
>  
> -static void journal_write_unlocked(struct closure *cl)
> +
> +static void __journal_write_unlocked(struct cache_set *c)
>  	__releases(c->journal.lock)
>  {
> -	struct cache_set *c = container_of(cl, struct cache_set, journal.io);
> -	struct cache *ca = c->cache;
> -	struct journal_write *w = c->journal.cur;
>  	struct bkey *k = &c->journal.key;
> -	unsigned int i, sectors = set_blocks(w->data, block_bytes(ca)) *
> -		ca->sb.block_size;
> -
> +	struct journal_write *w = c->journal.cur;
> +	struct closure *cl = &c->journal.io;
> +	struct cache *ca = c->cache;
>  	struct bio *bio;
>  	struct bio_list list;
> +	unsigned int i, sectors = set_blocks(w->data, block_bytes(ca)) *
> +		ca->sb.block_size;
>  
>  	bio_list_init(&list);
>  
> -	if (!w->need_write) {
> -		closure_return_with_destructor(cl, journal_write_unlock);
> -		return;
> -	} else if (journal_full(&c->journal)) {
> -		journal_reclaim(c);
> -		spin_unlock(&c->journal.lock);
> -
> -		btree_flush_write(c);
> -		continue_at(cl, journal_write, bch_journal_wq);
> -		return;
> -	}
> -
> -	c->journal.blocks_free -= set_blocks(w->data, block_bytes(ca));
> -
> -	w->data->btree_level = c->root->level;
> -
> -	bkey_copy(&w->data->btree_root, &c->root->key);
> -	bkey_copy(&w->data->uuid_bucket, &c->uuid_bucket);
> -
> -	w->data->prio_bucket[ca->sb.nr_this_dev] = ca->prio_buckets[0];
> -	w->data->magic		= jset_magic(&ca->sb);
> -	w->data->version	= BCACHE_JSET_VERSION;
> -	w->data->last_seq	= last_seq(&c->journal);
> -	w->data->csum		= csum_set(w->data);
> -
>  	for (i = 0; i < KEY_PTRS(k); i++) {
>  		ca = c->cache;
>  		bio = &ca->journal.bio;
> @@ -793,7 +774,6 @@ static void journal_write_unlocked(struct closure *cl)
>  
>  		ca->journal.seq[ca->journal.cur_idx] = w->data->seq;
>  	}
> -
>  	/* If KEY_PTRS(k) == 0, this jset gets lost in air */
>  	BUG_ON(i == 0);
>  
> @@ -805,6 +785,73 @@ static void journal_write_unlocked(struct closure *cl)
>  
>  	while ((bio = bio_list_pop(&list)))
>  		closure_bio_submit(c, bio, cl);
> +}
> +
> +#if defined(CONFIG_BCACHE_NVM_PAGES)
> +
> +static void __journal_nvdimm_write_unlocked(struct cache_set *c)
> +	__releases(c->journal.lock)
> +{
> +	struct journal_write *w = c->journal.cur;
> +	struct cache *ca = c->cache;
> +	unsigned int sectors;
> +
> +	sectors = set_blocks(w->data, block_bytes(ca)) * ca->sb.block_size;
> +	atomic_long_add(sectors, &ca->meta_sectors_written);
> +
> +	memcpy_flushcache((void *)c->journal.key.ptr[0], w->data, sectors << 9);
> +
> +	c->journal.key.ptr[0] += sectors << 9;
> +	ca->journal.seq[ca->journal.cur_idx] = w->data->seq;
> +
> +	atomic_dec_bug(&fifo_back(&c->journal.pin));
> +	bch_journal_next(&c->journal);
> +	journal_reclaim(c);
> +
> +	spin_unlock(&c->journal.lock);
> +}
> +
> +#else /* CONFIG_BCACHE_NVM_PAGES */
> +
> +static void __journal_nvdimm_write_unlocked(struct cache_set *c) { }
> +
> +#endif /* CONFIG_BCACHE_NVM_PAGES */
> +
> +static void journal_write_unlocked(struct closure *cl)
> +{
> +	struct cache_set *c = container_of(cl, struct cache_set, journal.io);
> +	struct cache *ca = c->cache;
> +	struct journal_write *w = c->journal.cur;
> +
> +	if (!w->need_write) {
> +		closure_return_with_destructor(cl, journal_write_unlock);
> +		return;
> +	} else if (journal_full(&c->journal)) {
> +		journal_reclaim(c);
> +		spin_unlock(&c->journal.lock);
> +
> +		btree_flush_write(c);
> +		continue_at(cl, journal_write, bch_journal_wq);
> +		return;
> +	}
> +
> +	c->journal.blocks_free -= set_blocks(w->data, block_bytes(ca));
> +
> +	w->data->btree_level = c->root->level;
> +
> +	bkey_copy(&w->data->btree_root, &c->root->key);
> +	bkey_copy(&w->data->uuid_bucket, &c->uuid_bucket);
> +
> +	w->data->prio_bucket[ca->sb.nr_this_dev] = ca->prio_buckets[0];
> +	w->data->magic		= jset_magic(&ca->sb);
> +	w->data->version	= BCACHE_JSET_VERSION;
> +	w->data->last_seq	= last_seq(&c->journal);
> +	w->data->csum		= csum_set(w->data);
> +
> +	if (!bch_has_feature_nvdimm_meta(&ca->sb))
> +		__journal_write_unlocked(c);
> +	else
> +		__journal_nvdimm_write_unlocked(c);
>  
>  	continue_at(cl, journal_write_done, NULL);
>  }
> diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
> index c763bf2e2721..736a661777b7 100644
> --- a/drivers/md/bcache/nvm-pages.h
> +++ b/drivers/md/bcache/nvm-pages.h
> @@ -5,6 +5,7 @@
>  
>  #if defined(CONFIG_BCACHE_NVM_PAGES)
>  #include <linux/bcache-nvm.h>
> +#include <linux/libnvdimm.h>
>  #endif /* CONFIG_BCACHE_NVM_PAGES */
>  
>  /*
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index cce0f6bf0944..4d6666d03aa7 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1686,7 +1686,32 @@ void bch_cache_set_release(struct kobject *kobj)
>  static void cache_set_free(struct closure *cl)
>  {
>  	struct cache_set *c = container_of(cl, struct cache_set, cl);
> -	struct cache *ca;
> +	struct cache *ca = c->cache;
> +
> +#if defined(CONFIG_BCACHE_NVM_PAGES)
> +	/* Flush cache if journal stored in NVDIMM */
> +	if (ca && bch_has_feature_nvdimm_meta(&ca->sb)) {
> +		unsigned long bucket_size = ca->sb.bucket_size;
> +		int i;
> +
> +		for (i = 0; i < ca->sb.keys; i++) {
> +			unsigned long offset = 0;
> +			unsigned int len = round_down(UINT_MAX, 2);
> +
> +			if ((void *)ca->sb.d[i] == NULL)
> +				continue;
> +
> +			while (bucket_size > 0) {
> +				if (len > bucket_size)
> +					len = bucket_size;
> +				arch_invalidate_pmem(
> +					(void *)(ca->sb.d[i] + offset), len);
> +				offset += len;
> +				bucket_size -= len;
> +			}
> +		}
> +	}
> +#endif /* CONFIG_BCACHE_NVM_PAGES */
>  
>  	debugfs_remove(c->debug);
>  
> @@ -1698,7 +1723,6 @@ static void cache_set_free(struct closure *cl)
>  	bch_bset_sort_state_free(&c->sort);
>  	free_pages((unsigned long) c->uuids, ilog2(meta_bucket_pages(&c->cache->sb)));
>  
> -	ca = c->cache;
>  	if (ca) {
>  		ca->set = NULL;
>  		c->cache = NULL;
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/14] bcache: read jset from NVDIMM pages for journal replay
  2021-06-15  5:49 ` [PATCH 13/14] bcache: read jset from NVDIMM pages for journal replay Coly Li
@ 2021-06-22 11:04   ` Hannes Reinecke
  2021-06-23  6:21     ` Coly Li
  0 siblings, 1 reply; 60+ messages in thread
From: Hannes Reinecke @ 2021-06-22 11:04 UTC (permalink / raw)
  To: Coly Li, axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/15/21 7:49 AM, Coly Li wrote:
> This patch implements two methods to read jset from media for journal
> replay,
> - __jnl_rd_bkt() for block device
>   This is the legacy method to read jset via block device interface.
> - __jnl_rd_nvm_bkt() for NVDIMM
>   This is the method to read jset from NVDIMM memory interface, a.k.a
>   memcopy() from NVDIMM pages to DRAM pages.
> 
> If BCH_FEATURE_INCOMPAT_NVDIMM_META is set in incompat feature set,
> during running cache set, journal_read_bucket() will read the journal
> content from NVDIMM by __jnl_rd_nvm_bkt(). The linear addresses of
> NVDIMM pages to read jset are stored in sb.d[SB_JOURNAL_BUCKETS], which
> were initialized and maintained in previous runs of the cache set.
> 
> A thing should be noticed is, when bch_journal_read() is called, the
> linear address of NVDIMM pages is not loaded and initialized yet, it
> is necessary to call __bch_journal_nvdimm_init() before reading the jset
> from NVDIMM pages.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
> ---
>  drivers/md/bcache/journal.c | 93 +++++++++++++++++++++++++++----------
>  1 file changed, 69 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 03ecedf813b0..23e5ccf125df 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -34,60 +34,96 @@ static void journal_read_endio(struct bio *bio)
>  	closure_put(cl);
>  }
>  
> +static struct jset *__jnl_rd_bkt(struct cache *ca, unsigned int bkt_idx,
> +				    unsigned int len, unsigned int offset,
> +				    struct closure *cl)
> +{
> +	sector_t bucket = bucket_to_sector(ca->set, ca->sb.d[bkt_idx]);
> +	struct bio *bio = &ca->journal.bio;
> +	struct jset *data = ca->set->journal.w[0].data;
> +
> +	bio_reset(bio);
> +	bio->bi_iter.bi_sector	= bucket + offset;
> +	bio_set_dev(bio, ca->bdev);
> +	bio->bi_iter.bi_size	= len << 9;
> +	bio->bi_end_io	= journal_read_endio;
> +	bio->bi_private = cl;
> +	bio_set_op_attrs(bio, REQ_OP_READ, 0);
> +	bch_bio_map(bio, data);
> +
> +	closure_bio_submit(ca->set, bio, cl);
> +	closure_sync(cl);
> +
> +	/* Indeed journal.w[0].data */
> +	return data;
> +}
> +
> +#if defined(CONFIG_BCACHE_NVM_PAGES)
> +
> +static struct jset *__jnl_rd_nvm_bkt(struct cache *ca, unsigned int bkt_idx,
> +				     unsigned int len, unsigned int offset)
> +{
> +	void *jset_addr = (void *)ca->sb.d[bkt_idx] + (offset << 9);
> +	struct jset *data = ca->set->journal.w[0].data;
> +
> +	memcpy(data, jset_addr, len << 9);
> +
> +	/* Indeed journal.w[0].data */
> +	return data;
> +}
> +
> +#else /* CONFIG_BCACHE_NVM_PAGES */
> +
> +static struct jset *__jnl_rd_nvm_bkt(struct cache *ca, unsigned int bkt_idx,
> +				     unsigned int len, unsigned int offset)
> +{
> +	return NULL;
> +}
> +
> +#endif /* CONFIG_BCACHE_NVM_PAGES */
> +
>  static int journal_read_bucket(struct cache *ca, struct list_head *list,
> -			       unsigned int bucket_index)
> +			       unsigned int bucket_idx)

This renaming is pointless.

>  {
>  	struct journal_device *ja = &ca->journal;
> -	struct bio *bio = &ja->bio;
>  
>  	struct journal_replay *i;
> -	struct jset *j, *data = ca->set->journal.w[0].data;
> +	struct jset *j;
>  	struct closure cl;
>  	unsigned int len, left, offset = 0;
>  	int ret = 0;
> -	sector_t bucket = bucket_to_sector(ca->set, ca->sb.d[bucket_index]);
>  
>  	closure_init_stack(&cl);
>  
> -	pr_debug("reading %u\n", bucket_index);
> +	pr_debug("reading %u\n", bucket_idx);
>  
>  	while (offset < ca->sb.bucket_size) {
>  reread:		left = ca->sb.bucket_size - offset;
>  		len = min_t(unsigned int, left, PAGE_SECTORS << JSET_BITS);
>  
> -		bio_reset(bio);
> -		bio->bi_iter.bi_sector	= bucket + offset;
> -		bio_set_dev(bio, ca->bdev);
> -		bio->bi_iter.bi_size	= len << 9;
> -
> -		bio->bi_end_io	= journal_read_endio;
> -		bio->bi_private = &cl;
> -		bio_set_op_attrs(bio, REQ_OP_READ, 0);
> -		bch_bio_map(bio, data);
> -
> -		closure_bio_submit(ca->set, bio, &cl);
> -		closure_sync(&cl);
> +		if (!bch_has_feature_nvdimm_meta(&ca->sb))
> +			j = __jnl_rd_bkt(ca, bucket_idx, len, offset, &cl);
> +		else
> +			j = __jnl_rd_nvm_bkt(ca, bucket_idx, len, offset);
>  
>  		/* This function could be simpler now since we no longer write
>  		 * journal entries that overlap bucket boundaries; this means
>  		 * the start of a bucket will always have a valid journal entry
>  		 * if it has any journal entries at all.
>  		 */
> -
> -		j = data;
>  		while (len) {
>  			struct list_head *where;
>  			size_t blocks, bytes = set_bytes(j);
>  
>  			if (j->magic != jset_magic(&ca->sb)) {
> -				pr_debug("%u: bad magic\n", bucket_index);
> +				pr_debug("%u: bad magic\n", bucket_idx);
>  				return ret;
>  			}
>  
>  			if (bytes > left << 9 ||
>  			    bytes > PAGE_SIZE << JSET_BITS) {
>  				pr_info("%u: too big, %zu bytes, offset %u\n",
> -					bucket_index, bytes, offset);
> +					bucket_idx, bytes, offset);
>  				return ret;
>  			}
>  
> @@ -96,7 +132,7 @@ reread:		left = ca->sb.bucket_size - offset;
>  
>  			if (j->csum != csum_set(j)) {
>  				pr_info("%u: bad csum, %zu bytes, offset %u\n",
> -					bucket_index, bytes, offset);
> +					bucket_idx, bytes, offset);
>  				return ret;
>  			}
>  
> @@ -158,8 +194,8 @@ reread:		left = ca->sb.bucket_size - offset;
>  			list_add(&i->list, where);
>  			ret = 1;
>  
> -			if (j->seq > ja->seq[bucket_index])
> -				ja->seq[bucket_index] = j->seq;
> +			if (j->seq > ja->seq[bucket_idx])
> +				ja->seq[bucket_idx] = j->seq;
>  next_set:
>  			offset	+= blocks * ca->sb.block_size;
>  			len	-= blocks * ca->sb.block_size;
> @@ -170,6 +206,8 @@ reread:		left = ca->sb.bucket_size - offset;
>  	return ret;
>  }
>  
> +static int __bch_journal_nvdimm_init(struct cache *ca);
> +
>  int bch_journal_read(struct cache_set *c, struct list_head *list)
>  {
>  #define read_bucket(b)							\
> @@ -188,6 +226,13 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
>  	unsigned int i, l, r, m;
>  	uint64_t seq;
>  
> +	/*
> +	 * Linear addresses of NVDIMM pages for journaling is not
> +	 * initialized yet, do it before read jset from NVDIMM pages.
> +	 */
> +	if (bch_has_feature_nvdimm_meta(&ca->sb))
> +		__bch_journal_nvdimm_init(ca);
> +
>  	bitmap_zero(bitmap, SB_JOURNAL_BUCKETS);
>  	pr_debug("%u journal buckets\n", ca->sb.njournal_buckets);
>  
> 
Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 14/14] bcache: add sysfs interface register_nvdimm_meta to register NVDIMM meta device
  2021-06-15  5:49 ` [PATCH 14/14] bcache: add sysfs interface register_nvdimm_meta to register NVDIMM meta device Coly Li
@ 2021-06-22 11:04   ` Hannes Reinecke
  0 siblings, 0 replies; 60+ messages in thread
From: Hannes Reinecke @ 2021-06-22 11:04 UTC (permalink / raw)
  To: Coly Li, axboe; +Cc: linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/15/21 7:49 AM, Coly Li wrote:
> This patch adds a sysfs interface register_nvdimm_meta to register
> NVDIMM meta device. The sysfs interface file only shows up when
> CONFIG_BCACHE_NVM_PAGES=y. Then a NVDIMM name space formatted by
> bcache-tools can be registered into bcache by e.g.,
>   echo /dev/pmem0 > /sys/fs/bcache/register_nvdimm_meta
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
> ---
>  drivers/md/bcache/super.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 4d6666d03aa7..9d506d053548 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -2439,10 +2439,18 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>  static ssize_t bch_pending_bdevs_cleanup(struct kobject *k,
>  					 struct kobj_attribute *attr,
>  					 const char *buffer, size_t size);
> +#if defined(CONFIG_BCACHE_NVM_PAGES)
> +static ssize_t register_nvdimm_meta(struct kobject *k,
> +				    struct kobj_attribute *attr,
> +				    const char *buffer, size_t size);
> +#endif
>  
>  kobj_attribute_write(register,		register_bcache);
>  kobj_attribute_write(register_quiet,	register_bcache);
>  kobj_attribute_write(pendings_cleanup,	bch_pending_bdevs_cleanup);
> +#if defined(CONFIG_BCACHE_NVM_PAGES)
> +kobj_attribute_write(register_nvdimm_meta, register_nvdimm_meta);
> +#endif
>  
>  static bool bch_is_open_backing(dev_t dev)
>  {
> @@ -2556,6 +2564,24 @@ static void register_device_async(struct async_reg_args *args)
>  	queue_delayed_work(system_wq, &args->reg_work, 10);
>  }
>  
> +#if defined(CONFIG_BCACHE_NVM_PAGES)
> +static ssize_t register_nvdimm_meta(struct kobject *k, struct kobj_attribute *attr,
> +				    const char *buffer, size_t size)
> +{
> +	ssize_t ret = size;
> +
> +	struct bch_nvm_namespace *ns = bch_register_namespace(buffer);
> +
> +	if (IS_ERR(ns)) {
> +		pr_err("register nvdimm namespace %s for meta device failed.\n",
> +			buffer);
> +		ret = -EINVAL;
> +	}
> +
> +	return ret;
> +}
> +#endif
> +
>  static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>  			       const char *buffer, size_t size)
>  {
> @@ -2898,6 +2924,9 @@ static int __init bcache_init(void)
>  	static const struct attribute *files[] = {
>  		&ksysfs_register.attr,
>  		&ksysfs_register_quiet.attr,
> +#if defined(CONFIG_BCACHE_NVM_PAGES)
> +		&ksysfs_register_nvdimm_meta.attr,
> +#endif
>  		&ksysfs_pendings_cleanup.attr,
>  		NULL
>  	};
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Ask help for code review (was Re: [PATCH 03/14] bcache: add initial data structures for nvm pages)
  2021-06-22  8:41     ` Huang, Ying
@ 2021-06-23  4:32       ` Coly Li
  2021-06-23  6:53         ` Huang, Ying
  0 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-23  4:32 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Dan Williams, Jan Kara, Hannes Reinecke, Christoph Hellwig,
	linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, axboe

Hi Ying,

I reply your comment in-place where you commented on.

On 6/22/21 4:41 PM, Huang, Ying wrote:
> Coly Li <colyli@suse.de> writes:
>
>> Hi all my dear receivers (Christoph, Dan, Hannes, Jan and Ying),
>>
>> I do need your help on code review for the following patch. This
>> series are posted and refined for 2 merge windows already but we
>> are lack of code review from more experienced kernel developers
>> like you all.
>>
>> The following patch defines a set of on-NVDIMM memory objects,
>> which are used to support NVDIMM for bcache journalling. Currently
>> the testing hardware is Intel AEP (Apache Pass).
>>
>> Qiangwei Ren and Jianpeng Ma work with me to compose a mini pages
>> allocator for NVDIMM pages, then we allocate non-volatiled memory
>> pages from NVDIMM to store bcache journal set. Then the jouranling
>> can be very fast and after system reboots, once the NVDIMM mapping
>> is done, bcache code can directly reference journal set memory
>> object without loading them via block layer interface.
>>
>> In order to restore allocated non-volatile memory, we use a set of
>> list (named owner list) to trace all allocated non-volatile memory
>> pages identified by UUID. Just like the bcache journal set, the list
>> stored in NVDIMM and accessed directly as typical in-memory list,
>> the only difference is they are non-volatiled: we access the lists
>> directly from NVDIMM, update the list in-place.
>>
>> This is why you can see pointers are defined in struct
>> bch_nvm_pgalloc_recs, because such object is reference directly as
>> memory object, and stored directly onto NVDIMM.
>>
>> Current patch series works as expected with limited data-set on
>> both bcache-tools and patched kernel. Because the bcache btree nodes
>> are not stored onto NVDIMM yet, journaling for leaf node splitting
>> will be handled in later series.
>>
>> The whole work of supporting NVDIMM for bcache will involve in,
>> - Storing bcache journal on NVDIMM
>> - Store bcache btree nodes on NVDIMM
>> - Store cached data on NVDIMM.
>> - On-NVDIMM objects consistency for power failure
>>
>> In order to make the code review to be more easier, the first step
>> we submit storing journal on NVDIMM into upstream firstly, following
>> work will be submitted step by step.
>>
>> Jens wants more widely review before taking this series into bcache
>> upstream, and you are all the experts I trust and have my respect.
>>
>> I do ask for help of code review from you all. Especially for the
>> following particular data structure definition patch, because I
>> define pointers in memory structures and reference and store them on
>> the NVDIMM.
>>
>> Thanks in advance for your help.
>>
>> Coly Li
>>
>>
>>
>>
>> On 6/15/21 1:49 PM, Coly Li wrote:
>>> This patch initializes the prototype data structures for nvm pages
>>> allocator,
>>>
>>> - struct bch_nvm_pages_sb
>>> This is the super block allocated on each nvdimm namespace. A nvdimm
>>> set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used
>>> to mark which nvdimm set this name space belongs to. Normally we will
>>> use the bcache's cache set UUID to initialize this uuid, to connect this
>>> nvdimm set to a specified bcache cache set.
>>>
>>> - struct bch_owner_list_head
>>> This is a table for all heads of all owner lists. A owner list records
>>> which page(s) allocated to which owner. After reboot from power failure,
>>> the ownwer may find all its requested and allocated pages from the owner
>>> list by a handler which is converted by a UUID.
>>>
>>> - struct bch_nvm_pages_owner_head
>>> This is a head of an owner list. Each owner only has one owner list,
>>> and a nvm page only belongs to an specific owner. uuid[] will be set to
>>> owner's uuid, for bcache it is the bcache's cache set uuid. label is not
>>> mandatory, it is a human-readable string for debug purpose. The pointer
>>> *recs references to separated nvm page which hold the table of struct
>>> bch_nvm_pgalloc_rec.
>>>
>>> - struct bch_nvm_pgalloc_recs
>>> This struct occupies a whole page, owner_uuid should match the uuid
>>> in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
>>> allocated records.
>>>
>>> - struct bch_nvm_pgalloc_rec
>>> Each structure records a range of allocated nvm pages.
>>>   - Bits  0 - 51: is pages offset of the allocated pages.
>>>   - Bits 52 - 57: allocaed size in page_size * order-of-2
>>>   - Bits 58 - 63: reserved.
>>> Since each of the allocated nvm pages are power of 2, using 6 bits to
>>> represent allocated size can have (1<<(1<<64) - 1) * PAGE_SIZE maximum
>>> value. It can be a 76 bits width range size in byte for 4KB page size,
>>> which is large enough currently.
>>>
>>> Signed-off-by: Coly Li <colyli@suse.de>
>>> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
>>> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
>>> ---
>>>  include/uapi/linux/bcache-nvm.h | 200 ++++++++++++++++++++++++++++++++
>>>  1 file changed, 200 insertions(+)
>>>  create mode 100644 include/uapi/linux/bcache-nvm.h
>>>
>>> diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
>>> new file mode 100644
>>> index 000000000000..5094a6797679
>>> --- /dev/null
>>> +++ b/include/uapi/linux/bcache-nvm.h
>>> @@ -0,0 +1,200 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>> +
>>> +#ifndef _UAPI_BCACHE_NVM_H
>>> +#define _UAPI_BCACHE_NVM_H
>>> +
>>> +#if (__BITS_PER_LONG == 64)
>>> +/*
>>> + * Bcache on NVDIMM data structures
>>> + */
>>> +
>>> +/*
>>> + * - struct bch_nvm_pages_sb
>>> + *   This is the super block allocated on each nvdimm namespace. A nvdimm
>>> + * set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used to mark
>>> + * which nvdimm set this name space belongs to. Normally we will use the
>>> + * bcache's cache set UUID to initialize this uuid, to connect this nvdimm
>>> + * set to a specified bcache cache set.
>>> + *
>>> + * - struct bch_owner_list_head
>>> + *   This is a table for all heads of all owner lists. A owner list records
>>> + * which page(s) allocated to which owner. After reboot from power failure,
>>> + * the ownwer may find all its requested and allocated pages from the owner
>>> + * list by a handler which is converted by a UUID.
>>> + *
>>> + * - struct bch_nvm_pages_owner_head
>>> + *   This is a head of an owner list. Each owner only has one owner list,
>>> + * and a nvm page only belongs to an specific owner. uuid[] will be set to
>>> + * owner's uuid, for bcache it is the bcache's cache set uuid. label is not
>>> + * mandatory, it is a human-readable string for debug purpose. The pointer
>>> + * recs references to separated nvm page which hold the table of struct
>>> + * bch_pgalloc_rec.
>>> + *
>>> + *- struct bch_nvm_pgalloc_recs
>>> + *  This structure occupies a whole page, owner_uuid should match the uuid
>>> + * in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
>>> + * allocated records.
>>> + *
>>> + * - struct bch_pgalloc_rec
>>> + *   Each structure records a range of allocated nvm pages. pgoff is offset
>>> + * in unit of page size of this allocated nvm page range. The adjoint page
>>> + * ranges of same owner can be merged into a larger one, therefore pages_nr
>>> + * is NOT always power of 2.
>>> + *
>>> + *
>>> + * Memory layout on nvdimm namespace 0
>>> + *
>>> + *    0 +---------------------------------+
>>> + *      |                                 |
>>> + *  4KB +---------------------------------+
>>> + *      |         bch_nvm_pages_sb        |
>>> + *  8KB +---------------------------------+ <--- bch_nvm_pages_sb.bch_owner_list_head
>>> + *      |       bch_owner_list_head       |
>>> + *      |                                 |
>>> + * 16KB +---------------------------------+ <--- bch_owner_list_head.heads[0].recs[0]
>>> + *      |       bch_nvm_pgalloc_recs      |
>>> + *      |  (nvm pages internal usage)     |
>>> + * 24KB +---------------------------------+
>>> + *      |                                 |
>>> + *      |                                 |
>>> + * 16MB  +---------------------------------+
>>> + *      |      allocable nvm pages        |
>>> + *      |      for buddy allocator        |
>>> + * end  +---------------------------------+
>>> + *
>>> + *
>>> + *
>>> + * Memory layout on nvdimm namespace N
>>> + * (doesn't have owner list)
>>> + *
>>> + *    0 +---------------------------------+
>>> + *      |                                 |
>>> + *  4KB +---------------------------------+
>>> + *      |         bch_nvm_pages_sb        |
>>> + *  8KB +---------------------------------+
>>> + *      |                                 |
>>> + *      |                                 |
>>> + *      |                                 |
>>> + *      |                                 |
>>> + *      |                                 |
>>> + *      |                                 |
>>> + * 16MB  +---------------------------------+
>>> + *      |      allocable nvm pages        |
>>> + *      |      for buddy allocator        |
>>> + * end  +---------------------------------+
>>> + *
>>> + */
>>> +
>>> +#include <linux/types.h>
>>> +
>>> +/* In sectors */
>>> +#define BCH_NVM_PAGES_SB_OFFSET			4096
>>> +#define BCH_NVM_PAGES_OFFSET			(16 << 20)
>>> +
>>> +#define BCH_NVM_PAGES_LABEL_SIZE		32
>>> +#define BCH_NVM_PAGES_NAMESPACES_MAX		8
>>> +
>>> +#define BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET	(8<<10)
>>> +#define BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET	(16<<10)
>>> +
>>> +#define BCH_NVM_PAGES_SB_VERSION		0
>>> +#define BCH_NVM_PAGES_SB_VERSION_MAX		0
>>> +
>>> +static const unsigned char bch_nvm_pages_magic[] = {
>>> +	0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83,
>>> +	0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 };
>>> +static const unsigned char bch_nvm_pages_pgalloc_magic[] = {
>>> +	0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9,
>>> +	0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae };
>>> +
>>> +/* takes 64bit width */
>>> +struct bch_pgalloc_rec {
>>> +	__u64	pgoff:52;
>>> +	__u64	order:6;
>>> +	__u64	reserved:6;
>>> +};
>>> +
>>> +struct bch_nvm_pgalloc_recs {
>>> +union {
>>> +	struct {
>>> +		struct bch_nvm_pages_owner_head	*owner;
>>> +		struct bch_nvm_pgalloc_recs	*next;
> I have concerns about using pointers directly in on-NVDIMM data
> structure too.  How can you guarantee the NVDIMM devices will be mapped
> to exact same virtual address across reboot?
>
> Best Regards,
> Huang, Ying


We use the NVDIMM name space as memory, and from our testing and
observation, the DAX mapping base address is consistent if the NVDIMM
address from e820 table does not change.

And from our testing and observation, the NVDIMM address from e820
table does not change when,
- NVDIMM and DRAM memory population does not change
- Install more NVDIMM and/or DRAM based on existing memory population
- NVDIMM always plugged in same slot location and no movement or swap
- No CPU remove and change

For 99.9%+ time when the hardware working healthily, the above condition
can be assumed. Therefore we choose to store whole linear address
(pointer) here, other than relative offset inside the NVDIMM name space.

For the 0.0?% condition if the NVDIMM address from e820 table changes,
because the last time DAX map address is stored in ns_start of struct
bch_nvm_pages_sb, if the new DAX mapping address is different from
ns_srart value, all pointers in the owner list can be updated by,
    new_addr = (old_addr - old_ns_start) + new_ns_start

The update can be very fast (and it can be power failure tolerant with
carefully coding) for. Therefore we decide to store full linear address
for directly memory access for 99%+ condition, and update the pointers
for the 0.0?% condition when DAX mapping address of the NVDIMM changes.

Handling DAX mapping address change is not current high priority task,
our next task after this series merged will be the power failure tolerance
of the owner list (from Intel developers) and storing bcache btree nodes
on NVDIMM pages (from me).

Thanks for your comments and review.

Coly Li


>>> +		unsigned char			magic[16];
>>> +		unsigned char			owner_uuid[16];
>>> +		unsigned int			size;
>>> +		unsigned int			used;
>>> +		unsigned long			_pad[4];
>>> +		struct bch_pgalloc_rec		recs[];
>>> +	};
>>> +	unsigned char				pad[8192];
>>> +};
>>> +};
>>> +
>>> +#define BCH_MAX_RECS					\
>>> +	((sizeof(struct bch_nvm_pgalloc_recs) -		\
>>> +	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /	\
>>> +	 sizeof(struct bch_pgalloc_rec))
>>> +
>>> +struct bch_nvm_pages_owner_head {
>>> +	unsigned char			uuid[16];
>>> +	unsigned char			label[BCH_NVM_PAGES_LABEL_SIZE];
>>> +	/* Per-namespace own lists */
>>> +	struct bch_nvm_pgalloc_recs	*recs[BCH_NVM_PAGES_NAMESPACES_MAX];
>>> +};
>>> +
>>> +/* heads[0] is always for nvm_pages internal usage */
>>> +struct bch_owner_list_head {
>>> +union {
>>> +	struct {
>>> +		unsigned int			size;
>>> +		unsigned int			used;
>>> +		unsigned long			_pad[4];
>>> +		struct bch_nvm_pages_owner_head	heads[];
>>> +	};
>>> +	unsigned char				pad[8192];
>>> +};
>>> +};
>>> +#define BCH_MAX_OWNER_LIST				\
>>> +	((sizeof(struct bch_owner_list_head) -		\
>>> +	 offsetof(struct bch_owner_list_head, heads)) /	\
>>> +	 sizeof(struct bch_nvm_pages_owner_head))
>>> +
>>> +/* The on-media bit order is local CPU order */
>>> +struct bch_nvm_pages_sb {
>>> +	unsigned long				csum;
>>> +	unsigned long				ns_start;
>>> +	unsigned long				sb_offset;
>>> +	unsigned long				version;
>>> +	unsigned char				magic[16];
>>> +	unsigned char				uuid[16];
>>> +	unsigned int				page_size;
>>> +	unsigned int				total_namespaces_nr;
>>> +	unsigned int				this_namespace_nr;
>>> +	union {
>>> +		unsigned char			set_uuid[16];
>>> +		unsigned long			set_magic;
>>> +	};
>>> +
>>> +	unsigned long				flags;
>>> +	unsigned long				seq;
>>> +
>>> +	unsigned long				feature_compat;
>>> +	unsigned long				feature_incompat;
>>> +	unsigned long				feature_ro_compat;
>>> +
>>> +	/* For allocable nvm pages from buddy systems */
>>> +	unsigned long				pages_offset;
>>> +	unsigned long				pages_total;
>>> +
>>> +	unsigned long				pad[8];
>>> +
>>> +	/* Only on the first name space */
>>> +	struct bch_owner_list_head		*owner_list_head;
>>> +
>>> +	/* Just for csum_set() */
>>> +	unsigned int				keys;
>>> +	unsigned long				d[0];
>>> +};
>>> +#endif /* __BITS_PER_LONG == 64 */
>>> +
>>> +#endif /* _UAPI_BCACHE_NVM_H */


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/14] bcache: initialize the nvm pages allocator
  2021-06-22 10:39   ` Hannes Reinecke
@ 2021-06-23  5:26     ` Coly Li
  2021-06-23  9:16       ` Hannes Reinecke
  0 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-23  5:26 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-bcache, axboe, linux-block, Jianpeng Ma, Randy Dunlap, Qiaowei Ren

On 6/22/21 6:39 PM, Hannes Reinecke wrote:
> On 6/15/21 7:49 AM, Coly Li wrote:
>> From: Jianpeng Ma <jianpeng.ma@intel.com>
>>
>> This patch define the prototype data structures in memory and
>> initializes the nvm pages allocator.
>>
>> The nvm address space which is managed by this allocator can consist of
>> many nvm namespaces, and some namespaces can compose into one nvm set,
>> like cache set. For this initial implementation, only one set can be
>> supported.
>>
>> The users of this nvm pages allocator need to call register_namespace()
>> to register the nvdimm device (like /dev/pmemX) into this allocator as
>> the instance of struct nvm_namespace.
>>
>> Reported-by: Randy Dunlap <rdunlap@infradead.org>
>> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
>> Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
>> Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>  drivers/md/bcache/Kconfig     |  10 ++
>>  drivers/md/bcache/Makefile    |   1 +
>>  drivers/md/bcache/nvm-pages.c | 295 ++++++++++++++++++++++++++++++++++
>>  drivers/md/bcache/nvm-pages.h |  74 +++++++++
>>  drivers/md/bcache/super.c     |   3 +
>>  5 files changed, 383 insertions(+)
>>  create mode 100644 drivers/md/bcache/nvm-pages.c
>>  create mode 100644 drivers/md/bcache/nvm-pages.h
>>
>> diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
>> index d1ca4d059c20..a69f6c0e0507 100644
>> --- a/drivers/md/bcache/Kconfig
>> +++ b/drivers/md/bcache/Kconfig
>> @@ -35,3 +35,13 @@ config BCACHE_ASYNC_REGISTRATION
>>  	device path into this file will returns immediately and the real
>>  	registration work is handled in kernel work queue in asynchronous
>>  	way.
>> +
>> +config BCACHE_NVM_PAGES
>> +	bool "NVDIMM support for bcache (EXPERIMENTAL)"
>> +	depends on BCACHE
>> +	depends on 64BIT
>> +	depends on LIBNVDIMM
>> +	depends on DAX
>> +	help
>> +	  Allocate/release NV-memory pages for bcache and provide allocated pages
>> +	  for each requestor after system reboot.
>> diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
>> index 5b87e59676b8..2397bb7c7ffd 100644
>> --- a/drivers/md/bcache/Makefile
>> +++ b/drivers/md/bcache/Makefile
>> @@ -5,3 +5,4 @@ obj-$(CONFIG_BCACHE)	+= bcache.o
>>  bcache-y		:= alloc.o bset.o btree.o closure.o debug.o extents.o\
>>  	io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
>>  	util.o writeback.o features.o
>> +bcache-$(CONFIG_BCACHE_NVM_PAGES) += nvm-pages.o
>> diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
>> new file mode 100644
>> index 000000000000..18fdadbc502f
>> --- /dev/null
>> +++ b/drivers/md/bcache/nvm-pages.c
>> @@ -0,0 +1,295 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Nvdimm page-buddy allocator
>> + *
>> + * Copyright (c) 2021, Intel Corporation.
>> + * Copyright (c) 2021, Qiaowei Ren <qiaowei.ren@intel.com>.
>> + * Copyright (c) 2021, Jianpeng Ma <jianpeng.ma@intel.com>.
>> + */
>> +
>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>> +
> No need for this 'if' statement as it'll be excluded by the Makefile
> anyway if the config option isn't set.

Such if is necessary because stub routines are defined when
CONFIG_BCACHE_NVM_PAGES is not defined, e.g.

426 +#else
427 +
428 +static inline struct bch_nvm_namespace
*bch_register_namespace(const char *dev_path)
429 +{
430 +       return NULL;
431 +}
432 +static inline int bch_nvm_init(void)
433 +{
434 +       return 0;
435 +}
436 +static inline void bch_nvm_exit(void) { }
437 +
438 +#endif /* CONFIG_BCACHE_NVM_PAGES */

>> +#include "bcache.h"
>> +#include "nvm-pages.h"
>> +
>> +#include <linux/slab.h>
>> +#include <linux/list.h>
>> +#include <linux/mutex.h>
>> +#include <linux/dax.h>
>> +#include <linux/pfn_t.h>
>> +#include <linux/libnvdimm.h>
>> +#include <linux/mm_types.h>
>> +#include <linux/err.h>
>> +#include <linux/pagemap.h>
>> +#include <linux/bitmap.h>
>> +#include <linux/blkdev.h>
>> +
>> +struct bch_nvm_set *only_set;
>> +
>> +static void release_nvm_namespaces(struct bch_nvm_set *nvm_set)
>> +{
>> +	int i;
>> +	struct bch_nvm_namespace *ns;
>> +
>> +	for (i = 0; i < nvm_set->total_namespaces_nr; i++) {
>> +		ns = nvm_set->nss[i];
>> +		if (ns) {
>> +			blkdev_put(ns->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC);
>> +			kfree(ns);
>> +		}
>> +	}
>> +
>> +	kfree(nvm_set->nss);
>> +}
>> +
>> +static void release_nvm_set(struct bch_nvm_set *nvm_set)
>> +{
>> +	release_nvm_namespaces(nvm_set);
>> +	kfree(nvm_set);
>> +}
>> +
>> +static int init_owner_info(struct bch_nvm_namespace *ns)
>> +{
>> +	struct bch_owner_list_head *owner_list_head = ns->sb->owner_list_head;
>> +
>> +	mutex_lock(&only_set->lock);
>> +	only_set->owner_list_head = owner_list_head;
>> +	only_set->owner_list_size = owner_list_head->size;
>> +	only_set->owner_list_used = owner_list_head->used;
>> +	mutex_unlock(&only_set->lock);
>> +
>> +	return 0;
>> +}
>> +
>> +static int attach_nvm_set(struct bch_nvm_namespace *ns)
>> +{
>> +	int rc = 0;
>> +
>> +	mutex_lock(&only_set->lock);
>> +	if (only_set->nss) {
>> +		if (memcmp(ns->sb->set_uuid, only_set->set_uuid, 16)) {
>> +			pr_info("namespace id doesn't match nvm set\n");
>> +			rc = -EINVAL;
>> +			goto unlock;
>> +		}
>> +
>> +		if (only_set->nss[ns->sb->this_namespace_nr]) {
> Doesn't this need to be checked against 'total_namespaces_nr' to avoid
> overflow?

Will add such checking in bch_register_namespace().


>> +			pr_info("already has the same position(%d) nvm\n",
>> +					ns->sb->this_namespace_nr);
>> +			rc = -EEXIST;
>> +			goto unlock;
>> +		}
>> +	} else {
>> +		memcpy(only_set->set_uuid, ns->sb->set_uuid, 16);
>> +		only_set->total_namespaces_nr = ns->sb->total_namespaces_nr;
>> +		only_set->nss = kcalloc(only_set->total_namespaces_nr,
>> +				sizeof(struct bch_nvm_namespace *), GFP_KERNEL);
>> +		if (!only_set->nss) {
> When you enter here, 'set_uuid' and 'total_namespace_nr' is being
> modified, which might cause errors later on.
> Please move these two lines _after_ the kcalloc() to avoid this.

Yeah, modify the order is better.

I don't ask for change here during my code review is, only_set is
an in-memory object and only attach one name space currently.
only_set->set_uuid and only_set->total_namespaces_nr will not be
checked after "only_set->nss[ns->sb->this_namespace_nr] = ns".
So the code works correctly here.

But yes, we don't need unnecessary extra memory writing here. We
will modify here in next post.


>> +			rc = -ENOMEM;
>> +			goto unlock;
>> +		}
>> +	}
>> +
>> +	only_set->nss[ns->sb->this_namespace_nr] = ns;
>> +
>> +	/* Firstly attach */
> Initial attach?

Will fix in next post.

>
>> +	if ((unsigned long)ns->sb->owner_list_head == BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET) {
>> +		struct bch_nvm_pages_owner_head *sys_owner_head;
>> +		struct bch_nvm_pgalloc_recs *sys_pgalloc_recs;
>> +
>> +		ns->sb->owner_list_head = ns->kaddr + BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET;
>> +		sys_pgalloc_recs = ns->kaddr + BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET;
>> +
>> +		sys_owner_head = &(ns->sb->owner_list_head->heads[0]);
>> +		sys_owner_head->recs[0] = sys_pgalloc_recs;
>> +		ns->sb->csum = csum_set(ns->sb);
>> +
> Hmm. You are trying to pick up the 'list_head' structure from NVM, right?

No, this is not READ, it's WRITE onto NVDIMM.

sys_owner_head points to NVDIMM, since ns->sb->owner_list_head is updated,
the checksum of ns->sb should be updated to new value onto the NVDIMM.
This is what the above line does.


>
> In doing so, don't you need to validate the structure (eg by checking
> the checksum) before doing so to ensure that the contents are valid?

The check sum checking for READ is done in read_nvdimm_meta_super() in
following lines,
198 +       r = -EINVAL;
199 +       expected_csum = csum_set(sb);
200 +       if (expected_csum != sb->csum) {
201 +               pr_info("csum is not match with expected one\n");
202 +               goto put_page;
203 +       }

Once thing to note is, currently all NVDIMM update is not power failure
considered. This is the next big task to do after the first small code
base merged.


>> +		sys_pgalloc_recs->owner = sys_owner_head;
>> +	} else
>> +		BUG_ON(ns->sb->owner_list_head !=
>> +			(ns->kaddr + BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET));
>> +
>> +unlock:
>> +	mutex_unlock(&only_set->lock);
>> +	return rc;
>> +}
>> +
>> +static int read_nvdimm_meta_super(struct block_device *bdev,
>> +			      struct bch_nvm_namespace *ns)
>> +{
>> +	struct page *page;
>> +	struct bch_nvm_pages_sb *sb;
>> +	int r = 0;
>> +	uint64_t expected_csum = 0;
>> +
>> +	page = read_cache_page_gfp(bdev->bd_inode->i_mapping,
>> +			BCH_NVM_PAGES_SB_OFFSET >> PAGE_SHIFT, GFP_KERNEL);
>> +
>> +	if (IS_ERR(page))
>> +		return -EIO;
>> +
>> +	sb = (struct bch_nvm_pages_sb *)(page_address(page) +
>> +					offset_in_page(BCH_NVM_PAGES_SB_OFFSET));
>> +	r = -EINVAL;
>> +	expected_csum = csum_set(sb);
>> +	if (expected_csum != sb->csum) {
>> +		pr_info("csum is not match with expected one\n");
>> +		goto put_page;
>> +	}
>> +
>> +	if (memcmp(sb->magic, bch_nvm_pages_magic, 16)) {
>> +		pr_info("invalid bch_nvm_pages_magic\n");
>> +		goto put_page;
>> +	}
>> +
>> +	if (sb->total_namespaces_nr != 1) {
>> +		pr_info("currently only support one nvm device\n");
>> +		goto put_page;
>> +	}
>> +
>> +	if (sb->sb_offset != BCH_NVM_PAGES_SB_OFFSET) {
>> +		pr_info("invalid superblock offset\n");
>> +		goto put_page;
>> +	}
>> +
>> +	r = 0;
>> +	/* temporary use for DAX API */
>> +	ns->page_size = sb->page_size;
>> +	ns->pages_total = sb->pages_total;
>> +
>> +put_page:
>> +	put_page(page);
>> +	return r;
>> +}
>> +
>> +struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
>> +{
>> +	struct bch_nvm_namespace *ns;
>> +	int err;
>> +	pgoff_t pgoff;
>> +	char buf[BDEVNAME_SIZE];
>> +	struct block_device *bdev;
>> +	int id;
>> +	char *path = NULL;
>> +
>> +	path = kstrndup(dev_path, 512, GFP_KERNEL);
>> +	if (!path) {
>> +		pr_err("kstrndup failed\n");
>> +		return ERR_PTR(-ENOMEM);
>> +	}
>> +
>> +	bdev = blkdev_get_by_path(strim(path),
>> +				  FMODE_READ|FMODE_WRITE|FMODE_EXEC,
>> +				  only_set);
>> +	if (IS_ERR(bdev)) {
>> +		pr_info("get %s error: %ld\n", dev_path, PTR_ERR(bdev));
>> +		kfree(path);
>> +		return ERR_PTR(PTR_ERR(bdev));
>> +	}
>> +
>> +	err = -ENOMEM;
>> +	ns = kzalloc(sizeof(struct bch_nvm_namespace), GFP_KERNEL);
>> +	if (!ns)
>> +		goto bdput;
>> +
>> +	err = -EIO;
>> +	if (read_nvdimm_meta_super(bdev, ns)) {
>> +		pr_info("%s read nvdimm meta super block failed.\n",
>> +			bdevname(bdev, buf));
>> +		goto free_ns;
>> +	}
>> +
>> +	err = -EOPNOTSUPP;
>> +	if (!bdev_dax_supported(bdev, ns->page_size)) {
>> +		pr_info("%s don't support DAX\n", bdevname(bdev, buf));
>> +		goto free_ns;
>> +	}
>> +
>> +	err = -EINVAL;
>> +	if (bdev_dax_pgoff(bdev, 0, ns->page_size, &pgoff)) {
>> +		pr_info("invalid offset of %s\n", bdevname(bdev, buf));
>> +		goto free_ns;
>> +	}
>> +
>> +	err = -ENOMEM;
>> +	ns->dax_dev = fs_dax_get_by_bdev(bdev);
>> +	if (!ns->dax_dev) {
>> +		pr_info("can't by dax device by %s\n", bdevname(bdev, buf));
>> +		goto free_ns;
>> +	}
>> +
>> +	err = -EINVAL;
>> +	id = dax_read_lock();
>> +	if (dax_direct_access(ns->dax_dev, pgoff, ns->pages_total,
>> +			      &ns->kaddr, &ns->start_pfn) <= 0) {
>> +		pr_info("dax_direct_access error\n");
>> +		dax_read_unlock(id);
>> +		goto free_ns;
>> +	}
>> +	dax_read_unlock(id);
>> +
>> +	ns->sb = ns->kaddr + BCH_NVM_PAGES_SB_OFFSET;
>> +
> You already read the superblock in read_nvdimm_meta_super(), right?
> Wouldn't it be better to first do the 'dax_direct_access()' call, and
> then check the superblock?
> That way you'll ensure that dax_direct_access()' did the right thing;
> with the current code you are using two different methods of accessing
> the superblock, which theoretically can result in one method succeeding,
> the other not ...

We have to do it. Because the mapping size of dax_direct_access() is
from ns->pages_total, it is stored on NVDIMM. Before calling
dax_direct_acess()
we need to make sure ns is an valid super block stored on NVDIMM, then
we can
trust value of ns->pages_total to do the DAX mapping.

Another method is firstly mapping a small fixed range (e.g. 1GB), and
check whether the super block on NVDIMM is valid. If yes, re-map the
whole space indicated by ns->pages_total. But the two-times accessing cannot
be avoided.

>> +	err = -EINVAL;
>> +	/* Check magic again to make sure DAX mapping is correct */
>> +	if (memcmp(ns->sb->magic, bch_nvm_pages_magic, 16)) {
>> +		pr_info("invalid bch_nvm_pages_magic after DAX mapping\n");
>> +		goto free_ns;
>> +	}
>> +
>> +	err = attach_nvm_set(ns);
>> +	if (err < 0)
>> +		goto free_ns;
>> +
>> +	ns->page_size = ns->sb->page_size;
>> +	ns->pages_offset = ns->sb->pages_offset;
>> +	ns->pages_total = ns->sb->pages_total;
>> +	ns->free = 0;
>> +	ns->bdev = bdev;
>> +	ns->nvm_set = only_set;
>> +	mutex_init(&ns->lock);
>> +
>> +	if (ns->sb->this_namespace_nr == 0) {
>> +		pr_info("only first namespace contain owner info\n");
>> +		err = init_owner_info(ns);
>> +		if (err < 0) {
>> +			pr_info("init_owner_info met error %d\n", err);
>> +			only_set->nss[ns->sb->this_namespace_nr] = NULL;
>> +			goto free_ns;
>> +		}
>> +	}
>> +
>> +	kfree(path);
>> +	return ns;
>> +free_ns:
>> +	kfree(ns);
>> +bdput:
>> +	blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC);
>> +	kfree(path);
>> +	return ERR_PTR(err);
>> +}
>> +EXPORT_SYMBOL_GPL(bch_register_namespace);
>> +
>> +int __init bch_nvm_init(void)
>> +{
>> +	only_set = kzalloc(sizeof(*only_set), GFP_KERNEL);
>> +	if (!only_set)
>> +		return -ENOMEM;
>> +
>> +	only_set->total_namespaces_nr = 0;
>> +	only_set->owner_list_head = NULL;
>> +	only_set->nss = NULL;
>> +
>> +	mutex_init(&only_set->lock);
>> +
>> +	pr_info("bcache nvm init\n");
>> +	return 0;
>> +}
>> +
>> +void bch_nvm_exit(void)
>> +{
>> +	release_nvm_set(only_set);
>> +	pr_info("bcache nvm exit\n");
>> +}
>> +
>> +#endif /* CONFIG_BCACHE_NVM_PAGES */
>> diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
>> new file mode 100644
>> index 000000000000..3e24c4dee7fd
>> --- /dev/null
>> +++ b/drivers/md/bcache/nvm-pages.h
>> @@ -0,0 +1,74 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#ifndef _BCACHE_NVM_PAGES_H
>> +#define _BCACHE_NVM_PAGES_H
>> +
>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>> +#include <linux/bcache-nvm.h>
>> +#endif /* CONFIG_BCACHE_NVM_PAGES */
>> +
> Hmm? What is that doing here?
> Please move it into the source file.

This is temporary before the whole NVDIMM support for bcache is completed.

drivers/md/bcache/nvm-pages.h has to be included because there are still
stub
routines in this header. Such stub routines like,

+static inline struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
+{
+	return NULL;
+}

will be removed after the whole code completed and merged. So currently we have to
do this to make sure the NVDIMM related code won't be leaked out if such experimental
configure is not enabled.


Thanks for your review. The addressed issue will be fixed and updated in next post.


Coly Li

>> +/*
>> + * Bcache NVDIMM in memory data structures
>> + */
>> +
>> +/*
>> + * The following three structures in memory records which page(s) allocated
>> + * to which owner. After reboot from power failure, they will be initialized
>> + * based on nvm pages superblock in NVDIMM device.
>> + */
>> +struct bch_nvm_namespace {
>> +	struct bch_nvm_pages_sb *sb;
>> +	void *kaddr;
>> +
>> +	u8 uuid[16];
>> +	u64 free;
>> +	u32 page_size;
>> +	u64 pages_offset;
>> +	u64 pages_total;
>> +	pfn_t start_pfn;
>> +
>> +	struct dax_device *dax_dev;
>> +	struct block_device *bdev;
>> +	struct bch_nvm_set *nvm_set;
>> +
>> +	struct mutex lock;
>> +};
>> +
>> +/*
>> + * A set of namespaces. Currently only one set can be supported.
>> + */
>> +struct bch_nvm_set {
>> +	u8 set_uuid[16];
>> +	u32 total_namespaces_nr;
>> +
>> +	u32 owner_list_size;
>> +	u32 owner_list_used;
>> +	struct bch_owner_list_head *owner_list_head;
>> +
>> +	struct bch_nvm_namespace **nss;
>> +
>> +	struct mutex lock;
>> +};
>> +extern struct bch_nvm_set *only_set;
>> +
>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>> +
>> +struct bch_nvm_namespace *bch_register_namespace(const char *dev_path);
>> +int bch_nvm_init(void);
>> +void bch_nvm_exit(void);
>> +
>> +#else
>> +
>> +static inline struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
>> +{
>> +	return NULL;
>> +}
>> +static inline int bch_nvm_init(void)
>> +{
>> +	return 0;
>> +}
>> +static inline void bch_nvm_exit(void) { }
>> +
>> +#endif /* CONFIG_BCACHE_NVM_PAGES */
>> +
>> +#endif /* _BCACHE_NVM_PAGES_H */
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index 2f1ee4fbf4d5..ce22aefb1352 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -14,6 +14,7 @@
>>  #include "request.h"
>>  #include "writeback.h"
>>  #include "features.h"
>> +#include "nvm-pages.h"
>>  
>>  #include <linux/blkdev.h>
>>  #include <linux/pagemap.h>
>> @@ -2823,6 +2824,7 @@ static void bcache_exit(void)
>>  {
>>  	bch_debug_exit();
>>  	bch_request_exit();
>> +	bch_nvm_exit();
>>  	if (bcache_kobj)
>>  		kobject_put(bcache_kobj);
>>  	if (bcache_wq)
>> @@ -2921,6 +2923,7 @@ static int __init bcache_init(void)
>>  
>>  	bch_debug_init();
>>  	closure_debug_init();
>> +	bch_nvm_init();
>>  
>>  	bcache_is_reboot = false;
>>  
>>
> Cheers,
>
> Hannes


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/14] bcache: initialization of the buddy
  2021-06-22 10:45   ` Hannes Reinecke
@ 2021-06-23  5:35     ` Coly Li
  2021-06-23  5:46       ` Re[2]: " Pavel Goran
  0 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-23  5:35 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-bcache, linux-block, Jianpeng Ma, kernel test robot,
	Dan Carpenter, axboe, Qiaowei Ren

On 6/22/21 6:45 PM, Hannes Reinecke wrote:
> On 6/15/21 7:49 AM, Coly Li wrote:
>> From: Jianpeng Ma <jianpeng.ma@intel.com>
>>
>> This nvm pages allocator will implement the simple buddy to manage the
>> nvm address space. This patch initializes this buddy for new namespace.
>>
> Please use 'buddy allocator' instead of just 'buddy'.

Will update in next post.


>
>> the unit of alloc/free of the buddy is page. DAX device has their
>> struct page(in dram or PMEM).
>>
>>         struct {        /* ZONE_DEVICE pages */
>>                 /** @pgmap: Points to the hosting device page map. */
>>                 struct dev_pagemap *pgmap;
>>                 void *zone_device_data;
>>                 /*
>>                  * ZONE_DEVICE private pages are counted as being
>>                  * mapped so the next 3 words hold the mapping, index,
>>                  * and private fields from the source anonymous or
>>                  * page cache page while the page is migrated to device
>>                  * private memory.
>>                  * ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also
>>                  * use the mapping, index, and private fields when
>>                  * pmem backed DAX files are mapped.
>>                  */
>>         };
>>
>> ZONE_DEVICE pages only use pgmap. Other 4 words[16/32 bytes] don't use.
>> So the second/third word will be used as 'struct list_head ' which list
>> in buddy. The fourth word(that is normal struct page::index) store pgoff
>> which the page-offset in the dax device. And the fifth word (that is
>> normal struct page::private) store order of buddy. page_type will be used
>> to store buddy flags.
>>
>> Reported-by: kernel test robot <lkp@intel.com>
>> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
>> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
>> Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
>> Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>  drivers/md/bcache/nvm-pages.c   | 156 +++++++++++++++++++++++++++++++-
>>  drivers/md/bcache/nvm-pages.h   |   6 ++
>>  include/uapi/linux/bcache-nvm.h |  10 +-
>>  3 files changed, 165 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
>> index 18fdadbc502f..804ee66e97be 100644
>> --- a/drivers/md/bcache/nvm-pages.c
>> +++ b/drivers/md/bcache/nvm-pages.c
>> @@ -34,6 +34,10 @@ static void release_nvm_namespaces(struct bch_nvm_set *nvm_set)
>>  	for (i = 0; i < nvm_set->total_namespaces_nr; i++) {
>>  		ns = nvm_set->nss[i];
>>  		if (ns) {
>> +			kvfree(ns->pages_bitmap);
>> +			if (ns->pgalloc_recs_bitmap)
>> +				bitmap_free(ns->pgalloc_recs_bitmap);
>> +
>>  			blkdev_put(ns->bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC);
>>  			kfree(ns);
>>  		}
>> @@ -48,17 +52,130 @@ static void release_nvm_set(struct bch_nvm_set *nvm_set)
>>  	kfree(nvm_set);
>>  }
>>  
>> +static struct page *nvm_vaddr_to_page(struct bch_nvm_namespace *ns, void *addr)
>> +{
>> +	return virt_to_page(addr);
>> +}
>> +
>> +static void *nvm_pgoff_to_vaddr(struct bch_nvm_namespace *ns, pgoff_t pgoff)
>> +{
>> +	return ns->kaddr + (pgoff << PAGE_SHIFT);
>> +}
>> +
>> +static inline void remove_owner_space(struct bch_nvm_namespace *ns,
>> +					pgoff_t pgoff, u64 nr)
>> +{
>> +	while (nr > 0) {
>> +		unsigned int num = nr > UINT_MAX ? UINT_MAX : nr;
>> +
>> +		bitmap_set(ns->pages_bitmap, pgoff, num);
>> +		nr -= num;
>> +		pgoff += num;
>> +	}
>> +}
>> +
>> +#define BCH_PGOFF_TO_KVADDR(pgoff) ((void *)((unsigned long)pgoff << PAGE_SHIFT))
>> +
>>  static int init_owner_info(struct bch_nvm_namespace *ns)
>>  {
>>  	struct bch_owner_list_head *owner_list_head = ns->sb->owner_list_head;
>> +	struct bch_nvm_pgalloc_recs *sys_recs;
>> +	int i, j, k, rc = 0;
>>  
>>  	mutex_lock(&only_set->lock);
>>  	only_set->owner_list_head = owner_list_head;
>>  	only_set->owner_list_size = owner_list_head->size;
>>  	only_set->owner_list_used = owner_list_head->used;
>> +
>> +	/* remove used space */
>> +	remove_owner_space(ns, 0, div_u64(ns->pages_offset, ns->page_size));
>> +
>> +	sys_recs = ns->kaddr + BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET;
>> +	/* suppose no hole in array */
>> +	for (i = 0; i < owner_list_head->used; i++) {
>> +		struct bch_nvm_pages_owner_head *head = &owner_list_head->heads[i];
>> +
>> +		for (j = 0; j < BCH_NVM_PAGES_NAMESPACES_MAX; j++) {
>> +			struct bch_nvm_pgalloc_recs *pgalloc_recs = head->recs[j];
>> +			unsigned long offset = (unsigned long)ns->kaddr >> PAGE_SHIFT;
>> +			struct page *page;
>> +
>> +			while (pgalloc_recs) {
>> +				u32 pgalloc_recs_pos = (unsigned int)(pgalloc_recs - sys_recs);
>> +
>> +				if (memcmp(pgalloc_recs->magic, bch_nvm_pages_pgalloc_magic, 16)) {
>> +					pr_info("invalid bch_nvm_pages_pgalloc_magic\n");
>> +					rc = -EINVAL;
>> +					goto unlock;
>> +				}
>> +				if (memcmp(pgalloc_recs->owner_uuid, head->uuid, 16)) {
>> +					pr_info("invalid owner_uuid in bch_nvm_pgalloc_recs\n");
>> +					rc = -EINVAL;
>> +					goto unlock;
>> +				}
>> +				if (pgalloc_recs->owner != head) {
>> +					pr_info("invalid owner in bch_nvm_pgalloc_recs\n");
>> +					rc = -EINVAL;
>> +					goto unlock;
>> +				}
>> +
>> +				/* recs array can has hole */
> can have holes ?

It means the valid record is not always continuously stored in recs[]
from struct bch_nvm_pgalloc_recs. Because currently only 8 bytes write
to a 8 bytes aligned address on NVDIMM is stomic for power failure.

When a record is removed from the recs[] array by a block of NVDIMM pages
are freed, if the following valid records are moved forward to make all
records stored continuously, such memory movement is not atomic for power
failure. Then we need to design more complicated method to make sure the
meta data consistency for power failure.

Allowing hole (records can be non-continuously stored in recs[] array)
can make things much simpler here.

Thanks for your review.

Coly Li


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re[2]: [PATCH 05/14] bcache: initialization of the buddy
  2021-06-23  5:35     ` Coly Li
@ 2021-06-23  5:46       ` Pavel Goran
  2021-06-23  6:03         ` Coly Li
  0 siblings, 1 reply; 60+ messages in thread
From: Pavel Goran @ 2021-06-23  5:46 UTC (permalink / raw)
  To: Coly Li
  Cc: Hannes Reinecke, linux-bcache, linux-block, Jianpeng Ma,
	kernel test robot, Dan Carpenter, axboe, Qiaowei Ren

Hello Coly,

Wednesday, June 23, 2021, 12:35:21 PM, you wrote:

> ... (skipped a lot)
>>> +                            /* recs array can has hole */
>> can have holes ?

> It means the valid record is not always continuously stored in recs[]
> from struct bch_nvm_pgalloc_recs. Because currently only 8 bytes write
> to a 8 bytes aligned address on NVDIMM is stomic for power failure.

> ...

The issue is with the wording of this comment, not with the code or the
meaning of the comment.

The comment should read "recs array can have hole".

> Coly Li

Pavel Goran
  


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 06/14] bcache: bch_nvm_alloc_pages() of the buddy
  2021-06-22 10:51   ` Hannes Reinecke
@ 2021-06-23  6:02     ` Coly Li
  0 siblings, 0 replies; 60+ messages in thread
From: Coly Li @ 2021-06-23  6:02 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-bcache, axboe, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/22/21 6:51 PM, Hannes Reinecke wrote:
> On 6/15/21 7:49 AM, Coly Li wrote:
>> From: Jianpeng Ma <jianpeng.ma@intel.com>
>>
>> This patch implements the bch_nvm_alloc_pages() of the buddy.
>> In terms of function, this func is like current-page-buddy-alloc.
>> But the differences are:
>> a: it need owner_uuid as parameter which record owner info. And it
>> make those info persistence.
>> b: it don't need flags like GFP_*. All allocs are the equal.
>> c: it don't trigger other ops etc swap/recycle.
>>
>> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
>> Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
>> Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>  drivers/md/bcache/nvm-pages.c   | 174 ++++++++++++++++++++++++++++++++
>>  drivers/md/bcache/nvm-pages.h   |   6 ++
>>  include/uapi/linux/bcache-nvm.h |   6 +-
>>  3 files changed, 184 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
>> index 804ee66e97be..5d095d241483 100644
>> --- a/drivers/md/bcache/nvm-pages.c
>> +++ b/drivers/md/bcache/nvm-pages.c
>> @@ -74,6 +74,180 @@ static inline void remove_owner_space(struct bch_nvm_namespace *ns,
>>  	}
>>  }
>>  
>> +/* If not found, it will create if create == true */
>> +static struct bch_nvm_pages_owner_head *find_owner_head(const char *owner_uuid, bool create)
>> +{
>> +	struct bch_owner_list_head *owner_list_head = only_set->owner_list_head;
>> +	struct bch_nvm_pages_owner_head *owner_head = NULL;
>> +	int i;
>> +
>> +	if (owner_list_head == NULL)
>> +		goto out;
>> +
>> +	for (i = 0; i < only_set->owner_list_used; i++) {
>> +		if (!memcmp(owner_uuid, owner_list_head->heads[i].uuid, 16)) {
>> +			owner_head = &(owner_list_head->heads[i]);
>> +			break;
>> +		}
>> +	}
>> +
> Please, don't name is 'heads'. If this is supposed to be a linked list,
> use the standard list implementation and initialize the pointers correctly.
> If it isn't use an array (as you know in advance how many array entries
> you can allocate).

heads is an array to store the heads of all owner lists. Each element in
array heads[] is a head of an owner list.

An owner is identified by its uuid. When allocating nvm pages from the
nvm-pages allocator, the owner's uuid is provided. And all its allocated
nvm pages are tracked by this owner's owner list. Typically the owner is
a device driver using nvm pages like bcache.

After reboot, bcache will ask the nvm-pages allocator to return the whole
owner list to it by the previous provided uuid of bcache driver. Then it
is bcache driver's duty to restore all data layout from all the nvm pages
which are tracked by the returned owner list.

So heads is named for an array to store all the heads of all the owner list.


>> +	if (!owner_head && create) {
>> +		u32 used = only_set->owner_list_used;
>> +
>> +		if (only_set->owner_list_size > used) {
>> +			memcpy_flushcache(owner_list_head->heads[used].uuid, owner_uuid, 16);
>> +			only_set->owner_list_used++;
>> +
>> +			owner_list_head->used++;
>> +			owner_head = &(owner_list_head->heads[used]);
>> +		} else
>> +			pr_info("no free bch_nvm_pages_owner_head\n");
>> +	}
>> +
>> +out:
>> +	return owner_head;
>> +}
>> +
>> +static struct bch_nvm_pgalloc_recs *find_empty_pgalloc_recs(void)
>> +{
>> +	unsigned int start;
>> +	struct bch_nvm_namespace *ns = only_set->nss[0];
>> +	struct bch_nvm_pgalloc_recs *recs;
>> +
>> +	start = bitmap_find_next_zero_area(ns->pgalloc_recs_bitmap, BCH_MAX_PGALLOC_RECS, 0, 1, 0);
>> +	if (start > BCH_MAX_PGALLOC_RECS) {
>> +		pr_info("no free struct bch_nvm_pgalloc_recs\n");
>> +		return NULL;
>> +	}
>> +
>> +	bitmap_set(ns->pgalloc_recs_bitmap, start, 1);
>> +	recs = (struct bch_nvm_pgalloc_recs *)(ns->kaddr + BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET)
>> +		+ start;
>> +	return recs;
>> +}
>> +
>> +static struct bch_nvm_pgalloc_recs *find_nvm_pgalloc_recs(struct bch_nvm_namespace *ns,
>> +		struct bch_nvm_pages_owner_head *owner_head, bool create)
>> +{
>> +	int ns_nr = ns->sb->this_namespace_nr;
>> +	struct bch_nvm_pgalloc_recs *prev_recs = NULL, *recs = owner_head->recs[ns_nr];
>> +
>> +	/* If create=false, we return recs[nr] */
>> +	if (!create)
>> +		return recs;
>> +
>> +	/*
>> +	 * If create=true, it mean we need a empty struct bch_pgalloc_rec
>> +	 * So we should find non-empty struct bch_nvm_pgalloc_recs or alloc
>> +	 * new struct bch_nvm_pgalloc_recs. And return this bch_nvm_pgalloc_recs
>> +	 */
>> +	while (recs && (recs->used == recs->size)) {
>> +		prev_recs = recs;
>> +		recs = recs->next;
>> +	}
>> +
>> +	/* Found empty struct bch_nvm_pgalloc_recs */
>> +	if (recs)
>> +		return recs;
>> +	/* Need alloc new struct bch_nvm_galloc_recs */
>> +	recs = find_empty_pgalloc_recs();
>> +	if (recs) {
>> +		recs->next = NULL;
>> +		recs->owner = owner_head;
>> +		memcpy_flushcache(recs->magic, bch_nvm_pages_pgalloc_magic, 16);
>> +		memcpy_flushcache(recs->owner_uuid, owner_head->uuid, 16);
>> +		recs->size = BCH_MAX_RECS;
>> +		recs->used = 0;
>> +
>> +		if (prev_recs)
>> +			prev_recs->next = recs;
>> +		else
>> +			owner_head->recs[ns_nr] = recs;
>> +	}
>> +
> Wouldn't it be easier if the bitmap covers the entire range, and not
> just the non-empty ones?
> Eventually (ie if the NVM set becomes full) it'll cover it anyway, so
> can't we save ourselves some time to allocate a large enough bitmap
> upfront and only use it do figure out empty recs?

Yes we will do it later. We don't do it now is because a struct
bch_nvm_pgalloc_recs may contain 1000+ records and all current
code only use 1 record for bcache journal. Later when I star to
store bcache btree nodes on NVDIMM, then I can use the suggested
bitmap optimization with real workload to test.

Thanks for the suggestion.


>
>> +	return recs;
>> +}
>> +
>> +static void add_pgalloc_rec(struct bch_nvm_pgalloc_recs *recs, void *kaddr, int order)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < recs->size; i++) {
>> +		if (recs->recs[i].pgoff == 0) {
>> +			recs->recs[i].pgoff = (unsigned long)kaddr >> PAGE_SHIFT;
>> +			recs->recs[i].order = order;
>> +			recs->used++;
>> +			break;
>> +		}
>> +	}
>> +	BUG_ON(i == recs->size);
>> +}
>> +
>> +void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
>> +{
>> +	void *kaddr = NULL;
>> +	struct bch_nvm_pgalloc_recs *pgalloc_recs;
>> +	struct bch_nvm_pages_owner_head *owner_head;
>> +	int i, j;
>> +
>> +	mutex_lock(&only_set->lock);
>> +	owner_head = find_owner_head(owner_uuid, true);
>> +
>> +	if (!owner_head) {
>> +		pr_err("can't find bch_nvm_pgalloc_recs by(uuid=%s)\n", owner_uuid);
>> +		goto unlock;
>> +	}
>> +
>> +	for (j = 0; j < only_set->total_namespaces_nr; j++) {
>> +		struct bch_nvm_namespace *ns = only_set->nss[j];
>> +
>> +		if (!ns || (ns->free < (1L << order)))
>> +			continue;
>> +
>> +		for (i = order; i < BCH_MAX_ORDER; i++) {
>> +			struct list_head *list;
>> +			struct page *page, *buddy_page;
>> +
>> +			if (list_empty(&ns->free_area[i]))
>> +				continue;
>> +
>> +			list = ns->free_area[i].next;
> list_first_entry()?

Copied. It will be updated in next post.

Thanks for your review.

Coly Li





^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/14] bcache: initialization of the buddy
  2021-06-23  5:46       ` Re[2]: " Pavel Goran
@ 2021-06-23  6:03         ` Coly Li
  0 siblings, 0 replies; 60+ messages in thread
From: Coly Li @ 2021-06-23  6:03 UTC (permalink / raw)
  To: Pavel Goran
  Cc: Hannes Reinecke, linux-bcache, linux-block, Jianpeng Ma,
	kernel test robot, Dan Carpenter, axboe, Qiaowei Ren

On 6/23/21 1:46 PM, Pavel Goran wrote:
> Hello Coly,
>
> Wednesday, June 23, 2021, 12:35:21 PM, you wrote:
>
>> ... (skipped a lot)
>>>> +                            /* recs array can has hole */
>>> can have holes ?
>> It means the valid record is not always continuously stored in recs[]
>> from struct bch_nvm_pgalloc_recs. Because currently only 8 bytes write
>> to a 8 bytes aligned address on NVDIMM is stomic for power failure.
>> ...
> The issue is with the wording of this comment, not with the code or the
> meaning of the comment.
>
> The comment should read "recs array can have hole".

Oh, I see. Thank Pavel for the hint :-) We will update it in next post.

Coly Li

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/14] bcache: bch_nvm_free_pages() of the buddy
  2021-06-22 10:53   ` Hannes Reinecke
@ 2021-06-23  6:06     ` Coly Li
  0 siblings, 0 replies; 60+ messages in thread
From: Coly Li @ 2021-06-23  6:06 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-bcache, axboe, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/22/21 6:53 PM, Hannes Reinecke wrote:
> On 6/15/21 7:49 AM, Coly Li wrote:
>> From: Jianpeng Ma <jianpeng.ma@intel.com>
>>
>> This patch implements the bch_nvm_free_pages() of the buddy.
>>
>> The difference between this and page-buddy-free:
>> it need owner_uuid to free owner allocated pages.And must
>> persistent after free.
>>
>> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
>> Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
>> Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>  drivers/md/bcache/nvm-pages.c | 164 ++++++++++++++++++++++++++++++++--
>>  drivers/md/bcache/nvm-pages.h |   3 +-
>>  2 files changed, 159 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
>> index 5d095d241483..74d08950c67c 100644
>> --- a/drivers/md/bcache/nvm-pages.c
>> +++ b/drivers/md/bcache/nvm-pages.c
>> @@ -52,7 +52,7 @@ static void release_nvm_set(struct bch_nvm_set *nvm_set)
>>  	kfree(nvm_set);
>>  }
>>  
>> -static struct page *nvm_vaddr_to_page(struct bch_nvm_namespace *ns, void *addr)
>> +static struct page *nvm_vaddr_to_page(void *addr)
>>  {
>>  	return virt_to_page(addr);
>>  }
> If you don't need this argument please modify the patch adding the
> nvm_vaddr_to_page() function.

Copied. We will add the patch where  nvm_vaddr_to_page() was firstly
added in.

It will be updated in next post.

Thanks for your review.

Coly Li


[snipped]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 08/14] bcache: get allocated pages from specific owner
  2021-06-22 10:54   ` Hannes Reinecke
@ 2021-06-23  6:08     ` Coly Li
  0 siblings, 0 replies; 60+ messages in thread
From: Coly Li @ 2021-06-23  6:08 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-bcache, axboe, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/22/21 6:54 PM, Hannes Reinecke wrote:
> On 6/15/21 7:49 AM, Coly Li wrote:
>> From: Jianpeng Ma <jianpeng.ma@intel.com>
>>
>> This patch implements bch_get_allocated_pages() of the buddy to be used to
> buddy allocator
>

Copied. Will be updated in next post.

>> get allocated pages from specific owner.
>>
>> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
>> Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
>> Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>  drivers/md/bcache/nvm-pages.c | 6 ++++++
>>  drivers/md/bcache/nvm-pages.h | 5 +++++
>>  2 files changed, 11 insertions(+)
>>
>> diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
>> index 74d08950c67c..42b0504d9564 100644
>> --- a/drivers/md/bcache/nvm-pages.c
>> +++ b/drivers/md/bcache/nvm-pages.c
>> @@ -397,6 +397,12 @@ void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
>>  }
>>  EXPORT_SYMBOL_GPL(bch_nvm_alloc_pages);
>>  
>> +struct bch_nvm_pages_owner_head *bch_get_allocated_pages(const char *owner_uuid)
>> +{
>> +	return find_owner_head(owner_uuid, false);
>> +}
>> +EXPORT_SYMBOL_GPL(bch_get_allocated_pages);
>> +
>>  #define BCH_PGOFF_TO_KVADDR(pgoff) ((void *)((unsigned long)pgoff << PAGE_SHIFT))
>>  
>>  static int init_owner_info(struct bch_nvm_namespace *ns)
>> diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
>> index 0ca699166855..c763bf2e2721 100644
>> --- a/drivers/md/bcache/nvm-pages.h
>> +++ b/drivers/md/bcache/nvm-pages.h
>> @@ -64,6 +64,7 @@ int bch_nvm_init(void);
>>  void bch_nvm_exit(void);
>>  void *bch_nvm_alloc_pages(int order, const char *owner_uuid);
>>  void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid);
>> +struct bch_nvm_pages_owner_head *bch_get_allocated_pages(const char *owner_uuid);
>>  
>>  #else
>>  
>> @@ -81,6 +82,10 @@ static inline void *bch_nvm_alloc_pages(int order, const char *owner_uuid)
>>  	return NULL;
>>  }
>>  static inline void bch_nvm_free_pages(void *addr, int order, const char *owner_uuid) { }
>> +static inline struct bch_nvm_pages_owner_head *bch_get_allocated_pages(const char *owner_uuid)
>> +{
>> +	return NULL;
>> +}
>>  
>>  #endif /* CONFIG_BCACHE_NVM_PAGES */
>>  
>>
> Reviewed-by: Hannes Reinecke <hare@suse.de>

Thanks for your review.

Coly Li

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/14] bcache: use bucket index to set GC_MARK_METADATA for journal buckets in bch_btree_gc_finish()
  2021-06-22 10:55   ` Hannes Reinecke
@ 2021-06-23  6:09     ` Coly Li
  0 siblings, 0 replies; 60+ messages in thread
From: Coly Li @ 2021-06-23  6:09 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-bcache, axboe, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/22/21 6:55 PM, Hannes Reinecke wrote:
> On 6/15/21 7:49 AM, Coly Li wrote:
>> Currently the meta data bucket locations on cache device are reserved
>> after the meta data stored on NVDIMM pages, for the meta data layout
>> consistentcy temporarily. So these buckets are still marked as meta data
>> by SET_GC_MARK() in bch_btree_gc_finish().
>>
>> When BCH_FEATURE_INCOMPAT_NVDIMM_META is set, the sb.d[] stores linear
>> address of NVDIMM pages and not bucket index anymore. Therefore we
>> should avoid to find bucket index from sb.d[], and directly use bucket
>> index from ca->sb.first_bucket to (ca->sb.first_bucket +
>> ca->sb.njournal_bucketsi) for setting the gc mark of journal bucket.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
>> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
>> ---
>>  drivers/md/bcache/btree.c | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
>> index 183a58c89377..e0d7135669ca 100644
>> --- a/drivers/md/bcache/btree.c
>> +++ b/drivers/md/bcache/btree.c
>> @@ -1761,8 +1761,10 @@ static void bch_btree_gc_finish(struct cache_set *c)
>>  	ca = c->cache;
>>  	ca->invalidate_needs_gc = 0;
>>  
>> -	for (k = ca->sb.d; k < ca->sb.d + ca->sb.keys; k++)
>> -		SET_GC_MARK(ca->buckets + *k, GC_MARK_METADATA);
>> +	/* Range [first_bucket, first_bucket + keys) is for journal buckets */
>> +	for (i = ca->sb.first_bucket;
>> +	     i < ca->sb.first_bucket + ca->sb.njournal_buckets; i++)
>> +		SET_GC_MARK(ca->buckets + i, GC_MARK_METADATA);
>>  
>>  	for (k = ca->prio_buckets;
>>  	     k < ca->prio_buckets + prio_buckets(ca) * 2; k++)
>>
> Reviewed-by: Hannes Reinecke <hare@suse.de>

Thanks for your review.

Coly Li

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 10/14] bcache: add BCH_FEATURE_INCOMPAT_NVDIMM_META into incompat feature set
  2021-06-22 10:59   ` Hannes Reinecke
@ 2021-06-23  6:09     ` Coly Li
  0 siblings, 0 replies; 60+ messages in thread
From: Coly Li @ 2021-06-23  6:09 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-bcache, axboe, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/22/21 6:59 PM, Hannes Reinecke wrote:
> On 6/15/21 7:49 AM, Coly Li wrote:
>> This patch adds BCH_FEATURE_INCOMPAT_NVDIMM_META (value 0x0004) into the
>> incompat feature set. When this bit is set by bcache-tools, it indicates
>> bcache meta data should be stored on specific NVDIMM meta device.
>>
>> The bcache meta data mainly includes journal and btree nodes, when this
>> bit is set in incompat feature set, bcache will ask the nvm-pages
>> allocator for NVDIMM space to store the meta data.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
>> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
>> ---
>>  drivers/md/bcache/features.h | 9 +++++++++
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h
>> index d1c8fd3977fc..45d2508d5532 100644
>> --- a/drivers/md/bcache/features.h
>> +++ b/drivers/md/bcache/features.h
>> @@ -17,11 +17,19 @@
>>  #define BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET		0x0001
>>  /* real bucket size is (1 << bucket_size) */
>>  #define BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE	0x0002
>> +/* store bcache meta data on nvdimm */
>> +#define BCH_FEATURE_INCOMPAT_NVDIMM_META		0x0004
>>  
>>  #define BCH_FEATURE_COMPAT_SUPP		0
>>  #define BCH_FEATURE_RO_COMPAT_SUPP	0
>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>> +#define BCH_FEATURE_INCOMPAT_SUPP	(BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET| \
>> +					 BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE| \
>> +					 BCH_FEATURE_INCOMPAT_NVDIMM_META)
>> +#else
>>  #define BCH_FEATURE_INCOMPAT_SUPP	(BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET| \
>>  					 BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE)
>> +#endif
>>  
>>  #define BCH_HAS_COMPAT_FEATURE(sb, mask) \
>>  		((sb)->feature_compat & (mask))
>> @@ -89,6 +97,7 @@ static inline void bch_clear_feature_##name(struct cache_sb *sb) \
>>  
>>  BCH_FEATURE_INCOMPAT_FUNCS(obso_large_bucket, OBSO_LARGE_BUCKET);
>>  BCH_FEATURE_INCOMPAT_FUNCS(large_bucket, LOG_LARGE_BUCKET_SIZE);
>> +BCH_FEATURE_INCOMPAT_FUNCS(nvdimm_meta, NVDIMM_META);
>>  
>>  static inline bool bch_has_unknown_compat_features(struct cache_sb *sb)
>>  {
>>
> Reviewed-by: Hannes Reinecke <hare@suse.de>

Thanks for your review.

Coly Li

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 11/14] bcache: initialize bcache journal for NVDIMM meta device
  2021-06-22 11:01   ` Hannes Reinecke
@ 2021-06-23  6:17     ` Coly Li
  2021-06-23  9:20       ` Hannes Reinecke
  0 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-23  6:17 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-bcache, axboe, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/22/21 7:01 PM, Hannes Reinecke wrote:
> On 6/15/21 7:49 AM, Coly Li wrote:
>> The nvm-pages allocator may store and index the NVDIMM pages allocated
>> for bcache journal. This patch adds the initialization to store bcache
>> journal space on NVDIMM pages if BCH_FEATURE_INCOMPAT_NVDIMM_META bit is
>> set by bcache-tools.
>>
>> If BCH_FEATURE_INCOMPAT_NVDIMM_META is set, get_nvdimm_journal_space()
>> will return the linear address of NVDIMM pages for bcache journal,
>> - If there is previously allocated space, find it from nvm-pages owner
>>   list and return to bch_journal_init().
>> - If there is no previously allocated space, require a new NVDIMM range
>>   from the nvm-pages allocator, and return it to bch_journal_init().
>>
>> And in bch_journal_init(), keys in sb.d[] store the corresponding linear
>> address from NVDIMM into sb.d[i].ptr[0] where 'i' is the bucket index to
>> iterate all journal buckets.
>>
>> Later when bcache journaling code stores the journaling jset, the target
>> NVDIMM linear address stored (and updated) in sb.d[i].ptr[0] can be used
>> directly in memory copy from DRAM pages into NVDIMM pages.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
>> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
>> ---
>>  drivers/md/bcache/journal.c | 105 ++++++++++++++++++++++++++++++++++++
>>  drivers/md/bcache/journal.h |   2 +-
>>  drivers/md/bcache/super.c   |  16 +++---
>>  3 files changed, 115 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
>> index 61bd79babf7a..32599d2ff5d2 100644
>> --- a/drivers/md/bcache/journal.c
>> +++ b/drivers/md/bcache/journal.c
>> @@ -9,6 +9,8 @@
>>  #include "btree.h"
>>  #include "debug.h"
>>  #include "extents.h"
>> +#include "nvm-pages.h"
>> +#include "features.h"
>>  
>>  #include <trace/events/bcache.h>
>>  
>> @@ -982,3 +984,106 @@ int bch_journal_alloc(struct cache_set *c)
>>  
>>  	return 0;
>>  }
>> +
>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>> +
>> +static void *find_journal_nvm_base(struct bch_nvm_pages_owner_head *owner_list,
>> +				   struct cache *ca)
>> +{
>> +	unsigned long addr = 0;
>> +	struct bch_nvm_pgalloc_recs *recs_list = owner_list->recs[0];
>> +
>> +	while (recs_list) {
>> +		struct bch_pgalloc_rec *rec;
>> +		unsigned long jnl_pgoff;
>> +		int i;
>> +
>> +		jnl_pgoff = ((unsigned long)ca->sb.d[0]) >> PAGE_SHIFT;
>> +		rec = recs_list->recs;
>> +		for (i = 0; i < recs_list->used; i++) {
>> +			if (rec->pgoff == jnl_pgoff)
>> +				break;
>> +			rec++;
>> +		}
>> +		if (i < recs_list->used) {
>> +			addr = rec->pgoff << PAGE_SHIFT;
>> +			break;
>> +		}
>> +		recs_list = recs_list->next;
>> +	}
>> +	return (void *)addr;
>> +}
>> +
>> +static void *get_nvdimm_journal_space(struct cache *ca)
>> +{
>> +	struct bch_nvm_pages_owner_head *owner_list = NULL;
>> +	void *ret = NULL;
>> +	int order;
>> +
>> +	owner_list = bch_get_allocated_pages(ca->sb.set_uuid);
>> +	if (owner_list) {
>> +		ret = find_journal_nvm_base(owner_list, ca);
>> +		if (ret)
>> +			goto found;
>> +	}
>> +
>> +	order = ilog2(ca->sb.bucket_size *
>> +		      ca->sb.njournal_buckets / PAGE_SECTORS);
>> +	ret = bch_nvm_alloc_pages(order, ca->sb.set_uuid);
>> +	if (ret)
>> +		memset(ret, 0, (1 << order) * PAGE_SIZE);
>> +
>> +found:
>> +	return ret;
>> +}
>> +
>> +static int __bch_journal_nvdimm_init(struct cache *ca)
>> +{
>> +	int i, ret = 0;
>> +	void *journal_nvm_base = NULL;
>> +
>> +	journal_nvm_base = get_nvdimm_journal_space(ca);
>> +	if (!journal_nvm_base) {
>> +		pr_err("Failed to get journal space from nvdimm\n");
>> +		ret = -1;
>> +		goto out;
>> +	}
>> +
>> +	/* Iniialized and reloaded from on-disk super block already */
>> +	if (ca->sb.d[0] != 0)
>> +		goto out;
>> +
>> +	for (i = 0; i < ca->sb.keys; i++)
>> +		ca->sb.d[i] =
>> +			(u64)(journal_nvm_base + (ca->sb.bucket_size * i));
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +#else /* CONFIG_BCACHE_NVM_PAGES */
>> +
>> +static int __bch_journal_nvdimm_init(struct cache *ca)
>> +{
>> +	return -1;
>> +}
>> +
>> +#endif /* CONFIG_BCACHE_NVM_PAGES */
>> +
>> +int bch_journal_init(struct cache_set *c)
>> +{
>> +	int i, ret = 0;
>> +	struct cache *ca = c->cache;
>> +
>> +	ca->sb.keys = clamp_t(int, ca->sb.nbuckets >> 7,
>> +				2, SB_JOURNAL_BUCKETS);
>> +
>> +	if (!bch_has_feature_nvdimm_meta(&ca->sb)) {
>> +		for (i = 0; i < ca->sb.keys; i++)
>> +			ca->sb.d[i] = ca->sb.first_bucket + i;
>> +	} else {
>> +		ret = __bch_journal_nvdimm_init(ca);
>> +	}
>> +
>> +	return ret;
>> +}
>> diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h
>> index f2ea34d5f431..e3a7fa5a8fda 100644
>> --- a/drivers/md/bcache/journal.h
>> +++ b/drivers/md/bcache/journal.h
>> @@ -179,7 +179,7 @@ void bch_journal_mark(struct cache_set *c, struct list_head *list);
>>  void bch_journal_meta(struct cache_set *c, struct closure *cl);
>>  int bch_journal_read(struct cache_set *c, struct list_head *list);
>>  int bch_journal_replay(struct cache_set *c, struct list_head *list);
>> -
>> +int bch_journal_init(struct cache_set *c);
>>  void bch_journal_free(struct cache_set *c);
>>  int bch_journal_alloc(struct cache_set *c);
>>  
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index ce22aefb1352..cce0f6bf0944 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -147,10 +147,15 @@ static const char *read_super_common(struct cache_sb *sb,  struct block_device *
>>  		goto err;
>>  
>>  	err = "Journal buckets not sequential";
>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>> +	if (!bch_has_feature_nvdimm_meta(sb)) {
>> +#endif
>>  	for (i = 0; i < sb->keys; i++)
>>  		if (sb->d[i] != sb->first_bucket + i)
>>  			goto err;
>> -
>> +#ifdef CONFIG_BCACHE_NVM_PAGES
>> +	} /* bch_has_feature_nvdimm_meta */
>> +#endif
>>  	err = "Too many journal buckets";
>>  	if (sb->first_bucket + sb->keys > sb->nbuckets)
>>  		goto err;
> Extremely awkward.

After the feature settled and not marked as EXPERIMENTAL, such condition
code will be removed.


> Make 'bch_has_feature_nvdimm_meta()' generally available, and have it
> return 'false' if the config feature isn't enabled.

bch_has_feature_nvdimm_meta() is defined as,


 41 #define BCH_FEATURE_COMPAT_FUNCS(name, flagname) \
 42 static inline int bch_has_feature_##name(struct cache_sb *sb) \
 43 { \
 44         if (sb->version < BCACHE_SB_VERSION_CDEV_WITH_FEATURES) \
 45                 return 0; \
 46         return (((sb)->feature_compat & \
 47                 BCH##_FEATURE_COMPAT_##flagname) != 0); \
 48 } \

It is not easy to check a specific Kconfig item in the above code block,
this is why
we choose the compiling condition to disable nvdimm related code here,
before we remove
the EXPERIMENTAL mark in Kconfig.


Thanks for your review. Do you feel my above response is convinced ?

Coly Li


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/14] bcache: support storing bcache journal into NVDIMM meta device
  2021-06-22 11:03   ` Hannes Reinecke
@ 2021-06-23  6:19     ` Coly Li
  0 siblings, 0 replies; 60+ messages in thread
From: Coly Li @ 2021-06-23  6:19 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-bcache, axboe, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/22/21 7:03 PM, Hannes Reinecke wrote:
> On 6/15/21 7:49 AM, Coly Li wrote:
>> This patch implements two methods to store bcache journal to,
>> 1) __journal_write_unlocked() for block interface device
>>    The latency method to compose bio and issue the jset bio to cache
>>    device (e.g. SSD). c->journal.key.ptr[0] indicates the LBA on cache
>>    device to store the journal jset.
>> 2) __journal_nvdimm_write_unlocked() for memory interface NVDIMM
>>    Use memory interface to access NVDIMM pages and store the jset by
>>    memcpy_flushcache(). c->journal.key.ptr[0] indicates the linear
>>    address from the NVDIMM pages to store the journal jset.
>>
>> For lagency configuration without NVDIMM meta device, journal I/O is
> legacy?
>
>> handled by __journal_write_unlocked() with existing code logic. If the
>> NVDIMM meta device is used (by bcache-tools), the journal I/O will
>> be handled by __journal_nvdimm_write_unlocked() and go into the NVDIMM
>> pages.
>>
>> And when NVDIMM meta device is used, sb.d[] stores the linear addresses
>> from NVDIMM pages (no more bucket index), in journal_reclaim() the
>> journaling location in c->journal.key.ptr[0] should also be updated by
>> linear address from NVDIMM pages (no more LBA combined by sectors offset
>> and bucket index).
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
>> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
>> ---
>>  drivers/md/bcache/journal.c   | 119 ++++++++++++++++++++++++----------
>>  drivers/md/bcache/nvm-pages.h |   1 +
>>  drivers/md/bcache/super.c     |  28 +++++++-
>>  3 files changed, 110 insertions(+), 38 deletions(-)
>>
>> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
>> index 32599d2ff5d2..03ecedf813b0 100644
>> --- a/drivers/md/bcache/journal.c
>> +++ b/drivers/md/bcache/journal.c
>> @@ -596,6 +596,8 @@ static void do_journal_discard(struct cache *ca)
>>  		return;
>>  	}
>>  
>> +	BUG_ON(bch_has_feature_nvdimm_meta(&ca->sb));
>> +
>>  	switch (atomic_read(&ja->discard_in_flight)) {
>>  	case DISCARD_IN_FLIGHT:
>>  		return;
>> @@ -661,9 +663,13 @@ static void journal_reclaim(struct cache_set *c)
>>  		goto out;
>>  
>>  	ja->cur_idx = next;
>> -	k->ptr[0] = MAKE_PTR(0,
>> -			     bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
>> -			     ca->sb.nr_this_dev);
>> +	if (!bch_has_feature_nvdimm_meta(&ca->sb))
>> +		k->ptr[0] = MAKE_PTR(0,
>> +			bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
>> +			ca->sb.nr_this_dev);
>> +	else
>> +		k->ptr[0] = ca->sb.d[ja->cur_idx];
>> +
>>  	atomic_long_inc(&c->reclaimed_journal_buckets);
>>  
>>  	bkey_init(k);
>> @@ -729,46 +735,21 @@ static void journal_write_unlock(struct closure *cl)
>>  	spin_unlock(&c->journal.lock);
>>  }
>>  
>> -static void journal_write_unlocked(struct closure *cl)
>> +
>> +static void __journal_write_unlocked(struct cache_set *c)
>>  	__releases(c->journal.lock)
>>  {
>> -	struct cache_set *c = container_of(cl, struct cache_set, journal.io);
>> -	struct cache *ca = c->cache;
>> -	struct journal_write *w = c->journal.cur;
>>  	struct bkey *k = &c->journal.key;
>> -	unsigned int i, sectors = set_blocks(w->data, block_bytes(ca)) *
>> -		ca->sb.block_size;
>> -
>> +	struct journal_write *w = c->journal.cur;
>> +	struct closure *cl = &c->journal.io;
>> +	struct cache *ca = c->cache;
>>  	struct bio *bio;
>>  	struct bio_list list;
>> +	unsigned int i, sectors = set_blocks(w->data, block_bytes(ca)) *
>> +		ca->sb.block_size;
>>  
>>  	bio_list_init(&list);
>>  
>> -	if (!w->need_write) {
>> -		closure_return_with_destructor(cl, journal_write_unlock);
>> -		return;
>> -	} else if (journal_full(&c->journal)) {
>> -		journal_reclaim(c);
>> -		spin_unlock(&c->journal.lock);
>> -
>> -		btree_flush_write(c);
>> -		continue_at(cl, journal_write, bch_journal_wq);
>> -		return;
>> -	}
>> -
>> -	c->journal.blocks_free -= set_blocks(w->data, block_bytes(ca));
>> -
>> -	w->data->btree_level = c->root->level;
>> -
>> -	bkey_copy(&w->data->btree_root, &c->root->key);
>> -	bkey_copy(&w->data->uuid_bucket, &c->uuid_bucket);
>> -
>> -	w->data->prio_bucket[ca->sb.nr_this_dev] = ca->prio_buckets[0];
>> -	w->data->magic		= jset_magic(&ca->sb);
>> -	w->data->version	= BCACHE_JSET_VERSION;
>> -	w->data->last_seq	= last_seq(&c->journal);
>> -	w->data->csum		= csum_set(w->data);
>> -
>>  	for (i = 0; i < KEY_PTRS(k); i++) {
>>  		ca = c->cache;
>>  		bio = &ca->journal.bio;
>> @@ -793,7 +774,6 @@ static void journal_write_unlocked(struct closure *cl)
>>  
>>  		ca->journal.seq[ca->journal.cur_idx] = w->data->seq;
>>  	}
>> -
>>  	/* If KEY_PTRS(k) == 0, this jset gets lost in air */
>>  	BUG_ON(i == 0);
>>  
>> @@ -805,6 +785,73 @@ static void journal_write_unlocked(struct closure *cl)
>>  
>>  	while ((bio = bio_list_pop(&list)))
>>  		closure_bio_submit(c, bio, cl);
>> +}
>> +
>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>> +
>> +static void __journal_nvdimm_write_unlocked(struct cache_set *c)
>> +	__releases(c->journal.lock)
>> +{
>> +	struct journal_write *w = c->journal.cur;
>> +	struct cache *ca = c->cache;
>> +	unsigned int sectors;
>> +
>> +	sectors = set_blocks(w->data, block_bytes(ca)) * ca->sb.block_size;
>> +	atomic_long_add(sectors, &ca->meta_sectors_written);
>> +
>> +	memcpy_flushcache((void *)c->journal.key.ptr[0], w->data, sectors << 9);
>> +
>> +	c->journal.key.ptr[0] += sectors << 9;
>> +	ca->journal.seq[ca->journal.cur_idx] = w->data->seq;
>> +
>> +	atomic_dec_bug(&fifo_back(&c->journal.pin));
>> +	bch_journal_next(&c->journal);
>> +	journal_reclaim(c);
>> +
>> +	spin_unlock(&c->journal.lock);
>> +}
>> +
>> +#else /* CONFIG_BCACHE_NVM_PAGES */
>> +
>> +static void __journal_nvdimm_write_unlocked(struct cache_set *c) { }
>> +
>> +#endif /* CONFIG_BCACHE_NVM_PAGES */
>> +
>> +static void journal_write_unlocked(struct closure *cl)
>> +{
>> +	struct cache_set *c = container_of(cl, struct cache_set, journal.io);
>> +	struct cache *ca = c->cache;
>> +	struct journal_write *w = c->journal.cur;
>> +
>> +	if (!w->need_write) {
>> +		closure_return_with_destructor(cl, journal_write_unlock);
>> +		return;
>> +	} else if (journal_full(&c->journal)) {
>> +		journal_reclaim(c);
>> +		spin_unlock(&c->journal.lock);
>> +
>> +		btree_flush_write(c);
>> +		continue_at(cl, journal_write, bch_journal_wq);
>> +		return;
>> +	}
>> +
>> +	c->journal.blocks_free -= set_blocks(w->data, block_bytes(ca));
>> +
>> +	w->data->btree_level = c->root->level;
>> +
>> +	bkey_copy(&w->data->btree_root, &c->root->key);
>> +	bkey_copy(&w->data->uuid_bucket, &c->uuid_bucket);
>> +
>> +	w->data->prio_bucket[ca->sb.nr_this_dev] = ca->prio_buckets[0];
>> +	w->data->magic		= jset_magic(&ca->sb);
>> +	w->data->version	= BCACHE_JSET_VERSION;
>> +	w->data->last_seq	= last_seq(&c->journal);
>> +	w->data->csum		= csum_set(w->data);
>> +
>> +	if (!bch_has_feature_nvdimm_meta(&ca->sb))
>> +		__journal_write_unlocked(c);
>> +	else
>> +		__journal_nvdimm_write_unlocked(c);
>>  
>>  	continue_at(cl, journal_write_done, NULL);
>>  }
>> diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
>> index c763bf2e2721..736a661777b7 100644
>> --- a/drivers/md/bcache/nvm-pages.h
>> +++ b/drivers/md/bcache/nvm-pages.h
>> @@ -5,6 +5,7 @@
>>  
>>  #if defined(CONFIG_BCACHE_NVM_PAGES)
>>  #include <linux/bcache-nvm.h>
>> +#include <linux/libnvdimm.h>
>>  #endif /* CONFIG_BCACHE_NVM_PAGES */
>>  
>>  /*
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index cce0f6bf0944..4d6666d03aa7 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -1686,7 +1686,32 @@ void bch_cache_set_release(struct kobject *kobj)
>>  static void cache_set_free(struct closure *cl)
>>  {
>>  	struct cache_set *c = container_of(cl, struct cache_set, cl);
>> -	struct cache *ca;
>> +	struct cache *ca = c->cache;
>> +
>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>> +	/* Flush cache if journal stored in NVDIMM */
>> +	if (ca && bch_has_feature_nvdimm_meta(&ca->sb)) {
>> +		unsigned long bucket_size = ca->sb.bucket_size;
>> +		int i;
>> +
>> +		for (i = 0; i < ca->sb.keys; i++) {
>> +			unsigned long offset = 0;
>> +			unsigned int len = round_down(UINT_MAX, 2);
>> +
>> +			if ((void *)ca->sb.d[i] == NULL)
>> +				continue;
>> +
>> +			while (bucket_size > 0) {
>> +				if (len > bucket_size)
>> +					len = bucket_size;
>> +				arch_invalidate_pmem(
>> +					(void *)(ca->sb.d[i] + offset), len);
>> +				offset += len;
>> +				bucket_size -= len;
>> +			}
>> +		}
>> +	}
>> +#endif /* CONFIG_BCACHE_NVM_PAGES */
>>  
>>  	debugfs_remove(c->debug);
>>  
>> @@ -1698,7 +1723,6 @@ static void cache_set_free(struct closure *cl)
>>  	bch_bset_sort_state_free(&c->sort);
>>  	free_pages((unsigned long) c->uuids, ilog2(meta_bucket_pages(&c->cache->sb)));
>>  
>> -	ca = c->cache;
>>  	if (ca) {
>>  		ca->set = NULL;
>>  		c->cache = NULL;
>>
> Reviewed-by: Hannes Reinecke <hare@suse.de>

Thanks for your review.

Coly Li

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/14] bcache: read jset from NVDIMM pages for journal replay
  2021-06-22 11:04   ` Hannes Reinecke
@ 2021-06-23  6:21     ` Coly Li
  0 siblings, 0 replies; 60+ messages in thread
From: Coly Li @ 2021-06-23  6:21 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-bcache, axboe, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/22/21 7:04 PM, Hannes Reinecke wrote:
> On 6/15/21 7:49 AM, Coly Li wrote:
>> This patch implements two methods to read jset from media for journal
>> replay,
>> - __jnl_rd_bkt() for block device
>>   This is the legacy method to read jset via block device interface.
>> - __jnl_rd_nvm_bkt() for NVDIMM
>>   This is the method to read jset from NVDIMM memory interface, a.k.a
>>   memcopy() from NVDIMM pages to DRAM pages.
>>
>> If BCH_FEATURE_INCOMPAT_NVDIMM_META is set in incompat feature set,
>> during running cache set, journal_read_bucket() will read the journal
>> content from NVDIMM by __jnl_rd_nvm_bkt(). The linear addresses of
>> NVDIMM pages to read jset are stored in sb.d[SB_JOURNAL_BUCKETS], which
>> were initialized and maintained in previous runs of the cache set.
>>
>> A thing should be noticed is, when bch_journal_read() is called, the
>> linear address of NVDIMM pages is not loaded and initialized yet, it
>> is necessary to call __bch_journal_nvdimm_init() before reading the jset
>> from NVDIMM pages.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
>> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
>> ---
>>  drivers/md/bcache/journal.c | 93 +++++++++++++++++++++++++++----------
>>  1 file changed, 69 insertions(+), 24 deletions(-)
>>
>> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
>> index 03ecedf813b0..23e5ccf125df 100644
>> --- a/drivers/md/bcache/journal.c
>> +++ b/drivers/md/bcache/journal.c
>> @@ -34,60 +34,96 @@ static void journal_read_endio(struct bio *bio)
>>  	closure_put(cl);
>>  }
>>  
>> +static struct jset *__jnl_rd_bkt(struct cache *ca, unsigned int bkt_idx,
>> +				    unsigned int len, unsigned int offset,
>> +				    struct closure *cl)
>> +{
>> +	sector_t bucket = bucket_to_sector(ca->set, ca->sb.d[bkt_idx]);
>> +	struct bio *bio = &ca->journal.bio;
>> +	struct jset *data = ca->set->journal.w[0].data;
>> +
>> +	bio_reset(bio);
>> +	bio->bi_iter.bi_sector	= bucket + offset;
>> +	bio_set_dev(bio, ca->bdev);
>> +	bio->bi_iter.bi_size	= len << 9;
>> +	bio->bi_end_io	= journal_read_endio;
>> +	bio->bi_private = cl;
>> +	bio_set_op_attrs(bio, REQ_OP_READ, 0);
>> +	bch_bio_map(bio, data);
>> +
>> +	closure_bio_submit(ca->set, bio, cl);
>> +	closure_sync(cl);
>> +
>> +	/* Indeed journal.w[0].data */
>> +	return data;
>> +}
>> +
>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>> +
>> +static struct jset *__jnl_rd_nvm_bkt(struct cache *ca, unsigned int bkt_idx,
>> +				     unsigned int len, unsigned int offset)
>> +{
>> +	void *jset_addr = (void *)ca->sb.d[bkt_idx] + (offset << 9);
>> +	struct jset *data = ca->set->journal.w[0].data;
>> +
>> +	memcpy(data, jset_addr, len << 9);
>> +
>> +	/* Indeed journal.w[0].data */
>> +	return data;
>> +}
>> +
>> +#else /* CONFIG_BCACHE_NVM_PAGES */
>> +
>> +static struct jset *__jnl_rd_nvm_bkt(struct cache *ca, unsigned int bkt_idx,
>> +				     unsigned int len, unsigned int offset)
>> +{
>> +	return NULL;
>> +}
>> +
>> +#endif /* CONFIG_BCACHE_NVM_PAGES */
>> +
>>  static int journal_read_bucket(struct cache *ca, struct list_head *list,
>> -			       unsigned int bucket_index)
>> +			       unsigned int bucket_idx)
> This renaming is pointless.

Copied, will revert this in next post.

Thanks for your review.

Coly Li



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Ask help for code review (was Re: [PATCH 03/14] bcache: add initial data structures for nvm pages)
  2021-06-23  4:32       ` Coly Li
@ 2021-06-23  6:53         ` Huang, Ying
  2021-06-23  7:04           ` Christoph Hellwig
  0 siblings, 1 reply; 60+ messages in thread
From: Huang, Ying @ 2021-06-23  6:53 UTC (permalink / raw)
  To: Coly Li
  Cc: Dan Williams, Jan Kara, Hannes Reinecke, Christoph Hellwig,
	linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, axboe

Coly Li <colyli@suse.de> writes:

> Hi Ying,
>
> I reply your comment in-place where you commented on.
>
> On 6/22/21 4:41 PM, Huang, Ying wrote:
>> Coly Li <colyli@suse.de> writes:
>>

[snip]

>>>> diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
>>>> new file mode 100644
>>>> index 000000000000..5094a6797679
>>>> --- /dev/null
>>>> +++ b/include/uapi/linux/bcache-nvm.h
>>>> @@ -0,0 +1,200 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>>> +
>>>> +#ifndef _UAPI_BCACHE_NVM_H
>>>> +#define _UAPI_BCACHE_NVM_H
>>>> +
>>>> +#if (__BITS_PER_LONG == 64)
>>>> +/*
>>>> + * Bcache on NVDIMM data structures
>>>> + */
>>>> +
>>>> +/*
>>>> + * - struct bch_nvm_pages_sb
>>>> + *   This is the super block allocated on each nvdimm namespace. A nvdimm
>>>> + * set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used to mark
>>>> + * which nvdimm set this name space belongs to. Normally we will use the
>>>> + * bcache's cache set UUID to initialize this uuid, to connect this nvdimm
>>>> + * set to a specified bcache cache set.
>>>> + *
>>>> + * - struct bch_owner_list_head
>>>> + *   This is a table for all heads of all owner lists. A owner list records
>>>> + * which page(s) allocated to which owner. After reboot from power failure,
>>>> + * the ownwer may find all its requested and allocated pages from the owner
>>>> + * list by a handler which is converted by a UUID.
>>>> + *
>>>> + * - struct bch_nvm_pages_owner_head
>>>> + *   This is a head of an owner list. Each owner only has one owner list,
>>>> + * and a nvm page only belongs to an specific owner. uuid[] will be set to
>>>> + * owner's uuid, for bcache it is the bcache's cache set uuid. label is not
>>>> + * mandatory, it is a human-readable string for debug purpose. The pointer
>>>> + * recs references to separated nvm page which hold the table of struct
>>>> + * bch_pgalloc_rec.
>>>> + *
>>>> + *- struct bch_nvm_pgalloc_recs
>>>> + *  This structure occupies a whole page, owner_uuid should match the uuid
>>>> + * in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
>>>> + * allocated records.
>>>> + *
>>>> + * - struct bch_pgalloc_rec
>>>> + *   Each structure records a range of allocated nvm pages. pgoff is offset
>>>> + * in unit of page size of this allocated nvm page range. The adjoint page
>>>> + * ranges of same owner can be merged into a larger one, therefore pages_nr
>>>> + * is NOT always power of 2.
>>>> + *
>>>> + *
>>>> + * Memory layout on nvdimm namespace 0
>>>> + *
>>>> + *    0 +---------------------------------+
>>>> + *      |                                 |
>>>> + *  4KB +---------------------------------+
>>>> + *      |         bch_nvm_pages_sb        |
>>>> + *  8KB +---------------------------------+ <--- bch_nvm_pages_sb.bch_owner_list_head
>>>> + *      |       bch_owner_list_head       |
>>>> + *      |                                 |
>>>> + * 16KB +---------------------------------+ <--- bch_owner_list_head.heads[0].recs[0]
>>>> + *      |       bch_nvm_pgalloc_recs      |
>>>> + *      |  (nvm pages internal usage)     |
>>>> + * 24KB +---------------------------------+
>>>> + *      |                                 |
>>>> + *      |                                 |
>>>> + * 16MB  +---------------------------------+
>>>> + *      |      allocable nvm pages        |
>>>> + *      |      for buddy allocator        |
>>>> + * end  +---------------------------------+
>>>> + *
>>>> + *
>>>> + *
>>>> + * Memory layout on nvdimm namespace N
>>>> + * (doesn't have owner list)
>>>> + *
>>>> + *    0 +---------------------------------+
>>>> + *      |                                 |
>>>> + *  4KB +---------------------------------+
>>>> + *      |         bch_nvm_pages_sb        |
>>>> + *  8KB +---------------------------------+
>>>> + *      |                                 |
>>>> + *      |                                 |
>>>> + *      |                                 |
>>>> + *      |                                 |
>>>> + *      |                                 |
>>>> + *      |                                 |
>>>> + * 16MB  +---------------------------------+
>>>> + *      |      allocable nvm pages        |
>>>> + *      |      for buddy allocator        |
>>>> + * end  +---------------------------------+
>>>> + *
>>>> + */
>>>> +
>>>> +#include <linux/types.h>
>>>> +
>>>> +/* In sectors */
>>>> +#define BCH_NVM_PAGES_SB_OFFSET			4096
>>>> +#define BCH_NVM_PAGES_OFFSET			(16 << 20)
>>>> +
>>>> +#define BCH_NVM_PAGES_LABEL_SIZE		32
>>>> +#define BCH_NVM_PAGES_NAMESPACES_MAX		8
>>>> +
>>>> +#define BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET	(8<<10)
>>>> +#define BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET	(16<<10)
>>>> +
>>>> +#define BCH_NVM_PAGES_SB_VERSION		0
>>>> +#define BCH_NVM_PAGES_SB_VERSION_MAX		0
>>>> +
>>>> +static const unsigned char bch_nvm_pages_magic[] = {
>>>> +	0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83,
>>>> +	0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 };
>>>> +static const unsigned char bch_nvm_pages_pgalloc_magic[] = {
>>>> +	0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9,
>>>> +	0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae };
>>>> +
>>>> +/* takes 64bit width */
>>>> +struct bch_pgalloc_rec {
>>>> +	__u64	pgoff:52;
>>>> +	__u64	order:6;
>>>> +	__u64	reserved:6;
>>>> +};
>>>> +
>>>> +struct bch_nvm_pgalloc_recs {
>>>> +union {
>>>> +	struct {
>>>> +		struct bch_nvm_pages_owner_head	*owner;
>>>> +		struct bch_nvm_pgalloc_recs	*next;
>> I have concerns about using pointers directly in on-NVDIMM data
>> structure too.  How can you guarantee the NVDIMM devices will be mapped
>> to exact same virtual address across reboot?
>>
>
>
> We use the NVDIMM name space as memory, and from our testing and
> observation, the DAX mapping base address is consistent if the NVDIMM
> address from e820 table does not change.
>
> And from our testing and observation, the NVDIMM address from e820
> table does not change when,
> - NVDIMM and DRAM memory population does not change
> - Install more NVDIMM and/or DRAM based on existing memory population
> - NVDIMM always plugged in same slot location and no movement or swap
> - No CPU remove and change
>
> For 99.9%+ time when the hardware working healthily, the above condition
> can be assumed. Therefore we choose to store whole linear address
> (pointer) here, other than relative offset inside the NVDIMM name space.
>
> For the 0.0?% condition if the NVDIMM address from e820 table changes,
> because the last time DAX map address is stored in ns_start of struct
> bch_nvm_pages_sb, if the new DAX mapping address is different from
> ns_srart value, all pointers in the owner list can be updated by,
>     new_addr = (old_addr - old_ns_start) + new_ns_start
>
> The update can be very fast (and it can be power failure tolerant with
> carefully coding) for. Therefore we decide to store full linear address
> for directly memory access for 99%+ condition, and update the pointers
> for the 0.0?% condition when DAX mapping address of the NVDIMM changes.
>
> Handling DAX mapping address change is not current high priority task,
> our next task after this series merged will be the power failure tolerance
> of the owner list (from Intel developers) and storing bcache btree nodes
> on NVDIMM pages (from me).

Thanks for the detailed explanation.  Given "ns_start", this should work
even when the base address changed.

So the question becomes pointer vs. offset, which one is better.  I
guess that you prefer pointer because it's easier to be used.  How about
making the pointer in the NVDIMM like the per-cpu pointer?  Which is
implemented as an offset with the pointer type.  And it's not too
hard to be used.  With that, you don't need to maintain the code to
update all pointers when the base address is changed.

Best Regards,
Huang, Ying

>>>> +		unsigned char			magic[16];
>>>> +		unsigned char			owner_uuid[16];
>>>> +		unsigned int			size;
>>>> +		unsigned int			used;
>>>> +		unsigned long			_pad[4];
>>>> +		struct bch_pgalloc_rec		recs[];
>>>> +	};
>>>> +	unsigned char				pad[8192];
>>>> +};
>>>> +};
>>>> +
>>>> +#define BCH_MAX_RECS					\
>>>> +	((sizeof(struct bch_nvm_pgalloc_recs) -		\
>>>> +	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /	\
>>>> +	 sizeof(struct bch_pgalloc_rec))
>>>> +
>>>> +struct bch_nvm_pages_owner_head {
>>>> +	unsigned char			uuid[16];
>>>> +	unsigned char			label[BCH_NVM_PAGES_LABEL_SIZE];
>>>> +	/* Per-namespace own lists */
>>>> +	struct bch_nvm_pgalloc_recs	*recs[BCH_NVM_PAGES_NAMESPACES_MAX];
>>>> +};
>>>> +
>>>> +/* heads[0] is always for nvm_pages internal usage */
>>>> +struct bch_owner_list_head {
>>>> +union {
>>>> +	struct {
>>>> +		unsigned int			size;
>>>> +		unsigned int			used;
>>>> +		unsigned long			_pad[4];
>>>> +		struct bch_nvm_pages_owner_head	heads[];
>>>> +	};
>>>> +	unsigned char				pad[8192];
>>>> +};
>>>> +};
>>>> +#define BCH_MAX_OWNER_LIST				\
>>>> +	((sizeof(struct bch_owner_list_head) -		\
>>>> +	 offsetof(struct bch_owner_list_head, heads)) /	\
>>>> +	 sizeof(struct bch_nvm_pages_owner_head))
>>>> +
>>>> +/* The on-media bit order is local CPU order */
>>>> +struct bch_nvm_pages_sb {
>>>> +	unsigned long				csum;
>>>> +	unsigned long				ns_start;
>>>> +	unsigned long				sb_offset;
>>>> +	unsigned long				version;
>>>> +	unsigned char				magic[16];
>>>> +	unsigned char				uuid[16];
>>>> +	unsigned int				page_size;
>>>> +	unsigned int				total_namespaces_nr;
>>>> +	unsigned int				this_namespace_nr;
>>>> +	union {
>>>> +		unsigned char			set_uuid[16];
>>>> +		unsigned long			set_magic;
>>>> +	};
>>>> +
>>>> +	unsigned long				flags;
>>>> +	unsigned long				seq;
>>>> +
>>>> +	unsigned long				feature_compat;
>>>> +	unsigned long				feature_incompat;
>>>> +	unsigned long				feature_ro_compat;
>>>> +
>>>> +	/* For allocable nvm pages from buddy systems */
>>>> +	unsigned long				pages_offset;
>>>> +	unsigned long				pages_total;
>>>> +
>>>> +	unsigned long				pad[8];
>>>> +
>>>> +	/* Only on the first name space */
>>>> +	struct bch_owner_list_head		*owner_list_head;
>>>> +
>>>> +	/* Just for csum_set() */
>>>> +	unsigned int				keys;
>>>> +	unsigned long				d[0];
>>>> +};
>>>> +#endif /* __BITS_PER_LONG == 64 */
>>>> +
>>>> +#endif /* _UAPI_BCACHE_NVM_H */

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Ask help for code review (was Re: [PATCH 03/14] bcache: add initial data structures for nvm pages)
  2021-06-23  6:53         ` Huang, Ying
@ 2021-06-23  7:04           ` Christoph Hellwig
  2021-06-23  7:19             ` Coly Li
  0 siblings, 1 reply; 60+ messages in thread
From: Christoph Hellwig @ 2021-06-23  7:04 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Coly Li, Dan Williams, Jan Kara, Hannes Reinecke,
	Christoph Hellwig, linux-bcache, linux-block, Jianpeng Ma,
	Qiaowei Ren, axboe

Storing a pointer on-media is completely broken.  It is not endian
clean, not 32-bit vs 64-bit clean and will lead to problems when addresses
change.  And they will change - maybe not often with DDR-attached
memory, but very certainly with CXL-attached memory that is completely
hot pluggable.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 03/14] bcache: add initial data structures for nvm pages
  2021-06-22 10:19   ` [PATCH 03/14] bcache: add initial data structures for nvm pages Hannes Reinecke
@ 2021-06-23  7:09     ` Coly Li
  0 siblings, 0 replies; 60+ messages in thread
From: Coly Li @ 2021-06-23  7:09 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-bcache, axboe, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/22/21 6:19 PM, Hannes Reinecke wrote:
> On 6/15/21 7:49 AM, Coly Li wrote:
>> This patch initializes the prototype data structures for nvm pages
>> allocator,
>>
>> - struct bch_nvm_pages_sb
>> This is the super block allocated on each nvdimm namespace. A nvdimm
>> set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used
>> to mark which nvdimm set this name space belongs to. Normally we will
>> use the bcache's cache set UUID to initialize this uuid, to connect this
>> nvdimm set to a specified bcache cache set.
>>
>> - struct bch_owner_list_head
>> This is a table for all heads of all owner lists. A owner list records
>> which page(s) allocated to which owner. After reboot from power failure,
>> the ownwer may find all its requested and allocated pages from the owner
> owner

Fixed for next post.


>
>> list by a handler which is converted by a UUID.
>>
>> - struct bch_nvm_pages_owner_head
>> This is a head of an owner list. Each owner only has one owner list,
>> and a nvm page only belongs to an specific owner. uuid[] will be set to
>> owner's uuid, for bcache it is the bcache's cache set uuid. label is not
>> mandatory, it is a human-readable string for debug purpose. The pointer
>> *recs references to separated nvm page which hold the table of struct
>> bch_nvm_pgalloc_rec.
>>
>> - struct bch_nvm_pgalloc_recs
>> This struct occupies a whole page, owner_uuid should match the uuid
>> in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
>> allocated records.
>>
>> - struct bch_nvm_pgalloc_rec
>> Each structure records a range of allocated nvm pages.
>>   - Bits  0 - 51: is pages offset of the allocated pages.
>>   - Bits 52 - 57: allocaed size in page_size * order-of-2
>>   - Bits 58 - 63: reserved.
>> Since each of the allocated nvm pages are power of 2, using 6 bits to
>> represent allocated size can have (1<<(1<<64) - 1) * PAGE_SIZE maximum
>> value. It can be a 76 bits width range size in byte for 4KB page size,
>> which is large enough currently.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
>> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
>> ---
>>  include/uapi/linux/bcache-nvm.h | 200 ++++++++++++++++++++++++++++++++
>>  1 file changed, 200 insertions(+)
>>  create mode 100644 include/uapi/linux/bcache-nvm.h
>>
>> diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
>> new file mode 100644
>> index 000000000000..5094a6797679
>> --- /dev/null
>> +++ b/include/uapi/linux/bcache-nvm.h
>> @@ -0,0 +1,200 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +
>> +#ifndef _UAPI_BCACHE_NVM_H
>> +#define _UAPI_BCACHE_NVM_H
>> +
>> +#if (__BITS_PER_LONG == 64)
>> +/*
>> + * Bcache on NVDIMM data structures
>> + */
>> +
>> +/*
>> + * - struct bch_nvm_pages_sb
>> + *   This is the super block allocated on each nvdimm namespace. A nvdimm
>> + * set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used to mark
>> + * which nvdimm set this name space belongs to. Normally we will use the
>> + * bcache's cache set UUID to initialize this uuid, to connect this nvdimm
>> + * set to a specified bcache cache set.
>> + *
>> + * - struct bch_owner_list_head
>> + *   This is a table for all heads of all owner lists. A owner list records
>> + * which page(s) allocated to which owner. After reboot from power failure,
>> + * the ownwer may find all its requested and allocated pages from the owner
>> + * list by a handler which is converted by a UUID.
>> + *
>> + * - struct bch_nvm_pages_owner_head
>> + *   This is a head of an owner list. Each owner only has one owner list,
>> + * and a nvm page only belongs to an specific owner. uuid[] will be set to
>> + * owner's uuid, for bcache it is the bcache's cache set uuid. label is not
>> + * mandatory, it is a human-readable string for debug purpose. The pointer
>> + * recs references to separated nvm page which hold the table of struct
>> + * bch_pgalloc_rec.
>> + *
>> + *- struct bch_nvm_pgalloc_recs
>> + *  This structure occupies a whole page, owner_uuid should match the uuid
>> + * in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
>> + * allocated records.
>> + *
>> + * - struct bch_pgalloc_rec
>> + *   Each structure records a range of allocated nvm pages. pgoff is offset
>> + * in unit of page size of this allocated nvm page range. The adjoint page
>> + * ranges of same owner can be merged into a larger one, therefore pages_nr
>> + * is NOT always power of 2.
>> + *
>> + *
>> + * Memory layout on nvdimm namespace 0
>> + *
>> + *    0 +---------------------------------+
>> + *      |                                 |
>> + *  4KB +---------------------------------+
>> + *      |         bch_nvm_pages_sb        |
>> + *  8KB +---------------------------------+ <--- bch_nvm_pages_sb.bch_owner_list_head
>> + *      |       bch_owner_list_head       |
>> + *      |                                 |
>> + * 16KB +---------------------------------+ <--- bch_owner_list_head.heads[0].recs[0]
>> + *      |       bch_nvm_pgalloc_recs      |
>> + *      |  (nvm pages internal usage)     |
>> + * 24KB +---------------------------------+
>> + *      |                                 |
>> + *      |                                 |
>> + * 16MB  +---------------------------------+
>> + *      |      allocable nvm pages        |
>> + *      |      for buddy allocator        |
>> + * end  +---------------------------------+
>> + *
>> + *
>> + *
>> + * Memory layout on nvdimm namespace N
>> + * (doesn't have owner list)
>> + *
>> + *    0 +---------------------------------+
>> + *      |                                 |
>> + *  4KB +---------------------------------+
>> + *      |         bch_nvm_pages_sb        |
>> + *  8KB +---------------------------------+
>> + *      |                                 |
>> + *      |                                 |
>> + *      |                                 |
>> + *      |                                 |
>> + *      |                                 |
>> + *      |                                 |
>> + * 16MB  +---------------------------------+
>> + *      |      allocable nvm pages        |
>> + *      |      for buddy allocator        |
>> + * end  +---------------------------------+
>> + *
>> + */
>> +
>> +#include <linux/types.h>
>> +
>> +/* In sectors */
>> +#define BCH_NVM_PAGES_SB_OFFSET			4096
>> +#define BCH_NVM_PAGES_OFFSET			(16 << 20)
>> +
>> +#define BCH_NVM_PAGES_LABEL_SIZE		32
>> +#define BCH_NVM_PAGES_NAMESPACES_MAX		8
>> +
>> +#define BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET	(8<<10)
>> +#define BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET	(16<<10)
>> +
>> +#define BCH_NVM_PAGES_SB_VERSION		0
>> +#define BCH_NVM_PAGES_SB_VERSION_MAX		0
>> +
>> +static const unsigned char bch_nvm_pages_magic[] = {
>> +	0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83,
>> +	0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 };
>> +static const unsigned char bch_nvm_pages_pgalloc_magic[] = {
>> +	0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9,
>> +	0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae };
>> +
>> +/* takes 64bit width */
>> +struct bch_pgalloc_rec {
>> +	__u64	pgoff:52;
>> +	__u64	order:6;
>> +	__u64	reserved:6;
>> +};
>> +
>> +struct bch_nvm_pgalloc_recs {
>> +union {
> Indentation.

Copied. It will be updated in next post.

>
>> +	struct {
>> +		struct bch_nvm_pages_owner_head	*owner;
>> +		struct bch_nvm_pgalloc_recs	*next;
>> +		unsigned char			magic[16];
>> +		unsigned char			owner_uuid[16];
>> +		unsigned int			size;
>> +		unsigned int			used;
>> +		unsigned long			_pad[4];
>> +		struct bch_pgalloc_rec		recs[];
>> +	};
>> +	unsigned char				pad[8192];
>> +};
>> +};
>> +
> Consider using __u64 and friends when specifying a structure with a
> fixed alignment; that also removes the need of the BITS_PER_LONG ifdef
> at the top.

It _WAS_ the first version how we did. But Jens didn't agree with this:

"This doesn't look right in a user header, any user API should be 32-bit
and 64-bit agnostic."

My following explanation was not convinced, then I change all the
__u32, __u64 stuffs into unsigned int, and unsigned long.

Considering nvm-pages allocator works only for both register word and
physical address are all 64 bits width, unsigned long will exactly
match 64bit and unsigned int will exactly match 32bit, I am fine to
use any of the forms. Jens is the upper layer maintainer, I choose to
listen to him.


>> +#define BCH_MAX_RECS					\
>> +	((sizeof(struct bch_nvm_pgalloc_recs) -		\
>> +	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /	\
>> +	 sizeof(struct bch_pgalloc_rec))
>> +
> What _are_ you doing here?

BCH_MAX_RECS is a consistent value, to indicate how many elements can
be stored in array recs[] from struct bch_nvm_pgalloc_recs.

> You're not seriously using the 'pad' field as a placeholder to size the
> structure accordingly?

The code is in the way as you expected. The 8KB pad forces struct
bch_nvm_pgalloc_recs to be
two 4K pages, which is an ordered size from the nvm-pages buddy allocator.


> Also, what is the size of the 'bch_nvm_pgalloc_recs' structure?
> 8k + header size?

struct  bch_nvm_pgalloc_recs is exactly 8K. The header is inside the 8K
space.


> That is very awkward, as the page allocator won't be able to handle it
> efficiently.
> Please size it to either 8k or 16k overall.
> And if you do that you can simplify this define.

In memory layout of struct bch_nvm_pgalloc_recs is

|<------------ 8K ----------->|
[header] [recs ...............]
         |<-- BCH_MAX_RECS -->|


So the code works same as your expectation.

>
>> +struct bch_nvm_pages_owner_head {
>> +	unsigned char			uuid[16];
>> +	unsigned char			label[BCH_NVM_PAGES_LABEL_SIZE];
>> +	/* Per-namespace own lists */
>> +	struct bch_nvm_pgalloc_recs	*recs[BCH_NVM_PAGES_NAMESPACES_MAX];
>> +};
>> +
>> +/* heads[0] is always for nvm_pages internal usage */
>> +struct bch_owner_list_head {
>> +union {
>> +	struct {
>> +		unsigned int			size;
>> +		unsigned int			used;
>> +		unsigned long			_pad[4];
>> +		struct bch_nvm_pages_owner_head	heads[];
>> +	};
>> +	unsigned char				pad[8192];
>> +};
>> +};
>> +#define BCH_MAX_OWNER_LIST				\
>> +	((sizeof(struct bch_owner_list_head) -		\
>> +	 offsetof(struct bch_owner_list_head, heads)) /	\
>> +	 sizeof(struct bch_nvm_pages_owner_head))
>> +
> Same here.
> Please size it that the 'bch_owner_list_head' structure fits into either
> 8k or 16k.

It works as you expected. But I realize maybe the indent is misleading.
I should add a blank before offsetof(), like this,

163 #define BCH_MAX_OWNER_LIST                              \
164         ((sizeof(struct bch_owner_list_head) -          \
165           offsetof(struct bch_owner_list_head, heads)) /\
166          sizeof(struct bch_nvm_pages_owner_head))

It means (8K - header_size) / sizeof(struct bch_nvm_pages_owner_head).

>
>> +/* The on-media bit order is local CPU order */
>> +struct bch_nvm_pages_sb {
>> +	unsigned long				csum;
>> +	unsigned long				ns_start;
>> +	unsigned long				sb_offset;
>> +	unsigned long				version;
>> +	unsigned char				magic[16];
>> +	unsigned char				uuid[16];
>> +	unsigned int				page_size;
>> +	unsigned int				total_namespaces_nr;
>> +	unsigned int				this_namespace_nr;
>> +	union {
>> +		unsigned char			set_uuid[16];
>> +		unsigned long			set_magic;
>> +	};
>> +
>> +	unsigned long				flags;
>> +	unsigned long				seq;
>> +
>> +	unsigned long				feature_compat;
>> +	unsigned long				feature_incompat;
>> +	unsigned long				feature_ro_compat;
>> +
>> +	/* For allocable nvm pages from buddy systems */
>> +	unsigned long				pages_offset;
>> +	unsigned long				pages_total;
>> +
>> +	unsigned long				pad[8];
>> +
>> +	/* Only on the first name space */
>> +	struct bch_owner_list_head		*owner_list_head;
>> +
>> +	/* Just for csum_set() */
>> +	unsigned int				keys;
>> +	unsigned long				d[0];
>> +};
>>

Thanks for your review. I will update all addressed locations except for
the __u32/__u64 stuffs, because Jens
didn't want them.

Coly Li


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Ask help for code review (was Re: [PATCH 03/14] bcache: add initial data structures for nvm pages)
  2021-06-23  7:04           ` Christoph Hellwig
@ 2021-06-23  7:19             ` Coly Li
  2021-06-23  7:21               ` Christoph Hellwig
  0 siblings, 1 reply; 60+ messages in thread
From: Coly Li @ 2021-06-23  7:19 UTC (permalink / raw)
  To: Christoph Hellwig, Huang, Ying
  Cc: Dan Williams, Jan Kara, Hannes Reinecke, linux-bcache,
	linux-block, Jianpeng Ma, Qiaowei Ren, axboe

On 6/23/21 3:04 PM, Christoph Hellwig wrote:
> Storing a pointer on-media is completely broken.  It is not endian
> clean, not 32-bit vs 64-bit clean and will lead to problems when addresses

Why it is not endian clean, and not 32-bit vs. 64 bit clean for bcache ?
Bcache does not support endian clean indeed, and libnvdimm only works with
64bit physical address width. The only restriction here by using pointer is
the CPU register word should be 64bits, because we use the NVDIMM as memory.

Is it one of the way how NVDIMM (especially Intel AEP) designed to use ?
As a non-volatiled memory.

> change.  And they will change - maybe not often with DDR-attached
> memory, but very certainly with CXL-attached memory that is completely
> hot pluggable.

Does the already mapped DAX base address change in runtime during memory
hot plugable ?
If not, it won't be a problem here for this specific use case.

Coly Li

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Ask help for code review (was Re: [PATCH 03/14] bcache: add initial data structures for nvm pages)
  2021-06-23  7:19             ` Coly Li
@ 2021-06-23  7:21               ` Christoph Hellwig
  2021-06-23 10:05                 ` Coly Li
  0 siblings, 1 reply; 60+ messages in thread
From: Christoph Hellwig @ 2021-06-23  7:21 UTC (permalink / raw)
  To: Coly Li
  Cc: Christoph Hellwig, Huang, Ying, Dan Williams, Jan Kara,
	Hannes Reinecke, linux-bcache, linux-block, Jianpeng Ma,
	Qiaowei Ren, axboe

On Wed, Jun 23, 2021 at 03:19:11PM +0800, Coly Li wrote:
> Bcache does not support endian clean indeed,

Then we need to fix that eventually rather than making it worse.  Which
means any _new_ data structure should start that way.

> and libnvdimm only works with
> 64bit physical address width.

Maybe it does right now.  But ther is nothing fundamental in that, so
please don't design stupid on-disk formats to encode that are going to
come back to bite us sooner or later.  Be that by adding 32-bit support
for any Linux DAX device, or by new 96 or 128bit CPUs.

> The only restriction here by using pointer is
> the CPU register word should be 64bits, because we use the NVDIMM as memory.
> 
> Is it one of the way how NVDIMM (especially Intel AEP) designed to use ?
> As a non-volatiled memory.

Not for on-disk data structures.

> Does the already mapped DAX base address change in runtime during memory
> hot plugable ?
> If not, it won't be a problem here for this specific use case.

It could change between one use and another.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/14] bcache: initialize the nvm pages allocator
  2021-06-23  5:26     ` Coly Li
@ 2021-06-23  9:16       ` Hannes Reinecke
  2021-06-23  9:34         ` Coly Li
  0 siblings, 1 reply; 60+ messages in thread
From: Hannes Reinecke @ 2021-06-23  9:16 UTC (permalink / raw)
  To: Coly Li
  Cc: linux-bcache, axboe, linux-block, Jianpeng Ma, Randy Dunlap, Qiaowei Ren

On 6/23/21 7:26 AM, Coly Li wrote:
> On 6/22/21 6:39 PM, Hannes Reinecke wrote:
>> On 6/15/21 7:49 AM, Coly Li wrote:
>>> From: Jianpeng Ma <jianpeng.ma@intel.com>
>>>
>>> This patch define the prototype data structures in memory and
>>> initializes the nvm pages allocator.
>>>
>>> The nvm address space which is managed by this allocator can consist of
>>> many nvm namespaces, and some namespaces can compose into one nvm set,
>>> like cache set. For this initial implementation, only one set can be
>>> supported.
>>>
>>> The users of this nvm pages allocator need to call register_namespace()
>>> to register the nvdimm device (like /dev/pmemX) into this allocator as
>>> the instance of struct nvm_namespace.
>>>
>>> Reported-by: Randy Dunlap <rdunlap@infradead.org>
>>> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
>>> Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
>>> Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
>>> Signed-off-by: Coly Li <colyli@suse.de>
>>> ---
>>>   drivers/md/bcache/Kconfig     |  10 ++
>>>   drivers/md/bcache/Makefile    |   1 +
>>>   drivers/md/bcache/nvm-pages.c | 295 ++++++++++++++++++++++++++++++++++
>>>   drivers/md/bcache/nvm-pages.h |  74 +++++++++
>>>   drivers/md/bcache/super.c     |   3 +
>>>   5 files changed, 383 insertions(+)
>>>   create mode 100644 drivers/md/bcache/nvm-pages.c
>>>   create mode 100644 drivers/md/bcache/nvm-pages.h
>>>
>>> diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
>>> index d1ca4d059c20..a69f6c0e0507 100644
>>> --- a/drivers/md/bcache/Kconfig
>>> +++ b/drivers/md/bcache/Kconfig
>>> @@ -35,3 +35,13 @@ config BCACHE_ASYNC_REGISTRATION
>>>   	device path into this file will returns immediately and the real
>>>   	registration work is handled in kernel work queue in asynchronous
>>>   	way.
>>> +
>>> +config BCACHE_NVM_PAGES
>>> +	bool "NVDIMM support for bcache (EXPERIMENTAL)"
>>> +	depends on BCACHE
>>> +	depends on 64BIT
>>> +	depends on LIBNVDIMM
>>> +	depends on DAX
>>> +	help
>>> +	  Allocate/release NV-memory pages for bcache and provide allocated pages
>>> +	  for each requestor after system reboot.
>>> diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
>>> index 5b87e59676b8..2397bb7c7ffd 100644
>>> --- a/drivers/md/bcache/Makefile
>>> +++ b/drivers/md/bcache/Makefile
>>> @@ -5,3 +5,4 @@ obj-$(CONFIG_BCACHE)	+= bcache.o
>>>   bcache-y		:= alloc.o bset.o btree.o closure.o debug.o extents.o\
>>>   	io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
>>>   	util.o writeback.o features.o
>>> +bcache-$(CONFIG_BCACHE_NVM_PAGES) += nvm-pages.o
>>> diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
>>> new file mode 100644
>>> index 000000000000..18fdadbc502f
>>> --- /dev/null
>>> +++ b/drivers/md/bcache/nvm-pages.c
>>> @@ -0,0 +1,295 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * Nvdimm page-buddy allocator
>>> + *
>>> + * Copyright (c) 2021, Intel Corporation.
>>> + * Copyright (c) 2021, Qiaowei Ren <qiaowei.ren@intel.com>.
>>> + * Copyright (c) 2021, Jianpeng Ma <jianpeng.ma@intel.com>.
>>> + */
>>> +
>>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>>> +
>> No need for this 'if' statement as it'll be excluded by the Makefile
>> anyway if the config option isn't set.
> 
> Such if is necessary because stub routines are defined when
> CONFIG_BCACHE_NVM_PAGES is not defined, e.g.
> 
> 426 +#else
> 427 +
> 428 +static inline struct bch_nvm_namespace
> *bch_register_namespace(const char *dev_path)
> 429 +{
> 430 +       return NULL;
> 431 +}
> 432 +static inline int bch_nvm_init(void)
> 433 +{
> 434 +       return 0;
> 435 +}
> 436 +static inline void bch_nvm_exit(void) { }
> 437 +
> 438 +#endif /* CONFIG_BCACHE_NVM_PAGES */
> 
But then these stubs should be defined in the header file, not here.

[ .. ]

>>> +			rc = -ENOMEM;
>>> +			goto unlock;
>>> +		}
>>> +	}
>>> +
>>> +	only_set->nss[ns->sb->this_namespace_nr] = ns;
>>> +
>>> +	/* Firstly attach */
>> Initial attach?
> 
> Will fix in next post.
> 
>>
>>> +	if ((unsigned long)ns->sb->owner_list_head == BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET) {
>>> +		struct bch_nvm_pages_owner_head *sys_owner_head;
>>> +		struct bch_nvm_pgalloc_recs *sys_pgalloc_recs;
>>> +
>>> +		ns->sb->owner_list_head = ns->kaddr + BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET;
>>> +		sys_pgalloc_recs = ns->kaddr + BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET;
>>> +
>>> +		sys_owner_head = &(ns->sb->owner_list_head->heads[0]);
>>> +		sys_owner_head->recs[0] = sys_pgalloc_recs;
>>> +		ns->sb->csum = csum_set(ns->sb);
>>> +
>> Hmm. You are trying to pick up the 'list_head' structure from NVM, right?
> 
> No, this is not READ, it's WRITE onto NVDIMM.
> 
> sys_owner_head points to NVDIMM, since ns->sb->owner_list_head is updated,
> the checksum of ns->sb should be updated to new value onto the NVDIMM.
> This is what the above line does.
> 
> 
Ah, right.

>>
>> In doing so, don't you need to validate the structure (eg by checking
>> the checksum) before doing so to ensure that the contents are valid?
> 
> The check sum checking for READ is done in read_nvdimm_meta_super() in
> following lines,
> 198 +       r = -EINVAL;
> 199 +       expected_csum = csum_set(sb);
> 200 +       if (expected_csum != sb->csum) {
> 201 +               pr_info("csum is not match with expected one\n");
> 202 +               goto put_page;
> 203 +       }
> 
> Once thing to note is, currently all NVDIMM update is not power failure
> considered. This is the next big task to do after the first small code
> base merged.
> 
> 
>>> +		sys_pgalloc_recs->owner = sys_owner_head;
>>> +	} else
>>> +		BUG_ON(ns->sb->owner_list_head !=
>>> +			(ns->kaddr + BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET));
>>> +
>>> +unlock:
>>> +	mutex_unlock(&only_set->lock);
>>> +	return rc;
>>> +}
>>> +
>>> +static int read_nvdimm_meta_super(struct block_device *bdev,
>>> +			      struct bch_nvm_namespace *ns)
>>> +{
>>> +	struct page *page;
>>> +	struct bch_nvm_pages_sb *sb;
>>> +	int r = 0;
>>> +	uint64_t expected_csum = 0;
>>> +
>>> +	page = read_cache_page_gfp(bdev->bd_inode->i_mapping,
>>> +			BCH_NVM_PAGES_SB_OFFSET >> PAGE_SHIFT, GFP_KERNEL);
>>> +
>>> +	if (IS_ERR(page))
>>> +		return -EIO;
>>> +
>>> +	sb = (struct bch_nvm_pages_sb *)(page_address(page) +
>>> +					offset_in_page(BCH_NVM_PAGES_SB_OFFSET));
>>> +	r = -EINVAL;
>>> +	expected_csum = csum_set(sb);
>>> +	if (expected_csum != sb->csum) {
>>> +		pr_info("csum is not match with expected one\n");
>>> +		goto put_page;
>>> +	}
>>> +
>>> +	if (memcmp(sb->magic, bch_nvm_pages_magic, 16)) {
>>> +		pr_info("invalid bch_nvm_pages_magic\n");
>>> +		goto put_page;
>>> +	}
>>> +
>>> +	if (sb->total_namespaces_nr != 1) {
>>> +		pr_info("currently only support one nvm device\n");
>>> +		goto put_page;
>>> +	}
>>> +
>>> +	if (sb->sb_offset != BCH_NVM_PAGES_SB_OFFSET) {
>>> +		pr_info("invalid superblock offset\n");
>>> +		goto put_page;
>>> +	}
>>> +
>>> +	r = 0;
>>> +	/* temporary use for DAX API */
>>> +	ns->page_size = sb->page_size;
>>> +	ns->pages_total = sb->pages_total;
>>> +
>>> +put_page:
>>> +	put_page(page);
>>> +	return r;
>>> +}
>>> +
>>> +struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
>>> +{
>>> +	struct bch_nvm_namespace *ns;
>>> +	int err;
>>> +	pgoff_t pgoff;
>>> +	char buf[BDEVNAME_SIZE];
>>> +	struct block_device *bdev;
>>> +	int id;
>>> +	char *path = NULL;
>>> +
>>> +	path = kstrndup(dev_path, 512, GFP_KERNEL);
>>> +	if (!path) {
>>> +		pr_err("kstrndup failed\n");
>>> +		return ERR_PTR(-ENOMEM);
>>> +	}
>>> +
>>> +	bdev = blkdev_get_by_path(strim(path),
>>> +				  FMODE_READ|FMODE_WRITE|FMODE_EXEC,
>>> +				  only_set);
>>> +	if (IS_ERR(bdev)) {
>>> +		pr_info("get %s error: %ld\n", dev_path, PTR_ERR(bdev));
>>> +		kfree(path);
>>> +		return ERR_PTR(PTR_ERR(bdev));
>>> +	}
>>> +
>>> +	err = -ENOMEM;
>>> +	ns = kzalloc(sizeof(struct bch_nvm_namespace), GFP_KERNEL);
>>> +	if (!ns)
>>> +		goto bdput;
>>> +
>>> +	err = -EIO;
>>> +	if (read_nvdimm_meta_super(bdev, ns)) {
>>> +		pr_info("%s read nvdimm meta super block failed.\n",
>>> +			bdevname(bdev, buf));
>>> +		goto free_ns;
>>> +	}
>>> +
>>> +	err = -EOPNOTSUPP;
>>> +	if (!bdev_dax_supported(bdev, ns->page_size)) {
>>> +		pr_info("%s don't support DAX\n", bdevname(bdev, buf));
>>> +		goto free_ns;
>>> +	}
>>> +
>>> +	err = -EINVAL;
>>> +	if (bdev_dax_pgoff(bdev, 0, ns->page_size, &pgoff)) {
>>> +		pr_info("invalid offset of %s\n", bdevname(bdev, buf));
>>> +		goto free_ns;
>>> +	}
>>> +
>>> +	err = -ENOMEM;
>>> +	ns->dax_dev = fs_dax_get_by_bdev(bdev);
>>> +	if (!ns->dax_dev) {
>>> +		pr_info("can't by dax device by %s\n", bdevname(bdev, buf));
>>> +		goto free_ns;
>>> +	}
>>> +
>>> +	err = -EINVAL;
>>> +	id = dax_read_lock();
>>> +	if (dax_direct_access(ns->dax_dev, pgoff, ns->pages_total,
>>> +			      &ns->kaddr, &ns->start_pfn) <= 0) {
>>> +		pr_info("dax_direct_access error\n");
>>> +		dax_read_unlock(id);
>>> +		goto free_ns;
>>> +	}
>>> +	dax_read_unlock(id);
>>> +
>>> +	ns->sb = ns->kaddr + BCH_NVM_PAGES_SB_OFFSET;
>>> +
>> You already read the superblock in read_nvdimm_meta_super(), right?
>> Wouldn't it be better to first do the 'dax_direct_access()' call, and
>> then check the superblock?
>> That way you'll ensure that dax_direct_access()' did the right thing;
>> with the current code you are using two different methods of accessing
>> the superblock, which theoretically can result in one method succeeding,
>> the other not ...
> 
> We have to do it. Because the mapping size of dax_direct_access() is
> from ns->pages_total, it is stored on NVDIMM. Before calling
> dax_direct_acess()
> we need to make sure ns is an valid super block stored on NVDIMM, then
> we can
> trust value of ns->pages_total to do the DAX mapping.
> 
> Another method is firstly mapping a small fixed range (e.g. 1GB), and
> check whether the super block on NVDIMM is valid. If yes, re-map the
> whole space indicated by ns->pages_total. But the two-times accessing cannot
> be avoided.
> 
Oh, indeed, you are correct. Disregard my comments here.

>>> +	err = -EINVAL;
>>> +	/* Check magic again to make sure DAX mapping is correct */
>>> +	if (memcmp(ns->sb->magic, bch_nvm_pages_magic, 16)) {
>>> +		pr_info("invalid bch_nvm_pages_magic after DAX mapping\n");
>>> +		goto free_ns;
>>> +	}
>>> +
>>> +	err = attach_nvm_set(ns);
>>> +	if (err < 0)
>>> +		goto free_ns;
>>> +
>>> +	ns->page_size = ns->sb->page_size;
>>> +	ns->pages_offset = ns->sb->pages_offset;
>>> +	ns->pages_total = ns->sb->pages_total;
>>> +	ns->free = 0;
>>> +	ns->bdev = bdev;
>>> +	ns->nvm_set = only_set;
>>> +	mutex_init(&ns->lock);
>>> +
>>> +	if (ns->sb->this_namespace_nr == 0) {
>>> +		pr_info("only first namespace contain owner info\n");
>>> +		err = init_owner_info(ns);
>>> +		if (err < 0) {
>>> +			pr_info("init_owner_info met error %d\n", err);
>>> +			only_set->nss[ns->sb->this_namespace_nr] = NULL;
>>> +			goto free_ns;
>>> +		}
>>> +	}
>>> +
>>> +	kfree(path);
>>> +	return ns;
>>> +free_ns:
>>> +	kfree(ns);
>>> +bdput:
>>> +	blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC);
>>> +	kfree(path);
>>> +	return ERR_PTR(err);
>>> +}
>>> +EXPORT_SYMBOL_GPL(bch_register_namespace);
>>> +
>>> +int __init bch_nvm_init(void)
>>> +{
>>> +	only_set = kzalloc(sizeof(*only_set), GFP_KERNEL);
>>> +	if (!only_set)
>>> +		return -ENOMEM;
>>> +
>>> +	only_set->total_namespaces_nr = 0;
>>> +	only_set->owner_list_head = NULL;
>>> +	only_set->nss = NULL;
>>> +
>>> +	mutex_init(&only_set->lock);
>>> +
>>> +	pr_info("bcache nvm init\n");
>>> +	return 0;
>>> +}
>>> +
>>> +void bch_nvm_exit(void)
>>> +{
>>> +	release_nvm_set(only_set);
>>> +	pr_info("bcache nvm exit\n");
>>> +}
>>> +
>>> +#endif /* CONFIG_BCACHE_NVM_PAGES */
>>> diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
>>> new file mode 100644
>>> index 000000000000..3e24c4dee7fd
>>> --- /dev/null
>>> +++ b/drivers/md/bcache/nvm-pages.h
>>> @@ -0,0 +1,74 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +
>>> +#ifndef _BCACHE_NVM_PAGES_H
>>> +#define _BCACHE_NVM_PAGES_H
>>> +
>>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>>> +#include <linux/bcache-nvm.h>
>>> +#endif /* CONFIG_BCACHE_NVM_PAGES */
>>> +
>> Hmm? What is that doing here?
>> Please move it into the source file.
> 
> This is temporary before the whole NVDIMM support for bcache is completed.
> 
> drivers/md/bcache/nvm-pages.h has to be included because there are still
> stub
> routines in this header. Such stub routines like,
> 
> +static inline struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
> +{
> +	return NULL;
> +}
> 
> will be removed after the whole code completed and merged. So currently we have to
> do this to make sure the NVDIMM related code won't be leaked out if such experimental
> configure is not enabled.
> 
> 
> Thanks for your review. The addressed issue will be fixed and updated in next post.
> 
Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 11/14] bcache: initialize bcache journal for NVDIMM meta device
  2021-06-23  6:17     ` Coly Li
@ 2021-06-23  9:20       ` Hannes Reinecke
  2021-06-23 10:14         ` Coly Li
  0 siblings, 1 reply; 60+ messages in thread
From: Hannes Reinecke @ 2021-06-23  9:20 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-bcache, axboe, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/23/21 8:17 AM, Coly Li wrote:
> On 6/22/21 7:01 PM, Hannes Reinecke wrote:
>> On 6/15/21 7:49 AM, Coly Li wrote:
>>> The nvm-pages allocator may store and index the NVDIMM pages allocated
>>> for bcache journal. This patch adds the initialization to store bcache
>>> journal space on NVDIMM pages if BCH_FEATURE_INCOMPAT_NVDIMM_META bit is
>>> set by bcache-tools.
>>>
>>> If BCH_FEATURE_INCOMPAT_NVDIMM_META is set, get_nvdimm_journal_space()
>>> will return the linear address of NVDIMM pages for bcache journal,
>>> - If there is previously allocated space, find it from nvm-pages owner
>>>    list and return to bch_journal_init().
>>> - If there is no previously allocated space, require a new NVDIMM range
>>>    from the nvm-pages allocator, and return it to bch_journal_init().
>>>
>>> And in bch_journal_init(), keys in sb.d[] store the corresponding linear
>>> address from NVDIMM into sb.d[i].ptr[0] where 'i' is the bucket index to
>>> iterate all journal buckets.
>>>
>>> Later when bcache journaling code stores the journaling jset, the target
>>> NVDIMM linear address stored (and updated) in sb.d[i].ptr[0] can be used
>>> directly in memory copy from DRAM pages into NVDIMM pages.
>>>
>>> Signed-off-by: Coly Li <colyli@suse.de>
>>> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
>>> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
>>> ---
>>>   drivers/md/bcache/journal.c | 105 ++++++++++++++++++++++++++++++++++++
>>>   drivers/md/bcache/journal.h |   2 +-
>>>   drivers/md/bcache/super.c   |  16 +++---
>>>   3 files changed, 115 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
>>> index 61bd79babf7a..32599d2ff5d2 100644
>>> --- a/drivers/md/bcache/journal.c
>>> +++ b/drivers/md/bcache/journal.c
>>> @@ -9,6 +9,8 @@
>>>   #include "btree.h"
>>>   #include "debug.h"
>>>   #include "extents.h"
>>> +#include "nvm-pages.h"
>>> +#include "features.h"
>>>   
>>>   #include <trace/events/bcache.h>
>>>   
>>> @@ -982,3 +984,106 @@ int bch_journal_alloc(struct cache_set *c)
>>>   
>>>   	return 0;
>>>   }
>>> +
>>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>>> +
>>> +static void *find_journal_nvm_base(struct bch_nvm_pages_owner_head *owner_list,
>>> +				   struct cache *ca)
>>> +{
>>> +	unsigned long addr = 0;
>>> +	struct bch_nvm_pgalloc_recs *recs_list = owner_list->recs[0];
>>> +
>>> +	while (recs_list) {
>>> +		struct bch_pgalloc_rec *rec;
>>> +		unsigned long jnl_pgoff;
>>> +		int i;
>>> +
>>> +		jnl_pgoff = ((unsigned long)ca->sb.d[0]) >> PAGE_SHIFT;
>>> +		rec = recs_list->recs;
>>> +		for (i = 0; i < recs_list->used; i++) {
>>> +			if (rec->pgoff == jnl_pgoff)
>>> +				break;
>>> +			rec++;
>>> +		}
>>> +		if (i < recs_list->used) {
>>> +			addr = rec->pgoff << PAGE_SHIFT;
>>> +			break;
>>> +		}
>>> +		recs_list = recs_list->next;
>>> +	}
>>> +	return (void *)addr;
>>> +}
>>> +
>>> +static void *get_nvdimm_journal_space(struct cache *ca)
>>> +{
>>> +	struct bch_nvm_pages_owner_head *owner_list = NULL;
>>> +	void *ret = NULL;
>>> +	int order;
>>> +
>>> +	owner_list = bch_get_allocated_pages(ca->sb.set_uuid);
>>> +	if (owner_list) {
>>> +		ret = find_journal_nvm_base(owner_list, ca);
>>> +		if (ret)
>>> +			goto found;
>>> +	}
>>> +
>>> +	order = ilog2(ca->sb.bucket_size *
>>> +		      ca->sb.njournal_buckets / PAGE_SECTORS);
>>> +	ret = bch_nvm_alloc_pages(order, ca->sb.set_uuid);
>>> +	if (ret)
>>> +		memset(ret, 0, (1 << order) * PAGE_SIZE);
>>> +
>>> +found:
>>> +	return ret;
>>> +}
>>> +
>>> +static int __bch_journal_nvdimm_init(struct cache *ca)
>>> +{
>>> +	int i, ret = 0;
>>> +	void *journal_nvm_base = NULL;
>>> +
>>> +	journal_nvm_base = get_nvdimm_journal_space(ca);
>>> +	if (!journal_nvm_base) {
>>> +		pr_err("Failed to get journal space from nvdimm\n");
>>> +		ret = -1;
>>> +		goto out;
>>> +	}
>>> +
>>> +	/* Iniialized and reloaded from on-disk super block already */
>>> +	if (ca->sb.d[0] != 0)
>>> +		goto out;
>>> +
>>> +	for (i = 0; i < ca->sb.keys; i++)
>>> +		ca->sb.d[i] =
>>> +			(u64)(journal_nvm_base + (ca->sb.bucket_size * i));
>>> +
>>> +out:
>>> +	return ret;
>>> +}
>>> +
>>> +#else /* CONFIG_BCACHE_NVM_PAGES */
>>> +
>>> +static int __bch_journal_nvdimm_init(struct cache *ca)
>>> +{
>>> +	return -1;
>>> +}
>>> +
>>> +#endif /* CONFIG_BCACHE_NVM_PAGES */
>>> +
>>> +int bch_journal_init(struct cache_set *c)
>>> +{
>>> +	int i, ret = 0;
>>> +	struct cache *ca = c->cache;
>>> +
>>> +	ca->sb.keys = clamp_t(int, ca->sb.nbuckets >> 7,
>>> +				2, SB_JOURNAL_BUCKETS);
>>> +
>>> +	if (!bch_has_feature_nvdimm_meta(&ca->sb)) {
>>> +		for (i = 0; i < ca->sb.keys; i++)
>>> +			ca->sb.d[i] = ca->sb.first_bucket + i;
>>> +	} else {
>>> +		ret = __bch_journal_nvdimm_init(ca);
>>> +	}
>>> +
>>> +	return ret;
>>> +}
>>> diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h
>>> index f2ea34d5f431..e3a7fa5a8fda 100644
>>> --- a/drivers/md/bcache/journal.h
>>> +++ b/drivers/md/bcache/journal.h
>>> @@ -179,7 +179,7 @@ void bch_journal_mark(struct cache_set *c, struct list_head *list);
>>>   void bch_journal_meta(struct cache_set *c, struct closure *cl);
>>>   int bch_journal_read(struct cache_set *c, struct list_head *list);
>>>   int bch_journal_replay(struct cache_set *c, struct list_head *list);
>>> -
>>> +int bch_journal_init(struct cache_set *c);
>>>   void bch_journal_free(struct cache_set *c);
>>>   int bch_journal_alloc(struct cache_set *c);
>>>   
>>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>>> index ce22aefb1352..cce0f6bf0944 100644
>>> --- a/drivers/md/bcache/super.c
>>> +++ b/drivers/md/bcache/super.c
>>> @@ -147,10 +147,15 @@ static const char *read_super_common(struct cache_sb *sb,  struct block_device *
>>>   		goto err;
>>>   
>>>   	err = "Journal buckets not sequential";
>>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>>> +	if (!bch_has_feature_nvdimm_meta(sb)) {
>>> +#endif
>>>   	for (i = 0; i < sb->keys; i++)
>>>   		if (sb->d[i] != sb->first_bucket + i)
>>>   			goto err;
>>> -
>>> +#ifdef CONFIG_BCACHE_NVM_PAGES
>>> +	} /* bch_has_feature_nvdimm_meta */
>>> +#endif
>>>   	err = "Too many journal buckets";
>>>   	if (sb->first_bucket + sb->keys > sb->nbuckets)
>>>   		goto err;
>> Extremely awkward.
> 
> After the feature settled and not marked as EXPERIMENTAL, such condition
> code will be removed.
> 
> 
>> Make 'bch_has_feature_nvdimm_meta()' generally available, and have it
>> return 'false' if the config feature isn't enabled.
> 
> bch_has_feature_nvdimm_meta() is defined as,
> 
> 
>   41 #define BCH_FEATURE_COMPAT_FUNCS(name, flagname) \
>   42 static inline int bch_has_feature_##name(struct cache_sb *sb) \
>   43 { \
>   44         if (sb->version < BCACHE_SB_VERSION_CDEV_WITH_FEATURES) \
>   45                 return 0; \
>   46         return (((sb)->feature_compat & \
>   47                 BCH##_FEATURE_COMPAT_##flagname) != 0); \
>   48 } \
> 
> It is not easy to check a specific Kconfig item in the above code block,
> this is why
> we choose the compiling condition to disable nvdimm related code here,
> before we remove
> the EXPERIMENTAL mark in Kconfig.
> 
But you can have the flag defined in general (ie outside the config 
ifdefs), and only set it if the code is enabled, right?
Then the check will work always and do what we want, or?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/14] bcache: initialize the nvm pages allocator
  2021-06-23  9:16       ` Hannes Reinecke
@ 2021-06-23  9:34         ` Coly Li
  0 siblings, 0 replies; 60+ messages in thread
From: Coly Li @ 2021-06-23  9:34 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-bcache, axboe, linux-block, Jianpeng Ma, Randy Dunlap, Qiaowei Ren

On 6/23/21 5:16 PM, Hannes Reinecke wrote:
> On 6/23/21 7:26 AM, Coly Li wrote:
>> On 6/22/21 6:39 PM, Hannes Reinecke wrote:
>>> On 6/15/21 7:49 AM, Coly Li wrote:
>>>> From: Jianpeng Ma <jianpeng.ma@intel.com>
>>>>
>>>> This patch define the prototype data structures in memory and
>>>> initializes the nvm pages allocator.
>>>>
>>>> The nvm address space which is managed by this allocator can
>>>> consist of
>>>> many nvm namespaces, and some namespaces can compose into one nvm set,
>>>> like cache set. For this initial implementation, only one set can be
>>>> supported.
>>>>
>>>> The users of this nvm pages allocator need to call
>>>> register_namespace()
>>>> to register the nvdimm device (like /dev/pmemX) into this allocator as
>>>> the instance of struct nvm_namespace.
>>>>
>>>> Reported-by: Randy Dunlap <rdunlap@infradead.org>
>>>> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
>>>> Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
>>>> Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
>>>> Signed-off-by: Coly Li <colyli@suse.de>
>>>> ---
>>>>   drivers/md/bcache/Kconfig     |  10 ++
>>>>   drivers/md/bcache/Makefile    |   1 +
>>>>   drivers/md/bcache/nvm-pages.c | 295
>>>> ++++++++++++++++++++++++++++++++++
>>>>   drivers/md/bcache/nvm-pages.h |  74 +++++++++
>>>>   drivers/md/bcache/super.c     |   3 +
>>>>   5 files changed, 383 insertions(+)
>>>>   create mode 100644 drivers/md/bcache/nvm-pages.c
>>>>   create mode 100644 drivers/md/bcache/nvm-pages.h
>>>>
>>>> diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
>>>> index d1ca4d059c20..a69f6c0e0507 100644
>>>> --- a/drivers/md/bcache/Kconfig
>>>> +++ b/drivers/md/bcache/Kconfig
>>>> @@ -35,3 +35,13 @@ config BCACHE_ASYNC_REGISTRATION
>>>>       device path into this file will returns immediately and the real
>>>>       registration work is handled in kernel work queue in
>>>> asynchronous
>>>>       way.
>>>> +
>>>> +config BCACHE_NVM_PAGES
>>>> +    bool "NVDIMM support for bcache (EXPERIMENTAL)"
>>>> +    depends on BCACHE
>>>> +    depends on 64BIT
>>>> +    depends on LIBNVDIMM
>>>> +    depends on DAX
>>>> +    help
>>>> +      Allocate/release NV-memory pages for bcache and provide
>>>> allocated pages
>>>> +      for each requestor after system reboot.
>>>> diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
>>>> index 5b87e59676b8..2397bb7c7ffd 100644
>>>> --- a/drivers/md/bcache/Makefile
>>>> +++ b/drivers/md/bcache/Makefile
>>>> @@ -5,3 +5,4 @@ obj-$(CONFIG_BCACHE)    += bcache.o
>>>>   bcache-y        := alloc.o bset.o btree.o closure.o debug.o
>>>> extents.o\
>>>>       io.o journal.o movinggc.o request.o stats.o super.o sysfs.o
>>>> trace.o\
>>>>       util.o writeback.o features.o
>>>> +bcache-$(CONFIG_BCACHE_NVM_PAGES) += nvm-pages.o
>>>> diff --git a/drivers/md/bcache/nvm-pages.c
>>>> b/drivers/md/bcache/nvm-pages.c
>>>> new file mode 100644
>>>> index 000000000000..18fdadbc502f
>>>> --- /dev/null
>>>> +++ b/drivers/md/bcache/nvm-pages.c
>>>> @@ -0,0 +1,295 @@
>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>> +/*
>>>> + * Nvdimm page-buddy allocator
>>>> + *
>>>> + * Copyright (c) 2021, Intel Corporation.
>>>> + * Copyright (c) 2021, Qiaowei Ren <qiaowei.ren@intel.com>.
>>>> + * Copyright (c) 2021, Jianpeng Ma <jianpeng.ma@intel.com>.
>>>> + */
>>>> +
>>>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>>>> +
>>> No need for this 'if' statement as it'll be excluded by the Makefile
>>> anyway if the config option isn't set.
>>
>> Such if is necessary because stub routines are defined when
>> CONFIG_BCACHE_NVM_PAGES is not defined, e.g.
>>
>> 426 +#else
>> 427 +
>> 428 +static inline struct bch_nvm_namespace
>> *bch_register_namespace(const char *dev_path)
>> 429 +{
>> 430 +       return NULL;
>> 431 +}
>> 432 +static inline int bch_nvm_init(void)
>> 433 +{
>> 434 +       return 0;
>> 435 +}
>> 436 +static inline void bch_nvm_exit(void) { }
>> 437 +
>> 438 +#endif /* CONFIG_BCACHE_NVM_PAGES */
>>
> But then these stubs should be defined in the header file, not here.
>
> [ .. ]

Copied, it will be improved in next post. Thanks for your review and
comment.

Coly Li


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Ask help for code review (was Re: [PATCH 03/14] bcache: add initial data structures for nvm pages)
  2021-06-23  7:21               ` Christoph Hellwig
@ 2021-06-23 10:05                 ` Coly Li
  2021-06-23 11:16                   ` Coly Li
  2021-06-23 11:49                   ` Christoph Hellwig
  0 siblings, 2 replies; 60+ messages in thread
From: Coly Li @ 2021-06-23 10:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Huang, Ying, Dan Williams, Jan Kara, Hannes Reinecke,
	linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, axboe

On 6/23/21 3:21 PM, Christoph Hellwig wrote:
> On Wed, Jun 23, 2021 at 03:19:11PM +0800, Coly Li wrote:
>> Bcache does not support endian clean indeed,
> Then we need to fix that eventually rather than making it worse.  Which
> means any _new_ data structure should start that way.

The cache device (typically SSD) of bcache is designed to dedicate to a
single local machine. Any
storage migration between machines with different endians should firstly
flush the dirty data to
backing hard drive. For bcache meta-data on cache device (typically
SSD), it is designed to NOT
move dirty cache between machines with different endians, and in
practice there is no such use
case indeed and not supported by any Linux Distribution.

Not supporting different endian is as design, why we should fix it for
no real use case ?

BTW, the discussion is only for cache device because the bcache meta
data stored on it. For
backing hard drive, its endian is transparent to bcache and decided by
upper layer code like
file system or user space application, it is fully endian clean.


>> and libnvdimm only works with
>> 64bit physical address width.
> Maybe it does right now.  But ther is nothing fundamental in that, so
> please don't design stupid on-disk formats to encode that are going to
> come back to bite us sooner or later.  Be that by adding 32-bit support
> for any Linux DAX device, or by new 96 or 128bit CPUs.

This is unfair restriction :-)
The nvdimm support for bcache heavily depends on libnvdimm, that is, for
all conditions that libnvdimm
supports we should follow up. But requiring us to support the condition
that even libnvdimm does not
support yet, it is too early at this stage.

And, if libnvdimm (not DAX) supports 32-bit or new 96 or 128bit CPUs,
considering the data sturectures
are arrays and single lists,  it won't be too complicated to follow up.

>> The only restriction here by using pointer is
>> the CPU register word should be 64bits, because we use the NVDIMM as memory.
>>
>> Is it one of the way how NVDIMM (especially Intel AEP) designed to use ?
>> As a non-volatiled memory.
> Not for on-disk data structures.

This is not on-disk data structure. We use the NVDIMM as memory, and
access the internal data
structures as current existing code does onto DRAM.

Please encourage us to have a series try with this might-be-different idea.

>> Does the already mapped DAX base address change in runtime during memory
>> hot plugable ?
>> If not, it won't be a problem here for this specific use case.
> It could change between one use and another.

Hmm, I don't understand the implicit meaning of the above line.
Could you please offer a detail example ?


Thank you for looking at this and provide value comment. All the above
response is not argument or
stubbornness, I do want to have a clear understand by the discussion
with you, that we won't regret
in future for current design.

Coly Li

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 11/14] bcache: initialize bcache journal for NVDIMM meta device
  2021-06-23  9:20       ` Hannes Reinecke
@ 2021-06-23 10:14         ` Coly Li
  0 siblings, 0 replies; 60+ messages in thread
From: Coly Li @ 2021-06-23 10:14 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-bcache, axboe, linux-block, Jianpeng Ma, Qiaowei Ren

On 6/23/21 5:20 PM, Hannes Reinecke wrote:
> On 6/23/21 8:17 AM, Coly Li wrote:
>> On 6/22/21 7:01 PM, Hannes Reinecke wrote:
>>> On 6/15/21 7:49 AM, Coly Li wrote:
>>>> The nvm-pages allocator may store and index the NVDIMM pages allocated
>>>> for bcache journal. This patch adds the initialization to store bcache
>>>> journal space on NVDIMM pages if BCH_FEATURE_INCOMPAT_NVDIMM_META
>>>> bit is
>>>> set by bcache-tools.
>>>>
>>>> If BCH_FEATURE_INCOMPAT_NVDIMM_META is set, get_nvdimm_journal_space()
>>>> will return the linear address of NVDIMM pages for bcache journal,
>>>> - If there is previously allocated space, find it from nvm-pages owner
>>>>    list and return to bch_journal_init().
>>>> - If there is no previously allocated space, require a new NVDIMM
>>>> range
>>>>    from the nvm-pages allocator, and return it to bch_journal_init().
>>>>
>>>> And in bch_journal_init(), keys in sb.d[] store the corresponding
>>>> linear
>>>> address from NVDIMM into sb.d[i].ptr[0] where 'i' is the bucket
>>>> index to
>>>> iterate all journal buckets.
>>>>
>>>> Later when bcache journaling code stores the journaling jset, the
>>>> target
>>>> NVDIMM linear address stored (and updated) in sb.d[i].ptr[0] can be
>>>> used
>>>> directly in memory copy from DRAM pages into NVDIMM pages.
>>>>
>>>> Signed-off-by: Coly Li <colyli@suse.de>
>>>> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
>>>> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
>>>> ---
>>>>   drivers/md/bcache/journal.c | 105
>>>> ++++++++++++++++++++++++++++++++++++
>>>>   drivers/md/bcache/journal.h |   2 +-
>>>>   drivers/md/bcache/super.c   |  16 +++---
>>>>   3 files changed, 115 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
>>>> index 61bd79babf7a..32599d2ff5d2 100644
>>>> --- a/drivers/md/bcache/journal.c
>>>> +++ b/drivers/md/bcache/journal.c
>>>> @@ -9,6 +9,8 @@
>>>>   #include "btree.h"
>>>>   #include "debug.h"
>>>>   #include "extents.h"
>>>> +#include "nvm-pages.h"
>>>> +#include "features.h"
>>>>     #include <trace/events/bcache.h>
>>>>   @@ -982,3 +984,106 @@ int bch_journal_alloc(struct cache_set *c)
>>>>         return 0;
>>>>   }
>>>> +
>>>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>>>> +
>>>> +static void *find_journal_nvm_base(struct bch_nvm_pages_owner_head
>>>> *owner_list,
>>>> +                   struct cache *ca)
>>>> +{
>>>> +    unsigned long addr = 0;
>>>> +    struct bch_nvm_pgalloc_recs *recs_list = owner_list->recs[0];
>>>> +
>>>> +    while (recs_list) {
>>>> +        struct bch_pgalloc_rec *rec;
>>>> +        unsigned long jnl_pgoff;
>>>> +        int i;
>>>> +
>>>> +        jnl_pgoff = ((unsigned long)ca->sb.d[0]) >> PAGE_SHIFT;
>>>> +        rec = recs_list->recs;
>>>> +        for (i = 0; i < recs_list->used; i++) {
>>>> +            if (rec->pgoff == jnl_pgoff)
>>>> +                break;
>>>> +            rec++;
>>>> +        }
>>>> +        if (i < recs_list->used) {
>>>> +            addr = rec->pgoff << PAGE_SHIFT;
>>>> +            break;
>>>> +        }
>>>> +        recs_list = recs_list->next;
>>>> +    }
>>>> +    return (void *)addr;
>>>> +}
>>>> +
>>>> +static void *get_nvdimm_journal_space(struct cache *ca)
>>>> +{
>>>> +    struct bch_nvm_pages_owner_head *owner_list = NULL;
>>>> +    void *ret = NULL;
>>>> +    int order;
>>>> +
>>>> +    owner_list = bch_get_allocated_pages(ca->sb.set_uuid);
>>>> +    if (owner_list) {
>>>> +        ret = find_journal_nvm_base(owner_list, ca);
>>>> +        if (ret)
>>>> +            goto found;
>>>> +    }
>>>> +
>>>> +    order = ilog2(ca->sb.bucket_size *
>>>> +              ca->sb.njournal_buckets / PAGE_SECTORS);
>>>> +    ret = bch_nvm_alloc_pages(order, ca->sb.set_uuid);
>>>> +    if (ret)
>>>> +        memset(ret, 0, (1 << order) * PAGE_SIZE);
>>>> +
>>>> +found:
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static int __bch_journal_nvdimm_init(struct cache *ca)
>>>> +{
>>>> +    int i, ret = 0;
>>>> +    void *journal_nvm_base = NULL;
>>>> +
>>>> +    journal_nvm_base = get_nvdimm_journal_space(ca);
>>>> +    if (!journal_nvm_base) {
>>>> +        pr_err("Failed to get journal space from nvdimm\n");
>>>> +        ret = -1;
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    /* Iniialized and reloaded from on-disk super block already */
>>>> +    if (ca->sb.d[0] != 0)
>>>> +        goto out;
>>>> +
>>>> +    for (i = 0; i < ca->sb.keys; i++)
>>>> +        ca->sb.d[i] =
>>>> +            (u64)(journal_nvm_base + (ca->sb.bucket_size * i));
>>>> +
>>>> +out:
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +#else /* CONFIG_BCACHE_NVM_PAGES */
>>>> +
>>>> +static int __bch_journal_nvdimm_init(struct cache *ca)
>>>> +{
>>>> +    return -1;
>>>> +}
>>>> +
>>>> +#endif /* CONFIG_BCACHE_NVM_PAGES */
>>>> +
>>>> +int bch_journal_init(struct cache_set *c)
>>>> +{
>>>> +    int i, ret = 0;
>>>> +    struct cache *ca = c->cache;
>>>> +
>>>> +    ca->sb.keys = clamp_t(int, ca->sb.nbuckets >> 7,
>>>> +                2, SB_JOURNAL_BUCKETS);
>>>> +
>>>> +    if (!bch_has_feature_nvdimm_meta(&ca->sb)) {
>>>> +        for (i = 0; i < ca->sb.keys; i++)
>>>> +            ca->sb.d[i] = ca->sb.first_bucket + i;
>>>> +    } else {
>>>> +        ret = __bch_journal_nvdimm_init(ca);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h
>>>> index f2ea34d5f431..e3a7fa5a8fda 100644
>>>> --- a/drivers/md/bcache/journal.h
>>>> +++ b/drivers/md/bcache/journal.h
>>>> @@ -179,7 +179,7 @@ void bch_journal_mark(struct cache_set *c,
>>>> struct list_head *list);
>>>>   void bch_journal_meta(struct cache_set *c, struct closure *cl);
>>>>   int bch_journal_read(struct cache_set *c, struct list_head *list);
>>>>   int bch_journal_replay(struct cache_set *c, struct list_head *list);
>>>> -
>>>> +int bch_journal_init(struct cache_set *c);
>>>>   void bch_journal_free(struct cache_set *c);
>>>>   int bch_journal_alloc(struct cache_set *c);
>>>>   diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>>>> index ce22aefb1352..cce0f6bf0944 100644
>>>> --- a/drivers/md/bcache/super.c
>>>> +++ b/drivers/md/bcache/super.c
>>>> @@ -147,10 +147,15 @@ static const char *read_super_common(struct
>>>> cache_sb *sb,  struct block_device *
>>>>           goto err;
>>>>         err = "Journal buckets not sequential";
>>>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>>>> +    if (!bch_has_feature_nvdimm_meta(sb)) {
>>>> +#endif
>>>>       for (i = 0; i < sb->keys; i++)
>>>>           if (sb->d[i] != sb->first_bucket + i)
>>>>               goto err;
>>>> -
>>>> +#ifdef CONFIG_BCACHE_NVM_PAGES
>>>> +    } /* bch_has_feature_nvdimm_meta */
>>>> +#endif
>>>>       err = "Too many journal buckets";
>>>>       if (sb->first_bucket + sb->keys > sb->nbuckets)
>>>>           goto err;
>>> Extremely awkward.
>>
>> After the feature settled and not marked as EXPERIMENTAL, such condition
>> code will be removed.
>>
>>
>>> Make 'bch_has_feature_nvdimm_meta()' generally available, and have it
>>> return 'false' if the config feature isn't enabled.
>>
>> bch_has_feature_nvdimm_meta() is defined as,
>>
>>
>>   41 #define BCH_FEATURE_COMPAT_FUNCS(name, flagname) \
>>   42 static inline int bch_has_feature_##name(struct cache_sb *sb) \
>>   43 { \
>>   44         if (sb->version < BCACHE_SB_VERSION_CDEV_WITH_FEATURES) \
>>   45                 return 0; \
>>   46         return (((sb)->feature_compat & \
>>   47                 BCH##_FEATURE_COMPAT_##flagname) != 0); \
>>   48 } \
>>
>> It is not easy to check a specific Kconfig item in the above code block,
>> this is why
>> we choose the compiling condition to disable nvdimm related code here,
>> before we remove
>> the EXPERIMENTAL mark in Kconfig.
>>
> But you can have the flag defined in general (ie outside the config
> ifdefs), and only set it if the code is enabled, right?
> Then the check will work always and do what we want, or?

I understand you, I will add a flag in struct cache, and handle the
condition as your suggestion.

Thanks for your review and comments.

Coly Li

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Ask help for code review (was Re: [PATCH 03/14] bcache: add initial data structures for nvm pages)
  2021-06-23 10:05                 ` Coly Li
@ 2021-06-23 11:16                   ` Coly Li
  2021-06-23 11:49                   ` Christoph Hellwig
  1 sibling, 0 replies; 60+ messages in thread
From: Coly Li @ 2021-06-23 11:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Huang, Ying, Dan Williams, Jan Kara, Hannes Reinecke,
	linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, axboe

On 6/23/21 6:05 PM, Coly Li wrote:
> On 6/23/21 3:21 PM, Christoph Hellwig wrote:
> Does the already mapped DAX base address change in runtime during memory
> hot plugable ?
> If not, it won't be a problem here for this specific use case.
>> It could change between one use and another.
> Hmm, I don't understand the implicit meaning of the above line.
> Could you please offer a detail example ?
>

Hi Christoph,

I feel I come to understand "It could change between
one use and another." Yes, this is a problem for full
pointer design. I will switch to base + offset format.

Thank you for joining the discussion and provide your
comments, to help me have improved understand for better
design :-)

Coly Li

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Ask help for code review (was Re: [PATCH 03/14] bcache: add initial data structures for nvm pages)
  2021-06-23 10:05                 ` Coly Li
  2021-06-23 11:16                   ` Coly Li
@ 2021-06-23 11:49                   ` Christoph Hellwig
  2021-06-23 12:09                     ` Coly Li
  1 sibling, 1 reply; 60+ messages in thread
From: Christoph Hellwig @ 2021-06-23 11:49 UTC (permalink / raw)
  To: Coly Li
  Cc: Christoph Hellwig, Huang, Ying, Dan Williams, Jan Kara,
	Hannes Reinecke, linux-bcache, linux-block, Jianpeng Ma,
	Qiaowei Ren, axboe

On Wed, Jun 23, 2021 at 06:05:51PM +0800, Coly Li wrote:
> The cache device (typically SSD) of bcache is designed to dedicate to a
> single local machine. Any
> storage migration between machines with different endians should firstly
> flush the dirty data to
> backing hard drive.

Now my G5 died and I need to recover the data using my x86 laptop,
what am I going to do?

> >> If not, it won't be a problem here for this specific use case.
> > It could change between one use and another.
> 
> Hmm, I don't understand the implicit meaning of the above line.
> Could you please offer a detail example ?

There is no guarantee you nvdimm or CXL memory device will show up
at the same address.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: Ask help for code review (was Re: [PATCH 03/14] bcache: add initial data structures for nvm pages)
  2021-06-23 11:49                   ` Christoph Hellwig
@ 2021-06-23 12:09                     ` Coly Li
  0 siblings, 0 replies; 60+ messages in thread
From: Coly Li @ 2021-06-23 12:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Huang, Ying, Dan Williams, Jan Kara, Hannes Reinecke,
	linux-bcache, linux-block, Jianpeng Ma, Qiaowei Ren, axboe

On 6/23/21 7:49 PM, Christoph Hellwig wrote:
> On Wed, Jun 23, 2021 at 06:05:51PM +0800, Coly Li wrote:
>> The cache device (typically SSD) of bcache is designed to dedicate to a
>> single local machine. Any
>> storage migration between machines with different endians should firstly
>> flush the dirty data to
>> backing hard drive.
> Now my G5 died and I need to recover the data using my x86 laptop,
> what am I going to do?
>
>>>> If not, it won't be a problem here for this specific use case.
>>> It could change between one use and another.
>> Hmm, I don't understand the implicit meaning of the above line.
>> Could you please offer a detail example ?
> There is no guarantee you nvdimm or CXL memory device will show up
> at the same address.

Copied, I fully understand. Now I am working on the full pointer to
[base + offset] convert.

Thanks for your patient explanation :-)

Coly Li

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2021-06-23 12:09 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-15  5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
2021-06-15  5:49 ` [PATCH 01/14] bcache: fix error info in register_bcache() Coly Li
2021-06-22  9:47   ` Hannes Reinecke
2021-06-15  5:49 ` [PATCH 02/14] md: bcache: Fix spelling of 'acquire' Coly Li
2021-06-22 10:03   ` Hannes Reinecke
2021-06-15  5:49 ` [PATCH 03/14] bcache: add initial data structures for nvm pages Coly Li
2021-06-21 16:17   ` Ask help for code review (was Re: [PATCH 03/14] bcache: add initial data structures for nvm pages) Coly Li
2021-06-22  8:41     ` Huang, Ying
2021-06-23  4:32       ` Coly Li
2021-06-23  6:53         ` Huang, Ying
2021-06-23  7:04           ` Christoph Hellwig
2021-06-23  7:19             ` Coly Li
2021-06-23  7:21               ` Christoph Hellwig
2021-06-23 10:05                 ` Coly Li
2021-06-23 11:16                   ` Coly Li
2021-06-23 11:49                   ` Christoph Hellwig
2021-06-23 12:09                     ` Coly Li
2021-06-22 10:19   ` [PATCH 03/14] bcache: add initial data structures for nvm pages Hannes Reinecke
2021-06-23  7:09     ` Coly Li
2021-06-15  5:49 ` [PATCH 04/14] bcache: initialize the nvm pages allocator Coly Li
2021-06-22 10:39   ` Hannes Reinecke
2021-06-23  5:26     ` Coly Li
2021-06-23  9:16       ` Hannes Reinecke
2021-06-23  9:34         ` Coly Li
2021-06-15  5:49 ` [PATCH 05/14] bcache: initialization of the buddy Coly Li
2021-06-22 10:45   ` Hannes Reinecke
2021-06-23  5:35     ` Coly Li
2021-06-23  5:46       ` Re[2]: " Pavel Goran
2021-06-23  6:03         ` Coly Li
2021-06-15  5:49 ` [PATCH 06/14] bcache: bch_nvm_alloc_pages() " Coly Li
2021-06-22 10:51   ` Hannes Reinecke
2021-06-23  6:02     ` Coly Li
2021-06-15  5:49 ` [PATCH 07/14] bcache: bch_nvm_free_pages() " Coly Li
2021-06-22 10:53   ` Hannes Reinecke
2021-06-23  6:06     ` Coly Li
2021-06-15  5:49 ` [PATCH 08/14] bcache: get allocated pages from specific owner Coly Li
2021-06-22 10:54   ` Hannes Reinecke
2021-06-23  6:08     ` Coly Li
2021-06-15  5:49 ` [PATCH 09/14] bcache: use bucket index to set GC_MARK_METADATA for journal buckets in bch_btree_gc_finish() Coly Li
2021-06-22 10:55   ` Hannes Reinecke
2021-06-23  6:09     ` Coly Li
2021-06-15  5:49 ` [PATCH 10/14] bcache: add BCH_FEATURE_INCOMPAT_NVDIMM_META into incompat feature set Coly Li
2021-06-22 10:59   ` Hannes Reinecke
2021-06-23  6:09     ` Coly Li
2021-06-15  5:49 ` [PATCH 11/14] bcache: initialize bcache journal for NVDIMM meta device Coly Li
2021-06-22 11:01   ` Hannes Reinecke
2021-06-23  6:17     ` Coly Li
2021-06-23  9:20       ` Hannes Reinecke
2021-06-23 10:14         ` Coly Li
2021-06-15  5:49 ` [PATCH 12/14] bcache: support storing bcache journal into " Coly Li
2021-06-22 11:03   ` Hannes Reinecke
2021-06-23  6:19     ` Coly Li
2021-06-15  5:49 ` [PATCH 13/14] bcache: read jset from NVDIMM pages for journal replay Coly Li
2021-06-22 11:04   ` Hannes Reinecke
2021-06-23  6:21     ` Coly Li
2021-06-15  5:49 ` [PATCH 14/14] bcache: add sysfs interface register_nvdimm_meta to register NVDIMM meta device Coly Li
2021-06-22 11:04   ` Hannes Reinecke
2021-06-21 15:14 ` [PATCH 00/14] bcache patches for Linux v5.14 Jens Axboe
2021-06-21 15:25   ` Coly Li
2021-06-21 15:27     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).