All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hannes Reinecke <hare@suse.de>
To: Coly Li <colyli@suse.de>
Cc: linux-bcache@vger.kernel.org, axboe@kernel.dk,
	linux-block@vger.kernel.org, Jianpeng Ma <jianpeng.ma@intel.com>,
	Randy Dunlap <rdunlap@infradead.org>,
	Qiaowei Ren <qiaowei.ren@intel.com>
Subject: Re: [PATCH 04/14] bcache: initialize the nvm pages allocator
Date: Wed, 23 Jun 2021 11:16:28 +0200	[thread overview]
Message-ID: <48cab671-d4d6-d5f2-29bf-b2f19e04e533@suse.de> (raw)
In-Reply-To: <fc5b8586-0181-da7b-ba1f-73f7b00b8d16@suse.de>

On 6/23/21 7:26 AM, Coly Li wrote:
> On 6/22/21 6:39 PM, Hannes Reinecke wrote:
>> On 6/15/21 7:49 AM, Coly Li wrote:
>>> From: Jianpeng Ma <jianpeng.ma@intel.com>
>>>
>>> This patch define the prototype data structures in memory and
>>> initializes the nvm pages allocator.
>>>
>>> The nvm address space which is managed by this allocator can consist of
>>> many nvm namespaces, and some namespaces can compose into one nvm set,
>>> like cache set. For this initial implementation, only one set can be
>>> supported.
>>>
>>> The users of this nvm pages allocator need to call register_namespace()
>>> to register the nvdimm device (like /dev/pmemX) into this allocator as
>>> the instance of struct nvm_namespace.
>>>
>>> Reported-by: Randy Dunlap <rdunlap@infradead.org>
>>> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
>>> Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
>>> Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
>>> Signed-off-by: Coly Li <colyli@suse.de>
>>> ---
>>>   drivers/md/bcache/Kconfig     |  10 ++
>>>   drivers/md/bcache/Makefile    |   1 +
>>>   drivers/md/bcache/nvm-pages.c | 295 ++++++++++++++++++++++++++++++++++
>>>   drivers/md/bcache/nvm-pages.h |  74 +++++++++
>>>   drivers/md/bcache/super.c     |   3 +
>>>   5 files changed, 383 insertions(+)
>>>   create mode 100644 drivers/md/bcache/nvm-pages.c
>>>   create mode 100644 drivers/md/bcache/nvm-pages.h
>>>
>>> diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
>>> index d1ca4d059c20..a69f6c0e0507 100644
>>> --- a/drivers/md/bcache/Kconfig
>>> +++ b/drivers/md/bcache/Kconfig
>>> @@ -35,3 +35,13 @@ config BCACHE_ASYNC_REGISTRATION
>>>   	device path into this file will returns immediately and the real
>>>   	registration work is handled in kernel work queue in asynchronous
>>>   	way.
>>> +
>>> +config BCACHE_NVM_PAGES
>>> +	bool "NVDIMM support for bcache (EXPERIMENTAL)"
>>> +	depends on BCACHE
>>> +	depends on 64BIT
>>> +	depends on LIBNVDIMM
>>> +	depends on DAX
>>> +	help
>>> +	  Allocate/release NV-memory pages for bcache and provide allocated pages
>>> +	  for each requestor after system reboot.
>>> diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
>>> index 5b87e59676b8..2397bb7c7ffd 100644
>>> --- a/drivers/md/bcache/Makefile
>>> +++ b/drivers/md/bcache/Makefile
>>> @@ -5,3 +5,4 @@ obj-$(CONFIG_BCACHE)	+= bcache.o
>>>   bcache-y		:= alloc.o bset.o btree.o closure.o debug.o extents.o\
>>>   	io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
>>>   	util.o writeback.o features.o
>>> +bcache-$(CONFIG_BCACHE_NVM_PAGES) += nvm-pages.o
>>> diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
>>> new file mode 100644
>>> index 000000000000..18fdadbc502f
>>> --- /dev/null
>>> +++ b/drivers/md/bcache/nvm-pages.c
>>> @@ -0,0 +1,295 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * Nvdimm page-buddy allocator
>>> + *
>>> + * Copyright (c) 2021, Intel Corporation.
>>> + * Copyright (c) 2021, Qiaowei Ren <qiaowei.ren@intel.com>.
>>> + * Copyright (c) 2021, Jianpeng Ma <jianpeng.ma@intel.com>.
>>> + */
>>> +
>>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>>> +
>> No need for this 'if' statement as it'll be excluded by the Makefile
>> anyway if the config option isn't set.
> 
> Such if is necessary because stub routines are defined when
> CONFIG_BCACHE_NVM_PAGES is not defined, e.g.
> 
> 426 +#else
> 427 +
> 428 +static inline struct bch_nvm_namespace
> *bch_register_namespace(const char *dev_path)
> 429 +{
> 430 +       return NULL;
> 431 +}
> 432 +static inline int bch_nvm_init(void)
> 433 +{
> 434 +       return 0;
> 435 +}
> 436 +static inline void bch_nvm_exit(void) { }
> 437 +
> 438 +#endif /* CONFIG_BCACHE_NVM_PAGES */
> 
But then these stubs should be defined in the header file, not here.

[ .. ]

>>> +			rc = -ENOMEM;
>>> +			goto unlock;
>>> +		}
>>> +	}
>>> +
>>> +	only_set->nss[ns->sb->this_namespace_nr] = ns;
>>> +
>>> +	/* Firstly attach */
>> Initial attach?
> 
> Will fix in next post.
> 
>>
>>> +	if ((unsigned long)ns->sb->owner_list_head == BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET) {
>>> +		struct bch_nvm_pages_owner_head *sys_owner_head;
>>> +		struct bch_nvm_pgalloc_recs *sys_pgalloc_recs;
>>> +
>>> +		ns->sb->owner_list_head = ns->kaddr + BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET;
>>> +		sys_pgalloc_recs = ns->kaddr + BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET;
>>> +
>>> +		sys_owner_head = &(ns->sb->owner_list_head->heads[0]);
>>> +		sys_owner_head->recs[0] = sys_pgalloc_recs;
>>> +		ns->sb->csum = csum_set(ns->sb);
>>> +
>> Hmm. You are trying to pick up the 'list_head' structure from NVM, right?
> 
> No, this is not READ, it's WRITE onto NVDIMM.
> 
> sys_owner_head points to NVDIMM, since ns->sb->owner_list_head is updated,
> the checksum of ns->sb should be updated to new value onto the NVDIMM.
> This is what the above line does.
> 
> 
Ah, right.

>>
>> In doing so, don't you need to validate the structure (eg by checking
>> the checksum) before doing so to ensure that the contents are valid?
> 
> The check sum checking for READ is done in read_nvdimm_meta_super() in
> following lines,
> 198 +       r = -EINVAL;
> 199 +       expected_csum = csum_set(sb);
> 200 +       if (expected_csum != sb->csum) {
> 201 +               pr_info("csum is not match with expected one\n");
> 202 +               goto put_page;
> 203 +       }
> 
> Once thing to note is, currently all NVDIMM update is not power failure
> considered. This is the next big task to do after the first small code
> base merged.
> 
> 
>>> +		sys_pgalloc_recs->owner = sys_owner_head;
>>> +	} else
>>> +		BUG_ON(ns->sb->owner_list_head !=
>>> +			(ns->kaddr + BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET));
>>> +
>>> +unlock:
>>> +	mutex_unlock(&only_set->lock);
>>> +	return rc;
>>> +}
>>> +
>>> +static int read_nvdimm_meta_super(struct block_device *bdev,
>>> +			      struct bch_nvm_namespace *ns)
>>> +{
>>> +	struct page *page;
>>> +	struct bch_nvm_pages_sb *sb;
>>> +	int r = 0;
>>> +	uint64_t expected_csum = 0;
>>> +
>>> +	page = read_cache_page_gfp(bdev->bd_inode->i_mapping,
>>> +			BCH_NVM_PAGES_SB_OFFSET >> PAGE_SHIFT, GFP_KERNEL);
>>> +
>>> +	if (IS_ERR(page))
>>> +		return -EIO;
>>> +
>>> +	sb = (struct bch_nvm_pages_sb *)(page_address(page) +
>>> +					offset_in_page(BCH_NVM_PAGES_SB_OFFSET));
>>> +	r = -EINVAL;
>>> +	expected_csum = csum_set(sb);
>>> +	if (expected_csum != sb->csum) {
>>> +		pr_info("csum is not match with expected one\n");
>>> +		goto put_page;
>>> +	}
>>> +
>>> +	if (memcmp(sb->magic, bch_nvm_pages_magic, 16)) {
>>> +		pr_info("invalid bch_nvm_pages_magic\n");
>>> +		goto put_page;
>>> +	}
>>> +
>>> +	if (sb->total_namespaces_nr != 1) {
>>> +		pr_info("currently only support one nvm device\n");
>>> +		goto put_page;
>>> +	}
>>> +
>>> +	if (sb->sb_offset != BCH_NVM_PAGES_SB_OFFSET) {
>>> +		pr_info("invalid superblock offset\n");
>>> +		goto put_page;
>>> +	}
>>> +
>>> +	r = 0;
>>> +	/* temporary use for DAX API */
>>> +	ns->page_size = sb->page_size;
>>> +	ns->pages_total = sb->pages_total;
>>> +
>>> +put_page:
>>> +	put_page(page);
>>> +	return r;
>>> +}
>>> +
>>> +struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
>>> +{
>>> +	struct bch_nvm_namespace *ns;
>>> +	int err;
>>> +	pgoff_t pgoff;
>>> +	char buf[BDEVNAME_SIZE];
>>> +	struct block_device *bdev;
>>> +	int id;
>>> +	char *path = NULL;
>>> +
>>> +	path = kstrndup(dev_path, 512, GFP_KERNEL);
>>> +	if (!path) {
>>> +		pr_err("kstrndup failed\n");
>>> +		return ERR_PTR(-ENOMEM);
>>> +	}
>>> +
>>> +	bdev = blkdev_get_by_path(strim(path),
>>> +				  FMODE_READ|FMODE_WRITE|FMODE_EXEC,
>>> +				  only_set);
>>> +	if (IS_ERR(bdev)) {
>>> +		pr_info("get %s error: %ld\n", dev_path, PTR_ERR(bdev));
>>> +		kfree(path);
>>> +		return ERR_PTR(PTR_ERR(bdev));
>>> +	}
>>> +
>>> +	err = -ENOMEM;
>>> +	ns = kzalloc(sizeof(struct bch_nvm_namespace), GFP_KERNEL);
>>> +	if (!ns)
>>> +		goto bdput;
>>> +
>>> +	err = -EIO;
>>> +	if (read_nvdimm_meta_super(bdev, ns)) {
>>> +		pr_info("%s read nvdimm meta super block failed.\n",
>>> +			bdevname(bdev, buf));
>>> +		goto free_ns;
>>> +	}
>>> +
>>> +	err = -EOPNOTSUPP;
>>> +	if (!bdev_dax_supported(bdev, ns->page_size)) {
>>> +		pr_info("%s don't support DAX\n", bdevname(bdev, buf));
>>> +		goto free_ns;
>>> +	}
>>> +
>>> +	err = -EINVAL;
>>> +	if (bdev_dax_pgoff(bdev, 0, ns->page_size, &pgoff)) {
>>> +		pr_info("invalid offset of %s\n", bdevname(bdev, buf));
>>> +		goto free_ns;
>>> +	}
>>> +
>>> +	err = -ENOMEM;
>>> +	ns->dax_dev = fs_dax_get_by_bdev(bdev);
>>> +	if (!ns->dax_dev) {
>>> +		pr_info("can't by dax device by %s\n", bdevname(bdev, buf));
>>> +		goto free_ns;
>>> +	}
>>> +
>>> +	err = -EINVAL;
>>> +	id = dax_read_lock();
>>> +	if (dax_direct_access(ns->dax_dev, pgoff, ns->pages_total,
>>> +			      &ns->kaddr, &ns->start_pfn) <= 0) {
>>> +		pr_info("dax_direct_access error\n");
>>> +		dax_read_unlock(id);
>>> +		goto free_ns;
>>> +	}
>>> +	dax_read_unlock(id);
>>> +
>>> +	ns->sb = ns->kaddr + BCH_NVM_PAGES_SB_OFFSET;
>>> +
>> You already read the superblock in read_nvdimm_meta_super(), right?
>> Wouldn't it be better to first do the 'dax_direct_access()' call, and
>> then check the superblock?
>> That way you'll ensure that dax_direct_access()' did the right thing;
>> with the current code you are using two different methods of accessing
>> the superblock, which theoretically can result in one method succeeding,
>> the other not ...
> 
> We have to do it. Because the mapping size of dax_direct_access() is
> from ns->pages_total, it is stored on NVDIMM. Before calling
> dax_direct_acess()
> we need to make sure ns is an valid super block stored on NVDIMM, then
> we can
> trust value of ns->pages_total to do the DAX mapping.
> 
> Another method is firstly mapping a small fixed range (e.g. 1GB), and
> check whether the super block on NVDIMM is valid. If yes, re-map the
> whole space indicated by ns->pages_total. But the two-times accessing cannot
> be avoided.
> 
Oh, indeed, you are correct. Disregard my comments here.

>>> +	err = -EINVAL;
>>> +	/* Check magic again to make sure DAX mapping is correct */
>>> +	if (memcmp(ns->sb->magic, bch_nvm_pages_magic, 16)) {
>>> +		pr_info("invalid bch_nvm_pages_magic after DAX mapping\n");
>>> +		goto free_ns;
>>> +	}
>>> +
>>> +	err = attach_nvm_set(ns);
>>> +	if (err < 0)
>>> +		goto free_ns;
>>> +
>>> +	ns->page_size = ns->sb->page_size;
>>> +	ns->pages_offset = ns->sb->pages_offset;
>>> +	ns->pages_total = ns->sb->pages_total;
>>> +	ns->free = 0;
>>> +	ns->bdev = bdev;
>>> +	ns->nvm_set = only_set;
>>> +	mutex_init(&ns->lock);
>>> +
>>> +	if (ns->sb->this_namespace_nr == 0) {
>>> +		pr_info("only first namespace contain owner info\n");
>>> +		err = init_owner_info(ns);
>>> +		if (err < 0) {
>>> +			pr_info("init_owner_info met error %d\n", err);
>>> +			only_set->nss[ns->sb->this_namespace_nr] = NULL;
>>> +			goto free_ns;
>>> +		}
>>> +	}
>>> +
>>> +	kfree(path);
>>> +	return ns;
>>> +free_ns:
>>> +	kfree(ns);
>>> +bdput:
>>> +	blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC);
>>> +	kfree(path);
>>> +	return ERR_PTR(err);
>>> +}
>>> +EXPORT_SYMBOL_GPL(bch_register_namespace);
>>> +
>>> +int __init bch_nvm_init(void)
>>> +{
>>> +	only_set = kzalloc(sizeof(*only_set), GFP_KERNEL);
>>> +	if (!only_set)
>>> +		return -ENOMEM;
>>> +
>>> +	only_set->total_namespaces_nr = 0;
>>> +	only_set->owner_list_head = NULL;
>>> +	only_set->nss = NULL;
>>> +
>>> +	mutex_init(&only_set->lock);
>>> +
>>> +	pr_info("bcache nvm init\n");
>>> +	return 0;
>>> +}
>>> +
>>> +void bch_nvm_exit(void)
>>> +{
>>> +	release_nvm_set(only_set);
>>> +	pr_info("bcache nvm exit\n");
>>> +}
>>> +
>>> +#endif /* CONFIG_BCACHE_NVM_PAGES */
>>> diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
>>> new file mode 100644
>>> index 000000000000..3e24c4dee7fd
>>> --- /dev/null
>>> +++ b/drivers/md/bcache/nvm-pages.h
>>> @@ -0,0 +1,74 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +
>>> +#ifndef _BCACHE_NVM_PAGES_H
>>> +#define _BCACHE_NVM_PAGES_H
>>> +
>>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>>> +#include <linux/bcache-nvm.h>
>>> +#endif /* CONFIG_BCACHE_NVM_PAGES */
>>> +
>> Hmm? What is that doing here?
>> Please move it into the source file.
> 
> This is temporary before the whole NVDIMM support for bcache is completed.
> 
> drivers/md/bcache/nvm-pages.h has to be included because there are still
> stub
> routines in this header. Such stub routines like,
> 
> +static inline struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
> +{
> +	return NULL;
> +}
> 
> will be removed after the whole code completed and merged. So currently we have to
> do this to make sure the NVDIMM related code won't be leaked out if such experimental
> configure is not enabled.
> 
> 
> Thanks for your review. The addressed issue will be fixed and updated in next post.
> 
Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

  reply	other threads:[~2021-06-23  9:16 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-15  5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
2021-06-15  5:49 ` [PATCH 01/14] bcache: fix error info in register_bcache() Coly Li
2021-06-22  9:47   ` Hannes Reinecke
2021-06-15  5:49 ` [PATCH 02/14] md: bcache: Fix spelling of 'acquire' Coly Li
2021-06-22 10:03   ` Hannes Reinecke
2021-06-15  5:49 ` [PATCH 03/14] bcache: add initial data structures for nvm pages Coly Li
2021-06-21 16:17   ` Ask help for code review (was Re: [PATCH 03/14] bcache: add initial data structures for nvm pages) Coly Li
2021-06-22  8:41     ` Huang, Ying
2021-06-23  4:32       ` Coly Li
2021-06-23  6:53         ` Huang, Ying
2021-06-23  7:04           ` Christoph Hellwig
2021-06-23  7:19             ` Coly Li
2021-06-23  7:21               ` Christoph Hellwig
2021-06-23 10:05                 ` Coly Li
2021-06-23 11:16                   ` Coly Li
2021-06-23 11:49                   ` Christoph Hellwig
2021-06-23 12:09                     ` Coly Li
2021-06-22 10:19   ` [PATCH 03/14] bcache: add initial data structures for nvm pages Hannes Reinecke
2021-06-23  7:09     ` Coly Li
2021-06-15  5:49 ` [PATCH 04/14] bcache: initialize the nvm pages allocator Coly Li
2021-06-22 10:39   ` Hannes Reinecke
2021-06-23  5:26     ` Coly Li
2021-06-23  9:16       ` Hannes Reinecke [this message]
2021-06-23  9:34         ` Coly Li
2021-06-15  5:49 ` [PATCH 05/14] bcache: initialization of the buddy Coly Li
2021-06-22 10:45   ` Hannes Reinecke
2021-06-23  5:35     ` Coly Li
2021-06-23  5:46       ` Re[2]: " Pavel Goran
2021-06-23  6:03         ` Coly Li
2021-06-15  5:49 ` [PATCH 06/14] bcache: bch_nvm_alloc_pages() " Coly Li
2021-06-22 10:51   ` Hannes Reinecke
2021-06-23  6:02     ` Coly Li
2021-06-15  5:49 ` [PATCH 07/14] bcache: bch_nvm_free_pages() " Coly Li
2021-06-22 10:53   ` Hannes Reinecke
2021-06-23  6:06     ` Coly Li
2021-06-15  5:49 ` [PATCH 08/14] bcache: get allocated pages from specific owner Coly Li
2021-06-22 10:54   ` Hannes Reinecke
2021-06-23  6:08     ` Coly Li
2021-06-15  5:49 ` [PATCH 09/14] bcache: use bucket index to set GC_MARK_METADATA for journal buckets in bch_btree_gc_finish() Coly Li
2021-06-22 10:55   ` Hannes Reinecke
2021-06-23  6:09     ` Coly Li
2021-06-15  5:49 ` [PATCH 10/14] bcache: add BCH_FEATURE_INCOMPAT_NVDIMM_META into incompat feature set Coly Li
2021-06-22 10:59   ` Hannes Reinecke
2021-06-23  6:09     ` Coly Li
2021-06-15  5:49 ` [PATCH 11/14] bcache: initialize bcache journal for NVDIMM meta device Coly Li
2021-06-22 11:01   ` Hannes Reinecke
2021-06-23  6:17     ` Coly Li
2021-06-23  9:20       ` Hannes Reinecke
2021-06-23 10:14         ` Coly Li
2021-06-15  5:49 ` [PATCH 12/14] bcache: support storing bcache journal into " Coly Li
2021-06-22 11:03   ` Hannes Reinecke
2021-06-23  6:19     ` Coly Li
2021-06-15  5:49 ` [PATCH 13/14] bcache: read jset from NVDIMM pages for journal replay Coly Li
2021-06-22 11:04   ` Hannes Reinecke
2021-06-23  6:21     ` Coly Li
2021-06-15  5:49 ` [PATCH 14/14] bcache: add sysfs interface register_nvdimm_meta to register NVDIMM meta device Coly Li
2021-06-22 11:04   ` Hannes Reinecke
2021-06-21 15:14 ` [PATCH 00/14] bcache patches for Linux v5.14 Jens Axboe
2021-06-21 15:25   ` Coly Li
2021-06-21 15:27     ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48cab671-d4d6-d5f2-29bf-b2f19e04e533@suse.de \
    --to=hare@suse.de \
    --cc=axboe@kernel.dk \
    --cc=colyli@suse.de \
    --cc=jianpeng.ma@intel.com \
    --cc=linux-bcache@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=qiaowei.ren@intel.com \
    --cc=rdunlap@infradead.org \
    --subject='Re: [PATCH 04/14] bcache: initialize the nvm pages allocator' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.