From: Hannes Reinecke <hare@suse.de>
To: Coly Li <colyli@suse.de>
Cc: linux-bcache@vger.kernel.org, axboe@kernel.dk,
linux-block@vger.kernel.org, Jianpeng Ma <jianpeng.ma@intel.com>,
Randy Dunlap <rdunlap@infradead.org>,
Qiaowei Ren <qiaowei.ren@intel.com>
Subject: Re: [PATCH 04/14] bcache: initialize the nvm pages allocator
Date: Wed, 23 Jun 2021 11:16:28 +0200 [thread overview]
Message-ID: <48cab671-d4d6-d5f2-29bf-b2f19e04e533@suse.de> (raw)
In-Reply-To: <fc5b8586-0181-da7b-ba1f-73f7b00b8d16@suse.de>
On 6/23/21 7:26 AM, Coly Li wrote:
> On 6/22/21 6:39 PM, Hannes Reinecke wrote:
>> On 6/15/21 7:49 AM, Coly Li wrote:
>>> From: Jianpeng Ma <jianpeng.ma@intel.com>
>>>
>>> This patch define the prototype data structures in memory and
>>> initializes the nvm pages allocator.
>>>
>>> The nvm address space which is managed by this allocator can consist of
>>> many nvm namespaces, and some namespaces can compose into one nvm set,
>>> like cache set. For this initial implementation, only one set can be
>>> supported.
>>>
>>> The users of this nvm pages allocator need to call register_namespace()
>>> to register the nvdimm device (like /dev/pmemX) into this allocator as
>>> the instance of struct nvm_namespace.
>>>
>>> Reported-by: Randy Dunlap <rdunlap@infradead.org>
>>> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
>>> Co-developed-by: Qiaowei Ren <qiaowei.ren@intel.com>
>>> Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
>>> Signed-off-by: Coly Li <colyli@suse.de>
>>> ---
>>> drivers/md/bcache/Kconfig | 10 ++
>>> drivers/md/bcache/Makefile | 1 +
>>> drivers/md/bcache/nvm-pages.c | 295 ++++++++++++++++++++++++++++++++++
>>> drivers/md/bcache/nvm-pages.h | 74 +++++++++
>>> drivers/md/bcache/super.c | 3 +
>>> 5 files changed, 383 insertions(+)
>>> create mode 100644 drivers/md/bcache/nvm-pages.c
>>> create mode 100644 drivers/md/bcache/nvm-pages.h
>>>
>>> diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
>>> index d1ca4d059c20..a69f6c0e0507 100644
>>> --- a/drivers/md/bcache/Kconfig
>>> +++ b/drivers/md/bcache/Kconfig
>>> @@ -35,3 +35,13 @@ config BCACHE_ASYNC_REGISTRATION
>>> device path into this file will returns immediately and the real
>>> registration work is handled in kernel work queue in asynchronous
>>> way.
>>> +
>>> +config BCACHE_NVM_PAGES
>>> + bool "NVDIMM support for bcache (EXPERIMENTAL)"
>>> + depends on BCACHE
>>> + depends on 64BIT
>>> + depends on LIBNVDIMM
>>> + depends on DAX
>>> + help
>>> + Allocate/release NV-memory pages for bcache and provide allocated pages
>>> + for each requestor after system reboot.
>>> diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
>>> index 5b87e59676b8..2397bb7c7ffd 100644
>>> --- a/drivers/md/bcache/Makefile
>>> +++ b/drivers/md/bcache/Makefile
>>> @@ -5,3 +5,4 @@ obj-$(CONFIG_BCACHE) += bcache.o
>>> bcache-y := alloc.o bset.o btree.o closure.o debug.o extents.o\
>>> io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
>>> util.o writeback.o features.o
>>> +bcache-$(CONFIG_BCACHE_NVM_PAGES) += nvm-pages.o
>>> diff --git a/drivers/md/bcache/nvm-pages.c b/drivers/md/bcache/nvm-pages.c
>>> new file mode 100644
>>> index 000000000000..18fdadbc502f
>>> --- /dev/null
>>> +++ b/drivers/md/bcache/nvm-pages.c
>>> @@ -0,0 +1,295 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * Nvdimm page-buddy allocator
>>> + *
>>> + * Copyright (c) 2021, Intel Corporation.
>>> + * Copyright (c) 2021, Qiaowei Ren <qiaowei.ren@intel.com>.
>>> + * Copyright (c) 2021, Jianpeng Ma <jianpeng.ma@intel.com>.
>>> + */
>>> +
>>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>>> +
>> No need for this 'if' statement as it'll be excluded by the Makefile
>> anyway if the config option isn't set.
>
> Such if is necessary because stub routines are defined when
> CONFIG_BCACHE_NVM_PAGES is not defined, e.g.
>
> 426 +#else
> 427 +
> 428 +static inline struct bch_nvm_namespace
> *bch_register_namespace(const char *dev_path)
> 429 +{
> 430 + return NULL;
> 431 +}
> 432 +static inline int bch_nvm_init(void)
> 433 +{
> 434 + return 0;
> 435 +}
> 436 +static inline void bch_nvm_exit(void) { }
> 437 +
> 438 +#endif /* CONFIG_BCACHE_NVM_PAGES */
>
But then these stubs should be defined in the header file, not here.
[ .. ]
>>> + rc = -ENOMEM;
>>> + goto unlock;
>>> + }
>>> + }
>>> +
>>> + only_set->nss[ns->sb->this_namespace_nr] = ns;
>>> +
>>> + /* Firstly attach */
>> Initial attach?
>
> Will fix in next post.
>
>>
>>> + if ((unsigned long)ns->sb->owner_list_head == BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET) {
>>> + struct bch_nvm_pages_owner_head *sys_owner_head;
>>> + struct bch_nvm_pgalloc_recs *sys_pgalloc_recs;
>>> +
>>> + ns->sb->owner_list_head = ns->kaddr + BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET;
>>> + sys_pgalloc_recs = ns->kaddr + BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET;
>>> +
>>> + sys_owner_head = &(ns->sb->owner_list_head->heads[0]);
>>> + sys_owner_head->recs[0] = sys_pgalloc_recs;
>>> + ns->sb->csum = csum_set(ns->sb);
>>> +
>> Hmm. You are trying to pick up the 'list_head' structure from NVM, right?
>
> No, this is not READ, it's WRITE onto NVDIMM.
>
> sys_owner_head points to NVDIMM, since ns->sb->owner_list_head is updated,
> the checksum of ns->sb should be updated to new value onto the NVDIMM.
> This is what the above line does.
>
>
Ah, right.
>>
>> In doing so, don't you need to validate the structure (eg by checking
>> the checksum) before doing so to ensure that the contents are valid?
>
> The check sum checking for READ is done in read_nvdimm_meta_super() in
> following lines,
> 198 + r = -EINVAL;
> 199 + expected_csum = csum_set(sb);
> 200 + if (expected_csum != sb->csum) {
> 201 + pr_info("csum is not match with expected one\n");
> 202 + goto put_page;
> 203 + }
>
> Once thing to note is, currently all NVDIMM update is not power failure
> considered. This is the next big task to do after the first small code
> base merged.
>
>
>>> + sys_pgalloc_recs->owner = sys_owner_head;
>>> + } else
>>> + BUG_ON(ns->sb->owner_list_head !=
>>> + (ns->kaddr + BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET));
>>> +
>>> +unlock:
>>> + mutex_unlock(&only_set->lock);
>>> + return rc;
>>> +}
>>> +
>>> +static int read_nvdimm_meta_super(struct block_device *bdev,
>>> + struct bch_nvm_namespace *ns)
>>> +{
>>> + struct page *page;
>>> + struct bch_nvm_pages_sb *sb;
>>> + int r = 0;
>>> + uint64_t expected_csum = 0;
>>> +
>>> + page = read_cache_page_gfp(bdev->bd_inode->i_mapping,
>>> + BCH_NVM_PAGES_SB_OFFSET >> PAGE_SHIFT, GFP_KERNEL);
>>> +
>>> + if (IS_ERR(page))
>>> + return -EIO;
>>> +
>>> + sb = (struct bch_nvm_pages_sb *)(page_address(page) +
>>> + offset_in_page(BCH_NVM_PAGES_SB_OFFSET));
>>> + r = -EINVAL;
>>> + expected_csum = csum_set(sb);
>>> + if (expected_csum != sb->csum) {
>>> + pr_info("csum is not match with expected one\n");
>>> + goto put_page;
>>> + }
>>> +
>>> + if (memcmp(sb->magic, bch_nvm_pages_magic, 16)) {
>>> + pr_info("invalid bch_nvm_pages_magic\n");
>>> + goto put_page;
>>> + }
>>> +
>>> + if (sb->total_namespaces_nr != 1) {
>>> + pr_info("currently only support one nvm device\n");
>>> + goto put_page;
>>> + }
>>> +
>>> + if (sb->sb_offset != BCH_NVM_PAGES_SB_OFFSET) {
>>> + pr_info("invalid superblock offset\n");
>>> + goto put_page;
>>> + }
>>> +
>>> + r = 0;
>>> + /* temporary use for DAX API */
>>> + ns->page_size = sb->page_size;
>>> + ns->pages_total = sb->pages_total;
>>> +
>>> +put_page:
>>> + put_page(page);
>>> + return r;
>>> +}
>>> +
>>> +struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
>>> +{
>>> + struct bch_nvm_namespace *ns;
>>> + int err;
>>> + pgoff_t pgoff;
>>> + char buf[BDEVNAME_SIZE];
>>> + struct block_device *bdev;
>>> + int id;
>>> + char *path = NULL;
>>> +
>>> + path = kstrndup(dev_path, 512, GFP_KERNEL);
>>> + if (!path) {
>>> + pr_err("kstrndup failed\n");
>>> + return ERR_PTR(-ENOMEM);
>>> + }
>>> +
>>> + bdev = blkdev_get_by_path(strim(path),
>>> + FMODE_READ|FMODE_WRITE|FMODE_EXEC,
>>> + only_set);
>>> + if (IS_ERR(bdev)) {
>>> + pr_info("get %s error: %ld\n", dev_path, PTR_ERR(bdev));
>>> + kfree(path);
>>> + return ERR_PTR(PTR_ERR(bdev));
>>> + }
>>> +
>>> + err = -ENOMEM;
>>> + ns = kzalloc(sizeof(struct bch_nvm_namespace), GFP_KERNEL);
>>> + if (!ns)
>>> + goto bdput;
>>> +
>>> + err = -EIO;
>>> + if (read_nvdimm_meta_super(bdev, ns)) {
>>> + pr_info("%s read nvdimm meta super block failed.\n",
>>> + bdevname(bdev, buf));
>>> + goto free_ns;
>>> + }
>>> +
>>> + err = -EOPNOTSUPP;
>>> + if (!bdev_dax_supported(bdev, ns->page_size)) {
>>> + pr_info("%s don't support DAX\n", bdevname(bdev, buf));
>>> + goto free_ns;
>>> + }
>>> +
>>> + err = -EINVAL;
>>> + if (bdev_dax_pgoff(bdev, 0, ns->page_size, &pgoff)) {
>>> + pr_info("invalid offset of %s\n", bdevname(bdev, buf));
>>> + goto free_ns;
>>> + }
>>> +
>>> + err = -ENOMEM;
>>> + ns->dax_dev = fs_dax_get_by_bdev(bdev);
>>> + if (!ns->dax_dev) {
>>> + pr_info("can't by dax device by %s\n", bdevname(bdev, buf));
>>> + goto free_ns;
>>> + }
>>> +
>>> + err = -EINVAL;
>>> + id = dax_read_lock();
>>> + if (dax_direct_access(ns->dax_dev, pgoff, ns->pages_total,
>>> + &ns->kaddr, &ns->start_pfn) <= 0) {
>>> + pr_info("dax_direct_access error\n");
>>> + dax_read_unlock(id);
>>> + goto free_ns;
>>> + }
>>> + dax_read_unlock(id);
>>> +
>>> + ns->sb = ns->kaddr + BCH_NVM_PAGES_SB_OFFSET;
>>> +
>> You already read the superblock in read_nvdimm_meta_super(), right?
>> Wouldn't it be better to first do the 'dax_direct_access()' call, and
>> then check the superblock?
>> That way you'll ensure that dax_direct_access()' did the right thing;
>> with the current code you are using two different methods of accessing
>> the superblock, which theoretically can result in one method succeeding,
>> the other not ...
>
> We have to do it. Because the mapping size of dax_direct_access() is
> from ns->pages_total, it is stored on NVDIMM. Before calling
> dax_direct_acess()
> we need to make sure ns is an valid super block stored on NVDIMM, then
> we can
> trust value of ns->pages_total to do the DAX mapping.
>
> Another method is firstly mapping a small fixed range (e.g. 1GB), and
> check whether the super block on NVDIMM is valid. If yes, re-map the
> whole space indicated by ns->pages_total. But the two-times accessing cannot
> be avoided.
>
Oh, indeed, you are correct. Disregard my comments here.
>>> + err = -EINVAL;
>>> + /* Check magic again to make sure DAX mapping is correct */
>>> + if (memcmp(ns->sb->magic, bch_nvm_pages_magic, 16)) {
>>> + pr_info("invalid bch_nvm_pages_magic after DAX mapping\n");
>>> + goto free_ns;
>>> + }
>>> +
>>> + err = attach_nvm_set(ns);
>>> + if (err < 0)
>>> + goto free_ns;
>>> +
>>> + ns->page_size = ns->sb->page_size;
>>> + ns->pages_offset = ns->sb->pages_offset;
>>> + ns->pages_total = ns->sb->pages_total;
>>> + ns->free = 0;
>>> + ns->bdev = bdev;
>>> + ns->nvm_set = only_set;
>>> + mutex_init(&ns->lock);
>>> +
>>> + if (ns->sb->this_namespace_nr == 0) {
>>> + pr_info("only first namespace contain owner info\n");
>>> + err = init_owner_info(ns);
>>> + if (err < 0) {
>>> + pr_info("init_owner_info met error %d\n", err);
>>> + only_set->nss[ns->sb->this_namespace_nr] = NULL;
>>> + goto free_ns;
>>> + }
>>> + }
>>> +
>>> + kfree(path);
>>> + return ns;
>>> +free_ns:
>>> + kfree(ns);
>>> +bdput:
>>> + blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXEC);
>>> + kfree(path);
>>> + return ERR_PTR(err);
>>> +}
>>> +EXPORT_SYMBOL_GPL(bch_register_namespace);
>>> +
>>> +int __init bch_nvm_init(void)
>>> +{
>>> + only_set = kzalloc(sizeof(*only_set), GFP_KERNEL);
>>> + if (!only_set)
>>> + return -ENOMEM;
>>> +
>>> + only_set->total_namespaces_nr = 0;
>>> + only_set->owner_list_head = NULL;
>>> + only_set->nss = NULL;
>>> +
>>> + mutex_init(&only_set->lock);
>>> +
>>> + pr_info("bcache nvm init\n");
>>> + return 0;
>>> +}
>>> +
>>> +void bch_nvm_exit(void)
>>> +{
>>> + release_nvm_set(only_set);
>>> + pr_info("bcache nvm exit\n");
>>> +}
>>> +
>>> +#endif /* CONFIG_BCACHE_NVM_PAGES */
>>> diff --git a/drivers/md/bcache/nvm-pages.h b/drivers/md/bcache/nvm-pages.h
>>> new file mode 100644
>>> index 000000000000..3e24c4dee7fd
>>> --- /dev/null
>>> +++ b/drivers/md/bcache/nvm-pages.h
>>> @@ -0,0 +1,74 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +
>>> +#ifndef _BCACHE_NVM_PAGES_H
>>> +#define _BCACHE_NVM_PAGES_H
>>> +
>>> +#if defined(CONFIG_BCACHE_NVM_PAGES)
>>> +#include <linux/bcache-nvm.h>
>>> +#endif /* CONFIG_BCACHE_NVM_PAGES */
>>> +
>> Hmm? What is that doing here?
>> Please move it into the source file.
>
> This is temporary before the whole NVDIMM support for bcache is completed.
>
> drivers/md/bcache/nvm-pages.h has to be included because there are still
> stub
> routines in this header. Such stub routines like,
>
> +static inline struct bch_nvm_namespace *bch_register_namespace(const char *dev_path)
> +{
> + return NULL;
> +}
>
> will be removed after the whole code completed and merged. So currently we have to
> do this to make sure the NVDIMM related code won't be leaked out if such experimental
> configure is not enabled.
>
>
> Thanks for your review. The addressed issue will be fixed and updated in next post.
>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
next prev parent reply other threads:[~2021-06-23 9:16 UTC|newest]
Thread overview: 60+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-15 5:49 [PATCH 00/14] bcache patches for Linux v5.14 Coly Li
2021-06-15 5:49 ` [PATCH 01/14] bcache: fix error info in register_bcache() Coly Li
2021-06-22 9:47 ` Hannes Reinecke
2021-06-15 5:49 ` [PATCH 02/14] md: bcache: Fix spelling of 'acquire' Coly Li
2021-06-22 10:03 ` Hannes Reinecke
2021-06-15 5:49 ` [PATCH 03/14] bcache: add initial data structures for nvm pages Coly Li
2021-06-21 16:17 ` Ask help for code review (was Re: [PATCH 03/14] bcache: add initial data structures for nvm pages) Coly Li
2021-06-22 8:41 ` Huang, Ying
2021-06-23 4:32 ` Coly Li
2021-06-23 6:53 ` Huang, Ying
2021-06-23 7:04 ` Christoph Hellwig
2021-06-23 7:19 ` Coly Li
2021-06-23 7:21 ` Christoph Hellwig
2021-06-23 10:05 ` Coly Li
2021-06-23 11:16 ` Coly Li
2021-06-23 11:49 ` Christoph Hellwig
2021-06-23 12:09 ` Coly Li
2021-06-22 10:19 ` [PATCH 03/14] bcache: add initial data structures for nvm pages Hannes Reinecke
2021-06-23 7:09 ` Coly Li
2021-06-15 5:49 ` [PATCH 04/14] bcache: initialize the nvm pages allocator Coly Li
2021-06-22 10:39 ` Hannes Reinecke
2021-06-23 5:26 ` Coly Li
2021-06-23 9:16 ` Hannes Reinecke [this message]
2021-06-23 9:34 ` Coly Li
2021-06-15 5:49 ` [PATCH 05/14] bcache: initialization of the buddy Coly Li
2021-06-22 10:45 ` Hannes Reinecke
2021-06-23 5:35 ` Coly Li
2021-06-23 5:46 ` Re[2]: " Pavel Goran
2021-06-23 6:03 ` Coly Li
2021-06-15 5:49 ` [PATCH 06/14] bcache: bch_nvm_alloc_pages() " Coly Li
2021-06-22 10:51 ` Hannes Reinecke
2021-06-23 6:02 ` Coly Li
2021-06-15 5:49 ` [PATCH 07/14] bcache: bch_nvm_free_pages() " Coly Li
2021-06-22 10:53 ` Hannes Reinecke
2021-06-23 6:06 ` Coly Li
2021-06-15 5:49 ` [PATCH 08/14] bcache: get allocated pages from specific owner Coly Li
2021-06-22 10:54 ` Hannes Reinecke
2021-06-23 6:08 ` Coly Li
2021-06-15 5:49 ` [PATCH 09/14] bcache: use bucket index to set GC_MARK_METADATA for journal buckets in bch_btree_gc_finish() Coly Li
2021-06-22 10:55 ` Hannes Reinecke
2021-06-23 6:09 ` Coly Li
2021-06-15 5:49 ` [PATCH 10/14] bcache: add BCH_FEATURE_INCOMPAT_NVDIMM_META into incompat feature set Coly Li
2021-06-22 10:59 ` Hannes Reinecke
2021-06-23 6:09 ` Coly Li
2021-06-15 5:49 ` [PATCH 11/14] bcache: initialize bcache journal for NVDIMM meta device Coly Li
2021-06-22 11:01 ` Hannes Reinecke
2021-06-23 6:17 ` Coly Li
2021-06-23 9:20 ` Hannes Reinecke
2021-06-23 10:14 ` Coly Li
2021-06-15 5:49 ` [PATCH 12/14] bcache: support storing bcache journal into " Coly Li
2021-06-22 11:03 ` Hannes Reinecke
2021-06-23 6:19 ` Coly Li
2021-06-15 5:49 ` [PATCH 13/14] bcache: read jset from NVDIMM pages for journal replay Coly Li
2021-06-22 11:04 ` Hannes Reinecke
2021-06-23 6:21 ` Coly Li
2021-06-15 5:49 ` [PATCH 14/14] bcache: add sysfs interface register_nvdimm_meta to register NVDIMM meta device Coly Li
2021-06-22 11:04 ` Hannes Reinecke
2021-06-21 15:14 ` [PATCH 00/14] bcache patches for Linux v5.14 Jens Axboe
2021-06-21 15:25 ` Coly Li
2021-06-21 15:27 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=48cab671-d4d6-d5f2-29bf-b2f19e04e533@suse.de \
--to=hare@suse.de \
--cc=axboe@kernel.dk \
--cc=colyli@suse.de \
--cc=jianpeng.ma@intel.com \
--cc=linux-bcache@vger.kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=qiaowei.ren@intel.com \
--cc=rdunlap@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).