[RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM
@ 2022-08-02 17:57 Srinivas Aji
  2022-08-02 18:02 ` [RFC PATCH 1/4] mm/memory_hotplug: Add MHP_ALLOCATE flag which treats hotplugged memory as allocated Srinivas Aji
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Srinivas Aji @ 2022-08-02 17:57 UTC (permalink / raw)
  To: linux-nvdimm, Linux MM
  Cc: Dan Williams, David Hildenbrand, Vivek Goyal, David Woodhouse,
	Gowans, James, Yue Li, Beau Beauchamp

Linux supports adding a DAX driver managed memory region as system
memory using the KMEM driver (from version 5.1). We would like to use
a persistent addressable memory segment as system memory and
simultaneously for storing some persistent data.

Motivation: It is already possible to partition an NVDIMM device for
RAM and storage by creating separate regions on the device and using
one of them with KMEM and another as fsdax. This patch set is a start
to trying to get zero copy snapshots of processes which are using the
DAX device as RAM. That requires dynamically sharing pages between
process RAM and the storage within a single NVDIMM region.

To do this, we add a layer for handling the persistent data which does
the following:

1. When a DAX device is added as KMEM, mark all the memory as
   allocated and pass it up to a module which is aware of the storage
   layout.

2. This module scans the memory, identifies the unused parts, and
   frees those memory pages.

3. Further memory from this device is allocated using the kernel
   memory allocation API. The memory allocation API currently allows
   the allocation to be limited only based on NUMA node. So this
   feature works only when the DAX device used as KMEM is the only
   memory from its NUMA node.

4. Discarding of blocks previously used for persistent data results in
   those blocks being freed to system memory.

As an example, we implement a simple persistence module using the
above framework to provide a block device. A block device assumes all
blocks are always available, but in this case we have to get the
blocks through the memory allocation API, at an offset not under our
control. To provide block device semantics, we maintain an array which
maps the logical block number to the real physical page, if one
exists. Block device Trim/Discard support is used to mark blocks as
unused.

While we have the block device here as an example, a memory filesystem
might be a more useful implementation. I am not sure if any of the
existing in-memory filesystem structures are suited for
persistence. Any suggestions for this are appreciated.

Srinivas Aji (4):
  mm/memory_hotplug: Add MHP_ALLOCATE flag which treats hotplugged
    memory as allocated
  device-dax: Add framework for keeping persistent data in DAX KMEM
  device-dax: Add a NONE type for DAX KMEM persistence
  device-dax: Add a block device persistent type, BLK, for DAX KMEM

 drivers/dax/Kconfig            |  13 +
 drivers/dax/Makefile           |   1 +
 drivers/dax/kmem.c             | 312 +++++++++++++++++-
 drivers/dax/kmem_blk.c         | 573 +++++++++++++++++++++++++++++++++
 drivers/dax/kmem_persist.h     |  47 +++
 include/linux/memory_hotplug.h |   4 +
 mm/internal.h                  |   3 +-
 mm/memory_hotplug.c            |  15 +-
 mm/page_alloc.c                |  19 +-
 9 files changed, 975 insertions(+), 12 deletions(-)
 create mode 100644 drivers/dax/kmem_blk.c
 create mode 100644 drivers/dax/kmem_persist.h

-- 
2.30.2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 1/4] mm/memory_hotplug: Add MHP_ALLOCATE flag which treats hotplugged memory as allocated
  2022-08-02 17:57 [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM Srinivas Aji
@ 2022-08-02 18:02 ` Srinivas Aji
  2022-08-02 18:03 ` [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM David Hildenbrand
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Srinivas Aji @ 2022-08-02 18:02 UTC (permalink / raw)
  To: linux-nvdimm, Linux MM
  Cc: Dan Williams, David Hildenbrand, Vivek Goyal, David Woodhouse,
	Gowans, James, Yue Li, Beau Beauchamp

When memory is added by hotplug, it is treated as free by default.
This patch adds a flag MHP_ALLOCATE, which can be used with the
add_memory_...() memory hotplug functions to treat the new memory as
allocated. This allows the memory contents to be inspected before
being freed for system use. This feature is useful when the hotplugged
memory is persistent, since it can be scanned for any persistent data
which should be retained, rather than freed up for system use and
overwritten.

Signed-off-by: Srinivas Aji <srinivas.aji@memverge.com>
---
 include/linux/memory_hotplug.h |  4 ++++
 mm/internal.h                  |  3 ++-
 mm/memory_hotplug.c            | 15 +++++++++++++--
 mm/page_alloc.c                | 19 ++++++++++++-------
 4 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 20d7edf62a6a..15c9b1f2a5be 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -113,6 +113,10 @@ typedef int __bitwise mhp_t;
  * implies the node id (nid).
  */
 #define MHP_NID_IS_MGID		((__force mhp_t)BIT(2))
+/*
+ * Online this memory as when added, and also treat these pages as allocated.
+ */
+#define MHP_ALLOCATE		((__force mhp_t)BIT(3))
 
 /*
  * Extended parameters for memory hotplug:
diff --git a/mm/internal.h b/mm/internal.h
index c0f8fbe0445b..ee37749e341e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -356,7 +356,8 @@ extern void __putback_isolated_page(struct page *page, unsigned int order,
 				    int mt);
 extern void memblock_free_pages(struct page *page, unsigned long pfn,
 					unsigned int order);
-extern void __free_pages_core(struct page *page, unsigned int order);
+extern void __free_pages_core(struct page *page, unsigned int order,
+			      int allocate);
 extern void prep_compound_page(struct page *page, unsigned int order);
 extern void post_alloc_hook(struct page *page, unsigned int order,
 					gfp_t gfp_flags);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1213d0c67a53..eeef37d37bfa 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -177,6 +177,11 @@ static int __init setup_memhp_default_state(char *str)
 }
 __setup("memhp_default_state=", setup_memhp_default_state);
 
+/*
+ * If this is set, then this memory is allocated when onlining.
+ */
+static bool mhp_allocate;
+
 void mem_hotplug_begin(void)
 {
 	cpus_read_lock();
@@ -595,7 +600,7 @@ void generic_online_page(struct page *page, unsigned int order)
 	 * case in page freeing fast path.
 	 */
 	debug_pagealloc_map_pages(page, 1 << order);
-	__free_pages_core(page, order);
+	__free_pages_core(page, order, mhp_allocate);
 	totalram_pages_add(1UL << order);
 }
 EXPORT_SYMBOL_GPL(generic_online_page);
@@ -1417,10 +1422,16 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	if (mhp_flags & MHP_MERGE_RESOURCE)
 		merge_system_ram_resource(res);
 
+	if (mhp_flags & MHP_ALLOCATE)
+		mhp_allocate = true;
+
 	/* online pages if requested */
-	if (mhp_default_online_type != MMOP_OFFLINE)
+	if (mhp_default_online_type != MMOP_OFFLINE ||
+		(mhp_flags & MHP_ALLOCATE))
 		walk_memory_blocks(start, size, NULL, online_memory_block);
 
+	mhp_allocate = false;
+
 	return ret;
 error:
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b0bcab50f0a3..72b3955145ef 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1643,12 +1643,12 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	__count_vm_events(PGFREE, 1 << order);
 }
 
-void __free_pages_core(struct page *page, unsigned int order)
+void __free_pages_core(struct page *page, unsigned int order, int allocate)
 {
 	unsigned int nr_pages = 1 << order;
 	struct page *p = page;
 	unsigned int loop;
-
+	int count = allocate ? 1 : 0;
 	/*
 	 * When initializing the memmap, __init_single_page() sets the refcount
 	 * of all pages to 1 ("allocated"/"not free"). We have to set the
@@ -1658,13 +1658,18 @@ void __free_pages_core(struct page *page, unsigned int order)
 	for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
 		prefetchw(p + 1);
 		__ClearPageReserved(p);
-		set_page_count(p, 0);
+		set_page_count(p, count);
 	}
 	__ClearPageReserved(p);
-	set_page_count(p, 0);
+	set_page_count(p, count);
 
 	atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
 
+	/*
+	 * Don't free the pages if we want them to appear allocated.
+	 */
+	if (allocate)
+		return;
 	/*
 	 * Bypass PCP and place fresh pages right to the tail, primarily
 	 * relevant for memory onlining.
@@ -1729,7 +1734,7 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn,
 {
 	if (early_page_uninitialised(pfn))
 		return;
-	__free_pages_core(page, order);
+	__free_pages_core(page, order, false);
 }
 
 /*
@@ -1818,14 +1823,14 @@ static void __init deferred_free_range(unsigned long pfn,
 	if (nr_pages == pageblock_nr_pages &&
 	    (pfn & (pageblock_nr_pages - 1)) == 0) {
 		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-		__free_pages_core(page, pageblock_order);
+		__free_pages_core(page, pageblock_order, false);
 		return;
 	}
 
 	for (i = 0; i < nr_pages; i++, page++, pfn++) {
 		if ((pfn & (pageblock_nr_pages - 1)) == 0)
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-		__free_pages_core(page, 0);
+		__free_pages_core(page, 0, false);
 	}
 }
 
-- 
2.30.2




^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM
  2022-08-02 17:57 [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM Srinivas Aji
  2022-08-02 18:02 ` [RFC PATCH 1/4] mm/memory_hotplug: Add MHP_ALLOCATE flag which treats hotplugged memory as allocated Srinivas Aji
@ 2022-08-02 18:03 ` David Hildenbrand
  2022-08-02 18:53   ` Srinivas Aji
  2022-08-02 18:07 ` [RFC PATCH 2/4] device-dax: Add framework for keeping persistent data in DAX KMEM Srinivas Aji
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2022-08-02 18:03 UTC (permalink / raw)
  To: Srinivas Aji, linux-nvdimm, Linux MM
  Cc: Dan Williams, Vivek Goyal, David Woodhouse, Gowans, James,
	Yue Li, Beau Beauchamp

On 02.08.22 19:57, Srinivas Aji wrote:
> Linux supports adding a DAX driver managed memory region as system
> memory using the KMEM driver (from version 5.1). We would like to use
> a persistent addressable memory segment as system memory and
> simultaneously for storing some persistent data.
> 
> Motivation: It is already possible to partition an NVDIMM device for
> RAM and storage by creating separate regions on the device and using
> one of them with KMEM and another as fsdax. This patch set is a start
> to trying to get zero copy snapshots of processes which are using the
> DAX device as RAM. That requires dynamically sharing pages between
> process RAM and the storage within a single NVDIMM region.
> 
> To do this, we add a layer for handling the persistent data which does
> the following:
> 
> 1. When a DAX device is added as KMEM, mark all the memory as
>    allocated and pass it up to a module which is aware of the storage
>    layout.
> 
> 2. This module scans the memory, identifies the unused parts, and
>    frees those memory pages.
> 
> 3. Further memory from this device is allocated using the kernel
>    memory allocation API. The memory allocation API currently allows
>    the allocation to be limited only based on NUMA node. So this
>    feature works only when the DAX device used as KMEM is the only
>    memory from its NUMA node.
> 
> 4. Discarding of blocks previously used for persistent data results in
>    those blocks being freed to system memory.
> 
> As an example, we implement a simple persistence module using the
> above framework to provide a block device. A block device assumes all
> blocks are always available, but in this case we have to get the
> blocks through the memory allocation API, at an offset not under our
> control. To provide block device semantics, we maintain an array which
> maps the logical block number to the real physical page, if one
> exists. Block device Trim/Discard support is used to mark blocks as
> unused.
> 
> While we have the block device here as an example, a memory filesystem
> might be a more useful implementation. I am not sure if any of the
> existing in-memory filesystem structures are suited for
> persistence. Any suggestions for this are appreciated.
> 
> Srinivas Aji (4):
>   mm/memory_hotplug: Add MHP_ALLOCATE flag which treats hotplugged
>     memory as allocated

Without seeing the actual patches, I am very skeptical that this is the
right approach, especially regarding memory onlining/offlining.

virtio-mem achieves something similar (yet different) by hooking into
generic_online_page(). From there, you can control what should actually
happen with memory that is getting onlined (e.g., free them to the buddy
or treat them differently).

Did you evaluate going that path?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 2/4] device-dax: Add framework for keeping persistent data in DAX KMEM
  2022-08-02 17:57 [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM Srinivas Aji
  2022-08-02 18:02 ` [RFC PATCH 1/4] mm/memory_hotplug: Add MHP_ALLOCATE flag which treats hotplugged memory as allocated Srinivas Aji
  2022-08-02 18:03 ` [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM David Hildenbrand
@ 2022-08-02 18:07 ` Srinivas Aji
  2022-08-02 18:10 ` [RFC PATCH 3/4] device-dax: Add a NONE type for DAX KMEM persistence Srinivas Aji
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Srinivas Aji @ 2022-08-02 18:07 UTC (permalink / raw)
  To: linux-nvdimm, Linux MM
  Cc: Dan Williams, David Hildenbrand, Vivek Goyal, David Woodhouse,
	Gowans, James, Yue Li, Beau Beauchamp

DAX memory treated as system RAM will be checked for persistent data
and passed to the appropriate plugin. The plugin is provided with
functions for accessing pages by device offset, and also for
allocating and freeing pages from the DAX device provided memory.

Add a config option CONFIG_DEV_DAX_KMEM_PERSIST which controls this
feature.

The plugin framework allows multiple formats for the persistent data.
The contents of the initial block determine which plugin is used.

A module parameter, persist_format_type, to the kmem module sets the
plugin type which is used to format any hotplugged DAX which does not
have a recognized initial block.

Note: With just this change but without futher patches which implement
plugins for persistence, adding a DAX device as KMEM will only add
it as fully allocated.

Limitation: Adding a DAX device as KMEM will succeed only if this memory
is the only memory in its NUMA node.

Signed-off-by: Srinivas Aji <srinivas.aji@memverge.com>
---
 drivers/dax/Kconfig        |  13 ++
 drivers/dax/kmem.c         | 266 ++++++++++++++++++++++++++++++++++++-
 drivers/dax/kmem_persist.h |  43 ++++++
 3 files changed, 320 insertions(+), 2 deletions(-)
 create mode 100644 drivers/dax/kmem_persist.h

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 5fdf269a822e..837178b841b6 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -66,4 +66,17 @@ config DEV_DAX_KMEM
 
 	  Say N if unsure.
 
+config DEV_DAX_KMEM_PERSIST
+	tristate "KMEM PERSIST: persistent storage together with kmem"
+	default DEV_DAX_KMEM
+	depends on DEV_DAX_KMEM
+	help
+	  Support using a DAX device as system memory while also allowing
+          persistent data on it. This is done by treating all the
+          persistent data as pre-allocated when we add the DAX device
+          as system memory and further acquiring and releasing persistent
+          data through memory allocate and free.
+
+	  Say N if unsure.
+
 endif
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a37622060fff..df7cfc8ace78 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -13,6 +13,9 @@
 #include <linux/mman.h>
 #include "dax-private.h"
 #include "bus.h"
+#ifdef CONFIG_DEV_DAX_KMEM_PERSIST
+#include "kmem_persist.h"
+#endif
 
 /* Memory resource name used for add_memory_driver_managed(). */
 static const char *kmem_name;
@@ -38,9 +41,23 @@ static int dax_kmem_range(struct dev_dax *dev_dax, int i, struct range *r)
 struct dax_kmem_data {
 	const char *res_name;
 	int mgid;
+#ifdef CONFIG_DEV_DAX_KMEM_PERSIST
+	unsigned long total_len;
+	struct kmem_persist_ops *persist_ops;
+	void *persist_data;
+#endif
 	struct resource *res[];
 };
 
+#ifdef CONFIG_DEV_DAX_KMEM_PERSIST
+static int kmem_persist_probe(struct dev_dax *dev_dax,
+			struct kmem_persist_ops **persist_ops,
+			void **persist_data);
+static int kmem_persist_cleanup(struct dev_dax *dev_dax,
+				struct kmem_persist_ops *persist_ops,
+				void *persist_data);
+#endif
+
 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 {
 	struct device *dev = &dev_dax->dev;
@@ -48,6 +65,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	struct dax_kmem_data *data;
 	int i, rc, mapped = 0;
 	int numa_node;
+	mhp_t mhp_flags;
 
 	/*
 	 * Ensure good NUMA information for the persistent memory.
@@ -62,6 +80,18 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 		return -EINVAL;
 	}
 
+#ifdef CONFIG_DEV_DAX_KMEM_PERSIST
+	/*
+	 * Check if NUMA node has any memory already
+	 */
+	if (node_online(numa_node) && node_present_pages(numa_node) != 0) {
+		dev_warn(dev,
+			"rejecting DAX region on numa_node with existing memory: numa_node %d, existing pages %lu\n",
+			numa_node, node_present_pages(numa_node));
+		return -EINVAL;
+	}
+#endif
+
 	for (i = 0; i < dev_dax->nr_range; i++) {
 		struct range range;
 
@@ -92,6 +122,15 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	if (rc < 0)
 		goto err_reg_mgid;
 	data->mgid = rc;
+#ifdef CONFIG_DEV_DAX_KMEM_PERSIST
+	data->total_len = total_len;
+#endif
+
+	mhp_flags = MHP_NID_IS_MGID
+#ifdef CONFIG_DEV_DAX_KMEM_PERSIST
+		| MHP_ALLOCATE
+#endif
+		;
 
 	for (i = 0; i < dev_dax->nr_range; i++) {
 		struct resource *res;
@@ -130,8 +169,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 		 * this as RAM automatically.
 		 */
 		rc = add_memory_driver_managed(data->mgid, range.start,
-				range_len(&range), kmem_name, MHP_NID_IS_MGID);
-
+				range_len(&range), kmem_name, mhp_flags);
 		if (rc) {
 			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
 					i, range.start, range.end);
@@ -147,6 +185,14 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 
 	dev_set_drvdata(dev, data);
 
+#ifdef CONFIG_DEV_DAX_KMEM_PERSIST
+	rc = kmem_persist_probe(dev_dax,
+				&data->persist_ops,
+				&data->persist_data);
+	if (rc)
+		dev_err(dev, "Cannot setup kmem persistent data\n");
+#endif
+
 	return 0;
 
 err_request_mem:
@@ -165,6 +211,18 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 	struct device *dev = &dev_dax->dev;
 	struct dax_kmem_data *data = dev_get_drvdata(dev);
 
+#ifdef CONFIG_DEV_DAX_KMEM_PERSIST
+	/*
+	 * TODO:This is probably the wrong place to call this. We need to
+	 * call this before the blocks are marked offline, but after
+	 * ensuring no new allocations.
+	 */
+	if (kmem_persist_cleanup(dev_dax, data->persist_ops,
+					data->persist_data)) {
+		dev_err(dev, "Block device cannot be freed.\n");
+		return;
+	}
+#endif
 	/*
 	 * We have one shot for removing memory, if some memory blocks were not
 	 * offline prior to calling this function remove_memory() will fail, and
@@ -214,6 +272,210 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
+#ifdef CONFIG_DEV_DAX_KMEM_PERSIST
+struct page *dax_kmem_index_to_page(unsigned long page_index,
+				struct dev_dax *dev_dax)
+{
+	struct device *dev = &dev_dax->dev;
+	struct dax_kmem_data *data = dev_get_drvdata(dev);
+	int i;
+	unsigned long page_offset = 0;
+
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct resource *r = data->res[i];
+		unsigned long page_len = (r->end + 1 - r->start) >> PAGE_SHIFT;
+
+		if (page_offset + page_len <= page_index) {
+			page_offset += page_len;
+			continue;
+		}
+		return pfn_to_page((r->start >> PAGE_SHIFT) +
+				(page_index - page_offset));
+	}
+	return NULL;
+}
+
+unsigned long dax_kmem_num_pages(struct dev_dax *dev_dax)
+{
+	struct device *dev = &dev_dax->dev;
+	struct dax_kmem_data *data = dev_get_drvdata(dev);
+
+	return data->total_len >> PAGE_SHIFT;
+}
+
+struct page *dax_kmem_alloc_page(struct dev_dax *dev_dax,
+				unsigned long *page_index)
+{
+	struct device *dev = &dev_dax->dev;
+	struct dax_kmem_data *data = dev_get_drvdata(dev);
+	int i;
+	unsigned long page_offset = 0;
+	u64 phys;
+	struct page *page =
+		alloc_pages_node(dev_dax->target_node,
+				GFP_NOIO | __GFP_ZERO | __GFP_THISNODE,
+				0);
+	if (!page)
+		return NULL;
+
+	phys = __pfn_to_phys(page_to_pfn(page));
+
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct resource *r = data->res[i];
+		unsigned long page_len = (r->end + 1 - r->start) >> PAGE_SHIFT;
+
+		if (phys >= r->start && phys <= r->end) {
+			*page_index =
+				page_offset + ((phys - r->start) >> PAGE_SHIFT);
+			break;
+		}
+		page_offset += page_len;
+	}
+	if (i == dev_dax->nr_range) {
+		dev_err(dev, "Allocated page not in DAX range. Freeing.\n");
+		__free_page(page);
+		page = NULL;
+	}
+
+	return page;
+}
+
+static int persist_format_type = -1;
+module_param(persist_format_type, int, 0644);
+
+/*
+ * Forcibly format new KMEM with persist_format_type. This can cause loss
+ * of existing persistent data, so this should be replaced with some
+ * other mechanism for reformatting.
+ */
+static bool persist_format_force;
+module_param(persist_format_force, bool, 0644);
+
+static LIST_HEAD(persist_types);
+static DEFINE_MUTEX(persist_types_lock);
+
+int kmem_persist_type_register(struct kmem_persist_ops *ops)
+{
+	mutex_lock(&persist_types_lock);
+	ops->ref_count = 0;
+	list_add_tail(&ops->next, &persist_types);
+	mutex_unlock(&persist_types_lock);
+	return 0;
+}
+
+int kmem_persist_type_unregister(struct kmem_persist_ops *ops)
+{
+	mutex_lock(&persist_types_lock);
+	if (ops->ref_count != 0) {
+		mutex_unlock(&persist_types_lock);
+		return -1;
+	}
+	list_del(&ops->next);
+	mutex_unlock(&persist_types_lock);
+	return 0;
+}
+
+int kmem_persist_probe(struct dev_dax *dev_dax,
+		struct kmem_persist_ops **persist_ops,
+		void **persist_data)
+{
+	struct device *dev = &dev_dax->dev;
+	struct kmem_persist_superblock *super;
+	enum kmem_persist_type ptype;
+	bool format = false;
+	bool ptype_found = false;
+	struct kmem_persist_ops *ops;
+	void *data;
+	struct list_head *pos;
+	int rc;
+
+	super = kmap_local_page(dax_kmem_index_to_page(0, dev_dax));
+
+	if (super->magic != kmem_persist_magic) {
+		if (persist_format_type == -1) {
+			dev_err(dev, "kmem unformatted for persistence\n");
+			kunmap_local(super);
+			return -EINVAL;
+		}
+		ptype = persist_format_type;
+		format = true;
+	} else {
+		ptype = super->type;
+	}
+
+	mutex_lock(&persist_types_lock);
+	list_for_each(pos, &persist_types) {
+		ops = list_entry(pos, struct kmem_persist_ops, next);
+		if (ops->type == ptype) {
+			ops->ref_count++;
+			ptype_found = true;
+			break;
+		}
+	}
+	mutex_unlock(&persist_types_lock);
+
+	if (!ptype_found) {
+		dev_err(dev, "No persistence module with type %d\n", ptype);
+		kunmap_local(super);
+		return -EINVAL;
+	}
+
+	if (format) {
+		rc = ops->format(dev_dax);
+		if (rc ||
+			super->magic != kmem_persist_magic ||
+			super->type != persist_format_type
+			) {
+			dev_err(dev,
+				"Error formatting kmem persistence type %d\n",
+				ptype);
+			mutex_lock(&persist_types_lock);
+			ops->ref_count--;
+			mutex_unlock(&persist_types_lock);
+
+			kunmap_local(super);
+			return rc;
+		}
+	}
+
+	kunmap_local(super);
+	super = NULL;
+
+	rc = ops->probe(dev_dax, &data);
+	if (rc) {
+		dev_err(dev, "Error initializing kmem persistence type %d\n",
+			ptype);
+		return rc;
+	}
+	*persist_ops = ops;
+	*persist_data = data;
+	return 0;
+}
+
+int kmem_persist_cleanup(struct dev_dax *dev_dax,
+		struct kmem_persist_ops *pops,
+		void *persist_data)
+{
+	int rc;
+	struct device *dev = &dev_dax->dev;
+
+	rc = pops->cleanup(dev_dax, persist_data);
+
+	if (rc) {
+		dev_err(dev, "Error cleaning up kmem persistence type %d\n",
+			pops->type);
+		return rc;
+	}
+
+	mutex_lock(&persist_types_lock);
+	pops->ref_count--;
+	mutex_unlock(&persist_types_lock);
+
+	return 0;
+}
+
+#endif /* CONFIG_DEV_DAX_KMEM_PERSIST */
+
 static struct dax_device_driver device_dax_kmem_driver = {
 	.probe = dev_dax_kmem_probe,
 	.remove = dev_dax_kmem_remove,
diff --git a/drivers/dax/kmem_persist.h b/drivers/dax/kmem_persist.h
new file mode 100644
index 000000000000..dd651025f28c
--- /dev/null
+++ b/drivers/dax/kmem_persist.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright(c) 2022 MemVerge. All rights reserved.
+ */
+#ifndef __KMEM_PERSIST_H__
+#define __KMEM_PERSIST_H__
+
+struct page;
+struct dev_dax;
+
+enum kmem_persist_type {
+	KMEM_PERSIST_NONE = 0,
+};
+
+
+struct kmem_persist_ops {
+	enum kmem_persist_type type;
+	int (*format)(struct dev_dax *dev_dax);
+	int (*probe)(struct dev_dax *dev_dax, void **data);
+	int (*cleanup)(struct dev_dax *dev_dax, void *data);
+	int ref_count;
+	struct list_head next;
+};
+
+static const unsigned long kmem_persist_magic = 0x4b4d454d50455253L; // KMEMPERS
+
+struct kmem_persist_superblock {
+	unsigned long magic;
+	enum kmem_persist_type type;
+} __packed;
+
+int kmem_persist_type_register(struct kmem_persist_ops *ops);
+
+int kmem_persist_type_unregister(struct kmem_persist_ops *ops);
+
+
+struct page *dax_kmem_index_to_page(unsigned long page_index,
+				struct dev_dax *dev_dax);
+unsigned long dax_kmem_num_pages(struct dev_dax *dev_dax);
+struct page *dax_kmem_alloc_page(struct dev_dax *dev_dax,
+				unsigned long *page_index);
+
+#endif
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 3/4] device-dax: Add a NONE type for DAX KMEM persistence
  2022-08-02 17:57 [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM Srinivas Aji
                   ` (2 preceding siblings ...)
  2022-08-02 18:07 ` [RFC PATCH 2/4] device-dax: Add framework for keeping persistent data in DAX KMEM Srinivas Aji
@ 2022-08-02 18:10 ` Srinivas Aji
  2022-08-02 18:12 ` [RFC PATCH 4/4] device-dax: Add a block device persistent type, BLK, for DAX KMEM Srinivas Aji
  2022-08-05 12:46 ` [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM David Hildenbrand
  5 siblings, 0 replies; 13+ messages in thread
From: Srinivas Aji @ 2022-08-02 18:10 UTC (permalink / raw)
  To: linux-nvdimm, Linux MM
  Cc: Dan Williams, David Hildenbrand, Vivek Goyal, David Woodhouse,
	Gowans, James, Yue Li, Beau Beauchamp

The NONE type allows us to format DAX KMEM so that all the memory is
usable as system memory. This does not allow us to store any
persistent data.

Signed-off-by: Srinivas Aji <srinivas.aji@memverge.com>
---
 drivers/dax/kmem.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index df7cfc8ace78..0ca6e14f7e73 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -474,6 +474,46 @@ int kmem_persist_cleanup(struct dev_dax *dev_dax,
 	return 0;
 }
 
+/*
+ * A NONE type for DAX KMEM persistence which does not expose any persistent
+ * device or filesystem on the memory.
+ */
+
+static int kmem_persist_none_format(struct dev_dax *dev_dax)
+{
+	struct kmem_persist_superblock *super =
+		kmap_local_page(dax_kmem_index_to_page(0, dev_dax));
+	super->magic = kmem_persist_magic;
+	super->type = KMEM_PERSIST_NONE;
+	kunmap_local(super);
+	return 0;
+}
+
+static int kmem_persist_none_probe(struct dev_dax *dev_dax, void **data)
+{
+	unsigned long num_pages = dax_kmem_num_pages(dev_dax);
+	unsigned long i;
+
+	for (i = 1; i < num_pages; i++)
+		__free_page(dax_kmem_index_to_page(i, dev_dax));
+
+	*data = NULL;
+	return 0;
+}
+
+static int kmem_persist_none_cleanup(struct dev_dax *dev_dax, void *data)
+{
+	__free_page(dax_kmem_index_to_page(0, dev_dax));
+	return 0;
+}
+
+static struct kmem_persist_ops kmem_persist_none_ops = {
+	.type = KMEM_PERSIST_NONE,
+	.format = kmem_persist_none_format,
+	.probe = kmem_persist_none_probe,
+	.cleanup = kmem_persist_none_cleanup
+};
+
 #endif /* CONFIG_DEV_DAX_KMEM_PERSIST */
 
 static struct dax_device_driver device_dax_kmem_driver = {
@@ -493,6 +533,10 @@ static int __init dax_kmem_init(void)
 	rc = dax_driver_register(&device_dax_kmem_driver);
 	if (rc)
 		kfree_const(kmem_name);
+#ifdef CONFIG_DEV_DAX_KMEM_PERSIST
+	if (rc == 0)
+		kmem_persist_type_register(&kmem_persist_none_ops);
+#endif
 	return rc;
 }
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 4/4] device-dax: Add a block device persistent type, BLK, for DAX KMEM
  2022-08-02 17:57 [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM Srinivas Aji
                   ` (3 preceding siblings ...)
  2022-08-02 18:10 ` [RFC PATCH 3/4] device-dax: Add a NONE type for DAX KMEM persistence Srinivas Aji
@ 2022-08-02 18:12 ` Srinivas Aji
  2022-08-03 20:00   ` kernel test robot
                     ` (2 more replies)
  2022-08-05 12:46 ` [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM David Hildenbrand
  5 siblings, 3 replies; 13+ messages in thread
From: Srinivas Aji @ 2022-08-02 18:12 UTC (permalink / raw)
  To: linux-nvdimm, Linux MM
  Cc: Dan Williams, David Hildenbrand, Vivek Goyal, David Woodhouse,
	Gowans, James, Yue Li, Beau Beauchamp

When a DAX KMEM device is formatted as of type BLK, adding the DAX memory
exposes a block device /dev/kmem<numa_node> . A filesystem can be created
on this block device. Blocks which contain data are unavailable for use as
system memory, but blocks which are freed up using DISCARD become free for
system use.

The implementation uses an array which maps the logical block number
to the real block offset in the DAX device. This allows us to keep
block device semantics even though allocations can return any page.

Signed-off-by: Srinivas Aji <srinivas.aji@memverge.com>
---
 drivers/dax/Makefile       |   1 +
 drivers/dax/kmem.c         |   4 +-
 drivers/dax/kmem_blk.c     | 573 +++++++++++++++++++++++++++++++++++++
 drivers/dax/kmem_persist.h |   4 +
 4 files changed, 581 insertions(+), 1 deletion(-)
 create mode 100644 drivers/dax/kmem_blk.c

diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
index 90a56ca3b345..d0a97f4af4ea 100644
--- a/drivers/dax/Makefile
+++ b/drivers/dax/Makefile
@@ -3,6 +3,7 @@ obj-$(CONFIG_DAX) += dax.o
 obj-$(CONFIG_DEV_DAX) += device_dax.o
 obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
 obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
+obj-$(CONFIG_DEV_DAX_KMEM_PERSIST) += kmem_blk.o
 
 dax-y := super.o
 dax-y += bus.o
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 0ca6e14f7e73..0fa45d1ba9cc 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -534,8 +534,10 @@ static int __init dax_kmem_init(void)
 	if (rc)
 		kfree_const(kmem_name);
 #ifdef CONFIG_DEV_DAX_KMEM_PERSIST
-	if (rc == 0)
+	if (rc == 0) {
 		kmem_persist_type_register(&kmem_persist_none_ops);
+		kmem_persist_type_register(&kmem_persist_blk_ops);
+	}
 #endif
 	return rc;
 }
diff --git a/drivers/dax/kmem_blk.c b/drivers/dax/kmem_blk.c
new file mode 100644
index 000000000000..856b35713999
--- /dev/null
+++ b/drivers/dax/kmem_blk.c
@@ -0,0 +1,573 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2022 MemVerge. All rights reserved. */
+#include <linux/module.h>
+#include <linux/major.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/highmem.h>
+#include <linux/pagemap.h>
+#include <linux/bitops.h>
+#include <linux/slab.h>
+#include "dax-private.h"
+#include "kmem_persist.h"
+
+static const unsigned int index_entries_per_page = (PAGE_SIZE / sizeof(u64));
+
+struct kmem_blk_super {
+	struct kmem_persist_superblock header;
+	u64 num_index_pages;
+	u64 num_index_entries;
+} __packed;
+
+struct kmem_blk_data {
+	struct dev_dax *dev_dax;
+	struct gendisk *disk;
+	spinlock_t index_lock;
+	struct kmem_blk_super *super;
+	unsigned long num_index_pages;
+	u64 *index_page[];
+};
+
+// TODO: Make sure locking is sound for concurrent multiple IOs,
+// i.e. writes and discards.
+
+static struct page *kmem_blk_get_page(struct kmem_blk_data *data,
+				sector_t sector)
+{
+	pgoff_t i = sector >> PAGE_SECTORS_SHIFT;
+	u64 page_num;
+
+	spin_lock(&data->index_lock);
+	page_num = data->index_page
+		[i / index_entries_per_page]
+		[i % index_entries_per_page];
+	spin_unlock(&data->index_lock);
+
+	if (page_num)
+		return dax_kmem_index_to_page(page_num, data->dev_dax);
+	else
+		return NULL;
+}
+
+/*
+ * copy_to_kmem_blk_setup must be called before copy_to_kmem_blk. It may sleep.
+ */
+static int kmem_blk_insert_page(struct kmem_blk_data *data, sector_t sector)
+{
+	pgoff_t i = sector >> PAGE_SECTORS_SHIFT;
+	struct page *page;
+	unsigned long page_index = 0;
+	u64 page_num; // TODO fixup u64 / unsigned long to use one type?
+	u64 *index_ptr =
+		&data->index_page
+		[i / index_entries_per_page][i % index_entries_per_page];
+
+	/* Check if block exists */
+	spin_lock(&data->index_lock);
+	page_num = *index_ptr;
+	spin_unlock(&data->index_lock);
+	if (page_num)
+		return 0;
+
+	page = dax_kmem_alloc_page(data->dev_dax, &page_index);
+	if (!page) {
+		dev_err(&data->dev_dax->dev, "Cannot allocate page\n");
+		return -1;
+	}
+
+	spin_lock(&data->index_lock);
+	if (*index_ptr != 0)
+		__free_page(page);
+	else
+		*index_ptr = page_index;
+	spin_unlock(&data->index_lock);
+
+	return 0;
+}
+
+static int kmem_blk_discard(struct kmem_blk_data *data,
+			sector_t sector, size_t n)
+{
+	pgoff_t i = sector >> PAGE_SECTORS_SHIFT;
+	struct page *page;
+	u64 page_num; // TODO fixup u64 / unsigned long to use one type?
+	u64 *index_ptr;
+
+	BUG_ON(sector & ((1 << PAGE_SECTORS_SHIFT) - 1));
+	BUG_ON(n & (PAGE_SIZE - 1));
+
+	while (n > 0) {
+		BUG_ON(i > data->super->num_index_entries);
+		index_ptr =
+			&data->index_page
+			[i / index_entries_per_page]
+			[i % index_entries_per_page];
+		spin_lock(&data->index_lock);
+		page_num = *index_ptr;
+		if (page_num)
+			*index_ptr = 0;
+		spin_unlock(&data->index_lock);
+		if (page_num) {
+			page = dax_kmem_index_to_page(page_num, data->dev_dax);
+			__free_page(page);
+		}
+		i++;
+		n -= PAGE_SIZE;
+	}
+	return 0;
+}
+
+/*
+ * copy_to_kmem_blk_setup must be called before copy_to_kmem_blk. It may sleep.
+ */
+static int copy_to_kmem_blk_setup(struct kmem_blk_data *data, sector_t sector, size_t n)
+{
+	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
+	size_t copy;
+
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	if (kmem_blk_insert_page(data, sector))
+		return -ENOSPC;
+	if (copy < n) {
+		sector += copy >> SECTOR_SHIFT;
+		if (kmem_blk_insert_page(data, sector))
+			return -ENOSPC;
+	}
+	return 0;
+}
+
+/*
+ * Copy n bytes from src to the block device starting at sector. Does not sleep.
+ */
+static void copy_to_kmem_blk(struct kmem_blk_data *data, const void *src,
+			sector_t sector, size_t n)
+{
+	struct page *page;
+	void *dst;
+	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
+	size_t copy;
+
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	page = kmem_blk_get_page(data, sector);
+	BUG_ON(!page);
+
+	dst = kmap_atomic(page);
+	memcpy(dst + offset, src, copy);
+	kunmap_atomic(dst);
+
+	if (copy < n) {
+		src += copy;
+		sector += copy >> SECTOR_SHIFT;
+		copy = n - copy;
+		page = kmem_blk_get_page(data, sector);
+		BUG_ON(!page);
+
+		dst = kmap_atomic(page);
+		memcpy(dst, src, copy);
+		kunmap_atomic(dst);
+	}
+}
+
+/*
+ * Copy n bytes to dst from the block device starting at sector. Does not sleep.
+ */
+static void copy_from_kmem_blk(void *dst, struct kmem_blk_data *data,
+			sector_t sector, size_t n)
+{
+	struct page *page;
+	void *src;
+	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
+	size_t copy;
+
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	page = kmem_blk_get_page(data, sector);
+	if (page) {
+		src = kmap_atomic(page);
+		memcpy(dst, src + offset, copy);
+		kunmap_atomic(src);
+	} else
+		memset(dst, 0, copy);
+
+	if (copy < n) {
+		dst += copy;
+		sector += copy >> SECTOR_SHIFT;
+		copy = n - copy;
+		page = kmem_blk_get_page(data, sector);
+		if (page) {
+			src = kmap_atomic(page);
+			memcpy(dst, src, copy);
+			kunmap_atomic(src);
+		} else
+			memset(dst, 0, copy);
+	}
+}
+
+/*
+ * Process a single bvec of a bio.
+ */
+static int kmem_blk_do_bvec(struct kmem_blk_data *data, struct page *page,
+			unsigned int len, unsigned int off, unsigned int op,
+			sector_t sector)
+{
+	void *mem = NULL;
+	int err = 0;
+
+	if (op == REQ_OP_WRITE) {
+		err = copy_to_kmem_blk_setup(data, sector, len);
+		if (err)
+			goto out;
+	}
+
+	if (page)
+		mem = kmap_atomic(page);
+	switch (op) {
+	case REQ_OP_READ:
+		copy_from_kmem_blk(mem + off, data, sector, len);
+		flush_dcache_page(page);
+		break;
+	case REQ_OP_WRITE:
+		flush_dcache_page(page);
+		copy_to_kmem_blk(data, mem + off, sector, len);
+		break;
+	case REQ_OP_DISCARD:
+		BUG_ON(page);
+		kmem_blk_discard(data, sector, len);
+		break;
+	default:
+		BUG();
+		break;
+	}
+	if (mem)
+		kunmap_atomic(mem);
+
+out:
+	return err;
+}
+
+static void kmem_blk_submit_bio(struct bio *bio)
+{
+	struct kmem_blk_data *data = bio->bi_bdev->bd_disk->private_data;
+	sector_t sector = bio->bi_iter.bi_sector;
+	struct bio_vec bvec;
+	struct bvec_iter iter;
+
+	/*
+	 * DISCARD and WRITE_ZEROES come separately and don't work with
+	 * bio_for_each_segment
+	 */
+	switch (bio_op(bio)) {
+	case REQ_OP_DISCARD:
+	case REQ_OP_WRITE_ZEROES:
+		kmem_blk_discard(data, sector, bio->bi_iter.bi_size);
+		bio_endio(bio);
+		return;
+	default:
+		break;
+	}
+
+	bio_for_each_segment(bvec, bio, iter) {
+		unsigned int len = bvec.bv_len;
+		int err;
+
+		/* Don't support un-aligned buffer */
+		WARN_ON_ONCE((bvec.bv_offset & (SECTOR_SIZE - 1)) ||
+				(len & (SECTOR_SIZE - 1)));
+		err = kmem_blk_do_bvec(data, bvec.bv_page, len, bvec.bv_offset,
+				bio_op(bio), sector);
+		if (err) {
+			bio_io_error(bio);
+			return;
+		}
+		sector += len >> SECTOR_SHIFT;
+	}
+
+	bio_endio(bio);
+}
+
+static int kmem_blk_rw_page(struct block_device *bdev, sector_t sector,
+			struct page *page, unsigned int op)
+{
+	struct kmem_blk_data *data = bdev->bd_disk->private_data;
+	int err;
+
+	if (PageTransHuge(page))
+		return -EOPNOTSUPP;
+	err = kmem_blk_do_bvec(data, page, PAGE_SIZE, 0, op, sector);
+	page_endio(page, op_is_write(op), err);
+	return err;
+}
+
+static const struct block_device_operations kmem_blk_fops = {
+	.owner =		THIS_MODULE,
+	.submit_bio =		kmem_blk_submit_bio,
+	.rw_page =		kmem_blk_rw_page,
+};
+
+
+
+
+
+static int kmem_blk_disk_init(struct kmem_blk_data *data)
+{
+	struct gendisk *disk;
+	int err;
+
+	disk = blk_alloc_disk(data->dev_dax->target_node);
+	data->disk = disk;
+
+	disk->flags = GENHD_FL_NO_PART;
+	disk->fops = &kmem_blk_fops;
+	disk->private_data = data;
+	snprintf(disk->disk_name, DISK_NAME_LEN, "kmem%d",
+		data->dev_dax->target_node);
+
+	set_capacity(disk,
+		data->super->num_index_entries << PAGE_SECTORS_SHIFT);
+
+	// TODO: Handle cases where PAGE_SIZE is too big.
+	/* Set physical and logical block size to PAGE_SIZE */
+	blk_queue_physical_block_size(disk->queue, PAGE_SIZE);
+	blk_queue_logical_block_size(disk->queue, PAGE_SIZE);
+
+	/* Tell the block layer that this is not a rotational device */
+	blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue);
+	/* Don't use this for randomness */
+	blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, disk->queue);
+
+	/* Support discard */
+	blk_queue_flag_set(QUEUE_FLAG_DISCARD, disk->queue);
+	disk->queue->limits.discard_granularity = PAGE_SIZE;
+	blk_queue_max_discard_sectors(disk->queue, UINT_MAX);
+	/* We can handle WRITE_ZEROES as DISCARD, at units of page size */
+	blk_queue_max_write_zeroes_sectors(disk->queue, UINT_MAX);
+
+	err = add_disk(disk);
+	if (err)
+		goto out_cleanup_disk;
+
+	return 0;
+out_cleanup_disk:
+	blk_cleanup_disk(data->disk);
+	data->disk = NULL;
+	return err;
+}
+
+
+static void kmem_blk_disk_cleanup(struct kmem_blk_data *data)
+{
+	if (data->disk == NULL)
+		return;
+	del_gendisk(data->disk);
+	blk_cleanup_disk(data->disk);
+	data->disk = NULL;
+}
+
+/* Format device with full allocation */
+static int kmem_blk_format(struct dev_dax *dev_dax)
+{
+	struct kmem_blk_super *super =
+		kmap_local_page(dax_kmem_index_to_page(0, dev_dax));
+
+	unsigned long num_pages = dax_kmem_num_pages(dev_dax);
+	u64 i;
+	/*
+	 * c = a / b => c is largest c s.t. c * b <= a.
+	 * c = (a + b - 1) / b is smallest c s.t. c * b >= a
+	 * num_index_pages is the largest number such that
+	 * 1 + num_index_pages + num_index_pages * index_entries_per_page >= num_pages
+	 * num_index_pages *(1 + index_entries_per_page) >= num_pages - 1
+	 * num_index_pages =
+	 *   ((num_pages - 1) + (1 + index_entries_per_page) - 1 ) /
+	 *   (1 + index_entries_per_page)
+	 */
+	u64 num_index_pages =
+		(num_pages + index_entries_per_page - 1) /
+		(1 + index_entries_per_page);
+	super->header.magic = kmem_persist_magic;
+	super->header.type = KMEM_PERSIST_BLK;
+	super->num_index_pages = num_index_pages;
+	super->num_index_entries = num_pages - 1 - num_index_pages;
+
+	for (i = 0; i < num_index_pages; i++) {
+		u64 *index_array =
+			kmap_local_page(dax_kmem_index_to_page(1 + i, dev_dax));
+#if !defined(KMEM_PERSIST_BLK_FORMAT_FULL)
+		memset(index_array, 0, PAGE_SIZE);
+#else /* KMEM_PERSIST_BLK_FORMAT_FULL */
+		u64 j;
+
+		for (j = 0; j < index_entries_per_page; j++) {
+			u64 idx =
+				1 + num_index_pages +
+				i * index_entries_per_page + j;
+
+			if (idx >= num_pages)
+				idx = 0;
+			index_array[j] = idx;
+		}
+#endif
+		kunmap_local(index_array);
+	}
+	kunmap_local(super);
+	return 0;
+}
+
+/* Free unused blocks in the dax memory to system */
+static int kmem_blk_free_unused(struct kmem_blk_data *data)
+{
+	struct kmem_blk_super *super = data->super;
+	unsigned long num_pages = dax_kmem_num_pages(data->dev_dax);
+	u64 *alloc_bitmap;
+	unsigned long i;
+
+	/* Bitmap for tracking allocated pages. Temporary */
+	alloc_bitmap =
+		kvzalloc(sizeof(u64) * BITS_TO_U64(num_pages), GFP_KERNEL);
+	if (alloc_bitmap == NULL) {
+		dev_err(&data->dev_dax->dev,
+			"Unable to allocate bit array. Not freeing unused space.\n");
+		return -ENOMEM;
+	}
+
+	/* Free up pages unused by block storage to memory */
+	for (i = 0; i < super->num_index_entries; i++) {
+		u64 page_num = data->index_page
+			[i / index_entries_per_page]
+			[i % index_entries_per_page];
+
+		if (page_num != 0) {
+			BUG_ON(page_num < 1 + super->num_index_pages ||
+				page_num >= num_pages);
+			/* Set bit */
+			alloc_bitmap[page_num / 64] |= 1 << (page_num % 64);
+		}
+	}
+
+	for (i = 1 + super->num_index_pages; i < num_pages; i++) {
+		struct page *page;
+
+		if (!(alloc_bitmap[i / 64] & (1 << (i % 64)))) {
+			/* Bit clear. Page not used */
+			page = dax_kmem_index_to_page(i, data->dev_dax);
+			__free_page(page);
+		}
+	}
+
+	kvfree(alloc_bitmap);
+	return 0;
+}
+
+static int kmem_blk_probe(struct dev_dax *dev_dax, void **persist_data)
+{
+	struct device *dev = &dev_dax->dev;
+	struct kmem_blk_super *super;
+	unsigned long i;
+	struct kmem_blk_data *data;
+	unsigned long num_pages = dax_kmem_num_pages(dev_dax);
+
+	if (num_pages == 0) {
+		dev_err(dev, "Dax device for KMEM has no pages\n");
+		*persist_data = NULL;
+		return -1;
+	}
+
+	super = kmap(dax_kmem_index_to_page(0, dev_dax));
+
+	/* Validate superblock magic and type */
+	if (super->header.magic != kmem_persist_magic ||
+		super->header.type != KMEM_PERSIST_BLK) {
+		dev_err("KMEM not formatted for blk, magic %lx type %d\n",
+			super->header.magic, super->header.type);
+		kunmap(dax_kmem_index_to_page(0, dev_dax));
+		*persist_data = NULL;
+		return -EINVAL;
+	}
+
+	/* Validate superblock index page counts */
+	if (super->num_index_entries <=
+		super->num_index_pages * index_entries_per_page &&
+		1 + super->num_index_pages + super->num_index_entries
+		== num_pages) {
+		dev_info(dev,
+			"Found kmem_blk superblock num_index_entries %llu num_index_pages %llu num_pages %lu\n",
+			super->num_index_entries,
+			super->num_index_pages, num_pages);
+	} else {
+		dev_warn(dev,
+			"Invalid kmem_blk superblock num_index_entries %llu num_index_pages %llu num_pages %lu\n",
+			super->num_index_entries,
+			super->num_index_pages, num_pages);
+		kunmap(dax_kmem_index_to_page(0, dev_dax));
+		*persist_data = NULL;
+		return -EINVAL;
+	}
+
+	data = kzalloc(struct_size(data, index_page, super->num_index_pages),
+		GFP_KERNEL);
+	if (!data) {
+		kunmap(dax_kmem_index_to_page(0, dev_dax));
+		*persist_data = NULL;
+		return -ENOMEM;
+	}
+
+	*persist_data = data;
+	data->dev_dax = dev_dax;
+	data->super = super;
+	spin_lock_init(&data->index_lock);
+
+	for (i = 0; i < super->num_index_pages; i++)
+		data->index_page[i] =
+			kmap(dax_kmem_index_to_page(i + 1, dev_dax));
+
+	kmem_blk_free_unused(data);
+
+	kmem_blk_disk_init(data);
+
+	return 0;
+}
+
+static int kmem_blk_cleanup(struct dev_dax *dev_dax, void *persist_data)
+{
+	struct kmem_blk_data *data = persist_data;
+	unsigned long num_pages = dax_kmem_num_pages(dev_dax);
+	unsigned long i;
+
+	if (data == NULL)
+		return -1;
+
+	kmem_blk_disk_cleanup(data);
+
+	if (data->super == 0) {
+		for (i = 0; i < num_pages; i++)
+			__free_page(dax_kmem_index_to_page(i, dev_dax));
+	} else {
+		for (i = 0; i < data->super->num_index_entries; i++) {
+			u64 page_num = data->index_page
+				[i / index_entries_per_page]
+				[i % index_entries_per_page];
+			if (page_num != 0) {
+				__free_page(dax_kmem_index_to_page(page_num,
+								   dev_dax));
+			}
+		}
+		for (i = 0; i < data->super->num_index_pages; i++) {
+			struct page *page =
+				dax_kmem_index_to_page(1 + i, dev_dax);
+			data->index_page[i] = NULL;
+			kunmap(page);
+			__free_page(page);
+		}
+		data->super = NULL;
+		kunmap(dax_kmem_index_to_page(0, dev_dax));
+		__free_page(dax_kmem_index_to_page(0, dev_dax));
+	}
+	kfree(data);
+	return 0;
+}
+
+struct kmem_persist_ops kmem_persist_blk_ops = {
+	.type = KMEM_PERSIST_BLK,
+	.format = kmem_blk_format,
+	.probe = kmem_blk_probe,
+	.cleanup = kmem_blk_cleanup
+};
diff --git a/drivers/dax/kmem_persist.h b/drivers/dax/kmem_persist.h
index dd651025f28c..0e0279feaa12 100644
--- a/drivers/dax/kmem_persist.h
+++ b/drivers/dax/kmem_persist.h
@@ -10,6 +10,7 @@ struct dev_dax;
 
 enum kmem_persist_type {
 	KMEM_PERSIST_NONE = 0,
+	KMEM_PERSIST_BLK,
 };
 
 
@@ -40,4 +41,7 @@ unsigned long dax_kmem_num_pages(struct dev_dax *dev_dax);
 struct page *dax_kmem_alloc_page(struct dev_dax *dev_dax,
 				unsigned long *page_index);
 
+/* Defined in kmem_blk.c */
+extern struct kmem_persist_ops kmem_persist_blk_ops;
+
 #endif
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM
  2022-08-02 18:03 ` [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM David Hildenbrand
@ 2022-08-02 18:53   ` Srinivas Aji
  0 siblings, 0 replies; 13+ messages in thread
From: Srinivas Aji @ 2022-08-02 18:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-nvdimm, Linux MM, Dan Williams, Vivek Goyal,
	David Woodhouse, Gowans, James, Yue Li, Beau Beauchamp

On Tue, Aug 02, 2022 at 08:03:10PM +0200, David Hildenbrand wrote:
> > Srinivas Aji (4):
> >   mm/memory_hotplug: Add MHP_ALLOCATE flag which treats hotplugged
> >     memory as allocated
> 
> Without seeing the actual patches, I am very skeptical that this is the
> right approach, especially regarding memory onlining/offlining.
> 
> virtio-mem achieves something similar (yet different) by hooking into
> generic_online_page(). From there, you can control what should actually
> happen with memory that is getting onlined (e.g., free them to the buddy
> or treat them differently).
> 
> Did you evaluate going that path?

Thanks for the feedback. I had not tried this. Let me see if what I am
trying here can be done using the online_page_callback mechanism.

Thanks,
Srinivas


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 4/4] device-dax: Add a block device persistent type, BLK, for DAX KMEM
  2022-08-02 18:12 ` [RFC PATCH 4/4] device-dax: Add a block device persistent type, BLK, for DAX KMEM Srinivas Aji
@ 2022-08-03 20:00   ` kernel test robot
  2022-08-03 21:19   ` Fabio M. De Francesco
  2022-08-04 10:45   ` kernel test robot
  2 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2022-08-03 20:00 UTC (permalink / raw)
  To: Srinivas Aji; +Cc: llvm, kbuild-all

Hi Srinivas,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on linus/master]
[also build test ERROR on v5.19 next-20220802]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Srinivas-Aji/Allow-persistent-data-on-DAX-device-being-used-as-KMEM/20220803-021320
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 0805c6fb39f66e01cb0adccfae8d9e0615c70fd7
config: x86_64-randconfig-a003 (https://download.01.org/0day-ci/archive/20220804/202208040329.87LkXtmz-lkp@intel.com/config)
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 495519e5f8232d144ed26e9c18dbcbac6a5f25eb)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/1d821cfc1cd5b2a7034ca77e84b59cf808b09a4f
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Srinivas-Aji/Allow-persistent-data-on-DAX-device-being-used-as-KMEM/20220803-021320
        git checkout 1d821cfc1cd5b2a7034ca77e84b59cf808b09a4f
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/dax/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> drivers/dax/kmem_blk.c:338:21: error: use of undeclared identifier 'QUEUE_FLAG_DISCARD'
           blk_queue_flag_set(QUEUE_FLAG_DISCARD, disk->queue);
                              ^
>> drivers/dax/kmem_blk.c:479:11: error: incompatible pointer types passing 'char[47]' to parameter of type 'const struct device *' [-Werror,-Wincompatible-pointer-types]
                   dev_err("KMEM not formatted for blk, magic %lx type %d\n",
                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/dev_printk.h:144:44: note: expanded from macro 'dev_err'
           dev_printk_index_wrap(_dev_err, KERN_ERR, dev, dev_fmt(fmt), ##__VA_ARGS__)
                                                     ^~~
   include/linux/dev_printk.h:110:11: note: expanded from macro 'dev_printk_index_wrap'
                   _p_func(dev, fmt, ##__VA_ARGS__);                       \
                           ^~~
   include/linux/dev_printk.h:50:36: note: passing argument to parameter 'dev' here
   void _dev_err(const struct device *dev, const char *fmt, ...);
                                      ^
>> drivers/dax/kmem_blk.c:480:4: error: incompatible integer to pointer conversion passing 'unsigned long' to parameter of type 'const char *' [-Wint-conversion]
                           super->header.magic, super->header.type);
                           ^~~~~~~~~~~~~~~~~~~
   include/linux/dev_printk.h:144:57: note: expanded from macro 'dev_err'
           dev_printk_index_wrap(_dev_err, KERN_ERR, dev, dev_fmt(fmt), ##__VA_ARGS__)
                                                                  ^~~
   include/linux/dev_printk.h:19:22: note: expanded from macro 'dev_fmt'
   #define dev_fmt(fmt) fmt
                        ^~~
   include/linux/dev_printk.h:110:16: note: expanded from macro 'dev_printk_index_wrap'
                   _p_func(dev, fmt, ##__VA_ARGS__);                       \
                                ^~~
   include/linux/dev_printk.h:50:53: note: passing argument to parameter 'fmt' here
   void _dev_err(const struct device *dev, const char *fmt, ...);
                                                       ^
   3 errors generated.


vim +/QUEUE_FLAG_DISCARD +338 drivers/dax/kmem_blk.c

   305	
   306	
   307	
   308	
   309	
   310	static int kmem_blk_disk_init(struct kmem_blk_data *data)
   311	{
   312		struct gendisk *disk;
   313		int err;
   314	
   315		disk = blk_alloc_disk(data->dev_dax->target_node);
   316		data->disk = disk;
   317	
   318		disk->flags = GENHD_FL_NO_PART;
   319		disk->fops = &kmem_blk_fops;
   320		disk->private_data = data;
   321		snprintf(disk->disk_name, DISK_NAME_LEN, "kmem%d",
   322			data->dev_dax->target_node);
   323	
   324		set_capacity(disk,
   325			data->super->num_index_entries << PAGE_SECTORS_SHIFT);
   326	
   327		// TODO: Handle cases where PAGE_SIZE is too big.
   328		/* Set physical and logical block size to PAGE_SIZE */
   329		blk_queue_physical_block_size(disk->queue, PAGE_SIZE);
   330		blk_queue_logical_block_size(disk->queue, PAGE_SIZE);
   331	
   332		/* Tell the block layer that this is not a rotational device */
   333		blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue);
   334		/* Don't use this for randomness */
   335		blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, disk->queue);
   336	
   337		/* Support discard */
 > 338		blk_queue_flag_set(QUEUE_FLAG_DISCARD, disk->queue);
   339		disk->queue->limits.discard_granularity = PAGE_SIZE;
   340		blk_queue_max_discard_sectors(disk->queue, UINT_MAX);
   341		/* We can handle WRITE_ZEROES as DISCARD, at units of page size */
   342		blk_queue_max_write_zeroes_sectors(disk->queue, UINT_MAX);
   343	
   344		err = add_disk(disk);
   345		if (err)
   346			goto out_cleanup_disk;
   347	
   348		return 0;
   349	out_cleanup_disk:
   350		blk_cleanup_disk(data->disk);
   351		data->disk = NULL;
   352		return err;
   353	}
   354	
   355	
   356	static void kmem_blk_disk_cleanup(struct kmem_blk_data *data)
   357	{
   358		if (data->disk == NULL)
   359			return;
   360		del_gendisk(data->disk);
   361		blk_cleanup_disk(data->disk);
   362		data->disk = NULL;
   363	}
   364	
   365	/* Format device with full allocation */
   366	static int kmem_blk_format(struct dev_dax *dev_dax)
   367	{
   368		struct kmem_blk_super *super =
   369			kmap_local_page(dax_kmem_index_to_page(0, dev_dax));
   370	
   371		unsigned long num_pages = dax_kmem_num_pages(dev_dax);
   372		u64 i;
   373		/*
   374		 * c = a / b => c is largest c s.t. c * b <= a.
   375		 * c = (a + b - 1) / b is smallest c s.t. c * b >= a
   376		 * num_index_pages is the largest number such that
   377		 * 1 + num_index_pages + num_index_pages * index_entries_per_page >= num_pages
   378		 * num_index_pages *(1 + index_entries_per_page) >= num_pages - 1
   379		 * num_index_pages =
   380		 *   ((num_pages - 1) + (1 + index_entries_per_page) - 1 ) /
   381		 *   (1 + index_entries_per_page)
   382		 */
   383		u64 num_index_pages =
   384			(num_pages + index_entries_per_page - 1) /
   385			(1 + index_entries_per_page);
   386		super->header.magic = kmem_persist_magic;
   387		super->header.type = KMEM_PERSIST_BLK;
   388		super->num_index_pages = num_index_pages;
   389		super->num_index_entries = num_pages - 1 - num_index_pages;
   390	
   391		for (i = 0; i < num_index_pages; i++) {
   392			u64 *index_array =
   393				kmap_local_page(dax_kmem_index_to_page(1 + i, dev_dax));
   394	#if !defined(KMEM_PERSIST_BLK_FORMAT_FULL)
   395			memset(index_array, 0, PAGE_SIZE);
   396	#else /* KMEM_PERSIST_BLK_FORMAT_FULL */
   397			u64 j;
   398	
   399			for (j = 0; j < index_entries_per_page; j++) {
   400				u64 idx =
   401					1 + num_index_pages +
   402					i * index_entries_per_page + j;
   403	
   404				if (idx >= num_pages)
   405					idx = 0;
   406				index_array[j] = idx;
   407			}
   408	#endif
   409			kunmap_local(index_array);
   410		}
   411		kunmap_local(super);
   412		return 0;
   413	}
   414	
   415	/* Free unused blocks in the dax memory to system */
   416	static int kmem_blk_free_unused(struct kmem_blk_data *data)
   417	{
   418		struct kmem_blk_super *super = data->super;
   419		unsigned long num_pages = dax_kmem_num_pages(data->dev_dax);
   420		u64 *alloc_bitmap;
   421		unsigned long i;
   422	
   423		/* Bitmap for tracking allocated pages. Temporary */
   424		alloc_bitmap =
   425			kvzalloc(sizeof(u64) * BITS_TO_U64(num_pages), GFP_KERNEL);
   426		if (alloc_bitmap == NULL) {
   427			dev_err(&data->dev_dax->dev,
   428				"Unable to allocate bit array. Not freeing unused space.\n");
   429			return -ENOMEM;
   430		}
   431	
   432		/* Free up pages unused by block storage to memory */
   433		for (i = 0; i < super->num_index_entries; i++) {
   434			u64 page_num = data->index_page
   435				[i / index_entries_per_page]
   436				[i % index_entries_per_page];
   437	
   438			if (page_num != 0) {
   439				BUG_ON(page_num < 1 + super->num_index_pages ||
   440					page_num >= num_pages);
   441				/* Set bit */
   442				alloc_bitmap[page_num / 64] |= 1 << (page_num % 64);
   443			}
   444		}
   445	
   446		for (i = 1 + super->num_index_pages; i < num_pages; i++) {
   447			struct page *page;
   448	
   449			if (!(alloc_bitmap[i / 64] & (1 << (i % 64)))) {
   450				/* Bit clear. Page not used */
   451				page = dax_kmem_index_to_page(i, data->dev_dax);
   452				__free_page(page);
   453			}
   454		}
   455	
   456		kvfree(alloc_bitmap);
   457		return 0;
   458	}
   459	
   460	static int kmem_blk_probe(struct dev_dax *dev_dax, void **persist_data)
   461	{
   462		struct device *dev = &dev_dax->dev;
   463		struct kmem_blk_super *super;
   464		unsigned long i;
   465		struct kmem_blk_data *data;
   466		unsigned long num_pages = dax_kmem_num_pages(dev_dax);
   467	
   468		if (num_pages == 0) {
   469			dev_err(dev, "Dax device for KMEM has no pages\n");
   470			*persist_data = NULL;
   471			return -1;
   472		}
   473	
   474		super = kmap(dax_kmem_index_to_page(0, dev_dax));
   475	
   476		/* Validate superblock magic and type */
   477		if (super->header.magic != kmem_persist_magic ||
   478			super->header.type != KMEM_PERSIST_BLK) {
 > 479			dev_err("KMEM not formatted for blk, magic %lx type %d\n",
 > 480				super->header.magic, super->header.type);
   481			kunmap(dax_kmem_index_to_page(0, dev_dax));
   482			*persist_data = NULL;
   483			return -EINVAL;
   484		}
   485	
   486		/* Validate superblock index page counts */
   487		if (super->num_index_entries <=
   488			super->num_index_pages * index_entries_per_page &&
   489			1 + super->num_index_pages + super->num_index_entries
   490			== num_pages) {
   491			dev_info(dev,
   492				"Found kmem_blk superblock num_index_entries %llu num_index_pages %llu num_pages %lu\n",
   493				super->num_index_entries,
   494				super->num_index_pages, num_pages);
   495		} else {
   496			dev_warn(dev,
   497				"Invalid kmem_blk superblock num_index_entries %llu num_index_pages %llu num_pages %lu\n",
   498				super->num_index_entries,
   499				super->num_index_pages, num_pages);
   500			kunmap(dax_kmem_index_to_page(0, dev_dax));
   501			*persist_data = NULL;
   502			return -EINVAL;
   503		}
   504	
   505		data = kzalloc(struct_size(data, index_page, super->num_index_pages),
   506			GFP_KERNEL);
   507		if (!data) {
   508			kunmap(dax_kmem_index_to_page(0, dev_dax));
   509			*persist_data = NULL;
   510			return -ENOMEM;
   511		}
   512	
   513		*persist_data = data;
   514		data->dev_dax = dev_dax;
   515		data->super = super;
   516		spin_lock_init(&data->index_lock);
   517	
   518		for (i = 0; i < super->num_index_pages; i++)
   519			data->index_page[i] =
   520				kmap(dax_kmem_index_to_page(i + 1, dev_dax));
   521	
   522		kmem_blk_free_unused(data);
   523	
   524		kmem_blk_disk_init(data);
   525	
   526		return 0;
   527	}
   528	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 4/4] device-dax: Add a block device persistent type, BLK, for DAX KMEM
  2022-08-02 18:12 ` [RFC PATCH 4/4] device-dax: Add a block device persistent type, BLK, for DAX KMEM Srinivas Aji
  2022-08-03 20:00   ` kernel test robot
@ 2022-08-03 21:19   ` Fabio M. De Francesco
  2022-08-04 10:45   ` kernel test robot
  2 siblings, 0 replies; 13+ messages in thread
From: Fabio M. De Francesco @ 2022-08-03 21:19 UTC (permalink / raw)
  To: linux-nvdimm, Linux MM, Srinivas Aji
  Cc: Dan Williams, David Hildenbrand, Vivek Goyal, David Woodhouse,
	Gowans, James, Yue Li, Beau Beauchamp, Ira Weiny

On martedì 2 agosto 2022 20:12:07 CEST Srinivas Aji wrote:
> When a DAX KMEM device is formatted as of type BLK, adding the DAX memory
> exposes a block device /dev/kmem<numa_node> . A filesystem can be created
> on this block device. Blocks which contain data are unavailable for use 
as
> system memory, but blocks which are freed up using DISCARD become free 
for
> system use.
> 
> The implementation uses an array which maps the logical block number
> to the real block offset in the DAX device. This allows us to keep
> block device semantics even though allocations can return any page.
> 
> Signed-off-by: Srinivas Aji <srinivas.aji@memverge.com>
> ---
>  drivers/dax/Makefile       |   1 +
>  drivers/dax/kmem.c         |   4 +-
>  drivers/dax/kmem_blk.c     | 573 +++++++++++++++++++++++++++++++++++++
>  drivers/dax/kmem_persist.h |   4 +
>  4 files changed, 581 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/dax/kmem_blk.c

From a quick look at this code I see a mix of kmap(), kmap_atomic() and 
kmap_local_page(). Actually it's not obvious to me why you are still using 
those kmap() and kmap_atomic() functions since they are being deprecated 
and shouldn't be used in new code.

Furthermore, there's an ongoing effort towards replacing the latter two 
functions with kmap_local_page(). This implies that this code, sooner or 
later, will be refactored to get rid of those two functions.

Please take a look at highmem.rst which was update in mainline a couple of 
months ago, while a second round of changes are still in Andrew Morton's 
"mm-unstable" branch.

There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.

With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and are still valid.

As said, since kmap_local_page() can be also called from atomic context,
and since the code shouldn't ever rely on an implicit preempt_disable(), 
this function can also safely replace kmap_atomic().

I haven't looked closely at your code because I haven't the necessary 
domain knowledge and experience to comment on design and implementation 
details. 

However, for what regards the mappings, it looks it can be converted to the 
only use of kmap_local_page and its related helpers even if it is 
immediately clear that there are some parts that need refactoring to not 
break the rules of local mapping/unmapping, especially when nesting several 
kmap_local_page() in loops.

> diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
> index 90a56ca3b345..d0a97f4af4ea 100644
> --- a/drivers/dax/Makefile
> +++ b/drivers/dax/Makefile
> @@ -3,6 +3,7 @@ obj-$(CONFIG_DAX) += dax.o
>  obj-$(CONFIG_DEV_DAX) += device_dax.o
>  obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
>  obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
> +obj-$(CONFIG_DEV_DAX_KMEM_PERSIST) += kmem_blk.o
>  
>  dax-y := super.o
>  dax-y += bus.o
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 0ca6e14f7e73..0fa45d1ba9cc 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -534,8 +534,10 @@ static int __init dax_kmem_init(void)
>  	if (rc)
>  		kfree_const(kmem_name);
>  #ifdef CONFIG_DEV_DAX_KMEM_PERSIST
> -	if (rc == 0)
> +	if (rc == 0) {
>  		kmem_persist_type_register(&kmem_persist_none_ops);
> +		kmem_persist_type_register(&kmem_persist_blk_ops);
> +	}
>  #endif
>  	return rc;
>  }
> diff --git a/drivers/dax/kmem_blk.c b/drivers/dax/kmem_blk.c
> new file mode 100644
> index 000000000000..856b35713999
> --- /dev/null
> +++ b/drivers/dax/kmem_blk.c
> @@ -0,0 +1,573 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2022 MemVerge. All rights reserved. */
> +#include <linux/module.h>
> +#include <linux/major.h>
> +#include <linux/blkdev.h>
> +#include <linux/bio.h>
> +#include <linux/highmem.h>
> +#include <linux/pagemap.h>
> +#include <linux/bitops.h>
> +#include <linux/slab.h>
> +#include "dax-private.h"
> +#include "kmem_persist.h"
> +
> +static const unsigned int index_entries_per_page = (PAGE_SIZE / 
sizeof(u64));
> +
> +struct kmem_blk_super {
> +	struct kmem_persist_superblock header;
> +	u64 num_index_pages;
> +	u64 num_index_entries;
> +} __packed;
> +
> +struct kmem_blk_data {
> +	struct dev_dax *dev_dax;
> +	struct gendisk *disk;
> +	spinlock_t index_lock;
> +	struct kmem_blk_super *super;
> +	unsigned long num_index_pages;
> +	u64 *index_page[];
> +};
> +
> +// TODO: Make sure locking is sound for concurrent multiple IOs,
> +// i.e. writes and discards.
> +
> +static struct page *kmem_blk_get_page(struct kmem_blk_data *data,
> +				sector_t sector)
> +{
> +	pgoff_t i = sector >> PAGE_SECTORS_SHIFT;
> +	u64 page_num;
> +
> +	spin_lock(&data->index_lock);
> +	page_num = data->index_page
> +		[i / index_entries_per_page]
> +		[i % index_entries_per_page];
> +	spin_unlock(&data->index_lock);
> +
> +	if (page_num)
> +		return dax_kmem_index_to_page(page_num, data->dev_dax);
> +	else
> +		return NULL;
> +}
> +
> +/*
> + * copy_to_kmem_blk_setup must be called before copy_to_kmem_blk. It may 
sleep.
> + */
> +static int kmem_blk_insert_page(struct kmem_blk_data *data, sector_t 
sector)
> +{
> +	pgoff_t i = sector >> PAGE_SECTORS_SHIFT;
> +	struct page *page;
> +	unsigned long page_index = 0;
> +	u64 page_num; // TODO fixup u64 / unsigned long to use one type?
> +	u64 *index_ptr =
> +		&data->index_page
> +		[i / index_entries_per_page][i % 
index_entries_per_page];
> +
> +	/* Check if block exists */
> +	spin_lock(&data->index_lock);
> +	page_num = *index_ptr;
> +	spin_unlock(&data->index_lock);
> +	if (page_num)
> +		return 0;
> +
> +	page = dax_kmem_alloc_page(data->dev_dax, &page_index);
> +	if (!page) {
> +		dev_err(&data->dev_dax->dev, "Cannot allocate page\n");
> +		return -1;
> +	}
> +
> +	spin_lock(&data->index_lock);
> +	if (*index_ptr != 0)
> +		__free_page(page);
> +	else
> +		*index_ptr = page_index;
> +	spin_unlock(&data->index_lock);
> +
> +	return 0;
> +}
> +
> +static int kmem_blk_discard(struct kmem_blk_data *data,
> +			sector_t sector, size_t n)
> +{
> +	pgoff_t i = sector >> PAGE_SECTORS_SHIFT;
> +	struct page *page;
> +	u64 page_num; // TODO fixup u64 / unsigned long to use one type?
> +	u64 *index_ptr;
> +
> +	BUG_ON(sector & ((1 << PAGE_SECTORS_SHIFT) - 1));
> +	BUG_ON(n & (PAGE_SIZE - 1));
> +
> +	while (n > 0) {
> +		BUG_ON(i > data->super->num_index_entries);
> +		index_ptr =
> +			&data->index_page
> +			[i / index_entries_per_page]
> +			[i % index_entries_per_page];
> +		spin_lock(&data->index_lock);
> +		page_num = *index_ptr;
> +		if (page_num)
> +			*index_ptr = 0;
> +		spin_unlock(&data->index_lock);
> +		if (page_num) {
> +			page = dax_kmem_index_to_page(page_num, data-
>dev_dax);
> +			__free_page(page);
> +		}
> +		i++;
> +		n -= PAGE_SIZE;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * copy_to_kmem_blk_setup must be called before copy_to_kmem_blk. It may 
sleep.
> + */
> +static int copy_to_kmem_blk_setup(struct kmem_blk_data *data, sector_t 
sector, size_t n)
> +{
> +	unsigned int offset = (sector & (PAGE_SECTORS-1)) << 
SECTOR_SHIFT;
> +	size_t copy;
> +
> +	copy = min_t(size_t, n, PAGE_SIZE - offset);
> +	if (kmem_blk_insert_page(data, sector))
> +		return -ENOSPC;
> +	if (copy < n) {
> +		sector += copy >> SECTOR_SHIFT;
> +		if (kmem_blk_insert_page(data, sector))
> +			return -ENOSPC;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Copy n bytes from src to the block device starting at sector. Does 
not sleep.
> + */
> +static void copy_to_kmem_blk(struct kmem_blk_data *data, const void 
*src,
> +			sector_t sector, size_t n)
> +{
> +	struct page *page;
> +	void *dst;
> +	unsigned int offset = (sector & (PAGE_SECTORS-1)) << 
SECTOR_SHIFT;
> +	size_t copy;
> +
> +	copy = min_t(size_t, n, PAGE_SIZE - offset);
> +	page = kmem_blk_get_page(data, sector);
> +	BUG_ON(!page);
> +
> +	dst = kmap_atomic(page);
> +	memcpy(dst + offset, src, copy);
> +	kunmap_atomic(dst);

Can you please replace the above three lines with memcpy_to_page()?

> +
> +	if (copy < n) {
> +		src += copy;
> +		sector += copy >> SECTOR_SHIFT;
> +		copy = n - copy;
> +		page = kmem_blk_get_page(data, sector);
> +		BUG_ON(!page);
> +
> +		dst = kmap_atomic(page);
> +		memcpy(dst, src, copy);
> +		kunmap_atomic(dst);

Same here, please.

> +	}
> +}
> +
> +/*
> + * Copy n bytes to dst from the block device starting at sector. Does 
not sleep.
> + */
> +static void copy_from_kmem_blk(void *dst, struct kmem_blk_data *data,
> +			sector_t sector, size_t n)
> +{
> +	struct page *page;
> +	void *src;
> +	unsigned int offset = (sector & (PAGE_SECTORS-1)) << 
SECTOR_SHIFT;
> +	size_t copy;
> +
> +	copy = min_t(size_t, n, PAGE_SIZE - offset);
> +	page = kmem_blk_get_page(data, sector);
> +	if (page) {
> +		src = kmap_atomic(page);
> +		memcpy(dst, src + offset, copy);
> +		kunmap_atomic(src);

Again.

> +	} else
> +		memset(dst, 0, copy);
> +
> +	if (copy < n) {
> +		dst += copy;
> +		sector += copy >> SECTOR_SHIFT;
> +		copy = n - copy;
> +		page = kmem_blk_get_page(data, sector);
> +		if (page) {
> +			src = kmap_atomic(page);
> +			memcpy(dst, src, copy);
> +			kunmap_atomic(src);

And again :-)

> +		} else
> +			memset(dst, 0, copy);
> +	}
> +}
> +
> +/*
> + * Process a single bvec of a bio.
> + */
> +static int kmem_blk_do_bvec(struct kmem_blk_data *data, struct page 
*page,
> +			unsigned int len, unsigned int off, unsigned 
int op,
> +			sector_t sector)
> +{
> +	void *mem = NULL;
> +	int err = 0;
> +
> +	if (op == REQ_OP_WRITE) {
> +		err = copy_to_kmem_blk_setup(data, sector, len);
> +		if (err)
> +			goto out;
> +	}
> +
> +	if (page)
> +		mem = kmap_atomic(page);

Are you implicitly relying to preempt_disable()? I'm not following closely. 
However, if so, you shouldn't rely on it. Please use kmap_local_page() and, 
if you need this switch and the calls under atomic context use the suited 
API in explicit ways.

> +	switch (op) {
> +	case REQ_OP_READ:
> +		copy_from_kmem_blk(mem + off, data, sector, len);
> +		flush_dcache_page(page);
> +		break;
> +	case REQ_OP_WRITE:
> +		flush_dcache_page(page);
> +		copy_to_kmem_blk(data, mem + off, sector, len);
> +		break;
> +	case REQ_OP_DISCARD:
> +		BUG_ON(page);
> +		kmem_blk_discard(data, sector, len);
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	if (mem)
> +		kunmap_atomic(mem);
> +
> +out:
> +	return err;
> +}
> +
> +static void kmem_blk_submit_bio(struct bio *bio)
> +{
> +	struct kmem_blk_data *data = bio->bi_bdev->bd_disk->private_data;
> +	sector_t sector = bio->bi_iter.bi_sector;
> +	struct bio_vec bvec;
> +	struct bvec_iter iter;
> +
> +	/*
> +	 * DISCARD and WRITE_ZEROES come separately and don't work with
> +	 * bio_for_each_segment
> +	 */
> +	switch (bio_op(bio)) {
> +	case REQ_OP_DISCARD:
> +	case REQ_OP_WRITE_ZEROES:
> +		kmem_blk_discard(data, sector, bio->bi_iter.bi_size);
> +		bio_endio(bio);
> +		return;
> +	default:
> +		break;
> +	}
> +
> +	bio_for_each_segment(bvec, bio, iter) {
> +		unsigned int len = bvec.bv_len;
> +		int err;
> +
> +		/* Don't support un-aligned buffer */
> +		WARN_ON_ONCE((bvec.bv_offset & (SECTOR_SIZE - 1)) ||
> +				(len & (SECTOR_SIZE - 1)));
> +		err = kmem_blk_do_bvec(data, bvec.bv_page, len, 
bvec.bv_offset,
> +				bio_op(bio), sector);
> +		if (err) {
> +			bio_io_error(bio);
> +			return;
> +		}
> +		sector += len >> SECTOR_SHIFT;
> +	}
> +
> +	bio_endio(bio);
> +}
> +
> +static int kmem_blk_rw_page(struct block_device *bdev, sector_t sector,
> +			struct page *page, unsigned int op)
> +{
> +	struct kmem_blk_data *data = bdev->bd_disk->private_data;
> +	int err;
> +
> +	if (PageTransHuge(page))
> +		return -EOPNOTSUPP;
> +	err = kmem_blk_do_bvec(data, page, PAGE_SIZE, 0, op, sector);
> +	page_endio(page, op_is_write(op), err);
> +	return err;
> +}
> +
> +static const struct block_device_operations kmem_blk_fops = {
> +	.owner =		THIS_MODULE,
> +	.submit_bio =		kmem_blk_submit_bio,
> +	.rw_page =		kmem_blk_rw_page,
> +};
> +
> +
> +
> +
> +
> +static int kmem_blk_disk_init(struct kmem_blk_data *data)
> +{
> +	struct gendisk *disk;
> +	int err;
> +
> +	disk = blk_alloc_disk(data->dev_dax->target_node);
> +	data->disk = disk;
> +
> +	disk->flags = GENHD_FL_NO_PART;
> +	disk->fops = &kmem_blk_fops;
> +	disk->private_data = data;
> +	snprintf(disk->disk_name, DISK_NAME_LEN, "kmem%d",
> +		data->dev_dax->target_node);
> +
> +	set_capacity(disk,
> +		data->super->num_index_entries << PAGE_SECTORS_SHIFT);
> +
> +	// TODO: Handle cases where PAGE_SIZE is too big.
> +	/* Set physical and logical block size to PAGE_SIZE */
> +	blk_queue_physical_block_size(disk->queue, PAGE_SIZE);
> +	blk_queue_logical_block_size(disk->queue, PAGE_SIZE);
> +
> +	/* Tell the block layer that this is not a rotational device */
> +	blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue);
> +	/* Don't use this for randomness */
> +	blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, disk->queue);
> +
> +	/* Support discard */
> +	blk_queue_flag_set(QUEUE_FLAG_DISCARD, disk->queue);
> +	disk->queue->limits.discard_granularity = PAGE_SIZE;
> +	blk_queue_max_discard_sectors(disk->queue, UINT_MAX);
> +	/* We can handle WRITE_ZEROES as DISCARD, at units of page size 
*/
> +	blk_queue_max_write_zeroes_sectors(disk->queue, UINT_MAX);
> +
> +	err = add_disk(disk);
> +	if (err)
> +		goto out_cleanup_disk;
> +
> +	return 0;
> +out_cleanup_disk:
> +	blk_cleanup_disk(data->disk);
> +	data->disk = NULL;
> +	return err;
> +}
> +
> +
> +static void kmem_blk_disk_cleanup(struct kmem_blk_data *data)
> +{
> +	if (data->disk == NULL)
> +		return;
> +	del_gendisk(data->disk);
> +	blk_cleanup_disk(data->disk);
> +	data->disk = NULL;
> +}
> +
> +/* Format device with full allocation */
> +static int kmem_blk_format(struct dev_dax *dev_dax)
> +{
> +	struct kmem_blk_super *super =
> +		kmap_local_page(dax_kmem_index_to_page(0, dev_dax));
> +
> +	unsigned long num_pages = dax_kmem_num_pages(dev_dax);
> +	u64 i;
> +	/*
> +	 * c = a / b => c is largest c s.t. c * b <= a.
> +	 * c = (a + b - 1) / b is smallest c s.t. c * b >= a
> +	 * num_index_pages is the largest number such that
> +	 * 1 + num_index_pages + num_index_pages * index_entries_per_page 
>= num_pages
> +	 * num_index_pages *(1 + index_entries_per_page) >= num_pages - 1
> +	 * num_index_pages =
> +	 *   ((num_pages - 1) + (1 + index_entries_per_page) - 1 ) /
> +	 *   (1 + index_entries_per_page)
> +	 */
> +	u64 num_index_pages =
> +		(num_pages + index_entries_per_page - 1) /
> +		(1 + index_entries_per_page);
> +	super->header.magic = kmem_persist_magic;
> +	super->header.type = KMEM_PERSIST_BLK;
> +	super->num_index_pages = num_index_pages;
> +	super->num_index_entries = num_pages - 1 - num_index_pages;
> +
> +	for (i = 0; i < num_index_pages; i++) {
> +		u64 *index_array =
> +			kmap_local_page(dax_kmem_index_to_page(1 + i, 
dev_dax));
> +#if !defined(KMEM_PERSIST_BLK_FORMAT_FULL)
> +		memset(index_array, 0, PAGE_SIZE);
> +#else /* KMEM_PERSIST_BLK_FORMAT_FULL */
> +		u64 j;
> +
> +		for (j = 0; j < index_entries_per_page; j++) {
> +			u64 idx =
> +				1 + num_index_pages +
> +				i * index_entries_per_page + j;
> +
> +			if (idx >= num_pages)
> +				idx = 0;
> +			index_array[j] = idx;
> +		}
> +#endif
> +		kunmap_local(index_array);
> +	}
> +	kunmap_local(super);
> +	return 0;
> +}
> +
> +/* Free unused blocks in the dax memory to system */
> +static int kmem_blk_free_unused(struct kmem_blk_data *data)
> +{
> +	struct kmem_blk_super *super = data->super;
> +	unsigned long num_pages = dax_kmem_num_pages(data->dev_dax);
> +	u64 *alloc_bitmap;
> +	unsigned long i;
> +
> +	/* Bitmap for tracking allocated pages. Temporary */
> +	alloc_bitmap =
> +		kvzalloc(sizeof(u64) * BITS_TO_U64(num_pages), 
GFP_KERNEL);
> +	if (alloc_bitmap == NULL) {
> +		dev_err(&data->dev_dax->dev,
> +			"Unable to allocate bit array. Not freeing 
unused space.\n");
> +		return -ENOMEM;
> +	}
> +
> +	/* Free up pages unused by block storage to memory */
> +	for (i = 0; i < super->num_index_entries; i++) {
> +		u64 page_num = data->index_page
> +			[i / index_entries_per_page]
> +			[i % index_entries_per_page];
> +
> +		if (page_num != 0) {
> +			BUG_ON(page_num < 1 + super->num_index_pages 
||
> +				page_num >= num_pages);
> +			/* Set bit */
> +			alloc_bitmap[page_num / 64] |= 1 << (page_num 
% 64);
> +		}
> +	}
> +
> +	for (i = 1 + super->num_index_pages; i < num_pages; i++) {
> +		struct page *page;
> +
> +		if (!(alloc_bitmap[i / 64] & (1 << (i % 64)))) {
> +			/* Bit clear. Page not used */
> +			page = dax_kmem_index_to_page(i, data-
>dev_dax);
> +			__free_page(page);
> +		}
> +	}
> +
> +	kvfree(alloc_bitmap);
> +	return 0;
> +}
> +
> +static int kmem_blk_probe(struct dev_dax *dev_dax, void **persist_data)
> +{
> +	struct device *dev = &dev_dax->dev;
> +	struct kmem_blk_super *super;
> +	unsigned long i;
> +	struct kmem_blk_data *data;
> +	unsigned long num_pages = dax_kmem_num_pages(dev_dax);
> +
> +	if (num_pages == 0) {
> +		dev_err(dev, "Dax device for KMEM has no pages\n");
> +		*persist_data = NULL;
> +		return -1;
> +	}
> +
> +	super = kmap(dax_kmem_index_to_page(0, dev_dax));

This looks better suited for kmap_local_page().

> +
> +	/* Validate superblock magic and type */
> +	if (super->header.magic != kmem_persist_magic ||
> +		super->header.type != KMEM_PERSIST_BLK) {
> +		dev_err("KMEM not formatted for blk, magic %lx type 
%d\n",
> +			super->header.magic, super->header.type);
> +		kunmap(dax_kmem_index_to_page(0, dev_dax));
> +		*persist_data = NULL;
> +		return -EINVAL;
> +	}
> +
> +	/* Validate superblock index page counts */
> +	if (super->num_index_entries <=
> +		super->num_index_pages * index_entries_per_page &&
> +		1 + super->num_index_pages + super->num_index_entries
> +		== num_pages) {
> +		dev_info(dev,
> +			"Found kmem_blk superblock num_index_entries 
%llu num_index_pages %llu num_pages %lu\n",
> +			super->num_index_entries,
> +			super->num_index_pages, num_pages);
> +	} else {
> +		dev_warn(dev,
> +			"Invalid kmem_blk superblock 
num_index_entries %llu num_index_pages %llu num_pages %lu\n",
> +			super->num_index_entries,
> +			super->num_index_pages, num_pages);
> +		kunmap(dax_kmem_index_to_page(0, dev_dax));
> +		*persist_data = NULL;
> +		return -EINVAL;
> +	}
> +
> +	data = kzalloc(struct_size(data, index_page, super-
>num_index_pages),
> +		GFP_KERNEL);
> +	if (!data) {
> +		kunmap(dax_kmem_index_to_page(0, dev_dax));
> +		*persist_data = NULL;
> +		return -ENOMEM;
> +	}
> +
> +	*persist_data = data;
> +	data->dev_dax = dev_dax;
> +	data->super = super;
> +	spin_lock_init(&data->index_lock);
> +
> +	for (i = 0; i < super->num_index_pages; i++)
> +		data->index_page[i] =
> +			kmap(dax_kmem_index_to_page(i + 1, dev_dax));

Nesting of kmap_local_page() is performed on a stack based fashion (LIFO), 
therefore the kunmap_local() calls must be invoked in reverse order. 

Any special problems with this?  

> +
> +	kmem_blk_free_unused(data);
> +
> +	kmem_blk_disk_init(data);
> +
> +	return 0;
> +}
> +
> +static int kmem_blk_cleanup(struct dev_dax *dev_dax, void *persist_data)
> +{
> +	struct kmem_blk_data *data = persist_data;
> +	unsigned long num_pages = dax_kmem_num_pages(dev_dax);
> +	unsigned long i;
> +
> +	if (data == NULL)
> +		return -1;
> +
> +	kmem_blk_disk_cleanup(data);
> +
> +	if (data->super == 0) {
> +		for (i = 0; i < num_pages; i++)
> +			__free_page(dax_kmem_index_to_page(i, 
dev_dax));
> +	} else {
> +		for (i = 0; i < data->super->num_index_entries; i++) {
> +			u64 page_num = data->index_page
> +				[i / index_entries_per_page]
> +				[i % index_entries_per_page];
> +			if (page_num != 0) {
> +				
__free_page(dax_kmem_index_to_page(page_num,
> +								
   dev_dax));
> +			}
> +		}
> +		for (i = 0; i < data->super->num_index_pages; i++) {
> +			struct page *page =
> +				dax_kmem_index_to_page(1 + i, 
dev_dax);
> +			data->index_page[i] = NULL;
> +			kunmap(page);
> +			__free_page(page);
> +		}
> +		data->super = NULL;
> +		kunmap(dax_kmem_index_to_page(0, dev_dax));
> +		__free_page(dax_kmem_index_to_page(0, dev_dax));
> +	}
> +	kfree(data);
> +	return 0;
> +}
> +
> +struct kmem_persist_ops kmem_persist_blk_ops = {
> +	.type = KMEM_PERSIST_BLK,
> +	.format = kmem_blk_format,
> +	.probe = kmem_blk_probe,
> +	.cleanup = kmem_blk_cleanup
> +};
> diff --git a/drivers/dax/kmem_persist.h b/drivers/dax/kmem_persist.h
> index dd651025f28c..0e0279feaa12 100644
> --- a/drivers/dax/kmem_persist.h
> +++ b/drivers/dax/kmem_persist.h
> @@ -10,6 +10,7 @@ struct dev_dax;
>  
>  enum kmem_persist_type {
>  	KMEM_PERSIST_NONE = 0,
> +	KMEM_PERSIST_BLK,
>  };
>  
>  
> @@ -40,4 +41,7 @@ unsigned long dax_kmem_num_pages(struct dev_dax 
*dev_dax);
>  struct page *dax_kmem_alloc_page(struct dev_dax *dev_dax,
>  				unsigned long *page_index);
>  
> +/* Defined in kmem_blk.c */
> +extern struct kmem_persist_ops kmem_persist_blk_ops;
> +
>  #endif
> -- 
> 2.30.2
> 

Thanks,

Fabio

P.S.: I'm Cc'ing Ira Weiny.




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 4/4] device-dax: Add a block device persistent type, BLK, for DAX KMEM
  2022-08-02 18:12 ` [RFC PATCH 4/4] device-dax: Add a block device persistent type, BLK, for DAX KMEM Srinivas Aji
  2022-08-03 20:00   ` kernel test robot
  2022-08-03 21:19   ` Fabio M. De Francesco
@ 2022-08-04 10:45   ` kernel test robot
  2 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2022-08-04 10:45 UTC (permalink / raw)
  To: Srinivas Aji; +Cc: llvm, kbuild-all

Hi Srinivas,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on linus/master]
[also build test WARNING on v5.19 next-20220803]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Srinivas-Aji/Allow-persistent-data-on-DAX-device-being-used-as-KMEM/20220803-021320
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 0805c6fb39f66e01cb0adccfae8d9e0615c70fd7
config: s390-buildonly-randconfig-r005-20220804 (https://download.01.org/0day-ci/archive/20220804/202208041827.gXLXUdlo-lkp@intel.com/config)
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 26dd42705c2af0b8f6e5d6cdb32c9bd5ed9524eb)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install s390 cross compiling tool for clang build
        # apt-get install binutils-s390x-linux-gnu
        # https://github.com/intel-lab-lkp/linux/commit/1d821cfc1cd5b2a7034ca77e84b59cf808b09a4f
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Srinivas-Aji/Allow-persistent-data-on-DAX-device-being-used-as-KMEM/20220803-021320
        git checkout 1d821cfc1cd5b2a7034ca77e84b59cf808b09a4f
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash drivers/dax/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   drivers/dax/kmem_blk.c:338:21: error: use of undeclared identifier 'QUEUE_FLAG_DISCARD'
           blk_queue_flag_set(QUEUE_FLAG_DISCARD, disk->queue);
                              ^
>> drivers/dax/kmem_blk.c:479:3: warning: pointer/integer type mismatch in conditional expression ('unsigned long' and 'void *') [-Wconditional-type-mismatch]
                   dev_err("KMEM not formatted for blk, magic %lx type %d\n",
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/dev_printk.h:144:2: note: expanded from macro 'dev_err'
           dev_printk_index_wrap(_dev_err, KERN_ERR, dev, dev_fmt(fmt), ##__VA_ARGS__)
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/dev_printk.h:109:3: note: expanded from macro 'dev_printk_index_wrap'
                   dev_printk_index_emit(level, fmt);                      \
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/dev_printk.h:105:2: note: expanded from macro 'dev_printk_index_emit'
           printk_index_subsys_emit("%s %s: ", level, fmt)
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/printk.h:431:2: note: expanded from macro 'printk_index_subsys_emit'
           __printk_index_emit(fmt, level, subsys_fmt_prefix)
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/printk.h:397:39: note: expanded from macro '__printk_index_emit'
                                   .fmt = __builtin_constant_p(_fmt) ? (_fmt) : NULL, \
                                                                     ^ ~~~~~~   ~~~~
   drivers/dax/kmem_blk.c:479:11: error: incompatible pointer types passing 'char[47]' to parameter of type 'const struct device *' [-Werror,-Wincompatible-pointer-types]
                   dev_err("KMEM not formatted for blk, magic %lx type %d\n",
                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/dev_printk.h:144:44: note: expanded from macro 'dev_err'
           dev_printk_index_wrap(_dev_err, KERN_ERR, dev, dev_fmt(fmt), ##__VA_ARGS__)
                                                     ^~~
   include/linux/dev_printk.h:110:11: note: expanded from macro 'dev_printk_index_wrap'
                   _p_func(dev, fmt, ##__VA_ARGS__);                       \
                           ^~~
   include/linux/dev_printk.h:50:36: note: passing argument to parameter 'dev' here
   void _dev_err(const struct device *dev, const char *fmt, ...);
                                      ^
   drivers/dax/kmem_blk.c:480:4: error: incompatible integer to pointer conversion passing 'unsigned long' to parameter of type 'const char *' [-Wint-conversion]
                           super->header.magic, super->header.type);
                           ^~~~~~~~~~~~~~~~~~~
   include/linux/dev_printk.h:144:57: note: expanded from macro 'dev_err'
           dev_printk_index_wrap(_dev_err, KERN_ERR, dev, dev_fmt(fmt), ##__VA_ARGS__)
                                                                  ^~~
   include/linux/dev_printk.h:19:22: note: expanded from macro 'dev_fmt'
   #define dev_fmt(fmt) fmt
                        ^~~
   include/linux/dev_printk.h:110:16: note: expanded from macro 'dev_printk_index_wrap'
                   _p_func(dev, fmt, ##__VA_ARGS__);                       \
                                ^~~
   include/linux/dev_printk.h:50:53: note: passing argument to parameter 'fmt' here
   void _dev_err(const struct device *dev, const char *fmt, ...);
                                                       ^
   1 warning and 3 errors generated.


vim +479 drivers/dax/kmem_blk.c

   459	
   460	static int kmem_blk_probe(struct dev_dax *dev_dax, void **persist_data)
   461	{
   462		struct device *dev = &dev_dax->dev;
   463		struct kmem_blk_super *super;
   464		unsigned long i;
   465		struct kmem_blk_data *data;
   466		unsigned long num_pages = dax_kmem_num_pages(dev_dax);
   467	
   468		if (num_pages == 0) {
   469			dev_err(dev, "Dax device for KMEM has no pages\n");
   470			*persist_data = NULL;
   471			return -1;
   472		}
   473	
   474		super = kmap(dax_kmem_index_to_page(0, dev_dax));
   475	
   476		/* Validate superblock magic and type */
   477		if (super->header.magic != kmem_persist_magic ||
   478			super->header.type != KMEM_PERSIST_BLK) {
 > 479			dev_err("KMEM not formatted for blk, magic %lx type %d\n",
   480				super->header.magic, super->header.type);
   481			kunmap(dax_kmem_index_to_page(0, dev_dax));
   482			*persist_data = NULL;
   483			return -EINVAL;
   484		}
   485	
   486		/* Validate superblock index page counts */
   487		if (super->num_index_entries <=
   488			super->num_index_pages * index_entries_per_page &&
   489			1 + super->num_index_pages + super->num_index_entries
   490			== num_pages) {
   491			dev_info(dev,
   492				"Found kmem_blk superblock num_index_entries %llu num_index_pages %llu num_pages %lu\n",
   493				super->num_index_entries,
   494				super->num_index_pages, num_pages);
   495		} else {
   496			dev_warn(dev,
   497				"Invalid kmem_blk superblock num_index_entries %llu num_index_pages %llu num_pages %lu\n",
   498				super->num_index_entries,
   499				super->num_index_pages, num_pages);
   500			kunmap(dax_kmem_index_to_page(0, dev_dax));
   501			*persist_data = NULL;
   502			return -EINVAL;
   503		}
   504	
   505		data = kzalloc(struct_size(data, index_page, super->num_index_pages),
   506			GFP_KERNEL);
   507		if (!data) {
   508			kunmap(dax_kmem_index_to_page(0, dev_dax));
   509			*persist_data = NULL;
   510			return -ENOMEM;
   511		}
   512	
   513		*persist_data = data;
   514		data->dev_dax = dev_dax;
   515		data->super = super;
   516		spin_lock_init(&data->index_lock);
   517	
   518		for (i = 0; i < super->num_index_pages; i++)
   519			data->index_page[i] =
   520				kmap(dax_kmem_index_to_page(i + 1, dev_dax));
   521	
   522		kmem_blk_free_unused(data);
   523	
   524		kmem_blk_disk_init(data);
   525	
   526		return 0;
   527	}
   528	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM
  2022-08-02 17:57 [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM Srinivas Aji
                   ` (4 preceding siblings ...)
  2022-08-02 18:12 ` [RFC PATCH 4/4] device-dax: Add a block device persistent type, BLK, for DAX KMEM Srinivas Aji
@ 2022-08-05 12:46 ` David Hildenbrand
  2022-08-08 21:21   ` Srinivas Aji
  5 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2022-08-05 12:46 UTC (permalink / raw)
  To: Srinivas Aji, linux-nvdimm, Linux MM
  Cc: Dan Williams, Vivek Goyal, David Woodhouse, Gowans, James,
	Yue Li, Beau Beauchamp

On 02.08.22 19:57, Srinivas Aji wrote:
> Linux supports adding a DAX driver managed memory region as system
> memory using the KMEM driver (from version 5.1). We would like to use
> a persistent addressable memory segment as system memory and
> simultaneously for storing some persistent data.
> 
> Motivation: It is already possible to partition an NVDIMM device for
> RAM and storage by creating separate regions on the device and using
> one of them with KMEM and another as fsdax. This patch set is a start
> to trying to get zero copy snapshots of processes which are using the
> DAX device as RAM. That requires dynamically sharing pages between
> process RAM and the storage within a single NVDIMM region.
> 
> To do this, we add a layer for handling the persistent data which does
> the following:
> 
> 1. When a DAX device is added as KMEM, mark all the memory as
>    allocated and pass it up to a module which is aware of the storage
>    layout.
> 
> 2. This module scans the memory, identifies the unused parts, and
>    frees those memory pages.
> 
> 3. Further memory from this device is allocated using the kernel
>    memory allocation API. The memory allocation API currently allows
>    the allocation to be limited only based on NUMA node. So this
>    feature works only when the DAX device used as KMEM is the only
>    memory from its NUMA node.
> 
> 4. Discarding of blocks previously used for persistent data results in
>    those blocks being freed to system memory.

Can you explain how "zero copy snapshots of processes" would work, both

a) From a user space POV
b) From a kernel-internal POV

Especially, what I get is that you have a filesystem on that memory
region, and all memory that is not used for filesystem blocks can be
used as ordinary system RAM (a little like shmem, but restricted to dax
memory regions?).

But how does this interact with zero-copy snapshots?

I feel like I am missing one piece where we really need system RAM as
part of the bigger picture. Hopefully it's not some hack that converts
system RAM to file system blocks :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM
  2022-08-05 12:46 ` [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM David Hildenbrand
@ 2022-08-08 21:21   ` Srinivas Aji
  2022-08-08 23:05     ` Dan Williams
  0 siblings, 1 reply; 13+ messages in thread
From: Srinivas Aji @ 2022-08-08 21:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linux MM, Dan Williams, Vivek Goyal, David Woodhouse, Gowans,
	James, Yue Li, Beau Beauchamp

On Fri, Aug 05, 2022 at 02:46:26PM +0200, David Hildenbrand wrote:
> Can you explain how "zero copy snapshots of processes" would work, both
> 
> a) From a user space POV
> b) From a kernel-internal POV
> 
> Especially, what I get is that you have a filesystem on that memory
> region, and all memory that is not used for filesystem blocks can be
> used as ordinary system RAM (a little like shmem, but restricted to dax
> memory regions?).
> 
> But how does this interact with zero-copy snapshots?
> 
> I feel like I am missing one piece where we really need system RAM as
> part of the bigger picture. Hopefully it's not some hack that converts
> system RAM to file system blocks :)

My proposal probably falls into this category. The idea is that if we
have the persistent filesystem in the same space as system RAM, we
could make most of the process pages part of a snapshot file by
holding references to the these pages and making the pages
copy-on-write for the process, in about the same way a forked child
would. (I still don't have this piece fully worked out. May be there
are reasons why this won't work or will make something else difficult,
and that is why you are advising against it.)

Regarding the userspace and kernel POV:

The userspace operation would be that the process tries to save or
restore its pages using vmsplice(). In the kernel, this would be
implemented using a filesystem which shares pages with system RAM and
uses a zero-copy COW mechanism for those process pages which can be
shared with the filesystem.

I had earlier been thinking of having a different interface to the
kernel, which creates a file with only those memory pages which can be
saved using COW and also indicates to the caller which pages have
actually been saved.  But having a vmsplice implementation which does
COW as far as possible keeps the userspace process indicating the
desired function (saving or restoring memory pages) and the kernel
implementation handling the zero copy as an optimization where
possible.

Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM
  2022-08-08 21:21   ` Srinivas Aji
@ 2022-08-08 23:05     ` Dan Williams
  0 siblings, 0 replies; 13+ messages in thread
From: Dan Williams @ 2022-08-08 23:05 UTC (permalink / raw)
  To: Srinivas Aji, David Hildenbrand
  Cc: Linux MM, Dan Williams, Vivek Goyal, David Woodhouse, Gowans,
	James, Yue Li, Beau Beauchamp

Srinivas Aji wrote:
> On Fri, Aug 05, 2022 at 02:46:26PM +0200, David Hildenbrand wrote:
> > Can you explain how "zero copy snapshots of processes" would work, both
> > 
> > a) From a user space POV
> > b) From a kernel-internal POV
> > 
> > Especially, what I get is that you have a filesystem on that memory
> > region, and all memory that is not used for filesystem blocks can be
> > used as ordinary system RAM (a little like shmem, but restricted to dax
> > memory regions?).
> > 
> > But how does this interact with zero-copy snapshots?
> > 
> > I feel like I am missing one piece where we really need system RAM as
> > part of the bigger picture. Hopefully it's not some hack that converts
> > system RAM to file system blocks :)
> 
> My proposal probably falls into this category. The idea is that if we
> have the persistent filesystem in the same space as system RAM, we
> could make most of the process pages part of a snapshot file by
> holding references to the these pages and making the pages
> copy-on-write for the process, in about the same way a forked child
> would. (I still don't have this piece fully worked out. May be there
> are reasons why this won't work or will make something else difficult,
> and that is why you are advising against it.)

If I understand the proposal correctly I think you eventually run into
situations similar to what killed RDMA+FSDAX support. The filesystem
needs to be the ultimate arbiter of the physical address space and this
solution seems to want to put part of that control in an agent outside
of the filesystem.

> Regarding the userspace and kernel POV:
> 
> The userspace operation would be that the process tries to save or
> restore its pages using vmsplice(). In the kernel, this would be
> implemented using a filesystem which shares pages with system RAM and
> uses a zero-copy COW mechanism for those process pages which can be
> shared with the filesystem.
> 
> I had earlier been thinking of having a different interface to the
> kernel, which creates a file with only those memory pages which can be
> saved using COW and also indicates to the caller which pages have
> actually been saved.  But having a vmsplice implementation which does
> COW as far as possible keeps the userspace process indicating the
> desired function (saving or restoring memory pages) and the kernel
> implementation handling the zero copy as an optimization where
> possible.

While my initial reaction to hearing about this proposal back at LSF
indeed made it sound like an extension to FSDAX semantics, now I am not
so sure. This requirement you state, "...we have to get the blocks
through the memory allocation API, at an offset not under our control"
makes me feel like this is a new memory management facility where the
application thinks it is getting page allocations serviced via the
typical malloc+mempolicy APIs, but another agent is positioned to trap
and service those requests.

Correct me if I am wrong, but is the end goal similar to what an
application in a VM experiences when that VM's memory is backed by a
file mappping on the VMM side? I.e. the application is accessing a
virtual NUMA node, but the faults into physical address space are
trapped and serviced by the VMM. If that is the case then the solution
starts look more like NUMA "namespacing" than a block-device + file
interface. In other words a rough (I mean rough) strawman like:

   numactlX --remap=3,/dev/dax0.0 --membind=3 $application

Where memory allocation and refault requests can be trapped by that
modified numactl. As far as the application is concerned its memory
policy is set to allocate from NUMA node 3, and those page allocation
requests are routed to numactlX via userfaultfd-like mechanics to map
pages out of /dev/dax0.0 (or any other file for that mattter). Snap
shotting would be achieved by telling numactlX to CoW all of the pages
that it currently has mapped while the snapshot is taken.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-08-08 23:06 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-02 17:57 [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM Srinivas Aji
2022-08-02 18:02 ` [RFC PATCH 1/4] mm/memory_hotplug: Add MHP_ALLOCATE flag which treats hotplugged memory as allocated Srinivas Aji
2022-08-02 18:03 ` [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM David Hildenbrand
2022-08-02 18:53   ` Srinivas Aji
2022-08-02 18:07 ` [RFC PATCH 2/4] device-dax: Add framework for keeping persistent data in DAX KMEM Srinivas Aji
2022-08-02 18:10 ` [RFC PATCH 3/4] device-dax: Add a NONE type for DAX KMEM persistence Srinivas Aji
2022-08-02 18:12 ` [RFC PATCH 4/4] device-dax: Add a block device persistent type, BLK, for DAX KMEM Srinivas Aji
2022-08-03 20:00   ` kernel test robot
2022-08-03 21:19   ` Fabio M. De Francesco
2022-08-04 10:45   ` kernel test robot
2022-08-05 12:46 ` [RFC PATCH 0/4] Allow persistent data on DAX device being used as KMEM David Hildenbrand
2022-08-08 21:21   ` Srinivas Aji
2022-08-08 23:05     ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.