All of lore.kernel.org
 help / color / mirror / Atom feed
* another pmem variant
@ 2015-03-25 16:04 Christoph Hellwig
  2015-03-25 16:04 ` [PATCH 1/3] pmem: Initial version of persistent memory driver Christoph Hellwig
                   ` (5 more replies)
  0 siblings, 6 replies; 35+ messages in thread
From: Christoph Hellwig @ 2015-03-25 16:04 UTC (permalink / raw)
  To: linux-nvdimm, linux-fsdevel, linux-kernel, x86; +Cc: ross.zwisler, boaz, axboe

Here is another version of the same trivial pmem driver, because two
obviously aren't enough.  The first patch is the same pmem driver
that Ross posted a short time ago, just modified to use platform_devices
to find the persistant memory region instead of hardconding it in the
Kconfig.  This allows to keep pmem.c separate from any discovery mechanism,
but still allow auto-discovery.

The other two patches are a heavily rewritten version of the code that
Intel gave to various storage vendors to discover the type 12 (and earlier
type 6) nvdimms, which I massaged into a form that is hopefully suitable
for mainline.

Note that pmem.c really is the minimal version as I think we need something
included ASAP.  We'll eventually need to be able to do other I/O from and
to it, and as most people know everyone has their own preferre method to
do it, which I'd like to discuss once we have the basic driver in.

This has been tested both with a real NVDIMM on a system with a type 12
capable bios, as well as with "fake persistent" memory using the memmap=
option.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/3] pmem: Initial version of persistent memory driver
  2015-03-25 16:04 another pmem variant Christoph Hellwig
@ 2015-03-25 16:04 ` Christoph Hellwig
  2015-03-25 20:19   ` Paul Bolle
  2015-03-25 20:21   ` Ross Zwisler
  2015-03-25 16:04 ` [PATCH 2/3] x86: add a is_e820_ram() helper Christoph Hellwig
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 35+ messages in thread
From: Christoph Hellwig @ 2015-03-25 16:04 UTC (permalink / raw)
  To: linux-nvdimm, linux-fsdevel, linux-kernel, x86; +Cc: ross.zwisler, boaz, axboe

From: Ross Zwisler <ross.zwisler@linux.intel.com>

PMEM is a new driver that presents a reserved range of memory as a
block device.  This is useful for developing with NV-DIMMs, and
can be used with volatile memory as a development platform.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
[hch: convert to use a platform_device for discovery, fix partition
 support]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 MAINTAINERS            |   6 +
 drivers/block/Kconfig  |  13 ++
 drivers/block/Makefile |   1 +
 drivers/block/pmem.c   | 370 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 390 insertions(+)
 create mode 100644 drivers/block/pmem.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 358eb01..efacf2b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8063,6 +8063,12 @@ S:	Maintained
 F:	Documentation/blockdev/ramdisk.txt
 F:	drivers/block/brd.c
 
+PERSISTENT MEMORY DRIVER
+M:	Ross Zwisler <ross.zwisler@linux.intel.com>
+L:	linux-nvdimm@lists.01.org
+S:	Supported
+F:	drivers/block/pmem.c
+
 RANDOM NUMBER DRIVER
 M:	"Theodore Ts'o" <tytso@mit.edu>
 S:	Maintained
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 1b8094d..9284aaf 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -404,6 +404,19 @@ config BLK_DEV_RAM_DAX
 	  and will prevent RAM block device backing store memory from being
 	  allocated from highmem (only a problem for highmem systems).
 
+config BLK_DEV_PMEM
+	tristate "Persistent memory block device support"
+	help
+	  Saying Y here will allow you to use a contiguous range of reserved
+	  memory as one or more block devices.  Memory for PMEM should be
+	  reserved using the "memmap" kernel parameter.
+
+	  To compile this driver as a module, choose M here: the module will be
+	  called pmem.
+
+	  Most normal users won't need this functionality, and can thus say N
+	  here.
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 02b688d..9cc6c18 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -14,6 +14,7 @@ obj-$(CONFIG_PS3_VRAM)		+= ps3vram.o
 obj-$(CONFIG_ATARI_FLOPPY)	+= ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)	+= z2ram.o
 obj-$(CONFIG_BLK_DEV_RAM)	+= brd.o
+obj-$(CONFIG_BLK_DEV_PMEM)	+= pmem.o
 obj-$(CONFIG_BLK_DEV_LOOP)	+= loop.o
 obj-$(CONFIG_BLK_CPQ_DA)	+= cpqarray.o
 obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
new file mode 100644
index 0000000..a4df2d5
--- /dev/null
+++ b/drivers/block/pmem.c
@@ -0,0 +1,370 @@
+/*
+ * Persistent Memory Driver
+ * Copyright (c) 2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * This driver is heavily based on drivers/block/brd.c.
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ */
+
+#include <asm/cacheflush.h>
+#include <linux/blkdev.h>
+#include <linux/hdreg.h>
+#include <linux/init.h>
+#include <linux/platform_device.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/slab.h>
+
+#define SECTOR_SHIFT		9
+#define PAGE_SECTORS_SHIFT	(PAGE_SHIFT - SECTOR_SHIFT)
+#define PAGE_SECTORS		(1 << PAGE_SECTORS_SHIFT)
+
+#define PMEM_MINORS		16
+
+struct pmem_device {
+	struct request_queue	*pmem_queue;
+	struct gendisk		*pmem_disk;
+
+	/* One contiguous memory region per device */
+	phys_addr_t		phys_addr;
+	void			*virt_addr;
+	size_t			size;
+};
+
+static int pmem_major;
+static atomic_t pmem_index;
+
+static int pmem_getgeo(struct block_device *bd, struct hd_geometry *geo)
+{
+	/* some standard values */
+	geo->heads = 1 << 6;
+	geo->sectors = 1 << 5;
+	geo->cylinders = get_capacity(bd->bd_disk) >> 11;
+	return 0;
+}
+
+/*
+ * direct translation from (pmem,sector) => void*
+ * We do not require that sector be page aligned.
+ * The return value will point to the beginning of the page containing the
+ * given sector, not to the sector itself.
+ */
+static void *pmem_lookup_pg_addr(struct pmem_device *pmem, sector_t sector)
+{
+	size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
+	size_t offset = page_offset << PAGE_SHIFT;
+
+	BUG_ON(offset >= pmem->size);
+	return pmem->virt_addr + offset;
+}
+
+/* sector must be page aligned */
+static unsigned long pmem_lookup_pfn(struct pmem_device *pmem, sector_t sector)
+{
+	size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
+
+	BUG_ON(sector & (PAGE_SECTORS - 1));
+	return (pmem->phys_addr >> PAGE_SHIFT) + page_offset;
+}
+
+/*
+ * sector is not required to be page aligned.
+ * n is at most a single page, but could be less.
+ */
+static void copy_to_pmem(struct pmem_device *pmem, const void *src,
+			sector_t sector, size_t n)
+{
+	void *dst;
+	unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
+	size_t copy;
+
+	BUG_ON(n > PAGE_SIZE);
+
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	dst = pmem_lookup_pg_addr(pmem, sector);
+	memcpy(dst + offset, src, copy);
+
+	if (copy < n) {
+		src += copy;
+		sector += copy >> SECTOR_SHIFT;
+		copy = n - copy;
+		dst = pmem_lookup_pg_addr(pmem, sector);
+		memcpy(dst, src, copy);
+	}
+}
+
+/*
+ * sector is not required to be page aligned.
+ * n is at most a single page, but could be less.
+ */
+static void copy_from_pmem(void *dst, struct pmem_device *pmem,
+			  sector_t sector, size_t n)
+{
+	void *src;
+	unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
+	size_t copy;
+
+	BUG_ON(n > PAGE_SIZE);
+
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	src = pmem_lookup_pg_addr(pmem, sector);
+
+	memcpy(dst, src + offset, copy);
+
+	if (copy < n) {
+		dst += copy;
+		sector += copy >> SECTOR_SHIFT;
+		copy = n - copy;
+		src = pmem_lookup_pg_addr(pmem, sector);
+		memcpy(dst, src, copy);
+	}
+}
+
+static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
+			unsigned int len, unsigned int off, int rw,
+			sector_t sector)
+{
+	void *mem = kmap_atomic(page);
+
+	if (rw == READ) {
+		copy_from_pmem(mem + off, pmem, sector, len);
+		flush_dcache_page(page);
+	} else {
+		/*
+		 * FIXME: Need more involved flushing to ensure that writes to
+		 * NVDIMMs are actually durable before returning.
+		 */
+		flush_dcache_page(page);
+		copy_to_pmem(pmem, mem + off, sector, len);
+	}
+
+	kunmap_atomic(mem);
+}
+
+static void pmem_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
+	int rw;
+	struct bio_vec bvec;
+	sector_t sector;
+	struct bvec_iter iter;
+	int err = 0;
+
+	sector = bio->bi_iter.bi_sector;
+	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
+		err = -EIO;
+		goto out;
+	}
+
+	BUG_ON(bio->bi_rw & REQ_DISCARD);
+
+	rw = bio_rw(bio);
+	if (rw == READA)
+		rw = READ;
+
+	bio_for_each_segment(bvec, bio, iter) {
+		unsigned int len = bvec.bv_len;
+
+		BUG_ON(len > PAGE_SIZE);
+		pmem_do_bvec(pmem, bvec.bv_page, len,
+			    bvec.bv_offset, rw, sector);
+		sector += len >> SECTOR_SHIFT;
+	}
+
+out:
+	bio_endio(bio, err);
+}
+
+static int pmem_rw_page(struct block_device *bdev, sector_t sector,
+		       struct page *page, int rw)
+{
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
+
+	pmem_do_bvec(pmem, page, PAGE_CACHE_SIZE, 0, rw, sector);
+	page_endio(page, rw & WRITE, 0);
+	return 0;
+}
+
+static long pmem_direct_access(struct block_device *bdev, sector_t sector,
+			      void **kaddr, unsigned long *pfn, long size)
+{
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
+
+	if (!pmem)
+		return -ENODEV;
+
+	*kaddr = pmem_lookup_pg_addr(pmem, sector);
+	*pfn = pmem_lookup_pfn(pmem, sector);
+
+	return pmem->size - (sector * 512);
+}
+
+static const struct block_device_operations pmem_fops = {
+	.owner =		THIS_MODULE,
+	.rw_page =		pmem_rw_page,
+	.direct_access =	pmem_direct_access,
+	.getgeo =		pmem_getgeo,
+};
+
+/* pmem->phys_addr and pmem->size need to be set.
+ * Will then set virt_addr if successful.
+ */
+static int pmem_mapmem(struct pmem_device *pmem)
+{
+	struct resource *res_mem;
+	int err;
+
+	res_mem = request_mem_region_exclusive(pmem->phys_addr, pmem->size,
+					       "pmem");
+	if (!res_mem) {
+		pr_warn("pmem: request_mem_region_exclusive phys=0x%llx size=0x%zx failed\n",
+			   pmem->phys_addr, pmem->size);
+		return -EINVAL;
+	}
+
+	pmem->virt_addr = ioremap_cache(pmem->phys_addr, pmem->size);
+	if (unlikely(!pmem->virt_addr)) {
+		err = -ENXIO;
+		goto out_release;
+	}
+	return 0;
+
+out_release:
+	release_mem_region(pmem->phys_addr, pmem->size);
+	return err;
+}
+
+static void pmem_unmapmem(struct pmem_device *pmem)
+{
+	if (unlikely(!pmem->virt_addr))
+		return;
+
+	iounmap(pmem->virt_addr);
+	release_mem_region(pmem->phys_addr, pmem->size);
+	pmem->virt_addr = NULL;
+}
+
+static int pmem_probe(struct platform_device *pdev)
+{
+	struct pmem_device *pmem;
+	struct gendisk *disk;
+	struct resource *res;
+	int idx, err;
+
+	if (WARN_ON(pdev->num_resources > 1))
+		return -ENXIO;
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	if (!res)
+		return -ENXIO;
+
+	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
+	if (unlikely(!pmem))
+		return -ENOMEM;
+
+	pmem->phys_addr = res->start;
+	pmem->size = resource_size(res);
+
+	err = pmem_mapmem(pmem);
+	if (unlikely(err))
+		goto out_free_dev;
+
+	err = -ENOMEM;
+	pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
+	if (unlikely(!pmem->pmem_queue))
+		goto out_unmap;
+
+	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
+	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
+	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+
+	disk = alloc_disk(PMEM_MINORS);
+	if (unlikely(!disk))
+		goto out_free_queue;
+
+	idx = atomic_inc_return(&pmem_index) - 1;
+
+	disk->major		= pmem_major;
+	disk->first_minor	= PMEM_MINORS * idx;
+	disk->fops		= &pmem_fops;
+	disk->private_data	= pmem;
+	disk->queue		= pmem->pmem_queue;
+	disk->flags		= GENHD_FL_EXT_DEVT;
+	sprintf(disk->disk_name, "pmem%d", idx);
+	disk->driverfs_dev = &pdev->dev;
+	set_capacity(disk, pmem->size >> SECTOR_SHIFT);
+	pmem->pmem_disk = disk;
+
+	add_disk(disk);
+
+	platform_set_drvdata(pdev, pmem);
+	return 0;
+
+out_free_queue:
+	blk_cleanup_queue(pmem->pmem_queue);
+out_unmap:
+	pmem_unmapmem(pmem);
+out_free_dev:
+	kfree(pmem);
+out:
+	return err;
+}
+
+static int pmem_remove(struct platform_device *pdev)
+{
+	struct pmem_device *pmem = platform_get_drvdata(pdev);
+
+	del_gendisk(pmem->pmem_disk);
+	put_disk(pmem->pmem_disk);
+	blk_cleanup_queue(pmem->pmem_queue);
+	pmem_unmapmem(pmem);
+	kfree(pmem);
+
+	return 0;
+}
+
+static struct platform_driver pmem_driver = {
+	.probe		= pmem_probe,
+	.remove		= pmem_remove,
+	.driver		= {
+		.owner	= THIS_MODULE,
+		.name	= "pmem",
+	},
+};
+
+static int __init pmem_init(void)
+{
+	int error;
+
+	pmem_major = register_blkdev(0, "pmem");
+	if (pmem_major < 0)
+		return pmem_major;
+
+	error = platform_driver_register(&pmem_driver);
+	if (error)
+		unregister_blkdev(pmem_major, "pmem");
+	return error;
+}
+
+static void pmem_exit(void)
+{
+	platform_driver_unregister(&pmem_driver);
+	unregister_blkdev(pmem_major, "pmem");
+}
+
+MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
+MODULE_LICENSE("GPL");
+
+module_init(pmem_init);
+module_exit(pmem_exit);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-25 16:04 another pmem variant Christoph Hellwig
  2015-03-25 16:04 ` [PATCH 1/3] pmem: Initial version of persistent memory driver Christoph Hellwig
@ 2015-03-25 16:04 ` Christoph Hellwig
  2015-03-26  2:15   ` [Linux-nvdimm] " Dan Williams
  2015-03-25 16:04 ` [PATCH 3/3] x86: add support for the non-standard protected e820 type Christoph Hellwig
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2015-03-25 16:04 UTC (permalink / raw)
  To: linux-nvdimm, linux-fsdevel, linux-kernel, x86; +Cc: ross.zwisler, boaz, axboe

This will allow to deal with persistent memory which needs to be
treated like ram in many, but not all cases.

Based on an earlier patch from Dave Jiang <dave.jiang@intel.com>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 arch/x86/kernel/e820.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 46201de..de088e3 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -48,6 +48,11 @@ unsigned long pci_mem_start = 0xaeedbabe;
 EXPORT_SYMBOL(pci_mem_start);
 #endif
 
+static inline bool is_e820_ram(__u32 type)
+{
+	return type == E820_RAM;
+}
+
 /*
  * This function checks if any part of the range <start,end> is mapped
  * with type.
@@ -688,7 +693,7 @@ void __init e820_mark_nosave_regions(unsigned long limit_pfn)
 			register_nosave_region(pfn, PFN_UP(ei->addr));
 
 		pfn = PFN_DOWN(ei->addr + ei->size);
-		if (ei->type != E820_RAM && ei->type != E820_RESERVED_KERN)
+		if (!is_e820_ram(ei->type) && ei->type != E820_RESERVED_KERN)
 			register_nosave_region(PFN_UP(ei->addr), pfn);
 
 		if (pfn >= limit_pfn)
@@ -748,7 +753,7 @@ u64 __init early_reserve_e820(u64 size, u64 align)
 /*
  * Find the highest page frame number we have available
  */
-static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
+static unsigned long __init e820_end_pfn(unsigned long limit_pfn)
 {
 	int i;
 	unsigned long last_pfn = 0;
@@ -759,7 +764,7 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
 		unsigned long start_pfn;
 		unsigned long end_pfn;
 
-		if (ei->type != type)
+		if (!is_e820_ram(ei->type))
 			continue;
 
 		start_pfn = ei->addr >> PAGE_SHIFT;
@@ -784,12 +789,12 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
 }
 unsigned long __init e820_end_of_ram_pfn(void)
 {
-	return e820_end_pfn(MAX_ARCH_PFN, E820_RAM);
+	return e820_end_pfn(MAX_ARCH_PFN);
 }
 
 unsigned long __init e820_end_of_low_ram_pfn(void)
 {
-	return e820_end_pfn(1UL<<(32 - PAGE_SHIFT), E820_RAM);
+	return e820_end_pfn(1UL<<(32 - PAGE_SHIFT));
 }
 
 static void early_panic(char *msg)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/3] x86: add support for the non-standard protected e820 type
  2015-03-25 16:04 another pmem variant Christoph Hellwig
  2015-03-25 16:04 ` [PATCH 1/3] pmem: Initial version of persistent memory driver Christoph Hellwig
  2015-03-25 16:04 ` [PATCH 2/3] x86: add a is_e820_ram() helper Christoph Hellwig
@ 2015-03-25 16:04 ` Christoph Hellwig
  2015-03-25 19:47   ` Elliott, Robert (Server Storage)
                     ` (3 more replies)
  2015-03-25 16:33 ` [Linux-nvdimm] another pmem variant Dan Williams
                   ` (2 subsequent siblings)
  5 siblings, 4 replies; 35+ messages in thread
From: Christoph Hellwig @ 2015-03-25 16:04 UTC (permalink / raw)
  To: linux-nvdimm, linux-fsdevel, linux-kernel, x86; +Cc: ross.zwisler, boaz, axboe

Various recent bioses support NVDIMMs or ADR using a non-standard
e820 memory type, and Intel supplied reference Linux code using this
type to various vendors.

Wire this e820 table type up to export platform devices for the pmem
driver so that we can use it in Linux, and also provide a memmap=
argument to manually tag memory as protected, which can be used
if the bios doesn't use the standard nonstandard interface, or
we just want to test the pmem driver with regular memory.

Based on an earlier patch from Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 Documentation/kernel-parameters.txt |  6 ++++
 arch/x86/Kconfig                    | 13 +++++++
 arch/x86/include/asm/setup.h        |  6 ++++
 arch/x86/include/uapi/asm/e820.h    | 10 ++++++
 arch/x86/kernel/Makefile            |  1 +
 arch/x86/kernel/e820.c              | 22 +++++++++++-
 arch/x86/kernel/pmem.c              | 70 +++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/setup.c             |  2 ++
 8 files changed, 129 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/pmem.c

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index bfcb1a6..98eeaca 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1965,6 +1965,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			         or
 			         memmap=0x10000$0x18690000
 
+	memmap=nn[KMG]!ss[KMG]
+			[KNL,X86] Mark specific memory as protected.
+			Region of memory to be used, from ss to ss+nn.
+			The memory region may be marked as type 12 and
+			is NVDIMM or ADR memory.
+
 	memory_corruption_check=0/1 [X86]
 			Some BIOSes seem to corrupt the first 64k of
 			memory when doing things like suspend/resume.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca..93a27e4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1430,6 +1430,19 @@ config ILLEGAL_POINTER_VALUE
 
 source "mm/Kconfig"
 
+config X86_PMEM_LEGACY
+	bool "Support non-stanard NVDIMMs and ADR protected memory"
+	help
+	  Treat memory marked using the non-stard e820 type of 12 as used
+	  by the Intel Sandy Bridge-EP reference BIOS as protected memory.
+	  The kernel will the offer these regions to the pmem driver so
+	  they can be used for persistent storage.
+
+	  If you say N the kernel will treat the ADR region like an e820
+	  reserved region.
+
+	  Say Y if unsure
+
 config HIGHPTE
 	bool "Allocate 3rd-level pagetables from highmem"
 	depends on HIGHMEM
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index ff4e7b2..2352fde 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -57,6 +57,12 @@ extern void x86_ce4100_early_setup(void);
 static inline void x86_ce4100_early_setup(void) { }
 #endif
 
+#ifdef CONFIG_X86_PMEM_LEGACY
+void reserve_pmem(void);
+#else
+static inline void reserve_pmem(void) { }
+#endif
+
 #ifndef _SETUP
 
 #include <asm/espfix.h>
diff --git a/arch/x86/include/uapi/asm/e820.h b/arch/x86/include/uapi/asm/e820.h
index d993e33..e040950 100644
--- a/arch/x86/include/uapi/asm/e820.h
+++ b/arch/x86/include/uapi/asm/e820.h
@@ -33,6 +33,16 @@
 #define E820_NVS	4
 #define E820_UNUSABLE	5
 
+/*
+ * This is a non-standardized way to represent ADR or NVDIMM regions that
+ * persist over a reboot.  The kernel will ignore their special capabilities
+ * unless the CONFIG_X86_PMEM_LEGACY option is set.
+ *
+ * Note that older platforms also used 6 for the same type of memory,
+ * but newer versions switched to 12 as 6 was assigned differently.  Some
+ * time they will learn..
+ */
+#define E820_PROTECTED_KERN	12
 
 /*
  * reserved RAM used by kernel itself
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index cdb1b70..971f18c 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -94,6 +94,7 @@ obj-$(CONFIG_KVM_GUEST)		+= kvm.o kvmclock.o
 obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
+obj-$(CONFIG_X86_PMEM_LEGACY)	+= pmem.o
 
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
 
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index de088e3..8c6a976 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -48,10 +48,22 @@ unsigned long pci_mem_start = 0xaeedbabe;
 EXPORT_SYMBOL(pci_mem_start);
 #endif
 
+/*
+ * Memory protected by the system ADR (asynchronous dram refresh)
+ * mechanism is accounted as ram for purposes of establishing max_pfn
+ * and mem_map.
+ */
+#ifdef CONFIG_X86_PMEM_LEGACY
+static inline bool is_e820_ram(__u32 type)
+{
+	return type == E820_RAM || type == E820_PROTECTED_KERN;
+}
+#else
 static inline bool is_e820_ram(__u32 type)
 {
 	return type == E820_RAM;
 }
+#endif
 
 /*
  * This function checks if any part of the range <start,end> is mapped
@@ -154,6 +166,9 @@ static void __init e820_print_type(u32 type)
 	case E820_UNUSABLE:
 		printk(KERN_CONT "unusable");
 		break;
+	case E820_PROTECTED_KERN:
+		printk(KERN_CONT "protected (type %u)\n", type);
+		break;
 	default:
 		printk(KERN_CONT "type %u", type);
 		break;
@@ -871,6 +886,9 @@ static int __init parse_memmap_one(char *p)
 	} else if (*p == '$') {
 		start_at = memparse(p+1, &p);
 		e820_add_region(start_at, mem_size, E820_RESERVED);
+	} else if (*p == '!') {
+		start_at = memparse(p+1, &p);
+		e820_add_region(start_at, mem_size, E820_PROTECTED_KERN);
 	} else
 		e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
 
@@ -912,6 +930,7 @@ static inline const char *e820_type_to_string(int e820_type)
 	case E820_ACPI:	return "ACPI Tables";
 	case E820_NVS:	return "ACPI Non-volatile Storage";
 	case E820_UNUSABLE:	return "Unusable memory";
+	case E820_PROTECTED_KERN: return "Protected RAM";
 	default:	return "reserved";
 	}
 }
@@ -946,7 +965,8 @@ void __init e820_reserve_resources(void)
 		 * pcibios_resource_survey()
 		 */
 		if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) {
-			res->flags |= IORESOURCE_BUSY;
+			if (e820.map[i].type != E820_PROTECTED_KERN)
+				res->flags |= IORESOURCE_BUSY;
 			insert_resource(&iomem_resource, res);
 		}
 		res++;
diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c
new file mode 100644
index 0000000..902e2f9
--- /dev/null
+++ b/arch/x86/kernel/pmem.c
@@ -0,0 +1,70 @@
+/*
+ * Copyright (c) 2009, Intel Corporation.
+ * Copyright (c) 2015, Christoph Hellwig.
+ */
+#include <linux/memblock.h>
+#include <linux/platform_device.h>
+#include <linux/slab.h>
+#include <asm/e820.h>
+#include <asm/page_types.h>
+#include <asm/setup.h>
+
+void __init reserve_pmem(void)
+{
+	int i;
+
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+
+		if (ei->type != E820_PROTECTED_KERN)
+			continue;
+
+		memblock_reserve(ei->addr, ei->addr + ei->size);
+		max_pfn_mapped = init_memory_mapping(
+				ei->addr < 1UL << 32 ? 1UL << 32 : ei->addr,
+				ei->addr + ei->size);
+	}
+}
+		
+static __init void register_pmem_device(struct resource *res)
+{
+	struct platform_device *pdev;
+	int error;
+
+	pdev = platform_device_alloc("pmem", PLATFORM_DEVID_AUTO);
+	if (!pdev)
+		return;
+
+	error = platform_device_add_resources(pdev, res, 1);
+	if (error)
+		goto out_put_pdev;
+
+	error = platform_device_add(pdev);
+	if (error)
+		goto out_put_pdev;
+	return;
+out_put_pdev:
+	dev_warn(&pdev->dev, "failed to add pmem device!\n");
+	platform_device_put(pdev);
+}
+
+static __init int register_pmem_devices(void)
+{
+	int i;
+
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+
+		if (ei->type == E820_PROTECTED_KERN) {
+			struct resource res = {
+				.flags	= IORESOURCE_MEM,
+				.start	= ei->addr,
+				.end	= ei->addr + ei->size - 1,
+			};
+			register_pmem_device(&res);
+		}
+	}
+
+	return 0;
+}
+device_initcall(register_pmem_devices);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 0a2421c..f2bed2b 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1158,6 +1158,8 @@ void __init setup_arch(char **cmdline_p)
 
 	early_acpi_boot_init();
 
+	reserve_pmem();
+
 	initmem_init();
 	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [Linux-nvdimm] another pmem variant
  2015-03-25 16:04 another pmem variant Christoph Hellwig
                   ` (2 preceding siblings ...)
  2015-03-25 16:04 ` [PATCH 3/3] x86: add support for the non-standard protected e820 type Christoph Hellwig
@ 2015-03-25 16:33 ` Dan Williams
  2015-03-25 16:44   ` Christoph Hellwig
  2015-03-25 18:09   ` Brooks, Adam J
  2015-03-25 21:02 ` Ross Zwisler
  5 siblings, 1 reply; 35+ messages in thread
From: Dan Williams @ 2015-03-25 16:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, X86 ML, Jens Axboe

On Wed, Mar 25, 2015 at 9:04 AM, Christoph Hellwig <hch@lst.de> wrote:
> Here is another version of the same trivial pmem driver, because two
> obviously aren't enough.

Welcome to the party! :-)

> The first patch is the same pmem driver
> that Ross posted a short time ago, just modified to use platform_devices
> to find the persistant memory region instead of hardconding it in the
> Kconfig.  This allows to keep pmem.c separate from any discovery mechanism,
> but still allow auto-discovery.

This is mostly ok and does not collide too much with the upcoming ACPI
mechanism for this stuff.  I do worry that the new
"memmap=nn[KMG]!ss[KMG]" kernel command line option will only be
relevant for at most one kernel cycle given the imminent publication
of the spec that unblocks our release.

Our planned solution to the "legacy pmem" problem is to have a
userspace utility craft a list of address ranges in the form that ACPI
expects and attach that to a platform device (one time setup).  It
only requires that the memory be marked reserved, not necessarily
marked type-12.

> The other two patches are a heavily rewritten version of the code that
> Intel gave to various storage vendors to discover the type 12 (and earlier
> type 6) nvdimms, which I massaged into a form that is hopefully suitable
> for mainline.

I'd prefer E820_PMEM over E820_PROTECTED_KERN, I don't know why I
chose that name initially, but to each his own bike shed.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Linux-nvdimm] another pmem variant
  2015-03-25 16:33 ` [Linux-nvdimm] another pmem variant Dan Williams
@ 2015-03-25 16:44   ` Christoph Hellwig
  2015-03-25 17:00     ` Dan Williams
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2015-03-25 16:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	X86 ML, Jens Axboe

On Wed, Mar 25, 2015 at 09:33:52AM -0700, Dan Williams wrote:
> This is mostly ok and does not collide too much with the upcoming ACPI
> mechanism for this stuff.  I do worry that the new
> "memmap=nn[KMG]!ss[KMG]" kernel command line option will only be
> relevant for at most one kernel cycle given the imminent publication
> of the spec that unblocks our release.

I don't think we can just get rid of it as legacy systems won't be
upgraded to the new discovery mechanism.  Or do you mean you plan to
introduce a better override on the command line?  In that case speak
up now!

> Our planned solution to the "legacy pmem" problem is to have a
> userspace utility craft a list of address ranges in the form that ACPI
> expects and attach that to a platform device (one time setup).  It
> only requires that the memory be marked reserved, not necessarily
> marked type-12.

I can't see any benefit of that over just doign the right thing in
kernel space.

> > The other two patches are a heavily rewritten version of the code that
> > Intel gave to various storage vendors to discover the type 12 (and earlier
> > type 6) nvdimms, which I massaged into a form that is hopefully suitable
> > for mainline.
> 
> I'd prefer E820_PMEM over E820_PROTECTED_KERN, I don't know why I
> chose that name initially, but to each his own bike shed.

Sounds fine to me.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Linux-nvdimm] another pmem variant
  2015-03-25 16:44   ` Christoph Hellwig
@ 2015-03-25 17:00     ` Dan Williams
  2015-03-25 17:04       ` Christoph Hellwig
  2015-04-13  9:01       ` Greg KH
  0 siblings, 2 replies; 35+ messages in thread
From: Dan Williams @ 2015-03-25 17:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, X86 ML, Jens Axboe

On Wed, Mar 25, 2015 at 9:44 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Mar 25, 2015 at 09:33:52AM -0700, Dan Williams wrote:
>> This is mostly ok and does not collide too much with the upcoming ACPI
>> mechanism for this stuff.  I do worry that the new
>> "memmap=nn[KMG]!ss[KMG]" kernel command line option will only be
>> relevant for at most one kernel cycle given the imminent publication
>> of the spec that unblocks our release.
>
> I don't think we can just get rid of it as legacy systems won't be
> upgraded to the new discovery mechanism.  Or do you mean you plan to
> introduce a better override on the command line?  In that case speak
> up now!

The kernel command line would simply be the standard/existing memmap=
to reserve a memory range.  Then, when the platform device loads, it
does a request_firmware() to inject a binary table that further carves
memory into ranges to which the pmem driver attaches.  No need for the
legacy system BIOS to be upgraded to the "new way".

>> Our planned solution to the "legacy pmem" problem is to have a
>> userspace utility craft a list of address ranges in the form that ACPI
>> expects and attach that to a platform device (one time setup).  It
>> only requires that the memory be marked reserved, not necessarily
>> marked type-12.
>
> I can't see any benefit of that over just doign the right thing in
> kernel space.

It does do the right thing in kernel space.  The userspace utility
creates the binary table (once) that can be compiled into the platform
device driver or auto-loaded by an initrd.  The problem with a new
memmap= is that it is too coarse.  For example you can't do things
like specify a pmem range per-NUMA node.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Linux-nvdimm] another pmem variant
  2015-03-25 17:00     ` Dan Williams
@ 2015-03-25 17:04       ` Christoph Hellwig
  2015-03-25 17:18         ` Dan Williams
  2015-04-13  9:01       ` Greg KH
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2015-03-25 17:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	X86 ML, Jens Axboe

On Wed, Mar 25, 2015 at 10:00:26AM -0700, Dan Williams wrote:
> The kernel command line would simply be the standard/existing memmap=
> to reserve a memory range.  Then, when the platform device loads, it
> does a request_firmware() to inject a binary table that further carves
> memory into ranges to which the pmem driver attaches.  No need for the
> legacy system BIOS to be upgraded to the "new way".

Ewww...

> It does do the right thing in kernel space.  The userspace utility
> creates the binary table (once) that can be compiled into the platform
> device driver or auto-loaded by an initrd.  The problem with a new
> memmap= is that it is too coarse.  For example you can't do things
> like specify a pmem range per-NUMA node.

Sure you can as long as you know the layout.  memmap= can be specified
multiple times.   Again, I see absolutely zero benefit of doing crap
like request_firmware() to convert interface, and I'm also tired of
having this talk about code that will eventually be released and should
be superior (and from all that I can guess so far will actually be far
worse).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Linux-nvdimm] another pmem variant
  2015-03-25 17:04       ` Christoph Hellwig
@ 2015-03-25 17:18         ` Dan Williams
  0 siblings, 0 replies; 35+ messages in thread
From: Dan Williams @ 2015-03-25 17:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, X86 ML, Jens Axboe

On Wed, Mar 25, 2015 at 10:04 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Mar 25, 2015 at 10:00:26AM -0700, Dan Williams wrote:
>> The kernel command line would simply be the standard/existing memmap=
>> to reserve a memory range.  Then, when the platform device loads, it
>> does a request_firmware() to inject a binary table that further carves
>> memory into ranges to which the pmem driver attaches.  No need for the
>> legacy system BIOS to be upgraded to the "new way".
>
> Ewww...
>
>> It does do the right thing in kernel space.  The userspace utility
>> creates the binary table (once) that can be compiled into the platform
>> device driver or auto-loaded by an initrd.  The problem with a new
>> memmap= is that it is too coarse.  For example you can't do things
>> like specify a pmem range per-NUMA node.
>
> Sure you can as long as you know the layout.  memmap= can be specified
> multiple times.   Again, I see absolutely zero benefit of doing crap
> like request_firmware() to convert interface, and I'm also tired of
> having this talk about code that will eventually be released and should
> be superior (and from all that I can guess so far will actually be far
> worse).

You and me both...

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [Linux-nvdimm] another pmem variant
  2015-03-25 16:04 another pmem variant Christoph Hellwig
@ 2015-03-25 18:09   ` Brooks, Adam J
  2015-03-25 16:04 ` [PATCH 2/3] x86: add a is_e820_ram() helper Christoph Hellwig
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 35+ messages in thread
From: Brooks, Adam J @ 2015-03-25 18:09 UTC (permalink / raw)
  To: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel, x86
  Cc: axboe, Boaz Harrosh

>The other two patches are a heavily rewritten version of the code that
>Intel gave to various storage vendors to discover the type 12 (and earlier
>type 6) nvdimms, which I massaged into a form that is hopefully suitable
>for mainline.

The problem is that the e820 or the UEFI Memory Map Table on their own are really bad ways to represent NVDIMMs.  The memory table idea was originally developed 6 years ago prior to NVDIMMs existing.  It was used to define traditional battery backed memory.  With traditional battery backed memory either the whole region was going to be valid or the whole region was going to be gone.  There was also no concept of arming.  You simply have x hours of data retention based on your battery be y% charged.  Fast forward a couple years, and we continued using the memory table method for something called Copy To Flash where the CPU would copy memory from the DIMMs to a SSD of some sort.  Again this was a whole region or none of the region solution and because we were typically using SATA SSD there was no need to "arm" anything.  Additionally the restore operation (and even the save operation if you were brave enough) could be done from the OS.  Therefore there was no need for the BIOS to pass up any status regarding if the recovery was successful or not.

Fast forward again to the present day and NVDIMMs.  We used the memory table model initially for NVDIMM because 1) the BIOS code was already in place 2) we had a non-upstreamed driver (something that predated pmem by several years called ADRBD).  In a perfect world where there are no hardware failures e820+ADRBD work great for NVDIMMs.  However in the real world where there are failures it has a number of short comings.  Mainly there are the following issues with it:
1) The region may now be comprised for 2+ different NVDIMMs that have different statuses. A subset of NVDIMMs may have failed the restore.  An NVDIMM may have been added since after the last save/restore of the existing NVDIMM
2) Just based on the e820 table, the OS has no one of knowing where the boundaries of the NVDIMMs are.  It has no one of knowing if they are all interleaved together where a failure of single NVDIMM means the loss of the whole region, or if the NVDIMMs are non-interleaved and can be treated as separate memory regions to prevent the failure of one NVDIMM from causing data to be lost form all NVDIMM
2) Due to the requirement to restore the MRS/RC registers the NVDIMM restore must be done from the BIOS.  Depending on the security settings of the platform the OS may not be able to directly interrogate the individual NVDIMMs to find their status.  Even if the OS can get to the NVDIMM over SMBUS all information about the status of the last restore attempt may have been wiped if the BIOS was also configured to do the erase/arm operation

For those reasons (and more) simply using the current memory tables is not a good solution. A more detailed NVDIMM specific table is required to surface the status and configuration of the NVDIMMs.  Unfortunately that table has been perpetually delayed, and a result people are trying to move forward with Type 12.  I understand why this has been done, and for highly embedded storage appliances it is fine, because those users probably inherently know the configuration of the NVDIMMs.  However for general purpose systems where the user has no way of knowing the exact configuration of the DIMMS, just using the e820 or UEFI Memory Map table is not sufficient.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [Linux-nvdimm] another pmem variant
@ 2015-03-25 18:09   ` Brooks, Adam J
  0 siblings, 0 replies; 35+ messages in thread
From: Brooks, Adam J @ 2015-03-25 18:09 UTC (permalink / raw)
  To: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel, x86
  Cc: axboe, Boaz Harrosh

>The other two patches are a heavily rewritten version of the code that
>Intel gave to various storage vendors to discover the type 12 (and earlier
>type 6) nvdimms, which I massaged into a form that is hopefully suitable
>for mainline.

The problem is that the e820 or the UEFI Memory Map Table on their own are really bad ways to represent NVDIMMs.  The memory table idea was originally developed 6 years ago prior to NVDIMMs existing.  It was used to define traditional battery backed memory.  With traditional battery backed memory either the whole region was going to be valid or the whole region was going to be gone.  There was also no concept of arming.  You simply have x hours of data retention based on your battery be y% charged.  Fast forward a couple years, and we continued using the memory table method for something called Copy To Flash where the CPU would copy memory from the DIMMs to a SSD of some sort.  Again this was a whole region or none of the region solution and because we were typically using SATA SSD there w
 as no need to "arm" anything.  Additionally the restore operation (and even the save operation if you were brave enough) could be done from the OS.  Therefore there was no need for the BIOS to pass up any status regarding if the recovery was successful or not.

Fast forward again to the present day and NVDIMMs.  We used the memory table model initially for NVDIMM because 1) the BIOS code was already in place 2) we had a non-upstreamed driver (something that predated pmem by several years called ADRBD).  In a perfect world where there are no hardware failures e820+ADRBD work great for NVDIMMs.  However in the real world where there are failures it has a number of short comings.  Mainly there are the following issues with it:
1) The region may now be comprised for 2+ different NVDIMMs that have different statuses. A subset of NVDIMMs may have failed the restore.  An NVDIMM may have been added since after the last save/restore of the existing NVDIMM
2) Just based on the e820 table, the OS has no one of knowing where the boundaries of the NVDIMMs are.  It has no one of knowing if they are all interleaved together where a failure of single NVDIMM means the loss of the whole region, or if the NVDIMMs are non-interleaved and can be treated as separate memory regions to prevent the failure of one NVDIMM from causing data to be lost form all NVDIMM
2) Due to the requirement to restore the MRS/RC registers the NVDIMM restore must be done from the BIOS.  Depending on the security settings of the platform the OS may not be able to directly interrogate the individual NVDIMMs to find their status.  Even if the OS can get to the NVDIMM over SMBUS all information about the status of the last restore attempt may have been wiped if the BIOS was also configured to do the erase/arm operation

For those reasons (and more) simply using the current memory tables is not a good solution. A more detailed NVDIMM specific table is required to surface the status and configuration of the NVDIMMs.  Unfortunately that table has been perpetually delayed, and a result people are trying to move forward with Type 12.  I understand why this has been done, and for highly embedded storage appliances it is fine, because those users probably inherently know the configuration of the NVDIMMs.  However for general purpose systems where the user has no way of knowing the exact configuration of the DIMMS, just using the e820 or UEFI Memory Map table is not sufficient.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Linux-nvdimm] another pmem variant
  2015-03-25 18:09   ` Brooks, Adam J
  (?)
@ 2015-03-25 18:46   ` Christoph Hellwig
  -1 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2015-03-25 18:46 UTC (permalink / raw)
  To: Brooks, Adam J
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, axboe, Boaz Harrosh

Hi Adam,

we're all well aware of the deficits of the current interface, but for now
that's all we have, and due to lead times in implementing bioses it will
be all we have for quite a while.

We're all eagerly looking forward to better interfaces and bioses that will
support them.

But for now evryone would love to just be able to use existing systems with
an out of the box Linux kernel.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH 3/3] x86: add support for the non-standard protected e820 type
  2015-03-25 16:04 ` [PATCH 3/3] x86: add support for the non-standard protected e820 type Christoph Hellwig
@ 2015-03-25 19:47   ` Elliott, Robert (Server Storage)
  2015-03-26  8:02     ` Christoph Hellwig
  2015-03-25 20:23   ` Ross Zwisler
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 35+ messages in thread
From: Elliott, Robert (Server Storage) @ 2015-03-25 19:47 UTC (permalink / raw)
  To: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel, x86
  Cc: ross.zwisler, boaz, axboe

A few editorial nits follow...

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Wednesday, March 25, 2015 11:04 AM
> Subject: [PATCH 3/3] x86: add support for the non-standard protected e820
> type
> 
> Various recent bioses support NVDIMMs or ADR using a non-standard
> e820 memory type, and Intel supplied reference Linux code using this
> type to various vendors.

If this goes into the kernel, I think someone should request that the
ACPI specification mark the value 12 as permanently tainted.  Otherwise
they could assign it to some new meaning that conflicts with all
of this.

A few editorial nits follow...

> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-
> parameters.txt
> index bfcb1a6..98eeaca 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1965,6 +1965,12 @@ bytes respectively. Such letter suffixes can also be
> entirely omitted.
>  			         or
>  			         memmap=0x10000$0x18690000
> 
> +	memmap=nn[KMG]!ss[KMG]
> +			[KNL,X86] Mark specific memory as protected.
> +			Region of memory to be used, from ss to ss+nn.
> +			The memory region may be marked as type 12 and
> +			is NVDIMM or ADR memory.

It can be confusing that E820h type values differ from UEFI 
memory map type values, so it might be worth emphasizing that is 
an E820h type value.

Showing hex alongside would also clarify that it is indeed a 
decimal 12.

Suggestion: "e820 type 12 (0xc)"

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b7d31ca..93a27e4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1430,6 +1430,19 @@ config ILLEGAL_POINTER_VALUE
> 
>  source "mm/Kconfig"
> 
> +config X86_PMEM_LEGACY
> +	bool "Support non-stanard NVDIMMs and ADR protected memory"

stanard s/b standard


> +	help
> +	  Treat memory marked using the non-stard e820 type of 12 as used

stard s/b standard

> +	  by the Intel Sandy Bridge-EP reference BIOS as protected memory.
> +	  The kernel will the offer these regions to the pmem driver so

the s/b then

> +	  they can be used for persistent storage.
> +
> +	  If you say N the kernel will treat the ADR region like an e820
> +	  reserved region.
> +
> +	  Say Y if unsure
> +

...
> diff --git a/arch/x86/include/uapi/asm/e820.h
> b/arch/x86/include/uapi/asm/e820.h
> index d993e33..e040950 100644
> --- a/arch/x86/include/uapi/asm/e820.h
> +++ b/arch/x86/include/uapi/asm/e820.h
> @@ -33,6 +33,16 @@
>  #define E820_NVS	4
>  #define E820_UNUSABLE	5
> 
> +/*
> + * This is a non-standardized way to represent ADR or NVDIMM regions that
> + * persist over a reboot.  The kernel will ignore their special
> capabilities
> + * unless the CONFIG_X86_PMEM_LEGACY option is set.
> + *
> + * Note that older platforms also used 6 for the same type of memory,
> + * but newer versions switched to 12 as 6 was assigned differently.  Some
> + * time they will learn..
> + */
> +#define E820_PROTECTED_KERN	12

The p in pmem means persistent, not protected.  To me, protected sounds 
like a security feature.  I suggest using a different macro name and 
text strings.

...
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index de088e3..8c6a976 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -48,10 +48,22 @@ unsigned long pci_mem_start = 0xaeedbabe;
>  EXPORT_SYMBOL(pci_mem_start);
>  #endif
> 
> +/*
> + * Memory protected by the system ADR (asynchronous dram refresh)
> + * mechanism is accounted as ram for purposes of establishing max_pfn
> + * and mem_map.
> + */

ADR doesn't really protect or ensure persistence; it just puts memory 
into self-refresh mode. Batteries/capacitors and other logic is what
provides the persistence.

...
> @@ -154,6 +166,9 @@ static void __init e820_print_type(u32 type)
>  	case E820_UNUSABLE:
>  		printk(KERN_CONT "unusable");
>  		break;
> +	case E820_PROTECTED_KERN:
> +		printk(KERN_CONT "protected (type %u)\n", type);
> +		break;

Same "protect" comment applies there, and a few other places in
the patch not excerpted.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] pmem: Initial version of persistent memory driver
  2015-03-25 16:04 ` [PATCH 1/3] pmem: Initial version of persistent memory driver Christoph Hellwig
@ 2015-03-25 20:19   ` Paul Bolle
  2015-03-25 20:26     ` Ross Zwisler
  2015-03-25 20:21   ` Ross Zwisler
  1 sibling, 1 reply; 35+ messages in thread
From: Paul Bolle @ 2015-03-25 20:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, ross.zwisler,
	boaz, axboe

The same license nit I found in the previous two versions of this patch.

On Wed, 2015-03-25 at 17:04 +0100, Christoph Hellwig wrote:
> --- /dev/null
> +++ b/drivers/block/pmem.c

> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.

This states the license is GPL v2.

> +MODULE_LICENSE("GPL");

So you probably want to use
    MODULE_LICENSE("GPL v2");

here.


Paul Bolle


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] pmem: Initial version of persistent memory driver
  2015-03-25 16:04 ` [PATCH 1/3] pmem: Initial version of persistent memory driver Christoph Hellwig
  2015-03-25 20:19   ` Paul Bolle
@ 2015-03-25 20:21   ` Ross Zwisler
  2015-03-26  8:06     ` Christoph Hellwig
  1 sibling, 1 reply; 35+ messages in thread
From: Ross Zwisler @ 2015-03-25 20:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, boaz, axboe

On Wed, 2015-03-25 at 17:04 +0100, Christoph Hellwig wrote:
> From: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> PMEM is a new driver that presents a reserved range of memory as a
> block device.  This is useful for developing with NV-DIMMs, and
> can be used with volatile memory as a development platform.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> [hch: convert to use a platform_device for discovery, fix partition
>  support]

Overall I really like this approach.  It makes things simpler, removes
unneeded code and most importantly removes the ability for the user to have a
configuration where the PMEM / memmap reservation via the command line don't
match the parameters given to pmem.

What needed to be fixed with the partition support?  I used to have real
numbers for first_minor and passed into alloc_disk(), but simplified it based
on code found in this commit in the nvme driver:

469071a37afc NVMe: Dynamically allocate partition numbers

This has worked fine for me - is there some test case in which it breaks?

> +static int pmem_probe(struct platform_device *pdev)
> +{
> +	struct pmem_device *pmem;
> +	struct gendisk *disk;
> +	struct resource *res;
> +	int idx, err;
> +
> +	if (WARN_ON(pdev->num_resources > 1))
> +		return -ENXIO;
> +
> +	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> +	if (!res)
> +		return -ENXIO;
> +
> +	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
> +	if (unlikely(!pmem))
> +		return -ENOMEM;
> +
> +	pmem->phys_addr = res->start;
> +	pmem->size = resource_size(res);
> +
> +	err = pmem_mapmem(pmem);
> +	if (unlikely(err))
> +		goto out_free_dev;
> +
> +	err = -ENOMEM;
> +	pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
> +	if (unlikely(!pmem->pmem_queue))
> +		goto out_unmap;
> +
> +	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
> +	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
> +	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
> +
> +	disk = alloc_disk(PMEM_MINORS);
> +	if (unlikely(!disk))
> +		goto out_free_queue;
> +
> +	idx = atomic_inc_return(&pmem_index) - 1;
> +
> +	disk->major		= pmem_major;
> +	disk->first_minor	= PMEM_MINORS * idx;
> +	disk->fops		= &pmem_fops;
> +	disk->private_data	= pmem;
> +	disk->queue		= pmem->pmem_queue;
> +	disk->flags		= GENHD_FL_EXT_DEVT;
> +	sprintf(disk->disk_name, "pmem%d", idx);
> +	disk->driverfs_dev = &pdev->dev;
> +	set_capacity(disk, pmem->size >> SECTOR_SHIFT);
> +	pmem->pmem_disk = disk;
> +
> +	add_disk(disk);
> +
> +	platform_set_drvdata(pdev, pmem);
> +	return 0;
> +
> +out_free_queue:
> +	blk_cleanup_queue(pmem->pmem_queue);
> +out_unmap:
> +	pmem_unmapmem(pmem);
> +out_free_dev:
> +	kfree(pmem);
> +out:

This label is no longer used, and can be removed.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] x86: add support for the non-standard protected e820 type
  2015-03-25 16:04 ` [PATCH 3/3] x86: add support for the non-standard protected e820 type Christoph Hellwig
  2015-03-25 19:47   ` Elliott, Robert (Server Storage)
@ 2015-03-25 20:23   ` Ross Zwisler
  2015-03-25 20:29     ` [Linux-nvdimm] " Dan Williams
  2015-03-25 20:25   ` Ross Zwisler
  2015-03-25 20:35   ` [Linux-nvdimm] " Dan Williams
  3 siblings, 1 reply; 35+ messages in thread
From: Ross Zwisler @ 2015-03-25 20:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, boaz, axboe

On Wed, 2015-03-25 at 17:04 +0100, Christoph Hellwig wrote:
> Various recent bioses support NVDIMMs or ADR using a non-standard
> e820 memory type, and Intel supplied reference Linux code using this
> type to various vendors.
> 
> Wire this e820 table type up to export platform devices for the pmem
> driver so that we can use it in Linux, and also provide a memmap=
> argument to manually tag memory as protected, which can be used
> if the bios doesn't use the standard nonstandard interface, or
> we just want to test the pmem driver with regular memory.
> 
> Based on an earlier patch from Dave Jiang <dave.jiang@intel.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>

<snip>

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b7d31ca..93a27e4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1430,6 +1430,19 @@ config ILLEGAL_POINTER_VALUE
>  
>  source "mm/Kconfig"
>  
> +config X86_PMEM_LEGACY
> +	bool "Support non-stanard NVDIMMs and ADR protected memory"
> +	help
> +	  Treat memory marked using the non-stard e820 type of 12 as used
> +	  by the Intel Sandy Bridge-EP reference BIOS as protected memory.
> +	  The kernel will the offer these regions to the pmem driver so
> +	  they can be used for persistent storage.
> +
> +	  If you say N the kernel will treat the ADR region like an e820
> +	  reserved region.
> +
> +	  Say Y if unsure

Would it make sense to have this default to "y", or is that too strong?



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] x86: add support for the non-standard protected e820 type
  2015-03-25 16:04 ` [PATCH 3/3] x86: add support for the non-standard protected e820 type Christoph Hellwig
  2015-03-25 19:47   ` Elliott, Robert (Server Storage)
  2015-03-25 20:23   ` Ross Zwisler
@ 2015-03-25 20:25   ` Ross Zwisler
  2015-03-26  8:03     ` Christoph Hellwig
  2015-03-25 20:35   ` [Linux-nvdimm] " Dan Williams
  3 siblings, 1 reply; 35+ messages in thread
From: Ross Zwisler @ 2015-03-25 20:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, boaz, axboe

On Wed, 2015-03-25 at 17:04 +0100, Christoph Hellwig wrote:
> Various recent bioses support NVDIMMs or ADR using a non-standard
> e820 memory type, and Intel supplied reference Linux code using this
> type to various vendors.
> 
> Wire this e820 table type up to export platform devices for the pmem
> driver so that we can use it in Linux, and also provide a memmap=
> argument to manually tag memory as protected, which can be used
> if the bios doesn't use the standard nonstandard interface, or
> we just want to test the pmem driver with regular memory.

<snip>

> @@ -154,6 +166,9 @@ static void __init e820_print_type(u32 type)
>  	case E820_UNUSABLE:
>  		printk(KERN_CONT "unusable");
>  		break;
> +	case E820_PROTECTED_KERN:
> +		printk(KERN_CONT "protected (type %u)\n", type);

I don't think we want a newline in this string.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] pmem: Initial version of persistent memory driver
  2015-03-25 20:19   ` Paul Bolle
@ 2015-03-25 20:26     ` Ross Zwisler
  2015-03-26  8:04       ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Ross Zwisler @ 2015-03-25 20:26 UTC (permalink / raw)
  To: Paul Bolle
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, boaz, axboe

On Wed, 2015-03-25 at 21:19 +0100, Paul Bolle wrote:
> The same license nit I found in the previous two versions of this patch.
> 
> On Wed, 2015-03-25 at 17:04 +0100, Christoph Hellwig wrote:
> > --- /dev/null
> > +++ b/drivers/block/pmem.c
> 
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms and conditions of the GNU General Public License,
> > + * version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope it will be useful, but WITHOUT
> > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> > + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> > + * more details.
> 
> This states the license is GPL v2.
> 
> > +MODULE_LICENSE("GPL");
> 
> So you probably want to use
>     MODULE_LICENSE("GPL v2");
> 
> here.

Cool - yep, feel free to update this if you want in the next version of
your series, Christoph.

- Ross


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Linux-nvdimm] [PATCH 3/3] x86: add support for the non-standard protected e820 type
  2015-03-25 20:23   ` Ross Zwisler
@ 2015-03-25 20:29     ` Dan Williams
  0 siblings, 0 replies; 35+ messages in thread
From: Dan Williams @ 2015-03-25 20:29 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Christoph Hellwig, Jens Axboe, linux-nvdimm, X86 ML,
	linux-kernel, linux-fsdevel

On Wed, Mar 25, 2015 at 1:23 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Wed, 2015-03-25 at 17:04 +0100, Christoph Hellwig wrote:
>> Various recent bioses support NVDIMMs or ADR using a non-standard
>> e820 memory type, and Intel supplied reference Linux code using this
>> type to various vendors.
>>
>> Wire this e820 table type up to export platform devices for the pmem
>> driver so that we can use it in Linux, and also provide a memmap=
>> argument to manually tag memory as protected, which can be used
>> if the bios doesn't use the standard nonstandard interface, or
>> we just want to test the pmem driver with regular memory.
>>
>> Based on an earlier patch from Dave Jiang <dave.jiang@intel.com>
>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>
> <snip>
>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index b7d31ca..93a27e4 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -1430,6 +1430,19 @@ config ILLEGAL_POINTER_VALUE
>>
>>  source "mm/Kconfig"
>>
>> +config X86_PMEM_LEGACY
>> +     bool "Support non-stanard NVDIMMs and ADR protected memory"
>> +     help
>> +       Treat memory marked using the non-stard e820 type of 12 as used
>> +       by the Intel Sandy Bridge-EP reference BIOS as protected memory.
>> +       The kernel will the offer these regions to the pmem driver so
>> +       they can be used for persistent storage.
>> +
>> +       If you say N the kernel will treat the ADR region like an e820
>> +       reserved region.
>> +
>> +       Say Y if unsure
>
> Would it make sense to have this default to "y", or is that too strong?

We never default new enabling to y.  Maybe some exceptions, but this
isn't one of them in my mind.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Linux-nvdimm] [PATCH 3/3] x86: add support for the non-standard protected e820 type
  2015-03-25 16:04 ` [PATCH 3/3] x86: add support for the non-standard protected e820 type Christoph Hellwig
                     ` (2 preceding siblings ...)
  2015-03-25 20:25   ` Ross Zwisler
@ 2015-03-25 20:35   ` Dan Williams
  3 siblings, 0 replies; 35+ messages in thread
From: Dan Williams @ 2015-03-25 20:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, X86 ML, Jens Axboe

On Wed, Mar 25, 2015 at 9:04 AM, Christoph Hellwig <hch@lst.de> wrote:
> Various recent bioses support NVDIMMs or ADR using a non-standard
> e820 memory type, and Intel supplied reference Linux code using this
> type to various vendors.
>
> Wire this e820 table type up to export platform devices for the pmem
> driver so that we can use it in Linux, and also provide a memmap=
> argument to manually tag memory as protected, which can be used
> if the bios doesn't use the standard nonstandard interface, or
> we just want to test the pmem driver with regular memory.
>
> Based on an earlier patch from Dave Jiang <dave.jiang@intel.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
[..]
> +static __init int register_pmem_devices(void)
> +{
> +       int i;
> +
> +       for (i = 0; i < e820.nr_map; i++) {
> +               struct e820entry *ei = &e820.map[i];
> +
> +               if (ei->type == E820_PROTECTED_KERN) {
> +                       struct resource res = {
> +                               .flags  = IORESOURCE_MEM,
> +                               .start  = ei->addr,
> +                               .end    = ei->addr + ei->size - 1,
> +                       };
> +                       register_pmem_device(&res);
> +               }
> +       }
> +
> +       return 0;
> +}

Aside from the s/E820_PROTECTED_KERN/E820_PMEM/ suggestion this looks
ok to me.  The "vaporware" new way can be a superset of this
mechanism.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: another pmem variant
  2015-03-25 16:04 another pmem variant Christoph Hellwig
                   ` (4 preceding siblings ...)
  2015-03-25 18:09   ` Brooks, Adam J
@ 2015-03-25 21:02 ` Ross Zwisler
  5 siblings, 0 replies; 35+ messages in thread
From: Ross Zwisler @ 2015-03-25 21:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, boaz, axboe

On Wed, 2015-03-25 at 17:04 +0100, Christoph Hellwig wrote:
> Here is another version of the same trivial pmem driver, because two
> obviously aren't enough.  The first patch is the same pmem driver
> that Ross posted a short time ago, just modified to use platform_devices
> to find the persistant memory region instead of hardconding it in the
> Kconfig.  This allows to keep pmem.c separate from any discovery mechanism,
> but still allow auto-discovery.
> 
> The other two patches are a heavily rewritten version of the code that
> Intel gave to various storage vendors to discover the type 12 (and earlier
> type 6) nvdimms, which I massaged into a form that is hopefully suitable
> for mainline.
> 
> Note that pmem.c really is the minimal version as I think we need something
> included ASAP.  We'll eventually need to be able to do other I/O from and
> to it, and as most people know everyone has their own preferre method to
> do it, which I'd like to discuss once we have the basic driver in.
> 
> This has been tested both with a real NVDIMM on a system with a type 12
> capable bios, as well as with "fake persistent" memory using the memmap=
> option.
> 

For the series:
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Linux-nvdimm] [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-25 16:04 ` [PATCH 2/3] x86: add a is_e820_ram() helper Christoph Hellwig
@ 2015-03-26  2:15   ` Dan Williams
  2015-03-26  8:01     ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Dan Williams @ 2015-03-26  2:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, X86 ML, Jens Axboe

On Wed, Mar 25, 2015 at 9:04 AM, Christoph Hellwig <hch@lst.de> wrote:
> This will allow to deal with persistent memory which needs to be
> treated like ram in many, but not all cases.

Random thought, type-12 memory happens to correspond to "legacy"
NVDIMM systems with smaller capacities.  Perhaps "new NVDIMM" should
not be is_e820_ram() by default?

> Based on an earlier patch from Dave Jiang <dave.jiang@intel.com>.

...which was based on an earlier patch by me, its been nearly 4 years
to come full circle.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Linux-nvdimm] [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26  2:15   ` [Linux-nvdimm] " Dan Williams
@ 2015-03-26  8:01     ` Christoph Hellwig
  2015-03-26 13:57       ` Dan Williams
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2015-03-26  8:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	X86 ML, Jens Axboe

On Wed, Mar 25, 2015 at 07:15:42PM -0700, Dan Williams wrote:
> Random thought, type-12 memory happens to correspond to "legacy"
> NVDIMM systems with smaller capacities.  Perhaps "new NVDIMM" should
> not be is_e820_ram() by default?

Let's look into that once we can see the spec..

> > Based on an earlier patch from Dave Jiang <dave.jiang@intel.com>.
> 
> ...which was based on an earlier patch by me, its been nearly 4 years
> to come full circle.

That's the attribution in the patch I have access to.  I can add you
to the credits if you want.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] x86: add support for the non-standard protected e820 type
  2015-03-25 19:47   ` Elliott, Robert (Server Storage)
@ 2015-03-26  8:02     ` Christoph Hellwig
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2015-03-26  8:02 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, ross.zwisler, boaz, axboe

On Wed, Mar 25, 2015 at 07:47:26PM +0000, Elliott, Robert (Server Storage) wrote:
> If this goes into the kernel, I think someone should request that the
> ACPI specification mark the value 12 as permanently tainted.  Otherwise
> they could assign it to some new meaning that conflicts with all
> of this.

I think reusing it now would create huge problems, but I have no idea
to how to even talk to the people writing the ACPI spec.

> It can be confusing that E820h type values differ from UEFI 
> memory map type values, so it might be worth emphasizing that is 
> an E820h type value.
> 
> Showing hex alongside would also clarify that it is indeed a 
> decimal 12.
> 
> Suggestion: "e820 type 12 (0xc)"

I've fixed this as well as the various typos.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] x86: add support for the non-standard protected e820 type
  2015-03-25 20:25   ` Ross Zwisler
@ 2015-03-26  8:03     ` Christoph Hellwig
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2015-03-26  8:03 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, boaz, axboe

On Wed, Mar 25, 2015 at 02:25:33PM -0600, Ross Zwisler wrote:
> > +	case E820_PROTECTED_KERN:
> > +		printk(KERN_CONT "protected (type %u)\n", type);
> 
> I don't think we want a newline in this string.

Fixed.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] pmem: Initial version of persistent memory driver
  2015-03-25 20:26     ` Ross Zwisler
@ 2015-03-26  8:04       ` Christoph Hellwig
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2015-03-26  8:04 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Paul Bolle, Christoph Hellwig, linux-nvdimm, linux-fsdevel,
	linux-kernel, x86, boaz, axboe

On Wed, Mar 25, 2015 at 02:26:52PM -0600, Ross Zwisler wrote:
> Cool - yep, feel free to update this if you want in the next version of
> your series, Christoph.

Ok, I've fixed this up.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] pmem: Initial version of persistent memory driver
  2015-03-25 20:21   ` Ross Zwisler
@ 2015-03-26  8:06     ` Christoph Hellwig
  2015-05-04 16:43         ` Ross Zwisler
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2015-03-26  8:06 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, boaz, axboe

On Wed, Mar 25, 2015 at 02:21:53PM -0600, Ross Zwisler wrote:
> What needed to be fixed with the partition support?  I used to have real
> numbers for first_minor and passed into alloc_disk(), but simplified it based
> on code found in this commit in the nvme driver:
> 
> 469071a37afc NVMe: Dynamically allocate partition numbers
> 
> This has worked fine for me - is there some test case in which it breaks?

Yes, if CONFIG_DEBUG_BLOCK_EXT_DEVT isn't set that code doesn't work at all.

> > +out_unmap:
> > +	pmem_unmapmem(pmem);
> > +out_free_dev:
> > +	kfree(pmem);
> > +out:
> 
> This label is no longer used, and can be removed.

Ok, fixed.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Linux-nvdimm] [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26  8:01     ` Christoph Hellwig
@ 2015-03-26 13:57       ` Dan Williams
  2015-03-26 14:32         ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Dan Williams @ 2015-03-26 13:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, X86 ML, Jens Axboe

On Thu, Mar 26, 2015 at 1:01 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Mar 25, 2015 at 07:15:42PM -0700, Dan Williams wrote:
>> Random thought, type-12 memory happens to correspond to "legacy"
>> NVDIMM systems with smaller capacities.  Perhaps "new NVDIMM" should
>> not be is_e820_ram() by default?
>
> Let's look into that once we can see the spec..
>
>> > Based on an earlier patch from Dave Jiang <dave.jiang@intel.com>.
>>
>> ...which was based on an earlier patch by me, its been nearly 4 years
>> to come full circle.
>
> That's the attribution in the patch I have access to.  I can add you
> to the credits if you want.

Yes, please attribute Dave and myself.

...and for the series: Acked-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Linux-nvdimm] [PATCH 2/3] x86: add a is_e820_ram() helper
  2015-03-26 13:57       ` Dan Williams
@ 2015-03-26 14:32         ` Christoph Hellwig
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2015-03-26 14:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	X86 ML, Jens Axboe

On Thu, Mar 26, 2015 at 06:57:56AM -0700, Dan Williams wrote:
> Yes, please attribute Dave and myself.
> 
> ...and for the series: Acked-by: Dan Williams <dan.j.williams@intel.com>

Ok, I will resend the series with the updated attribution.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Linux-nvdimm] another pmem variant
  2015-03-25 17:00     ` Dan Williams
  2015-03-25 17:04       ` Christoph Hellwig
@ 2015-04-13  9:01       ` Greg KH
  2015-04-13 16:02         ` Dan Williams
  1 sibling, 1 reply; 35+ messages in thread
From: Greg KH @ 2015-04-13  9:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	X86 ML, Jens Axboe

On Wed, Mar 25, 2015 at 10:00:26AM -0700, Dan Williams wrote:
> On Wed, Mar 25, 2015 at 9:44 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Wed, Mar 25, 2015 at 09:33:52AM -0700, Dan Williams wrote:
> >> This is mostly ok and does not collide too much with the upcoming ACPI
> >> mechanism for this stuff.  I do worry that the new
> >> "memmap=nn[KMG]!ss[KMG]" kernel command line option will only be
> >> relevant for at most one kernel cycle given the imminent publication
> >> of the spec that unblocks our release.
> >
> > I don't think we can just get rid of it as legacy systems won't be
> > upgraded to the new discovery mechanism.  Or do you mean you plan to
> > introduce a better override on the command line?  In that case speak
> > up now!
> 
> The kernel command line would simply be the standard/existing memmap=
> to reserve a memory range.  Then, when the platform device loads, it
> does a request_firmware() to inject a binary table that further carves
> memory into ranges to which the pmem driver attaches.  No need for the
> legacy system BIOS to be upgraded to the "new way".

Um, what parses that "binary table"?  The kernel better not be doing
that, as that's not what the firmware interface is for.  The firmware
interface is for "pass through" only directly to hardware.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Linux-nvdimm] another pmem variant
  2015-04-13  9:01       ` Greg KH
@ 2015-04-13 16:02         ` Dan Williams
  0 siblings, 0 replies; 35+ messages in thread
From: Dan Williams @ 2015-04-13 16:02 UTC (permalink / raw)
  To: Greg KH
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	X86 ML, Jens Axboe

On Mon, Apr 13, 2015 at 2:01 AM, Greg KH <gregkh@linuxfoundation.org> wrote:
> On Wed, Mar 25, 2015 at 10:00:26AM -0700, Dan Williams wrote:
>> On Wed, Mar 25, 2015 at 9:44 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > On Wed, Mar 25, 2015 at 09:33:52AM -0700, Dan Williams wrote:
>> >> This is mostly ok and does not collide too much with the upcoming ACPI
>> >> mechanism for this stuff.  I do worry that the new
>> >> "memmap=nn[KMG]!ss[KMG]" kernel command line option will only be
>> >> relevant for at most one kernel cycle given the imminent publication
>> >> of the spec that unblocks our release.
>> >
>> > I don't think we can just get rid of it as legacy systems won't be
>> > upgraded to the new discovery mechanism.  Or do you mean you plan to
>> > introduce a better override on the command line?  In that case speak
>> > up now!
>>
>> The kernel command line would simply be the standard/existing memmap=
>> to reserve a memory range.  Then, when the platform device loads, it
>> does a request_firmware() to inject a binary table that further carves
>> memory into ranges to which the pmem driver attaches.  No need for the
>> legacy system BIOS to be upgraded to the "new way".
>
> Um, what parses that "binary table"?  The kernel better not be doing
> that, as that's not what the firmware interface is for.  The firmware
> interface is for "pass through" only directly to hardware.

I had been using it as a generic/device-model-integrated way to do
what amounts to ACPI table injection [1].  But, now that the new
memmap= command line is upstream, most of the benefits of this
approach are moot and no longer outweigh the downsides [2].  Consider
it tabled.


[1]: https://01.org/linux-acpi/documentation/overriding-dsdt
[2]: http://marc.info/?l=linux-netdev&m=135793331325647&w=2

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] pmem: Initial version of persistent memory driver
  2015-03-26  8:06     ` Christoph Hellwig
@ 2015-05-04 16:43         ` Ross Zwisler
  0 siblings, 0 replies; 35+ messages in thread
From: Ross Zwisler @ 2015-05-04 16:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, boaz, axboe

On Thu, 2015-03-26 at 09:06 +0100, Christoph Hellwig wrote:
> On Wed, Mar 25, 2015 at 02:21:53PM -0600, Ross Zwisler wrote:
> > What needed to be fixed with the partition support?  I used to have real
> > numbers for first_minor and passed into alloc_disk(), but simplified it based
> > on code found in this commit in the nvme driver:
> > 
> > 469071a37afc NVMe: Dynamically allocate partition numbers
> > 
> > This has worked fine for me - is there some test case in which it breaks?
> 
> Yes, if CONFIG_DEBUG_BLOCK_EXT_DEVT isn't set that code doesn't work at all.

I can't figure out a use case that breaks when using dynamically allocated
minors without CONFIG_DEBUG_BLOCK_EXT_DEVT.  The patch that I've been testing
against is at the bottom of this mail.

Here are the minors that I get when creating a bunch of partitions using the
current code with PMEM_MINORS=16, with CONFIG_DEBUG_BLOCK_EXT_DEVT turned off:

pmem0      249:0    0 63.5G  0 rom  
├─pmem0p1  249:1    0    1G  0 part 
├─pmem0p2  249:2    0    1G  0 part 
├─pmem0p3  249:3    0    1G  0 part 
├─pmem0p4  249:4    0    1G  0 part 
├─pmem0p5  249:5    0    1G  0 part 
├─pmem0p6  249:6    0    1G  0 part 
├─pmem0p7  249:7    0    1G  0 part 
├─pmem0p8  249:8    0    1G  0 part 
├─pmem0p9  249:9    0    1G  0 part 
├─pmem0p10 249:10   0    1G  0 part 
├─pmem0p11 249:11   0    1G  0 part 
├─pmem0p12 249:12   0    1G  0 part 
├─pmem0p13 249:13   0    1G  0 part 
├─pmem0p14 249:14   0    1G  0 part 
├─pmem0p15 249:15   0    1G  0 part 
├─pmem0p16 259:0    0    1G  0 part 
├─pmem0p17 259:1    0    1G  0 part 
└─pmem0p18 259:2    0    1G  0 part 

With dynamic minor allocation, with CONFIG_DEBUG_BLOCK_EXT_DEVT turned off:

pmem0      259:0    0 63.5G  0 rom  
├─pmem0p1  259:1    0    1G  0 part 
├─pmem0p2  259:2    0    1G  0 part 
├─pmem0p3  259:3    0    1G  0 part 
├─pmem0p4  259:4    0    1G  0 part 
├─pmem0p5  259:5    0    1G  0 part 
├─pmem0p6  259:6    0    1G  0 part 
├─pmem0p7  259:7    0    1G  0 part 
├─pmem0p8  259:8    0    1G  0 part 
├─pmem0p9  259:9    0    1G  0 part 
├─pmem0p10 259:10   0    1G  0 part 
├─pmem0p11 259:11   0    1G  0 part 
├─pmem0p12 259:12   0    1G  0 part 
├─pmem0p13 259:13   0    1G  0 part 
├─pmem0p14 259:14   0    1G  0 part 
├─pmem0p15 259:15   0    1G  0 part 
├─pmem0p16 259:16   0    1G  0 part 
├─pmem0p17 259:17   0    1G  0 part 
└─pmem0p18 259:18   0    1G  0 part

And with CONFIG_DEBUG_BLOCK_EXT_DEVT turned on:

pmem0      259:262144  0 63.5G  0 rom  
├─pmem0p1  259:786432  0    1G  0 part 
├─pmem0p2  259:131072  0    1G  0 part 
├─pmem0p3  259:655360  0    1G  0 part 
├─pmem0p4  259:393216  0    1G  0 part 
├─pmem0p5  259:917504  0    1G  0 part 
├─pmem0p6  259:65536   0    1G  0 part 
├─pmem0p7  259:589824  0    1G  0 part 
├─pmem0p8  259:327680  0    1G  0 part 
├─pmem0p9  259:851968  0    1G  0 part 
├─pmem0p10 259:196608  0    1G  0 part 
├─pmem0p11 259:720896  0    1G  0 part 
├─pmem0p12 259:458752  0    1G  0 part 
├─pmem0p13 259:983040  0    1G  0 part 
├─pmem0p14 259:32768   0    1G  0 part 
├─pmem0p15 259:557056  0    1G  0 part 
├─pmem0p16 259:294912  0    1G  0 part 
├─pmem0p17 259:819200  0    1G  0 part 
└─pmem0p18 259:163840  0    1G  0 part

With CONFIG_DEBUG_BLOCK_EXT_DEVT the minors are all mangled due to
blk_mangle_minor(), but I think that all three configs work?

Was there maybe confusion between that config option and the GENHD_FL_EXT_DEVT
gendisk flag, which AFAIK are independent?

Is there a use case that breaks when using dynamic minors without
CONFIG_DEBUG_BLOCK_EXT_DEVT?

Thanks,
- Ross

--- >8 ---
>From 6202dc7c1ef765faebb905161860c6b9ab19cc8a Mon Sep 17 00:00:00 2001
From: Ross Zwisler <ross.zwisler@linux.intel.com>
Date: Mon, 4 May 2015 10:26:54 -0600
Subject: [PATCH] pmem: Dynamically allocate partition numbers

Dynamically allocate minor numbers for partitions instead of statically
preallocating them.

Inspired by this commit:

469071a37afc NVMe: Dynamically allocate partition numbers

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/block/nd/pmem.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
index 900dad61a6b9..b977def8981e 100644
--- a/drivers/block/nd/pmem.c
+++ b/drivers/block/nd/pmem.c
@@ -26,8 +26,6 @@
 #include <linux/nd.h>
 #include "nd.h"
 
-#define PMEM_MINORS		16
-
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
@@ -185,12 +183,12 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
 	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
 	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
 
-	disk = alloc_disk(PMEM_MINORS);
+	disk = alloc_disk(0);
 	if (!disk)
 		goto out_free_queue;
 
 	disk->major		= pmem_major;
-	disk->first_minor	= PMEM_MINORS * pmem->id;
+	disk->first_minor	= 0;
 	disk->fops		= &pmem_fops;
 	disk->private_data	= pmem;
 	disk->queue		= pmem->pmem_queue;
-- 
1.9.3







^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] pmem: Initial version of persistent memory driver
@ 2015-05-04 16:43         ` Ross Zwisler
  0 siblings, 0 replies; 35+ messages in thread
From: Ross Zwisler @ 2015-05-04 16:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, boaz, axboe

On Thu, 2015-03-26 at 09:06 +0100, Christoph Hellwig wrote:
> On Wed, Mar 25, 2015 at 02:21:53PM -0600, Ross Zwisler wrote:
> > What needed to be fixed with the partition support?  I used to have real
> > numbers for first_minor and passed into alloc_disk(), but simplified it based
> > on code found in this commit in the nvme driver:
> > 
> > 469071a37afc NVMe: Dynamically allocate partition numbers
> > 
> > This has worked fine for me - is there some test case in which it breaks?
> 
> Yes, if CONFIG_DEBUG_BLOCK_EXT_DEVT isn't set that code doesn't work at all.

I can't figure out a use case that breaks when using dynamically allocated
minors without CONFIG_DEBUG_BLOCK_EXT_DEVT.  The patch that I've been testing
against is at the bottom of this mail.

Here are the minors that I get when creating a bunch of partitions using the
current code with PMEM_MINORS=16, with CONFIG_DEBUG_BLOCK_EXT_DEVT turned off:

pmem0      249:0    0 63.5G  0 rom  
├─pmem0p1  249:1    0    1G  0 part 
├─pmem0p2  249:2    0    1G  0 part 
├─pmem0p3  249:3    0    1G  0 part 
├─pmem0p4  249:4    0    1G  0 part 
├─pmem0p5  249:5    0    1G  0 part 
├─pmem0p6  249:6    0    1G  0 part 
├─pmem0p7  249:7    0    1G  0 part 
├─pmem0p8  249:8    0    1G  0 part 
├─pmem0p9  249:9    0    1G  0 part 
├─pmem0p10 249:10   0    1G  0 part 
├─pmem0p11 249:11   0    1G  0 part 
├─pmem0p12 249:12   0    1G  0 part 
├─pmem0p13 249:13   0    1G  0 part 
├─pmem0p14 249:14   0    1G  0 part 
├─pmem0p15 249:15   0    1G  0 part 
├─pmem0p16 259:0    0    1G  0 part 
├─pmem0p17 259:1    0    1G  0 part 
└─pmem0p18 259:2    0    1G  0 part 

With dynamic minor allocation, with CONFIG_DEBUG_BLOCK_EXT_DEVT turned off:

pmem0      259:0    0 63.5G  0 rom  
├─pmem0p1  259:1    0    1G  0 part 
├─pmem0p2  259:2    0    1G  0 part 
├─pmem0p3  259:3    0    1G  0 part 
├─pmem0p4  259:4    0    1G  0 part 
├─pmem0p5  259:5    0    1G  0 part 
├─pmem0p6  259:6    0    1G  0 part 
├─pmem0p7  259:7    0    1G  0 part 
├─pmem0p8  259:8    0    1G  0 part 
├─pmem0p9  259:9    0    1G  0 part 
├─pmem0p10 259:10   0    1G  0 part 
├─pmem0p11 259:11   0    1G  0 part 
├─pmem0p12 259:12   0    1G  0 part 
├─pmem0p13 259:13   0    1G  0 part 
├─pmem0p14 259:14   0    1G  0 part 
├─pmem0p15 259:15   0    1G  0 part 
├─pmem0p16 259:16   0    1G  0 part 
├─pmem0p17 259:17   0    1G  0 part 
└─pmem0p18 259:18   0    1G  0 part

And with CONFIG_DEBUG_BLOCK_EXT_DEVT turned on:

pmem0      259:262144  0 63.5G  0 rom  
├─pmem0p1  259:786432  0    1G  0 part 
├─pmem0p2  259:131072  0    1G  0 part 
├─pmem0p3  259:655360  0    1G  0 part 
├─pmem0p4  259:393216  0    1G  0 part 
├─pmem0p5  259:917504  0    1G  0 part 
├─pmem0p6  259:65536   0    1G  0 part 
├─pmem0p7  259:589824  0    1G  0 part 
├─pmem0p8  259:327680  0    1G  0 part 
├─pmem0p9  259:851968  0    1G  0 part 
├─pmem0p10 259:196608  0    1G  0 part 
├─pmem0p11 259:720896  0    1G  0 part 
├─pmem0p12 259:458752  0    1G  0 part 
├─pmem0p13 259:983040  0    1G  0 part 
├─pmem0p14 259:32768   0    1G  0 part 
├─pmem0p15 259:557056  0    1G  0 part 
├─pmem0p16 259:294912  0    1G  0 part 
├─pmem0p17 259:819200  0    1G  0 part 
└─pmem0p18 259:163840  0    1G  0 part

With CONFIG_DEBUG_BLOCK_EXT_DEVT the minors are all mangled due to
blk_mangle_minor(), but I think that all three configs work?

Was there maybe confusion between that config option and the GENHD_FL_EXT_DEVT
gendisk flag, which AFAIK are independent?

Is there a use case that breaks when using dynamic minors without
CONFIG_DEBUG_BLOCK_EXT_DEVT?

Thanks,
- Ross

--- >8 ---
From 6202dc7c1ef765faebb905161860c6b9ab19cc8a Mon Sep 17 00:00:00 2001
From: Ross Zwisler <ross.zwisler@linux.intel.com>
Date: Mon, 4 May 2015 10:26:54 -0600
Subject: [PATCH] pmem: Dynamically allocate partition numbers

Dynamically allocate minor numbers for partitions instead of statically
preallocating them.

Inspired by this commit:

469071a37afc NVMe: Dynamically allocate partition numbers

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/block/nd/pmem.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/block/nd/pmem.c b/drivers/block/nd/pmem.c
index 900dad61a6b9..b977def8981e 100644
--- a/drivers/block/nd/pmem.c
+++ b/drivers/block/nd/pmem.c
@@ -26,8 +26,6 @@
 #include <linux/nd.h>
 #include "nd.h"
 
-#define PMEM_MINORS		16
-
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
@@ -185,12 +183,12 @@ static struct pmem_device *pmem_alloc(struct device *dev, struct resource *res)
 	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
 	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
 
-	disk = alloc_disk(PMEM_MINORS);
+	disk = alloc_disk(0);
 	if (!disk)
 		goto out_free_queue;
 
 	disk->major		= pmem_major;
-	disk->first_minor	= PMEM_MINORS * pmem->id;
+	disk->first_minor	= 0;
 	disk->fops		= &pmem_fops;
 	disk->private_data	= pmem;
 	disk->queue		= pmem->pmem_queue;
-- 
1.9.3






--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] pmem: Initial version of persistent memory driver
  2015-05-04 16:43         ` Ross Zwisler
  (?)
@ 2015-05-07  7:26         ` Christoph Hellwig
  2015-05-07  8:35           ` Boaz Harrosh
  -1 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2015-05-07  7:26 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Christoph Hellwig, linux-nvdimm, linux-fsdevel, linux-kernel,
	x86, boaz, axboe

On Mon, May 04, 2015 at 10:43:01AM -0600, Ross Zwisler wrote:
> > Yes, if CONFIG_DEBUG_BLOCK_EXT_DEVT isn't set that code doesn't work at all.
> 
> I can't figure out a use case that breaks when using dynamically allocated
> minors without CONFIG_DEBUG_BLOCK_EXT_DEVT.  The patch that I've been testing
> against is at the bottom of this mail.
> 
> Here are the minors that I get when creating a bunch of partitions using the
> current code with PMEM_MINORS=16, with CONFIG_DEBUG_BLOCK_EXT_DEVT turned off:

FYI, your patch below works fine for me, but the original one certainly
didn't.  One big difference was that it also removed the register_blkdev
call and thus assigning a major number.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] pmem: Initial version of persistent memory driver
  2015-05-07  7:26         ` Christoph Hellwig
@ 2015-05-07  8:35           ` Boaz Harrosh
  0 siblings, 0 replies; 35+ messages in thread
From: Boaz Harrosh @ 2015-05-07  8:35 UTC (permalink / raw)
  To: Christoph Hellwig, Ross Zwisler
  Cc: linux-nvdimm, linux-fsdevel, linux-kernel, x86, axboe

On 05/07/2015 10:26 AM, Christoph Hellwig wrote:
> On Mon, May 04, 2015 at 10:43:01AM -0600, Ross Zwisler wrote:
>>> Yes, if CONFIG_DEBUG_BLOCK_EXT_DEVT isn't set that code doesn't work at all.
>>
>> I can't figure out a use case that breaks when using dynamically allocated
>> minors without CONFIG_DEBUG_BLOCK_EXT_DEVT.  The patch that I've been testing
>> against is at the bottom of this mail.
>>
>> Here are the minors that I get when creating a bunch of partitions using the
>> current code with PMEM_MINORS=16, with CONFIG_DEBUG_BLOCK_EXT_DEVT turned off:
> 
> FYI, your patch below works fine for me, but the original one certainly
> didn't.  One big difference was that it also removed the register_blkdev
> call and thus assigning a major number.
> 

assigning a major number for what?

That "assigned major number" is then never used *anywhere* in the code at all
until it is unregistered. All the devices come up with the dynamic (259) major

the register_blkdev is just dead code. I have experimented with this a lot,
I have audited the all code stack, that major is never used, when doing the:

	alloc_disk(0)
	disk->flags	|= GENHD_FL_EXT_DEVT

The:
	disk->major		= X

Is completely ignored and is immediately over written. The only relic of that
major registration is the pmem_major global member collecting dust.

Sigh, bad habits die hard, I don't really care you can keep it. Sorry for
the noise.

Cheers
Boaz


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2015-05-07  8:35 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-25 16:04 another pmem variant Christoph Hellwig
2015-03-25 16:04 ` [PATCH 1/3] pmem: Initial version of persistent memory driver Christoph Hellwig
2015-03-25 20:19   ` Paul Bolle
2015-03-25 20:26     ` Ross Zwisler
2015-03-26  8:04       ` Christoph Hellwig
2015-03-25 20:21   ` Ross Zwisler
2015-03-26  8:06     ` Christoph Hellwig
2015-05-04 16:43       ` Ross Zwisler
2015-05-04 16:43         ` Ross Zwisler
2015-05-07  7:26         ` Christoph Hellwig
2015-05-07  8:35           ` Boaz Harrosh
2015-03-25 16:04 ` [PATCH 2/3] x86: add a is_e820_ram() helper Christoph Hellwig
2015-03-26  2:15   ` [Linux-nvdimm] " Dan Williams
2015-03-26  8:01     ` Christoph Hellwig
2015-03-26 13:57       ` Dan Williams
2015-03-26 14:32         ` Christoph Hellwig
2015-03-25 16:04 ` [PATCH 3/3] x86: add support for the non-standard protected e820 type Christoph Hellwig
2015-03-25 19:47   ` Elliott, Robert (Server Storage)
2015-03-26  8:02     ` Christoph Hellwig
2015-03-25 20:23   ` Ross Zwisler
2015-03-25 20:29     ` [Linux-nvdimm] " Dan Williams
2015-03-25 20:25   ` Ross Zwisler
2015-03-26  8:03     ` Christoph Hellwig
2015-03-25 20:35   ` [Linux-nvdimm] " Dan Williams
2015-03-25 16:33 ` [Linux-nvdimm] another pmem variant Dan Williams
2015-03-25 16:44   ` Christoph Hellwig
2015-03-25 17:00     ` Dan Williams
2015-03-25 17:04       ` Christoph Hellwig
2015-03-25 17:18         ` Dan Williams
2015-04-13  9:01       ` Greg KH
2015-04-13 16:02         ` Dan Williams
2015-03-25 18:09 ` Brooks, Adam J
2015-03-25 18:09   ` Brooks, Adam J
2015-03-25 18:46   ` Christoph Hellwig
2015-03-25 21:02 ` Ross Zwisler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.