linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Add persistent memory driver
@ 2014-08-27 21:11 Ross Zwisler
  2014-08-27 21:11 ` [PATCH 1/4] pmem: Initial version of " Ross Zwisler
                   ` (4 more replies)
  0 siblings, 5 replies; 41+ messages in thread
From: Ross Zwisler @ 2014-08-27 21:11 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox, Boaz Harrosh, Nick Piggin,
	linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: Ross Zwisler

PMEM is a modified version of the Block RAM Driver, BRD. The major difference
is that BRD allocates its backing store pages from the page cache, whereas
PMEM uses reserved memory that has been ioremapped.

One benefit of this approach is that there is a direct mapping between
filesystem block numbers and virtual addresses.  In PMEM, filesystem blocks N,
N+1, N+2, etc. will all be adjacent in the virtual memory space. This property
allows us to set up PMD mappings (2 MiB) for DAX.

This patch set is builds upon the work that Matthew Wilcox has been doing for
DAX:

https://lkml.org/lkml/2014/8/27/31

Specifically, my implementation of pmem_direct_access() in patch 4/4 uses API
enhancements introduced in Matthew's DAX patch v10 02/21:

https://lkml.org/lkml/2014/8/27/48

Ross Zwisler (4):
  pmem: Initial version of persistent memory driver
  pmem: Add support for getgeo()
  pmem: Add support for rw_page()
  pmem: Add support for direct_access()

 MAINTAINERS            |   6 +
 drivers/block/Kconfig  |  41 ++++++
 drivers/block/Makefile |   1 +
 drivers/block/pmem.c   | 375 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 423 insertions(+)
 create mode 100644 drivers/block/pmem.c

-- 
1.9.3


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/4] pmem: Initial version of persistent memory driver
  2014-08-27 21:11 [PATCH 0/4] Add persistent memory driver Ross Zwisler
@ 2014-08-27 21:11 ` Ross Zwisler
  2014-09-09 16:23   ` [PATCH v2] " Boaz Harrosh
  2014-11-02  3:22   ` [PATCH 1/4] " Elliott, Robert (Server Storage)
  2014-08-27 21:12 ` [PATCH 2/4] pmem: Add support for getgeo() Ross Zwisler
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 41+ messages in thread
From: Ross Zwisler @ 2014-08-27 21:11 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox, Boaz Harrosh, Nick Piggin,
	linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: Ross Zwisler

PMEM is a new driver that presents a reserved range of memory as a
block device.  This is useful for developing with NV-DIMMs, and
can be used with volatile memory as a development platform.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 MAINTAINERS            |   6 +
 drivers/block/Kconfig  |  41 ++++++
 drivers/block/Makefile |   1 +
 drivers/block/pmem.c   | 330 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 378 insertions(+)
 create mode 100644 drivers/block/pmem.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 3f29153..028dc99 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7505,6 +7505,12 @@ S:	Maintained
 F:	Documentation/blockdev/ramdisk.txt
 F:	drivers/block/brd.c
 
+PERSISTENT MEMORY DRIVER
+M:	Ross Zwisler <ross.zwisler@linux.intel.com>
+L:	linux-nvdimm@lists.01.org
+S:	Supported
+F:	drivers/block/pmem.c
+
 RANDOM NUMBER DRIVER
 M:	"Theodore Ts'o" <tytso@mit.edu>
 S:	Maintained
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 1b8094d..ac52f5a 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -404,6 +404,47 @@ config BLK_DEV_RAM_DAX
 	  and will prevent RAM block device backing store memory from being
 	  allocated from highmem (only a problem for highmem systems).
 
+config BLK_DEV_PMEM
+	tristate "Persistent memory block device support"
+	help
+	  Saying Y here will allow you to use a contiguous range of reserved
+	  memory as one or more block devices.  Memory for PMEM should be
+	  reserved using the "memmap" kernel parameter.
+
+	  To compile this driver as a module, choose M here: the module will be
+	  called pmem.
+
+	  Most normal users won't need this functionality, and can thus say N
+	  here.
+
+config BLK_DEV_PMEM_START
+	int "Offset in GiB of where to start claiming space"
+	default "0"
+	depends on BLK_DEV_PMEM
+	help
+	  Starting offset in GiB that PMEM should use when claiming memory.  This
+	  memory needs to be reserved from the OS at boot time using the
+	  "memmap" kernel parameter.
+
+	  If you provide PMEM with volatile memory it will act as a volatile
+	  RAM disk and your data will not be persistent.
+
+config BLK_DEV_PMEM_COUNT
+	int "Default number of PMEM disks"
+	default "4"
+	depends on BLK_DEV_PMEM
+	help
+	  Number of equal sized block devices that PMEM should create.
+
+config BLK_DEV_PMEM_SIZE
+	int "Size in GiB of space to claim"
+	depends on BLK_DEV_PMEM
+	default "0"
+	help
+	  Amount of memory in GiB that PMEM should use when creating block
+	  devices.  This memory needs to be reserved from the OS at
+	  boot time using the "memmap" kernel parameter.
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 02b688d..9cc6c18 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -14,6 +14,7 @@ obj-$(CONFIG_PS3_VRAM)		+= ps3vram.o
 obj-$(CONFIG_ATARI_FLOPPY)	+= ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)	+= z2ram.o
 obj-$(CONFIG_BLK_DEV_RAM)	+= brd.o
+obj-$(CONFIG_BLK_DEV_PMEM)	+= pmem.o
 obj-$(CONFIG_BLK_DEV_LOOP)	+= loop.o
 obj-$(CONFIG_BLK_CPQ_DA)	+= cpqarray.o
 obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
new file mode 100644
index 0000000..d366b9b
--- /dev/null
+++ b/drivers/block/pmem.c
@@ -0,0 +1,330 @@
+/*
+ * Persistent Memory Driver
+ * Copyright (c) 2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * This driver is heavily based on drivers/block/brd.c.
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ */
+
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/fs.h>
+#include <linux/hdreg.h>
+#include <linux/highmem.h>
+#include <linux/init.h>
+#include <linux/major.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#define SECTOR_SHIFT		9
+#define PAGE_SECTORS_SHIFT	(PAGE_SHIFT - SECTOR_SHIFT)
+#define PAGE_SECTORS		(1 << PAGE_SECTORS_SHIFT)
+
+/*
+ * driver-wide physical address and total_size - one single, contiguous memory
+ * region that we divide up in to same-sized devices
+ */
+phys_addr_t	phys_addr;
+void		*virt_addr;
+size_t		total_size;
+
+struct pmem_device {
+	struct request_queue	*pmem_queue;
+	struct gendisk		*pmem_disk;
+	struct list_head	pmem_list;
+
+	phys_addr_t		phys_addr;
+	void			*virt_addr;
+	size_t			size;
+};
+
+/*
+ * direct translation from (pmem,sector) => void*
+ * We do not require that sector be page aligned.
+ * The return value will point to the beginning of the page containing the
+ * given sector, not to the sector itself.
+ */
+static void *pmem_lookup_pg_addr(struct pmem_device *pmem, sector_t sector)
+{
+	size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
+	size_t offset = page_offset << PAGE_SHIFT;
+
+	BUG_ON(offset >= pmem->size);
+	return pmem->virt_addr + offset;
+}
+
+/*
+ * sector is not required to be page aligned.
+ * n is at most a single page, but could be less.
+ */
+static void copy_to_pmem(struct pmem_device *pmem, const void *src,
+			sector_t sector, size_t n)
+{
+	void *dst;
+	unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
+	size_t copy;
+
+	BUG_ON(n > PAGE_SIZE);
+
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	dst = pmem_lookup_pg_addr(pmem, sector);
+	memcpy(dst + offset, src, copy);
+
+	if (copy < n) {
+		src += copy;
+		sector += copy >> SECTOR_SHIFT;
+		copy = n - copy;
+		dst = pmem_lookup_pg_addr(pmem, sector);
+		memcpy(dst, src, copy);
+	}
+}
+
+/*
+ * sector is not required to be page aligned.
+ * n is at most a single page, but could be less.
+ */
+static void copy_from_pmem(void *dst, struct pmem_device *pmem,
+			  sector_t sector, size_t n)
+{
+	void *src;
+	unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
+	size_t copy;
+
+	BUG_ON(n > PAGE_SIZE);
+
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	src = pmem_lookup_pg_addr(pmem, sector);
+
+	memcpy(dst, src + offset, copy);
+
+	if (copy < n) {
+		dst += copy;
+		sector += copy >> SECTOR_SHIFT;
+		copy = n - copy;
+		src = pmem_lookup_pg_addr(pmem, sector);
+		memcpy(dst, src, copy);
+	}
+}
+
+static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
+			unsigned int len, unsigned int off, int rw,
+			sector_t sector)
+{
+	void *mem = kmap_atomic(page);
+
+	if (rw == READ) {
+		copy_from_pmem(mem + off, pmem, sector, len);
+		flush_dcache_page(page);
+	} else {
+		/*
+		 * FIXME: Need more involved flushing to ensure that writes to
+		 * NVDIMMs are actually durable before returning.
+		 */
+		flush_dcache_page(page);
+		copy_to_pmem(pmem, mem + off, sector, len);
+	}
+
+	kunmap_atomic(mem);
+}
+
+static void pmem_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
+	int rw;
+	struct bio_vec bvec;
+	sector_t sector;
+	struct bvec_iter iter;
+	int err = 0;
+
+	sector = bio->bi_iter.bi_sector;
+	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
+		err = -EIO;
+		goto out;
+	}
+
+	BUG_ON(bio->bi_rw & REQ_DISCARD);
+
+	rw = bio_rw(bio);
+	if (rw == READA)
+		rw = READ;
+
+	bio_for_each_segment(bvec, bio, iter) {
+		unsigned int len = bvec.bv_len;
+
+		BUG_ON(len > PAGE_SIZE);
+		pmem_do_bvec(pmem, bvec.bv_page, len,
+			    bvec.bv_offset, rw, sector);
+		sector += len >> SECTOR_SHIFT;
+	}
+
+out:
+	bio_endio(bio, err);
+}
+
+static const struct block_device_operations pmem_fops = {
+	.owner =		THIS_MODULE,
+};
+
+/* Kernel module stuff */
+static int pmem_start_gb = CONFIG_BLK_DEV_PMEM_START;
+module_param(pmem_start_gb, int, S_IRUGO);
+MODULE_PARM_DESC(pmem_start_gb, "Offset in GB of where to start claiming space");
+
+static int pmem_size_gb = CONFIG_BLK_DEV_PMEM_SIZE;
+module_param(pmem_size_gb,  int, S_IRUGO);
+MODULE_PARM_DESC(pmem_size_gb,  "Total size in GB of space to claim for all disks");
+
+static int pmem_count = CONFIG_BLK_DEV_PMEM_COUNT;
+module_param(pmem_count, int, S_IRUGO);
+MODULE_PARM_DESC(pmem_count, "Number of pmem devices to evenly split allocated space");
+
+static LIST_HEAD(pmem_devices);
+static int pmem_major;
+
+/* FIXME: move phys_addr, virt_addr, size calls up to caller */
+static struct pmem_device *pmem_alloc(int i)
+{
+	struct pmem_device *pmem;
+	struct gendisk *disk;
+	size_t disk_size = total_size / pmem_count;
+	size_t disk_sectors = disk_size / 512;
+
+	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
+	if (!pmem)
+		goto out;
+
+	pmem->phys_addr = phys_addr + i * disk_size;
+	pmem->virt_addr = virt_addr + i * disk_size;
+	pmem->size = disk_size;
+
+	pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
+	if (!pmem->pmem_queue)
+		goto out_free_dev;
+
+	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
+	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
+	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+
+	disk = pmem->pmem_disk = alloc_disk(0);
+	if (!disk)
+		goto out_free_queue;
+	disk->major		= pmem_major;
+	disk->first_minor	= 0;
+	disk->fops		= &pmem_fops;
+	disk->private_data	= pmem;
+	disk->queue		= pmem->pmem_queue;
+	disk->flags		= GENHD_FL_EXT_DEVT;
+	sprintf(disk->disk_name, "pmem%d", i);
+	set_capacity(disk, disk_sectors);
+
+	return pmem;
+
+out_free_queue:
+	blk_cleanup_queue(pmem->pmem_queue);
+out_free_dev:
+	kfree(pmem);
+out:
+	return NULL;
+}
+
+static void pmem_free(struct pmem_device *pmem)
+{
+	put_disk(pmem->pmem_disk);
+	blk_cleanup_queue(pmem->pmem_queue);
+	kfree(pmem);
+}
+
+static void pmem_del_one(struct pmem_device *pmem)
+{
+	list_del(&pmem->pmem_list);
+	del_gendisk(pmem->pmem_disk);
+	pmem_free(pmem);
+}
+
+static int __init pmem_init(void)
+{
+	int result, i;
+	struct resource *res_mem;
+	struct pmem_device *pmem, *next;
+
+	phys_addr  = (phys_addr_t) pmem_start_gb * 1024 * 1024 * 1024;
+	total_size = (size_t)	   pmem_size_gb  * 1024 * 1024 * 1024;
+
+	res_mem = request_mem_region_exclusive(phys_addr, total_size, "pmem");
+	if (!res_mem)
+		return -ENOMEM;
+
+	virt_addr = ioremap_cache(phys_addr, total_size);
+	if (!virt_addr) {
+		result = -ENOMEM;
+		goto out_release;
+	}
+
+	result = register_blkdev(0, "pmem");
+	if (result < 0) {
+		result = -EIO;
+		goto out_unmap;
+	} else
+		pmem_major = result;
+
+	for (i = 0; i < pmem_count; i++) {
+		pmem = pmem_alloc(i);
+		if (!pmem) {
+			result = -ENOMEM;
+			goto out_free;
+		}
+		list_add_tail(&pmem->pmem_list, &pmem_devices);
+	}
+
+	list_for_each_entry(pmem, &pmem_devices, pmem_list)
+		add_disk(pmem->pmem_disk);
+
+	pr_info("pmem: module loaded\n");
+	return 0;
+
+out_free:
+	list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list) {
+		list_del(&pmem->pmem_list);
+		pmem_free(pmem);
+	}
+	unregister_blkdev(pmem_major, "pmem");
+
+out_unmap:
+	iounmap(virt_addr);
+
+out_release:
+	release_mem_region(phys_addr, total_size);
+	return result;
+}
+
+static void __exit pmem_exit(void)
+{
+	struct pmem_device *pmem, *next;
+
+	list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list)
+		pmem_del_one(pmem);
+
+	unregister_blkdev(pmem_major, "pmem");
+	iounmap(virt_addr);
+	release_mem_region(phys_addr, total_size);
+
+	pr_info("pmem: module unloaded\n");
+}
+
+MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
+MODULE_LICENSE("GPL");
+module_init(pmem_init);
+module_exit(pmem_exit);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/4] pmem: Add support for getgeo()
  2014-08-27 21:11 [PATCH 0/4] Add persistent memory driver Ross Zwisler
  2014-08-27 21:11 ` [PATCH 1/4] pmem: Initial version of " Ross Zwisler
@ 2014-08-27 21:12 ` Ross Zwisler
  2014-11-02  3:27   ` Elliott, Robert (Server Storage)
  2014-08-27 21:12 ` [PATCH 3/4] pmem: Add support for rw_page() Ross Zwisler
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 41+ messages in thread
From: Ross Zwisler @ 2014-08-27 21:12 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox, Boaz Harrosh, Nick Piggin,
	linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: Ross Zwisler

Some programs require HDIO_GETGEO work, which requires we implement
getgeo.  Based off of the work done to the NVMe driver in this commit:

commit 4cc09e2dc4cb ("NVMe: Add getgeo to block ops")

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/block/pmem.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index d366b9b..60bbe0d 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -50,6 +50,15 @@ struct pmem_device {
 	size_t			size;
 };
 
+static int pmem_getgeo(struct block_device *bd, struct hd_geometry *geo)
+{
+	/* some standard values */
+	geo->heads = 1 << 6;
+	geo->sectors = 1 << 5;
+	geo->cylinders = get_capacity(bd->bd_disk) >> 11;
+	return 0;
+}
+
 /*
  * direct translation from (pmem,sector) => void*
  * We do not require that sector be page aligned.
@@ -176,6 +185,7 @@ out:
 
 static const struct block_device_operations pmem_fops = {
 	.owner =		THIS_MODULE,
+	.getgeo =		pmem_getgeo,
 };
 
 /* Kernel module stuff */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/4] pmem: Add support for rw_page()
  2014-08-27 21:11 [PATCH 0/4] Add persistent memory driver Ross Zwisler
  2014-08-27 21:11 ` [PATCH 1/4] pmem: Initial version of " Ross Zwisler
  2014-08-27 21:12 ` [PATCH 2/4] pmem: Add support for getgeo() Ross Zwisler
@ 2014-08-27 21:12 ` Ross Zwisler
  2014-08-27 21:12 ` [PATCH 4/4] pmem: Add support for direct_access() Ross Zwisler
  2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh
  4 siblings, 0 replies; 41+ messages in thread
From: Ross Zwisler @ 2014-08-27 21:12 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox, Boaz Harrosh, Nick Piggin,
	linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: Ross Zwisler

Based on commit a72132c31d58 ("brd: add support for rw_page()")

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/block/pmem.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 60bbe0d..0be3669 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -183,8 +183,19 @@ out:
 	bio_endio(bio, err);
 }
 
+static int pmem_rw_page(struct block_device *bdev, sector_t sector,
+		       struct page *page, int rw)
+{
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
+
+	pmem_do_bvec(pmem, page, PAGE_CACHE_SIZE, 0, rw, sector);
+	page_endio(page, rw & WRITE, 0);
+	return 0;
+}
+
 static const struct block_device_operations pmem_fops = {
 	.owner =		THIS_MODULE,
+	.rw_page =		pmem_rw_page,
 	.getgeo =		pmem_getgeo,
 };
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/4] pmem: Add support for direct_access()
  2014-08-27 21:11 [PATCH 0/4] Add persistent memory driver Ross Zwisler
                   ` (2 preceding siblings ...)
  2014-08-27 21:12 ` [PATCH 3/4] pmem: Add support for rw_page() Ross Zwisler
@ 2014-08-27 21:12 ` Ross Zwisler
  2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh
  4 siblings, 0 replies; 41+ messages in thread
From: Ross Zwisler @ 2014-08-27 21:12 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox, Boaz Harrosh, Nick Piggin,
	linux-fsdevel, linux-kernel, linux-nvdimm
  Cc: Ross Zwisler

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/block/pmem.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 0be3669..d63bc96 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -74,6 +74,15 @@ static void *pmem_lookup_pg_addr(struct pmem_device *pmem, sector_t sector)
 	return pmem->virt_addr + offset;
 }
 
+/* sector must be page aligned */
+static unsigned long pmem_lookup_pfn(struct pmem_device *pmem, sector_t sector)
+{
+	size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
+
+	BUG_ON(sector & (PAGE_SECTORS - 1));
+	return (pmem->phys_addr >> PAGE_SHIFT) + page_offset;
+}
+
 /*
  * sector is not required to be page aligned.
  * n is at most a single page, but could be less.
@@ -193,9 +202,24 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	return 0;
 }
 
+static long pmem_direct_access(struct block_device *bdev, sector_t sector,
+			      void **kaddr, unsigned long *pfn, long size)
+{
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
+
+	if (!pmem)
+		return -ENODEV;
+
+	*kaddr = pmem_lookup_pg_addr(pmem, sector);
+	*pfn = pmem_lookup_pfn(pmem, sector);
+
+	return pmem->size - (sector * 512);
+}
+
 static const struct block_device_operations pmem_fops = {
 	.owner =		THIS_MODULE,
 	.rw_page =		pmem_rw_page,
+	.direct_access =	pmem_direct_access,
 	.getgeo =		pmem_getgeo,
 };
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory)
  2014-08-27 21:11 [PATCH 0/4] Add persistent memory driver Ross Zwisler
                   ` (3 preceding siblings ...)
  2014-08-27 21:12 ` [PATCH 4/4] pmem: Add support for direct_access() Ross Zwisler
@ 2014-09-09 15:37 ` Boaz Harrosh
  2014-09-09 15:44   ` [PATCH 4/9] SQUASHME: pmem: Support of multiple memory regions Boaz Harrosh
                     ` (4 more replies)
  4 siblings, 5 replies; 41+ messages in thread
From: Boaz Harrosh @ 2014-09-09 15:37 UTC (permalink / raw)
  To: Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm, Toshi Kani, Dave Hansen, linux-mm
  Cc: Andrew Morton, linux-kernel

On 08/28/2014 12:11 AM, Ross Zwisler wrote:
> PMEM is a modified version of the Block RAM Driver, BRD. The major difference
> is that BRD allocates its backing store pages from the page cache, whereas
> PMEM uses reserved memory that has been ioremapped.
> 
> One benefit of this approach is that there is a direct mapping between
> filesystem block numbers and virtual addresses.  In PMEM, filesystem blocks N,
> N+1, N+2, etc. will all be adjacent in the virtual memory space. This property
> allows us to set up PMD mappings (2 MiB) for DAX.
> 
> This patch set is builds upon the work that Matthew Wilcox has been doing for
> DAX:
> 

Let us not submit a driver with the wrong user visible API. Lets submit the
better API (and structure) I have sent.

> https://lkml.org/lkml/2014/8/27/31
> 
> Specifically, my implementation of pmem_direct_access() in patch 4/4 uses API
> enhancements introduced in Matthew's DAX patch v10 02/21:
> 
> https://lkml.org/lkml/2014/8/27/48
> 
> Ross Zwisler (4):
>   pmem: Initial version of persistent memory driver
>   pmem: Add support for getgeo()
>   pmem: Add support for rw_page()
>   pmem: Add support for direct_access()
> 

On top of the 4 above patches here is a list of changes:

[PATCH 1/9] SQUASHME: pmem: Remove unused #include headers
[PATCH 2/9] SQUASHME: pmem: Request from fdisk 4k alignment
[PATCH 3/9] SQUASHME: pmem: Let each device manage private memory region
[PATCH 4/9] SQUASHME: pmem: Support of multiple memory regions

	These 4 need to be squashed into Ross's 
		[patch 1/4] pmem: Initial version of persistent memory driver
	See below for a suggested new patch

[PATCH 5/9 v2] mm: Let sparse_{add,remove}_one_section receive a node_id
[PATCH 6/9 v2] mm: New add_persistent_memory/remove_persistent_memory
[PATCH 7/9 v2] pmem: Add support for page structs

	Please need review by Toshi and mm people.

[PATCH 8/9] SQUASHME: pmem: Fixs to getgeo
[PATCH 9/9] pmem: KISS, remove register_blkdev

	And some more development atop the initial version


All these patches can be viewed in this tree/branch:
	git://git.open-osd.org/pmem.git branch pmem-jens-3.17-rc1
	[http://git.open-osd.org/gitweb.cgi?p=pmem.git;a=shortlog;h=refs/heads/pmem-jens-3.17-rc1]

I have also prepared a new branch *pmem* which is already SQUASHED
And has my suggested changed commit logs for the combined patches
here is the commit-log:

aa85c80 Boaz Harrosh  |  pmem: KISS, remove register_blkdev 
738203c Boaz Harrosh  |  pmem: Add support for page structs 
9f50a54 Boaz Harrosh  |  mm: New add_persistent_memory/remove_persistent_memory 
fdfab12 Yigal Korman  |  mm: Let sparse_{add,remove}_one_section receive a node_id 
a477a87 Ross Zwisler  |  pmem: Add support for direct_access() 
316a93a Ross Zwisler  |  pmem: Add support for rw_page() 
6850353 Boaz Harrosh  |  SQUASHME: pmem: Fixs to getgeo 
d78a84a Ross Zwisler  |  pmem: Add support for getgeo() 
bb0eb45 Ross Zwisler  |  pmem: Initial version of persistent memory driver                                                                             

All these patches can be viewed in this tree/branch:
	git://git.open-osd.org/pmem.git branch pmem
	[http://git.open-osd.org/gitweb.cgi?p=pmem.git;a=shortlog;h=refs/heads/pmem]
Specifically the first [bb0eb45] is needed so first version can be released with the
proper user visible API.
Ross please consider taking these patches (pmem branch) in your tree for submission?

Thanks
Boaz


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 4/9] SQUASHME: pmem: Support of multiple memory regions
  2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh
@ 2014-09-09 15:44   ` Boaz Harrosh
  2014-09-09 15:45   ` [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id Boaz Harrosh
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 41+ messages in thread
From: Boaz Harrosh @ 2014-09-09 15:44 UTC (permalink / raw)
  To: Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-nvdimm
  Cc: Andrew Morton, linux-kernel

From: Boaz Harrosh <boaz@plexistor.com>

After the last patch this is easy.

The API to pmem module is changed. We now have a single string
parameter named "map" of the form:
		 map=mapS[,mapS...]

		 where mapS=nn[KMG]$ss[KMG],
		 or    mapS=nn[KMG]@ss[KMG],

		 nn=size, ss=offset

Just like the Kernel command line map && memmap parameters,
so anything you did at grub just copy/paste to here.

The "@" form is exactly the same as the "$" form only that
at bash prompt we need to escape the "$" with \$ so also
support the '@' char for convenience.

For each specified mapS there will be a device created.

So needless to say that all the previous prd_XXX params are
removed as well as the Kconfig defaults.

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/Kconfig | 28 -----------------------
 drivers/block/pmem.c  | 62 ++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 44 insertions(+), 46 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 2d50125..5da8cbf 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -416,34 +416,6 @@ config BLK_DEV_PMEM
 	  Most normal users won't need this functionality, and can thus say N
 	  here.
 
-config BLK_DEV_PMEM_START
-	int "Offset in GiB of where to start claiming space"
-	default "0"
-	depends on BLK_DEV_PMEM
-	help
-	  Starting offset in GiB that PMEM should use when claiming memory.  This
-	  memory needs to be reserved from the OS at boot time using the
-	  "memmap" kernel parameter.
-
-	  If you provide PMEM with volatile memory it will act as a volatile
-	  RAM disk and your data will not be persistent.
-
-config BLK_DEV_PMEM_COUNT
-	int "Default number of PMEM disks"
-	default "4"
-	depends on BLK_DEV_PMEM
-	help
-	  Number of equal sized block devices that PMEM should create.
-
-config BLK_DEV_PMEM_SIZE
-	int "Size in GiB of space to claim"
-	depends on BLK_DEV_PMEM
-	default "0"
-	help
-	  Amount of memory in GiB that PMEM should use when creating block
-	  devices.  This memory needs to be reserved from the OS at
-	  boot time using the "memmap" kernel parameter.
-
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index d5c5c52..e07a373 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -212,17 +212,11 @@ static const struct block_device_operations pmem_fops = {
 };
 
 /* Kernel module stuff */
-static int pmem_start_gb = CONFIG_BLK_DEV_PMEM_START;
-module_param(pmem_start_gb, int, S_IRUGO);
-MODULE_PARM_DESC(pmem_start_gb, "Offset in GB of where to start claiming space");
-
-static int pmem_size_gb = CONFIG_BLK_DEV_PMEM_SIZE;
-module_param(pmem_size_gb,  int, S_IRUGO);
-MODULE_PARM_DESC(pmem_size_gb,  "Total size in GB of space to claim for all disks");
-
-static int pmem_count = CONFIG_BLK_DEV_PMEM_COUNT;
-module_param(pmem_count, int, S_IRUGO);
-MODULE_PARM_DESC(pmem_count, "Number of pmem devices to evenly split allocated space");
+static char *map;
+module_param(map, charp, S_IRUGO);
+MODULE_PARM_DESC(map,
+	"pmem device mapping: map=mapS[,mapS...] where:\n"
+	"mapS=nn[KMG]$ss[KMG] or mapS=nn[KMG]@ss[KMG], nn=size, ss=offset.");
 
 static LIST_HEAD(pmem_devices);
 static int pmem_major;
@@ -272,6 +266,13 @@ static struct pmem_device *pmem_alloc(phys_addr_t phys_addr, size_t disk_size,
 	struct gendisk *disk;
 	int err;
 
+	if (unlikely((phys_addr & ~PAGE_MASK) || (disk_size & ~PAGE_MASK))) {
+		pr_err("phys_addr=0x%llx disk_size=0x%zx must be 4k aligned\n",
+		       phys_addr, disk_size);
+		err = -EINVAL;
+		goto out;
+	}
+
 	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
 	if (unlikely(!pmem)) {
 		err = -ENOMEM;
@@ -344,16 +345,32 @@ static void pmem_del_one(struct pmem_device *pmem)
 	pmem_free(pmem);
 }
 
+static int pmem_parse_map_one(char *map, phys_addr_t *start, size_t *size)
+{
+	char *p = map;
+
+	*size = (size_t)memparse(p, &p);
+	if ((p == map) || ((*p != '$') && (*p != '@')))
+		return -EINVAL;
+
+	if (!*(++p))
+		return -EINVAL;
+
+	*start = (phys_addr_t)memparse(p, &p);
+
+	return *p == '\0' ? 0 : -EINVAL;
+}
+
 static int __init pmem_init(void)
 {
 	int result, i;
 	struct pmem_device *pmem, *next;
-	phys_addr_t phys_addr;
-	size_t total_size, disk_size;
+	char *p, *pmem_map = map;
 
-	phys_addr  = (phys_addr_t) pmem_start_gb * 1024 * 1024 * 1024;
-	total_size = (size_t)	   pmem_size_gb  * 1024 * 1024 * 1024;
-	disk_size = total_size / pmem_count;
+	if (!pmem_map) {
+		pr_err("pmem: must specify map=nn@ss parameter.\n");
+		return -EINVAL;
+	}
 
 	result = register_blkdev(0, "pmem");
 	if (result < 0)
@@ -361,14 +378,23 @@ static int __init pmem_init(void)
 	else
 		pmem_major = result;
 
-	for (i = 0; i < pmem_count; i++) {
+	i = 0;
+	while ((p = strsep(&pmem_map, ",")) != NULL) {
+		phys_addr_t phys_addr;
+		size_t disk_size;
+
+		if (!*p)
+			continue;
+		result = pmem_parse_map_one(p, &phys_addr, &disk_size);
+		if (result)
+			goto out_free;
 		pmem = pmem_alloc(phys_addr, disk_size, i);
 		if (IS_ERR(pmem)) {
 			result = PTR_ERR(pmem);
 			goto out_free;
 		}
 		list_add_tail(&pmem->pmem_list, &pmem_devices);
-		phys_addr += disk_size;
+		++i;
 	}
 
 	list_for_each_entry(pmem, &pmem_devices, pmem_list)
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh
  2014-09-09 15:44   ` [PATCH 4/9] SQUASHME: pmem: Support of multiple memory regions Boaz Harrosh
@ 2014-09-09 15:45   ` Boaz Harrosh
  2014-09-09 18:36     ` Dave Hansen
  2014-09-09 15:47   ` [PATCH 6/9] mm: New add_persistent_memory/remove_persistent_memory Boaz Harrosh
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 41+ messages in thread
From: Boaz Harrosh @ 2014-09-09 15:45 UTC (permalink / raw)
  To: Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm, Toshi Kani, Dave Hansen, linux-mm
  Cc: Andrew Morton, linux-kernel

From: Yigal Korman <yigal@plexistor.com>

Refactored the arguments of sparse_add_one_section / sparse_remove_one_section
to use node id instead of struct zone * - A memory section has no direct
connection to zones, all that was needed from zone was the node id.

This is for add_persistent_memory that will want a section of pages
allocated but without any zone associated. This is because belonging
to a zone will give the memory to the page allocators, but
persistent_memory belongs to a block device, and is not available for
regular volatile usage.

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 include/linux/memory_hotplug.h | 4 ++--
 mm/memory_hotplug.c            | 4 ++--
 mm/sparse.c                    | 9 +++++----
 3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index d9524c4..35ca1bb 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -264,8 +264,8 @@ extern int arch_add_memory(int nid, u64 start, u64 size);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern void remove_memory(int nid, u64 start, u64 size);
-extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn);
-extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms);
+extern int sparse_add_one_section(int nid, unsigned long start_pfn);
+extern void sparse_remove_one_section(int nid, struct mem_section *ms);
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
 					  unsigned long pnum);
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2ff8c23..e556a90 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -471,7 +471,7 @@ static int __meminit __add_section(int nid, struct zone *zone,
 	if (pfn_valid(phys_start_pfn))
 		return -EEXIST;
 
-	ret = sparse_add_one_section(zone, phys_start_pfn);
+	ret = sparse_add_one_section(zone->zone_pgdat->node_id, phys_start_pfn);
 
 	if (ret < 0)
 		return ret;
@@ -737,7 +737,7 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
 	start_pfn = section_nr_to_pfn(scn_nr);
 	__remove_zone(zone, start_pfn);
 
-	sparse_remove_one_section(zone, ms);
+	sparse_remove_one_section(zone->zone_pgdat->node_id, ms);
 	return 0;
 }
 
diff --git a/mm/sparse.c b/mm/sparse.c
index d1b48b6..12a10ab 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -690,10 +690,10 @@ static void free_map_bootmem(struct page *memmap)
  * set.  If this is <=0, then that means that the passed-in
  * map was not consumed and must be freed.
  */
-int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn)
+int __meminit sparse_add_one_section(int nid, unsigned long start_pfn)
 {
 	unsigned long section_nr = pfn_to_section_nr(start_pfn);
-	struct pglist_data *pgdat = zone->zone_pgdat;
+	struct pglist_data *pgdat = NODE_DATA(nid);
 	struct mem_section *ms;
 	struct page *memmap;
 	unsigned long *usemap;
@@ -788,11 +788,11 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 		free_map_bootmem(memmap);
 }
 
-void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
+void sparse_remove_one_section(int nid, struct mem_section *ms)
 {
 	struct page *memmap = NULL;
 	unsigned long *usemap = NULL, flags;
-	struct pglist_data *pgdat = zone->zone_pgdat;
+	struct pglist_data *pgdat = NODE_DATA(nid);
 
 	pgdat_resize_lock(pgdat, &flags);
 	if (ms->section_mem_map) {
@@ -807,5 +807,6 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 	clear_hwpoisoned_pages(memmap, PAGES_PER_SECTION);
 	free_section_usemap(memmap, usemap);
 }
+
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 #endif /* CONFIG_MEMORY_HOTPLUG */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 6/9] mm: New add_persistent_memory/remove_persistent_memory
  2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh
  2014-09-09 15:44   ` [PATCH 4/9] SQUASHME: pmem: Support of multiple memory regions Boaz Harrosh
  2014-09-09 15:45   ` [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id Boaz Harrosh
@ 2014-09-09 15:47   ` Boaz Harrosh
  2014-09-09 15:48   ` [PATCH 7/9] pmem: Add support for page structs Boaz Harrosh
  2014-09-09 15:51   ` [PATCH 9/9] pmem: KISS, remove register_blkdev Boaz Harrosh
  4 siblings, 0 replies; 41+ messages in thread
From: Boaz Harrosh @ 2014-09-09 15:47 UTC (permalink / raw)
  To: Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm, Toshi Kani, Dave Hansen, linux-mm
  Cc: Andrew Morton, linux-kernel

From: Boaz Harrosh <boaz@plexistor.com>

Persistent Memory is not Memory. It is not presented as
a Memory Zone and is not available through the page allocators
for application/kernel volatile usage.

It belongs to A block device just as any other Persistent storage,
the novelty here is that it is directly mapped on the CPU Memory
bus, and usually as fast or almost as fast as system RAM.

The main motivation of add_persistent_memory is to allocate a
page-struct "Section" for a given physical memory region. This is because
The user of this memory mapped device might need to pass pages-struct
of this memory to a Kernel subsytem like block-layer or networking
and have it's content directly DMAed to other devices

(For example these pages can be put on a bio and sent to disk
 in a copy-less manner)

It will also request_mem_region_exclusive(.., "persistent_memory")
to own that physical memory region.

And will map that physical region to the Kernel's VM at the
address expected for page_address() of those pages allocated
above

remove_persistent_memory() must be called to undo what
add_persistent_memory did.

A user of this API will then use pfn_to_page(PHISICAL_ADDR >> PAGE_SIZE)
to receive a page-struct for use on its pmem.

Any operation like page_address() page_to_pfn() page_lock() ... can
be preformed on these pages just as usual.

An example user is presented in the next patch to pmem.c Block Device
driver (There are 3 more such drivers in the Kernel that could use this
API)

This patch is based on research and patch made by
Yigal Korman <yigal@plexistor.com> to the pmem driver. I took his code
and adapted it to mm, where it belongs.

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 include/linux/memory_hotplug.h |   4 +
 mm/Kconfig                     |  23 ++++++
 mm/memory_hotplug.c            | 177 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 204 insertions(+)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 35ca1bb..9a16cec 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -191,6 +191,10 @@ extern void get_page_bootmem(unsigned long ingo, struct page *page,
 void get_online_mems(void);
 void put_online_mems(void);
 
+int add_persistent_memory(phys_addr_t phys_addr, size_t size,
+			  void **o_virt_addr);
+void remove_persistent_memory(phys_addr_t phys_addr, size_t size);
+
 #else /* ! CONFIG_MEMORY_HOTPLUG */
 /*
  * Stub functions for when hotplug is off
diff --git a/mm/Kconfig b/mm/Kconfig
index 886db21..2b78d19 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -197,6 +197,29 @@ config MEMORY_HOTREMOVE
 	depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
 	depends on MIGRATION
 
+
+# User of PERSISTENT_MEMORY_SECTION should:
+#	depends on PERSISTENT_MEMORY_DEPENDENCY and
+#	select DRIVER_NEEDS_PERSISTENT_MEMORY
+# Note that it will not be enabled if MEMORY_HOTPLUG is not enabled
+#
+# If you have changed the dependency/select of MEMORY_HOTREMOVE please also
+# update here
+#
+config PERSISTENT_MEMORY_DEPENDENCY
+	def_bool y
+	depends on MEMORY_HOTPLUG
+	depends on ARCH_ENABLE_MEMORY_HOTREMOVE && MIGRATION
+
+config DRIVER_NEEDS_PERSISTENT_MEMORY
+	bool
+
+config PERSISTENT_MEMORY_SECTION
+	def_bool y
+	depends on PERSISTENT_MEMORY_DEPENDENCY
+	depends on DRIVER_NEEDS_PERSISTENT_MEMORY
+	select MEMORY_HOTREMOVE
+
 #
 # If we have space for more page flags then we can enable additional
 # optimizations and functionality.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e556a90..1682b0e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2004,3 +2004,180 @@ void __ref remove_memory(int nid, u64 start, u64 size)
 }
 EXPORT_SYMBOL_GPL(remove_memory);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
+
+#ifdef CONFIG_PERSISTENT_MEMORY_SECTION
+
+/* This helper is so we do not need to allocate a page_array bigger
+ * than PAGE_SIZE
+ */
+static int _map_sec_range(ulong sec_start_pfn, struct page **page_array)
+{
+	const uint NUM_PAGES = PAGE_SIZE / sizeof(*page_array);
+	const uint ARRAYS_IN_SECTION = PAGES_PER_SECTION / NUM_PAGES;
+	ulong pfn = sec_start_pfn;
+	uint a;
+
+	for (a = 0; a < ARRAYS_IN_SECTION; ++a) {
+		ulong virt_addr = (ulong)page_address(pfn_to_page(pfn));
+		uint p;
+		int ret;
+
+		for (p = 0; p < NUM_PAGES; ++p)
+			page_array[p] = pfn_to_page(pfn++);
+
+		ret = map_kernel_range_noflush(virt_addr, NUM_PAGES * PAGE_SIZE,
+					       PAGE_KERNEL, page_array);
+		if (unlikely(ret < 0)) {
+			pr_warn("%s: map_kernel_range(0x%lx, 0x%lx) => %d\n",
+				__func__, sec_start_pfn, virt_addr, ret);
+			return ret;
+		}
+		if (unlikely(ret < NUM_PAGES)) {
+			pr_warn("%s: map_kernel_range(0x%lx) => %d != %d last_pfn=0x%lx\n",
+				 __func__, virt_addr, NUM_PAGES, ret, pfn);
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * add_persistent_memory - Add memory sections and maps them to Kernel space
+ * @phys_addr: start of physical address to add & map
+ * @size: size of the memory range in bytes
+ * @o_virt_addr: The returned virtual address of the mapped memory range
+ *
+ * A persistent_memory block-device will use this function to add memory
+ * sections and map its physical memory range. After the call to this function
+ * There will be page-struct associated with each pfn added here, and it will
+ * be accessible from Kernel space through the returned @o_virt_addr
+ * @phys_addr will be rounded down to the nearest SECTION_SIZE, the range
+ * mapped will be in full SECTION_SIZE sections.
+ * @o_virt_addr is the address of @phys_addr not the start of the mapped section
+ * so usually mapping a range unaligned on SECTION_SIZE will work just that the
+ * unaligned start and/or end, will ignore the error and continue.
+ * (but will print "memory section XX already exists")
+ *
+ * NOTE:
+ * persistent_memory is not system ram and is not available through any
+ * allocator, for regular consumption. Therefore it does not belong to any
+ * memory zone.
+ * But it will need a memory-section allocated, so page-structs are available
+ * for this memory, so it can be DMA'd directly with zero copy.
+ * After a call to this function the ranged pages belong exclusively to the
+ * caller.
+ *
+ * RETURNS:
+ * zero on success, or -errno on failure. If successful @o_virt_addr will be set
+ */
+int add_persistent_memory(phys_addr_t phys_addr, size_t size,
+			  void **o_virt_addr)
+{
+	ulong start_pfn = phys_addr >> PAGE_SHIFT;
+	ulong nr_pages = size >> PAGE_SHIFT;
+	ulong start_sec = pfn_to_section_nr(start_pfn);
+	ulong end_sec = pfn_to_section_nr(start_pfn + nr_pages +
+							PAGES_PER_SECTION - 1);
+	int nid = memory_add_physaddr_to_nid(phys_addr);
+	struct resource *res_mem;
+	struct page **page_array;
+	ulong i;
+	int ret = 0;
+
+	page_array = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (unlikely(!page_array))
+		return -ENOMEM;
+
+	res_mem = request_mem_region_exclusive(phys_addr, size,
+					       "persistent_memory");
+	if (unlikely(!res_mem)) {
+		pr_warn("%s: request_mem_region_exclusive phys=0x%llx size=0x%zx failed\n",
+			__func__, phys_addr, size);
+		ret = -EINVAL;
+		goto free_array;
+	}
+
+	for (i = start_sec; i < end_sec; ++i) {
+		ulong sec_start_pfn = i << PFN_SECTION_SHIFT;
+
+		if (pfn_valid(sec_start_pfn)) {
+			pr_warn("%s: memory section %lu already exists.\n",
+				__func__, i);
+			continue;
+		}
+
+		ret = sparse_add_one_section(nid, sec_start_pfn);
+		if (unlikely(ret < 0)) {
+			if (ret == -EEXIST) {
+				ret = 0;
+				continue;
+			} else {
+				pr_warn("%s: sparse_add_one_section => %d\n",
+					__func__, ret);
+				goto release_region;
+			}
+		}
+
+		ret = _map_sec_range(sec_start_pfn, page_array);
+		if (unlikely(ret))
+			goto release_region;
+	}
+
+	*o_virt_addr = page_address(pfn_to_page(start_pfn));
+
+free_array:
+	kfree(page_array);
+	return ret;
+
+release_region:
+	release_mem_region(phys_addr, size);
+	goto free_array;
+}
+EXPORT_SYMBOL_GPL(add_persistent_memory);
+
+/**
+ * remove_persistent_memory - undo anything add_persistent_memory did
+ * @phys_addr: start of physical address to remove
+ * @size: size of the memory range in bytes
+ *
+ * A successful call to add_persistent_memory must be paired with
+ * remove_persistent_memory when done. It will unmap passed PFNs from
+ * Kernel virtual address, and will remove the memory sections.
+ * @phys_addr, @size must be exactly those passed to add_persistent_memory
+ * otherwise results are unexpected, there are no checks done on this.
+ */
+void remove_persistent_memory(phys_addr_t phys_addr, size_t size)
+{
+	ulong start_pfn = phys_addr >> PAGE_SHIFT;
+	ulong nr_pages = size >> PAGE_SHIFT;
+	ulong start_sec = pfn_to_section_nr(start_pfn);
+	ulong end_sec = pfn_to_section_nr(start_pfn + nr_pages +
+							PAGES_PER_SECTION - 1);
+	int nid = pfn_to_nid(start_pfn);
+	ulong virt_addr;
+	unsigned int i;
+
+	virt_addr = (ulong)page_address(
+				pfn_to_page(start_sec << PFN_SECTION_SHIFT));
+
+	for (i = start_sec; i < end_sec; ++i) {
+		struct mem_section *ms;
+
+		unmap_kernel_range(virt_addr, 1UL << SECTION_SIZE_BITS);
+		virt_addr += 1UL << SECTION_SIZE_BITS;
+
+		ms = __nr_to_section(i);
+		if (!valid_section(ms)) {
+			pr_warn("%s: memory section %d is missing.\n",
+				__func__, i);
+			continue;
+		}
+		sparse_remove_one_section(nid, ms);
+	}
+
+	release_mem_region(phys_addr, size);
+}
+EXPORT_SYMBOL_GPL(remove_persistent_memory);
+
+#endif /* def CONFIG_PERSISTENT_MEMORY_SECTION */
+
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 7/9] pmem: Add support for page structs
  2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh
                     ` (2 preceding siblings ...)
  2014-09-09 15:47   ` [PATCH 6/9] mm: New add_persistent_memory/remove_persistent_memory Boaz Harrosh
@ 2014-09-09 15:48   ` Boaz Harrosh
  2014-09-09 15:51   ` [PATCH 9/9] pmem: KISS, remove register_blkdev Boaz Harrosh
  4 siblings, 0 replies; 41+ messages in thread
From: Boaz Harrosh @ 2014-09-09 15:48 UTC (permalink / raw)
  To: Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel,
	linux-nvdimm, Toshi Kani, Dave Hansen, linux-mm
  Cc: Andrew Morton, linux-kernel

From: Boaz Harrosh <boaz@plexistor.com>

One of the current shortcomings of the NVDIMM/PMEM
support is that this memory does not have a page-struct(s)
associated with its memory and therefor cannot be passed
to a block-device or network or DMAed in any way through
another device in the system.

The use of add_persistent_memory() fixes all this. After this patch
an FS can do:
	bdev_direct_access(,&pfn,);
	page = pfn_to_page(pfn);
And use that page for a lock_page(), set_page_dirty(), and/or
anything else one might do with a page *.
(Note that with brd one can already do this)

[pmem-pages-ref-count]
pmem will serve it's pages with ref==0. Once an FS does
an blkdev_get_XXX(,FMODE_EXCL,), that memory is own by the FS.
The FS needs to manage its allocation, just as it already does
for its disk blocks. The fs should set page->count = 2, before
submission to any Kernel subsystem so when it returns it will
never be released to the Kernel's page-allocators. (page_freeze)

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/Kconfig | 13 +++++++++++++
 drivers/block/pmem.c  | 19 +++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 5da8cbf..8a5929c 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -416,6 +416,19 @@ config BLK_DEV_PMEM
 	  Most normal users won't need this functionality, and can thus say N
 	  here.
 
+config BLK_DEV_PMEM_USE_PAGES
+	bool "Enable use of page struct pages with pmem"
+	depends on BLK_DEV_PMEM
+	depends on PERSISTENT_MEMORY_DEPENDENCY
+	select DRIVER_NEEDS_PERSISTENT_MEMORY
+	default y
+	help
+	  If a user of PMEM device needs "struct page" associated
+	  with its memory, so this memory can be sent to other
+	  block devices, or sent on the network, or be DMA transferred
+	  to other devices in the system, then you must say "Yes" here.
+	  If unsure leave as Yes.
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index e07a373..b415b61 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -221,6 +221,23 @@ MODULE_PARM_DESC(map,
 static LIST_HEAD(pmem_devices);
 static int pmem_major;
 
+#ifdef CONFIG_BLK_DEV_PMEM_USE_PAGES
+/* pmem->phys_addr and pmem->size need to be set.
+ * Will then set pmem->virt_addr if successful.
+ */
+int pmem_mapmem(struct pmem_device *pmem)
+{
+	return add_persistent_memory(pmem->phys_addr, pmem->size,
+				     &pmem->virt_addr);
+}
+
+static void pmem_unmapmem(struct pmem_device *pmem)
+{
+	remove_persistent_memory(pmem->phys_addr, pmem->size);
+}
+
+#else /* !CONFIG_BLK_DEV_PMEM_USE_PAGES */
+
 /* pmem->phys_addr and pmem->size need to be set.
  * Will then set virt_addr if successful.
  */
@@ -258,6 +275,8 @@ void pmem_unmapmem(struct pmem_device *pmem)
 	release_mem_region(pmem->phys_addr, pmem->size);
 	pmem->virt_addr = NULL;
 }
+#endif /* ! CONFIG_BLK_DEV_PMEM_USE_PAGES */
+
 
 static struct pmem_device *pmem_alloc(phys_addr_t phys_addr, size_t disk_size,
 				      int i)
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 9/9] pmem: KISS, remove register_blkdev
  2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh
                     ` (3 preceding siblings ...)
  2014-09-09 15:48   ` [PATCH 7/9] pmem: Add support for page structs Boaz Harrosh
@ 2014-09-09 15:51   ` Boaz Harrosh
  4 siblings, 0 replies; 41+ messages in thread
From: Boaz Harrosh @ 2014-09-09 15:51 UTC (permalink / raw)
  To: Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-nvdimm
  Cc: Andrew Morton, linux-kernel

From: Boaz Harrosh <boaz@plexistor.com>

Though this looks revolutionary, it actually does nothing
but removes dead code.

Since the move of pmem to the dynamic BLOCK_EXT_MAJOR
range, by Ross, pmem_major returned from register_blkdev
is not used.

If inspecting the code at add_disk() we can see that
whatever number we will put at
	disk->major		= ...
	disk->first_minor	= ...

will be immediately overwritten with the BLOCK_EXT_MAJOR
and a dynamically allocated minor. Regardless of the
registration and/or what ever we put at disk->major.

I have tested all the tests that I know how to perform
on the devices, fdisk, partitions, multiple pmemX devices,
mknode, lsblk, blkid, and it all behaves exactly as
before this patch.

I have also done:
	find /sys/ -name "*pmem*"
	find /proc/ -name "*pmem*"
	find /dev/ -name "*pmem*"
And get the same output as before this patch.

The only thing missing is an entry in /proc/devices of
the form: "250 pmem" (250 or what ever is free at the moment)

But this is good because if one tries to look for 250
devices after loading pmem he will fail because pmem
is always registered as 259 (blkext) now.

If anyone knows of what would work differently after this
patch please do tell.

But it looks like the calls to register_blkdev is just dead
code for us right now

Thanks

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/pmem.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 600d225..5eda95a 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -225,7 +225,6 @@ MODULE_PARM_DESC(map,
 	"mapS=nn[KMG]$ss[KMG] or mapS=nn[KMG]@ss[KMG], nn=size, ss=offset.");
 
 static LIST_HEAD(pmem_devices);
-static int pmem_major;
 
 #ifdef CONFIG_BLK_DEV_PMEM_USE_PAGES
 /* pmem->phys_addr and pmem->size need to be set.
@@ -333,8 +332,6 @@ static struct pmem_device *pmem_alloc(phys_addr_t phys_addr, size_t disk_size,
 		goto out_free_queue;
 	}
 
-	disk->major		= pmem_major;
-	disk->first_minor	= 0;
 	disk->fops		= &pmem_fops;
 	disk->private_data	= pmem;
 	disk->queue		= pmem->pmem_queue;
@@ -397,12 +394,6 @@ static int __init pmem_init(void)
 		return -EINVAL;
 	}
 
-	result = register_blkdev(0, "pmem");
-	if (result < 0)
-		return -EIO;
-	else
-		pmem_major = result;
-
 	i = 0;
 	while ((p = strsep(&pmem_map, ",")) != NULL) {
 		phys_addr_t phys_addr;
@@ -433,7 +424,6 @@ out_free:
 		list_del(&pmem->pmem_list);
 		pmem_free(pmem);
 	}
-	unregister_blkdev(pmem_major, "pmem");
 
 	return result;
 }
@@ -445,7 +435,6 @@ static void __exit pmem_exit(void)
 	list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list)
 		pmem_del_one(pmem);
 
-	unregister_blkdev(pmem_major, "pmem");
 	pr_info("pmem: module unloaded\n");
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2] pmem: Initial version of persistent memory driver
  2014-08-27 21:11 ` [PATCH 1/4] pmem: Initial version of " Ross Zwisler
@ 2014-09-09 16:23   ` Boaz Harrosh
  2014-09-09 16:53     ` [Linux-nvdimm] " Dan Williams
  2014-11-02  3:22   ` [PATCH 1/4] " Elliott, Robert (Server Storage)
  1 sibling, 1 reply; 41+ messages in thread
From: Boaz Harrosh @ 2014-09-09 16:23 UTC (permalink / raw)
  To: Ross Zwisler, Jens Axboe, Matthew Wilcox, Nick Piggin,
	linux-fsdevel, linux-kernel, linux-nvdimm

From: Ross Zwisler <ross.zwisler@linux.intel.com>

PMEM is a new driver that presents a reserved range of memory as a
block device.  This is useful for developing with NV-DIMMs, and
can be used with volatile memory as a development platform.

[boaz]
SQUASHME: pmem: Remove unused #include headers
SQUASHME: pmem: Request from fdisk 4k alignment
SQUASHME: pmem: Let each device manage private memory region
SQUASHME: pmem: Support of multiple memory regions

The API to pmem module a single string parameter named "map"
of the form:
		 map=mapS[,mapS...]

		 where mapS=nn[KMG]$ss[KMG],
		 or    mapS=nn[KMG]@ss[KMG],

		 nn=size, ss=offset

Just like the Kernel command line map && memmap parameters,
so anything you did at grub just copy/paste to here.

The "@" form is exactly the same as the "$" form only that
at bash prompt we need to escape the "$" with \$ so also
support the '@' char for convenience.

For each specified mapS there will be a device created.

[This is the accumulated version of the driver developed by
 multiple programmers. To see the real history of these
 patches see:
	git://git.open-osd.org/pmem.git
	https://github.com/01org/prd
 This patch is based on:
	[673302b] pmem: KISS, remove register_blkdev
]

TODO: Add Documentation/blockdev/pmem.txt

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 MAINTAINERS            |   6 +
 drivers/block/Kconfig  |  13 ++
 drivers/block/Makefile |   1 +
 drivers/block/pmem.c   | 385 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 405 insertions(+)
 create mode 100644 drivers/block/pmem.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 5e7866a..2724ede 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7504,6 +7504,12 @@ S:	Maintained
 F:	Documentation/blockdev/ramdisk.txt
 F:	drivers/block/brd.c
 
+PERSISTENT MEMORY DRIVER
+M:	Ross Zwisler <ross.zwisler@linux.intel.com>
+L:	linux-nvdimm@lists.01.org
+S:	Supported
+F:	drivers/block/pmem.c
+
 RANDOM NUMBER DRIVER
 M:	"Theodore Ts'o" <tytso@mit.edu>
 S:	Maintained
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 014a1cf..5da8cbf 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -403,6 +403,19 @@ config BLK_DEV_XIP
 	  will prevent RAM block device backing store memory from being
 	  allocated from highmem (only a problem for highmem systems).
 
+config BLK_DEV_PMEM
+	tristate "Persistent memory block device support"
+	help
+	  Saying Y here will allow you to use a contiguous range of reserved
+	  memory as one or more block devices.  Memory for PMEM should be
+	  reserved using the "memmap" kernel parameter.
+
+	  To compile this driver as a module, choose M here: the module will be
+	  called pmem.
+
+	  Most normal users won't need this functionality, and can thus say N
+	  here.
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 02b688d..9cc6c18 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -14,6 +14,7 @@ obj-$(CONFIG_PS3_VRAM)		+= ps3vram.o
 obj-$(CONFIG_ATARI_FLOPPY)	+= ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)	+= z2ram.o
 obj-$(CONFIG_BLK_DEV_RAM)	+= brd.o
+obj-$(CONFIG_BLK_DEV_PMEM)	+= pmem.o
 obj-$(CONFIG_BLK_DEV_LOOP)	+= loop.o
 obj-$(CONFIG_BLK_CPQ_DA)	+= cpqarray.o
 obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
new file mode 100644
index 0000000..0fcda58
--- /dev/null
+++ b/drivers/block/pmem.c
@@ -0,0 +1,385 @@
+/*
+ * Persistent Memory Driver
+ * Copyright (c) 2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * This driver is heavily based on drivers/block/brd.c.
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ */
+
+#include <linux/blkdev.h>
+#include <linux/hdreg.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/slab.h>
+
+#define SECTOR_SHIFT		9
+#define PAGE_SECTORS_SHIFT	(PAGE_SHIFT - SECTOR_SHIFT)
+#define PAGE_SECTORS		(1 << PAGE_SECTORS_SHIFT)
+
+struct pmem_device {
+	struct request_queue	*pmem_queue;
+	struct gendisk		*pmem_disk;
+	struct list_head	pmem_list;
+
+	/* One contiguous memory region per device */
+	phys_addr_t		phys_addr;
+	void			*virt_addr;
+	size_t			size;
+};
+
+/*
+ * direct translation from (pmem,sector) => void*
+ * We do not require that sector be page aligned.
+ * The return value will point to the beginning of the page containing the
+ * given sector, not to the sector itself.
+ */
+static void *pmem_lookup_pg_addr(struct pmem_device *pmem, sector_t sector)
+{
+	size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
+	size_t offset = page_offset << PAGE_SHIFT;
+
+	BUG_ON(offset >= pmem->size);
+	return pmem->virt_addr + offset;
+}
+
+/*
+ * sector is not required to be page aligned.
+ * n is at most a single page, but could be less.
+ */
+static void copy_to_pmem(struct pmem_device *pmem, const void *src,
+			sector_t sector, size_t n)
+{
+	void *dst;
+	unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
+	size_t copy;
+
+	BUG_ON(n > PAGE_SIZE);
+
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	dst = pmem_lookup_pg_addr(pmem, sector);
+	memcpy(dst + offset, src, copy);
+
+	if (copy < n) {
+		src += copy;
+		sector += copy >> SECTOR_SHIFT;
+		copy = n - copy;
+		dst = pmem_lookup_pg_addr(pmem, sector);
+		memcpy(dst, src, copy);
+	}
+}
+
+/*
+ * sector is not required to be page aligned.
+ * n is at most a single page, but could be less.
+ */
+static void copy_from_pmem(void *dst, struct pmem_device *pmem,
+			  sector_t sector, size_t n)
+{
+	void *src;
+	unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
+	size_t copy;
+
+	BUG_ON(n > PAGE_SIZE);
+
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	src = pmem_lookup_pg_addr(pmem, sector);
+
+	memcpy(dst, src + offset, copy);
+
+	if (copy < n) {
+		dst += copy;
+		sector += copy >> SECTOR_SHIFT;
+		copy = n - copy;
+		src = pmem_lookup_pg_addr(pmem, sector);
+		memcpy(dst, src, copy);
+	}
+}
+
+static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
+			unsigned int len, unsigned int off, int rw,
+			sector_t sector)
+{
+	void *mem = kmap_atomic(page);
+
+	if (rw == READ) {
+		copy_from_pmem(mem + off, pmem, sector, len);
+		flush_dcache_page(page);
+	} else {
+		/*
+		 * FIXME: Need more involved flushing to ensure that writes to
+		 * NVDIMMs are actually durable before returning.
+		 */
+		flush_dcache_page(page);
+		copy_to_pmem(pmem, mem + off, sector, len);
+	}
+
+	kunmap_atomic(mem);
+}
+
+static void pmem_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
+	int rw;
+	struct bio_vec bvec;
+	sector_t sector;
+	struct bvec_iter iter;
+	int err = 0;
+
+	sector = bio->bi_iter.bi_sector;
+	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
+		err = -EIO;
+		goto out;
+	}
+
+	BUG_ON(bio->bi_rw & REQ_DISCARD);
+
+	rw = bio_rw(bio);
+	if (rw == READA)
+		rw = READ;
+
+	bio_for_each_segment(bvec, bio, iter) {
+		unsigned int len = bvec.bv_len;
+
+		BUG_ON(len > PAGE_SIZE);
+		pmem_do_bvec(pmem, bvec.bv_page, len,
+			    bvec.bv_offset, rw, sector);
+		sector += len >> SECTOR_SHIFT;
+	}
+
+out:
+	bio_endio(bio, err);
+}
+
+static const struct block_device_operations pmem_fops = {
+	.owner =		THIS_MODULE,
+};
+
+/* Kernel module stuff */
+static char *map;
+module_param(map, charp, S_IRUGO);
+MODULE_PARM_DESC(map,
+	"pmem device mapping: map=mapS[,mapS...] where:\n"
+	"mapS=nn[KMG]$ss[KMG] or mapS=nn[KMG]@ss[KMG], nn=size, ss=offset.");
+
+static LIST_HEAD(pmem_devices);
+static int pmem_major;
+
+/* pmem->phys_addr and pmem->size need to be set.
+ * Will then set virt_addr if successful.
+ */
+int pmem_mapmem(struct pmem_device *pmem)
+{
+	struct resource *res_mem;
+	int err;
+
+	res_mem = request_mem_region_exclusive(pmem->phys_addr, pmem->size,
+					       "pmem");
+	if (!res_mem) {
+		pr_warn("pmem: request_mem_region_exclusive phys=0x%llx size=0x%zx failed\n",
+			   pmem->phys_addr, pmem->size);
+		return -EINVAL;
+	}
+
+	pmem->virt_addr = ioremap_cache(pmem->phys_addr, pmem->size);
+	if (unlikely(!pmem->virt_addr)) {
+		err = -ENXIO;
+		goto out_release;
+	}
+	return 0;
+
+out_release:
+	release_mem_region(pmem->phys_addr, pmem->size);
+	return err;
+}
+
+void pmem_unmapmem(struct pmem_device *pmem)
+{
+	if (unlikely(!pmem->virt_addr))
+		return;
+
+	iounmap(pmem->virt_addr);
+	release_mem_region(pmem->phys_addr, pmem->size);
+	pmem->virt_addr = NULL;
+}
+
+static struct pmem_device *pmem_alloc(phys_addr_t phys_addr, size_t disk_size,
+				      int i)
+{
+	struct pmem_device *pmem;
+	struct gendisk *disk;
+	int err;
+
+	if (unlikely((phys_addr & ~PAGE_MASK) || (disk_size & ~PAGE_MASK))) {
+		pr_err("phys_addr=0x%llx disk_size=0x%zx must be 4k aligned\n",
+		       phys_addr, disk_size);
+		err = -EINVAL;
+		goto out;
+	}
+
+	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
+	if (unlikely(!pmem)) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	pmem->phys_addr = phys_addr;
+	pmem->size = disk_size;
+
+	err = pmem_mapmem(pmem);
+	if (unlikely(err))
+		goto out_free_dev;
+
+	pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
+	if (unlikely(!pmem->pmem_queue)) {
+		err = -ENOMEM;
+		goto out_unmap;
+	}
+
+	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
+	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
+	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+
+	/* This is so fdisk will align partitions on 4k, because of
+	 * direct_access API needing 4k alignment, returning a PFN
+	 */
+	blk_queue_physical_block_size(pmem->pmem_queue, PAGE_SIZE);
+	pmem->pmem_queue->limits.io_min = 512; /* Don't use the accessor */
+
+	disk = alloc_disk(0);
+	if (unlikely(!disk)) {
+		err = -ENOMEM;
+		goto out_free_queue;
+	}
+
+	disk->major		= pmem_major;
+	disk->first_minor	= 0;
+	disk->fops		= &pmem_fops;
+	disk->private_data	= pmem;
+	disk->queue		= pmem->pmem_queue;
+	disk->flags		= GENHD_FL_EXT_DEVT;
+	sprintf(disk->disk_name, "pmem%d", i);
+	set_capacity(disk, disk_size >> SECTOR_SHIFT);
+	pmem->pmem_disk = disk;
+
+	return pmem;
+
+out_free_queue:
+	blk_cleanup_queue(pmem->pmem_queue);
+out_unmap:
+	pmem_unmapmem(pmem);
+out_free_dev:
+	kfree(pmem);
+out:
+	return ERR_PTR(err);
+}
+
+static void pmem_free(struct pmem_device *pmem)
+{
+	put_disk(pmem->pmem_disk);
+	blk_cleanup_queue(pmem->pmem_queue);
+	pmem_unmapmem(pmem);
+	kfree(pmem);
+}
+
+static void pmem_del_one(struct pmem_device *pmem)
+{
+	list_del(&pmem->pmem_list);
+	del_gendisk(pmem->pmem_disk);
+	pmem_free(pmem);
+}
+
+static int pmem_parse_map_one(char *map, phys_addr_t *start, size_t *size)
+{
+	char *p = map;
+
+	*size = (size_t)memparse(p, &p);
+	if ((p == map) || ((*p != '$') && (*p != '@')))
+		return -EINVAL;
+
+	if (!*(++p))
+		return -EINVAL;
+
+	*start = (phys_addr_t)memparse(p, &p);
+
+	return *p == '\0' ? 0 : -EINVAL;
+}
+
+static int __init pmem_init(void)
+{
+	int result, i;
+	struct pmem_device *pmem, *next;
+	char *p, *pmem_map = map;
+
+	if (!pmem_map) {
+		pr_err("pmem: must specify map=nn@ss parameter.\n");
+		return -EINVAL;
+	}
+
+	result = register_blkdev(0, "pmem");
+	if (result < 0)
+		return -EIO;
+	else
+		pmem_major = result;
+
+	i = 0;
+	while ((p = strsep(&pmem_map, ",")) != NULL) {
+		phys_addr_t phys_addr;
+		size_t disk_size;
+
+		if (!*p)
+			continue;
+		result = pmem_parse_map_one(p, &phys_addr, &disk_size);
+		if (result)
+			goto out_free;
+		pmem = pmem_alloc(phys_addr, disk_size, i);
+		if (IS_ERR(pmem)) {
+			result = PTR_ERR(pmem);
+			goto out_free;
+		}
+		list_add_tail(&pmem->pmem_list, &pmem_devices);
+		++i;
+	}
+
+	list_for_each_entry(pmem, &pmem_devices, pmem_list)
+		add_disk(pmem->pmem_disk);
+
+	pr_info("pmem: module loaded\n");
+	return 0;
+
+out_free:
+	list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list) {
+		list_del(&pmem->pmem_list);
+		pmem_free(pmem);
+	}
+	unregister_blkdev(pmem_major, "pmem");
+
+	return result;
+}
+
+static void __exit pmem_exit(void)
+{
+	struct pmem_device *pmem, *next;
+
+	list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list)
+		pmem_del_one(pmem);
+
+	unregister_blkdev(pmem_major, "pmem");
+	pr_info("pmem: module unloaded\n");
+}
+
+MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
+MODULE_LICENSE("GPL");
+module_init(pmem_init);
+module_exit(pmem_exit);
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver
  2014-09-09 16:23   ` [PATCH v2] " Boaz Harrosh
@ 2014-09-09 16:53     ` Dan Williams
  2014-09-10 13:23       ` Boaz Harrosh
  0 siblings, 1 reply; 41+ messages in thread
From: Dan Williams @ 2014-09-09 16:53 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ross Zwisler, Jens Axboe, Matthew Wilcox, Nick Piggin,
	linux-fsdevel, linux-kernel, linux-nvdimm

On Tue, Sep 9, 2014 at 9:23 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> From: Ross Zwisler <ross.zwisler@linux.intel.com>
>
> PMEM is a new driver that presents a reserved range of memory as a
> block device.  This is useful for developing with NV-DIMMs, and
> can be used with volatile memory as a development platform.
>
> [boaz]
> SQUASHME: pmem: Remove unused #include headers
> SQUASHME: pmem: Request from fdisk 4k alignment
> SQUASHME: pmem: Let each device manage private memory region
> SQUASHME: pmem: Support of multiple memory regions
>
> The API to pmem module a single string parameter named "map"
> of the form:
>                  map=mapS[,mapS...]
>
>                  where mapS=nn[KMG]$ss[KMG],
>                  or    mapS=nn[KMG]@ss[KMG],
>
>                  nn=size, ss=offset
>
> Just like the Kernel command line map && memmap parameters,
> so anything you did at grub just copy/paste to here.
>
> The "@" form is exactly the same as the "$" form only that
> at bash prompt we need to escape the "$" with \$ so also
> support the '@' char for convenience.

Hmm this looks like a "ACPI/DeviceTree-over-kernel-command-line"
description language.  I understand that this is borrowed from the
memmap= precedent, but, if I'm not mistaken, that's really more for
debug rather than a permanent platform-device descriptor.  <strawman>
Since this looks like firmware why not go ahead and use
request_firmware() to request a pmem range descriptor blob</strawman>?
 Given you can compile such a blob into a kernel image  or provide it
in an initrd I think it makes deployment more straightforward, also
the descriptor format can be extended going forward whereas the
command line approach mandates ever increasingly complicated strings.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-09 15:45   ` [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id Boaz Harrosh
@ 2014-09-09 18:36     ` Dave Hansen
  2014-09-10 10:07       ` Boaz Harrosh
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2014-09-09 18:36 UTC (permalink / raw)
  To: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm
  Cc: Andrew Morton, linux-kernel

On 09/09/2014 08:45 AM, Boaz Harrosh wrote:
> This is for add_persistent_memory that will want a section of pages
> allocated but without any zone associated. This is because belonging
> to a zone will give the memory to the page allocators, but
> persistent_memory belongs to a block device, and is not available for
> regular volatile usage.

I don't think we should be taking patches like this in to the kernel
until we've seen the other side of it.  Where is the page allocator code
which will see a page belonging to no zone?  Am I missing it in this set?

I see about 80 or so calls to page_zone() in the kernel.  How will a
zone-less page look to all of these sites?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-09 18:36     ` Dave Hansen
@ 2014-09-10 10:07       ` Boaz Harrosh
  2014-09-10 16:10         ` Dave Hansen
  0 siblings, 1 reply; 41+ messages in thread
From: Boaz Harrosh @ 2014-09-10 10:07 UTC (permalink / raw)
  To: Dave Hansen, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm
  Cc: Andrew Morton, linux-kernel

On 09/09/2014 09:36 PM, Dave Hansen wrote:
> On 09/09/2014 08:45 AM, Boaz Harrosh wrote:
>> This is for add_persistent_memory that will want a section of pages
>> allocated but without any zone associated. This is because belonging
>> to a zone will give the memory to the page allocators, but
>> persistent_memory belongs to a block device, and is not available for
>> regular volatile usage.
> 
> I don't think we should be taking patches like this in to the kernel
> until we've seen the other side of it.  Where is the page allocator code
> which will see a page belonging to no zone?  Am I missing it in this set?
> 

It is not missing. It will never be.

These pages do not belong to any allocator. They are not allocate-able
pages. In fact they are not "memory" they are "storage"

These pages belong wholesomely to a block-device. In turn the block
device grants ownership of a partition of this pages to an FS.
The FS loaded has its own block allocation schema. Which internally
circulate each pages usage around. But the page never goes beyond its
FS.

> I see about 80 or so calls to page_zone() in the kernel.  How will a
> zone-less page look to all of these sites?
> 

None of these 80 call site will be reached! the pages are always used
below the FS, like send them on the network, or send them to a slower
block device via a BIO. I have a full fledge FS on top of this code
and it all works very smoothly, and stable. (And fast ;))

It is up to the pMem-based FS to manage its pages's ref count so they are
never released outside of its own block allocator.

at the end of the day, struct pages has nothing to do with zones
and allocators and "memory", as it says in Documentation struct
page is a facility to track the state of a physical page in the
system. All the other structures are higher in the stack above
the physical layer, struct-pages for me are the upper API of the
memory physical layer. Which are in common with pmem, higher
on the stack where with memory we have a zone, pmem has a block-device.
Higher where we have page allocators, pmem has an FS block allocator,
higher where we have a slab, pmem has files for user consumption.

pmem is storage, which shares the physical layer with memory, and
this is what this patch describes. There will be no more mm interaction
at all for pmem. The rest of the picture is all there in plain site as
part of this patchset, the pmem.c driver then an FS on top of that. What
else do you need to see?

Thanks
Boaz


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver
  2014-09-09 16:53     ` [Linux-nvdimm] " Dan Williams
@ 2014-09-10 13:23       ` Boaz Harrosh
  2014-09-10 17:03         ` Dan Williams
  0 siblings, 1 reply; 41+ messages in thread
From: Boaz Harrosh @ 2014-09-10 13:23 UTC (permalink / raw)
  To: Dan Williams, Boaz Harrosh
  Cc: Ross Zwisler, Jens Axboe, Matthew Wilcox, Nick Piggin,
	linux-fsdevel, linux-kernel, linux-nvdimm

On 09/09/2014 07:53 PM, Dan Williams wrote:
> On Tue, Sep 9, 2014 at 9:23 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
<>
> 
> Hmm this looks like a "ACPI/DeviceTree-over-kernel-command-line"
> description language.  I understand that this is borrowed from the
> memmap= precedent, but, if I'm not mistaken, that's really more for
> debug rather than a permanent platform-device descriptor.

So what is bad about that ?

> <strawman>
> Since this looks like firmware why not go ahead and use
> request_firmware() to request a pmem range descriptor blob
</strawman>?

What? God no! How is this a firmware at all. No this is all BUS info, it is
 "where on the bus this device's resource was allocated"

Firmware is static compile time thing common to all of my devices
but this here is where in the bus this was stuck in. What is your
suggestion that I compile a kernel and "make install" it (initrd)
for every DIMM I insert ? since when? and what tool will tell me
what to put there?

>  Given you can compile such a blob into a kernel image  or provide it
> in an initrd I think it makes deployment more straightforward, also
> the descriptor format can be extended going forward 

What ? really Dan I think you got confused. The nn@ss thing is just:
   "what connector I stuck my DIMM disk in (and what other empty slots I have)"

Would you have me compile and install my kernel every time my sdX
number change, or my sata channel moved. I do not think so.

> whereas the
> command line approach mandates ever increasingly complicated strings.

This is the first API, I intend, there is even a TODO in the patchset, to
also have a dynamic sysfs add/remove/grow API for on the fly changes.
All these can be easily driven by a simple udev rule to plug and play them.

Usually in the Kernel, buses do not directly load devices drivers. The
bus driver sends an "event" with a new discovered resource, then user-mode
udev rule will load an associated driver, which will scan for its device or
receive its identification somewhere, that can be the sysfs interface above,
or even commandline.

Farther down the road we might want a Kernel API, through to the DIMM-manager
when each device's ID is Identified to managed volumes of DIMMs. (And what
happens to them if they move in physical addressing), but even with DIMM-manager
I would have gone through a udev-event and pmem-probe because, this way user
mode can add a chain of commands associated with new insertions. So since we
have a DIMM-manager event going on why not have the udev rule load pmem as well
and not have any API. But All this is way down the line.

Regardless of which, commandline API of pmem will always stay as this is for
the pmem emulation as we use it now, in accord with memmap=.
(And it works very nice in our lab with DDR3 NvDIMMs that need an memmap= as well)

We please need to start somewhere, no?

Thanks
Boaz


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-10 10:07       ` Boaz Harrosh
@ 2014-09-10 16:10         ` Dave Hansen
  2014-09-10 17:25           ` Boaz Harrosh
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2014-09-10 16:10 UTC (permalink / raw)
  To: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm
  Cc: Andrew Morton, linux-kernel

On 09/10/2014 03:07 AM, Boaz Harrosh wrote:
> On 09/09/2014 09:36 PM, Dave Hansen wrote:
>> On 09/09/2014 08:45 AM, Boaz Harrosh wrote:
>>> This is for add_persistent_memory that will want a section of pages
>>> allocated but without any zone associated. This is because belonging
>>> to a zone will give the memory to the page allocators, but
>>> persistent_memory belongs to a block device, and is not available for
>>> regular volatile usage.
>>
>> I don't think we should be taking patches like this in to the kernel
>> until we've seen the other side of it.  Where is the page allocator code
>> which will see a page belonging to no zone?  Am I missing it in this set?
> 
> It is not missing. It will never be.
> 
> These pages do not belong to any allocator. They are not allocate-able
> pages. In fact they are not "memory" they are "storage"
> 
> These pages belong wholesomely to a block-device. In turn the block
> device grants ownership of a partition of this pages to an FS.
> The FS loaded has its own block allocation schema. Which internally
> circulate each pages usage around. But the page never goes beyond its
> FS.

I'm mostly worried about things that start with an mmap().

Imagine you mmap() a persistent memory file, fault some pages in, then
'cat /proc/$pid/numa_maps'.  That code will look at the page to see
which zone and node it is in.

Or, consider if you mmap() then put a futex in the page.  The page will
have get_user_pages() called on it by the futex code, and a reference
taken.  The reference can outlast the mmap().  We either have to put the
file somewhere special and scan the page's reference occasionally, or we
need to hook something under put_page() to make sure that we keep the
page out of the normal allocator.

>> I see about 80 or so calls to page_zone() in the kernel.  How will a
>> zone-less page look to all of these sites?
> 
> None of these 80 call site will be reached! the pages are always used
> below the FS, like send them on the network, or send them to a slower
> block device via a BIO. I have a full fledge FS on top of this code
> and it all works very smoothly, and stable. (And fast ;))

Does the fs support mmap()?

The idea of layering is a nice one, but mmap() is a big fat layering
violation. :)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver
  2014-09-10 13:23       ` Boaz Harrosh
@ 2014-09-10 17:03         ` Dan Williams
  2014-09-10 17:47           ` Boaz Harrosh
  0 siblings, 1 reply; 41+ messages in thread
From: Dan Williams @ 2014-09-10 17:03 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	Nick Piggin, linux-fsdevel, linux-kernel, linux-nvdimm

Hi Boaz,

On Wed, Sep 10, 2014 at 6:23 AM, Boaz Harrosh <openosd@gmail.com> wrote:
> On 09/09/2014 07:53 PM, Dan Williams wrote:
>> On Tue, Sep 9, 2014 at 9:23 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> <>
>>
>> Hmm this looks like a "ACPI/DeviceTree-over-kernel-command-line"
>> description language.  I understand that this is borrowed from the
>> memmap= precedent, but, if I'm not mistaken, that's really more for
>> debug rather than a permanent platform-device descriptor.
>
> So what is bad about that ?
>
>> <strawman>
>> Since this looks like firmware why not go ahead and use
>> request_firmware() to request a pmem range descriptor blob
> </strawman>?
>
> What? God no! How is this a firmware at all. No this is all BUS info, it is
>  "where on the bus this device's resource was allocated"
>
> Firmware is static compile time thing common to all of my devices
> but this here is where in the bus this was stuck in. What is your
> suggestion that I compile a kernel and "make install" it (initrd)
> for every DIMM I insert ? since when? and what tool will tell me
> what to put there?
>
>>  Given you can compile such a blob into a kernel image  or provide it
>> in an initrd I think it makes deployment more straightforward, also
>> the descriptor format can be extended going forward
>
> What ? really Dan I think you got confused. The nn@ss thing is just:
>    "what connector I stuck my DIMM disk in (and what other empty slots I have)"
>
> Would you have me compile and install my kernel every time my sdX
> number change, or my sata channel moved. I do not think so.
>
>> whereas the
>> command line approach mandates ever increasingly complicated strings.
>
> This is the first API, I intend, there is even a TODO in the patchset, to
> also have a dynamic sysfs add/remove/grow API for on the fly changes.
> All these can be easily driven by a simple udev rule to plug and play them.
>
> Usually in the Kernel, buses do not directly load devices drivers. The
> bus driver sends an "event" with a new discovered resource, then user-mode
> udev rule will load an associated driver, which will scan for its device or
> receive its identification somewhere, that can be the sysfs interface above,
> or even commandline.
>
> Farther down the road we might want a Kernel API, through to the DIMM-manager
> when each device's ID is Identified to managed volumes of DIMMs. (And what
> happens to them if they move in physical addressing), but even with DIMM-manager
> I would have gone through a udev-event and pmem-probe because, this way user
> mode can add a chain of commands associated with new insertions. So since we
> have a DIMM-manager event going on why not have the udev rule load pmem as well
> and not have any API. But All this is way down the line.
>
> Regardless of which, commandline API of pmem will always stay as this is for
> the pmem emulation as we use it now, in accord with memmap=.
> (And it works very nice in our lab with DDR3 NvDIMMs that need an memmap= as well)
>
> We please need to start somewhere, no?

Sure, but you used the operative term "start", as in you already
expect to enhance this capability down the road, right?

It's fine to dismiss this request_firmware() based approach, but don't
mis-characterize it in the process.  With regards to describing device
boundaries, a bus-descriptor-blob handed to the kernel is a superset
of the capability provided by the kernel command line.  It can be
injected statically at compile time, or dynamically loaded from the
initrd or the rootfs.  It has the added benefit of being flexible to
change whereas the kernel command line is a more permanent contract
that we will need to maintain compatibility with in perpetuity.

If you already see this bus description as a "starting" point, then I
think we need an interface that is more amenable to ongoing change,
that's not the kernel-command-line.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-10 16:10         ` Dave Hansen
@ 2014-09-10 17:25           ` Boaz Harrosh
  2014-09-10 18:28             ` Dave Hansen
  0 siblings, 1 reply; 41+ messages in thread
From: Boaz Harrosh @ 2014-09-10 17:25 UTC (permalink / raw)
  To: Dave Hansen, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm
  Cc: Andrew Morton, linux-kernel

On 09/10/2014 07:10 PM, Dave Hansen wrote:
> On 09/10/2014 03:07 AM, Boaz Harrosh wrote:
>> On 09/09/2014 09:36 PM, Dave Hansen wrote:
>>> On 09/09/2014 08:45 AM, Boaz Harrosh wrote:
>>>> This is for add_persistent_memory that will want a section of pages
>>>> allocated but without any zone associated. This is because belonging
>>>> to a zone will give the memory to the page allocators, but
>>>> persistent_memory belongs to a block device, and is not available for
>>>> regular volatile usage.
>>>
>>> I don't think we should be taking patches like this in to the kernel
>>> until we've seen the other side of it.  Where is the page allocator code
>>> which will see a page belonging to no zone?  Am I missing it in this set?
>>
>> It is not missing. It will never be.
>>
>> These pages do not belong to any allocator. They are not allocate-able
>> pages. In fact they are not "memory" they are "storage"
>>
>> These pages belong wholesomely to a block-device. In turn the block
>> device grants ownership of a partition of this pages to an FS.
>> The FS loaded has its own block allocation schema. Which internally
>> circulate each pages usage around. But the page never goes beyond its
>> FS.
> 
> I'm mostly worried about things that start with an mmap().
> 
> Imagine you mmap() a persistent memory file, fault some pages in, then
> 'cat /proc/$pid/numa_maps'.  That code will look at the page to see
> which zone and node it is in.
> 
> Or, consider if you mmap() then put a futex in the page.  The page will
> have get_user_pages() called on it by the futex code, and a reference
> taken.  The reference can outlast the mmap().  We either have to put the
> file somewhere special and scan the page's reference occasionally, or we
> need to hook something under put_page() to make sure that we keep the
> page out of the normal allocator.
> 

Yes the block_allocator of the pmem-FS always holds the final REF on this
page, as long as there is valid data on this block. Even cross boots, the
mount code re-initializes references. The only internal state that frees
these blocks is truncate, which only then return these pages to the block
allocator, all this is common practice in filesystems so the page-ref on
these blocks only ever drops to zero after they loose all visibility. And
yes the block allocator uses a special code to drop the count to zero
not using put_page().

So there is no chance these pages will ever be presented to page_allocators
through a  put_page().

BTW: There is an hook in place that can be used today. By calling
  SetPagePrivate(page) and setting a .release function on the page->mapping->a_ops
  If .release() returns false the page is not released (and can be added on an
  internal queue for garbage collection)
  But with above schema this is not needed at all. I yet need to find a test
  that keeps my free_block reference above 1. At which time I will exercise
  a garbage collection queue.

>>> I see about 80 or so calls to page_zone() in the kernel.  How will a
>>> zone-less page look to all of these sites?
>>
>> None of these 80 call site will be reached! the pages are always used
>> below the FS, like send them on the network, or send them to a slower
>> block device via a BIO. I have a full fledge FS on top of this code
>> and it all works very smoothly, and stable. (And fast ;))
> 
> Does the fs support mmap()?
> 
> The idea of layering is a nice one, but mmap() is a big fat layering
> violation. :)
> 

No!

Yes the FS supports mmap, but through the DAX patchset. Please see
Matthew's DAX patchset how he implements mmap without using pages
at all, direct PFN to virtual_addr. So these pages do not get exposed
to the top of the FS.

My FS uses his technics exactly only when it wants to spill over to
slower device it will use these pages copy-less.

Cheers
Boaz


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver
  2014-09-10 17:03         ` Dan Williams
@ 2014-09-10 17:47           ` Boaz Harrosh
  2014-09-10 23:01             ` Dan Williams
  0 siblings, 1 reply; 41+ messages in thread
From: Boaz Harrosh @ 2014-09-10 17:47 UTC (permalink / raw)
  To: Dan Williams, Boaz Harrosh
  Cc: Ross Zwisler, Jens Axboe, Matthew Wilcox, Nick Piggin,
	linux-fsdevel, linux-kernel, linux-nvdimm

On 09/10/2014 08:03 PM, Dan Williams wrote:
> Hi Boaz,
> 
<>
>> We please need to start somewhere, no?
> 
> Sure, but you used the operative term "start", as in you already
> expect to enhance this capability down the road, right?
> 

Yes

> It's fine to dismiss this request_firmware() based approach, but don't
> mis-characterize it in the process.  With regards to describing device
> boundaries, a bus-descriptor-blob handed to the kernel is a superset
> of the capability provided by the kernel command line.  It can be
> injected statically at compile time, or dynamically loaded from the
> initrd or the rootfs.  It has the added benefit of being flexible to
> change whereas the kernel command line is a more permanent contract
> that we will need to maintain compatibility with in perpetuity.
> 

initrd or rootfs means for me "make install". But I want my fedora
to never make or install. Pre-compiled binary blobs including rootfs and
it needs to work.

> If you already see this bus description as a "starting" point, then I
> think we need an interface that is more amenable to ongoing change,
> that's not the kernel-command-line.
> 

module-command-line. a module can be loaded via udev and/or module param
can be changed dynamically on the fly. And also be specified via
kernel-command-line. So it is much less permanent contract API than
"rootfs"

And yes, I intend to add more interfaces. And No! I do not intend to
ever extend this module-param interface, that I can see. This one is
that, which it is right now. Later a sysfs/ objects will enable dynamic
management of devices. So both: initial device list on load - more devices
or removal on the fly, unload all on unload. This is my plan. So right
now I do not see this map= need ever change in the future. Only more
interfaces added in (the very near) future.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-10 17:25           ` Boaz Harrosh
@ 2014-09-10 18:28             ` Dave Hansen
  2014-09-11  8:39               ` Boaz Harrosh
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2014-09-10 18:28 UTC (permalink / raw)
  To: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm
  Cc: Andrew Morton, linux-kernel

On 09/10/2014 10:25 AM, Boaz Harrosh wrote:
> Yes the block_allocator of the pmem-FS always holds the final REF on this
> page, as long as there is valid data on this block. Even cross boots, the
> mount code re-initializes references. The only internal state that frees
> these blocks is truncate, which only then return these pages to the block
> allocator, all this is common practice in filesystems so the page-ref on
> these blocks only ever drops to zero after they loose all visibility. And
> yes the block allocator uses a special code to drop the count to zero
> not using put_page().

OK, so what happens when a page is truncated out of a file and this
"last" block reference is dropped while a get_user_pages() still has a
reference?

> On 09/10/2014 07:10 PM, Dave Hansen wrote:
>> Does the fs support mmap()?
>>
> No!
> 
> Yes the FS supports mmap, but through the DAX patchset. Please see
> Matthew's DAX patchset how he implements mmap without using pages
> at all, direct PFN to virtual_addr. So these pages do not get exposed
> to the top of the FS.
> 
> My FS uses his technics exactly only when it wants to spill over to
> slower device it will use these pages copy-less.

>From my perspective, DAX is complicated, but it is necessary because we
don't have a 'struct page'.  You're saying that even if we pay the cost
of a 'struct page' for the memory, we still don't get the benefit of
having it like getting rid of this DAX stuff?

Also, about not having a zone for these pages.  Do you intend to support
32-bit systems?  If so, I believe you will require the kmap() family of
functions to map the pages in order to copy data in and out.  kmap()
currently requires knowing the zone of the page.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver
  2014-09-10 17:47           ` Boaz Harrosh
@ 2014-09-10 23:01             ` Dan Williams
  2014-09-11 10:45               ` Boaz Harrosh
  0 siblings, 1 reply; 41+ messages in thread
From: Dan Williams @ 2014-09-10 23:01 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	Nick Piggin, linux-fsdevel, linux-kernel, linux-nvdimm

On Wed, Sep 10, 2014 at 10:47 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 09/10/2014 08:03 PM, Dan Williams wrote:
>> Hi Boaz,
>>
> <>
>>> We please need to start somewhere, no?
>>
>> Sure, but you used the operative term "start", as in you already
>> expect to enhance this capability down the road, right?
>>
>
> Yes
>
>> It's fine to dismiss this request_firmware() based approach, but don't
>> mis-characterize it in the process.  With regards to describing device
>> boundaries, a bus-descriptor-blob handed to the kernel is a superset
>> of the capability provided by the kernel command line.  It can be
>> injected statically at compile time, or dynamically loaded from the
>> initrd or the rootfs.  It has the added benefit of being flexible to
>> change whereas the kernel command line is a more permanent contract
>> that we will need to maintain compatibility with in perpetuity.
>>
>
> initrd or rootfs means for me "make install". But I want my fedora
> to never make or install. Pre-compiled binary blobs including rootfs and
> it needs to work.
>
>> If you already see this bus description as a "starting" point, then I
>> think we need an interface that is more amenable to ongoing change,
>> that's not the kernel-command-line.
>>
>
> module-command-line. a module can be loaded via udev and/or module param
> can be changed dynamically on the fly. And also be specified via
> kernel-command-line. So it is much less permanent contract API than
> "rootfs"
>
> And yes, I intend to add more interfaces. And No! I do not intend to
> ever extend this module-param interface, that I can see. This one is
> that, which it is right now. Later a sysfs/ objects will enable dynamic
> management of devices. So both: initial device list on load - more devices
> or removal on the fly, unload all on unload. This is my plan. So right
> now I do not see this map= need ever change in the future. Only more
> interfaces added in (the very near) future.
>

Imagine you want to deploy a policy like "use half of the memory
provided by the dimm in slot3, i.e. the only one with a battery".
That sort of thing gets unwieldy in a command line string compared to
a description table format that we can update at will.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-10 18:28             ` Dave Hansen
@ 2014-09-11  8:39               ` Boaz Harrosh
  2014-09-11 17:07                 ` Dave Hansen
  0 siblings, 1 reply; 41+ messages in thread
From: Boaz Harrosh @ 2014-09-11  8:39 UTC (permalink / raw)
  To: Dave Hansen, Boaz Harrosh, Ross Zwisler, Jens Axboe,
	Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani,
	linux-mm
  Cc: Andrew Morton, linux-kernel

On 09/10/2014 09:28 PM, Dave Hansen wrote:
<>
> 
> OK, so what happens when a page is truncated out of a file and this
> "last" block reference is dropped while a get_user_pages() still has a
> reference?
> 

I have a very simple plan for this scenario, as I said, hang these pages
with ref!=1 on a garbage list, and one of the clear threads can scan them
periodically and release them.

I have this test in place, currently what I do is just drop the block
and let it leak (that is, not be used any more) until a next mount where
this will be returned to free store. Yes stupid I know. But I have a big
fat message when this happens and I have not been able to reproduce it.
So I'm still waiting for this test case, I guess DAX protects me.

<>
> From my perspective, DAX is complicated, but it is necessary because we
> don't have a 'struct page'.  You're saying that even if we pay the cost
> of a 'struct page' for the memory, we still don't get the benefit of
> having it like getting rid of this DAX stuff?
> 

No DAX is still necessary because we map storage directly to app space,
and we still need it persistent. That is we can-not/need-not use an
in-ram radix tree but directly use on-storage btrees.
Regular VFS has this 2 tiers model, volatile-ram over persistent store.
DAX is an alternative VFS model where you have a single tier. the name
implies "Direct Access".

So this is nothing to do with page cost or "benefit". DAX is about a new
VFS model for new storage technologies.

And please be noted, the complexity you are talking about is just a learning
curve, on the developers side. Not a technological one. Actually if you
compare the two models, lets call them VFS-2t and VFS-1t, then you see that
DAX is an order of a magnitude simpler then the old model.

Life is hard and we do need the two models all at the same time, to support
all these different devices. So yes the complexity is added with the added
choice. But please do not confuse, DAX is not the complicated part. Having
a Choice is.

> Also, about not having a zone for these pages.  Do you intend to support
> 32-bit systems?  If so, I believe you will require the kmap() family of
> functions to map the pages in order to copy data in and out.  kmap()
> currently requires knowing the zone of the page.

No!!! This is strictly 64 bit. A 32bit system is able to have at maximum
3Gb of low-ram + storage.
DAX implies always mapped. That is, no re-mapping. So this rules out
more then a G of storage. Since that is a joke then No! 32bit is out.

You need to understand current HW std talks about DDR4 and there are
DDR3 samples flouting around. So this is strictly 64bit, even on
phones.


Thanks
Boaz


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver
  2014-09-10 23:01             ` Dan Williams
@ 2014-09-11 10:45               ` Boaz Harrosh
  2014-09-11 16:31                 ` Dan Williams
  0 siblings, 1 reply; 41+ messages in thread
From: Boaz Harrosh @ 2014-09-11 10:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	Nick Piggin, linux-fsdevel, linux-kernel, linux-nvdimm

On 09/11/2014 02:01 AM, Dan Williams wrote:
<>
> 
> Imagine you want to deploy a policy like "use half of the memory
> provided by the dimm in slot3, i.e. the only one with a battery".
> That sort of thing gets unwieldy in a command line string compared to
> a description table format that we can update at will.
> 

Actually it is easy to do, why? I do this here in the lab all the time.
with a "command line" with this code you see here.

[DDR3 NvDIMM which means I need memmap=16G\$32G on Kernel command line.
 Then: modprobe pmem map=8G@32G,4G@44G,...
 and so on Just as a simple example where 2/4-3/4 addresses are not used.
 You can have holes in the middle or what ever you want. This here is just
 a table in comma-separated format. If we need like flags in future we can
 extend the format to nn@ss:flags, but I do no have any 3rd column yet]

And again I have in the pipe a dynamic interface added on top of
the module param one. So it will all be there soon. Without reverting
the old one.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver
  2014-09-11 10:45               ` Boaz Harrosh
@ 2014-09-11 16:31                 ` Dan Williams
  2014-09-14 11:18                   ` Boaz Harrosh
  0 siblings, 1 reply; 41+ messages in thread
From: Dan Williams @ 2014-09-11 16:31 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	linux-fsdevel, linux-kernel, linux-nvdimm

On Thu, Sep 11, 2014 at 3:45 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 09/11/2014 02:01 AM, Dan Williams wrote:
> <>
>>
>> Imagine you want to deploy a policy like "use half of the memory
>> provided by the dimm in slot3, i.e. the only one with a battery".
>> That sort of thing gets unwieldy in a command line string compared to
>> a description table format that we can update at will.
>>
>
> Actually it is easy to do, why? I do this here in the lab all the time.
> with a "command line" with this code you see here.
>
> [DDR3 NvDIMM which means I need memmap=16G\$32G on Kernel command line.
>  Then: modprobe pmem map=8G@32G,4G@44G,...
>  and so on Just as a simple example where 2/4-3/4 addresses are not used.
>  You can have holes in the middle or what ever you want. This here is just
>  a table in comma-separated format. If we need like flags in future we can
>  extend the format to nn@ss:flags, but I do no have any 3rd column yet]

The point I am getting at is not requiring a priori knowledge of the
physical memory map of a system.  Rather, place holder variables to
enable simple dynamic discovery.

> And again I have in the pipe a dynamic interface added on top of
> the module param one. So it will all be there soon. Without reverting
> the old one.

Why start on step 2 when we haven't got agreement on step 1?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-11  8:39               ` Boaz Harrosh
@ 2014-09-11 17:07                 ` Dave Hansen
  2014-09-14  9:36                   ` Boaz Harrosh
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2014-09-11 17:07 UTC (permalink / raw)
  To: Boaz Harrosh, Boaz Harrosh, Ross Zwisler, Jens Axboe,
	Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani,
	linux-mm
  Cc: Andrew Morton, linux-kernel

On 09/11/2014 01:39 AM, Boaz Harrosh wrote:
> On 09/10/2014 09:28 PM, Dave Hansen wrote:
>> OK, so what happens when a page is truncated out of a file and this
>> "last" block reference is dropped while a get_user_pages() still has a
>> reference?
> 
> I have a very simple plan for this scenario, as I said, hang these pages
> with ref!=1 on a garbage list, and one of the clear threads can scan them
> periodically and release them.
> 
> I have this test in place, currently what I do is just drop the block
> and let it leak (that is, not be used any more) until a next mount where
> this will be returned to free store. Yes stupid I know. But I have a big
> fat message when this happens and I have not been able to reproduce it.
> So I'm still waiting for this test case, I guess DAX protects me.

OK, that sounds like it will work.  The "leaked until the next mount"
sounds disastrous, but I'm sure you'll fix that.  I can see how it might
lead to some fragmentation if only small amounts are ever pinned, but
not a deal-breaker.

>> From my perspective, DAX is complicated, but it is necessary because we
>> don't have a 'struct page'.  You're saying that even if we pay the cost
>> of a 'struct page' for the memory, we still don't get the benefit of
>> having it like getting rid of this DAX stuff?
> 
> No DAX is still necessary because we map storage directly to app space,
> and we still need it persistent. That is we can-not/need-not use an
> in-ram radix tree but directly use on-storage btrees.

Huh?  We obviously don't need/want persistent memory pages in the page
*cache*.  But, that's completely orthogonal to _having_ a 'struct page'
for them.

DAX does two major things:
1. avoids needing the page cache
2. creates "raw" page table entries that the VM does not manage
   for mmap()s

I'm not saying to put persistent memory in the page cache.

I'm saying that, if we have a 'struct page' for the memory, we should
try to make the mmap()s more normal.  This enables all kinds of things
that DAX does not support today, like direct I/O.

> Life is hard and we do need the two models all at the same time, to support
> all these different devices. So yes the complexity is added with the added
> choice. But please do not confuse, DAX is not the complicated part. Having
> a Choice is.

Great, so we at least agree that this adds complexity.

>> Also, about not having a zone for these pages.  Do you intend to support
>> 32-bit systems?  If so, I believe you will require the kmap() family of
>> functions to map the pages in order to copy data in and out.  kmap()
>> currently requires knowing the zone of the page.
> 
> No!!! This is strictly 64 bit. A 32bit system is able to have at maximum
> 3Gb of low-ram + storage.
> DAX implies always mapped. That is, no re-mapping. So this rules out
> more then a G of storage. Since that is a joke then No! 32bit is out.
> 
> You need to understand current HW std talks about DDR4 and there are
> DDR3 samples flouting around. So this is strictly 64bit, even on
> phones.

OK, so I think I at least understand the scope of the patch set and the
limitations.  I think I've summarized the limitations:

1. Approach requires all of RAM+Pmem to be direct-mapped (rules out
   almost all 32-bit systems, or any 64-bit systems with more than 64TB
   of RAM+pmem-storage)
2. Approach is currently incompatible with some kernel code that
   requires a 'struct page' (such as direct I/O), and all kernel code
   that requires knowledge of zones or NUMA nodes.
3. Approach requires 1/64 of the amount of storage to be consumed by
   RAM for a pseudo 'struct page'.  If you had 64GB of storage and 1GB
   of RAM, you would simply run our of RAM.

Did I miss any?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id
  2014-09-11 17:07                 ` Dave Hansen
@ 2014-09-14  9:36                   ` Boaz Harrosh
  0 siblings, 0 replies; 41+ messages in thread
From: Boaz Harrosh @ 2014-09-14  9:36 UTC (permalink / raw)
  To: Dave Hansen, Boaz Harrosh, Ross Zwisler, Jens Axboe,
	Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani,
	linux-mm
  Cc: Andrew Morton, linux-kernel

On 09/11/2014 08:07 PM, Dave Hansen wrote:
<>
> 
> OK, that sounds like it will work.  The "leaked until the next mount"
> sounds disastrous, but I'm sure you'll fix that.  I can see how it might
> lead to some fragmentation if only small amounts are ever pinned, but
> not a deal-breaker.
> 

There is no such thing as fragmentation with memory mapped storage ;-)

<>
> I'm saying that, if we have a 'struct page' for the memory, we should
> try to make the mmap()s more normal.  This enables all kinds of things
> that DAX does not support today, like direct I/O.
> 

What? no! direct I/O is fully supported. Including all API's of it. Do
you mean open(O_DIRECT) and io_submit(..) Yes it is fully supported.

In fact all IO is direct IO. there is never page-cache on the way, hence direct

BTW: These patches enable something else. Say FSA is DAX and FSB is regular
disk FS then
	fda = open(/mnt/FSA);
	pa = mmap(fda, ...);

	fdb = open(/mnt/FSB, O_DIRECT);
	io_submit(fdb,..,pa ,..);
	/* I mean pa is put for IO into the passed iocb for fdb */

Before this patch above will not work and revert to buffered IO, but
with these patches it will work.
Please note this is true for the submitted pmem driver. With brd which
also supports DAX this will work, because brd always uses pages.

<>
> Great, so we at least agree that this adds complexity.
> 

But the complexity is already there DAX by Matthew is to go in soon I hope.
Surly these added pages do not add to the complexity that much.

<>
> 
> OK, so I think I at least understand the scope of the patch set and the
> limitations.  I think I've summarized the limitations:
> 
> 1. Approach requires all of RAM+Pmem to be direct-mapped (rules out
>    almost all 32-bit systems, or any 64-bit systems with more than 64TB
>    of RAM+pmem-storage)

Yes, for NOW

> 2. Approach is currently incompatible with some kernel code that
>    requires a 'struct page' (such as direct I/O), and all kernel code
>    that requires knowledge of zones or NUMA nodes.

NO!
Direct IO - supported
NUMA - supported

"all kernel code that requires knowledge of zones" - Not needed

> 3. Approach requires 1/64 of the amount of storage to be consumed by
>    RAM for a pseudo 'struct page'.  If you had 64GB of storage and 1GB
>    of RAM, you would simply run our of RAM.
> 

Yes so in a system as above of 64GB of pmem, 1GB of pmem will need to be
set aside and hotpluged as volatile memory. This already works today BTW
you can set aside a portion of NvDIMM and hotplug it as system memory.

We are already used to pay that ratio for RAM.
On a kernel-config choice that ratio can be also paid for pmem. This is
why I left it a configuration option

> Did I miss any?
> 

Thanks
Boaz


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver
  2014-09-11 16:31                 ` Dan Williams
@ 2014-09-14 11:18                   ` Boaz Harrosh
  2014-09-16 13:54                     ` Jeff Moyer
  0 siblings, 1 reply; 41+ messages in thread
From: Boaz Harrosh @ 2014-09-14 11:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	linux-fsdevel, linux-kernel, linux-nvdimm

On 09/11/2014 07:31 PM, Dan Williams wrote:
<>
> 
> The point I am getting at is not requiring a priori knowledge of the
> physical memory map of a system.  Rather, place holder variables to
> enable simple dynamic discovery.
> 

"simple dynamic discovery" does not yet exist and when the DDR4 NvDIMM
will be released then we still have those DDR3 out there which will
not work with the new discovery, which I need to support as well.

And ... The submitted API will always need to exist for the emulation
case for the developer who wants to do memmap= at Kernel command line.
Which does not have any "dynamic-discovery" because it is not a real
device.

And my point was that the "dynamic-discovery" can be d-coupled from
pmem, and be driven via udev. Each technology/ARCH may have its own
"dynamic-discovery" method and drivers, yet all load a generic pmem driver.

<>
> Why start on step 2 when we haven't got agreement on step 1?
> 

So what you are saying that you do not agree with pmem driver at all?

Because my patch came to better the one range global memory parameters
of prd_start= , prd_size=, prd_nr= interface submitted by Ross to a
simple map= Interface that can take us a very long way. Does cover all
of today's usage including emulated memmap= at Kernel command line.

I have not seen any objections to Ross's pmem driver in general from
you. Only to my patch which changes the API to the better, and actually
let us support all existing and future flat technologies. Yet, yes, do
not have a "dynamic discovery". Which is out of scope for pmem right now.
(And I hope will always be)

I really do not understand what you are suggesting? Are you just saying
NACK on pmem.c. It does not do its job? how can we make it does its job
then?

Thanks
Boaz


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver
  2014-09-14 11:18                   ` Boaz Harrosh
@ 2014-09-16 13:54                     ` Jeff Moyer
  2014-09-16 16:24                       ` Boaz Harrosh
  2014-09-19 16:27                       ` Dan Williams
  0 siblings, 2 replies; 41+ messages in thread
From: Jeff Moyer @ 2014-09-16 13:54 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Dan Williams, Boaz Harrosh, Ross Zwisler, Jens Axboe,
	Matthew Wilcox, linux-fsdevel, linux-kernel, linux-nvdimm

Boaz Harrosh <boaz@plexistor.com> writes:

> On 09/11/2014 07:31 PM, Dan Williams wrote:
> <>
>> 
>> The point I am getting at is not requiring a priori knowledge of the
>> physical memory map of a system.  Rather, place holder variables to
>> enable simple dynamic discovery.
>> 
>
> "simple dynamic discovery" does not yet exist and when the DDR4 NvDIMM
> will be released then we still have those DDR3 out there which will
> not work with the new discovery, which I need to support as well.

Boaz,

Are you telling me that vendors are shipping parts that present
themselves as E820_RAM, and that you have to manually block off the
addresses from the kernel using the kernel command line?  If that is
true, then that is just insane and unsupportable.  All the hardware I
have access to:
1) does not present itself as normal memory and
2) provides some means for discovering its address and size

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver
  2014-09-16 13:54                     ` Jeff Moyer
@ 2014-09-16 16:24                       ` Boaz Harrosh
  2014-09-19 16:27                       ` Dan Williams
  1 sibling, 0 replies; 41+ messages in thread
From: Boaz Harrosh @ 2014-09-16 16:24 UTC (permalink / raw)
  To: Jeff Moyer, Boaz Harrosh
  Cc: Dan Williams, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	linux-fsdevel, linux-kernel, linux-nvdimm

On 09/16/2014 04:54 PM, Jeff Moyer wrote:
> Boaz Harrosh <boaz@plexistor.com> writes:
> 
>> On 09/11/2014 07:31 PM, Dan Williams wrote:
>> <>
>>>
>>> The point I am getting at is not requiring a priori knowledge of the
>>> physical memory map of a system.  Rather, place holder variables to
>>> enable simple dynamic discovery.
>>>
>>
>> "simple dynamic discovery" does not yet exist and when the DDR4 NvDIMM
>> will be released then we still have those DDR3 out there which will
>> not work with the new discovery, which I need to support as well.
> 
> Boaz,
> 
> Are you telling me that vendors are shipping parts that present
> themselves as E820_RAM, and that you have to manually block off the
> addresses from the kernel using the kernel command line?  If that is
> true, then that is just insane and unsupportable.  All the hardware I
> have access to:
> 1) does not present itself as normal memory and
> 2) provides some means for discovering its address and size
> 

Hi Jeff

There is one chip I have seen that is like that, yes, only the funny
thing is that we have the capacitors and all, but we don't seem to
be able to save on power loss. But it might be a bug at MB system bios
so we are investigating. But for this chip, yes we need an exclusion
at Kernel command line. I agree not very usable.

Putting that aside, Yes the two other vendors of DDR3 NvDIMM come with
their own driver that enables the chip and puts it on the buss. Then we
use a vendor supplied tool, to find the mapped physical address + size
+ unique id. We then run a script that loads pmem with this info, to
drive the chips. But with DDR3 there is no STD and each vendor has his own
discovery method. So pmem is just the generic ULD (Upper-layer-Driver) loaded
after the vendor LLD did its initial setup.

With DDR4 we will have an STD and one LLD driver will be able to discover them
from any vendor. At which time we might do a dynamic in-Kernel probe like the
SCSI core does to its ULDs when a new target is found below. But for me this
probe can just be a udev rule from user-mode and pmem can stay pure and generic.
But lets cross that bridge later. It does not change the current design, it only
adds a probe() capability to the all stack. All of the current pmem code is made
very friendly to a dynamic prob(), either from code, or via sysfs.

That said. The map= interface will always be needed because. pmem supports one
more option which is the most commonly used right now, by developers: The emulation
of pmem with RAM. In such a usage a developer puts a memmap=nn@ss at Kernel command-line
and a map=nn@ss on pmem comand-line and he can test and use code just as with real
pmem, only of-course none persistent. This mode since it has no real device is never
dynamically discovered. And we will always want to keep this ability for pmem.
So releasing with this interface is fine because there is never a reason to not keep it.
It will be there to stay. (It is also good for exporting a pmem device to a VM, with a
VM shared memory library)

My next plan is to widen the module-param interface to enable 
hotplug/hotremove/hotexpand via the same module-params. You know how a module-param
is also a hot sysfs file. At which stage the logic is as follows:

[parameters]
map=		- exists today
   On Load      - Same as "Write"
   On read	- Will display in the nn@ss,[...] format the existing devices
   On Write	- For all specified nn@ss
		  If an existing device is found at ss, if nn is bigger then
		  current, device is dynamically expanded (shrinking not aloud).
		  If no device exist at ss then one is added of nn size, provided
		  that there is no overlap with an existing device.
		  Any existing devices which are not specified are HOTREMOVED

  At this point we support everything but it is not very udev friendly so have
  two more

add= 		- New
   On Load      - Ignored
   On read	- empty
   On Write	- For all specified nn@ss 
		  If an existing device is found at ss, if nn is bigger then
                  current device it is dynamically expanded ((shrinking not aloud)
		  If no device exist at ss then one is created of nn size, provided
		  that there is no overlap with an existing device.

Remove= 	- New
   On Load      - Ignored
   On read	- empty
   On Write	- For all specified nn@ss:
		  if an existing device exactly matches nn@ss it is HOTREMOVED

  An HOTREMOVED is only allowed when device ref-count is 1, that is no open files.
  (Or mounted filesystems)

With such interface we can probe new devices from udev and keep pmem completely
generic, and vendor/ARCH agnostic. It can also be used with none DDR pcie devices.

If later we want in-kernel probe we will need an NvM-core which a pmem ULD registers
with. Then any Vendor LLD triggers core which will call all registered ULDs until
a type match is found. Same as SCSI.
But for me that registering core can just be udev in user-mode. Again we do not
have to decide now. Current pmem code is very friendly to an in kernel probe() when
such a probe will exist.

NOTE: There are 3 more possible ULDs for an NvM-core pmem is only type1
	type1 - All memory always mapped (pmem.ko)
	type2 - Reads always mapped writes are slow and need IO like flash
		(Will need an internal bcache and COW of write pages)
	type3 - Bigger internal nvm/flash with only a small window mapped at any
                given time. Will need paging and remapping-da
	type4 - pmem + flash, needs specific instructions to move data from pmem
                to flash, and free pmem for reuse. (2 tier)

> Cheers,
> Jeff
> 

Thanks
Boaz


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver
  2014-09-16 13:54                     ` Jeff Moyer
  2014-09-16 16:24                       ` Boaz Harrosh
@ 2014-09-19 16:27                       ` Dan Williams
  2014-09-21  9:27                         ` Boaz Harrosh
  1 sibling, 1 reply; 41+ messages in thread
From: Dan Williams @ 2014-09-19 16:27 UTC (permalink / raw)
  To: Jeff Moyer, Boaz Harrosh
  Cc: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	linux-fsdevel, linux-kernel, linux-nvdimm

On Tue, Sep 16, 2014 at 6:54 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Boaz Harrosh <boaz@plexistor.com> writes:
>
>> On 09/11/2014 07:31 PM, Dan Williams wrote:
>> <>
>>>
>>> The point I am getting at is not requiring a priori knowledge of the
>>> physical memory map of a system.  Rather, place holder variables to
>>> enable simple dynamic discovery.
>>>
>>
>> "simple dynamic discovery" does not yet exist and when the DDR4 NvDIMM
>> will be released then we still have those DDR3 out there which will
>> not work with the new discovery, which I need to support as well.
>
> Boaz,
>
> Are you telling me that vendors are shipping parts that present
> themselves as E820_RAM, and that you have to manually block off the
> addresses from the kernel using the kernel command line?  If that is
> true, then that is just insane and unsupportable.  All the hardware I
> have access to:
> 1) does not present itself as normal memory and
> 2) provides some means for discovering its address and size

Jeff,

Yes, agree.

Boaz,

The UEFI organization is in the process of defining a generic
specification for platform non-volatile memory resources.  Let's wait
until that is publicly available before adding any new device
discovery capabilities to pmem.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver
  2014-09-19 16:27                       ` Dan Williams
@ 2014-09-21  9:27                         ` Boaz Harrosh
  0 siblings, 0 replies; 41+ messages in thread
From: Boaz Harrosh @ 2014-09-21  9:27 UTC (permalink / raw)
  To: Dan Williams, Jeff Moyer
  Cc: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox,
	linux-fsdevel, linux-kernel, linux-nvdimm

On 09/19/2014 07:27 PM, Dan Williams wrote:
> before adding any new device
> discovery capabilities to pmem.
> 

I lost you? what "new device discovery capabilities" do you see in
Ross's or my code to pmem?

Actually my point is that I hope there will never be any. That the discovery
should be elsewhere in an ARCH/driver LLD and pmem stays generic.

I feel we are not disagreeing, just not communicating. Please explain again,
sorry for not understanding. What in the code I sent, you do not like? What
in the code is dependent on UEFI organization's paper. That might need to
change?

Again sorry for being slow. please explain again?

Thanks
Boaz


^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 1/4] pmem: Initial version of persistent memory driver
  2014-08-27 21:11 ` [PATCH 1/4] pmem: Initial version of " Ross Zwisler
  2014-09-09 16:23   ` [PATCH v2] " Boaz Harrosh
@ 2014-11-02  3:22   ` Elliott, Robert (Server Storage)
  2014-11-03 15:50     ` Jeff Moyer
  2014-11-03 16:19     ` Wilcox, Matthew R
  1 sibling, 2 replies; 41+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-11-02  3:22 UTC (permalink / raw)
  To: Ross Zwisler, Jens Axboe, Matthew Wilcox, Boaz Harrosh,
	Nick Piggin, Kani, Toshimitsu, Knippers, Linda, linux-fsdevel,
	linux-kernel, linux-nvdimm@lists.01.org

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Ross Zwisler
> Sent: Wednesday, 27 August, 2014 4:12 PM
> To: Jens Axboe; Matthew Wilcox; Boaz Harrosh; Nick Piggin; linux-
> fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> nvdimm@lists.01.org
> Cc: Ross Zwisler
> Subject: [PATCH 1/4] pmem: Initial version of persistent memory
> driver
> 
> PMEM is a new driver that presents a reserved range of memory as a
> block device.  This is useful for developing with NV-DIMMs, and
> can be used with volatile memory as a development platform.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  MAINTAINERS            |   6 +
>  drivers/block/Kconfig  |  41 ++++++
>  drivers/block/Makefile |   1 +
>  drivers/block/pmem.c   | 330
> +++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 378 insertions(+)
>  create mode 100644 drivers/block/pmem.c
> 
...

> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
> index 1b8094d..ac52f5a 100644
> --- a/drivers/block/Kconfig
> +++ b/drivers/block/Kconfig
> @@ -404,6 +404,47 @@ config BLK_DEV_RAM_DAX
>  	  and will prevent RAM block device backing store memory from
> being
>  	  allocated from highmem (only a problem for highmem systems).
> 
> +config BLK_DEV_PMEM
> +	tristate "Persistent memory block device support"
> +	help
> +	  Saying Y here will allow you to use a contiguous range of
> reserved
> +	  memory as one or more block devices.  Memory for PMEM should
> be
> +	  reserved using the "memmap" kernel parameter.
> +
> +	  To compile this driver as a module, choose M here: the module
> will be
> +	  called pmem.
> +
> +	  Most normal users won't need this functionality, and can thus
> say N
> +	  here.
> +
> +config BLK_DEV_PMEM_START
> +	int "Offset in GiB of where to start claiming space"
> +	default "0"
> +	depends on BLK_DEV_PMEM
> +	help
> +	  Starting offset in GiB that PMEM should use when claiming
> memory.  This
> +	  memory needs to be reserved from the OS at boot time using
> the
> +	  "memmap" kernel parameter.
> +
> +	  If you provide PMEM with volatile memory it will act as a
> volatile
> +	  RAM disk and your data will not be persistent.
> +
> +config BLK_DEV_PMEM_COUNT
> +	int "Default number of PMEM disks"
> +	default "4"

For real use I think a default of 1 would be better.

> +	depends on BLK_DEV_PMEM
> +	help
> +	  Number of equal sized block devices that PMEM should create.
> +
> +config BLK_DEV_PMEM_SIZE
> +	int "Size in GiB of space to claim"
> +	depends on BLK_DEV_PMEM
> +	default "0"
> +	help
> +	  Amount of memory in GiB that PMEM should use when creating
> block
> +	  devices.  This memory needs to be reserved from the OS at
> +	  boot time using the "memmap" kernel parameter.
> +
>  config CDROM_PKTCDVD
>  	tristate "Packet writing on CD/DVD media"
>  	depends on !UML


...

> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
> new file mode 100644
> index 0000000..d366b9b
> --- /dev/null
> +++ b/drivers/block/pmem.c
> @@ -0,0 +1,330 @@
> +/*
> + * Persistent Memory Driver
> + * Copyright (c) 2014, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or
> modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but
> WITHOUT
> + * ANY WARRANTY; without even the implied warranty of
> MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
> License for
> + * more details.
> + *
> + * This driver is heavily based on drivers/block/brd.c.
> + * Copyright (C) 2007 Nick Piggin
> + * Copyright (C) 2007 Novell Inc.
> + */
> +
> +#include <linux/bio.h>
> +#include <linux/blkdev.h>
> +#include <linux/fs.h>
> +#include <linux/hdreg.h>
> +#include <linux/highmem.h>
> +#include <linux/init.h>
> +#include <linux/major.h>
> +#include <linux/module.h>
> +#include <linux/moduleparam.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +
> +#define SECTOR_SHIFT		9
> +#define PAGE_SECTORS_SHIFT	(PAGE_SHIFT - SECTOR_SHIFT)
> +#define PAGE_SECTORS		(1 << PAGE_SECTORS_SHIFT)
> +
> +/*
> + * driver-wide physical address and total_size - one single,
> contiguous memory
> + * region that we divide up in to same-sized devices
> + */
> +phys_addr_t	phys_addr;
> +void		*virt_addr;
> +size_t		total_size;
> +
> +struct pmem_device {
> +	struct request_queue	*pmem_queue;
> +	struct gendisk		*pmem_disk;
> +	struct list_head	pmem_list;
> +
> +	phys_addr_t		phys_addr;
> +	void			*virt_addr;
> +	size_t			size;
> +};
> +
> +/*
> + * direct translation from (pmem,sector) => void*
> + * We do not require that sector be page aligned.
> + * The return value will point to the beginning of the page
> containing the
> + * given sector, not to the sector itself.
> + */
> +static void *pmem_lookup_pg_addr(struct pmem_device *pmem, sector_t
> sector)
> +{
> +	size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
> +	size_t offset = page_offset << PAGE_SHIFT;
> +

Since page_offset is only used to calculate offset, they
could be joined to avoid relying on the compiler to
optimize it away:
	size_t offset = sector >> PAGE_SECTORS_SHIFT << PAGE_SHIFT;

> +	BUG_ON(offset >= pmem->size);

BUG_ON is severe... should this function be designed to return 
an error instead?

> +	return pmem->virt_addr + offset;
> +}

> +
> +/*
> + * sector is not required to be page aligned.
> + * n is at most a single page, but could be less.
> + */
> +static void copy_to_pmem(struct pmem_device *pmem, const void *src,
> +			sector_t sector, size_t n)
> +{
> +	void *dst;
> +	unsigned int offset = (sector & (PAGE_SECTORS - 1)) <<
> SECTOR_SHIFT;
> +	size_t copy;
> +
> +	BUG_ON(n > PAGE_SIZE);
> +
> +	copy = min_t(size_t, n, PAGE_SIZE - offset);
> +	dst = pmem_lookup_pg_addr(pmem, sector);
> +	memcpy(dst + offset, src, copy);
> +
> +	if (copy < n) {
> +		src += copy;
> +		sector += copy >> SECTOR_SHIFT;
> +		copy = n - copy;
> +		dst = pmem_lookup_pg_addr(pmem, sector);
> +		memcpy(dst, src, copy);
> +	}
> +}
> +
> +/*
> + * sector is not required to be page aligned.
> + * n is at most a single page, but could be less.
> + */
> +static void copy_from_pmem(void *dst, struct pmem_device *pmem,
> +			  sector_t sector, size_t n)
> +{
> +	void *src;
> +	unsigned int offset = (sector & (PAGE_SECTORS - 1)) <<
> SECTOR_SHIFT;
> +	size_t copy;
> +
> +	BUG_ON(n > PAGE_SIZE);
> +
> +	copy = min_t(size_t, n, PAGE_SIZE - offset);
> +	src = pmem_lookup_pg_addr(pmem, sector);
> +
> +	memcpy(dst, src + offset, copy);
> +
> +	if (copy < n) {
> +		dst += copy;
> +		sector += copy >> SECTOR_SHIFT;
> +		copy = n - copy;
> +		src = pmem_lookup_pg_addr(pmem, sector);
> +		memcpy(dst, src, copy);
> +	}
> +}
> +
> +static void pmem_do_bvec(struct pmem_device *pmem, struct page
> *page,
> +			unsigned int len, unsigned int off, int rw,
> +			sector_t sector)
> +{
> +	void *mem = kmap_atomic(page);
> +
> +	if (rw == READ) {
> +		copy_from_pmem(mem + off, pmem, sector, len);
> +		flush_dcache_page(page);

Why does READ flush the dcache after reading?  It's fine to 
leave data in dcache.

> +	} else {
> +		/*
> +		 * FIXME: Need more involved flushing to ensure that
> writes to
> +		 * NVDIMMs are actually durable before returning.
> +		 */
> +		flush_dcache_page(page);
> +		copy_to_pmem(pmem, mem + off, sector, len);

Why is this flushing before the write to pmem? That might flush
data that's going to be overwritten.  I'd expect a flush AFTER
the write, to push data to a durable location that will survive 
surprise power loss.

> +	}
> +
> +	kunmap_atomic(mem);
> +}

If a read or write of a pmem address gets an uncorrectable
error, the system should not crash; just report this IO is bad.  
That reflects a difference in storage philosophy vs. memory
philosophy.  All the rest of the data in the pmem might be
fine, and many users prefer to recover 99% of their data
than lose it all.

pmem_do_bvec needs a way to turn off normal DRAM "crash on
error" handling for its accesses, and provide a return value
indicating success or failure.

Also, most storage devices automatically remap bad sectors
when they are overwritten or remap them on command (e.g.,
REASSIGN BLOCKS in SCSI), rather than leave that sector 
bad forever.  I don't know how many filesystems and
applications rely on that feature now, vs. maintain their
own bad block tables; this may be something that pmem
needs to provide.

> +
> +static void pmem_make_request(struct request_queue *q, struct bio
> *bio)
> +{
> +	struct block_device *bdev = bio->bi_bdev;
> +	struct pmem_device *pmem = bdev->bd_disk->private_data;
> +	int rw;
> +	struct bio_vec bvec;
> +	sector_t sector;
> +	struct bvec_iter iter;
> +	int err = 0;
> +
> +	sector = bio->bi_iter.bi_sector;
> +	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
> +		err = -EIO;
> +		goto out;
> +	}
> +
> +	BUG_ON(bio->bi_rw & REQ_DISCARD);
> +

Since discard (i.e., unmap, trip) is just a hint, it could just 
be ignored rather than trigger a drastic BUG.

Related suggestion: keep track of each sector that has been 
discarded, and whether it has been written after discard.
This would tell flash-backed DRAM which sectors don't need 
to be flushed to flash on a power failure.

Related suggestion: Adding WRITE SAME support would be useful
for some software (for at least the writing zeros subset) - 
that command is not just a hint.  It would result in a memset.

> +	rw = bio_rw(bio);
> +	if (rw == READA)
> +		rw = READ;
> +
> +	bio_for_each_segment(bvec, bio, iter) {
> +		unsigned int len = bvec.bv_len;
> +
> +		BUG_ON(len > PAGE_SIZE);
> +		pmem_do_bvec(pmem, bvec.bv_page, len,
> +			    bvec.bv_offset, rw, sector);
> +		sector += len >> SECTOR_SHIFT;
> +	}
> +
> +out:
> +	bio_endio(bio, err);
> +}
> +
> +static const struct block_device_operations pmem_fops = {
> +	.owner =		THIS_MODULE,
> +};
> +
> +/* Kernel module stuff */
> +static int pmem_start_gb = CONFIG_BLK_DEV_PMEM_START;
> +module_param(pmem_start_gb, int, S_IRUGO);
> +MODULE_PARM_DESC(pmem_start_gb, "Offset in GB of where to start
> claiming space");
> +
> +static int pmem_size_gb = CONFIG_BLK_DEV_PMEM_SIZE;
> +module_param(pmem_size_gb,  int, S_IRUGO);
> +MODULE_PARM_DESC(pmem_size_gb,  "Total size in GB of space to claim
> for all disks");
> +

These modparam variable names and description scripts use gb 
and GB when they mean GiB.  The storage market generally uses 
correct SI units; please don't follow JEDEC's misuse.  The
Kconfig descriptions use the right IEC binary units.

Using GiB as the base unit prevents having persistent memory 
blocks < 1 GiB or not multiples of 1 GiB.  For RAID controller 
caches, 256 MiB and 512 MiB are still used.  Smaller units
may be warranted.  Accepting the number and unit would be
the most flexible and clear: allow strings like 512MiB, 
2GiB, 1GB, and 4096B.

> +static int pmem_count = CONFIG_BLK_DEV_PMEM_COUNT;
> +module_param(pmem_count, int, S_IRUGO);
> +MODULE_PARM_DESC(pmem_count, "Number of pmem devices to evenly split
> allocated space");
> +
> +static LIST_HEAD(pmem_devices);
> +static int pmem_major;
> +
> +/* FIXME: move phys_addr, virt_addr, size calls up to caller */

It is unclear what that wants to fix.

> +static struct pmem_device *pmem_alloc(int i)
> +{
> +	struct pmem_device *pmem;
> +	struct gendisk *disk;
> +	size_t disk_size = total_size / pmem_count;

There is no protection for pmem_count being 0 and causing
divide-by-zero.

There is no check for disk_size being 0.  That might
cause problems in pmem_init.

> +	size_t disk_sectors = disk_size / 512;
> +
> +	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
> +	if (!pmem)
> +		goto out;
> +
> +	pmem->phys_addr = phys_addr + i * disk_size;
> +	pmem->virt_addr = virt_addr + i * disk_size;
> +	pmem->size = disk_size;
> +
> +	pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
> +	if (!pmem->pmem_queue)
> +		goto out_free_dev;
> +

General comment:
This driver should be blk-mq based.  Not doing so loses
out on a number of SMP and NUMA performance improvements.
For example, blk_alloc_queue calls blk_alloc_queue_node
with NUMA_NO_NODE.  If it called blk_mq_init_queue, then
each CPU would get structures located on its node.

> +	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
> +	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);

Many storage controllers support 1 MiB IOs now, and increasing
amounts of software feel comfortable generating those sizes.  
That means this should be at least 2048.

> +	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
> +

It might be appropriate to call blk_queue_rq_timeout
and set up a very short timeout, since persistent memory is
very fast.  I'm not sure what the default is for non-blk-mq
drivers; for blk-mq, I think it is 30 seconds, which is
far too long for memory based storage.

> +	disk = pmem->pmem_disk = alloc_disk(0);
> +	if (!disk)
> +		goto out_free_queue;
> +	disk->major		= pmem_major;
> +	disk->first_minor	= 0;
> +	disk->fops		= &pmem_fops;
> +	disk->private_data	= pmem;
> +	disk->queue		= pmem->pmem_queue;
> +	disk->flags		= GENHD_FL_EXT_DEVT;
> +	sprintf(disk->disk_name, "pmem%d", i);
> +	set_capacity(disk, disk_sectors);
> +
> +	return pmem;
> +
> +out_free_queue:
> +	blk_cleanup_queue(pmem->pmem_queue);
> +out_free_dev:
> +	kfree(pmem);
> +out:
> +	return NULL;
> +}
> +
> +static void pmem_free(struct pmem_device *pmem)
> +{
> +	put_disk(pmem->pmem_disk);
> +	blk_cleanup_queue(pmem->pmem_queue);
> +	kfree(pmem);
> +}
> +
> +static void pmem_del_one(struct pmem_device *pmem)
> +{
> +	list_del(&pmem->pmem_list);
> +	del_gendisk(pmem->pmem_disk);
> +	pmem_free(pmem);
> +}
> +
> +static int __init pmem_init(void)
> +{
> +	int result, i;
> +	struct resource *res_mem;
> +	struct pmem_device *pmem, *next;
> +
> +	phys_addr  = (phys_addr_t) pmem_start_gb * 1024 * 1024 * 1024;
> +	total_size = (size_t)	   pmem_size_gb  * 1024 * 1024 * 1024;
> +

There is no check for start+size wrapping around the
address space. 

Although modparams are a fine starting point, some way to
automatically load drivers based on e820 table and other ACPI
content will be needed.  pmem may not be the only driver
vying for use of persistent memory regions; some sort of
arbiter will be needed to implement the user's choice (the
same way each boot).

> +	res_mem = request_mem_region_exclusive(phys_addr, total_size,
> "pmem");
> +	if (!res_mem)
> +		return -ENOMEM;
> +

Does the call to request_mem_region_exclusive, which uses:
	#define IORESOURCE_EXCLUSIVE    0x08000000      /* Userland may not map this resource */
mean that userspace cannot mmap files and directly access
persistent memory?  There are only 3 drivers that use
request.*exclusive functions: drivers/net/ethernet/intel/e1000e
and drivers/watchdog/sp5100-tco. None of the storage drivers 
do so.

> +	virt_addr = ioremap_cache(phys_addr, total_size);
> +	if (!virt_addr) {
> +		result = -ENOMEM;
> +		goto out_release;
> +	}
> +

ioremap_cache maps the addresses as writeback cacheable.  How 
about offering all the cache attributes as modparams?
* writeback (WB): costly cache flushes to preserve data
* write combining (WC): must flush the WC buffers
* writethrough (WT): safe, good for cases where the application
  expects to read the data again soon
* uncacheable (UC): safe, good for cases where the application
  does not expect to read the data

There are some use cases for persistence across reboots
that don't care about persistence across power cycles;
for these, separate options could be provided for:
* WB
* WB+flush
* WC
* WC+flush


> +	result = register_blkdev(0, "pmem");
> +	if (result < 0) {
> +		result = -EIO;
> +		goto out_unmap;
> +	} else
> +		pmem_major = result;
> +

In comparison, if sd fails register_blkdev, it returns
-ENODEV.  Since no pmem device is created, -ENODEV might
be more appropriate.

The pmem_major variable might be unsafe/racy if multiple
concurrent module loads are possible.  The value is stored in 
disk->major; that's the only place it should be stored,
and pmem_exit should pull the value from that location to
later pass into unregister_blkdev before deleting its
containing structure.  This might not be an issue until
you get to 16 Ki modules.

Should this function call blk_register_region after
calling register_blkdev?

> +	for (i = 0; i < pmem_count; i++) {
> +		pmem = pmem_alloc(i);
> +		if (!pmem) {
> +			result = -ENOMEM;
> +			goto out_free;
> +		}
> +		list_add_tail(&pmem->pmem_list, &pmem_devices);
> +	}
> +
> +	list_for_each_entry(pmem, &pmem_devices, pmem_list)
> +		add_disk(pmem->pmem_disk);
> +
> +	pr_info("pmem: module loaded\n");
> +	return 0;
> +
> +out_free:
> +	list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list)
> {
> +		list_del(&pmem->pmem_list);
> +		pmem_free(pmem);
> +	}
> +	unregister_blkdev(pmem_major, "pmem");
> +
> +out_unmap:
> +	iounmap(virt_addr);
> +
> +out_release:
> +	release_mem_region(phys_addr, total_size);
> +	return result;
> +}
> +
> +static void __exit pmem_exit(void)
> +{
> +	struct pmem_device *pmem, *next;
> +
> +	list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list)
> +		pmem_del_one(pmem);
> +
> +	unregister_blkdev(pmem_major, "pmem");
> +	iounmap(virt_addr);
> +	release_mem_region(phys_addr, total_size);
> +
> +	pr_info("pmem: module unloaded\n");
> +}
> +
> +MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
> +MODULE_LICENSE("GPL");
> +module_init(pmem_init);
> +module_exit(pmem_exit);


---
Rob Elliott    HP Server Storage




^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 2/4] pmem: Add support for getgeo()
  2014-08-27 21:12 ` [PATCH 2/4] pmem: Add support for getgeo() Ross Zwisler
@ 2014-11-02  3:27   ` Elliott, Robert (Server Storage)
  2014-11-03 16:36     ` Wilcox, Matthew R
  0 siblings, 1 reply; 41+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-11-02  3:27 UTC (permalink / raw)
  To: Ross Zwisler, Jens Axboe, Matthew Wilcox, Boaz Harrosh,
	Nick Piggin, Kani, Toshimitsu, Knippers, Linda, linux-fsdevel,
	linux-kernel, linux-nvdimm@lists.01.org



> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Ross Zwisler
> Sent: Wednesday, 27 August, 2014 4:12 PM
> To: Jens Axboe; Matthew Wilcox; Boaz Harrosh; Nick Piggin; linux-
> fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> nvdimm@lists.01.org
> Cc: Ross Zwisler
> Subject: [PATCH 2/4] pmem: Add support for getgeo()
> 
> Some programs require HDIO_GETGEO work, which requires we implement
> getgeo.  Based off of the work done to the NVMe driver in this
> commit:
> 
> commit 4cc09e2dc4cb ("NVMe: Add getgeo to block ops")
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  drivers/block/pmem.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
> index d366b9b..60bbe0d 100644
> --- a/drivers/block/pmem.c
> +++ b/drivers/block/pmem.c
> @@ -50,6 +50,15 @@ struct pmem_device {
>  	size_t			size;
>  };
> 
> +static int pmem_getgeo(struct block_device *bd, struct hd_geometry
> *geo)
> +{
> +	/* some standard values */
> +	geo->heads = 1 << 6;
> +	geo->sectors = 1 << 5;
> +	geo->cylinders = get_capacity(bd->bd_disk) >> 11;

Just stuffing the result of get_capacity into the 16-bit 
cylinders field will overflow/wrap on large capacities.
0xFFFF << 11 = 0x7FF_F800 = 64 GiB (68.7 GB)

How many programs still need these meaningless fields?
Could the bogus information be created elsewhere so
each block driver doesn't need to do this?


> +	return 0;
> +}
> +
>  /*
>   * direct translation from (pmem,sector) => void*
>   * We do not require that sector be page aligned.
> @@ -176,6 +185,7 @@ out:
> 
>  static const struct block_device_operations pmem_fops = {
>  	.owner =		THIS_MODULE,
> +	.getgeo =		pmem_getgeo,
>  };
> 
>  /* Kernel module stuff */
> --


---
Rob Elliott    HP Server Storage




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/4] pmem: Initial version of persistent memory driver
  2014-11-02  3:22   ` [PATCH 1/4] " Elliott, Robert (Server Storage)
@ 2014-11-03 15:50     ` Jeff Moyer
  2014-11-03 16:19     ` Wilcox, Matthew R
  1 sibling, 0 replies; 41+ messages in thread
From: Jeff Moyer @ 2014-11-03 15:50 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Ross Zwisler, Jens Axboe, Matthew Wilcox, Boaz Harrosh,
	Nick Piggin, Kani, Toshimitsu, Knippers, Linda, linux-fsdevel,
	linux-kernel, linux-nvdimm@lists.01.org

"Elliott, Robert (Server Storage)" <Elliott@hp.com> writes:

>> +config BLK_DEV_PMEM_COUNT
>> +	int "Default number of PMEM disks"
>> +	default "4"
>
> For real use I think a default of 1 would be better.

For "real" use, it will be the number of actual DIMMs, not a config
option, I would think.  I don't take any issue with defaulting to 4.
This will go away.

>> +	pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
>> +	if (!pmem->pmem_queue)
>> +		goto out_free_dev;
>> +
>
> General comment:
> This driver should be blk-mq based.  Not doing so loses
> out on a number of SMP and NUMA performance improvements.
> For example, blk_alloc_queue calls blk_alloc_queue_node
> with NUMA_NO_NODE.  If it called blk_mq_init_queue, then
> each CPU would get structures located on its node.

No need to use blk-mq just to set the numa node, as the driver could
just call blk_alloc_queue_node itself.  I'd chalk this up to another
thing that could be fixed when the driver is used for actual hardware
that describes its own proximity domain.  Further, there is no DMA
engine, here, so what is the benefit of using blk-mq?  I/O completes in
the submission path, and I don't see any locking that's preventing
parallelism.

>
>> +	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
>> +	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
>
> Many storage controllers support 1 MiB IOs now, and increasing
> amounts of software feel comfortable generating those sizes.  
> That means this should be at least 2048.
>
>> +	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
>> +
>
> It might be appropriate to call blk_queue_rq_timeout
> and set up a very short timeout, since persistent memory is
> very fast.  I'm not sure what the default is for non-blk-mq
> drivers; for blk-mq, I think it is 30 seconds, which is
> far too long for memory based storage.

If you timeout on a memcpy, then we've definitely got problems. :)

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 1/4] pmem: Initial version of persistent memory driver
  2014-11-02  3:22   ` [PATCH 1/4] " Elliott, Robert (Server Storage)
  2014-11-03 15:50     ` Jeff Moyer
@ 2014-11-03 16:19     ` Wilcox, Matthew R
  2014-11-04 10:37       ` Boaz Harrosh
  1 sibling, 1 reply; 41+ messages in thread
From: Wilcox, Matthew R @ 2014-11-03 16:19 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage),
	Ross Zwisler, Jens Axboe, Boaz Harrosh, Nick Piggin, Kani,
	Toshimitsu, Knippers, Linda, linux-fsdevel, linux-kernel,
	linux-nvdimm@lists.01.org, Matthew Wilcox

I really wish people wouldn't use my Exchange email address for patches.  It's completely impossible to have a meaningful conversation this way.  I've resorted to inserting extra quotation marks by hand so people can stand at least some chance of understanding what the hell's going on.

> > +config BLK_DEV_PMEM_COUNT
> > +	int "Default number of PMEM disks"
> > +	default "4"
> 
> For real use I think a default of 1 would be better.

For real use, you need at least two to run xfstests.  This whole configuration mechanism is going away soon anyway.

> > +	size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
> > +	size_t offset = page_offset << PAGE_SHIFT;
> > +
> 
> Since page_offset is only used to calculate offset, they
> could be joined to avoid relying on the compiler to
> optimize it away:
> 	size_t offset = sector >> PAGE_SECTORS_SHIFT << PAGE_SHIFT;

If you insist on doing the compiler's job for it, why not:

	size_t offset = sector >> (PAGE_SECTORS_SHIFT - PAGE_SHIFT);

I actually think the original is clearer.

> > +	BUG_ON(offset >= pmem->size);
> 
> BUG_ON is severe... should this function be designed to return 
> an error instead?

No ... upper levels should have prevented an out of range I/O from being communicated down to the driver.  Finding the I/O is out of range here indicates something is horribly wrong, and BUG_ON is the appropriate response.

> > +	if (rw == READ) {
> > +		copy_from_pmem(mem + off, pmem, sector, len);
> > +		flush_dcache_page(page);
> 
> Why does READ flush the dcache after reading?  It's fine to 
> leave data in dcache.

You misunderstand the purpose of flush_dcache_page(); see Documentation/cachetlb.txt.  It is to handle D-cache aliasing between user views of pages and kernel views of pages.  So we have to flush it after reading so that userspace sees the newly written data, and flush it before writing from it, so that the kernel sees all the data thast userspace has written to it.  These macros are no-ops on x86, but on CPUs like PA-RISC, they perform important work.

> If a read or write of a pmem address gets an uncorrectable
> error, the system should not crash; just report this IO is bad.  
> That reflects a difference in storage philosophy vs. memory
> philosophy.  All the rest of the data in the pmem might be
> fine, and many users prefer to recover 99% of their data
> than lose it all.
> 
> pmem_do_bvec needs a way to turn off normal DRAM "crash on
> error" handling for its accesses, and provide a return value
> indicating success or failure.

This is on our longer-term TODO list.  It requires integration with the MCE handler.

> Also, most storage devices automatically remap bad sectors
> when they are overwritten or remap them on command (e.g.,
> REASSIGN BLOCKS in SCSI), rather than leave that sector 
> bad forever.  I don't know how many filesystems and
> applications rely on that feature now, vs. maintain their
> own bad block tables; this may be something that pmem
> needs to provide.

At least ext4 supports the concept of bad blocks.  XFS doesn't (due to it being a bad idea for modern hard drives), but it's pretty trivial to add (a special inode that owns all of the bad blocks; no changes to any data structures).

> > +	BUG_ON(bio->bi_rw & REQ_DISCARD);
> > +
> 
> Since discard (i.e., unmap, trip) is just a hint, it could just 
> be ignored rather than trigger a drastic BUG.

Again, upper layers should have not sent down a DISCARD request since the driver didn't indicate support for them.  Personally, I would have not put this line in, but it's Ross' driver and I wasn't going to argue with him.

> Related suggestion: keep track of each sector that has been 
> discarded, and whether it has been written after discard.
> This would tell flash-backed DRAM which sectors don't need 
> to be flushed to flash on a power failure.

How would we communicate that info to the DIMM?  We're not opposed to putting in DISCARd support, but it needs to provide some value.

> Related suggestion: Adding WRITE SAME support would be useful
> for some software (for at least the writing zeros subset) - 
> that command is not just a hint.  It would result in a memset.

At this point, it's no more efficient in the Linux stack than having the filesystem send down I/Os full of zeroes.  The NVMe WRITE ZEROES command would be more efficient, but that's not yet supported by the Linux stack.

> These modparam variable names and description scripts use gb 
> and GB when they mean GiB.  The storage market generally uses 
> correct SI units; please don't follow JEDEC's misuse.  The
> Kconfig descriptions use the right IEC binary units.

These are also going away.

> > +/* FIXME: move phys_addr, virt_addr, size calls up to caller */
> 
> It is unclear what that wants to fix.

Also part of the configuration interface that is being replaced.

> > +	size_t disk_size = total_size / pmem_count;
> 
> There is no protection for pmem_count being 0 and causing
> divide-by-zero.
> 
> There is no check for disk_size being 0.  That might
> cause problems in pmem_init.

Also part of the configuration interface that is being replaced.

> This driver should be blk-mq based.  Not doing so loses
> out on a number of SMP and NUMA performance improvements.
> For example, blk_alloc_queue calls blk_alloc_queue_node
> with NUMA_NO_NODE.  If it called blk_mq_init_queue, then
> each CPU would get structures located on its node.

No, it's completely pointless to use blk-mq for pmem.  There is zero advantage for a synchronous block driver to include this overhead.

> > +	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
> 
> Many storage controllers support 1 MiB IOs now, and increasing
> amounts of software feel comfortable generating those sizes.  
> That means this should be at least 2048.

Linux prohibits this being larger than 1024 today; see Documentation/block/biodoc.txt

> It might be appropriate to call blk_queue_rq_timeout
> and set up a very short timeout, since persistent memory is
> very fast.  I'm not sure what the default is for non-blk-mq
> drivers; for blk-mq, I think it is 30 seconds, which is
> far too long for memory based storage.

The Linux block layer isn't tracking these I/Os because we're a BIO driver.

> > +	phys_addr  = (phys_addr_t) pmem_start_gb * 1024 * 1024 * 1024;
> > +	total_size = (size_t)	   pmem_size_gb  * 1024 * 1024 * 1024;
> 
> There is no check for start+size wrapping around the
> address space. 

Right; will go away as part of the new configuration system.

> Although modparams are a fine starting point, some way to
> automatically load drivers based on e820 table and other ACPI
> content will be needed.  pmem may not be the only driver
> vying for use of persistent memory regions; some sort of
> arbiter will be needed to implement the user's choice (the
> same way each boot).

I thought you were being cc'd on the review of that document ... ?

> > +	res_mem = request_mem_region_exclusive(phys_addr, total_size, "pmem");

> Does the call to request_mem_region_exclusive, which uses:
>	#define IORESOURCE_EXCLUSIVE    0x08000000      /* Userland may not map this > resource */
> mean that userspace cannot mmap files and directly access
> persistent memory?  There are only 3 drivers that use
> request.*exclusive functions: drivers/net/ethernet/intel/e1000e
> and drivers/watchdog/sp5100-tco. None of the storage drivers 
> do so.

It does not mean that, since I do exactly this today using DAX on top of PMEM.  It means that userspace can't go poking around in /proc/mem and map the memory directly, which isn't how anybody wants programs to work.

> > +	virt_addr = ioremap_cache(phys_addr, total_size);
> 
> ioremap_cache maps the addresses as writeback cacheable.  How 
> about offering all the cache attributes as modparams?
> * writeback (WB): costly cache flushes to preserve data
> * write combining (WC): must flush the WC buffers
> * writethrough (WT): safe, good for cases where the application
>   expects to read the data again soon
> * uncacheable (UC): safe, good for cases where the application
>   does not expect to read the data
> 
> There are some use cases for persistence across reboots
> that don't care about persistence across power cycles;
> for these, separate options could be provided for:
> * WB
> * WB+flush
> * WC
> * WC+flush

We hadn't talked about this previously ... let's look into it.

> > +	result = register_blkdev(0, "pmem");
> > +	if (result < 0) {
> > +		result = -EIO;
> > +		goto out_unmap;
> 
> In comparison, if sd fails register_blkdev, it returns
> -ENODEV.  Since no pmem device is created, -ENODEV might
> be more appropriate.

I'm not sure why we don't just return the error from register_blkdev().  Ross?

> The pmem_major variable might be unsafe/racy if multiple
> concurrent module loads are possible.

They aren't.

> Should this function call blk_register_region after
> calling register_blkdev?

I don't think that's necessary with dynamic majors, but I'm happy to learn that I'm wrong about this.

Phew, that was a long one!  Thanks for your review!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 2/4] pmem: Add support for getgeo()
  2014-11-02  3:27   ` Elliott, Robert (Server Storage)
@ 2014-11-03 16:36     ` Wilcox, Matthew R
  0 siblings, 0 replies; 41+ messages in thread
From: Wilcox, Matthew R @ 2014-11-03 16:36 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage),
	Ross Zwisler, Jens Axboe, Boaz Harrosh, Nick Piggin, Kani,
	Toshimitsu, Knippers, Linda, linux-fsdevel, linux-kernel,
	linux-nvdimm@lists.01.org, Matthew Wilcox

I agree that there should be a generic fake getgeo routine; but fixing that is a topic for a different patchset and it doesn't need to get folded into this driver submission process.

-----Original Message-----
From: Elliott, Robert (Server Storage) [mailto:Elliott@hp.com] 
Sent: Saturday, November 01, 2014 8:28 PM
To: Ross Zwisler; Jens Axboe; Wilcox, Matthew R; Boaz Harrosh; Nick Piggin; Kani, Toshimitsu; Knippers, Linda; linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; linux-nvdimm@lists.01.org
Subject: RE: [PATCH 2/4] pmem: Add support for getgeo()



> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Ross Zwisler
> Sent: Wednesday, 27 August, 2014 4:12 PM
> To: Jens Axboe; Matthew Wilcox; Boaz Harrosh; Nick Piggin; linux-
> fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> nvdimm@lists.01.org
> Cc: Ross Zwisler
> Subject: [PATCH 2/4] pmem: Add support for getgeo()
> 
> Some programs require HDIO_GETGEO work, which requires we implement
> getgeo.  Based off of the work done to the NVMe driver in this
> commit:
> 
> commit 4cc09e2dc4cb ("NVMe: Add getgeo to block ops")
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  drivers/block/pmem.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
> index d366b9b..60bbe0d 100644
> --- a/drivers/block/pmem.c
> +++ b/drivers/block/pmem.c
> @@ -50,6 +50,15 @@ struct pmem_device {
>  	size_t			size;
>  };
> 
> +static int pmem_getgeo(struct block_device *bd, struct hd_geometry
> *geo)
> +{
> +	/* some standard values */
> +	geo->heads = 1 << 6;
> +	geo->sectors = 1 << 5;
> +	geo->cylinders = get_capacity(bd->bd_disk) >> 11;

Just stuffing the result of get_capacity into the 16-bit 
cylinders field will overflow/wrap on large capacities.
0xFFFF << 11 = 0x7FF_F800 = 64 GiB (68.7 GB)

How many programs still need these meaningless fields?
Could the bogus information be created elsewhere so
each block driver doesn't need to do this?


> +	return 0;
> +}
> +
>  /*
>   * direct translation from (pmem,sector) => void*
>   * We do not require that sector be page aligned.
> @@ -176,6 +185,7 @@ out:
> 
>  static const struct block_device_operations pmem_fops = {
>  	.owner =		THIS_MODULE,
> +	.getgeo =		pmem_getgeo,
>  };
> 
>  /* Kernel module stuff */
> --


---
Rob Elliott    HP Server Storage





^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/4] pmem: Initial version of persistent memory driver
  2014-11-03 16:19     ` Wilcox, Matthew R
@ 2014-11-04 10:37       ` Boaz Harrosh
  2014-11-04 16:26         ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 41+ messages in thread
From: Boaz Harrosh @ 2014-11-04 10:37 UTC (permalink / raw)
  To: Wilcox, Matthew R, Elliott, Robert (Server Storage),
	Ross Zwisler, Jens Axboe, Nick Piggin, Kani, Toshimitsu,
	Knippers, Linda, linux-fsdevel, linux-kernel,
	linux-nvdimm@lists.01.org, Matthew Wilcox

On 11/03/2014 06:19 PM, Wilcox, Matthew R wrote:
<>
>>> +config BLK_DEV_PMEM_COUNT
>>> +	int "Default number of PMEM disks"
>>> +	default "4"
>>
>> For real use I think a default of 1 would be better.
> 
> For real use, you need at least two to run xfstests.  This whole configuration mechanism is going away soon anyway.
> 
>>> +	size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
>>> +	size_t offset = page_offset << PAGE_SHIFT;
>>> +
>>
>> Since page_offset is only used to calculate offset, they
>> could be joined to avoid relying on the compiler to
>> optimize it away:
>> 	size_t offset = sector >> PAGE_SECTORS_SHIFT << PAGE_SHIFT;
> 
> If you insist on doing the compiler's job for it, why not:
> 
> 	size_t offset = sector >> (PAGE_SECTORS_SHIFT - PAGE_SHIFT);
> 
> I actually think the original is clearer.
> 

I wish you guys would actually review the correct code.

In the actual good driver that has any shape of proper code all these issue
are gone.

* config defaults gone, multiple-devices multiple-memory ranges fully supported
  hot plug style.
* above shifts cruft completely gone it is left overs from brd.c and
  its page usage.
* getgeo fixed to do what we realy want by the only application on earth
  that still uses it, fdisk. All other partitioners do not call it at all.

Why are we reviewing dead code ?

Cheers
Boaz


^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 1/4] pmem: Initial version of persistent memory driver
  2014-11-04 10:37       ` Boaz Harrosh
@ 2014-11-04 16:26         ` Elliott, Robert (Server Storage)
  2014-11-04 16:41           ` Ross Zwisler
  0 siblings, 1 reply; 41+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-11-04 16:26 UTC (permalink / raw)
  To: Boaz Harrosh, Wilcox, Matthew R, Ross Zwisler, Jens Axboe,
	Nick Piggin, Kani, Toshimitsu, Knippers, Linda, linux-fsdevel,
	linux-kernel, linux-nvdimm@lists.01.org, Matthew Wilcox

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1387 bytes --]



> -----Original Message-----
> From: Boaz Harrosh [mailto:boaz@plexistor.com]
> Sent: Tuesday, 04 November, 2014 4:38 AM
> To: Wilcox, Matthew R; Elliott, Robert (Server Storage); Ross
> Zwisler; Jens Axboe; Nick Piggin; Kani, Toshimitsu; Knippers, Linda;
> linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> nvdimm@lists.01.org; Matthew Wilcox
> Subject: Re: [PATCH 1/4] pmem: Initial version of persistent memory
> driver
> 
> On 11/03/2014 06:19 PM, Wilcox, Matthew R wrote:
...
> 
> I wish you guys would actually review the correct code.
> 
> In the actual good driver that has any shape of proper code all these
> issue are gone.
> 
> * config defaults gone, multiple-devices multiple-memory ranges fully
>    supported hot plug style.
> * above shifts cruft completely gone it is left overs from brd.c and
>   its page usage.
> * getgeo fixed to do what we realy want by the only application on earth
>   that still uses it, fdisk. All other partitioners do not call it at
>   all.
> 
> Why are we reviewing dead code ?
> 
> Cheers
> Boaz

Ross, what's the status of Boaz' patches (available in
git://git.open-osd.org/pmem.git)?

https://github.com/01org/prd.git doesn't include any of 
them yet.


ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/4] pmem: Initial version of persistent memory driver
  2014-11-04 16:26         ` Elliott, Robert (Server Storage)
@ 2014-11-04 16:41           ` Ross Zwisler
  2014-11-04 17:06             ` Boaz Harrosh
  0 siblings, 1 reply; 41+ messages in thread
From: Ross Zwisler @ 2014-11-04 16:41 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Boaz Harrosh, Wilcox, Matthew R, Jens Axboe, Nick Piggin, Kani,
	Toshimitsu, Knippers, Linda, linux-fsdevel, linux-kernel,
	linux-nvdimm@lists.01.org, Matthew Wilcox, Williams, Dan J

On Tue, 2014-11-04 at 16:26 +0000, Elliott, Robert (Server Storage)
wrote:
> 
> > -----Original Message-----
> > From: Boaz Harrosh [mailto:boaz@plexistor.com]
> > Sent: Tuesday, 04 November, 2014 4:38 AM
> > To: Wilcox, Matthew R; Elliott, Robert (Server Storage); Ross
> > Zwisler; Jens Axboe; Nick Piggin; Kani, Toshimitsu; Knippers, Linda;
> > linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> > nvdimm@lists.01.org; Matthew Wilcox
> > Subject: Re: [PATCH 1/4] pmem: Initial version of persistent memory
> > driver
> > 
> > On 11/03/2014 06:19 PM, Wilcox, Matthew R wrote:
> ...
> > 
> > I wish you guys would actually review the correct code.
> > 
> > In the actual good driver that has any shape of proper code all these
> > issue are gone.
> > 
> > * config defaults gone, multiple-devices multiple-memory ranges fully
> >    supported hot plug style.
> > * above shifts cruft completely gone it is left overs from brd.c and
> >   its page usage.
> > * getgeo fixed to do what we realy want by the only application on earth
> >   that still uses it, fdisk. All other partitioners do not call it at
> >   all.
> > 
> > Why are we reviewing dead code ?
> > 
> > Cheers
> > Boaz
> 
> Ross, what's the status of Boaz' patches (available in
> git://git.open-osd.org/pmem.git)?
> 
> https://github.com/01org/prd.git doesn't include any of 
> them yet.

Hey Robert,

The UEFI organization is in the process of defining a generic specification
for platform non-volatile memory resources.  Essentially the thought was to
wait until that was publicly available before adding any new device discovery
capabilities to pmem.

What Boaz has suggested and coded up is certainly useful, but the worry is
that it will end up being incompatible with what comes out of UEFI.  If we
stay with the dead-simple module parameter method, we will have less code to
unwind later.

Thanks,
- Ross



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/4] pmem: Initial version of persistent memory driver
  2014-11-04 16:41           ` Ross Zwisler
@ 2014-11-04 17:06             ` Boaz Harrosh
  0 siblings, 0 replies; 41+ messages in thread
From: Boaz Harrosh @ 2014-11-04 17:06 UTC (permalink / raw)
  To: Ross Zwisler, Elliott, Robert (Server Storage)
  Cc: Boaz Harrosh, Wilcox, Matthew R, Jens Axboe, Nick Piggin, Kani,
	Toshimitsu, Knippers, Linda, linux-fsdevel, linux-kernel,
	linux-nvdimm@lists.01.org, Matthew Wilcox, Williams, Dan J

On 11/04/2014 06:41 PM, Ross Zwisler wrote:
<>
> 
> The UEFI organization is in the process of defining a generic specification
> for platform non-volatile memory resources.  Essentially the thought was to
> wait until that was publicly available before adding any new device discovery
> capabilities to pmem.
> 
> What Boaz has suggested and coded up is certainly useful, but the worry is
> that it will end up being incompatible with what comes out of UEFI.  If we
> stay with the dead-simple module parameter method, we will have less code to
> unwind later.
> 

What ??
What I coded up is a exactly "dead-simple module parameter method"
that will not need to be changed in future, that is actually sane.

Actually your version of the code needs changing with the global parameters,
one range support, and same size devices.

My version of the code is specifically made very probe() ready and dynamic
ready. It was the point of it all. Including bugs and ugliness fixed.
(I have a small patch that dynamically addes/removes devices on the fly
 through the same module-param interface)

There is not a line in all of my patches that might be affected by
that UEFI comity. (Again actually fixing what needed changing). And in
addition to that comity and the new HW they will define, my system
also supports all the current NvDIMMs in the market.

I really can not understand what you guys are saying? Have you actually
looked at the code and used it, compared to the unusable thing you guys
have now ??
[Like you are saying "before adding any new device discovery capabilities"
 But I have not added any? All I did was fix and simplify the module-param
 interface. Which makes me think that you might not have looked at any of
 my fixes?
]

> Thanks,
> - Ross

Cheers
Boaz


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2014-11-04 17:06 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-27 21:11 [PATCH 0/4] Add persistent memory driver Ross Zwisler
2014-08-27 21:11 ` [PATCH 1/4] pmem: Initial version of " Ross Zwisler
2014-09-09 16:23   ` [PATCH v2] " Boaz Harrosh
2014-09-09 16:53     ` [Linux-nvdimm] " Dan Williams
2014-09-10 13:23       ` Boaz Harrosh
2014-09-10 17:03         ` Dan Williams
2014-09-10 17:47           ` Boaz Harrosh
2014-09-10 23:01             ` Dan Williams
2014-09-11 10:45               ` Boaz Harrosh
2014-09-11 16:31                 ` Dan Williams
2014-09-14 11:18                   ` Boaz Harrosh
2014-09-16 13:54                     ` Jeff Moyer
2014-09-16 16:24                       ` Boaz Harrosh
2014-09-19 16:27                       ` Dan Williams
2014-09-21  9:27                         ` Boaz Harrosh
2014-11-02  3:22   ` [PATCH 1/4] " Elliott, Robert (Server Storage)
2014-11-03 15:50     ` Jeff Moyer
2014-11-03 16:19     ` Wilcox, Matthew R
2014-11-04 10:37       ` Boaz Harrosh
2014-11-04 16:26         ` Elliott, Robert (Server Storage)
2014-11-04 16:41           ` Ross Zwisler
2014-11-04 17:06             ` Boaz Harrosh
2014-08-27 21:12 ` [PATCH 2/4] pmem: Add support for getgeo() Ross Zwisler
2014-11-02  3:27   ` Elliott, Robert (Server Storage)
2014-11-03 16:36     ` Wilcox, Matthew R
2014-08-27 21:12 ` [PATCH 3/4] pmem: Add support for rw_page() Ross Zwisler
2014-08-27 21:12 ` [PATCH 4/4] pmem: Add support for direct_access() Ross Zwisler
2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh
2014-09-09 15:44   ` [PATCH 4/9] SQUASHME: pmem: Support of multiple memory regions Boaz Harrosh
2014-09-09 15:45   ` [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id Boaz Harrosh
2014-09-09 18:36     ` Dave Hansen
2014-09-10 10:07       ` Boaz Harrosh
2014-09-10 16:10         ` Dave Hansen
2014-09-10 17:25           ` Boaz Harrosh
2014-09-10 18:28             ` Dave Hansen
2014-09-11  8:39               ` Boaz Harrosh
2014-09-11 17:07                 ` Dave Hansen
2014-09-14  9:36                   ` Boaz Harrosh
2014-09-09 15:47   ` [PATCH 6/9] mm: New add_persistent_memory/remove_persistent_memory Boaz Harrosh
2014-09-09 15:48   ` [PATCH 7/9] pmem: Add support for page structs Boaz Harrosh
2014-09-09 15:51   ` [PATCH 9/9] pmem: KISS, remove register_blkdev Boaz Harrosh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).