* [PATCH 0/4] Add persistent memory driver @ 2014-08-27 21:11 Ross Zwisler 2014-08-27 21:11 ` [PATCH 1/4] pmem: Initial version of " Ross Zwisler ` (4 more replies) 0 siblings, 5 replies; 41+ messages in thread From: Ross Zwisler @ 2014-08-27 21:11 UTC (permalink / raw) To: Jens Axboe, Matthew Wilcox, Boaz Harrosh, Nick Piggin, linux-fsdevel, linux-kernel, linux-nvdimm Cc: Ross Zwisler PMEM is a modified version of the Block RAM Driver, BRD. The major difference is that BRD allocates its backing store pages from the page cache, whereas PMEM uses reserved memory that has been ioremapped. One benefit of this approach is that there is a direct mapping between filesystem block numbers and virtual addresses. In PMEM, filesystem blocks N, N+1, N+2, etc. will all be adjacent in the virtual memory space. This property allows us to set up PMD mappings (2 MiB) for DAX. This patch set is builds upon the work that Matthew Wilcox has been doing for DAX: https://lkml.org/lkml/2014/8/27/31 Specifically, my implementation of pmem_direct_access() in patch 4/4 uses API enhancements introduced in Matthew's DAX patch v10 02/21: https://lkml.org/lkml/2014/8/27/48 Ross Zwisler (4): pmem: Initial version of persistent memory driver pmem: Add support for getgeo() pmem: Add support for rw_page() pmem: Add support for direct_access() MAINTAINERS | 6 + drivers/block/Kconfig | 41 ++++++ drivers/block/Makefile | 1 + drivers/block/pmem.c | 375 +++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 423 insertions(+) create mode 100644 drivers/block/pmem.c -- 1.9.3 ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 1/4] pmem: Initial version of persistent memory driver 2014-08-27 21:11 [PATCH 0/4] Add persistent memory driver Ross Zwisler @ 2014-08-27 21:11 ` Ross Zwisler 2014-09-09 16:23 ` [PATCH v2] " Boaz Harrosh 2014-11-02 3:22 ` [PATCH 1/4] " Elliott, Robert (Server Storage) 2014-08-27 21:12 ` [PATCH 2/4] pmem: Add support for getgeo() Ross Zwisler ` (3 subsequent siblings) 4 siblings, 2 replies; 41+ messages in thread From: Ross Zwisler @ 2014-08-27 21:11 UTC (permalink / raw) To: Jens Axboe, Matthew Wilcox, Boaz Harrosh, Nick Piggin, linux-fsdevel, linux-kernel, linux-nvdimm Cc: Ross Zwisler PMEM is a new driver that presents a reserved range of memory as a block device. This is useful for developing with NV-DIMMs, and can be used with volatile memory as a development platform. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> --- MAINTAINERS | 6 + drivers/block/Kconfig | 41 ++++++ drivers/block/Makefile | 1 + drivers/block/pmem.c | 330 +++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 378 insertions(+) create mode 100644 drivers/block/pmem.c diff --git a/MAINTAINERS b/MAINTAINERS index 3f29153..028dc99 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7505,6 +7505,12 @@ S: Maintained F: Documentation/blockdev/ramdisk.txt F: drivers/block/brd.c +PERSISTENT MEMORY DRIVER +M: Ross Zwisler <ross.zwisler@linux.intel.com> +L: linux-nvdimm@lists.01.org +S: Supported +F: drivers/block/pmem.c + RANDOM NUMBER DRIVER M: "Theodore Ts'o" <tytso@mit.edu> S: Maintained diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index 1b8094d..ac52f5a 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -404,6 +404,47 @@ config BLK_DEV_RAM_DAX and will prevent RAM block device backing store memory from being allocated from highmem (only a problem for highmem systems). +config BLK_DEV_PMEM + tristate "Persistent memory block device support" + help + Saying Y here will allow you to use a contiguous range of reserved + memory as one or more block devices. Memory for PMEM should be + reserved using the "memmap" kernel parameter. + + To compile this driver as a module, choose M here: the module will be + called pmem. + + Most normal users won't need this functionality, and can thus say N + here. + +config BLK_DEV_PMEM_START + int "Offset in GiB of where to start claiming space" + default "0" + depends on BLK_DEV_PMEM + help + Starting offset in GiB that PMEM should use when claiming memory. This + memory needs to be reserved from the OS at boot time using the + "memmap" kernel parameter. + + If you provide PMEM with volatile memory it will act as a volatile + RAM disk and your data will not be persistent. + +config BLK_DEV_PMEM_COUNT + int "Default number of PMEM disks" + default "4" + depends on BLK_DEV_PMEM + help + Number of equal sized block devices that PMEM should create. + +config BLK_DEV_PMEM_SIZE + int "Size in GiB of space to claim" + depends on BLK_DEV_PMEM + default "0" + help + Amount of memory in GiB that PMEM should use when creating block + devices. This memory needs to be reserved from the OS at + boot time using the "memmap" kernel parameter. + config CDROM_PKTCDVD tristate "Packet writing on CD/DVD media" depends on !UML diff --git a/drivers/block/Makefile b/drivers/block/Makefile index 02b688d..9cc6c18 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -14,6 +14,7 @@ obj-$(CONFIG_PS3_VRAM) += ps3vram.o obj-$(CONFIG_ATARI_FLOPPY) += ataflop.o obj-$(CONFIG_AMIGA_Z2RAM) += z2ram.o obj-$(CONFIG_BLK_DEV_RAM) += brd.o +obj-$(CONFIG_BLK_DEV_PMEM) += pmem.o obj-$(CONFIG_BLK_DEV_LOOP) += loop.o obj-$(CONFIG_BLK_CPQ_DA) += cpqarray.o obj-$(CONFIG_BLK_CPQ_CISS_DA) += cciss.o diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c new file mode 100644 index 0000000..d366b9b --- /dev/null +++ b/drivers/block/pmem.c @@ -0,0 +1,330 @@ +/* + * Persistent Memory Driver + * Copyright (c) 2014, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * This driver is heavily based on drivers/block/brd.c. + * Copyright (C) 2007 Nick Piggin + * Copyright (C) 2007 Novell Inc. + */ + +#include <linux/bio.h> +#include <linux/blkdev.h> +#include <linux/fs.h> +#include <linux/hdreg.h> +#include <linux/highmem.h> +#include <linux/init.h> +#include <linux/major.h> +#include <linux/module.h> +#include <linux/moduleparam.h> +#include <linux/slab.h> +#include <linux/uaccess.h> + +#define SECTOR_SHIFT 9 +#define PAGE_SECTORS_SHIFT (PAGE_SHIFT - SECTOR_SHIFT) +#define PAGE_SECTORS (1 << PAGE_SECTORS_SHIFT) + +/* + * driver-wide physical address and total_size - one single, contiguous memory + * region that we divide up in to same-sized devices + */ +phys_addr_t phys_addr; +void *virt_addr; +size_t total_size; + +struct pmem_device { + struct request_queue *pmem_queue; + struct gendisk *pmem_disk; + struct list_head pmem_list; + + phys_addr_t phys_addr; + void *virt_addr; + size_t size; +}; + +/* + * direct translation from (pmem,sector) => void* + * We do not require that sector be page aligned. + * The return value will point to the beginning of the page containing the + * given sector, not to the sector itself. + */ +static void *pmem_lookup_pg_addr(struct pmem_device *pmem, sector_t sector) +{ + size_t page_offset = sector >> PAGE_SECTORS_SHIFT; + size_t offset = page_offset << PAGE_SHIFT; + + BUG_ON(offset >= pmem->size); + return pmem->virt_addr + offset; +} + +/* + * sector is not required to be page aligned. + * n is at most a single page, but could be less. + */ +static void copy_to_pmem(struct pmem_device *pmem, const void *src, + sector_t sector, size_t n) +{ + void *dst; + unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT; + size_t copy; + + BUG_ON(n > PAGE_SIZE); + + copy = min_t(size_t, n, PAGE_SIZE - offset); + dst = pmem_lookup_pg_addr(pmem, sector); + memcpy(dst + offset, src, copy); + + if (copy < n) { + src += copy; + sector += copy >> SECTOR_SHIFT; + copy = n - copy; + dst = pmem_lookup_pg_addr(pmem, sector); + memcpy(dst, src, copy); + } +} + +/* + * sector is not required to be page aligned. + * n is at most a single page, but could be less. + */ +static void copy_from_pmem(void *dst, struct pmem_device *pmem, + sector_t sector, size_t n) +{ + void *src; + unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT; + size_t copy; + + BUG_ON(n > PAGE_SIZE); + + copy = min_t(size_t, n, PAGE_SIZE - offset); + src = pmem_lookup_pg_addr(pmem, sector); + + memcpy(dst, src + offset, copy); + + if (copy < n) { + dst += copy; + sector += copy >> SECTOR_SHIFT; + copy = n - copy; + src = pmem_lookup_pg_addr(pmem, sector); + memcpy(dst, src, copy); + } +} + +static void pmem_do_bvec(struct pmem_device *pmem, struct page *page, + unsigned int len, unsigned int off, int rw, + sector_t sector) +{ + void *mem = kmap_atomic(page); + + if (rw == READ) { + copy_from_pmem(mem + off, pmem, sector, len); + flush_dcache_page(page); + } else { + /* + * FIXME: Need more involved flushing to ensure that writes to + * NVDIMMs are actually durable before returning. + */ + flush_dcache_page(page); + copy_to_pmem(pmem, mem + off, sector, len); + } + + kunmap_atomic(mem); +} + +static void pmem_make_request(struct request_queue *q, struct bio *bio) +{ + struct block_device *bdev = bio->bi_bdev; + struct pmem_device *pmem = bdev->bd_disk->private_data; + int rw; + struct bio_vec bvec; + sector_t sector; + struct bvec_iter iter; + int err = 0; + + sector = bio->bi_iter.bi_sector; + if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) { + err = -EIO; + goto out; + } + + BUG_ON(bio->bi_rw & REQ_DISCARD); + + rw = bio_rw(bio); + if (rw == READA) + rw = READ; + + bio_for_each_segment(bvec, bio, iter) { + unsigned int len = bvec.bv_len; + + BUG_ON(len > PAGE_SIZE); + pmem_do_bvec(pmem, bvec.bv_page, len, + bvec.bv_offset, rw, sector); + sector += len >> SECTOR_SHIFT; + } + +out: + bio_endio(bio, err); +} + +static const struct block_device_operations pmem_fops = { + .owner = THIS_MODULE, +}; + +/* Kernel module stuff */ +static int pmem_start_gb = CONFIG_BLK_DEV_PMEM_START; +module_param(pmem_start_gb, int, S_IRUGO); +MODULE_PARM_DESC(pmem_start_gb, "Offset in GB of where to start claiming space"); + +static int pmem_size_gb = CONFIG_BLK_DEV_PMEM_SIZE; +module_param(pmem_size_gb, int, S_IRUGO); +MODULE_PARM_DESC(pmem_size_gb, "Total size in GB of space to claim for all disks"); + +static int pmem_count = CONFIG_BLK_DEV_PMEM_COUNT; +module_param(pmem_count, int, S_IRUGO); +MODULE_PARM_DESC(pmem_count, "Number of pmem devices to evenly split allocated space"); + +static LIST_HEAD(pmem_devices); +static int pmem_major; + +/* FIXME: move phys_addr, virt_addr, size calls up to caller */ +static struct pmem_device *pmem_alloc(int i) +{ + struct pmem_device *pmem; + struct gendisk *disk; + size_t disk_size = total_size / pmem_count; + size_t disk_sectors = disk_size / 512; + + pmem = kzalloc(sizeof(*pmem), GFP_KERNEL); + if (!pmem) + goto out; + + pmem->phys_addr = phys_addr + i * disk_size; + pmem->virt_addr = virt_addr + i * disk_size; + pmem->size = disk_size; + + pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL); + if (!pmem->pmem_queue) + goto out_free_dev; + + blk_queue_make_request(pmem->pmem_queue, pmem_make_request); + blk_queue_max_hw_sectors(pmem->pmem_queue, 1024); + blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY); + + disk = pmem->pmem_disk = alloc_disk(0); + if (!disk) + goto out_free_queue; + disk->major = pmem_major; + disk->first_minor = 0; + disk->fops = &pmem_fops; + disk->private_data = pmem; + disk->queue = pmem->pmem_queue; + disk->flags = GENHD_FL_EXT_DEVT; + sprintf(disk->disk_name, "pmem%d", i); + set_capacity(disk, disk_sectors); + + return pmem; + +out_free_queue: + blk_cleanup_queue(pmem->pmem_queue); +out_free_dev: + kfree(pmem); +out: + return NULL; +} + +static void pmem_free(struct pmem_device *pmem) +{ + put_disk(pmem->pmem_disk); + blk_cleanup_queue(pmem->pmem_queue); + kfree(pmem); +} + +static void pmem_del_one(struct pmem_device *pmem) +{ + list_del(&pmem->pmem_list); + del_gendisk(pmem->pmem_disk); + pmem_free(pmem); +} + +static int __init pmem_init(void) +{ + int result, i; + struct resource *res_mem; + struct pmem_device *pmem, *next; + + phys_addr = (phys_addr_t) pmem_start_gb * 1024 * 1024 * 1024; + total_size = (size_t) pmem_size_gb * 1024 * 1024 * 1024; + + res_mem = request_mem_region_exclusive(phys_addr, total_size, "pmem"); + if (!res_mem) + return -ENOMEM; + + virt_addr = ioremap_cache(phys_addr, total_size); + if (!virt_addr) { + result = -ENOMEM; + goto out_release; + } + + result = register_blkdev(0, "pmem"); + if (result < 0) { + result = -EIO; + goto out_unmap; + } else + pmem_major = result; + + for (i = 0; i < pmem_count; i++) { + pmem = pmem_alloc(i); + if (!pmem) { + result = -ENOMEM; + goto out_free; + } + list_add_tail(&pmem->pmem_list, &pmem_devices); + } + + list_for_each_entry(pmem, &pmem_devices, pmem_list) + add_disk(pmem->pmem_disk); + + pr_info("pmem: module loaded\n"); + return 0; + +out_free: + list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list) { + list_del(&pmem->pmem_list); + pmem_free(pmem); + } + unregister_blkdev(pmem_major, "pmem"); + +out_unmap: + iounmap(virt_addr); + +out_release: + release_mem_region(phys_addr, total_size); + return result; +} + +static void __exit pmem_exit(void) +{ + struct pmem_device *pmem, *next; + + list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list) + pmem_del_one(pmem); + + unregister_blkdev(pmem_major, "pmem"); + iounmap(virt_addr); + release_mem_region(phys_addr, total_size); + + pr_info("pmem: module unloaded\n"); +} + +MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>"); +MODULE_LICENSE("GPL"); +module_init(pmem_init); +module_exit(pmem_exit); -- 1.9.3 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v2] pmem: Initial version of persistent memory driver 2014-08-27 21:11 ` [PATCH 1/4] pmem: Initial version of " Ross Zwisler @ 2014-09-09 16:23 ` Boaz Harrosh 2014-09-09 16:53 ` [Linux-nvdimm] " Dan Williams 2014-11-02 3:22 ` [PATCH 1/4] " Elliott, Robert (Server Storage) 1 sibling, 1 reply; 41+ messages in thread From: Boaz Harrosh @ 2014-09-09 16:23 UTC (permalink / raw) To: Ross Zwisler, Jens Axboe, Matthew Wilcox, Nick Piggin, linux-fsdevel, linux-kernel, linux-nvdimm From: Ross Zwisler <ross.zwisler@linux.intel.com> PMEM is a new driver that presents a reserved range of memory as a block device. This is useful for developing with NV-DIMMs, and can be used with volatile memory as a development platform. [boaz] SQUASHME: pmem: Remove unused #include headers SQUASHME: pmem: Request from fdisk 4k alignment SQUASHME: pmem: Let each device manage private memory region SQUASHME: pmem: Support of multiple memory regions The API to pmem module a single string parameter named "map" of the form: map=mapS[,mapS...] where mapS=nn[KMG]$ss[KMG], or mapS=nn[KMG]@ss[KMG], nn=size, ss=offset Just like the Kernel command line map && memmap parameters, so anything you did at grub just copy/paste to here. The "@" form is exactly the same as the "$" form only that at bash prompt we need to escape the "$" with \$ so also support the '@' char for convenience. For each specified mapS there will be a device created. [This is the accumulated version of the driver developed by multiple programmers. To see the real history of these patches see: git://git.open-osd.org/pmem.git https://github.com/01org/prd This patch is based on: [673302b] pmem: KISS, remove register_blkdev ] TODO: Add Documentation/blockdev/pmem.txt Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> --- MAINTAINERS | 6 + drivers/block/Kconfig | 13 ++ drivers/block/Makefile | 1 + drivers/block/pmem.c | 385 +++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 405 insertions(+) create mode 100644 drivers/block/pmem.c diff --git a/MAINTAINERS b/MAINTAINERS index 5e7866a..2724ede 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7504,6 +7504,12 @@ S: Maintained F: Documentation/blockdev/ramdisk.txt F: drivers/block/brd.c +PERSISTENT MEMORY DRIVER +M: Ross Zwisler <ross.zwisler@linux.intel.com> +L: linux-nvdimm@lists.01.org +S: Supported +F: drivers/block/pmem.c + RANDOM NUMBER DRIVER M: "Theodore Ts'o" <tytso@mit.edu> S: Maintained diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index 014a1cf..5da8cbf 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -403,6 +403,19 @@ config BLK_DEV_XIP will prevent RAM block device backing store memory from being allocated from highmem (only a problem for highmem systems). +config BLK_DEV_PMEM + tristate "Persistent memory block device support" + help + Saying Y here will allow you to use a contiguous range of reserved + memory as one or more block devices. Memory for PMEM should be + reserved using the "memmap" kernel parameter. + + To compile this driver as a module, choose M here: the module will be + called pmem. + + Most normal users won't need this functionality, and can thus say N + here. + config CDROM_PKTCDVD tristate "Packet writing on CD/DVD media" depends on !UML diff --git a/drivers/block/Makefile b/drivers/block/Makefile index 02b688d..9cc6c18 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -14,6 +14,7 @@ obj-$(CONFIG_PS3_VRAM) += ps3vram.o obj-$(CONFIG_ATARI_FLOPPY) += ataflop.o obj-$(CONFIG_AMIGA_Z2RAM) += z2ram.o obj-$(CONFIG_BLK_DEV_RAM) += brd.o +obj-$(CONFIG_BLK_DEV_PMEM) += pmem.o obj-$(CONFIG_BLK_DEV_LOOP) += loop.o obj-$(CONFIG_BLK_CPQ_DA) += cpqarray.o obj-$(CONFIG_BLK_CPQ_CISS_DA) += cciss.o diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c new file mode 100644 index 0000000..0fcda58 --- /dev/null +++ b/drivers/block/pmem.c @@ -0,0 +1,385 @@ +/* + * Persistent Memory Driver + * Copyright (c) 2014, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * This driver is heavily based on drivers/block/brd.c. + * Copyright (C) 2007 Nick Piggin + * Copyright (C) 2007 Novell Inc. + */ + +#include <linux/blkdev.h> +#include <linux/hdreg.h> +#include <linux/init.h> +#include <linux/module.h> +#include <linux/moduleparam.h> +#include <linux/slab.h> + +#define SECTOR_SHIFT 9 +#define PAGE_SECTORS_SHIFT (PAGE_SHIFT - SECTOR_SHIFT) +#define PAGE_SECTORS (1 << PAGE_SECTORS_SHIFT) + +struct pmem_device { + struct request_queue *pmem_queue; + struct gendisk *pmem_disk; + struct list_head pmem_list; + + /* One contiguous memory region per device */ + phys_addr_t phys_addr; + void *virt_addr; + size_t size; +}; + +/* + * direct translation from (pmem,sector) => void* + * We do not require that sector be page aligned. + * The return value will point to the beginning of the page containing the + * given sector, not to the sector itself. + */ +static void *pmem_lookup_pg_addr(struct pmem_device *pmem, sector_t sector) +{ + size_t page_offset = sector >> PAGE_SECTORS_SHIFT; + size_t offset = page_offset << PAGE_SHIFT; + + BUG_ON(offset >= pmem->size); + return pmem->virt_addr + offset; +} + +/* + * sector is not required to be page aligned. + * n is at most a single page, but could be less. + */ +static void copy_to_pmem(struct pmem_device *pmem, const void *src, + sector_t sector, size_t n) +{ + void *dst; + unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT; + size_t copy; + + BUG_ON(n > PAGE_SIZE); + + copy = min_t(size_t, n, PAGE_SIZE - offset); + dst = pmem_lookup_pg_addr(pmem, sector); + memcpy(dst + offset, src, copy); + + if (copy < n) { + src += copy; + sector += copy >> SECTOR_SHIFT; + copy = n - copy; + dst = pmem_lookup_pg_addr(pmem, sector); + memcpy(dst, src, copy); + } +} + +/* + * sector is not required to be page aligned. + * n is at most a single page, but could be less. + */ +static void copy_from_pmem(void *dst, struct pmem_device *pmem, + sector_t sector, size_t n) +{ + void *src; + unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT; + size_t copy; + + BUG_ON(n > PAGE_SIZE); + + copy = min_t(size_t, n, PAGE_SIZE - offset); + src = pmem_lookup_pg_addr(pmem, sector); + + memcpy(dst, src + offset, copy); + + if (copy < n) { + dst += copy; + sector += copy >> SECTOR_SHIFT; + copy = n - copy; + src = pmem_lookup_pg_addr(pmem, sector); + memcpy(dst, src, copy); + } +} + +static void pmem_do_bvec(struct pmem_device *pmem, struct page *page, + unsigned int len, unsigned int off, int rw, + sector_t sector) +{ + void *mem = kmap_atomic(page); + + if (rw == READ) { + copy_from_pmem(mem + off, pmem, sector, len); + flush_dcache_page(page); + } else { + /* + * FIXME: Need more involved flushing to ensure that writes to + * NVDIMMs are actually durable before returning. + */ + flush_dcache_page(page); + copy_to_pmem(pmem, mem + off, sector, len); + } + + kunmap_atomic(mem); +} + +static void pmem_make_request(struct request_queue *q, struct bio *bio) +{ + struct block_device *bdev = bio->bi_bdev; + struct pmem_device *pmem = bdev->bd_disk->private_data; + int rw; + struct bio_vec bvec; + sector_t sector; + struct bvec_iter iter; + int err = 0; + + sector = bio->bi_iter.bi_sector; + if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) { + err = -EIO; + goto out; + } + + BUG_ON(bio->bi_rw & REQ_DISCARD); + + rw = bio_rw(bio); + if (rw == READA) + rw = READ; + + bio_for_each_segment(bvec, bio, iter) { + unsigned int len = bvec.bv_len; + + BUG_ON(len > PAGE_SIZE); + pmem_do_bvec(pmem, bvec.bv_page, len, + bvec.bv_offset, rw, sector); + sector += len >> SECTOR_SHIFT; + } + +out: + bio_endio(bio, err); +} + +static const struct block_device_operations pmem_fops = { + .owner = THIS_MODULE, +}; + +/* Kernel module stuff */ +static char *map; +module_param(map, charp, S_IRUGO); +MODULE_PARM_DESC(map, + "pmem device mapping: map=mapS[,mapS...] where:\n" + "mapS=nn[KMG]$ss[KMG] or mapS=nn[KMG]@ss[KMG], nn=size, ss=offset."); + +static LIST_HEAD(pmem_devices); +static int pmem_major; + +/* pmem->phys_addr and pmem->size need to be set. + * Will then set virt_addr if successful. + */ +int pmem_mapmem(struct pmem_device *pmem) +{ + struct resource *res_mem; + int err; + + res_mem = request_mem_region_exclusive(pmem->phys_addr, pmem->size, + "pmem"); + if (!res_mem) { + pr_warn("pmem: request_mem_region_exclusive phys=0x%llx size=0x%zx failed\n", + pmem->phys_addr, pmem->size); + return -EINVAL; + } + + pmem->virt_addr = ioremap_cache(pmem->phys_addr, pmem->size); + if (unlikely(!pmem->virt_addr)) { + err = -ENXIO; + goto out_release; + } + return 0; + +out_release: + release_mem_region(pmem->phys_addr, pmem->size); + return err; +} + +void pmem_unmapmem(struct pmem_device *pmem) +{ + if (unlikely(!pmem->virt_addr)) + return; + + iounmap(pmem->virt_addr); + release_mem_region(pmem->phys_addr, pmem->size); + pmem->virt_addr = NULL; +} + +static struct pmem_device *pmem_alloc(phys_addr_t phys_addr, size_t disk_size, + int i) +{ + struct pmem_device *pmem; + struct gendisk *disk; + int err; + + if (unlikely((phys_addr & ~PAGE_MASK) || (disk_size & ~PAGE_MASK))) { + pr_err("phys_addr=0x%llx disk_size=0x%zx must be 4k aligned\n", + phys_addr, disk_size); + err = -EINVAL; + goto out; + } + + pmem = kzalloc(sizeof(*pmem), GFP_KERNEL); + if (unlikely(!pmem)) { + err = -ENOMEM; + goto out; + } + + pmem->phys_addr = phys_addr; + pmem->size = disk_size; + + err = pmem_mapmem(pmem); + if (unlikely(err)) + goto out_free_dev; + + pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL); + if (unlikely(!pmem->pmem_queue)) { + err = -ENOMEM; + goto out_unmap; + } + + blk_queue_make_request(pmem->pmem_queue, pmem_make_request); + blk_queue_max_hw_sectors(pmem->pmem_queue, 1024); + blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY); + + /* This is so fdisk will align partitions on 4k, because of + * direct_access API needing 4k alignment, returning a PFN + */ + blk_queue_physical_block_size(pmem->pmem_queue, PAGE_SIZE); + pmem->pmem_queue->limits.io_min = 512; /* Don't use the accessor */ + + disk = alloc_disk(0); + if (unlikely(!disk)) { + err = -ENOMEM; + goto out_free_queue; + } + + disk->major = pmem_major; + disk->first_minor = 0; + disk->fops = &pmem_fops; + disk->private_data = pmem; + disk->queue = pmem->pmem_queue; + disk->flags = GENHD_FL_EXT_DEVT; + sprintf(disk->disk_name, "pmem%d", i); + set_capacity(disk, disk_size >> SECTOR_SHIFT); + pmem->pmem_disk = disk; + + return pmem; + +out_free_queue: + blk_cleanup_queue(pmem->pmem_queue); +out_unmap: + pmem_unmapmem(pmem); +out_free_dev: + kfree(pmem); +out: + return ERR_PTR(err); +} + +static void pmem_free(struct pmem_device *pmem) +{ + put_disk(pmem->pmem_disk); + blk_cleanup_queue(pmem->pmem_queue); + pmem_unmapmem(pmem); + kfree(pmem); +} + +static void pmem_del_one(struct pmem_device *pmem) +{ + list_del(&pmem->pmem_list); + del_gendisk(pmem->pmem_disk); + pmem_free(pmem); +} + +static int pmem_parse_map_one(char *map, phys_addr_t *start, size_t *size) +{ + char *p = map; + + *size = (size_t)memparse(p, &p); + if ((p == map) || ((*p != '$') && (*p != '@'))) + return -EINVAL; + + if (!*(++p)) + return -EINVAL; + + *start = (phys_addr_t)memparse(p, &p); + + return *p == '\0' ? 0 : -EINVAL; +} + +static int __init pmem_init(void) +{ + int result, i; + struct pmem_device *pmem, *next; + char *p, *pmem_map = map; + + if (!pmem_map) { + pr_err("pmem: must specify map=nn@ss parameter.\n"); + return -EINVAL; + } + + result = register_blkdev(0, "pmem"); + if (result < 0) + return -EIO; + else + pmem_major = result; + + i = 0; + while ((p = strsep(&pmem_map, ",")) != NULL) { + phys_addr_t phys_addr; + size_t disk_size; + + if (!*p) + continue; + result = pmem_parse_map_one(p, &phys_addr, &disk_size); + if (result) + goto out_free; + pmem = pmem_alloc(phys_addr, disk_size, i); + if (IS_ERR(pmem)) { + result = PTR_ERR(pmem); + goto out_free; + } + list_add_tail(&pmem->pmem_list, &pmem_devices); + ++i; + } + + list_for_each_entry(pmem, &pmem_devices, pmem_list) + add_disk(pmem->pmem_disk); + + pr_info("pmem: module loaded\n"); + return 0; + +out_free: + list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list) { + list_del(&pmem->pmem_list); + pmem_free(pmem); + } + unregister_blkdev(pmem_major, "pmem"); + + return result; +} + +static void __exit pmem_exit(void) +{ + struct pmem_device *pmem, *next; + + list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list) + pmem_del_one(pmem); + + unregister_blkdev(pmem_major, "pmem"); + pr_info("pmem: module unloaded\n"); +} + +MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>"); +MODULE_LICENSE("GPL"); +module_init(pmem_init); +module_exit(pmem_exit); -- 1.9.3 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver 2014-09-09 16:23 ` [PATCH v2] " Boaz Harrosh @ 2014-09-09 16:53 ` Dan Williams 2014-09-10 13:23 ` Boaz Harrosh 0 siblings, 1 reply; 41+ messages in thread From: Dan Williams @ 2014-09-09 16:53 UTC (permalink / raw) To: Boaz Harrosh Cc: Ross Zwisler, Jens Axboe, Matthew Wilcox, Nick Piggin, linux-fsdevel, linux-kernel, linux-nvdimm On Tue, Sep 9, 2014 at 9:23 AM, Boaz Harrosh <boaz@plexistor.com> wrote: > From: Ross Zwisler <ross.zwisler@linux.intel.com> > > PMEM is a new driver that presents a reserved range of memory as a > block device. This is useful for developing with NV-DIMMs, and > can be used with volatile memory as a development platform. > > [boaz] > SQUASHME: pmem: Remove unused #include headers > SQUASHME: pmem: Request from fdisk 4k alignment > SQUASHME: pmem: Let each device manage private memory region > SQUASHME: pmem: Support of multiple memory regions > > The API to pmem module a single string parameter named "map" > of the form: > map=mapS[,mapS...] > > where mapS=nn[KMG]$ss[KMG], > or mapS=nn[KMG]@ss[KMG], > > nn=size, ss=offset > > Just like the Kernel command line map && memmap parameters, > so anything you did at grub just copy/paste to here. > > The "@" form is exactly the same as the "$" form only that > at bash prompt we need to escape the "$" with \$ so also > support the '@' char for convenience. Hmm this looks like a "ACPI/DeviceTree-over-kernel-command-line" description language. I understand that this is borrowed from the memmap= precedent, but, if I'm not mistaken, that's really more for debug rather than a permanent platform-device descriptor. <strawman> Since this looks like firmware why not go ahead and use request_firmware() to request a pmem range descriptor blob</strawman>? Given you can compile such a blob into a kernel image or provide it in an initrd I think it makes deployment more straightforward, also the descriptor format can be extended going forward whereas the command line approach mandates ever increasingly complicated strings. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver 2014-09-09 16:53 ` [Linux-nvdimm] " Dan Williams @ 2014-09-10 13:23 ` Boaz Harrosh 2014-09-10 17:03 ` Dan Williams 0 siblings, 1 reply; 41+ messages in thread From: Boaz Harrosh @ 2014-09-10 13:23 UTC (permalink / raw) To: Dan Williams, Boaz Harrosh Cc: Ross Zwisler, Jens Axboe, Matthew Wilcox, Nick Piggin, linux-fsdevel, linux-kernel, linux-nvdimm On 09/09/2014 07:53 PM, Dan Williams wrote: > On Tue, Sep 9, 2014 at 9:23 AM, Boaz Harrosh <boaz@plexistor.com> wrote: <> > > Hmm this looks like a "ACPI/DeviceTree-over-kernel-command-line" > description language. I understand that this is borrowed from the > memmap= precedent, but, if I'm not mistaken, that's really more for > debug rather than a permanent platform-device descriptor. So what is bad about that ? > <strawman> > Since this looks like firmware why not go ahead and use > request_firmware() to request a pmem range descriptor blob </strawman>? What? God no! How is this a firmware at all. No this is all BUS info, it is "where on the bus this device's resource was allocated" Firmware is static compile time thing common to all of my devices but this here is where in the bus this was stuck in. What is your suggestion that I compile a kernel and "make install" it (initrd) for every DIMM I insert ? since when? and what tool will tell me what to put there? > Given you can compile such a blob into a kernel image or provide it > in an initrd I think it makes deployment more straightforward, also > the descriptor format can be extended going forward What ? really Dan I think you got confused. The nn@ss thing is just: "what connector I stuck my DIMM disk in (and what other empty slots I have)" Would you have me compile and install my kernel every time my sdX number change, or my sata channel moved. I do not think so. > whereas the > command line approach mandates ever increasingly complicated strings. This is the first API, I intend, there is even a TODO in the patchset, to also have a dynamic sysfs add/remove/grow API for on the fly changes. All these can be easily driven by a simple udev rule to plug and play them. Usually in the Kernel, buses do not directly load devices drivers. The bus driver sends an "event" with a new discovered resource, then user-mode udev rule will load an associated driver, which will scan for its device or receive its identification somewhere, that can be the sysfs interface above, or even commandline. Farther down the road we might want a Kernel API, through to the DIMM-manager when each device's ID is Identified to managed volumes of DIMMs. (And what happens to them if they move in physical addressing), but even with DIMM-manager I would have gone through a udev-event and pmem-probe because, this way user mode can add a chain of commands associated with new insertions. So since we have a DIMM-manager event going on why not have the udev rule load pmem as well and not have any API. But All this is way down the line. Regardless of which, commandline API of pmem will always stay as this is for the pmem emulation as we use it now, in accord with memmap=. (And it works very nice in our lab with DDR3 NvDIMMs that need an memmap= as well) We please need to start somewhere, no? Thanks Boaz ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver 2014-09-10 13:23 ` Boaz Harrosh @ 2014-09-10 17:03 ` Dan Williams 2014-09-10 17:47 ` Boaz Harrosh 0 siblings, 1 reply; 41+ messages in thread From: Dan Williams @ 2014-09-10 17:03 UTC (permalink / raw) To: Boaz Harrosh Cc: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox, Nick Piggin, linux-fsdevel, linux-kernel, linux-nvdimm Hi Boaz, On Wed, Sep 10, 2014 at 6:23 AM, Boaz Harrosh <openosd@gmail.com> wrote: > On 09/09/2014 07:53 PM, Dan Williams wrote: >> On Tue, Sep 9, 2014 at 9:23 AM, Boaz Harrosh <boaz@plexistor.com> wrote: > <> >> >> Hmm this looks like a "ACPI/DeviceTree-over-kernel-command-line" >> description language. I understand that this is borrowed from the >> memmap= precedent, but, if I'm not mistaken, that's really more for >> debug rather than a permanent platform-device descriptor. > > So what is bad about that ? > >> <strawman> >> Since this looks like firmware why not go ahead and use >> request_firmware() to request a pmem range descriptor blob > </strawman>? > > What? God no! How is this a firmware at all. No this is all BUS info, it is > "where on the bus this device's resource was allocated" > > Firmware is static compile time thing common to all of my devices > but this here is where in the bus this was stuck in. What is your > suggestion that I compile a kernel and "make install" it (initrd) > for every DIMM I insert ? since when? and what tool will tell me > what to put there? > >> Given you can compile such a blob into a kernel image or provide it >> in an initrd I think it makes deployment more straightforward, also >> the descriptor format can be extended going forward > > What ? really Dan I think you got confused. The nn@ss thing is just: > "what connector I stuck my DIMM disk in (and what other empty slots I have)" > > Would you have me compile and install my kernel every time my sdX > number change, or my sata channel moved. I do not think so. > >> whereas the >> command line approach mandates ever increasingly complicated strings. > > This is the first API, I intend, there is even a TODO in the patchset, to > also have a dynamic sysfs add/remove/grow API for on the fly changes. > All these can be easily driven by a simple udev rule to plug and play them. > > Usually in the Kernel, buses do not directly load devices drivers. The > bus driver sends an "event" with a new discovered resource, then user-mode > udev rule will load an associated driver, which will scan for its device or > receive its identification somewhere, that can be the sysfs interface above, > or even commandline. > > Farther down the road we might want a Kernel API, through to the DIMM-manager > when each device's ID is Identified to managed volumes of DIMMs. (And what > happens to them if they move in physical addressing), but even with DIMM-manager > I would have gone through a udev-event and pmem-probe because, this way user > mode can add a chain of commands associated with new insertions. So since we > have a DIMM-manager event going on why not have the udev rule load pmem as well > and not have any API. But All this is way down the line. > > Regardless of which, commandline API of pmem will always stay as this is for > the pmem emulation as we use it now, in accord with memmap=. > (And it works very nice in our lab with DDR3 NvDIMMs that need an memmap= as well) > > We please need to start somewhere, no? Sure, but you used the operative term "start", as in you already expect to enhance this capability down the road, right? It's fine to dismiss this request_firmware() based approach, but don't mis-characterize it in the process. With regards to describing device boundaries, a bus-descriptor-blob handed to the kernel is a superset of the capability provided by the kernel command line. It can be injected statically at compile time, or dynamically loaded from the initrd or the rootfs. It has the added benefit of being flexible to change whereas the kernel command line is a more permanent contract that we will need to maintain compatibility with in perpetuity. If you already see this bus description as a "starting" point, then I think we need an interface that is more amenable to ongoing change, that's not the kernel-command-line. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver 2014-09-10 17:03 ` Dan Williams @ 2014-09-10 17:47 ` Boaz Harrosh 2014-09-10 23:01 ` Dan Williams 0 siblings, 1 reply; 41+ messages in thread From: Boaz Harrosh @ 2014-09-10 17:47 UTC (permalink / raw) To: Dan Williams, Boaz Harrosh Cc: Ross Zwisler, Jens Axboe, Matthew Wilcox, Nick Piggin, linux-fsdevel, linux-kernel, linux-nvdimm On 09/10/2014 08:03 PM, Dan Williams wrote: > Hi Boaz, > <> >> We please need to start somewhere, no? > > Sure, but you used the operative term "start", as in you already > expect to enhance this capability down the road, right? > Yes > It's fine to dismiss this request_firmware() based approach, but don't > mis-characterize it in the process. With regards to describing device > boundaries, a bus-descriptor-blob handed to the kernel is a superset > of the capability provided by the kernel command line. It can be > injected statically at compile time, or dynamically loaded from the > initrd or the rootfs. It has the added benefit of being flexible to > change whereas the kernel command line is a more permanent contract > that we will need to maintain compatibility with in perpetuity. > initrd or rootfs means for me "make install". But I want my fedora to never make or install. Pre-compiled binary blobs including rootfs and it needs to work. > If you already see this bus description as a "starting" point, then I > think we need an interface that is more amenable to ongoing change, > that's not the kernel-command-line. > module-command-line. a module can be loaded via udev and/or module param can be changed dynamically on the fly. And also be specified via kernel-command-line. So it is much less permanent contract API than "rootfs" And yes, I intend to add more interfaces. And No! I do not intend to ever extend this module-param interface, that I can see. This one is that, which it is right now. Later a sysfs/ objects will enable dynamic management of devices. So both: initial device list on load - more devices or removal on the fly, unload all on unload. This is my plan. So right now I do not see this map= need ever change in the future. Only more interfaces added in (the very near) future. Thanks Boaz ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver 2014-09-10 17:47 ` Boaz Harrosh @ 2014-09-10 23:01 ` Dan Williams 2014-09-11 10:45 ` Boaz Harrosh 0 siblings, 1 reply; 41+ messages in thread From: Dan Williams @ 2014-09-10 23:01 UTC (permalink / raw) To: Boaz Harrosh Cc: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox, Nick Piggin, linux-fsdevel, linux-kernel, linux-nvdimm On Wed, Sep 10, 2014 at 10:47 AM, Boaz Harrosh <boaz@plexistor.com> wrote: > On 09/10/2014 08:03 PM, Dan Williams wrote: >> Hi Boaz, >> > <> >>> We please need to start somewhere, no? >> >> Sure, but you used the operative term "start", as in you already >> expect to enhance this capability down the road, right? >> > > Yes > >> It's fine to dismiss this request_firmware() based approach, but don't >> mis-characterize it in the process. With regards to describing device >> boundaries, a bus-descriptor-blob handed to the kernel is a superset >> of the capability provided by the kernel command line. It can be >> injected statically at compile time, or dynamically loaded from the >> initrd or the rootfs. It has the added benefit of being flexible to >> change whereas the kernel command line is a more permanent contract >> that we will need to maintain compatibility with in perpetuity. >> > > initrd or rootfs means for me "make install". But I want my fedora > to never make or install. Pre-compiled binary blobs including rootfs and > it needs to work. > >> If you already see this bus description as a "starting" point, then I >> think we need an interface that is more amenable to ongoing change, >> that's not the kernel-command-line. >> > > module-command-line. a module can be loaded via udev and/or module param > can be changed dynamically on the fly. And also be specified via > kernel-command-line. So it is much less permanent contract API than > "rootfs" > > And yes, I intend to add more interfaces. And No! I do not intend to > ever extend this module-param interface, that I can see. This one is > that, which it is right now. Later a sysfs/ objects will enable dynamic > management of devices. So both: initial device list on load - more devices > or removal on the fly, unload all on unload. This is my plan. So right > now I do not see this map= need ever change in the future. Only more > interfaces added in (the very near) future. > Imagine you want to deploy a policy like "use half of the memory provided by the dimm in slot3, i.e. the only one with a battery". That sort of thing gets unwieldy in a command line string compared to a description table format that we can update at will. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver 2014-09-10 23:01 ` Dan Williams @ 2014-09-11 10:45 ` Boaz Harrosh 2014-09-11 16:31 ` Dan Williams 0 siblings, 1 reply; 41+ messages in thread From: Boaz Harrosh @ 2014-09-11 10:45 UTC (permalink / raw) To: Dan Williams Cc: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox, Nick Piggin, linux-fsdevel, linux-kernel, linux-nvdimm On 09/11/2014 02:01 AM, Dan Williams wrote: <> > > Imagine you want to deploy a policy like "use half of the memory > provided by the dimm in slot3, i.e. the only one with a battery". > That sort of thing gets unwieldy in a command line string compared to > a description table format that we can update at will. > Actually it is easy to do, why? I do this here in the lab all the time. with a "command line" with this code you see here. [DDR3 NvDIMM which means I need memmap=16G\$32G on Kernel command line. Then: modprobe pmem map=8G@32G,4G@44G,... and so on Just as a simple example where 2/4-3/4 addresses are not used. You can have holes in the middle or what ever you want. This here is just a table in comma-separated format. If we need like flags in future we can extend the format to nn@ss:flags, but I do no have any 3rd column yet] And again I have in the pipe a dynamic interface added on top of the module param one. So it will all be there soon. Without reverting the old one. Thanks Boaz ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver 2014-09-11 10:45 ` Boaz Harrosh @ 2014-09-11 16:31 ` Dan Williams 2014-09-14 11:18 ` Boaz Harrosh 0 siblings, 1 reply; 41+ messages in thread From: Dan Williams @ 2014-09-11 16:31 UTC (permalink / raw) To: Boaz Harrosh Cc: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-kernel, linux-nvdimm On Thu, Sep 11, 2014 at 3:45 AM, Boaz Harrosh <boaz@plexistor.com> wrote: > On 09/11/2014 02:01 AM, Dan Williams wrote: > <> >> >> Imagine you want to deploy a policy like "use half of the memory >> provided by the dimm in slot3, i.e. the only one with a battery". >> That sort of thing gets unwieldy in a command line string compared to >> a description table format that we can update at will. >> > > Actually it is easy to do, why? I do this here in the lab all the time. > with a "command line" with this code you see here. > > [DDR3 NvDIMM which means I need memmap=16G\$32G on Kernel command line. > Then: modprobe pmem map=8G@32G,4G@44G,... > and so on Just as a simple example where 2/4-3/4 addresses are not used. > You can have holes in the middle or what ever you want. This here is just > a table in comma-separated format. If we need like flags in future we can > extend the format to nn@ss:flags, but I do no have any 3rd column yet] The point I am getting at is not requiring a priori knowledge of the physical memory map of a system. Rather, place holder variables to enable simple dynamic discovery. > And again I have in the pipe a dynamic interface added on top of > the module param one. So it will all be there soon. Without reverting > the old one. Why start on step 2 when we haven't got agreement on step 1? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver 2014-09-11 16:31 ` Dan Williams @ 2014-09-14 11:18 ` Boaz Harrosh 2014-09-16 13:54 ` Jeff Moyer 0 siblings, 1 reply; 41+ messages in thread From: Boaz Harrosh @ 2014-09-14 11:18 UTC (permalink / raw) To: Dan Williams Cc: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-kernel, linux-nvdimm On 09/11/2014 07:31 PM, Dan Williams wrote: <> > > The point I am getting at is not requiring a priori knowledge of the > physical memory map of a system. Rather, place holder variables to > enable simple dynamic discovery. > "simple dynamic discovery" does not yet exist and when the DDR4 NvDIMM will be released then we still have those DDR3 out there which will not work with the new discovery, which I need to support as well. And ... The submitted API will always need to exist for the emulation case for the developer who wants to do memmap= at Kernel command line. Which does not have any "dynamic-discovery" because it is not a real device. And my point was that the "dynamic-discovery" can be d-coupled from pmem, and be driven via udev. Each technology/ARCH may have its own "dynamic-discovery" method and drivers, yet all load a generic pmem driver. <> > Why start on step 2 when we haven't got agreement on step 1? > So what you are saying that you do not agree with pmem driver at all? Because my patch came to better the one range global memory parameters of prd_start= , prd_size=, prd_nr= interface submitted by Ross to a simple map= Interface that can take us a very long way. Does cover all of today's usage including emulated memmap= at Kernel command line. I have not seen any objections to Ross's pmem driver in general from you. Only to my patch which changes the API to the better, and actually let us support all existing and future flat technologies. Yet, yes, do not have a "dynamic discovery". Which is out of scope for pmem right now. (And I hope will always be) I really do not understand what you are suggesting? Are you just saying NACK on pmem.c. It does not do its job? how can we make it does its job then? Thanks Boaz ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver 2014-09-14 11:18 ` Boaz Harrosh @ 2014-09-16 13:54 ` Jeff Moyer 2014-09-16 16:24 ` Boaz Harrosh 2014-09-19 16:27 ` Dan Williams 0 siblings, 2 replies; 41+ messages in thread From: Jeff Moyer @ 2014-09-16 13:54 UTC (permalink / raw) To: Boaz Harrosh Cc: Dan Williams, Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-kernel, linux-nvdimm Boaz Harrosh <boaz@plexistor.com> writes: > On 09/11/2014 07:31 PM, Dan Williams wrote: > <> >> >> The point I am getting at is not requiring a priori knowledge of the >> physical memory map of a system. Rather, place holder variables to >> enable simple dynamic discovery. >> > > "simple dynamic discovery" does not yet exist and when the DDR4 NvDIMM > will be released then we still have those DDR3 out there which will > not work with the new discovery, which I need to support as well. Boaz, Are you telling me that vendors are shipping parts that present themselves as E820_RAM, and that you have to manually block off the addresses from the kernel using the kernel command line? If that is true, then that is just insane and unsupportable. All the hardware I have access to: 1) does not present itself as normal memory and 2) provides some means for discovering its address and size Cheers, Jeff ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver 2014-09-16 13:54 ` Jeff Moyer @ 2014-09-16 16:24 ` Boaz Harrosh 2014-09-19 16:27 ` Dan Williams 1 sibling, 0 replies; 41+ messages in thread From: Boaz Harrosh @ 2014-09-16 16:24 UTC (permalink / raw) To: Jeff Moyer, Boaz Harrosh Cc: Dan Williams, Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-kernel, linux-nvdimm On 09/16/2014 04:54 PM, Jeff Moyer wrote: > Boaz Harrosh <boaz@plexistor.com> writes: > >> On 09/11/2014 07:31 PM, Dan Williams wrote: >> <> >>> >>> The point I am getting at is not requiring a priori knowledge of the >>> physical memory map of a system. Rather, place holder variables to >>> enable simple dynamic discovery. >>> >> >> "simple dynamic discovery" does not yet exist and when the DDR4 NvDIMM >> will be released then we still have those DDR3 out there which will >> not work with the new discovery, which I need to support as well. > > Boaz, > > Are you telling me that vendors are shipping parts that present > themselves as E820_RAM, and that you have to manually block off the > addresses from the kernel using the kernel command line? If that is > true, then that is just insane and unsupportable. All the hardware I > have access to: > 1) does not present itself as normal memory and > 2) provides some means for discovering its address and size > Hi Jeff There is one chip I have seen that is like that, yes, only the funny thing is that we have the capacitors and all, but we don't seem to be able to save on power loss. But it might be a bug at MB system bios so we are investigating. But for this chip, yes we need an exclusion at Kernel command line. I agree not very usable. Putting that aside, Yes the two other vendors of DDR3 NvDIMM come with their own driver that enables the chip and puts it on the buss. Then we use a vendor supplied tool, to find the mapped physical address + size + unique id. We then run a script that loads pmem with this info, to drive the chips. But with DDR3 there is no STD and each vendor has his own discovery method. So pmem is just the generic ULD (Upper-layer-Driver) loaded after the vendor LLD did its initial setup. With DDR4 we will have an STD and one LLD driver will be able to discover them from any vendor. At which time we might do a dynamic in-Kernel probe like the SCSI core does to its ULDs when a new target is found below. But for me this probe can just be a udev rule from user-mode and pmem can stay pure and generic. But lets cross that bridge later. It does not change the current design, it only adds a probe() capability to the all stack. All of the current pmem code is made very friendly to a dynamic prob(), either from code, or via sysfs. That said. The map= interface will always be needed because. pmem supports one more option which is the most commonly used right now, by developers: The emulation of pmem with RAM. In such a usage a developer puts a memmap=nn@ss at Kernel command-line and a map=nn@ss on pmem comand-line and he can test and use code just as with real pmem, only of-course none persistent. This mode since it has no real device is never dynamically discovered. And we will always want to keep this ability for pmem. So releasing with this interface is fine because there is never a reason to not keep it. It will be there to stay. (It is also good for exporting a pmem device to a VM, with a VM shared memory library) My next plan is to widen the module-param interface to enable hotplug/hotremove/hotexpand via the same module-params. You know how a module-param is also a hot sysfs file. At which stage the logic is as follows: [parameters] map= - exists today On Load - Same as "Write" On read - Will display in the nn@ss,[...] format the existing devices On Write - For all specified nn@ss If an existing device is found at ss, if nn is bigger then current, device is dynamically expanded (shrinking not aloud). If no device exist at ss then one is added of nn size, provided that there is no overlap with an existing device. Any existing devices which are not specified are HOTREMOVED At this point we support everything but it is not very udev friendly so have two more add= - New On Load - Ignored On read - empty On Write - For all specified nn@ss If an existing device is found at ss, if nn is bigger then current device it is dynamically expanded ((shrinking not aloud) If no device exist at ss then one is created of nn size, provided that there is no overlap with an existing device. Remove= - New On Load - Ignored On read - empty On Write - For all specified nn@ss: if an existing device exactly matches nn@ss it is HOTREMOVED An HOTREMOVED is only allowed when device ref-count is 1, that is no open files. (Or mounted filesystems) With such interface we can probe new devices from udev and keep pmem completely generic, and vendor/ARCH agnostic. It can also be used with none DDR pcie devices. If later we want in-kernel probe we will need an NvM-core which a pmem ULD registers with. Then any Vendor LLD triggers core which will call all registered ULDs until a type match is found. Same as SCSI. But for me that registering core can just be udev in user-mode. Again we do not have to decide now. Current pmem code is very friendly to an in kernel probe() when such a probe will exist. NOTE: There are 3 more possible ULDs for an NvM-core pmem is only type1 type1 - All memory always mapped (pmem.ko) type2 - Reads always mapped writes are slow and need IO like flash (Will need an internal bcache and COW of write pages) type3 - Bigger internal nvm/flash with only a small window mapped at any given time. Will need paging and remapping-da type4 - pmem + flash, needs specific instructions to move data from pmem to flash, and free pmem for reuse. (2 tier) > Cheers, > Jeff > Thanks Boaz ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver 2014-09-16 13:54 ` Jeff Moyer 2014-09-16 16:24 ` Boaz Harrosh @ 2014-09-19 16:27 ` Dan Williams 2014-09-21 9:27 ` Boaz Harrosh 1 sibling, 1 reply; 41+ messages in thread From: Dan Williams @ 2014-09-19 16:27 UTC (permalink / raw) To: Jeff Moyer, Boaz Harrosh Cc: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-kernel, linux-nvdimm On Tue, Sep 16, 2014 at 6:54 AM, Jeff Moyer <jmoyer@redhat.com> wrote: > Boaz Harrosh <boaz@plexistor.com> writes: > >> On 09/11/2014 07:31 PM, Dan Williams wrote: >> <> >>> >>> The point I am getting at is not requiring a priori knowledge of the >>> physical memory map of a system. Rather, place holder variables to >>> enable simple dynamic discovery. >>> >> >> "simple dynamic discovery" does not yet exist and when the DDR4 NvDIMM >> will be released then we still have those DDR3 out there which will >> not work with the new discovery, which I need to support as well. > > Boaz, > > Are you telling me that vendors are shipping parts that present > themselves as E820_RAM, and that you have to manually block off the > addresses from the kernel using the kernel command line? If that is > true, then that is just insane and unsupportable. All the hardware I > have access to: > 1) does not present itself as normal memory and > 2) provides some means for discovering its address and size Jeff, Yes, agree. Boaz, The UEFI organization is in the process of defining a generic specification for platform non-volatile memory resources. Let's wait until that is publicly available before adding any new device discovery capabilities to pmem. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [Linux-nvdimm] [PATCH v2] pmem: Initial version of persistent memory driver 2014-09-19 16:27 ` Dan Williams @ 2014-09-21 9:27 ` Boaz Harrosh 0 siblings, 0 replies; 41+ messages in thread From: Boaz Harrosh @ 2014-09-21 9:27 UTC (permalink / raw) To: Dan Williams, Jeff Moyer Cc: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-kernel, linux-nvdimm On 09/19/2014 07:27 PM, Dan Williams wrote: > before adding any new device > discovery capabilities to pmem. > I lost you? what "new device discovery capabilities" do you see in Ross's or my code to pmem? Actually my point is that I hope there will never be any. That the discovery should be elsewhere in an ARCH/driver LLD and pmem stays generic. I feel we are not disagreeing, just not communicating. Please explain again, sorry for not understanding. What in the code I sent, you do not like? What in the code is dependent on UEFI organization's paper. That might need to change? Again sorry for being slow. please explain again? Thanks Boaz ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH 1/4] pmem: Initial version of persistent memory driver 2014-08-27 21:11 ` [PATCH 1/4] pmem: Initial version of " Ross Zwisler 2014-09-09 16:23 ` [PATCH v2] " Boaz Harrosh @ 2014-11-02 3:22 ` Elliott, Robert (Server Storage) 2014-11-03 15:50 ` Jeff Moyer 2014-11-03 16:19 ` Wilcox, Matthew R 1 sibling, 2 replies; 41+ messages in thread From: Elliott, Robert (Server Storage) @ 2014-11-02 3:22 UTC (permalink / raw) To: Ross Zwisler, Jens Axboe, Matthew Wilcox, Boaz Harrosh, Nick Piggin, Kani, Toshimitsu, Knippers, Linda, linux-fsdevel, linux-kernel, linux-nvdimm@lists.01.org > -----Original Message----- > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel- > owner@vger.kernel.org] On Behalf Of Ross Zwisler > Sent: Wednesday, 27 August, 2014 4:12 PM > To: Jens Axboe; Matthew Wilcox; Boaz Harrosh; Nick Piggin; linux- > fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; linux- > nvdimm@lists.01.org > Cc: Ross Zwisler > Subject: [PATCH 1/4] pmem: Initial version of persistent memory > driver > > PMEM is a new driver that presents a reserved range of memory as a > block device. This is useful for developing with NV-DIMMs, and > can be used with volatile memory as a development platform. > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > --- > MAINTAINERS | 6 + > drivers/block/Kconfig | 41 ++++++ > drivers/block/Makefile | 1 + > drivers/block/pmem.c | 330 > +++++++++++++++++++++++++++++++++++++++++++++++++ > 4 files changed, 378 insertions(+) > create mode 100644 drivers/block/pmem.c > ... > diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig > index 1b8094d..ac52f5a 100644 > --- a/drivers/block/Kconfig > +++ b/drivers/block/Kconfig > @@ -404,6 +404,47 @@ config BLK_DEV_RAM_DAX > and will prevent RAM block device backing store memory from > being > allocated from highmem (only a problem for highmem systems). > > +config BLK_DEV_PMEM > + tristate "Persistent memory block device support" > + help > + Saying Y here will allow you to use a contiguous range of > reserved > + memory as one or more block devices. Memory for PMEM should > be > + reserved using the "memmap" kernel parameter. > + > + To compile this driver as a module, choose M here: the module > will be > + called pmem. > + > + Most normal users won't need this functionality, and can thus > say N > + here. > + > +config BLK_DEV_PMEM_START > + int "Offset in GiB of where to start claiming space" > + default "0" > + depends on BLK_DEV_PMEM > + help > + Starting offset in GiB that PMEM should use when claiming > memory. This > + memory needs to be reserved from the OS at boot time using > the > + "memmap" kernel parameter. > + > + If you provide PMEM with volatile memory it will act as a > volatile > + RAM disk and your data will not be persistent. > + > +config BLK_DEV_PMEM_COUNT > + int "Default number of PMEM disks" > + default "4" For real use I think a default of 1 would be better. > + depends on BLK_DEV_PMEM > + help > + Number of equal sized block devices that PMEM should create. > + > +config BLK_DEV_PMEM_SIZE > + int "Size in GiB of space to claim" > + depends on BLK_DEV_PMEM > + default "0" > + help > + Amount of memory in GiB that PMEM should use when creating > block > + devices. This memory needs to be reserved from the OS at > + boot time using the "memmap" kernel parameter. > + > config CDROM_PKTCDVD > tristate "Packet writing on CD/DVD media" > depends on !UML ... > diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c > new file mode 100644 > index 0000000..d366b9b > --- /dev/null > +++ b/drivers/block/pmem.c > @@ -0,0 +1,330 @@ > +/* > + * Persistent Memory Driver > + * Copyright (c) 2014, Intel Corporation. > + * > + * This program is free software; you can redistribute it and/or > modify it > + * under the terms and conditions of the GNU General Public License, > + * version 2, as published by the Free Software Foundation. > + * > + * This program is distributed in the hope it will be useful, but > WITHOUT > + * ANY WARRANTY; without even the implied warranty of > MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public > License for > + * more details. > + * > + * This driver is heavily based on drivers/block/brd.c. > + * Copyright (C) 2007 Nick Piggin > + * Copyright (C) 2007 Novell Inc. > + */ > + > +#include <linux/bio.h> > +#include <linux/blkdev.h> > +#include <linux/fs.h> > +#include <linux/hdreg.h> > +#include <linux/highmem.h> > +#include <linux/init.h> > +#include <linux/major.h> > +#include <linux/module.h> > +#include <linux/moduleparam.h> > +#include <linux/slab.h> > +#include <linux/uaccess.h> > + > +#define SECTOR_SHIFT 9 > +#define PAGE_SECTORS_SHIFT (PAGE_SHIFT - SECTOR_SHIFT) > +#define PAGE_SECTORS (1 << PAGE_SECTORS_SHIFT) > + > +/* > + * driver-wide physical address and total_size - one single, > contiguous memory > + * region that we divide up in to same-sized devices > + */ > +phys_addr_t phys_addr; > +void *virt_addr; > +size_t total_size; > + > +struct pmem_device { > + struct request_queue *pmem_queue; > + struct gendisk *pmem_disk; > + struct list_head pmem_list; > + > + phys_addr_t phys_addr; > + void *virt_addr; > + size_t size; > +}; > + > +/* > + * direct translation from (pmem,sector) => void* > + * We do not require that sector be page aligned. > + * The return value will point to the beginning of the page > containing the > + * given sector, not to the sector itself. > + */ > +static void *pmem_lookup_pg_addr(struct pmem_device *pmem, sector_t > sector) > +{ > + size_t page_offset = sector >> PAGE_SECTORS_SHIFT; > + size_t offset = page_offset << PAGE_SHIFT; > + Since page_offset is only used to calculate offset, they could be joined to avoid relying on the compiler to optimize it away: size_t offset = sector >> PAGE_SECTORS_SHIFT << PAGE_SHIFT; > + BUG_ON(offset >= pmem->size); BUG_ON is severe... should this function be designed to return an error instead? > + return pmem->virt_addr + offset; > +} > + > +/* > + * sector is not required to be page aligned. > + * n is at most a single page, but could be less. > + */ > +static void copy_to_pmem(struct pmem_device *pmem, const void *src, > + sector_t sector, size_t n) > +{ > + void *dst; > + unsigned int offset = (sector & (PAGE_SECTORS - 1)) << > SECTOR_SHIFT; > + size_t copy; > + > + BUG_ON(n > PAGE_SIZE); > + > + copy = min_t(size_t, n, PAGE_SIZE - offset); > + dst = pmem_lookup_pg_addr(pmem, sector); > + memcpy(dst + offset, src, copy); > + > + if (copy < n) { > + src += copy; > + sector += copy >> SECTOR_SHIFT; > + copy = n - copy; > + dst = pmem_lookup_pg_addr(pmem, sector); > + memcpy(dst, src, copy); > + } > +} > + > +/* > + * sector is not required to be page aligned. > + * n is at most a single page, but could be less. > + */ > +static void copy_from_pmem(void *dst, struct pmem_device *pmem, > + sector_t sector, size_t n) > +{ > + void *src; > + unsigned int offset = (sector & (PAGE_SECTORS - 1)) << > SECTOR_SHIFT; > + size_t copy; > + > + BUG_ON(n > PAGE_SIZE); > + > + copy = min_t(size_t, n, PAGE_SIZE - offset); > + src = pmem_lookup_pg_addr(pmem, sector); > + > + memcpy(dst, src + offset, copy); > + > + if (copy < n) { > + dst += copy; > + sector += copy >> SECTOR_SHIFT; > + copy = n - copy; > + src = pmem_lookup_pg_addr(pmem, sector); > + memcpy(dst, src, copy); > + } > +} > + > +static void pmem_do_bvec(struct pmem_device *pmem, struct page > *page, > + unsigned int len, unsigned int off, int rw, > + sector_t sector) > +{ > + void *mem = kmap_atomic(page); > + > + if (rw == READ) { > + copy_from_pmem(mem + off, pmem, sector, len); > + flush_dcache_page(page); Why does READ flush the dcache after reading? It's fine to leave data in dcache. > + } else { > + /* > + * FIXME: Need more involved flushing to ensure that > writes to > + * NVDIMMs are actually durable before returning. > + */ > + flush_dcache_page(page); > + copy_to_pmem(pmem, mem + off, sector, len); Why is this flushing before the write to pmem? That might flush data that's going to be overwritten. I'd expect a flush AFTER the write, to push data to a durable location that will survive surprise power loss. > + } > + > + kunmap_atomic(mem); > +} If a read or write of a pmem address gets an uncorrectable error, the system should not crash; just report this IO is bad. That reflects a difference in storage philosophy vs. memory philosophy. All the rest of the data in the pmem might be fine, and many users prefer to recover 99% of their data than lose it all. pmem_do_bvec needs a way to turn off normal DRAM "crash on error" handling for its accesses, and provide a return value indicating success or failure. Also, most storage devices automatically remap bad sectors when they are overwritten or remap them on command (e.g., REASSIGN BLOCKS in SCSI), rather than leave that sector bad forever. I don't know how many filesystems and applications rely on that feature now, vs. maintain their own bad block tables; this may be something that pmem needs to provide. > + > +static void pmem_make_request(struct request_queue *q, struct bio > *bio) > +{ > + struct block_device *bdev = bio->bi_bdev; > + struct pmem_device *pmem = bdev->bd_disk->private_data; > + int rw; > + struct bio_vec bvec; > + sector_t sector; > + struct bvec_iter iter; > + int err = 0; > + > + sector = bio->bi_iter.bi_sector; > + if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) { > + err = -EIO; > + goto out; > + } > + > + BUG_ON(bio->bi_rw & REQ_DISCARD); > + Since discard (i.e., unmap, trip) is just a hint, it could just be ignored rather than trigger a drastic BUG. Related suggestion: keep track of each sector that has been discarded, and whether it has been written after discard. This would tell flash-backed DRAM which sectors don't need to be flushed to flash on a power failure. Related suggestion: Adding WRITE SAME support would be useful for some software (for at least the writing zeros subset) - that command is not just a hint. It would result in a memset. > + rw = bio_rw(bio); > + if (rw == READA) > + rw = READ; > + > + bio_for_each_segment(bvec, bio, iter) { > + unsigned int len = bvec.bv_len; > + > + BUG_ON(len > PAGE_SIZE); > + pmem_do_bvec(pmem, bvec.bv_page, len, > + bvec.bv_offset, rw, sector); > + sector += len >> SECTOR_SHIFT; > + } > + > +out: > + bio_endio(bio, err); > +} > + > +static const struct block_device_operations pmem_fops = { > + .owner = THIS_MODULE, > +}; > + > +/* Kernel module stuff */ > +static int pmem_start_gb = CONFIG_BLK_DEV_PMEM_START; > +module_param(pmem_start_gb, int, S_IRUGO); > +MODULE_PARM_DESC(pmem_start_gb, "Offset in GB of where to start > claiming space"); > + > +static int pmem_size_gb = CONFIG_BLK_DEV_PMEM_SIZE; > +module_param(pmem_size_gb, int, S_IRUGO); > +MODULE_PARM_DESC(pmem_size_gb, "Total size in GB of space to claim > for all disks"); > + These modparam variable names and description scripts use gb and GB when they mean GiB. The storage market generally uses correct SI units; please don't follow JEDEC's misuse. The Kconfig descriptions use the right IEC binary units. Using GiB as the base unit prevents having persistent memory blocks < 1 GiB or not multiples of 1 GiB. For RAID controller caches, 256 MiB and 512 MiB are still used. Smaller units may be warranted. Accepting the number and unit would be the most flexible and clear: allow strings like 512MiB, 2GiB, 1GB, and 4096B. > +static int pmem_count = CONFIG_BLK_DEV_PMEM_COUNT; > +module_param(pmem_count, int, S_IRUGO); > +MODULE_PARM_DESC(pmem_count, "Number of pmem devices to evenly split > allocated space"); > + > +static LIST_HEAD(pmem_devices); > +static int pmem_major; > + > +/* FIXME: move phys_addr, virt_addr, size calls up to caller */ It is unclear what that wants to fix. > +static struct pmem_device *pmem_alloc(int i) > +{ > + struct pmem_device *pmem; > + struct gendisk *disk; > + size_t disk_size = total_size / pmem_count; There is no protection for pmem_count being 0 and causing divide-by-zero. There is no check for disk_size being 0. That might cause problems in pmem_init. > + size_t disk_sectors = disk_size / 512; > + > + pmem = kzalloc(sizeof(*pmem), GFP_KERNEL); > + if (!pmem) > + goto out; > + > + pmem->phys_addr = phys_addr + i * disk_size; > + pmem->virt_addr = virt_addr + i * disk_size; > + pmem->size = disk_size; > + > + pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL); > + if (!pmem->pmem_queue) > + goto out_free_dev; > + General comment: This driver should be blk-mq based. Not doing so loses out on a number of SMP and NUMA performance improvements. For example, blk_alloc_queue calls blk_alloc_queue_node with NUMA_NO_NODE. If it called blk_mq_init_queue, then each CPU would get structures located on its node. > + blk_queue_make_request(pmem->pmem_queue, pmem_make_request); > + blk_queue_max_hw_sectors(pmem->pmem_queue, 1024); Many storage controllers support 1 MiB IOs now, and increasing amounts of software feel comfortable generating those sizes. That means this should be at least 2048. > + blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY); > + It might be appropriate to call blk_queue_rq_timeout and set up a very short timeout, since persistent memory is very fast. I'm not sure what the default is for non-blk-mq drivers; for blk-mq, I think it is 30 seconds, which is far too long for memory based storage. > + disk = pmem->pmem_disk = alloc_disk(0); > + if (!disk) > + goto out_free_queue; > + disk->major = pmem_major; > + disk->first_minor = 0; > + disk->fops = &pmem_fops; > + disk->private_data = pmem; > + disk->queue = pmem->pmem_queue; > + disk->flags = GENHD_FL_EXT_DEVT; > + sprintf(disk->disk_name, "pmem%d", i); > + set_capacity(disk, disk_sectors); > + > + return pmem; > + > +out_free_queue: > + blk_cleanup_queue(pmem->pmem_queue); > +out_free_dev: > + kfree(pmem); > +out: > + return NULL; > +} > + > +static void pmem_free(struct pmem_device *pmem) > +{ > + put_disk(pmem->pmem_disk); > + blk_cleanup_queue(pmem->pmem_queue); > + kfree(pmem); > +} > + > +static void pmem_del_one(struct pmem_device *pmem) > +{ > + list_del(&pmem->pmem_list); > + del_gendisk(pmem->pmem_disk); > + pmem_free(pmem); > +} > + > +static int __init pmem_init(void) > +{ > + int result, i; > + struct resource *res_mem; > + struct pmem_device *pmem, *next; > + > + phys_addr = (phys_addr_t) pmem_start_gb * 1024 * 1024 * 1024; > + total_size = (size_t) pmem_size_gb * 1024 * 1024 * 1024; > + There is no check for start+size wrapping around the address space. Although modparams are a fine starting point, some way to automatically load drivers based on e820 table and other ACPI content will be needed. pmem may not be the only driver vying for use of persistent memory regions; some sort of arbiter will be needed to implement the user's choice (the same way each boot). > + res_mem = request_mem_region_exclusive(phys_addr, total_size, > "pmem"); > + if (!res_mem) > + return -ENOMEM; > + Does the call to request_mem_region_exclusive, which uses: #define IORESOURCE_EXCLUSIVE 0x08000000 /* Userland may not map this resource */ mean that userspace cannot mmap files and directly access persistent memory? There are only 3 drivers that use request.*exclusive functions: drivers/net/ethernet/intel/e1000e and drivers/watchdog/sp5100-tco. None of the storage drivers do so. > + virt_addr = ioremap_cache(phys_addr, total_size); > + if (!virt_addr) { > + result = -ENOMEM; > + goto out_release; > + } > + ioremap_cache maps the addresses as writeback cacheable. How about offering all the cache attributes as modparams? * writeback (WB): costly cache flushes to preserve data * write combining (WC): must flush the WC buffers * writethrough (WT): safe, good for cases where the application expects to read the data again soon * uncacheable (UC): safe, good for cases where the application does not expect to read the data There are some use cases for persistence across reboots that don't care about persistence across power cycles; for these, separate options could be provided for: * WB * WB+flush * WC * WC+flush > + result = register_blkdev(0, "pmem"); > + if (result < 0) { > + result = -EIO; > + goto out_unmap; > + } else > + pmem_major = result; > + In comparison, if sd fails register_blkdev, it returns -ENODEV. Since no pmem device is created, -ENODEV might be more appropriate. The pmem_major variable might be unsafe/racy if multiple concurrent module loads are possible. The value is stored in disk->major; that's the only place it should be stored, and pmem_exit should pull the value from that location to later pass into unregister_blkdev before deleting its containing structure. This might not be an issue until you get to 16 Ki modules. Should this function call blk_register_region after calling register_blkdev? > + for (i = 0; i < pmem_count; i++) { > + pmem = pmem_alloc(i); > + if (!pmem) { > + result = -ENOMEM; > + goto out_free; > + } > + list_add_tail(&pmem->pmem_list, &pmem_devices); > + } > + > + list_for_each_entry(pmem, &pmem_devices, pmem_list) > + add_disk(pmem->pmem_disk); > + > + pr_info("pmem: module loaded\n"); > + return 0; > + > +out_free: > + list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list) > { > + list_del(&pmem->pmem_list); > + pmem_free(pmem); > + } > + unregister_blkdev(pmem_major, "pmem"); > + > +out_unmap: > + iounmap(virt_addr); > + > +out_release: > + release_mem_region(phys_addr, total_size); > + return result; > +} > + > +static void __exit pmem_exit(void) > +{ > + struct pmem_device *pmem, *next; > + > + list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list) > + pmem_del_one(pmem); > + > + unregister_blkdev(pmem_major, "pmem"); > + iounmap(virt_addr); > + release_mem_region(phys_addr, total_size); > + > + pr_info("pmem: module unloaded\n"); > +} > + > +MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>"); > +MODULE_LICENSE("GPL"); > +module_init(pmem_init); > +module_exit(pmem_exit); --- Rob Elliott HP Server Storage ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/4] pmem: Initial version of persistent memory driver 2014-11-02 3:22 ` [PATCH 1/4] " Elliott, Robert (Server Storage) @ 2014-11-03 15:50 ` Jeff Moyer 2014-11-03 16:19 ` Wilcox, Matthew R 1 sibling, 0 replies; 41+ messages in thread From: Jeff Moyer @ 2014-11-03 15:50 UTC (permalink / raw) To: Elliott, Robert (Server Storage) Cc: Ross Zwisler, Jens Axboe, Matthew Wilcox, Boaz Harrosh, Nick Piggin, Kani, Toshimitsu, Knippers, Linda, linux-fsdevel, linux-kernel, linux-nvdimm@lists.01.org "Elliott, Robert (Server Storage)" <Elliott@hp.com> writes: >> +config BLK_DEV_PMEM_COUNT >> + int "Default number of PMEM disks" >> + default "4" > > For real use I think a default of 1 would be better. For "real" use, it will be the number of actual DIMMs, not a config option, I would think. I don't take any issue with defaulting to 4. This will go away. >> + pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL); >> + if (!pmem->pmem_queue) >> + goto out_free_dev; >> + > > General comment: > This driver should be blk-mq based. Not doing so loses > out on a number of SMP and NUMA performance improvements. > For example, blk_alloc_queue calls blk_alloc_queue_node > with NUMA_NO_NODE. If it called blk_mq_init_queue, then > each CPU would get structures located on its node. No need to use blk-mq just to set the numa node, as the driver could just call blk_alloc_queue_node itself. I'd chalk this up to another thing that could be fixed when the driver is used for actual hardware that describes its own proximity domain. Further, there is no DMA engine, here, so what is the benefit of using blk-mq? I/O completes in the submission path, and I don't see any locking that's preventing parallelism. > >> + blk_queue_make_request(pmem->pmem_queue, pmem_make_request); >> + blk_queue_max_hw_sectors(pmem->pmem_queue, 1024); > > Many storage controllers support 1 MiB IOs now, and increasing > amounts of software feel comfortable generating those sizes. > That means this should be at least 2048. > >> + blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY); >> + > > It might be appropriate to call blk_queue_rq_timeout > and set up a very short timeout, since persistent memory is > very fast. I'm not sure what the default is for non-blk-mq > drivers; for blk-mq, I think it is 30 seconds, which is > far too long for memory based storage. If you timeout on a memcpy, then we've definitely got problems. :) Cheers, Jeff ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH 1/4] pmem: Initial version of persistent memory driver 2014-11-02 3:22 ` [PATCH 1/4] " Elliott, Robert (Server Storage) 2014-11-03 15:50 ` Jeff Moyer @ 2014-11-03 16:19 ` Wilcox, Matthew R 2014-11-04 10:37 ` Boaz Harrosh 1 sibling, 1 reply; 41+ messages in thread From: Wilcox, Matthew R @ 2014-11-03 16:19 UTC (permalink / raw) To: Elliott, Robert (Server Storage), Ross Zwisler, Jens Axboe, Boaz Harrosh, Nick Piggin, Kani, Toshimitsu, Knippers, Linda, linux-fsdevel, linux-kernel, linux-nvdimm@lists.01.org, Matthew Wilcox I really wish people wouldn't use my Exchange email address for patches. It's completely impossible to have a meaningful conversation this way. I've resorted to inserting extra quotation marks by hand so people can stand at least some chance of understanding what the hell's going on. > > +config BLK_DEV_PMEM_COUNT > > + int "Default number of PMEM disks" > > + default "4" > > For real use I think a default of 1 would be better. For real use, you need at least two to run xfstests. This whole configuration mechanism is going away soon anyway. > > + size_t page_offset = sector >> PAGE_SECTORS_SHIFT; > > + size_t offset = page_offset << PAGE_SHIFT; > > + > > Since page_offset is only used to calculate offset, they > could be joined to avoid relying on the compiler to > optimize it away: > size_t offset = sector >> PAGE_SECTORS_SHIFT << PAGE_SHIFT; If you insist on doing the compiler's job for it, why not: size_t offset = sector >> (PAGE_SECTORS_SHIFT - PAGE_SHIFT); I actually think the original is clearer. > > + BUG_ON(offset >= pmem->size); > > BUG_ON is severe... should this function be designed to return > an error instead? No ... upper levels should have prevented an out of range I/O from being communicated down to the driver. Finding the I/O is out of range here indicates something is horribly wrong, and BUG_ON is the appropriate response. > > + if (rw == READ) { > > + copy_from_pmem(mem + off, pmem, sector, len); > > + flush_dcache_page(page); > > Why does READ flush the dcache after reading? It's fine to > leave data in dcache. You misunderstand the purpose of flush_dcache_page(); see Documentation/cachetlb.txt. It is to handle D-cache aliasing between user views of pages and kernel views of pages. So we have to flush it after reading so that userspace sees the newly written data, and flush it before writing from it, so that the kernel sees all the data thast userspace has written to it. These macros are no-ops on x86, but on CPUs like PA-RISC, they perform important work. > If a read or write of a pmem address gets an uncorrectable > error, the system should not crash; just report this IO is bad. > That reflects a difference in storage philosophy vs. memory > philosophy. All the rest of the data in the pmem might be > fine, and many users prefer to recover 99% of their data > than lose it all. > > pmem_do_bvec needs a way to turn off normal DRAM "crash on > error" handling for its accesses, and provide a return value > indicating success or failure. This is on our longer-term TODO list. It requires integration with the MCE handler. > Also, most storage devices automatically remap bad sectors > when they are overwritten or remap them on command (e.g., > REASSIGN BLOCKS in SCSI), rather than leave that sector > bad forever. I don't know how many filesystems and > applications rely on that feature now, vs. maintain their > own bad block tables; this may be something that pmem > needs to provide. At least ext4 supports the concept of bad blocks. XFS doesn't (due to it being a bad idea for modern hard drives), but it's pretty trivial to add (a special inode that owns all of the bad blocks; no changes to any data structures). > > + BUG_ON(bio->bi_rw & REQ_DISCARD); > > + > > Since discard (i.e., unmap, trip) is just a hint, it could just > be ignored rather than trigger a drastic BUG. Again, upper layers should have not sent down a DISCARD request since the driver didn't indicate support for them. Personally, I would have not put this line in, but it's Ross' driver and I wasn't going to argue with him. > Related suggestion: keep track of each sector that has been > discarded, and whether it has been written after discard. > This would tell flash-backed DRAM which sectors don't need > to be flushed to flash on a power failure. How would we communicate that info to the DIMM? We're not opposed to putting in DISCARd support, but it needs to provide some value. > Related suggestion: Adding WRITE SAME support would be useful > for some software (for at least the writing zeros subset) - > that command is not just a hint. It would result in a memset. At this point, it's no more efficient in the Linux stack than having the filesystem send down I/Os full of zeroes. The NVMe WRITE ZEROES command would be more efficient, but that's not yet supported by the Linux stack. > These modparam variable names and description scripts use gb > and GB when they mean GiB. The storage market generally uses > correct SI units; please don't follow JEDEC's misuse. The > Kconfig descriptions use the right IEC binary units. These are also going away. > > +/* FIXME: move phys_addr, virt_addr, size calls up to caller */ > > It is unclear what that wants to fix. Also part of the configuration interface that is being replaced. > > + size_t disk_size = total_size / pmem_count; > > There is no protection for pmem_count being 0 and causing > divide-by-zero. > > There is no check for disk_size being 0. That might > cause problems in pmem_init. Also part of the configuration interface that is being replaced. > This driver should be blk-mq based. Not doing so loses > out on a number of SMP and NUMA performance improvements. > For example, blk_alloc_queue calls blk_alloc_queue_node > with NUMA_NO_NODE. If it called blk_mq_init_queue, then > each CPU would get structures located on its node. No, it's completely pointless to use blk-mq for pmem. There is zero advantage for a synchronous block driver to include this overhead. > > + blk_queue_max_hw_sectors(pmem->pmem_queue, 1024); > > Many storage controllers support 1 MiB IOs now, and increasing > amounts of software feel comfortable generating those sizes. > That means this should be at least 2048. Linux prohibits this being larger than 1024 today; see Documentation/block/biodoc.txt > It might be appropriate to call blk_queue_rq_timeout > and set up a very short timeout, since persistent memory is > very fast. I'm not sure what the default is for non-blk-mq > drivers; for blk-mq, I think it is 30 seconds, which is > far too long for memory based storage. The Linux block layer isn't tracking these I/Os because we're a BIO driver. > > + phys_addr = (phys_addr_t) pmem_start_gb * 1024 * 1024 * 1024; > > + total_size = (size_t) pmem_size_gb * 1024 * 1024 * 1024; > > There is no check for start+size wrapping around the > address space. Right; will go away as part of the new configuration system. > Although modparams are a fine starting point, some way to > automatically load drivers based on e820 table and other ACPI > content will be needed. pmem may not be the only driver > vying for use of persistent memory regions; some sort of > arbiter will be needed to implement the user's choice (the > same way each boot). I thought you were being cc'd on the review of that document ... ? > > + res_mem = request_mem_region_exclusive(phys_addr, total_size, "pmem"); > Does the call to request_mem_region_exclusive, which uses: > #define IORESOURCE_EXCLUSIVE 0x08000000 /* Userland may not map this > resource */ > mean that userspace cannot mmap files and directly access > persistent memory? There are only 3 drivers that use > request.*exclusive functions: drivers/net/ethernet/intel/e1000e > and drivers/watchdog/sp5100-tco. None of the storage drivers > do so. It does not mean that, since I do exactly this today using DAX on top of PMEM. It means that userspace can't go poking around in /proc/mem and map the memory directly, which isn't how anybody wants programs to work. > > + virt_addr = ioremap_cache(phys_addr, total_size); > > ioremap_cache maps the addresses as writeback cacheable. How > about offering all the cache attributes as modparams? > * writeback (WB): costly cache flushes to preserve data > * write combining (WC): must flush the WC buffers > * writethrough (WT): safe, good for cases where the application > expects to read the data again soon > * uncacheable (UC): safe, good for cases where the application > does not expect to read the data > > There are some use cases for persistence across reboots > that don't care about persistence across power cycles; > for these, separate options could be provided for: > * WB > * WB+flush > * WC > * WC+flush We hadn't talked about this previously ... let's look into it. > > + result = register_blkdev(0, "pmem"); > > + if (result < 0) { > > + result = -EIO; > > + goto out_unmap; > > In comparison, if sd fails register_blkdev, it returns > -ENODEV. Since no pmem device is created, -ENODEV might > be more appropriate. I'm not sure why we don't just return the error from register_blkdev(). Ross? > The pmem_major variable might be unsafe/racy if multiple > concurrent module loads are possible. They aren't. > Should this function call blk_register_region after > calling register_blkdev? I don't think that's necessary with dynamic majors, but I'm happy to learn that I'm wrong about this. Phew, that was a long one! Thanks for your review! ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/4] pmem: Initial version of persistent memory driver 2014-11-03 16:19 ` Wilcox, Matthew R @ 2014-11-04 10:37 ` Boaz Harrosh 2014-11-04 16:26 ` Elliott, Robert (Server Storage) 0 siblings, 1 reply; 41+ messages in thread From: Boaz Harrosh @ 2014-11-04 10:37 UTC (permalink / raw) To: Wilcox, Matthew R, Elliott, Robert (Server Storage), Ross Zwisler, Jens Axboe, Nick Piggin, Kani, Toshimitsu, Knippers, Linda, linux-fsdevel, linux-kernel, linux-nvdimm@lists.01.org, Matthew Wilcox On 11/03/2014 06:19 PM, Wilcox, Matthew R wrote: <> >>> +config BLK_DEV_PMEM_COUNT >>> + int "Default number of PMEM disks" >>> + default "4" >> >> For real use I think a default of 1 would be better. > > For real use, you need at least two to run xfstests. This whole configuration mechanism is going away soon anyway. > >>> + size_t page_offset = sector >> PAGE_SECTORS_SHIFT; >>> + size_t offset = page_offset << PAGE_SHIFT; >>> + >> >> Since page_offset is only used to calculate offset, they >> could be joined to avoid relying on the compiler to >> optimize it away: >> size_t offset = sector >> PAGE_SECTORS_SHIFT << PAGE_SHIFT; > > If you insist on doing the compiler's job for it, why not: > > size_t offset = sector >> (PAGE_SECTORS_SHIFT - PAGE_SHIFT); > > I actually think the original is clearer. > I wish you guys would actually review the correct code. In the actual good driver that has any shape of proper code all these issue are gone. * config defaults gone, multiple-devices multiple-memory ranges fully supported hot plug style. * above shifts cruft completely gone it is left overs from brd.c and its page usage. * getgeo fixed to do what we realy want by the only application on earth that still uses it, fdisk. All other partitioners do not call it at all. Why are we reviewing dead code ? Cheers Boaz ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH 1/4] pmem: Initial version of persistent memory driver 2014-11-04 10:37 ` Boaz Harrosh @ 2014-11-04 16:26 ` Elliott, Robert (Server Storage) 2014-11-04 16:41 ` Ross Zwisler 0 siblings, 1 reply; 41+ messages in thread From: Elliott, Robert (Server Storage) @ 2014-11-04 16:26 UTC (permalink / raw) To: Boaz Harrosh, Wilcox, Matthew R, Ross Zwisler, Jens Axboe, Nick Piggin, Kani, Toshimitsu, Knippers, Linda, linux-fsdevel, linux-kernel, linux-nvdimm@lists.01.org, Matthew Wilcox [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1387 bytes --] > -----Original Message----- > From: Boaz Harrosh [mailto:boaz@plexistor.com] > Sent: Tuesday, 04 November, 2014 4:38 AM > To: Wilcox, Matthew R; Elliott, Robert (Server Storage); Ross > Zwisler; Jens Axboe; Nick Piggin; Kani, Toshimitsu; Knippers, Linda; > linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; linux- > nvdimm@lists.01.org; Matthew Wilcox > Subject: Re: [PATCH 1/4] pmem: Initial version of persistent memory > driver > > On 11/03/2014 06:19 PM, Wilcox, Matthew R wrote: ... > > I wish you guys would actually review the correct code. > > In the actual good driver that has any shape of proper code all these > issue are gone. > > * config defaults gone, multiple-devices multiple-memory ranges fully > supported hot plug style. > * above shifts cruft completely gone it is left overs from brd.c and > its page usage. > * getgeo fixed to do what we realy want by the only application on earth > that still uses it, fdisk. All other partitioners do not call it at > all. > > Why are we reviewing dead code ? > > Cheers > Boaz Ross, what's the status of Boaz' patches (available in git://git.open-osd.org/pmem.git)? https://github.com/01org/prd.git doesn't include any of them yet. ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/4] pmem: Initial version of persistent memory driver 2014-11-04 16:26 ` Elliott, Robert (Server Storage) @ 2014-11-04 16:41 ` Ross Zwisler 2014-11-04 17:06 ` Boaz Harrosh 0 siblings, 1 reply; 41+ messages in thread From: Ross Zwisler @ 2014-11-04 16:41 UTC (permalink / raw) To: Elliott, Robert (Server Storage) Cc: Boaz Harrosh, Wilcox, Matthew R, Jens Axboe, Nick Piggin, Kani, Toshimitsu, Knippers, Linda, linux-fsdevel, linux-kernel, linux-nvdimm@lists.01.org, Matthew Wilcox, Williams, Dan J On Tue, 2014-11-04 at 16:26 +0000, Elliott, Robert (Server Storage) wrote: > > > -----Original Message----- > > From: Boaz Harrosh [mailto:boaz@plexistor.com] > > Sent: Tuesday, 04 November, 2014 4:38 AM > > To: Wilcox, Matthew R; Elliott, Robert (Server Storage); Ross > > Zwisler; Jens Axboe; Nick Piggin; Kani, Toshimitsu; Knippers, Linda; > > linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; linux- > > nvdimm@lists.01.org; Matthew Wilcox > > Subject: Re: [PATCH 1/4] pmem: Initial version of persistent memory > > driver > > > > On 11/03/2014 06:19 PM, Wilcox, Matthew R wrote: > ... > > > > I wish you guys would actually review the correct code. > > > > In the actual good driver that has any shape of proper code all these > > issue are gone. > > > > * config defaults gone, multiple-devices multiple-memory ranges fully > > supported hot plug style. > > * above shifts cruft completely gone it is left overs from brd.c and > > its page usage. > > * getgeo fixed to do what we realy want by the only application on earth > > that still uses it, fdisk. All other partitioners do not call it at > > all. > > > > Why are we reviewing dead code ? > > > > Cheers > > Boaz > > Ross, what's the status of Boaz' patches (available in > git://git.open-osd.org/pmem.git)? > > https://github.com/01org/prd.git doesn't include any of > them yet. Hey Robert, The UEFI organization is in the process of defining a generic specification for platform non-volatile memory resources. Essentially the thought was to wait until that was publicly available before adding any new device discovery capabilities to pmem. What Boaz has suggested and coded up is certainly useful, but the worry is that it will end up being incompatible with what comes out of UEFI. If we stay with the dead-simple module parameter method, we will have less code to unwind later. Thanks, - Ross ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/4] pmem: Initial version of persistent memory driver 2014-11-04 16:41 ` Ross Zwisler @ 2014-11-04 17:06 ` Boaz Harrosh 0 siblings, 0 replies; 41+ messages in thread From: Boaz Harrosh @ 2014-11-04 17:06 UTC (permalink / raw) To: Ross Zwisler, Elliott, Robert (Server Storage) Cc: Boaz Harrosh, Wilcox, Matthew R, Jens Axboe, Nick Piggin, Kani, Toshimitsu, Knippers, Linda, linux-fsdevel, linux-kernel, linux-nvdimm@lists.01.org, Matthew Wilcox, Williams, Dan J On 11/04/2014 06:41 PM, Ross Zwisler wrote: <> > > The UEFI organization is in the process of defining a generic specification > for platform non-volatile memory resources. Essentially the thought was to > wait until that was publicly available before adding any new device discovery > capabilities to pmem. > > What Boaz has suggested and coded up is certainly useful, but the worry is > that it will end up being incompatible with what comes out of UEFI. If we > stay with the dead-simple module parameter method, we will have less code to > unwind later. > What ?? What I coded up is a exactly "dead-simple module parameter method" that will not need to be changed in future, that is actually sane. Actually your version of the code needs changing with the global parameters, one range support, and same size devices. My version of the code is specifically made very probe() ready and dynamic ready. It was the point of it all. Including bugs and ugliness fixed. (I have a small patch that dynamically addes/removes devices on the fly through the same module-param interface) There is not a line in all of my patches that might be affected by that UEFI comity. (Again actually fixing what needed changing). And in addition to that comity and the new HW they will define, my system also supports all the current NvDIMMs in the market. I really can not understand what you guys are saying? Have you actually looked at the code and used it, compared to the unusable thing you guys have now ?? [Like you are saying "before adding any new device discovery capabilities" But I have not added any? All I did was fix and simplify the module-param interface. Which makes me think that you might not have looked at any of my fixes? ] > Thanks, > - Ross Cheers Boaz ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 2/4] pmem: Add support for getgeo() 2014-08-27 21:11 [PATCH 0/4] Add persistent memory driver Ross Zwisler 2014-08-27 21:11 ` [PATCH 1/4] pmem: Initial version of " Ross Zwisler @ 2014-08-27 21:12 ` Ross Zwisler 2014-11-02 3:27 ` Elliott, Robert (Server Storage) 2014-08-27 21:12 ` [PATCH 3/4] pmem: Add support for rw_page() Ross Zwisler ` (2 subsequent siblings) 4 siblings, 1 reply; 41+ messages in thread From: Ross Zwisler @ 2014-08-27 21:12 UTC (permalink / raw) To: Jens Axboe, Matthew Wilcox, Boaz Harrosh, Nick Piggin, linux-fsdevel, linux-kernel, linux-nvdimm Cc: Ross Zwisler Some programs require HDIO_GETGEO work, which requires we implement getgeo. Based off of the work done to the NVMe driver in this commit: commit 4cc09e2dc4cb ("NVMe: Add getgeo to block ops") Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> --- drivers/block/pmem.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c index d366b9b..60bbe0d 100644 --- a/drivers/block/pmem.c +++ b/drivers/block/pmem.c @@ -50,6 +50,15 @@ struct pmem_device { size_t size; }; +static int pmem_getgeo(struct block_device *bd, struct hd_geometry *geo) +{ + /* some standard values */ + geo->heads = 1 << 6; + geo->sectors = 1 << 5; + geo->cylinders = get_capacity(bd->bd_disk) >> 11; + return 0; +} + /* * direct translation from (pmem,sector) => void* * We do not require that sector be page aligned. @@ -176,6 +185,7 @@ out: static const struct block_device_operations pmem_fops = { .owner = THIS_MODULE, + .getgeo = pmem_getgeo, }; /* Kernel module stuff */ -- 1.9.3 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* RE: [PATCH 2/4] pmem: Add support for getgeo() 2014-08-27 21:12 ` [PATCH 2/4] pmem: Add support for getgeo() Ross Zwisler @ 2014-11-02 3:27 ` Elliott, Robert (Server Storage) 2014-11-03 16:36 ` Wilcox, Matthew R 0 siblings, 1 reply; 41+ messages in thread From: Elliott, Robert (Server Storage) @ 2014-11-02 3:27 UTC (permalink / raw) To: Ross Zwisler, Jens Axboe, Matthew Wilcox, Boaz Harrosh, Nick Piggin, Kani, Toshimitsu, Knippers, Linda, linux-fsdevel, linux-kernel, linux-nvdimm@lists.01.org > -----Original Message----- > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel- > owner@vger.kernel.org] On Behalf Of Ross Zwisler > Sent: Wednesday, 27 August, 2014 4:12 PM > To: Jens Axboe; Matthew Wilcox; Boaz Harrosh; Nick Piggin; linux- > fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; linux- > nvdimm@lists.01.org > Cc: Ross Zwisler > Subject: [PATCH 2/4] pmem: Add support for getgeo() > > Some programs require HDIO_GETGEO work, which requires we implement > getgeo. Based off of the work done to the NVMe driver in this > commit: > > commit 4cc09e2dc4cb ("NVMe: Add getgeo to block ops") > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > --- > drivers/block/pmem.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c > index d366b9b..60bbe0d 100644 > --- a/drivers/block/pmem.c > +++ b/drivers/block/pmem.c > @@ -50,6 +50,15 @@ struct pmem_device { > size_t size; > }; > > +static int pmem_getgeo(struct block_device *bd, struct hd_geometry > *geo) > +{ > + /* some standard values */ > + geo->heads = 1 << 6; > + geo->sectors = 1 << 5; > + geo->cylinders = get_capacity(bd->bd_disk) >> 11; Just stuffing the result of get_capacity into the 16-bit cylinders field will overflow/wrap on large capacities. 0xFFFF << 11 = 0x7FF_F800 = 64 GiB (68.7 GB) How many programs still need these meaningless fields? Could the bogus information be created elsewhere so each block driver doesn't need to do this? > + return 0; > +} > + > /* > * direct translation from (pmem,sector) => void* > * We do not require that sector be page aligned. > @@ -176,6 +185,7 @@ out: > > static const struct block_device_operations pmem_fops = { > .owner = THIS_MODULE, > + .getgeo = pmem_getgeo, > }; > > /* Kernel module stuff */ > -- --- Rob Elliott HP Server Storage ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH 2/4] pmem: Add support for getgeo() 2014-11-02 3:27 ` Elliott, Robert (Server Storage) @ 2014-11-03 16:36 ` Wilcox, Matthew R 0 siblings, 0 replies; 41+ messages in thread From: Wilcox, Matthew R @ 2014-11-03 16:36 UTC (permalink / raw) To: Elliott, Robert (Server Storage), Ross Zwisler, Jens Axboe, Boaz Harrosh, Nick Piggin, Kani, Toshimitsu, Knippers, Linda, linux-fsdevel, linux-kernel, linux-nvdimm@lists.01.org, Matthew Wilcox I agree that there should be a generic fake getgeo routine; but fixing that is a topic for a different patchset and it doesn't need to get folded into this driver submission process. -----Original Message----- From: Elliott, Robert (Server Storage) [mailto:Elliott@hp.com] Sent: Saturday, November 01, 2014 8:28 PM To: Ross Zwisler; Jens Axboe; Wilcox, Matthew R; Boaz Harrosh; Nick Piggin; Kani, Toshimitsu; Knippers, Linda; linux-fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; linux-nvdimm@lists.01.org Subject: RE: [PATCH 2/4] pmem: Add support for getgeo() > -----Original Message----- > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel- > owner@vger.kernel.org] On Behalf Of Ross Zwisler > Sent: Wednesday, 27 August, 2014 4:12 PM > To: Jens Axboe; Matthew Wilcox; Boaz Harrosh; Nick Piggin; linux- > fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; linux- > nvdimm@lists.01.org > Cc: Ross Zwisler > Subject: [PATCH 2/4] pmem: Add support for getgeo() > > Some programs require HDIO_GETGEO work, which requires we implement > getgeo. Based off of the work done to the NVMe driver in this > commit: > > commit 4cc09e2dc4cb ("NVMe: Add getgeo to block ops") > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > --- > drivers/block/pmem.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c > index d366b9b..60bbe0d 100644 > --- a/drivers/block/pmem.c > +++ b/drivers/block/pmem.c > @@ -50,6 +50,15 @@ struct pmem_device { > size_t size; > }; > > +static int pmem_getgeo(struct block_device *bd, struct hd_geometry > *geo) > +{ > + /* some standard values */ > + geo->heads = 1 << 6; > + geo->sectors = 1 << 5; > + geo->cylinders = get_capacity(bd->bd_disk) >> 11; Just stuffing the result of get_capacity into the 16-bit cylinders field will overflow/wrap on large capacities. 0xFFFF << 11 = 0x7FF_F800 = 64 GiB (68.7 GB) How many programs still need these meaningless fields? Could the bogus information be created elsewhere so each block driver doesn't need to do this? > + return 0; > +} > + > /* > * direct translation from (pmem,sector) => void* > * We do not require that sector be page aligned. > @@ -176,6 +185,7 @@ out: > > static const struct block_device_operations pmem_fops = { > .owner = THIS_MODULE, > + .getgeo = pmem_getgeo, > }; > > /* Kernel module stuff */ > -- --- Rob Elliott HP Server Storage ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 3/4] pmem: Add support for rw_page() 2014-08-27 21:11 [PATCH 0/4] Add persistent memory driver Ross Zwisler 2014-08-27 21:11 ` [PATCH 1/4] pmem: Initial version of " Ross Zwisler 2014-08-27 21:12 ` [PATCH 2/4] pmem: Add support for getgeo() Ross Zwisler @ 2014-08-27 21:12 ` Ross Zwisler 2014-08-27 21:12 ` [PATCH 4/4] pmem: Add support for direct_access() Ross Zwisler 2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh 4 siblings, 0 replies; 41+ messages in thread From: Ross Zwisler @ 2014-08-27 21:12 UTC (permalink / raw) To: Jens Axboe, Matthew Wilcox, Boaz Harrosh, Nick Piggin, linux-fsdevel, linux-kernel, linux-nvdimm Cc: Ross Zwisler Based on commit a72132c31d58 ("brd: add support for rw_page()") Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> --- drivers/block/pmem.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c index 60bbe0d..0be3669 100644 --- a/drivers/block/pmem.c +++ b/drivers/block/pmem.c @@ -183,8 +183,19 @@ out: bio_endio(bio, err); } +static int pmem_rw_page(struct block_device *bdev, sector_t sector, + struct page *page, int rw) +{ + struct pmem_device *pmem = bdev->bd_disk->private_data; + + pmem_do_bvec(pmem, page, PAGE_CACHE_SIZE, 0, rw, sector); + page_endio(page, rw & WRITE, 0); + return 0; +} + static const struct block_device_operations pmem_fops = { .owner = THIS_MODULE, + .rw_page = pmem_rw_page, .getgeo = pmem_getgeo, }; -- 1.9.3 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 4/4] pmem: Add support for direct_access() 2014-08-27 21:11 [PATCH 0/4] Add persistent memory driver Ross Zwisler ` (2 preceding siblings ...) 2014-08-27 21:12 ` [PATCH 3/4] pmem: Add support for rw_page() Ross Zwisler @ 2014-08-27 21:12 ` Ross Zwisler 2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh 4 siblings, 0 replies; 41+ messages in thread From: Ross Zwisler @ 2014-08-27 21:12 UTC (permalink / raw) To: Jens Axboe, Matthew Wilcox, Boaz Harrosh, Nick Piggin, linux-fsdevel, linux-kernel, linux-nvdimm Cc: Ross Zwisler Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> --- drivers/block/pmem.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c index 0be3669..d63bc96 100644 --- a/drivers/block/pmem.c +++ b/drivers/block/pmem.c @@ -74,6 +74,15 @@ static void *pmem_lookup_pg_addr(struct pmem_device *pmem, sector_t sector) return pmem->virt_addr + offset; } +/* sector must be page aligned */ +static unsigned long pmem_lookup_pfn(struct pmem_device *pmem, sector_t sector) +{ + size_t page_offset = sector >> PAGE_SECTORS_SHIFT; + + BUG_ON(sector & (PAGE_SECTORS - 1)); + return (pmem->phys_addr >> PAGE_SHIFT) + page_offset; +} + /* * sector is not required to be page aligned. * n is at most a single page, but could be less. @@ -193,9 +202,24 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector, return 0; } +static long pmem_direct_access(struct block_device *bdev, sector_t sector, + void **kaddr, unsigned long *pfn, long size) +{ + struct pmem_device *pmem = bdev->bd_disk->private_data; + + if (!pmem) + return -ENODEV; + + *kaddr = pmem_lookup_pg_addr(pmem, sector); + *pfn = pmem_lookup_pfn(pmem, sector); + + return pmem->size - (sector * 512); +} + static const struct block_device_operations pmem_fops = { .owner = THIS_MODULE, .rw_page = pmem_rw_page, + .direct_access = pmem_direct_access, .getgeo = pmem_getgeo, }; -- 1.9.3 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) 2014-08-27 21:11 [PATCH 0/4] Add persistent memory driver Ross Zwisler ` (3 preceding siblings ...) 2014-08-27 21:12 ` [PATCH 4/4] pmem: Add support for direct_access() Ross Zwisler @ 2014-09-09 15:37 ` Boaz Harrosh 2014-09-09 15:44 ` [PATCH 4/9] SQUASHME: pmem: Support of multiple memory regions Boaz Harrosh ` (4 more replies) 4 siblings, 5 replies; 41+ messages in thread From: Boaz Harrosh @ 2014-09-09 15:37 UTC (permalink / raw) To: Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani, Dave Hansen, linux-mm Cc: Andrew Morton, linux-kernel On 08/28/2014 12:11 AM, Ross Zwisler wrote: > PMEM is a modified version of the Block RAM Driver, BRD. The major difference > is that BRD allocates its backing store pages from the page cache, whereas > PMEM uses reserved memory that has been ioremapped. > > One benefit of this approach is that there is a direct mapping between > filesystem block numbers and virtual addresses. In PMEM, filesystem blocks N, > N+1, N+2, etc. will all be adjacent in the virtual memory space. This property > allows us to set up PMD mappings (2 MiB) for DAX. > > This patch set is builds upon the work that Matthew Wilcox has been doing for > DAX: > Let us not submit a driver with the wrong user visible API. Lets submit the better API (and structure) I have sent. > https://lkml.org/lkml/2014/8/27/31 > > Specifically, my implementation of pmem_direct_access() in patch 4/4 uses API > enhancements introduced in Matthew's DAX patch v10 02/21: > > https://lkml.org/lkml/2014/8/27/48 > > Ross Zwisler (4): > pmem: Initial version of persistent memory driver > pmem: Add support for getgeo() > pmem: Add support for rw_page() > pmem: Add support for direct_access() > On top of the 4 above patches here is a list of changes: [PATCH 1/9] SQUASHME: pmem: Remove unused #include headers [PATCH 2/9] SQUASHME: pmem: Request from fdisk 4k alignment [PATCH 3/9] SQUASHME: pmem: Let each device manage private memory region [PATCH 4/9] SQUASHME: pmem: Support of multiple memory regions These 4 need to be squashed into Ross's [patch 1/4] pmem: Initial version of persistent memory driver See below for a suggested new patch [PATCH 5/9 v2] mm: Let sparse_{add,remove}_one_section receive a node_id [PATCH 6/9 v2] mm: New add_persistent_memory/remove_persistent_memory [PATCH 7/9 v2] pmem: Add support for page structs Please need review by Toshi and mm people. [PATCH 8/9] SQUASHME: pmem: Fixs to getgeo [PATCH 9/9] pmem: KISS, remove register_blkdev And some more development atop the initial version All these patches can be viewed in this tree/branch: git://git.open-osd.org/pmem.git branch pmem-jens-3.17-rc1 [http://git.open-osd.org/gitweb.cgi?p=pmem.git;a=shortlog;h=refs/heads/pmem-jens-3.17-rc1] I have also prepared a new branch *pmem* which is already SQUASHED And has my suggested changed commit logs for the combined patches here is the commit-log: aa85c80 Boaz Harrosh | pmem: KISS, remove register_blkdev 738203c Boaz Harrosh | pmem: Add support for page structs 9f50a54 Boaz Harrosh | mm: New add_persistent_memory/remove_persistent_memory fdfab12 Yigal Korman | mm: Let sparse_{add,remove}_one_section receive a node_id a477a87 Ross Zwisler | pmem: Add support for direct_access() 316a93a Ross Zwisler | pmem: Add support for rw_page() 6850353 Boaz Harrosh | SQUASHME: pmem: Fixs to getgeo d78a84a Ross Zwisler | pmem: Add support for getgeo() bb0eb45 Ross Zwisler | pmem: Initial version of persistent memory driver All these patches can be viewed in this tree/branch: git://git.open-osd.org/pmem.git branch pmem [http://git.open-osd.org/gitweb.cgi?p=pmem.git;a=shortlog;h=refs/heads/pmem] Specifically the first [bb0eb45] is needed so first version can be released with the proper user visible API. Ross please consider taking these patches (pmem branch) in your tree for submission? Thanks Boaz ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 4/9] SQUASHME: pmem: Support of multiple memory regions 2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh @ 2014-09-09 15:44 ` Boaz Harrosh 2014-09-09 15:45 ` [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id Boaz Harrosh ` (3 subsequent siblings) 4 siblings, 0 replies; 41+ messages in thread From: Boaz Harrosh @ 2014-09-09 15:44 UTC (permalink / raw) To: Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-nvdimm Cc: Andrew Morton, linux-kernel From: Boaz Harrosh <boaz@plexistor.com> After the last patch this is easy. The API to pmem module is changed. We now have a single string parameter named "map" of the form: map=mapS[,mapS...] where mapS=nn[KMG]$ss[KMG], or mapS=nn[KMG]@ss[KMG], nn=size, ss=offset Just like the Kernel command line map && memmap parameters, so anything you did at grub just copy/paste to here. The "@" form is exactly the same as the "$" form only that at bash prompt we need to escape the "$" with \$ so also support the '@' char for convenience. For each specified mapS there will be a device created. So needless to say that all the previous prd_XXX params are removed as well as the Kconfig defaults. Signed-off-by: Boaz Harrosh <boaz@plexistor.com> --- drivers/block/Kconfig | 28 ----------------------- drivers/block/pmem.c | 62 ++++++++++++++++++++++++++++++++++++--------------- 2 files changed, 44 insertions(+), 46 deletions(-) diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index 2d50125..5da8cbf 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -416,34 +416,6 @@ config BLK_DEV_PMEM Most normal users won't need this functionality, and can thus say N here. -config BLK_DEV_PMEM_START - int "Offset in GiB of where to start claiming space" - default "0" - depends on BLK_DEV_PMEM - help - Starting offset in GiB that PMEM should use when claiming memory. This - memory needs to be reserved from the OS at boot time using the - "memmap" kernel parameter. - - If you provide PMEM with volatile memory it will act as a volatile - RAM disk and your data will not be persistent. - -config BLK_DEV_PMEM_COUNT - int "Default number of PMEM disks" - default "4" - depends on BLK_DEV_PMEM - help - Number of equal sized block devices that PMEM should create. - -config BLK_DEV_PMEM_SIZE - int "Size in GiB of space to claim" - depends on BLK_DEV_PMEM - default "0" - help - Amount of memory in GiB that PMEM should use when creating block - devices. This memory needs to be reserved from the OS at - boot time using the "memmap" kernel parameter. - config CDROM_PKTCDVD tristate "Packet writing on CD/DVD media" depends on !UML diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c index d5c5c52..e07a373 100644 --- a/drivers/block/pmem.c +++ b/drivers/block/pmem.c @@ -212,17 +212,11 @@ static const struct block_device_operations pmem_fops = { }; /* Kernel module stuff */ -static int pmem_start_gb = CONFIG_BLK_DEV_PMEM_START; -module_param(pmem_start_gb, int, S_IRUGO); -MODULE_PARM_DESC(pmem_start_gb, "Offset in GB of where to start claiming space"); - -static int pmem_size_gb = CONFIG_BLK_DEV_PMEM_SIZE; -module_param(pmem_size_gb, int, S_IRUGO); -MODULE_PARM_DESC(pmem_size_gb, "Total size in GB of space to claim for all disks"); - -static int pmem_count = CONFIG_BLK_DEV_PMEM_COUNT; -module_param(pmem_count, int, S_IRUGO); -MODULE_PARM_DESC(pmem_count, "Number of pmem devices to evenly split allocated space"); +static char *map; +module_param(map, charp, S_IRUGO); +MODULE_PARM_DESC(map, + "pmem device mapping: map=mapS[,mapS...] where:\n" + "mapS=nn[KMG]$ss[KMG] or mapS=nn[KMG]@ss[KMG], nn=size, ss=offset."); static LIST_HEAD(pmem_devices); static int pmem_major; @@ -272,6 +266,13 @@ static struct pmem_device *pmem_alloc(phys_addr_t phys_addr, size_t disk_size, struct gendisk *disk; int err; + if (unlikely((phys_addr & ~PAGE_MASK) || (disk_size & ~PAGE_MASK))) { + pr_err("phys_addr=0x%llx disk_size=0x%zx must be 4k aligned\n", + phys_addr, disk_size); + err = -EINVAL; + goto out; + } + pmem = kzalloc(sizeof(*pmem), GFP_KERNEL); if (unlikely(!pmem)) { err = -ENOMEM; @@ -344,16 +345,32 @@ static void pmem_del_one(struct pmem_device *pmem) pmem_free(pmem); } +static int pmem_parse_map_one(char *map, phys_addr_t *start, size_t *size) +{ + char *p = map; + + *size = (size_t)memparse(p, &p); + if ((p == map) || ((*p != '$') && (*p != '@'))) + return -EINVAL; + + if (!*(++p)) + return -EINVAL; + + *start = (phys_addr_t)memparse(p, &p); + + return *p == '\0' ? 0 : -EINVAL; +} + static int __init pmem_init(void) { int result, i; struct pmem_device *pmem, *next; - phys_addr_t phys_addr; - size_t total_size, disk_size; + char *p, *pmem_map = map; - phys_addr = (phys_addr_t) pmem_start_gb * 1024 * 1024 * 1024; - total_size = (size_t) pmem_size_gb * 1024 * 1024 * 1024; - disk_size = total_size / pmem_count; + if (!pmem_map) { + pr_err("pmem: must specify map=nn@ss parameter.\n"); + return -EINVAL; + } result = register_blkdev(0, "pmem"); if (result < 0) @@ -361,14 +378,23 @@ static int __init pmem_init(void) else pmem_major = result; - for (i = 0; i < pmem_count; i++) { + i = 0; + while ((p = strsep(&pmem_map, ",")) != NULL) { + phys_addr_t phys_addr; + size_t disk_size; + + if (!*p) + continue; + result = pmem_parse_map_one(p, &phys_addr, &disk_size); + if (result) + goto out_free; pmem = pmem_alloc(phys_addr, disk_size, i); if (IS_ERR(pmem)) { result = PTR_ERR(pmem); goto out_free; } list_add_tail(&pmem->pmem_list, &pmem_devices); - phys_addr += disk_size; + ++i; } list_for_each_entry(pmem, &pmem_devices, pmem_list) -- 1.9.3 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id 2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh 2014-09-09 15:44 ` [PATCH 4/9] SQUASHME: pmem: Support of multiple memory regions Boaz Harrosh @ 2014-09-09 15:45 ` Boaz Harrosh 2014-09-09 18:36 ` Dave Hansen 2014-09-09 15:47 ` [PATCH 6/9] mm: New add_persistent_memory/remove_persistent_memory Boaz Harrosh ` (2 subsequent siblings) 4 siblings, 1 reply; 41+ messages in thread From: Boaz Harrosh @ 2014-09-09 15:45 UTC (permalink / raw) To: Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani, Dave Hansen, linux-mm Cc: Andrew Morton, linux-kernel From: Yigal Korman <yigal@plexistor.com> Refactored the arguments of sparse_add_one_section / sparse_remove_one_section to use node id instead of struct zone * - A memory section has no direct connection to zones, all that was needed from zone was the node id. This is for add_persistent_memory that will want a section of pages allocated but without any zone associated. This is because belonging to a zone will give the memory to the page allocators, but persistent_memory belongs to a block device, and is not available for regular volatile usage. Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> --- include/linux/memory_hotplug.h | 4 ++-- mm/memory_hotplug.c | 4 ++-- mm/sparse.c | 9 +++++---- 3 files changed, 9 insertions(+), 8 deletions(-) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index d9524c4..35ca1bb 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -264,8 +264,8 @@ extern int arch_add_memory(int nid, u64 start, u64 size); extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages); extern bool is_memblock_offlined(struct memory_block *mem); extern void remove_memory(int nid, u64 start, u64 size); -extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn); -extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms); +extern int sparse_add_one_section(int nid, unsigned long start_pfn); +extern void sparse_remove_one_section(int nid, struct mem_section *ms); extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pnum); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 2ff8c23..e556a90 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -471,7 +471,7 @@ static int __meminit __add_section(int nid, struct zone *zone, if (pfn_valid(phys_start_pfn)) return -EEXIST; - ret = sparse_add_one_section(zone, phys_start_pfn); + ret = sparse_add_one_section(zone->zone_pgdat->node_id, phys_start_pfn); if (ret < 0) return ret; @@ -737,7 +737,7 @@ static int __remove_section(struct zone *zone, struct mem_section *ms) start_pfn = section_nr_to_pfn(scn_nr); __remove_zone(zone, start_pfn); - sparse_remove_one_section(zone, ms); + sparse_remove_one_section(zone->zone_pgdat->node_id, ms); return 0; } diff --git a/mm/sparse.c b/mm/sparse.c index d1b48b6..12a10ab 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -690,10 +690,10 @@ static void free_map_bootmem(struct page *memmap) * set. If this is <=0, then that means that the passed-in * map was not consumed and must be freed. */ -int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn) +int __meminit sparse_add_one_section(int nid, unsigned long start_pfn) { unsigned long section_nr = pfn_to_section_nr(start_pfn); - struct pglist_data *pgdat = zone->zone_pgdat; + struct pglist_data *pgdat = NODE_DATA(nid); struct mem_section *ms; struct page *memmap; unsigned long *usemap; @@ -788,11 +788,11 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap) free_map_bootmem(memmap); } -void sparse_remove_one_section(struct zone *zone, struct mem_section *ms) +void sparse_remove_one_section(int nid, struct mem_section *ms) { struct page *memmap = NULL; unsigned long *usemap = NULL, flags; - struct pglist_data *pgdat = zone->zone_pgdat; + struct pglist_data *pgdat = NODE_DATA(nid); pgdat_resize_lock(pgdat, &flags); if (ms->section_mem_map) { @@ -807,5 +807,6 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms) clear_hwpoisoned_pages(memmap, PAGES_PER_SECTION); free_section_usemap(memmap, usemap); } + #endif /* CONFIG_MEMORY_HOTREMOVE */ #endif /* CONFIG_MEMORY_HOTPLUG */ -- 1.9.3 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id 2014-09-09 15:45 ` [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id Boaz Harrosh @ 2014-09-09 18:36 ` Dave Hansen 2014-09-10 10:07 ` Boaz Harrosh 0 siblings, 1 reply; 41+ messages in thread From: Dave Hansen @ 2014-09-09 18:36 UTC (permalink / raw) To: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm Cc: Andrew Morton, linux-kernel On 09/09/2014 08:45 AM, Boaz Harrosh wrote: > This is for add_persistent_memory that will want a section of pages > allocated but without any zone associated. This is because belonging > to a zone will give the memory to the page allocators, but > persistent_memory belongs to a block device, and is not available for > regular volatile usage. I don't think we should be taking patches like this in to the kernel until we've seen the other side of it. Where is the page allocator code which will see a page belonging to no zone? Am I missing it in this set? I see about 80 or so calls to page_zone() in the kernel. How will a zone-less page look to all of these sites? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id 2014-09-09 18:36 ` Dave Hansen @ 2014-09-10 10:07 ` Boaz Harrosh 2014-09-10 16:10 ` Dave Hansen 0 siblings, 1 reply; 41+ messages in thread From: Boaz Harrosh @ 2014-09-10 10:07 UTC (permalink / raw) To: Dave Hansen, Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm Cc: Andrew Morton, linux-kernel On 09/09/2014 09:36 PM, Dave Hansen wrote: > On 09/09/2014 08:45 AM, Boaz Harrosh wrote: >> This is for add_persistent_memory that will want a section of pages >> allocated but without any zone associated. This is because belonging >> to a zone will give the memory to the page allocators, but >> persistent_memory belongs to a block device, and is not available for >> regular volatile usage. > > I don't think we should be taking patches like this in to the kernel > until we've seen the other side of it. Where is the page allocator code > which will see a page belonging to no zone? Am I missing it in this set? > It is not missing. It will never be. These pages do not belong to any allocator. They are not allocate-able pages. In fact they are not "memory" they are "storage" These pages belong wholesomely to a block-device. In turn the block device grants ownership of a partition of this pages to an FS. The FS loaded has its own block allocation schema. Which internally circulate each pages usage around. But the page never goes beyond its FS. > I see about 80 or so calls to page_zone() in the kernel. How will a > zone-less page look to all of these sites? > None of these 80 call site will be reached! the pages are always used below the FS, like send them on the network, or send them to a slower block device via a BIO. I have a full fledge FS on top of this code and it all works very smoothly, and stable. (And fast ;)) It is up to the pMem-based FS to manage its pages's ref count so they are never released outside of its own block allocator. at the end of the day, struct pages has nothing to do with zones and allocators and "memory", as it says in Documentation struct page is a facility to track the state of a physical page in the system. All the other structures are higher in the stack above the physical layer, struct-pages for me are the upper API of the memory physical layer. Which are in common with pmem, higher on the stack where with memory we have a zone, pmem has a block-device. Higher where we have page allocators, pmem has an FS block allocator, higher where we have a slab, pmem has files for user consumption. pmem is storage, which shares the physical layer with memory, and this is what this patch describes. There will be no more mm interaction at all for pmem. The rest of the picture is all there in plain site as part of this patchset, the pmem.c driver then an FS on top of that. What else do you need to see? Thanks Boaz ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id 2014-09-10 10:07 ` Boaz Harrosh @ 2014-09-10 16:10 ` Dave Hansen 2014-09-10 17:25 ` Boaz Harrosh 0 siblings, 1 reply; 41+ messages in thread From: Dave Hansen @ 2014-09-10 16:10 UTC (permalink / raw) To: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm Cc: Andrew Morton, linux-kernel On 09/10/2014 03:07 AM, Boaz Harrosh wrote: > On 09/09/2014 09:36 PM, Dave Hansen wrote: >> On 09/09/2014 08:45 AM, Boaz Harrosh wrote: >>> This is for add_persistent_memory that will want a section of pages >>> allocated but without any zone associated. This is because belonging >>> to a zone will give the memory to the page allocators, but >>> persistent_memory belongs to a block device, and is not available for >>> regular volatile usage. >> >> I don't think we should be taking patches like this in to the kernel >> until we've seen the other side of it. Where is the page allocator code >> which will see a page belonging to no zone? Am I missing it in this set? > > It is not missing. It will never be. > > These pages do not belong to any allocator. They are not allocate-able > pages. In fact they are not "memory" they are "storage" > > These pages belong wholesomely to a block-device. In turn the block > device grants ownership of a partition of this pages to an FS. > The FS loaded has its own block allocation schema. Which internally > circulate each pages usage around. But the page never goes beyond its > FS. I'm mostly worried about things that start with an mmap(). Imagine you mmap() a persistent memory file, fault some pages in, then 'cat /proc/$pid/numa_maps'. That code will look at the page to see which zone and node it is in. Or, consider if you mmap() then put a futex in the page. The page will have get_user_pages() called on it by the futex code, and a reference taken. The reference can outlast the mmap(). We either have to put the file somewhere special and scan the page's reference occasionally, or we need to hook something under put_page() to make sure that we keep the page out of the normal allocator. >> I see about 80 or so calls to page_zone() in the kernel. How will a >> zone-less page look to all of these sites? > > None of these 80 call site will be reached! the pages are always used > below the FS, like send them on the network, or send them to a slower > block device via a BIO. I have a full fledge FS on top of this code > and it all works very smoothly, and stable. (And fast ;)) Does the fs support mmap()? The idea of layering is a nice one, but mmap() is a big fat layering violation. :) ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id 2014-09-10 16:10 ` Dave Hansen @ 2014-09-10 17:25 ` Boaz Harrosh 2014-09-10 18:28 ` Dave Hansen 0 siblings, 1 reply; 41+ messages in thread From: Boaz Harrosh @ 2014-09-10 17:25 UTC (permalink / raw) To: Dave Hansen, Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm Cc: Andrew Morton, linux-kernel On 09/10/2014 07:10 PM, Dave Hansen wrote: > On 09/10/2014 03:07 AM, Boaz Harrosh wrote: >> On 09/09/2014 09:36 PM, Dave Hansen wrote: >>> On 09/09/2014 08:45 AM, Boaz Harrosh wrote: >>>> This is for add_persistent_memory that will want a section of pages >>>> allocated but without any zone associated. This is because belonging >>>> to a zone will give the memory to the page allocators, but >>>> persistent_memory belongs to a block device, and is not available for >>>> regular volatile usage. >>> >>> I don't think we should be taking patches like this in to the kernel >>> until we've seen the other side of it. Where is the page allocator code >>> which will see a page belonging to no zone? Am I missing it in this set? >> >> It is not missing. It will never be. >> >> These pages do not belong to any allocator. They are not allocate-able >> pages. In fact they are not "memory" they are "storage" >> >> These pages belong wholesomely to a block-device. In turn the block >> device grants ownership of a partition of this pages to an FS. >> The FS loaded has its own block allocation schema. Which internally >> circulate each pages usage around. But the page never goes beyond its >> FS. > > I'm mostly worried about things that start with an mmap(). > > Imagine you mmap() a persistent memory file, fault some pages in, then > 'cat /proc/$pid/numa_maps'. That code will look at the page to see > which zone and node it is in. > > Or, consider if you mmap() then put a futex in the page. The page will > have get_user_pages() called on it by the futex code, and a reference > taken. The reference can outlast the mmap(). We either have to put the > file somewhere special and scan the page's reference occasionally, or we > need to hook something under put_page() to make sure that we keep the > page out of the normal allocator. > Yes the block_allocator of the pmem-FS always holds the final REF on this page, as long as there is valid data on this block. Even cross boots, the mount code re-initializes references. The only internal state that frees these blocks is truncate, which only then return these pages to the block allocator, all this is common practice in filesystems so the page-ref on these blocks only ever drops to zero after they loose all visibility. And yes the block allocator uses a special code to drop the count to zero not using put_page(). So there is no chance these pages will ever be presented to page_allocators through a put_page(). BTW: There is an hook in place that can be used today. By calling SetPagePrivate(page) and setting a .release function on the page->mapping->a_ops If .release() returns false the page is not released (and can be added on an internal queue for garbage collection) But with above schema this is not needed at all. I yet need to find a test that keeps my free_block reference above 1. At which time I will exercise a garbage collection queue. >>> I see about 80 or so calls to page_zone() in the kernel. How will a >>> zone-less page look to all of these sites? >> >> None of these 80 call site will be reached! the pages are always used >> below the FS, like send them on the network, or send them to a slower >> block device via a BIO. I have a full fledge FS on top of this code >> and it all works very smoothly, and stable. (And fast ;)) > > Does the fs support mmap()? > > The idea of layering is a nice one, but mmap() is a big fat layering > violation. :) > No! Yes the FS supports mmap, but through the DAX patchset. Please see Matthew's DAX patchset how he implements mmap without using pages at all, direct PFN to virtual_addr. So these pages do not get exposed to the top of the FS. My FS uses his technics exactly only when it wants to spill over to slower device it will use these pages copy-less. Cheers Boaz ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id 2014-09-10 17:25 ` Boaz Harrosh @ 2014-09-10 18:28 ` Dave Hansen 2014-09-11 8:39 ` Boaz Harrosh 0 siblings, 1 reply; 41+ messages in thread From: Dave Hansen @ 2014-09-10 18:28 UTC (permalink / raw) To: Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm Cc: Andrew Morton, linux-kernel On 09/10/2014 10:25 AM, Boaz Harrosh wrote: > Yes the block_allocator of the pmem-FS always holds the final REF on this > page, as long as there is valid data on this block. Even cross boots, the > mount code re-initializes references. The only internal state that frees > these blocks is truncate, which only then return these pages to the block > allocator, all this is common practice in filesystems so the page-ref on > these blocks only ever drops to zero after they loose all visibility. And > yes the block allocator uses a special code to drop the count to zero > not using put_page(). OK, so what happens when a page is truncated out of a file and this "last" block reference is dropped while a get_user_pages() still has a reference? > On 09/10/2014 07:10 PM, Dave Hansen wrote: >> Does the fs support mmap()? >> > No! > > Yes the FS supports mmap, but through the DAX patchset. Please see > Matthew's DAX patchset how he implements mmap without using pages > at all, direct PFN to virtual_addr. So these pages do not get exposed > to the top of the FS. > > My FS uses his technics exactly only when it wants to spill over to > slower device it will use these pages copy-less. >From my perspective, DAX is complicated, but it is necessary because we don't have a 'struct page'. You're saying that even if we pay the cost of a 'struct page' for the memory, we still don't get the benefit of having it like getting rid of this DAX stuff? Also, about not having a zone for these pages. Do you intend to support 32-bit systems? If so, I believe you will require the kmap() family of functions to map the pages in order to copy data in and out. kmap() currently requires knowing the zone of the page. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id 2014-09-10 18:28 ` Dave Hansen @ 2014-09-11 8:39 ` Boaz Harrosh 2014-09-11 17:07 ` Dave Hansen 0 siblings, 1 reply; 41+ messages in thread From: Boaz Harrosh @ 2014-09-11 8:39 UTC (permalink / raw) To: Dave Hansen, Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm Cc: Andrew Morton, linux-kernel On 09/10/2014 09:28 PM, Dave Hansen wrote: <> > > OK, so what happens when a page is truncated out of a file and this > "last" block reference is dropped while a get_user_pages() still has a > reference? > I have a very simple plan for this scenario, as I said, hang these pages with ref!=1 on a garbage list, and one of the clear threads can scan them periodically and release them. I have this test in place, currently what I do is just drop the block and let it leak (that is, not be used any more) until a next mount where this will be returned to free store. Yes stupid I know. But I have a big fat message when this happens and I have not been able to reproduce it. So I'm still waiting for this test case, I guess DAX protects me. <> > From my perspective, DAX is complicated, but it is necessary because we > don't have a 'struct page'. You're saying that even if we pay the cost > of a 'struct page' for the memory, we still don't get the benefit of > having it like getting rid of this DAX stuff? > No DAX is still necessary because we map storage directly to app space, and we still need it persistent. That is we can-not/need-not use an in-ram radix tree but directly use on-storage btrees. Regular VFS has this 2 tiers model, volatile-ram over persistent store. DAX is an alternative VFS model where you have a single tier. the name implies "Direct Access". So this is nothing to do with page cost or "benefit". DAX is about a new VFS model for new storage technologies. And please be noted, the complexity you are talking about is just a learning curve, on the developers side. Not a technological one. Actually if you compare the two models, lets call them VFS-2t and VFS-1t, then you see that DAX is an order of a magnitude simpler then the old model. Life is hard and we do need the two models all at the same time, to support all these different devices. So yes the complexity is added with the added choice. But please do not confuse, DAX is not the complicated part. Having a Choice is. > Also, about not having a zone for these pages. Do you intend to support > 32-bit systems? If so, I believe you will require the kmap() family of > functions to map the pages in order to copy data in and out. kmap() > currently requires knowing the zone of the page. No!!! This is strictly 64 bit. A 32bit system is able to have at maximum 3Gb of low-ram + storage. DAX implies always mapped. That is, no re-mapping. So this rules out more then a G of storage. Since that is a joke then No! 32bit is out. You need to understand current HW std talks about DDR4 and there are DDR3 samples flouting around. So this is strictly 64bit, even on phones. Thanks Boaz ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id 2014-09-11 8:39 ` Boaz Harrosh @ 2014-09-11 17:07 ` Dave Hansen 2014-09-14 9:36 ` Boaz Harrosh 0 siblings, 1 reply; 41+ messages in thread From: Dave Hansen @ 2014-09-11 17:07 UTC (permalink / raw) To: Boaz Harrosh, Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm Cc: Andrew Morton, linux-kernel On 09/11/2014 01:39 AM, Boaz Harrosh wrote: > On 09/10/2014 09:28 PM, Dave Hansen wrote: >> OK, so what happens when a page is truncated out of a file and this >> "last" block reference is dropped while a get_user_pages() still has a >> reference? > > I have a very simple plan for this scenario, as I said, hang these pages > with ref!=1 on a garbage list, and one of the clear threads can scan them > periodically and release them. > > I have this test in place, currently what I do is just drop the block > and let it leak (that is, not be used any more) until a next mount where > this will be returned to free store. Yes stupid I know. But I have a big > fat message when this happens and I have not been able to reproduce it. > So I'm still waiting for this test case, I guess DAX protects me. OK, that sounds like it will work. The "leaked until the next mount" sounds disastrous, but I'm sure you'll fix that. I can see how it might lead to some fragmentation if only small amounts are ever pinned, but not a deal-breaker. >> From my perspective, DAX is complicated, but it is necessary because we >> don't have a 'struct page'. You're saying that even if we pay the cost >> of a 'struct page' for the memory, we still don't get the benefit of >> having it like getting rid of this DAX stuff? > > No DAX is still necessary because we map storage directly to app space, > and we still need it persistent. That is we can-not/need-not use an > in-ram radix tree but directly use on-storage btrees. Huh? We obviously don't need/want persistent memory pages in the page *cache*. But, that's completely orthogonal to _having_ a 'struct page' for them. DAX does two major things: 1. avoids needing the page cache 2. creates "raw" page table entries that the VM does not manage for mmap()s I'm not saying to put persistent memory in the page cache. I'm saying that, if we have a 'struct page' for the memory, we should try to make the mmap()s more normal. This enables all kinds of things that DAX does not support today, like direct I/O. > Life is hard and we do need the two models all at the same time, to support > all these different devices. So yes the complexity is added with the added > choice. But please do not confuse, DAX is not the complicated part. Having > a Choice is. Great, so we at least agree that this adds complexity. >> Also, about not having a zone for these pages. Do you intend to support >> 32-bit systems? If so, I believe you will require the kmap() family of >> functions to map the pages in order to copy data in and out. kmap() >> currently requires knowing the zone of the page. > > No!!! This is strictly 64 bit. A 32bit system is able to have at maximum > 3Gb of low-ram + storage. > DAX implies always mapped. That is, no re-mapping. So this rules out > more then a G of storage. Since that is a joke then No! 32bit is out. > > You need to understand current HW std talks about DDR4 and there are > DDR3 samples flouting around. So this is strictly 64bit, even on > phones. OK, so I think I at least understand the scope of the patch set and the limitations. I think I've summarized the limitations: 1. Approach requires all of RAM+Pmem to be direct-mapped (rules out almost all 32-bit systems, or any 64-bit systems with more than 64TB of RAM+pmem-storage) 2. Approach is currently incompatible with some kernel code that requires a 'struct page' (such as direct I/O), and all kernel code that requires knowledge of zones or NUMA nodes. 3. Approach requires 1/64 of the amount of storage to be consumed by RAM for a pseudo 'struct page'. If you had 64GB of storage and 1GB of RAM, you would simply run our of RAM. Did I miss any? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id 2014-09-11 17:07 ` Dave Hansen @ 2014-09-14 9:36 ` Boaz Harrosh 0 siblings, 0 replies; 41+ messages in thread From: Boaz Harrosh @ 2014-09-14 9:36 UTC (permalink / raw) To: Dave Hansen, Boaz Harrosh, Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani, linux-mm Cc: Andrew Morton, linux-kernel On 09/11/2014 08:07 PM, Dave Hansen wrote: <> > > OK, that sounds like it will work. The "leaked until the next mount" > sounds disastrous, but I'm sure you'll fix that. I can see how it might > lead to some fragmentation if only small amounts are ever pinned, but > not a deal-breaker. > There is no such thing as fragmentation with memory mapped storage ;-) <> > I'm saying that, if we have a 'struct page' for the memory, we should > try to make the mmap()s more normal. This enables all kinds of things > that DAX does not support today, like direct I/O. > What? no! direct I/O is fully supported. Including all API's of it. Do you mean open(O_DIRECT) and io_submit(..) Yes it is fully supported. In fact all IO is direct IO. there is never page-cache on the way, hence direct BTW: These patches enable something else. Say FSA is DAX and FSB is regular disk FS then fda = open(/mnt/FSA); pa = mmap(fda, ...); fdb = open(/mnt/FSB, O_DIRECT); io_submit(fdb,..,pa ,..); /* I mean pa is put for IO into the passed iocb for fdb */ Before this patch above will not work and revert to buffered IO, but with these patches it will work. Please note this is true for the submitted pmem driver. With brd which also supports DAX this will work, because brd always uses pages. <> > Great, so we at least agree that this adds complexity. > But the complexity is already there DAX by Matthew is to go in soon I hope. Surly these added pages do not add to the complexity that much. <> > > OK, so I think I at least understand the scope of the patch set and the > limitations. I think I've summarized the limitations: > > 1. Approach requires all of RAM+Pmem to be direct-mapped (rules out > almost all 32-bit systems, or any 64-bit systems with more than 64TB > of RAM+pmem-storage) Yes, for NOW > 2. Approach is currently incompatible with some kernel code that > requires a 'struct page' (such as direct I/O), and all kernel code > that requires knowledge of zones or NUMA nodes. NO! Direct IO - supported NUMA - supported "all kernel code that requires knowledge of zones" - Not needed > 3. Approach requires 1/64 of the amount of storage to be consumed by > RAM for a pseudo 'struct page'. If you had 64GB of storage and 1GB > of RAM, you would simply run our of RAM. > Yes so in a system as above of 64GB of pmem, 1GB of pmem will need to be set aside and hotpluged as volatile memory. This already works today BTW you can set aside a portion of NvDIMM and hotplug it as system memory. We are already used to pay that ratio for RAM. On a kernel-config choice that ratio can be also paid for pmem. This is why I left it a configuration option > Did I miss any? > Thanks Boaz ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 6/9] mm: New add_persistent_memory/remove_persistent_memory 2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh 2014-09-09 15:44 ` [PATCH 4/9] SQUASHME: pmem: Support of multiple memory regions Boaz Harrosh 2014-09-09 15:45 ` [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id Boaz Harrosh @ 2014-09-09 15:47 ` Boaz Harrosh 2014-09-09 15:48 ` [PATCH 7/9] pmem: Add support for page structs Boaz Harrosh 2014-09-09 15:51 ` [PATCH 9/9] pmem: KISS, remove register_blkdev Boaz Harrosh 4 siblings, 0 replies; 41+ messages in thread From: Boaz Harrosh @ 2014-09-09 15:47 UTC (permalink / raw) To: Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani, Dave Hansen, linux-mm Cc: Andrew Morton, linux-kernel From: Boaz Harrosh <boaz@plexistor.com> Persistent Memory is not Memory. It is not presented as a Memory Zone and is not available through the page allocators for application/kernel volatile usage. It belongs to A block device just as any other Persistent storage, the novelty here is that it is directly mapped on the CPU Memory bus, and usually as fast or almost as fast as system RAM. The main motivation of add_persistent_memory is to allocate a page-struct "Section" for a given physical memory region. This is because The user of this memory mapped device might need to pass pages-struct of this memory to a Kernel subsytem like block-layer or networking and have it's content directly DMAed to other devices (For example these pages can be put on a bio and sent to disk in a copy-less manner) It will also request_mem_region_exclusive(.., "persistent_memory") to own that physical memory region. And will map that physical region to the Kernel's VM at the address expected for page_address() of those pages allocated above remove_persistent_memory() must be called to undo what add_persistent_memory did. A user of this API will then use pfn_to_page(PHISICAL_ADDR >> PAGE_SIZE) to receive a page-struct for use on its pmem. Any operation like page_address() page_to_pfn() page_lock() ... can be preformed on these pages just as usual. An example user is presented in the next patch to pmem.c Block Device driver (There are 3 more such drivers in the Kernel that could use this API) This patch is based on research and patch made by Yigal Korman <yigal@plexistor.com> to the pmem driver. I took his code and adapted it to mm, where it belongs. Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Boaz Harrosh <boaz@plexistor.com> --- include/linux/memory_hotplug.h | 4 + mm/Kconfig | 23 ++++++ mm/memory_hotplug.c | 177 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 204 insertions(+) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 35ca1bb..9a16cec 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -191,6 +191,10 @@ extern void get_page_bootmem(unsigned long ingo, struct page *page, void get_online_mems(void); void put_online_mems(void); +int add_persistent_memory(phys_addr_t phys_addr, size_t size, + void **o_virt_addr); +void remove_persistent_memory(phys_addr_t phys_addr, size_t size); + #else /* ! CONFIG_MEMORY_HOTPLUG */ /* * Stub functions for when hotplug is off diff --git a/mm/Kconfig b/mm/Kconfig index 886db21..2b78d19 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -197,6 +197,29 @@ config MEMORY_HOTREMOVE depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE depends on MIGRATION + +# User of PERSISTENT_MEMORY_SECTION should: +# depends on PERSISTENT_MEMORY_DEPENDENCY and +# select DRIVER_NEEDS_PERSISTENT_MEMORY +# Note that it will not be enabled if MEMORY_HOTPLUG is not enabled +# +# If you have changed the dependency/select of MEMORY_HOTREMOVE please also +# update here +# +config PERSISTENT_MEMORY_DEPENDENCY + def_bool y + depends on MEMORY_HOTPLUG + depends on ARCH_ENABLE_MEMORY_HOTREMOVE && MIGRATION + +config DRIVER_NEEDS_PERSISTENT_MEMORY + bool + +config PERSISTENT_MEMORY_SECTION + def_bool y + depends on PERSISTENT_MEMORY_DEPENDENCY + depends on DRIVER_NEEDS_PERSISTENT_MEMORY + select MEMORY_HOTREMOVE + # # If we have space for more page flags then we can enable additional # optimizations and functionality. diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index e556a90..1682b0e 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -2004,3 +2004,180 @@ void __ref remove_memory(int nid, u64 start, u64 size) } EXPORT_SYMBOL_GPL(remove_memory); #endif /* CONFIG_MEMORY_HOTREMOVE */ + +#ifdef CONFIG_PERSISTENT_MEMORY_SECTION + +/* This helper is so we do not need to allocate a page_array bigger + * than PAGE_SIZE + */ +static int _map_sec_range(ulong sec_start_pfn, struct page **page_array) +{ + const uint NUM_PAGES = PAGE_SIZE / sizeof(*page_array); + const uint ARRAYS_IN_SECTION = PAGES_PER_SECTION / NUM_PAGES; + ulong pfn = sec_start_pfn; + uint a; + + for (a = 0; a < ARRAYS_IN_SECTION; ++a) { + ulong virt_addr = (ulong)page_address(pfn_to_page(pfn)); + uint p; + int ret; + + for (p = 0; p < NUM_PAGES; ++p) + page_array[p] = pfn_to_page(pfn++); + + ret = map_kernel_range_noflush(virt_addr, NUM_PAGES * PAGE_SIZE, + PAGE_KERNEL, page_array); + if (unlikely(ret < 0)) { + pr_warn("%s: map_kernel_range(0x%lx, 0x%lx) => %d\n", + __func__, sec_start_pfn, virt_addr, ret); + return ret; + } + if (unlikely(ret < NUM_PAGES)) { + pr_warn("%s: map_kernel_range(0x%lx) => %d != %d last_pfn=0x%lx\n", + __func__, virt_addr, NUM_PAGES, ret, pfn); + } + } + + return 0; +} + +/** + * add_persistent_memory - Add memory sections and maps them to Kernel space + * @phys_addr: start of physical address to add & map + * @size: size of the memory range in bytes + * @o_virt_addr: The returned virtual address of the mapped memory range + * + * A persistent_memory block-device will use this function to add memory + * sections and map its physical memory range. After the call to this function + * There will be page-struct associated with each pfn added here, and it will + * be accessible from Kernel space through the returned @o_virt_addr + * @phys_addr will be rounded down to the nearest SECTION_SIZE, the range + * mapped will be in full SECTION_SIZE sections. + * @o_virt_addr is the address of @phys_addr not the start of the mapped section + * so usually mapping a range unaligned on SECTION_SIZE will work just that the + * unaligned start and/or end, will ignore the error and continue. + * (but will print "memory section XX already exists") + * + * NOTE: + * persistent_memory is not system ram and is not available through any + * allocator, for regular consumption. Therefore it does not belong to any + * memory zone. + * But it will need a memory-section allocated, so page-structs are available + * for this memory, so it can be DMA'd directly with zero copy. + * After a call to this function the ranged pages belong exclusively to the + * caller. + * + * RETURNS: + * zero on success, or -errno on failure. If successful @o_virt_addr will be set + */ +int add_persistent_memory(phys_addr_t phys_addr, size_t size, + void **o_virt_addr) +{ + ulong start_pfn = phys_addr >> PAGE_SHIFT; + ulong nr_pages = size >> PAGE_SHIFT; + ulong start_sec = pfn_to_section_nr(start_pfn); + ulong end_sec = pfn_to_section_nr(start_pfn + nr_pages + + PAGES_PER_SECTION - 1); + int nid = memory_add_physaddr_to_nid(phys_addr); + struct resource *res_mem; + struct page **page_array; + ulong i; + int ret = 0; + + page_array = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (unlikely(!page_array)) + return -ENOMEM; + + res_mem = request_mem_region_exclusive(phys_addr, size, + "persistent_memory"); + if (unlikely(!res_mem)) { + pr_warn("%s: request_mem_region_exclusive phys=0x%llx size=0x%zx failed\n", + __func__, phys_addr, size); + ret = -EINVAL; + goto free_array; + } + + for (i = start_sec; i < end_sec; ++i) { + ulong sec_start_pfn = i << PFN_SECTION_SHIFT; + + if (pfn_valid(sec_start_pfn)) { + pr_warn("%s: memory section %lu already exists.\n", + __func__, i); + continue; + } + + ret = sparse_add_one_section(nid, sec_start_pfn); + if (unlikely(ret < 0)) { + if (ret == -EEXIST) { + ret = 0; + continue; + } else { + pr_warn("%s: sparse_add_one_section => %d\n", + __func__, ret); + goto release_region; + } + } + + ret = _map_sec_range(sec_start_pfn, page_array); + if (unlikely(ret)) + goto release_region; + } + + *o_virt_addr = page_address(pfn_to_page(start_pfn)); + +free_array: + kfree(page_array); + return ret; + +release_region: + release_mem_region(phys_addr, size); + goto free_array; +} +EXPORT_SYMBOL_GPL(add_persistent_memory); + +/** + * remove_persistent_memory - undo anything add_persistent_memory did + * @phys_addr: start of physical address to remove + * @size: size of the memory range in bytes + * + * A successful call to add_persistent_memory must be paired with + * remove_persistent_memory when done. It will unmap passed PFNs from + * Kernel virtual address, and will remove the memory sections. + * @phys_addr, @size must be exactly those passed to add_persistent_memory + * otherwise results are unexpected, there are no checks done on this. + */ +void remove_persistent_memory(phys_addr_t phys_addr, size_t size) +{ + ulong start_pfn = phys_addr >> PAGE_SHIFT; + ulong nr_pages = size >> PAGE_SHIFT; + ulong start_sec = pfn_to_section_nr(start_pfn); + ulong end_sec = pfn_to_section_nr(start_pfn + nr_pages + + PAGES_PER_SECTION - 1); + int nid = pfn_to_nid(start_pfn); + ulong virt_addr; + unsigned int i; + + virt_addr = (ulong)page_address( + pfn_to_page(start_sec << PFN_SECTION_SHIFT)); + + for (i = start_sec; i < end_sec; ++i) { + struct mem_section *ms; + + unmap_kernel_range(virt_addr, 1UL << SECTION_SIZE_BITS); + virt_addr += 1UL << SECTION_SIZE_BITS; + + ms = __nr_to_section(i); + if (!valid_section(ms)) { + pr_warn("%s: memory section %d is missing.\n", + __func__, i); + continue; + } + sparse_remove_one_section(nid, ms); + } + + release_mem_region(phys_addr, size); +} +EXPORT_SYMBOL_GPL(remove_persistent_memory); + +#endif /* def CONFIG_PERSISTENT_MEMORY_SECTION */ + -- 1.9.3 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 7/9] pmem: Add support for page structs 2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh ` (2 preceding siblings ...) 2014-09-09 15:47 ` [PATCH 6/9] mm: New add_persistent_memory/remove_persistent_memory Boaz Harrosh @ 2014-09-09 15:48 ` Boaz Harrosh 2014-09-09 15:51 ` [PATCH 9/9] pmem: KISS, remove register_blkdev Boaz Harrosh 4 siblings, 0 replies; 41+ messages in thread From: Boaz Harrosh @ 2014-09-09 15:48 UTC (permalink / raw) To: Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-nvdimm, Toshi Kani, Dave Hansen, linux-mm Cc: Andrew Morton, linux-kernel From: Boaz Harrosh <boaz@plexistor.com> One of the current shortcomings of the NVDIMM/PMEM support is that this memory does not have a page-struct(s) associated with its memory and therefor cannot be passed to a block-device or network or DMAed in any way through another device in the system. The use of add_persistent_memory() fixes all this. After this patch an FS can do: bdev_direct_access(,&pfn,); page = pfn_to_page(pfn); And use that page for a lock_page(), set_page_dirty(), and/or anything else one might do with a page *. (Note that with brd one can already do this) [pmem-pages-ref-count] pmem will serve it's pages with ref==0. Once an FS does an blkdev_get_XXX(,FMODE_EXCL,), that memory is own by the FS. The FS needs to manage its allocation, just as it already does for its disk blocks. The fs should set page->count = 2, before submission to any Kernel subsystem so when it returns it will never be released to the Kernel's page-allocators. (page_freeze) Signed-off-by: Boaz Harrosh <boaz@plexistor.com> --- drivers/block/Kconfig | 13 +++++++++++++ drivers/block/pmem.c | 19 +++++++++++++++++++ 2 files changed, 32 insertions(+) diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index 5da8cbf..8a5929c 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -416,6 +416,19 @@ config BLK_DEV_PMEM Most normal users won't need this functionality, and can thus say N here. +config BLK_DEV_PMEM_USE_PAGES + bool "Enable use of page struct pages with pmem" + depends on BLK_DEV_PMEM + depends on PERSISTENT_MEMORY_DEPENDENCY + select DRIVER_NEEDS_PERSISTENT_MEMORY + default y + help + If a user of PMEM device needs "struct page" associated + with its memory, so this memory can be sent to other + block devices, or sent on the network, or be DMA transferred + to other devices in the system, then you must say "Yes" here. + If unsure leave as Yes. + config CDROM_PKTCDVD tristate "Packet writing on CD/DVD media" depends on !UML diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c index e07a373..b415b61 100644 --- a/drivers/block/pmem.c +++ b/drivers/block/pmem.c @@ -221,6 +221,23 @@ MODULE_PARM_DESC(map, static LIST_HEAD(pmem_devices); static int pmem_major; +#ifdef CONFIG_BLK_DEV_PMEM_USE_PAGES +/* pmem->phys_addr and pmem->size need to be set. + * Will then set pmem->virt_addr if successful. + */ +int pmem_mapmem(struct pmem_device *pmem) +{ + return add_persistent_memory(pmem->phys_addr, pmem->size, + &pmem->virt_addr); +} + +static void pmem_unmapmem(struct pmem_device *pmem) +{ + remove_persistent_memory(pmem->phys_addr, pmem->size); +} + +#else /* !CONFIG_BLK_DEV_PMEM_USE_PAGES */ + /* pmem->phys_addr and pmem->size need to be set. * Will then set virt_addr if successful. */ @@ -258,6 +275,8 @@ void pmem_unmapmem(struct pmem_device *pmem) release_mem_region(pmem->phys_addr, pmem->size); pmem->virt_addr = NULL; } +#endif /* ! CONFIG_BLK_DEV_PMEM_USE_PAGES */ + static struct pmem_device *pmem_alloc(phys_addr_t phys_addr, size_t disk_size, int i) -- 1.9.3 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 9/9] pmem: KISS, remove register_blkdev 2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh ` (3 preceding siblings ...) 2014-09-09 15:48 ` [PATCH 7/9] pmem: Add support for page structs Boaz Harrosh @ 2014-09-09 15:51 ` Boaz Harrosh 4 siblings, 0 replies; 41+ messages in thread From: Boaz Harrosh @ 2014-09-09 15:51 UTC (permalink / raw) To: Ross Zwisler, Jens Axboe, Matthew Wilcox, linux-fsdevel, linux-nvdimm Cc: Andrew Morton, linux-kernel From: Boaz Harrosh <boaz@plexistor.com> Though this looks revolutionary, it actually does nothing but removes dead code. Since the move of pmem to the dynamic BLOCK_EXT_MAJOR range, by Ross, pmem_major returned from register_blkdev is not used. If inspecting the code at add_disk() we can see that whatever number we will put at disk->major = ... disk->first_minor = ... will be immediately overwritten with the BLOCK_EXT_MAJOR and a dynamically allocated minor. Regardless of the registration and/or what ever we put at disk->major. I have tested all the tests that I know how to perform on the devices, fdisk, partitions, multiple pmemX devices, mknode, lsblk, blkid, and it all behaves exactly as before this patch. I have also done: find /sys/ -name "*pmem*" find /proc/ -name "*pmem*" find /dev/ -name "*pmem*" And get the same output as before this patch. The only thing missing is an entry in /proc/devices of the form: "250 pmem" (250 or what ever is free at the moment) But this is good because if one tries to look for 250 devices after loading pmem he will fail because pmem is always registered as 259 (blkext) now. If anyone knows of what would work differently after this patch please do tell. But it looks like the calls to register_blkdev is just dead code for us right now Thanks Signed-off-by: Boaz Harrosh <boaz@plexistor.com> --- drivers/block/pmem.c | 11 ----------- 1 file changed, 11 deletions(-) diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c index 600d225..5eda95a 100644 --- a/drivers/block/pmem.c +++ b/drivers/block/pmem.c @@ -225,7 +225,6 @@ MODULE_PARM_DESC(map, "mapS=nn[KMG]$ss[KMG] or mapS=nn[KMG]@ss[KMG], nn=size, ss=offset."); static LIST_HEAD(pmem_devices); -static int pmem_major; #ifdef CONFIG_BLK_DEV_PMEM_USE_PAGES /* pmem->phys_addr and pmem->size need to be set. @@ -333,8 +332,6 @@ static struct pmem_device *pmem_alloc(phys_addr_t phys_addr, size_t disk_size, goto out_free_queue; } - disk->major = pmem_major; - disk->first_minor = 0; disk->fops = &pmem_fops; disk->private_data = pmem; disk->queue = pmem->pmem_queue; @@ -397,12 +394,6 @@ static int __init pmem_init(void) return -EINVAL; } - result = register_blkdev(0, "pmem"); - if (result < 0) - return -EIO; - else - pmem_major = result; - i = 0; while ((p = strsep(&pmem_map, ",")) != NULL) { phys_addr_t phys_addr; @@ -433,7 +424,6 @@ out_free: list_del(&pmem->pmem_list); pmem_free(pmem); } - unregister_blkdev(pmem_major, "pmem"); return result; } @@ -445,7 +435,6 @@ static void __exit pmem_exit(void) list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list) pmem_del_one(pmem); - unregister_blkdev(pmem_major, "pmem"); pr_info("pmem: module unloaded\n"); } -- 1.9.3 ^ permalink raw reply related [flat|nested] 41+ messages in thread
end of thread, other threads:[~2014-11-04 17:06 UTC | newest] Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-08-27 21:11 [PATCH 0/4] Add persistent memory driver Ross Zwisler 2014-08-27 21:11 ` [PATCH 1/4] pmem: Initial version of " Ross Zwisler 2014-09-09 16:23 ` [PATCH v2] " Boaz Harrosh 2014-09-09 16:53 ` [Linux-nvdimm] " Dan Williams 2014-09-10 13:23 ` Boaz Harrosh 2014-09-10 17:03 ` Dan Williams 2014-09-10 17:47 ` Boaz Harrosh 2014-09-10 23:01 ` Dan Williams 2014-09-11 10:45 ` Boaz Harrosh 2014-09-11 16:31 ` Dan Williams 2014-09-14 11:18 ` Boaz Harrosh 2014-09-16 13:54 ` Jeff Moyer 2014-09-16 16:24 ` Boaz Harrosh 2014-09-19 16:27 ` Dan Williams 2014-09-21 9:27 ` Boaz Harrosh 2014-11-02 3:22 ` [PATCH 1/4] " Elliott, Robert (Server Storage) 2014-11-03 15:50 ` Jeff Moyer 2014-11-03 16:19 ` Wilcox, Matthew R 2014-11-04 10:37 ` Boaz Harrosh 2014-11-04 16:26 ` Elliott, Robert (Server Storage) 2014-11-04 16:41 ` Ross Zwisler 2014-11-04 17:06 ` Boaz Harrosh 2014-08-27 21:12 ` [PATCH 2/4] pmem: Add support for getgeo() Ross Zwisler 2014-11-02 3:27 ` Elliott, Robert (Server Storage) 2014-11-03 16:36 ` Wilcox, Matthew R 2014-08-27 21:12 ` [PATCH 3/4] pmem: Add support for rw_page() Ross Zwisler 2014-08-27 21:12 ` [PATCH 4/4] pmem: Add support for direct_access() Ross Zwisler 2014-09-09 15:37 ` [PATCH 0/9] pmem: Fixes and farther development (mm: add_persistent_memory) Boaz Harrosh 2014-09-09 15:44 ` [PATCH 4/9] SQUASHME: pmem: Support of multiple memory regions Boaz Harrosh 2014-09-09 15:45 ` [PATCH 5/9] mm: Let sparse_{add,remove}_one_section receive a node_id Boaz Harrosh 2014-09-09 18:36 ` Dave Hansen 2014-09-10 10:07 ` Boaz Harrosh 2014-09-10 16:10 ` Dave Hansen 2014-09-10 17:25 ` Boaz Harrosh 2014-09-10 18:28 ` Dave Hansen 2014-09-11 8:39 ` Boaz Harrosh 2014-09-11 17:07 ` Dave Hansen 2014-09-14 9:36 ` Boaz Harrosh 2014-09-09 15:47 ` [PATCH 6/9] mm: New add_persistent_memory/remove_persistent_memory Boaz Harrosh 2014-09-09 15:48 ` [PATCH 7/9] pmem: Add support for page structs Boaz Harrosh 2014-09-09 15:51 ` [PATCH 9/9] pmem: KISS, remove register_blkdev Boaz Harrosh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).