All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/9] pmem: Support for "struct page" with Persistent Memory storage
@ 2014-08-13 12:08 ` Boaz Harrosh
  0 siblings, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-13 12:08 UTC (permalink / raw)
  To: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

Hi Folks

There are already NvDIMMs and other PMEM devices in the market, though very rare still.
However in Linux we like to get ready for them before hand.

Current stack is coming along very nice, and filesystems supporting and leveraging this
technologies have been submitted for review in the DAX series by Matthew Wilcox.

The PMEM industry has been speaking of re inventing the wheel of storage and providing
an alternative API for storage in the form of persist_alloc(,UUID,...) for applications
to use.

But The general consensus by the Kernel flocks is that an FS like ext4 over a "direct_access"
block device supplies that API perfectly with a user simple call sequence of:
	fd = open(,"UUID_STR",);
	mmap(fd,);
And there, the wheel has already been invented.

So the PMEM stack is the usual:
	block-device
	partition
	file-system
	application file

With extra care, see Matthew's DAX patches, to shorten the stack so at the application
level when finally a store/load CPU operation is done on an mmaped pointer it will go
directly to pmem. (OK after the call to msync to flush CPU caches), but certainly
no extra copies are made anywhere in the stack.

The only thing missing from current design is that this pmem memory region is very
alien to the rest of the Kernel, application access to this memory is enabled but
if we want to send this memory on the network or to a slower block device and/or to
any device in the system. We are not able to do this. This is because of a missing
struct page support for this memory.
This shortcomings is easily fixable by this patchset. Actually by the last two
patches. The other patches are just development stuff of the prd.ko (Persistent Ram Disk)
that is destined to mange the pmem devices.

Ross hi, I have based a *prd* tree on my brd-partition branch which I hope will go into Kernel
ASAP, even into 3.17. This tree includes all your patches, slightly re-orders you can
see it here:
	git://git.open-osd.org/linux-open-osd.git branch prd-3.16
on web:
	http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/prd-3.16

This is what one will fine on the prd-3.16 branch:
7a70a75 Boaz Harrosh        |  (prd-3.16) prd: Add support for page struct mapping 
9a367c6 Yigal Korman        |  MM: export sparse_add_one_section/sparse_remove_one_section 
62c96b7 Boaz Harrosh        |  SQUASHME: prd: Support of multiple memory regions 
9290c1b Boaz Harrosh        |  SQUASHME: prd: Let each prd-device manage private memory region 
bfefdbb Boaz Harrosh        |  SQUASHME: prd: Last fixes for partitions 
0a7c7e4 Boaz Harrosh        |  SQUASHME: prd: Fixs to getgeo 

  From here on this is exactly Ross's prd tree, patches slight reordered.

709891f Ross Zwisler        |  prd: Add getgeo to block ops 
3ce69f4fe Ross Zwisler      |  prd: add support for rw_page() 
4fb0ac8 Ross Zwisler        |  SQUASHME: prd: Remove Kconfig default for PRD 
8dafb03 Ross Zwisler        |  SQUASHME: prd: remove redundant checks in prd_direct_access 
0373be8 Ross Zwisler        |  SQUASHME: prd: Updates to comments & whitespace 
0ef8f82 Ross Zwisler        |  SQUASHME: prd: Improve wording in Kconfig 
eb11421 Ross Zwisler        |  SQUASHME: prd: Dynamically allocate partition numbers 
9021eae Ross Zwisler        |  SQUASHME: prd: Remove support for discard 
4805046 Ross Zwisler        |  SQUASHME: prd: enable partitions and surface by default 
9550836 Ross Zwisler        |  SQUASHME: prd: workaround to stop weird partition overlap 
c9d84b3 Ross Zwisler        |  SQUASHME: prd: fix error handling in prd_init 
252aaff Ross Zwisler        |  prd: Initial version of Persistent RAM Driver 
a49772c Boaz Harrosh        |  (ooo/brd-partitions, brd-partitions) brd: Request from fdisk 4k alignment 

[I have edited all the commit messages to add "SQUASHME" at start so to denote to user that
 these will be incorporated into a previous patch before online submission.]

For review and discussion here I am submitting the SQUASHed prd-initial-version for
the reviewers to have a context on what we are talking about, then a set of my patches
for prd. Finally two patches one to mm and the second to prd for page-struct support.

List of patches:
[RFC 1/9] prd: Initial version of Persistent RAM Driver
[RFC 2/9] prd: add support for rw_page()
[RFC 3/9] prd: Add getgeo to block ops

	Up to here it is the latest code from Ross's prd tree

[RFC 4/9] SQUASHME: prd: Fixs to getgeo
[RFC 5/9] SQUASHME: prd: Last fixes for partitions
[RFC 6/9] SQUASHME: prd: Let each prd-device manage private memory
[RFC 7/9] SQUASHME: prd: Support of multiple memory regions

	These 4 are my fixes to the prd driver. Ross please review
	and submit to your tree.

[RFC 8/9] mm: export sparse_add/remove_one_section
[RFC 9/9] prd: Add support for page struct mapping

	These two are for the MM guys to comment, and for anyone
	that is interested in the PMEM technology.
	I wanted to post an example Kernel module that will demonstrate
	the use of pages, on top of bdev_direct_access(). But its not
	cleaned up and finished, I will send it later.

These can be found here:
	git://git.open-osd.org/linux-open-osd.git branch prd
on web:
	http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/prd

And one last Rant if I may?
I hate the prd name, why why? OK so a freak of bad luck forced us to invent a new name for
/dev/ram because it would be weird to do an lsmod and see a ram.ko hanging there which is actually
a block device driver. OK so a brd == /dev/ram. But why do we need to carry this punishment forever?
Why an additional/different name in the namespace? /dev/foo should just be foo.ko in lsmod, No?
So please, please, for my peace of mind can we call this driver pmem.ko?
I know, I would hate it if I was inventing a name and people change it, so Ross it is your call, is
it OK if we move back to just call it pmem everywhere?

Thanks
Boaz

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC 0/9] pmem: Support for "struct page" with Persistent Memory storage
@ 2014-08-13 12:08 ` Boaz Harrosh
  0 siblings, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-13 12:08 UTC (permalink / raw)
  To: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

Hi Folks

There are already NvDIMMs and other PMEM devices in the market, though very rare still.
However in Linux we like to get ready for them before hand.

Current stack is coming along very nice, and filesystems supporting and leveraging this
technologies have been submitted for review in the DAX series by Matthew Wilcox.

The PMEM industry has been speaking of re inventing the wheel of storage and providing
an alternative API for storage in the form of persist_alloc(,UUID,...) for applications
to use.

But The general consensus by the Kernel flocks is that an FS like ext4 over a "direct_access"
block device supplies that API perfectly with a user simple call sequence of:
	fd = open(,"UUID_STR",);
	mmap(fd,);
And there, the wheel has already been invented.

So the PMEM stack is the usual:
	block-device
	partition
	file-system
	application file

With extra care, see Matthew's DAX patches, to shorten the stack so at the application
level when finally a store/load CPU operation is done on an mmaped pointer it will go
directly to pmem. (OK after the call to msync to flush CPU caches), but certainly
no extra copies are made anywhere in the stack.

The only thing missing from current design is that this pmem memory region is very
alien to the rest of the Kernel, application access to this memory is enabled but
if we want to send this memory on the network or to a slower block device and/or to
any device in the system. We are not able to do this. This is because of a missing
struct page support for this memory.
This shortcomings is easily fixable by this patchset. Actually by the last two
patches. The other patches are just development stuff of the prd.ko (Persistent Ram Disk)
that is destined to mange the pmem devices.

Ross hi, I have based a *prd* tree on my brd-partition branch which I hope will go into Kernel
ASAP, even into 3.17. This tree includes all your patches, slightly re-orders you can
see it here:
	git://git.open-osd.org/linux-open-osd.git branch prd-3.16
on web:
	http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/prd-3.16

This is what one will fine on the prd-3.16 branch:
7a70a75 Boaz Harrosh        |  (prd-3.16) prd: Add support for page struct mapping 
9a367c6 Yigal Korman        |  MM: export sparse_add_one_section/sparse_remove_one_section 
62c96b7 Boaz Harrosh        |  SQUASHME: prd: Support of multiple memory regions 
9290c1b Boaz Harrosh        |  SQUASHME: prd: Let each prd-device manage private memory region 
bfefdbb Boaz Harrosh        |  SQUASHME: prd: Last fixes for partitions 
0a7c7e4 Boaz Harrosh        |  SQUASHME: prd: Fixs to getgeo 

  From here on this is exactly Ross's prd tree, patches slight reordered.

709891f Ross Zwisler        |  prd: Add getgeo to block ops 
3ce69f4fe Ross Zwisler      |  prd: add support for rw_page() 
4fb0ac8 Ross Zwisler        |  SQUASHME: prd: Remove Kconfig default for PRD 
8dafb03 Ross Zwisler        |  SQUASHME: prd: remove redundant checks in prd_direct_access 
0373be8 Ross Zwisler        |  SQUASHME: prd: Updates to comments & whitespace 
0ef8f82 Ross Zwisler        |  SQUASHME: prd: Improve wording in Kconfig 
eb11421 Ross Zwisler        |  SQUASHME: prd: Dynamically allocate partition numbers 
9021eae Ross Zwisler        |  SQUASHME: prd: Remove support for discard 
4805046 Ross Zwisler        |  SQUASHME: prd: enable partitions and surface by default 
9550836 Ross Zwisler        |  SQUASHME: prd: workaround to stop weird partition overlap 
c9d84b3 Ross Zwisler        |  SQUASHME: prd: fix error handling in prd_init 
252aaff Ross Zwisler        |  prd: Initial version of Persistent RAM Driver 
a49772c Boaz Harrosh        |  (ooo/brd-partitions, brd-partitions) brd: Request from fdisk 4k alignment 

[I have edited all the commit messages to add "SQUASHME" at start so to denote to user that
 these will be incorporated into a previous patch before online submission.]

For review and discussion here I am submitting the SQUASHed prd-initial-version for
the reviewers to have a context on what we are talking about, then a set of my patches
for prd. Finally two patches one to mm and the second to prd for page-struct support.

List of patches:
[RFC 1/9] prd: Initial version of Persistent RAM Driver
[RFC 2/9] prd: add support for rw_page()
[RFC 3/9] prd: Add getgeo to block ops

	Up to here it is the latest code from Ross's prd tree

[RFC 4/9] SQUASHME: prd: Fixs to getgeo
[RFC 5/9] SQUASHME: prd: Last fixes for partitions
[RFC 6/9] SQUASHME: prd: Let each prd-device manage private memory
[RFC 7/9] SQUASHME: prd: Support of multiple memory regions

	These 4 are my fixes to the prd driver. Ross please review
	and submit to your tree.

[RFC 8/9] mm: export sparse_add/remove_one_section
[RFC 9/9] prd: Add support for page struct mapping

	These two are for the MM guys to comment, and for anyone
	that is interested in the PMEM technology.
	I wanted to post an example Kernel module that will demonstrate
	the use of pages, on top of bdev_direct_access(). But its not
	cleaned up and finished, I will send it later.

These can be found here:
	git://git.open-osd.org/linux-open-osd.git branch prd
on web:
	http://git.open-osd.org/gitweb.cgi?p=linux-open-osd.git;a=shortlog;h=refs/heads/prd

And one last Rant if I may?
I hate the prd name, why why? OK so a freak of bad luck forced us to invent a new name for
/dev/ram because it would be weird to do an lsmod and see a ram.ko hanging there which is actually
a block device driver. OK so a brd == /dev/ram. But why do we need to carry this punishment forever?
Why an additional/different name in the namespace? /dev/foo should just be foo.ko in lsmod, No?
So please, please, for my peace of mind can we call this driver pmem.ko?
I know, I would hate it if I was inventing a name and people change it, so Ross it is your call, is
it OK if we move back to just call it pmem everywhere?

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC 1/9] prd: Initial version of Persistent RAM Driver
  2014-08-13 12:08 ` Boaz Harrosh
  (?)
@ 2014-08-13 12:10 ` Boaz Harrosh
  -1 siblings, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-13 12:10 UTC (permalink / raw)
  To: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

From: Ross Zwisler <ross.zwisler@linux.intel.com>

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/Kconfig  |  41 +++++
 drivers/block/Makefile |   1 +
 drivers/block/prd.c    | 398 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 440 insertions(+)
 create mode 100644 drivers/block/prd.c

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 014a1cf..463c45e 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -403,6 +403,47 @@ config BLK_DEV_XIP
 	  will prevent RAM block device backing store memory from being
 	  allocated from highmem (only a problem for highmem systems).
 
+config BLK_DEV_PMEM
+	tristate "Persistent memory block device support"
+	help
+	  Saying Y here will allow you to use a contiguous range of reserved
+	  memory as one or more block devices.  Memory for PRD should be
+	  reserved using the "memmap" kernel parameter.
+
+	  To compile this driver as a module, choose M here: the module will be
+	  called prd.
+
+	  Most normal users won't need this functionality, and can thus say N
+	  here.
+
+config BLK_DEV_PMEM_START
+	int "Offset in GiB of where to start claiming space"
+	default "0"
+	depends on BLK_DEV_PMEM
+	help
+	  Starting offset in GiB that PRD should use when claiming memory.  This
+	  memory needs to be reserved from the OS at boot time using the
+	  "memmap" kernel parameter.
+
+	  If you provide PRD with volatile memory it will act as a volatile
+	  RAM disk and your data will not be persistent.
+
+config BLK_DEV_PMEM_COUNT
+	int "Default number of PMEM disks"
+	default "4"
+	depends on BLK_DEV_PMEM
+	help
+	  Number of equal sized block devices that PRD should create.
+
+config BLK_DEV_PMEM_SIZE
+	int "Size in GiB of space to claim"
+	depends on BLK_DEV_PMEM
+	default "0"
+	help
+	  Amount of memory in GiB that PRD should use when creating block
+	  devices.  This memory needs to be reserved from the OS at
+	  boot time using the "memmap" kernel parameter.
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 02b688d..6e94c61 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -14,6 +14,7 @@ obj-$(CONFIG_PS3_VRAM)		+= ps3vram.o
 obj-$(CONFIG_ATARI_FLOPPY)	+= ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)	+= z2ram.o
 obj-$(CONFIG_BLK_DEV_RAM)	+= brd.o
+obj-$(CONFIG_BLK_DEV_PMEM)	+= prd.o
 obj-$(CONFIG_BLK_DEV_LOOP)	+= loop.o
 obj-$(CONFIG_BLK_CPQ_DA)	+= cpqarray.o
 obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o
diff --git a/drivers/block/prd.c b/drivers/block/prd.c
new file mode 100644
index 0000000..7684197
--- /dev/null
+++ b/drivers/block/prd.c
@@ -0,0 +1,398 @@
+/*
+ * Persistent RAM Driver
+ * Copyright (c) 2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * This driver is heavily based on drivers/block/brd.c.
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ */
+
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/fs.h>
+#include <linux/highmem.h>
+#include <linux/init.h>
+#include <linux/major.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#define SECTOR_SHIFT		9
+#define PAGE_SECTORS_SHIFT	(PAGE_SHIFT - SECTOR_SHIFT)
+#define PAGE_SECTORS		(1 << PAGE_SECTORS_SHIFT)
+
+/*
+ * driver-wide physical address and total_size - one single, contiguous memory
+ * region that we divide up in to same-sized devices
+ */
+phys_addr_t	phys_addr;
+void		*virt_addr;
+size_t		total_size;
+
+struct prd_device {
+	int			prd_number;
+
+	struct request_queue	*prd_queue;
+	struct gendisk		*prd_disk;
+	struct list_head	prd_list;
+
+	phys_addr_t		phys_addr;
+	void			*virt_addr;
+	size_t			size;
+};
+
+/*
+ * direct translation from (prd,sector) => void*
+ * We do not require that sector be page aligned.
+ * The return value will point to the beginning of the page containing the
+ * given sector, not to the sector itself.
+ */
+static void *prd_lookup_pg_addr(struct prd_device *prd, sector_t sector)
+{
+	size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
+	size_t offset = page_offset << PAGE_SHIFT;
+
+	BUG_ON(offset >= prd->size);
+	return prd->virt_addr + offset;
+}
+
+/* sector must be page aligned */
+static unsigned long prd_lookup_pfn(struct prd_device *prd, sector_t sector)
+{
+	size_t page_offset = sector >> PAGE_SECTORS_SHIFT;
+
+	BUG_ON(sector & (PAGE_SECTORS - 1));
+	return (prd->phys_addr >> PAGE_SHIFT) + page_offset;
+}
+
+/*
+ * sector is not required to be page aligned.
+ * n is at most a single page, but could be less.
+ */
+static void copy_to_prd(struct prd_device *prd, const void *src,
+			sector_t sector, size_t n)
+{
+	void *dst;
+	unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
+	size_t copy;
+
+	BUG_ON(n > PAGE_SIZE);
+
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	dst = prd_lookup_pg_addr(prd, sector);
+	memcpy(dst + offset, src, copy);
+
+	if (copy < n) {
+		src += copy;
+		sector += copy >> SECTOR_SHIFT;
+		copy = n - copy;
+		dst = prd_lookup_pg_addr(prd, sector);
+		memcpy(dst, src, copy);
+	}
+}
+
+/*
+ * sector is not required to be page aligned.
+ * n is at most a single page, but could be less.
+ */
+static void copy_from_prd(void *dst, struct prd_device *prd,
+			  sector_t sector, size_t n)
+{
+	void *src;
+	unsigned int offset = (sector & (PAGE_SECTORS - 1)) << SECTOR_SHIFT;
+	size_t copy;
+
+	BUG_ON(n > PAGE_SIZE);
+
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	src = prd_lookup_pg_addr(prd, sector);
+
+	memcpy(dst, src + offset, copy);
+
+	if (copy < n) {
+		dst += copy;
+		sector += copy >> SECTOR_SHIFT;
+		copy = n - copy;
+		src = prd_lookup_pg_addr(prd, sector);
+		memcpy(dst, src, copy);
+	}
+}
+
+static void prd_do_bvec(struct prd_device *prd, struct page *page,
+			unsigned int len, unsigned int off, int rw,
+			sector_t sector)
+{
+	void *mem = kmap_atomic(page);
+
+	if (rw == READ) {
+		copy_from_prd(mem + off, prd, sector, len);
+		flush_dcache_page(page);
+	} else {
+		/*
+		 * FIXME: Need more involved flushing to ensure that writes to
+		 * NVDIMMs are actually durable before returning.
+		 */
+		flush_dcache_page(page);
+		copy_to_prd(prd, mem + off, sector, len);
+	}
+
+	kunmap_atomic(mem);
+}
+
+static void prd_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	struct prd_device *prd = bdev->bd_disk->private_data;
+	int rw;
+	struct bio_vec bvec;
+	sector_t sector;
+	struct bvec_iter iter;
+	int err = 0;
+
+	sector = bio->bi_iter.bi_sector;
+	if (bio_end_sector(bio) > get_capacity(bdev->bd_disk)) {
+		err = -EIO;
+		goto out;
+	}
+
+	BUG_ON(bio->bi_rw & REQ_DISCARD);
+
+	rw = bio_rw(bio);
+	if (rw == READA)
+		rw = READ;
+
+	bio_for_each_segment(bvec, bio, iter) {
+		unsigned int len = bvec.bv_len;
+
+		BUG_ON(len > PAGE_SIZE);
+		prd_do_bvec(prd, bvec.bv_page, len,
+			    bvec.bv_offset, rw, sector);
+		sector += len >> SECTOR_SHIFT;
+	}
+
+out:
+	bio_endio(bio, err);
+}
+
+static long prd_direct_access(struct block_device *bdev, sector_t sector,
+			      void **kaddr, unsigned long *pfn, long size)
+{
+	struct prd_device *prd = bdev->bd_disk->private_data;
+
+	if (!prd)
+		return -ENODEV;
+
+	*kaddr = prd_lookup_pg_addr(prd, sector);
+	*pfn = prd_lookup_pfn(prd, sector);
+
+	return size;
+}
+
+static const struct block_device_operations prd_fops = {
+	.owner =		THIS_MODULE,
+	.direct_access =	prd_direct_access,
+};
+
+/* Kernel module stuff */
+static int prd_start_gb = CONFIG_BLK_DEV_PMEM_START;
+module_param(prd_start_gb, int, S_IRUGO);
+MODULE_PARM_DESC(prd_start_gb, "Offset in GB of where to start claiming space");
+
+static int prd_size_gb = CONFIG_BLK_DEV_PMEM_SIZE;
+module_param(prd_size_gb,  int, S_IRUGO);
+MODULE_PARM_DESC(prd_size_gb,  "Total size in GB of space to claim for all disks");
+
+static int prd_major;
+module_param(prd_major, int, 0);
+MODULE_PARM_DESC(prd_major,  "Major number to request for this driver");
+
+static int prd_count = CONFIG_BLK_DEV_PMEM_COUNT;
+module_param(prd_count, int, S_IRUGO);
+MODULE_PARM_DESC(prd_count, "Number of prd devices to evenly split allocated space");
+
+static LIST_HEAD(prd_devices);
+static DEFINE_MUTEX(prd_devices_mutex);
+
+/* FIXME: move phys_addr, virt_addr, size calls up to caller */
+static struct prd_device *prd_alloc(int i)
+{
+	struct prd_device *prd;
+	struct gendisk *disk;
+	size_t disk_size = total_size / prd_count;
+	size_t disk_sectors =  disk_size / 512;
+
+	prd = kzalloc(sizeof(*prd), GFP_KERNEL);
+	if (!prd)
+		goto out;
+
+	prd->prd_number	= i;
+	prd->phys_addr = phys_addr + i * disk_size;
+	prd->virt_addr = virt_addr + i * disk_size;
+	prd->size = disk_size;
+
+	prd->prd_queue = blk_alloc_queue(GFP_KERNEL);
+	if (!prd->prd_queue)
+		goto out_free_dev;
+
+	blk_queue_make_request(prd->prd_queue, prd_make_request);
+	blk_queue_max_hw_sectors(prd->prd_queue, 1024);
+	blk_queue_bounce_limit(prd->prd_queue, BLK_BOUNCE_ANY);
+
+	disk = prd->prd_disk = alloc_disk(0);
+	if (!disk)
+		goto out_free_queue;
+	disk->major		= prd_major;
+	disk->first_minor	= 0;
+	disk->fops		= &prd_fops;
+	disk->private_data	= prd;
+	disk->queue		= prd->prd_queue;
+	disk->flags		= GENHD_FL_EXT_DEVT;
+	sprintf(disk->disk_name, "pmem%d", i);
+	set_capacity(disk, disk_sectors);
+
+	return prd;
+
+out_free_queue:
+	blk_cleanup_queue(prd->prd_queue);
+out_free_dev:
+	kfree(prd);
+out:
+	return NULL;
+}
+
+static void prd_free(struct prd_device *prd)
+{
+	put_disk(prd->prd_disk);
+	blk_cleanup_queue(prd->prd_queue);
+	kfree(prd);
+}
+
+static struct prd_device *prd_init_one(int i)
+{
+	struct prd_device *prd;
+
+	list_for_each_entry(prd, &prd_devices, prd_list) {
+		if (prd->prd_number == i)
+			goto out;
+	}
+
+	prd = prd_alloc(i);
+	if (prd) {
+		add_disk(prd->prd_disk);
+		list_add_tail(&prd->prd_list, &prd_devices);
+	}
+out:
+	return prd;
+}
+
+static void prd_del_one(struct prd_device *prd)
+{
+	list_del(&prd->prd_list);
+	del_gendisk(prd->prd_disk);
+	prd_free(prd);
+}
+
+static struct kobject *prd_probe(dev_t dev, int *part, void *data)
+{
+	struct prd_device *prd;
+	struct kobject *kobj;
+
+	mutex_lock(&prd_devices_mutex);
+	prd = prd_init_one(MINOR(dev));
+	kobj = prd ? get_disk(prd->prd_disk) : NULL;
+	mutex_unlock(&prd_devices_mutex);
+
+	return kobj;
+}
+
+static int __init prd_init(void)
+{
+	int result, i;
+	struct resource *res_mem;
+	struct prd_device *prd, *next;
+
+	phys_addr  = (phys_addr_t) prd_start_gb * 1024 * 1024 * 1024;
+	total_size = (size_t)	   prd_size_gb  * 1024 * 1024 * 1024;
+
+	res_mem = request_mem_region_exclusive(phys_addr, total_size, "prd");
+	if (!res_mem)
+		return -ENOMEM;
+
+	virt_addr = ioremap_cache(phys_addr, total_size);
+
+	if (!virt_addr) {
+		result = -ENOMEM;
+		goto out_release;
+	}
+
+	result = register_blkdev(prd_major, "prd");
+	if (result < 0) {
+		result = -EIO;
+		goto out_unmap;
+	} else if (result > 0)
+		prd_major = result;
+
+	for (i = 0; i < prd_count; i++) {
+		prd = prd_alloc(i);
+		if (!prd) {
+			result = -ENOMEM;
+			goto out_free;
+		}
+		list_add_tail(&prd->prd_list, &prd_devices);
+	}
+
+	list_for_each_entry(prd, &prd_devices, prd_list)
+		add_disk(prd->prd_disk);
+
+	blk_register_region(MKDEV(prd_major, 0), 0,
+				  THIS_MODULE, prd_probe, NULL, NULL);
+
+	pr_info("prd: module loaded\n");
+	return 0;
+
+out_free:
+	list_for_each_entry_safe(prd, next, &prd_devices, prd_list) {
+		list_del(&prd->prd_list);
+		prd_free(prd);
+	}
+	unregister_blkdev(prd_major, "prd");
+
+out_unmap:
+	iounmap(virt_addr);
+
+out_release:
+	release_mem_region(phys_addr, total_size);
+	return result;
+}
+
+static void __exit prd_exit(void)
+{
+	struct prd_device *prd, *next;
+
+	blk_unregister_region(MKDEV(prd_major, 0), 0);
+
+	list_for_each_entry_safe(prd, next, &prd_devices, prd_list)
+		prd_del_one(prd);
+
+	unregister_blkdev(prd_major, "prd");
+	iounmap(virt_addr);
+	release_mem_region(phys_addr, total_size);
+
+	pr_info("prd: module unloaded\n");
+}
+
+MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
+MODULE_LICENSE("GPL");
+module_init(prd_init);
+module_exit(prd_exit);
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC 2/9] prd: add support for rw_page()
  2014-08-13 12:08 ` Boaz Harrosh
  (?)
  (?)
@ 2014-08-13 12:11 ` Boaz Harrosh
  -1 siblings, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-13 12:11 UTC (permalink / raw)
  To: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

From: Ross Zwisler <ross.zwisler@linux.intel.com>

Based on commit a72132c31d58 brd: add support for rw_page()

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/block/prd.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/block/prd.c b/drivers/block/prd.c
index 7684197..4cfc4f8 100644
--- a/drivers/block/prd.c
+++ b/drivers/block/prd.c
@@ -185,6 +185,16 @@ out:
 	bio_endio(bio, err);
 }
 
+static int prd_rw_page(struct block_device *bdev, sector_t sector,
+		       struct page *page, int rw)
+{
+	struct prd_device *prd = bdev->bd_disk->private_data;
+
+	prd_do_bvec(prd, page, PAGE_CACHE_SIZE, 0, rw, sector);
+	page_endio(page, rw & WRITE, 0);
+	return 0;
+}
+
 static long prd_direct_access(struct block_device *bdev, sector_t sector,
 			      void **kaddr, unsigned long *pfn, long size)
 {
@@ -201,6 +211,7 @@ static long prd_direct_access(struct block_device *bdev, sector_t sector,
 
 static const struct block_device_operations prd_fops = {
 	.owner =		THIS_MODULE,
+	.rw_page =		prd_rw_page,
 	.direct_access =	prd_direct_access,
 };
 
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC 3/9] prd: Add getgeo to block ops
  2014-08-13 12:08 ` Boaz Harrosh
                   ` (2 preceding siblings ...)
  (?)
@ 2014-08-13 12:12 ` Boaz Harrosh
  -1 siblings, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-13 12:12 UTC (permalink / raw)
  To: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

From: Ross Zwisler <ross.zwisler@linux.intel.com>

Some programs require HDIO_GETGEO work, which requires we implement
getgeo.  Based off of the work done to the NVMe driver in this commit:

4cc09e2dc4cb NVMe: Add getgeo to block ops

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/block/prd.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/block/prd.c b/drivers/block/prd.c
index 4cfc4f8..cc0aabf 100644
--- a/drivers/block/prd.c
+++ b/drivers/block/prd.c
@@ -19,6 +19,7 @@
 #include <linux/bio.h>
 #include <linux/blkdev.h>
 #include <linux/fs.h>
+#include <linux/hdreg.h>
 #include <linux/highmem.h>
 #include <linux/init.h>
 #include <linux/major.h>
@@ -52,6 +53,15 @@ struct prd_device {
 	size_t			size;
 };
 
+static int prd_getgeo(struct block_device *bd, struct hd_geometry *geo)
+{
+	/* some standard values */
+	geo->heads = 1 << 6;
+	geo->sectors = 1 << 5;
+	geo->cylinders = get_capacity(bd->bd_disk) >> 11;
+	return 0;
+}
+
 /*
  * direct translation from (prd,sector) => void*
  * We do not require that sector be page aligned.
@@ -213,6 +223,7 @@ static const struct block_device_operations prd_fops = {
 	.owner =		THIS_MODULE,
 	.rw_page =		prd_rw_page,
 	.direct_access =	prd_direct_access,
+	.getgeo =		prd_getgeo,
 };
 
 /* Kernel module stuff */
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC 4/9] SQUASHME: prd: Fixs to getgeo
  2014-08-13 12:08 ` Boaz Harrosh
                   ` (3 preceding siblings ...)
  (?)
@ 2014-08-13 12:14 ` Boaz Harrosh
  2014-08-20 22:10   ` Ross Zwisler
  -1 siblings, 1 reply; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-13 12:14 UTC (permalink / raw)
  To: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

From: Boaz Harrosh <boaz@plexistor.com>

With current values fdisk does the wrong thing.

Setting all values to 1, will make everything nice and easy.

Note that current code had a BUG with anything bigger than
64G because hd_geometry->cylinders is ushort and it would
overflow at this value. Any way capacity is not calculated
through getgeo so it does not matter what you put here.

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/prd.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/block/prd.c b/drivers/block/prd.c
index cc0aabf..62af81e 100644
--- a/drivers/block/prd.c
+++ b/drivers/block/prd.c
@@ -55,10 +55,18 @@ struct prd_device {
 
 static int prd_getgeo(struct block_device *bd, struct hd_geometry *geo)
 {
-	/* some standard values */
-	geo->heads = 1 << 6;
-	geo->sectors = 1 << 5;
-	geo->cylinders = get_capacity(bd->bd_disk) >> 11;
+	/* Just tell fdisk to get out of the way. The math here is so
+	 * convoluted and does not make any sense at all. With all 1s
+	 * The math just gets out of the way.
+	 * NOTE: I was trying to get some values that will make fdisk
+	 * Want to align first sector on 4K (like 8, 16, 20, ... sectors) but
+	 * nothing worked, I searched the net the math is not your regular
+	 * simple multiplication at all. If you managed to get these please
+	 * fix here. For now we use 4k physical sectors for this
+	 */
+	geo->heads = 1;
+	geo->sectors = 1;
+	geo->cylinders = 1;
 	return 0;
 }
 
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC 5/9] SQUASHME: prd: Last fixes for partitions
  2014-08-13 12:08 ` Boaz Harrosh
                   ` (4 preceding siblings ...)
  (?)
@ 2014-08-13 12:16 ` Boaz Harrosh
  2014-08-14 13:04   ` Boaz Harrosh
                     ` (2 more replies)
  -1 siblings, 3 replies; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-13 12:16 UTC (permalink / raw)
  To: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

From: Boaz Harrosh <boaz@plexistor.com>

This streamlines prd with the latest brd code.

In prd we do not allocate new devices dynamically on devnod
access, because we need parameterization of each device. So
the dynamic allocation in prd_init_one is removed.

Therefor prd_init_one only called from prd_prob is moved
there, now that it is small.

And other small fixes regarding partitions

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/prd.c | 47 ++++++++++++++++++++++++-----------------------
 1 file changed, 24 insertions(+), 23 deletions(-)

diff --git a/drivers/block/prd.c b/drivers/block/prd.c
index 62af81e..c4aeba7 100644
--- a/drivers/block/prd.c
+++ b/drivers/block/prd.c
@@ -218,13 +218,13 @@ static long prd_direct_access(struct block_device *bdev, sector_t sector,
 {
 	struct prd_device *prd = bdev->bd_disk->private_data;
 
-	if (!prd)
+	if (unlikely(!prd))
 		return -ENODEV;
 
 	*kaddr = prd_lookup_pg_addr(prd, sector);
 	*pfn = prd_lookup_pfn(prd, sector);
 
-	return size;
+	return min_t(long, size, prd->size);
 }
 
 static const struct block_device_operations prd_fops = {
@@ -279,6 +279,12 @@ static struct prd_device *prd_alloc(int i)
 	blk_queue_max_hw_sectors(prd->prd_queue, 1024);
 	blk_queue_bounce_limit(prd->prd_queue, BLK_BOUNCE_ANY);
 
+	/* This is so fdisk will align partitions on 4k, because of
+	 * direct_access API needing 4k alignment, returning a PFN
+	 */
+	blk_queue_physical_block_size(prd->prd_queue, PAGE_SIZE);
+	prd->prd_queue->limits.io_min = 512; /* Don't use the accessor */
+
 	disk = prd->prd_disk = alloc_disk(0);
 	if (!disk)
 		goto out_free_queue;
@@ -308,24 +314,6 @@ static void prd_free(struct prd_device *prd)
 	kfree(prd);
 }
 
-static struct prd_device *prd_init_one(int i)
-{
-	struct prd_device *prd;
-
-	list_for_each_entry(prd, &prd_devices, prd_list) {
-		if (prd->prd_number == i)
-			goto out;
-	}
-
-	prd = prd_alloc(i);
-	if (prd) {
-		add_disk(prd->prd_disk);
-		list_add_tail(&prd->prd_list, &prd_devices);
-	}
-out:
-	return prd;
-}
-
 static void prd_del_one(struct prd_device *prd)
 {
 	list_del(&prd->prd_list);
@@ -333,16 +321,27 @@ static void prd_del_one(struct prd_device *prd)
 	prd_free(prd);
 }
 
+/*FIXME: Actually in our driver prd_probe is never used. Can be removed */
 static struct kobject *prd_probe(dev_t dev, int *part, void *data)
 {
 	struct prd_device *prd;
 	struct kobject *kobj;
+	int number = MINOR(dev);
 
 	mutex_lock(&prd_devices_mutex);
-	prd = prd_init_one(MINOR(dev));
-	kobj = prd ? get_disk(prd->prd_disk) : NULL;
-	mutex_unlock(&prd_devices_mutex);
 
+	list_for_each_entry(prd, &prd_devices, prd_list) {
+		if (prd->prd_number == number) {
+			kobj = get_disk(prd->prd_disk);
+			goto out;
+		}
+	}
+
+	pr_err("prd: prd_probe: Unexpected parameter=%d\n", number);
+	kobj = NULL;
+
+out:
+	mutex_unlock(&prd_devices_mutex);
 	return kobj;
 }
 
@@ -424,5 +423,7 @@ static void __exit prd_exit(void)
 
 MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
 MODULE_LICENSE("GPL");
+MODULE_ALIAS("pmem");
+
 module_init(prd_init);
 module_exit(prd_exit);
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC 6/9] SQUASHME: prd: Let each prd-device manage private memory region
  2014-08-13 12:08 ` Boaz Harrosh
                   ` (5 preceding siblings ...)
  (?)
@ 2014-08-13 12:18 ` Boaz Harrosh
  2014-08-21 16:57   ` Ross Zwisler
  -1 siblings, 1 reply; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-13 12:18 UTC (permalink / raw)
  To: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

From: Boaz Harrosh <boaz@plexistor.com>

This patch removes any global memory information. And lets
each prd-device manage it's own memory region.

prd_alloc() Now receives phys_addr and disk_size and will
map that region, also prd_free will do the unmaping.

This is so we can support multiple discontinuous memory regions
in the next patch

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/prd.c | 125 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 78 insertions(+), 47 deletions(-)

diff --git a/drivers/block/prd.c b/drivers/block/prd.c
index c4aeba7..6d96e6c 100644
--- a/drivers/block/prd.c
+++ b/drivers/block/prd.c
@@ -33,14 +33,6 @@
 #define PAGE_SECTORS_SHIFT	(PAGE_SHIFT - SECTOR_SHIFT)
 #define PAGE_SECTORS		(1 << PAGE_SECTORS_SHIFT)
 
-/*
- * driver-wide physical address and total_size - one single, contiguous memory
- * region that we divide up in to same-sized devices
- */
-phys_addr_t	phys_addr;
-void		*virt_addr;
-size_t		total_size;
-
 struct prd_device {
 	int			prd_number;
 
@@ -48,6 +40,7 @@ struct prd_device {
 	struct gendisk		*prd_disk;
 	struct list_head	prd_list;
 
+	/* One contiguous memory region per device */
 	phys_addr_t		phys_addr;
 	void			*virt_addr;
 	size_t			size;
@@ -254,27 +247,71 @@ MODULE_PARM_DESC(prd_count, "Number of prd devices to evenly split allocated spa
 static LIST_HEAD(prd_devices);
 static DEFINE_MUTEX(prd_devices_mutex);
 
-/* FIXME: move phys_addr, virt_addr, size calls up to caller */
-static struct prd_device *prd_alloc(int i)
+/* prd->phys_addr and prd->size need to be set.
+ * Will then set virt_addr if successful.
+ */
+int prd_mem_map(struct prd_device *prd)
+{
+	struct resource *res_mem;
+	int err;
+
+	res_mem = request_mem_region_exclusive(prd->phys_addr, prd->size,
+					       "pmem");
+	if (!res_mem) {
+		pr_warn("prd: request_mem_region_exclusive phys=0x%llx size=0x%zx failed\n",
+			   prd->phys_addr, prd->size);
+		return -EINVAL;
+	}
+
+	prd->virt_addr = ioremap_cache(prd->phys_addr, prd->size);
+	if (unlikely(!prd->virt_addr)) {
+		err = -ENOMEM;
+		goto out_release;
+	}
+	return 0;
+
+out_release:
+	release_mem_region(prd->phys_addr, prd->size);
+	return err;
+}
+
+void prd_mem_unmap(struct prd_device *prd)
+{
+	if (unlikely(!prd->virt_addr))
+		return;
+
+	iounmap(prd->virt_addr);
+	release_mem_region(prd->phys_addr, prd->size);
+	prd->virt_addr = NULL;
+}
+
+static struct prd_device *prd_alloc(phys_addr_t phys_addr, size_t disk_size,
+				    int i)
 {
 	struct prd_device *prd;
 	struct gendisk *disk;
-	size_t disk_size = total_size / prd_count;
-	size_t disk_sectors =  disk_size / 512;
+	int err;
 
 	prd = kzalloc(sizeof(*prd), GFP_KERNEL);
-	if (!prd)
+	if (unlikely(!prd)) {
+		err = -ENOMEM;
 		goto out;
+	}
 
 	prd->prd_number	= i;
-	prd->phys_addr = phys_addr + i * disk_size;
-	prd->virt_addr = virt_addr + i * disk_size;
+	prd->phys_addr = phys_addr;
 	prd->size = disk_size;
 
-	prd->prd_queue = blk_alloc_queue(GFP_KERNEL);
-	if (!prd->prd_queue)
+	err = prd_mem_map(prd);
+	if (unlikely(err))
 		goto out_free_dev;
 
+	prd->prd_queue = blk_alloc_queue(GFP_KERNEL);
+	if (unlikely(!prd->prd_queue)) {
+		err = -ENOMEM;
+		goto out_unmap;
+	}
+
 	blk_queue_make_request(prd->prd_queue, prd_make_request);
 	blk_queue_max_hw_sectors(prd->prd_queue, 1024);
 	blk_queue_bounce_limit(prd->prd_queue, BLK_BOUNCE_ANY);
@@ -285,9 +322,11 @@ static struct prd_device *prd_alloc(int i)
 	blk_queue_physical_block_size(prd->prd_queue, PAGE_SIZE);
 	prd->prd_queue->limits.io_min = 512; /* Don't use the accessor */
 
-	disk = prd->prd_disk = alloc_disk(0);
-	if (!disk)
+	disk = alloc_disk(0);
+	if (unlikely(!disk)) {
+		err = -ENOMEM;
 		goto out_free_queue;
+	}
 	disk->major		= prd_major;
 	disk->first_minor	= 0;
 	disk->fops		= &prd_fops;
@@ -295,22 +334,26 @@ static struct prd_device *prd_alloc(int i)
 	disk->queue		= prd->prd_queue;
 	disk->flags		= GENHD_FL_EXT_DEVT;
 	sprintf(disk->disk_name, "pmem%d", i);
-	set_capacity(disk, disk_sectors);
+	set_capacity(disk, disk_size >> SECTOR_SHIFT);
+	prd->prd_disk = disk;
 
 	return prd;
 
 out_free_queue:
 	blk_cleanup_queue(prd->prd_queue);
+out_unmap:
+	prd_mem_unmap(prd);
 out_free_dev:
 	kfree(prd);
 out:
-	return NULL;
+	return ERR_PTR(err);
 }
 
 static void prd_free(struct prd_device *prd)
 {
 	put_disk(prd->prd_disk);
 	blk_cleanup_queue(prd->prd_queue);
+	prd_mem_unmap(prd);
 	kfree(prd);
 }
 
@@ -348,34 +391,30 @@ out:
 static int __init prd_init(void)
 {
 	int result, i;
-	struct resource *res_mem;
 	struct prd_device *prd, *next;
+	phys_addr_t phys_addr;
+	size_t total_size, disk_size;
 
-	phys_addr  = (phys_addr_t) prd_start_gb * 1024 * 1024 * 1024;
-	total_size = (size_t)	   prd_size_gb  * 1024 * 1024 * 1024;
-
-	res_mem = request_mem_region_exclusive(phys_addr, total_size, "prd");
-	if (!res_mem)
-		return -ENOMEM;
-
-	virt_addr = ioremap_cache(phys_addr, total_size);
-
-	if (!virt_addr) {
-		result = -ENOMEM;
-		goto out_release;
+	if (unlikely(!prd_start_gb || !prd_size_gb || !prd_count)) {
+		pr_err("prd: prd_start_gb || prd_size_gb || prd_count are 0!!\n");
+		return -EINVAL;
 	}
 
+	phys_addr = (phys_addr_t) prd_start_gb * 1024 * 1024 * 1024;
+	total_size = (size_t)	   prd_size_gb  * 1024 * 1024 * 1024;
+	disk_size = total_size / prd_count;
+
 	result = register_blkdev(prd_major, "prd");
 	if (result < 0) {
 		result = -EIO;
-		goto out_unmap;
+		goto out;
 	} else if (result > 0)
 		prd_major = result;
 
 	for (i = 0; i < prd_count; i++) {
-		prd = prd_alloc(i);
-		if (!prd) {
-			result = -ENOMEM;
+		prd = prd_alloc(phys_addr + i * disk_size, disk_size, i);
+		if (IS_ERR(prd)) {
+			result = PTR_ERR(prd);
 			goto out_free;
 		}
 		list_add_tail(&prd->prd_list, &prd_devices);
@@ -396,12 +435,7 @@ out_free:
 		prd_free(prd);
 	}
 	unregister_blkdev(prd_major, "prd");
-
-out_unmap:
-	iounmap(virt_addr);
-
-out_release:
-	release_mem_region(phys_addr, total_size);
+out:
 	return result;
 }
 
@@ -415,9 +449,6 @@ static void __exit prd_exit(void)
 		prd_del_one(prd);
 
 	unregister_blkdev(prd_major, "prd");
-	iounmap(virt_addr);
-	release_mem_region(phys_addr, total_size);
-
 	pr_info("prd: module unloaded\n");
 }
 
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC 7/9] SQUASHME: prd: Support of multiple memory regions
  2014-08-13 12:08 ` Boaz Harrosh
                   ` (6 preceding siblings ...)
  (?)
@ 2014-08-13 12:20 ` Boaz Harrosh
  2014-08-25 23:02     ` Ross Zwisler
  -1 siblings, 1 reply; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-13 12:20 UTC (permalink / raw)
  To: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

From: Boaz Harrosh <boaz@plexistor.com>

After the last patch this is easy.

The API to prd module is changed. We now have a single string
parameter named "map" of the form:
		 map=mapS[,mapS...]

		 where mapS=nn[KMG]$ss[KMG],
		 or    mapS=nn[KMG]@ss[KMG],

		 nn=size, ss=offset

Just like the Kernel command line map && memmap parameters,
so anything you did at grub just copy/paste to here.

The "@" form is exactly the same as the "$" form only that
at bash prompt we need to escape the "$" with \$ so also
support the '@' char for convenience.

For each specified mapS there will be a device created.

So needless to say that all the previous prd_XXX params are
removed as well as the Kconfig defaults.

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/Kconfig | 28 -----------------------
 drivers/block/prd.c   | 62 ++++++++++++++++++++++++++++++++++-----------------
 2 files changed, 41 insertions(+), 49 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 463c45e..8f0c225 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -416,34 +416,6 @@ config BLK_DEV_PMEM
 	  Most normal users won't need this functionality, and can thus say N
 	  here.
 
-config BLK_DEV_PMEM_START
-	int "Offset in GiB of where to start claiming space"
-	default "0"
-	depends on BLK_DEV_PMEM
-	help
-	  Starting offset in GiB that PRD should use when claiming memory.  This
-	  memory needs to be reserved from the OS at boot time using the
-	  "memmap" kernel parameter.
-
-	  If you provide PRD with volatile memory it will act as a volatile
-	  RAM disk and your data will not be persistent.
-
-config BLK_DEV_PMEM_COUNT
-	int "Default number of PMEM disks"
-	default "4"
-	depends on BLK_DEV_PMEM
-	help
-	  Number of equal sized block devices that PRD should create.
-
-config BLK_DEV_PMEM_SIZE
-	int "Size in GiB of space to claim"
-	depends on BLK_DEV_PMEM
-	default "0"
-	help
-	  Amount of memory in GiB that PRD should use when creating block
-	  devices.  This memory needs to be reserved from the OS at
-	  boot time using the "memmap" kernel parameter.
-
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff --git a/drivers/block/prd.c b/drivers/block/prd.c
index 6d96e6c..36b8fe4 100644
--- a/drivers/block/prd.c
+++ b/drivers/block/prd.c
@@ -228,21 +228,15 @@ static const struct block_device_operations prd_fops = {
 };
 
 /* Kernel module stuff */
-static int prd_start_gb = CONFIG_BLK_DEV_PMEM_START;
-module_param(prd_start_gb, int, S_IRUGO);
-MODULE_PARM_DESC(prd_start_gb, "Offset in GB of where to start claiming space");
-
-static int prd_size_gb = CONFIG_BLK_DEV_PMEM_SIZE;
-module_param(prd_size_gb,  int, S_IRUGO);
-MODULE_PARM_DESC(prd_size_gb,  "Total size in GB of space to claim for all disks");
-
 static int prd_major;
 module_param(prd_major, int, 0);
 MODULE_PARM_DESC(prd_major,  "Major number to request for this driver");
 
-static int prd_count = CONFIG_BLK_DEV_PMEM_COUNT;
-module_param(prd_count, int, S_IRUGO);
-MODULE_PARM_DESC(prd_count, "Number of prd devices to evenly split allocated space");
+static char *map;
+module_param(map, charp, S_IRUGO);
+MODULE_PARM_DESC(map,
+	"pmem device mapping: map=mapS[,mapS...] where:\n"
+	"mapS=nn[KMG]$ss[KMG] or mapS=nn[KMG]@ss[KMG], nn=size, ss=offset.");
 
 static LIST_HEAD(prd_devices);
 static DEFINE_MUTEX(prd_devices_mutex);
@@ -292,6 +286,13 @@ static struct prd_device *prd_alloc(phys_addr_t phys_addr, size_t disk_size,
 	struct gendisk *disk;
 	int err;
 
+	if (unlikely((phys_addr & ~PAGE_MASK) || (disk_size & ~PAGE_MASK))) {
+		pr_err("phys_addr=0x%llx disk_size=0x%zx must be 4k aligned\n",
+		       phys_addr, disk_size);
+		err = -EINVAL;
+		goto out;
+	}
+
 	prd = kzalloc(sizeof(*prd), GFP_KERNEL);
 	if (unlikely(!prd)) {
 		err = -ENOMEM;
@@ -388,22 +389,30 @@ out:
 	return kobj;
 }
 
+static int prd_parse_map_one(char *map, phys_addr_t *start, size_t *size)
+{
+	char *p = map;
+
+	*size = (phys_addr_t)memparse(p, &p);
+	if ((p == map) || ((*p != '$') && (*p != '@')))
+		return -EINVAL;
+
+	*start = (size_t)memparse(p + 1, &p);
+
+	return *p == '\0' ? 0 : -EINVAL;
+}
+
 static int __init prd_init(void)
 {
 	int result, i;
 	struct prd_device *prd, *next;
-	phys_addr_t phys_addr;
-	size_t total_size, disk_size;
+	char *p, *prd_map = map;
 
-	if (unlikely(!prd_start_gb || !prd_size_gb || !prd_count)) {
-		pr_err("prd: prd_start_gb || prd_size_gb || prd_count are 0!!\n");
+	if (!prd_map) {
+		pr_err("prd: must specify map parameter.\n");
 		return -EINVAL;
 	}
 
-	phys_addr = (phys_addr_t) prd_start_gb * 1024 * 1024 * 1024;
-	total_size = (size_t)	   prd_size_gb  * 1024 * 1024 * 1024;
-	disk_size = total_size / prd_count;
-
 	result = register_blkdev(prd_major, "prd");
 	if (result < 0) {
 		result = -EIO;
@@ -411,13 +420,24 @@ static int __init prd_init(void)
 	} else if (result > 0)
 		prd_major = result;
 
-	for (i = 0; i < prd_count; i++) {
-		prd = prd_alloc(phys_addr + i * disk_size, disk_size, i);
+	i = 0;
+	while ((p = strsep(&prd_map, ",")) != NULL) {
+		phys_addr_t phys_addr;
+		size_t disk_size;
+
+		if (!*p)
+			continue;
+		result = prd_parse_map_one(p, &phys_addr, &disk_size);
+		if (result)
+			goto out_free;
+
+		prd = prd_alloc(phys_addr, disk_size, i);
 		if (IS_ERR(prd)) {
 			result = PTR_ERR(prd);
 			goto out_free;
 		}
 		list_add_tail(&prd->prd_list, &prd_devices);
+		++i;
 	}
 
 	list_for_each_entry(prd, &prd_devices, prd_list)
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC 8/9] mm: export sparse_add/remove_one_section
  2014-08-13 12:08 ` Boaz Harrosh
                   ` (7 preceding siblings ...)
  (?)
@ 2014-08-13 12:21 ` Boaz Harrosh
  -1 siblings, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-13 12:21 UTC (permalink / raw)
  To: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

From: Yigal Korman <yigal@plexistor.com>

Export sparse_add_one_section & sparse_remove_one_section for use
in modules that want private memory mappings (prd for example).

Also refactored the arguments to use node id instead of
struct zone * - sparse memory has no direct connection to zones,
all that was needed from zone was the node id.

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 include/linux/memory_hotplug.h |  4 ++--
 mm/memory_hotplug.c            |  4 ++--
 mm/sparse.c                    | 11 +++++++----
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 010d125..91e0474 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -262,8 +262,8 @@ extern int arch_add_memory(int nid, u64 start, u64 size);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern void remove_memory(int nid, u64 start, u64 size);
-extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn);
-extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms);
+extern int sparse_add_one_section(int nid, unsigned long start_pfn);
+extern void sparse_remove_one_section(int nid, struct mem_section *ms);
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
 					  unsigned long pnum);
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 469bbf5..0c87570 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -471,7 +471,7 @@ static int __meminit __add_section(int nid, struct zone *zone,
 	if (pfn_valid(phys_start_pfn))
 		return -EEXIST;
 
-	ret = sparse_add_one_section(zone, phys_start_pfn);
+	ret = sparse_add_one_section(zone->zone_pgdat->node_id, phys_start_pfn);
 
 	if (ret < 0)
 		return ret;
@@ -737,7 +737,7 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
 	start_pfn = section_nr_to_pfn(scn_nr);
 	__remove_zone(zone, start_pfn);
 
-	sparse_remove_one_section(zone, ms);
+	sparse_remove_one_section(zone->zone_pgdat->node_id, ms);
 	return 0;
 }
 
diff --git a/mm/sparse.c b/mm/sparse.c
index d1b48b6..d97facd3 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -690,10 +690,10 @@ static void free_map_bootmem(struct page *memmap)
  * set.  If this is <=0, then that means that the passed-in
  * map was not consumed and must be freed.
  */
-int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn)
+int __meminit sparse_add_one_section(int nid, unsigned long start_pfn)
 {
 	unsigned long section_nr = pfn_to_section_nr(start_pfn);
-	struct pglist_data *pgdat = zone->zone_pgdat;
+	struct pglist_data *pgdat = NODE_DATA(nid);
 	struct mem_section *ms;
 	struct page *memmap;
 	unsigned long *usemap;
@@ -738,6 +738,7 @@ out:
 	}
 	return ret;
 }
+EXPORT_SYMBOL_GPL(sparse_add_one_section);
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
 #ifdef CONFIG_MEMORY_FAILURE
@@ -788,11 +789,11 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 		free_map_bootmem(memmap);
 }
 
-void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
+void sparse_remove_one_section(int nid, struct mem_section *ms)
 {
 	struct page *memmap = NULL;
 	unsigned long *usemap = NULL, flags;
-	struct pglist_data *pgdat = zone->zone_pgdat;
+	struct pglist_data *pgdat = NODE_DATA(nid);
 
 	pgdat_resize_lock(pgdat, &flags);
 	if (ms->section_mem_map) {
@@ -807,5 +808,7 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 	clear_hwpoisoned_pages(memmap, PAGES_PER_SECTION);
 	free_section_usemap(memmap, usemap);
 }
+EXPORT_SYMBOL_GPL(sparse_remove_one_section);
+
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 #endif /* CONFIG_MEMORY_HOTPLUG */
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC 9/9] prd: Add support for page struct mapping
  2014-08-13 12:08 ` Boaz Harrosh
                   ` (8 preceding siblings ...)
  (?)
@ 2014-08-13 12:26 ` Boaz Harrosh
  2014-08-15 20:28   ` Toshi Kani
  2014-08-22 14:36     ` Dave Hansen
  -1 siblings, 2 replies; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-13 12:26 UTC (permalink / raw)
  To: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

From: Yigal Korman <yigal@plexistor.com>

One of the current short comings of the NVDIMM/PMEM
support is that this memory does not have a page-struct(s)
associated with its memory and therefor cannot be passed
to a block-device or network or DMAed in any way through
another device in the system.

This simple patch fixes all this. After this patch an FS
can do:
	bdev_direct_access(,&pfn,);
	page = pfn_to_page(pfn);
And use that page for a lock_page(), set_page_dirty(), and/or
anything else one might do with a page *.
(Note that with brd one can already do this)

[pmem-pages-ref-count]
pmem will serve it's pages with ref==0. Once an FS does
an blkdev_get_XXX(,FMODE_EXCL,), that memory is own by the FS.
The FS needs to manage its allocation, just as it already does
for its disk blocks. The fs should set page->count = 2, before
submission to any Kernel subsystem so when it returns it will
never be released to the Kernel's page-allocators. (page_freeze)

All is actually needed for this is to allocate page-sections
and map them into kernel virtual memory. Note that these sections
are not associated with any zone, because that would add them to
the page_allocators.

In order to reuse existing code, prd now depends on memory hotplug
and sparse memory configuration options.

If system has enabled MEMORY_HOTPLUG_SPARSE then a new config option
BLK_DEV_PMEM_USE_PAGES is enabled (Yes by default)

We will also need MEMORY_HOTREMOVE so if BLK_DEV_PMEM_USE_PAGES
is on we will "select" MEMORY_HOTREMOVE. Most distro's have
MEMORY_HOTPLUG_SPARSE on but not MEMORY_HOTREMOVE. For us here
we must have both.

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/Kconfig |  13 +++++
 drivers/block/prd.c   | 137 ++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 145 insertions(+), 5 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 8f0c225..8aca1b7 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -416,6 +416,19 @@ config BLK_DEV_PMEM
 	  Most normal users won't need this functionality, and can thus say N
 	  here.
 
+config BLK_DEV_PMEM_USE_PAGES
+	bool "Enable use of page struct pages with pmem"
+	depends on BLK_DEV_PMEM
+	depends on MEMORY_HOTPLUG_SPARSE
+	select MEMORY_HOTREMOVE
+	default y
+	help
+	  If a user of PMEM device needs "struct page" associated
+	  with its memory, so this memory can be sent to other
+	  block devices, or sent on the network, or be DMA transferred
+	  to other devices in the system, then you must say "Yes" here.
+	  If unsure leave as Yes.
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff --git a/drivers/block/prd.c b/drivers/block/prd.c
index 36b8fe4..6115553 100644
--- a/drivers/block/prd.c
+++ b/drivers/block/prd.c
@@ -241,6 +241,134 @@ MODULE_PARM_DESC(map,
 static LIST_HEAD(prd_devices);
 static DEFINE_MUTEX(prd_devices_mutex);
 
+#ifdef CONFIG_BLK_DEV_PMEM_USE_PAGES
+static int prd_add_page_mapping(phys_addr_t phys_addr, size_t total_size,
+				void **o_virt_addr)
+{
+	int nid = memory_add_physaddr_to_nid(phys_addr);
+	unsigned long start_pfn = phys_addr >> PAGE_SHIFT;
+	unsigned long nr_pages = total_size >> PAGE_SHIFT;
+	unsigned int start_sec = pfn_to_section_nr(start_pfn);
+	unsigned int end_sec = pfn_to_section_nr(start_pfn + nr_pages - 1);
+	unsigned long phys_start_pfn;
+	struct page **page_array, **mapped_page_array;
+	unsigned long i;
+	struct vm_struct *vm_area;
+	void *virt_addr;
+	int ret = 0;
+
+	for (i = start_sec; i <= end_sec; i++) {
+		phys_start_pfn = i << PFN_SECTION_SHIFT;
+
+		if (pfn_valid(phys_start_pfn)) {
+			pr_warn("prd: memory section %lu already exists.\n", i);
+			continue;
+		}
+
+		ret = sparse_add_one_section(nid, phys_start_pfn);
+		if (unlikely(ret < 0)) {
+			if (ret == -EEXIST) {
+				ret = 0;
+				continue;
+			} else {
+				pr_warn("prd: sparse_add_one_section => %d\n",
+					ret);
+				return ret;
+			}
+		}
+	}
+
+	virt_addr = page_address(pfn_to_page(phys_addr >> PAGE_SHIFT));
+
+	page_array = vmalloc(sizeof(struct page *) * nr_pages);
+	if (unlikely(!page_array)) {
+		pr_warn("prd: failed to allocate nr_pages=0x%lx\n", nr_pages);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i <  nr_pages; i++)
+		page_array[i] = pfn_to_page(start_pfn + i);
+
+	/* __get_vm_area requires a range of addresses from which to allocate
+	 * the vm_area. This range will include more pages that we need because
+	 * it allocates one guard page in the end. Usually you give it a wide
+	 * range from which to choose from, but we want exact addresses, so add
+	 * the size of the guard page to the end of the range (otherwise, this
+	 * will always fail)
+	 */
+	/* TODO this guard page may confuse users when asking for several pmem
+	 * devices in adjacent areas (the start of the next pmem will be
+	 * occupied by the guard page of the previous pmem)
+	 */
+	vm_area = __get_vm_area(total_size, VM_USERMAP, (ulong)virt_addr,
+				(ulong)virt_addr + total_size + PAGE_SIZE);
+	if (unlikely(!vm_area)) {
+		pr_err("prd: failed to __get_vm_area.\n");
+		ret = -ENOMEM;
+		goto free_array;
+	}
+
+	mapped_page_array = page_array;
+	ret = map_vm_area(vm_area, PAGE_KERNEL, &mapped_page_array);
+	if (unlikely(ret || mapped_page_array < (page_array + nr_pages))) {
+		pr_err("prd: failed to map_vm_area => %d\n", ret);
+		if (!ret) {
+			free_vm_area(vm_area);
+			ret = -ENOMEM;
+		}
+	}
+	*o_virt_addr = virt_addr;
+
+free_array:
+	vfree(page_array);
+	return ret;
+}
+
+static void prd_remove_page_mapping(phys_addr_t phys_addr, size_t total_size,
+				    void *virt_addr)
+{
+	unsigned long start_pfn = phys_addr >> PAGE_SHIFT;
+	unsigned long nr_pages = total_size >> PAGE_SHIFT;
+	unsigned int start_sec = pfn_to_section_nr(start_pfn);
+	unsigned int end_sec = pfn_to_section_nr(start_pfn + nr_pages - 1);
+	unsigned int i;
+
+	for (i = start_sec; i <= end_sec; i++) {
+		struct mem_section *ms = __nr_to_section(i);
+		int nid = pfn_to_nid(i << PFN_SECTION_SHIFT);
+
+		if (!valid_section(ms)) {
+			pr_warn("prd: memory section %d is missing.\n", i);
+			continue;
+		}
+
+		sparse_remove_one_section(nid, ms);
+	}
+	vunmap(virt_addr);
+}
+
+#else /* !CONFIG_BLK_DEV_PMEM_USE_PAGES */
+static int prd_add_page_mapping(phys_addr_t phys_addr, size_t total_size,
+				void **o_virt_addr)
+{
+	void *virt_addr = ioremap_cache(phys_addr, total_size);
+
+	if (unlikely(!virt_addr))
+		return -ENXIO;
+
+	*o_virt_addr = virt_addr;
+	return 0;
+}
+
+static void prd_remove_page_mapping(phys_addr_t phys_addr, size_t total_size,
+				    void *virt_addr)
+{
+	iounmap(virt_addr);
+}
+#endif /* CONFIG_BLK_DEV_PMEM_USE_PAGES */
+
+
+
 /* prd->phys_addr and prd->size need to be set.
  * Will then set virt_addr if successful.
  */
@@ -257,11 +385,10 @@ int prd_mem_map(struct prd_device *prd)
 		return -EINVAL;
 	}
 
-	prd->virt_addr = ioremap_cache(prd->phys_addr, prd->size);
-	if (unlikely(!prd->virt_addr)) {
-		err = -ENOMEM;
+	err = prd_add_page_mapping(prd->phys_addr, prd->size, &prd->virt_addr);
+	if (unlikely(err))
 		goto out_release;
-	}
+
 	return 0;
 
 out_release:
@@ -274,7 +401,7 @@ void prd_mem_unmap(struct prd_device *prd)
 	if (unlikely(!prd->virt_addr))
 		return;
 
-	iounmap(prd->virt_addr);
+	prd_remove_page_mapping(prd->phys_addr, prd->size, prd->virt_addr);
 	release_mem_region(prd->phys_addr, prd->size);
 	prd->virt_addr = NULL;
 }
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC 5/9] SQUASHME: prd: Last fixes for partitions
  2014-08-13 12:16 ` [RFC 5/9] SQUASHME: prd: Last fixes for partitions Boaz Harrosh
@ 2014-08-14 13:04   ` Boaz Harrosh
  2014-08-14 13:16       ` Matthew Wilcox
  2014-08-14 13:07   ` [PATCH 5/9 v2] " Boaz Harrosh
  2014-08-20 23:03   ` [RFC 5/9] " Ross Zwisler
  2 siblings, 1 reply; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-14 13:04 UTC (permalink / raw)
  To: Boaz Harrosh, Ross Zwisler, linux-fsdevel, Andrew Morton,
	linux-mm, Matthew Wilcox, Sagi Manole, Yigal Korman

On 08/13/2014 03:16 PM, Boaz Harrosh wrote:
> From: Boaz Harrosh <boaz@plexistor.com>
> 
> This streamlines prd with the latest brd code.
> 
> In prd we do not allocate new devices dynamically on devnod
> access, because we need parameterization of each device. So
> the dynamic allocation in prd_init_one is removed.
> 
> Therefor prd_init_one only called from prd_prob is moved
> there, now that it is small.
> 
> And other small fixes regarding partitions
> 
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  drivers/block/prd.c | 47 ++++++++++++++++++++++++-----------------------
>  1 file changed, 24 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/block/prd.c b/drivers/block/prd.c
> index 62af81e..c4aeba7 100644
> --- a/drivers/block/prd.c
> +++ b/drivers/block/prd.c
> @@ -218,13 +218,13 @@ static long prd_direct_access(struct block_device *bdev, sector_t sector,
>  {
>  	struct prd_device *prd = bdev->bd_disk->private_data;
>  
> -	if (!prd)
> +	if (unlikely(!prd))
>  		return -ENODEV;
>  
>  	*kaddr = prd_lookup_pg_addr(prd, sector);
>  	*pfn = prd_lookup_pfn(prd, sector);
>  
> -	return size;
> +	return min_t(long, size, prd->size);

This is off course a BUG need to subtract offset, will send version 2

Boaz
<>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 5/9 v2] SQUASHME: prd: Last fixes for partitions
  2014-08-13 12:16 ` [RFC 5/9] SQUASHME: prd: Last fixes for partitions Boaz Harrosh
  2014-08-14 13:04   ` Boaz Harrosh
@ 2014-08-14 13:07   ` Boaz Harrosh
  2014-08-25 20:10     ` Ross Zwisler
  2014-08-20 23:03   ` [RFC 5/9] " Ross Zwisler
  2 siblings, 1 reply; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-14 13:07 UTC (permalink / raw)
  To: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

From: Boaz Harrosh <boaz@plexistor.com>

This streamlines prd with the latest brd code.

In prd we do not allocate new devices dynamically on devnod
access, because we need parameterization of each device. So
the dynamic allocation in prd_init_one is removed.

Therefor prd_init_one only called from prd_prob is moved
there, now that it is small.

And other small fixes regarding partitions

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/prd.c | 47 ++++++++++++++++++++++++-----------------------
 1 file changed, 24 insertions(+), 23 deletions(-)

diff --git a/drivers/block/prd.c b/drivers/block/prd.c
index 62af81e..f117ca5 100644
--- a/drivers/block/prd.c
+++ b/drivers/block/prd.c
@@ -218,13 +218,13 @@ static long prd_direct_access(struct block_device *bdev, sector_t sector,
 {
 	struct prd_device *prd = bdev->bd_disk->private_data;
 
-	if (!prd)
+	if (unlikely(!prd))
 		return -ENODEV;
 
 	*kaddr = prd_lookup_pg_addr(prd, sector);
 	*pfn = prd_lookup_pfn(prd, sector);
 
-	return size;
+	return min_t(long, size, prd->size - (sector << SECTOR_SHIFT));
 }
 
 static const struct block_device_operations prd_fops = {
@@ -279,6 +279,12 @@ static struct prd_device *prd_alloc(int i)
 	blk_queue_max_hw_sectors(prd->prd_queue, 1024);
 	blk_queue_bounce_limit(prd->prd_queue, BLK_BOUNCE_ANY);
 
+	/* This is so fdisk will align partitions on 4k, because of
+	 * direct_access API needing 4k alignment, returning a PFN
+	 */
+	blk_queue_physical_block_size(prd->prd_queue, PAGE_SIZE);
+	prd->prd_queue->limits.io_min = 512; /* Don't use the accessor */
+
 	disk = prd->prd_disk = alloc_disk(0);
 	if (!disk)
 		goto out_free_queue;
@@ -308,24 +314,6 @@ static void prd_free(struct prd_device *prd)
 	kfree(prd);
 }
 
-static struct prd_device *prd_init_one(int i)
-{
-	struct prd_device *prd;
-
-	list_for_each_entry(prd, &prd_devices, prd_list) {
-		if (prd->prd_number == i)
-			goto out;
-	}
-
-	prd = prd_alloc(i);
-	if (prd) {
-		add_disk(prd->prd_disk);
-		list_add_tail(&prd->prd_list, &prd_devices);
-	}
-out:
-	return prd;
-}
-
 static void prd_del_one(struct prd_device *prd)
 {
 	list_del(&prd->prd_list);
@@ -333,16 +321,27 @@ static void prd_del_one(struct prd_device *prd)
 	prd_free(prd);
 }
 
+/*FIXME: Actually in our driver prd_probe is never used. Can be removed */
 static struct kobject *prd_probe(dev_t dev, int *part, void *data)
 {
 	struct prd_device *prd;
 	struct kobject *kobj;
+	int number = MINOR(dev);
 
 	mutex_lock(&prd_devices_mutex);
-	prd = prd_init_one(MINOR(dev));
-	kobj = prd ? get_disk(prd->prd_disk) : NULL;
-	mutex_unlock(&prd_devices_mutex);
 
+	list_for_each_entry(prd, &prd_devices, prd_list) {
+		if (prd->prd_number == number) {
+			kobj = get_disk(prd->prd_disk);
+			goto out;
+		}
+	}
+
+	pr_err("prd: prd_probe: Unexpected parameter=%d\n", number);
+	kobj = NULL;
+
+out:
+	mutex_unlock(&prd_devices_mutex);
 	return kobj;
 }
 
@@ -424,5 +423,7 @@ static void __exit prd_exit(void)
 
 MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
 MODULE_LICENSE("GPL");
+MODULE_ALIAS("pmem");
+
 module_init(prd_init);
 module_exit(prd_exit);
-- 
1.9.3


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC 5/9] SQUASHME: prd: Last fixes for partitions
  2014-08-14 13:04   ` Boaz Harrosh
@ 2014-08-14 13:16       ` Matthew Wilcox
  0 siblings, 0 replies; 48+ messages in thread
From: Matthew Wilcox @ 2014-08-14 13:16 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Sagi Manole, Yigal Korman

On Thu, Aug 14, 2014 at 04:04:53PM +0300, Boaz Harrosh wrote:
> > @@ -218,13 +218,13 @@ static long prd_direct_access(struct block_device *bdev, sector_t sector,
> >  {
> >  	struct prd_device *prd = bdev->bd_disk->private_data;
> >  
> > -	if (!prd)
> > +	if (unlikely(!prd))
> >  		return -ENODEV;
> >  
> >  	*kaddr = prd_lookup_pg_addr(prd, sector);
> >  	*pfn = prd_lookup_pfn(prd, sector);
> >  
> > -	return size;
> > +	return min_t(long, size, prd->size);
> 
> This is off course a BUG need to subtract offset, will send version 2

I was wondering about simplifying the return value for the drivers
a little.  Something like this:

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 741293f..8709b9f 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -149,7 +149,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
 	*kaddr = (void *)(bank->ph_addr + offset);
 	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
-	return min_t(long, size, bank->size - offset);
+	return bank->size - offset;
 }
 
 static const struct block_device_operations axon_ram_devops = {
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3483458..344681a 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -384,9 +384,9 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
 	*kaddr = page_address(page);
 	*pfn = page_to_pfn(page);
 
-	/* Could optimistically check to see if the next page in the
-	 * file is mapped to the next page of physical RAM */
-	return min_t(long, PAGE_SIZE, size);
+	/* If size > PAGE_SIZE, we could look to see if the next page in the
+	 * file happens to be mapped to the next page of physical RAM */
+	return PAGE_SIZE;
 }
 #else
 #define brd_direct_access NULL
diff --git a/drivers/block/prd.c b/drivers/block/prd.c
index cc0aabf..1cfbd5b 100644
--- a/drivers/block/prd.c
+++ b/drivers/block/prd.c
@@ -216,7 +216,7 @@ static long prd_direct_access(struct block_device *bdev, sector_t sector,
 	*kaddr = prd_lookup_pg_addr(prd, sector);
 	*pfn = prd_lookup_pfn(prd, sector);
 
-	return size;
+	return size - (sector * 512);
 }
 
 static const struct block_device_operations prd_fops = {
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 2ee5556..96bc411 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -881,7 +881,7 @@ dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
 	*kaddr = (void *) (dev_info->start + offset);
 	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
-	return min_t(long, size, dev_sz - offset);
+	return dev_sz - offset;
 }
 
 static void
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 93ebdd53..ce3e69c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -447,6 +447,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
 long bdev_direct_access(struct block_device *bdev, sector_t sector,
 			void **addr, unsigned long *pfn, long size)
 {
+	long max;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
 	if (!ops->direct_access)
 		return -EOPNOTSUPP;
@@ -456,8 +457,10 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
 	sector += get_start_sect(bdev);
 	if (sector % (PAGE_SIZE / 512))
 		return -EINVAL;
-	size = ops->direct_access(bdev, sector, addr, pfn, size);
-	return size ? size : -ERANGE;
+	max = ops->direct_access(bdev, sector, addr, pfn, size);
+	if (!max)
+		return -ERANGE;
+	return min(max, size);
 }
 EXPORT_SYMBOL_GPL(bdev_direct_access);
 

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC 5/9] SQUASHME: prd: Last fixes for partitions
@ 2014-08-14 13:16       ` Matthew Wilcox
  0 siblings, 0 replies; 48+ messages in thread
From: Matthew Wilcox @ 2014-08-14 13:16 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Sagi Manole, Yigal Korman

On Thu, Aug 14, 2014 at 04:04:53PM +0300, Boaz Harrosh wrote:
> > @@ -218,13 +218,13 @@ static long prd_direct_access(struct block_device *bdev, sector_t sector,
> >  {
> >  	struct prd_device *prd = bdev->bd_disk->private_data;
> >  
> > -	if (!prd)
> > +	if (unlikely(!prd))
> >  		return -ENODEV;
> >  
> >  	*kaddr = prd_lookup_pg_addr(prd, sector);
> >  	*pfn = prd_lookup_pfn(prd, sector);
> >  
> > -	return size;
> > +	return min_t(long, size, prd->size);
> 
> This is off course a BUG need to subtract offset, will send version 2

I was wondering about simplifying the return value for the drivers
a little.  Something like this:

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 741293f..8709b9f 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -149,7 +149,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
 	*kaddr = (void *)(bank->ph_addr + offset);
 	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
-	return min_t(long, size, bank->size - offset);
+	return bank->size - offset;
 }
 
 static const struct block_device_operations axon_ram_devops = {
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3483458..344681a 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -384,9 +384,9 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
 	*kaddr = page_address(page);
 	*pfn = page_to_pfn(page);
 
-	/* Could optimistically check to see if the next page in the
-	 * file is mapped to the next page of physical RAM */
-	return min_t(long, PAGE_SIZE, size);
+	/* If size > PAGE_SIZE, we could look to see if the next page in the
+	 * file happens to be mapped to the next page of physical RAM */
+	return PAGE_SIZE;
 }
 #else
 #define brd_direct_access NULL
diff --git a/drivers/block/prd.c b/drivers/block/prd.c
index cc0aabf..1cfbd5b 100644
--- a/drivers/block/prd.c
+++ b/drivers/block/prd.c
@@ -216,7 +216,7 @@ static long prd_direct_access(struct block_device *bdev, sector_t sector,
 	*kaddr = prd_lookup_pg_addr(prd, sector);
 	*pfn = prd_lookup_pfn(prd, sector);
 
-	return size;
+	return size - (sector * 512);
 }
 
 static const struct block_device_operations prd_fops = {
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 2ee5556..96bc411 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -881,7 +881,7 @@ dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
 	*kaddr = (void *) (dev_info->start + offset);
 	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
-	return min_t(long, size, dev_sz - offset);
+	return dev_sz - offset;
 }
 
 static void
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 93ebdd53..ce3e69c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -447,6 +447,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
 long bdev_direct_access(struct block_device *bdev, sector_t sector,
 			void **addr, unsigned long *pfn, long size)
 {
+	long max;
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
 	if (!ops->direct_access)
 		return -EOPNOTSUPP;
@@ -456,8 +457,10 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
 	sector += get_start_sect(bdev);
 	if (sector % (PAGE_SIZE / 512))
 		return -EINVAL;
-	size = ops->direct_access(bdev, sector, addr, pfn, size);
-	return size ? size : -ERANGE;
+	max = ops->direct_access(bdev, sector, addr, pfn, size);
+	if (!max)
+		return -ERANGE;
+	return min(max, size);
 }
 EXPORT_SYMBOL_GPL(bdev_direct_access);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC 5/9] SQUASHME: prd: Last fixes for partitions
  2014-08-14 13:16       ` Matthew Wilcox
@ 2014-08-14 13:55         ` Boaz Harrosh
  -1 siblings, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-14 13:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Sagi Manole, Yigal Korman

On 08/14/2014 04:16 PM, Matthew Wilcox wrote:
> On Thu, Aug 14, 2014 at 04:04:53PM +0300, Boaz Harrosh wrote:
>>> @@ -218,13 +218,13 @@ static long prd_direct_access(struct block_device *bdev, sector_t sector,
>>>  {
>>>  	struct prd_device *prd = bdev->bd_disk->private_data;
>>>  
>>> -	if (!prd)
>>> +	if (unlikely(!prd))
>>>  		return -ENODEV;
>>>  
>>>  	*kaddr = prd_lookup_pg_addr(prd, sector);
>>>  	*pfn = prd_lookup_pfn(prd, sector);
>>>  
>>> -	return size;
>>> +	return min_t(long, size, prd->size);
>>
>> This is off course a BUG need to subtract offset, will send version 2
> 
> I was wondering about simplifying the return value for the drivers
> a little.  Something like this:
> 

Sure, looks good

I will produce a V2 for the brd-partitions set to send Jens, sometime
next week.

Thanks
Boaz

> diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
> index 741293f..8709b9f 100644
> --- a/arch/powerpc/sysdev/axonram.c
> +++ b/arch/powerpc/sysdev/axonram.c
> @@ -149,7 +149,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
>  	*kaddr = (void *)(bank->ph_addr + offset);
>  	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
>  
> -	return min_t(long, size, bank->size - offset);
> +	return bank->size - offset;
>  }
>  
>  static const struct block_device_operations axon_ram_devops = {
> diff --git a/drivers/block/brd.c b/drivers/block/brd.c
> index 3483458..344681a 100644
> --- a/drivers/block/brd.c
> +++ b/drivers/block/brd.c
> @@ -384,9 +384,9 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
>  	*kaddr = page_address(page);
>  	*pfn = page_to_pfn(page);
>  
> -	/* Could optimistically check to see if the next page in the
> -	 * file is mapped to the next page of physical RAM */
> -	return min_t(long, PAGE_SIZE, size);
> +	/* If size > PAGE_SIZE, we could look to see if the next page in the
> +	 * file happens to be mapped to the next page of physical RAM */
> +	return PAGE_SIZE;
>  }
>  #else
>  #define brd_direct_access NULL
> diff --git a/drivers/block/prd.c b/drivers/block/prd.c
> index cc0aabf..1cfbd5b 100644
> --- a/drivers/block/prd.c
> +++ b/drivers/block/prd.c
> @@ -216,7 +216,7 @@ static long prd_direct_access(struct block_device *bdev, sector_t sector,
>  	*kaddr = prd_lookup_pg_addr(prd, sector);
>  	*pfn = prd_lookup_pfn(prd, sector);
>  
> -	return size;
> +	return size - (sector * 512);
>  }
>  
>  static const struct block_device_operations prd_fops = {
> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index 2ee5556..96bc411 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
> @@ -881,7 +881,7 @@ dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
>  	*kaddr = (void *) (dev_info->start + offset);
>  	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
>  
> -	return min_t(long, size, dev_sz - offset);
> +	return dev_sz - offset;
>  }
>  
>  static void
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 93ebdd53..ce3e69c 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -447,6 +447,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
>  long bdev_direct_access(struct block_device *bdev, sector_t sector,
>  			void **addr, unsigned long *pfn, long size)
>  {
> +	long max;
>  	const struct block_device_operations *ops = bdev->bd_disk->fops;
>  	if (!ops->direct_access)
>  		return -EOPNOTSUPP;
> @@ -456,8 +457,10 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
>  	sector += get_start_sect(bdev);
>  	if (sector % (PAGE_SIZE / 512))
>  		return -EINVAL;
> -	size = ops->direct_access(bdev, sector, addr, pfn, size);
> -	return size ? size : -ERANGE;
> +	max = ops->direct_access(bdev, sector, addr, pfn, size);
> +	if (!max)
> +		return -ERANGE;
> +	return min(max, size);
>  }
>  EXPORT_SYMBOL_GPL(bdev_direct_access);
>  
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 5/9] SQUASHME: prd: Last fixes for partitions
@ 2014-08-14 13:55         ` Boaz Harrosh
  0 siblings, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-14 13:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Sagi Manole, Yigal Korman

On 08/14/2014 04:16 PM, Matthew Wilcox wrote:
> On Thu, Aug 14, 2014 at 04:04:53PM +0300, Boaz Harrosh wrote:
>>> @@ -218,13 +218,13 @@ static long prd_direct_access(struct block_device *bdev, sector_t sector,
>>>  {
>>>  	struct prd_device *prd = bdev->bd_disk->private_data;
>>>  
>>> -	if (!prd)
>>> +	if (unlikely(!prd))
>>>  		return -ENODEV;
>>>  
>>>  	*kaddr = prd_lookup_pg_addr(prd, sector);
>>>  	*pfn = prd_lookup_pfn(prd, sector);
>>>  
>>> -	return size;
>>> +	return min_t(long, size, prd->size);
>>
>> This is off course a BUG need to subtract offset, will send version 2
> 
> I was wondering about simplifying the return value for the drivers
> a little.  Something like this:
> 

Sure, looks good

I will produce a V2 for the brd-partitions set to send Jens, sometime
next week.

Thanks
Boaz

> diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
> index 741293f..8709b9f 100644
> --- a/arch/powerpc/sysdev/axonram.c
> +++ b/arch/powerpc/sysdev/axonram.c
> @@ -149,7 +149,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
>  	*kaddr = (void *)(bank->ph_addr + offset);
>  	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
>  
> -	return min_t(long, size, bank->size - offset);
> +	return bank->size - offset;
>  }
>  
>  static const struct block_device_operations axon_ram_devops = {
> diff --git a/drivers/block/brd.c b/drivers/block/brd.c
> index 3483458..344681a 100644
> --- a/drivers/block/brd.c
> +++ b/drivers/block/brd.c
> @@ -384,9 +384,9 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
>  	*kaddr = page_address(page);
>  	*pfn = page_to_pfn(page);
>  
> -	/* Could optimistically check to see if the next page in the
> -	 * file is mapped to the next page of physical RAM */
> -	return min_t(long, PAGE_SIZE, size);
> +	/* If size > PAGE_SIZE, we could look to see if the next page in the
> +	 * file happens to be mapped to the next page of physical RAM */
> +	return PAGE_SIZE;
>  }
>  #else
>  #define brd_direct_access NULL
> diff --git a/drivers/block/prd.c b/drivers/block/prd.c
> index cc0aabf..1cfbd5b 100644
> --- a/drivers/block/prd.c
> +++ b/drivers/block/prd.c
> @@ -216,7 +216,7 @@ static long prd_direct_access(struct block_device *bdev, sector_t sector,
>  	*kaddr = prd_lookup_pg_addr(prd, sector);
>  	*pfn = prd_lookup_pfn(prd, sector);
>  
> -	return size;
> +	return size - (sector * 512);
>  }
>  
>  static const struct block_device_operations prd_fops = {
> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index 2ee5556..96bc411 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
> @@ -881,7 +881,7 @@ dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
>  	*kaddr = (void *) (dev_info->start + offset);
>  	*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
>  
> -	return min_t(long, size, dev_sz - offset);
> +	return dev_sz - offset;
>  }
>  
>  static void
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 93ebdd53..ce3e69c 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -447,6 +447,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page);
>  long bdev_direct_access(struct block_device *bdev, sector_t sector,
>  			void **addr, unsigned long *pfn, long size)
>  {
> +	long max;
>  	const struct block_device_operations *ops = bdev->bd_disk->fops;
>  	if (!ops->direct_access)
>  		return -EOPNOTSUPP;
> @@ -456,8 +457,10 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
>  	sector += get_start_sect(bdev);
>  	if (sector % (PAGE_SIZE / 512))
>  		return -EINVAL;
> -	size = ops->direct_access(bdev, sector, addr, pfn, size);
> -	return size ? size : -ERANGE;
> +	max = ops->direct_access(bdev, sector, addr, pfn, size);
> +	if (!max)
> +		return -ERANGE;
> +	return min(max, size);
>  }
>  EXPORT_SYMBOL_GPL(bdev_direct_access);
>  
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 9/9] prd: Add support for page struct mapping
  2014-08-13 12:26 ` [RFC 9/9] prd: Add support for page struct mapping Boaz Harrosh
@ 2014-08-15 20:28   ` Toshi Kani
  2014-08-17  9:17     ` Boaz Harrosh
  2014-08-22 14:36     ` Dave Hansen
  1 sibling, 1 reply; 48+ messages in thread
From: Toshi Kani @ 2014-08-15 20:28 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

On Wed, 2014-08-13 at 15:26 +0300, Boaz Harrosh wrote:
> From: Yigal Korman <yigal@plexistor.com>
> 
> One of the current short comings of the NVDIMM/PMEM
> support is that this memory does not have a page-struct(s)
> associated with its memory and therefor cannot be passed
> to a block-device or network or DMAed in any way through
> another device in the system.
> 
> This simple patch fixes all this. After this patch an FS
> can do:
> 	bdev_direct_access(,&pfn,);
> 	page = pfn_to_page(pfn);
> And use that page for a lock_page(), set_page_dirty(), and/or
> anything else one might do with a page *.
> (Note that with brd one can already do this)
> 
> [pmem-pages-ref-count]
> pmem will serve it's pages with ref==0. Once an FS does
> an blkdev_get_XXX(,FMODE_EXCL,), that memory is own by the FS.
> The FS needs to manage its allocation, just as it already does
> for its disk blocks. The fs should set page->count = 2, before
> submission to any Kernel subsystem so when it returns it will
> never be released to the Kernel's page-allocators. (page_freeze)
> 
> All is actually needed for this is to allocate page-sections
> and map them into kernel virtual memory. Note that these sections
> are not associated with any zone, because that would add them to
> the page_allocators.

Can we just use memory hotplug and call add_memory(), instead of
directly calling sparse_add_one_section()?  Memory hotplug adds memory
as off-line state, and sets all pages reserved.  So, I do not think the
page allocators will mess with them (unless you put them online).  It
can also maps the pages with large page size.

Thanks,
-Toshi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 9/9] prd: Add support for page struct mapping
  2014-08-15 20:28   ` Toshi Kani
@ 2014-08-17  9:17     ` Boaz Harrosh
  2014-08-18 19:48         ` Toshi Kani
  0 siblings, 1 reply; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-17  9:17 UTC (permalink / raw)
  To: Toshi Kani, Boaz Harrosh
  Cc: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

On 08/15/2014 11:28 PM, Toshi Kani wrote:
> On Wed, 2014-08-13 at 15:26 +0300, Boaz Harrosh wrote:
>> From: Yigal Korman <yigal@plexistor.com>
>>
<>
>>
>> All is actually needed for this is to allocate page-sections
>> and map them into kernel virtual memory. Note that these sections
>> are not associated with any zone, because that would add them to
>> the page_allocators.
> 
> Can we just use memory hotplug and call add_memory(), instead of
> directly calling sparse_add_one_section()?  Memory hotplug adds memory
> as off-line state, and sets all pages reserved.  So, I do not think the
> page allocators will mess with them (unless you put them online).  It
> can also maps the pages with large page size.
> 
> Thanks,
> -Toshi
> 

Thank you Toshi for your reply

I was thinking about that as well at first, but I was afraid, once I call
add_memory() what will prevent the user from enabling that memory through the sysfs
interface later, it looks to me that add_memory() will add all the necessary knobs
to do it.

It is very important to keep a clear distinction, pmem is *not* memory what-so-ever
it is however memory-mapped and needs these accesses enabled for it, hence the need
for page-struct so we can DMA it off the buss.

I am very afraid of any thing that will associate a "zone" with this memory.
Also the:
	firmware_map_add_hotplug(start, start + size, "System RAM");

"System RAM" it is not. And also I think that for DDR4 NvDIMMs we will fail with:
	ret = check_hotplug_memory_range(start, size);

Thanks
Boaz

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 9/9] prd: Add support for page struct mapping
  2014-08-17  9:17     ` Boaz Harrosh
@ 2014-08-18 19:48         ` Toshi Kani
  0 siblings, 0 replies; 48+ messages in thread
From: Toshi Kani @ 2014-08-18 19:48 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Boaz Harrosh, Ross Zwisler, linux-fsdevel, Andrew Morton,
	linux-mm, Matthew Wilcox, Sagi Manole, Yigal Korman

On Sun, 2014-08-17 at 12:17 +0300, Boaz Harrosh wrote:
> On 08/15/2014 11:28 PM, Toshi Kani wrote:
> > On Wed, 2014-08-13 at 15:26 +0300, Boaz Harrosh wrote:
> >> From: Yigal Korman <yigal@plexistor.com>
 :
> >> All is actually needed for this is to allocate page-sections
> >> and map them into kernel virtual memory. Note that these sections
> >> are not associated with any zone, because that would add them to
> >> the page_allocators.
> > 
> > Can we just use memory hotplug and call add_memory(), instead of
> > directly calling sparse_add_one_section()?  Memory hotplug adds memory
> > as off-line state, and sets all pages reserved.  So, I do not think the
> > page allocators will mess with them (unless you put them online).  It
> > can also maps the pages with large page size.
> > 
> > Thanks,
> > -Toshi
> > 
> 
> Thank you Toshi for your reply
> 
> I was thinking about that as well at first, but I was afraid, once I call
> add_memory() what will prevent the user from enabling that memory through the sysfs
> interface later, it looks to me that add_memory() will add all the necessary knobs
> to do it.
> 
> It is very important to keep a clear distinction, pmem is *not* memory what-so-ever
> it is however memory-mapped and needs these accesses enabled for it, hence the need
> for page-struct so we can DMA it off the buss.
> 
> I am very afraid of any thing that will associate a "zone" with this memory.
> Also the:
> 	firmware_map_add_hotplug(start, start + size, "System RAM");
> 
> "System RAM" it is not. 

I think add_memory() can be easily extended (or modified to provide a
separate interface) for persistent memory, and avoid creating the sysfs
interface and change the handling with firmware_map.  But I can also see
your point that persistent memory should not be added to zone at all.

Anyway, I am a bit concerned with the way to create direct mappings with
map_vm_area() within the prd driver.  Can we use init_memory_mapping()
as it's used by add_memory() and supports large page size?  The size of
persistent memory will grow up quickly.  Also, I'd prefer to have an mm
interface that takes care of page allocations and mappings, and avoid a
driver to deal with them.

> And also I think that for DDR4 NvDIMMs we will fail with:
> 	ret = check_hotplug_memory_range(start, size);
> 

Can you elaborate why DDR4 will fail with the function above?

Thanks,
-Toshi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 9/9] prd: Add support for page struct mapping
@ 2014-08-18 19:48         ` Toshi Kani
  0 siblings, 0 replies; 48+ messages in thread
From: Toshi Kani @ 2014-08-18 19:48 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Boaz Harrosh, Ross Zwisler, linux-fsdevel, Andrew Morton,
	linux-mm, Matthew Wilcox, Sagi Manole, Yigal Korman

On Sun, 2014-08-17 at 12:17 +0300, Boaz Harrosh wrote:
> On 08/15/2014 11:28 PM, Toshi Kani wrote:
> > On Wed, 2014-08-13 at 15:26 +0300, Boaz Harrosh wrote:
> >> From: Yigal Korman <yigal@plexistor.com>
 :
> >> All is actually needed for this is to allocate page-sections
> >> and map them into kernel virtual memory. Note that these sections
> >> are not associated with any zone, because that would add them to
> >> the page_allocators.
> > 
> > Can we just use memory hotplug and call add_memory(), instead of
> > directly calling sparse_add_one_section()?  Memory hotplug adds memory
> > as off-line state, and sets all pages reserved.  So, I do not think the
> > page allocators will mess with them (unless you put them online).  It
> > can also maps the pages with large page size.
> > 
> > Thanks,
> > -Toshi
> > 
> 
> Thank you Toshi for your reply
> 
> I was thinking about that as well at first, but I was afraid, once I call
> add_memory() what will prevent the user from enabling that memory through the sysfs
> interface later, it looks to me that add_memory() will add all the necessary knobs
> to do it.
> 
> It is very important to keep a clear distinction, pmem is *not* memory what-so-ever
> it is however memory-mapped and needs these accesses enabled for it, hence the need
> for page-struct so we can DMA it off the buss.
> 
> I am very afraid of any thing that will associate a "zone" with this memory.
> Also the:
> 	firmware_map_add_hotplug(start, start + size, "System RAM");
> 
> "System RAM" it is not. 

I think add_memory() can be easily extended (or modified to provide a
separate interface) for persistent memory, and avoid creating the sysfs
interface and change the handling with firmware_map.  But I can also see
your point that persistent memory should not be added to zone at all.

Anyway, I am a bit concerned with the way to create direct mappings with
map_vm_area() within the prd driver.  Can we use init_memory_mapping()
as it's used by add_memory() and supports large page size?  The size of
persistent memory will grow up quickly.  Also, I'd prefer to have an mm
interface that takes care of page allocations and mappings, and avoid a
driver to deal with them.

> And also I think that for DDR4 NvDIMMs we will fail with:
> 	ret = check_hotplug_memory_range(start, size);
> 

Can you elaborate why DDR4 will fail with the function above?

Thanks,
-Toshi


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 9/9] prd: Add support for page struct mapping
  2014-08-18 19:48         ` Toshi Kani
  (?)
@ 2014-08-19  8:40         ` Boaz Harrosh
  2014-08-19 16:49             ` Toshi Kani
  -1 siblings, 1 reply; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-19  8:40 UTC (permalink / raw)
  To: Toshi Kani, Boaz Harrosh
  Cc: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Yigal Korman

On 08/18/2014 10:48 PM, Toshi Kani wrote:
> On Sun, 2014-08-17 at 12:17 +0300, Boaz Harrosh wrote:
<>
>> "System RAM" it is not. 
> 
> I think add_memory() can be easily extended (or modified to provide a
> separate interface) for persistent memory, and avoid creating the sysfs
> interface and change the handling with firmware_map.  But I can also see
> your point that persistent memory should not be added to zone at all.
> 

Right

> Anyway, I am a bit concerned with the way to create direct mappings with
> map_vm_area() within the prd driver.  Can we use init_memory_mapping()
> as it's used by add_memory() and supports large page size?  The size of
> persistent memory will grow up quickly.

A bit about large page size. The principal reason of my effort here is
that at some stage I need to send pmem blocks to block-layer or network.

The PAGE == 4K is pasted all over the block stack. Do you know how those
can work together? will we need some kind of page_split thing how does
that work?

> Also, I'd prefer to have an mm
> interface that takes care of page allocations and mappings, and avoid a
> driver to deal with them.
> 

This is a great idea you mean that I define:
+	int mm_add_page_mapping(phys_addr_t phys_addr, size_t total_size,
+				void **o_virt_addr)

At the mm level. OK It needs a much better name.

I know of 2 more drivers that will need the use of the same interface
actually, so you are absolutely right. I didn't dare ask ;-)

>> And also I think that for DDR4 NvDIMMs we will fail with:
>> 	ret = check_hotplug_memory_range(start, size);
>>
> 
> Can you elaborate why DDR4 will fail with the function above?
> 

I'm not at all familiar with the details, perhaps the Intel
guys that knows better can chip in, but from the little I
understood: Today with DDR3 these chips come up at the e820
controller, as type 12 memory and, each vendor has a driver
to drive proprietary enablement and persistence.
With DDR4 it will all be standardized, but it will not come
up through the e820 manager, but as a separate device on the
SMBus/ACPI.
So it is not clear to me that we want to plug this back into
the ARCH's memory controllers. check_hotplug_memory_range is
it per ARCH?

> Thanks,
> -Toshi
> 
> 

I will produce a new Patchset that introduces a new API
for drivers. And I will try to see about the use of
init_memory_mapping(), as long as it is not using
zones.

Do you think that the new code should sit in?
	mm/memory_hotplug.c

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 9/9] prd: Add support for page struct mapping
  2014-08-19  8:40         ` Boaz Harrosh
@ 2014-08-19 16:49             ` Toshi Kani
  0 siblings, 0 replies; 48+ messages in thread
From: Toshi Kani @ 2014-08-19 16:49 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Boaz Harrosh, Ross Zwisler, linux-fsdevel, Andrew Morton,
	linux-mm, Matthew Wilcox, Yigal Korman

On Tue, 2014-08-19 at 11:40 +0300, Boaz Harrosh wrote:
> On 08/18/2014 10:48 PM, Toshi Kani wrote:
> > On Sun, 2014-08-17 at 12:17 +0300, Boaz Harrosh wrote:
> <>
> >> "System RAM" it is not. 
> > 
> > I think add_memory() can be easily extended (or modified to provide a
> > separate interface) for persistent memory, and avoid creating the sysfs
> > interface and change the handling with firmware_map.  But I can also see
> > your point that persistent memory should not be added to zone at all.
> > 
> 
> Right
> 
> > Anyway, I am a bit concerned with the way to create direct mappings with
> > map_vm_area() within the prd driver.  Can we use init_memory_mapping()
> > as it's used by add_memory() and supports large page size?  The size of
> > persistent memory will grow up quickly.
> 
> A bit about large page size. The principal reason of my effort here is
> that at some stage I need to send pmem blocks to block-layer or network.
> 
> The PAGE == 4K is pasted all over the block stack. Do you know how those
> can work together? will we need some kind of page_split thing how does
> that work?

I do not think there will be any problem. struct page's are still
allocated for each 4KB. When you change cache attribute with
set_memory_<type>(), it will split into 4K mappings if necessary.

> > Also, I'd prefer to have an mm
> > interface that takes care of page allocations and mappings, and avoid a
> > driver to deal with them.
> > 
> 
> This is a great idea you mean that I define:
> +	int mm_add_page_mapping(phys_addr_t phys_addr, size_t total_size,
> +				void **o_virt_addr)
> 
> At the mm level. OK It needs a much better name.
> 
> I know of 2 more drivers that will need the use of the same interface
> actually, so you are absolutely right. I didn't dare ask ;-)

I think the new interface should be analogous to add_memory(). Perhaps,
the name can be something like add_persistent_memory().

> >> And also I think that for DDR4 NvDIMMs we will fail with:
> >> 	ret = check_hotplug_memory_range(start, size);
> >>
> > 
> > Can you elaborate why DDR4 will fail with the function above?
> > 
> 
> I'm not at all familiar with the details, perhaps the Intel
> guys that knows better can chip in, but from the little I
> understood: Today with DDR3 these chips come up at the e820
> controller, as type 12 memory and, each vendor has a driver
> to drive proprietary enablement and persistence.
> With DDR4 it will all be standardized, but it will not come
> up through the e820 manager, but as a separate device on the
> SMBus/ACPI.
> So it is not clear to me that we want to plug this back into
> the ARCH's memory controllers. check_hotplug_memory_range is
> it per ARCH?

check_hotplug_memory_range() is a common function, but the section size
is defined per architecture. On x86, the size is 128MB. I do not think
the firmware interface is going to be a problem for this. Some NVDIMM
may allow a window size to be smaller than 128MB, but the driver can
manage to configure with a proper size. 

> I will produce a new Patchset that introduces a new API
> for drivers. And I will try to see about the use of
> init_memory_mapping(), as long as it is not using
> zones.
> 
> Do you think that the new code should sit in?
> 	mm/memory_hotplug.c
> 

Great.  Yes, I agree that the new code should sit in
mm/memory_hotplug.c.

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 9/9] prd: Add support for page struct mapping
@ 2014-08-19 16:49             ` Toshi Kani
  0 siblings, 0 replies; 48+ messages in thread
From: Toshi Kani @ 2014-08-19 16:49 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Boaz Harrosh, Ross Zwisler, linux-fsdevel, Andrew Morton,
	linux-mm, Matthew Wilcox, Yigal Korman

On Tue, 2014-08-19 at 11:40 +0300, Boaz Harrosh wrote:
> On 08/18/2014 10:48 PM, Toshi Kani wrote:
> > On Sun, 2014-08-17 at 12:17 +0300, Boaz Harrosh wrote:
> <>
> >> "System RAM" it is not. 
> > 
> > I think add_memory() can be easily extended (or modified to provide a
> > separate interface) for persistent memory, and avoid creating the sysfs
> > interface and change the handling with firmware_map.  But I can also see
> > your point that persistent memory should not be added to zone at all.
> > 
> 
> Right
> 
> > Anyway, I am a bit concerned with the way to create direct mappings with
> > map_vm_area() within the prd driver.  Can we use init_memory_mapping()
> > as it's used by add_memory() and supports large page size?  The size of
> > persistent memory will grow up quickly.
> 
> A bit about large page size. The principal reason of my effort here is
> that at some stage I need to send pmem blocks to block-layer or network.
> 
> The PAGE == 4K is pasted all over the block stack. Do you know how those
> can work together? will we need some kind of page_split thing how does
> that work?

I do not think there will be any problem. struct page's are still
allocated for each 4KB. When you change cache attribute with
set_memory_<type>(), it will split into 4K mappings if necessary.

> > Also, I'd prefer to have an mm
> > interface that takes care of page allocations and mappings, and avoid a
> > driver to deal with them.
> > 
> 
> This is a great idea you mean that I define:
> +	int mm_add_page_mapping(phys_addr_t phys_addr, size_t total_size,
> +				void **o_virt_addr)
> 
> At the mm level. OK It needs a much better name.
> 
> I know of 2 more drivers that will need the use of the same interface
> actually, so you are absolutely right. I didn't dare ask ;-)

I think the new interface should be analogous to add_memory(). Perhaps,
the name can be something like add_persistent_memory().

> >> And also I think that for DDR4 NvDIMMs we will fail with:
> >> 	ret = check_hotplug_memory_range(start, size);
> >>
> > 
> > Can you elaborate why DDR4 will fail with the function above?
> > 
> 
> I'm not at all familiar with the details, perhaps the Intel
> guys that knows better can chip in, but from the little I
> understood: Today with DDR3 these chips come up at the e820
> controller, as type 12 memory and, each vendor has a driver
> to drive proprietary enablement and persistence.
> With DDR4 it will all be standardized, but it will not come
> up through the e820 manager, but as a separate device on the
> SMBus/ACPI.
> So it is not clear to me that we want to plug this back into
> the ARCH's memory controllers. check_hotplug_memory_range is
> it per ARCH?

check_hotplug_memory_range() is a common function, but the section size
is defined per architecture. On x86, the size is 128MB. I do not think
the firmware interface is going to be a problem for this. Some NVDIMM
may allow a window size to be smaller than 128MB, but the driver can
manage to configure with a proper size. 

> I will produce a new Patchset that introduces a new API
> for drivers. And I will try to see about the use of
> init_memory_mapping(), as long as it is not using
> zones.
> 
> Do you think that the new code should sit in?
> 	mm/memory_hotplug.c
> 

Great.  Yes, I agree that the new code should sit in
mm/memory_hotplug.c.

Thanks,
-Toshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 0/9] pmem: Support for "struct page" with Persistent Memory storage
  2014-08-13 12:08 ` Boaz Harrosh
                   ` (9 preceding siblings ...)
  (?)
@ 2014-08-20 20:13 ` Ross Zwisler
  -1 siblings, 0 replies; 48+ messages in thread
From: Ross Zwisler @ 2014-08-20 20:13 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-fsdevel, Andrew Morton, linux-mm, Matthew Wilcox,
	Sagi Manole, Yigal Korman

On Wed, 2014-08-13 at 15:08 +0300, Boaz Harrosh wrote:
> And one last Rant if I may?
> I hate the prd name, why why? OK so a freak of bad luck forced us to invent a new name for
> /dev/ram because it would be weird to do an lsmod and see a ram.ko hanging there which is actually
> a block device driver. OK so a brd == /dev/ram. But why do we need to carry this punishment forever?
> Why an additional/different name in the namespace? /dev/foo should just be foo.ko in lsmod, No?
> So please, please, for my peace of mind can we call this driver pmem.ko?
> I know, I would hate it if I was inventing a name and people change it, so Ross it is your call, is
> it OK if we move back to just call it pmem everywhere?

Hi Boaz,

Yep, I'm fine with the rename from prd to pmem everywhere.  I've made this
change and will be pushing out an updated version to my GitHub repo shortly.

Thanks,
- Ross


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 4/9] SQUASHME: prd: Fixs to getgeo
  2014-08-13 12:14 ` [RFC 4/9] SQUASHME: prd: Fixs to getgeo Boaz Harrosh
@ 2014-08-20 22:10   ` Ross Zwisler
  2014-08-21  9:47     ` Boaz Harrosh
  0 siblings, 1 reply; 48+ messages in thread
From: Ross Zwisler @ 2014-08-20 22:10 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-fsdevel, Andrew Morton, linux-mm, Matthew Wilcox,
	Sagi Manole, Yigal Korman

On Wed, 2014-08-13 at 15:14 +0300, Boaz Harrosh wrote:
> From: Boaz Harrosh <boaz@plexistor.com>
> 
> With current values fdisk does the wrong thing.
> 
> Setting all values to 1, will make everything nice and easy.
> 
> Note that current code had a BUG with anything bigger than
> 64G because hd_geometry->cylinders is ushort and it would
> overflow at this value. Any way capacity is not calculated
> through getgeo so it does not matter what you put here.
> 
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  drivers/block/prd.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/block/prd.c b/drivers/block/prd.c
> index cc0aabf..62af81e 100644
> --- a/drivers/block/prd.c
> +++ b/drivers/block/prd.c
> @@ -55,10 +55,18 @@ struct prd_device {
>  
>  static int prd_getgeo(struct block_device *bd, struct hd_geometry *geo)
>  {
> -	/* some standard values */
> -	geo->heads = 1 << 6;
> -	geo->sectors = 1 << 5;
> -	geo->cylinders = get_capacity(bd->bd_disk) >> 11;
> +	/* Just tell fdisk to get out of the way. The math here is so
> +	 * convoluted and does not make any sense at all. With all 1s
> +	 * The math just gets out of the way.
> +	 * NOTE: I was trying to get some values that will make fdisk
> +	 * Want to align first sector on 4K (like 8, 16, 20, ... sectors) but
> +	 * nothing worked, I searched the net the math is not your regular
> +	 * simple multiplication at all. If you managed to get these please
> +	 * fix here. For now we use 4k physical sectors for this
> +	 */
> +	geo->heads = 1;
> +	geo->sectors = 1;
> +	geo->cylinders = 1;
>  	return 0;
>  }

I'm okay with this change, but can you let me know in which case fdisk was
previously doing the wrong thing?  I'm just curious because I never saw it
misbehave, and wonder what else I should be testing.

Regarding the note in the comment, is this addressed by the
blk_queue_physical_block_size() and prd->prd_queue->limits.io_min changes in
your patch 5/9, or is it an open issue?  Either way, can we nix the NOTE?

Also, you put "SQUASHME" on this patch.  I'm planning on squashing all of my
patches together into an "initial version" type patch (see
https://github.com/01org/prd).  Based on this, it probably makes sense to keep
it separate so you get credit for the patch?

- Ross


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 5/9] SQUASHME: prd: Last fixes for partitions
  2014-08-13 12:16 ` [RFC 5/9] SQUASHME: prd: Last fixes for partitions Boaz Harrosh
  2014-08-14 13:04   ` Boaz Harrosh
  2014-08-14 13:07   ` [PATCH 5/9 v2] " Boaz Harrosh
@ 2014-08-20 23:03   ` Ross Zwisler
  2014-08-21 10:05       ` Boaz Harrosh
  2 siblings, 1 reply; 48+ messages in thread
From: Ross Zwisler @ 2014-08-20 23:03 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-fsdevel, Andrew Morton, linux-mm, Matthew Wilcox,
	Sagi Manole, Yigal Korman

On Wed, 2014-08-13 at 15:16 +0300, Boaz Harrosh wrote:
> From: Boaz Harrosh <boaz@plexistor.com>
> 
> This streamlines prd with the latest brd code.
> 
> In prd we do not allocate new devices dynamically on devnod
> access, because we need parameterization of each device. So
> the dynamic allocation in prd_init_one is removed.
> 
> Therefor prd_init_one only called from prd_prob is moved
> there, now that it is small.
> 
> And other small fixes regarding partitions
> 
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  drivers/block/prd.c | 47 ++++++++++++++++++++++++-----------------------
>  1 file changed, 24 insertions(+), 23 deletions(-)

<snip>

> @@ -308,24 +314,6 @@ static void prd_free(struct prd_device *prd)
>  	kfree(prd);
>  }
>  
> -static struct prd_device *prd_init_one(int i)
> -{
> -	struct prd_device *prd;
> -
> -	list_for_each_entry(prd, &prd_devices, prd_list) {
> -		if (prd->prd_number == i)
> -			goto out;
> -	}
> -
> -	prd = prd_alloc(i);
> -	if (prd) {
> -		add_disk(prd->prd_disk);
> -		list_add_tail(&prd->prd_list, &prd_devices);
> -	}
> -out:
> -	return prd;
> -}
> -
>  static void prd_del_one(struct prd_device *prd)
>  {
>  	list_del(&prd->prd_list);
> @@ -333,16 +321,27 @@ static void prd_del_one(struct prd_device *prd)
>  	prd_free(prd);
>  }
>  
> +/*FIXME: Actually in our driver prd_probe is never used. Can be removed */
>  static struct kobject *prd_probe(dev_t dev, int *part, void *data)
>  {
>  	struct prd_device *prd;
>  	struct kobject *kobj;
> +	int number = MINOR(dev);
>  
>  	mutex_lock(&prd_devices_mutex);
> -	prd = prd_init_one(MINOR(dev));
> -	kobj = prd ? get_disk(prd->prd_disk) : NULL;
> -	mutex_unlock(&prd_devices_mutex);
>  
> +	list_for_each_entry(prd, &prd_devices, prd_list) {
> +		if (prd->prd_number == number) {
> +			kobj = get_disk(prd->prd_disk);
> +			goto out;
> +		}
> +	}
> +
> +	pr_err("prd: prd_probe: Unexpected parameter=%d\n", number);
> +	kobj = NULL;
> +
> +out:
> +	mutex_unlock(&prd_devices_mutex);
>  	return kobj;
>  }

I really like where you're going with getting rid of prd_probe.  Clearly I
just copied this from brd, but I'd love to be rid of it entirely.  Is there a
valid way for our probe function to get called?  If not, can we just have a
little stub with a BUG() in it to make sure we hear about it if it does ever
get called, and delete a bunch of code?

I think this would let us get rid of pmem_probe(), pmem_init_one(), and the
pmem_devices_mutex.

If there *is* a valid way for this code to get called, let's figure it out so
we can at least test this function.  This will be especially necessary as we
add support for more pmem disks.

>  
> @@ -424,5 +423,7 @@ static void __exit prd_exit(void)
>  
>  MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
>  MODULE_LICENSE("GPL");
> +MODULE_ALIAS("pmem");

Let's just go with the full rename s/prd/pmem/.  That turned out to be really
clean & made everything consistent - thanks for the good suggestion.

- Ross


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 4/9] SQUASHME: prd: Fixs to getgeo
  2014-08-20 22:10   ` Ross Zwisler
@ 2014-08-21  9:47     ` Boaz Harrosh
  0 siblings, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-21  9:47 UTC (permalink / raw)
  To: Ross Zwisler, Boaz Harrosh
  Cc: linux-fsdevel, Andrew Morton, linux-mm, Matthew Wilcox,
	Sagi Manole, Yigal Korman

On 08/21/2014 01:10 AM, Ross Zwisler wrote:
> On Wed, 2014-08-13 at 15:14 +0300, Boaz Harrosh wrote:
<>
>>  static int prd_getgeo(struct block_device *bd, struct hd_geometry *geo)
>>  {
>> -	/* some standard values */
>> -	geo->heads = 1 << 6;
>> -	geo->sectors = 1 << 5;
>> -	geo->cylinders = get_capacity(bd->bd_disk) >> 11;
>> +	/* Just tell fdisk to get out of the way. The math here is so
>> +	 * convoluted and does not make any sense at all. With all 1s
>> +	 * The math just gets out of the way.
>> +	 * NOTE: I was trying to get some values that will make fdisk
>> +	 * Want to align first sector on 4K (like 8, 16, 20, ... sectors) but
>> +	 * nothing worked, I searched the net the math is not your regular
>> +	 * simple multiplication at all. If you managed to get these please
>> +	 * fix here. For now we use 4k physical sectors for this
>> +	 */
>> +	geo->heads = 1;
>> +	geo->sectors = 1;
>> +	geo->cylinders = 1;
>>  	return 0;
>>  }
> 
> I'm okay with this change, but can you let me know in which case fdisk was
> previously doing the wrong thing?  I'm just curious because I never saw it
> misbehave, and wonder what else I should be testing.
> 

OK fdisk was doing a a few wrong things for us, this one here fixes one of
them.

Ways to reproduce
You need to have a small enough device, how small I do not know, but with
a 1G device fdisk will offer a 2048 first sectors [1M] always, so with big
devices you will not see this.

But with small devices without this patch fdisk will offer I can't remember
I think it was 18 originally (note the not 4K alignment) and 20 with my
other 4k-phisical patch.
With this one applied it will give me 8 as possible first sector.

What fdisk does is takes bunch of stuff into consideration and uses the
maximum as its base alignment. Then it has more code about the very
first sector that needs to be aligned on the alignment factor but has more
considerations like device size and stuff I guess it wants 1M at start of
disk to leave space for a boot sector.

I would love it if for brd and pmem fdisk will always offer 8, I intend to
send a patch to fdisk to fix that.

But regarding the above, with high values here we can get higher first sector
and funny alignments, with all 1(s) this math gets out of the way.

> Regarding the note in the comment, is this addressed by the
> blk_queue_physical_block_size() and prd->prd_queue->limits.io_min changes in
> your patch 5/9, or is it an open issue?  Either way, can we nix the NOTE?
> 

Yes  5/9 set-physical-sector to 4k fixes this problem and with that
fdisk will not offer the wrong numbers

What you mean nix, get rid of it? We should say something. I will try to
shorten it to a single paragraph, let me send you a fix.

> Also, you put "SQUASHME" on this patch.  I'm planning on squashing all of my
> patches together into an "initial version" type patch (see
> https://github.com/01org/prd).  Based on this, it probably makes sense to keep
> it separate so you get credit for the patch?
> 

The initial version did not include the getgeo patch and the rw_page patch.
I think it makes sense to keep pmem as a small patchset for submission, basic
functionality, then added optional API, one by one. It is easier for review and
stages things nicely. For me I have this list here:

a2cd031 Yigal Korman            |  (HEAD, prd) prd: Add support for page struct mapping 
5f3d00e Yigal Korman            |  mm: export sparse_add/remove_one_section 
5bdf5f7 Boaz Harrosh            |  pmem: Add getgeo to block ops 
387daf9 Ross Zwisler            |  pmem: add support for rw_page() 
5629da1 Ross Zwisler            |  pmem: Initial version of Persistent RAM Driver 

The first patch, Initial version will need to have both our sign-off and
a note about a tree with full history at the github address. I've been
participating in the pnfs tree where we worked like that for 4 years with
100ds of patches.

The one thing you must not do is delete the old trees. at first you make a branch
with the SQUASHMES as is, then rebase do the squashes and produce a new clean branch.
Tag them. Farther development gets as SQUASHMEs on top, tagged rebased and so on.
Then final upstream submission notes of the public tree that has the real history.
That's the process we worked with the big companies lawyers.
Note that anyone that touched a part of any patch, will need to have his sign-off
on it, then authorship is arbitrated by content.

> - Ross
> 


Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 5/9] SQUASHME: prd: Last fixes for partitions
  2014-08-20 23:03   ` [RFC 5/9] " Ross Zwisler
@ 2014-08-21 10:05       ` Boaz Harrosh
  0 siblings, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-21 10:05 UTC (permalink / raw)
  To: Ross Zwisler, Boaz Harrosh
  Cc: linux-fsdevel, Andrew Morton, linux-mm, Matthew Wilcox,
	Sagi Manole, Yigal Korman

On 08/21/2014 02:03 AM, Ross Zwisler wrote:
> On Wed, 2014-08-13 at 15:16 +0300, Boaz Harrosh wrote:
<>
> 
> I really like where you're going with getting rid of prd_probe.  Clearly I
> just copied this from brd, but I'd love to be rid of it entirely.  Is there a
> valid way for our probe function to get called?  If not, can we just have a
> little stub with a BUG() in it to make sure we hear about it if it does ever
> get called, and delete a bunch of code?
> 
> I think this would let us get rid of pmem_probe(), pmem_init_one(), and the
> pmem_devices_mutex.
> 

You lost me, pmem_init_one() is gone already, and yes the mutex can go away
as well right now after this patch. But please lets keep it I want to add
a sysfs interface to add more devices dynamically similar to osdblk.

The only thing I want to clean is the the pmem_free + pmem_del_one it can
be reduced to just one function.

> If there *is* a valid way for this code to get called, let's figure it out so
> we can at least test this function.  This will be especially necessary as we
> add support for more pmem disks.
> 

Let me investigate this one, I think we can get rid of it for sure, by
passing NULL to register. Surly there is no use case for it now.

>>  
>> @@ -424,5 +423,7 @@ static void __exit prd_exit(void)
>>  
>>  MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
>>  MODULE_LICENSE("GPL");
>> +MODULE_ALIAS("pmem");
> 
> Let's just go with the full rename s/prd/pmem/.  That turned out to be really
> clean & made everything consistent - thanks for the good suggestion.
> 

hooray, yes thanks, this makes me very happy.

> - Ross
> 

Thanks
Boaz


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 5/9] SQUASHME: prd: Last fixes for partitions
@ 2014-08-21 10:05       ` Boaz Harrosh
  0 siblings, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-21 10:05 UTC (permalink / raw)
  To: Ross Zwisler, Boaz Harrosh
  Cc: linux-fsdevel, Andrew Morton, linux-mm, Matthew Wilcox,
	Sagi Manole, Yigal Korman

On 08/21/2014 02:03 AM, Ross Zwisler wrote:
> On Wed, 2014-08-13 at 15:16 +0300, Boaz Harrosh wrote:
<>
> 
> I really like where you're going with getting rid of prd_probe.  Clearly I
> just copied this from brd, but I'd love to be rid of it entirely.  Is there a
> valid way for our probe function to get called?  If not, can we just have a
> little stub with a BUG() in it to make sure we hear about it if it does ever
> get called, and delete a bunch of code?
> 
> I think this would let us get rid of pmem_probe(), pmem_init_one(), and the
> pmem_devices_mutex.
> 

You lost me, pmem_init_one() is gone already, and yes the mutex can go away
as well right now after this patch. But please lets keep it I want to add
a sysfs interface to add more devices dynamically similar to osdblk.

The only thing I want to clean is the the pmem_free + pmem_del_one it can
be reduced to just one function.

> If there *is* a valid way for this code to get called, let's figure it out so
> we can at least test this function.  This will be especially necessary as we
> add support for more pmem disks.
> 

Let me investigate this one, I think we can get rid of it for sure, by
passing NULL to register. Surly there is no use case for it now.

>>  
>> @@ -424,5 +423,7 @@ static void __exit prd_exit(void)
>>  
>>  MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
>>  MODULE_LICENSE("GPL");
>> +MODULE_ALIAS("pmem");
> 
> Let's just go with the full rename s/prd/pmem/.  That turned out to be really
> clean & made everything consistent - thanks for the good suggestion.
> 

hooray, yes thanks, this makes me very happy.

> - Ross
> 

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 6/9] SQUASHME: prd: Let each prd-device manage private memory region
  2014-08-13 12:18 ` [RFC 6/9] SQUASHME: prd: Let each prd-device manage private memory region Boaz Harrosh
@ 2014-08-21 16:57   ` Ross Zwisler
  0 siblings, 0 replies; 48+ messages in thread
From: Ross Zwisler @ 2014-08-21 16:57 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-fsdevel, Andrew Morton, linux-mm, Matthew Wilcox,
	Sagi Manole, Yigal Korman

On Wed, 2014-08-13 at 15:18 +0300, Boaz Harrosh wrote:
> From: Boaz Harrosh <boaz@plexistor.com>
> 
> This patch removes any global memory information. And lets
> each prd-device manage it's own memory region.
> 
> prd_alloc() Now receives phys_addr and disk_size and will
> map that region, also prd_free will do the unmaping.
> 
> This is so we can support multiple discontinuous memory regions
> in the next patch
> 
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  drivers/block/prd.c | 125 ++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 78 insertions(+), 47 deletions(-)

Looks great.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 9/9] prd: Add support for page struct mapping
  2014-08-13 12:26 ` [RFC 9/9] prd: Add support for page struct mapping Boaz Harrosh
@ 2014-08-22 14:36     ` Dave Hansen
  2014-08-22 14:36     ` Dave Hansen
  1 sibling, 0 replies; 48+ messages in thread
From: Dave Hansen @ 2014-08-22 14:36 UTC (permalink / raw)
  To: Boaz Harrosh, Ross Zwisler, linux-fsdevel, Andrew Morton,
	linux-mm, Matthew Wilcox, Sagi Manole, Yigal Korman

On 08/13/2014 05:26 AM, Boaz Harrosh wrote:
> +#ifdef CONFIG_BLK_DEV_PMEM_USE_PAGES
> +static int prd_add_page_mapping(phys_addr_t phys_addr, size_t total_size,
> +				void **o_virt_addr)
> +{
> +	int nid = memory_add_physaddr_to_nid(phys_addr);
> +	unsigned long start_pfn = phys_addr >> PAGE_SHIFT;
> +	unsigned long nr_pages = total_size >> PAGE_SHIFT;
> +	unsigned int start_sec = pfn_to_section_nr(start_pfn);
> +	unsigned int end_sec = pfn_to_section_nr(start_pfn + nr_pages - 1);

Nit: any chance you'd change this to be an exclusive end?  In the mm
code, we usually do:

	unsigned int end_sec = pfn_to_section_nr(start_pfn + nr_pages);

so the for loops end up <end_sec instead of <=end_sec.

> +	unsigned long phys_start_pfn;
> +	struct page **page_array, **mapped_page_array;
> +	unsigned long i;
> +	struct vm_struct *vm_area;
> +	void *virt_addr;
> +	int ret = 0;

This is a philosophical thing, but I don't see *ANY* block-specific code
in here.  Seems like this belongs in mm/ to me.

Is there a reason you don't just do this at boot and have to use hotplug
at runtime for it?  What are the ratio of pmem to RAM?  Is it possible
to exhaust all of RAM with 'struct page's for pmem?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 9/9] prd: Add support for page struct mapping
@ 2014-08-22 14:36     ` Dave Hansen
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Hansen @ 2014-08-22 14:36 UTC (permalink / raw)
  To: Boaz Harrosh, Ross Zwisler, linux-fsdevel, Andrew Morton,
	linux-mm, Matthew Wilcox, Sagi Manole, Yigal Korman

On 08/13/2014 05:26 AM, Boaz Harrosh wrote:
> +#ifdef CONFIG_BLK_DEV_PMEM_USE_PAGES
> +static int prd_add_page_mapping(phys_addr_t phys_addr, size_t total_size,
> +				void **o_virt_addr)
> +{
> +	int nid = memory_add_physaddr_to_nid(phys_addr);
> +	unsigned long start_pfn = phys_addr >> PAGE_SHIFT;
> +	unsigned long nr_pages = total_size >> PAGE_SHIFT;
> +	unsigned int start_sec = pfn_to_section_nr(start_pfn);
> +	unsigned int end_sec = pfn_to_section_nr(start_pfn + nr_pages - 1);

Nit: any chance you'd change this to be an exclusive end?  In the mm
code, we usually do:

	unsigned int end_sec = pfn_to_section_nr(start_pfn + nr_pages);

so the for loops end up <end_sec instead of <=end_sec.

> +	unsigned long phys_start_pfn;
> +	struct page **page_array, **mapped_page_array;
> +	unsigned long i;
> +	struct vm_struct *vm_area;
> +	void *virt_addr;
> +	int ret = 0;

This is a philosophical thing, but I don't see *ANY* block-specific code
in here.  Seems like this belongs in mm/ to me.

Is there a reason you don't just do this at boot and have to use hotplug
at runtime for it?  What are the ratio of pmem to RAM?  Is it possible
to exhaust all of RAM with 'struct page's for pmem?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/9 v2] SQUASHME: prd: Last fixes for partitions
  2014-08-14 13:07   ` [PATCH 5/9 v2] " Boaz Harrosh
@ 2014-08-25 20:10     ` Ross Zwisler
  2014-08-26  8:18       ` Boaz Harrosh
  0 siblings, 1 reply; 48+ messages in thread
From: Ross Zwisler @ 2014-08-25 20:10 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-fsdevel, Andrew Morton, linux-mm, Matthew Wilcox,
	Sagi Manole, Yigal Korman

On Thu, 2014-08-14 at 16:07 +0300, Boaz Harrosh wrote:
> From: Boaz Harrosh <boaz@plexistor.com>
> 
> This streamlines prd with the latest brd code.
> 
> In prd we do not allocate new devices dynamically on devnod
> access, because we need parameterization of each device. So
> the dynamic allocation in prd_init_one is removed.
> 
> Therefor prd_init_one only called from prd_prob is moved
> there, now that it is small.
> 
> And other small fixes regarding partitions
> 
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  drivers/block/prd.c | 47 ++++++++++++++++++++++++-----------------------
>  1 file changed, 24 insertions(+), 23 deletions(-)

<snip>

> @@ -333,16 +321,27 @@ static void prd_del_one(struct prd_device *prd)
>  	prd_free(prd);
>  }
>  
> +/*FIXME: Actually in our driver prd_probe is never used. Can be removed */
>  static struct kobject *prd_probe(dev_t dev, int *part, void *data)
>  {
>  	struct prd_device *prd;
>  	struct kobject *kobj;
> +	int number = MINOR(dev);

Unfortunately I think this is broken, and it was broken in the previous
version of prd_probe() as well.

When we were allocating minors from our own device we could rely on the fact
that there was a relationship between the minor number of the device and the
prd_number.  Now that we're using the dynamic minors scheme, our minor is
dependent on other drivers that are also using that same scheme.  For example,
when both PRD and BRD are using dynamic minors:

# ls -la /dev/ram* /dev/pmem*
brw-rw---- 1 root disk 259, 4 Aug 25 12:38 /dev/pmem0
brw-rw---- 1 root disk 259, 5 Aug 25 12:38 /dev/pmem1
brw-rw---- 1 root disk 259, 6 Aug 25 12:38 /dev/pmem2
brw-rw---- 1 root disk 259, 7 Aug 25 12:38 /dev/pmem3
brw-rw---- 1 root disk 259, 0 Aug 25 12:22 /dev/ram0
brw-rw---- 1 root disk 259, 1 Aug 25 12:22 /dev/ram1
brw-rw---- 1 root disk 259, 2 Aug 25 12:22 /dev/ram2
brw-rw---- 1 root disk 259, 3 Aug 25 12:22 /dev/ram3

pmem0 has prd_number 0, but has minor 4.

I think we can still have a working probe by instead comparing the passed in
dev_t against the dev_t we get back from disk_to_dev(prd->prd_disk), but I'd
really like a use case so I can test this.  Until then I think I'm just going
to stub out prd/pmem_probe() with a BUG() so we can see if anyone hits it.

It seems like this is preferable to just registering NULL for probe, as that
would cause an oops in kobj_lookup(() when probe() is blindly called without
checking for NULL.

- Ross

>  
>  	mutex_lock(&prd_devices_mutex);
> -	prd = prd_init_one(MINOR(dev));
> -	kobj = prd ? get_disk(prd->prd_disk) : NULL;
> -	mutex_unlock(&prd_devices_mutex);
>  
> +	list_for_each_entry(prd, &prd_devices, prd_list) {
> +		if (prd->prd_number == number) {
> +			kobj = get_disk(prd->prd_disk);
> +			goto out;
> +		}
> +	}
> +
> +	pr_err("prd: prd_probe: Unexpected parameter=%d\n", number);
> +	kobj = NULL;
> +
> +out:
> +	mutex_unlock(&prd_devices_mutex);
>  	return kobj;
>  }
>  
> @@ -424,5 +423,7 @@ static void __exit prd_exit(void)
>  
>  MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
>  MODULE_LICENSE("GPL");
> +MODULE_ALIAS("pmem");
> +
>  module_init(prd_init);
>  module_exit(prd_exit);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 7/9] SQUASHME: prd: Support of multiple memory regions
  2014-08-13 12:20 ` [RFC 7/9] SQUASHME: prd: Support of multiple memory regions Boaz Harrosh
@ 2014-08-25 23:02     ` Ross Zwisler
  0 siblings, 0 replies; 48+ messages in thread
From: Ross Zwisler @ 2014-08-25 23:02 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-fsdevel, Andrew Morton, linux-mm, Matthew Wilcox,
	Sagi Manole, Yigal Korman

On Wed, 2014-08-13 at 15:20 +0300, Boaz Harrosh wrote:
> From: Boaz Harrosh <boaz@plexistor.com>
> 
> After the last patch this is easy.
> 
> The API to prd module is changed. We now have a single string
> parameter named "map" of the form:
> 		 map=mapS[,mapS...]
> 
> 		 where mapS=nn[KMG]$ss[KMG],
> 		 or    mapS=nn[KMG]@ss[KMG],
> 
> 		 nn=size, ss=offset
> 
> Just like the Kernel command line map && memmap parameters,
> so anything you did at grub just copy/paste to here.
> 
> The "@" form is exactly the same as the "$" form only that
> at bash prompt we need to escape the "$" with \$ so also
> support the '@' char for convenience.
> 
> For each specified mapS there will be a device created.
> 
> So needless to say that all the previous prd_XXX params are
> removed as well as the Kconfig defaults.
> 
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  drivers/block/Kconfig | 28 -----------------------
>  drivers/block/prd.c   | 62 ++++++++++++++++++++++++++++++++++-----------------
>  2 files changed, 41 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
> index 463c45e..8f0c225 100644
> --- a/drivers/block/Kconfig
> +++ b/drivers/block/Kconfig
> @@ -416,34 +416,6 @@ config BLK_DEV_PMEM
>  	  Most normal users won't need this functionality, and can thus say N
>  	  here.
>  
> -config BLK_DEV_PMEM_START
> -	int "Offset in GiB of where to start claiming space"
> -	default "0"
> -	depends on BLK_DEV_PMEM
> -	help
> -	  Starting offset in GiB that PRD should use when claiming memory.  This
> -	  memory needs to be reserved from the OS at boot time using the
> -	  "memmap" kernel parameter.
> -
> -	  If you provide PRD with volatile memory it will act as a volatile
> -	  RAM disk and your data will not be persistent.
> -
> -config BLK_DEV_PMEM_COUNT
> -	int "Default number of PMEM disks"
> -	default "4"
> -	depends on BLK_DEV_PMEM
> -	help
> -	  Number of equal sized block devices that PRD should create.
> -
> -config BLK_DEV_PMEM_SIZE
> -	int "Size in GiB of space to claim"
> -	depends on BLK_DEV_PMEM
> -	default "0"
> -	help
> -	  Amount of memory in GiB that PRD should use when creating block
> -	  devices.  This memory needs to be reserved from the OS at
> -	  boot time using the "memmap" kernel parameter.
> -
>  config CDROM_PKTCDVD
>  	tristate "Packet writing on CD/DVD media"
>  	depends on !UML
> diff --git a/drivers/block/prd.c b/drivers/block/prd.c
> index 6d96e6c..36b8fe4 100644
> --- a/drivers/block/prd.c
> +++ b/drivers/block/prd.c
> @@ -228,21 +228,15 @@ static const struct block_device_operations prd_fops = {
>  };
>  
>  /* Kernel module stuff */
> -static int prd_start_gb = CONFIG_BLK_DEV_PMEM_START;
> -module_param(prd_start_gb, int, S_IRUGO);
> -MODULE_PARM_DESC(prd_start_gb, "Offset in GB of where to start claiming space");
> -
> -static int prd_size_gb = CONFIG_BLK_DEV_PMEM_SIZE;
> -module_param(prd_size_gb,  int, S_IRUGO);
> -MODULE_PARM_DESC(prd_size_gb,  "Total size in GB of space to claim for all disks");
> -
>  static int prd_major;
>  module_param(prd_major, int, 0);
>  MODULE_PARM_DESC(prd_major,  "Major number to request for this driver");
>  
> -static int prd_count = CONFIG_BLK_DEV_PMEM_COUNT;
> -module_param(prd_count, int, S_IRUGO);
> -MODULE_PARM_DESC(prd_count, "Number of prd devices to evenly split allocated space");
> +static char *map;
> +module_param(map, charp, S_IRUGO);
> +MODULE_PARM_DESC(map,
> +	"pmem device mapping: map=mapS[,mapS...] where:\n"
> +	"mapS=nn[KMG]$ss[KMG] or mapS=nn[KMG]@ss[KMG], nn=size, ss=offset.");
>  
>  static LIST_HEAD(prd_devices);
>  static DEFINE_MUTEX(prd_devices_mutex);
> @@ -292,6 +286,13 @@ static struct prd_device *prd_alloc(phys_addr_t phys_addr, size_t disk_size,
>  	struct gendisk *disk;
>  	int err;
>  
> +	if (unlikely((phys_addr & ~PAGE_MASK) || (disk_size & ~PAGE_MASK))) {
> +		pr_err("phys_addr=0x%llx disk_size=0x%zx must be 4k aligned\n",

Need a "pmem:" prefix on this error string.

> +		       phys_addr, disk_size);
> +		err = -EINVAL;
> +		goto out;
> +	}
> +
>  	prd = kzalloc(sizeof(*prd), GFP_KERNEL);
>  	if (unlikely(!prd)) {
>  		err = -ENOMEM;
> @@ -388,22 +389,30 @@ out:
>  	return kobj;
>  }
>  
> +static int prd_parse_map_one(char *map, phys_addr_t *start, size_t *size)
> +{
> +	char *p = map;
> +
> +	*size = (phys_addr_t)memparse(p, &p);
> +	if ((p == map) || ((*p != '$') && (*p != '@')))
> +		return -EINVAL;
> +
> +	*start = (size_t)memparse(p + 1, &p);
> +
> +	return *p == '\0' ? 0 : -EINVAL;

Probably need to check for a zero-length parse on the start as well, similar
to what you did with the (p == map) check for size.  Without this a parameter
of "map=8G@" will try and map starting with address 0.  This'll fail, but
probably better to catch it during the string parsing.

> +}
> +
>  static int __init prd_init(void)
>  {
>  	int result, i;
>  	struct prd_device *prd, *next;
> -	phys_addr_t phys_addr;
> -	size_t total_size, disk_size;
> +	char *p, *prd_map = map;
>  
> -	if (unlikely(!prd_start_gb || !prd_size_gb || !prd_count)) {
> -		pr_err("prd: prd_start_gb || prd_size_gb || prd_count are 0!!\n");
> +	if (!prd_map) {
> +		pr_err("prd: must specify map parameter.\n");
>  		return -EINVAL;
>  	}
>  
> -	phys_addr = (phys_addr_t) prd_start_gb * 1024 * 1024 * 1024;
> -	total_size = (size_t)	   prd_size_gb  * 1024 * 1024 * 1024;
> -	disk_size = total_size / prd_count;
> -
>  	result = register_blkdev(prd_major, "prd");
>  	if (result < 0) {
>  		result = -EIO;
> @@ -411,13 +420,24 @@ static int __init prd_init(void)
>  	} else if (result > 0)
>  		prd_major = result;
>  
> -	for (i = 0; i < prd_count; i++) {
> -		prd = prd_alloc(phys_addr + i * disk_size, disk_size, i);
> +	i = 0;
> +	while ((p = strsep(&prd_map, ",")) != NULL) {
> +		phys_addr_t phys_addr;
> +		size_t disk_size;
> +
> +		if (!*p)
> +			continue;
> +		result = prd_parse_map_one(p, &phys_addr, &disk_size);
> +		if (result)
> +			goto out_free;
> +
> +		prd = prd_alloc(phys_addr, disk_size, i);
>  		if (IS_ERR(prd)) {
>  			result = PTR_ERR(prd);
>  			goto out_free;
>  		}
>  		list_add_tail(&prd->prd_list, &prd_devices);
> +		++i;
>  	}
>  
>  	list_for_each_entry(prd, &prd_devices, prd_list)

Overall I really like this patch.  It makes dealing with multiple regions
very easy!

Thanks,
- Ross



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 7/9] SQUASHME: prd: Support of multiple memory regions
@ 2014-08-25 23:02     ` Ross Zwisler
  0 siblings, 0 replies; 48+ messages in thread
From: Ross Zwisler @ 2014-08-25 23:02 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-fsdevel, Andrew Morton, linux-mm, Matthew Wilcox,
	Sagi Manole, Yigal Korman

On Wed, 2014-08-13 at 15:20 +0300, Boaz Harrosh wrote:
> From: Boaz Harrosh <boaz@plexistor.com>
> 
> After the last patch this is easy.
> 
> The API to prd module is changed. We now have a single string
> parameter named "map" of the form:
> 		 map=mapS[,mapS...]
> 
> 		 where mapS=nn[KMG]$ss[KMG],
> 		 or    mapS=nn[KMG]@ss[KMG],
> 
> 		 nn=size, ss=offset
> 
> Just like the Kernel command line map && memmap parameters,
> so anything you did at grub just copy/paste to here.
> 
> The "@" form is exactly the same as the "$" form only that
> at bash prompt we need to escape the "$" with \$ so also
> support the '@' char for convenience.
> 
> For each specified mapS there will be a device created.
> 
> So needless to say that all the previous prd_XXX params are
> removed as well as the Kconfig defaults.
> 
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  drivers/block/Kconfig | 28 -----------------------
>  drivers/block/prd.c   | 62 ++++++++++++++++++++++++++++++++++-----------------
>  2 files changed, 41 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
> index 463c45e..8f0c225 100644
> --- a/drivers/block/Kconfig
> +++ b/drivers/block/Kconfig
> @@ -416,34 +416,6 @@ config BLK_DEV_PMEM
>  	  Most normal users won't need this functionality, and can thus say N
>  	  here.
>  
> -config BLK_DEV_PMEM_START
> -	int "Offset in GiB of where to start claiming space"
> -	default "0"
> -	depends on BLK_DEV_PMEM
> -	help
> -	  Starting offset in GiB that PRD should use when claiming memory.  This
> -	  memory needs to be reserved from the OS at boot time using the
> -	  "memmap" kernel parameter.
> -
> -	  If you provide PRD with volatile memory it will act as a volatile
> -	  RAM disk and your data will not be persistent.
> -
> -config BLK_DEV_PMEM_COUNT
> -	int "Default number of PMEM disks"
> -	default "4"
> -	depends on BLK_DEV_PMEM
> -	help
> -	  Number of equal sized block devices that PRD should create.
> -
> -config BLK_DEV_PMEM_SIZE
> -	int "Size in GiB of space to claim"
> -	depends on BLK_DEV_PMEM
> -	default "0"
> -	help
> -	  Amount of memory in GiB that PRD should use when creating block
> -	  devices.  This memory needs to be reserved from the OS at
> -	  boot time using the "memmap" kernel parameter.
> -
>  config CDROM_PKTCDVD
>  	tristate "Packet writing on CD/DVD media"
>  	depends on !UML
> diff --git a/drivers/block/prd.c b/drivers/block/prd.c
> index 6d96e6c..36b8fe4 100644
> --- a/drivers/block/prd.c
> +++ b/drivers/block/prd.c
> @@ -228,21 +228,15 @@ static const struct block_device_operations prd_fops = {
>  };
>  
>  /* Kernel module stuff */
> -static int prd_start_gb = CONFIG_BLK_DEV_PMEM_START;
> -module_param(prd_start_gb, int, S_IRUGO);
> -MODULE_PARM_DESC(prd_start_gb, "Offset in GB of where to start claiming space");
> -
> -static int prd_size_gb = CONFIG_BLK_DEV_PMEM_SIZE;
> -module_param(prd_size_gb,  int, S_IRUGO);
> -MODULE_PARM_DESC(prd_size_gb,  "Total size in GB of space to claim for all disks");
> -
>  static int prd_major;
>  module_param(prd_major, int, 0);
>  MODULE_PARM_DESC(prd_major,  "Major number to request for this driver");
>  
> -static int prd_count = CONFIG_BLK_DEV_PMEM_COUNT;
> -module_param(prd_count, int, S_IRUGO);
> -MODULE_PARM_DESC(prd_count, "Number of prd devices to evenly split allocated space");
> +static char *map;
> +module_param(map, charp, S_IRUGO);
> +MODULE_PARM_DESC(map,
> +	"pmem device mapping: map=mapS[,mapS...] where:\n"
> +	"mapS=nn[KMG]$ss[KMG] or mapS=nn[KMG]@ss[KMG], nn=size, ss=offset.");
>  
>  static LIST_HEAD(prd_devices);
>  static DEFINE_MUTEX(prd_devices_mutex);
> @@ -292,6 +286,13 @@ static struct prd_device *prd_alloc(phys_addr_t phys_addr, size_t disk_size,
>  	struct gendisk *disk;
>  	int err;
>  
> +	if (unlikely((phys_addr & ~PAGE_MASK) || (disk_size & ~PAGE_MASK))) {
> +		pr_err("phys_addr=0x%llx disk_size=0x%zx must be 4k aligned\n",

Need a "pmem:" prefix on this error string.

> +		       phys_addr, disk_size);
> +		err = -EINVAL;
> +		goto out;
> +	}
> +
>  	prd = kzalloc(sizeof(*prd), GFP_KERNEL);
>  	if (unlikely(!prd)) {
>  		err = -ENOMEM;
> @@ -388,22 +389,30 @@ out:
>  	return kobj;
>  }
>  
> +static int prd_parse_map_one(char *map, phys_addr_t *start, size_t *size)
> +{
> +	char *p = map;
> +
> +	*size = (phys_addr_t)memparse(p, &p);
> +	if ((p == map) || ((*p != '$') && (*p != '@')))
> +		return -EINVAL;
> +
> +	*start = (size_t)memparse(p + 1, &p);
> +
> +	return *p == '\0' ? 0 : -EINVAL;

Probably need to check for a zero-length parse on the start as well, similar
to what you did with the (p == map) check for size.  Without this a parameter
of "map=8G@" will try and map starting with address 0.  This'll fail, but
probably better to catch it during the string parsing.

> +}
> +
>  static int __init prd_init(void)
>  {
>  	int result, i;
>  	struct prd_device *prd, *next;
> -	phys_addr_t phys_addr;
> -	size_t total_size, disk_size;
> +	char *p, *prd_map = map;
>  
> -	if (unlikely(!prd_start_gb || !prd_size_gb || !prd_count)) {
> -		pr_err("prd: prd_start_gb || prd_size_gb || prd_count are 0!!\n");
> +	if (!prd_map) {
> +		pr_err("prd: must specify map parameter.\n");
>  		return -EINVAL;
>  	}
>  
> -	phys_addr = (phys_addr_t) prd_start_gb * 1024 * 1024 * 1024;
> -	total_size = (size_t)	   prd_size_gb  * 1024 * 1024 * 1024;
> -	disk_size = total_size / prd_count;
> -
>  	result = register_blkdev(prd_major, "prd");
>  	if (result < 0) {
>  		result = -EIO;
> @@ -411,13 +420,24 @@ static int __init prd_init(void)
>  	} else if (result > 0)
>  		prd_major = result;
>  
> -	for (i = 0; i < prd_count; i++) {
> -		prd = prd_alloc(phys_addr + i * disk_size, disk_size, i);
> +	i = 0;
> +	while ((p = strsep(&prd_map, ",")) != NULL) {
> +		phys_addr_t phys_addr;
> +		size_t disk_size;
> +
> +		if (!*p)
> +			continue;
> +		result = prd_parse_map_one(p, &phys_addr, &disk_size);
> +		if (result)
> +			goto out_free;
> +
> +		prd = prd_alloc(phys_addr, disk_size, i);
>  		if (IS_ERR(prd)) {
>  			result = PTR_ERR(prd);
>  			goto out_free;
>  		}
>  		list_add_tail(&prd->prd_list, &prd_devices);
> +		++i;
>  	}
>  
>  	list_for_each_entry(prd, &prd_devices, prd_list)

Overall I really like this patch.  It makes dealing with multiple regions
very easy!

Thanks,
- Ross


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/9 v2] SQUASHME: prd: Last fixes for partitions
  2014-08-25 20:10     ` Ross Zwisler
@ 2014-08-26  8:18       ` Boaz Harrosh
  2014-08-26 17:36           ` Boaz Harrosh
  0 siblings, 1 reply; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-26  8:18 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-fsdevel, Andrew Morton, linux-mm, Matthew Wilcox,
	Sagi Manole, Yigal Korman

On 08/25/2014 11:10 PM, Ross Zwisler wrote:
<>
> I think we can still have a working probe by instead comparing the passed in
> dev_t against the dev_t we get back from disk_to_dev(prd->prd_disk), but I'd
> really like a use case so I can test this.  Until then I think I'm just going
> to stub out prd/pmem_probe() with a BUG() so we can see if anyone hits it.
> 
> It seems like this is preferable to just registering NULL for probe, as that
> would cause an oops in kobj_lookup(() when probe() is blindly called without
> checking for NULL.
> 

I have a version I think you will love it removes the probe and bunch of
other stuff.

I tested it heavily it just works

Comming soon, I'm preparing trees right now
Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/9 v2] SQUASHME: prd: Last fixes for partitions
  2014-08-26  8:18       ` Boaz Harrosh
@ 2014-08-26 17:36           ` Boaz Harrosh
  0 siblings, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-26 17:36 UTC (permalink / raw)
  To: Boaz Harrosh, Ross Zwisler
  Cc: linux-fsdevel, Andrew Morton, linux-mm, Matthew Wilcox,
	Sagi Manole, Yigal Korman

[-- Attachment #1: Type: text/plain, Size: 1640 bytes --]

On 08/26/2014 11:18 AM, Boaz Harrosh wrote:
> On 08/25/2014 11:10 PM, Ross Zwisler wrote:
> <>
>> I think we can still have a working probe by instead comparing the passed in
>> dev_t against the dev_t we get back from disk_to_dev(prd->prd_disk), but I'd
>> really like a use case so I can test this.  Until then I think I'm just going
>> to stub out prd/pmem_probe() with a BUG() so we can see if anyone hits it.
>>
>> It seems like this is preferable to just registering NULL for probe, as that
>> would cause an oops in kobj_lookup(() when probe() is blindly called without
>> checking for NULL.
>>
> 
> I have a version I think you will love it removes the probe and bunch of
> other stuff.
> 
> I tested it heavily it just works
> 
> Comming soon, I'm preparing trees right now
> Thanks
> Boaz

Ross hi

It is getting late around here and I will not be pushing new trees
tonight, first thing tomorrow morning. (Your last push caused me lots
of extra work, you must not do like you did when working on a rebasing
out-off-tree project involving more then one person, but more on the proper
procedure tomorrow with the push of the trees)

So I hope you are not doing any big changes, do not apply any of my patches
just yet, wait for the new trees tomorrow please I have everything all ready
on top of the rename to pmem

Meanwhile without any explanations, these will come tomorrow, I'm attaching
the most interesting bit which you have not seen before.

If you want you can inspect a preview of what's to come here:
	http://git.open-osd.org/gitweb.cgi?p=pmem.git;a=summary

I have created a new pmem.git tree just for us.

Thanks
Boaz


[-- Attachment #2: 0012-SQUASHME-pmem-KISS-remove-the-all-pmem_major-registr.patch --]
[-- Type: text/x-patch, Size: 4482 bytes --]

>From 748370b1220693b755b6b92828e651cc0bac7846 Mon Sep 17 00:00:00 2001
From: Boaz Harrosh <boaz@plexistor.com>
Date: Thu, 21 Aug 2014 19:03:48 +0300
Subject: [PATCH 12/12] SQUASHME: pmem: KISS, remove the all pmem_major
 registration

Though this looks revolutionary, it actually does nothing
but removes dead code.

Since the move of pmem to the dynamic BLOCK_EXT_MAJOR
range, by Ross, all this stuff is not used by the Kernel.

Note that with current code pmem_major defaults to ZERO
but even if I run with
	modprobe pmem map=1G@1G pmem_major=17

I still ever get:
brw-rw----. 1 root disk 259, 0 Aug 21 19:02 /dev/pmem0
brw-rw----. 1 root disk 259, 1 Aug 21 19:02 /dev/pmem0p1
brw-rw----. 1 root disk 259, 2 Aug 21 19:02 /dev/pmem0p2

>From the BLOCK_EXT_MAJOR (259) range. Even Though the
registration of pmem_major=17 succeeded just fine, and
	disk->major = 17;
was set in pmem_alloc()

I have tested all the tests that I know how to perform
on the devices, fdisk, partitions, multiple pmemX devices,
mknode, lsblk, blkid, and it all behaves exactly as
before this patch.

I have also done:
	find /sys/ -name "*pmem*"
	find /proc/ -name "*pmem*"
	find /dev/ -name "*pmem*"
And get the same output as before this patch.

If anyone knows of what would work differently after this
patch please do tell.

But it looks like the calls to register_blkdev +
blk_register_region are just dead code for us right now

Thanks

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/pmem.c | 45 +--------------------------------------------
 1 file changed, 1 insertion(+), 44 deletions(-)

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index a272c43..014bba7 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -228,9 +228,6 @@ static const struct block_device_operations pmem_fops = {
 };
 
 /* Kernel module stuff */
-static int pmem_major;
-module_param(pmem_major, int, 0);
-MODULE_PARM_DESC(pmem_major,  "Major number to request for this driver");
 
 static char *map;
 module_param(map, charp, S_IRUGO);
@@ -239,7 +236,6 @@ MODULE_PARM_DESC(map,
 	"mapS=nn[KMG]$ss[KMG] or mapS=nn[KMG]@ss[KMG], nn=size, ss=offset.");
 
 static LIST_HEAD(pmem_devices);
-static DEFINE_MUTEX(pmem_devices_mutex);
 
 #ifdef CONFIG_BLK_DEV_PMEM_USE_PAGES
 static int pmem_add_page_mapping(phys_addr_t phys_addr, size_t total_size,
@@ -455,7 +451,7 @@ static struct pmem_device *pmem_alloc(phys_addr_t phys_addr, size_t disk_size,
 		err = -ENOMEM;
 		goto out_free_queue;
 	}
-	disk->major		= pmem_major;
+	disk->major		= BLOCK_EXT_MAJOR;
 	disk->first_minor	= 0;
 	disk->fops		= &pmem_fops;
 	disk->private_data	= pmem;
@@ -492,30 +488,6 @@ static void pmem_del_one(struct pmem_device *pmem)
 	pmem_free(pmem);
 }
 
-/*FIXME: Actually in our driver pmem_probe is never used. Can be removed */
-static struct kobject *pmem_probe(dev_t dev, int *part, void *data)
-{
-	struct pmem_device *pmem;
-	struct kobject *kobj;
-	int number = MINOR(dev);
-
-	mutex_lock(&pmem_devices_mutex);
-
-	list_for_each_entry(pmem, &pmem_devices, pmem_list) {
-		if (pmem->pmem_number == number) {
-			kobj = get_disk(pmem->pmem_disk);
-			goto out;
-		}
-	}
-
-	pr_err("pmem: pmem_probe: Unexpected parameter=%d\n", number);
-	kobj = NULL;
-
-out:
-	mutex_unlock(&pmem_devices_mutex);
-	return kobj;
-}
-
 static int pmem_parse_map_one(char *map, phys_addr_t *start, size_t *size)
 {
 	char *p = map;
@@ -540,13 +512,6 @@ static int __init pmem_init(void)
 		return -EINVAL;
 	}
 
-	result = register_blkdev(pmem_major, "pmem");
-	if (result < 0) {
-		result = -EIO;
-		goto out;
-	} else if (result > 0)
-		pmem_major = result;
-
 	i = 0;
 	while ((p = strsep(&pmem_map, ",")) != NULL) {
 		phys_addr_t phys_addr;
@@ -570,9 +535,6 @@ static int __init pmem_init(void)
 	list_for_each_entry(pmem, &pmem_devices, pmem_list)
 		add_disk(pmem->pmem_disk);
 
-	blk_register_region(MKDEV(pmem_major, 0), 0,
-				  THIS_MODULE, pmem_probe, NULL, NULL);
-
 	pr_info("pmem: module loaded\n");
 	return 0;
 
@@ -581,8 +543,6 @@ out_free:
 		list_del(&pmem->pmem_list);
 		pmem_free(pmem);
 	}
-	unregister_blkdev(pmem_major, "pmem");
-out:
 	return result;
 }
 
@@ -590,12 +550,9 @@ static void __exit pmem_exit(void)
 {
 	struct pmem_device *pmem, *next;
 
-	blk_unregister_region(MKDEV(pmem_major, 0), 0);
-
 	list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list)
 		pmem_del_one(pmem);
 
-	unregister_blkdev(pmem_major, "pmem");
 	pr_info("pmem: module unloaded\n");
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/9 v2] SQUASHME: prd: Last fixes for partitions
@ 2014-08-26 17:36           ` Boaz Harrosh
  0 siblings, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-26 17:36 UTC (permalink / raw)
  To: Boaz Harrosh, Ross Zwisler
  Cc: linux-fsdevel, Andrew Morton, linux-mm, Matthew Wilcox,
	Sagi Manole, Yigal Korman

[-- Attachment #1: Type: text/plain, Size: 1640 bytes --]

On 08/26/2014 11:18 AM, Boaz Harrosh wrote:
> On 08/25/2014 11:10 PM, Ross Zwisler wrote:
> <>
>> I think we can still have a working probe by instead comparing the passed in
>> dev_t against the dev_t we get back from disk_to_dev(prd->prd_disk), but I'd
>> really like a use case so I can test this.  Until then I think I'm just going
>> to stub out prd/pmem_probe() with a BUG() so we can see if anyone hits it.
>>
>> It seems like this is preferable to just registering NULL for probe, as that
>> would cause an oops in kobj_lookup(() when probe() is blindly called without
>> checking for NULL.
>>
> 
> I have a version I think you will love it removes the probe and bunch of
> other stuff.
> 
> I tested it heavily it just works
> 
> Comming soon, I'm preparing trees right now
> Thanks
> Boaz

Ross hi

It is getting late around here and I will not be pushing new trees
tonight, first thing tomorrow morning. (Your last push caused me lots
of extra work, you must not do like you did when working on a rebasing
out-off-tree project involving more then one person, but more on the proper
procedure tomorrow with the push of the trees)

So I hope you are not doing any big changes, do not apply any of my patches
just yet, wait for the new trees tomorrow please I have everything all ready
on top of the rename to pmem

Meanwhile without any explanations, these will come tomorrow, I'm attaching
the most interesting bit which you have not seen before.

If you want you can inspect a preview of what's to come here:
	http://git.open-osd.org/gitweb.cgi?p=pmem.git;a=summary

I have created a new pmem.git tree just for us.

Thanks
Boaz


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0012-SQUASHME-pmem-KISS-remove-the-all-pmem_major-registr.patch --]
[-- Type: text/x-patch; name="0012-SQUASHME-pmem-KISS-remove-the-all-pmem_major-registr.patch", Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/9 v2] SQUASHME: prd: Last fixes for partitions
  2014-08-26 17:36           ` Boaz Harrosh
@ 2014-08-26 20:34             ` Ross Zwisler
  -1 siblings, 0 replies; 48+ messages in thread
From: Ross Zwisler @ 2014-08-26 20:34 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Boaz Harrosh, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

On Tue, 2014-08-26 at 20:36 +0300, Boaz Harrosh wrote:
> Meanwhile without any explanations, these will come tomorrow, I'm attaching
> the most interesting bit which you have not seen before.
> 
> If you want you can inspect a preview of what's to come here:
> 	http://git.open-osd.org/gitweb.cgi?p=pmem.git;a=summary

Regarding the top patch "pmem: KISS, remove the all pmem_major registration",
I like that we're getting rid of lots of dead code.  The only issue I have is
that I'm pretty sure we aren't supposed to register our disks directly with a
major of BLOCK_EXT_MAJOR.  I think we still need to register our own major via
register_blkdev(), and use that.  I'm fine with getting rid of the module
parameter though, and always getting a major dynamically.

If you look at the other block devices that use the GENHD_FL_EXT_DEVT flag
(nvme, loop, md, etc.) they all register their own major.  You can't see this
major by doing 'ls -l' on the resulting devices in /dev, but you can see it by
looking at /proc/devices:

# ls -l /dev/pmem0
brw-rw---- 1 root disk 259, 0 Aug 26 12:37 /dev/pmem0

# grep pmem /proc/devices 
250 pmem

- Ross



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/9 v2] SQUASHME: prd: Last fixes for partitions
@ 2014-08-26 20:34             ` Ross Zwisler
  0 siblings, 0 replies; 48+ messages in thread
From: Ross Zwisler @ 2014-08-26 20:34 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Boaz Harrosh, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

On Tue, 2014-08-26 at 20:36 +0300, Boaz Harrosh wrote:
> Meanwhile without any explanations, these will come tomorrow, I'm attaching
> the most interesting bit which you have not seen before.
> 
> If you want you can inspect a preview of what's to come here:
> 	http://git.open-osd.org/gitweb.cgi?p=pmem.git;a=summary

Regarding the top patch "pmem: KISS, remove the all pmem_major registration",
I like that we're getting rid of lots of dead code.  The only issue I have is
that I'm pretty sure we aren't supposed to register our disks directly with a
major of BLOCK_EXT_MAJOR.  I think we still need to register our own major via
register_blkdev(), and use that.  I'm fine with getting rid of the module
parameter though, and always getting a major dynamically.

If you look at the other block devices that use the GENHD_FL_EXT_DEVT flag
(nvme, loop, md, etc.) they all register their own major.  You can't see this
major by doing 'ls -l' on the resulting devices in /dev, but you can see it by
looking at /proc/devices:

# ls -l /dev/pmem0
brw-rw---- 1 root disk 259, 0 Aug 26 12:37 /dev/pmem0

# grep pmem /proc/devices 
250 pmem

- Ross


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/9 v2] SQUASHME: prd: Last fixes for partitions
  2014-08-26 17:36           ` Boaz Harrosh
  (?)
  (?)
@ 2014-08-27  4:38           ` Matthew Wilcox
  2014-08-27  9:55             ` Boaz Harrosh
  -1 siblings, 1 reply; 48+ messages in thread
From: Matthew Wilcox @ 2014-08-27  4:38 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Boaz Harrosh, Ross Zwisler, linux-fsdevel, Andrew Morton,
	linux-mm, Sagi Manole, Yigal Korman

On Tue, Aug 26, 2014 at 08:36:19PM +0300, Boaz Harrosh wrote:
> It is getting late around here and I will not be pushing new trees
> tonight, first thing tomorrow morning. (Your last push caused me lots
> of extra work, you must not do like you did when working on a rebasing
> out-off-tree project involving more then one person, but more on the proper
> procedure tomorrow with the push of the trees)

If you're going to push a new tree, please do so on top of the DAX
v10 patches I just sent out.  If you have any changes you want to make
to the bdev_direct_access() patch, then please make them separately,
instead of editing my patch like you did last time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/9 v2] SQUASHME: prd: Last fixes for partitions
  2014-08-26 20:34             ` Ross Zwisler
  (?)
@ 2014-08-27  9:41             ` Boaz Harrosh
  -1 siblings, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-27  9:41 UTC (permalink / raw)
  To: Ross Zwisler, Boaz Harrosh; +Cc: linux-fsdevel, Matthew Wilcox, Sagi Manole

On 08/26/2014 11:34 PM, Ross Zwisler wrote:
> On Tue, 2014-08-26 at 20:36 +0300, Boaz Harrosh wrote:
>> Meanwhile without any explanations, these will come tomorrow, I'm attaching
>> the most interesting bit which you have not seen before.
>>
>> If you want you can inspect a preview of what's to come here:
>> 	http://git.open-osd.org/gitweb.cgi?p=pmem.git;a=summary
> 
> Regarding the top patch "pmem: KISS, remove the all pmem_major registration",
> I like that we're getting rid of lots of dead code.  The only issue I have is
> that I'm pretty sure we aren't supposed to register our disks directly with a
> major of BLOCK_EXT_MAJOR.  

Why not? if you inspect the code you will see that the system will do
that assignment regardless of what we put there. Actually I agree we
should just leave it ZERO and not touch it at all. More clear this way.
(And better for me it is a complete red "remove lines" patch I love
 those patches)

> I think we still need to register our own major via
> register_blkdev(), and use that.  I'm fine with getting rid of the module
> parameter though, and always getting a major dynamically.
> 
> If you look at the other block devices that use the GENHD_FL_EXT_DEVT flag
> (nvme, loop, md, etc.) they all register their own major.  You can't see this
> major by doing 'ls -l' on the resulting devices in /dev, but you can see it by
> looking at /proc/devices:
> 
> # ls -l /dev/pmem0
> brw-rw---- 1 root disk 259, 0 Aug 26 12:37 /dev/pmem0
> 
> # grep pmem /proc/devices 
> 250 pmem
> 

But why? all these you mentioned nvme, loop, md, they are all the case
of copy/paste and dead code. Which is never use and is actually misleading.

Like the /proc/devices thing, so "250 pmem" what does it means? It use
to mean that any 250 major will be my device and any of my device will
be 250, but here it is not. In fact you will not find any 250 related to
devices in the system anywhere. It is just dead code.

A user that will ever look at /proc/devices will just get confused
assuming there is some error in the system because he loaded pmem
and no 250 device is found.

As I said in my patch. This looks revolutionary at first because it
has never been done before, but this is the proper code, in current
Kernel, and the modern udev rules all this legacy device major is
not needed, and actually not used at all when you are a dynamic device,
Just bunch of dead code that does nothing.

[In old times /proc/devices was the only communication to user mode
 plumbing to create (mknod) node device-files at /dev/. X device
 files where created then accessed and if the Kernel returned -ENODEV
 it was deleted, I still remember those scripts that one had at your
 driver development project that would create the device-nodes at  /dev/
 for you. Actually it was before my time I never used a system that
 did not have udev. But I inherited a project that had that script and
 I remember one of the first things I did was to "git rm" that script.
 Funny I guess it is in my Karma to get rid of static device registration
]

> - Ross
> 
> 

I'll let you get used to it for a while, but think about it. It is
dead code

Thanks
Boaz


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/9 v2] SQUASHME: prd: Last fixes for partitions
  2014-08-27  4:38           ` Matthew Wilcox
@ 2014-08-27  9:55             ` Boaz Harrosh
  2014-08-27 12:46               ` Matthew Wilcox
  0 siblings, 1 reply; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-27  9:55 UTC (permalink / raw)
  To: Matthew Wilcox, Boaz Harrosh
  Cc: Ross Zwisler, linux-fsdevel, Andrew Morton, Sagi Manole

On 08/27/2014 07:38 AM, Matthew Wilcox wrote:
> On Tue, Aug 26, 2014 at 08:36:19PM +0300, Boaz Harrosh wrote:
>> It is getting late around here and I will not be pushing new trees
>> tonight, first thing tomorrow morning. (Your last push caused me lots
>> of extra work, you must not do like you did when working on a rebasing
>> out-off-tree project involving more then one person, but more on the proper
>> procedure tomorrow with the push of the trees)
> 
> If you're going to push a new tree, please do so on top of the DAX
> v10 patches I just sent out.  

Matthew hi

No, I'm pushing 4 brd patches for mainline for inclusion today, into
current 3.17-rcX cycle. The set includes your one patch which is needed
out and before the DAX patches.

> If you have any changes you want to make
> to the bdev_direct_access() patch, then please make them separately,
> instead of editing my patch like you did last time.
> 

I apologize about that but I did use the proper protocol here.
I'm sending your patch as part of my push from my tree, through
Jens into linus. I have put my sign-off-by on the patch, then
Jens will put his, before it will finally be pushed to Linus.

Having my sign-off on it and pushing to Jens means I need to
review and approve the patch, and so I had to fix the checkpatch
complain before pushing. And I did put at the patch commit message
that I have edited for check-patch. And I have sent the mail directly
addressed to you for approval.

If I have offended you in any way please forgive me, none was intended,
opposite I have only the deepest gratitude and admiration, and I want
for these patches to be pushed forward.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/9 v2] SQUASHME: prd: Last fixes for partitions
  2014-08-27  9:55             ` Boaz Harrosh
@ 2014-08-27 12:46               ` Matthew Wilcox
  2014-08-27 13:01                 ` Boaz Harrosh
  0 siblings, 1 reply; 48+ messages in thread
From: Matthew Wilcox @ 2014-08-27 12:46 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Boaz Harrosh, Ross Zwisler, linux-fsdevel, Andrew Morton, Sagi Manole

On Wed, Aug 27, 2014 at 12:55:56PM +0300, Boaz Harrosh wrote:
> > If you have any changes you want to make
> > to the bdev_direct_access() patch, then please make them separately,
> > instead of editing my patch like you did last time.
> > 
> 
> I apologize about that but I did use the proper protocol here.
> I'm sending your patch as part of my push from my tree, through
> Jens into linus. I have put my sign-off-by on the patch, then
> Jens will put his, before it will finally be pushed to Linus.
> 
> Having my sign-off on it and pushing to Jens means I need to
> review and approve the patch, and so I had to fix the checkpatch
> complain before pushing. And I did put at the patch commit message
> that I have edited for check-patch. And I have sent the mail directly
> addressed to you for approval.

That's not what SubmittingPatches says that you are to do.  Indeed,
option (c) says that you haven't modified my patch.

> If I have offended you in any way please forgive me, none was intended,
> opposite I have only the deepest gratitude and admiration, and I want
> for these patches to be pushed forward.
> 
> Thanks
> Boaz

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/9 v2] SQUASHME: prd: Last fixes for partitions
  2014-08-27 12:46               ` Matthew Wilcox
@ 2014-08-27 13:01                 ` Boaz Harrosh
  0 siblings, 0 replies; 48+ messages in thread
From: Boaz Harrosh @ 2014-08-27 13:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Boaz Harrosh, Ross Zwisler, linux-fsdevel, Andrew Morton, Sagi Manole

On 08/27/2014 03:46 PM, Matthew Wilcox wrote:
> On Wed, Aug 27, 2014 at 12:55:56PM +0300, Boaz Harrosh wrote:
<>
>> Having my sign-off on it and pushing to Jens means I need to
>> review and approve the patch, and so I had to fix the checkpatch
>> complain before pushing. And I did put at the patch commit message
>> that I have edited for check-patch. And I have sent the mail directly
>> addressed to you for approval.
> 
> That's not what SubmittingPatches says that you are to do.  Indeed,
> option (c) says that you haven't modified my patch.
> 

Sorry that we are fighting here, let us please do what you decide to
do.
But you are wrong about SubmittingPatches rules. If you read the all
section then you see that I also have (b) which holds in my case

        (b) The contribution is based upon previous work that, to the best
            of my knowledge, is covered under an appropriate open source
            license and I have the right under that license to submit that
            work with modifications, whether created in whole or in part
            by me, under the same open source license (unless I am
            permitted to submit under a different license), as indicated
            in the file; or

        (c) The contribution was provided directly to me by some other
            person who certified (a), (b) or (c) and I have not modified
            it.

Note the text that says (a) ... or (b) ... or (c)

>> If I have offended you in any way please forgive me, none was intended,
>> opposite I have only the deepest gratitude and admiration, and I want
>> for these patches to be pushed forward.
>>
>> Thanks
>> Boaz

What do you want that we do, I need to send these 4 patches to Jens ASAP?
Tell me and I'll do it

Thanks
Boaz


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 9/9] prd: Add support for page struct mapping
  2014-08-22 14:36     ` Dave Hansen
  (?)
@ 2014-09-09 16:16     ` Boaz Harrosh
  2014-09-09 16:29       ` Dave Hansen
  -1 siblings, 1 reply; 48+ messages in thread
From: Boaz Harrosh @ 2014-09-09 16:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

On 08/22/2014 05:36 PM, Dave Hansen wrote:
> On 08/13/2014 05:26 AM, Boaz Harrosh wrote:
>> +#ifdef CONFIG_BLK_DEV_PMEM_USE_PAGES
>> +static int prd_add_page_mapping(phys_addr_t phys_addr, size_t total_size,
>> +				void **o_virt_addr)
>> +{
>> +	int nid = memory_add_physaddr_to_nid(phys_addr);
>> +	unsigned long start_pfn = phys_addr >> PAGE_SHIFT;
>> +	unsigned long nr_pages = total_size >> PAGE_SHIFT;
>> +	unsigned int start_sec = pfn_to_section_nr(start_pfn);
>> +	unsigned int end_sec = pfn_to_section_nr(start_pfn + nr_pages - 1);
> 
> Nit: any chance you'd change this to be an exclusive end?  In the mm
> code, we usually do:
> 
> 	unsigned int end_sec = pfn_to_section_nr(start_pfn + nr_pages);
> 
> so the for loops end up <end_sec instead of <=end_sec.
> 

Done, thanks please see new patches I CCed you as well

>> +	unsigned long phys_start_pfn;
>> +	struct page **page_array, **mapped_page_array;
>> +	unsigned long i;
>> +	struct vm_struct *vm_area;
>> +	void *virt_addr;
>> +	int ret = 0;
> 
> This is a philosophical thing, but I don't see *ANY* block-specific code
> in here.  Seems like this belongs in mm/ to me.
> 

Yes, as suggested by Toshi as well, I have moved it there and fixed
bugs.

> Is there a reason you don't just do this at boot and have to use hotplug
> at runtime for it?  

This is a plug and play thing. This memory region is not reached via memory
controller, it is on the ACPI/SBUS device with physical address/size specified
there. On load of block-device this will be called. Also a block device can
be unloaded and should be able to cleanup.

> What are the ratio of pmem to RAM?  Is it possible
> to exhaust all of RAM with 'struct page's for pmem?

Yes! in the not very distant future there will be systems that have only pmem.
yes no RAM. This is because once available some pmem has much better power
efficiency then DRAM, because of the no refresh thing. So even cellphones and
embedded system first.

At which point the pmem management system will need to set aside an area for the
Kernel's volatile usage. This here makes a distinction that though the addressed
region is persistent the managing page-struct section is volatile and renewed on
boot.

The way I see it is that the Admin/setup will need to partition its storage with
setting up a partition "as ram", this will just be the "swap" partition. And the
system will run directly from the SWAP partition.
Note that the fact that this is persistent memory is not lost. The
hibernate/de-hibernate will then see that this is persistent memory and will do nothing.
(swapd will not run of course)

So the Admin/setup will need to calculate and configure the proper ratio of
volatile vs non-volatile portions of its system for proper usage.

Thanks
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC 9/9] prd: Add support for page struct mapping
  2014-09-09 16:16     ` Boaz Harrosh
@ 2014-09-09 16:29       ` Dave Hansen
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Hansen @ 2014-09-09 16:29 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ross Zwisler, linux-fsdevel, Andrew Morton, linux-mm,
	Matthew Wilcox, Sagi Manole, Yigal Korman

On 09/09/2014 09:16 AM, Boaz Harrosh wrote:
> On 08/22/2014 05:36 PM, Dave Hansen wrote:
>> Is there a reason you don't just do this at boot and have to use hotplug
>> at runtime for it?  
> 
> This is a plug and play thing. This memory region is not reached via memory
> controller, it is on the ACPI/SBUS device with physical address/size specified
> there. On load of block-device this will be called. Also a block device can
> be unloaded and should be able to cleanup.

OK, cool.

>> What are the ratio of pmem to RAM?  Is it possible
>> to exhaust all of RAM with 'struct page's for pmem?
> 
> Yes! in the not very distant future there will be systems that have only pmem.
> yes no RAM. This is because once available some pmem has much better power
> efficiency then DRAM, because of the no refresh thing. So even cellphones and
> embedded system first.
...
> So the Admin/setup will need to calculate and configure the proper ratio of
> volatile vs non-volatile portions of its system for proper usage.

I'm just worried that somebody will plug a card in, and immediately OOM
the system.  Or, even worse, a card gets plugged in and 98% of the RAM
gets used by this prd driver and the kernel performs horribly.

I think we probably need to have some tracking of how much memory is
getting used for these mem_map[]s, and make sure they don't get out of hand.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2014-09-09 16:29 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-13 12:08 [RFC 0/9] pmem: Support for "struct page" with Persistent Memory storage Boaz Harrosh
2014-08-13 12:08 ` Boaz Harrosh
2014-08-13 12:10 ` [RFC 1/9] prd: Initial version of Persistent RAM Driver Boaz Harrosh
2014-08-13 12:11 ` [RFC 2/9] prd: add support for rw_page() Boaz Harrosh
2014-08-13 12:12 ` [RFC 3/9] prd: Add getgeo to block ops Boaz Harrosh
2014-08-13 12:14 ` [RFC 4/9] SQUASHME: prd: Fixs to getgeo Boaz Harrosh
2014-08-20 22:10   ` Ross Zwisler
2014-08-21  9:47     ` Boaz Harrosh
2014-08-13 12:16 ` [RFC 5/9] SQUASHME: prd: Last fixes for partitions Boaz Harrosh
2014-08-14 13:04   ` Boaz Harrosh
2014-08-14 13:16     ` Matthew Wilcox
2014-08-14 13:16       ` Matthew Wilcox
2014-08-14 13:55       ` Boaz Harrosh
2014-08-14 13:55         ` Boaz Harrosh
2014-08-14 13:07   ` [PATCH 5/9 v2] " Boaz Harrosh
2014-08-25 20:10     ` Ross Zwisler
2014-08-26  8:18       ` Boaz Harrosh
2014-08-26 17:36         ` Boaz Harrosh
2014-08-26 17:36           ` Boaz Harrosh
2014-08-26 20:34           ` Ross Zwisler
2014-08-26 20:34             ` Ross Zwisler
2014-08-27  9:41             ` Boaz Harrosh
2014-08-27  4:38           ` Matthew Wilcox
2014-08-27  9:55             ` Boaz Harrosh
2014-08-27 12:46               ` Matthew Wilcox
2014-08-27 13:01                 ` Boaz Harrosh
2014-08-20 23:03   ` [RFC 5/9] " Ross Zwisler
2014-08-21 10:05     ` Boaz Harrosh
2014-08-21 10:05       ` Boaz Harrosh
2014-08-13 12:18 ` [RFC 6/9] SQUASHME: prd: Let each prd-device manage private memory region Boaz Harrosh
2014-08-21 16:57   ` Ross Zwisler
2014-08-13 12:20 ` [RFC 7/9] SQUASHME: prd: Support of multiple memory regions Boaz Harrosh
2014-08-25 23:02   ` Ross Zwisler
2014-08-25 23:02     ` Ross Zwisler
2014-08-13 12:21 ` [RFC 8/9] mm: export sparse_add/remove_one_section Boaz Harrosh
2014-08-13 12:26 ` [RFC 9/9] prd: Add support for page struct mapping Boaz Harrosh
2014-08-15 20:28   ` Toshi Kani
2014-08-17  9:17     ` Boaz Harrosh
2014-08-18 19:48       ` Toshi Kani
2014-08-18 19:48         ` Toshi Kani
2014-08-19  8:40         ` Boaz Harrosh
2014-08-19 16:49           ` Toshi Kani
2014-08-19 16:49             ` Toshi Kani
2014-08-22 14:36   ` Dave Hansen
2014-08-22 14:36     ` Dave Hansen
2014-09-09 16:16     ` Boaz Harrosh
2014-09-09 16:29       ` Dave Hansen
2014-08-20 20:13 ` [RFC 0/9] pmem: Support for "struct page" with Persistent Memory storage Ross Zwisler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.