linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3 v5] e820: Fix handling of NvDIMM chips
@ 2015-03-05 10:16 Boaz Harrosh
  2015-03-05 10:20 ` [PATCH 1/3] e820: Don't let unknown DIMM type come out BUSY Boaz Harrosh
                   ` (4 more replies)
  0 siblings, 5 replies; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-05 10:16 UTC (permalink / raw)
  To: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig
  Cc: Ross Zwisler

Hi

[v5]
* [PATCH 2/3] Added the add_taint(TAINT_FIRMWARE_WORKAROUND,...)
   and changed the printed message as requested
* Use IORESOURCE_MEM_WARN bit from the mem specific bit range
  (not 64bit only anymore, only works with memory resources)
* Fix user visible typo reserved-unkown => reserved-unknown &&
	unkown-12 => unknown-12
* Select [PATCH 3A/3] (over [PATCH 3B/3])
* ...
* Also posting RFC of pmem as reference

[v2]
* Added warning at bring up about unknown type
* Added an extra patch to warn-print in request_resource
* changed name from NvDIMM-12 => unknown-12
  I wish we would reconsider this. So we need to suffer until some unknown
  future when ACPI decides to reuse type-12. When this happens we can fix
  it then, NO?
* Now based on 4.0-rc1

[v1]
There is a deficiency in current e820.c handling where unknown new memory-chip
types come up as a BUSY resource when some other driver (like pmem) tries to
call request_mem_region_exclusive() on that resource. Even though, actually
there is nothing using it.
>From inspecting the code and the history of e820.c it looks like a BUG.

In any way this is a problem for the new type-12 NvDIMM memory chips that
are circulating around. (It is estimated that there are already 100ds of
thousands NvDIMM chips in active use)

The patches below first fixes the above problem for any future type
memory, so external drivers can access these mem chips.

I then also add the NvDIMM type-12 memory constant so it comes up
nice in dprints and at /proc/iomem

Just as before all these chips are very much usable with the pmem
driver. This lets us remove the hack for type-12 NvDIMMs that ignores
the return code from request_mem_region_exclusive() in pmem.c.

For all the pmem people. I maintain a tree with these patches
and latest pmem code here:
	git://git.open-osd.org/pmem.git (pmem branch)
	[web-view:http://git.open-osd.org/gitweb.cgi?p=pmem.git;a=summary]

List of patches:
 [PATCH 1/3] e820: Don't let unknown DIMM type come out BUSY
	The main fix

 [PATCH 2/3] resource: Add new flag IORESOURCE_MEM_WARN
	Warn in request_resource

 [PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM)

Also submitted as reference is an RFC of the pmem driver that demonstrates
the use of the add_resource API for the NvDIMM chips. This can be seen in
pmem-patch-1. Also please see pmem-patch-8 an out-of-tree patch that
ignores the add_resource failure so it can work with NvDIMMs with old
kernels.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/3] e820: Don't let unknown DIMM type come out BUSY
  2015-03-05 10:16 [PATCH 0/3 v5] e820: Fix handling of NvDIMM chips Boaz Harrosh
@ 2015-03-05 10:20 ` Boaz Harrosh
  2015-03-05 20:41   ` Dan Williams
  2015-03-05 10:21 ` [PATCH 2/3] resource: Add new flag IORESOURCE_MEM_WARN Boaz Harrosh
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-05 10:20 UTC (permalink / raw)
  To: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig
  Cc: Ross Zwisler


There is something not very nice (Gentlemen nice) In current
e820.c code.

At Multiple places for example @memblock_x86_fill() it will
add the different memory resources *except the E820_RESERVED type*

Then at e820_reserve_resources() it will mark all !E820_RESERVED
as busy.

This is all fine when we have only the known types one of:
	E820_RESERVED_KERN:
	E820_RAM:
	E820_ACPI:
	E820_NVS:
	E820_UNUSABLE:
	E820_RESERVED:

But if the system encounters a brand new memory type it will
not add it to any memory list, But will proceed to mark it
BUSY. So now any other Driver in the system that does know
how to deal with this new type, is not able to call
request_mem_region_exclusive() on this new type because it is
hard coded BUSY even though nothing really uses it.

So make any unknown type behave like E820_RESERVED memory,
it will show up as available to first caller of
request_mem_region_exclusive().

I Also change the string representation of an unknown type
from "reserved" (So to not confuse with memmap "reserved"
region). And call it "reserved-unknown"
I wish I could return "reserved-type-X" But this is not possible
because one must return a constant, code-segment, string.

(NOTE: These unknown-types where called "reserved" in
 /proc/iomem and in dmesg but behaved differently. What this
 patch does is name them differently but let them behave
 the same)

By Popular demand An Extra WARNING message is printed if
an "UNKNOWN" is found. It will look like this:
  e820: WARNING [mem 0x100000000-0x1ffffffff] is unknown type 12

An example of such "UNKNOWN" type is the not Standard type-12
DDR3-NvDIMM which is used by multiple vendors for a while
now. (Estimated 100ds of thousands sold world wide)

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org
CC: Dan Williams <dan.j.williams@intel.com>
CC: Matthew Wilcox <willy@linux.intel.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 arch/x86/kernel/e820.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 46201de..c3a11cd 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -104,6 +104,21 @@ int __init e820_all_mapped(u64 start, u64 end, unsigned type)
 	return 0;
 }
 
+static bool _is_unknown_type(int e820_type)
+{
+	switch (e820_type) {
+	case E820_RESERVED_KERN:
+	case E820_RAM:
+	case E820_ACPI:
+	case E820_NVS:
+	case E820_UNUSABLE:
+	case E820_RESERVED:
+		return false;
+	default:
+		return true;
+	}
+}
+
 /*
  * Add a memory region to the kernel e820 map.
  */
@@ -119,6 +134,11 @@ static void __init __e820_add_region(struct e820map *e820x, u64 start, u64 size,
 		return;
 	}
 
+	if (unlikely(_is_unknown_type(type)))
+		pr_warn("e820: WARNING [mem %#010llx-%#010llx] is unknown type %d\n",
+		       (unsigned long long) start,
+		       (unsigned long long) (start + size - 1), type);
+
 	e820x->map[x].addr = start;
 	e820x->map[x].size = size;
 	e820x->map[x].type = type;
@@ -907,10 +927,16 @@ static inline const char *e820_type_to_string(int e820_type)
 	case E820_ACPI:	return "ACPI Tables";
 	case E820_NVS:	return "ACPI Non-volatile Storage";
 	case E820_UNUSABLE:	return "Unusable memory";
-	default:	return "reserved";
+	case E820_RESERVED:	return "reserved";
+	default:		return "reserved-unknown";
 	}
 }
 
+static bool _is_reserved_type(int e820_type)
+{
+	return (e820_type == E820_RESERVED) || _is_unknown_type(e820_type);
+}
+
 /*
  * Mark e820 reserved areas as busy for the resource manager.
  */
@@ -940,7 +966,8 @@ void __init e820_reserve_resources(void)
 		 * pci device BAR resource and insert them later in
 		 * pcibios_resource_survey()
 		 */
-		if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) {
+		if (!_is_reserved_type(e820.map[i].type) ||
+		    res->start < (1ULL<<20)) {
 			res->flags |= IORESOURCE_BUSY;
 			insert_resource(&iomem_resource, res);
 		}
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/3] resource: Add new flag IORESOURCE_MEM_WARN
  2015-03-05 10:16 [PATCH 0/3 v5] e820: Fix handling of NvDIMM chips Boaz Harrosh
  2015-03-05 10:20 ` [PATCH 1/3] e820: Don't let unknown DIMM type come out BUSY Boaz Harrosh
@ 2015-03-05 10:21 ` Boaz Harrosh
  2015-03-05 10:24 ` [PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM) Boaz Harrosh
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-05 10:21 UTC (permalink / raw)
  To: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig
  Cc: Ross Zwisler


memory resource providers set this flag if they want
that request_region will print a warning in dmesg
if this particular memory resource is locked by a driver.

Thous acting as a Protocol Police about experimental
devices that did not pass a committee approval.

The Only user of  this flag is x86/kernel/e820.c that
wants to WARN about UNKNOWN memory types.

NOTE: It would be preferred if I defined a general flag say
      IORESOURCE_WARN, where any kind of resource provider
      can WARN on use, but we have run out of flags in the
      32bit long systems. So I defined a free bit from the
      resource specific flags for mem resources. This is
      why I need to check if this is a memory resource first
      so not to conflict with other resource specific flags.
      (Though actually no one is using this specific bit)

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org
CC: Dan Williams <dan.j.williams@intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Bjorn Helgaas <bhelgaas@google.com>
CC: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 arch/x86/kernel/e820.c |  3 +++
 include/linux/ioport.h |  1 +
 kernel/resource.c      | 11 ++++++++++-
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index c3a11cd..c2f2da2 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -961,6 +961,9 @@ void __init e820_reserve_resources(void)
 
 		res->flags = IORESOURCE_MEM;
 
+		if (_is_unknown_type(e820.map[i].type))
+			res->flags |= IORESOURCE_MEM_WARN;
+
 		/*
 		 * don't register the region that could be conflicted with
 		 * pci device BAR resource and insert them later in
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 2c525022..f78972b 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -90,6 +90,7 @@ struct resource {
 #define IORESOURCE_MEM_32BIT		(3<<3)
 #define IORESOURCE_MEM_SHADOWABLE	(1<<5)	/* dup: IORESOURCE_SHADOWABLE */
 #define IORESOURCE_MEM_EXPANSIONROM	(1<<6)
+#define IORESOURCE_MEM_WARN		(1<<7)	/* WARN if requested by driver */
 
 /* PnP I/O specific bits (IORESOURCE_BITS) */
 #define IORESOURCE_IO_16BIT_ADDR	(1<<0)
diff --git a/kernel/resource.c b/kernel/resource.c
index 19f2357..f886666 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1075,8 +1075,17 @@ struct resource * __request_region(struct resource *parent,
 			break;
 		if (conflict != parent) {
 			parent = conflict;
-			if (!(conflict->flags & IORESOURCE_BUSY))
+			if (!(conflict->flags & IORESOURCE_BUSY)) {
+				if ((resource_type(conflict) == IORESOURCE_MEM)
+				    && (conflict->flags & IORESOURCE_MEM_WARN)) {
+					add_taint(TAINT_FIRMWARE_WORKAROUND,
+						  LOCKDEP_STILL_OK);
+					pr_warn("%s requested an unknown memory type [mem %#010llx-%#010llx] %s\n",
+						name, conflict->start,
+						conflict->end, conflict->name);
+				}
 				continue;
+			}
 		}
 		if (conflict->flags & flags & IORESOURCE_MUXED) {
 			add_wait_queue(&muxed_resource_wait, &wait);
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM)
  2015-03-05 10:16 [PATCH 0/3 v5] e820: Fix handling of NvDIMM chips Boaz Harrosh
  2015-03-05 10:20 ` [PATCH 1/3] e820: Don't let unknown DIMM type come out BUSY Boaz Harrosh
  2015-03-05 10:21 ` [PATCH 2/3] resource: Add new flag IORESOURCE_MEM_WARN Boaz Harrosh
@ 2015-03-05 10:24 ` Boaz Harrosh
  2015-03-05 20:56   ` Dan Williams
  2015-03-05 10:32 ` [RFC 0/8] pmem: Submission of the Persistent memory block device Boaz Harrosh
  2015-03-05 22:48 ` [PATCH 0/3 v5] e820: Fix handling of NvDIMM chips H. Peter Anvin
  4 siblings, 1 reply; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-05 10:24 UTC (permalink / raw)
  To: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig
  Cc: Ross Zwisler


There are multiple vendors of DDR3 NvDIMMs out in the market today.
At various stages of development/production. It is estimated that
there are already more the 100ds of thousands chips sold to
testers and sites.

All the BIOS vendors I know of, tagged these chips at e820 table
as type-12 memory.

Now the ACPI comity, as far as I know, did not yet define a
standard type for NvDIMM. Also, as far as I know any NvDIMM
standard will only be defined for DDR4. So DDR3 NvDIMM is
probably stuck with this none STD type.

I Wish and call the ACPI comity to Define that NvDIMM is type-12.
Also for DDR4

The motivation of this patch is to be able to differentiate
this NvDIMM type from a real future "unknown-reserved" type.

In this patch I name type-12 "unknown-12". This is because of
ACPI politics that refuse to reserve type-12 as DDR3-NvDIMM
and members keep saying:
	"What if ACPI assigns type-12 for something else in future"

[And I say: Then just don't. Please?]

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 arch/x86/kernel/e820.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index c2f2da2..4bf574a 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -25,6 +25,11 @@
 #include <asm/proto.h>
 #include <asm/setup.h>
 
+/* These are nonstandard Memory types that we do not want in the exported
+ * header.
+ */
+#define E820_UNKNOWN_12 12	/* This is the unofficial DDR3-NVDIMM */
+
 /*
  * The e820 map is the map that gets modified e.g. with command line parameters
  * and that is also registered with modifications in the kernel resource tree
@@ -169,6 +174,9 @@ static void __init e820_print_type(u32 type)
 	case E820_UNUSABLE:
 		printk(KERN_CONT "unusable");
 		break;
+	case E820_UNKNOWN_12:
+		printk(KERN_CONT "unknown-12");
+		break;
 	default:
 		printk(KERN_CONT "type %u", type);
 		break;
@@ -928,6 +936,7 @@ static inline const char *e820_type_to_string(int e820_type)
 	case E820_NVS:	return "ACPI Non-volatile Storage";
 	case E820_UNUSABLE:	return "Unusable memory";
 	case E820_RESERVED:	return "reserved";
+	case E820_UNKNOWN_12:	return "unknown-12";
 	default:		return "reserved-unknown";
 	}
 }
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 0/8] pmem: Submission of the Persistent memory block device
  2015-03-05 10:16 [PATCH 0/3 v5] e820: Fix handling of NvDIMM chips Boaz Harrosh
                   ` (2 preceding siblings ...)
  2015-03-05 10:24 ` [PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM) Boaz Harrosh
@ 2015-03-05 10:32 ` Boaz Harrosh
  2015-03-05 11:55   ` [PATCH 1/8] pmem: Initial version of persistent memory driver Boaz Harrosh
                     ` (8 more replies)
  2015-03-05 22:48 ` [PATCH 0/3 v5] e820: Fix handling of NvDIMM chips H. Peter Anvin
  4 siblings, 9 replies; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-05 10:32 UTC (permalink / raw)
  To: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig
  Cc: Ross Zwisler


There are already NvDIMMs and other Persistent-memory devices in the market, and
lots more of them will be coming in near future.

Current stack is coming along very nice, and filesystems support for leveraging this
technologies has been submitted to Linus in the DAX series by Matthew Wilcox.

The general stack does not change:
	block-device
	partition
	file-system
	application file

The only extra care, see Matthew's DAX patches, Is the ->direct_access() API from
block devices that enables a direct mapping from Persistent-memory to user application
and/or Kernel for direct store/load of data.

The only missing piece is the actual block device that enables support
for such NvDIMM chips. This is the driver we submit here.

The driver is very simple, in fact it is the 2nd smallest driver inside drivers/block
What the driver does is support a physical contiguous iomem range as a single block
device. The driver has support for as many as needed iomem ranges each as its own device.
(See patch-1 for more details)

We are using this driver over a year now, In a lab with combination of VMs and real
hardware, with a variety of hardware and vendors, and it is very stable. Actually why
not it is so simple it does nothing almost.

The driver is not only good for NvDIMMs, It is good for any flat memory mapped
device. We've used it with NvDIMMs, Kernel reserved DRAM (memmap= on command line),
PCIE Battery backed memory cards, VM shared memory, and so on.

Together with this driver also submitted support for page-struct with
Persistent-memory, so Persistent-memory can be used with RDMA, DMA, block-devices
and so on, just as regular memory, in a copy-less manner.
With the use of these two simple patches, we were able to set up an RDMA target
machine which exports NvDIMMs and enables direct remote storage. The only
"complicated" thing was the remote flush of caches because most RDMA nicks in
Kernel will RDMA directly to L3 cache, so we needed to establish a message that
involves the remote CPU for this. But otherwise the mapping of pmem pointer
to an RDMA key was trivial, directly from user-mode, with no extra Kernel code.
[The target is simple with no extra code, the RDMA client on the other hand needs
 a special driver]

I maintain these patches on latest Kernels here:
	git://git.open-osd.org/pmem.git branch pmem

Thanks for reviewing
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/8] pmem: Initial version of persistent memory driver
  2015-03-05 10:32 ` [RFC 0/8] pmem: Submission of the Persistent memory block device Boaz Harrosh
@ 2015-03-05 11:55   ` Boaz Harrosh
  2015-03-05 20:35     ` Paul Bolle
                       ` (2 more replies)
  2015-03-05 11:55   ` [PATCH 2/8] pmem: KISS, remove register_blkdev Boaz Harrosh
                     ` (7 subsequent siblings)
  8 siblings, 3 replies; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-05 11:55 UTC (permalink / raw)
  To: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig
  Cc: Ross Zwisler

From: Ross Zwisler <ross.zwisler@linux.intel.com>

PMEM is a new driver That supports any physical contiguous iomem range
as a single block device. The driver has support for as many as needed
iomem ranges each as its own device.

The driver is not only good for NvDIMMs, It is good for any flat memory
mapped device. We've used it with NvDIMMs, Kernel reserved DRAM
(memmap= on command line), PCIE Battery backed memory cards, VM shared
memory, and so on.

The API to pmem module a single string parameter named "map"
of the form:
		 map=mapS[,mapS...]

		 where mapS=nn[KMG]$ss[KMG],
		 or    mapS=nn[KMG]@ss[KMG],

		 nn=size, ss=offset

Just like the Kernel command line map && memmap parameters,
so anything you did at grub just copy/paste to here.

The "@" form is exactly the same as the "$" form only that
at bash prompt we need to escape the "$" with \$ so also
support the '@' char for convenience.

For each specified mapS there will be a device created.

[This is the accumulated version of the driver developed by
 multiple programmers. To see the real history of these
 patches see:
	git://git.open-osd.org/pmem.git
	https://github.com/01org/prd
 This patch is based on (git://git.open-osd.org/pmem.git):
	[5ccf703] SQUASHME: Don't clobber the map module param

<list-of-changes>
[boaz]
SQUASHME: pmem: Remove unused #include headers
SQUASHME: pmem: Request from fdisk 4k alignment
SQUASHME: pmem: Let each device manage private memory region
SQUASHME: pmem: Support of multiple memory regions
SQUASHME: pmem: Micro optimization the hotpath 001
SQUASHME: pmem: no need to copy a page at a time
SQUASHME: pmem that 4k sector thing
SQUASHME: pmem: Cleanliness is neat
SQUASHME: Don't clobber the map module param
SQUASHME: pmem: Few changes to Initial version of pmem
SQUASHME: Changes to copyright text (trivial)
</list-of-changes>

TODO: Add Documentation/blockdev/pmem.txt

Need-signed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 MAINTAINERS            |   7 ++
 drivers/block/Kconfig  |  18 +++
 drivers/block/Makefile |   1 +
 drivers/block/pmem.c   | 334 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 360 insertions(+)
 create mode 100644 drivers/block/pmem.c

diff --git a/MAINTAINERS b/MAINTAINERS
index ddc5a8c..21c5384 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8053,6 +8053,13 @@ S:	Maintained
 F:	Documentation/blockdev/ramdisk.txt
 F:	drivers/block/brd.c
 
+PERSISTENT MEMORY DRIVER
+M:	Ross Zwisler <ross.zwisler@linux.intel.com>
+M:	Boaz Harrosh <boaz@plexistor.com>
+L:	linux-nvdimm@lists.01.org
+S:	Supported
+F:	drivers/block/pmem.c
+
 RANDOM NUMBER DRIVER
 M:	"Theodore Ts'o" <tytso@mit.edu>
 S:	Maintained
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 1b8094d..1530c2a 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -404,6 +404,24 @@ config BLK_DEV_RAM_DAX
 	  and will prevent RAM block device backing store memory from being
 	  allocated from highmem (only a problem for highmem systems).
 
+config BLK_DEV_PMEM
+	tristate "pmem: Persistent memory block device support"
+	help
+	  If you have Persistent memory in your system say Y/m
+	  here. The driver can support real Persistent memory chips
+	  such as NVDIMMs , as well as volatile memory that was set
+	  aside from Kernel use by the "memmap" kernel parameter.
+	  And/or any contiguous physical memory ranges that you want
+	  to represent as a block device. (Even PCIE flat memory mapped
+	  devices)
+	  See Documentation/block/pmem.txt for how to use
+
+	  To compile this driver as a module, choose M here: the module will be
+	  called pmem. Created Devices will be named: /dev/pmemX
+
+	  Most normal users won't need this functionality, and can thus say N
+	  here.
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 02b688d..9cc6c18 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -14,6 +14,7 @@ obj-$(CONFIG_PS3_VRAM)		+= ps3vram.o
 obj-$(CONFIG_ATARI_FLOPPY)	+= ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)	+= z2ram.o
 obj-$(CONFIG_BLK_DEV_RAM)	+= brd.o
+obj-$(CONFIG_BLK_DEV_PMEM)	+= pmem.o
 obj-$(CONFIG_BLK_DEV_LOOP)	+= loop.o
 obj-$(CONFIG_BLK_CPQ_DA)	+= cpqarray.o
 obj-$(CONFIG_BLK_CPQ_CISS_DA)  += cciss.o
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
new file mode 100644
index 0000000..02cd118
--- /dev/null
+++ b/drivers/block/pmem.c
@@ -0,0 +1,334 @@
+/*
+ * Persistent Memory Driver
+ * Copyright (c) 2014, Intel Corporation.
+ * Copyright (c) 2014, Boaz Harrosh <boaz@plexistor.com>.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * This driver's skeleton is based on drivers/block/brd.c.
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ */
+
+#include <linux/blkdev.h>
+#include <linux/hdreg.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+
+struct pmem_device {
+	struct request_queue	*pmem_queue;
+	struct gendisk		*pmem_disk;
+	struct list_head	pmem_list;
+
+	/* One contiguous memory region per device */
+	phys_addr_t		phys_addr;
+	void			*virt_addr;
+	size_t			size;
+};
+
+static void pmem_do_bvec(struct pmem_device *pmem, struct page *page, uint len,
+			 uint off, int rw, sector_t sector)
+{
+	void *mem = kmap_atomic(page);
+	size_t pmem_off = sector << 9;
+
+	BUG_ON(pmem_off >= pmem->size);
+
+	if (rw == READ) {
+		memcpy(mem + off, pmem->virt_addr + pmem_off, len);
+		flush_dcache_page(page);
+	} else {
+		/*
+		 * FIXME: Need more involved flushing to ensure that writes to
+		 * NVDIMMs are actually durable before returning.
+		 */
+		flush_dcache_page(page);
+		memcpy(pmem->virt_addr + pmem_off, mem + off, len);
+	}
+
+	kunmap_atomic(mem);
+}
+
+static void pmem_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct block_device *bdev = bio->bi_bdev;
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
+	int rw;
+	struct bio_vec bvec;
+	sector_t sector;
+	struct bvec_iter iter;
+	int err = 0;
+
+	if (unlikely(bio_end_sector(bio) > get_capacity(bdev->bd_disk))) {
+		err = -EIO;
+		goto out;
+	}
+
+	if (WARN_ON(bio->bi_rw & REQ_DISCARD)) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	rw = bio_rw(bio);
+	if (rw == READA)
+		rw = READ;
+
+	sector = bio->bi_iter.bi_sector;
+	bio_for_each_segment(bvec, bio, iter) {
+		/* NOTE: There is a legend saying that bv_len might be
+		 * bigger than PAGE_SIZE in the case that bv_page points to
+		 * a physical contiguous PFN set. But for us it is fine because
+		 * it means the Kernel virtual mapping is also contiguous. And
+		 * on the pmem side we are always contiguous both virtual and
+		 * physical
+		 */
+		pmem_do_bvec(pmem, bvec.bv_page, bvec.bv_len, bvec.bv_offset,
+			     rw, sector);
+		sector += bvec.bv_len >> 9;
+	}
+
+out:
+	bio_endio(bio, err);
+}
+
+static const struct block_device_operations pmem_fops = {
+	.owner =		THIS_MODULE,
+};
+
+/* Kernel module stuff */
+static char *map;
+module_param(map, charp, S_IRUGO);
+MODULE_PARM_DESC(map,
+	"pmem device mapping: map=mapS[,mapS...] where:\n"
+	"mapS=nn[KMG]$ss[KMG] or mapS=nn[KMG]@ss[KMG], nn=size, ss=offset.");
+
+static LIST_HEAD(pmem_devices);
+static int pmem_major;
+
+/* pmem->phys_addr and pmem->size need to be set.
+ * Will then set virt_addr if successful.
+ */
+int pmem_mapmem(struct pmem_device *pmem)
+{
+	struct resource *res_mem;
+	int err;
+
+	res_mem = request_mem_region_exclusive(pmem->phys_addr, pmem->size,
+					       "pmem");
+	if (unlikely(!res_mem)) {
+		pr_warn("pmem: request_mem_region_exclusive phys=0x%llx size=0x%zx failed\n",
+			   pmem->phys_addr, pmem->size);
+		return -EINVAL;
+	}
+
+	pmem->virt_addr = ioremap_cache(pmem->phys_addr, pmem->size);
+	if (unlikely(!pmem->virt_addr)) {
+		err = -ENXIO;
+		goto out_release;
+	}
+	return 0;
+
+out_release:
+	release_mem_region(pmem->phys_addr, pmem->size);
+	return err;
+}
+
+void pmem_unmapmem(struct pmem_device *pmem)
+{
+	if (unlikely(!pmem->virt_addr))
+		return;
+
+	iounmap(pmem->virt_addr);
+	release_mem_region(pmem->phys_addr, pmem->size);
+	pmem->virt_addr = NULL;
+}
+
+#define PMEM_ALIGNMEM PAGE_SIZE
+
+static struct pmem_device *pmem_alloc(phys_addr_t phys_addr, size_t disk_size,
+				      int i)
+{
+	struct pmem_device *pmem;
+	struct gendisk *disk;
+	int err;
+
+	if (unlikely((phys_addr & (PMEM_ALIGNMEM - 1)) ||
+		     (disk_size & (PMEM_ALIGNMEM - 1)))) {
+		pr_err("phys_addr=0x%llx disk_size=0x%zx must be 0x%lx aligned\n",
+		       phys_addr, disk_size, PMEM_ALIGNMEM);
+		err = -EINVAL;
+		goto out;
+	}
+
+	pmem = kzalloc(sizeof(*pmem), GFP_KERNEL);
+	if (unlikely(!pmem)) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	pmem->phys_addr = phys_addr;
+	pmem->size = disk_size;
+
+	err = pmem_mapmem(pmem);
+	if (unlikely(err))
+		goto out_free_dev;
+
+	pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL);
+	if (unlikely(!pmem->pmem_queue)) {
+		err = -ENOMEM;
+		goto out_unmap;
+	}
+
+	blk_queue_make_request(pmem->pmem_queue, pmem_make_request);
+	blk_queue_max_hw_sectors(pmem->pmem_queue, 1024);
+	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+
+	/* This is so fdisk will align partitions on 4k, because of
+	 * direct_access API needing 4k alignment, returning a PFN
+	 */
+	blk_queue_physical_block_size(pmem->pmem_queue, PAGE_SIZE);
+
+	disk = alloc_disk(0);
+	if (unlikely(!disk)) {
+		err = -ENOMEM;
+		goto out_free_queue;
+	}
+
+	disk->major		= pmem_major;
+	disk->first_minor	= 0;
+	disk->fops		= &pmem_fops;
+	disk->private_data	= pmem;
+	disk->queue		= pmem->pmem_queue;
+	disk->flags		= GENHD_FL_EXT_DEVT;
+	sprintf(disk->disk_name, "pmem%d", i);
+	set_capacity(disk, disk_size >> 9);
+	pmem->pmem_disk = disk;
+
+	return pmem;
+
+out_free_queue:
+	blk_cleanup_queue(pmem->pmem_queue);
+out_unmap:
+	pmem_unmapmem(pmem);
+out_free_dev:
+	kfree(pmem);
+out:
+	return ERR_PTR(err);
+}
+
+static void pmem_free(struct pmem_device *pmem)
+{
+	put_disk(pmem->pmem_disk);
+	blk_cleanup_queue(pmem->pmem_queue);
+	pmem_unmapmem(pmem);
+	kfree(pmem);
+}
+
+static void pmem_del_one(struct pmem_device *pmem)
+{
+	list_del(&pmem->pmem_list);
+	del_gendisk(pmem->pmem_disk);
+	pmem_free(pmem);
+}
+
+static int pmem_parse_map_one(char *map, phys_addr_t *start, size_t *size)
+{
+	char *p = map;
+
+	*size = (size_t)memparse(p, &p);
+	if ((p == map) || ((*p != '$') && (*p != '@')))
+		return -EINVAL;
+
+	if (!*(++p))
+		return -EINVAL;
+
+	*start = (phys_addr_t)memparse(p, &p);
+
+	return *p == '\0' ? 0 : -EINVAL;
+}
+
+static int __init pmem_init(void)
+{
+	int result, i;
+	struct pmem_device *pmem, *next;
+	char *p, *pmem_map, *map_dup;
+
+	if (unlikely(!map || !*map)) {
+		pr_err("pmem: must specify map=nn@ss parameter.\n");
+		return -EINVAL;
+	}
+
+	result = register_blkdev(0, "pmem");
+	if (unlikely(result < 0))
+		return -EIO;
+
+	pmem_major = result;
+
+	map_dup = pmem_map = kstrdup(map, GFP_KERNEL);
+	if (unlikely(!pmem_map)) {
+		pr_debug("pmem_init strdup(%s) failed\n", map);
+		return -ENOMEM;
+	}
+
+	i = 0;
+	while ((p = strsep(&pmem_map, ",")) != NULL) {
+		phys_addr_t phys_addr;
+		size_t disk_size;
+
+		if (!*p)
+			continue;
+		result = pmem_parse_map_one(p, &phys_addr, &disk_size);
+		if (result)
+			goto out_free;
+		pmem = pmem_alloc(phys_addr, disk_size, i);
+		if (IS_ERR(pmem)) {
+			result = PTR_ERR(pmem);
+			goto out_free;
+		}
+		list_add_tail(&pmem->pmem_list, &pmem_devices);
+		++i;
+	}
+
+	list_for_each_entry(pmem, &pmem_devices, pmem_list)
+		add_disk(pmem->pmem_disk);
+
+	pr_info("pmem: module loaded map=%s\n", map);
+	kfree(map_dup);
+	return 0;
+
+out_free:
+	list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list) {
+		list_del(&pmem->pmem_list);
+		pmem_free(pmem);
+	}
+	kfree(map_dup);
+	unregister_blkdev(pmem_major, "pmem");
+
+	return result;
+}
+
+static void __exit pmem_exit(void)
+{
+	struct pmem_device *pmem, *next;
+
+	list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list)
+		pmem_del_one(pmem);
+
+	unregister_blkdev(pmem_major, "pmem");
+	pr_info("pmem: module unloaded\n");
+}
+
+MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
+MODULE_LICENSE("GPL");
+module_init(pmem_init);
+module_exit(pmem_exit);
-- 
1.9.3



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/8] pmem: KISS, remove register_blkdev
  2015-03-05 10:32 ` [RFC 0/8] pmem: Submission of the Persistent memory block device Boaz Harrosh
  2015-03-05 11:55   ` [PATCH 1/8] pmem: Initial version of persistent memory driver Boaz Harrosh
@ 2015-03-05 11:55   ` Boaz Harrosh
  2015-03-05 11:56   ` [PATCH 3/8] pmem: Add support for rw_page() Boaz Harrosh
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-05 11:55 UTC (permalink / raw)
  To: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig
  Cc: Ross Zwisler


Though this looks revolutionary, it actually does nothing
but removes dead code.

Since the move of pmem to the dynamic BLOCK_EXT_MAJOR
range, by Ross, pmem_major returned from register_blkdev
is not used.

If inspecting the code at add_disk() we can see that
whatever number we will put at
	disk->major		= ...
	disk->first_minor	= ...

will be immediately overwritten with the BLOCK_EXT_MAJOR
and a dynamically allocated minor. Regardless of the
registration and/or what ever we put at disk->major.

I have tested all the tests that I know how to perform
on the devices, fdisk, partitions, multiple pmemX devices,
mknode, lsblk, blkid, and it all behaves exactly as
before this patch.

I have also done:
	find /sys/ -name "*pmem*"
	find /proc/ -name "*pmem*"
	find /dev/ -name "*pmem*"
And get the same output as before this patch.

The only thing missing is an entry in /proc/devices of
the form: "250 pmem" (250 or what ever is free at the moment)

But this is good because if one tries to look for 250
devices after loading pmem he will fail because pmem
is always registered as 259 (blkext) now.

If anyone knows of what would work differently after this
patch please do tell.

But it looks like the calls to register_blkdev is just dead
code for us right now

Thanks

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/pmem.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 02cd118..bba53af 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -113,7 +113,6 @@ MODULE_PARM_DESC(map,
 	"mapS=nn[KMG]$ss[KMG] or mapS=nn[KMG]@ss[KMG], nn=size, ss=offset.");
 
 static LIST_HEAD(pmem_devices);
-static int pmem_major;
 
 /* pmem->phys_addr and pmem->size need to be set.
  * Will then set virt_addr if successful.
@@ -204,8 +203,6 @@ static struct pmem_device *pmem_alloc(phys_addr_t phys_addr, size_t disk_size,
 		goto out_free_queue;
 	}
 
-	disk->major		= pmem_major;
-	disk->first_minor	= 0;
 	disk->fops		= &pmem_fops;
 	disk->private_data	= pmem;
 	disk->queue		= pmem->pmem_queue;
@@ -268,12 +265,6 @@ static int __init pmem_init(void)
 		return -EINVAL;
 	}
 
-	result = register_blkdev(0, "pmem");
-	if (unlikely(result < 0))
-		return -EIO;
-
-	pmem_major = result;
-
 	map_dup = pmem_map = kstrdup(map, GFP_KERNEL);
 	if (unlikely(!pmem_map)) {
 		pr_debug("pmem_init strdup(%s) failed\n", map);
@@ -312,7 +303,6 @@ out_free:
 		pmem_free(pmem);
 	}
 	kfree(map_dup);
-	unregister_blkdev(pmem_major, "pmem");
 
 	return result;
 }
@@ -324,7 +314,6 @@ static void __exit pmem_exit(void)
 	list_for_each_entry_safe(pmem, next, &pmem_devices, pmem_list)
 		pmem_del_one(pmem);
 
-	unregister_blkdev(pmem_major, "pmem");
 	pr_info("pmem: module unloaded\n");
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/8] pmem: Add support for rw_page()
  2015-03-05 10:32 ` [RFC 0/8] pmem: Submission of the Persistent memory block device Boaz Harrosh
  2015-03-05 11:55   ` [PATCH 1/8] pmem: Initial version of persistent memory driver Boaz Harrosh
  2015-03-05 11:55   ` [PATCH 2/8] pmem: KISS, remove register_blkdev Boaz Harrosh
@ 2015-03-05 11:56   ` Boaz Harrosh
  2015-03-05 11:57   ` [PATCH 4/8] pmem: Add support for direct_access() Boaz Harrosh
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-05 11:56 UTC (permalink / raw)
  To: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig
  Cc: Ross Zwisler

From: Ross Zwisler <ross.zwisler@linux.intel.com>

Based on commit a72132c31d58 ("brd: add support for rw_page()")

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/block/pmem.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index bba53af..750ffdf 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -101,8 +101,19 @@ out:
 	bio_endio(bio, err);
 }
 
+static int pmem_rw_page(struct block_device *bdev, sector_t sector,
+		       struct page *page, int rw)
+{
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
+
+	pmem_do_bvec(pmem, page, PAGE_CACHE_SIZE, 0, rw, sector);
+	page_endio(page, rw & WRITE, 0);
+	return 0;
+}
+
 static const struct block_device_operations pmem_fops = {
 	.owner =		THIS_MODULE,
+	.rw_page =		pmem_rw_page,
 };
 
 /* Kernel module stuff */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 4/8] pmem: Add support for direct_access()
  2015-03-05 10:32 ` [RFC 0/8] pmem: Submission of the Persistent memory block device Boaz Harrosh
                     ` (2 preceding siblings ...)
  2015-03-05 11:56   ` [PATCH 3/8] pmem: Add support for rw_page() Boaz Harrosh
@ 2015-03-05 11:57   ` Boaz Harrosh
  2015-03-05 11:58   ` [PATCH 5/8] mm: Let sparse_{add,remove}_one_section receive a node_id Boaz Harrosh
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-05 11:57 UTC (permalink / raw)
  To: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig
  Cc: Ross Zwisler

From: Ross Zwisler <ross.zwisler@linux.intel.com>

Also fixed a top the initial version
[boaz]
  SQUASHME pmem: Micro optimization the hotpath 002

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/pmem.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index 750ffdf..f0f0ba0 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -111,9 +111,25 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	return 0;
 }
 
+static long pmem_direct_access(struct block_device *bdev, sector_t sector,
+			      void **kaddr, unsigned long *pfn, long size)
+{
+	struct pmem_device *pmem = bdev->bd_disk->private_data;
+	size_t offset = sector << 9;
+
+	if (unlikely(!pmem))
+		return -ENODEV;
+
+	*kaddr = pmem->virt_addr + offset;
+	*pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;
+
+	return pmem->size - offset;
+}
+
 static const struct block_device_operations pmem_fops = {
 	.owner =		THIS_MODULE,
 	.rw_page =		pmem_rw_page,
+	.direct_access =	pmem_direct_access,
 };
 
 /* Kernel module stuff */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 5/8] mm: Let sparse_{add,remove}_one_section receive a node_id
  2015-03-05 10:32 ` [RFC 0/8] pmem: Submission of the Persistent memory block device Boaz Harrosh
                     ` (3 preceding siblings ...)
  2015-03-05 11:57   ` [PATCH 4/8] pmem: Add support for direct_access() Boaz Harrosh
@ 2015-03-05 11:58   ` Boaz Harrosh
  2015-03-06 18:43     ` Ross Zwisler
  2015-03-05 11:59   ` [PATCH 6/8] mm: New add_persistent_memory/remove_persistent_memory Boaz Harrosh
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-05 11:58 UTC (permalink / raw)
  To: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig
  Cc: Ross Zwisler

From: Yigal Korman <yigal@plexistor.com>

Refactored the arguments of sparse_add_one_section / sparse_remove_one_section
to use node id instead of struct zone * - A memory section has no direct
connection to zones, all that was needed from zone was the node id.

This is for add_persistent_memory that will want a section of pages
allocated but without any zone associated. This is because belonging
to a zone will give the memory to the page allocators, but
persistent_memory belongs to a block device, and is not available for
regular volatile usage.

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 include/linux/memory_hotplug.h | 4 ++--
 mm/memory_hotplug.c            | 4 ++--
 mm/sparse.c                    | 9 +++++----
 3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8f1a419..77ca3a4 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -265,8 +265,8 @@ extern int arch_add_memory(int nid, u64 start, u64 size);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern void remove_memory(int nid, u64 start, u64 size);
-extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn);
-extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms);
+extern int sparse_add_one_section(int nid, unsigned long start_pfn);
+extern void sparse_remove_one_section(int nid, struct mem_section *ms);
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
 					  unsigned long pnum);
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9fab107..7a73d30 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -472,7 +472,7 @@ static int __meminit __add_section(int nid, struct zone *zone,
 	if (pfn_valid(phys_start_pfn))
 		return -EEXIST;
 
-	ret = sparse_add_one_section(zone, phys_start_pfn);
+	ret = sparse_add_one_section(zone->zone_pgdat->node_id, phys_start_pfn);
 
 	if (ret < 0)
 		return ret;
@@ -738,7 +738,7 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
 	start_pfn = section_nr_to_pfn(scn_nr);
 	__remove_zone(zone, start_pfn);
 
-	sparse_remove_one_section(zone, ms);
+	sparse_remove_one_section(zone->zone_pgdat->node_id, ms);
 	return 0;
 }
 
diff --git a/mm/sparse.c b/mm/sparse.c
index d1b48b6..12a10ab 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -690,10 +690,10 @@ static void free_map_bootmem(struct page *memmap)
  * set.  If this is <=0, then that means that the passed-in
  * map was not consumed and must be freed.
  */
-int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn)
+int __meminit sparse_add_one_section(int nid, unsigned long start_pfn)
 {
 	unsigned long section_nr = pfn_to_section_nr(start_pfn);
-	struct pglist_data *pgdat = zone->zone_pgdat;
+	struct pglist_data *pgdat = NODE_DATA(nid);
 	struct mem_section *ms;
 	struct page *memmap;
 	unsigned long *usemap;
@@ -788,11 +788,11 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 		free_map_bootmem(memmap);
 }
 
-void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
+void sparse_remove_one_section(int nid, struct mem_section *ms)
 {
 	struct page *memmap = NULL;
 	unsigned long *usemap = NULL, flags;
-	struct pglist_data *pgdat = zone->zone_pgdat;
+	struct pglist_data *pgdat = NODE_DATA(nid);
 
 	pgdat_resize_lock(pgdat, &flags);
 	if (ms->section_mem_map) {
@@ -807,5 +807,6 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 	clear_hwpoisoned_pages(memmap, PAGES_PER_SECTION);
 	free_section_usemap(memmap, usemap);
 }
+
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 #endif /* CONFIG_MEMORY_HOTPLUG */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 6/8] mm: New add_persistent_memory/remove_persistent_memory
  2015-03-05 10:32 ` [RFC 0/8] pmem: Submission of the Persistent memory block device Boaz Harrosh
                     ` (4 preceding siblings ...)
  2015-03-05 11:58   ` [PATCH 5/8] mm: Let sparse_{add,remove}_one_section receive a node_id Boaz Harrosh
@ 2015-03-05 11:59   ` Boaz Harrosh
  2015-03-05 11:59   ` [PATCH 7/8] pmem: Add support for page structs Boaz Harrosh
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-05 11:59 UTC (permalink / raw)
  To: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig
  Cc: Ross Zwisler


Persistent Memory is not Memory. It is not presented as
a Memory Zone and is not available through the page allocators
for application/kernel volatile usage.

It belongs to A block device just as any other Persistent storage,
the novelty here is that it is directly mapped on the CPU Memory
bus, and usually as fast or almost as fast as system RAM.

The main motivation of add_persistent_memory is to allocate a
page-struct "Section" for a given physical memory region. This is because
The user of this memory mapped device might need to pass pages-struct
of this memory to a Kernel subsytem like block-layer or networking
and have it's content directly DMAed to other devices

(For example these pages can be put on a bio and sent to disk
 in a copy-less manner)

It will also request_mem_region_exclusive(.., "persistent_memory")
to own that physical memory region.

And will map that physical region to the Kernel's VM at the
address expected for page_address() of those pages allocated
above

remove_persistent_memory() must be called to undo what
add_persistent_memory did.

A user of this API will then use pfn_to_page(PHISICAL_ADDR >> PAGE_SIZE)
to receive a page-struct for use on its pmem.

Any operation like page_address() page_to_pfn() page_lock() ... can
be preformed on these pages just as usual.

An example user is presented in the next patch to pmem.c Block Device
driver (There are 3 more such drivers in the Kernel that could use this
API)

This patch is based on research and patch made by
Yigal Korman <yigal@plexistor.com> to the pmem driver. I took his code
and adapted it to mm, where it belongs.

[v2 SQUASHME1]
    Fixing 'make CONFIG_DEBUG_SECTION_MISMATCH=y'
    '__meminit' should be added
	Signed-off-by: Yigal Korman <yigal@plexistor.com>
[v3 SQUASHME2]
	We need to calculate the nid for every section, it is
	possible that the same contiguous range will span two
	numa-nodes.
	Signed-off-by: Boaz Harrosh <boaz@plexistor.com>

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 include/linux/memory_hotplug.h |   4 +
 mm/Kconfig                     |  23 ++++++
 mm/memory_hotplug.c            | 177 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 204 insertions(+)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 77ca3a4..b7d6c6e 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -192,6 +192,10 @@ extern void get_page_bootmem(unsigned long ingo, struct page *page,
 void get_online_mems(void);
 void put_online_mems(void);
 
+int add_persistent_memory(phys_addr_t phys_addr, size_t size,
+			  void **o_virt_addr);
+void remove_persistent_memory(phys_addr_t phys_addr, size_t size);
+
 #else /* ! CONFIG_MEMORY_HOTPLUG */
 /*
  * Stub functions for when hotplug is off
diff --git a/mm/Kconfig b/mm/Kconfig
index a03131b..05c9585 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -200,6 +200,29 @@ config MEMORY_HOTREMOVE
 	depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
 	depends on MIGRATION
 
+
+# User of PERSISTENT_MEMORY_SECTION should:
+#	depends on PERSISTENT_MEMORY_DEPENDENCY and
+#	select DRIVER_NEEDS_PERSISTENT_MEMORY
+# Note that it will not be enabled if MEMORY_HOTPLUG is not enabled
+#
+# If you have changed the dependency/select of MEMORY_HOTREMOVE please also
+# update here
+#
+config PERSISTENT_MEMORY_DEPENDENCY
+	def_bool y
+	depends on MEMORY_HOTPLUG
+	depends on ARCH_ENABLE_MEMORY_HOTREMOVE && MIGRATION
+
+config DRIVER_NEEDS_PERSISTENT_MEMORY
+	bool
+
+config PERSISTENT_MEMORY_SECTION
+	def_bool y
+	depends on PERSISTENT_MEMORY_DEPENDENCY
+	depends on DRIVER_NEEDS_PERSISTENT_MEMORY
+	select MEMORY_HOTREMOVE
+
 #
 # If we have space for more page flags then we can enable additional
 # optimizations and functionality.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7a73d30..416efdb 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2025,3 +2025,180 @@ void __ref remove_memory(int nid, u64 start, u64 size)
 }
 EXPORT_SYMBOL_GPL(remove_memory);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
+
+#ifdef CONFIG_PERSISTENT_MEMORY_SECTION
+
+/* This helper is so we do not need to allocate a page_array bigger
+ * than PAGE_SIZE
+ */
+static int _map_sec_range(ulong sec_start_pfn, struct page **page_array)
+{
+	const uint NUM_PAGES = PAGE_SIZE / sizeof(*page_array);
+	const uint ARRAYS_IN_SECTION = PAGES_PER_SECTION / NUM_PAGES;
+	ulong pfn = sec_start_pfn;
+	uint a;
+
+	for (a = 0; a < ARRAYS_IN_SECTION; ++a) {
+		ulong virt_addr = (ulong)page_address(pfn_to_page(pfn));
+		uint p;
+		int ret;
+
+		for (p = 0; p < NUM_PAGES; ++p)
+			page_array[p] = pfn_to_page(pfn++);
+
+		ret = map_kernel_range_noflush(virt_addr, NUM_PAGES * PAGE_SIZE,
+					       PAGE_KERNEL, page_array);
+		if (unlikely(ret < 0)) {
+			pr_warn("%s: map_kernel_range(0x%lx, 0x%lx) => %d\n",
+				__func__, sec_start_pfn, virt_addr, ret);
+			return ret;
+		}
+		if (unlikely(ret < NUM_PAGES)) {
+			pr_warn("%s: map_kernel_range(0x%lx) => %d != %d last_pfn=0x%lx\n",
+				 __func__, virt_addr, NUM_PAGES, ret, pfn);
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * add_persistent_memory - Add memory sections and maps them to Kernel space
+ * @phys_addr: start of physical address to add & map
+ * @size: size of the memory range in bytes
+ * @o_virt_addr: The returned virtual address of the mapped memory range
+ *
+ * A persistent_memory block-device will use this function to add memory
+ * sections and map its physical memory range. After the call to this function
+ * There will be page-struct associated with each pfn added here, and it will
+ * be accessible from Kernel space through the returned @o_virt_addr
+ * @phys_addr will be rounded down to the nearest SECTION_SIZE, the range
+ * mapped will be in full SECTION_SIZE sections.
+ * @o_virt_addr is the address of @phys_addr not the start of the mapped section
+ * so usually mapping a range unaligned on SECTION_SIZE will work just that the
+ * unaligned start and/or end, will ignore the error and continue.
+ * (but will print "memory section XX already exists")
+ *
+ * NOTE:
+ * persistent_memory is not system ram and is not available through any
+ * allocator, for regular consumption. Therefore it does not belong to any
+ * memory zone.
+ * But it will need a memory-section allocated, so page-structs are available
+ * for this memory, so it can be DMA'd directly with zero copy.
+ * After a call to this function the ranged pages belong exclusively to the
+ * caller.
+ *
+ * RETURNS:
+ * zero on success, or -errno on failure. If successful @o_virt_addr will be set
+ */
+int __meminit add_persistent_memory(phys_addr_t phys_addr, size_t size,
+			  void **o_virt_addr)
+{
+	ulong start_pfn = phys_addr >> PAGE_SHIFT;
+	ulong nr_pages = size >> PAGE_SHIFT;
+	ulong start_sec = pfn_to_section_nr(start_pfn);
+	ulong end_sec = pfn_to_section_nr(start_pfn + nr_pages +
+							PAGES_PER_SECTION - 1);
+	struct resource *res_mem;
+	struct page **page_array;
+	ulong i;
+	int ret = 0;
+
+	page_array = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (unlikely(!page_array))
+		return -ENOMEM;
+
+	res_mem = request_mem_region_exclusive(phys_addr, size,
+					       "persistent_memory");
+	if (unlikely(!res_mem)) {
+		pr_warn("%s: request_mem_region_exclusive phys=0x%llx size=0x%zx failed\n",
+			__func__, phys_addr, size);
+		ret = -EINVAL;
+		goto free_array;
+	}
+
+	for (i = start_sec; i < end_sec; ++i) {
+		ulong sec_pfn = i << PFN_SECTION_SHIFT;
+		int nid = memory_add_physaddr_to_nid(sec_pfn << PAGE_SHIFT);
+
+		if (pfn_valid(sec_pfn)) {
+			pr_warn("%s: memory section %lu already exists.\n",
+				__func__, i);
+			continue;
+		}
+
+		ret = sparse_add_one_section(nid, sec_pfn);
+		if (unlikely(ret < 0)) {
+			if (ret == -EEXIST) {
+				ret = 0;
+				continue;
+			} else {
+				pr_warn("%s: sparse_add_one_section => %d\n",
+					__func__, ret);
+				goto release_region;
+			}
+		}
+
+		ret = _map_sec_range(sec_pfn, page_array);
+		if (unlikely(ret))
+			goto release_region;
+	}
+
+	*o_virt_addr = page_address(pfn_to_page(start_pfn));
+
+free_array:
+	kfree(page_array);
+	return ret;
+
+release_region:
+	release_mem_region(phys_addr, size);
+	goto free_array;
+}
+EXPORT_SYMBOL_GPL(add_persistent_memory);
+
+/**
+ * remove_persistent_memory - undo anything add_persistent_memory did
+ * @phys_addr: start of physical address to remove
+ * @size: size of the memory range in bytes
+ *
+ * A successful call to add_persistent_memory must be paired with
+ * remove_persistent_memory when done. It will unmap passed PFNs from
+ * Kernel virtual address, and will remove the memory sections.
+ * @phys_addr, @size must be exactly those passed to add_persistent_memory
+ * otherwise results are unexpected, there are no checks done on this.
+ */
+void remove_persistent_memory(phys_addr_t phys_addr, size_t size)
+{
+	ulong start_pfn = phys_addr >> PAGE_SHIFT;
+	ulong nr_pages = size >> PAGE_SHIFT;
+	ulong start_sec = pfn_to_section_nr(start_pfn);
+	ulong end_sec = pfn_to_section_nr(start_pfn + nr_pages +
+							PAGES_PER_SECTION - 1);
+	int nid = pfn_to_nid(start_pfn);
+	ulong virt_addr;
+	unsigned int i;
+
+	virt_addr = (ulong)page_address(
+				pfn_to_page(start_sec << PFN_SECTION_SHIFT));
+
+	for (i = start_sec; i < end_sec; ++i) {
+		struct mem_section *ms;
+
+		unmap_kernel_range(virt_addr, 1UL << SECTION_SIZE_BITS);
+		virt_addr += 1UL << SECTION_SIZE_BITS;
+
+		ms = __nr_to_section(i);
+		if (!valid_section(ms)) {
+			pr_warn("%s: memory section %d is missing.\n",
+				__func__, i);
+			continue;
+		}
+		sparse_remove_one_section(nid, ms);
+	}
+
+	release_mem_region(phys_addr, size);
+}
+EXPORT_SYMBOL_GPL(remove_persistent_memory);
+
+#endif /* def CONFIG_PERSISTENT_MEMORY_SECTION */
+
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 7/8] pmem: Add support for page structs
  2015-03-05 10:32 ` [RFC 0/8] pmem: Submission of the Persistent memory block device Boaz Harrosh
                     ` (5 preceding siblings ...)
  2015-03-05 11:59   ` [PATCH 6/8] mm: New add_persistent_memory/remove_persistent_memory Boaz Harrosh
@ 2015-03-05 11:59   ` Boaz Harrosh
  2015-03-23 20:59     ` Dan Williams
  2015-03-05 12:01   ` [PATCH 8/8] OUT-OF-TREE: pmem: Allow request_mem to fail (BLK_DEV_PMEM_IGNORE_REQUEST_MEM_RET) Boaz Harrosh
  2015-03-06 18:37   ` [RFC 0/8] pmem: Submission of the Persistent memory block device Ross Zwisler
  8 siblings, 1 reply; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-05 11:59 UTC (permalink / raw)
  To: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig
  Cc: Ross Zwisler


One of the current shortcomings of the NVDIMM/PMEM
support is that this memory does not have a page-struct(s)
associated with its memory and therefor cannot be passed
to a block-device or network or DMAed in any way through
another device in the system.

The use of add_persistent_memory() fixes all this. After this patch
an FS can do:
	bdev_direct_access(,&pfn,);
	page = pfn_to_page(pfn);
And use that page for a lock_page(), set_page_dirty(), and/or
anything else one might do with a page *.
(Note that with brd one can already do this)

[pmem-pages-ref-count]
pmem will serve it's pages with ref==0. Once an FS does
an blkdev_get_XXX(,FMODE_EXCL,), that memory is own by the FS.
The FS needs to manage its allocation, just as it already does
for its disk blocks. The fs should set page->count = 2, before
submission to any Kernel subsystem so when it returns it will
never be released to the Kernel's page-allocators. (page_freeze)

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/Kconfig | 13 +++++++++++++
 drivers/block/pmem.c  | 20 ++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 1530c2a..635fa6a 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -422,6 +422,19 @@ config BLK_DEV_PMEM
 	  Most normal users won't need this functionality, and can thus say N
 	  here.
 
+config BLK_DEV_PMEM_USE_PAGES
+	bool "Enable use of page struct pages with pmem"
+	depends on BLK_DEV_PMEM
+	depends on PERSISTENT_MEMORY_DEPENDENCY
+	select DRIVER_NEEDS_PERSISTENT_MEMORY
+	default y
+	help
+	  If a user of PMEM device needs "struct page" associated
+	  with its memory, so this memory can be sent to other
+	  block devices, or sent on the network, or be DMA transferred
+	  to other devices in the system, then you must say "Yes" here.
+	  If unsure leave as Yes.
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index f0f0ba0..d0c80f4 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -20,6 +20,7 @@
 #include <linux/blkdev.h>
 #include <linux/hdreg.h>
 #include <linux/init.h>
+#include <linux/memory_hotplug.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/slab.h>
@@ -141,6 +142,24 @@ MODULE_PARM_DESC(map,
 
 static LIST_HEAD(pmem_devices);
 
+#ifdef CONFIG_BLK_DEV_PMEM_USE_PAGES
+/* pmem->phys_addr and pmem->size need to be set.
+ * Will then set pmem->virt_addr if successful.
+ */
+int pmem_mapmem(struct pmem_device *pmem)
+{
+	return add_persistent_memory(pmem->phys_addr, pmem->size,
+				     &pmem->virt_addr);
+}
+
+static void pmem_unmapmem(struct pmem_device *pmem)
+{
+	remove_persistent_memory(pmem->phys_addr, pmem->size);
+}
+
+#define PMEM_ALIGNMEM (1UL << SECTION_SIZE_BITS)
+#else /* !CONFIG_BLK_DEV_PMEM_USE_PAGES */
+
 /* pmem->phys_addr and pmem->size need to be set.
  * Will then set virt_addr if successful.
  */
@@ -180,6 +199,7 @@ void pmem_unmapmem(struct pmem_device *pmem)
 }
 
 #define PMEM_ALIGNMEM PAGE_SIZE
+#endif /* ! CONFIG_BLK_DEV_PMEM_USE_PAGES */
 
 static struct pmem_device *pmem_alloc(phys_addr_t phys_addr, size_t disk_size,
 				      int i)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 8/8] OUT-OF-TREE: pmem: Allow request_mem to fail (BLK_DEV_PMEM_IGNORE_REQUEST_MEM_RET)
  2015-03-05 10:32 ` [RFC 0/8] pmem: Submission of the Persistent memory block device Boaz Harrosh
                     ` (6 preceding siblings ...)
  2015-03-05 11:59   ` [PATCH 7/8] pmem: Add support for page structs Boaz Harrosh
@ 2015-03-05 12:01   ` Boaz Harrosh
  2015-03-06 18:37   ` [RFC 0/8] pmem: Submission of the Persistent memory block device Ross Zwisler
  8 siblings, 0 replies; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-05 12:01 UTC (permalink / raw)
  To: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig
  Cc: Ross Zwisler


With old Kernels there was a bug in x86 where any unknown
memory chip type would come up BUSY when calling
request_mem_region_exclusive().

So for pmem to work with old Kernels and real NvDIMM chips
we have a new Kconfig option CONFIG_BLK_DEV_PMEM_IGNORE_REQUEST_MEM_RET.

People have been running with hacked up pmem that will ignore
the return code from request_mem_region_exclusive. So here it is
official

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 drivers/block/Kconfig | 12 ++++++++++++
 drivers/block/pmem.c  |  9 ++++++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 635fa6a..58a2c69 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -435,6 +435,18 @@ config BLK_DEV_PMEM_USE_PAGES
 	  to other devices in the system, then you must say "Yes" here.
 	  If unsure leave as Yes.
 
+config BLK_DEV_PMEM_IGNORE_REQUEST_MEM_RET
+	bool "Ignore the return code from request_mem_region_exclusive"
+	depends on BLK_DEV_PMEM
+	help
+	  In Old Kernels type-12 Memory type which is used by NvDIMM
+	  chips Comes out busy when calling request_mem_region_exclusive,
+	  because of a bug.
+	  If this option is set to "yes". The pmem will ignore the
+	  failure, and continue as usual. If you have an old Kernel and
+	  a real NvDIMM chip you must say yes here.
+	  (Ignored if BLK_DEV_PMEM_USE_PAGES=y)
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
diff --git a/drivers/block/pmem.c b/drivers/block/pmem.c
index d0c80f4..ba1167c 100644
--- a/drivers/block/pmem.c
+++ b/drivers/block/pmem.c
@@ -172,8 +172,10 @@ int pmem_mapmem(struct pmem_device *pmem)
 					       "pmem");
 	if (unlikely(!res_mem)) {
 		pr_warn("pmem: request_mem_region_exclusive phys=0x%llx size=0x%zx failed\n",
-			   pmem->phys_addr, pmem->size);
-		return -EINVAL;
+			pmem->phys_addr, pmem->size);
+#ifndef CONFIG_BLK_DEV_PMEM_IGNORE_REQUEST_MEM_RET
+		return -EBUSY;
+#endif
 	}
 
 	pmem->virt_addr = ioremap_cache(pmem->phys_addr, pmem->size);
@@ -184,7 +186,8 @@ int pmem_mapmem(struct pmem_device *pmem)
 	return 0;
 
 out_release:
-	release_mem_region(pmem->phys_addr, pmem->size);
+	if (res_mem)
+		release_mem_region(pmem->phys_addr, pmem->size);
 	return err;
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/8] pmem: Initial version of persistent memory driver
  2015-03-05 11:55   ` [PATCH 1/8] pmem: Initial version of persistent memory driver Boaz Harrosh
@ 2015-03-05 20:35     ` Paul Bolle
  2015-03-05 23:03     ` Andy Lutomirski
  2015-03-18 17:43     ` Ross Zwisler
  2 siblings, 0 replies; 43+ messages in thread
From: Paul Bolle @ 2015-03-05 20:35 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig, Ross Zwisler

For what it's worth, at this moment, I found a license nit.

On Thu, 2015-03-05 at 13:55 +0200, Boaz Harrosh wrote:
> --- /dev/null
> +++ b/drivers/block/pmem.c
> @@ -0,0 +1,334 @@
> +/*
> + * Persistent Memory Driver
> + * Copyright (c) 2014, Intel Corporation.
> + * Copyright (c) 2014, Boaz Harrosh <boaz@plexistor.com>.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.

This states the license is GPL v2.

> +MODULE_LICENSE("GPL");

So you probably want
    MODULE_LICENSE("GPL v2");

here.


Paul Bolle


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] e820: Don't let unknown DIMM type come out BUSY
  2015-03-05 10:20 ` [PATCH 1/3] e820: Don't let unknown DIMM type come out BUSY Boaz Harrosh
@ 2015-03-05 20:41   ` Dan Williams
  2015-03-09 10:54     ` Boaz Harrosh
  0 siblings, 1 reply; 43+ messages in thread
From: Dan Williams @ 2015-03-05 20:41 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ingo Molnar, X86 ML, linux-kernel, Roger C. Pao, Thomas Gleixner,
	linux-nvdimm, H. Peter Anvin, Matthew Wilcox, Andy Lutomirski,
	Christoph Hellwig, Ross Zwisler

On Thu, Mar 5, 2015 at 2:20 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>
> There is something not very nice (Gentlemen nice) In current
> e820.c code.
>
> At Multiple places for example @memblock_x86_fill() it will
> add the different memory resources *except the E820_RESERVED type*
>
> Then at e820_reserve_resources() it will mark all !E820_RESERVED
> as busy.
>
> This is all fine when we have only the known types one of:
>         E820_RESERVED_KERN:
>         E820_RAM:
>         E820_ACPI:
>         E820_NVS:
>         E820_UNUSABLE:
>         E820_RESERVED:
>
> But if the system encounters a brand new memory type it will
> not add it to any memory list, But will proceed to mark it
> BUSY. So now any other Driver in the system that does know
> how to deal with this new type, is not able to call
> request_mem_region_exclusive() on this new type because it is
> hard coded BUSY even though nothing really uses it.
>
> So make any unknown type behave like E820_RESERVED memory,
> it will show up as available to first caller of
> request_mem_region_exclusive().
>
> I Also change the string representation of an unknown type
> from "reserved" (So to not confuse with memmap "reserved"
> region). And call it "reserved-unknown"
> I wish I could return "reserved-type-X" But this is not possible
> because one must return a constant, code-segment, string.
>
> (NOTE: These unknown-types where called "reserved" in
>  /proc/iomem and in dmesg but behaved differently. What this
>  patch does is name them differently but let them behave
>  the same)
>
> By Popular demand An Extra WARNING message is printed if
> an "UNKNOWN" is found. It will look like this:
>   e820: WARNING [mem 0x100000000-0x1ffffffff] is unknown type 12
>
> An example of such "UNKNOWN" type is the not Standard type-12
> DDR3-NvDIMM which is used by multiple vendors for a while
> now. (Estimated 100ds of thousands sold world wide)
>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Ingo Molnar <mingo@redhat.com>
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: x86@kernel.org
> CC: Dan Williams <dan.j.williams@intel.com>
> CC: Matthew Wilcox <willy@linux.intel.com>
> CC: Christoph Hellwig <hch@infradead.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  arch/x86/kernel/e820.c | 31 +++++++++++++++++++++++++++++--
>  1 file changed, 29 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 46201de..c3a11cd 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -104,6 +104,21 @@ int __init e820_all_mapped(u64 start, u64 end, unsigned type)
>         return 0;
>  }
>
> +static bool _is_unknown_type(int e820_type)

Any reason for the leading "_"?

> +{
> +       switch (e820_type) {
> +       case E820_RESERVED_KERN:
> +       case E820_RAM:
> +       case E820_ACPI:
> +       case E820_NVS:
> +       case E820_UNUSABLE:
> +       case E820_RESERVED:
> +               return false;
> +       default:
> +               return true;
> +       }
> +}
> +
>  /*
>   * Add a memory region to the kernel e820 map.
>   */
> @@ -119,6 +134,11 @@ static void __init __e820_add_region(struct e820map *e820x, u64 start, u64 size,
>                 return;
>         }
>
> +       if (unlikely(_is_unknown_type(type)))

Unnecessary unlikely()?

> +               pr_warn("e820: WARNING [mem %#010llx-%#010llx] is unknown type %d\n",
> +                      (unsigned long long) start,
> +                      (unsigned long long) (start + size - 1), type);

I still think this warning can go and we can just fold patch 2 into
this one, but other than that this looks ok to me.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM)
  2015-03-05 10:24 ` [PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM) Boaz Harrosh
@ 2015-03-05 20:56   ` Dan Williams
  2015-03-05 23:09     ` Andy Lutomirski
  2015-03-09 11:19     ` Boaz Harrosh
  0 siblings, 2 replies; 43+ messages in thread
From: Dan Williams @ 2015-03-05 20:56 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ingo Molnar, X86 ML, linux-kernel, Roger C. Pao, Thomas Gleixner,
	linux-nvdimm, H. Peter Anvin, Matthew Wilcox, Andy Lutomirski,
	Christoph Hellwig, Ross Zwisler

On Thu, Mar 5, 2015 at 2:24 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>
> There are multiple vendors of DDR3 NvDIMMs out in the market today.
> At various stages of development/production. It is estimated that
> there are already more the 100ds of thousands chips sold to
> testers and sites.
>
> All the BIOS vendors I know of, tagged these chips at e820 table
> as type-12 memory.
>
> Now the ACPI comity, as far as I know, did not yet define a
> standard type for NvDIMM. Also, as far as I know any NvDIMM
> standard will only be defined for DDR4. So DDR3 NvDIMM is
> probably stuck with this none STD type.

There's no relation between E820 types and DDR technology revisions.

> I Wish and call the ACPI comity to Define that NvDIMM is type-12.
> Also for DDR4
>
> The motivation of this patch is to be able to differentiate
> this NvDIMM type from a real future "unknown-reserved" type.
>
> In this patch I name type-12 "unknown-12". This is because of
> ACPI politics that refuse to reserve type-12 as DDR3-NvDIMM

It's not "politics".  Setting standards takes time and the platforms
in question simply jumped the gun to enable a proof-of-concept.

> and members keep saying:
>         "What if ACPI assigns type-12 for something else in future"
>
> [And I say: Then just don't. Please?]

Once a standard number is assigned, platform firmwares can update
type-12 to that number.  We might consider a compile time override for
these niche/pre-standard systems that can't/won't update, but it's not
clear to me that we even need to go that far.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3 v5] e820: Fix handling of NvDIMM chips
  2015-03-05 10:16 [PATCH 0/3 v5] e820: Fix handling of NvDIMM chips Boaz Harrosh
                   ` (3 preceding siblings ...)
  2015-03-05 10:32 ` [RFC 0/8] pmem: Submission of the Persistent memory block device Boaz Harrosh
@ 2015-03-05 22:48 ` H. Peter Anvin
  2015-03-05 23:06   ` Andy Lutomirski
  4 siblings, 1 reply; 43+ messages in thread
From: H. Peter Anvin @ 2015-03-05 22:48 UTC (permalink / raw)
  To: Boaz Harrosh, Ingo Molnar, x86, linux-kernel, Roger C. Pao,
	Dan Williams, Thomas Gleixner, linux-nvdimm, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig
  Cc: Ross Zwisler

Seriously, if type 12 is the de facto standard for NvDIMMs, I think we
should add it.  The documented ACPI method of using flags was doomed
from the start.

	-hpa


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/8] pmem: Initial version of persistent memory driver
  2015-03-05 11:55   ` [PATCH 1/8] pmem: Initial version of persistent memory driver Boaz Harrosh
  2015-03-05 20:35     ` Paul Bolle
@ 2015-03-05 23:03     ` Andy Lutomirski
  2015-03-09 12:20       ` Boaz Harrosh
  2015-03-18 17:43     ` Ross Zwisler
  2 siblings, 1 reply; 43+ messages in thread
From: Andy Lutomirski @ 2015-03-05 23:03 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Ross Zwisler, Thomas Gleixner, X86 ML,
	Dan Williams, Roger C. Pao, Ingo Molnar, linux-nvdimm,
	linux-kernel, H. Peter Anvin, Christoph Hellwig

On Mar 5, 2015 3:55 AM, "Boaz Harrosh" <boaz@plexistor.com> wrote:
>
> From: Ross Zwisler <ross.zwisler@linux.intel.com>
>
> PMEM is a new driver That supports any physical contiguous iomem range
> as a single block device. The driver has support for as many as needed
> iomem ranges each as its own device.
>
> The driver is not only good for NvDIMMs, It is good for any flat memory
> mapped device. We've used it with NvDIMMs, Kernel reserved DRAM
> (memmap= on command line), PCIE Battery backed memory cards, VM shared
> memory, and so on.
>
> The API to pmem module a single string parameter named "map"
> of the form:
>                  map=mapS[,mapS...]
>
>                  where mapS=nn[KMG]$ss[KMG],
>                  or    mapS=nn[KMG]@ss[KMG],
>
>                  nn=size, ss=offset
>
> Just like the Kernel command line map && memmap parameters,
> so anything you did at grub just copy/paste to here.
>
> The "@" form is exactly the same as the "$" form only that
> at bash prompt we need to escape the "$" with \$ so also
> support the '@' char for convenience.
>
> For each specified mapS there will be a device created.

[...]

> +       pmem->virt_addr = ioremap_cache(pmem->phys_addr, pmem->size);

I think it would be nice to have control over the caching mode.
Depending on the application, WT or UC could make more sense.

--Andy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/3 v5] e820: Fix handling of NvDIMM chips
  2015-03-05 22:48 ` [PATCH 0/3 v5] e820: Fix handling of NvDIMM chips H. Peter Anvin
@ 2015-03-05 23:06   ` Andy Lutomirski
  0 siblings, 0 replies; 43+ messages in thread
From: Andy Lutomirski @ 2015-03-05 23:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Boaz Harrosh, Ingo Molnar, X86 ML, linux-kernel, Roger C. Pao,
	Dan Williams, Thomas Gleixner, linux-nvdimm, Matthew Wilcox,
	Christoph Hellwig, Ross Zwisler

On Thu, Mar 5, 2015 at 2:48 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> Seriously, if type 12 is the de facto standard for NvDIMMs, I think we
> should add it.  The documented ACPI method of using flags was doomed
> from the start.

I think it's a de facto standard that type 12 means "nvdimm".  I don't
think there's any kind of standard as to whether that means that it's
permissible to touch that memory or that nvdimm means type 12.

I have no problem w/ the resource name saying "nvdimm", but I do
object to autoloading anything as a result.

(There is no type 12 on EFI, nor will there ever be, since there is no
e820 at all on EFI.)

--Andy

>
>         -hpa
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM)
  2015-03-05 20:56   ` Dan Williams
@ 2015-03-05 23:09     ` Andy Lutomirski
  2015-03-09 12:10       ` Boaz Harrosh
  2015-03-09 11:19     ` Boaz Harrosh
  1 sibling, 1 reply; 43+ messages in thread
From: Andy Lutomirski @ 2015-03-05 23:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Boaz Harrosh, Ingo Molnar, X86 ML, linux-kernel, Roger C. Pao,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Christoph Hellwig, Ross Zwisler

On Thu, Mar 5, 2015 at 12:56 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Mar 5, 2015 at 2:24 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>
>> There are multiple vendors of DDR3 NvDIMMs out in the market today.
>> At various stages of development/production. It is estimated that
>> there are already more the 100ds of thousands chips sold to
>> testers and sites.
>>
>> All the BIOS vendors I know of, tagged these chips at e820 table
>> as type-12 memory.
>>
>> Now the ACPI comity, as far as I know, did not yet define a
>> standard type for NvDIMM. Also, as far as I know any NvDIMM
>> standard will only be defined for DDR4. So DDR3 NvDIMM is
>> probably stuck with this none STD type.
>
> There's no relation between E820 types and DDR technology revisions.
>
>> I Wish and call the ACPI comity to Define that NvDIMM is type-12.
>> Also for DDR4
>>
>> The motivation of this patch is to be able to differentiate
>> this NvDIMM type from a real future "unknown-reserved" type.
>>
>> In this patch I name type-12 "unknown-12". This is because of
>> ACPI politics that refuse to reserve type-12 as DDR3-NvDIMM
>
> It's not "politics".  Setting standards takes time and the platforms
> in question simply jumped the gun to enable a proof-of-concept.
>
>> and members keep saying:
>>         "What if ACPI assigns type-12 for something else in future"
>>
>> [And I say: Then just don't. Please?]
>
> Once a standard number is assigned, platform firmwares can update
> type-12 to that number.  We might consider a compile time override for
> these niche/pre-standard systems that can't/won't update, but it's not
> clear to me that we even need to go that far.

I will be shocked if a standard of this form ever appears.  Modern
systems *don't have e820*.  The BIOSes that are using this type 12
hack are awful throwbacks.

--Andy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 0/8] pmem: Submission of the Persistent memory block device
  2015-03-05 10:32 ` [RFC 0/8] pmem: Submission of the Persistent memory block device Boaz Harrosh
                     ` (7 preceding siblings ...)
  2015-03-05 12:01   ` [PATCH 8/8] OUT-OF-TREE: pmem: Allow request_mem to fail (BLK_DEV_PMEM_IGNORE_REQUEST_MEM_RET) Boaz Harrosh
@ 2015-03-06 18:37   ` Ross Zwisler
  2015-03-07  1:39     ` Christoph Hellwig
  2015-03-09 12:41     ` Boaz Harrosh
  8 siblings, 2 replies; 43+ messages in thread
From: Ross Zwisler @ 2015-03-06 18:37 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig

On Thu, 2015-03-05 at 12:32 +0200, Boaz Harrosh wrote:
> There are already NvDIMMs and other Persistent-memory devices in the market, and
> lots more of them will be coming in near future.
> 
> Current stack is coming along very nice, and filesystems support for leveraging this
> technologies has been submitted to Linus in the DAX series by Matthew Wilcox.
> 
> The general stack does not change:
> 	block-device
> 	partition
> 	file-system
> 	application file
> 
> The only extra care, see Matthew's DAX patches, Is the ->direct_access() API from
> block devices that enables a direct mapping from Persistent-memory to user application
> and/or Kernel for direct store/load of data.
> 
> The only missing piece is the actual block device that enables support
> for such NvDIMM chips. This is the driver we submit here.
> 
> The driver is very simple, in fact it is the 2nd smallest driver inside drivers/block
> What the driver does is support a physical contiguous iomem range as a single block
> device. The driver has support for as many as needed iomem ranges each as its own device.
> (See patch-1 for more details)
> 
> We are using this driver over a year now, In a lab with combination of VMs and real
> hardware, with a variety of hardware and vendors, and it is very stable. Actually why
> not it is so simple it does nothing almost.
> 
> The driver is not only good for NvDIMMs, It is good for any flat memory mapped
> device. We've used it with NvDIMMs, Kernel reserved DRAM (memmap= on command line),
> PCIE Battery backed memory cards, VM shared memory, and so on.
> 
> Together with this driver also submitted support for page-struct with
> Persistent-memory, so Persistent-memory can be used with RDMA, DMA, block-devices
> and so on, just as regular memory, in a copy-less manner.
> With the use of these two simple patches, we were able to set up an RDMA target
> machine which exports NvDIMMs and enables direct remote storage. The only
> "complicated" thing was the remote flush of caches because most RDMA nicks in
> Kernel will RDMA directly to L3 cache, so we needed to establish a message that
> involves the remote CPU for this. But otherwise the mapping of pmem pointer
> to an RDMA key was trivial, directly from user-mode, with no extra Kernel code.
> [The target is simple with no extra code, the RDMA client on the other hand needs
>  a special driver]
> 
> I maintain these patches on latest Kernels here:
> 	git://git.open-osd.org/pmem.git branch pmem
> 
> Thanks for reviewing
> Boaz

Hey Boaz,

Regarding the PMEM series, my group has been working on an updated
version of this driver for the past 6 months or so since I initially
posted the beginnings of this series:

https://lkml.org/lkml/2014/8/27/674

That new version should be ready for public viewing sometime in April.

It's my preference that we wait to try and upstream any form of PMEM
until we've released our updated version of the driver, and you've had a
chance to review and add in any changes you need.  I'm cool with
gathering additional feedback until then, of course.

Trying to upstream this older version and then merging it with the newer
stuff in-kernel seems like it'll just end up being more work in the end.

Thanks,
- Ross


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/8] mm: Let sparse_{add,remove}_one_section receive a node_id
  2015-03-05 11:58   ` [PATCH 5/8] mm: Let sparse_{add,remove}_one_section receive a node_id Boaz Harrosh
@ 2015-03-06 18:43     ` Ross Zwisler
  0 siblings, 0 replies; 43+ messages in thread
From: Ross Zwisler @ 2015-03-06 18:43 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig

On Thu, 2015-03-05 at 13:58 +0200, Boaz Harrosh wrote:
> From: Yigal Korman <yigal@plexistor.com>
> 
> Refactored the arguments of sparse_add_one_section / sparse_remove_one_section
> to use node id instead of struct zone * - A memory section has no direct
> connection to zones, all that was needed from zone was the node id.
> 
> This is for add_persistent_memory that will want a section of pages
> allocated but without any zone associated. This is because belonging
> to a zone will give the memory to the page allocators, but
> persistent_memory belongs to a block device, and is not available for
> regular volatile usage.
> 
> Signed-off-by: Yigal Korman <yigal@plexistor.com>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  include/linux/memory_hotplug.h | 4 ++--
>  mm/memory_hotplug.c            | 4 ++--
>  mm/sparse.c                    | 9 +++++----
>  3 files changed, 9 insertions(+), 8 deletions(-)

For both of the MM patches in this series (this one, 5/8, and the next one,
6/8), please be sure and CC the MM folks.  I know that Dave Hansen had
feedback on an earlier version of this patch:

https://lkml.org/lkml/2014/9/9/742

We need to make sure that feedback is addressed.

Thanks,
- Ross



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 0/8] pmem: Submission of the Persistent memory block device
  2015-03-06 18:37   ` [RFC 0/8] pmem: Submission of the Persistent memory block device Ross Zwisler
@ 2015-03-07  1:39     ` Christoph Hellwig
  2015-03-09 12:41     ` Boaz Harrosh
  1 sibling, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2015-03-07  1:39 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Boaz Harrosh, Ingo Molnar, x86, linux-kernel, Roger C. Pao,
	Dan Williams, Thomas Gleixner, linux-nvdimm, H. Peter Anvin,
	Matthew Wilcox, Andy Lutomirski, Christoph Hellwig

On Fri, Mar 06, 2015 at 11:37:45AM -0700, Ross Zwisler wrote:
> Regarding the PMEM series, my group has been working on an updated
> version of this driver for the past 6 months or so since I initially
> posted the beginnings of this series:
> 
> https://lkml.org/lkml/2014/8/27/674
> 
> That new version should be ready for public viewing sometime in April.
> 
> It's my preference that we wait to try and upstream any form of PMEM
> until we've released our updated version of the driver, and you've had a
> chance to review and add in any changes you need.  I'm cool with
> gathering additional feedback until then, of course.
> 
> Trying to upstream this older version and then merging it with the newer
> stuff in-kernel seems like it'll just end up being more work in the end.

We've been waiting far too long to get any version of this merged.  I
dont think waiting for vapourware is a good idea.  So either please post
your new code ASAP given that you apparently have it, or you'll just
have to do more work later.  Given how simple the pmem driver is I
really can't see any major merge problems anyway.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/3] e820: Don't let unknown DIMM type come out BUSY
  2015-03-05 20:41   ` Dan Williams
@ 2015-03-09 10:54     ` Boaz Harrosh
  0 siblings, 0 replies; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-09 10:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ingo Molnar, X86 ML, linux-kernel, Roger C. Pao, Thomas Gleixner,
	linux-nvdimm, H. Peter Anvin, Matthew Wilcox, Andy Lutomirski,
	Christoph Hellwig, Ross Zwisler

On 03/05/2015 10:41 PM, Dan Williams wrote:
> On Thu, Mar 5, 2015 at 2:20 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>
>> There is something not very nice (Gentlemen nice) In current
>> e820.c code.
>>
>> At Multiple places for example @memblock_x86_fill() it will
>> add the different memory resources *except the E820_RESERVED type*
>>
>> Then at e820_reserve_resources() it will mark all !E820_RESERVED
>> as busy.
>>
>> This is all fine when we have only the known types one of:
>>         E820_RESERVED_KERN:
>>         E820_RAM:
>>         E820_ACPI:
>>         E820_NVS:
>>         E820_UNUSABLE:
>>         E820_RESERVED:
>>
>> But if the system encounters a brand new memory type it will
>> not add it to any memory list, But will proceed to mark it
>> BUSY. So now any other Driver in the system that does know
>> how to deal with this new type, is not able to call
>> request_mem_region_exclusive() on this new type because it is
>> hard coded BUSY even though nothing really uses it.
>>
>> So make any unknown type behave like E820_RESERVED memory,
>> it will show up as available to first caller of
>> request_mem_region_exclusive().
>>
>> I Also change the string representation of an unknown type
>> from "reserved" (So to not confuse with memmap "reserved"
>> region). And call it "reserved-unknown"
>> I wish I could return "reserved-type-X" But this is not possible
>> because one must return a constant, code-segment, string.
>>
>> (NOTE: These unknown-types where called "reserved" in
>>  /proc/iomem and in dmesg but behaved differently. What this
>>  patch does is name them differently but let them behave
>>  the same)
>>
>> By Popular demand An Extra WARNING message is printed if
>> an "UNKNOWN" is found. It will look like this:
>>   e820: WARNING [mem 0x100000000-0x1ffffffff] is unknown type 12
>>
>> An example of such "UNKNOWN" type is the not Standard type-12
>> DDR3-NvDIMM which is used by multiple vendors for a while
>> now. (Estimated 100ds of thousands sold world wide)
>>
>> CC: Thomas Gleixner <tglx@linutronix.de>
>> CC: Ingo Molnar <mingo@redhat.com>
>> CC: "H. Peter Anvin" <hpa@zytor.com>
>> CC: x86@kernel.org
>> CC: Dan Williams <dan.j.williams@intel.com>
>> CC: Matthew Wilcox <willy@linux.intel.com>
>> CC: Christoph Hellwig <hch@infradead.org>
>> CC: Andy Lutomirski <luto@amacapital.net>
>> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
>> ---
>>  arch/x86/kernel/e820.c | 31 +++++++++++++++++++++++++++++--
>>  1 file changed, 29 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
>> index 46201de..c3a11cd 100644
>> --- a/arch/x86/kernel/e820.c
>> +++ b/arch/x86/kernel/e820.c
>> @@ -104,6 +104,21 @@ int __init e820_all_mapped(u64 start, u64 end, unsigned type)
>>         return 0;
>>  }
>>
>> +static bool _is_unknown_type(int e820_type)
> 
> Any reason for the leading "_"?
> 

I have a style guide that says that any static function starts with a _ and
need not a global prefix like e820_ so to not conflict when compiled in-kernel.

But This is not the style in this file. So sorry, do I must send a new version
patch?

>> +{
>> +       switch (e820_type) {
>> +       case E820_RESERVED_KERN:
>> +       case E820_RAM:
>> +       case E820_ACPI:
>> +       case E820_NVS:
>> +       case E820_UNUSABLE:
>> +       case E820_RESERVED:
>> +               return false;
>> +       default:
>> +               return true;
>> +       }
>> +}
>> +
>>  /*
>>   * Add a memory region to the kernel e820 map.
>>   */
>> @@ -119,6 +134,11 @@ static void __init __e820_add_region(struct e820map *e820x, u64 start, u64 size,
>>                 return;
>>         }
>>
>> +       if (unlikely(_is_unknown_type(type)))
> 
> Unnecessary unlikely()?
> 

Its more for a programmer's communication then CPU optimization. Though also for
the CPU it is, "don't waste space at branch predictor", in anyway it can not hurt.

But I like it here for the reading code flow that communicates. "Even if this happens
%100 of the time the code flow should be not to take this error branch"

>> +               pr_warn("e820: WARNING [mem %#010llx-%#010llx] is unknown type %d\n",
>> +                      (unsigned long long) start,
>> +                      (unsigned long long) (start + size - 1), type);
> 
> I still think this warning can go and 

I did not have it at first, but Ingo felt it has merit, I trust his experience and
instincts. I kind of like it now in the lab logs, for machines that actually have
NvDIMMs in them.

> we can just fold patch 2 into this one, but other than that this looks ok to me.
> 

No! we cannot.

1. For one it is two different subsystems and maintainers. You want them separate
   So they might go in through different trees, also conflict in different way if
   code advances.

2. These are two different topics. This patch is about fixing a bug, The resource being
   busy was not intentional.
   The second patch is about a global WARNING for when some mem resource is used, which
   is independent of this here problem and is a new fixture.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM)
  2015-03-05 20:56   ` Dan Williams
  2015-03-05 23:09     ` Andy Lutomirski
@ 2015-03-09 11:19     ` Boaz Harrosh
  2015-03-09 14:44       ` Dan Williams
  1 sibling, 1 reply; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-09 11:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ingo Molnar, X86 ML, linux-kernel, Roger C. Pao, Thomas Gleixner,
	linux-nvdimm, H. Peter Anvin, Matthew Wilcox, Andy Lutomirski,
	Christoph Hellwig, Ross Zwisler

On 03/05/2015 10:56 PM, Dan Williams wrote:
> On Thu, Mar 5, 2015 at 2:24 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>
<>
>> Now the ACPI comity, as far as I know, did not yet define a
>> standard type for NvDIMM. Also, as far as I know any NvDIMM
>> standard will only be defined for DDR4. So DDR3 NvDIMM is
>> probably stuck with this none STD type.
> 
> There's no relation between E820 types and DDR technology revisions.
> 

Yes and no, I mean the DDR4 has extra legs and signals defined
for NvDIMM. So DDR3 will always mean different style of NvDIMM.

You tell me. Say the standard finally comes out. Will I have a
new bios from Intel for my DDR3 system here in the lab that will
report the new STD type ?

What I meant is that DDR3 is too old for the proposed STD and probably
only DDR4 NvDIMMs will be supported in systems. The way the STD defined
it.

<>
>> In this patch I name type-12 "unknown-12". This is because of
>> ACPI politics that refuse to reserve type-12 as DDR3-NvDIMM
> 
> It's not "politics".  Setting standards takes time and the platforms
> in question simply jumped the gun to enable a proof-of-concept.
> 

So ye, but once you have 100,000 devices out there, then the dichotomy
between standards-takes-time vs proof-of-concept, becomes politics.

This is the definition of politics, when life moves faster than some
"body", the "body" stands on its back feet and shoots fire from
his head.

>> and members keep saying:
>>         "What if ACPI assigns type-12 for something else in future"
>>
>> [And I say: Then just don't. Please?]
> 
> Once a standard number is assigned, platform firmwares can update
> type-12 to that number.  We might consider a compile time override for
> these niche/pre-standard systems that can't/won't update, but it's not
> clear to me that we even need to go that far.
> 

OK, so I do not understand what you want. Yes or No to this patch?

This patch with unknown-12 is for NOW. For systems already running.

So we can differentiate between  reserved-unknown which might mean
type-13 and this here bastard type-12 which we know is NvDIMM but
for future sake we do not call by name?

Or maybe we should call it NVDIMM-12 ?

Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM)
  2015-03-05 23:09     ` Andy Lutomirski
@ 2015-03-09 12:10       ` Boaz Harrosh
  2015-03-10  5:11         ` joeyli
  0 siblings, 1 reply; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-09 12:10 UTC (permalink / raw)
  To: Andy Lutomirski, Dan Williams
  Cc: Ingo Molnar, X86 ML, linux-kernel, Roger C. Pao, Thomas Gleixner,
	linux-nvdimm, H. Peter Anvin, Matthew Wilcox, Christoph Hellwig,
	Ross Zwisler

On 03/06/2015 01:09 AM, Andy Lutomirski wrote:
<>
> 
> I will be shocked if a standard of this form ever appears.  Modern
> systems *don't have e820*.  The BIOSes that are using this type 12
> hack are awful throwbacks.

So far the systems we have, with DDR4 NvDIMM(s) (Actual chips arriving soon)
still have BIOS (On by default). The BIOS was yelled "Wolfe" for so long NOW ;-)
Again these are working system in the field, who will switch them all to UEFI?

How will the UEFI present them to the system? can you point me to the relevant
code? (Or did you mean BIOS but with a different communication path than e820?)

> 
> --Andy
> 

Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/8] pmem: Initial version of persistent memory driver
  2015-03-05 23:03     ` Andy Lutomirski
@ 2015-03-09 12:20       ` Boaz Harrosh
  2015-03-18 18:06         ` Andy Lutomirski
  0 siblings, 1 reply; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-09 12:20 UTC (permalink / raw)
  To: Andy Lutomirski, Boaz Harrosh
  Cc: Matthew Wilcox, Ross Zwisler, Thomas Gleixner, X86 ML,
	Dan Williams, Roger C. Pao, Ingo Molnar, linux-nvdimm,
	linux-kernel, H. Peter Anvin, Christoph Hellwig

On 03/06/2015 01:03 AM, Andy Lutomirski wrote:
<>
> 
> I think it would be nice to have control over the caching mode.
> Depending on the application, WT or UC could make more sense.
> 

Patches are welcome. say
	map=sss@aaa:WT,sss@aaa:CA, ...

But for us, with direct_access(), all benchmarks show a slight advantage
for the cached mode.

Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 0/8] pmem: Submission of the Persistent memory block device
  2015-03-06 18:37   ` [RFC 0/8] pmem: Submission of the Persistent memory block device Ross Zwisler
  2015-03-07  1:39     ` Christoph Hellwig
@ 2015-03-09 12:41     ` Boaz Harrosh
  1 sibling, 0 replies; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-09 12:41 UTC (permalink / raw)
  To: Ross Zwisler, Boaz Harrosh
  Cc: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig

On 03/06/2015 08:37 PM, Ross Zwisler wrote:
<>
>> I maintain these patches on latest Kernels here:
>> 	git://git.open-osd.org/pmem.git branch pmem
>>
>> Thanks for reviewing
>> Boaz
> 
> Hey Boaz,
> 
> Regarding the PMEM series, my group has been working on an updated
> version of this driver for the past 6 months or so since I initially
> posted the beginnings of this series:
> 
> https://lkml.org/lkml/2014/8/27/674
> 
> That new version should be ready for public viewing sometime in April.
> 
> It's my preference that we wait to try and upstream any form of PMEM
> until we've released our updated version of the driver, and you've had a
> chance to review and add in any changes you need.  I'm cool with
> gathering additional feedback until then, of course.
> 

Dear Ross

We have a very grave problem with you guys. You do not develop an open
source driver in stealth mode, and not release code until some undefined
future. This is called forking. And forking means taking the extra hit
when main stream is advancing.

I've been changing this driver and supporting code, and even if not posting
every change on ML, so not to make noise, I pushed any new version to the
public tree every time. So anyone that monitors the tree can provision for
my changes and see where I'm going with this.

I've been monitoring your tree, and there was not a single change. If you
are making progress you should make them available, as soon as they advance
so we can see and adjust. (And/or maybe not make the same effort twice?)

> Trying to upstream this older version and then merging it with the newer
> stuff in-kernel seems like it'll just end up being more work in the end.
> 

I don't think so. Did you see the latest version of this driver. It is
so small. Any changes you will make, will probably be additions and
enhancements. I do not see how it will make such an extra work.

Show me the code, If I see that you are right I will make changes
accordingly, but for now I cannot see why it can make any kind of
problems.

> Thanks,
> - Ross

Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM)
  2015-03-09 11:19     ` Boaz Harrosh
@ 2015-03-09 14:44       ` Dan Williams
  2015-03-09 15:14         ` Andy Lutomirski
  0 siblings, 1 reply; 43+ messages in thread
From: Dan Williams @ 2015-03-09 14:44 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ingo Molnar, X86 ML, linux-kernel, Roger C. Pao, Thomas Gleixner,
	linux-nvdimm, H. Peter Anvin, Matthew Wilcox, Andy Lutomirski,
	Christoph Hellwig, Ross Zwisler

On Mon, Mar 9, 2015 at 7:19 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 03/05/2015 10:56 PM, Dan Williams wrote:
[..]
>> It's not "politics".  Setting standards takes time and the platforms
>> in question simply jumped the gun to enable a proof-of-concept.
>>
>
> So ye, but once you have 100,000 devices out there, then the dichotomy
> between standards-takes-time vs proof-of-concept, becomes politics.

Ok.

...although, I have a question about this "100,000 devices" you quote.
What's not clear to me is how many platforms are shipping with
type-12.  Certainly there are very many NVDIMMs on the market, but
which off-the-shelf systems can one obtain that include type-12
support?  Not that it would change the disposition of this patch, if
platforms are in the field they're "in the field!", just clarifying
that "100,000 devices" is NVDIMMs not type-12 platforms, right?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM)
  2015-03-09 14:44       ` Dan Williams
@ 2015-03-09 15:14         ` Andy Lutomirski
  2015-03-09 15:17           ` Dan Williams
  0 siblings, 1 reply; 43+ messages in thread
From: Andy Lutomirski @ 2015-03-09 15:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Boaz Harrosh, Ingo Molnar, X86 ML, linux-kernel, Roger C. Pao,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Christoph Hellwig, Ross Zwisler

On Mon, Mar 9, 2015 at 7:44 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Mon, Mar 9, 2015 at 7:19 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>> On 03/05/2015 10:56 PM, Dan Williams wrote:
> [..]
>>> It's not "politics".  Setting standards takes time and the platforms
>>> in question simply jumped the gun to enable a proof-of-concept.
>>>
>>
>> So ye, but once you have 100,000 devices out there, then the dichotomy
>> between standards-takes-time vs proof-of-concept, becomes politics.
>
> Ok.
>
> ...although, I have a question about this "100,000 devices" you quote.
> What's not clear to me is how many platforms are shipping with
> type-12.  Certainly there are very many NVDIMMs on the market, but
> which off-the-shelf systems can one obtain that include type-12
> support?  Not that it would change the disposition of this patch, if
> platforms are in the field they're "in the field!", just clarifying
> that "100,000 devices" is NVDIMMs not type-12 platforms, right?

This is a type-12 platform:

http://www.supermicro.com/products/motherboard/Xeon/C600/X9DRH-iF-NV.cfm

You can order one online, for example:

http://www.acmemicro.com/Product/11926/Supermicro+X9DRH-iF-NV-O+Server+Board+DP+Xeon+E5-2600+LGA2011+DDR3+SATA3+RAID+IPMI+GbE+PCIe+ATX+MBD-X9DRH-iF-NV

--Andy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM)
  2015-03-09 15:14         ` Andy Lutomirski
@ 2015-03-09 15:17           ` Dan Williams
  2015-03-10  8:47             ` Boaz Harrosh
  0 siblings, 1 reply; 43+ messages in thread
From: Dan Williams @ 2015-03-09 15:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Boaz Harrosh, Ingo Molnar, X86 ML, linux-kernel, Roger C. Pao,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Christoph Hellwig, Ross Zwisler

On Mon, Mar 9, 2015 at 11:14 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Mar 9, 2015 at 7:44 AM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Mon, Mar 9, 2015 at 7:19 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>> On 03/05/2015 10:56 PM, Dan Williams wrote:
>> [..]
>>>> It's not "politics".  Setting standards takes time and the platforms
>>>> in question simply jumped the gun to enable a proof-of-concept.
>>>>
>>>
>>> So ye, but once you have 100,000 devices out there, then the dichotomy
>>> between standards-takes-time vs proof-of-concept, becomes politics.
>>
>> Ok.
>>
>> ...although, I have a question about this "100,000 devices" you quote.
>> What's not clear to me is how many platforms are shipping with
>> type-12.  Certainly there are very many NVDIMMs on the market, but
>> which off-the-shelf systems can one obtain that include type-12
>> support?  Not that it would change the disposition of this patch, if
>> platforms are in the field they're "in the field!", just clarifying
>> that "100,000 devices" is NVDIMMs not type-12 platforms, right?
>
> This is a type-12 platform:
>
> http://www.supermicro.com/products/motherboard/Xeon/C600/X9DRH-iF-NV.cfm
>
> You can order one online, for example:
>
> http://www.acmemicro.com/Product/11926/Supermicro+X9DRH-iF-NV-O+Server+Board+DP+Xeon+E5-2600+LGA2011+DDR3+SATA3+RAID+IPMI+GbE+PCIe+ATX+MBD-X9DRH-iF-NV
>

Ok, the gig is up :)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM)
  2015-03-09 12:10       ` Boaz Harrosh
@ 2015-03-10  5:11         ` joeyli
  2015-03-10  8:56           ` Boaz Harrosh
  2015-03-10 13:19           ` Andy Lutomirski
  0 siblings, 2 replies; 43+ messages in thread
From: joeyli @ 2015-03-10  5:11 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Andy Lutomirski, Dan Williams, Ingo Molnar, X86 ML, linux-kernel,
	Roger C. Pao, Thomas Gleixner, linux-nvdimm, H. Peter Anvin,
	Matthew Wilcox, Christoph Hellwig, Ross Zwisler

Hi, 

On Mon, Mar 09, 2015 at 02:10:37PM +0200, Boaz Harrosh wrote:
> On 03/06/2015 01:09 AM, Andy Lutomirski wrote:
> <>
> > 
> > I will be shocked if a standard of this form ever appears.  Modern
> > systems *don't have e820*.  The BIOSes that are using this type 12
> > hack are awful throwbacks.
> 
> So far the systems we have, with DDR4 NvDIMM(s) (Actual chips arriving soon)
> still have BIOS (On by default). The BIOS was yelled "Wolfe" for so long NOW ;-)
> Again these are working system in the field, who will switch them all to UEFI?
> 
> How will the UEFI present them to the system? can you point me to the relevant
> code? (Or did you mean BIOS but with a different communication path than e820?)
>

Per my understand...

With EFI Boot Stub, there have setup_e820() codes in arch/x86/boot/compressed/eboot.c
that used to transfer EFI memmap to e820 entries. Currently doesn't have any
EFI_MEMORY_TYPE reflects to NvDIMM that will map to e820_type. 

I wonder what kind of EFI_MEMORY_TYPE reported by UEFI BIOS on those "shipped
NvDIMMs motherboards", like supermicro X9DRH-iF-NV. Then we may need add code
to setup_e820() for mapping the efi memory type to e820 type 12 region.

> > 
> > --Andy
> > 
> 
> Thanks
> Boaz
> 

Thanks a lot!
Joey Lee

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM)
  2015-03-09 15:17           ` Dan Williams
@ 2015-03-10  8:47             ` Boaz Harrosh
  0 siblings, 0 replies; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-10  8:47 UTC (permalink / raw)
  To: Dan Williams, Andy Lutomirski
  Cc: Ingo Molnar, X86 ML, linux-kernel, Roger C. Pao, Thomas Gleixner,
	linux-nvdimm, H. Peter Anvin, Matthew Wilcox, Christoph Hellwig,
	Ross Zwisler

On 03/09/2015 05:17 PM, Dan Williams wrote:
> On Mon, Mar 9, 2015 at 11:14 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Mar 9, 2015 at 7:44 AM, Dan Williams <dan.j.williams@intel.com> wrote:
>>> On Mon, Mar 9, 2015 at 7:19 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>>> On 03/05/2015 10:56 PM, Dan Williams wrote:
>>> [..]
>>>>> It's not "politics".  Setting standards takes time and the platforms
>>>>> in question simply jumped the gun to enable a proof-of-concept.
>>>>>
>>>>
>>>> So ye, but once you have 100,000 devices out there, then the dichotomy
>>>> between standards-takes-time vs proof-of-concept, becomes politics.
>>>
>>> Ok.
>>>
>>> ...although, I have a question about this "100,000 devices" you quote.
>>> What's not clear to me is how many platforms are shipping with
>>> type-12.  Certainly there are very many NVDIMMs on the market, but
>>> which off-the-shelf systems can one obtain that include type-12
>>> support?  Not that it would change the disposition of this patch, if
>>> platforms are in the field they're "in the field!", just clarifying
>>> that "100,000 devices" is NVDIMMs not type-12 platforms, right?
>>
>> This is a type-12 platform:
>>
>> http://www.supermicro.com/products/motherboard/Xeon/C600/X9DRH-iF-NV.cfm
>>
>> You can order one online, for example:
>>
>> http://www.acmemicro.com/Product/11926/Supermicro+X9DRH-iF-NV-O+Server+Board+DP+Xeon+E5-2600+LGA2011+DDR3+SATA3+RAID+IPMI+GbE+PCIe+ATX+MBD-X9DRH-iF-NV
>>
> 
> Ok, the gig is up :)
> 

Hi Dan.

Sigh, yes. These systems are out there. lots of them, and not only at developers.

So far all the system I had a hands on and all the people I've asked all reported
type-12, if the DIMM was reported at all (or even booted). With various success rate
of actual NVIDIMM functionality.

I call to all developers that are working on NvDIMM. What are the methods you see
that these chips are presented to the system? Are you using BIOS or UEFI?

(From what I understood, there is actually only one family of motherboards/firmware
 That actually boot NVDIMMs, and it all originated from Intel. I might be off by a
 mile, I have not signed any NDA and no one told me so, just my personal follow the
 bread crumbs)

Could we please reconsider and name those in /proc/iomem as NVDIMM-12 instead of unknow-12
just as the nobleness obliges? We are tainting the Kernel and complaining at driver use of
them, that will not change, just the name?

Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM)
  2015-03-10  5:11         ` joeyli
@ 2015-03-10  8:56           ` Boaz Harrosh
  2015-03-10 13:19           ` Andy Lutomirski
  1 sibling, 0 replies; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-10  8:56 UTC (permalink / raw)
  To: joeyli
  Cc: Andy Lutomirski, Dan Williams, Ingo Molnar, X86 ML, linux-kernel,
	Roger C. Pao, Thomas Gleixner, linux-nvdimm, H. Peter Anvin,
	Matthew Wilcox, Christoph Hellwig, Ross Zwisler

On 03/10/2015 07:11 AM, joeyli wrote:
<>
> 
> Per my understand...
> 
> With EFI Boot Stub, there have setup_e820() codes in arch/x86/boot/compressed/eboot.c
> that used to transfer EFI memmap to e820 entries. Currently doesn't have any
> EFI_MEMORY_TYPE reflects to NvDIMM that will map to e820_type. 
> 
> I wonder what kind of EFI_MEMORY_TYPE reported by UEFI BIOS on those "shipped
> NvDIMMs motherboards", like supermicro X9DRH-iF-NV. Then we may need add code
> to setup_e820() for mapping the efi memory type to e820 type 12 region.
> 

Thanks Joey. This is tremendous help. (Just saved me days of research)

Our DDR4 NvDIMM chips arrive this week I will try and debug all this, also
booting EFI.

If you have any chance to experiments in your lab, and compare notes it
would be great.

<>
> 
> Thanks a lot!
> Joey Lee
> 

Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM)
  2015-03-10  5:11         ` joeyli
  2015-03-10  8:56           ` Boaz Harrosh
@ 2015-03-10 13:19           ` Andy Lutomirski
  1 sibling, 0 replies; 43+ messages in thread
From: Andy Lutomirski @ 2015-03-10 13:19 UTC (permalink / raw)
  To: joeyli
  Cc: Matthew Wilcox, Boaz Harrosh, Ross Zwisler, Thomas Gleixner,
	X86 ML, Dan Williams, Roger C. Pao, Ingo Molnar, linux-nvdimm,
	linux-kernel, H. Peter Anvin, Christoph Hellwig

On Mar 10, 2015 1:12 AM, "joeyli" <jlee@suse.com> wrote:
>
> Hi,
>
> On Mon, Mar 09, 2015 at 02:10:37PM +0200, Boaz Harrosh wrote:
> > On 03/06/2015 01:09 AM, Andy Lutomirski wrote:
> > <>
> > >
> > > I will be shocked if a standard of this form ever appears.  Modern
> > > systems *don't have e820*.  The BIOSes that are using this type 12
> > > hack are awful throwbacks.
> >
> > So far the systems we have, with DDR4 NvDIMM(s) (Actual chips arriving soon)
> > still have BIOS (On by default). The BIOS was yelled "Wolfe" for so long NOW ;-)
> > Again these are working system in the field, who will switch them all to UEFI?
> >
> > How will the UEFI present them to the system? can you point me to the relevant
> > code? (Or did you mean BIOS but with a different communication path than e820?)
> >
>
> Per my understand...
>
> With EFI Boot Stub, there have setup_e820() codes in arch/x86/boot/compressed/eboot.c
> that used to transfer EFI memmap to e820 entries. Currently doesn't have any
> EFI_MEMORY_TYPE reflects to NvDIMM that will map to e820_type.
>
> I wonder what kind of EFI_MEMORY_TYPE reported by UEFI BIOS on those "shipped
> NvDIMMs motherboards", like supermicro X9DRH-iF-NV. Then we may need add code
> to setup_e820() for mapping the efi memory type to e820 type 12 region.

Nothing useful.  You can't detect the NvDIMM in EFI mode on my board
as far as I know.

Rumor has it that we'll learn more about the EFI case very soon.

--Andy

>
> > >
> > > --Andy
> > >
> >
> > Thanks
> > Boaz
> >
>
> Thanks a lot!
> Joey Lee

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/8] pmem: Initial version of persistent memory driver
  2015-03-05 11:55   ` [PATCH 1/8] pmem: Initial version of persistent memory driver Boaz Harrosh
  2015-03-05 20:35     ` Paul Bolle
  2015-03-05 23:03     ` Andy Lutomirski
@ 2015-03-18 17:43     ` Ross Zwisler
  2015-03-19  9:24       ` Boaz Harrosh
  2 siblings, 1 reply; 43+ messages in thread
From: Ross Zwisler @ 2015-03-18 17:43 UTC (permalink / raw)
  To: Boaz Harrosh, Jens Axboe
  Cc: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig

On Thu, 2015-03-05 at 13:55 +0200, Boaz Harrosh wrote:
> From: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> PMEM is a new driver That supports any physical contiguous iomem range
> as a single block device. The driver has support for as many as needed
> iomem ranges each as its own device.
> 
> The driver is not only good for NvDIMMs, It is good for any flat memory
> mapped device. We've used it with NvDIMMs, Kernel reserved DRAM
> (memmap= on command line), PCIE Battery backed memory cards, VM shared
> memory, and so on.
> 
> The API to pmem module a single string parameter named "map"
> of the form:
> 		 map=mapS[,mapS...]
> 
> 		 where mapS=nn[KMG]$ss[KMG],
> 		 or    mapS=nn[KMG]@ss[KMG],
> 
> 		 nn=size, ss=offset
> 
> Just like the Kernel command line map && memmap parameters,
> so anything you did at grub just copy/paste to here.
> 
> The "@" form is exactly the same as the "$" form only that
> at bash prompt we need to escape the "$" with \$ so also
> support the '@' char for convenience.
> 
> For each specified mapS there will be a device created.
> 
> [This is the accumulated version of the driver developed by
>  multiple programmers. To see the real history of these
>  patches see:
> 	git://git.open-osd.org/pmem.git
> 	https://github.com/01org/prd
>  This patch is based on (git://git.open-osd.org/pmem.git):
> 	[5ccf703] SQUASHME: Don't clobber the map module param
> 
> <list-of-changes>
> [boaz]
> SQUASHME: pmem: Remove unused #include headers
> SQUASHME: pmem: Request from fdisk 4k alignment
> SQUASHME: pmem: Let each device manage private memory region
> SQUASHME: pmem: Support of multiple memory regions
> SQUASHME: pmem: Micro optimization the hotpath 001
> SQUASHME: pmem: no need to copy a page at a time
> SQUASHME: pmem that 4k sector thing
> SQUASHME: pmem: Cleanliness is neat
> SQUASHME: Don't clobber the map module param
> SQUASHME: pmem: Few changes to Initial version of pmem
> SQUASHME: Changes to copyright text (trivial)
> </list-of-changes>
> 
> TODO: Add Documentation/blockdev/pmem.txt
> 
> Need-signed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>

I wrote the initial version of the PMEM driver (then called PRD for Persistent
RAM Driver) in late 2013/early 2014, and posted it on GitHub.  Here's a link
to my first version:

https://github.com/01org/prd/tree/prd_3.13

Matthew Wilcox pointed Boaz to it in June of 2014, and he cloned my tree and
went off and made a bunch of changes.  A few of those changes he sent back to
me, like the one I included in the patch series I recently sent for upstream
inclusion:

https://lkml.org/lkml/2015/3/16/1102

Many of the changes he did not submit back to me for review or inclusion in my
tree.

With the first patch in this series Boaz is squashing all of our changes
together, adding his copyright and trying to install himself as maintainer.  I
believe this to be unacceptable.  

Boaz, if you have contributions that you would like to make to PMEM, please
submit them to our mailing list (linux-nvdimm@lists.01.org) and we will be
happy to review them.  But please don't try and steal control of my driver.

- Ross



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/8] pmem: Initial version of persistent memory driver
  2015-03-09 12:20       ` Boaz Harrosh
@ 2015-03-18 18:06         ` Andy Lutomirski
  2015-03-26  4:00           ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 43+ messages in thread
From: Andy Lutomirski @ 2015-03-18 18:06 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Matthew Wilcox, Ross Zwisler, X86 ML, Thomas Gleixner,
	Dan Williams, Ingo Molnar, Roger C. Pao, linux-nvdimm,
	linux-kernel, H. Peter Anvin, Christoph Hellwig

On Mar 9, 2015 8:20 AM, "Boaz Harrosh" <boaz@plexistor.com> wrote:
>
> On 03/06/2015 01:03 AM, Andy Lutomirski wrote:
> <>
> >
> > I think it would be nice to have control over the caching mode.
> > Depending on the application, WT or UC could make more sense.
> >
>
> Patches are welcome. say
>         map=sss@aaa:WT,sss@aaa:CA, ...
>
> But for us, with direct_access(), all benchmarks show a slight advantage
> for the cached mode.

I'm sure cached is faster.  The question is: who flushes the cache?

--Andy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/8] pmem: Initial version of persistent memory driver
  2015-03-18 17:43     ` Ross Zwisler
@ 2015-03-19  9:24       ` Boaz Harrosh
  2015-03-20  0:11         ` Dan Williams
  0 siblings, 1 reply; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-19  9:24 UTC (permalink / raw)
  To: Ross Zwisler, Jens Axboe
  Cc: Ingo Molnar, x86, linux-kernel, Roger C. Pao, Dan Williams,
	Thomas Gleixner, linux-nvdimm, H. Peter Anvin, Matthew Wilcox,
	Andy Lutomirski, Christoph Hellwig

On 03/18/2015 07:43 PM, Ross Zwisler wrote:
> On Thu, 2015-03-05 at 13:55 +0200, Boaz Harrosh wrote:
>> From: Ross Zwisler <ross.zwisler@linux.intel.com>
>>
>> PMEM is a new driver That supports any physical contiguous iomem range
>> as a single block device. The driver has support for as many as needed
>> iomem ranges each as its own device.
>>
>> The driver is not only good for NvDIMMs, It is good for any flat memory
>> mapped device. We've used it with NvDIMMs, Kernel reserved DRAM
>> (memmap= on command line), PCIE Battery backed memory cards, VM shared
>> memory, and so on.
>>
>> The API to pmem module a single string parameter named "map"
>> of the form:
>> 		 map=mapS[,mapS...]
>>
>> 		 where mapS=nn[KMG]$ss[KMG],
>> 		 or    mapS=nn[KMG]@ss[KMG],
>>
>> 		 nn=size, ss=offset
>>
>> Just like the Kernel command line map && memmap parameters,
>> so anything you did at grub just copy/paste to here.
>>
>> The "@" form is exactly the same as the "$" form only that
>> at bash prompt we need to escape the "$" with \$ so also
>> support the '@' char for convenience.
>>
>> For each specified mapS there will be a device created.
>>
>> [This is the accumulated version of the driver developed by
>>  multiple programmers. To see the real history of these
>>  patches see:
>> 	git://git.open-osd.org/pmem.git
>> 	https://github.com/01org/prd
>>  This patch is based on (git://git.open-osd.org/pmem.git):
>> 	[5ccf703] SQUASHME: Don't clobber the map module param
>>
>> <list-of-changes>
>> [boaz]
>> SQUASHME: pmem: Remove unused #include headers
>> SQUASHME: pmem: Request from fdisk 4k alignment
>> SQUASHME: pmem: Let each device manage private memory region
>> SQUASHME: pmem: Support of multiple memory regions
>> SQUASHME: pmem: Micro optimization the hotpath 001
>> SQUASHME: pmem: no need to copy a page at a time
>> SQUASHME: pmem that 4k sector thing
>> SQUASHME: pmem: Cleanliness is neat
>> SQUASHME: Don't clobber the map module param
>> SQUASHME: pmem: Few changes to Initial version of pmem
>> SQUASHME: Changes to copyright text (trivial)
>> </list-of-changes>
>>
>> TODO: Add Documentation/blockdev/pmem.txt
>>
>> Need-signed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> 
> I wrote the initial version of the PMEM driver (then called PRD for Persistent
> RAM Driver) in late 2013/early 2014, and posted it on GitHub.  Here's a link
> to my first version:
> 
> https://github.com/01org/prd/tree/prd_3.13
> 
> Matthew Wilcox pointed Boaz to it in June of 2014, and he cloned my tree and
> went off and made a bunch of changes.  A few of those changes he sent back to
> me, like the one I included in the patch series I recently sent for upstream
> inclusion:
> 
> https://lkml.org/lkml/2015/3/16/1102
> 
> Many of the changes he did not submit back to me for review or inclusion in my
> tree.
> 
> With the first patch in this series Boaz is squashing all of our changes
> together, adding his copyright and trying to install himself as maintainer.  I
> believe this to be unacceptable.  
> 
> Boaz, if you have contributions that you would like to make to PMEM, please
> submit them to our mailing list (linux-nvdimm@lists.01.org) and we will be
> happy to review them.  But please don't try and steal control of my driver.
> 

I apologize. It is not my intention to hijack your project. All but the last
2 changes I have posted again and again, even those changes I have said that
I maintain them in a public tree, and made them available publicly ASAP. I
stopped sending the (last 2) patches because it felt like I was spamming the
list, since none of my patches got any comments or have been accepted to your
tree.

It was my notion that you do not want to bother with farther development, your
tree was stuck on 3.17, while I was rebasing on every major Linux release, adding
my changes as they accumulated over time.

For example That patch that you mentioned that you accepted in the tree, that
same patch was just a staging patch, to the more important change that
throws away the toy Kconfig and 3 module params, and puts in a real world
actually usable API, for long term. You did not take that patch, why?

So I was in the notion that you don't want to maintain this driver, and I was
forced to fork the project and move on. What other choice did I have?

About the added copyright, diffing your original driver without any of my
changes including all the partition bugs, the changed API the IO path cleanup,
it comes out less then 30% similarity, as a courtesy to my employer I think he
is entitled to an added copyright.

But let us not fight. You want to maintain this thing, start by squashing all
my changes + all the other added patches and publish them on an your tree. I
need this driver usable.

> - Ross
> 
> 

Thanks
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/8] pmem: Initial version of persistent memory driver
  2015-03-19  9:24       ` Boaz Harrosh
@ 2015-03-20  0:11         ` Dan Williams
  0 siblings, 0 replies; 43+ messages in thread
From: Dan Williams @ 2015-03-20  0:11 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ross Zwisler, Jens Axboe, Ingo Molnar, X86 ML, linux-kernel,
	Roger C. Pao, Thomas Gleixner, linux-nvdimm, H. Peter Anvin,
	Matthew Wilcox, Andy Lutomirski, Christoph Hellwig

On Thu, Mar 19, 2015 at 2:24 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> I apologize. It is not my intention to hijack your project. All but the last
> 2 changes I have posted again and again, even those changes I have said that
> I maintain them in a public tree, and made them available publicly ASAP. I
> stopped sending the (last 2) patches because it felt like I was spamming the
> list, since none of my patches got any comments or have been accepted to your
> tree.

That's not true.  We talked about your "map=" proposal at length back
in September.  You concluded "That the discovery should be elsewhere
in an ARCH/driver LLD and pmem stays generic." [1].  A generic
approach is being specified by the ACPI Working Group and will be
released "Real Soon Now (TM)" (on the order of weeks not months).  My
first choice would be to finish waiting for that specification before
we upstream a pmem driver.  Outside of that, if we need a pmem driver
"now", Ross's version has the nice property of having an easier to
revert resource discovery mechanism.  The kernel command line is
arguably an ABI and the need for "map=" is obviated by a generic
resource discovery mechanism.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2014-September/000043.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 7/8] pmem: Add support for page structs
  2015-03-05 11:59   ` [PATCH 7/8] pmem: Add support for page structs Boaz Harrosh
@ 2015-03-23 20:59     ` Dan Williams
  0 siblings, 0 replies; 43+ messages in thread
From: Dan Williams @ 2015-03-23 20:59 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ingo Molnar, X86 ML, linux-kernel, Roger C. Pao, Thomas Gleixner,
	linux-nvdimm, H. Peter Anvin, Matthew Wilcox, Andy Lutomirski,
	Christoph Hellwig, Ross Zwisler

On Thu, Mar 5, 2015 at 3:59 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>
> One of the current shortcomings of the NVDIMM/PMEM
> support is that this memory does not have a page-struct(s)
> associated with its memory and therefor cannot be passed
> to a block-device or network or DMAed in any way through
> another device in the system.
>
> The use of add_persistent_memory() fixes all this. After this patch
> an FS can do:
>         bdev_direct_access(,&pfn,);

Hmm, can we do this mapping on demand per direct access mapping rather
than unconditionally for each range that pmem is handling?

Going forward I don't think we want to be tied to guaranteeing that
plain bdev_direct_access() always yields pfn_to_page()-capable pfns.

Perhaps a DAX_MAP_PFN flag or something along those lines?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 1/8] pmem: Initial version of persistent memory driver
  2015-03-18 18:06         ` Andy Lutomirski
@ 2015-03-26  4:00           ` Elliott, Robert (Server Storage)
  2015-03-26  7:51             ` Boaz Harrosh
  2015-03-26 21:31             ` Dave Chinner
  0 siblings, 2 replies; 43+ messages in thread
From: Elliott, Robert (Server Storage) @ 2015-03-26  4:00 UTC (permalink / raw)
  To: Andy Lutomirski, Boaz Harrosh
  Cc: Matthew Wilcox, Ross Zwisler, X86 ML, Thomas Gleixner,
	Dan Williams, Ingo Molnar, Roger C. Pao, linux-nvdimm,
	linux-kernel, H. Peter Anvin, Christoph Hellwig, Kani,
	Toshimitsu, Christoph Hellwig

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1631 bytes --]



> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Andy Lutomirski
> Sent: Wednesday, March 18, 2015 1:07 PM
> To: Boaz Harrosh
> Cc: Matthew Wilcox; Ross Zwisler; X86 ML; Thomas Gleixner; Dan Williams;
> Ingo Molnar; Roger C. Pao; linux-nvdimm; linux-kernel; H. Peter Anvin;
> Christoph Hellwig
> Subject: Re: [PATCH 1/8] pmem: Initial version of persistent memory driver
> 
> On Mar 9, 2015 8:20 AM, "Boaz Harrosh" <boaz@plexistor.com> wrote:
> >
> > On 03/06/2015 01:03 AM, Andy Lutomirski wrote:
> > <>
> > >
> > > I think it would be nice to have control over the caching mode.
> > > Depending on the application, WT or UC could make more sense.
> > >
> >
> > Patches are welcome. say
> >         map=sss@aaa:WT,sss@aaa:CA, ...
> >
> > But for us, with direct_access(), all benchmarks show a slight advantage
> > for the cached mode.
> 
> I'm sure cached is faster.  The question is: who flushes the cache?
> 
> --Andy

Nobody.

Therefore, pmem as currently proposed (mapping the memory with
ioremap_cache, which uses _PAGE_CACHE_MODE_WB) is unsafe unless the
system is doing something special to ensure L1, L2, and L3 caches are
flushed on power loss.

I think pmem needs to map the memory as UC or WT by default, providing
WB and WC only as an option for users confident that those attributes
are safe to use in their system.

Even using UC or WT presumes that ADR is in place.



ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/8] pmem: Initial version of persistent memory driver
  2015-03-26  4:00           ` Elliott, Robert (Server Storage)
@ 2015-03-26  7:51             ` Boaz Harrosh
  2015-03-26 21:31             ` Dave Chinner
  1 sibling, 0 replies; 43+ messages in thread
From: Boaz Harrosh @ 2015-03-26  7:51 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage), Andy Lutomirski
  Cc: Matthew Wilcox, Ross Zwisler, X86 ML, Thomas Gleixner,
	Dan Williams, Ingo Molnar, Roger C. Pao, linux-nvdimm,
	linux-kernel, H. Peter Anvin, Christoph Hellwig, Kani,
	Toshimitsu, Christoph Hellwig

On 03/26/2015 06:00 AM, Elliott, Robert (Server Storage) wrote:
> 
> 
>> -----Original Message-----
>> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
>> owner@vger.kernel.org] On Behalf Of Andy Lutomirski
>> Sent: Wednesday, March 18, 2015 1:07 PM
>> To: Boaz Harrosh
>> Cc: Matthew Wilcox; Ross Zwisler; X86 ML; Thomas Gleixner; Dan Williams;
>> Ingo Molnar; Roger C. Pao; linux-nvdimm; linux-kernel; H. Peter Anvin;
>> Christoph Hellwig
>> Subject: Re: [PATCH 1/8] pmem: Initial version of persistent memory driver
>>
>> On Mar 9, 2015 8:20 AM, "Boaz Harrosh" <boaz@plexistor.com> wrote:
>>>
>>> On 03/06/2015 01:03 AM, Andy Lutomirski wrote:
>>> <>
>>>>
>>>> I think it would be nice to have control over the caching mode.
>>>> Depending on the application, WT or UC could make more sense.
>>>>
>>>
>>> Patches are welcome. say
>>>         map=sss@aaa:WT,sss@aaa:CA, ...
>>>
>>> But for us, with direct_access(), all benchmarks show a slight advantage
>>> for the cached mode.
>>
>> I'm sure cached is faster.  The question is: who flushes the cache?
>>
>> --Andy
> 
> Nobody.
> 
> Therefore, pmem as currently proposed (mapping the memory with
> ioremap_cache, which uses _PAGE_CACHE_MODE_WB) is unsafe unless the
> system is doing something special to ensure L1, L2, and L3 caches are
> flushed on power loss.
> 
> I think pmem needs to map the memory as UC or WT by default, providing
> WB and WC only as an option for users confident that those attributes
> are safe to use in their system.
> 
> Even using UC or WT presumes that ADR is in place.
> 

I will add command line options for these modes per range. (Unless you
care to send a patch before me)

Thanks this is a good idea
Boaz


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/8] pmem: Initial version of persistent memory driver
  2015-03-26  4:00           ` Elliott, Robert (Server Storage)
  2015-03-26  7:51             ` Boaz Harrosh
@ 2015-03-26 21:31             ` Dave Chinner
  1 sibling, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2015-03-26 21:31 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Andy Lutomirski, Boaz Harrosh, Matthew Wilcox, Ross Zwisler,
	X86 ML, Thomas Gleixner, Dan Williams, Ingo Molnar, Roger C. Pao,
	linux-nvdimm, linux-kernel, H. Peter Anvin, Christoph Hellwig,
	Kani, Toshimitsu, Christoph Hellwig

On Thu, Mar 26, 2015 at 04:00:57AM +0000, Elliott, Robert (Server Storage) wrote:
> > -----Original Message-----
> > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> > owner@vger.kernel.org] On Behalf Of Andy Lutomirski
> > Sent: Wednesday, March 18, 2015 1:07 PM
> > To: Boaz Harrosh
> > Cc: Matthew Wilcox; Ross Zwisler; X86 ML; Thomas Gleixner; Dan Williams;
> > Ingo Molnar; Roger C. Pao; linux-nvdimm; linux-kernel; H. Peter Anvin;
> > Christoph Hellwig
> > Subject: Re: [PATCH 1/8] pmem: Initial version of persistent memory driver
> > 
> > On Mar 9, 2015 8:20 AM, "Boaz Harrosh" <boaz@plexistor.com> wrote:
> > >
> > > On 03/06/2015 01:03 AM, Andy Lutomirski wrote:
> > > <>
> > > >
> > > > I think it would be nice to have control over the caching mode.
> > > > Depending on the application, WT or UC could make more sense.
> > > >
> > >
> > > Patches are welcome. say
> > >         map=sss@aaa:WT,sss@aaa:CA, ...
> > >
> > > But for us, with direct_access(), all benchmarks show a slight advantage
> > > for the cached mode.
> > 
> > I'm sure cached is faster.  The question is: who flushes the cache?
> > 
> > --Andy
> 
> Nobody.

There is another discussion going on about ensuring we have
mechanisms to flush the cpu caches correctly when DAX is enabled and
data integrity operations are run.  i.e. fsync and sync will provide
cache flush triggers for DAX enabled devices once we get everything
in place.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2015-03-26 21:32 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-05 10:16 [PATCH 0/3 v5] e820: Fix handling of NvDIMM chips Boaz Harrosh
2015-03-05 10:20 ` [PATCH 1/3] e820: Don't let unknown DIMM type come out BUSY Boaz Harrosh
2015-03-05 20:41   ` Dan Williams
2015-03-09 10:54     ` Boaz Harrosh
2015-03-05 10:21 ` [PATCH 2/3] resource: Add new flag IORESOURCE_MEM_WARN Boaz Harrosh
2015-03-05 10:24 ` [PATCH 3/3] e820: Add the unknown-12 Memory type (DDR3-NvDIMM) Boaz Harrosh
2015-03-05 20:56   ` Dan Williams
2015-03-05 23:09     ` Andy Lutomirski
2015-03-09 12:10       ` Boaz Harrosh
2015-03-10  5:11         ` joeyli
2015-03-10  8:56           ` Boaz Harrosh
2015-03-10 13:19           ` Andy Lutomirski
2015-03-09 11:19     ` Boaz Harrosh
2015-03-09 14:44       ` Dan Williams
2015-03-09 15:14         ` Andy Lutomirski
2015-03-09 15:17           ` Dan Williams
2015-03-10  8:47             ` Boaz Harrosh
2015-03-05 10:32 ` [RFC 0/8] pmem: Submission of the Persistent memory block device Boaz Harrosh
2015-03-05 11:55   ` [PATCH 1/8] pmem: Initial version of persistent memory driver Boaz Harrosh
2015-03-05 20:35     ` Paul Bolle
2015-03-05 23:03     ` Andy Lutomirski
2015-03-09 12:20       ` Boaz Harrosh
2015-03-18 18:06         ` Andy Lutomirski
2015-03-26  4:00           ` Elliott, Robert (Server Storage)
2015-03-26  7:51             ` Boaz Harrosh
2015-03-26 21:31             ` Dave Chinner
2015-03-18 17:43     ` Ross Zwisler
2015-03-19  9:24       ` Boaz Harrosh
2015-03-20  0:11         ` Dan Williams
2015-03-05 11:55   ` [PATCH 2/8] pmem: KISS, remove register_blkdev Boaz Harrosh
2015-03-05 11:56   ` [PATCH 3/8] pmem: Add support for rw_page() Boaz Harrosh
2015-03-05 11:57   ` [PATCH 4/8] pmem: Add support for direct_access() Boaz Harrosh
2015-03-05 11:58   ` [PATCH 5/8] mm: Let sparse_{add,remove}_one_section receive a node_id Boaz Harrosh
2015-03-06 18:43     ` Ross Zwisler
2015-03-05 11:59   ` [PATCH 6/8] mm: New add_persistent_memory/remove_persistent_memory Boaz Harrosh
2015-03-05 11:59   ` [PATCH 7/8] pmem: Add support for page structs Boaz Harrosh
2015-03-23 20:59     ` Dan Williams
2015-03-05 12:01   ` [PATCH 8/8] OUT-OF-TREE: pmem: Allow request_mem to fail (BLK_DEV_PMEM_IGNORE_REQUEST_MEM_RET) Boaz Harrosh
2015-03-06 18:37   ` [RFC 0/8] pmem: Submission of the Persistent memory block device Ross Zwisler
2015-03-07  1:39     ` Christoph Hellwig
2015-03-09 12:41     ` Boaz Harrosh
2015-03-05 22:48 ` [PATCH 0/3 v5] e820: Fix handling of NvDIMM chips H. Peter Anvin
2015-03-05 23:06   ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).