All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] [v5] Allow persistent memory to be used like normal RAM
@ 2019-02-25 18:57 ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: thomas.lendacky, mhocko, linux-nvdimm, tiwai, Dave Hansen,
	ying.huang, linux-mm, jglisse, bp, baiyaowei, zwisler, bhelgaas,
	fengguang.wu, akpm

This is a relatively small delta from v4.  The review comments seem
to be settling down, so it seems like we should start thinking about
how this might get merged.  Are there any objections to taking it in
via the nvdimm tree?

Dan Williams, our intrepid nvdimm maintainer has said he would
appreciate acks on these from relevant folks before merging them.
Reviews/acks on any in the series would be welcome, but the last
two especially are lacking any non-Intel acks:

	mm/resource: let walk_system_ram_range() search child resources
	dax: "Hotplug" persistent memory for use like normal RAM

Note: these are based on commit b20a7bfc2f9 in:

	git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git for-5.1/devdax

Changes since v4:
 * Update powerpc resource return codes too, not just generic
 * Update dev_dax_kmem_remove() comment about resource "leaks"
 * Make HMM-related warning in __request_region() use %pR's
 * Add GPL export for memory_block_size_bytes() to fix build erroe
 * Add a FAQ in the documentation
 * Make resource.c hmm output a pr_warn() instead of pr_debug()
 * Minor CodingStyle tweaks
 * Allow device removal by making dev_dax_kmem_remove() return 0.
   Note that although the device goes away, the memory stays
   in-use, assigned and reserved.

Changes since v3:
 * Move HMM-related resource warning instead of removing it
 * Use __request_resource() directly instead of devm.
 * Create a separate DAX_PMEM Kconfig option, complete with help text
 * Update patch descriptions and cover letter to give a better
   overview of use-cases and hardware where this might be useful.

Changes since v2:
 * Updates to dev_dax_kmem_probe() in patch 5:
   * Reject probes for devices with bad NUMA nodes.  Keeps slow
     memory from being added to node 0.
   * Use raw request_mem_region()
   * Add comments about permanent reservation
   * use dev_*() instead of printk's
 * Add references to nvdimm documentation in descriptions
 * Remove unneeded GPL export
 * Add Kconfig prompt and help text

Changes since v1:
 * Now based on git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git
 * Use binding/unbinding from "dax bus" code
 * Move over to a "dax bus" driver from being an nvdimm driver

--

Persistent memory is cool.  But, currently, you have to rewrite
your applications to use it.  Wouldn't it be cool if you could
just have it show up in your system like normal RAM and get to
it like a slow blob of memory?  Well... have I got the patch
series for you!

== Background / Use Cases ==

Persistent Memory (aka Non-Volatile DIMMs / NVDIMMS) themselves
are described in detail in Documentation/nvdimm/nvdimm.txt.
However, this documentation focuses on actually using them as
storage.  This set is focused on using NVDIMMs as DRAM replacement.

This is intended for Intel-style NVDIMMs (aka. Intel Optane DC
persistent memory) NVDIMMs.  These DIMMs are physically persistent,
more akin to flash than traditional RAM.  They are also expected to
be more cost-effective than using RAM, which is why folks want this
set in the first place.

This set is not intended for RAM-based NVDIMMs.  Those are not
cost-effective vs. plain RAM, and this using them here would simply
be a waste.  However, if you have a user of a system that does not
care about persistence and wants all the RAM they can get, this
could allow you to use a RAM-based NVDIMM where it would otherwise
go unused.

But, why would you bother with this approach?  Intel itself [1]
has announced a hardware feature that does something very similar:
"Memory Mode" which turns DRAM into a cache in front of persistent
memory, which is then as a whole used as normal "RAM"?

Here are a few reasons:
1. The capacity of memory mode is the size of your persistent
   memory that you dedicate.  DRAM capacity is "lost" because it
   is used for cache.  With this, you get PMEM+DRAM capacity for
   memory.
2. DRAM acts as a cache with memory mode, and caches can lead to
   unpredictable latencies.  Since memory mode is all-or-nothing
   (either all your DRAM is used as cache or none is), your entire
   memory space is exposed to these unpredictable latencies.  This
   solution lets you guarantee DRAM latencies if you need them.
3. The new "tier" of memory is exposed to software.  That means
   that you can build tiered applications or infrastructure.  A
   cloud provider could sell cheaper VMs that use more PMEM and
   more expensive ones that use DRAM.  That's impossible with
   memory mode.

Don't take this as criticism of memory mode.  Memory mode is
awesome, and doesn't strictly require *any* software changes (we
have software changes proposed for optimizing it though).  It has
tons of other advantages over *this* approach.  Basically, we
believe that the approach in these patches is complementary to
memory mode and that both can live side-by-side in harmony.

== Patch Set Overview ==

This series adds a new "driver" to which pmem devices can be
attached.  Once attached, the memory "owned" by the device is
hot-added to the kernel and managed like any other memory.  On
systems with an HMAT (a new ACPI table), each socket (roughly)
will have a separate NUMA node for its persistent memory so
this newly-added memory can be selected by its unique NUMA
node.

== FAQ ==

Q: Will this break my NVDIMMs?

Intel's NVDIMMs were designed with use-cases like this in mind.
Using this feature on that hardware is not expected to affect
its service life.  If you have NVDIMMs from other sources, it
would be best to consult your hardware vendor to see if this
kind of use is appropriate.

Q: Will this leave secret data in memory after reboots?

Yes.  Data placed in persistent memory will remain there until
it is written again.  There are no specific mitigations for
this in the current patch set.  However, these kinds of issues
already exist in the kernel.  We leave secrets in "free" memory
all the time, and even across kexec-style reboots.  The only
difference here is that data can survive power cycles.

The kernel has built-in protections to ensure that sensitive
data is zero'd before being handed out to applications.  If
your persistent-memory data was available to an application,
it would almost certainly be a bug with or without this patch
set.

In addition, NVDIMMs have built-in encryption, functionally
similar to the disk passwords present on hard drives.  When
available, we suggest configuring NVDIMMs so that they are
locked on any power loss.

Q: What happens if memory errors are encountered?

If encountered at runtime, all the existing DRAM-based memory
error reporting and recovery is used: nothing changes.  The
only thing that's different is that the poison can be
persistent across reboots.  We suggest using the ndctl utility
to recover these locations for now.  Doing this automatically
could be a part of future enabling.  But, for now, please do
it from userspace.

== Testing Overview ==

Here's how I set up a system to test this thing:

1. Boot qemu with lots of memory: "-m 4096", for instance
2. Reserve 512MB of physical memory.  Reserving a spot a 2GB
   physical seems to work: memmap=512M!0x0000000080000000
   This will end up looking like a pmem device at boot.
3. When booted, convert fsdax device to "device dax":
	ndctl create-namespace -fe namespace0.0 -m dax
4. See patch 5 for instructions on binding the kmem driver
   to a device.
5. Now, online the new memory sections.  Perhaps:

grep ^MemTotal /proc/meminfo
for f in `grep -vl online /sys/devices/system/memory/*/state`; do
	echo $f: `cat $f`
	echo online_movable > $f
	grep ^MemTotal /proc/meminfo
done

1. https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/#gs.RKG7BeIu

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 0/5] [v5] Allow persistent memory to be used like normal RAM
@ 2019-02-25 18:57 ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, dan.j.williams, dave.jiang, zwisler, vishal.l.verma,
	thomas.lendacky, akpm, mhocko, linux-nvdimm, linux-mm,
	ying.huang, fengguang.wu, bp, bhelgaas, baiyaowei, tiwai,
	jglisse, keith.busch

This is a relatively small delta from v4.  The review comments seem
to be settling down, so it seems like we should start thinking about
how this might get merged.  Are there any objections to taking it in
via the nvdimm tree?

Dan Williams, our intrepid nvdimm maintainer has said he would
appreciate acks on these from relevant folks before merging them.
Reviews/acks on any in the series would be welcome, but the last
two especially are lacking any non-Intel acks:

	mm/resource: let walk_system_ram_range() search child resources
	dax: "Hotplug" persistent memory for use like normal RAM

Note: these are based on commit b20a7bfc2f9 in:

	git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git for-5.1/devdax

Changes since v4:
 * Update powerpc resource return codes too, not just generic
 * Update dev_dax_kmem_remove() comment about resource "leaks"
 * Make HMM-related warning in __request_region() use %pR's
 * Add GPL export for memory_block_size_bytes() to fix build erroe
 * Add a FAQ in the documentation
 * Make resource.c hmm output a pr_warn() instead of pr_debug()
 * Minor CodingStyle tweaks
 * Allow device removal by making dev_dax_kmem_remove() return 0.
   Note that although the device goes away, the memory stays
   in-use, assigned and reserved.

Changes since v3:
 * Move HMM-related resource warning instead of removing it
 * Use __request_resource() directly instead of devm.
 * Create a separate DAX_PMEM Kconfig option, complete with help text
 * Update patch descriptions and cover letter to give a better
   overview of use-cases and hardware where this might be useful.

Changes since v2:
 * Updates to dev_dax_kmem_probe() in patch 5:
   * Reject probes for devices with bad NUMA nodes.  Keeps slow
     memory from being added to node 0.
   * Use raw request_mem_region()
   * Add comments about permanent reservation
   * use dev_*() instead of printk's
 * Add references to nvdimm documentation in descriptions
 * Remove unneeded GPL export
 * Add Kconfig prompt and help text

Changes since v1:
 * Now based on git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git
 * Use binding/unbinding from "dax bus" code
 * Move over to a "dax bus" driver from being an nvdimm driver

--

Persistent memory is cool.  But, currently, you have to rewrite
your applications to use it.  Wouldn't it be cool if you could
just have it show up in your system like normal RAM and get to
it like a slow blob of memory?  Well... have I got the patch
series for you!

== Background / Use Cases ==

Persistent Memory (aka Non-Volatile DIMMs / NVDIMMS) themselves
are described in detail in Documentation/nvdimm/nvdimm.txt.
However, this documentation focuses on actually using them as
storage.  This set is focused on using NVDIMMs as DRAM replacement.

This is intended for Intel-style NVDIMMs (aka. Intel Optane DC
persistent memory) NVDIMMs.  These DIMMs are physically persistent,
more akin to flash than traditional RAM.  They are also expected to
be more cost-effective than using RAM, which is why folks want this
set in the first place.

This set is not intended for RAM-based NVDIMMs.  Those are not
cost-effective vs. plain RAM, and this using them here would simply
be a waste.  However, if you have a user of a system that does not
care about persistence and wants all the RAM they can get, this
could allow you to use a RAM-based NVDIMM where it would otherwise
go unused.

But, why would you bother with this approach?  Intel itself [1]
has announced a hardware feature that does something very similar:
"Memory Mode" which turns DRAM into a cache in front of persistent
memory, which is then as a whole used as normal "RAM"?

Here are a few reasons:
1. The capacity of memory mode is the size of your persistent
   memory that you dedicate.  DRAM capacity is "lost" because it
   is used for cache.  With this, you get PMEM+DRAM capacity for
   memory.
2. DRAM acts as a cache with memory mode, and caches can lead to
   unpredictable latencies.  Since memory mode is all-or-nothing
   (either all your DRAM is used as cache or none is), your entire
   memory space is exposed to these unpredictable latencies.  This
   solution lets you guarantee DRAM latencies if you need them.
3. The new "tier" of memory is exposed to software.  That means
   that you can build tiered applications or infrastructure.  A
   cloud provider could sell cheaper VMs that use more PMEM and
   more expensive ones that use DRAM.  That's impossible with
   memory mode.

Don't take this as criticism of memory mode.  Memory mode is
awesome, and doesn't strictly require *any* software changes (we
have software changes proposed for optimizing it though).  It has
tons of other advantages over *this* approach.  Basically, we
believe that the approach in these patches is complementary to
memory mode and that both can live side-by-side in harmony.

== Patch Set Overview ==

This series adds a new "driver" to which pmem devices can be
attached.  Once attached, the memory "owned" by the device is
hot-added to the kernel and managed like any other memory.  On
systems with an HMAT (a new ACPI table), each socket (roughly)
will have a separate NUMA node for its persistent memory so
this newly-added memory can be selected by its unique NUMA
node.

== FAQ ==

Q: Will this break my NVDIMMs?

Intel's NVDIMMs were designed with use-cases like this in mind.
Using this feature on that hardware is not expected to affect
its service life.  If you have NVDIMMs from other sources, it
would be best to consult your hardware vendor to see if this
kind of use is appropriate.

Q: Will this leave secret data in memory after reboots?

Yes.  Data placed in persistent memory will remain there until
it is written again.  There are no specific mitigations for
this in the current patch set.  However, these kinds of issues
already exist in the kernel.  We leave secrets in "free" memory
all the time, and even across kexec-style reboots.  The only
difference here is that data can survive power cycles.

The kernel has built-in protections to ensure that sensitive
data is zero'd before being handed out to applications.  If
your persistent-memory data was available to an application,
it would almost certainly be a bug with or without this patch
set.

In addition, NVDIMMs have built-in encryption, functionally
similar to the disk passwords present on hard drives.  When
available, we suggest configuring NVDIMMs so that they are
locked on any power loss.

Q: What happens if memory errors are encountered?

If encountered at runtime, all the existing DRAM-based memory
error reporting and recovery is used: nothing changes.  The
only thing that's different is that the poison can be
persistent across reboots.  We suggest using the ndctl utility
to recover these locations for now.  Doing this automatically
could be a part of future enabling.  But, for now, please do
it from userspace.

== Testing Overview ==

Here's how I set up a system to test this thing:

1. Boot qemu with lots of memory: "-m 4096", for instance
2. Reserve 512MB of physical memory.  Reserving a spot a 2GB
   physical seems to work: memmap=512M!0x0000000080000000
   This will end up looking like a pmem device at boot.
3. When booted, convert fsdax device to "device dax":
	ndctl create-namespace -fe namespace0.0 -m dax
4. See patch 5 for instructions on binding the kmem driver
   to a device.
5. Now, online the new memory sections.  Perhaps:

grep ^MemTotal /proc/meminfo
for f in `grep -vl online /sys/devices/system/memory/*/state`; do
	echo $f: `cat $f`
	echo online_movable > $f
	grep ^MemTotal /proc/meminfo
done

1. https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/#gs.RKG7BeIu

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 0/5] [v5] Allow persistent memory to be used like normal RAM
@ 2019-02-25 18:57 ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, dan.j.williams, dave.jiang, zwisler, vishal.l.verma,
	thomas.lendacky, akpm, mhocko, linux-nvdimm, linux-mm,
	ying.huang, fengguang.wu, bp, bhelgaas, baiyaowei, tiwai,
	jglisse, keith.busch

This is a relatively small delta from v4.  The review comments seem
to be settling down, so it seems like we should start thinking about
how this might get merged.  Are there any objections to taking it in
via the nvdimm tree?

Dan Williams, our intrepid nvdimm maintainer has said he would
appreciate acks on these from relevant folks before merging them.
Reviews/acks on any in the series would be welcome, but the last
two especially are lacking any non-Intel acks:

	mm/resource: let walk_system_ram_range() search child resources
	dax: "Hotplug" persistent memory for use like normal RAM

Note: these are based on commit b20a7bfc2f9 in:

	git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git for-5.1/devdax

Changes since v4:
 * Update powerpc resource return codes too, not just generic
 * Update dev_dax_kmem_remove() comment about resource "leaks"
 * Make HMM-related warning in __request_region() use %pR's
 * Add GPL export for memory_block_size_bytes() to fix build erroe
 * Add a FAQ in the documentation
 * Make resource.c hmm output a pr_warn() instead of pr_debug()
 * Minor CodingStyle tweaks
 * Allow device removal by making dev_dax_kmem_remove() return 0.
   Note that although the device goes away, the memory stays
   in-use, assigned and reserved.

Changes since v3:
 * Move HMM-related resource warning instead of removing it
 * Use __request_resource() directly instead of devm.
 * Create a separate DAX_PMEM Kconfig option, complete with help text
 * Update patch descriptions and cover letter to give a better
   overview of use-cases and hardware where this might be useful.

Changes since v2:
 * Updates to dev_dax_kmem_probe() in patch 5:
   * Reject probes for devices with bad NUMA nodes.  Keeps slow
     memory from being added to node 0.
   * Use raw request_mem_region()
   * Add comments about permanent reservation
   * use dev_*() instead of printk's
 * Add references to nvdimm documentation in descriptions
 * Remove unneeded GPL export
 * Add Kconfig prompt and help text

Changes since v1:
 * Now based on git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git
 * Use binding/unbinding from "dax bus" code
 * Move over to a "dax bus" driver from being an nvdimm driver

--

Persistent memory is cool.  But, currently, you have to rewrite
your applications to use it.  Wouldn't it be cool if you could
just have it show up in your system like normal RAM and get to
it like a slow blob of memory?  Well... have I got the patch
series for you!

== Background / Use Cases ==

Persistent Memory (aka Non-Volatile DIMMs / NVDIMMS) themselves
are described in detail in Documentation/nvdimm/nvdimm.txt.
However, this documentation focuses on actually using them as
storage.  This set is focused on using NVDIMMs as DRAM replacement.

This is intended for Intel-style NVDIMMs (aka. Intel Optane DC
persistent memory) NVDIMMs.  These DIMMs are physically persistent,
more akin to flash than traditional RAM.  They are also expected to
be more cost-effective than using RAM, which is why folks want this
set in the first place.

This set is not intended for RAM-based NVDIMMs.  Those are not
cost-effective vs. plain RAM, and this using them here would simply
be a waste.  However, if you have a user of a system that does not
care about persistence and wants all the RAM they can get, this
could allow you to use a RAM-based NVDIMM where it would otherwise
go unused.

But, why would you bother with this approach?  Intel itself [1]
has announced a hardware feature that does something very similar:
"Memory Mode" which turns DRAM into a cache in front of persistent
memory, which is then as a whole used as normal "RAM"?

Here are a few reasons:
1. The capacity of memory mode is the size of your persistent
   memory that you dedicate.  DRAM capacity is "lost" because it
   is used for cache.  With this, you get PMEM+DRAM capacity for
   memory.
2. DRAM acts as a cache with memory mode, and caches can lead to
   unpredictable latencies.  Since memory mode is all-or-nothing
   (either all your DRAM is used as cache or none is), your entire
   memory space is exposed to these unpredictable latencies.  This
   solution lets you guarantee DRAM latencies if you need them.
3. The new "tier" of memory is exposed to software.  That means
   that you can build tiered applications or infrastructure.  A
   cloud provider could sell cheaper VMs that use more PMEM and
   more expensive ones that use DRAM.  That's impossible with
   memory mode.

Don't take this as criticism of memory mode.  Memory mode is
awesome, and doesn't strictly require *any* software changes (we
have software changes proposed for optimizing it though).  It has
tons of other advantages over *this* approach.  Basically, we
believe that the approach in these patches is complementary to
memory mode and that both can live side-by-side in harmony.

== Patch Set Overview ==

This series adds a new "driver" to which pmem devices can be
attached.  Once attached, the memory "owned" by the device is
hot-added to the kernel and managed like any other memory.  On
systems with an HMAT (a new ACPI table), each socket (roughly)
will have a separate NUMA node for its persistent memory so
this newly-added memory can be selected by its unique NUMA
node.

== FAQ ==

Q: Will this break my NVDIMMs?

Intel's NVDIMMs were designed with use-cases like this in mind.
Using this feature on that hardware is not expected to affect
its service life.  If you have NVDIMMs from other sources, it
would be best to consult your hardware vendor to see if this
kind of use is appropriate.

Q: Will this leave secret data in memory after reboots?

Yes.  Data placed in persistent memory will remain there until
it is written again.  There are no specific mitigations for
this in the current patch set.  However, these kinds of issues
already exist in the kernel.  We leave secrets in "free" memory
all the time, and even across kexec-style reboots.  The only
difference here is that data can survive power cycles.

The kernel has built-in protections to ensure that sensitive
data is zero'd before being handed out to applications.  If
your persistent-memory data was available to an application,
it would almost certainly be a bug with or without this patch
set.

In addition, NVDIMMs have built-in encryption, functionally
similar to the disk passwords present on hard drives.  When
available, we suggest configuring NVDIMMs so that they are
locked on any power loss.

Q: What happens if memory errors are encountered?

If encountered at runtime, all the existing DRAM-based memory
error reporting and recovery is used: nothing changes.  The
only thing that's different is that the poison can be
persistent across reboots.  We suggest using the ndctl utility
to recover these locations for now.  Doing this automatically
could be a part of future enabling.  But, for now, please do
it from userspace.

== Testing Overview ==

Here's how I set up a system to test this thing:

1. Boot qemu with lots of memory: "-m 4096", for instance
2. Reserve 512MB of physical memory.  Reserving a spot a 2GB
   physical seems to work: memmap=512M!0x0000000080000000
   This will end up looking like a pmem device at boot.
3. When booted, convert fsdax device to "device dax":
	ndctl create-namespace -fe namespace0.0 -m dax
4. See patch 5 for instructions on binding the kmem driver
   to a device.
5. Now, online the new memory sections.  Perhaps:

grep ^MemTotal /proc/meminfo
for f in `grep -vl online /sys/devices/system/memory/*/state`; do
	echo $f: `cat $f`
	echo online_movable > $f
	grep ^MemTotal /proc/meminfo
done

1. https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/#gs.RKG7BeIu

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/5] mm/resource: return real error codes from walk failures
  2019-02-25 18:57 ` Dave Hansen
  (?)
  (?)
@ 2019-02-25 18:57   ` Dave Hansen
  -1 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: mhocko, tiwai, benh, Dave Hansen, linux-mm, paulus, baiyaowei,
	zwisler, linux-nvdimm, mpe, ying.huang, bp, thomas.lendacky,
	jglisse, bhelgaas, akpm, fengguang.wu, linuxppc-dev


From: Dave Hansen <dave.hansen@linux.intel.com>

walk_system_ram_range() can return an error code either becuase
*it* failed, or because the 'func' that it calls returned an
error.  The memory hotplug does the following:

	ret = walk_system_ram_range(..., func);
        if (ret)
		return ret;

and 'ret' makes it out to userspace, eventually.  The problem
s, walk_system_ram_range() failues that result from *it* failing
(as opposed to 'func') return -1.  That leads to a very odd
-EPERM (-1) return code out to userspace.

Make walk_system_ram_range() return -EINVAL for internal
failures to keep userspace less confused.

This return code is compatible with all the callers that I
audited.

This changes both the generic mm/ and powerpc-specific
implementations to have the same return value.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Keith Busch <keith.busch@intel.com>
---

 b/arch/powerpc/mm/mem.c |    2 +-
 b/kernel/resource.c     |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff -puN arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1 arch/powerpc/mm/mem.c
--- a/arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1	2019-02-25 10:56:47.452908034 -0800
+++ b/arch/powerpc/mm/mem.c	2019-02-25 10:56:47.458908034 -0800
@@ -189,7 +189,7 @@ walk_system_ram_range(unsigned long star
 	struct memblock_region *reg;
 	unsigned long end_pfn = start_pfn + nr_pages;
 	unsigned long tstart, tend;
-	int ret = -1;
+	int ret = -EINVAL;
 
 	for_each_memblock(memory, reg) {
 		tstart = max(start_pfn, memblock_region_memory_base_pfn(reg));
diff -puN kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 kernel/resource.c
--- a/kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1	2019-02-25 10:56:47.454908034 -0800
+++ b/kernel/resource.c	2019-02-25 10:56:47.459908034 -0800
@@ -382,7 +382,7 @@ static int __walk_iomem_res_desc(resourc
 				 int (*func)(struct resource *, void *))
 {
 	struct resource res;
-	int ret = -1;
+	int ret = -EINVAL;
 
 	while (start < end &&
 	       !find_next_iomem_res(start, end, flags, desc, first_lvl, &res)) {
@@ -462,7 +462,7 @@ int walk_system_ram_range(unsigned long
 	unsigned long flags;
 	struct resource res;
 	unsigned long pfn, end_pfn;
-	int ret = -1;
+	int ret = -EINVAL;
 
 	start = (u64) start_pfn << PAGE_SHIFT;
 	end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1;
_
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/5] mm/resource: return real error codes from walk failures
@ 2019-02-25 18:57   ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, bhelgaas, mpe, dan.j.williams, dave.jiang, zwisler,
	vishal.l.verma, thomas.lendacky, akpm, mhocko, linux-nvdimm,
	linux-mm, ying.huang, fengguang.wu, bp, baiyaowei, tiwai,
	jglisse, benh, paulus, linuxppc-dev, keith.busch


From: Dave Hansen <dave.hansen@linux.intel.com>

walk_system_ram_range() can return an error code either becuase
*it* failed, or because the 'func' that it calls returned an
error.  The memory hotplug does the following:

	ret = walk_system_ram_range(..., func);
        if (ret)
		return ret;

and 'ret' makes it out to userspace, eventually.  The problem
s, walk_system_ram_range() failues that result from *it* failing
(as opposed to 'func') return -1.  That leads to a very odd
-EPERM (-1) return code out to userspace.

Make walk_system_ram_range() return -EINVAL for internal
failures to keep userspace less confused.

This return code is compatible with all the callers that I
audited.

This changes both the generic mm/ and powerpc-specific
implementations to have the same return value.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Keith Busch <keith.busch@intel.com>
---

 b/arch/powerpc/mm/mem.c |    2 +-
 b/kernel/resource.c     |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff -puN arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1 arch/powerpc/mm/mem.c
--- a/arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1	2019-02-25 10:56:47.452908034 -0800
+++ b/arch/powerpc/mm/mem.c	2019-02-25 10:56:47.458908034 -0800
@@ -189,7 +189,7 @@ walk_system_ram_range(unsigned long star
 	struct memblock_region *reg;
 	unsigned long end_pfn = start_pfn + nr_pages;
 	unsigned long tstart, tend;
-	int ret = -1;
+	int ret = -EINVAL;
 
 	for_each_memblock(memory, reg) {
 		tstart = max(start_pfn, memblock_region_memory_base_pfn(reg));
diff -puN kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 kernel/resource.c
--- a/kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1	2019-02-25 10:56:47.454908034 -0800
+++ b/kernel/resource.c	2019-02-25 10:56:47.459908034 -0800
@@ -382,7 +382,7 @@ static int __walk_iomem_res_desc(resourc
 				 int (*func)(struct resource *, void *))
 {
 	struct resource res;
-	int ret = -1;
+	int ret = -EINVAL;
 
 	while (start < end &&
 	       !find_next_iomem_res(start, end, flags, desc, first_lvl, &res)) {
@@ -462,7 +462,7 @@ int walk_system_ram_range(unsigned long
 	unsigned long flags;
 	struct resource res;
 	unsigned long pfn, end_pfn;
-	int ret = -1;
+	int ret = -EINVAL;
 
 	start = (u64) start_pfn << PAGE_SHIFT;
 	end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1;
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/5] mm/resource: return real error codes from walk failures
@ 2019-02-25 18:57   ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, bhelgaas, mpe, dan.j.williams, dave.jiang, zwisler,
	vishal.l.verma, thomas.lendacky, akpm, mhocko, linux-nvdimm,
	linux-mm, ying.huang, fengguang.wu, bp, baiyaowei, tiwai,
	jglisse, benh, paulus, linuxppc-dev, keith.busch


From: Dave Hansen <dave.hansen@linux.intel.com>

walk_system_ram_range() can return an error code either becuase
*it* failed, or because the 'func' that it calls returned an
error.  The memory hotplug does the following:

	ret = walk_system_ram_range(..., func);
        if (ret)
		return ret;

and 'ret' makes it out to userspace, eventually.  The problem
s, walk_system_ram_range() failues that result from *it* failing
(as opposed to 'func') return -1.  That leads to a very odd
-EPERM (-1) return code out to userspace.

Make walk_system_ram_range() return -EINVAL for internal
failures to keep userspace less confused.

This return code is compatible with all the callers that I
audited.

This changes both the generic mm/ and powerpc-specific
implementations to have the same return value.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Keith Busch <keith.busch@intel.com>
---

 b/arch/powerpc/mm/mem.c |    2 +-
 b/kernel/resource.c     |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff -puN arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1 arch/powerpc/mm/mem.c
--- a/arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1	2019-02-25 10:56:47.452908034 -0800
+++ b/arch/powerpc/mm/mem.c	2019-02-25 10:56:47.458908034 -0800
@@ -189,7 +189,7 @@ walk_system_ram_range(unsigned long star
 	struct memblock_region *reg;
 	unsigned long end_pfn = start_pfn + nr_pages;
 	unsigned long tstart, tend;
-	int ret = -1;
+	int ret = -EINVAL;
 
 	for_each_memblock(memory, reg) {
 		tstart = max(start_pfn, memblock_region_memory_base_pfn(reg));
diff -puN kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 kernel/resource.c
--- a/kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1	2019-02-25 10:56:47.454908034 -0800
+++ b/kernel/resource.c	2019-02-25 10:56:47.459908034 -0800
@@ -382,7 +382,7 @@ static int __walk_iomem_res_desc(resourc
 				 int (*func)(struct resource *, void *))
 {
 	struct resource res;
-	int ret = -1;
+	int ret = -EINVAL;
 
 	while (start < end &&
 	       !find_next_iomem_res(start, end, flags, desc, first_lvl, &res)) {
@@ -462,7 +462,7 @@ int walk_system_ram_range(unsigned long
 	unsigned long flags;
 	struct resource res;
 	unsigned long pfn, end_pfn;
-	int ret = -1;
+	int ret = -EINVAL;
 
 	start = (u64) start_pfn << PAGE_SHIFT;
 	end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1;
_


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/5] mm/resource: return real error codes from walk failures
@ 2019-02-25 18:57   ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: mhocko, tiwai, Dave Hansen, keith.busch, linux-mm, paulus,
	baiyaowei, zwisler, dave.jiang, linux-nvdimm, ying.huang, bp,
	thomas.lendacky, jglisse, bhelgaas, dan.j.williams,
	vishal.l.verma, akpm, fengguang.wu, linuxppc-dev


From: Dave Hansen <dave.hansen@linux.intel.com>

walk_system_ram_range() can return an error code either becuase
*it* failed, or because the 'func' that it calls returned an
error.  The memory hotplug does the following:

	ret = walk_system_ram_range(..., func);
        if (ret)
		return ret;

and 'ret' makes it out to userspace, eventually.  The problem
s, walk_system_ram_range() failues that result from *it* failing
(as opposed to 'func') return -1.  That leads to a very odd
-EPERM (-1) return code out to userspace.

Make walk_system_ram_range() return -EINVAL for internal
failures to keep userspace less confused.

This return code is compatible with all the callers that I
audited.

This changes both the generic mm/ and powerpc-specific
implementations to have the same return value.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Keith Busch <keith.busch@intel.com>
---

 b/arch/powerpc/mm/mem.c |    2 +-
 b/kernel/resource.c     |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff -puN arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1 arch/powerpc/mm/mem.c
--- a/arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1	2019-02-25 10:56:47.452908034 -0800
+++ b/arch/powerpc/mm/mem.c	2019-02-25 10:56:47.458908034 -0800
@@ -189,7 +189,7 @@ walk_system_ram_range(unsigned long star
 	struct memblock_region *reg;
 	unsigned long end_pfn = start_pfn + nr_pages;
 	unsigned long tstart, tend;
-	int ret = -1;
+	int ret = -EINVAL;
 
 	for_each_memblock(memory, reg) {
 		tstart = max(start_pfn, memblock_region_memory_base_pfn(reg));
diff -puN kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 kernel/resource.c
--- a/kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1	2019-02-25 10:56:47.454908034 -0800
+++ b/kernel/resource.c	2019-02-25 10:56:47.459908034 -0800
@@ -382,7 +382,7 @@ static int __walk_iomem_res_desc(resourc
 				 int (*func)(struct resource *, void *))
 {
 	struct resource res;
-	int ret = -1;
+	int ret = -EINVAL;
 
 	while (start < end &&
 	       !find_next_iomem_res(start, end, flags, desc, first_lvl, &res)) {
@@ -462,7 +462,7 @@ int walk_system_ram_range(unsigned long
 	unsigned long flags;
 	struct resource res;
 	unsigned long pfn, end_pfn;
-	int ret = -1;
+	int ret = -EINVAL;
 
 	start = (u64) start_pfn << PAGE_SHIFT;
 	end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1;
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/5] mm/resource: move HMM pr_debug() deeper into resource code
  2019-02-25 18:57 ` Dave Hansen
  (?)
@ 2019-02-25 18:57   ` Dave Hansen
  -1 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: thomas.lendacky, mhocko, linux-nvdimm, Dave Hansen, ying.huang,
	linux-mm, jglisse, zwisler, fengguang.wu, akpm


From: Dave Hansen <dave.hansen@linux.intel.com>

HMM consumes physical address space for its own use, even
though nothing is mapped or accessible there.  It uses a
special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
to uniquely identify these areas.

When HMM consumes address space, it makes a best guess about
what to consume.  However, it is possible that a future memory
or device hotplug can collide with the reserved area.  In the
case of these conflicts, there is an error message in
register_memory_resource().

Later patches in this series move register_memory_resource()
from using request_resource_conflict() to __request_region().
Unfortunately, __request_region() does not return the conflict
like the previous function did, which makes it impossible to
check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
resource.

Instead of warning in register_memory_resource(), move the
check into the core resource code itself (__request_region())
where the conflicting resource _is_ available.  This has the
added bonus of producing a warning in case of HMM conflicts
with devices *or* RAM address space, as opposed to the RAM-
only warnings that were there previously.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
---

 b/kernel/resource.c   |    9 +++++++++
 b/mm/memory_hotplug.c |    5 -----
 2 files changed, 9 insertions(+), 5 deletions(-)

diff -puN kernel/resource.c~move-request_region-check kernel/resource.c
--- a/kernel/resource.c~move-request_region-check	2019-02-25 10:56:48.581908031 -0800
+++ b/kernel/resource.c	2019-02-25 10:56:48.588908031 -0800
@@ -1132,6 +1132,15 @@ struct resource * __request_region(struc
 		conflict = __request_resource(parent, res);
 		if (!conflict)
 			break;
+		/*
+		 * mm/hmm.c reserves physical addresses which then
+		 * become unavailable to other users.  Conflicts are
+		 * not expected.  Warn to aid debugging if encountered.
+		 */
+		if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
+			pr_warn("Unaddressable device %s %pR conflicts with %pR",
+				conflict->name, conflict, res);
+		}
 		if (conflict != parent) {
 			if (!(conflict->flags & IORESOURCE_BUSY)) {
 				parent = conflict;
diff -puN mm/memory_hotplug.c~move-request_region-check mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~move-request_region-check	2019-02-25 10:56:48.583908031 -0800
+++ b/mm/memory_hotplug.c	2019-02-25 10:56:48.588908031 -0800
@@ -111,11 +111,6 @@ static struct resource *register_memory_
 	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
 	conflict =  request_resource_conflict(&iomem_resource, res);
 	if (conflict) {
-		if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
-			pr_debug("Device unaddressable memory block "
-				 "memory hotplug at %#010llx !\n",
-				 (unsigned long long)start);
-		}
 		pr_debug("System RAM resource %pR cannot be added\n", res);
 		kfree(res);
 		return ERR_PTR(-EEXIST);
_
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/5] mm/resource: move HMM pr_debug() deeper into resource code
@ 2019-02-25 18:57   ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, jglisse, dan.j.williams, dave.jiang, zwisler,
	vishal.l.verma, thomas.lendacky, akpm, mhocko, linux-nvdimm,
	linux-mm, ying.huang, fengguang.wu, keith.busch


From: Dave Hansen <dave.hansen@linux.intel.com>

HMM consumes physical address space for its own use, even
though nothing is mapped or accessible there.  It uses a
special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
to uniquely identify these areas.

When HMM consumes address space, it makes a best guess about
what to consume.  However, it is possible that a future memory
or device hotplug can collide with the reserved area.  In the
case of these conflicts, there is an error message in
register_memory_resource().

Later patches in this series move register_memory_resource()
from using request_resource_conflict() to __request_region().
Unfortunately, __request_region() does not return the conflict
like the previous function did, which makes it impossible to
check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
resource.

Instead of warning in register_memory_resource(), move the
check into the core resource code itself (__request_region())
where the conflicting resource _is_ available.  This has the
added bonus of producing a warning in case of HMM conflicts
with devices *or* RAM address space, as opposed to the RAM-
only warnings that were there previously.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
---

 b/kernel/resource.c   |    9 +++++++++
 b/mm/memory_hotplug.c |    5 -----
 2 files changed, 9 insertions(+), 5 deletions(-)

diff -puN kernel/resource.c~move-request_region-check kernel/resource.c
--- a/kernel/resource.c~move-request_region-check	2019-02-25 10:56:48.581908031 -0800
+++ b/kernel/resource.c	2019-02-25 10:56:48.588908031 -0800
@@ -1132,6 +1132,15 @@ struct resource * __request_region(struc
 		conflict = __request_resource(parent, res);
 		if (!conflict)
 			break;
+		/*
+		 * mm/hmm.c reserves physical addresses which then
+		 * become unavailable to other users.  Conflicts are
+		 * not expected.  Warn to aid debugging if encountered.
+		 */
+		if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
+			pr_warn("Unaddressable device %s %pR conflicts with %pR",
+				conflict->name, conflict, res);
+		}
 		if (conflict != parent) {
 			if (!(conflict->flags & IORESOURCE_BUSY)) {
 				parent = conflict;
diff -puN mm/memory_hotplug.c~move-request_region-check mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~move-request_region-check	2019-02-25 10:56:48.583908031 -0800
+++ b/mm/memory_hotplug.c	2019-02-25 10:56:48.588908031 -0800
@@ -111,11 +111,6 @@ static struct resource *register_memory_
 	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
 	conflict =  request_resource_conflict(&iomem_resource, res);
 	if (conflict) {
-		if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
-			pr_debug("Device unaddressable memory block "
-				 "memory hotplug at %#010llx !\n",
-				 (unsigned long long)start);
-		}
 		pr_debug("System RAM resource %pR cannot be added\n", res);
 		kfree(res);
 		return ERR_PTR(-EEXIST);
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/5] mm/resource: move HMM pr_debug() deeper into resource code
@ 2019-02-25 18:57   ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, jglisse, dan.j.williams, dave.jiang, zwisler,
	vishal.l.verma, thomas.lendacky, akpm, mhocko, linux-nvdimm,
	linux-mm, ying.huang, fengguang.wu, keith.busch


From: Dave Hansen <dave.hansen@linux.intel.com>

HMM consumes physical address space for its own use, even
though nothing is mapped or accessible there.  It uses a
special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
to uniquely identify these areas.

When HMM consumes address space, it makes a best guess about
what to consume.  However, it is possible that a future memory
or device hotplug can collide with the reserved area.  In the
case of these conflicts, there is an error message in
register_memory_resource().

Later patches in this series move register_memory_resource()
from using request_resource_conflict() to __request_region().
Unfortunately, __request_region() does not return the conflict
like the previous function did, which makes it impossible to
check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
resource.

Instead of warning in register_memory_resource(), move the
check into the core resource code itself (__request_region())
where the conflicting resource _is_ available.  This has the
added bonus of producing a warning in case of HMM conflicts
with devices *or* RAM address space, as opposed to the RAM-
only warnings that were there previously.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
---

 b/kernel/resource.c   |    9 +++++++++
 b/mm/memory_hotplug.c |    5 -----
 2 files changed, 9 insertions(+), 5 deletions(-)

diff -puN kernel/resource.c~move-request_region-check kernel/resource.c
--- a/kernel/resource.c~move-request_region-check	2019-02-25 10:56:48.581908031 -0800
+++ b/kernel/resource.c	2019-02-25 10:56:48.588908031 -0800
@@ -1132,6 +1132,15 @@ struct resource * __request_region(struc
 		conflict = __request_resource(parent, res);
 		if (!conflict)
 			break;
+		/*
+		 * mm/hmm.c reserves physical addresses which then
+		 * become unavailable to other users.  Conflicts are
+		 * not expected.  Warn to aid debugging if encountered.
+		 */
+		if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
+			pr_warn("Unaddressable device %s %pR conflicts with %pR",
+				conflict->name, conflict, res);
+		}
 		if (conflict != parent) {
 			if (!(conflict->flags & IORESOURCE_BUSY)) {
 				parent = conflict;
diff -puN mm/memory_hotplug.c~move-request_region-check mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~move-request_region-check	2019-02-25 10:56:48.583908031 -0800
+++ b/mm/memory_hotplug.c	2019-02-25 10:56:48.588908031 -0800
@@ -111,11 +111,6 @@ static struct resource *register_memory_
 	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
 	conflict =  request_resource_conflict(&iomem_resource, res);
 	if (conflict) {
-		if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
-			pr_debug("Device unaddressable memory block "
-				 "memory hotplug at %#010llx !\n",
-				 (unsigned long long)start);
-		}
 		pr_debug("System RAM resource %pR cannot be added\n", res);
 		kfree(res);
 		return ERR_PTR(-EEXIST);
_


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/5] mm/memory-hotplug: allow memory resources to be children
  2019-02-25 18:57 ` Dave Hansen
  (?)
@ 2019-02-25 18:57   ` Dave Hansen
  -1 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: thomas.lendacky, mhocko, linux-nvdimm, tiwai, Dave Hansen,
	ying.huang, linux-mm, jglisse, bp, baiyaowei, zwisler, bhelgaas,
	fengguang.wu, akpm


From: Dave Hansen <dave.hansen@linux.intel.com>

The mm/resource.c code is used to manage the physical address
space.  The current resource configuration can be viewed in
/proc/iomem.  An example of this is at the bottom of this
description.

The nvdimm subsystem "owns" the physical address resources which
map to persistent memory and has resources inserted for them as
"Persistent Memory".  The best way to repurpose this for volatile
use is to leave the existing resource in place, but add a "System
RAM" resource underneath it. This clearly communicates the
ownership relationship of this memory.

The request_resource_conflict() API only deals with the
top-level resources.  Replace it with __request_region() which
will search for !IORESOURCE_BUSY areas lower in the resource
tree than the top level.

We *could* also simply truncate the existing top-level
"Persistent Memory" resource and take over the released address
space.  But, this means that if we ever decide to hot-unplug the
"RAM" and give it back, we need to recreate the original setup,
which may mean going back to the BIOS tables.

This should have no real effect on the existing collision
detection because the areas that truly conflict should be marked
IORESOURCE_BUSY.

00000000-00000fff : Reserved
00001000-0009fbff : System RAM
0009fc00-0009ffff : Reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c97ff : Video ROM
000c9800-000ca5ff : Adapter ROM
000f0000-000fffff : Reserved
  000f0000-000fffff : System ROM
00100000-9fffffff : System RAM
  01000000-01e071d0 : Kernel code
  01e071d1-027dfdff : Kernel data
  02dc6000-0305dfff : Kernel bss
a0000000-afffffff : Persistent Memory (legacy)
  a0000000-a7ffffff : System RAM
b0000000-bffdffff : System RAM
bffe0000-bfffffff : Reserved
c0000000-febfffff : PCI Bus 0000:00

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
---

 b/mm/memory_hotplug.c |   26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff -puN mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child	2019-02-25 10:56:49.707908029 -0800
+++ b/mm/memory_hotplug.c	2019-02-25 10:56:49.711908029 -0800
@@ -100,19 +100,21 @@ void mem_hotplug_done(void)
 /* add this memory to iomem resource */
 static struct resource *register_memory_resource(u64 start, u64 size)
 {
-	struct resource *res, *conflict;
-	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
-	if (!res)
-		return ERR_PTR(-ENOMEM);
+	struct resource *res;
+	unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+	char *resource_name = "System RAM";
 
-	res->name = "System RAM";
-	res->start = start;
-	res->end = start + size - 1;
-	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-	conflict =  request_resource_conflict(&iomem_resource, res);
-	if (conflict) {
-		pr_debug("System RAM resource %pR cannot be added\n", res);
-		kfree(res);
+	/*
+	 * Request ownership of the new memory range.  This might be
+	 * a child of an existing resource that was present but
+	 * not marked as busy.
+	 */
+	res = __request_region(&iomem_resource, start, size,
+			       resource_name, flags);
+
+	if (!res) {
+		pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n",
+				start, start + size);
 		return ERR_PTR(-EEXIST);
 	}
 	return res;
_
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/5] mm/memory-hotplug: allow memory resources to be children
@ 2019-02-25 18:57   ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, dan.j.williams, vishal.l.verma, dave.jiang, zwisler,
	thomas.lendacky, akpm, mhocko, linux-nvdimm, linux-mm,
	ying.huang, fengguang.wu, bp, bhelgaas, baiyaowei, tiwai,
	jglisse, keith.busch


From: Dave Hansen <dave.hansen@linux.intel.com>

The mm/resource.c code is used to manage the physical address
space.  The current resource configuration can be viewed in
/proc/iomem.  An example of this is at the bottom of this
description.

The nvdimm subsystem "owns" the physical address resources which
map to persistent memory and has resources inserted for them as
"Persistent Memory".  The best way to repurpose this for volatile
use is to leave the existing resource in place, but add a "System
RAM" resource underneath it. This clearly communicates the
ownership relationship of this memory.

The request_resource_conflict() API only deals with the
top-level resources.  Replace it with __request_region() which
will search for !IORESOURCE_BUSY areas lower in the resource
tree than the top level.

We *could* also simply truncate the existing top-level
"Persistent Memory" resource and take over the released address
space.  But, this means that if we ever decide to hot-unplug the
"RAM" and give it back, we need to recreate the original setup,
which may mean going back to the BIOS tables.

This should have no real effect on the existing collision
detection because the areas that truly conflict should be marked
IORESOURCE_BUSY.

00000000-00000fff : Reserved
00001000-0009fbff : System RAM
0009fc00-0009ffff : Reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c97ff : Video ROM
000c9800-000ca5ff : Adapter ROM
000f0000-000fffff : Reserved
  000f0000-000fffff : System ROM
00100000-9fffffff : System RAM
  01000000-01e071d0 : Kernel code
  01e071d1-027dfdff : Kernel data
  02dc6000-0305dfff : Kernel bss
a0000000-afffffff : Persistent Memory (legacy)
  a0000000-a7ffffff : System RAM
b0000000-bffdffff : System RAM
bffe0000-bfffffff : Reserved
c0000000-febfffff : PCI Bus 0000:00

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
---

 b/mm/memory_hotplug.c |   26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff -puN mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child	2019-02-25 10:56:49.707908029 -0800
+++ b/mm/memory_hotplug.c	2019-02-25 10:56:49.711908029 -0800
@@ -100,19 +100,21 @@ void mem_hotplug_done(void)
 /* add this memory to iomem resource */
 static struct resource *register_memory_resource(u64 start, u64 size)
 {
-	struct resource *res, *conflict;
-	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
-	if (!res)
-		return ERR_PTR(-ENOMEM);
+	struct resource *res;
+	unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+	char *resource_name = "System RAM";
 
-	res->name = "System RAM";
-	res->start = start;
-	res->end = start + size - 1;
-	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-	conflict =  request_resource_conflict(&iomem_resource, res);
-	if (conflict) {
-		pr_debug("System RAM resource %pR cannot be added\n", res);
-		kfree(res);
+	/*
+	 * Request ownership of the new memory range.  This might be
+	 * a child of an existing resource that was present but
+	 * not marked as busy.
+	 */
+	res = __request_region(&iomem_resource, start, size,
+			       resource_name, flags);
+
+	if (!res) {
+		pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n",
+				start, start + size);
 		return ERR_PTR(-EEXIST);
 	}
 	return res;
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/5] mm/memory-hotplug: allow memory resources to be children
@ 2019-02-25 18:57   ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, dan.j.williams, vishal.l.verma, dave.jiang, zwisler,
	thomas.lendacky, akpm, mhocko, linux-nvdimm, linux-mm,
	ying.huang, fengguang.wu, bp, bhelgaas, baiyaowei, tiwai,
	jglisse, keith.busch


From: Dave Hansen <dave.hansen@linux.intel.com>

The mm/resource.c code is used to manage the physical address
space.  The current resource configuration can be viewed in
/proc/iomem.  An example of this is at the bottom of this
description.

The nvdimm subsystem "owns" the physical address resources which
map to persistent memory and has resources inserted for them as
"Persistent Memory".  The best way to repurpose this for volatile
use is to leave the existing resource in place, but add a "System
RAM" resource underneath it. This clearly communicates the
ownership relationship of this memory.

The request_resource_conflict() API only deals with the
top-level resources.  Replace it with __request_region() which
will search for !IORESOURCE_BUSY areas lower in the resource
tree than the top level.

We *could* also simply truncate the existing top-level
"Persistent Memory" resource and take over the released address
space.  But, this means that if we ever decide to hot-unplug the
"RAM" and give it back, we need to recreate the original setup,
which may mean going back to the BIOS tables.

This should have no real effect on the existing collision
detection because the areas that truly conflict should be marked
IORESOURCE_BUSY.

00000000-00000fff : Reserved
00001000-0009fbff : System RAM
0009fc00-0009ffff : Reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c97ff : Video ROM
000c9800-000ca5ff : Adapter ROM
000f0000-000fffff : Reserved
  000f0000-000fffff : System ROM
00100000-9fffffff : System RAM
  01000000-01e071d0 : Kernel code
  01e071d1-027dfdff : Kernel data
  02dc6000-0305dfff : Kernel bss
a0000000-afffffff : Persistent Memory (legacy)
  a0000000-a7ffffff : System RAM
b0000000-bffdffff : System RAM
bffe0000-bfffffff : Reserved
c0000000-febfffff : PCI Bus 0000:00

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
---

 b/mm/memory_hotplug.c |   26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff -puN mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child	2019-02-25 10:56:49.707908029 -0800
+++ b/mm/memory_hotplug.c	2019-02-25 10:56:49.711908029 -0800
@@ -100,19 +100,21 @@ void mem_hotplug_done(void)
 /* add this memory to iomem resource */
 static struct resource *register_memory_resource(u64 start, u64 size)
 {
-	struct resource *res, *conflict;
-	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
-	if (!res)
-		return ERR_PTR(-ENOMEM);
+	struct resource *res;
+	unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+	char *resource_name = "System RAM";
 
-	res->name = "System RAM";
-	res->start = start;
-	res->end = start + size - 1;
-	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-	conflict =  request_resource_conflict(&iomem_resource, res);
-	if (conflict) {
-		pr_debug("System RAM resource %pR cannot be added\n", res);
-		kfree(res);
+	/*
+	 * Request ownership of the new memory range.  This might be
+	 * a child of an existing resource that was present but
+	 * not marked as busy.
+	 */
+	res = __request_region(&iomem_resource, start, size,
+			       resource_name, flags);
+
+	if (!res) {
+		pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n",
+				start, start + size);
 		return ERR_PTR(-EEXIST);
 	}
 	return res;
_


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 4/5] mm/resource: let walk_system_ram_range() search child resources
  2019-02-25 18:57 ` Dave Hansen
  (?)
@ 2019-02-25 18:57   ` Dave Hansen
  -1 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: thomas.lendacky, mhocko, linux-nvdimm, tiwai, Dave Hansen,
	ying.huang, linux-mm, jglisse, bp, baiyaowei, zwisler, bhelgaas,
	fengguang.wu, akpm


From: Dave Hansen <dave.hansen@linux.intel.com>

In the process of onlining memory, we use walk_system_ram_range()
to find the actual RAM areas inside of the area being onlined.

However, it currently only finds memory resources which are
"top-level" iomem_resources.  Children are not currently
searched which causes it to skip System RAM in areas like this
(in the format of /proc/iomem):

a0000000-bfffffff : Persistent Memory (legacy)
  a0000000-afffffff : System RAM

Changing the true->false here allows children to be searched
as well.  We need this because we add a new "System RAM"
resource underneath the "persistent memory" resource when
we use persistent memory in a volatile mode.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
---

 b/kernel/resource.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff -puN kernel/resource.c~mm-walk_system_ram_range-search-child-resources kernel/resource.c
--- a/kernel/resource.c~mm-walk_system_ram_range-search-child-resources	2019-02-25 10:56:50.750908026 -0800
+++ b/kernel/resource.c	2019-02-25 10:56:50.754908026 -0800
@@ -454,6 +454,9 @@ int walk_mem_res(u64 start, u64 end, voi
  * This function calls the @func callback against all memory ranges of type
  * System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY.
  * It is to be used only for System RAM.
+ *
+ * This will find System RAM ranges that are children of top-level resources
+ * in addition to top-level System RAM resources.
  */
 int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
 			  void *arg, int (*func)(unsigned long, unsigned long, void *))
@@ -469,7 +472,7 @@ int walk_system_ram_range(unsigned long
 	flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
 	while (start < end &&
 	       !find_next_iomem_res(start, end, flags, IORES_DESC_NONE,
-				    true, &res)) {
+				    false, &res)) {
 		pfn = (res.start + PAGE_SIZE - 1) >> PAGE_SHIFT;
 		end_pfn = (res.end + 1) >> PAGE_SHIFT;
 		if (end_pfn > pfn)
_
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 4/5] mm/resource: let walk_system_ram_range() search child resources
@ 2019-02-25 18:57   ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, keith.busch, dan.j.williams, dave.jiang, zwisler,
	vishal.l.verma, thomas.lendacky, akpm, mhocko, linux-nvdimm,
	linux-mm, ying.huang, fengguang.wu, bp, bhelgaas, baiyaowei,
	tiwai, jglisse


From: Dave Hansen <dave.hansen@linux.intel.com>

In the process of onlining memory, we use walk_system_ram_range()
to find the actual RAM areas inside of the area being onlined.

However, it currently only finds memory resources which are
"top-level" iomem_resources.  Children are not currently
searched which causes it to skip System RAM in areas like this
(in the format of /proc/iomem):

a0000000-bfffffff : Persistent Memory (legacy)
  a0000000-afffffff : System RAM

Changing the true->false here allows children to be searched
as well.  We need this because we add a new "System RAM"
resource underneath the "persistent memory" resource when
we use persistent memory in a volatile mode.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
---

 b/kernel/resource.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff -puN kernel/resource.c~mm-walk_system_ram_range-search-child-resources kernel/resource.c
--- a/kernel/resource.c~mm-walk_system_ram_range-search-child-resources	2019-02-25 10:56:50.750908026 -0800
+++ b/kernel/resource.c	2019-02-25 10:56:50.754908026 -0800
@@ -454,6 +454,9 @@ int walk_mem_res(u64 start, u64 end, voi
  * This function calls the @func callback against all memory ranges of type
  * System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY.
  * It is to be used only for System RAM.
+ *
+ * This will find System RAM ranges that are children of top-level resources
+ * in addition to top-level System RAM resources.
  */
 int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
 			  void *arg, int (*func)(unsigned long, unsigned long, void *))
@@ -469,7 +472,7 @@ int walk_system_ram_range(unsigned long
 	flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
 	while (start < end &&
 	       !find_next_iomem_res(start, end, flags, IORES_DESC_NONE,
-				    true, &res)) {
+				    false, &res)) {
 		pfn = (res.start + PAGE_SIZE - 1) >> PAGE_SHIFT;
 		end_pfn = (res.end + 1) >> PAGE_SHIFT;
 		if (end_pfn > pfn)
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 4/5] mm/resource: let walk_system_ram_range() search child resources
@ 2019-02-25 18:57   ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, keith.busch, dan.j.williams, dave.jiang, zwisler,
	vishal.l.verma, thomas.lendacky, akpm, mhocko, linux-nvdimm,
	linux-mm, ying.huang, fengguang.wu, bp, bhelgaas, baiyaowei,
	tiwai, jglisse


From: Dave Hansen <dave.hansen@linux.intel.com>

In the process of onlining memory, we use walk_system_ram_range()
to find the actual RAM areas inside of the area being onlined.

However, it currently only finds memory resources which are
"top-level" iomem_resources.  Children are not currently
searched which causes it to skip System RAM in areas like this
(in the format of /proc/iomem):

a0000000-bfffffff : Persistent Memory (legacy)
  a0000000-afffffff : System RAM

Changing the true->false here allows children to be searched
as well.  We need this because we add a new "System RAM"
resource underneath the "persistent memory" resource when
we use persistent memory in a volatile mode.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
---

 b/kernel/resource.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff -puN kernel/resource.c~mm-walk_system_ram_range-search-child-resources kernel/resource.c
--- a/kernel/resource.c~mm-walk_system_ram_range-search-child-resources	2019-02-25 10:56:50.750908026 -0800
+++ b/kernel/resource.c	2019-02-25 10:56:50.754908026 -0800
@@ -454,6 +454,9 @@ int walk_mem_res(u64 start, u64 end, voi
  * This function calls the @func callback against all memory ranges of type
  * System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY.
  * It is to be used only for System RAM.
+ *
+ * This will find System RAM ranges that are children of top-level resources
+ * in addition to top-level System RAM resources.
  */
 int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
 			  void *arg, int (*func)(unsigned long, unsigned long, void *))
@@ -469,7 +472,7 @@ int walk_system_ram_range(unsigned long
 	flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
 	while (start < end &&
 	       !find_next_iomem_res(start, end, flags, IORES_DESC_NONE,
-				    true, &res)) {
+				    false, &res)) {
 		pfn = (res.start + PAGE_SIZE - 1) >> PAGE_SHIFT;
 		end_pfn = (res.end + 1) >> PAGE_SHIFT;
 		if (end_pfn > pfn)
_


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 5/5] dax: "Hotplug" persistent memory for use like normal RAM
  2019-02-25 18:57 ` Dave Hansen
  (?)
@ 2019-02-25 18:57   ` Dave Hansen
  -1 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: thomas.lendacky, mhocko, linux-nvdimm, tiwai, Dave Hansen,
	ying.huang, linux-mm, jglisse, bp, baiyaowei, zwisler, bhelgaas,
	fengguang.wu, akpm


From: Dave Hansen <dave.hansen@linux.intel.com>

This is intended for use with NVDIMMs that are physically persistent
(physically like flash) so that they can be used as a cost-effective
RAM replacement.  Intel Optane DC persistent memory is one
implementation of this kind of NVDIMM.

Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries. (DIMM-based
persistent memory hardware/software is described in great detail
here: Documentation/nvdimm/nvdimm.txt).

However, this limits persistent memory use to applications which
*have* been modified.  To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed and used just like
normal RAM would be.

To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:

	echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind

and then tell the new driver that it can bind to the device:

	echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id

After this, there will be a number of new memory sections visible
in sysfs that can be onlined, or that may get onlined by existing
udev-initiated memory hotplug rules.

This rebinding procedure is currently a one-way trip.  Once memory
is bound to "kmem", it's there permanently and can not be
unbound and assigned back to device_dax.

The kmem driver will never bind to a dax device unless the device
is *explicitly* bound to the driver.  There are two reasons for
this: One, since it is a one-way trip, it can not be undone if
bound incorrectly.  Two, the kmem driver destroys data on the
device.  Think of if you had good data on a pmem device.  It
would be catastrophic if you compile-in "kmem", but leave out
the "device_dax" driver.  kmem would take over the device and
write volatile data all over your good data.

This inherits any existing NUMA information for the newly-added
memory from the persistent memory device that came from the
firmware.  On Intel platforms, the firmware has guarantees that
require each socket's persistent memory to be in a separate
memory-only NUMA node.  That means that this patch is not expected
to create NUMA nodes, but will simply hotplug memory into existing
nodes.

Because NUMA nodes are created, the existing NUMA APIs and tools
are sufficient to create policies for applications or memory areas
to have affinity for or an aversion to using this memory.

There is currently some metadata at the beginning of pmem regions.
The section-size memory hotplug restrictions, plus this small
reserved area can cause the "loss" of a section or two of capacity.
This should be fixable in follow-on patches.  But, as a first step,
losing 256MB of memory (worst case) out of hundreds of gigabytes
is a good tradeoff vs. the required code to fix this up precisely.
This calculation is also the reason we export
memory_block_size_bytes().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
---

 b/drivers/base/memory.c |    1 
 b/drivers/dax/Kconfig   |   16 +++++++
 b/drivers/dax/Makefile  |    1 
 b/drivers/dax/kmem.c    |  108 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 126 insertions(+)

diff -puN drivers/base/memory.c~dax-kmem-try-4 drivers/base/memory.c
--- a/drivers/base/memory.c~dax-kmem-try-4	2019-02-25 10:56:51.791908023 -0800
+++ b/drivers/base/memory.c	2019-02-25 10:56:51.800908023 -0800
@@ -88,6 +88,7 @@ unsigned long __weak memory_block_size_b
 {
 	return MIN_MEMORY_BLOCK_SIZE;
 }
+EXPORT_SYMBOL_GPL(memory_block_size_bytes);
 
 static unsigned long get_memory_block_size(void)
 {
diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
--- a/drivers/dax/Kconfig~dax-kmem-try-4	2019-02-25 10:56:51.793908023 -0800
+++ b/drivers/dax/Kconfig	2019-02-25 10:56:51.800908023 -0800
@@ -32,6 +32,22 @@ config DEV_DAX_PMEM
 
 	  Say M if unsure
 
+config DEV_DAX_KMEM
+	tristate "KMEM DAX: volatile-use of persistent memory"
+	default DEV_DAX
+	depends on DEV_DAX
+	depends on MEMORY_HOTPLUG # for add_memory() and friends
+	help
+	  Support access to persistent memory as if it were RAM.  This
+	  allows easier use of persistent memory by unmodified
+	  applications.
+
+	  To use this feature, a DAX device must be unbound from the
+	  device_dax driver (PMEM DAX) and bound to this kmem driver
+	  on each boot.
+
+	  Say N if unsure.
+
 config DEV_DAX_PMEM_COMPAT
 	tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
 	depends on DEV_DAX_PMEM
diff -puN /dev/null drivers/dax/kmem.c
--- /dev/null	2019-02-15 15:42:29.903470860 -0800
+++ b/drivers/dax/kmem.c	2019-02-25 10:56:51.800908023 -0800
@@ -0,0 +1,108 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2016-2019 Intel Corporation. All rights reserved. */
+#include <linux/memremap.h>
+#include <linux/pagemap.h>
+#include <linux/memory.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/pfn_t.h>
+#include <linux/slab.h>
+#include <linux/dax.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include "dax-private.h"
+#include "bus.h"
+
+int dev_dax_kmem_probe(struct device *dev)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+	struct resource *res = &dev_dax->region->res;
+	resource_size_t kmem_start;
+	resource_size_t kmem_size;
+	resource_size_t kmem_end;
+	struct resource *new_res;
+	int numa_node;
+	int rc;
+
+	/*
+	 * Ensure good NUMA information for the persistent memory.
+	 * Without this check, there is a risk that slow memory
+	 * could be mixed in a node with faster memory, causing
+	 * unavoidable performance issues.
+	 */
+	numa_node = dev_dax->target_node;
+	if (numa_node < 0) {
+		dev_warn(dev, "rejecting DAX region %pR with invalid node: %d\n",
+			 res, numa_node);
+		return -EINVAL;
+	}
+
+	/* Hotplug starting at the beginning of the next block: */
+	kmem_start = ALIGN(res->start, memory_block_size_bytes());
+
+	kmem_size = resource_size(res);
+	/* Adjust the size down to compensate for moving up kmem_start: */
+	kmem_size -= kmem_start - res->start;
+	/* Align the size down to cover only complete blocks: */
+	kmem_size &= ~(memory_block_size_bytes() - 1);
+	kmem_end = kmem_start + kmem_size;
+
+	/* Region is permanently reserved.  Hot-remove not yet implemented. */
+	new_res = request_mem_region(kmem_start, kmem_size, dev_name(dev));
+	if (!new_res) {
+		dev_warn(dev, "could not reserve region [%pa-%pa]\n",
+			 &kmem_start, &kmem_end);
+		return -EBUSY;
+	}
+
+	/*
+	 * Set flags appropriate for System RAM.  Leave ..._BUSY clear
+	 * so that add_memory() can add a child resource.  Do not
+	 * inherit flags from the parent since it may set new flags
+	 * unknown to us that will break add_memory() below.
+	 */
+	new_res->flags = IORESOURCE_SYSTEM_RAM;
+	new_res->name = dev_name(dev);
+
+	rc = add_memory(numa_node, new_res->start, resource_size(new_res));
+	if (rc)
+		return rc;
+
+	return 0;
+}
+
+static int dev_dax_kmem_remove(struct device *dev)
+{
+	/*
+	 * Purposely leak the request_mem_region() for the device-dax
+	 * range and return '0' to ->remove() attempts. The removal of
+	 * the device from the driver always succeeds, but the region
+	 * is permanently pinned as reserved by the unreleased
+	 * request_mem_region().
+	 */
+	return 0;
+}
+
+static struct dax_device_driver device_dax_kmem_driver = {
+	.drv = {
+		.probe = dev_dax_kmem_probe,
+		.remove = dev_dax_kmem_remove,
+	},
+};
+
+static int __init dax_kmem_init(void)
+{
+	return dax_driver_register(&device_dax_kmem_driver);
+}
+
+static void __exit dax_kmem_exit(void)
+{
+	dax_driver_unregister(&device_dax_kmem_driver);
+}
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_LICENSE("GPL v2");
+module_init(dax_kmem_init);
+module_exit(dax_kmem_exit);
+MODULE_ALIAS_DAX_DEVICE(0);
diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile
--- a/drivers/dax/Makefile~dax-kmem-try-4	2019-02-25 10:56:51.796908023 -0800
+++ b/drivers/dax/Makefile	2019-02-25 10:56:51.800908023 -0800
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_DAX) += dax.o
 obj-$(CONFIG_DEV_DAX) += device_dax.o
+obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
 
 dax-y := super.o
 dax-y += bus.o
_
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 5/5] dax: "Hotplug" persistent memory for use like normal RAM
@ 2019-02-25 18:57   ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, dan.j.williams, keith.busch, dave.jiang, zwisler,
	vishal.l.verma, thomas.lendacky, akpm, mhocko, linux-nvdimm,
	linux-mm, ying.huang, fengguang.wu, bp, bhelgaas, baiyaowei,
	tiwai, jglisse


From: Dave Hansen <dave.hansen@linux.intel.com>

This is intended for use with NVDIMMs that are physically persistent
(physically like flash) so that they can be used as a cost-effective
RAM replacement.  Intel Optane DC persistent memory is one
implementation of this kind of NVDIMM.

Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries. (DIMM-based
persistent memory hardware/software is described in great detail
here: Documentation/nvdimm/nvdimm.txt).

However, this limits persistent memory use to applications which
*have* been modified.  To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed and used just like
normal RAM would be.

To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:

	echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind

and then tell the new driver that it can bind to the device:

	echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id

After this, there will be a number of new memory sections visible
in sysfs that can be onlined, or that may get onlined by existing
udev-initiated memory hotplug rules.

This rebinding procedure is currently a one-way trip.  Once memory
is bound to "kmem", it's there permanently and can not be
unbound and assigned back to device_dax.

The kmem driver will never bind to a dax device unless the device
is *explicitly* bound to the driver.  There are two reasons for
this: One, since it is a one-way trip, it can not be undone if
bound incorrectly.  Two, the kmem driver destroys data on the
device.  Think of if you had good data on a pmem device.  It
would be catastrophic if you compile-in "kmem", but leave out
the "device_dax" driver.  kmem would take over the device and
write volatile data all over your good data.

This inherits any existing NUMA information for the newly-added
memory from the persistent memory device that came from the
firmware.  On Intel platforms, the firmware has guarantees that
require each socket's persistent memory to be in a separate
memory-only NUMA node.  That means that this patch is not expected
to create NUMA nodes, but will simply hotplug memory into existing
nodes.

Because NUMA nodes are created, the existing NUMA APIs and tools
are sufficient to create policies for applications or memory areas
to have affinity for or an aversion to using this memory.

There is currently some metadata at the beginning of pmem regions.
The section-size memory hotplug restrictions, plus this small
reserved area can cause the "loss" of a section or two of capacity.
This should be fixable in follow-on patches.  But, as a first step,
losing 256MB of memory (worst case) out of hundreds of gigabytes
is a good tradeoff vs. the required code to fix this up precisely.
This calculation is also the reason we export
memory_block_size_bytes().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
---

 b/drivers/base/memory.c |    1 
 b/drivers/dax/Kconfig   |   16 +++++++
 b/drivers/dax/Makefile  |    1 
 b/drivers/dax/kmem.c    |  108 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 126 insertions(+)

diff -puN drivers/base/memory.c~dax-kmem-try-4 drivers/base/memory.c
--- a/drivers/base/memory.c~dax-kmem-try-4	2019-02-25 10:56:51.791908023 -0800
+++ b/drivers/base/memory.c	2019-02-25 10:56:51.800908023 -0800
@@ -88,6 +88,7 @@ unsigned long __weak memory_block_size_b
 {
 	return MIN_MEMORY_BLOCK_SIZE;
 }
+EXPORT_SYMBOL_GPL(memory_block_size_bytes);
 
 static unsigned long get_memory_block_size(void)
 {
diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
--- a/drivers/dax/Kconfig~dax-kmem-try-4	2019-02-25 10:56:51.793908023 -0800
+++ b/drivers/dax/Kconfig	2019-02-25 10:56:51.800908023 -0800
@@ -32,6 +32,22 @@ config DEV_DAX_PMEM
 
 	  Say M if unsure
 
+config DEV_DAX_KMEM
+	tristate "KMEM DAX: volatile-use of persistent memory"
+	default DEV_DAX
+	depends on DEV_DAX
+	depends on MEMORY_HOTPLUG # for add_memory() and friends
+	help
+	  Support access to persistent memory as if it were RAM.  This
+	  allows easier use of persistent memory by unmodified
+	  applications.
+
+	  To use this feature, a DAX device must be unbound from the
+	  device_dax driver (PMEM DAX) and bound to this kmem driver
+	  on each boot.
+
+	  Say N if unsure.
+
 config DEV_DAX_PMEM_COMPAT
 	tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
 	depends on DEV_DAX_PMEM
diff -puN /dev/null drivers/dax/kmem.c
--- /dev/null	2019-02-15 15:42:29.903470860 -0800
+++ b/drivers/dax/kmem.c	2019-02-25 10:56:51.800908023 -0800
@@ -0,0 +1,108 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2016-2019 Intel Corporation. All rights reserved. */
+#include <linux/memremap.h>
+#include <linux/pagemap.h>
+#include <linux/memory.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/pfn_t.h>
+#include <linux/slab.h>
+#include <linux/dax.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include "dax-private.h"
+#include "bus.h"
+
+int dev_dax_kmem_probe(struct device *dev)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+	struct resource *res = &dev_dax->region->res;
+	resource_size_t kmem_start;
+	resource_size_t kmem_size;
+	resource_size_t kmem_end;
+	struct resource *new_res;
+	int numa_node;
+	int rc;
+
+	/*
+	 * Ensure good NUMA information for the persistent memory.
+	 * Without this check, there is a risk that slow memory
+	 * could be mixed in a node with faster memory, causing
+	 * unavoidable performance issues.
+	 */
+	numa_node = dev_dax->target_node;
+	if (numa_node < 0) {
+		dev_warn(dev, "rejecting DAX region %pR with invalid node: %d\n",
+			 res, numa_node);
+		return -EINVAL;
+	}
+
+	/* Hotplug starting at the beginning of the next block: */
+	kmem_start = ALIGN(res->start, memory_block_size_bytes());
+
+	kmem_size = resource_size(res);
+	/* Adjust the size down to compensate for moving up kmem_start: */
+	kmem_size -= kmem_start - res->start;
+	/* Align the size down to cover only complete blocks: */
+	kmem_size &= ~(memory_block_size_bytes() - 1);
+	kmem_end = kmem_start + kmem_size;
+
+	/* Region is permanently reserved.  Hot-remove not yet implemented. */
+	new_res = request_mem_region(kmem_start, kmem_size, dev_name(dev));
+	if (!new_res) {
+		dev_warn(dev, "could not reserve region [%pa-%pa]\n",
+			 &kmem_start, &kmem_end);
+		return -EBUSY;
+	}
+
+	/*
+	 * Set flags appropriate for System RAM.  Leave ..._BUSY clear
+	 * so that add_memory() can add a child resource.  Do not
+	 * inherit flags from the parent since it may set new flags
+	 * unknown to us that will break add_memory() below.
+	 */
+	new_res->flags = IORESOURCE_SYSTEM_RAM;
+	new_res->name = dev_name(dev);
+
+	rc = add_memory(numa_node, new_res->start, resource_size(new_res));
+	if (rc)
+		return rc;
+
+	return 0;
+}
+
+static int dev_dax_kmem_remove(struct device *dev)
+{
+	/*
+	 * Purposely leak the request_mem_region() for the device-dax
+	 * range and return '0' to ->remove() attempts. The removal of
+	 * the device from the driver always succeeds, but the region
+	 * is permanently pinned as reserved by the unreleased
+	 * request_mem_region().
+	 */
+	return 0;
+}
+
+static struct dax_device_driver device_dax_kmem_driver = {
+	.drv = {
+		.probe = dev_dax_kmem_probe,
+		.remove = dev_dax_kmem_remove,
+	},
+};
+
+static int __init dax_kmem_init(void)
+{
+	return dax_driver_register(&device_dax_kmem_driver);
+}
+
+static void __exit dax_kmem_exit(void)
+{
+	dax_driver_unregister(&device_dax_kmem_driver);
+}
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_LICENSE("GPL v2");
+module_init(dax_kmem_init);
+module_exit(dax_kmem_exit);
+MODULE_ALIAS_DAX_DEVICE(0);
diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile
--- a/drivers/dax/Makefile~dax-kmem-try-4	2019-02-25 10:56:51.796908023 -0800
+++ b/drivers/dax/Makefile	2019-02-25 10:56:51.800908023 -0800
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_DAX) += dax.o
 obj-$(CONFIG_DEV_DAX) += device_dax.o
+obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
 
 dax-y := super.o
 dax-y += bus.o
_

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 5/5] dax: "Hotplug" persistent memory for use like normal RAM
@ 2019-02-25 18:57   ` Dave Hansen
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Hansen @ 2019-02-25 18:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, dan.j.williams, keith.busch, dave.jiang, zwisler,
	vishal.l.verma, thomas.lendacky, akpm, mhocko, linux-nvdimm,
	linux-mm, ying.huang, fengguang.wu, bp, bhelgaas, baiyaowei,
	tiwai, jglisse


From: Dave Hansen <dave.hansen@linux.intel.com>

This is intended for use with NVDIMMs that are physically persistent
(physically like flash) so that they can be used as a cost-effective
RAM replacement.  Intel Optane DC persistent memory is one
implementation of this kind of NVDIMM.

Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries. (DIMM-based
persistent memory hardware/software is described in great detail
here: Documentation/nvdimm/nvdimm.txt).

However, this limits persistent memory use to applications which
*have* been modified.  To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed and used just like
normal RAM would be.

To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:

	echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind

and then tell the new driver that it can bind to the device:

	echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id

After this, there will be a number of new memory sections visible
in sysfs that can be onlined, or that may get onlined by existing
udev-initiated memory hotplug rules.

This rebinding procedure is currently a one-way trip.  Once memory
is bound to "kmem", it's there permanently and can not be
unbound and assigned back to device_dax.

The kmem driver will never bind to a dax device unless the device
is *explicitly* bound to the driver.  There are two reasons for
this: One, since it is a one-way trip, it can not be undone if
bound incorrectly.  Two, the kmem driver destroys data on the
device.  Think of if you had good data on a pmem device.  It
would be catastrophic if you compile-in "kmem", but leave out
the "device_dax" driver.  kmem would take over the device and
write volatile data all over your good data.

This inherits any existing NUMA information for the newly-added
memory from the persistent memory device that came from the
firmware.  On Intel platforms, the firmware has guarantees that
require each socket's persistent memory to be in a separate
memory-only NUMA node.  That means that this patch is not expected
to create NUMA nodes, but will simply hotplug memory into existing
nodes.

Because NUMA nodes are created, the existing NUMA APIs and tools
are sufficient to create policies for applications or memory areas
to have affinity for or an aversion to using this memory.

There is currently some metadata at the beginning of pmem regions.
The section-size memory hotplug restrictions, plus this small
reserved area can cause the "loss" of a section or two of capacity.
This should be fixable in follow-on patches.  But, as a first step,
losing 256MB of memory (worst case) out of hundreds of gigabytes
is a good tradeoff vs. the required code to fix this up precisely.
This calculation is also the reason we export
memory_block_size_bytes().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
---

 b/drivers/base/memory.c |    1 
 b/drivers/dax/Kconfig   |   16 +++++++
 b/drivers/dax/Makefile  |    1 
 b/drivers/dax/kmem.c    |  108 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 126 insertions(+)

diff -puN drivers/base/memory.c~dax-kmem-try-4 drivers/base/memory.c
--- a/drivers/base/memory.c~dax-kmem-try-4	2019-02-25 10:56:51.791908023 -0800
+++ b/drivers/base/memory.c	2019-02-25 10:56:51.800908023 -0800
@@ -88,6 +88,7 @@ unsigned long __weak memory_block_size_b
 {
 	return MIN_MEMORY_BLOCK_SIZE;
 }
+EXPORT_SYMBOL_GPL(memory_block_size_bytes);
 
 static unsigned long get_memory_block_size(void)
 {
diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig
--- a/drivers/dax/Kconfig~dax-kmem-try-4	2019-02-25 10:56:51.793908023 -0800
+++ b/drivers/dax/Kconfig	2019-02-25 10:56:51.800908023 -0800
@@ -32,6 +32,22 @@ config DEV_DAX_PMEM
 
 	  Say M if unsure
 
+config DEV_DAX_KMEM
+	tristate "KMEM DAX: volatile-use of persistent memory"
+	default DEV_DAX
+	depends on DEV_DAX
+	depends on MEMORY_HOTPLUG # for add_memory() and friends
+	help
+	  Support access to persistent memory as if it were RAM.  This
+	  allows easier use of persistent memory by unmodified
+	  applications.
+
+	  To use this feature, a DAX device must be unbound from the
+	  device_dax driver (PMEM DAX) and bound to this kmem driver
+	  on each boot.
+
+	  Say N if unsure.
+
 config DEV_DAX_PMEM_COMPAT
 	tristate "PMEM DAX: support the deprecated /sys/class/dax interface"
 	depends on DEV_DAX_PMEM
diff -puN /dev/null drivers/dax/kmem.c
--- /dev/null	2019-02-15 15:42:29.903470860 -0800
+++ b/drivers/dax/kmem.c	2019-02-25 10:56:51.800908023 -0800
@@ -0,0 +1,108 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2016-2019 Intel Corporation. All rights reserved. */
+#include <linux/memremap.h>
+#include <linux/pagemap.h>
+#include <linux/memory.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/pfn_t.h>
+#include <linux/slab.h>
+#include <linux/dax.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include "dax-private.h"
+#include "bus.h"
+
+int dev_dax_kmem_probe(struct device *dev)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+	struct resource *res = &dev_dax->region->res;
+	resource_size_t kmem_start;
+	resource_size_t kmem_size;
+	resource_size_t kmem_end;
+	struct resource *new_res;
+	int numa_node;
+	int rc;
+
+	/*
+	 * Ensure good NUMA information for the persistent memory.
+	 * Without this check, there is a risk that slow memory
+	 * could be mixed in a node with faster memory, causing
+	 * unavoidable performance issues.
+	 */
+	numa_node = dev_dax->target_node;
+	if (numa_node < 0) {
+		dev_warn(dev, "rejecting DAX region %pR with invalid node: %d\n",
+			 res, numa_node);
+		return -EINVAL;
+	}
+
+	/* Hotplug starting at the beginning of the next block: */
+	kmem_start = ALIGN(res->start, memory_block_size_bytes());
+
+	kmem_size = resource_size(res);
+	/* Adjust the size down to compensate for moving up kmem_start: */
+	kmem_size -= kmem_start - res->start;
+	/* Align the size down to cover only complete blocks: */
+	kmem_size &= ~(memory_block_size_bytes() - 1);
+	kmem_end = kmem_start + kmem_size;
+
+	/* Region is permanently reserved.  Hot-remove not yet implemented. */
+	new_res = request_mem_region(kmem_start, kmem_size, dev_name(dev));
+	if (!new_res) {
+		dev_warn(dev, "could not reserve region [%pa-%pa]\n",
+			 &kmem_start, &kmem_end);
+		return -EBUSY;
+	}
+
+	/*
+	 * Set flags appropriate for System RAM.  Leave ..._BUSY clear
+	 * so that add_memory() can add a child resource.  Do not
+	 * inherit flags from the parent since it may set new flags
+	 * unknown to us that will break add_memory() below.
+	 */
+	new_res->flags = IORESOURCE_SYSTEM_RAM;
+	new_res->name = dev_name(dev);
+
+	rc = add_memory(numa_node, new_res->start, resource_size(new_res));
+	if (rc)
+		return rc;
+
+	return 0;
+}
+
+static int dev_dax_kmem_remove(struct device *dev)
+{
+	/*
+	 * Purposely leak the request_mem_region() for the device-dax
+	 * range and return '0' to ->remove() attempts. The removal of
+	 * the device from the driver always succeeds, but the region
+	 * is permanently pinned as reserved by the unreleased
+	 * request_mem_region().
+	 */
+	return 0;
+}
+
+static struct dax_device_driver device_dax_kmem_driver = {
+	.drv = {
+		.probe = dev_dax_kmem_probe,
+		.remove = dev_dax_kmem_remove,
+	},
+};
+
+static int __init dax_kmem_init(void)
+{
+	return dax_driver_register(&device_dax_kmem_driver);
+}
+
+static void __exit dax_kmem_exit(void)
+{
+	dax_driver_unregister(&device_dax_kmem_driver);
+}
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_LICENSE("GPL v2");
+module_init(dax_kmem_init);
+module_exit(dax_kmem_exit);
+MODULE_ALIAS_DAX_DEVICE(0);
diff -puN drivers/dax/Makefile~dax-kmem-try-4 drivers/dax/Makefile
--- a/drivers/dax/Makefile~dax-kmem-try-4	2019-02-25 10:56:51.796908023 -0800
+++ b/drivers/dax/Makefile	2019-02-25 10:56:51.800908023 -0800
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_DAX) += dax.o
 obj-$(CONFIG_DEV_DAX) += device_dax.o
+obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
 
 dax-y := super.o
 dax-y += bus.o
_


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] dax: "Hotplug" persistent memory for use like normal RAM
  2019-02-25 18:57   ` Dave Hansen
@ 2019-02-25 22:56     ` Verma, Vishal L
  -1 siblings, 0 replies; 25+ messages in thread
From: Verma, Vishal L @ 2019-02-25 22:56 UTC (permalink / raw)
  To: linux-kernel, dave.hansen
  Cc: thomas.lendacky, mhocko, linux-nvdimm, tiwai, Huang, 


On Mon, 2019-02-25 at 10:57 -0800, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> This is intended for use with NVDIMMs that are physically persistent
> (physically like flash) so that they can be used as a cost-effective
> RAM replacement.  Intel Optane DC persistent memory is one
> implementation of this kind of NVDIMM.
> 
> Currently, a persistent memory region is "owned" by a device driver,
> either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
> allow applications to explicitly use persistent memory, generally
> by being modified to use special, new libraries. (DIMM-based
> persistent memory hardware/software is described in great detail
> here: Documentation/nvdimm/nvdimm.txt).
> 
> However, this limits persistent memory use to applications which
> *have* been modified.  To make it more broadly usable, this driver
> "hotplugs" memory into the kernel, to be managed and used just like
> normal RAM would be.
> 
> To make this work, management software must remove the device from
> being controlled by the "Device DAX" infrastructure:
> 
> 	echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> 
> and then tell the new driver that it can bind to the device:
> 
> 	echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> 
> After this, there will be a number of new memory sections visible
> in sysfs that can be onlined, or that may get onlined by existing
> udev-initiated memory hotplug rules.
> 
> This rebinding procedure is currently a one-way trip.  Once memory
> is bound to "kmem", it's there permanently and can not be
> unbound and assigned back to device_dax.
> 
> The kmem driver will never bind to a dax device unless the device
> is *explicitly* bound to the driver.  There are two reasons for
> this: One, since it is a one-way trip, it can not be undone if
> bound incorrectly.  Two, the kmem driver destroys data on the
> device.  Think of if you had good data on a pmem device.  It
> would be catastrophic if you compile-in "kmem", but leave out
> the "device_dax" driver.  kmem would take over the device and
> write volatile data all over your good data.
> 
> This inherits any existing NUMA information for the newly-added
> memory from the persistent memory device that came from the
> firmware.  On Intel platforms, the firmware has guarantees that
> require each socket's persistent memory to be in a separate
> memory-only NUMA node.  That means that this patch is not expected
> to create NUMA nodes, but will simply hotplug memory into existing
> nodes.
> 
> Because NUMA nodes are created, the existing NUMA APIs and tools
> are sufficient to create policies for applications or memory areas
> to have affinity for or an aversion to using this memory.
> 
> There is currently some metadata at the beginning of pmem regions.
> The section-size memory hotplug restrictions, plus this small
> reserved area can cause the "loss" of a section or two of capacity.
> This should be fixable in follow-on patches.  But, as a first step,
> losing 256MB of memory (worst case) out of hundreds of gigabytes
> is a good tradeoff vs. the required code to fix this up precisely.
> This calculation is also the reason we export
> memory_block_size_bytes().
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Reviewed-by: Keith Busch <keith.busch@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-nvdimm@lists.01.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Fengguang Wu <fengguang.wu@intel.com>
> Cc: Borislav Petkov <bp@suse.de>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> Cc: Takashi Iwai <tiwai@suse.de>
> Cc: Jerome Glisse <jglisse@redhat.com>
> ---
> 
>  b/drivers/base/memory.c |    1 
>  b/drivers/dax/Kconfig   |   16 +++++++
>  b/drivers/dax/Makefile  |    1 
>  b/drivers/dax/kmem.c    |  108 ++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 126 insertions(+)

Looks good,
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] dax: "Hotplug" persistent memory for use like normal RAM
@ 2019-02-25 22:56     ` Verma, Vishal L
  0 siblings, 0 replies; 25+ messages in thread
From: Verma, Vishal L @ 2019-02-25 22:56 UTC (permalink / raw)
  To: linux-kernel, dave.hansen
  Cc: tiwai, bp, linux-mm, Williams, Dan J, akpm, linux-nvdimm,
	jglisse, zwisler, mhocko, Jiang, Dave, bhelgaas, thomas.lendacky,
	Busch, Keith, Huang, Ying, Wu, Fengguang, baiyaowei


On Mon, 2019-02-25 at 10:57 -0800, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> This is intended for use with NVDIMMs that are physically persistent
> (physically like flash) so that they can be used as a cost-effective
> RAM replacement.  Intel Optane DC persistent memory is one
> implementation of this kind of NVDIMM.
> 
> Currently, a persistent memory region is "owned" by a device driver,
> either the "Direct DAX" or "Filesystem DAX" drivers.  These drivers
> allow applications to explicitly use persistent memory, generally
> by being modified to use special, new libraries. (DIMM-based
> persistent memory hardware/software is described in great detail
> here: Documentation/nvdimm/nvdimm.txt).
> 
> However, this limits persistent memory use to applications which
> *have* been modified.  To make it more broadly usable, this driver
> "hotplugs" memory into the kernel, to be managed and used just like
> normal RAM would be.
> 
> To make this work, management software must remove the device from
> being controlled by the "Device DAX" infrastructure:
> 
> 	echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> 
> and then tell the new driver that it can bind to the device:
> 
> 	echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> 
> After this, there will be a number of new memory sections visible
> in sysfs that can be onlined, or that may get onlined by existing
> udev-initiated memory hotplug rules.
> 
> This rebinding procedure is currently a one-way trip.  Once memory
> is bound to "kmem", it's there permanently and can not be
> unbound and assigned back to device_dax.
> 
> The kmem driver will never bind to a dax device unless the device
> is *explicitly* bound to the driver.  There are two reasons for
> this: One, since it is a one-way trip, it can not be undone if
> bound incorrectly.  Two, the kmem driver destroys data on the
> device.  Think of if you had good data on a pmem device.  It
> would be catastrophic if you compile-in "kmem", but leave out
> the "device_dax" driver.  kmem would take over the device and
> write volatile data all over your good data.
> 
> This inherits any existing NUMA information for the newly-added
> memory from the persistent memory device that came from the
> firmware.  On Intel platforms, the firmware has guarantees that
> require each socket's persistent memory to be in a separate
> memory-only NUMA node.  That means that this patch is not expected
> to create NUMA nodes, but will simply hotplug memory into existing
> nodes.
> 
> Because NUMA nodes are created, the existing NUMA APIs and tools
> are sufficient to create policies for applications or memory areas
> to have affinity for or an aversion to using this memory.
> 
> There is currently some metadata at the beginning of pmem regions.
> The section-size memory hotplug restrictions, plus this small
> reserved area can cause the "loss" of a section or two of capacity.
> This should be fixable in follow-on patches.  But, as a first step,
> losing 256MB of memory (worst case) out of hundreds of gigabytes
> is a good tradeoff vs. the required code to fix this up precisely.
> This calculation is also the reason we export
> memory_block_size_bytes().
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Reviewed-by: Keith Busch <keith.busch@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-nvdimm@lists.01.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Fengguang Wu <fengguang.wu@intel.com>
> Cc: Borislav Petkov <bp@suse.de>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> Cc: Takashi Iwai <tiwai@suse.de>
> Cc: Jerome Glisse <jglisse@redhat.com>
> ---
> 
>  b/drivers/base/memory.c |    1 
>  b/drivers/dax/Kconfig   |   16 +++++++
>  b/drivers/dax/Makefile  |    1 
>  b/drivers/dax/kmem.c    |  108 ++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 126 insertions(+)

Looks good,
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] mm/resource: return real error codes from walk failures
  2019-02-25 18:57   ` Dave Hansen
@ 2019-02-26  7:41     ` Christophe Leroy
  -1 siblings, 0 replies; 25+ messages in thread
From: Christophe Leroy @ 2019-02-26  7:41 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: mhocko, tiwai, keith.busch, linux-mm, paulus, baiyaowei, zwisler,
	dave.jiang, linux-nvdimm, ying.huang, bp, thomas.lendacky,
	jglisse, bhelgaas, dan.j.williams, vishal.l.verma, akpm,
	fengguang.wu, linuxppc-dev



Le 25/02/2019 à 19:57, Dave Hansen a écrit :
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> walk_system_ram_range() can return an error code either becuase
> *it* failed, or because the 'func' that it calls returned an
> error.  The memory hotplug does the following:
> 
> 	ret = walk_system_ram_range(..., func);
>          if (ret)
> 		return ret;
> 
> and 'ret' makes it out to userspace, eventually.  The problem
> s, walk_system_ram_range() failues that result from *it* failing
> (as opposed to 'func') return -1.  That leads to a very odd
> -EPERM (-1) return code out to userspace.
> 
> Make walk_system_ram_range() return -EINVAL for internal
> failures to keep userspace less confused.
> 
> This return code is compatible with all the callers that I
> audited.
> 
> This changes both the generic mm/ and powerpc-specific
> implementations to have the same return value.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-nvdimm@lists.01.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Fengguang Wu <fengguang.wu@intel.com>
> Cc: Borislav Petkov <bp@suse.de>
> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> Cc: Takashi Iwai <tiwai@suse.de>
> Cc: Jerome Glisse <jglisse@redhat.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Keith Busch <keith.busch@intel.com>
> ---
> 
>   b/arch/powerpc/mm/mem.c |    2 +-

walk_system_ram_range() was droped in commit 
https://git.kernel.orghttps://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=26b523356f49a0117c8f9e32ca98aa6d6e496e1a

Christophe

>   b/kernel/resource.c     |    4 ++--
>   2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff -puN arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1 arch/powerpc/mm/mem.c
> --- a/arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1	2019-02-25 10:56:47.452908034 -0800
> +++ b/arch/powerpc/mm/mem.c	2019-02-25 10:56:47.458908034 -0800
> @@ -189,7 +189,7 @@ walk_system_ram_range(unsigned long star
>   	struct memblock_region *reg;
>   	unsigned long end_pfn = start_pfn + nr_pages;
>   	unsigned long tstart, tend;
> -	int ret = -1;
> +	int ret = -EINVAL;
>   
>   	for_each_memblock(memory, reg) {
>   		tstart = max(start_pfn, memblock_region_memory_base_pfn(reg));
> diff -puN kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 kernel/resource.c
> --- a/kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1	2019-02-25 10:56:47.454908034 -0800
> +++ b/kernel/resource.c	2019-02-25 10:56:47.459908034 -0800
> @@ -382,7 +382,7 @@ static int __walk_iomem_res_desc(resourc
>   				 int (*func)(struct resource *, void *))
>   {
>   	struct resource res;
> -	int ret = -1;
> +	int ret = -EINVAL;
>   
>   	while (start < end &&
>   	       !find_next_iomem_res(start, end, flags, desc, first_lvl, &res)) {
> @@ -462,7 +462,7 @@ int walk_system_ram_range(unsigned long
>   	unsigned long flags;
>   	struct resource res;
>   	unsigned long pfn, end_pfn;
> -	int ret = -1;
> +	int ret = -EINVAL;
>   
>   	start = (u64) start_pfn << PAGE_SHIFT;
>   	end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1;
> _
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] mm/resource: return real error codes from walk failures
@ 2019-02-26  7:41     ` Christophe Leroy
  0 siblings, 0 replies; 25+ messages in thread
From: Christophe Leroy @ 2019-02-26  7:41 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: thomas.lendacky, mhocko, dave.jiang, linux-nvdimm, tiwai,
	vishal.l.verma, ying.huang, keith.busch, linux-mm, jglisse,
	fengguang.wu, paulus, baiyaowei, zwisler, bhelgaas,
	dan.j.williams, bp, linuxppc-dev, akpm



Le 25/02/2019 à 19:57, Dave Hansen a écrit :
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> walk_system_ram_range() can return an error code either becuase
> *it* failed, or because the 'func' that it calls returned an
> error.  The memory hotplug does the following:
> 
> 	ret = walk_system_ram_range(..., func);
>          if (ret)
> 		return ret;
> 
> and 'ret' makes it out to userspace, eventually.  The problem
> s, walk_system_ram_range() failues that result from *it* failing
> (as opposed to 'func') return -1.  That leads to a very odd
> -EPERM (-1) return code out to userspace.
> 
> Make walk_system_ram_range() return -EINVAL for internal
> failures to keep userspace less confused.
> 
> This return code is compatible with all the callers that I
> audited.
> 
> This changes both the generic mm/ and powerpc-specific
> implementations to have the same return value.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Bjorn Helgaas <bhelgaas@google.com>
> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-nvdimm@lists.01.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Fengguang Wu <fengguang.wu@intel.com>
> Cc: Borislav Petkov <bp@suse.de>
> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
> Cc: Takashi Iwai <tiwai@suse.de>
> Cc: Jerome Glisse <jglisse@redhat.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Keith Busch <keith.busch@intel.com>
> ---
> 
>   b/arch/powerpc/mm/mem.c |    2 +-

walk_system_ram_range() was droped in commit 
https://git.kernel.orghttps://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=26b523356f49a0117c8f9e32ca98aa6d6e496e1a

Christophe

>   b/kernel/resource.c     |    4 ++--
>   2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff -puN arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1 arch/powerpc/mm/mem.c
> --- a/arch/powerpc/mm/mem.c~memory-hotplug-walk_system_ram_range-returns-neg-1	2019-02-25 10:56:47.452908034 -0800
> +++ b/arch/powerpc/mm/mem.c	2019-02-25 10:56:47.458908034 -0800
> @@ -189,7 +189,7 @@ walk_system_ram_range(unsigned long star
>   	struct memblock_region *reg;
>   	unsigned long end_pfn = start_pfn + nr_pages;
>   	unsigned long tstart, tend;
> -	int ret = -1;
> +	int ret = -EINVAL;
>   
>   	for_each_memblock(memory, reg) {
>   		tstart = max(start_pfn, memblock_region_memory_base_pfn(reg));
> diff -puN kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1 kernel/resource.c
> --- a/kernel/resource.c~memory-hotplug-walk_system_ram_range-returns-neg-1	2019-02-25 10:56:47.454908034 -0800
> +++ b/kernel/resource.c	2019-02-25 10:56:47.459908034 -0800
> @@ -382,7 +382,7 @@ static int __walk_iomem_res_desc(resourc
>   				 int (*func)(struct resource *, void *))
>   {
>   	struct resource res;
> -	int ret = -1;
> +	int ret = -EINVAL;
>   
>   	while (start < end &&
>   	       !find_next_iomem_res(start, end, flags, desc, first_lvl, &res)) {
> @@ -462,7 +462,7 @@ int walk_system_ram_range(unsigned long
>   	unsigned long flags;
>   	struct resource res;
>   	unsigned long pfn, end_pfn;
> -	int ret = -1;
> +	int ret = -EINVAL;
>   
>   	start = (u64) start_pfn << PAGE_SHIFT;
>   	end = ((u64)(start_pfn + nr_pages) << PAGE_SHIFT) - 1;
> _
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] [v5] Allow persistent memory to be used like normal RAM
  2019-02-25 18:57 ` Dave Hansen
@ 2019-02-28  7:53   ` Dan Williams
  -1 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2019-02-28  7:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Juergen Gross, Tom Lendacky, Michal Hocko, linux-nvdimm,
	Takashi Iwai, Huang, Ying, Linux Kernel Mailing List, Linux MM,
	Jérôme Glisse, Borislav Petkov, Yaowei Bai,
	Ross Zwisler, Bjorn Helgaas, Stephen Rothwell, Andrew Morton,
	Fengguang Wu

[ add Stephen and Juergen ]

On Mon, Feb 25, 2019 at 11:02 AM Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> This is a relatively small delta from v4.  The review comments seem
> to be settling down, so it seems like we should start thinking about
> how this might get merged.  Are there any objections to taking it in
> via the nvdimm tree?
>
> Dan Williams, our intrepid nvdimm maintainer has said he would
> appreciate acks on these from relevant folks before merging them.
> Reviews/acks on any in the series would be welcome, but the last
> two especially are lacking any non-Intel acks:
>
>         mm/resource: let walk_system_ram_range() search child resources
>         dax: "Hotplug" persistent memory for use like normal RAM

I've gone ahead and added this to the libnvdimm-for-next branch for
wider exposure. Acks of course still welcome.

Stephen, this collides with commit 357b4da50a62 "x86: respect memory
size limiting via mem= parameter" in current -next. Here's my
resolution for reference, basically just add the max_mem_size
statement to Dave's rework. Holler if this causes any other problems.

diff --cc mm/memory_hotplug.c
index a9d5787044e1,b37f3a5c4833..c4f59ac21014
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@@ -102,28 -99,21 +102,24 @@@ u64 max_mem_size = U64_MAX
  /* add this memory to iomem resource */
  static struct resource *register_memory_resource(u64 start, u64 size)
  {
-       struct resource *res, *conflict;
+       struct resource *res;
+       unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+       char *resource_name = "System RAM";

 +      if (start + size > max_mem_size)
 +              return ERR_PTR(-E2BIG);
 +
-       res = kzalloc(sizeof(struct resource), GFP_KERNEL);
-       if (!res)
-               return ERR_PTR(-ENOMEM);
-
-       res->name = "System RAM";
-       res->start = start;
-       res->end = start + size - 1;
-       res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-       conflict =  request_resource_conflict(&iomem_resource, res);
-       if (conflict) {
-               if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
-                       pr_debug("Device unaddressable memory block "
-                                "memory hotplug at %#010llx !\n",
-                                (unsigned long long)start);
-               }
-               pr_debug("System RAM resource %pR cannot be added\n", res);
-               kfree(res);
+       /*
+        * Request ownership of the new memory range.  This might be
+        * a child of an existing resource that was present but
+        * not marked as busy.
+        */
+       res = __request_region(&iomem_resource, start, size,
+                              resource_name, flags);
+
+       if (!res) {
+               pr_debug("Unable to reserve System RAM region:
%016llx->%016llx\n",
+                               start, start + size);
                return ERR_PTR(-EEXIST);
        }
        return res;
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] [v5] Allow persistent memory to be used like normal RAM
@ 2019-02-28  7:53   ` Dan Williams
  0 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2019-02-28  7:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Kernel Mailing List, Dave Jiang, Ross Zwisler,
	Vishal L Verma, Tom Lendacky, Andrew Morton, Michal Hocko,
	linux-nvdimm, Linux MM, Huang, Ying, Fengguang Wu,
	Borislav Petkov, Bjorn Helgaas, Yaowei Bai, Takashi Iwai,
	Jérôme Glisse, Keith Busch, Stephen Rothwell,
	Juergen Gross

[ add Stephen and Juergen ]

On Mon, Feb 25, 2019 at 11:02 AM Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> This is a relatively small delta from v4.  The review comments seem
> to be settling down, so it seems like we should start thinking about
> how this might get merged.  Are there any objections to taking it in
> via the nvdimm tree?
>
> Dan Williams, our intrepid nvdimm maintainer has said he would
> appreciate acks on these from relevant folks before merging them.
> Reviews/acks on any in the series would be welcome, but the last
> two especially are lacking any non-Intel acks:
>
>         mm/resource: let walk_system_ram_range() search child resources
>         dax: "Hotplug" persistent memory for use like normal RAM

I've gone ahead and added this to the libnvdimm-for-next branch for
wider exposure. Acks of course still welcome.

Stephen, this collides with commit 357b4da50a62 "x86: respect memory
size limiting via mem= parameter" in current -next. Here's my
resolution for reference, basically just add the max_mem_size
statement to Dave's rework. Holler if this causes any other problems.

diff --cc mm/memory_hotplug.c
index a9d5787044e1,b37f3a5c4833..c4f59ac21014
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@@ -102,28 -99,21 +102,24 @@@ u64 max_mem_size = U64_MAX
  /* add this memory to iomem resource */
  static struct resource *register_memory_resource(u64 start, u64 size)
  {
-       struct resource *res, *conflict;
+       struct resource *res;
+       unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+       char *resource_name = "System RAM";

 +      if (start + size > max_mem_size)
 +              return ERR_PTR(-E2BIG);
 +
-       res = kzalloc(sizeof(struct resource), GFP_KERNEL);
-       if (!res)
-               return ERR_PTR(-ENOMEM);
-
-       res->name = "System RAM";
-       res->start = start;
-       res->end = start + size - 1;
-       res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-       conflict =  request_resource_conflict(&iomem_resource, res);
-       if (conflict) {
-               if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
-                       pr_debug("Device unaddressable memory block "
-                                "memory hotplug at %#010llx !\n",
-                                (unsigned long long)start);
-               }
-               pr_debug("System RAM resource %pR cannot be added\n", res);
-               kfree(res);
+       /*
+        * Request ownership of the new memory range.  This might be
+        * a child of an existing resource that was present but
+        * not marked as busy.
+        */
+       res = __request_region(&iomem_resource, start, size,
+                              resource_name, flags);
+
+       if (!res) {
+               pr_debug("Unable to reserve System RAM region:
%016llx->%016llx\n",
+                               start, start + size);
                return ERR_PTR(-EEXIST);
        }
        return res;

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2019-02-28  7:54 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-25 18:57 [PATCH 0/5] [v5] Allow persistent memory to be used like normal RAM Dave Hansen
2019-02-25 18:57 ` Dave Hansen
2019-02-25 18:57 ` Dave Hansen
2019-02-25 18:57 ` [PATCH 1/5] mm/resource: return real error codes from walk failures Dave Hansen
2019-02-25 18:57   ` Dave Hansen
2019-02-25 18:57   ` Dave Hansen
2019-02-25 18:57   ` Dave Hansen
2019-02-26  7:41   ` Christophe Leroy
2019-02-26  7:41     ` Christophe Leroy
2019-02-25 18:57 ` [PATCH 2/5] mm/resource: move HMM pr_debug() deeper into resource code Dave Hansen
2019-02-25 18:57   ` Dave Hansen
2019-02-25 18:57   ` Dave Hansen
2019-02-25 18:57 ` [PATCH 3/5] mm/memory-hotplug: allow memory resources to be children Dave Hansen
2019-02-25 18:57   ` Dave Hansen
2019-02-25 18:57   ` Dave Hansen
2019-02-25 18:57 ` [PATCH 4/5] mm/resource: let walk_system_ram_range() search child resources Dave Hansen
2019-02-25 18:57   ` Dave Hansen
2019-02-25 18:57   ` Dave Hansen
2019-02-25 18:57 ` [PATCH 5/5] dax: "Hotplug" persistent memory for use like normal RAM Dave Hansen
2019-02-25 18:57   ` Dave Hansen
2019-02-25 18:57   ` Dave Hansen
2019-02-25 22:56   ` Verma, Vishal L
2019-02-25 22:56     ` Verma, Vishal L
2019-02-28  7:53 ` [PATCH 0/5] [v5] Allow persistent memory to be used " Dan Williams
2019-02-28  7:53   ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.