All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-21 17:22 ` Nathan Fontenot
  0 siblings, 0 replies; 40+ messages in thread
From: Nathan Fontenot @ 2017-02-21 17:22 UTC (permalink / raw)
  To: linux-mm, vkuznets; +Cc: mpe, linuxppc-dev, mdroth

Commit 31bc3858e "add automatic onlining policy for the newly added memory"
provides the capability to have added memory automatically onlined
during add, but this appears to be slightly broken.

The current implementation uses walk_memory_range() to call
online_memory_block, which uses memory_block_change_state() to online
the memory. Instead I think we should be calling device_online()
for the memory block in online_memory_block. This would online
the memory (the memory bus online routine memory_subsys_online()
called from device_online calls memory_block_change_state()) and
properly update the device struct offline flag.

As a result of the current implementation, attempting to remove
a memory block after adding it using auto online fails. This is
because doing a remove, for instance
'echo offline > /sys/devices/system/memory/memoryXXX/state', uses
device_offline() which checks the dev->offline flag.

There is a workaround in that a user could online the memory or have
a udev rule to online the memory by using the sysfs interface. The
sysfs interface to online memory goes through device_online() which
should updated the dev->offline flag. I'm not sure that having kernel
memory hotplug rely on userspace actions is the correct way to go.

I have tried reading through the email threads when the origianl patch
was submitted and could not determine if this is the expected behavior.
The problem with the current behavior was found when trying to update
memory hotplug on powerpc to use auto online.

-Nathan Fontenot
---
 drivers/base/memory.c  |    2 +-
 include/linux/memory.h |    3 ---
 mm/memory_hotplug.c    |    2 +-
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 8ab8ea1..ede46f3 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -249,7 +249,7 @@ static bool pages_correctly_reserved(unsigned long start_pfn)
 	return ret;
 }
 
-int memory_block_change_state(struct memory_block *mem,
+static int memory_block_change_state(struct memory_block *mem,
 		unsigned long to_state, unsigned long from_state_req)
 {
 	int ret = 0;
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 093607f..b723a68 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -109,9 +109,6 @@ static inline int memory_isolate_notify(unsigned long val, void *v)
 extern int register_memory_isolate_notifier(struct notifier_block *nb);
 extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
 extern int register_new_memory(int, struct mem_section *);
-extern int memory_block_change_state(struct memory_block *mem,
-				     unsigned long to_state,
-				     unsigned long from_state_req);
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern int unregister_memory_section(struct mem_section *);
 #endif
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e43142c1..6f7a289 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1329,7 +1329,7 @@ int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
 
 static int online_memory_block(struct memory_block *mem, void *arg)
 {
-	return memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
+	return device_online(&mem->dev);
 }
 
 /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-21 17:22 ` Nathan Fontenot
  0 siblings, 0 replies; 40+ messages in thread
From: Nathan Fontenot @ 2017-02-21 17:22 UTC (permalink / raw)
  To: linux-mm, vkuznets; +Cc: mpe, linuxppc-dev, mdroth

Commit 31bc3858e "add automatic onlining policy for the newly added memory"
provides the capability to have added memory automatically onlined
during add, but this appears to be slightly broken.

The current implementation uses walk_memory_range() to call
online_memory_block, which uses memory_block_change_state() to online
the memory. Instead I think we should be calling device_online()
for the memory block in online_memory_block. This would online
the memory (the memory bus online routine memory_subsys_online()
called from device_online calls memory_block_change_state()) and
properly update the device struct offline flag.

As a result of the current implementation, attempting to remove
a memory block after adding it using auto online fails. This is
because doing a remove, for instance
'echo offline > /sys/devices/system/memory/memoryXXX/state', uses
device_offline() which checks the dev->offline flag.

There is a workaround in that a user could online the memory or have
a udev rule to online the memory by using the sysfs interface. The
sysfs interface to online memory goes through device_online() which
should updated the dev->offline flag. I'm not sure that having kernel
memory hotplug rely on userspace actions is the correct way to go.

I have tried reading through the email threads when the origianl patch
was submitted and could not determine if this is the expected behavior.
The problem with the current behavior was found when trying to update
memory hotplug on powerpc to use auto online.

-Nathan Fontenot
---
 drivers/base/memory.c  |    2 +-
 include/linux/memory.h |    3 ---
 mm/memory_hotplug.c    |    2 +-
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 8ab8ea1..ede46f3 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -249,7 +249,7 @@ static bool pages_correctly_reserved(unsigned long start_pfn)
 	return ret;
 }
 
-int memory_block_change_state(struct memory_block *mem,
+static int memory_block_change_state(struct memory_block *mem,
 		unsigned long to_state, unsigned long from_state_req)
 {
 	int ret = 0;
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 093607f..b723a68 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -109,9 +109,6 @@ static inline int memory_isolate_notify(unsigned long val, void *v)
 extern int register_memory_isolate_notifier(struct notifier_block *nb);
 extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
 extern int register_new_memory(int, struct mem_section *);
-extern int memory_block_change_state(struct memory_block *mem,
-				     unsigned long to_state,
-				     unsigned long from_state_req);
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern int unregister_memory_section(struct mem_section *);
 #endif
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e43142c1..6f7a289 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1329,7 +1329,7 @@ int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
 
 static int online_memory_block(struct memory_block *mem, void *arg)
 {
-	return memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
+	return device_online(&mem->dev);
 }
 
 /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-21 17:22 ` Nathan Fontenot
@ 2017-02-22  9:32   ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-22  9:32 UTC (permalink / raw)
  To: Nathan Fontenot; +Cc: linux-mm, mpe, linuxppc-dev, mdroth

Hi,

s,memhp_auto_offline,memhp_auto_online, in the subject please :-)

Nathan Fontenot <nfont@linux.vnet.ibm.com> writes:

> Commit 31bc3858e "add automatic onlining policy for the newly added memory"
> provides the capability to have added memory automatically onlined
> during add, but this appears to be slightly broken.
>
> The current implementation uses walk_memory_range() to call
> online_memory_block, which uses memory_block_change_state() to online
> the memory. Instead I think we should be calling device_online()
> for the memory block in online_memory_block. This would online
> the memory (the memory bus online routine memory_subsys_online()
> called from device_online calls memory_block_change_state()) and
> properly update the device struct offline flag.
>
> As a result of the current implementation, attempting to remove
> a memory block after adding it using auto online fails.
> This is
> because doing a remove, for instance
> 'echo offline > /sys/devices/system/memory/memoryXXX/state', uses
> device_offline() which checks the dev->offline flag.

I see the issue.

>
> There is a workaround in that a user could online the memory or have
> a udev rule to online the memory by using the sysfs interface. The
> sysfs interface to online memory goes through device_online() which
> should updated the dev->offline flag. I'm not sure that having kernel
> memory hotplug rely on userspace actions is the correct way to go.

Using udev rule for memory onlining is possible when you disable
memhp_auto_online but in some cases it doesn't work well, e.g. when we
use memory hotplug to address memory pressure the loop through userspace
is really slow and memory consuming, we may hit OOM before we manage to
online newly added memory. In addition to that, systemd/udev folks
continuosly refused to add this udev rule to udev calling it stupid as
it actually is an unconditional and redundant ping-pong between kernel
and udev.

>
> I have tried reading through the email threads when the origianl patch
> was submitted and could not determine if this is the expected behavior.
> The problem with the current behavior was found when trying to update
> memory hotplug on powerpc to use auto online.
>
> -Nathan Fontenot
> ---
>  drivers/base/memory.c  |    2 +-
>  include/linux/memory.h |    3 ---
>  mm/memory_hotplug.c    |    2 +-
>  3 files changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 8ab8ea1..ede46f3 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -249,7 +249,7 @@ static bool pages_correctly_reserved(unsigned long start_pfn)
>  	return ret;
>  }
>
> -int memory_block_change_state(struct memory_block *mem,
> +static int memory_block_change_state(struct memory_block *mem,
>  		unsigned long to_state, unsigned long from_state_req)
>  {
>  	int ret = 0;
> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index 093607f..b723a68 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -109,9 +109,6 @@ static inline int memory_isolate_notify(unsigned long val, void *v)
>  extern int register_memory_isolate_notifier(struct notifier_block *nb);
>  extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
>  extern int register_new_memory(int, struct mem_section *);
> -extern int memory_block_change_state(struct memory_block *mem,
> -				     unsigned long to_state,
> -				     unsigned long from_state_req);
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  extern int unregister_memory_section(struct mem_section *);
>  #endif
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e43142c1..6f7a289 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1329,7 +1329,7 @@ int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
>
>  static int online_memory_block(struct memory_block *mem, void *arg)
>  {
> -	return memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
> +	return device_online(&mem->dev);
>  }
>
>  /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */

Your patch looks good to me, I tested it on x86 (Hyper-V) and it seems
to work.

Thanks!

-- 
  Vitaly

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-22  9:32   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-22  9:32 UTC (permalink / raw)
  To: Nathan Fontenot; +Cc: linux-mm, mpe, linuxppc-dev, mdroth

Hi,

s,memhp_auto_offline,memhp_auto_online, in the subject please :-)

Nathan Fontenot <nfont@linux.vnet.ibm.com> writes:

> Commit 31bc3858e "add automatic onlining policy for the newly added memory"
> provides the capability to have added memory automatically onlined
> during add, but this appears to be slightly broken.
>
> The current implementation uses walk_memory_range() to call
> online_memory_block, which uses memory_block_change_state() to online
> the memory. Instead I think we should be calling device_online()
> for the memory block in online_memory_block. This would online
> the memory (the memory bus online routine memory_subsys_online()
> called from device_online calls memory_block_change_state()) and
> properly update the device struct offline flag.
>
> As a result of the current implementation, attempting to remove
> a memory block after adding it using auto online fails.
> This is
> because doing a remove, for instance
> 'echo offline > /sys/devices/system/memory/memoryXXX/state', uses
> device_offline() which checks the dev->offline flag.

I see the issue.

>
> There is a workaround in that a user could online the memory or have
> a udev rule to online the memory by using the sysfs interface. The
> sysfs interface to online memory goes through device_online() which
> should updated the dev->offline flag. I'm not sure that having kernel
> memory hotplug rely on userspace actions is the correct way to go.

Using udev rule for memory onlining is possible when you disable
memhp_auto_online but in some cases it doesn't work well, e.g. when we
use memory hotplug to address memory pressure the loop through userspace
is really slow and memory consuming, we may hit OOM before we manage to
online newly added memory. In addition to that, systemd/udev folks
continuosly refused to add this udev rule to udev calling it stupid as
it actually is an unconditional and redundant ping-pong between kernel
and udev.

>
> I have tried reading through the email threads when the origianl patch
> was submitted and could not determine if this is the expected behavior.
> The problem with the current behavior was found when trying to update
> memory hotplug on powerpc to use auto online.
>
> -Nathan Fontenot
> ---
>  drivers/base/memory.c  |    2 +-
>  include/linux/memory.h |    3 ---
>  mm/memory_hotplug.c    |    2 +-
>  3 files changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 8ab8ea1..ede46f3 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -249,7 +249,7 @@ static bool pages_correctly_reserved(unsigned long start_pfn)
>  	return ret;
>  }
>
> -int memory_block_change_state(struct memory_block *mem,
> +static int memory_block_change_state(struct memory_block *mem,
>  		unsigned long to_state, unsigned long from_state_req)
>  {
>  	int ret = 0;
> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index 093607f..b723a68 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -109,9 +109,6 @@ static inline int memory_isolate_notify(unsigned long val, void *v)
>  extern int register_memory_isolate_notifier(struct notifier_block *nb);
>  extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
>  extern int register_new_memory(int, struct mem_section *);
> -extern int memory_block_change_state(struct memory_block *mem,
> -				     unsigned long to_state,
> -				     unsigned long from_state_req);
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  extern int unregister_memory_section(struct mem_section *);
>  #endif
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e43142c1..6f7a289 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1329,7 +1329,7 @@ int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
>
>  static int online_memory_block(struct memory_block *mem, void *arg)
>  {
> -	return memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
> +	return device_online(&mem->dev);
>  }
>
>  /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */

Your patch looks good to me, I tested it on x86 (Hyper-V) and it seems
to work.

Thanks!

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-22  9:32   ` Vitaly Kuznetsov
@ 2017-02-23 12:56     ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-23 12:56 UTC (permalink / raw)
  To: Vitaly Kuznetsov; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth

On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
[...]
> > There is a workaround in that a user could online the memory or have
> > a udev rule to online the memory by using the sysfs interface. The
> > sysfs interface to online memory goes through device_online() which
> > should updated the dev->offline flag. I'm not sure that having kernel
> > memory hotplug rely on userspace actions is the correct way to go.
> 
> Using udev rule for memory onlining is possible when you disable
> memhp_auto_online but in some cases it doesn't work well, e.g. when we
> use memory hotplug to address memory pressure the loop through userspace
> is really slow and memory consuming, we may hit OOM before we manage to
> online newly added memory.

How does the in-kernel implementation prevents from that?

> In addition to that, systemd/udev folks
> continuosly refused to add this udev rule to udev calling it stupid as
> it actually is an unconditional and redundant ping-pong between kernel
> and udev.

This is a policy and as such it doesn't belong to the kernel. The whole
auto-enable in the kernel is just plain wrong IMHO and we shouldn't have
merged it.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-23 12:56     ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-23 12:56 UTC (permalink / raw)
  To: Vitaly Kuznetsov; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth

On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
[...]
> > There is a workaround in that a user could online the memory or have
> > a udev rule to online the memory by using the sysfs interface. The
> > sysfs interface to online memory goes through device_online() which
> > should updated the dev->offline flag. I'm not sure that having kernel
> > memory hotplug rely on userspace actions is the correct way to go.
> 
> Using udev rule for memory onlining is possible when you disable
> memhp_auto_online but in some cases it doesn't work well, e.g. when we
> use memory hotplug to address memory pressure the loop through userspace
> is really slow and memory consuming, we may hit OOM before we manage to
> online newly added memory.

How does the in-kernel implementation prevents from that?

> In addition to that, systemd/udev folks
> continuosly refused to add this udev rule to udev calling it stupid as
> it actually is an unconditional and redundant ping-pong between kernel
> and udev.

This is a policy and as such it doesn't belong to the kernel. The whole
auto-enable in the kernel is just plain wrong IMHO and we shouldn't have
merged it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-23 12:56     ` Michal Hocko
@ 2017-02-23 13:31       ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-23 13:31 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth

Michal Hocko <mhocko@kernel.org> writes:

> On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
> [...]
>> > There is a workaround in that a user could online the memory or have
>> > a udev rule to online the memory by using the sysfs interface. The
>> > sysfs interface to online memory goes through device_online() which
>> > should updated the dev->offline flag. I'm not sure that having kernel
>> > memory hotplug rely on userspace actions is the correct way to go.
>> 
>> Using udev rule for memory onlining is possible when you disable
>> memhp_auto_online but in some cases it doesn't work well, e.g. when we
>> use memory hotplug to address memory pressure the loop through userspace
>> is really slow and memory consuming, we may hit OOM before we manage to
>> online newly added memory.
>
> How does the in-kernel implementation prevents from that?
>

Onlining memory on hot-plug is much more reliable, e.g. if we were able
to add it in add_memory_resource() we'll also manage to online it. With
udev rule we may end up adding many blocks and then (as udev is
asynchronous) failing to online any of them. In-kernel operation is
synchronous.

>> In addition to that, systemd/udev folks
>> continuosly refused to add this udev rule to udev calling it stupid as
>> it actually is an unconditional and redundant ping-pong between kernel
>> and udev.
>
> This is a policy and as such it doesn't belong to the kernel. The whole
> auto-enable in the kernel is just plain wrong IMHO and we shouldn't have
> merged it.

I disagree.

First of all it's not a policy, it is a default. We have many other
defaults in kernel. When I add a network card or a storage, for example,
I don't need to go anywhere and 'enable' it before I'm able to use
it from userspace. An for memory (and CPUs) we, for some unknown reason
opted for something completely different. If someone is plugging new
memory into a box he probably wants to use it, I don't see much value in
waiting for a special confirmation from him. 

Second, this feature is optional. If you want to keep old behavior just
don't enable it.

Third, this solves real world issues. With Hyper-V it is very easy to
show udev failing on stress. No other solution to the issue was ever
suggested.

-- 
  Vitaly

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-23 13:31       ` Vitaly Kuznetsov
  0 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-23 13:31 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth

Michal Hocko <mhocko@kernel.org> writes:

> On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
> [...]
>> > There is a workaround in that a user could online the memory or have
>> > a udev rule to online the memory by using the sysfs interface. The
>> > sysfs interface to online memory goes through device_online() which
>> > should updated the dev->offline flag. I'm not sure that having kernel
>> > memory hotplug rely on userspace actions is the correct way to go.
>> 
>> Using udev rule for memory onlining is possible when you disable
>> memhp_auto_online but in some cases it doesn't work well, e.g. when we
>> use memory hotplug to address memory pressure the loop through userspace
>> is really slow and memory consuming, we may hit OOM before we manage to
>> online newly added memory.
>
> How does the in-kernel implementation prevents from that?
>

Onlining memory on hot-plug is much more reliable, e.g. if we were able
to add it in add_memory_resource() we'll also manage to online it. With
udev rule we may end up adding many blocks and then (as udev is
asynchronous) failing to online any of them. In-kernel operation is
synchronous.

>> In addition to that, systemd/udev folks
>> continuosly refused to add this udev rule to udev calling it stupid as
>> it actually is an unconditional and redundant ping-pong between kernel
>> and udev.
>
> This is a policy and as such it doesn't belong to the kernel. The whole
> auto-enable in the kernel is just plain wrong IMHO and we shouldn't have
> merged it.

I disagree.

First of all it's not a policy, it is a default. We have many other
defaults in kernel. When I add a network card or a storage, for example,
I don't need to go anywhere and 'enable' it before I'm able to use
it from userspace. An for memory (and CPUs) we, for some unknown reason
opted for something completely different. If someone is plugging new
memory into a box he probably wants to use it, I don't see much value in
waiting for a special confirmation from him. 

Second, this feature is optional. If you want to keep old behavior just
don't enable it.

Third, this solves real world issues. With Hyper-V it is very easy to
show udev failing on stress. No other solution to the issue was ever
suggested.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-23 13:31       ` Vitaly Kuznetsov
@ 2017-02-23 15:09         ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-23 15:09 UTC (permalink / raw)
  To: Vitaly Kuznetsov; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth

On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
> > [...]
> >> > There is a workaround in that a user could online the memory or have
> >> > a udev rule to online the memory by using the sysfs interface. The
> >> > sysfs interface to online memory goes through device_online() which
> >> > should updated the dev->offline flag. I'm not sure that having kernel
> >> > memory hotplug rely on userspace actions is the correct way to go.
> >> 
> >> Using udev rule for memory onlining is possible when you disable
> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
> >> use memory hotplug to address memory pressure the loop through userspace
> >> is really slow and memory consuming, we may hit OOM before we manage to
> >> online newly added memory.
> >
> > How does the in-kernel implementation prevents from that?
> >
> 
> Onlining memory on hot-plug is much more reliable, e.g. if we were able
> to add it in add_memory_resource() we'll also manage to online it.

How does that differ from initiating online from the users?

> With
> udev rule we may end up adding many blocks and then (as udev is
> asynchronous) failing to online any of them.

Why would it fail?

> In-kernel operation is synchronous.

which doesn't mean anything as the context is preemptible AFAICS.

> >> In addition to that, systemd/udev folks
> >> continuosly refused to add this udev rule to udev calling it stupid as
> >> it actually is an unconditional and redundant ping-pong between kernel
> >> and udev.
> >
> > This is a policy and as such it doesn't belong to the kernel. The whole
> > auto-enable in the kernel is just plain wrong IMHO and we shouldn't have
> > merged it.
> 
> I disagree.
> 
> First of all it's not a policy, it is a default. We have many other
> defaults in kernel. When I add a network card or a storage, for example,
> I don't need to go anywhere and 'enable' it before I'm able to use
> it from userspace. An for memory (and CPUs) we, for some unknown reason
> opted for something completely different. If someone is plugging new
> memory into a box he probably wants to use it, I don't see much value in
> waiting for a special confirmation from him. 

This was not my decision so I can only guess but to me it makes sense.
Both memory and cpus can be physically present and offline which is a
perfectly reasonable state. So having a two phase physicall hotadd is
just built on top of physical vs. logical distinction. I completely
understand that some usecases will really like to online the whole node
as soon as it appears present. But an automatic in-kernel implementation
has its down sites - e.g. if this operation fails in the middle you will
not know about that unless you check all the memblocks in sysfs. This is
really a poor interface.

> Second, this feature is optional. If you want to keep old behavior just
> don't enable it.

It just adds unnecessary configuration noise as well

> Third, this solves real world issues. With Hyper-V it is very easy to
> show udev failing on stress. 

What is the reason for this failures. Do you have any link handy?

> No other solution to the issue was ever suggested.

you mean like using ballooning for the memory overcommit like other more
reasonable virtualization solutions?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-23 15:09         ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-23 15:09 UTC (permalink / raw)
  To: Vitaly Kuznetsov; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth

On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
> > [...]
> >> > There is a workaround in that a user could online the memory or have
> >> > a udev rule to online the memory by using the sysfs interface. The
> >> > sysfs interface to online memory goes through device_online() which
> >> > should updated the dev->offline flag. I'm not sure that having kernel
> >> > memory hotplug rely on userspace actions is the correct way to go.
> >> 
> >> Using udev rule for memory onlining is possible when you disable
> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
> >> use memory hotplug to address memory pressure the loop through userspace
> >> is really slow and memory consuming, we may hit OOM before we manage to
> >> online newly added memory.
> >
> > How does the in-kernel implementation prevents from that?
> >
> 
> Onlining memory on hot-plug is much more reliable, e.g. if we were able
> to add it in add_memory_resource() we'll also manage to online it.

How does that differ from initiating online from the users?

> With
> udev rule we may end up adding many blocks and then (as udev is
> asynchronous) failing to online any of them.

Why would it fail?

> In-kernel operation is synchronous.

which doesn't mean anything as the context is preemptible AFAICS.

> >> In addition to that, systemd/udev folks
> >> continuosly refused to add this udev rule to udev calling it stupid as
> >> it actually is an unconditional and redundant ping-pong between kernel
> >> and udev.
> >
> > This is a policy and as such it doesn't belong to the kernel. The whole
> > auto-enable in the kernel is just plain wrong IMHO and we shouldn't have
> > merged it.
> 
> I disagree.
> 
> First of all it's not a policy, it is a default. We have many other
> defaults in kernel. When I add a network card or a storage, for example,
> I don't need to go anywhere and 'enable' it before I'm able to use
> it from userspace. An for memory (and CPUs) we, for some unknown reason
> opted for something completely different. If someone is plugging new
> memory into a box he probably wants to use it, I don't see much value in
> waiting for a special confirmation from him. 

This was not my decision so I can only guess but to me it makes sense.
Both memory and cpus can be physically present and offline which is a
perfectly reasonable state. So having a two phase physicall hotadd is
just built on top of physical vs. logical distinction. I completely
understand that some usecases will really like to online the whole node
as soon as it appears present. But an automatic in-kernel implementation
has its down sites - e.g. if this operation fails in the middle you will
not know about that unless you check all the memblocks in sysfs. This is
really a poor interface.

> Second, this feature is optional. If you want to keep old behavior just
> don't enable it.

It just adds unnecessary configuration noise as well

> Third, this solves real world issues. With Hyper-V it is very easy to
> show udev failing on stress. 

What is the reason for this failures. Do you have any link handy?

> No other solution to the issue was ever suggested.

you mean like using ballooning for the memory overcommit like other more
reasonable virtualization solutions?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-23 15:09         ` Michal Hocko
@ 2017-02-23 15:49           ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-23 15:49 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth

Michal Hocko <mhocko@kernel.org> writes:

> On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
>> > [...]
>> >> > There is a workaround in that a user could online the memory or have
>> >> > a udev rule to online the memory by using the sysfs interface. The
>> >> > sysfs interface to online memory goes through device_online() which
>> >> > should updated the dev->offline flag. I'm not sure that having kernel
>> >> > memory hotplug rely on userspace actions is the correct way to go.
>> >> 
>> >> Using udev rule for memory onlining is possible when you disable
>> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
>> >> use memory hotplug to address memory pressure the loop through userspace
>> >> is really slow and memory consuming, we may hit OOM before we manage to
>> >> online newly added memory.
>> >
>> > How does the in-kernel implementation prevents from that?
>> >
>> 
>> Onlining memory on hot-plug is much more reliable, e.g. if we were able
>> to add it in add_memory_resource() we'll also manage to online it.
>
> How does that differ from initiating online from the users?
>
>> With
>> udev rule we may end up adding many blocks and then (as udev is
>> asynchronous) failing to online any of them.
>
> Why would it fail?
>
>> In-kernel operation is synchronous.
>
> which doesn't mean anything as the context is preemptible AFAICS.
>

It actually does,

imagine the following example: you run a small guest (256M of memory)
and now there is a request to add 1000 128mb blocks to it. In case you
do it the old way you're very likely to get OOM somewhere in the middle
as you keep adding blocks which requere kernel memory and nobody is
onlining it (or, at least you're racing with the onliner). With
in-kernel implementation we're going to online the first block when it's
added and only then go to the second.

>> >> In addition to that, systemd/udev folks
>> >> continuosly refused to add this udev rule to udev calling it stupid as
>> >> it actually is an unconditional and redundant ping-pong between kernel
>> >> and udev.
>> >
>> > This is a policy and as such it doesn't belong to the kernel. The whole
>> > auto-enable in the kernel is just plain wrong IMHO and we shouldn't have
>> > merged it.
>> 
>> I disagree.
>> 
>> First of all it's not a policy, it is a default. We have many other
>> defaults in kernel. When I add a network card or a storage, for example,
>> I don't need to go anywhere and 'enable' it before I'm able to use
>> it from userspace. An for memory (and CPUs) we, for some unknown reason
>> opted for something completely different. If someone is plugging new
>> memory into a box he probably wants to use it, I don't see much value in
>> waiting for a special confirmation from him. 
>
> This was not my decision so I can only guess but to me it makes sense.
> Both memory and cpus can be physically present and offline which is a
> perfectly reasonable state. So having a two phase physicall hotadd is
> just built on top of physical vs. logical distinction. I completely
> understand that some usecases will really like to online the whole node
> as soon as it appears present. But an automatic in-kernel implementation
> has its down sites - e.g. if this operation fails in the middle you will
> not know about that unless you check all the memblocks in sysfs. This is
> really a poor interface.

And how do you know that some blocks failed to online with udev? Who
handles these failures and how? And, the last but not least, why do
these failures happen?

>
>> Second, this feature is optional. If you want to keep old behavior just
>> don't enable it.
>
> It just adds unnecessary configuration noise as well
>

For any particular user everything he doesn't need is 'noise'...

>> Third, this solves real world issues. With Hyper-V it is very easy to
>> show udev failing on stress. 
>
> What is the reason for this failures. Do you have any link handy?
>

The reason is going out of memory, swapping and being slow in
general. Again, think about the example I give above: there is a request
to add many memory blocks and if we try to handle them all before any of
them get online we will get OOM and may even kill the udev process.

>> No other solution to the issue was ever suggested.
>
> you mean like using ballooning for the memory overcommit like other more
> reasonable virtualization solutions?

Not sure how ballooning is related here. Hyper-V uses memory hotplug to
add memory to domains, I don't think we have any other solutions for
that. From hypervisor's point of view the memory was added when the
particular request succeeded, it is not aware of our 'logical/physical'
separation.

-- 
  Vitaly

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-23 15:49           ` Vitaly Kuznetsov
  0 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-23 15:49 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth

Michal Hocko <mhocko@kernel.org> writes:

> On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
>> > [...]
>> >> > There is a workaround in that a user could online the memory or have
>> >> > a udev rule to online the memory by using the sysfs interface. The
>> >> > sysfs interface to online memory goes through device_online() which
>> >> > should updated the dev->offline flag. I'm not sure that having kernel
>> >> > memory hotplug rely on userspace actions is the correct way to go.
>> >> 
>> >> Using udev rule for memory onlining is possible when you disable
>> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
>> >> use memory hotplug to address memory pressure the loop through userspace
>> >> is really slow and memory consuming, we may hit OOM before we manage to
>> >> online newly added memory.
>> >
>> > How does the in-kernel implementation prevents from that?
>> >
>> 
>> Onlining memory on hot-plug is much more reliable, e.g. if we were able
>> to add it in add_memory_resource() we'll also manage to online it.
>
> How does that differ from initiating online from the users?
>
>> With
>> udev rule we may end up adding many blocks and then (as udev is
>> asynchronous) failing to online any of them.
>
> Why would it fail?
>
>> In-kernel operation is synchronous.
>
> which doesn't mean anything as the context is preemptible AFAICS.
>

It actually does,

imagine the following example: you run a small guest (256M of memory)
and now there is a request to add 1000 128mb blocks to it. In case you
do it the old way you're very likely to get OOM somewhere in the middle
as you keep adding blocks which requere kernel memory and nobody is
onlining it (or, at least you're racing with the onliner). With
in-kernel implementation we're going to online the first block when it's
added and only then go to the second.

>> >> In addition to that, systemd/udev folks
>> >> continuosly refused to add this udev rule to udev calling it stupid as
>> >> it actually is an unconditional and redundant ping-pong between kernel
>> >> and udev.
>> >
>> > This is a policy and as such it doesn't belong to the kernel. The whole
>> > auto-enable in the kernel is just plain wrong IMHO and we shouldn't have
>> > merged it.
>> 
>> I disagree.
>> 
>> First of all it's not a policy, it is a default. We have many other
>> defaults in kernel. When I add a network card or a storage, for example,
>> I don't need to go anywhere and 'enable' it before I'm able to use
>> it from userspace. An for memory (and CPUs) we, for some unknown reason
>> opted for something completely different. If someone is plugging new
>> memory into a box he probably wants to use it, I don't see much value in
>> waiting for a special confirmation from him. 
>
> This was not my decision so I can only guess but to me it makes sense.
> Both memory and cpus can be physically present and offline which is a
> perfectly reasonable state. So having a two phase physicall hotadd is
> just built on top of physical vs. logical distinction. I completely
> understand that some usecases will really like to online the whole node
> as soon as it appears present. But an automatic in-kernel implementation
> has its down sites - e.g. if this operation fails in the middle you will
> not know about that unless you check all the memblocks in sysfs. This is
> really a poor interface.

And how do you know that some blocks failed to online with udev? Who
handles these failures and how? And, the last but not least, why do
these failures happen?

>
>> Second, this feature is optional. If you want to keep old behavior just
>> don't enable it.
>
> It just adds unnecessary configuration noise as well
>

For any particular user everything he doesn't need is 'noise'...

>> Third, this solves real world issues. With Hyper-V it is very easy to
>> show udev failing on stress. 
>
> What is the reason for this failures. Do you have any link handy?
>

The reason is going out of memory, swapping and being slow in
general. Again, think about the example I give above: there is a request
to add many memory blocks and if we try to handle them all before any of
them get online we will get OOM and may even kill the udev process.

>> No other solution to the issue was ever suggested.
>
> you mean like using ballooning for the memory overcommit like other more
> reasonable virtualization solutions?

Not sure how ballooning is related here. Hyper-V uses memory hotplug to
add memory to domains, I don't think we have any other solutions for
that. From hypervisor's point of view the memory was added when the
particular request succeeded, it is not aware of our 'logical/physical'
separation.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-23 15:49           ` Vitaly Kuznetsov
@ 2017-02-23 16:12             ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-23 16:12 UTC (permalink / raw)
  To: Vitaly Kuznetsov; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth

On Thu 23-02-17 16:49:06, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
> >> Michal Hocko <mhocko@kernel.org> writes:
> >> 
> >> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
> >> > [...]
> >> >> > There is a workaround in that a user could online the memory or have
> >> >> > a udev rule to online the memory by using the sysfs interface. The
> >> >> > sysfs interface to online memory goes through device_online() which
> >> >> > should updated the dev->offline flag. I'm not sure that having kernel
> >> >> > memory hotplug rely on userspace actions is the correct way to go.
> >> >> 
> >> >> Using udev rule for memory onlining is possible when you disable
> >> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
> >> >> use memory hotplug to address memory pressure the loop through userspace
> >> >> is really slow and memory consuming, we may hit OOM before we manage to
> >> >> online newly added memory.
> >> >
> >> > How does the in-kernel implementation prevents from that?
> >> >
> >> 
> >> Onlining memory on hot-plug is much more reliable, e.g. if we were able
> >> to add it in add_memory_resource() we'll also manage to online it.
> >
> > How does that differ from initiating online from the users?
> >
> >> With
> >> udev rule we may end up adding many blocks and then (as udev is
> >> asynchronous) failing to online any of them.
> >
> > Why would it fail?
> >
> >> In-kernel operation is synchronous.
> >
> > which doesn't mean anything as the context is preemptible AFAICS.
> >
> 
> It actually does,
> 
> imagine the following example: you run a small guest (256M of memory)
> and now there is a request to add 1000 128mb blocks to it. 

Is a grow from 256M -> 128GB really something that happens in real life?
Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
which is an operation which has to allocate memory has to scale with the
currently available memory IMHO.

> In case you
> do it the old way you're very likely to get OOM somewhere in the middle
> as you keep adding blocks which requere kernel memory and nobody is
> onlining it (or, at least you're racing with the onliner). With
> in-kernel implementation we're going to online the first block when it's
> added and only then go to the second.

Yes, adding a memory will cost you some memory and that is why I am
really skeptical when memory hotplug is used under a strong memory
pressure. This can lead to OOMs even when you online one block at the
time.

[...]
> > This was not my decision so I can only guess but to me it makes sense.
> > Both memory and cpus can be physically present and offline which is a
> > perfectly reasonable state. So having a two phase physicall hotadd is
> > just built on top of physical vs. logical distinction. I completely
> > understand that some usecases will really like to online the whole node
> > as soon as it appears present. But an automatic in-kernel implementation
> > has its down sites - e.g. if this operation fails in the middle you will
> > not know about that unless you check all the memblocks in sysfs. This is
> > really a poor interface.
> 
> And how do you know that some blocks failed to online with udev?

Because the udev will run a code which can cope with that - retry if the
error is recoverable or simply report with all the details. Compare that
to crawling the system log to see that something has broken...

> Who
> handles these failures and how? And, the last but not least, why do
> these failures happen?

I haven't heard reports about the failures and from looking into the
code those are possible but very unlikely.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-23 16:12             ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-23 16:12 UTC (permalink / raw)
  To: Vitaly Kuznetsov; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth

On Thu 23-02-17 16:49:06, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
> >> Michal Hocko <mhocko@kernel.org> writes:
> >> 
> >> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
> >> > [...]
> >> >> > There is a workaround in that a user could online the memory or have
> >> >> > a udev rule to online the memory by using the sysfs interface. The
> >> >> > sysfs interface to online memory goes through device_online() which
> >> >> > should updated the dev->offline flag. I'm not sure that having kernel
> >> >> > memory hotplug rely on userspace actions is the correct way to go.
> >> >> 
> >> >> Using udev rule for memory onlining is possible when you disable
> >> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
> >> >> use memory hotplug to address memory pressure the loop through userspace
> >> >> is really slow and memory consuming, we may hit OOM before we manage to
> >> >> online newly added memory.
> >> >
> >> > How does the in-kernel implementation prevents from that?
> >> >
> >> 
> >> Onlining memory on hot-plug is much more reliable, e.g. if we were able
> >> to add it in add_memory_resource() we'll also manage to online it.
> >
> > How does that differ from initiating online from the users?
> >
> >> With
> >> udev rule we may end up adding many blocks and then (as udev is
> >> asynchronous) failing to online any of them.
> >
> > Why would it fail?
> >
> >> In-kernel operation is synchronous.
> >
> > which doesn't mean anything as the context is preemptible AFAICS.
> >
> 
> It actually does,
> 
> imagine the following example: you run a small guest (256M of memory)
> and now there is a request to add 1000 128mb blocks to it. 

Is a grow from 256M -> 128GB really something that happens in real life?
Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
which is an operation which has to allocate memory has to scale with the
currently available memory IMHO.

> In case you
> do it the old way you're very likely to get OOM somewhere in the middle
> as you keep adding blocks which requere kernel memory and nobody is
> onlining it (or, at least you're racing with the onliner). With
> in-kernel implementation we're going to online the first block when it's
> added and only then go to the second.

Yes, adding a memory will cost you some memory and that is why I am
really skeptical when memory hotplug is used under a strong memory
pressure. This can lead to OOMs even when you online one block at the
time.

[...]
> > This was not my decision so I can only guess but to me it makes sense.
> > Both memory and cpus can be physically present and offline which is a
> > perfectly reasonable state. So having a two phase physicall hotadd is
> > just built on top of physical vs. logical distinction. I completely
> > understand that some usecases will really like to online the whole node
> > as soon as it appears present. But an automatic in-kernel implementation
> > has its down sites - e.g. if this operation fails in the middle you will
> > not know about that unless you check all the memblocks in sysfs. This is
> > really a poor interface.
> 
> And how do you know that some blocks failed to online with udev?

Because the udev will run a code which can cope with that - retry if the
error is recoverable or simply report with all the details. Compare that
to crawling the system log to see that something has broken...

> Who
> handles these failures and how? And, the last but not least, why do
> these failures happen?

I haven't heard reports about the failures and from looking into the
code those are possible but very unlikely.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-23 16:12             ` Michal Hocko
@ 2017-02-23 16:36               ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-23 16:36 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth

Michal Hocko <mhocko@kernel.org> writes:

> On Thu 23-02-17 16:49:06, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
>> >> Michal Hocko <mhocko@kernel.org> writes:
>> >> 
>> >> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
>> >> > [...]
>> >> >> > There is a workaround in that a user could online the memory or have
>> >> >> > a udev rule to online the memory by using the sysfs interface. The
>> >> >> > sysfs interface to online memory goes through device_online() which
>> >> >> > should updated the dev->offline flag. I'm not sure that having kernel
>> >> >> > memory hotplug rely on userspace actions is the correct way to go.
>> >> >> 
>> >> >> Using udev rule for memory onlining is possible when you disable
>> >> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
>> >> >> use memory hotplug to address memory pressure the loop through userspace
>> >> >> is really slow and memory consuming, we may hit OOM before we manage to
>> >> >> online newly added memory.
>> >> >
>> >> > How does the in-kernel implementation prevents from that?
>> >> >
>> >> 
>> >> Onlining memory on hot-plug is much more reliable, e.g. if we were able
>> >> to add it in add_memory_resource() we'll also manage to online it.
>> >
>> > How does that differ from initiating online from the users?
>> >
>> >> With
>> >> udev rule we may end up adding many blocks and then (as udev is
>> >> asynchronous) failing to online any of them.
>> >
>> > Why would it fail?
>> >
>> >> In-kernel operation is synchronous.
>> >
>> > which doesn't mean anything as the context is preemptible AFAICS.
>> >
>> 
>> It actually does,
>> 
>> imagine the following example: you run a small guest (256M of memory)
>> and now there is a request to add 1000 128mb blocks to it. 
>
> Is a grow from 256M -> 128GB really something that happens in real life?
> Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
> which is an operation which has to allocate memory has to scale with the
> currently available memory IMHO.

With virtual machines this is very real and not exaggerated at
all. E.g. Hyper-V host can be tuned to automatically add new memory when
guest is running out of it. Even 100 blocks can represent an issue.

>
>> In case you
>> do it the old way you're very likely to get OOM somewhere in the middle
>> as you keep adding blocks which requere kernel memory and nobody is
>> onlining it (or, at least you're racing with the onliner). With
>> in-kernel implementation we're going to online the first block when it's
>> added and only then go to the second.
>
> Yes, adding a memory will cost you some memory and that is why I am
> really skeptical when memory hotplug is used under a strong memory
> pressure. This can lead to OOMs even when you online one block at the
> time.

If you can't allocate anything then yes, of course it will fail. But if
you try to add many blocks without onlining at the same time the
probability of failure is orders of magniture higher.

(a bit unrelated) I was actually thinking about the possible failure and
had the following idea in my head: we always keep everything allocated
for one additional memory block so when hotplug happens we use this
reserved space to add the block, online it and immediately reserve space
for the next one. I didn't do any coding yet.

>
> [...]
>> > This was not my decision so I can only guess but to me it makes sense.
>> > Both memory and cpus can be physically present and offline which is a
>> > perfectly reasonable state. So having a two phase physicall hotadd is
>> > just built on top of physical vs. logical distinction. I completely
>> > understand that some usecases will really like to online the whole node
>> > as soon as it appears present. But an automatic in-kernel implementation
>> > has its down sites - e.g. if this operation fails in the middle you will
>> > not know about that unless you check all the memblocks in sysfs. This is
>> > really a poor interface.
>> 
>> And how do you know that some blocks failed to online with udev?
>
> Because the udev will run a code which can cope with that - retry if the
> error is recoverable or simply report with all the details. Compare that
> to crawling the system log to see that something has broken...

I don't know much about udev, but the most common rule to online memory
I've met is:

SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline",  ATTR{state}="online"

doesn't do anything smart.

In current RHEL7 it is even worse:

SUBSYSTEM=="memory", ACTION=="add", PROGRAM="/bin/uname -p", RESULT!="s390*", ATTR{state}=="offline", ATTR{state}="online"

so to online new memory block we actually need to run a process.

>
>> Who
>> handles these failures and how? And, the last but not least, why do
>> these failures happen?
>
> I haven't heard reports about the failures and from looking into the
> code those are possible but very unlikely.

My point is - failures are possible, yes, but in the most common
use-case if we hot-plugged some memory we most probably want to use it
and the feature does that. I'd be glad to hear about possible
improvemets to it of course.

-- 
  Vitaly

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-23 16:36               ` Vitaly Kuznetsov
  0 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-23 16:36 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth

Michal Hocko <mhocko@kernel.org> writes:

> On Thu 23-02-17 16:49:06, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Thu 23-02-17 14:31:24, Vitaly Kuznetsov wrote:
>> >> Michal Hocko <mhocko@kernel.org> writes:
>> >> 
>> >> > On Wed 22-02-17 10:32:34, Vitaly Kuznetsov wrote:
>> >> > [...]
>> >> >> > There is a workaround in that a user could online the memory or have
>> >> >> > a udev rule to online the memory by using the sysfs interface. The
>> >> >> > sysfs interface to online memory goes through device_online() which
>> >> >> > should updated the dev->offline flag. I'm not sure that having kernel
>> >> >> > memory hotplug rely on userspace actions is the correct way to go.
>> >> >> 
>> >> >> Using udev rule for memory onlining is possible when you disable
>> >> >> memhp_auto_online but in some cases it doesn't work well, e.g. when we
>> >> >> use memory hotplug to address memory pressure the loop through userspace
>> >> >> is really slow and memory consuming, we may hit OOM before we manage to
>> >> >> online newly added memory.
>> >> >
>> >> > How does the in-kernel implementation prevents from that?
>> >> >
>> >> 
>> >> Onlining memory on hot-plug is much more reliable, e.g. if we were able
>> >> to add it in add_memory_resource() we'll also manage to online it.
>> >
>> > How does that differ from initiating online from the users?
>> >
>> >> With
>> >> udev rule we may end up adding many blocks and then (as udev is
>> >> asynchronous) failing to online any of them.
>> >
>> > Why would it fail?
>> >
>> >> In-kernel operation is synchronous.
>> >
>> > which doesn't mean anything as the context is preemptible AFAICS.
>> >
>> 
>> It actually does,
>> 
>> imagine the following example: you run a small guest (256M of memory)
>> and now there is a request to add 1000 128mb blocks to it. 
>
> Is a grow from 256M -> 128GB really something that happens in real life?
> Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
> which is an operation which has to allocate memory has to scale with the
> currently available memory IMHO.

With virtual machines this is very real and not exaggerated at
all. E.g. Hyper-V host can be tuned to automatically add new memory when
guest is running out of it. Even 100 blocks can represent an issue.

>
>> In case you
>> do it the old way you're very likely to get OOM somewhere in the middle
>> as you keep adding blocks which requere kernel memory and nobody is
>> onlining it (or, at least you're racing with the onliner). With
>> in-kernel implementation we're going to online the first block when it's
>> added and only then go to the second.
>
> Yes, adding a memory will cost you some memory and that is why I am
> really skeptical when memory hotplug is used under a strong memory
> pressure. This can lead to OOMs even when you online one block at the
> time.

If you can't allocate anything then yes, of course it will fail. But if
you try to add many blocks without onlining at the same time the
probability of failure is orders of magniture higher.

(a bit unrelated) I was actually thinking about the possible failure and
had the following idea in my head: we always keep everything allocated
for one additional memory block so when hotplug happens we use this
reserved space to add the block, online it and immediately reserve space
for the next one. I didn't do any coding yet.

>
> [...]
>> > This was not my decision so I can only guess but to me it makes sense.
>> > Both memory and cpus can be physically present and offline which is a
>> > perfectly reasonable state. So having a two phase physicall hotadd is
>> > just built on top of physical vs. logical distinction. I completely
>> > understand that some usecases will really like to online the whole node
>> > as soon as it appears present. But an automatic in-kernel implementation
>> > has its down sites - e.g. if this operation fails in the middle you will
>> > not know about that unless you check all the memblocks in sysfs. This is
>> > really a poor interface.
>> 
>> And how do you know that some blocks failed to online with udev?
>
> Because the udev will run a code which can cope with that - retry if the
> error is recoverable or simply report with all the details. Compare that
> to crawling the system log to see that something has broken...

I don't know much about udev, but the most common rule to online memory
I've met is:

SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline",  ATTR{state}="online"

doesn't do anything smart.

In current RHEL7 it is even worse:

SUBSYSTEM=="memory", ACTION=="add", PROGRAM="/bin/uname -p", RESULT!="s390*", ATTR{state}=="offline", ATTR{state}="online"

so to online new memory block we actually need to run a process.

>
>> Who
>> handles these failures and how? And, the last but not least, why do
>> these failures happen?
>
> I haven't heard reports about the failures and from looking into the
> code those are possible but very unlikely.

My point is - failures are possible, yes, but in the most common
use-case if we hot-plugged some memory we most probably want to use it
and the feature does that. I'd be glad to hear about possible
improvemets to it of course.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-23 16:36               ` Vitaly Kuznetsov
@ 2017-02-23 17:41                 ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-23 17:41 UTC (permalink / raw)
  To: Vitaly Kuznetsov; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth

On Thu 23-02-17 17:36:38, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
[...]
> > Is a grow from 256M -> 128GB really something that happens in real life?
> > Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
> > which is an operation which has to allocate memory has to scale with the
> > currently available memory IMHO.
> 
> With virtual machines this is very real and not exaggerated at
> all. E.g. Hyper-V host can be tuned to automatically add new memory when
> guest is running out of it. Even 100 blocks can represent an issue.

Do you have any reference to a bug report. I am really curious because
something really smells wrong and it is not clear that the chosen
solution is really the best one.
[...]
> > Because the udev will run a code which can cope with that - retry if the
> > error is recoverable or simply report with all the details. Compare that
> > to crawling the system log to see that something has broken...
> 
> I don't know much about udev, but the most common rule to online memory
> I've met is:
> 
> SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline",  ATTR{state}="online"
> 
> doesn't do anything smart.

So what? Is there anything that prevents doing something smarter?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-23 17:41                 ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-23 17:41 UTC (permalink / raw)
  To: Vitaly Kuznetsov; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth

On Thu 23-02-17 17:36:38, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
[...]
> > Is a grow from 256M -> 128GB really something that happens in real life?
> > Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
> > which is an operation which has to allocate memory has to scale with the
> > currently available memory IMHO.
> 
> With virtual machines this is very real and not exaggerated at
> all. E.g. Hyper-V host can be tuned to automatically add new memory when
> guest is running out of it. Even 100 blocks can represent an issue.

Do you have any reference to a bug report. I am really curious because
something really smells wrong and it is not clear that the chosen
solution is really the best one.
[...]
> > Because the udev will run a code which can cope with that - retry if the
> > error is recoverable or simply report with all the details. Compare that
> > to crawling the system log to see that something has broken...
> 
> I don't know much about udev, but the most common rule to online memory
> I've met is:
> 
> SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline",  ATTR{state}="online"
> 
> doesn't do anything smart.

So what? Is there anything that prevents doing something smarter?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-23 17:41                 ` Michal Hocko
@ 2017-02-23 18:14                   ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-23 18:14 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

Michal Hocko <mhocko@kernel.org> writes:

> On Thu 23-02-17 17:36:38, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
> [...]
>> > Is a grow from 256M -> 128GB really something that happens in real life?
>> > Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
>> > which is an operation which has to allocate memory has to scale with the
>> > currently available memory IMHO.
>> 
>> With virtual machines this is very real and not exaggerated at
>> all. E.g. Hyper-V host can be tuned to automatically add new memory when
>> guest is running out of it. Even 100 blocks can represent an issue.
>
> Do you have any reference to a bug report. I am really curious because
> something really smells wrong and it is not clear that the chosen
> solution is really the best one.

Unfortunately I'm not aware of any publicly posted bug reports (CC:
K. Y. - he may have a reference) but I think I still remember everything
correctly. Not sure how deep you want me to go into details though...

Virtual guests under stress were getting into OOM easily and the OOM
killer was even killing the udev process trying to online the
memory. There was a workaround for the issue added to the hyper-v driver
doing memory add:

hv_mem_hot_add(...) {
...
 add_memory(....);
 wait_for_completion_timeout(..., 5*HZ);
 ...
}

the completion was done by observing for the MEM_ONLINE event. This, of
course, was slowing things down significantly and waiting for a
userspace action in kernel is not a nice thing to have (not speaking
about all other memory adding methods which had the same issue). Just
removing this wait was leading us to the same OOM as the hypervisor was
adding more and more memory and eventually even add_memory() was
failing, udev and other processes were killed,...

With the feature in place we have new memory available right after we do
add_memory(), everything is serialized.

> [...]
>> > Because the udev will run a code which can cope with that - retry if the
>> > error is recoverable or simply report with all the details. Compare that
>> > to crawling the system log to see that something has broken...
>> 
>> I don't know much about udev, but the most common rule to online memory
>> I've met is:
>> 
>> SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline",  ATTR{state}="online"
>> 
>> doesn't do anything smart.
>
> So what? Is there anything that prevents doing something smarter?

Yes, the asynchronous nature of all this stuff. There is no way you can
stop other blocks from being added to the system while you're processing
something in userspace.

-- 
  Vitaly

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-23 18:14                   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-23 18:14 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

Michal Hocko <mhocko@kernel.org> writes:

> On Thu 23-02-17 17:36:38, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
> [...]
>> > Is a grow from 256M -> 128GB really something that happens in real life?
>> > Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
>> > which is an operation which has to allocate memory has to scale with the
>> > currently available memory IMHO.
>> 
>> With virtual machines this is very real and not exaggerated at
>> all. E.g. Hyper-V host can be tuned to automatically add new memory when
>> guest is running out of it. Even 100 blocks can represent an issue.
>
> Do you have any reference to a bug report. I am really curious because
> something really smells wrong and it is not clear that the chosen
> solution is really the best one.

Unfortunately I'm not aware of any publicly posted bug reports (CC:
K. Y. - he may have a reference) but I think I still remember everything
correctly. Not sure how deep you want me to go into details though...

Virtual guests under stress were getting into OOM easily and the OOM
killer was even killing the udev process trying to online the
memory. There was a workaround for the issue added to the hyper-v driver
doing memory add:

hv_mem_hot_add(...) {
...
 add_memory(....);
 wait_for_completion_timeout(..., 5*HZ);
 ...
}

the completion was done by observing for the MEM_ONLINE event. This, of
course, was slowing things down significantly and waiting for a
userspace action in kernel is not a nice thing to have (not speaking
about all other memory adding methods which had the same issue). Just
removing this wait was leading us to the same OOM as the hypervisor was
adding more and more memory and eventually even add_memory() was
failing, udev and other processes were killed,...

With the feature in place we have new memory available right after we do
add_memory(), everything is serialized.

> [...]
>> > Because the udev will run a code which can cope with that - retry if the
>> > error is recoverable or simply report with all the details. Compare that
>> > to crawling the system log to see that something has broken...
>> 
>> I don't know much about udev, but the most common rule to online memory
>> I've met is:
>> 
>> SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline",  ATTR{state}="online"
>> 
>> doesn't do anything smart.
>
> So what? Is there anything that prevents doing something smarter?

Yes, the asynchronous nature of all this stuff. There is no way you can
stop other blocks from being added to the system while you're processing
something in userspace.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-23 18:14                   ` Vitaly Kuznetsov
@ 2017-02-24 13:37                     ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-24 13:37 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

On Thu 23-02-17 19:14:27, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Thu 23-02-17 17:36:38, Vitaly Kuznetsov wrote:
> >> Michal Hocko <mhocko@kernel.org> writes:
> > [...]
> >> > Is a grow from 256M -> 128GB really something that happens in real life?
> >> > Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
> >> > which is an operation which has to allocate memory has to scale with the
> >> > currently available memory IMHO.
> >> 
> >> With virtual machines this is very real and not exaggerated at
> >> all. E.g. Hyper-V host can be tuned to automatically add new memory when
> >> guest is running out of it. Even 100 blocks can represent an issue.
> >
> > Do you have any reference to a bug report. I am really curious because
> > something really smells wrong and it is not clear that the chosen
> > solution is really the best one.
> 
> Unfortunately I'm not aware of any publicly posted bug reports (CC:
> K. Y. - he may have a reference) but I think I still remember everything
> correctly. Not sure how deep you want me to go into details though...

As much as possible to understand what was really going on...

> Virtual guests under stress were getting into OOM easily and the OOM
> killer was even killing the udev process trying to online the
> memory.

Do you happen to have any OOM report? I am really surprised that udev
would be an oom victim because that process is really small. Who is
consuming all the memory then?

Have you measured how much memory do we need to allocate to add one
memblock?

> There was a workaround for the issue added to the hyper-v driver
> doing memory add:
> 
> hv_mem_hot_add(...) {
> ...
>  add_memory(....);
>  wait_for_completion_timeout(..., 5*HZ);
>  ...
> }

I can still see 
		/*
		 * Wait for the memory block to be onlined when memory onlining
		 * is done outside of kernel (memhp_auto_online). Since the hot
		 * add has succeeded, it is ok to proceed even if the pages in
		 * the hot added region have not been "onlined" within the
		 * allowed time.
		 */
		if (dm_device.ha_waiting)
			wait_for_completion_timeout(&dm_device.ol_waitevent,
						    5*HZ);
 
> the completion was done by observing for the MEM_ONLINE event. This, of
> course, was slowing things down significantly and waiting for a
> userspace action in kernel is not a nice thing to have (not speaking
> about all other memory adding methods which had the same issue). Just
> removing this wait was leading us to the same OOM as the hypervisor was
> adding more and more memory and eventually even add_memory() was
> failing, udev and other processes were killed,...

Yes, I agree that waiting on a user action from the kernel is very far
from ideal.
 
> With the feature in place we have new memory available right after we do
> add_memory(), everything is serialized.

What prevented you from onlining the memory explicitly from
hv_mem_hot_add path? Why do you need a user visible policy for that at
all? You could also add a parameter to add_memory that would do the same
thing. Or am I missing something?
 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-24 13:37                     ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-24 13:37 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

On Thu 23-02-17 19:14:27, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Thu 23-02-17 17:36:38, Vitaly Kuznetsov wrote:
> >> Michal Hocko <mhocko@kernel.org> writes:
> > [...]
> >> > Is a grow from 256M -> 128GB really something that happens in real life?
> >> > Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
> >> > which is an operation which has to allocate memory has to scale with the
> >> > currently available memory IMHO.
> >> 
> >> With virtual machines this is very real and not exaggerated at
> >> all. E.g. Hyper-V host can be tuned to automatically add new memory when
> >> guest is running out of it. Even 100 blocks can represent an issue.
> >
> > Do you have any reference to a bug report. I am really curious because
> > something really smells wrong and it is not clear that the chosen
> > solution is really the best one.
> 
> Unfortunately I'm not aware of any publicly posted bug reports (CC:
> K. Y. - he may have a reference) but I think I still remember everything
> correctly. Not sure how deep you want me to go into details though...

As much as possible to understand what was really going on...

> Virtual guests under stress were getting into OOM easily and the OOM
> killer was even killing the udev process trying to online the
> memory.

Do you happen to have any OOM report? I am really surprised that udev
would be an oom victim because that process is really small. Who is
consuming all the memory then?

Have you measured how much memory do we need to allocate to add one
memblock?

> There was a workaround for the issue added to the hyper-v driver
> doing memory add:
> 
> hv_mem_hot_add(...) {
> ...
>  add_memory(....);
>  wait_for_completion_timeout(..., 5*HZ);
>  ...
> }

I can still see 
		/*
		 * Wait for the memory block to be onlined when memory onlining
		 * is done outside of kernel (memhp_auto_online). Since the hot
		 * add has succeeded, it is ok to proceed even if the pages in
		 * the hot added region have not been "onlined" within the
		 * allowed time.
		 */
		if (dm_device.ha_waiting)
			wait_for_completion_timeout(&dm_device.ol_waitevent,
						    5*HZ);
 
> the completion was done by observing for the MEM_ONLINE event. This, of
> course, was slowing things down significantly and waiting for a
> userspace action in kernel is not a nice thing to have (not speaking
> about all other memory adding methods which had the same issue). Just
> removing this wait was leading us to the same OOM as the hypervisor was
> adding more and more memory and eventually even add_memory() was
> failing, udev and other processes were killed,...

Yes, I agree that waiting on a user action from the kernel is very far
from ideal.
 
> With the feature in place we have new memory available right after we do
> add_memory(), everything is serialized.

What prevented you from onlining the memory explicitly from
hv_mem_hot_add path? Why do you need a user visible policy for that at
all? You could also add a parameter to add_memory that would do the same
thing. Or am I missing something?
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-24 13:37                     ` Michal Hocko
@ 2017-02-24 14:10                       ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-24 14:10 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

Michal Hocko <mhocko@kernel.org> writes:

> On Thu 23-02-17 19:14:27, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Thu 23-02-17 17:36:38, Vitaly Kuznetsov wrote:
>> >> Michal Hocko <mhocko@kernel.org> writes:
>> > [...]
>> >> > Is a grow from 256M -> 128GB really something that happens in real life?
>> >> > Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
>> >> > which is an operation which has to allocate memory has to scale with the
>> >> > currently available memory IMHO.
>> >> 
>> >> With virtual machines this is very real and not exaggerated at
>> >> all. E.g. Hyper-V host can be tuned to automatically add new memory when
>> >> guest is running out of it. Even 100 blocks can represent an issue.
>> >
>> > Do you have any reference to a bug report. I am really curious because
>> > something really smells wrong and it is not clear that the chosen
>> > solution is really the best one.
>> 
>> Unfortunately I'm not aware of any publicly posted bug reports (CC:
>> K. Y. - he may have a reference) but I think I still remember everything
>> correctly. Not sure how deep you want me to go into details though...
>
> As much as possible to understand what was really going on...
>
>> Virtual guests under stress were getting into OOM easily and the OOM
>> killer was even killing the udev process trying to online the
>> memory.
>
> Do you happen to have any OOM report? I am really surprised that udev
> would be an oom victim because that process is really small. Who is
> consuming all the memory then?

It's been a while since I worked on this and unfortunatelly I don't have
a log. From what I remember, the kernel itself was consuming all memory
so *all* processes were victims.

>
> Have you measured how much memory do we need to allocate to add one
> memblock?

No, it's actually a good idea if we decide to do some sort of pre-allocation.

Just did a quick (and probably dirty) test, increasing guest memory from
4G to 8G (32 x 128mb blocks) require 68Mb of memory, so it's roughly 2Mb
per block. It's really easy to trigger OOM for small guests.

>
>> There was a workaround for the issue added to the hyper-v driver
>> doing memory add:
>> 
>> hv_mem_hot_add(...) {
>> ...
>>  add_memory(....);
>>  wait_for_completion_timeout(..., 5*HZ);
>>  ...
>> }
>
> I can still see 
> 		/*
> 		 * Wait for the memory block to be onlined when memory onlining
> 		 * is done outside of kernel (memhp_auto_online). Since the hot
> 		 * add has succeeded, it is ok to proceed even if the pages in
> 		 * the hot added region have not been "onlined" within the
> 		 * allowed time.
> 		 */
> 		if (dm_device.ha_waiting)
> 			wait_for_completion_timeout(&dm_device.ol_waitevent,
> 						    5*HZ);
>

See 

 dm_device.ha_waiting = !memhp_auto_online;

30 lines above. The workaround is still there for udev case and it is
still equaly bad.

>> the completion was done by observing for the MEM_ONLINE event. This, of
>> course, was slowing things down significantly and waiting for a
>> userspace action in kernel is not a nice thing to have (not speaking
>> about all other memory adding methods which had the same issue). Just
>> removing this wait was leading us to the same OOM as the hypervisor was
>> adding more and more memory and eventually even add_memory() was
>> failing, udev and other processes were killed,...
>
> Yes, I agree that waiting on a user action from the kernel is very far
> from ideal.
>
>> With the feature in place we have new memory available right after we do
>> add_memory(), everything is serialized.
>
> What prevented you from onlining the memory explicitly from
> hv_mem_hot_add path? Why do you need a user visible policy for that at
> all? You could also add a parameter to add_memory that would do the same
> thing. Or am I missing something?

We have different mechanisms for adding memory, I'm aware of at least 3:
ACPI, Xen, Hyper-V. The issue I'm addressing is general enough, I'm
pretty sure I can reproduce the issue on Xen, for example - just boot a
small guest and try adding tons of memory. Why should we have different
defaults for different technologies? 

And, BTW, the link to the previous discussion:
https://groups.google.com/forum/#!msg/linux.kernel/AxvyuQjr4GY/TLC-K0sL_NEJ

-- 
  Vitaly

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-24 14:10                       ` Vitaly Kuznetsov
  0 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-24 14:10 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

Michal Hocko <mhocko@kernel.org> writes:

> On Thu 23-02-17 19:14:27, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Thu 23-02-17 17:36:38, Vitaly Kuznetsov wrote:
>> >> Michal Hocko <mhocko@kernel.org> writes:
>> > [...]
>> >> > Is a grow from 256M -> 128GB really something that happens in real life?
>> >> > Don't get me wrong but to me this sounds quite exaggerated. Hotmem add
>> >> > which is an operation which has to allocate memory has to scale with the
>> >> > currently available memory IMHO.
>> >> 
>> >> With virtual machines this is very real and not exaggerated at
>> >> all. E.g. Hyper-V host can be tuned to automatically add new memory when
>> >> guest is running out of it. Even 100 blocks can represent an issue.
>> >
>> > Do you have any reference to a bug report. I am really curious because
>> > something really smells wrong and it is not clear that the chosen
>> > solution is really the best one.
>> 
>> Unfortunately I'm not aware of any publicly posted bug reports (CC:
>> K. Y. - he may have a reference) but I think I still remember everything
>> correctly. Not sure how deep you want me to go into details though...
>
> As much as possible to understand what was really going on...
>
>> Virtual guests under stress were getting into OOM easily and the OOM
>> killer was even killing the udev process trying to online the
>> memory.
>
> Do you happen to have any OOM report? I am really surprised that udev
> would be an oom victim because that process is really small. Who is
> consuming all the memory then?

It's been a while since I worked on this and unfortunatelly I don't have
a log. From what I remember, the kernel itself was consuming all memory
so *all* processes were victims.

>
> Have you measured how much memory do we need to allocate to add one
> memblock?

No, it's actually a good idea if we decide to do some sort of pre-allocation.

Just did a quick (and probably dirty) test, increasing guest memory from
4G to 8G (32 x 128mb blocks) require 68Mb of memory, so it's roughly 2Mb
per block. It's really easy to trigger OOM for small guests.

>
>> There was a workaround for the issue added to the hyper-v driver
>> doing memory add:
>> 
>> hv_mem_hot_add(...) {
>> ...
>>  add_memory(....);
>>  wait_for_completion_timeout(..., 5*HZ);
>>  ...
>> }
>
> I can still see 
> 		/*
> 		 * Wait for the memory block to be onlined when memory onlining
> 		 * is done outside of kernel (memhp_auto_online). Since the hot
> 		 * add has succeeded, it is ok to proceed even if the pages in
> 		 * the hot added region have not been "onlined" within the
> 		 * allowed time.
> 		 */
> 		if (dm_device.ha_waiting)
> 			wait_for_completion_timeout(&dm_device.ol_waitevent,
> 						    5*HZ);
>

See 

 dm_device.ha_waiting = !memhp_auto_online;

30 lines above. The workaround is still there for udev case and it is
still equaly bad.

>> the completion was done by observing for the MEM_ONLINE event. This, of
>> course, was slowing things down significantly and waiting for a
>> userspace action in kernel is not a nice thing to have (not speaking
>> about all other memory adding methods which had the same issue). Just
>> removing this wait was leading us to the same OOM as the hypervisor was
>> adding more and more memory and eventually even add_memory() was
>> failing, udev and other processes were killed,...
>
> Yes, I agree that waiting on a user action from the kernel is very far
> from ideal.
>
>> With the feature in place we have new memory available right after we do
>> add_memory(), everything is serialized.
>
> What prevented you from onlining the memory explicitly from
> hv_mem_hot_add path? Why do you need a user visible policy for that at
> all? You could also add a parameter to add_memory that would do the same
> thing. Or am I missing something?

We have different mechanisms for adding memory, I'm aware of at least 3:
ACPI, Xen, Hyper-V. The issue I'm addressing is general enough, I'm
pretty sure I can reproduce the issue on Xen, for example - just boot a
small guest and try adding tons of memory. Why should we have different
defaults for different technologies? 

And, BTW, the link to the previous discussion:
https://groups.google.com/forum/#!msg/linux.kernel/AxvyuQjr4GY/TLC-K0sL_NEJ

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-24 14:10                       ` Vitaly Kuznetsov
@ 2017-02-24 14:41                         ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-24 14:41 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

On Fri 24-02-17 15:10:29, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Thu 23-02-17 19:14:27, Vitaly Kuznetsov wrote:
[...]
> >> Virtual guests under stress were getting into OOM easily and the OOM
> >> killer was even killing the udev process trying to online the
> >> memory.
> >
> > Do you happen to have any OOM report? I am really surprised that udev
> > would be an oom victim because that process is really small. Who is
> > consuming all the memory then?
> 
> It's been a while since I worked on this and unfortunatelly I don't have
> a log. From what I remember, the kernel itself was consuming all memory
> so *all* processes were victims.

This suggests that something is terminally broken!
 
> > Have you measured how much memory do we need to allocate to add one
> > memblock?
> 
> No, it's actually a good idea if we decide to do some sort of pre-allocation.
> 
> Just did a quick (and probably dirty) test, increasing guest memory from
> 4G to 8G (32 x 128mb blocks) require 68Mb of memory, so it's roughly 2Mb
> per block. It's really easy to trigger OOM for small guests.

So we need ~1.5% of the added memory. That doesn't sound like something
to trigger OOM killer too easily. Assuming that increase is not way too
large. Going from 256M (your earlier example) to 8G looks will eat half
the memory which is still quite far away from the OOM. I would call such
an increase a bad memory balancing, though, to be honest. A more
reasonable memory balancing would go and double the available memory
IMHO. Anway, I still think that hotplug is a terrible way to do memory
ballooning.

[...]
> >> the completion was done by observing for the MEM_ONLINE event. This, of
> >> course, was slowing things down significantly and waiting for a
> >> userspace action in kernel is not a nice thing to have (not speaking
> >> about all other memory adding methods which had the same issue). Just
> >> removing this wait was leading us to the same OOM as the hypervisor was
> >> adding more and more memory and eventually even add_memory() was
> >> failing, udev and other processes were killed,...
> >
> > Yes, I agree that waiting on a user action from the kernel is very far
> > from ideal.
> >
> >> With the feature in place we have new memory available right after we do
> >> add_memory(), everything is serialized.
> >
> > What prevented you from onlining the memory explicitly from
> > hv_mem_hot_add path? Why do you need a user visible policy for that at
> > all? You could also add a parameter to add_memory that would do the same
> > thing. Or am I missing something?
> 
> We have different mechanisms for adding memory, I'm aware of at least 3:
> ACPI, Xen, Hyper-V. The issue I'm addressing is general enough, I'm
> pretty sure I can reproduce the issue on Xen, for example - just boot a
> small guest and try adding tons of memory. Why should we have different
> defaults for different technologies? 

Just make them all online the memory explicitly. I really do not see why
this should be decided by poor user. Put it differently, when should I
disable auto online when using hyperV or other of the mentioned
technologies? CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE should simply die and
I would even be for killing the whole memhp_auto_online thing along the
way. This simply doesn't make any sense to me.
 
> And, BTW, the link to the previous discussion:
> https://groups.google.com/forum/#!msg/linux.kernel/AxvyuQjr4GY/TLC-K0sL_NEJ

I remember this discussion and objected to the approach back then as
well.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-24 14:41                         ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-24 14:41 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

On Fri 24-02-17 15:10:29, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Thu 23-02-17 19:14:27, Vitaly Kuznetsov wrote:
[...]
> >> Virtual guests under stress were getting into OOM easily and the OOM
> >> killer was even killing the udev process trying to online the
> >> memory.
> >
> > Do you happen to have any OOM report? I am really surprised that udev
> > would be an oom victim because that process is really small. Who is
> > consuming all the memory then?
> 
> It's been a while since I worked on this and unfortunatelly I don't have
> a log. From what I remember, the kernel itself was consuming all memory
> so *all* processes were victims.

This suggests that something is terminally broken!
 
> > Have you measured how much memory do we need to allocate to add one
> > memblock?
> 
> No, it's actually a good idea if we decide to do some sort of pre-allocation.
> 
> Just did a quick (and probably dirty) test, increasing guest memory from
> 4G to 8G (32 x 128mb blocks) require 68Mb of memory, so it's roughly 2Mb
> per block. It's really easy to trigger OOM for small guests.

So we need ~1.5% of the added memory. That doesn't sound like something
to trigger OOM killer too easily. Assuming that increase is not way too
large. Going from 256M (your earlier example) to 8G looks will eat half
the memory which is still quite far away from the OOM. I would call such
an increase a bad memory balancing, though, to be honest. A more
reasonable memory balancing would go and double the available memory
IMHO. Anway, I still think that hotplug is a terrible way to do memory
ballooning.

[...]
> >> the completion was done by observing for the MEM_ONLINE event. This, of
> >> course, was slowing things down significantly and waiting for a
> >> userspace action in kernel is not a nice thing to have (not speaking
> >> about all other memory adding methods which had the same issue). Just
> >> removing this wait was leading us to the same OOM as the hypervisor was
> >> adding more and more memory and eventually even add_memory() was
> >> failing, udev and other processes were killed,...
> >
> > Yes, I agree that waiting on a user action from the kernel is very far
> > from ideal.
> >
> >> With the feature in place we have new memory available right after we do
> >> add_memory(), everything is serialized.
> >
> > What prevented you from onlining the memory explicitly from
> > hv_mem_hot_add path? Why do you need a user visible policy for that at
> > all? You could also add a parameter to add_memory that would do the same
> > thing. Or am I missing something?
> 
> We have different mechanisms for adding memory, I'm aware of at least 3:
> ACPI, Xen, Hyper-V. The issue I'm addressing is general enough, I'm
> pretty sure I can reproduce the issue on Xen, for example - just boot a
> small guest and try adding tons of memory. Why should we have different
> defaults for different technologies? 

Just make them all online the memory explicitly. I really do not see why
this should be decided by poor user. Put it differently, when should I
disable auto online when using hyperV or other of the mentioned
technologies? CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE should simply die and
I would even be for killing the whole memhp_auto_online thing along the
way. This simply doesn't make any sense to me.
 
> And, BTW, the link to the previous discussion:
> https://groups.google.com/forum/#!msg/linux.kernel/AxvyuQjr4GY/TLC-K0sL_NEJ

I remember this discussion and objected to the approach back then as
well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-24 14:41                         ` Michal Hocko
@ 2017-02-24 15:05                           ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-24 15:05 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

Michal Hocko <mhocko@kernel.org> writes:

> On Fri 24-02-17 15:10:29, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Thu 23-02-17 19:14:27, Vitaly Kuznetsov wrote:
> [...]
>> >> Virtual guests under stress were getting into OOM easily and the OOM
>> >> killer was even killing the udev process trying to online the
>> >> memory.
>> >
>> > Do you happen to have any OOM report? I am really surprised that udev
>> > would be an oom victim because that process is really small. Who is
>> > consuming all the memory then?
>> 
>> It's been a while since I worked on this and unfortunatelly I don't have
>> a log. From what I remember, the kernel itself was consuming all memory
>> so *all* processes were victims.
>
> This suggests that something is terminally broken!
>
>> > Have you measured how much memory do we need to allocate to add one
>> > memblock?
>> 
>> No, it's actually a good idea if we decide to do some sort of pre-allocation.
>> 
>> Just did a quick (and probably dirty) test, increasing guest memory from
>> 4G to 8G (32 x 128mb blocks) require 68Mb of memory, so it's roughly 2Mb
>> per block. It's really easy to trigger OOM for small guests.
>
> So we need ~1.5% of the added memory. That doesn't sound like something
> to trigger OOM killer too easily. Assuming that increase is not way too
> large. Going from 256M (your earlier example) to 8G looks will eat half
> the memory which is still quite far away from the OOM.

And if the kernel itself takes 128Mb of ram (which is not something
extraordinary with many CPUs) we have zero left. Go to something bigger
than 8G and you die.

> I would call such
> an increase a bad memory balancing, though, to be honest. A more
> reasonable memory balancing would go and double the available memory
> IMHO. Anway, I still think that hotplug is a terrible way to do memory
> ballooning.

That's what we have in *all* modern hypervisors. And I don't see why
it's bad.

>
> [...]
>> >> the completion was done by observing for the MEM_ONLINE event. This, of
>> >> course, was slowing things down significantly and waiting for a
>> >> userspace action in kernel is not a nice thing to have (not speaking
>> >> about all other memory adding methods which had the same issue). Just
>> >> removing this wait was leading us to the same OOM as the hypervisor was
>> >> adding more and more memory and eventually even add_memory() was
>> >> failing, udev and other processes were killed,...
>> >
>> > Yes, I agree that waiting on a user action from the kernel is very far
>> > from ideal.
>> >
>> >> With the feature in place we have new memory available right after we do
>> >> add_memory(), everything is serialized.
>> >
>> > What prevented you from onlining the memory explicitly from
>> > hv_mem_hot_add path? Why do you need a user visible policy for that at
>> > all? You could also add a parameter to add_memory that would do the same
>> > thing. Or am I missing something?
>> 
>> We have different mechanisms for adding memory, I'm aware of at least 3:
>> ACPI, Xen, Hyper-V. The issue I'm addressing is general enough, I'm
>> pretty sure I can reproduce the issue on Xen, for example - just boot a
>> small guest and try adding tons of memory. Why should we have different
>> defaults for different technologies? 
>
> Just make them all online the memory explicitly. I really do not see why
> this should be decided by poor user. Put it differently, when should I
> disable auto online when using hyperV or other of the mentioned
> technologies? CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE should simply die and
> I would even be for killing the whole memhp_auto_online thing along the
> way. This simply doesn't make any sense to me.

ACPI, for example, is shared between KVM/Qemu, Vmware and real
hardware. I can understand why bare metall guys might not want to have
auto-online by default (though, major linux distros ship the stupid
'offline' -> 'online' udev rule and nobody complains) -- they're doing
some physical action - going to a server room, openning the box,
plugging in memory, going back to their place but with VMs it's not like
that. What's gonna be the default for ACPI then?

I don't understand why CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE is
disturbing and why do we need to take this choice away from distros. I
don't understand what we're gaining by replacing it with
per-memory-add-technology defaults.

>> And, BTW, the link to the previous discussion:
>> https://groups.google.com/forum/#!msg/linux.kernel/AxvyuQjr4GY/TLC-K0sL_NEJ
>
> I remember this discussion and objected to the approach back then as
> well.

-- 
  Vitaly

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-24 15:05                           ` Vitaly Kuznetsov
  0 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-24 15:05 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

Michal Hocko <mhocko@kernel.org> writes:

> On Fri 24-02-17 15:10:29, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Thu 23-02-17 19:14:27, Vitaly Kuznetsov wrote:
> [...]
>> >> Virtual guests under stress were getting into OOM easily and the OOM
>> >> killer was even killing the udev process trying to online the
>> >> memory.
>> >
>> > Do you happen to have any OOM report? I am really surprised that udev
>> > would be an oom victim because that process is really small. Who is
>> > consuming all the memory then?
>> 
>> It's been a while since I worked on this and unfortunatelly I don't have
>> a log. From what I remember, the kernel itself was consuming all memory
>> so *all* processes were victims.
>
> This suggests that something is terminally broken!
>
>> > Have you measured how much memory do we need to allocate to add one
>> > memblock?
>> 
>> No, it's actually a good idea if we decide to do some sort of pre-allocation.
>> 
>> Just did a quick (and probably dirty) test, increasing guest memory from
>> 4G to 8G (32 x 128mb blocks) require 68Mb of memory, so it's roughly 2Mb
>> per block. It's really easy to trigger OOM for small guests.
>
> So we need ~1.5% of the added memory. That doesn't sound like something
> to trigger OOM killer too easily. Assuming that increase is not way too
> large. Going from 256M (your earlier example) to 8G looks will eat half
> the memory which is still quite far away from the OOM.

And if the kernel itself takes 128Mb of ram (which is not something
extraordinary with many CPUs) we have zero left. Go to something bigger
than 8G and you die.

> I would call such
> an increase a bad memory balancing, though, to be honest. A more
> reasonable memory balancing would go and double the available memory
> IMHO. Anway, I still think that hotplug is a terrible way to do memory
> ballooning.

That's what we have in *all* modern hypervisors. And I don't see why
it's bad.

>
> [...]
>> >> the completion was done by observing for the MEM_ONLINE event. This, of
>> >> course, was slowing things down significantly and waiting for a
>> >> userspace action in kernel is not a nice thing to have (not speaking
>> >> about all other memory adding methods which had the same issue). Just
>> >> removing this wait was leading us to the same OOM as the hypervisor was
>> >> adding more and more memory and eventually even add_memory() was
>> >> failing, udev and other processes were killed,...
>> >
>> > Yes, I agree that waiting on a user action from the kernel is very far
>> > from ideal.
>> >
>> >> With the feature in place we have new memory available right after we do
>> >> add_memory(), everything is serialized.
>> >
>> > What prevented you from onlining the memory explicitly from
>> > hv_mem_hot_add path? Why do you need a user visible policy for that at
>> > all? You could also add a parameter to add_memory that would do the same
>> > thing. Or am I missing something?
>> 
>> We have different mechanisms for adding memory, I'm aware of at least 3:
>> ACPI, Xen, Hyper-V. The issue I'm addressing is general enough, I'm
>> pretty sure I can reproduce the issue on Xen, for example - just boot a
>> small guest and try adding tons of memory. Why should we have different
>> defaults for different technologies? 
>
> Just make them all online the memory explicitly. I really do not see why
> this should be decided by poor user. Put it differently, when should I
> disable auto online when using hyperV or other of the mentioned
> technologies? CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE should simply die and
> I would even be for killing the whole memhp_auto_online thing along the
> way. This simply doesn't make any sense to me.

ACPI, for example, is shared between KVM/Qemu, Vmware and real
hardware. I can understand why bare metall guys might not want to have
auto-online by default (though, major linux distros ship the stupid
'offline' -> 'online' udev rule and nobody complains) -- they're doing
some physical action - going to a server room, openning the box,
plugging in memory, going back to their place but with VMs it's not like
that. What's gonna be the default for ACPI then?

I don't understand why CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE is
disturbing and why do we need to take this choice away from distros. I
don't understand what we're gaining by replacing it with
per-memory-add-technology defaults.

>> And, BTW, the link to the previous discussion:
>> https://groups.google.com/forum/#!msg/linux.kernel/AxvyuQjr4GY/TLC-K0sL_NEJ
>
> I remember this discussion and objected to the approach back then as
> well.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-24 15:05                           ` Vitaly Kuznetsov
@ 2017-02-24 15:32                             ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-24 15:32 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

On Fri 24-02-17 16:05:18, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Fri 24-02-17 15:10:29, Vitaly Kuznetsov wrote:
[...]
> >> Just did a quick (and probably dirty) test, increasing guest memory from
> >> 4G to 8G (32 x 128mb blocks) require 68Mb of memory, so it's roughly 2Mb
> >> per block. It's really easy to trigger OOM for small guests.
> >
> > So we need ~1.5% of the added memory. That doesn't sound like something
> > to trigger OOM killer too easily. Assuming that increase is not way too
> > large. Going from 256M (your earlier example) to 8G looks will eat half
> > the memory which is still quite far away from the OOM.
> 
> And if the kernel itself takes 128Mb of ram (which is not something
> extraordinary with many CPUs) we have zero left. Go to something bigger
> than 8G and you die.

Again, if you have 128M and jump to 8G then your memory balancing is
most probably broken.
 
> > I would call such
> > an increase a bad memory balancing, though, to be honest. A more
> > reasonable memory balancing would go and double the available memory
> > IMHO. Anway, I still think that hotplug is a terrible way to do memory
> > ballooning.
> 
> That's what we have in *all* modern hypervisors. And I don't see why
> it's bad.

Go and re-read the original thread. Dave has given many good arguments.

[...]
> > Just make them all online the memory explicitly. I really do not see why
> > this should be decided by poor user. Put it differently, when should I
> > disable auto online when using hyperV or other of the mentioned
> > technologies? CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE should simply die and
> > I would even be for killing the whole memhp_auto_online thing along the
> > way. This simply doesn't make any sense to me.
> 
> ACPI, for example, is shared between KVM/Qemu, Vmware and real
> hardware. I can understand why bare metall guys might not want to have
> auto-online by default (though, major linux distros ship the stupid
> 'offline' -> 'online' udev rule and nobody complains) -- they're doing
> some physical action - going to a server room, openning the box,
> plugging in memory, going back to their place but with VMs it's not like
> that. What's gonna be the default for ACPI then?
> 
> I don't understand why CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE is

Because this is something a user has to think about and doesn't have a
reasonable way to decide. Our config space is also way too large!

> disturbing and why do we need to take this choice away from distros. I
> don't understand what we're gaining by replacing it with
> per-memory-add-technology defaults.

Because those technologies know that they want to have the memory online
as soon as possible. Jeez, just look at the hv code. It waits for the
userspace to online memory before going further. Why would it ever want
to have the tunable in "offline" state? This just doesn't make any
sense. Look at how things get simplified if we get rid of this clutter
---
 drivers/acpi/acpi_memhotplug.c |  2 +-
 drivers/base/memory.c          | 33 +--------------------------------
 drivers/hv/hv_balloon.c        | 26 +-------------------------
 drivers/s390/char/sclp_cmd.c   |  2 +-
 drivers/xen/balloon.c          |  2 +-
 include/linux/memory_hotplug.h |  4 +---
 mm/Kconfig                     | 16 ----------------
 mm/memory_hotplug.c            | 22 ++--------------------

 8 files changed, 8 insertions(+), 99 deletions(-)
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 6b0d3ef7309c..2b1c35fb36d1 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -228,7 +228,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		if (node < 0)
 			node = memory_add_physaddr_to_nid(info->start_addr);
 
-		result = add_memory(node, info->start_addr, info->length);
+		result = add_memory(node, info->start_addr, info->length, false);
 
 		/*
 		 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index fa26ffd25fa6..476c2c02f938 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -446,37 +446,6 @@ print_block_size(struct device *dev, struct device_attribute *attr,
 static DEVICE_ATTR(block_size_bytes, 0444, print_block_size, NULL);
 
 /*
- * Memory auto online policy.
- */
-
-static ssize_t
-show_auto_online_blocks(struct device *dev, struct device_attribute *attr,
-			char *buf)
-{
-	if (memhp_auto_online)
-		return sprintf(buf, "online\n");
-	else
-		return sprintf(buf, "offline\n");
-}
-
-static ssize_t
-store_auto_online_blocks(struct device *dev, struct device_attribute *attr,
-			 const char *buf, size_t count)
-{
-	if (sysfs_streq(buf, "online"))
-		memhp_auto_online = true;
-	else if (sysfs_streq(buf, "offline"))
-		memhp_auto_online = false;
-	else
-		return -EINVAL;
-
-	return count;
-}
-
-static DEVICE_ATTR(auto_online_blocks, 0644, show_auto_online_blocks,
-		   store_auto_online_blocks);
-
-/*
  * Some architectures will have custom drivers to do this, and
  * will not need to do it from userspace.  The fake hot-add code
  * as well as ppc64 will do all of their discovery in userspace
@@ -500,7 +469,7 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
 
 	nid = memory_add_physaddr_to_nid(phys_addr);
 	ret = add_memory(nid, phys_addr,
-			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+			 MIN_MEMORY_BLOCK_SIZE * sections_per_block, false);
 
 	if (ret)
 		goto out;
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 14c3dc4bd23c..3e052bedade5 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -535,11 +535,6 @@ struct hv_dynmem_device {
 	bool host_specified_ha_region;
 
 	/*
-	 * State to synchronize hot-add.
-	 */
-	struct completion  ol_waitevent;
-	bool ha_waiting;
-	/*
 	 * This thread handles hot-add
 	 * requests from the host as well as notifying
 	 * the host with regards to memory pressure in
@@ -587,11 +582,6 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val,
 		spin_lock_irqsave(&dm_device.ha_lock, flags);
 		dm_device.num_pages_onlined += mem->nr_pages;
 		spin_unlock_irqrestore(&dm_device.ha_lock, flags);
-	case MEM_CANCEL_ONLINE:
-		if (dm_device.ha_waiting) {
-			dm_device.ha_waiting = false;
-			complete(&dm_device.ol_waitevent);
-		}
 		break;
 
 	case MEM_OFFLINE:
@@ -683,12 +673,9 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
 		has->covered_end_pfn +=  processed_pfn;
 		spin_unlock_irqrestore(&dm_device.ha_lock, flags);
 
-		init_completion(&dm_device.ol_waitevent);
-		dm_device.ha_waiting = !memhp_auto_online;
-
 		nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
 		ret = add_memory(nid, PFN_PHYS((start_pfn)),
-				(HA_CHUNK << PAGE_SHIFT));
+				(HA_CHUNK << PAGE_SHIFT), true);
 
 		if (ret) {
 			pr_warn("hot_add memory failed error is %d\n", ret);
@@ -708,17 +695,6 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
 			spin_unlock_irqrestore(&dm_device.ha_lock, flags);
 			break;
 		}
-
-		/*
-		 * Wait for the memory block to be onlined when memory onlining
-		 * is done outside of kernel (memhp_auto_online). Since the hot
-		 * add has succeeded, it is ok to proceed even if the pages in
-		 * the hot added region have not been "onlined" within the
-		 * allowed time.
-		 */
-		if (dm_device.ha_waiting)
-			wait_for_completion_timeout(&dm_device.ol_waitevent,
-						    5*HZ);
 		post_status(&dm_device);
 	}
 
diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index b9c5522b8a68..f54c621195b6 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -404,7 +404,7 @@ static void __init add_memory_merged(u16 rn)
 	if (!size)
 		goto skip_add;
 	for (addr = start; addr < start + size; addr += block_size)
-		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size);
+		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size, false);
 skip_add:
 	first_rn = rn;
 	num = 1;
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index db107fa50ca1..fce961de8771 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -355,7 +355,7 @@ static enum bp_state reserve_additional_memory(void)
 	 * callers drop the mutex before trying again.
 	 */
 	mutex_unlock(&balloon_mutex);
-	rc = add_memory_resource(nid, resource, memhp_auto_online);
+	rc = add_memory_resource(nid, resource, true);
 	mutex_lock(&balloon_mutex);
 
 	if (rc) {
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 134a2f69c21a..a72f7f64ee26 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -100,8 +100,6 @@ extern void __online_page_free(struct page *page);
 
 extern int try_online_node(int nid);
 
-extern bool memhp_auto_online;
-
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern bool is_pageblock_removable_nolock(struct page *page);
 extern int arch_remove_memory(u64 start, u64 size);
@@ -272,7 +270,7 @@ static inline void remove_memory(int nid, u64 start, u64 size) {}
 
 extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
 		void *arg, int (*func)(struct memory_block *, void *));
-extern int add_memory(int nid, u64 start, u64 size);
+extern int add_memory(int nid, u64 start, u64 size, bool online);
 extern int add_memory_resource(int nid, struct resource *resource, bool online);
 extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
 		bool for_device);
diff --git a/mm/Kconfig b/mm/Kconfig
index 9b8fccb969dc..a64a3bca43d5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,22 +193,6 @@ config MEMORY_HOTPLUG_SPARSE
 	def_bool y
 	depends on SPARSEMEM && MEMORY_HOTPLUG
 
-config MEMORY_HOTPLUG_DEFAULT_ONLINE
-        bool "Online the newly added memory blocks by default"
-        default n
-        depends on MEMORY_HOTPLUG
-        help
-	  This option sets the default policy setting for memory hotplug
-	  onlining policy (/sys/devices/system/memory/auto_online_blocks) which
-	  determines what happens to newly added memory regions. Policy setting
-	  can always be changed at runtime.
-	  See Documentation/memory-hotplug.txt for more information.
-
-	  Say Y here if you want all hot-plugged memory blocks to appear in
-	  'online' state by default.
-	  Say N here if you want the default policy to keep all hot-plugged
-	  memory blocks in 'offline' state.
-
 config MEMORY_HOTREMOVE
 	bool "Allow for memory hot remove"
 	select MEMORY_ISOLATION
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c35dd1976574..8520c9166f47 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -78,24 +78,6 @@ static struct {
 #define memhp_lock_acquire()      lock_map_acquire(&mem_hotplug.dep_map)
 #define memhp_lock_release()      lock_map_release(&mem_hotplug.dep_map)
 
-#ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
-bool memhp_auto_online;
-#else
-bool memhp_auto_online = true;
-#endif
-EXPORT_SYMBOL_GPL(memhp_auto_online);
-
-static int __init setup_memhp_default_state(char *str)
-{
-	if (!strcmp(str, "online"))
-		memhp_auto_online = true;
-	else if (!strcmp(str, "offline"))
-		memhp_auto_online = false;
-
-	return 1;
-}
-__setup("memhp_default_state=", setup_memhp_default_state);
-
 void get_online_mems(void)
 {
 	might_sleep();
@@ -1420,7 +1402,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 }
 EXPORT_SYMBOL_GPL(add_memory_resource);
 
-int __ref add_memory(int nid, u64 start, u64 size)
+int __ref add_memory(int nid, u64 start, u64 size, bool online)
 {
 	struct resource *res;
 	int ret;
@@ -1429,7 +1411,7 @@ int __ref add_memory(int nid, u64 start, u64 size)
 	if (IS_ERR(res))
 		return PTR_ERR(res);
 
-	ret = add_memory_resource(nid, res, memhp_auto_online);
+	ret = add_memory_resource(nid, res, online);
 	if (ret < 0)
 		release_memory_resource(res);
 	return ret;
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-24 15:32                             ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-24 15:32 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

On Fri 24-02-17 16:05:18, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Fri 24-02-17 15:10:29, Vitaly Kuznetsov wrote:
[...]
> >> Just did a quick (and probably dirty) test, increasing guest memory from
> >> 4G to 8G (32 x 128mb blocks) require 68Mb of memory, so it's roughly 2Mb
> >> per block. It's really easy to trigger OOM for small guests.
> >
> > So we need ~1.5% of the added memory. That doesn't sound like something
> > to trigger OOM killer too easily. Assuming that increase is not way too
> > large. Going from 256M (your earlier example) to 8G looks will eat half
> > the memory which is still quite far away from the OOM.
> 
> And if the kernel itself takes 128Mb of ram (which is not something
> extraordinary with many CPUs) we have zero left. Go to something bigger
> than 8G and you die.

Again, if you have 128M and jump to 8G then your memory balancing is
most probably broken.
 
> > I would call such
> > an increase a bad memory balancing, though, to be honest. A more
> > reasonable memory balancing would go and double the available memory
> > IMHO. Anway, I still think that hotplug is a terrible way to do memory
> > ballooning.
> 
> That's what we have in *all* modern hypervisors. And I don't see why
> it's bad.

Go and re-read the original thread. Dave has given many good arguments.

[...]
> > Just make them all online the memory explicitly. I really do not see why
> > this should be decided by poor user. Put it differently, when should I
> > disable auto online when using hyperV or other of the mentioned
> > technologies? CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE should simply die and
> > I would even be for killing the whole memhp_auto_online thing along the
> > way. This simply doesn't make any sense to me.
> 
> ACPI, for example, is shared between KVM/Qemu, Vmware and real
> hardware. I can understand why bare metall guys might not want to have
> auto-online by default (though, major linux distros ship the stupid
> 'offline' -> 'online' udev rule and nobody complains) -- they're doing
> some physical action - going to a server room, openning the box,
> plugging in memory, going back to their place but with VMs it's not like
> that. What's gonna be the default for ACPI then?
> 
> I don't understand why CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE is

Because this is something a user has to think about and doesn't have a
reasonable way to decide. Our config space is also way too large!

> disturbing and why do we need to take this choice away from distros. I
> don't understand what we're gaining by replacing it with
> per-memory-add-technology defaults.

Because those technologies know that they want to have the memory online
as soon as possible. Jeez, just look at the hv code. It waits for the
userspace to online memory before going further. Why would it ever want
to have the tunable in "offline" state? This just doesn't make any
sense. Look at how things get simplified if we get rid of this clutter
---
 drivers/acpi/acpi_memhotplug.c |  2 +-
 drivers/base/memory.c          | 33 +--------------------------------
 drivers/hv/hv_balloon.c        | 26 +-------------------------
 drivers/s390/char/sclp_cmd.c   |  2 +-
 drivers/xen/balloon.c          |  2 +-
 include/linux/memory_hotplug.h |  4 +---
 mm/Kconfig                     | 16 ----------------
 mm/memory_hotplug.c            | 22 ++--------------------

 8 files changed, 8 insertions(+), 99 deletions(-)
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 6b0d3ef7309c..2b1c35fb36d1 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -228,7 +228,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		if (node < 0)
 			node = memory_add_physaddr_to_nid(info->start_addr);
 
-		result = add_memory(node, info->start_addr, info->length);
+		result = add_memory(node, info->start_addr, info->length, false);
 
 		/*
 		 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index fa26ffd25fa6..476c2c02f938 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -446,37 +446,6 @@ print_block_size(struct device *dev, struct device_attribute *attr,
 static DEVICE_ATTR(block_size_bytes, 0444, print_block_size, NULL);
 
 /*
- * Memory auto online policy.
- */
-
-static ssize_t
-show_auto_online_blocks(struct device *dev, struct device_attribute *attr,
-			char *buf)
-{
-	if (memhp_auto_online)
-		return sprintf(buf, "online\n");
-	else
-		return sprintf(buf, "offline\n");
-}
-
-static ssize_t
-store_auto_online_blocks(struct device *dev, struct device_attribute *attr,
-			 const char *buf, size_t count)
-{
-	if (sysfs_streq(buf, "online"))
-		memhp_auto_online = true;
-	else if (sysfs_streq(buf, "offline"))
-		memhp_auto_online = false;
-	else
-		return -EINVAL;
-
-	return count;
-}
-
-static DEVICE_ATTR(auto_online_blocks, 0644, show_auto_online_blocks,
-		   store_auto_online_blocks);
-
-/*
  * Some architectures will have custom drivers to do this, and
  * will not need to do it from userspace.  The fake hot-add code
  * as well as ppc64 will do all of their discovery in userspace
@@ -500,7 +469,7 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
 
 	nid = memory_add_physaddr_to_nid(phys_addr);
 	ret = add_memory(nid, phys_addr,
-			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+			 MIN_MEMORY_BLOCK_SIZE * sections_per_block, false);
 
 	if (ret)
 		goto out;
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 14c3dc4bd23c..3e052bedade5 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -535,11 +535,6 @@ struct hv_dynmem_device {
 	bool host_specified_ha_region;
 
 	/*
-	 * State to synchronize hot-add.
-	 */
-	struct completion  ol_waitevent;
-	bool ha_waiting;
-	/*
 	 * This thread handles hot-add
 	 * requests from the host as well as notifying
 	 * the host with regards to memory pressure in
@@ -587,11 +582,6 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val,
 		spin_lock_irqsave(&dm_device.ha_lock, flags);
 		dm_device.num_pages_onlined += mem->nr_pages;
 		spin_unlock_irqrestore(&dm_device.ha_lock, flags);
-	case MEM_CANCEL_ONLINE:
-		if (dm_device.ha_waiting) {
-			dm_device.ha_waiting = false;
-			complete(&dm_device.ol_waitevent);
-		}
 		break;
 
 	case MEM_OFFLINE:
@@ -683,12 +673,9 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
 		has->covered_end_pfn +=  processed_pfn;
 		spin_unlock_irqrestore(&dm_device.ha_lock, flags);
 
-		init_completion(&dm_device.ol_waitevent);
-		dm_device.ha_waiting = !memhp_auto_online;
-
 		nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
 		ret = add_memory(nid, PFN_PHYS((start_pfn)),
-				(HA_CHUNK << PAGE_SHIFT));
+				(HA_CHUNK << PAGE_SHIFT), true);
 
 		if (ret) {
 			pr_warn("hot_add memory failed error is %d\n", ret);
@@ -708,17 +695,6 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
 			spin_unlock_irqrestore(&dm_device.ha_lock, flags);
 			break;
 		}
-
-		/*
-		 * Wait for the memory block to be onlined when memory onlining
-		 * is done outside of kernel (memhp_auto_online). Since the hot
-		 * add has succeeded, it is ok to proceed even if the pages in
-		 * the hot added region have not been "onlined" within the
-		 * allowed time.
-		 */
-		if (dm_device.ha_waiting)
-			wait_for_completion_timeout(&dm_device.ol_waitevent,
-						    5*HZ);
 		post_status(&dm_device);
 	}
 
diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index b9c5522b8a68..f54c621195b6 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -404,7 +404,7 @@ static void __init add_memory_merged(u16 rn)
 	if (!size)
 		goto skip_add;
 	for (addr = start; addr < start + size; addr += block_size)
-		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size);
+		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size, false);
 skip_add:
 	first_rn = rn;
 	num = 1;
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index db107fa50ca1..fce961de8771 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -355,7 +355,7 @@ static enum bp_state reserve_additional_memory(void)
 	 * callers drop the mutex before trying again.
 	 */
 	mutex_unlock(&balloon_mutex);
-	rc = add_memory_resource(nid, resource, memhp_auto_online);
+	rc = add_memory_resource(nid, resource, true);
 	mutex_lock(&balloon_mutex);
 
 	if (rc) {
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 134a2f69c21a..a72f7f64ee26 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -100,8 +100,6 @@ extern void __online_page_free(struct page *page);
 
 extern int try_online_node(int nid);
 
-extern bool memhp_auto_online;
-
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern bool is_pageblock_removable_nolock(struct page *page);
 extern int arch_remove_memory(u64 start, u64 size);
@@ -272,7 +270,7 @@ static inline void remove_memory(int nid, u64 start, u64 size) {}
 
 extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
 		void *arg, int (*func)(struct memory_block *, void *));
-extern int add_memory(int nid, u64 start, u64 size);
+extern int add_memory(int nid, u64 start, u64 size, bool online);
 extern int add_memory_resource(int nid, struct resource *resource, bool online);
 extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
 		bool for_device);
diff --git a/mm/Kconfig b/mm/Kconfig
index 9b8fccb969dc..a64a3bca43d5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,22 +193,6 @@ config MEMORY_HOTPLUG_SPARSE
 	def_bool y
 	depends on SPARSEMEM && MEMORY_HOTPLUG
 
-config MEMORY_HOTPLUG_DEFAULT_ONLINE
-        bool "Online the newly added memory blocks by default"
-        default n
-        depends on MEMORY_HOTPLUG
-        help
-	  This option sets the default policy setting for memory hotplug
-	  onlining policy (/sys/devices/system/memory/auto_online_blocks) which
-	  determines what happens to newly added memory regions. Policy setting
-	  can always be changed at runtime.
-	  See Documentation/memory-hotplug.txt for more information.
-
-	  Say Y here if you want all hot-plugged memory blocks to appear in
-	  'online' state by default.
-	  Say N here if you want the default policy to keep all hot-plugged
-	  memory blocks in 'offline' state.
-
 config MEMORY_HOTREMOVE
 	bool "Allow for memory hot remove"
 	select MEMORY_ISOLATION
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c35dd1976574..8520c9166f47 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -78,24 +78,6 @@ static struct {
 #define memhp_lock_acquire()      lock_map_acquire(&mem_hotplug.dep_map)
 #define memhp_lock_release()      lock_map_release(&mem_hotplug.dep_map)
 
-#ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
-bool memhp_auto_online;
-#else
-bool memhp_auto_online = true;
-#endif
-EXPORT_SYMBOL_GPL(memhp_auto_online);
-
-static int __init setup_memhp_default_state(char *str)
-{
-	if (!strcmp(str, "online"))
-		memhp_auto_online = true;
-	else if (!strcmp(str, "offline"))
-		memhp_auto_online = false;
-
-	return 1;
-}
-__setup("memhp_default_state=", setup_memhp_default_state);
-
 void get_online_mems(void)
 {
 	might_sleep();
@@ -1420,7 +1402,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 }
 EXPORT_SYMBOL_GPL(add_memory_resource);
 
-int __ref add_memory(int nid, u64 start, u64 size)
+int __ref add_memory(int nid, u64 start, u64 size, bool online)
 {
 	struct resource *res;
 	int ret;
@@ -1429,7 +1411,7 @@ int __ref add_memory(int nid, u64 start, u64 size)
 	if (IS_ERR(res))
 		return PTR_ERR(res);
 
-	ret = add_memory_resource(nid, res, memhp_auto_online);
+	ret = add_memory_resource(nid, res, online);
 	if (ret < 0)
 		release_memory_resource(res);
 	return ret;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-24 15:32                             ` Michal Hocko
@ 2017-02-24 16:09                               ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-24 16:09 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

Michal Hocko <mhocko@kernel.org> writes:

> On Fri 24-02-17 16:05:18, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Fri 24-02-17 15:10:29, Vitaly Kuznetsov wrote:
> [...]
>> >> Just did a quick (and probably dirty) test, increasing guest memory from
>> >> 4G to 8G (32 x 128mb blocks) require 68Mb of memory, so it's roughly 2Mb
>> >> per block. It's really easy to trigger OOM for small guests.
>> >
>> > So we need ~1.5% of the added memory. That doesn't sound like something
>> > to trigger OOM killer too easily. Assuming that increase is not way too
>> > large. Going from 256M (your earlier example) to 8G looks will eat half
>> > the memory which is still quite far away from the OOM.
>> 
>> And if the kernel itself takes 128Mb of ram (which is not something
>> extraordinary with many CPUs) we have zero left. Go to something bigger
>> than 8G and you die.
>
> Again, if you have 128M and jump to 8G then your memory balancing is
> most probably broken.
>

I don't understand what balancing you're talking about. I have a small
guest and I want to add more memory to it and the result is ... OOM. Not
something I expected.

>> > I would call such
>> > an increase a bad memory balancing, though, to be honest. A more
>> > reasonable memory balancing would go and double the available memory
>> > IMHO. Anway, I still think that hotplug is a terrible way to do memory
>> > ballooning.
>> 
>> That's what we have in *all* modern hypervisors. And I don't see why
>> it's bad.
>
> Go and re-read the original thread. Dave has given many good arguments.
>

Are we discussing taking away the memory hotplug feature from all
hypervisors here?

>> > Just make them all online the memory explicitly. I really do not see why
>> > this should be decided by poor user. Put it differently, when should I
>> > disable auto online when using hyperV or other of the mentioned
>> > technologies? CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE should simply die and
>> > I would even be for killing the whole memhp_auto_online thing along the
>> > way. This simply doesn't make any sense to me.
>> 
>> ACPI, for example, is shared between KVM/Qemu, Vmware and real
>> hardware. I can understand why bare metall guys might not want to have
>> auto-online by default (though, major linux distros ship the stupid
>> 'offline' -> 'online' udev rule and nobody complains) -- they're doing
>> some physical action - going to a server room, openning the box,
>> plugging in memory, going back to their place but with VMs it's not like
>> that. What's gonna be the default for ACPI then?
>> 
>> I don't understand why CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE is
>
> Because this is something a user has to think about and doesn't have a
> reasonable way to decide. Our config space is also way too large!

Config space is for distros, not users.

>
>> disturbing and why do we need to take this choice away from distros. I
>> don't understand what we're gaining by replacing it with
>> per-memory-add-technology defaults.
>
> Because those technologies know that they want to have the memory online
> as soon as possible. Jeez, just look at the hv code. It waits for the
> userspace to online memory before going further. Why would it ever want
> to have the tunable in "offline" state? This just doesn't make any
> sense. Look at how things get simplified if we get rid of this clutter

While this will most probably work for me I still disagree with the
concept of 'one size fits all' here and the default 'false' for ACPI,
we're taking away the feature from KVM/Vmware folks so they'll again
come up with the udev rule which has known issues.

[snip].

-- 
  Vitaly

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-24 16:09                               ` Vitaly Kuznetsov
  0 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-24 16:09 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

Michal Hocko <mhocko@kernel.org> writes:

> On Fri 24-02-17 16:05:18, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Fri 24-02-17 15:10:29, Vitaly Kuznetsov wrote:
> [...]
>> >> Just did a quick (and probably dirty) test, increasing guest memory from
>> >> 4G to 8G (32 x 128mb blocks) require 68Mb of memory, so it's roughly 2Mb
>> >> per block. It's really easy to trigger OOM for small guests.
>> >
>> > So we need ~1.5% of the added memory. That doesn't sound like something
>> > to trigger OOM killer too easily. Assuming that increase is not way too
>> > large. Going from 256M (your earlier example) to 8G looks will eat half
>> > the memory which is still quite far away from the OOM.
>> 
>> And if the kernel itself takes 128Mb of ram (which is not something
>> extraordinary with many CPUs) we have zero left. Go to something bigger
>> than 8G and you die.
>
> Again, if you have 128M and jump to 8G then your memory balancing is
> most probably broken.
>

I don't understand what balancing you're talking about. I have a small
guest and I want to add more memory to it and the result is ... OOM. Not
something I expected.

>> > I would call such
>> > an increase a bad memory balancing, though, to be honest. A more
>> > reasonable memory balancing would go and double the available memory
>> > IMHO. Anway, I still think that hotplug is a terrible way to do memory
>> > ballooning.
>> 
>> That's what we have in *all* modern hypervisors. And I don't see why
>> it's bad.
>
> Go and re-read the original thread. Dave has given many good arguments.
>

Are we discussing taking away the memory hotplug feature from all
hypervisors here?

>> > Just make them all online the memory explicitly. I really do not see why
>> > this should be decided by poor user. Put it differently, when should I
>> > disable auto online when using hyperV or other of the mentioned
>> > technologies? CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE should simply die and
>> > I would even be for killing the whole memhp_auto_online thing along the
>> > way. This simply doesn't make any sense to me.
>> 
>> ACPI, for example, is shared between KVM/Qemu, Vmware and real
>> hardware. I can understand why bare metall guys might not want to have
>> auto-online by default (though, major linux distros ship the stupid
>> 'offline' -> 'online' udev rule and nobody complains) -- they're doing
>> some physical action - going to a server room, openning the box,
>> plugging in memory, going back to their place but with VMs it's not like
>> that. What's gonna be the default for ACPI then?
>> 
>> I don't understand why CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE is
>
> Because this is something a user has to think about and doesn't have a
> reasonable way to decide. Our config space is also way too large!

Config space is for distros, not users.

>
>> disturbing and why do we need to take this choice away from distros. I
>> don't understand what we're gaining by replacing it with
>> per-memory-add-technology defaults.
>
> Because those technologies know that they want to have the memory online
> as soon as possible. Jeez, just look at the hv code. It waits for the
> userspace to online memory before going further. Why would it ever want
> to have the tunable in "offline" state? This just doesn't make any
> sense. Look at how things get simplified if we get rid of this clutter

While this will most probably work for me I still disagree with the
concept of 'one size fits all' here and the default 'false' for ACPI,
we're taking away the feature from KVM/Vmware folks so they'll again
come up with the udev rule which has known issues.

[snip].

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-24 16:09                               ` Vitaly Kuznetsov
@ 2017-02-24 16:23                                 ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-24 16:23 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

On Fri 24-02-17 17:09:13, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Fri 24-02-17 16:05:18, Vitaly Kuznetsov wrote:
> >> Michal Hocko <mhocko@kernel.org> writes:
> >> 
> >> > On Fri 24-02-17 15:10:29, Vitaly Kuznetsov wrote:
> > [...]
> >> >> Just did a quick (and probably dirty) test, increasing guest memory from
> >> >> 4G to 8G (32 x 128mb blocks) require 68Mb of memory, so it's roughly 2Mb
> >> >> per block. It's really easy to trigger OOM for small guests.
> >> >
> >> > So we need ~1.5% of the added memory. That doesn't sound like something
> >> > to trigger OOM killer too easily. Assuming that increase is not way too
> >> > large. Going from 256M (your earlier example) to 8G looks will eat half
> >> > the memory which is still quite far away from the OOM.
> >> 
> >> And if the kernel itself takes 128Mb of ram (which is not something
> >> extraordinary with many CPUs) we have zero left. Go to something bigger
> >> than 8G and you die.
> >
> > Again, if you have 128M and jump to 8G then your memory balancing is
> > most probably broken.
> >
> 
> I don't understand what balancing you're talking about.

balancing = dynamic memory resizing depending on the demand both
internal (inside guest) and outside (on the host to balance memory
between different guests).

> I have a small
> guest and I want to add more memory to it and the result is ... OOM. Not
> something I expected.

Which is not all that unexpected if you use a technology which has to
allocated in order to add more memory.

> >> > I would call such
> >> > an increase a bad memory balancing, though, to be honest. A more
> >> > reasonable memory balancing would go and double the available memory
> >> > IMHO. Anway, I still think that hotplug is a terrible way to do memory
> >> > ballooning.
> >> 
> >> That's what we have in *all* modern hypervisors. And I don't see why
> >> it's bad.
> >
> > Go and re-read the original thread. Dave has given many good arguments.
> >
> 
> Are we discussing taking away the memory hotplug feature from all
> hypervisors here?

No. I just consider it a bad idea because it has many problems and will
never be 100% reliable.

[...]
> >> I don't understand why CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE is
> >
> > Because this is something a user has to think about and doesn't have a
> > reasonable way to decide. Our config space is also way too large!
> 
> Config space is for distros, not users.

Maybe you haven't noticed but there are people compiling their kernels
as well. But even distros are not really in a great position to answer
this question because it depends on the specific usecase.

> >> disturbing and why do we need to take this choice away from distros. I
> >> don't understand what we're gaining by replacing it with
> >> per-memory-add-technology defaults.
> >
> > Because those technologies know that they want to have the memory online
> > as soon as possible. Jeez, just look at the hv code. It waits for the
> > userspace to online memory before going further. Why would it ever want
> > to have the tunable in "offline" state? This just doesn't make any
> > sense. Look at how things get simplified if we get rid of this clutter
> 
> While this will most probably work for me I still disagree with the
> concept of 'one size fits all' here and the default 'false' for ACPI,
> we're taking away the feature from KVM/Vmware folks so they'll again
> come up with the udev rule which has known issues.

Well, AFAIU acpi_memory_device_add is a standard way how to announce
physical memory added to the system. Where does the KVM/VMware depend on
this to do memory ballooning?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-24 16:23                                 ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-24 16:23 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

On Fri 24-02-17 17:09:13, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Fri 24-02-17 16:05:18, Vitaly Kuznetsov wrote:
> >> Michal Hocko <mhocko@kernel.org> writes:
> >> 
> >> > On Fri 24-02-17 15:10:29, Vitaly Kuznetsov wrote:
> > [...]
> >> >> Just did a quick (and probably dirty) test, increasing guest memory from
> >> >> 4G to 8G (32 x 128mb blocks) require 68Mb of memory, so it's roughly 2Mb
> >> >> per block. It's really easy to trigger OOM for small guests.
> >> >
> >> > So we need ~1.5% of the added memory. That doesn't sound like something
> >> > to trigger OOM killer too easily. Assuming that increase is not way too
> >> > large. Going from 256M (your earlier example) to 8G looks will eat half
> >> > the memory which is still quite far away from the OOM.
> >> 
> >> And if the kernel itself takes 128Mb of ram (which is not something
> >> extraordinary with many CPUs) we have zero left. Go to something bigger
> >> than 8G and you die.
> >
> > Again, if you have 128M and jump to 8G then your memory balancing is
> > most probably broken.
> >
> 
> I don't understand what balancing you're talking about.

balancing = dynamic memory resizing depending on the demand both
internal (inside guest) and outside (on the host to balance memory
between different guests).

> I have a small
> guest and I want to add more memory to it and the result is ... OOM. Not
> something I expected.

Which is not all that unexpected if you use a technology which has to
allocated in order to add more memory.

> >> > I would call such
> >> > an increase a bad memory balancing, though, to be honest. A more
> >> > reasonable memory balancing would go and double the available memory
> >> > IMHO. Anway, I still think that hotplug is a terrible way to do memory
> >> > ballooning.
> >> 
> >> That's what we have in *all* modern hypervisors. And I don't see why
> >> it's bad.
> >
> > Go and re-read the original thread. Dave has given many good arguments.
> >
> 
> Are we discussing taking away the memory hotplug feature from all
> hypervisors here?

No. I just consider it a bad idea because it has many problems and will
never be 100% reliable.

[...]
> >> I don't understand why CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE is
> >
> > Because this is something a user has to think about and doesn't have a
> > reasonable way to decide. Our config space is also way too large!
> 
> Config space is for distros, not users.

Maybe you haven't noticed but there are people compiling their kernels
as well. But even distros are not really in a great position to answer
this question because it depends on the specific usecase.

> >> disturbing and why do we need to take this choice away from distros. I
> >> don't understand what we're gaining by replacing it with
> >> per-memory-add-technology defaults.
> >
> > Because those technologies know that they want to have the memory online
> > as soon as possible. Jeez, just look at the hv code. It waits for the
> > userspace to online memory before going further. Why would it ever want
> > to have the tunable in "offline" state? This just doesn't make any
> > sense. Look at how things get simplified if we get rid of this clutter
> 
> While this will most probably work for me I still disagree with the
> concept of 'one size fits all' here and the default 'false' for ACPI,
> we're taking away the feature from KVM/Vmware folks so they'll again
> come up with the udev rule which has known issues.

Well, AFAIU acpi_memory_device_add is a standard way how to announce
physical memory added to the system. Where does the KVM/VMware depend on
this to do memory ballooning?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-24 16:23                                 ` Michal Hocko
@ 2017-02-24 16:40                                   ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-24 16:40 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

Michal Hocko <mhocko@kernel.org> writes:

> On Fri 24-02-17 17:09:13, Vitaly Kuznetsov wrote:

>> I have a smal  guest and I want to add more memory to it and the
>> result is ... OOM. Not something I expected.
>
> Which is not all that unexpected if you use a technology which has to
> allocated in order to add more memory.
>

My point is: why should users care about this? It's our problem that we
can't always hotplug memory. And I think this problem is solvable.

>> 
>> While this will most probably work for me I still disagree with the
>> concept of 'one size fits all' here and the default 'false' for ACPI,
>> we're taking away the feature from KVM/Vmware folks so they'll again
>> come up with the udev rule which has known issues.
>
> Well, AFAIU acpi_memory_device_add is a standard way how to announce
> physical memory added to the system. Where does the KVM/VMware depend on
> this to do memory ballooning?

As far as I understand memory hotplug in KVM/VMware is pure ACPI memory
hotplug, there is no specific code for it.

-- 
  Vitaly

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-24 16:40                                   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-24 16:40 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

Michal Hocko <mhocko@kernel.org> writes:

> On Fri 24-02-17 17:09:13, Vitaly Kuznetsov wrote:

>> I have a smal  guest and I want to add more memory to it and the
>> result is ... OOM. Not something I expected.
>
> Which is not all that unexpected if you use a technology which has to
> allocated in order to add more memory.
>

My point is: why should users care about this? It's our problem that we
can't always hotplug memory. And I think this problem is solvable.

>> 
>> While this will most probably work for me I still disagree with the
>> concept of 'one size fits all' here and the default 'false' for ACPI,
>> we're taking away the feature from KVM/Vmware folks so they'll again
>> come up with the udev rule which has known issues.
>
> Well, AFAIU acpi_memory_device_add is a standard way how to announce
> physical memory added to the system. Where does the KVM/VMware depend on
> this to do memory ballooning?

As far as I understand memory hotplug in KVM/VMware is pure ACPI memory
hotplug, there is no specific code for it.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-24 16:40                                   ` Vitaly Kuznetsov
@ 2017-02-24 16:52                                     ` Michal Hocko
  -1 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-24 16:52 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

On Fri 24-02-17 17:40:25, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Fri 24-02-17 17:09:13, Vitaly Kuznetsov wrote:
[...]
> >> While this will most probably work for me I still disagree with the
> >> concept of 'one size fits all' here and the default 'false' for ACPI,
> >> we're taking away the feature from KVM/Vmware folks so they'll again
> >> come up with the udev rule which has known issues.
> >
> > Well, AFAIU acpi_memory_device_add is a standard way how to announce
> > physical memory added to the system. Where does the KVM/VMware depend on
> > this to do memory ballooning?
> 
> As far as I understand memory hotplug in KVM/VMware is pure ACPI memory
> hotplug, there is no specific code for it.

VMware has its ballooning driver AFAIK and I have no idea what KVM uses.
Anyway, acpi_memory_device_add is no different from doing a physical
memory hotplug IIUC so there shouldn't be any difference to how it is
handled.

I will post the patch as an RFC sometimes next week, let's see what
others think about it.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-24 16:52                                     ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2017-02-24 16:52 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

On Fri 24-02-17 17:40:25, Vitaly Kuznetsov wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Fri 24-02-17 17:09:13, Vitaly Kuznetsov wrote:
[...]
> >> While this will most probably work for me I still disagree with the
> >> concept of 'one size fits all' here and the default 'false' for ACPI,
> >> we're taking away the feature from KVM/Vmware folks so they'll again
> >> come up with the udev rule which has known issues.
> >
> > Well, AFAIU acpi_memory_device_add is a standard way how to announce
> > physical memory added to the system. Where does the KVM/VMware depend on
> > this to do memory ballooning?
> 
> As far as I understand memory hotplug in KVM/VMware is pure ACPI memory
> hotplug, there is no specific code for it.

VMware has its ballooning driver AFAIK and I have no idea what KVM uses.
Anyway, acpi_memory_device_add is no different from doing a physical
memory hotplug IIUC so there shouldn't be any difference to how it is
handled.

I will post the patch as an RFC sometimes next week, let's see what
others think about it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
  2017-02-24 16:52                                     ` Michal Hocko
@ 2017-02-24 17:06                                       ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-24 17:06 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

Michal Hocko <mhocko@kernel.org> writes:

> On Fri 24-02-17 17:40:25, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Fri 24-02-17 17:09:13, Vitaly Kuznetsov wrote:
> [...]
>> >> While this will most probably work for me I still disagree with the
>> >> concept of 'one size fits all' here and the default 'false' for ACPI,
>> >> we're taking away the feature from KVM/Vmware folks so they'll again
>> >> come up with the udev rule which has known issues.
>> >
>> > Well, AFAIU acpi_memory_device_add is a standard way how to announce
>> > physical memory added to the system. Where does the KVM/VMware depend on
>> > this to do memory ballooning?
>> 
>> As far as I understand memory hotplug in KVM/VMware is pure ACPI memory
>> hotplug, there is no specific code for it.
>
> VMware has its ballooning driver AFAIK and I have no idea what KVM
> uses.

They both have ballooning drivers but ballooning is a different
thing. BTW, both Xen and Hyper-V have ballooning too but it is a
different thing, we're not discussing it here.

> Anyway, acpi_memory_device_add is no different from doing a physical
> memory hotplug IIUC so there shouldn't be any difference to how it is
> handled.

With the patch you suggest we'll have different memory hotplug defaults
for different virtualization technologies. If you suggest to have
unconditional default online it should probably be the save for all
hypervisors we support (IMO).

-- 
  Vitaly

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline
@ 2017-02-24 17:06                                       ` Vitaly Kuznetsov
  0 siblings, 0 replies; 40+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-24 17:06 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Nathan Fontenot, linux-mm, mpe, linuxppc-dev, mdroth, kys

Michal Hocko <mhocko@kernel.org> writes:

> On Fri 24-02-17 17:40:25, Vitaly Kuznetsov wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Fri 24-02-17 17:09:13, Vitaly Kuznetsov wrote:
> [...]
>> >> While this will most probably work for me I still disagree with the
>> >> concept of 'one size fits all' here and the default 'false' for ACPI,
>> >> we're taking away the feature from KVM/Vmware folks so they'll again
>> >> come up with the udev rule which has known issues.
>> >
>> > Well, AFAIU acpi_memory_device_add is a standard way how to announce
>> > physical memory added to the system. Where does the KVM/VMware depend on
>> > this to do memory ballooning?
>> 
>> As far as I understand memory hotplug in KVM/VMware is pure ACPI memory
>> hotplug, there is no specific code for it.
>
> VMware has its ballooning driver AFAIK and I have no idea what KVM
> uses.

They both have ballooning drivers but ballooning is a different
thing. BTW, both Xen and Hyper-V have ballooning too but it is a
different thing, we're not discussing it here.

> Anyway, acpi_memory_device_add is no different from doing a physical
> memory hotplug IIUC so there shouldn't be any difference to how it is
> handled.

With the patch you suggest we'll have different memory hotplug defaults
for different virtualization technologies. If you suggest to have
unconditional default online it should probably be the save for all
hypervisors we support (IMO).

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2017-02-24 17:06 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-21 17:22 [RFC PATCH] memory-hotplug: Use dev_online for memhp_auto_offline Nathan Fontenot
2017-02-21 17:22 ` Nathan Fontenot
2017-02-22  9:32 ` Vitaly Kuznetsov
2017-02-22  9:32   ` Vitaly Kuznetsov
2017-02-23 12:56   ` Michal Hocko
2017-02-23 12:56     ` Michal Hocko
2017-02-23 13:31     ` Vitaly Kuznetsov
2017-02-23 13:31       ` Vitaly Kuznetsov
2017-02-23 15:09       ` Michal Hocko
2017-02-23 15:09         ` Michal Hocko
2017-02-23 15:49         ` Vitaly Kuznetsov
2017-02-23 15:49           ` Vitaly Kuznetsov
2017-02-23 16:12           ` Michal Hocko
2017-02-23 16:12             ` Michal Hocko
2017-02-23 16:36             ` Vitaly Kuznetsov
2017-02-23 16:36               ` Vitaly Kuznetsov
2017-02-23 17:41               ` Michal Hocko
2017-02-23 17:41                 ` Michal Hocko
2017-02-23 18:14                 ` Vitaly Kuznetsov
2017-02-23 18:14                   ` Vitaly Kuznetsov
2017-02-24 13:37                   ` Michal Hocko
2017-02-24 13:37                     ` Michal Hocko
2017-02-24 14:10                     ` Vitaly Kuznetsov
2017-02-24 14:10                       ` Vitaly Kuznetsov
2017-02-24 14:41                       ` Michal Hocko
2017-02-24 14:41                         ` Michal Hocko
2017-02-24 15:05                         ` Vitaly Kuznetsov
2017-02-24 15:05                           ` Vitaly Kuznetsov
2017-02-24 15:32                           ` Michal Hocko
2017-02-24 15:32                             ` Michal Hocko
2017-02-24 16:09                             ` Vitaly Kuznetsov
2017-02-24 16:09                               ` Vitaly Kuznetsov
2017-02-24 16:23                               ` Michal Hocko
2017-02-24 16:23                                 ` Michal Hocko
2017-02-24 16:40                                 ` Vitaly Kuznetsov
2017-02-24 16:40                                   ` Vitaly Kuznetsov
2017-02-24 16:52                                   ` Michal Hocko
2017-02-24 16:52                                     ` Michal Hocko
2017-02-24 17:06                                     ` Vitaly Kuznetsov
2017-02-24 17:06                                       ` Vitaly Kuznetsov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.