[PATCH 0/2] mm, memory_hotplug: redefine memory offline retry logic

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] mm, memory_hotplug: redefine memory offline retry logic
@ 2017-09-04  8:21 ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-04  8:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

Hi,
while testing memory hotplug on a large 4TB machine we have noticed that
memory offlining is just too eager to fail. The primary reason is that
the retry logic is just too easy to give up. We have 4 ways out of the
offline
	- we have a permanent failure (isolation or memory notifiers fail,
	  or hugetlb pages cannot be dropped)
	- userspace sends a signal
	- a hardcoded 120s timeout expires
	- page migration fails 5 times
This is way too convoluted and it doesn't scale very well. We have seen both
temporary migration failures as well as 120s being triggered. After removing
those restrictions we were able to pass stress testing during memory hot
remove without any other negative side effects observed. Therefore I suggest
dropping both hard coded policies. I couldn't have found any specific reason
for them in the changelog. I neither didn't get any response [1] from Kamezawa.
If we need some upper bound - e.g. timeout based - then we should have a proper
and user defined policy for that. In any case there should be a clear use case
when introducing it.

Any comments, objections?

Shortlog
Michal Hocko (2):
      mm, memory_hotplug: do not fail offlining too early
      mm, memory_hotplug: remove timeout from __offline_memory

Diffstat
 mm/memory_hotplug.c | 48 ++++++++++++------------------------------------
 1 file changed, 12 insertions(+), 36 deletions(-)

[1] http://lkml.kernel.org/r/20170828094316.GF17097@dhcp22.suse.cz

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 0/2] mm, memory_hotplug: redefine memory offline retry logic
@ 2017-09-04  8:21 ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-04  8:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

Hi,
while testing memory hotplug on a large 4TB machine we have noticed that
memory offlining is just too eager to fail. The primary reason is that
the retry logic is just too easy to give up. We have 4 ways out of the
offline
	- we have a permanent failure (isolation or memory notifiers fail,
	  or hugetlb pages cannot be dropped)
	- userspace sends a signal
	- a hardcoded 120s timeout expires
	- page migration fails 5 times
This is way too convoluted and it doesn't scale very well. We have seen both
temporary migration failures as well as 120s being triggered. After removing
those restrictions we were able to pass stress testing during memory hot
remove without any other negative side effects observed. Therefore I suggest
dropping both hard coded policies. I couldn't have found any specific reason
for them in the changelog. I neither didn't get any response [1] from Kamezawa.
If we need some upper bound - e.g. timeout based - then we should have a proper
and user defined policy for that. In any case there should be a clear use case
when introducing it.

Any comments, objections?

Shortlog
Michal Hocko (2):
      mm, memory_hotplug: do not fail offlining too early
      mm, memory_hotplug: remove timeout from __offline_memory

Diffstat
 mm/memory_hotplug.c | 48 ++++++++++++------------------------------------
 1 file changed, 12 insertions(+), 36 deletions(-)

[1] http://lkml.kernel.org/r/20170828094316.GF17097@dhcp22.suse.cz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-09-04  8:21 ` Michal Hocko
@ 2017-09-04  8:21   ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-04  8:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Memory offlining can fail just too eagerly under a heavy memory pressure.

[ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
[ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
[ 5410.336811] page dumped because: isolation failed
[ 5410.336813] page->mem_cgroup:ffff8801cd662000
[ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed

Isolation has failed here because the page is not on LRU. Most probably
because it was on the pcp LRU cache or it has been removed from the LRU
already but it hasn't been freed yet. In both cases the page doesn't look
non-migrable so retrying more makes sense.

__offline_pages seems rather cluttered when it comes to the retry
logic. We have 5 retries at maximum and a timeout. We could argue
whether the timeout makes sense but failing just because of a race when
somebody isoltes a page from LRU or puts it on a pcp LRU lists is just
wrong. It only takes it to race with a process which unmaps some pages
and remove them from the LRU list and we can fail the whole offline
because of something that is a temporary condition and actually not
harmful for the offline. Please note that unmovable pages should be
already excluded during start_isolate_page_range.

Fix this by removing the max retry count and only rely on the timeout
resp. interruption by a signal from the userspace. Also retry rather
than fail when check_pages_isolated sees some !free pages because those
could be a result of the race as well.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory_hotplug.c | 40 ++++++++++------------------------------
 1 file changed, 10 insertions(+), 30 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 459bbc182d10..c9dcbe6d2ac6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1597,7 +1597,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 {
 	unsigned long pfn, nr_pages, expire;
 	long offlined_pages;
-	int ret, drain, retry_max, node;
+	int ret, node;
 	unsigned long flags;
 	unsigned long valid_start, valid_end;
 	struct zone *zone;
@@ -1634,43 +1634,25 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	pfn = start_pfn;
 	expire = jiffies + timeout;
-	drain = 0;
-	retry_max = 5;
 repeat:
 	/* start memory hot removal */
-	ret = -EAGAIN;
+	ret = -EBUSY;
 	if (time_after(jiffies, expire))
 		goto failed_removal;
 	ret = -EINTR;
 	if (signal_pending(current))
 		goto failed_removal;
-	ret = 0;
-	if (drain) {
-		lru_add_drain_all_cpuslocked();
-		cond_resched();
-		drain_all_pages(zone);
-	}
+
+	cond_resched();
+	lru_add_drain_all_cpuslocked();
+	drain_all_pages(zone);
 
 	pfn = scan_movable_pages(start_pfn, end_pfn);
 	if (pfn) { /* We have movable pages */
 		ret = do_migrate_range(pfn, end_pfn);
-		if (!ret) {
-			drain = 1;
-			goto repeat;
-		} else {
-			if (ret < 0)
-				if (--retry_max == 0)
-					goto failed_removal;
-			yield();
-			drain = 1;
-			goto repeat;
-		}
+		goto repeat;
 	}
-	/* drain all zone's lru pagevec, this is asynchronous... */
-	lru_add_drain_all_cpuslocked();
-	yield();
-	/* drain pcp pages, this is synchronous. */
-	drain_all_pages(zone);
+
 	/*
 	 * dissolve free hugepages in the memory block before doing offlining
 	 * actually in order to make hugetlbfs's object counting consistent.
@@ -1680,10 +1662,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		goto failed_removal;
 	/* check again */
 	offlined_pages = check_pages_isolated(start_pfn, end_pfn);
-	if (offlined_pages < 0) {
-		ret = -EBUSY;
-		goto failed_removal;
-	}
+	if (offlined_pages < 0)
+		goto repeat;
 	pr_info("Offlined Pages %ld\n", offlined_pages);
 	/* Ok, all of our target is isolated.
 	   We cannot do rollback at this point. */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-09-04  8:21   ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-04  8:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Memory offlining can fail just too eagerly under a heavy memory pressure.

[ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
[ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
[ 5410.336811] page dumped because: isolation failed
[ 5410.336813] page->mem_cgroup:ffff8801cd662000
[ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed

Isolation has failed here because the page is not on LRU. Most probably
because it was on the pcp LRU cache or it has been removed from the LRU
already but it hasn't been freed yet. In both cases the page doesn't look
non-migrable so retrying more makes sense.

__offline_pages seems rather cluttered when it comes to the retry
logic. We have 5 retries at maximum and a timeout. We could argue
whether the timeout makes sense but failing just because of a race when
somebody isoltes a page from LRU or puts it on a pcp LRU lists is just
wrong. It only takes it to race with a process which unmaps some pages
and remove them from the LRU list and we can fail the whole offline
because of something that is a temporary condition and actually not
harmful for the offline. Please note that unmovable pages should be
already excluded during start_isolate_page_range.

Fix this by removing the max retry count and only rely on the timeout
resp. interruption by a signal from the userspace. Also retry rather
than fail when check_pages_isolated sees some !free pages because those
could be a result of the race as well.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory_hotplug.c | 40 ++++++++++------------------------------
 1 file changed, 10 insertions(+), 30 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 459bbc182d10..c9dcbe6d2ac6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1597,7 +1597,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 {
 	unsigned long pfn, nr_pages, expire;
 	long offlined_pages;
-	int ret, drain, retry_max, node;
+	int ret, node;
 	unsigned long flags;
 	unsigned long valid_start, valid_end;
 	struct zone *zone;
@@ -1634,43 +1634,25 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	pfn = start_pfn;
 	expire = jiffies + timeout;
-	drain = 0;
-	retry_max = 5;
 repeat:
 	/* start memory hot removal */
-	ret = -EAGAIN;
+	ret = -EBUSY;
 	if (time_after(jiffies, expire))
 		goto failed_removal;
 	ret = -EINTR;
 	if (signal_pending(current))
 		goto failed_removal;
-	ret = 0;
-	if (drain) {
-		lru_add_drain_all_cpuslocked();
-		cond_resched();
-		drain_all_pages(zone);
-	}
+
+	cond_resched();
+	lru_add_drain_all_cpuslocked();
+	drain_all_pages(zone);
 
 	pfn = scan_movable_pages(start_pfn, end_pfn);
 	if (pfn) { /* We have movable pages */
 		ret = do_migrate_range(pfn, end_pfn);
-		if (!ret) {
-			drain = 1;
-			goto repeat;
-		} else {
-			if (ret < 0)
-				if (--retry_max == 0)
-					goto failed_removal;
-			yield();
-			drain = 1;
-			goto repeat;
-		}
+		goto repeat;
 	}
-	/* drain all zone's lru pagevec, this is asynchronous... */
-	lru_add_drain_all_cpuslocked();
-	yield();
-	/* drain pcp pages, this is synchronous. */
-	drain_all_pages(zone);
+
 	/*
 	 * dissolve free hugepages in the memory block before doing offlining
 	 * actually in order to make hugetlbfs's object counting consistent.
@@ -1680,10 +1662,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		goto failed_removal;
 	/* check again */
 	offlined_pages = check_pages_isolated(start_pfn, end_pfn);
-	if (offlined_pages < 0) {
-		ret = -EBUSY;
-		goto failed_removal;
-	}
+	if (offlined_pages < 0)
+		goto repeat;
 	pr_info("Offlined Pages %ld\n", offlined_pages);
 	/* Ok, all of our target is isolated.
 	   We cannot do rollback at this point. */
-- 
2.14.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
  2017-09-04  8:21 ` Michal Hocko
@ 2017-09-04  8:21   ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-04  8:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

We have a hardcoded 120s timeout after which the memory offline fails
basically since the hot remove has been introduced. This is essentially
a policy implemented in the kernel. Moreover there is no way to adjust
the timeout and so we are sometimes facing memory offline failures if
the system is under a heavy memory pressure or very intensive CPU
workload on large machines.

It is not very clear what purpose the timeout actually serves. The
offline operation is interruptible by a signal so if userspace wants
some timeout based termination this can be done trivially by sending a
signal.

If there is a strong usecase to do this from the kernel then we should
do it properly and have a it tunable from the userspace with the timeout
disabled by default along with the explanation who uses it and for what
purporse.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory_hotplug.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c9dcbe6d2ac6..b8a85c11360e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1593,9 +1593,9 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
 }
 
 static int __ref __offline_pages(unsigned long start_pfn,
-		  unsigned long end_pfn, unsigned long timeout)
+		  unsigned long end_pfn)
 {
-	unsigned long pfn, nr_pages, expire;
+	unsigned long pfn, nr_pages;
 	long offlined_pages;
 	int ret, node;
 	unsigned long flags;
@@ -1633,12 +1633,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		goto failed_removal;
 
 	pfn = start_pfn;
-	expire = jiffies + timeout;
 repeat:
 	/* start memory hot removal */
-	ret = -EBUSY;
-	if (time_after(jiffies, expire))
-		goto failed_removal;
 	ret = -EINTR;
 	if (signal_pending(current))
 		goto failed_removal;
@@ -1711,7 +1707,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 /* Must be protected by mem_hotplug_begin() or a device_lock */
 int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 {
-	return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);
+	return __offline_pages(start_pfn, start_pfn + nr_pages);
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
@ 2017-09-04  8:21   ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-04  8:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

We have a hardcoded 120s timeout after which the memory offline fails
basically since the hot remove has been introduced. This is essentially
a policy implemented in the kernel. Moreover there is no way to adjust
the timeout and so we are sometimes facing memory offline failures if
the system is under a heavy memory pressure or very intensive CPU
workload on large machines.

It is not very clear what purpose the timeout actually serves. The
offline operation is interruptible by a signal so if userspace wants
some timeout based termination this can be done trivially by sending a
signal.

If there is a strong usecase to do this from the kernel then we should
do it properly and have a it tunable from the userspace with the timeout
disabled by default along with the explanation who uses it and for what
purporse.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory_hotplug.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c9dcbe6d2ac6..b8a85c11360e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1593,9 +1593,9 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
 }
 
 static int __ref __offline_pages(unsigned long start_pfn,
-		  unsigned long end_pfn, unsigned long timeout)
+		  unsigned long end_pfn)
 {
-	unsigned long pfn, nr_pages, expire;
+	unsigned long pfn, nr_pages;
 	long offlined_pages;
 	int ret, node;
 	unsigned long flags;
@@ -1633,12 +1633,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		goto failed_removal;
 
 	pfn = start_pfn;
-	expire = jiffies + timeout;
 repeat:
 	/* start memory hot removal */
-	ret = -EBUSY;
-	if (time_after(jiffies, expire))
-		goto failed_removal;
 	ret = -EINTR;
 	if (signal_pending(current))
 		goto failed_removal;
@@ -1711,7 +1707,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 /* Must be protected by mem_hotplug_begin() or a device_lock */
 int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 {
-	return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);
+	return __offline_pages(start_pfn, start_pfn + nr_pages);
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
-- 
2.14.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
  2017-09-04  8:21   ` Michal Hocko
@ 2017-09-04  8:58     ` Xishi Qiu
  -1 siblings, 0 replies; 68+ messages in thread
From: Xishi Qiu @ 2017-09-04  8:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML, Michal Hocko

On 2017/9/4 16:21, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> We have a hardcoded 120s timeout after which the memory offline fails
> basically since the hot remove has been introduced. This is essentially
> a policy implemented in the kernel. Moreover there is no way to adjust
> the timeout and so we are sometimes facing memory offline failures if
> the system is under a heavy memory pressure or very intensive CPU
> workload on large machines.
> 
> It is not very clear what purpose the timeout actually serves. The
> offline operation is interruptible by a signal so if userspace wants

Hi Michal,

If the user know what he should do if migration for a long time,
it is OK, but I don't think all the users know this operation
(e.g. ctrl + c) and the affect.

Thanks,
Xishi Qiu

> some timeout based termination this can be done trivially by sending a
> signal.
> 
> If there is a strong usecase to do this from the kernel then we should
> do it properly and have a it tunable from the userspace with the timeout
> disabled by default along with the explanation who uses it and for what
> purporse.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memory_hotplug.c | 10 +++-------
>  1 file changed, 3 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c9dcbe6d2ac6..b8a85c11360e 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1593,9 +1593,9 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
>  }
>  
>  static int __ref __offline_pages(unsigned long start_pfn,
> -		  unsigned long end_pfn, unsigned long timeout)
> +		  unsigned long end_pfn)
>  {
> -	unsigned long pfn, nr_pages, expire;
> +	unsigned long pfn, nr_pages;
>  	long offlined_pages;
>  	int ret, node;
>  	unsigned long flags;
> @@ -1633,12 +1633,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  		goto failed_removal;
>  
>  	pfn = start_pfn;
> -	expire = jiffies + timeout;
>  repeat:
>  	/* start memory hot removal */
> -	ret = -EBUSY;
> -	if (time_after(jiffies, expire))
> -		goto failed_removal;
>  	ret = -EINTR;
>  	if (signal_pending(current))
>  		goto failed_removal;
> @@ -1711,7 +1707,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  /* Must be protected by mem_hotplug_begin() or a device_lock */
>  int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  {
> -	return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);
> +	return __offline_pages(start_pfn, start_pfn + nr_pages);
>  }
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
@ 2017-09-04  8:58     ` Xishi Qiu
  0 siblings, 0 replies; 68+ messages in thread
From: Xishi Qiu @ 2017-09-04  8:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML, Michal Hocko

On 2017/9/4 16:21, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> We have a hardcoded 120s timeout after which the memory offline fails
> basically since the hot remove has been introduced. This is essentially
> a policy implemented in the kernel. Moreover there is no way to adjust
> the timeout and so we are sometimes facing memory offline failures if
> the system is under a heavy memory pressure or very intensive CPU
> workload on large machines.
> 
> It is not very clear what purpose the timeout actually serves. The
> offline operation is interruptible by a signal so if userspace wants

Hi Michal,

If the user know what he should do if migration for a long time,
it is OK, but I don't think all the users know this operation
(e.g. ctrl + c) and the affect.

Thanks,
Xishi Qiu

> some timeout based termination this can be done trivially by sending a
> signal.
> 
> If there is a strong usecase to do this from the kernel then we should
> do it properly and have a it tunable from the userspace with the timeout
> disabled by default along with the explanation who uses it and for what
> purporse.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memory_hotplug.c | 10 +++-------
>  1 file changed, 3 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c9dcbe6d2ac6..b8a85c11360e 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1593,9 +1593,9 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
>  }
>  
>  static int __ref __offline_pages(unsigned long start_pfn,
> -		  unsigned long end_pfn, unsigned long timeout)
> +		  unsigned long end_pfn)
>  {
> -	unsigned long pfn, nr_pages, expire;
> +	unsigned long pfn, nr_pages;
>  	long offlined_pages;
>  	int ret, node;
>  	unsigned long flags;
> @@ -1633,12 +1633,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  		goto failed_removal;
>  
>  	pfn = start_pfn;
> -	expire = jiffies + timeout;
>  repeat:
>  	/* start memory hot removal */
> -	ret = -EBUSY;
> -	if (time_after(jiffies, expire))
> -		goto failed_removal;
>  	ret = -EINTR;
>  	if (signal_pending(current))
>  		goto failed_removal;
> @@ -1711,7 +1707,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  /* Must be protected by mem_hotplug_begin() or a device_lock */
>  int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  {
> -	return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);
> +	return __offline_pages(start_pfn, start_pfn + nr_pages);
>  }
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
  2017-09-04  8:58     ` Xishi Qiu
@ 2017-09-04  9:01       ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-04  9:01 UTC (permalink / raw)
  To: Xishi Qiu
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On Mon 04-09-17 16:58:30, Xishi Qiu wrote:
> On 2017/9/4 16:21, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > We have a hardcoded 120s timeout after which the memory offline fails
> > basically since the hot remove has been introduced. This is essentially
> > a policy implemented in the kernel. Moreover there is no way to adjust
> > the timeout and so we are sometimes facing memory offline failures if
> > the system is under a heavy memory pressure or very intensive CPU
> > workload on large machines.
> > 
> > It is not very clear what purpose the timeout actually serves. The
> > offline operation is interruptible by a signal so if userspace wants
> 
> Hi Michal,
> 
> If the user know what he should do if migration for a long time,
> it is OK, but I don't think all the users know this operation
> (e.g. ctrl + c) and the affect.

How is this operation any different from other potentially long
interruptible syscalls?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
@ 2017-09-04  9:01       ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-04  9:01 UTC (permalink / raw)
  To: Xishi Qiu
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On Mon 04-09-17 16:58:30, Xishi Qiu wrote:
> On 2017/9/4 16:21, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > We have a hardcoded 120s timeout after which the memory offline fails
> > basically since the hot remove has been introduced. This is essentially
> > a policy implemented in the kernel. Moreover there is no way to adjust
> > the timeout and so we are sometimes facing memory offline failures if
> > the system is under a heavy memory pressure or very intensive CPU
> > workload on large machines.
> > 
> > It is not very clear what purpose the timeout actually serves. The
> > offline operation is interruptible by a signal so if userspace wants
> 
> Hi Michal,
> 
> If the user know what he should do if migration for a long time,
> it is OK, but I don't think all the users know this operation
> (e.g. ctrl + c) and the affect.

How is this operation any different from other potentially long
interruptible syscalls?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
  2017-09-04  9:01       ` Michal Hocko
@ 2017-09-04  9:05         ` Xishi Qiu
  -1 siblings, 0 replies; 68+ messages in thread
From: Xishi Qiu @ 2017-09-04  9:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On 2017/9/4 17:01, Michal Hocko wrote:

> On Mon 04-09-17 16:58:30, Xishi Qiu wrote:
>> On 2017/9/4 16:21, Michal Hocko wrote:
>>
>>> From: Michal Hocko <mhocko@suse.com>
>>>
>>> We have a hardcoded 120s timeout after which the memory offline fails
>>> basically since the hot remove has been introduced. This is essentially
>>> a policy implemented in the kernel. Moreover there is no way to adjust
>>> the timeout and so we are sometimes facing memory offline failures if
>>> the system is under a heavy memory pressure or very intensive CPU
>>> workload on large machines.
>>>
>>> It is not very clear what purpose the timeout actually serves. The
>>> offline operation is interruptible by a signal so if userspace wants
>>
>> Hi Michal,
>>
>> If the user know what he should do if migration for a long time,
>> it is OK, but I don't think all the users know this operation
>> (e.g. ctrl + c) and the affect.
> 
> How is this operation any different from other potentially long
> interruptible syscalls?
> 

Hi Michal,

I means the user should stop it by himself if migration always retry in endless.

Thanks,
Xishi Qiu

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
@ 2017-09-04  9:05         ` Xishi Qiu
  0 siblings, 0 replies; 68+ messages in thread
From: Xishi Qiu @ 2017-09-04  9:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On 2017/9/4 17:01, Michal Hocko wrote:

> On Mon 04-09-17 16:58:30, Xishi Qiu wrote:
>> On 2017/9/4 16:21, Michal Hocko wrote:
>>
>>> From: Michal Hocko <mhocko@suse.com>
>>>
>>> We have a hardcoded 120s timeout after which the memory offline fails
>>> basically since the hot remove has been introduced. This is essentially
>>> a policy implemented in the kernel. Moreover there is no way to adjust
>>> the timeout and so we are sometimes facing memory offline failures if
>>> the system is under a heavy memory pressure or very intensive CPU
>>> workload on large machines.
>>>
>>> It is not very clear what purpose the timeout actually serves. The
>>> offline operation is interruptible by a signal so if userspace wants
>>
>> Hi Michal,
>>
>> If the user know what he should do if migration for a long time,
>> it is OK, but I don't think all the users know this operation
>> (e.g. ctrl + c) and the affect.
> 
> How is this operation any different from other potentially long
> interruptible syscalls?
> 

Hi Michal,

I means the user should stop it by himself if migration always retry in endless.

Thanks,
Xishi Qiu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
  2017-09-04  9:05         ` Xishi Qiu
@ 2017-09-04  9:15           ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-04  9:15 UTC (permalink / raw)
  To: Xishi Qiu
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On Mon 04-09-17 17:05:15, Xishi Qiu wrote:
> On 2017/9/4 17:01, Michal Hocko wrote:
> 
> > On Mon 04-09-17 16:58:30, Xishi Qiu wrote:
> >> On 2017/9/4 16:21, Michal Hocko wrote:
> >>
> >>> From: Michal Hocko <mhocko@suse.com>
> >>>
> >>> We have a hardcoded 120s timeout after which the memory offline fails
> >>> basically since the hot remove has been introduced. This is essentially
> >>> a policy implemented in the kernel. Moreover there is no way to adjust
> >>> the timeout and so we are sometimes facing memory offline failures if
> >>> the system is under a heavy memory pressure or very intensive CPU
> >>> workload on large machines.
> >>>
> >>> It is not very clear what purpose the timeout actually serves. The
> >>> offline operation is interruptible by a signal so if userspace wants
> >>
> >> Hi Michal,
> >>
> >> If the user know what he should do if migration for a long time,
> >> it is OK, but I don't think all the users know this operation
> >> (e.g. ctrl + c) and the affect.
> > 
> > How is this operation any different from other potentially long
> > interruptible syscalls?
> > 
> 
> Hi Michal,
> 
> I means the user should stop it by himself if migration always retry in endless.

If the memory is migrateable then the migration should finish
eventually. It can take some time but it shouldn't be an endless loop.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
@ 2017-09-04  9:15           ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-04  9:15 UTC (permalink / raw)
  To: Xishi Qiu
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On Mon 04-09-17 17:05:15, Xishi Qiu wrote:
> On 2017/9/4 17:01, Michal Hocko wrote:
> 
> > On Mon 04-09-17 16:58:30, Xishi Qiu wrote:
> >> On 2017/9/4 16:21, Michal Hocko wrote:
> >>
> >>> From: Michal Hocko <mhocko@suse.com>
> >>>
> >>> We have a hardcoded 120s timeout after which the memory offline fails
> >>> basically since the hot remove has been introduced. This is essentially
> >>> a policy implemented in the kernel. Moreover there is no way to adjust
> >>> the timeout and so we are sometimes facing memory offline failures if
> >>> the system is under a heavy memory pressure or very intensive CPU
> >>> workload on large machines.
> >>>
> >>> It is not very clear what purpose the timeout actually serves. The
> >>> offline operation is interruptible by a signal so if userspace wants
> >>
> >> Hi Michal,
> >>
> >> If the user know what he should do if migration for a long time,
> >> it is OK, but I don't think all the users know this operation
> >> (e.g. ctrl + c) and the affect.
> > 
> > How is this operation any different from other potentially long
> > interruptible syscalls?
> > 
> 
> Hi Michal,
> 
> I means the user should stop it by himself if migration always retry in endless.

If the memory is migrateable then the migration should finish
eventually. It can take some time but it shouldn't be an endless loop.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
  2017-09-04  9:15           ` Michal Hocko
@ 2017-09-05  5:46             ` Anshuman Khandual
  -1 siblings, 0 replies; 68+ messages in thread
From: Anshuman Khandual @ 2017-09-05  5:46 UTC (permalink / raw)
  To: Michal Hocko, Xishi Qiu
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On 09/04/2017 02:45 PM, Michal Hocko wrote:
> On Mon 04-09-17 17:05:15, Xishi Qiu wrote:
>> On 2017/9/4 17:01, Michal Hocko wrote:
>>
>>> On Mon 04-09-17 16:58:30, Xishi Qiu wrote:
>>>> On 2017/9/4 16:21, Michal Hocko wrote:
>>>>
>>>>> From: Michal Hocko <mhocko@suse.com>
>>>>>
>>>>> We have a hardcoded 120s timeout after which the memory offline fails
>>>>> basically since the hot remove has been introduced. This is essentially
>>>>> a policy implemented in the kernel. Moreover there is no way to adjust
>>>>> the timeout and so we are sometimes facing memory offline failures if
>>>>> the system is under a heavy memory pressure or very intensive CPU
>>>>> workload on large machines.
>>>>>
>>>>> It is not very clear what purpose the timeout actually serves. The
>>>>> offline operation is interruptible by a signal so if userspace wants
>>>> Hi Michal,
>>>>
>>>> If the user know what he should do if migration for a long time,
>>>> it is OK, but I don't think all the users know this operation
>>>> (e.g. ctrl + c) and the affect.
>>> How is this operation any different from other potentially long
>>> interruptible syscalls?
>>>
>> Hi Michal,
>>
>> I means the user should stop it by himself if migration always retry in endless.
> If the memory is migrateable then the migration should finish
> eventually. It can take some time but it shouldn't be an endless loop.

But what if some how the temporary condition (page removed from the PCP
LRU list and has not been freed yet to the buddy) happens again and again.
I understand we have schedule() and yield() to make sure that the context
does not hold the CPU for ever but it can take theoretically very long
time if not endless to finish. In that case sending signal to the user
space process who initiated the offline request is the only way to stop
this retry loop. I think this is still a better approach than the 120
second timeout which was kind of arbitrary.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
@ 2017-09-05  5:46             ` Anshuman Khandual
  0 siblings, 0 replies; 68+ messages in thread
From: Anshuman Khandual @ 2017-09-05  5:46 UTC (permalink / raw)
  To: Michal Hocko, Xishi Qiu
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On 09/04/2017 02:45 PM, Michal Hocko wrote:
> On Mon 04-09-17 17:05:15, Xishi Qiu wrote:
>> On 2017/9/4 17:01, Michal Hocko wrote:
>>
>>> On Mon 04-09-17 16:58:30, Xishi Qiu wrote:
>>>> On 2017/9/4 16:21, Michal Hocko wrote:
>>>>
>>>>> From: Michal Hocko <mhocko@suse.com>
>>>>>
>>>>> We have a hardcoded 120s timeout after which the memory offline fails
>>>>> basically since the hot remove has been introduced. This is essentially
>>>>> a policy implemented in the kernel. Moreover there is no way to adjust
>>>>> the timeout and so we are sometimes facing memory offline failures if
>>>>> the system is under a heavy memory pressure or very intensive CPU
>>>>> workload on large machines.
>>>>>
>>>>> It is not very clear what purpose the timeout actually serves. The
>>>>> offline operation is interruptible by a signal so if userspace wants
>>>> Hi Michal,
>>>>
>>>> If the user know what he should do if migration for a long time,
>>>> it is OK, but I don't think all the users know this operation
>>>> (e.g. ctrl + c) and the affect.
>>> How is this operation any different from other potentially long
>>> interruptible syscalls?
>>>
>> Hi Michal,
>>
>> I means the user should stop it by himself if migration always retry in endless.
> If the memory is migrateable then the migration should finish
> eventually. It can take some time but it shouldn't be an endless loop.

But what if some how the temporary condition (page removed from the PCP
LRU list and has not been freed yet to the buddy) happens again and again.
I understand we have schedule() and yield() to make sure that the context
does not hold the CPU for ever but it can take theoretically very long
time if not endless to finish. In that case sending signal to the user
space process who initiated the offline request is the only way to stop
this retry loop. I think this is still a better approach than the 120
second timeout which was kind of arbitrary.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-09-04  8:21   ` Michal Hocko
@ 2017-09-05  6:29     ` Anshuman Khandual
  -1 siblings, 0 replies; 68+ messages in thread
From: Anshuman Khandual @ 2017-09-05  6:29 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML, Michal Hocko

On 09/04/2017 01:51 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Memory offlining can fail just too eagerly under a heavy memory pressure.
> 
> [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> [ 5410.336811] page dumped because: isolation failed
> [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
> 
> Isolation has failed here because the page is not on LRU. Most probably
> because it was on the pcp LRU cache or it has been removed from the LRU
> already but it hasn't been freed yet. In both cases the page doesn't look
> non-migrable so retrying more makes sense.
> 
> __offline_pages seems rather cluttered when it comes to the retry
> logic. We have 5 retries at maximum and a timeout. We could argue
> whether the timeout makes sense but failing just because of a race when
> somebody isoltes a page from LRU or puts it on a pcp LRU lists is just
> wrong. It only takes it to race with a process which unmaps some pages
> and remove them from the LRU list and we can fail the whole offline
> because of something that is a temporary condition and actually not
> harmful for the offline. Please note that unmovable pages should be
> already excluded during start_isolate_page_range.
> 
> Fix this by removing the max retry count and only rely on the timeout
> resp. interruption by a signal from the userspace. Also retry rather
> than fail when check_pages_isolated sees some !free pages because those
> could be a result of the race as well.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memory_hotplug.c | 40 ++++++++++------------------------------
>  1 file changed, 10 insertions(+), 30 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 459bbc182d10..c9dcbe6d2ac6 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1597,7 +1597,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  {
>  	unsigned long pfn, nr_pages, expire;
>  	long offlined_pages;
> -	int ret, drain, retry_max, node;
> +	int ret, node;
>  	unsigned long flags;
>  	unsigned long valid_start, valid_end;
>  	struct zone *zone;
> @@ -1634,43 +1634,25 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  
>  	pfn = start_pfn;
>  	expire = jiffies + timeout;
> -	drain = 0;
> -	retry_max = 5;
>  repeat:
>  	/* start memory hot removal */
> -	ret = -EAGAIN;
> +	ret = -EBUSY;
>  	if (time_after(jiffies, expire))
>  		goto failed_removal;
>  	ret = -EINTR;
>  	if (signal_pending(current))
>  		goto failed_removal;
> -	ret = 0;
> -	if (drain) {
> -		lru_add_drain_all_cpuslocked();
> -		cond_resched();
> -		drain_all_pages(zone);
> -	}

Why we had this condition before that only when we fail in migration
later in do_migrate_range function, drain the lru lists in the next
attempt. Why not from the first attempt itself ? Just being curious.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-09-05  6:29     ` Anshuman Khandual
  0 siblings, 0 replies; 68+ messages in thread
From: Anshuman Khandual @ 2017-09-05  6:29 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML, Michal Hocko

On 09/04/2017 01:51 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Memory offlining can fail just too eagerly under a heavy memory pressure.
> 
> [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> [ 5410.336811] page dumped because: isolation failed
> [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
> 
> Isolation has failed here because the page is not on LRU. Most probably
> because it was on the pcp LRU cache or it has been removed from the LRU
> already but it hasn't been freed yet. In both cases the page doesn't look
> non-migrable so retrying more makes sense.
> 
> __offline_pages seems rather cluttered when it comes to the retry
> logic. We have 5 retries at maximum and a timeout. We could argue
> whether the timeout makes sense but failing just because of a race when
> somebody isoltes a page from LRU or puts it on a pcp LRU lists is just
> wrong. It only takes it to race with a process which unmaps some pages
> and remove them from the LRU list and we can fail the whole offline
> because of something that is a temporary condition and actually not
> harmful for the offline. Please note that unmovable pages should be
> already excluded during start_isolate_page_range.
> 
> Fix this by removing the max retry count and only rely on the timeout
> resp. interruption by a signal from the userspace. Also retry rather
> than fail when check_pages_isolated sees some !free pages because those
> could be a result of the race as well.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memory_hotplug.c | 40 ++++++++++------------------------------
>  1 file changed, 10 insertions(+), 30 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 459bbc182d10..c9dcbe6d2ac6 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1597,7 +1597,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  {
>  	unsigned long pfn, nr_pages, expire;
>  	long offlined_pages;
> -	int ret, drain, retry_max, node;
> +	int ret, node;
>  	unsigned long flags;
>  	unsigned long valid_start, valid_end;
>  	struct zone *zone;
> @@ -1634,43 +1634,25 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  
>  	pfn = start_pfn;
>  	expire = jiffies + timeout;
> -	drain = 0;
> -	retry_max = 5;
>  repeat:
>  	/* start memory hot removal */
> -	ret = -EAGAIN;
> +	ret = -EBUSY;
>  	if (time_after(jiffies, expire))
>  		goto failed_removal;
>  	ret = -EINTR;
>  	if (signal_pending(current))
>  		goto failed_removal;
> -	ret = 0;
> -	if (drain) {
> -		lru_add_drain_all_cpuslocked();
> -		cond_resched();
> -		drain_all_pages(zone);
> -	}

Why we had this condition before that only when we fail in migration
later in do_migrate_range function, drain the lru lists in the next
attempt. Why not from the first attempt itself ? Just being curious.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-09-05  6:29     ` Anshuman Khandual
@ 2017-09-05  7:13       ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-05  7:13 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On Tue 05-09-17 11:59:36, Anshuman Khandual wrote:
[...]
> > @@ -1634,43 +1634,25 @@ static int __ref __offline_pages(unsigned long start_pfn,
> >  
> >  	pfn = start_pfn;
> >  	expire = jiffies + timeout;
> > -	drain = 0;
> > -	retry_max = 5;
> >  repeat:
> >  	/* start memory hot removal */
> > -	ret = -EAGAIN;
> > +	ret = -EBUSY;
> >  	if (time_after(jiffies, expire))
> >  		goto failed_removal;
> >  	ret = -EINTR;
> >  	if (signal_pending(current))
> >  		goto failed_removal;
> > -	ret = 0;
> > -	if (drain) {
> > -		lru_add_drain_all_cpuslocked();
> > -		cond_resched();
> > -		drain_all_pages(zone);
> > -	}
> 
> Why we had this condition before that only when we fail in migration
> later in do_migrate_range function, drain the lru lists in the next
> attempt. Why not from the first attempt itself ? Just being curious.
 
I can only guess but draining used to invoke IPIs and that is really
costly so an optimistic attempt could try without draining and do that
only if the migration fails. Now that we have it all done in WQ context
there shouldn't be any reason to optimize for draining.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-09-05  7:13       ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-05  7:13 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On Tue 05-09-17 11:59:36, Anshuman Khandual wrote:
[...]
> > @@ -1634,43 +1634,25 @@ static int __ref __offline_pages(unsigned long start_pfn,
> >  
> >  	pfn = start_pfn;
> >  	expire = jiffies + timeout;
> > -	drain = 0;
> > -	retry_max = 5;
> >  repeat:
> >  	/* start memory hot removal */
> > -	ret = -EAGAIN;
> > +	ret = -EBUSY;
> >  	if (time_after(jiffies, expire))
> >  		goto failed_removal;
> >  	ret = -EINTR;
> >  	if (signal_pending(current))
> >  		goto failed_removal;
> > -	ret = 0;
> > -	if (drain) {
> > -		lru_add_drain_all_cpuslocked();
> > -		cond_resched();
> > -		drain_all_pages(zone);
> > -	}
> 
> Why we had this condition before that only when we fail in migration
> later in do_migrate_range function, drain the lru lists in the next
> attempt. Why not from the first attempt itself ? Just being curious.
 
I can only guess but draining used to invoke IPIs and that is really
costly so an optimistic attempt could try without draining and do that
only if the migration fails. Now that we have it all done in WQ context
there shouldn't be any reason to optimize for draining.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
  2017-09-05  5:46             ` Anshuman Khandual
@ 2017-09-05  7:23               ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-05  7:23 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Xishi Qiu, Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab,
	Yasuaki Ishimatsu, Igor Mammedov, Vitaly Kuznetsov, linux-mm,
	LKML

On Tue 05-09-17 11:16:57, Anshuman Khandual wrote:
> On 09/04/2017 02:45 PM, Michal Hocko wrote:
> > On Mon 04-09-17 17:05:15, Xishi Qiu wrote:
> >> On 2017/9/4 17:01, Michal Hocko wrote:
> >>
> >>> On Mon 04-09-17 16:58:30, Xishi Qiu wrote:
> >>>> On 2017/9/4 16:21, Michal Hocko wrote:
> >>>>
> >>>>> From: Michal Hocko <mhocko@suse.com>
> >>>>>
> >>>>> We have a hardcoded 120s timeout after which the memory offline fails
> >>>>> basically since the hot remove has been introduced. This is essentially
> >>>>> a policy implemented in the kernel. Moreover there is no way to adjust
> >>>>> the timeout and so we are sometimes facing memory offline failures if
> >>>>> the system is under a heavy memory pressure or very intensive CPU
> >>>>> workload on large machines.
> >>>>>
> >>>>> It is not very clear what purpose the timeout actually serves. The
> >>>>> offline operation is interruptible by a signal so if userspace wants
> >>>> Hi Michal,
> >>>>
> >>>> If the user know what he should do if migration for a long time,
> >>>> it is OK, but I don't think all the users know this operation
> >>>> (e.g. ctrl + c) and the affect.
> >>> How is this operation any different from other potentially long
> >>> interruptible syscalls?
> >>>
> >> Hi Michal,
> >>
> >> I means the user should stop it by himself if migration always retry in endless.
> > If the memory is migrateable then the migration should finish
> > eventually. It can take some time but it shouldn't be an endless loop.
> 
> But what if some how the temporary condition (page removed from the PCP
> LRU list and has not been freed yet to the buddy) happens again and again.

How would that happen? We have all pages in the range MIGRATE_ISOLATE so
no pages will get reallocated and we know that there are no unmigratable
pages in the range. So we only should have temporary failures for
migration. If that is not the case then we have a bug somewhere. 

> I understand we have schedule() and yield() to make sure that the context
> does not hold the CPU for ever but it can take theoretically very long
> time if not endless to finish. In that case sending signal to the user

I guess you meant to say signal from the user space...

> space process who initiated the offline request is the only way to stop
> this retry loop. I think this is still a better approach than the 120
> second timeout which was kind of arbitrary.

Yeah the context is interruptible so if the operation takes unbearably
too long then a watchdog can be setup trivially and to the user defined
value. There is a good reason we do not add hardocded timeouts to the
kernel.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
@ 2017-09-05  7:23               ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-05  7:23 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Xishi Qiu, Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab,
	Yasuaki Ishimatsu, Igor Mammedov, Vitaly Kuznetsov, linux-mm,
	LKML

On Tue 05-09-17 11:16:57, Anshuman Khandual wrote:
> On 09/04/2017 02:45 PM, Michal Hocko wrote:
> > On Mon 04-09-17 17:05:15, Xishi Qiu wrote:
> >> On 2017/9/4 17:01, Michal Hocko wrote:
> >>
> >>> On Mon 04-09-17 16:58:30, Xishi Qiu wrote:
> >>>> On 2017/9/4 16:21, Michal Hocko wrote:
> >>>>
> >>>>> From: Michal Hocko <mhocko@suse.com>
> >>>>>
> >>>>> We have a hardcoded 120s timeout after which the memory offline fails
> >>>>> basically since the hot remove has been introduced. This is essentially
> >>>>> a policy implemented in the kernel. Moreover there is no way to adjust
> >>>>> the timeout and so we are sometimes facing memory offline failures if
> >>>>> the system is under a heavy memory pressure or very intensive CPU
> >>>>> workload on large machines.
> >>>>>
> >>>>> It is not very clear what purpose the timeout actually serves. The
> >>>>> offline operation is interruptible by a signal so if userspace wants
> >>>> Hi Michal,
> >>>>
> >>>> If the user know what he should do if migration for a long time,
> >>>> it is OK, but I don't think all the users know this operation
> >>>> (e.g. ctrl + c) and the affect.
> >>> How is this operation any different from other potentially long
> >>> interruptible syscalls?
> >>>
> >> Hi Michal,
> >>
> >> I means the user should stop it by himself if migration always retry in endless.
> > If the memory is migrateable then the migration should finish
> > eventually. It can take some time but it shouldn't be an endless loop.
> 
> But what if some how the temporary condition (page removed from the PCP
> LRU list and has not been freed yet to the buddy) happens again and again.

How would that happen? We have all pages in the range MIGRATE_ISOLATE so
no pages will get reallocated and we know that there are no unmigratable
pages in the range. So we only should have temporary failures for
migration. If that is not the case then we have a bug somewhere. 

> I understand we have schedule() and yield() to make sure that the context
> does not hold the CPU for ever but it can take theoretically very long
> time if not endless to finish. In that case sending signal to the user

I guess you meant to say signal from the user space...

> space process who initiated the offline request is the only way to stop
> this retry loop. I think this is still a better approach than the 120
> second timeout which was kind of arbitrary.

Yeah the context is interruptible so if the operation takes unbearably
too long then a watchdog can be setup trivially and to the user defined
value. There is a good reason we do not add hardocded timeouts to the
kernel.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
  2017-09-05  7:23               ` Michal Hocko
@ 2017-09-05  8:54                 ` Anshuman Khandual
  -1 siblings, 0 replies; 68+ messages in thread
From: Anshuman Khandual @ 2017-09-05  8:54 UTC (permalink / raw)
  To: Michal Hocko, Anshuman Khandual
  Cc: Xishi Qiu, Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab,
	Yasuaki Ishimatsu, Igor Mammedov, Vitaly Kuznetsov, linux-mm,
	LKML

On 09/05/2017 12:53 PM, Michal Hocko wrote:
> On Tue 05-09-17 11:16:57, Anshuman Khandual wrote:
>> On 09/04/2017 02:45 PM, Michal Hocko wrote:
>>> On Mon 04-09-17 17:05:15, Xishi Qiu wrote:
>>>> On 2017/9/4 17:01, Michal Hocko wrote:
>>>>
>>>>> On Mon 04-09-17 16:58:30, Xishi Qiu wrote:
>>>>>> On 2017/9/4 16:21, Michal Hocko wrote:
>>>>>>
>>>>>>> From: Michal Hocko <mhocko@suse.com>
>>>>>>>
>>>>>>> We have a hardcoded 120s timeout after which the memory offline fails
>>>>>>> basically since the hot remove has been introduced. This is essentially
>>>>>>> a policy implemented in the kernel. Moreover there is no way to adjust
>>>>>>> the timeout and so we are sometimes facing memory offline failures if
>>>>>>> the system is under a heavy memory pressure or very intensive CPU
>>>>>>> workload on large machines.
>>>>>>>
>>>>>>> It is not very clear what purpose the timeout actually serves. The
>>>>>>> offline operation is interruptible by a signal so if userspace wants
>>>>>> Hi Michal,
>>>>>>
>>>>>> If the user know what he should do if migration for a long time,
>>>>>> it is OK, but I don't think all the users know this operation
>>>>>> (e.g. ctrl + c) and the affect.
>>>>> How is this operation any different from other potentially long
>>>>> interruptible syscalls?
>>>>>
>>>> Hi Michal,
>>>>
>>>> I means the user should stop it by himself if migration always retry in endless.
>>> If the memory is migrateable then the migration should finish
>>> eventually. It can take some time but it shouldn't be an endless loop.
>>
>> But what if some how the temporary condition (page removed from the PCP
>> LRU list and has not been freed yet to the buddy) happens again and again.
> 
> How would that happen? We have all pages in the range MIGRATE_ISOLATE so
> no pages will get reallocated and we know that there are no unmigratable
> pages in the range. So we only should have temporary failures for
> migration. If that is not the case then we have a bug somewhere.

Right.

> 
>> I understand we have schedule() and yield() to make sure that the context
>> does not hold the CPU for ever but it can take theoretically very long
>> time if not endless to finish. In that case sending signal to the user
> 
> I guess you meant to say signal from the user space...

Yes.

> 
>> space process who initiated the offline request is the only way to stop
>> this retry loop. I think this is still a better approach than the 120
>> second timeout which was kind of arbitrary.
> 
> Yeah the context is interruptible so if the operation takes unbearably
> too long then a watchdog can be setup trivially and to the user defined
> value. There is a good reason we do not add hardocded timeouts to the
> kernel.
> 

Right.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
@ 2017-09-05  8:54                 ` Anshuman Khandual
  0 siblings, 0 replies; 68+ messages in thread
From: Anshuman Khandual @ 2017-09-05  8:54 UTC (permalink / raw)
  To: Michal Hocko, Anshuman Khandual
  Cc: Xishi Qiu, Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab,
	Yasuaki Ishimatsu, Igor Mammedov, Vitaly Kuznetsov, linux-mm,
	LKML

On 09/05/2017 12:53 PM, Michal Hocko wrote:
> On Tue 05-09-17 11:16:57, Anshuman Khandual wrote:
>> On 09/04/2017 02:45 PM, Michal Hocko wrote:
>>> On Mon 04-09-17 17:05:15, Xishi Qiu wrote:
>>>> On 2017/9/4 17:01, Michal Hocko wrote:
>>>>
>>>>> On Mon 04-09-17 16:58:30, Xishi Qiu wrote:
>>>>>> On 2017/9/4 16:21, Michal Hocko wrote:
>>>>>>
>>>>>>> From: Michal Hocko <mhocko@suse.com>
>>>>>>>
>>>>>>> We have a hardcoded 120s timeout after which the memory offline fails
>>>>>>> basically since the hot remove has been introduced. This is essentially
>>>>>>> a policy implemented in the kernel. Moreover there is no way to adjust
>>>>>>> the timeout and so we are sometimes facing memory offline failures if
>>>>>>> the system is under a heavy memory pressure or very intensive CPU
>>>>>>> workload on large machines.
>>>>>>>
>>>>>>> It is not very clear what purpose the timeout actually serves. The
>>>>>>> offline operation is interruptible by a signal so if userspace wants
>>>>>> Hi Michal,
>>>>>>
>>>>>> If the user know what he should do if migration for a long time,
>>>>>> it is OK, but I don't think all the users know this operation
>>>>>> (e.g. ctrl + c) and the affect.
>>>>> How is this operation any different from other potentially long
>>>>> interruptible syscalls?
>>>>>
>>>> Hi Michal,
>>>>
>>>> I means the user should stop it by himself if migration always retry in endless.
>>> If the memory is migrateable then the migration should finish
>>> eventually. It can take some time but it shouldn't be an endless loop.
>>
>> But what if some how the temporary condition (page removed from the PCP
>> LRU list and has not been freed yet to the buddy) happens again and again.
> 
> How would that happen? We have all pages in the range MIGRATE_ISOLATE so
> no pages will get reallocated and we know that there are no unmigratable
> pages in the range. So we only should have temporary failures for
> migration. If that is not the case then we have a bug somewhere.

Right.

> 
>> I understand we have schedule() and yield() to make sure that the context
>> does not hold the CPU for ever but it can take theoretically very long
>> time if not endless to finish. In that case sending signal to the user
> 
> I guess you meant to say signal from the user space...

Yes.

> 
>> space process who initiated the offline request is the only way to stop
>> this retry loop. I think this is still a better approach than the 120
>> second timeout which was kind of arbitrary.
> 
> Yeah the context is interruptible so if the operation takes unbearably
> too long then a watchdog can be setup trivially and to the user defined
> value. There is a good reason we do not add hardocded timeouts to the
> kernel.
> 

Right.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-09-04  8:21   ` Michal Hocko
@ 2017-09-08 17:26     ` Vlastimil Babka
  -1 siblings, 0 replies; 68+ messages in thread
From: Vlastimil Babka @ 2017-09-08 17:26 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML, Michal Hocko

On 09/04/2017 10:21 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Memory offlining can fail just too eagerly under a heavy memory pressure.
> 
> [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> [ 5410.336811] page dumped because: isolation failed
> [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
> 
> Isolation has failed here because the page is not on LRU. Most probably
> because it was on the pcp LRU cache or it has been removed from the LRU
> already but it hasn't been freed yet. In both cases the page doesn't look
> non-migrable so retrying more makes sense.
> 
> __offline_pages seems rather cluttered when it comes to the retry
> logic. We have 5 retries at maximum and a timeout. We could argue
> whether the timeout makes sense but failing just because of a race when
> somebody isoltes a page from LRU or puts it on a pcp LRU lists is just
> wrong. It only takes it to race with a process which unmaps some pages
> and remove them from the LRU list and we can fail the whole offline
> because of something that is a temporary condition and actually not
> harmful for the offline. Please note that unmovable pages should be
> already excluded during start_isolate_page_range.

Hmm, the has_unmovable_pages() check doesn't offer any strict guarantees due to
races, per its comment. Also at the very quick glance, I see a check where it
assumes that MIGRATE_MOVABLE pageblock will have no unmovable pages. There is no
such guarantee even without races.

> Fix this by removing the max retry count and only rely on the timeout
> resp. interruption by a signal from the userspace. Also retry rather
> than fail when check_pages_isolated sees some !free pages because those
> could be a result of the race as well.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Even within a movable node where has_unmovable_pages() is a non-issue, you could
have pinned movable pages where the pinning is not temporary. So after this
patch, this will really keep retrying forever. I'm not saying it's wrong, just
pointing it out, since the changelog seems to assume there would be only
temporary failures possible and thus unbound retries are always correct.
The obvious problem if we wanted to avoid this, is how to recognize
non-temporary failures...

> ---
>  mm/memory_hotplug.c | 40 ++++++++++------------------------------
>  1 file changed, 10 insertions(+), 30 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 459bbc182d10..c9dcbe6d2ac6 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1597,7 +1597,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  {
>  	unsigned long pfn, nr_pages, expire;
>  	long offlined_pages;
> -	int ret, drain, retry_max, node;
> +	int ret, node;
>  	unsigned long flags;
>  	unsigned long valid_start, valid_end;
>  	struct zone *zone;
> @@ -1634,43 +1634,25 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  
>  	pfn = start_pfn;
>  	expire = jiffies + timeout;
> -	drain = 0;
> -	retry_max = 5;
>  repeat:
>  	/* start memory hot removal */
> -	ret = -EAGAIN;
> +	ret = -EBUSY;
>  	if (time_after(jiffies, expire))
>  		goto failed_removal;
>  	ret = -EINTR;
>  	if (signal_pending(current))
>  		goto failed_removal;
> -	ret = 0;
> -	if (drain) {
> -		lru_add_drain_all_cpuslocked();
> -		cond_resched();
> -		drain_all_pages(zone);
> -	}
> +
> +	cond_resched();
> +	lru_add_drain_all_cpuslocked();
> +	drain_all_pages(zone);
>  
>  	pfn = scan_movable_pages(start_pfn, end_pfn);
>  	if (pfn) { /* We have movable pages */
>  		ret = do_migrate_range(pfn, end_pfn);
> -		if (!ret) {
> -			drain = 1;
> -			goto repeat;
> -		} else {
> -			if (ret < 0)
> -				if (--retry_max == 0)
> -					goto failed_removal;
> -			yield();
> -			drain = 1;
> -			goto repeat;
> -		}
> +		goto repeat;
>  	}
> -	/* drain all zone's lru pagevec, this is asynchronous... */
> -	lru_add_drain_all_cpuslocked();
> -	yield();
> -	/* drain pcp pages, this is synchronous. */
> -	drain_all_pages(zone);
> +
>  	/*
>  	 * dissolve free hugepages in the memory block before doing offlining
>  	 * actually in order to make hugetlbfs's object counting consistent.
> @@ -1680,10 +1662,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  		goto failed_removal;
>  	/* check again */
>  	offlined_pages = check_pages_isolated(start_pfn, end_pfn);
> -	if (offlined_pages < 0) {
> -		ret = -EBUSY;
> -		goto failed_removal;
> -	}
> +	if (offlined_pages < 0)
> +		goto repeat;
>  	pr_info("Offlined Pages %ld\n", offlined_pages);
>  	/* Ok, all of our target is isolated.
>  	   We cannot do rollback at this point. */
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-09-08 17:26     ` Vlastimil Babka
  0 siblings, 0 replies; 68+ messages in thread
From: Vlastimil Babka @ 2017-09-08 17:26 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML, Michal Hocko

On 09/04/2017 10:21 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Memory offlining can fail just too eagerly under a heavy memory pressure.
> 
> [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> [ 5410.336811] page dumped because: isolation failed
> [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
> 
> Isolation has failed here because the page is not on LRU. Most probably
> because it was on the pcp LRU cache or it has been removed from the LRU
> already but it hasn't been freed yet. In both cases the page doesn't look
> non-migrable so retrying more makes sense.
> 
> __offline_pages seems rather cluttered when it comes to the retry
> logic. We have 5 retries at maximum and a timeout. We could argue
> whether the timeout makes sense but failing just because of a race when
> somebody isoltes a page from LRU or puts it on a pcp LRU lists is just
> wrong. It only takes it to race with a process which unmaps some pages
> and remove them from the LRU list and we can fail the whole offline
> because of something that is a temporary condition and actually not
> harmful for the offline. Please note that unmovable pages should be
> already excluded during start_isolate_page_range.

Hmm, the has_unmovable_pages() check doesn't offer any strict guarantees due to
races, per its comment. Also at the very quick glance, I see a check where it
assumes that MIGRATE_MOVABLE pageblock will have no unmovable pages. There is no
such guarantee even without races.

> Fix this by removing the max retry count and only rely on the timeout
> resp. interruption by a signal from the userspace. Also retry rather
> than fail when check_pages_isolated sees some !free pages because those
> could be a result of the race as well.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Even within a movable node where has_unmovable_pages() is a non-issue, you could
have pinned movable pages where the pinning is not temporary. So after this
patch, this will really keep retrying forever. I'm not saying it's wrong, just
pointing it out, since the changelog seems to assume there would be only
temporary failures possible and thus unbound retries are always correct.
The obvious problem if we wanted to avoid this, is how to recognize
non-temporary failures...

> ---
>  mm/memory_hotplug.c | 40 ++++++++++------------------------------
>  1 file changed, 10 insertions(+), 30 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 459bbc182d10..c9dcbe6d2ac6 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1597,7 +1597,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  {
>  	unsigned long pfn, nr_pages, expire;
>  	long offlined_pages;
> -	int ret, drain, retry_max, node;
> +	int ret, node;
>  	unsigned long flags;
>  	unsigned long valid_start, valid_end;
>  	struct zone *zone;
> @@ -1634,43 +1634,25 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  
>  	pfn = start_pfn;
>  	expire = jiffies + timeout;
> -	drain = 0;
> -	retry_max = 5;
>  repeat:
>  	/* start memory hot removal */
> -	ret = -EAGAIN;
> +	ret = -EBUSY;
>  	if (time_after(jiffies, expire))
>  		goto failed_removal;
>  	ret = -EINTR;
>  	if (signal_pending(current))
>  		goto failed_removal;
> -	ret = 0;
> -	if (drain) {
> -		lru_add_drain_all_cpuslocked();
> -		cond_resched();
> -		drain_all_pages(zone);
> -	}
> +
> +	cond_resched();
> +	lru_add_drain_all_cpuslocked();
> +	drain_all_pages(zone);
>  
>  	pfn = scan_movable_pages(start_pfn, end_pfn);
>  	if (pfn) { /* We have movable pages */
>  		ret = do_migrate_range(pfn, end_pfn);
> -		if (!ret) {
> -			drain = 1;
> -			goto repeat;
> -		} else {
> -			if (ret < 0)
> -				if (--retry_max == 0)
> -					goto failed_removal;
> -			yield();
> -			drain = 1;
> -			goto repeat;
> -		}
> +		goto repeat;
>  	}
> -	/* drain all zone's lru pagevec, this is asynchronous... */
> -	lru_add_drain_all_cpuslocked();
> -	yield();
> -	/* drain pcp pages, this is synchronous. */
> -	drain_all_pages(zone);
> +
>  	/*
>  	 * dissolve free hugepages in the memory block before doing offlining
>  	 * actually in order to make hugetlbfs's object counting consistent.
> @@ -1680,10 +1662,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  		goto failed_removal;
>  	/* check again */
>  	offlined_pages = check_pages_isolated(start_pfn, end_pfn);
> -	if (offlined_pages < 0) {
> -		ret = -EBUSY;
> -		goto failed_removal;
> -	}
> +	if (offlined_pages < 0)
> +		goto repeat;
>  	pr_info("Offlined Pages %ld\n", offlined_pages);
>  	/* Ok, all of our target is isolated.
>  	   We cannot do rollback at this point. */
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
  2017-09-04  8:21   ` Michal Hocko
@ 2017-09-08 17:27     ` Vlastimil Babka
  -1 siblings, 0 replies; 68+ messages in thread
From: Vlastimil Babka @ 2017-09-08 17:27 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML, Michal Hocko

On 09/04/2017 10:21 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> We have a hardcoded 120s timeout after which the memory offline fails
> basically since the hot remove has been introduced. This is essentially
> a policy implemented in the kernel. Moreover there is no way to adjust
> the timeout and so we are sometimes facing memory offline failures if
> the system is under a heavy memory pressure or very intensive CPU
> workload on large machines.
> 
> It is not very clear what purpose the timeout actually serves. The
> offline operation is interruptible by a signal so if userspace wants
> some timeout based termination this can be done trivially by sending a
> signal.
> 
> If there is a strong usecase to do this from the kernel then we should
> do it properly and have a it tunable from the userspace with the timeout
> disabled by default along with the explanation who uses it and for what
> purporse.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Makes sense to me.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/memory_hotplug.c | 10 +++-------
>  1 file changed, 3 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c9dcbe6d2ac6..b8a85c11360e 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1593,9 +1593,9 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
>  }
>  
>  static int __ref __offline_pages(unsigned long start_pfn,
> -		  unsigned long end_pfn, unsigned long timeout)
> +		  unsigned long end_pfn)
>  {
> -	unsigned long pfn, nr_pages, expire;
> +	unsigned long pfn, nr_pages;
>  	long offlined_pages;
>  	int ret, node;
>  	unsigned long flags;
> @@ -1633,12 +1633,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  		goto failed_removal;
>  
>  	pfn = start_pfn;
> -	expire = jiffies + timeout;
>  repeat:
>  	/* start memory hot removal */
> -	ret = -EBUSY;
> -	if (time_after(jiffies, expire))
> -		goto failed_removal;
>  	ret = -EINTR;
>  	if (signal_pending(current))
>  		goto failed_removal;
> @@ -1711,7 +1707,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  /* Must be protected by mem_hotplug_begin() or a device_lock */
>  int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  {
> -	return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);
> +	return __offline_pages(start_pfn, start_pfn + nr_pages);
>  }
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory
@ 2017-09-08 17:27     ` Vlastimil Babka
  0 siblings, 0 replies; 68+ messages in thread
From: Vlastimil Babka @ 2017-09-08 17:27 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML, Michal Hocko

On 09/04/2017 10:21 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> We have a hardcoded 120s timeout after which the memory offline fails
> basically since the hot remove has been introduced. This is essentially
> a policy implemented in the kernel. Moreover there is no way to adjust
> the timeout and so we are sometimes facing memory offline failures if
> the system is under a heavy memory pressure or very intensive CPU
> workload on large machines.
> 
> It is not very clear what purpose the timeout actually serves. The
> offline operation is interruptible by a signal so if userspace wants
> some timeout based termination this can be done trivially by sending a
> signal.
> 
> If there is a strong usecase to do this from the kernel then we should
> do it properly and have a it tunable from the userspace with the timeout
> disabled by default along with the explanation who uses it and for what
> purporse.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Makes sense to me.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/memory_hotplug.c | 10 +++-------
>  1 file changed, 3 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c9dcbe6d2ac6..b8a85c11360e 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1593,9 +1593,9 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
>  }
>  
>  static int __ref __offline_pages(unsigned long start_pfn,
> -		  unsigned long end_pfn, unsigned long timeout)
> +		  unsigned long end_pfn)
>  {
> -	unsigned long pfn, nr_pages, expire;
> +	unsigned long pfn, nr_pages;
>  	long offlined_pages;
>  	int ret, node;
>  	unsigned long flags;
> @@ -1633,12 +1633,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  		goto failed_removal;
>  
>  	pfn = start_pfn;
> -	expire = jiffies + timeout;
>  repeat:
>  	/* start memory hot removal */
> -	ret = -EBUSY;
> -	if (time_after(jiffies, expire))
> -		goto failed_removal;
>  	ret = -EINTR;
>  	if (signal_pending(current))
>  		goto failed_removal;
> @@ -1711,7 +1707,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  /* Must be protected by mem_hotplug_begin() or a device_lock */
>  int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  {
> -	return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);
> +	return __offline_pages(start_pfn, start_pfn + nr_pages);
>  }
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-09-08 17:26     ` Vlastimil Babka
@ 2017-09-11  8:17       ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-11  8:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On Fri 08-09-17 19:26:06, Vlastimil Babka wrote:
> On 09/04/2017 10:21 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Memory offlining can fail just too eagerly under a heavy memory pressure.
> > 
> > [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> > [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> > [ 5410.336811] page dumped because: isolation failed
> > [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> > [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
> > 
> > Isolation has failed here because the page is not on LRU. Most probably
> > because it was on the pcp LRU cache or it has been removed from the LRU
> > already but it hasn't been freed yet. In both cases the page doesn't look
> > non-migrable so retrying more makes sense.
> > 
> > __offline_pages seems rather cluttered when it comes to the retry
> > logic. We have 5 retries at maximum and a timeout. We could argue
> > whether the timeout makes sense but failing just because of a race when
> > somebody isoltes a page from LRU or puts it on a pcp LRU lists is just
> > wrong. It only takes it to race with a process which unmaps some pages
> > and remove them from the LRU list and we can fail the whole offline
> > because of something that is a temporary condition and actually not
> > harmful for the offline. Please note that unmovable pages should be
> > already excluded during start_isolate_page_range.
> 
> Hmm, the has_unmovable_pages() check doesn't offer any strict guarantees due to
> races, per its comment. Also at the very quick glance, I see a check where it
> assumes that MIGRATE_MOVABLE pageblock will have no unmovable pages. There is no
> such guarantee even without races.

Yes, you are right that there are races possible but practically
speaking non-movable memblocks (in !MOVABLE_ZONE) would be very likely
to have reliably unmovable pages and so has_unmovable_pages would bail
out. And ZONE_MOVABLE memblocks with permanently pinned pages sound like
a bug to me.

> > Fix this by removing the max retry count and only rely on the timeout
> > resp. interruption by a signal from the userspace. Also retry rather
> > than fail when check_pages_isolated sees some !free pages because those
> > could be a result of the race as well.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Even within a movable node where has_unmovable_pages() is a non-issue, you could
> have pinned movable pages where the pinning is not temporary.

Who would pin those pages? Such a page would be unreclaimable as well
and thus a memory leak and I would argue it would be a bug.

> So after this
> patch, this will really keep retrying forever. I'm not saying it's wrong, just
> pointing it out, since the changelog seems to assume there would be only
> temporary failures possible and thus unbound retries are always correct.
> The obvious problem if we wanted to avoid this, is how to recognize
> non-temporary failures...

Yes, we should be able to distinguish the two and hopefully we can teach
the migration code to distinguish between EBUSY (likely permanent) and
EGAIN (temporal) failure. This sound like something we should aim for
longterm I guess. Anyway as I've said in other email. If somebody really
wants to have a guaratee of a bounded retry then it is trivial to set up
an alarm and send a signal itself to bail out.

Do you think that the changelog should be more clear about this?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-09-11  8:17       ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-11  8:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On Fri 08-09-17 19:26:06, Vlastimil Babka wrote:
> On 09/04/2017 10:21 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Memory offlining can fail just too eagerly under a heavy memory pressure.
> > 
> > [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> > [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> > [ 5410.336811] page dumped because: isolation failed
> > [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> > [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
> > 
> > Isolation has failed here because the page is not on LRU. Most probably
> > because it was on the pcp LRU cache or it has been removed from the LRU
> > already but it hasn't been freed yet. In both cases the page doesn't look
> > non-migrable so retrying more makes sense.
> > 
> > __offline_pages seems rather cluttered when it comes to the retry
> > logic. We have 5 retries at maximum and a timeout. We could argue
> > whether the timeout makes sense but failing just because of a race when
> > somebody isoltes a page from LRU or puts it on a pcp LRU lists is just
> > wrong. It only takes it to race with a process which unmaps some pages
> > and remove them from the LRU list and we can fail the whole offline
> > because of something that is a temporary condition and actually not
> > harmful for the offline. Please note that unmovable pages should be
> > already excluded during start_isolate_page_range.
> 
> Hmm, the has_unmovable_pages() check doesn't offer any strict guarantees due to
> races, per its comment. Also at the very quick glance, I see a check where it
> assumes that MIGRATE_MOVABLE pageblock will have no unmovable pages. There is no
> such guarantee even without races.

Yes, you are right that there are races possible but practically
speaking non-movable memblocks (in !MOVABLE_ZONE) would be very likely
to have reliably unmovable pages and so has_unmovable_pages would bail
out. And ZONE_MOVABLE memblocks with permanently pinned pages sound like
a bug to me.

> > Fix this by removing the max retry count and only rely on the timeout
> > resp. interruption by a signal from the userspace. Also retry rather
> > than fail when check_pages_isolated sees some !free pages because those
> > could be a result of the race as well.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Even within a movable node where has_unmovable_pages() is a non-issue, you could
> have pinned movable pages where the pinning is not temporary.

Who would pin those pages? Such a page would be unreclaimable as well
and thus a memory leak and I would argue it would be a bug.

> So after this
> patch, this will really keep retrying forever. I'm not saying it's wrong, just
> pointing it out, since the changelog seems to assume there would be only
> temporary failures possible and thus unbound retries are always correct.
> The obvious problem if we wanted to avoid this, is how to recognize
> non-temporary failures...

Yes, we should be able to distinguish the two and hopefully we can teach
the migration code to distinguish between EBUSY (likely permanent) and
EGAIN (temporal) failure. This sound like something we should aim for
longterm I guess. Anyway as I've said in other email. If somebody really
wants to have a guaratee of a bounded retry then it is trivial to set up
an alarm and send a signal itself to bail out.

Do you think that the changelog should be more clear about this?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-09-11  8:17       ` Michal Hocko
@ 2017-09-13 11:41         ` Vlastimil Babka
  -1 siblings, 0 replies; 68+ messages in thread
From: Vlastimil Babka @ 2017-09-13 11:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On 09/11/2017 10:17 AM, Michal Hocko wrote:
> On Fri 08-09-17 19:26:06, Vlastimil Babka wrote:
>> On 09/04/2017 10:21 AM, Michal Hocko wrote:
>>> From: Michal Hocko <mhocko@suse.com>
>>>
>>> Fix this by removing the max retry count and only rely on the timeout
>>> resp. interruption by a signal from the userspace. Also retry rather
>>> than fail when check_pages_isolated sees some !free pages because those
>>> could be a result of the race as well.
>>>
>>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>>
>> Even within a movable node where has_unmovable_pages() is a non-issue, you could
>> have pinned movable pages where the pinning is not temporary.
> 
> Who would pin those pages? Such a page would be unreclaimable as well
> and thus a memory leak and I would argue it would be a bug.

I don't know who exactly, but generally it's a problem for CMA and a
reason why there was some effort from PeterZ to introduce an API for
long-term pinning.

>> So after this
>> patch, this will really keep retrying forever. I'm not saying it's wrong, just
>> pointing it out, since the changelog seems to assume there would be only
>> temporary failures possible and thus unbound retries are always correct.
>> The obvious problem if we wanted to avoid this, is how to recognize
>> non-temporary failures...
> 
> Yes, we should be able to distinguish the two and hopefully we can teach
> the migration code to distinguish between EBUSY (likely permanent) and
> EGAIN (temporal) failure. This sound like something we should aim for
> longterm I guess. Anyway as I've said in other email. If somebody really
> wants to have a guaratee of a bounded retry then it is trivial to set up
> an alarm and send a signal itself to bail out.

Sure, I would just be careful about not breaking existing userspace
(udev?) when offline triggered via ACPI from some management interface
(or whatever the exact mechanism is).

> Do you think that the changelog should be more clear about this?

It certainly wouldn't hurt :)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-09-13 11:41         ` Vlastimil Babka
  0 siblings, 0 replies; 68+ messages in thread
From: Vlastimil Babka @ 2017-09-13 11:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On 09/11/2017 10:17 AM, Michal Hocko wrote:
> On Fri 08-09-17 19:26:06, Vlastimil Babka wrote:
>> On 09/04/2017 10:21 AM, Michal Hocko wrote:
>>> From: Michal Hocko <mhocko@suse.com>
>>>
>>> Fix this by removing the max retry count and only rely on the timeout
>>> resp. interruption by a signal from the userspace. Also retry rather
>>> than fail when check_pages_isolated sees some !free pages because those
>>> could be a result of the race as well.
>>>
>>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>>
>> Even within a movable node where has_unmovable_pages() is a non-issue, you could
>> have pinned movable pages where the pinning is not temporary.
> 
> Who would pin those pages? Such a page would be unreclaimable as well
> and thus a memory leak and I would argue it would be a bug.

I don't know who exactly, but generally it's a problem for CMA and a
reason why there was some effort from PeterZ to introduce an API for
long-term pinning.

>> So after this
>> patch, this will really keep retrying forever. I'm not saying it's wrong, just
>> pointing it out, since the changelog seems to assume there would be only
>> temporary failures possible and thus unbound retries are always correct.
>> The obvious problem if we wanted to avoid this, is how to recognize
>> non-temporary failures...
> 
> Yes, we should be able to distinguish the two and hopefully we can teach
> the migration code to distinguish between EBUSY (likely permanent) and
> EGAIN (temporal) failure. This sound like something we should aim for
> longterm I guess. Anyway as I've said in other email. If somebody really
> wants to have a guaratee of a bounded retry then it is trivial to set up
> an alarm and send a signal itself to bail out.

Sure, I would just be careful about not breaking existing userspace
(udev?) when offline triggered via ACPI from some management interface
(or whatever the exact mechanism is).

> Do you think that the changelog should be more clear about this?

It certainly wouldn't hurt :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-09-13 11:41         ` Vlastimil Babka
@ 2017-09-13 12:10           ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-13 12:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On Wed 13-09-17 13:41:20, Vlastimil Babka wrote:
> On 09/11/2017 10:17 AM, Michal Hocko wrote:
[...]
> > Yes, we should be able to distinguish the two and hopefully we can teach
> > the migration code to distinguish between EBUSY (likely permanent) and
> > EGAIN (temporal) failure. This sound like something we should aim for
> > longterm I guess. Anyway as I've said in other email. If somebody really
> > wants to have a guaratee of a bounded retry then it is trivial to set up
> > an alarm and send a signal itself to bail out.
> 
> Sure, I would just be careful about not breaking existing userspace
> (udev?) when offline triggered via ACPI from some management interface
> (or whatever the exact mechanism is).

The thing is that there is absolutely no timing guarantee even with
retry limit in place. We are doing allocations, potentially bouncing on
locks which can be taken elsewhere etc... So if somebody really depend
on this then it is pretty much broken already.

> > Do you think that the changelog should be more clear about this?
> 
> It certainly wouldn't hurt :)

So what do you think about the following wording:

commit 23c4ded55c2ba880165a9f5b8a67694361fb6bc7
Author: Michal Hocko <mhocko@suse.com>
Date:   Mon Aug 28 13:13:06 2017 +0200

    mm, memory_hotplug: remove timeout from __offline_memory
    
    We have a hardcoded 120s timeout after which the memory offline fails
    basically since the hot remove has been introduced. This is essentially
    a policy implemented in the kernel. Moreover there is no way to adjust
    the timeout and so we are sometimes facing memory offline failures if
    the system is under a heavy memory pressure or very intensive CPU
    workload on large machines.
    
    It is not very clear what purpose the timeout actually serves. The
    offline operation is interruptible by a signal so if userspace wants
    some timeout based termination this can be done trivially by sending a
    signal.
    
    If there is a strong usecase to do this from the kernel then we should
    do it properly and have a it tunable from the userspace with the timeout
    disabled by default along with the explanation who uses it and for what
    purporse.
    
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Michal Hocko <mhocko@suse.com>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-09-13 12:10           ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-13 12:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On Wed 13-09-17 13:41:20, Vlastimil Babka wrote:
> On 09/11/2017 10:17 AM, Michal Hocko wrote:
[...]
> > Yes, we should be able to distinguish the two and hopefully we can teach
> > the migration code to distinguish between EBUSY (likely permanent) and
> > EGAIN (temporal) failure. This sound like something we should aim for
> > longterm I guess. Anyway as I've said in other email. If somebody really
> > wants to have a guaratee of a bounded retry then it is trivial to set up
> > an alarm and send a signal itself to bail out.
> 
> Sure, I would just be careful about not breaking existing userspace
> (udev?) when offline triggered via ACPI from some management interface
> (or whatever the exact mechanism is).

The thing is that there is absolutely no timing guarantee even with
retry limit in place. We are doing allocations, potentially bouncing on
locks which can be taken elsewhere etc... So if somebody really depend
on this then it is pretty much broken already.

> > Do you think that the changelog should be more clear about this?
> 
> It certainly wouldn't hurt :)

So what do you think about the following wording:

commit 23c4ded55c2ba880165a9f5b8a67694361fb6bc7
Author: Michal Hocko <mhocko@suse.com>
Date:   Mon Aug 28 13:13:06 2017 +0200

    mm, memory_hotplug: remove timeout from __offline_memory
    
    We have a hardcoded 120s timeout after which the memory offline fails
    basically since the hot remove has been introduced. This is essentially
    a policy implemented in the kernel. Moreover there is no way to adjust
    the timeout and so we are sometimes facing memory offline failures if
    the system is under a heavy memory pressure or very intensive CPU
    workload on large machines.
    
    It is not very clear what purpose the timeout actually serves. The
    offline operation is interruptible by a signal so if userspace wants
    some timeout based termination this can be done trivially by sending a
    signal.
    
    If there is a strong usecase to do this from the kernel then we should
    do it properly and have a it tunable from the userspace with the timeout
    disabled by default along with the explanation who uses it and for what
    purporse.
    
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: Michal Hocko <mhocko@suse.com>
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-09-13 12:10           ` Michal Hocko
@ 2017-09-13 12:14             ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-13 12:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On Wed 13-09-17 14:10:01, Michal Hocko wrote:
> On Wed 13-09-17 13:41:20, Vlastimil Babka wrote:
> > On 09/11/2017 10:17 AM, Michal Hocko wrote:
> [...]
> > > Yes, we should be able to distinguish the two and hopefully we can teach
> > > the migration code to distinguish between EBUSY (likely permanent) and
> > > EGAIN (temporal) failure. This sound like something we should aim for
> > > longterm I guess. Anyway as I've said in other email. If somebody really
> > > wants to have a guaratee of a bounded retry then it is trivial to set up
> > > an alarm and send a signal itself to bail out.
> > 
> > Sure, I would just be careful about not breaking existing userspace
> > (udev?) when offline triggered via ACPI from some management interface
> > (or whatever the exact mechanism is).
> 
> The thing is that there is absolutely no timing guarantee even with
> retry limit in place. We are doing allocations, potentially bouncing on
> locks which can be taken elsewhere etc... So if somebody really depend
> on this then it is pretty much broken already.
> 
> > > Do you think that the changelog should be more clear about this?
> > 
> > It certainly wouldn't hurt :)
> 
> So what do you think about the following wording:

Ups, wrong patch

>From 8639496a834b4a7c24972ec23b17e50f0d6a304c Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 14 Aug 2017 10:46:12 +0200
Subject: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early

Memory offlining can fail just too eagerly under a heavy memory pressure.

[ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
[ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
[ 5410.336811] page dumped because: isolation failed
[ 5410.336813] page->mem_cgroup:ffff8801cd662000
[ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed

Isolation has failed here because the page is not on LRU. Most probably
because it was on the pcp LRU cache or it has been removed from the LRU
already but it hasn't been freed yet. In both cases the page doesn't look
non-migrable so retrying more makes sense.

__offline_pages seems rather cluttered when it comes to the retry
logic. We have 5 retries at maximum and a timeout. We could argue
whether the timeout makes sense but failing just because of a race when
somebody isoltes a page from LRU or puts it on a pcp LRU lists is just
wrong. It only takes it to race with a process which unmaps some pages
and remove them from the LRU list and we can fail the whole offline
because of something that is a temporary condition and actually not
harmful for the offline.

Please note that unmovable pages should be already excluded during
start_isolate_page_range. We could argue that has_unmovable_pages is
racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either
but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable
pages in most cases and movable zone shouldn't contain unmovable pages
at all. Some of those pages might be pinned but not for ever because
that would be a bug on its own. In any case the context is still
interruptible and so the userspace can easily bail out when the
operation takes too long. This is certainly better behavior than a
hardcoded retry loop which is racy.

Fix this by removing the max retry count and only rely on the timeout
resp. interruption by a signal from the userspace. Also retry rather
than fail when check_pages_isolated sees some !free pages because those
could be a result of the race as well.

Signed-off-by: Michal Hocko <mhocko@suse.com>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-09-13 12:14             ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-13 12:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On Wed 13-09-17 14:10:01, Michal Hocko wrote:
> On Wed 13-09-17 13:41:20, Vlastimil Babka wrote:
> > On 09/11/2017 10:17 AM, Michal Hocko wrote:
> [...]
> > > Yes, we should be able to distinguish the two and hopefully we can teach
> > > the migration code to distinguish between EBUSY (likely permanent) and
> > > EGAIN (temporal) failure. This sound like something we should aim for
> > > longterm I guess. Anyway as I've said in other email. If somebody really
> > > wants to have a guaratee of a bounded retry then it is trivial to set up
> > > an alarm and send a signal itself to bail out.
> > 
> > Sure, I would just be careful about not breaking existing userspace
> > (udev?) when offline triggered via ACPI from some management interface
> > (or whatever the exact mechanism is).
> 
> The thing is that there is absolutely no timing guarantee even with
> retry limit in place. We are doing allocations, potentially bouncing on
> locks which can be taken elsewhere etc... So if somebody really depend
> on this then it is pretty much broken already.
> 
> > > Do you think that the changelog should be more clear about this?
> > 
> > It certainly wouldn't hurt :)
> 
> So what do you think about the following wording:

Ups, wrong patch

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-09-13 12:14             ` Michal Hocko
@ 2017-09-13 12:19               ` Vlastimil Babka
  -1 siblings, 0 replies; 68+ messages in thread
From: Vlastimil Babka @ 2017-09-13 12:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On 09/13/2017 02:14 PM, Michal Hocko wrote:
>>>> Do you think that the changelog should be more clear about this?
>>>
>>> It certainly wouldn't hurt :)
>>
>> So what do you think about the following wording:
> 
> Ups, wrong patch
> 
> 
> From 8639496a834b4a7c24972ec23b17e50f0d6a304c Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 14 Aug 2017 10:46:12 +0200
> Subject: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
> 
> Memory offlining can fail just too eagerly under a heavy memory pressure.
> 
> [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> [ 5410.336811] page dumped because: isolation failed
> [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
> 
> Isolation has failed here because the page is not on LRU. Most probably
> because it was on the pcp LRU cache or it has been removed from the LRU
> already but it hasn't been freed yet. In both cases the page doesn't look
> non-migrable so retrying more makes sense.
> 
> __offline_pages seems rather cluttered when it comes to the retry
> logic. We have 5 retries at maximum and a timeout. We could argue
> whether the timeout makes sense but failing just because of a race when
> somebody isoltes a page from LRU or puts it on a pcp LRU lists is just
> wrong. It only takes it to race with a process which unmaps some pages
> and remove them from the LRU list and we can fail the whole offline
> because of something that is a temporary condition and actually not
> harmful for the offline.
> 
> Please note that unmovable pages should be already excluded during
> start_isolate_page_range. We could argue that has_unmovable_pages is
> racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either
> but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable
> pages in most cases and movable zone shouldn't contain unmovable pages
> at all. Some of those pages might be pinned but not for ever because
> that would be a bug on its own. In any case the context is still
> interruptible and so the userspace can easily bail out when the
> operation takes too long. This is certainly better behavior than a
> hardcoded retry loop which is racy.
> 
> Fix this by removing the max retry count and only rely on the timeout
> resp. interruption by a signal from the userspace. Also retry rather
> than fail when check_pages_isolated sees some !free pages because those
> could be a result of the race as well.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Yeah, that's better, thanks.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-09-13 12:19               ` Vlastimil Babka
  0 siblings, 0 replies; 68+ messages in thread
From: Vlastimil Babka @ 2017-09-13 12:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On 09/13/2017 02:14 PM, Michal Hocko wrote:
>>>> Do you think that the changelog should be more clear about this?
>>>
>>> It certainly wouldn't hurt :)
>>
>> So what do you think about the following wording:
> 
> Ups, wrong patch
> 
> 
> From 8639496a834b4a7c24972ec23b17e50f0d6a304c Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 14 Aug 2017 10:46:12 +0200
> Subject: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
> 
> Memory offlining can fail just too eagerly under a heavy memory pressure.
> 
> [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> [ 5410.336811] page dumped because: isolation failed
> [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
> 
> Isolation has failed here because the page is not on LRU. Most probably
> because it was on the pcp LRU cache or it has been removed from the LRU
> already but it hasn't been freed yet. In both cases the page doesn't look
> non-migrable so retrying more makes sense.
> 
> __offline_pages seems rather cluttered when it comes to the retry
> logic. We have 5 retries at maximum and a timeout. We could argue
> whether the timeout makes sense but failing just because of a race when
> somebody isoltes a page from LRU or puts it on a pcp LRU lists is just
> wrong. It only takes it to race with a process which unmaps some pages
> and remove them from the LRU list and we can fail the whole offline
> because of something that is a temporary condition and actually not
> harmful for the offline.
> 
> Please note that unmovable pages should be already excluded during
> start_isolate_page_range. We could argue that has_unmovable_pages is
> racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either
> but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable
> pages in most cases and movable zone shouldn't contain unmovable pages
> at all. Some of those pages might be pinned but not for ever because
> that would be a bug on its own. In any case the context is still
> interruptible and so the userspace can easily bail out when the
> operation takes too long. This is certainly better behavior than a
> hardcoded retry loop which is racy.
> 
> Fix this by removing the max retry count and only rely on the timeout
> resp. interruption by a signal from the userspace. Also retry rather
> than fail when check_pages_isolated sees some !free pages because those
> could be a result of the race as well.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Yeah, that's better, thanks.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-09-13 12:19               ` Vlastimil Babka
@ 2017-09-13 12:32                 ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-13 12:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On Wed 13-09-17 14:19:19, Vlastimil Babka wrote:
> On 09/13/2017 02:14 PM, Michal Hocko wrote:
> >>>> Do you think that the changelog should be more clear about this?
> >>>
> >>> It certainly wouldn't hurt :)
> >>
> >> So what do you think about the following wording:
> > 
> > Ups, wrong patch
> > 
> > 
> > From 8639496a834b4a7c24972ec23b17e50f0d6a304c Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Mon, 14 Aug 2017 10:46:12 +0200
> > Subject: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
> > 
> > Memory offlining can fail just too eagerly under a heavy memory pressure.
> > 
> > [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> > [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> > [ 5410.336811] page dumped because: isolation failed
> > [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> > [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
> > 
> > Isolation has failed here because the page is not on LRU. Most probably
> > because it was on the pcp LRU cache or it has been removed from the LRU
> > already but it hasn't been freed yet. In both cases the page doesn't look
> > non-migrable so retrying more makes sense.
> > 
> > __offline_pages seems rather cluttered when it comes to the retry
> > logic. We have 5 retries at maximum and a timeout. We could argue
> > whether the timeout makes sense but failing just because of a race when
> > somebody isoltes a page from LRU or puts it on a pcp LRU lists is just
> > wrong. It only takes it to race with a process which unmaps some pages
> > and remove them from the LRU list and we can fail the whole offline
> > because of something that is a temporary condition and actually not
> > harmful for the offline.
> > 
> > Please note that unmovable pages should be already excluded during
> > start_isolate_page_range. We could argue that has_unmovable_pages is
> > racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either
> > but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable
> > pages in most cases and movable zone shouldn't contain unmovable pages
> > at all. Some of those pages might be pinned but not for ever because
> > that would be a bug on its own. In any case the context is still
> > interruptible and so the userspace can easily bail out when the
> > operation takes too long. This is certainly better behavior than a
> > hardcoded retry loop which is racy.
> > 
> > Fix this by removing the max retry count and only rely on the timeout
> > resp. interruption by a signal from the userspace. Also retry rather
> > than fail when check_pages_isolated sees some !free pages because those
> > could be a result of the race as well.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Yeah, that's better, thanks.
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks. I will give it a day and repost the series. If somebody still
have some concerns please speak up.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-09-13 12:32                 ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-13 12:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On Wed 13-09-17 14:19:19, Vlastimil Babka wrote:
> On 09/13/2017 02:14 PM, Michal Hocko wrote:
> >>>> Do you think that the changelog should be more clear about this?
> >>>
> >>> It certainly wouldn't hurt :)
> >>
> >> So what do you think about the following wording:
> > 
> > Ups, wrong patch
> > 
> > 
> > From 8639496a834b4a7c24972ec23b17e50f0d6a304c Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Mon, 14 Aug 2017 10:46:12 +0200
> > Subject: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
> > 
> > Memory offlining can fail just too eagerly under a heavy memory pressure.
> > 
> > [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> > [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> > [ 5410.336811] page dumped because: isolation failed
> > [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> > [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
> > 
> > Isolation has failed here because the page is not on LRU. Most probably
> > because it was on the pcp LRU cache or it has been removed from the LRU
> > already but it hasn't been freed yet. In both cases the page doesn't look
> > non-migrable so retrying more makes sense.
> > 
> > __offline_pages seems rather cluttered when it comes to the retry
> > logic. We have 5 retries at maximum and a timeout. We could argue
> > whether the timeout makes sense but failing just because of a race when
> > somebody isoltes a page from LRU or puts it on a pcp LRU lists is just
> > wrong. It only takes it to race with a process which unmaps some pages
> > and remove them from the LRU list and we can fail the whole offline
> > because of something that is a temporary condition and actually not
> > harmful for the offline.
> > 
> > Please note that unmovable pages should be already excluded during
> > start_isolate_page_range. We could argue that has_unmovable_pages is
> > racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either
> > but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable
> > pages in most cases and movable zone shouldn't contain unmovable pages
> > at all. Some of those pages might be pinned but not for ever because
> > that would be a bug on its own. In any case the context is still
> > interruptible and so the userspace can easily bail out when the
> > operation takes too long. This is certainly better behavior than a
> > hardcoded retry loop which is racy.
> > 
> > Fix this by removing the max retry count and only rely on the timeout
> > resp. interruption by a signal from the userspace. Also retry rather
> > than fail when check_pages_isolated sees some !free pages because those
> > could be a result of the race as well.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Yeah, that's better, thanks.
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks. I will give it a day and repost the series. If somebody still
have some concerns please speak up.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-10-13 11:42               ` Michael Ellerman
@ 2017-10-13 11:58                 ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-10-13 11:58 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Vlastimil Babka, Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab,
	Yasuaki Ishimatsu, qiuxishi, Igor Mammedov, Vitaly Kuznetsov,
	linux-mm, LKML

On Fri 13-10-17 22:42:46, Michael Ellerman wrote:
> Vlastimil Babka <vbabka@suse.cz> writes:
> > On 10/11/2017 08:51 AM, Michal Hocko wrote:
> >> On Wed 11-10-17 13:37:50, Michael Ellerman wrote:
> >>> Michal Hocko <mhocko@kernel.org> writes:
> >>>> On Tue 10-10-17 23:05:08, Michael Ellerman wrote:
> >>>>> Michal Hocko <mhocko@kernel.org> writes:
> >>>>>> From: Michal Hocko <mhocko@suse.com>
> >>>>>>
> >>>>>> Memory offlining can fail just too eagerly under a heavy memory pressure.
> ...
> >>>>>
> >>>>> This breaks offline for me.
> >>>>>
> >>>>> Prior to this commit:
> >>>>>   /sys/devices/system/memory/memory0# time echo 0 > online
> >>>>>   -bash: echo: write error: Device or resource busy
> >
> > Well, that means offline didn't actually work for that block even before
> > this patch, right? Is it even a movable_node block? I guess not?
> 
> Correct. It should fail.
> 
> >>>>> After:
> >>>>>   /sys/devices/system/memory/memory0# time echo 0 > online
> >>>>>   -bash: echo: write error: Device or resource busy
> >>>>>   
> >>>>>   real	2m0.009s
> >>>>>   user	0m0.000s
> >>>>>   sys	1m25.035s
> >>>>>
> >>>>> There's no way that block can be removed, it contains the kernel text,
> >>>>> so it should instantly fail - which it used to.
> >
> > Ah, right. So your complain is really about that the failure is not
> > instant anymore for blocks that can't be offlined.
> 
> Yes. Previously it failed instantly, now it doesn't fail, and loops
> infinitely (once the 2 minute limit is removed).

Yeah it failed only because the migration code retried few times and we
bailed out which is wrong as well. I will send two patches as a reply to
this email.

> >> This is really strange! As you write in other email the page is
> >> reserved. That means that some of the earlier checks 
> >> 	if (zone_idx(zone) == ZONE_MOVABLE)
> >> 		return false;
> >> 	mt = get_pageblock_migratetype(page);
> >> 	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
> >
> > The MIGRATE_MOVABLE check is indeed bogus, because that doesn't
> > guarantee there are no unmovable pages in the block (CMA block OTOH
> > should be a guarantee).
> 
> OK I'll try that and get back to you.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-10-13 11:58                 ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-10-13 11:58 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Vlastimil Babka, Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab,
	Yasuaki Ishimatsu, qiuxishi, Igor Mammedov, Vitaly Kuznetsov,
	linux-mm, LKML

On Fri 13-10-17 22:42:46, Michael Ellerman wrote:
> Vlastimil Babka <vbabka@suse.cz> writes:
> > On 10/11/2017 08:51 AM, Michal Hocko wrote:
> >> On Wed 11-10-17 13:37:50, Michael Ellerman wrote:
> >>> Michal Hocko <mhocko@kernel.org> writes:
> >>>> On Tue 10-10-17 23:05:08, Michael Ellerman wrote:
> >>>>> Michal Hocko <mhocko@kernel.org> writes:
> >>>>>> From: Michal Hocko <mhocko@suse.com>
> >>>>>>
> >>>>>> Memory offlining can fail just too eagerly under a heavy memory pressure.
> ...
> >>>>>
> >>>>> This breaks offline for me.
> >>>>>
> >>>>> Prior to this commit:
> >>>>>   /sys/devices/system/memory/memory0# time echo 0 > online
> >>>>>   -bash: echo: write error: Device or resource busy
> >
> > Well, that means offline didn't actually work for that block even before
> > this patch, right? Is it even a movable_node block? I guess not?
> 
> Correct. It should fail.
> 
> >>>>> After:
> >>>>>   /sys/devices/system/memory/memory0# time echo 0 > online
> >>>>>   -bash: echo: write error: Device or resource busy
> >>>>>   
> >>>>>   real	2m0.009s
> >>>>>   user	0m0.000s
> >>>>>   sys	1m25.035s
> >>>>>
> >>>>> There's no way that block can be removed, it contains the kernel text,
> >>>>> so it should instantly fail - which it used to.
> >
> > Ah, right. So your complain is really about that the failure is not
> > instant anymore for blocks that can't be offlined.
> 
> Yes. Previously it failed instantly, now it doesn't fail, and loops
> infinitely (once the 2 minute limit is removed).

Yeah it failed only because the migration code retried few times and we
bailed out which is wrong as well. I will send two patches as a reply to
this email.

> >> This is really strange! As you write in other email the page is
> >> reserved. That means that some of the earlier checks 
> >> 	if (zone_idx(zone) == ZONE_MOVABLE)
> >> 		return false;
> >> 	mt = get_pageblock_migratetype(page);
> >> 	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
> >
> > The MIGRATE_MOVABLE check is indeed bogus, because that doesn't
> > guarantee there are no unmovable pages in the block (CMA block OTOH
> > should be a guarantee).
> 
> OK I'll try that and get back to you.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-10-11  8:04             ` Vlastimil Babka
@ 2017-10-13 11:42               ` Michael Ellerman
  -1 siblings, 0 replies; 68+ messages in thread
From: Michael Ellerman @ 2017-10-13 11:42 UTC (permalink / raw)
  To: Vlastimil Babka, Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

Vlastimil Babka <vbabka@suse.cz> writes:
> On 10/11/2017 08:51 AM, Michal Hocko wrote:
>> On Wed 11-10-17 13:37:50, Michael Ellerman wrote:
>>> Michal Hocko <mhocko@kernel.org> writes:
>>>> On Tue 10-10-17 23:05:08, Michael Ellerman wrote:
>>>>> Michal Hocko <mhocko@kernel.org> writes:
>>>>>> From: Michal Hocko <mhocko@suse.com>
>>>>>>
>>>>>> Memory offlining can fail just too eagerly under a heavy memory pressure.
...
>>>>>
>>>>> This breaks offline for me.
>>>>>
>>>>> Prior to this commit:
>>>>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>>>>   -bash: echo: write error: Device or resource busy
>
> Well, that means offline didn't actually work for that block even before
> this patch, right? Is it even a movable_node block? I guess not?

Correct. It should fail.

>>>>> After:
>>>>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>>>>   -bash: echo: write error: Device or resource busy
>>>>>   
>>>>>   real	2m0.009s
>>>>>   user	0m0.000s
>>>>>   sys	1m25.035s
>>>>>
>>>>> There's no way that block can be removed, it contains the kernel text,
>>>>> so it should instantly fail - which it used to.
>
> Ah, right. So your complain is really about that the failure is not
> instant anymore for blocks that can't be offlined.

Yes. Previously it failed instantly, now it doesn't fail, and loops
infinitely (once the 2 minute limit is removed).

>> This is really strange! As you write in other email the page is
>> reserved. That means that some of the earlier checks 
>> 	if (zone_idx(zone) == ZONE_MOVABLE)
>> 		return false;
>> 	mt = get_pageblock_migratetype(page);
>> 	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
>
> The MIGRATE_MOVABLE check is indeed bogus, because that doesn't
> guarantee there are no unmovable pages in the block (CMA block OTOH
> should be a guarantee).

OK I'll try that and get back to you.

cheers


>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3badcedf96a7..5b4d85ae445c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -7355,9 +7355,6 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>>  	 */
>>  	if (zone_idx(zone) == ZONE_MOVABLE)
>>  		return false;
>> -	mt = get_pageblock_migratetype(page);
>> -	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
>> -		return false;
>>  
>>  	pfn = page_to_pfn(page);
>>  	for (found = 0, iter = 0; iter < pageblock_nr_pages; iter++) {
>> @@ -7368,6 +7365,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>>  
>>  		page = pfn_to_page(check);
>>  
>> +		if (PageReserved(page))
>> +			return true;
>> +
>>  		/*
>>  		 * Hugepages are not in LRU lists, but they're movable.
>>  		 * We need not scan over tail pages bacause we don't
>> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-10-13 11:42               ` Michael Ellerman
  0 siblings, 0 replies; 68+ messages in thread
From: Michael Ellerman @ 2017-10-13 11:42 UTC (permalink / raw)
  To: Vlastimil Babka, Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

Vlastimil Babka <vbabka@suse.cz> writes:
> On 10/11/2017 08:51 AM, Michal Hocko wrote:
>> On Wed 11-10-17 13:37:50, Michael Ellerman wrote:
>>> Michal Hocko <mhocko@kernel.org> writes:
>>>> On Tue 10-10-17 23:05:08, Michael Ellerman wrote:
>>>>> Michal Hocko <mhocko@kernel.org> writes:
>>>>>> From: Michal Hocko <mhocko@suse.com>
>>>>>>
>>>>>> Memory offlining can fail just too eagerly under a heavy memory pressure.
...
>>>>>
>>>>> This breaks offline for me.
>>>>>
>>>>> Prior to this commit:
>>>>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>>>>   -bash: echo: write error: Device or resource busy
>
> Well, that means offline didn't actually work for that block even before
> this patch, right? Is it even a movable_node block? I guess not?

Correct. It should fail.

>>>>> After:
>>>>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>>>>   -bash: echo: write error: Device or resource busy
>>>>>   
>>>>>   real	2m0.009s
>>>>>   user	0m0.000s
>>>>>   sys	1m25.035s
>>>>>
>>>>> There's no way that block can be removed, it contains the kernel text,
>>>>> so it should instantly fail - which it used to.
>
> Ah, right. So your complain is really about that the failure is not
> instant anymore for blocks that can't be offlined.

Yes. Previously it failed instantly, now it doesn't fail, and loops
infinitely (once the 2 minute limit is removed).

>> This is really strange! As you write in other email the page is
>> reserved. That means that some of the earlier checks 
>> 	if (zone_idx(zone) == ZONE_MOVABLE)
>> 		return false;
>> 	mt = get_pageblock_migratetype(page);
>> 	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
>
> The MIGRATE_MOVABLE check is indeed bogus, because that doesn't
> guarantee there are no unmovable pages in the block (CMA block OTOH
> should be a guarantee).

OK I'll try that and get back to you.

cheers


>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3badcedf96a7..5b4d85ae445c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -7355,9 +7355,6 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>>  	 */
>>  	if (zone_idx(zone) == ZONE_MOVABLE)
>>  		return false;
>> -	mt = get_pageblock_migratetype(page);
>> -	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
>> -		return false;
>>  
>>  	pfn = page_to_pfn(page);
>>  	for (found = 0, iter = 0; iter < pageblock_nr_pages; iter++) {
>> @@ -7368,6 +7365,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>>  
>>  		page = pfn_to_page(check);
>>  
>> +		if (PageReserved(page))
>> +			return true;
>> +
>>  		/*
>>  		 * Hugepages are not in LRU lists, but they're movable.
>>  		 * We need not scan over tail pages bacause we don't
>> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-10-11 14:05             ` Anshuman Khandual
@ 2017-10-11 14:16               ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-10-11 14:16 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Michael Ellerman, Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab,
	Yasuaki Ishimatsu, qiuxishi, Igor Mammedov, Vitaly Kuznetsov,
	linux-mm, LKML, Vlastimil Babka

On Wed 11-10-17 19:35:04, Anshuman Khandual wrote:
[...]
> >   $ grep __init_begin /proc/kallsyms
> >   c000000000d70000 T __init_begin
> >   $ ./page-types -r -a 0x0,0xd7
> >                flags	page-count       MB  symbolic-flags			long-symbolic-flags
> >   0x0000000100000000	       215       13  __________________________r_______________	reserved
> >                total	       215       13
> 
> Hey Michael,
> 
> What tool is this 'page-types' ?

tools/vm/page-types.c

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-10-11 14:16               ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-10-11 14:16 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Michael Ellerman, Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab,
	Yasuaki Ishimatsu, qiuxishi, Igor Mammedov, Vitaly Kuznetsov,
	linux-mm, LKML, Vlastimil Babka

On Wed 11-10-17 19:35:04, Anshuman Khandual wrote:
[...]
> >   $ grep __init_begin /proc/kallsyms
> >   c000000000d70000 T __init_begin
> >   $ ./page-types -r -a 0x0,0xd7
> >                flags	page-count       MB  symbolic-flags			long-symbolic-flags
> >   0x0000000100000000	       215       13  __________________________r_______________	reserved
> >                total	       215       13
> 
> Hey Michael,
> 
> What tool is this 'page-types' ?

tools/vm/page-types.c

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-10-11  5:19           ` Michael Ellerman
@ 2017-10-11 14:05             ` Anshuman Khandual
  -1 siblings, 0 replies; 68+ messages in thread
From: Anshuman Khandual @ 2017-10-11 14:05 UTC (permalink / raw)
  To: Michael Ellerman, Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML,
	Vlastimil Babka

On 10/11/2017 10:49 AM, Michael Ellerman wrote:
> Michael Ellerman <mpe@ellerman.id.au> writes:
>> Michal Hocko <mhocko@kernel.org> writes:
>>> On Tue 10-10-17 23:05:08, Michael Ellerman wrote:
>>>> Michal Hocko <mhocko@kernel.org> writes:
>>>>> From: Michal Hocko <mhocko@suse.com>
>>>>> Memory offlining can fail just too eagerly under a heavy memory pressure.
>>>>>
>>>>> [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
>>>>> [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
>>>>> [ 5410.336811] page dumped because: isolation failed
>>>>> [ 5410.336813] page->mem_cgroup:ffff8801cd662000
>>>>> [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
>>>>>
>>>>> Isolation has failed here because the page is not on LRU. Most probably
>>>>> because it was on the pcp LRU cache or it has been removed from the LRU
>>>>> already but it hasn't been freed yet. In both cases the page doesn't look
>>>>> non-migrable so retrying more makes sense.
>>>> This breaks offline for me.
>>>>
>>>> Prior to this commit:
>>>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>>>   -bash: echo: write error: Device or resource busy
>>>>   
>>>>   real	0m0.001s
>>>>   user	0m0.000s
>>>>   sys	0m0.001s
>>>>
>>>> After:
>>>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>>>   -bash: echo: write error: Device or resource busy
>>>>   
>>>>   real	2m0.009s
>>>>   user	0m0.000s
>>>>   sys	1m25.035s
>>>>
>>>> There's no way that block can be removed, it contains the kernel text,
>>>> so it should instantly fail - which it used to.
>>> OK, that means that start_isolate_page_range should have failed but it
>>> hasn't for some reason. I strongly suspect has_unmovable_pages is doing
>>> something wrong. Is the kernel text marked somehow? E.g. PageReserved?
>> I'm not sure how the text is marked, will have to dig into that.
> Yeah it's reserved:
> 
>   $ grep __init_begin /proc/kallsyms
>   c000000000d70000 T __init_begin
>   $ ./page-types -r -a 0x0,0xd7
>                flags	page-count       MB  symbolic-flags			long-symbolic-flags
>   0x0000000100000000	       215       13  __________________________r_______________	reserved
>                total	       215       13

Hey Michael,

What tool is this 'page-types' ?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-10-11 14:05             ` Anshuman Khandual
  0 siblings, 0 replies; 68+ messages in thread
From: Anshuman Khandual @ 2017-10-11 14:05 UTC (permalink / raw)
  To: Michael Ellerman, Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML,
	Vlastimil Babka

On 10/11/2017 10:49 AM, Michael Ellerman wrote:
> Michael Ellerman <mpe@ellerman.id.au> writes:
>> Michal Hocko <mhocko@kernel.org> writes:
>>> On Tue 10-10-17 23:05:08, Michael Ellerman wrote:
>>>> Michal Hocko <mhocko@kernel.org> writes:
>>>>> From: Michal Hocko <mhocko@suse.com>
>>>>> Memory offlining can fail just too eagerly under a heavy memory pressure.
>>>>>
>>>>> [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
>>>>> [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
>>>>> [ 5410.336811] page dumped because: isolation failed
>>>>> [ 5410.336813] page->mem_cgroup:ffff8801cd662000
>>>>> [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
>>>>>
>>>>> Isolation has failed here because the page is not on LRU. Most probably
>>>>> because it was on the pcp LRU cache or it has been removed from the LRU
>>>>> already but it hasn't been freed yet. In both cases the page doesn't look
>>>>> non-migrable so retrying more makes sense.
>>>> This breaks offline for me.
>>>>
>>>> Prior to this commit:
>>>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>>>   -bash: echo: write error: Device or resource busy
>>>>   
>>>>   real	0m0.001s
>>>>   user	0m0.000s
>>>>   sys	0m0.001s
>>>>
>>>> After:
>>>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>>>   -bash: echo: write error: Device or resource busy
>>>>   
>>>>   real	2m0.009s
>>>>   user	0m0.000s
>>>>   sys	1m25.035s
>>>>
>>>> There's no way that block can be removed, it contains the kernel text,
>>>> so it should instantly fail - which it used to.
>>> OK, that means that start_isolate_page_range should have failed but it
>>> hasn't for some reason. I strongly suspect has_unmovable_pages is doing
>>> something wrong. Is the kernel text marked somehow? E.g. PageReserved?
>> I'm not sure how the text is marked, will have to dig into that.
> Yeah it's reserved:
> 
>   $ grep __init_begin /proc/kallsyms
>   c000000000d70000 T __init_begin
>   $ ./page-types -r -a 0x0,0xd7
>                flags	page-count       MB  symbolic-flags			long-symbolic-flags
>   0x0000000100000000	       215       13  __________________________r_______________	reserved
>                total	       215       13

Hey Michael,

What tool is this 'page-types' ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-10-11 11:17                 ` Vlastimil Babka
@ 2017-10-11 11:24                   ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-10-11 11:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michael Ellerman, Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab,
	Yasuaki Ishimatsu, qiuxishi, Igor Mammedov, Vitaly Kuznetsov,
	linux-mm, LKML

On Wed 11-10-17 13:17:13, Vlastimil Babka wrote:
> On 10/11/2017 10:13 AM, Michal Hocko wrote:
> > On Wed 11-10-17 10:04:39, Vlastimil Babka wrote:
> >> On 10/11/2017 08:51 AM, Michal Hocko wrote:
> > [...]
> >>> This is really strange! As you write in other email the page is
> >>> reserved. That means that some of the earlier checks 
> >>> 	if (zone_idx(zone) == ZONE_MOVABLE)
> >>> 		return false;
> >>> 	mt = get_pageblock_migratetype(page);
> >>> 	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
> >>
> >> The MIGRATE_MOVABLE check is indeed bogus, because that doesn't
> >> guarantee there are no unmovable pages in the block (CMA block OTOH
> >> should be a guarantee).
> > 
> > OK, thanks for confirmation. I will remove the MIGRATE_MOVABLE check
> > here. Do you think it is worth removing CMA check as well? This is
> > merely an optimization AFAIU because we do not have to check the full
> > pageblockworth of pfns.
> 
> Actually, we should remove the CMA part as well. It's true that
> MIGRATE_CMA does guarantee that the *buddy allocator* won't allocate
> non-MOVABLE pages from the pageblock. But if the memory got allocated as
> an actual CMA allocation (alloc_contig...) it will almost certainly not
> be movable.

That was my suspicious. Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-10-11 11:24                   ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-10-11 11:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michael Ellerman, Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab,
	Yasuaki Ishimatsu, qiuxishi, Igor Mammedov, Vitaly Kuznetsov,
	linux-mm, LKML

On Wed 11-10-17 13:17:13, Vlastimil Babka wrote:
> On 10/11/2017 10:13 AM, Michal Hocko wrote:
> > On Wed 11-10-17 10:04:39, Vlastimil Babka wrote:
> >> On 10/11/2017 08:51 AM, Michal Hocko wrote:
> > [...]
> >>> This is really strange! As you write in other email the page is
> >>> reserved. That means that some of the earlier checks 
> >>> 	if (zone_idx(zone) == ZONE_MOVABLE)
> >>> 		return false;
> >>> 	mt = get_pageblock_migratetype(page);
> >>> 	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
> >>
> >> The MIGRATE_MOVABLE check is indeed bogus, because that doesn't
> >> guarantee there are no unmovable pages in the block (CMA block OTOH
> >> should be a guarantee).
> > 
> > OK, thanks for confirmation. I will remove the MIGRATE_MOVABLE check
> > here. Do you think it is worth removing CMA check as well? This is
> > merely an optimization AFAIU because we do not have to check the full
> > pageblockworth of pfns.
> 
> Actually, we should remove the CMA part as well. It's true that
> MIGRATE_CMA does guarantee that the *buddy allocator* won't allocate
> non-MOVABLE pages from the pageblock. But if the memory got allocated as
> an actual CMA allocation (alloc_contig...) it will almost certainly not
> be movable.

That was my suspicious. Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-10-11  8:13               ` Michal Hocko
@ 2017-10-11 11:17                 ` Vlastimil Babka
  -1 siblings, 0 replies; 68+ messages in thread
From: Vlastimil Babka @ 2017-10-11 11:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael Ellerman, Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab,
	Yasuaki Ishimatsu, qiuxishi, Igor Mammedov, Vitaly Kuznetsov,
	linux-mm, LKML

On 10/11/2017 10:13 AM, Michal Hocko wrote:
> On Wed 11-10-17 10:04:39, Vlastimil Babka wrote:
>> On 10/11/2017 08:51 AM, Michal Hocko wrote:
> [...]
>>> This is really strange! As you write in other email the page is
>>> reserved. That means that some of the earlier checks 
>>> 	if (zone_idx(zone) == ZONE_MOVABLE)
>>> 		return false;
>>> 	mt = get_pageblock_migratetype(page);
>>> 	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
>>
>> The MIGRATE_MOVABLE check is indeed bogus, because that doesn't
>> guarantee there are no unmovable pages in the block (CMA block OTOH
>> should be a guarantee).
> 
> OK, thanks for confirmation. I will remove the MIGRATE_MOVABLE check
> here. Do you think it is worth removing CMA check as well? This is
> merely an optimization AFAIU because we do not have to check the full
> pageblockworth of pfns.

Actually, we should remove the CMA part as well. It's true that
MIGRATE_CMA does guarantee that the *buddy allocator* won't allocate
non-MOVABLE pages from the pageblock. But if the memory got allocated as
an actual CMA allocation (alloc_contig...) it will almost certainly not
be movable.

> Anyway, let's way for Michael to confirm it really helps. If yes I will
> post a full patch and ask Andrew to add it as a prerequisite for this
> patch when sending to Linus to prevent the regression.
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-10-11 11:17                 ` Vlastimil Babka
  0 siblings, 0 replies; 68+ messages in thread
From: Vlastimil Babka @ 2017-10-11 11:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Michael Ellerman, Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab,
	Yasuaki Ishimatsu, qiuxishi, Igor Mammedov, Vitaly Kuznetsov,
	linux-mm, LKML

On 10/11/2017 10:13 AM, Michal Hocko wrote:
> On Wed 11-10-17 10:04:39, Vlastimil Babka wrote:
>> On 10/11/2017 08:51 AM, Michal Hocko wrote:
> [...]
>>> This is really strange! As you write in other email the page is
>>> reserved. That means that some of the earlier checks 
>>> 	if (zone_idx(zone) == ZONE_MOVABLE)
>>> 		return false;
>>> 	mt = get_pageblock_migratetype(page);
>>> 	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
>>
>> The MIGRATE_MOVABLE check is indeed bogus, because that doesn't
>> guarantee there are no unmovable pages in the block (CMA block OTOH
>> should be a guarantee).
> 
> OK, thanks for confirmation. I will remove the MIGRATE_MOVABLE check
> here. Do you think it is worth removing CMA check as well? This is
> merely an optimization AFAIU because we do not have to check the full
> pageblockworth of pfns.

Actually, we should remove the CMA part as well. It's true that
MIGRATE_CMA does guarantee that the *buddy allocator* won't allocate
non-MOVABLE pages from the pageblock. But if the memory got allocated as
an actual CMA allocation (alloc_contig...) it will almost certainly not
be movable.

> Anyway, let's way for Michael to confirm it really helps. If yes I will
> post a full patch and ask Andrew to add it as a prerequisite for this
> patch when sending to Linus to prevent the regression.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-10-11  8:04             ` Vlastimil Babka
@ 2017-10-11  8:13               ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-10-11  8:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michael Ellerman, Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab,
	Yasuaki Ishimatsu, qiuxishi, Igor Mammedov, Vitaly Kuznetsov,
	linux-mm, LKML

On Wed 11-10-17 10:04:39, Vlastimil Babka wrote:
> On 10/11/2017 08:51 AM, Michal Hocko wrote:
[...]
> > This is really strange! As you write in other email the page is
> > reserved. That means that some of the earlier checks 
> > 	if (zone_idx(zone) == ZONE_MOVABLE)
> > 		return false;
> > 	mt = get_pageblock_migratetype(page);
> > 	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
> 
> The MIGRATE_MOVABLE check is indeed bogus, because that doesn't
> guarantee there are no unmovable pages in the block (CMA block OTOH
> should be a guarantee).

OK, thanks for confirmation. I will remove the MIGRATE_MOVABLE check
here. Do you think it is worth removing CMA check as well? This is
merely an optimization AFAIU because we do not have to check the full
pageblockworth of pfns.

Anyway, let's way for Michael to confirm it really helps. If yes I will
post a full patch and ask Andrew to add it as a prerequisite for this
patch when sending to Linus to prevent the regression.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-10-11  8:13               ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-10-11  8:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michael Ellerman, Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab,
	Yasuaki Ishimatsu, qiuxishi, Igor Mammedov, Vitaly Kuznetsov,
	linux-mm, LKML

On Wed 11-10-17 10:04:39, Vlastimil Babka wrote:
> On 10/11/2017 08:51 AM, Michal Hocko wrote:
[...]
> > This is really strange! As you write in other email the page is
> > reserved. That means that some of the earlier checks 
> > 	if (zone_idx(zone) == ZONE_MOVABLE)
> > 		return false;
> > 	mt = get_pageblock_migratetype(page);
> > 	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
> 
> The MIGRATE_MOVABLE check is indeed bogus, because that doesn't
> guarantee there are no unmovable pages in the block (CMA block OTOH
> should be a guarantee).

OK, thanks for confirmation. I will remove the MIGRATE_MOVABLE check
here. Do you think it is worth removing CMA check as well? This is
merely an optimization AFAIU because we do not have to check the full
pageblockworth of pfns.

Anyway, let's way for Michael to confirm it really helps. If yes I will
post a full patch and ask Andrew to add it as a prerequisite for this
patch when sending to Linus to prevent the regression.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-10-11  6:51           ` Michal Hocko
@ 2017-10-11  8:04             ` Vlastimil Babka
  -1 siblings, 0 replies; 68+ messages in thread
From: Vlastimil Babka @ 2017-10-11  8:04 UTC (permalink / raw)
  To: Michal Hocko, Michael Ellerman
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On 10/11/2017 08:51 AM, Michal Hocko wrote:
> On Wed 11-10-17 13:37:50, Michael Ellerman wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>>
>>> On Tue 10-10-17 23:05:08, Michael Ellerman wrote:
>>>> Michal Hocko <mhocko@kernel.org> writes:
>>>>
>>>>> From: Michal Hocko <mhocko@suse.com>
>>>>>
>>>>> Memory offlining can fail just too eagerly under a heavy memory pressure.
>>>>>
>>>>> [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
>>>>> [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
>>>>> [ 5410.336811] page dumped because: isolation failed
>>>>> [ 5410.336813] page->mem_cgroup:ffff8801cd662000
>>>>> [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
>>>>>
>>>>> Isolation has failed here because the page is not on LRU. Most probably
>>>>> because it was on the pcp LRU cache or it has been removed from the LRU
>>>>> already but it hasn't been freed yet. In both cases the page doesn't look
>>>>> non-migrable so retrying more makes sense.
>>>>
>>>> This breaks offline for me.
>>>>
>>>> Prior to this commit:
>>>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>>>   -bash: echo: write error: Device or resource busy

Well, that means offline didn't actually work for that block even before
this patch, right? Is it even a movable_node block? I guess not?

>>>>   real	0m0.001s
>>>>   user	0m0.000s
>>>>   sys	0m0.001s
>>>>
>>>> After:
>>>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>>>   -bash: echo: write error: Device or resource busy
>>>>   
>>>>   real	2m0.009s
>>>>   user	0m0.000s
>>>>   sys	1m25.035s
>>>>
>>>>
>>>> There's no way that block can be removed, it contains the kernel text,
>>>> so it should instantly fail - which it used to.

Ah, right. So your complain is really about that the failure is not
instant anymore for blocks that can't be offlined.

>>> OK, that means that start_isolate_page_range should have failed but it
>>> hasn't for some reason. I strongly suspect has_unmovable_pages is doing
>>> something wrong. Is the kernel text marked somehow? E.g. PageReserved?
>>
>> I'm not sure how the text is marked, will have to dig into that.
>>
>>> In other words, does the diff below helps?
>>
>> No that doesn't help.
> 
> This is really strange! As you write in other email the page is
> reserved. That means that some of the earlier checks 
> 	if (zone_idx(zone) == ZONE_MOVABLE)
> 		return false;
> 	mt = get_pageblock_migratetype(page);
> 	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))

The MIGRATE_MOVABLE check is indeed bogus, because that doesn't
guarantee there are no unmovable pages in the block (CMA block OTOH
should be a guarantee).

> 		return false;
> has bailed out early. I would be quite surprised if the kernel text was
> sitting in the zone movable. The migrate type check is more fishy
> AFAICS. I can imagine that the kernel text can share the movable or CMA
> mt block. I am not really familiar with this function but it looks
> suspicious. So does it help to remove this check?
> --- 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3badcedf96a7..5b4d85ae445c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7355,9 +7355,6 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>  	 */
>  	if (zone_idx(zone) == ZONE_MOVABLE)
>  		return false;
> -	mt = get_pageblock_migratetype(page);
> -	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
> -		return false;
>  
>  	pfn = page_to_pfn(page);
>  	for (found = 0, iter = 0; iter < pageblock_nr_pages; iter++) {
> @@ -7368,6 +7365,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>  
>  		page = pfn_to_page(check);
>  
> +		if (PageReserved(page))
> +			return true;
> +
>  		/*
>  		 * Hugepages are not in LRU lists, but they're movable.
>  		 * We need not scan over tail pages bacause we don't
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-10-11  8:04             ` Vlastimil Babka
  0 siblings, 0 replies; 68+ messages in thread
From: Vlastimil Babka @ 2017-10-11  8:04 UTC (permalink / raw)
  To: Michal Hocko, Michael Ellerman
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML

On 10/11/2017 08:51 AM, Michal Hocko wrote:
> On Wed 11-10-17 13:37:50, Michael Ellerman wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>>
>>> On Tue 10-10-17 23:05:08, Michael Ellerman wrote:
>>>> Michal Hocko <mhocko@kernel.org> writes:
>>>>
>>>>> From: Michal Hocko <mhocko@suse.com>
>>>>>
>>>>> Memory offlining can fail just too eagerly under a heavy memory pressure.
>>>>>
>>>>> [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
>>>>> [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
>>>>> [ 5410.336811] page dumped because: isolation failed
>>>>> [ 5410.336813] page->mem_cgroup:ffff8801cd662000
>>>>> [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
>>>>>
>>>>> Isolation has failed here because the page is not on LRU. Most probably
>>>>> because it was on the pcp LRU cache or it has been removed from the LRU
>>>>> already but it hasn't been freed yet. In both cases the page doesn't look
>>>>> non-migrable so retrying more makes sense.
>>>>
>>>> This breaks offline for me.
>>>>
>>>> Prior to this commit:
>>>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>>>   -bash: echo: write error: Device or resource busy

Well, that means offline didn't actually work for that block even before
this patch, right? Is it even a movable_node block? I guess not?

>>>>   real	0m0.001s
>>>>   user	0m0.000s
>>>>   sys	0m0.001s
>>>>
>>>> After:
>>>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>>>   -bash: echo: write error: Device or resource busy
>>>>   
>>>>   real	2m0.009s
>>>>   user	0m0.000s
>>>>   sys	1m25.035s
>>>>
>>>>
>>>> There's no way that block can be removed, it contains the kernel text,
>>>> so it should instantly fail - which it used to.

Ah, right. So your complain is really about that the failure is not
instant anymore for blocks that can't be offlined.

>>> OK, that means that start_isolate_page_range should have failed but it
>>> hasn't for some reason. I strongly suspect has_unmovable_pages is doing
>>> something wrong. Is the kernel text marked somehow? E.g. PageReserved?
>>
>> I'm not sure how the text is marked, will have to dig into that.
>>
>>> In other words, does the diff below helps?
>>
>> No that doesn't help.
> 
> This is really strange! As you write in other email the page is
> reserved. That means that some of the earlier checks 
> 	if (zone_idx(zone) == ZONE_MOVABLE)
> 		return false;
> 	mt = get_pageblock_migratetype(page);
> 	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))

The MIGRATE_MOVABLE check is indeed bogus, because that doesn't
guarantee there are no unmovable pages in the block (CMA block OTOH
should be a guarantee).

> 		return false;
> has bailed out early. I would be quite surprised if the kernel text was
> sitting in the zone movable. The migrate type check is more fishy
> AFAICS. I can imagine that the kernel text can share the movable or CMA
> mt block. I am not really familiar with this function but it looks
> suspicious. So does it help to remove this check?
> --- 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3badcedf96a7..5b4d85ae445c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7355,9 +7355,6 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>  	 */
>  	if (zone_idx(zone) == ZONE_MOVABLE)
>  		return false;
> -	mt = get_pageblock_migratetype(page);
> -	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
> -		return false;
>  
>  	pfn = page_to_pfn(page);
>  	for (found = 0, iter = 0; iter < pageblock_nr_pages; iter++) {
> @@ -7368,6 +7365,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>  
>  		page = pfn_to_page(check);
>  
> +		if (PageReserved(page))
> +			return true;
> +
>  		/*
>  		 * Hugepages are not in LRU lists, but they're movable.
>  		 * We need not scan over tail pages bacause we don't
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-10-11  2:37         ` Michael Ellerman
@ 2017-10-11  6:51           ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-10-11  6:51 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML,
	Vlastimil Babka

On Wed 11-10-17 13:37:50, Michael Ellerman wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Tue 10-10-17 23:05:08, Michael Ellerman wrote:
> >> Michal Hocko <mhocko@kernel.org> writes:
> >> 
> >> > From: Michal Hocko <mhocko@suse.com>
> >> >
> >> > Memory offlining can fail just too eagerly under a heavy memory pressure.
> >> >
> >> > [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> >> > [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> >> > [ 5410.336811] page dumped because: isolation failed
> >> > [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> >> > [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
> >> >
> >> > Isolation has failed here because the page is not on LRU. Most probably
> >> > because it was on the pcp LRU cache or it has been removed from the LRU
> >> > already but it hasn't been freed yet. In both cases the page doesn't look
> >> > non-migrable so retrying more makes sense.
> >> 
> >> This breaks offline for me.
> >> 
> >> Prior to this commit:
> >>   /sys/devices/system/memory/memory0# time echo 0 > online
> >>   -bash: echo: write error: Device or resource busy
> >>   
> >>   real	0m0.001s
> >>   user	0m0.000s
> >>   sys	0m0.001s
> >> 
> >> After:
> >>   /sys/devices/system/memory/memory0# time echo 0 > online
> >>   -bash: echo: write error: Device or resource busy
> >>   
> >>   real	2m0.009s
> >>   user	0m0.000s
> >>   sys	1m25.035s
> >> 
> >> 
> >> There's no way that block can be removed, it contains the kernel text,
> >> so it should instantly fail - which it used to.
> >
> > OK, that means that start_isolate_page_range should have failed but it
> > hasn't for some reason. I strongly suspect has_unmovable_pages is doing
> > something wrong. Is the kernel text marked somehow? E.g. PageReserved?
> 
> I'm not sure how the text is marked, will have to dig into that.
> 
> > In other words, does the diff below helps?
> 
> No that doesn't help.

This is really strange! As you write in other email the page is
reserved. That means that some of the earlier checks 
	if (zone_idx(zone) == ZONE_MOVABLE)
		return false;
	mt = get_pageblock_migratetype(page);
	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
		return false;
has bailed out early. I would be quite surprised if the kernel text was
sitting in the zone movable. The migrate type check is more fishy
AFAICS. I can imagine that the kernel text can share the movable or CMA
mt block. I am not really familiar with this function but it looks
suspicious. So does it help to remove this check?
--- 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3badcedf96a7..5b4d85ae445c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7355,9 +7355,6 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
 	 */
 	if (zone_idx(zone) == ZONE_MOVABLE)
 		return false;
-	mt = get_pageblock_migratetype(page);
-	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
-		return false;
 
 	pfn = page_to_pfn(page);
 	for (found = 0, iter = 0; iter < pageblock_nr_pages; iter++) {
@@ -7368,6 +7365,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
 
 		page = pfn_to_page(check);
 
+		if (PageReserved(page))
+			return true;
+
 		/*
 		 * Hugepages are not in LRU lists, but they're movable.
 		 * We need not scan over tail pages bacause we don't

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-10-11  6:51           ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-10-11  6:51 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML,
	Vlastimil Babka

On Wed 11-10-17 13:37:50, Michael Ellerman wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Tue 10-10-17 23:05:08, Michael Ellerman wrote:
> >> Michal Hocko <mhocko@kernel.org> writes:
> >> 
> >> > From: Michal Hocko <mhocko@suse.com>
> >> >
> >> > Memory offlining can fail just too eagerly under a heavy memory pressure.
> >> >
> >> > [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> >> > [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> >> > [ 5410.336811] page dumped because: isolation failed
> >> > [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> >> > [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
> >> >
> >> > Isolation has failed here because the page is not on LRU. Most probably
> >> > because it was on the pcp LRU cache or it has been removed from the LRU
> >> > already but it hasn't been freed yet. In both cases the page doesn't look
> >> > non-migrable so retrying more makes sense.
> >> 
> >> This breaks offline for me.
> >> 
> >> Prior to this commit:
> >>   /sys/devices/system/memory/memory0# time echo 0 > online
> >>   -bash: echo: write error: Device or resource busy
> >>   
> >>   real	0m0.001s
> >>   user	0m0.000s
> >>   sys	0m0.001s
> >> 
> >> After:
> >>   /sys/devices/system/memory/memory0# time echo 0 > online
> >>   -bash: echo: write error: Device or resource busy
> >>   
> >>   real	2m0.009s
> >>   user	0m0.000s
> >>   sys	1m25.035s
> >> 
> >> 
> >> There's no way that block can be removed, it contains the kernel text,
> >> so it should instantly fail - which it used to.
> >
> > OK, that means that start_isolate_page_range should have failed but it
> > hasn't for some reason. I strongly suspect has_unmovable_pages is doing
> > something wrong. Is the kernel text marked somehow? E.g. PageReserved?
> 
> I'm not sure how the text is marked, will have to dig into that.
> 
> > In other words, does the diff below helps?
> 
> No that doesn't help.

This is really strange! As you write in other email the page is
reserved. That means that some of the earlier checks 
	if (zone_idx(zone) == ZONE_MOVABLE)
		return false;
	mt = get_pageblock_migratetype(page);
	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
		return false;
has bailed out early. I would be quite surprised if the kernel text was
sitting in the zone movable. The migrate type check is more fishy
AFAICS. I can imagine that the kernel text can share the movable or CMA
mt block. I am not really familiar with this function but it looks
suspicious. So does it help to remove this check?
--- 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3badcedf96a7..5b4d85ae445c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7355,9 +7355,6 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
 	 */
 	if (zone_idx(zone) == ZONE_MOVABLE)
 		return false;
-	mt = get_pageblock_migratetype(page);
-	if (mt == MIGRATE_MOVABLE || is_migrate_cma(mt))
-		return false;
 
 	pfn = page_to_pfn(page);
 	for (found = 0, iter = 0; iter < pageblock_nr_pages; iter++) {
@@ -7368,6 +7365,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
 
 		page = pfn_to_page(check);
 
+		if (PageReserved(page))
+			return true;
+
 		/*
 		 * Hugepages are not in LRU lists, but they're movable.
 		 * We need not scan over tail pages bacause we don't

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-10-11  2:37         ` Michael Ellerman
@ 2017-10-11  5:19           ` Michael Ellerman
  -1 siblings, 0 replies; 68+ messages in thread
From: Michael Ellerman @ 2017-10-11  5:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML,
	Vlastimil Babka

Michael Ellerman <mpe@ellerman.id.au> writes:
> Michal Hocko <mhocko@kernel.org> writes:
>> On Tue 10-10-17 23:05:08, Michael Ellerman wrote:
>>> Michal Hocko <mhocko@kernel.org> writes:
>>> > From: Michal Hocko <mhocko@suse.com>
>>> > Memory offlining can fail just too eagerly under a heavy memory pressure.
>>> >
>>> > [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
>>> > [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
>>> > [ 5410.336811] page dumped because: isolation failed
>>> > [ 5410.336813] page->mem_cgroup:ffff8801cd662000
>>> > [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
>>> >
>>> > Isolation has failed here because the page is not on LRU. Most probably
>>> > because it was on the pcp LRU cache or it has been removed from the LRU
>>> > already but it hasn't been freed yet. In both cases the page doesn't look
>>> > non-migrable so retrying more makes sense.
>>> 
>>> This breaks offline for me.
>>> 
>>> Prior to this commit:
>>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>>   -bash: echo: write error: Device or resource busy
>>>   
>>>   real	0m0.001s
>>>   user	0m0.000s
>>>   sys	0m0.001s
>>> 
>>> After:
>>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>>   -bash: echo: write error: Device or resource busy
>>>   
>>>   real	2m0.009s
>>>   user	0m0.000s
>>>   sys	1m25.035s
>>> 
>>> There's no way that block can be removed, it contains the kernel text,
>>> so it should instantly fail - which it used to.
>>
>> OK, that means that start_isolate_page_range should have failed but it
>> hasn't for some reason. I strongly suspect has_unmovable_pages is doing
>> something wrong. Is the kernel text marked somehow? E.g. PageReserved?
>
> I'm not sure how the text is marked, will have to dig into that.

Yeah it's reserved:

  $ grep __init_begin /proc/kallsyms
  c000000000d70000 T __init_begin
  $ ./page-types -r -a 0x0,0xd7
               flags	page-count       MB  symbolic-flags			long-symbolic-flags
  0x0000000100000000	       215       13  __________________________r_______________	reserved
               total	       215       13


I added some printks, we're getting EBUSY from do_migrate_range(pfn, end_pfn).

So we seem to just have an infinite loop:

  repeat:
  	/* start memory hot removal */
  	ret = -EINTR;
  	if (signal_pending(current))
  		goto failed_removal;
  
  	cond_resched();
  	lru_add_drain_all_cpuslocked();
  	drain_all_pages(zone);
  
  	pfn = scan_movable_pages(start_pfn, end_pfn);
  	if (pfn) { /* We have movable pages */
  		ret = do_migrate_range(pfn, end_pfn);
  		printk_ratelimited("memory-hotplug: migrate range returned %ld\n", ret);
  		goto repeat;
  	}


eg:

  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  __offline_pages: 354031 callbacks suppressed
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  __offline_pages: 355794 callbacks suppressed


cheers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-10-11  5:19           ` Michael Ellerman
  0 siblings, 0 replies; 68+ messages in thread
From: Michael Ellerman @ 2017-10-11  5:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML,
	Vlastimil Babka

Michael Ellerman <mpe@ellerman.id.au> writes:
> Michal Hocko <mhocko@kernel.org> writes:
>> On Tue 10-10-17 23:05:08, Michael Ellerman wrote:
>>> Michal Hocko <mhocko@kernel.org> writes:
>>> > From: Michal Hocko <mhocko@suse.com>
>>> > Memory offlining can fail just too eagerly under a heavy memory pressure.
>>> >
>>> > [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
>>> > [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
>>> > [ 5410.336811] page dumped because: isolation failed
>>> > [ 5410.336813] page->mem_cgroup:ffff8801cd662000
>>> > [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
>>> >
>>> > Isolation has failed here because the page is not on LRU. Most probably
>>> > because it was on the pcp LRU cache or it has been removed from the LRU
>>> > already but it hasn't been freed yet. In both cases the page doesn't look
>>> > non-migrable so retrying more makes sense.
>>> 
>>> This breaks offline for me.
>>> 
>>> Prior to this commit:
>>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>>   -bash: echo: write error: Device or resource busy
>>>   
>>>   real	0m0.001s
>>>   user	0m0.000s
>>>   sys	0m0.001s
>>> 
>>> After:
>>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>>   -bash: echo: write error: Device or resource busy
>>>   
>>>   real	2m0.009s
>>>   user	0m0.000s
>>>   sys	1m25.035s
>>> 
>>> There's no way that block can be removed, it contains the kernel text,
>>> so it should instantly fail - which it used to.
>>
>> OK, that means that start_isolate_page_range should have failed but it
>> hasn't for some reason. I strongly suspect has_unmovable_pages is doing
>> something wrong. Is the kernel text marked somehow? E.g. PageReserved?
>
> I'm not sure how the text is marked, will have to dig into that.

Yeah it's reserved:

  $ grep __init_begin /proc/kallsyms
  c000000000d70000 T __init_begin
  $ ./page-types -r -a 0x0,0xd7
               flags	page-count       MB  symbolic-flags			long-symbolic-flags
  0x0000000100000000	       215       13  __________________________r_______________	reserved
               total	       215       13


I added some printks, we're getting EBUSY from do_migrate_range(pfn, end_pfn).

So we seem to just have an infinite loop:

  repeat:
  	/* start memory hot removal */
  	ret = -EINTR;
  	if (signal_pending(current))
  		goto failed_removal;
  
  	cond_resched();
  	lru_add_drain_all_cpuslocked();
  	drain_all_pages(zone);
  
  	pfn = scan_movable_pages(start_pfn, end_pfn);
  	if (pfn) { /* We have movable pages */
  		ret = do_migrate_range(pfn, end_pfn);
  		printk_ratelimited("memory-hotplug: migrate range returned %ld\n", ret);
  		goto repeat;
  	}


eg:

  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  __offline_pages: 354031 callbacks suppressed
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  memory-hotplug: migrate range returned -16
  __offline_pages: 355794 callbacks suppressed


cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-10-10 12:27       ` Michal Hocko
@ 2017-10-11  2:37         ` Michael Ellerman
  -1 siblings, 0 replies; 68+ messages in thread
From: Michael Ellerman @ 2017-10-11  2:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML,
	Vlastimil Babka

Michal Hocko <mhocko@kernel.org> writes:

> On Tue 10-10-17 23:05:08, Michael Ellerman wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > From: Michal Hocko <mhocko@suse.com>
>> >
>> > Memory offlining can fail just too eagerly under a heavy memory pressure.
>> >
>> > [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
>> > [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
>> > [ 5410.336811] page dumped because: isolation failed
>> > [ 5410.336813] page->mem_cgroup:ffff8801cd662000
>> > [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
>> >
>> > Isolation has failed here because the page is not on LRU. Most probably
>> > because it was on the pcp LRU cache or it has been removed from the LRU
>> > already but it hasn't been freed yet. In both cases the page doesn't look
>> > non-migrable so retrying more makes sense.
>> 
>> This breaks offline for me.
>> 
>> Prior to this commit:
>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>   -bash: echo: write error: Device or resource busy
>>   
>>   real	0m0.001s
>>   user	0m0.000s
>>   sys	0m0.001s
>> 
>> After:
>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>   -bash: echo: write error: Device or resource busy
>>   
>>   real	2m0.009s
>>   user	0m0.000s
>>   sys	1m25.035s
>> 
>> 
>> There's no way that block can be removed, it contains the kernel text,
>> so it should instantly fail - which it used to.
>
> OK, that means that start_isolate_page_range should have failed but it
> hasn't for some reason. I strongly suspect has_unmovable_pages is doing
> something wrong. Is the kernel text marked somehow? E.g. PageReserved?

I'm not sure how the text is marked, will have to dig into that.

> In other words, does the diff below helps?

No that doesn't help.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3badcedf96a7..00d042052501 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7368,6 +7368,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>  
>  		page = pfn_to_page(check);
>  
> +		if (PageReserved(page))
> +			return true;
> +
>  		/*
>  		 * Hugepages are not in LRU lists, but they're movable.
>  		 * We need not scan over tail pages bacause we don't
>
>
>> With commit 3aa2823fdf66 ("mm, memory_hotplug: remove timeout from
>> __offline_memory") also applied, it appears to just get stuck forever,
>> and I get lots of:
>> 
>>   [ 1232.112953] INFO: task kworker/3:0:4609 blocked for more than 120 seconds.
>>   [ 1232.113067]       Not tainted 4.14.0-rc4-gcc6-next-20171009-g49827b9 #1
>>   [ 1232.113183] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>   [ 1232.113319] kworker/3:0     D11984  4609      2 0x00000800
>>   [ 1232.113416] Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
>>   [ 1232.113531] Call Trace:
>>   [ 1232.113579] [c0000000fb2db7a0] [c0000000fb2db900] 0xc0000000fb2db900 (unreliable)
>>   [ 1232.113717] [c0000000fb2db970] [c00000000001c964] __switch_to+0x304/0x6e0
>>   [ 1232.113840] [c0000000fb2dba10] [c000000000a408c0] __schedule+0x2e0/0xa80
>>   [ 1232.113978] [c0000000fb2dbae0] [c000000000a410a8] schedule+0x48/0xc0
>>   [ 1232.114113] [c0000000fb2dbb10] [c000000000a44d88] rwsem_down_read_failed+0x128/0x1b0
>>   [ 1232.114269] [c0000000fb2dbb70] [c0000000001696a8] __percpu_down_read+0x108/0x110
>>   [ 1232.114426] [c0000000fb2dbba0] [c00000000032e498] get_online_mems+0x68/0x80
>>   [ 1232.115487] [c0000000fb2dbbc0] [c0000000002c82ec] memcg_create_kmem_cache+0x4c/0x190
>>   [ 1232.115651] [c0000000fb2dbc60] [c0000000003483b8] memcg_kmem_cache_create_func+0x38/0xf0
>>   [ 1232.115809] [c0000000fb2dbc90] [c000000000121594] process_one_work+0x2b4/0x590
>>   [ 1232.115964] [c0000000fb2dbd20] [c000000000121908] worker_thread+0x98/0x5d0
>>   [ 1232.116095] [c0000000fb2dbdc0] [c00000000012a134] kthread+0x164/0x1b0
>>   [ 1232.116229] [c0000000fb2dbe30] [c00000000000bae0] ret_from_kernel_thread+0x5c/0x7c
>
> I do not see how this is related to the offline path.

It's blocked doing get_online_mems(). So it's unrelated to the offline,
but it can't proceed until the offline finishes, which it never does,
IIUIC.

cheers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-10-11  2:37         ` Michael Ellerman
  0 siblings, 0 replies; 68+ messages in thread
From: Michael Ellerman @ 2017-10-11  2:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML,
	Vlastimil Babka

Michal Hocko <mhocko@kernel.org> writes:

> On Tue 10-10-17 23:05:08, Michael Ellerman wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > From: Michal Hocko <mhocko@suse.com>
>> >
>> > Memory offlining can fail just too eagerly under a heavy memory pressure.
>> >
>> > [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
>> > [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
>> > [ 5410.336811] page dumped because: isolation failed
>> > [ 5410.336813] page->mem_cgroup:ffff8801cd662000
>> > [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
>> >
>> > Isolation has failed here because the page is not on LRU. Most probably
>> > because it was on the pcp LRU cache or it has been removed from the LRU
>> > already but it hasn't been freed yet. In both cases the page doesn't look
>> > non-migrable so retrying more makes sense.
>> 
>> This breaks offline for me.
>> 
>> Prior to this commit:
>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>   -bash: echo: write error: Device or resource busy
>>   
>>   real	0m0.001s
>>   user	0m0.000s
>>   sys	0m0.001s
>> 
>> After:
>>   /sys/devices/system/memory/memory0# time echo 0 > online
>>   -bash: echo: write error: Device or resource busy
>>   
>>   real	2m0.009s
>>   user	0m0.000s
>>   sys	1m25.035s
>> 
>> 
>> There's no way that block can be removed, it contains the kernel text,
>> so it should instantly fail - which it used to.
>
> OK, that means that start_isolate_page_range should have failed but it
> hasn't for some reason. I strongly suspect has_unmovable_pages is doing
> something wrong. Is the kernel text marked somehow? E.g. PageReserved?

I'm not sure how the text is marked, will have to dig into that.

> In other words, does the diff below helps?

No that doesn't help.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3badcedf96a7..00d042052501 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7368,6 +7368,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>  
>  		page = pfn_to_page(check);
>  
> +		if (PageReserved(page))
> +			return true;
> +
>  		/*
>  		 * Hugepages are not in LRU lists, but they're movable.
>  		 * We need not scan over tail pages bacause we don't
>
>
>> With commit 3aa2823fdf66 ("mm, memory_hotplug: remove timeout from
>> __offline_memory") also applied, it appears to just get stuck forever,
>> and I get lots of:
>> 
>>   [ 1232.112953] INFO: task kworker/3:0:4609 blocked for more than 120 seconds.
>>   [ 1232.113067]       Not tainted 4.14.0-rc4-gcc6-next-20171009-g49827b9 #1
>>   [ 1232.113183] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>   [ 1232.113319] kworker/3:0     D11984  4609      2 0x00000800
>>   [ 1232.113416] Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
>>   [ 1232.113531] Call Trace:
>>   [ 1232.113579] [c0000000fb2db7a0] [c0000000fb2db900] 0xc0000000fb2db900 (unreliable)
>>   [ 1232.113717] [c0000000fb2db970] [c00000000001c964] __switch_to+0x304/0x6e0
>>   [ 1232.113840] [c0000000fb2dba10] [c000000000a408c0] __schedule+0x2e0/0xa80
>>   [ 1232.113978] [c0000000fb2dbae0] [c000000000a410a8] schedule+0x48/0xc0
>>   [ 1232.114113] [c0000000fb2dbb10] [c000000000a44d88] rwsem_down_read_failed+0x128/0x1b0
>>   [ 1232.114269] [c0000000fb2dbb70] [c0000000001696a8] __percpu_down_read+0x108/0x110
>>   [ 1232.114426] [c0000000fb2dbba0] [c00000000032e498] get_online_mems+0x68/0x80
>>   [ 1232.115487] [c0000000fb2dbbc0] [c0000000002c82ec] memcg_create_kmem_cache+0x4c/0x190
>>   [ 1232.115651] [c0000000fb2dbc60] [c0000000003483b8] memcg_kmem_cache_create_func+0x38/0xf0
>>   [ 1232.115809] [c0000000fb2dbc90] [c000000000121594] process_one_work+0x2b4/0x590
>>   [ 1232.115964] [c0000000fb2dbd20] [c000000000121908] worker_thread+0x98/0x5d0
>>   [ 1232.116095] [c0000000fb2dbdc0] [c00000000012a134] kthread+0x164/0x1b0
>>   [ 1232.116229] [c0000000fb2dbe30] [c00000000000bae0] ret_from_kernel_thread+0x5c/0x7c
>
> I do not see how this is related to the offline path.

It's blocked doing get_online_mems(). So it's unrelated to the offline,
but it can't proceed until the offline finishes, which it never does,
IIUIC.

cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-10-10 12:05     ` Michael Ellerman
@ 2017-10-10 12:27       ` Michal Hocko
  -1 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-10-10 12:27 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML,
	Vlastimil Babka

On Tue 10-10-17 23:05:08, Michael Ellerman wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > From: Michal Hocko <mhocko@suse.com>
> >
> > Memory offlining can fail just too eagerly under a heavy memory pressure.
> >
> > [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> > [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> > [ 5410.336811] page dumped because: isolation failed
> > [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> > [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
> >
> > Isolation has failed here because the page is not on LRU. Most probably
> > because it was on the pcp LRU cache or it has been removed from the LRU
> > already but it hasn't been freed yet. In both cases the page doesn't look
> > non-migrable so retrying more makes sense.
> 
> This breaks offline for me.
> 
> Prior to this commit:
>   /sys/devices/system/memory/memory0# time echo 0 > online
>   -bash: echo: write error: Device or resource busy
>   
>   real	0m0.001s
>   user	0m0.000s
>   sys	0m0.001s
> 
> After:
>   /sys/devices/system/memory/memory0# time echo 0 > online
>   -bash: echo: write error: Device or resource busy
>   
>   real	2m0.009s
>   user	0m0.000s
>   sys	1m25.035s
> 
> 
> There's no way that block can be removed, it contains the kernel text,
> so it should instantly fail - which it used to.

OK, that means that start_isolate_page_range should have failed but it
hasn't for some reason. I strongly suspect has_unmovable_pages is doing
something wrong. Is the kernel text marked somehow? E.g. PageReserved?
In other words, does the diff below helps?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3badcedf96a7..00d042052501 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7368,6 +7368,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
 
 		page = pfn_to_page(check);
 
+		if (PageReserved(page))
+			return true;
+
 		/*
 		 * Hugepages are not in LRU lists, but they're movable.
 		 * We need not scan over tail pages bacause we don't


> With commit 3aa2823fdf66 ("mm, memory_hotplug: remove timeout from
> __offline_memory") also applied, it appears to just get stuck forever,
> and I get lots of:
> 
>   [ 1232.112953] INFO: task kworker/3:0:4609 blocked for more than 120 seconds.
>   [ 1232.113067]       Not tainted 4.14.0-rc4-gcc6-next-20171009-g49827b9 #1
>   [ 1232.113183] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   [ 1232.113319] kworker/3:0     D11984  4609      2 0x00000800
>   [ 1232.113416] Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
>   [ 1232.113531] Call Trace:
>   [ 1232.113579] [c0000000fb2db7a0] [c0000000fb2db900] 0xc0000000fb2db900 (unreliable)
>   [ 1232.113717] [c0000000fb2db970] [c00000000001c964] __switch_to+0x304/0x6e0
>   [ 1232.113840] [c0000000fb2dba10] [c000000000a408c0] __schedule+0x2e0/0xa80
>   [ 1232.113978] [c0000000fb2dbae0] [c000000000a410a8] schedule+0x48/0xc0
>   [ 1232.114113] [c0000000fb2dbb10] [c000000000a44d88] rwsem_down_read_failed+0x128/0x1b0
>   [ 1232.114269] [c0000000fb2dbb70] [c0000000001696a8] __percpu_down_read+0x108/0x110
>   [ 1232.114426] [c0000000fb2dbba0] [c00000000032e498] get_online_mems+0x68/0x80
>   [ 1232.115487] [c0000000fb2dbbc0] [c0000000002c82ec] memcg_create_kmem_cache+0x4c/0x190
>   [ 1232.115651] [c0000000fb2dbc60] [c0000000003483b8] memcg_kmem_cache_create_func+0x38/0xf0
>   [ 1232.115809] [c0000000fb2dbc90] [c000000000121594] process_one_work+0x2b4/0x590
>   [ 1232.115964] [c0000000fb2dbd20] [c000000000121908] worker_thread+0x98/0x5d0
>   [ 1232.116095] [c0000000fb2dbdc0] [c00000000012a134] kthread+0x164/0x1b0
>   [ 1232.116229] [c0000000fb2dbe30] [c00000000000bae0] ret_from_kernel_thread+0x5c/0x7c

I do not see how this is related to the offline path.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-10-10 12:27       ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-10-10 12:27 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML,
	Vlastimil Babka

On Tue 10-10-17 23:05:08, Michael Ellerman wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > From: Michal Hocko <mhocko@suse.com>
> >
> > Memory offlining can fail just too eagerly under a heavy memory pressure.
> >
> > [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> > [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> > [ 5410.336811] page dumped because: isolation failed
> > [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> > [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
> >
> > Isolation has failed here because the page is not on LRU. Most probably
> > because it was on the pcp LRU cache or it has been removed from the LRU
> > already but it hasn't been freed yet. In both cases the page doesn't look
> > non-migrable so retrying more makes sense.
> 
> This breaks offline for me.
> 
> Prior to this commit:
>   /sys/devices/system/memory/memory0# time echo 0 > online
>   -bash: echo: write error: Device or resource busy
>   
>   real	0m0.001s
>   user	0m0.000s
>   sys	0m0.001s
> 
> After:
>   /sys/devices/system/memory/memory0# time echo 0 > online
>   -bash: echo: write error: Device or resource busy
>   
>   real	2m0.009s
>   user	0m0.000s
>   sys	1m25.035s
> 
> 
> There's no way that block can be removed, it contains the kernel text,
> so it should instantly fail - which it used to.

OK, that means that start_isolate_page_range should have failed but it
hasn't for some reason. I strongly suspect has_unmovable_pages is doing
something wrong. Is the kernel text marked somehow? E.g. PageReserved?
In other words, does the diff below helps?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3badcedf96a7..00d042052501 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7368,6 +7368,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
 
 		page = pfn_to_page(check);
 
+		if (PageReserved(page))
+			return true;
+
 		/*
 		 * Hugepages are not in LRU lists, but they're movable.
 		 * We need not scan over tail pages bacause we don't


> With commit 3aa2823fdf66 ("mm, memory_hotplug: remove timeout from
> __offline_memory") also applied, it appears to just get stuck forever,
> and I get lots of:
> 
>   [ 1232.112953] INFO: task kworker/3:0:4609 blocked for more than 120 seconds.
>   [ 1232.113067]       Not tainted 4.14.0-rc4-gcc6-next-20171009-g49827b9 #1
>   [ 1232.113183] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   [ 1232.113319] kworker/3:0     D11984  4609      2 0x00000800
>   [ 1232.113416] Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
>   [ 1232.113531] Call Trace:
>   [ 1232.113579] [c0000000fb2db7a0] [c0000000fb2db900] 0xc0000000fb2db900 (unreliable)
>   [ 1232.113717] [c0000000fb2db970] [c00000000001c964] __switch_to+0x304/0x6e0
>   [ 1232.113840] [c0000000fb2dba10] [c000000000a408c0] __schedule+0x2e0/0xa80
>   [ 1232.113978] [c0000000fb2dbae0] [c000000000a410a8] schedule+0x48/0xc0
>   [ 1232.114113] [c0000000fb2dbb10] [c000000000a44d88] rwsem_down_read_failed+0x128/0x1b0
>   [ 1232.114269] [c0000000fb2dbb70] [c0000000001696a8] __percpu_down_read+0x108/0x110
>   [ 1232.114426] [c0000000fb2dbba0] [c00000000032e498] get_online_mems+0x68/0x80
>   [ 1232.115487] [c0000000fb2dbbc0] [c0000000002c82ec] memcg_create_kmem_cache+0x4c/0x190
>   [ 1232.115651] [c0000000fb2dbc60] [c0000000003483b8] memcg_kmem_cache_create_func+0x38/0xf0
>   [ 1232.115809] [c0000000fb2dbc90] [c000000000121594] process_one_work+0x2b4/0x590
>   [ 1232.115964] [c0000000fb2dbd20] [c000000000121908] worker_thread+0x98/0x5d0
>   [ 1232.116095] [c0000000fb2dbdc0] [c00000000012a134] kthread+0x164/0x1b0
>   [ 1232.116229] [c0000000fb2dbe30] [c00000000000bae0] ret_from_kernel_thread+0x5c/0x7c

I do not see how this is related to the offline path.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-09-18  7:08   ` Michal Hocko
@ 2017-10-10 12:05     ` Michael Ellerman
  -1 siblings, 0 replies; 68+ messages in thread
From: Michael Ellerman @ 2017-10-10 12:05 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML, Michal Hocko,
	Vlastimil Babka

Michal Hocko <mhocko@kernel.org> writes:

> From: Michal Hocko <mhocko@suse.com>
>
> Memory offlining can fail just too eagerly under a heavy memory pressure.
>
> [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> [ 5410.336811] page dumped because: isolation failed
> [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
>
> Isolation has failed here because the page is not on LRU. Most probably
> because it was on the pcp LRU cache or it has been removed from the LRU
> already but it hasn't been freed yet. In both cases the page doesn't look
> non-migrable so retrying more makes sense.

This breaks offline for me.

Prior to this commit:
  /sys/devices/system/memory/memory0# time echo 0 > online
  -bash: echo: write error: Device or resource busy
  
  real	0m0.001s
  user	0m0.000s
  sys	0m0.001s

After:
  /sys/devices/system/memory/memory0# time echo 0 > online
  -bash: echo: write error: Device or resource busy
  
  real	2m0.009s
  user	0m0.000s
  sys	1m25.035s


There's no way that block can be removed, it contains the kernel text,
so it should instantly fail - which it used to.


With commit 3aa2823fdf66 ("mm, memory_hotplug: remove timeout from
__offline_memory") also applied, it appears to just get stuck forever,
and I get lots of:

  [ 1232.112953] INFO: task kworker/3:0:4609 blocked for more than 120 seconds.
  [ 1232.113067]       Not tainted 4.14.0-rc4-gcc6-next-20171009-g49827b9 #1
  [ 1232.113183] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [ 1232.113319] kworker/3:0     D11984  4609      2 0x00000800
  [ 1232.113416] Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
  [ 1232.113531] Call Trace:
  [ 1232.113579] [c0000000fb2db7a0] [c0000000fb2db900] 0xc0000000fb2db900 (unreliable)
  [ 1232.113717] [c0000000fb2db970] [c00000000001c964] __switch_to+0x304/0x6e0
  [ 1232.113840] [c0000000fb2dba10] [c000000000a408c0] __schedule+0x2e0/0xa80
  [ 1232.113978] [c0000000fb2dbae0] [c000000000a410a8] schedule+0x48/0xc0
  [ 1232.114113] [c0000000fb2dbb10] [c000000000a44d88] rwsem_down_read_failed+0x128/0x1b0
  [ 1232.114269] [c0000000fb2dbb70] [c0000000001696a8] __percpu_down_read+0x108/0x110
  [ 1232.114426] [c0000000fb2dbba0] [c00000000032e498] get_online_mems+0x68/0x80
  [ 1232.115487] [c0000000fb2dbbc0] [c0000000002c82ec] memcg_create_kmem_cache+0x4c/0x190
  [ 1232.115651] [c0000000fb2dbc60] [c0000000003483b8] memcg_kmem_cache_create_func+0x38/0xf0
  [ 1232.115809] [c0000000fb2dbc90] [c000000000121594] process_one_work+0x2b4/0x590
  [ 1232.115964] [c0000000fb2dbd20] [c000000000121908] worker_thread+0x98/0x5d0
  [ 1232.116095] [c0000000fb2dbdc0] [c00000000012a134] kthread+0x164/0x1b0
  [ 1232.116229] [c0000000fb2dbe30] [c00000000000bae0] ret_from_kernel_thread+0x5c/0x7c


cheers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-10-10 12:05     ` Michael Ellerman
  0 siblings, 0 replies; 68+ messages in thread
From: Michael Ellerman @ 2017-10-10 12:05 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML, Michal Hocko,
	Vlastimil Babka

Michal Hocko <mhocko@kernel.org> writes:

> From: Michal Hocko <mhocko@suse.com>
>
> Memory offlining can fail just too eagerly under a heavy memory pressure.
>
> [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> [ 5410.336811] page dumped because: isolation failed
> [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
>
> Isolation has failed here because the page is not on LRU. Most probably
> because it was on the pcp LRU cache or it has been removed from the LRU
> already but it hasn't been freed yet. In both cases the page doesn't look
> non-migrable so retrying more makes sense.

This breaks offline for me.

Prior to this commit:
  /sys/devices/system/memory/memory0# time echo 0 > online
  -bash: echo: write error: Device or resource busy
  
  real	0m0.001s
  user	0m0.000s
  sys	0m0.001s

After:
  /sys/devices/system/memory/memory0# time echo 0 > online
  -bash: echo: write error: Device or resource busy
  
  real	2m0.009s
  user	0m0.000s
  sys	1m25.035s


There's no way that block can be removed, it contains the kernel text,
so it should instantly fail - which it used to.


With commit 3aa2823fdf66 ("mm, memory_hotplug: remove timeout from
__offline_memory") also applied, it appears to just get stuck forever,
and I get lots of:

  [ 1232.112953] INFO: task kworker/3:0:4609 blocked for more than 120 seconds.
  [ 1232.113067]       Not tainted 4.14.0-rc4-gcc6-next-20171009-g49827b9 #1
  [ 1232.113183] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [ 1232.113319] kworker/3:0     D11984  4609      2 0x00000800
  [ 1232.113416] Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
  [ 1232.113531] Call Trace:
  [ 1232.113579] [c0000000fb2db7a0] [c0000000fb2db900] 0xc0000000fb2db900 (unreliable)
  [ 1232.113717] [c0000000fb2db970] [c00000000001c964] __switch_to+0x304/0x6e0
  [ 1232.113840] [c0000000fb2dba10] [c000000000a408c0] __schedule+0x2e0/0xa80
  [ 1232.113978] [c0000000fb2dbae0] [c000000000a410a8] schedule+0x48/0xc0
  [ 1232.114113] [c0000000fb2dbb10] [c000000000a44d88] rwsem_down_read_failed+0x128/0x1b0
  [ 1232.114269] [c0000000fb2dbb70] [c0000000001696a8] __percpu_down_read+0x108/0x110
  [ 1232.114426] [c0000000fb2dbba0] [c00000000032e498] get_online_mems+0x68/0x80
  [ 1232.115487] [c0000000fb2dbbc0] [c0000000002c82ec] memcg_create_kmem_cache+0x4c/0x190
  [ 1232.115651] [c0000000fb2dbc60] [c0000000003483b8] memcg_kmem_cache_create_func+0x38/0xf0
  [ 1232.115809] [c0000000fb2dbc90] [c000000000121594] process_one_work+0x2b4/0x590
  [ 1232.115964] [c0000000fb2dbd20] [c000000000121908] worker_thread+0x98/0x5d0
  [ 1232.116095] [c0000000fb2dbdc0] [c00000000012a134] kthread+0x164/0x1b0
  [ 1232.116229] [c0000000fb2dbe30] [c00000000000bae0] ret_from_kernel_thread+0x5c/0x7c


cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
  2017-09-18  7:08 [PATCH v2 0/2] mm, memory_hotplug: redefine memory offline retry logic Michal Hocko
@ 2017-09-18  7:08   ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-18  7:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML, Michal Hocko,
	Vlastimil Babka

From: Michal Hocko <mhocko@suse.com>

Memory offlining can fail just too eagerly under a heavy memory pressure.

[ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
[ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
[ 5410.336811] page dumped because: isolation failed
[ 5410.336813] page->mem_cgroup:ffff8801cd662000
[ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed

Isolation has failed here because the page is not on LRU. Most probably
because it was on the pcp LRU cache or it has been removed from the LRU
already but it hasn't been freed yet. In both cases the page doesn't look
non-migrable so retrying more makes sense.

__offline_pages seems rather cluttered when it comes to the retry
logic. We have 5 retries at maximum and a timeout. We could argue
whether the timeout makes sense but failing just because of a race when
somebody isoltes a page from LRU or puts it on a pcp LRU lists is just
wrong. It only takes it to race with a process which unmaps some pages
and remove them from the LRU list and we can fail the whole offline
because of something that is a temporary condition and actually not
harmful for the offline.

Please note that unmovable pages should be already excluded during
start_isolate_page_range. We could argue that has_unmovable_pages is
racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either
but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable
pages in most cases and movable zone shouldn't contain unmovable pages
at all. Some of those pages might be pinned but not for ever because
that would be a bug on its own. In any case the context is still
interruptible and so the userspace can easily bail out when the
operation takes too long. This is certainly better behavior than a
hardcoded retry loop which is racy.

Fix this by removing the max retry count and only rely on the timeout
resp. interruption by a signal from the userspace. Also retry rather
than fail when check_pages_isolated sees some !free pages because those
could be a result of the race as well.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory_hotplug.c | 40 ++++++++++------------------------------
 1 file changed, 10 insertions(+), 30 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 459bbc182d10..c9dcbe6d2ac6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1597,7 +1597,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 {
 	unsigned long pfn, nr_pages, expire;
 	long offlined_pages;
-	int ret, drain, retry_max, node;
+	int ret, node;
 	unsigned long flags;
 	unsigned long valid_start, valid_end;
 	struct zone *zone;
@@ -1634,43 +1634,25 @@ static int __ref __offline_pages(unsigned long start_pfn,

 	pfn = start_pfn;
 	expire = jiffies + timeout;
-	drain = 0;
-	retry_max = 5;
 repeat:
 	/* start memory hot removal */
-	ret = -EAGAIN;
+	ret = -EBUSY;
 	if (time_after(jiffies, expire))
 		goto failed_removal;
 	ret = -EINTR;
 	if (signal_pending(current))
 		goto failed_removal;
-	ret = 0;
-	if (drain) {
-		lru_add_drain_all_cpuslocked();
-		cond_resched();
-		drain_all_pages(zone);
-	}
+
+	cond_resched();
+	lru_add_drain_all_cpuslocked();
+	drain_all_pages(zone);

 	pfn = scan_movable_pages(start_pfn, end_pfn);
 	if (pfn) { /* We have movable pages */
 		ret = do_migrate_range(pfn, end_pfn);
-		if (!ret) {
-			drain = 1;
-			goto repeat;
-		} else {
-			if (ret < 0)
-				if (--retry_max == 0)
-					goto failed_removal;
-			yield();
-			drain = 1;
-			goto repeat;
-		}
+		goto repeat;
 	}
-	/* drain all zone's lru pagevec, this is asynchronous... */
-	lru_add_drain_all_cpuslocked();
-	yield();
-	/* drain pcp pages, this is synchronous. */
-	drain_all_pages(zone);
+
 	/*
 	 * dissolve free hugepages in the memory block before doing offlining
 	 * actually in order to make hugetlbfs's object counting consistent.
@@ -1680,10 +1662,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		goto failed_removal;
 	/* check again */
 	offlined_pages = check_pages_isolated(start_pfn, end_pfn);
-	if (offlined_pages < 0) {
-		ret = -EBUSY;
-		goto failed_removal;
-	}
+	if (offlined_pages < 0)
+		goto repeat;
 	pr_info("Offlined Pages %ld\n", offlined_pages);
 	/* Ok, all of our target is isolated.
 	   We cannot do rollback at this point. */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
@ 2017-09-18  7:08   ` Michal Hocko
  0 siblings, 0 replies; 68+ messages in thread
From: Michal Hocko @ 2017-09-18  7:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Igor Mammedov, Vitaly Kuznetsov, linux-mm, LKML, Michal Hocko,
	Vlastimil Babka

From: Michal Hocko <mhocko@suse.com>

Memory offlining can fail just too eagerly under a heavy memory pressure.

[ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
[ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
[ 5410.336811] page dumped because: isolation failed
[ 5410.336813] page->mem_cgroup:ffff8801cd662000
[ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed

Isolation has failed here because the page is not on LRU. Most probably
because it was on the pcp LRU cache or it has been removed from the LRU
already but it hasn't been freed yet. In both cases the page doesn't look
non-migrable so retrying more makes sense.

__offline_pages seems rather cluttered when it comes to the retry
logic. We have 5 retries at maximum and a timeout. We could argue
whether the timeout makes sense but failing just because of a race when
somebody isoltes a page from LRU or puts it on a pcp LRU lists is just
wrong. It only takes it to race with a process which unmaps some pages
and remove them from the LRU list and we can fail the whole offline
because of something that is a temporary condition and actually not
harmful for the offline.

Please note that unmovable pages should be already excluded during
start_isolate_page_range. We could argue that has_unmovable_pages is
racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either
but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable
pages in most cases and movable zone shouldn't contain unmovable pages
at all. Some of those pages might be pinned but not for ever because
that would be a bug on its own. In any case the context is still
interruptible and so the userspace can easily bail out when the
operation takes too long. This is certainly better behavior than a
hardcoded retry loop which is racy.

Fix this by removing the max retry count and only rely on the timeout
resp. interruption by a signal from the userspace. Also retry rather
than fail when check_pages_isolated sees some !free pages because those
could be a result of the race as well.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory_hotplug.c | 40 ++++++++++------------------------------
 1 file changed, 10 insertions(+), 30 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 459bbc182d10..c9dcbe6d2ac6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1597,7 +1597,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 {
 	unsigned long pfn, nr_pages, expire;
 	long offlined_pages;
-	int ret, drain, retry_max, node;
+	int ret, node;
 	unsigned long flags;
 	unsigned long valid_start, valid_end;
 	struct zone *zone;
@@ -1634,43 +1634,25 @@ static int __ref __offline_pages(unsigned long start_pfn,

 	pfn = start_pfn;
 	expire = jiffies + timeout;
-	drain = 0;
-	retry_max = 5;
 repeat:
 	/* start memory hot removal */
-	ret = -EAGAIN;
+	ret = -EBUSY;
 	if (time_after(jiffies, expire))
 		goto failed_removal;
 	ret = -EINTR;
 	if (signal_pending(current))
 		goto failed_removal;
-	ret = 0;
-	if (drain) {
-		lru_add_drain_all_cpuslocked();
-		cond_resched();
-		drain_all_pages(zone);
-	}
+
+	cond_resched();
+	lru_add_drain_all_cpuslocked();
+	drain_all_pages(zone);

 	pfn = scan_movable_pages(start_pfn, end_pfn);
 	if (pfn) { /* We have movable pages */
 		ret = do_migrate_range(pfn, end_pfn);
-		if (!ret) {
-			drain = 1;
-			goto repeat;
-		} else {
-			if (ret < 0)
-				if (--retry_max == 0)
-					goto failed_removal;
-			yield();
-			drain = 1;
-			goto repeat;
-		}
+		goto repeat;
 	}
-	/* drain all zone's lru pagevec, this is asynchronous... */
-	lru_add_drain_all_cpuslocked();
-	yield();
-	/* drain pcp pages, this is synchronous. */
-	drain_all_pages(zone);
+
 	/*
 	 * dissolve free hugepages in the memory block before doing offlining
 	 * actually in order to make hugetlbfs's object counting consistent.
@@ -1680,10 +1662,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
 		goto failed_removal;
 	/* check again */
 	offlined_pages = check_pages_isolated(start_pfn, end_pfn);
-	if (offlined_pages < 0) {
-		ret = -EBUSY;
-		goto failed_removal;
-	}
+	if (offlined_pages < 0)
+		goto repeat;
 	pr_info("Offlined Pages %ld\n", offlined_pages);
 	/* Ok, all of our target is isolated.
 	   We cannot do rollback at this point. */
-- 
2.14.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2017-10-13 11:58 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-04  8:21 [PATCH 0/2] mm, memory_hotplug: redefine memory offline retry logic Michal Hocko
2017-09-04  8:21 ` Michal Hocko
2017-09-04  8:21 ` [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early Michal Hocko
2017-09-04  8:21   ` Michal Hocko
2017-09-05  6:29   ` Anshuman Khandual
2017-09-05  6:29     ` Anshuman Khandual
2017-09-05  7:13     ` Michal Hocko
2017-09-05  7:13       ` Michal Hocko
2017-09-08 17:26   ` Vlastimil Babka
2017-09-08 17:26     ` Vlastimil Babka
2017-09-11  8:17     ` Michal Hocko
2017-09-11  8:17       ` Michal Hocko
2017-09-13 11:41       ` Vlastimil Babka
2017-09-13 11:41         ` Vlastimil Babka
2017-09-13 12:10         ` Michal Hocko
2017-09-13 12:10           ` Michal Hocko
2017-09-13 12:14           ` Michal Hocko
2017-09-13 12:14             ` Michal Hocko
2017-09-13 12:19             ` Vlastimil Babka
2017-09-13 12:19               ` Vlastimil Babka
2017-09-13 12:32               ` Michal Hocko
2017-09-13 12:32                 ` Michal Hocko
2017-09-04  8:21 ` [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory Michal Hocko
2017-09-04  8:21   ` Michal Hocko
2017-09-04  8:58   ` Xishi Qiu
2017-09-04  8:58     ` Xishi Qiu
2017-09-04  9:01     ` Michal Hocko
2017-09-04  9:01       ` Michal Hocko
2017-09-04  9:05       ` Xishi Qiu
2017-09-04  9:05         ` Xishi Qiu
2017-09-04  9:15         ` Michal Hocko
2017-09-04  9:15           ` Michal Hocko
2017-09-05  5:46           ` Anshuman Khandual
2017-09-05  5:46             ` Anshuman Khandual
2017-09-05  7:23             ` Michal Hocko
2017-09-05  7:23               ` Michal Hocko
2017-09-05  8:54               ` Anshuman Khandual
2017-09-05  8:54                 ` Anshuman Khandual
2017-09-08 17:27   ` Vlastimil Babka
2017-09-08 17:27     ` Vlastimil Babka
2017-09-18  7:08 [PATCH v2 0/2] mm, memory_hotplug: redefine memory offline retry logic Michal Hocko
2017-09-18  7:08 ` [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early Michal Hocko
2017-09-18  7:08   ` Michal Hocko
2017-10-10 12:05   ` Michael Ellerman
2017-10-10 12:05     ` Michael Ellerman
2017-10-10 12:27     ` Michal Hocko
2017-10-10 12:27       ` Michal Hocko
2017-10-11  2:37       ` Michael Ellerman
2017-10-11  2:37         ` Michael Ellerman
2017-10-11  5:19         ` Michael Ellerman
2017-10-11  5:19           ` Michael Ellerman
2017-10-11 14:05           ` Anshuman Khandual
2017-10-11 14:05             ` Anshuman Khandual
2017-10-11 14:16             ` Michal Hocko
2017-10-11 14:16               ` Michal Hocko
2017-10-11  6:51         ` Michal Hocko
2017-10-11  6:51           ` Michal Hocko
2017-10-11  8:04           ` Vlastimil Babka
2017-10-11  8:04             ` Vlastimil Babka
2017-10-11  8:13             ` Michal Hocko
2017-10-11  8:13               ` Michal Hocko
2017-10-11 11:17               ` Vlastimil Babka
2017-10-11 11:17                 ` Vlastimil Babka
2017-10-11 11:24                 ` Michal Hocko
2017-10-11 11:24                   ` Michal Hocko
2017-10-13 11:42             ` Michael Ellerman
2017-10-13 11:42               ` Michael Ellerman
2017-10-13 11:58               ` Michal Hocko
2017-10-13 11:58                 ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.