All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vlastimil Babka <vbabka@suse.cz>
To: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Reza Arbab <arbab@linux.vnet.ibm.com>,
	Yasuaki Ishimatsu <yasu.isimatu@gmail.com>,
	qiuxishi@huawei.com, Igor Mammedov <imammedo@redhat.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
Date: Wed, 13 Sep 2017 14:19:19 +0200	[thread overview]
Message-ID: <25ffda93-0c0d-28b4-bd0b-7fc9df7d678a@suse.cz> (raw)
In-Reply-To: <20170913121433.yjzloaf6g447zeq2@dhcp22.suse.cz>

On 09/13/2017 02:14 PM, Michal Hocko wrote:
>>>> Do you think that the changelog should be more clear about this?
>>>
>>> It certainly wouldn't hurt :)
>>
>> So what do you think about the following wording:
> 
> Ups, wrong patch
> 
> 
> From 8639496a834b4a7c24972ec23b17e50f0d6a304c Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 14 Aug 2017 10:46:12 +0200
> Subject: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
> 
> Memory offlining can fail just too eagerly under a heavy memory pressure.
> 
> [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> [ 5410.336811] page dumped because: isolation failed
> [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
> 
> Isolation has failed here because the page is not on LRU. Most probably
> because it was on the pcp LRU cache or it has been removed from the LRU
> already but it hasn't been freed yet. In both cases the page doesn't look
> non-migrable so retrying more makes sense.
> 
> __offline_pages seems rather cluttered when it comes to the retry
> logic. We have 5 retries at maximum and a timeout. We could argue
> whether the timeout makes sense but failing just because of a race when
> somebody isoltes a page from LRU or puts it on a pcp LRU lists is just
> wrong. It only takes it to race with a process which unmaps some pages
> and remove them from the LRU list and we can fail the whole offline
> because of something that is a temporary condition and actually not
> harmful for the offline.
> 
> Please note that unmovable pages should be already excluded during
> start_isolate_page_range. We could argue that has_unmovable_pages is
> racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either
> but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable
> pages in most cases and movable zone shouldn't contain unmovable pages
> at all. Some of those pages might be pinned but not for ever because
> that would be a bug on its own. In any case the context is still
> interruptible and so the userspace can easily bail out when the
> operation takes too long. This is certainly better behavior than a
> hardcoded retry loop which is racy.
> 
> Fix this by removing the max retry count and only rely on the timeout
> resp. interruption by a signal from the userspace. Also retry rather
> than fail when check_pages_isolated sees some !free pages because those
> could be a result of the race as well.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Yeah, that's better, thanks.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

WARNING: multiple messages have this Message-ID (diff)
From: Vlastimil Babka <vbabka@suse.cz>
To: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Reza Arbab <arbab@linux.vnet.ibm.com>,
	Yasuaki Ishimatsu <yasu.isimatu@gmail.com>,
	qiuxishi@huawei.com, Igor Mammedov <imammedo@redhat.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
Date: Wed, 13 Sep 2017 14:19:19 +0200	[thread overview]
Message-ID: <25ffda93-0c0d-28b4-bd0b-7fc9df7d678a@suse.cz> (raw)
In-Reply-To: <20170913121433.yjzloaf6g447zeq2@dhcp22.suse.cz>

On 09/13/2017 02:14 PM, Michal Hocko wrote:
>>>> Do you think that the changelog should be more clear about this?
>>>
>>> It certainly wouldn't hurt :)
>>
>> So what do you think about the following wording:
> 
> Ups, wrong patch
> 
> 
> From 8639496a834b4a7c24972ec23b17e50f0d6a304c Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 14 Aug 2017 10:46:12 +0200
> Subject: [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early
> 
> Memory offlining can fail just too eagerly under a heavy memory pressure.
> 
> [ 5410.336792] page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
> [ 5410.336809] flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
> [ 5410.336811] page dumped because: isolation failed
> [ 5410.336813] page->mem_cgroup:ffff8801cd662000
> [ 5420.655030] memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed
> 
> Isolation has failed here because the page is not on LRU. Most probably
> because it was on the pcp LRU cache or it has been removed from the LRU
> already but it hasn't been freed yet. In both cases the page doesn't look
> non-migrable so retrying more makes sense.
> 
> __offline_pages seems rather cluttered when it comes to the retry
> logic. We have 5 retries at maximum and a timeout. We could argue
> whether the timeout makes sense but failing just because of a race when
> somebody isoltes a page from LRU or puts it on a pcp LRU lists is just
> wrong. It only takes it to race with a process which unmaps some pages
> and remove them from the LRU list and we can fail the whole offline
> because of something that is a temporary condition and actually not
> harmful for the offline.
> 
> Please note that unmovable pages should be already excluded during
> start_isolate_page_range. We could argue that has_unmovable_pages is
> racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either
> but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable
> pages in most cases and movable zone shouldn't contain unmovable pages
> at all. Some of those pages might be pinned but not for ever because
> that would be a bug on its own. In any case the context is still
> interruptible and so the userspace can easily bail out when the
> operation takes too long. This is certainly better behavior than a
> hardcoded retry loop which is racy.
> 
> Fix this by removing the max retry count and only rely on the timeout
> resp. interruption by a signal from the userspace. Also retry rather
> than fail when check_pages_isolated sees some !free pages because those
> could be a result of the race as well.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Yeah, that's better, thanks.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-09-13 12:19 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-04  8:21 [PATCH 0/2] mm, memory_hotplug: redefine memory offline retry logic Michal Hocko
2017-09-04  8:21 ` Michal Hocko
2017-09-04  8:21 ` [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early Michal Hocko
2017-09-04  8:21   ` Michal Hocko
2017-09-05  6:29   ` Anshuman Khandual
2017-09-05  6:29     ` Anshuman Khandual
2017-09-05  7:13     ` Michal Hocko
2017-09-05  7:13       ` Michal Hocko
2017-09-08 17:26   ` Vlastimil Babka
2017-09-08 17:26     ` Vlastimil Babka
2017-09-11  8:17     ` Michal Hocko
2017-09-11  8:17       ` Michal Hocko
2017-09-13 11:41       ` Vlastimil Babka
2017-09-13 11:41         ` Vlastimil Babka
2017-09-13 12:10         ` Michal Hocko
2017-09-13 12:10           ` Michal Hocko
2017-09-13 12:14           ` Michal Hocko
2017-09-13 12:14             ` Michal Hocko
2017-09-13 12:19             ` Vlastimil Babka [this message]
2017-09-13 12:19               ` Vlastimil Babka
2017-09-13 12:32               ` Michal Hocko
2017-09-13 12:32                 ` Michal Hocko
2017-09-04  8:21 ` [PATCH 2/2] mm, memory_hotplug: remove timeout from __offline_memory Michal Hocko
2017-09-04  8:21   ` Michal Hocko
2017-09-04  8:58   ` Xishi Qiu
2017-09-04  8:58     ` Xishi Qiu
2017-09-04  9:01     ` Michal Hocko
2017-09-04  9:01       ` Michal Hocko
2017-09-04  9:05       ` Xishi Qiu
2017-09-04  9:05         ` Xishi Qiu
2017-09-04  9:15         ` Michal Hocko
2017-09-04  9:15           ` Michal Hocko
2017-09-05  5:46           ` Anshuman Khandual
2017-09-05  5:46             ` Anshuman Khandual
2017-09-05  7:23             ` Michal Hocko
2017-09-05  7:23               ` Michal Hocko
2017-09-05  8:54               ` Anshuman Khandual
2017-09-05  8:54                 ` Anshuman Khandual
2017-09-08 17:27   ` Vlastimil Babka
2017-09-08 17:27     ` Vlastimil Babka
2017-09-18  7:08 [PATCH v2 0/2] mm, memory_hotplug: redefine memory offline retry logic Michal Hocko
2017-09-18  7:08 ` [PATCH 1/2] mm, memory_hotplug: do not fail offlining too early Michal Hocko
2017-09-18  7:08   ` Michal Hocko
2017-10-10 12:05   ` Michael Ellerman
2017-10-10 12:05     ` Michael Ellerman
2017-10-10 12:27     ` Michal Hocko
2017-10-10 12:27       ` Michal Hocko
2017-10-11  2:37       ` Michael Ellerman
2017-10-11  2:37         ` Michael Ellerman
2017-10-11  5:19         ` Michael Ellerman
2017-10-11  5:19           ` Michael Ellerman
2017-10-11 14:05           ` Anshuman Khandual
2017-10-11 14:05             ` Anshuman Khandual
2017-10-11 14:16             ` Michal Hocko
2017-10-11 14:16               ` Michal Hocko
2017-10-11  6:51         ` Michal Hocko
2017-10-11  6:51           ` Michal Hocko
2017-10-11  8:04           ` Vlastimil Babka
2017-10-11  8:04             ` Vlastimil Babka
2017-10-11  8:13             ` Michal Hocko
2017-10-11  8:13               ` Michal Hocko
2017-10-11 11:17               ` Vlastimil Babka
2017-10-11 11:17                 ` Vlastimil Babka
2017-10-11 11:24                 ` Michal Hocko
2017-10-11 11:24                   ` Michal Hocko
2017-10-13 11:42             ` Michael Ellerman
2017-10-13 11:42               ` Michael Ellerman
2017-10-13 11:58               ` Michal Hocko
2017-10-13 11:58                 ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=25ffda93-0c0d-28b4-bd0b-7fc9df7d678a@suse.cz \
    --to=vbabka@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=arbab@linux.vnet.ibm.com \
    --cc=imammedo@redhat.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=qiuxishi@huawei.com \
    --cc=vkuznets@redhat.com \
    --cc=yasu.isimatu@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.