All of lore.kernel.org
 help / color / mirror / Atom feed
From: Shakeel Butt <shakeelb@google.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jia He <hejianet@gmail.com>,
	Hillf Danton <hillf.zj@alibaba-inc.com>,
	Linux MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm: fix condition for throttle_direct_reclaim
Date: Mon, 13 Mar 2017 09:50:22 -0700	[thread overview]
Message-ID: <CALvZod4uKErf06PaTc136JD-Yznda1fqj8UoSL7vWKzPHEYXPA@mail.gmail.com> (raw)
In-Reply-To: <20170313154627.GU31518@dhcp22.suse.cz>

On Mon, Mar 13, 2017 at 8:46 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 13-03-17 08:07:15, Shakeel Butt wrote:
>> On Mon, Mar 13, 2017 at 2:02 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Fri 10-03-17 11:46:20, Shakeel Butt wrote:
>> >> Recently kswapd has been modified to give up after MAX_RECLAIM_RETRIES
>> >> number of unsucessful iterations. Before going to sleep, kswapd thread
>> >> will unconditionally wakeup all threads sleeping on pfmemalloc_wait.
>> >> However the awoken threads will recheck the watermarks and wake the
>> >> kswapd thread and sleep again on pfmemalloc_wait. There is a chance
>> >> of continuous back and forth between kswapd and direct reclaiming
>> >> threads if the kswapd keep failing and thus defeat the purpose of
>> >> adding backoff mechanism to kswapd. So, add kswapd_failures check
>> >> on the throttle_direct_reclaim condition.
>> >
>> > I have to say I really do not like this. kswapd_failures shouldn't
>> > really be checked outside of the kswapd context. The
>> > pfmemalloc_watermark_ok/throttle_direct_reclaim is quite complex even
>> > without putting another variable into it. I wish we rather replace this
>> > throttling by something else. Johannes had an idea to throttle by the
>> > number of reclaimers.
>> >
>>
>> Do you suspect race in accessing kswapd_failures in non-kswapd
>> context?
>
> No, this is not about race conditions. It is more about the logic of the
> code. kswapd_failures is the private thing to the kswapd daemon. Direct
> reclaimers shouldn't have any business in it - well except resetting it.
>
>> Please do let me know more about replacing this throttling.
>
> The idea behind a different throttling would be to not allow too many
> direct reclaimers on the same set of nodes/zones. Johannes would tell
> you more.
>
>> > Anyway, I am wondering whether we can hit this issue in
>> > practice? Have you seen it happening or is this a result of the code
>> > review? I would assume that that !zone_reclaimable_pages check in
>> > pfmemalloc_watermark_ok should help to some degree.
>> >
>> Yes, I have seen this issue going on for more than one hour on my
>> test. It was a simple test where the number of processes, in the
>> presence of swap, try to allocate memory more than RAM.
>
> this is an anonymous memory, right?
>
Yes.

>> The number of
>> processes are equal to the number of cores and are pinned to each
>> individual core. I am suspecting that !zone_reclaimable_pages() check
>> did not help.
>
> Hmm, interesting! I would expect the OOM killer triggering but I guess
> I see what is going on. kswapd couldn't reclaim a single page and ran
> out of its kswapd_failures attempts while no direct reclaimers could
> reclaim a single page either until we reached the throttling point when
> we are basically livelocked because neither kswapd nor _all_ direct
> reclaimers can make a forward progress. Although this sounds quite
> unlikely I think it is quite possible to happen. So we cannot really
> throttle _all_ direct reclaimers when the kswapd is out of game which I
> haven't fully realized when reviewing "mm: fix 100% CPU kswapd busyloop
> on unreclaimable nodes".
>
> The simplest thing to do would be something like you have proposed and
> do not throttle if kswapd is out of game.
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bae698484e8e..d34b1afc781a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2791,6 +2791,9 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>         int i;
>         bool wmark_ok;
>
> +       if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +               return true;
> +
>         for (i = 0; i <= ZONE_NORMAL; i++) {
>                 zone = &pgdat->node_zones[i];
>                 if (!managed_zone(zone))
>
> I do not like this as I've already said but it would allow to merge
> "mm: fix 100% CPU kswapd busyloop on unreclaimable nodes" without too
> many additional changes.
>
> Another option would be to cap the waiting time same as we do for
> GFP_NOFS. Not ideal either because I suspect we would just get herds
> of direct reclaimers that way.
>
> The best option would be to rethink the throttling and move it out of
> the direct reclaim path somehow.
>
Agreed.

> Thanks and sorry for not spotting the potential lockup previously.
> --
> Michal Hocko
> SUSE Labs

WARNING: multiple messages have this Message-ID (diff)
From: Shakeel Butt <shakeelb@google.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jia He <hejianet@gmail.com>,
	Hillf Danton <hillf.zj@alibaba-inc.com>,
	Linux MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm: fix condition for throttle_direct_reclaim
Date: Mon, 13 Mar 2017 09:50:22 -0700	[thread overview]
Message-ID: <CALvZod4uKErf06PaTc136JD-Yznda1fqj8UoSL7vWKzPHEYXPA@mail.gmail.com> (raw)
In-Reply-To: <20170313154627.GU31518@dhcp22.suse.cz>

On Mon, Mar 13, 2017 at 8:46 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 13-03-17 08:07:15, Shakeel Butt wrote:
>> On Mon, Mar 13, 2017 at 2:02 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> > On Fri 10-03-17 11:46:20, Shakeel Butt wrote:
>> >> Recently kswapd has been modified to give up after MAX_RECLAIM_RETRIES
>> >> number of unsucessful iterations. Before going to sleep, kswapd thread
>> >> will unconditionally wakeup all threads sleeping on pfmemalloc_wait.
>> >> However the awoken threads will recheck the watermarks and wake the
>> >> kswapd thread and sleep again on pfmemalloc_wait. There is a chance
>> >> of continuous back and forth between kswapd and direct reclaiming
>> >> threads if the kswapd keep failing and thus defeat the purpose of
>> >> adding backoff mechanism to kswapd. So, add kswapd_failures check
>> >> on the throttle_direct_reclaim condition.
>> >
>> > I have to say I really do not like this. kswapd_failures shouldn't
>> > really be checked outside of the kswapd context. The
>> > pfmemalloc_watermark_ok/throttle_direct_reclaim is quite complex even
>> > without putting another variable into it. I wish we rather replace this
>> > throttling by something else. Johannes had an idea to throttle by the
>> > number of reclaimers.
>> >
>>
>> Do you suspect race in accessing kswapd_failures in non-kswapd
>> context?
>
> No, this is not about race conditions. It is more about the logic of the
> code. kswapd_failures is the private thing to the kswapd daemon. Direct
> reclaimers shouldn't have any business in it - well except resetting it.
>
>> Please do let me know more about replacing this throttling.
>
> The idea behind a different throttling would be to not allow too many
> direct reclaimers on the same set of nodes/zones. Johannes would tell
> you more.
>
>> > Anyway, I am wondering whether we can hit this issue in
>> > practice? Have you seen it happening or is this a result of the code
>> > review? I would assume that that !zone_reclaimable_pages check in
>> > pfmemalloc_watermark_ok should help to some degree.
>> >
>> Yes, I have seen this issue going on for more than one hour on my
>> test. It was a simple test where the number of processes, in the
>> presence of swap, try to allocate memory more than RAM.
>
> this is an anonymous memory, right?
>
Yes.

>> The number of
>> processes are equal to the number of cores and are pinned to each
>> individual core. I am suspecting that !zone_reclaimable_pages() check
>> did not help.
>
> Hmm, interesting! I would expect the OOM killer triggering but I guess
> I see what is going on. kswapd couldn't reclaim a single page and ran
> out of its kswapd_failures attempts while no direct reclaimers could
> reclaim a single page either until we reached the throttling point when
> we are basically livelocked because neither kswapd nor _all_ direct
> reclaimers can make a forward progress. Although this sounds quite
> unlikely I think it is quite possible to happen. So we cannot really
> throttle _all_ direct reclaimers when the kswapd is out of game which I
> haven't fully realized when reviewing "mm: fix 100% CPU kswapd busyloop
> on unreclaimable nodes".
>
> The simplest thing to do would be something like you have proposed and
> do not throttle if kswapd is out of game.
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bae698484e8e..d34b1afc781a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2791,6 +2791,9 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>         int i;
>         bool wmark_ok;
>
> +       if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +               return true;
> +
>         for (i = 0; i <= ZONE_NORMAL; i++) {
>                 zone = &pgdat->node_zones[i];
>                 if (!managed_zone(zone))
>
> I do not like this as I've already said but it would allow to merge
> "mm: fix 100% CPU kswapd busyloop on unreclaimable nodes" without too
> many additional changes.
>
> Another option would be to cap the waiting time same as we do for
> GFP_NOFS. Not ideal either because I suspect we would just get herds
> of direct reclaimers that way.
>
> The best option would be to rethink the throttling and move it out of
> the direct reclaim path somehow.
>
Agreed.

> Thanks and sorry for not spotting the potential lockup previously.
> --
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-03-13 16:50 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-10 19:46 [PATCH] mm: fix condition for throttle_direct_reclaim Shakeel Butt
2017-03-10 19:46 ` Shakeel Butt
2017-03-13  9:02 ` Michal Hocko
2017-03-13  9:02   ` Michal Hocko
2017-03-13 15:07   ` Shakeel Butt
2017-03-13 15:07     ` Shakeel Butt
2017-03-13 15:46     ` Michal Hocko
2017-03-13 15:46       ` Michal Hocko
2017-03-13 16:50       ` Shakeel Butt [this message]
2017-03-13 16:50         ` Shakeel Butt
2017-03-13 19:58 ` Johannes Weiner
2017-03-13 19:58   ` Johannes Weiner
2017-03-13 21:48   ` Shakeel Butt
2017-03-13 21:48     ` Shakeel Butt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALvZod4uKErf06PaTc136JD-Yznda1fqj8UoSL7vWKzPHEYXPA@mail.gmail.com \
    --to=shakeelb@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hejianet@gmail.com \
    --cc=hillf.zj@alibaba-inc.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@kernel.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.