All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Down <chris@chrisdown.name>
To: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] mm, memcg: reclaim more aggressively before high allocator throttling
Date: Wed, 20 May 2020 21:26:50 +0100	[thread overview]
Message-ID: <20200520202650.GB558281@chrisdown.name> (raw)
In-Reply-To: <20200520160756.GE6462@dhcp22.suse.cz>

Michal Hocko writes:
>Let me try to understand the actual problem. The high memory reclaim has
>a target which is proportional to the amount of charged memory. For most
>requests that would be SWAP_CLUSTER_MAX though (resp. N times that where
>N is the number of memcgs in excess up the hierarchy). I can see to be
>insufficient if the memcg is already in a large excess but if the
>reclaim can make a forward progress this should just work fine because
>each charging context should reclaim at least the contributed amount.
>
>Do you have any insight on why this doesn't work in your situation?
>Especially with such a large inactive file list I would be really
>surprised if the reclaim was not able to make a forward progress.

Reclaim can fail for any number of reasons, which is why we have retries 
sprinkled all over for it already. It doesn't seem hard to believe that it 
might just fail for transient reasons and drive us deeper into the hole as a 
result.

In this case, a.) the application is producing tons of dirty pages, and b.) we 
have really heavy systemwide I/O contention on the affected machines. This high 
load is one of the reasons that direct and kswapd reclaim cannot keep up, and 
thus nr_pages can become a number of orders of magnitude larger than 
SWAP_CLUSTER_MAX. This is trivially reproducible on these machines, it's not an 
edge case.

Putting a trace_printk("%d\n", __LINE__) at non-successful reclaim in 
shrink_page_list shows that what's happening is always (and I really mean 
always) the "dirty page and you're not kswapd" check, as expected:

	if (PageDirty(page)) {
		/*
		 * Only kswapd can writeback filesystem pages
		 * to avoid risk of stack overflow. But avoid
		 * injecting inefficient single-page IO into
		 * flusher writeback as much as possible: only
		 * write pages when we've encountered many
		 * dirty pages, and when we've already scanned
		 * the rest of the LRU for clean pages and see
		 * the same dirty pages again (PageReclaim).
		 */
		if (page_is_file_lru(page) &&
			(!current_is_kswapd() || !PageReclaim(page) ||
			!test_bit(PGDAT_DIRTY, &pgdat->flags))) {
			/*
			 * Immediately reclaim when written back.
			 * Similar in principal to deactivate_page()
			 * except we already have the page isolated
			 * and know it's dirty
			 */
			inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
			SetPageReclaim(page);

			goto activate_locked;
		}

>Now to your patch. I do not like it much to be honest.
>MEM_CGROUP_RECLAIM_RETRIES is quite arbitrary and I neither like it in
>memory_high_write because the that is an interruptible context so there
>shouldn't be a good reason to give up after $FOO number of failed
>attempts. try_charge and memory_max_write are slightly different because
>we are invoking OOM killer based on the number of failed attempts.

As Johannes mentioned, the very intent of memory.high is to have it managed 
using a userspace OOM killer, which monitors PSI. As such, I'm not sure this 
distinction means much.

WARNING: multiple messages have this Message-ID (diff)
From: Chris Down <chris-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org>
To: Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	kernel-team-b10kYP2dOMg@public.gmane.org
Subject: Re: [PATCH] mm, memcg: reclaim more aggressively before high allocator throttling
Date: Wed, 20 May 2020 21:26:50 +0100	[thread overview]
Message-ID: <20200520202650.GB558281@chrisdown.name> (raw)
In-Reply-To: <20200520160756.GE6462-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>

Michal Hocko writes:
>Let me try to understand the actual problem. The high memory reclaim has
>a target which is proportional to the amount of charged memory. For most
>requests that would be SWAP_CLUSTER_MAX though (resp. N times that where
>N is the number of memcgs in excess up the hierarchy). I can see to be
>insufficient if the memcg is already in a large excess but if the
>reclaim can make a forward progress this should just work fine because
>each charging context should reclaim at least the contributed amount.
>
>Do you have any insight on why this doesn't work in your situation?
>Especially with such a large inactive file list I would be really
>surprised if the reclaim was not able to make a forward progress.

Reclaim can fail for any number of reasons, which is why we have retries 
sprinkled all over for it already. It doesn't seem hard to believe that it 
might just fail for transient reasons and drive us deeper into the hole as a 
result.

In this case, a.) the application is producing tons of dirty pages, and b.) we 
have really heavy systemwide I/O contention on the affected machines. This high 
load is one of the reasons that direct and kswapd reclaim cannot keep up, and 
thus nr_pages can become a number of orders of magnitude larger than 
SWAP_CLUSTER_MAX. This is trivially reproducible on these machines, it's not an 
edge case.

Putting a trace_printk("%d\n", __LINE__) at non-successful reclaim in 
shrink_page_list shows that what's happening is always (and I really mean 
always) the "dirty page and you're not kswapd" check, as expected:

	if (PageDirty(page)) {
		/*
		 * Only kswapd can writeback filesystem pages
		 * to avoid risk of stack overflow. But avoid
		 * injecting inefficient single-page IO into
		 * flusher writeback as much as possible: only
		 * write pages when we've encountered many
		 * dirty pages, and when we've already scanned
		 * the rest of the LRU for clean pages and see
		 * the same dirty pages again (PageReclaim).
		 */
		if (page_is_file_lru(page) &&
			(!current_is_kswapd() || !PageReclaim(page) ||
			!test_bit(PGDAT_DIRTY, &pgdat->flags))) {
			/*
			 * Immediately reclaim when written back.
			 * Similar in principal to deactivate_page()
			 * except we already have the page isolated
			 * and know it's dirty
			 */
			inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
			SetPageReclaim(page);

			goto activate_locked;
		}

>Now to your patch. I do not like it much to be honest.
>MEM_CGROUP_RECLAIM_RETRIES is quite arbitrary and I neither like it in
>memory_high_write because the that is an interruptible context so there
>shouldn't be a good reason to give up after $FOO number of failed
>attempts. try_charge and memory_max_write are slightly different because
>we are invoking OOM killer based on the number of failed attempts.

As Johannes mentioned, the very intent of memory.high is to have it managed 
using a userspace OOM killer, which monitors PSI. As such, I'm not sure this 
distinction means much.

  parent reply	other threads:[~2020-05-20 20:26 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-20 14:37 [PATCH] mm, memcg: reclaim more aggressively before high allocator throttling Chris Down
2020-05-20 14:37 ` Chris Down
2020-05-20 16:07 ` Michal Hocko
2020-05-20 16:07   ` Michal Hocko
2020-05-20 16:51   ` Johannes Weiner
2020-05-20 17:04     ` Michal Hocko
2020-05-20 17:51       ` Johannes Weiner
2020-05-21  7:32         ` Michal Hocko
2020-05-21  7:32           ` Michal Hocko
2020-05-21 13:51           ` Johannes Weiner
2020-05-21 13:51             ` Johannes Weiner
2020-05-21 14:22             ` Johannes Weiner
2020-05-21 14:35             ` Michal Hocko
2020-05-21 14:35               ` Michal Hocko
2020-05-21 15:02               ` Chris Down
2020-05-21 15:02                 ` Chris Down
2020-05-21 16:38               ` Johannes Weiner
2020-05-21 16:38                 ` Johannes Weiner
2020-05-21 17:37                 ` Michal Hocko
2020-05-21 17:37                   ` Michal Hocko
2020-05-21 18:45                   ` Johannes Weiner
2020-05-21 18:45                     ` Johannes Weiner
2020-05-28 16:31                     ` Michal Hocko
2020-05-28 16:31                       ` Michal Hocko
2020-05-28 16:48                       ` Chris Down
2020-05-28 16:48                         ` Chris Down
2020-05-29  7:31                         ` Michal Hocko
2020-05-29  7:31                           ` Michal Hocko
2020-05-29 10:08                           ` Chris Down
2020-05-29 10:14                             ` Michal Hocko
2020-05-29 10:14                               ` Michal Hocko
2020-05-28 20:11                       ` Johannes Weiner
2020-05-28 20:11                         ` Johannes Weiner
2020-05-20 20:26   ` Chris Down [this message]
2020-05-20 20:26     ` Chris Down
2020-05-21  7:19     ` Michal Hocko
2020-05-21  7:19       ` Michal Hocko
2020-05-21 11:27       ` Chris Down
2020-05-21 12:04         ` Michal Hocko
2020-05-21 12:04           ` Michal Hocko
2020-05-21 12:23           ` Chris Down
2020-05-21 12:23             ` Chris Down
2020-05-21 12:24             ` Chris Down
2020-05-21 12:24               ` Chris Down
2020-05-21 12:37             ` Michal Hocko
2020-05-21 12:57               ` Chris Down
2020-05-21 13:05                 ` Chris Down
2020-05-21 13:05                   ` Chris Down
2020-05-21 13:28                   ` Michal Hocko
2020-05-21 13:28                     ` Michal Hocko
2020-05-21 13:21                 ` Michal Hocko
2020-05-21 13:21                   ` Michal Hocko
2020-05-21 13:41                   ` Chris Down
2020-05-21 13:41                     ` Chris Down
2020-05-21 13:58                     ` Michal Hocko
2020-05-21 13:58                       ` Michal Hocko
2020-05-21 14:22                       ` Chris Down
2020-05-21 12:28         ` Michal Hocko
2020-05-21 12:28           ` Michal Hocko
2020-05-28 18:02 ` Shakeel Butt
2020-05-28 18:02   ` Shakeel Butt
2020-05-28 19:48   ` Chris Down
2020-05-28 19:48     ` Chris Down
2020-05-28 20:29     ` Johannes Weiner
2020-05-28 20:29       ` Johannes Weiner
2020-05-28 21:02       ` Shakeel Butt
2020-05-28 21:02         ` Shakeel Butt
2020-05-28 21:02         ` Shakeel Butt
2020-05-28 21:14       ` Chris Down
2020-05-28 21:14         ` Chris Down
2020-05-29  7:25       ` Michal Hocko
2020-05-29  7:25         ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200520202650.GB558281@chrisdown.name \
    --to=chris@chrisdown.name \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.