Re: [PATCH v2] mm/vmscan: get number of pages on the LRU list in memcgroup base on lru_zone_size

From: Michal Hocko <mhocko@kernel.org>
To: Honglei Wang <honglei.wang@oracle.com>
Cc: linux-mm@kvack.org, vdavydov.dev@gmail.com, hannes@cmpxchg.org
Subject: Re: [PATCH v2] mm/vmscan: get number of pages on the LRU list in memcgroup base on lru_zone_size
Date: Mon, 7 Oct 2019 16:28:05 +0200	[thread overview]
Message-ID: <20191007142805.GM2381@dhcp22.suse.cz> (raw)
In-Reply-To: <20190905071034.16822-1-honglei.wang@oracle.com>

On Thu 05-09-19 15:10:34, Honglei Wang wrote:
> lruvec_lru_size() is involving lruvec_page_state_local() to get the
> lru_size in the current code. It's base on lruvec_stat_local.count[]
> of mem_cgroup_per_node. This counter is updated in batch. It won't
> do charge if the number of coming pages doesn't meet the needs of
> MEMCG_CHARGE_BATCH who's defined as 32 now.
> 
> The testcase in LTP madvise09[1] fails due to small block memory is
> not charged. It creates a new memcgroup and sets up 32 MADV_FREE
> pages. Then it forks child who will introduce memory pressure in the
> memcgroup. The MADV_FREE pages are expected to be released under the
> pressure, but 32 is not more than MEMCG_CHARGE_BATCH and these pages
> won't be charged in lruvec_stat_local.count[] until some more pages
> come in to satisfy the needs of batch charging. So these MADV_FREE
> pages can't be freed in memory pressure which is a bit conflicted
> with the definition of MADV_FREE.

The test case is simly wrong. The caching and the batch size is an
internal implementation detail. Moreover MADV_FREE is a _hint_ so all
you can say is that those pages will get freed at some point in time but
you cannot make any assumptions about when that moment happens.

> Getting lru_size base on lru_zone_size of mem_cgroup_per_node which
> is not updated in batch can make it a bit more accurate in similar
> scenario.

What does that mean? It would be more helpful to describe the code path
which will use this more precise value and what is the effect of that.

As I've said in the previous version, I do not object to the patch
because a more precise lruvec_lru_size sounds like a nice thing as long
as we are not paying a high price for that. Just look at the global case
for mem_cgroup_disabled(). It uses node_page_state and that one is using
per-cpu accounting with regular global value refreshing IIRC.

> [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise09.c
> 
> Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
> ---
>  mm/vmscan.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c77d1e3761a7..c28672460868 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -354,12 +354,13 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
>   */
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
>  {
> -	unsigned long lru_size;
> +	unsigned long lru_size = 0;
>  	int zid;
>  
> -	if (!mem_cgroup_disabled())
> -		lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru);
> -	else
> +	if (!mem_cgroup_disabled()) {
> +		for (zid = 0; zid < MAX_NR_ZONES; zid++)
> +			lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
> +	} else
>  		lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
>  
>  	for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
> -- 
> 2.17.0

-- 
Michal Hocko
SUSE Labs