Re: [PATCH] shmem, memcg: enable memcg aware shrinker

From: Greg Thelen <gthelen@google.com>
To: Kirill Tkhai <ktkhai@virtuozzo.com>,
	Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Kirill A. Shutemov" <kirill@shutemov.name>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org
Subject: Re: [PATCH] shmem, memcg: enable memcg aware shrinker
Date: Mon, 01 Jun 2020 14:48:43 -0700	[thread overview]
Message-ID: <xr93d06i4fus.fsf@gthelen.svl.corp.google.com> (raw)
In-Reply-To: <ffdff0be-f2b6-c7c0-debc-9c5e8a33ae4e@virtuozzo.com>

Kirill Tkhai <ktkhai@virtuozzo.com> wrote:

> Hi, Greg,
>
> good finding. See comments below.
>
> On 01.06.2020 06:22, Greg Thelen wrote:
>> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
>> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
>> called when the per-memcg per-node shrinker_map indicates that the
>> shrinker may have objects to release to the memcg and node.
>> 
>> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
>> shrinker which advertises per memcg and numa awareness.  The shmem
>> shrinker releases memory by splitting hugepages that extend beyond
>> i_size.
>> 
>> Shmem does not currently set bits in shrinker_map.  So, starting with
>> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
>> pressure.  This leads to undeserved memcg OOM kills.
>> Example that reliably sees memcg OOM kill in unpatched kernel:
>>   FS=/tmp/fs
>>   CONTAINER=/cgroup/memory/tmpfs_shrinker
>>   mkdir -p $FS
>>   mount -t tmpfs -o huge=always nodev $FS
>>   # Create 1000 MB container, which shouldn't suffer OOM.
>>   mkdir $CONTAINER
>>   echo 1000M > $CONTAINER/memory.limit_in_bytes
>>   echo $BASHPID >> $CONTAINER/cgroup.procs
>>   # Create 4000 files.  Ideally each file uses 4k data page + a little
>>   # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
>>   # fit within container's 1000 MB.  But if data pages use 2MB
>>   # hugepages (due to aggressive huge=always) then files consume 8GB,
>>   # which hits memcg 1000 MB limit.
>>   for i in {1..4000}; do
>>     echo . > $FS/$i
>>   done
>> 
>> v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg
>> aware") maintains the per-node per-memcg shrinker bitmap for THP
>> shrinker.  But there's no such logic in shmem.  Make shmem set the
>> per-memcg per-node shrinker bits when it modifies inodes to have
>> shrinkable pages.
>> 
>> Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()")
>> Cc: <stable@vger.kernel.org> # 4.19+
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> ---
>>  mm/shmem.c | 61 +++++++++++++++++++++++++++++++-----------------------
>>  1 file changed, 35 insertions(+), 26 deletions(-)
>> 
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index bd8840082c94..e11090f78cb5 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, struct kstat *stat,
>>  	return 0;
>>  }
>>  
>> +/*
>> + * Expose inode and optional page to shrinker as having a possibly splittable
>> + * hugepage that reaches beyond i_size.
>> + */
>> +static void shmem_shrinker_add(struct shmem_sb_info *sbinfo,
>> +			       struct inode *inode, struct page *page)
>> +{
>> +	struct shmem_inode_info *info = SHMEM_I(inode);
>> +
>> +	spin_lock(&sbinfo->shrinklist_lock);
>> +	/*
>> +	 * _careful to defend against unlocked access to ->shrink_list in
>> +	 * shmem_unused_huge_shrink()
>> +	 */
>> +	if (list_empty_careful(&info->shrinklist)) {
>> +		list_add_tail(&info->shrinklist, &sbinfo->shrinklist);
>> +		sbinfo->shrinklist_len++;
>> +	}
>> +	spin_unlock(&sbinfo->shrinklist_lock);
>> +
>> +#ifdef CONFIG_MEMCG
>> +	if (page && PageTransHuge(page))
>> +		memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page),
>> +				       inode->i_sb->s_shrink.id);
>> +#endif
>> +}
>> +
>>  static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>>  {
>>  	struct inode *inode = d_inode(dentry);
>> @@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>>  			 * to shrink under memory pressure.
>>  			 */
>>  			if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>> -				spin_lock(&sbinfo->shrinklist_lock);
>> -				/*
>> -				 * _careful to defend against unlocked access to
>> -				 * ->shrink_list in shmem_unused_huge_shrink()
>> -				 */
>> -				if (list_empty_careful(&info->shrinklist)) {
>> -					list_add_tail(&info->shrinklist,
>> -							&sbinfo->shrinklist);
>> -					sbinfo->shrinklist_len++;
>> -				}
>> -				spin_unlock(&sbinfo->shrinklist_lock);
>> +				struct page *page;
>> +
>> +				page = find_get_page(inode->i_mapping,
>> +					(newsize & HPAGE_PMD_MASK) >> PAGE_SHIFT);
>> +				shmem_shrinker_add(sbinfo, inode, page);
>> +				if (page)
>> +					put_page(page);
>
> 1)I'd move PageTransHuge() check from shmem_shrinker_add() to here. In case of page is not trans huge,
>   it looks strange and completely useless to add inode to shrinklist and then to avoid memcg_set_shrinker_bit().
>   Nothing should be added to the shrinklist in this case.

Ack,  I'll take a look at this.

> 2)In general I think this "last inode page spliter" does not fit SHINKER_MEMCG_AWARE conception, and
>   shmem_unused_huge_shrink() should be reworked as a new separate !memcg-aware shrinker instead of
>   .nr_cached_objects callback of generic fs shrinker.
>
> CC: Kirill Shutemov
>
> Kirill, are there any fundamental reasons to keep this shrinking logic in the generic fs shrinker?
> Are there any no-go to for conversion this as a separate !memcg-aware shrinker?

Making the shmem shrinker !memcg-aware seems like it would require tail
pages beyond i_size not be charged to memcg.  Otherwise memcg pressure
(which only calls memcg aware shrinkers) won't uncharge them.  Currently
the entire compound page is charged.

>>  			}
>>  		}
>>  	}
>> @@ -1889,21 +1912,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>>  	if (PageTransHuge(page) &&
>>  	    DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
>>  			hindex + HPAGE_PMD_NR - 1) {
>> -		/*
>> -		 * Part of the huge page is beyond i_size: subject
>> -		 * to shrink under memory pressure.
>> -		 */
>> -		spin_lock(&sbinfo->shrinklist_lock);
>> -		/*
>> -		 * _careful to defend against unlocked access to
>> -		 * ->shrink_list in shmem_unused_huge_shrink()
>> -		 */
>> -		if (list_empty_careful(&info->shrinklist)) {
>> -			list_add_tail(&info->shrinklist,
>> -				      &sbinfo->shrinklist);
>> -			sbinfo->shrinklist_len++;
>> -		}
>> -		spin_unlock(&sbinfo->shrinklist_lock);
>> +		shmem_shrinker_add(sbinfo, inode, page);
>>  	}
>>  
>>  	/*
>
> Thanks,
> Kirill