linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] shmem, memcg: enable memcg aware shrinker
@ 2020-06-01  3:22 Greg Thelen
  2020-06-01 11:05 ` Kirill Tkhai
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Greg Thelen @ 2020-06-01  3:22 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton, Kirill Tkhai
  Cc: linux-mm, linux-kernel, Greg Thelen, stable

Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
called when the per-memcg per-node shrinker_map indicates that the
shrinker may have objects to release to the memcg and node.

shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
shrinker which advertises per memcg and numa awareness.  The shmem
shrinker releases memory by splitting hugepages that extend beyond
i_size.

Shmem does not currently set bits in shrinker_map.  So, starting with
b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
pressure.  This leads to undeserved memcg OOM kills.
Example that reliably sees memcg OOM kill in unpatched kernel:
  FS=/tmp/fs
  CONTAINER=/cgroup/memory/tmpfs_shrinker
  mkdir -p $FS
  mount -t tmpfs -o huge=always nodev $FS
  # Create 1000 MB container, which shouldn't suffer OOM.
  mkdir $CONTAINER
  echo 1000M > $CONTAINER/memory.limit_in_bytes
  echo $BASHPID >> $CONTAINER/cgroup.procs
  # Create 4000 files.  Ideally each file uses 4k data page + a little
  # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
  # fit within container's 1000 MB.  But if data pages use 2MB
  # hugepages (due to aggressive huge=always) then files consume 8GB,
  # which hits memcg 1000 MB limit.
  for i in {1..4000}; do
    echo . > $FS/$i
  done

v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg
aware") maintains the per-node per-memcg shrinker bitmap for THP
shrinker.  But there's no such logic in shmem.  Make shmem set the
per-memcg per-node shrinker bits when it modifies inodes to have
shrinkable pages.

Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()")
Cc: <stable@vger.kernel.org> # 4.19+
Signed-off-by: Greg Thelen <gthelen@google.com>
---
 mm/shmem.c | 61 +++++++++++++++++++++++++++++++-----------------------
 1 file changed, 35 insertions(+), 26 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index bd8840082c94..e11090f78cb5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, struct kstat *stat,
 	return 0;
 }
 
+/*
+ * Expose inode and optional page to shrinker as having a possibly splittable
+ * hugepage that reaches beyond i_size.
+ */
+static void shmem_shrinker_add(struct shmem_sb_info *sbinfo,
+			       struct inode *inode, struct page *page)
+{
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	spin_lock(&sbinfo->shrinklist_lock);
+	/*
+	 * _careful to defend against unlocked access to ->shrink_list in
+	 * shmem_unused_huge_shrink()
+	 */
+	if (list_empty_careful(&info->shrinklist)) {
+		list_add_tail(&info->shrinklist, &sbinfo->shrinklist);
+		sbinfo->shrinklist_len++;
+	}
+	spin_unlock(&sbinfo->shrinklist_lock);
+
+#ifdef CONFIG_MEMCG
+	if (page && PageTransHuge(page))
+		memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page),
+				       inode->i_sb->s_shrink.id);
+#endif
+}
+
 static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = d_inode(dentry);
@@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 			 * to shrink under memory pressure.
 			 */
 			if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
-				spin_lock(&sbinfo->shrinklist_lock);
-				/*
-				 * _careful to defend against unlocked access to
-				 * ->shrink_list in shmem_unused_huge_shrink()
-				 */
-				if (list_empty_careful(&info->shrinklist)) {
-					list_add_tail(&info->shrinklist,
-							&sbinfo->shrinklist);
-					sbinfo->shrinklist_len++;
-				}
-				spin_unlock(&sbinfo->shrinklist_lock);
+				struct page *page;
+
+				page = find_get_page(inode->i_mapping,
+					(newsize & HPAGE_PMD_MASK) >> PAGE_SHIFT);
+				shmem_shrinker_add(sbinfo, inode, page);
+				if (page)
+					put_page(page);
 			}
 		}
 	}
@@ -1889,21 +1912,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	if (PageTransHuge(page) &&
 	    DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
 			hindex + HPAGE_PMD_NR - 1) {
-		/*
-		 * Part of the huge page is beyond i_size: subject
-		 * to shrink under memory pressure.
-		 */
-		spin_lock(&sbinfo->shrinklist_lock);
-		/*
-		 * _careful to defend against unlocked access to
-		 * ->shrink_list in shmem_unused_huge_shrink()
-		 */
-		if (list_empty_careful(&info->shrinklist)) {
-			list_add_tail(&info->shrinklist,
-				      &sbinfo->shrinklist);
-			sbinfo->shrinklist_len++;
-		}
-		spin_unlock(&sbinfo->shrinklist_lock);
+		shmem_shrinker_add(sbinfo, inode, page);
 	}
 
 	/*
-- 
2.27.0.rc0.183.gde8f92d652-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] shmem, memcg: enable memcg aware shrinker
  2020-06-01  3:22 [PATCH] shmem, memcg: enable memcg aware shrinker Greg Thelen
@ 2020-06-01 11:05 ` Kirill Tkhai
  2020-06-01 21:48   ` Greg Thelen
  2020-06-01 22:59 ` Yang Shi
  2020-06-08  4:49 ` Hugh Dickins
  2 siblings, 1 reply; 8+ messages in thread
From: Kirill Tkhai @ 2020-06-01 11:05 UTC (permalink / raw)
  To: Greg Thelen, Hugh Dickins, Andrew Morton, Kirill A. Shutemov
  Cc: linux-mm, linux-kernel, stable

Hi, Greg,

good finding. See comments below.

On 01.06.2020 06:22, Greg Thelen wrote:
> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
> called when the per-memcg per-node shrinker_map indicates that the
> shrinker may have objects to release to the memcg and node.
> 
> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
> shrinker which advertises per memcg and numa awareness.  The shmem
> shrinker releases memory by splitting hugepages that extend beyond
> i_size.
> 
> Shmem does not currently set bits in shrinker_map.  So, starting with
> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
> pressure.  This leads to undeserved memcg OOM kills.
> Example that reliably sees memcg OOM kill in unpatched kernel:
>   FS=/tmp/fs
>   CONTAINER=/cgroup/memory/tmpfs_shrinker
>   mkdir -p $FS
>   mount -t tmpfs -o huge=always nodev $FS
>   # Create 1000 MB container, which shouldn't suffer OOM.
>   mkdir $CONTAINER
>   echo 1000M > $CONTAINER/memory.limit_in_bytes
>   echo $BASHPID >> $CONTAINER/cgroup.procs
>   # Create 4000 files.  Ideally each file uses 4k data page + a little
>   # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
>   # fit within container's 1000 MB.  But if data pages use 2MB
>   # hugepages (due to aggressive huge=always) then files consume 8GB,
>   # which hits memcg 1000 MB limit.
>   for i in {1..4000}; do
>     echo . > $FS/$i
>   done
> 
> v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg
> aware") maintains the per-node per-memcg shrinker bitmap for THP
> shrinker.  But there's no such logic in shmem.  Make shmem set the
> per-memcg per-node shrinker bits when it modifies inodes to have
> shrinkable pages.
> 
> Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()")
> Cc: <stable@vger.kernel.org> # 4.19+
> Signed-off-by: Greg Thelen <gthelen@google.com>
> ---
>  mm/shmem.c | 61 +++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 35 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index bd8840082c94..e11090f78cb5 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, struct kstat *stat,
>  	return 0;
>  }
>  
> +/*
> + * Expose inode and optional page to shrinker as having a possibly splittable
> + * hugepage that reaches beyond i_size.
> + */
> +static void shmem_shrinker_add(struct shmem_sb_info *sbinfo,
> +			       struct inode *inode, struct page *page)
> +{
> +	struct shmem_inode_info *info = SHMEM_I(inode);
> +
> +	spin_lock(&sbinfo->shrinklist_lock);
> +	/*
> +	 * _careful to defend against unlocked access to ->shrink_list in
> +	 * shmem_unused_huge_shrink()
> +	 */
> +	if (list_empty_careful(&info->shrinklist)) {
> +		list_add_tail(&info->shrinklist, &sbinfo->shrinklist);
> +		sbinfo->shrinklist_len++;
> +	}
> +	spin_unlock(&sbinfo->shrinklist_lock);
> +
> +#ifdef CONFIG_MEMCG
> +	if (page && PageTransHuge(page))
> +		memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page),
> +				       inode->i_sb->s_shrink.id);
> +#endif
> +}
> +
>  static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>  	struct inode *inode = d_inode(dentry);
> @@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>  			 * to shrink under memory pressure.
>  			 */
>  			if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> -				spin_lock(&sbinfo->shrinklist_lock);
> -				/*
> -				 * _careful to defend against unlocked access to
> -				 * ->shrink_list in shmem_unused_huge_shrink()
> -				 */
> -				if (list_empty_careful(&info->shrinklist)) {
> -					list_add_tail(&info->shrinklist,
> -							&sbinfo->shrinklist);
> -					sbinfo->shrinklist_len++;
> -				}
> -				spin_unlock(&sbinfo->shrinklist_lock);
> +				struct page *page;
> +
> +				page = find_get_page(inode->i_mapping,
> +					(newsize & HPAGE_PMD_MASK) >> PAGE_SHIFT);
> +				shmem_shrinker_add(sbinfo, inode, page);
> +				if (page)
> +					put_page(page);

1)I'd move PageTransHuge() check from shmem_shrinker_add() to here. In case of page is not trans huge,
  it looks strange and completely useless to add inode to shrinklist and then to avoid memcg_set_shrinker_bit().
  Nothing should be added to the shrinklist in this case.

2)In general I think this "last inode page spliter" does not fit SHINKER_MEMCG_AWARE conception, and
  shmem_unused_huge_shrink() should be reworked as a new separate !memcg-aware shrinker instead of
  .nr_cached_objects callback of generic fs shrinker.

CC: Kirill Shutemov

Kirill, are there any fundamental reasons to keep this shrinking logic in the generic fs shrinker?
Are there any no-go to for conversion this as a separate !memcg-aware shrinker?

>  			}
>  		}
>  	}
> @@ -1889,21 +1912,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  	if (PageTransHuge(page) &&
>  	    DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
>  			hindex + HPAGE_PMD_NR - 1) {
> -		/*
> -		 * Part of the huge page is beyond i_size: subject
> -		 * to shrink under memory pressure.
> -		 */
> -		spin_lock(&sbinfo->shrinklist_lock);
> -		/*
> -		 * _careful to defend against unlocked access to
> -		 * ->shrink_list in shmem_unused_huge_shrink()
> -		 */
> -		if (list_empty_careful(&info->shrinklist)) {
> -			list_add_tail(&info->shrinklist,
> -				      &sbinfo->shrinklist);
> -			sbinfo->shrinklist_len++;
> -		}
> -		spin_unlock(&sbinfo->shrinklist_lock);
> +		shmem_shrinker_add(sbinfo, inode, page);
>  	}
>  
>  	/*

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] shmem, memcg: enable memcg aware shrinker
  2020-06-01 11:05 ` Kirill Tkhai
@ 2020-06-01 21:48   ` Greg Thelen
  2020-06-02  8:28     ` Kirill Tkhai
  0 siblings, 1 reply; 8+ messages in thread
From: Greg Thelen @ 2020-06-01 21:48 UTC (permalink / raw)
  To: Kirill Tkhai, Hugh Dickins, Andrew Morton, Kirill A. Shutemov
  Cc: linux-mm, linux-kernel, stable

Kirill Tkhai <ktkhai@virtuozzo.com> wrote:

> Hi, Greg,
>
> good finding. See comments below.
>
> On 01.06.2020 06:22, Greg Thelen wrote:
>> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
>> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
>> called when the per-memcg per-node shrinker_map indicates that the
>> shrinker may have objects to release to the memcg and node.
>> 
>> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
>> shrinker which advertises per memcg and numa awareness.  The shmem
>> shrinker releases memory by splitting hugepages that extend beyond
>> i_size.
>> 
>> Shmem does not currently set bits in shrinker_map.  So, starting with
>> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
>> pressure.  This leads to undeserved memcg OOM kills.
>> Example that reliably sees memcg OOM kill in unpatched kernel:
>>   FS=/tmp/fs
>>   CONTAINER=/cgroup/memory/tmpfs_shrinker
>>   mkdir -p $FS
>>   mount -t tmpfs -o huge=always nodev $FS
>>   # Create 1000 MB container, which shouldn't suffer OOM.
>>   mkdir $CONTAINER
>>   echo 1000M > $CONTAINER/memory.limit_in_bytes
>>   echo $BASHPID >> $CONTAINER/cgroup.procs
>>   # Create 4000 files.  Ideally each file uses 4k data page + a little
>>   # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
>>   # fit within container's 1000 MB.  But if data pages use 2MB
>>   # hugepages (due to aggressive huge=always) then files consume 8GB,
>>   # which hits memcg 1000 MB limit.
>>   for i in {1..4000}; do
>>     echo . > $FS/$i
>>   done
>> 
>> v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg
>> aware") maintains the per-node per-memcg shrinker bitmap for THP
>> shrinker.  But there's no such logic in shmem.  Make shmem set the
>> per-memcg per-node shrinker bits when it modifies inodes to have
>> shrinkable pages.
>> 
>> Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()")
>> Cc: <stable@vger.kernel.org> # 4.19+
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> ---
>>  mm/shmem.c | 61 +++++++++++++++++++++++++++++++-----------------------
>>  1 file changed, 35 insertions(+), 26 deletions(-)
>> 
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index bd8840082c94..e11090f78cb5 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, struct kstat *stat,
>>  	return 0;
>>  }
>>  
>> +/*
>> + * Expose inode and optional page to shrinker as having a possibly splittable
>> + * hugepage that reaches beyond i_size.
>> + */
>> +static void shmem_shrinker_add(struct shmem_sb_info *sbinfo,
>> +			       struct inode *inode, struct page *page)
>> +{
>> +	struct shmem_inode_info *info = SHMEM_I(inode);
>> +
>> +	spin_lock(&sbinfo->shrinklist_lock);
>> +	/*
>> +	 * _careful to defend against unlocked access to ->shrink_list in
>> +	 * shmem_unused_huge_shrink()
>> +	 */
>> +	if (list_empty_careful(&info->shrinklist)) {
>> +		list_add_tail(&info->shrinklist, &sbinfo->shrinklist);
>> +		sbinfo->shrinklist_len++;
>> +	}
>> +	spin_unlock(&sbinfo->shrinklist_lock);
>> +
>> +#ifdef CONFIG_MEMCG
>> +	if (page && PageTransHuge(page))
>> +		memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page),
>> +				       inode->i_sb->s_shrink.id);
>> +#endif
>> +}
>> +
>>  static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>>  {
>>  	struct inode *inode = d_inode(dentry);
>> @@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>>  			 * to shrink under memory pressure.
>>  			 */
>>  			if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>> -				spin_lock(&sbinfo->shrinklist_lock);
>> -				/*
>> -				 * _careful to defend against unlocked access to
>> -				 * ->shrink_list in shmem_unused_huge_shrink()
>> -				 */
>> -				if (list_empty_careful(&info->shrinklist)) {
>> -					list_add_tail(&info->shrinklist,
>> -							&sbinfo->shrinklist);
>> -					sbinfo->shrinklist_len++;
>> -				}
>> -				spin_unlock(&sbinfo->shrinklist_lock);
>> +				struct page *page;
>> +
>> +				page = find_get_page(inode->i_mapping,
>> +					(newsize & HPAGE_PMD_MASK) >> PAGE_SHIFT);
>> +				shmem_shrinker_add(sbinfo, inode, page);
>> +				if (page)
>> +					put_page(page);
>
> 1)I'd move PageTransHuge() check from shmem_shrinker_add() to here. In case of page is not trans huge,
>   it looks strange and completely useless to add inode to shrinklist and then to avoid memcg_set_shrinker_bit().
>   Nothing should be added to the shrinklist in this case.

Ack,  I'll take a look at this.

> 2)In general I think this "last inode page spliter" does not fit SHINKER_MEMCG_AWARE conception, and
>   shmem_unused_huge_shrink() should be reworked as a new separate !memcg-aware shrinker instead of
>   .nr_cached_objects callback of generic fs shrinker.
>
> CC: Kirill Shutemov
>
> Kirill, are there any fundamental reasons to keep this shrinking logic in the generic fs shrinker?
> Are there any no-go to for conversion this as a separate !memcg-aware shrinker?

Making the shmem shrinker !memcg-aware seems like it would require tail
pages beyond i_size not be charged to memcg.  Otherwise memcg pressure
(which only calls memcg aware shrinkers) won't uncharge them.  Currently
the entire compound page is charged.

>>  			}
>>  		}
>>  	}
>> @@ -1889,21 +1912,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>>  	if (PageTransHuge(page) &&
>>  	    DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
>>  			hindex + HPAGE_PMD_NR - 1) {
>> -		/*
>> -		 * Part of the huge page is beyond i_size: subject
>> -		 * to shrink under memory pressure.
>> -		 */
>> -		spin_lock(&sbinfo->shrinklist_lock);
>> -		/*
>> -		 * _careful to defend against unlocked access to
>> -		 * ->shrink_list in shmem_unused_huge_shrink()
>> -		 */
>> -		if (list_empty_careful(&info->shrinklist)) {
>> -			list_add_tail(&info->shrinklist,
>> -				      &sbinfo->shrinklist);
>> -			sbinfo->shrinklist_len++;
>> -		}
>> -		spin_unlock(&sbinfo->shrinklist_lock);
>> +		shmem_shrinker_add(sbinfo, inode, page);
>>  	}
>>  
>>  	/*
>
> Thanks,
> Kirill

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] shmem, memcg: enable memcg aware shrinker
  2020-06-01  3:22 [PATCH] shmem, memcg: enable memcg aware shrinker Greg Thelen
  2020-06-01 11:05 ` Kirill Tkhai
@ 2020-06-01 22:59 ` Yang Shi
  2020-06-04  8:17   ` Greg Thelen
  2020-06-08  4:49 ` Hugh Dickins
  2 siblings, 1 reply; 8+ messages in thread
From: Yang Shi @ 2020-06-01 22:59 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Hugh Dickins, Andrew Morton, Kirill Tkhai, Linux MM,
	Linux Kernel Mailing List, stable

On Sun, May 31, 2020 at 8:22 PM Greg Thelen <gthelen@google.com> wrote:
>
> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
> called when the per-memcg per-node shrinker_map indicates that the
> shrinker may have objects to release to the memcg and node.
>
> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
> shrinker which advertises per memcg and numa awareness.  The shmem
> shrinker releases memory by splitting hugepages that extend beyond
> i_size.
>
> Shmem does not currently set bits in shrinker_map.  So, starting with
> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
> pressure.  This leads to undeserved memcg OOM kills.
> Example that reliably sees memcg OOM kill in unpatched kernel:
>   FS=/tmp/fs
>   CONTAINER=/cgroup/memory/tmpfs_shrinker
>   mkdir -p $FS
>   mount -t tmpfs -o huge=always nodev $FS
>   # Create 1000 MB container, which shouldn't suffer OOM.
>   mkdir $CONTAINER
>   echo 1000M > $CONTAINER/memory.limit_in_bytes
>   echo $BASHPID >> $CONTAINER/cgroup.procs
>   # Create 4000 files.  Ideally each file uses 4k data page + a little
>   # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
>   # fit within container's 1000 MB.  But if data pages use 2MB
>   # hugepages (due to aggressive huge=always) then files consume 8GB,
>   # which hits memcg 1000 MB limit.
>   for i in {1..4000}; do
>     echo . > $FS/$i
>   done

It looks all the inodes which have tail THP beyond i_size are on one
single list, then the shrinker actually just splits the first
nr_to_scan inodes. But since the list is not memcg aware, so it seems
it may split the THPs which are not charged to the victim memcg and
the victim memcg still may suffer from pre-mature oom, right?

>
> v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg
> aware") maintains the per-node per-memcg shrinker bitmap for THP
> shrinker.  But there's no such logic in shmem.  Make shmem set the
> per-memcg per-node shrinker bits when it modifies inodes to have
> shrinkable pages.
>
> Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()")
> Cc: <stable@vger.kernel.org> # 4.19+
> Signed-off-by: Greg Thelen <gthelen@google.com>
> ---
>  mm/shmem.c | 61 +++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 35 insertions(+), 26 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index bd8840082c94..e11090f78cb5 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, struct kstat *stat,
>         return 0;
>  }
>
> +/*
> + * Expose inode and optional page to shrinker as having a possibly splittable
> + * hugepage that reaches beyond i_size.
> + */
> +static void shmem_shrinker_add(struct shmem_sb_info *sbinfo,
> +                              struct inode *inode, struct page *page)
> +{
> +       struct shmem_inode_info *info = SHMEM_I(inode);
> +
> +       spin_lock(&sbinfo->shrinklist_lock);
> +       /*
> +        * _careful to defend against unlocked access to ->shrink_list in
> +        * shmem_unused_huge_shrink()
> +        */
> +       if (list_empty_careful(&info->shrinklist)) {
> +               list_add_tail(&info->shrinklist, &sbinfo->shrinklist);
> +               sbinfo->shrinklist_len++;
> +       }
> +       spin_unlock(&sbinfo->shrinklist_lock);
> +
> +#ifdef CONFIG_MEMCG
> +       if (page && PageTransHuge(page))
> +               memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page),
> +                                      inode->i_sb->s_shrink.id);
> +#endif
> +}
> +
>  static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>         struct inode *inode = d_inode(dentry);
> @@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>                          * to shrink under memory pressure.
>                          */
>                         if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> -                               spin_lock(&sbinfo->shrinklist_lock);
> -                               /*
> -                                * _careful to defend against unlocked access to
> -                                * ->shrink_list in shmem_unused_huge_shrink()
> -                                */
> -                               if (list_empty_careful(&info->shrinklist)) {
> -                                       list_add_tail(&info->shrinklist,
> -                                                       &sbinfo->shrinklist);
> -                                       sbinfo->shrinklist_len++;
> -                               }
> -                               spin_unlock(&sbinfo->shrinklist_lock);
> +                               struct page *page;
> +
> +                               page = find_get_page(inode->i_mapping,
> +                                       (newsize & HPAGE_PMD_MASK) >> PAGE_SHIFT);
> +                               shmem_shrinker_add(sbinfo, inode, page);
> +                               if (page)
> +                                       put_page(page);
>                         }
>                 }
>         }
> @@ -1889,21 +1912,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>         if (PageTransHuge(page) &&
>             DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
>                         hindex + HPAGE_PMD_NR - 1) {
> -               /*
> -                * Part of the huge page is beyond i_size: subject
> -                * to shrink under memory pressure.
> -                */
> -               spin_lock(&sbinfo->shrinklist_lock);
> -               /*
> -                * _careful to defend against unlocked access to
> -                * ->shrink_list in shmem_unused_huge_shrink()
> -                */
> -               if (list_empty_careful(&info->shrinklist)) {
> -                       list_add_tail(&info->shrinklist,
> -                                     &sbinfo->shrinklist);
> -                       sbinfo->shrinklist_len++;
> -               }
> -               spin_unlock(&sbinfo->shrinklist_lock);
> +               shmem_shrinker_add(sbinfo, inode, page);
>         }
>
>         /*
> --
> 2.27.0.rc0.183.gde8f92d652-goog
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] shmem, memcg: enable memcg aware shrinker
  2020-06-01 21:48   ` Greg Thelen
@ 2020-06-02  8:28     ` Kirill Tkhai
  0 siblings, 0 replies; 8+ messages in thread
From: Kirill Tkhai @ 2020-06-02  8:28 UTC (permalink / raw)
  To: Greg Thelen, Hugh Dickins, Andrew Morton, Kirill A. Shutemov
  Cc: linux-mm, linux-kernel, stable

On 02.06.2020 00:48, Greg Thelen wrote:
> Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> 
>> Hi, Greg,
>>
>> good finding. See comments below.
>>
>> On 01.06.2020 06:22, Greg Thelen wrote:
>>> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
>>> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
>>> called when the per-memcg per-node shrinker_map indicates that the
>>> shrinker may have objects to release to the memcg and node.
>>>
>>> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
>>> shrinker which advertises per memcg and numa awareness.  The shmem
>>> shrinker releases memory by splitting hugepages that extend beyond
>>> i_size.
>>>
>>> Shmem does not currently set bits in shrinker_map.  So, starting with
>>> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
>>> pressure.  This leads to undeserved memcg OOM kills.
>>> Example that reliably sees memcg OOM kill in unpatched kernel:
>>>   FS=/tmp/fs
>>>   CONTAINER=/cgroup/memory/tmpfs_shrinker
>>>   mkdir -p $FS
>>>   mount -t tmpfs -o huge=always nodev $FS
>>>   # Create 1000 MB container, which shouldn't suffer OOM.
>>>   mkdir $CONTAINER
>>>   echo 1000M > $CONTAINER/memory.limit_in_bytes
>>>   echo $BASHPID >> $CONTAINER/cgroup.procs
>>>   # Create 4000 files.  Ideally each file uses 4k data page + a little
>>>   # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
>>>   # fit within container's 1000 MB.  But if data pages use 2MB
>>>   # hugepages (due to aggressive huge=always) then files consume 8GB,
>>>   # which hits memcg 1000 MB limit.
>>>   for i in {1..4000}; do
>>>     echo . > $FS/$i
>>>   done
>>>
>>> v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg
>>> aware") maintains the per-node per-memcg shrinker bitmap for THP
>>> shrinker.  But there's no such logic in shmem.  Make shmem set the
>>> per-memcg per-node shrinker bits when it modifies inodes to have
>>> shrinkable pages.
>>>
>>> Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()")
>>> Cc: <stable@vger.kernel.org> # 4.19+
>>> Signed-off-by: Greg Thelen <gthelen@google.com>
>>> ---
>>>  mm/shmem.c | 61 +++++++++++++++++++++++++++++++-----------------------
>>>  1 file changed, 35 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index bd8840082c94..e11090f78cb5 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, struct kstat *stat,
>>>  	return 0;
>>>  }
>>>  
>>> +/*
>>> + * Expose inode and optional page to shrinker as having a possibly splittable
>>> + * hugepage that reaches beyond i_size.
>>> + */
>>> +static void shmem_shrinker_add(struct shmem_sb_info *sbinfo,
>>> +			       struct inode *inode, struct page *page)
>>> +{
>>> +	struct shmem_inode_info *info = SHMEM_I(inode);
>>> +
>>> +	spin_lock(&sbinfo->shrinklist_lock);
>>> +	/*
>>> +	 * _careful to defend against unlocked access to ->shrink_list in
>>> +	 * shmem_unused_huge_shrink()
>>> +	 */
>>> +	if (list_empty_careful(&info->shrinklist)) {
>>> +		list_add_tail(&info->shrinklist, &sbinfo->shrinklist);
>>> +		sbinfo->shrinklist_len++;
>>> +	}
>>> +	spin_unlock(&sbinfo->shrinklist_lock);
>>> +
>>> +#ifdef CONFIG_MEMCG
>>> +	if (page && PageTransHuge(page))
>>> +		memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page),
>>> +				       inode->i_sb->s_shrink.id);
>>> +#endif
>>> +}
>>> +
>>>  static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>>>  {
>>>  	struct inode *inode = d_inode(dentry);
>>> @@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>>>  			 * to shrink under memory pressure.
>>>  			 */
>>>  			if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>>> -				spin_lock(&sbinfo->shrinklist_lock);
>>> -				/*
>>> -				 * _careful to defend against unlocked access to
>>> -				 * ->shrink_list in shmem_unused_huge_shrink()
>>> -				 */
>>> -				if (list_empty_careful(&info->shrinklist)) {
>>> -					list_add_tail(&info->shrinklist,
>>> -							&sbinfo->shrinklist);
>>> -					sbinfo->shrinklist_len++;
>>> -				}
>>> -				spin_unlock(&sbinfo->shrinklist_lock);
>>> +				struct page *page;
>>> +
>>> +				page = find_get_page(inode->i_mapping,
>>> +					(newsize & HPAGE_PMD_MASK) >> PAGE_SHIFT);
>>> +				shmem_shrinker_add(sbinfo, inode, page);
>>> +				if (page)
>>> +					put_page(page);
>>
>> 1)I'd move PageTransHuge() check from shmem_shrinker_add() to here. In case of page is not trans huge,
>>   it looks strange and completely useless to add inode to shrinklist and then to avoid memcg_set_shrinker_bit().
>>   Nothing should be added to the shrinklist in this case.
> 
> Ack,  I'll take a look at this.
> 
>> 2)In general I think this "last inode page spliter" does not fit SHINKER_MEMCG_AWARE conception, and
>>   shmem_unused_huge_shrink() should be reworked as a new separate !memcg-aware shrinker instead of
>>   .nr_cached_objects callback of generic fs shrinker.
>>
>> CC: Kirill Shutemov
>>
>> Kirill, are there any fundamental reasons to keep this shrinking logic in the generic fs shrinker?
>> Are there any no-go to for conversion this as a separate !memcg-aware shrinker?
> 
> Making the shmem shrinker !memcg-aware seems like it would require tail
> pages beyond i_size not be charged to memcg.  Otherwise memcg pressure
> (which only calls memcg aware shrinkers) won't uncharge them.  Currently
> the entire compound page is charged.

Shrinker is not about charging, shrinker is about uncharging ;) The pages will remain be charged
like they used to be and where they used to be.

The thing is we have single shrinklist for a superblock. An inode with pending tail page splitting
is added to this list. Later, shmem_unused_huge_scan() iterates over the list. It splits the tail
page for every inode in case of the inode size is still unaligned by huge page.

We do not care about memcg here. Tail pages for two inodes may be related to different memcg. But
shmem_unused_huge_scan() shrink all of them, it does not care about memcg in sc->memcg. Even more:
nobody in mm/shmem.c cares about memcg.

In traditional memcg-aware shrinkers we maintain separate lists for every existing memcg. Object,
which is charged to a memcg, is added to a specific shrinker list related to the memcg. So, shrinker
is able to iterate only that objects, which are charged to the memcg.

In case of shmem we have single list for that objects (shrinklist). Even more, we have shrinklist
not for charged objects, we link inodes there. Imagine a situation: file was shrinked and inode
became linked to shrinklist. Tail page is unaligned and it is related to memcg1. Then file became
shrinked one more time. New tail page is also unaligned and it related to another memcg2. So, shmem
shrinker doesn't introduce lists for every memcg (i.e. list_lru), since inode's tail page relation
to a memcg is not constant. So, as shmem shrinker splits any memcg pages (despite its sc->memcg),
it can't be called memcg-aware.

Revisiting this once again today, I think we should make shrinklist a list_lru type. Every time,
when inode is considered to be added to a shrinklist, we should move it to appropriate memcg list.
In case of it's already linked and memcg is changed, we should move it to another memcg list. I.e.,
every place like:

	if (list_empty(&info->shrinklist)) {
		list_add_tail(&info->shrinklist, &sbinfo->shrinklist); 
		sbinfo->shrinklist_len++;
	}

Convert into:

	if (list_empty(&info->shrinklist)) {
		list_lru_add(&sbinfo->shrinklist, &info->shrinklist);
		info->memcg_id = memcg->id;
	} else if (memcg_changed(info)) {
		/* Remove from old memcg list */
		list_lru_del(&info->shrinklist);
		/* Link to new memcg list */
		list_lru_add(&sbinfo->shrinklist, &info->shrinklist);
	}

We may cache memcg->id into info, so memcg_changed() we be able to compare cached memcg
and current, and we will avoid del/add in case of tail page memcg remain the same.

Kirill

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] shmem, memcg: enable memcg aware shrinker
  2020-06-01 22:59 ` Yang Shi
@ 2020-06-04  8:17   ` Greg Thelen
  2020-06-05 22:05     ` Yang Shi
  0 siblings, 1 reply; 8+ messages in thread
From: Greg Thelen @ 2020-06-04  8:17 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Kirill Tkhai, Linux MM,
	Linux Kernel Mailing List, stable

Yang Shi <shy828301@gmail.com> wrote:

> On Sun, May 31, 2020 at 8:22 PM Greg Thelen <gthelen@google.com> wrote:
>>
>> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
>> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
>> called when the per-memcg per-node shrinker_map indicates that the
>> shrinker may have objects to release to the memcg and node.
>>
>> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
>> shrinker which advertises per memcg and numa awareness.  The shmem
>> shrinker releases memory by splitting hugepages that extend beyond
>> i_size.
>>
>> Shmem does not currently set bits in shrinker_map.  So, starting with
>> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
>> pressure.  This leads to undeserved memcg OOM kills.
>> Example that reliably sees memcg OOM kill in unpatched kernel:
>>   FS=/tmp/fs
>>   CONTAINER=/cgroup/memory/tmpfs_shrinker
>>   mkdir -p $FS
>>   mount -t tmpfs -o huge=always nodev $FS
>>   # Create 1000 MB container, which shouldn't suffer OOM.
>>   mkdir $CONTAINER
>>   echo 1000M > $CONTAINER/memory.limit_in_bytes
>>   echo $BASHPID >> $CONTAINER/cgroup.procs
>>   # Create 4000 files.  Ideally each file uses 4k data page + a little
>>   # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
>>   # fit within container's 1000 MB.  But if data pages use 2MB
>>   # hugepages (due to aggressive huge=always) then files consume 8GB,
>>   # which hits memcg 1000 MB limit.
>>   for i in {1..4000}; do
>>     echo . > $FS/$i
>>   done
>
> It looks all the inodes which have tail THP beyond i_size are on one
> single list, then the shrinker actually just splits the first
> nr_to_scan inodes. But since the list is not memcg aware, so it seems
> it may split the THPs which are not charged to the victim memcg and
> the victim memcg still may suffer from pre-mature oom, right?

Correct.  shmem_unused_huge_shrink() is not memcg aware.  In response to
memcg pressure it will split the post-i_size tails of nr_to_scan tmpfs
inodes regardless of if they're charged to the under-pressure memcg.
do_shrink_slab() looks like it'll repeatedly call
shmem_unused_huge_shrink().  So it will split tails of many inodes.  So
I think it'll avoid the oom by over shrinking.  This is not ideal.  But
it seems better than undeserved oom kill.

I think the solution (as Kirill Tkhai suggested) a memcg-aware index
would solve both:
1) avoid premature oom by registering shrinker to responding to memcg
   pressure
2) avoid shrinking/splitting inodes unrelated to the under-pressure
   memcg

I can certainly look into that (thanks Kirill for the pointers).  In the
short term I'm still interested in avoiding premature OOMs with the
original thread (i.e. restore pre-4.19 behavior to shmem shrinker for
memcg pressure).  I plan to test and repost v2.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] shmem, memcg: enable memcg aware shrinker
  2020-06-04  8:17   ` Greg Thelen
@ 2020-06-05 22:05     ` Yang Shi
  0 siblings, 0 replies; 8+ messages in thread
From: Yang Shi @ 2020-06-05 22:05 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Hugh Dickins, Andrew Morton, Kirill Tkhai, Linux MM,
	Linux Kernel Mailing List, stable

On Thu, Jun 4, 2020 at 1:17 AM Greg Thelen <gthelen@google.com> wrote:
>
> Yang Shi <shy828301@gmail.com> wrote:
>
> > On Sun, May 31, 2020 at 8:22 PM Greg Thelen <gthelen@google.com> wrote:
> >>
> >> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
> >> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
> >> called when the per-memcg per-node shrinker_map indicates that the
> >> shrinker may have objects to release to the memcg and node.
> >>
> >> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
> >> shrinker which advertises per memcg and numa awareness.  The shmem
> >> shrinker releases memory by splitting hugepages that extend beyond
> >> i_size.
> >>
> >> Shmem does not currently set bits in shrinker_map.  So, starting with
> >> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
> >> pressure.  This leads to undeserved memcg OOM kills.
> >> Example that reliably sees memcg OOM kill in unpatched kernel:
> >>   FS=/tmp/fs
> >>   CONTAINER=/cgroup/memory/tmpfs_shrinker
> >>   mkdir -p $FS
> >>   mount -t tmpfs -o huge=always nodev $FS
> >>   # Create 1000 MB container, which shouldn't suffer OOM.
> >>   mkdir $CONTAINER
> >>   echo 1000M > $CONTAINER/memory.limit_in_bytes
> >>   echo $BASHPID >> $CONTAINER/cgroup.procs
> >>   # Create 4000 files.  Ideally each file uses 4k data page + a little
> >>   # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
> >>   # fit within container's 1000 MB.  But if data pages use 2MB
> >>   # hugepages (due to aggressive huge=always) then files consume 8GB,
> >>   # which hits memcg 1000 MB limit.
> >>   for i in {1..4000}; do
> >>     echo . > $FS/$i
> >>   done
> >
> > It looks all the inodes which have tail THP beyond i_size are on one
> > single list, then the shrinker actually just splits the first
> > nr_to_scan inodes. But since the list is not memcg aware, so it seems
> > it may split the THPs which are not charged to the victim memcg and
> > the victim memcg still may suffer from pre-mature oom, right?
>
> Correct.  shmem_unused_huge_shrink() is not memcg aware.  In response to
> memcg pressure it will split the post-i_size tails of nr_to_scan tmpfs
> inodes regardless of if they're charged to the under-pressure memcg.
> do_shrink_slab() looks like it'll repeatedly call
> shmem_unused_huge_shrink().  So it will split tails of many inodes.  So
> I think it'll avoid the oom by over shrinking.  This is not ideal.  But
> it seems better than undeserved oom kill.
>
> I think the solution (as Kirill Tkhai suggested) a memcg-aware index
> would solve both:
> 1) avoid premature oom by registering shrinker to responding to memcg
>    pressure
> 2) avoid shrinking/splitting inodes unrelated to the under-pressure
>    memcg

I do agree with Kirill. Using list_lru sounds optimal. But, it looks
the memcg index is tricky. The index of memcg which the beyond i_size
THP is charged to should be used instead of the inode's memcg which
may charge to different memcg.

>
> I can certainly look into that (thanks Kirill for the pointers).  In the
> short term I'm still interested in avoiding premature OOMs with the
> original thread (i.e. restore pre-4.19 behavior to shmem shrinker for
> memcg pressure).  I plan to test and repost v2.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] shmem, memcg: enable memcg aware shrinker
  2020-06-01  3:22 [PATCH] shmem, memcg: enable memcg aware shrinker Greg Thelen
  2020-06-01 11:05 ` Kirill Tkhai
  2020-06-01 22:59 ` Yang Shi
@ 2020-06-08  4:49 ` Hugh Dickins
  2 siblings, 0 replies; 8+ messages in thread
From: Hugh Dickins @ 2020-06-08  4:49 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Hugh Dickins, Andrew Morton, Kirill Tkhai, Yang Shi,
	Kirill A. Shutemov, linux-mm, linux-kernel

On Sun, 31 May 2020, Greg Thelen wrote:

> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
> called when the per-memcg per-node shrinker_map indicates that the
> shrinker may have objects to release to the memcg and node.
> 
> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
> shrinker which advertises per memcg and numa awareness.  The shmem
> shrinker releases memory by splitting hugepages that extend beyond
> i_size.
> 
> Shmem does not currently set bits in shrinker_map.  So, starting with
> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
> pressure.  This leads to undeserved memcg OOM kills.
> Example that reliably sees memcg OOM kill in unpatched kernel:
>   FS=/tmp/fs
>   CONTAINER=/cgroup/memory/tmpfs_shrinker
>   mkdir -p $FS
>   mount -t tmpfs -o huge=always nodev $FS
>   # Create 1000 MB container, which shouldn't suffer OOM.
>   mkdir $CONTAINER
>   echo 1000M > $CONTAINER/memory.limit_in_bytes
>   echo $BASHPID >> $CONTAINER/cgroup.procs
>   # Create 4000 files.  Ideally each file uses 4k data page + a little
>   # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
>   # fit within container's 1000 MB.  But if data pages use 2MB
>   # hugepages (due to aggressive huge=always) then files consume 8GB,
>   # which hits memcg 1000 MB limit.
>   for i in {1..4000}; do
>     echo . > $FS/$i
>   done
> 
> v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg
> aware") maintains the per-node per-memcg shrinker bitmap for THP
> shrinker.  But there's no such logic in shmem.  Make shmem set the
> per-memcg per-node shrinker bits when it modifies inodes to have
> shrinkable pages.
> 
> Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()")
> Cc: <stable@vger.kernel.org> # 4.19+
> Signed-off-by: Greg Thelen <gthelen@google.com>
> ---
>  mm/shmem.c | 61 +++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 35 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index bd8840082c94..e11090f78cb5 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, struct kstat *stat,
>  	return 0;
>  }
>  
> +/*
> + * Expose inode and optional page to shrinker as having a possibly splittable
> + * hugepage that reaches beyond i_size.
> + */
> +static void shmem_shrinker_add(struct shmem_sb_info *sbinfo,
> +			       struct inode *inode, struct page *page)
> +{
> +	struct shmem_inode_info *info = SHMEM_I(inode);
> +
> +	spin_lock(&sbinfo->shrinklist_lock);
> +	/*
> +	 * _careful to defend against unlocked access to ->shrink_list in
> +	 * shmem_unused_huge_shrink()
> +	 */
> +	if (list_empty_careful(&info->shrinklist)) {
> +		list_add_tail(&info->shrinklist, &sbinfo->shrinklist);
> +		sbinfo->shrinklist_len++;
> +	}
> +	spin_unlock(&sbinfo->shrinklist_lock);
> +
> +#ifdef CONFIG_MEMCG
> +	if (page && PageTransHuge(page))
> +		memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page),
> +				       inode->i_sb->s_shrink.id);
> +#endif
> +}
> +
>  static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>  	struct inode *inode = d_inode(dentry);
> @@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>  			 * to shrink under memory pressure.
>  			 */
>  			if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> -				spin_lock(&sbinfo->shrinklist_lock);
> -				/*
> -				 * _careful to defend against unlocked access to
> -				 * ->shrink_list in shmem_unused_huge_shrink()
> -				 */
> -				if (list_empty_careful(&info->shrinklist)) {
> -					list_add_tail(&info->shrinklist,
> -							&sbinfo->shrinklist);
> -					sbinfo->shrinklist_len++;
> -				}
> -				spin_unlock(&sbinfo->shrinklist_lock);
> +				struct page *page;
> +
> +				page = find_get_page(inode->i_mapping,
> +					(newsize & HPAGE_PMD_MASK) >> PAGE_SHIFT);
> +				shmem_shrinker_add(sbinfo, inode, page);
> +				if (page)
> +					put_page(page);
>  			}
>  		}
>  	}

Thanks Greg. But it looks like v5.7's 71725ed10c40 ("mm: huge tmpfs:
try to split_huge_page() when punching hole") actually made this block
of shmem_setattr() redundant, and I should have deleted it in that patch.

There are admittedly circumstances in thich shmem_truncate_range() is
unable to split the huge page: but it has already retried, and putting
it on this shrinklist, probably to fail again and again, is not the right
thing.  Cleaner just to delete all this (either as part of this patch or
separately), and just add memcg_set_shrinker_bit() in shmem_getpage_gfp().

But I see you're suggesting Cc: <stable@vger.kernel.org> # 4.19+
Hmm, I suppose we could put in a version like you have here, then I
follow up immediately with the patch deleting most of what you wrote.
Or, we could recommend 71725ed10c40 for stable too: I'm not sure.

For now at least, I'm not myself giving much thought to the wider issue,
of how best to reimplement the shrinker in a more memcg-friendly way.

Hugh

> @@ -1889,21 +1912,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  	if (PageTransHuge(page) &&
>  	    DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
>  			hindex + HPAGE_PMD_NR - 1) {
> -		/*
> -		 * Part of the huge page is beyond i_size: subject
> -		 * to shrink under memory pressure.
> -		 */
> -		spin_lock(&sbinfo->shrinklist_lock);
> -		/*
> -		 * _careful to defend against unlocked access to
> -		 * ->shrink_list in shmem_unused_huge_shrink()
> -		 */
> -		if (list_empty_careful(&info->shrinklist)) {
> -			list_add_tail(&info->shrinklist,
> -				      &sbinfo->shrinklist);
> -			sbinfo->shrinklist_len++;
> -		}
> -		spin_unlock(&sbinfo->shrinklist_lock);
> +		shmem_shrinker_add(sbinfo, inode, page);
>  	}
>  
>  	/*
> -- 
> 2.27.0.rc0.183.gde8f92d652-goog

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-06-08  4:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-01  3:22 [PATCH] shmem, memcg: enable memcg aware shrinker Greg Thelen
2020-06-01 11:05 ` Kirill Tkhai
2020-06-01 21:48   ` Greg Thelen
2020-06-02  8:28     ` Kirill Tkhai
2020-06-01 22:59 ` Yang Shi
2020-06-04  8:17   ` Greg Thelen
2020-06-05 22:05     ` Yang Shi
2020-06-08  4:49 ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).