linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/4] mm/page_owner: Extend page_owner to show memcg information
@ 2022-02-08  0:05 Waiman Long
  2022-02-08  0:05 ` [PATCH v5 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Waiman Long @ 2022-02-08  0:05 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long

 v5:
  - Apply the following changes to patch 3
    1) Make cgroup_name() write directly into kbuf without using an
       intermediate buffer.
    2) Change the terminology from "offline memcg" to "dying memcg" to align
       better with similar terms used elsewhere in the kernel.

 v4:
  - Take rcu_read_lock() when memcg is being accessed as suggested by
    Michal.
  - Make print_page_owner_memcg() return the new offset into the buffer
    and put CONFIG_MEMCG block inside as suggested by Mike.
  - Directly use TASK_COMM_LEN as length of name buffer as suggested by
    Roman.

 v3:
  - Add unlikely() to patch 1 and clarify that -1 will not be returned.
  - Use a helper function to print out memcg information in patch 3.
  - Add a new patch 4 to store task command name in page_owner
    structure.

While debugging the constant increase in percpu memory consumption on
a system that spawned large number of containers, it was found that a
lot of dying mem_cgroup structures remained in place without being
freed. Further investigation indicated that those mem_cgroup structures
were pinned by some pages.

In order to find out what those pages are, the existing page_owner
debugging tool is extended to show memory cgroup information and whether
those memcgs are dying or not. With the enhanced page_owner tool,
the following is a typical page that pinned the mem_cgroup structure
in my test case:

Page allocated via order 0, mask 0x1100cca(GFP_HIGHUSER_MOVABLE), pid 70984 (podman), ts 5421278969115 ns, free_ts 5420935666638 ns
PFN 3205061 type Movable Block 6259 type Movable Flags 0x17ffffc00c001c(uptodate|dirty|lru|reclaim|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
 prep_new_page+0x8e/0xb0
 get_page_from_freelist+0xc4d/0xe50
 __alloc_pages+0x172/0x320
 alloc_pages_vma+0x84/0x230
 shmem_alloc_page+0x3f/0x90
 shmem_alloc_and_acct_page+0x76/0x1c0
 shmem_getpage_gfp+0x48d/0x890
 shmem_write_begin+0x36/0xc0
 generic_perform_write+0xed/0x1d0
 __generic_file_write_iter+0xdc/0x1b0
 generic_file_write_iter+0x5d/0xb0
 new_sync_write+0x11f/0x1b0
 vfs_write+0x1ba/0x2a0
 ksys_write+0x59/0xd0
 do_syscall_64+0x37/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
Charged to dying memcg libpod-conmon-fbc62060b5377479a7371cc16c5c596002945f2aa00d3d6d73a0cd0d148b6637.scope

So the page was not freed because it was part of a shmem segment. That
is useful information that can help users to diagnose similar problems.

With cgroup v1, /proc/cgroups can be read to find out the total number
of memory cgroups (online + dying). With cgroup v2, the cgroup.stat
of the root cgroup can be read to find the number of dying cgroups
(most likely pinned by dying memcgs).

The page_owner feature is not supposed to be enabled for production
system due to its memory overhead. However, if it is suspected that
dying memcgs are increasing over time, a test environment with page_owner
enabled can then be set up with appropriate workload for further analysis
on what may be causing the increasing number of dying memcgs.

Waiman Long (4):
  lib/vsprintf: Avoid redundant work with 0 size
  mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check
  mm/page_owner: Print memcg information
  mm/page_owner: Record task command name

 lib/vsprintf.c  |  8 +++---
 mm/page_owner.c | 72 ++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 62 insertions(+), 18 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v5 1/4] lib/vsprintf: Avoid redundant work with 0 size
  2022-02-08  0:05 [PATCH v5 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long
@ 2022-02-08  0:05 ` Waiman Long
  2022-02-08  0:05 ` [PATCH v5 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2022-02-08  0:05 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long,
	Mike Rapoport

For *scnprintf(), vsnprintf() is always called even if the input size is
0. That is a waste of time, so just return 0 in this case.

Note that vsnprintf() will never return -1 to indicate an error. So
skipping the call to vsnprintf() when size is 0 will have no functional
impact at all.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
---
 lib/vsprintf.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 3b8129dd374c..d419154b47bb 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -2895,13 +2895,15 @@ int vscnprintf(char *buf, size_t size, const char *fmt, va_list args)
 {
 	int i;
 
+	if (unlikely(!size))
+		return 0;
+
 	i = vsnprintf(buf, size, fmt, args);
 
 	if (likely(i < size))
 		return i;
-	if (size != 0)
-		return size - 1;
-	return 0;
+
+	return size - 1;
 }
 EXPORT_SYMBOL(vscnprintf);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check
  2022-02-08  0:05 [PATCH v5 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long
  2022-02-08  0:05 ` [PATCH v5 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long
@ 2022-02-08  0:05 ` Waiman Long
  2022-02-08  0:05 ` [PATCH v5 3/4] mm/page_owner: Print memcg information Waiman Long
  2022-02-08  0:05 ` [PATCH v5 4/4] mm/page_owner: Record task command name Waiman Long
  3 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2022-02-08  0:05 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long,
	Mike Rapoport

The snprintf() function can return a length greater than the given
input size. That will require a check for buffer overrun after each
invocation of snprintf(). scnprintf(), on the other hand, will never
return a greater length. By using scnprintf() in selected places, we
can avoid some buffer overrun checks except after stack_depot_snprint()
and after the last snprintf().

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
---
 mm/page_owner.c | 14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index 99e360df9465..28dac73e0542 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -338,19 +338,16 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 	if (!kbuf)
 		return -ENOMEM;
 
-	ret = snprintf(kbuf, count,
+	ret = scnprintf(kbuf, count,
 			"Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n",
 			page_owner->order, page_owner->gfp_mask,
 			&page_owner->gfp_mask, page_owner->pid,
 			page_owner->ts_nsec, page_owner->free_ts_nsec);
 
-	if (ret >= count)
-		goto err;
-
 	/* Print information relevant to grouping pages by mobility */
 	pageblock_mt = get_pageblock_migratetype(page);
 	page_mt  = gfp_migratetype(page_owner->gfp_mask);
-	ret += snprintf(kbuf + ret, count - ret,
+	ret += scnprintf(kbuf + ret, count - ret,
 			"PFN %lu type %s Block %lu type %s Flags %pGp\n",
 			pfn,
 			migratetype_names[page_mt],
@@ -358,19 +355,14 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 			migratetype_names[pageblock_mt],
 			&page->flags);
 
-	if (ret >= count)
-		goto err;
-
 	ret += stack_depot_snprint(handle, kbuf + ret, count - ret, 0);
 	if (ret >= count)
 		goto err;
 
 	if (page_owner->last_migrate_reason != -1) {
-		ret += snprintf(kbuf + ret, count - ret,
+		ret += scnprintf(kbuf + ret, count - ret,
 			"Page has been migrated, last migrate reason: %s\n",
 			migrate_reason_names[page_owner->last_migrate_reason]);
-		if (ret >= count)
-			goto err;
 	}
 
 	ret += snprintf(kbuf + ret, count - ret, "\n");
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 3/4] mm/page_owner: Print memcg information
  2022-02-08  0:05 [PATCH v5 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long
  2022-02-08  0:05 ` [PATCH v5 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long
  2022-02-08  0:05 ` [PATCH v5 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long
@ 2022-02-08  0:05 ` Waiman Long
  2022-02-08 12:13   ` Michal Hocko
  2022-02-08  0:05 ` [PATCH v5 4/4] mm/page_owner: Record task command name Waiman Long
  3 siblings, 1 reply; 9+ messages in thread
From: Waiman Long @ 2022-02-08  0:05 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long,
	Mike Rapoport

It was found that a number of dying memcgs were not freed because
they were pinned by some charged pages that were present. Even "echo 1 >
/proc/sys/vm/drop_caches" wasn't able to free those pages. These dying
but not freed memcgs tend to increase in number over time with the side
effect that percpu memory consumption as shown in /proc/meminfo also
increases over time.

In order to find out more information about those pages that pin
dying memcgs, the page_owner feature is extended to print memory
cgroup information especially whether the cgroup is dying or not.
RCU read lock is taken when memcg is being accessed to make sure
that it won't be freed.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
---
 mm/page_owner.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index 28dac73e0542..d4c311455753 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -10,6 +10,7 @@
 #include <linux/migrate.h>
 #include <linux/stackdepot.h>
 #include <linux/seq_file.h>
+#include <linux/memcontrol.h>
 #include <linux/sched/clock.h>
 
 #include "internal.h"
@@ -325,6 +326,47 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
 	seq_putc(m, '\n');
 }
 
+/*
+ * Looking for memcg information and print it out
+ */
+static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
+					 struct page *page)
+{
+#ifdef CONFIG_MEMCG
+	unsigned long memcg_data;
+	struct mem_cgroup *memcg;
+	bool dying;
+
+	rcu_read_lock();
+	memcg_data = READ_ONCE(page->memcg_data);
+	if (!memcg_data)
+		goto out_unlock;
+
+	if (memcg_data & MEMCG_DATA_OBJCGS)
+		ret += scnprintf(kbuf + ret, count - ret,
+				"Slab cache page\n");
+
+	memcg = page_memcg_check(page);
+	if (!memcg)
+		goto out_unlock;
+
+	dying = (memcg->css.flags & CSS_DYING);
+	ret += scnprintf(kbuf + ret, count - ret,
+			"Charged %sto %smemcg ",
+			PageMemcgKmem(page) ? "(via objcg) " : "",
+			dying ? "dying " : "");
+
+	/* Write cgroup name directly into kbuf */
+	cgroup_name(memcg->css.cgroup, kbuf + ret, count - ret);
+	ret += strlen(kbuf + ret);
+	ret += scnprintf(kbuf + ret, count - ret, "\n");
+out_unlock:
+	rcu_read_unlock();
+#endif /* CONFIG_MEMCG */
+
+	return ret;
+}
+
 static ssize_t
 print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 		struct page *page, struct page_owner *page_owner,
@@ -365,6 +407,8 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 			migrate_reason_names[page_owner->last_migrate_reason]);
 	}
 
+	ret = print_page_owner_memcg(kbuf, count, ret, page);
+
 	ret += snprintf(kbuf + ret, count - ret, "\n");
 	if (ret >= count)
 		goto err;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 4/4] mm/page_owner: Record task command name
  2022-02-08  0:05 [PATCH v5 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long
                   ` (2 preceding siblings ...)
  2022-02-08  0:05 ` [PATCH v5 3/4] mm/page_owner: Print memcg information Waiman Long
@ 2022-02-08  0:05 ` Waiman Long
  3 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2022-02-08  0:05 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long

The page_owner information currently includes the pid of the calling
task. That is useful as long as the task is still running. Otherwise,
the number is meaningless. To have more information about the allocating
tasks that had exited by the time the page_owner information is
retrieved, we need to store the command name of the task.

Add a new comm field into page_owner structure to store the command name
and display it when the page_owner information is retrieved.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/page_owner.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index d4c311455753..0d2017ebe3d8 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -29,6 +29,7 @@ struct page_owner {
 	depot_stack_handle_t free_handle;
 	u64 ts_nsec;
 	u64 free_ts_nsec;
+	char comm[TASK_COMM_LEN];
 	pid_t pid;
 };
 
@@ -165,6 +166,8 @@ static inline void __set_page_owner_handle(struct page_ext *page_ext,
 		page_owner->last_migrate_reason = -1;
 		page_owner->pid = current->pid;
 		page_owner->ts_nsec = local_clock();
+		strlcpy(page_owner->comm, current->comm,
+			sizeof(page_owner->comm));
 		__set_bit(PAGE_EXT_OWNER, &page_ext->flags);
 		__set_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags);
 
@@ -232,6 +235,7 @@ void __folio_copy_owner(struct folio *newfolio, struct folio *old)
 	new_page_owner->pid = old_page_owner->pid;
 	new_page_owner->ts_nsec = old_page_owner->ts_nsec;
 	new_page_owner->free_ts_nsec = old_page_owner->ts_nsec;
+	strcpy(new_page_owner->comm, old_page_owner->comm);
 
 	/*
 	 * We don't clear the bit on the old folio as it's going to be freed
@@ -381,10 +385,11 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 		return -ENOMEM;
 
 	ret = scnprintf(kbuf, count,
-			"Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n",
+			"Page allocated via order %u, mask %#x(%pGg), pid %d (%s), ts %llu ns, free_ts %llu ns\n",
 			page_owner->order, page_owner->gfp_mask,
 			&page_owner->gfp_mask, page_owner->pid,
-			page_owner->ts_nsec, page_owner->free_ts_nsec);
+			page_owner->comm, page_owner->ts_nsec,
+			page_owner->free_ts_nsec);
 
 	/* Print information relevant to grouping pages by mobility */
 	pageblock_mt = get_pageblock_migratetype(page);
@@ -451,9 +456,10 @@ void __dump_page_owner(const struct page *page)
 	else
 		pr_alert("page_owner tracks the page as freed\n");
 
-	pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d, ts %llu, free_ts %llu\n",
+	pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d (%s), ts %llu, free_ts %llu\n",
 		 page_owner->order, migratetype_names[mt], gfp_mask, &gfp_mask,
-		 page_owner->pid, page_owner->ts_nsec, page_owner->free_ts_nsec);
+		 page_owner->pid, page_owner->comm, page_owner->ts_nsec,
+		 page_owner->free_ts_nsec);
 
 	handle = READ_ONCE(page_owner->handle);
 	if (!handle)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 3/4] mm/page_owner: Print memcg information
  2022-02-08  0:05 ` [PATCH v5 3/4] mm/page_owner: Print memcg information Waiman Long
@ 2022-02-08 12:13   ` Michal Hocko
  2022-02-08 15:15     ` Michal Hocko
  2022-02-08 18:40     ` Waiman Long
  0 siblings, 2 replies; 9+ messages in thread
From: Michal Hocko @ 2022-02-08 12:13 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek,
	Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini,
	Mike Rapoport

On Mon 07-02-22 19:05:31, Waiman Long wrote:
> It was found that a number of dying memcgs were not freed because
> they were pinned by some charged pages that were present. Even "echo 1 >
> /proc/sys/vm/drop_caches" wasn't able to free those pages. These dying
> but not freed memcgs tend to increase in number over time with the side
> effect that percpu memory consumption as shown in /proc/meminfo also
> increases over time.

I still believe that this is very suboptimal way to debug offline memcgs
but memcg information can be useful in other contexts and it doesn't
cost us anything except for an additional output so I am fine with this.
 
> In order to find out more information about those pages that pin
> dying memcgs, the page_owner feature is extended to print memory
> cgroup information especially whether the cgroup is dying or not.
> RCU read lock is taken when memcg is being accessed to make sure
> that it won't be freed.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Roman Gushchin <guro@fb.com>
> Acked-by: Mike Rapoport <rppt@linux.ibm.com>

With few comments/questions below.

> ---
>  mm/page_owner.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index 28dac73e0542..d4c311455753 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -10,6 +10,7 @@
>  #include <linux/migrate.h>
>  #include <linux/stackdepot.h>
>  #include <linux/seq_file.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched/clock.h>
>  
>  #include "internal.h"
> @@ -325,6 +326,47 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
>  	seq_putc(m, '\n');
>  }
>  
> +/*
> + * Looking for memcg information and print it out
> + */

I am not sure this is particularly useful comment.

> +static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
> +					 struct page *page)
> +{
> +#ifdef CONFIG_MEMCG
> +	unsigned long memcg_data;
> +	struct mem_cgroup *memcg;
> +	bool dying;
> +
> +	rcu_read_lock();
> +	memcg_data = READ_ONCE(page->memcg_data);
> +	if (!memcg_data)
> +		goto out_unlock;
> +
> +	if (memcg_data & MEMCG_DATA_OBJCGS)
> +		ret += scnprintf(kbuf + ret, count - ret,
> +				"Slab cache page\n");
> +
> +	memcg = page_memcg_check(page);
> +	if (!memcg)
> +		goto out_unlock;
> +
> +	dying = (memcg->css.flags & CSS_DYING);

Is there any specific reason why you haven't used mem_cgroup_online?

> +	ret += scnprintf(kbuf + ret, count - ret,
> +			"Charged %sto %smemcg ",
> +			PageMemcgKmem(page) ? "(via objcg) " : "",
> +			dying ? "dying " : "");
> +
> +	/* Write cgroup name directly into kbuf */
> +	cgroup_name(memcg->css.cgroup, kbuf + ret, count - ret);
> +	ret += strlen(kbuf + ret);

cgroup_name should return the length of the path added to the buffer.

> +	ret += scnprintf(kbuf + ret, count - ret, "\n");

I do not see any overflow prevention here. I believe you really need to
check ret >= count after each scnprintf/cgroup_name.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 3/4] mm/page_owner: Print memcg information
  2022-02-08 12:13   ` Michal Hocko
@ 2022-02-08 15:15     ` Michal Hocko
  2022-02-08 18:40     ` Waiman Long
  1 sibling, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2022-02-08 15:15 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek,
	Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini,
	Mike Rapoport

On Tue 08-02-22 13:13:15, Michal Hocko wrote:
> On Mon 07-02-22 19:05:31, Waiman Long wrote:
[...]
> > +	ret += scnprintf(kbuf + ret, count - ret, "\n");
> 
> I do not see any overflow prevention here. I believe you really need to
> check ret >= count after each scnprintf/cgroup_name.

OK, I have only now realized that scnprintf would return size - 1 on
overflow so explicitly checking for the overlow shouldn't be reallu
necessary.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 3/4] mm/page_owner: Print memcg information
  2022-02-08 12:13   ` Michal Hocko
  2022-02-08 15:15     ` Michal Hocko
@ 2022-02-08 18:40     ` Waiman Long
  2022-02-08 19:11       ` Michal Hocko
  1 sibling, 1 reply; 9+ messages in thread
From: Waiman Long @ 2022-02-08 18:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek,
	Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini,
	Mike Rapoport

On 2/8/22 07:13, Michal Hocko wrote:
> On Mon 07-02-22 19:05:31, Waiman Long wrote:
>> It was found that a number of dying memcgs were not freed because
>> they were pinned by some charged pages that were present. Even "echo 1 >
>> /proc/sys/vm/drop_caches" wasn't able to free those pages. These dying
>> but not freed memcgs tend to increase in number over time with the side
>> effect that percpu memory consumption as shown in /proc/meminfo also
>> increases over time.
> I still believe that this is very suboptimal way to debug offline memcgs
> but memcg information can be useful in other contexts and it doesn't
> cost us anything except for an additional output so I am fine with this.
I am planning to have a follow-up patch to add a new debugfs file for 
just printing page information associated with dying memcgs only. It 
will be based on the existing page_owner code, though. So I need to get 
this patch in first.
>   
>> In order to find out more information about those pages that pin
>> dying memcgs, the page_owner feature is extended to print memory
>> cgroup information especially whether the cgroup is dying or not.
>> RCU read lock is taken when memcg is being accessed to make sure
>> that it won't be freed.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> Acked-by: David Rientjes <rientjes@google.com>
>> Acked-by: Roman Gushchin <guro@fb.com>
>> Acked-by: Mike Rapoport <rppt@linux.ibm.com>
> With few comments/questions below.
>
>> ---
>>   mm/page_owner.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 44 insertions(+)
>>
>> diff --git a/mm/page_owner.c b/mm/page_owner.c
>> index 28dac73e0542..d4c311455753 100644
>> --- a/mm/page_owner.c
>> +++ b/mm/page_owner.c
>> @@ -10,6 +10,7 @@
>>   #include <linux/migrate.h>
>>   #include <linux/stackdepot.h>
>>   #include <linux/seq_file.h>
>> +#include <linux/memcontrol.h>
>>   #include <linux/sched/clock.h>
>>   
>>   #include "internal.h"
>> @@ -325,6 +326,47 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
>>   	seq_putc(m, '\n');
>>   }
>>   
>> +/*
>> + * Looking for memcg information and print it out
>> + */
> I am not sure this is particularly useful comment.
Right, I can remove that.
>
>> +static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
>> +					 struct page *page)
>> +{
>> +#ifdef CONFIG_MEMCG
>> +	unsigned long memcg_data;
>> +	struct mem_cgroup *memcg;
>> +	bool dying;
>> +
>> +	rcu_read_lock();
>> +	memcg_data = READ_ONCE(page->memcg_data);
>> +	if (!memcg_data)
>> +		goto out_unlock;
>> +
>> +	if (memcg_data & MEMCG_DATA_OBJCGS)
>> +		ret += scnprintf(kbuf + ret, count - ret,
>> +				"Slab cache page\n");
>> +
>> +	memcg = page_memcg_check(page);
>> +	if (!memcg)
>> +		goto out_unlock;
>> +
>> +	dying = (memcg->css.flags & CSS_DYING);
> Is there any specific reason why you haven't used mem_cgroup_online?
Not really. However, I think checking for CSS_DYING makes more sense now 
that I using the term "dying".
>
>> +	ret += scnprintf(kbuf + ret, count - ret,
>> +			"Charged %sto %smemcg ",
>> +			PageMemcgKmem(page) ? "(via objcg) " : "",
>> +			dying ? "dying " : "");
>> +
>> +	/* Write cgroup name directly into kbuf */
>> +	cgroup_name(memcg->css.cgroup, kbuf + ret, count - ret);
>> +	ret += strlen(kbuf + ret);
> cgroup_name should return the length of the path added to the buffer.
I realized that after I sent out the patch. I will remove te redundant 
strlen() in a future update.
>
>> +	ret += scnprintf(kbuf + ret, count - ret, "\n");
> I do not see any overflow prevention here. I believe you really need to
> check ret >= count after each scnprintf/cgroup_name.

As you have realized, the beauty of using scnprintf() is to not needing 
an overflow check after each invocation.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 3/4] mm/page_owner: Print memcg information
  2022-02-08 18:40     ` Waiman Long
@ 2022-02-08 19:11       ` Michal Hocko
  0 siblings, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2022-02-08 19:11 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek,
	Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini,
	Mike Rapoport

On Tue 08-02-22 13:40:57, Waiman Long wrote:
> On 2/8/22 07:13, Michal Hocko wrote:
> > On Mon 07-02-22 19:05:31, Waiman Long wrote:
> > > It was found that a number of dying memcgs were not freed because
> > > they were pinned by some charged pages that were present. Even "echo 1 >
> > > /proc/sys/vm/drop_caches" wasn't able to free those pages. These dying
> > > but not freed memcgs tend to increase in number over time with the side
> > > effect that percpu memory consumption as shown in /proc/meminfo also
> > > increases over time.
> > I still believe that this is very suboptimal way to debug offline memcgs
> > but memcg information can be useful in other contexts and it doesn't
> > cost us anything except for an additional output so I am fine with this.
>
> I am planning to have a follow-up patch to add a new debugfs file for just
> printing page information associated with dying memcgs only. It will be
> based on the existing page_owner code, though. So I need to get this patch
> in first.

Sure. I would give a shot the drgn approach as this can be much more
versatile without any additional kernel code.

[...]

> > > +	dying = (memcg->css.flags & CSS_DYING);
> > Is there any specific reason why you haven't used mem_cgroup_online?
> Not really. However, I think checking for CSS_DYING makes more sense now
> that I using the term "dying".

I do not really care much but I though CSS_DYING is a cgroup internal
thing. We have a highlevel API so I thought it would be used
preferably.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-02-08 19:11 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-08  0:05 [PATCH v5 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long
2022-02-08  0:05 ` [PATCH v5 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long
2022-02-08  0:05 ` [PATCH v5 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long
2022-02-08  0:05 ` [PATCH v5 3/4] mm/page_owner: Print memcg information Waiman Long
2022-02-08 12:13   ` Michal Hocko
2022-02-08 15:15     ` Michal Hocko
2022-02-08 18:40     ` Waiman Long
2022-02-08 19:11       ` Michal Hocko
2022-02-08  0:05 ` [PATCH v5 4/4] mm/page_owner: Record task command name Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).