[PATCH v3 0/4] mm/page_owner: Extend page_owner to show memcg information

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/4] mm/page_owner: Extend page_owner to show memcg information
@ 2022-01-31 19:23 Waiman Long
  2022-01-31 19:23 ` [PATCH v3 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long
                   ` (3 more replies)
  0 siblings, 4 replies; 39+ messages in thread
From: Waiman Long @ 2022-01-31 19:23 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long

 v3:
  - Add unlikely() to patch 1 and clarify that -1 will not be returned.
  - Use a helper function to print out memcg information in patch 3.
  - Add a new patch 4 to store task command name in page_owner
    structure.

 v2:
  - Remove the SNPRINTF() macro as suggested by Ira and use scnprintf()
    instead to remove some buffer overrun checks.
  - Add a patch to optimize vscnprintf with a size parameter of 0.

While debugging the constant increase in percpu memory consumption on
a system that spawned large number of containers, it was found that a
lot of offlined mem_cgroup structures remained in place without being
freed. Further investigation indicated that those mem_cgroup structures
were pinned by some pages.

In order to find out what those pages are, the existing page_owner
debugging tool is extended to show memory cgroup information and whether
those memcgs are offlined or not. With the enhanced page_owner tool,
the following is a typical page that pinned the mem_cgroup structure
in my test case:

Page allocated via order 0, mask 0x1100cca(GFP_HIGHUSER_MOVABLE), pid 162970 (podman), ts 1097761405537 ns, free_ts 1097760838089 ns
PFN 1925700 type Movable Block 3761 type Movable Flags 0x17ffffc00c001c(uptodate|dirty|lru|reclaim|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
 prep_new_page+0xac/0xe0
 get_page_from_freelist+0x1327/0x14d0
 __alloc_pages+0x191/0x340
 alloc_pages_vma+0x84/0x250
 shmem_alloc_page+0x3f/0x90
 shmem_alloc_and_acct_page+0x76/0x1c0
 shmem_getpage_gfp+0x281/0x940
 shmem_write_begin+0x36/0xe0
 generic_perform_write+0xed/0x1d0
 __generic_file_write_iter+0xdc/0x1b0
 generic_file_write_iter+0x5d/0xb0
 new_sync_write+0x11f/0x1b0
 vfs_write+0x1ba/0x2a0
 ksys_write+0x59/0xd0
 do_syscall_64+0x37/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
Charged to offlined memcg libpod-conmon-15e4f9c758422306b73b2dd99f9d50a5ea53cbb16b4a13a2c2308a4253cc0ec8.

So the page was not freed because it was part of a shmem segment. That
is useful information that can help users to diagnose similar problems.

Waiman Long (4):
  lib/vsprintf: Avoid redundant work with 0 size
  mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check
  mm/page_owner: Print memcg information
  mm/page_owner: Record task command name

 lib/vsprintf.c  |  8 +++---
 mm/page_owner.c | 69 ++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 59 insertions(+), 18 deletions(-)

-- 
2.27.0

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v3 1/4] lib/vsprintf: Avoid redundant work with 0 size
  2022-01-31 19:23 [PATCH v3 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long
@ 2022-01-31 19:23 ` Waiman Long
  2022-01-31 20:42   ` Mike Rapoport
  2022-01-31 19:23 ` [PATCH v3 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2022-01-31 19:23 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long

For *scnprintf(), vsnprintf() is always called even if the input size is
0. That is a waste of time, so just return 0 in this case.

Note that vsnprintf() will never return -1 to indicate an error. So
skipping the call to vsnprintf() when size is 0 will have no functional
impact at all.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Roman Gushchin <guro@fb.com>
---
 lib/vsprintf.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 3b8129dd374c..d419154b47bb 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -2895,13 +2895,15 @@ int vscnprintf(char *buf, size_t size, const char *fmt, va_list args)
 {
 	int i;
 
+	if (unlikely(!size))
+		return 0;
+
 	i = vsnprintf(buf, size, fmt, args);
 
 	if (likely(i < size))
 		return i;
-	if (size != 0)
-		return size - 1;
-	return 0;
+
+	return size - 1;
 }
 EXPORT_SYMBOL(vscnprintf);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check
  2022-01-31 19:23 [PATCH v3 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long
  2022-01-31 19:23 ` [PATCH v3 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long
@ 2022-01-31 19:23 ` Waiman Long
  2022-01-31 20:38   ` Roman Gushchin
  2022-01-31 20:43   ` Mike Rapoport
  2022-01-31 19:23 ` [PATCH v3 3/4] mm/page_owner: Print memcg information Waiman Long
  2022-01-31 19:23 ` [PATCH v3 4/4] mm/page_owner: Record task command name Waiman Long
  3 siblings, 2 replies; 39+ messages in thread
From: Waiman Long @ 2022-01-31 19:23 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long

The snprintf() function can return a length greater than the given
input size. That will require a check for buffer overrun after each
invocation of snprintf(). scnprintf(), on the other hand, will never
return a greater length. By using scnprintf() in selected places, we
can avoid some buffer overrun checks except after stack_depot_snprint()
and after the last snprintf().

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 mm/page_owner.c | 14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index 99e360df9465..28dac73e0542 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -338,19 +338,16 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 	if (!kbuf)
 		return -ENOMEM;
 
-	ret = snprintf(kbuf, count,
+	ret = scnprintf(kbuf, count,
 			"Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n",
 			page_owner->order, page_owner->gfp_mask,
 			&page_owner->gfp_mask, page_owner->pid,
 			page_owner->ts_nsec, page_owner->free_ts_nsec);
 
-	if (ret >= count)
-		goto err;
-
 	/* Print information relevant to grouping pages by mobility */
 	pageblock_mt = get_pageblock_migratetype(page);
 	page_mt  = gfp_migratetype(page_owner->gfp_mask);
-	ret += snprintf(kbuf + ret, count - ret,
+	ret += scnprintf(kbuf + ret, count - ret,
 			"PFN %lu type %s Block %lu type %s Flags %pGp\n",
 			pfn,
 			migratetype_names[page_mt],
@@ -358,19 +355,14 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 			migratetype_names[pageblock_mt],
 			&page->flags);
 
-	if (ret >= count)
-		goto err;
-
 	ret += stack_depot_snprint(handle, kbuf + ret, count - ret, 0);
 	if (ret >= count)
 		goto err;
 
 	if (page_owner->last_migrate_reason != -1) {
-		ret += snprintf(kbuf + ret, count - ret,
+		ret += scnprintf(kbuf + ret, count - ret,
 			"Page has been migrated, last migrate reason: %s\n",
 			migrate_reason_names[page_owner->last_migrate_reason]);
-		if (ret >= count)
-			goto err;
 	}
 
 	ret += snprintf(kbuf + ret, count - ret, "\n");
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 3/4] mm/page_owner: Print memcg information
  2022-01-31 19:23 [PATCH v3 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long
  2022-01-31 19:23 ` [PATCH v3 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long
  2022-01-31 19:23 ` [PATCH v3 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long
@ 2022-01-31 19:23 ` Waiman Long
  2022-01-31 20:51   ` Mike Rapoport
                     ` (2 more replies)
  2022-01-31 19:23 ` [PATCH v3 4/4] mm/page_owner: Record task command name Waiman Long
  3 siblings, 3 replies; 39+ messages in thread
From: Waiman Long @ 2022-01-31 19:23 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long

It was found that a number of offlined memcgs were not freed because
they were pinned by some charged pages that were present. Even "echo
1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These
offlined but not freed memcgs tend to increase in number over time with
the side effect that percpu memory consumption as shown in /proc/meminfo
also increases over time.

In order to find out more information about those pages that pin
offlined memcgs, the page_owner feature is extended to print memory
cgroup information especially whether the cgroup is offlined or not.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
---
 mm/page_owner.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index 28dac73e0542..a471c74c7fe0 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -10,6 +10,7 @@
 #include <linux/migrate.h>
 #include <linux/stackdepot.h>
 #include <linux/seq_file.h>
+#include <linux/memcontrol.h>
 #include <linux/sched/clock.h>
 
 #include "internal.h"
@@ -325,6 +326,42 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
 	seq_putc(m, '\n');
 }
 
+#ifdef CONFIG_MEMCG
+/*
+ * Looking for memcg information and print it out
+ */
+static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret,
+					  struct page *page)
+{
+	unsigned long memcg_data = READ_ONCE(page->memcg_data);
+	struct mem_cgroup *memcg;
+	bool onlined;
+	char name[80];
+
+	if (!memcg_data)
+		return;
+
+	if (memcg_data & MEMCG_DATA_OBJCGS)
+		*pret += scnprintf(kbuf + *pret, count - *pret,
+				"Slab cache page\n");
+
+	memcg = page_memcg_check(page);
+	if (!memcg)
+		return;
+
+	onlined = (memcg->css.flags & CSS_ONLINE);
+	cgroup_name(memcg->css.cgroup, name, sizeof(name));
+	*pret += scnprintf(kbuf + *pret, count - *pret,
+			"Charged %sto %smemcg %s\n",
+			PageMemcgKmem(page) ? "(via objcg) " : "",
+			onlined ? "" : "offlined ",
+			name);
+}
+#else /* CONFIG_MEMCG */
+static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret,
+					  struct page *page) { }
+#endif /* CONFIG_MEMCG */
+
 static ssize_t
 print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 		struct page *page, struct page_owner *page_owner,
@@ -365,6 +402,8 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 			migrate_reason_names[page_owner->last_migrate_reason]);
 	}
 
+	print_page_owner_memcg(kbuf, count, &ret, page);
+
 	ret += snprintf(kbuf + ret, count - ret, "\n");
 	if (ret >= count)
 		goto err;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 4/4] mm/page_owner: Record task command name
  2022-01-31 19:23 [PATCH v3 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long
                   ` (2 preceding siblings ...)
  2022-01-31 19:23 ` [PATCH v3 3/4] mm/page_owner: Print memcg information Waiman Long
@ 2022-01-31 19:23 ` Waiman Long
  2022-01-31 20:54   ` Roman Gushchin
                     ` (6 more replies)
  3 siblings, 7 replies; 39+ messages in thread
From: Waiman Long @ 2022-01-31 19:23 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long

The page_owner information currently includes the pid of the calling
task. That is useful as long as the task is still running. Otherwise,
the number is meaningless. To have more information about the allocating
tasks that had exited by the time the page_owner information is
retrieved, we need to store the command name of the task.

Add a new comm field into page_owner structure to store the command name
and display it when the page_owner information is retrieved. Only the
first 15 characters of the command name will be copied, but that should
be enough in most cases. Even for those commands with longer names,
it shouldn't be hard to guess what they are.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/page_owner.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index a471c74c7fe0..8b2b381fd986 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -20,6 +20,7 @@
  * to use off stack temporal storage
  */
 #define PAGE_OWNER_STACK_DEPTH (16)
+#define PAGE_OWNER_COMM_LEN	16
 
 struct page_owner {
 	unsigned short order;
@@ -29,6 +30,7 @@ struct page_owner {
 	depot_stack_handle_t free_handle;
 	u64 ts_nsec;
 	u64 free_ts_nsec;
+	char comm[PAGE_OWNER_COMM_LEN];
 	pid_t pid;
 };
 
@@ -146,6 +148,7 @@ void __reset_page_owner(struct page *page, unsigned short order)
 		page_owner = get_page_owner(page_ext);
 		page_owner->free_handle = handle;
 		page_owner->free_ts_nsec = free_ts_nsec;
+		page_owner->comm[0] = '\0';
 		page_ext = page_ext_next(page_ext);
 	}
 }
@@ -165,6 +168,8 @@ static inline void __set_page_owner_handle(struct page_ext *page_ext,
 		page_owner->last_migrate_reason = -1;
 		page_owner->pid = current->pid;
 		page_owner->ts_nsec = local_clock();
+		strlcpy(page_owner->comm, current->comm,
+			sizeof(page_owner->comm));
 		__set_bit(PAGE_EXT_OWNER, &page_ext->flags);
 		__set_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags);
 
@@ -232,6 +237,7 @@ void __folio_copy_owner(struct folio *newfolio, struct folio *old)
 	new_page_owner->pid = old_page_owner->pid;
 	new_page_owner->ts_nsec = old_page_owner->ts_nsec;
 	new_page_owner->free_ts_nsec = old_page_owner->ts_nsec;
+	strcpy(new_page_owner->comm, old_page_owner->comm);
 
 	/*
 	 * We don't clear the bit on the old folio as it's going to be freed
@@ -376,10 +382,11 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 		return -ENOMEM;
 
 	ret = scnprintf(kbuf, count,
-			"Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n",
+			"Page allocated via order %u, mask %#x(%pGg), pid %d (%s), ts %llu ns, free_ts %llu ns\n",
 			page_owner->order, page_owner->gfp_mask,
 			&page_owner->gfp_mask, page_owner->pid,
-			page_owner->ts_nsec, page_owner->free_ts_nsec);
+			page_owner->comm, page_owner->ts_nsec,
+			page_owner->free_ts_nsec);
 
 	/* Print information relevant to grouping pages by mobility */
 	pageblock_mt = get_pageblock_migratetype(page);
@@ -446,9 +453,10 @@ void __dump_page_owner(const struct page *page)
 	else
 		pr_alert("page_owner tracks the page as freed\n");
 
-	pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d, ts %llu, free_ts %llu\n",
+	pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d (%s), ts %llu, free_ts %llu\n",
 		 page_owner->order, migratetype_names[mt], gfp_mask, &gfp_mask,
-		 page_owner->pid, page_owner->ts_nsec, page_owner->free_ts_nsec);
+		 page_owner->pid, page_owner->comm, page_owner->ts_nsec,
+		 page_owner->free_ts_nsec);
 
 	handle = READ_ONCE(page_owner->handle);
 	if (!handle)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check
  2022-01-31 19:23 ` [PATCH v3 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long
@ 2022-01-31 20:38   ` Roman Gushchin
  2022-01-31 20:43   ` Mike Rapoport
  1 sibling, 0 replies; 39+ messages in thread
From: Roman Gushchin @ 2022-01-31 20:38 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Rafael Aquini

On Mon, Jan 31, 2022 at 02:23:06PM -0500, Waiman Long wrote:
> The snprintf() function can return a length greater than the given
> input size. That will require a check for buffer overrun after each
> invocation of snprintf(). scnprintf(), on the other hand, will never
> return a greater length. By using scnprintf() in selected places, we
> can avoid some buffer overrun checks except after stack_depot_snprint()
> and after the last snprintf().
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>

Acked-by: Roman Gushchin <guro@fb.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 1/4] lib/vsprintf: Avoid redundant work with 0 size
  2022-01-31 19:23 ` [PATCH v3 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long
@ 2022-01-31 20:42   ` Mike Rapoport
  0 siblings, 0 replies; 39+ messages in thread
From: Mike Rapoport @ 2022-01-31 20:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	David Rientjes, Roman Gushchin, Rafael Aquini

On Mon, Jan 31, 2022 at 02:23:05PM -0500, Waiman Long wrote:
> For *scnprintf(), vsnprintf() is always called even if the input size is
> 0. That is a waste of time, so just return 0 in this case.
> 
> Note that vsnprintf() will never return -1 to indicate an error. So
> skipping the call to vsnprintf() when size is 0 will have no functional
> impact at all.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> Acked-by: Roman Gushchin <guro@fb.com>

Acked-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  lib/vsprintf.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> index 3b8129dd374c..d419154b47bb 100644
> --- a/lib/vsprintf.c
> +++ b/lib/vsprintf.c
> @@ -2895,13 +2895,15 @@ int vscnprintf(char *buf, size_t size, const char *fmt, va_list args)
>  {
>  	int i;
>  
> +	if (unlikely(!size))
> +		return 0;
> +
>  	i = vsnprintf(buf, size, fmt, args);
>  
>  	if (likely(i < size))
>  		return i;
> -	if (size != 0)
> -		return size - 1;
> -	return 0;
> +
> +	return size - 1;
>  }
>  EXPORT_SYMBOL(vscnprintf);
>  
> -- 
> 2.27.0
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check
  2022-01-31 19:23 ` [PATCH v3 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long
  2022-01-31 20:38   ` Roman Gushchin
@ 2022-01-31 20:43   ` Mike Rapoport
  1 sibling, 0 replies; 39+ messages in thread
From: Mike Rapoport @ 2022-01-31 20:43 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	David Rientjes, Roman Gushchin, Rafael Aquini

On Mon, Jan 31, 2022 at 02:23:06PM -0500, Waiman Long wrote:
> The snprintf() function can return a length greater than the given
> input size. That will require a check for buffer overrun after each
> invocation of snprintf(). scnprintf(), on the other hand, will never
> return a greater length. By using scnprintf() in selected places, we
> can avoid some buffer overrun checks except after stack_depot_snprint()
> and after the last snprintf().
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>

Acked-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
>  mm/page_owner.c | 14 +++-----------
>  1 file changed, 3 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index 99e360df9465..28dac73e0542 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -338,19 +338,16 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
>  	if (!kbuf)
>  		return -ENOMEM;
>  
> -	ret = snprintf(kbuf, count,
> +	ret = scnprintf(kbuf, count,
>  			"Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n",
>  			page_owner->order, page_owner->gfp_mask,
>  			&page_owner->gfp_mask, page_owner->pid,
>  			page_owner->ts_nsec, page_owner->free_ts_nsec);
>  
> -	if (ret >= count)
> -		goto err;
> -
>  	/* Print information relevant to grouping pages by mobility */
>  	pageblock_mt = get_pageblock_migratetype(page);
>  	page_mt  = gfp_migratetype(page_owner->gfp_mask);
> -	ret += snprintf(kbuf + ret, count - ret,
> +	ret += scnprintf(kbuf + ret, count - ret,
>  			"PFN %lu type %s Block %lu type %s Flags %pGp\n",
>  			pfn,
>  			migratetype_names[page_mt],
> @@ -358,19 +355,14 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
>  			migratetype_names[pageblock_mt],
>  			&page->flags);
>  
> -	if (ret >= count)
> -		goto err;
> -
>  	ret += stack_depot_snprint(handle, kbuf + ret, count - ret, 0);
>  	if (ret >= count)
>  		goto err;
>  
>  	if (page_owner->last_migrate_reason != -1) {
> -		ret += snprintf(kbuf + ret, count - ret,
> +		ret += scnprintf(kbuf + ret, count - ret,
>  			"Page has been migrated, last migrate reason: %s\n",
>  			migrate_reason_names[page_owner->last_migrate_reason]);
> -		if (ret >= count)
> -			goto err;
>  	}
>  
>  	ret += snprintf(kbuf + ret, count - ret, "\n");
> -- 
> 2.27.0
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 3/4] mm/page_owner: Print memcg information
  2022-01-31 19:23 ` [PATCH v3 3/4] mm/page_owner: Print memcg information Waiman Long
@ 2022-01-31 20:51   ` Mike Rapoport
  2022-01-31 21:43     ` Waiman Long
  2022-01-31 20:51   ` Roman Gushchin
  2022-02-01 10:54   ` Michal Hocko
  2 siblings, 1 reply; 39+ messages in thread
From: Mike Rapoport @ 2022-01-31 20:51 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	David Rientjes, Roman Gushchin, Rafael Aquini

On Mon, Jan 31, 2022 at 02:23:07PM -0500, Waiman Long wrote:
> It was found that a number of offlined memcgs were not freed because
> they were pinned by some charged pages that were present. Even "echo
> 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These
> offlined but not freed memcgs tend to increase in number over time with
> the side effect that percpu memory consumption as shown in /proc/meminfo
> also increases over time.
> 
> In order to find out more information about those pages that pin
> offlined memcgs, the page_owner feature is extended to print memory
> cgroup information especially whether the cgroup is offlined or not.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> ---
>  mm/page_owner.c | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index 28dac73e0542..a471c74c7fe0 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -10,6 +10,7 @@
>  #include <linux/migrate.h>
>  #include <linux/stackdepot.h>
>  #include <linux/seq_file.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched/clock.h>
>  
>  #include "internal.h"
> @@ -325,6 +326,42 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
>  	seq_putc(m, '\n');
>  }
>  
> +#ifdef CONFIG_MEMCG
> +/*
> + * Looking for memcg information and print it out
> + */
> +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret,
> +					  struct page *page)
> +{
> +	unsigned long memcg_data = READ_ONCE(page->memcg_data);
> +	struct mem_cgroup *memcg;
> +	bool onlined;
> +	char name[80];
> +
> +	if (!memcg_data)
> +		return;
> +
> +	if (memcg_data & MEMCG_DATA_OBJCGS)
> +		*pret += scnprintf(kbuf + *pret, count - *pret,
> +				"Slab cache page\n");

Don't we need to check for overflow here?

> +
> +	memcg = page_memcg_check(page);
> +	if (!memcg)
> +		return;
> +
> +	onlined = (memcg->css.flags & CSS_ONLINE);
> +	cgroup_name(memcg->css.cgroup, name, sizeof(name));
> +	*pret += scnprintf(kbuf + *pret, count - *pret,
> +			"Charged %sto %smemcg %s\n",
> +			PageMemcgKmem(page) ? "(via objcg) " : "",
> +			onlined ? "" : "offlined ",
> +			name);

Ditto

> +}
> +#else /* CONFIG_MEMCG */
> +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret,
> +					  struct page *page) { }

I think #ifdef inside the print_page_owner_memcg() functions will be
simpler and clearer.

> +#endif /* CONFIG_MEMCG */
> +
>  static ssize_t
>  print_page_owner(char __user *buf, size_t count, unsigned long pfn,
>  		struct page *page, struct page_owner *page_owner,
> @@ -365,6 +402,8 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
>  			migrate_reason_names[page_owner->last_migrate_reason]);
>  	}
>  
> +	print_page_owner_memcg(kbuf, count, &ret, page);
> +

ret can go over count here.
Why not make print_page_owner_memcg() an int so that the call will be
consistent with other calls in print_page_owner():

	ret += print_page_owner_memcg(kbuf, count, page);
	if (ret >= count)
		goto err;

>  	ret += snprintf(kbuf + ret, count - ret, "\n");
>  	if (ret >= count)
>  		goto err;
> -- 
> 2.27.0
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 3/4] mm/page_owner: Print memcg information
  2022-01-31 19:23 ` [PATCH v3 3/4] mm/page_owner: Print memcg information Waiman Long
  2022-01-31 20:51   ` Mike Rapoport
@ 2022-01-31 20:51   ` Roman Gushchin
  2022-02-01 10:54   ` Michal Hocko
  2 siblings, 0 replies; 39+ messages in thread
From: Roman Gushchin @ 2022-01-31 20:51 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Rafael Aquini

On Mon, Jan 31, 2022 at 02:23:07PM -0500, Waiman Long wrote:
> It was found that a number of offlined memcgs were not freed because
> they were pinned by some charged pages that were present. Even "echo
> 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These
> offlined but not freed memcgs tend to increase in number over time with
> the side effect that percpu memory consumption as shown in /proc/meminfo
> also increases over time.
> 
> In order to find out more information about those pages that pin
> offlined memcgs, the page_owner feature is extended to print memory
> cgroup information especially whether the cgroup is offlined or not.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>

Acked-by: Roman Gushchin <guro@fb.com>

Thanks!

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 4/4] mm/page_owner: Record task command name
  2022-01-31 19:23 ` [PATCH v3 4/4] mm/page_owner: Record task command name Waiman Long
@ 2022-01-31 20:54   ` Roman Gushchin
  2022-01-31 21:46     ` Waiman Long
  2022-01-31 22:03   ` [PATCH v4 " Waiman Long
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Roman Gushchin @ 2022-01-31 20:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Rafael Aquini

On Mon, Jan 31, 2022 at 02:23:08PM -0500, Waiman Long wrote:
> The page_owner information currently includes the pid of the calling
> task. That is useful as long as the task is still running. Otherwise,
> the number is meaningless. To have more information about the allocating
> tasks that had exited by the time the page_owner information is
> retrieved, we need to store the command name of the task.
> 
> Add a new comm field into page_owner structure to store the command name
> and display it when the page_owner information is retrieved. Only the
> first 15 characters of the command name will be copied, but that should
> be enough in most cases. Even for those commands with longer names,
> it shouldn't be hard to guess what they are.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/page_owner.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index a471c74c7fe0..8b2b381fd986 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -20,6 +20,7 @@
>   * to use off stack temporal storage
>   */
>  #define PAGE_OWNER_STACK_DEPTH (16)
> +#define PAGE_OWNER_COMM_LEN	16

Not sure I understand why not simply use TASK_COMM_LEN ?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 3/4] mm/page_owner: Print memcg information
  2022-01-31 20:51   ` Mike Rapoport
@ 2022-01-31 21:43     ` Waiman Long
  2022-02-01  6:23       ` Mike Rapoport
  0 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2022-01-31 21:43 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	David Rientjes, Roman Gushchin, Rafael Aquini


On 1/31/22 15:51, Mike Rapoport wrote:
> On Mon, Jan 31, 2022 at 02:23:07PM -0500, Waiman Long wrote:
>> It was found that a number of offlined memcgs were not freed because
>> they were pinned by some charged pages that were present. Even "echo
>> 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These
>> offlined but not freed memcgs tend to increase in number over time with
>> the side effect that percpu memory consumption as shown in /proc/meminfo
>> also increases over time.
>>
>> In order to find out more information about those pages that pin
>> offlined memcgs, the page_owner feature is extended to print memory
>> cgroup information especially whether the cgroup is offlined or not.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> Acked-by: David Rientjes <rientjes@google.com>
>> ---
>>   mm/page_owner.c | 39 +++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 39 insertions(+)
>>
>> diff --git a/mm/page_owner.c b/mm/page_owner.c
>> index 28dac73e0542..a471c74c7fe0 100644
>> --- a/mm/page_owner.c
>> +++ b/mm/page_owner.c
>> @@ -10,6 +10,7 @@
>>   #include <linux/migrate.h>
>>   #include <linux/stackdepot.h>
>>   #include <linux/seq_file.h>
>> +#include <linux/memcontrol.h>
>>   #include <linux/sched/clock.h>
>>   
>>   #include "internal.h"
>> @@ -325,6 +326,42 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
>>   	seq_putc(m, '\n');
>>   }
>>   
>> +#ifdef CONFIG_MEMCG
>> +/*
>> + * Looking for memcg information and print it out
>> + */
>> +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret,
>> +					  struct page *page)
>> +{
>> +	unsigned long memcg_data = READ_ONCE(page->memcg_data);
>> +	struct mem_cgroup *memcg;
>> +	bool onlined;
>> +	char name[80];
>> +
>> +	if (!memcg_data)
>> +		return;
>> +
>> +	if (memcg_data & MEMCG_DATA_OBJCGS)
>> +		*pret += scnprintf(kbuf + *pret, count - *pret,
>> +				"Slab cache page\n");
> Don't we need to check for overflow here?

See my previous patch 2 and the reason I used scnprintf() is that it 
never return a length that is >= the given size. So overflow won't 
happen. The final snprintf() in print_page_owner() will detect buffer 
overflow.


>
>> +
>> +	memcg = page_memcg_check(page);
>> +	if (!memcg)
>> +		return;
>> +
>> +	onlined = (memcg->css.flags & CSS_ONLINE);
>> +	cgroup_name(memcg->css.cgroup, name, sizeof(name));
>> +	*pret += scnprintf(kbuf + *pret, count - *pret,
>> +			"Charged %sto %smemcg %s\n",
>> +			PageMemcgKmem(page) ? "(via objcg) " : "",
>> +			onlined ? "" : "offlined ",
>> +			name);
> Ditto
>
>> +}
>> +#else /* CONFIG_MEMCG */
>> +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret,
>> +					  struct page *page) { }
> I think #ifdef inside the print_page_owner_memcg() functions will be
> simpler and clearer.
Yes, I see both styles used in kernel code though this style is probably 
more common. I will keep this unless there is a good reason to do otherwise.
>
>> +#endif /* CONFIG_MEMCG */
>> +
>>   static ssize_t
>>   print_page_owner(char __user *buf, size_t count, unsigned long pfn,
>>   		struct page *page, struct page_owner *page_owner,
>> @@ -365,6 +402,8 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
>>   			migrate_reason_names[page_owner->last_migrate_reason]);
>>   	}
>>   
>> +	print_page_owner_memcg(kbuf, count, &ret, page);
>> +
> ret can go over count here.
> Why not make print_page_owner_memcg() an int so that the call will be
> consistent with other calls in print_page_owner():
>
> 	ret += print_page_owner_memcg(kbuf, count, page);
> 	if (ret >= count)
> 		goto err;

See my comments above.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 4/4] mm/page_owner: Record task command name
  2022-01-31 20:54   ` Roman Gushchin
@ 2022-01-31 21:46     ` Waiman Long
  0 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2022-01-31 21:46 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Rafael Aquini

On 1/31/22 15:54, Roman Gushchin wrote:
> On Mon, Jan 31, 2022 at 02:23:08PM -0500, Waiman Long wrote:
>> The page_owner information currently includes the pid of the calling
>> task. That is useful as long as the task is still running. Otherwise,
>> the number is meaningless. To have more information about the allocating
>> tasks that had exited by the time the page_owner information is
>> retrieved, we need to store the command name of the task.
>>
>> Add a new comm field into page_owner structure to store the command name
>> and display it when the page_owner information is retrieved. Only the
>> first 15 characters of the command name will be copied, but that should
>> be enough in most cases. Even for those commands with longer names,
>> it shouldn't be hard to guess what they are.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   mm/page_owner.c | 16 ++++++++++++----
>>   1 file changed, 12 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/page_owner.c b/mm/page_owner.c
>> index a471c74c7fe0..8b2b381fd986 100644
>> --- a/mm/page_owner.c
>> +++ b/mm/page_owner.c
>> @@ -20,6 +20,7 @@
>>    * to use off stack temporal storage
>>    */
>>   #define PAGE_OWNER_STACK_DEPTH (16)
>> +#define PAGE_OWNER_COMM_LEN	16
> Not sure I understand why not simply use TASK_COMM_LEN ?

Yes, you are right. I thought TASK_COMM_LEN is larger than 16 without 
actually checking it. Will fix that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 4/4] mm/page_owner: Record task command name
  2022-01-31 19:23 ` [PATCH v3 4/4] mm/page_owner: Record task command name Waiman Long
  2022-01-31 20:54   ` Roman Gushchin
@ 2022-01-31 22:03   ` Waiman Long
  2022-02-01 15:28     ` Michal Hocko
  2022-02-02 20:30   ` [PATCH v4 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2022-01-31 22:03 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long

The page_owner information currently includes the pid of the calling
task. That is useful as long as the task is still running. Otherwise,
the number is meaningless. To have more information about the allocating
tasks that had exited by the time the page_owner information is
retrieved, we need to store the command name of the task.

Add a new comm field into page_owner structure to store the command name
and display it when the page_owner information is retrieved.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/page_owner.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index a471c74c7fe0..485542155483 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -29,6 +29,7 @@ struct page_owner {
 	depot_stack_handle_t free_handle;
 	u64 ts_nsec;
 	u64 free_ts_nsec;
+	char comm[TASK_COMM_LEN];
 	pid_t pid;
 };
 
@@ -146,6 +147,7 @@ void __reset_page_owner(struct page *page, unsigned short order)
 		page_owner = get_page_owner(page_ext);
 		page_owner->free_handle = handle;
 		page_owner->free_ts_nsec = free_ts_nsec;
+		page_owner->comm[0] = '\0';
 		page_ext = page_ext_next(page_ext);
 	}
 }
@@ -165,6 +167,8 @@ static inline void __set_page_owner_handle(struct page_ext *page_ext,
 		page_owner->last_migrate_reason = -1;
 		page_owner->pid = current->pid;
 		page_owner->ts_nsec = local_clock();
+		strlcpy(page_owner->comm, current->comm,
+			sizeof(page_owner->comm));
 		__set_bit(PAGE_EXT_OWNER, &page_ext->flags);
 		__set_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags);
 
@@ -232,6 +236,7 @@ void __folio_copy_owner(struct folio *newfolio, struct folio *old)
 	new_page_owner->pid = old_page_owner->pid;
 	new_page_owner->ts_nsec = old_page_owner->ts_nsec;
 	new_page_owner->free_ts_nsec = old_page_owner->ts_nsec;
+	strcpy(new_page_owner->comm, old_page_owner->comm);
 
 	/*
 	 * We don't clear the bit on the old folio as it's going to be freed
@@ -376,10 +381,11 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 		return -ENOMEM;
 
 	ret = scnprintf(kbuf, count,
-			"Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n",
+			"Page allocated via order %u, mask %#x(%pGg), pid %d (%s), ts %llu ns, free_ts %llu ns\n",
 			page_owner->order, page_owner->gfp_mask,
 			&page_owner->gfp_mask, page_owner->pid,
-			page_owner->ts_nsec, page_owner->free_ts_nsec);
+			page_owner->comm, page_owner->ts_nsec,
+			page_owner->free_ts_nsec);
 
 	/* Print information relevant to grouping pages by mobility */
 	pageblock_mt = get_pageblock_migratetype(page);
@@ -446,9 +452,10 @@ void __dump_page_owner(const struct page *page)
 	else
 		pr_alert("page_owner tracks the page as freed\n");
 
-	pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d, ts %llu, free_ts %llu\n",
+	pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d (%s), ts %llu, free_ts %llu\n",
 		 page_owner->order, migratetype_names[mt], gfp_mask, &gfp_mask,
-		 page_owner->pid, page_owner->ts_nsec, page_owner->free_ts_nsec);
+		 page_owner->pid, page_owner->comm, page_owner->ts_nsec,
+		 page_owner->free_ts_nsec);
 
 	handle = READ_ONCE(page_owner->handle);
 	if (!handle)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 3/4] mm/page_owner: Print memcg information
  2022-01-31 21:43     ` Waiman Long
@ 2022-02-01  6:23       ` Mike Rapoport
  0 siblings, 0 replies; 39+ messages in thread
From: Mike Rapoport @ 2022-02-01  6:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	David Rientjes, Roman Gushchin, Rafael Aquini

On Mon, Jan 31, 2022 at 04:43:32PM -0500, Waiman Long wrote:
> 
> On 1/31/22 15:51, Mike Rapoport wrote:
> > On Mon, Jan 31, 2022 at 02:23:07PM -0500, Waiman Long wrote:
> > > It was found that a number of offlined memcgs were not freed because
> > > they were pinned by some charged pages that were present. Even "echo
> > > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These
> > > offlined but not freed memcgs tend to increase in number over time with
> > > the side effect that percpu memory consumption as shown in /proc/meminfo
> > > also increases over time.
> > > 
> > > In order to find out more information about those pages that pin
> > > offlined memcgs, the page_owner feature is extended to print memory
> > > cgroup information especially whether the cgroup is offlined or not.
> > > 
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> > > Acked-by: David Rientjes <rientjes@google.com>
> > > ---
> > >   mm/page_owner.c | 39 +++++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 39 insertions(+)
> > > 
> > > diff --git a/mm/page_owner.c b/mm/page_owner.c
> > > index 28dac73e0542..a471c74c7fe0 100644
> > > --- a/mm/page_owner.c
> > > +++ b/mm/page_owner.c
> > > @@ -10,6 +10,7 @@
> > >   #include <linux/migrate.h>
> > >   #include <linux/stackdepot.h>
> > >   #include <linux/seq_file.h>
> > > +#include <linux/memcontrol.h>
> > >   #include <linux/sched/clock.h>
> > >   #include "internal.h"
> > > @@ -325,6 +326,42 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
> > >   	seq_putc(m, '\n');
> > >   }
> > > +#ifdef CONFIG_MEMCG
> > > +/*
> > > + * Looking for memcg information and print it out
> > > + */
> > > +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret,
> > > +					  struct page *page)
> > > +{
> > > +	unsigned long memcg_data = READ_ONCE(page->memcg_data);
> > > +	struct mem_cgroup *memcg;
> > > +	bool onlined;
> > > +	char name[80];
> > > +
> > > +	if (!memcg_data)
> > > +		return;
> > > +
> > > +	if (memcg_data & MEMCG_DATA_OBJCGS)
> > > +		*pret += scnprintf(kbuf + *pret, count - *pret,
> > > +				"Slab cache page\n");
> > Don't we need to check for overflow here?
> 
> See my previous patch 2 and the reason I used scnprintf() is that it never
> return a length that is >= the given size. So overflow won't happen. The
> final snprintf() in print_page_owner() will detect buffer overflow.
 
Right, I've missed that 
 
> > > +
> > > +	memcg = page_memcg_check(page);
> > > +	if (!memcg)
> > > +		return;
> > > +
> > > +	onlined = (memcg->css.flags & CSS_ONLINE);
> > > +	cgroup_name(memcg->css.cgroup, name, sizeof(name));
> > > +	*pret += scnprintf(kbuf + *pret, count - *pret,
> > > +			"Charged %sto %smemcg %s\n",
> > > +			PageMemcgKmem(page) ? "(via objcg) " : "",
> > > +			onlined ? "" : "offlined ",
> > > +			name);
> > Ditto
> > 
> > > +}
> > > +#else /* CONFIG_MEMCG */
> > > +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret,
> > > +					  struct page *page) { }

> > I think #ifdef inside the print_page_owner_memcg() functions will be
> > simpler and clearer.
>
> Yes, I see both styles used in kernel code though this style is probably
> more common. I will keep this unless there is a good reason to do otherwise.

Having #ifdef inside the function is safer wrt future updates. It's often
happens that non-default arm of #ifdef is forgotten. Besides, it's several
lines less.
 
> > > +#endif /* CONFIG_MEMCG */
> > > +
> > >   static ssize_t
> > >   print_page_owner(char __user *buf, size_t count, unsigned long pfn,
> > >   		struct page *page, struct page_owner *page_owner,
> > > @@ -365,6 +402,8 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
> > >   			migrate_reason_names[page_owner->last_migrate_reason]);
> > >   	}
> > > +	print_page_owner_memcg(kbuf, count, &ret, page);
> > > +
> > ret can go over count here.
> > Why not make print_page_owner_memcg() an int so that the call will be
> > consistent with other calls in print_page_owner():
> > 
> > 	ret += print_page_owner_memcg(kbuf, count, page);
> > 	if (ret >= count)
> > 		goto err;

I still think that 'int print_page_owner_memcg()' is clearer and more
readable.
 
> See my comments above.
> 
> Cheers,
> Longman
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 3/4] mm/page_owner: Print memcg information
  2022-01-31 19:23 ` [PATCH v3 3/4] mm/page_owner: Print memcg information Waiman Long
  2022-01-31 20:51   ` Mike Rapoport
  2022-01-31 20:51   ` Roman Gushchin
@ 2022-02-01 10:54   ` Michal Hocko
  2022-02-01 17:04     ` Waiman Long
  2 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2022-02-01 10:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek,
	Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini

On Mon 31-01-22 14:23:07, Waiman Long wrote:
> It was found that a number of offlined memcgs were not freed because
> they were pinned by some charged pages that were present. Even "echo
> 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These
> offlined but not freed memcgs tend to increase in number over time with
> the side effect that percpu memory consumption as shown in /proc/meminfo
> also increases over time.
> 
> In order to find out more information about those pages that pin
> offlined memcgs, the page_owner feature is extended to print memory
> cgroup information especially whether the cgroup is offlined or not.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> ---
>  mm/page_owner.c | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index 28dac73e0542..a471c74c7fe0 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -10,6 +10,7 @@
>  #include <linux/migrate.h>
>  #include <linux/stackdepot.h>
>  #include <linux/seq_file.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched/clock.h>
>  
>  #include "internal.h"
> @@ -325,6 +326,42 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
>  	seq_putc(m, '\n');
>  }
>  
> +#ifdef CONFIG_MEMCG
> +/*
> + * Looking for memcg information and print it out
> + */
> +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret,
> +					  struct page *page)
> +{
> +	unsigned long memcg_data = READ_ONCE(page->memcg_data);
> +	struct mem_cgroup *memcg;
> +	bool onlined;
> +	char name[80];
> +
> +	if (!memcg_data)
> +		return;
> +
> +	if (memcg_data & MEMCG_DATA_OBJCGS)
> +		*pret += scnprintf(kbuf + *pret, count - *pret,
> +				"Slab cache page\n");
> +
> +	memcg = page_memcg_check(page);
> +	if (!memcg)
> +		return;
> +
> +	onlined = (memcg->css.flags & CSS_ONLINE);
> +	cgroup_name(memcg->css.cgroup, name, sizeof(name));
> +	*pret += scnprintf(kbuf + *pret, count - *pret,
> +			"Charged %sto %smemcg %s\n",
> +			PageMemcgKmem(page) ? "(via objcg) " : "",
> +			onlined ? "" : "offlined ",
> +			name);

I have asked in the previous version already but what makes the memcg
stable (why it cannot go away and be reallocated for something else)
while you are trying to get its name?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/4] mm/page_owner: Record task command name
  2022-01-31 22:03   ` [PATCH v4 " Waiman Long
@ 2022-02-01 15:28     ` Michal Hocko
  2022-02-02 16:53       ` Waiman Long
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2022-02-01 15:28 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek,
	Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini,
	Vlastimil Babka

Cc Vlastimil

On Mon 31-01-22 17:03:28, Waiman Long wrote:
> The page_owner information currently includes the pid of the calling
> task. That is useful as long as the task is still running. Otherwise,
> the number is meaningless. To have more information about the allocating
> tasks that had exited by the time the page_owner information is
> retrieved, we need to store the command name of the task.
> 
> Add a new comm field into page_owner structure to store the command name
> and display it when the page_owner information is retrieved.

I completely agree that pid is effectivelly useless (if not misleading)
but is comm really telling all that much to compensate for the
additional storage required for _each_ page in the system?

> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/page_owner.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index a471c74c7fe0..485542155483 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -29,6 +29,7 @@ struct page_owner {
>  	depot_stack_handle_t free_handle;
>  	u64 ts_nsec;
>  	u64 free_ts_nsec;
> +	char comm[TASK_COMM_LEN];
>  	pid_t pid;
>  };
>  
> @@ -146,6 +147,7 @@ void __reset_page_owner(struct page *page, unsigned short order)
>  		page_owner = get_page_owner(page_ext);
>  		page_owner->free_handle = handle;
>  		page_owner->free_ts_nsec = free_ts_nsec;
> +		page_owner->comm[0] = '\0';
>  		page_ext = page_ext_next(page_ext);
>  	}
>  }
> @@ -165,6 +167,8 @@ static inline void __set_page_owner_handle(struct page_ext *page_ext,
>  		page_owner->last_migrate_reason = -1;
>  		page_owner->pid = current->pid;
>  		page_owner->ts_nsec = local_clock();
> +		strlcpy(page_owner->comm, current->comm,
> +			sizeof(page_owner->comm));
>  		__set_bit(PAGE_EXT_OWNER, &page_ext->flags);
>  		__set_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags);
>  
> @@ -232,6 +236,7 @@ void __folio_copy_owner(struct folio *newfolio, struct folio *old)
>  	new_page_owner->pid = old_page_owner->pid;
>  	new_page_owner->ts_nsec = old_page_owner->ts_nsec;
>  	new_page_owner->free_ts_nsec = old_page_owner->ts_nsec;
> +	strcpy(new_page_owner->comm, old_page_owner->comm);
>  
>  	/*
>  	 * We don't clear the bit on the old folio as it's going to be freed
> @@ -376,10 +381,11 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
>  		return -ENOMEM;
>  
>  	ret = scnprintf(kbuf, count,
> -			"Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n",
> +			"Page allocated via order %u, mask %#x(%pGg), pid %d (%s), ts %llu ns, free_ts %llu ns\n",
>  			page_owner->order, page_owner->gfp_mask,
>  			&page_owner->gfp_mask, page_owner->pid,
> -			page_owner->ts_nsec, page_owner->free_ts_nsec);
> +			page_owner->comm, page_owner->ts_nsec,
> +			page_owner->free_ts_nsec);
>  
>  	/* Print information relevant to grouping pages by mobility */
>  	pageblock_mt = get_pageblock_migratetype(page);
> @@ -446,9 +452,10 @@ void __dump_page_owner(const struct page *page)
>  	else
>  		pr_alert("page_owner tracks the page as freed\n");
>  
> -	pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d, ts %llu, free_ts %llu\n",
> +	pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d (%s), ts %llu, free_ts %llu\n",
>  		 page_owner->order, migratetype_names[mt], gfp_mask, &gfp_mask,
> -		 page_owner->pid, page_owner->ts_nsec, page_owner->free_ts_nsec);
> +		 page_owner->pid, page_owner->comm, page_owner->ts_nsec,
> +		 page_owner->free_ts_nsec);
>  
>  	handle = READ_ONCE(page_owner->handle);
>  	if (!handle)
> -- 
> 2.27.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 3/4] mm/page_owner: Print memcg information
  2022-02-01 10:54   ` Michal Hocko
@ 2022-02-01 17:04     ` Waiman Long
  2022-02-02  8:49       ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2022-02-01 17:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek,
	Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini

On 2/1/22 05:54, Michal Hocko wrote:
> On Mon 31-01-22 14:23:07, Waiman Long wrote:
>> It was found that a number of offlined memcgs were not freed because
>> they were pinned by some charged pages that were present. Even "echo
>> 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These
>> offlined but not freed memcgs tend to increase in number over time with
>> the side effect that percpu memory consumption as shown in /proc/meminfo
>> also increases over time.
>>
>> In order to find out more information about those pages that pin
>> offlined memcgs, the page_owner feature is extended to print memory
>> cgroup information especially whether the cgroup is offlined or not.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> Acked-by: David Rientjes <rientjes@google.com>
>> ---
>>   mm/page_owner.c | 39 +++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 39 insertions(+)
>>
>> diff --git a/mm/page_owner.c b/mm/page_owner.c
>> index 28dac73e0542..a471c74c7fe0 100644
>> --- a/mm/page_owner.c
>> +++ b/mm/page_owner.c
>> @@ -10,6 +10,7 @@
>>   #include <linux/migrate.h>
>>   #include <linux/stackdepot.h>
>>   #include <linux/seq_file.h>
>> +#include <linux/memcontrol.h>
>>   #include <linux/sched/clock.h>
>>   
>>   #include "internal.h"
>> @@ -325,6 +326,42 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
>>   	seq_putc(m, '\n');
>>   }
>>   
>> +#ifdef CONFIG_MEMCG
>> +/*
>> + * Looking for memcg information and print it out
>> + */
>> +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret,
>> +					  struct page *page)
>> +{
>> +	unsigned long memcg_data = READ_ONCE(page->memcg_data);
>> +	struct mem_cgroup *memcg;
>> +	bool onlined;
>> +	char name[80];
>> +
>> +	if (!memcg_data)
>> +		return;
>> +
>> +	if (memcg_data & MEMCG_DATA_OBJCGS)
>> +		*pret += scnprintf(kbuf + *pret, count - *pret,
>> +				"Slab cache page\n");
>> +
>> +	memcg = page_memcg_check(page);
>> +	if (!memcg)
>> +		return;
>> +
>> +	onlined = (memcg->css.flags & CSS_ONLINE);
>> +	cgroup_name(memcg->css.cgroup, name, sizeof(name));
>> +	*pret += scnprintf(kbuf + *pret, count - *pret,
>> +			"Charged %sto %smemcg %s\n",
>> +			PageMemcgKmem(page) ? "(via objcg) " : "",
>> +			onlined ? "" : "offlined ",
>> +			name);
> I have asked in the previous version already but what makes the memcg
> stable (why it cannot go away and be reallocated for something else)
> while you are trying to get its name?

The memcg is not going away as long as the page isn't freed unless if it 
is indirectly connected via objcg. Of course, there can be a race 
between the page is going to be freed while the page_owner information 
is being displayed. One solution is to add a simple bit lock to each of 
the page_owner structure and acquire the lock when it is being written 
to or read from. Anyway a lot of these debugging aids or tools don't 
eliminate all the race conditions that affect the accuracy of the 
displayed information. I can add a patch to eliminate this direct memcg 
race if you think this is necessary.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 3/4] mm/page_owner: Print memcg information
  2022-02-01 17:04     ` Waiman Long
@ 2022-02-02  8:49       ` Michal Hocko
  2022-02-02 16:12         ` Waiman Long
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2022-02-02  8:49 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek,
	Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini

On Tue 01-02-22 12:04:37, Waiman Long wrote:
> On 2/1/22 05:54, Michal Hocko wrote:
> > On Mon 31-01-22 14:23:07, Waiman Long wrote:
> > > It was found that a number of offlined memcgs were not freed because
> > > they were pinned by some charged pages that were present. Even "echo
> > > 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These
> > > offlined but not freed memcgs tend to increase in number over time with
> > > the side effect that percpu memory consumption as shown in /proc/meminfo
> > > also increases over time.
> > > 
> > > In order to find out more information about those pages that pin
> > > offlined memcgs, the page_owner feature is extended to print memory
> > > cgroup information especially whether the cgroup is offlined or not.
> > > 
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> > > Acked-by: David Rientjes <rientjes@google.com>
> > > ---
> > >   mm/page_owner.c | 39 +++++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 39 insertions(+)
> > > 
> > > diff --git a/mm/page_owner.c b/mm/page_owner.c
> > > index 28dac73e0542..a471c74c7fe0 100644
> > > --- a/mm/page_owner.c
> > > +++ b/mm/page_owner.c
> > > @@ -10,6 +10,7 @@
> > >   #include <linux/migrate.h>
> > >   #include <linux/stackdepot.h>
> > >   #include <linux/seq_file.h>
> > > +#include <linux/memcontrol.h>
> > >   #include <linux/sched/clock.h>
> > >   #include "internal.h"
> > > @@ -325,6 +326,42 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
> > >   	seq_putc(m, '\n');
> > >   }
> > > +#ifdef CONFIG_MEMCG
> > > +/*
> > > + * Looking for memcg information and print it out
> > > + */
> > > +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret,
> > > +					  struct page *page)
> > > +{
> > > +	unsigned long memcg_data = READ_ONCE(page->memcg_data);
> > > +	struct mem_cgroup *memcg;
> > > +	bool onlined;
> > > +	char name[80];
> > > +
> > > +	if (!memcg_data)
> > > +		return;
> > > +
> > > +	if (memcg_data & MEMCG_DATA_OBJCGS)
> > > +		*pret += scnprintf(kbuf + *pret, count - *pret,
> > > +				"Slab cache page\n");
> > > +
> > > +	memcg = page_memcg_check(page);
> > > +	if (!memcg)
> > > +		return;
> > > +
> > > +	onlined = (memcg->css.flags & CSS_ONLINE);
> > > +	cgroup_name(memcg->css.cgroup, name, sizeof(name));
> > > +	*pret += scnprintf(kbuf + *pret, count - *pret,
> > > +			"Charged %sto %smemcg %s\n",
> > > +			PageMemcgKmem(page) ? "(via objcg) " : "",
> > > +			onlined ? "" : "offlined ",
> > > +			name);
> > I have asked in the previous version already but what makes the memcg
> > stable (why it cannot go away and be reallocated for something else)
> > while you are trying to get its name?
> 
> The memcg is not going away as long as the page isn't freed unless if it is
> indirectly connected via objcg. Of course, there can be a race between the
> page is going to be freed while the page_owner information is being
> displayed.

Right. And that means that cgtoup_name can go off the rail and wander
through memory correct?

> One solution is to add a simple bit lock to each of the
> page_owner structure and acquire the lock when it is being written to or
> read from.

I do not really see how a bit lock could prevent memcg from going away.
On the other hand I think RCU read lock should be sufficient to keep the
memcg from going away completely.

> Anyway a lot of these debugging aids or tools don't eliminate all
> the race conditions that affect the accuracy of the displayed information. I
> can add a patch to eliminate this direct memcg race if you think this is
> necessary.

I do not mind inaccurate information. That is natural but reading
through a freed memory can be really harmfull. So this really need to be
sorted out.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 3/4] mm/page_owner: Print memcg information
  2022-02-02  8:49       ` Michal Hocko
@ 2022-02-02 16:12         ` Waiman Long
  0 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2022-02-02 16:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek,
	Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini

On 2/2/22 03:49, Michal Hocko wrote:
> On Tue 01-02-22 12:04:37, Waiman Long wrote:
>> On 2/1/22 05:54, Michal Hocko wrote:
>>> On Mon 31-01-22 14:23:07, Waiman Long wrote:
>>>> It was found that a number of offlined memcgs were not freed because
>>>> they were pinned by some charged pages that were present. Even "echo
>>>> 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These
>>>> offlined but not freed memcgs tend to increase in number over time with
>>>> the side effect that percpu memory consumption as shown in /proc/meminfo
>>>> also increases over time.
>>>>
>>>> In order to find out more information about those pages that pin
>>>> offlined memcgs, the page_owner feature is extended to print memory
>>>> cgroup information especially whether the cgroup is offlined or not.
>>>>
>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>> Acked-by: David Rientjes <rientjes@google.com>
>>>> ---
>>>>    mm/page_owner.c | 39 +++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 39 insertions(+)
>>>>
>>>> diff --git a/mm/page_owner.c b/mm/page_owner.c
>>>> index 28dac73e0542..a471c74c7fe0 100644
>>>> --- a/mm/page_owner.c
>>>> +++ b/mm/page_owner.c
>>>> @@ -10,6 +10,7 @@
>>>>    #include <linux/migrate.h>
>>>>    #include <linux/stackdepot.h>
>>>>    #include <linux/seq_file.h>
>>>> +#include <linux/memcontrol.h>
>>>>    #include <linux/sched/clock.h>
>>>>    #include "internal.h"
>>>> @@ -325,6 +326,42 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
>>>>    	seq_putc(m, '\n');
>>>>    }
>>>> +#ifdef CONFIG_MEMCG
>>>> +/*
>>>> + * Looking for memcg information and print it out
>>>> + */
>>>> +static inline void print_page_owner_memcg(char *kbuf, size_t count, int *pret,
>>>> +					  struct page *page)
>>>> +{
>>>> +	unsigned long memcg_data = READ_ONCE(page->memcg_data);
>>>> +	struct mem_cgroup *memcg;
>>>> +	bool onlined;
>>>> +	char name[80];
>>>> +
>>>> +	if (!memcg_data)
>>>> +		return;
>>>> +
>>>> +	if (memcg_data & MEMCG_DATA_OBJCGS)
>>>> +		*pret += scnprintf(kbuf + *pret, count - *pret,
>>>> +				"Slab cache page\n");
>>>> +
>>>> +	memcg = page_memcg_check(page);
>>>> +	if (!memcg)
>>>> +		return;
>>>> +
>>>> +	onlined = (memcg->css.flags & CSS_ONLINE);
>>>> +	cgroup_name(memcg->css.cgroup, name, sizeof(name));
>>>> +	*pret += scnprintf(kbuf + *pret, count - *pret,
>>>> +			"Charged %sto %smemcg %s\n",
>>>> +			PageMemcgKmem(page) ? "(via objcg) " : "",
>>>> +			onlined ? "" : "offlined ",
>>>> +			name);
>>> I have asked in the previous version already but what makes the memcg
>>> stable (why it cannot go away and be reallocated for something else)
>>> while you are trying to get its name?
>> The memcg is not going away as long as the page isn't freed unless if it is
>> indirectly connected via objcg. Of course, there can be a race between the
>> page is going to be freed while the page_owner information is being
>> displayed.
> Right. And that means that cgtoup_name can go off the rail and wander
> through memory correct?
>
>> One solution is to add a simple bit lock to each of the
>> page_owner structure and acquire the lock when it is being written to or
>> read from.
> I do not really see how a bit lock could prevent memcg from going away.
> On the other hand I think RCU read lock should be sufficient to keep the
> memcg from going away completely.
Using rcu_read_lock() is also what I have been thinking of doing. So I 
will update the patch to add that for safety.
>
>> Anyway a lot of these debugging aids or tools don't eliminate all
>> the race conditions that affect the accuracy of the displayed information. I
>> can add a patch to eliminate this direct memcg race if you think this is
>> necessary.
> I do not mind inaccurate information. That is natural but reading
> through a freed memory can be really harmfull. So this really need to be
> sorted out.

Thanks for the review.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/4] mm/page_owner: Record task command name
  2022-02-01 15:28     ` Michal Hocko
@ 2022-02-02 16:53       ` Waiman Long
  2022-02-03 12:10         ` Vlastimil Babka
  0 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2022-02-02 16:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek,
	Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini,
	Vlastimil Babka


On 2/1/22 10:28, Michal Hocko wrote:
> Cc Vlastimil
>
> On Mon 31-01-22 17:03:28, Waiman Long wrote:
>> The page_owner information currently includes the pid of the calling
>> task. That is useful as long as the task is still running. Otherwise,
>> the number is meaningless. To have more information about the allocating
>> tasks that had exited by the time the page_owner information is
>> retrieved, we need to store the command name of the task.
>>
>> Add a new comm field into page_owner structure to store the command name
>> and display it when the page_owner information is retrieved.
> I completely agree that pid is effectivelly useless (if not misleading)
> but is comm really telling all that much to compensate for the
> additional storage required for _each_ page in the system?

Yes, it does add an extra 16 bytes per page overhead. The command name 
can be useful if one want to find out which userspace command is 
responsible for a problematic page allocation. Maybe we can remove pid 
from page_owner to save 8 bytes as you also agree that this number is 
not that useful.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 0/4] mm/page_owner: Extend page_owner to show memcg information
  2022-01-31 19:23 ` [PATCH v3 4/4] mm/page_owner: Record task command name Waiman Long
  2022-01-31 20:54   ` Roman Gushchin
  2022-01-31 22:03   ` [PATCH v4 " Waiman Long
@ 2022-02-02 20:30   ` Waiman Long
  2022-02-02 23:06     ` Rafael Aquini
  2022-02-02 20:30   ` [PATCH v4 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2022-02-02 20:30 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long

 v4:
  - Take rcu_read_lock() when memcg is being accessed as suggested by
    Michal.
  - Make print_page_owner_memcg() return the new offset into the buffer
    and put CONFIG_MEMCG block inside as suggested by Mike.
  - Directly use TASK_COMM_LEN as length of name buffer as suggested by
    Roman.

 v3:
  - Add unlikely() to patch 1 and clarify that -1 will not be returned.
  - Use a helper function to print out memcg information in patch 3.
  - Add a new patch 4 to store task command name in page_owner
    structure.

 v2:
  - Remove the SNPRINTF() macro as suggested by Ira and use scnprintf()
    instead to remove some buffer overrun checks.
  - Add a patch to optimize vscnprintf with a size parameter of 0.

While debugging the constant increase in percpu memory consumption on
a system that spawned large number of containers, it was found that a
lot of offline mem_cgroup structures remained in place without being
freed. Further investigation indicated that those mem_cgroup structures
were pinned by some pages.

In order to find out what those pages are, the existing page_owner
debugging tool is extended to show memory cgroup information and whether
those memcgs are offline or not. With the enhanced page_owner tool,
the following is a typical page that pinned the mem_cgroup structure
in my test case:

Page allocated via order 0, mask 0x1100cca(GFP_HIGHUSER_MOVABLE), pid 162970 (podman), ts 1097761405537 ns, free_ts 1097760838089 ns
PFN 1925700 type Movable Block 3761 type Movable Flags 0x17ffffc00c001c(uptodate|dirty|lru|reclaim|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
 prep_new_page+0xac/0xe0
 get_page_from_freelist+0x1327/0x14d0
 __alloc_pages+0x191/0x340
 alloc_pages_vma+0x84/0x250
 shmem_alloc_page+0x3f/0x90
 shmem_alloc_and_acct_page+0x76/0x1c0
 shmem_getpage_gfp+0x281/0x940
 shmem_write_begin+0x36/0xe0
 generic_perform_write+0xed/0x1d0
 __generic_file_write_iter+0xdc/0x1b0
 generic_file_write_iter+0x5d/0xb0
 new_sync_write+0x11f/0x1b0
 vfs_write+0x1ba/0x2a0
 ksys_write+0x59/0xd0
 do_syscall_64+0x37/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
Charged to offline memcg libpod-conmon-15e4f9c758422306b73b2dd99f9d50a5ea53cbb16b4a13a2c2308a4253cc0ec8.

So the page was not freed because it was part of a shmem segment. That
is useful information that can help users to diagnose similar problems.

With cgroup v1, /proc/cgroups can be read to find out the total number
of memory cgroups (online + offline). With cgroup v2, the cgroup.stat of
the root cgroup can be read to find the number of dying cgroups (most
likely pinned by dying memcgs).

The page_owner feature is not supposed to be enabled for production
system due to its memory overhead. However, if it is suspected that
dying memcgs are increasing over time, a test environment with page_owner
enabled can then be set up with appropriate workload for further analysis
on what may be causing the increasing number of dying memcgs.

Waiman Long (4):
  lib/vsprintf: Avoid redundant work with 0 size
  mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check
  mm/page_owner: Print memcg information
  mm/page_owner: Record task command name

 lib/vsprintf.c  |  8 +++---
 mm/page_owner.c | 70 ++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 60 insertions(+), 18 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 1/4] lib/vsprintf: Avoid redundant work with 0 size
  2022-01-31 19:23 ` [PATCH v3 4/4] mm/page_owner: Record task command name Waiman Long
                     ` (2 preceding siblings ...)
  2022-02-02 20:30   ` [PATCH v4 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long
@ 2022-02-02 20:30   ` Waiman Long
  2022-02-08 10:08     ` Petr Mladek
  2022-02-02 20:30   ` [PATCH v4 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2022-02-02 20:30 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long

For *scnprintf(), vsnprintf() is always called even if the input size is
0. That is a waste of time, so just return 0 in this case.

Note that vsnprintf() will never return -1 to indicate an error. So
skipping the call to vsnprintf() when size is 0 will have no functional
impact at all.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Roman Gushchin <guro@fb.com>
---
 lib/vsprintf.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 3b8129dd374c..d419154b47bb 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -2895,13 +2895,15 @@ int vscnprintf(char *buf, size_t size, const char *fmt, va_list args)
 {
 	int i;
 
+	if (unlikely(!size))
+		return 0;
+
 	i = vsnprintf(buf, size, fmt, args);
 
 	if (likely(i < size))
 		return i;
-	if (size != 0)
-		return size - 1;
-	return 0;
+
+	return size - 1;
 }
 EXPORT_SYMBOL(vscnprintf);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check
  2022-01-31 19:23 ` [PATCH v3 4/4] mm/page_owner: Record task command name Waiman Long
                     ` (3 preceding siblings ...)
  2022-02-02 20:30   ` [PATCH v4 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long
@ 2022-02-02 20:30   ` Waiman Long
  2022-02-03 15:46     ` Vlastimil Babka
  2022-02-02 20:30   ` [PATCH v4 3/4] mm/page_owner: Print memcg information Waiman Long
  2022-02-02 20:30   ` [PATCH v4 4/4] mm/page_owner: Record task command name Waiman Long
  6 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2022-02-02 20:30 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long

The snprintf() function can return a length greater than the given
input size. That will require a check for buffer overrun after each
invocation of snprintf(). scnprintf(), on the other hand, will never
return a greater length. By using scnprintf() in selected places, we
can avoid some buffer overrun checks except after stack_depot_snprint()
and after the last snprintf().

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 mm/page_owner.c | 14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index 99e360df9465..28dac73e0542 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -338,19 +338,16 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 	if (!kbuf)
 		return -ENOMEM;
 
-	ret = snprintf(kbuf, count,
+	ret = scnprintf(kbuf, count,
 			"Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n",
 			page_owner->order, page_owner->gfp_mask,
 			&page_owner->gfp_mask, page_owner->pid,
 			page_owner->ts_nsec, page_owner->free_ts_nsec);
 
-	if (ret >= count)
-		goto err;
-
 	/* Print information relevant to grouping pages by mobility */
 	pageblock_mt = get_pageblock_migratetype(page);
 	page_mt  = gfp_migratetype(page_owner->gfp_mask);
-	ret += snprintf(kbuf + ret, count - ret,
+	ret += scnprintf(kbuf + ret, count - ret,
 			"PFN %lu type %s Block %lu type %s Flags %pGp\n",
 			pfn,
 			migratetype_names[page_mt],
@@ -358,19 +355,14 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 			migratetype_names[pageblock_mt],
 			&page->flags);
 
-	if (ret >= count)
-		goto err;
-
 	ret += stack_depot_snprint(handle, kbuf + ret, count - ret, 0);
 	if (ret >= count)
 		goto err;
 
 	if (page_owner->last_migrate_reason != -1) {
-		ret += snprintf(kbuf + ret, count - ret,
+		ret += scnprintf(kbuf + ret, count - ret,
 			"Page has been migrated, last migrate reason: %s\n",
 			migrate_reason_names[page_owner->last_migrate_reason]);
-		if (ret >= count)
-			goto err;
 	}
 
 	ret += snprintf(kbuf + ret, count - ret, "\n");
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 3/4] mm/page_owner: Print memcg information
  2022-01-31 19:23 ` [PATCH v3 4/4] mm/page_owner: Record task command name Waiman Long
                     ` (4 preceding siblings ...)
  2022-02-02 20:30   ` [PATCH v4 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long
@ 2022-02-02 20:30   ` Waiman Long
  2022-02-03  6:53     ` Mike Rapoport
  2022-02-03 12:46     ` Michal Hocko
  2022-02-02 20:30   ` [PATCH v4 4/4] mm/page_owner: Record task command name Waiman Long
  6 siblings, 2 replies; 39+ messages in thread
From: Waiman Long @ 2022-02-02 20:30 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long

It was found that a number of offline memcgs were not freed because
they were pinned by some charged pages that were present. Even "echo 1 >
/proc/sys/vm/drop_caches" wasn't able to free those pages. These offline
but not freed memcgs tend to increase in number over time with the side
effect that percpu memory consumption as shown in /proc/meminfo also
increases over time.

In order to find out more information about those pages that pin
offline memcgs, the page_owner feature is extended to print memory
cgroup information especially whether the cgroup is offline or not.
RCU read lock is taken when memcg is being accessed to make sure
that it won't be freed.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
---
 mm/page_owner.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index 28dac73e0542..f7820357e4d4 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -10,6 +10,7 @@
 #include <linux/migrate.h>
 #include <linux/stackdepot.h>
 #include <linux/seq_file.h>
+#include <linux/memcontrol.h>
 #include <linux/sched/clock.h>
 
 #include "internal.h"
@@ -325,6 +326,45 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
 	seq_putc(m, '\n');
 }
 
+/*
+ * Looking for memcg information and print it out
+ */
+static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
+					 struct page *page)
+{
+#ifdef CONFIG_MEMCG
+	unsigned long memcg_data;
+	struct mem_cgroup *memcg;
+	bool online;
+	char name[80];
+
+	rcu_read_lock();
+	memcg_data = READ_ONCE(page->memcg_data);
+	if (!memcg_data)
+		goto out_unlock;
+
+	if (memcg_data & MEMCG_DATA_OBJCGS)
+		ret += scnprintf(kbuf + ret, count - ret,
+				"Slab cache page\n");
+
+	memcg = page_memcg_check(page);
+	if (!memcg)
+		goto out_unlock;
+
+	online = (memcg->css.flags & CSS_ONLINE);
+	cgroup_name(memcg->css.cgroup, name, sizeof(name));
+	ret += scnprintf(kbuf + ret, count - ret,
+			"Charged %sto %smemcg %s\n",
+			PageMemcgKmem(page) ? "(via objcg) " : "",
+			online ? "" : "offline ",
+			name);
+out_unlock:
+	rcu_read_unlock();
+#endif /* CONFIG_MEMCG */
+
+	return ret;
+}
+
 static ssize_t
 print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 		struct page *page, struct page_owner *page_owner,
@@ -365,6 +405,8 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 			migrate_reason_names[page_owner->last_migrate_reason]);
 	}
 
+	ret = print_page_owner_memcg(kbuf, count, ret, page);
+
 	ret += snprintf(kbuf + ret, count - ret, "\n");
 	if (ret >= count)
 		goto err;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 4/4] mm/page_owner: Record task command name
  2022-01-31 19:23 ` [PATCH v3 4/4] mm/page_owner: Record task command name Waiman Long
                     ` (5 preceding siblings ...)
  2022-02-02 20:30   ` [PATCH v4 3/4] mm/page_owner: Print memcg information Waiman Long
@ 2022-02-02 20:30   ` Waiman Long
  6 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2022-02-02 20:30 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini, Waiman Long

The page_owner information currently includes the pid of the calling
task. That is useful as long as the task is still running. Otherwise,
the number is meaningless. To have more information about the allocating
tasks that had exited by the time the page_owner information is
retrieved, we need to store the command name of the task.

Add a new comm field into page_owner structure to store the command name
and display it when the page_owner information is retrieved.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/page_owner.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/page_owner.c b/mm/page_owner.c
index f7820357e4d4..d56afa9c792e 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -29,6 +29,7 @@ struct page_owner {
 	depot_stack_handle_t free_handle;
 	u64 ts_nsec;
 	u64 free_ts_nsec;
+	char comm[TASK_COMM_LEN];
 	pid_t pid;
 };
 
@@ -165,6 +166,8 @@ static inline void __set_page_owner_handle(struct page_ext *page_ext,
 		page_owner->last_migrate_reason = -1;
 		page_owner->pid = current->pid;
 		page_owner->ts_nsec = local_clock();
+		strlcpy(page_owner->comm, current->comm,
+			sizeof(page_owner->comm));
 		__set_bit(PAGE_EXT_OWNER, &page_ext->flags);
 		__set_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags);
 
@@ -232,6 +235,7 @@ void __folio_copy_owner(struct folio *newfolio, struct folio *old)
 	new_page_owner->pid = old_page_owner->pid;
 	new_page_owner->ts_nsec = old_page_owner->ts_nsec;
 	new_page_owner->free_ts_nsec = old_page_owner->ts_nsec;
+	strcpy(new_page_owner->comm, old_page_owner->comm);
 
 	/*
 	 * We don't clear the bit on the old folio as it's going to be freed
@@ -379,10 +383,11 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 		return -ENOMEM;
 
 	ret = scnprintf(kbuf, count,
-			"Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n",
+			"Page allocated via order %u, mask %#x(%pGg), pid %d (%s), ts %llu ns, free_ts %llu ns\n",
 			page_owner->order, page_owner->gfp_mask,
 			&page_owner->gfp_mask, page_owner->pid,
-			page_owner->ts_nsec, page_owner->free_ts_nsec);
+			page_owner->comm, page_owner->ts_nsec,
+			page_owner->free_ts_nsec);
 
 	/* Print information relevant to grouping pages by mobility */
 	pageblock_mt = get_pageblock_migratetype(page);
@@ -449,9 +454,10 @@ void __dump_page_owner(const struct page *page)
 	else
 		pr_alert("page_owner tracks the page as freed\n");
 
-	pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d, ts %llu, free_ts %llu\n",
+	pr_alert("page last allocated via order %u, migratetype %s, gfp_mask %#x(%pGg), pid %d (%s), ts %llu, free_ts %llu\n",
 		 page_owner->order, migratetype_names[mt], gfp_mask, &gfp_mask,
-		 page_owner->pid, page_owner->ts_nsec, page_owner->free_ts_nsec);
+		 page_owner->pid, page_owner->comm, page_owner->ts_nsec,
+		 page_owner->free_ts_nsec);
 
 	handle = READ_ONCE(page_owner->handle);
 	if (!handle)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 0/4] mm/page_owner: Extend page_owner to show memcg information
  2022-02-02 20:30   ` [PATCH v4 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long
@ 2022-02-02 23:06     ` Rafael Aquini
  0 siblings, 0 replies; 39+ messages in thread
From: Rafael Aquini @ 2022-02-02 23:06 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin

On Wed, Feb 02, 2022 at 03:30:32PM -0500, Waiman Long wrote:
>  v4:
>   - Take rcu_read_lock() when memcg is being accessed as suggested by
>     Michal.
>   - Make print_page_owner_memcg() return the new offset into the buffer
>     and put CONFIG_MEMCG block inside as suggested by Mike.
>   - Directly use TASK_COMM_LEN as length of name buffer as suggested by
>     Roman.
> 
>  v3:
>   - Add unlikely() to patch 1 and clarify that -1 will not be returned.
>   - Use a helper function to print out memcg information in patch 3.
>   - Add a new patch 4 to store task command name in page_owner
>     structure.
> 
>  v2:
>   - Remove the SNPRINTF() macro as suggested by Ira and use scnprintf()
>     instead to remove some buffer overrun checks.
>   - Add a patch to optimize vscnprintf with a size parameter of 0.
> 
> While debugging the constant increase in percpu memory consumption on
> a system that spawned large number of containers, it was found that a
> lot of offline mem_cgroup structures remained in place without being
> freed. Further investigation indicated that those mem_cgroup structures
> were pinned by some pages.
> 
> In order to find out what those pages are, the existing page_owner
> debugging tool is extended to show memory cgroup information and whether
> those memcgs are offline or not. With the enhanced page_owner tool,
> the following is a typical page that pinned the mem_cgroup structure
> in my test case:
> 
> Page allocated via order 0, mask 0x1100cca(GFP_HIGHUSER_MOVABLE), pid 162970 (podman), ts 1097761405537 ns, free_ts 1097760838089 ns
> PFN 1925700 type Movable Block 3761 type Movable Flags 0x17ffffc00c001c(uptodate|dirty|lru|reclaim|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
>  prep_new_page+0xac/0xe0
>  get_page_from_freelist+0x1327/0x14d0
>  __alloc_pages+0x191/0x340
>  alloc_pages_vma+0x84/0x250
>  shmem_alloc_page+0x3f/0x90
>  shmem_alloc_and_acct_page+0x76/0x1c0
>  shmem_getpage_gfp+0x281/0x940
>  shmem_write_begin+0x36/0xe0
>  generic_perform_write+0xed/0x1d0
>  __generic_file_write_iter+0xdc/0x1b0
>  generic_file_write_iter+0x5d/0xb0
>  new_sync_write+0x11f/0x1b0
>  vfs_write+0x1ba/0x2a0
>  ksys_write+0x59/0xd0
>  do_syscall_64+0x37/0x80
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> Charged to offline memcg libpod-conmon-15e4f9c758422306b73b2dd99f9d50a5ea53cbb16b4a13a2c2308a4253cc0ec8.
> 
> So the page was not freed because it was part of a shmem segment. That
> is useful information that can help users to diagnose similar problems.
> 
> With cgroup v1, /proc/cgroups can be read to find out the total number
> of memory cgroups (online + offline). With cgroup v2, the cgroup.stat of
> the root cgroup can be read to find the number of dying cgroups (most
> likely pinned by dying memcgs).
> 
> The page_owner feature is not supposed to be enabled for production
> system due to its memory overhead. However, if it is suspected that
> dying memcgs are increasing over time, a test environment with page_owner
> enabled can then be set up with appropriate workload for further analysis
> on what may be causing the increasing number of dying memcgs.
> 
> Waiman Long (4):
>   lib/vsprintf: Avoid redundant work with 0 size
>   mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check
>   mm/page_owner: Print memcg information
>   mm/page_owner: Record task command name
> 
>  lib/vsprintf.c  |  8 +++---
>  mm/page_owner.c | 70 ++++++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 60 insertions(+), 18 deletions(-)
> 
> -- 
> 2.27.0
>

Thank you, Waiman.

Acked-by: Rafael Aquini <aquini@redhat.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 3/4] mm/page_owner: Print memcg information
  2022-02-02 20:30   ` [PATCH v4 3/4] mm/page_owner: Print memcg information Waiman Long
@ 2022-02-03  6:53     ` Mike Rapoport
  2022-02-03 12:46     ` Michal Hocko
  1 sibling, 0 replies; 39+ messages in thread
From: Mike Rapoport @ 2022-02-03  6:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Petr Mladek, Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	David Rientjes, Roman Gushchin, Rafael Aquini

On Wed, Feb 02, 2022 at 03:30:35PM -0500, Waiman Long wrote:
> It was found that a number of offline memcgs were not freed because
> they were pinned by some charged pages that were present. Even "echo 1 >
> /proc/sys/vm/drop_caches" wasn't able to free those pages. These offline
> but not freed memcgs tend to increase in number over time with the side
> effect that percpu memory consumption as shown in /proc/meminfo also
> increases over time.
> 
> In order to find out more information about those pages that pin
> offline memcgs, the page_owner feature is extended to print memory
> cgroup information especially whether the cgroup is offline or not.
> RCU read lock is taken when memcg is being accessed to make sure
> that it won't be freed.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Roman Gushchin <guro@fb.com>

Acked-by: Mike Rapoport <rppt@linux.ibm.com>

And my akcs for the first two patches are missing somehow in v4...

> ---
>  mm/page_owner.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
> 
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index 28dac73e0542..f7820357e4d4 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -10,6 +10,7 @@
>  #include <linux/migrate.h>
>  #include <linux/stackdepot.h>
>  #include <linux/seq_file.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched/clock.h>
>  
>  #include "internal.h"
> @@ -325,6 +326,45 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m,
>  	seq_putc(m, '\n');
>  }
>  
> +/*
> + * Looking for memcg information and print it out
> + */
> +static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
> +					 struct page *page)
> +{
> +#ifdef CONFIG_MEMCG
> +	unsigned long memcg_data;
> +	struct mem_cgroup *memcg;
> +	bool online;
> +	char name[80];
> +
> +	rcu_read_lock();
> +	memcg_data = READ_ONCE(page->memcg_data);
> +	if (!memcg_data)
> +		goto out_unlock;
> +
> +	if (memcg_data & MEMCG_DATA_OBJCGS)
> +		ret += scnprintf(kbuf + ret, count - ret,
> +				"Slab cache page\n");
> +
> +	memcg = page_memcg_check(page);
> +	if (!memcg)
> +		goto out_unlock;
> +
> +	online = (memcg->css.flags & CSS_ONLINE);
> +	cgroup_name(memcg->css.cgroup, name, sizeof(name));
> +	ret += scnprintf(kbuf + ret, count - ret,
> +			"Charged %sto %smemcg %s\n",
> +			PageMemcgKmem(page) ? "(via objcg) " : "",
> +			online ? "" : "offline ",
> +			name);
> +out_unlock:
> +	rcu_read_unlock();
> +#endif /* CONFIG_MEMCG */
> +
> +	return ret;
> +}
> +
>  static ssize_t
>  print_page_owner(char __user *buf, size_t count, unsigned long pfn,
>  		struct page *page, struct page_owner *page_owner,
> @@ -365,6 +405,8 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
>  			migrate_reason_names[page_owner->last_migrate_reason]);
>  	}
>  
> +	ret = print_page_owner_memcg(kbuf, count, ret, page);
> +
>  	ret += snprintf(kbuf + ret, count - ret, "\n");
>  	if (ret >= count)
>  		goto err;
> -- 
> 2.27.0
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/4] mm/page_owner: Record task command name
  2022-02-02 16:53       ` Waiman Long
@ 2022-02-03 12:10         ` Vlastimil Babka
  2022-02-03 18:53           ` Waiman Long
  0 siblings, 1 reply; 39+ messages in thread
From: Vlastimil Babka @ 2022-02-03 12:10 UTC (permalink / raw)
  To: Waiman Long, Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek,
	Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini

On 2/2/22 17:53, Waiman Long wrote:
> 
> On 2/1/22 10:28, Michal Hocko wrote:
>> Cc Vlastimil
>>
>> On Mon 31-01-22 17:03:28, Waiman Long wrote:
>>> The page_owner information currently includes the pid of the calling
>>> task. That is useful as long as the task is still running. Otherwise,
>>> the number is meaningless. To have more information about the allocating
>>> tasks that had exited by the time the page_owner information is
>>> retrieved, we need to store the command name of the task.
>>>
>>> Add a new comm field into page_owner structure to store the command name
>>> and display it when the page_owner information is retrieved.
>> I completely agree that pid is effectivelly useless (if not misleading)
>> but is comm really telling all that much to compensate for the
>> additional storage required for _each_ page in the system?
> 
> Yes, it does add an extra 16 bytes per page overhead. The command name can
> be useful if one want to find out which userspace command is responsible for
> a problematic page allocation. Maybe we can remove pid from page_owner to
> save 8 bytes as you also agree that this number is not that useful.

Pid could be used to correlate command instances (not perfectly if reuse
happens), but command name could have a higher chance to be useful. In my
experience the most useful were the stacktraces and gfp/order etc. anyway.
So I wouldn't be opposed replacing pid with comm. The mild size increase
should be acceptable, this is an opt-in feature for debugging sessions with
known tradeoff for memory and cpu overhead for the extra info.

> Cheers,
> Longman
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 3/4] mm/page_owner: Print memcg information
  2022-02-02 20:30   ` [PATCH v4 3/4] mm/page_owner: Print memcg information Waiman Long
  2022-02-03  6:53     ` Mike Rapoport
@ 2022-02-03 12:46     ` Michal Hocko
  2022-02-03 19:03       ` Waiman Long
  1 sibling, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2022-02-03 12:46 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek,
	Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini

On Wed 02-02-22 15:30:35, Waiman Long wrote:
[...]
> +#ifdef CONFIG_MEMCG
> +	unsigned long memcg_data;
> +	struct mem_cgroup *memcg;
> +	bool online;
> +	char name[80];
> +
> +	rcu_read_lock();
> +	memcg_data = READ_ONCE(page->memcg_data);
> +	if (!memcg_data)
> +		goto out_unlock;
> +
> +	if (memcg_data & MEMCG_DATA_OBJCGS)
> +		ret += scnprintf(kbuf + ret, count - ret,
> +				"Slab cache page\n");
> +
> +	memcg = page_memcg_check(page);
> +	if (!memcg)
> +		goto out_unlock;
> +
> +	online = (memcg->css.flags & CSS_ONLINE);
> +	cgroup_name(memcg->css.cgroup, name, sizeof(name));

Is there any specific reason to use another buffer allocated on the
stack? Also 80B seems too short to cover NAME_MAX.

Nothing else jumped at me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check
  2022-02-02 20:30   ` [PATCH v4 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long
@ 2022-02-03 15:46     ` Vlastimil Babka
  2022-02-03 18:49       ` Waiman Long
  0 siblings, 1 reply; 39+ messages in thread
From: Vlastimil Babka @ 2022-02-03 15:46 UTC (permalink / raw)
  To: Waiman Long, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini

On 2/2/22 21:30, Waiman Long wrote:
> The snprintf() function can return a length greater than the given
> input size. That will require a check for buffer overrun after each
> invocation of snprintf(). scnprintf(), on the other hand, will never
> return a greater length. By using scnprintf() in selected places, we
> can avoid some buffer overrun checks except after stack_depot_snprint()
> and after the last snprintf().
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>

Looks like this will work, but note that if the purpose of patch 1/4 was
that after the first scnprintf() that overflows the following calls will be
short-cut thanks to passing the size as 0, AFAICS that won't work. Because
scnprintf() returns the number without trailing zero, 'ret' will be 'count -
1' after the overflow, so 'count - ret' will be 1, never 0.

> ---
>  mm/page_owner.c | 14 +++-----------
>  1 file changed, 3 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index 99e360df9465..28dac73e0542 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -338,19 +338,16 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
>  	if (!kbuf)
>  		return -ENOMEM;
>  
> -	ret = snprintf(kbuf, count,
> +	ret = scnprintf(kbuf, count,
>  			"Page allocated via order %u, mask %#x(%pGg), pid %d, ts %llu ns, free_ts %llu ns\n",
>  			page_owner->order, page_owner->gfp_mask,
>  			&page_owner->gfp_mask, page_owner->pid,
>  			page_owner->ts_nsec, page_owner->free_ts_nsec);
>  
> -	if (ret >= count)
> -		goto err;
> -
>  	/* Print information relevant to grouping pages by mobility */
>  	pageblock_mt = get_pageblock_migratetype(page);
>  	page_mt  = gfp_migratetype(page_owner->gfp_mask);
> -	ret += snprintf(kbuf + ret, count - ret,
> +	ret += scnprintf(kbuf + ret, count - ret,
>  			"PFN %lu type %s Block %lu type %s Flags %pGp\n",
>  			pfn,
>  			migratetype_names[page_mt],
> @@ -358,19 +355,14 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
>  			migratetype_names[pageblock_mt],
>  			&page->flags);
>  
> -	if (ret >= count)
> -		goto err;
> -
>  	ret += stack_depot_snprint(handle, kbuf + ret, count - ret, 0);
>  	if (ret >= count)
>  		goto err;
>  
>  	if (page_owner->last_migrate_reason != -1) {
> -		ret += snprintf(kbuf + ret, count - ret,
> +		ret += scnprintf(kbuf + ret, count - ret,
>  			"Page has been migrated, last migrate reason: %s\n",
>  			migrate_reason_names[page_owner->last_migrate_reason]);
> -		if (ret >= count)
> -			goto err;
>  	}
>  
>  	ret += snprintf(kbuf + ret, count - ret, "\n");


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check
  2022-02-03 15:46     ` Vlastimil Babka
@ 2022-02-03 18:49       ` Waiman Long
  2022-02-08 10:51         ` Petr Mladek
  0 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2022-02-03 18:49 UTC (permalink / raw)
  To: Vlastimil Babka, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Andrew Morton, Petr Mladek, Steven Rostedt, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes
  Cc: linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini

On 2/3/22 10:46, Vlastimil Babka wrote:
> On 2/2/22 21:30, Waiman Long wrote:
>> The snprintf() function can return a length greater than the given
>> input size. That will require a check for buffer overrun after each
>> invocation of snprintf(). scnprintf(), on the other hand, will never
>> return a greater length. By using scnprintf() in selected places, we
>> can avoid some buffer overrun checks except after stack_depot_snprint()
>> and after the last snprintf().
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> Acked-by: David Rientjes <rientjes@google.com>
>> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> Looks like this will work, but note that if the purpose of patch 1/4 was
> that after the first scnprintf() that overflows the following calls will be
> short-cut thanks to passing the size as 0, AFAICS that won't work. Because
> scnprintf() returns the number without trailing zero, 'ret' will be 'count -
> 1' after the overflow, so 'count - ret' will be 1, never 0.

Yes, I am aware of that. Patch 1 is just a micro-optimization for the 
very rare case.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 4/4] mm/page_owner: Record task command name
  2022-02-03 12:10         ` Vlastimil Babka
@ 2022-02-03 18:53           ` Waiman Long
  0 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2022-02-03 18:53 UTC (permalink / raw)
  To: Vlastimil Babka, Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek,
	Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini

On 2/3/22 07:10, Vlastimil Babka wrote:
> On 2/2/22 17:53, Waiman Long wrote:
>> On 2/1/22 10:28, Michal Hocko wrote:
>>> Cc Vlastimil
>>>
>>> On Mon 31-01-22 17:03:28, Waiman Long wrote:
>>>> The page_owner information currently includes the pid of the calling
>>>> task. That is useful as long as the task is still running. Otherwise,
>>>> the number is meaningless. To have more information about the allocating
>>>> tasks that had exited by the time the page_owner information is
>>>> retrieved, we need to store the command name of the task.
>>>>
>>>> Add a new comm field into page_owner structure to store the command name
>>>> and display it when the page_owner information is retrieved.
>>> I completely agree that pid is effectivelly useless (if not misleading)
>>> but is comm really telling all that much to compensate for the
>>> additional storage required for _each_ page in the system?
>> Yes, it does add an extra 16 bytes per page overhead. The command name can
>> be useful if one want to find out which userspace command is responsible for
>> a problematic page allocation. Maybe we can remove pid from page_owner to
>> save 8 bytes as you also agree that this number is not that useful.
> Pid could be used to correlate command instances (not perfectly if reuse
> happens), but command name could have a higher chance to be useful. In my
> experience the most useful were the stacktraces and gfp/order etc. anyway.
> So I wouldn't be opposed replacing pid with comm. The mild size increase
> should be acceptable, this is an opt-in feature for debugging sessions with
> known tradeoff for memory and cpu overhead for the extra info.

Thanks for the information.

I floated around dropping pid just as a possible way to reduce overall 
memory overhead. I did not do that in my patch and I am not planning to 
post any patch unless everybody agree.

Cheer,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 3/4] mm/page_owner: Print memcg information
  2022-02-03 12:46     ` Michal Hocko
@ 2022-02-03 19:03       ` Waiman Long
  2022-02-07 17:20         ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2022-02-03 19:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek,
	Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini

On 2/3/22 07:46, Michal Hocko wrote:
> On Wed 02-02-22 15:30:35, Waiman Long wrote:
> [...]
>> +#ifdef CONFIG_MEMCG
>> +	unsigned long memcg_data;
>> +	struct mem_cgroup *memcg;
>> +	bool online;
>> +	char name[80];
>> +
>> +	rcu_read_lock();
>> +	memcg_data = READ_ONCE(page->memcg_data);
>> +	if (!memcg_data)
>> +		goto out_unlock;
>> +
>> +	if (memcg_data & MEMCG_DATA_OBJCGS)
>> +		ret += scnprintf(kbuf + ret, count - ret,
>> +				"Slab cache page\n");
>> +
>> +	memcg = page_memcg_check(page);
>> +	if (!memcg)
>> +		goto out_unlock;
>> +
>> +	online = (memcg->css.flags & CSS_ONLINE);
>> +	cgroup_name(memcg->css.cgroup, name, sizeof(name));
> Is there any specific reason to use another buffer allocated on the
> stack? Also 80B seems too short to cover NAME_MAX.
>
> Nothing else jumped at me.

I suppose we can print directly into kbuf with cgroup_name(), but using 
a separate buffer is easier to read and understand. 79 characters should 
be enough for most cgroup names. Some auto-generated names with some 
kind of embedded uuids may be longer than that, but the random sequence 
of hex digits that may be missing do not convey much information for 
identification purpose. We can always increase the buffer length later 
if it turns out to be an issue.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 3/4] mm/page_owner: Print memcg information
  2022-02-03 19:03       ` Waiman Long
@ 2022-02-07 17:20         ` Michal Hocko
  2022-02-07 19:09           ` Andrew Morton
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2022-02-07 17:20 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Petr Mladek,
	Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini

On Thu 03-02-22 14:03:58, Waiman Long wrote:
> On 2/3/22 07:46, Michal Hocko wrote:
> > On Wed 02-02-22 15:30:35, Waiman Long wrote:
> > [...]
> > > +#ifdef CONFIG_MEMCG
> > > +	unsigned long memcg_data;
> > > +	struct mem_cgroup *memcg;
> > > +	bool online;
> > > +	char name[80];
> > > +
> > > +	rcu_read_lock();
> > > +	memcg_data = READ_ONCE(page->memcg_data);
> > > +	if (!memcg_data)
> > > +		goto out_unlock;
> > > +
> > > +	if (memcg_data & MEMCG_DATA_OBJCGS)
> > > +		ret += scnprintf(kbuf + ret, count - ret,
> > > +				"Slab cache page\n");
> > > +
> > > +	memcg = page_memcg_check(page);
> > > +	if (!memcg)
> > > +		goto out_unlock;
> > > +
> > > +	online = (memcg->css.flags & CSS_ONLINE);
> > > +	cgroup_name(memcg->css.cgroup, name, sizeof(name));
> > Is there any specific reason to use another buffer allocated on the
> > stack? Also 80B seems too short to cover NAME_MAX.
> > 
> > Nothing else jumped at me.
> 
> I suppose we can print directly into kbuf with cgroup_name(), but using a
> separate buffer is easier to read and understand. 79 characters should be
> enough for most cgroup names. Some auto-generated names with some kind of
> embedded uuids may be longer than that, but the random sequence of hex
> digits that may be missing do not convey much information for identification
> purpose. We can always increase the buffer length later if it turns out to
> be an issue.

Cutting a name short sounds like a source of confusion and there doesn't
seem to be any good reason for that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 3/4] mm/page_owner: Print memcg information
  2022-02-07 17:20         ` Michal Hocko
@ 2022-02-07 19:09           ` Andrew Morton
  2022-02-07 19:33             ` Waiman Long
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Morton @ 2022-02-07 19:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Waiman Long, Johannes Weiner, Vladimir Davydov, Petr Mladek,
	Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini

On Mon, 7 Feb 2022 18:20:04 +0100 Michal Hocko <mhocko@suse.com> wrote:

> On Thu 03-02-22 14:03:58, Waiman Long wrote:
> > On 2/3/22 07:46, Michal Hocko wrote:
> > > On Wed 02-02-22 15:30:35, Waiman Long wrote:
> > > [...]
> ...
> > > > +	online = (memcg->css.flags & CSS_ONLINE);
> > > > +	cgroup_name(memcg->css.cgroup, name, sizeof(name));
> > > Is there any specific reason to use another buffer allocated on the
> > > stack? Also 80B seems too short to cover NAME_MAX.
> > > 
> > > Nothing else jumped at me.
> > 
> > I suppose we can print directly into kbuf with cgroup_name(), but using a
> > separate buffer is easier to read and understand. 79 characters should be
> > enough for most cgroup names. Some auto-generated names with some kind of
> > embedded uuids may be longer than that, but the random sequence of hex
> > digits that may be missing do not convey much information for identification
> > purpose. We can always increase the buffer length later if it turns out to
> > be an issue.
> 
> Cutting a name short sounds like a source of confusion and there doesn't
> seem to be any good reason for that.

Yes.  If we give them 79 characters, someone will go and want 94.  If
we can prevent this once and for ever, let's please do so.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 3/4] mm/page_owner: Print memcg information
  2022-02-07 19:09           ` Andrew Morton
@ 2022-02-07 19:33             ` Waiman Long
  0 siblings, 0 replies; 39+ messages in thread
From: Waiman Long @ 2022-02-07 19:33 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Petr Mladek, Steven Rostedt,
	Sergey Senozhatsky, Andy Shevchenko, Rasmus Villemoes,
	linux-kernel, cgroups, linux-mm, Ira Weiny, Mike Rapoport,
	David Rientjes, Roman Gushchin, Rafael Aquini


On 2/7/22 14:09, Andrew Morton wrote:
> On Mon, 7 Feb 2022 18:20:04 +0100 Michal Hocko <mhocko@suse.com> wrote:
>
>> On Thu 03-02-22 14:03:58, Waiman Long wrote:
>>> On 2/3/22 07:46, Michal Hocko wrote:
>>>> On Wed 02-02-22 15:30:35, Waiman Long wrote:
>>>> [...]
>> ...
>>>>> +	online = (memcg->css.flags & CSS_ONLINE);
>>>>> +	cgroup_name(memcg->css.cgroup, name, sizeof(name));
>>>> Is there any specific reason to use another buffer allocated on the
>>>> stack? Also 80B seems too short to cover NAME_MAX.
>>>>
>>>> Nothing else jumped at me.
>>> I suppose we can print directly into kbuf with cgroup_name(), but using a
>>> separate buffer is easier to read and understand. 79 characters should be
>>> enough for most cgroup names. Some auto-generated names with some kind of
>>> embedded uuids may be longer than that, but the random sequence of hex
>>> digits that may be missing do not convey much information for identification
>>> purpose. We can always increase the buffer length later if it turns out to
>>> be an issue.
>> Cutting a name short sounds like a source of confusion and there doesn't
>> seem to be any good reason for that.
> Yes.  If we give them 79 characters, someone will go and want 94.  If
> we can prevent this once and for ever, let's please do so.

Sure. Will send a version with that change.

Cheers,
Longman

>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 1/4] lib/vsprintf: Avoid redundant work with 0 size
  2022-02-02 20:30   ` [PATCH v4 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long
@ 2022-02-08 10:08     ` Petr Mladek
  0 siblings, 0 replies; 39+ messages in thread
From: Petr Mladek @ 2022-02-08 10:08 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Steven Rostedt, Sergey Senozhatsky, Andy Shevchenko,
	Rasmus Villemoes, linux-kernel, cgroups, linux-mm, Ira Weiny,
	Mike Rapoport, David Rientjes, Roman Gushchin, Rafael Aquini

On Wed 2022-02-02 15:30:33, Waiman Long wrote:
> For *scnprintf(), vsnprintf() is always called even if the input size is
> 0. That is a waste of time, so just return 0 in this case.
> 
> Note that vsnprintf() will never return -1 to indicate an error. So
> skipping the call to vsnprintf() when size is 0 will have no functional
> impact at all.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> Acked-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Petr Mladek <pmladek@suse.com>

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check
  2022-02-03 18:49       ` Waiman Long
@ 2022-02-08 10:51         ` Petr Mladek
  0 siblings, 0 replies; 39+ messages in thread
From: Petr Mladek @ 2022-02-08 10:51 UTC (permalink / raw)
  To: Waiman Long
  Cc: Vlastimil Babka, Johannes Weiner, Michal Hocko, Vladimir Davydov,
	Andrew Morton, Steven Rostedt, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, linux-kernel, cgroups,
	linux-mm, Ira Weiny, Mike Rapoport, David Rientjes,
	Roman Gushchin, Rafael Aquini

On Thu 2022-02-03 13:49:02, Waiman Long wrote:
> On 2/3/22 10:46, Vlastimil Babka wrote:
> > On 2/2/22 21:30, Waiman Long wrote:
> > > The snprintf() function can return a length greater than the given
> > > input size. That will require a check for buffer overrun after each
> > > invocation of snprintf(). scnprintf(), on the other hand, will never
> > > return a greater length. By using scnprintf() in selected places, we
> > > can avoid some buffer overrun checks except after stack_depot_snprint()
> > > and after the last snprintf().
> > > 
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> > > Acked-by: David Rientjes <rientjes@google.com>
> > > Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
> > Looks like this will work, but note that if the purpose of patch 1/4 was
> > that after the first scnprintf() that overflows the following calls will be
> > short-cut thanks to passing the size as 0, AFAICS that won't work. Because
> > scnprintf() returns the number without trailing zero, 'ret' will be 'count -
> > 1' after the overflow, so 'count - ret' will be 1, never 0.
> 
> Yes, I am aware of that. Patch 1 is just a micro-optimization for the very
> rare case.

In theory, we might micro-optimize also the case when "size == 1".

Well, I am not sure if it is worth it. After all, the primary use-case
is to print the message into a big-enough buffer. Lost information is
a bigger problem than the speed ;-)

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2022-02-08 11:31 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-31 19:23 [PATCH v3 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long
2022-01-31 19:23 ` [PATCH v3 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long
2022-01-31 20:42   ` Mike Rapoport
2022-01-31 19:23 ` [PATCH v3 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long
2022-01-31 20:38   ` Roman Gushchin
2022-01-31 20:43   ` Mike Rapoport
2022-01-31 19:23 ` [PATCH v3 3/4] mm/page_owner: Print memcg information Waiman Long
2022-01-31 20:51   ` Mike Rapoport
2022-01-31 21:43     ` Waiman Long
2022-02-01  6:23       ` Mike Rapoport
2022-01-31 20:51   ` Roman Gushchin
2022-02-01 10:54   ` Michal Hocko
2022-02-01 17:04     ` Waiman Long
2022-02-02  8:49       ` Michal Hocko
2022-02-02 16:12         ` Waiman Long
2022-01-31 19:23 ` [PATCH v3 4/4] mm/page_owner: Record task command name Waiman Long
2022-01-31 20:54   ` Roman Gushchin
2022-01-31 21:46     ` Waiman Long
2022-01-31 22:03   ` [PATCH v4 " Waiman Long
2022-02-01 15:28     ` Michal Hocko
2022-02-02 16:53       ` Waiman Long
2022-02-03 12:10         ` Vlastimil Babka
2022-02-03 18:53           ` Waiman Long
2022-02-02 20:30   ` [PATCH v4 0/4] mm/page_owner: Extend page_owner to show memcg information Waiman Long
2022-02-02 23:06     ` Rafael Aquini
2022-02-02 20:30   ` [PATCH v4 1/4] lib/vsprintf: Avoid redundant work with 0 size Waiman Long
2022-02-08 10:08     ` Petr Mladek
2022-02-02 20:30   ` [PATCH v4 2/4] mm/page_owner: Use scnprintf() to avoid excessive buffer overrun check Waiman Long
2022-02-03 15:46     ` Vlastimil Babka
2022-02-03 18:49       ` Waiman Long
2022-02-08 10:51         ` Petr Mladek
2022-02-02 20:30   ` [PATCH v4 3/4] mm/page_owner: Print memcg information Waiman Long
2022-02-03  6:53     ` Mike Rapoport
2022-02-03 12:46     ` Michal Hocko
2022-02-03 19:03       ` Waiman Long
2022-02-07 17:20         ` Michal Hocko
2022-02-07 19:09           ` Andrew Morton
2022-02-07 19:33             ` Waiman Long
2022-02-02 20:30   ` [PATCH v4 4/4] mm/page_owner: Record task command name Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).