linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2] mm: introduce page pin owner
@ 2021-12-28 17:59 Minchan Kim
  2022-01-06 18:14 ` Minchan Kim
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Minchan Kim @ 2021-12-28 17:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, David Hildenbrand, linux-mm, LKML,
	Suren Baghdasaryan, John Dias, huww98, John Hubbard, Minchan Kim

A Contiguous Memory Allocator(CMA) allocation can fail if any page
within the requested range has an elevated refcount(a pinned page).

Debugging such failures is difficult, because the struct pages only
show a combined refcount, and do not show the callstacks or
backtraces of the code that acquired each refcount. So the source
of the page pins remains a mystery, at the time of CMA failure.

In order to solve this without adding too much overhead, just do
nothing most of the time, which is pretty low overhead. However,
once a CMA failure occurs, then mark the page (this requires a
pointer's worth of space in struct page, but it uses page extensions
to get that), and start tracing the subsequent put_page() calls.
As the program finishes up, each page pin will be undone, and
traced with a backtrace. The programmer reads the trace output and
sees the list of all page pinning code paths.

This will consume an additional 8 bytes per 4KB page, or an
additional 0.2% of RAM. In addition to the storage space, it will
have some performance cost, due to increasing the size of struct
page so that it is greater than the cacheline size (or multiples
thereof) of popular (x86, ...) CPUs.

The idea can apply every user of migrate_pages as well as CMA to
know the reason why the page migration failed. To support it,
the implementation takes "enum migrate_reason" string as filter
of the tracepoint(see below).

Usage)

trace_dir="/sys/kernel/tracing"
echo 1 > $trace_dir/options/stacktrace
echo 1 > $trace_dir/events/page_pin_owner/enable
echo "contig_range" > $trace_dir/events/page_pin_owner/report_page_pinners/filter

===
example:
If you are interested in compaction failure, you want to use
"compaction" as a filter instead of "contig_range".
For the other filter strings, you can refer migrate_reason_names
in mm/debug.c
===

..
run workload
..

cat $trace_dir/trace
..

            bash-7442    [007] ...1.   288.504690: report_page_pinners: pfn=0x91430 flags=uptodate|swapbacked count=2 mapcount=0 mapping=ffff88f5480a2709 mt=5 reason=contig_range err=-11
            bash-7442    [007] ...1.   288.504691: <stack trace>
 => trace_event_raw_event_report_page_pinners
 => __report_page_pinners
 => migrate_pages
 => alloc_contig_range
 => cma_alloc.cold
 => cma_alloc_write
 => simple_attr_write
 => debugfs_attr_write
 => full_proxy_write
 => vfs_write
 => ksys_write
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
..
..

            bash-7442    [007] ...1.   288.506426: page_pin_owner_put: pfn=0x91430 flags=uptodate|swapbacked count=2 mapcount=0 mapping=ffff88f5480a2709 mt=5
            bash-7442    [007] ...1.   288.506427: <stack trace>
 => trace_event_raw_event_page_pin_owner_put
 => __page_pin_owner_put
 => put_page
 => putback_movable_pages
 => alloc_contig_range
 => cma_alloc.cold
 => cma_alloc_write
 => simple_attr_write
 => debugfs_attr_write
 => full_proxy_write
 => vfs_write
 => ksys_write
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
..
             cc1-31104   [001] d..2.   288.507632: page_pin_owner_put: pfn=0x91430 flags=uptodate|swapbacked count=1 mapcount=0 mapping=ffff88f5480a2709 mt=4
             cc1-31104   [001] d..2.   288.507636: <stack trace>
 => trace_event_raw_event_page_pin_owner_put
 => __page_pin_owner_put
 => release_pages
 => tlb_flush_mmu
 => unmap_page_range
 => unmap_vmas
 => exit_mmap
 => mmput
 => do_exit
 => do_group_exit
 => __x64_sys_exit_group
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe
..

            make-31157   [007] d..3.   288.508333: page_pin_owner_put: pfn=0x91430 flags=uptodate|lru|swapbacked count=0 mapcount=0 mapping=ffff88f5480a2709 mt=5
            make-31157   [007] d..3.   288.508335: <stack trace>
 => trace_event_raw_event_page_pin_owner_put
 => __page_pin_owner_put
 => release_pages
 => __pagevec_lru_add
 => lru_add_drain_cpu
 => lru_add_drain
 => exit_mmap
 => mmput
 => begin_new_exec
 => load_elf_binary
 => bprm_execve
 => do_execveat_common.isra.0
 => __x64_sys_execve
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
* from v1 - https://lore.kernel.org/lkml/20211206184730.858850-1-minchan@kernel.org/
  * PagePinOwner naming suggestion - jhubbard@
  * Description/Kconfig suggestion - jhubbard@
  * General migrate_page supports - huww98@

 include/linux/mm.h                    |  7 ++-
 include/linux/page_ext.h              |  3 +
 include/linux/page_pin_owner.h        | 48 ++++++++++++++
 include/trace/events/page_pin_owner.h | 81 ++++++++++++++++++++++++
 mm/Kconfig.debug                      | 15 +++++
 mm/Makefile                           |  1 +
 mm/migrate.c                          |  5 +-
 mm/page_alloc.c                       |  3 +
 mm/page_ext.c                         |  4 ++
 mm/page_pin_owner.c                   | 91 +++++++++++++++++++++++++++
 10 files changed, 256 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/page_pin_owner.h
 create mode 100644 include/trace/events/page_pin_owner.h
 create mode 100644 mm/page_pin_owner.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 82be88c1fdbb..5c437faf129c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -26,6 +26,7 @@
 #include <linux/err.h>
 #include <linux/page-flags.h>
 #include <linux/page_ref.h>
+#include <linux/page_pin_owner.h>
 #include <linux/memremap.h>
 #include <linux/overflow.h>
 #include <linux/sizes.h>
@@ -714,8 +715,12 @@ static inline unsigned int folio_order(struct folio *folio)
  */
 static inline int put_page_testzero(struct page *page)
 {
+	int ret;
+
 	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
-	return page_ref_dec_and_test(page);
+	ret = page_ref_dec_and_test(page);
+	page_pin_owner_put(page);
+	return ret;
 }
 
 static inline int folio_put_testzero(struct folio *folio)
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index fabb2e1e087f..3236efd5ab07 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -23,6 +23,9 @@ enum page_ext_flags {
 	PAGE_EXT_YOUNG,
 	PAGE_EXT_IDLE,
 #endif
+#if defined(CONFIG_PAGE_PIN_OWNER)
+	PAGE_EXT_PIN_OWNER,
+#endif
 };
 
 /*
diff --git a/include/linux/page_pin_owner.h b/include/linux/page_pin_owner.h
new file mode 100644
index 000000000000..e68adcdd6799
--- /dev/null
+++ b/include/linux/page_pin_owner.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_PAGE_PIN_OWNER_H
+#define __LINUX_PAGE_PIN_OWNER_H
+
+#include <linux/jump_label.h>
+
+#ifdef CONFIG_PAGE_PIN_OWNER
+extern struct static_key_false page_pin_owner_inited;
+extern struct page_ext_operations page_pin_owner_ops;
+
+void __report_page_pinners(struct page *page, int reason, int err);
+void __page_pin_owner_put(struct page *page);
+void __reset_page_pin_owner(struct page *page, unsigned int order);
+
+static inline void reset_page_pin_owner(struct page *page, unsigned int order)
+{
+	if (static_branch_unlikely(&page_pin_owner_inited))
+		__reset_page_pin_owner(page, order);
+}
+
+static inline void report_page_pinners(struct page *page, int reason, int err)
+{
+	if (!static_branch_unlikely(&page_pin_owner_inited))
+		return;
+
+	__report_page_pinners(page, reason, err);
+}
+
+static inline void page_pin_owner_put(struct page *page)
+{
+	if (!static_branch_unlikely(&page_pin_owner_inited))
+		return;
+
+	__page_pin_owner_put(page);
+}
+
+#else
+static inline void reset_page_pin_owner(struct page *page, unsigned int order)
+{
+}
+static inline void report_page_pinners(struct page *page, int reason, int err)
+{
+}
+static inline void page_pin_owner_put(struct page *page)
+{
+}
+#endif /* CONFIG_PAGE_PIN_OWNER */
+#endif /*__LINUX_PAGE_PIN_OWNER_H */
diff --git a/include/trace/events/page_pin_owner.h b/include/trace/events/page_pin_owner.h
new file mode 100644
index 000000000000..0096297312f5
--- /dev/null
+++ b/include/trace/events/page_pin_owner.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM page_pin_owner
+
+#if !defined(_TRACE_PAGE_PIN_OWNER_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PAGE_PIN_OWNER_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+#include <trace/events/mmflags.h>
+
+TRACE_EVENT(page_pin_owner_put,
+
+	TP_PROTO(struct page *page),
+
+	TP_ARGS(page),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+		__field(unsigned long, flags)
+		__field(int, count)
+		__field(int, mapcount)
+		__field(void *, mapping)
+		__field(int, mt)
+		),
+
+	TP_fast_assign(
+		__entry->pfn = page_to_pfn(page);
+		__entry->flags = page->flags;
+		__entry->count = page_ref_count(page);
+		__entry->mapcount = page_mapcount(page);
+		__entry->mapping = page->mapping;
+		__entry->mt = get_pageblock_migratetype(page);
+		),
+
+	TP_printk("pfn=0x%lx flags=%s count=%d mapcount=%d mapping=%p mt=%d",
+			__entry->pfn,
+			show_page_flags(__entry->flags & ((1UL << NR_PAGEFLAGS) - 1)),
+			__entry->count,
+			__entry->mapcount, __entry->mapping, __entry->mt)
+);
+
+TRACE_EVENT(report_page_pinners,
+
+	TP_PROTO(struct page *page, const char *reason, int err),
+
+	TP_ARGS(page, reason, err),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+		__field(unsigned long, flags)
+		__field(int, count)
+		__field(int, mapcount)
+		__field(void *, mapping)
+		__field(int, mt)
+		__field(const char *, reason)
+		__field(int, err)
+		),
+
+	TP_fast_assign(
+		__entry->pfn = page_to_pfn(page);
+		__entry->flags = page->flags;
+		__entry->count = page_ref_count(page);
+		__entry->mapcount = page_mapcount(page);
+		__entry->mapping = page->mapping;
+		__entry->mt = get_pageblock_migratetype(page);
+		__entry->reason = reason;
+		__entry->err = err;
+		),
+
+	TP_printk("pfn=0x%lx flags=%s count=%d mapcount=%d mapping=%p mt=%d reason=%s err=%d",
+			__entry->pfn,
+			show_page_flags(__entry->flags & ((1UL << NR_PAGEFLAGS) - 1)),
+			__entry->count, __entry->mapcount, __entry->mapping,
+			__entry->mt, __entry->reason, __entry->err)
+);
+
+#endif /* _TRACE_PAGE_PIN_OWNER_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 5bd5bb097252..ed2d5e6450b7 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -86,6 +86,21 @@ config PAGE_TABLE_CHECK_ENFORCED
 
 	  If unsure say "n".
 
+config PAGE_PIN_OWNER
+	bool "Track page pinner"
+	depends on DEBUG_KERNEL && TRACEPOINTS && STACKTRACE_SUPPORT
+	select PAGE_EXTENSION
+	select STACKTRACE
+	help
+	  This keeps track of what call chain is the pinner of a page, may
+	  help to find contiguous page allocation(memory-hotplug, compaction,
+	  cma and so on) failure. Even if you include this feature in your
+	  build, it is disabled by default. You should pass "page_pin_owner=on"
+	  to boot parameter in order to enable it. Eats a fair amount of memory
+	  if enabled.
+
+	  If unsure, say N.
+
 config PAGE_POISONING
 	bool "Poison pages after freeing"
 	help
diff --git a/mm/Makefile b/mm/Makefile
index 588d3113f3b0..ca89f4fa38ce 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -104,6 +104,7 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
 obj-$(CONFIG_DEBUG_VM_PGTABLE) += debug_vm_pgtable.o
 obj-$(CONFIG_PAGE_OWNER) += page_owner.o
+obj-$(CONFIG_PAGE_PIN_OWNER) += page_pin_owner.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
 obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
 obj-$(CONFIG_ZPOOL)	+= zpool.o
diff --git a/mm/migrate.c b/mm/migrate.c
index c7da064b4781..aa8a2c21da8e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1105,6 +1105,8 @@ static int unmap_and_move(new_page_t get_new_page,
 	rc = __unmap_and_move(page, newpage, force, mode);
 	if (rc == MIGRATEPAGE_SUCCESS)
 		set_page_owner_migrate_reason(newpage, reason);
+	else
+		report_page_pinners(page, reason, rc);
 
 out:
 	if (rc != -EAGAIN) {
@@ -1273,7 +1275,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	if (rc == MIGRATEPAGE_SUCCESS) {
 		move_hugetlb_state(hpage, new_hpage, reason);
 		put_new_page = NULL;
-	}
+	} else
+		report_page_pinners(hpage, reason, rc);
 
 out_unlock:
 	unlock_page(hpage);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b5d62e1c8d81..98125b1b6e7e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -64,6 +64,7 @@
 #include <linux/sched/rt.h>
 #include <linux/sched/mm.h>
 #include <linux/page_owner.h>
+#include <linux/page_pin_owner.h>
 #include <linux/page_table_check.h>
 #include <linux/kthread.h>
 #include <linux/memcontrol.h>
@@ -1310,6 +1311,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 		if (memcg_kmem_enabled() && PageMemcgKmem(page))
 			__memcg_kmem_uncharge_page(page, order);
 		reset_page_owner(page, order);
+		reset_page_pin_owner(page, order);
 		page_table_check_free(page, order);
 		return false;
 	}
@@ -1350,6 +1352,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 	page_cpupid_reset_last(page);
 	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	reset_page_owner(page, order);
+	reset_page_pin_owner(page, order);
 	page_table_check_free(page, order);
 
 	if (!PageHighMem(page)) {
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 2e66d934d63f..3c0df97b9b8b 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -7,6 +7,7 @@
 #include <linux/vmalloc.h>
 #include <linux/kmemleak.h>
 #include <linux/page_owner.h>
+#include <linux/page_pin_owner.h>
 #include <linux/page_idle.h>
 #include <linux/page_table_check.h>
 
@@ -79,6 +80,9 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
 #ifdef CONFIG_PAGE_TABLE_CHECK
 	&page_table_check_ops,
 #endif
+#ifdef CONFIG_PAGE_PIN_OWNER
+	&page_pin_owner_ops,
+#endif
 };
 
 unsigned long page_ext_size = sizeof(struct page_ext);
diff --git a/mm/page_pin_owner.c b/mm/page_pin_owner.c
new file mode 100644
index 000000000000..736617df093c
--- /dev/null
+++ b/mm/page_pin_owner.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/page_pin_owner.h>
+#include <linux/migrate.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/page_pin_owner.h>
+
+static bool page_pin_owner_enabled;
+DEFINE_STATIC_KEY_FALSE(page_pin_owner_inited);
+EXPORT_SYMBOL(page_pin_owner_inited);
+
+static int __init early_page_pin_owner_param(char *buf)
+{
+	return kstrtobool(buf, &page_pin_owner_enabled);
+}
+early_param("page_pin_owner", early_page_pin_owner_param);
+
+static bool need_page_pin_owner(void)
+{
+	return page_pin_owner_enabled;
+}
+
+static void init_page_pin_owner(void)
+{
+	if (!page_pin_owner_enabled)
+		return;
+
+	static_branch_enable(&page_pin_owner_inited);
+}
+
+struct page_ext_operations page_pin_owner_ops = {
+	.need = need_page_pin_owner,
+	.init = init_page_pin_owner,
+};
+
+void __reset_page_pin_owner(struct page *page, unsigned int order)
+{
+	struct page_ext *page_ext;
+	int i;
+
+	page_ext = lookup_page_ext(page);
+	if (unlikely(!page_ext))
+		return;
+
+	for (i = 0; i < (1 << order); i++) {
+		if (!test_bit(PAGE_EXT_PIN_OWNER, &page_ext->flags))
+			break;
+
+		clear_bit(PAGE_EXT_PIN_OWNER, &page_ext->flags);
+		page_ext = page_ext_next(page_ext);
+	}
+}
+
+void __report_page_pinners(struct page *page, int reason, int err)
+{
+	struct page_ext *page_ext;
+
+	page_ext = lookup_page_ext(page);
+	if (unlikely(!page_ext))
+		return;
+
+	test_and_set_bit(PAGE_EXT_PIN_OWNER, &page_ext->flags);
+	trace_report_page_pinners(page, migrate_reason_names[reason], err);
+}
+
+void __page_pin_owner_put(struct page *page)
+{
+	struct page_ext *page_ext = lookup_page_ext(page);
+
+	if (unlikely(!page_ext))
+		return;
+
+	if (!test_bit(PAGE_EXT_PIN_OWNER, &page_ext->flags))
+		return;
+
+	trace_page_pin_owner_put(page);
+}
+EXPORT_SYMBOL(__page_pin_owner_put);
+
+static int __init page_pin_owner_init(void)
+{
+	if (!static_branch_unlikely(&page_pin_owner_inited)) {
+		pr_info("page_pin_owner is disabled\n");
+		return 0;
+	}
+
+	pr_info("page_pin_owner is enabled\n");
+	return 0;
+}
+late_initcall(page_pin_owner_init)
-- 
2.34.1.448.ga2b2bfdf31-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC v2] mm: introduce page pin owner
  2021-12-28 17:59 [RFC v2] mm: introduce page pin owner Minchan Kim
@ 2022-01-06 18:14 ` Minchan Kim
  2022-01-06 22:27 ` John Hubbard
  2022-01-12 12:25 ` David Hildenbrand
  2 siblings, 0 replies; 15+ messages in thread
From: Minchan Kim @ 2022-01-06 18:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, David Hildenbrand, linux-mm, LKML,
	Suren Baghdasaryan, John Dias, huww98, John Hubbard

On Tue, Dec 28, 2021 at 09:59:04AM -0800, Minchan Kim wrote:
> A Contiguous Memory Allocator(CMA) allocation can fail if any page
> within the requested range has an elevated refcount(a pinned page).
> 
> Debugging such failures is difficult, because the struct pages only
> show a combined refcount, and do not show the callstacks or
> backtraces of the code that acquired each refcount. So the source
> of the page pins remains a mystery, at the time of CMA failure.
> 
> In order to solve this without adding too much overhead, just do
> nothing most of the time, which is pretty low overhead. However,
> once a CMA failure occurs, then mark the page (this requires a
> pointer's worth of space in struct page, but it uses page extensions
> to get that), and start tracing the subsequent put_page() calls.
> As the program finishes up, each page pin will be undone, and
> traced with a backtrace. The programmer reads the trace output and
> sees the list of all page pinning code paths.
> 
> This will consume an additional 8 bytes per 4KB page, or an
> additional 0.2% of RAM. In addition to the storage space, it will
> have some performance cost, due to increasing the size of struct
> page so that it is greater than the cacheline size (or multiples
> thereof) of popular (x86, ...) CPUs.
> 
> The idea can apply every user of migrate_pages as well as CMA to
> know the reason why the page migration failed. To support it,
> the implementation takes "enum migrate_reason" string as filter
> of the tracepoint(see below).

Hi Folks,

Any comment to proceed the work?

Thanks.

> 
> Usage)
> 
> trace_dir="/sys/kernel/tracing"
> echo 1 > $trace_dir/options/stacktrace
> echo 1 > $trace_dir/events/page_pin_owner/enable
> echo "contig_range" > $trace_dir/events/page_pin_owner/report_page_pinners/filter
> 
> ===
> example:
> If you are interested in compaction failure, you want to use
> "compaction" as a filter instead of "contig_range".
> For the other filter strings, you can refer migrate_reason_names
> in mm/debug.c
> ===
> 
> ..
> run workload
> ..
> 
> cat $trace_dir/trace
> ..
> 
>             bash-7442    [007] ...1.   288.504690: report_page_pinners: pfn=0x91430 flags=uptodate|swapbacked count=2 mapcount=0 mapping=ffff88f5480a2709 mt=5 reason=contig_range err=-11
>             bash-7442    [007] ...1.   288.504691: <stack trace>
>  => trace_event_raw_event_report_page_pinners
>  => __report_page_pinners
>  => migrate_pages
>  => alloc_contig_range
>  => cma_alloc.cold
>  => cma_alloc_write
>  => simple_attr_write
>  => debugfs_attr_write
>  => full_proxy_write
>  => vfs_write
>  => ksys_write
>  => do_syscall_64
>  => entry_SYSCALL_64_after_hwframe
> ..
> ..
> 
>             bash-7442    [007] ...1.   288.506426: page_pin_owner_put: pfn=0x91430 flags=uptodate|swapbacked count=2 mapcount=0 mapping=ffff88f5480a2709 mt=5
>             bash-7442    [007] ...1.   288.506427: <stack trace>
>  => trace_event_raw_event_page_pin_owner_put
>  => __page_pin_owner_put
>  => put_page
>  => putback_movable_pages
>  => alloc_contig_range
>  => cma_alloc.cold
>  => cma_alloc_write
>  => simple_attr_write
>  => debugfs_attr_write
>  => full_proxy_write
>  => vfs_write
>  => ksys_write
>  => do_syscall_64
>  => entry_SYSCALL_64_after_hwframe
> ..
>              cc1-31104   [001] d..2.   288.507632: page_pin_owner_put: pfn=0x91430 flags=uptodate|swapbacked count=1 mapcount=0 mapping=ffff88f5480a2709 mt=4
>              cc1-31104   [001] d..2.   288.507636: <stack trace>
>  => trace_event_raw_event_page_pin_owner_put
>  => __page_pin_owner_put
>  => release_pages
>  => tlb_flush_mmu
>  => unmap_page_range
>  => unmap_vmas
>  => exit_mmap
>  => mmput
>  => do_exit
>  => do_group_exit
>  => __x64_sys_exit_group
>  => do_syscall_64
>  => entry_SYSCALL_64_after_hwframe
> ..
> 
>             make-31157   [007] d..3.   288.508333: page_pin_owner_put: pfn=0x91430 flags=uptodate|lru|swapbacked count=0 mapcount=0 mapping=ffff88f5480a2709 mt=5
>             make-31157   [007] d..3.   288.508335: <stack trace>
>  => trace_event_raw_event_page_pin_owner_put
>  => __page_pin_owner_put
>  => release_pages
>  => __pagevec_lru_add
>  => lru_add_drain_cpu
>  => lru_add_drain
>  => exit_mmap
>  => mmput
>  => begin_new_exec
>  => load_elf_binary
>  => bprm_execve
>  => do_execveat_common.isra.0
>  => __x64_sys_execve
>  => do_syscall_64
>  => entry_SYSCALL_64_after_hwframe
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
> * from v1 - https://lore.kernel.org/lkml/20211206184730.858850-1-minchan@kernel.org/
>   * PagePinOwner naming suggestion - jhubbard@
>   * Description/Kconfig suggestion - jhubbard@
>   * General migrate_page supports - huww98@
> 
>  include/linux/mm.h                    |  7 ++-
>  include/linux/page_ext.h              |  3 +
>  include/linux/page_pin_owner.h        | 48 ++++++++++++++
>  include/trace/events/page_pin_owner.h | 81 ++++++++++++++++++++++++
>  mm/Kconfig.debug                      | 15 +++++
>  mm/Makefile                           |  1 +
>  mm/migrate.c                          |  5 +-
>  mm/page_alloc.c                       |  3 +
>  mm/page_ext.c                         |  4 ++
>  mm/page_pin_owner.c                   | 91 +++++++++++++++++++++++++++
>  10 files changed, 256 insertions(+), 2 deletions(-)
>  create mode 100644 include/linux/page_pin_owner.h
>  create mode 100644 include/trace/events/page_pin_owner.h
>  create mode 100644 mm/page_pin_owner.c
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 82be88c1fdbb..5c437faf129c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -26,6 +26,7 @@
>  #include <linux/err.h>
>  #include <linux/page-flags.h>
>  #include <linux/page_ref.h>
> +#include <linux/page_pin_owner.h>
>  #include <linux/memremap.h>
>  #include <linux/overflow.h>
>  #include <linux/sizes.h>
> @@ -714,8 +715,12 @@ static inline unsigned int folio_order(struct folio *folio)
>   */
>  static inline int put_page_testzero(struct page *page)
>  {
> +	int ret;
> +
>  	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> -	return page_ref_dec_and_test(page);
> +	ret = page_ref_dec_and_test(page);
> +	page_pin_owner_put(page);
> +	return ret;
>  }
>  
>  static inline int folio_put_testzero(struct folio *folio)
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index fabb2e1e087f..3236efd5ab07 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -23,6 +23,9 @@ enum page_ext_flags {
>  	PAGE_EXT_YOUNG,
>  	PAGE_EXT_IDLE,
>  #endif
> +#if defined(CONFIG_PAGE_PIN_OWNER)
> +	PAGE_EXT_PIN_OWNER,
> +#endif
>  };
>  
>  /*
> diff --git a/include/linux/page_pin_owner.h b/include/linux/page_pin_owner.h
> new file mode 100644
> index 000000000000..e68adcdd6799
> --- /dev/null
> +++ b/include/linux/page_pin_owner.h
> @@ -0,0 +1,48 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __LINUX_PAGE_PIN_OWNER_H
> +#define __LINUX_PAGE_PIN_OWNER_H
> +
> +#include <linux/jump_label.h>
> +
> +#ifdef CONFIG_PAGE_PIN_OWNER
> +extern struct static_key_false page_pin_owner_inited;
> +extern struct page_ext_operations page_pin_owner_ops;
> +
> +void __report_page_pinners(struct page *page, int reason, int err);
> +void __page_pin_owner_put(struct page *page);
> +void __reset_page_pin_owner(struct page *page, unsigned int order);
> +
> +static inline void reset_page_pin_owner(struct page *page, unsigned int order)
> +{
> +	if (static_branch_unlikely(&page_pin_owner_inited))
> +		__reset_page_pin_owner(page, order);
> +}
> +
> +static inline void report_page_pinners(struct page *page, int reason, int err)
> +{
> +	if (!static_branch_unlikely(&page_pin_owner_inited))
> +		return;
> +
> +	__report_page_pinners(page, reason, err);
> +}
> +
> +static inline void page_pin_owner_put(struct page *page)
> +{
> +	if (!static_branch_unlikely(&page_pin_owner_inited))
> +		return;
> +
> +	__page_pin_owner_put(page);
> +}
> +
> +#else
> +static inline void reset_page_pin_owner(struct page *page, unsigned int order)
> +{
> +}
> +static inline void report_page_pinners(struct page *page, int reason, int err)
> +{
> +}
> +static inline void page_pin_owner_put(struct page *page)
> +{
> +}
> +#endif /* CONFIG_PAGE_PIN_OWNER */
> +#endif /*__LINUX_PAGE_PIN_OWNER_H */
> diff --git a/include/trace/events/page_pin_owner.h b/include/trace/events/page_pin_owner.h
> new file mode 100644
> index 000000000000..0096297312f5
> --- /dev/null
> +++ b/include/trace/events/page_pin_owner.h
> @@ -0,0 +1,81 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM page_pin_owner
> +
> +#if !defined(_TRACE_PAGE_PIN_OWNER_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_PAGE_PIN_OWNER_H
> +
> +#include <linux/types.h>
> +#include <linux/tracepoint.h>
> +#include <trace/events/mmflags.h>
> +
> +TRACE_EVENT(page_pin_owner_put,
> +
> +	TP_PROTO(struct page *page),
> +
> +	TP_ARGS(page),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, pfn)
> +		__field(unsigned long, flags)
> +		__field(int, count)
> +		__field(int, mapcount)
> +		__field(void *, mapping)
> +		__field(int, mt)
> +		),
> +
> +	TP_fast_assign(
> +		__entry->pfn = page_to_pfn(page);
> +		__entry->flags = page->flags;
> +		__entry->count = page_ref_count(page);
> +		__entry->mapcount = page_mapcount(page);
> +		__entry->mapping = page->mapping;
> +		__entry->mt = get_pageblock_migratetype(page);
> +		),
> +
> +	TP_printk("pfn=0x%lx flags=%s count=%d mapcount=%d mapping=%p mt=%d",
> +			__entry->pfn,
> +			show_page_flags(__entry->flags & ((1UL << NR_PAGEFLAGS) - 1)),
> +			__entry->count,
> +			__entry->mapcount, __entry->mapping, __entry->mt)
> +);
> +
> +TRACE_EVENT(report_page_pinners,
> +
> +	TP_PROTO(struct page *page, const char *reason, int err),
> +
> +	TP_ARGS(page, reason, err),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, pfn)
> +		__field(unsigned long, flags)
> +		__field(int, count)
> +		__field(int, mapcount)
> +		__field(void *, mapping)
> +		__field(int, mt)
> +		__field(const char *, reason)
> +		__field(int, err)
> +		),
> +
> +	TP_fast_assign(
> +		__entry->pfn = page_to_pfn(page);
> +		__entry->flags = page->flags;
> +		__entry->count = page_ref_count(page);
> +		__entry->mapcount = page_mapcount(page);
> +		__entry->mapping = page->mapping;
> +		__entry->mt = get_pageblock_migratetype(page);
> +		__entry->reason = reason;
> +		__entry->err = err;
> +		),
> +
> +	TP_printk("pfn=0x%lx flags=%s count=%d mapcount=%d mapping=%p mt=%d reason=%s err=%d",
> +			__entry->pfn,
> +			show_page_flags(__entry->flags & ((1UL << NR_PAGEFLAGS) - 1)),
> +			__entry->count, __entry->mapcount, __entry->mapping,
> +			__entry->mt, __entry->reason, __entry->err)
> +);
> +
> +#endif /* _TRACE_PAGE_PIN_OWNER_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> index 5bd5bb097252..ed2d5e6450b7 100644
> --- a/mm/Kconfig.debug
> +++ b/mm/Kconfig.debug
> @@ -86,6 +86,21 @@ config PAGE_TABLE_CHECK_ENFORCED
>  
>  	  If unsure say "n".
>  
> +config PAGE_PIN_OWNER
> +	bool "Track page pinner"
> +	depends on DEBUG_KERNEL && TRACEPOINTS && STACKTRACE_SUPPORT
> +	select PAGE_EXTENSION
> +	select STACKTRACE
> +	help
> +	  This keeps track of what call chain is the pinner of a page, may
> +	  help to find contiguous page allocation(memory-hotplug, compaction,
> +	  cma and so on) failure. Even if you include this feature in your
> +	  build, it is disabled by default. You should pass "page_pin_owner=on"
> +	  to boot parameter in order to enable it. Eats a fair amount of memory
> +	  if enabled.
> +
> +	  If unsure, say N.
> +
>  config PAGE_POISONING
>  	bool "Poison pages after freeing"
>  	help
> diff --git a/mm/Makefile b/mm/Makefile
> index 588d3113f3b0..ca89f4fa38ce 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -104,6 +104,7 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>  obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
>  obj-$(CONFIG_DEBUG_VM_PGTABLE) += debug_vm_pgtable.o
>  obj-$(CONFIG_PAGE_OWNER) += page_owner.o
> +obj-$(CONFIG_PAGE_PIN_OWNER) += page_pin_owner.o
>  obj-$(CONFIG_CLEANCACHE) += cleancache.o
>  obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
>  obj-$(CONFIG_ZPOOL)	+= zpool.o
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c7da064b4781..aa8a2c21da8e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1105,6 +1105,8 @@ static int unmap_and_move(new_page_t get_new_page,
>  	rc = __unmap_and_move(page, newpage, force, mode);
>  	if (rc == MIGRATEPAGE_SUCCESS)
>  		set_page_owner_migrate_reason(newpage, reason);
> +	else
> +		report_page_pinners(page, reason, rc);
>  
>  out:
>  	if (rc != -EAGAIN) {
> @@ -1273,7 +1275,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>  	if (rc == MIGRATEPAGE_SUCCESS) {
>  		move_hugetlb_state(hpage, new_hpage, reason);
>  		put_new_page = NULL;
> -	}
> +	} else
> +		report_page_pinners(hpage, reason, rc);
>  
>  out_unlock:
>  	unlock_page(hpage);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b5d62e1c8d81..98125b1b6e7e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -64,6 +64,7 @@
>  #include <linux/sched/rt.h>
>  #include <linux/sched/mm.h>
>  #include <linux/page_owner.h>
> +#include <linux/page_pin_owner.h>
>  #include <linux/page_table_check.h>
>  #include <linux/kthread.h>
>  #include <linux/memcontrol.h>
> @@ -1310,6 +1311,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>  		if (memcg_kmem_enabled() && PageMemcgKmem(page))
>  			__memcg_kmem_uncharge_page(page, order);
>  		reset_page_owner(page, order);
> +		reset_page_pin_owner(page, order);
>  		page_table_check_free(page, order);
>  		return false;
>  	}
> @@ -1350,6 +1352,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>  	page_cpupid_reset_last(page);
>  	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
>  	reset_page_owner(page, order);
> +	reset_page_pin_owner(page, order);
>  	page_table_check_free(page, order);
>  
>  	if (!PageHighMem(page)) {
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 2e66d934d63f..3c0df97b9b8b 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -7,6 +7,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/kmemleak.h>
>  #include <linux/page_owner.h>
> +#include <linux/page_pin_owner.h>
>  #include <linux/page_idle.h>
>  #include <linux/page_table_check.h>
>  
> @@ -79,6 +80,9 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
>  #ifdef CONFIG_PAGE_TABLE_CHECK
>  	&page_table_check_ops,
>  #endif
> +#ifdef CONFIG_PAGE_PIN_OWNER
> +	&page_pin_owner_ops,
> +#endif
>  };
>  
>  unsigned long page_ext_size = sizeof(struct page_ext);
> diff --git a/mm/page_pin_owner.c b/mm/page_pin_owner.c
> new file mode 100644
> index 000000000000..736617df093c
> --- /dev/null
> +++ b/mm/page_pin_owner.c
> @@ -0,0 +1,91 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/mm.h>
> +#include <linux/page_pin_owner.h>
> +#include <linux/migrate.h>
> +
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/page_pin_owner.h>
> +
> +static bool page_pin_owner_enabled;
> +DEFINE_STATIC_KEY_FALSE(page_pin_owner_inited);
> +EXPORT_SYMBOL(page_pin_owner_inited);
> +
> +static int __init early_page_pin_owner_param(char *buf)
> +{
> +	return kstrtobool(buf, &page_pin_owner_enabled);
> +}
> +early_param("page_pin_owner", early_page_pin_owner_param);
> +
> +static bool need_page_pin_owner(void)
> +{
> +	return page_pin_owner_enabled;
> +}
> +
> +static void init_page_pin_owner(void)
> +{
> +	if (!page_pin_owner_enabled)
> +		return;
> +
> +	static_branch_enable(&page_pin_owner_inited);
> +}
> +
> +struct page_ext_operations page_pin_owner_ops = {
> +	.need = need_page_pin_owner,
> +	.init = init_page_pin_owner,
> +};
> +
> +void __reset_page_pin_owner(struct page *page, unsigned int order)
> +{
> +	struct page_ext *page_ext;
> +	int i;
> +
> +	page_ext = lookup_page_ext(page);
> +	if (unlikely(!page_ext))
> +		return;
> +
> +	for (i = 0; i < (1 << order); i++) {
> +		if (!test_bit(PAGE_EXT_PIN_OWNER, &page_ext->flags))
> +			break;
> +
> +		clear_bit(PAGE_EXT_PIN_OWNER, &page_ext->flags);
> +		page_ext = page_ext_next(page_ext);
> +	}
> +}
> +
> +void __report_page_pinners(struct page *page, int reason, int err)
> +{
> +	struct page_ext *page_ext;
> +
> +	page_ext = lookup_page_ext(page);
> +	if (unlikely(!page_ext))
> +		return;
> +
> +	test_and_set_bit(PAGE_EXT_PIN_OWNER, &page_ext->flags);
> +	trace_report_page_pinners(page, migrate_reason_names[reason], err);
> +}
> +
> +void __page_pin_owner_put(struct page *page)
> +{
> +	struct page_ext *page_ext = lookup_page_ext(page);
> +
> +	if (unlikely(!page_ext))
> +		return;
> +
> +	if (!test_bit(PAGE_EXT_PIN_OWNER, &page_ext->flags))
> +		return;
> +
> +	trace_page_pin_owner_put(page);
> +}
> +EXPORT_SYMBOL(__page_pin_owner_put);
> +
> +static int __init page_pin_owner_init(void)
> +{
> +	if (!static_branch_unlikely(&page_pin_owner_inited)) {
> +		pr_info("page_pin_owner is disabled\n");
> +		return 0;
> +	}
> +
> +	pr_info("page_pin_owner is enabled\n");
> +	return 0;
> +}
> +late_initcall(page_pin_owner_init)
> -- 
> 2.34.1.448.ga2b2bfdf31-goog
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2] mm: introduce page pin owner
  2021-12-28 17:59 [RFC v2] mm: introduce page pin owner Minchan Kim
  2022-01-06 18:14 ` Minchan Kim
@ 2022-01-06 22:27 ` John Hubbard
  2022-01-06 23:24   ` Minchan Kim
  2022-01-12 12:25 ` David Hildenbrand
  2 siblings, 1 reply; 15+ messages in thread
From: John Hubbard @ 2022-01-06 22:27 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: Michal Hocko, David Hildenbrand, linux-mm, LKML,
	Suren Baghdasaryan, John Dias, huww98

On 12/28/21 09:59, Minchan Kim wrote:
> A Contiguous Memory Allocator(CMA) allocation can fail if any page
> within the requested range has an elevated refcount(a pinned page).
> 
> Debugging such failures is difficult, because the struct pages only
> show a combined refcount, and do not show the callstacks or
> backtraces of the code that acquired each refcount. So the source
> of the page pins remains a mystery, at the time of CMA failure.
> 
> In order to solve this without adding too much overhead, just do
> nothing most of the time, which is pretty low overhead. However,
> once a CMA failure occurs, then mark the page (this requires a
> pointer's worth of space in struct page, but it uses page extensions
> to get that), and start tracing the subsequent put_page() calls.
> As the program finishes up, each page pin will be undone, and
> traced with a backtrace. The programmer reads the trace output and
> sees the list of all page pinning code paths.
> 
> This will consume an additional 8 bytes per 4KB page, or an
> additional 0.2% of RAM. In addition to the storage space, it will
> have some performance cost, due to increasing the size of struct
> page so that it is greater than the cacheline size (or multiples
> thereof) of popular (x86, ...) CPUs.
> 
> The idea can apply every user of migrate_pages as well as CMA to
> know the reason why the page migration failed. To support it,
> the implementation takes "enum migrate_reason" string as filter
> of the tracepoint(see below).
> 

Hi Minchan,

If this is ready to propose, then maybe it's time to remove the "RFC"
qualification from the subject line, and re-post for final review.

And also when you do that, could you please specify which tree or commit
this applies to? I wasn't able to figure that out this time.

> Usage)

This extensive "usage" section is probably helpful, but the commit
log is certainly not the place for the "how to" documentation. Let's
find an .rst file to stash it in, I think.


thanks,
-- 
John Hubbard
NVIDIA

> 
> trace_dir="/sys/kernel/tracing"
> echo 1 > $trace_dir/options/stacktrace
> echo 1 > $trace_dir/events/page_pin_owner/enable
> echo "contig_range" > $trace_dir/events/page_pin_owner/report_page_pinners/filter
> 
> ===
> example:
> If you are interested in compaction failure, you want to use
> "compaction" as a filter instead of "contig_range".
> For the other filter strings, you can refer migrate_reason_names
> in mm/debug.c
> ===
> 
> ..
> run workload
> ..
> 
> cat $trace_dir/trace
> ..
> 
>              bash-7442    [007] ...1.   288.504690: report_page_pinners: pfn=0x91430 flags=uptodate|swapbacked count=2 mapcount=0 mapping=ffff88f5480a2709 mt=5 reason=contig_range err=-11
>              bash-7442    [007] ...1.   288.504691: <stack trace>
>   => trace_event_raw_event_report_page_pinners
>   => __report_page_pinners
>   => migrate_pages
>   => alloc_contig_range
>   => cma_alloc.cold
>   => cma_alloc_write
>   => simple_attr_write
>   => debugfs_attr_write
>   => full_proxy_write
>   => vfs_write
>   => ksys_write
>   => do_syscall_64
>   => entry_SYSCALL_64_after_hwframe
> ..
> ..
> 
>              bash-7442    [007] ...1.   288.506426: page_pin_owner_put: pfn=0x91430 flags=uptodate|swapbacked count=2 mapcount=0 mapping=ffff88f5480a2709 mt=5
>              bash-7442    [007] ...1.   288.506427: <stack trace>
>   => trace_event_raw_event_page_pin_owner_put
>   => __page_pin_owner_put
>   => put_page
>   => putback_movable_pages
>   => alloc_contig_range
>   => cma_alloc.cold
>   => cma_alloc_write
>   => simple_attr_write
>   => debugfs_attr_write
>   => full_proxy_write
>   => vfs_write
>   => ksys_write
>   => do_syscall_64
>   => entry_SYSCALL_64_after_hwframe
> ..
>               cc1-31104   [001] d..2.   288.507632: page_pin_owner_put: pfn=0x91430 flags=uptodate|swapbacked count=1 mapcount=0 mapping=ffff88f5480a2709 mt=4
>               cc1-31104   [001] d..2.   288.507636: <stack trace>
>   => trace_event_raw_event_page_pin_owner_put
>   => __page_pin_owner_put
>   => release_pages
>   => tlb_flush_mmu
>   => unmap_page_range
>   => unmap_vmas
>   => exit_mmap
>   => mmput
>   => do_exit
>   => do_group_exit
>   => __x64_sys_exit_group
>   => do_syscall_64
>   => entry_SYSCALL_64_after_hwframe
> ..
> 
>              make-31157   [007] d..3.   288.508333: page_pin_owner_put: pfn=0x91430 flags=uptodate|lru|swapbacked count=0 mapcount=0 mapping=ffff88f5480a2709 mt=5
>              make-31157   [007] d..3.   288.508335: <stack trace>
>   => trace_event_raw_event_page_pin_owner_put
>   => __page_pin_owner_put
>   => release_pages
>   => __pagevec_lru_add
>   => lru_add_drain_cpu
>   => lru_add_drain
>   => exit_mmap
>   => mmput
>   => begin_new_exec
>   => load_elf_binary
>   => bprm_execve
>   => do_execveat_common.isra.0
>   => __x64_sys_execve
>   => do_syscall_64
>   => entry_SYSCALL_64_after_hwframe
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
> * from v1 - https://lore.kernel.org/lkml/20211206184730.858850-1-minchan@kernel.org/
>    * PagePinOwner naming suggestion - jhubbard@
>    * Description/Kconfig suggestion - jhubbard@
>    * General migrate_page supports - huww98@
> 
>   include/linux/mm.h                    |  7 ++-
>   include/linux/page_ext.h              |  3 +
>   include/linux/page_pin_owner.h        | 48 ++++++++++++++
>   include/trace/events/page_pin_owner.h | 81 ++++++++++++++++++++++++
>   mm/Kconfig.debug                      | 15 +++++
>   mm/Makefile                           |  1 +
>   mm/migrate.c                          |  5 +-
>   mm/page_alloc.c                       |  3 +
>   mm/page_ext.c                         |  4 ++
>   mm/page_pin_owner.c                   | 91 +++++++++++++++++++++++++++
>   10 files changed, 256 insertions(+), 2 deletions(-)
>   create mode 100644 include/linux/page_pin_owner.h
>   create mode 100644 include/trace/events/page_pin_owner.h
>   create mode 100644 mm/page_pin_owner.c
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 82be88c1fdbb..5c437faf129c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -26,6 +26,7 @@
>   #include <linux/err.h>
>   #include <linux/page-flags.h>
>   #include <linux/page_ref.h>
> +#include <linux/page_pin_owner.h>
>   #include <linux/memremap.h>
>   #include <linux/overflow.h>
>   #include <linux/sizes.h>
> @@ -714,8 +715,12 @@ static inline unsigned int folio_order(struct folio *folio)
>    */
>   static inline int put_page_testzero(struct page *page)
>   {
> +	int ret;
> +
>   	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> -	return page_ref_dec_and_test(page);
> +	ret = page_ref_dec_and_test(page);
> +	page_pin_owner_put(page);
> +	return ret;
>   }
>   
>   static inline int folio_put_testzero(struct folio *folio)
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index fabb2e1e087f..3236efd5ab07 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -23,6 +23,9 @@ enum page_ext_flags {
>   	PAGE_EXT_YOUNG,
>   	PAGE_EXT_IDLE,
>   #endif
> +#if defined(CONFIG_PAGE_PIN_OWNER)
> +	PAGE_EXT_PIN_OWNER,
> +#endif
>   };
>   
>   /*
> diff --git a/include/linux/page_pin_owner.h b/include/linux/page_pin_owner.h
> new file mode 100644
> index 000000000000..e68adcdd6799
> --- /dev/null
> +++ b/include/linux/page_pin_owner.h
> @@ -0,0 +1,48 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __LINUX_PAGE_PIN_OWNER_H
> +#define __LINUX_PAGE_PIN_OWNER_H
> +
> +#include <linux/jump_label.h>
> +
> +#ifdef CONFIG_PAGE_PIN_OWNER
> +extern struct static_key_false page_pin_owner_inited;
> +extern struct page_ext_operations page_pin_owner_ops;
> +
> +void __report_page_pinners(struct page *page, int reason, int err);
> +void __page_pin_owner_put(struct page *page);
> +void __reset_page_pin_owner(struct page *page, unsigned int order);
> +
> +static inline void reset_page_pin_owner(struct page *page, unsigned int order)
> +{
> +	if (static_branch_unlikely(&page_pin_owner_inited))
> +		__reset_page_pin_owner(page, order);
> +}
> +
> +static inline void report_page_pinners(struct page *page, int reason, int err)
> +{
> +	if (!static_branch_unlikely(&page_pin_owner_inited))
> +		return;
> +
> +	__report_page_pinners(page, reason, err);
> +}
> +
> +static inline void page_pin_owner_put(struct page *page)
> +{
> +	if (!static_branch_unlikely(&page_pin_owner_inited))
> +		return;
> +
> +	__page_pin_owner_put(page);
> +}
> +
> +#else
> +static inline void reset_page_pin_owner(struct page *page, unsigned int order)
> +{
> +}
> +static inline void report_page_pinners(struct page *page, int reason, int err)
> +{
> +}
> +static inline void page_pin_owner_put(struct page *page)
> +{
> +}
> +#endif /* CONFIG_PAGE_PIN_OWNER */
> +#endif /*__LINUX_PAGE_PIN_OWNER_H */
> diff --git a/include/trace/events/page_pin_owner.h b/include/trace/events/page_pin_owner.h
> new file mode 100644
> index 000000000000..0096297312f5
> --- /dev/null
> +++ b/include/trace/events/page_pin_owner.h
> @@ -0,0 +1,81 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM page_pin_owner
> +
> +#if !defined(_TRACE_PAGE_PIN_OWNER_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_PAGE_PIN_OWNER_H
> +
> +#include <linux/types.h>
> +#include <linux/tracepoint.h>
> +#include <trace/events/mmflags.h>
> +
> +TRACE_EVENT(page_pin_owner_put,
> +
> +	TP_PROTO(struct page *page),
> +
> +	TP_ARGS(page),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, pfn)
> +		__field(unsigned long, flags)
> +		__field(int, count)
> +		__field(int, mapcount)
> +		__field(void *, mapping)
> +		__field(int, mt)
> +		),
> +
> +	TP_fast_assign(
> +		__entry->pfn = page_to_pfn(page);
> +		__entry->flags = page->flags;
> +		__entry->count = page_ref_count(page);
> +		__entry->mapcount = page_mapcount(page);
> +		__entry->mapping = page->mapping;
> +		__entry->mt = get_pageblock_migratetype(page);
> +		),
> +
> +	TP_printk("pfn=0x%lx flags=%s count=%d mapcount=%d mapping=%p mt=%d",
> +			__entry->pfn,
> +			show_page_flags(__entry->flags & ((1UL << NR_PAGEFLAGS) - 1)),
> +			__entry->count,
> +			__entry->mapcount, __entry->mapping, __entry->mt)
> +);
> +
> +TRACE_EVENT(report_page_pinners,
> +
> +	TP_PROTO(struct page *page, const char *reason, int err),
> +
> +	TP_ARGS(page, reason, err),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, pfn)
> +		__field(unsigned long, flags)
> +		__field(int, count)
> +		__field(int, mapcount)
> +		__field(void *, mapping)
> +		__field(int, mt)
> +		__field(const char *, reason)
> +		__field(int, err)
> +		),
> +
> +	TP_fast_assign(
> +		__entry->pfn = page_to_pfn(page);
> +		__entry->flags = page->flags;
> +		__entry->count = page_ref_count(page);
> +		__entry->mapcount = page_mapcount(page);
> +		__entry->mapping = page->mapping;
> +		__entry->mt = get_pageblock_migratetype(page);
> +		__entry->reason = reason;
> +		__entry->err = err;
> +		),
> +
> +	TP_printk("pfn=0x%lx flags=%s count=%d mapcount=%d mapping=%p mt=%d reason=%s err=%d",
> +			__entry->pfn,
> +			show_page_flags(__entry->flags & ((1UL << NR_PAGEFLAGS) - 1)),
> +			__entry->count, __entry->mapcount, __entry->mapping,
> +			__entry->mt, __entry->reason, __entry->err)
> +);
> +
> +#endif /* _TRACE_PAGE_PIN_OWNER_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> index 5bd5bb097252..ed2d5e6450b7 100644
> --- a/mm/Kconfig.debug
> +++ b/mm/Kconfig.debug
> @@ -86,6 +86,21 @@ config PAGE_TABLE_CHECK_ENFORCED
>   
>   	  If unsure say "n".
>   
> +config PAGE_PIN_OWNER
> +	bool "Track page pinner"
> +	depends on DEBUG_KERNEL && TRACEPOINTS && STACKTRACE_SUPPORT
> +	select PAGE_EXTENSION
> +	select STACKTRACE
> +	help
> +	  This keeps track of what call chain is the pinner of a page, may
> +	  help to find contiguous page allocation(memory-hotplug, compaction,
> +	  cma and so on) failure. Even if you include this feature in your
> +	  build, it is disabled by default. You should pass "page_pin_owner=on"
> +	  to boot parameter in order to enable it. Eats a fair amount of memory
> +	  if enabled.
> +
> +	  If unsure, say N.
> +
>   config PAGE_POISONING
>   	bool "Poison pages after freeing"
>   	help
> diff --git a/mm/Makefile b/mm/Makefile
> index 588d3113f3b0..ca89f4fa38ce 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -104,6 +104,7 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>   obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
>   obj-$(CONFIG_DEBUG_VM_PGTABLE) += debug_vm_pgtable.o
>   obj-$(CONFIG_PAGE_OWNER) += page_owner.o
> +obj-$(CONFIG_PAGE_PIN_OWNER) += page_pin_owner.o
>   obj-$(CONFIG_CLEANCACHE) += cleancache.o
>   obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
>   obj-$(CONFIG_ZPOOL)	+= zpool.o
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c7da064b4781..aa8a2c21da8e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1105,6 +1105,8 @@ static int unmap_and_move(new_page_t get_new_page,
>   	rc = __unmap_and_move(page, newpage, force, mode);
>   	if (rc == MIGRATEPAGE_SUCCESS)
>   		set_page_owner_migrate_reason(newpage, reason);
> +	else
> +		report_page_pinners(page, reason, rc);
>   
>   out:
>   	if (rc != -EAGAIN) {
> @@ -1273,7 +1275,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>   	if (rc == MIGRATEPAGE_SUCCESS) {
>   		move_hugetlb_state(hpage, new_hpage, reason);
>   		put_new_page = NULL;
> -	}
> +	} else
> +		report_page_pinners(hpage, reason, rc);
>   
>   out_unlock:
>   	unlock_page(hpage);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b5d62e1c8d81..98125b1b6e7e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -64,6 +64,7 @@
>   #include <linux/sched/rt.h>
>   #include <linux/sched/mm.h>
>   #include <linux/page_owner.h>
> +#include <linux/page_pin_owner.h>
>   #include <linux/page_table_check.h>
>   #include <linux/kthread.h>
>   #include <linux/memcontrol.h>
> @@ -1310,6 +1311,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>   		if (memcg_kmem_enabled() && PageMemcgKmem(page))
>   			__memcg_kmem_uncharge_page(page, order);
>   		reset_page_owner(page, order);
> +		reset_page_pin_owner(page, order);
>   		page_table_check_free(page, order);
>   		return false;
>   	}
> @@ -1350,6 +1352,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>   	page_cpupid_reset_last(page);
>   	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
>   	reset_page_owner(page, order);
> +	reset_page_pin_owner(page, order);
>   	page_table_check_free(page, order);
>   
>   	if (!PageHighMem(page)) {
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 2e66d934d63f..3c0df97b9b8b 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -7,6 +7,7 @@
>   #include <linux/vmalloc.h>
>   #include <linux/kmemleak.h>
>   #include <linux/page_owner.h>
> +#include <linux/page_pin_owner.h>
>   #include <linux/page_idle.h>
>   #include <linux/page_table_check.h>
>   
> @@ -79,6 +80,9 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
>   #ifdef CONFIG_PAGE_TABLE_CHECK
>   	&page_table_check_ops,
>   #endif
> +#ifdef CONFIG_PAGE_PIN_OWNER
> +	&page_pin_owner_ops,
> +#endif
>   };
>   
>   unsigned long page_ext_size = sizeof(struct page_ext);
> diff --git a/mm/page_pin_owner.c b/mm/page_pin_owner.c
> new file mode 100644
> index 000000000000..736617df093c
> --- /dev/null
> +++ b/mm/page_pin_owner.c
> @@ -0,0 +1,91 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/mm.h>
> +#include <linux/page_pin_owner.h>
> +#include <linux/migrate.h>
> +
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/page_pin_owner.h>
> +
> +static bool page_pin_owner_enabled;
> +DEFINE_STATIC_KEY_FALSE(page_pin_owner_inited);
> +EXPORT_SYMBOL(page_pin_owner_inited);
> +
> +static int __init early_page_pin_owner_param(char *buf)
> +{
> +	return kstrtobool(buf, &page_pin_owner_enabled);
> +}
> +early_param("page_pin_owner", early_page_pin_owner_param);
> +
> +static bool need_page_pin_owner(void)
> +{
> +	return page_pin_owner_enabled;
> +}
> +
> +static void init_page_pin_owner(void)
> +{
> +	if (!page_pin_owner_enabled)
> +		return;
> +
> +	static_branch_enable(&page_pin_owner_inited);
> +}
> +
> +struct page_ext_operations page_pin_owner_ops = {
> +	.need = need_page_pin_owner,
> +	.init = init_page_pin_owner,
> +};
> +
> +void __reset_page_pin_owner(struct page *page, unsigned int order)
> +{
> +	struct page_ext *page_ext;
> +	int i;
> +
> +	page_ext = lookup_page_ext(page);
> +	if (unlikely(!page_ext))
> +		return;
> +
> +	for (i = 0; i < (1 << order); i++) {
> +		if (!test_bit(PAGE_EXT_PIN_OWNER, &page_ext->flags))
> +			break;
> +
> +		clear_bit(PAGE_EXT_PIN_OWNER, &page_ext->flags);
> +		page_ext = page_ext_next(page_ext);
> +	}
> +}
> +
> +void __report_page_pinners(struct page *page, int reason, int err)
> +{
> +	struct page_ext *page_ext;
> +
> +	page_ext = lookup_page_ext(page);
> +	if (unlikely(!page_ext))
> +		return;
> +
> +	test_and_set_bit(PAGE_EXT_PIN_OWNER, &page_ext->flags);
> +	trace_report_page_pinners(page, migrate_reason_names[reason], err);
> +}
> +
> +void __page_pin_owner_put(struct page *page)
> +{
> +	struct page_ext *page_ext = lookup_page_ext(page);
> +
> +	if (unlikely(!page_ext))
> +		return;
> +
> +	if (!test_bit(PAGE_EXT_PIN_OWNER, &page_ext->flags))
> +		return;
> +
> +	trace_page_pin_owner_put(page);
> +}
> +EXPORT_SYMBOL(__page_pin_owner_put);
> +
> +static int __init page_pin_owner_init(void)
> +{
> +	if (!static_branch_unlikely(&page_pin_owner_inited)) {
> +		pr_info("page_pin_owner is disabled\n");
> +		return 0;
> +	}
> +
> +	pr_info("page_pin_owner is enabled\n");
> +	return 0;
> +}
> +late_initcall(page_pin_owner_init)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2] mm: introduce page pin owner
  2022-01-06 22:27 ` John Hubbard
@ 2022-01-06 23:24   ` Minchan Kim
  0 siblings, 0 replies; 15+ messages in thread
From: Minchan Kim @ 2022-01-06 23:24 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Michal Hocko, David Hildenbrand, linux-mm, LKML,
	Suren Baghdasaryan, John Dias, huww98

On Thu, Jan 06, 2022 at 02:27:48PM -0800, John Hubbard wrote:
> On 12/28/21 09:59, Minchan Kim wrote:
> > A Contiguous Memory Allocator(CMA) allocation can fail if any page
> > within the requested range has an elevated refcount(a pinned page).
> > 
> > Debugging such failures is difficult, because the struct pages only
> > show a combined refcount, and do not show the callstacks or
> > backtraces of the code that acquired each refcount. So the source
> > of the page pins remains a mystery, at the time of CMA failure.
> > 
> > In order to solve this without adding too much overhead, just do
> > nothing most of the time, which is pretty low overhead. However,
> > once a CMA failure occurs, then mark the page (this requires a
> > pointer's worth of space in struct page, but it uses page extensions
> > to get that), and start tracing the subsequent put_page() calls.
> > As the program finishes up, each page pin will be undone, and
> > traced with a backtrace. The programmer reads the trace output and
> > sees the list of all page pinning code paths.
> > 
> > This will consume an additional 8 bytes per 4KB page, or an
> > additional 0.2% of RAM. In addition to the storage space, it will
> > have some performance cost, due to increasing the size of struct
> > page so that it is greater than the cacheline size (or multiples
> > thereof) of popular (x86, ...) CPUs.
> > 
> > The idea can apply every user of migrate_pages as well as CMA to
> > know the reason why the page migration failed. To support it,
> > the implementation takes "enum migrate_reason" string as filter
> > of the tracepoint(see below).
> > 
> 
> Hi Minchan,
> 
> If this is ready to propose, then maybe it's time to remove the "RFC"
> qualification from the subject line, and re-post for final review.
> 
> And also when you do that, could you please specify which tree or commit
> this applies to? I wasn't able to figure that out this time.

Sorry for that. It was based on next-20211224.

> 
> > Usage)
> 
> This extensive "usage" section is probably helpful, but the commit
> log is certainly not the place for the "how to" documentation. Let's
> find an .rst file to stash it in, I think.

I wanted to get some review for implementation/interface/usage before
respin removing the RFC. Otherwise, the the documentation need to keep
update heavily. Based on your comment, I think you are almost agree
with as-is. Then, yeah, let me cook up the doc and repost it with
removing the RFC tag.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2] mm: introduce page pin owner
  2021-12-28 17:59 [RFC v2] mm: introduce page pin owner Minchan Kim
  2022-01-06 18:14 ` Minchan Kim
  2022-01-06 22:27 ` John Hubbard
@ 2022-01-12 12:25 ` David Hildenbrand
  2022-01-12 16:22   ` Minchan Kim
  2 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2022-01-12 12:25 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: Michal Hocko, linux-mm, LKML, Suren Baghdasaryan, John Dias,
	huww98, John Hubbard

On 28.12.21 18:59, Minchan Kim wrote:
> A Contiguous Memory Allocator(CMA) allocation can fail if any page
> within the requested range has an elevated refcount(a pinned page).
> 
> Debugging such failures is difficult, because the struct pages only
> show a combined refcount, and do not show the callstacks or
> backtraces of the code that acquired each refcount. So the source
> of the page pins remains a mystery, at the time of CMA failure.
> 
> In order to solve this without adding too much overhead, just do
> nothing most of the time, which is pretty low overhead. However,
> once a CMA failure occurs, then mark the page (this requires a
> pointer's worth of space in struct page, but it uses page extensions
> to get that), and start tracing the subsequent put_page() calls.
> As the program finishes up, each page pin will be undone, and
> traced with a backtrace. The programmer reads the trace output and
> sees the list of all page pinning code paths.
> 

It's worth noting that this is a pure debug feature, right?


I like the general approach, however, IMHO the current naming is a bit
sub-optimal and misleading. All you're doing is flagging pages that
should result in a tracepoint when unref'ed.

"page pinners" makes it somewhat sound like you're targeting FOLL_PIN,
not simply any references.

"owner" is misleading IMHO as well.


What about something like:

"mm: selective tracing of page reference holders on unref"

PAGE_EXT_PIN_OWNER -> PAGE_EXT_TRACE_UNREF

$whatever feature/user can then set the bit, for example, when migration
fails.

I somewhat dislike that it's implicitly activated by failed page
migration. At least the current naming doesn't reflect that.


> This will consume an additional 8 bytes per 4KB page, or an
> additional 0.2% of RAM. In addition to the storage space, it will
> have some performance cost, due to increasing the size of struct
> page so that it is greater than the cacheline size (or multiples
> thereof) of popular (x86, ...) CPUs.

I think I might be missing something. Aren't you simply reusing
&page_ext->flags ? I mean, the "overhead" is just ordinary page_ext
overhead ... and whee exactly are you changing "struct page" layout? Is
this description outdated?

> 
> The idea can apply every user of migrate_pages as well as CMA to
> know the reason why the page migration failed. To support it,
> the implementation takes "enum migrate_reason" string as filter
> of the tracepoint(see below).
> 

I wonder if we could achieve the same thing for debugging by

a) Tracing the PFN when migration fails
b) Tracing any unref of any PFN

User space can then combine both information to achieve the same result.
I assume one would need a big trace buffer, but maybe for a debug
feature good enough?


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2] mm: introduce page pin owner
  2022-01-12 12:25 ` David Hildenbrand
@ 2022-01-12 16:22   ` Minchan Kim
  2022-01-12 17:42     ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2022-01-12 16:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, linux-mm, LKML, Suren Baghdasaryan,
	John Dias, huww98, John Hubbard

On Wed, Jan 12, 2022 at 01:25:04PM +0100, David Hildenbrand wrote:
> On 28.12.21 18:59, Minchan Kim wrote:
> > A Contiguous Memory Allocator(CMA) allocation can fail if any page
> > within the requested range has an elevated refcount(a pinned page).
> > 
> > Debugging such failures is difficult, because the struct pages only
> > show a combined refcount, and do not show the callstacks or
> > backtraces of the code that acquired each refcount. So the source
> > of the page pins remains a mystery, at the time of CMA failure.
> > 
> > In order to solve this without adding too much overhead, just do
> > nothing most of the time, which is pretty low overhead. However,
> > once a CMA failure occurs, then mark the page (this requires a
> > pointer's worth of space in struct page, but it uses page extensions
> > to get that), and start tracing the subsequent put_page() calls.
> > As the program finishes up, each page pin will be undone, and
> > traced with a backtrace. The programmer reads the trace output and
> > sees the list of all page pinning code paths.
> > 
> 
> It's worth noting that this is a pure debug feature, right?

Sure.

> 
> 
> I like the general approach, however, IMHO the current naming is a bit
> sub-optimal and misleading. All you're doing is flagging pages that
> should result in a tracepoint when unref'ed.
> 
> "page pinners" makes it somewhat sound like you're targeting FOLL_PIN,
> not simply any references.
> 
> "owner" is misleading IMHO as well.
> 
> 
> What about something like:
> 
> "mm: selective tracing of page reference holders on unref"
> 
> PAGE_EXT_PIN_OWNER -> PAGE_EXT_TRACE_UNREF
> 
> $whatever feature/user can then set the bit, for example, when migration
> fails.

I couldn't imagine put_page tracking is generally useful except
migration failure. Do you have reasonable usecase in your mind
to make the feature general to be used?
Otherwise, I'd like to have feature naming more higher level
to represent page migration failure and then tracking unref of
the page. In the sense, PagePinOwner John suggested was good
candidate(Even, my original naming PagePinner was worse) since
I was trouble to abstract the feature with short word.
If we approach "what feature is doing" rather than "what's
the feature's goal"(I feel the your suggestion would be close
to what feature is doing), I'd like to express "unreference on
migraiton failed page" so PAGE_EXT_UNMIGRATED_UNREF
(However, I prefer the feature naming more "what we want to achieve")

> 
> I somewhat dislike that it's implicitly activated by failed page
> migration. At least the current naming doesn't reflect that.
> 
> 
> > This will consume an additional 8 bytes per 4KB page, or an
> > additional 0.2% of RAM. In addition to the storage space, it will
> > have some performance cost, due to increasing the size of struct
> > page so that it is greater than the cacheline size (or multiples
> > thereof) of popular (x86, ...) CPUs.
> 
> I think I might be missing something. Aren't you simply reusing
> &page_ext->flags ? I mean, the "overhead" is just ordinary page_ext
> overhead ... and whee exactly are you changing "struct page" layout? Is
> this description outdated?

The feature enables page_ext which adds up 8 bytes per 4KB and on every
put operation, it need to access the additional flag on page_ext which
affects performance since page_put is the common operation.
Yeah, the struct size stuff in the wording is rather misleading.
Let me change the workding something like this:

 This will consume an additional 8 bytes per 4KB page, or an
 additional 0.2% of RAM. In addition to the storage space, it will
 have some performance cost, due to checking additional flag on
 every put_page opeartion.

> 
> > 
> > The idea can apply every user of migrate_pages as well as CMA to
> > know the reason why the page migration failed. To support it,
> > the implementation takes "enum migrate_reason" string as filter
> > of the tracepoint(see below).
> > 
> 
> I wonder if we could achieve the same thing for debugging by
> 
> a) Tracing the PFN when migration fails
> b) Tracing any unref of any PFN
> 
> User space can then combine both information to achieve the same result.
> I assume one would need a big trace buffer, but maybe for a debug
> feature good enough?

I definitely tried it for cma allocation failure but it generated
enormous output(Please keep it in mind that we also need stacktrace)
due to too frequent put_page and compaction operation(Even, I filter
them out to track only cma pages but it was still huge since the CMA
size is 1/8 of the system). Even though I increased the buffer size
a lot, the buffer was easily overwritten. Moreover, even though it's
debug feature, we need to release the feature into dogfooder to catch
the real problem in the field so consuming too much memory as well as
backtrace operhead on every put page are tough to be used in field.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2] mm: introduce page pin owner
  2022-01-12 16:22   ` Minchan Kim
@ 2022-01-12 17:42     ` David Hildenbrand
  2022-01-12 20:41       ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2022-01-12 17:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Michal Hocko, linux-mm, LKML, Suren Baghdasaryan,
	John Dias, huww98, John Hubbard

>>
>> What about something like:
>>
>> "mm: selective tracing of page reference holders on unref"
>>
>> PAGE_EXT_PIN_OWNER -> PAGE_EXT_TRACE_UNREF
>>
>> $whatever feature/user can then set the bit, for example, when migration
>> fails.
> 
> I couldn't imagine put_page tracking is generally useful except
> migration failure. Do you have reasonable usecase in your mind
> to make the feature general to be used?

HWpoison etc. purposes maybe? Trace who still held a reference a page
that was poisoned and couldn't be removed?  Or in general, tracking
references to something that should have a refcount of 0 because it
should have been freed, but for some reason there are still references
around?

> Otherwise, I'd like to have feature naming more higher level
> to represent page migration failure and then tracking unref of
> the page. In the sense, PagePinOwner John suggested was good
> candidate(Even, my original naming PagePinner was worse) since

Personally, I dislike both variants.

> I was trouble to abstract the feature with short word.
> If we approach "what feature is doing" rather than "what's
> the feature's goal"(I feel the your suggestion would be close
> to what feature is doing), I'd like to express "unreference on
> migraiton failed page" so PAGE_EXT_UNMIGRATED_UNREF
> (However, I prefer the feature naming more "what we want to achieve")
> 
E.g., PAGE_EXT_TRACE_UNREF will trace unref to the page once the bit is
set. The functionality itself is completely independent of migration
failures. That's just the code that sets it to enable the underlying
tracing for that specific page.

>>
>> I somewhat dislike that it's implicitly activated by failed page
>> migration. At least the current naming doesn't reflect that.
>>
>>
>>> This will consume an additional 8 bytes per 4KB page, or an
>>> additional 0.2% of RAM. In addition to the storage space, it will
>>> have some performance cost, due to increasing the size of struct
>>> page so that it is greater than the cacheline size (or multiples
>>> thereof) of popular (x86, ...) CPUs.
>>
>> I think I might be missing something. Aren't you simply reusing
>> &page_ext->flags ? I mean, the "overhead" is just ordinary page_ext
>> overhead ... and whee exactly are you changing "struct page" layout? Is
>> this description outdated?
> 
> The feature enables page_ext which adds up 8 bytes per 4KB and on every
> put operation, it need to access the additional flag on page_ext which
> affects performance since page_put is the common operation.
> Yeah, the struct size stuff in the wording is rather misleading.
> Let me change the workding something like this:
> 
>  This will consume an additional 8 bytes per 4KB page, or an
>  additional 0.2% of RAM. In addition to the storage space, it will
>  have some performance cost, due to checking additional flag on
>  every put_page opeartion.

I'd adjust to

As this feature depends on page_ext->flags, it will consume an
additional 8 bytes per 4KB page to enable page_ext if not already
enabled. ...

> 
>>
>>>
>>> The idea can apply every user of migrate_pages as well as CMA to
>>> know the reason why the page migration failed. To support it,
>>> the implementation takes "enum migrate_reason" string as filter
>>> of the tracepoint(see below).
>>>
>>
>> I wonder if we could achieve the same thing for debugging by
>>
>> a) Tracing the PFN when migration fails
>> b) Tracing any unref of any PFN
>>
>> User space can then combine both information to achieve the same result.
>> I assume one would need a big trace buffer, but maybe for a debug
>> feature good enough?
> 
> I definitely tried it for cma allocation failure but it generated
> enormous output(Please keep it in mind that we also need stacktrace)
> due to too frequent put_page and compaction operation(Even, I filter
> them out to track only cma pages but it was still huge since the CMA
> size is 1/8 of the system). Even though I increased the buffer size
> a lot, the buffer was easily overwritten. Moreover, even though it's
> debug feature, we need to release the feature into dogfooder to catch
> the real problem in the field so consuming too much memory as well as
> backtrace operhead on every put page are tough to be used in field.

Makes sense, I was expecting the output to be large, but possible it's
going to be way too large.

Would it also make sense to track for a flagged page new taken
references, such that you can differentiate between new (e.g.,
temporary) ones and previous ones? Feels like a reasonable addition.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2] mm: introduce page pin owner
  2022-01-12 17:42     ` David Hildenbrand
@ 2022-01-12 20:41       ` Minchan Kim
  2022-01-14 13:31         ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2022-01-12 20:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, linux-mm, LKML, Suren Baghdasaryan,
	John Dias, huww98, John Hubbard

On Wed, Jan 12, 2022 at 06:42:21PM +0100, David Hildenbrand wrote:
> >>
> >> What about something like:
> >>
> >> "mm: selective tracing of page reference holders on unref"
> >>
> >> PAGE_EXT_PIN_OWNER -> PAGE_EXT_TRACE_UNREF
> >>
> >> $whatever feature/user can then set the bit, for example, when migration
> >> fails.
> > 
> > I couldn't imagine put_page tracking is generally useful except
> > migration failure. Do you have reasonable usecase in your mind
> > to make the feature general to be used?
> 
> HWpoison etc. purposes maybe? Trace who still held a reference a page
> that was poisoned and couldn't be removed?  Or in general, tracking

I am not familiar with hwpoison so here dumb question goes:
Is that different one with __soft_offline_page?
It uses migrate_pages so current interface supports it with filter.

echo "memory_failure" > $trace_dir/events/page_pin_owner/report_page_pinners/filter

> references to something that should have a refcount of 0 because it
> should have been freed, but for some reason there are still references
> around?

Sounds like you are talking about memory leak? What's the purpose
with trace, not using other existing tool to find memory leak?

> 
> > Otherwise, I'd like to have feature naming more higher level
> > to represent page migration failure and then tracking unref of
> > the page. In the sense, PagePinOwner John suggested was good
> > candidate(Even, my original naming PagePinner was worse) since
> 
> Personally, I dislike both variants.
> 
> > I was trouble to abstract the feature with short word.
> > If we approach "what feature is doing" rather than "what's
> > the feature's goal"(I feel the your suggestion would be close
> > to what feature is doing), I'd like to express "unreference on
> > migraiton failed page" so PAGE_EXT_UNMIGRATED_UNREF
> > (However, I prefer the feature naming more "what we want to achieve")
> > 
> E.g., PAGE_EXT_TRACE_UNREF will trace unref to the page once the bit is
> set. The functionality itself is completely independent of migration
> failures. That's just the code that sets it to enable the underlying
> tracing for that specific page.

I agree that make something general is great but I also want to avoid
create something too big from the beginning with just imagination.
So, I'd like to hear more concrete and appealing usecases and then
we could think over this trace approach is really the best one to
achieve the goal. Once it's agreed, the naming you suggested would
make sense. 

> 
> >>
> >> I somewhat dislike that it's implicitly activated by failed page
> >> migration. At least the current naming doesn't reflect that.
> >>
> >>
> >>> This will consume an additional 8 bytes per 4KB page, or an
> >>> additional 0.2% of RAM. In addition to the storage space, it will
> >>> have some performance cost, due to increasing the size of struct
> >>> page so that it is greater than the cacheline size (or multiples
> >>> thereof) of popular (x86, ...) CPUs.
> >>
> >> I think I might be missing something. Aren't you simply reusing
> >> &page_ext->flags ? I mean, the "overhead" is just ordinary page_ext
> >> overhead ... and whee exactly are you changing "struct page" layout? Is
> >> this description outdated?
> > 
> > The feature enables page_ext which adds up 8 bytes per 4KB and on every
> > put operation, it need to access the additional flag on page_ext which
> > affects performance since page_put is the common operation.
> > Yeah, the struct size stuff in the wording is rather misleading.
> > Let me change the workding something like this:
> > 
> >  This will consume an additional 8 bytes per 4KB page, or an
> >  additional 0.2% of RAM. In addition to the storage space, it will
> >  have some performance cost, due to checking additional flag on
> >  every put_page opeartion.
> 
> I'd adjust to
> 
> As this feature depends on page_ext->flags, it will consume an
> additional 8 bytes per 4KB page to enable page_ext if not already
> enabled. ...

NP.

> 
> > 
> >>
> >>>
> >>> The idea can apply every user of migrate_pages as well as CMA to
> >>> know the reason why the page migration failed. To support it,
> >>> the implementation takes "enum migrate_reason" string as filter
> >>> of the tracepoint(see below).
> >>>
> >>
> >> I wonder if we could achieve the same thing for debugging by
> >>
> >> a) Tracing the PFN when migration fails
> >> b) Tracing any unref of any PFN
> >>
> >> User space can then combine both information to achieve the same result.
> >> I assume one would need a big trace buffer, but maybe for a debug
> >> feature good enough?
> > 
> > I definitely tried it for cma allocation failure but it generated
> > enormous output(Please keep it in mind that we also need stacktrace)
> > due to too frequent put_page and compaction operation(Even, I filter
> > them out to track only cma pages but it was still huge since the CMA
> > size is 1/8 of the system). Even though I increased the buffer size
> > a lot, the buffer was easily overwritten. Moreover, even though it's
> > debug feature, we need to release the feature into dogfooder to catch
> > the real problem in the field so consuming too much memory as well as
> > backtrace operhead on every put page are tough to be used in field.
> 
> Makes sense, I was expecting the output to be large, but possible it's
> going to be way too large.
> 
> Would it also make sense to track for a flagged page new taken
> references, such that you can differentiate between new (e.g.,
> temporary) ones and previous ones? Feels like a reasonable addition.

I actually tried it and it showed 2x times bigger output.
For me to debug CMA alloation failure, the new get_page callstack
after migration failure were waste since they repeated from lru
adding, isolate from the LRU something. Rather than get callsite,
I needed only put call sites since I could deduce where the pair-get
came from.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2] mm: introduce page pin owner
  2022-01-12 20:41       ` Minchan Kim
@ 2022-01-14 13:31         ` David Hildenbrand
  2022-01-14 16:39           ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2022-01-14 13:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Michal Hocko, linux-mm, LKML, Suren Baghdasaryan,
	John Dias, huww98, John Hubbard

On 12.01.22 21:41, Minchan Kim wrote:
> On Wed, Jan 12, 2022 at 06:42:21PM +0100, David Hildenbrand wrote:
>>>>
>>>> What about something like:
>>>>
>>>> "mm: selective tracing of page reference holders on unref"
>>>>
>>>> PAGE_EXT_PIN_OWNER -> PAGE_EXT_TRACE_UNREF
>>>>
>>>> $whatever feature/user can then set the bit, for example, when migration
>>>> fails.
>>>
>>> I couldn't imagine put_page tracking is generally useful except
>>> migration failure. Do you have reasonable usecase in your mind
>>> to make the feature general to be used?
>>
>> HWpoison etc. purposes maybe? Trace who still held a reference a page
>> that was poisoned and couldn't be removed?  Or in general, tracking
> 
> I am not familiar with hwpoison so here dumb question goes:
> Is that different one with __soft_offline_page?
> It uses migrate_pages so current interface supports it with filter.

__soft_offline_page() won't kill the target and try to migrate because
the pages are about to be damaged and we can still access them.

ordinary memory errors mean we kill the target because we cannot access
the page anymore without triggering MCEs (or worse IIUC) again.

So in my thinking, after memory_failure(), it could eventually be
helpful to figure out who still has a (temporary) reference to such a
broken page, even after killing the process. But that's just one idea I
quickly came up with.

> 
> echo "memory_failure" > $trace_dir/events/page_pin_owner/report_page_pinners/filter
> 
>> references to something that should have a refcount of 0 because it
>> should have been freed, but for some reason there are still references
>> around?
> 
> Sounds like you are talking about memory leak? What's the purpose
> with trace, not using other existing tool to find memory leak?
> 

IIRC, kmemleak can find objects that are no longer referenced, and we
cannot track buddy allocations, but only kmalloc and friends.

>>
>>> Otherwise, I'd like to have feature naming more higher level
>>> to represent page migration failure and then tracking unref of
>>> the page. In the sense, PagePinOwner John suggested was good
>>> candidate(Even, my original naming PagePinner was worse) since
>>
>> Personally, I dislike both variants.
>>
>>> I was trouble to abstract the feature with short word.
>>> If we approach "what feature is doing" rather than "what's
>>> the feature's goal"(I feel the your suggestion would be close
>>> to what feature is doing), I'd like to express "unreference on
>>> migraiton failed page" so PAGE_EXT_UNMIGRATED_UNREF
>>> (However, I prefer the feature naming more "what we want to achieve")
>>>
>> E.g., PAGE_EXT_TRACE_UNREF will trace unref to the page once the bit is
>> set. The functionality itself is completely independent of migration
>> failures. That's just the code that sets it to enable the underlying
>> tracing for that specific page.
> 
> I agree that make something general is great but I also want to avoid
> create something too big from the beginning with just imagination.
> So, I'd like to hear more concrete and appealing usecases and then
> we could think over this trace approach is really the best one to
> achieve the goal. Once it's agreed, the naming you suggested would
> make sense. 

At least for me it's a lot cleaner if a feature clearly expresses what
it actually does. Staring at PAGE_EXT_PIN_OWNER I initially had no clue.
I was assuming we would actually track (not trace!) all active FOLL_PIN
(not unref callers!). Maybe that makes it clearer why I'd prefer a
clearer name.

>>
>> Makes sense, I was expecting the output to be large, but possible it's
>> going to be way too large.
>>
>> Would it also make sense to track for a flagged page new taken
>> references, such that you can differentiate between new (e.g.,
>> temporary) ones and previous ones? Feels like a reasonable addition.
> 
> I actually tried it and it showed 2x times bigger output.

Is 2x that bad? Or would it be worth making it configurable?

> For me to debug CMA alloation failure, the new get_page callstack
> after migration failure were waste since they repeated from lru
> adding, isolate from the LRU something. Rather than get callsite,
> I needed only put call sites since I could deduce where the pair-get
> came from.

Could maybe some filters help that exclude such LRU activity?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2] mm: introduce page pin owner
  2022-01-14 13:31         ` David Hildenbrand
@ 2022-01-14 16:39           ` Minchan Kim
  2022-01-14 16:51             ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2022-01-14 16:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, linux-mm, LKML, Suren Baghdasaryan,
	John Dias, huww98, John Hubbard

On Fri, Jan 14, 2022 at 02:31:43PM +0100, David Hildenbrand wrote:
> On 12.01.22 21:41, Minchan Kim wrote:
> > On Wed, Jan 12, 2022 at 06:42:21PM +0100, David Hildenbrand wrote:
> >>>>
> >>>> What about something like:
> >>>>
> >>>> "mm: selective tracing of page reference holders on unref"
> >>>>
> >>>> PAGE_EXT_PIN_OWNER -> PAGE_EXT_TRACE_UNREF
> >>>>
> >>>> $whatever feature/user can then set the bit, for example, when migration
> >>>> fails.
> >>>
> >>> I couldn't imagine put_page tracking is generally useful except
> >>> migration failure. Do you have reasonable usecase in your mind
> >>> to make the feature general to be used?
> >>
> >> HWpoison etc. purposes maybe? Trace who still held a reference a page
> >> that was poisoned and couldn't be removed?  Or in general, tracking
> > 
> > I am not familiar with hwpoison so here dumb question goes:
> > Is that different one with __soft_offline_page?
> > It uses migrate_pages so current interface supports it with filter.
> 
> __soft_offline_page() won't kill the target and try to migrate because
> the pages are about to be damaged and we can still access them.
> 
> ordinary memory errors mean we kill the target because we cannot access
> the page anymore without triggering MCEs (or worse IIUC) again.
> 
> So in my thinking, after memory_failure(), it could eventually be
> helpful to figure out who still has a (temporary) reference to such a
> broken page, even after killing the process. But that's just one idea I
> quickly came up with.


Thanks for the clarification. Is the trace best fit in the case?
Presumably you know the broken page, can't you find who owns the page
using /proc/pid/pagemap?
Furthermore, page_get/put operations commonly could happen in
different contexts regardless of page's owner so the trace from
different context is still helpful?

If it's helpful, could you tell what you want to make the interface to
set the bit of broken page? For example, as-is case for page migration,
report_page_pinners is called to mark failed page at unmap_and_move.
Do you want to add something similar(maybe, set_page_ref_track under
rename) in memory-failure.c?

It would be very helpful to design the feature's interface(surely,
naming as well) and write description to convince others "yeah,
sounds like so useful for the case and that's best fit than other way".

> 
> > 
> > echo "memory_failure" > $trace_dir/events/page_pin_owner/report_page_pinners/filter
> > 
> >> references to something that should have a refcount of 0 because it
> >> should have been freed, but for some reason there are still references
> >> around?
> > 
> > Sounds like you are talking about memory leak? What's the purpose
> > with trace, not using other existing tool to find memory leak?
> > 
> 
> IIRC, kmemleak can find objects that are no longer referenced, and we
> cannot track buddy allocations, but only kmalloc and friends.

PageOwner is your good buddy.

> 
> >>
> >>> Otherwise, I'd like to have feature naming more higher level
> >>> to represent page migration failure and then tracking unref of
> >>> the page. In the sense, PagePinOwner John suggested was good
> >>> candidate(Even, my original naming PagePinner was worse) since
> >>
> >> Personally, I dislike both variants.
> >>
> >>> I was trouble to abstract the feature with short word.
> >>> If we approach "what feature is doing" rather than "what's
> >>> the feature's goal"(I feel the your suggestion would be close
> >>> to what feature is doing), I'd like to express "unreference on
> >>> migraiton failed page" so PAGE_EXT_UNMIGRATED_UNREF
> >>> (However, I prefer the feature naming more "what we want to achieve")
> >>>
> >> E.g., PAGE_EXT_TRACE_UNREF will trace unref to the page once the bit is
> >> set. The functionality itself is completely independent of migration
> >> failures. That's just the code that sets it to enable the underlying
> >> tracing for that specific page.
> > 
> > I agree that make something general is great but I also want to avoid
> > create something too big from the beginning with just imagination.
> > So, I'd like to hear more concrete and appealing usecases and then
> > we could think over this trace approach is really the best one to
> > achieve the goal. Once it's agreed, the naming you suggested would
> > make sense. 
> 
> At least for me it's a lot cleaner if a feature clearly expresses what
> it actually does. Staring at PAGE_EXT_PIN_OWNER I initially had no clue.
> I was assuming we would actually track (not trace!) all active FOLL_PIN
> (not unref callers!). Maybe that makes it clearer why I'd prefer a
> clearer name.

I totally agree PagePinOwner is not 100% straightforward. I'm open for
other better name. Currently we are discussing how we could generalize
and whether it's useful or not. Depending on the discussion, the design/
interface as well as naming could be changed. No problem.

> 
> >>
> >> Makes sense, I was expecting the output to be large, but possible it's
> >> going to be way too large.
> >>
> >> Would it also make sense to track for a flagged page new taken
> >> references, such that you can differentiate between new (e.g.,
> >> temporary) ones and previous ones? Feels like a reasonable addition.
> > 
> > I actually tried it and it showed 2x times bigger output.
> 
> Is 2x that bad? Or would it be worth making it configurable?

For my goal, 2x was bad because I need to minimize the trace buffer.
Furthermore, the new get operation was not helpful but just noisy.
If some usecase is not enough with only put callsite, we add get
easily. Implementation is not hard. The matter is how it's useful
in real practice since we would expose the interface to the user,
I guess.

> 
> > For me to debug CMA alloation failure, the new get_page callstack
> > after migration failure were waste since they repeated from lru
> > adding, isolate from the LRU something. Rather than get callsite,
> > I needed only put call sites since I could deduce where the pair-get
> > came from.
> 
> Could maybe some filters help that exclude such LRU activity?

For my case, the filter wouldn't be helpful because I shouldn't
miss the LRU activity since sometime they makes the allocation
failure. Thus, I could deduce the lru's get site from put callback.
get callsite is not helpful for my case but just wasted memory
and perf.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2] mm: introduce page pin owner
  2022-01-14 16:39           ` Minchan Kim
@ 2022-01-14 16:51             ` David Hildenbrand
  2022-01-14 18:47               ` David Hildenbrand
  2022-01-14 18:55               ` Minchan Kim
  0 siblings, 2 replies; 15+ messages in thread
From: David Hildenbrand @ 2022-01-14 16:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Michal Hocko, linux-mm, LKML, Suren Baghdasaryan,
	John Dias, huww98, John Hubbard

On 14.01.22 17:39, Minchan Kim wrote:
> On Fri, Jan 14, 2022 at 02:31:43PM +0100, David Hildenbrand wrote:
>> On 12.01.22 21:41, Minchan Kim wrote:
>>> On Wed, Jan 12, 2022 at 06:42:21PM +0100, David Hildenbrand wrote:
>>>>>>
>>>>>> What about something like:
>>>>>>
>>>>>> "mm: selective tracing of page reference holders on unref"
>>>>>>
>>>>>> PAGE_EXT_PIN_OWNER -> PAGE_EXT_TRACE_UNREF
>>>>>>
>>>>>> $whatever feature/user can then set the bit, for example, when migration
>>>>>> fails.
>>>>>
>>>>> I couldn't imagine put_page tracking is generally useful except
>>>>> migration failure. Do you have reasonable usecase in your mind
>>>>> to make the feature general to be used?
>>>>
>>>> HWpoison etc. purposes maybe? Trace who still held a reference a page
>>>> that was poisoned and couldn't be removed?  Or in general, tracking
>>>
>>> I am not familiar with hwpoison so here dumb question goes:
>>> Is that different one with __soft_offline_page?
>>> It uses migrate_pages so current interface supports it with filter.
>>
>> __soft_offline_page() won't kill the target and try to migrate because
>> the pages are about to be damaged and we can still access them.
>>
>> ordinary memory errors mean we kill the target because we cannot access
>> the page anymore without triggering MCEs (or worse IIUC) again.
>>
>> So in my thinking, after memory_failure(), it could eventually be
>> helpful to figure out who still has a (temporary) reference to such a
>> broken page, even after killing the process. But that's just one idea I
>> quickly came up with.
> 
> 
> Thanks for the clarification. Is the trace best fit in the case?
> Presumably you know the broken page, can't you find who owns the page
> using /proc/pid/pagemap?
> Furthermore, page_get/put operations commonly could happen in
> different contexts regardless of page's owner so the trace from
> different context is still helpful?
> 
> If it's helpful, could you tell what you want to make the interface to
> set the bit of broken page? For example, as-is case for page migration,
> report_page_pinners is called to mark failed page at unmap_and_move.
> Do you want to add something similar(maybe, set_page_ref_track under
> rename) in memory-failure.c?
> 
> It would be very helpful to design the feature's interface(surely,
> naming as well) and write description to convince others "yeah,
> sounds like so useful for the case and that's best fit than other way".

I currently don't have time to explore this further, it was just a
random thought how else this might be useful.


>>
>>>
>>> echo "memory_failure" > $trace_dir/events/page_pin_owner/report_page_pinners/filter
>>>
>>>> references to something that should have a refcount of 0 because it
>>>> should have been freed, but for some reason there are still references
>>>> around?
>>>
>>> Sounds like you are talking about memory leak? What's the purpose
>>> with trace, not using other existing tool to find memory leak?
>>>
>>
>> IIRC, kmemleak can find objects that are no longer referenced, and we
>> cannot track buddy allocations, but only kmalloc and friends.
> 
> PageOwner is your good buddy.

Page owner tracks who owns a page but not who else holds a reference
(some buffer, gup, whatsoever), right?

> 
>>
>>>>
>>>>> Otherwise, I'd like to have feature naming more higher level
>>>>> to represent page migration failure and then tracking unref of
>>>>> the page. In the sense, PagePinOwner John suggested was good
>>>>> candidate(Even, my original naming PagePinner was worse) since
>>>>
>>>> Personally, I dislike both variants.
>>>>
>>>>> I was trouble to abstract the feature with short word.
>>>>> If we approach "what feature is doing" rather than "what's
>>>>> the feature's goal"(I feel the your suggestion would be close
>>>>> to what feature is doing), I'd like to express "unreference on
>>>>> migraiton failed page" so PAGE_EXT_UNMIGRATED_UNREF
>>>>> (However, I prefer the feature naming more "what we want to achieve")
>>>>>
>>>> E.g., PAGE_EXT_TRACE_UNREF will trace unref to the page once the bit is
>>>> set. The functionality itself is completely independent of migration
>>>> failures. That's just the code that sets it to enable the underlying
>>>> tracing for that specific page.
>>>
>>> I agree that make something general is great but I also want to avoid
>>> create something too big from the beginning with just imagination.
>>> So, I'd like to hear more concrete and appealing usecases and then
>>> we could think over this trace approach is really the best one to
>>> achieve the goal. Once it's agreed, the naming you suggested would
>>> make sense. 
>>
>> At least for me it's a lot cleaner if a feature clearly expresses what
>> it actually does. Staring at PAGE_EXT_PIN_OWNER I initially had no clue.
>> I was assuming we would actually track (not trace!) all active FOLL_PIN
>> (not unref callers!). Maybe that makes it clearer why I'd prefer a
>> clearer name.
> 
> I totally agree PagePinOwner is not 100% straightforward. I'm open for
> other better name. Currently we are discussing how we could generalize
> and whether it's useful or not. Depending on the discussion, the design/
> interface as well as naming could be changed. No problem.

PagePinOwner is just highly misleading. Because that's not what the
feature does. Having that said, i hope we'll get other opinions as well.

>>
>>>>
>>>> Makes sense, I was expecting the output to be large, but possible it's
>>>> going to be way too large.
>>>>
>>>> Would it also make sense to track for a flagged page new taken
>>>> references, such that you can differentiate between new (e.g.,
>>>> temporary) ones and previous ones? Feels like a reasonable addition.
>>>
>>> I actually tried it and it showed 2x times bigger output.
>>
>> Is 2x that bad? Or would it be worth making it configurable?
> 
> For my goal, 2x was bad because I need to minimize the trace buffer.
> Furthermore, the new get operation was not helpful but just noisy.
> If some usecase is not enough with only put callsite, we add get
> easily. Implementation is not hard. The matter is how it's useful
> in real practice since we would expose the interface to the user,
> I guess.

Right. but it's a debugging feature after all, so I asume we care little
about KABI. We can always add it on top, I agree.

> 
>>
>>> For me to debug CMA alloation failure, the new get_page callstack
>>> after migration failure were waste since they repeated from lru
>>> adding, isolate from the LRU something. Rather than get callsite,
>>> I needed only put call sites since I could deduce where the pair-get
>>> came from.
>>
>> Could maybe some filters help that exclude such LRU activity?
> 
> For my case, the filter wouldn't be helpful because I shouldn't
> miss the LRU activity since sometime they makes the allocation
> failure. Thus, I could deduce the lru's get site from put callback.
> get callsite is not helpful for my case but just wasted memory
> and perf.
> 

Makes sense.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2] mm: introduce page pin owner
  2022-01-14 16:51             ` David Hildenbrand
@ 2022-01-14 18:47               ` David Hildenbrand
  2022-01-14 18:57                 ` Minchan Kim
  2022-01-14 18:55               ` Minchan Kim
  1 sibling, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2022-01-14 18:47 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Michal Hocko, linux-mm, LKML, Suren Baghdasaryan,
	John Dias, huww98, John Hubbard


>>>>>> Otherwise, I'd like to have feature naming more higher level>>>>>> to represent page migration failure and then tracking unref of
>>>>>> the page. In the sense, PagePinOwner John suggested was good
>>>>>> candidate(Even, my original naming PagePinner was worse) since
>>>>>
>>>>> Personally, I dislike both variants.
>>>>>
>>>>>> I was trouble to abstract the feature with short word.
>>>>>> If we approach "what feature is doing" rather than "what's
>>>>>> the feature's goal"(I feel the your suggestion would be close
>>>>>> to what feature is doing), I'd like to express "unreference on
>>>>>> migraiton failed page" so PAGE_EXT_UNMIGRATED_UNREF
>>>>>> (However, I prefer the feature naming more "what we want to achieve")
>>>>>>
>>>>> E.g., PAGE_EXT_TRACE_UNREF will trace unref to the page once the bit is
>>>>> set. The functionality itself is completely independent of migration
>>>>> failures. That's just the code that sets it to enable the underlying
>>>>> tracing for that specific page.
>>>>
>>>> I agree that make something general is great but I also want to avoid
>>>> create something too big from the beginning with just imagination.
>>>> So, I'd like to hear more concrete and appealing usecases and then
>>>> we could think over this trace approach is really the best one to
>>>> achieve the goal. Once it's agreed, the naming you suggested would
>>>> make sense. 
>>>
>>> At least for me it's a lot cleaner if a feature clearly expresses what
>>> it actually does. Staring at PAGE_EXT_PIN_OWNER I initially had no clue.
>>> I was assuming we would actually track (not trace!) all active FOLL_PIN
>>> (not unref callers!). Maybe that makes it clearer why I'd prefer a
>>> clearer name.
>>
>> I totally agree PagePinOwner is not 100% straightforward. I'm open for
>> other better name. Currently we are discussing how we could generalize
>> and whether it's useful or not. Depending on the discussion, the design/
>> interface as well as naming could be changed. No problem.
> 
> PagePinOwner is just highly misleading. Because that's not what the
> feature does. Having that said, i hope we'll get other opinions as well.

FWIW, I think "page reference holder" would be clearer. PageRefHolder or
PageReferenceHolder

"Trace page reference holders on unref after migration of a page failed."

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2] mm: introduce page pin owner
  2022-01-14 16:51             ` David Hildenbrand
  2022-01-14 18:47               ` David Hildenbrand
@ 2022-01-14 18:55               ` Minchan Kim
  1 sibling, 0 replies; 15+ messages in thread
From: Minchan Kim @ 2022-01-14 18:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, linux-mm, LKML, Suren Baghdasaryan,
	John Dias, huww98, John Hubbard

On Fri, Jan 14, 2022 at 05:51:24PM +0100, David Hildenbrand wrote:
> On 14.01.22 17:39, Minchan Kim wrote:
> > On Fri, Jan 14, 2022 at 02:31:43PM +0100, David Hildenbrand wrote:
> >> On 12.01.22 21:41, Minchan Kim wrote:
> >>> On Wed, Jan 12, 2022 at 06:42:21PM +0100, David Hildenbrand wrote:
> >>>>>>
> >>>>>> What about something like:
> >>>>>>
> >>>>>> "mm: selective tracing of page reference holders on unref"
> >>>>>>
> >>>>>> PAGE_EXT_PIN_OWNER -> PAGE_EXT_TRACE_UNREF
> >>>>>>
> >>>>>> $whatever feature/user can then set the bit, for example, when migration
> >>>>>> fails.
> >>>>>
> >>>>> I couldn't imagine put_page tracking is generally useful except
> >>>>> migration failure. Do you have reasonable usecase in your mind
> >>>>> to make the feature general to be used?
> >>>>
> >>>> HWpoison etc. purposes maybe? Trace who still held a reference a page
> >>>> that was poisoned and couldn't be removed?  Or in general, tracking
> >>>
> >>> I am not familiar with hwpoison so here dumb question goes:
> >>> Is that different one with __soft_offline_page?
> >>> It uses migrate_pages so current interface supports it with filter.
> >>
> >> __soft_offline_page() won't kill the target and try to migrate because
> >> the pages are about to be damaged and we can still access them.
> >>
> >> ordinary memory errors mean we kill the target because we cannot access
> >> the page anymore without triggering MCEs (or worse IIUC) again.
> >>
> >> So in my thinking, after memory_failure(), it could eventually be
> >> helpful to figure out who still has a (temporary) reference to such a
> >> broken page, even after killing the process. But that's just one idea I
> >> quickly came up with.
> > 
> > 
> > Thanks for the clarification. Is the trace best fit in the case?
> > Presumably you know the broken page, can't you find who owns the page
> > using /proc/pid/pagemap?
> > Furthermore, page_get/put operations commonly could happen in
> > different contexts regardless of page's owner so the trace from
> > different context is still helpful?
> > 
> > If it's helpful, could you tell what you want to make the interface to
> > set the bit of broken page? For example, as-is case for page migration,
> > report_page_pinners is called to mark failed page at unmap_and_move.
> > Do you want to add something similar(maybe, set_page_ref_track under
> > rename) in memory-failure.c?
> > 
> > It would be very helpful to design the feature's interface(surely,
> > naming as well) and write description to convince others "yeah,
> > sounds like so useful for the case and that's best fit than other way".
> 
> I currently don't have time to explore this further, it was just a
> random thought how else this might be useful.

Thanks for the input. The concrete usecase/idea are really important
to define what information the trace will dump as well as defining
interface.

Please consider the current dump content are also good to cover
what you are thinking for potential cases. And potentially, someone
want to see different data for their failure and put event for
their taste. It means we need to tell why the page was marked to
be tracked originally in the put_page, such as, cma, or hwpoison
or something else so we need per-page metadata to distinguish
them. It makes me think over strong justfication even though
page_ext would be a little more tolerable than struct page.

> 
> 
> >>
> >>>
> >>> echo "memory_failure" > $trace_dir/events/page_pin_owner/report_page_pinners/filter
> >>>
> >>>> references to something that should have a refcount of 0 because it
> >>>> should have been freed, but for some reason there are still references
> >>>> around?
> >>>
> >>> Sounds like you are talking about memory leak? What's the purpose
> >>> with trace, not using other existing tool to find memory leak?
> >>>
> >>
> >> IIRC, kmemleak can find objects that are no longer referenced, and we
> >> cannot track buddy allocations, but only kmalloc and friends.
> > 
> > PageOwner is your good buddy.
> 
> Page owner tracks who owns a page but not who else holds a reference
> (some buffer, gup, whatsoever), right?

I thought you wanted to use the feature for memory leak since you
mentioned kmemleak. I don't understand fully what you want but
how could you get the trace if the reference holder never release?
Only way to get to the goal is you need to trace all the get trace
before the event happen and then correlate them with put. However,
the get happened long time ago, you need to assign huge amount of
RAM to keep the record. Without compelling usecase, I am not convinced.

> 
> > 
> >>
> >>>>
> >>>>> Otherwise, I'd like to have feature naming more higher level
> >>>>> to represent page migration failure and then tracking unref of
> >>>>> the page. In the sense, PagePinOwner John suggested was good
> >>>>> candidate(Even, my original naming PagePinner was worse) since
> >>>>
> >>>> Personally, I dislike both variants.
> >>>>
> >>>>> I was trouble to abstract the feature with short word.
> >>>>> If we approach "what feature is doing" rather than "what's
> >>>>> the feature's goal"(I feel the your suggestion would be close
> >>>>> to what feature is doing), I'd like to express "unreference on
> >>>>> migraiton failed page" so PAGE_EXT_UNMIGRATED_UNREF
> >>>>> (However, I prefer the feature naming more "what we want to achieve")
> >>>>>
> >>>> E.g., PAGE_EXT_TRACE_UNREF will trace unref to the page once the bit is
> >>>> set. The functionality itself is completely independent of migration
> >>>> failures. That's just the code that sets it to enable the underlying
> >>>> tracing for that specific page.
> >>>
> >>> I agree that make something general is great but I also want to avoid
> >>> create something too big from the beginning with just imagination.
> >>> So, I'd like to hear more concrete and appealing usecases and then
> >>> we could think over this trace approach is really the best one to
> >>> achieve the goal. Once it's agreed, the naming you suggested would
> >>> make sense. 
> >>
> >> At least for me it's a lot cleaner if a feature clearly expresses what
> >> it actually does. Staring at PAGE_EXT_PIN_OWNER I initially had no clue.
> >> I was assuming we would actually track (not trace!) all active FOLL_PIN
> >> (not unref callers!). Maybe that makes it clearer why I'd prefer a
> >> clearer name.
> > 
> > I totally agree PagePinOwner is not 100% straightforward. I'm open for
> > other better name. Currently we are discussing how we could generalize
> > and whether it's useful or not. Depending on the discussion, the design/
> > interface as well as naming could be changed. No problem.
> 
> PagePinOwner is just highly misleading. Because that's not what the
> feature does. Having that said, i hope we'll get other opinions as well.

Yeah, I'm open for the better name. Still thinking but I was not good.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2] mm: introduce page pin owner
  2022-01-14 18:47               ` David Hildenbrand
@ 2022-01-14 18:57                 ` Minchan Kim
  2022-01-21 21:59                   ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2022-01-14 18:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, linux-mm, LKML, Suren Baghdasaryan,
	John Dias, huww98, John Hubbard

On Fri, Jan 14, 2022 at 07:47:49PM +0100, David Hildenbrand wrote:
> 
> >>>>>> Otherwise, I'd like to have feature naming more higher level>>>>>> to represent page migration failure and then tracking unref of
> >>>>>> the page. In the sense, PagePinOwner John suggested was good
> >>>>>> candidate(Even, my original naming PagePinner was worse) since
> >>>>>
> >>>>> Personally, I dislike both variants.
> >>>>>
> >>>>>> I was trouble to abstract the feature with short word.
> >>>>>> If we approach "what feature is doing" rather than "what's
> >>>>>> the feature's goal"(I feel the your suggestion would be close
> >>>>>> to what feature is doing), I'd like to express "unreference on
> >>>>>> migraiton failed page" so PAGE_EXT_UNMIGRATED_UNREF
> >>>>>> (However, I prefer the feature naming more "what we want to achieve")
> >>>>>>
> >>>>> E.g., PAGE_EXT_TRACE_UNREF will trace unref to the page once the bit is
> >>>>> set. The functionality itself is completely independent of migration
> >>>>> failures. That's just the code that sets it to enable the underlying
> >>>>> tracing for that specific page.
> >>>>
> >>>> I agree that make something general is great but I also want to avoid
> >>>> create something too big from the beginning with just imagination.
> >>>> So, I'd like to hear more concrete and appealing usecases and then
> >>>> we could think over this trace approach is really the best one to
> >>>> achieve the goal. Once it's agreed, the naming you suggested would
> >>>> make sense. 
> >>>
> >>> At least for me it's a lot cleaner if a feature clearly expresses what
> >>> it actually does. Staring at PAGE_EXT_PIN_OWNER I initially had no clue.
> >>> I was assuming we would actually track (not trace!) all active FOLL_PIN
> >>> (not unref callers!). Maybe that makes it clearer why I'd prefer a
> >>> clearer name.
> >>
> >> I totally agree PagePinOwner is not 100% straightforward. I'm open for
> >> other better name. Currently we are discussing how we could generalize
> >> and whether it's useful or not. Depending on the discussion, the design/
> >> interface as well as naming could be changed. No problem.
> > 
> > PagePinOwner is just highly misleading. Because that's not what the
> > feature does. Having that said, i hope we'll get other opinions as well.
> 
> FWIW, I think "page reference holder" would be clearer. PageRefHolder or
> PageReferenceHolder
> 
> "Trace page reference holders on unref after migration of a page failed."

Ah, crossed email. PageRefHolder. Yeah, sounds like better!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2] mm: introduce page pin owner
  2022-01-14 18:57                 ` Minchan Kim
@ 2022-01-21 21:59                   ` Minchan Kim
  0 siblings, 0 replies; 15+ messages in thread
From: Minchan Kim @ 2022-01-21 21:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Michal Hocko, linux-mm, LKML, Suren Baghdasaryan,
	John Dias, huww98, John Hubbard

On Fri, Jan 14, 2022 at 10:57:04AM -0800, Minchan Kim wrote:
> On Fri, Jan 14, 2022 at 07:47:49PM +0100, David Hildenbrand wrote:
> > 
> > >>>>>> Otherwise, I'd like to have feature naming more higher level>>>>>> to represent page migration failure and then tracking unref of
> > >>>>>> the page. In the sense, PagePinOwner John suggested was good
> > >>>>>> candidate(Even, my original naming PagePinner was worse) since
> > >>>>>
> > >>>>> Personally, I dislike both variants.
> > >>>>>
> > >>>>>> I was trouble to abstract the feature with short word.
> > >>>>>> If we approach "what feature is doing" rather than "what's
> > >>>>>> the feature's goal"(I feel the your suggestion would be close
> > >>>>>> to what feature is doing), I'd like to express "unreference on
> > >>>>>> migraiton failed page" so PAGE_EXT_UNMIGRATED_UNREF
> > >>>>>> (However, I prefer the feature naming more "what we want to achieve")
> > >>>>>>
> > >>>>> E.g., PAGE_EXT_TRACE_UNREF will trace unref to the page once the bit is
> > >>>>> set. The functionality itself is completely independent of migration
> > >>>>> failures. That's just the code that sets it to enable the underlying
> > >>>>> tracing for that specific page.
> > >>>>
> > >>>> I agree that make something general is great but I also want to avoid
> > >>>> create something too big from the beginning with just imagination.
> > >>>> So, I'd like to hear more concrete and appealing usecases and then
> > >>>> we could think over this trace approach is really the best one to
> > >>>> achieve the goal. Once it's agreed, the naming you suggested would
> > >>>> make sense. 
> > >>>
> > >>> At least for me it's a lot cleaner if a feature clearly expresses what
> > >>> it actually does. Staring at PAGE_EXT_PIN_OWNER I initially had no clue.
> > >>> I was assuming we would actually track (not trace!) all active FOLL_PIN
> > >>> (not unref callers!). Maybe that makes it clearer why I'd prefer a
> > >>> clearer name.
> > >>
> > >> I totally agree PagePinOwner is not 100% straightforward. I'm open for
> > >> other better name. Currently we are discussing how we could generalize
> > >> and whether it's useful or not. Depending on the discussion, the design/
> > >> interface as well as naming could be changed. No problem.
> > > 
> > > PagePinOwner is just highly misleading. Because that's not what the
> > > feature does. Having that said, i hope we'll get other opinions as well.
> > 
> > FWIW, I think "page reference holder" would be clearer. PageRefHolder or
> > PageReferenceHolder
> > 
> > "Trace page reference holders on unref after migration of a page failed."
> 
> Ah, crossed email. PageRefHolder. Yeah, sounds like better!

David,

I will change the naming to PageRefHolder and update the other
code/comments to follow it.

Do you have any objection?

Otherwise, I'd like to post next version to make the work proceeding.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-01-21 21:59 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-28 17:59 [RFC v2] mm: introduce page pin owner Minchan Kim
2022-01-06 18:14 ` Minchan Kim
2022-01-06 22:27 ` John Hubbard
2022-01-06 23:24   ` Minchan Kim
2022-01-12 12:25 ` David Hildenbrand
2022-01-12 16:22   ` Minchan Kim
2022-01-12 17:42     ` David Hildenbrand
2022-01-12 20:41       ` Minchan Kim
2022-01-14 13:31         ` David Hildenbrand
2022-01-14 16:39           ` Minchan Kim
2022-01-14 16:51             ` David Hildenbrand
2022-01-14 18:47               ` David Hildenbrand
2022-01-14 18:57                 ` Minchan Kim
2022-01-21 21:59                   ` Minchan Kim
2022-01-14 18:55               ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).