linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
@ 2013-11-05 13:45 Jingbai Ma
  2013-11-05 13:45 ` [PATCH 1/3] makedumpfile: hugepage filtering: add hugepage filtering functions Jingbai Ma
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Jingbai Ma @ 2013-11-05 13:45 UTC (permalink / raw)
  To: ptesarik, d.hatayama, kumagai-atsushi
  Cc: bhe, tom.vaden, kexec, linux-kernel, lisa.mitchell, anderson,
	ebiederm, vgoyal

This patch set intend to exclude unnecessary hugepages from vmcore dump file.

This patch requires the kernel patch to export necessary data structures into
vmcore: "kexec: export hugepage data structure into vmcoreinfo"
http://lists.infradead.org/pipermail/kexec/2013-November/009997.html

This patch introduce two new dump levels 32 and 64 to exclude all unused and
active hugepages. The level to exclude all unnecessary pages will be 127 now.

            |         cache    cache                    free    active
      Dump  |  zero   without  with     user    free    huge    huge
      Level |  page   private  private  data    page    page    page
     -------+----------------------------------------------------------
         0  |
         1  |   X
         2  |           X
         4  |           X        X
         8  |                            X
        16  |                                    X
        32  |                                            X
        64  |                                            X       X
       127  |   X       X        X       X       X       X       X

example:
To exclude all unnecessary pages:
makedumpfile -c --message-level 23 -d 127 /proc/vmcore /var/crash/kdump

To exclude all unnecessary pages but keep active hugepages:
makedumpfile -c --message-level 23 -d 63 /proc/vmcore /var/crash/kdump

---

Jingbai Ma (3):
      makedumpfile: hugepage filtering: add hugepage filtering functions
      makedumpfile: hugepage filtering: add excluding hugepage messages
      makedumpfile: hugepage filtering: add new dump levels for manual page


 makedumpfile.8 |  170 +++++++++++++++++++++++++++--------
 makedumpfile.c |  272 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 makedumpfile.h |   19 ++++
 print_info.c   |   12 +-
 print_info.h   |    2 
 5 files changed, 431 insertions(+), 44 deletions(-)

--


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/3] makedumpfile: hugepage filtering: add hugepage filtering functions
  2013-11-05 13:45 [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump Jingbai Ma
@ 2013-11-05 13:45 ` Jingbai Ma
  2013-11-05 13:45 ` [PATCH 2/3] makedumpfile: hugepage filtering: add excluding hugepage messages Jingbai Ma
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Jingbai Ma @ 2013-11-05 13:45 UTC (permalink / raw)
  To: ptesarik, d.hatayama, kumagai-atsushi
  Cc: bhe, tom.vaden, kexec, linux-kernel, lisa.mitchell, anderson,
	ebiederm, vgoyal

Add functions to exclude hugepage from vmcore dump.

Signed-off-by: Jingbai Ma <jingbai.ma@hp.com>
---
 makedumpfile.c |  272 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 makedumpfile.h |   19 ++++
 2 files changed, 289 insertions(+), 2 deletions(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index b42565c..f0b2531 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -46,6 +46,8 @@ unsigned long long pfn_cache_private;
 unsigned long long pfn_user;
 unsigned long long pfn_free;
 unsigned long long pfn_hwpoison;
+unsigned long long pfn_free_huge;
+unsigned long long pfn_active_huge;
 
 unsigned long long num_dumped;
 
@@ -1038,6 +1040,7 @@ get_symbol_info(void)
 	SYMBOL_INIT(mem_map, "mem_map");
 	SYMBOL_INIT(vmem_map, "vmem_map");
 	SYMBOL_INIT(mem_section, "mem_section");
+	SYMBOL_INIT(hstates, "hstates");
 	SYMBOL_INIT(pkmap_count, "pkmap_count");
 	SYMBOL_INIT_NEXT(pkmap_count_next, "pkmap_count");
 	SYMBOL_INIT(system_utsname, "system_utsname");
@@ -1174,6 +1177,19 @@ get_structure_info(void)
 	OFFSET_INIT(list_head.prev, "list_head", "prev");
 
 	/*
+	 * Get offsets of the hstate's members.
+	 */
+	SIZE_INIT(hstate, "hstate");
+	OFFSET_INIT(hstate.order, "hstate", "order");
+	OFFSET_INIT(hstate.nr_huge_pages, "hstate", "nr_huge_pages");
+	OFFSET_INIT(hstate.free_huge_pages, "hstate", "free_huge_pages");
+	OFFSET_INIT(hstate.hugepage_activelist, "hstate",
+		"hugepage_activelist");
+	OFFSET_INIT(hstate.hugepage_freelists, "hstate", "hugepage_freelists");
+	MEMBER_ARRAY_LENGTH_INIT(hstate.hugepage_freelists, "hstate",
+		"hugepage_freelists");
+
+	/*
 	 * Get offsets of the node_memblk_s's members.
 	 */
 	SIZE_INIT(node_memblk_s, "node_memblk_s");
@@ -1555,6 +1571,7 @@ write_vmcoreinfo_data(void)
 	WRITE_SYMBOL("mem_map", mem_map);
 	WRITE_SYMBOL("vmem_map", vmem_map);
 	WRITE_SYMBOL("mem_section", mem_section);
+	WRITE_SYMBOL("hstates", hstates);
 	WRITE_SYMBOL("pkmap_count", pkmap_count);
 	WRITE_SYMBOL("pkmap_count_next", pkmap_count_next);
 	WRITE_SYMBOL("system_utsname", system_utsname);
@@ -1590,6 +1607,7 @@ write_vmcoreinfo_data(void)
 	WRITE_STRUCTURE_SIZE("zone", zone);
 	WRITE_STRUCTURE_SIZE("free_area", free_area);
 	WRITE_STRUCTURE_SIZE("list_head", list_head);
+	WRITE_STRUCTURE_SIZE("hstate", hstate);
 	WRITE_STRUCTURE_SIZE("node_memblk_s", node_memblk_s);
 	WRITE_STRUCTURE_SIZE("nodemask_t", nodemask_t);
 	WRITE_STRUCTURE_SIZE("pageflags", pageflags);
@@ -1628,6 +1646,13 @@ write_vmcoreinfo_data(void)
 	WRITE_MEMBER_OFFSET("vm_struct.addr", vm_struct.addr);
 	WRITE_MEMBER_OFFSET("vmap_area.va_start", vmap_area.va_start);
 	WRITE_MEMBER_OFFSET("vmap_area.list", vmap_area.list);
+	WRITE_MEMBER_OFFSET("hstate.order", hstate.order);
+	WRITE_MEMBER_OFFSET("hstate.nr_huge_pages", hstate.nr_huge_pages);
+	WRITE_MEMBER_OFFSET("hstate.free_huge_pages", hstate.free_huge_pages);
+	WRITE_MEMBER_OFFSET("hstate.hugepage_activelist",
+		hstate.hugepage_activelist);
+	WRITE_MEMBER_OFFSET("hstate.hugepage_freelists",
+		hstate.hugepage_freelists);
 	WRITE_MEMBER_OFFSET("log.ts_nsec", log.ts_nsec);
 	WRITE_MEMBER_OFFSET("log.len", log.len);
 	WRITE_MEMBER_OFFSET("log.text_len", log.text_len);
@@ -1647,6 +1672,9 @@ write_vmcoreinfo_data(void)
 	WRITE_ARRAY_LENGTH("zone.free_area", zone.free_area);
 	WRITE_ARRAY_LENGTH("free_area.free_list", free_area.free_list);
 
+	WRITE_ARRAY_LENGTH("hstate.hugepage_freelists",
+		hstate.hugepage_freelists);
+
 	WRITE_NUMBER("NR_FREE_PAGES", NR_FREE_PAGES);
 	WRITE_NUMBER("N_ONLINE", N_ONLINE);
 
@@ -1659,6 +1687,8 @@ write_vmcoreinfo_data(void)
 
 	WRITE_NUMBER("PAGE_BUDDY_MAPCOUNT_VALUE", PAGE_BUDDY_MAPCOUNT_VALUE);
 
+	WRITE_NUMBER("HUGE_MAX_HSTATE", HUGE_MAX_HSTATE);
+
 	/*
 	 * write the source file of 1st kernel
 	 */
@@ -1874,6 +1904,7 @@ read_vmcoreinfo(void)
 	READ_SYMBOL("mem_map", mem_map);
 	READ_SYMBOL("vmem_map", vmem_map);
 	READ_SYMBOL("mem_section", mem_section);
+	READ_SYMBOL("hstates", hstates);
 	READ_SYMBOL("pkmap_count", pkmap_count);
 	READ_SYMBOL("pkmap_count_next", pkmap_count_next);
 	READ_SYMBOL("system_utsname", system_utsname);
@@ -1906,6 +1937,7 @@ read_vmcoreinfo(void)
 	READ_STRUCTURE_SIZE("zone", zone);
 	READ_STRUCTURE_SIZE("free_area", free_area);
 	READ_STRUCTURE_SIZE("list_head", list_head);
+	READ_STRUCTURE_SIZE("hstate", hstate);
 	READ_STRUCTURE_SIZE("node_memblk_s", node_memblk_s);
 	READ_STRUCTURE_SIZE("nodemask_t", nodemask_t);
 	READ_STRUCTURE_SIZE("pageflags", pageflags);
@@ -1940,6 +1972,13 @@ read_vmcoreinfo(void)
 	READ_MEMBER_OFFSET("vm_struct.addr", vm_struct.addr);
 	READ_MEMBER_OFFSET("vmap_area.va_start", vmap_area.va_start);
 	READ_MEMBER_OFFSET("vmap_area.list", vmap_area.list);
+	READ_MEMBER_OFFSET("hstate.order", hstate.order);
+	READ_MEMBER_OFFSET("hstate.nr_huge_pages", hstate.nr_huge_pages);
+	READ_MEMBER_OFFSET("hstate.free_huge_pages", hstate.free_huge_pages);
+	READ_MEMBER_OFFSET("hstate.hugepage_activelist",
+		hstate.hugepage_activelist);
+	READ_MEMBER_OFFSET("hstate.hugepage_freelists",
+		hstate.hugepage_freelists);
 	READ_MEMBER_OFFSET("log.ts_nsec", log.ts_nsec);
 	READ_MEMBER_OFFSET("log.len", log.len);
 	READ_MEMBER_OFFSET("log.text_len", log.text_len);
@@ -1950,6 +1989,8 @@ read_vmcoreinfo(void)
 	READ_ARRAY_LENGTH("node_memblk", node_memblk);
 	READ_ARRAY_LENGTH("zone.free_area", zone.free_area);
 	READ_ARRAY_LENGTH("free_area.free_list", free_area.free_list);
+	READ_ARRAY_LENGTH("hstate.hugepage_freelists",
+		hstate.hugepage_freelists);
 	READ_ARRAY_LENGTH("node_remap_start_pfn", node_remap_start_pfn);
 
 	READ_NUMBER("NR_FREE_PAGES", NR_FREE_PAGES);
@@ -1966,6 +2007,8 @@ read_vmcoreinfo(void)
 
 	READ_NUMBER("PAGE_BUDDY_MAPCOUNT_VALUE", PAGE_BUDDY_MAPCOUNT_VALUE);
 
+	READ_NUMBER("HUGE_MAX_HSTATE", HUGE_MAX_HSTATE);
+
 	return TRUE;
 }
 
@@ -4040,6 +4083,214 @@ exclude_free_page(void)
 	return TRUE;
 }
 
+inline int
+clear_huge_page(unsigned long long pfn, unsigned int order)
+{
+	unsigned int i;
+
+	DEBUG_MSG("Exclude huge page. start pfn: %lld, order: %d\n",
+		pfn, order);
+
+	for (i = 0; i < (1 << order); i++) {
+		if (!clear_bit_on_2nd_bitmap_for_kernel(pfn + i)) {
+			ERRMSG("Can't clear 2nd bitmap! pfn=0x%llx\n", pfn + i);
+			return FALSE;
+		}
+	}
+
+	return TRUE;
+}
+
+int
+_exclude_huge_page(void)
+{
+	int i, node, freelist_length;
+	unsigned long curr_hstate, curr_page, head, curr, previous, curr_prev;
+	struct timeval tv_start;
+	unsigned long long pfn;
+	unsigned int order;
+	unsigned long nr_huge_pages, free_huge_pages, active_huge_pages;
+
+	freelist_length = ARRAY_LENGTH(hstate.hugepage_freelists);
+	/* Exclude free huge pages */
+	if (info->dump_level & (DL_EXCLUDE_FREE_HUGE
+		| DL_EXCLUDE_ACTIVE_HUGE)) {
+		gettimeofday(&tv_start, NULL);
+		for (i = 0; i < NUMBER(HUGE_MAX_HSTATE); i++) {
+			curr_hstate = SYMBOL(hstates) + SIZE(hstate) * i;
+			/* Read order */
+			if (!readmem(VADDR,
+				curr_hstate + OFFSET(hstate.order),
+				&order, sizeof(order))) {
+				ERRMSG("Can't get hstate.order!");
+					return FALSE;
+			}
+			/* Read free_huge_pages */
+			if (!readmem(VADDR,
+				curr_hstate + OFFSET(hstate.free_huge_pages),
+				&free_huge_pages, sizeof(free_huge_pages))) {
+				ERRMSG("Can't get hstate.free_huge_pages!");
+					return FALSE;
+			}
+			for (node = 0; node < freelist_length; node++) {
+				/* head = hstate.hugepage_freelists[node] */
+				head = curr_hstate
+					+ OFFSET(hstate.hugepage_freelists)
+					+ SIZE(list_head) * node;
+				if (!readmem(VADDR,
+					head + OFFSET(list_head.next),
+					&curr, sizeof(curr))) {
+					ERRMSG("Can't get free list!");
+						return FALSE;
+				}
+				curr_prev = head;
+				/* Walking free list of the node */
+				while (head != curr && curr != 0) {
+					print_progress(PROGRESS_FREE_HUGE,
+						pfn_free_huge, free_huge_pages);
+					if (!readmem(VADDR,
+						curr + OFFSET(list_head.prev),
+						&previous, sizeof(previous))) {
+						ERRMSG("Can't get free list!");
+						return FALSE;
+					}
+					if (previous != curr_prev) {
+						ERRMSG("Free list is broken!");
+						return FALSE;
+					}
+					curr_page = curr - OFFSET(page.lru);
+					pfn = page_to_pfn(curr_page);
+					if (!clear_huge_page(pfn, order))
+						return FALSE;
+					pfn_free_huge++;
+					curr_prev = curr;
+					if (!readmem(VADDR,
+						curr + OFFSET(list_head.next),
+						&curr, sizeof(curr))) {
+						ERRMSG("Can't get free list!");
+						return FALSE;
+					}
+				}
+			}
+		}
+		/*
+		 * print [100 %]
+		 */
+		print_progress(PROGRESS_FREE_HUGE, 1, 1);
+		print_execution_time(PROGRESS_FREE_HUGE, &tv_start);
+	}
+
+	/* Exclude active huge pages */
+	if (info->dump_level & DL_EXCLUDE_ACTIVE_HUGE) {
+		gettimeofday(&tv_start, NULL);
+		for (i = 0; i < NUMBER(HUGE_MAX_HSTATE); i++) {
+			curr_hstate = SYMBOL(hstates) + SIZE(hstate) * i;
+			/* Read order */
+			if (!readmem(VADDR,
+				curr_hstate + OFFSET(hstate.order),
+				&order, sizeof(order))) {
+				ERRMSG("Can't get hstate.order!");
+					return FALSE;
+			}
+			/* Read nr_huge_pages */
+			if (!readmem(VADDR,
+				curr_hstate + OFFSET(hstate.nr_huge_pages),
+				&nr_huge_pages, sizeof(nr_huge_pages))) {
+				ERRMSG("Can't get hstate.nr_huge_pages!");
+					return FALSE;
+			}
+			/* Read free_huge_pages */
+			if (!readmem(VADDR,
+				curr_hstate + OFFSET(hstate.free_huge_pages),
+				&free_huge_pages, sizeof(free_huge_pages))) {
+				ERRMSG("Can't get hstate.free_huge_pages!");
+					return FALSE;
+			}
+			if (nr_huge_pages < free_huge_pages) {
+				ERRMSG("nr_huge_pages < free_huge_pages!");
+					return FALSE;
+			}
+			active_huge_pages = nr_huge_pages - free_huge_pages;
+			/* head = hstate.hugepage_freelists[node] */
+			head = curr_hstate + OFFSET(hstate.hugepage_activelist);
+			if (!readmem(VADDR, head + OFFSET(list_head.next),
+				&curr, sizeof(curr))) {
+				ERRMSG("Can't get active list!");
+			}
+			curr_prev = head;
+			/* Walking active list */
+			while (head != curr && curr != 0) {
+				print_progress(PROGRESS_ACTIVE_HUGE,
+					pfn_active_huge,
+					active_huge_pages);
+				if (!readmem(VADDR,
+					curr + OFFSET(list_head.prev),
+					&previous, sizeof(previous))) {
+					ERRMSG("Can't get active list!");
+					return FALSE;
+				}
+				if (previous != curr_prev) {
+					ERRMSG("Active list is broken!");
+					return FALSE;
+				}
+				curr_page = curr - OFFSET(page.lru);
+				pfn = page_to_pfn(curr_page);
+				if (!clear_huge_page(pfn, order))
+					return FALSE;
+				pfn_active_huge++;
+				curr_prev = curr;
+				if (!readmem(VADDR,
+					curr + OFFSET(list_head.next),
+					&curr, sizeof(curr))) {
+					ERRMSG("Can't get active list!");
+					return FALSE;
+				}
+			}
+		}
+		/*
+		 * print [100 %]
+		 */
+		print_progress(PROGRESS_ACTIVE_HUGE, 1, 1);
+		print_execution_time(PROGRESS_ACTIVE_HUGE, &tv_start);
+	}
+
+	DEBUG_MSG("\n");
+	DEBUG_MSG("free huge pages  : %lld\n", pfn_free_huge);
+	DEBUG_MSG("active huge pages: %lld\n", pfn_active_huge);
+
+	return TRUE;
+}
+
+int
+exclude_huge_page(void)
+{
+	/*
+	 * Check having necessary information.
+	 */
+	if (SYMBOL(hstates) == NOT_FOUND_SYMBOL)
+		ERRMSG("Can't get necessary symbols for huge pages.\n");
+
+	if ((SIZE(hstate) == NOT_FOUND_STRUCTURE)
+	    || (OFFSET(hstate.order) == NOT_FOUND_STRUCTURE)
+	    || (OFFSET(hstate.nr_huge_pages) == NOT_FOUND_STRUCTURE)
+	    || (OFFSET(hstate.free_huge_pages) == NOT_FOUND_STRUCTURE)
+	    || (OFFSET(hstate.hugepage_activelist) == NOT_FOUND_STRUCTURE)
+	    || (OFFSET(hstate.hugepage_freelists) == NOT_FOUND_STRUCTURE)
+	    || (ARRAY_LENGTH(hstate.hugepage_freelists)
+		== NOT_FOUND_STRUCTURE)) {
+		ERRMSG("Can't get necessary structures for huge pages.\n");
+		return FALSE;
+	}
+
+	/*
+	 * Detect huge pages and update 2nd-bitmap.
+	 */
+	if (!_exclude_huge_page())
+		return FALSE;
+
+	return TRUE;
+}
+
 /*
  * Let C be a cyclic buffer size and B a bitmap size used for
  * representing maximum block size managed by buddy allocator.
@@ -4532,6 +4783,13 @@ exclude_unnecessary_pages_cyclic(void)
 			return FALSE;
 
 	/*
+	 * Exclude huge pages.
+	 */
+	if (info->dump_level & (DL_EXCLUDE_FREE_HUGE | DL_EXCLUDE_ACTIVE_HUGE))
+		if (!exclude_huge_page())
+			return FALSE;
+
+	/*
 	 * Exclude cache pages, cache private pages, user data pages,
 	 * free pages and hwpoison pages.
 	 */
@@ -4661,6 +4919,13 @@ create_2nd_bitmap(void)
 			return FALSE;
 
 	/*
+	 * Exclude huge pages.
+	 */
+	if (info->dump_level & (DL_EXCLUDE_FREE_HUGE | DL_EXCLUDE_ACTIVE_HUGE))
+		if (!exclude_huge_page())
+			return FALSE;
+
+	/*
 	 * Exclude Xen user domain.
 	 */
 	if (info->flag_exclude_xen_dom) {
@@ -6513,6 +6778,7 @@ write_kdump_pages_and_bitmap_cyclic(struct cache_data *cd_header, struct cache_d
 	 */
 	pfn_zero = pfn_cache = pfn_cache_private = 0;
 	pfn_user = pfn_free = pfn_hwpoison = 0;
+	pfn_free_huge = pfn_active_huge = 0;
 	pfn_memhole = info->max_mapnr;
 
 	cd_header->offset
@@ -7416,7 +7682,8 @@ print_report(void)
 	pfn_original = info->max_mapnr - pfn_memhole;
 
 	pfn_excluded = pfn_zero + pfn_cache + pfn_cache_private
-	    + pfn_user + pfn_free + pfn_hwpoison;
+	    + pfn_user + pfn_free + pfn_hwpoison
+	    + pfn_free_huge + pfn_active_huge;
 	shrinking = (pfn_original - pfn_excluded) * 100;
 	shrinking = shrinking / pfn_original;
 
@@ -7429,6 +7696,9 @@ print_report(void)
 	    pfn_cache_private);
 	REPORT_MSG("    User process data pages : 0x%016llx\n", pfn_user);
 	REPORT_MSG("    Free pages              : 0x%016llx\n", pfn_free);
+	REPORT_MSG("    Free hugepage pages     : 0x%016llx\n", pfn_free_huge);
+	REPORT_MSG("    Active hugepage pages   : 0x%016llx\n",
+		pfn_active_huge);
 	REPORT_MSG("    Hwpoison pages          : 0x%016llx\n", pfn_hwpoison);
 	REPORT_MSG("  Remaining pages  : 0x%016llx\n",
 	    pfn_original - pfn_excluded);
diff --git a/makedumpfile.h b/makedumpfile.h
index a5826e0..1a0a5fa 100644
--- a/makedumpfile.h
+++ b/makedumpfile.h
@@ -178,7 +178,7 @@ isAnon(unsigned long mapping)
  * Dump Level
  */
 #define MIN_DUMP_LEVEL		(0)
-#define MAX_DUMP_LEVEL		(31)
+#define MAX_DUMP_LEVEL		(127)
 #define NUM_ARRAY_DUMP_LEVEL	(MAX_DUMP_LEVEL + 1) /* enough to allocate
 							all the dump_level */
 #define DL_EXCLUDE_ZERO		(0x001) /* Exclude Pages filled with Zeros */
@@ -189,6 +189,9 @@ isAnon(unsigned long mapping)
 #define DL_EXCLUDE_USER_DATA	(0x008) /* Exclude UserProcessData Pages */
 #define DL_EXCLUDE_FREE		(0x010)	/* Exclude Free Pages */
 
+#define DL_EXCLUDE_FREE_HUGE	(0x020) /* Exclude Free Huge Pages */
+#define DL_EXCLUDE_ACTIVE_HUGE	(0x040) /* Exclude Active Huge Pages */
+
 
 /*
  * For parse_line()
@@ -1098,6 +1101,7 @@ struct symbol_table {
 	unsigned long long	mem_map;
 	unsigned long long	vmem_map;
 	unsigned long long	mem_section;
+	unsigned long long	hstates;
 	unsigned long long	pkmap_count;
 	unsigned long long	pkmap_count_next;
 	unsigned long long	system_utsname;
@@ -1174,6 +1178,7 @@ struct size_table {
 	long	zone;
 	long	free_area;
 	long	list_head;
+	long	hstate;
 	long	node_memblk_s;
 	long	nodemask_t;
 
@@ -1232,6 +1237,13 @@ struct offset_table {
 	struct free_area {
 		long	free_list;
 	} free_area;
+	struct hstate {
+		long	order;
+		long	nr_huge_pages;
+		long	free_huge_pages;
+		long	hugepage_activelist;
+		long	hugepage_freelists;
+	} hstate;
 	struct list_head {
 		long	next;
 		long	prev;
@@ -1368,6 +1380,9 @@ struct array_table {
 	struct free_area_at {
 		long	free_list;
 	} free_area;
+	struct hstate_at {
+		long	hugepage_freelists;
+	} hstate;
 	struct kimage_at {
 		long	segment;
 	} kimage;
@@ -1388,6 +1403,8 @@ struct number_table {
 	long    PG_hwpoison;
 
 	long	PAGE_BUDDY_MAPCOUNT_VALUE;
+
+	long	HUGE_MAX_HSTATE;
 };
 
 struct srcfile_table {


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/3] makedumpfile: hugepage filtering: add excluding hugepage messages
  2013-11-05 13:45 [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump Jingbai Ma
  2013-11-05 13:45 ` [PATCH 1/3] makedumpfile: hugepage filtering: add hugepage filtering functions Jingbai Ma
@ 2013-11-05 13:45 ` Jingbai Ma
  2013-11-05 13:46 ` [PATCH 3/3] makedumpfile: hugepage filtering: add new dump levels for manual page Jingbai Ma
  2013-11-05 20:26 ` [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump Vivek Goyal
  3 siblings, 0 replies; 25+ messages in thread
From: Jingbai Ma @ 2013-11-05 13:45 UTC (permalink / raw)
  To: ptesarik, d.hatayama, kumagai-atsushi
  Cc: bhe, tom.vaden, kexec, linux-kernel, lisa.mitchell, anderson,
	ebiederm, vgoyal

Add messages for print_info.

Signed-off-by: Jingbai Ma <jingbai.ma@hp.com>
---
 print_info.c |   12 +++++++-----
 print_info.h |    2 ++
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/print_info.c b/print_info.c
index 06939e0..978d9fb 100644
--- a/print_info.c
+++ b/print_info.c
@@ -103,17 +103,19 @@ print_usage(void)
 	MSG("      The maximum of Dump_Level is 31.\n");
 	MSG("      Note that Dump_Level for Xen dump filtering is 0 or 1.\n");
 	MSG("\n");
-	MSG("            |         cache    cache\n");
-	MSG("      Dump  |  zero   without  with     user    free\n");
-	MSG("      Level |  page   private  private  data    page\n");
-	MSG("     -------+---------------------------------------\n");
+	MSG("            |         cache    cache                    free    active\n");
+	MSG("      Dump  |  zero   without  with     user    free    huge    huge\n");
+	MSG("      Level |  page   private  private  data    page    page    page\n");
+	MSG("     -------+----------------------------------------------------------\n");
 	MSG("         0  |\n");
 	MSG("         1  |   X\n");
 	MSG("         2  |           X\n");
 	MSG("         4  |           X        X\n");
 	MSG("         8  |                            X\n");
 	MSG("        16  |                                    X\n");
-	MSG("        31  |   X       X        X       X       X\n");
+	MSG("        32  |                                            X\n");
+	MSG("        64  |                                            X       X\n");
+	MSG("       127  |   X       X        X       X       X       X       X\n");
 	MSG("\n");
 	MSG("  [-E]:\n");
 	MSG("      Create DUMPFILE in the ELF format.\n");
diff --git a/print_info.h b/print_info.h
index 01e3706..8461df6 100644
--- a/print_info.h
+++ b/print_info.h
@@ -35,6 +35,8 @@ void print_execution_time(char *step_name, struct timeval *tv_start);
 #define PROGRESS_HOLES		"Checking for memory holes  "
 #define PROGRESS_UNN_PAGES 	"Excluding unnecessary pages"
 #define PROGRESS_FREE_PAGES 	"Excluding free pages       "
+#define PROGRESS_FREE_HUGE	"Excluding free huge pages  "
+#define PROGRESS_ACTIVE_HUGE	"Excluding active huge pages"
 #define PROGRESS_ZERO_PAGES 	"Excluding zero pages       "
 #define PROGRESS_XEN_DOMAIN 	"Excluding xen user domain  "
 


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/3] makedumpfile: hugepage filtering: add new dump levels for manual page
  2013-11-05 13:45 [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump Jingbai Ma
  2013-11-05 13:45 ` [PATCH 1/3] makedumpfile: hugepage filtering: add hugepage filtering functions Jingbai Ma
  2013-11-05 13:45 ` [PATCH 2/3] makedumpfile: hugepage filtering: add excluding hugepage messages Jingbai Ma
@ 2013-11-05 13:46 ` Jingbai Ma
  2013-11-05 20:26 ` [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump Vivek Goyal
  3 siblings, 0 replies; 25+ messages in thread
From: Jingbai Ma @ 2013-11-05 13:46 UTC (permalink / raw)
  To: ptesarik, d.hatayama, kumagai-atsushi
  Cc: bhe, tom.vaden, kexec, linux-kernel, lisa.mitchell, anderson,
	ebiederm, vgoyal

Add new dump levels for makedumpfile manual page.

Signed-off-by: Jingbai Ma <jingbai.ma@hp.com>
---
 makedumpfile.8 |  170 ++++++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 133 insertions(+), 37 deletions(-)

diff --git a/makedumpfile.8 b/makedumpfile.8
index adeb811..70e8732 100644
--- a/makedumpfile.8
+++ b/makedumpfile.8
@@ -164,43 +164,139 @@ by dump_level 11, makedumpfile retries it by dump_level 31.
 .br
 # makedumpfile \-d 11,31 \-x vmlinux /proc/vmcore dumpfile
 
-       |      |cache  |cache  |      |
-  dump | zero |without|with   | user | free
- level | page |private|private| data | page
-.br
-\-\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\-\-+\-\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\-
-     0 |      |       |       |      |
-     1 |  X   |       |       |      |
-     2 |      |   X   |       |      |
-     3 |  X   |   X   |       |      |
-     4 |      |   X   |   X   |      |
-     5 |  X   |   X   |   X   |      |
-     6 |      |   X   |   X   |      |
-     7 |  X   |   X   |   X   |      |
-     8 |      |       |       |  X   |
-     9 |  X   |       |       |  X   |
-    10 |      |   X   |       |  X   |
-    11 |  X   |   X   |       |  X   |
-    12 |      |   X   |   X   |  X   |
-    13 |  X   |   X   |   X   |  X   |
-    14 |      |   X   |   X   |  X   |
-    15 |  X   |   X   |   X   |  X   |
-    16 |      |       |       |      |  X
-    17 |  X   |       |       |      |  X
-    18 |      |   X   |       |      |  X
-    19 |  X   |   X   |       |      |  X
-    20 |      |   X   |   X   |      |  X
-    21 |  X   |   X   |   X   |      |  X
-    22 |      |   X   |   X   |      |  X
-    23 |  X   |   X   |   X   |      |  X
-    24 |      |       |       |  X   |  X
-    25 |  X   |       |       |  X   |  X
-    26 |      |   X   |       |  X   |  X
-    27 |  X   |   X   |       |  X   |  X
-    28 |      |   X   |   X   |  X   |  X
-    29 |  X   |   X   |   X   |  X   |  X
-    30 |      |   X   |   X   |  X   |  X
-    31 |  X   |   X   |   X   |  X   |  X
+       |      |cache  |cache  |      |      | free | active
+  dump | zero |without|with   | user | free | huge | huge
+ level | page |private|private| data | page | page | page
+.br
+\-\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\-\-+\-\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\-+\-\-\-\-\-\-\-\-
+     0 |      |       |       |      |      |      |
+     1 |  X   |       |       |      |      |      |
+     2 |      |   X   |       |      |      |      |
+     3 |  X   |   X   |       |      |      |      |
+     4 |      |   X   |   X   |      |      |      |
+     5 |  X   |   X   |   X   |      |      |      |
+     6 |      |   X   |   X   |      |      |      |
+     7 |  X   |   X   |   X   |      |      |      |
+     8 |      |       |       |  X   |      |      |
+     9 |  X   |       |       |  X   |      |      |
+    10 |      |   X   |       |  X   |      |      |
+    11 |  X   |   X   |       |  X   |      |      |
+    12 |      |   X   |   X   |  X   |      |      |
+    13 |  X   |   X   |   X   |  X   |      |      |
+    14 |      |   X   |   X   |  X   |      |      |
+    15 |  X   |   X   |   X   |  X   |      |      |
+    16 |      |       |       |      |  X   |      |
+    17 |  X   |       |       |      |  X   |      |
+    18 |      |   X   |       |      |  X   |      |
+    19 |  X   |   X   |       |      |  X   |      |
+    20 |      |   X   |   X   |      |  X   |      |
+    21 |  X   |   X   |   X   |      |  X   |      |
+    22 |      |   X   |   X   |      |  X   |      |
+    23 |  X   |   X   |   X   |      |  X   |      |
+    24 |      |       |       |  X   |  X   |      |
+    25 |  X   |       |       |  X   |  X   |      |
+    26 |      |   X   |       |  X   |  X   |      |
+    27 |  X   |   X   |       |  X   |  X   |      |
+    28 |      |   X   |   X   |  X   |  X   |      |
+    29 |  X   |   X   |   X   |  X   |  X   |      |
+    30 |      |   X   |   X   |  X   |  X   |      |
+    31 |  X   |   X   |   X   |  X   |  X   |      |
+    32 |      |       |       |      |      |  X   |
+    33 |  X   |       |       |      |      |  X   |
+    34 |      |   X   |       |      |      |  X   |
+    35 |  X   |   X   |       |      |      |  X   |
+    36 |      |   X   |   X   |      |      |  X   |
+    37 |  X   |   X   |   X   |      |      |  X   |
+    38 |      |   X   |   X   |      |      |  X   |
+    39 |  X   |   X   |   X   |      |      |  X   |
+    40 |      |       |       |  X   |      |  X   |
+    41 |  X   |       |       |  X   |      |  X   |
+    42 |      |   X   |       |  X   |      |  X   |
+    43 |  X   |   X   |       |  X   |      |  X   |
+    44 |      |   X   |   X   |  X   |      |  X   |
+    45 |  X   |   X   |   X   |  X   |      |  X   |
+    46 |      |   X   |   X   |  X   |      |  X   |
+    47 |  X   |   X   |   X   |  X   |      |  X   |
+    48 |      |       |       |      |  X   |  X   |
+    49 |  X   |       |       |      |  X   |  X   |
+    50 |      |   X   |       |      |  X   |  X   |
+    51 |  X   |   X   |       |      |  X   |  X   |
+    52 |      |   X   |   X   |      |  X   |  X   |
+    53 |  X   |   X   |   X   |      |  X   |  X   |
+    54 |      |   X   |   X   |      |  X   |  X   |
+    55 |  X   |   X   |   X   |      |  X   |  X   |
+    56 |      |       |       |  X   |  X   |  X   |
+    57 |  X   |       |       |  X   |  X   |  X   |
+    58 |      |   X   |       |  X   |  X   |  X   |
+    59 |  X   |   X   |       |  X   |  X   |  X   |
+    60 |      |   X   |   X   |  X   |  X   |  X   |
+    61 |  X   |   X   |   X   |  X   |  X   |  X   |
+    62 |      |   X   |   X   |  X   |  X   |  X   |
+    63 |  X   |   X   |   X   |  X   |  X   |  X   |
+    64 |      |       |       |      |      |  X   |  X
+    65 |  X   |       |       |      |      |  X   |  X
+    66 |      |   X   |       |      |      |  X   |  X
+    67 |  X   |   X   |       |      |      |  X   |  X
+    68 |      |   X   |   X   |      |      |  X   |  X
+    69 |  X   |   X   |   X   |      |      |  X   |  X
+    70 |      |   X   |   X   |      |      |  X   |  X
+    71 |  X   |   X   |   X   |      |      |  X   |  X
+    72 |      |       |       |  X   |      |  X   |  X
+    73 |  X   |       |       |  X   |      |  X   |  X
+    74 |      |   X   |       |  X   |      |  X   |  X
+    75 |  X   |   X   |       |  X   |      |  X   |  X
+    76 |      |   X   |   X   |  X   |      |  X   |  X
+    77 |  X   |   X   |   X   |  X   |      |  X   |  X
+    78 |      |   X   |   X   |  X   |      |  X   |  X
+    79 |  X   |   X   |   X   |  X   |      |  X   |  X
+    80 |      |       |       |      |  X   |  X   |  X
+    81 |  X   |       |       |      |  X   |  X   |  X
+    82 |      |   X   |       |      |  X   |  X   |  X
+    83 |  X   |   X   |       |      |  X   |  X   |  X
+    84 |      |   X   |   X   |      |  X   |  X   |  X
+    85 |  X   |   X   |   X   |      |  X   |  X   |  X
+    86 |      |   X   |   X   |      |  X   |  X   |  X
+    87 |  X   |   X   |   X   |      |  X   |  X   |  X
+    88 |      |       |       |  X   |  X   |  X   |  X
+    89 |  X   |       |       |  X   |  X   |  X   |  X
+    90 |      |   X   |       |  X   |  X   |  X   |  X
+    91 |  X   |   X   |       |  X   |  X   |  X   |  X
+    92 |      |   X   |   X   |  X   |  X   |  X   |  X
+    93 |  X   |   X   |   X   |  X   |  X   |  X   |  X
+    94 |      |   X   |   X   |  X   |  X   |  X   |  X
+    95 |  X   |   X   |   X   |  X   |  X   |  X   |  X
+    96 |      |       |       |      |      |  X   |  X
+    97 |  X   |       |       |      |      |  X   |  X
+    98 |      |   X   |       |      |      |  X   |  X
+    99 |  X   |   X   |       |      |      |  X   |  X
+   100 |      |   X   |   X   |      |      |  X   |  X
+   101 |  X   |   X   |   X   |      |      |  X   |  X
+   102 |      |   X   |   X   |      |      |  X   |  X
+   103 |  X   |   X   |   X   |      |      |  X   |  X
+   104 |      |       |       |  X   |      |  X   |  X
+   105 |  X   |       |       |  X   |      |  X   |  X
+   106 |      |   X   |       |  X   |      |  X   |  X
+   107 |  X   |   X   |       |  X   |      |  X   |  X
+   108 |      |   X   |   X   |  X   |      |  X   |  X
+   109 |  X   |   X   |   X   |  X   |      |  X   |  X
+   110 |      |   X   |   X   |  X   |      |  X   |  X
+   111 |  X   |   X   |   X   |  X   |      |  X   |  X
+   112 |      |       |       |      |  X   |  X   |  X
+   113 |  X   |       |       |      |  X   |  X   |  X
+   114 |      |   X   |       |      |  X   |  X   |  X
+   115 |  X   |   X   |       |      |  X   |  X   |  X
+   116 |      |   X   |   X   |      |  X   |  X   |  X
+   117 |  X   |   X   |   X   |      |  X   |  X   |  X
+   118 |      |   X   |   X   |      |  X   |  X   |  X
+   119 |  X   |   X   |   X   |      |  X   |  X   |  X
+   120 |      |       |       |  X   |  X   |  X   |  X
+   121 |  X   |       |       |  X   |  X   |  X   |  X
+   122 |      |   X   |       |  X   |  X   |  X   |  X
+   123 |  X   |   X   |       |  X   |  X   |  X   |  X
+   124 |      |   X   |   X   |  X   |  X   |  X   |  X
+   125 |  X   |   X   |   X   |  X   |  X   |  X   |  X
+   126 |      |   X   |   X   |  X   |  X   |  X   |  X
+   127 |  X   |   X   |   X   |  X   |  X   |  X   |  X
 
 
 .TP


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-11-05 13:45 [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump Jingbai Ma
                   ` (2 preceding siblings ...)
  2013-11-05 13:46 ` [PATCH 3/3] makedumpfile: hugepage filtering: add new dump levels for manual page Jingbai Ma
@ 2013-11-05 20:26 ` Vivek Goyal
  2013-11-06  1:47   ` Jingbai Ma
  2013-11-06  2:21   ` Atsushi Kumagai
  3 siblings, 2 replies; 25+ messages in thread
From: Vivek Goyal @ 2013-11-05 20:26 UTC (permalink / raw)
  To: Jingbai Ma
  Cc: ptesarik, d.hatayama, kumagai-atsushi, bhe, tom.vaden, kexec,
	linux-kernel, lisa.mitchell, anderson, ebiederm

On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote:
> This patch set intend to exclude unnecessary hugepages from vmcore dump file.
> 
> This patch requires the kernel patch to export necessary data structures into
> vmcore: "kexec: export hugepage data structure into vmcoreinfo"
> http://lists.infradead.org/pipermail/kexec/2013-November/009997.html
> 
> This patch introduce two new dump levels 32 and 64 to exclude all unused and
> active hugepages. The level to exclude all unnecessary pages will be 127 now.

Interesting. Why hugepages should be treated any differentely than normal
pages?

If user asked to filter out free page, then it should be filtered and
it should not matter whether it is a huge page or not?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-11-05 20:26 ` [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump Vivek Goyal
@ 2013-11-06  1:47   ` Jingbai Ma
  2013-11-06  1:53     ` Vivek Goyal
  2013-11-06  2:21   ` Atsushi Kumagai
  1 sibling, 1 reply; 25+ messages in thread
From: Jingbai Ma @ 2013-11-06  1:47 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jingbai Ma, ptesarik, d.hatayama, kumagai-atsushi, bhe,
	tom.vaden, kexec, linux-kernel, lisa.mitchell, anderson,
	ebiederm

On 11/06/2013 04:26 AM, Vivek Goyal wrote:
> On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote:
>> This patch set intend to exclude unnecessary hugepages from vmcore dump file.
>>
>> This patch requires the kernel patch to export necessary data structures into
>> vmcore: "kexec: export hugepage data structure into vmcoreinfo"
>> http://lists.infradead.org/pipermail/kexec/2013-November/009997.html
>>
>> This patch introduce two new dump levels 32 and 64 to exclude all unused and
>> active hugepages. The level to exclude all unnecessary pages will be 127 now.
>
> Interesting. Why hugepages should be treated any differentely than normal
> pages?
>
> If user asked to filter out free page, then it should be filtered and
> it should not matter whether it is a huge page or not?

Yes, free hugepages should be filtered out with other free pages. It 
sounds reasonable.

But for active hugepages, I would offer user more choices/flexibility. 
(maybe bad).
I'm OK to filter active hugepages with other user data page.

Any other comments?


>
> Thanks
> Vivek


-- 
Thanks,
Jingbai Ma

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-11-06  1:47   ` Jingbai Ma
@ 2013-11-06  1:53     ` Vivek Goyal
  0 siblings, 0 replies; 25+ messages in thread
From: Vivek Goyal @ 2013-11-06  1:53 UTC (permalink / raw)
  To: Jingbai Ma
  Cc: ptesarik, d.hatayama, kumagai-atsushi, bhe, tom.vaden, kexec,
	linux-kernel, lisa.mitchell, anderson, ebiederm

On Wed, Nov 06, 2013 at 09:47:49AM +0800, Jingbai Ma wrote:
> On 11/06/2013 04:26 AM, Vivek Goyal wrote:
> >On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote:
> >>This patch set intend to exclude unnecessary hugepages from vmcore dump file.
> >>
> >>This patch requires the kernel patch to export necessary data structures into
> >>vmcore: "kexec: export hugepage data structure into vmcoreinfo"
> >>http://lists.infradead.org/pipermail/kexec/2013-November/009997.html
> >>
> >>This patch introduce two new dump levels 32 and 64 to exclude all unused and
> >>active hugepages. The level to exclude all unnecessary pages will be 127 now.
> >
> >Interesting. Why hugepages should be treated any differentely than normal
> >pages?
> >
> >If user asked to filter out free page, then it should be filtered and
> >it should not matter whether it is a huge page or not?
> 
> Yes, free hugepages should be filtered out with other free pages. It
> sounds reasonable.
> 
> But for active hugepages, I would offer user more
> choices/flexibility. (maybe bad).
> I'm OK to filter active hugepages with other user data page.
> 
> Any other comments?

I really can't see why hugepages are different than regular pages when
it comes to filtering. IMO, we really should not create filtering
option/levels only for huge pages, until and unless there is a strong
use case.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-11-05 20:26 ` [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump Vivek Goyal
  2013-11-06  1:47   ` Jingbai Ma
@ 2013-11-06  2:21   ` Atsushi Kumagai
  2013-11-06 14:23     ` Vivek Goyal
  2013-11-07  0:54     ` HATAYAMA Daisuke
  1 sibling, 2 replies; 25+ messages in thread
From: Atsushi Kumagai @ 2013-11-06  2:21 UTC (permalink / raw)
  To: vgoyal
  Cc: jingbai.ma, bhe, tom.vaden, kexec, ptesarik, linux-kernel,
	lisa.mitchell, d.hatayama, ebiederm, anderson

(2013/11/06 5:27), Vivek Goyal wrote:
> On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote:
>> This patch set intend to exclude unnecessary hugepages from vmcore dump file.
>>
>> This patch requires the kernel patch to export necessary data structures into
>> vmcore: "kexec: export hugepage data structure into vmcoreinfo"
>> http://lists.infradead.org/pipermail/kexec/2013-November/009997.html
>>
>> This patch introduce two new dump levels 32 and 64 to exclude all unused and
>> active hugepages. The level to exclude all unnecessary pages will be 127 now.
>
> Interesting. Why hugepages should be treated any differentely than normal
> pages?
>
> If user asked to filter out free page, then it should be filtered and
> it should not matter whether it is a huge page or not?

I'm making a RFC patch of hugepages filtering based on such policy.

I attach the prototype version.
It's able to filter out also THPs, and suitable for cyclic processing
because it depends on mem_map and looking up it can be divided into
cycles. This is the same idea as page_is_buddy().

So I think it's better.

-- 
Thanks
Atsushi Kumagai


From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Date: Wed, 6 Nov 2013 10:10:43 +0900
Subject: [PATCH] [RFC] Exclude hugepages.

Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
---
   makedumpfile.c | 122 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
   makedumpfile.h |   8 ++++
   2 files changed, 125 insertions(+), 5 deletions(-)

diff --git a/makedumpfile.c b/makedumpfile.c
index 428c53e..75b7123 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -63,6 +63,7 @@ do { \
   
   static void check_cyclic_buffer_overrun(void);
   static void setup_page_is_buddy(void);
+static void setup_page_is_hugepage(void);
   
   void
   initialize_tables(void)
@@ -270,6 +271,18 @@ update_mmap_range(off_t offset, int initial) {
   }
   
   static int
+page_is_hugepage(unsigned long flags) {
+	if (NUMBER(PG_head) != NOT_FOUND_NUMBER) {
+		return isHead(flags);
+	} else if (NUMBER(PG_tail) != NOT_FOUND_NUMBER) {
+		return isTail(flags);
+	}if (NUMBER(PG_compound) != NOT_FOUND_NUMBER) {
+		return isCompound(flags);
+	}
+	return 0;
+}
+
+static int
   is_mapped_with_mmap(off_t offset) {
   
   	if (info->flag_usemmap
@@ -1107,6 +1120,8 @@ get_symbol_info(void)
   		SYMBOL_ARRAY_LENGTH_INIT(node_remap_start_pfn,
   					"node_remap_start_pfn");
   
+	SYMBOL_INIT(free_huge_page, "free_huge_page");
+
   	return TRUE;
   }
   
@@ -1214,11 +1229,19 @@ get_structure_info(void)
   
   	ENUM_NUMBER_INIT(PG_lru, "PG_lru");
   	ENUM_NUMBER_INIT(PG_private, "PG_private");
+	ENUM_NUMBER_INIT(PG_head, "PG_head");
+	ENUM_NUMBER_INIT(PG_tail, "PG_tail");
+	ENUM_NUMBER_INIT(PG_compound, "PG_compound");
   	ENUM_NUMBER_INIT(PG_swapcache, "PG_swapcache");
   	ENUM_NUMBER_INIT(PG_buddy, "PG_buddy");
   	ENUM_NUMBER_INIT(PG_slab, "PG_slab");
   	ENUM_NUMBER_INIT(PG_hwpoison, "PG_hwpoison");
   
+	if (NUMBER(PG_head) == NOT_FOUND_NUMBER &&
+	    NUMBER(PG_compound) == NOT_FOUND_NUMBER)
+		/* Pre-2.6.26 kernels did not have pageflags */
+		NUMBER(PG_compound) = PG_compound_ORIGINAL;
+
   	ENUM_TYPE_SIZE_INIT(pageflags, "pageflags");
   
   	TYPEDEF_SIZE_INIT(nodemask_t, "nodemask_t");
@@ -1603,6 +1626,7 @@ write_vmcoreinfo_data(void)
   	WRITE_SYMBOL("node_remap_start_vaddr", node_remap_start_vaddr);
   	WRITE_SYMBOL("node_remap_end_vaddr", node_remap_end_vaddr);
   	WRITE_SYMBOL("node_remap_start_pfn", node_remap_start_pfn);
+	WRITE_SYMBOL("free_huge_page", free_huge_page);
   
   	/*
   	 * write the structure size of 1st kernel
@@ -1685,6 +1709,9 @@ write_vmcoreinfo_data(void)
   
   	WRITE_NUMBER("PG_lru", PG_lru);
   	WRITE_NUMBER("PG_private", PG_private);
+	WRITE_NUMBER("PG_head", PG_head);
+	WRITE_NUMBER("PG_tail", PG_tail);
+	WRITE_NUMBER("PG_compound", PG_compound);
   	WRITE_NUMBER("PG_swapcache", PG_swapcache);
   	WRITE_NUMBER("PG_buddy", PG_buddy);
   	WRITE_NUMBER("PG_slab", PG_slab);
@@ -1932,6 +1959,7 @@ read_vmcoreinfo(void)
   	READ_SYMBOL("node_remap_start_vaddr", node_remap_start_vaddr);
   	READ_SYMBOL("node_remap_end_vaddr", node_remap_end_vaddr);
   	READ_SYMBOL("node_remap_start_pfn", node_remap_start_pfn);
+	READ_SYMBOL("free_huge_page", free_huge_page);
   
   	READ_STRUCTURE_SIZE("page", page);
   	READ_STRUCTURE_SIZE("mem_section", mem_section);
@@ -2000,6 +2028,9 @@ read_vmcoreinfo(void)
   
   	READ_NUMBER("PG_lru", PG_lru);
   	READ_NUMBER("PG_private", PG_private);
+	READ_NUMBER("PG_head", PG_head);
+	READ_NUMBER("PG_tail", PG_tail);
+	READ_NUMBER("PG_compound", PG_compound);
   	READ_NUMBER("PG_swapcache", PG_swapcache);
   	READ_NUMBER("PG_slab", PG_slab);
   	READ_NUMBER("PG_buddy", PG_buddy);
@@ -3126,6 +3157,9 @@ out:
   	if (!get_value_for_old_linux())
   		return FALSE;
   
+	/* Get page flags for compound pages */
+	setup_page_is_hugepage();
+
   	/* use buddy identification of free pages whether cyclic or not */
   	/* (this can reduce pages scan of 1TB memory from 60sec to 30sec) */
   	if (info->dump_level & DL_EXCLUDE_FREE)
@@ -4197,6 +4231,23 @@ out:
   			  "follow free lists instead of mem_map array.\n");
   }
   
+static void
+setup_page_is_hugepage(void)
+{
+	if (NUMBER(PG_head) != NOT_FOUND_NUMBER) {
+		if (NUMBER(PG_tail) == NOT_FOUND_NUMBER) {
+			/* If PG_tail is not explicitly saved, then assume
+			 * that it immediately follows PG_head.
+			 */
+			NUMBER(PG_tail) = NUMBER(PG_head) + 1;
+		}
+	} else if ((NUMBER(PG_compound) != NOT_FOUND_NUMBER)
+		   && (info->dump_level & DL_EXCLUDE_USER_DATA)) {
+		MSG("Compound page bit could not be determined: ");
+		MSG("huge pages will NOT be filtered.\n");
+	}
+}
+
   /*
    * If using a dumpfile in kdump-compressed format as a source file
    * instead of /proc/vmcore, 1st-bitmap of a new dumpfile must be
@@ -4404,8 +4455,9 @@ __exclude_unnecessary_pages(unsigned long mem_map,
   	unsigned long long pfn_read_start, pfn_read_end, index_pg;
   	unsigned char page_cache[SIZE(page) * PGMM_CACHED];
   	unsigned char *pcache;
-	unsigned int _count, _mapcount = 0;
+	unsigned int _count, _mapcount = 0, compound_order = 0;
   	unsigned long flags, mapping, private = 0;
+	unsigned long hugetlb_dtor;
   
   	/*
   	 * Refresh the buffer of struct page, when changing mem_map.
@@ -4459,6 +4511,27 @@ __exclude_unnecessary_pages(unsigned long mem_map,
   		flags   = ULONG(pcache + OFFSET(page.flags));
   		_count  = UINT(pcache + OFFSET(page._count));
   		mapping = ULONG(pcache + OFFSET(page.mapping));
+
+		if (index_pg < PGMM_CACHED - 1) {
+			compound_order = ULONG(pcache + SIZE(page) + OFFSET(page.lru)
+					       + OFFSET(list_head.prev));
+			hugetlb_dtor = ULONG(pcache + SIZE(page) + OFFSET(page.lru)
+					     + OFFSET(list_head.next));
+		} else if (pfn + 1 < pfn_end) {
+			unsigned char page_cache_next[SIZE(page)];
+			if (!readmem(VADDR, mem_map, page_cache_next, SIZE(page))) {
+				ERRMSG("Can't read the buffer of struct page.\n");
+				return FALSE;
+			}
+			compound_order = ULONG(page_cache_next + OFFSET(page.lru)
+					       + OFFSET(list_head.prev));
+			hugetlb_dtor = ULONG(page_cache_next + OFFSET(page.lru)
+					     + OFFSET(list_head.next));
+		} else {
+			compound_order = 0;
+			hugetlb_dtor = 0;
+		}
+
   		if (OFFSET(page._mapcount) != NOT_FOUND_STRUCTURE)
   			_mapcount = UINT(pcache + OFFSET(page._mapcount));
   		if (OFFSET(page.private) != NOT_FOUND_STRUCTURE)
@@ -4497,6 +4570,10 @@ __exclude_unnecessary_pages(unsigned long mem_map,
   		    && !isPrivate(flags) && !isAnon(mapping)) {
   			if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
   				pfn_cache++;
+			/*
+			 * NOTE: If THP for cache is introduced, the check for
+			 *       compound pages is needed here.
+			 */
   		}
   		/*
   		 * Exclude the cache page with the private page.
@@ -4506,14 +4583,49 @@ __exclude_unnecessary_pages(unsigned long mem_map,
   		    && !isAnon(mapping)) {
   			if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
   				pfn_cache_private++;
+			/*
+			 * NOTE: If THP for cache is introduced, the check for
+			 *       compound pages is needed here.
+			 */
   		}
   		/*
   		 * Exclude the data page of the user process.
   		 */
-		else if ((info->dump_level & DL_EXCLUDE_USER_DATA)
-		    && isAnon(mapping)) {
-			if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
-				pfn_user++;
+		else if (info->dump_level & DL_EXCLUDE_USER_DATA) {
+			/*
+			 * Exclude the anonnymous pages as user pages.
+			 */
+			if (isAnon(mapping)) {
+				if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
+					pfn_user++;
+
+				/*
+				 * Check the compound page
+				 */
+				if (page_is_hugepage(flags) && compound_order > 0) {
+					int i, nr_pages = 1 << compound_order;
+
+					for (i = 1; i < nr_pages; ++i) {
+						if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i))
+							pfn_user++;
+					}
+					pfn += nr_pages - 2;
+					mem_map += (nr_pages - 1) * SIZE(page);
+				}
+			}
+			/*
+			 * Exclude the hugetlbfs pages as user pages.
+			 */
+			else if (hugetlb_dtor == SYMBOL(free_huge_page)) {
+				int i, nr_pages = 1 << compound_order;
+
+				for (i = 0; i < nr_pages; ++i) {
+					if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i))
+						pfn_user++;
+				}
+				pfn += nr_pages - 1;
+				mem_map += (nr_pages - 1) * SIZE(page);
+			}
   		}
   		/*
   		 * Exclude the hwpoison page.
diff --git a/makedumpfile.h b/makedumpfile.h
index 3a7e61a..d6ee832 100644
--- a/makedumpfile.h
+++ b/makedumpfile.h
@@ -74,6 +74,7 @@ int get_mem_type(void);
   #define PG_lru_ORIGINAL	 	(5)
   #define PG_slab_ORIGINAL	(7)
   #define PG_private_ORIGINAL	(11)	/* Has something at ->private */
+#define PG_compound_ORIGINAL	(14)	/* Is part of a compound page */
   #define PG_swapcache_ORIGINAL	(15)	/* Swap page: swp_entry_t in private */
   
   #define PAGE_BUDDY_MAPCOUNT_VALUE_v2_6_38	(-2)
@@ -140,6 +141,9 @@ test_bit(int nr, unsigned long addr)
   
   #define isLRU(flags)		test_bit(NUMBER(PG_lru), flags)
   #define isPrivate(flags)	test_bit(NUMBER(PG_private), flags)
+#define isHead(flags)		test_bit(NUMBER(PG_head), flags)
+#define isTail(flags)		test_bit(NUMBER(PG_tail), flags)
+#define isCompound(flags)	test_bit(NUMBER(PG_compound), flags)
   #define isSwapCache(flags)	test_bit(NUMBER(PG_swapcache), flags)
   #define isHWPOISON(flags)	(test_bit(NUMBER(PG_hwpoison), flags) \
   				&& (NUMBER(PG_hwpoison) != NOT_FOUND_NUMBER))
@@ -1124,6 +1128,7 @@ struct symbol_table {
   	unsigned long long	node_remap_start_vaddr;
   	unsigned long long	node_remap_end_vaddr;
   	unsigned long long	node_remap_start_pfn;
+	unsigned long long      free_huge_page;
   
   	/*
   	 * for Xen extraction
@@ -1383,6 +1388,9 @@ struct number_table {
   	 */
   	long	PG_lru;
   	long	PG_private;
+	long	PG_head;
+	long	PG_tail;
+	long	PG_compound;
   	long	PG_swapcache;
   	long	PG_buddy;
   	long	PG_slab;
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-11-06  2:21   ` Atsushi Kumagai
@ 2013-11-06 14:23     ` Vivek Goyal
  2013-11-07  8:57       ` Jingbai Ma
  2013-11-07  0:54     ` HATAYAMA Daisuke
  1 sibling, 1 reply; 25+ messages in thread
From: Vivek Goyal @ 2013-11-06 14:23 UTC (permalink / raw)
  To: Atsushi Kumagai
  Cc: jingbai.ma, bhe, tom.vaden, kexec, ptesarik, linux-kernel,
	lisa.mitchell, d.hatayama, ebiederm, anderson

On Wed, Nov 06, 2013 at 02:21:39AM +0000, Atsushi Kumagai wrote:
> (2013/11/06 5:27), Vivek Goyal wrote:
> > On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote:
> >> This patch set intend to exclude unnecessary hugepages from vmcore dump file.
> >>
> >> This patch requires the kernel patch to export necessary data structures into
> >> vmcore: "kexec: export hugepage data structure into vmcoreinfo"
> >> http://lists.infradead.org/pipermail/kexec/2013-November/009997.html
> >>
> >> This patch introduce two new dump levels 32 and 64 to exclude all unused and
> >> active hugepages. The level to exclude all unnecessary pages will be 127 now.
> >
> > Interesting. Why hugepages should be treated any differentely than normal
> > pages?
> >
> > If user asked to filter out free page, then it should be filtered and
> > it should not matter whether it is a huge page or not?
> 
> I'm making a RFC patch of hugepages filtering based on such policy.
> 
> I attach the prototype version.
> It's able to filter out also THPs, and suitable for cyclic processing
> because it depends on mem_map and looking up it can be divided into
> cycles. This is the same idea as page_is_buddy().
> 
> So I think it's better.

Agreed. Being able to treat hugepages in same manner as other pages
sounds good.

Jingbai, looks good to you?

Thanks
Vivek

> 
> -- 
> Thanks
> Atsushi Kumagai
> 
> 
> From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
> Date: Wed, 6 Nov 2013 10:10:43 +0900
> Subject: [PATCH] [RFC] Exclude hugepages.
> 
> Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
> ---
>    makedumpfile.c | 122 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>    makedumpfile.h |   8 ++++
>    2 files changed, 125 insertions(+), 5 deletions(-)
> 
> diff --git a/makedumpfile.c b/makedumpfile.c
> index 428c53e..75b7123 100644
> --- a/makedumpfile.c
> +++ b/makedumpfile.c
> @@ -63,6 +63,7 @@ do { \
>    
>    static void check_cyclic_buffer_overrun(void);
>    static void setup_page_is_buddy(void);
> +static void setup_page_is_hugepage(void);
>    
>    void
>    initialize_tables(void)
> @@ -270,6 +271,18 @@ update_mmap_range(off_t offset, int initial) {
>    }
>    
>    static int
> +page_is_hugepage(unsigned long flags) {
> +	if (NUMBER(PG_head) != NOT_FOUND_NUMBER) {
> +		return isHead(flags);
> +	} else if (NUMBER(PG_tail) != NOT_FOUND_NUMBER) {
> +		return isTail(flags);
> +	}if (NUMBER(PG_compound) != NOT_FOUND_NUMBER) {
> +		return isCompound(flags);
> +	}
> +	return 0;
> +}
> +
> +static int
>    is_mapped_with_mmap(off_t offset) {
>    
>    	if (info->flag_usemmap
> @@ -1107,6 +1120,8 @@ get_symbol_info(void)
>    		SYMBOL_ARRAY_LENGTH_INIT(node_remap_start_pfn,
>    					"node_remap_start_pfn");
>    
> +	SYMBOL_INIT(free_huge_page, "free_huge_page");
> +
>    	return TRUE;
>    }
>    
> @@ -1214,11 +1229,19 @@ get_structure_info(void)
>    
>    	ENUM_NUMBER_INIT(PG_lru, "PG_lru");
>    	ENUM_NUMBER_INIT(PG_private, "PG_private");
> +	ENUM_NUMBER_INIT(PG_head, "PG_head");
> +	ENUM_NUMBER_INIT(PG_tail, "PG_tail");
> +	ENUM_NUMBER_INIT(PG_compound, "PG_compound");
>    	ENUM_NUMBER_INIT(PG_swapcache, "PG_swapcache");
>    	ENUM_NUMBER_INIT(PG_buddy, "PG_buddy");
>    	ENUM_NUMBER_INIT(PG_slab, "PG_slab");
>    	ENUM_NUMBER_INIT(PG_hwpoison, "PG_hwpoison");
>    
> +	if (NUMBER(PG_head) == NOT_FOUND_NUMBER &&
> +	    NUMBER(PG_compound) == NOT_FOUND_NUMBER)
> +		/* Pre-2.6.26 kernels did not have pageflags */
> +		NUMBER(PG_compound) = PG_compound_ORIGINAL;
> +
>    	ENUM_TYPE_SIZE_INIT(pageflags, "pageflags");
>    
>    	TYPEDEF_SIZE_INIT(nodemask_t, "nodemask_t");
> @@ -1603,6 +1626,7 @@ write_vmcoreinfo_data(void)
>    	WRITE_SYMBOL("node_remap_start_vaddr", node_remap_start_vaddr);
>    	WRITE_SYMBOL("node_remap_end_vaddr", node_remap_end_vaddr);
>    	WRITE_SYMBOL("node_remap_start_pfn", node_remap_start_pfn);
> +	WRITE_SYMBOL("free_huge_page", free_huge_page);
>    
>    	/*
>    	 * write the structure size of 1st kernel
> @@ -1685,6 +1709,9 @@ write_vmcoreinfo_data(void)
>    
>    	WRITE_NUMBER("PG_lru", PG_lru);
>    	WRITE_NUMBER("PG_private", PG_private);
> +	WRITE_NUMBER("PG_head", PG_head);
> +	WRITE_NUMBER("PG_tail", PG_tail);
> +	WRITE_NUMBER("PG_compound", PG_compound);
>    	WRITE_NUMBER("PG_swapcache", PG_swapcache);
>    	WRITE_NUMBER("PG_buddy", PG_buddy);
>    	WRITE_NUMBER("PG_slab", PG_slab);
> @@ -1932,6 +1959,7 @@ read_vmcoreinfo(void)
>    	READ_SYMBOL("node_remap_start_vaddr", node_remap_start_vaddr);
>    	READ_SYMBOL("node_remap_end_vaddr", node_remap_end_vaddr);
>    	READ_SYMBOL("node_remap_start_pfn", node_remap_start_pfn);
> +	READ_SYMBOL("free_huge_page", free_huge_page);
>    
>    	READ_STRUCTURE_SIZE("page", page);
>    	READ_STRUCTURE_SIZE("mem_section", mem_section);
> @@ -2000,6 +2028,9 @@ read_vmcoreinfo(void)
>    
>    	READ_NUMBER("PG_lru", PG_lru);
>    	READ_NUMBER("PG_private", PG_private);
> +	READ_NUMBER("PG_head", PG_head);
> +	READ_NUMBER("PG_tail", PG_tail);
> +	READ_NUMBER("PG_compound", PG_compound);
>    	READ_NUMBER("PG_swapcache", PG_swapcache);
>    	READ_NUMBER("PG_slab", PG_slab);
>    	READ_NUMBER("PG_buddy", PG_buddy);
> @@ -3126,6 +3157,9 @@ out:
>    	if (!get_value_for_old_linux())
>    		return FALSE;
>    
> +	/* Get page flags for compound pages */
> +	setup_page_is_hugepage();
> +
>    	/* use buddy identification of free pages whether cyclic or not */
>    	/* (this can reduce pages scan of 1TB memory from 60sec to 30sec) */
>    	if (info->dump_level & DL_EXCLUDE_FREE)
> @@ -4197,6 +4231,23 @@ out:
>    			  "follow free lists instead of mem_map array.\n");
>    }
>    
> +static void
> +setup_page_is_hugepage(void)
> +{
> +	if (NUMBER(PG_head) != NOT_FOUND_NUMBER) {
> +		if (NUMBER(PG_tail) == NOT_FOUND_NUMBER) {
> +			/* If PG_tail is not explicitly saved, then assume
> +			 * that it immediately follows PG_head.
> +			 */
> +			NUMBER(PG_tail) = NUMBER(PG_head) + 1;
> +		}
> +	} else if ((NUMBER(PG_compound) != NOT_FOUND_NUMBER)
> +		   && (info->dump_level & DL_EXCLUDE_USER_DATA)) {
> +		MSG("Compound page bit could not be determined: ");
> +		MSG("huge pages will NOT be filtered.\n");
> +	}
> +}
> +
>    /*
>     * If using a dumpfile in kdump-compressed format as a source file
>     * instead of /proc/vmcore, 1st-bitmap of a new dumpfile must be
> @@ -4404,8 +4455,9 @@ __exclude_unnecessary_pages(unsigned long mem_map,
>    	unsigned long long pfn_read_start, pfn_read_end, index_pg;
>    	unsigned char page_cache[SIZE(page) * PGMM_CACHED];
>    	unsigned char *pcache;
> -	unsigned int _count, _mapcount = 0;
> +	unsigned int _count, _mapcount = 0, compound_order = 0;
>    	unsigned long flags, mapping, private = 0;
> +	unsigned long hugetlb_dtor;
>    
>    	/*
>    	 * Refresh the buffer of struct page, when changing mem_map.
> @@ -4459,6 +4511,27 @@ __exclude_unnecessary_pages(unsigned long mem_map,
>    		flags   = ULONG(pcache + OFFSET(page.flags));
>    		_count  = UINT(pcache + OFFSET(page._count));
>    		mapping = ULONG(pcache + OFFSET(page.mapping));
> +
> +		if (index_pg < PGMM_CACHED - 1) {
> +			compound_order = ULONG(pcache + SIZE(page) + OFFSET(page.lru)
> +					       + OFFSET(list_head.prev));
> +			hugetlb_dtor = ULONG(pcache + SIZE(page) + OFFSET(page.lru)
> +					     + OFFSET(list_head.next));
> +		} else if (pfn + 1 < pfn_end) {
> +			unsigned char page_cache_next[SIZE(page)];
> +			if (!readmem(VADDR, mem_map, page_cache_next, SIZE(page))) {
> +				ERRMSG("Can't read the buffer of struct page.\n");
> +				return FALSE;
> +			}
> +			compound_order = ULONG(page_cache_next + OFFSET(page.lru)
> +					       + OFFSET(list_head.prev));
> +			hugetlb_dtor = ULONG(page_cache_next + OFFSET(page.lru)
> +					     + OFFSET(list_head.next));
> +		} else {
> +			compound_order = 0;
> +			hugetlb_dtor = 0;
> +		}
> +
>    		if (OFFSET(page._mapcount) != NOT_FOUND_STRUCTURE)
>    			_mapcount = UINT(pcache + OFFSET(page._mapcount));
>    		if (OFFSET(page.private) != NOT_FOUND_STRUCTURE)
> @@ -4497,6 +4570,10 @@ __exclude_unnecessary_pages(unsigned long mem_map,
>    		    && !isPrivate(flags) && !isAnon(mapping)) {
>    			if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
>    				pfn_cache++;
> +			/*
> +			 * NOTE: If THP for cache is introduced, the check for
> +			 *       compound pages is needed here.
> +			 */
>    		}
>    		/*
>    		 * Exclude the cache page with the private page.
> @@ -4506,14 +4583,49 @@ __exclude_unnecessary_pages(unsigned long mem_map,
>    		    && !isAnon(mapping)) {
>    			if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
>    				pfn_cache_private++;
> +			/*
> +			 * NOTE: If THP for cache is introduced, the check for
> +			 *       compound pages is needed here.
> +			 */
>    		}
>    		/*
>    		 * Exclude the data page of the user process.
>    		 */
> -		else if ((info->dump_level & DL_EXCLUDE_USER_DATA)
> -		    && isAnon(mapping)) {
> -			if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
> -				pfn_user++;
> +		else if (info->dump_level & DL_EXCLUDE_USER_DATA) {
> +			/*
> +			 * Exclude the anonnymous pages as user pages.
> +			 */
> +			if (isAnon(mapping)) {
> +				if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
> +					pfn_user++;
> +
> +				/*
> +				 * Check the compound page
> +				 */
> +				if (page_is_hugepage(flags) && compound_order > 0) {
> +					int i, nr_pages = 1 << compound_order;
> +
> +					for (i = 1; i < nr_pages; ++i) {
> +						if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i))
> +							pfn_user++;
> +					}
> +					pfn += nr_pages - 2;
> +					mem_map += (nr_pages - 1) * SIZE(page);
> +				}
> +			}
> +			/*
> +			 * Exclude the hugetlbfs pages as user pages.
> +			 */
> +			else if (hugetlb_dtor == SYMBOL(free_huge_page)) {
> +				int i, nr_pages = 1 << compound_order;
> +
> +				for (i = 0; i < nr_pages; ++i) {
> +					if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i))
> +						pfn_user++;
> +				}
> +				pfn += nr_pages - 1;
> +				mem_map += (nr_pages - 1) * SIZE(page);
> +			}
>    		}
>    		/*
>    		 * Exclude the hwpoison page.
> diff --git a/makedumpfile.h b/makedumpfile.h
> index 3a7e61a..d6ee832 100644
> --- a/makedumpfile.h
> +++ b/makedumpfile.h
> @@ -74,6 +74,7 @@ int get_mem_type(void);
>    #define PG_lru_ORIGINAL	 	(5)
>    #define PG_slab_ORIGINAL	(7)
>    #define PG_private_ORIGINAL	(11)	/* Has something at ->private */
> +#define PG_compound_ORIGINAL	(14)	/* Is part of a compound page */
>    #define PG_swapcache_ORIGINAL	(15)	/* Swap page: swp_entry_t in private */
>    
>    #define PAGE_BUDDY_MAPCOUNT_VALUE_v2_6_38	(-2)
> @@ -140,6 +141,9 @@ test_bit(int nr, unsigned long addr)
>    
>    #define isLRU(flags)		test_bit(NUMBER(PG_lru), flags)
>    #define isPrivate(flags)	test_bit(NUMBER(PG_private), flags)
> +#define isHead(flags)		test_bit(NUMBER(PG_head), flags)
> +#define isTail(flags)		test_bit(NUMBER(PG_tail), flags)
> +#define isCompound(flags)	test_bit(NUMBER(PG_compound), flags)
>    #define isSwapCache(flags)	test_bit(NUMBER(PG_swapcache), flags)
>    #define isHWPOISON(flags)	(test_bit(NUMBER(PG_hwpoison), flags) \
>    				&& (NUMBER(PG_hwpoison) != NOT_FOUND_NUMBER))
> @@ -1124,6 +1128,7 @@ struct symbol_table {
>    	unsigned long long	node_remap_start_vaddr;
>    	unsigned long long	node_remap_end_vaddr;
>    	unsigned long long	node_remap_start_pfn;
> +	unsigned long long      free_huge_page;
>    
>    	/*
>    	 * for Xen extraction
> @@ -1383,6 +1388,9 @@ struct number_table {
>    	 */
>    	long	PG_lru;
>    	long	PG_private;
> +	long	PG_head;
> +	long	PG_tail;
> +	long	PG_compound;
>    	long	PG_swapcache;
>    	long	PG_buddy;
>    	long	PG_slab;
> -- 
> 1.8.0.2
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-11-06  2:21   ` Atsushi Kumagai
  2013-11-06 14:23     ` Vivek Goyal
@ 2013-11-07  0:54     ` HATAYAMA Daisuke
  2013-11-22  7:16       ` HATAYAMA Daisuke
  1 sibling, 1 reply; 25+ messages in thread
From: HATAYAMA Daisuke @ 2013-11-07  0:54 UTC (permalink / raw)
  To: Atsushi Kumagai
  Cc: vgoyal, bhe, tom.vaden, kexec, ptesarik, linux-kernel,
	lisa.mitchell, anderson, ebiederm, jingbai.ma

(2013/11/06 11:21), Atsushi Kumagai wrote:
> (2013/11/06 5:27), Vivek Goyal wrote:
>> On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote:
>>> This patch set intend to exclude unnecessary hugepages from vmcore dump file.
>>>
>>> This patch requires the kernel patch to export necessary data structures into
>>> vmcore: "kexec: export hugepage data structure into vmcoreinfo"
>>> http://lists.infradead.org/pipermail/kexec/2013-November/009997.html
>>>
>>> This patch introduce two new dump levels 32 and 64 to exclude all unused and
>>> active hugepages. The level to exclude all unnecessary pages will be 127 now.
>>
>> Interesting. Why hugepages should be treated any differentely than normal
>> pages?
>>
>> If user asked to filter out free page, then it should be filtered and
>> it should not matter whether it is a huge page or not?
>
> I'm making a RFC patch of hugepages filtering based on such policy.
>
> I attach the prototype version.
> It's able to filter out also THPs, and suitable for cyclic processing
> because it depends on mem_map and looking up it can be divided into
> cycles. This is the same idea as page_is_buddy().
>
> So I think it's better.
>

> @@ -4506,14 +4583,49 @@ __exclude_unnecessary_pages(unsigned long mem_map,
>    		    && !isAnon(mapping)) {
>    			if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
>    				pfn_cache_private++;
> +			/*
> +			 * NOTE: If THP for cache is introduced, the check for
> +			 *       compound pages is needed here.
> +			 */
>    		}
>    		/*
>    		 * Exclude the data page of the user process.
>    		 */
> -		else if ((info->dump_level & DL_EXCLUDE_USER_DATA)
> -		    && isAnon(mapping)) {
> -			if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
> -				pfn_user++;
> +		else if (info->dump_level & DL_EXCLUDE_USER_DATA) {
> +			/*
> +			 * Exclude the anonnymous pages as user pages.
> +			 */
> +			if (isAnon(mapping)) {
> +				if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
> +					pfn_user++;
> +
> +				/*
> +				 * Check the compound page
> +				 */
> +				if (page_is_hugepage(flags) && compound_order > 0) {
> +					int i, nr_pages = 1 << compound_order;
> +
> +					for (i = 1; i < nr_pages; ++i) {
> +						if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i))
> +							pfn_user++;
> +					}
> +					pfn += nr_pages - 2;
> +					mem_map += (nr_pages - 1) * SIZE(page);
> +				}
> +			}
> +			/*
> +			 * Exclude the hugetlbfs pages as user pages.
> +			 */
> +			else if (hugetlb_dtor == SYMBOL(free_huge_page)) {
> +				int i, nr_pages = 1 << compound_order;
> +
> +				for (i = 0; i < nr_pages; ++i) {
> +					if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i))
> +						pfn_user++;
> +				}
> +				pfn += nr_pages - 1;
> +				mem_map += (nr_pages - 1) * SIZE(page);
> +			}
>    		}
>    		/*
>    		 * Exclude the hwpoison page.

I'm concerned about the case that filtering is not performed to part of mem_map
entries not belonging to the current cyclic range.

If maximum value of compound_order is larger than maximum value of
CONFIG_FORCE_MAX_ZONEORDER, which makedumpfile obtains by ARRAY_LENGTH(zone.free_area),
it's necessary to align info->bufsize_cyclic with larger one in
check_cyclic_buffer_overrun().

-- 
Thanks.
HATAYAMA, Daisuke


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-11-06 14:23     ` Vivek Goyal
@ 2013-11-07  8:57       ` Jingbai Ma
  2013-11-08  5:12         ` Atsushi Kumagai
  0 siblings, 1 reply; 25+ messages in thread
From: Jingbai Ma @ 2013-11-07  8:57 UTC (permalink / raw)
  To: Vivek Goyal, Atsushi Kumagai
  Cc: jingbai.ma, bhe, tom.vaden, kexec, ptesarik, linux-kernel,
	lisa.mitchell, d.hatayama, ebiederm, anderson

On 11/06/2013 10:23 PM, Vivek Goyal wrote:
> On Wed, Nov 06, 2013 at 02:21:39AM +0000, Atsushi Kumagai wrote:
>> (2013/11/06 5:27), Vivek Goyal wrote:
>>> On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote:
>>>> This patch set intend to exclude unnecessary hugepages from vmcore dump file.
>>>>
>>>> This patch requires the kernel patch to export necessary data structures into
>>>> vmcore: "kexec: export hugepage data structure into vmcoreinfo"
>>>> http://lists.infradead.org/pipermail/kexec/2013-November/009997.html
>>>>
>>>> This patch introduce two new dump levels 32 and 64 to exclude all unused and
>>>> active hugepages. The level to exclude all unnecessary pages will be 127 now.
>>>
>>> Interesting. Why hugepages should be treated any differentely than normal
>>> pages?
>>>
>>> If user asked to filter out free page, then it should be filtered and
>>> it should not matter whether it is a huge page or not?
>>
>> I'm making a RFC patch of hugepages filtering based on such policy.
>>
>> I attach the prototype version.
>> It's able to filter out also THPs, and suitable for cyclic processing
>> because it depends on mem_map and looking up it can be divided into
>> cycles. This is the same idea as page_is_buddy().
>>
>> So I think it's better.
>
> Agreed. Being able to treat hugepages in same manner as other pages
> sounds good.
>
> Jingbai, looks good to you?

It looks good to me.

My only concern is by this way, we only can exclude all hugepage 
together, but can't exclude the free hugepages only. I'm not sure if 
user need to dump out the activated hugepage only.

Kumagai-san, please correct me, if I'm wrong.



>
> Thanks
> Vivek
>
>>
>> --
>> Thanks
>> Atsushi Kumagai
>>

-- 
Thanks,
Jingbai Ma

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-11-07  8:57       ` Jingbai Ma
@ 2013-11-08  5:12         ` Atsushi Kumagai
  2013-11-08  5:21           ` HATAYAMA Daisuke
  0 siblings, 1 reply; 25+ messages in thread
From: Atsushi Kumagai @ 2013-11-08  5:12 UTC (permalink / raw)
  To: jingbai.ma
  Cc: vgoyal, bhe, tom.vaden, kexec, ptesarik, linux-kernel,
	lisa.mitchell, d.hatayama, anderson, ebiederm

Hello Jingbai,

(2013/11/07 17:58), Jingbai Ma wrote:
> On 11/06/2013 10:23 PM, Vivek Goyal wrote:
>> On Wed, Nov 06, 2013 at 02:21:39AM +0000, Atsushi Kumagai wrote:
>>> (2013/11/06 5:27), Vivek Goyal wrote:
>>>> On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote:
>>>>> This patch set intend to exclude unnecessary hugepages from vmcore dump file.
>>>>>
>>>>> This patch requires the kernel patch to export necessary data structures into
>>>>> vmcore: "kexec: export hugepage data structure into vmcoreinfo"
>>>>> http://lists.infradead.org/pipermail/kexec/2013-November/009997.html
>>>>>
>>>>> This patch introduce two new dump levels 32 and 64 to exclude all unused and
>>>>> active hugepages. The level to exclude all unnecessary pages will be 127 now.
>>>>
>>>> Interesting. Why hugepages should be treated any differentely than normal
>>>> pages?
>>>>
>>>> If user asked to filter out free page, then it should be filtered and
>>>> it should not matter whether it is a huge page or not?
>>>
>>> I'm making a RFC patch of hugepages filtering based on such policy.
>>>
>>> I attach the prototype version.
>>> It's able to filter out also THPs, and suitable for cyclic processing
>>> because it depends on mem_map and looking up it can be divided into
>>> cycles. This is the same idea as page_is_buddy().
>>>
>>> So I think it's better.
>>
>> Agreed. Being able to treat hugepages in same manner as other pages
>> sounds good.
>>
>> Jingbai, looks good to you?
>
> It looks good to me.
>
> My only concern is by this way, we only can exclude all hugepage together, but can't exclude the free hugepages only. I'm not sure if user need to dump out the activated hugepage only.
>
> Kumagai-san, please correct me, if I'm wrong.

Yes, my patch treats all allocated hugetlbfs pages as user pages,
doesn't distinguish whether the pages are actually used or not.
I made so because I guess it's enough for almost all users.

We can introduce new dump level after it's needed actually,
but I don't think now is the time. To introduce it without
demand will make this tool just more complex.


Thanks
Atsushi Kumagai

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-11-08  5:12         ` Atsushi Kumagai
@ 2013-11-08  5:21           ` HATAYAMA Daisuke
  2013-11-08  5:27             ` Jingbai Ma
  0 siblings, 1 reply; 25+ messages in thread
From: HATAYAMA Daisuke @ 2013-11-08  5:21 UTC (permalink / raw)
  To: Atsushi Kumagai
  Cc: jingbai.ma, vgoyal, bhe, tom.vaden, kexec, ptesarik,
	linux-kernel, lisa.mitchell, anderson, ebiederm

(2013/11/08 14:12), Atsushi Kumagai wrote:
> Hello Jingbai,
> 
> (2013/11/07 17:58), Jingbai Ma wrote:
>> On 11/06/2013 10:23 PM, Vivek Goyal wrote:
>>> On Wed, Nov 06, 2013 at 02:21:39AM +0000, Atsushi Kumagai wrote:
>>>> (2013/11/06 5:27), Vivek Goyal wrote:
>>>>> On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote:
>>>>>> This patch set intend to exclude unnecessary hugepages from vmcore dump file.
>>>>>>
>>>>>> This patch requires the kernel patch to export necessary data structures into
>>>>>> vmcore: "kexec: export hugepage data structure into vmcoreinfo"
>>>>>> http://lists.infradead.org/pipermail/kexec/2013-November/009997.html
>>>>>>
>>>>>> This patch introduce two new dump levels 32 and 64 to exclude all unused and
>>>>>> active hugepages. The level to exclude all unnecessary pages will be 127 now.
>>>>>
>>>>> Interesting. Why hugepages should be treated any differentely than normal
>>>>> pages?
>>>>>
>>>>> If user asked to filter out free page, then it should be filtered and
>>>>> it should not matter whether it is a huge page or not?
>>>>
>>>> I'm making a RFC patch of hugepages filtering based on such policy.
>>>>
>>>> I attach the prototype version.
>>>> It's able to filter out also THPs, and suitable for cyclic processing
>>>> because it depends on mem_map and looking up it can be divided into
>>>> cycles. This is the same idea as page_is_buddy().
>>>>
>>>> So I think it's better.
>>>
>>> Agreed. Being able to treat hugepages in same manner as other pages
>>> sounds good.
>>>
>>> Jingbai, looks good to you?
>>
>> It looks good to me.
>>
>> My only concern is by this way, we only can exclude all hugepage together, but can't exclude the free hugepages only. I'm not sure if user need to dump out the activated hugepage only.
>>
>> Kumagai-san, please correct me, if I'm wrong.
> 
> Yes, my patch treats all allocated hugetlbfs pages as user pages,
> doesn't distinguish whether the pages are actually used or not.
> I made so because I guess it's enough for almost all users.
> 
> We can introduce new dump level after it's needed actually,
> but I don't think now is the time. To introduce it without
> demand will make this tool just more complex.
> 

Typically, users would allocate huge pages as much as actually they use only,
in order not to waste system memory. So, this design seems reasonable.

-- 
Thanks.
HATAYAMA, Daisuke


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-11-08  5:21           ` HATAYAMA Daisuke
@ 2013-11-08  5:27             ` Jingbai Ma
  2013-11-11  9:06               ` Petr Tesarik
  0 siblings, 1 reply; 25+ messages in thread
From: Jingbai Ma @ 2013-11-08  5:27 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: Atsushi Kumagai, jingbai.ma, vgoyal, bhe, tom.vaden, kexec,
	ptesarik, linux-kernel, lisa.mitchell, anderson, ebiederm

On 11/08/2013 01:21 PM, HATAYAMA Daisuke wrote:
> (2013/11/08 14:12), Atsushi Kumagai wrote:
>> Hello Jingbai,
>>
>> (2013/11/07 17:58), Jingbai Ma wrote:
>>> On 11/06/2013 10:23 PM, Vivek Goyal wrote:
>>>> On Wed, Nov 06, 2013 at 02:21:39AM +0000, Atsushi Kumagai wrote:
>>>>> (2013/11/06 5:27), Vivek Goyal wrote:
>>>>>> On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote:
>>>>>>> This patch set intend to exclude unnecessary hugepages from vmcore dump file.
>>>>>>>
>>>>>>> This patch requires the kernel patch to export necessary data structures into
>>>>>>> vmcore: "kexec: export hugepage data structure into vmcoreinfo"
>>>>>>> http://lists.infradead.org/pipermail/kexec/2013-November/009997.html
>>>>>>>
>>>>>>> This patch introduce two new dump levels 32 and 64 to exclude all unused and
>>>>>>> active hugepages. The level to exclude all unnecessary pages will be 127 now.
>>>>>>
>>>>>> Interesting. Why hugepages should be treated any differentely than normal
>>>>>> pages?
>>>>>>
>>>>>> If user asked to filter out free page, then it should be filtered and
>>>>>> it should not matter whether it is a huge page or not?
>>>>>
>>>>> I'm making a RFC patch of hugepages filtering based on such policy.
>>>>>
>>>>> I attach the prototype version.
>>>>> It's able to filter out also THPs, and suitable for cyclic processing
>>>>> because it depends on mem_map and looking up it can be divided into
>>>>> cycles. This is the same idea as page_is_buddy().
>>>>>
>>>>> So I think it's better.
>>>>
>>>> Agreed. Being able to treat hugepages in same manner as other pages
>>>> sounds good.
>>>>
>>>> Jingbai, looks good to you?
>>>
>>> It looks good to me.
>>>
>>> My only concern is by this way, we only can exclude all hugepage together, but can't exclude the free hugepages only. I'm not sure if user need to dump out the activated hugepage only.
>>>
>>> Kumagai-san, please correct me, if I'm wrong.
>>
>> Yes, my patch treats all allocated hugetlbfs pages as user pages,
>> doesn't distinguish whether the pages are actually used or not.
>> I made so because I guess it's enough for almost all users.
>>
>> We can introduce new dump level after it's needed actually,
>> but I don't think now is the time. To introduce it without
>> demand will make this tool just more complex.
>>
> 
> Typically, users would allocate huge pages as much as actually they use only,
> in order not to waste system memory. So, this design seems reasonable.
> 

OK, It looks reasonable.
Thanks!

-- 
Thanks,
Jingbai Ma

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-11-08  5:27             ` Jingbai Ma
@ 2013-11-11  9:06               ` Petr Tesarik
  0 siblings, 0 replies; 25+ messages in thread
From: Petr Tesarik @ 2013-11-11  9:06 UTC (permalink / raw)
  To: Jingbai Ma
  Cc: HATAYAMA Daisuke, Atsushi Kumagai, vgoyal, bhe, tom.vaden, kexec,
	linux-kernel, lisa.mitchell, anderson, ebiederm

On Fri, 08 Nov 2013 13:27:05 +0800
Jingbai Ma <jingbai.ma@hp.com> wrote:

> On 11/08/2013 01:21 PM, HATAYAMA Daisuke wrote:
> > (2013/11/08 14:12), Atsushi Kumagai wrote:
> >> Hello Jingbai,
> >>
> >> (2013/11/07 17:58), Jingbai Ma wrote:
> >>> On 11/06/2013 10:23 PM, Vivek Goyal wrote:
> >>>> On Wed, Nov 06, 2013 at 02:21:39AM +0000, Atsushi Kumagai wrote:
> >>>>> (2013/11/06 5:27), Vivek Goyal wrote:
> >>>>>> On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote:
> >>>>>>> This patch set intend to exclude unnecessary hugepages from vmcore dump file.
> >>>>>>>
> >>>>>>> This patch requires the kernel patch to export necessary data structures into
> >>>>>>> vmcore: "kexec: export hugepage data structure into vmcoreinfo"
> >>>>>>> http://lists.infradead.org/pipermail/kexec/2013-November/009997.html
> >>>>>>>
> >>>>>>> This patch introduce two new dump levels 32 and 64 to exclude all unused and
> >>>>>>> active hugepages. The level to exclude all unnecessary pages will be 127 now.
> >>>>>>
> >>>>>> Interesting. Why hugepages should be treated any differentely than normal
> >>>>>> pages?
> >>>>>>
> >>>>>> If user asked to filter out free page, then it should be filtered and
> >>>>>> it should not matter whether it is a huge page or not?
> >>>>>
> >>>>> I'm making a RFC patch of hugepages filtering based on such policy.
> >>>>>
> >>>>> I attach the prototype version.
> >>>>> It's able to filter out also THPs, and suitable for cyclic processing
> >>>>> because it depends on mem_map and looking up it can be divided into
> >>>>> cycles. This is the same idea as page_is_buddy().
> >>>>>
> >>>>> So I think it's better.
> >>>>
> >>>> Agreed. Being able to treat hugepages in same manner as other pages
> >>>> sounds good.
> >>>>
> >>>> Jingbai, looks good to you?
> >>>
> >>> It looks good to me.
> >>>
> >>> My only concern is by this way, we only can exclude all hugepage together, but can't exclude the free hugepages only. I'm not sure if user need to dump out the activated hugepage only.
> >>>
> >>> Kumagai-san, please correct me, if I'm wrong.
> >>
> >> Yes, my patch treats all allocated hugetlbfs pages as user pages,
> >> doesn't distinguish whether the pages are actually used or not.
> >> I made so because I guess it's enough for almost all users.
> >>
> >> We can introduce new dump level after it's needed actually,
> >> but I don't think now is the time. To introduce it without
> >> demand will make this tool just more complex.
> >>
> > 
> > Typically, users would allocate huge pages as much as actually they use only,
> > in order not to waste system memory. So, this design seems reasonable.
> > 
> 
> OK, It looks reasonable.

Agreed. Whether a page is a huge page or not is an implementation
detail (and with THP even more so). Makedumpfile users should only be
concerned about the _meaning_ of what gets filtered, not about
implementation details.

If we expose too much of the implementation, it may become hard to
maintain backward compatibility one day...

Thank you very much for all the work!

Petr T

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-11-07  0:54     ` HATAYAMA Daisuke
@ 2013-11-22  7:16       ` HATAYAMA Daisuke
  2013-11-28  7:08         ` Atsushi Kumagai
  0 siblings, 1 reply; 25+ messages in thread
From: HATAYAMA Daisuke @ 2013-11-22  7:16 UTC (permalink / raw)
  To: Atsushi Kumagai
  Cc: bhe, tom.vaden, kexec, jingbai.ma, ptesarik, linux-kernel,
	lisa.mitchell, anderson, ebiederm, vgoyal

(2013/11/07 9:54), HATAYAMA Daisuke wrote:
> (2013/11/06 11:21), Atsushi Kumagai wrote:
>> (2013/11/06 5:27), Vivek Goyal wrote:
>>> On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote:
>>>> This patch set intend to exclude unnecessary hugepages from vmcore dump file.
>>>>
>>>> This patch requires the kernel patch to export necessary data structures into
>>>> vmcore: "kexec: export hugepage data structure into vmcoreinfo"
>>>> http://lists.infradead.org/pipermail/kexec/2013-November/009997.html
>>>>
>>>> This patch introduce two new dump levels 32 and 64 to exclude all unused and
>>>> active hugepages. The level to exclude all unnecessary pages will be 127 now.
>>>
>>> Interesting. Why hugepages should be treated any differentely than normal
>>> pages?
>>>
>>> If user asked to filter out free page, then it should be filtered and
>>> it should not matter whether it is a huge page or not?
>>
>> I'm making a RFC patch of hugepages filtering based on such policy.
>>
>> I attach the prototype version.
>> It's able to filter out also THPs, and suitable for cyclic processing
>> because it depends on mem_map and looking up it can be divided into
>> cycles. This is the same idea as page_is_buddy().
>>
>> So I think it's better.
>>
>
>> @@ -4506,14 +4583,49 @@ __exclude_unnecessary_pages(unsigned long mem_map,
>>                && !isAnon(mapping)) {
>>                if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
>>                    pfn_cache_private++;
>> +            /*
>> +             * NOTE: If THP for cache is introduced, the check for
>> +             *       compound pages is needed here.
>> +             */
>>            }
>>            /*
>>             * Exclude the data page of the user process.
>>             */
>> -        else if ((info->dump_level & DL_EXCLUDE_USER_DATA)
>> -            && isAnon(mapping)) {
>> -            if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
>> -                pfn_user++;
>> +        else if (info->dump_level & DL_EXCLUDE_USER_DATA) {
>> +            /*
>> +             * Exclude the anonnymous pages as user pages.
>> +             */
>> +            if (isAnon(mapping)) {
>> +                if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
>> +                    pfn_user++;
>> +
>> +                /*
>> +                 * Check the compound page
>> +                 */
>> +                if (page_is_hugepage(flags) && compound_order > 0) {
>> +                    int i, nr_pages = 1 << compound_order;
>> +
>> +                    for (i = 1; i < nr_pages; ++i) {
>> +                        if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i))
>> +                            pfn_user++;
>> +                    }
>> +                    pfn += nr_pages - 2;
>> +                    mem_map += (nr_pages - 1) * SIZE(page);
>> +                }
>> +            }
>> +            /*
>> +             * Exclude the hugetlbfs pages as user pages.
>> +             */
>> +            else if (hugetlb_dtor == SYMBOL(free_huge_page)) {
>> +                int i, nr_pages = 1 << compound_order;
>> +
>> +                for (i = 0; i < nr_pages; ++i) {
>> +                    if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i))
>> +                        pfn_user++;
>> +                }
>> +                pfn += nr_pages - 1;
>> +                mem_map += (nr_pages - 1) * SIZE(page);
>> +            }
>>            }
>>            /*
>>             * Exclude the hwpoison page.
>
> I'm concerned about the case that filtering is not performed to part of mem_map
> entries not belonging to the current cyclic range.
>
> If maximum value of compound_order is larger than maximum value of
> CONFIG_FORCE_MAX_ZONEORDER, which makedumpfile obtains by ARRAY_LENGTH(zone.free_area),
> it's necessary to align info->bufsize_cyclic with larger one in
> check_cyclic_buffer_overrun().
>

ping, in case you overlooked this...

-- 
Thanks.
HATAYAMA, Daisuke


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-11-22  7:16       ` HATAYAMA Daisuke
@ 2013-11-28  7:08         ` Atsushi Kumagai
  2013-11-28  7:48           ` HATAYAMA Daisuke
  0 siblings, 1 reply; 25+ messages in thread
From: Atsushi Kumagai @ 2013-11-28  7:08 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: bhe, tom.vaden, kexec, ptesarik, linux-kernel, lisa.mitchell,
	vgoyal, anderson, ebiederm, jingbai.ma

On 2013/11/22 16:18:20, kexec <kexec-bounces@lists.infradead.org> wrote:
> (2013/11/07 9:54), HATAYAMA Daisuke wrote:
> > (2013/11/06 11:21), Atsushi Kumagai wrote:
> >> (2013/11/06 5:27), Vivek Goyal wrote:
> >>> On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote:
> >>>> This patch set intend to exclude unnecessary hugepages from vmcore dump file.
> >>>>
> >>>> This patch requires the kernel patch to export necessary data structures into
> >>>> vmcore: "kexec: export hugepage data structure into vmcoreinfo"
> >>>> http://lists.infradead.org/pipermail/kexec/2013-November/009997.html
> >>>>
> >>>> This patch introduce two new dump levels 32 and 64 to exclude all unused and
> >>>> active hugepages. The level to exclude all unnecessary pages will be 127 now.
> >>>
> >>> Interesting. Why hugepages should be treated any differentely than normal
> >>> pages?
> >>>
> >>> If user asked to filter out free page, then it should be filtered and
> >>> it should not matter whether it is a huge page or not?
> >>
> >> I'm making a RFC patch of hugepages filtering based on such policy.
> >>
> >> I attach the prototype version.
> >> It's able to filter out also THPs, and suitable for cyclic processing
> >> because it depends on mem_map and looking up it can be divided into
> >> cycles. This is the same idea as page_is_buddy().
> >>
> >> So I think it's better.
> >>
> >
> >> @@ -4506,14 +4583,49 @@ __exclude_unnecessary_pages(unsigned long mem_map,
> >>                && !isAnon(mapping)) {
> >>                if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
> >>                    pfn_cache_private++;
> >> +            /*
> >> +             * NOTE: If THP for cache is introduced, the check for
> >> +             *       compound pages is needed here.
> >> +             */
> >>            }
> >>            /*
> >>             * Exclude the data page of the user process.
> >>             */
> >> -        else if ((info->dump_level & DL_EXCLUDE_USER_DATA)
> >> -            && isAnon(mapping)) {
> >> -            if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
> >> -                pfn_user++;
> >> +        else if (info->dump_level & DL_EXCLUDE_USER_DATA) {
> >> +            /*
> >> +             * Exclude the anonnymous pages as user pages.
> >> +             */
> >> +            if (isAnon(mapping)) {
> >> +                if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
> >> +                    pfn_user++;
> >> +
> >> +                /*
> >> +                 * Check the compound page
> >> +                 */
> >> +                if (page_is_hugepage(flags) && compound_order > 0) {
> >> +                    int i, nr_pages = 1 << compound_order;
> >> +
> >> +                    for (i = 1; i < nr_pages; ++i) {
> >> +                        if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i))
> >> +                            pfn_user++;
> >> +                    }
> >> +                    pfn += nr_pages - 2;
> >> +                    mem_map += (nr_pages - 1) * SIZE(page);
> >> +                }
> >> +            }
> >> +            /*
> >> +             * Exclude the hugetlbfs pages as user pages.
> >> +             */
> >> +            else if (hugetlb_dtor == SYMBOL(free_huge_page)) {
> >> +                int i, nr_pages = 1 << compound_order;
> >> +
> >> +                for (i = 0; i < nr_pages; ++i) {
> >> +                    if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i))
> >> +                        pfn_user++;
> >> +                }
> >> +                pfn += nr_pages - 1;
> >> +                mem_map += (nr_pages - 1) * SIZE(page);
> >> +            }
> >>            }
> >>            /*
> >>             * Exclude the hwpoison page.
> >
> > I'm concerned about the case that filtering is not performed to part of mem_map
> > entries not belonging to the current cyclic range.
> >
> > If maximum value of compound_order is larger than maximum value of
> > CONFIG_FORCE_MAX_ZONEORDER, which makedumpfile obtains by ARRAY_LENGTH(zone.free_area),
> > it's necessary to align info->bufsize_cyclic with larger one in
> > check_cyclic_buffer_overrun().
> >
> 
> ping, in case you overlooked this...

Sorry for the delayed response, I prioritize the release of v1.5.5 now.

Thanks for your advice, check_cyclic_buffer_overrun() should be fixed
as you said. In addition, I'm considering other way to address such case,
that is to bring the number of "overflowed pages" to the next cycle and
exclude them at the top of __exclude_unnecessary_pages() like below:

               /*
                * The pages which should be excluded still remain.
                */
               if (remainder >= 1) {
                       int i;
                       unsigned long tmp;
                       for (i = 0; i < remainder; ++i) {
                               if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i)) {
                                       pfn_user++;
                                       tmp++;
                               }
                       }
                       pfn += tmp;
                       remainder -= tmp;
                       mem_map += (tmp - 1) * SIZE(page);
                       continue;
               }

If this way works well, then aligning info->buf_size_cyclic will be
unnecessary.


Thanks
Atsushi Kumagai

> -- 
> Thanks.
> HATAYAMA, Daisuke
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-11-28  7:08         ` Atsushi Kumagai
@ 2013-11-28  7:48           ` HATAYAMA Daisuke
  0 siblings, 0 replies; 25+ messages in thread
From: HATAYAMA Daisuke @ 2013-11-28  7:48 UTC (permalink / raw)
  To: Atsushi Kumagai
  Cc: bhe, tom.vaden, kexec, ptesarik, linux-kernel, lisa.mitchell,
	vgoyal, anderson, ebiederm, jingbai.ma

(2013/11/28 16:08), Atsushi Kumagai wrote:
> On 2013/11/22 16:18:20, kexec <kexec-bounces@lists.infradead.org> wrote:
>> (2013/11/07 9:54), HATAYAMA Daisuke wrote:
>>> (2013/11/06 11:21), Atsushi Kumagai wrote:
>>>> (2013/11/06 5:27), Vivek Goyal wrote:
>>>>> On Tue, Nov 05, 2013 at 09:45:32PM +0800, Jingbai Ma wrote:
>>>>>> This patch set intend to exclude unnecessary hugepages from vmcore dump file.
>>>>>>
>>>>>> This patch requires the kernel patch to export necessary data structures into
>>>>>> vmcore: "kexec: export hugepage data structure into vmcoreinfo"
>>>>>> http://lists.infradead.org/pipermail/kexec/2013-November/009997.html
>>>>>>
>>>>>> This patch introduce two new dump levels 32 and 64 to exclude all unused and
>>>>>> active hugepages. The level to exclude all unnecessary pages will be 127 now.
>>>>>
>>>>> Interesting. Why hugepages should be treated any differentely than normal
>>>>> pages?
>>>>>
>>>>> If user asked to filter out free page, then it should be filtered and
>>>>> it should not matter whether it is a huge page or not?
>>>>
>>>> I'm making a RFC patch of hugepages filtering based on such policy.
>>>>
>>>> I attach the prototype version.
>>>> It's able to filter out also THPs, and suitable for cyclic processing
>>>> because it depends on mem_map and looking up it can be divided into
>>>> cycles. This is the same idea as page_is_buddy().
>>>>
>>>> So I think it's better.
>>>>
>>>
>>>> @@ -4506,14 +4583,49 @@ __exclude_unnecessary_pages(unsigned long mem_map,
>>>>                 && !isAnon(mapping)) {
>>>>                 if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
>>>>                     pfn_cache_private++;
>>>> +            /*
>>>> +             * NOTE: If THP for cache is introduced, the check for
>>>> +             *       compound pages is needed here.
>>>> +             */
>>>>             }
>>>>             /*
>>>>              * Exclude the data page of the user process.
>>>>              */
>>>> -        else if ((info->dump_level & DL_EXCLUDE_USER_DATA)
>>>> -            && isAnon(mapping)) {
>>>> -            if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
>>>> -                pfn_user++;
>>>> +        else if (info->dump_level & DL_EXCLUDE_USER_DATA) {
>>>> +            /*
>>>> +             * Exclude the anonnymous pages as user pages.
>>>> +             */
>>>> +            if (isAnon(mapping)) {
>>>> +                if (clear_bit_on_2nd_bitmap_for_kernel(pfn))
>>>> +                    pfn_user++;
>>>> +
>>>> +                /*
>>>> +                 * Check the compound page
>>>> +                 */
>>>> +                if (page_is_hugepage(flags) && compound_order > 0) {
>>>> +                    int i, nr_pages = 1 << compound_order;
>>>> +
>>>> +                    for (i = 1; i < nr_pages; ++i) {
>>>> +                        if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i))
>>>> +                            pfn_user++;
>>>> +                    }
>>>> +                    pfn += nr_pages - 2;
>>>> +                    mem_map += (nr_pages - 1) * SIZE(page);
>>>> +                }
>>>> +            }
>>>> +            /*
>>>> +             * Exclude the hugetlbfs pages as user pages.
>>>> +             */
>>>> +            else if (hugetlb_dtor == SYMBOL(free_huge_page)) {
>>>> +                int i, nr_pages = 1 << compound_order;
>>>> +
>>>> +                for (i = 0; i < nr_pages; ++i) {
>>>> +                    if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i))
>>>> +                        pfn_user++;
>>>> +                }
>>>> +                pfn += nr_pages - 1;
>>>> +                mem_map += (nr_pages - 1) * SIZE(page);
>>>> +            }
>>>>             }
>>>>             /*
>>>>              * Exclude the hwpoison page.
>>>
>>> I'm concerned about the case that filtering is not performed to part of mem_map
>>> entries not belonging to the current cyclic range.
>>>
>>> If maximum value of compound_order is larger than maximum value of
>>> CONFIG_FORCE_MAX_ZONEORDER, which makedumpfile obtains by ARRAY_LENGTH(zone.free_area),
>>> it's necessary to align info->bufsize_cyclic with larger one in
>>> check_cyclic_buffer_overrun().
>>>
>>
>> ping, in case you overlooked this...
> 
> Sorry for the delayed response, I prioritize the release of v1.5.5 now.
> 
> Thanks for your advice, check_cyclic_buffer_overrun() should be fixed
> as you said. In addition, I'm considering other way to address such case,
> that is to bring the number of "overflowed pages" to the next cycle and
> exclude them at the top of __exclude_unnecessary_pages() like below:
> 
>                 /*
>                  * The pages which should be excluded still remain.
>                  */
>                 if (remainder >= 1) {
>                         int i;
>                         unsigned long tmp;
>                         for (i = 0; i < remainder; ++i) {
>                                 if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i)) {
>                                         pfn_user++;
>                                         tmp++;
>                                 }
>                         }
>                         pfn += tmp;
>                         remainder -= tmp;
>                         mem_map += (tmp - 1) * SIZE(page);
>                         continue;
>                 }
> 
> If this way works well, then aligning info->buf_size_cyclic will be
> unnecessary.
> 

I selected the current implementation of changing cyclic buffer size becuase
I thought it was simpler than carrying over remaining filtered pages to next cycle
in that there was no need to add extra code in filtering processing.

I guess the reason why you think this is better now is how to detect maximum order of
huge page is hard in some way, right?

-- 
Thanks.
HATAYAMA, Daisuke


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-12-03  9:05 ` HATAYAMA Daisuke
@ 2013-12-04  6:08   ` Atsushi Kumagai
  0 siblings, 0 replies; 25+ messages in thread
From: Atsushi Kumagai @ 2013-12-04  6:08 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: bhe, tom.vaden, kexec, ptesarik, linux-kernel, lisa.mitchell,
	vgoyal, anderson, ebiederm, jingbai.ma

On 2013/12/03 18:06:13, kexec <kexec-bounces@lists.infradead.org> wrote:
> >> This is a suggestion from different point of view...
> >>
> >> In general, data on crash dump can be corrupted. Thus, order contained in a page
> >> descriptor can also be corrupted. For example, if the corrupted value were a huge
> >> number, wide range of pages after buddy page would be filtered falsely.
> >>
> >> So, actually we should sanity check data in crash dump before using them for application
> >> level feature. I've picked up order contained in page descriptor, so there would be other
> >> data used in makedumpfile that are not checked.
> > 
> > What you said is reasonable, but how will you do such sanity check ?
> > Certain standard values are necessary for sanity check, how will
> > you prepare such values ?
> > (Get them from kernel source and hard-code them in makedumpfile ?)
> > 
> >> Unlike diskdump, we no longer need to care about kernel/hardware level data integrity
> >> outside of user-land, but we still care about data its own integrity.
> >>
> >> On the other hand, if we do it, we might face some difficulty, for example, hardness of
> >> maintenance or performance bottleneck; it might be the reason why we don't see sanity
> >> check in makedumpfile now.
> > 
> > There are many values which should be checked, e.g. page.flags, page._count,
> > page.mapping, list_head.next and so on.
> > If we introduce sanity check for them, the issues you mentioned will be appear
> > distinctly.
> > 
> > So I think makedumpfile has to trust crash dump in practice.
> > 
> 
> Yes, I don't mean such very drastic checking; I understand hardness because I often
> handle/write this kind of code; I don't want to fight tremendously many dependencies...
> 
> So we need to concentrate on things that can affect makedumpfile's behavior significantly,
> e.g. infinite loop caused by broken linked list objects, buffer overrun cauesd by large values
> from broken data, etc. We should be able to deal with them by carefully handling
> dump data against makedumpfile's runtime data structure, e.g., buffer size.
> 
> Is it OK to consider this is a policy of makedumpfile for data corruption?

Right. 
Of course, if there is a very simple and effective check for a dump data, 
then we can take it.


Thanks
Atsushi Kumagai

> -- 
> Thanks.
> HATAYAMA, Daisuke
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-12-03  8:05 Atsushi Kumagai
@ 2013-12-03  9:05 ` HATAYAMA Daisuke
  2013-12-04  6:08   ` Atsushi Kumagai
  0 siblings, 1 reply; 25+ messages in thread
From: HATAYAMA Daisuke @ 2013-12-03  9:05 UTC (permalink / raw)
  To: Atsushi Kumagai
  Cc: bhe, tom.vaden, kexec, jingbai.ma, ptesarik, linux-kernel,
	lisa.mitchell, anderson, ebiederm, vgoyal

(2013/12/03 17:05), Atsushi Kumagai wrote:
> On 2013/11/29 13:57:21, kexec <kexec-bounces@lists.infradead.org> wrote:
>> (2013/11/29 13:23), Atsushi Kumagai wrote:
>>> On 2013/11/29 12:24:45, kexec <kexec-bounces@lists.infradead.org> wrote:
>>>> (2013/11/29 12:02), Atsushi Kumagai wrote:
>>>>> On 2013/11/28 16:50:21, kexec <kexec-bounces@lists.infradead.org> wrote:
>>>>>>>> ping, in case you overlooked this...
>>>>>>>
>>>>>>> Sorry for the delayed response, I prioritize the release of v1.5.5 now.
>>>>>>>
>>>>>>> Thanks for your advice, check_cyclic_buffer_overrun() should be fixed
>>>>>>> as you said. In addition, I'm considering other way to address such case,
>>>>>>> that is to bring the number of "overflowed pages" to the next cycle and
>>>>>>> exclude them at the top of __exclude_unnecessary_pages() like below:
>>>>>>>
>>>>>>>                    /*
>>>>>>>                     * The pages which should be excluded still remain.
>>>>>>>                     */
>>>>>>>                    if (remainder >= 1) {
>>>>>>>                            int i;
>>>>>>>                            unsigned long tmp;
>>>>>>>                            for (i = 0; i < remainder; ++i) {
>>>>>>>                                    if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i)) {
>>>>>>>                                            pfn_user++;
>>>>>>>                                            tmp++;
>>>>>>>                                    }
>>>>>>>                            }
>>>>>>>                            pfn += tmp;
>>>>>>>                            remainder -= tmp;
>>>>>>>                            mem_map += (tmp - 1) * SIZE(page);
>>>>>>>                            continue;
>>>>>>>                    }
>>>>>>>
>>>>>>> If this way works well, then aligning info->buf_size_cyclic will be
>>>>>>> unnecessary.
>>>>>>>
>>>>>>
>>>>>> I selected the current implementation of changing cyclic buffer size becuase
>>>>>> I thought it was simpler than carrying over remaining filtered pages to next cycle
>>>>>> in that there was no need to add extra code in filtering processing.
>>>>>>
>>>>>> I guess the reason why you think this is better now is how to detect maximum order of
>>>>>> huge page is hard in some way, right?
>>>>>
>>>>> The maximum order will be gotten from HUGETLB_PAGE_ORDER or HPAGE_PMD_ORDER,
>>>>> so I don't say it's hard. However, the carrying over method doesn't depend on
>>>>> such kernel symbols, so I think it's robuster.
>>>>>
>>>>
>>>> Then, it's better to remove check_cyclic_buffer_overrun() and rewrite part of free page
>>>> filtering in __exclude_unnecessary_pages(). Could you do that too?
>>>
>>> Sure, I'll modify it too.
>>>
>>
>> This is a suggestion from different point of view...
>>
>> In general, data on crash dump can be corrupted. Thus, order contained in a page
>> descriptor can also be corrupted. For example, if the corrupted value were a huge
>> number, wide range of pages after buddy page would be filtered falsely.
>>
>> So, actually we should sanity check data in crash dump before using them for application
>> level feature. I've picked up order contained in page descriptor, so there would be other
>> data used in makedumpfile that are not checked.
> 
> What you said is reasonable, but how will you do such sanity check ?
> Certain standard values are necessary for sanity check, how will
> you prepare such values ?
> (Get them from kernel source and hard-code them in makedumpfile ?)
> 
>> Unlike diskdump, we no longer need to care about kernel/hardware level data integrity
>> outside of user-land, but we still care about data its own integrity.
>>
>> On the other hand, if we do it, we might face some difficulty, for example, hardness of
>> maintenance or performance bottleneck; it might be the reason why we don't see sanity
>> check in makedumpfile now.
> 
> There are many values which should be checked, e.g. page.flags, page._count,
> page.mapping, list_head.next and so on.
> If we introduce sanity check for them, the issues you mentioned will be appear
> distinctly.
> 
> So I think makedumpfile has to trust crash dump in practice.
> 

Yes, I don't mean such very drastic checking; I understand hardness because I often
handle/write this kind of code; I don't want to fight tremendously many dependencies...

So we need to concentrate on things that can affect makedumpfile's behavior significantly,
e.g. infinite loop caused by broken linked list objects, buffer overrun cauesd by large values
from broken data, etc. We should be able to deal with them by carefully handling
dump data against makedumpfile's runtime data structure, e.g., buffer size.

Is it OK to consider this is a policy of makedumpfile for data corruption?

-- 
Thanks.
HATAYAMA, Daisuke


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
@ 2013-12-03  8:05 Atsushi Kumagai
  2013-12-03  9:05 ` HATAYAMA Daisuke
  0 siblings, 1 reply; 25+ messages in thread
From: Atsushi Kumagai @ 2013-12-03  8:05 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: bhe, tom.vaden, kexec, jingbai.ma, ptesarik, linux-kernel,
	lisa.mitchell, anderson, ebiederm, vgoyal

On 2013/11/29 13:57:21, kexec <kexec-bounces@lists.infradead.org> wrote:
> (2013/11/29 13:23), Atsushi Kumagai wrote:
> > On 2013/11/29 12:24:45, kexec <kexec-bounces@lists.infradead.org> wrote:
> >> (2013/11/29 12:02), Atsushi Kumagai wrote:
> >>> On 2013/11/28 16:50:21, kexec <kexec-bounces@lists.infradead.org> wrote:
> >>>>>> ping, in case you overlooked this...
> >>>>>
> >>>>> Sorry for the delayed response, I prioritize the release of v1.5.5 now.
> >>>>>
> >>>>> Thanks for your advice, check_cyclic_buffer_overrun() should be fixed
> >>>>> as you said. In addition, I'm considering other way to address such case,
> >>>>> that is to bring the number of "overflowed pages" to the next cycle and
> >>>>> exclude them at the top of __exclude_unnecessary_pages() like below:
> >>>>>
> >>>>>                   /*
> >>>>>                    * The pages which should be excluded still remain.
> >>>>>                    */
> >>>>>                   if (remainder >= 1) {
> >>>>>                           int i;
> >>>>>                           unsigned long tmp;
> >>>>>                           for (i = 0; i < remainder; ++i) {
> >>>>>                                   if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i)) {
> >>>>>                                           pfn_user++;
> >>>>>                                           tmp++;
> >>>>>                                   }
> >>>>>                           }
> >>>>>                           pfn += tmp;
> >>>>>                           remainder -= tmp;
> >>>>>                           mem_map += (tmp - 1) * SIZE(page);
> >>>>>                           continue;
> >>>>>                   }
> >>>>>
> >>>>> If this way works well, then aligning info->buf_size_cyclic will be
> >>>>> unnecessary.
> >>>>>
> >>>>
> >>>> I selected the current implementation of changing cyclic buffer size becuase
> >>>> I thought it was simpler than carrying over remaining filtered pages to next cycle
> >>>> in that there was no need to add extra code in filtering processing.
> >>>>
> >>>> I guess the reason why you think this is better now is how to detect maximum order of
> >>>> huge page is hard in some way, right?
> >>>
> >>> The maximum order will be gotten from HUGETLB_PAGE_ORDER or HPAGE_PMD_ORDER,
> >>> so I don't say it's hard. However, the carrying over method doesn't depend on
> >>> such kernel symbols, so I think it's robuster.
> >>>
> >>
> >> Then, it's better to remove check_cyclic_buffer_overrun() and rewrite part of free page
> >> filtering in __exclude_unnecessary_pages(). Could you do that too?
> >
> > Sure, I'll modify it too.
> >
>
> This is a suggestion from different point of view...
>
> In general, data on crash dump can be corrupted. Thus, order contained in a page
> descriptor can also be corrupted. For example, if the corrupted value were a huge
> number, wide range of pages after buddy page would be filtered falsely.
>
> So, actually we should sanity check data in crash dump before using them for application
> level feature. I've picked up order contained in page descriptor, so there would be other
> data used in makedumpfile that are not checked.

What you said is reasonable, but how will you do such sanity check ?
Certain standard values are necessary for sanity check, how will
you prepare such values ?
(Get them from kernel source and hard-code them in makedumpfile ?)

> Unlike diskdump, we no longer need to care about kernel/hardware level data integrity
> outside of user-land, but we still care about data its own integrity.
>
> On the other hand, if we do it, we might face some difficulty, for example, hardness of
> maintenance or performance bottleneck; it might be the reason why we don't see sanity
> check in makedumpfile now.

There are many values which should be checked, e.g. page.flags, page._count,
page.mapping, list_head.next and so on.
If we introduce sanity check for them, the issues you mentioned will be appear
distinctly.

So I think makedumpfile has to trust crash dump in practice.


Thanks
Atsushi Kumagai

> --
> Thanks.
> HATAYAMA, Daisuke
>
>
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-11-29  4:23   ` Atsushi Kumagai
@ 2013-11-29  4:56     ` HATAYAMA Daisuke
  0 siblings, 0 replies; 25+ messages in thread
From: HATAYAMA Daisuke @ 2013-11-29  4:56 UTC (permalink / raw)
  To: Atsushi Kumagai
  Cc: bhe, tom.vaden, kexec, ptesarik, linux-kernel, lisa.mitchell,
	vgoyal, anderson, ebiederm, jingbai.ma

(2013/11/29 13:23), Atsushi Kumagai wrote:
> On 2013/11/29 12:24:45, kexec <kexec-bounces@lists.infradead.org> wrote:
>> (2013/11/29 12:02), Atsushi Kumagai wrote:
>>> On 2013/11/28 16:50:21, kexec <kexec-bounces@lists.infradead.org> wrote:
>>>>>> ping, in case you overlooked this...
>>>>>
>>>>> Sorry for the delayed response, I prioritize the release of v1.5.5 now.
>>>>>
>>>>> Thanks for your advice, check_cyclic_buffer_overrun() should be fixed
>>>>> as you said. In addition, I'm considering other way to address such case,
>>>>> that is to bring the number of "overflowed pages" to the next cycle and
>>>>> exclude them at the top of __exclude_unnecessary_pages() like below:
>>>>>
>>>>>                   /*
>>>>>                    * The pages which should be excluded still remain.
>>>>>                    */
>>>>>                   if (remainder >= 1) {
>>>>>                           int i;
>>>>>                           unsigned long tmp;
>>>>>                           for (i = 0; i < remainder; ++i) {
>>>>>                                   if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i)) {
>>>>>                                           pfn_user++;
>>>>>                                           tmp++;
>>>>>                                   }
>>>>>                           }
>>>>>                           pfn += tmp;
>>>>>                           remainder -= tmp;
>>>>>                           mem_map += (tmp - 1) * SIZE(page);
>>>>>                           continue;
>>>>>                   }
>>>>>
>>>>> If this way works well, then aligning info->buf_size_cyclic will be
>>>>> unnecessary.
>>>>>
>>>>
>>>> I selected the current implementation of changing cyclic buffer size becuase
>>>> I thought it was simpler than carrying over remaining filtered pages to next cycle
>>>> in that there was no need to add extra code in filtering processing.
>>>>
>>>> I guess the reason why you think this is better now is how to detect maximum order of
>>>> huge page is hard in some way, right?
>>>
>>> The maximum order will be gotten from HUGETLB_PAGE_ORDER or HPAGE_PMD_ORDER,
>>> so I don't say it's hard. However, the carrying over method doesn't depend on
>>> such kernel symbols, so I think it's robuster.
>>>
>>
>> Then, it's better to remove check_cyclic_buffer_overrun() and rewrite part of free page
>> filtering in __exclude_unnecessary_pages(). Could you do that too?
> 
> Sure, I'll modify it too.
> 

This is a suggestion from different point of view...

In general, data on crash dump can be corrupted. Thus, order contained in a page
descriptor can also be corrupted. For example, if the corrupted value were a huge
number, wide range of pages after buddy page would be filtered falsely.

So, actually we should sanity check data in crash dump before using them for application
level feature. I've picked up order contained in page descriptor, so there would be other
data used in makedumpfile that are not checked.

Unlike diskdump, we no longer need to care about kernel/hardware level data integrity
outside of user-land, but we still care about data its own integrity.

On the other hand, if we do it, we might face some difficulty, for example, hardness of
maintenance or performance bottleneck; it might be the reason why we don't see sanity
check in makedumpfile now.

-- 
Thanks.
HATAYAMA, Daisuke


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-11-29  3:21 ` HATAYAMA Daisuke
@ 2013-11-29  4:23   ` Atsushi Kumagai
  2013-11-29  4:56     ` HATAYAMA Daisuke
  0 siblings, 1 reply; 25+ messages in thread
From: Atsushi Kumagai @ 2013-11-29  4:23 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: bhe, tom.vaden, kexec, ptesarik, linux-kernel, lisa.mitchell,
	vgoyal, anderson, ebiederm, jingbai.ma

On 2013/11/29 12:24:45, kexec <kexec-bounces@lists.infradead.org> wrote:
> (2013/11/29 12:02), Atsushi Kumagai wrote:
> > On 2013/11/28 16:50:21, kexec <kexec-bounces@lists.infradead.org> wrote:
> >>>> ping, in case you overlooked this...
> >>>
> >>> Sorry for the delayed response, I prioritize the release of v1.5.5 now.
> >>>
> >>> Thanks for your advice, check_cyclic_buffer_overrun() should be fixed
> >>> as you said. In addition, I'm considering other way to address such case,
> >>> that is to bring the number of "overflowed pages" to the next cycle and
> >>> exclude them at the top of __exclude_unnecessary_pages() like below:
> >>>
> >>>                  /*
> >>>                   * The pages which should be excluded still remain.
> >>>                   */
> >>>                  if (remainder >= 1) {
> >>>                          int i;
> >>>                          unsigned long tmp;
> >>>                          for (i = 0; i < remainder; ++i) {
> >>>                                  if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i)) {
> >>>                                          pfn_user++;
> >>>                                          tmp++;
> >>>                                  }
> >>>                          }
> >>>                          pfn += tmp;
> >>>                          remainder -= tmp;
> >>>                          mem_map += (tmp - 1) * SIZE(page);
> >>>                          continue;
> >>>                  }
> >>>
> >>> If this way works well, then aligning info->buf_size_cyclic will be
> >>> unnecessary.
> >>>
> >>
> >> I selected the current implementation of changing cyclic buffer size becuase
> >> I thought it was simpler than carrying over remaining filtered pages to next cycle
> >> in that there was no need to add extra code in filtering processing.
> >>
> >> I guess the reason why you think this is better now is how to detect maximum order of
> >> huge page is hard in some way, right?
> > 
> > The maximum order will be gotten from HUGETLB_PAGE_ORDER or HPAGE_PMD_ORDER,
> > so I don't say it's hard. However, the carrying over method doesn't depend on
> > such kernel symbols, so I think it's robuster.
> > 
> 
> Then, it's better to remove check_cyclic_buffer_overrun() and rewrite part of free page
> filtering in __exclude_unnecessary_pages(). Could you do that too?

Sure, I'll modify it too.


Thanks
Atsushi Kumagai

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
  2013-11-29  3:02 Atsushi Kumagai
@ 2013-11-29  3:21 ` HATAYAMA Daisuke
  2013-11-29  4:23   ` Atsushi Kumagai
  0 siblings, 1 reply; 25+ messages in thread
From: HATAYAMA Daisuke @ 2013-11-29  3:21 UTC (permalink / raw)
  To: Atsushi Kumagai
  Cc: bhe, tom.vaden, kexec, jingbai.ma, ptesarik, linux-kernel,
	lisa.mitchell, anderson, ebiederm, vgoyal

(2013/11/29 12:02), Atsushi Kumagai wrote:
> On 2013/11/28 16:50:21, kexec <kexec-bounces@lists.infradead.org> wrote:
>>>> ping, in case you overlooked this...
>>>
>>> Sorry for the delayed response, I prioritize the release of v1.5.5 now.
>>>
>>> Thanks for your advice, check_cyclic_buffer_overrun() should be fixed
>>> as you said. In addition, I'm considering other way to address such case,
>>> that is to bring the number of "overflowed pages" to the next cycle and
>>> exclude them at the top of __exclude_unnecessary_pages() like below:
>>>
>>>                  /*
>>>                   * The pages which should be excluded still remain.
>>>                   */
>>>                  if (remainder >= 1) {
>>>                          int i;
>>>                          unsigned long tmp;
>>>                          for (i = 0; i < remainder; ++i) {
>>>                                  if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i)) {
>>>                                          pfn_user++;
>>>                                          tmp++;
>>>                                  }
>>>                          }
>>>                          pfn += tmp;
>>>                          remainder -= tmp;
>>>                          mem_map += (tmp - 1) * SIZE(page);
>>>                          continue;
>>>                  }
>>>
>>> If this way works well, then aligning info->buf_size_cyclic will be
>>> unnecessary.
>>>
>>
>> I selected the current implementation of changing cyclic buffer size becuase
>> I thought it was simpler than carrying over remaining filtered pages to next cycle
>> in that there was no need to add extra code in filtering processing.
>>
>> I guess the reason why you think this is better now is how to detect maximum order of
>> huge page is hard in some way, right?
> 
> The maximum order will be gotten from HUGETLB_PAGE_ORDER or HPAGE_PMD_ORDER,
> so I don't say it's hard. However, the carrying over method doesn't depend on
> such kernel symbols, so I think it's robuster.
> 

Then, it's better to remove check_cyclic_buffer_overrun() and rewrite part of free page
filtering in __exclude_unnecessary_pages(). Could you do that too?

-- 
Thanks.
HATAYAMA, Daisuke


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump
@ 2013-11-29  3:02 Atsushi Kumagai
  2013-11-29  3:21 ` HATAYAMA Daisuke
  0 siblings, 1 reply; 25+ messages in thread
From: Atsushi Kumagai @ 2013-11-29  3:02 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: bhe, tom.vaden, kexec, jingbai.ma, ptesarik, linux-kernel,
	lisa.mitchell, anderson, ebiederm, vgoyal

On 2013/11/28 16:50:21, kexec <kexec-bounces@lists.infradead.org> wrote:
> >> ping, in case you overlooked this...
> >
> > Sorry for the delayed response, I prioritize the release of v1.5.5 now.
> >
> > Thanks for your advice, check_cyclic_buffer_overrun() should be fixed
> > as you said. In addition, I'm considering other way to address such case,
> > that is to bring the number of "overflowed pages" to the next cycle and
> > exclude them at the top of __exclude_unnecessary_pages() like below:
> >
> >                 /*
> >                  * The pages which should be excluded still remain.
> >                  */
> >                 if (remainder >= 1) {
> >                         int i;
> >                         unsigned long tmp;
> >                         for (i = 0; i < remainder; ++i) {
> >                                 if (clear_bit_on_2nd_bitmap_for_kernel(pfn + i)) {
> >                                         pfn_user++;
> >                                         tmp++;
> >                                 }
> >                         }
> >                         pfn += tmp;
> >                         remainder -= tmp;
> >                         mem_map += (tmp - 1) * SIZE(page);
> >                         continue;
> >                 }
> >
> > If this way works well, then aligning info->buf_size_cyclic will be
> > unnecessary.
> >
>
> I selected the current implementation of changing cyclic buffer size becuase
> I thought it was simpler than carrying over remaining filtered pages to next cycle
> in that there was no need to add extra code in filtering processing.
>
> I guess the reason why you think this is better now is how to detect maximum order of
> huge page is hard in some way, right?

The maximum order will be gotten from HUGETLB_PAGE_ORDER or HPAGE_PMD_ORDER,
so I don't say it's hard. However, the carrying over method doesn't depend on
such kernel symbols, so I think it's robuster.


Thanks
Atsushi Kumagai

> --
> Thanks.
> HATAYAMA, Daisuke
>
>
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2013-12-04  6:12 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-05 13:45 [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump Jingbai Ma
2013-11-05 13:45 ` [PATCH 1/3] makedumpfile: hugepage filtering: add hugepage filtering functions Jingbai Ma
2013-11-05 13:45 ` [PATCH 2/3] makedumpfile: hugepage filtering: add excluding hugepage messages Jingbai Ma
2013-11-05 13:46 ` [PATCH 3/3] makedumpfile: hugepage filtering: add new dump levels for manual page Jingbai Ma
2013-11-05 20:26 ` [PATCH 0/3] makedumpfile: hugepage filtering for vmcore dump Vivek Goyal
2013-11-06  1:47   ` Jingbai Ma
2013-11-06  1:53     ` Vivek Goyal
2013-11-06  2:21   ` Atsushi Kumagai
2013-11-06 14:23     ` Vivek Goyal
2013-11-07  8:57       ` Jingbai Ma
2013-11-08  5:12         ` Atsushi Kumagai
2013-11-08  5:21           ` HATAYAMA Daisuke
2013-11-08  5:27             ` Jingbai Ma
2013-11-11  9:06               ` Petr Tesarik
2013-11-07  0:54     ` HATAYAMA Daisuke
2013-11-22  7:16       ` HATAYAMA Daisuke
2013-11-28  7:08         ` Atsushi Kumagai
2013-11-28  7:48           ` HATAYAMA Daisuke
2013-11-29  3:02 Atsushi Kumagai
2013-11-29  3:21 ` HATAYAMA Daisuke
2013-11-29  4:23   ` Atsushi Kumagai
2013-11-29  4:56     ` HATAYAMA Daisuke
2013-12-03  8:05 Atsushi Kumagai
2013-12-03  9:05 ` HATAYAMA Daisuke
2013-12-04  6:08   ` Atsushi Kumagai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).