All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
@ 2013-12-09  8:06 ` HATAYAMA Daisuke
  0 siblings, 0 replies; 12+ messages in thread
From: HATAYAMA Daisuke @ 2013-12-09  8:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: kexec, Atsushi Kumagai, Eric W. Biederman, Linux Kernel Mailing List

This is a patch for fixing mmap failure due to fractional page issue.

This patch might be still a bit too large as a single patch and might need to split.
If you think patch refactoring is needed, please suggest.

Change Log:

v1 => v2)

- Copy fractional pages from 1st kernel to 2nd kernel to reduce read
  to the fractional pages for reliability.

- Deal with the case where multiple System RAM areas are contained in
  a single fractional page.

Test:

Tested on X86_64. Fractional pages are created using memmap= kernel
parameter on the kdump 1st kernel.

>From fd6b0aca54caf7f0b5fd3841ef9e5ff081121ab8 Mon Sep 17 00:00:00 2001
From: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Date: Mon, 9 Dec 2013 09:12:32 +0900
Subject: [PATCH] vmcore: copy fractional pages into buffers in the kdump 2nd kernel

As Vivek reported in https://lkml.org/lkml/2013/11/13/439, in real
world there's platform that allocates System RAM area and Reserved
area in a single same page. As a result, mmap fails at sanity check
that comapres memory cache types in a given range, causing user-land
tools to exit abnormally in the middle of crash dumping.

Although in the current case the data in Reserved area is ACPI data,
in general, arbitrary data can possibly be located in a single page
together with System RAM area. If they are, for example, mmio, read or
write to the area could affect the corresponding devices and so a
whole system. We should avoid doing such operations as much as
possible in order to keep reliability.

To address this issue, we copy fractional pages into buffers in the
kdump 2nd kernel, and then read data on the fractional pages from the
buffers in the kdump 2nd kernel, not from the fractional pages on the
kdump 1st kernel. Similarly, we mmap data on the buffers on the 2nd
kernel, not on the 1st kernel. These are done just as we've already
done for ELF note segments.

Rigorously, we should avoid even mapping pages containing non-System
RAM area since mapping could cause some platform specific optimization
that could then lead to some kind of prefetch to the page. However, as
long as trying to read the System RAM area in the page, we cannot
avoid mapping the page. Therefore, reliable possible way is to supress
the number of times of reading the fractional pages to just once by
buffering System RAM part of the fractional page in the 2nd kerenel.

To implement this, extend vmcore structure to represent object in
buffer on the 2nd kernel, i.e., introducing VMCORE_2ND_KERNEL flag;
for a vmcore object, if it has VMCORE_2ND_KERNEL set, then its data is
on the buffer on the 2nd kernel that is pointed to by ->buf member.

Only non-trivial case is where multiple System RAM areas are contained
in a single page. I want to think there's unlikely to be such system,
but the issue addressed here is already odd enough, so we should
consider there would be likely enough to be.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
---
 fs/proc/vmcore.c      | 271 +++++++++++++++++++++++++++++++++++++++++---------
 include/linux/kcore.h |   4 +
 2 files changed, 229 insertions(+), 46 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 9100d69..ca79120 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -231,11 +231,20 @@ static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
 
 	list_for_each_entry(m, &vmcore_list, list) {
 		if (*fpos < m->offset + m->size) {
-			tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
-			start = m->paddr + *fpos - m->offset;
-			tmp = read_from_oldmem(buffer, tsz, &start, userbuf);
-			if (tmp < 0)
-				return tmp;
+			tsz = min_t(size_t, m->offset+m->size-*fpos, buflen);
+			if ((m->flags & VMCORE_2ND_KERNEL)) {
+				void *kaddr;
+
+				kaddr = m->buf + *fpos - m->offset;
+				if (copy_to(buffer, kaddr, tsz, userbuf))
+					return -EFAULT;
+			} else {
+				start = m->paddr + *fpos - m->offset;
+				tmp = read_from_oldmem(buffer, tsz, &start,
+						       userbuf);
+				if (tmp < 0)
+					return tmp;
+			}
 			buflen -= tsz;
 			*fpos += tsz;
 			buffer += tsz;
@@ -300,10 +309,10 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
 };
 
 /**
- * alloc_elfnotes_buf - allocate buffer for ELF note segment in
- *                      vmalloc memory
+ * alloc_copy_buf - allocate buffer to copy ELF note segment or
+ *                      fractional pages in vmalloc memory
  *
- * @notes_sz: size of buffer
+ * @sz: size of buffer
  *
  * If CONFIG_MMU is defined, use vmalloc_user() to allow users to mmap
  * the buffer to user-space by means of remap_vmalloc_range().
@@ -311,12 +320,12 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
  * If CONFIG_MMU is not defined, use vzalloc() since mmap_vmcore() is
  * disabled and there's no need to allow users to mmap the buffer.
  */
-static inline char *alloc_elfnotes_buf(size_t notes_sz)
+static inline char *alloc_copy_buf(size_t sz)
 {
 #ifdef CONFIG_MMU
-	return vmalloc_user(notes_sz);
+	return vmalloc_user(sz);
 #else
-	return vzalloc(notes_sz);
+	return vzalloc(sz);
 #endif
 }
 
@@ -383,14 +392,24 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
 
 	list_for_each_entry(m, &vmcore_list, list) {
 		if (start < m->offset + m->size) {
-			u64 paddr = 0;
-
 			tsz = min_t(size_t, m->offset + m->size - start, size);
-			paddr = m->paddr + start - m->offset;
-			if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
-						   paddr >> PAGE_SHIFT, tsz,
-						   vma->vm_page_prot))
-				goto fail;
+			if ((m->flags & VMCORE_2ND_KERNEL)) {
+				unsigned long uaddr = vma->vm_start + len;
+				void *kaddr = m->buf + start - m->offset;
+
+				if (remap_vmalloc_range_partial(vma, uaddr,
+								kaddr, tsz))
+					goto fail;
+			} else {
+				u64 paddr = paddr = m->paddr+start-m->offset;
+
+				if (remap_oldmem_pfn_range(vma,
+							   vma->vm_start + len,
+							   paddr >> PAGE_SHIFT,
+							   tsz,
+							   vma->vm_page_prot))
+					goto fail;
+			}
 			size -= tsz;
 			start += tsz;
 			len += tsz;
@@ -580,7 +599,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
 		return rc;
 
 	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
-	*notes_buf = alloc_elfnotes_buf(*notes_sz);
+	*notes_buf = alloc_copy_buf(*notes_sz);
 	if (!*notes_buf)
 		return -ENOMEM;
 
@@ -760,7 +779,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
 		return rc;
 
 	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
-	*notes_buf = alloc_elfnotes_buf(*notes_sz);
+	*notes_buf = alloc_copy_buf(*notes_sz);
 	if (!*notes_buf)
 		return -ENOMEM;
 
@@ -807,7 +826,7 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
 	Elf64_Ehdr *ehdr_ptr;
 	Elf64_Phdr *phdr_ptr;
 	loff_t vmcore_off;
-	struct vmcore *new;
+	struct vmcore *m, *new;
 
 	ehdr_ptr = (Elf64_Ehdr *)elfptr;
 	phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */
@@ -816,27 +835,106 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
 	vmcore_off = elfsz + elfnotes_sz;
 
 	for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
-		u64 paddr, start, end, size;
+		u64 start, end, size, rest;
+		u64 start_up, start_down, end_up, end_down;
+		loff_t offset;
+		int rc, reuse = 0;
 
 		if (phdr_ptr->p_type != PT_LOAD)
 			continue;
 
-		paddr = phdr_ptr->p_offset;
-		start = rounddown(paddr, PAGE_SIZE);
-		end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
-		size = end - start;
+		start = phdr_ptr->p_offset;
+		start_up = roundup(start, PAGE_SIZE);
+		start_down = rounddown(start, PAGE_SIZE);
+
+		end = phdr_ptr->p_offset + phdr_ptr->p_memsz;
+		end_up = roundup(end, PAGE_SIZE);
+		end_down = rounddown(end, PAGE_SIZE);
+
+		size = end_up - start_down;
+		rest = phdr_ptr->p_memsz;
+
+		/* Add a head fractional page to vmcore list. */
+		if (!PAGE_ALIGNED(start)) {
+			/* Reuse the same buffer if multiple System
+			 * RAM entries show up in the same page. */
+			list_for_each_entry(m, vc_list, list) {
+				if (m->paddr == start_down &&
+				    m->flags == VMCORE_2ND_KERNEL) {
+					new = m;
+					reuse = 1;
+					goto skip;
+				}
+			}
+
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->buf = alloc_copy_buf(PAGE_SIZE);
+			if (!new->buf) {
+				kfree(new);
+				return -ENOMEM;
+			}
+			new->flags = VMCORE_2ND_KERNEL;
+			new->size = PAGE_SIZE;
+			new->paddr = start_down;
+			list_add_tail(&new->list, vc_list);
+		skip:
+
+			offset = start;
+			rc = __read_vmcore(new->buf + (start - start_down),
+					   min(start_up, end) - start,
+					   &offset, 0);
+			if (rc < 0)
+				return rc;
+
+			rest -= min(start_up, end) - start;
+		}
 
 		/* Add this contiguous chunk of memory to vmcore list.*/
-		new = get_new_element();
-		if (!new)
-			return -ENOMEM;
-		new->paddr = start;
-		new->size = size;
-		list_add_tail(&new->list, vc_list);
+		if (rest > 0 && start_up < end_down) {
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->size = end_down - start_up;
+			new->paddr = start_up;
+			list_add_tail(&new->list, vc_list);
+			rest -= end_down - start_up;
+		}
+
+		/* Add a tail fractional page to vmcore list. */
+		if (rest > 0) {
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->buf = alloc_copy_buf(PAGE_SIZE);
+			if (!new->buf) {
+				kfree(new);
+				return -ENOMEM;
+			}
+			new->flags = VMCORE_2ND_KERNEL;
+			new->size = PAGE_SIZE;
+			new->paddr = end_down;
+			list_add_tail(&new->list, vc_list);
+
+			offset = end_down;
+			rc = __read_vmcore(new->buf, end - end_down, &offset,
+					   0);
+			if (rc < 0)
+				return rc;
+
+			rest -= end - end_down;
+		}
+
+		WARN_ON(rest > 0);
 
 		/* Update the program header offset. */
-		phdr_ptr->p_offset = vmcore_off + (paddr - start);
+		phdr_ptr->p_offset = vmcore_off + (start - start_down);
 		vmcore_off = vmcore_off + size;
+		if (reuse) {
+			phdr_ptr->p_offset -= PAGE_SIZE;
+			vmcore_off -= PAGE_SIZE;
+		}
 	}
 	return 0;
 }
@@ -850,7 +948,7 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
 	Elf32_Ehdr *ehdr_ptr;
 	Elf32_Phdr *phdr_ptr;
 	loff_t vmcore_off;
-	struct vmcore *new;
+	struct vmcore *m, *new;
 
 	ehdr_ptr = (Elf32_Ehdr *)elfptr;
 	phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr)); /* PT_NOTE hdr */
@@ -859,27 +957,106 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
 	vmcore_off = elfsz + elfnotes_sz;
 
 	for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
-		u64 paddr, start, end, size;
+		u64 start, end, size, rest;
+		u64 start_up, start_down, end_up, end_down;
+		loff_t offset;
+		int rc, reuse = 0;
 
 		if (phdr_ptr->p_type != PT_LOAD)
 			continue;
 
-		paddr = phdr_ptr->p_offset;
-		start = rounddown(paddr, PAGE_SIZE);
-		end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
-		size = end - start;
+		start = phdr_ptr->p_offset;
+		start_up = roundup(start, PAGE_SIZE);
+		start_down = rounddown(start, PAGE_SIZE);
+
+		end = phdr_ptr->p_offset + phdr_ptr->p_memsz;
+		end_up = roundup(end, PAGE_SIZE);
+		end_down = rounddown(end, PAGE_SIZE);
+
+		size = end_up - start_down;
+		rest = phdr_ptr->p_memsz;
+
+		/* Add a head fractional page to vmcore list. */
+		if (!PAGE_ALIGNED(start)) {
+			/* Reuse the same buffer if multiple System
+			 * RAM entries show up in the same page. */
+			list_for_each_entry(m, vc_list, list) {
+				if (m->paddr == start_down &&
+				    m->flags == VMCORE_2ND_KERNEL) {
+					new = m;
+					reuse = 1;
+					goto skip;
+				}
+			}
+
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->buf = alloc_copy_buf(PAGE_SIZE);
+			if (!new->buf) {
+				kfree(new);
+				return -ENOMEM;
+			}
+			new->flags = VMCORE_2ND_KERNEL;
+			new->paddr = start_down;
+			new->size = PAGE_SIZE;
+			list_add_tail(&new->list, vc_list);
+		skip:
+
+			offset = start;
+			rc = __read_vmcore(new->buf + (start - start_down),
+					   min(start_up, end) - start,
+					   &offset, 0);
+			if (rc < 0)
+				return rc;
+
+			rest -= min(start_up, end) - start;
+		}
 
 		/* Add this contiguous chunk of memory to vmcore list.*/
-		new = get_new_element();
-		if (!new)
-			return -ENOMEM;
-		new->paddr = start;
-		new->size = size;
-		list_add_tail(&new->list, vc_list);
+		if (rest > 0 && start_up < end_down) {
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->size = end_down - start_up;
+			new->paddr = start_up;
+			list_add_tail(&new->list, vc_list);
+			rest -= end_down - start_up;
+		}
+
+		/* Add a tail fractional page to vmcore list. */
+		if (rest > 0) {
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->buf = (void *)get_zeroed_page(GFP_KERNEL);
+			if (!new->buf) {
+				kfree(new);
+				return -ENOMEM;
+			}
+			new->flags = VMCORE_2ND_KERNEL;
+			new->size = PAGE_SIZE;
+			new->paddr = end_down;
+			list_add_tail(&new->list, vc_list);
+
+			offset = end_down;
+			rc = __read_vmcore(new->buf, end - end_down, &offset,
+					   0);
+			if (rc < 0)
+				return rc;
+
+			rest -= end - end_down;
+		}
+
+		WARN_ON(rest > 0);
 
 		/* Update the program header offset */
-		phdr_ptr->p_offset = vmcore_off + (paddr - start);
+		phdr_ptr->p_offset = vmcore_off + (start - start_down);
 		vmcore_off = vmcore_off + size;
+		if (reuse) {
+			phdr_ptr->p_offset -= PAGE_SIZE;
+			vmcore_off -= PAGE_SIZE;
+		}
 	}
 	return 0;
 }
@@ -1100,6 +1277,8 @@ void vmcore_cleanup(void)
 
 		m = list_entry(pos, struct vmcore, list);
 		list_del(&m->list);
+		if ((m->flags & VMCORE_2ND_KERNEL))
+			vfree(m->buf);
 		kfree(m);
 	}
 	free_elfcorebuf();
diff --git a/include/linux/kcore.h b/include/linux/kcore.h
index d927622..3a86423 100644
--- a/include/linux/kcore.h
+++ b/include/linux/kcore.h
@@ -19,11 +19,15 @@ struct kcore_list {
 	int type;
 };
 
+#define VMCORE_2ND_KERNEL 0x1
+
 struct vmcore {
 	struct list_head list;
 	unsigned long long paddr;
 	unsigned long long size;
 	loff_t offset;
+	char *buf;
+	unsigned long flags;
 };
 
 #ifdef CONFIG_PROC_KCORE
-- 
1.8.3.1

-- 
Thanks.
HATAYAMA, Daisuke


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
@ 2013-12-09  8:06 ` HATAYAMA Daisuke
  0 siblings, 0 replies; 12+ messages in thread
From: HATAYAMA Daisuke @ 2013-12-09  8:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Atsushi Kumagai, kexec, Eric W. Biederman, Linux Kernel Mailing List

This is a patch for fixing mmap failure due to fractional page issue.

This patch might be still a bit too large as a single patch and might need to split.
If you think patch refactoring is needed, please suggest.

Change Log:

v1 => v2)

- Copy fractional pages from 1st kernel to 2nd kernel to reduce read
  to the fractional pages for reliability.

- Deal with the case where multiple System RAM areas are contained in
  a single fractional page.

Test:

Tested on X86_64. Fractional pages are created using memmap= kernel
parameter on the kdump 1st kernel.

From fd6b0aca54caf7f0b5fd3841ef9e5ff081121ab8 Mon Sep 17 00:00:00 2001
From: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Date: Mon, 9 Dec 2013 09:12:32 +0900
Subject: [PATCH] vmcore: copy fractional pages into buffers in the kdump 2nd kernel

As Vivek reported in https://lkml.org/lkml/2013/11/13/439, in real
world there's platform that allocates System RAM area and Reserved
area in a single same page. As a result, mmap fails at sanity check
that comapres memory cache types in a given range, causing user-land
tools to exit abnormally in the middle of crash dumping.

Although in the current case the data in Reserved area is ACPI data,
in general, arbitrary data can possibly be located in a single page
together with System RAM area. If they are, for example, mmio, read or
write to the area could affect the corresponding devices and so a
whole system. We should avoid doing such operations as much as
possible in order to keep reliability.

To address this issue, we copy fractional pages into buffers in the
kdump 2nd kernel, and then read data on the fractional pages from the
buffers in the kdump 2nd kernel, not from the fractional pages on the
kdump 1st kernel. Similarly, we mmap data on the buffers on the 2nd
kernel, not on the 1st kernel. These are done just as we've already
done for ELF note segments.

Rigorously, we should avoid even mapping pages containing non-System
RAM area since mapping could cause some platform specific optimization
that could then lead to some kind of prefetch to the page. However, as
long as trying to read the System RAM area in the page, we cannot
avoid mapping the page. Therefore, reliable possible way is to supress
the number of times of reading the fractional pages to just once by
buffering System RAM part of the fractional page in the 2nd kerenel.

To implement this, extend vmcore structure to represent object in
buffer on the 2nd kernel, i.e., introducing VMCORE_2ND_KERNEL flag;
for a vmcore object, if it has VMCORE_2ND_KERNEL set, then its data is
on the buffer on the 2nd kernel that is pointed to by ->buf member.

Only non-trivial case is where multiple System RAM areas are contained
in a single page. I want to think there's unlikely to be such system,
but the issue addressed here is already odd enough, so we should
consider there would be likely enough to be.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
---
 fs/proc/vmcore.c      | 271 +++++++++++++++++++++++++++++++++++++++++---------
 include/linux/kcore.h |   4 +
 2 files changed, 229 insertions(+), 46 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 9100d69..ca79120 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -231,11 +231,20 @@ static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
 
 	list_for_each_entry(m, &vmcore_list, list) {
 		if (*fpos < m->offset + m->size) {
-			tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
-			start = m->paddr + *fpos - m->offset;
-			tmp = read_from_oldmem(buffer, tsz, &start, userbuf);
-			if (tmp < 0)
-				return tmp;
+			tsz = min_t(size_t, m->offset+m->size-*fpos, buflen);
+			if ((m->flags & VMCORE_2ND_KERNEL)) {
+				void *kaddr;
+
+				kaddr = m->buf + *fpos - m->offset;
+				if (copy_to(buffer, kaddr, tsz, userbuf))
+					return -EFAULT;
+			} else {
+				start = m->paddr + *fpos - m->offset;
+				tmp = read_from_oldmem(buffer, tsz, &start,
+						       userbuf);
+				if (tmp < 0)
+					return tmp;
+			}
 			buflen -= tsz;
 			*fpos += tsz;
 			buffer += tsz;
@@ -300,10 +309,10 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
 };
 
 /**
- * alloc_elfnotes_buf - allocate buffer for ELF note segment in
- *                      vmalloc memory
+ * alloc_copy_buf - allocate buffer to copy ELF note segment or
+ *                      fractional pages in vmalloc memory
  *
- * @notes_sz: size of buffer
+ * @sz: size of buffer
  *
  * If CONFIG_MMU is defined, use vmalloc_user() to allow users to mmap
  * the buffer to user-space by means of remap_vmalloc_range().
@@ -311,12 +320,12 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
  * If CONFIG_MMU is not defined, use vzalloc() since mmap_vmcore() is
  * disabled and there's no need to allow users to mmap the buffer.
  */
-static inline char *alloc_elfnotes_buf(size_t notes_sz)
+static inline char *alloc_copy_buf(size_t sz)
 {
 #ifdef CONFIG_MMU
-	return vmalloc_user(notes_sz);
+	return vmalloc_user(sz);
 #else
-	return vzalloc(notes_sz);
+	return vzalloc(sz);
 #endif
 }
 
@@ -383,14 +392,24 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
 
 	list_for_each_entry(m, &vmcore_list, list) {
 		if (start < m->offset + m->size) {
-			u64 paddr = 0;
-
 			tsz = min_t(size_t, m->offset + m->size - start, size);
-			paddr = m->paddr + start - m->offset;
-			if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
-						   paddr >> PAGE_SHIFT, tsz,
-						   vma->vm_page_prot))
-				goto fail;
+			if ((m->flags & VMCORE_2ND_KERNEL)) {
+				unsigned long uaddr = vma->vm_start + len;
+				void *kaddr = m->buf + start - m->offset;
+
+				if (remap_vmalloc_range_partial(vma, uaddr,
+								kaddr, tsz))
+					goto fail;
+			} else {
+				u64 paddr = paddr = m->paddr+start-m->offset;
+
+				if (remap_oldmem_pfn_range(vma,
+							   vma->vm_start + len,
+							   paddr >> PAGE_SHIFT,
+							   tsz,
+							   vma->vm_page_prot))
+					goto fail;
+			}
 			size -= tsz;
 			start += tsz;
 			len += tsz;
@@ -580,7 +599,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
 		return rc;
 
 	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
-	*notes_buf = alloc_elfnotes_buf(*notes_sz);
+	*notes_buf = alloc_copy_buf(*notes_sz);
 	if (!*notes_buf)
 		return -ENOMEM;
 
@@ -760,7 +779,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
 		return rc;
 
 	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
-	*notes_buf = alloc_elfnotes_buf(*notes_sz);
+	*notes_buf = alloc_copy_buf(*notes_sz);
 	if (!*notes_buf)
 		return -ENOMEM;
 
@@ -807,7 +826,7 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
 	Elf64_Ehdr *ehdr_ptr;
 	Elf64_Phdr *phdr_ptr;
 	loff_t vmcore_off;
-	struct vmcore *new;
+	struct vmcore *m, *new;
 
 	ehdr_ptr = (Elf64_Ehdr *)elfptr;
 	phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */
@@ -816,27 +835,106 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
 	vmcore_off = elfsz + elfnotes_sz;
 
 	for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
-		u64 paddr, start, end, size;
+		u64 start, end, size, rest;
+		u64 start_up, start_down, end_up, end_down;
+		loff_t offset;
+		int rc, reuse = 0;
 
 		if (phdr_ptr->p_type != PT_LOAD)
 			continue;
 
-		paddr = phdr_ptr->p_offset;
-		start = rounddown(paddr, PAGE_SIZE);
-		end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
-		size = end - start;
+		start = phdr_ptr->p_offset;
+		start_up = roundup(start, PAGE_SIZE);
+		start_down = rounddown(start, PAGE_SIZE);
+
+		end = phdr_ptr->p_offset + phdr_ptr->p_memsz;
+		end_up = roundup(end, PAGE_SIZE);
+		end_down = rounddown(end, PAGE_SIZE);
+
+		size = end_up - start_down;
+		rest = phdr_ptr->p_memsz;
+
+		/* Add a head fractional page to vmcore list. */
+		if (!PAGE_ALIGNED(start)) {
+			/* Reuse the same buffer if multiple System
+			 * RAM entries show up in the same page. */
+			list_for_each_entry(m, vc_list, list) {
+				if (m->paddr == start_down &&
+				    m->flags == VMCORE_2ND_KERNEL) {
+					new = m;
+					reuse = 1;
+					goto skip;
+				}
+			}
+
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->buf = alloc_copy_buf(PAGE_SIZE);
+			if (!new->buf) {
+				kfree(new);
+				return -ENOMEM;
+			}
+			new->flags = VMCORE_2ND_KERNEL;
+			new->size = PAGE_SIZE;
+			new->paddr = start_down;
+			list_add_tail(&new->list, vc_list);
+		skip:
+
+			offset = start;
+			rc = __read_vmcore(new->buf + (start - start_down),
+					   min(start_up, end) - start,
+					   &offset, 0);
+			if (rc < 0)
+				return rc;
+
+			rest -= min(start_up, end) - start;
+		}
 
 		/* Add this contiguous chunk of memory to vmcore list.*/
-		new = get_new_element();
-		if (!new)
-			return -ENOMEM;
-		new->paddr = start;
-		new->size = size;
-		list_add_tail(&new->list, vc_list);
+		if (rest > 0 && start_up < end_down) {
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->size = end_down - start_up;
+			new->paddr = start_up;
+			list_add_tail(&new->list, vc_list);
+			rest -= end_down - start_up;
+		}
+
+		/* Add a tail fractional page to vmcore list. */
+		if (rest > 0) {
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->buf = alloc_copy_buf(PAGE_SIZE);
+			if (!new->buf) {
+				kfree(new);
+				return -ENOMEM;
+			}
+			new->flags = VMCORE_2ND_KERNEL;
+			new->size = PAGE_SIZE;
+			new->paddr = end_down;
+			list_add_tail(&new->list, vc_list);
+
+			offset = end_down;
+			rc = __read_vmcore(new->buf, end - end_down, &offset,
+					   0);
+			if (rc < 0)
+				return rc;
+
+			rest -= end - end_down;
+		}
+
+		WARN_ON(rest > 0);
 
 		/* Update the program header offset. */
-		phdr_ptr->p_offset = vmcore_off + (paddr - start);
+		phdr_ptr->p_offset = vmcore_off + (start - start_down);
 		vmcore_off = vmcore_off + size;
+		if (reuse) {
+			phdr_ptr->p_offset -= PAGE_SIZE;
+			vmcore_off -= PAGE_SIZE;
+		}
 	}
 	return 0;
 }
@@ -850,7 +948,7 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
 	Elf32_Ehdr *ehdr_ptr;
 	Elf32_Phdr *phdr_ptr;
 	loff_t vmcore_off;
-	struct vmcore *new;
+	struct vmcore *m, *new;
 
 	ehdr_ptr = (Elf32_Ehdr *)elfptr;
 	phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr)); /* PT_NOTE hdr */
@@ -859,27 +957,106 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
 	vmcore_off = elfsz + elfnotes_sz;
 
 	for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
-		u64 paddr, start, end, size;
+		u64 start, end, size, rest;
+		u64 start_up, start_down, end_up, end_down;
+		loff_t offset;
+		int rc, reuse = 0;
 
 		if (phdr_ptr->p_type != PT_LOAD)
 			continue;
 
-		paddr = phdr_ptr->p_offset;
-		start = rounddown(paddr, PAGE_SIZE);
-		end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
-		size = end - start;
+		start = phdr_ptr->p_offset;
+		start_up = roundup(start, PAGE_SIZE);
+		start_down = rounddown(start, PAGE_SIZE);
+
+		end = phdr_ptr->p_offset + phdr_ptr->p_memsz;
+		end_up = roundup(end, PAGE_SIZE);
+		end_down = rounddown(end, PAGE_SIZE);
+
+		size = end_up - start_down;
+		rest = phdr_ptr->p_memsz;
+
+		/* Add a head fractional page to vmcore list. */
+		if (!PAGE_ALIGNED(start)) {
+			/* Reuse the same buffer if multiple System
+			 * RAM entries show up in the same page. */
+			list_for_each_entry(m, vc_list, list) {
+				if (m->paddr == start_down &&
+				    m->flags == VMCORE_2ND_KERNEL) {
+					new = m;
+					reuse = 1;
+					goto skip;
+				}
+			}
+
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->buf = alloc_copy_buf(PAGE_SIZE);
+			if (!new->buf) {
+				kfree(new);
+				return -ENOMEM;
+			}
+			new->flags = VMCORE_2ND_KERNEL;
+			new->paddr = start_down;
+			new->size = PAGE_SIZE;
+			list_add_tail(&new->list, vc_list);
+		skip:
+
+			offset = start;
+			rc = __read_vmcore(new->buf + (start - start_down),
+					   min(start_up, end) - start,
+					   &offset, 0);
+			if (rc < 0)
+				return rc;
+
+			rest -= min(start_up, end) - start;
+		}
 
 		/* Add this contiguous chunk of memory to vmcore list.*/
-		new = get_new_element();
-		if (!new)
-			return -ENOMEM;
-		new->paddr = start;
-		new->size = size;
-		list_add_tail(&new->list, vc_list);
+		if (rest > 0 && start_up < end_down) {
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->size = end_down - start_up;
+			new->paddr = start_up;
+			list_add_tail(&new->list, vc_list);
+			rest -= end_down - start_up;
+		}
+
+		/* Add a tail fractional page to vmcore list. */
+		if (rest > 0) {
+			new = get_new_element();
+			if (!new)
+				return -ENOMEM;
+			new->buf = (void *)get_zeroed_page(GFP_KERNEL);
+			if (!new->buf) {
+				kfree(new);
+				return -ENOMEM;
+			}
+			new->flags = VMCORE_2ND_KERNEL;
+			new->size = PAGE_SIZE;
+			new->paddr = end_down;
+			list_add_tail(&new->list, vc_list);
+
+			offset = end_down;
+			rc = __read_vmcore(new->buf, end - end_down, &offset,
+					   0);
+			if (rc < 0)
+				return rc;
+
+			rest -= end - end_down;
+		}
+
+		WARN_ON(rest > 0);
 
 		/* Update the program header offset */
-		phdr_ptr->p_offset = vmcore_off + (paddr - start);
+		phdr_ptr->p_offset = vmcore_off + (start - start_down);
 		vmcore_off = vmcore_off + size;
+		if (reuse) {
+			phdr_ptr->p_offset -= PAGE_SIZE;
+			vmcore_off -= PAGE_SIZE;
+		}
 	}
 	return 0;
 }
@@ -1100,6 +1277,8 @@ void vmcore_cleanup(void)
 
 		m = list_entry(pos, struct vmcore, list);
 		list_del(&m->list);
+		if ((m->flags & VMCORE_2ND_KERNEL))
+			vfree(m->buf);
 		kfree(m);
 	}
 	free_elfcorebuf();
diff --git a/include/linux/kcore.h b/include/linux/kcore.h
index d927622..3a86423 100644
--- a/include/linux/kcore.h
+++ b/include/linux/kcore.h
@@ -19,11 +19,15 @@ struct kcore_list {
 	int type;
 };
 
+#define VMCORE_2ND_KERNEL 0x1
+
 struct vmcore {
 	struct list_head list;
 	unsigned long long paddr;
 	unsigned long long size;
 	loff_t offset;
+	char *buf;
+	unsigned long flags;
 };
 
 #ifdef CONFIG_PROC_KCORE
-- 
1.8.3.1

-- 
Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
  2013-12-09  8:06 ` HATAYAMA Daisuke
@ 2013-12-11 11:33   ` WANG Chao
  -1 siblings, 0 replies; 12+ messages in thread
From: WANG Chao @ 2013-12-11 11:33 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: Vivek Goyal, Atsushi Kumagai, kexec, Eric W. Biederman,
	Linux Kernel Mailing List

On 12/09/13 at 05:06pm, HATAYAMA Daisuke wrote:
> This is a patch for fixing mmap failure due to fractional page issue.
> 
> This patch might be still a bit too large as a single patch and might need to split.
> If you think patch refactoring is needed, please suggest.
> 
> Change Log:
> 
> v1 => v2)
> 
> - Copy fractional pages from 1st kernel to 2nd kernel to reduce read
>   to the fractional pages for reliability.
> 
> - Deal with the case where multiple System RAM areas are contained in
>   a single fractional page.
> 
> Test:
> 
> Tested on X86_64. Fractional pages are created using memmap= kernel
> parameter on the kdump 1st kernel.
> 
> From fd6b0aca54caf7f0b5fd3841ef9e5ff081121ab8 Mon Sep 17 00:00:00 2001
> From: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
> Date: Mon, 9 Dec 2013 09:12:32 +0900
> Subject: [PATCH] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
> 
> As Vivek reported in https://lkml.org/lkml/2013/11/13/439, in real
> world there's platform that allocates System RAM area and Reserved
> area in a single same page. As a result, mmap fails at sanity check
> that comapres memory cache types in a given range, causing user-land
> tools to exit abnormally in the middle of crash dumping.
> 
> Although in the current case the data in Reserved area is ACPI data,
> in general, arbitrary data can possibly be located in a single page
> together with System RAM area. If they are, for example, mmio, read or
> write to the area could affect the corresponding devices and so a
> whole system. We should avoid doing such operations as much as
> possible in order to keep reliability.
> 
> To address this issue, we copy fractional pages into buffers in the
> kdump 2nd kernel, and then read data on the fractional pages from the
> buffers in the kdump 2nd kernel, not from the fractional pages on the
> kdump 1st kernel. Similarly, we mmap data on the buffers on the 2nd
> kernel, not on the 1st kernel. These are done just as we've already
> done for ELF note segments.
> 
> Rigorously, we should avoid even mapping pages containing non-System
> RAM area since mapping could cause some platform specific optimization
> that could then lead to some kind of prefetch to the page. However, as
> long as trying to read the System RAM area in the page, we cannot
> avoid mapping the page. Therefore, reliable possible way is to supress
> the number of times of reading the fractional pages to just once by
> buffering System RAM part of the fractional page in the 2nd kerenel.
> 
> To implement this, extend vmcore structure to represent object in
> buffer on the 2nd kernel, i.e., introducing VMCORE_2ND_KERNEL flag;
> for a vmcore object, if it has VMCORE_2ND_KERNEL set, then its data is
> on the buffer on the 2nd kernel that is pointed to by ->buf member.
> 
> Only non-trivial case is where multiple System RAM areas are contained
> in a single page. I want to think there's unlikely to be such system,
> but the issue addressed here is already odd enough, so we should
> consider there would be likely enough to be.
> 
> Reported-by: Vivek Goyal <vgoyal@redhat.com>
> Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>

Hi, HATAMAYA Daisuke

Thanks for the fix.

My workstation has been experiencing the same issue. It has a fractional
page, one part contains system ram and the other contains ACPI data:

# cat /proc/iomem
[..]
00100000-bfdffbff : System RAM
  01000000-0167b6a5 : Kernel code
  0167b6a6-01d06cbf : Kernel data
  01e6d000-01feafff : Kernel bss
  bb000000-bf7fffff : Crash kernel
bfdffc00-bfe53bff : ACPI Non-volatile Storage

I apply your patch on top of 3.13-rc3. And makedumpfile can successfully
extract dump with mmap().

Thanks,
WANG Chao


> ---
>  fs/proc/vmcore.c      | 271 +++++++++++++++++++++++++++++++++++++++++---------
>  include/linux/kcore.h |   4 +
>  2 files changed, 229 insertions(+), 46 deletions(-)
> 
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 9100d69..ca79120 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -231,11 +231,20 @@ static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
>  
>  	list_for_each_entry(m, &vmcore_list, list) {
>  		if (*fpos < m->offset + m->size) {
> -			tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
> -			start = m->paddr + *fpos - m->offset;
> -			tmp = read_from_oldmem(buffer, tsz, &start, userbuf);
> -			if (tmp < 0)
> -				return tmp;
> +			tsz = min_t(size_t, m->offset+m->size-*fpos, buflen);
> +			if ((m->flags & VMCORE_2ND_KERNEL)) {
> +				void *kaddr;
> +
> +				kaddr = m->buf + *fpos - m->offset;
> +				if (copy_to(buffer, kaddr, tsz, userbuf))
> +					return -EFAULT;
> +			} else {
> +				start = m->paddr + *fpos - m->offset;
> +				tmp = read_from_oldmem(buffer, tsz, &start,
> +						       userbuf);
> +				if (tmp < 0)
> +					return tmp;
> +			}
>  			buflen -= tsz;
>  			*fpos += tsz;
>  			buffer += tsz;
> @@ -300,10 +309,10 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
>  };
>  
>  /**
> - * alloc_elfnotes_buf - allocate buffer for ELF note segment in
> - *                      vmalloc memory
> + * alloc_copy_buf - allocate buffer to copy ELF note segment or
> + *                      fractional pages in vmalloc memory
>   *
> - * @notes_sz: size of buffer
> + * @sz: size of buffer
>   *
>   * If CONFIG_MMU is defined, use vmalloc_user() to allow users to mmap
>   * the buffer to user-space by means of remap_vmalloc_range().
> @@ -311,12 +320,12 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
>   * If CONFIG_MMU is not defined, use vzalloc() since mmap_vmcore() is
>   * disabled and there's no need to allow users to mmap the buffer.
>   */
> -static inline char *alloc_elfnotes_buf(size_t notes_sz)
> +static inline char *alloc_copy_buf(size_t sz)
>  {
>  #ifdef CONFIG_MMU
> -	return vmalloc_user(notes_sz);
> +	return vmalloc_user(sz);
>  #else
> -	return vzalloc(notes_sz);
> +	return vzalloc(sz);
>  #endif
>  }
>  
> @@ -383,14 +392,24 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>  
>  	list_for_each_entry(m, &vmcore_list, list) {
>  		if (start < m->offset + m->size) {
> -			u64 paddr = 0;
> -
>  			tsz = min_t(size_t, m->offset + m->size - start, size);
> -			paddr = m->paddr + start - m->offset;
> -			if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
> -						   paddr >> PAGE_SHIFT, tsz,
> -						   vma->vm_page_prot))
> -				goto fail;
> +			if ((m->flags & VMCORE_2ND_KERNEL)) {
> +				unsigned long uaddr = vma->vm_start + len;
> +				void *kaddr = m->buf + start - m->offset;
> +
> +				if (remap_vmalloc_range_partial(vma, uaddr,
> +								kaddr, tsz))
> +					goto fail;
> +			} else {
> +				u64 paddr = paddr = m->paddr+start-m->offset;
> +
> +				if (remap_oldmem_pfn_range(vma,
> +							   vma->vm_start + len,
> +							   paddr >> PAGE_SHIFT,
> +							   tsz,
> +							   vma->vm_page_prot))
> +					goto fail;
> +			}
>  			size -= tsz;
>  			start += tsz;
>  			len += tsz;
> @@ -580,7 +599,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
>  		return rc;
>  
>  	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
> -	*notes_buf = alloc_elfnotes_buf(*notes_sz);
> +	*notes_buf = alloc_copy_buf(*notes_sz);
>  	if (!*notes_buf)
>  		return -ENOMEM;
>  
> @@ -760,7 +779,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
>  		return rc;
>  
>  	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
> -	*notes_buf = alloc_elfnotes_buf(*notes_sz);
> +	*notes_buf = alloc_copy_buf(*notes_sz);
>  	if (!*notes_buf)
>  		return -ENOMEM;
>  
> @@ -807,7 +826,7 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
>  	Elf64_Ehdr *ehdr_ptr;
>  	Elf64_Phdr *phdr_ptr;
>  	loff_t vmcore_off;
> -	struct vmcore *new;
> +	struct vmcore *m, *new;
>  
>  	ehdr_ptr = (Elf64_Ehdr *)elfptr;
>  	phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */
> @@ -816,27 +835,106 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
>  	vmcore_off = elfsz + elfnotes_sz;
>  
>  	for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> -		u64 paddr, start, end, size;
> +		u64 start, end, size, rest;
> +		u64 start_up, start_down, end_up, end_down;
> +		loff_t offset;
> +		int rc, reuse = 0;
>  
>  		if (phdr_ptr->p_type != PT_LOAD)
>  			continue;
>  
> -		paddr = phdr_ptr->p_offset;
> -		start = rounddown(paddr, PAGE_SIZE);
> -		end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
> -		size = end - start;
> +		start = phdr_ptr->p_offset;
> +		start_up = roundup(start, PAGE_SIZE);
> +		start_down = rounddown(start, PAGE_SIZE);
> +
> +		end = phdr_ptr->p_offset + phdr_ptr->p_memsz;
> +		end_up = roundup(end, PAGE_SIZE);
> +		end_down = rounddown(end, PAGE_SIZE);
> +
> +		size = end_up - start_down;
> +		rest = phdr_ptr->p_memsz;
> +
> +		/* Add a head fractional page to vmcore list. */
> +		if (!PAGE_ALIGNED(start)) {
> +			/* Reuse the same buffer if multiple System
> +			 * RAM entries show up in the same page. */
> +			list_for_each_entry(m, vc_list, list) {
> +				if (m->paddr == start_down &&
> +				    m->flags == VMCORE_2ND_KERNEL) {
> +					new = m;
> +					reuse = 1;
> +					goto skip;
> +				}
> +			}
> +
> +			new = get_new_element();
> +			if (!new)
> +				return -ENOMEM;
> +			new->buf = alloc_copy_buf(PAGE_SIZE);
> +			if (!new->buf) {
> +				kfree(new);
> +				return -ENOMEM;
> +			}
> +			new->flags = VMCORE_2ND_KERNEL;
> +			new->size = PAGE_SIZE;
> +			new->paddr = start_down;
> +			list_add_tail(&new->list, vc_list);
> +		skip:
> +
> +			offset = start;
> +			rc = __read_vmcore(new->buf + (start - start_down),
> +					   min(start_up, end) - start,
> +					   &offset, 0);
> +			if (rc < 0)
> +				return rc;
> +
> +			rest -= min(start_up, end) - start;
> +		}
>  
>  		/* Add this contiguous chunk of memory to vmcore list.*/
> -		new = get_new_element();
> -		if (!new)
> -			return -ENOMEM;
> -		new->paddr = start;
> -		new->size = size;
> -		list_add_tail(&new->list, vc_list);
> +		if (rest > 0 && start_up < end_down) {
> +			new = get_new_element();
> +			if (!new)
> +				return -ENOMEM;
> +			new->size = end_down - start_up;
> +			new->paddr = start_up;
> +			list_add_tail(&new->list, vc_list);
> +			rest -= end_down - start_up;
> +		}
> +
> +		/* Add a tail fractional page to vmcore list. */
> +		if (rest > 0) {
> +			new = get_new_element();
> +			if (!new)
> +				return -ENOMEM;
> +			new->buf = alloc_copy_buf(PAGE_SIZE);
> +			if (!new->buf) {
> +				kfree(new);
> +				return -ENOMEM;
> +			}
> +			new->flags = VMCORE_2ND_KERNEL;
> +			new->size = PAGE_SIZE;
> +			new->paddr = end_down;
> +			list_add_tail(&new->list, vc_list);
> +
> +			offset = end_down;
> +			rc = __read_vmcore(new->buf, end - end_down, &offset,
> +					   0);
> +			if (rc < 0)
> +				return rc;
> +
> +			rest -= end - end_down;
> +		}
> +
> +		WARN_ON(rest > 0);
>  
>  		/* Update the program header offset. */
> -		phdr_ptr->p_offset = vmcore_off + (paddr - start);
> +		phdr_ptr->p_offset = vmcore_off + (start - start_down);
>  		vmcore_off = vmcore_off + size;
> +		if (reuse) {
> +			phdr_ptr->p_offset -= PAGE_SIZE;
> +			vmcore_off -= PAGE_SIZE;
> +		}
>  	}
>  	return 0;
>  }
> @@ -850,7 +948,7 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
>  	Elf32_Ehdr *ehdr_ptr;
>  	Elf32_Phdr *phdr_ptr;
>  	loff_t vmcore_off;
> -	struct vmcore *new;
> +	struct vmcore *m, *new;
>  
>  	ehdr_ptr = (Elf32_Ehdr *)elfptr;
>  	phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr)); /* PT_NOTE hdr */
> @@ -859,27 +957,106 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
>  	vmcore_off = elfsz + elfnotes_sz;
>  
>  	for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> -		u64 paddr, start, end, size;
> +		u64 start, end, size, rest;
> +		u64 start_up, start_down, end_up, end_down;
> +		loff_t offset;
> +		int rc, reuse = 0;
>  
>  		if (phdr_ptr->p_type != PT_LOAD)
>  			continue;
>  
> -		paddr = phdr_ptr->p_offset;
> -		start = rounddown(paddr, PAGE_SIZE);
> -		end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
> -		size = end - start;
> +		start = phdr_ptr->p_offset;
> +		start_up = roundup(start, PAGE_SIZE);
> +		start_down = rounddown(start, PAGE_SIZE);
> +
> +		end = phdr_ptr->p_offset + phdr_ptr->p_memsz;
> +		end_up = roundup(end, PAGE_SIZE);
> +		end_down = rounddown(end, PAGE_SIZE);
> +
> +		size = end_up - start_down;
> +		rest = phdr_ptr->p_memsz;
> +
> +		/* Add a head fractional page to vmcore list. */
> +		if (!PAGE_ALIGNED(start)) {
> +			/* Reuse the same buffer if multiple System
> +			 * RAM entries show up in the same page. */
> +			list_for_each_entry(m, vc_list, list) {
> +				if (m->paddr == start_down &&
> +				    m->flags == VMCORE_2ND_KERNEL) {
> +					new = m;
> +					reuse = 1;
> +					goto skip;
> +				}
> +			}
> +
> +			new = get_new_element();
> +			if (!new)
> +				return -ENOMEM;
> +			new->buf = alloc_copy_buf(PAGE_SIZE);
> +			if (!new->buf) {
> +				kfree(new);
> +				return -ENOMEM;
> +			}
> +			new->flags = VMCORE_2ND_KERNEL;
> +			new->paddr = start_down;
> +			new->size = PAGE_SIZE;
> +			list_add_tail(&new->list, vc_list);
> +		skip:
> +
> +			offset = start;
> +			rc = __read_vmcore(new->buf + (start - start_down),
> +					   min(start_up, end) - start,
> +					   &offset, 0);
> +			if (rc < 0)
> +				return rc;
> +
> +			rest -= min(start_up, end) - start;
> +		}
>  
>  		/* Add this contiguous chunk of memory to vmcore list.*/
> -		new = get_new_element();
> -		if (!new)
> -			return -ENOMEM;
> -		new->paddr = start;
> -		new->size = size;
> -		list_add_tail(&new->list, vc_list);
> +		if (rest > 0 && start_up < end_down) {
> +			new = get_new_element();
> +			if (!new)
> +				return -ENOMEM;
> +			new->size = end_down - start_up;
> +			new->paddr = start_up;
> +			list_add_tail(&new->list, vc_list);
> +			rest -= end_down - start_up;
> +		}
> +
> +		/* Add a tail fractional page to vmcore list. */
> +		if (rest > 0) {
> +			new = get_new_element();
> +			if (!new)
> +				return -ENOMEM;
> +			new->buf = (void *)get_zeroed_page(GFP_KERNEL);
> +			if (!new->buf) {
> +				kfree(new);
> +				return -ENOMEM;
> +			}
> +			new->flags = VMCORE_2ND_KERNEL;
> +			new->size = PAGE_SIZE;
> +			new->paddr = end_down;
> +			list_add_tail(&new->list, vc_list);
> +
> +			offset = end_down;
> +			rc = __read_vmcore(new->buf, end - end_down, &offset,
> +					   0);
> +			if (rc < 0)
> +				return rc;
> +
> +			rest -= end - end_down;
> +		}
> +
> +		WARN_ON(rest > 0);
>  
>  		/* Update the program header offset */
> -		phdr_ptr->p_offset = vmcore_off + (paddr - start);
> +		phdr_ptr->p_offset = vmcore_off + (start - start_down);
>  		vmcore_off = vmcore_off + size;
> +		if (reuse) {
> +			phdr_ptr->p_offset -= PAGE_SIZE;
> +			vmcore_off -= PAGE_SIZE;
> +		}
>  	}
>  	return 0;
>  }
> @@ -1100,6 +1277,8 @@ void vmcore_cleanup(void)
>  
>  		m = list_entry(pos, struct vmcore, list);
>  		list_del(&m->list);
> +		if ((m->flags & VMCORE_2ND_KERNEL))
> +			vfree(m->buf);
>  		kfree(m);
>  	}
>  	free_elfcorebuf();
> diff --git a/include/linux/kcore.h b/include/linux/kcore.h
> index d927622..3a86423 100644
> --- a/include/linux/kcore.h
> +++ b/include/linux/kcore.h
> @@ -19,11 +19,15 @@ struct kcore_list {
>  	int type;
>  };
>  
> +#define VMCORE_2ND_KERNEL 0x1
> +
>  struct vmcore {
>  	struct list_head list;
>  	unsigned long long paddr;
>  	unsigned long long size;
>  	loff_t offset;
> +	char *buf;
> +	unsigned long flags;
>  };
>  
>  #ifdef CONFIG_PROC_KCORE
> -- 
> 1.8.3.1
> 
> -- 
> Thanks.
> HATAYAMA, Daisuke
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
@ 2013-12-11 11:33   ` WANG Chao
  0 siblings, 0 replies; 12+ messages in thread
From: WANG Chao @ 2013-12-11 11:33 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: kexec, Atsushi Kumagai, Eric W. Biederman, Vivek Goyal,
	Linux Kernel Mailing List

On 12/09/13 at 05:06pm, HATAYAMA Daisuke wrote:
> This is a patch for fixing mmap failure due to fractional page issue.
> 
> This patch might be still a bit too large as a single patch and might need to split.
> If you think patch refactoring is needed, please suggest.
> 
> Change Log:
> 
> v1 => v2)
> 
> - Copy fractional pages from 1st kernel to 2nd kernel to reduce read
>   to the fractional pages for reliability.
> 
> - Deal with the case where multiple System RAM areas are contained in
>   a single fractional page.
> 
> Test:
> 
> Tested on X86_64. Fractional pages are created using memmap= kernel
> parameter on the kdump 1st kernel.
> 
> From fd6b0aca54caf7f0b5fd3841ef9e5ff081121ab8 Mon Sep 17 00:00:00 2001
> From: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
> Date: Mon, 9 Dec 2013 09:12:32 +0900
> Subject: [PATCH] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
> 
> As Vivek reported in https://lkml.org/lkml/2013/11/13/439, in real
> world there's platform that allocates System RAM area and Reserved
> area in a single same page. As a result, mmap fails at sanity check
> that comapres memory cache types in a given range, causing user-land
> tools to exit abnormally in the middle of crash dumping.
> 
> Although in the current case the data in Reserved area is ACPI data,
> in general, arbitrary data can possibly be located in a single page
> together with System RAM area. If they are, for example, mmio, read or
> write to the area could affect the corresponding devices and so a
> whole system. We should avoid doing such operations as much as
> possible in order to keep reliability.
> 
> To address this issue, we copy fractional pages into buffers in the
> kdump 2nd kernel, and then read data on the fractional pages from the
> buffers in the kdump 2nd kernel, not from the fractional pages on the
> kdump 1st kernel. Similarly, we mmap data on the buffers on the 2nd
> kernel, not on the 1st kernel. These are done just as we've already
> done for ELF note segments.
> 
> Rigorously, we should avoid even mapping pages containing non-System
> RAM area since mapping could cause some platform specific optimization
> that could then lead to some kind of prefetch to the page. However, as
> long as trying to read the System RAM area in the page, we cannot
> avoid mapping the page. Therefore, reliable possible way is to supress
> the number of times of reading the fractional pages to just once by
> buffering System RAM part of the fractional page in the 2nd kerenel.
> 
> To implement this, extend vmcore structure to represent object in
> buffer on the 2nd kernel, i.e., introducing VMCORE_2ND_KERNEL flag;
> for a vmcore object, if it has VMCORE_2ND_KERNEL set, then its data is
> on the buffer on the 2nd kernel that is pointed to by ->buf member.
> 
> Only non-trivial case is where multiple System RAM areas are contained
> in a single page. I want to think there's unlikely to be such system,
> but the issue addressed here is already odd enough, so we should
> consider there would be likely enough to be.
> 
> Reported-by: Vivek Goyal <vgoyal@redhat.com>
> Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>

Hi, HATAMAYA Daisuke

Thanks for the fix.

My workstation has been experiencing the same issue. It has a fractional
page, one part contains system ram and the other contains ACPI data:

# cat /proc/iomem
[..]
00100000-bfdffbff : System RAM
  01000000-0167b6a5 : Kernel code
  0167b6a6-01d06cbf : Kernel data
  01e6d000-01feafff : Kernel bss
  bb000000-bf7fffff : Crash kernel
bfdffc00-bfe53bff : ACPI Non-volatile Storage

I apply your patch on top of 3.13-rc3. And makedumpfile can successfully
extract dump with mmap().

Thanks,
WANG Chao


> ---
>  fs/proc/vmcore.c      | 271 +++++++++++++++++++++++++++++++++++++++++---------
>  include/linux/kcore.h |   4 +
>  2 files changed, 229 insertions(+), 46 deletions(-)
> 
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 9100d69..ca79120 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -231,11 +231,20 @@ static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
>  
>  	list_for_each_entry(m, &vmcore_list, list) {
>  		if (*fpos < m->offset + m->size) {
> -			tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
> -			start = m->paddr + *fpos - m->offset;
> -			tmp = read_from_oldmem(buffer, tsz, &start, userbuf);
> -			if (tmp < 0)
> -				return tmp;
> +			tsz = min_t(size_t, m->offset+m->size-*fpos, buflen);
> +			if ((m->flags & VMCORE_2ND_KERNEL)) {
> +				void *kaddr;
> +
> +				kaddr = m->buf + *fpos - m->offset;
> +				if (copy_to(buffer, kaddr, tsz, userbuf))
> +					return -EFAULT;
> +			} else {
> +				start = m->paddr + *fpos - m->offset;
> +				tmp = read_from_oldmem(buffer, tsz, &start,
> +						       userbuf);
> +				if (tmp < 0)
> +					return tmp;
> +			}
>  			buflen -= tsz;
>  			*fpos += tsz;
>  			buffer += tsz;
> @@ -300,10 +309,10 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
>  };
>  
>  /**
> - * alloc_elfnotes_buf - allocate buffer for ELF note segment in
> - *                      vmalloc memory
> + * alloc_copy_buf - allocate buffer to copy ELF note segment or
> + *                      fractional pages in vmalloc memory
>   *
> - * @notes_sz: size of buffer
> + * @sz: size of buffer
>   *
>   * If CONFIG_MMU is defined, use vmalloc_user() to allow users to mmap
>   * the buffer to user-space by means of remap_vmalloc_range().
> @@ -311,12 +320,12 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
>   * If CONFIG_MMU is not defined, use vzalloc() since mmap_vmcore() is
>   * disabled and there's no need to allow users to mmap the buffer.
>   */
> -static inline char *alloc_elfnotes_buf(size_t notes_sz)
> +static inline char *alloc_copy_buf(size_t sz)
>  {
>  #ifdef CONFIG_MMU
> -	return vmalloc_user(notes_sz);
> +	return vmalloc_user(sz);
>  #else
> -	return vzalloc(notes_sz);
> +	return vzalloc(sz);
>  #endif
>  }
>  
> @@ -383,14 +392,24 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>  
>  	list_for_each_entry(m, &vmcore_list, list) {
>  		if (start < m->offset + m->size) {
> -			u64 paddr = 0;
> -
>  			tsz = min_t(size_t, m->offset + m->size - start, size);
> -			paddr = m->paddr + start - m->offset;
> -			if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
> -						   paddr >> PAGE_SHIFT, tsz,
> -						   vma->vm_page_prot))
> -				goto fail;
> +			if ((m->flags & VMCORE_2ND_KERNEL)) {
> +				unsigned long uaddr = vma->vm_start + len;
> +				void *kaddr = m->buf + start - m->offset;
> +
> +				if (remap_vmalloc_range_partial(vma, uaddr,
> +								kaddr, tsz))
> +					goto fail;
> +			} else {
> +				u64 paddr = paddr = m->paddr+start-m->offset;
> +
> +				if (remap_oldmem_pfn_range(vma,
> +							   vma->vm_start + len,
> +							   paddr >> PAGE_SHIFT,
> +							   tsz,
> +							   vma->vm_page_prot))
> +					goto fail;
> +			}
>  			size -= tsz;
>  			start += tsz;
>  			len += tsz;
> @@ -580,7 +599,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
>  		return rc;
>  
>  	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
> -	*notes_buf = alloc_elfnotes_buf(*notes_sz);
> +	*notes_buf = alloc_copy_buf(*notes_sz);
>  	if (!*notes_buf)
>  		return -ENOMEM;
>  
> @@ -760,7 +779,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
>  		return rc;
>  
>  	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
> -	*notes_buf = alloc_elfnotes_buf(*notes_sz);
> +	*notes_buf = alloc_copy_buf(*notes_sz);
>  	if (!*notes_buf)
>  		return -ENOMEM;
>  
> @@ -807,7 +826,7 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
>  	Elf64_Ehdr *ehdr_ptr;
>  	Elf64_Phdr *phdr_ptr;
>  	loff_t vmcore_off;
> -	struct vmcore *new;
> +	struct vmcore *m, *new;
>  
>  	ehdr_ptr = (Elf64_Ehdr *)elfptr;
>  	phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */
> @@ -816,27 +835,106 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
>  	vmcore_off = elfsz + elfnotes_sz;
>  
>  	for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> -		u64 paddr, start, end, size;
> +		u64 start, end, size, rest;
> +		u64 start_up, start_down, end_up, end_down;
> +		loff_t offset;
> +		int rc, reuse = 0;
>  
>  		if (phdr_ptr->p_type != PT_LOAD)
>  			continue;
>  
> -		paddr = phdr_ptr->p_offset;
> -		start = rounddown(paddr, PAGE_SIZE);
> -		end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
> -		size = end - start;
> +		start = phdr_ptr->p_offset;
> +		start_up = roundup(start, PAGE_SIZE);
> +		start_down = rounddown(start, PAGE_SIZE);
> +
> +		end = phdr_ptr->p_offset + phdr_ptr->p_memsz;
> +		end_up = roundup(end, PAGE_SIZE);
> +		end_down = rounddown(end, PAGE_SIZE);
> +
> +		size = end_up - start_down;
> +		rest = phdr_ptr->p_memsz;
> +
> +		/* Add a head fractional page to vmcore list. */
> +		if (!PAGE_ALIGNED(start)) {
> +			/* Reuse the same buffer if multiple System
> +			 * RAM entries show up in the same page. */
> +			list_for_each_entry(m, vc_list, list) {
> +				if (m->paddr == start_down &&
> +				    m->flags == VMCORE_2ND_KERNEL) {
> +					new = m;
> +					reuse = 1;
> +					goto skip;
> +				}
> +			}
> +
> +			new = get_new_element();
> +			if (!new)
> +				return -ENOMEM;
> +			new->buf = alloc_copy_buf(PAGE_SIZE);
> +			if (!new->buf) {
> +				kfree(new);
> +				return -ENOMEM;
> +			}
> +			new->flags = VMCORE_2ND_KERNEL;
> +			new->size = PAGE_SIZE;
> +			new->paddr = start_down;
> +			list_add_tail(&new->list, vc_list);
> +		skip:
> +
> +			offset = start;
> +			rc = __read_vmcore(new->buf + (start - start_down),
> +					   min(start_up, end) - start,
> +					   &offset, 0);
> +			if (rc < 0)
> +				return rc;
> +
> +			rest -= min(start_up, end) - start;
> +		}
>  
>  		/* Add this contiguous chunk of memory to vmcore list.*/
> -		new = get_new_element();
> -		if (!new)
> -			return -ENOMEM;
> -		new->paddr = start;
> -		new->size = size;
> -		list_add_tail(&new->list, vc_list);
> +		if (rest > 0 && start_up < end_down) {
> +			new = get_new_element();
> +			if (!new)
> +				return -ENOMEM;
> +			new->size = end_down - start_up;
> +			new->paddr = start_up;
> +			list_add_tail(&new->list, vc_list);
> +			rest -= end_down - start_up;
> +		}
> +
> +		/* Add a tail fractional page to vmcore list. */
> +		if (rest > 0) {
> +			new = get_new_element();
> +			if (!new)
> +				return -ENOMEM;
> +			new->buf = alloc_copy_buf(PAGE_SIZE);
> +			if (!new->buf) {
> +				kfree(new);
> +				return -ENOMEM;
> +			}
> +			new->flags = VMCORE_2ND_KERNEL;
> +			new->size = PAGE_SIZE;
> +			new->paddr = end_down;
> +			list_add_tail(&new->list, vc_list);
> +
> +			offset = end_down;
> +			rc = __read_vmcore(new->buf, end - end_down, &offset,
> +					   0);
> +			if (rc < 0)
> +				return rc;
> +
> +			rest -= end - end_down;
> +		}
> +
> +		WARN_ON(rest > 0);
>  
>  		/* Update the program header offset. */
> -		phdr_ptr->p_offset = vmcore_off + (paddr - start);
> +		phdr_ptr->p_offset = vmcore_off + (start - start_down);
>  		vmcore_off = vmcore_off + size;
> +		if (reuse) {
> +			phdr_ptr->p_offset -= PAGE_SIZE;
> +			vmcore_off -= PAGE_SIZE;
> +		}
>  	}
>  	return 0;
>  }
> @@ -850,7 +948,7 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
>  	Elf32_Ehdr *ehdr_ptr;
>  	Elf32_Phdr *phdr_ptr;
>  	loff_t vmcore_off;
> -	struct vmcore *new;
> +	struct vmcore *m, *new;
>  
>  	ehdr_ptr = (Elf32_Ehdr *)elfptr;
>  	phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr)); /* PT_NOTE hdr */
> @@ -859,27 +957,106 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
>  	vmcore_off = elfsz + elfnotes_sz;
>  
>  	for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> -		u64 paddr, start, end, size;
> +		u64 start, end, size, rest;
> +		u64 start_up, start_down, end_up, end_down;
> +		loff_t offset;
> +		int rc, reuse = 0;
>  
>  		if (phdr_ptr->p_type != PT_LOAD)
>  			continue;
>  
> -		paddr = phdr_ptr->p_offset;
> -		start = rounddown(paddr, PAGE_SIZE);
> -		end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
> -		size = end - start;
> +		start = phdr_ptr->p_offset;
> +		start_up = roundup(start, PAGE_SIZE);
> +		start_down = rounddown(start, PAGE_SIZE);
> +
> +		end = phdr_ptr->p_offset + phdr_ptr->p_memsz;
> +		end_up = roundup(end, PAGE_SIZE);
> +		end_down = rounddown(end, PAGE_SIZE);
> +
> +		size = end_up - start_down;
> +		rest = phdr_ptr->p_memsz;
> +
> +		/* Add a head fractional page to vmcore list. */
> +		if (!PAGE_ALIGNED(start)) {
> +			/* Reuse the same buffer if multiple System
> +			 * RAM entries show up in the same page. */
> +			list_for_each_entry(m, vc_list, list) {
> +				if (m->paddr == start_down &&
> +				    m->flags == VMCORE_2ND_KERNEL) {
> +					new = m;
> +					reuse = 1;
> +					goto skip;
> +				}
> +			}
> +
> +			new = get_new_element();
> +			if (!new)
> +				return -ENOMEM;
> +			new->buf = alloc_copy_buf(PAGE_SIZE);
> +			if (!new->buf) {
> +				kfree(new);
> +				return -ENOMEM;
> +			}
> +			new->flags = VMCORE_2ND_KERNEL;
> +			new->paddr = start_down;
> +			new->size = PAGE_SIZE;
> +			list_add_tail(&new->list, vc_list);
> +		skip:
> +
> +			offset = start;
> +			rc = __read_vmcore(new->buf + (start - start_down),
> +					   min(start_up, end) - start,
> +					   &offset, 0);
> +			if (rc < 0)
> +				return rc;
> +
> +			rest -= min(start_up, end) - start;
> +		}
>  
>  		/* Add this contiguous chunk of memory to vmcore list.*/
> -		new = get_new_element();
> -		if (!new)
> -			return -ENOMEM;
> -		new->paddr = start;
> -		new->size = size;
> -		list_add_tail(&new->list, vc_list);
> +		if (rest > 0 && start_up < end_down) {
> +			new = get_new_element();
> +			if (!new)
> +				return -ENOMEM;
> +			new->size = end_down - start_up;
> +			new->paddr = start_up;
> +			list_add_tail(&new->list, vc_list);
> +			rest -= end_down - start_up;
> +		}
> +
> +		/* Add a tail fractional page to vmcore list. */
> +		if (rest > 0) {
> +			new = get_new_element();
> +			if (!new)
> +				return -ENOMEM;
> +			new->buf = (void *)get_zeroed_page(GFP_KERNEL);
> +			if (!new->buf) {
> +				kfree(new);
> +				return -ENOMEM;
> +			}
> +			new->flags = VMCORE_2ND_KERNEL;
> +			new->size = PAGE_SIZE;
> +			new->paddr = end_down;
> +			list_add_tail(&new->list, vc_list);
> +
> +			offset = end_down;
> +			rc = __read_vmcore(new->buf, end - end_down, &offset,
> +					   0);
> +			if (rc < 0)
> +				return rc;
> +
> +			rest -= end - end_down;
> +		}
> +
> +		WARN_ON(rest > 0);
>  
>  		/* Update the program header offset */
> -		phdr_ptr->p_offset = vmcore_off + (paddr - start);
> +		phdr_ptr->p_offset = vmcore_off + (start - start_down);
>  		vmcore_off = vmcore_off + size;
> +		if (reuse) {
> +			phdr_ptr->p_offset -= PAGE_SIZE;
> +			vmcore_off -= PAGE_SIZE;
> +		}
>  	}
>  	return 0;
>  }
> @@ -1100,6 +1277,8 @@ void vmcore_cleanup(void)
>  
>  		m = list_entry(pos, struct vmcore, list);
>  		list_del(&m->list);
> +		if ((m->flags & VMCORE_2ND_KERNEL))
> +			vfree(m->buf);
>  		kfree(m);
>  	}
>  	free_elfcorebuf();
> diff --git a/include/linux/kcore.h b/include/linux/kcore.h
> index d927622..3a86423 100644
> --- a/include/linux/kcore.h
> +++ b/include/linux/kcore.h
> @@ -19,11 +19,15 @@ struct kcore_list {
>  	int type;
>  };
>  
> +#define VMCORE_2ND_KERNEL 0x1
> +
>  struct vmcore {
>  	struct list_head list;
>  	unsigned long long paddr;
>  	unsigned long long size;
>  	loff_t offset;
> +	char *buf;
> +	unsigned long flags;
>  };
>  
>  #ifdef CONFIG_PROC_KCORE
> -- 
> 1.8.3.1
> 
> -- 
> Thanks.
> HATAYAMA, Daisuke
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
  2013-12-09  8:06 ` HATAYAMA Daisuke
@ 2013-12-12 16:42   ` Vivek Goyal
  -1 siblings, 0 replies; 12+ messages in thread
From: Vivek Goyal @ 2013-12-12 16:42 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: kexec, Atsushi Kumagai, Eric W. Biederman, Linux Kernel Mailing List

On Mon, Dec 09, 2013 at 05:06:06PM +0900, HATAYAMA Daisuke wrote:
> This is a patch for fixing mmap failure due to fractional page issue.
> 
> This patch might be still a bit too large as a single patch and might need to split.
> If you think patch refactoring is needed, please suggest.
> 
> Change Log:
> 
> v1 => v2)
> 
> - Copy fractional pages from 1st kernel to 2nd kernel to reduce read
>   to the fractional pages for reliability.
> 
> - Deal with the case where multiple System RAM areas are contained in
>   a single fractional page.
> 
> Test:
> 
> Tested on X86_64. Fractional pages are created using memmap= kernel
> parameter on the kdump 1st kernel.
> 
> >From fd6b0aca54caf7f0b5fd3841ef9e5ff081121ab8 Mon Sep 17 00:00:00 2001
> From: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
> Date: Mon, 9 Dec 2013 09:12:32 +0900
> Subject: [PATCH] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
> 
> As Vivek reported in https://lkml.org/lkml/2013/11/13/439, in real
> world there's platform that allocates System RAM area and Reserved
> area in a single same page. As a result, mmap fails at sanity check
> that comapres memory cache types in a given range, causing user-land
> tools to exit abnormally in the middle of crash dumping.
> 
> Although in the current case the data in Reserved area is ACPI data,
> in general, arbitrary data can possibly be located in a single page
> together with System RAM area. If they are, for example, mmio, read or
> write to the area could affect the corresponding devices and so a
> whole system. We should avoid doing such operations as much as
> possible in order to keep reliability.
> 
> To address this issue, we copy fractional pages into buffers in the
> kdump 2nd kernel, and then read data on the fractional pages from the
> buffers in the kdump 2nd kernel, not from the fractional pages on the
> kdump 1st kernel. Similarly, we mmap data on the buffers on the 2nd
> kernel, not on the 1st kernel. These are done just as we've already
> done for ELF note segments.
> 
> Rigorously, we should avoid even mapping pages containing non-System
> RAM area since mapping could cause some platform specific optimization
> that could then lead to some kind of prefetch to the page. However, as
> long as trying to read the System RAM area in the page, we cannot
> avoid mapping the page. Therefore, reliable possible way is to supress
> the number of times of reading the fractional pages to just once by
> buffering System RAM part of the fractional page in the 2nd kerenel.
> 
> To implement this, extend vmcore structure to represent object in
> buffer on the 2nd kernel, i.e., introducing VMCORE_2ND_KERNEL flag;
> for a vmcore object, if it has VMCORE_2ND_KERNEL set, then its data is
> on the buffer on the 2nd kernel that is pointed to by ->buf member.
> 
> Only non-trivial case is where multiple System RAM areas are contained
> in a single page. I want to think there's unlikely to be such system,
> but the issue addressed here is already odd enough, so we should
> consider there would be likely enough to be.
> 
> Reported-by: Vivek Goyal <vgoyal@redhat.com>
> Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
> ---
>  fs/proc/vmcore.c      | 271 +++++++++++++++++++++++++++++++++++++++++---------
>  include/linux/kcore.h |   4 +
>  2 files changed, 229 insertions(+), 46 deletions(-)
> 
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 9100d69..ca79120 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -231,11 +231,20 @@ static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
>  
>  	list_for_each_entry(m, &vmcore_list, list) {
>  		if (*fpos < m->offset + m->size) {
> -			tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
> -			start = m->paddr + *fpos - m->offset;
> -			tmp = read_from_oldmem(buffer, tsz, &start, userbuf);
> -			if (tmp < 0)
> -				return tmp;
> +			tsz = min_t(size_t, m->offset+m->size-*fpos, buflen);
							     ^^
spaces around "-".

> +			if ((m->flags & VMCORE_2ND_KERNEL)) {

I am finding this name "VMCORE_2ND_KERNEL" little odd. Given the fact
that we are calling other memory as OLDMEM, how about calling a new
flag as "VMCORE_OLDMEM" and anything which is in old memory should be
flagged accordingly?

> +				void *kaddr;
> +
> +				kaddr = m->buf + *fpos - m->offset;
> +				if (copy_to(buffer, kaddr, tsz, userbuf))
> +					return -EFAULT;
> +			} else {
> +				start = m->paddr + *fpos - m->offset;
> +				tmp = read_from_oldmem(buffer, tsz, &start,
> +						       userbuf);
> +				if (tmp < 0)
> +					return tmp;
> +			}
>  			buflen -= tsz;
>  			*fpos += tsz;
>  			buffer += tsz;
> @@ -300,10 +309,10 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
>  };
>  
>  /**
> - * alloc_elfnotes_buf - allocate buffer for ELF note segment in
> - *                      vmalloc memory
> + * alloc_copy_buf - allocate buffer to copy ELF note segment or
> + *                      fractional pages in vmalloc memory

Should we just call this alloc_buf()?

Also I don't understand that why should we use ifdef CONFIG_MMU. What
will happen if we just use vmalloc_user() for both CONFIG_MMU set and
unset?

I see for ELF headers we are using __get_free_pages(). I am trying to
remember that why ELF headers have to be physically contiguous in
memory. Why can't we use vmalloc_user() for ELF headers. Atleast a 
comment there to explain it would be nice.

>   *
> - * @notes_sz: size of buffer
> + * @sz: size of buffer
>   *
>   * If CONFIG_MMU is defined, use vmalloc_user() to allow users to mmap
>   * the buffer to user-space by means of remap_vmalloc_range().
> @@ -311,12 +320,12 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
>   * If CONFIG_MMU is not defined, use vzalloc() since mmap_vmcore() is
>   * disabled and there's no need to allow users to mmap the buffer.
>   */
> -static inline char *alloc_elfnotes_buf(size_t notes_sz)
> +static inline char *alloc_copy_buf(size_t sz)
>  {
>  #ifdef CONFIG_MMU
> -	return vmalloc_user(notes_sz);
> +	return vmalloc_user(sz);
>  #else
> -	return vzalloc(notes_sz);
> +	return vzalloc(sz);
>  #endif
>  }
>  
> @@ -383,14 +392,24 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>  
>  	list_for_each_entry(m, &vmcore_list, list) {
>  		if (start < m->offset + m->size) {
> -			u64 paddr = 0;
> -
>  			tsz = min_t(size_t, m->offset + m->size - start, size);
> -			paddr = m->paddr + start - m->offset;
> -			if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
> -						   paddr >> PAGE_SHIFT, tsz,
> -						   vma->vm_page_prot))
> -				goto fail;
> +			if ((m->flags & VMCORE_2ND_KERNEL)) {
> +				unsigned long uaddr = vma->vm_start + len;

"uaddr" is same both for oldmem and newmem. So I think you can bring
it out of if condition and use it for both the cases.

> +				void *kaddr = m->buf + start - m->offset;
> +
> +				if (remap_vmalloc_range_partial(vma, uaddr,
> +								kaddr, tsz))
> +					goto fail;
> +			} else {
> +				u64 paddr = paddr = m->paddr+start-m->offset;
				     ^^^    ^^^
Why "paddr" shows up twice?

> +
> +				if (remap_oldmem_pfn_range(vma,
> +							   vma->vm_start + len,
> +							   paddr >> PAGE_SHIFT,
> +							   tsz,
> +							   vma->vm_page_prot))
> +					goto fail;
> +			}
>  			size -= tsz;
>  			start += tsz;
>  			len += tsz;
> @@ -580,7 +599,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
>  		return rc;
>  
>  	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
> -	*notes_buf = alloc_elfnotes_buf(*notes_sz);
> +	*notes_buf = alloc_copy_buf(*notes_sz);
>  	if (!*notes_buf)
>  		return -ENOMEM;
>  
> @@ -760,7 +779,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
>  		return rc;
>  
>  	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
> -	*notes_buf = alloc_elfnotes_buf(*notes_sz);
> +	*notes_buf = alloc_copy_buf(*notes_sz);
>  	if (!*notes_buf)
>  		return -ENOMEM;
>  
> @@ -807,7 +826,7 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
>  	Elf64_Ehdr *ehdr_ptr;
>  	Elf64_Phdr *phdr_ptr;
>  	loff_t vmcore_off;
> -	struct vmcore *new;
> +	struct vmcore *m, *new;
>  
>  	ehdr_ptr = (Elf64_Ehdr *)elfptr;
>  	phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */
> @@ -816,27 +835,106 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
>  	vmcore_off = elfsz + elfnotes_sz;
>  
>  	for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> -		u64 paddr, start, end, size;
> +		u64 start, end, size, rest;
> +		u64 start_up, start_down, end_up, end_down;
> +		loff_t offset;
> +		int rc, reuse = 0;
>  
>  		if (phdr_ptr->p_type != PT_LOAD)
>  			continue;
>  
> -		paddr = phdr_ptr->p_offset;
> -		start = rounddown(paddr, PAGE_SIZE);
> -		end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
> -		size = end - start;
> +		start = phdr_ptr->p_offset;
> +		start_up = roundup(start, PAGE_SIZE);
> +		start_down = rounddown(start, PAGE_SIZE);

Following code is too long and looks messy. Seriously simplify it
and break it into smaller functions so that it becomes easier to
read.

Also you seem to be using __read_vmcore() to read from olmem. This
is strange. __read_vmcore() relies on that we have finished preparing
all the underlying data structures like vmcore_list and then we call
it. Now we are in the process of preparing the list and we are calling
it. It might not crash, but does not look clean to me. Also conceptually
we are not reading vmcore. We are reading a part of old memory. So why
not use read_from_oldmem() directly.


> +
> +		end = phdr_ptr->p_offset + phdr_ptr->p_memsz;
> +		end_up = roundup(end, PAGE_SIZE);
> +		end_down = rounddown(end, PAGE_SIZE);
> +
> +		size = end_up - start_down;
> +		rest = phdr_ptr->p_memsz;
> +
> +		/* Add a head fractional page to vmcore list. */
> +		if (!PAGE_ALIGNED(start)) {
> +			/* Reuse the same buffer if multiple System
> +			 * RAM entries show up in the same page. */
> +			list_for_each_entry(m, vc_list, list) {
> +				if (m->paddr == start_down &&
> +				    m->flags == VMCORE_2ND_KERNEL) {
> +					new = m;
> +					reuse = 1;
> +					goto skip;
> +				}
> +			}

Do we need this optimzation? Is it common to have multiple system RAM
ranges in a single page. I don't have systems like that. If it is not
common, let us remove this optimization for now. It is making code look
uglier in current form.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
@ 2013-12-12 16:42   ` Vivek Goyal
  0 siblings, 0 replies; 12+ messages in thread
From: Vivek Goyal @ 2013-12-12 16:42 UTC (permalink / raw)
  To: HATAYAMA Daisuke
  Cc: Atsushi Kumagai, kexec, Eric W. Biederman, Linux Kernel Mailing List

On Mon, Dec 09, 2013 at 05:06:06PM +0900, HATAYAMA Daisuke wrote:
> This is a patch for fixing mmap failure due to fractional page issue.
> 
> This patch might be still a bit too large as a single patch and might need to split.
> If you think patch refactoring is needed, please suggest.
> 
> Change Log:
> 
> v1 => v2)
> 
> - Copy fractional pages from 1st kernel to 2nd kernel to reduce read
>   to the fractional pages for reliability.
> 
> - Deal with the case where multiple System RAM areas are contained in
>   a single fractional page.
> 
> Test:
> 
> Tested on X86_64. Fractional pages are created using memmap= kernel
> parameter on the kdump 1st kernel.
> 
> >From fd6b0aca54caf7f0b5fd3841ef9e5ff081121ab8 Mon Sep 17 00:00:00 2001
> From: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
> Date: Mon, 9 Dec 2013 09:12:32 +0900
> Subject: [PATCH] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
> 
> As Vivek reported in https://lkml.org/lkml/2013/11/13/439, in real
> world there's platform that allocates System RAM area and Reserved
> area in a single same page. As a result, mmap fails at sanity check
> that comapres memory cache types in a given range, causing user-land
> tools to exit abnormally in the middle of crash dumping.
> 
> Although in the current case the data in Reserved area is ACPI data,
> in general, arbitrary data can possibly be located in a single page
> together with System RAM area. If they are, for example, mmio, read or
> write to the area could affect the corresponding devices and so a
> whole system. We should avoid doing such operations as much as
> possible in order to keep reliability.
> 
> To address this issue, we copy fractional pages into buffers in the
> kdump 2nd kernel, and then read data on the fractional pages from the
> buffers in the kdump 2nd kernel, not from the fractional pages on the
> kdump 1st kernel. Similarly, we mmap data on the buffers on the 2nd
> kernel, not on the 1st kernel. These are done just as we've already
> done for ELF note segments.
> 
> Rigorously, we should avoid even mapping pages containing non-System
> RAM area since mapping could cause some platform specific optimization
> that could then lead to some kind of prefetch to the page. However, as
> long as trying to read the System RAM area in the page, we cannot
> avoid mapping the page. Therefore, reliable possible way is to supress
> the number of times of reading the fractional pages to just once by
> buffering System RAM part of the fractional page in the 2nd kerenel.
> 
> To implement this, extend vmcore structure to represent object in
> buffer on the 2nd kernel, i.e., introducing VMCORE_2ND_KERNEL flag;
> for a vmcore object, if it has VMCORE_2ND_KERNEL set, then its data is
> on the buffer on the 2nd kernel that is pointed to by ->buf member.
> 
> Only non-trivial case is where multiple System RAM areas are contained
> in a single page. I want to think there's unlikely to be such system,
> but the issue addressed here is already odd enough, so we should
> consider there would be likely enough to be.
> 
> Reported-by: Vivek Goyal <vgoyal@redhat.com>
> Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
> ---
>  fs/proc/vmcore.c      | 271 +++++++++++++++++++++++++++++++++++++++++---------
>  include/linux/kcore.h |   4 +
>  2 files changed, 229 insertions(+), 46 deletions(-)
> 
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 9100d69..ca79120 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -231,11 +231,20 @@ static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
>  
>  	list_for_each_entry(m, &vmcore_list, list) {
>  		if (*fpos < m->offset + m->size) {
> -			tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
> -			start = m->paddr + *fpos - m->offset;
> -			tmp = read_from_oldmem(buffer, tsz, &start, userbuf);
> -			if (tmp < 0)
> -				return tmp;
> +			tsz = min_t(size_t, m->offset+m->size-*fpos, buflen);
							     ^^
spaces around "-".

> +			if ((m->flags & VMCORE_2ND_KERNEL)) {

I am finding this name "VMCORE_2ND_KERNEL" little odd. Given the fact
that we are calling other memory as OLDMEM, how about calling a new
flag as "VMCORE_OLDMEM" and anything which is in old memory should be
flagged accordingly?

> +				void *kaddr;
> +
> +				kaddr = m->buf + *fpos - m->offset;
> +				if (copy_to(buffer, kaddr, tsz, userbuf))
> +					return -EFAULT;
> +			} else {
> +				start = m->paddr + *fpos - m->offset;
> +				tmp = read_from_oldmem(buffer, tsz, &start,
> +						       userbuf);
> +				if (tmp < 0)
> +					return tmp;
> +			}
>  			buflen -= tsz;
>  			*fpos += tsz;
>  			buffer += tsz;
> @@ -300,10 +309,10 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
>  };
>  
>  /**
> - * alloc_elfnotes_buf - allocate buffer for ELF note segment in
> - *                      vmalloc memory
> + * alloc_copy_buf - allocate buffer to copy ELF note segment or
> + *                      fractional pages in vmalloc memory

Should we just call this alloc_buf()?

Also I don't understand that why should we use ifdef CONFIG_MMU. What
will happen if we just use vmalloc_user() for both CONFIG_MMU set and
unset?

I see for ELF headers we are using __get_free_pages(). I am trying to
remember that why ELF headers have to be physically contiguous in
memory. Why can't we use vmalloc_user() for ELF headers. Atleast a 
comment there to explain it would be nice.

>   *
> - * @notes_sz: size of buffer
> + * @sz: size of buffer
>   *
>   * If CONFIG_MMU is defined, use vmalloc_user() to allow users to mmap
>   * the buffer to user-space by means of remap_vmalloc_range().
> @@ -311,12 +320,12 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
>   * If CONFIG_MMU is not defined, use vzalloc() since mmap_vmcore() is
>   * disabled and there's no need to allow users to mmap the buffer.
>   */
> -static inline char *alloc_elfnotes_buf(size_t notes_sz)
> +static inline char *alloc_copy_buf(size_t sz)
>  {
>  #ifdef CONFIG_MMU
> -	return vmalloc_user(notes_sz);
> +	return vmalloc_user(sz);
>  #else
> -	return vzalloc(notes_sz);
> +	return vzalloc(sz);
>  #endif
>  }
>  
> @@ -383,14 +392,24 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>  
>  	list_for_each_entry(m, &vmcore_list, list) {
>  		if (start < m->offset + m->size) {
> -			u64 paddr = 0;
> -
>  			tsz = min_t(size_t, m->offset + m->size - start, size);
> -			paddr = m->paddr + start - m->offset;
> -			if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
> -						   paddr >> PAGE_SHIFT, tsz,
> -						   vma->vm_page_prot))
> -				goto fail;
> +			if ((m->flags & VMCORE_2ND_KERNEL)) {
> +				unsigned long uaddr = vma->vm_start + len;

"uaddr" is same both for oldmem and newmem. So I think you can bring
it out of if condition and use it for both the cases.

> +				void *kaddr = m->buf + start - m->offset;
> +
> +				if (remap_vmalloc_range_partial(vma, uaddr,
> +								kaddr, tsz))
> +					goto fail;
> +			} else {
> +				u64 paddr = paddr = m->paddr+start-m->offset;
				     ^^^    ^^^
Why "paddr" shows up twice?

> +
> +				if (remap_oldmem_pfn_range(vma,
> +							   vma->vm_start + len,
> +							   paddr >> PAGE_SHIFT,
> +							   tsz,
> +							   vma->vm_page_prot))
> +					goto fail;
> +			}
>  			size -= tsz;
>  			start += tsz;
>  			len += tsz;
> @@ -580,7 +599,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
>  		return rc;
>  
>  	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
> -	*notes_buf = alloc_elfnotes_buf(*notes_sz);
> +	*notes_buf = alloc_copy_buf(*notes_sz);
>  	if (!*notes_buf)
>  		return -ENOMEM;
>  
> @@ -760,7 +779,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
>  		return rc;
>  
>  	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
> -	*notes_buf = alloc_elfnotes_buf(*notes_sz);
> +	*notes_buf = alloc_copy_buf(*notes_sz);
>  	if (!*notes_buf)
>  		return -ENOMEM;
>  
> @@ -807,7 +826,7 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
>  	Elf64_Ehdr *ehdr_ptr;
>  	Elf64_Phdr *phdr_ptr;
>  	loff_t vmcore_off;
> -	struct vmcore *new;
> +	struct vmcore *m, *new;
>  
>  	ehdr_ptr = (Elf64_Ehdr *)elfptr;
>  	phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */
> @@ -816,27 +835,106 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
>  	vmcore_off = elfsz + elfnotes_sz;
>  
>  	for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> -		u64 paddr, start, end, size;
> +		u64 start, end, size, rest;
> +		u64 start_up, start_down, end_up, end_down;
> +		loff_t offset;
> +		int rc, reuse = 0;
>  
>  		if (phdr_ptr->p_type != PT_LOAD)
>  			continue;
>  
> -		paddr = phdr_ptr->p_offset;
> -		start = rounddown(paddr, PAGE_SIZE);
> -		end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
> -		size = end - start;
> +		start = phdr_ptr->p_offset;
> +		start_up = roundup(start, PAGE_SIZE);
> +		start_down = rounddown(start, PAGE_SIZE);

Following code is too long and looks messy. Seriously simplify it
and break it into smaller functions so that it becomes easier to
read.

Also you seem to be using __read_vmcore() to read from olmem. This
is strange. __read_vmcore() relies on that we have finished preparing
all the underlying data structures like vmcore_list and then we call
it. Now we are in the process of preparing the list and we are calling
it. It might not crash, but does not look clean to me. Also conceptually
we are not reading vmcore. We are reading a part of old memory. So why
not use read_from_oldmem() directly.


> +
> +		end = phdr_ptr->p_offset + phdr_ptr->p_memsz;
> +		end_up = roundup(end, PAGE_SIZE);
> +		end_down = rounddown(end, PAGE_SIZE);
> +
> +		size = end_up - start_down;
> +		rest = phdr_ptr->p_memsz;
> +
> +		/* Add a head fractional page to vmcore list. */
> +		if (!PAGE_ALIGNED(start)) {
> +			/* Reuse the same buffer if multiple System
> +			 * RAM entries show up in the same page. */
> +			list_for_each_entry(m, vc_list, list) {
> +				if (m->paddr == start_down &&
> +				    m->flags == VMCORE_2ND_KERNEL) {
> +					new = m;
> +					reuse = 1;
> +					goto skip;
> +				}
> +			}

Do we need this optimzation? Is it common to have multiple system RAM
ranges in a single page. I don't have systems like that. If it is not
common, let us remove this optimization for now. It is making code look
uglier in current form.

Thanks
Vivek

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
  2013-12-12 16:42   ` Vivek Goyal
@ 2013-12-13  1:42     ` HATAYAMA Daisuke
  -1 siblings, 0 replies; 12+ messages in thread
From: HATAYAMA Daisuke @ 2013-12-13  1:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: kexec, Atsushi Kumagai, Eric W. Biederman, Linux Kernel Mailing List

(2013/12/13 1:42), Vivek Goyal wrote:
> On Mon, Dec 09, 2013 at 05:06:06PM +0900, HATAYAMA Daisuke wrote:
>> This is a patch for fixing mmap failure due to fractional page issue.
>>
>> This patch might be still a bit too large as a single patch and might need to split.
>> If you think patch refactoring is needed, please suggest.
>>
>> Change Log:
>>
>> v1 => v2)
>>
>> - Copy fractional pages from 1st kernel to 2nd kernel to reduce read
>>    to the fractional pages for reliability.
>>
>> - Deal with the case where multiple System RAM areas are contained in
>>    a single fractional page.
>>
>> Test:
>>
>> Tested on X86_64. Fractional pages are created using memmap= kernel
>> parameter on the kdump 1st kernel.
>>
>> >From fd6b0aca54caf7f0b5fd3841ef9e5ff081121ab8 Mon Sep 17 00:00:00 2001
>> From: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
>> Date: Mon, 9 Dec 2013 09:12:32 +0900
>> Subject: [PATCH] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
>>
>> As Vivek reported in https://lkml.org/lkml/2013/11/13/439, in real
>> world there's platform that allocates System RAM area and Reserved
>> area in a single same page. As a result, mmap fails at sanity check
>> that comapres memory cache types in a given range, causing user-land
>> tools to exit abnormally in the middle of crash dumping.
>>
>> Although in the current case the data in Reserved area is ACPI data,
>> in general, arbitrary data can possibly be located in a single page
>> together with System RAM area. If they are, for example, mmio, read or
>> write to the area could affect the corresponding devices and so a
>> whole system. We should avoid doing such operations as much as
>> possible in order to keep reliability.
>>
>> To address this issue, we copy fractional pages into buffers in the
>> kdump 2nd kernel, and then read data on the fractional pages from the
>> buffers in the kdump 2nd kernel, not from the fractional pages on the
>> kdump 1st kernel. Similarly, we mmap data on the buffers on the 2nd
>> kernel, not on the 1st kernel. These are done just as we've already
>> done for ELF note segments.
>>
>> Rigorously, we should avoid even mapping pages containing non-System
>> RAM area since mapping could cause some platform specific optimization
>> that could then lead to some kind of prefetch to the page. However, as
>> long as trying to read the System RAM area in the page, we cannot
>> avoid mapping the page. Therefore, reliable possible way is to supress
>> the number of times of reading the fractional pages to just once by
>> buffering System RAM part of the fractional page in the 2nd kerenel.
>>
>> To implement this, extend vmcore structure to represent object in
>> buffer on the 2nd kernel, i.e., introducing VMCORE_2ND_KERNEL flag;
>> for a vmcore object, if it has VMCORE_2ND_KERNEL set, then its data is
>> on the buffer on the 2nd kernel that is pointed to by ->buf member.
>>
>> Only non-trivial case is where multiple System RAM areas are contained
>> in a single page. I want to think there's unlikely to be such system,
>> but the issue addressed here is already odd enough, so we should
>> consider there would be likely enough to be.
>>
>> Reported-by: Vivek Goyal <vgoyal@redhat.com>
>> Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
>> ---
>>   fs/proc/vmcore.c      | 271 +++++++++++++++++++++++++++++++++++++++++---------
>>   include/linux/kcore.h |   4 +
>>   2 files changed, 229 insertions(+), 46 deletions(-)
>>
>> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
>> index 9100d69..ca79120 100644
>> --- a/fs/proc/vmcore.c
>> +++ b/fs/proc/vmcore.c
>> @@ -231,11 +231,20 @@ static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
>>
>>   	list_for_each_entry(m, &vmcore_list, list) {
>>   		if (*fpos < m->offset + m->size) {
>> -			tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
>> -			start = m->paddr + *fpos - m->offset;
>> -			tmp = read_from_oldmem(buffer, tsz, &start, userbuf);
>> -			if (tmp < 0)
>> -				return tmp;
>> +			tsz = min_t(size_t, m->offset+m->size-*fpos, buflen);
> 							     ^^
> spaces around "-".
>
>> +			if ((m->flags & VMCORE_2ND_KERNEL)) {
>
> I am finding this name "VMCORE_2ND_KERNEL" little odd. Given the fact
> that we are calling other memory as OLDMEM, how about calling a new
> flag as "VMCORE_OLDMEM" and anything which is in old memory should be
> flagged accordingly?
>

It sounds to me from VMCORE_OLDMEM that the memory flagged with it
resides in old memory, not buffer on the 2nd kernel.. How about:

#define VMCORE_OLDMEM 0
#define VMCORE_BUFFER 1

where the objects with VMCORE_OLDMEM refer to the memory that resides in
the 1st kernel (that is called oldmem), and the objects with VMCORE_BUFFER
refer to the memory that resides in the 2nd kernel by being copied into
there just as this patch is about to do.

>> +				void *kaddr;
>> +
>> +				kaddr = m->buf + *fpos - m->offset;
>> +				if (copy_to(buffer, kaddr, tsz, userbuf))
>> +					return -EFAULT;
>> +			} else {
>> +				start = m->paddr + *fpos - m->offset;
>> +				tmp = read_from_oldmem(buffer, tsz, &start,
>> +						       userbuf);
>> +				if (tmp < 0)
>> +					return tmp;
>> +			}
>>   			buflen -= tsz;
>>   			*fpos += tsz;
>>   			buffer += tsz;
>> @@ -300,10 +309,10 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
>>   };
>>
>>   /**
>> - * alloc_elfnotes_buf - allocate buffer for ELF note segment in
>> - *                      vmalloc memory
>> + * alloc_copy_buf - allocate buffer to copy ELF note segment or
>> + *                      fractional pages in vmalloc memory
>
> Should we just call this alloc_buf()?
>
> Also I don't understand that why should we use ifdef CONFIG_MMU. What
> will happen if we just use vmalloc_user() for both CONFIG_MMU set and
> unset?
>

I disliked setting VM_USERMAP on no-MMU arch although there's no
mmap_vmcore support. But as you say, we can also say that it's safe
if using it commonly in both MMU and no-MMU archs since there's no
mmap_vmcore support.

Change it so use vmalloc_user() in either arches.

> I see for ELF headers we are using __get_free_pages(). I am trying to
> remember that why ELF headers have to be physically contiguous in
> memory. Why can't we use vmalloc_user() for ELF headers. Atleast a
> comment there to explain it would be nice.
>

I also don't have very good memory, but I guess the reason was that
the number of memory mapping is not so much large so size of program
header table doesn't become so large compared to ELF note segment, and
thus didn't use vmalloc() for ELF headers aggressively.

I'll fix this in next patch.

Maybe, we might be able to attack this implementation by 4MiB-unit
software hot-plug, though I have not used it, to make many hols.

>>    *
>> - * @notes_sz: size of buffer
>> + * @sz: size of buffer
>>    *
>>    * If CONFIG_MMU is defined, use vmalloc_user() to allow users to mmap
>>    * the buffer to user-space by means of remap_vmalloc_range().
>> @@ -311,12 +320,12 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
>>    * If CONFIG_MMU is not defined, use vzalloc() since mmap_vmcore() is
>>    * disabled and there's no need to allow users to mmap the buffer.
>>    */
>> -static inline char *alloc_elfnotes_buf(size_t notes_sz)
>> +static inline char *alloc_copy_buf(size_t sz)
>>   {
>>   #ifdef CONFIG_MMU
>> -	return vmalloc_user(notes_sz);
>> +	return vmalloc_user(sz);
>>   #else
>> -	return vzalloc(notes_sz);
>> +	return vzalloc(sz);
>>   #endif
>>   }
>>
>> @@ -383,14 +392,24 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>>
>>   	list_for_each_entry(m, &vmcore_list, list) {
>>   		if (start < m->offset + m->size) {
>> -			u64 paddr = 0;
>> -
>>   			tsz = min_t(size_t, m->offset + m->size - start, size);
>> -			paddr = m->paddr + start - m->offset;
>> -			if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
>> -						   paddr >> PAGE_SHIFT, tsz,
>> -						   vma->vm_page_prot))
>> -				goto fail;
>> +			if ((m->flags & VMCORE_2ND_KERNEL)) {
>> +				unsigned long uaddr = vma->vm_start + len;
>
> "uaddr" is same both for oldmem and newmem. So I think you can bring
> it out of if condition and use it for both the cases.
>
>> +				void *kaddr = m->buf + start - m->offset;
>> +
>> +				if (remap_vmalloc_range_partial(vma, uaddr,
>> +								kaddr, tsz))
>> +					goto fail;
>> +			} else {
>> +				u64 paddr = paddr = m->paddr+start-m->offset;
> 				     ^^^    ^^^
> Why "paddr" shows up twice?
>
>> +
>> +				if (remap_oldmem_pfn_range(vma,
>> +							   vma->vm_start + len,
>> +							   paddr >> PAGE_SHIFT,
>> +							   tsz,
>> +							   vma->vm_page_prot))
>> +					goto fail;
>> +			}
>>   			size -= tsz;
>>   			start += tsz;
>>   			len += tsz;
>> @@ -580,7 +599,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
>>   		return rc;
>>
>>   	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
>> -	*notes_buf = alloc_elfnotes_buf(*notes_sz);
>> +	*notes_buf = alloc_copy_buf(*notes_sz);
>>   	if (!*notes_buf)
>>   		return -ENOMEM;
>>
>> @@ -760,7 +779,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
>>   		return rc;
>>
>>   	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
>> -	*notes_buf = alloc_elfnotes_buf(*notes_sz);
>> +	*notes_buf = alloc_copy_buf(*notes_sz);
>>   	if (!*notes_buf)
>>   		return -ENOMEM;
>>
>> @@ -807,7 +826,7 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
>>   	Elf64_Ehdr *ehdr_ptr;
>>   	Elf64_Phdr *phdr_ptr;
>>   	loff_t vmcore_off;
>> -	struct vmcore *new;
>> +	struct vmcore *m, *new;
>>
>>   	ehdr_ptr = (Elf64_Ehdr *)elfptr;
>>   	phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */
>> @@ -816,27 +835,106 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
>>   	vmcore_off = elfsz + elfnotes_sz;
>>
>>   	for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
>> -		u64 paddr, start, end, size;
>> +		u64 start, end, size, rest;
>> +		u64 start_up, start_down, end_up, end_down;
>> +		loff_t offset;
>> +		int rc, reuse = 0;
>>
>>   		if (phdr_ptr->p_type != PT_LOAD)
>>   			continue;
>>
>> -		paddr = phdr_ptr->p_offset;
>> -		start = rounddown(paddr, PAGE_SIZE);
>> -		end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
>> -		size = end - start;
>> +		start = phdr_ptr->p_offset;
>> +		start_up = roundup(start, PAGE_SIZE);
>> +		start_down = rounddown(start, PAGE_SIZE);
>
> Following code is too long and looks messy. Seriously simplify it
> and break it into smaller functions so that it becomes easier to
> read.
>

I'll introduce some functions and refactor here. After that, please
comment it.

> Also you seem to be using __read_vmcore() to read from olmem. This
> is strange. __read_vmcore() relies on that we have finished preparing
> all the underlying data structures like vmcore_list and then we call
> it. Now we are in the process of preparing the list and we are calling
> it. It might not crash, but does not look clean to me. Also conceptually
> we are not reading vmcore. We are reading a part of old memory. So why
> not use read_from_oldmem() directly.
>

Exactly. I'll fix this.

>
>> +
>> +		end = phdr_ptr->p_offset + phdr_ptr->p_memsz;
>> +		end_up = roundup(end, PAGE_SIZE);
>> +		end_down = rounddown(end, PAGE_SIZE);
>> +
>> +		size = end_up - start_down;
>> +		rest = phdr_ptr->p_memsz;
>> +
>> +		/* Add a head fractional page to vmcore list. */
>> +		if (!PAGE_ALIGNED(start)) {
>> +			/* Reuse the same buffer if multiple System
>> +			 * RAM entries show up in the same page. */
>> +			list_for_each_entry(m, vc_list, list) {
>> +				if (m->paddr == start_down &&
>> +				    m->flags == VMCORE_2ND_KERNEL) {
>> +					new = m;
>> +					reuse = 1;
>> +					goto skip;
>> +				}
>> +			}
>
> Do we need this optimzation? Is it common to have multiple system RAM
> ranges in a single page. I don't have systems like that. If it is not
> common, let us remove this optimization for now. It is making code look
> uglier in current form.
>

OK for now.

-- 
Thanks.
HATAYAMA, Daisuke


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
@ 2013-12-13  1:42     ` HATAYAMA Daisuke
  0 siblings, 0 replies; 12+ messages in thread
From: HATAYAMA Daisuke @ 2013-12-13  1:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Atsushi Kumagai, kexec, Eric W. Biederman, Linux Kernel Mailing List

(2013/12/13 1:42), Vivek Goyal wrote:
> On Mon, Dec 09, 2013 at 05:06:06PM +0900, HATAYAMA Daisuke wrote:
>> This is a patch for fixing mmap failure due to fractional page issue.
>>
>> This patch might be still a bit too large as a single patch and might need to split.
>> If you think patch refactoring is needed, please suggest.
>>
>> Change Log:
>>
>> v1 => v2)
>>
>> - Copy fractional pages from 1st kernel to 2nd kernel to reduce read
>>    to the fractional pages for reliability.
>>
>> - Deal with the case where multiple System RAM areas are contained in
>>    a single fractional page.
>>
>> Test:
>>
>> Tested on X86_64. Fractional pages are created using memmap= kernel
>> parameter on the kdump 1st kernel.
>>
>> >From fd6b0aca54caf7f0b5fd3841ef9e5ff081121ab8 Mon Sep 17 00:00:00 2001
>> From: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
>> Date: Mon, 9 Dec 2013 09:12:32 +0900
>> Subject: [PATCH] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
>>
>> As Vivek reported in https://lkml.org/lkml/2013/11/13/439, in real
>> world there's platform that allocates System RAM area and Reserved
>> area in a single same page. As a result, mmap fails at sanity check
>> that comapres memory cache types in a given range, causing user-land
>> tools to exit abnormally in the middle of crash dumping.
>>
>> Although in the current case the data in Reserved area is ACPI data,
>> in general, arbitrary data can possibly be located in a single page
>> together with System RAM area. If they are, for example, mmio, read or
>> write to the area could affect the corresponding devices and so a
>> whole system. We should avoid doing such operations as much as
>> possible in order to keep reliability.
>>
>> To address this issue, we copy fractional pages into buffers in the
>> kdump 2nd kernel, and then read data on the fractional pages from the
>> buffers in the kdump 2nd kernel, not from the fractional pages on the
>> kdump 1st kernel. Similarly, we mmap data on the buffers on the 2nd
>> kernel, not on the 1st kernel. These are done just as we've already
>> done for ELF note segments.
>>
>> Rigorously, we should avoid even mapping pages containing non-System
>> RAM area since mapping could cause some platform specific optimization
>> that could then lead to some kind of prefetch to the page. However, as
>> long as trying to read the System RAM area in the page, we cannot
>> avoid mapping the page. Therefore, reliable possible way is to supress
>> the number of times of reading the fractional pages to just once by
>> buffering System RAM part of the fractional page in the 2nd kerenel.
>>
>> To implement this, extend vmcore structure to represent object in
>> buffer on the 2nd kernel, i.e., introducing VMCORE_2ND_KERNEL flag;
>> for a vmcore object, if it has VMCORE_2ND_KERNEL set, then its data is
>> on the buffer on the 2nd kernel that is pointed to by ->buf member.
>>
>> Only non-trivial case is where multiple System RAM areas are contained
>> in a single page. I want to think there's unlikely to be such system,
>> but the issue addressed here is already odd enough, so we should
>> consider there would be likely enough to be.
>>
>> Reported-by: Vivek Goyal <vgoyal@redhat.com>
>> Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
>> ---
>>   fs/proc/vmcore.c      | 271 +++++++++++++++++++++++++++++++++++++++++---------
>>   include/linux/kcore.h |   4 +
>>   2 files changed, 229 insertions(+), 46 deletions(-)
>>
>> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
>> index 9100d69..ca79120 100644
>> --- a/fs/proc/vmcore.c
>> +++ b/fs/proc/vmcore.c
>> @@ -231,11 +231,20 @@ static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
>>
>>   	list_for_each_entry(m, &vmcore_list, list) {
>>   		if (*fpos < m->offset + m->size) {
>> -			tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
>> -			start = m->paddr + *fpos - m->offset;
>> -			tmp = read_from_oldmem(buffer, tsz, &start, userbuf);
>> -			if (tmp < 0)
>> -				return tmp;
>> +			tsz = min_t(size_t, m->offset+m->size-*fpos, buflen);
> 							     ^^
> spaces around "-".
>
>> +			if ((m->flags & VMCORE_2ND_KERNEL)) {
>
> I am finding this name "VMCORE_2ND_KERNEL" little odd. Given the fact
> that we are calling other memory as OLDMEM, how about calling a new
> flag as "VMCORE_OLDMEM" and anything which is in old memory should be
> flagged accordingly?
>

It sounds to me from VMCORE_OLDMEM that the memory flagged with it
resides in old memory, not buffer on the 2nd kernel.. How about:

#define VMCORE_OLDMEM 0
#define VMCORE_BUFFER 1

where the objects with VMCORE_OLDMEM refer to the memory that resides in
the 1st kernel (that is called oldmem), and the objects with VMCORE_BUFFER
refer to the memory that resides in the 2nd kernel by being copied into
there just as this patch is about to do.

>> +				void *kaddr;
>> +
>> +				kaddr = m->buf + *fpos - m->offset;
>> +				if (copy_to(buffer, kaddr, tsz, userbuf))
>> +					return -EFAULT;
>> +			} else {
>> +				start = m->paddr + *fpos - m->offset;
>> +				tmp = read_from_oldmem(buffer, tsz, &start,
>> +						       userbuf);
>> +				if (tmp < 0)
>> +					return tmp;
>> +			}
>>   			buflen -= tsz;
>>   			*fpos += tsz;
>>   			buffer += tsz;
>> @@ -300,10 +309,10 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
>>   };
>>
>>   /**
>> - * alloc_elfnotes_buf - allocate buffer for ELF note segment in
>> - *                      vmalloc memory
>> + * alloc_copy_buf - allocate buffer to copy ELF note segment or
>> + *                      fractional pages in vmalloc memory
>
> Should we just call this alloc_buf()?
>
> Also I don't understand that why should we use ifdef CONFIG_MMU. What
> will happen if we just use vmalloc_user() for both CONFIG_MMU set and
> unset?
>

I disliked setting VM_USERMAP on no-MMU arch although there's no
mmap_vmcore support. But as you say, we can also say that it's safe
if using it commonly in both MMU and no-MMU archs since there's no
mmap_vmcore support.

Change it so use vmalloc_user() in either arches.

> I see for ELF headers we are using __get_free_pages(). I am trying to
> remember that why ELF headers have to be physically contiguous in
> memory. Why can't we use vmalloc_user() for ELF headers. Atleast a
> comment there to explain it would be nice.
>

I also don't have very good memory, but I guess the reason was that
the number of memory mapping is not so much large so size of program
header table doesn't become so large compared to ELF note segment, and
thus didn't use vmalloc() for ELF headers aggressively.

I'll fix this in next patch.

Maybe, we might be able to attack this implementation by 4MiB-unit
software hot-plug, though I have not used it, to make many hols.

>>    *
>> - * @notes_sz: size of buffer
>> + * @sz: size of buffer
>>    *
>>    * If CONFIG_MMU is defined, use vmalloc_user() to allow users to mmap
>>    * the buffer to user-space by means of remap_vmalloc_range().
>> @@ -311,12 +320,12 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
>>    * If CONFIG_MMU is not defined, use vzalloc() since mmap_vmcore() is
>>    * disabled and there's no need to allow users to mmap the buffer.
>>    */
>> -static inline char *alloc_elfnotes_buf(size_t notes_sz)
>> +static inline char *alloc_copy_buf(size_t sz)
>>   {
>>   #ifdef CONFIG_MMU
>> -	return vmalloc_user(notes_sz);
>> +	return vmalloc_user(sz);
>>   #else
>> -	return vzalloc(notes_sz);
>> +	return vzalloc(sz);
>>   #endif
>>   }
>>
>> @@ -383,14 +392,24 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>>
>>   	list_for_each_entry(m, &vmcore_list, list) {
>>   		if (start < m->offset + m->size) {
>> -			u64 paddr = 0;
>> -
>>   			tsz = min_t(size_t, m->offset + m->size - start, size);
>> -			paddr = m->paddr + start - m->offset;
>> -			if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
>> -						   paddr >> PAGE_SHIFT, tsz,
>> -						   vma->vm_page_prot))
>> -				goto fail;
>> +			if ((m->flags & VMCORE_2ND_KERNEL)) {
>> +				unsigned long uaddr = vma->vm_start + len;
>
> "uaddr" is same both for oldmem and newmem. So I think you can bring
> it out of if condition and use it for both the cases.
>
>> +				void *kaddr = m->buf + start - m->offset;
>> +
>> +				if (remap_vmalloc_range_partial(vma, uaddr,
>> +								kaddr, tsz))
>> +					goto fail;
>> +			} else {
>> +				u64 paddr = paddr = m->paddr+start-m->offset;
> 				     ^^^    ^^^
> Why "paddr" shows up twice?
>
>> +
>> +				if (remap_oldmem_pfn_range(vma,
>> +							   vma->vm_start + len,
>> +							   paddr >> PAGE_SHIFT,
>> +							   tsz,
>> +							   vma->vm_page_prot))
>> +					goto fail;
>> +			}
>>   			size -= tsz;
>>   			start += tsz;
>>   			len += tsz;
>> @@ -580,7 +599,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
>>   		return rc;
>>
>>   	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
>> -	*notes_buf = alloc_elfnotes_buf(*notes_sz);
>> +	*notes_buf = alloc_copy_buf(*notes_sz);
>>   	if (!*notes_buf)
>>   		return -ENOMEM;
>>
>> @@ -760,7 +779,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
>>   		return rc;
>>
>>   	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
>> -	*notes_buf = alloc_elfnotes_buf(*notes_sz);
>> +	*notes_buf = alloc_copy_buf(*notes_sz);
>>   	if (!*notes_buf)
>>   		return -ENOMEM;
>>
>> @@ -807,7 +826,7 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
>>   	Elf64_Ehdr *ehdr_ptr;
>>   	Elf64_Phdr *phdr_ptr;
>>   	loff_t vmcore_off;
>> -	struct vmcore *new;
>> +	struct vmcore *m, *new;
>>
>>   	ehdr_ptr = (Elf64_Ehdr *)elfptr;
>>   	phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */
>> @@ -816,27 +835,106 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
>>   	vmcore_off = elfsz + elfnotes_sz;
>>
>>   	for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
>> -		u64 paddr, start, end, size;
>> +		u64 start, end, size, rest;
>> +		u64 start_up, start_down, end_up, end_down;
>> +		loff_t offset;
>> +		int rc, reuse = 0;
>>
>>   		if (phdr_ptr->p_type != PT_LOAD)
>>   			continue;
>>
>> -		paddr = phdr_ptr->p_offset;
>> -		start = rounddown(paddr, PAGE_SIZE);
>> -		end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
>> -		size = end - start;
>> +		start = phdr_ptr->p_offset;
>> +		start_up = roundup(start, PAGE_SIZE);
>> +		start_down = rounddown(start, PAGE_SIZE);
>
> Following code is too long and looks messy. Seriously simplify it
> and break it into smaller functions so that it becomes easier to
> read.
>

I'll introduce some functions and refactor here. After that, please
comment it.

> Also you seem to be using __read_vmcore() to read from olmem. This
> is strange. __read_vmcore() relies on that we have finished preparing
> all the underlying data structures like vmcore_list and then we call
> it. Now we are in the process of preparing the list and we are calling
> it. It might not crash, but does not look clean to me. Also conceptually
> we are not reading vmcore. We are reading a part of old memory. So why
> not use read_from_oldmem() directly.
>

Exactly. I'll fix this.

>
>> +
>> +		end = phdr_ptr->p_offset + phdr_ptr->p_memsz;
>> +		end_up = roundup(end, PAGE_SIZE);
>> +		end_down = rounddown(end, PAGE_SIZE);
>> +
>> +		size = end_up - start_down;
>> +		rest = phdr_ptr->p_memsz;
>> +
>> +		/* Add a head fractional page to vmcore list. */
>> +		if (!PAGE_ALIGNED(start)) {
>> +			/* Reuse the same buffer if multiple System
>> +			 * RAM entries show up in the same page. */
>> +			list_for_each_entry(m, vc_list, list) {
>> +				if (m->paddr == start_down &&
>> +				    m->flags == VMCORE_2ND_KERNEL) {
>> +					new = m;
>> +					reuse = 1;
>> +					goto skip;
>> +				}
>> +			}
>
> Do we need this optimzation? Is it common to have multiple system RAM
> ranges in a single page. I don't have systems like that. If it is not
> common, let us remove this optimization for now. It is making code look
> uglier in current form.
>

OK for now.

-- 
Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
  2013-12-09  8:06 ` HATAYAMA Daisuke
                   ` (2 preceding siblings ...)
  (?)
@ 2014-01-31  2:36 ` Atsushi Kumagai
  2014-01-31  9:57   ` HATAYAMA Daisuke
  -1 siblings, 1 reply; 12+ messages in thread
From: Atsushi Kumagai @ 2014-01-31  2:36 UTC (permalink / raw)
  To: HATAYAMA Daisuke; +Cc: kexec

Hello HATAYAMA-san,

On 2013/12/09 17:06:18, HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:
> This is a patch for fixing mmap failure due to fractional page issue.
> 
> This patch might be still a bit too large as a single patch and might need to split.
> If you think patch refactoring is needed, please suggest.
> 
> Change Log:
> 
> v1 => v2)
> 
> - Copy fractional pages from 1st kernel to 2nd kernel to reduce read
>   to the fractional pages for reliability.
> 
> - Deal with the case where multiple System RAM areas are contained in
>   a single fractional page.
> 
> Test:
> 
> Tested on X86_64. Fractional pages are created using memmap= kernel
> parameter on the kdump 1st kernel.

Could you tell me more details about how to reproduce this ?
I tried to create such fractional pages to test the patch at
the end of this mail, by using memmap= kernel parameter as you said
like below:

  # cat /proc/iomem
  ...
  100000000-10fff57ff : System RAM
  10fff5800-1100057ff : reserved
  110005800-11fffffff : System RAM

However, I couldn't face the mmap() failure and makedumpfile worked
normally even using mmap() on linux-3.12.1. What am I missing here ?


Thanks
Atsushi Kumagai


From: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Date: Thu, 23 Jan 2014 16:06:38 +0900
Subject: [PATCH] Restrict the mapping range to avoid mmap() failure.

Now, we know that mmap() on /proc/vmcore for fractional pages
will fail due to a sanity check in remap_pfn_range().
In that case, current makedumpfile can fall back to read() mode,
but it's still desirable to use mmap() as much as possible for
performance.
This patch restricts the range of mmap() to avoid the fall back.

The original kernel issue was reported in:

  https://lkml.org/lkml/2013/11/13/439

and it will be fixed in kernel 3.14.

Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
---
 elf_info.c     | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 elf_info.h     |  4 ++++
 makedumpfile.c | 19 ++++++++++++-----
 3 files changed, 83 insertions(+), 5 deletions(-)

diff --git a/elf_info.c b/elf_info.c
index e350b99..b277f69 100644
--- a/elf_info.c
+++ b/elf_info.c
@@ -488,6 +488,71 @@ page_head_to_phys_end(unsigned long long head_paddr)
 	return 0;
 }
 
+/*
+ *  Calculate a start File Offset of PT_LOAD from a File Offset
+ *  of a page. If this function returns 0x0, the input page is
+ *  not in the memory image.
+ */
+off_t
+offset_to_pt_load_start(off_t offset)
+{
+	int i;
+	off_t pt_load_start;
+	struct pt_load_segment *pls;
+
+	for (i = pt_load_start = 0; i < num_pt_loads; i++) {
+		pls = &pt_loads[i];
+		if ((offset >= pls->file_offset)
+		    && (offset < pls->file_offset +
+			(pls->phys_end - pls->phys_start))) {
+			pt_load_start = pls->file_offset;
+			break;
+		}
+	}
+	return pt_load_start;
+}
+
+/*
+ *  Calculate a end File Offset of PT_LOAD from a File Offset
+ *  of a page. If this function returns 0x0, the input page is
+ *  not in the memory image.
+ */
+off_t
+offset_to_pt_load_end(off_t offset)
+{
+	int i;
+	off_t pt_load_end;
+	struct pt_load_segment *pls;
+
+	for (i = pt_load_end = 0; i < num_pt_loads; i++) {
+		pls = &pt_loads[i];
+		if ((offset >= pls->file_offset)
+		    && (offset < pls->file_offset +
+			(pls->phys_end - pls->phys_start))) {
+			pt_load_end = (off_t)(pls->file_offset +
+					      (pls->phys_end - pls->phys_start));
+			break;
+		}
+	}
+	return pt_load_end;
+}
+
+/*
+ * Judge whether the page is fractional or not.
+ */
+int
+page_is_fractional(off_t page_offset)
+{
+	if (page_offset % info->page_size != 0)
+		return TRUE;
+
+	if (offset_to_pt_load_end(page_offset) - page_offset
+	    < info->page_size)
+		return TRUE;
+
+	return FALSE;
+}
+
 unsigned long long
 vaddr_to_paddr_general(unsigned long long vaddr)
 {
diff --git a/elf_info.h b/elf_info.h
index a60fef3..801faff 100644
--- a/elf_info.h
+++ b/elf_info.h
@@ -32,10 +32,14 @@ off_t paddr_to_offset(unsigned long long paddr);
 off_t paddr_to_offset2(unsigned long long paddr, off_t hint);
 unsigned long long page_head_to_phys_start(unsigned long long head_paddr);
 unsigned long long page_head_to_phys_end(unsigned long long head_paddr);
+off_t offset_to_pt_load_start(off_t offset);
+off_t offset_to_pt_load_end(off_t offset);
 unsigned long long vaddr_to_paddr_general(unsigned long long vaddr);
 off_t vaddr_to_offset_slow(int fd, char *filename, unsigned long long vaddr);
 unsigned long long get_max_paddr(void);
 
+int page_is_fractional(off_t page_offset);
+
 int get_elf64_ehdr(int fd, char *filename, Elf64_Ehdr *ehdr);
 int get_elf32_ehdr(int fd, char *filename, Elf32_Ehdr *ehdr);
 int get_elf_info(int fd, char *filename);
diff --git a/makedumpfile.c b/makedumpfile.c
index 7536274..4dce1e2 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -237,9 +237,10 @@ read_page_desc(unsigned long long paddr, page_desc_t *pd)
 
 static int
 update_mmap_range(off_t offset, int initial) {
-	off_t start_offset;
+	off_t start_offset, end_offset;
 	off_t map_size;
 	off_t max_offset = get_max_file_offset();
+	off_t pt_load_end = offset_to_pt_load_end(offset);
 
 	munmap(info->mmap_buf,
 	       info->mmap_end_offset - info->mmap_start_offset);
@@ -247,9 +248,13 @@ update_mmap_range(off_t offset, int initial) {
 	/*
 	 * offset for mmap() must be page aligned.
 	 */
-	start_offset = round(offset, info->page_size);
+	start_offset = roundup(offset, info->page_size);
+	end_offset = MIN(max_offset, round(pt_load_end, info->page_size));
 
-	map_size = MIN(max_offset - start_offset, info->mmap_region_size);
+	if (!pt_load_end || (end_offset - start_offset) <= 0)
+		return FALSE;
+
+	map_size = MIN(end_offset - start_offset, info->mmap_region_size);
 
 	info->mmap_buf = mmap(NULL, map_size, PROT_READ, MAP_PRIVATE,
 				     info->fd_memory, start_offset);
@@ -282,9 +287,12 @@ is_mapped_with_mmap(off_t offset) {
 
 int
 initialize_mmap(void) {
+	unsigned long long phys_start;
 	info->mmap_region_size = MAP_REGION;
 	info->mmap_buf = MAP_FAILED;
-	if (!update_mmap_range(0, 1))
+
+	get_pt_load(0, &phys_start, NULL, NULL, NULL);
+	if (!update_mmap_range(phys_start, 1))
 		return FALSE;
 
 	return TRUE;
@@ -320,7 +328,8 @@ read_from_vmcore(off_t offset, void *bufptr, unsigned long size)
 {
 	const off_t failed = (off_t)-1;
 
-	if (info->flag_usemmap == MMAP_ENABLE) {
+	if (info->flag_usemmap == MMAP_ENABLE &&
+	    page_is_fractional(offset) == FALSE) {
 		if (!read_with_mmap(offset, bufptr, size)) {
 			ERRMSG("Can't read the dump memory(%s) with mmap().\n",
 			       info->name_memory);
-- 
1.8.0.2

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
  2014-01-31  2:36 ` Atsushi Kumagai
@ 2014-01-31  9:57   ` HATAYAMA Daisuke
  2014-02-06  9:38     ` Atsushi Kumagai
  0 siblings, 1 reply; 12+ messages in thread
From: HATAYAMA Daisuke @ 2014-01-31  9:57 UTC (permalink / raw)
  To: Atsushi Kumagai; +Cc: kexec

(2014/01/31 11:36), Atsushi Kumagai wrote:
> Hello HATAYAMA-san,
>
> On 2013/12/09 17:06:18, HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:
>> This is a patch for fixing mmap failure due to fractional page issue.
>>
>> This patch might be still a bit too large as a single patch and might need to split.
>> If you think patch refactoring is needed, please suggest.
>>
>> Change Log:
>>
>> v1 => v2)
>>
>> - Copy fractional pages from 1st kernel to 2nd kernel to reduce read
>>    to the fractional pages for reliability.
>>
>> - Deal with the case where multiple System RAM areas are contained in
>>    a single fractional page.
>>
>> Test:
>>
>> Tested on X86_64. Fractional pages are created using memmap= kernel
>> parameter on the kdump 1st kernel.
>
> Could you tell me more details about how to reproduce this ?
> I tried to create such fractional pages to test the patch at
> the end of this mail, by using memmap= kernel parameter as you said
> like below:
>
>    # cat /proc/iomem
>    ...
>    100000000-10fff57ff : System RAM
>    10fff5800-1100057ff : reserved
>    110005800-11fffffff : System RAM
>
> However, I couldn't face the mmap() failure and makedumpfile worked
> normally even using mmap() on linux-3.12.1. What am I missing here ?
>

This patch set tries to reduce potential risk on accessing i.e. creating
page tables reading memory outside System RAM regions. The potential risk
I intend here is for example effect of accessing mmio region.

If you didn't see any failure except for mmap() failure on fractional pages,
there's no potential risk on your system in the sense of what I mean above.
Or you could probably see different behavior by choosing other System RAM
region that resides in the memory that is used for something special.

Also, in early phase, our design didn't care about this kind of fractional pages
because we don't think there were many such systems on real world. But the
bug report came earlier than we expected. So, I think we should design
carefully here around at least as long as they can be done relatively simply.
# Sorry for delaying my work...

Of course, I think both kernel and makedumpfile address the issue together
to reduce potential risk as much as possible.

-- 
Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
  2014-01-31  9:57   ` HATAYAMA Daisuke
@ 2014-02-06  9:38     ` Atsushi Kumagai
  2014-02-14 12:18       ` HATAYAMA Daisuke
  0 siblings, 1 reply; 12+ messages in thread
From: Atsushi Kumagai @ 2014-02-06  9:38 UTC (permalink / raw)
  To: HATAYAMA Daisuke; +Cc: kexec

On 2014/01/31 18:58:08, HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:
> (2014/01/31 11:36), Atsushi Kumagai wrote:
> > Hello HATAYAMA-san,
> >
> > On 2013/12/09 17:06:18, HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:
> >> This is a patch for fixing mmap failure due to fractional page issue.
> >>
> >> This patch might be still a bit too large as a single patch and might need to split.
> >> If you think patch refactoring is needed, please suggest.
> >>
> >> Change Log:
> >>
> >> v1 => v2)
> >>
> >> - Copy fractional pages from 1st kernel to 2nd kernel to reduce read
> >>    to the fractional pages for reliability.
> >>
> >> - Deal with the case where multiple System RAM areas are contained in
> >>    a single fractional page.
> >>
> >> Test:
> >>
> >> Tested on X86_64. Fractional pages are created using memmap= kernel
> >> parameter on the kdump 1st kernel.
> >
> > Could you tell me more details about how to reproduce this ?
> > I tried to create such fractional pages to test the patch at
> > the end of this mail, by using memmap= kernel parameter as you said
> > like below:
> >
> >    # cat /proc/iomem
> >    ...
> >    100000000-10fff57ff : System RAM
> >    10fff5800-1100057ff : reserved
> >    110005800-11fffffff : System RAM
> >
> > However, I couldn't face the mmap() failure and makedumpfile worked
> > normally even using mmap() on linux-3.12.1. What am I missing here ?
> >
> 
> This patch set tries to reduce potential risk on accessing i.e. creating
> page tables reading memory outside System RAM regions. The potential risk
> I intend here is for example effect of accessing mmio region.
> 
> If you didn't see any failure except for mmap() failure on fractional pages,
> there's no potential risk on your system in the sense of what I mean above.
> Or you could probably see different behavior by choosing other System RAM
> region that resides in the memory that is used for something special.

Thanks for your response, but I couldn't see even the mmap() failure caused
by a sanity check in remap_pfn_range().
I'll describe what I did for debugging as below.

First, the 1st kernel's memory map I prepared and its PT_LOAD are here.

   # cat /proc/iomem

   ...
   100000000-10fff57ff : System RAM
   10fff5800-1100057ff : ACPI Tables
   110005800-11fffffff : System RAM

   ...

   Type           Offset             VirtAddr           PhysAddr
                  FileSiz            MemSiz              Flags  Align
   LOAD           0x00000000d07ca000 0xffff880100000000 0x0000000100000000
                  0x000000000fff5800 0x000000000fff5800  RWE    0
   LOAD           0x00000000e07c0800 0xffff880110005800 0x0000000110005800
                  0x000000000fffa800 0x000000000fffa800  RWE    0


The fractional page I expected was [0x10fff5000 - 0x10fff57ff],
so its file offset was [0xe07bf000 - 0xe07bf7ff].

Second, I prepared a patch to make sure whether the fractional page was
mapped with mmap() or not as below:


diff --git a/makedumpfile.c b/makedumpfile.c
index 7536274..b6abd31 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -251,11 +251,16 @@ update_mmap_range(off_t offset, int initial) {

        map_size = MIN(max_offset - start_offset, info->mmap_region_size);

+       if (start_offset <= 0xe07bf000 && 0xe07bf000 <= (start_offset + map_size)) {
+               MSG("Try mapping [%llx-%llx] with mmap()\n",
+                   (ulonglong)start_offset,
+                   (ulonglong)(start_offset + map_size));
+       }
+
        info->mmap_buf = mmap(NULL, map_size, PROT_READ, MAP_PRIVATE,
                                     info->fd_memory, start_offset);


Finally, I run makedumpfile_v1.5.5 with only the debug patch above in
2nd kernel (linux-3.12.1):

    # makedumpfile -D -c /proc/vmcore ./dumpfile.c
    ...
    
    mmap() is available on the kernel.
    Copying data                       : [ 92.9 %] -Try mapping [e03f9000-e07f9000] with mmap()
    Copying data                       : [100.0 %] |
    Writing erase info...
    offset_eraseinfo: 12ad1218, size_eraseinfo: 0

    The dumpfile is saved to ./dumpfile.c.

    makedumpfile Completed.
    #


According to this result, mmap() for the fractional page seemed to
succeed even without any fix, so I suspect that I misunderstand
something about the mmap() issue reported by Vivek.
Perhaps, can a fractional page pass that sanity check depending on
the situation ?

At least, I confirmed that the my patch I sent truncates mmap() regions
as I expected with the debug patch above, so I think there is no problem
with it.


Thanks
Atsushi Kumagai

> Also, in early phase, our design didn't care about this kind of fractional pages
> because we don't think there were many such systems on real world. But the
> bug report came earlier than we expected. So, I think we should design
> carefully here around at least as long as they can be done relatively simply.
> # Sorry for delaying my work...
> 
> Of course, I think both kernel and makedumpfile address the issue together
> to reduce potential risk as much as possible.
> 
> -- 
> Thanks.
> HATAYAMA, Daisuke

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] vmcore: copy fractional pages into buffers in the kdump 2nd kernel
  2014-02-06  9:38     ` Atsushi Kumagai
@ 2014-02-14 12:18       ` HATAYAMA Daisuke
  0 siblings, 0 replies; 12+ messages in thread
From: HATAYAMA Daisuke @ 2014-02-14 12:18 UTC (permalink / raw)
  To: Atsushi Kumagai; +Cc: kexec

(2014/02/06 18:38), Atsushi Kumagai wrote:
> On 2014/01/31 18:58:08, HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:
>> (2014/01/31 11:36), Atsushi Kumagai wrote:
>>> Hello HATAYAMA-san,
>>>
>>> On 2013/12/09 17:06:18, HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> wrote:
>>>> This is a patch for fixing mmap failure due to fractional page issue.
>>>>
>>>> This patch might be still a bit too large as a single patch and might need to split.
>>>> If you think patch refactoring is needed, please suggest.
>>>>
>>>> Change Log:
>>>>
>>>> v1 => v2)
>>>>
>>>> - Copy fractional pages from 1st kernel to 2nd kernel to reduce read
>>>>     to the fractional pages for reliability.
>>>>
>>>> - Deal with the case where multiple System RAM areas are contained in
>>>>     a single fractional page.
>>>>
>>>> Test:
>>>>
>>>> Tested on X86_64. Fractional pages are created using memmap= kernel
>>>> parameter on the kdump 1st kernel.
>>>
>>> Could you tell me more details about how to reproduce this ?
>>> I tried to create such fractional pages to test the patch at
>>> the end of this mail, by using memmap= kernel parameter as you said
>>> like below:
>>>
>>>     # cat /proc/iomem
>>>     ...
>>>     100000000-10fff57ff : System RAM
>>>     10fff5800-1100057ff : reserved
>>>     110005800-11fffffff : System RAM
>>>
>>> However, I couldn't face the mmap() failure and makedumpfile worked
>>> normally even using mmap() on linux-3.12.1. What am I missing here ?
>>>
>>
>> This patch set tries to reduce potential risk on accessing i.e. creating
>> page tables reading memory outside System RAM regions. The potential risk
>> I intend here is for example effect of accessing mmio region.
>>
>> If you didn't see any failure except for mmap() failure on fractional pages,
>> there's no potential risk on your system in the sense of what I mean above.
>> Or you could probably see different behavior by choosing other System RAM
>> region that resides in the memory that is used for something special.
>
> Thanks for your response, but I couldn't see even the mmap() failure caused
> by a sanity check in remap_pfn_range().
> I'll describe what I did for debugging as below.
>
> First, the 1st kernel's memory map I prepared and its PT_LOAD are here.
>
>     # cat /proc/iomem
>
>     ...
>     100000000-10fff57ff : System RAM
>     10fff5800-1100057ff : ACPI Tables
>     110005800-11fffffff : System RAM
>
>     ...
>
>     Type           Offset             VirtAddr           PhysAddr
>                    FileSiz            MemSiz              Flags  Align
>     LOAD           0x00000000d07ca000 0xffff880100000000 0x0000000100000000
>                    0x000000000fff5800 0x000000000fff5800  RWE    0
>     LOAD           0x00000000e07c0800 0xffff880110005800 0x0000000110005800
>                    0x000000000fffa800 0x000000000fffa800  RWE    0
>
>
> The fractional page I expected was [0x10fff5000 - 0x10fff57ff],
> so its file offset was [0xe07bf000 - 0xe07bf7ff].
>
> Second, I prepared a patch to make sure whether the fractional page was
> mapped with mmap() or not as below:
>
>
> diff --git a/makedumpfile.c b/makedumpfile.c
> index 7536274..b6abd31 100644
> --- a/makedumpfile.c
> +++ b/makedumpfile.c
> @@ -251,11 +251,16 @@ update_mmap_range(off_t offset, int initial) {
>
>          map_size = MIN(max_offset - start_offset, info->mmap_region_size);
>
> +       if (start_offset <= 0xe07bf000 && 0xe07bf000 <= (start_offset + map_size)) {
> +               MSG("Try mapping [%llx-%llx] with mmap()\n",
> +                   (ulonglong)start_offset,
> +                   (ulonglong)(start_offset + map_size));
> +       }
> +
>          info->mmap_buf = mmap(NULL, map_size, PROT_READ, MAP_PRIVATE,
>                                       info->fd_memory, start_offset);
>
>
> Finally, I run makedumpfile_v1.5.5 with only the debug patch above in
> 2nd kernel (linux-3.12.1):
>
>      # makedumpfile -D -c /proc/vmcore ./dumpfile.c
>      ...
>
>      mmap() is available on the kernel.
>      Copying data                       : [ 92.9 %] -Try mapping [e03f9000-e07f9000] with mmap()
>      Copying data                       : [100.0 %] |
>      Writing erase info...
>      offset_eraseinfo: 12ad1218, size_eraseinfo: 0
>
>      The dumpfile is saved to ./dumpfile.c.
>
>      makedumpfile Completed.
>      #
>
>
> According to this result, mmap() for the fractional page seemed to
> succeed even without any fix, so I suspect that I misunderstand
> something about the mmap() issue reported by Vivek.
> Perhaps, can a fractional page pass that sanity check depending on
> the situation ?
>

mmap fails if target region contains two or more different memory types,
and on Vivek's environment, some ACPI region was mapped as UC.
Because you didn't see this failure, ACPI Tables on your environment
would have the same memory type as System RAM.

-- 
Thanks.
HATAYAMA, Daisuke


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-02-14 12:20 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-09  8:06 [PATCH v2] vmcore: copy fractional pages into buffers in the kdump 2nd kernel HATAYAMA Daisuke
2013-12-09  8:06 ` HATAYAMA Daisuke
2013-12-11 11:33 ` WANG Chao
2013-12-11 11:33   ` WANG Chao
2013-12-12 16:42 ` Vivek Goyal
2013-12-12 16:42   ` Vivek Goyal
2013-12-13  1:42   ` HATAYAMA Daisuke
2013-12-13  1:42     ` HATAYAMA Daisuke
2014-01-31  2:36 ` Atsushi Kumagai
2014-01-31  9:57   ` HATAYAMA Daisuke
2014-02-06  9:38     ` Atsushi Kumagai
2014-02-14 12:18       ` HATAYAMA Daisuke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.