linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5][RESEND] introduce: Live Dump
@ 2011-12-23 13:14 YOSHIDA Masanori
  2011-12-23 13:14 ` [RFC PATCH 1/5][RESEND] livedump: Add notifier-call-chain into do_page_fault YOSHIDA Masanori
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: YOSHIDA Masanori @ 2011-12-23 13:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, hpa, x86, linux-kernel
  Cc: hpa, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Ingo Molnar, KOSAKI Motohiro, Kevin Hilman, Marcelo Tosatti,
	Michal Marek, Rik van Riel, Tejun Heo, Thomas Gleixner,
	YOSHIDA Masanori, Yinghai Lu, linux-kernel, x86, frank.rowand,
	jan.kiszka, yrl.pp-manager.tt

[-- Attachment #1: Type: text/plain, Size: 4940 bytes --]

[I'm sorry I failed to send the patch of [1/5]. Probably it vanished in the
 maze of my company's mail system (it consists of more than 10 servers...).
 Anyway, I'm sending the whole patch series again now.]

The following series introduces the new memory dumping mechanism Live Dump,
which let users obtain a consistent memory dump without stopping a running
system.

Such a mechanism is useful especially in the case where very important
systems are consolidated onto a single machine via virtualization.
Assuming a KVM host runs multiple important VMs on it and one of them
fails, the other VMs have to keep running. However, at the same time, an
administrator may want to obtain memory dump of not only the failed guest
but also the host because possibly the cause of failture is not in the
guest but in the host or the hardware under it.

Live Dump is based on Copy-on-write technique. Basically processing is
performed in the following order.
(1) Suspends processing of all CPUs.
(2) Makes pages (which you want to dump) read-only.
(3) Resumes all CPUs
(4) On page fault, dumps a page including a fault address.
(5) Finally, dumps the rest of pages that are not updated.

Currently, Live Dump is just a simple prototype and it has many
limitations. I list the important ones below.

(1) It write-protects only kernel's straight mapping areas. Therefore
    memory updates from vmap areas and user space don't cause page fault.
    Pages corresponding to these areas are not consistently dumped.

(2) It supports only x86-64 architecture.

(3) It can only handle 4K pages. As we know, most pages in kernel space are
    mapped via 2M or 1G large page mapping. Therefore, the current
    implementation of Live Dump splits all large pages into 4K pages before
    setting up write protection.

(4) It allocates about 50% of physical RAM to store dumped pages. Currently
    Live Dump saves all dumped data on memory once, and after that a user
    becomes able to use the dumped data. Live Dump itself has no feature to
    save dumped data onto a disk or any other storage device.

This series consists of 5 patches.

Ths 1st patch adds notifier-call-chain in do_page_fault. This is the only
modification against the existing code path of the upstream kernel.

The 2nd patch introduces "livedump" misc device.

The 3rd patch adds the function to split large pages.

The 4th patch introduces feature of write protection management. This
enables users to turn on write protection on kernel space and to install a
hook function that is called every time page fault occurs on each protected
page.

The last patch introduces memory dumping feature. This patch installs the
function to dump content of the protected page on page fault. At the same
time, it lets users to access the dumped data via the misc device
interface.


***How to test***
To test this patch, you have to apply the attached patch to the source code
of crash[1]. This patch can be applied to the version 6.0.1 of crash.  In
addition to this, you have to configure your kernel to turn on
CONFIG_DEBUG_INFO.

[1]crash, http://people.redhat.com/anderson/crash-6.0.1.tar.gz

At first, kick the script tools/livedump/livedump in the order as follows.
 # livedump init
 # livedump split
 # livedump start
 # livedump sweep

At this point, all memory image has been saved (also on memory). Then you
can analyze the image by kicking the patched crash as follows.
 # crash /dev/livedump /boot/System.map-livedump /boot/vmlinux.o-livedump

By the following command, you can release all resources allocated by the
livedump. You can execute this command at any timing (after init, split,
start or sweep).
 # livedump uninit

After executing "uninit", you can start again from "init".

---

YOSHIDA Masanori (5):
      livedump: Add memory dumping functionality
      livedump: Add write protection management
      livedump: Add page splitting functionality
      livedump: Add the new misc device "livedump"
      livedump: Add notifier-call-chain into do_page_fault


 arch/x86/Kconfig                 |   29 ++
 arch/x86/include/asm/traps.h     |    2 
 arch/x86/include/asm/wrprotect.h |   49 +++
 arch/x86/mm/Makefile             |    2 
 arch/x86/mm/fault.c              |    7 
 arch/x86/mm/wrprotect.c          |  613 ++++++++++++++++++++++++++++++++++++++
 kernel/Makefile                  |    1 
 kernel/livedump-memdump.c        |  227 ++++++++++++++
 kernel/livedump-memdump.h        |   45 +++
 kernel/livedump.c                |  131 ++++++++
 tools/livedump/livedump          |   17 +
 11 files changed, 1123 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/wrprotect.h
 create mode 100644 arch/x86/mm/wrprotect.c
 create mode 100644 kernel/livedump-memdump.c
 create mode 100644 kernel/livedump-memdump.h
 create mode 100644 kernel/livedump.c
 create mode 100755 tools/livedump/livedump

-- 
YOSHIDA Masanori <masanori.yoshida.tv@hitachi.com>

[-- Attachment #2: crash-6.0.1-livedump.patch --]
[-- Type: text/plain, Size: 1880 bytes --]

commit 6b649c02506a81b7d9ce36c474d68158e52ae463
Author: YOSHIDA Masanori <masanori.yoshida.tv@hitachi.com>
Date:   Wed Dec 7 16:07:57 2011 +0900

    Fix to support livedump

diff --git a/filesys.c b/filesys.c
index 48259fb..8c79e53 100755
--- a/filesys.c
+++ b/filesys.c
@@ -155,6 +155,7 @@ memory_source_init(void)
 			return;
 
 		if (!STREQ(pc->live_memsrc, "/dev/mem") &&
+		    !STREQ(pc->live_memsrc, "/dev/livedump") &&
 		     STREQ(pc->live_memsrc, pc->memory_device)) {
 			if (memory_driver_init())
 				return;
@@ -175,6 +176,9 @@ memory_source_init(void)
 	                                        strerror(errno));
 	                } else
 	                        pc->flags |= MFD_RDWR;
+		} else if (STREQ(pc->live_memsrc, "/dev/livedump")) {
+	                if ((pc->mfd = open("/dev/livedump", O_RDONLY)) < 0)
+				error(FATAL, "/dev/livedump: %s\n", strerror(errno));
 		} else if (STREQ(pc->live_memsrc, "/proc/kcore")) {
 			if ((pc->mfd = open("/proc/kcore", O_RDONLY)) < 0)
 				error(FATAL, "/proc/kcore: %s\n", 
diff --git a/main.c b/main.c
index c794ca8..b42c679 100755
--- a/main.c
+++ b/main.c
@@ -435,6 +435,19 @@ main(int argc, char **argv)
 				pc->writemem = write_dev_mem;
 				pc->live_memsrc = argv[optind];
 
+			} else if (STREQ(argv[optind], "/dev/livedump")) {
+                        	if (pc->flags & MEMORY_SOURCES) {
+                                	error(INFO, 
+                                            "too many dumpfile arguments\n");
+                                	program_usage(SHORT_FORM);
+                        	}
+				pc->flags |= DEVMEM;
+				pc->dumpfile = NULL;
+				pc->readmem = read_dev_mem;
+				pc->writemem = write_dev_mem;
+				pc->live_memsrc = argv[optind];
+				pc->program_pid = 1;
+
 			} else if (is_proc_kcore(argv[optind], KCORE_LOCAL)) {
 				if (pc->flags & MEMORY_SOURCES) {
 					error(INFO, 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC PATCH 2/5][RESEND] livedump: Add the new misc device "livedump"
  2011-12-23 13:14 [RFC PATCH 0/5][RESEND] introduce: Live Dump YOSHIDA Masanori
  2011-12-23 13:14 ` [RFC PATCH 1/5][RESEND] livedump: Add notifier-call-chain into do_page_fault YOSHIDA Masanori
@ 2011-12-23 13:14 ` YOSHIDA Masanori
  2011-12-23 13:14 ` [RFC PATCH 3/5][RESEND] livedump: Add page splitting functionality YOSHIDA Masanori
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: YOSHIDA Masanori @ 2011-12-23 13:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, hpa, x86, linux-kernel
  Cc: hpa, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Ingo Molnar, KOSAKI Motohiro, Kevin Hilman, Marcelo Tosatti,
	Michal Marek, Rik van Riel, Tejun Heo, Thomas Gleixner,
	YOSHIDA Masanori, Yinghai Lu, linux-kernel, x86, frank.rowand,
	jan.kiszka, yrl.pp-manager.tt, YOSHIDA Masanori, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, Andrew Morton, Michal Marek,
	Kevin Hilman, Borislav Petkov, linux-kernel

Introduces the new misc device "livedump".
The device only has open, close and empty ioctl operations.

Signed-off-by: YOSHIDA Masanori <masanori.yoshida.tv@hitachi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: linux-kernel@vger.kernel.org
---

 arch/x86/Kconfig  |   15 ++++++++++
 kernel/Makefile   |    1 +
 kernel/livedump.c |   82 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 98 insertions(+), 0 deletions(-)
 create mode 100644 kernel/livedump.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index efb4294..32bc16d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1702,6 +1702,21 @@ config CMDLINE_OVERRIDE
 	  This is used to work around broken boot loaders.  This should
 	  be set to 'N' under normal conditions.
 
+config LIVEDUMP
+	bool "Live Dump support"
+	depends on X86_64
+	---help---
+	  Set this option to 'Y' to allow the kernel support to acquire
+	  a consistent snapshot of kernel space without stopping system.
+
+	  This feature regularly causes small overhead on kernel.
+
+	  Once this feature is initialized by its special ioctl, it
+	  allocates huge memory for itself and causes much more overhead
+	  on kernel.
+
+	  If in doubt, say N.
+
 endmenu
 
 config ARCH_ENABLE_MEMORY_HOTPLUG
diff --git a/kernel/Makefile b/kernel/Makefile
index e898c5b..7d858e4 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -109,6 +109,7 @@ obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
 obj-$(CONFIG_PADATA) += padata.o
 obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 obj-$(CONFIG_JUMP_LABEL) += jump_label.o
+obj-$(CONFIG_LIVEDUMP) += livedump.o
 
 ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/livedump.c b/kernel/livedump.c
new file mode 100644
index 0000000..198bda5
--- /dev/null
+++ b/kernel/livedump.c
@@ -0,0 +1,82 @@
+/* livedump.c - Live Dump's main
+ * Copyright (C) 2011 Hitachi, Ltd.
+ * Author: YOSHIDA Masanori <masanori.yoshida.tv@hitachi.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
+ * MA  02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+
+#define DEVICE_NAME	"livedump"
+
+#define LIVEDUMP_IOC(x)	_IO(0xff, x)
+
+static long livedump_ioctl(
+		struct file *file, unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
+static int livedump_open(struct inode *inode, struct file *file)
+{
+	if (!try_module_get(THIS_MODULE))
+		return -ENOENT;
+	return 0;
+}
+
+static int livedump_release(struct inode *inode, struct file *file)
+{
+	module_put(THIS_MODULE);
+	return 0;
+}
+
+static const struct file_operations livedump_fops = {
+	.unlocked_ioctl = livedump_ioctl,
+	.open = livedump_open,
+	.release = livedump_release,
+};
+static struct miscdevice livedump_misc = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = DEVICE_NAME,
+	.fops = &livedump_fops,
+};
+
+static int livedump_module_init(void)
+{
+	int ret;
+
+	ret = misc_register(&livedump_misc);
+	if (WARN(ret, "livedump: "
+			"Failed to register livedump on misc device.\n"))
+		return ret;
+
+	return 0;
+}
+module_init(livedump_module_init);
+
+static void livedump_module_exit(void)
+{
+	misc_deregister(&livedump_misc);
+}
+module_exit(livedump_module_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Livedump kernel module");


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC PATCH 1/5][RESEND] livedump: Add notifier-call-chain into do_page_fault
  2011-12-23 13:14 [RFC PATCH 0/5][RESEND] introduce: Live Dump YOSHIDA Masanori
@ 2011-12-23 13:14 ` YOSHIDA Masanori
  2011-12-23 13:14 ` [RFC PATCH 2/5][RESEND] livedump: Add the new misc device "livedump" YOSHIDA Masanori
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: YOSHIDA Masanori @ 2011-12-23 13:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, hpa, x86, linux-kernel
  Cc: hpa, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Ingo Molnar, KOSAKI Motohiro, Kevin Hilman, Marcelo Tosatti,
	Michal Marek, Rik van Riel, Tejun Heo, Thomas Gleixner,
	YOSHIDA Masanori, Yinghai Lu, linux-kernel, x86,
	yrl.pp-manager.tt

This patch adds notifier-call-chain that is called in do_page_fault.
This can be used to check whether page fault occurs for the normal reason,
or it is caused by special write protection, for example Live Dump.

Signed-off-by: YOSHIDA Masanori <masanori.yoshida.tv@hitachi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Andy Lutomirski <luto@mit.edu>
Cc: Rik van Riel <riel@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: linux-kernel@vger.kernel.org
---

 arch/x86/include/asm/traps.h |    2 ++
 arch/x86/mm/fault.c          |    7 +++++++
 2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 0012d09..7bf6081 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -89,4 +89,6 @@ asmlinkage void smp_thermal_interrupt(void);
 asmlinkage void mce_threshold_interrupt(void);
 #endif
 
+extern struct atomic_notifier_head page_fault_notifier_list;
+
 #endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5db0490..2f8e2a9 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -985,6 +985,8 @@ static int fault_in_kernel_space(unsigned long address)
 	return address >= TASK_SIZE_MAX;
 }
 
+ATOMIC_NOTIFIER_HEAD(page_fault_notifier_list);
+
 /*
  * This routine handles page faults.  It determines the address,
  * and the problem, and then passes it off to one of the appropriate
@@ -1008,6 +1010,11 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 	/* Get the faulting address: */
 	address = read_cr2();
 
+	if (atomic_notifier_call_chain(
+				&page_fault_notifier_list, error_code, regs) &
+			NOTIFY_OK)
+		return;
+
 	/*
 	 * Detect and handle instructions that would cause a page fault for
 	 * both a tracked kernel page and a userspace page.


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC PATCH 4/5][RESEND] livedump: Add write protection management
  2011-12-23 13:14 [RFC PATCH 0/5][RESEND] introduce: Live Dump YOSHIDA Masanori
                   ` (2 preceding siblings ...)
  2011-12-23 13:14 ` [RFC PATCH 3/5][RESEND] livedump: Add page splitting functionality YOSHIDA Masanori
@ 2011-12-23 13:14 ` YOSHIDA Masanori
  2011-12-23 13:14 ` [RFC PATCH 5/5][RESEND] livedump: Add memory dumping functionality YOSHIDA Masanori
  4 siblings, 0 replies; 6+ messages in thread
From: YOSHIDA Masanori @ 2011-12-23 13:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, hpa, x86, linux-kernel
  Cc: hpa, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Ingo Molnar, KOSAKI Motohiro, Kevin Hilman, Marcelo Tosatti,
	Michal Marek, Rik van Riel, Tejun Heo, Thomas Gleixner,
	YOSHIDA Masanori, Yinghai Lu, linux-kernel, x86, frank.rowand,
	jan.kiszka, yrl.pp-manager.tt, YOSHIDA Masanori, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, linux-kernel

This patch makes it possible to write-protect pages in kernel space and to
install a handler function that is called every time when page fault occurs
on the protected page. The write protection is executed in the stop-machine
state to protect all pages consistently.

Processing of write protection and handling is executed in the order as
follows:

(1) Write protection phase
  - Stops machine.
  - Handles sensitive pages.
    (described below about sensitive pages)
  - Sets up write protection.
  - Resumes machine.
(2) Page fault exception handling
  - Before unprotecting faulting page, calls the handler function.
(3) Sweep phase
  - Calls the handler function against the rest of pages.

This patch exports the following 4 ioctl operations.
- Ioctl to activate this feature of write protection
- Ioctl to deactivate this feature
- Ioctl to kick stop-machine and to set up write protection
- Ioctl to sweep all the rest of pages

States of processing is as follows. They can transit only in the order
shown below.
- WRPROTECT_UNINITIALIZED
- WRPROTECT_INIT
- WRPROTECT_STARTED (= write protection already set up)
- WRPROTECT_SWEPT

However, this order is protected by a normal integer variable, therefore,
to be exact, this code is not safe against concurrent operation.

The livedump module has to acquire consistent memory image of kernel space.
Therefore, write protection is set up while the update of memory state is
suspended. To do so, the livedump is using stop_machine currently.

Causing page fault during page fault handling results in kernel panic, and
so any pages that can be updated during page fault handling must not be
write-protected. For the same reason, any pages that can be updated during
NMI handling must not be write-protected. I call such pages "sensitive
page". The handler function is called against the sensitive pages during
the stop-machine state as if they caused page fault at this timing.

I list the sensitive pages in the following:

- Kernel/Exception/Interrupt stacks
- Page table structure
- All task_struct
- ".data" section of kernel
- per_cpu areas

This handler function is not called against the pages that are not updated
unless the function is called by someone else. To handle these pages, the
livedump module finally calls the handler function against each of the
pages. I call this phase "sweep", which is triggered by ioctl operation.

To specify which pages to be write-protected and how to handle the pages,
the following 3 types of hook functions need to be defined.

- void fn_select_pages(unsigned long *bmp)
    This function selects pages to be protected. Selection is returned in
    the form of bitmap of which bit corresponds to PFN (page frame number).
    This function is called outside the stop-machine state, and so the
    processing of this function doesn't make the stop-machine time longer.

- void fn_handle_page(unsigned long pfn)
    This function handles faulting pages. The argument pfn specifies which
    page caused page fault. How to handle the page can be defined
    arbitrarily.
    This function is called when page fault occurs on the pages protected
    by this module. It's also called during the stop-machine state to
    handle the above sensitive pages.

- void fn_handle_sensitive_pages(unsigned long *bmp)
    Someone who defines these hook functions may have additional sensitive
    pages, to say, pages that must not be write-protected. This function
    handles such pages during the stop-machine state. Bits in the bitmap
    corresponding to pages that are handled by this function must be
    cleared.

To be exact, if set_memory_rw is called between states of WRPROTECT_STARTED
and WRPROTECT_SWEPT, consistency of dumped memory image possibly breaks.
To solve this problem, I plan to add a hook into set_memory_rw in the next
version of the patch series.

Signed-off-by: YOSHIDA Masanori <masanori.yoshida.tv@hitachi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---

 arch/x86/include/asm/wrprotect.h |   23 ++
 arch/x86/mm/wrprotect.c          |  582 ++++++++++++++++++++++++++++++++++++++
 kernel/livedump.c                |   33 ++
 tools/livedump/livedump          |    4 
 4 files changed, 641 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/wrprotect.h b/arch/x86/include/asm/wrprotect.h
index 40489e0..f63b196 100644
--- a/arch/x86/include/asm/wrprotect.h
+++ b/arch/x86/include/asm/wrprotect.h
@@ -23,4 +23,27 @@
 
 extern int wrprotect_split(void);
 
+typedef int (*fn_select_pages_t)(unsigned long *pfn_bmp);
+typedef void (*fn_handle_sensitive_pages_t)(unsigned long *pgbmp);
+typedef void (*fn_handle_page_t)(unsigned long pfn);
+
+extern int wrprotect_init(
+		fn_select_pages_t fn_select_pages,
+		fn_handle_sensitive_pages_t fn_handle_sensitive_pages,
+		fn_handle_page_t fn_handle_page);
+extern void wrprotect_uninit(void);
+
+extern int wrprotect_start(void);
+extern int wrprotect_sweep(void);
+
+extern void wrprotect_unselect_pages_but_edges(
+		unsigned long *pgbmp,
+		unsigned long start,
+		unsigned long len);
+extern void wrprotect_handle_only_edges(
+		unsigned long *pgbmp,
+		fn_handle_page_t fn_handle_page,
+		unsigned long start,
+		unsigned long len);
+
 #endif /* _WRPROTECT_H */
diff --git a/arch/x86/mm/wrprotect.c b/arch/x86/mm/wrprotect.c
index 2a69735..bd5638c 100644
--- a/arch/x86/mm/wrprotect.c
+++ b/arch/x86/mm/wrprotect.c
@@ -19,7 +19,33 @@
  */
 
 #include <asm/wrprotect.h>
-#include <linux/highmem.h>
+#include <linux/mm.h>		/* num_physpages, __get_free_page, etc. */
+#include <linux/bitmap.h>	/* bit operations */
+#include <linux/slab.h>		/* kmalloc, kfree */
+#include <linux/hugetlb.h>	/* __flush_tlb_all */
+#include <linux/stop_machine.h>	/* stop_machine */
+#include <asm/traps.h>		/* page_fault_notifier_list */
+#include <asm/sections.h>	/* __per_cpu_* */
+
+/* wrprotect's stuffs */
+static struct wrprotect {
+	int state;
+#define STATE_UNINIT 0
+#define STATE_INITIALIZED 1
+#define STATE_STARTED 2
+#define STATE_SWEPT 3
+} wrprotect;
+
+/* Bitmap specifying pages being write-protected */
+static unsigned long *pgbmp;
+#define PGBMP_LEN (sizeof(long) * BITS_TO_LONGS(num_physpages))
+
+/* wrprotect's hook functions, which define which and how to handle pages */
+static struct {
+	fn_select_pages_t select_pages;
+	fn_handle_sensitive_pages_t handle_sensitive_pages;
+	fn_handle_page_t handle_page;
+} ops;
 
 int wrprotect_split(void)
 {
@@ -31,3 +57,557 @@ int wrprotect_split(void)
 	}
 	return 0;
 }
+
+struct sm_context {
+	int leader_cpu;
+	int leader_done;
+	int (*fn_leader)(void *arg);
+	int (*fn_follower)(void *arg);
+	void *arg;
+};
+
+static int call_leader_follower(void *data)
+{
+	int ret;
+	struct sm_context *ctx = data;
+
+	if (smp_processor_id() == ctx->leader_cpu) {
+		ret = ctx->fn_leader(ctx->arg);
+		ctx->leader_done = 1;
+	} else {
+		while (!ctx->leader_done)
+			cpu_relax();
+		ret = ctx->fn_follower(ctx->arg);
+	}
+
+	return ret;
+}
+
+/* stop_machine_leader_follower
+ *
+ * Calls stop_machine with a leader CPU and follower CPUs
+ * executing different codes.
+ * At first, the leader CPU is selected randomly and executes its code.
+ * After that, follower CPUs execute their codes.
+ */
+static int stop_machine_leader_follower(
+		int (*fn_leader)(void *),
+		int (*fn_follower)(void *),
+		void *arg)
+{
+	int cpu;
+	struct sm_context ctx;
+
+	preempt_disable();
+	cpu = smp_processor_id();
+	preempt_enable();
+
+	memset(&ctx, 0, sizeof(ctx));
+	ctx.leader_cpu = cpu;
+	ctx.leader_done = 0;
+	ctx.fn_leader = fn_leader;
+	ctx.fn_follower = fn_follower;
+	ctx.arg = arg;
+
+	return stop_machine(call_leader_follower, &ctx, cpu_online_mask);
+}
+
+/* wrprotect_unselect_pages_but_edges
+ *
+ * Clear bits corresponding to pages that cover a range
+ * from start to start+len-1.
+ * However, if edges (start and/or start+len) are not aligned to PAGE_SIZE,
+ * the first and the last bits are not cleared.
+ */
+void wrprotect_unselect_pages_but_edges(
+		unsigned long *pgbmp,
+		unsigned long start,
+		unsigned long len)
+{
+	unsigned long end = (start + len) & PAGE_MASK;
+
+	start = (start + PAGE_SIZE - 1) & PAGE_MASK;
+	while (start < end) {
+		unsigned long pfn = __pa(start) >> PAGE_SHIFT;
+		clear_bit(pfn, pgbmp);
+		start += PAGE_SIZE;
+	}
+}
+
+/* wrprotect_handle_only_edges
+ *
+ * Call fn_handle_page against the first and the last pages
+ * if the corresponding bits are set.
+ * When fn_handle_page is called, the corresponding bit is cleared.
+ */
+void wrprotect_handle_only_edges(
+		unsigned long *pgbmp,
+		fn_handle_page_t fn_handle_page,
+		unsigned long start,
+		unsigned long len)
+{
+	unsigned long pfn_begin = __pa(start) >> PAGE_SHIFT;
+	unsigned long pfn_last = __pa(start + len - 1) >> PAGE_SHIFT;
+
+	if (test_bit(pfn_begin, pgbmp)) {
+		fn_handle_page(pfn_begin);
+		clear_bit(pfn_begin, pgbmp);
+	}
+	if (test_bit(pfn_last, pgbmp)) {
+		fn_handle_page(pfn_last);
+		clear_bit(pfn_last, pgbmp);
+	}
+}
+
+/* handle_addr_range
+ *
+ * Call fn_handle_page in turns against pages that cover a range
+ * from start to start+len-1.
+ * At the same time, bits corresponding to the pages are cleared.
+ */
+static void handle_addr_range(
+		unsigned long *pgbmp,
+		fn_handle_page_t fn_handle_page,
+		unsigned long start,
+		unsigned long len)
+{
+	unsigned long end = start + len;
+
+	while (start < end) {
+		unsigned long pfn = __pa(start) >> PAGE_SHIFT;
+		if (test_bit(pfn, pgbmp)) {
+			fn_handle_page(pfn);
+			clear_bit(pfn, pgbmp);
+		}
+		start += PAGE_SIZE;
+	}
+}
+
+/* handle_task
+ *
+ * Call handle_addr_range against a given task_struct & thread_info
+ */
+static void handle_task(
+		unsigned long *pgbmp,
+		fn_handle_page_t fn_handle_page,
+		struct task_struct *t)
+{
+	BUG_ON(!t);
+	BUG_ON(!t->stack);
+	BUG_ON((unsigned long)t->stack & ~PAGE_MASK);
+	handle_addr_range(pgbmp, fn_handle_page,
+			(unsigned long)t, sizeof(*t));
+	handle_addr_range(pgbmp, fn_handle_page,
+			(unsigned long)t->stack, THREAD_SIZE);
+}
+
+/* handle_tasks
+ *
+ * Call handle_task against all tasks (including idle_task's).
+ */
+static void handle_tasks(
+		unsigned long *pgbmp,
+		fn_handle_page_t fn_handle_page)
+{
+	struct task_struct *p, *t;
+	unsigned int cpu;
+
+	do_each_thread(p, t) {
+		handle_task(pgbmp, fn_handle_page, t);
+	} while_each_thread(p, t);
+
+	for_each_online_cpu(cpu)
+		handle_task(pgbmp, fn_handle_page, idle_task(cpu));
+}
+
+static void handle_pmd(
+		unsigned long *pgbmp,
+		fn_handle_page_t fn_handle_page,
+		pmd_t *pmd)
+{
+	unsigned long i;
+
+	handle_addr_range(pgbmp, fn_handle_page,
+			(unsigned long)pmd, PAGE_SIZE);
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (pmd_present(pmd[i]) && !pmd_large(pmd[i]))
+			handle_addr_range(pgbmp, fn_handle_page,
+					pmd_page_vaddr(pmd[i]), PAGE_SIZE);
+	}
+}
+
+static void handle_pud(
+		unsigned long *pgbmp,
+		fn_handle_page_t fn_handle_page,
+		pud_t *pud)
+{
+	unsigned long i;
+
+	handle_addr_range(pgbmp, fn_handle_page,
+			(unsigned long)pud, PAGE_SIZE);
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		if (pud_present(pud[i]) && !pud_large(pud[i]))
+			handle_pmd(pgbmp, fn_handle_page,
+					(pmd_t *)pud_page_vaddr(pud[i]));
+	}
+}
+
+/* handle_page_table
+ *
+ * Call fn_handle_page against all pages of page table structure
+ * and clear all bits corresponding to the pages.
+ */
+static void handle_page_table(
+		unsigned long *pgbmp,
+		fn_handle_page_t fn_handle_page)
+{
+	pgd_t *pgd;
+	unsigned long i;
+
+	pgd = __va(read_cr3() & PAGE_MASK);
+	handle_addr_range(pgbmp, fn_handle_page,
+			(unsigned long)pgd, PAGE_SIZE);
+	for (i = pgd_index(PAGE_OFFSET); i < PTRS_PER_PGD; i++) {
+		if (pgd_present(pgd[i]))
+			handle_pud(pgbmp, fn_handle_page,
+					(pud_t *)pgd_page_vaddr(pgd[i]));
+	}
+}
+
+/* handle_sensitive_pages
+ *
+ * Call fn_handle_page against the following pages and
+ * clear bits corresponding them.
+ */
+static void handle_sensitive_pages(
+		unsigned long *pgbmp,
+		fn_handle_page_t fn_handle_page)
+{
+	handle_tasks(pgbmp, fn_handle_page);
+	handle_page_table(pgbmp, fn_handle_page);
+	handle_addr_range(pgbmp, fn_handle_page,
+			(unsigned long)__per_cpu_offset[0], PMD_PAGE_SIZE);
+	handle_addr_range(pgbmp, fn_handle_page,
+			(unsigned long)_sdata, _end - _sdata);
+}
+
+/* protect_page
+ *
+ * Changes a specified page's _PAGE_RW flag and _PAGE_UNUSED1 flag.
+ * If the argument protect is non-zero:
+ *  - _PAGE_RW flag is cleared
+ *  - _PAGE_UNUSED1 flag is set
+ * If the argument protect is zero:
+ *  - _PAGE_RW flag is set
+ *  - _PAGE_UNUSED1 flag is cleared
+ *
+ * The change is executed only when all the following are true.
+ *  - The page is mapped by the straight mapping area.
+ *  - The page is mapped as 4K page.
+ *  - The page is originally writable.
+ *
+ * Returns 1 if the change is actually executed, otherwise returns 0.
+ */
+static int protect_page(unsigned long pfn, int protect)
+{
+	unsigned long addr = (unsigned long)pfn_to_kaddr(pfn);
+	pte_t *ptep, pte;
+	unsigned int level;
+
+	ptep = lookup_address(addr, &level);
+	if (WARN(!ptep, "livedump: Page=%016lx isn't mapped.\n", addr) ||
+	    WARN(!pte_present(*ptep),
+		    "livedump: Page=%016lx isn't mapped.\n", addr) ||
+	    WARN(PG_LEVEL_NONE == level,
+		    "livedump: Page=%016lx isn't mapped.\n", addr) ||
+	    WARN(PG_LEVEL_2M == level,
+		    "livedump: Page=%016lx is consisted of 2M page.\n", addr) ||
+	    WARN(PG_LEVEL_1G == level,
+		    "livedump: Page=%016lx is consisted of 1G page.\n", addr)) {
+		return 0;
+	}
+
+	pte = *ptep;
+	if (protect) {
+		if (pte_write(pte)) {
+			pte = pte_wrprotect(pte);
+			pte = pte_set_flags(pte, _PAGE_UNUSED1);
+		}
+	} else {
+		pte = pte_mkwrite(pte);
+		pte = pte_clear_flags(pte, _PAGE_UNUSED1);
+	}
+	*ptep = pte;
+
+	return 1;
+}
+
+/*
+ * Page fault error code bits:
+ *
+ *   bit 0 ==	 0: no page found	1: protection fault
+ *   bit 1 ==	 0: read access		1: write access
+ *   bit 2 ==	 0: kernel-mode access	1: user-mode access
+ *   bit 3 ==				1: use of reserved bit detected
+ *   bit 4 ==				1: fault was an instruction fetch
+ */
+enum x86_pf_error_code {
+	PF_PROT		=		1 << 0,
+	PF_WRITE	=		1 << 1,
+	PF_USER		=		1 << 2,
+	PF_RSVD		=		1 << 3,
+	PF_INSTR	=		1 << 4,
+};
+
+static int wrprotect_page_fault_notifier(
+		struct notifier_block *n, unsigned long val, void *v)
+{
+	unsigned long error_code = val;
+	pte_t *ptep, pte;
+	unsigned int level;
+	unsigned long pfn;
+
+	/*
+	 * Handle only kernel-mode write access
+	 *
+	 * error_code must be:
+	 *  (1) PF_PROT
+	 *  (2) PF_WRITE
+	 *  (3) not PF_USER
+	 *  (4) not PF_SRVD
+	 *  (5) not PF_INSTR
+	 */
+	if (!(PF_PROT  & error_code) ||
+	    !(PF_WRITE & error_code) ||
+	     (PF_USER  & error_code) ||
+	     (PF_RSVD  & error_code) ||
+	     (PF_INSTR & error_code))
+		goto not_processed;
+
+	ptep = lookup_address(read_cr2(), &level);
+	if (!ptep)
+		goto not_processed;
+	pte = *ptep;
+	if (!pte_present(pte) || PG_LEVEL_4K != level)
+		goto not_processed;
+	if (!(pte_flags(pte) & _PAGE_UNUSED1))
+		goto not_processed;
+
+	pfn = pte_pfn(pte);
+	if (test_and_clear_bit(pfn, pgbmp)) {
+		ops.handle_page(pfn);
+		protect_page(pfn, 0);
+	}
+
+	return NOTIFY_OK;
+
+not_processed:
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block wrprotect_page_fault_notifier_block = {
+	.notifier_call = wrprotect_page_fault_notifier,
+	.priority = 0,
+};
+
+/* sm_leader
+ *
+ * Is executed by a leader CPU during stop-machine.
+ *
+ * Does the following:
+ * (1)Handle sensitive pages, which must not be write-protected.
+ * (2)Register notifier-call-chain into the kernel's page fault handler.
+ * (3)Write-protect pages which are specified by the bitmap.
+ * (4)Flush TLB cache of the leader CPU.
+ */
+static int sm_leader(void *arg)
+{
+	int ret;
+	unsigned long pfn;
+
+	handle_sensitive_pages(pgbmp, ops.handle_page);
+	wrprotect_handle_only_edges(pgbmp, ops.handle_page,
+			(unsigned long)pgbmp, PGBMP_LEN);
+	wrprotect_handle_only_edges(pgbmp, ops.handle_page,
+			(unsigned long)&wrprotect, sizeof(wrprotect));
+	ops.handle_sensitive_pages(pgbmp);
+
+	ret = atomic_notifier_chain_register(
+			&page_fault_notifier_list,
+			&wrprotect_page_fault_notifier_block);
+	if (WARN(ret, "livedump: Failed to register notifier.\n"))
+		return ret;
+
+	for_each_set_bit(pfn, pgbmp, num_physpages)
+		if (!protect_page(pfn, 1))
+			clear_bit(pfn, pgbmp);
+
+	__flush_tlb_all();
+
+	return 0;
+}
+
+/* sm_follower
+ *
+ * Is executed by follower CPUs during stop-machine.
+ * Flushes TLB cache of each CPU.
+ */
+static int sm_follower(void *arg)
+{
+	__flush_tlb_all();
+	return 0;
+}
+
+/* wrprotect_start
+ *
+ * Set up write protection on the kernel space in the stop-machine state.
+ */
+int wrprotect_start(void)
+{
+	int ret;
+
+	if (WARN(STATE_INITIALIZED != wrprotect.state,
+				"livedump: wrprotect isn't initialized yet.\n"))
+		return 0;
+
+	ret = stop_machine_leader_follower(sm_leader, sm_follower, NULL);
+	if (WARN(ret, "livedump: Failed to protect pages w/errno=%d.\n", ret))
+		return ret;
+
+	wrprotect.state = STATE_STARTED;
+	return 0;
+}
+
+/* wrprotect_sweep
+ *
+ * On every page specified by the bitmap, the following is executed.
+ *  - Handle the page by the way defined as ops.handle_page.
+ *  - Change the page's flags by calling protect_page.
+ *
+ * The above work can be executed on the same page at the same time
+ * by the notifer-call-chain.
+ * test_and_clear_bit is used for exclusion control.
+ */
+int wrprotect_sweep(void)
+{
+	unsigned long pfn;
+
+	if (WARN(STATE_STARTED != wrprotect.state,
+				"livedump: Pages aren't protected yet.\n"))
+		return 0;
+	for_each_set_bit(pfn, pgbmp, num_physpages) {
+		if (!test_and_clear_bit(pfn, pgbmp))
+			continue;
+		ops.handle_page(pfn);
+		protect_page(pfn, 0);
+		if (!(pfn & 0xffUL))
+			cond_resched();
+	}
+	wrprotect.state = STATE_SWEPT;
+	return 0;
+}
+
+static int default_select_pages(unsigned long *pgmap)
+{
+	unsigned long pfn;
+
+	for (pfn = 0; pfn < num_physpages; pfn++) {
+		if (e820_any_mapped(pfn << PAGE_SHIFT,
+				    (pfn + 1) << PAGE_SHIFT,
+				    E820_RAM))
+			bitmap_set(pgbmp, pfn, 1);
+		if (!(pfn & 0xffUL))
+			cond_resched();
+	}
+	return 0;
+}
+
+static void default_handle_sensitive_pages(unsigned long *pgbmp)
+{
+}
+
+static void default_handle_page(unsigned long pfn)
+{
+}
+
+int wrprotect_init(
+		fn_select_pages_t fn_select_pages,
+		fn_handle_sensitive_pages_t fn_handle_sensitive_pages,
+		fn_handle_page_t fn_handle_page)
+{
+	int ret;
+
+	if (WARN(STATE_UNINIT != wrprotect.state,
+			"livedump: wrprotect is already initialized.\n"))
+		return 0;
+
+	if (fn_select_pages && fn_handle_sensitive_pages && fn_handle_page) {
+		ops.select_pages = fn_select_pages;
+		ops.handle_sensitive_pages = fn_handle_sensitive_pages;
+		ops.handle_page = fn_handle_page;
+	} else {
+		ops.select_pages = default_select_pages;
+		ops.handle_sensitive_pages = default_handle_sensitive_pages;
+		ops.handle_page = default_handle_page;
+	}
+
+	ret = -ENOMEM;
+	pgbmp = kzalloc(PGBMP_LEN, GFP_KERNEL);
+	if (!pgbmp)
+		goto err;
+
+	ret = ops.select_pages(pgbmp);
+	if (ret)
+		goto err;
+
+	wrprotect_unselect_pages_but_edges(
+			pgbmp, (unsigned long)pgbmp, PGBMP_LEN);
+	wrprotect_unselect_pages_but_edges(
+			pgbmp, (unsigned long)&wrprotect, sizeof(wrprotect));
+
+	wrprotect.state = STATE_INITIALIZED;
+	return 0;
+
+err:
+	kfree(pgbmp);
+	pgbmp = NULL;
+
+	return ret;
+}
+
+void wrprotect_uninit(void)
+{
+	int ret;
+	unsigned long pfn;
+
+	if (STATE_UNINIT == wrprotect.state)
+		return;
+
+	if (STATE_STARTED == wrprotect.state) {
+		for_each_set_bit(pfn, pgbmp, num_physpages) {
+			if (!test_and_clear_bit(pfn, pgbmp))
+				continue;
+			protect_page(pfn, 0);
+			cond_resched();
+		}
+
+		flush_tlb_all();
+	}
+
+	if (STATE_STARTED <= wrprotect.state) {
+		ret = atomic_notifier_chain_unregister(
+				&page_fault_notifier_list,
+				&wrprotect_page_fault_notifier_block);
+		WARN(ret, "livedump: Failed to unregister notifier "
+				"w/errno=%d.\n", -ret);
+	}
+
+	ops.select_pages = NULL;
+	ops.handle_sensitive_pages = NULL;
+	ops.handle_page = NULL;
+
+	kfree(pgbmp);
+	pgbmp = NULL;
+
+	wrprotect.state = STATE_UNINIT;
+}
diff --git a/kernel/livedump.c b/kernel/livedump.c
index 899d572..99b49b9 100644
--- a/kernel/livedump.c
+++ b/kernel/livedump.c
@@ -28,6 +28,29 @@
 
 #define LIVEDUMP_IOC(x)	_IO(0xff, x)
 #define LIVEDUMP_IOC_SPLIT LIVEDUMP_IOC(1)
+#define LIVEDUMP_IOC_START LIVEDUMP_IOC(2)
+#define LIVEDUMP_IOC_SWEEP LIVEDUMP_IOC(3)
+#define LIVEDUMP_IOC_INIT LIVEDUMP_IOC(100)
+#define LIVEDUMP_IOC_UNINIT LIVEDUMP_IOC(101)
+
+static void do_uninit(void)
+{
+	wrprotect_uninit();
+}
+
+static int do_init(void)
+{
+	int ret;
+
+	ret = wrprotect_init(NULL, NULL, NULL);
+	if (WARN(ret, "livedump: Failed to initialize Protection manager.\n"))
+		goto err;
+
+	return 0;
+err:
+	do_uninit();
+	return ret;
+}
 
 static long livedump_ioctl(
 		struct file *file, unsigned int cmd, unsigned long arg)
@@ -35,6 +58,15 @@ static long livedump_ioctl(
 	switch (cmd) {
 	case LIVEDUMP_IOC_SPLIT:
 		return wrprotect_split();
+	case LIVEDUMP_IOC_START:
+		return wrprotect_start();
+	case LIVEDUMP_IOC_SWEEP:
+		return wrprotect_sweep();
+	case LIVEDUMP_IOC_INIT:
+		return do_init();
+	case LIVEDUMP_IOC_UNINIT:
+		do_uninit();
+		return 0;
 	default:
 		return -ENOIOCTLCMD;
 	}
@@ -80,6 +112,7 @@ module_init(livedump_module_init);
 static void livedump_module_exit(void)
 {
 	misc_deregister(&livedump_misc);
+	do_uninit();
 }
 module_exit(livedump_module_exit);
 
diff --git a/tools/livedump/livedump b/tools/livedump/livedump
index c5e34d6..838347d 100755
--- a/tools/livedump/livedump
+++ b/tools/livedump/livedump
@@ -5,6 +5,10 @@ import fcntl
 
 cmds = {
 	'split':0xff01,
+	'start':0xff02,
+	'sweep':0xff03,
+	'init':0xff64,
+	'uninit':0xff65
 	}
 cmd = cmds[sys.argv[1]]
 


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC PATCH 5/5][RESEND] livedump: Add memory dumping functionality
  2011-12-23 13:14 [RFC PATCH 0/5][RESEND] introduce: Live Dump YOSHIDA Masanori
                   ` (3 preceding siblings ...)
  2011-12-23 13:14 ` [RFC PATCH 4/5][RESEND] livedump: Add write protection management YOSHIDA Masanori
@ 2011-12-23 13:14 ` YOSHIDA Masanori
  4 siblings, 0 replies; 6+ messages in thread
From: YOSHIDA Masanori @ 2011-12-23 13:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, hpa, x86, linux-kernel
  Cc: hpa, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Ingo Molnar, KOSAKI Motohiro, Kevin Hilman, Marcelo Tosatti,
	Michal Marek, Rik van Riel, Tejun Heo, Thomas Gleixner,
	YOSHIDA Masanori, Yinghai Lu, linux-kernel, x86, frank.rowand,
	jan.kiszka, yrl.pp-manager.tt, YOSHIDA Masanori, Andrew Morton,
	Michal Marek, Kevin Hilman, Borislav Petkov, linux-kernel

This patch realizes memory dumping of kernel space. All dumped memory image
is saved on memory once. To do so, this patch allocates about 50% of RAM at
the initialization.

This patch also adds read/lseek operations to the "livedump" misc device to
provide user land with means to read the dumped data. The standard dump
analysis tool "crash" can analyze the dumped data via these operations.

The previous patch made it possible to define hook functions that specify
which pages to write-protect and how to handle pages. This patch defines
the hooks functions as follows.

- fn_select_pages:
    Selects all normal RAM pages, which are marked as E820_RAM.

    Also selects pages of physical memory address from 0 to
    CONFIG_X86_RESERVE_LOW. This range is usually used by BIOS,
    but crash also uses this range of memory.

    Pages which contain this patch's own stuffs (e.g. Allocated pages to
    store dumped image) are not selected because they are not needed for
    memory dump analysis.
    However, this patch's own stuffs are not necessarily aligned to 4K.
    Therefore, first and last pages can contain together data other than
    this patch's stuffs. I call such pages as "edge pages".
    Edge pages are selected here, but all of them area handled during the
    stop-machine because they are "sensitive pages".

- fn_handle_page:
    Saves a faulting page onto the above allocated area.

- fn_handle_sensitive_pages:
    Handles edge pages as described above.

Signed-off-by: YOSHIDA Masanori <masanori.yoshida.tv@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: linux-kernel@vger.kernel.org
---

 kernel/Makefile           |    2 
 kernel/livedump-memdump.c |  227 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/livedump-memdump.h |   45 +++++++++
 kernel/livedump.c         |   13 ++-
 4 files changed, 285 insertions(+), 2 deletions(-)
 create mode 100644 kernel/livedump-memdump.c
 create mode 100644 kernel/livedump-memdump.h

diff --git a/kernel/Makefile b/kernel/Makefile
index 7d858e4..72efb90 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -109,7 +109,7 @@ obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
 obj-$(CONFIG_PADATA) += padata.o
 obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 obj-$(CONFIG_JUMP_LABEL) += jump_label.o
-obj-$(CONFIG_LIVEDUMP) += livedump.o
+obj-$(CONFIG_LIVEDUMP) += livedump.o livedump-memdump.o
 
 ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/livedump-memdump.c b/kernel/livedump-memdump.c
new file mode 100644
index 0000000..a283848
--- /dev/null
+++ b/kernel/livedump-memdump.c
@@ -0,0 +1,227 @@
+/* livedump-memdump.c - Live Dump's memory dumping management
+ * Copyright (C) 2011 Hitachi, Ltd.
+ * Author: YOSHIDA Masanori <masanori.yoshida.tv@hitachi.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
+ * MA  02110-1301, USA.
+ */
+
+#include "livedump-memdump.h"
+#include <asm/wrprotect.h>
+
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/uaccess.h>
+
+/* memdump's stuffs */
+static struct memdump {
+	spinlock_t lock;
+	unsigned long alloced;
+	unsigned long used;
+} memdump;
+
+static void **pages;	/* allocated pages */
+static void **pagemap;	/* mapping from PFN to page */
+
+int livedump_memdump_init(void)
+{
+	int ret;
+	unsigned long i;
+
+	spin_lock_init(&memdump.lock);
+	memdump.alloced = num_physpages / 2 + 1;
+
+	ret = -ENOMEM;
+	pages = vmalloc(sizeof(void *) * memdump.alloced);
+	if (!pages)
+		goto err;
+	for (i = 0; i < memdump.alloced; i++) {
+		pages[i] = (void *)__get_free_page(GFP_KERNEL);
+		if (!pages[i])
+			goto err;
+	}
+
+	ret = -ENOMEM;
+	pagemap = vmalloc(sizeof(void *) * num_physpages);
+	if (!pagemap)
+		goto err;
+	memset(pagemap, 0, sizeof(void *) * num_physpages);
+
+	return 0;
+err:
+	livedump_memdump_uninit();
+	return ret;
+}
+
+void livedump_memdump_uninit(void)
+{
+	if (pagemap) {
+		vfree(pagemap);
+		pagemap = NULL;
+	}
+	if (pages) {
+		unsigned long i;
+		for (i = 0; i < memdump.alloced; i++)
+			if (pages[i])
+				free_page((unsigned long)pages[i]);
+			else
+				break;
+		vfree(pages);
+		pages = NULL;
+	}
+	memdump.used = 0;
+	memdump.alloced = 0;
+	spin_lock_init(&memdump.lock);
+}
+
+/* livedump_memdump_select_pages
+ *
+ * Selects pages to protect.
+ *
+ * The following pages are selected.
+ *  - Pages marked as RAM by E820
+ *  - Pages of low memory used by BIOS (needed for crash to work normally)
+ *
+ * Pages that contain memdump's stuffs are unselected (eliminated from
+ * selection).
+ *
+ * On the other hand, because vmap areas are not write-protected,
+ * we don't have to unselect pagemap.
+ */
+int livedump_memdump_select_pages(unsigned long *pgbmp)
+{
+	unsigned long pfn, i;
+
+	/* Select all RAM pages */
+	for (pfn = 0; pfn < num_physpages; pfn++) {
+		if (e820_any_mapped(pfn << PAGE_SHIFT,
+				    (pfn + 1) << PAGE_SHIFT,
+				    E820_RAM))
+			set_bit(pfn, pgbmp);
+		cond_resched();
+	}
+
+	/* Essential area for executing crash with livedump */
+	bitmap_set(pgbmp, 0, (CONFIG_X86_RESERVE_LOW << 10) >> PAGE_SHIFT);
+
+	/* Unselect memdump stuffs (not needed against vmap areas) */
+	wrprotect_unselect_pages_but_edges(pgbmp,
+			(unsigned long)&memdump, sizeof(memdump));
+	for (i = 0; i < memdump.alloced; i++) {
+		clear_bit(__pa(pages[i]) >> PAGE_SHIFT, pgbmp);
+		cond_resched();
+	}
+
+	return 0;
+}
+
+/* livedump_memdump_handle_sensitive_pages
+ *
+ * Edge pages possibly contain both memdump's stuffs and something else.
+ * Such pages must not be unselected in advance.
+ * In fact, they should be handled during the stop-machine state.
+ *
+ * memdump_handle_sensitive_pages hook function is called to do this.
+ */
+void livedump_memdump_handle_sensitive_pages(unsigned long *pgbmp)
+{
+	wrprotect_handle_only_edges(pgbmp, livedump_memdump_handle_page,
+			(unsigned long)&memdump, sizeof(memdump));
+}
+
+void livedump_memdump_handle_page(unsigned long pfn)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&memdump.lock, flags);
+	if (WARN(memdump.used >= memdump.alloced,
+				"livedump: Out of memory of memdump.\n"))
+		goto out;
+	pagemap[pfn] = pages[memdump.used++];
+	memcpy(pagemap[pfn], pfn_to_kaddr(pfn), PAGE_SIZE);
+out:
+	spin_unlock_irqrestore(&memdump.lock, flags);
+}
+
+static void *memdump_page(unsigned long pfn)
+{
+	void *p = pagemap[pfn];
+	if (p)
+		return p;
+	return empty_zero_page;
+}
+
+loff_t livedump_memdump_sys_llseek(struct file *file, loff_t offset, int origin)
+{
+	loff_t retval;
+
+	switch (origin) {
+	case SEEK_SET:
+		break;
+	case SEEK_END:
+		offset += PFN_PHYS(num_physpages);
+		break;
+	case SEEK_CUR:
+		if (offset == 0) {
+			retval = file->f_pos;
+			goto out;
+		}
+		offset += file->f_pos;
+		break;
+	case SEEK_DATA:
+	case SEEK_HOLE:
+		retval = -ENOSYS;
+		goto out;
+	default:
+		retval = -EINVAL;
+		goto out;
+	}
+	retval = -EINVAL;
+	if (offset >= 0) {
+		if (offset != file->f_pos) {
+			file->f_pos = offset;
+			file->f_version = 0;
+		}
+		retval = offset;
+	}
+out:
+	return retval;
+}
+
+ssize_t livedump_memdump_sys_read(
+		struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+	loff_t pos = *ppos;
+
+	if (pos >= PFN_PHYS(num_physpages))
+		return 0;
+	if (count > PFN_PHYS(num_physpages) - pos)
+		count = PFN_PHYS(num_physpages) - pos;
+
+	while (count) {
+		void *p = memdump_page(pos >> PAGE_SHIFT);
+		unsigned long off = pos & ~PAGE_MASK;
+		unsigned long len = min(count, PAGE_SIZE - off);
+		if (copy_to_user(buf, p + off, len))
+			return -EFAULT;
+		buf += len;
+		pos += len;
+		count -= len;
+	}
+
+	pos -= *ppos;
+	*ppos += pos;
+	return pos;
+}
diff --git a/kernel/livedump-memdump.h b/kernel/livedump-memdump.h
new file mode 100644
index 0000000..e8a5bae
--- /dev/null
+++ b/kernel/livedump-memdump.h
@@ -0,0 +1,45 @@
+/* livedump-memdump.h - Live Dump's memory dumping management
+ * Copyright (C) 2011 Hitachi, Ltd.
+ * Author: YOSHIDA Masanori <masanori.yoshida.tv@hitachi.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
+ * MA  02110-1301, USA.
+ */
+
+#ifndef _LIVEDUMP_MEMDUMP_H
+#define _LIVEDUMP_MEMDUMP_H
+
+#include <linux/fs.h>
+
+extern int livedump_memdump_init(void);
+
+extern void livedump_memdump_uninit(void);
+
+extern int livedump_memdump_select_pages(unsigned long *pgbmp);
+
+extern void livedump_memdump_handle_sensitive_pages(unsigned long *pgbmp);
+
+extern void livedump_memdump_handle_page(unsigned long pfn);
+
+extern loff_t livedump_memdump_sys_llseek(
+		struct file *file, loff_t offset, int origin);
+
+extern ssize_t livedump_memdump_sys_read(
+		struct file *file,
+		char __user *buf,
+		size_t len,
+		loff_t *ppos);
+
+#endif /* _LIVEDUMP_MEMDUMP_H */
diff --git a/kernel/livedump.c b/kernel/livedump.c
index 99b49b9..7bef9c8 100644
--- a/kernel/livedump.c
+++ b/kernel/livedump.c
@@ -18,6 +18,7 @@
  * MA  02110-1301, USA.
  */
 
+#include "livedump-memdump.h"
 #include <asm/wrprotect.h>
 
 #include <linux/module.h>
@@ -36,13 +37,21 @@
 static void do_uninit(void)
 {
 	wrprotect_uninit();
+	livedump_memdump_uninit();
 }
 
 static int do_init(void)
 {
 	int ret;
 
-	ret = wrprotect_init(NULL, NULL, NULL);
+	ret = livedump_memdump_init();
+	if (WARN(ret, "livedump: Failed to initialize Dump manager.\n"))
+		goto err;
+
+	ret = wrprotect_init(
+			livedump_memdump_select_pages,
+			livedump_memdump_handle_sensitive_pages,
+			livedump_memdump_handle_page);
 	if (WARN(ret, "livedump: Failed to initialize Protection manager.\n"))
 		goto err;
 
@@ -89,6 +98,8 @@ static const struct file_operations livedump_fops = {
 	.unlocked_ioctl = livedump_ioctl,
 	.open = livedump_open,
 	.release = livedump_release,
+	.read = livedump_memdump_sys_read,
+	.llseek = livedump_memdump_sys_llseek,
 };
 static struct miscdevice livedump_misc = {
 	.minor = MISC_DYNAMIC_MINOR,



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC PATCH 3/5][RESEND] livedump: Add page splitting functionality
  2011-12-23 13:14 [RFC PATCH 0/5][RESEND] introduce: Live Dump YOSHIDA Masanori
  2011-12-23 13:14 ` [RFC PATCH 1/5][RESEND] livedump: Add notifier-call-chain into do_page_fault YOSHIDA Masanori
  2011-12-23 13:14 ` [RFC PATCH 2/5][RESEND] livedump: Add the new misc device "livedump" YOSHIDA Masanori
@ 2011-12-23 13:14 ` YOSHIDA Masanori
  2011-12-23 13:14 ` [RFC PATCH 4/5][RESEND] livedump: Add write protection management YOSHIDA Masanori
  2011-12-23 13:14 ` [RFC PATCH 5/5][RESEND] livedump: Add memory dumping functionality YOSHIDA Masanori
  4 siblings, 0 replies; 6+ messages in thread
From: YOSHIDA Masanori @ 2011-12-23 13:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, hpa, x86, linux-kernel
  Cc: hpa, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Ingo Molnar, KOSAKI Motohiro, Kevin Hilman, Marcelo Tosatti,
	Michal Marek, Rik van Riel, Tejun Heo, Thomas Gleixner,
	YOSHIDA Masanori, Yinghai Lu, linux-kernel, x86, frank.rowand,
	jan.kiszka, yrl.pp-manager.tt, YOSHIDA Masanori, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, Tejun Heo, Yinghai Lu,
	linux-kernel

This patch adds function "wrprotect_split" to split all large pages in
kernel space into 4K pages. This patch also adds tools/livedump/livedump
to kick wrprotect_split via ioctl operation.

***ATTENTION PLEASE***
Right now, the livedump module can handle only 4K pages. Before setting up
write protection, it has to split all large pages into 4K pages.
I think this job (page splitting) can be eliminated in the future.

Signed-off-by: YOSHIDA Masanori <masanori.yoshida.tv@hitachi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
---

 arch/x86/Kconfig                 |   16 +++++++++++++++-
 arch/x86/include/asm/wrprotect.h |   26 ++++++++++++++++++++++++++
 arch/x86/mm/Makefile             |    2 ++
 arch/x86/mm/wrprotect.c          |   33 +++++++++++++++++++++++++++++++++
 kernel/livedump.c                |    5 +++++
 tools/livedump/livedump          |   13 +++++++++++++
 6 files changed, 94 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/include/asm/wrprotect.h
 create mode 100644 arch/x86/mm/wrprotect.c
 create mode 100755 tools/livedump/livedump

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 32bc16d..9e6b53a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1702,9 +1702,23 @@ config CMDLINE_OVERRIDE
 	  This is used to work around broken boot loaders.  This should
 	  be set to 'N' under normal conditions.
 
+config WRPROTECT
+	bool "Write protection on kernel space"
+	depends on X86_64
+	---help---
+	  Set this option to 'Y' to allow the kernel to write protect
+	  its own memory space and to handle page fault caused by the
+	  write protection.
+
+	  This feature regularly causes small overhead on kernel.
+	  Once this feature is activated, it causes much more overhead
+	  on kernel.
+
+	  If in doubt, say N.
+
 config LIVEDUMP
 	bool "Live Dump support"
-	depends on X86_64
+	depends on WRPROTECT
 	---help---
 	  Set this option to 'Y' to allow the kernel support to acquire
 	  a consistent snapshot of kernel space without stopping system.
diff --git a/arch/x86/include/asm/wrprotect.h b/arch/x86/include/asm/wrprotect.h
new file mode 100644
index 0000000..40489e0
--- /dev/null
+++ b/arch/x86/include/asm/wrprotect.h
@@ -0,0 +1,26 @@
+/* wrprortect.h - Kernel space write protection support
+ * Copyright (C) 2011 Hitachi, Ltd.
+ * Author: YOSHIDA Masanori <masanori.yoshida.tv@hitachi.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
+ * MA  02110-1301, USA.
+ */
+
+#ifndef _WRPROTECT_H
+#define _WRPROTECT_H
+
+extern int wrprotect_split(void);
+
+#endif /* _WRPROTECT_H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 3d11327..781a368 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -30,3 +30,5 @@ obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
 obj-$(CONFIG_HAVE_MEMBLOCK)		+= memblock.o
 
 obj-$(CONFIG_MEMTEST)		+= memtest.o
+
+obj-$(CONFIG_WRPROTECT)		+= wrprotect.o
diff --git a/arch/x86/mm/wrprotect.c b/arch/x86/mm/wrprotect.c
new file mode 100644
index 0000000..2a69735
--- /dev/null
+++ b/arch/x86/mm/wrprotect.c
@@ -0,0 +1,33 @@
+/* wrprotect.c - Kernel space write protection support
+ * Copyright (C) 2011 Hitachi, Ltd.
+ * Author: YOSHIDA Masanori <masanori.yoshida.tv@hitachi.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
+ * MA  02110-1301, USA.
+ */
+
+#include <asm/wrprotect.h>
+#include <linux/highmem.h>
+
+int wrprotect_split(void)
+{
+	unsigned long pfn;
+	for (pfn = 0; pfn < num_physpages; pfn++) {
+		int ret = set_memory_4k((unsigned long)pfn_to_kaddr(pfn), 1);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
diff --git a/kernel/livedump.c b/kernel/livedump.c
index 198bda5..899d572 100644
--- a/kernel/livedump.c
+++ b/kernel/livedump.c
@@ -18,6 +18,8 @@
  * MA  02110-1301, USA.
  */
 
+#include <asm/wrprotect.h>
+
 #include <linux/module.h>
 #include <linux/fs.h>
 #include <linux/miscdevice.h>
@@ -25,11 +27,14 @@
 #define DEVICE_NAME	"livedump"
 
 #define LIVEDUMP_IOC(x)	_IO(0xff, x)
+#define LIVEDUMP_IOC_SPLIT LIVEDUMP_IOC(1)
 
 static long livedump_ioctl(
 		struct file *file, unsigned int cmd, unsigned long arg)
 {
 	switch (cmd) {
+	case LIVEDUMP_IOC_SPLIT:
+		return wrprotect_split();
 	default:
 		return -ENOIOCTLCMD;
 	}
diff --git a/tools/livedump/livedump b/tools/livedump/livedump
new file mode 100755
index 0000000..c5e34d6
--- /dev/null
+++ b/tools/livedump/livedump
@@ -0,0 +1,13 @@
+#!/usr/bin/python
+
+import sys
+import fcntl
+
+cmds = {
+	'split':0xff01,
+	}
+cmd = cmds[sys.argv[1]]
+
+f = open('/dev/livedump')
+fcntl.ioctl(f, cmd)
+f.close


^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-12-27  9:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-23 13:14 [RFC PATCH 0/5][RESEND] introduce: Live Dump YOSHIDA Masanori
2011-12-23 13:14 ` [RFC PATCH 1/5][RESEND] livedump: Add notifier-call-chain into do_page_fault YOSHIDA Masanori
2011-12-23 13:14 ` [RFC PATCH 2/5][RESEND] livedump: Add the new misc device "livedump" YOSHIDA Masanori
2011-12-23 13:14 ` [RFC PATCH 3/5][RESEND] livedump: Add page splitting functionality YOSHIDA Masanori
2011-12-23 13:14 ` [RFC PATCH 4/5][RESEND] livedump: Add write protection management YOSHIDA Masanori
2011-12-23 13:14 ` [RFC PATCH 5/5][RESEND] livedump: Add memory dumping functionality YOSHIDA Masanori

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).