[Qemu-devel] Release of COREMU, a scalable and portable full-system emulator

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] Release of COREMU, a scalable and portable full-system emulator
@ 2010-07-17 10:27 Chen Yufei
  2010-07-20 21:43 ` Blue Swirl
  0 siblings, 1 reply; 20+ messages in thread
From: Chen Yufei @ 2010-07-17 10:27 UTC (permalink / raw)
  To: qemu-devel

We are pleased to announce COREMU, which is a "multicore-on-multicore" full-system emulator built on Qemu. (Simply speaking, we made Qemu parallel.)

The project web page is located at:
http://ppi.fudan.edu.cn/coremu

You can also download the source code, images for playing on sourceforge
http://sf.net/p/coremu

COREMU is composed of
1. a parallel emulation library
2. a set of patches to qemu
(We worked on the master branch, commit 54d7cf136f040713095cbc064f62d753bff6f9d2)

It currently supports full-system emulation of x64 and ARM MPcore platforms.

By leveraging the underlying multicore resources, it can emulate up to 255 cores running commodity operating systems (even on a 4-core machine).

Enjoy,

The COREMU Team, Parallel Processing Institute, Fudan University

http://ppi.fudan.edu.cn/system_research_group

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Release of COREMU, a scalable and portable full-system emulator
  2010-07-17 10:27 [Qemu-devel] Release of COREMU, a scalable and portable full-system emulator Chen Yufei
@ 2010-07-20 21:43 ` Blue Swirl
  2010-07-21  7:03   ` Chen Yufei
  0 siblings, 1 reply; 20+ messages in thread
From: Blue Swirl @ 2010-07-20 21:43 UTC (permalink / raw)
  To: Chen Yufei; +Cc: qemu-devel

On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei <cyfdecyf@gmail.com> wrote:
> We are pleased to announce COREMU, which is a "multicore-on-multicore" full-system emulator built on Qemu. (Simply speaking, we made Qemu parallel.)
>
> The project web page is located at:
> http://ppi.fudan.edu.cn/coremu
>
> You can also download the source code, images for playing on sourceforge
> http://sf.net/p/coremu
>
> COREMU is composed of
> 1. a parallel emulation library
> 2. a set of patches to qemu
> (We worked on the master branch, commit 54d7cf136f040713095cbc064f62d753bff6f9d2)
>
> It currently supports full-system emulation of x64 and ARM MPcore platforms.
>
> By leveraging the underlying multicore resources, it can emulate up to 255 cores running commodity operating systems (even on a 4-core machine).
>
> Enjoy,

Nice work. Do you plan to submit the improvements back to upstream QEMU?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Release of COREMU, a scalable and portable full-system emulator
  2010-07-20 21:43 ` Blue Swirl
@ 2010-07-21  7:03   ` Chen Yufei
  2010-07-21 17:04     ` Stefan Weil
  0 siblings, 1 reply; 20+ messages in thread
From: Chen Yufei @ 2010-07-21  7:03 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel


On 2010-7-21, at 上午5:43, Blue Swirl wrote:

> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei <cyfdecyf@gmail.com> wrote:
>> We are pleased to announce COREMU, which is a "multicore-on-multicore" full-system emulator built on Qemu. (Simply speaking, we made Qemu parallel.)
>> 
>> The project web page is located at:
>> http://ppi.fudan.edu.cn/coremu
>> 
>> You can also download the source code, images for playing on sourceforge
>> http://sf.net/p/coremu
>> 
>> COREMU is composed of
>> 1. a parallel emulation library
>> 2. a set of patches to qemu
>> (We worked on the master branch, commit 54d7cf136f040713095cbc064f62d753bff6f9d2)
>> 
>> It currently supports full-system emulation of x64 and ARM MPcore platforms.
>> 
>> By leveraging the underlying multicore resources, it can emulate up to 255 cores running commodity operating systems (even on a 4-core machine).
>> 
>> Enjoy,
> 
> Nice work. Do you plan to submit the improvements back to upstream QEMU?

It would be great if we can submit our code to QEMU, but we do not know the process.
Would you please give us some instructions?

--
Best regards,
Chen Yufei

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Release of COREMU, a scalable and portable full-system emulator
  2010-07-21  7:03   ` Chen Yufei
@ 2010-07-21 17:04     ` Stefan Weil
  2010-07-22  8:48       ` Chen Yufei
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Weil @ 2010-07-21 17:04 UTC (permalink / raw)
  To: Chen Yufei; +Cc: qemu-devel

Am 21.07.2010 09:03, schrieb Chen Yufei:
> On 2010-7-21, at 上午5:43, Blue Swirl wrote:
>
>    
>> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei<cyfdecyf@gmail.com>  wrote:
>>      
>>> We are pleased to announce COREMU, which is a "multicore-on-multicore" full-system emulator built on Qemu. (Simply speaking, we made Qemu parallel.)
>>>
>>> The project web page is located at:
>>> http://ppi.fudan.edu.cn/coremu
>>>
>>> You can also download the source code, images for playing on sourceforge
>>> http://sf.net/p/coremu
>>>
>>> COREMU is composed of
>>> 1. a parallel emulation library
>>> 2. a set of patches to qemu
>>> (We worked on the master branch, commit 54d7cf136f040713095cbc064f62d753bff6f9d2)
>>>
>>> It currently supports full-system emulation of x64 and ARM MPcore platforms.
>>>
>>> By leveraging the underlying multicore resources, it can emulate up to 255 cores running commodity operating systems (even on a 4-core machine).
>>>
>>> Enjoy,
>>>        
>> Nice work. Do you plan to submit the improvements back to upstream QEMU?
>>      
> It would be great if we can submit our code to QEMU, but we do not know the process.
> Would you please give us some instructions?
>
> --
> Best regards,
> Chen Yufei
>    

Some hints can be found here:
http://wiki.qemu.org/Contribute/StartHere

Kind regards,
Stefan Weil

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Release of COREMU, a scalable and portable full-system emulator
  2010-07-21 17:04     ` Stefan Weil
@ 2010-07-22  8:48       ` Chen Yufei
  2010-07-22 11:05         ` [Qemu-devel] " Jan Kiszka
  2010-07-22 12:18         ` [Qemu-devel] " Stefan Hajnoczi
  0 siblings, 2 replies; 20+ messages in thread
From: Chen Yufei @ 2010-07-22  8:48 UTC (permalink / raw)
  To: Stefan Weil; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2704 bytes --]

On 2010-7-22, at 上午1:04, Stefan Weil wrote:

> Am 21.07.2010 09:03, schrieb Chen Yufei:
>> On 2010-7-21, at 上午5:43, Blue Swirl wrote:
>> 
>>   
>>> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei<cyfdecyf@gmail.com>  wrote:
>>>     
>>>> We are pleased to announce COREMU, which is a "multicore-on-multicore" full-system emulator built on Qemu. (Simply speaking, we made Qemu parallel.)
>>>> 
>>>> The project web page is located at:
>>>> http://ppi.fudan.edu.cn/coremu
>>>> 
>>>> You can also download the source code, images for playing on sourceforge
>>>> http://sf.net/p/coremu
>>>> 
>>>> COREMU is composed of
>>>> 1. a parallel emulation library
>>>> 2. a set of patches to qemu
>>>> (We worked on the master branch, commit 54d7cf136f040713095cbc064f62d753bff6f9d2)
>>>> 
>>>> It currently supports full-system emulation of x64 and ARM MPcore platforms.
>>>> 
>>>> By leveraging the underlying multicore resources, it can emulate up to 255 cores running commodity operating systems (even on a 4-core machine).
>>>> 
>>>> Enjoy,
>>>>       
>>> Nice work. Do you plan to submit the improvements back to upstream QEMU?
>>>     
>> It would be great if we can submit our code to QEMU, but we do not know the process.
>> Would you please give us some instructions?
>> 
>> --
>> Best regards,
>> Chen Yufei
>>   
> 
> Some hints can be found here:
> http://wiki.qemu.org/Contribute/StartHere
> 
> Kind regards,
> Stefan Weil

The patch is in the attachment, produced with command
git diff 54d7cf136f040713095cbc064f62d753bff6f9d2

In order to separate what need to be done to make QEMU parallel, we created a separate library, and the patched QEMU need to be compiled and linked with that library. To submit our enhancement to QEMU, maybe we need to incorporate this library into QEMU. I don't know what would be the best solution.

Our approach to make QEMU parallel can be found at http://ppi.fudan.edu.cn/coremu

I will give a short summary here:

1. Each emulated core thread runs a separate binary translator engine and has private code cache. We marked some variables in TCG as thread local. We also modified the TB invalidation mechanism.

2. Each core has a queue holding pending interrupts. The COREMU library provides this queue, and interrupt notification is done by sending realtime signals to the emulated core thread.

3. Atomic instruction emulation has to be modified for parallel emulation. We use lightweight memory transaction which requires only compare-and-swap instruction to emulate atomic instruction.

4. Some code in the original QEMU may cause data race bug after we make it parallel. We fixed these problems.


[-- Attachment #2: patch-to-54d7cf136f040713095cbc064f62d753bff6f9d2 --]
[-- Type: application/octet-stream, Size: 162739 bytes --]

diff --git a/Makefile b/Makefile
index eb9e02b..62419ec 100644
--- a/Makefile
+++ b/Makefile
@@ -135,11 +135,12 @@ iov.o: iov.c iov.h
 qemu-img.o: qemu-img-cmds.h
 qemu-img.o qemu-tool.o qemu-nbd.o qemu-io.o: $(GENERATED_HEADERS)
 
-qemu-img$(EXESUF): qemu-img.o qemu-tool.o qemu-error.o $(block-obj-y) $(qobject-obj-y)
+include $(SRC_PATH)/coremu.mk
+qemu-img$(EXESUF): qemu-img.o qemu-tool.o qemu-error.o $(block-obj-y) $(qobject-obj-y) $(COREMU_LIB)
 
-qemu-nbd$(EXESUF): qemu-nbd.o qemu-tool.o qemu-error.o $(block-obj-y) $(qobject-obj-y)
+qemu-nbd$(EXESUF): qemu-nbd.o qemu-tool.o qemu-error.o $(block-obj-y) $(qobject-obj-y) $(COREMU_LIB)
 
-qemu-io$(EXESUF): qemu-io.o cmd.o qemu-tool.o qemu-error.o $(block-obj-y) $(qobject-obj-y)
+qemu-io$(EXESUF): qemu-io.o cmd.o qemu-tool.o qemu-error.o $(block-obj-y) $(qobject-obj-y) $(COREMU_LIB)
 
 qemu-img-cmds.h: $(SRC_PATH)/qemu-img-cmds.hx
 	$(call quiet-command,sh $(SRC_PATH)/hxtool -h < $< > $@,"  GEN   $@")
diff --git a/Makefile.target b/Makefile.target
index c092900..aec7f12 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -58,6 +58,9 @@ libobj-$(TARGET_ARM) += neon_helper.o iwmmxt_helper.o
 
 libobj-y += disas.o
 
+# coremu related object, we may need to split this later.
+libobj-y += cm-loop.o cm-intr.o cm-target-intr.o
+
 $(libobj-y): $(GENERATED_HEADERS)
 
 # libqemu
@@ -300,8 +303,11 @@ endif # CONFIG_SOFTMMU
 
 obj-$(CONFIG_GDBSTUB_XML) += gdbstub-xml.o
 
-$(QEMU_PROG): $(obj-y) $(obj-$(TARGET_BASE_ARCH)-y)
-	$(call LINK,$(obj-y) $(obj-$(TARGET_BASE_ARCH)-y))
+# COREMU_LIB is defined in coremu.mk
+include $(SRC_PATH)/coremu.mk
+
+$(QEMU_PROG): $(obj-y) $(obj-$(TARGET_BASE_ARCH)-y) $(COREMU_LIB)
+	$(call LINK,$(obj-y) $(obj-$(TARGET_BASE_ARCH)-y) -ltopology $(COREMU_LIB))
 
 
 gdbstub-xml.c: $(TARGET_XML_FILES) $(SRC_PATH)/feature_to_c.sh
diff --git a/cm-init.c b/cm-init.c
new file mode 100644
index 0000000..4dd451a
--- /dev/null
+++ b/cm-init.c
@@ -0,0 +1,131 @@
+/*
+ * COREMU Parallel Emulator Framework
+ *
+ * Initialization stuff for qemu.
+ *
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* We include this file in exec.c */
+
+#include <sys/types.h>
+#include <sys/mman.h>
+
+#define VERBOSE_COREMU
+#include "sysemu.h"
+#include "coremu-sched.h"
+#include "coremu-debug.h"
+#include "coremu-init.h"
+#include "cm-timer.h"
+#include "cm-init.h"
+
+/* XXX How to clean up the following code? */
+
+/* Since each core uses it's own code buffer, we set a large value here. */
+#undef DEFAULT_CODE_GEN_BUFFER_SIZE
+#define DEFAULT_CODE_GEN_BUFFER_SIZE (800 * 1024 * 1024)
+
+static uint64_t cm_bufsize = 0;
+static void *cm_bufbase = NULL;
+#define min(a, b) ((a) < (b) ? (a) : (b))
+
+/* Prepare a large code cache for each CORE to allocate later */
+static void cm_code_gen_alloc_all(void)
+{
+    int flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_32BIT;
+
+    /*cm_bufsize = (min(DEFAULT_CODE_GEN_BUFFER_SIZE, phys_ram_size));*/
+    /* XXX what if this is larger than physical ram size? */
+    cm_bufsize = DEFAULT_CODE_GEN_BUFFER_SIZE;
+    cm_bufbase = mmap(NULL, cm_bufsize, PROT_WRITE | PROT_READ | PROT_EXEC,
+                      flags, -1, 0);
+
+    if (cm_bufbase == MAP_FAILED) {
+        cm_assert(0, "mmap failed\n");
+    }
+
+    code_gen_buffer_size = (unsigned long)(cm_bufsize / (smp_cpus));
+    cm_assert(code_gen_buffer_size >= MIN_CODE_GEN_BUFFER_SIZE,
+              "code buffer size too small");
+
+    code_gen_buffer_max_size = code_gen_buffer_size - code_gen_max_block_size();
+    code_gen_max_blocks = code_gen_buffer_size / CODE_GEN_AVG_BLOCK_SIZE;
+}
+
+/* From the allocated memory in code_gen_alloc_all, we allocate memory for each
+ * core. */
+static void cm_code_gen_alloc(void)
+{
+    /* We use cpu_index here, note that this maybe not the same as architecture
+     * dependent cpu id. eg. cpuid_apic_id. */
+    code_gen_buffer = cm_bufbase + (code_gen_buffer_size *
+                                    cpu_single_env->cpu_index);
+
+    /* Allocate space for TBs. */
+    tbs = qemu_malloc(code_gen_max_blocks * sizeof(TranslationBlock));
+
+   /* cm_print("CORE[%u] TC [%lu MB] at %p", cpu_single_env->cpu_index,
+             (code_gen_buffer_size) / (1024 * 1024), code_gen_buffer); */
+}
+
+/* For coremu, code generator related initialization should be called by all
+ * core thread. While other stuff only need to be done in the hardware
+ * thread. */
+void cm_cpu_exec_init(void)
+{
+    page_init();
+    io_mem_init();
+
+    /* Allocate code cache. */
+    cm_code_gen_alloc_all();
+
+    /* Code prologue initialization. */
+    cm_code_prologue_init();
+    map_exec(code_gen_prologue, sizeof(code_gen_prologue));
+}
+
+void cm_cpu_exec_init_core(void)
+{
+    cpu_gen_init();
+    /* Get code cache. */
+    cm_code_gen_alloc();
+    code_gen_ptr = code_gen_buffer;
+
+#if defined(TARGET_I386)
+    optimize_flags_init();
+#elif defined(TARGET_ARM)
+    arm_translate_init();
+#endif
+    /* Setup the scheduling for core thread */
+    coremu_init_sched_core();
+
+    /* Set up ticks mechanism for every core. */
+    cpu_enable_ticks();
+
+    /* Create per core timer. */
+    if (cm_init_local_timer_alarm() < 0) {
+        cm_assert(0, "local alarm initialize failed");
+    }
+
+    /* Wait other core to finish initialization. */
+    coremu_wait_init();
+}
diff --git a/cm-init.h b/cm-init.h
new file mode 100644
index 0000000..e0e1632
--- /dev/null
+++ b/cm-init.h
@@ -0,0 +1,37 @@
+/*
+ * COREMU Parallel Emulator Framework
+ *
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef _CM_INIT_H
+#define _CM_INIT_H
+
+/* page_init, io_mem_init, etc. Called by hardware thread. */
+void cm_cpu_exec_init(void);
+/* Allocate code buffer for each core. Called by each core. */
+void cm_cpu_exec_init_core(void);
+
+/* This function is defined in tcg/tcg.c */
+void cm_code_prologue_init(void);
+
+#endif /* _CM_INIT_H */
diff --git a/cm-intr.c b/cm-intr.c
new file mode 100644
index 0000000..bcefe36
--- /dev/null
+++ b/cm-intr.c
@@ -0,0 +1,53 @@
+/*
+ * COREMU Parallel Emulator Framework
+ *
+ * The common interface for hardware interrupt sending and handling.
+ *
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+#include <stdlib.h>
+#include <stdio.h>
+#include "cpu.h"
+
+#include "coremu-intr.h"
+#include "coremu-core.h"
+#include "coremu-malloc.h"
+#include "cm-intr.h"
+
+/* The common interface to handle the interrupt, this function should to
+   be registered to coremu */
+void cm_common_intr_handler(CMIntr *intr)
+{
+    coremu_assert_core_thr();
+    if (!intr)
+        return;
+    intr->handler(intr);
+    coremu_free(intr);
+}
+
+/* To notify there is an event coming, what qemu need to do is
+   just exit current cpu loop */
+void cm_notify_event(void)
+{
+    if (cpu_single_env)
+        cpu_exit(cpu_single_env);
+}
diff --git a/cm-intr.h b/cm-intr.h
new file mode 100644
index 0000000..697b18f
--- /dev/null
+++ b/cm-intr.h
@@ -0,0 +1,40 @@
+/*
+ * COREMU Parallel Emulator Framework
+ * Defines qemu related structure and interface for hardware interrupt
+ *
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+#ifndef CM_INTR_H
+#define CM_INTR_H
+
+/* This is the call back function used to handle different type interrupts */
+typedef void (*CMIntr_handler)(void *opaque);
+
+/* Base type for all types of interrupt. Subtype of CMIntr should have an
+ * object of this struct as its first member. */
+typedef struct CMIntr {
+    CMIntr_handler handler;
+} CMIntr;
+
+void cm_common_intr_handler(CMIntr *opaque);
+void cm_notify_event(void);
+#endif
diff --git a/cm-loop.c b/cm-loop.c
new file mode 100644
index 0000000..5d33743
--- /dev/null
+++ b/cm-loop.c
@@ -0,0 +1,95 @@
+/*
+ * COREMU Parallel Emulator Framework
+ * The definition of core thread function
+ *
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include "cpu.h"
+#include "cpus.h"
+
+#include "coremu-intr.h"
+#include "coremu-debug.h"
+#include "coremu-sched.h"
+#include "coremu-types.h"
+#include "cm-loop.h"
+#include "cm-timer.h"
+#include "cm-init.h"
+
+static bool cm_tcg_cpu_exec(void);
+static bool cm_tcg_cpu_exec(void)
+{
+    int ret = 0;
+    CPUState *env = cpu_single_env;
+    struct timespec halt_interval;
+    halt_interval.tv_sec = 0;
+    halt_interval.tv_nsec = 10000;
+
+    for (;;) {
+        if (cm_local_alarm_pending())
+            cm_run_all_local_timers();
+
+        coremu_receive_intr();
+        if (cm_cpu_can_run(env))
+            ret = cpu_exec(env);
+        else if (env->stop)
+            break;
+
+        if (!cm_vm_can_run())
+            break;
+
+        if (ret == EXCP_DEBUG) {
+            cm_assert(0, "debug support hasn't been finished\n");
+            break;
+        }
+        if (ret == EXCP_HALTED || ret == EXCP_HLT) {
+            coremu_cpu_sched(CM_EVENT_HALTED);
+        }
+    }
+    return ret;
+}
+
+void *cm_cpu_loop(void *args)
+{
+    int ret;
+
+    /* Must initialize cpu_single_env before initializing core thread. */
+    assert(args);
+    cpu_single_env = (CPUState *)args;
+
+    /* Setup dynamic translator */
+    cm_cpu_exec_init_core();
+
+    for (;;) {
+        ret = cm_tcg_cpu_exec();
+        if (test_reset_request()) {
+            coremu_pause_core();
+            continue;
+        }
+        break;
+    }
+    cm_stop_local_timer();
+    coremu_core_exit(NULL);
+    assert(0);
+}
diff --git a/cm-loop.h b/cm-loop.h
new file mode 100644
index 0000000..d03d13d
--- /dev/null
+++ b/cm-loop.h
@@ -0,0 +1,37 @@
+/*
+ * COREMU Parallel Emulator Framework
+ * The definition of core thread function
+ *
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+#ifndef CM_LOOP_H
+#define CM_LOOP_H
+
+/*#include "cpu.h"*/
+void *cm_cpu_loop(void *args);
+
+/* The wrapper for static function of qemu */
+/*int cm_cpu_can_run(struct CPUState * env);*/
+int cm_vm_can_run(void);
+
+#endif
+
diff --git a/cm-tbinval.c b/cm-tbinval.c
new file mode 100644
index 0000000..a410403
--- /dev/null
+++ b/cm-tbinval.c
@@ -0,0 +1,199 @@
+/*
+ * COREMU Parallel Emulator Framework
+ *
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+#include <assert.h>
+#include "coremu-malloc.h"
+#include "coremu-atomic.h"
+#include "coremu-hw.h"
+
+static uint16_t *cm_phys_tb_cnt;
+
+extern void cm_inject_invalidate_code(TranslationBlock *tb);
+static int cm_invalidate_other(int cpu_id, target_phys_addr_t start, int len);
+
+void cm_init_tb_cnt(ram_addr_t ram_offset, ram_addr_t size)
+{
+    coremu_assert_hw_thr("cm_init_bt_cnt should only called by hw thr");
+
+    cm_phys_tb_cnt = coremu_realloc(cm_phys_tb_cnt,
+                                    ((ram_offset +
+                                      size) >> TARGET_PAGE_BITS) *
+                                    sizeof(uint16_t));
+    memset(cm_phys_tb_cnt + (ram_offset >> TARGET_PAGE_BITS), 0x0,
+           (size >> TARGET_PAGE_BITS) * sizeof(uint16_t));
+}
+
+void cm_phys_add_tb(ram_addr_t addr)
+{
+    atomic_incw(&cm_phys_tb_cnt[addr >> TARGET_PAGE_BITS]);
+}
+
+void cm_phys_del_tb(ram_addr_t addr)
+{
+    assert(cm_phys_tb_cnt[addr >> TARGET_PAGE_BITS]);
+    atomic_decw(&cm_phys_tb_cnt[addr >> TARGET_PAGE_BITS]);
+}
+
+uint16_t cm_phys_page_tb_p(ram_addr_t addr)
+{
+    return cm_phys_tb_cnt[addr >> TARGET_PAGE_BITS];
+}
+
+void cm_invalidate_bitmap(CMPageDesc *p)
+{
+    /* Get the bitmap lock */
+    coremu_spin_lock(&p->bitmap_lock);
+
+    if (p->code_bitmap) {
+        coremu_free(p->code_bitmap);
+        p->code_bitmap = NULL;
+    }
+
+    /* Unlock the bitmap lock */
+    coremu_spin_unlock(&p->bitmap_lock);
+
+}
+
+void cm_invalidate_tb(target_phys_addr_t start, int len)
+{
+    int count = tb_phys_invalidate_count;
+    if (!coremu_hw_thr_p()) {
+        tb_invalidate_phys_page_fast(start, len);
+        count = tb_phys_invalidate_count - count;
+    }
+
+    if ((!cm_phys_page_tb_p(start)) || (cm_phys_page_tb_p(start) == count))
+        goto done;
+
+#ifdef COREMU_CMC_SUPPORT
+    /* XXX: not finish need Lazy invalidate here! */
+    int have_done = count;
+    int cpu_idx = 0;
+    for (cpu_idx = 0; cpu_idx < coremu_get_targetcpu(); cpu_idx++) {
+        if ((!coremu_hw_thr_p()) && cpu_idx == cpu_single_env->cpuid_apic_id)
+            continue;
+        have_done += cm_invalidate_other(cpu_idx, start, len);
+        if (have_done > cm_phys_page_tb_p(start))
+            break;
+    }
+#endif
+
+done:
+    return;
+}
+
+void cm_tlb_reset_dirty_range(CPUTLBEntry *tlb_entry,
+                              unsigned long start, unsigned long length)
+{
+    unsigned long addr, old, addend;
+    old = tlb_entry->addr_write;
+    addend = tlb_entry->addend;
+
+    if ((old & ~TARGET_PAGE_MASK) == IO_MEM_RAM) {
+        addr = (old & TARGET_PAGE_MASK) + addend;
+        if ((addr - start) < length) {
+            uint64_t newv = (tlb_entry->addr_write & TARGET_PAGE_MASK) |
+                TLB_NOTDIRTY;
+            atomic_compare_exchangeq(&tlb_entry->addr_write, old, newv);
+        }
+    }
+}
+
+/* Try to Lazy invalidate the TB of CPU[cpu_id]
+ * return 1: successful find and invalidate TB of CPU[cpu_id]
+ *        0: dosn't exist
+ */
+static int cm_lazy_invalidate_tb(TranslationBlock *tbs,
+                                 target_phys_addr_t start, int len)
+{
+    int n, ret = 0;
+    TranslationBlock *tb_next;
+    TranslationBlock *tb = tbs;
+
+    target_phys_addr_t end = start + len;
+    target_ulong tb_start, tb_end;
+
+    while (tb != NULL) {
+        n = (long)tb & 3;
+        tb = (TranslationBlock *)((long)tb & ~3);
+        tb_next = tb->page_next[n];
+        /* NOTE: this is subtle as a TB may span two physical pages */
+        if (n == 0) {
+            /* NOTE: tb_end may be after the end of the page, but
+               it is not a problem */
+            tb_start = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
+            tb_end = tb_start + tb->size;
+        } else {
+            tb_start = tb->page_addr[1];
+            tb_end = tb_start + ((tb->pc + tb->size) & ~TARGET_PAGE_MASK);
+        }
+        if (!(tb_end <= start || tb_start >= end)) {
+
+            /* change the code cache of the tb */
+            cm_inject_invalidate_code(tb);
+            ret = 1;
+        }
+        tb = tb_next;
+    }
+
+    return ret;
+}
+
+
+/* Try to invalidate the TB of CPU[cpu_id]
+ * return 1: successful find and invalidate TB of CPU[cpu_id]
+ *        0: dosn't exist
+ */
+static int cm_invalidate_other(int cpu_id, target_phys_addr_t start, int len)
+{
+    /* Find if exit any TB intersect with start -- start+len */
+    PageDesc *p = page_find(start >> TARGET_PAGE_BITS);
+    if (!p)
+        return 0;
+
+    int offset, b;
+    uint8_t *bit_map;
+    int need_invalidate = 1;
+    int ret = 0;
+
+    coremu_spin_lock(&p->cpu_tbs[cpu_id].bitmap_lock);
+    bit_map = p->cpu_tbs[cpu_id].code_bitmap;
+    if (bit_map) {
+        offset = start & ~TARGET_PAGE_MASK;
+        b = bit_map[offset >> 3] >> (offset & 7);
+        if (!(b & ((1 << len) - 1)))
+            need_invalidate = 0;
+    }
+    coremu_spin_unlock(&p->cpu_tbs[cpu_id].bitmap_lock);
+
+    if (need_invalidate) {
+        coremu_spin_lock(&p->cpu_tbs[cpu_id].tb_list_lock);
+        //find the code ptr
+        ret = cm_lazy_invalidate_tb(p->cpu_tbs[cpu_id].first_tb, start, len);
+        coremu_spin_unlock(&p->cpu_tbs[cpu_id].tb_list_lock);
+        //change the code here
+    }
+
+    return ret;
+}
diff --git a/cm-tbinval.h b/cm-tbinval.h
new file mode 100644
index 0000000..9a1879a
--- /dev/null
+++ b/cm-tbinval.h
@@ -0,0 +1,51 @@
+/*
+ * COREMU Parallel Emulator Framework
+ *
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef _CM_TBINVAL_H
+#define _CM_TBINVAL_H
+
+typedef struct {
+    /* List of TBs of this cpu intersecting this ram page */
+    TranslationBlock *first_tb;
+    /* This lock is used to guarantee the other read and self modify conflict */
+    CMSpinLock tb_list_lock;
+
+    /* Use a bitmap to optimize the self modify code */
+    uint8_t *code_bitmap;
+    CMSpinLock bitmap_lock;
+} CMPageDesc;
+
+void cm_init_tb_cnt(ram_addr_t ram_offset, ram_addr_t size);
+void cm_phys_add_tb(ram_addr_t addr);
+void cm_phys_del_tb(ram_addr_t addr);
+uint16_t cm_phys_page_tb_p(ram_addr_t addr);
+
+void cm_invalidate_bitmap(CMPageDesc *p);
+void cm_invalidate_tb(target_phys_addr_t start, int len);
+
+void cm_tlb_reset_dirty_range(CPUTLBEntry *tlb_entry, unsigned long start,
+                              unsigned long length);
+
+#endif
diff --git a/cm-timer.c b/cm-timer.c
new file mode 100644
index 0000000..3c6a0a1
--- /dev/null
+++ b/cm-timer.c
@@ -0,0 +1,265 @@
+/*
+ * COREMU Parallel Emulator Framework
+ *
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *  Xi Wu           <wuxi@fudan.edu.cn>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* We include this file in qemu-timer.c qemu_alarm_timer is defined in it, and
+ * there's lots of static function there. */
+#include "coremu-sched.h"
+#include <math.h>
+int cm_pit_freq;
+
+static int64_t cm_local_next_deadline(void);
+static uint64_t cm_local_next_deadline_dyntick(void);
+static void cm_local_dynticks_rearm_timer(struct qemu_alarm_timer *t);
+static void cm_qemu_run_local_timers(QEMUClock *clock);
+
+COREMU_THREAD QEMUTimer *cm_local_active_timers;
+COREMU_THREAD struct qemu_alarm_timer *cm_local_alarm_timer;
+static COREMU_THREAD struct qemu_alarm_timer cm_local_alarm_timers[] = {
+    {"dynticks", dynticks_start_timer,
+     dynticks_stop_timer, cm_local_dynticks_rearm_timer, NULL},
+    {NULL,}
+};
+
+void cm_init_pit_freq(void)
+{
+    double v_num = coremu_get_targetcpu();
+    double p_num = coremu_get_hostcpu();
+    double p_root = sqrt(p_num) / 4;
+    double suggest = p_root * pow(v_num / p_num, p_root);
+    int pit_freq_suggest = ceil(suggest);
+    cm_pit_freq = 1193182 / pit_freq_suggest;
+}
+
+/* Called by each core thread to create a local timer. */
+int cm_init_local_timer_alarm(void)
+{
+    coremu_assert_core_thr();
+    /* core thr block the Timer Alarm signal */
+    struct qemu_alarm_timer *t = NULL;
+    int i, err = -1;
+
+    for (i = 0; cm_local_alarm_timers[i].name; i++) {
+        t = &cm_local_alarm_timers[i];
+        if (!t)
+            return 0;
+        err = t->start(t);
+        if (!err)
+            break;
+    }
+
+    if (err) {
+        err = -ENOENT;
+        goto fail;
+    }
+
+    /* first event is at time 0 */
+    t->pending = 1;
+    cm_local_alarm_timer = t;
+
+    return 0;
+
+fail:
+    return err;
+}
+
+/* Mode the local virtual timer for core.
+   Because for x86_64, there is only one timer for every core
+   so there is no need to do the link list.
+*/
+void cm_mod_local_timer(QEMUTimer *ts, int64_t expire_time)
+{
+    QEMUTimer **pt, *t;
+
+    cm_del_local_timer(ts);
+
+    /* add the timer in the sorted list */
+    /* NOTE: this code must be signal safe because
+       qemu_timer_expired() can be called from a signal. */
+    pt = &cm_local_active_timers;
+    for (;;) {
+        t = *pt;
+        if (!t)
+            break;
+        if (t->expire_time > expire_time)
+            break;
+        pt = &t->next;
+    }
+    ts->expire_time = expire_time;
+    ts->next = *pt;
+    *pt = ts;
+
+    /* Rearm if necessary  */
+    if (pt == &cm_local_active_timers) {
+        if (!cm_local_alarm_timer->pending) {
+            qemu_rearm_alarm_timer(cm_local_alarm_timer);
+        }
+    }
+}
+
+void cm_del_local_timer(QEMUTimer *ts)
+{
+    QEMUTimer **pt, *t;
+
+    /* NOTE: this code must be signal safe because
+       qemu_timer_expired() can be called from a signal. */
+    pt = &cm_local_active_timers;
+    for (;;) {
+        t = *pt;
+        if (!t)
+            break;
+        if (t == ts) {
+            *pt = t->next;
+            break;
+        }
+        pt = &t->next;
+    }
+}
+
+int cm_local_alarm_pending(void)
+{
+    return cm_local_alarm_timer->pending;
+}
+
+void cm_run_all_local_timers(void)
+{
+    cm_local_alarm_timer->pending = 0;
+
+    /* rearm timer, if not periodic */
+    if (cm_local_alarm_timer->expired) {
+        cm_local_alarm_timer->expired = 0;
+        qemu_rearm_alarm_timer(cm_local_alarm_timer);
+    }
+
+    if (vm_running) {
+        cm_qemu_run_local_timers(vm_clock);
+    }
+}
+
+void cm_local_host_alarm_handler(int host_signum)
+{
+    coremu_assert_core_thr();
+
+    struct qemu_alarm_timer *t = cm_local_alarm_timer;
+    if (!t)
+        return;
+
+    if (alarm_has_dynticks(t) ||
+        qemu_timer_expired(cm_local_active_timers, qemu_get_clock(vm_clock))) {
+        t->expired = alarm_has_dynticks(t);
+        t->pending = 1;
+        cm_notify_event();
+    }
+}
+
+static void cm_qemu_run_local_timers(QEMUClock *clock)
+{
+    QEMUTimer **ptimer_head, *ts;
+    int64_t current_time;
+
+    if (!clock->enabled)
+        return;
+
+    current_time = qemu_get_clock(clock);
+    ptimer_head = &cm_local_active_timers;
+    for (;;) {
+        ts = *ptimer_head;
+        if (!ts || ts->expire_time > current_time)
+            break;
+        /* remove timer from the list before calling the callback */
+        *ptimer_head = ts->next;
+        ts->next = NULL;
+
+        /* run the callback (the timer list can be modified) */
+        ts->cb(ts->opaque);
+    }
+}
+
+static void cm_local_dynticks_rearm_timer(struct qemu_alarm_timer *t)
+{
+    timer_t host_timer = (timer_t)(long)t->priv;
+    struct itimerspec timeout;
+    int64_t nearest_delta_us = INT64_MAX;
+    int64_t current_us;
+
+    assert(alarm_has_dynticks(t));
+    if (!cm_local_active_timers)
+        return;
+
+    nearest_delta_us = cm_local_next_deadline_dyntick();
+
+    /* check whether a timer is already running */
+    if (timer_gettime(host_timer, &timeout)) {
+        perror("gettime");
+        fprintf(stderr, "Internal timer error: aborting\n");
+        exit(1);
+    }
+    current_us =
+        timeout.it_value.tv_sec * 1000000 + timeout.it_value.tv_nsec / 1000;
+    if (current_us && current_us <= nearest_delta_us)
+        return;
+
+    timeout.it_interval.tv_sec = 0;
+    timeout.it_interval.tv_nsec = 0;    /* 0 for one-shot timer */
+    timeout.it_value.tv_sec = nearest_delta_us / 1000000;
+    timeout.it_value.tv_nsec = (nearest_delta_us % 1000000) * 1000;
+    if (timer_settime(host_timer, 0 /* RELATIVE */, &timeout, NULL)) {
+        perror("settime");
+        fprintf(stderr, "Internal timer error: aborting\n");
+        exit(1);
+    }
+}
+
+static uint64_t cm_local_next_deadline_dyntick(void)
+{
+    int64_t delta;
+
+    delta = (cm_local_next_deadline() + 999) / 1000;
+
+    if (delta < MIN_TIMER_REARM_US)
+        delta = MIN_TIMER_REARM_US;
+
+    return delta;
+}
+
+static int64_t cm_local_next_deadline(void)
+{
+    /* To avoid problems with overflow limit this to 2^32.  */
+    int64_t delta = INT32_MAX;
+
+    if (cm_local_active_timers) {
+        delta = cm_local_active_timers->expire_time - qemu_get_clock(vm_clock);
+    }
+
+    if (delta < 0)
+        delta = 0;
+
+    return delta;
+}
+
+void cm_stop_local_timer(void)
+{
+    cm_local_alarm_timer->stop(cm_local_alarm_timer);
+}
diff --git a/cm-timer.h b/cm-timer.h
new file mode 100644
index 0000000..627819e
--- /dev/null
+++ b/cm-timer.h
@@ -0,0 +1,39 @@
+/*
+ * COREMU Parallel Emulator Framework
+ * The definition of core thread function
+ *
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *  Xi Wu           <wuxi@fudan.edu.cn>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+#ifndef CM_TIMER_H
+#define CM_TIMER_H
+
+#include "qemu-common.h"
+int cm_init_local_timer_alarm(void);
+void cm_mod_local_timer(QEMUTimer * ts, int64_t expire_time);
+void cm_del_local_timer(QEMUTimer * ts);
+void cm_run_all_local_timers(void);
+void cm_local_host_alarm_handler(int host_signum);
+int cm_local_alarm_pending(void);
+void cm_init_pit_freq(void);
+void cm_stop_local_timer(void);
+#endif
diff --git a/cpu-all.h b/cpu-all.h
index 52a1817..0110aac 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -22,6 +22,8 @@
 #include "qemu-common.h"
 #include "cpu-common.h"
 
+#include "coremu-config.h"
+
 /* some important defines:
  *
  * WORDS_ALIGNED : if defined, the host cpu can only make word aligned
@@ -772,7 +774,7 @@ void cpu_dump_statistics (CPUState *env, FILE *f,
 void QEMU_NORETURN cpu_abort(CPUState *env, const char *fmt, ...)
     __attribute__ ((__format__ (__printf__, 2, 3)));
 extern CPUState *first_cpu;
-extern CPUState *cpu_single_env;
+extern COREMU_THREAD CPUState *cpu_single_env;
 
 #define CPU_INTERRUPT_HARD   0x02 /* hardware interrupt pending */
 #define CPU_INTERRUPT_EXITTB 0x04 /* exit the current TB (use for x86 a20 case) */
diff --git a/cpu-exec.c b/cpu-exec.c
index dc81e79..086d330 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -22,6 +22,9 @@
 #include "tcg.h"
 #include "kvm.h"
 
+#include "coremu-config.h"
+#include "coremu-intr.h"
+
 #if !defined(CONFIG_SOFTMMU)
 #undef EAX
 #undef ECX
@@ -44,7 +47,7 @@
 #define env cpu_single_env
 #endif
 
-int tb_invalidated_flag;
+COREMU_THREAD int tb_invalidated_flag;
 
 //#define CONFIG_DEBUG_EXEC
 //#define DEBUG_SIGNAL
@@ -224,8 +227,9 @@ int cpu_exec(CPUState *env1)
     if (cpu_halted(env1) == EXCP_HALTED)
         return EXCP_HALTED;
 
+#ifndef CONFIG_COREMU
     cpu_single_env = env1;
-
+#endif
     /* the access to env below is actually saving the global register's
        value, so that files not including target-xyz/exec.h are free to
        use it.  */
@@ -264,6 +268,9 @@ int cpu_exec(CPUState *env1)
     /* prepare setjmp context for exception handling */
     for(;;) {
         if (setjmp(env->jmp_env) == 0) {
+#ifdef CONFIG_COREMU
+                coremu_receive_intr();
+#endif
 #if defined(__sparc__) && !defined(CONFIG_SOLARIS)
 #undef env
                     env = cpu_single_env;
@@ -601,7 +608,27 @@ int cpu_exec(CPUState *env1)
                     env = cpu_single_env;
 #define env cpu_single_env
 #endif
+
+#ifdef CONFIG_COREMU
+                    coremu_receive_intr();
+#endif
                     next_tb = tcg_qemu_tb_exec(tc_ptr);
+
+#ifdef CONFIG_COREMU
+                    coremu_receive_intr();
+
+#ifdef COREMU_CMC_SUPPORT
+                    if((next_tb & 3) == 3) {
+                        //assert(0);
+                        /* this tb has been invalidate */
+                        TranslationBlock *tmp_tb = (TranslationBlock *)(next_tb & ~3);
+                        next_tb = 0;
+                        cpu_pc_from_tb(env, tmp_tb);
+                        tb_phys_invalidate(tmp_tb, -1);
+                    }
+#endif
+#endif
+
                     env->current_tb = NULL;
                     if ((next_tb & 3) == 2) {
                         /* Instruction counter expired.  */
@@ -665,8 +692,10 @@ int cpu_exec(CPUState *env1)
     asm("");
     env = (void *) saved_env_reg;
 
+#ifndef CONFIG_COREMU
     /* fail safe : never use cpu_single_env outside cpu_exec() */
     cpu_single_env = NULL;
+#endif
     return ret;
 }
 
diff --git a/cpus.c b/cpus.c
index 29462e5..5bdcd65 100644
--- a/cpus.c
+++ b/cpus.c
@@ -33,6 +33,8 @@
 
 #include "cpus.h"
 
+#include "coremu-config.h"
+
 #ifdef SIGRTMIN
 #define SIG_IPI (SIGRTMIN+4)
 #else
@@ -269,7 +271,10 @@ void qemu_notify_event(void)
 {
     CPUState *env = cpu_single_env;
 
-    qemu_event_increment ();
+#ifndef CONFIG_COREMU
+    qemu_event_increment();
+#endif
+
     if (env) {
         cpu_exit(env);
     }
@@ -812,3 +817,11 @@ void list_cpus(FILE *f, int (*cpu_fprintf)(FILE *f, const char *fmt, ...),
     cpu_list(f, cpu_fprintf); /* deprecated */
 #endif
 }
+
+#ifdef CONFIG_COREMU
+int cm_cpu_can_run(CPUState * env);
+int cm_cpu_can_run(CPUState * env)
+{
+    return cpu_can_run(env);
+}
+#endif
diff --git a/exec-all.h b/exec-all.h
index 1016de2..50bc79a 100644
--- a/exec-all.h
+++ b/exec-all.h
@@ -22,6 +22,8 @@
 
 #include "qemu-common.h"
 
+#include "coremu-config.h"
+
 /* allow to see translation results - the slowdown should be negligible, so we leave it */
 #define DEBUG_DISAS
 
@@ -69,9 +71,9 @@ typedef struct TranslationBlock TranslationBlock;
 
 #define OPPARAM_BUF_SIZE (OPC_BUF_SIZE * MAX_OPC_PARAM)
 
-extern target_ulong gen_opc_pc[OPC_BUF_SIZE];
-extern uint8_t gen_opc_instr_start[OPC_BUF_SIZE];
-extern uint16_t gen_opc_icount[OPC_BUF_SIZE];
+extern COREMU_THREAD target_ulong gen_opc_pc[OPC_BUF_SIZE];
+extern COREMU_THREAD uint8_t gen_opc_instr_start[OPC_BUF_SIZE];
+extern COREMU_THREAD uint16_t gen_opc_icount[OPC_BUF_SIZE];
 
 #include "qemu-log.h"
 
@@ -162,6 +164,9 @@ struct TranslationBlock {
     struct TranslationBlock *jmp_next[2];
     struct TranslationBlock *jmp_first;
     uint32_t icount;
+#ifdef CONFIG_COREMU
+    uint16_t has_invalidate; /* if this TB has been invalidated */
+#endif
 };
 
 static inline unsigned int tb_jmp_cache_hash_page(target_ulong pc)
@@ -191,8 +196,8 @@ void tb_link_page(TranslationBlock *tb,
                   tb_page_addr_t phys_pc, tb_page_addr_t phys_page2);
 void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
 
-extern TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
-extern uint8_t *code_gen_ptr;
+extern COREMU_THREAD TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
+extern COREMU_THREAD uint8_t *code_gen_ptr;
 extern int code_gen_max_blocks;
 
 #if defined(USE_DIRECT_JUMP)
@@ -273,9 +278,9 @@ TranslationBlock *tb_find_pc(unsigned long pc_ptr);
 
 #include "qemu-lock.h"
 
-extern spinlock_t tb_lock;
+extern COREMU_THREAD spinlock_t tb_lock;
 
-extern int tb_invalidated_flag;
+extern COREMU_THREAD int tb_invalidated_flag;
 
 #if !defined(CONFIG_USER_ONLY)
 
diff --git a/exec.c b/exec.c
index 3416aed..4d4064f 100644
--- a/exec.c
+++ b/exec.c
@@ -71,6 +71,13 @@
 //#define DEBUG_IOPORT
 //#define DEBUG_SUBPAGE
 
+#include "coremu-config.h"
+#include "coremu-spinlock.h"
+#include "coremu-malloc.h"
+#include "coremu-atomic.h"
+#include "coremu-hw.h"
+#include "cm-tbinval.h"
+
 #if !defined(CONFIG_USER_ONLY)
 /* TB consistency checks only implemented for usermode emulation.  */
 #undef DEBUG_TB_CHECK
@@ -78,12 +85,12 @@
 
 #define SMC_BITMAP_USE_THRESHOLD 10
 
-static TranslationBlock *tbs;
+static COREMU_THREAD TranslationBlock *tbs;
 int code_gen_max_blocks;
-TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
-static int nb_tbs;
+COREMU_THREAD TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
+static COREMU_THREAD int nb_tbs;
 /* any access to the tbs or the page table must use this lock */
-spinlock_t tb_lock = SPIN_LOCK_UNLOCKED;
+COREMU_THREAD spinlock_t tb_lock = SPIN_LOCK_UNLOCKED;
 
 #if defined(__arm__) || defined(__sparc_v9__)
 /* The prologue must be reachable with a direct jump. ARM and Sparc64
@@ -102,11 +109,11 @@ spinlock_t tb_lock = SPIN_LOCK_UNLOCKED;
 #endif
 
 uint8_t code_gen_prologue[1024] code_gen_section;
-static uint8_t *code_gen_buffer;
+static COREMU_THREAD uint8_t *code_gen_buffer;
 static unsigned long code_gen_buffer_size;
 /* threshold to flush the translated code buffer */
 static unsigned long code_gen_buffer_max_size;
-uint8_t *code_gen_ptr;
+COREMU_THREAD uint8_t *code_gen_ptr;
 
 #if !defined(CONFIG_USER_ONLY)
 int phys_ram_fd;
@@ -130,7 +137,7 @@ ram_addr_t last_ram_offset;
 CPUState *first_cpu;
 /* current CPU in the current thread. It is only valid inside
    cpu_exec() */
-CPUState *cpu_single_env;
+COREMU_THREAD CPUState *cpu_single_env;
 /* 0 = Do not count executed instructions.
    1 = Precise instruction counting.
    2 = Adaptive rate instruction counting.  */
@@ -139,6 +146,16 @@ int use_icount = 0;
    include some instructions that have not yet been executed.  */
 int64_t qemu_icount;
 
+#ifdef CONFIG_COREMU
+typedef struct PageDesc {
+    /* in order to optimize self modifying code, we count the number
+       of lookups we do to a given page to use a bitmap */
+    unsigned int code_write_count;
+
+   /* Given the different page and tb information for different cpu */
+    CMPageDesc cpu_tbs[COREMU_MAX_CPU];
+} PageDesc;
+#else
 typedef struct PageDesc {
     /* list of TBs intersecting this ram page */
     TranslationBlock *first_tb;
@@ -150,7 +167,7 @@ typedef struct PageDesc {
     unsigned long flags;
 #endif
 } PageDesc;
-
+#endif
 /* In system mode we want L1_MAP to be based on ram offsets,
    while in user mode we want it to be based on virtual addresses.  */
 #if !defined(CONFIG_USER_ONLY)
@@ -237,7 +254,7 @@ static int log_append = 0;
 static int tlb_flush_count;
 #endif
 static int tb_flush_count;
-static int tb_phys_invalidate_count;
+static COREMU_THREAD int tb_phys_invalidate_count;
 
 #ifdef _WIN32
 static void map_exec(void *addr, long size)
@@ -383,8 +400,13 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
             if (!alloc) {
                 return NULL;
             }
+#ifdef CONFIG_COREMU
+            coremu_atomic_mallocz(lp, sizeof(void *) * L2_SIZE);
+            p = *lp;
+#else
             ALLOC(p, sizeof(void *) * L2_SIZE);
             *lp = p;
+#endif
         }
 
         lp = p + ((index >> (i * L2_BITS)) & (L2_SIZE - 1));
@@ -395,8 +417,13 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
         if (!alloc) {
             return NULL;
         }
+#ifdef CONFIG_COREMU
+        coremu_atomic_mallocz(lp, sizeof(PageDesc) * L2_SIZE);
+        pd = *lp;
+#else
         ALLOC(pd, sizeof(PageDesc) * L2_SIZE);
         *lp = pd;
+#endif
     }
 
 #undef ALLOC
@@ -426,7 +453,12 @@ static PhysPageDesc *phys_page_find_alloc(target_phys_addr_t index, int alloc)
             if (!alloc) {
                 return NULL;
             }
+#ifdef CONFIG_COREMU
+            coremu_atomic_mallocz(lp, sizeof(void *) * L2_SIZE);
+            p = *lp;
+#else
             *lp = p = qemu_mallocz(sizeof(void *) * L2_SIZE);
+#endif
         }
         lp = p + ((index >> (i * L2_BITS)) & (L2_SIZE - 1));
     }
@@ -438,9 +470,12 @@ static PhysPageDesc *phys_page_find_alloc(target_phys_addr_t index, int alloc)
         if (!alloc) {
             return NULL;
         }
-
+#ifdef CONFIG_COREMU
+        coremu_atomic_mallocz(lp, sizeof(PhysPageDesc) * L2_SIZE);
+        pd = *lp;
+#else
         *lp = pd = qemu_malloc(sizeof(PhysPageDesc) * L2_SIZE);
-
+#endif
         for (i = 0; i < L2_SIZE; i++) {
             pd[i].phys_offset = IO_MEM_UNASSIGNED;
             pd[i].region_offset = (index + i) << TARGET_PAGE_BITS;
@@ -649,10 +684,19 @@ void cpu_exec_init(CPUState *env)
 
 static inline void invalidate_page_bitmap(PageDesc *p)
 {
+#ifdef CONFIG_COREMU
+#if defined(TARGET_I386)
+    int cpuid = cpu_single_env->cpuid_apic_id;
+#elif defined(TARGET_ARM)
+    int cpuid = cpu_single_env->cpu_index;
+#endif
+    cm_invalidate_bitmap(&p->cpu_tbs[cpuid]);
+#else
     if (p->code_bitmap) {
         qemu_free(p->code_bitmap);
         p->code_bitmap = NULL;
     }
+#endif
     p->code_write_count = 0;
 }
 
@@ -665,10 +709,20 @@ static void page_flush_tb_1 (int level, void **lp)
     if (*lp == NULL) {
         return;
     }
+#if defined(TARGET_I386)
+    int cpuid = cpu_single_env->cpuid_apic_id;
+#elif defined(TARGET_ARM)
+    int cpuid = cpu_single_env->cpu_index;
+#endif
     if (level == 0) {
         PageDesc *pd = *lp;
         for (i = 0; i < L2_SIZE; ++i) {
+#ifdef CONFIG_COREMU
+            /* XXX only flush tb for the corresponding cpu. */
+            pd[i].cpu_tbs[cpuid].first_tb = NULL;
+#else
             pd[i].first_tb = NULL;
+#endif
             invalidate_page_bitmap(pd + i);
         }
     } else {
@@ -691,7 +745,9 @@ static void page_flush_tb(void)
 /* XXX: tb_flush is currently not thread safe */
 void tb_flush(CPUState *env1)
 {
+#ifndef CONFIG_COREMU
     CPUState *env;
+#endif
 #if defined(DEBUG_FLUSH)
     printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%ld\n",
            (unsigned long)(code_gen_ptr - code_gen_buffer),
@@ -702,10 +758,13 @@ void tb_flush(CPUState *env1)
         cpu_abort(env1, "Internal error: code buffer overflow\n");
 
     nb_tbs = 0;
-
+#ifdef CONFIG_COREMU
+    memset (env1->tb_jmp_cache, 0, TB_JMP_CACHE_SIZE * sizeof (void *));
+#else
     for(env = first_cpu; env != NULL; env = env->next_cpu) {
         memset (env->tb_jmp_cache, 0, TB_JMP_CACHE_SIZE * sizeof (void *));
     }
+#endif
 
     memset (tb_phys_hash, 0, CODE_GEN_PHYS_HASH_SIZE * sizeof (void *));
     page_flush_tb();
@@ -829,7 +888,14 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     unsigned int h, n1;
     tb_page_addr_t phys_pc;
     TranslationBlock *tb1, *tb2;
-
+#ifdef CONFIG_COREMU
+#if defined(TARGET_I386)
+    int cpuid = cpu_single_env->cpuid_apic_id;
+#elif defined(TARGET_ARM)
+    int cpuid = cpu_single_env->cpu_index;
+#endif
+    CMPageDesc *cp; 
+#endif
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
     h = tb_phys_hash_func(phys_pc);
@@ -839,12 +905,26 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     /* remove the TB from the page list */
     if (tb->page_addr[0] != page_addr) {
         p = page_find(tb->page_addr[0] >> TARGET_PAGE_BITS);
+#ifdef CONFIG_COREMU
+        cp = &p->cpu_tbs[cpuid];
+        coremu_spin_lock(&cp->tb_list_lock);
+        tb_page_remove(&cp->first_tb, tb);
+        coremu_spin_unlock(&cp->tb_list_lock);
+#else
         tb_page_remove(&p->first_tb, tb);
+#endif
         invalidate_page_bitmap(p);
     }
     if (tb->page_addr[1] != -1 && tb->page_addr[1] != page_addr) {
         p = page_find(tb->page_addr[1] >> TARGET_PAGE_BITS);
+#ifdef CONFIG_COREMU
+        cp = &p->cpu_tbs[cpuid];
+        coremu_spin_lock(&cp->tb_list_lock);
+        tb_page_remove(&cp->first_tb, tb);
+        coremu_spin_unlock(&cp->tb_list_lock);
+#else
         tb_page_remove(&p->first_tb, tb);
+#endif
         invalidate_page_bitmap(p);
     }
 
@@ -852,10 +932,17 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
 
     /* remove the TB from the hash list */
     h = tb_jmp_cache_hash_func(tb->pc);
+
+#ifdef CONFIG_COREMU
+    env = cpu_single_env;
+    if (env->tb_jmp_cache[h] == tb)
+        env->tb_jmp_cache[h] = NULL;
+#else
     for(env = first_cpu; env != NULL; env = env->next_cpu) {
         if (env->tb_jmp_cache[h] == tb)
             env->tb_jmp_cache[h] = NULL;
     }
+#endif    
 
     /* suppress this TB from the two jump lists */
     tb_jmp_remove(tb, 0);
@@ -909,10 +996,20 @@ static void build_page_bitmap(PageDesc *p)
 {
     int n, tb_start, tb_end;
     TranslationBlock *tb;
-
+#ifdef CONFIG_COREMU
+#if defined(TARGET_I386)
+    int cpuid = cpu_single_env->cpuid_apic_id;
+#elif defined(TARGET_ARM)
+    int cpuid = cpu_single_env->cpu_index;
+#endif
+    CMPageDesc *cp = &p->cpu_tbs[cpuid];
+    coremu_spin_lock(&cp->bitmap_lock);
+    cp->code_bitmap = coremu_mallocz(TARGET_PAGE_SIZE / 8);
+    tb = cp->first_tb;
+#else
     p->code_bitmap = qemu_mallocz(TARGET_PAGE_SIZE / 8);
-
     tb = p->first_tb;
+#endif
     while (tb != NULL) {
         n = (long)tb & 3;
         tb = (TranslationBlock *)((long)tb & ~3);
@@ -928,9 +1025,16 @@ static void build_page_bitmap(PageDesc *p)
             tb_start = 0;
             tb_end = ((tb->pc + tb->size) & ~TARGET_PAGE_MASK);
         }
+#ifdef CONFIG_COREMU
+        set_bits(cp->code_bitmap, tb_start, tb_end - tb_start);
+#else
         set_bits(p->code_bitmap, tb_start, tb_end - tb_start);
+#endif
         tb = tb->page_next[n];
     }
+#ifdef CONFIG_COREMU
+    coremu_spin_unlock(&cp->bitmap_lock);
+#endif
 }
 
 TranslationBlock *tb_gen_code(CPUState *env,
@@ -996,6 +1100,22 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
     p = page_find(start >> TARGET_PAGE_BITS);
     if (!p)
         return;
+#ifdef CONFIG_COREMU
+#if defined(TARGET_I386)
+    int cpuid = cpu_single_env->cpuid_apic_id;
+#elif defined(TARGET_ARM)
+    int cpuid = cpu_single_env->cpu_index;
+#endif
+    atomic_incl((uint32_t *)&p->code_write_count);
+    if (!p->cpu_tbs[cpuid].code_bitmap &&
+        p->code_write_count >= SMC_BITMAP_USE_THRESHOLD &&
+        is_cpu_write_access) {
+        /* build code bitmap */
+        build_page_bitmap(p);
+    }
+
+    tb = p->cpu_tbs[cpuid].first_tb;
+#else
     if (!p->code_bitmap &&
         ++p->code_write_count >= SMC_BITMAP_USE_THRESHOLD &&
         is_cpu_write_access) {
@@ -1006,6 +1126,8 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
     /* we remove all the TBs in the range [start, end[ */
     /* XXX: see if in some cases it could be faster to invalidate all the code */
     tb = p->first_tb;
+#endif
+
     while (tb != NULL) {
         n = (long)tb & 3;
         tb = (TranslationBlock *)((long)tb & ~3);
@@ -1052,6 +1174,9 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
                 saved_tb = env->current_tb;
                 env->current_tb = NULL;
             }
+#ifdef CONFIG_COREMU
+            tb->has_invalidate = 1;
+#endif
             tb_phys_invalidate(tb, -1);
             if (env) {
                 env->current_tb = saved_tb;
@@ -1063,6 +1188,15 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
     }
 #if !defined(CONFIG_USER_ONLY)
     /* if no code remaining, no need to continue to use slow writes */
+#ifdef CONFIG_COREMU
+    if (!p->cpu_tbs[cpuid].first_tb) {
+        invalidate_page_bitmap(p);
+        cm_phys_del_tb(start);
+        if ((!cm_phys_page_tb_p(start)) && is_cpu_write_access) {
+            tlb_unprotect_code_phys(env, start, env->mem_io_vaddr);
+        }
+    }
+#else
     if (!p->first_tb) {
         invalidate_page_bitmap(p);
         if (is_cpu_write_access) {
@@ -1070,6 +1204,7 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
         }
     }
 #endif
+#endif
 #ifdef TARGET_HAS_PRECISE_SMC
     if (current_tb_modified) {
         /* we generate a block containing just the instruction
@@ -1098,9 +1233,20 @@ static inline void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
     p = page_find(start >> TARGET_PAGE_BITS);
     if (!p)
         return;
+#ifdef CONFIG_COREMU
+#if defined(TARGET_I386)
+    int cpuid = cpu_single_env->cpuid_apic_id;
+#elif defined(TARGET_ARM)
+    int cpuid = cpu_single_env->cpu_index;
+#endif
+    if (p->cpu_tbs[cpuid].code_bitmap) {
+        offset = start & ~TARGET_PAGE_MASK;
+        b = p->cpu_tbs[cpuid].code_bitmap[offset >> 3] >> (offset & 7);
+#else
     if (p->code_bitmap) {
         offset = start & ~TARGET_PAGE_MASK;
         b = p->code_bitmap[offset >> 3] >> (offset & 7);
+#endif
         if (b & ((1 << len) - 1))
             goto do_invalidate;
     } else {
@@ -1179,9 +1325,21 @@ static inline void tb_alloc_page(TranslationBlock *tb,
 
     tb->page_addr[n] = page_addr;
     p = page_find_alloc(page_addr >> TARGET_PAGE_BITS, 1);
+#ifdef CONFIG_COREMU
+    assert(cpu_single_env);
+#if defined(TARGET_I386)
+    int cpuid = cpu_single_env->cpuid_apic_id;
+#elif defined(TARGET_ARM)
+    int cpuid = cpu_single_env->cpu_index;
+#endif
+    tb->page_next[n] = p->cpu_tbs[cpuid].first_tb;
+    last_first_tb = p->cpu_tbs[cpuid].first_tb;
+    p->cpu_tbs[cpuid].first_tb = (TranslationBlock *)((long)tb | n);
+#else
     tb->page_next[n] = p->first_tb;
     last_first_tb = p->first_tb;
     p->first_tb = (TranslationBlock *)((long)tb | n);
+#endif
     invalidate_page_bitmap(p);
 
 #if defined(TARGET_HAS_SMC) || 1
@@ -1217,7 +1375,13 @@ static inline void tb_alloc_page(TranslationBlock *tb,
        protected. So we handle the case where only the first TB is
        allocated in a physical page */
     if (!last_first_tb) {
+#ifdef CONFIG_COREMU
+        cm_phys_add_tb(page_addr);
+        if(cm_phys_page_tb_p(page_addr) == 1)
+            tlb_protect_code(page_addr);
+#else
         tlb_protect_code(page_addr);
+#endif
     }
 #endif
 
@@ -1390,7 +1554,11 @@ static void breakpoint_invalidate(CPUState *env, target_ulong pc)
         pd = p->phys_offset;
     }
     ram_addr = (pd & TARGET_PAGE_MASK) | (pc & ~TARGET_PAGE_MASK);
+#ifdef CONFIG_COREMU
+    cm_invalidate_tb(ram_addr, 1);
+#else
     tb_invalidate_phys_page_range(ram_addr, ram_addr + 1, 0);
+#endif
 }
 #endif
 #endif /* TARGET_HAS_ICE */
@@ -1612,7 +1780,7 @@ static void cpu_unlink_tb(CPUState *env)
        emulation this often isn't actually as bad as it sounds.  Often
        signals are used primarily to interrupt blocking syscalls.  */
     TranslationBlock *tb;
-    static spinlock_t interrupt_lock = SPIN_LOCK_UNLOCKED;
+    static COREMU_THREAD spinlock_t interrupt_lock = SPIN_LOCK_UNLOCKED;
 
     spin_lock(&interrupt_lock);
     tb = env->current_tb;
@@ -1936,7 +2104,14 @@ void tlb_flush(CPUState *env, int flush_global)
     for(i = 0; i < CPU_TLB_SIZE; i++) {
         int mmu_idx;
         for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+#ifdef CONFIG_COREMU
+            /* XXX: temporay solution to the tlb lookup data race problem */
+            env->tlb_table[mmu_idx][i].addr_read = -1;
+            env->tlb_table[mmu_idx][i].addr_write = -1;
+            env->tlb_table[mmu_idx][i].addr_code = -1;
+#else
             env->tlb_table[mmu_idx][i] = s_cputlb_empty_entry;
+#endif
         }
     }
 
@@ -2048,8 +2223,13 @@ void cpu_physical_memory_reset_dirty(ram_addr_t start, ram_addr_t end,
         int mmu_idx;
         for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
             for(i = 0; i < CPU_TLB_SIZE; i++)
+#ifdef CONFIG_COREMU
+                cm_tlb_reset_dirty_range(&env->tlb_table[mmu_idx][i],
+                                        start1, length);
+#else
                 tlb_reset_dirty_range(&env->tlb_table[mmu_idx][i],
                                       start1, length);
+#endif
         }
     }
 }
@@ -2639,7 +2819,17 @@ void cpu_register_physical_memory_offset(target_phys_addr_t start_addr,
        reset the modified entries */
     /* XXX: slow ! */
     for(env = first_cpu; env != NULL; env = env->next_cpu) {
+    /* If there is no hot plug device this function won't be invoked
+       after pci bus initialized, so we don't enable broadcast flush
+       tlb in common case. */
+#if defined(CONFIG_COREMU) && defined(COREMU_FLUSH_TLB)
+        if(coremu_init_done_p())
+            cm_send_tlb_flush_req(env->cpuid_apic_id);
+        else
+            tlb_flush(env, 1);
+#else
         tlb_flush(env, 1);
+#endif
     }
 }
 
@@ -2805,6 +2995,10 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size)
     memset(phys_ram_dirty + (last_ram_offset >> TARGET_PAGE_BITS),
            0xff, size >> TARGET_PAGE_BITS);
 
+#ifdef CONFIG_COREMU
+    coremu_assert_hw_thr("qemu_ram_alloc should only called by hw thr");
+    cm_init_tb_cnt(last_ram_offset, size);
+#endif
     last_ram_offset += size;
 
     if (kvm_enabled())
@@ -2847,11 +3041,16 @@ void *qemu_get_ram_ptr(ram_addr_t addr)
         abort();
     }
     /* Move this entry to to start of the list.  */
+#ifndef CONFIG_COREMU
+    /* Different core can access this function at the same time.
+     * For coremu, disable this optimization to avoid data race. 
+     * XXX or use spin lock here if performance impact is big. */
     if (prev) {
         prev->next = block->next;
         block->next = *prevp;
         *prevp = block;
     }
+#endif
     return block->host + (addr - block->offset);
 }
 
@@ -2956,7 +3155,11 @@ static void notdirty_mem_writeb(void *opaque, target_phys_addr_t ram_addr,
     dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
     if (!(dirty_flags & CODE_DIRTY_FLAG)) {
 #if !defined(CONFIG_USER_ONLY)
+#ifdef CONFIG_COREMU
+        cm_invalidate_tb(ram_addr, 1);
+#else
         tb_invalidate_phys_page_fast(ram_addr, 1);
+#endif
         dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
 #endif
     }
@@ -2976,7 +3179,11 @@ static void notdirty_mem_writew(void *opaque, target_phys_addr_t ram_addr,
     dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
     if (!(dirty_flags & CODE_DIRTY_FLAG)) {
 #if !defined(CONFIG_USER_ONLY)
+#ifdef CONFIG_COREMU
+        cm_invalidate_tb(ram_addr, 2);
+#else
         tb_invalidate_phys_page_fast(ram_addr, 2);
+#endif
         dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
 #endif
     }
@@ -2996,7 +3203,11 @@ static void notdirty_mem_writel(void *opaque, target_phys_addr_t ram_addr,
     dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
     if (!(dirty_flags & CODE_DIRTY_FLAG)) {
 #if !defined(CONFIG_USER_ONLY)
+#ifdef CONFIG_COREMU
+        cm_invalidate_tb(ram_addr, 4);
+#else
         tb_invalidate_phys_page_fast(ram_addr, 4);
+#endif
         dirty_flags = cpu_physical_memory_get_dirty_flags(ram_addr);
 #endif
     }
@@ -3419,7 +3630,11 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
                 memcpy(ptr, buf, l);
                 if (!cpu_physical_memory_is_dirty(addr1)) {
                     /* invalidate code */
+#ifdef CONFIG_COREMU
+                    cm_invalidate_tb(addr1, l);
+#else
                     tb_invalidate_phys_page_range(addr1, addr1 + l, 0);
+#endif
                     /* set dirty bit */
                     cpu_physical_memory_set_dirty_flags(
                         addr1, (0xff & ~CODE_DIRTY_FLAG));
@@ -3626,7 +3841,11 @@ void cpu_physical_memory_unmap(void *buffer, target_phys_addr_t len,
                     l = access_len;
                 if (!cpu_physical_memory_is_dirty(addr1)) {
                     /* invalidate code */
+#ifdef CONFIG_COREMU
+                    cm_invalidate_tb(addr1, l);
+#else
                     tb_invalidate_phys_page_range(addr1, addr1 + l, 0);
+#endif
                     /* set dirty bit */
                     cpu_physical_memory_set_dirty_flags(
                         addr1, (0xff & ~CODE_DIRTY_FLAG));
@@ -3785,7 +4004,11 @@ void stl_phys_notdirty(target_phys_addr_t addr, uint32_t val)
         if (unlikely(in_migration)) {
             if (!cpu_physical_memory_is_dirty(addr1)) {
                 /* invalidate code */
+ #ifdef CONFIG_COREMU
+                cm_invalidate_tb(addr1, 4);
+ #else
                 tb_invalidate_phys_page_range(addr1, addr1 + 4, 0);
+ #endif
                 /* set dirty bit */
                 cpu_physical_memory_set_dirty_flags(
                     addr1, (0xff & ~CODE_DIRTY_FLAG));
@@ -3854,7 +4077,11 @@ void stl_phys(target_phys_addr_t addr, uint32_t val)
         stl_p(ptr, val);
         if (!cpu_physical_memory_is_dirty(addr1)) {
             /* invalidate code */
+ #ifdef CONFIG_COREMU
+            cm_invalidate_tb(addr1, 4);
+ #else
             tb_invalidate_phys_page_range(addr1, addr1 + 4, 0);
+ #endif
             /* set dirty bit */
             cpu_physical_memory_set_dirty_flags(addr1,
                 (0xff & ~CODE_DIRTY_FLAG));
@@ -4076,3 +4303,8 @@ void dump_exec_info(FILE *f,
 #undef env
 
 #endif
+
+#ifdef CONFIG_COREMU
+#include "cm-init.c"
+#include "cm-tbinval.c"
+#endif
diff --git a/hw/apic.c b/hw/apic.c
old mode 100644
new mode 100755
index 9029dad..3d64331
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -26,6 +26,9 @@
 #include "kvm.h"
 
 //#define DEBUG_APIC
+#include "coremu-config.h"
+#include "cm-target-intr.h"
+#include "cm-timer.h"
 
 /* APIC Local Vector Table */
 #define APIC_LVT_TIMER   0
@@ -244,7 +247,11 @@ static void apic_bus_deliver(const uint32_t *deliver_bitmask,
                 if (d >= 0) {
                     apic_iter = local_apics[d];
                     if (apic_iter) {
+#ifdef CONFIG_COREMU
+                        cm_send_apicbus_intr(apic_iter->id, CPU_INTERRUPT_HARD, vector_num, trigger_mode);
+#else
                         apic_set_irq(apic_iter, vector_num, trigger_mode);
+#endif
                     }
                 }
             }
@@ -254,19 +261,37 @@ static void apic_bus_deliver(const uint32_t *deliver_bitmask,
             break;
 
         case APIC_DM_SMI:
+#ifdef CONFIG_COREMU
+            /* Vector number is -1 which indecates ignore */
+            foreach_apic(apic_iter, deliver_bitmask,
+                    cm_send_apicbus_intr(apic_iter->id, CPU_INTERRUPT_SMI, -1, -1) );
+#else
             foreach_apic(apic_iter, deliver_bitmask,
                 cpu_interrupt(apic_iter->cpu_env, CPU_INTERRUPT_SMI) );
+#endif
             return;
 
         case APIC_DM_NMI:
+#ifdef CONFIG_COREMU
+            /* Vector number is -1 which indecates ignore */
+            foreach_apic(apic_iter, deliver_bitmask,
+                    cm_send_apicbus_intr(apic_iter->id, CPU_INTERRUPT_NMI, -1, -1) );
+#else
             foreach_apic(apic_iter, deliver_bitmask,
                 cpu_interrupt(apic_iter->cpu_env, CPU_INTERRUPT_NMI) );
+#endif
             return;
 
         case APIC_DM_INIT:
             /* normal INIT IPI sent to processors */
+#ifdef CONFIG_COREMU
+            /* Vector number is -1 which indecates ignore */
+            foreach_apic(apic_iter, deliver_bitmask,
+                    cm_send_apicbus_intr(apic_iter->id, CPU_INTERRUPT_INIT, -1, -1) );
+#else
             foreach_apic(apic_iter, deliver_bitmask,
                          cpu_interrupt(apic_iter->cpu_env, CPU_INTERRUPT_INIT) );
+#endif
             return;
 
         case APIC_DM_EXTINT:
@@ -277,8 +302,14 @@ static void apic_bus_deliver(const uint32_t *deliver_bitmask,
             return;
     }
 
+#ifdef CONFIG_COREMU
+    /* Vector number is -1 which indecates ignore */
+    foreach_apic(apic_iter, deliver_bitmask,
+            cm_send_apicbus_intr(apic_iter->id, CPU_INTERRUPT_HARD, vector_num, trigger_mode) );
+#else
     foreach_apic(apic_iter, deliver_bitmask,
                  apic_set_irq(apic_iter, vector_num, trigger_mode) );
+#endif
 }
 
 void apic_deliver_irq(uint8_t dest, uint8_t dest_mode,
@@ -553,16 +584,26 @@ static void apic_deliver(APICState *s, uint8_t dest, uint8_t dest_mode,
                 int trig_mode = (s->icr[0] >> 15) & 1;
                 int level = (s->icr[0] >> 14) & 1;
                 if (level == 0 && trig_mode == 1) {
+#ifdef CONFIG_COREMU
+                    foreach_apic(apic_iter, deliver_bitmask,
+                                 cm_send_ipi_intr(apic_iter->id, vector_num, 0));
+#else
                     foreach_apic(apic_iter, deliver_bitmask,
                                  apic_iter->arb_id = apic_iter->id );
+#endif
                     return;
                 }
             }
             break;
 
         case APIC_DM_SIPI:
+#ifdef CONFIG_COREMU
+            foreach_apic(apic_iter, deliver_bitmask,
+                         cm_send_ipi_intr(apic_iter->id, vector_num, 1));
+#else
             foreach_apic(apic_iter, deliver_bitmask,
                          apic_startup(apic_iter, vector_num) );
+#endif
             return;
     }
 
@@ -646,11 +687,19 @@ static void apic_timer_update(APICState *s, int64_t current_time)
             d = (uint64_t)s->initial_count + 1;
         }
         next_time = s->initial_count_load_time + (d << s->count_shift);
+#ifdef CONFIG_COREMU
+        cm_mod_local_timer(s->timer, next_time);
+#else
         qemu_mod_timer(s->timer, next_time);
+#endif
         s->next_time = next_time;
     } else {
     no_timer:
+#ifdef CONFIG_COREMU
+        cm_del_local_timer(s->timer);
+#else
         qemu_del_timer(s->timer);
+#endif
     }
 }
 
@@ -1000,3 +1049,26 @@ int apic_init(CPUState *env)
     local_apics[s->idx] = s;
     return 0;
 }
+
+/*
+ * COREMU Parallel Emulator Framework
+ * The wrapper for COREMU IO emulate mechanism.
+ *
+ * Copyright (C) 2010 PPI, Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ */
+/* The declaration for wrapper interface */
+void cm_apic_set_irq(struct APICState *s, int vector_num, int trigger_mode)
+{
+    apic_set_irq(s, vector_num, trigger_mode);
+}
+
+void cm_apic_startup(struct APICState *s, int vector_num)
+{
+    apic_startup(s, vector_num);
+}
+
+void cm_apic_setup_arbid(struct APICState *s)
+{
+    s->arb_id = s->id;
+}
diff --git a/hw/arm_gic.c b/hw/arm_gic.c
old mode 100644
new mode 100755
index c4afc6a..1594213
--- a/hw/arm_gic.c
+++ b/hw/arm_gic.c
@@ -12,6 +12,10 @@
    Nested Vectored Interrupt Controller.  */
 
 //#define DEBUG_GIC
+#include "coremu-config.h"
+#include "coremu-spinlock.h"
+#include "cm-target-intr.h"
+#include "coremu-hw.h"
 
 #ifdef DEBUG_GIC
 #define DPRINTF(fmt, ...) \
@@ -151,24 +155,46 @@ static void __attribute__((unused))
 gic_set_pending_private(gic_state *s, int cpu, int irq)
 {
     int cm = 1 << cpu;
-
+#ifdef CONFIG_COREMU
+    if(coremu_hw_thr_p())
+        coremu_spin_lock(&cm_hw_lock);
+#endif
     if (GIC_TEST_PENDING(irq, cm))
+    {
+#ifdef CONFIG_COREMU
+    if(coremu_hw_thr_p())
+        coremu_spin_unlock(&cm_hw_lock);
+#endif
         return;
-
+    }
     DPRINTF("Set %d pending cpu %d\n", irq, cpu);
     GIC_SET_PENDING(irq, cm);
     gic_update(s);
+#ifdef CONFIG_COREMU
+    if(coremu_hw_thr_p())
+        coremu_spin_unlock(&cm_hw_lock);
+#endif
 }
 
 /* Process a change in an external IRQ input.  */
 static void gic_set_irq(void *opaque, int irq, int level)
 {
+#ifdef CONFIG_COREMU
+    if(coremu_hw_thr_p())
+        coremu_spin_lock(&cm_hw_lock);
+#endif
     gic_state *s = (gic_state *)opaque;
     /* The first external input line is internal interrupt 32.  */
     irq += 32;
-    if (level == GIC_TEST_LEVEL(irq, ALL_CPU_MASK))
+    if (level == GIC_TEST_LEVEL(irq, ALL_CPU_MASK)) {
+#ifdef CONFIG_COREMU
+        if(coremu_hw_thr_p())
+            coremu_spin_unlock(&cm_hw_lock);
+#endif
         return;
 
+    }
+
     if (level) {
         GIC_SET_LEVEL(irq, ALL_CPU_MASK);
         if (GIC_TEST_TRIGGER(irq) || GIC_TEST_ENABLED(irq)) {
@@ -179,6 +205,10 @@ static void gic_set_irq(void *opaque, int irq, int level)
         GIC_CLEAR_LEVEL(irq, ALL_CPU_MASK);
     }
     gic_update(s);
+#ifdef CONFIG_COREMU
+        if(coremu_hw_thr_p())
+            coremu_spin_unlock(&cm_hw_lock);
+#endif
 }
 
 static void gic_set_running_irq(gic_state *s, int cpu, int irq)
@@ -194,6 +224,10 @@ static void gic_set_running_irq(gic_state *s, int cpu, int irq)
 
 static uint32_t gic_acknowledge_irq(gic_state *s, int cpu)
 {
+#ifdef CONFIG_COREMU
+    if(coremu_hw_thr_p())
+        coremu_spin_lock(&cm_hw_lock);
+#endif
     int new_irq;
     int cm = 1 << cpu;
     new_irq = s->current_pending[cpu];
@@ -208,11 +242,19 @@ static uint32_t gic_acknowledge_irq(gic_state *s, int cpu)
     GIC_CLEAR_PENDING(new_irq, GIC_TEST_MODEL(new_irq) ? ALL_CPU_MASK : cm);
     gic_set_running_irq(s, cpu, new_irq);
     DPRINTF("ACK %d\n", new_irq);
+#ifdef CONFIG_COREMU
+    if(coremu_hw_thr_p())
+        coremu_spin_unlock(&cm_hw_lock);
+#endif
     return new_irq;
 }
 
 static void gic_complete_irq(gic_state * s, int cpu, int irq)
 {
+#ifdef CONFIG_COREMU
+    if(coremu_hw_thr_p())
+        coremu_spin_lock(&cm_hw_lock);
+#endif
     int update = 0;
     int cm = 1 << cpu;
     DPRINTF("EOI %d\n", irq);
@@ -245,6 +287,10 @@ static void gic_complete_irq(gic_state * s, int cpu, int irq)
         /* Complete the current running IRQ.  */
         gic_set_running_irq(s, cpu, s->last_active[s->running_irq[cpu]][cpu]);
     }
+#ifdef CONFIG_COREMU
+    if(coremu_hw_thr_p())
+        coremu_spin_unlock(&cm_hw_lock);
+#endif
 }
 
 static uint32_t gic_dist_readb(void *opaque, target_phys_addr_t offset)
diff --git a/hw/arm_pic.c b/hw/arm_pic.c
old mode 100644
new mode 100755
index f44568c..3aa6016
--- a/hw/arm_pic.c
+++ b/hw/arm_pic.c
@@ -11,6 +11,9 @@
 #include "pc.h"
 #include "arm-misc.h"
 
+#include "coremu-config.h"
+#include "cm-target-intr.h"
+
 /* Stub functions for hardware that doesn't exist.  */
 void pic_info(Monitor *mon)
 {
@@ -45,5 +48,9 @@ static void arm_pic_cpu_handler(void *opaque, int irq, int level)
 
 qemu_irq *arm_pic_init_cpu(CPUState *env)
 {
+#ifdef CONFIG_COREMU
     return qemu_allocate_irqs(arm_pic_cpu_handler, env, 2);
+#else
+    return qemu_allocate_irqs(cm_arm_pic_cpu_handler, env, 2);
+#endif
 }
diff --git a/hw/i8259.c b/hw/i8259.c
old mode 100644
new mode 100755
index ea48e0e..3943ec7
--- a/hw/i8259.c
+++ b/hw/i8259.c
@@ -32,6 +32,7 @@
 
 //#define DEBUG_IRQ_LATENCY
 //#define DEBUG_IRQ_COUNT
+#include "coremu-config.h"
 
 typedef struct PicState {
     uint8_t last_irr; /* edge detection */
@@ -245,7 +246,13 @@ int pic_read_irq(PicState2 *s)
         irq = 7;
         intno = s->pics[0].irq_base + irq;
     }
+#ifndef CONFIG_COREMU
+    /* COREMU XXX: in parallel emulation, we always use real-time signals to
+     * inform the emulator about interrupts, there is no need for such update by
+     * emulator itself.
+     * ??? more check on this ??? */
     pic_update_irq(s);
+#endif
 
 #ifdef DEBUG_IRQ_LATENCY
     printf("IRQ%d latency=%0.3fus\n",
diff --git a/hw/ide/core.c b/hw/ide/core.c
old mode 100644
new mode 100755
index 0757528..ad46a3a
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -582,7 +582,21 @@ static void ide_read_dma_cb(void *opaque, int ret)
     /* end of transfer ? */
     if (s->nsector == 0) {
         s->status = READY_STAT | SEEK_STAT;
+
+/* For coremu dma state need to be changed before irq is sent */
+#ifdef CONFIG_COREMU
+        bm->status &= ~BM_STATUS_DMAING;
+        bm->status |= BM_STATUS_INT;
+        bm->dma_cb = NULL;
+        bm->unit = -1;
+        bm->aiocb = NULL;
+#endif
         ide_set_irq(s->bus);
+
+#ifdef CONFIG_COREMU
+        return;
+#endif
+
     eot:
         bm->status &= ~BM_STATUS_DMAING;
         bm->status |= BM_STATUS_INT;
@@ -726,7 +740,21 @@ static void ide_write_dma_cb(void *opaque, int ret)
     /* end of transfer ? */
     if (s->nsector == 0) {
         s->status = READY_STAT | SEEK_STAT;
+
+/* For coremu dma state need to be changed before irq is sent */
+#ifdef CONFIG_COREMU
+        bm->status &= ~BM_STATUS_DMAING;
+        bm->status |= BM_STATUS_INT;
+        bm->dma_cb = NULL;
+        bm->unit = -1;
+        bm->aiocb = NULL;
+#endif
         ide_set_irq(s->bus);
+
+#ifdef CONFIG_COREMU
+        return;
+#endif
+
     eot:
         bm->status &= ~BM_STATUS_DMAING;
         bm->status |= BM_STATUS_INT;
diff --git a/hw/ioapic.c b/hw/ioapic.c
index 7ad8018..0cbeac3 100644
--- a/hw/ioapic.c
+++ b/hw/ioapic.c
@@ -179,6 +179,14 @@ static void ioapic_mem_writel(void *opaque, target_phys_addr_t addr, uint32_t va
             default:
                 index = (s->ioregsel - 0x10) >> 1;
                 if (index >= 0 && index < IOAPIC_NUM_PINS) {
+#ifdef CONFIG_COREMU
+                    /* Qemu's code will cause data race: when ioapic_service
+                     * reads the table entry, and the read happens between the
+                     * two assignment, it may get a zero entry.
+                     * In fact, we just need to assign to the high or low 32 bits of
+                     * the table entry according to ioregsel. */
+                    *((uint32_t *)(s->ioredtbl + index) + (s->ioregsel & 1)) = val;
+#else
                     if (s->ioregsel & 1) {
                         s->ioredtbl[index] &= 0xffffffff;
                         s->ioredtbl[index] |= (uint64_t)val << 32;
@@ -186,6 +194,7 @@ static void ioapic_mem_writel(void *opaque, target_phys_addr_t addr, uint32_t va
                         s->ioredtbl[index] &= ~0xffffffffULL;
                         s->ioredtbl[index] |= val;
                     }
+#endif
                     ioapic_service(s);
                 }
         }
diff --git a/hw/pc.c b/hw/pc.c
old mode 100644
new mode 100755
index db2b9a2..1340916
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -50,6 +50,10 @@
 
 /* output Bochs bios info messages */
 //#define DEBUG_BIOS
+#include "coremu-config.h"
+#include "coremu-init.h"
+#include "coremu-core.h"
+#include "cm-target-intr.h"
 
 #define BIOS_FILENAME "bios.bin"
 
@@ -141,7 +145,9 @@ int cpu_get_pic_interrupt(CPUState *env)
     if (intno >= 0) {
         /* set irq request if a PIC irq is still pending */
         /* XXX: improve that */
+#ifndef CONFIG_COREMU
         pic_update_irq(isa_pic);
+#endif
         return intno;
     }
     /* read the irq from the PIC */
@@ -795,6 +801,9 @@ static CPUState *pc_new_cpu(const char *cpu_model)
     } else {
         qemu_register_reset((QEMUResetHandler*)cpu_reset, env);
     }
+#ifdef CONFIG_COREMU
+    coremu_core_init(env->cpuid_apic_id, env);
+#endif
     return env;
 }
 
@@ -916,8 +925,11 @@ static void pc_init1(ram_addr_t ram_size,
     for (i = 0; i < nb_option_roms; i++) {
         rom_add_option(option_rom[i]);
     }
-
+#ifdef CONFIG_COREMU
+    cpu_irq = qemu_allocate_irqs(cm_pic_irq_request, NULL, 1);
+#else
     cpu_irq = qemu_allocate_irqs(pic_irq_request, NULL, 1);
+#endif
     i8259 = i8259_init(cpu_irq[0]);
     isa_irq_state = qemu_mallocz(sizeof(*isa_irq_state));
     isa_irq_state->i8259 = i8259;
@@ -1217,3 +1229,37 @@ static void pc_machine_init(void)
 }
 
 machine_init(pc_machine_init);
+
+/*
+ * COREMU Parallel Emulator Framework
+ * The wrapper for COREMU IO emulate mechanism
+ *
+ * Copyright (C) 2010 PPI, Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ */
+#ifdef CONFIG_COREMU
+/* The pic irq request */
+void cm_pic_irq_request(void * opaque, int irq, int level)
+{
+    CPUState *env = NULL;
+
+    if (coremu_init_done_p()) {
+        /* Send the signal to core thread */
+        env = first_cpu;
+        if (env->apic_state) {
+            while (env) {
+                if (apic_accept_pic_intr(env)) {
+                    cm_send_pic_intr(env->cpuid_apic_id, level);
+                }
+                env = env->next_cpu;
+            }
+        } else {
+            /* Uniprocessor system without lapic */
+            cm_send_pic_intr(env->cpuid_apic_id, level);
+        }
+    } else {
+        /* Initialization hasn't finished */
+        pic_irq_request(opaque, irq, level);
+    }
+}
+#endif
diff --git a/hw/pc.h b/hw/pc.h
index d11a576..5a954fb 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -3,6 +3,8 @@
 
 #include "qemu-common.h"
 #include "ioport.h"
+#include "coremu-config.h"
+#include "coremu-sched.h"
 
 /* PC-style peripherals (also used by other machines).  */
 
@@ -37,8 +39,16 @@ void pic_info(Monitor *mon);
 void irq_info(Monitor *mon);
 
 /* i8254.c */
-
+#ifdef CONFIG_COREMU
+extern int cm_pit_freq;
+/* *
+ * For parallel emualtion, timer frequency need to be reduced when
+ * more than one thread runs on a simple physical cores
+ */
+#define PIT_FREQ cm_pit_freq
+#else
 #define PIT_FREQ 1193182
+#endif
 
 typedef struct PITState PITState;
 
diff --git a/posix-aio-compat.c b/posix-aio-compat.c
index b43c531..392cce5 100644
--- a/posix-aio-compat.c
+++ b/posix-aio-compat.c
@@ -29,6 +29,10 @@
 
 #include "block/raw-posix-aio.h"
 
+#include "coremu-config.h"
+#include "coremu-hw.h"
+#include "coremu-thread.h"
+
 
 struct qemu_paiocb {
     BlockDriverAIOCB common;
@@ -302,10 +306,13 @@ static ssize_t handle_aiocb_rw(struct qemu_paiocb *aiocb)
 
 static void *aio_thread(void *unused)
 {
-    pid_t pid;
 
+#ifdef CONFIG_COREMU
+    coremu_thread_setpriority(PRIO_PROCESS, 0, -21);
+#else
+    pid_t pid;
     pid = getpid();
-
+#endif
     while (1) {
         struct qemu_paiocb *aiocb;
         ssize_t ret = 0;
@@ -353,8 +360,11 @@ static void *aio_thread(void *unused)
         aiocb->ret = ret;
         idle_threads++;
         mutex_unlock(&lock);
-
+#ifdef CONFIG_COREMU
+        coremu_signal_hw_thr(aiocb->ev_signo);
+#else
         if (kill(pid, aiocb->ev_signo)) die("kill failed");
+#endif
     }
 
     idle_threads--;
@@ -499,6 +509,10 @@ static PosixAioState *posix_aio_state;
 
 static void aio_signal_handler(int signum)
 {
+#ifdef CONFIG_COREMU
+    coremu_assert_hw_thr("aio_signal_handler should only called by hw thr\n");
+#endif
+
     if (posix_aio_state) {
         char byte = 0;
         ssize_t ret;
@@ -507,8 +521,9 @@ static void aio_signal_handler(int signum)
         if (ret < 0 && errno != EAGAIN)
             die("write()");
     }
-
+#ifndef CONFIG_COREMU
     qemu_service_io();
+#endif
 }
 
 static void paio_remove(struct qemu_paiocb *acb)
@@ -570,7 +585,11 @@ BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd,
         return NULL;
     acb->aio_type = type;
     acb->aio_fildes = fd;
+#ifdef CONFIG_COREMU
+    acb->ev_signo = COREMU_AIO_SIG;
+#else
     acb->ev_signo = SIGUSR2;
+#endif
     acb->async_context_id = get_async_context_id();
 
     if (qiov) {
@@ -598,7 +617,11 @@ BlockDriverAIOCB *paio_ioctl(BlockDriverState *bs, int fd,
         return NULL;
     acb->aio_type = QEMU_AIO_IOCTL;
     acb->aio_fildes = fd;
+#ifdef CONFIG_COREMU
+    acb->ev_signo = COREMU_AIO_SIG;
+#else
     acb->ev_signo = SIGUSR2;
+#endif
     acb->aio_offset = 0;
     acb->aio_ioctl_buf = buf;
     acb->aio_ioctl_cmd = req;
@@ -625,7 +648,11 @@ int paio_init(void)
     sigfillset(&act.sa_mask);
     act.sa_flags = 0; /* do not restart syscalls to interrupt select() */
     act.sa_handler = aio_signal_handler;
+#ifdef CONFIG_COREMU
+    sigaction(COREMU_AIO_SIG, &act, NULL);
+#else
     sigaction(SIGUSR2, &act, NULL);
+#endif
 
     s->first_aio = NULL;
     if (qemu_pipe(fds) == -1) {
diff --git a/qemu-timer.c b/qemu-timer.c
index bdc8206..fa50562 100644
--- a/qemu-timer.c
+++ b/qemu-timer.c
@@ -55,6 +55,14 @@
 
 #include "qemu-timer.h"
 
+#include "coremu-config.h"
+#include "coremu-timer.h"
+#include "coremu-debug.h"
+#include "coremu-core.h"
+#include "coremu-hw.h"
+#include "cm-intr.h"
+#include "cm-timer.h"
+
 /* Conversion factor from emulated instructions to virtual clock ticks.  */
 int icount_time_shift;
 /* Arbitrarily pick 1MIPS as the minimum allowable speed.  */
@@ -121,7 +129,7 @@ static void init_get_clock(void)
 static int64_t get_clock(void)
 {
 #if defined(__linux__) || (defined(__FreeBSD__) && __FreeBSD_version >= 500000) \
-	|| defined(__DragonFly__) || defined(__FreeBSD_kernel__)
+    || defined(__DragonFly__) || defined(__FreeBSD_kernel__)
     if (use_rt_clock) {
         struct timespec ts;
         clock_gettime(CLOCK_MONOTONIC, &ts);
@@ -147,7 +155,7 @@ typedef struct TimersState {
     int64_t dummy;
 } TimersState;
 
-TimersState timers_state;
+COREMU_THREAD TimersState timers_state;
 
 /* return the host CPU cycle counter and handle stop/restart */
 int64_t cpu_get_ticks(void)
@@ -160,12 +168,14 @@ int64_t cpu_get_ticks(void)
     } else {
         int64_t ticks;
         ticks = cpu_get_real_ticks();
+#ifndef CONFIG_COREMU
         if (timers_state.cpu_ticks_prev > ticks) {
             /* Note: non increasing ticks may happen if the host uses
                software suspend */
             timers_state.cpu_ticks_offset += timers_state.cpu_ticks_prev - ticks;
         }
         timers_state.cpu_ticks_prev = ticks;
+#endif
         return ticks + timers_state.cpu_ticks_offset;
     }
 }
@@ -423,7 +433,7 @@ void configure_alarms(char const *opt)
             /* Ignore */
             goto next;
 
-	/* Swap */
+    /* Swap */
         tmp = alarm_timers[i];
         alarm_timers[i] = alarm_timers[cur];
         alarm_timers[cur] = tmp;
@@ -718,9 +728,12 @@ static void CALLBACK host_alarm_handler(UINT uTimerID, UINT uMsg,
 static void host_alarm_handler(int host_signum)
 #endif
 {
+    //printf("host_alarm_handler\n");
+    coremu_assert_hw_thr("Host_alarm_handler should be called by hw thr\n");
+
     struct qemu_alarm_timer *t = alarm_timer;
     if (!t)
-	return;
+        return;
 
 #if 0
 #define DISP_FREQ 1000
@@ -926,9 +939,27 @@ static int dynticks_start_timer(struct qemu_alarm_timer *t)
     act.sa_flags = 0;
     act.sa_handler = host_alarm_handler;
 
+#ifdef CONFIG_COREMU
+    int signo;
+    (void) ev;
+
+    if (coremu_hw_thr_p()) {
+        signo = COREMU_HARDWARE_ALARM;
+        sigaction(COREMU_HARDWARE_ALARM, &act, NULL);
+    } else {
+        /* Core signal handler is registerd before running all core. */
+        signo = COREMU_CORE_ALARM;
+    }
+
+    if (coremu_timer_create(signo, &host_timer)) {
+        perror("timer_create");
+        cm_assert(0, "timer create failed");
+        return -1;
+    }
+#else
     sigaction(SIGALRM, &act, NULL);
 
-    /* 
+    /*
      * Initialize ev struct to 0 to avoid valgrind complaining
      * about uninitialized data in timer_create call
      */
@@ -945,7 +976,7 @@ static int dynticks_start_timer(struct qemu_alarm_timer *t)
 
         return -1;
     }
-
+#endif
     t->priv = (void *)(long)host_timer;
 
     return 0;
@@ -1160,7 +1191,7 @@ int qemu_calculate_timeout(void)
         int64_t add;
         int64_t delta;
         /* Advance virtual time to the next event.  */
-	delta = qemu_icount_delta();
+        delta = qemu_icount_delta();
         if (delta > 0) {
             /* If virtual time is ahead of real time then just
                wait for IO.  */
@@ -1188,3 +1219,6 @@ int qemu_calculate_timeout(void)
 #endif
 }
 
+#ifdef CONFIG_COREMU
+#include "cm-timer.c"
+#endif
diff --git a/qemu-timer.h b/qemu-timer.h
index 1494f79..288cb27 100644
--- a/qemu-timer.h
+++ b/qemu-timer.h
@@ -2,7 +2,6 @@
 #define QEMU_TIMER_H
 
 #include "qemu-common.h"
-
 /* timers */
 
 typedef struct QEMUClock QEMUClock;
diff --git a/softmmu_template.h b/softmmu_template.h
index c2df9ec..6289480 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -18,6 +18,11 @@
  */
 #include "qemu-timer.h"
 
+#if defined(TARGET_ARM)
+#include "coremu-spinlock.h"
+#include "cm-target-intr.h"
+#endif
+
 #define DATA_SIZE (1 << SHIFT)
 
 #if DATA_SIZE == 8
@@ -55,6 +60,9 @@ static inline DATA_TYPE glue(io_read, SUFFIX)(target_phys_addr_t physaddr,
                                               target_ulong addr,
                                               void *retaddr)
 {
+#if defined(CONFIG_COREMU) && defined(TARGET_ARM)
+    coremu_spin_lock(&cm_hw_lock);
+#endif
     DATA_TYPE res;
     int index;
     index = (physaddr >> IO_MEM_SHIFT) & (IO_MEM_NB_ENTRIES - 1);
@@ -77,6 +85,10 @@ static inline DATA_TYPE glue(io_read, SUFFIX)(target_phys_addr_t physaddr,
     res |= (uint64_t)io_mem_read[index][2](io_mem_opaque[index], physaddr + 4) << 32;
 #endif
 #endif /* SHIFT > 2 */
+
+#if defined(CONFIG_COREMU) && defined(TARGET_ARM)
+    coremu_spin_unlock(&cm_hw_lock);
+#endif
     return res;
 }
 
@@ -199,6 +211,9 @@ static inline void glue(io_write, SUFFIX)(target_phys_addr_t physaddr,
                                           target_ulong addr,
                                           void *retaddr)
 {
+#if defined(CONFIG_COREMU) && defined(TARGET_ARM)
+    coremu_spin_lock(&cm_hw_lock);
+#endif
     int index;
     index = (physaddr >> IO_MEM_SHIFT) & (IO_MEM_NB_ENTRIES - 1);
     physaddr = (physaddr & TARGET_PAGE_MASK) + addr;
@@ -220,6 +235,10 @@ static inline void glue(io_write, SUFFIX)(target_phys_addr_t physaddr,
     io_mem_write[index][2](io_mem_opaque[index], physaddr + 4, val >> 32);
 #endif
 #endif /* SHIFT > 2 */
+
+#if defined(CONFIG_COREMU) && defined(TARGET_ARM)
+    coremu_spin_unlock(&cm_hw_lock);
+#endif
 }
 
 void REGPARM glue(glue(__st, SUFFIX), MMUSUFFIX)(target_ulong addr,
diff --git a/target-arm/cm-atomic.c b/target-arm/cm-atomic.c
new file mode 100644
index 0000000..9d57243
--- /dev/null
+++ b/target-arm/cm-atomic.c
@@ -0,0 +1,211 @@
+/*
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* We include this file in op_helper.c */
+
+#include <stdlib.h>
+#include <pthread.h>
+#include "coremu-atomic.h"
+#include "coremu-sched.h"
+#include "coremu-types.h"
+
+/* These definitions are copied from translate.c */
+#if defined(WORDS_BIGENDIAN)
+#define REG_B_OFFSET (sizeof(target_ulong) - 1)
+#define REG_H_OFFSET (sizeof(target_ulong) - 2)
+#define REG_W_OFFSET (sizeof(target_ulong) - 2)
+#define REG_L_OFFSET (sizeof(target_ulong) - 4)
+#define REG_LH_OFFSET (sizeof(target_ulong) - 8)
+#else
+#define REG_B_OFFSET 0
+#define REG_H_OFFSET 1
+#define REG_W_OFFSET 0
+#define REG_L_OFFSET 0
+#define REG_LH_OFFSET 4
+#endif
+
+#define REG_LOW_MASK (~(uint64_t)0x0>>32)
+
+/* gen_op instructions */
+/* i386 arith/logic operations */
+enum {
+    OP_ADDL,
+    OP_ORL,
+    OP_ADCL,
+    OP_SBBL,
+    OP_ANDL,
+    OP_SUBL,
+    OP_XORL,
+    OP_CMPL,
+};
+
+/* XXX: This function is not platform specific, move them to other place
+ * later. */
+
+/* Given the guest virtual address, get the corresponding host address.
+ * This macro resembles ldxxx in softmmu_template.h
+ * NOTE: This must be inlined since the use of GETPC needs to get the
+ * return address. Using always inline also works, we use macro here to be more
+ * explicit. */
+#define CM_GET_QEMU_ADDR(q_addr, v_addr) \
+do {                                                                        \
+    int __mmu_idx, __index;                                                 \
+    CPUState *__env1 = cpu_single_env;                                      \
+    void *__retaddr;                                                        \
+    __index = (v_addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);            \
+    /* get the CPL, hence determine the MMU mode */                         \
+    __mmu_idx = cpu_mmu_index(__env1);                                      \
+    /* We use this function in the implementation of atomic instructions */ \
+    /* and we are going to modify these memory. So we use addr_write. */    \
+    if (unlikely(__env1->tlb_table[__mmu_idx][__index].addr_write           \
+                != (v_addr & TARGET_PAGE_MASK))) {                          \
+        __retaddr = GETPC();                                                \
+        tlb_fill(v_addr, 1, __mmu_idx, __retaddr);                          \
+    }                                                                       \
+    q_addr = v_addr + __env1->tlb_table[__mmu_idx][__index].addend;         \
+} while(0)
+
+#define LD_b ldub_raw
+#define LD_w lduw_raw
+#define LD_l ldl_raw
+#define LD_q ldq_raw
+
+/* Lightweight transactional memory. */
+#define TX(vaddr, type, value, command) \
+    target_ulong __q_addr;                                    \
+    DATA_##type __oldv;                                       \
+    DATA_##type value;                                        \
+                                                              \
+    CM_GET_QEMU_ADDR(__q_addr, vaddr);                        \
+    do {                                                      \
+        __oldv = value = LD_##type((DATA_##type *)__q_addr);  \
+        {command;};                                           \
+        mb();                                                 \
+    } while (__oldv != (atomic_compare_exchange##type(        \
+                    (DATA_##type *)__q_addr, __oldv, value)))
+
+COREMU_THREAD uint64_t cm_exclusive_val;
+COREMU_THREAD uint32_t cm_exclusive_addr = -1;
+
+#define GEN_LOAD_EXCLUSIVE(type, TYPE) \
+void HELPER(load_exclusive##type)(uint32_t reg, uint32_t addr)        \
+{                                                                     \
+    ram_addr_t q_addr = 0;                                            \
+    DATA_##type val = 0;                                              \
+                                                                      \
+    cm_exclusive_addr = addr;                                         \
+    CM_GET_QEMU_ADDR(q_addr,addr);                                    \
+    val = *(DATA_##type *)q_addr;                                     \
+    cm_exclusive_val = val;                                           \
+    cpu_single_env->regs[reg] = val;                                  \
+}
+
+GEN_LOAD_EXCLUSIVE(b, B);
+GEN_LOAD_EXCLUSIVE(w, W);
+GEN_LOAD_EXCLUSIVE(l, L);
+//GEN_LOAD_EXCLUSIVE(q, Q);
+
+#define GEN_STORE_EXCLUSIVE(type, TYPE) \
+void HELPER(store_exclusive##type)(uint32_t res, uint32_t reg, uint32_t addr) \
+{                                                                             \
+    ram_addr_t q_addr = 0;                                                    \
+    DATA_##type val = 0;                                                      \
+    DATA_##type r = 0;                                                        \
+                                                                              \
+    if(addr != cm_exclusive_addr)                                             \
+        goto fail;                                                            \
+                                                                              \
+    CM_GET_QEMU_ADDR(q_addr,addr);                                            \
+    val = (DATA_##type)cpu_single_env->regs[reg];                             \
+                                                                              \
+    r = atomic_compare_exchange##type((DATA_##type *)q_addr,                  \
+                                    (DATA_##type)cm_exclusive_val, val);      \
+                                                                              \
+    if(r == (DATA_##type)cm_exclusive_val) {                                  \
+        cpu_single_env->regs[res] = 0;                                        \
+        goto done;                                                            \
+    } else {                                                                  \
+        goto fail;                                                            \
+    }                                                                         \
+                                                                              \
+fail:                                                                         \
+    cpu_single_env->regs[res] = 1;                                            \
+                                                                              \
+done:                                                                         \
+    cm_exclusive_addr = -1;                                                   \
+    return;                                                                   \
+}
+
+GEN_STORE_EXCLUSIVE(b, B);
+GEN_STORE_EXCLUSIVE(w, W);
+GEN_STORE_EXCLUSIVE(l, L);
+//GEN_STORE_EXCLUSIVE(q, Q);
+
+void HELPER(load_exclusiveq)(uint32_t reg, uint32_t addr)
+{
+   ram_addr_t q_addr = 0;
+   uint64_t val = 0;
+
+   cm_exclusive_addr = addr;
+   CM_GET_QEMU_ADDR(q_addr,addr);
+   val = *(uint64_t *)q_addr;
+   cm_exclusive_val = val;
+   cpu_single_env->regs[reg] = (uint32_t)val;
+   cpu_single_env->regs[reg + 1] = (uint32_t)(val>>32);
+}
+
+void HELPER(store_exclusiveq)(uint32_t res, uint32_t reg, uint32_t addr)
+{
+   ram_addr_t q_addr = 0;
+   uint64_t val = 0;
+   uint64_t r = 0;
+
+   if(addr != cm_exclusive_addr)
+        goto fail;
+
+   CM_GET_QEMU_ADDR(q_addr,addr);
+   val = (uint32_t)cpu_single_env->regs[reg];
+   val |= ((uint64_t)cpu_single_env->regs[reg + 1]) << 32;
+
+   r = atomic_compare_exchangeq((uint64_t *)q_addr,
+                                    (uint64_t)cm_exclusive_val, val);
+
+   if(r == (uint64_t)cm_exclusive_val) {
+        cpu_single_env->regs[res] = 0;
+        goto done;
+   } else {
+        goto fail;
+   }
+
+fail:
+    cpu_single_env->regs[res] = 1;
+
+done:
+    cm_exclusive_addr = -1;
+    return;
+}
+
+void HELPER(clear_exclusive)()
+{
+    cm_exclusive_addr = -1;
+}
diff --git a/target-arm/cm-atomic.h b/target-arm/cm-atomic.h
new file mode 100644
index 0000000..26ee256
--- /dev/null
+++ b/target-arm/cm-atomic.h
@@ -0,0 +1,34 @@
+/*
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#define __GEN_HEADER(type) \
+DEF_HELPER_2(load_exclusive##type, void, i32, i32)           \
+DEF_HELPER_3(store_exclusive##type, void, i32, i32, i32)
+
+__GEN_HEADER(b)
+__GEN_HEADER(w)
+__GEN_HEADER(l)
+__GEN_HEADER(q)
+
+DEF_HELPER_0(clear_exclusive, void)
+
diff --git a/target-arm/cm-target-intr.c b/target-arm/cm-target-intr.c
new file mode 100644
index 0000000..1627d28
--- /dev/null
+++ b/target-arm/cm-target-intr.c
@@ -0,0 +1,74 @@
+/*
+ * COREMU Parallel Emulator Framework
+ * The definition of interrupt related interface for i386
+ *
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include "cpu.h"
+#include "../hw/arm-misc.h"
+#include "coremu-intr.h"
+#include "coremu-malloc.h"
+#include "coremu-atomic.h"
+#include "coremu-spinlock.h"
+#include "cm-intr.h"
+#include "cm-target-intr.h"
+
+CMSpinLock cm_hw_lock;
+static void cm_gic_intr_handler(void *opaque)
+{
+    CMGICIntr *gic_intr = (CMGICIntr *) opaque;
+    switch (gic_intr->irq_num) {
+    case ARM_PIC_CPU_IRQ:
+        if (gic_intr->level)
+            cpu_interrupt(cpu_single_env, CPU_INTERRUPT_HARD);
+        else
+            cpu_reset_interrupt(cpu_single_env, CPU_INTERRUPT_HARD);
+        break;
+    case ARM_PIC_CPU_FIQ:
+        if (gic_intr->level)
+            cpu_interrupt(cpu_single_env, CPU_INTERRUPT_FIQ);
+        else
+            cpu_reset_interrupt(cpu_single_env, CPU_INTERRUPT_FIQ);
+        break;
+    default:
+        hw_error("arm_pic_cpu_handler: Bad interrput line %d\n",
+                 gic_intr->irq_num);
+    }
+
+}
+
+static CMIntr *cm_gic_intr_init(int irq, int level)
+{
+    CMGICIntr *intr = coremu_mallocz(sizeof(*intr));
+    ((CMIntr *)intr)->handler = cm_gic_intr_handler;
+    intr->irq_num = irq;
+    intr->level = level;
+    return (CMIntr *)intr;
+}
+
+void cm_arm_pic_cpu_handler(void *opaque, int irq, int level)
+{
+    CPUState *env = (CPUState *)opaque;
+    coremu_send_intr(cm_gic_intr_init(irq, level), env->cpu_index);
+}
diff --git a/target-arm/cm-target-intr.h b/target-arm/cm-target-intr.h
new file mode 100644
index 0000000..7a2a148
--- /dev/null
+++ b/target-arm/cm-target-intr.h
@@ -0,0 +1,40 @@
+/*
+ * COREMU Parallel Emulator Framework
+ * The definition of interrupt related interface for i386
+ *
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+#ifndef CM_ARM_INTR_H
+#define CM_ARM_INTR_H
+#include "cm-intr.h"
+#include "coremu-spinlock.h"
+
+typedef struct CMGICIntr {
+    CMIntr *base;
+    int irq_num;
+    int level;
+} CMGICIntr;
+
+extern CMSpinLock cm_hw_lock;
+void cm_arm_pic_cpu_handler(void *opaque, int irq, int level);
+
+#endif
diff --git a/target-arm/helper.c b/target-arm/helper.c
old mode 100644
new mode 100755
index 99e0394..efbb2fa
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -295,9 +295,12 @@ CPUARMState *cpu_arm_init(const char *cpu_model)
         return NULL;
     env = qemu_mallocz(sizeof(CPUARMState));
     cpu_exec_init(env);
+
     if (!inited) {
         inited = 1;
+#ifndef CONFIG_COREMU
         arm_translate_init();
+#endif
     }
 
     env->cpu_model_str = cpu_model;
@@ -314,6 +317,10 @@ CPUARMState *cpu_arm_init(const char *cpu_model)
                                  19, "arm-vfp.xml", 0);
     }
     qemu_init_vcpu(env);
+
+#ifdef CONFIG_COREMU
+    coremu_core_init(env->cpu_index, env);
+#endif
     return env;
 }
 
diff --git a/target-arm/helpers.h b/target-arm/helpers.h
index 0d1bc47..9e07338 100644
--- a/target-arm/helpers.h
+++ b/target-arm/helpers.h
@@ -447,4 +447,9 @@ DEF_HELPER_3(iwmmxt_muladdswl, i64, i64, i32, i32)
 
 DEF_HELPER_2(set_teecr, void, env, i32)
 
+#include "coremu-config.h"
+#ifdef CONFIG_COREMU
+#include "cm-atomic.h"
+#endif
+
 #include "def-helper.h"
diff --git a/target-arm/neon_helper.c b/target-arm/neon_helper.c
index 5e6452b..e58a8cd 100644
--- a/target-arm/neon_helper.c
+++ b/target-arm/neon_helper.c
@@ -18,7 +18,7 @@
 
 #define SET_QC() env->vfp.xregs[ARM_VFP_FPSCR] = CPSR_Q
 
-static float_status neon_float_status;
+static COREMU_THREAD float_status neon_float_status;
 #define NFS &neon_float_status
 
 /* Helper routines to perform bitwise copies between float and int.  */
diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
index 9b1a014..3503855 100644
--- a/target-arm/op_helper.c
+++ b/target-arm/op_helper.c
@@ -487,3 +487,8 @@ uint64_t HELPER(neon_sub_saturate_u64)(uint64_t src1, uint64_t src2)
     }
     return res;
 }
+
+#include "coremu-config.h"
+#ifdef CONFIG_COREMU
+#include "cm-atomic.c"
+#endif
diff --git a/target-arm/translate.c b/target-arm/translate.c
index 0eccca5..b6d07c5 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -72,21 +72,21 @@ typedef struct DisasContext {
 #define DISAS_WFI 4
 #define DISAS_SWI 5
 
-static TCGv_ptr cpu_env;
+static COREMU_THREAD TCGv_ptr cpu_env;
 /* We reuse the same 64-bit temporaries for efficiency.  */
-static TCGv_i64 cpu_V0, cpu_V1, cpu_M0;
-static TCGv_i32 cpu_R[16];
-static TCGv_i32 cpu_exclusive_addr;
-static TCGv_i32 cpu_exclusive_val;
-static TCGv_i32 cpu_exclusive_high;
+static COREMU_THREAD TCGv_i64 cpu_V0, cpu_V1, cpu_M0;
+static COREMU_THREAD TCGv_i32 cpu_R[16];
+static COREMU_THREAD TCGv_i32 cpu_exclusive_addr;
+static COREMU_THREAD TCGv_i32 cpu_exclusive_val;
+static COREMU_THREAD TCGv_i32 cpu_exclusive_high;
 #ifdef CONFIG_USER_ONLY
 static TCGv_i32 cpu_exclusive_test;
 static TCGv_i32 cpu_exclusive_info;
 #endif
 
 /* FIXME:  These should be removed.  */
-static TCGv cpu_F0s, cpu_F1s;
-static TCGv_i64 cpu_F0d, cpu_F1d;
+static COREMU_THREAD TCGv cpu_F0s, cpu_F1s;
+static COREMU_THREAD TCGv_i64 cpu_F0d, cpu_F1d;
 
 #include "gen-icount.h"
 
@@ -123,7 +123,7 @@ void arm_translate_init(void)
 #include "helpers.h"
 }
 
-static int num_temps;
+static COREMU_THREAD int num_temps;
 
 /* Allocate a temporary variable.  */
 static TCGv_i32 new_tmp(void)
@@ -6026,6 +6026,12 @@ static void disas_arm_insn(CPUState * env, DisasContext *s)
     TCGv tmp2;
     TCGv tmp3;
     TCGv addr;
+
+#ifdef CONFIG_COREMU
+    TCGv cm_tmp;
+    TCGv cm_tmp1;
+#endif
+
     TCGv_i64 tmp64;
 
     insn = ldl_code(s->pc);
@@ -6069,7 +6075,11 @@ static void disas_arm_insn(CPUState * env, DisasContext *s)
             switch ((insn >> 4) & 0xf) {
             case 1: /* clrex */
                 ARCH(6K);
+#ifdef CONFIG_COREMU
+                gen_helper_clear_exclusive();
+#else
                 gen_clrex(s);
+#endif
                 return;
             case 4: /* dsb */
             case 5: /* dmb */
@@ -6655,36 +6665,75 @@ static void disas_arm_insn(CPUState * env, DisasContext *s)
                         addr = tcg_temp_local_new_i32();
                         load_reg_var(s, addr, rn);
                         if (insn & (1 << 20)) {
+#ifdef CONFIG_COREMU
+							cm_tmp = tcg_const_i32(rd);
+#endif
                             switch (op1) {
                             case 0: /* ldrex */
-                                gen_load_exclusive(s, rd, 15, addr, 2);
+#ifdef CONFIG_COREMU
+                                gen_helper_load_exclusivel(cm_tmp, addr);
+#else
+                                 gen_load_exclusive(s, rd, 15, addr, 2);
+#endif
                                 break;
                             case 1: /* ldrexd */
+#ifdef CONFIG_COREMU
+                                gen_helper_load_exclusiveq(cm_tmp, addr);
+#else
                                 gen_load_exclusive(s, rd, rd + 1, addr, 3);
                                 break;
+#endif
                             case 2: /* ldrexb */
+#ifdef CONFIG_COREMU
+                                gen_helper_load_exclusiveb(cm_tmp, addr);
+#else
                                 gen_load_exclusive(s, rd, 15, addr, 0);
+#endif
                                 break;
                             case 3: /* ldrexh */
+#ifdef CONFIG_COREMU
+                                gen_helper_load_exclusivew(cm_tmp, addr);
+#else
                                 gen_load_exclusive(s, rd, 15, addr, 1);
+#endif
                                 break;
                             default:
                                 abort();
                             }
                         } else {
                             rm = insn & 0xf;
+#ifdef CONFIG_COREMU
+							cm_tmp = tcg_const_i32(rd);
+                            cm_tmp1 = tcg_const_i32(rm);
+#endif
                             switch (op1) {
                             case 0:  /*  strex */
+#ifdef CONFIG_COREMU
+                                gen_helper_store_exclusivel(cm_tmp, cm_tmp1, addr);
+#else
                                 gen_store_exclusive(s, rd, rm, 15, addr, 2);
+#endif
                                 break;
                             case 1: /*  strexd */
+#ifdef CONFIG_COREMU
+                                gen_helper_store_exclusiveq(cm_tmp, cm_tmp1, addr);
+#else
                                 gen_store_exclusive(s, rd, rm, rm + 1, addr, 3);
+#endif
                                 break;
                             case 2: /*  strexb */
+#ifdef CONFIG_COREMU
+                                gen_helper_store_exclusiveb(cm_tmp, cm_tmp1, addr);
+#else
                                 gen_store_exclusive(s, rd, rm, 15, addr, 0);
+#endif
                                 break;
                             case 3: /* strexh */
+#ifdef CONFIG_COREMU
+                                gen_helper_store_exclusivew(cm_tmp, cm_tmp1, addr);
+#else
                                 gen_store_exclusive(s, rd, rm, 15, addr, 1);
+#endif
                                 break;
                             default:
                                 abort();
@@ -7333,6 +7382,10 @@ static int disas_thumb2_insn(CPUState *env, DisasContext *s, uint16_t insn_hw1)
     TCGv tmp;
     TCGv tmp2;
     TCGv tmp3;
+#ifdef CONFIG_COREMU
+    TCGv cm_tmp;
+    TCGv cm_tmp1;
+#endif
     TCGv addr;
     TCGv_i64 tmp64;
     int op;
@@ -7445,9 +7498,20 @@ static int disas_thumb2_insn(CPUState *env, DisasContext *s, uint16_t insn_hw1)
                 load_reg_var(s, addr, rn);
                 tcg_gen_addi_i32(addr, addr, (insn & 0xff) << 2);
                 if (insn & (1 << 20)) {
+#ifdef CONFIG_COREMU
+                    cm_tmp = tcg_const_i32(rs);
+                    gen_helper_load_exclusivel(cm_tmp, addr);
+#else
                     gen_load_exclusive(s, rs, 15, addr, 2);
+#endif
                 } else {
+#ifdef CONFIG_COREMU
+                    cm_tmp = tcg_const_i32(rd);
+                    cm_tmp1 = tcg_const_i32(rs);
+                    gen_helper_store_exclusivel(cm_tmp, cm_tmp1, addr);
+#else
                     gen_store_exclusive(s, rd, rs, 15, addr, 2);
+#endif
                 }
                 tcg_temp_free(addr);
             } else if ((insn & (1 << 6)) == 0) {
diff --git a/target-i386/cm-atomic.c b/target-i386/cm-atomic.c
new file mode 100644
index 0000000..ecb9349
--- /dev/null
+++ b/target-i386/cm-atomic.c
@@ -0,0 +1,491 @@
+/*
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *  Xi Wu           <wuxi@fudan.edu.cn>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* We include this file in op_helper.c */
+
+#include <stdlib.h>
+#include <pthread.h>
+#include "coremu-atomic.h"
+#include "coremu-sched.h"
+#include "coremu-types.h"
+
+/* These definitions are copied from translate.c */
+#if defined(WORDS_BIGENDIAN)
+#define REG_B_OFFSET (sizeof(target_ulong) - 1)
+#define REG_H_OFFSET (sizeof(target_ulong) - 2)
+#define REG_W_OFFSET (sizeof(target_ulong) - 2)
+#define REG_L_OFFSET (sizeof(target_ulong) - 4)
+#define REG_LH_OFFSET (sizeof(target_ulong) - 8)
+#else
+#define REG_B_OFFSET 0
+#define REG_H_OFFSET 1
+#define REG_W_OFFSET 0
+#define REG_L_OFFSET 0
+#define REG_LH_OFFSET 4
+#endif
+
+#define REG_LOW_MASK (~(uint64_t)0x0>>32)
+
+/* gen_op instructions */
+/* i386 arith/logic operations */
+enum {
+    OP_ADDL,
+    OP_ORL,
+    OP_ADCL,
+    OP_SBBL,
+    OP_ANDL,
+    OP_SUBL,
+    OP_XORL,
+    OP_CMPL,
+};
+
+/* XXX: This function is not platform specific, move them to other place
+ * later. */
+
+/* Given the guest virtual address, get the corresponding host address.
+ * This macro resembles ldxxx in softmmu_template.h
+ * NOTE: This must be inlined since the use of GETPC needs to get the
+ * return address. Using always inline also works, we use macro here to be more
+ * explicit. */
+#define CM_GET_QEMU_ADDR(q_addr, v_addr) \
+do {                                                                        \
+    int __mmu_idx, __index;                                                 \
+    CPUState *__env1 = cpu_single_env;                                      \
+    void *__retaddr;                                                        \
+    __index = (v_addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);            \
+    /* get the CPL, hence determine the MMU mode */                         \
+    __mmu_idx = cpu_mmu_index(__env1);                                      \
+    /* We use this function in the implementation of atomic instructions */ \
+    /* and we are going to modify these memory. So we use addr_write. */    \
+    if (unlikely(__env1->tlb_table[__mmu_idx][__index].addr_write           \
+                != (v_addr & TARGET_PAGE_MASK))) {                          \
+        __retaddr = GETPC();                                                \
+        tlb_fill(v_addr, 1, __mmu_idx, __retaddr);                          \
+    }                                                                       \
+    q_addr = v_addr + __env1->tlb_table[__mmu_idx][__index].addend;         \
+} while(0)
+
+static target_ulong cm_get_reg_val(int ot, int hregs, int reg)
+{
+    target_ulong val, offset;
+    CPUState *env1 = cpu_single_env;
+
+    switch(ot) {
+    case 0:  /*OT_BYTE*/
+        if (reg < 4 || reg >= 8 || hregs) {
+            goto std_case;
+        } else {
+            offset = offsetof(CPUState, regs[reg - 4]) + REG_H_OFFSET;
+            val = *(((uint8_t *)env1) + offset);
+        }
+        break;
+    default:
+    std_case:
+        val =  env1->regs[reg];
+        break;
+    }
+
+    return val;
+}
+
+static void cm_set_reg_val(int ot, int hregs, int reg, target_ulong val)
+{
+      target_ulong offset;
+
+      CPUState *env1 = cpu_single_env;
+
+      switch(ot) {
+      case 0: /* OT_BYTE */
+          if (reg < 4 || reg >= 8 || hregs) {
+              offset = offsetof(CPUState, regs[reg]) + REG_B_OFFSET;
+              *(((uint8_t *) env1) + offset) = (uint8_t)val;
+          } else {
+              offset = offsetof(CPUState, regs[reg - 4]) + REG_H_OFFSET;
+              *(((uint8_t *) env1) + offset) = (uint8_t)val;
+          }
+          break;
+      case 1: /* OT_WORD */
+          offset = offsetof(CPUState, regs[reg]) + REG_W_OFFSET;
+          *((uint16_t *)((uint8_t *)env1 + offset)) = (uint16_t)val;
+          break;
+      case 2: /* OT_LONG */
+          env1->regs[reg] = REG_LOW_MASK & val;
+          break;
+      default:
+      case 3: /* OT_QUAD */
+          env1->regs[reg] = val;
+          break;
+      }
+}
+
+#define LD_b ldub_raw
+#define LD_w lduw_raw
+#define LD_l ldl_raw
+#define LD_q ldq_raw
+
+/* Lightweight transactional memory. */
+#define TX(vaddr, type, value, command) \
+    target_ulong __q_addr;                                    \
+    DATA_##type __oldv;                                       \
+    DATA_##type value;                                        \
+                                                              \
+    CM_GET_QEMU_ADDR(__q_addr, vaddr);                        \
+    do {                                                      \
+        __oldv = value = LD_##type((DATA_##type *)__q_addr);  \
+        {command;};                                           \
+        mb();                                                 \
+    } while (__oldv != (atomic_compare_exchange##type(        \
+                    (DATA_##type *)__q_addr, __oldv, value)))
+
+/* Atomically emulate INC instruction using CAS1 and memory transaction. */
+
+#define GEN_ATOMIC_INC(type, TYPE) \
+void helper_atomic_inc##type(target_ulong a0, int c)                  \
+{                                                                     \
+    int eflags_c, eflags;                                             \
+    int cc_op;                                                        \
+                                                                      \
+    /* compute the previous instruction c flags */                    \
+    eflags_c = helper_cc_compute_c(CC_OP);                            \
+                                                                      \
+    TX(a0, type, value, {                                             \
+        if (c > 0) {                                                  \
+            value++;                                                  \
+            cc_op = CC_OP_INC##TYPE;                                  \
+        } else {                                                      \
+            value--;                                                  \
+            cc_op = CC_OP_DEC##TYPE;                                  \
+        }                                                             \
+    });                                                               \
+                                                                      \
+    CC_SRC = eflags_c;                                                \
+    CC_DST = value;                                                   \
+                                                                      \
+    eflags = helper_cc_compute_all(cc_op);                            \
+    CC_SRC = eflags;                                                  \
+}                                                                     \
+
+GEN_ATOMIC_INC(b, B);
+GEN_ATOMIC_INC(w, W);
+GEN_ATOMIC_INC(l, L);
+GEN_ATOMIC_INC(q, Q);
+
+#define OT_b 0
+#define OT_w 1
+#define OT_l 2
+#define OT_q 3
+
+#define GEN_XCHG(type) \
+void helper_xchg##type(target_ulong a0, int reg, int hreg)    \
+{                                                             \
+    DATA_##type val, out;                                     \
+    target_ulong q_addr;                                      \
+                                                              \
+    CM_GET_QEMU_ADDR(q_addr, a0);                             \
+    val = (DATA_##type)cm_get_reg_val(OT_##type, hreg, reg);  \
+    out = atomic_exchange##type((DATA_##type *)q_addr, val);  \
+    mb();                                                     \
+                                                              \
+    cm_set_reg_val(OT_##type, hreg, reg, out);                \
+}
+
+GEN_XCHG(b);
+GEN_XCHG(w);
+GEN_XCHG(l);
+GEN_XCHG(q);
+
+#define GEN_OP(type, TYPE) \
+void helper_atomic_op##type(target_ulong a0, target_ulong t1,    \
+                       int op)                                   \
+{                                                                \
+    DATA_##type operand;                                         \
+    int eflags_c, eflags;                                        \
+    int cc_op;                                                   \
+                                                                 \
+    /* compute the previous instruction c flags */               \
+    eflags_c = helper_cc_compute_c(CC_OP);                       \
+    operand = (DATA_##type)t1;                                   \
+                                                                 \
+    TX(a0, type, value, {                                        \
+        switch(op) {                                             \
+        case OP_ADCL:                                            \
+            value += operand + eflags_c;                         \
+            cc_op = CC_OP_ADD##TYPE + (eflags_c << 2);           \
+            CC_SRC = operand;                                    \
+            break;                                               \
+        case OP_SBBL:                                            \
+            value = value - operand - eflags_c;                  \
+            cc_op = CC_OP_SUB##TYPE + (eflags_c << 2);           \
+            CC_SRC = operand;                                    \
+            break;                                               \
+        case OP_ADDL:                                            \
+            value += operand;                                    \
+            cc_op = CC_OP_ADD##TYPE;                             \
+            CC_SRC = operand;                                    \
+            break;                                               \
+        case OP_SUBL:                                            \
+            value -= operand;                                    \
+            cc_op = CC_OP_SUB##TYPE;                             \
+            CC_SRC = operand;                                    \
+            break;                                               \
+        default:                                                 \
+        case OP_ANDL:                                            \
+            value &= operand;                                    \
+            cc_op = CC_OP_LOGIC##TYPE;                           \
+            break;                                               \
+        case OP_ORL:                                             \
+            value |= operand;                                    \
+            cc_op = CC_OP_LOGIC##TYPE;                           \
+            break;                                               \
+        case OP_XORL:                                            \
+            value ^= operand;                                    \
+            cc_op = CC_OP_LOGIC##TYPE;                           \
+            break;                                               \
+        case OP_CMPL:                                            \
+            abort();                                             \
+            break;                                               \
+        }                                                        \
+    });                                                          \
+    CC_DST = value;                                              \
+    /* successful transaction, compute the eflags */             \
+    eflags = helper_cc_compute_all(cc_op);                       \
+    CC_SRC = eflags;                                             \
+}
+
+GEN_OP(b, B);
+GEN_OP(w, W);
+GEN_OP(l, L);
+GEN_OP(q, Q);
+
+/* xadd */
+#define GEN_XADD(type, TYPE) \
+void helper_atomic_xadd##type(target_ulong a0, int reg,   \
+                        int hreg)                         \
+{                                                         \
+    DATA_##type operand, oldv;                            \
+    int eflags;                                           \
+                                                          \
+    operand = (DATA_##type)cm_get_reg_val(                \
+            OT_##type, hreg, reg);                        \
+                                                          \
+    TX(a0, type, newv, {                                  \
+        oldv = newv;                                      \
+        newv += operand;                                  \
+    });                                                   \
+                                                          \
+    /* transaction successes */                           \
+    /* xchg the register and compute the eflags */        \
+    cm_set_reg_val(OT_##type, hreg, reg, oldv);           \
+    CC_SRC = oldv;                                        \
+    CC_DST = newv;                                        \
+                                                          \
+    eflags = helper_cc_compute_all(CC_OP_ADD##TYPE);      \
+    CC_SRC = eflags;                                      \
+}
+
+GEN_XADD(b, B);
+GEN_XADD(w, W);
+GEN_XADD(l, L);
+GEN_XADD(q, Q);
+
+/* cmpxchg */
+#define GEN_CMPXCHG(type, TYPE) \
+void helper_atomic_cmpxchg##type(target_ulong a0, int reg,       \
+                            int hreg)                            \
+{                                                                \
+    DATA_##type reg_v, eax_v, res;                               \
+    int eflags;                                                  \
+    target_ulong q_addr;                                         \
+                                                                 \
+    CM_GET_QEMU_ADDR(q_addr, a0);                                \
+    reg_v = (DATA_##type)cm_get_reg_val(OT_##type, hreg, reg);   \
+    eax_v = (DATA_##type)cm_get_reg_val(OT_##type, 0, R_EAX);    \
+                                                                 \
+    res = atomic_compare_exchange##type(                         \
+            (DATA_##type *)q_addr, eax_v, reg_v);                \
+    mb();                                                        \
+                                                                 \
+    if (res != eax_v)                                            \
+        cm_set_reg_val(OT_##type, 0, R_EAX, res);                \
+                                                                 \
+    CC_SRC = res;                                                \
+    CC_DST = eax_v - res;                                        \
+                                                                 \
+    eflags = helper_cc_compute_all(CC_OP_SUB##TYPE);             \
+    CC_SRC = eflags;                                             \
+}
+
+GEN_CMPXCHG(b, B);
+GEN_CMPXCHG(w, W);
+GEN_CMPXCHG(l, L);
+GEN_CMPXCHG(q, Q);
+
+/* cmpxchgb (8, 16) */
+void helper_atomic_cmpxchg8b(target_ulong a0)
+{
+    uint64_t edx_eax, ecx_ebx, res;
+    int eflags;
+    target_ulong q_addr;
+
+    eflags = helper_cc_compute_all(CC_OP);
+    CM_GET_QEMU_ADDR(q_addr, a0);
+
+    edx_eax = (((uint64_t)EDX << 32) | (uint32_t)EAX);
+    ecx_ebx = (((uint64_t)ECX << 32) | (uint32_t)EBX);
+
+    res = atomic_compare_exchangeq((uint64_t *)q_addr, edx_eax, ecx_ebx);
+    mb();
+
+    if (res == edx_eax) {
+         eflags |= CC_Z;
+    } else {
+        EDX = (uint32_t)(res >> 32);
+        EAX = (uint32_t)res;
+        eflags &= ~CC_Z;
+    }
+
+    CC_SRC = eflags;
+}
+
+void helper_atomic_cmpxchg16b(target_ulong a0)
+{
+    uint8_t res;
+    int eflags;
+    target_ulong q_addr;
+
+    eflags = helper_cc_compute_all(CC_OP);
+    CM_GET_QEMU_ADDR(q_addr, a0);
+
+    uint64_t old_rax = *(uint64_t *)q_addr;
+    uint64_t old_rdx = *(uint64_t *)(q_addr + 8);
+    res = atomic_compare_exchange16b((uint64_t *)q_addr, EAX, EDX, EBX, ECX);
+    mb();
+
+    if (res) {
+        eflags |= CC_Z;         /* swap success */
+    } else {
+        EDX = old_rdx;
+        EAX = old_rax;
+        eflags &= ~CC_Z;        /* read the old value ! */
+    }
+
+    CC_SRC = eflags;
+}
+
+/* not */
+#define GEN_NOT(type) \
+void helper_atomic_not##type(target_ulong a0)  \
+{                                              \
+    TX(a0, type, value, {                      \
+        value = ~value;                        \
+    });                                        \
+}
+
+GEN_NOT(b);
+GEN_NOT(w);
+GEN_NOT(l);
+GEN_NOT(q);
+
+/* neg */
+#define GEN_NEG(type, TYPE) \
+void helper_atomic_neg##type(target_ulong a0)        \
+{                                                    \
+    int eflags;                                      \
+                                                     \
+    TX(a0, type, value, {                            \
+        value = -value;                              \
+    });                                              \
+                                                     \
+    /* We should use the old value to compute CC */  \
+    CC_SRC = CC_DST = -value;                        \
+                                                     \
+    eflags = helper_cc_compute_all(CC_OP_SUB##TYPE); \
+    CC_SRC = eflags;                                 \
+}                                                    \
+
+GEN_NEG(b, B);
+GEN_NEG(w, W);
+GEN_NEG(l, L);
+GEN_NEG(q, Q);
+
+/* This is only used in BTX instruction, with an additional offset.
+ * Note that, when using register bitoffset, the value can be larger than
+ * operand size - 1 (operand size can be 16/32/64), refer to intel manual 2A
+ * page 3-11. */
+#define TX2(vaddr, type, value, offset, command) \
+    target_ulong __q_addr;                                    \
+    DATA_##type __oldv;                                       \
+    DATA_##type value;                                        \
+                                                              \
+    CM_GET_QEMU_ADDR(__q_addr, vaddr);                        \
+    __q_addr += offset >> 3;                                  \
+    do {                                                      \
+        __oldv = value = LD_##type((DATA_##type *)__q_addr);  \
+        {command;};                                           \
+        mb();                                                 \
+    } while (__oldv != (atomic_compare_exchange##type(        \
+                    (DATA_##type *)__q_addr, __oldv, value)))
+
+#define GEN_BTX(ins, command) \
+void helper_atomic_##ins(target_ulong a0, target_ulong offset, \
+        int ot)                                                \
+{                                                              \
+    uint8_t old_byte;                                          \
+    int eflags;                                                \
+                                                               \
+    TX2(a0, b, value, offset, {                                 \
+        old_byte = value;                                      \
+        {command;};                                            \
+    });                                                        \
+                                                               \
+    CC_SRC = (old_byte >> (offset & 0x7));                     \
+    CC_DST = 0;                                                \
+    eflags = helper_cc_compute_all(CC_OP_SARB + ot);           \
+    CC_SRC = eflags;                                           \
+}
+
+/* bts */
+GEN_BTX(bts, {
+    value |= (1 << (offset & 0x7));
+});
+/* btr */
+GEN_BTX(btr, {
+    value &= ~(1 << (offset & 0x7));
+});
+/* btc */
+GEN_BTX(btc, {
+    value ^= (1 << (offset & 0x7));
+});
+
+/* fence **/
+void helper_fence(void)
+{
+    mb();
+}
+
+/* pause */
+void helper_pause(void)
+{
+    coremu_cpu_sched(CM_EVENT_PAUSE);
+}
diff --git a/target-i386/cm-atomic.h b/target-i386/cm-atomic.h
new file mode 100644
index 0000000..f888231
--- /dev/null
+++ b/target-i386/cm-atomic.h
@@ -0,0 +1,50 @@
+/*
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#define __GEN_HEADER(type) \
+DEF_HELPER_2(atomic_inc##type, void, tl, int)                \
+DEF_HELPER_3(xchg##type, void, tl, int, int)                 \
+DEF_HELPER_3(atomic_op##type, void, tl, tl, int)             \
+DEF_HELPER_3(atomic_xadd##type, void, tl, int, int)          \
+DEF_HELPER_3(atomic_cmpxchg##type, void, tl, int, int)       \
+DEF_HELPER_1(atomic_not##type, void, tl)                     \
+DEF_HELPER_1(atomic_neg##type, void, tl)
+
+__GEN_HEADER(b)
+__GEN_HEADER(w)
+__GEN_HEADER(l)
+__GEN_HEADER(q)
+
+DEF_HELPER_1(atomic_cmpxchg8b, void, tl)
+DEF_HELPER_1(atomic_cmpxchg16b, void, tl)
+
+DEF_HELPER_3(atomic_bts, void, tl, tl, int)
+DEF_HELPER_3(atomic_btr, void, tl, tl, int)
+DEF_HELPER_3(atomic_btc, void, tl, tl, int)
+
+/* fence */
+DEF_HELPER_0(fence, void)
+
+/* pause */
+DEF_HELPER_0(pause, void)
+
diff --git a/target-i386/cm-target-intr.c b/target-i386/cm-target-intr.c
new file mode 100644
index 0000000..7d8a21d
--- /dev/null
+++ b/target-i386/cm-target-intr.c
@@ -0,0 +1,162 @@
+/*
+ * COREMU Parallel Emulator Framework
+ * The definition of interrupt related interface for i386
+ *
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include "cpu.h"
+#include "../hw/apic.h"
+
+#include "coremu-intr.h"
+#include "coremu-malloc.h"
+#include "coremu-atomic.h"
+#include "cm-intr.h"
+#include "cm-target-intr.h"
+
+/* The initial function for interrupts */
+
+static CMIntr *cm_pic_intr_init(int level)
+{
+    CMPICIntr *intr = coremu_mallocz(sizeof(*intr));
+    ((CMIntr *)intr)->handler = cm_pic_intr_handler;
+
+    intr->level = level;
+
+    return (CMIntr *)intr;
+}
+
+static CMIntr *cm_apicbus_intr_init(int mask, int vector_num, int trigger_mode)
+{
+    CMAPICBusIntr *intr = coremu_mallocz(sizeof(*intr));
+    ((CMIntr *)intr)->handler = cm_apicbus_intr_handler;
+
+    intr->mask = mask;
+    intr->vector_num = vector_num;
+    intr->trigger_mode = trigger_mode;
+
+    return (CMIntr *)intr;
+}
+
+static CMIntr *cm_ipi_intr_init(int vector_num, int deliver_mode)
+{
+    CMIPIIntr *intr = coremu_mallocz(sizeof(*intr));
+    ((CMIntr *)intr)->handler = cm_ipi_intr_handler;
+
+    intr->vector_num = vector_num;
+    intr->deliver_mode = deliver_mode;
+
+    return (CMIntr *)intr;
+}
+
+static CMIntr *cm_tlb_flush_req_init(void)
+{
+    CMTLBFlushReq *intr = coremu_mallocz(sizeof(*intr));
+    ((CMIntr *)intr)->handler = cm_tlb_flush_req_handler;
+
+    return (CMIntr *)intr;
+}
+
+void cm_send_pic_intr(int target, int level)
+{
+    coremu_send_intr(cm_pic_intr_init(level), target);
+}
+
+void cm_send_apicbus_intr(int target, int mask,
+                          int vector_num, int trigger_mode)
+{
+    coremu_send_intr(cm_apicbus_intr_init(mask, vector_num, trigger_mode),
+                     target);
+}
+
+void cm_send_ipi_intr(int target, int vector_num, int deliver_mode)
+{
+    coremu_send_intr(cm_ipi_intr_init(vector_num, deliver_mode), target);
+}
+
+void cm_send_tlb_flush_req(int target)
+{
+    assert(0);
+    coremu_send_intr(cm_tlb_flush_req_init(), target);
+}
+
+/* Handle the interrupt from the i8259 chip */
+void cm_pic_intr_handler(void *opaque)
+{
+    CMPICIntr *pic_intr = (CMPICIntr *) opaque;
+
+    CPUState *self = cpu_single_env;
+    int level = pic_intr->level;
+
+    if (self->apic_state) {
+        if (apic_accept_pic_intr(self))
+            apic_deliver_pic_intr(self, pic_intr->level);
+    } else {
+        if (level)
+            cpu_interrupt(self, CPU_INTERRUPT_HARD);
+        else
+            cpu_reset_interrupt(self, CPU_INTERRUPT_HARD);
+    }
+}
+
+/* Handle the interrupt from the apic bus.
+   Because hardware connect to ioapic and inter-processor interrupt
+   are all delivered through apic bus, so this kind of interrupt can
+   be hw interrupt or IPI */
+void cm_apicbus_intr_handler(void *opaque)
+{
+    CMAPICBusIntr *apicbus_intr = (CMAPICBusIntr *)opaque;
+
+    CPUState *self = cpu_single_env;
+
+    if (apicbus_intr->vector_num >= 0) {
+        cm_apic_set_irq(self->apic_state, apicbus_intr->vector_num,
+                        apicbus_intr->trigger_mode);
+    } else {
+        /* For NMI, SMI and INIT the vector information is ignored */
+        cpu_interrupt(self, apicbus_intr->mask);
+    }
+}
+
+/* Handle the inter-processor interrupt (Only for INIT De-assert or SIPI) */
+void cm_ipi_intr_handler(void *opaque)
+{
+    CMIPIIntr *ipi_intr = (CMIPIIntr *)opaque;
+
+    CPUState *self = cpu_single_env;
+
+    if (ipi_intr->deliver_mode) {
+        /* SIPI */
+        cm_apic_startup(self->apic_state, ipi_intr->vector_num);
+    } else {
+        /* the INIT level de-assert */
+        cm_apic_setup_arbid(self->apic_state);
+    }
+}
+
+
+/* Handler the tlb flush request */
+void cm_tlb_flush_req_handler(void *opaque)
+{
+    tlb_flush(cpu_single_env, 1);
+}
diff --git a/target-i386/cm-target-intr.h b/target-i386/cm-target-intr.h
new file mode 100644
index 0000000..aa64215
--- /dev/null
+++ b/target-i386/cm-target-intr.h
@@ -0,0 +1,96 @@
+/*
+ * COREMU Parallel Emulator Framework
+ * Defines qemu related structure and interface for i386 architecture.
+ *
+ * Copyright (C) 2010 Parallel Processing Institute (PPI), Fudan Univ.
+ *  <http://ppi.fudan.edu.cn/system_research_group>
+ *
+ * Authors:
+ *  Zhaoguo Wang    <zgwang@fudan.edu.cn>
+ *  Yufei Chen      <chenyufei@fudan.edu.cn>
+ *  Ran Liu         <naruilone@gmail.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+ /* Interrupt types for i386 architecture */
+#ifndef CM_I386_INTR_H
+#define CM_I386_INTR_H
+
+#include "cm-intr.h"
+
+enum cm_i386_intr_type {
+    PIC_INTR,                   /* Interrupt from i8259 pic */
+    APICBUS_INTR,               /* Interrupt from APIC BUS
+                                   can be issued by other core or ioapic */
+    IPI_INTR,                   /* Interrupt from other core
+                                   Only for de-assert INIT and SIPI */
+    DIRECT_INTR,                /* Direct interrupt (SMI) */
+    SHUTDOWN_REQ,               /* Shut down request */
+    TLB_FLUSH_REQ,              /* This kind of request does not exist in real world,
+                                   we do this is just to confirm to Qemu framework */
+};
+
+
+/* Interrupt infomation for i8259 pic */
+typedef struct CMPICIntr {
+    CMIntr *base;
+    int level;                  /* the level of interrupt */
+} CMPICIntr;
+
+
+/* Interrupt information for IOAPIC */
+typedef struct CMAPICBusIntr {
+    CMIntr *base;
+    int mask;                   /* Qemu will use this to check which
+                                   kind of interrupt is issued */
+    int vector_num;             /* The interrupt vector number
+                                   If the vector number is -1, it indicates
+                                   the vector information is ignored (SMI, NMI, INIT) */
+    int trigger_mode;           /* The trigger mode of interrupt */
+} CMAPICBusIntr;
+
+
+typedef struct CMIPIIntr {
+    CMIntr *base;
+    int vector_num;             /* The interrupt vector number */
+    int deliver_mode;           /* The deliver mode of interrupt
+                                   0: INIT Level De-assert
+                                   1: Start up IPI */
+} CMIPIIntr;
+
+typedef struct CMTLBFlushReq {
+    CMIntr *base;
+} CMTLBFlushReq;
+
+/* The declaration for apic wrapper function */
+void cm_apic_set_irq(struct APICState *s, int vector_num, int trigger_mode);
+void cm_apic_startup(struct APICState *s, int vector_num);
+void cm_apic_setup_arbid(struct APICState *s);
+
+/* The declaration for pic wrapper function */
+void cm_pic_irq_request(void *opaque, int irq, int level);
+
+/* The common declaration */
+void cm_send_pic_intr(int target, int level);
+void cm_send_apicbus_intr(int target, int mask, int vector_num, int
+						  trigger_mode);
+void cm_send_ipi_intr(int target, int vector_num, int deliver_mode);
+void cm_send_tlb_flush_req(int target);
+
+void cm_pic_intr_handler(void *opaque);
+void cm_apicbus_intr_handler(void *opaque);
+void cm_ipi_intr_handler(void *opaque);
+void cm_tlb_flush_req_handler(void *opaque);
+#endif
diff --git a/target-i386/helper.c b/target-i386/helper.c
index c9508a8..1c487aa 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -1127,7 +1127,10 @@ CPUX86State *cpu_x86_init(const char *cpu_model)
     /* init various static tables */
     if (!inited) {
         inited = 1;
+#ifndef CONFIG_COREMU
+        /* For coremu, this is called in cm_cpu_exec_init_core. */
         optimize_flags_init();
+#endif
 #ifndef CONFIG_USER_ONLY
         prev_debug_excp_handler =
             cpu_set_debug_excp_handler(breakpoint_handler);
diff --git a/target-i386/helper.h b/target-i386/helper.h
index 6b518ad..c75a441 100644
--- a/target-i386/helper.h
+++ b/target-i386/helper.h
@@ -217,4 +217,9 @@ DEF_HELPER_2(rclq, tl, tl, tl)
 DEF_HELPER_2(rcrq, tl, tl, tl)
 #endif
 
+#include "coremu-config.h"
+#ifdef CONFIG_COREMU
+#include "cm-atomic.h"
+#endif
+
 #include "def-helper.h"
diff --git a/target-i386/op_helper.c b/target-i386/op_helper.c
index dcbdfe7..121e739 100644
--- a/target-i386/op_helper.c
+++ b/target-i386/op_helper.c
@@ -5662,3 +5662,8 @@ uint32_t helper_cc_compute_c(int op)
 #endif
     }
 }
+
+#include "coremu-config.h"
+#ifdef CONFIG_COREMU
+#include "cm-atomic.c"
+#endif
diff --git a/target-i386/translate.c b/target-i386/translate.c
index 38c6016..f5f0fab 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -27,6 +27,7 @@
 #include "exec-all.h"
 #include "disas.h"
 #include "tcg-op.h"
+#include "coremu-sched.h"
 
 #include "helper.h"
 #define GEN_HELPER 1
@@ -59,25 +60,25 @@
 //#define MACRO_TEST   1
 
 /* global register indexes */
-static TCGv_ptr cpu_env;
-static TCGv cpu_A0, cpu_cc_src, cpu_cc_dst, cpu_cc_tmp;
-static TCGv_i32 cpu_cc_op;
-static TCGv cpu_regs[CPU_NB_REGS];
+static COREMU_THREAD TCGv_ptr cpu_env;
+static COREMU_THREAD TCGv cpu_A0, cpu_cc_src, cpu_cc_dst, cpu_cc_tmp;
+static COREMU_THREAD TCGv_i32 cpu_cc_op;
+static COREMU_THREAD TCGv cpu_regs[CPU_NB_REGS];
 /* local temps */
-static TCGv cpu_T[2], cpu_T3;
+static COREMU_THREAD TCGv cpu_T[2], cpu_T3;
 /* local register indexes (only used inside old micro ops) */
-static TCGv cpu_tmp0, cpu_tmp4;
-static TCGv_ptr cpu_ptr0, cpu_ptr1;
-static TCGv_i32 cpu_tmp2_i32, cpu_tmp3_i32;
-static TCGv_i64 cpu_tmp1_i64;
-static TCGv cpu_tmp5;
+static COREMU_THREAD TCGv cpu_tmp0, cpu_tmp4;
+static COREMU_THREAD TCGv_ptr cpu_ptr0, cpu_ptr1;
+static COREMU_THREAD TCGv_i32 cpu_tmp2_i32, cpu_tmp3_i32;
+static COREMU_THREAD TCGv_i64 cpu_tmp1_i64;
+static COREMU_THREAD TCGv cpu_tmp5;
 
-static uint8_t gen_opc_cc_op[OPC_BUF_SIZE];
+static COREMU_THREAD uint8_t gen_opc_cc_op[OPC_BUF_SIZE];
 
 #include "gen-icount.h"
 
 #ifdef TARGET_X86_64
-static int x86_64_hregs;
+static COREMU_THREAD int x86_64_hregs;
 #endif
 
 typedef struct DisasContext {
@@ -1307,6 +1308,31 @@ static void gen_helper_fp_arith_STN_ST0(int op, int opreg)
 /* if d == OR_TMP0, it means memory operand (address in A0) */
 static void gen_op(DisasContext *s1, int op, int ot, int d)
 {
+#ifdef CONFIG_COREMU
+    if (s1->prefix & PREFIX_LOCK) {
+        if (s1->cc_op != CC_OP_DYNAMIC)
+            gen_op_set_cc_op(s1->cc_op);
+
+        switch (ot & 3) {
+        case 0:
+            gen_helper_atomic_opb(cpu_A0,cpu_T[1], tcg_const_i32(op));
+            break;
+        case 1:
+            gen_helper_atomic_opw(cpu_A0,cpu_T[1], tcg_const_i32(op));
+            break;
+        case 2:
+            gen_helper_atomic_opl(cpu_A0,cpu_T[1], tcg_const_i32(op));
+            break;
+        default:
+#ifdef TARGET_X86_64
+        case 3:
+            gen_helper_atomic_opq(cpu_A0,cpu_T[1], tcg_const_i32(op));
+#endif
+        }
+        s1->cc_op = CC_OP_EFLAGS;
+        return;
+    }
+#endif
     if (d != OR_TMP0) {
         gen_op_mov_TN_reg(ot, 0, d);
     } else {
@@ -1403,6 +1429,37 @@ static void gen_op(DisasContext *s1, int op, int ot, int d)
 /* if d == OR_TMP0, it means memory operand (address in A0) */
 static void gen_inc(DisasContext *s1, int ot, int d, int c)
 {
+#ifdef CONFIG_COREMU
+    /* with lock prefix */
+    if (s1->prefix & PREFIX_LOCK) {
+        assert(d == OR_TMP0);
+
+        /* The helper will use CAS1 as a unified way to
+           implement atomic inc (locked inc) */
+        if (s1->cc_op != CC_OP_DYNAMIC)
+            gen_op_set_cc_op(s1->cc_op);
+
+        switch(ot & 3) {
+        case 0:
+            gen_helper_atomic_incb(cpu_A0, tcg_const_i32(c));
+            break;
+        case 1:
+            gen_helper_atomic_incw(cpu_A0, tcg_const_i32(c));
+            break;
+        case 2:
+            gen_helper_atomic_incl(cpu_A0, tcg_const_i32(c));
+            break;
+        default:
+#ifdef TARGET_X86_64
+        case 3:
+            gen_helper_atomic_incq(cpu_A0, tcg_const_i32(c));
+#endif
+        }
+        s1->cc_op = CC_OP_EFLAGS;
+        return;
+    }
+#endif
+
     if (d != OR_TMP0)
         gen_op_mov_TN_reg(ot, 0, d);
     else
@@ -2712,7 +2769,7 @@ static void gen_eob(DisasContext *s)
     if (s->singlestep_enabled) {
         gen_helper_debug();
     } else if (s->tf) {
-	gen_helper_single_step();
+    gen_helper_single_step();
     } else {
         tcg_gen_exit_tb(0);
     }
@@ -4208,9 +4265,13 @@ static target_ulong disas_insn(DisasContext *s, target_ulong pc_start)
     s->aflag = aflag;
     s->dflag = dflag;
 
+#ifndef CONFIG_COREMU
+    /* In coremu, atomic instructions are emulated by light-weight memory
+     * transaction, so there's no need to use lock. */
     /* lock generation */
     if (prefixes & PREFIX_LOCK)
         gen_helper_lock();
+#endif
 
     /* now check op code */
  reswitch:
@@ -4372,6 +4433,30 @@ static target_ulong disas_insn(DisasContext *s, target_ulong pc_start)
             s->cc_op = CC_OP_LOGICB + ot;
             break;
         case 2: /* not */
+#ifdef CONFIG_COREMU
+            if (s->prefix & PREFIX_LOCK) {
+                if (mod == 3)
+                    goto illegal_op;
+
+                switch(ot & 3) {
+                case 0:
+                    gen_helper_atomic_notb(cpu_A0);
+                    break;
+                case 1:
+                    gen_helper_atomic_notw(cpu_A0);
+                    break;
+                case 2:
+                    gen_helper_atomic_notl(cpu_A0);
+                    break;
+                default:
+#ifdef TARGET_X86_64
+                case 3:
+                    gen_helper_atomic_notq(cpu_A0);
+#endif
+                }
+                break;
+            }
+#endif
             tcg_gen_not_tl(cpu_T[0], cpu_T[0]);
             if (mod != 3) {
                 gen_op_st_T0_A0(ot + s->mem_index);
@@ -4380,6 +4465,34 @@ static target_ulong disas_insn(DisasContext *s, target_ulong pc_start)
             }
             break;
         case 3: /* neg */
+#ifdef CONFIG_COREMU
+            if (s->prefix & PREFIX_LOCK) {
+                if (mod == 3)
+                    goto illegal_op;
+
+                if (s->cc_op != CC_OP_DYNAMIC)
+                    gen_op_set_cc_op(s->cc_op);
+
+                switch(ot & 3) {
+                case 0:
+                    gen_helper_atomic_negb(cpu_A0);
+                    break;
+                case 1:
+                    gen_helper_atomic_negw(cpu_A0);
+                    break;
+                case 2:
+                    gen_helper_atomic_negl(cpu_A0);
+                    break;
+                default:
+#ifdef TARGET_X86_64
+                case 3:
+                    gen_helper_atomic_negq(cpu_A0);
+#endif
+                }
+                s->cc_op = CC_OP_EFLAGS;
+                break;
+            }
+#endif
             tcg_gen_neg_tl(cpu_T[0], cpu_T[0]);
             if (mod != 3) {
                 gen_op_st_T0_A0(ot + s->mem_index);
@@ -4834,7 +4947,38 @@ static target_ulong disas_insn(DisasContext *s, target_ulong pc_start)
             gen_op_addl_T0_T1();
             gen_op_mov_reg_T1(ot, reg);
             gen_op_mov_reg_T0(ot, rm);
-        } else {
+        } else
+#ifdef CONFIG_COREMU
+        if (s->prefix & PREFIX_LOCK) {
+            gen_lea_modrm(s, modrm, &reg_addr, &offset_addr);
+            if (s->cc_op != CC_OP_DYNAMIC)
+                gen_op_set_cc_op(s->cc_op);
+
+            switch (ot & 3) {
+            case 0:
+                gen_helper_atomic_xaddb(cpu_A0, tcg_const_i32(reg),
+                        tcg_const_i32(x86_64_hregs));
+                break;
+            case 1:
+                gen_helper_atomic_xaddw(cpu_A0, tcg_const_i32(reg),
+                        tcg_const_i32(x86_64_hregs));
+                break;
+            case 2:
+                gen_helper_atomic_xaddl(cpu_A0, tcg_const_i32(reg),
+                        tcg_const_i32(x86_64_hregs));
+                break;
+            default:
+#ifdef TARGET_X86_64
+            case 3:
+                gen_helper_atomic_xaddq(cpu_A0, tcg_const_i32(reg),
+                        tcg_const_i32(x86_64_hregs));
+#endif
+            }
+            s->cc_op = CC_OP_EFLAGS;
+            break;
+        } else
+#endif
+        {
             gen_lea_modrm(s, modrm, &reg_addr, &offset_addr);
             gen_op_mov_TN_reg(ot, 0, reg);
             gen_op_ld_T1_A0(ot + s->mem_index);
@@ -4858,6 +5002,41 @@ static target_ulong disas_insn(DisasContext *s, target_ulong pc_start)
             modrm = ldub_code(s->pc++);
             reg = ((modrm >> 3) & 7) | rex_r;
             mod = (modrm >> 6) & 3;
+
+#ifdef CONFIG_COREMU
+            if (s->prefix & PREFIX_LOCK) {
+                if (mod == 3)
+                    goto illegal_op;
+
+                gen_lea_modrm(s, modrm, &reg_addr, &offset_addr);
+
+                if (s->cc_op != CC_OP_DYNAMIC)
+                    gen_op_set_cc_op(s->cc_op);
+
+                switch(ot & 3) {
+                case 0:
+                    gen_helper_atomic_cmpxchgb(cpu_A0, tcg_const_i32(reg),
+                                                    tcg_const_i32(x86_64_hregs));
+                    break;
+                case 1:
+                    gen_helper_atomic_cmpxchgw(cpu_A0, tcg_const_i32(reg),
+                                                    tcg_const_i32(x86_64_hregs));
+                    break;
+                case 2:
+                    gen_helper_atomic_cmpxchgl(cpu_A0, tcg_const_i32(reg),
+                                                    tcg_const_i32(x86_64_hregs));
+                    break;
+                default:
+#ifdef TARGET_X86_64
+                case 3:
+                    gen_helper_atomic_cmpxchgq(cpu_A0, tcg_const_i32(reg),
+                            tcg_const_i32(x86_64_hregs));
+#endif
+                }
+                s->cc_op = CC_OP_EFLAGS;
+                break;
+            }
+#endif
             t0 = tcg_temp_local_new();
             t1 = tcg_temp_local_new();
             t2 = tcg_temp_local_new();
@@ -4912,9 +5091,14 @@ static target_ulong disas_insn(DisasContext *s, target_ulong pc_start)
             if (s->cc_op != CC_OP_DYNAMIC)
                 gen_op_set_cc_op(s->cc_op);
             gen_lea_modrm(s, modrm, &reg_addr, &offset_addr);
+#ifdef CONFIG_COREMU
+            if (s->prefix | PREFIX_LOCK) {
+                gen_helper_atomic_cmpxchg16b(cpu_A0);
+            } else
+#endif
             gen_helper_cmpxchg16b(cpu_A0);
         } else
-#endif        
+#endif
         {
             if (!(s->cpuid_features & CPUID_CX8))
                 goto illegal_op;
@@ -4922,6 +5106,11 @@ static target_ulong disas_insn(DisasContext *s, target_ulong pc_start)
             if (s->cc_op != CC_OP_DYNAMIC)
                 gen_op_set_cc_op(s->cc_op);
             gen_lea_modrm(s, modrm, &reg_addr, &offset_addr);
+#ifdef CONFIG_COREMU
+            if (s->prefix | PREFIX_LOCK) {
+                gen_helper_atomic_cmpxchg8b(cpu_A0);
+            } else
+#endif
             gen_helper_cmpxchg8b(cpu_A0);
         }
         s->cc_op = CC_OP_EFLAGS;
@@ -5315,15 +5504,43 @@ static target_ulong disas_insn(DisasContext *s, target_ulong pc_start)
             gen_op_mov_reg_T1(ot, reg);
         } else {
             gen_lea_modrm(s, modrm, &reg_addr, &offset_addr);
+
+#ifdef CONFIG_COREMU
+            /* for xchg, lock is implicit.
+               XXX: none flag is affected! */
+            switch (ot & 3) {
+            case 0:
+                gen_helper_xchgb(cpu_A0, tcg_const_i32(reg),
+                        tcg_const_i32(x86_64_hregs));
+                break;
+            case 1:
+                gen_helper_xchgw(cpu_A0, tcg_const_i32(reg),
+                        tcg_const_i32(x86_64_hregs));
+                break;
+            case 2:
+                gen_helper_xchgl(cpu_A0, tcg_const_i32(reg),
+                        tcg_const_i32(x86_64_hregs));
+                break;
+            default:
+#ifdef TARGET_X86_64
+                case 3:
+                gen_helper_xchgq(cpu_A0, tcg_const_i32(reg),
+                        tcg_const_i32(x86_64_hregs));
+#endif
+            }
+#else
             gen_op_mov_TN_reg(ot, 0, reg);
             /* for xchg, lock is implicit */
             if (!(prefixes & PREFIX_LOCK))
                 gen_helper_lock();
             gen_op_ld_T1_A0(ot + s->mem_index);
             gen_op_st_T0_A0(ot + s->mem_index);
+#ifndef CONFIG_COREMU
             if (!(prefixes & PREFIX_LOCK))
                 gen_helper_unlock();
+#endif
             gen_op_mov_reg_T1(ot, reg);
+#endif
         }
         break;
     case 0xc4: /* les Gv */
@@ -6530,6 +6747,28 @@ static target_ulong disas_insn(DisasContext *s, target_ulong pc_start)
         }
     bt_op:
         tcg_gen_andi_tl(cpu_T[1], cpu_T[1], (1 << (3 + ot)) - 1);
+#ifdef CONFIG_COREMU
+        if (s->prefix & PREFIX_LOCK) {
+            if (s->cc_op != CC_OP_DYNAMIC)
+                gen_op_set_cc_op(s->cc_op);
+
+            switch (op) {
+            case 0:
+                goto illegal_op;
+                break;
+            case 1:
+                gen_helper_atomic_bts(cpu_A0, cpu_T[1], tcg_const_i32(ot));
+                break;
+            case 2:
+                gen_helper_atomic_btr(cpu_A0, cpu_T[1], tcg_const_i32(ot));
+                break;
+            case 3:
+                gen_helper_atomic_btc(cpu_A0, cpu_T[1], tcg_const_i32(ot));
+            }
+            s->cc_op = CC_OP_EFLAGS;
+            break;
+        }
+#endif
         switch(op) {
         case 0:
             tcg_gen_shr_tl(cpu_cc_src, cpu_T[0], cpu_T[1]);
@@ -6669,6 +6908,11 @@ static target_ulong disas_insn(DisasContext *s, target_ulong pc_start)
             goto illegal_op;
         if (prefixes & PREFIX_REPZ) {
             gen_svm_check_intercept(s, pc_start, SVM_EXIT_PAUSE);
+            /* When the emulated core number is more than the real number
+               on the machine, we need to catch the pause instruction to
+               avoid the lockholder thread to be preemted. */
+            if (!coremu_physical_core_enough_p())
+                gen_helper_pause();
         }
         break;
     case 0x9b: /* fwait */
@@ -7647,12 +7891,16 @@ static target_ulong disas_insn(DisasContext *s, target_ulong pc_start)
         goto illegal_op;
     }
     /* lock generation */
+#ifndef CONFIG_COREMU
     if (s->prefix & PREFIX_LOCK)
         gen_helper_unlock();
+#endif
     return s->pc;
  illegal_op:
+#ifndef CONFIG_COREMU
     if (s->prefix & PREFIX_LOCK)
         gen_helper_unlock();
+#endif
     /* XXX: ensure that no lock was generated */
     gen_exception(s, EXCP06_ILLOP, pc_start - s->cs_base);
     return s->pc;
diff --git a/tcg/tcg.c b/tcg/tcg.c
index a99ecb9..b6e50d4 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -48,6 +48,7 @@
 #include "cache-utils.h"
 #include "host-utils.h"
 #include "qemu-timer.h"
+#include "coremu-atomic.h"
 
 /* Note: the long term plan is to reduce the dependancies on the QEMU
    CPU definitions. Currently they are used for qemu_ld/st
@@ -59,6 +60,8 @@
 #include "tcg-op.h"
 #include "elf.h"
 
+#include "coremu-config.h"
+
 #if defined(CONFIG_USE_GUEST_BASE) && !defined(TCG_TARGET_HAS_GUEST_BASE)
 #error GUEST_BASE not supported on this host.
 #endif
@@ -66,7 +69,7 @@
 static void patch_reloc(uint8_t *code_ptr, int type, 
                         tcg_target_long value, tcg_target_long addend);
 
-static TCGOpDef tcg_op_defs[] = {
+static COREMU_THREAD TCGOpDef tcg_op_defs[] = {
 #define DEF(s, n, copy_size) { #s, 0, 0, n, n, 0, copy_size },
 #define DEF2(s, oargs, iargs, cargs, flags) { #s, oargs, iargs, cargs, iargs + oargs + cargs, flags, 0 },
 #include "tcg-opc.h"
@@ -74,12 +77,12 @@ static TCGOpDef tcg_op_defs[] = {
 #undef DEF2
 };
 
-static TCGRegSet tcg_target_available_regs[2];
-static TCGRegSet tcg_target_call_clobber_regs;
+static COREMU_THREAD TCGRegSet tcg_target_available_regs[2];
+static COREMU_THREAD TCGRegSet tcg_target_call_clobber_regs;
 
 /* XXX: move that inside the context */
-uint16_t *gen_opc_ptr;
-TCGArg *gen_opparam_ptr;
+COREMU_THREAD uint16_t *gen_opc_ptr;
+COREMU_THREAD TCGArg *gen_opparam_ptr;
 
 static inline void tcg_out8(TCGContext *s, uint8_t v)
 {
@@ -242,9 +245,13 @@ void tcg_context_init(TCGContext *s)
     tcg_target_init(s);
 
     /* init global prologue and epilogue */
+#ifndef CONFIG_COREMU
+    /* For coremu, we only need one piece code prologue. We initialize it in the
+     * hardware thread. */
     s->code_buf = code_gen_prologue;
     s->code_ptr = s->code_buf;
     tcg_target_qemu_prologue(s);
+#endif
     flush_icache_range((unsigned long)s->code_buf, 
                        (unsigned long)s->code_ptr);
 }
@@ -2143,3 +2150,37 @@ void tcg_dump_info(FILE *f,
     cpu_fprintf(f, "[TCG profiler not compiled]\n");
 }
 #endif
+
+#ifdef CONFIG_COREMU
+
+#include <sys/types.h>
+#include <sys/mman.h>
+#include "cm-init.h"
+void cm_code_prologue_init(void)
+{
+    TCGContext tmp_ctx;
+    memset(&tmp_ctx, 0, sizeof(tmp_ctx));
+
+    /* init global prologue and epilogue */
+    tmp_ctx.code_buf = code_gen_prologue;
+    tmp_ctx.code_ptr = tmp_ctx.code_buf;
+    tcg_target_qemu_prologue(&tmp_ctx);
+}
+
+void cm_inject_invalidate_code(TranslationBlock *tb)
+{
+    uint16_t ret =  atomic_compare_exchangew(&tb->has_invalidate, 0, 1);
+
+    if(ret == 1)
+       return;
+
+    TCGContext *s = &tcg_ctx;
+    s->code_buf = tb->tc_ptr;
+    s->code_ptr = tb->tc_ptr;
+
+    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_RAX, (long) tb + 3);
+    tcg_out8(s, 0xe9); /* jmp tb_ret_addr */
+    tcg_out32(s, tb_ret_addr - s->code_ptr - 4);
+}
+
+#endif /* CONFIG_COREMU */
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 44856e1..171b233 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -318,11 +318,11 @@ struct TCGContext {
 #endif
 };
 
-extern TCGContext tcg_ctx;
-extern uint16_t *gen_opc_ptr;
-extern TCGArg *gen_opparam_ptr;
-extern uint16_t gen_opc_buf[];
-extern TCGArg gen_opparam_buf[];
+extern COREMU_THREAD TCGContext tcg_ctx;
+extern COREMU_THREAD uint16_t *gen_opc_ptr;
+extern COREMU_THREAD TCGArg *gen_opparam_ptr;
+extern COREMU_THREAD uint16_t gen_opc_buf[];
+extern COREMU_THREAD TCGArg gen_opparam_buf[];
 
 /* pool based memory allocation */
 
diff --git a/tcg/x86_64/tcg-target.c b/tcg/x86_64/tcg-target.c
index 3892f75..fdc8784 100644
--- a/tcg/x86_64/tcg-target.c
+++ b/tcg/x86_64/tcg-target.c
@@ -22,6 +22,8 @@
  * THE SOFTWARE.
  */
 
+#include "coremu-config.h"
+
 #ifndef NDEBUG
 static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
     "%rax",
diff --git a/translate-all.c b/translate-all.c
index 91cbbc4..d7dc4ea 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -31,15 +31,17 @@
 #include "tcg.h"
 #include "qemu-timer.h"
 
+#include "coremu-config.h"
+
 /* code generation context */
-TCGContext tcg_ctx;
+COREMU_THREAD TCGContext tcg_ctx;
 
-uint16_t gen_opc_buf[OPC_BUF_SIZE];
-TCGArg gen_opparam_buf[OPPARAM_BUF_SIZE];
+COREMU_THREAD uint16_t gen_opc_buf[OPC_BUF_SIZE];
+COREMU_THREAD TCGArg gen_opparam_buf[OPPARAM_BUF_SIZE];
 
-target_ulong gen_opc_pc[OPC_BUF_SIZE];
-uint16_t gen_opc_icount[OPC_BUF_SIZE];
-uint8_t gen_opc_instr_start[OPC_BUF_SIZE];
+COREMU_THREAD target_ulong gen_opc_pc[OPC_BUF_SIZE];
+COREMU_THREAD uint16_t gen_opc_icount[OPC_BUF_SIZE];
+COREMU_THREAD uint8_t gen_opc_instr_start[OPC_BUF_SIZE];
 
 /* XXX: suppress that */
 unsigned long code_gen_max_block_size(void)
diff --git a/vl.c b/vl.c
index 85bcc84..68b2505 100644
--- a/vl.c
+++ b/vl.c
@@ -165,6 +165,18 @@ int main(int argc, char **argv)
 
 //#define DEBUG_NET
 //#define DEBUG_SLIRP
+#include "coremu-config.h"
+#include "coremu-init.h"
+#include "coremu-core.h"
+#include "coremu-intr.h"
+#include "coremu-thread.h"
+#include "coremu-debug.h"
+#include "cm-loop.h"
+#include "cm-intr.h"
+#include "cm-init.h"
+#include "cm-timer.h"
+
+//#include "cm-i386-intr.h"
 
 #define DEFAULT_RAM_SIZE 128
 
@@ -1710,6 +1722,13 @@ int debug_requested;
 int vmstop_requested;
 static int exit_requested;
 
+#ifdef CONFIG_COREMU
+int test_reset_request(void)
+{
+	return reset_requested;
+}
+#endif
+
 int qemu_shutdown_requested(void)
 {
     int r = shutdown_requested;
@@ -1874,8 +1893,12 @@ void main_loop_wait(int nonblocking)
     if (nonblocking)
         timeout = 0;
     else {
+#ifdef CONFIG_COREMU
+        timeout = 1000;
+#else
         timeout = qemu_calculate_timeout();
         qemu_bh_update_timeout(&timeout);
+#endif
     }
 
     host_main_loop_wait(&timeout);
@@ -1959,8 +1982,28 @@ qemu_irq qemu_system_powerdown;
 static void main_loop(void)
 {
     int r;
+#ifdef CONFIG_COREMU
+    /* 1. Not finish: need some initialization */
+
+    /* 2. register hook functions */
+    /* register the interrupt handler */
+    coremu_register_event_handler((void (*)(void*))cm_common_intr_handler);
+    /* register the event notifier */
+    coremu_register_event_notifier(cm_notify_event);
 
+    /* 3. register core alarm handler. */
+    struct sigaction act;
+    sigfillset(&act.sa_mask);
+    act.sa_flags = 0;
+    extern void cm_local_host_alarm_handler(int host_signum);
+    act.sa_handler = cm_local_host_alarm_handler;
+    sigaction(COREMU_CORE_ALARM, &act, NULL);
+
+    /* 4. Create cpu thread body*/
+    coremu_run_all_cores(cm_cpu_loop);
+#else
     qemu_main_loop_start();
+#endif
 
     for (;;) {
         do {
@@ -1968,9 +2011,11 @@ static void main_loop(void)
 #ifdef CONFIG_PROFILER
             int64_t ti;
 #endif
+#ifndef CONFIG_COREMU
 #ifndef CONFIG_IOTHREAD
             nonblocking = tcg_cpu_exec();
 #endif
+#endif
 #ifdef CONFIG_PROFILER
             ti = profile_getclock();
 #endif
@@ -1991,6 +2036,7 @@ static void main_loop(void)
             } else
                 break;
         }
+#ifndef CONFIG_COREMU
         if (qemu_reset_requested()) {
             pause_all_vcpus();
             qemu_system_reset();
@@ -2000,6 +2046,18 @@ static void main_loop(void)
             monitor_protocol_event(QEVENT_POWERDOWN, NULL);
             qemu_irq_raise(qemu_system_powerdown);
         }
+#else
+        if (reset_requested) {
+            coremu_wait_all_cores_pause();
+            qemu_system_reset();
+            reset_requested=0;
+            coremu_restart_all_cores();
+        }
+        if (qemu_powerdown_requested()) {
+            monitor_protocol_event(QEVENT_POWERDOWN, NULL);
+            exit(0);
+        }
+#endif
         if ((r = qemu_vmstop_requested())) {
             vm_stop(r);
         }
@@ -3456,6 +3514,16 @@ int main(int argc, char **argv, char **envp)
         exit(1);
     }
 
+#ifdef CONFIG_COREMU
+    cm_print("\n%s\n%s\n%s",
+             "------------------------------------",
+             "|     [COREMU Parallel Emulator]   |",
+             "------------------------------------");
+    coremu_init(smp_cpus);
+    /* Initialize qemu variable for coremu */
+    cm_init_pit_freq();
+#endif
+
     qemu_opts_foreach(&qemu_device_opts, default_driver_check, NULL, 0);
     qemu_opts_foreach(&qemu_global_opts, default_driver_check, NULL, 0);
 
@@ -3632,7 +3700,11 @@ int main(int argc, char **argv, char **envp)
         ram_size = DEFAULT_RAM_SIZE * 1024 * 1024;
 
     /* init the dynamic translator */
+#ifdef CONFIG_COREMU
+    cm_cpu_exec_init();
+#else
     cpu_exec_init_all(tb_size * 1024 * 1024);
+#endif
 
     bdrv_init_with_whitelist();
 
@@ -3917,3 +3989,10 @@ int main(int argc, char **argv, char **envp)
 
     return 0;
 }
+
+#ifdef CONFIG_COREMU
+int cm_vm_can_run(void)
+{
+    return vm_can_run();
+}
+#endif

[-- Attachment #3: Type: text/plain, Size: 31 bytes --]



--
Best regards,
Chen Yufei


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
  2010-07-22  8:48       ` Chen Yufei
@ 2010-07-22 11:05         ` Jan Kiszka
  2010-07-22 12:18         ` [Qemu-devel] " Stefan Hajnoczi
  1 sibling, 0 replies; 20+ messages in thread
From: Jan Kiszka @ 2010-07-22 11:05 UTC (permalink / raw)
  To: Chen Yufei; +Cc: qemu-devel

Chen Yufei wrote:
> On 2010-7-22, at 上午1:04, Stefan Weil wrote:
> 
>> Am 21.07.2010 09:03, schrieb Chen Yufei:
>>> On 2010-7-21, at 上午5:43, Blue Swirl wrote:
>>>
>>>   
>>>> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei<cyfdecyf@gmail.com>  wrote:
>>>>     
>>>>> We are pleased to announce COREMU, which is a "multicore-on-multicore" full-system emulator built on Qemu. (Simply speaking, we made Qemu parallel.)
>>>>>
>>>>> The project web page is located at:
>>>>> http://ppi.fudan.edu.cn/coremu
>>>>>
>>>>> You can also download the source code, images for playing on sourceforge
>>>>> http://sf.net/p/coremu
>>>>>
>>>>> COREMU is composed of
>>>>> 1. a parallel emulation library
>>>>> 2. a set of patches to qemu
>>>>> (We worked on the master branch, commit 54d7cf136f040713095cbc064f62d753bff6f9d2)
>>>>>
>>>>> It currently supports full-system emulation of x64 and ARM MPcore platforms.
>>>>>
>>>>> By leveraging the underlying multicore resources, it can emulate up to 255 cores running commodity operating systems (even on a 4-core machine).
>>>>>
>>>>> Enjoy,
>>>>>       
>>>> Nice work. Do you plan to submit the improvements back to upstream QEMU?
>>>>     
>>> It would be great if we can submit our code to QEMU, but we do not know the process.
>>> Would you please give us some instructions?
>>>
>>> --
>>> Best regards,
>>> Chen Yufei
>>>   
>> Some hints can be found here:
>> http://wiki.qemu.org/Contribute/StartHere
>>
>> Kind regards,
>> Stefan Weil
> 
> The patch is in the attachment, produced with command
> git diff 54d7cf136f040713095cbc064f62d753bff6f9d2
> 
> In order to separate what need to be done to make QEMU parallel, we created a separate library, and the patched QEMU need to be compiled and linked with that library. To submit our enhancement to QEMU, maybe we need to incorporate this library into QEMU. I don't know what would be the best solution.

For upstream QEMU, the goal should be to integrate your modifications
and enhancements into the existing architecture in a mostly seamless
way. The library approach may help maintaining your changes out of tree,
but it likely cannot contribute any benefit to an in-tree extension of
QEMU for parallel TCG VCPUs.

> 
> Our approach to make QEMU parallel can be found at http://ppi.fudan.edu.cn/coremu
> 
> I will give a short summary here:
> 
> 1. Each emulated core thread runs a separate binary translator engine and has private code cache. We marked some variables in TCG as thread local. We also modified the TB invalidation mechanism.
> 
> 2. Each core has a queue holding pending interrupts. The COREMU library provides this queue, and interrupt notification is done by sending realtime signals to the emulated core thread.
> 
> 3. Atomic instruction emulation has to be modified for parallel emulation. We use lightweight memory transaction which requires only compare-and-swap instruction to emulate atomic instruction.
> 
> 4. Some code in the original QEMU may cause data race bug after we make it parallel. We fixed these problems.
> 

Upstream integration requires such iterative steps as well - in form of
ideally small, focused patches that finally convert QEMU into a parallel
emulator.

Also note that upstream already supports threaded VCPUs - in KVM mode.
You obviously have resolved the major blocking points to apply this on
TCG mode as well. But I don't see yet why we may need a new VCPU
threading infrastructure for this. Rather only small tuning of what KVM
already uses should suffice - if that's required at all.

To give it a start, you could identify some more trivial changes in your
patches, split them out and rebase them over latest qemu.git, then post
them as a patch series for inclusion (see the mailing list for various
examples). Make sure to describe the reason for your changes as clear as
possible, specifically if they are not (yet) obvious in the absence of
COREMU features in upstream QEMU.

Be prepared that merging your code can be a lengthy process with quite a
few discussions about why and how things are done, likely also with
requests to change your current solution in some aspects. However, the
result should be an optimal solution for the overall goal, parallel VCPU
emulation - and no longer any need to maintain your private set of
patches against quickly evolving QEMU.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Release of COREMU, a scalable and portable full-system emulator
  2010-07-22  8:48       ` Chen Yufei
  2010-07-22 11:05         ` [Qemu-devel] " Jan Kiszka
@ 2010-07-22 12:18         ` Stefan Hajnoczi
  2010-07-22 13:00           ` [Qemu-devel] " Jan Kiszka
  1 sibling, 1 reply; 20+ messages in thread
From: Stefan Hajnoczi @ 2010-07-22 12:18 UTC (permalink / raw)
  To: Chen Yufei; +Cc: qemu-devel

On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei <cyfdecyf@gmail.com> wrote:
> On 2010-7-22, at 上午1:04, Stefan Weil wrote:
>
>> Am 21.07.2010 09:03, schrieb Chen Yufei:
>>> On 2010-7-21, at 上午5:43, Blue Swirl wrote:
>>>
>>>
>>>> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei<cyfdecyf@gmail.com>  wrote:
>>>>
>>>>> We are pleased to announce COREMU, which is a "multicore-on-multicore" full-system emulator built on Qemu. (Simply speaking, we made Qemu parallel.)
>>>>>
>>>>> The project web page is located at:
>>>>> http://ppi.fudan.edu.cn/coremu
>>>>>
>>>>> You can also download the source code, images for playing on sourceforge
>>>>> http://sf.net/p/coremu
>>>>>
>>>>> COREMU is composed of
>>>>> 1. a parallel emulation library
>>>>> 2. a set of patches to qemu
>>>>> (We worked on the master branch, commit 54d7cf136f040713095cbc064f62d753bff6f9d2)
>>>>>
>>>>> It currently supports full-system emulation of x64 and ARM MPcore platforms.
>>>>>
>>>>> By leveraging the underlying multicore resources, it can emulate up to 255 cores running commodity operating systems (even on a 4-core machine).
>>>>>
>>>>> Enjoy,
>>>>>
>>>> Nice work. Do you plan to submit the improvements back to upstream QEMU?
>>>>
>>> It would be great if we can submit our code to QEMU, but we do not know the process.
>>> Would you please give us some instructions?
>>>
>>> --
>>> Best regards,
>>> Chen Yufei
>>>
>>
>> Some hints can be found here:
>> http://wiki.qemu.org/Contribute/StartHere
>>
>> Kind regards,
>> Stefan Weil
>
> The patch is in the attachment, produced with command
> git diff 54d7cf136f040713095cbc064f62d753bff6f9d2
>
> In order to separate what need to be done to make QEMU parallel, we created a separate library, and the patched QEMU need to be compiled and linked with that library. To submit our enhancement to QEMU, maybe we need to incorporate this library into QEMU. I don't know what would be the best solution.
>
> Our approach to make QEMU parallel can be found at http://ppi.fudan.edu.cn/coremu
>
> I will give a short summary here:
>
> 1. Each emulated core thread runs a separate binary translator engine and has private code cache. We marked some variables in TCG as thread local. We also modified the TB invalidation mechanism.
>
> 2. Each core has a queue holding pending interrupts. The COREMU library provides this queue, and interrupt notification is done by sending realtime signals to the emulated core thread.
>
> 3. Atomic instruction emulation has to be modified for parallel emulation. We use lightweight memory transaction which requires only compare-and-swap instruction to emulate atomic instruction.
>
> 4. Some code in the original QEMU may cause data race bug after we make it parallel. We fixed these problems.
>
>
>
>
> --
> Best regards,
> Chen Yufei

Looking at the patch it seems there is a global lock for hardware
access via coremu_spin_lock(&cm_hw_lock).  How many cores have you
tried running and do you have lock contention data for cm_hw_lock?
Have you thought about making hardware emulation concurrent?

These are issues that qemu-kvm faces today since it executes vcpu
threads in parallel.  Both qemu-kvm and the COREMU patches could
benefit from a solution for concurrent hardware access.

Stefan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] Re: Release of COREMU, a scalable and portable  full-system emulator
  2010-07-22 12:18         ` [Qemu-devel] " Stefan Hajnoczi
@ 2010-07-22 13:00           ` Jan Kiszka
  2010-07-22 13:21             ` Stefan Hajnoczi
  2010-07-22 15:19             ` wang Tiger
  0 siblings, 2 replies; 20+ messages in thread
From: Jan Kiszka @ 2010-07-22 13:00 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Chen Yufei, qemu-devel

Stefan Hajnoczi wrote:
> On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei <cyfdecyf@gmail.com> wrote:
>> On 2010-7-22, at 上午1:04, Stefan Weil wrote:
>>
>>> Am 21.07.2010 09:03, schrieb Chen Yufei:
>>>> On 2010-7-21, at 上午5:43, Blue Swirl wrote:
>>>>
>>>>
>>>>> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei<cyfdecyf@gmail.com>  wrote:
>>>>>
>>>>>> We are pleased to announce COREMU, which is a "multicore-on-multicore" full-system emulator built on Qemu. (Simply speaking, we made Qemu parallel.)
>>>>>>
>>>>>> The project web page is located at:
>>>>>> http://ppi.fudan.edu.cn/coremu
>>>>>>
>>>>>> You can also download the source code, images for playing on sourceforge
>>>>>> http://sf.net/p/coremu
>>>>>>
>>>>>> COREMU is composed of
>>>>>> 1. a parallel emulation library
>>>>>> 2. a set of patches to qemu
>>>>>> (We worked on the master branch, commit 54d7cf136f040713095cbc064f62d753bff6f9d2)
>>>>>>
>>>>>> It currently supports full-system emulation of x64 and ARM MPcore platforms.
>>>>>>
>>>>>> By leveraging the underlying multicore resources, it can emulate up to 255 cores running commodity operating systems (even on a 4-core machine).
>>>>>>
>>>>>> Enjoy,
>>>>>>
>>>>> Nice work. Do you plan to submit the improvements back to upstream QEMU?
>>>>>
>>>> It would be great if we can submit our code to QEMU, but we do not know the process.
>>>> Would you please give us some instructions?
>>>>
>>>> --
>>>> Best regards,
>>>> Chen Yufei
>>>>
>>> Some hints can be found here:
>>> http://wiki.qemu.org/Contribute/StartHere
>>>
>>> Kind regards,
>>> Stefan Weil
>> The patch is in the attachment, produced with command
>> git diff 54d7cf136f040713095cbc064f62d753bff6f9d2
>>
>> In order to separate what need to be done to make QEMU parallel, we created a separate library, and the patched QEMU need to be compiled and linked with that library. To submit our enhancement to QEMU, maybe we need to incorporate this library into QEMU. I don't know what would be the best solution.
>>
>> Our approach to make QEMU parallel can be found at http://ppi.fudan.edu.cn/coremu
>>
>> I will give a short summary here:
>>
>> 1. Each emulated core thread runs a separate binary translator engine and has private code cache. We marked some variables in TCG as thread local. We also modified the TB invalidation mechanism.
>>
>> 2. Each core has a queue holding pending interrupts. The COREMU library provides this queue, and interrupt notification is done by sending realtime signals to the emulated core thread.
>>
>> 3. Atomic instruction emulation has to be modified for parallel emulation. We use lightweight memory transaction which requires only compare-and-swap instruction to emulate atomic instruction.
>>
>> 4. Some code in the original QEMU may cause data race bug after we make it parallel. We fixed these problems.
>>
>>
>>
>>
>> --
>> Best regards,
>> Chen Yufei
> 
> Looking at the patch it seems there is a global lock for hardware
> access via coremu_spin_lock(&cm_hw_lock).  How many cores have you
> tried running and do you have lock contention data for cm_hw_lock?

BTW, this kind of lock is called qemu_global_mutex in QEMU, thus it is a
sleepy lock here which is likely better for the code paths protected by
it in upstream. Are they shorter in COREMU?

> Have you thought about making hardware emulation concurrent?
> 
> These are issues that qemu-kvm faces today since it executes vcpu
> threads in parallel.  Both qemu-kvm and the COREMU patches could
> benefit from a solution for concurrent hardware access.

While we are all looking forward to see more scalable hardware models
:), I think it is a topic that can be addressed widely independent of
parallelizing TCG VCPUs. The latter can benefit from the former, for
sure, but it first of all has to solve its own issues.

Note that --enable-io-thread provides truly parallel KVM VCPUs also in
upstream these days. Just for TCG, we need that sightly suboptimal CPU
scheduling inside single-threaded tcg_cpu_exec (was renamed to
cpu_exec_all today).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
  2010-07-22 13:00           ` [Qemu-devel] " Jan Kiszka
@ 2010-07-22 13:21             ` Stefan Hajnoczi
  2010-07-22 15:19             ` wang Tiger
  1 sibling, 0 replies; 20+ messages in thread
From: Stefan Hajnoczi @ 2010-07-22 13:21 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Chen Yufei, qemu-devel

2010/7/22 Jan Kiszka <jan.kiszka@siemens.com>:
> Stefan Hajnoczi wrote:
>> On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei <cyfdecyf@gmail.com> wrote:
>>> On 2010-7-22, at 上午1:04, Stefan Weil wrote:
>>>
>>>> Am 21.07.2010 09:03, schrieb Chen Yufei:
>>>>> On 2010-7-21, at 上午5:43, Blue Swirl wrote:
>>>>>
>>>>>
>>>>>> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei<cyfdecyf@gmail.com>  wrote:
>>>>>>
>>>>>>> We are pleased to announce COREMU, which is a "multicore-on-multicore" full-system emulator built on Qemu. (Simply speaking, we made Qemu parallel.)
>>>>>>>
>>>>>>> The project web page is located at:
>>>>>>> http://ppi.fudan.edu.cn/coremu
>>>>>>>
>>>>>>> You can also download the source code, images for playing on sourceforge
>>>>>>> http://sf.net/p/coremu
>>>>>>>
>>>>>>> COREMU is composed of
>>>>>>> 1. a parallel emulation library
>>>>>>> 2. a set of patches to qemu
>>>>>>> (We worked on the master branch, commit 54d7cf136f040713095cbc064f62d753bff6f9d2)
>>>>>>>
>>>>>>> It currently supports full-system emulation of x64 and ARM MPcore platforms.
>>>>>>>
>>>>>>> By leveraging the underlying multicore resources, it can emulate up to 255 cores running commodity operating systems (even on a 4-core machine).
>>>>>>>
>>>>>>> Enjoy,
>>>>>>>
>>>>>> Nice work. Do you plan to submit the improvements back to upstream QEMU?
>>>>>>
>>>>> It would be great if we can submit our code to QEMU, but we do not know the process.
>>>>> Would you please give us some instructions?
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Chen Yufei
>>>>>
>>>> Some hints can be found here:
>>>> http://wiki.qemu.org/Contribute/StartHere
>>>>
>>>> Kind regards,
>>>> Stefan Weil
>>> The patch is in the attachment, produced with command
>>> git diff 54d7cf136f040713095cbc064f62d753bff6f9d2
>>>
>>> In order to separate what need to be done to make QEMU parallel, we created a separate library, and the patched QEMU need to be compiled and linked with that library. To submit our enhancement to QEMU, maybe we need to incorporate this library into QEMU. I don't know what would be the best solution.
>>>
>>> Our approach to make QEMU parallel can be found at http://ppi.fudan.edu.cn/coremu
>>>
>>> I will give a short summary here:
>>>
>>> 1. Each emulated core thread runs a separate binary translator engine and has private code cache. We marked some variables in TCG as thread local. We also modified the TB invalidation mechanism.
>>>
>>> 2. Each core has a queue holding pending interrupts. The COREMU library provides this queue, and interrupt notification is done by sending realtime signals to the emulated core thread.
>>>
>>> 3. Atomic instruction emulation has to be modified for parallel emulation. We use lightweight memory transaction which requires only compare-and-swap instruction to emulate atomic instruction.
>>>
>>> 4. Some code in the original QEMU may cause data race bug after we make it parallel. We fixed these problems.
>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Chen Yufei
>>
>> Looking at the patch it seems there is a global lock for hardware
>> access via coremu_spin_lock(&cm_hw_lock).  How many cores have you
>> tried running and do you have lock contention data for cm_hw_lock?
>
> BTW, this kind of lock is called qemu_global_mutex in QEMU, thus it is a
> sleepy lock here which is likely better for the code paths protected by
> it in upstream. Are they shorter in COREMU?
>
>> Have you thought about making hardware emulation concurrent?
>>
>> These are issues that qemu-kvm faces today since it executes vcpu
>> threads in parallel.  Both qemu-kvm and the COREMU patches could
>> benefit from a solution for concurrent hardware access.
>
> While we are all looking forward to see more scalable hardware models
> :), I think it is a topic that can be addressed widely independent of
> parallelizing TCG VCPUs. The latter can benefit from the former, for
> sure, but it first of all has to solve its own issues.

Right, but it's worth discussing with people who have worked on
parallel vcpus from a different angle.

> Note that --enable-io-thread provides truly parallel KVM VCPUs also in
> upstream these days. Just for TCG, we need that sightly suboptimal CPU
> scheduling inside single-threaded tcg_cpu_exec (was renamed to
> cpu_exec_all today).
>
> Jan
>
> --
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
  2010-07-22 13:00           ` [Qemu-devel] " Jan Kiszka
  2010-07-22 13:21             ` Stefan Hajnoczi
@ 2010-07-22 15:19             ` wang Tiger
  2010-07-22 15:47               ` Stefan Hajnoczi
  1 sibling, 1 reply; 20+ messages in thread
From: wang Tiger @ 2010-07-22 15:19 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Stefan Hajnoczi, Chen Yufei, qemu-devel

在 2010年7月22日 下午9:00，Jan Kiszka <jan.kiszka@siemens.com> 写道：
> Stefan Hajnoczi wrote:
>> On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei <cyfdecyf@gmail.com> wrote:
>>> On 2010-7-22, at 上午1:04, Stefan Weil wrote:
>>>
>>>> Am 21.07.2010 09:03, schrieb Chen Yufei:
>>>>> On 2010-7-21, at 上午5:43, Blue Swirl wrote:
>>>>>
>>>>>
>>>>>> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei<cyfdecyf@gmail.com>  wrote:
>>>>>>
>>>>>>> We are pleased to announce COREMU, which is a "multicore-on-multicore" full-system emulator built on Qemu. (Simply speaking, we made Qemu parallel.)
>>>>>>>
>>>>>>> The project web page is located at:
>>>>>>> http://ppi.fudan.edu.cn/coremu
>>>>>>>
>>>>>>> You can also download the source code, images for playing on sourceforge
>>>>>>> http://sf.net/p/coremu
>>>>>>>
>>>>>>> COREMU is composed of
>>>>>>> 1. a parallel emulation library
>>>>>>> 2. a set of patches to qemu
>>>>>>> (We worked on the master branch, commit 54d7cf136f040713095cbc064f62d753bff6f9d2)
>>>>>>>
>>>>>>> It currently supports full-system emulation of x64 and ARM MPcore platforms.
>>>>>>>
>>>>>>> By leveraging the underlying multicore resources, it can emulate up to 255 cores running commodity operating systems (even on a 4-core machine).
>>>>>>>
>>>>>>> Enjoy,
>>>>>>>
>>>>>> Nice work. Do you plan to submit the improvements back to upstream QEMU?
>>>>>>
>>>>> It would be great if we can submit our code to QEMU, but we do not know the process.
>>>>> Would you please give us some instructions?
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Chen Yufei
>>>>>
>>>> Some hints can be found here:
>>>> http://wiki.qemu.org/Contribute/StartHere
>>>>
>>>> Kind regards,
>>>> Stefan Weil
>>> The patch is in the attachment, produced with command
>>> git diff 54d7cf136f040713095cbc064f62d753bff6f9d2
>>>
>>> In order to separate what need to be done to make QEMU parallel, we created a separate library, and the patched QEMU need to be compiled and linked with that library. To submit our enhancement to QEMU, maybe we need to incorporate this library into QEMU. I don't know what would be the best solution.
>>>
>>> Our approach to make QEMU parallel can be found at http://ppi.fudan.edu.cn/coremu
>>>
>>> I will give a short summary here:
>>>
>>> 1. Each emulated core thread runs a separate binary translator engine and has private code cache. We marked some variables in TCG as thread local. We also modified the TB invalidation mechanism.
>>>
>>> 2. Each core has a queue holding pending interrupts. The COREMU library provides this queue, and interrupt notification is done by sending realtime signals to the emulated core thread.
>>>
>>> 3. Atomic instruction emulation has to be modified for parallel emulation. We use lightweight memory transaction which requires only compare-and-swap instruction to emulate atomic instruction.
>>>
>>> 4. Some code in the original QEMU may cause data race bug after we make it parallel. We fixed these problems.
>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Chen Yufei
>>
>> Looking at the patch it seems there is a global lock for hardware
>> access via coremu_spin_lock(&cm_hw_lock).  How many cores have you
>> tried running and do you have lock contention data for cm_hw_lock?

The global lock for hardware access is only for ARM target in our
implementation. It is mainly because that we are not quite familiar
with ARM. 4 ARM cores (Cortex A9 limitation) could be emulated in such
way.
For x86_64 target, we have already made hardware emulation
concurrently accessed. We can emulate 255 cores on a quad-core
machine.

>
> BTW, this kind of lock is called qemu_global_mutex in QEMU, thus it is a
> sleepy lock here which is likely better for the code paths protected by
> it in upstream. Are they shorter in COREMU?
>
>> Have you thought about making hardware emulation concurrent?
>>
>> These are issues that qemu-kvm faces today since it executes vcpu
>> threads in parallel.  Both qemu-kvm and the COREMU patches could
>> benefit from a solution for concurrent hardware access.

In our implementation for x86_64 target, all devices except LAPIC are
emulated in a seperate thread. VCPUs are emulated  in other threads
(one thread per VCPU).
By observing some device drivers in linux, we have a hypothethis that
drivers in OS have already ensured correct synchronization on
concurrent hardware accesses.

For example, when emulating IDE with bus master DMA,
1. Two VCPUs will not send disk w/r requests at the same time.
2. New DMA request will not be sent until the previous one has completed.
These two points guarantee the emulated IDE with DMA can be
concurrently accessed by either VCPU thread or hw thread with no
additional locks.

The only work we need to do is to fix some misbehaving emulated device
in current Qemu.
For example, in the function ide_write_dma_cb of Qemu

if (s->nsector == 0) {
        s->status = READY_STAT | SEEK_STAT;
        ide_set_irq(s->bus);
/* In parallel emulation, OS may receive interrupt here before the DMA
state is updated */
    eot:
        bm->status &= ~BM_STATUS_DMAING;
        bm->status |= BM_STATUS_INT;
        bm->dma_cb = NULL;
        bm->unit = -1;
        bm->aiocb = NULL;
        return;
    }

The DMA state is changed after the IRQ has been sent. This is correct
in sequantial emulation. But in parallel emulation, OS may find the
DMA is busy even after an end of request interrupt is received.
The correct solution should be:

if (s->nsector == 0) {
        s->status = READY_STAT | SEEK_STAT;
/* For coremu dma state need to be changed before irq is sent */
        bm->status &= ~BM_STATUS_DMAING;
        bm->status |= BM_STATUS_INT;
        bm->dma_cb = NULL;
        bm->unit = -1;
        bm->aiocb = NULL;
        ide_set_irq(s->bus);
        return;
       eot:
       ...
}

The DMA state need to be changed before the IRQ has been sent as what
real hardware does.

Our evaluation shows that the implementation based on this hypothethis
could correctly handle concurrent  device accesses.
We also use a per VCPU lock-free queue to hold interrupts information
for each VCPU.

For your convience, here is the url for our project
http://sourceforge.net/p/coremu/
We will do our best to merge our code to upstream. :-)

>
> While we are all looking forward to see more scalable hardware models
> :), I think it is a topic that can be addressed widely independent of
> parallelizing TCG VCPUs. The latter can benefit from the former, for
> sure, but it first of all has to solve its own issues.
>
> Note that --enable-io-thread provides truly parallel KVM VCPUs also in
> upstream these days. Just for TCG, we need that sightly suboptimal CPU
> scheduling inside single-threaded tcg_cpu_exec (was renamed to
> cpu_exec_all today).
>
> Jan
>
> --
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux
>
>



-- 
Zhaoguo Wang, Parallel Processing Institute, Fudan University

Address: Room 320, Software Building, 825 Zhangheng Road, Shanghai, China

tigerwang1986@gmail.com
http://ppi.fudan.edu.cn/zhaoguo_wang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
  2010-07-22 15:19             ` wang Tiger
@ 2010-07-22 15:47               ` Stefan Hajnoczi
  2010-07-23  3:29                 ` wang Tiger
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Hajnoczi @ 2010-07-22 15:47 UTC (permalink / raw)
  To: wang Tiger; +Cc: Jan Kiszka, Chen Yufei, qemu-devel

2010/7/22 wang Tiger <tigerwang1986@gmail.com>:
> 在 2010年7月22日 下午9:00，Jan Kiszka <jan.kiszka@siemens.com> 写道：
>> Stefan Hajnoczi wrote:
>>> On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei <cyfdecyf@gmail.com> wrote:
>>>> On 2010-7-22, at 上午1:04, Stefan Weil wrote:
>>>>
>>>>> Am 21.07.2010 09:03, schrieb Chen Yufei:
>>>>>> On 2010-7-21, at 上午5:43, Blue Swirl wrote:
>>>>>>
>>>>>>
>>>>>>> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei<cyfdecyf@gmail.com>  wrote:
>>>>>>>
>>>>>>>> We are pleased to announce COREMU, which is a "multicore-on-multicore" full-system emulator built on Qemu. (Simply speaking, we made Qemu parallel.)
>>>>>>>>
>>>>>>>> The project web page is located at:
>>>>>>>> http://ppi.fudan.edu.cn/coremu
>>>>>>>>
>>>>>>>> You can also download the source code, images for playing on sourceforge
>>>>>>>> http://sf.net/p/coremu
>>>>>>>>
>>>>>>>> COREMU is composed of
>>>>>>>> 1. a parallel emulation library
>>>>>>>> 2. a set of patches to qemu
>>>>>>>> (We worked on the master branch, commit 54d7cf136f040713095cbc064f62d753bff6f9d2)
>>>>>>>>
>>>>>>>> It currently supports full-system emulation of x64 and ARM MPcore platforms.
>>>>>>>>
>>>>>>>> By leveraging the underlying multicore resources, it can emulate up to 255 cores running commodity operating systems (even on a 4-core machine).
>>>>>>>>
>>>>>>>> Enjoy,
>>>>>>>>
>>>>>>> Nice work. Do you plan to submit the improvements back to upstream QEMU?
>>>>>>>
>>>>>> It would be great if we can submit our code to QEMU, but we do not know the process.
>>>>>> Would you please give us some instructions?
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Chen Yufei
>>>>>>
>>>>> Some hints can be found here:
>>>>> http://wiki.qemu.org/Contribute/StartHere
>>>>>
>>>>> Kind regards,
>>>>> Stefan Weil
>>>> The patch is in the attachment, produced with command
>>>> git diff 54d7cf136f040713095cbc064f62d753bff6f9d2
>>>>
>>>> In order to separate what need to be done to make QEMU parallel, we created a separate library, and the patched QEMU need to be compiled and linked with that library. To submit our enhancement to QEMU, maybe we need to incorporate this library into QEMU. I don't know what would be the best solution.
>>>>
>>>> Our approach to make QEMU parallel can be found at http://ppi.fudan.edu.cn/coremu
>>>>
>>>> I will give a short summary here:
>>>>
>>>> 1. Each emulated core thread runs a separate binary translator engine and has private code cache. We marked some variables in TCG as thread local. We also modified the TB invalidation mechanism.
>>>>
>>>> 2. Each core has a queue holding pending interrupts. The COREMU library provides this queue, and interrupt notification is done by sending realtime signals to the emulated core thread.
>>>>
>>>> 3. Atomic instruction emulation has to be modified for parallel emulation. We use lightweight memory transaction which requires only compare-and-swap instruction to emulate atomic instruction.
>>>>
>>>> 4. Some code in the original QEMU may cause data race bug after we make it parallel. We fixed these problems.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Chen Yufei
>>>
>>> Looking at the patch it seems there is a global lock for hardware
>>> access via coremu_spin_lock(&cm_hw_lock).  How many cores have you
>>> tried running and do you have lock contention data for cm_hw_lock?
>
> The global lock for hardware access is only for ARM target in our
> implementation. It is mainly because that we are not quite familiar
> with ARM. 4 ARM cores (Cortex A9 limitation) could be emulated in such
> way.
> For x86_64 target, we have already made hardware emulation
> concurrently accessed. We can emulate 255 cores on a quad-core
> machine.
>
>>
>> BTW, this kind of lock is called qemu_global_mutex in QEMU, thus it is a
>> sleepy lock here which is likely better for the code paths protected by
>> it in upstream. Are they shorter in COREMU?
>>
>>> Have you thought about making hardware emulation concurrent?
>>>
>>> These are issues that qemu-kvm faces today since it executes vcpu
>>> threads in parallel.  Both qemu-kvm and the COREMU patches could
>>> benefit from a solution for concurrent hardware access.
>
> In our implementation for x86_64 target, all devices except LAPIC are
> emulated in a seperate thread. VCPUs are emulated  in other threads
> (one thread per VCPU).
> By observing some device drivers in linux, we have a hypothethis that
> drivers in OS have already ensured correct synchronization on
> concurrent hardware accesses.

This hypothesis is too optimistic.  If hardware emulation code assumes
it is only executed in a single-threaded fashion, but guests can
execute it in parallel, then this opens up the possibility of race
conditions that malicious guests can exploit.  There needs to be
isolation: a guest should not be able to cause QEMU to crash.

If you have one hardware thread that handles all device emulation and
vcpu threads do no hardware emulation, then all hardware emulation is
serialized anyway.  Does this describe COREMU's model?

> For example, when emulating IDE with bus master DMA,
> 1. Two VCPUs will not send disk w/r requests at the same time.
> 2. New DMA request will not be sent until the previous one has completed.
> These two points guarantee the emulated IDE with DMA can be
> concurrently accessed by either VCPU thread or hw thread with no
> additional locks.
>
> The only work we need to do is to fix some misbehaving emulated device
> in current Qemu.
> For example, in the function ide_write_dma_cb of Qemu
>
> if (s->nsector == 0) {
>        s->status = READY_STAT | SEEK_STAT;
>        ide_set_irq(s->bus);
> /* In parallel emulation, OS may receive interrupt here before the DMA
> state is updated */
>    eot:
>        bm->status &= ~BM_STATUS_DMAING;
>        bm->status |= BM_STATUS_INT;
>        bm->dma_cb = NULL;
>        bm->unit = -1;
>        bm->aiocb = NULL;
>        return;
>    }
>
> The DMA state is changed after the IRQ has been sent. This is correct
> in sequantial emulation. But in parallel emulation, OS may find the
> DMA is busy even after an end of request interrupt is received.
> The correct solution should be:
>
> if (s->nsector == 0) {
>        s->status = READY_STAT | SEEK_STAT;
> /* For coremu dma state need to be changed before irq is sent */
>        bm->status &= ~BM_STATUS_DMAING;
>        bm->status |= BM_STATUS_INT;
>        bm->dma_cb = NULL;
>        bm->unit = -1;
>        bm->aiocb = NULL;
>        ide_set_irq(s->bus);
>        return;
>       eot:
>       ...
> }
>
> The DMA state need to be changed before the IRQ has been sent as what
> real hardware does.
>
> Our evaluation shows that the implementation based on this hypothethis
> could correctly handle concurrent  device accesses.
> We also use a per VCPU lock-free queue to hold interrupts information
> for each VCPU.
>
> For your convience, here is the url for our project
> http://sourceforge.net/p/coremu/
> We will do our best to merge our code to upstream. :-)
>
>>
>> While we are all looking forward to see more scalable hardware models
>> :), I think it is a topic that can be addressed widely independent of
>> parallelizing TCG VCPUs. The latter can benefit from the former, for
>> sure, but it first of all has to solve its own issues.
>>
>> Note that --enable-io-thread provides truly parallel KVM VCPUs also in
>> upstream these days. Just for TCG, we need that sightly suboptimal CPU
>> scheduling inside single-threaded tcg_cpu_exec (was renamed to
>> cpu_exec_all today).
>>
>> Jan
>>
>> --
>> Siemens AG, Corporate Technology, CT T DE IT 1
>> Corporate Competence Center Embedded Linux
>>
>>
>
>
>
> --
> Zhaoguo Wang, Parallel Processing Institute, Fudan University
>
> Address: Room 320, Software Building, 825 Zhangheng Road, Shanghai, China
>
> tigerwang1986@gmail.com
> http://ppi.fudan.edu.cn/zhaoguo_wang
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
  2010-07-22 15:47               ` Stefan Hajnoczi
@ 2010-07-23  3:29                 ` wang Tiger
  2010-07-23  7:53                   ` Jan Kiszka
  0 siblings, 1 reply; 20+ messages in thread
From: wang Tiger @ 2010-07-23  3:29 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Jan Kiszka, Chen Yufei, qemu-devel

在 2010年7月22日 下午11:47，Stefan Hajnoczi <stefanha@gmail.com> 写道：
> 2010/7/22 wang Tiger <tigerwang1986@gmail.com>:
>> 在 2010年7月22日 下午9:00，Jan Kiszka <jan.kiszka@siemens.com> 写道：
>>> Stefan Hajnoczi wrote:
>>>> On Thu, Jul 22, 2010 at 9:48 AM, Chen Yufei <cyfdecyf@gmail.com> wrote:
>>>>> On 2010-7-22, at 上午1:04, Stefan Weil wrote:
>>>>>
>>>>>> Am 21.07.2010 09:03, schrieb Chen Yufei:
>>>>>>> On 2010-7-21, at 上午5:43, Blue Swirl wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Sat, Jul 17, 2010 at 10:27 AM, Chen Yufei<cyfdecyf@gmail.com>  wrote:
>>>>>>>>
>>>>>>>>> We are pleased to announce COREMU, which is a "multicore-on-multicore" full-system emulator built on Qemu. (Simply speaking, we made Qemu parallel.)
>>>>>>>>>
>>>>>>>>> The project web page is located at:
>>>>>>>>> http://ppi.fudan.edu.cn/coremu
>>>>>>>>>
>>>>>>>>> You can also download the source code, images for playing on sourceforge
>>>>>>>>> http://sf.net/p/coremu
>>>>>>>>>
>>>>>>>>> COREMU is composed of
>>>>>>>>> 1. a parallel emulation library
>>>>>>>>> 2. a set of patches to qemu
>>>>>>>>> (We worked on the master branch, commit 54d7cf136f040713095cbc064f62d753bff6f9d2)
>>>>>>>>>
>>>>>>>>> It currently supports full-system emulation of x64 and ARM MPcore platforms.
>>>>>>>>>
>>>>>>>>> By leveraging the underlying multicore resources, it can emulate up to 255 cores running commodity operating systems (even on a 4-core machine).
>>>>>>>>>
>>>>>>>>> Enjoy,
>>>>>>>>>
>>>>>>>> Nice work. Do you plan to submit the improvements back to upstream QEMU?
>>>>>>>>
>>>>>>> It would be great if we can submit our code to QEMU, but we do not know the process.
>>>>>>> Would you please give us some instructions?
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Chen Yufei
>>>>>>>
>>>>>> Some hints can be found here:
>>>>>> http://wiki.qemu.org/Contribute/StartHere
>>>>>>
>>>>>> Kind regards,
>>>>>> Stefan Weil
>>>>> The patch is in the attachment, produced with command
>>>>> git diff 54d7cf136f040713095cbc064f62d753bff6f9d2
>>>>>
>>>>> In order to separate what need to be done to make QEMU parallel, we created a separate library, and the patched QEMU need to be compiled and linked with that library. To submit our enhancement to QEMU, maybe we need to incorporate this library into QEMU. I don't know what would be the best solution.
>>>>>
>>>>> Our approach to make QEMU parallel can be found at http://ppi.fudan.edu.cn/coremu
>>>>>
>>>>> I will give a short summary here:
>>>>>
>>>>> 1. Each emulated core thread runs a separate binary translator engine and has private code cache. We marked some variables in TCG as thread local. We also modified the TB invalidation mechanism.
>>>>>
>>>>> 2. Each core has a queue holding pending interrupts. The COREMU library provides this queue, and interrupt notification is done by sending realtime signals to the emulated core thread.
>>>>>
>>>>> 3. Atomic instruction emulation has to be modified for parallel emulation. We use lightweight memory transaction which requires only compare-and-swap instruction to emulate atomic instruction.
>>>>>
>>>>> 4. Some code in the original QEMU may cause data race bug after we make it parallel. We fixed these problems.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Chen Yufei
>>>>
>>>> Looking at the patch it seems there is a global lock for hardware
>>>> access via coremu_spin_lock(&cm_hw_lock).  How many cores have you
>>>> tried running and do you have lock contention data for cm_hw_lock?
>>
>> The global lock for hardware access is only for ARM target in our
>> implementation. It is mainly because that we are not quite familiar
>> with ARM. 4 ARM cores (Cortex A9 limitation) could be emulated in such
>> way.
>> For x86_64 target, we have already made hardware emulation
>> concurrently accessed. We can emulate 255 cores on a quad-core
>> machine.
>>
>>>
>>> BTW, this kind of lock is called qemu_global_mutex in QEMU, thus it is a
>>> sleepy lock here which is likely better for the code paths protected by
>>> it in upstream. Are they shorter in COREMU?
>>>
>>>> Have you thought about making hardware emulation concurrent?
>>>>
>>>> These are issues that qemu-kvm faces today since it executes vcpu
>>>> threads in parallel.  Both qemu-kvm and the COREMU patches could
>>>> benefit from a solution for concurrent hardware access.
>>
>> In our implementation for x86_64 target, all devices except LAPIC are
>> emulated in a seperate thread. VCPUs are emulated  in other threads
>> (one thread per VCPU).
>> By observing some device drivers in linux, we have a hypothethis that
>> drivers in OS have already ensured correct synchronization on
>> concurrent hardware accesses.
>
> This hypothesis is too optimistic.  If hardware emulation code assumes
> it is only executed in a single-threaded fashion, but guests can
> execute it in parallel, then this opens up the possibility of race
> conditions that malicious guests can exploit.  There needs to be
> isolation: a guest should not be able to cause QEMU to crash.

In our prototype, we assume the guest behaves correctly. If hardware
emulation code can ensure atomic access(behave like real hardware),
VCPUS can access device freely.  We actually refine some hardward
emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of
hardware access.

>
> If you have one hardware thread that handles all device emulation and
> vcpu threads do no hardware emulation, then all hardware emulation is
> serialized anyway.  Does this describe COREMU's model?

In our previous implementation, VCPU threads do no hardware emulation.
When VCPU w/r an ioport, it put the ioport address and value
infomation into a lock free queue. Hardware thread polls the queue to
serve the request. When hardware issues an interrupt, hardware thread
also put the irq information into a per VCPU lockfree queue. VCPU sets
its LAPIC and serve the interrupt request.

For performance reason, we abandon this approach. VCPU is allowed to
modify hardware state.
Hardware code is only to be slightly modified. The misbehavior of the
guest OS can be easily detected.
>
>> For example, when emulating IDE with bus master DMA,
>> 1. Two VCPUs will not send disk w/r requests at the same time.
>> 2. New DMA request will not be sent until the previous one has completed.
>> These two points guarantee the emulated IDE with DMA can be
>> concurrently accessed by either VCPU thread or hw thread with no
>> additional locks.
>>
>> The only work we need to do is to fix some misbehaving emulated device
>> in current Qemu.
>> For example, in the function ide_write_dma_cb of Qemu
>>
>> if (s->nsector == 0) {
>>        s->status = READY_STAT | SEEK_STAT;
>>        ide_set_irq(s->bus);
>> /* In parallel emulation, OS may receive interrupt here before the DMA
>> state is updated */
>>    eot:
>>        bm->status &= ~BM_STATUS_DMAING;
>>        bm->status |= BM_STATUS_INT;
>>        bm->dma_cb = NULL;
>>        bm->unit = -1;
>>        bm->aiocb = NULL;
>>        return;
>>    }
>>
>> The DMA state is changed after the IRQ has been sent. This is correct
>> in sequantial emulation. But in parallel emulation, OS may find the
>> DMA is busy even after an end of request interrupt is received.
>> The correct solution should be:
>>
>> if (s->nsector == 0) {
>>        s->status = READY_STAT | SEEK_STAT;
>> /* For coremu dma state need to be changed before irq is sent */
>>        bm->status &= ~BM_STATUS_DMAING;
>>        bm->status |= BM_STATUS_INT;
>>        bm->dma_cb = NULL;
>>        bm->unit = -1;
>>        bm->aiocb = NULL;
>>        ide_set_irq(s->bus);
>>        return;
>>       eot:
>>       ...
>> }
>>
>> The DMA state need to be changed before the IRQ has been sent as what
>> real hardware does.
>>
>> Our evaluation shows that the implementation based on this hypothethis
>> could correctly handle concurrent  device accesses.
>> We also use a per VCPU lock-free queue to hold interrupts information
>> for each VCPU.
>>
>> For your convience, here is the url for our project
>> http://sourceforge.net/p/coremu/
>> We will do our best to merge our code to upstream. :-)
>>
>>>
>>> While we are all looking forward to see more scalable hardware models
>>> :), I think it is a topic that can be addressed widely independent of
>>> parallelizing TCG VCPUs. The latter can benefit from the former, for
>>> sure, but it first of all has to solve its own issues.
>>>
>>> Note that --enable-io-thread provides truly parallel KVM VCPUs also in
>>> upstream these days. Just for TCG, we need that sightly suboptimal CPU
>>> scheduling inside single-threaded tcg_cpu_exec (was renamed to
>>> cpu_exec_all today).
>>>
>>> Jan
>>>
>>> --
>>> Siemens AG, Corporate Technology, CT T DE IT 1
>>> Corporate Competence Center Embedded Linux
>>>
>>>
>>
>>
>>
>> --
>> Zhaoguo Wang, Parallel Processing Institute, Fudan University
>>
>> Address: Room 320, Software Building, 825 Zhangheng Road, Shanghai, China
>>
>> tigerwang1986@gmail.com
>> http://ppi.fudan.edu.cn/zhaoguo_wang
>>
>



-- 
Zhaoguo Wang, Parallel Processing Institute, Fudan University

Address: Room 320, Software Building, 825 Zhangheng Road, Shanghai, China

tigerwang1986@gmail.com
http://ppi.fudan.edu.cn/zhaoguo_wang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] Re: Release of COREMU, a scalable and portable  full-system emulator
  2010-07-23  3:29                 ` wang Tiger
@ 2010-07-23  7:53                   ` Jan Kiszka
  2010-07-23  8:38                     ` Alexander Graf
  2010-07-23 10:35                     ` wang Tiger
  0 siblings, 2 replies; 20+ messages in thread
From: Jan Kiszka @ 2010-07-23  7:53 UTC (permalink / raw)
  To: wang Tiger; +Cc: Stefan Hajnoczi, Chen Yufei, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1760 bytes --]

wang Tiger wrote:
> 在 2010年7月22日 下午11:47，Stefan Hajnoczi <stefanha@gmail.com> 写道：
>> 2010/7/22 wang Tiger <tigerwang1986@gmail.com>:
>>> In our implementation for x86_64 target, all devices except LAPIC are
>>> emulated in a seperate thread. VCPUs are emulated  in other threads
>>> (one thread per VCPU).
>>> By observing some device drivers in linux, we have a hypothethis that
>>> drivers in OS have already ensured correct synchronization on
>>> concurrent hardware accesses.
>> This hypothesis is too optimistic.  If hardware emulation code assumes
>> it is only executed in a single-threaded fashion, but guests can
>> execute it in parallel, then this opens up the possibility of race
>> conditions that malicious guests can exploit.  There needs to be
>> isolation: a guest should not be able to cause QEMU to crash.
> 
> In our prototype, we assume the guest behaves correctly. If hardware
> emulation code can ensure atomic access(behave like real hardware),
> VCPUS can access device freely.  We actually refine some hardward
> emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of
> hardware access.

This approach is surely helpful for a prototype to explore the limits.
But it's not applicable to production systems. It would create a huge
source of potential subtle regressions for other guest OSes,
specifically those that you cannot analyze regarding synchronized
hardware access. We must play safe.

That's why we currently have the global mutex. Its conversion can only
happen step-wise, e.g. by establishing an infrastructure to declare the
need of device models for that Big Lock. Then you can start converting
individual models to private locks or even smart lock-less patterns.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable  full-system emulator
  2010-07-23  7:53                   ` Jan Kiszka
@ 2010-07-23  8:38                     ` Alexander Graf
  2010-07-23  9:13                       ` Stefan Hajnoczi
  2010-07-23 10:35                     ` wang Tiger
  1 sibling, 1 reply; 20+ messages in thread
From: Alexander Graf @ 2010-07-23  8:38 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Stefan Hajnoczi, Chen Yufei, wang Tiger, qemu-devel


On 23.07.2010, at 09:53, Jan Kiszka wrote:

> wang Tiger wrote:
>> 在 2010年7月22日 下午11:47，Stefan Hajnoczi <stefanha@gmail.com> 写道：
>>> 2010/7/22 wang Tiger <tigerwang1986@gmail.com>:
>>>> In our implementation for x86_64 target, all devices except LAPIC are
>>>> emulated in a seperate thread. VCPUs are emulated  in other threads
>>>> (one thread per VCPU).
>>>> By observing some device drivers in linux, we have a hypothethis that
>>>> drivers in OS have already ensured correct synchronization on
>>>> concurrent hardware accesses.
>>> This hypothesis is too optimistic.  If hardware emulation code assumes
>>> it is only executed in a single-threaded fashion, but guests can
>>> execute it in parallel, then this opens up the possibility of race
>>> conditions that malicious guests can exploit.  There needs to be
>>> isolation: a guest should not be able to cause QEMU to crash.
>> 
>> In our prototype, we assume the guest behaves correctly. If hardware
>> emulation code can ensure atomic access(behave like real hardware),
>> VCPUS can access device freely.  We actually refine some hardward
>> emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of
>> hardware access.
> 
> This approach is surely helpful for a prototype to explore the limits.
> But it's not applicable to production systems. It would create a huge
> source of potential subtle regressions for other guest OSes,
> specifically those that you cannot analyze regarding synchronized
> hardware access. We must play safe.
> 
> That's why we currently have the global mutex. Its conversion can only
> happen step-wise, e.g. by establishing an infrastructure to declare the
> need of device models for that Big Lock. Then you can start converting
> individual models to private locks or even smart lock-less patterns.

But isn't that independent from making TCG atomic capable and parallel? At that point a TCG vCPU would have the exact same issues and interfaces as a KVM vCPU, right? And then we can tackle the concurrent device access issues together.


Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
  2010-07-23  8:38                     ` Alexander Graf
@ 2010-07-23  9:13                       ` Stefan Hajnoczi
  2010-07-23  9:47                         ` Jan Kiszka
  2010-07-23 10:59                         ` wang Tiger
  0 siblings, 2 replies; 20+ messages in thread
From: Stefan Hajnoczi @ 2010-07-23  9:13 UTC (permalink / raw)
  To: wang Tiger; +Cc: Chen Yufei, Jan Kiszka, Alexander Graf, qemu-devel

2010/7/23 Alexander Graf <agraf@suse.de>:
>
> On 23.07.2010, at 09:53, Jan Kiszka wrote:
>
>> wang Tiger wrote:
>>> 在 2010年7月22日 下午11:47，Stefan Hajnoczi <stefanha@gmail.com> 写道：
>>>> 2010/7/22 wang Tiger <tigerwang1986@gmail.com>:
>>>>> In our implementation for x86_64 target, all devices except LAPIC are
>>>>> emulated in a seperate thread. VCPUs are emulated  in other threads
>>>>> (one thread per VCPU).
>>>>> By observing some device drivers in linux, we have a hypothethis that
>>>>> drivers in OS have already ensured correct synchronization on
>>>>> concurrent hardware accesses.
>>>> This hypothesis is too optimistic.  If hardware emulation code assumes
>>>> it is only executed in a single-threaded fashion, but guests can
>>>> execute it in parallel, then this opens up the possibility of race
>>>> conditions that malicious guests can exploit.  There needs to be
>>>> isolation: a guest should not be able to cause QEMU to crash.
>>>
>>> In our prototype, we assume the guest behaves correctly. If hardware
>>> emulation code can ensure atomic access(behave like real hardware),
>>> VCPUS can access device freely.  We actually refine some hardward
>>> emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of
>>> hardware access.
>>
>> This approach is surely helpful for a prototype to explore the limits.
>> But it's not applicable to production systems. It would create a huge
>> source of potential subtle regressions for other guest OSes,
>> specifically those that you cannot analyze regarding synchronized
>> hardware access. We must play safe.
>>
>> That's why we currently have the global mutex. Its conversion can only
>> happen step-wise, e.g. by establishing an infrastructure to declare the
>> need of device models for that Big Lock. Then you can start converting
>> individual models to private locks or even smart lock-less patterns.
>
> But isn't that independent from making TCG atomic capable and parallel? At that point a TCG vCPU would have the exact same issues and interfaces as a KVM vCPU, right? And then we can tackle the concurrent device access issues together.

An issue that might affect COREMU today is core QEMU subsystems that
are not thread-safe and used from hardware emulation, for example:

cpu_physical_memory_read/write() to RAM will use qemu_get_ram_ptr().
This function moves the found RAMBlock to the head of the global RAM
blocks list in a non-atomic way.  Therefore, two unrelated hardware
devices executing cpu_physical_memory_*() simultaneously face a race
condition.  I have seen this happen when playing with parallel
hardware emulation.

Tiger: If you are only locking the hardware thread for ARM target,
your hardware emulation is not safe for other targets.  Have I missed
something in the COREMU patch that defends against this problem?

Stefan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
  2010-07-23  9:13                       ` Stefan Hajnoczi
@ 2010-07-23  9:47                         ` Jan Kiszka
  2010-07-23 10:59                         ` wang Tiger
  1 sibling, 0 replies; 20+ messages in thread
From: Jan Kiszka @ 2010-07-23  9:47 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Chen Yufei, wang Tiger, Alexander Graf, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3501 bytes --]

Stefan Hajnoczi wrote:
> 2010/7/23 Alexander Graf <agraf@suse.de>:
>> On 23.07.2010, at 09:53, Jan Kiszka wrote:
>>
>>> wang Tiger wrote:
>>>> 在 2010年7月22日 下午11:47，Stefan Hajnoczi <stefanha@gmail.com> 写道：
>>>>> 2010/7/22 wang Tiger <tigerwang1986@gmail.com>:
>>>>>> In our implementation for x86_64 target, all devices except LAPIC are
>>>>>> emulated in a seperate thread. VCPUs are emulated  in other threads
>>>>>> (one thread per VCPU).
>>>>>> By observing some device drivers in linux, we have a hypothethis that
>>>>>> drivers in OS have already ensured correct synchronization on
>>>>>> concurrent hardware accesses.
>>>>> This hypothesis is too optimistic.  If hardware emulation code assumes
>>>>> it is only executed in a single-threaded fashion, but guests can
>>>>> execute it in parallel, then this opens up the possibility of race
>>>>> conditions that malicious guests can exploit.  There needs to be
>>>>> isolation: a guest should not be able to cause QEMU to crash.
>>>> In our prototype, we assume the guest behaves correctly. If hardware
>>>> emulation code can ensure atomic access(behave like real hardware),
>>>> VCPUS can access device freely.  We actually refine some hardward
>>>> emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of
>>>> hardware access.
>>> This approach is surely helpful for a prototype to explore the limits.
>>> But it's not applicable to production systems. It would create a huge
>>> source of potential subtle regressions for other guest OSes,
>>> specifically those that you cannot analyze regarding synchronized
>>> hardware access. We must play safe.
>>>
>>> That's why we currently have the global mutex. Its conversion can only
>>> happen step-wise, e.g. by establishing an infrastructure to declare the
>>> need of device models for that Big Lock. Then you can start converting
>>> individual models to private locks or even smart lock-less patterns.
>> But isn't that independent from making TCG atomic capable and parallel? At that point a TCG vCPU would have the exact same issues and interfaces as a KVM vCPU, right? And then we can tackle the concurrent device access issues together.
> 
> An issue that might affect COREMU today is core QEMU subsystems that
> are not thread-safe and used from hardware emulation, for example:
> 
> cpu_physical_memory_read/write() to RAM will use qemu_get_ram_ptr().
> This function moves the found RAMBlock to the head of the global RAM
> blocks list in a non-atomic way.  Therefore, two unrelated hardware
> devices executing cpu_physical_memory_*() simultaneously face a race
> condition.  I have seen this happen when playing with parallel
> hardware emulation.

Those issues need to be identified and, in a first step, worked around
by holding dedicated locks or just the global mutex. Maybe the above
conflict can also directly be resolved by creating per-VCPU lookup lists
(likely more efficient than tapping on other VCPU shoes by constantly
reordering a global list). Likely a good example for a self-contained
preparatory patch.

However, getting concurrency right is tricky enough. We should really be
careful with turning to much upside down in a rush. Even if TCG may have
some deeper hooks into the device model or thread-unsafe core parts than
KVM, parallelizing it can and should remain a separate topic. And we
also have to keep an eye on performance if a bit less than 255 VCPUs
shall be emulated.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
  2010-07-23  7:53                   ` Jan Kiszka
  2010-07-23  8:38                     ` Alexander Graf
@ 2010-07-23 10:35                     ` wang Tiger
  1 sibling, 0 replies; 20+ messages in thread
From: wang Tiger @ 2010-07-23 10:35 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Stefan Hajnoczi, Chen Yufei, qemu-devel

在 2010年7月23日 下午3:53，Jan Kiszka <jan.kiszka@web.de> 写道：
> wang Tiger wrote:
>> 在 2010年7月22日 下午11:47，Stefan Hajnoczi <stefanha@gmail.com> 写道：
>>> 2010/7/22 wang Tiger <tigerwang1986@gmail.com>:
>>>> In our implementation for x86_64 target, all devices except LAPIC are
>>>> emulated in a seperate thread. VCPUs are emulated  in other threads
>>>> (one thread per VCPU).
>>>> By observing some device drivers in linux, we have a hypothethis that
>>>> drivers in OS have already ensured correct synchronization on
>>>> concurrent hardware accesses.
>>> This hypothesis is too optimistic.  If hardware emulation code assumes
>>> it is only executed in a single-threaded fashion, but guests can
>>> execute it in parallel, then this opens up the possibility of race
>>> conditions that malicious guests can exploit.  There needs to be
>>> isolation: a guest should not be able to cause QEMU to crash.
>>
>> In our prototype, we assume the guest behaves correctly. If hardware
>> emulation code can ensure atomic access(behave like real hardware),
>> VCPUS can access device freely.  We actually refine some hardward
>> emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of
>> hardware access.
>
> This approach is surely helpful for a prototype to explore the limits.
> But it's not applicable to production systems. It would create a huge
> source of potential subtle regressions for other guest OSes,
> specifically those that you cannot analyze regarding synchronized
> hardware access. We must play safe.
>
> That's why we currently have the global mutex. Its conversion can only
> happen step-wise, e.g. by establishing an infrastructure to declare the
> need of device models for that Big Lock. Then you can start converting
> individual models to private locks or even smart lock-less patterns.
>
> Jan
>
>
I agree with you on this point. The approach we used is really helpful
for a research prototype. But it needs a lot of work to make it
applicable to production systems.
Its my pleasure if we can tackle this issue togethor.

-- 
Zhaoguo Wang, Parallel Processing Institute, Fudan University

Address: Room 320, Software Building, 825 Zhangheng Road, Shanghai, China

tigerwang1986@gmail.com
http://ppi.fudan.edu.cn/zhaoguo_wang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
  2010-07-23  9:13                       ` Stefan Hajnoczi
  2010-07-23  9:47                         ` Jan Kiszka
@ 2010-07-23 10:59                         ` wang Tiger
  2010-07-23 11:02                           ` Stefan Hajnoczi
  1 sibling, 1 reply; 20+ messages in thread
From: wang Tiger @ 2010-07-23 10:59 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Chen Yufei, Jan Kiszka, Alexander Graf, qemu-devel

在 2010年7月23日 下午5:13，Stefan Hajnoczi <stefanha@gmail.com> 写道：
> 2010/7/23 Alexander Graf <agraf@suse.de>:
>>
>> On 23.07.2010, at 09:53, Jan Kiszka wrote:
>>
>>> wang Tiger wrote:
>>>> 在 2010年7月22日 下午11:47，Stefan Hajnoczi <stefanha@gmail.com> 写道：
>>>>> 2010/7/22 wang Tiger <tigerwang1986@gmail.com>:
>>>>>> In our implementation for x86_64 target, all devices except LAPIC are
>>>>>> emulated in a seperate thread. VCPUs are emulated  in other threads
>>>>>> (one thread per VCPU).
>>>>>> By observing some device drivers in linux, we have a hypothethis that
>>>>>> drivers in OS have already ensured correct synchronization on
>>>>>> concurrent hardware accesses.
>>>>> This hypothesis is too optimistic.  If hardware emulation code assumes
>>>>> it is only executed in a single-threaded fashion, but guests can
>>>>> execute it in parallel, then this opens up the possibility of race
>>>>> conditions that malicious guests can exploit.  There needs to be
>>>>> isolation: a guest should not be able to cause QEMU to crash.
>>>>
>>>> In our prototype, we assume the guest behaves correctly. If hardware
>>>> emulation code can ensure atomic access(behave like real hardware),
>>>> VCPUS can access device freely.  We actually refine some hardward
>>>> emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of
>>>> hardware access.
>>>
>>> This approach is surely helpful for a prototype to explore the limits.
>>> But it's not applicable to production systems. It would create a huge
>>> source of potential subtle regressions for other guest OSes,
>>> specifically those that you cannot analyze regarding synchronized
>>> hardware access. We must play safe.
>>>
>>> That's why we currently have the global mutex. Its conversion can only
>>> happen step-wise, e.g. by establishing an infrastructure to declare the
>>> need of device models for that Big Lock. Then you can start converting
>>> individual models to private locks or even smart lock-less patterns.
>>
>> But isn't that independent from making TCG atomic capable and parallel? At that point a TCG vCPU would have the exact same issues and interfaces as a KVM vCPU, right? And then we can tackle the concurrent device access issues together.
>
> An issue that might affect COREMU today is core QEMU subsystems that
> are not thread-safe and used from hardware emulation, for example:
>
> cpu_physical_memory_read/write() to RAM will use qemu_get_ram_ptr().
> This function moves the found RAMBlock to the head of the global RAM
> blocks list in a non-atomic way.  Therefore, two unrelated hardware
> devices executing cpu_physical_memory_*() simultaneously face a race
> condition.  I have seen this happen when playing with parallel
> hardware emulation.
>
> Tiger: If you are only locking the hardware thread for ARM target,
> your hardware emulation is not safe for other targets.  Have I missed
> something in the COREMU patch that defends against this problem?
>
> Stefan
>
In fact, we solve this problem through a really simple method.
In our prototype, we removed this piece of code like this:
void *qemu_get_ram_ptr(ram_addr_t addr)
{
    ......

    /* Move this entry to to start of the list.  */
#ifndef CONFIG_COREMU
    /* Different core can access this function at the same time.
     * For coremu, disable this optimization to avoid data race.
     * XXX or use spin lock here if performance impact is big. */
    if (prev) {
        prev->next = block->next;
        block->next = *prevp;
        *prevp = block;
    }
#endif
    return block->host + (addr - block->offset);
}

CONFIG_COREMU is defined when TCG parallel mode is configured.
And the list is more likely to be read only without hotplug device, so
we don't use a lock to protect it.
Reimplement this list with a lock free list is also reasonable, but
seems unnecessary. :-)
-- 
Zhaoguo Wang, Parallel Processing Institute, Fudan University

Address: Room 320, Software Building, 825 Zhangheng Road, Shanghai, China

tigerwang1986@gmail.com
http://ppi.fudan.edu.cn/zhaoguo_wang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Re: Release of COREMU, a scalable and portable full-system emulator
  2010-07-23 10:59                         ` wang Tiger
@ 2010-07-23 11:02                           ` Stefan Hajnoczi
  2010-07-25 15:56                             ` Paolo Bonzini
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Hajnoczi @ 2010-07-23 11:02 UTC (permalink / raw)
  To: wang Tiger; +Cc: Chen Yufei, Jan Kiszka, Alexander Graf, qemu-devel

2010/7/23 wang Tiger <tigerwang1986@gmail.com>:
> 在 2010年7月23日 下午5:13，Stefan Hajnoczi <stefanha@gmail.com> 写道：
>> 2010/7/23 Alexander Graf <agraf@suse.de>:
>>>
>>> On 23.07.2010, at 09:53, Jan Kiszka wrote:
>>>
>>>> wang Tiger wrote:
>>>>> 在 2010年7月22日 下午11:47，Stefan Hajnoczi <stefanha@gmail.com> 写道：
>>>>>> 2010/7/22 wang Tiger <tigerwang1986@gmail.com>:
>>>>>>> In our implementation for x86_64 target, all devices except LAPIC are
>>>>>>> emulated in a seperate thread. VCPUs are emulated  in other threads
>>>>>>> (one thread per VCPU).
>>>>>>> By observing some device drivers in linux, we have a hypothethis that
>>>>>>> drivers in OS have already ensured correct synchronization on
>>>>>>> concurrent hardware accesses.
>>>>>> This hypothesis is too optimistic.  If hardware emulation code assumes
>>>>>> it is only executed in a single-threaded fashion, but guests can
>>>>>> execute it in parallel, then this opens up the possibility of race
>>>>>> conditions that malicious guests can exploit.  There needs to be
>>>>>> isolation: a guest should not be able to cause QEMU to crash.
>>>>>
>>>>> In our prototype, we assume the guest behaves correctly. If hardware
>>>>> emulation code can ensure atomic access(behave like real hardware),
>>>>> VCPUS can access device freely.  We actually refine some hardward
>>>>> emulation code (eg. BMDMA, IOAPIC ) to ensure the atomicity of
>>>>> hardware access.
>>>>
>>>> This approach is surely helpful for a prototype to explore the limits.
>>>> But it's not applicable to production systems. It would create a huge
>>>> source of potential subtle regressions for other guest OSes,
>>>> specifically those that you cannot analyze regarding synchronized
>>>> hardware access. We must play safe.
>>>>
>>>> That's why we currently have the global mutex. Its conversion can only
>>>> happen step-wise, e.g. by establishing an infrastructure to declare the
>>>> need of device models for that Big Lock. Then you can start converting
>>>> individual models to private locks or even smart lock-less patterns.
>>>
>>> But isn't that independent from making TCG atomic capable and parallel? At that point a TCG vCPU would have the exact same issues and interfaces as a KVM vCPU, right? And then we can tackle the concurrent device access issues together.
>>
>> An issue that might affect COREMU today is core QEMU subsystems that
>> are not thread-safe and used from hardware emulation, for example:
>>
>> cpu_physical_memory_read/write() to RAM will use qemu_get_ram_ptr().
>> This function moves the found RAMBlock to the head of the global RAM
>> blocks list in a non-atomic way.  Therefore, two unrelated hardware
>> devices executing cpu_physical_memory_*() simultaneously face a race
>> condition.  I have seen this happen when playing with parallel
>> hardware emulation.
>>
>> Tiger: If you are only locking the hardware thread for ARM target,
>> your hardware emulation is not safe for other targets.  Have I missed
>> something in the COREMU patch that defends against this problem?
>>
>> Stefan
>>
> In fact, we solve this problem through a really simple method.
> In our prototype, we removed this piece of code like this:
> void *qemu_get_ram_ptr(ram_addr_t addr)
> {
>    ......
>
>    /* Move this entry to to start of the list.  */
> #ifndef CONFIG_COREMU
>    /* Different core can access this function at the same time.
>     * For coremu, disable this optimization to avoid data race.
>     * XXX or use spin lock here if performance impact is big. */
>    if (prev) {
>        prev->next = block->next;
>        block->next = *prevp;
>        *prevp = block;
>    }
> #endif
>    return block->host + (addr - block->offset);
> }
>
> CONFIG_COREMU is defined when TCG parallel mode is configured.
> And the list is more likely to be read only without hotplug device, so
> we don't use a lock to protect it.
> Reimplement this list with a lock free list is also reasonable, but
> seems unnecessary. :-)

Ah, good :).

Stefan

> --
> Zhaoguo Wang, Parallel Processing Institute, Fudan University
>
> Address: Room 320, Software Building, 825 Zhangheng Road, Shanghai, China
>
> tigerwang1986@gmail.com
> http://ppi.fudan.edu.cn/zhaoguo_wang
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] Re: Release of COREMU, a scalable and portable  full-system emulator
  2010-07-23 11:02                           ` Stefan Hajnoczi
@ 2010-07-25 15:56                             ` Paolo Bonzini
  0 siblings, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2010-07-25 15:56 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Jan Kiszka, Chen Yufei, wang Tiger, Alexander Graf, qemu-devel

On 07/23/2010 01:02 PM, Stefan Hajnoczi wrote:
>> In fact, we solve this problem through a really simple method.
>> In our prototype, we removed this piece of code like this:
>> void *qemu_get_ram_ptr(ram_addr_t addr)
>> {
>>     ......
>>
>>     /* Move this entry to to start of the list.  */
>> #ifndef CONFIG_COREMU
>>     /* Different core can access this function at the same time.
>>      * For coremu, disable this optimization to avoid data race.
>>      * XXX or use spin lock here if performance impact is big. */
>>     if (prev) {
>>         prev->next = block->next;
>>         block->next = *prevp;
>>         *prevp = block;
>>     }
>> #endif
>>     return block->host + (addr - block->offset);
>> }
>>
>> CONFIG_COREMU is defined when TCG parallel mode is configured.
>> And the list is more likely to be read only without hotplug device, so
>> we don't use a lock to protect it.
>> Reimplement this list with a lock free list is also reasonable, but
>> seems unnecessary. :-)
> 
> Ah, good :).

For this one in particular, you could just use circular lists (without a
"head" node, unlike the Linux kernel's list data type, as there's always
a RAM entry) and start iteration at "prev".

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2010-07-25 15:56 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-17 10:27 [Qemu-devel] Release of COREMU, a scalable and portable full-system emulator Chen Yufei
2010-07-20 21:43 ` Blue Swirl
2010-07-21  7:03   ` Chen Yufei
2010-07-21 17:04     ` Stefan Weil
2010-07-22  8:48       ` Chen Yufei
2010-07-22 11:05         ` [Qemu-devel] " Jan Kiszka
2010-07-22 12:18         ` [Qemu-devel] " Stefan Hajnoczi
2010-07-22 13:00           ` [Qemu-devel] " Jan Kiszka
2010-07-22 13:21             ` Stefan Hajnoczi
2010-07-22 15:19             ` wang Tiger
2010-07-22 15:47               ` Stefan Hajnoczi
2010-07-23  3:29                 ` wang Tiger
2010-07-23  7:53                   ` Jan Kiszka
2010-07-23  8:38                     ` Alexander Graf
2010-07-23  9:13                       ` Stefan Hajnoczi
2010-07-23  9:47                         ` Jan Kiszka
2010-07-23 10:59                         ` wang Tiger
2010-07-23 11:02                           ` Stefan Hajnoczi
2010-07-25 15:56                             ` Paolo Bonzini
2010-07-23 10:35                     ` wang Tiger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.