All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] simulated memory instead of host memory
@ 2003-06-09 18:31 Johan Rydberg
  2003-06-09 19:09 ` Fabrice Bellard
  0 siblings, 1 reply; 5+ messages in thread
From: Johan Rydberg @ 2003-06-09 18:31 UTC (permalink / raw)
  To: qemu-devel


First question of the day;

First of all I would like to say that I really like the concept of QEMU.  
Let GCC do most of the work and just glue it all together.  Brilliant.
One downside of it though is all the tampering with flags to GCC.

To the question.  How hard would it be to make QEMU a full-system 
simulator?  Or, more concrete: How hard would it be to instead of using
the host memory for the simulated app, use simulated memory (on per-page 
basis)?

-- 
Johan Rydberg, Free Software Developer, Sweden
http://rtmk.sf.net | http://www.nongnu.org/guss/

Listning to Tricky - Where I'm from

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] simulated memory instead of host memory
  2003-06-09 18:31 [Qemu-devel] simulated memory instead of host memory Johan Rydberg
@ 2003-06-09 19:09 ` Fabrice Bellard
  2003-06-09 19:37   ` Johan Rydberg
  0 siblings, 1 reply; 5+ messages in thread
From: Fabrice Bellard @ 2003-06-09 19:09 UTC (permalink / raw)
  To: qemu-devel

Johan Rydberg wrote:
> First question of the day;
> 
> First of all I would like to say that I really like the concept of QEMU.  
> Let GCC do most of the work and just glue it all together.  Brilliant.
> One downside of it though is all the tampering with flags to GCC.

Yes, on every new gcc version there may be problems... a solution may be 
to distribute binary only versions of some object files of QEMU.

I hope that someday someone will make a proper code generator, but it is 
really a lot of work!

> To the question.  How hard would it be to make QEMU a full-system 
> simulator?  Or, more concrete: How hard would it be to instead of using
> the host memory for the simulated app, use simulated memory (on per-page 
> basis)?

It would be possible. I spent a lot of time thinking about it, but I did 
not make it because of lack of time and motivation. I see three solutions:

1) The very slow (but simplest) solution is to just modify the memory 
access inline functions in 'cpu-i386.h' to emulate the x86 MMU.

2) A faster solution is to use 4MB tables containing the addresses of 
each CPU page. One 4MB table would be used for read, one table for 
write. The tables can be seen as big TLBs. Unmapped pages would have a 
NULL entry in the tables so that a fault is generated on access to fill 
the table.

3) An even faster solution is to use Linux memory mappings to emulate 
the MMU. The Linux MM state of the process would be considered as a TLB 
of the virtual x86 MMU state. It works only if the host has <= 4KB page 
size and if the guest OS don't do any mapping in memory >= 0xc0000000. 
With Linux as guest it would work as you can easily change the base 
address of the kernel. The restriction about mappings >= 0xc0000000 
could be suppressed with a small (but tricky) kernel patch which would 
allow to mmap() at addresses >= 0xc0000000.

I wanted to implement solution (3) to be able to simulate an unpatched 
Linux kernel (and call the project 'qplex86' !).

To run any OS you would also need precise segment limits and rights 
emulation, at least for non user code.

Fabrice.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] simulated memory instead of host memory
  2003-06-09 19:09 ` Fabrice Bellard
@ 2003-06-09 19:37   ` Johan Rydberg
  2003-06-09 20:18     ` Fabrice Bellard
  0 siblings, 1 reply; 5+ messages in thread
From: Johan Rydberg @ 2003-06-09 19:37 UTC (permalink / raw)
  To: qemu-devel

On Mon, 09 Jun 2003 21:09:37 +0200
Fabrice Bellard <fabrice.bellard@free.fr> wrote:

: It would be possible. I spent a lot of time thinking about it, but I did 
: not make it because of lack of time and motivation. I see three solutions:
: [...]
: 2) A faster solution is to use 4MB tables containing the addresses of 
: each CPU page. One 4MB table would be used for read, one table for 
: write. The tables can be seen as big TLBs. Unmapped pages would have a 
: NULL entry in the tables so that a fault is generated on access to fill 
: the table.

In the current version of GUSS I use a similar technique.  I call them
mtcaches, which stands for memory translation caches.  They can be seen
as a direct mapped cache, with the virtual page number as index.  The
tag is constructed from the virtual address, with the offset masked.
The cache contains <tag, diff> tuples.  The diff is the difference between
the virtual address and the host memory address.  When there is a mtcache
hit, all that has to be done to get the host memory address is add the virtual 
address to the diff value.  

When there is a mtcache miss, the full MMU emulation code is called.
It is up to it to add entries to the mtcache (there is separate mtcaches
for reads and write, and user and supervisor mode).

Some early testing (booting the Linux kernel on a simulated MIPS32 4Kc)
shows that you can get a 95% hit rate or more. 

On SPARC and other RISC architectures which has bit extraction insns
and register+register addring the test against the mtcache can be done
in 6-8 insns.  The testing on IA-32 is a bit more complex (12-14 insns),
mainly due to the limited number of general purpose registers.

This is what my code generator emits for a memory store.  The value that
should be stores is located in %ebx.  The virtual address in %eax.
%ecx must be pushed on the stack to free a register.  

  40017160: 0000005b: push   %ecx
  40017161: 0000005c: mov    0x805cce4,%ebp        pointer to mtcache
  40017167: 00000062: mov    %eax,%ecx
  40017169: 00000064: shr    $0xc,%ecx
  4001716c: 00000067: and    $0xff,%ecx            256 entries
  40017172: 0000006d: lea    0x0(%ebp,%ecx,8),%esi mtcache entry at %esi
  40017176: 00000071: mov    %eax,%ecx
  40017178: 00000073: and    $0xfffff000,%ecx      make tag
  4001717e: 00000079: cmp    %ecx,0x0(%esi)        and compare
  40017181: 0000007c: jne    0x00000439            miss -> slow way
  40017187: 00000082: mov    0x4(%esi),%esi
  4001718a: 00000085: add    %eax,%esi
  4001718c: 00000087: mov    %ebx,0x0(%esi)        do the store
  4001718f: 0000008a: pop    %ecx

Can you come to thing of a faster way to do it?  Note that I generate
the code by hand (not using GCC).

: 3) An even faster solution is to use Linux memory mappings to emulate 
: the MMU. The Linux MM state of the process would be considered as a TLB 
: of the virtual x86 MMU state. It works only if the host has <= 4KB page 
: size and if the guest OS don't do any mapping in memory >= 0xc0000000. 
: With Linux as guest it would work as you can easily change the base 
: address of the kernel. The restriction about mappings >= 0xc0000000 
: could be suppressed with a small (but tricky) kernel patch which would 
: allow to mmap() at addresses >= 0xc0000000.

Since it isn't very portable I don't think it is an option.

: I wanted to implement solution (3) to be able to simulate an unpatched 
: Linux kernel (and call the project 'qplex86' !).
: 
: To run any OS you would also need precise segment limits and rights 
: emulation, at least for non user code.

Of course.  Everything has to be simulated.  That is the challange :)

-- 
Johan Rydberg, Free Software Developer, Sweden
http://rtmk.sf.net | http://www.nongnu.org/guss/

Listning to Her Majesty - F.U.N.E.R.A.L.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] simulated memory instead of host memory
  2003-06-09 19:37   ` Johan Rydberg
@ 2003-06-09 20:18     ` Fabrice Bellard
  2003-06-09 20:43       ` Johan Rydberg
  0 siblings, 1 reply; 5+ messages in thread
From: Fabrice Bellard @ 2003-06-09 20:18 UTC (permalink / raw)
  To: qemu-devel

Johan Rydberg wrote:

> This is what my code generator emits for a memory store.  The value that
> should be stores is located in %ebx.  The virtual address in %eax.
> %ecx must be pushed on the stack to free a register.  
> 
>   40017160: 0000005b: push   %ecx
>   40017161: 0000005c: mov    0x805cce4,%ebp        pointer to mtcache
>   40017167: 00000062: mov    %eax,%ecx
>   40017169: 00000064: shr    $0xc,%ecx
>   4001716c: 00000067: and    $0xff,%ecx            256 entries
>   40017172: 0000006d: lea    0x0(%ebp,%ecx,8),%esi mtcache entry at %esi
>   40017176: 00000071: mov    %eax,%ecx
>   40017178: 00000073: and    $0xfffff000,%ecx      make tag
>   4001717e: 00000079: cmp    %ecx,0x0(%esi)        and compare
>   40017181: 0000007c: jne    0x00000439            miss -> slow way
>   40017187: 00000082: mov    0x4(%esi),%esi
>   4001718a: 00000085: add    %eax,%esi
>   4001718c: 00000087: mov    %ebx,0x0(%esi)        do the store
>   4001718f: 0000008a: pop    %ecx
> 
> Can you come to thing of a faster way to do it?  Note that I generate
> the code by hand (not using GCC).

Using a cache as you do is a good idea. You can save some insns, and 
more if you use differents bits of the address (do a mask with 0x7f8), 
but you would have less cache hits.

40017160: 0000005b: push   %ecx
40017167: 00000062: mov    %eax,%esi
40017169: 00000064: shr    $0xc,%esi
                     movl   %esi, %ecx
4001716c: 00000067: and    $0xff,%esi            256 entries
4001717e: 00000079: cmp    %ecx,0x805cee4(%esi,8)   compare
40017181: 0000007c: jne    0x00000439            miss -> slow way
40017187: 00000082: add    0x805cee8(%esi,8),%eax
4001718c: 00000087: mov    %ebx,0x0(%eax)        do the store
4001718f: 0000008a: pop    %ecx

I guess GCC should give nearly optimal code.

> : 3) An even faster solution is to use Linux memory mappings to emulate 
> : the MMU. The Linux MM state of the process would be considered as a TLB 
> : of the virtual x86 MMU state. It works only if the host has <= 4KB page 
> : size and if the guest OS don't do any mapping in memory >= 0xc0000000. 
> : With Linux as guest it would work as you can easily change the base 
> : address of the kernel. The restriction about mappings >= 0xc0000000 
> : could be suppressed with a small (but tricky) kernel patch which would 
> : allow to mmap() at addresses >= 0xc0000000.
> 
> Since it isn't very portable I don't think it is an option.

Well, if you generate code it is already not portable :-)

Fabrice.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] simulated memory instead of host memory
  2003-06-09 20:18     ` Fabrice Bellard
@ 2003-06-09 20:43       ` Johan Rydberg
  0 siblings, 0 replies; 5+ messages in thread
From: Johan Rydberg @ 2003-06-09 20:43 UTC (permalink / raw)
  To: qemu-devel

Fabrice Bellard <fabrice.bellard@free.fr> wrote:

: Using a cache as you do is a good idea. You can save some insns, and 
: more if you use differents bits of the address (do a mask with 0x7f8), 
: but you would have less cache hits.

Since doing it the "slow way" is really slow you should try to maximize
the hit rate.  The ideal whould be something like 95-99% hit rate for 
normal pages, and it should only have to escape into the slow path on
accesses to memory mapped I/O devices.  Well, you could always dream I 
guess.
 
: 40017160: 0000005b: push   %ecx
: 40017167: 00000062: mov    %eax,%esi
: 40017169: 00000064: shr    $0xc,%esi
:                     movl   %esi, %ecx
: 4001716c: 00000067: and    $0xff,%esi            256 entries
: 4001717e: 00000079: cmp    %ecx,0x805cee4(%esi,8)   compare
: 40017181: 0000007c: jne    0x00000439            miss -> slow way
: 40017187: 00000082: add    0x805cee8(%esi,8),%eax
: 4001718c: 00000087: mov    %ebx,0x0(%eax)        do the store
: 4001718f: 0000008a: pop    %ecx

Does this really work?  0x805cee4 is the address to _a pointer_ that holds
the address of the mtcache.  The reason for having a pointer to the real
mtcache is that it is much faster just to change the pointer when switching
between user and supervisor mode (and the other way around).

Maybe it would be better to have centralized mtcache, and copy the contents
of the per-cpu and per-state mtcaches into that one when the state changes.

The reason for masking the virtual address as I did, and use it as tag in
the cache is that you may also check for unaligned memory accesses. 
This is not an issue when simulating IA-32, but you must detect it when 
simulate machines that can not cope with unaligned accesses.

: I guess GCC should give nearly optimal code.

Most probably. I will wrap something together and see what it generates.


: [...]
: Well, if you generate code it is already not portable :-)

I ment between systsems such as GNU/Linux and BSD.

-- 
Johan Rydberg, Free Software Developer, Sweden
http://rtmk.sf.net | http://www.nongnu.org/guss/

Listning to Her Majesty - Rules to follow

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2003-06-09 20:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-06-09 18:31 [Qemu-devel] simulated memory instead of host memory Johan Rydberg
2003-06-09 19:09 ` Fabrice Bellard
2003-06-09 19:37   ` Johan Rydberg
2003-06-09 20:18     ` Fabrice Bellard
2003-06-09 20:43       ` Johan Rydberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.